This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
t)
t-e). Proof Clearly {X > t} c {X* > t) and {X > t) c {X* > t}, so we have "<" in (a) and the first inequality in (b).
Take a measurable cover A D {X > t), so P*(X > t) = P(A). Then X* < t a.s. outside A, so P(X* > t) < P(A), proving (a). Thus for 0 < 8 <
e, P(X* > t - 8) < P*(X > t - s). Letting 8 4, 0 proves (b). Let (0, A, P) be aprobability space. For a function f from Q into [-oo, oo]
let f* f dP := sup{ f gdP : g measurable, g < f}. Let f* be the essential supremum of all measurable functions g < f. Then just as for f*, f* is well-defined up to a.s equality and f* f dP = f f* dP whenever either side is defined, as in Theorem 3.2.1.
3.2 Measurable cover functions
99
It's easy to check that f* = -((- f)*) and that f* f dP = -(f * - f dP). So the convergence in law Y, = Yo as defined in Section 3.1 implies that f* g(Y-,) dQn - f g(Yo) dQoNext, here is a one-sided Tonelli-Fubini theorem for starred functions:
3.2.7 Theorem Let (X, A, P) x (Y, B, Q) be a product of two probability spaces. For a real-valued function f > 0 on X x Y, define f * with respect to P x Q. For each x E X let (E* f)(x) := f * f (x, y) dQ(y). Then
EiE2f(x,Y)
(3.2.8)
ff*(x,y)d(PxQ)(x,y),
where Ei = E* with respect to P. Also, if Q is purely atomic, so that >j Q({yj)) = 1 for some yj E Y, and (3.2.9)
f - dQ, then
Ei E2f(X, Y) < E*f(X, Y) = E2Ei f(X, Y).
Proof By the usual Tonelli-Fubini theorem, we have (3.2.10)
f
f*(x, y)d(P x Q)(Y) = ff f*(x, y)d Q(y) dP(x),
and f f *(x, y)d Q(y) is measurable in x. For each x E X let f (y) := f (x, y). Then f(x, y) < f *(x, y), which is measurable in y, so f,, (y) < f *(x, y) for Q-almost all y. Thus since E* fX = Efx by Theorem 3.2.1,
(E*f)(x) =
f
f,*(Y)dQ(Y) <
f
f*(x,Y)dQ(Y),
and (3.2.8) follows. Next, to prove (3.2.9), note that on Y, since Q is purely atomic, all real-valued functions are measurable (for the completion of Q), so E2 - E2, the left inequality follows from (3.2.8), and the equality from (3.2.10), so (3.2.9) is proved. We also have a one-sided monotone convergence theorem with stars:
3.2.11 Theorem Let (S2, A, P) be a probability space and let f be realvalued functions on Q such that f f f, that is, f (x) T f (x) for all x E Q. If
E* fl > -oo then E*fj T E*f as j -+ oo.
Proof By Theorem 3.2.1, E*fj = Ef * for each j, and as j -+ oo, f increases a.s. up to some function g where clearly g < f * almost surely. If
g < f * on some set A with P(A) > 0, then f < g on A implies f < g and so f * < g on A, a contradiction. The theorem follows.
100
Uniform Central Limit Theorems: Donsker Classes
Recall that for any measure space (X, A, µ) and any C C X, the outer measure is defined by s*(C) := inf(µ(A) : A E A, A D C). Note that there exist subsets Aj := A(j) of [0, 1] with outer measure k*(Aj) = I for all j and Aj J, 0 (RAP, Problem 3.4.2). Letting f := IA(.i), we have f 4, 0, and E* f = 1 for all j, so that the monotone convergence theorem fails for E* on decreasing sequences. Next is a Fatou lemma with stars.
3.2.12 Theorem Let (Q, A, P) be a probability space and f any nonnegative real-valued functions on 0. Then E* lim infs.,, f < lim infj,0 E* f j. Proof For 1 < k < m we have infra>k f, < fm and so E* fn < E* fm, thus E* infn>k fn < infra>k E*fm. Taking the supremum over k on both sides and using the previous monotone convergence theorem gives the result.
3.3 Almost uniform convergence and convergence in outer probability In Section 3.1 the definition of convergence of laws was adapted to define convergence in law for random elements which may not have laws defined. In this section the same will be done for convergence in probability and almost sure convergence. Let (0, A, P) be a probability space, (S, d) a metric space, and fn functions
from S2 into S. Then f, will be said to converge to fo in outer probability if
d(f, fo)* - 0 in probability as n - oo, or equivalently, by Lemma 3.2.6,
for every e > 0, P*(d(f , fo) > E) -k 0 as n -+ oo. Also, f, is said to converge to fo almost uniformly if as n - oo, d(fn, fo)* -k 0 almost surely. The following is immediate:
3.3.1 Proposition Almost uniform convergence always implies convergence in outer probability.
If fn are all measurable functions, then clearly convergence in outer probability is equivalent to the usual convergence in probability, and almost uniform convergence to almost sure convergence. Now some definitions will be given for Glivenko-Cantelli properties, which are laws of large numbers for empirical measures.
Definition. If (X, A, P) is a probability space and .P is a class of integrable real-valued functions, F C Gt (X, A, P), then F will be called a strong (resp.
3.3 Convergence without measurability
101
weak) Glivenko-Cantelli class for P if as n -+ oo, 11 P, - PII.r --+ 0 almost uniformly (resp. in outer probability). In the following proposition, part (C) is what is usually called "Egorov's theorem" for almost surely convergent sequences of measurable functions (RAP, Theorem 7.5.1).
3.3.2 Proposition Let (S2, A, P) be a probability space, (S, d) a metric . Then the following space, and f, any functions from 2 into S for n = 0, 1, are equivalent:
(A) fn - fo almost uniformly. (B) For any e > 0, P*{sup,,>m d(fn, fo) > e} 4, 0 as m -* oo. (C) For any 8 > 0 there is some B E A with P(B) > 1 - 8 such that fn -* fo uniformly on B. (D) There exist measurable hn > d(fn, fo) with hn -+ 0 a.s. Proof (A) implies that (SUpn>m d (fn, f0))* < Sup,,>m (d (fn,
f0)*)
4, 0 a.s. I
as m -+ oo,
which implies (B).
Assuming (B), for k = 1, 2, , let Ck := {supra>m(k) d(f, fo) > 1/k), where m (k) is large enough so that P* (Ck) < 2-k. Take measurable covers B k f o r Ck, so C k C Bk, Bk E A, and P(Bk) < 2-k. For r = 1 , 2, , let Ar := S2 \ Uk>r Bk. Then P(Ar) > 1 - 2-' and fn fo uniformly on Ar, so (C) holds. Now assume (C). Take Ck E A, k = 1, 2, , such that P(Ck) T 1 and f n -+ f o uniformly on Ck. W e can take Cl C C2 C . Take Mk such that
d(f, fo) < 1/k on Ck for all n >_ mk. Then d(fn, fo)* < 1/k on Ck, so d(fn, fo)* --> 0 a.s., proving (A). Clearly, (A) and (D) are equivalent.
Example. In [0, 1] with Lebesgue measure P let A 1 D A2 D . . . be sets with P*(A,) = 1 and fn°_1 An = 0 (e.g., RAP, Section 3.4, problem 2; Cohn, 1980, p. 35). Then 1An -* 0 everywhere and, in that sense, almost surely, but not almost uniformly. Note also that lAn doesn't converge to 0 in law as defined in Section 3.1. To avoid such pathology, almost uniform convergence is helpful.
3.3.3 Proposition Let (S, d) and (Y, e) be two metric spaces and (S2, A, P) a probability space. Let f, be functions f r o m 0 into S for n = 1 , 2,
, such
that f, - fo in outer probability as n -+ oo. Assume that fo has separable
Uniform Central Limit Theorems: Donsker Classes
102
range and is measurable (for the Borel or -algebra on S). Let g be a continuous function from S into Y. Then g(fn) g(o) in outer probability.
Note. If f G (fo) dP is defined for all bounded continuous real G (as it must be if fn fo, by definition) then the image measure P o f0 1 is defined on all Borel subsets of S (RAP, Theorem 7.1.1). Such a law does have a separable support except perhaps in some set-theoretically pathological cases (Appen-
dix Q. , let Bk :_ {x E S: d(x, y) < 1/k implies P r o o f Given e > 0, k = 1, 2, e(g(x), g(y)) < s, y E S}. Then each Bk is closed and Bk T S as k -+ oo. Fix k large enough so that P(f01(Bk)) > 1 - e. Then {e(g(fn), g(.f0)) > s) (1 f0 1(Bk) C {d (fn, .fo)
1/k}.
Thus
P*{e(g(fn),g(f0)) > e} < e+P*{d(fn, fo) > 1/k} < 2s for n large enough.
3.3.4 Lemma Let (S2, A, P) be a probability space and {gn }n o a uniformly bounded sequence of real-valued functions on Q such that go is measurable. If gn -* go in outer probability then lim supn,, f * gn dP < f go dP.
Proof Let I g, (x) I < M < oo for all n and all x E 22. We can assume M = 1. Given s > 0, for n large enough P*(I gn - goI > e) < e. Let An be a measurable set on which Ign - gol < e with P(S2 \ An) < e. Then
J
gndP < e+ J*gndP < 2s+ f A
A An
< 3s+ J $odP.
Letting s 1. 0 completes the proof. On any metric space, the a-algebra will be the Borel or -algebra unless something is said to the contrary.
3.3.5 Corollary If f, are functions from a probability space into a metric space, fn --* fo in outer probability, and fo is measurable with separable range, then fn = fo. Proof Apply Proposition 3.3.3 to g = G for any bounded continuous G and Lemma 3.3.4 to gn := G o fn and also to gn = -G o fn.
3.4 Perfect functions
103
3.4 Perfect functions For a function g defined on a set A let g[A] :_ {g(x) : x E A}. It will be useful that under some conditions on a measurable function g and general real-valued
f, (f o g)* = f * o g. Here are some equivalent conditions: 3.4.1 Theorem Let (X, A, P) be a probability space, (Y, l3) any measurable space, and g a measurable function from X to Y. Let Q be the restriction of P o g t to B. For any real-valued function f on Y, define f *for Q. Then the following are equivalent:
(a) For any A E A there is a B E B with B C g[A] and Q(B) > P(A); (b) for any A E A with P(A) > 0 there is a B E 8 with B C g[A] and Q(B) > 0; (c) for every real function f on Y, (f o g)* = f * o g a.s.; (d) f o r any D C Y, OD o 9)* = 1 *D o g a. s.
Proof Clearly (a) implies (b).
To show (b) implies (c), note that always (f o g)* < f* o g. Suppose (f o g)* < f * o g on a set of positive probability. Then for some rational
r, (fog)*
andIDog=(lDog)*=Oa.sonA.LetB:=Y\C.Then BCg[A],and
Q(B) = 1 - Q(C) = 1 -
f 1* d(P o g '),
which by the image measure theorem (RAP, Theorem 4.1.11) equals
1-J 1*DogdP = 1- J(1Dog)*dP > P(A). Note. In (a) or (b), if the direct image g[A] E B, we could just set B := g[A]. But, for any uncountable complete separable metric space Y, there exists a complete separable metric space S (for example, a countable product N°O of copies of N) and a continuous function f from S into Y such that f [B] is not a Borel set in Y (RAP, Theorem 13.2.1, Proposition 13.2.5). If f is only required to be Borel measurable, then S can also be any uncountable complete metric space (RAP, Theorem 13.1.1).
104
Uniform Central Limit Theorems: Donsker Classes
A function g satisfying any of the four conditions in Theorem 3.4.1 will be called perfect or P perfect. Coordinate projections on a product space are, as one would hope, perfect:
3.4.2 Proposition Suppose A = X x Y, P is a product probability v x m, and g is the natural projection of A onto Y. Then g is P-perfect.
Proof Here P o g-1 =m. For any B CA let By := {x : (x, y) E B}, y e Y. If B is measurable, then by the Tonelli-Fubini theorem, for C := {y: v(By) >
0}, C is measurable, C C g[B], and P(B) < m(C), so condition (a) of Theorem 3.4.1 holds.
3.4.3 Theorem Let (0, A, P) be a probability space and (S, d) a metric , (Yn,13,,) is a measurable space, gn a space. Suppose that for n = 0, 1, perfect measurable function from 0 into Y,,, and fn a function from Y into S, where fo has separable range and is measurable. Let Qn := P o gn 1 on 13n and suppose fn o gn - fo o go in outer probability as n --* oo. Then fn = fo as n oo for fn on (Yn, 13,,, Qn)
Before this is proved, an example will be given. Recall that for any measure space (X, S, µ) and C C X, the inner measure is defined by /,t. (C)
sup{µ(A) : A E S, ACC}. 3.4.4 Proposition
Theorem 3.4.3 can fail without the hypothesis that gn be
perfect.
Proof Let C C I := [0, 1] satisfy 0 = X*(C) < X*(C) = 1 for Lebesgue measure A (RAP, Theorem 3.4.4). Let P = A*, giving a probability measure on
the Borel sets of C (RAP, Theorem 3.3.6). Let 0 = C, fo = 0, Yn = I, fn :_ II\C, and let gn be the identity from C into Y,, for all n. Then fn o gn = 0 for all n, so f, o gn -+ fo o go in outer probability (and in any other sense). Let 13,, be the Borel or-algebra on Y,, = I for each n. Let G be the identity from I into IR. Then f * dQn = f * fn da, = 1 for n > 1, while f G(fo) dQo = 0, so fn does not converge to fo in law.
After Theorem 3.4.3 is proved, it will follow that the gn in the last proof are not perfect, as can also be seen directly from condition (c) or (d) in Theorem 3.4.1.
fo o go. Let H Proof of Theorem 3.4.3 By Corollary 3.3.5, fn o g, be any bounded, continuous, real-valued function on S. Then by RAP,
3.4 Perfect functions
105
Theorem 4.1.11,
JH(ffl(gfl))dP - f H(fo(go)) dP
f
H(fo) dQo
Also,
f *H(fn(gn))dP =
f H(fn(gn))*dP
by Theorem 3.2.1
f(Hofn)*(gn)dP
by Theorem 3.4.1
f(Hofn)*dQn
(RAP, Theorem 4. 1.11)
f * H(fn) dQn
by Theorem 3.2.1,
and the theorem follows.
In Proposition 3.4.2, X x Y could be an arbitrary product probability space, but projection is a rather special function. The following fact will show that all measurable functions on reasonable domain spaces are perfect.
Recall that a law P is called tight if sup{P(K) : K compact) = 1. A set P of laws is called uniformly tight if for every e > 0 there is a compact K such that P(K) > 1 - e for all P E P. Also, a metric space (S, d) is called universally measurable (u.m.) if for every law P on the completion of S, S is measurable for the completion of P (RAP, Section 11.5). So any complete metric space, or any Borel set in its completion, is u.m.
3.4.5 Theorem
Let (S, d) be a u.m. separable metric space. Let P be a
probability measure on the Borel or -algebra of S. Then any Borel measurable function g from S into a separable metric space Y is perfect for P.
Note. In view of Appendix C, the hypothesis that S be separable is not very restrictive.
Proof Let A be any Borel set in S with P(A) > 0. Let 0 < s < P(A). By the extended Lusin theorem (Theorem D.1, Appendix D) there is a closed set F with
P(F) > 1-s/2 such that g restricted to F is continuous. Since P is tight (RAP, Theorems 11.5.1 and 7.1.3), there is a compact set K C A with P(K) > s. Then C := F fl K is compact, C C A, P(C) > 0, and g is continuous on C,
so g[C] is compact, g[C] C g[A], and (P o g 1)(g[C]) > P(C) > 0, so the conclusion follows from Theorem 3.4.1.
106
Uniform Central Limit Theorems: Donsker Classes
Let (0, A, P) be a probability space and g a measurable function from S2 into Y where (Y, 8) is a measurable space. Let Q := P o g 1 on B. Call g quasiperfect for P or P-quasiperfect if for every C C Y with g -'(C) E A, C is measurable for the completion of Q. Then the probability space (0, A, P) is called perfect if every real-valued function G on 0, measurable for the usual Borel or -algebra on R, is quasiperfect.
Example. A measurable, quasiperfect function g on a finite set need not be perfect: let X:= {al, a2, a3, a4, a5, a6}, U := {al, a2}, V :_ {a3, a4}, W {a5, a6}, A := {0, U, V, W, U U V, U U W, V U W, X), P(U) = P(V) _
P(W) = 1/3, Y := {0, 1, 21, g(al) := g(a3)
0, g(a2) := g(a5) :=
1,
g(a4) := g(a6) := 2. Let B := {0, Y). For C C Y, g-1 (C) E A if and only if C E B, so g is quasiperfect. But P(U) > 0 and g[U] does not include any nonempty set in 13, so g is not perfect.
3.4.6 Proposition Any perfect function is quasiperfect.
Proof Let C C Y, A := 9_1(C) E A. By Theorem 3.4.1 take B C g[A]
with B E 8 and Q(B) > P(A). Then B C C, so Q(B) = P(g 1(B)) < P(g 1(C)) = P(A), and Q(B) = P(A). Thus the inner measure Q,,(C) = P o g 1(C). Likewise, Q. (Y \ C) = (P o g 1) (Y \ C), so Q*(C) = (P o g 1) (C) and C is Q-completion measurable.
U
3.5 Almost surely convergent realizations First let's recall a theorem of Skorokhod (RAP, Theorem 11.7.2): if (S, d) is a complete separable metric space, and Pn are laws on S converging to a law Po, then on some probability space there exist S-valued measurable functions X, such that P, for all n and X -+ Xo almost surely. This section will prove an extension of Skorokhod's theorem to our current setup. Having almost uniformly convergent realizations shows that the definition of convergence in law for random elements is reasonable. The realizations will be useful in some later proofs on convergence in law.
Suppose f, = fo where f, are random elements, in other words functions not necessarily measurable except for n = 0, defined on some probability spaces (52,,, into a possibly nonseparable metric space S. We want to find random elements Y "having the same laws" as f, for each n such that Yo almost surely or better, almost uniformly. At first look it isn't clear Yn what "having the same laws" should mean for random elements fn, n > 1, not
3.5 Almost surely convergent realizations
107
having laws defined on any nontrivial or-algebra. A way that turns out to work is to define Y, = fn o gn where gn are functions from some other probability space 0 with probability measure Q into Stn such that each gn is measurable and Q o gn1 = Q, for each n. Thus the argument of fn will have the same law Qn as before. It turns out, moreover, that the gn should be not only measurable but perfect. Before stating the theorem, here is an example to show that there may really be no way to define a or -algebra on S on which laws could be defined and yield an equivalence as in the next theorem, even if S is a finite set.
Example. Let (Xn, An, Qn) = ([0, 1], B, A) for all n (), = Lebesgue measure, 13 = Borel a-algebra ). Take sets C(n) C [0, 1] with O = A*(C(n)) < .X*(C(n)) = 1/n2 (RAP, Theorem 3.4.4). Let S be the two-point space {0, 11 with usual metric. Then f,, := lC(n) -* 0 in law and almost uniformly, but each "law" On := Qn o fn 1 is only defined on the trivial a-algebra {0, S). The only larger or -algebra on S is 2S, but no On for n > 1 is defined on 2S.
3.5.1 Theorem Let (S, d) be any metric space, (Xn, An, Qn) any probability . Suppose fo has spaces, and fn a function from Xn into S for each n = 0, 1, separable range So and is measurable (for the Borel a -algebra on So). Then fn = fo if and only if there exist a probability space (S2, S, Q) and perfect , such measurable functions gn from (S2, S) to (Xn, An) for each n = 0, 1, fo o go almost uniformly that Q o gn 1 = Qn on An for each n and fn o gn as n -+ oo. Notes. Proposition 3.4.4 and the "if and only if " in Theorem 3.5.1 show that the hypothesis that gn be perfect can't just be dropped from the theorem.
Proof "If" follows from Proposition 3.3.1 and Theorem 3.4.3. "Only if" will be proved very much as in RAP, Theorem 11.7.2. Let S2 be the Cartesian product fl 0Xn x In where each In is a copy of [0, 1]. Here gn will be the natural projection of 0 onto Xn for each n. Let P := Qo o fo 1 on the Borel or -algebra of S, concentrated in the separable subset So. A set B C S will be called a continuity set (RAP, Section 11.1) for P if P(dB) = 0 where 8B is the boundary of B. Then,
3.5.2 Lemma For any e > 0 there are disjoint open continuity sets Uj, j = 1, , J, for some J < oo, where for each j, diam Uj := sup{d(x, y) :
x,yEUj)<e,andwith 1:;J= 1P(Uj)>1-e.
108
Uniform Central Limit Theorems: Donsker Classes
Proof Let {xjf°° 1 be dense in So. Let B(x, r) :_ {y E S: d(x, y) < r} for 0 < r < oo and x E So. Then B(xj, r) is a continuity set of P for all but at most countably many values of r. Choose rj with e/3 < rj < e/2 such that B(xj, rj) is a continuity set of P for each j. The continuity sets form an algebra (RAP, Proposition 11.1.4). Let
U j := B(xj, rj) \ U{y: d(xi, y) < r,}. i<j
Then Uj are disjoint open continuity sets of diameters < e with
1, so there isaJ
1
P(Uj) _
1P(Uj)> 1-e.
Now to continue the proof of Theorem 3.5.1, for each k = 1, 2, - -, by the last lemma take disjoint open continuity sets Ukj := U(k, j) of P for j = 1, 2, .. -, Jk := J(k) < oo, with diam(Ukj) < 1/k, P(Ukj) > 0, and J(k)
E P(Ukj) > 1-2 -k.
(3.5.3)
j=1
For any open set U in S with complement F, let d(x, F) := inf{d(x, y) y E F). For r = 1, 2, - -, let Fr := {x : d (x, F) > 1/r}. Then Fr is closed and Fr T U as r -* oo. There is a continuous hr on S with 0 < hr < 1, hr = 1 on Fr, and hr = 0 outside F2r: let hr(X) := min(l, max(0, 2rd(x, F) - 1)). For each j and k, let F(k, j) := S \ Ukj. Take r := r(k, j) large enough so that P(F(k, f)r) > (1 - 2-k)P(Ukj). Let hkj be the hr as defined above for such an r and Hkj the her. For n large enough, say for n > nk, we have
jhki(fn)dQn > (1-2-k)P(Ukj) and
f* Hkj(fn)dQn < (1+2-k)P(Ukj) for all j = 1, - - -, Jk. We may assume n1 < n2 < ... . For every n = 0, 1, - , let fkjn := (hkj o fn)* for Qn, so that by Theorem - -
3.2.1,
fhki(fn)dQn =
ffkJndQn,
0 < ,fkjn < hkj(fn),
and fkjn is An-measurable. For n > 1 let Bkjn :_ (.fkjn > 0) E A. Let Bkjo
1(Ukj) E Ao. For each k and n, the Bkjn C f-1
for j= 1,---,Jk,andHkj(fn)=Ion Bkjn, soforn>nk, (3.5.4)
(1-2-k)P(Ukj) < Qn(Bkjn) < (1+2-k)P(Ukj),
are disjoint
3.5 Almost surely convergent realizations
109
and Qo(Bk jo) = P(Ukj). Let Tn := X, x In. Let it, be the product law Q, x .X on Tn where A is Lebesgue measure on the Borel or-algebra B in In. For each
k> 1, Ckjn
:= Bkjn x [0, F(k, j, n)] C T,,
Dkjn
:= Bkjo x [0, G(k, j, n)] C To,
where F and G are defined so that
An(Ckjn) = Ito(Dkjn) = min(Qn(Bkjn), Qo(Bkjo))
Then for each k, j, and n > nk,wehave by(3.5.4),since 1/(1+2-k) > 1-2-k,
(3.5.5) 1-2 -k < min(F, G)(k, j, n) < max(F, G)(k, j, n) = 1. Let J(k) Ckon
7'n \ U Ckjn,
J(k)
DkOn = To \ U Dkjn.
j=1
j=1
For k = 0let Jo := J(0) := 0, Coon Tn, Doon := To, and no := 0. For each n = 1, 2, , let k(n) be the unique k such that nk < n < nk+1 Then for n > 1, Tn is the disjoint union of sets Wnj Ck(n)jn, j = 0, 1, (3.5.6)
, Jk(n). We also have
If j > 1 and (v, s) E Wn j then V E Bk(n) jn so fn (v, S) E Uk(n) j
Next, To is the disjoint union of sets Enj := Dk(n) jn. Then µn (Wnj) = Ito(Enj)
for each n and j, and if j > 1 or k(n) = 0, then Ao(En j) > 0. For x in To and each n, (3.5.7)
let j (n, x) be the j such that x E En j .
Let L := (x E To : so(Enj(n,x)) > 0 for all n). Then To \ L C IJi En(i)o for some (possibly empty or finite) sequence n (i) such that po(En(1)o) = 0 for all i. Thus µo(L) = 1. For X E L and any measurable set B C Tn, in other words B is in the product
or-algebra An 0 B, let j := j(n, x), recall that Itn(Wnj) = t o(Enj), and let (3.5.8)
Pnj(B) := An(B fl Wnj)/Ito(Enj),
Pnx
Pnj(n,x).
Then Pnx is a probability measure on An 0 B. Let px be the product measure 1 -100 Pnx on T := IIn° 1 T, (RAP, Theorem 8.2.2).
Uniform Central Limit Theorems: Donsker Classes
110
3.5.9 Lemma For any measurable set H C T (for the infinite product a algebra with A® ® 13 on each factor), x i-+ px(H) is measurable on (To, Ao ®13).
Proof Let H be the collection of all H for which the assertion holds. Given n, Pnx is one of finitely many laws, each obtained for x in a measurable subset
En j. Thus if Y, is the natural projection of T onto Tn and H = Y,- 1(B) for some B E A®®13 then H E H. If H= ITEM Ym(i)(Bi), where Bi E An(i) 0 13 and M is finite, we may assume the m (i) are distinct. Then I
px(H) = niEMpx(Yn(i)(Bi)), so H E R. Then, any finite, disjoint union of such intersections is in 7-l. Such unions form an algebra. If Hn E 7-1 and Hn T H or Hn ,. H, then H E 7-1. Since the smallest monotone class containing an algebra is a a-algebra (RAP, Theorem 4.4.2), the lemma follows.
Now returning to the proof of Theorem 3.5.1, S2 = To x T. For any set C C S2, measurable for the product a-algebra, and x E To, let C, :_ l y E T : (x, y) E C}, and Q(C) := f px (Cx) dµo(x). Here x r-+ px (Cx) is measurable
if C is a finite union of products Ai x Fi where Ai E Ao ® 13 and Fi is product measurable in T. Such a union equals a disjoint union (RAP, Proposition 3.2.2). Thus by monotone classes again, x i-+ px(Cx) is measurable on To for
any product-measurable set C C Q. Thus Q is defined. It is then clearly a countably additive probability measure.
Let p be the natural projection of T onto Xn. Recall that Pnx = Pnj for all x E En j, by (3.5.7) and (3.5.8). The marginal of Q on X,,, in other words Q o gn 1, is by (3.5.8) again J(k(n))
E lso(Ej)Pnj o p-1 = An o
P-1
= Qn
j=0
Thus Q has marginal Qn on Xn for each n, as desired. By (3.5.3), 00
J(k)
E Q0 X 0 \ U f 1(Ukj) < 1: 2-k < 00. k=1
j=1
k
l
So Qo-almost every y E X o belongs to U f0 1(Uk j) for all large enough k. Also if t E 10 and t < 1, then by (3.5.5), t < G(k, j, n) for all j > 1 as soon
3.6 Conditions equivalent to convergence in law
111
as 1 - 2-k > t and n > nk. Thus for µo-almost all (y, t), there is an in such that (y, t) E U (k(n)) En j for all n > m. If x := (y, t) E En j for j > 1, then y E Bk(n) jo, so fo(y) E Uk(n) j. Also, by (3.5.8), Pnx = Pn j is concentrated in Wnj. For (v, s) E Wnj, fn (V) E Uk(n) j by (3.5.6). Since diam(Ukj) < 1/k for
each j > 1, Q*(d(fn(gn), .Po(go)) > 1/k(n) for some n > m)
< µo({(y, t) E Eno for some n > m}) -+ 0 as in -* oo, so fn (gn) -* fo (go) almost uniformly. Lastly, let's show that the gn are perfect. Suppose Q(A) > 0 for some A.
Now Q(A) = f px(Ax) dµo(x). Firstletn > 1. Thenforsomex, px(Ax) > 0. If µo(Eno) = 0, we take x Eno. Now T = Tn x T(n) where T(n) :_ ff1<(0nT,. Then on T, px = Pnx x Qnx for a law Qnx = fImonPmx on T(n). Let A (x) := A. By the Tonelli-Fubini theorem, px (Ax) = ff 1A(x) (u, v) dPnx (u) dQnx (v) Thus for some v, f 1 A(x) (u, v) dPnx (u) > 0. Choose and fix such a v as well as
x. Now Pnx = Pn j for j = j (n, x) with µo(En j) > 0. Let u = (s, t), s E Xn, and t E In. Then since Pn j = Qn x A restricted to a set of positive measure and normalized,
0 < ff1A(X)(stv)dQfl(s)dt. Choose and fix a t with 0 < f IA(x)(s, t, v)dQn(s). Let C IS E Xn (s, t, v) E Ax}. Then Qn(C) > 0. Clearly C C gn[A], so gn is perfect for n > 1 by Theorem 3.4.1. To show go is perfect, we have µo = Qo x ,., and px (Ax) > 0 for x = (y, t)
in a set X with Ao(X) > 0. There is a t E Io such that Qo(C) > 0 where C := {y: (y, t) E X). Then C C go[A], so go is perfect, finishing the proof of Theorem 3.5.1.
3.6 Conditions equivalent to convergence in law Conditions equivalent to convergence of laws on separable metric spaces are given in the portmanteau theorem (RAP, Theorem 11.1.1) and metrization theorem (RAP, Theorem 11.3.3). Here, the conditions will be extended to general random elements for the theory being developed in this chapter. For any probability space (S2, A, P) and real-valued function f on S2 let
E*f := f * f dP, E* f := f* f dP. If (S, d) is a metric space and f is a
112
Uniform Central Limit Theorems: Donsker Classes
real-valued function on S, recall (RAP, Section 11.2) that the Lipschitz seminorm of f is defined by
IIflIL
Sup{If(x) - f(y)Ild(x, y) : x 0 y)
and f is called a Lipschitz function if II f 11L < oo. The bounded Lipschitz := sup., If(x)j. norm is defined by II f II BL II f II L + II f II where Iif II Then f is called a bounded Lipschitz function if II f II BL < oo, and 11 II BL is a norm on the space of all such functions.
The extended portmanteau theorem about to be proved is an adaptation of RAP, Theorem 11.1.1, and some further facts based on the last section (Theorem
3.5.1). The proof to be given includes relatively easy implications, some of which consist of putting in stars at appropriate places in the proofs in RAP.
3.6.1 Theorem Let (S, d) be any metric space. F o r n = 0, 1 , 2, , let (Xn, An, Q,) be a probability space and fn a function from Xn into S. SupQo o f0 1 on S. pose fo has separable range So and is measurable. Let P Then the following are equivalent: (a) fn = fo;
(a') lim supn,, E*G(fn) < EG(fo) for each bounded continuous realvalued G on S; (b) E*G (fn) --* EG (fo) as n -+ oo for every bounded Lipschitzfunction
GonS; (b') (a') holds for all bounded Lipschitz G on S;
(c) sup{IE*G(fn) - EG(fo)I: IIG1IBL < 1) - Oasn - oo; (d) for any closed F C S, P(F) > lim supn,oo Qn (fn E F); (e) for any open U C S, P(U) < liminfn,oo(Qn)*(fn E U);
(f) for any continuity set A of P in S, Q* (fn E A) - P(A) and (Qn)*(fn E A) -* P(A) as n -* oo; (g) there exist a probability space (S2, S, Q) and measurable functions gn from S2 into Xn and hn from S2 into S such that the gn are perfect,
Qogn1 = Qn and Qohn1 = Pforalln, andd(fn ogn,hn) --). 0 almost uniformly.
Moreover, (g) remains equivalent if any of the following changes are made in it: "almost uniformly" can be replaced by "in outer probability"; we can take hn = fo o yn for some measurable functions yn from S2 into X0, which can be taken to be perfect; and we can take yn to be all the same, yn = yl for all n.
3.6 Conditions equivalent to convergence in law
113
Proof Clearly (a) implies (a'). Conversely, by interchanging G and -G, (a') implies
liminfn*cE*G(fn) > liminfn,,,,E*G(fn) > EG(fo), and (a) follows, so (a) and (a') are equivalent.
Clearly (a) implies (b), which is equivalent to (b') just as (a) is to (a'). To show (b) implies (c), let T be the completion of S. Then all the fn take values in T. Each bounded Lipschitz function G on S extends uniquely to such a function on T, and the functions G o fn on Xn are exactly the same. So we can assume in this step that S and So are complete. Let E > 0. By Ulam's theorem (RAP, Theorem 7.1.4), take a compact K C So with P(K) = Qo(fo E K) > 1 - E. Recall that d (x, K) := inf {d (x, y) :
y E K) and KE := Ix: d(x, K) < E}. Let g(x) := max(O, 1-d(x, K)/e) for 1 + 1/E < oo. Clearly 1K < g < 1KE. Since E*g(fn) -). Eg(fo) as n -+ oo, it follows that x E S. Then g is a bounded Lipschitz function with II9IIBL for n large enough (3.6.2)
(Q,,). (f, E K5) > E*g(fn) > Eg(fo) - E > I - 2E.
Let B be the set of all G on S with II G II BL < 1. Then the functions in B are uniformly equicontinuous, so their restrictions to K are totally bounded for the supremum distance over K by the Arzela-Ascoli theorem (RAP, Theorem 2.4.7). Let G I, , G j for some finite J be functions in B such that for each
G E B, supXEK I(G - Gj)(x)I < E for some j = 1, , J. Next, for any G E B, choose such a j. Then (3.6.3)
I E*G(fn) - EG(fo)I < IE*G(fn) - E*Gj(fn)I + I E*(Gj(fn)) - E(Gj(fo))I + I E(Gj(fo) - G(fo))I,
a sum of three terms. For the last term, splitting the integral into two parts according as fo c K or not, the first part is bounded by E and the second by 2E since I G j - G I < 2 everywhere and Qo (fo K) < E. The middle term on the right side of (3.6.3) is bounded above by
maxj<j IE*Gj(fn) - EGj(fo)I, which is less than E for n large enough by (b). The first term on the right in (3.6.3) is a sum of two parts, one over a measurable subset An of Xn where f, E KE with Q, (A,) > 1- 2E by (3.6.2), and the other over the complement of A. Since G and G j E B, and I G - G j I < E on
K, we haveIG - GjI <3s on K6. It follows that I (Go fn)*-(Gjo fn)*I <3E
Uniform Central Limit Theorems: Donsker Classes
114
on An, so the first part is bounded above by 3e. The second part is bounded by 2(1 - Qn(An)) < 46. So for n large enough the left side of (3.6.3) is less than 116 uniformly for G E B, and (c) follows. Clearly (c) implies (b), so (b) and (c) are equivalent. Next, (b) implies (d) by taking bounded Lipschitz functions gk decreasing down to 1 F as k - oo, specifically the functions g in the proof of (b) implies (c) with F in place of K
and 6 = 1/k. (d) and (e) are equivalent by taking complements. (d) and (e) together easily imply (f). For any bounded continuous function G, all but countably many of the sets
f G < t) are continuity sets of P. It follows that G can be approximated uniformly within e by a simple function F_ ti 1 Ai where Ai are continuity sets {ti-1 < G < ti}. So (f) implies (a), and (a) through (f) are all equivalent. Theorem 3.5.1 says that (a) implies (g), in the strongest form where hn
h 1 - fo o yl and yi is perfect. The weakest form of (g), with convergence in outer probability and hn depending on n and not necessarily a composition of fo with any (perfect) function,
will be shown to imply (b'). Given 6 > 0, take n large enough so that
Q*{d(.fn o gn, hn) > 6) < 6. n So on a measurable set An with Q(An) > 1 - e, we have d(fn o gn, hn) < e, and thus for any G E B (in other words II G II BL < 1) we have
IG(fn(gn)) -G(hn)I < 61A
+2. IA'-
It follows that
E*G(fn(gn)) < EG(hn)+36 = EG(fo)+3e (since hn has the same distribution as fo) and
E*G(fn (gn)) = E(((G o fn) o gn)*) = E((G o fn)* o gn) by Theorem 3.2.1 and since gn is perfect, and where all the expectations are with respect to Q. The latter integral, by the image measure theorem (RAP, Theorem 4.1.11), equals E*(G(fn)) (for Qn). Letting e 4. 0, (b') follows, so all of the forms of (g) are equivalent to any of (a) through (f). Theorem 1.1.1, in a special case, proved a form of convergence which in that case is now easily seen to imply the conditions in Theorem 3.6.1: the one in (c), for example. It will be shown that convergence in law is also equivalent to convergence in some analogues of the Prokhorov and dual-bounded-Lipschitz metrics, which
3.6 Conditions equivalent to convergence in law
115
metrize convergence of laws on separable metric spaces as shown in RAP, Theorem 11.3.3. We will have analogues of metrics, rather than actual metrics, because of the asymmetry between the nonmeasurable random elements f, and the limiting measurable random variable fo.
Definitions. Let (X., Am , Qm) be probability spaces, m = 0, 1, and let (S, d) be a metric space. Let fm be functions from Xm into S, m = 0, 1, such that fo is measurable and has separable range. Let P := Qo o f0 1. Then let
P(.fi, fo) := sup{IE*G(.fl) - EG(fo)I :
IIGIIBL <- 11,
and for F ranging over nonempty closed subsets of S,
p(.fi, .fo) := inf{e > 0: SUPF[P(F) - (Qi)*(fi E FE)] < e}. 3.6.4 Theorem For any metric space (S, d), probability spaces (Xm, Am, Qm) and functions fm from Xm into S, where fo has separable range and is measurable, the following are equivalent: (i) .fm
fo;
(ii) $(fm, Jo)
0 as m
(iii) p(fm, fo) - 0 as m
oo; oo.
Proof (i) and (ii) are equivalent just as (a) and (c) are in Theorem 3.6.1. To show that (ii) implies (iii), we have:
For any fl and fo as in the statement of Theorem 3.6.4, p(Ji, .fo) < min(l, (20(ft,
3.6.5 Lemma
.fo))1/2).
Proof Let F be any closed set in S and P = Qo o f01. Given 0 < s < 1, let g(x) := max(0, 1 - d(x, F)/e) as before. Then 11811 BL < 1 + 1/c, and
(Q1). (f, E FE) > jg(Ji)dQi >
fg(Jo)dQo_(JiJo)(l+) > P(F) - s cases,
if (fi, fo) < r2/2. Clearly p(fl, fo) < 1 in all (2p(.fi, .fo))1/2 when the latter is < 1, the lemma follows.
so letting s
Now to continue the proof of Theorem 3.6.4, applying the lemma to fm in place of fl for each in shows that (ii) implies (iii). To show that (iii) implies , let F k := {x : d (x, y) < 1/k (i), let U be an open set in S. For k = 1 , 2,
116
Uniform Central Limit Theorems: Donsker Classes
implies y E U). Then Fk are closed sets and Fk T U as k -f oo. For n large enough, p(fn, fo) < 1/k, and then P(Fk) < (Qn),k(fn E F1Ik) + k < (Qn)*(fn E U) + k, SO
P(Fk) < liminfn->oo(Qn)*(fn E U) + k Then letting k ->. oo gives P(U) < lim infn oo(Qn). (fn E U). This is (e) of Theorem 3.6.1, so the proof of Theorem 3.6.4 is complete. Next is a version of Theorem 3.6.4 where we have two indices.
3.6.6 Theorem
Let (S, d) be a metric space and (S2, S, Q) a probability
space. Suppose that f o r each m, n = 1 , 2, , fmn is a function from Q into S, and fo is a measurable function from Q into S with separable range. Then the following are equivalent:
(i) fm,n = fo, that is, for every bounded continuous real function G on S,
I*G(fm,n)dQ
f G (.f0) dQ asm,n -+ oo;
(ii) f(fmn, fo) - O as m, n -> oo; (iii) p(fmn, fo) -+ 0 as m, n -+ oo. Proof For a double sequence amn of real numbers, let lim SUpm
n,oo amn =
infk SUpm>k,n>k amn,
lim infm, n-#oo amn = SUpk lnfm>k, n>k amn
Then the steps in the proof of Theorem 3.6.1 showing that (a) through (f) are equivalent extend directly to convergence as m -+ oo and n -+ oo if lim supra,,, is replaced by lim Supm,n,oo (not by lim supm,o lim Supra,,, or the iterated lim sup in the reverse order, which may be different), and likewise for lim inf. Thus, the proof of Theorem 3.6.4 also extends.
We have the following:
3.6.7 Continuous mapping theorem Under the conditions of Theorem 3.6.4, if (T, e) is another metric space, G is a continuous function from S into T, and fo, then G(fm) G(fo) .fm Proof This follows directly from the definition of convergence in law =.
3.7 Asymptotic equicontinuity and Donsker classes
117
We also have
3.6.8 Theorem Suppose fm, m > 0, are measurable random variables taking values in a separable metric space S, so that laws G(fm) exist on the Borel or -algebra of S. Then convergence fm
fo is equivalent to convergence
of the laws G(fm) -+ G(o) in the usual sense. Proof For any bounded continuous real-valued function G on S, the functions G(fm) are measurable, so upper integrals reduce to integrals, and the result follows from the definitions.
3.7 Asymptotic equicontinuity and Donsker classes Recall from Section 3.1 the definitions of the empirical measures P, empirical process v,,, pseudometric pp, and the Gaussian process Gp. Recall that a set .F C 'C2 (p) is called pregaussian if a Gp process restricted to F exists whose
sample functions f i-+ Gp (f) ((o) are (almost) all bounded and uniformly continuous for pp on F. Such a Gp process is called coherent if, in addition, its sample functions are prelinear on .F as in Lemma 2.5.3.
3.7.1 Proposition For any probability measure P and pregaussian set F C G2(P), the symmetric convex hull sco(.F) of.F is also pregaussian, and there exists a Gp process defined on the linear span of .F and constant functions which is 0 on the constant functions and coherent on sco(.F). Proof A Gp process is isonormal on the space L2 (P) for the semi-inner product
(f, g)o,p := P(fg) - PfPg, and Gp is 0 on constant functions. So Theorem 2.5.5 on isonormal processes applies and gives the result. For a signed measure v and measurable function f such that f f d v is defined,
letv(f) := f fdv. A class ,F of functions will be said to satisfy the asymptotic equicontinuity
condition for P and a pseudometric r on F, or F E AEC(P, r) for short, if for every e > 0 there is a 3 > 0 and an no large enough such that for n > no,
f,gE.F, t(f,g)<S}>e} < E. Then.F E AEC(P) will mean.F E AEC(P, pp).
Uniform Central Limit Theorems: Donsker Classes
118
3.7.2 Theorem Let .T' C G2(X, A, P). Then the following are equivalent: (I) F is a Donsker class for P, in other words .F is P-pregaussian and vn
Gp in £'(F);
(II) (a) F is totally bounded for pp, and (b) F satisfies the asymptotic equicontinuity condition for P, F E AEC(P); (III) there is a pseudometric r on .F such that .F is totally bounded for r and
.T' E AEC(P, t).
Proof If F is a Donsker class and e > 0, then since F is pregaussian, it is totally bounded for pp by Theorem 2.3.5, so (a) holds. Take 0 < 3 < s/3 such that for any coherent Gp process,
Pr {sup{IGp(f) - Gp(g)I: pp(f, g) < 8} > s/31 < s/2. By almost surely convergent realizations (Theorem 3.5.1), for n > no large enough we can assume Pr*{Ilvn -GpII.p > s/3} < 8/2. If IIvn -GpII,p < s/3 and I Gp (f) - Gp (g) I < s/3 then I vn (f) - vn (g) l < s. So the asymptotic equicontinuity holds with the 3 and no chosen, and (b) holds, so (I) implies (II).
(II) implies (III) directly, with r = pp.
To show (III) implies (I), suppose F is r-totally bounded and F E AEC(P, r). Let UC := UC(F) denote the set of all real-valued functions on F uniformly continuous for T. Then UC is a separable subspace of t°O(F) for II Il.p since F is totally bounded and UC equals, in effect, the space of continuous functions on the compact completion (RAP, Corollary 11.2.5). For any finite subset g of F, by the finite-dimensional central limit theorem (RAP, Theorem 9.5.6) we can let n oo in the asymptotic equicontinuity
condition and so replace v (f - g) by Gp (f) - Gp (g) for f, g E 9. Given s > 0, take a 8 := 8(s) > 0 and no := no(s) < oo from the definition of asymptotic equicontinuity condition. Then 8 and no don't depend on 9 C F. So we can let CJ increase up to a countable dense set 7-l C F. Gp has sample functions almost surely uniformly continuous on H.
If fo E C and fk E 7-l are such that r(fk, fo) -+ 0 then by applying the finite-dimensional central limit theorem to the finite set 9 = { fo, fi, , fk} for each k, applying the asymptotic equicontinuity condition and letting k -+ oo we see that Gp is almost surely uniformly continuous for r on the set { f )j>o.
So Gp(fk) -- Gp(fo) almost surely. Thus for each f E F, the almost sure limit of Gp(hk) as hk - f through 7-l, which exists by uniform continuity, equals Gp (f) a.s., so the almost sure limit defines a Gp process. This Gp has uniformly continuous sample functions for r on F. So by Theorem 2.8.6, since
3.7 Asymptotic equicontinuity and Donsker classes
119
Gp is isonormal for ( , -)o, p, it follows that .F is pregaussian. So Gp has a law µ3 defined on the Borel sets of the separable Banach space UC. Given e > 0, take S > 0 from the asymptotic equicontinuity condition and a finite set 9 c Y such that for each f E F there is a g E 9 such that r (f, g) < S. Then 1189 is the set of all real-valued functions on 9. Let µ2 be the law of Gp on 1R and let tµ23 be the law on 1[89 x UC where Gp on g in R9 is just the restriction of Gp on UC. So µ23 has marginals µ2 and µ3. Let 1t1,n be the law of v on g, so µ1,n is also defined on R9. Then by the finite-dimensional central limit theorem again, the laws µ1,n converge to µ2 on 1[89. So for the Prokhorov metric p, since it metrizes convergence of laws (RAP,
Theorem 11.3.3), p(µt,,, µ2) < e for n large enough. Taken > no also, then fix n. By Strassen's theorem (RAP, Corollary 11.6.4), there is a law A12 on 1[89 x R9 with marginals 1t1,n and µ2 such that µ1,2{(x, Y): Ix - yA > E) < E. By the Vorob'ev-Berkes-Philipp theorem (1.1.10), there is a Borel measure /'t123 on 1[89 x 1[89 x UC having marginals A12 and µ23 on the appropriate spaces.
The next step is to link up v with its restriction to the finite set 9. The Vorob'ev-Berkes-Philipp theorem may not apply here since v on F may not be in a Polish space, at least not one that seems apparent. (About nonmeasurability on nonseparable spaces see the remarks at the end of Section 1.1.) Here we can use instead:
3.7.3 Lemma Let S and T be Polish spaces and (Q, A, P) a probability space. Let Q be a law on S x T with marginal q on S. Let V be a random variable on Q with values in S and law G(V) = q. Suppose there is a real random variable U on S2 independent of V with continuous distribution function FU. Then there is a random variable W : S2 i-). T such that the joint law G(V, W)
of(V,W)isQ. Proof Two metric spaces X, Y are called Borel-isomorphic if there is a one-toone Borel measurable map of X onto Y with Borel measurable inverse. Every Polish space is Borel-isomorphic to some compact subset of [0, 11, either the whole interval, a finite set, or a convergent sequence and its limit (RAP, Theorem 13.1.1). Since the lemma involves only measurability and not topological properties of the Polish spaces, we can assume S = T = [0, 11. Here we need the fact that for any real-valued random variable X with continuous distribution function F, F(X) has a uniform distribution in [0, 1 ], which can be seen as fol-
lows. For -oo < t < oo, the probability that X < t equals F(t). Now X < t implies F(X) < F(t). On the other hand, the probability that F(X) < F(t) is the supremum of probabilities that X < y for y such that F(y) < F(t), and this supremum is at most F(t). So the probability that X < t equals the probability
Uniform Central Limit Theorems: Donsker Classes
120
that F(X) < F(t). Since F is continuous, for any s with 0 < s < 1 there is a t E R with F(t) = s, so the probability that F(X) < s equals F(t) = s, and F(X) is uniformly distributed in [0, 1] as claimed. So, taking FU (U), we can assume U is uniformly distributed in [0, 1]. By way of regular conditional probabilities (RAP, Section 10.2; Bauer, 1981) we can write Q = f Qx dq(x) where for each x, Qx is a probability measure on T, so that for any measurable set A in (the square) S x T, Q(A) = f f 1 A (X, y) dQx (y) d q (x) (RAP, Theorems 10.2.1 and 10.2.2). Let Fx be the distribution function of Qx and
0 < t < 1.
Fx 1(t) := inf{u: Fx(u) > t1,
Then for any real z and 0 < t < 1, Fx 1(t) < z if and only if Fx(z) > t. Now x i-- Fx (z) is measurable for any fixed Z. It follows that x i-+ Fx 1(t) is measurable for each t, 0 < t < 1. For each x, Fx 1 is left-continuous and nondecreasing in t. It follows that (x, t) i-+ Fx 1(t) is jointly measurable. Thus f o r W (co) := F7 ) (U (w)), co H W (co) is measurable. For each x we have the image measure A o (Fx 1)-1 = Qx (RAP, Proposition 9.1.2). So for any bounded Borel function g,
J
gdQ =
f1 f1
JJ 0
g(x, y) dQx(y) dq(x)
by Theorem 10.2.1 of RAP
0
f
= J0
g(x, Fx 1(y)) dy dq (x) by the image measure theorem
0
I fo g (x, Fx 1(y)) d(q x A)(x, y) 1
1
0
by the Tonelli-Fubini
theorem
= E(g(V, FV 1(U))) = Eg(V, W), since U is independent of V and G(V) = q and by the image measure theorem again. So G(V, W) = Q, proving Lemma 3.7.3. Now let (Q, S, Q) be a probability space on which all the empirical processes v, and an independent U are defined, specifically a countable product of copies
of the probability space (X, A, P) times one copy of [0, 1] with Lebesgue measure A. Then Lemma 3.7.3 applies to Q with S = RO and T = V x
UC, where V is v, restricted to 9, and Q = µ123 on S x T. On Q we then have processes v,, and Gp defined on F, which by construction and the asymptotic equicontinuity condition are within 3e of each other uniformly on F except with a probability at most 38 (as in the proof of Donsker's theorem in Section 1.1).
3.8 Unions of Donsker classes
121
Lets j, 0 through the sequence s = 1 / k, k = 1, 2, . Let the approximation just shown hold for n > nk on a probability space (Qk, Sk, Qk). We can assume nk is nondecreasing ink. Let Akn be the vn process defined on S2k and let
Gkn be the corresponding Gp process on Stk. Let no := 1 and let (Qo, So, Qo) be a probability space on which vn processes Aon and Gp processes Go, Go
are defined, where Go is independent of the Aon processes. Let An Akn and Gn := Gk, if and only if nk < n < nk+1 for k = 0, 1 , . Then for all n, An is a vn process and Gn is a Gp process. Also, II A, - Gn CIF 0 in outer probability, so by Theorem 3.6.1, vn z' Gp on .T7, and (III) implies (I), proving Theorem 3.7.2.
3.8 Unions of Donsker classes It will be shown in this section that the union of any two Donsker classes F and
g is a Donsker class. This is not surprising: one might think it was enough, given the asymptotic equicontinuity conditions for the separate classes, for a given s > 0, to take the larger of the two no's and the smaller of the two 8's. But it is not so easy as that. For example, F and G could both be finite sets, with distinct elements of F at distance, say, more than 0.2 apart for pp, and likewise for CG, but there may be some element of F very close to an element of G. So the equicontinuity condition on the union won't just follow from the conditions on the separate families. Given a probability measure P, F C G2(P), s > 0, 8 > 0, and a positive integer no, say that AE(F, no, s, 8) holds if for all n > no,
Pr* {sup{Ivn(f - g)I: f, g E F, pp(f, g) < 8} > s} < E. Then the asymptotic equicontinuity condition, as in the previous section, holds for .P and P if and only if for every s > 0 there is a 8 > 0 and an no such that AE(F, no, s, 8) holds. The asymptotic equicontinuity condition, together with total boundedness of F for pp, is equivalent to the Donsker property of F for P (Theorem 3.7.2).
3.8.1 Theorem (K. Alexander) Let (S2, A, P) be a probability space and let .T'1 and F2 be two Donsker classes for P. Then F := .T'1 U .T72 is also a Donsker class for P.
Proof (M. Arcones) Given s > 0, take 8i > 0 and ni < oo so that AE(Fi, ni, s/3, 8i) holds, i = 1, 2. F1 and F2, being Donsker classes, are pregaussian, and Gp is an isonormal process for ( , )o,p, so F is pregaussian by Corollary
Uniform Central Limit Theorems: Donsker Classes
122
2.5.9. So there is an a > 0 such that for a suitable version of GP,
Pr{sup{JGP(f-g)j: pp(f,g)s/3} < 8/3. Let S := min(31, 82, a/3). Take finite sets hi c Ti, i = 1, 2, such that for
each i and f E Ti there is an h := Ti f E Hi with pp (f, h) < 3. Since 7-l := Ni U N2 is finite, by the finite-dimensional central limit theorem (RAP, Theorem 9.5.6), vn restricted to N converges in law to Gp restricted to H. Let
F(R, a, s/3) be the set of all y E R H such that I y(f) - y(g) I > 8/3 for some f, g E N with pp(f, g) < a. Then F(N, a, e/3) is closed and has probability less than s/3 for the law of Gp on IRx. Thus by the portmanteau theorem (RAP, Theorem 11.1.1), there is an m such that
AE(N, m, s/3, a) holds.
(3.8.2)
Let no := max(n 1, n2, m). It will be shown that AE(.TF, no, e, 3) holds. By the
asymptotic equicontinuity conditions in each Ti and since no > max(n1, n2) and S < min(81, 32), there is a set of probability less than s/3 for each i = 1, 2 such that outside these sets, I v,, (f) - vn (g) I < 8/3 for any f, g in the same
.T'i with pp(f, g) < S. For pairs f, g with pp(f, g) < 8, with f E .T'1 and g E F2, we have pP(ri f, r2$) < a since pp(f, rl f) < 8, pp(g, r2g) < S, and 33 < a. Thus by (3.8.2), I vn (rl f) - vn (r2g) I < s/3 for all such f, g, outside of another set of probability at most e/3. Thus
vn(f) - vn(g)I
Ivn(f) - vn(nif)I + Ivn(rlf) - vn(r2g)I + (vn (r2g) - vn (g) i
< e/3 + s/3 + e/3 = s
for all f E .T'1 and g E F2 except on a set of probability at most 8/3 + s/3 + s/3 = e.
3.9 Sequences of sets and functions This section will show how the asymptotic equicontinuity condition in Theorem 3.7.2 can be applied to prove that some sequences of sets and functions are Donsker classes. In Chapters 6 and 7, other sufficient conditions for the Donsker property will be given that will apply to uncountable families of sets and functions. 3.9.1 Theorem Let (X, A, P) be a probability space and {C,n },n> 1 a sequence of measurable sets. If 00
(3.9.2)
r(P(Cm)(1 - P(Cm)))' < oo forsome r < oo, m=1
3.9 Sequences of sets and functions
123
then the sequence {Cm),,,>1 is a Donsker class for P. Conversely, if the sets Cm are independent for P, then the sequence is a Donsker class only if (3.9.2) holds.
Proof Suppose (3.9.2) holds. Then the positive integers can be decomposed into two subsequences, over one of which P(Cm) 0 and over the other P(Cm) -+ 1. It's enough to prove the Donsker property separately for each subsequence by Theorem 3.8.1. For any measurable set A with complement
-vn(A) and Gp(A`) - -Gp(A). The transformation of these processes into their negatives preserves convergence in law if it holds. So we can assume P(Cm) ,(, 0 as m -- oo, and then E. P(Cm)' < oo. Also, {Cm}m>1 is totally bounded for pp. By Theorem 3.8.1 we can assume pm := P(Cm) < 1/2 for all m. For any i and m such that P(CiACm) = 0, we will have almost surely for any n that P. (Ci) = Pn (Cm ). So we can assume that P(C1 ACm) > 0 for all i 0 M. For any m such that P(Cm) = 0 we will have Pn (Cm) = 0 almost surely for any n, and then vn (Cm) = 0, so we can assume P (Cm) > 0 for all m. Let 0 < s < 1. Suppose we can find M and N such that for all n > N, Ac,
(3.9.3)
Pr {supra>MIVn(Cm)I >s} < s.
Then for J large enough, pm < pM12 form > J. Let y := min{P(Ci ACj :
1 < i < j < J), a := min(y, pM)12. Then sup{Ivn(Ci) - vn(Cj)I: P(CiACj) < a) <_ suP{Ivn(C,) - vn(CC)I: i, j > M}
< 2sup{Ivn(C1)I: j > M} < 2s with probability at least 1- e, proving the asymptotic equicontinuity condition. So it will be enough to prove (3.9.3). For that, recalling the binomial prob-
abilities defined in Section 1.3, it will suffice to find M and N such that for
n > N, 00
(3.9.4)
57, E
< 8/2
m=M and 00
B(npm - sn1 /2, n, pm) < s/2.
(3.9.5) M=M
Uniform Central Limit Theorems: Donsker Classes
124
Let qm := 1 - pm. For (3.9.5), by the Chernoff-Okamoto inequality (1.3.10), B(npm - En 1/2, n1
p.) < exp (- e2/(2pmgm)).
For some K, 1 < K < oo, pm < Km -1/'' for all m. Choose M large enough so that exp
m' e2/(2K)) < E/2,
M=M
and so (3.9.5) holds for all n. The other side, (3.9.4), is harder. Bernstein's inequality 1.3.2 gives, if 2pn1/2 > e and 0 < p < 1/2, with q := 1 - p, that
E(np + En1/2, n, p) < exp ( - e2/(2pq + en-1/2)} < exp (- E2/(6pq)). Then 00
E {E(npm +En1/2, n, pm) : 2pmn1/2 >
(3.9.6)
m=M 00
{exp-E2/(6pm)): 2pmn1/2 > e} M=M 00
exp (- e2m1/rl (6K)) < e/4
< M=M
for all n if M is large enough. It remains to treat the sum, which will be called S2, of (3.9.4) restricted to values of m with 2pmn1/2 < E.
(3.9.7)
For p:= pm, inequality (1.3.11) implies
E(np + en 1/2, n, p) < (npl (np +
En1/2))np+En1/2eenl/2
Lety:= y(n,m,e) :=n1/2pm/e,so y> 0. Let f(x) - (1+x)log(1+x Then
- (np + en1/2) log (1 + e/(pn1/2)) = En1/2 (1 - f(y)). Forx > 0, f(x) < 0. By (3.9.7), y < 1/2, so f(y) > f(1/2) > 3/2. Thus En1/2
1 < 2f(y)/3, En1/2 (1 - f(y)) < -En 1/2 f (y) /3, and 00
{exp(-(En1/2+npm) [log (1+E/(n1/2pm)}]/3): 2pmn112<E1
S2 m=1 00
JJ ((
M=1
{(pmnl/2/e)En1/2l3:
2pmn1/2 < E}
3.9 Sequences of sets and functions
since 1/y < (1 +
y)l+y, so (1 +
125
y)-(l+y) < y. Now since p,, < Km-1I' we
have S2 < S3 + S4 where 00
._
S3
i{
2Km-1/rfn- <
E}
m=1 E,vIn_13 1` m
K
e/nl(3r)
m>G
where G :_ (2K,1_n_/s)r > 2. Then for any n1 > (3r/s)2 and n > n
S3 <
(K,,,fn-/E) 8vfn-13
x-e,Jn_1(3r)dx
fG-1
< (K,//s) ,/3(s ,/
(3r)-1 - 1`-1(G
-
For K, E and r fixed, and n
oo, the logarithm of the latter expression is asymptotic to -En 1/2 (log 2)/3 -oo, so S3 -+ 0. Thus S3 < E/8 for n > n2 for some n2. Finally, define S4 by 00
E
{(Pm'In-/e)r,1rn_l3:
E<
2Km-1/rfn_}
m=1
<
(2K/i/sy2-E,`l3
-0
as n - o0, so S4 < E/8 forn > n3 for some n3. Thus forn > max(n 1, n2, n3), S2 < e/4. This and (3.9.6) give (3.9.4). So (3.9.2) implies that {Cm}m>1 is a Donsker class. For the converse, if the measurable sets {Cm }m>1 are independent for P and form a Donsker class for P, it will be shown that (3.9.2) holds. Note that for each n, P (Cm) are independent random variables for in = 1, 2, . First suppose
lim supm,,,,, pm =: p < 1 and for all n, EM' 1 pm = +oo. Then for each n, Pr{ P (Cm) = 1 for infinitely many in } = 1 by the Borel-Cantelli lemma. Now
Pn(Cm) = 1 implies vn(Cm) = n1/2(1 - pm) > n1/2(1 - p) /2 form large. From Theorem 3.7.2, { 1 C. : m > 1) must be totally bounded for pp and satisfy
the asymptotic equicontinuity condition. It follows that pm -* 0 as m - oo, and by the asymptotic equicontinuity condition, we must have En° 1 pm < 00 for some n. A symmetrical argument applies if lim infra + pm > 0. We can write {Cm )m> l as a union of two subsequences, one for which
P(C,) < 1/2 and another for which P(Cm) > 1/2, both Donsker classes. Thus by the argument just given, for some n < oo, the first subsequence sat-
isfies E. P(Cm)" < oo and the second, F_m(1 - P(Cm))" < oo. So for the original sequence, (3.9.2) holds.
126
Uniform Central Limit Theorems: Donsker Classes
Next, let's consider sequences of functions. For a probability space (A, A, P) and f E £2(A, A, P) let 0 ,2( f := f f 2 dP - (f f dP) 2 (the variance of f). Here is a sufficient condition for the Donsker property of a sequence { fm } which is easy to prove, yet turns out to be optimal of its kind:
3.9.8 Theorem If { fm}m>1 C G2(P) and Fn° 1 ci (fm) < oo, then {fm }m> 1 is a Donsker class for P.
Proof Since v, and Gp are the same on fm - Efm as on fm a.s. for all m, we can assume Efm = 0 for all m. Then fm - 0 in Go, so the sequence { fm} is totally bounded for pp. For any 0 < s < 1, n > 1 and m > 1, by Chebyshev's inequality,
Pr(Ivn(f )I > s/2) < 4 E o (f)/s2 < s j>m
j>m
form > mo for some mo < oo. We have a.s. for all n, vn (f) = 0 for all j such
that a (f) = 0 and vn (f) = v (fk) for all j and k with or p (f - fk) = 0. Let a be the infimum of o (f - f k ) over all j and k such that o (fj - fk) > 0, j < me, and o (f) > 0. Then a > 0 since in some pp neighborhood of fj there are only finitely many fk. Let 8 := min(a, 1). Then the probability that I vn (f) - vn (fk) I > s f o r some j, k with orp2 (f - fk) < 8 is less than s, implying the asymptotic equicontinuity condition and so finishing the proof by Theorem 3.7.2. The following shows that Theorem 3.9.8, although it does not imply the first half of Theorem 3.9.1, is sharp in one sense:
3.9.9 Proposition Let A := [0, 1] and P := U[0, 1] := Lebesgue measure on A. Let am > 0 satisfy Fm 1 am = +oo. Then there is a sequence { fm I C G2(A, A, P) with c,' (f.) < am for all m where { fm } is not a Donsker class. . 0. There exist cm J. 0 such that >m am cm = +oo (see problem 12). In A let Cm be independent sets with P(Cm) = am cm for
Proof We can assume am
1/2
each m (see problem 13). Let fm := cm I Cm. Then oP (fm) < f f,2 dP = am. For each n, almost surely Pn (Cm) > 1/n for infinitely many m. Then vn(Cm) > n-1/2/2 for infinitely many m and supm vn(fm) = +oo a.s., so the asymptotic equicontinuity condition fails and { fm } is not a Donsker class for P.
Problems
127
Problems 1. Let (0, A, P) be a probability space, let (S, d) be a (possibly nonseparable) metric space, and let xn, n = 0, 1, 2, , be points of S. Let fn ((o) = Xn
for all co. Show that fn = f0 if and only if xn - X. 2. In a general (possibly nonseparable) metric space show that if X0 - p E S is a constant random variable then random elements Xn = X0 if and only X0 in outer probability. if X,
3. Let (T, d) be any metric space and S C T a separable subspace. Let Xn, n > 0, be S-valued measurable random variables on some probability space such that Xn -) X0 in law as n --- oc, in S. Show that Xn = X0 as T-valued random elements. 4. Show that C := J(-oo, t] : t E R} is a Donsker class for any law P on R. Hint: Use Section 1.1.
5. Let Ak be independent sets in a probability space (S2, A, P) such that Y-k[P(Ak)(1 - P(Ak))]" = +oo for all n = 1, 2, - -. Show that {Ak}k>1 is not a Glivenko-Cantelli class, that is, SUpk I (Pn - P) (Ak) I doesn't con-
verge to 0 in probability as n - oo. Hint: See the last part of the proof of Theorem 3.9.1.
6. Let (A, A, P) be a probability space and let F C L2 (A, A, P) be a Donsker class for P. (a) Show that the convex hull of F, namely k
co(F) := j E ) , j : f E .P, j = 1, ... , k, Aj > 0, j=1 k
YAj=1, j=1
is a Donsker class. Hint: Use Theorem 2.5.5 to get that co(F) is pregaussian and Gp can be taken to be prelinear on co(F). Then use almost surely convergent realizations (3.5.1). (b) For any fixed k < oo, show that f : f E F for j = 1, , k} is also a Donsker class. Hints: Use induction and Theorem 3.8.1 on unions. (2f : f E F} is Donsker. Thus take k = 2. It is easy to show that 2F Then apply part (a). 7. Let c > 0. For Lebesgue measure A on [0, 1], the Poisson process with intensity measure cl is defined by first choosing n at random having a
Poisson distribution with parameter c, namely P(n = k) = e-`ck/k! for , then setting Yc := 8x, where Xj are i.i.d. with law). on k = 0, 1, [0, 1]. In the Banach space of bounded functions on [0, 1] with supremum
Uniform Central Limit Theorems: Donsker Classes
128
norm, prove that as c
oo (along any sequence) the random functions
t H (Yc - c? )c-1/2[0, t], 0 < t < 1, converge in law to the Brownian motion process xt, 0 < t < 1. (Recall that xt = L(llo,tl).) Hints: For c = ck -+ oo let n = nk be Poisson (ck). For F(t) - t, 0 < t < 1, write Yc(t) := Yc([0, t]) = n F, (t), so c-1/2(Yc(t)
- Ct)
_ (n/c)1/2[n1"2(Fn - F)(t)] + c-112(n - c)t.
By Donsker's theorem take Brownian bridges y(n) such that n 1/2(Fn - F)
is close to y(n) for n large. Also, c-1/2(n - c) is close in law by the central limit theorem and so in probability (Strassen's theorem) to some random N(0, 1) variable Zc. Show that one can take n((o) independent y(n('0)) of X1, X2, , then y( (w) is a Brownian bridge; apply the t w) := Vorob'ev-Berkes-Philipp theorem to take Zc independent of yt, and then yt + Zc t is Brownian. 8. Let P be a law on a separable, infinite-dimensional Hilbert space H such that f IIx X12 dP(x) < oo and with mean 0, so that f (x, h) dP(x) = 0 for all h E H. Let X1, X2, be i.i.d. in H with law P and Sn := X1 + + Xn.
(a) Show that the central limit theorem holds in H, that is, Sn/n1/2 converges in law to some normal measure on H. Hint: Prove, using variances, that the laws of Sn/n1/2 are uniformly tight. (b) Show that the class of functions x r* (x, h) for h E H with 1Ih 11 < 1 (the unit ball of the dual space of H) is a Donsker class of functions on H
for P. Hints: For part (a) let {en ] be an orthonormal basis so that for any x E H,
x = En xnen with lix 112 = En xn. Thus En Exn is finite. Show that for Exn/cn is still finite. For any K let some cn 0 slowly enough, CK be the set of x such that En (xn /cn )2 < K2 (an infinite-dimensional ellipsoid). Show that each CK is compact and that the laws of Sn/n1/2 are uniformly tight by way of the sets CK. Thus subsequences of these laws converge. Show that they all converge to the same Gaussian limit law, since by the Stone-Weierstrass theorem, functions depending on only finitely many xj are dense in the continuous functions on each CK. , X,n, Y1, 9. The two-sample empirical process. Let X1, , Yn be i.i.d. with the uniform distribution U[0, 1] on [0, 1] (Lebesgue measure restricted to [0, 1]). Let F be the empirical distribution function based , X,,,, and Gn likewise based on Y1, , Yn. Show that in the on X1, space of all bounded functions on [0, 1] with supremum norm, mn
(m+n)
1/2
(Fm
- Gn)
y
Problems
129
as m, n -* oo where t i-± yt, 0 < t < 1 is a Brownian bridge process. Hints: As in Donsker's theorem (1.1.1), form, n large, m1I2(F,, - F) is close to a Brownian bridge Y(m), and n 1/2 (Gn - F) is close to an independent Brownian bridge Z (n ). It follows that [(m n) / (m +n)] 1/2 (Fm - Gn) is
close to (n/(m +n))1/2Y(m) - (m/(m +n))1/2Z(n), which is a Brownian bridge.
10. Let fn (x) = cos(27rnx) for0 < x < I with law U[0, 1] on [0, 1]. For real c let Fc be the sequence of functions n-' fn for all n = 1, 2, . For what values of c is Fc pregaussian? A Donsker class? Hints: If c < 0 show easily that the class is not pregaussian and so not Donsker. If c > 0 then it is pregaussian, by metric entropy. If c > 1/2, it is Donsker by 3.9.8. Show that it is Donsker for 0 < c < 1 /2 by the Bernstein inequality and the asymptotic equicontinuity condition. 11. In R, for any law P on IR, show that for any fixed k < oo, k
U (aj, bj]: aj < bj for all j j=1
is a Donsker class, that is, .F := {lc : C E C} is a Donsker class. Hint: Apply Donsker's theorem (1.1.1) and take differences (and sums). For k = 2 reduce to the case of disjoint intervals and apply problem 6. Do induction to get general k.
12. Show that as stated in the proof of Proposition 3.9.9, for any am > 0 with E. an = +oo there are cm 1 0 with >m cmam = +oo. Hint: Take a sequence mk such that Flan : mk < m < mk+1} > k for each k = 1, 2, . Let cm have the same value for Mk < m < mk+1 13. Show that as also stated in the proof of Proposition 3.9.9, in [0, 1] for P = U[0, 1] there exist independent sets Cm with any given probabilities. Hint: Use binary expansions. By decomposing the set of positive integers into a countable union of countably infinite sets, show that ([0, 1], P) is isomorphic as a probability space to a countable Cartesian product of copies of itself. 14. Suppose that {Cm)m>1 are independent for P, 00
P(Cm)(1 - P(Cm)) _ +00, m=1
and cm
oo. Show that (cm I cm )m> 1 is not a Donsker class.
130
Uniform Central Limit Theorems: Donsker Classes
Notes
Notes to Section 3.1.
For any metric space (S, d), let Bb (S, d) be the o-
algebra generated by all balls B(x, r) := {y: d(x, y) < r}, x E S, r > 0. Then Bb (S, d) is always included in the Borel or -algebra 13(S, d) generated by all the open sets, with 13b(S, d)= 13(S, d) if (S, d) is separable. Suppose Y, are functions from a probability space (S2, P) into S, measurable
for Bb(S, d). Then each Y has a law it, = P o Yn 1 on Bb(S, d). Dudley (1966, 1967) defined convergence in law of Y to Yo to mean that f * Hdµ -k f Hd µo for every bounded continuous real-valued function Hon S. HoffmannJorgensen (1984) gave the newer definition adopted generally and here, where the upper integrals and integral are taken over n, not S, so that the laws µ are not necessarily defined on any particular or -algebra in S. Hoffmann-Jorgensen's monograph was published in 1991. Andersen (1985a,b), Andersen and Dobric (1987, 1988), and Dudley (1985) developed further the theory based on Hoffmann-Jorgensen's definition. Notes to Section 3.2. Blumberg (1935) defined the measurable cover function f *, see also Goffman and Zink (1960). (Thanks to Rae M. Shortt for pointing out Blumberg's paper.) Later, Eames and May (1967) also defined f *. Lemmas 3.2.2 through 3.2.6 are more or less as in Dudley and Philipp (1982, Section 2), except that Lemma 3.2.4(c) is new here. Theorem 3.2.1 and its proof are as in Vulikh (1967, pp. 78-79) for Lebesgue measure on an interval (the proof needs no change). Luxemburg and Zaanen (1983, Lemma 94.4 p. 222) also prove existence of essential suprema and infima of families of measurable (extended) real functions.
Note to Section 3.3. This section is based on parts of Dudley (1985). Notes to Section 3.4. Perfect probability spaces were apparently first defined by Gnedenko and Kolmogorov (1949, Section 3), and their theory was carried on among others by Ryll-Nardzewski (1955), Sazonov (1962), and Pachl (1979).
Perfect functions are defined and treated in Hoffmann-Jorgensen (1984, 1985) and Andersen (1985a,b); see also Dudley (1985).
Notes to Section 3.5. The existence of almost surely convergent random variables with a given converging sequence of laws was first proved by Skorokhod (1956) for complete separable metric spaces, then by Dudley (1968) for any separable metric space, with a re-exposition in RAP, Section 11.7, and by Wichura (1970) for laws on the a-algebra generated by balls in an arbitrary metric space
Notes
131
as mentioned in the notes to Section 3.1. The current version was given in Dudley (1985).
Notes to Section 3.6. Hoffmann-Jorgensen (1984), who defined convergence in law in the sense adopted in this chapter, also developed the theory of it as in this section, and partly in a more general form (with nets instead of sequences, and other classes of functions in place of the bounded Lipschitz functions). Andersen and Dobric (1987, Remark 2.13) pointed out that the portmanteau theorem (as in Topsoe, 1970, Theorem 8.1) "can be extended to the nonmeasurable case. The proof of this extension is the same as the ordinary proof." Much the same might be said of other equivalences in this section. Dudley (1990, Theorem A) gave a form of the portmanteau theorem and (Theorem B) of the metrization theorem (3.6.4).
But not all facts or proofs from the separable case extend so easily: for example, in the separable case, there is an inequality for the two metrics, 0 < 2p, in the opposite direction to Lemma 3.6.5, which follows from Strassen's theorem on nearby variables with nearby laws (RAP, Corollary 11.6.5), but Strassen's theorem seems not to extend well to the nonmeasurable case (Dudley, 1994).
Notes to Section 3.7. An early form of the asymptotic equicontinuity condition appeared in Dudley (1966, Proposition 2) and a later form in Dudley (1978).
The equivalence with a different pseudometric r in Theorem 3.7.2 is due to Gine and Zinn (1986, p. 58). Lemma 3.7.3 is essentially contained in the proof of Skorokhod (1976, Theorem 1), as Erich Berger kindly pointed out. See also Erlov (1975). Notes to Section 3.8. Alexander (1987, Corollary 2.7) stated Theorem 3.8.1 but didn't publish a proof of it, although he had written out an unpublished proof several years earlier. He says that the result is "an extension of a slightly weaker result of Dudley (1981)," where F2 is finite, but this author himself doesn't think his 1981 result was only "slightly weaker"! The proof presented was suggested by Miguel Arcones in Berkeley during the fall of 1991, but I take responsibility for any possible errors in it. Apparently van der Vaart (1996, Theorem A.3) first published a proof.
Notes to Section 3.9. Theorem 3.9.1 first appeared in Dudley (1978, Section 2), Theorem 3.9.8 and Proposition 3.9.10 in Dudley (1981), and Proposition 3.9.9 in Dudley (1984).
132
Uniform Central Limit Theorems: Donsker Classes
References Alexander, K. S. (1987). The central limit theorem for empirical processes on Vapnik-Cervonenkis classes. Ann. Probab. 15, 178-203. Andersen, N. T. (1985a). The central limit theorem for non-separable valued functions. Z. Wahrscheinlichkeitstheorie verve. Gebiete 70, 445-455. Andersen, N. T. (1985b). The calculus of non-measurable functions and sets. Various Publication Series no. 36, Matematisk Institut, Aarhus Universitet. Andersen, N. T., and Dobric, V. (1987). The central limit theorem for stochastic processes. Ann. Probab. 15, 164-177. Andersen, N. T., and Dobric, V. (1988). The central limit theorem for stochastic processes II. J. Theoret. Probab. 1, 287-303. Bauer, H. (1981). Probability Theory and Elements of Measure Theory, 2d ed. Academic Press, London. Blumberg, Henry (1935). The measurable boundaries of an arbitrary function. Acta Math. (Uppsala) 65, 263-282. Cohn, D. L. (1980). Measure Theory. Birkhauser, Boston.
Dudley, R. M. (1966). Weak convergence of probabilities on nonseparable metric spaces and empirical measures on Euclidean spaces. Illinois J. Math. 10, 109-126. Dudley, R. M. (1967). Measures on non-separable metric spaces. Illinois J. Math. 11, 449-453. Dudley, R. M. (1968). Distances of probability measures and random variables. Ann. Math. Statist. 39, 1563-1572.
Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6, 899-929; Correction 7 (1979) 909-911. Dudley, R. M. (1981). Donsker classes of functions. In Statistics and Related Topics (Proc. Symp. Ottawa, 1980), North-Holland, New York, 341-352. Dudley, R. M. (1984). A course on empirical processes. Ecole d'ete de probabilites de St.-Flour, 1982. Lecture Notes in Math. (Springer) 1097, 1-142. Dudley, R. M. (1985). An extended Wichura theorem, definitions of Donsker class, and weighted empirical distributions. In Probability in Banach Spaces V (Proc. Conf. Medford, 1984), Lecture Notes in Math. (Springer) 1153, 141-178. Dudley, R. M. (1990). Nonlinear functionals of empirical measures and the bootstrap. In Probability in Banach Spaces VII (Proc. Conf. Oberwolfach, 1988), Progress in Probability 21, Birkhauser, Boston, 63-82. Dudley, R. M. (1994). Metric marginal problems for set-valued or non-measurable variables. Probab. Theory Related Fields 100, 175-189.
References
133
Dudley, R. M., and Philipp, Walter (1983). Invariance principles for sums of Banach space valued random elements and empirical processes. Z. Wahrscheinlichkeitsth. verw. Gebiete 62, 509-552. Eames, W., and May, L. E. (1967). Measurable cover functions. Canad. Math. Bull. 10, 519-523. Erlov, M. P. (1975). The Choquet theorem and stochastic equations. Analysis Math. 1, 259-271. Gine, E., and Zinn, J. (1986). Lectures on the central limit theorem for empirical processes. In Probability and Banach Spaces (Proc. Conf. Zaragoza, 1985), Lecture Notes in Math. (Springer) 1221, 50-113. Gnedenko, B. V., and Kolmogorov, A. N. (1949). Limit Distributions for Sums of Independent Random Variables. Moscow. Transl. and ed. by K. L. Chung, Addison-Wesley, Reading, MA, 1954; rev. ed. 1968. Goffman, C., and Zink, R. E. (1960). Concerning the measurable boundaries of a real function. Fund. Math. 48, 105-111. Hoffmann-Jt rgensen, Jorgen (1984). Stochastic processes on Polish spaces. Published (1991): Various Publication Series no. 39, Matematisk Institut, Aarhus Universitet. Hoffmann-Jorgensen, Jorgen (1985). The law of large numbers for non-measurable and non-separable random elements. Asterisque 131, 299-356. Luxemburg, W. A. J., and Zaanen, A. C. (1983). Riesz Spaces, vol. 2. NorthHolland, Amsterdam. Pachl, Jan K. (1979). Two classes of measures. Colloq. Math. 42, 331-340. Ryll-Nardzewski, C. (1953). On quasi-compact measures. Fund. Math. 40, 125-130. Sazonov, V. V. (1962). On perfect measures (in Russian). Izv. Akad. Nauk SSSR 26, 391-414. Skorokhod, Anatolii Vladimirovich (1956). Limit theorems for stochastic processes. Theor. Probab. Appl. 1, 261-290. Skorokhod, A. V. (1976). On a representation of random variables. Theor. Probab. Appl. 21, 628-632 (English), 645-648 (Russian). Topsoe, Flemming (1970). Topology and Measure. Lecture Notes in Math. (Springer) 133. van der Vaart, Aad (1996). New Donsker classes. Ann. Probab. 24, 21282140.
Vulikh, B. Z. (1961). Introduction to the Theory of Partially Ordered Spaces (transl. by L. F. Boron, 1967). Wolters-Noordhoff, Groningen. Wichura, Michael J. (1970). On the construction of almost uniformly convergent random variables with given weakly convergent image laws. Ann. Math. Statist. 41, 284-291.
4
Vapnik-Cervonenkis Combinatorics
This chapter will treat some classes of sets satisfying a combinatorial condition. In Chapter 6, it will be shown that under a mild measurability condition to be treated in Chapter 5, these classes have the Donsker property, for all probability measures P on the sample space, and satisfy a law of large numbers (GlivenkoCantelli property) uniformly in P. Moreover, for either of these limit-theorem properties of a class of sets (without assuming any measurability), the VapnikCervonenkis property is necessary (Section 6.4). The present chapter will be self-contained, not depending on anything earlier in this book, except in some examples.
4.1 Vapnik-Cervonenkis classes Let X be any set and C a collection of subsets of X. For A C X let CA C n A := A nC := (Cf1A : C E C). Letcard(A) :_ I A I denotethecardinality (number of elements) of A and 2A :_ [B: B C A). Let AC(A) :_ ICAI. If A n C = 2A, then C is said to shatter A. If A is finite, then C shatters A if and only if AC (A) = 21 A1.
Let mc(n) := max[0C(F) : F C X , IF] = n} for n = 0, 1, IXI < n let mc(n) := mc(IXI). Then mc(n) < 2" for all n. Let V(C)
:
inf in: me (n) < 2")
if this is finite,
+00
if mc(n) = 2" for all n,
, or if
{I
sup {n: mo(n)=2"}, S(C)
:=
if C is empty.
1
Then S(C) - V (C) - 1, and S(C) is the largest cardinality of a set shattered by C, or +oo if arbitrarily large finite sets are shattered. So, V (C) is the smallest n 134
4.1 Vapnik-Cervonenkis classes
135
such that no set of cardinality n is shattered by C. If V (C) < oo, or equivalently if S(C) < oo, C will be called a Vapnik-Cervonenkis class or VC class. If X is finite, with n elements, then clearly 2X is a VC class, with S(2x) = n. Let NC
(N
\j/
N!/(j!(N - j)!),
j = 0, 1, ... N,
0,
j>N.
Then NC
4.1.1 Proposition NC
- -
, and
N = 1,2,.... Proof /For each j = 1, 2, - -, N, we have by the classical Pascal's triangle, I
Nl
(Ni 11+IN-1
I,
(N 0
I = INO 11 = 1.
Summing /over j, the conclusion follows. If N < k, we\ get 2N/ 2N-1
= 2N-1 +
For a non-VC class C we have mc(n) = 2' for all n. For a VC class, the next fact, which is fundamental in the Vapnik-Cervonenkis theory, will imply that mc(n) grows only as a polynomial rather than exponentially in n.
4.1.2 Theorem (Sauer's lemma) If mc(n) > nC
Proof The proof is by induction on k and n. For k = 1, nC
Fix k := K + 1. For n < k, as noted, it holds vacuously. For n = k, the hypothesis mc(n) > C
Vapnik-Cervonenkis Combinatorics
136
AC(HN) > NC
AC(HN) < NC
(4.1.3)
Let Cn := Hn n C :_ {A fl Hn : A E C}. Call a set E C HN full if both E and E U {xn } belong to Cn . Let f be the number of full sets. Then the map A i-+ A fl HN takes C n H onto C n HN and is two-to-one onto full sets and one-to-one onto non-full sets in C n HN. Thus
OC(H,,) = Ac(HN) + .f
(4.1.4)
Let F be the collection of all full sets. Suppose f = 0-F(HN) > NC
In the remaining case, f < NC
AC(H,,) < NC
Proof The inequality clearly holds for k = 0, so assume k > 1. By the binomial theorem, nk-1 (k + n) < (n + 1)k, so (4.1.6)
nk-1/(k
- 1)! +nk/k! < (n + 1)k/k!.
The proof will be done by induction on n and k. For k = 1 the inequality is n + 1 < 1.5n, which holds for all n > 2. For n = k + 2, the desired inequality is 2n - n - 1 < 1.5nii-2/(n - 2)! = 1.5(n -
1)nn-1/n!.
This can be checked directly for n = 3, 4, 5, and 6. Stirling's formula with an error bound (Theorem 1.3.13) gives n! < (n/e)n (2nn)1/2e1/(12n)
4.1 Vapnik-Cervonenkis classes
137
So it will be enough to prove
2n (n/e)n (2irn)1/2e11(12n) < 1.5(n -
1)nn-1
n>7,
which follows from (e/2)" > 2n1/2, n > 7; for f(x) := (e/2)x and g(x) :_ 2x1/2 it is straightforward to check that f (7) > g(7), f'(7) > g'(7), f" > 0, and g" < 0, so f (x) > g(x) for all x > 7. Now suppose the proposition has been proved for n = k + i, i = 2, , j, and for n = k + J, J := j + 1, for k = 1, , K, as we have done for j = 2 and for K = 1. We need to prove (4.1.5) for n = k + J and k = K + 1. We
have k+j = K + J and nC
< 1.5(k + j)k/k! + 1.5(K + J)K/K!
< 1.5nk/k!
by Proposition 4.1.1
by induction hypotheses by (4.1.6),
completing the proof.
Combining Theorem 4.1.2 and Proposition 4.1.5 gives mc(n) < 1.5nk/k! for n > k + 2 where k := S(C). To see that Theorem 4.1.2 is sharp, let X be an infinite set and C the collection of all subsets of X with cardinality k. Then S(C) = k and the inequality in the second sentence of the theorem becomes an equality for all n.
Let dens(C) be the infimum of all r > 0 such that for some K < oo, m c (n) < Kn' for all n > 1. Then we have
4.1.7 Corollary For any set X and C C 2X, dens(C) < S(C). Conversely if dens(C) < oc then S(C) < oo. Proof By Theorem 4.1.2 and Proposition 4.1.5, there is a K such that m c (n) <
Kns(c) for all n > S(C) + 2. The same holds for all n > 1, possibly with a larger K, so dens(C) < S(C). Conversely if dens(C) < oo then since mc(n) < Kn' < 2n for n large we have S(C) < oo. Note that S(C) can be determined by one large shattered set while dens(C) has to do with the behavior of C on arbitrarily large finite sets. For example if X is a set with card(X) = n and C = 2X then S(C) = n while dens(C) = 0.
For any set X, it is immediate that if C C D C 2X then S(C) < S(D) and dens(C) < dens(D). The following is straightforward since for any set X, the
138
Vapnik-Cervonenkis Combinatorics
map A r+ X \ A is one-to-one from 2X onto itself, and for any A, B, C C
X, Af1B#CflBifandonlyif(X\A)f1B#(X\C)f1B: 4.1.8 Proposition If X is any set, C C 2X and D := (X \ A : A E C) then for all B C X, AC(B) = AD(B), so m'(n) = mD(n) for all n, S(D) = S(C) and dens(D) = dens(C).
4.2 Generating Vapnik-Cervonenkis classes Let's begin with some examples of non-VC classes for which some limit theorems for empirical measures will fail. First, let X = [0, 1] and let C be the class of all finite subsets of X. Let P be the uniform (Lebesgue) law on [0, 1]. Clearly, S(C) = +oo, and C is not a VC
class. Also, for any possible value of Pn, we will have P,(A) = 1 for some A = {X1, , Xn} E C while P(A) = 0. Thus SUPAEC(Pn - P)(A) = 1 for all n, so C is not a Glivenko-Cantelli class for P; in other words, IIPn - PIIC
SUPAEC I(Pn - P)(A)I
doesn't approach 0 as n oo in outer probability or in any sense, since it is identically 1. It follows that C is also not a Donsker class for P. Note that all functions 1 A for A E C equal 0 almost surely for P. Thus, the whole class .F := {lA : A E C) reduces to the one point 0 in the space L2(P) of equivalence classes for equality almost everywhere of functions in G2(P), that is, measurable, square-integrable functions. Thus for purposes of empirical processes, functions equal a.s. P are not the same, and we need to deal with classes F C G2(P) of actual real-valued functions, not equivalence classes. Then the integral f fd(P, - P) will be well-defined for any f E L2(P). This
integral is linear in f and thus prelinear for f E T for any set J7 C G2(P). For the empirical process v, = n 1/2(P - P) we will not be taking versions or modifications as was done for Gaussian processes (Appendix J). Next, let C2 be the collection of all closed, convex subsets of R2. Let S1 be the unit circle {(x, y) : x2 + y2 = 11. For any finite subset F of S1, the convex polygon with vertices in F (a singleton if IFl = 1, or a line segment if
I F1 = 2) is in C2, and its intersection with S' is F. Thus S(C) = +oo and C is not a VC class. Let P be the uniform law dP(O) = dO/(2n) on S1. Then the Glivenko-Cantelli and Donsker properties fail for P just as in the previous example. Classes with S(C) finite, in other words Vapnik-Cervonenkis classes, can be
formed in various ways. Here is one. Let G be a collection of real-valued
4.2 Generating Vapnik-Cervonenkis classes
139
functions on a set X. Let
pos(g)
Ix: g(x) > 01, nn(g) := IX: g(x) > 0), gEG,
pos(G)
{pos(g): g E G}, nn(G) := {nn(g): g E G),
U(G)
pos(G) U nn(G).
4.2.1 Theorem Let H be an m-dimensional real vector space of functions on
a set X, let f be any real function on X, and let Hi := (f + h : h E H). Then S(pos(Hl)) = S(nn(H1)) = m. If H contains the constants then also S(U(H1)) = M.
Proof First it will be shown that S(pos(H1)) = m. Clearly card(X) > m. If card(X) = in, then H = Hl is the set ]RX of all real-valued functions on X, so the result holds.
Otherwise, let A C X with card(A) = m + 1. Let G be the vector space {a f + h : a E IR, h E H). Let rA : G ra IRA be the restriction of functions in G to A. If rA is not onto, take 0 v E IRA where v is orthogonal to rA(G) for the usual inner product -)A Let A+ := {x E A : v(x) > 0}. We can assume A+ is nonempty, replacing v by -v if necessary. If A+ = A fl pos(g) for some g E G, then (rA(g), V)A > 0, a contradiction. So pos(G) doesn't shatter A.
Suppose instead that rA(G) = RA. Then rA is 1-1 on G, f 0 H, and rA(H1) is a hyperplane in RA not containing 0, so that for some v E RA,
(j, v)A = -1 for all j E rA(H1). Again let A+ :_ {x E A : v(x) > 0). If A+ = A fl pos(f + h) for some h E H, then (rA(f + h), v)A > 0, a contradiction (here A+ may be empty). Thus pos(H1) never shatters A, so S(pos(Hi)) < m. For each x E X, a linear form 8x is defined on H by 3, (h) := h (x), h E H.
Let H' be the vector space of all real linear forms on H. Then H' is m-
dimensional. Let HH{aJxj: be the linear span in H' of the set of all 8,, x E X, (4.2.2)
HH
xj E X, aj E ]I8, r
1, 2,
}.
j-1 The map h H (r i-4 *(h)): h E H, r E HX, is 1-1 and linear from H JJJ
into (HX)', so HH is m-dimensional. Take B = {xl, , xm) C X such that the 8,1 are linearly independent in H. So rB (H) = RB, rB (H1) = JRB, and pos(rB(H1)) = 2B, so S(pos(Hi)) = m. Then S(nn(H1)) = m by taking complements (Proposition 4.1.8). If H contains the constant functions, then the sets nn(f ), f E H1, are the same as
140
Vapnik-Cervonenkis Combinatorics
the sets If > t), f E H1, t E R, and the sets pos(f), f E H1, are the same as the sets If > t), f E H1, t E R. Now for any finite subset A of X, f E Hl and t E R, since f takes only finitely many values on A, there exist s and u
such that AflIf >t}=AflIf >s}and AflIf >t}=AflIf >u}. So in this case S(U(H1)) = m. Examples. (I) Let H := Pd,k be the space of all polynomials of degree at most k on Rd. Then for each d and k, H is a finite-dimensional vector space of functions, so pos(H) is a Vapnik-Cervonenkis class. For k = 2, it follows specifically that the set of all ellipsoids in 1W' is included in a VapnikCervonenkis class and thus is one.
(II) Let X = R. Let H be the 1-dimensional space of linear functions f(x) = cx, x E 1k, c E R. Then S(pos(H)) = S(nn(H)) = 1 by Theorem 4.2.1, but U(H) shatters {0, 1). Since sets in U(H) are convex (half-lines), it follows that S(U(H)) = 2. So the condition that H contains the constants can't just be dropped from Theorem 4.2.1 for U(H). Let X be a real vector space of dimension m. Let H be the space of all real affine functions on X, in other words functions of the form h + c where h is real linear and c is any real constant. Then H has dimension m + 1 and pos(H) is the set of all open half-spaces of X. Letting f = 0 in Theorem 4.2.1 for this H gives a special case known as Radon's theorem. On the other hand, Theorem 4.2.1 for f = 0 with general X and H follows from Radon's theorem via the following stability fact.
4.2.3 Theorem If X and Y are sets, F is a function from X into Y, C C 2Y, S(F-t (C)) < S(C). If F is onto Y and F-1(C) := {F-1(A) : A E C}, then then S(F-1(C)) = S(C).
x j for i # j. Then , xm } where x; Proof Let F-1(C) shatter {xl, F(xi) 0 F(xj) for i ¢ j and C shatters {F(xl), , F(xm)}. So S(F-1(C)) < S(C). If F is onto Y and H C Y with card(H) = m, choose G C X such that F takes G 1-1 onto H. Then if C shatters H, F-1 (C) shatters G, so S(F-1(C)) = S(C). Now let X be any set and G a finite-dimensional real vector space of real functions on X. Then there is a natural map F : x H S from X into the space of linear functions on G. Then by Theorem 4.2.3 one could deduce Theorem 4.2.1 from its special case where X is an m- or (m + 1)-dimensional real vector space and f and all functions in H are affine, so that sets in pos(H1) are open half-spaces.
4.2 Generating Vapnik-Cervonenkis classes
141
Next it will be seen how a bounded number of Boolean operations preserves the Vapnik-Cervonenkis property.
, let C(k) be the 4.2.4 Theorem Let X be a set, C C 2X, and f o r k = 1 , 2, union of all (Boolean) algebras generated by k or fewer elements of C. Then dens(C(k)) < k dens(C), so if S(C) < oo then S(C(k)) < oo.
Proof Let dens(C) = r, so that for any s > 0 there is some M < oo such that mc(n) < Mnr+E for all n. For any A C X we have A n C(k) = (A n C)(k). An algebra A with k generators A1, , Ak has at most 2k atoms, which are those nonempty sets that are intersections of some of the A; and the complements of the rest. Sets in A are unions of atoms, so JAI < 22k. Thus IA n C(k) I < 22k IA n CIk < 22kMkIAlk(r+e). Letting E J, 0 gives dens(C(k)) < k dens(C). If S(C) < 00 then by Corollary 4.1.7, S(C(k)) < 00. The constant 22k is very large if k is at all large. Let C(lk) be the class of all intersections of at most k sets in C. Then C(nk) C C(k). For C(nk) the constant 22k is not needed. Theorems 4.2.1 and 4.2.4 can be combined to generate Vapnik-Cervonenkis classes. For example, half-spaces in Rd form a VC class. Intersections of at most k half-spaces give convex polytopes with at most k faces, so these form a VC class.
Remarks. Let X be an infinite set, r = 1 , 2,
, and let Cr be the collection of all subsets of X with at most r elements. Then clearly dens(Cr) = S(Cr) = r. It's easy to check that D := C(rk) consists of all sets B such that either B or X \ B
has at most kr elements. Thus mD(n) < 2(nC
d>5. Classes with V (C) = 0 or I are easily characterized:
4.2.5 Proposition A class C of subsets of a set X has V (C) = 0, or equivalently S(C) = -1, if and only if C is empty. Also, V (C) = 1, or equivalently
142
Vapnik-Cervonenkis Combinatorics
S(C) = 0, if and only if C contains exactly one set. Thus S(C) > 1 if and only if C contains at least two sets. Proof Clearly C shatters the empty set if and only if C contains at least one set. If C contains at least two sets, then for some A, B E C and X E X, X E A \ B.
Then C shatters {x), so S(C) > 1. Conversely if S(C) > 1, then clearly C contains at least two sets.
Here are two sufficient conditions for S(C) = 1:
4.2.6 Theorem If C is a collection of at least two subsets of a set X, then S(C) = 1 if either (a) C is linearly ordered by inclusion, or (b) any two sets in C are disjoint.
Proof In any case, S(C) > 1. If C is linearly ordered by inclusion, suppose it shatters {x, y). Let A, B E C, A fl {x, y} = {x}, B fl {x, y} = {y}. But A C B or B C A, giving a contradiction. If the sets in C are disjoint, then we can argue as in part (a), taking C E C with {x, y} C C, but C cannot be disjoint from A or B, a contradiction. Section 4.4 will go more into detail about classes of index 1.
*4.3 Maximal classes
Let C C A be classes of subsets of a set X. Then C will be called (A, n)maximal if S(C) = n while if C C D strictly and D C A, then S(D) > n. If A is the class 2X of all subsets of X, then C will be called n-maximal. If C is n-maximal, then clearly C is (A, n)-maximal for any A such that C C A C 2X. In view of Proposition 4.2.4, classes C with S(C) = i, i = -1 or 0, are empty or contain just one set, respectively, and so are always i -maximal. Thus n-maximality becomes interesting only for n > 1.
Examples. 1. For any set X, let C consist of 0 (the empty set) and all singletons (x} for x E X. Then C is clearly 1-maximal. 2. Let X = R. Let £ 7-1 consist of 0, R, and all left half-lines, closed (-oo, x] or open (-oo, x), for x E R. In other words, £7-l is the collection of all subsets A C JR such that whenever x < y e A then also x E A. Then clearly S(G7-l) = 1 since for x < y and A E £7-l, A fl {x, y} # (y}. On the
*4.3 Maximal classes
143
other hand if any subset of JR not in CH is adjoined, then some 2-element set is shattered, so £7-l is 1-maximal.
3. Let X = R and let Co consist of all subintervals of R, namely 0, R, any closed or open, left or right half-line, and any bounded interval, open or closed at either end. In other words, Co is the class of all convex subsets of R. Then S(Co) = 2, in fact Co shatters every 2-element subset of IR, while if x < y < z and A E Co, then A fl {x, y, z} {x, z}. On the other hand if any set not in Co is adjoined to it, its index becomes 3, so Co is 2-maximal. Here is an existence theorem for maximal classes:
4.3.1 Theorem Let X be a set and D C 2X. Suppose that C C D and S(C) = n. Then there exists a (D, n)-maximal class B with C C B.
Proof Zorn's lemma (RAP, Section 1.5) will be applied. Let (,t3a)aEI be such that C C Ba C D for all a in the index set I and such that the Ba are linearly ordered (i.e., form a chain) by inclusion, with S(Ba) = n for all a. Let A := Ua Ba. To show that S(A) = n, suppose A shatters a set F with I Fl = n + 1. Each of the 2"+1 subsets of F is induced by a set in some Ba. Since there are only finitely many of these sets and the Ba are linearly ordered by inclusion, there is some a such that Ba shatters F, a contradiction. So the chain {8a }aEI has an upper bound. So by Zorn's lemma the collection of all B with C C B C D and S(B) = n has a maximal element (for inclusion). The following fact is straightforward:
4.3.2 Proposition For any set X, Y C X, C C 2X, and Cy := C n Y, we have S(Cy) < S(C). Recall that Z2 := to, 1} with addition mod 2, in other words the usual addition except that I + 1 = 0. For any set X, the group Z2 of all functions from X into Z2, with the natural addition (f + g)(x) := f(x) + g(x) in Z2, provides a group structure for the collection 2X of all subsets of X. Addition of indicator functions mod 2 corresponds to the symmetric difference AAB :_ (A \ B) U (B \ A), so that I A + 1 B = I AAB mod 2. For any fixed set A C X, the translation 1 B H I A + 1 B takes Z 2 one-to-one and onto itself. If the functions are restricted to a subset Y C X, translation still takes Z2 one-to-one and onto itself. For any A C Xand C C 2X, let AAAC {AAC: C E C). Then for any finite F C X, C shatters F if and only if AAAC does. It follows that:
144
Vapnik-Cervonenkis Combinatorics
4.3.3 Proposition For any fixed set A C X and class C C 2X, S(C) S(AAAC). Also, C is n-maximal if and only if AAAC is. Next, we have:
4.3.4 Proposition If C is an n-maximal class of subsets of a set X, and n > 1, then UAc
Proof Suppose there is an x such that x A for all A E C. Take any B E C and let D := C U (B U {x}). Then by n-maximality, D must shatter some set F of cardinality n + 1 > 2, and evidently x E F. Thus D contains at least 2" > 2 sets which contain x, but it only contains one, a contradiction. This proves the first conclusion. The second follows on taking complements, setting A = X in Proposition 4.3.3.
On ZX = 2X there is a product topology coming from the discrete topology on Z2. The product topology is compact by Tychonoff's theorem (RAP, Theorem 2.2.8). 4.3.5 Proposition For any set X, any n-maximal class C C 2X is closed and therefore compact in 2X.
Proof Suppose Ca -* C is a convergent net in 2X with Ca E C for all a. Then for any finite set F C X, there is some a with Ca fl F = C fl F, so S(C U {C}) = S(C) and C E C. So C is closed (RAP, Theorem 2.1.3). A class C of subsets of a set X will be called complemented if X \ A E C for every A E C.
4.3.6 Theorem If S(C) = n, C C A strictly, and C is complemented, then C is not (A, n)-maximal.
Proof For any finite set F C X and G C F, G E C n F if and only if F \ G E C n F. Thus if IFl = n + 1, then IC n Fl < 21+1 - 2. So, for any
AEA\C,S(CU{A})=n. If F is a k-dimensional real vector space of real-valued functions on a set X containing the constants, and C is the collection U(F) of all sets {x : f (x) > 0}
or {x : f (x) > 0} for all f E .F, then S(C) = k by Theorem 4.2.1. Since C is complemented, if U(.F) 2X, then C is not k-maximal.
*4.4 Classes of index 1
145
Let X be any set and Ck the collection of all subsets of X with at most k elements. Then clearly S(C) = k. Also, C is k-maximal since if A 0 C, A C X, then I A I > k and if B is any subset of A with I B I = k + 1 then B is shattered by C U {A). For C = Ck we have me (n) C
*4.4 Classes of index 1 In this section the structure of classes C with S(C) = 1 will be treated. Recall that for classes of two or more sets, disjoint classes and classes linearly ordered by inclusion have S(C) = 1 (Proposition 4.2.6). A common extension of these two kinds of classes is given by treelike partial orderings, defined as follows. A binary relation < on a set X will be called a quasi-order if it is transitive:
x < y and y < z imply x < z; and reflexive: x < x for all x e X. The quasi-order is called a partial order if also x < y and y < x imply x = y. For any set S, inclusion ( C ) is a partial order on 2S or any subset of 2S. Let < be a quasi-order on a set X. Then two elements x and y of X are called comparable if at least one of x < y and y < x holds, or incomparable if neither holds. A quasi-order < on X will be called linear if any two elements of X are comparable. A quasi-order < will be called treelike if for any y E X and L Y := {x : x < y), the restriction of < to L Y is linear. 4.4.1 Theorem Let C C 2X contain at least two sets and satisfy, for any x in X, (4.4.2)
y
A fl {x, y) = 0 for some A E C.
Then the following are equivalent:
(a) S(C) = 1; (b) for every Y C X, the partial ordering of Cy := Y n C by inclusion is treelike; (c) for every Y C X with I Y I = 2, the partial ordering of Cy := Y n C by inclusion is treelike.
Note. Clearly, (4.4.2) holds if 0 E C.
146
Vapnik-Cervonenkis Combinatorics
Proof In proving (a) implies (b), since S(Cy) < S(C) and classes with S(C) < 1 contain at most one set and trivially have a treelike ordering by inclusion, we can assume Y = X. If C doesn't have a treelike ordering by inclusion, there is
a set D E C and B C D, C C D such that B and C are not comparable, so there exist some x E B \ C and Y E C \ B. Take A from assumption (4.4.2). But then the sets A, B, C, and D shatter (x, y) and S(C) > 2, a contradiction. So (b) holds. Now (b) implies (c) directly. If (c) holds and I Y I = 2, since 2Y doesn't have a treelike ordering by inclusion, C must not shatter Y, so (a) follows.
4.4.3 Proposition Let X be a set and A C 2X where 0 E A and for any B and C in A, B fl C E A. If C is (A, 1) -maximal and satisfies (4.4.2) for any
x# yin X, then 0ECandBflCECforanyBandCinC. Proof If x
y and (4.4.2) holds for A, then A fl {x, y} = 0 = 0 fl {x, y}, so adjoining 0 to C doesn't induce any additional subsets of sets with two elements, and S(C U 10)) = 1, so by maximality 0 E C. Suppose B, C E C and S(C U {B fl C}) > 1. Then for some x # yin X, B fl C fl (x, y) # D fl {x, y} for all D E C. Then by (4.4.2), we can assume x E B fl C. If {x, y} C B fl C C B then taking D = B would give a contradiction, so y 0 B fl C. Now B fl c fl {x, y} = {x} # D fl {x, y) for D = B or C implies y E B and y E C, again a contradiction. So B fl C E C.
4.4.4 Proposition Let X be a set and C a finite class of subsets of X with S(C) = 1 such that for any x y in X, (4.4.2) holds. Let D := D(C) consist of 0 and all intersections of nonempty subclasses of C. Then S(D) = 1. For each nonempty set D E D there is a C := C(D) E D such that C C D strictly (C 0 D) and if B is any set in D with B C D strictly, then B C C. Proof By Theorem 4.3.1, let C C E where £ is 1-maximal. Then by Proposition 4.4.3 for A = 2X and induction, C C D C E, so S(D) = 1. Clearly, D is finite. For each nonempty D E D, by Theorem 4.4.1, {B E D: B C D} is linearly ordered by inclusion and contains 0, so it has a largest element C(D) other than D itself.
4.4.5 Proposition
Under the hypotheses of Proposition 4.4.4, the sets
D \ C(D) for distinct nonempty D E D are all disjoint and are nonempty.
Proof Let A
B in D. If B C A, then B C C(A), so B and hence
B \ C(B) are disjoint from A \ C(A). Otherwise, A fl B C B strictly, and then
*4.4 Classes of index 1
Af1B C C(B), so again A \C(A) is disjoint from B\C(B). That A\C(A) 0 follows from the definitions.
147
0
for A
A graph is a nonempty set S together with a set E of unordered pairs {x, y} for some x y in S. Then S will be called the set of nodes and E the set of edges of the graph. The graph (S, E) is called a tree if (a) it is connected, in other words for any x and y in S there is a finite n and x; E S, i = 0, 1, , n, such that xo = x, x, = y, and {xk_1, xk} E E for
k= (b) the graph is acyclic, which means that there is no cycle, where a cycle
is a set of distinct xl,
, x E S such that n > 3, and letting xo := x,,,
{xk_1,xk} E Efork= 1, ,n. 4.4.6 Theorem (a) For in nodes, for any positive integer m, there exist connected graphs with in - 1 edges. (b) A connected graph with in nodes cannot have fewer than in - 1 edges. (c) A connected graph with m nodes has exactly m - 1 edges if and only if it is a tree.
Proof (a) is clear. (b) will be proved by induction. It is clearly true for in = 1, 2. Suppose (S, E) is a connected graph with BSI = in, I EI < in - 2, and m > 3. The edges in E contain at most 2m - 4 nodes, counted with multiplicity, so at least four nodes appear in only one edge each, or some node is in no edge, a contradiction. Select a node in only one edge and delete it and the edge that contains it. The remaining graph must be connected, but is not by induction assumption, a contradiction, so (b) holds.
For (c), let (S, E) be a connected graph with ISO = in and IEJ = m - 1. If the graph contains a cycle, we can delete any one edge in the cycle and the graph remains connected, contradicting (b). So (S, E) is a tree. Conversely, let (S, E) be a tree with I SI = m. It will be proved by induction that I E I < in - 1.
This is clearly true for m = 1, 2. Suppose I E I > in > 3. Take a maximal , xk} C S such that the xi are distinct and {xj_1, Xj) E E for set C := {xl, j = 2, , k. Then there is no y 0 X2 with {x1, y) E E: y cannot be any xj, j > 3, or there would be a cycle, and there is no such y C since C and k are maximal. So we can delete the node xl and the edge {xl, x21 from the graph, leaving a graph which is still a tree with in - 1 nodes and at least in - 1 edges, contradicting the induction hypothesis and so proving (c). Let the class D in Propositions 4.4.4 and 4.4.5 form the nodes of a graph G
whose edges are the pairs {C(D), D} for D E D, D # 0.
148
Vapnik-Cervonenkis Combinatorics
4.4.7 Proposition The graph G is a tree. Proof If D has m elements, then there are exactly m - I pairs {C(D), D}, for D E D, D # 0. Starting with any D E D, we have a decreasing sequence of sets D D C(D) D C(C(D)) D , which must end with the empty set, so all sets in D are connected in G via the empty set, and G is connected. Then by Theorem 4.4.6 it is a tree. 4.4.8 Proposition Let X be a finite set. Let C be 1-maximal in X and suppose (4.4.2) holds for all x y in X. Then C = D(C) as defined in Proposition 4.4.4. The sets D\C(D)for nonempty D E C are all the singletons {x}, x E X.
IfIXI=mthen ICI =m+1. Proof C = D(C) by Proposition 4.4.4. Suppose that for some D E C, D\C(D)
has two or more elements. Then for some B, C := C(D) C B C D where both inclusions are strict. It will be shown that S(C U {B}) = 1. If not, then for some x 0 y, C U {B} shatters {x, y}, so B fl {x, y} 54 F fl {x, y} for all F E C. Letting F = C shows that B fl {x, y} 0 0. Likewise, letting F = D shows that B fl {x, y} {x, y}. So we can assume B fl {x, y} = {x}. Taking F = D shows
that y e D. If G n {x, y} = (y) for some G E C, then y E G fl D E C, and G n D C D strictly, so G n D C C and y E C C B, giving a contradiction. So C U { BI does not shatter {x, y}, and S(C U { B)) = 1, contradicting 1-maximality of C.
So, each set D \ C(D) for 0 0 D E C is a singleton. Each singleton {x} equals D \ C(D) for at most one D E C by Proposition 4.4.5. By Proposition
4.3.4, X = UDEC D. For any x E X take D1 E C with x e D. Let . Forsomem, D,,, = 0, and {x} = Dj\C(Dj) D,+1 := C(D,) forn = 1, 2, for some j < m. So all singletons are of the form D \ C(D), D E C. This
gives a 1-1 correspondence between singletons and nonempty sets in D, so there are exactly m such sets and I D I = m + 1. Suppose throughout this paragraph that C is a class of two or more sets such that (4.4.2) holds with 0 replaced by {x, y}. Then the class of complements, N := {X \ C: C E C), satisfies the original hypotheses of Proposition 4.4.1. If C is 1-maximal, so is N by Proposition 4.3.3. So Theorem 4.4.1 and the facts numbered 4.4.3 through 4.4.7 apply to N, and so does Proposition 4.4.8 if X is finite. Then, C itself has a "cotreelike" ordering, where for each C E C, {D E C : C C D} is linearly ordered by inclusion. Propositions 4.4.3 and 4.4.4 apply to C if 0 is replaced by X and intersections by unions; in Proposition 4.4.4, we will have an immediate successor D(C) D C instead of a
*4.4 Classes of index 1
149
predecessor; and sets D(C) \ C instead of D \ C(D) in Propositions 4.4.5 and 4.4.8. The resulting tree (Proposition 4.4.7) then branches out as sets become smaller rather than larger. Next will be several facts in the general case, that is, without the hypothesis (4.4.2).
4.4.9 Theorem Let X be any set and C any collection of subsets with S(C) = 1. Then for any C E C, the collection Cx\c := (B \ C : B E C) satisfies (4.4.2) for any x y as a collection of subsets of X \ C. Likewise, Cc \ := (C \ B : B E C) satisfies (4.4.2) for any x y as a collection of subsets of C, S(Cc V) < 1 and S(CX\C) < 1.
Proof Letting B = C shows that both classes Cc \ and CX\c contain 0, so (4.4.2) holds for them. Both have index S < 1 by Propositions 4.3.2 and 4.3.3.
So, for an arbitrary class C with S(C) = 1, we have by Theorem 4.4.1 a treelike partial ordering by inclusion in one part X \ C of X and a cotreelike ordering in the complementary part C, for any C E C. If also X \ C happens to be in C, both orderings are linear. To see how the two orderings fit together in general, Proposition 4.3.3 gives:
4.4.10 Corollary Let C be any class of sets with S(C) = 1 and A E C. Let D := AAAC. Then S(D) = 1 and 0 E D. If C is 1-maximal, so is D. Then Theorem 4.4.1, Proposition 4.4.3, and if C is finite, Propositions 4.4.4, 4.4.5, 4.4.7, and if X is finite, Proposition 4.4.8, apply to D. The last sentence in Proposition 4.4.8 has a converse and extension:
4.4.11 Proposition Let X be finite with m elements and C C 2X with S(C) _ 1. Then C is 1-maximal if and only if ICI = m + I.
Proof For any fixed C E C, we can replace C by CAiC without loss of generality by Theorem 4.3.3. So we can assume 0 E C, and then (4.4.2) holds for all x, y. Then Proposition 4.4.8 implies "only if."
Conversely, let S(C) = 1 and ICI = m + 1. Let C C D strictly. Then IDS > m + 1, so by Sauer's lemma (Theorem 4.1.2), S(D) > 2. So C is 1-maximal.
Now, m + 1 = mC<1, which is the maximum value of mc(m) for S(C) = 1 by Sauer's lemma (Theorem 4.1.2). The example at the end of Section 4.3
150
Vapnik-Cervonenkis Combinatorics
shows that Proposition 4.4.11 in the form Cl I_ mC
a set X, a subset Y C X, and a class C c 2X, recall that Cy := C n Y {A n Y: A E C}.
4.4.12 Theorem If C is 1-maximal and 0
Y C X, then Cy is a 1-maximal
class of subsets of Y.
Proof Let A E C. Without changing 1-maximality, C can be replaced by AAAC (Proposition 4.3.3). So we can assume 0 E C. We can also assume that
IXI>2. Case 1. Suppose X is finite, I X I = m < oo. Then by Proposition 4.4.8 or
4.4.11, ICI = m + 1. Letx E Xand Y := X\{x}. Let B := {y E Y: {x, y} E C{x, y} and (y) E C{x, y} }. A set C C Y will be called full if C E C and
CU{x}E C. 4.4.13 Lemma If A C Y and A is full then A = B. Proof Suppose A is full. First suppose y (= A \ B. Then A U {x} E C implies {x, y} E C{x,y} and A E C implies {y} E C{x,y}, contradicting y
B.
Next, suppose y E B \ A. Then {x} = (A U {x}) n {x, y}, 0 = 0 n {x, y}, and y (= B imply that C shatters {x, y}, a contradiction. So A = B, proving the lemma.
El
Continuing the proof of Theorem 4.4.12, any two distinct sets in C have different intersections with Y, except possibly for B and B U {x}. Now Cl S=
m + 1 by Proposition 4.4.11, so ICyl > m and ICY = m since S(Cy) < 1 and I Y I = m - 1. So if m > 2, Cy is a 1-maximal class of subsets of Y by Proposition 4.4.11 again. Then by induction downward, Cy is 1-maximal in Y for any Y C X, Y 0, proving the theorem if X is finite.
Case 2. Let X be general and 0 Y finite. Suppose Cy is not 1-maximal in Y. Then by Case 1, for any finite Z D Y, Cz is not 1-maximal in Z. Let E be the class of all B C Y such that B Cy and S({B} U Cy) < 1. Suppose for each B E £ there is some finite Z(B) D Y such that there is no A C Z(B) with S({A} U CZ(B)) < 1 and A n Y = B. Let Z := UBEE Z(B). Then Z is a finite set including Y. So Cz is strictly included in some class D which is 1-maximal in Z, by Theorem 4.3.1 (with A = 2Z), and Dy is 1-maximal
*4.4 Classes of index 1
151
by Case 1. So B E Dy for some B E E, say D n Y = B for some D E D. Then A := D n Z(B) gives a contradiction.
So, for some B E E and any finite Z D Y, there is an A C Z such that A n Y = B and S(Cz U {A}) < 1. Let Dz be the class of all subsets D of X for which D n z is such a set A. Then Dz is compact in the product topology of 2X (which was treated in Proposition 4.3.5). For any finite U D Y
and V D Y, DU n Dv D DUuV 0 0. So the intersection of all DU is nonempty. Let D E DU for all finite U D Y. Then S(C U {D}) = 1, so
D E C, but D n Y = B
Cy, a contradiction. This finishes the proof in
Case 2.
Case 3. Let X be infinite and U any infinite subset. Let T (U) := {D C U : D n Y E Cy for all finite Y C U}. Then clearly CU C .F(U). Conversely, if D E .F(U), then for each finite Z C X, let Y := Z n U. Let g(Z) := {A E C : A n Y = D n Y}. Then 9(Z) is compact by Proposition 4.3.5, nonempty since D E .F(U), and decreases as Z increases. Thus there is some C E nz g(Z). Then for any finite Z, C n Z n U = D n z n U, so C n U = D n U = D. So D ECU and CU =.F(U). For each finite, nonempty Y C U, Cy is 1-maximal in Y by Case 2. Suppose
.F(U) is not 1-maximal in U. Let A C U, A .F(U), S(E) < 1 where E = (Al U.F(U). For each finite Y C U, Cy C Sy, so Cy = Ey. Let fly :=
{BEC: BnY=AnY}. Then Ry00andasshown above, I
lyfy#0.
Let B E ly Ry. Then A = B n U E Cu = .F(U), a contradiction. So CU is 1-maximal in U. I
For any set X and C C 2X, let x
4.4.14 Theorem If S(C) = 1, UBEC B = X, and C satisfies (4.4.2) for all x y, then
Proof Suppose S(C) = 1 and (4.4.2) holds for all x y. Let x
152
Vapnik-Cervonenkis Combinatorics
C, D E C with x E C \ D, y E D \ C. Then C shatters {x, y), a contradiction. So {x : x
y<x}EC,and0EC,soS(C)=1. Let A C X and suppose A is not linearly ordered by <. Take x, y E A which are not comparable for <. Since 0, Lx and L y are all in C, C U (A) shatters {x, y), and S(C U (A}) > 2.
Let B C X and suppose there exist x< y E B with x V B. Then Lx E C, Ly E C, and B fl {x, y} _ {y} imply that S(C U {B)) > 2. This has now been shown for any B V C, so C is 1-maximal. Example. Let X = 11, 2, 3, 4, 51 and
9=(0,(1),(5), 12,51,11,2,41,11,2,3,41,11,2,3,51,11,2,3,4,5)1. Let C be the complement of g in 2X. Then it can be checked that C is 3-maximal
but for Y = (1, 2, 3, 41, Cy is not 3-maximal in Y. So in Theorem 4.4.12, "1maximal" and "Y 0" cannot be replaced by "3-maximal" and "JYJ > 3" respectively.
*4.5 Combining VC classes Recalling the density as in Corollary 4.1.7, the following is clear:
4.5.1 Theorem For any set X, if A C 2X and C C 2X then dens(A U C) = max(dens(A), dens(C)). For the Vapnik-Cervonenkis index we have instead:
4.5.2 Proposition For any set X, A C 2X and C C 2X, S(A U C) < S(A) + S(C) + 1. This bound is optimal: for any nonnegative integers k and m there exist X and A, C C 2X with S(A) = k, S(C) = m, and S(A U C) _
k+m+1.
*4.5 Combining VC classes
153
Proof By Theorem 4.1.2, if k := S(A), m := S(C), and n > k + m + 2,
m,auc(n) < mA(n)+mC(n) < nC
(n) + j=0
j=n-m
(n)
<
2n,
so S(A U C) < n. It follows that S(A U C) < k + m + 1 as claimed. Conversely, given k and m let n = k + m + 1 and let X be a set with n members. Let A be the set of all subsets of X with at most k members and let C be the set of subsets
of X with at least n - m = k + 1 members. Then S(A) = k, S(C) = m, and A U C = 2X, so S(A U C) = n. Let X be a set and C, D any two collections of subsets of X. Let
CnD
{C n D: CEC, DED},
CuD
{CUD: CEC, DED}.
If A is a class of subsets of another set Y let
C®A := {CXA: CEC, AEA}. For such classes, we have:
4.5.3 Theorem For any C C 2X and D C 2X or 2Y let k := dens(C) and m := dens(D). Then we have dens(C El D) < k + m for El = n, u, or S. Proof For any of the three operations, we have m c E IV (n) < m c (n )m D (n ), and the conclusions follow.
For the Vapnik-Cervonenkis index the behavior of the El operations is not so simple. For k, m = 0, 1, 2, , and D = u, n, or ®, let D(k, m) :_ max{S(C El D) : S(C) = k, S(D) = m}. Here the maximum is taken where X and Y may be infinite sets. Then we have:
4.5.4 Theorem For any k = 0, 1, 2,
and m = 0, 1, 2,
, and El = u,
n, or ®, we have D(k, m) < oo.
Proof By Theorem 4.1.2 and Proposition 4.1.5, if S(C) = k and S(D) = m, then
mcoD(n) < mc(n)mD(n) < 9nk+m1(4k!m!) < 2n for n > no(k, m) large enough.
154
Vapnik-Cervonenkis Combinatorics
4.5.5 Theorem F o r any k, in = 0, 1 , 2,
,
n(k, m) = u(k, m) = ®(k, m).
Proof The first equation follows from taking complements (Proposition 4.1.8). For the second, for given k and in, by Theorem 4.5.4 it is enough to consider
large enough finite sets in place of X and Y, and then we can assume X =
Y. We have n(k, m) < ®(k, m) by restricting to the diagonal in X x X and applying Proposition 4.3.2. In the other direction, let fix and fly be the projections of X x Y onto X and Y respectively. Let C C 2X and A C 2Y.
Let.F:= (IIXI(C): C E C), B = {IIYI(A): A E A}. By Theorem 4.2.3, S(.T) = S(C) and S(B) = S(A). Since I} XI (C) n IIY I (A) = C x A it follows that S(.F n 8) > S(C ® A), and the theorem is proved. Let S(k, m) be the common value of the quantities in Theorem 4.5.5. Theorem 4.5.4 can be improved as follows. For any nonnegative integers j, k let
9(j, k) := sup{r E N: (rC 2r}. Then 9(j, k) < oo for each j, k by Proposition 4.1.5, and we have:
4.5.6 Proposition S(j, k) < B (j, k) for any j, k E N.
Proof Let S(C) = j and S(D) = k. Then for any n > 6(j, k), by Sauer's lemma (Theorem 4.1.2) and Proposition 4.1.5 again, mcuD(n)
< mc(n)m"(n) < (nC<j)(nC
Can the values S(k, m) be computed? It is shown in Dudley (1984, Proposition 9.2.8) that S(1, 1) = 3 (which is not a corollary of Proposition 4.5.2, apparently). Also, after Proposition 9.2.10 there, an example of L. Birge is given showing that S(l, 2) > 5. For classes satisfying stronger conditions, more can be proved:
4.5.7 Theorem Let X and Y be sets, C C 2X and D C 2Y. If C is linearly ordered by inclusion, then S(C ® D) < S(D) + 1. Proof Let m = S(D). We can assume that m < oo and X E C. Let F C X x Y
with IFI = in + 2. Suppose C ®D shatters F. The subsets of IIXF C X induced by C are linearly ordered by inclusion. Let G be the next largest subset
other than I1XF itself, and p E f}XF \ G. Take y such that (p, y) E F. Then all subsets of F containing (p, y) are induced by sets of the form I}XF x D, D E D. Thus H := F \ {(p, y)} is shattered by such sets. Now I H I = in + 1, and IIY must be one-to-one on H or it could not be shattered by sets of the given form. But then D shatters flyH, giving a contradiction.
*4.5 Combining VC classes
155
4.5.8 Theorem For any set X and C, D C 2X, if C is linearly ordered by inclusion, then S(C D D) < S(D) + 1 for D = n or u.
Proof First consider D = n. Again let m = S(D), and we can assume m < oo. Let F C X and I Fl = m + 2. Suppose C n D shatters F. Now CF is linearly ordered by inclusion. We can assume that X E C, so F E CF. Let G be the next largest element of CF and p E F \ G. Each set A C F containing p is of the form C f1 D fl F, C E C, D E D, so we must have C f1 F = F, and A = D fl F. So D shatters F \ (p), a contradiction. Since the complements of a class linearly ordered by inclusion are again linearly ordered by inclusion, the case of unions follows by taking complements (Proposition 4.1.8).
Then by Theorem 4.2.6 and induction we have:
4.5.9 Corollary Let Ci be classes of subsets of a set X and C := {n"=1 C; : Ci E Ci, i = 1, , n}, where each C1 is linearly ordered by inclusion. Then S(C) < n. Definition. For any set X and Vapnik-Cervonenkis class C C 2X, C will be
called bordered if for some F C X, with I Fl = S(C), and x (= X \ F, F is shattered by sets in C all containing x.
4.5.10 Theorem Let Ci C 2x0) be bordered Vapnik-Cervonenkis classes for i = 1, 2. Then S(C1®C2) > S(C1) + S(C2).
Proof Take F C X(i) and xi as in the definition of "bordered." Let H := ({x1} x F2) U (F1 x {x2}). Then IHI = S(C1) + S(C2). For any Vi C F, i = 1, 2, take Ci E Ci with Ci fl Fi = Vi and xi E Ci. Then (C1 x C2) fl H = ({x1 } x V2) U (V1 x {x21), so Ci ®C2 shatters H.
Theorem 4.5.10 extends by induction to any number of factors. One consequence is:
4.5.11 Corollary In JR let ,7 be the set of all intervals, which may be open or closed, bounded or unbounded on each side. In other words, ,7 is the set of all convex subsets of R. In JR let C be the collection of all rectangles parallel , m }. Then S(C) = 2m. to the axes, C :_ {III" 1 J : J i E ,7, i = 1, Let D be the set of all left half-lines (-oo, x] or (-oo, x) for x E R. Let
Vapnik-Cervonenkis Combinatorics
156
T :_ {IIm 1 Hi : Hi E D, i = 1, , m}, so T is the class of lower orthants parallel to the given axes. Then S(C) = m. Proof The class D is linearly ordered by inclusion and is bordered with S(D) =
1, and so is the class of half-lines [a, oo) or (a, oo), a E JR. The class ,7 of all intervals in R is bordered with S(J) = 2. The results now follow from Corollary 4.5.9 with n = 2m and n = m, and Theorem 4.5.10 and induction.
4.5.12 Proposition Let ,7 be the set of all intervals in JR. Let Y be any set and C C 2t' with Y E C. Then in JR x Y, S(,7 ®C) < 2 + S(C).
Proof If S(C) = +oo the result is clear, so suppose m := S(C) < oo. Let F C ] I R x Y with I FI = 3 + m and suppose ,7 ®C shatters F. Let (xi, yi ),
i=
, m + 3, be the points of F. Let u := mini xi and v := maxi xj. Let
1,
p := (u, yi) E F and q := (v, yj) E F. All subsets of F which include 1p, q) must be induced by sets of the form JR x C, C E C. So n y must be one-to-one on F \ { p, q }, and C shatters n y (F \ (p, q }) of cardinality m + 1, a contradiction.
4.6 Probability laws and independence Let (X, A, P) be a probability space. Recall the pseudometric dp (A, B) P(AAB) on A, where AAB := (A \ B) U (B \ A). Recall also that for any (pseudo) metric space (S, d) and s > 0, D(s, S, d) denotes the maximum number of points more than e apart (Section 1.2).
For a measurable space (X, A) and C C A let s(C) := inf{w : there is a K = K(w, C) < oo such that for every law P on A and 0 < s < 1, D(s, C, dp) < Ke-w). Definition.
This index s(C) turns out to equal the density:
4.6.1 Theorem For any measurable space (X, A) and C C A, dens(C) _ S (C).
Proof Let P be a probability measure on A. Suppose A,, , A, E C and be i.i.d. dp(Ai, Aj) > s > 0 for 1 < i < j < m, with m > 2. Let X1, X2, (P), specifically coordinates on a countable product of copies of (X, A, P).
4.6 Probability laws and independence Note that P may have atoms. For n = 1 , 2, Pr{for some i
157
,
j, Xk 0 Al DAB for all k < n)
< in (m - 1)(1 - s)'/2 < 1 forn large enough, n > -log(m(m-1)/2)/log(1-e). LetPn := n FY=18x,. (as usual) be empirical measures for P. For such n, there is positive probability that Pn (Ai AAA) > 0 for all i j, and so Ai and Aj induce different subsets of {X1, , Xn} and mc(n) > m. For any r > dens(C) there is an M =
M(r, C) < oo, where we can take M > 2, such that mc(n) < Mnr for all n.
Note that -log(1 - s) > E. Thus form > 2, m < M(2log(m2))rs-r, or m (log m) -r < M1 8-r for some M1 = Ml (r, C) = 4r M. For any 3 > 0 and ml-g < C large enough, (log m)' < Cm8 for all in > 1. Thus for all m > 0, M28-r for 0 < s < 1 for some M2 = M2(r, C, 8). Thus m < (M28-r)11(1-1)
Letting r ,I, dens(C) and 8 ,{ 0 gives dens(C) > s(C). In the converse direction, let IA I := card(A). Since it is not the case that
mc(n) < knt f o r all n > 1, for r < t < dens(C) and k = 1, 2, , let Ak C X with Ak 0 0 and I Ak n CI > k I Ak I t. Then I Ak I - oo as k oo. Let B0 := Al. Other sets Bj will be defined recursively. Let Bo, , B,_1 be disjoint subsets of X and let C(n) := Uo<j
Let Bn := Ak(n) \ C(n). So all the Bn are defined and disjoint. Each set in Bn n C is induced by at most 21 C(n)I different sets in Ak(n) n C. Thus
IBn n CI > IAk(n) n CI/21C(n)I > 224IAk(n)It > 224IBnIt. Since Bn is nonempty, it follows that I Bn n C I > 22', and hence I Bn I > 2n Let
an := IBnI-t/r,
S:_
IBnl1-t/r
< 00.
n=0
Let P be the probability measure on Un_1 Bn C X giving mass an/S to each point of Bn for each n. The distinct sets in Bn n C are at dp-distance at least an / S apart, and so are a set of elements of C which induce the subsets of Bn. So for all n,
D(an1(2S), C, dp) > I B. n CI > 224 I Bn I t
= 224an r = 224 (2S)-r(an/(2S))-r For s := an/(2S) -* 0 as n -+ oo, this implies D(s, C, dp) > 2 24
(2S)-rs-r
158
Vapnik-Cervonenkis Combinatorics
Since 2" - oo as a" 4. 0 this implies r < s(C). Then letting r f dens(C), the proof is complete. There is a notion of independence for sets without probability. To define it, for
any set X and subset A C X let A l := A and A-1 := X \ A. Sets A l , , A,,, from are called independent, or independent as sets, if for every function {1, , m} into {-1, +1}, n 1 As(j) 0 0. Such intersections, when they are , Am. nonempty, are called atoms of the Boolean algebra generated by A 1 , Thus for A 1, , Am to be independent as sets means that the Boolean algebra they generate has the maximum possible number, 2m, of atoms. If A 1, , Am are independent as sets, then one can define a probability law on the algebra they generate for which they are jointly independent in the usual probability sense and for which P(Ai) = 1/2, i = 1, , m. For example, choose a point in each atom and put mass 1/2m at each point chosen. Or, if desired, given any qi, 0 < qi < 1, one can define a probability measure Q for which the Ai are jointly independent and have Q(Ai) = qi, i = 1, , n. For a set X and C C 2X let I (C) be the supremum of m such that A t ,
are independent as sets for some Ai E C, i = 1,
, Am
, M.
4.6.2 Theorem F o r any set X , C C 2X, and n = 1, 2, , if S(C) > 2" then I (C) > n. Conversely, if I (C) > 2" then S(C) > n. So I (C) < 00 if and only if S(C) < oo. In both cases, 2" cannot be replaced by 2" - 1.
Proof Clearly, if a set Y has n or more independent subsets, then I Y I > 2'. Conversely, if I Y I = 2", we can assume that Y is the set of all strings of n digits
each equal to 0 or 1. Let Aj be the set of strings for which the jth digit is 1. Then the Aj are independent. It follows that if S(C) > 2", then I(C) > n, while if I Y I = 2" - 1 and C = 21', then S(C) = 2" - 1 while I (C) < n, as stated.
Conversely, if Bj are independent as sets for j = 1, , 2", Bj E C, let A(i) := Ai be independent subsets of {1, , 2"} for i = 1, , n. Choose
xi E nJEA(i) Bi n n0A(i)(X \ Bi). Then xi E Bj if and only if j E A(i).
For each setS C {1, ,n},
nAi nn({1,...,2"}\Ai) iES
for some j := is E 11, B; n
i0s , 2" }. Then j E Ai if and only if i E S, and (xi: i E S).
4.7 Vapnik-Cervonenkis properties of classes of functions
159
So C shatters {xl, , xn) and S(C) > n, as stated. If C consists of 2' - 1 (independent) sets, then clearly S(C) < n.
For any set X, C C 2X and Y C X, recall that Cy := Y n c := {Y fl C : Y) be the set of atoms of the algebra of subsets of Y
C E C}. Let At(C
I
generated by Cy, where in the cases to be considered, Cy will be finite because
C or Y is. Let Ac(Y) := IAt(C Y)J be the number of such atoms. Let mc(n) := sup{DA(Y) : A C C, IAA < n} < 2n. Let dens*(C) denote the infimum of all s > 0 such that for some C < oo, me (n) < Cn' for all n. For anyx E XletCX = {A E C: X E A}. LetC'' := {Cy: y E Y}. I
4.6.3 Theorem For any set X and A C C C 2X, with A finite,
(a) AA(X) = AcX (A); (b) for n = 1, 2, ... , me (n) = mcx'(n); (c) S(CX) = I (C); (d) dens*(C) = dens(CX) < I(C).
Proof For any B C A let
a(B) := n B fl n (X \ A). BES
AEA\l3
Then a (B) is an atom, in At(AI X), if and only if it is nonempty. Now y E a (B)
if and only if A fl Cy = B, so (a) follows. Then taking the maximum over A with IAA = n on both sides of (a) gives (b), which then implies (c) and (d). The last inequality follows from Corollary 4.1.7.
4.7 Vapnik-Cervonenkis properties of classes of functions The notion of VC class of sets has several extensions to classes of functions.
Definitions. Let X be a set and.F a class of real-valued functions on X. Let C C 2X. If f is any real-valued function, each set If > t } fort E R will be called a major set of f. The class .F will be called a major class for C if all the major sets of each f E ,F are in C. If C is a Vapnik-Cervonenkis class, then .F will be called a VC major class (for C). The subgraph of a real-valued function f will be the set {(x, t) E X x R:
0 < t < f(x) or f(x) < t < 01. If D is a class of subsets of X x R, and for each f E .F, the subgraph of f is in D, then .F will be called a subgraph class for D. If D is a VC class in X x JR then .F will be called a VC subgraph class.
Vapnik-Cervonenkis Combinatorics
160
The symmetric convex hull of F is the set of all functions E"'_1 c; f for f E F, c; E I[8, any finite m, and Ym1 Ic; I < 1. If 0 < M < oo, let H(F, M) denote M times the symmetric convex hull of F. Let HS (F, M) be the smallest class 9 of functions including H(.F, M) such that whenever gn E 9 for all n and gn (x) -> g(x) as n -+ oo for all x, we have g E tj. A class F of functions such that F C HS(G, M) for some M < oo and a given 9 will be called a VC subgraph hull class if G is a VC subgraph class, and a VC hull class if g = (IC: C E C} where C is a VC class of sets. So there are at least four possible ways to extend the notion of VC class to classes of functions. Some implications hold between these different conditions, but no two of them are equivalent. The next theorem deals with some of the easier cases of implication or non-implication.
4.7.1 Theorem Let F be a uniformly bounded class of nonnegative realvalued functions on a set X. Then
(a) if F is the set of indicators of members of a VC class of sets, then F is also a VC major class, a VC subgraph class, and a VC hull class; (b) if F is a VC major class then it is a VC hull class; (c) there exist VC hull classes F which are not VC major; (d) there exist VC subgraph classes F which are not VC major. Proof (a) The indicators of a VC class C of sets clearly form a VC major class (forCU{0, X}) and aVChull class. Also, the class of sets (Cx[0, 1])U(Xx{0}) for C E C form a VC class, for example by Theorem 4.5.4. So (a) holds. (b) Since .F is uniformly bounded and the definition of VC hull class allows
multiplication by a constant, we can assume 0 < f < 1 for all f E F. For such an f let n-1
A
n
n-1
l{f>j/n} j=1
jn l{ j/n< f<(j+1)/n}
j=0
Then fn (x) -> f(x) as n -+ cc for all x. (In fact, f(x)-1/n < fn (x) < f(x) for all x, so fn -> f uniformly.) So .F is a VC hull class. (c) Let X =1R2. Let C be the set of open lower left quadrants {(x, y) : x < a, y < b} for all a, b E R. Then S(C) = 2 by Corollary 4.5.11. Let F be the set of all sums Y-k 1 lc(k)/2k where C(k) E C. Then clearly .T' is a VC hull class. The sets where f > 0 for f in .F are exactly the countable unions of sets
4.8 Classes of functions and dual density
161
in C. But such unions are not a VC class; for example, they shatter any finite subset of the line x + y = 1. So .F is not a VC major class and (c) is proved. (d) Let f, := n-1 + n-21 B. for n = 1, 2, and for any measurable sets Bn. Then fn 4. 0, so the subgraphs of the functions fn are linearly ordered by inclusion and form a VC class of index 1 (Theorem 4.2.6(a)). Now If, > 1/n) = Bn for each n, and the sequence {Bn} need not form a VC class, for example let Bn be a sequence of independent sets (see Section 4.6). Then If,) is not a VC major class, proving (d).
4.8 Classes of functions and dual density For a metric space (S, d) and e > 0 recall D(s, S, d), the maximum number of points more than E apart. For a probability measure Q and 1 < p < oo we
have theLpmetric dp,Q(f,g):=(f If-gIPdQ)1/p.For aclass FC£"(Q) let D(p) (E, F, Q) := D(e, Jr, dp,Q). Let D(p) (E, F) be the supremum of D(E, F, dp,Q) over all laws Q concentrated in finite sets. If F is a class of measurable real-valued functions on a measurable space
(X, A), let FF(x) := supfEJ I f(x)I. Then a measurable function F will be called an envelope function for.F if and only if FF < F. If F2 is measurable it will be called the envelope function of F. For any law P on (X, A), F* is an envelope function for F, which in general depends on P.
Given .F, an envelope function F for it, E > 0, and 1 < p < oo, let DF i (e, F, Q) be the supremum of m such that there exist f1,
which f I f - f Ip dQ > Ep f FP dQ for all i
fn, E .F for
j.
The next fact extends part of Theorem 4.6.1 to families of functions. The proof is also similar.
4.8.1 Theorem Let 1 < p < oo. Let (X, A, Q) be a probability space and .F be a VC subgraph class of measurable real-valued functions on X. Let .F have an envelope F E LP (X, A, Q) with 0 < f F dQ and F > 1. Let C be the collection of subgraphs in X x JR of functions in F. Then for any W > S(C) there is an A < oo depending only on W and S(C) such that (4.8.2)
DF ) (e, .F, Q) < A (2P-11sP)W for 0 < E < 1.
Proof Given 0 < E < 1, take a maximal m and f1, , fm as in the definition of Df ). First suppose p = 1. For any measurable set B let (FQ)(B)
fB FdQ. Let QF := FQ/QF where QF := fFdQ, so QF is a probability measure. Let k = k(m, E) be the smallest integer such that eke'2 > (2). Then k < 1 + (4logm)/E. Let X1,
, Xk be i.i.d. (QF). Given X1, let Y, be uniformly distributed on the interval [- F(X1), F(X1)] and such that the vectors
Vapnik-Cervonenkis Combinatorics
162
(X1, Yi) E I[82 are independent for i = 1, , m. Then for all i and j # s,
f , j = 1,
,
k. Let Cj be the subgraph of
Pr((Xi, Y;) E CjACS)
= =
f [if (xi) - fs(Xi)I/(2F(xi))] d QF(xi)
f If -fldQ/(2 f FdQ)
Let Afsk be the event that (Xi, Yi) independence, for each j, s,
> s/2.
Cj ACS for all i = 1,
Pr(A,jsk) < (1 - 2)
k
, k. Then by
< e -h12'
and
Pr { Ajsk occurs for some j # s
(')e_1sI2 < 1.
So with positive probability, for all j # s there is some (Xi, Yi) E Cj ACs. Fix Xi and Y1, i = 1, , k, for which this happens. Then the sets Cj n {(X1, Yi)}k 1 are distinct for all j, so mc(k) > m. Let S := S(C). By the Sauer and Vapnik-Cervonenkis lemmas (4.1.2 and 4.1.5), mc(k) < 1.5ks/S! fork > S + 2. It follows that for some constant C depending only on S, mc(k) < Cks for all k > 1, where C < 2s+1 - 1. We can assume that C > 1. So
m < Cks < C((1 +4logm)/e)s. For any a > 0 there is an mo such that 1 + 4logm < m" form > mo and then ml-"s < C/ss. Choosing a small enough so that aS < 1 we have m < Cl/£s/(1-«s) form > mo and a constant C1. For any W > S, we can As-W for A := max(mo, C1), and solve W = as by a = Wyys . Then m < 1
then mo and A are functions of W and S, finishing the proof for p = 1.
Now suppose p > 1. Let QF,P := FP 1 Q/Q(FP-1). Then for i # j, EP
f FPdQ < =
JIi_fjIPdQ
f
fi f- f I dQ2F,P Q((2F)P 1)>
and Q2F,p = QF,p. Thus by the p = 1 case,
Q) < Dp)(6, F, QF,P) <
AS-W
4.8 Classes of functions and dual density
163
where 8 := ePQ(FP)/[Q(F)Q((2F)P-1)]. Now by Holder's inequality,
QF = Q(F 1) < (Q(FP)) 1/p and (Q(FP))(p-1)"P
Q(Fp_1) = Q(Fp_1.1) < So 8 > ep/2p-1 and the conclusion follows. The following fact is a continuation of Theorem 4.7.1.
4.8.3 Theorem Let.F be a uniformly bounded class of functions on a set X. (a) If F is a VC subgraph class, then (4.8.4)
for some r < oo and M < oo,
D(2) (s, .F°) < Me-r for 0 < e < 1; (b) there exist classes F satisfying (4.8.4) which are not VC hull; (c) there exist VC subgraph classes which are not VC hull.
Proof (a) There is a finite constant envelope function K for F, so for any f, g E Y and law y, f (f - g)2 dy < 2K f If - gidy. Thus (a) follows from Theorem 4.8.1 with r = 2W. (b) It will be shown that there exist sequences .F' = {bn 1 A }n, l , where we can
take bn = 1/n° for any positive integer v, such that .F is not VC hull. Clearly such a sequence will satisfy (4.8.4). For this, two lemmas will be helpful:
4.8.5 Lemma Let A1, , An be jointly independent events in a probability space (X, P) with P(A;) = 1/2, i = 1 , , n. Let B1, , Bn be any events. Let D := U1<j
Proof For anyFC
let :=}l
AF
I
IAjn
jEF
n
X\Aj,
joF, j
and define BF likewise. The atoms of l3 are those BF which are not empty. For each F, P(AF) = 1/2n. If a point of AF is not in D then it is also in BF, which
then is nonempty. Since D can include at most 2n P(D) of the 2' events AF, the lemma follows.
164
Vapnik-Cervonenkis Combinatorics
4.8.6 Lemma Suppose Aj are independent events with P(Aj) = 1/2 for all j, and C is a class of events such that for some K < oo and u < oo, for each j there is an event Dj such that P(AjADj) < ijj, where Dj is in an algebra generated by at most K j" elements of C and E j i7 j < 1. Then C is not a VC class.
Proof Let a := 1 - Fj=1 ilj > 0. By Lemma 4.8.5, for each m = 1, 2, , the algebra Dm generated by D1, , Dm has at least 2'a atoms. On the other hand, Dm is generated by at most E 1 Kj" < K fi +t x"dx < K(m + 1)u+1/(u + 1) sets in C. By Theorems 4.6.2 and 4.6.3(c), C' is a VC class and then by Theorem 4.6.3(b) and Corollary 4.1.7, there are t < 00 and C < oo such that the number of atoms of the algebra generated by k elements of C is at most Ckr. Then 2m a is bounded above by a polynomial in m of degree at most (u + 1)t, a contradiction, so Lemma 4.8.6 is proved.
Now to prove Theorem 4.8.3(b), let X := [0, 1], let P := U[0, 1] be the uniform (Lebesgue) law, and let Am be independent sets with P(Am) = 1/2. Let v be a positive integer and .F := { l Am /m"}m> 1. Then (4.8.4) holds for F.
Suppose F is a VC hull class for some C. Recall that P(f) := f fdP.
Suppose P(A) = 1/2 and a + b1A E 7,(C, M) for some a > 0, b > 0,
and 0 < M < oo. We can take M = 1, replacing a by a/M and b by b/M. Let r := S(C) < oo. Recall that dp(C, D) := P(COD) for measurable sets C, D. Then for any w > r, specifically for w := r + 1, there
is some C < oo with D(s, C, dp) < Cs-w for 0 < s < 1 by Corollary 4.1.7 and Theorem 4.6.1. Given 0 < P < 1, take Cj E C and tj with Ej It1I < 1 such that P(Ia + b1A - J:j tj1E1 I) < P. Choose Di E C, i = 1, , m, where m < C,B-r-1, such that for each j there is an i := i(j) with
P(CjADi) < P. Let f := Ej tjID,(i) Then P(Ia + b1A - fl) < 2,8. Let B := If > a + b/21. Then B is in the algebra generated by D1,
, Dm, and
P(AOB) < 4f/b. Apply what has just been shown to A = Ak for k = 1, , m, with a = ak = 0, b = bk = 1 / kv, and fi = P k = 1/(8k2). Then there are sets B = Bk and m = mk < Nk" for some N < oc where u := (r + 1)(2 + v) < oc. To apply Lemma 4.8.6, let ilk = 1/(2k2) to get a contradiction. So (b) is proved. For (c), let a = ak = 1 / k and b = bk = 1 / k2 to get a VC subgraph class as in the proof of Theorem 4.7.1(d). The same argument as for (b), now with v = 2, again gives a contradiction, so (c) holds. It will be shown in Proposition 10.1.2 that there are VC major (thus VC hull) classes which do not satisfy (4.8.4) and so are not VC subgraph classes.
**4.9 Further facts about VC classes
165
**4.9 Further facts about VC classes This section states facts without proofs.
4.9.1 Projections of semialgebraic sets Let's consider whether the VC property is preserved by projection. Let C be a collection of subsets of a product space X x Y, say R2. Let JrX(x, y) := x for each x E X and y E Y. For a set A E C let .nx[A] := f7rx(u) : u E Al. Let 7rx[[C]] := {7rx [A] : A E C}. It is possible for C to be a VC class, with S(C) = 1, while rx[[C]] is not a VC class. In fact, if the sets in C are disjoint, then S(C) = 1 by Theorem 4.2.6. Let D be an arbitrary collection of no more than c (cardinality of the continuum) subsets of R. Let D equal {Dy: Y E ][8}. For each y let Ay := {(x, y) : x E Ay}. Then the sets Ay are all disjoint with irx[Ay] = Dy for each y, but the sets Dy need not be a VC class, even for a countably infinite sequence of sets Dy. So, it's interesting that for certain sets of an algebraic nature, projection does preserve the VC property. A semialgebraic set in JRd is a set in the Boolean algebra generated by all sets pos(f) where f is a polynomial (in d variables). Stengle and Yukich (1989) proved the following:
, ) be a fixed real polynomial on ][8m+d+p+q p = P(x, y, t, u), where x e Rm, y E Rd, t E RP, and u E TR'. Let T C RP and
Theorem Let P( ,
,
U C Rq be fixed semialgebraic sets. Then the family of all subsets of Il8m of the form WY :_ {x E Rm : suprET infUEU P(x, y, t, u) > 01, as y varies over Rd, is a Vapnik-Cervonenkis class.
For the proof, see Stengle and Yukich (1989). They and Laskowski (1992) also give other ways of generating Vapnik-Cervonenkis classes.
4.9.2 The bound of Haussler (1995) For a class C of sets, and positive integers k and n, let M(k, n, C) be the largest m such that for some set F with I FI = n and some A 1, , Am in C n F, each
symmetric difference Ai DAB := (Ai \ Aj) U (Aj \ Ai) for i 0 j contains at least k points.
Theorem (Haussler, 1995) For any VC class C with S(C) = d, and integers 1 < k < n, we have
2e(n+1) M(k,n,C) < e(d+1)(k+2d+2) d < e(d+1)(2ne/k)d.
Vapnik-Cervonenkis Combinatorics
166
When the proof of Theorem 4.6.1 is applied to the given situation, it gives an additional logarithmic factor which Haussler's theorem shows to be unnecessary. Haussler's proof is partly based on a result of Alon and Tarsi (1992).
Problems 1. Let C be the class of all unions of two intervals in R. Evaluate S(C). Hint: Try it first directly; if you like, look at the more general problem 11.
2. If S(C) = 3 find the upper bounds for mc(n) given by Theorem 4.1.2 and by Proposition 4.1.5. 3. Show that for dens(C) = 0, S(C), which is finite by Corollary 4.1.7, can be arbitrarily large. Hint: Let C be finite. 4. Find the smallest n such that there is a set X with IXI = n and C C 2X with S(C) = 1 where neither (a) nor (b) in Theorem 4.2.6 holds. Hints for Problems 5-7: If C is a collection of convex sets in Rd and shatters a set F, then no point in F is in the convex hull of the other points. Thus, the convex hull of F is a polyhedron of which each point of F is a vertex. In the plane, it's a polygon. To get a lower bound S(C) > k, it's enough to find one set of k elements that is shattered. Try the vertices of a regular k-gon. To get upper bounds, use facts such as Theorem 4.2.1 and Proposition 4.5.6. 5. Let C be the set of all interiors of ellipses in R2, with arbitrary centers and semiaxes in any two perpendicular directions. Give upper and lower bounds for S(C).
6. A half-plane in R2 is a set of the form {(x, y) : ax + by > c} for real a, b, c with a and b not both 0. Define a wedge as an intersection of two half-planes. Let C be the collection of all wedges in R2. Show that S(C) > 5. Also find an upper bound for S(C). 7. Let C be the set of all interiors of triangles in R2. Show that S(C) > 7. Also give an upper bound for S(C).
8. Show that the lower bounds for S(C) in problems 6 and 7 are the values of S(C). Hint: For a convex polygon, the set F of vertices can be arranged in cyclic order, say clockwise around the boundary of the polygon, Vi, v2, , v,,, vt. Show that if a half-plane J contains vi and vj with i < j then it includes either {v, vi+t, , vj} or {vj, vj+t, , v,,, vi, , vi}. Thus find what kind of set the intersection of J and F must be. From that, find what occurs if two or three half-planes are intersected (or unioned, via complements).
Notes
167
9. In the example at the end of Section 4.3, for each set A C X with three elements, find a specific subset of A not in A n C. 10. Let .F be the class of all probability distribution functions on R. Show that F is a VC major class but not a VC subgraph class. Hint: Show with that the subgraphs of functions in F shatter all sets {(xj,
xt <...
BFI = n we have OcW(F) _ C<2j (the largest possible value by Sauer's lemma). Hints: One can take F = {1, 2, , n}. For A C F and x, y E F let x =A y mean that A fl {x, y} = 0 or {x, y), otherwise x A y. If A # 0 let jl := j1(A) be the least element of A. If ji(A), , jk(A) are defined, let jk+1(A) be the least j > jk(A) such
that j #A jk(A) and j < n, if there is such a j. Show that there is a 1-1 correspondence between subsets A C F and finite sequences r(A) = where r(0) :=0. Show that
A E C(j) n F if and only if r(A) < 2j. Notes
Notes to Section 4.1. The definitions of AC, mc, and (in effect) V(C) appeared in the announcement by Vapnik and Cervonenkis (1968). In their 1971 paper they had a weaker form of Theorem 4.1.2 with "> C
Notes to Section 4.4. This section is also based on Dudley (1985). Theorem 4.4.6 and other characterizations of trees are given in Harary (1969, pp. 32-33).
168
Vapnik-Cervonenkis Combinatorics
Notes to Section 4.5. Some of the facts in this section are from Dudley (1984,
Section 9.2). Theorems 4.5.1 and 4.5.3 on the density are due to Assouad (1983). Assouad also told me Proposition 4.5.6 in a letter. Theorem 4.5.8 and Corollary 4.5.11 are from Wenocur and Dudley (1981).
Notes to Section 4.6. Theorem 4.6.1 is partly given in Dudley (1978, Section 7) and was stated in the current form by Assouad (1981, 1983). Haussler (1995) proved a sharper form. Assouad (1983) also proved Theorem 4.6.2, first defined dens*(.), and proved Theorem 4.6.3.
Notes to Section 4.7.
VC subgraph classes have been called "VC graph" classes (Alexander 1984, 1987) or "polynomial classes" (Pollard 1984, pp. 17, 34; 1985). This section is based on part of Dudley (1987). Notes to Section 4.8. Theorem 4.8.1 is essentially Lemma 25, p. 27, of Pollard (1984) and had appeared earlier in Pollard (1982). Theorem 4.8.3 parts (b) and (c) are based on Section 3 of Dudley (1987).
References Alexander, Kenneth S. (1984). Probability inequalities for empirical processes and a law of the iterated logarithm. Ann. Probab. 12, 1041-1067. Alexander, K. S. (1987). The central limit theorem for empirical processes on Vapnik-Cervonenkis classes. Ann. Probab. 15, 178-203. Alon, N., and Tarsi, M. (1992). Colorings and orientations of graphs. Combi-
natorica 12, 125-134. Assouad, Patrice (1981). Sur les classes de Vapnik-Cervonenkis. C. R. Acad. Sci. Paris Ser. 1292, 921-924. Assouad, P. (1983). Densite et dimension. Ann. Inst. Fourier (Grenoble), 33, no. 3, 233-282. Danzer, L., Grunbaum, B., and Klee, V. L. (1963). Helly's theorem and its relatives. Proc. Symp. Pure Math. (Amer. Math. Soc.) 7, 101-180. Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6, 899-929; Correction 7 (1979), 909-911. Dudley, R. M. (1984). A course on empirical processes. Ecole d'ete de probabilites de St.-Flour, 1982. Lecture Notes in Math. (Springer) 1097, 1-142. Dudley, R. M. (1985). The structure of some Vapnik-Cervonenkis classes. In Proc. Berkeley Conf. in honor of J. Neyman and J. Kiefer 2, 495-508. Wadsworth, Belmont, CA.
References
169
Dudley, R. M. (1987). Universal Donsker classes and metric entropy. Ann. Probab. 15, 1306-1326. Harary, F. (1969). Graph Theory. Addison-Wesley, Reading, MA. Haussler, D. (1995). Sphere packing numbers for subsets of the Boolean ncube with bounded Vapnik-Chervonenkis dimension. J. Combin. Theory (A) 69, 217-232. Laskowski, M. C. (1992). Vapnik-Chervonenkis classes of definable sets. J. London Math. Soc. (Ser. 2) 45, 377-384. Pollard, David (1982). A central limit theorem for k-means clustering. Ann. Probab. 10, 919-926. Pollard, D. (1984). Convergence of Stochastic Processes. Springer, New York. Pollard, D. (1985). New ways to prove central limit theorems. Econometric Theory 1, 295-314. Radon, J. (1921). Mengen konvexer Korper, die einen gemeinsamen Punkt enthalten. Math. Ann. 83, 113-115. Sauer, N. (1972). On the density of families of sets. J. Combin. Theory (A) 13, 145-147. Shelah, S. (1972). A combinatorial problem: stability and order for models and theories in infinitary languages. Pacific J. Math. 41, 247-261. Stengle, G., and Yukich, J. E. (1989). Some new Vapnik-Chervonenkis classes. Ann. Statist. 17, 1441-1446. Vapnik, V. N., and Cervonenkis, A. Ya. (1968). Uniform convergence of frequencies of occurrence of events to their probabilities. Dokl. Akad. Nauk SSSR 181,781-783 (Russian) = Soy. Math. Doklady 9,915-918 (English). Vapnik, V. N., and Cervonenkis, A. Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theor. Probability Appls. 16, 264-280 = Teor. Verojatnost. i Primenen. 16, 264-279. Vapnik, V. N., and Cervonenkis, A. Ya. (1974). Teoriya Raspoznavaniya Obrazov: Statisticheskie problemy obucheniya [Theory of Pattern Recognition:
Statistical problems of learning; in Russian]. Nauka, Moscow. German ed.: Theorie der Zeichenerkennung, by W. N. Wapnik and A. J. Tscherwonenkis, transl. by K. G. Stockel and B. Schneider, ed. S. Unger and K. Fritzsch. Akademie-Verlag, Berlin, 1979 (Elektronisches Rechnen and Regeln, Sonderband). Wenocur, R. S., and Dudley, R. M. (1981). Some special Vapnik-Cervonenkis classes. Discrete Math. 33, 313-318.
5 Measurability
Let A be the set of all possible empirical distribution functions F1 for one observation x E [0, 1], namely F1(t) = 0 fort < x and F1 Q) = 1 fort > x. We noted previously that A in the supremum norm is nonseparable: it is an uncountable set, in which any two points are at a distance 1 apart. Thus A and all its subsets are closed. If x := X1 has a continuous distribution such
as the uniform distribution U[0, 1] on [0, 1], then x H (t H It>x) takes [0, 1] onto A, but it is not continuous for the supremum norm. Also, it is not measurable for the Borel cr-algebra on the range space. So, in Chapter 3, functions f * and upper expectations E* were used to get around measurability problems. Here is a different kind of example. It is related to the basic "ordinal triangle" counterexample in integration theory, showing why measurability is needed in the Tonelli-Fubini theorem on Cartesian product integrals. Let (S2, <) be an uncountable well-ordered set such that for each x E 0, the initial segment
Ix := {y : y < x} is countable. (In terms of ordinals, 0 is, or is orderisomorphic to, the least uncountable ordinal.) Let S be the or -algebra of subsets
of 0 consisting of sets that are countable or have countable complement. Let P be the probability measure on S which is 0 on countable sets and 1 on sets with countable complement. Then
ffly<xdP(Y)dP(X) = 0 < ffly<xdP(x)dP(Y) = 1. Since all other conditions of the Tonelli-Fubini theorem hold, the function (x, y) H ly<x must not be measurable for the product or -algebra, even if S is replaced by any larger cr-algebra of subsets of 0 to which P can be extended. For example, according to the continuum hypothesis, we could take S2 to be [0, 1] (where the well-ordering is unrelated to the usual ordering), and P to be Lebesgue measure or any other nonatomic law on [0, 1]. 170
*5.1 Sufficiency
171
Now, consider the class C of sets Ix for each x E Q. Each of these sets is countable by assumption. The sets are linearly ordered by inclusion since < is a linear ordering. Thus S(C) = 1 by Theorem 4.2.6. But, C is not a weak or strong Glivenko-Cantelli class as defined in Section 3.3 (still less a Donsker class), since for any possible X1, , X", a maximum x := max(XI, , X,) for the well-ordering exists, so P (Ix) = 1 while P(Ix) = 0, so
supAEC I (Pn - P)(A)I = 1. In Sections 5.2 and 5.3, some measurability conditions will be developed, which will hold for classes of sets encountered in practice. It will be shown in the next chapter that these conditions, together with the Vapnik-Cervonenkis or related properties, are enough to imply Glivenko-Cantelli and Donsker properties.
*5.1 Sufficiency Sufficiency is a concept from mathematical statistics. Suppose that a probability measure P is known to be in a certain family P of laws and we have observed
, X i.i.d. (P), but nothing else is known about P. A statistic, T, which is a measurable function of X1, , X,,, will roughly speaking be said to be sufficient for P if, given T, no further information about X1, , X, is useful in making decisions or inferences about P E P. A precise definition is given below. This section will show that the empirical measure P, is sufficient even when P is the family of all probability measures on a measurable space. Note that P is a symmetric function of the X, in the sense that it is preserved by any permutation of the indices 1, , n. Once P is given, knowing that the XZ were observed in a certain order will not help in making inferences about P. Here is the formal definition of sufficiency: let (Y, S) be a measurable space (a set Y and a or-algebra S of subsets of Y). Let Q be a set of probability laws on (Y, S). A sub-a-algebra 13 of S is called sufficient for Q if for every C E S there is some 13-measurable function gc such that for every Q E Q, the conditional probability X1,
(5.1.1)
Q(CI13) = gc
almost surely for Q.
The essential point is that gc does not depend on Q in Q. Most often, there will be some n > 1, a measurable space (X, A), and a family P of laws on (X, A) such that Y is the n-fold Cartesian product X' with
the product 6-algebra S = A" and Q = P" :_ {P" : P E P), where P" is the n-fold Cartesian product P x P x .
x P (RAP, Theorem 4.4.6).
172
Measurability
The meaning of sufficiency is clarified by the factorization theorem, to be proved next. A family P of probability measures on a measurable space (S, B) is said to be dominated by a measure p if every P E P is absolutely continuous with respect to it. Then we have the density (Radon-Nikodym derivative) dP/dµ (RAP, Section 5.5). If there is a nonatomic law on (X, A), the family P of all laws on (X, A) is not dominated. Factorization is still useful in that case, in the proof of Theorem 5.1.7.
5.1.2 Theorem (Factorization theorem) Let (S, B) be a measurable space, A a sub-or -algebra of B, and P a family of probability measures on B, dominated by a c finite measure lt. Then A is sufficient for P if and only if there is
a B-measurable function h > 0 such that for all P E P, there is an Ameasurable function fp with dP/d a = fph almost everywhere for it. We can take h E L' (S, B, µ).
Proof Two measures are called equivalent if each is absolutely continuous with respect to the other. Two families P and Q of probability measures on a or-algebra B are called equivalent if and only if for all B E B, (P(B) = 0 for all P E P) is equivalent to (Q(B) = 0 for all Q E Q). Two lemmas and another theorem will be useful in the proof.
5.1.3 Lemma If a family P of probability measures on a measurable space (S, B) is dominated by a or -finite measure µ, there is a countable subfamily of P equivalent to P.
Proof For each P E P, let fp be (a specific version of) the Radon-Nikodym derivative (density) dP/dic and let Kp := [x: fp(x) > 0}. A set K E B such
that for some P E P, K C Kp and µ(K) > 0, will be called a kernel. A chain is any union of disjoint kernels (necessarily a countable union). If µ is not finite let S = Un> 1 An where An are disjoint measurable sets with
0 < µ(An) < oc. Let v(A) :_ En__1 µ(A n An)/[2nµ(An)]. Then v and it are equivalent (mutually absolutely continuous), so replacing it by v, we can assume p is finite. Let C1, C2, be chains such that supn µ(Cn) = sup{µ(C) : C a chain}.
Let Dl := C1. Given a chain Di_1, let Cn = Uj>1 Knj where Knj are disjoint kernels (for n fixed). Let Dnj := Knj \ Di_1 if it is a kernel, that is, if µ(Knj \ Dn_1) > 0. Otherwise let Dnj be empty. Let Dn := Dn_1 U Uj>1 Dnj, which is a chain. Let D := UnDn, a chain with maximal it. Then D = Un K, for some disjoint kernels Kn. Choose P := Pn := P(n) in P
*5.1 Sufficiency
173
such that Kn C Kp(n) and P,(K,,) > 0. To show that {Pm} is equivalent to
P, suppose not. Then for some B E 8, P,(B) = 0 for all n and P(B) > 0 for some P E P. If P(B \ D) > 0 then some kernel is included in B \ D, contradicting the choice of D. So we can assume B C D and for some n,
P(BnK,) > 0. Then Pn(BnK,) > 0since tt (BnKn) > 0andKn is a kernel for Pn. This contradiction completes the proof.
5.1.4 Lemma Any countable family of or -finite measures An is equivalent to one finite measure µ.
Proof First, as in the last proof, we can assume An (S) < 1 for all n. Then let µ En>1 An/2n, which is finite and equivalent to {µn}n>1 The next step in the proof of Theorem 5.1.2 is:
5.1.5 Theorem Under the hypotheses of Theorem 5.1.2, the following are equivalent:
(a) A is sufficient for P; (b) there is a probability measure A such that {A} is equivalent to P and dP/dx is equal P-a.s. to an A-measurable function for all P E P; (c) there is a probability measure A which dominates P such that dP/dl is A-measurable for all P E P.
Proof First it will be shown that (c) implies (a). For any B E B let fB Ex (1 B I A). Then for any P E P and A E A, by RAP, Theorem 10.1.9,
IAfBdP =
jEx(1BIA)dA =
f1BdA
= P(AnB),
so fB = Ep (1 B I A) and A is sufficient. (a) implies (b): if A is sufficient, take a sequence { Pn } C P, equivalent to P,
by Lemma 5.1.3. Let A.:= E,,,, P,12'. For each B E B take fB = Ep (1 B I A)
P-a.s. for all P E P. Since 0 < fB < 1 P-almost surely for all P in P, also 0 < fB < 1 almost surely fork. Since Pn (B n A) = fA fB dPn for all n and all A E A, it follows that ,l(B n A) = fA fB dl, so fB = Ex(1BIA). To show that for all P E P, dP/d), is A-measurable, note that for each B in B,
f
JB
dAd,k
= P(B)
=
f
fEA(fBA)dA '
Measurability
174
which by RAP, Theorem 10.1..9 again, equals
J
Ex(1BIA)EA
\A I dX _ fEA(1BEA(A)A)dA f1BEA(HA)dA.
Since dP/d l and Ea, (Old), I A) have the same integral over all sets in 8, they must be equal P-a.s., so that dP/d k is equal P-a.s. to an A-measurable function. So (a) implies (b). Clearly (b) implies (c). Now to prove the factorization theorem (5.1.2), if A is sufficient, take A from
the last theorem and let h := d;./dµ. Letting fp = dP/dl, we have the desired factorization, with h e Ll (S, B, µ) (in fact f h dµ = 1). Conversely, if factorization holds (with h not necessarily integrable), then by Lemma 5.1.3, take {Pn}n>1 equivalent to P and let A:= En>1 so that {A} is equivalent to P as in the proof of Theorem 5.1.5. Then A is absolutely continuous with respect to µ and d.X /dµ = hk where k(x) := >n> fp,, (x)/2" 1
for all x. Also, k is A-measurable. For each P E P let gp(x) := fp(x)/k(x) if k(x) > 0, and gp(x) := 0 otherwise. Then A.(k = 0) = 0, and for each P E P, gp is A-measurable, P is absolutely continuous with respect to ,l, and dP/dA = gp. So A is sufficient for P by Theorem 5.1.5, and Theorem 5.1.2 is proved. Given a statistic T, that is, a measurable function, from S into Y for mea-
surable spaces (S, B) and (Y, .T'), let A := T -t (.F) := IT-'(A): A E Y}, a Q-algebra. For a family Q of laws on (S, B), T is called a sufficient statistic for Q if A is sufficient for Q. If T is sufficient we can write fp = gp o T for some .F-measurable function gp by RAP, Theorem 4.2.8. Sufficiency, defined in terms of conditional probabilities of measurable sets, can be extended to suitable conditional expectations:
5.1.6 Theorem Let A be sufficient for a family P of laws on a measurable space (5,13). Then for any measurable real-valued function f on (S, B) which
is integrable for each P E P, there is an A-measurable function g such that
g=Ep(fIA)a.s.forallPEP. Proof When f is the indicator function of a set in B, the assertion is the definition of sufficiency. It then follows for any simple function, which is a finite linear combination of such indicators. If f is nonnegative, there is a sequence of nonnegative simple functions increasing up to f, and the conclusion
*5.1 Sufficiency
175
holds (RAP, Proposition 4.1.5 and Theorem 10.1.7). Then any f satisfying the
hypothesis can be written as f = f + - f - for f + and f - nonnegative, and the result follows.
Let p. and v be two probability measures on the same measurable space (V, U). Take the Lebesgue decomposition (RAP, Theorem 5.5.3) v = vac + vs where vac is absolutely continuous, and vs is singular, with respect to A. Let
A E U with vs(A) = µ(V \ A) = 0, so vac(V \ A) = 0. Then the likelihood ratio R, I,, is defined as the Radon-Nikodym derivative d vac/d tt on A and +oo
on V \ A. By uniqueness of the Hahn decomposition of V for vs - tt (RAP, Theorem 5.6.1), R IN, is defined up to equality (it + vs)- and so (µ + v)-almost everywhere.
5.1.7 Theorem For any family P of laws on a measurable space (S, B) and sub-or-algebra A C B, A is sufficient for P if and only if for all P, Q E P, RQlp can be taken to be A-measurable, that is, it is equal (P + Q)-almost everywhere to an A-measurable function.
Proof Only if: suppose A is sufficient for P. Then it is also sufficient for {P, Q}, which is dominated by µ := P + Q. So by factorization (Theorem 5.1.2) there are A-measurable functions fp and fQ and a B-measurable function
h such that dP/dµ = fph and dQ/dµ = fQh. Then RQlp = fQh/(fph) = fQ/fp, where y10
+oo if y > 0 and 0 if y = 0, is A-measurable (note that RQlp doesn't depend on the choice of dominating measure). Conversely, given P, Q E P, suppose RQlp is A-measurable. Again let
It := P + Q. Then dP/dµ = 1/(1 + RQ1p) and dQ/dµ = RQIp/(1 + RQ1p) almost everywhere for tt, so dP/dµ and dQ/d µ can be chosen to be Ameasurable. Thus by factorization again (this time with h = 1), A is sufficient
for {P, Q}. So for all B E 8, P(BIA) = Q(BIA) almost surely for P and Q. Taking P fixed and letting Q vary over P, we get that A is sufficient for P. Suppose we observe X1,
, X7z i.i.d. with law P or Q but we do not know
which and want to decide. Suppose we have no a priori reason to favor a choice of P or Q, only the data. Then it is natural to evaluate the likelihood ratio RQn/pn and choose Q if RQ-lpn > 1 and P if RQnlpn < 1, while if RQnlpn = 1 we still have no basis to prefer P or Q. More generally, decisions between P and Q can be made optimally in terms of minimizing error probabilities or expected losses by way of the likelihood ratio RQlp or RQnlpn as appropriate (the Neyman-Pearson lemma, Lehmann, 1986, pp. 74, 125;
Measurability
176
Ferguson, 1967, Section 5.1). By Theorem 5.1.7, if B is sufficient for P" for some P D {P, Q}, then RQn/pn is B-measurable. Specifically, if T is a sufficient statistic, then by Theorem 5.1.7 and RAP, Theorem 4.2.8, RQ-l pn is a measurable function of T. Thus, no information in (X1, , Xn) beyond T is helpful in choosing P or Q. In this sense, the definition of sufficiency fits with the informal notion of sufficiency given at the beginning of the section. It will be shown that empirical measures are sufficient in a sense to be defined. Let Sn be the sub-or-algebra of A" consisting of sets invariant under all permutations of the coordinates.
5.1.8 Theorem S, is sufficient for Pn
{Pn : P E P} where P is the set
of all laws on (X, A). Proof Let Sn be the symmetric group of all permutations of 11, 2,
, n}. For
each R E Sn and x := (x1,
, xn(n))
, xn) E Xn, set fn (x) :_ (xn(1),
Then fn is a 1-1 measurable transformation of X" onto itself with measurable inverse and preserves the product law P" for each law P on (X, A). For any C E An, we have
P"(CISS) = 11
ufn(C)( ) nES,,
almost surely for P" since for any B E Sn,
P" (B n c) =
n
E Pn (C o f n1(B)) nEsn
1
Pn(fn(C) n B).
n FEsn The conclusion follows.
For example, if X has just two points, say X = to, 1}, and S := Eni=1 x;, then Sn is the smallest or-algebra for which S is measurable. In this case no a-algebra strictly smaller than Sn is sufficient (Sn is "minimal sufficient"). For each B E A and x = (x1, , Xn) E X", let n
Pn(B)(x) :_
1
Y, 1B(xj). j-1
So Pn is the usual empirical measure, except that in this section, x H Pn (B) (x)
is a measurable function, or statistic, on a measurable space, rather than a
*5.1 Sufficiency
177
probability space, since no particular law P or P" has been specified as yet. Here, P, (B) (x) is just a function of B and x.
For a collection F of measurable functions on (X", A"), let Sy be the smallest or -algebra making all functions in F measurable. Then I will be called sufficient if and only if ST is sufficient.
5.1.9 Theorem F o r any measurable space (X, A) andfor each n = 1 , 2, , the empirical measure P" is sufficient for P" where P is the set of all laws on (X, A). In other words, the set .F of functions x H P" (B)(x), for all B E A, is sufficient. In fact the or -algebra ST is exactly S".
Proof Clearly S. C S" . To prove the converse inclusion, for each set B E An let S(B) := UrlESn fn(B) E S". Then if B E S", S(B) = B. Let £ {C E An: S(C) E Sj,}. We want to prove £ = Now £ is a monotone class: if C, E £ and C" t C or C, , C, then C E S. An.
Also, since commutes with finite unions, any finite union of sets in £ is in S. So it will be enough to prove that (5.1.10)
A,x...xAnE9 for anyAj EA,
since the collection C of finite unions of such sets, which can be taken to be disjoint, is an algebra, and the smallest monotone class including C is At (RAP, Propositions 3.2.2 and 3.2.3 and Theorem 4.4.2). Here by another finite union the A; can be replaced by Bl(1) where Bt, , Br are atoms of the algebra generated by At, , An, so r < 2". So we just need to show that for all j(l), , j(n) with 1 < j(i) < r, i = 1, , n, we have S(Bj(t) x x x Bj(")) if and only if for each i = Bj(")) E S.F. Now, x E S(BJ(1) X 1,
, r, P"(Bi) = k, /n where k; is the number of values of s such that
j(s) = i, s = 1,
, n. So 5F = S", and by Theorem 5.1.8 the conclusion
follows.
For some subclasses C C A, the restriction of P" to C may be sufficient, and handier than the values of P" on the whole or-algebra A. Recall that a class C included in a or -algebra A is called a determining class if any two measures on A, equal and finite on C, are equal on all of A. If C generates the or-algebra A, C is not necessarily a determining class unless for example it is an algebra (RAP, Theorem 3.2.7 and the example after it). Sufficiency of P" (A) for A E C can depend on n. Let X = 11, 2, 3, 4, 5}, A = 2X, and C = {{ 1, 2, 3}, {2, 3, 4}, {3, 4, 5}}. Then C is sufficient for n = 1, but not for n = 2 since for example (61 + 64)/2 = (62 + 65)/2 on C. This is a case where C generates A but is not a determining class.
Measurability
178
5.1.11 Theorem Let (X, d) be a separable metric space which is a Borel subset of its completion, with Borel or-algebra. Suppose C = {Q11 is a countable determining class for A. Then for each n = 1, 2, , the sequence { Pn (Ck) }k 1 is sufficient for the class Pn of all laws Pn on (X n, An) where P E P, the class of all laws on (X, A). P r o o f For n = 1 , 2, , let I n be the finite set { j/n : j = 0, 1, , n}. Let I,°° be a countable product of copies of In with the product or -algebra defined by the or-algebra of all subsets of In. We have:
5.1.12 Lemma Under the hypotheses of Theorem 5.1.11, for each n = 1, 2, and A E A, there is a Borel measurable function fA on I,°° such that for any x1,
, xn E
X, and Pn := n-1 Fj 3, we have PP(A) _
fA({Pn(Ck)}k=1).
Proof Since C is a determining class, the function fA exists. We need to show it is measurable. If X is uncountable, then by the Borel isomorphism theorem (RAP, Theorem 13.1.1) we can assume X = [0, 1 ]. Or if X is countable, then A is the or -algebra of all its subsets and we can assume X C [0, 1]. Let
X(n) :_ {X :_ (xj)'_1 E Xn : X1 < X2 < ... < Xn }. The map x i-- P,ix) is 1-1 from X W into the set of all possible empirical measures P, , since if Q := P,(x) = Pny), with x, y E X(n), then x1 = y1 = the smallest u such that Q({u}) > 0. Next, X2 = Y2 = x1 if and only if Q({x1}) > 2/n, while otherwise x2 = Y2 = the next-smallest v such that Q({v}) > 0, and so on. Now, X (n) is a Borel subset of Xn. The completion of X' for any of the usual product metrics is isometric to Sn where S is the completion of X. Clearly X' is a Borel subset of Sn. It follows that X (n) is a Borel subset of its completion. Here the following will be useful:
5.1.13 Lemma On I,°O, the product or -algebra l3°°, the smallest or -algebra for which all the coordinates are measurable, equals the Borel a -algebra 8 of the product topology T.
Proof Clearly 8°O C BO° since the coordinates are continuous. Conversely, T has a base R consisting of all sets ilk ° 1 A j where A j = In for all but finitely many j; R is countable (since In is finite) and consists of sets in 8°O, and every
5.2 Admissibility
179
U E T is a countable union of sets in R, so U E BOO and BOO = B°°. Lemma 5.1.13 is proved. Now to continue the proof of Lemma 5.1.12, the map f: x i-+ { P,(x) (Ck) )_ 1
is Borel measurable from X(') into I,°O. Since {Ck)' area determining class, 1
f is one-to-one. Thus by Appendix G, Theorem G.6, f has Borel image f [X (' ] in I,°O and f-1 is Borel measurable from f [X (") ] onto X('). Then
f-1 extends to a Borel measurable function h on all of I,°O into X00, since Xt"> is Borel-isomorphic to R, or a countable subset, with Borel a-algebra, by the Borel isomorphism theorem (RAP, Theorem 13.1.1 again), and thus the extension works as for real-valued functions (RAP, Theorem 4.2.5). For any A E A, gA : x H P,$x) (A) is Borel measurable. Thus fA =- gA o h is Borel measurable and Lemma 5.1.12 is proved. Now to prove Theorem 5.1.11, we know from Theorem 5.1.9 that the smallest
a-algebra S,, making all functions x i-- P, (B) (x) measurable for B E A is sufficient. By Lemma 5.1.12, SF is the same as the smallest a-algebra making x r) P1t (Ck) measurable for all k, which finishes the proof.
In the real line R, the closed half-lines (-oo, x] form a determining class. In other words, as is well known, a probability measure P on the Borel a-algebra of IR is uniquely determined by its distribution function F (RAP, Theorem 3.2.6). It follows that the half-lines (-oo, q] for q rational are a determining class: for any real x, take rational qk J, x, then F(qk) ,[ F(x). Thus we have:
5.1.14 Corollary In R, the empirical distribution functions F (x) := P, ((-oo, x]) are sufficient for the family P" of all laws P" on W1 where P varies over all laws on the Borel or -algebra in R.
5.2 Admissibility Let F be a family of real-valued functions on a set X, measurable for a aalgebra A on X. Then there is a natural function, called here the evaluation
map, F x X H IR given by (f, x) H f (x). It turns out that for general F there may not exist any a-algebra of subsets of F for which the evaluation map is jointly measurable. The possible existence of such a a-algebra and its uses will be the subject of this section. Let (X, 5) be a measurable space. Then (X, B) will be called separable if B is generated by some countable subclass C C B and B contains all singletons
180
Measurability
{x}, x E X. In this section, (X, B) will be assumed to be such a space. Let F be a collection of real-valued functions on X. (The following definition is unrelated to the usage of "admissible" for estimators in statistics.)
F is called admissible if there is a a-algebra T of subsets of F such that the evaluation map (f, x) i-+ f (x) is jointly measurable from (F, T) x (X, B) (with product a-algebra) to R with Borel sets. Then T will Definition.
be called an admissible structure for F. .T' will be called image admissible via (Y, S, T) if (Y, S) is a measurable space and T is a function from Y onto F such that the map (y, x) r-* T(y)(x) is jointly measurable from (Y, S) x (X, B) with product a-algebra to R with Borel sets. To apply these definitions to a family C of sets let T 1 A : A E C).
Remarks. For one example, let (K, d) be a compact metric space and let F be a set of continuous real-valued functions on K, compact for the supremum norm. Then the functions in F are uniformly equicontinuous on K by the Arzela-Ascoli theorem (RAP, 2.4.7). It follows that the map (f, x) H f(x) is jointly continuous for the supremum norm on f E F and d on K. Since both spaces are separable metric spaces, the map is also jointly measurable, so that F is admissible. If a family F is admissible, then it is image admissible, taking T to be the identity. In regard to the converse direction, here is an example. Let X = [0, 1] with usual Borel a-algebra B. Let (Y, S) be a countable product
of copies of (X, 8). For y = E Y let T(y)(x) := lj(x, y) where J := {(x, y) : x = y, for some n}. Let C be the class of all countable subsets of X and .F the class of indicator functions of sets in C. Then it is easy to check
that F is image admissible via (Y, S, T). If a a-algebra T is defined on F by setting T := {F C F: T-' (F) E S) then T is not countably generated (see problem 5(b)) although S is. This example shows how sometimes image admissibility may work better than admissibility.
5.2.1 Theorem For any separable measurable space (X, B), there is a subset Y of [0, 1] and a 1-1 function M from X onto Y which is a measurable isomorphism (is measurable and has measurable inverse) for the Borel or -algebra on Y.
Remark. Note that Y is not necessarily a measurable subset of [0, 1]. On the other hand, if (X, B) is given as a separable metric space which is a Borel subset of its completion, with Borel a-algebra, then (X, 5) is measurably isomorphic
5.2 Admissibility
181
either to a countable set, with the a-algebra of all its subsets, or to all of [0, 1] by the Borel isomorphism theorem (RAP, Theorem 13.1.1).
Proof Recall that the Borel subsets of Y as a metric space with usual metric are the same as the intersections with Y of Borel sets in [0, 1], since the same is true for open sets. Let A := {Aj}j>1 be a countable set of generators of S. Consider the map f : x H {lAj(x)}j>1 from X into a countable product 2°O of copies of {0, 11 with product a-algebra. Then f is 1-1 and onto its range Z. Thus it preserves all set operations, specifically countable unions and complements. So it is easily
seen that f is a measurable isomorphism of X onto Z. Next consider the map g : {zj} i-+ Yj 2zj/3-' from 2' into [0, 1], actually onto the Cantor set C (RAP, proof of Proposition 3.4.1). Then g is continuous from the compact space 2°O with product topology onto C. It is easily seen
that g is 1-1. Thus g is a homeomorphism (RAP, Theorem 2.2.11) and a measurable isomorphism for Borel a-algebras. The Borel a-algebra on 21 equals the product a-algebra (Lemma 5.1.13). So the restriction of g to Z is a measurable isomorphism onto its range Y. It follows that the composition go f, called the Marczewskifunction,
M(x)
00
E 2 , IA. (x)13n n=1
is one-to-one from X onto Y C I
[0, 1] and is a measurable isomorphism
onto Y.
Let (X,13) be a separable measurable space where 13 is generated by a sequence {Aj 1. By taking the union of the finite algebras generated by A 1,
, An
for each n, we can and do take A := {A; )1>1 to be an algebra. Let Fo be the class of all finite sums "=1 C; IA, for rational c; E Ik, and n = 1, 2, . Then "Borel classes" or "Banach classes" are defined as follows by transfinite recursion (RAP, 1.3.2). Let (S2, <) be an uncountable well-
ordered set such that for each P E 0, {a E 0 : a < P} is countable. (Specifically, one can take 0 to be the set of all countable ordinals with their
usual ordering.) For any countable set A C 0, {y: y < x for some x E A} is countable, so there is a z E Q with x < z for all x E A. For each a E 2 there is a next larger element called a + 1. Let 0 be the smallest element of 0. For each a E 2, given Fa, let .Fa+1 be the set of all limits of everywhere pointwise convergent sequences of functions in Fa. If ,B E S2 is not of the form
a + 1 (f3 is a "limit ordinal"), P > 0, and Ta is defined for all a < P, let .Ffi
Measurability
182
be the union of all Jla for a < P. Note that Fa C Fp whenever a < P. Let U := U is the set of all measurable real functions on X.
Proof Clearly, each function in U is measurable. Conversely, the class of all sets B such that I B E U is a monotone class and includes the generating algebra A, so it is the a-algebra B of all measurable sets (RAP, Theorem 4.4.2). Likewise, for a fixed A E A and constants c, d, the collection of all sets B such that c 1 A + d 1 B E U is B. Then for a fixed B E B, the set of all C E B such that c 1 C + d 1 B E U is all of 13. By a similar proof for a sum of n terms, we get that all simple functions F-"=t ci 1B(j) are in U for any c; E R and B(i) E B. Since any measurable real function f is the limit of a sequence of simple functions min(f, 0) (RAP, Proposition 4.1.5), max(f, 0) and gn f, +gn where fn every measurable real function on X is in U.
On admissibility there is the following main theorem:
5.2.3 Theorem (Aumann) Let I :_ [0, 1] with usual Borel a-algebra. Given a separable measurable space (X, B) and a class . F of measurable real-valued functions on X, the following are equivalent:
(i) F C Fa for some a E S2; (ii) there is a jointly measurable function G : I x X H R such that for
each f E F, f = G(t, ) for some t E I; (iii) there is a separable admissible structure for F; (iv) F is admissible; (v) 2F is an admissible structure for .T'; (vi) F is image admissible via some (Y, S, T).
Remarks.
The specific classes Ta depend on the choice of the countable
family A of generators, but condition (i) does not: if C is another countable set of generators of B with corresponding classes 9a, then for any a E 0 there are ig E S2 and Y E 0 with.T'a C 9,6 and 9J C F.
Proof (ii) implies (iii): for each f E F choose a unique t E I and restrict the Borel a-algebra to the set of is chosen. Then (iii) follows. Clearly (iii) implies (iv), which is equivalent to (v). (iv) implies (iii): note that any real-valued measurable function f (for the Borel a-algebra on R as usual) is always measurable for some countably generated sub-a-algebra, for example generated by If > t), t rational. A product
5.2 Admissibility
183
a-algebra is the union of the a-algebras generated by countable sets of rectangles A, x B,. So the evaluation map is measurable for such a sub-a-algebra. Let D be the a-algebra of subsets of F generated by the A; A. For any two dis-
tinct functions f, g in F, f(x) g(x) for some x. The map h i-+ h(x) is D measurable. So If) is the intersection of those AI which contain f and the complements of the others, and (iii) follows. So (iii) through (v) are equivalent.
(iii) implies (ii): by Theorem 5.2.1, there is a subset Y C I [0, 1] with Borel a-algebra and a measurable isomorphism t H G(t, ) from Y onto F. The assumed admissibility implies that (t, x) H G (t, x) is jointly measurable. By the general extension theorem for real-valued measurable functions (RAP, Theorem 4.2.5), although Y is not necessarily measurable, we can assume G is jointly measurable on I x X, proving (ii). So (ii) through (v) are equivalent. (ii) implies (i): G belongs to some Fa on [0, 1] x X, where we take gener-
ators of the form A; x B;, Ai Borel, B; E 13, and {A;};,t and {B;};>1 are algebras. Hence the sections G (t, ) on X all belong to Fa on X for the generators {B1).
(i) implies (ii): for this we need universal functions, defined as follows. A jointly measurable function G: I x X H R will be called a universal class a function if every function f e Fa on X is of the form G(t, ) for some t E I. (G itself will not necessarily be of class a on I x X.) Recall by the way that an open universal open set in N°O x X exists for any separable metric space X (RAP, Proposition 13.2.3).
5.2.4 Theorem (Lebesgue) For any a E 0 there exists a universal class a function G : I x X i-+ R.
Proof For a = 0, Ft is a countable sequence { fk}k>l of functions. Let G(1/k, x) := fk(x) and G(t, x) := 0 if t 1/k for all k. Then G is jointly measurable and a universal class 0 function. For general a > 0, by transfinite induction (RAP, Section 1.3) suppose there
is a universal class 3 function for all ,B < a. First suppose a is a successor, that is, a = f + 1 for some P. Let H be a universal class ,B function on I x X. Let I°O be the countable product of copies of I with the product a-algebra. The product topology on I°O is compact and metrizable (RAP, Theorem 2.2.8 and Proposition 2.4.4). Its Borel a-algebra is the same as the product a-algebra, by way of the usual base or subbase of the product topology (RAP, pp. 30-31).
For t =
E I°O let G(t, x) := 1im sup, H(t,,, x) if the lim sup is finite, otherwise G(t, x) := 0. Then G is jointly measurable. Now I°O, as a
Measurability
184
Polish space, is Borel-isomorphic to I (RAP, Section 13.1), so we can replace I°O by I. Then G is a universal class a function. If a is not a successor, then there is a sequence P k f a, 18k < a, meaning < a, there is some k with f < pk. For each k let Gk be a that for every
universal class A function. Define G on I2 x X by G(s, t, x) = Gk(t, x) if s = 1/k and G(s, t, x) = 0 otherwise. Then G is jointly measurable. Since 12 is Borel-isomorphic to I we again have a universal class a function, proving Theorem 5.2.4.
With Theorem 5.2.4, (i) in Theorem 5.2.3 clearly implies (ii), so (i) through (v) are equivalent. (iv) implies (vi) directly. If (vi) holds, let Z be a subset of Y on which the map z H T (z) is one-to-one and onto F. Let Sz and TZ be the restrictions to Z of
S and T respectively. For a function f and set B let f [B] := { f(x) : x E B). Then F remains image admissible via (Z, SZ, Tz), and {TZ[A] : A E SZ} is an admissible structure for F, giving (iv).
For 0 < p < oo and a probability law Q on (X,13) we have the space GP (X,13, Q) of measurable real-valued functions f on X such that f If I P dQ < oo, with the pseudometric
dd,Q(f,g)
(f If-gIPdQ)1/P,
1 < p
f If - gIPdQ, inf{s > 0: Q(I f - gI > s) < s},
0
In admissible classes, dP, Q-open sets are measurable, as follows:
5.2.5 Theorem Let (X, B) be a separable measurable space, 0 < p < 00, and F C LP (X, 13, Q) where F is admissible. Then if F is image admissible via (Y, S, T), U C F, and U is relatively dd,Q-open in .T, we have
T-1(U) E S. Proof Since (X, B) is separable, the pseudometric spaces CP(X, B, Q) are separable for 0 < p < oo, and so (e.g., RAP, Proposition 2.1.4), U is a countable union of balls If : dd,Q(f, g) < r), g E U, 0 < r < oo. So for p > 0 it is enough to show that each function y H f gIPdQ is measurable. For g fixed, (y, x) i-+ IT(y)(x) - g(x)IP is jointly measurable.
So for 0 < p < oo we can reduce to the case p = 1 and g = 0
with T(y)(x) > 0 for all x, y. Now, the Tonelli-Fubini theorem implies the desired measurability. For p = 0, given s > 0, by the Tonelli-Fubini the-
orem again, the set AE of y such that Q(IT(y)(x) - g(x)I > s) < e is
5.3 Suslin properties, selection, and a counterexample
185
measurable, and {y : do,Q(T(y), g) < r} is the union of AE for a rational,
0<e
5.2.6 Corollary If (X, B) is a separable measurable space, .P C G1 (X, B, Q), and F is image admissible via (Y, S, T), then y i-* f T (y) dQ is Smeasurable.
Proof For any real u, { f : f f dQ > u} is open for d1,Q.
For 1 < p < oo and f, g E LP (X, B, Q), let pp,Q(f, g) := dp,Q(fo,Q, go,Q) where for h E G1(X, B, Q), ho,Q := h - f h dQ. Thus for pQ as defined in Section 3.1, pQ = p2,Q. 5.2.7 Corollary If (X, B) is a separable measurable space, 1 < p < oo, F C GP(X, B, Q), F is image admissible via (Y, S, T), S is separable, U C .P, and U is pp, Q-open, then T _ 1 (U) E S.
Proof Let g be the set of functions f - f f dQ, f E Y. For y E Y let S(y)(x) := T (y) (x) - f T (y) dQ. Then by Corollary 5.2.6, S is jointly measurable and g is image admissible via (Y, S, S). Now f E U if and only if
fo,Q E V where V is dp,Q-open in 9. Then T-1(U) = S-'(V) E S by Theorem 5.2.5.
5.3 Suslin properties, selection, and a counterexample Here is another counterexample on measurability to add to the two at the beginning of the chapter. Let X = [0, 1] with Borel a-algebra and uniform (Lebesgue) probability measure P := U[0, 1]. Let A be a non-Lebesgue measurable subset of [0, 1] (e.g., RAP, Theorem 3.4.4). Let C := {{x} : x E A}. Then C is a collection of disjoint sets, so S(C) = 1 by Theorem 4.2.6. Also C, being a class of singletons, is admissible, for example, by Theorem 5.2.3(ii) with G(t, s) = 1 for t = s, and G(t, s) = 0 otherwise. But, IIP1 IIc is nonmeasurable, being 1 if and only if Xl E A, and likewise any II P,, 11C and II P - P II c
is nonmeasurable. So some measurability condition beyond admissibility is needed for II P, - P II,F to be measurable. A sufficient condition will be provided by Suslin properties, as follows. Recall that a Polish space is a topological space metrizable as a complete separable metric space. A separable measurable space (Y, S) will be called a Suslin space if there is a Polish space X and a Borel measurable map from X onto Y. If (Y, S) is a measurable space, a subset Z C Y will be called a Suslin set if it is a Suslin space with the relative a-algebra Z n S.
Measurability
186
Given a measurable space (X, B) and M C X, recall that M is called universally measurable or u.m. if for every probability law P on B, M is measurable for the completion of P, in other words for some A, B E B, A C M C B and
P(A) = P(B). In a Polish space, all Suslin sets are universally measurable (RAP, Theorems 13.2.1 and 13.2.6). A function f from X into Z, where (Z, A) is a measurable space, will be called universally measurable or u.m. if for each set B E A, f -1(B) is universally measurable.
If (S2, A) is a measurable space and F a set, then a real-valued function X : (f, co) H X(f, co) will be called image admissible Suslin via (Y, S, T)
if (Y, S) is a Suslin measurable space, T is a function from Y onto F, and (y, (o) i-+ X(T(y), (o) is jointly measurable on Y x 0. Equivalently, Y could be taken to be Polish with S as its Borel or -algebra.
As the notation suggests, a main case of interest will be where F is a set of functions on S2 and X(f, (o) = f (w). Also, X or F will be called image admissible Suslin if X is image admissible Suslin via some (Y, S, T) as above. Recall the notion of separable measurable space defined in the last section. Note that any separable metric space with its Borel a-algebra is a separable measurable space, as follows from RAP, Proposition 2.1.4. We have:
5.3.1 Theorem A measurable space (X,13), where B is countably generated, is separable if and only if it separates the points of X, so that for any x 0 y in X, there is some A E B containing just one of x, y.
Proof If (X, 13) is separable then (x } E B for each x E X, so B separates points. The converse direction follows from the proof of Theorem 5.2.3, (iv) implies (iii).
5.3.2 Selection Theorem (Sainte-Beuve) Let (0, A) be any measurable space and let X : .T' x S2 i-+ R be image admissible Suslin via (Y, S, T). Then for any Borel set B C R,
flx(B) := {co : X(f, co) E B for some f E YJ is u.m. in S2, and there is a u.m. function H from IIX(B) into Y such that X(T (H(co)), co) E B for all co E IIX(B).
Note. Here (S2, A) need not be Suslin or even separable.
Proof As noted above, we can assume Y is Polish and S is its Borel aalgebra. For any measurable set V in a product or -algebra, here V = {(y, (o) : X(T (y), c)) E B} C Y x 0, there are countably many measurable sets A C Y
5.3 Suslin properties, selection, and a counterexample
187
and E C S2 such that V is in the a-algebra generated by the sets A x E (as noted in the proof of Theorem 5.2.3, (iv) implies (iii)). Let B vary over a countable set of generators of the Borel or -algebra in R. Thus (y, w) H X(T (y), w) is jointly measurable for S in Y and a a-algebra 13 in n generated by a sequence {B,r} of measurable sets. Define a Marczewski function b, as in the proof of Theorem 5.2.1, for the sequence { B, } in place of (A, }. Then b is a 13-measurable function from 0 into the Cantor set C C [0, 1] (defined, e.g., in RAP, proof of Proposition 3.4.1). Define w =13 co' if for all n, w E B if and only if co' E B. Equivalently, for all B E 13, co E B if and only if w' E B. Then b(a)) = b((o) if and only if (o =B a/. Let 6 be a 13-measurable function. Then if b(w) = b((W), we must have ,6((o) = ,B (w'). So ,B (w) - g(b (w)) for some function g. Now, ,B is measurable for the a-algebra of all sets b-1 (D) where D is a Borel subset of R, since each B is such a (by the proof of Theorem 5.2.1). So g can be taken to be
Borel measurable (RAP, Theorem 4.2.8). Similarly, X(T (y), (0) - F(y, b(w)) for some function F on Y x C. The next step in the proof of Theorem 5.3.2 is the following fact:
5.3.3 Lemma Every S ® B measurable real function i/i on Y x S2 can be written as F(., where F is S ® BC measurable and BC is the Borel a-algebra in C. Proof This is easily seen when i/r is the indicator of a set S x B, S E S, B E B, or a finite linear combination of such indicators. A finite union of sets Si x B; , Si E S, B, E B, can be written as a disjoint union (RAP, Proposition 3.2.2). Thus the lemma holds when it is the indicator of any such union.
F are Next, suppose * are S ® B measurable, i/r = S 0 BC measurable, and i/r - ilr pointwise on Y x Q. For any y E Y b(w')) c) whenever the sequence converges, as it will if c = b(w) for some w E 0. Otherwise set F(y, c) := 0. Then F is S ® BC measurable (RAP, proof of Theorem 4.2.5) and (y, (o) = F(y, b((o)) for all y E Y and w E 0. Since the finite unions of sets Si x B1 form an algebra, the monotone class theorem (RAP, Theorem 4.4.2) proves the lemma when i/r is the indicator function of any set in the product a-algebra S 0 B. It thus holds when i/r is any simple function (finite linear combination of such indicator functions). Since each measurable function is a pointwise limit of simple functions, the lemma follows. and to, (0' E S2 such that b(w) = b(w'), we have i/r,, (y, b(w)) =
for all n, so ,lr(y, b((o)) = ir(y, b(w')). Let F(y, c) := lim,"
Measurability
188
So X(T(y), (o) - F(y, b((o)) for an S ® ,tic measurable function F on
YxC. Now, {(y, c) E Y x C : F(y, C) E B} is a Borel subset of Y x C (RAP, Proposition 4.1.7) and thus a Suslin set (RAP, Theorem 13.2.1). It follows that IIFB := {c E C : for some y E Y, F(y, C) E B} is a Suslin subset of C and therefore universally measurable (RAP, Theorem 13.2.6). By a selection theorem (RAP, Theorem 13.2.7) there is a u.m. function h
from IIFB into Y with F(h(c), C) E B for all c E HFB. Next we need:
5.3.4 Lemma Let (D, D) and (G, 9) be two measurable spaces and g a measurable function from D into G. Let U C G be u.m. for 9. Then g 1(U) is u.m. for D.
Proof Let P be a law on (D, D), so P o g-1 is a law on (G, 9). Take A,
BE gwith ACUCBand (Pog 1)(B\A)=O. Then g-1(A)and g 1(B)
are in D, g -'(A) C g -'(U) C g -'(B) and P(g 1(B) \ g -'(A)) _ P(g1(B \ A)) = 0. Remark. The lemma includes the case where D C G, g is the identity, and D includes the relative a-algebra {J fl D: J E 9}, while possibly D and D may not be u.m. for Q.
Now to finish the proof of Theorem 5.3.2, Lemma 5.3.4 implies that
rlX(B) = b-1 {c E C: for some y E Y, F(y, C) E B)
is u.m. Let H := hob from Q into Y. Then for any E E S, H-1(E) _ b-1(h-1(E)). Here h-1(E) is a u.m. set in IIFB, and so by Lemma 5.3.4, b-1(h-1(E)) is u.m. in S2 and His a u.m. function. By choice of h we have X(T (H(w)), (0) E B for all (0 E IIXB. Some possibilities for the set B C R are the sets Ix: x > t} or {x : lx i > t) for any real t. These choices give:
5.3.5 Corollary Let (f, co) r* X(f, co) be real-valued and image admissible Suslin via some (Y, S, T). Then co r-+ sup(X(f, co) : f E F) and co r-+ Sup{ X(f, w) l : f E.F) are u. m. functions. The image admissible Suslin property is preserved by composing with a measurable function:
5.3 Suslin properties, selection, and a counterexample
189
, Xk be image admissible Suslin real-valued 5.3.6 Theorem Let X1, , k, for one measurable space (12, A), via functions on Fi x S2, i = 1, (Yi, Si, Ti), i = 1, , k. Let g be a Borel measurable function from Rk into R. Then (w, fl, , Xk(fk, c))) is image admissif k ) i-+ g(Xl(fi, cv),
x Yk with ble Suslin via some (Y, S, T). Specifically, we can let Y = Y1 x ®Sk and let T (yi, product or -algebra S = S1 ® yk) := (Ti (Yl), Tk(YO)-
Proof Clearly (Y, S) is Suslin and the joint measurability holds. Next, here are examples showing that if the Suslin assumption on Y is removed from Theorem 5.3.2 it may fail. Let (0, <) and the class C of countable initial segments Ix be as in the second example at the beginning of this chapter. We can take S2 to be (in 1-1 correspondence with) a subset of [0, 1], by the axiom of choice (or, by the continuum hypothesis, all of [0, 1]). Here the ordering < has no relation to any usual structure on [0, 1]. Then C is admissible by Theorem 5.2.3, since the collection of all countable sets is of bounded Borel class: all finite unions of open intervals (q, r) with rational endpoints are in some so all finite sets are in .F,,,+1 and all countable sets in .F(a+l)+1 =: Fa+2 Let P be a law on Q which is 0 on countable sets and 1 on sets with countable complement. Such a law, on the a-algebra generated by singletons, exists on any uncountable set. Under the continuum hypothesis with 0 = [0, 11 we can take P to be Lebesgue measure or any nonatomic law. Let P1 = Sx, P2 = Ox + By) /2, and Q1 = SZ where x, y, z are coordinates on Q3 with law P3, so x, y and z are i.i.d. (P). Then supAEC(P1 - Q1)(A) =
1 if and only if x < z. Thus as shown at the beginning of this chapter, supAEc (Pl - Q1) (A) is nonmeasurable.
If we let B := {(x, y, z): supAEc KP2 - Qi)(A)I = 1), it can be seen likewise that B must not be measurable. So "Suslin" cannot simply be removed from Corollary 5.3.5 or Theorem 5.3.2. Returning to positive results, an admissible structure can be put on spaces of closed sets. Let (X, d) be a separable metric space and .F0 the collection of all nonempty closed subsets of X. Then Fo is admissible: there is a countable base for the topology of X, so for some a, all finite unions of sets in the base are in .Fa. Thus all open sets are in ,Fa+1 Then all closed sets are in .Fa+2 since any closed set F is a countable intersection of open sets
U,, := {x: d(x, F) := infyEFd(x, y) < 1/n) . The topology of X can be metrized by a metric d for which (X, d) is totally
Measurability
190
bounded (RAP, Theorem 2.8.2). Assume d is such a metric. Let hd be the Hausdorff metric: hd(A, B) := max{supXEA d(x, B), supyEB d(y, A)) for any two closed sets A, B. Since d is totally bounded, it's easily seen that (.Fo, hd) is separable, since the finite subsets of a countable dense set in X are dense in F0 for hd. The Borel or-algebra of hd will be called an Effros Borel structure on To. (Effros, 1965, proved that ford totally bounded this Borel structure is unique.)
5.3.7 Proposition For any separable metric space X with totally bounded metric d, the Effros Borel structure (of hd) is admissible on F0. Also, for any law P on the Borel sets of X, P(.) is measurable for the Effros Borel structure. P r o o f The set {(x, F) : x E F E .F0} is closed in X x 1*0: if xn E F for (x, F), then xn -* x in X and hd(Fn, F) - 0. Then all n and (x,,, Fn) 0 and so x E F. Since both (X, d) and (.F0, hd) are separable, a d(xn, F) Borel set in their product is in the product a-algebra (RAP, Proposition 4.1.7).
For any law P and c > 0, the set IF E Fo : P(F) > c} is closed for hd as follows. If Fn - F for hd, P(F,,) > c, and e > 0, let Fe := l y: d(x, y) < e for some x E F). Then Fn C FE for n large enough, so P(FE) > c, and letting 0 shows P(F) > c. So is upper semicontinuous and thus e = 1/m Effros measurable. For families of functions, we have the following.
5.3.8 Theorem (a) Let S be a topological space and.F a family of bounded real functions on S, equicontinuous at each point of S. Then (f, x) F-* f (x) is jointly continuous.F x S i-+ R, with the supremum norm II f II oo := supx If (x) I
on.F. (b) If in addition .F is separable for 11 II oo and S is metrizable as a separable
metric space (S, d), then (f, x) H f (x) is jointly measurable for the Borel Q-algebras of II on .F and d on S. Thus .F is admissible. So is g, the collection of all subgraphs {(x, y) : 0 < y < f(x) or f(x) < y < 01 for II
fE.F. (c) If, moreover, F with II III distance and its Borel or-algebra is a Suslin set, then F and G are image admissible Suslin. Proof Part (a) is immediate. For Y, (b) follows from the fact that on a Cartesian
product of two separable metric spaces, the Borel or-algebra of the product topology equals the product a-algebra of the Borel Q-algebras on each space (RAP, Theorem 4.1.7). For 9, the set {(y, z) : 0 < y < z or z < y < 0) is
Problems
191
closed and thus Borel in 1R2. Composing its indicator functions with (f, x) i-k z := f (x) thus gives a measurable function. Then (c) follows directly. Strobl (1994, 1995) and Ziegler (1994, 1997a,b) have given special attention to measurability issues for empirical processes.
Let H C [0, 1] be a nonmeasurable set with X ,(H) = 0 and A*(H) = 1 where A is Lebesgue measure (e.g., RAP, Theorem 3.4.4). Then for each Borel set B C [0, 1], letting µ(B fl H) :_ X*(B fl H) defines a countably additive probability measure on the Borel subsets of H, as a metric space with the usual metric from R (RAP, Theorem 3.3.6). Likewise, A* gives a probability measure v on the Borel sets of H' := [0, 1] \ H. Take the countable product of probability spaces nn° 1(An,1n, pn) where for n odd, An = H and pn = µ while for M, p) n even, An = H` and pn = v. Such a product of probability spaces always Example (Adamski and Gaenssler).
exists (e.g., RAP, Theorem 8.2.2). Let Xn be the nth coordinate on S2, viewed as a map from Q into [0, 1]. Then each Xn is measurable and has law U[0, 1], the uniform distribution on [0, 1]. Thus, the X; are i.i.d. U[0, 1]. Let C be the collection of all finite subsets of H. Let Pn be the empirical measures defined by the given X, . Then II Pn IIc = 1 /2 for n even and II Pn IIc = (n + 1)/(2n) for n odd. On the other hand, if Xi are coordinates on a countable product of copies of ([0, 1], A), then II Pn 11C is nonmeasurable. This illustrates that C is a pathological class of sets, but the pathology can be obscured if one doesn't use the standard model.
Problems 1. A law Q on Rn is called exchangeable if it is invariant under all permutations
fn of the coordinates. (a) Give an example of an exchangeable law which is not of the form Pn for a law P. (b) Show that the empirical measure Pn is sufficient for any set of exchangeable laws. 2. (continuation) Give an example of a class of non-exchangeable laws on 1182 for which the empirical measure is still sufficient. Hint: Consider laws with
X2=2X1. 3. An exponential family is a set ( Pe, B E O) of laws on a measurable space (X, A), all dominated by a or -finite measure s, having densities d Pe /dµ = " exp((f (B), g(x)) where 0 is a set, f maps O into Rk, g maps X into R,
Measurability
192
and ( , ) is the usual inner product in JRk. If P is such an exponential family
and P" :_ {P" : P E P} is the corresponding set of laws on X", show that Fj=1 g(Xj) is a sufficient statistic for P. 4. (a) Show that the class C of initial segments Ix in the example at the beginning of Chapter 5 is admissible. Hint: The or-algebra generated by the sets Ix is not separable, but the set 0 has the smallest possible uncountable cardinality;
thus we can assume 0 is in 1-1 correspondence with a subset of [0, 1] (cf. RAP, Appendix A.3). Use this to show that there exists a separable structure on 0 for which the sets IX are all measurable. Then, see the discussion after Theorem 5.3.6. (b) A measure space (X, S, µ) where S contains all singletons [x) is called
nonatomic if µ({x}) = 0 for all x E X. Let S be a a-algebra of subsets of C from part (a) making it admissible, for some or-algebra B of subsets of S2 containing all singletons. Show that there is no nonatomic probability measure on S. 5. (a) Let (X, E) be a measurable space such that £ contains all singletons {x} for x E X and X is uncountable. Suppose that for any nonatomic law P on (X, £), P(A) = 0 or I for all A E £. Show that £ cannot be countably generated. Hint: Suppose Ak are generators. If for all k there is a law Qk on E with Qk(Ak) = 1, find a law Q on F with Q(Ak) = 1 for all k and consider nk Ak. Or, show that the union of those Ak such that Q(Ak) = 0 for all laws Q on S can be deleted. (b) In the remarks before Theorem 5.2.1, prove that T is not countably generated. Hint: Use part (a) and the Hewitt-Savage 0-1 law (RAP, Theorem 8.4.6).
6. Let (S, d) be a separable metric space. Let Y1, Y2, be Suslin subsets of S, for the Borel a-algebra on each. Show that Uk Yk and nk Yk are also Suslin but Yi may not be, even if S is complete. Hint: See RAP, Section 13.2, problem 13.2.1 (don't assume the result of the problem). 7. Let (X, A) be a measurable space and V a finite-dimensional vector space of measurable real-valued functions on X. (a) Show that V is image admissible Suslin. Hint: See Theorem 5.3.6. (b) Show that {lA : A E pos(V)} is also image admissible Suslin.
8. Let (X, A) be a measurable space and let C C A be image admissible Suslin, that is, { 1 B : BE C} is. Let C(k) be the union of all Boolean algebras generated by k or less sets in C, as in Theorem 4.2.4. Show that C(k) is also
image admissible Suslin. Hint: Apply Theorem 5.3.6.
9. Let (X, A) be a measurable space such that {x} E A for all x E X. If P is a law on A, call a set A E A a soft atom if P(A) > 0, for all B C A
Notes
193
with B E A either P(B) = P(A) or P(B) = 0, and P({x}) = 0 for all x E A. (A "hard atom" would be an ordinary atom, namely a singleton {x}
with P({x)) > 0.) (a) If X is a separable metric space with Borel or -algebra A, show that no
law on A has a soft atom. Hint: Show that a soft atom has a sequence of soft atom subsets, decreasing to a singleton. (b) Give an example of a soft atom. Hint: Let X be uncountable and let A consist of countable sets and their complements. Notes
Notes to Section 5.1. Fisher (1922) invented the idea of sufficiency and Neyman (1935) gave a first form of the factorization theorem. Halmos and Savage (1949) proved the factorization theorem (5.1.2) in case h E L' (S, L3, A). Vivid ad hoc concepts such as "kernel" and "chain" in the proof of Lemma 5.1.3
are typical of Halmos's style of writing proofs. Bahadur (1954) removed the restriction on h, by the proof given above in the proof of Theorem 5.1.2. Theorems 5.1.8 and 5.1.9 (sufficiency of follow from facts given by Neveu (1977, pp. 267-268). I thank Sam Gutmann for showing me the proof actually given in the section and Don Cohn for telling me about Neveu's proof. I do not have references for Theorem 5.1.11, Lemma 5.1.12, or Corollary 5.1.14. Yet, in some sense, it is well known that the empirical distribution function incorporates all the information in an i.i.d. sample (Corollary 5.1.14). Notes to Section 5.2. Theorem 5.2.4, due to Lebesgue, is proved in Natanson (1957, p. 137). Aumann (1961) proved Theorem 5.2.3. The proof here is due to B. V. Rao (1971), except that the "image admissible" part was added in Dudley (1984). Freedman (1966, Lemma (5)) proved that the or-algebra T mentioned before Theorem 5.2.1 is not countably generated. Notes to Section 5.3. There are several papers on selection theorems. SainteBeuve (1974, Theorem 3) gives a selection theorem close to Theorem 5.3.2, which covers Suslin topological spaces. Sainte-Beuve did not use the terminology "(image) admissible Suslin." Originally, as in Dudley (1984, Section 10.3), the definition of image admissible Suslin required that (0, A) also be a Suslin space. I thank Uwe Einmahl, who pointed out to me around 1988 that the Suslin assumption on S2 was unnecessary. I also thank Lucien Le Cam for stimulating discussions about this section. Durst and Dudley (1981) gave the example after Theorem 5.3.6.
Measurability
194
Effros (1965) is the original paper on the Effros Borel structure. W. Adamski and P. Gaenssler told me in 1992 about the example at the end of the section.
References *An asterisk indicates a work I have seen discussed in secondary sources but not in the original.
Aumann, R. J. (1961). Borel structures for function spaces. Illinois J. Math. 5, 614-630. Bahadur, R. R. (1954). Sufficiency and statistical decision functions. Ann. Math. Statist. 25, 423-462. Dudley, R. M. (1984). A course on empirical processes. Ecole d'ete de probabilites de St.-Flour, 1982. Lecture Notes in Math. (Springer) 1097,1-142. Durst, Mark, and Dudley, R. M. (1981). Empirical processes, Vapnik-Chervonenkis class and Poisson processes. Probab. Math. Statist. (Wroclaw) 1, no. 2, 109-115. Effros, E. G. (1965). Convergence of closed subsets in a topological space. Proc. Amer. Math. Soc. 16, 929-931. Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. Academic Press, New York. Fisher, Ronald Aylmer (1922). IX. On the mathematical foundations of theoretical statistics. Philos. Trans. Roy. Soc. London Ser. A 222, 309-368. Freedman, David A. (1966). On two equivalence relations between measures. Ann. Math. Statist. 37, 686-689. Gutmann, S. (1981). Unpublished manuscript. Halmos, Paul R., and Savage, L. J. (1949). Application of the Radon-Nikodym theorem to the theory of sufficient statistics. Ann. Math. Statist. 20, 225241.
Lehmann, E. L. (1986). Testing Statistical Hypotheses, 2d ed. Repr. 1997, Springer, New York. Natanson, I. P. (1957). Theory of Functions of a Real Variable, vol. 2, 2d ed. Transl. by L. F. Boron, Ungar, New York, 1961. Neveu, Jacques (1977). Processus ponctuels. Ecole d'ete de probabilites de St.-Flour VI, 1976. Lecture Notes Math. (Springer) 598, 249-445. *Neyman, Jerzy (1935). Su un teorema concernente le cosidette statistiche sufficienti. Giorn. Ist. Ital. Attuari 6, 320-334. Rao, B. V. (1971). Borel structures for function spaces. Colloq. Math. 23, 33-38.
References
195
Sainte-Beuve, M.-F. (1974). On the extension of von Neumann-Aumann's theorem. J. Funct. Anal. 17, 112-129. Strobl, F. (1994). Zur Theorie empirischer Prozesse. Dissertation, Mathematik, Univ. Miinchen.
Strobl, F. (1995). On the reversed sub-martingale property of empirical discrepancies in arbitrary sample spaces. J. Theoret. Probab. 8, 825-831. Ziegler, K. (1994). On functional central limit theorems and uniform laws of large numbers for sums of independent processes. Dissertation, Mathematik, Univ. Miinchen.
*Ziegler, K. (1997a). A maximal inequality and a functional central limit theorem for set-indexed empirical processes. Results Math. 31, 189-194. *Ziegler, K. (1997b). On Hoffmann-Jorgensen-type inequalities for outer expectations with applications. Results Math. 32, 179-192.
6
Limit Theorems for Vapnik-Cervonenkis and Related Classes
6.1 Koltchinskii-Pollard entropy and Glivenko-Cantelli theorems For central limit theorems over Vapnik-Cervonenkis and certain related classes some good sufficient conditions were proved by Pollard (1982), using the following form of "entropy" or "capacity." Let (X, A) be a measurable space and let .F C G°(X, A) be the space of all real-valued measurable functions on X. Recall that
FF(x) := sup{I f(x)I: f E.F) (Section 4.8). Then Fj,(x) - 1I3x I. A measurable function F E G°(X, A) with F > FF is called an envelope function for F. If FF is A-measurable it is called the envelope function of Y. If a law P is given on (X, A), then FF for P will be called the envelope function of F for P, defined up to equality P-a.s. Let t be the set of all laws on X of the form n-1 8x(j) for some
x(j) E X , j = 1 ,
, n, and n = 1, 2,
.
, where the x(j) need not be
distinct. For S > 0, 0 < p < oo, and y E F recall (Section 4.8) that if F is an envelope function of F, (6.1.1)
DFi(3,F,Y) = sup { m : for some f1,
1
If;
, f,,, E .F, and all i ¢ j,
- fjldy>8P 1 FPdy}.
Then log DF>(8, F, y) will be called a Koltchinskii-Pollard entropy. Let DF) (S, F) := supy,r DF i (S, F, y). Here DIP) (3,.F, y) is a kind of packing number, involving the envelope function F. Logarithms of DIP) (S, F) will appear in Section 6.3. 196
6.1 Koltchinskii-Pollard entropy and Glivenko-Cantelli theorems
197
Let 9 := { fl F: f E F), where 0/0 is replaced by 0. Then Ig(x)I < 1 for all g E Q and X E X. Given F, p, and y E r let Q(B) := fB FPdy/y (FP) if y (FP) > 0. Then Q := Qy is a law and for 1 < p < oo, DF) (S, T, y) = D (S, 9, dn,Q)
where (as defined before Theorem 5.2.5) dp,Q(f, g) := (f If - glP dQ)1VP. For example, if C is a collection of measurable sets, whose union is all of X, and F := 0A : A E C) the envelope function is F 1 and DF) Y) _ D(S,.F, dp,y). The next few results will connect DF (8,.F) with other ways of measuring the size of certain classes F. First we have Vapnik-Cervonenkis classes C of sets with dens(C) < S(C) < +oo (Corollary 4.1.7):
6.1.2 Theorem If C C A, dens(C) < +oo, 1 < p < oo, F E LP(X, A, P), F > 0, and T:= {F1A : A E C} then for any w > dens(C) there is a K < 00 such that
DF)(S, F') <
KS_Pw,
0 < S < 1.
Proof Let y E F and let G be the smallest set with y(G) = 1. We may assume F(x) > 0 for some x E G since otherwise for any K > 1,
Dp)(S, F, y) = 1 < KS-wP
for 0<S<1.
If C(1), , C(m) E C are such that [dp,y(F1c(i), Flc(j))]P > 3P f Fpdy for i 0 j, with m maximal, then for Q = Qy,
Q(C(i)OC(j)) = JC(i)AC(j) Fdy
/J
Fpdy > SP.
Then by maximality and Theorem 4.6.1, there is a K(w, C) < oo such that
DF) (S, .F, y) < m < D(SP, C, dQ) < K(w, C)S-Pw. Next, here is a kind of converse to Theorem 6.1.2:
6.1.3 Proposition Suppose 1 < p < 00, F = { 1 B : B E C) for some collection C of sets, F 1, and for some S with 0 < SP < 1/2, DF) (8,.F) < 00. Then dens(C) < S(C) < 00. Proof First, dens(C) < S(C) by Corollary 4.1.7. Next, suppose C shatters a set G with card G = n = 21 for some positive integer m. Let y have mass 1/n at each point of G. Then G has m subsets A(i), i = 1, , m, independent for y, with y(A(i)) = 1/2, i = 1, , m (taking elements of G as strings of
198
Limit Theorems for Vapnik-Cervonenkis and Related Classes
m binary digits, let A(i) be the set where the ith digit is 1). Then for i k, f I IA(i) - IA(k) I pdy = 1/2. So for each j in (6.1.1) we can take f = 1A(j). Thus
DF)(S,.T') > DF)(6,F,y) >
m.
For large m this is impossible, so S(C) < oo.
Next let us consider families of functions of the form F = (Fg: g E 9) where 9 is a family of functions totally bounded in the supremum norm Ilglisup := supxEX Ig(x)I,
with the associated metric dsup(g, h) := Ilg - h llsup and with Ilgllsup < 1 for
allgE9. 6.1.4 Proposition If 1 < p < oo and 0 < s < 1 then for any such F, DF) (s, .F) < D(E, 9, dsup)
Proof If dup(g, h) < S then for all x, I Fg - Fh I P (x) < SPF(x)P, so the result follows from the definition (6.1.1).
Next we come to the Koltchinskii-Pollard method of symmetrization of , let x1, , x2n be coordinates on empirical measures. Given n = 1 , 2,
(X2n A2n pen) hence i.i.d. P. Let a(1),
, o(n) be random variables independent of each other and the xi with Pr(Q (i) = 2i) = Pr(6 (i) = 2i - 1) =
1/2, i = 1 ,
, n. Let r(i) = 2i if Q(i) = 2i - 1 and t(i) = 2i - I if
a(i) = 2i. Let x(i) := xi. Then the x(o (j)) are i.i.d. P. Let P, vn Pno
:= n
n
n 1
E Sx(o(J))> j=1
:= n112(Pn-P), Pn' - Pn"+
Pn
n-1Y,
8x(t(j)),
j=1
vn' := n"2(Pn"-p) von
n1/2POn.
Note that P," = 2P2n - Pn and that vn and v;' are two independent copies of vn. Here is a symmetrization fact (a more general one is Lemma 11.2.4):
6.1.5 Symmetrization lemma Let
> 0 and F C C2 (X, A, P) with
f I.f12dP < 2 for all f E F.
6.1 Koltchinskii-Pollard entropy and Glivenko-Cantelli theorems
199
Assume F is image admissible Suslin via some (Y, S, T). Then for any 11 > 0,
Pr 111411', > n} > (1 -Pr JJ l11vn ll, > 2rl}. Proof Note that the given events are measurable by Corollary 5.3.5. For x = , x2n) let H :_ {(x, f) : I vn (f )I > 21j, f E F}. Then by admis41, sibility, H is a product measurable subset of X2n X F. The Suslin property implies (Theorem 5.3.2) that there is a universally measurable selector h such
that whenever (x, f) E H for some f E F, h(x) E .F and (x, h(x)) E H. Let x7 := (xr (1),
, xr (n) ). Then for some function h t , h (x) = h 1(x7) where
h 1(y) is defined if and only if y E J for some u.m. set J C Xn (by Theorem 5.3.2). Let Tn be the smallest or-algebra with respect to which x7(.) is measurable. Then on the set where x't E J, since v° = vn - v", Pr(11vn'll,. > rl I T,,)
Pr (11 v' (h 1 (x))
E,,).
Given Tn, that is, given x', h I (x7) () is a fixed function f E F with f I f 12d P < 2. Then since vn is independent of Tn, we can apply Chebyshev's inequality to obtain
Pr(lvn(.f)j < rl) > 1_0/11)2. Integrating gives Pr{IIvOII.F > rl} > (1 -
211}, which,
since v,,' is a copy of vn, gives the result.
Some reverse martingale and submartingale properties of the empirical measures Pn will be proved. Recall that Q(f) f f dQ for any f E G1(Q), and that in defining empirical measures Pn := n F Sxj, the Xj are always (in this book) taken as coordinates on a product of copies of a probability space (X, A, P), so that the underlying probability measure for P, is Pn, a product of n copies of P. The definitions of (reversed) (sub)martingale (e.g., RAP, Sections 10.3 and 10.6), for random variables Y, and o-algebras ,tan, will be slightly extended here by allowing Yn to be measurable for the completion of tan.
,
6.1.6 Theorem Let (X, A, P) be a probability space, let.F C 'C I (P), and let P, be empirical measures for P. Let S, be the smallest or -algebra for which Pk(f) are measurable for all k > n and f E L1 (X, A, P). Then: (a) For any f E F, { Pn (f ), Sn }n> 1 is a reversed martingale; in other words,
E(Pn-1(f)ISn) = Pn(f) a.s. if n > 2.
200
Limit Theorems for Vapnik-Cervonenkis and Related Classes
(b) (F Strobl) Suppose that .F has an envelope function F E L' (X, A, P) and that for each n, 11P, - P II, is measurable for the completion of Pn. Then (IIPn - P I Lr , Sn )n> 1 is a reversed submartingale; in other words,
IIPn - PIIj_ < E(IIPk - PIIj, I Sn) a.s.fork
Proof For each n, any set in Sn is invariant under permutations of the first n coordinates Xi, where Pn := (Sx, + + SXn). So if 1 < i < j < n, n E(f(Xi) I Sn) = E(f(Xj) I Sn). Summing oven = 1, , k and dividing by
k gives E(Pk(f) I Sn)=E(f(Xl) I
k=n-1
and n proves (a).
Proof of (b): clearly I I P I I F < PF < oo and I I Pn I I .F < PF. Since by assumption II Pn - P II y is completion measurable, we have Ell Pn - P II )- <
2PF
Dq which are An measurable with Cq C Aq C Dq and P'(Dq \ Cq) = 0. Each image of Cq by an element of nn is also An measurable and included in Aq. The union Uq of these n! images, being A' measurable and symmetric under all n! permutations of 1, , n, is Sn measurable by Theorem 5.1.9 and included in Aq. Likewise, the intersection Fq of all images of Dq by transformations in nn is Sn measurable and includes Aq. Clearly pn (Fq \ Uq) = 0, so Aq and IIPn - PII. - are measurable for the completion of Sn as desired.
Fori=1,
n+l letPni:=n_{Sx1:
jai}.Then
since the Xj are coordinates on a product space, Pn,i has the same properties as
Pn for each i, and Pn,n+l = Pn Thus IIPn,i - PIIF is completion measurable with respect to A+ the product a-algebra for the first n + 1 coordinates. Now, the conditional expectations E(II Pn,i - PILE I Sn+ 1) are all the same (equal to each other almost surely) for i = 1, , n + 1, so all are a.s. equal to E(II Pn - PILE I Sn+1), because for each i, a transformation defined by a permutation of the first n + 1 coordinates takes Pn,i into Pn while leaving all sets in the a-algebra Sn+l invariant.
6.1 Koltchinskii-Pollard entropy and Glivenko-Cantelli theorems
Then for any n = 1, 2,
201
, we have 1
n+1
IIPn+1 - PIIF _ n+l E Pn,i - P I i=1 II.n+1 +1T+IIPn,i-PII.', I =1
and
IIPn+1 - PILE = E(IIPn+1 - PILE I Sn+l) 1
<
<
n+
n+1
E (II Pn,i - PILF I Sn+l)
E01 P" - PILE I Sn+l)
a.s., finishing the proof.
Here is a law of large numbers (generalized Glivenko-Cantelli theorem):
6.1.7 Theorem Let (X, A, P) be a probability space, let F E Gl (X, A, P), and let.F be a collection of measurable functions on X having F as an envelope function. Suppose.F is image admissible Suslin via (Y, S, T). Assume that (6.1.8)
DF) (S, .F) < oo for all 8 > 0.
Then limn-+oo II Pn - PILE = 0 a.s.
Proof By Theorem 6.1.6, {IIPn - P I LF, Sn }n> 1 is a reversed submartingale. Being nonnegative, it converges almost surely and in L 1(RAP, Theorem 10.6.4). It will now be enough to show that II Pn - P II.F - 0 in probability. Given e > 0, take M large enough so that P(F1F>M) < e/4. Then II (Pn - P)lF>MII.F -< II PnlF>MII.F + P(F1F>M) < (Pn + P)(F1F>M)
2P(F1F>M) < e/2 almost surely. Replacing each f E .F by P F<M takes F onto another class of functions which is still image admissible Suslin since (x, y) -' T(y)(x)1F(x)<M is jointly measurable while the Suslin space (Y, S) is unchanged. So we may assume F < M. Next apply Lemma 6.1.5 with = M and rl = n 1/2e/4 > 2M for n large. Then it is enough to show that 11 P, II.F = 11 Pn - Pn II.F -+ 0 in probability.
202
Limit Theorems for Vapnik-Cervonenkis and Related Classes
In the definition of DF)(3, F, y) take S = s/(9M) and y = P2,. Choose functions fl, , fm to satisfy (6.1.1), with
m = DF)(5,.F, y) < DO)(S, F) One can choose the f := fgin) to depend measurably on x := (xl, , X20 as follows. For each k > 1, the measurable space (Y, S)k is Suslin. Let
Bk := ((x,Y1,...,Yk): P2n(lT(Yi)-T(Yj)I) > SP2n(F), i ¢ j}. Then Bk is product measurable by image admissibility. Let y(k) := (yl,
, Yk )
and let Ak denote the projection {x : (x, y(k)) E Bk for some Y(k)}, k > 2; Al := X2n. Then by selection (Theorem 5.3.2), Ak is a u.m. set and there is a universally measurable function 11k from Ak into yk such that (x, r7k(x)) E Bk for all x E Ak. Let 17(x) := rjk(x) if and only if x E Ak \ Ak+l, k > 1. For each x let k(x) be the unique k such that x E Ak \ Ak+l. Thus we have the following:
6.1.9 Lemma For each x E X2n and P2n :_ (2n)-1 >i<2n Sx;, DF) (3,.F, P2n) = k(x) where is universally measurable and for k(x) = k, ilk is a universally measurable function from X2, into Yk such that the functions f, in (6.1.1) can be taken as T (17k(x)i ), i = 1, , k.
Now for all f E F and m = DF)(3, J, P20, minj<,n P2n(I / - fi1) S P2n (F). For each j at which the minimum occurs,
PO(f)-Po(f) _ Pn(f-fiPnU-f)I (Pn+P,')(I.f-fiI) 2P2n(I f - f I) < 23P2n(F) As n -+ oo, almost surely 2SP2n(F) -+ 2SP(F) < e14. Since m remains bounded as n -+ oo, we have by Chebyshev's inequality
Pr {maxi<,n I PO(f )I > s/4}
< m sup {Pr {IP,(fj)I > e/4): IfI <M)
< m 2M2(4/e)2/n < s/4
6.2 Vapnik-Cervonenkis-Steele laws of large numbers
203
for n large enough, and then Pr{ II PO II.F > s/2} < e/2, proving Theorem 6.1.7.
6.1.10 Corollary Let (X, A) be a measurable space and C C A where S(C) < 0o and C is image admissible Suslin. Then for any probability law P on A, we have
limn,oo SUPAEC I (PI - P)(A)I = 0
a.s.
Proof In Theorem 6.1.7, take F - 1 and apply Theorem 6.1.2 with p = 1.
6.2 Vapnik-Cervonenkis-Steele laws of large numbers Let (X, A, P) be a probability space and C C A. Let {xn }n> 1 be coordinates in
(X', A°°, P°°) so that x j are i.i.d. P. For certain classes C with S(C) = +oo, one will have AC ({x1, , xn }) < 2' with P"-probability converging to 1. For such classes, with sufficient measurability properties, a law of large numbers will still hold. Here a main result is as follows: 6.2.1 Theorem If (X, A, P) is any probability space, C C A, and C is image admissible Suslin, then the following are equivalent: (a) II Pn - P II c
0 a.s. as n -+ oo;
(b) II P. - P 11c - 0 in probability as n - oo; n-'E log /c({x1, , xn}) = 0. (c) limp
For a finite set F C X and collection C C 2X, let kc(F) := S(C n F). 6.2.2 Lemma Under the hypotheses of Theorem 6.2.1, Oc(x1, kc({x1,
, xn) and
, xn}) are universally measurable.
Proof Let C be image admissible Suslin via (Y, Y. T). For any decomposition of { 1, , n) into subsets, there is a measurable set of ordered n -tuples , xn) such that xi = xj if and only if i and j belong to the same subset. (x1, It is enough to prove the lemma on each such measurable set; specifically, the set on which the xi are all different. For each xj e T (y) if and only if j E J } is product measurable by image admissibility. Thus
204
Limit Theorems for Vapnik-Cervonenkis and Related Classes
its projection IIU(J) into X' is universally measurable by Theorem 5.3.2. Now AC ({xl,
... , xn}) _ E
and
J (6.2.3)
U
I
I nU(H).
IGI=m HcG
The main step in Steele's proof uses Kingman's subadditive ergodic theorem (the equivalent superadditive ergodic theorem is Theorem 10.7.1 in RAP). To state it, here is some terminology. Let N denote the set of nonnegative integers. A subadditive process is a doubly indexed set {xmn }o<m
Xkn < Xkm + Xmn
whenever k < m < n.
Let xnn := 0 for all n E N. If instead of (6.2.4), (6.2.5)
Xkn > xkm + xmn,
k < m
then {xmn}o<m
can clearly be written as Xkn = >k<j
of all B E B such that V -l (B) = B. Another useful hypothesis for subadditive processes is: (6.2.6)
For each n E ICY, Exxon I
< +oc,
and K := infra>1 Exon/n > -oo. A or -algebra D C B will be called degenerate if Pr(D) = 0 or 1 for all D E D. 6.2.7 Theorem (Kingman's subadditive ergodic theorem) Let {xmn}o<m
a stationary subadditive process satisfying (6.2.6). Then as n -* 00, xon /n converges as. and in Ll to a random variable y := limn", n-1E(xon I S) with Ey = K. If S is degenerate, y = K a.s.
6.2 Vapnik-Cervonenkis-Steele laws of large numbers
205
Proof Theorem 10.7.1 in RAP applies with f, there defined as -xon. From near the end of the proof (RAP, p. 296), we have Ey < Exon/n for all n and Exon/n -* Ey as n -+ oo, so Ey = K. To apply Theorem 6.2.7 in proving Theorem 6.2.1 we have:
6.2.8 Theorem Let (S2, T, Pr) be a probability space, (X, A) a measurable space, X1 a measurable function from 0 into X, and V a measure-preserving transformation of 0 onto itself. Let Xj := X1 o V j-1 for j = 2, 3, . Let C be image admissible Suslin and 0 C C A. Then each of the following is a stationary subadditive process satisfying (6.2.6): (a) Dmn := SUPAEC I Em
Proof We have measurability by Corollary 5.3.5 in (a) and by Lemma 6.2.2 in (b) and (c). Stationarity clearly holds for the same V in each case. Subadditivity is clear in (a) and not difficult for (b) and (c). All three processes are nonnegative: in (b), C nonempty implies Ac > 1, so (6.2.6) holds. Now let us prove Theorem 6.2.1. The quantities II Pn - P II c are completion measurable by Corollary 5.3.5. The probability space, in our standard model, is a countable Cartesian product of copies of (X, A, P), with coordi..The measure-preserving transformation will be V ({x j } j> 1) := {x j+1 J j> 1. By the Kolmogorov 0-1 law (RAP, 8.4.4), S is degenerate. Then by Theorems 6.2.7 and 6.2.8, we have almost sure limits nates xl , x2,
(6.2.9)
ci
limn->oo Don/n,
c2
V im,
C3
limner
,
log AC )In,
kgn/n
for some constants ci. Thus in Theorem 6.2.1, (a) and (b) are equivalent (as was also shown in the proof of Theorem 6.1.7 via the reversed submartingale property). It remains to prove (b) equivalent to (c). To show (c) implies (b), given 0 < s < 1, first apply the symmetrization lemma (6.1.5) with = 1 and rl = sn 1/2/2 for n large enough so that rl > 2, giving
Pr {IIPn-PIIc>£} < 2Pr{IIPn-Pnlie >e/2}.
206
Limit Theorems for Vapnik-Cervonenkis and Related Classes
Thus it suffices to show that IIPn Pn II c -+ 0 in probability (these variables are universally measurable by Corollary 5.3.5). Let An be the smallest a-algebra for which xl, , xn are measurable. For any fixed set A E A, the conditional probability
-
PrA,e,n
Pr {I(Pn-Pn) (A) I >eI A2n Pr
{
E, Si (8-12j) - 8x2j_, ((A) > nEI i=1
where si := 21{v(i)=2i} - 1 = ±1 with probability 1/2 each, independently of each other (Rademacher variables) and of the xi. Thus by Hoeffding's inequality, Proposition 1.3.5, with ai = -1, 0, or 1, PrA,e,n < exp(-nee/2). For some no, E log Don < ne4 for n > no. Then by Markov's inequality,
Pr (log Opn > nE3) < E.
(6.2.10)
Given A2", the event I (Pn - P")(A)I > e is the same for any two sets A having the same intersection with {x 1,
Pr III Pn
- Pn IIc > E I A2n }
, x2n }. Thus
_< Ao,2n exp ( - n82 /2).
Hence, for e < 1/8 and n > no large enough so that exp(-ns2/4) < e, we have by (6.2.10) Pr (II Pn
- Pn
II
c > e) < e + exp (2ne3 - n82 /2)
< E + exp ( -ns2/4) < 2e, so (c) implies (b).
Now let us show (b) implies (c). By (6.2.9) we have 1 > n-1 log AC - c a.s. for some constant c := C2 > 0. Thus n-1 E log Dan - c and we want to prove c = 0. Suppose c > 0. Given e > 0, for n large enough,
Pr {(2n)-1 log 00 2n > c/2} = Pr {Op 2n > encI > Next, to symmetrize,
Pr {IIPn-Pn'IIC>2E} < Pr {IIPn-PIIC>e}+Pr {IIPn -PIIC>e} 2Pr{IIPn - PIIC > E}. So it will suffice to prove it P,, - Fn' IIc - 0 in probability. If 2 < k :_ [an] where [x] is the largest integer < x and 0 < a < 1/2, then an < k + 1 < 3k/2,
6.2 Vapnik-Cervonenkis-Steele laws of large numbers
207
so by the inequality (4.1.5) and Stirling's formula (Theorem 1.3.13) we have
2nC
n > 2/a
and
(3e/a)"n < enc
Hence by Sauer's lemma (4.1.2), if Oo 2n > e"' then ko 2n > [an]. Fix n satisfying (6.2.11). Let k :_ [an]. Now on an event U with Pr(U) > 1 - a, there is a subset T of the indices {1, , 2n} such that card T = k, C shatters {xi : i E T}, and xi xj for i j in T. If there is more than one such T, let us select each of the possible T's with equal probability, using for this a random variable rl independent of xj and Qj, 1 < j < 2n. Then since xj are i.i.d., T is uniformly distributed over its ( possible values. For any distinct ji E 11, , 2n }, N := 2n, we have, k) where the following equations are conditional on U,
Pr (ji E T) = k/N, Pr (ji, j2 E T) = k(k - 1)/N(N - 1),
(6.2.12)
Pr (ji E T,
i = 1, 2, 3, 4) = ()/().
Let Mn be the number of values of j < n such that both 2j - I and 2j are in T. Then from (6.2.12),
EMn = k(k - 1)/2(N - 1) = a2n/4 + 0(1), n -+ oo;
EMn = k(k - 1)/2(N - 1) -h n(n - 1)
\4/
/()
x
+ 0(n),
and a bit of algebra gives Q2(Mn) = EMn - (EMn)2 = 0(n) as n - oo. Thus for 0 < S < a2/4, by Chebyshev's inequality, Pr(MM > Sn) > I - 2e for n large. On the event U n { Mn > Sn }, let us make a measurable selection (5.3.2)
of a sequence J of [Sn] values of i such that J' := UfEJ{2i - 1, 2i) C T. Here Mn and J are independent of the or (j). Now measurably select, by Theorem
5.3.2 again with y(.) =
a set A = A(w) = T(y(co)) E C such that
t j E Y: xj E A) = ja(i): i E J}. Then EiEJ(SxQ(;) - SxI(,))(A) = [n6]. Here is measurable for the a-algebra 13J generated by all the xj, by rl, and by or (i) for i E J. Conditional on 13 j,
E (Sxo(i) - Sx,(n) (A) _ i0J
iJ
siai
208
Limit Theorems for Vapnik-Cervonenkis and Related Classes
where ai are Bj-measurable functions with values -1, 0, and 1, and si have values ±1 with probability 1/2 each, independently of each other and of BJ. Thus by Chebyshev's inequality,
PrI{ Ysiai >013 BJ I
< 9/(n62) i0J on the event where J is defined. Thus for n large, I
Pr {(Pn'-Pn)(A((s))>S/3} > 1-3s and II P, -
II
C - 0 in probability. So Theorem 6.2.1 is proved.
6.2.13 Theorem In (6.2.9), if any ci is 0, all three are 0.
Proof In the last proof, we saw that cl = 0 if and only if c2 = 0, and just after (6.2.11), that if c2 > 0 then C3 > 0. On the other hand, if c3 > 0 then for some S > 0, ko" > Sn for n large enough a.s., and then AC > 26". Thus
c2>c3log2>0. 6.3 Pollard's central limit theorem By way of the Koltchinskii-Pollard kind of entropy and law of large numbers (Section 6.1) the following will be proved:
6.3.1 Theorem (Pollard) Let (X, A, P) be a probability space and F C G2 (X, A, P). Let .F be image admissible Suslin via (Y, S, T) and have an envelope function F E C2 (X, A, P). Suppose that 1
log D(2,i (x, .F))112dx < oo. fo Then.F is a Donsker class for P. (6.3.2)
The hypothesis (6.3.2) will be called Pollard's entropy condition. Before the proof of the theorem, here is a consequence:
6.3.3 Theorem (Jain and Marcus) Let (K, d) be a compact metric space. Let C(K) be the space of continuous real functions on K with supremum norm. Let X1, X2, be i.i.d. random variables in C(K). Suppose EX1(t) = 0 and EX1(t)2 < oc for all t E K. Assume that for some random variable M with
EM2 < cc, I X1(s) - X1(t) I (w) < M(c))d (s, t)
6.3 Pollard's central limit theorem
209
for all cw and all s, t E K. Suppose that 1
(6.3.4)
J.
(log D(s, K, d)) 1/2 ds < oo.
Then the central limit theorem holds, that is, in C(K), G(n-1/2(XI + + Xn)) converges to some Gaussian law.
Proof For a real-valued function h on K recall (RAP, Section 11.2) that IIhIIL
IIhIIBL
sups#t Ih(s) - h(t)I/d(s, t), IIhIIL+Ilhllsup,
Ilhllsup := sups lh(t)I,
BL(K) := {h EC(K): IIhIIBL <00}.
To apply Theorem 6.3.1, take as probability space X = BL(K), and let A be the or-algebra induced by the Borel sets of II Ilsup (or equivalently by the eval-
uations at points of K). Let P = G(XI), F(h) := IIhIIBL, h e X. Then for any s E K, (XI(S)2 + M(w)2 sup, d(s, t)2) < 00 E IIXi Itsp < 2E
and E 11X! 112 < EM(w)2 < 00, so EF2 < 00 (note that F is measurable, F E £2(X, A, P)). Let 9 be the collection of functions St/F: h t-+ h(t)/F(h), h E X, where we replace h(t)/F(h) by 0 if h = 0 in BL(K), and t runs through K. Then l g(h) l< 1 for all g E g and h E X. Let
.F :_ {Fg: g (=- 9) = {St := (h --+ h(t)): t E K}. For any s, t E K and h E X, I(Ss/F - St/F)(h)l < d(s, t). Then by Proposition 6.1.4, for 0 < x < 1, D tFi (x, .F) < D(x, 9, dsup) < D(x, K, d).
Thus (6.3.2) holds and Theorem 6.3.1 applies to give that .F is a Donsker class. Since .F is the set of evaluations at points of K, uniform convergence over .F (as in the definition of Donsker class) implies uniform convergence of functions on K. Since BL(K) C C(K), which is complete for uniform convergence, the limiting Gaussian process Gp for our Donsker class must also have sample functions in C(K) (almost surely). Since C(K) is separable, the laws G(n-112(X1 + + Xn)) are defined on all Borel sets of C(K) and converge to G(Gp).
Remark. In the situation of Theorem 6.3.3, K may be given originally with a metric e. The metric d may be chosen, perhaps as a function d = f (e), where
210
Limit Theorems for Vapnik-Cervonenkis and Related Classes
f(x) may approach 0 slowly as x ,{ 0, for example, f(x) = xE for e > 0
or f(x) = 1/max(I logxl, 2). Thus one can increase the possibilities for obtaining the Lipschitz property of Xl with respect to d, so long as (6.3.4) holds for d. Now to prove Theorem 6.3.1 we first have:
6.3.5 Lemma Let (X, A, P) be a probability space, F E £2(X, A, P) and .F C £2(X, A, P) having envelope function F. Let H := 4F2 and H :_ {(f - g)2: f, g E .F). Then 0 < cp(x) < H(x) for all cP E H and X E X, and for any S > 0, DHt (4S, H) < D(2) (8, .F)2.
Proof Clearly 0 < cp < H for V E H. Given any y E r, choose m < DF ) (S, .F) and f , . , fm E F such that (6.1.1) holds with p = 2. For any f, g E F, take i and j such that max (Y ((.f
- f )2), Y ((g - f )2)) <
32Y (F2).
Then by the Cauchy-Bunyakovsky-Schwarz inequality,
Y((f-g)2-(f -f)2) = Y(f - g - f +
fj)(f-g+f -f)
< Y((f - f - (g- f ))2)1/24Y(F2)1/2 < 86y(F2) = 26y(H). Thus letting hk(i,J) := f - f where k(i, j) := mi -m +j, i, j = 1,
, m,
we get an approximation of all functions in 1-l, in the L1 (y) norm, within 23y(H), by functions hk, k = 1, , m2, which implies the lemma.
Lemma 6.3.5 gives in particular that if DF)(S, F) < oo for all 8 > 0 then DHt (e, 7-1) < oo for all e > 0. Thus hypothesis (6.3.2) lets us apply Theorem 6.1.7, with F there = H.
6.3.6 Proposition Let (X, A, P) be a probability space. Suppose an image admissible Suslin class F of measurable real-valued functions on X has an envelope function F E C2 (X, A, P). If DF) (8, F) < oo for all S > 0, then .F is totally bounded in
L2(X, A, p).
6.3 Pollard's central limit theorem
211
Proof We may assume f F2dP > 0, as otherwise.F is quite totally bounded. By Lemma 6.3.5 and Theorem 6.1.7 we have sup { I (P
- P) ((f - g)2) I
:
f, g E .F} --> 0
n - oo.
a.s.,
Also, as n -f oo, f F2dP2n -* f F2dP a.s. Given s > 0, take no large enough and a value of P2n, n > no, such that f F2dP2n < 2 f Fed P and sup { I (P2n - P) ((f - g)2) I
:
f, g E .F} < s/2.
Take 0 < 8 < (s/(4P(F2)))1/2 and choose fl, , fin E F to satisfy (6.1.1) for p = 2 and y = P2n. Then for each f E .F we have for some j,
f (f - fj)2dP < i + f(f -
j)2dP2n < 2 + S2 J F2dP2n < E.
Now to continue the proof of Theorem 6.3.1, the total boundedness gives us condition 3.7.2(II)(a); it remains to check the asymptotic equicontinuity condition 3.7.2(II)(b). For this, given s > 0, let us apply the symmetrization lemma (6.1.5) to sets .Fj,s
ff - f: .fEF, f(f - f)2dP<821
with 8 > 0 small enough, it = 8/2, and 2; = s/4. By Theorem 5.2.5 with p = 2, l y E Y : T (y) - f E Fj,s } E S, and since a measurable subset of a Suslin space is Suslin with its relative Borel structure (by RAP, Theorem 13.2.1), Fj,s is image admissible Suslin. Thus Lemma 6.1.5 applies and we need only check 3.7.2(II)(b) for the symmetrized v0, in place of vn. To do this we will prove 3.7.2(II)(b) conditionally on P2n for P2, in a set it belongs to with probability converging to 1 as n -+ oo. Via Lemma 6.3.5 and Theorem 6.1.7 again, Fj,s will be treated by way of
{f_fi:fEF, f(f_fi)2dP2<82j which, for each fixed P2n, is image admissible Suslin. Let A2n be the smallest Q-algebra making x = (xl, Then given s, rl (> 0, it will be enough to prove (6.3.7)
Pr{ Pr[SUP {Iv?(f -g)I:
, x2n) measurable.
f,gE.F,
f(f_g)2dP2n <
S21 > 31l A2n1 > 3e } I
< 3s for 8 small enough and n large enough.
212
Limit Theorems for Vapnik-Cervonenkis and Related Classes
Given A2n, that is, given x, let II.fII2n i = 1, 2, .. Choose finite subsets .F(1, x), F(2, x),
(P2n(f2))1/2
Let Si := 2-`,
of F such that for
all i, and f E F, (6.3.8)
min{Ilf-g112n: gEF(i,x)} < SiIIFII2n,
with card(F(i, x)) < DF) (S1, .T'). We can write.F(i, x) = {g i , . > g k(t,x) } where by Lemma 6.1.9 we have gim) = T (yim (x)), k(i, ) and yim () being universally measurable in x, and where yim (x) is defined if and only if m k(i, x).
For each f E F, let f := g := gi,,, E F(i, x) achieve the minimum in (6.3.8), with m minimal in case of a tie. Let Ak denote the v-algebra of universally measurable sets for laws defined on A". For each f = T(y) E F and i, wehave f = T (y)i = gim(x,y,t) where m ( , , i) is ,42' x S measurable. Thus (u, x, y) --+ gim(x.Y,i)(u) is '42,+1 x S measurable. Hence v,,O(T(y) T(y)j) is A2n x S measurable and thus equals some A2n x S measurable function G (x, y) for x V V where Pen (V) = 0. Then by the selection theorem (5.3.2), sups I G (x, y) I is universally measurable in x. Thus for each j, (6.3.9)
sups I vnO (T (y) - T(y)j)I
is Pen-measurable in x.
Now II fi - f II2n -* 0 as i -a oo by (6.3.8), and for any fixed r, f - Jr =
Jr<j
, X20-
Let Hj := log D(2.)(2-j, F), j = 1, 2, . The integral condition (6.3.2) is 1i2 equivalent to >j 2-jH < oo. For all x, card.F(j, x) < exp(Hj). Let
ijj := max (jSj, (576P(F2)8 Hj)1i2) > 0. Then
1: lj < 00,
(6.3.10)
j>1
j > 576P(F2)32j Hj,
(6.3.11) and
(6.3.12)
Lexp (- qj/(2883 P(F2))) j>1
exp (- j2/(288P(F2))) < 00.
< j>1
6.3 Pollard's central limit theorem
213
Then
(6.3.13)
Pr f supfeIv(.f - fr)>
A2n
j>r
1
< I: Pr{supfE Ivo(f -f-1)I > njI A2n} j>r
<Eexp(Hj)exp(Hj-1)supfEFPr{Iv,,O(f j>r
- f-i)
> 77j I A2n I I
i = j - 1, j. For a fixed j and f
since there are exp(Hi) possibilities for let
(fj - f-1)(x2i) - (f - f -1)(x21-1)
zi
Then n
vnno(f - f-1) =
n-1/2E(-1)e(i)zi i=1
where e(i) := 1 {Q (i)=21-1 } are random variables taking values 0 and 1 with prob-
ability 1/2 each, independently of each other and the zi. Then by an inequality of Hoeffding (Proposition 1.3.5), n
E(-1)e(il zi >r/j
<
i=1
Now zi2
f
<
4n
<
4n(IIf- f II2n+IIf-f-1II2n)2
<
4nII FII2n(Sj +Sj_1)2 < 72n8j2P(F2)
J (fj - .fj-1)2dP2n
i=1
(by (6.3.8) and the few lines after it) on Bn := {IFII2n < 2P(F2)}. Then the last sum in (6.3.13) is less than
Eexp(2Hj)2exp(-17j/(1445 P(F2))) j>r
<2
exp (- 17 /(2885 P(F2))) j>r
214
Limit Theorems for Vapnik-Cervonenkis and Related Classes
by (6.3.11) and is < e for r large enough by (6.3.12). If r is also large enough so that >j>r rji < n we obtain almost surely on Bn that, by (6.3.13), Pr {supfE.- I vO(f - fr) I > r1 I A2n } < S.
(6.3.14)
Next, if 11f - gII2n <8P(F2) and IIF112n < 2P(F2) then by choice of Jr (after 6.3.8), 11fr-gr112n
<-
IIfr-f112n+11f-9II2n+11g-gr112n
<
SrP(F2)112 + 28r1IF112n
<
48r(P(F2))1/'2.
Then by the same Hoeffding inequality (1.3.5) and with measurability as in (6.3.9), if S2 < S, P(F2), (6.3.15)
Pr {SUP {Ivn'(fr-gr)(: 11f-g112n <S, f,gE.F}>71I A2n) < (card F(r, x))22 exp (- 1)2[8 sup 11 fr - gr II2n])
< 2 exp (2Hr - 712/(1285, P(F2))) < 2 exp (- 712 (2565, P(F2)))
if 112 > 512Hr6r P(F2)
< E
for r large enough, on Bn, noting that Hr82r > 0 as r
just before it). Clearly Pr(Bn) -+ 1 as n over F,
oo (see (6.3.10) and
oo. Now for f and g ranging
sup{IvO,(f-g)I: 11f -gII2n < S}
<2sup
IIvo(f- fr) I
+IVo(fr-gr)I: 11f-gII2n <8).
Thus from (6.3.14) and (6.3.15) we get (6.3.7) for S = 6rP(F2)1/2 and r large enough. The event in (6.3.7) is measurable as follows: as Y x Y is Suslin, and (x, y, z) - f (T (y) -T (z))2dP2n is jointly measurable by admissibility, so the supremum over a measurable subset is universally measurable as in Corollary 5.3.5.
6.3.16 Corollary (Pollard) Let (X, A, P) be a probability space and let .F be an image admissible Suslin Vapnik-Cervonenkis subgraph class of functions with envelope F E £2(X, A, P). Then .F is a Donsker class for P.
6.4 Necessary conditions for limit theorems
215
Proof This follows from Theorem 6.3.1 and Theorem 4.8.1 for p = 2.
6.3.17 Corollary Let (X, A, P) be a probability space, F E G2(X, A, P), and .F = (F1c : C E C) where C is an image admissible Suslin Vapnik-Cervonenkis class of sets. Then F is a Donsker class for P.
Proof Since F is measurable, the image admissible Suslin property of F follows from that of C. By Theorem 6.1.2 for p = 2, (6.3.2) holds and Theorem 6.3.1 applies.
6.4 Necessary conditions for limit theorems
Theorems 6.1.2 and 6.3.1 imply that every class C C A with S(C) < +00, and which is image admissible Suslin, is a Donsker class, for an arbitrary law P on A. In this section it will be shown that to obtain, for all P, such a central limit theorem (or even the pregaussian property), for a class C of sets, the condition S(C) < +oo is necessary. Then it will be noted that some measurability, beyond that of II P,, - P II c, is needed to obtain even a law of large numbers for S(C) < +oo (6.1.9). Lastly, it will be shown that S(C) < 00 oo, uniformly 0 in outer probability as n is necessary so that II Pn - P Il c
in P.
6.4.1 Theorem Let (X, A) be a measurable space and C C A. Suppose that for all laws P on A, {1A : A E C} is a pregaussian class (as defined in Section 3.1). Then S(C) < 00.
Proof Suppose S(C) = +00. Then for each n > 1, C shatters some set F with card(Fn) = 4'. Let Gn := Fn \ U3
{IWP(B)I<2KforallBCEn},
£2
{IWp(B)I > 2K for some B C E, and for all such B and all C E D with C fl En = B, I Wp (C \ E,) J > K).
_
216
Limit Theorems for Vapnik-Cervonenkis and Related Classes
Then ( IWP(C)I < K for all C ED) C £1 U.62. Let Sn := >2XEE4 I WP({x})I.
Then sup{IWp(B)I: B C En} > Sn/2, so Si C {Sn <4K}. Foreachx E E, WP({x}) has a Gaussian law with mean 0 and variance 6/(ir2n22n) =:
so
EIWp({x})I = (2/n)1/2Qn and var(lWP({x})I) = a, (1 - 2). Thus ESn = 2n (2/7r)1/20rn 67r-2n-2(1_
Since Wp has independent values on disjoint sets, var (Sn) _ ). For n large, ESn > 4K, and then by Chebyshev's inequality,
n Pr{Sn <4K) < Pr{JSn - ESSI > ESn -4K} <
< 1/((2n n2(ESn - 4K)2 1
12/7r3)1/2
-
4Kn)2
=: f (n, K) -a 0 as n -* oo. Turning to £2, let t(n) := 224. Let the subsets of En be B1,
, Bt(n). Let
MO := 0 and recursively for j > 1, Mj := {IWP(Bj)I > 2K) \ Uo
Dj := Mj n {for all C E D such that c n En = Bj, IWP(C \ En) I > K).
For any set A, Wp(A) has a Gaussian distribution with mean 0 and vari-
ance P(A). Let '(x) := (2n)-1/2 f xO, exp(-t2/2) dt. Then since Wp has independent values on disjoint sets,
Pr(D) = Pr(Mj) Pr{for all C E D with c n En = Bj, IWP(C \ En)I > K}
< Pr(Mj) 24)(-K). Now £2 C UI<j
Pr(£2) < E Pr(Mj) 2(D(-K) < 2(D(-K). 1<j
Hence
Pr (IWP(C)I < K for all C E D) < f(n, K) + 2(D(-K).
If we let n -+ +oo, then K
+oo, we see that Wp is a.s. unbounded on D C C. Note that Gp () can be written as P()Wp(1), so Gp is a.s. unbounded on C.
6.4 Necessary conditions for limit theorems
217
6.4.2 Remark. Now let us recall the example at the beginning of Chapter 5, where C is the collection of all countable initial segments of an uncountable well-
ordered set (X, <) and P is a continuous law on some a-algebra A containing
all countable subsets of X. Then S(C) = 1 but SUpAEC I(P, - P)(A)I 1 for all n. Thus the latter random variable is measurable. For this class the weak law of large numbers, and hence the strong law and central limit theorem, all fail as badly as possible. This shows that in Theorem 6.1.7 and Corollary 6.1.9, the "image admissible Suslin" condition cannot simply be removed, nor can it be replaced by simple measurability of random variables appearing in the statements of the results. Further, for all A E C, IA = 0 a.s. P. So vanishing a.s. (P), even with S(C) = 1, does not imply a law of large numbers.
6.4.3 Remark. If X is a countably infinite set and A= 2X, then for an arbitrary
law P on A, limn',' supAEA I (Pn - P)(A)I = 0 a.s., but S(A) = +oo, so the hypothesis of 6.4.1 cannot be weakened to a law of large numbers for all P. Next it will be seen that the Vapnik-Cervonenkis property is also necessary for a law of large numbers to hold uniformly in P or that there exist an estimator of P based on X1, , Xn (which might or might not equal Pn) converging to P uniformly over C and uniformly in P. Here are some definitions.
Let (X,13) be a measurable space. Let its n-fold Cartesian product be (X", B"). Let P be the class of all probability measures on (X,13). Let C C B be any collection of measurable sets. A real-valued function T" on X" x C will be called a C-estimator if it is a stochastic process indexed by C; in other words, for each A E C, X H T" (x, A) is measurable on X". A C-estimator T" will be called an estimator if for each x, there is a probability
measure t on A which equals T" (x, - ) on C. For any probability measure P on (X,13) and product law P" on (X", B") for x = (X1, , X") so that Xi are i.i.d. (P), we would like T" to be a good
approximation to P with probability -+ 1 as n -* oo. The goodness of the approximation will be measured by the loss function L (Tn, P, C) := 11 Tn - P 11 c
From it we get the risk r (T, P, C) := EpL(T", P, C) * where Ep denotes expectation with respect to P". For any class Q of laws, let r(T", Q, C) := sup{r(T", P, C) : P E Q) and let rn (Q, C) be the minimax risk, that is, the infimum of r(Tn, Q, C) over all C-estimators Tn. In finding minimax risks for C-estimators, we can assume Tn takes values in [0, 1] since max(0, min(T,, 1)) will clearly have risks no larger than those of Tn.
218
Limit Theorems for Vapnik-Cervonenkis and Related Classes
For any C and Q, clearly 0 < rn (Q, C) < 1 and rn is nonincreasing in n. The following theorem holds for C-estimators and so a fortiori for estimators, whose values are probability measures. The following fact is mainly due to P. Assouad (see the Notes to this section).
6.4.4 Theorem Let P be the class of all probability measures on a sample space (X,13) and C C B. If the minimax risk rn (P, C) < 1 /2 for some n, then C is a Vapnik-Cervonenkis class.
Proof Suppose S(C) = +oo. Given n, for any m > n let F be a set with 2m elements, shattered by C. Let D C C be a collection of 22m sets which shatters F. Then II Ilc > II Ilc > II 11-D, and since D is finite, II Tn - P11 E) will be measurable for any C-estimator Tn and any P. Let Tn be a C-estimator defined on Xn. Let r be a ("prior") probability distribution on the set (finitedimensional simplex) of all probability laws on F, with mass 1/( 2m) at each law uniformly distributed over an m-element subset of F. Then the maximum risk is bounded below by the risk for r and D,
suppEpllTn - Pllc > E,tEpIITn - PIID For each n-tuple x = (xl, , xn) E Fn, let ranx denote the range of x, ran x = [XI, , xn }, and let k = k(x) be the number of distinct xi, which is the cardinality of ran x.
Let µ be the distribution of x in Fn averaged with respect to 7r, s f Pndjr(P). Thenforeachx E Fn, letlr(x) := 7rx be the posterior distribution defined by n given x, namely the law 7rx giving mass 1/(2m k) to each law uniform on a set of m elements of F including ran x. Then for any C-estima-
tor T, PIID = EN,E,,(x)Il Tn - PIID
Let A:= 2F. For each A E A, there is a unique DA E D with DA fl F = A. For any P in the support of ir, P(DA) = P(A). Let V, (x, A) := Tn(x, DA), an A-estimator. Then II Tn - P 11 D > 11 Vn - P II A, and we want to find a lower
bound for E,r(x)11V - P11A for each P and x fixed where V (A) := Vn (x, A).
Let y(k) := 7rx. If PO is any fixed law in the support of y(k), and if r is uniformly distributed over the group G of all (2m - k)! permutations of F which equal the identity on ran x (and thus permute the other 2m - k elements of F), then PO o r-1 will have distribution y(k). Let T[A] :_ {r(y) : y E A} for any set A. Each permutation r defines a 1-1 transformation A i-+ T[A] of
6.4 Necessary conditions for limit theorems
219
A onto itself. Let (V o r)(A) := V (r[A]). Then we have
Ey(k) II V - PII A = (2m -
k)!-1
E ll V -Poor-1 IlA lEG
(2m-k)!-1EIIVo_1-POIIA, TEG
which by the triangle inequality is bounded below by (2m -
k)!-'
liV -
V o r - PO
lEG
PollA.
A
Now let A(m) := (A E A: ran x C A and I AI = m). For any A, B E A(m) there is a r E G with r[A] = B. It follows that V, restricted to A(m), is a constant. The support of Po is in A(m). For another set C E A(m), PO(C) = k/m. Thus IIV-Po1IA(-)
max(ii - VI,
k
-V
m
1
k
1
n
2
2m
2
2m
This last bound, uniform in k, holds after integration with respect to it. For a oo, we see that the minimax risk for II Ilc is at least 1/2. given n, letting m
A corollary of Theorem 6.4.4, taking T as the empirical measure P,,, is:
6.4.5 Theorem (Assouad) If (X, B) is a measurable space and C C B is a uniform Glivenko-Cantelli class of sets, that is, sup p Ep II Pn - P II c --+ 0 as n -* oo, where the supremum is over all probability laws on (X, B), then C is a Vapnik-Cervonenkis class.
Now we'll see that the constant 1/2 in Theorem 6.4.4 is sharp:
6.4.6 Proposition For the class P of all probability measures on a sample space (X, B) there is always a B-estimator T, not depending on n or x, with r(T, P, B) < 1/2, so that for any Q C P, C C B, and n we have r (Q, C) < 1/2. Moreover, when (X,13) is the unit interval [0, 1] with the Borel or -algebra, there is a class C which is not a Vapnik-Cervonenkis class and for which T on C is given by a probability measure, so T is an estimator, not only a C-estimator.
220
Limit Theorems for Vapnik-Cervonenkis and Related Classes
Proof A C-estimator T (not depending on n or x) is defined by T (x, A) - 1/2 for all A E B. Then r(T, Q, C) < 1/2 for any Q and C, so rn (Q, C) < 1/2 for all n. For Lebesgue measure a, on [0, 1] let C := {Ck}k>1 be an infinite sequence of sets independent with probability 1/2. Specifically, let Ck be the set where the , k, are called "independent kth binary digit is 1. Recall that sets C1, i = 1,
as sets" if for any set J C (1, , k), the intersection of all the C, for i E J and of the complements of the C; for i J is nonempty. In other words, the (Boolean) algebra generated by the C; has the maximum number, 2k, of possible
,a
atoms. Clearly the C , are independent in [0, 1]. For any m = 1 , 2,
collection of 2' sets independent in a set X shatters some subset with m elements by Theorem 4.6.2. So C is not a Vapnik-Cervonenkis class.
Next, it will be shown that one of the hypotheses of Theorem 6.1.7, namely existence of an integrable envelope function, is essentially necessary for a law of large numbers (Glivenko-Cantelli property). This is related to the fact that f o r i.i.d. real random variables Y1, , Y, , , (Yl + + Yn)/n converges a.s. to a finite limit if and only if EI Y1 I < oo (RAP, Theorem 8.3.5).
6.4.7 Theorem If (S, B, P) is a probability space,
.F c G1(S,13,P), IIPII.7 < 00, and if .F is a strong Glivenko-Cantelli class for P, that is, II Pn - P II a. s., then F has an integrable envelope function: E II P1 11 < 00.
Proof We haveasn a.s., and IIP/nllj,
00,IIPn-i-PII
Oa.s.,I1n1(Pn-1-P)II
-k 0
-0
0, so IlSxn/nll* -+ 0 a.s. Let F(x) := IISxILF sup fEy If (x) I for x c S. Let Sj be copies of S whose Cartesian product S°° is taken in the standard model. For each n write S°O = Sn x TIC#nSj. Then Lemma 3.2.4 implies that IlSxn /n II
where II 11* is with respect to P°O on S
equals F*(Xn)/n. The random variables F*(XX) are i.i.d. Thus by the BorelCantelli lemma, En °_1 P°O(F*(XX) > n) < oo, so Yn°_1 P(F* > n) < 00, which implies EF*(Xi) < 00 (RAP, Lemma 8.3.6).
**6.5 Inequalities for empirical processes This section gives no proofs, but gives references. Recall that for the Brownian
bridge yr, and 0 < M < 00,
P (supo
**6.5 Inequalities for empirical processes (RAP, Proposition 12.3.3), and so P(supo
221
> M) < 2exp(-2M2)
where as M - oo, the probability on the left is asymptotic to the bound on the right (RAP, Proposition 12.3.4). Dvoretzky, Kiefer and Wolfowitz (1956) showed that for any law on ll with distribution function F and empirical dis-
tribution functions F,
Pr{ sup,, n1"21(F - F)(x) I > M} < K exp ( - 2M2) for a constant K not depending on n, F, or M. Massart (1990) showed that the
inequality holds with K = 2, which then is clearly best possible as n -+ 00 since it is best for the Brownian bridge. Let a VCM class C of sets be a Vapnik-Cervonenkis class satisfying the image admissible Suslin measurability condition. Some measurability condition is needed, as shown by the examples at the beginning of Chapter 5. Results are stated here on VCM classes for convenience. The authors quoted used various measurability conditions, some weaker, some stronger. For a class F of functions integrable for a law P, let
Dn(.F) := n11211Pn - PII F := supfEFn1121(Pn - P)(.f)j. Any quantity defined for classes .F of measurable functions is also defined for classes C of sets by letting F := { 1A: A E C}. If (X, A) is a measurable space and .F is a uniformly bounded class of real-valued measurable functions on X, so that F has a finite constant envelope function C > FF, let
7r (n, .F, M) := supP Pr(Dn(F) > M) where the supremum is over all laws P on (X, A). Vapnik and Cervonenkis (1974, Theorem 10.2) showed that for any VCM class C of measurable sets,
n(n,C,M) < 6mc(2n)exp(-M2(1- ')/4)n
Since C is a VC class, mc(2n) is bounded above by a polynomial in n of degree S(C) (Sauer's lemma, Theorem 4.1.2, and Proposition 4.1.5). Devroye (1982) proved that jr(n, C, M) < 4mc(n2) exp(-2M2 +4Mn-1"2+4M2n-1). Alexander (1984) and Massart (1983, 1986) showed that for any VCM class
C and s > 0, 7r(n, C, M) < K exp(-(2 - s)M2) for a large enough constant K = K(s, S(C)). Here, unlike the case of the Dvoretzky-Kiefer-Wolfowitz inequality, 2 - s cannot be replaced by 2. Let
,r(oo, F, M) := supp Pr(IIGPILT > M).
222
Limit Theorems for Vapnik-Cervonenkis and Related Classes
Any upper bound for zr(n, C, M) that holds for all finite n, uniformly inn, will also hold for n (oo, F, M), for any countable class F, by the finite-dimensional central limit theorem, and thus for any class F such that II II.F = II II4 for a
countable 9 c F, as is true for many classes T. Consider inequalities of the form (6.5.1)
r(oo,F,M) < CMYexp(-2M2)
and (6.5.2)
sup,,,ir(n,F,M) < CMYexp(-2M2)
where C is a constant possibly depending on y and F. For the VCM class O(d) of orthants Oy :_ {x : x j < y,, j = 1, , d) in Rd for ally E Rd, the smallest possible value of y in (6.5.1) is 2(d - 1), by results of Goodman (1976) for d = 2, Massart (1983, 1986, Theorem A.1), Cabana (1984, Section
3.2), and Adler and Brown (1986). For d > 2 it follows that y in (6.5.1) and (6.5.2) cannot be taken as 0. Adler and Samorodnitsky (1987, Example 3.2 p. 1347) found that for the VCM class B(d) of rectangular blocks parallel
to the axes in Rd, one has y = 2(2d - 1) as the precise value in (6.5.1). Here S(6 (d)) = d and S(B(d)) = 2d (Corollary 4.5.11). For these classes of sets, y = 2(S(C) - 1) is optimal in (6.5.1). For the VCM class H(2) of all open half-planes in R2, where S(H(2)) = 3 by Theorem 4.2.1, and P is uniform on the unit square, Adler and Samorodnitsky (1987, Example
3.3 p. 1349) showed that y < 2, so y is smaller than the other examples suggested.
On the other hand, for S(C) = 1, y can be 1 > 2(S(C) - 1), as follows. In the unit circle Sl :_ {(cos0, sine) : 0 < 6 < 2ir} C R2 let HC be the class of half-open half-circles Ht :_ {(cos0, sine) : t < e < t + r} for 0 < t < it. Then it is easy to check that S(HC) = 1. Let P be the uniform
law on St, dP(6) = de/(27r) for 0 < 0 < 2n. Let Xt := Gp(Ht). Then Xt, -oo < t < oo, is a stationary Gaussian process, periodic of period 2n. Here y = 1 is the precise exponent in (6.5.1) by a result of Pickands (1969, Lemma 2.5). See also Leadbetter, Lindgren and Rootzen (1983, Theorem 12.2.9) and Adler (1990, pp. 117-118). The class HC does not contain the empty set, and HC U {0} shatters some 2-element sets. Smith and Dudley (1992) showed that there exists a VCM class C with S(C) = 1 and 0 E C such that y = 1 is optimal in (6.5.1). Here C has a treelike partial ordering by inclusion as in Section 4.4. Returning to general cases, Samorodnitsky (1991) shows that for any VCM class C, (6.5.1) holds for any y > 2S(C) - 1. By way of a theorem of Haussler stated in Section 4.9.2, (6.5.1) also holds for y = 2S(C) - 1.
**6.6 Glivenko-Cantelli properties and random entropy
223
For empirical processes, upper bounds for exponents y have been harder to obtain than for Gaussian processes, while lower bounds in (6.5.1) also provide
them for (6.5.2). Massart (1986, Theorem 3.3.1°(a)) proved (6.5.2) for any VCM class of sets and also for VC subgraph classes .F of functions with values in [0, 1], for any y > 6S(C), where C is the class of subgraphs of functions in .F and is VCM. Talagrand (1994) showed that for any VCM class C one can take y = 2S(C) - 1 in (6.5.2). Talagrand also proves similar bounds for other,
not necessarily VC classes. The half-circle and tree examples show that for S(C) = 1, Talagrand's bound is precise, so the best value of y in (6.5.1) and in (6.5.2) for VCM classes C with S(C) = 1 is y = 1. At this writing, the problem of finding optimal or useful constants C in (6.5.1) and (6.5.2) for general VCM classes with given S(C) seems to be open. Another direction for inequalities is what is called the concentration of measure phenomenon. Ledoux (1996) gives an exposition. Here is one of the many
results. Let P be the standard normal measure N(0, I) on Rd. Let f be a real-valued Lipschitz function on Rd, so that
II.fIIL := supxOy If(x) - f(Y)I/Ix - yI < 00. Then for any r > 0,
f -JfdP
>
r)
< 2exp(-r2/(211.f112))
(Ledoux, 1996, (2.9) p. 181). Notably, the dimension d doesn't appear in the inequality. Concentration inequalities of similar form have been proved for spheres and some other Riemannian manifolds and for product spaces. Two major works for product spaces are Talagrand (1995, 1996a).
**6.6 Glivenko-Cantelli properties and random entropy This section is also a survey giving references rather than proofs. Let (X, A, P) be a probability space and F a set of measurable real-valued functions on X.
Recall that F is called a weak (resp. strong) Glivenko-Cantelli class for P if both F C Gl (X, A, P) and II P, - P II.T - 0 in outer probability (resp. almost
uniformly) as n - oo. Also, F is called order bounded for P if it has an integrable envelope function, EFF < oo. Let .F0,p := { f - Pf : f E .F}. Talagrand (1987a, Theorem 22) proved the following:
Theorem A (Talagrand) For a probability space (X, A, P) and a class F C G1 (X, A, P), the following are equivalent as n -k oo: (a) F is a strong Glivenko-Cantelli class;
224
Limit Theorems for Vapnik-Cervonenkis and Related Classes
(b) the possibly nonmeasurable functions II Pn - P IIjr converge to 0 a.s.; (c) F is a weak Glivenko-Cantelli class and .F0, p is order bounded.
Proof Each of (a), (b), or (c) for F is equivalent to the same statement for .F0, p. Clearly (a) implies (b), which for .Fo, p is equivalent to (I) in Talagrand's
theorem (1987a, Theorem 22), while (c) is intermediate between Talagrand's (VI) and (VII). Talagrand's (V) implies (a). El
A class F satisfying any of the three equivalent conditions in Theorem A will be called a Glivenko-Cantelli class. There exist weak Glivenko-Cantelli classes which are not Glivenko-Cantelli classes (problem 12). Talagrand (1987a, 1996b) gave a kind of Vapnik-Cervonenkis criterion for the Glivenko-Cantelli property, as follows. For a finite set F C X and -oo <
a < ,B < oo the class F is said to shatter F at the levels a, fi if for every G C F there is an f E F such that f (x) < a for all x E G and f (y) > f for
allyEF\G. Thus, if F is the class of indicators of sets in a class C, then F shatters F at levels a, fi where 0 < a < f < 1 if and only if C shatters F as defined in Section 4.1.
Suppose given a probability space (X, A, P). Let.F C G1 (X, A, P), -oo < , Xn be coordinates on a product of copies of (X, A, P). Let W (T, A, a, fi, n) be the set of all X:= {Xjj =1 E A" , Xn } at levels a, ,B. such that the Xj are all distinct and.F shatters {X1, X2, The triple (A, a, ,B) is called a witness of irregularity for.F if the following
a < fi < oo, and A E A. Let X1, X2,
all hold: P(A) > 0; the restriction of P to A has no atoms; and for all n =
1,2,..., (Pn)*(W (F, A, a, f, n)) = P(A)'. The terminology is as in Talagrand (1996b), except that in the 1996 paper, measurability assumptions make the star unnecessary. Talagrand (1987a, Theorem 2) proved:
Theorem B (Talagrand) Given (X, A, P), an order bounded class F C G1 (X, A, P) fails to be a Glivenko-Cantelli class if and only if there exists a witness of irregularityfor it. Thus, a class F is a Glivenko-Cantelli class for P if and only if Fo, p is order bounded and has no witness of irregularity. Now, given a measurable space (X, A), a class.F of real-valued measurable functions on X is called a universal Glivenko-Cantelli class if it is a GlivenkoCantelli class for every law (probability measure) on (X, A).
**6.6 Glivenko-Cantelli properties and random entropy
225
While a universal Donsker class C of sets must be a VC class (Theorem 6.4.1), a universal Glivenko-Cantelli class need not be, even if X and C are both
countable (problems 14, 15). In any uncountable complete separable metric space X, there exists an uncountable "universally null" set Z, which has outer measure 0 for every nonatomic law P on the Borel sets of X (Sierpinski and Szpilrajn, 1936). [As Shortt (1984) notes, Szpilrajn later changed his name to Marczewski.] The collection of all subsets of a universally null set is clearly a universal Glivenko-Cantelli class. Such examples and extensions of them show that the notion "universal Glivenko-Cantelli class" is surprisingly wide (Dudley, Gine and Zinn, 1991, Section 3). Some of the width is in seemingly pathological directions. More useful are classes.F such that II Pn - P11* -+ 0 uniformly in P, as follows. Given a measurable space (X, A) and a class.F of real-valued measurable functions on X,.F is a uniform Glivenko-Cantelli class as defined in Theorem 6.4.5 if and only if for every E > 0 there is an no such that for every law P on A and n > no, Pr*{IIPn - PIIF > E} < E. Each function f in a universal (or uniform) Glivenko-Cantelli class must be bounded, and Fo := { f - inf f : f E .F} must be uniformly bounded (problem 13)..Fo is a universal or uniform Glivenko-Cantelli class if and only if F has the same property.
For x :_ (xl,
, xn) E X, n = 1, 2,
and 0 < p < oo, define on To
the pseudometrics ex,p(f, g) :_
n [n_1If(xi)_g(xi)IP]
min(1,1/p)
i=1
ex,oo(f g) := maxi
Let Hn, p(e, .F) := sup{log D(e,.Fo, ex, p) : x E X'). Then we have Theorem C For any measurable space (X, A) and image admissible Suslin class F of bounded measurable functions on X, the following are equivalent: (a) F is a uniform Glivenko-Cantelli class; (b) for some or all p with 0 < p < +co, and any E > 0. (cp) limn,oo Hn, p(E, .Fo)/n = 0. Proof This is Theorem 6 of Dudley, Gine and Zinn (1991). Thus, every image admissible Suslin VC class of sets is a uniform GlivenkoCantelli class. Recall that conversely, every uniform Glivenko-Cantelli class of sets is a Vapnik-Cervonenkis class by Assouad's theorem (6.4.5). Theorem C
226
Limit Theorems for Vapnik-Cervonenkis and Related Classes
fails if no measurability condition is assumed (by the ordinal triangle example near the beginning of Chapter 5). Alon, Ben-David, Cesa-Bianchi, and Haussler (1997) give an alternate characterization of uniform Glivenko-Cantelli classes. Also recall that under a measurability condition, Vapnik and Cervonenkis (1971) and Steele (1978) gave a criterion (necessary and sufficient condition)
for a class C of sets to be a Glivenko-Cantelli class for a given P, in terms of "random entropy" log(Oc({X1, , X,})) (Theorem 6.2.1). Vapnik and Cervonenkis (1981) gave an analogous criterion for uniformly bounded classes
of functions. Talagrand (1987a) proved a kind of random entropy criterion (Theorem B in this section) for the Glivenko-Cantelli property without any measurability restrictions, with order boundedness in £1(P). There are also random entropy criteria for the Donsker property, which will not be stated here: Gine and Zinn (1986, Chapter 2, Theorem 3.4; Chapter 3, Theorem 1.3). Gind and Zinn refer in part to work that had been done, but was published later, by Talagrand (1987b, 1988).
**6.7 Classification problems and learning theory Let P and Q be laws on Rd, which are unknown except that X1, , X, i.i.d. (P) and Y1, , Y,, i.i.d. (Q) have been observed, giving empirical measures Pm and Q. There is a new observation Z = Xm+1 or but we do not know which and want to decide as well as possible. For example, for large d, each Xi and Yj might be a vector of regional weather data for one day, where Xi are for days when it rained the following day at a given station, and Yj for other days. Then Z is today's data vector and we want to predict whether it will rain tomorrow at the station. (A) Let C be a VCM class of Borel sets in Rd. On some set A E C, (Pm is maximized for A E C. A decision rule is then to decide Z = Xm+l (it will rain tomorrow) if Z E A, otherwise Z = Since Pm -* P and Q -+ Q
uniformly on C, also uniformly in P and Q, at rates which for some C are given by inequalities in Section 6.5, for large m and n, knowing Pm and Qn is nearly as good as knowing P and Q. Only VC classes can have such uniformity (Assouad's theorem (6.4.5)). Vapnik and Cervonenkis (1968, 1971) invented their classes of sets with applications to pattern recognition in view (Vapnik and Cervonenkis,1974). There have been extensive other and further developments, sometimes called "learning theory." There is at least one, unknown, measurable set B on which (P - Q) (B) is maximized. One wants to choose A, based on X1, , Xm; Y1, , Y,,, as a kind of estimator of B, so that there is a small error probability for large m, n
Problems
227
in deciding that Z E B just when one observes Z E A. If so, one has what is called probably approximately correct (PAC) learning. In most problems, it seems that a successful application depends on a good choice of the data vectors to be measured ("feature sets") and on knowledge of the particular field of application. A few further references on learning theory are: Vapnik (1982), Valiant (1984), Blumer, Ehrenfeucht, Haussler and Warmuth (1989), Haussler (1992), Dudley, Kulkarni, Richardson and Zeitouni (1994), Vapnik (1995), Devroye, Gyorfi and Lugosi (1996), and Alon et al. (1997).
Problems 1. On [0, 1] with Lebesgue (uniform) law P = U[0, 1], let ii00 F(a) := {k"l[0,1/k]Ik=1
for any a > 0. (a) Show that for a < 1, II P - P 11 y(,,) is measurable and -a 0 a.s. as n - * oo. Hint: This follows from the Chebyshev inequality only for some values of a. Instead, use some fact from Section 6.1. (b) Show that for a > 1, II Pn - PIIF(a) doesn't approach 0 in probability. (Prove this directly, without using facts from Chapter 6.) 2. If C is the set of tetrahedra in R3 (bounded intersections of four half-spaces), and P is any law on the Borel sets of 1[83, show that II P, - P II c is measurable
and goes to 0 a.s. as n - oo. Hint: You can assume the result of Chapter 5, problem 8. (Actually, any bounded polyhedron with four faces in 1R3 is convex and is an intersection of four half-spaces.) Let Co(Rk) be the class of all convex sets in Rk. 3. Let C := Co(1182) and F := {(0, 0), (1, 1), (2, 4), (3, 1), (4, 0)}. Evaluate
AC(F) and kc(F). 4. For C = Co(lRk) show that for any law P on Rk (on the Bore] a-algebra, as usual) supAEc(Pf - P)(A) is measurable. 5. Let C be the collection of all unions of three intervals in R. Show that Lc(F) and kc(F) depend only on the cardinality m = IFI and evaluate each of them as a function of m = 1, 2, .. 6. Let .F :_ {x-1/31[o,x] : 0 < x < 1}. Show that F is a Donsker class for
P = U[0, 1]. 7. For a law P on 1l8 suppose F E G2(R, P). Let 9 be the class of functions g on 1l8 with total variation < 1 and I I g l l sup < 1. Show that { Fg : g E 9)
is a Donsker class for P.
228
Limit Theorems for Vapnik-Cervonenkis and Related Classes
8. Prove Remark 6.4.3. 9. Let P({k}) := pk := ck-514 fork = 1, 2, , with c such that 1. Let N+ be the set of all positive integers and C the class of all subsets of N+. Show that for some a > 0, II Pn - P11 c > an-1/4 for all possible values of P,, and so, C is not Donsker for P. 10. In problem 9, if each pk is replaced by c/k3 for the appropriate constant c, show that C is a Donsker class for P. 11. If P is the uniform distribution on the unit square in R2, show that Co(R2) is a Donsker class for P. Note: This will be proved later in the book by different methods. It may be quite challenging to prove it with the methods available so far. 12. Let X = H be a separable Hilbert space with an orthonormal basis {ej}O01.
Let kj := 1/(1 + log j) and P(kjej) := P(-kjej) := 3/(n2 j2) for j = . Then Pisa law on H (Chapter 2, problem 19). Let.F : _ l y E H : Ilyll < 11 with y(x) := (x, y). Show that .F is a weak Glivenko-Cantelli class, but not a Glivenko-Cantelli class. Hint:F is not order bounded. 13. Let .F be a universal Glivenko-Cantelli class. 1, 2,
(a) If f E F, show that f is bounded. Hint: Otherwise, it is not in ,C1 (P) for some law P.
(b) Show that To :_ (f - inf f : f E FJ is uniformly bounded. 14. Show that if X is countable and A = 2X then A is a universal GlivenkoCantelli class. 15. Let (X, A) be a measurable space and X = U Aj for some disjoint sets Aj E A. Let C C A be image admissible Suslin and such that for each j, S(C n Aj) < oo (but may grow arbitrarily fast with j). Show that C is a universal Glivenko-Cantelli class. 1
Notes
Notes to Section 6.1. Koltchinskii (1981) defined DF)(8, F, y) when F = 1 and y = Pn. Pollard (1982) independently defined it for general F, for p = 2, and for distinct x(j). Both, in fact, used the minimal cardinality of an e-net (see Section 1.2). Theorem 6.1.2 is due to Pollard (1982, proof of Theorem 9). I do not know references for Proposition 6.1.3 or 6.1.4. The symmetrization method is also due independently to Koltchinskii (1981) and Pollard (1982). Lemma 6.1.5 is adapted from Pollard (1982, Lemma 11), who proves it for a countable class.F, thus avoiding measurability difficulties. From the countable
case, one can infer the result if there is a countable subset 71 C F such that 11 vn Ilrt = 11 vn IF almost surely. Thus it suffices for vn on F to be separable
Notes
229
in the sense of Doob, meaning that for some countable Q C F and set A with
P(A) = 0, for any w
A, f E F, e > 0, and closed interval J C Il8,
v, (g) (w) V J for all g E G with pp (f, g) < e implies the same for g E F. For most classes F arising in applications the empirical process is naturally separable. But the collection C of all singletons in [0, 1], for P = Lebesgue measure, appears as a rather regular collection for which the empirical process is not separable. In such a case it may appear unnatural to use a separable modification of the process, so that, for example, PI ({x)) = 0 for all x, even x = xl ! Strobl (1995) proved Theorem 6.1.6(b) without supplementary measurability assumptions. Theorem 6.1.7 extends Pollard (1982, Theorem 12) and Wolfowitz (1954) for half-spaces in W'.
Notes to Section 6.2. In Theorem 6.2.1, Vapnik and Cervonenkis (1971, Theorem 4) proved equivalence of (b) and (c). The proof here was first given in Dudley (1984). The Koltchinskii-Pollard techniques allowed some simplification of the Vapnik-Cervonenkis proof. Steele (1978) proved equivalence of (a) and (b) using Kingman's theorem and proved Theorem 6.2.8 and (6.2.9). However, Steele's assumption that II P - P II C is measurable has been strengthened to "image admissible Suslin." Some such strengthening seems needed in the proof. Notes to Section 6.3. Pollard (1982) proved Theorem 6.3.1 when the empirical
processes v are stochastically separable in the sense of Doob, as is usually true ab initio in cases of interest and can always be obtained by modifications of the process which may, occasionally, appear unnatural (see the Notes to Section 6.1). The "image admissible Suslin" formulation was given in Dudley (1984). The proof, based on Pollard's, is somewhat different, not only as regards measurability, but notably in that Proposition 6.3.6 extends a more specific result of Pollard.
The implication from Pollard's theorem (6.3.1) to Theorem 6.3.3, of Jain and Marcus (1975a), was first done in Dudley (1984). Koltchinskii (1981) proved Theorem 6.3.1 for .F uniformly bounded and satisfying a further entropy condition.
Notes to Section 6.4. Theorem 6.4.1 and Example 6.4.2 are from Durst and Dudley (1981). Assouad (1982, Proposition C) proved a form of Theorem 6.4.4 with 1/(8e2) in place of 1/2 and so proved Theorem 6.4.5. I thank David Pollard for a remark leading to Proposition 6.4.6. Theorem 6.4.4 and Proposition 6.4.6
230
Limit Theorems for Vapnik-Cervonenkis and Related Classes
were given in Assouad and Dudley (1990), which, like Assouad (1982), was not published. Assouad (1985) is a related work.
References Adler, R. J. (1990). An Introduction to Continuity, Extrema and Related Topicsfor General Gaussian Processes. IMS Lecture Notes-Monograph Series 12. Adler, R. J., and Brown, L. D. (1986). Tail behaviour for suprema of empirical processes. Ann. Probab. 14, 1-30. Adler, R. J., and Samorodnitsky, G. (1987). Tail behaviour for the suprema of Gaussian processes with applications to empirical processes. Ann. Probab. 15, 1339-1351. Alexander, K. S. (1984). Probability inequalities for empirical processes and a law of the iterated logarithm. Ann. Probab. 12, 1041-1067; Correction, ibid. 15 (1987), 428-430. Alon, N., Ben-David, S., Cesa-Bianchi, N., and Haussler, D. (1997). Scalesensitive dimensions, uniform convergence, and learnability. J. ACM 44,
615-631. Assouad, P. (1982). Classes de Vapnik-Cervonenkis et vitesse d'estimation (unpublished manuscript). Assouad, P. (1985). Observations sur les classes de Vapnik-Cervonenkis et la dimension combinatoire de Blei. In Seminaire d'analyse harmonique, 1983-84, Publ. Math. Orsay, Univ. Paris XI, Orsay, pp. 92-112. Assouad, P., and Dudley, R. M. (1990). Minimax nonparametric estimation over classes of sets (unpublished manuscript). Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. (1989). Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36, 929-965. Cabana, E. M. (1984). On the transition density of a multidimensional parameter Wiener process with one barrier. J. Applied Prob. 21, 197-200. Devroye, L. (1982). Bounds for the uniform deviation of empirical measures. J. Multivar. Analysis 12, 72-79. Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York. Dudley, R. M. (1984). A course on empirical processes. In Ecole d'ete de probabilites de St. -Flour, 1982. Lecture Notes in Math. (Springer) 1097, 1-142. Dudley, R. M., Gine, E., and Zinn, J. (1991). Uniform and universal GlivenkoCantelli classes. J. Theoret. Probab. 4, 485-510.
References
231
Dudley, R. M., Kulkarni, S. R., Richardson, T., and Zeitouni, O. (1994). A metric entropy bound is not sufficient for learnability. IEEE Trans. Inform. Theory 40, 883-885. Durst, Mark, and Dudley, R. M. (1981). Empirical processes, Vapnik-Chervo-
nenkis classes and Poisson processes. Probab. Math. Statist. (Wroclaw) 1, no. 2, 109-115. Dvoretzky, A., Kiefer, J., and Wolfowitz, J. (1956). Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Statist. 27, 642-669. Gine, E., and Zinn, J. (1986). Lectures on the central limit theorem for empirical processes. In Probability and Banach Spaces (Zaragoza, 1985). Lecture Notes in Math. (Springer) 1221, 50-113. Goodman, V. (1976). Distribution estimates for functionals of the two-parameter Wiener process. Ann. Probab. 4, 977-982. Haussler, David (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications Inform. and Comput. 100, 78-150. Jain, Naresh, and Marcus, Michael B. (1975). Central limit theorems for C(S)valued random variables. J. Funct. Analysis 19, 216-231. Koltchinskii, V. I. (1981). On the central limit theorem for empirical measures. Theor. Probab. Math. Statist. 24, 71-82. Transl. from Teor. Verojatnost. i Mat. Statist. (1981) 24, 63-75. Leadbetter, M. R., Lindgren, G., and Rootzen, H. (1983). Extremes and Related Properties of Random Sequences and Processes. Springer, New York. Ledoux, M. (1996). Isoperimetry and Gaussian analysis. In Ecole d'ete de probabilites de St. -Flour, 1994. Lecture Notes in Math. (Springer) 1648, 165-294. Massart, P. (1983). Vitesses de convergence dans le theoreme central limite pour des processus empiriques. C. R. Acad. Sci. Paris Ser. 1296, 937940. Massart, P. (1986). Rates of convergence in the central limit theorem for empirical processes. Ann. Inst. H. Poincare Probab. -Statist. 22, 381-423. See also Lecture Notes in Math. (Springer) 1193, 73-109. Massart, P. (1990). The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Ann. Probab. 18, 1269-1283. Pickands, James III (1969). Upcrossing probabilities for stationary Gaussian processes. Trans. Amer. Math. Soc. 145, 51-73. Pollard, David B. (1982). A central limit theorem for empirical processes. J. Austral. Math. Soc. Ser. A 33, 235-248.
232
Limit Theorems for Vapnik-Cervonenkis and Related Classes
Samorodnitsky, G. (1991). Probability tails of Gaussian extrema. Stochastic Process. Appl. 38, 55-84. Shorts, Rae M. (1984). Universally measurable spaces: an invariance theorem and diverse characterizations. Fund. Math. 121, 169-176. Sierpinski, W., and Szpilrajn, E. (1936). Remarque sur le probleme de la mesure. Fund. Math. 26, 56-61. Smith, D. L., and Dudley, R. M. (1992). Exponential bounds in Vapnik-Cervonenkis classes of index 1. In Probability in Banach Spaces VIII, ed. R. M. Dudley, M. G. Hahn, J. Kuelbs. Birkhauser, Boston, pp. 451-465. Steele, J. Michael (1978). Empirical discrepancies and subadditive processes. Ann. Probab. 6, 118-127. Strobl, F. (1995). On the reversed sub-martingale property of empirical discrepancies in arbitrary sample spaces. J. Theoret. Probab. 8, 825-831. Talagrand, M. (1987a). The Glivenko-Cantelli problem. Ann. Probab. 15, 837-870. Talagrand, M. (1987b). Donsker classes and random geometry. Ann. Probab. 15, 1327-1338. Talagrand, M. (1988). Donsker classes of sets. Probab. Theory Related Fields 78, 169-191. Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Ann. Probab. 22, 28-76. Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Pubis. J.H.E.S. 81, 73-205. Talagrand, M. (1996a). New concentration inequalities in product spaces. Invent. Math. 126, 505-563. Talagrand, M. (1996b). The Glivenko-Cantelli problem, ten years later. J. Theoret. Probab. 9, 371-384. Valiant, L. G. (1984). A theory of the learnable. C. ACM 27, 1134-1142. Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer, New York.
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York.
Vapnik, V. N., and Cervonenkis, A. Ya. (1968). Uniform convergence of frequencies of occurrence of events to their probabilities. Dokl. Akad. Nauk SSSR 181, 781-783. Engl. transl. Soy. Math. Doklady 9, 915-918. Vapnik, V. N., and Cervonenkis, A. Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264-280. Vapnik, V. N., and Cervonenkis, A. Ya. (1974). Teoriya Raspoznavaniya Obrazov (Theory of Pattern Recognition) (in Russian). Nauka, Moscow.
References
233
German ed.: Theorie der Zeichenerkennung, by W. N. Wapnik and A. Ja. Tscherwonenkis, transl. by K. G. StSckel and B. Schneider, eds. S. Unger and K. Fritzsch (Elektronisches Rechnen and Regeln, Sonderband). Akademie-Verlag, Berlin, 1979. Vapnik, V. N., and Cervonenkis, A. Ya. (1981). Necessary and sufficient conditions for the uniform convergence means to their expectations. Theory Probab. Appl. 26, 532-553. Wolfowitz, J. (1954). Generalization of the theorem of Glivenko-Cantelli. Ann. Math. Statist. 25, 131-138.
Metric Entropy, with Inclusion and Bracketing
7.1 Definitions and the Blum-DeHardt law of large numbers Definitions. Given a measurable space (A, A), let G° (A, A) denote the set of all real-valued A-measurable functions on A. Given f, g E G° (A, A), let
[f g] :_ (h E G0 (A, A) : f < h < g} (empty unless f < g). A set f f, g] will be called a bracket. Given a probability space (A, A, P), 1 < q < oo, F C £ (A, A, P) with usual seminorm II IIq, and e > 0, let NN (s, F, P) denote the smallest m such that for some fi , , , gin Gq (A, A, P), and g1,
with Ilgi -fllq <<sfori = M
C U[fi,gi] i=1
Here log N, (s, F, P) will be called a metric entropy with bracketing. Note that the f and gj are not required to be in F. For example, if F is the set of indicators of half-planes in R2, then f < h < g for f, g, h in F would require the boundary lines of all three half-planes to be parallel. If instead we let f be the indicator of an intersection of two half-planes and g that of a union,
then there can be a nondegenerate set of h E F with f < h < g. Also note that an individual bracket [f, g] has the envelope function
max(-f, g) = max(I f I, IgD,
and so if (7.1.1) holds, for some 8, then F has an envelope function max1< j<m max(- f j, gj). The set of all differences h - H for h and H in [f, g] has an envelope function g - f. So, in this chapter, unlike the last, envelope functions will not be singled out for special attention.
If F C jr then for q < r < oo, F C Gq and (7.1.2)
NN] t (e, F, P) < NN (e, F, p) for all s > 0. 234
7.1 Definitions and the Blum-DeHardt law of large numbers
235
For r = oc, more is true. Let
g) := sup., 1(f - g) (x) I, 11f IIs,P d,up(f, 0). It is easily seen using brackets [ f - s, f + s] that for any law P, (7.1.3)
N(-) (28, F, P) < D (e, F, dup) .
Thus, for example, a set F of continuous functions, totally bounded in the usual
supremum norm with given bounds D(s, F, dup) will have the same bounds on all NN 1) (2s, F, P), 1 < q < oo. If Y consists of indicator functions of measurable sets then in finding brackets
[j, gi ] to cover F, it is no loss to assume 0 < f < gi < 1 for all i. Next, if C(i) := {x: f (x) > 01, D(i) := {x: gi(x) = 11, and f < 1C < gi then
f
1C(i) <_ IC <_ 1D(i)
_< gi
So, [ f , gi] can be replaced by [1C(i), 1D(i)]. If C is a collection of measurable sets and s > 0, let NJ(s, C, P) := inf{m : for some C1, , Cm and D1, , Dm in A, for all C E C there is an i with Ci C C C Di and
P (Di \ Ci) < s}. Here the I in NJ indicates "inclusion" Then it follows that (7.1.4)
NI (s, C, P) = N[(]') (s, F, P)
where F = (I c : C E C) .
We have the following law of large numbers:
7.1.5 Theorem (Blum-DeHardt) Suppose .F C C1 (A, A, P) and for all s > 0, N, I) (s, .P, P) < oo. Then F is a strong Glivenko-Cantelli class, that is,
limn->o.IIP, -PIIf-=0a.s. Proof Given e > 0, take fl, , fm and gl, , gm in G1, m < oo, to satisfy (7.1.1) for q = 1. Let fm+.i := gj for j = 1, , m. By the ordinary strong law of large numbers there is an N such that
Pr {Supra>N maxj<2m I (Pn - P)(f)I > s} < s.
For each f E F let f < f < g, with P(gi - f) < s. Then if max,<2m I(PP - P)(f)I < s we have
I(Pn-P)(.f)I
<_
I(Pn-P)(.f)I+I(PP-P)(f-f)I
< e+(Pn+P)(gi-f) < 3s+(Pn-P)(gi-f) < 5e.
236
Metric Entropy, with Inclusion and Bracketing
Thus
Pr* {supn>N II PP - PII.F > 5E) < E,
and the conclusion follows from Lemma 3.2.6.
The sufficient condition in Theorem 7.1.5 is not necessary. In fact, the following holds.
7.1.6 Proposition There is a probability space (A, A, P) and a strong Glivenko-Cantelli class .F :_ (I c : C E C} for P, where C C A is such that for alle < 1/2, we have N,,')(E,.T, P) _ +00. Proof Let A = [0, 1] with P = U [O, 11 the uniform (Lebesgue) law. Let C,n := C(m) be independent sets with P (Cm) = 1/m. Then to show that limn+oo SUpm I(Pn - P) (Cm)I = 0 a.s., let 0 < E < 1. Form < 3/s we have by Bernstein's inequality (1.3.2),
(
Pr(I (Pn - P)(Cm)I > s) < 2exp l - nE
2
2
\m < 2 exp (- mne2/4).
+ 2e3
Form > 3/s, inequality (1.3.11) gives
Pr(I(Pn - P)(Cm)I > E) < E(ne, n, l/m) <
(e/(mE))"-i.
We have for r = 1 or 2, (m8JnE
t
<
/ME)r/[1 I\
-
- \ me
(me)r/L1 - 13 IEJ
Thus F-n>2/e Fm>1 Pr (I (Pn - P)(Cm)I > E) < oo, and the Borel-Cantelli lemma gives the result. Now, given functions fi,
, fk and g1,
, gk in G 1, suppose
{Cm}m>1 C U {[f , gi]: P (gi - f) < 1/2). We may assume 0 < f < gi < 1 for all i. For eachi with P(gi- f) < 1/2, we have E j P(Cm) : f < 1C(m) < gi } < +oo since if the series diverges then for a subsequence Cm(r) we have >, P(Cm(r)) = +oo, and for C := Ur Cm(r),
we have P(C) = 1 by the Borel-Cantelli lemma, so f < 1c < gi implies
7.1 Definitions and the Blum-DeHardt law of large numbers
P(g;) = 1 and P(f) > 1 - P(gi - f) > 1 - 1 = f < 1C(m) for only finitely many m, a contradiction. Thus +oo for every s < 1/2, which finishes the proof.
N 2l)
237
> 0, but then (s, {Cm }, P) _
On the other hand, let C be the collection of all finite subsets of [0, 1] with Lebesgue law P. Then I I P, - P l I c = 1 fi 0 although IA = 0 a. s. for all
A E C. This shows that in Theorem 7.1.5, NI < oo cannot be replaced by N (s, F, dp) - 1 for any LP distance dp. A Banach space (S, II 11) has a dual space (S', II II') of continuous linear forms f : S H R with Ilfll' := sup{I f(x) I : x E S, IIxII <- 1) < oo (RAP, Section 6.1). One way to apply Theorem 7.1.5 is via the following:
7.1.7 Proposition Let (S, 11 11) be a separable Banach space and P a law on the Borel sets of S such that f IIxII dP(x) < oo. Let F be the unit ball of the dual space S', F := If E S: 11 ,f 11' <_ 1). Then for every e > 0, NC(1) (s, F, P) < 00.
Proof By Ulam's theorem (RAP, Theorem 7.1.4) take a compact K C S such that fS\K llx II dP(x) < s/4. The elements of .T, restricted to K, form a uniformly bounded, equicontinuous family, hence totally bounded for the sup , f E F, m < norm 11 11K on K by the Arzela-Ascoli theorem. Take fl,
oo, such that for all f E F, Il.f - f II K < s/4 for some j. Let gj := f - s/4 on K, gj (x) := - IIxII, x 0 K; hj := f + s14 on K, hj (x) := IIxII , x K, all for j = 1, , m. Then for any f E F, if 11 .f - f 11 K < 8/4, then gj < f < hj and P (hj - gj) < s, so N,(,') (s, F, P) < m < oo. 7.1.8 Corollary (Mourier's strong law of large numbers) Let (S, II II) be a
separable Banach space, P a law on S such that f IIxII dP(x) < oo, and Xl, X2, i.i.d. P. Let S := Xt + + X,,. Then converges a.s. in (S,
xo E S.
Proof By Theorem 7.1.5 and Proposition 7.1.7, a.s. for II
is a Cauchy sequence II and hence converges a.s. to some random variable Y E S. For each
f E.T', f (Y) = P(f) a.s., that is, Y E f-1({P(f)})a.s.Let{fm}m>t CFbe a countable total set: if fm (x) = 0 for all m, then x = 0. Such f t exist by the Hahn-Banach theorem (RAP, Corollary 6.1.5). Let D := n m fm t ({P(fm)}).
Then Y E D a.s., so D is nonempty. But if y, z E D, then lly - zll = supra I fm (y - z) I = 0, so D = {x0} for some xo.
238
Metric Entropy, with Inclusion and Bracketing
Direct proof Given e > 0, there is a Borel measurable function g from S into a finite subset of itself such that P (Il x - g(x) II) < s. To show this, let {xi }°O 1 be dense in S, with xo := 0. For k = 1, 2, , let gk(x) := xi for the smallest i such that IIx - xi II = minr
IIx - gk(x)II < IIxII for all k and x, and llx - gk(x)II 4.0 as k -' oo for all x, so P (IIx - gk (x) II) -+ 0 by dominated or monotone convergence. The strong law of large numbers holds for the finite-dimensional variables
g (Xj), with some limit x, and
n-III Xj -g(Xj)I)
n-1 E IIX,-g(X,)II
< j=1 j-1 Ell X, - g(X1)II < e as n -+ 00 by the one-dimensional strong law. Thus
limn,oo IIx6 - Sn/nll < s a.s. Letting a J, 0 through some sequence, we get Sn/n converging a.s. to some xo.
7.1.9 Corollary If F' C £1(A, A, P) and {Sx : x E A} is separable for
-,thenllPn-PIIy =IIPP-PII*-Oa.s. Proof This follows from Corollary 7.1.8, since finite linear combinations of Sx, x E A, with rational coefficients, are dense in their completion for II II.F, a Banach space. The proof of Proposition 7.1.7 and Corollary 7.1.8 together from Theorem 7.1.5 is no shorter than the direct proof. On the other hand, if .P = { l [o,t] : 0 <
t < 1} and P is Lebesgue measure on [0, 1] then Theorem 7.1.5 applies but Corollary 7.1.9 does not.
7.2 Central limit theorems with bracketing In this section the bracketing will be in L2. A bracket [f h] will be called a S-bracket if (f (h - f)2dP)1/2 < S. The following main theorem will be proved. Then, Corollary 7.2.13 gives a hypothesis on N[ (II ) for uniformly bounded classes of functions.
7.2 Central limit theorems with bracketing
239
7.2.1 Theorem (M. Ossiander) Let (X, A, P) be a probability space and let F C L2 (X, A, P) be such that I
(x, F, P))1/2dx < ao.
(log fo Then F is a P-Donsker class.
Proof (Arcones and Gine). First, a lemma will help. Recall that an envelope G for a class G of functions is a measurable function such that I g(x) I < G (x) for all g E 9 and all x.
7.2.2 Lemma
Let (X, A, P) be a probability space and 9 a set of real-
valued measurable functions on X, with an envelope G. Let B E A and 8 > 0. Suppose that P(G 1B) < 8/2 and that for each g E Q, A(g) is a measurable set with A(g) C B. Then
Pr* {I)(Pn - P)(g1A(g))IIg > 28) < Pr{I(PP - P)(G1B)I > 8). Proof I P(g1A(g))I < P(G1B) < 8/2 for all g E G, so Pr* {II (Pn - P)(g1A(g)) II g > 28) < Pr* {II Pn (g1A(g)) II g > 38/2)
< Pr{Pn(G1B) > 38/2) < Pr{I(Pn - P)(GIB)I > 8). . Let Now to prove the theorem, let Nk := .T', P), k = 1, 2, yk :_ (log(kNi . . . Nk))112. Then yk is increasing in k. By the integral test,
1(logNk)1/2/2k+1 < 00, so 00
00
k=1
k=1
r
k
1` 2-k yk < 12-k [(logk)1/2 +
(log Nj)1/2J j=1
00
00
(log k)1/2/2k + k=1
00
2-k < oo.
(log Nj)1/2 j=1
k=j
Let 6k := Ej=k yj/2f. Then fik -+ 0 as k -k oo. Let Ski := [fki, hkil, i = 1, , Nk, be a set of 2-k-brackets covering .P. Let Tki := Ski \ Us
set Ak,s, s = (sl, . , sk), choose fk.s E Ak,s. For each f E F let Ak(f) :_ Ak,s and 7rkf := fk,s for the unique s := s(f) := s(f, k) such that f E Ak,s.
Metric Entropy, with Inclusion and Bracketing
240
Let Akf :=ok(f):=sup( Ig-hI: g, h EAk(f)}. Then for any f E. F, Lt k (f) is nonincreasing in k and E A2 (f) < 2-2k.
(7.2.3)
Note that Ak f depends on f only through s(f ). Fork > 1, n > 1, and f E .F let (7.2.4)
Bk := B(k) B(k, f, n) :_ {x E X: Ak f > n1/2/(2k+lyk+1)}.
For any fixed j and n and x E X let
rf := rj,, (f, x) := min{k> j: x E B(k, f,n)}, where min0 := +oo. Then {rf = j} = B(j, f, n), {rf > j} = {Oj f < n1/2/(2j+lyj+1)}, and fork > j, (7.2.5)
{rf > k) C X \ Bk_1 C {okf < Ak-1 f < n1/2/(2kyk) I
by (7.2.3), and {rf = k} C Bk \ Bk_i. For any f, g E F, let pl (f, g) := 112K for the largest K such that for some s, f and g are both in AK,S, or p[ (f, g) = 0 if this holds for arbitrarily large K. Then F is totally bounded for pl1. So by Theorem 3.7.2 it will be enough to prove the asymptotic equicontinuity condition for pl 1, in other words that for
every a > 0, (7.2.6)
limj.,,,
Pr* 1n112 II (Pn
- P)(f -
a} = 0.
We can assume 0 < a < 1. Then for any positive integers j < r, f -7rj f will be decomposed as follows: (7.2.7)
f -7rjf = (f -7rjf)lrf=j + (f -7rrf)lrf>r r-1
+ Y. (f -7rkf)lrf=k k=j+1 r
+ E (7rkf -7rk-1f)lrf>k; k=j+1
this is easily seen for r = j + 1 and then by induction on r. The decomposition (7.2.8) will give a bound for the outer probability in (7.2.6) by a sum of four terms to be labeled (I), (II), (III), and (IV) below.
Let s := a/8. Fix j = j (E) large enough so that (7.2.8)
,Bj < s/24 and E k-12 < 2s. k>j
7.2 Central limit theorems with bracketing
241
Then choose r > j large enough so that, since yr increases with r,
n1122-r < e/4 and 2 exp ( -
(7.2.9)
yr2r-IE)
< e.
Lemma 7.2.2 will be applied to classes of functions
9 :_ 9(k, s) = ck,s := {f - nkf: f E F, nk.f = fk,s} with envelope < G := Gk,, := Akfk,s About (I): for any function i/j > 0 and t > 0, l,,>t < .r/t. So by (7.2.4),
n1/2E(1BjAj f) < 2j+lyj+iE((Ajf)2)
< 2t-jyj+1 < 4,Oj < e/4 for all f E F. Then since {t f = j} = Bj,
n1/21P((f-njf)l,f=j)I < n1/2P(IBjLjf) < e/4. Apply Lemma 7.2.2 to Cjjs for each s with B := Bj := B(j, f,,, n), S e/n 1/2, and A(f - nj f) = Bj = {tr f = j} in this case. Then Pr* { Il n1/2(Pn - P)(g)1Bi 11O(j,s) > 2e}
< Pr {n1/2I (Pn - P)((Aj fj,s)1Bj)I > e}. Then summing over s, (I)
:= Pr* { Il n1/2(Pn - P)((f - njf)1rf=j)II.F > 2E1
< exp(y?)max,Pr {n1/2I(Pn - P)(Oj f,slBj)I > s}
< exp(y?)e-2maxsVar(Ajf,s1Bj) As n -+ oo, for fixed j and s, by (7.2.3) and (7.2.4), since P(Bj) _ P(B(j, fj,s, n)) - 0 for each j and s, we have (I) -+ 0, so (I) < e for n large enough. About (II): we have by (7.2.3) and (7.2.9), (7.2.10)
n1/2E(Ar f1(orf
<
n1/22-r
< e/4
and
(7.2.11)
E((Arf)21(Arf
Metric Entropy, with Inclusion and Bracketing
242
Now by (7.2.5) for k = r, and since If - 7rr f I < Ar f, we have by (7.2.10), E((f - 7rr f) 1r f>r) I < 8/4 for all f E F. Apply Lemma 7.2.2 for each n s with A(f - 7rr f) :_ {t f > r} and B :_ {Ar fr,s < nl/2/(2ryr)}, noting that Ar f < Ar_1 f for all f. Thus by (7.2.5) again, 1/21
(II)
:= Pr*
111n112(Pn
- P)((f
-)Trf)lrf>r)II.F > 281
Pr {n1/2II (F'n - P)(Arf 1{Arf
<
Then by Bernstein's inequality (Proposition 1.3.2) and (7.2.5), (II)
< 2 exp Yr -
E2
2E/(2r+2yr) + 2e/(3.2ryr)
2 . exp (Yr - 2ryrE/(7/6))
2 exp ( -
yr2r-1E)
since 2-ryr < jr B< P j < e/24 by definition of or and (7.2.8). Then (II) < 8 by (7.2.9).
About (III): fork= j+1, (7.2.12)
r - 1, by (7.2.4) and (7.2.3),
n1/2E((Akf)1tf=k) < 2k+1yk+1E((Akf)2) < 21-kyk+1
Then
(III)
Pr* n1/2 1(Pn - P) r-1 Y Pr*{n1/211(Pn
\k _J+ l(f
- 7rkf)1Lf-k/
- P) ((f
> 2E } JJJ
2-kEyk+1/pjl,
k=j+1
2 "Eyk+1/,8j < 28. To apply Lemma 7.2.2 for each k + , r -1 and for each s, let 8 := 2-k-1 Eyk+l 1#j, and A (f) := B := { r f = 1, k} for f - Irk f E Cjk,s. The hypothesis of the lemma holds since j < 8/24 (7.2.8) implies 21-kyk+1 < 2-k-2Yk+1E/fij, and by (7.2.5). So since
r-1
(III) < k=j+1
Pr {n11211(Pn - P)((Akf)lrf=k)1I F >
2-k-18Yk+1/pj}.
7.2 Central limit theorems with bracketing
243
Then by Bernstein's inequality again, (7.2.3), and (7.2.5),
r-1 exp (Yk2
(III) <
21-2k + 2-kEyk+l/(3 2kykfj) J1
k=j+1 r-1
exp k=j+1
(\yk2 - 402(2 + E 2yk+1 Yk+l/(3Yk0j)) / j
Now since yk is increasing with k,
-E2yk+1/(4,o2 2 +
.
l
_ -E2Yk+1
/
< -E2yk
3
(40j2
yk+1
-
Ykfj J/
2
Yk+1
2+
(416
+
E
3Ykpj
)
E
3_l
)
The latter expression, since IBj < e/24 by (7.2.8) and so 2 < 81(12#j), is bounded above by
l E
-E2yk /416j
.
5
12
1
=
-3Eyk /(516j).
It follows that
r-1
r-1
< E exp(-Eyk/(216j)) < E exp(- 12yk)
(III)
k=j+1
k=j+1 r-1-1
1/(kNi ... Nk) 12 <
Tk-12 <2s k>j
k=j+1 by (7.2.8).
, r and g E Ak(f) then 7rk(g) = 7rk(f ), About (IV): if k = j + 1, Irk-l(g) = Irk-l (f ), and A j = L g for all s = 1, . . , k, so {r f > k} = {rg > k). Thus the number of distinct functions (7rkf - rk-1 f)1Tf>k is at most exp(yk ). Also, Irk f E Ak_ 1(f) and so by (7.2.3), E((7rk f -7rk-l f)2) < 22-2k. Now (IV)
:= Pr*In1/2 ll
k(Pn - P)(7rkf -7rk-1.f)ltf>k)I) 1
j
>E
r
PrIlnl/2II(Pn - P)((7rkf -nk-1f) ITf>k) IlF > 2-kEYk/16jj k=j+1
Metric Entropy, with Inclusion and Bracketing
244
from the definition of fj. Bernstein's inequality and (7.2.5) give r
(IV) k
E
-J+1
2 E22-2ky2 2 PT exp (yk - 23-2k + 3 E2
-k-12-k
\
l/
.
OT
Now since fj < e/24, 8 + 3ePJ 1 < E/pj and r
r
exp (yk (1 - e/fi1)) < E exp (- 23yk) < 2e
(IV) < k=j+1
k=j+1
as in (III). Thus the expression in (7.2.6) is less than a. Letting a 4. 0, j -+ oo, and n -+ oo, the proof of Theorem 7.2.1 is complete. Theorem 7.2.1 implies the following for L 1 entropy with bracketing:
7.2.13 Corollary Let (X, A, P) be a probability space and F a uniformly bounded set of measurable functions on X. Suppose that r1
(logN1( 1)(x2,F, P))112dx < oo. 1
Then F is a Donsker class for P.
Proof Suppose I f (x) I < M < oo for all f E .E and X E X. Since multiplication by a constant preserves the Donsker property (by Theorem 3.7.2), we
can assume M = 1/2. Then for any f, g E F and e > 0, If - gI < 1 everywhere. So if f If - gIdP < E2 then (f If - gI2dP)1"2 < E. So N1( 2) (E, 1
F, P) <
N111) (E2,
F, P) and the result follows from Theorem 7.2.1.
It will be seen in the next section that Corollary 7.2.13, and thus Theorem 7.2.1, are best possible (provide characterizations of the Donsker property) in some cases.
7.3 The power set of a countable set: the Borisov-Durst theorem Let P be a law on the set N of nonnegative integers. The next theorem gives a criterion for the Donsker property of the collection 2N of all subsets of N, for P, in terms of the numbers P({m}) for m > 0. We also find that the sufficient condition given in Corollary 7.2.13 is necessary for 2N. Recall NJ as defined above Theorem 7.1.5.
7.3.1 Theorem The following are equivalent:
7.3 The power set of a countable set: the Borisov-Durst theorem
245
(a) 211 is a Donsker class for P; (b) >2m Pm/2 < 00;
(c) fo (log NI(x2, 21, p))1/2 dx < 00.
Proof We have (c) = (a) by Corollary 7.2.13. Next, to prove (a) ==> (b), suppose E = oo. The random variables W (m) := Wp (1 {m }) (for the p112
isonormal Wp on L2(P) as defined in Section 2.5) are independent and Gaus-
sian with mean 0 and variances p,,,. We can write Gp(f) = Wp(f) P(f)Wp(1) since the right side is Gaussian and has mean 0 and the covariances of Gp. Then F_m EIWp({m})I _ (2/n)1/2p1l?i/2 diverges, while Em Var(IWp({m})1) < Em pm < oo. Thus for any M < oo, by Chebyshev's inequality, P(F_T1 JWp({ j})I > M) + 1 as m -* oo. Thus >j IWp({ j})I = +oo almost surely. Now >m pmIWp(1N)I < oo a.s., so E IGp({m})l = +oo a.s. Hence SUPACN Gp(1A) = +oo a.s. and 2N is not a pregaussian class, so a fortiori not a Donsker class. Thus (a) implies (b).
Next, to prove (b) = (c): equivalently, let us prove 00
E 2-k (log Nl
(4-k 2N, P))112 < 00.
k=1
We can assume pm > pr > 0 for m < r. Let rj be the number of values of m such that 4-j-1 < p1i/2 < 4-j, j = 0, 1, 2, , and Cj := rj/41. Then j Cj < oo. Fork > ko large enough there is a unique j (k) such that
Cj/41 < 4-k <
(7.3.2)
j>j(k) Let m(k) := Mk :_ j_oj
(k)
Cj/41.
j>j(k)
rj. Then
m>m(k)
pm < L: rj142 < 4-k j>j(k)
Let Ai run overall subsets of fl,
, m (k)} where i = 1,
,
2m(k). Let Bt :=
Ai U {m E N: m > m(k)}. Then for any C C N, c fl 11, , m(k)} = Ai for some i. Then Ai C C C Bi and P(Bi \ Ai) < 4-k. So NI(4-k, 2N, P) < 2m(k)+1 Thus it will be enough to prove
mk/2/2k < 00,
(7.3.3) k
Metric Entropy, with Inclusion and Bracketing
246
with >k restricted to k > k0. We have j(k)
mk/2/2k
> C j=0
k
j(k)
<
1/2
4lCJ)
00
2j-kC)l2 =
j=0
k
/2k
/I
C)12
j=0
1: 2j-k k j<<jj(k)
To prove this converges, since > j C j < oo, it is enough by Cauchy's inequality to prove that > j (>k. j< j(k) 2i k)2 < oo. Let k(j) be the smallest k such that
j(k) > j. Then
E 2j-k < 2j+1-k(J) k: j:5 jj (k)
To prove that Y j 4j-k(j) < oo, setting j (k0 - 1) := 0, we have
4J-k(J) <
T, k
j>11
41+J(k)-k
Y 4j-k < j: j(k-1)<j<j(k)
k
For each k, let K (k) be the smallest K such that j (K) = j (k). Then from (7.3.2) for K, letting JC denote the range of 4-K(k) < )i>-j(k) CJ/4J, so
r 4j (k) -k
4j(k)-k+K(k)
k
E 4j(K)+K
Cl.4- j
l < Since j
CJ/4J j>j(k)
k
KEK,j(K)<j
j
k: K(k)=K
4i
2ECj4-J
4-k
.
KEIC,j(K)<j
is one-to-one on 1C, the sum is at most 4 E j Cj < 00.
Recall Remark 6.4.3, which implies that if C = 2N then we have
SUPA EC I (Pn - P)(A)I - 0
a.s., n -+ oc, for any law P on C.
**7.4 Bracketing and majorizing measures Andersen, Gine, Ossiander and Zinn (1988) combine the ideas of majorizing measures (Section 2.7) and bracketing to get extended central limit theorems (Section 4 of their paper) and laws of the iterated logarithm (Section 5). They
Problems
247
consider triangular arrays, so that variables are independent but not i.i.d.; they allow envelope functions not necessarily in L2; and they use truncation, so that bracketing need not hold everywhere but only on sets of high enough probability.
Problems 1. Let (K, d) be a compact metric space. For 0 < a < 1 let La(K, d) be the set of all real-valued functions f on K such that I f (x) - f (y) I < d (x, y)a for all x, y (= K. Show that for any law P on K, La (K, d) is a strong Glivenko-Cantelli class for P.
2. Show that If E C[0, fl: 11 f Ilsup < 1) is not a weak Glivenko-Cantelli class for P = U[0, 1]. 3. Let (X, A, P) be a probability space, such that {x } E A for all x E X and P has no soft atoms, as defined in problem 9 of Chapter 5. Let C C A be such that C is linearly ordered by inclusion and generates A. Give an upper bound on Ni (e, C, P) for 0 < s < 1.
4. If X = N, A = 214, and p := P({n}) := 1/2"+1 for n = 0, 1, 2,
,
evaluate NJ (s, A, P) for 0 < e < 1.
5. If X = N, A = 2, and for some a E IR, p, := P({n)) := ca/[(n + 1)2(log(n +2))'] for all n E N, where ca is such that En p = 1, for what values of a is A a Donsker class for P? 6. Let (X, A, P) be a probability space. For i = 1, . , k let )l C 'CO (X, A) each satisfy the hypothesis of Theorem 7.2.1. Then show that the hypothesis also holds for:
(a) j, (b)
Uk :_ {F_j'=1 f : f E J:j for each j}.
Let C; be a collection of sets, F; := { 1 A : A E C; }. Show that the hypothesis of Theorem 7.2.1 also holds for J := { IA: A E C) where
(c) C :=
Aj: Aj E Cj for each j}.
7. Let C := {[0, a] x [0, b] : 0 < a < 1, 0 < b < 1}. Let P be the uniform distribution on the unit square [0, 1] x [0, 1]. Give an upper bound for
NI(s,C,P)for0<s<1. 8. In the unit cube X := [0, 1]d in Rd, with uniform (Lebesgue) law P, let C be a collection of subsets A of X such that for some K < oo and c > 0, whenever X is decomposed as a union of nd cubes of side 1/n, each with vertices having coordinates i/n, i = 0, 1, . , n, the boundary of A
intersects at most Knd-` of these cubes. Show that NI(s, C, P) < oo for all s > 0.
248
Metric Entropy, with Inclusion and Bracketing
9. Show that the hypothesis of the previous problem holds if C is the collection of convex subsets of the unit square in R2. 10. Prove the same for the unit cube in ][8d for any finite dimension d.
11. If F is a class of indicators of measurable sets and s > 0, show that
N1 t (s, F, P) = N[ (e2, F, P) (compare Corollary 7.2.13 and its proof). 12. Prove Corollary 7.2.13 directly by Bernstein's inequality (1.3.2).
Notes
Notes to Section 7.1.
Theorem 7.1.5 is due to Blum (1955, Lemma 1) for families of (indicators of) sets and to DeHardt (1971, Lemma 1) for uniformly bounded families of functions. Mourier (1951; 1953, pp. 195-196) proved the law of large numbers in general separable Banach spaces, Corollary 7.1.8. I do not know a reference for Propositions 7.1.6 or 7.1.7. In the proof of Corollary 7.1.8, (Bochner or Pettis) integrals of Banach-valued functions (defined in Appendix E) were not assumed, so they had to be, in part, reconstructed. For a survey of laws of large numbers for empirical measures up to 1978, see Gaenssler and Stute (1979). Notes to Section 7.2. Theorem 7.2.1 is due to Ossiander (1987). The shorter proof presented here is an expanded version of that of Arcones and Gind (1993, Theorem 4.10) and applies a technique from Andersen, Gine, Ossiander and Zinn (1988), who proved an extended bracketing central limit theorem. Corollary 7.2.13 was proved earlier, first for classes of sets in Dudley (1978), then for classes of functions in Dudley (1984).
Notes to Section 7.3. In Theorem 7.3.1, it is not hard to prove that (b) (a). Durst and Dudley (1981) proved the equivalence of (a) and (b). Borisov (1981) discovered and announced the more difficult implication (b) = (c). I have not seen his proof.
References Andersen, Niels Trolle, Gine, E., Ossiander, M., and Zinn, J. (1988). The central limit theorem and the law of the iterated logarithm for empirical processes
under local conditions. Probab. Theory Related Fields 77, 271-305. Arcones, Miguel A., and Gine, E. (1993). Limit theorems for U-processes. Ann. Probab. 21, 1494-1542.
References
249
Blum, J. R. (1955). On the convergence of empiric distribution functions. Ann. Math. Statist. 26, 527-529. Borisov, I. S. (1981). Some limit theorems for empirical distributions (in Russian). Abstracts of Reports, Third Vilnius Conf. Probability Th. Math.
Statist. 1, 71-72. DeHardt, J. (1971). Generalizations of the Glivenko-Cantelli theorem. Ann. Math. Statist. 42, 2050-2055. Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6, 899-929; Correction, ibid. 7 (1979), 909-911. Dudley, R. M. (1984). A course on empirical processes. Ecole d'ete de probabilites de St.-Flour, 1982. Lecture Notes in Math. (Springer) 1097, 1-142. Durst, Mark, and Dudley, R. M. (1981). Empirical processes, Vapnik-Chervonenkis classes and Poisson processes. Probab. Math. Statist. (Wroclaw) 1, no. 2, 109-115. Gaenssler, Peter, and Stute, Winfried (1979). Empirical processes: a survey of results for independent and identically distributed random variables. Ann. Probab. 7, 193-243. Mourier, Edith (1951). Lois de grands nombres et theorie ergodique. C. R. Acad. Sci. Paris 232, 923-925. Mourier, E. (1953). Elements aleatoires dans un espace de Banach. Ann. Inst. H. Poincare 13, 161-244. Ossiander, Mina (1987). A central limit theorem under metric entropy with L2 bracketing. Ann. Probab. 15, 897-919.
8 Approximation of Functions and Sets
8.1 Introduction: the Hausdorff metric In this chapter upper and lower bounds will be shown for the metric entropies of various concrete classes of functions on Euclidean spaces and sets in such spaces. Some metric entropies with and without bracketing are treated. Metrics for functions are in £P, 1 < p < oo. For sets we use dp metrics dp(B, C) P(BOC) or the Hausdorff metric, defined as follows. For any metric space (S, d), x E S, and a nonempty B C S, let
d(x, B) := inf {d(x, y) : y E B). For nonempty B, C C S, the Hausdorff pseudometric is defined by
h (B, C) := max
d (x, C), supxE1 d (x, B)).
Then h is a metric on the collection of bounded, closed, nonempty sets. Let h (O, C) := h (C, 0) := +oo for C 0, and h (O, 0) := 0. On Rd we have the usual Euclidean metric d(x, y) := Ix - yI where Jul
+ ud)"2, u E Rd. For any set H C Rd-' and function f from H (u1 + into [0, oo] let
if = J(f) := {x
E Rd : 0 < xd < f (x(d)), x(d) E H}
(xt, , xd_t). Then J(f) is the subgraph of f. For any other function g > 0 on H, clearly h(Jf, Jg) < g). Thus for any collection F of real functions > 0 on H, and any s > 0, where x(d)
(8.1.1)
D(e, {Jf: f E F}, h) < D(s,.F,
If dsup(f, g) < s and j : max(f - s, 0), where g > 0, then 0 < j < g < f + s, so Jj C Jg C Jf+E. If P is a law on H x [0, oo) having a density p 250
8.1 Introduction: the Hausdorff metric
251
with respect to Lebesgue measure on Rd with p(x) < M < oo for all x, then (8.1.2)
NI(2Me, {Jf: f E .P}, P) < D(s, .F, dSnp).
In the converse direction there are corresponding estimates for Lipschitz functions. Recall that II f II L := sup,, #y I f(x) - f(y)I/Ix - y1. Then we have:
K on H C g > 0, then h(Jf, Jg) > min(1, 1/K)dd p(f, g)/2. 8.1.3 Lemma If II f I I L
K and I I g I I L
Rd-1, with f> 0 and
Proof Let t :=
g). Let 0 < s < t. By symmetry, assume that for some x E H, f (x) > g(x) + s. Then for any y E H, either Ix - yI > s/(2K) or g(y) < g(x) + s/2 < f(x) - s/2. In either case, if z < g(y) then I (x, f(x)) - (y, z)I > min(1, 1/K)s/2. Letting s f t, the result follows. Recall that a bounded number of Boolean operations preserve the VapnikCervonenkis property (Theorem 4.2.4). The same holds for classes of sets satisfying bounds on metric entropy (with inclusion). For any families Cj of subsets of a set X, extending the notation in Section 4.5, let k {
n Aj: Aj ECj for all j JJJ
k
{ UAj: Aj ECj for all
,1
jj.
8.1.4 Theorem Let (X, A, P) be a probability space and Cj C A for j = 1, , k. Let f1 j(e) := log NI(e, Cj, P) and f2 j (e) := log D(e, Cj, dp) for j = 1, k. Let CO := nkj=1Cj. If for i = 1 or 2, there area y > 0 , Mk such that J, (e) < Mj e _Y for 0 < e < 1 and and constants M1, j = 1, , k, then the same holds for j = 0. The statements also hold for i = 1 or 2 for CO :=
Proof For NJ, given 0 < e < 1, for each j = 1, , k, take m j < Mj (k/e)Y brackets [Ajr, Bjr] covering Cj with P(Bjr \ Ajr) < Elk for r = 1, , m j. If Ajr( j) C Cjr( j) C Bjr( j) for j = 1, , k and some r(j), then k
k
k
A(r) := n Afr(j) C C(r) :=n Cjr(j) C B(r) := n Bjr(i) j=1
j=1
j=1
252
Approximation of Functions and Sets
and P(B(r) \ A(r)) < k(elk) = E. The result for NJ then follows with MO ky (MI + M2 + + Mk). Without inclusions, and/or for u instead of n, the proof is similar.
8.2 Spaces of differentiable functions and sets with differentiable boundaries For any a > 0, spaces of functions will be defined having "bounded derivatives
through order a." If fi is the largest integer < a, the functions will have partial derivatives through order ,8 bounded, and the derivatives of order 0 will satisfy a uniform Holder condition of order a - 8. Still more specifically: for x := (x1, , xd) E Rd and p = (pi, , pd) E Nd (where N is the set of
nonnegative integers) let [p] := p1 + xp
x1
142 ... xad
+ pd, DP := a[p]/axp' ... axad
For a function f on an open set U C Rd having all partial derivatives DP f of orders [p] < P defined everywhere on U, let Ilflla := Ilflla,U := max[p]
yIa-1}
.
Here if 0 < a < 1, so ,B = 0, Do f := f := f. Let Id denote theunitcube{x E Rd: 0 <xj < 1, j = 1, ,d},andx(d) := (xl, ,xd_1), x c Rd. Let F C Rd be a closed set which is the closure of its interior U. Let .Fa, K (F) denote the set of all continuous f : F H R with 11f II a, U < K.
For a = 1, .FI,K(F) is the set of bounded Lipschitz functions f on F with max(II f II sup, II f II L) < K. Then, recalling the bounded Lipschitz norm
IIfIIBL := IIfIIL + IlfII
, we have
{f: IIfIIBL
.If = J(f) = {x E Id : 0 < xd < f (x(d))},
f E QJa,K,d-1,
f >-0.
If g and h and two functions defined for (small enough) y > 0, then g (as y 0) means that
h
0 < liminfyyo(g/h)(y) < limsupylo(g/h)(y) < +oo. Clearly, if f E ca,K,d and [p] < a then DP f E ca-[pl,K,d Let Bd :_ {x E Rd : Ix I < 1) and Bd := {x E Rd : lx I < 1) be the open and closed unit balls, respectively, in Rd. Here are some bounds on metric entropies, which
8.2 Spaces of differentiable functions and sets
253
by Ossiander's theorem (7.2.1) will give that if a > d/2, Ca,K,d is a Donsker class for any law P, and C(a, K, d + 1) is for laws with bounded densities.
8.2.1 Theorem For 0 < K < oo, 0 < a < oo, and d > 1, as s 10, log D(e, QJa,K,d, dsup) x E-d/a
For some T := T (a, K, d), any law P on Id, 1
log D(s,
C(a, K, d + 1), h) < TE-d/a
If Q is a law on Id+1 having a density with respect to Lebesgue measure bounded by M, then for some Ml = Ml (M, d, K, a),
log NI(e, C(a, K, d + 1), Q) <
Me-dla
0 < E < 1.
The same statements all hold for Bd in place of Id and so Ofgga,K,d, with possibly larger constants T and Mi.
K (Bd) in place
Notes. For a > 1, in the statement about h, the order E-dl, is precise, see Corollary 8.2.13. The statement about N[ ] for r = 1 implies that cca,K,d is a Glivenko-Cantelli class for any a > 0, K < oo, and d, by the Blum-DeHardt theorem (7.1.5).
Proof First, the proof will be done for the cube Id. The inequality on N follows from the first by way of (7.1.2) and (7.1.3). The inequalities for classes
of sets (in h and for NI) follow directly from that for functions in dsup by (8.1.1) and (8.1.2). So only the bounds for dsup need to be proved. For each f E cca,K,d, x E Id, and x + h E Id write the Taylor series with remainder 0
(8.2.2)
f(x + h) - E Qk(x, h) = R(x, h) k=0
where for each x, Qk(x, - ) is a homogeneous polynomial of degree k in h and by the mean value theorem, (8.2.3)
I R(x, h)I < C1hIa
for some constant C = C(d, K, a). Then C > 1 will be taken large enough so
254
Approximation of Functions and Sets
that whenever [p] < 3 we also have R-[P]
(8.2.4)
IRp(x, h)I
:=
DPf(x + h) -
Qk,p(x, h) k=0
<
CIhIa-[P]
where for each k, p, and x, Qk, p(x, ) is also a homogeneous polynomial of degree k.
To prove one half of the first conclusion in Theorem 8.2.1, it needs to be shown that (8.2.5)
lim supE,O [log D (e, ca,K,d, dup)] Ed/a < oo.
Given 0 < E < 1, let A :_ (e/(4C))1/a < 1. Letx(1), , x(,) be a A/2-net in Id, that is, sup{infj<s Ix - x(j) I : x E Id} < A/2. Here we can takes < M2E-dja for some constant M2 = M2(d, C); specifically, one can choose the x(j) as centers of cubes of a decomposition of Id into cubes of side I/ m where
m is the least integer > d1/2/A. For each multi-index p with [p] = k < p let p! := fl =1 pj!. Then for f E ca,K,d let Q(P)(x, h) := (DPf) (x)hP/p!. Thus in (8.2.2), (8.2.6)
Q(P) (x, h).
Qk(x, h) _ [pl =k
Let sk := E/(2Aked), k = 0, 1,
, and
A1,p = Ai,p(f) := [DPf (x(1)) Irk] ,
i = 1, ... , s,
[p] = k <-
where [x] denotes the largest integer < x for -oo < x < oo. Given some A := {A1,p: i < s, [p] < P} let
ga,K,d(A) := (f E Ga,K,d: Ai,p(f) = A1,p for all i < s, [p] 8.2.7 Lemma If f, g E cca,K,d(A) for some A then sup 11(f - g) (x) I
: x E id) < E.
Proof Let F := f - g. Whenever [p] < P and i < s, we will have (D'F)(x(t))I < E[P]. Also, by (8.2.4),
IDPF(x + h) - DPF(x)I <
2CIhIa-#
if [p] =
For each y E Id take an x(;) with I y - x(1) I < A /2. Then from (8.2.2), (8.2.3),
8.2 Spaces of differentiable functions and sets
255
and (8.2.6) with h := y - x(;),
2CIhla+sk T, IhPllp! k=0
[p]=k
0
2C0a + L EkEk I k=0
1/p! [p]=k
l
ed maxk<,8Ektk + E/2 < E, proving the lemma.
Now continuing the proof of Theorem 8.2.1, it follows that D(E, cca,K,d, dsup) < Na,K,d where Na,K,d is the number of distinct nonempty sets Qa,K,d(A). Let the x(j)
be ordered so that for 1 < j < s, IX(i) - X(j) I < 0 for some i < j. Such an ordering clearly exists for d = 1. Then by induction on d, we enumerate subcubes beginning and ending with sub-cubes at vertices of Id, where we first enumerate cubes on one face Id-1, then on the adjoining level, and so on. Now suppose f E Qa,K,d(A) (so that this set is nonempty) and suppose
given the values Ai,r for i < j for some j < s. Choose i < j such that Ix(j) - x(j) I < A. Take the Taylor expansion as in (8.2.2) with x = x(Z), h = x(j) - x(1). By (8.2.4) and (8.2.6),
VIA
DPf (x(i)) - 1: 1: DP+qf (x(z))hq/q!
<-
Cihla-[P]
k=0 [q]=k
Now I DP+q f (X(,)) - Ai,p+gSk+[P] I < E[p+q] for [q] = k = 0, 1, ... Let
D(j, f, p)
[DPJ(x(J))
-
,
- [p].
P-1p] 11
Ai,p+ghglq[J/E[pl.
--k+[p]
k=0
[q]=k
Then by the latter two inequalities, P-[p] I D(.f, j, p) I
!S
\CAa-[P] +
` L
1/q[)/E[Pl
E[P]+k1k
k=O
[q]=k
2Ced / 4C + ed < 3ed / 2. So for f E Qa,K,d and given the A;,,. (f) for i < j there are at most 3ed + 2 < 4ed possible values of Aj, p for a given p. The number of different p E Nd with
Approximation of Functions and Sets
256
[p] < P is bounded above by (P + 1)d. Thus the number of possible sets of values {A j, p}l p)
(2(2K +
1)ed/e)(,,+1)d
Thus by Lemma 8.2.7, d
Na,K,d < (2ed(2K + 1)(4ed)5/e)('s+l)
< exp ((p + 1)d {(d + log4)M2s-dla + log (ed (4K + 2)/e)})
< exp (Je-dl.) for some J = J(d, a, K) not depending on e, proving (8.2.5). In the other direction we need to prove (8.2.8)
liminfEyo log D (e, cc-a,K,d,
edja > 0.
Let f be a Cc function on Rd, 0 outside Id and positive on its interior, such as
f(x) :=
d d
g(xj), where
j=1 (8.2.9)
g(t) :=
r exp(-1/t)exp(-1/(1 -t)), jt
0
0
, decompose Id into and sub-cubes Ami of side 1/m, i = For m = 1 , 2, . , md. Let x(;) be the vertex of Ami closest to the origin, in other words 1, the vertex of Am, at which all coordinates are smallest. Given a > 0, set f (x) := m -a f (m (x - x(,))). Note that x H m (x - x(;)) is a 1-1 affine map of Rd onto itself, taking the interior of Ami onto that of Id. Thus f (x) > 0 if and only if x is in the interior of Amt . Lets := sup,, f (x) (= e-4d for our f ). For any S C 11,
,
md} let fs := Fi,s f . Then Il fslla
11 fI1a =: B while
forS
T, sup', I (fs-fT)(x)I = m-as. Thus foranyem := Km-as/(3B) we 2md have D(--m, GJa,K,d, ds p) > > exp(Cemdla) for some C = C(K, d, a, s) not depending on m. Since em+1/em -+ 1 as m -k oo this is enough to prove (8.2.8) and so to finish the proof of Theorem 8.2.1 for Id. Now it will be shown how to adapt the proof to the ball Bd in place of Id. In Sd-l :_ {x E Rd : IxI = 1}, for 0 < e < 1, take D(e, Sd-1, e) points at distances > e apart where e is the Euclidean metric on Rd. The balls of radius
8.2 Spaces of differentiable functions and sets
257
e/2 with centers at these points are disjoint. Thus by volumes, then by the mean value theorem,
D(e
Sd-1 ,
(Ed e)(2) < (1 + <
Eld
(
2I - \1
de(1+-
d
)
£ld 2I
< de(2)d
1
and so D(e, Sd-1, e) < 2d(3/e)d-1. Take a maximal set Sin Sd-1 of points at distances > e/2 apart, with ISI < 2d(6/e)d-1. Let W consist of 0 and all points tx f o r x E S and t = je/2, j = 1 , 2, . , [2/e]. Then W is e-dense in Bd and IWI
< 1 +2d(2/e)(3/e'1 <
2d(318)d.
Then, starting at 0 and moving outward along each segment tx, x E S, 0 < t < 1, through points in W, one can do the same proof as for (8.2.5) in Id, except for larger constants J, MI. For a lower bound of the form (8.2.8), note that Rd includes a cube of side d-112 centered at 0. This finishes the proof of Theorem 8.2.1. Next, some lower bounds for metric entropies in the L 1 norm will be given.
For a collection F C C1 (A, A, P), we have the G1 distance dl, p(f, g)
P(If - gD 8.2.10 Theorem Let P be a law on Id having a density with respect to Lebesgue measure bounded below by y > 0. Then for some C = C(y, a, K, d) > 0, and 1 < r < oo, N ji(e, «,x,d, P) ? N("(E, c«,K,d, P) D(e, c«,K,d, di, p) > exp(Ce-dI") fore small enough, and if d > 2, for small enough e > 0, and M := C(y, a, K, d - 1),
NI(e, C(a, K, d), P) > D(e, C(a, K, d), dp) > exp (Me Proof The following combinatorial fact will be used:
8.2.11 Lemma Let B be a set with n elements, n = 0, 1, .. Then there exist subsets Ei C B, i = 1, , k, where k > e"/6, such that for i 0 j, the symmetric difference Ei DEj has at least n/5 elements.
Proof For any set E C B, the number of sets F C B such that card (E A F) :5 n/5 is 2"B(n/5, n, 1/2), where binomial probabilities B(k, n, p) are as defined before the Chernoff-Okamoto inequalities (1.3.8). If S" is the sum of n independent Rademacher variables Xi taking values ±1 with probability 1/2 each,
258
Approximation of Functions and Sets
then by one of Hoeffding's inequalities (1.3.5), defining "success" as Xi = -1,
B(n/5, n, 1/2) = Pr (S > 3n/5) < exp(-9n/50) <
e-n16
This implies the lemma.
Now to prove Theorem 8.2.10, the first inequality follows from (7.1.2) and the second is also straightforward. For the lower bound on D, let us use again the construction in the proof of (8.2.8). Let.X denote Lebesgue measure on Id
and S := y f f dA for the f in (8.2.9). Then for each i, f f dP > Sm-a-d Applying Lemma 8.2.11 and obtaining sets S with card(S) > and/5 gives f fs dP > Sm -1/6, and f Ifs - fT I dP = f fsoTdP. So D (Sm-a/6, ca,K,d, dip) > exp (md/6). Thus if 0 < e < 3/6, since [x] > x/2 for x > 1,
D (E, Qa,K,d, dt,P) > exp ([(S/(6e))1/a]d/6l
> exp
(2-d
(S/ (6E)
)d/a /6)
proving the statement about cca,K,d
IfC=J(f)andD=J(g)then dp(C, D) := P(CAD) > yl(CAD) = Y f If - gldA, so for E > 0,
NI(E, C(a, K, d), P) > D(e, C(a, K, d), dp) > D (EIY, 9a,K,d-1, di,x), which finishes the proof of Theorem 8.2.1. To get lower bounds for the Hausdorff metric, the following will help:
8.2.12 Lemma If a > 1 and f, g E9Ja,K,d, then
h (Jf, Jg) >
g)/(2 max(1, Kd)).
Proof Note that for a > 1, any g E 9a,K,d is Lipschitz in each coordinate by the mean value theorem with Ig(x) - g(y)I < KIx - yl if x/ = yj for all but one value of j. In the cube, one can go from a general x to a general y by changing one coordinate at a time, so g is Lipschitz with Il91lL < Kd and Lemma 8.1.3 applies.
8.2 Spaces of differentiable functions and sets
8.2.13 Corollary I f a > 1 and d = 1 , 2,
259
, then as E , 0,
log D(s, C(a, K, d + 1), h) : E-d/« Proof This follows from Lemma 8.2.12 and the first and third statements in Theorem 8.2.1.
8.2.14 Remark.
For m = 1 , 2, , let Id be decomposed into a grid of and sub-cubes of side 1/m. Let E be the set of centers of the cubes. For any A C Id let B C E be the set of centers of the cubes in the grid that A intersects. Then
h(A, B) < d1/2/(2m), which includes the possibility that A = B = 0. For 0 < s < I there is a least m = 1, 2, such that d112/m < s, namely m = 1d1/2/el. It follows that D(s 21d h) < 2(1+d'121E)d
Hence for a < d/(d + 1), Corollary 8.2.13 cannot hold, nor can the upper bound for h in Theorem 8.2.1 be sharp.
The classes C(a, K, d) considered so far contain sets with flat faces except for one curved face. There are at least two ways to form more general classes of sets with piecewise differentiable boundaries, still satisfying the bounds in Theorem 8.2.1. One is to take a bounded number of Boolean operations. Let vl, , Vk be nonzero vectors in Rd where d > 2. For constants cl, , ck let Hj {x E Rd : (x, vj) = cj}, a hyperplane. Let Jrj map each x E Rd to its
nearest point in Hj, nj(x) := x - ((x, vj) - cj)vj/I vjI2. Let Tj be a cube in Hj and a, K > 0. Let f be a linear transformation taking Tj onto Id-1. For g E Qa,K,d-1, let
Jj(g) := {x E Rd : 7rj(x) E Tj, Cj < (vj, x) < Cj +g(fj(nj(x)))}. Then Theorem 8.2.1 implies that if K is a compact set in Rd (e.g., a cube) including all sets in Cj, and P is a law on K having bounded density with respect to Lebesgue measure ) , then for some Mj < oo,
log D(e, Cj, dp) < log Nj(e, Cj, P) < MjE By Theorem 8.1.4 we then have the following:
8.2.15 Theorem Let d > 2 and let Co :=
orCo :=
just defined. Then for some M < oo,
log D(E, Co, dp) < log NI(E, Co, P) <
ME(1-d1la
forCj as
Approximation of Functions and Sets
260
By intersections or unions of k sets in classes Cj (with k depending on d), one
can obtain sets with smooth boundaries (through order a) such as ellipsoids; see problem 5. One can also get more general sets, since, for example, for a > 1, the minimum or maximum of two functions in ca,K,d need not have first derivatives everywhere and then will not be in cy,K,d for any y > 1 and
K <00. Recall that a C°O real-valued function on an open set on Rd is one such that
the partial derivatives DPf exist for all p E Nd and are continuous. For a function f := (fl, , fk) into Rk to be C°O means that each f is. Another way to generate sets with boundaries differentiable of order a is as follows. The unit sphere Sd-1 :_ {x E Rd : IxI = 1) is a C°° manifold, specifically as follows. Sd-1 is the union of two sets
A := {x E Sd-1: x1 > -1/2) and C := {x E Sd-1: x1 < 1/21. There is a one-to-one, C°O function ,/r from {x E Rd-1 : IxI < 9/8} into Rd, with derivative matrix a ; ax i=I,j of maximum rank d - 1 everywhere,
such that * takes Bd_1 := {x E Rd-t : IxI < 1} onto A. Let ii(y) := (-*1(y), *2(y), , *d(y)). Then the above statements for >/r and A also hold for q and C. Sd-1 For 0 < a, K < oo let J'a,K(Sd-1) be the set of functions h : HR such that for Bd-1 := {x E Rd-': IxI < 11, h o t1i and h o 11 E .Fa,K(Bd-1), recalling that f o g(y) := f(g(y)). Let .F(d) (Sd-1) be the set of functions a,K
h = (h1,
hd) such that hj E Fa,K(Sd-1) for each j = 1, , d. Two continuous functions F, G from one topological space X to another, Y, are called homotopic if there exists a jointly continuous function H from ,
X x [0, 1] into Y such that
0) = F and H(-, 1) = G. His then called
a homotopy of F and G. Let I (F) be the set of all y E Y, not in the range of F, such that among mappings of X into Y \ l y), F is not homotopic to any
constant map G(x) = z # y. For a function F let R(F) := ran(F) := range(F) and C(F) := I(F) U R(F). For example, if F is the identity from Sd-t onto itself in W', then I (F) {y : Iyl < 11 by well-known facts in algebraic topology; see, for example, Eilenberg and Steenrod (1952, Chapter 11, Theorem 3.1).
Let I (d, a, K) := {I(F): F E
and let K(d, a, K) := (C(F): jadK FE (Sd-1) }. Then I (d, a, K) is a collection of open sets and IC (d, a, K) of compact sets, each of which, in a sense, has boundaries differentiable of order a. (For functions F that are not one-to-one, the boundaries may not be differentiable in some other senses.) For IC(d, a, K) and to some extent for .FctdK(Sd-i)}
8.2 Spaces of differentiable functions and sets
261
I (d, a, K) there are bounds as for other classes of sets with a times differentiable boundaries (Theorem 8.2.15):
8.2.16 Theorem For each d = 2, 3,
, K > 1, and a > 1,
(a) there is a constant Hd,a,K < oo such that for 0 < E < 1, and the Hausdorff metric h,
log D(E, lC(d, a, K), h) <
Hd,a,K/E(d-1)/a
< oo there is a constant Ad,a,K, < oo such that for any
(b) For any
law P on Rd having density with respect to Ad bounded above by
0<e<1,
for
max(log NJ (8, IC (d, a, K), P), log N, (s, I (d, a, K), P))
<
Ad,a,K,S/E(d-1)/a.
The proof will follow from a sequence of lemmas.
8.2.17 Lemma If H is a homotopy of F and G, then I (F) AI (G) C R(H). Proof Suppose y E I (F) \I(G) and y is not in the range of H. Then F and G are homotopic as maps into Y \ [y). Homotopy is clearly a transitive relation. Since G is homotopic to a constant map into Y \ { y}, so is F, a contradiction. The lemma is proved.
8.2.18 Lemma If F is a continuous map from a compact Hausdorff topological space K into a Hausdorff space Y, then C(F) is closed.
Proof If y f C(F), then there is a homotopy H of F to a constant map G(x) - z into Y \ {y}. Then R(H) is compact, thus closed (RAP, Theorem 2.2.3 and Proposition 2.2.9). Clearly I(G) = 0, so by the previous lemma,
I(F) C R(H). Then the open complementY\R(H) C Y\C(F)soY\C(F) is open and C(F) is closed. 8.2.19 Lemma If F is a continuous map from a compact Hausdorff topological space into some Rd, then I (F) is open, and its boundary is included in
R(F). Proof Let X E I(F). Then for some S > 0, d(x, y) < 23 implies y ¢ R(F). Let d (x, y) < S. Then there is a homeomorphism g of Rd which leaves
Approximation of Functions and Sets
262
{u : (u - xl > 28}, and so the values of F, fixed and takes x to y. T o define such a g we can assume x = 0 and 8 = 1/2. Let g(u) := a for Iul > 1 and g(u) := u + y(1 - Jul) for Jul < 1. Then g is the identity for Jul > 1 and is continuous, with g(0) = y. Also, gis 1-1 since Ig(u)l < 1 for Jul < 1,
andifg(u)=g(v)with lul,lvl < 1, then u-v=y(lul-lvl)andlu-vl < I(lul - Ivl)I/2 < lu - v1/2, so u = v. Thus Y E I(F), so I(F) is open. Since C(F) is closed by Lemma 8.2.18, it follows that the boundary of I (F) is included in R(F). Recall that for a metric space (S, d), a set A C S and 8 > 0, the 8-interior of A is defined by S A := Ix: d (x, y) < 8 implies y E A), and the S-neighborhood
by AS := {y: d(x, y) < S for some x E A}. 8.2.20 Lemma For continuous functions F, G from Sd-1 into Rd, if
ds p(F, G) := sup {I F(u) - G(u)J: u E Sd-1 } < 8, then
sI(F) C I(G) C C(G) C C(F)a. Proof If x E E 1 (F) and x E R(G), then d (x, y) < 8 for some y E R(F), so
y V I(F), a contradiction. Thus x V R(G). For 0< t < 1 and u E Sd-1 let H(u, t) := (1 - t)F(u) + tG(u). Then H is a homotopy of F and G, and R(H) C R(F)s, but I (F) n R(F) = 0, so x R(H). Thus by Lemma 8.2.17, x E I (G).
Next, let y E C(G). If Y E R(G) then y E R(F)s C C(F)s. Otherwise y E I (G). Then y E I (F) or by Lemma 8.2.17, y E R(H) C R(F)S.
8.2.21 Lemma For any continuous function F from a compact Hausdorff space K into Rd, C(F)8 \ sI (F) C R(F)s.
Proof Let x E C(F)' \ SI(F). Suppose d(x, R(F)) > S. Then Ix - yl < 8 for some y E I(F). Since the boundary of I(F) is included in R(F) by Lemma 8.2.19, the line segment {tx + (1 - t)y: 0 < t < 11 C I(F). It follows that x E I(F) and then likewise that z E I(F) whenever Ix - z1 < S. Thus x E SI(F), a contradiction. The Lipschitz seminorm II FII z, is defined for functions with values in R' just
as for real-valued functions. Let vk be the Lebesgue volume of the unit ball in Rk.
8.2 Spaces of differentiable functions and sets
263
8.2.22 Lemma For k = 1, 2, , if (T, d) is a metric space, 8 > 0, for some M < oo, D(8, T, d) < M81-x, and F is Lipschitz from T into 1[8x, with II FII L < K, then Ad
(C(F)s \ sI (F)) < vkM(K + 2)x8.
Proof For the usual metric e on 1[8x, we have D(K8, R(F), e) < It follows that D((K + 2)8, R(F)s, e) < M81-x. Lemma 8.2.21 gives the M81-k
conclusion.
Proof of Theorem 8.2.16. By the definitions, a function F E F(l) a, K (Sd-1) is given by a pair F(1), F(2) of functions F(j) :_ (F(j)1, F(j)d) where each
F(f)i E Fa,K(Bd-1) and Bj is the closed unit ball in R. Since a > 1, each F(j), is Lipschitz with II F(j)i II L < K, so each F(j) is Lipschitz with 11F(j)ft < M. Let T := T1 U T2 be a union of two disjoint copies Ti of Bd_1, with the Euclidean metric eon each and e(x, y) := 2 forx E Ti, y E Tj, i j. Letting G := F(j) on Tj, j = 1, 2, gives a function G := GF on T with (8.2.23)
IIGIIL < maxj IIF(j)IIL < dK.
Let 0 < s < 1. Then there are D(s, B, e) disjoint balls of radius s/2, included in a ball of radius 1 + (s/2). It follows by volumes that (8.2.24)
D(s, Bj,e) < [(2 + s)/2]i(2/s)i < (3/e)j.
By Theorem 8.2.1 for the ball case, for any K > 1, d > 2, and a > 1, there is
aC=C(K,d,a) < oo such that for 0 < 8 < 1, log D(8, Fa,K(Bd-1), dsup) <
C/8(d-1)/a.
It follows from the definitions with T := T1 U T2 that
log D(8,
.Fa,K(Sd-1psup)
<
2C/6(d-1)/a
and thus that for 0 < 8 < 1, log D(8, .Tadk(Sd-1dsuP < set
2C(d/8)(d-1)/a.
Given 0 < 8 < 1, take a
of functions f1, , fm E .Ta,K(Sd-1) with dsup(f , f) > 8/2 for i j and maximal m where m < and i d := 2C (2d)(d-1)/a. The brackets [sI (f ), C(f)s] for j = 1, ,m cover I (a, K, d) and 1C(a, K, d) by Lemma 8.2.20 (some sets s I (f) may be empty). If f) < 8/2 then by Lemma 8.2.20, C(g) C C(f )s and C(fj) C C(g)s,soh(C(g), C(f )) < 8andpart (a) of Theorem 8.2.16follows. Then Lemma 8.2.22 applies with M = 2 3d-1 by (8.2.24) and K = Kd by exp(Pd/8(d-1)/a)
264
Approximation of Functions and Sets
(8.2.23). It also holds for P in place of Ad with an additional factor of
Theorem 8.2.16(b) then follows with 8 := s/[2vd 3d-1(Kd + 2)d] and so
Ad,a,K, := I8d[2W 3d-1(Kd
+2)d](d-1)/a.
8.2.25 Corollary For any law P on Rd, d > 2, having bounded density with respect to Lebesgue measure, and K < oo,
(a) (Tze-Gong Sun) I (d, a, K) is a Donsker class for P if a > d - 1; (b) I (d, a, K) is a Glivenko-Cantelli class for P whenever a > 1.
Proof Apply 8.2.16(b) and, for part (a), Corollary 7.2.13; for part (b), the Blum-DeHardt theorem (7.1.5).
8.3 Lower layers A set B C Rd is called a lower layer if and only if for all x = (xl, , xd) E Bandy= (yl, , yd) with yj < xjfor j = 1, , d, we have y E B. Let LGd denote the collection of all nonempty lower layers in Rd with nonempty complement. Let 0 be the empty set and let
GGd,l := {L fl Id : L E GGd, L fl Id # 0}. Let ,l := )4 denote Lebesgue measure on Id. Recall that f
g means f/g -. 1. The size of GCd,l will be bounded first when d = 1 and 2. Let [xl be the smallest integer > x.
8.3.1 Theorem Ford = 1, D(e, GG1,1, h) = D(e,,Cr1, d)j = Nr(s, L 121 1, ),) = [1/c]. Ford = 2, any m = 1, 2,
, and 0 < t < 21/2/m, we have
max (Nj(2/m, GG2, A) , D(2 1/2/M, GG2,1, h))
< (2m - 2)
m-1
D(t GG2,1, h)
For 0 < s < 1, N, (E, GG2,1, )'I) < 42/E and
D(s, GG2,1, h) < exp ((21/2log4)
/s).
8.3 Lower layers
265
Proof For d = 1, sets in LL1,1 are intervals [0, t), 0 < t < 1, or [0, t], 0 < t < 1. For any s with 0 < s < 1, let m := m(E) [1/Ea. Then the collection of m brackets [[0, (k - 1)s], [0, ks]], k = 1, , m, covers GL1,1 with minimal m for E, showing that NI(s, CC1,1, k) = m. For any 3 > 0, the points k(s + 3) for k = 0, 1, , [1/(s + 6)], are at distances at least s + 3 apart. For 0 < x < y < 1, h([0, x], [0, y]) = dA([0, x], [0, y]) = y - x, so
D(s, LL1,1, h) = D(s, LLI, dd) = D(s, [0, 1], d) =: D(s) for the usual metric d. Letting 3 J, 0 gives D(s) = [1/e1 = m, finishing the proof for d = 1. For d = 2, decompose the unit square 12 into a union of m2 squares Si j
[(i - 1)/m, i/m) x [(j - 1)/m, j/m), i, j = 1,
, m - 1, but for i = m or j = m, replace "i/m)" or "j/m)" respectively by "1]." For any L E LL2,1, let m L be the union of the squares in the grid included in L and let Lm be the union of the squares which intersect L. Then mL C L C Lm and both m L and Lm are in LL2,1 U {P}.
For each m and each function f from {2, 3, .. , 2m - 1 } into 10, 11 taking the value 1 exactly m -1 times, define a sequence S(f)(k), k = 1, , 2m -1 of squares in the grid as follows. Let S(f)(1) be the upper left square S1m. Given S(f) (k - 1) = Si j, let S(f) (k) be the square Si+1, j just to its right if f(k) = 1, otherwise the square Si, j_1 just below it, for k = 2, .. , 2m - 1. Then S(f) (2m - 1) is always the lower right square Sm 1. Let Bm (f) 2m-1 Uk=1 S(f) (k). Let Am (f) be the union of the squares below and to the left
of Bm(f), and Cm(f) := Am(f) U Bm(f). Here Am(f) and Cm(f) belong to LL2,1 U {0}. Also, if f # g then h(Cm (f ), Cm(g)) > 1/m. Let L E LL2,1 U 0. Let L be its closure and
M := ML := LU{(0,y): 0
x(t) = y(t) + t < y(s) + t = x(s) + t - s, so x(.) is a Lipschitz function with 1. Likewise 1. In particular, x(.) and are continuous and the curve G is connected. We
have x(-1) = 0, y(-l) = 1, x(1) = 1, and y(1) = 0. For t near -1, g(t) E Slm. If G intersects a square Sij for i < m or j > I then it next
266
Approximation of Functions and Sets
intersects, as t increases, one of the squares 5i+1,j or Si,j-1. (It cannot go directly to Si+1,x_1 since the upper left vertex of that square is in Si+1,j.) For t near 1, g(t) E Sm 1. Thus for some f as above, there is a sequence of squares
S(f)(1) = Sim,
, S(f)(2m - 1) = Smt intersected by G, and no other
squares Si j are. It follows that
Am(f) C mL C L C L. C Cm(f), so that Lm \ mL C Bm (f ), and A%(Bm (f )) = (2m - 1)/m2 < 2/m, while h(Am (f ), Cm (f )) = 21/2/m or Am (f) = 0. Also, {(0, 0)} is the smallest nonempty intersection of a lower layer with I2, and using this set in place of Am (f) if the latter is empty, which occurs for just one, say fo, of the functions f, we have h({(0, 0)}, A) > 21/2/m for any A C I2 including S11 and thus for Cm (f) for any of the functions f fo. Since the number of such functions f is (2,-2), the first sentence of the Theorem for d = 2 is proved. Then since
(2m-2 < 4m-1 the second sentence follows. l M-1
-
El
For dimension > 2 we then have:
8.3.2 Theorem For each d > 2, as e y. 0, log D(e, GGd,1, h) x log D(s, GGd, dx) = log D(e, LLd,t, dx)
log NI(e,LLd,I,A) x E1
d
Proof First, for the Hausdorff metric h, it will be shown that for some constants
Cd with I < cd < 00, (8.3.3)
log D(e, 1CGd,1, h) <
CdE1-d
for 0 < E < 1. For d = 2 this holds by the previous theorem. It will be proved
for d > 2 by induction on d. Suppose it holds for d - 1, for d > 3. Given 0 < E < 1, take a maximal number of sets L 1,
, Lm E GGd_ 1,1 such that
h(Li, Lj) > E/4 for i 0 j, where m < exp(cd-1(4/e)d-2). Let k := 13/e1 and A E GGd,1. For j = 1, , k let Aj := {x E Id-1 : (x, j1 k) E A) and A(j) := Aj x {j/k) C A. Then Aj = 0 or Aj E GGd-1,1. In the latter case we can choose i := i(j, A) such that h(Ap Li) < E/4. Let Lo := 0 and i := i (j, A) := 0 if Aj = 0, so h(Aj, Li) = 0 < E/4 in that case also. , k-1. LetA, B E GGd,1 and suppose that i (j, A) = i(j, B) for j = 0, 1, It will be shown that h (A, B) < E. Let X E A. There is a j = 0, 1, ,k-1 such that j/k < xd < (j + 1)/k. Let y := (xl, , xd-1, j/k) E A(p. Then Aj 0 0 and h(A., Bj) < E/2, so for some z E B(p C B, we have
8.3 Lower layers
267
ly - zi < 2e/3 and Ix - zI < k-1 + 2e/3 < E. So d(x, B) < e and by symmetry, h(A, B) < e. Thus
D(e,LLd,1,h) < (m+1)k < [exp (2cd_1(4/e)d-2114/8 < eXp (cd/ed-1) IJ for cd := 2 4d-1 cd_1, so (8.3.3) is proved. For the metrics in terms of A we have the following:
8.3.4 Lemma
Let 8 > 0 and let A, B E £L d,1 with h(A, B) < S. Then
Ad (AAB) < dd128.
Proof Let U be a rotation of Rd which takes
v := (1, 1, into (0, 0,
,
d1/2). Let nd(y) :_ (y1,
, 1)
, yd-1, 0). Let C be the cube
C := U[Id] := {U(x) : x E Id}. Each point of Id or C is within d1/2/2 of its respective center. Thus each point z e H := lyd[C] is within d 1/2/2 of 0. Also, for any z E Rd-, It E R : (z, t) E C} is empty or a closed interval
h(z) < t < j(z). Let CZ := {(w, t) E C : w = z), a line segment. The intersections of U[A] and U[B] with CZ are each either empty or line segments with the same lower endpoint (z, h (z)) as C, so the two sets are linearly ordered
by inclusion. Thus the intersection of U[A]AU[B] = U[AAB] with CZ is some line segment SA,B,Z. It will be shown that SA,B,Z has length < d1/26. Suppose not. Then by symmetry we can assume that there is some > S
and a point x E B \ A such that v := x + ( 1 , l , . . . , 1 ) E B. The orthant O := l y: yj > xj for all j = 1, , d} is disjoint from A. But, the open ball of radius and center v is included in 0, contradicting h (A, B) < 8. Now, H is included in a cube of side d1/2 in Rd-1 with center at 0, so by the Tonelli-Fubini theorem,
), d(AAB) < (3d 1/2)
(d1/2)d-1
< 8d d12,
proving the lemma.
Returning to the proof of Theorem 8.3.2, from the last lemma and (8.3.3) it follows that for each d = 2, 3, and some Cd < oo, for 0 < e < 1,
log D(e, GGd, dx) = log D(e, GGd,I, d)) <
l
Cde-d
Approximation of Functions and Sets
268
Next, consider the remaining upper bound, for NJ. The angle between v (1, 1, , 1) and each hyperplane xj = 0 is cos-1
(((d - 1)/d)1/2) =
sin-1 d-112 = tan-1 ((d
-
1)-1/2).
Rd and point p on its boundary, Thus for any nonempty lower layer B U[B] includes the cone {x : Ix(d) - q(d) I < (qd - xd)(d - 1)-1/2}, where x(d) := ( X I ,---, xd_1), q := U(p). Hence the boundary of U[B] is the graph R where for any s, t E Rd-t of a function f : Rd-1
f(s) > f(t) - Kls - tl,
K := (d - 1)1/2.
Hence, interchanging s and t, I f (s) - f (t) I < K Is - t 1. So II f II L < K. Let
J(f) := {x : -00 < Xd < f(x(d))}. Thus for each B E CLd,l we have U[B] = ,7(f) fl U[Id] for a function f = fB on ][8d-1 with Il f IIL < K. We can restrict the functions f to a cube T of side d112 centered at the origin in Rd-1 parallel to the axes, which includes the projection of U[Id]. We can also assume that II fB I I sup < d112 for each B, since replacing f by max(-d1/2, (min(f, d1/2)) doesn't change ,7(f) fl U[Id ], nor does it increase II .f II L (RAP, Proposition 11.2.2(a), since 11911L = 0 if g is constant). Now, apply
Theorem 8.2.1 for a = 1 and d - 1 in place of d, where by a fixed linear transformation we have a correspondence between Id and the cube T. Since f < g
implies ,7(f) C ,7(g) and i(f) fl U[Id] C i(g) n U[Id], the bracketing parts of Theorem 8.2.1 imply the desired upper bound for log NI (e, LCd,1, ,L)
with -(d - 1)/a = 1 - d. This finishes the proof for upper bounds. Now for lower bounds, it will be enough to prove them for D(s, Ld, d;,) in light of Lemma 8.3.4 and since NI(e, ) > D(s, ). The angle between v = (1, 1, , 1) and each coordinate axis is
9d :=
COs-1 d-1/2
= sin-1 (((d - 1)/d)1/2) = tan-1 ((d - 1)1/2).
-
Thus if f : Rd-1 i-+ R satisfies II.fIIL < (d 1)-1/2, then to see that L U-1(J(f )) is a lower layer, suppose not. For some x E L and y ¢ L, x; = y, for all i except that xj < y, for some j. Then U transforms the line through x, y to a line f forming an angle 6d with the dth coordinate axis. Writing £ as td = h (t(d) ), we have I I h I I L = cot Bd = (d - 1)-1/2, which yields a contradiction. Recall (Section 8.2) that for S > 0 and a metric space Q,
.Fi,s(Q) := {f: Q -'R, max(II.fIIL,Ilfllsup)<<S}. Let S := (d - 1)-1/2d-1. Let Q be a small enough cube in Rd-1 with center at 0. Then for each f E .F1,3 (Q), we have v + U-1(J(f )) C Id . For such f, J(f) H 2 v + U-1(J(f )) is an isometry 2for h and for da, and preserves
8.4 Metric entropy of classes of convex sets
269
inclusion. Each f can be extended to Rd-1, preserving IIfIIL < 8 (RAP, Theorem 6.1.1). So the lower bound with dsup in Theorem 8.2.1 gives, via Lemma 8.1.3, the lower bound with h in Theorem 8.3.2. Theorem 8.2.10, likewise adapted from Id-1 to Q, gives the lower bounds with A, proving Theorem 8.3.2.
8.4 Metric entropy of classes of convex sets Let Cd denote the class of all nonempty closed convex subsets of the open unit ball B(0, 1) := {x : IxI < 1} in Rd. Let A be Lebesgue measure on Rd. Upper and lower bounds will be given for the metric entropy of Cd for the metric dd and for the Hausdorff metric h.
8.4.1 Theorem (E. M. Bronstein) For each d > 2 we have
log D(s, Cd, dA) x log D(e, Cd, h) x 8(1-d)/2 as s ., 0. Remark. For d = 1, C1 is just the class of subintervals of the open interval (-1, 1). Then it's rather easy to see that D(s, C1, d;) x D(s, C1, h) x s-2 Before proving the theorem, let us prove from it:
8.4.2 Corollary In Rd, for d > 2, if P is a law whose restriction to B(0, 1) has a bounded density f with respect to Lebesgue measure, then
(a) log NI(s, Cd, P) = 0 (e(1-d)/2) as s .} 0; (b) (E. Bolthausen) for d = 2, C2 is a Donsker class for P; (c) for any d, Cd is a Glivenko-Cantelli class; (d) if also f > v on B(0, 1) for some constant v > 0, then
log NI(e, Cd, P) x s(1-d)/2 as s J, 0.
Proof Let 0 < f(x) < V < oo for all x. For B C Rd and 8 > 0 recall that aB :_ {x : y E B whenever Ix - yI < 8} and B3 :_ {x : d(x, B) < 8}. For a closed set B, we have sB = [x: y E B whenever Ix - yl < 8}. Also, B3 is always open and 3B is always closed. We have 6B C B C Bs. For any two sets B, C, if h (B, C) < 8, then C C B3. It will be shown next that if B and C are closed and convex then also sB C C. If not, let x E 8B \ C. Take a closed half-space J D C with x V J (RAP, Theorem 6.2.9). On the line through x
Approximation of Functions and Sets
270
perpendicular to the hyperplane bounding J, on the side opposite J, there are points y e B \ Cs, a contradiction. So 3 B 0 C 0 B8. For dx, the following will help. For any set B, let 8B denote the boundary of B.
8.4.3 Lemma F o r any d = 1 , 2, , there is a K = K(d) < oo such that forany B E Cd andO < 8 < 1,A(BS\ sB)
A (BE) = A(B) + C1(B)e + C2(B)E2 + ... + CdEd where the coefficients Ci are known as mixed volumes of B and B(0, 1) (e.g.,
Eggleston, 1958, pp. 82-89; Bonnesen and Fenchel, 1934, pp. 38, 49-50). Here C1(B) = limb 1 o(A(BE) - A(B))/s is the (d - 1)-dimensional surface area of 8B (Eggleston, 1958, p. 88), and Cd = A(B(O, 1)). All the mixed volumes Ci (B) are nondecreasing functions of the convex set B (Eggleston, 1958, Theorem 42 p. 86; Bonnesen and Fenchel, 1934, p. 41). Thus, all the Ci(B) are maximized for B C B(0, 1) when B = B(0, 1). It follows that the derivative dA(BE)/dE is bounded above uniformly for B E Cd
and 0 < e < 1 by a constant K2 = K2(d), so that k(B8 \ B) < K2e for all
BECdandO<s<1. Now suppose B has an interior. Then B8 \ sB = (3B)8. For a set of points on 8B more than 8 apart, the balls of radius 8/2 with centers at the points are disjoint, and the outer halves of these balls cut by support hyperplanes to B at the points are outside of B. Thus for 0 < 8 < 1,
(8/2)K2 > A(Bs12 \ B) > !Cd (8/2)d D(8, 8B, p) where p is the Euclidean distance. Thus
D(8, 8B, p) <
2dK281-d/Cd.
Then for 0 < 8 < 1,
X ((M)') < D(8, 8B, p)Cd(28)d < 4dK28, and the lemma follows.
8.4 Metric entropy of classes of convex sets
271
Now continuing the proof of Corollary 8.4.2, given 0 < S < 1, let Nd(8) be a 3-net in Cd for h with cardinality at most exp(AdS(I-d)/2) where Ad is a large enough constant. The brackets [s B, B81 for B E Nd(8) cover Cd: in other words, for any C E Cd there is such a B with 8 B C C C B8, as seen just before Lemma 8.4.3, and X(B6 \ 8B) < KS, so P(BS \ 8B) < KV8, and
NI(E, Cd, P) < D(E/(KV ), Cd, h) < exp (Ad (E/(KV
))(1-d)'2
for E small enough, proving part (a). Given part (a), part (b) follows from Corollary 7.2.13, and part (c) from the Blum-DeHardt theorem (7.1.5). For part (d), we have
NI(E, Cd, P) > D(E, Cd, dP) > D(E/v, Cd, d),) > exp (Cd (El v)(1-d)'2) for some cd > 0 and all E small enough, which finishes the proof of Corollary 8.4.2 from Theorem 8.4.1. Now Theorem 8.4.1 will be proved. Let 0 < E < 1. For any set C C Rd and r > 0 let Cr1 := [x E Rd : d(x, C) < r}. Then the open set C' is included in
the closed set C'1, aCr = aC'1, a closed set, and h(C', Cr]) = 0.
8.4.4 Lemma For any C, D E Cd, and r > 0, h (Cr], D'1) = h (C, D), in other words Or : E H Erl is an isometry for h. Proof Lets > 0. It will be shown that D C CS if and only if Drl C (C'1)S = Cr+S. "Only if" is straightforward. To prove "if," suppose not. Let a E D \ CS. There is a unique point of q of Csl closest to a: there is a nearest point q since CS1 is compact, and if b is another nearest point, then (q + b)/2 E CS1 since Cs1 is convex and (q + b)/2 is nearer to a, a contradiction. (Possibly q = a.) Now q E 8(Cs1). If q = a, take a support hyperplane H to CS1 at q (RAP, Theorem 6.2.7). If q a, then the hyperplane H through q perpendicular to
the line segment aq is a support hyperplane to Csl at q (if there were a point c of Cs1 on the same side of H as a, then on the line segment cq there would be a point of CS1 closer to a than q is, a contradiction). Let p be a point at distance r from a in the direction perpendicular to H and heading away from CS1. Then
p E D'1 but p 0 (Cs1)r = Cr+s, a contradiction. So "if" is proved. Since C and D can be interchanged, the lemma follows.
For r > 0, 0r is a useful smoothing, as it takes a convex set D, which may have a sharply curved boundary (vertices, edges, etc.) to a convex set Drl
272
Approximation of Functions and Sets
whose boundary is no more curved than a sphere of radius r, and so will be easier to approximate. Now, for a given C E Cd and s > 0 let
AI(C)
{DECd: h(C,D)<s},
NE(C)
{DEAIE(C): CCD}.
For any convex C, 02+E is an isometry from Ar, (C) into R2, (C2). A sequence of lemmas will be proved. Here I I will denote the usual Euclid-
ean norm on Rd. Let Cd,1,3 denote the class of all closed convex sets C in
Rd such that B(0, 1) C C C B(0, 3). Note that if C E Cd, then clearly C21 E Cd,1,3
8.4.5 Lemma Let E E Cd, 1, 3. Let x E 8E and let H be a support hyperplane
to E at x. Let p be the point of H closest to 0. Then I pI > 1 and ZOxp > sin-1(1/3). Proof Existence of support hyperplanes is proved in RAP (Theorem 6.2.7). Clearly H is disjoint from B(0, 1), so Ipl > 1. Since IxI < 3 and LOpx = r/2, the lemma follows.
8.4.6 Lemma If E E Cd,1,3, r > 0, and z is a point such that z E E' \ E, let x (z) be the point on the half-line from 0 to z and in W. Then I z - x (z) I< 3r.
Proof Note that x = x(z) is uniquely determined since E D B(0, 1). Apply Lemma 8.4.5. Let zl be the point of H closest to z. Then Iz - zl I < r. The vectors z - zi and p are parallel, so := LpOz = LOzzl, and Iz - xI = Iz - ziIIxI/IPI < 3r. 8.4.7 Lemma Suppose C E Cd, y E aC, x E aC2, and Ix - yI = 2. Then for any two-dimensional subspace V containing x, V fl B(y, 2) is a disk containing 0 of radius at least 2/3.
Proof Clearly 0 E B(y, 2) C C2. Apply Lemma 8.4.5 again. Then H is also a support hyperplane to B(y, 2) at x, so x - y is orthogonal to H and in the same direction as p. Let q be the point on the segment [0, x] closest to y. Then 0 := ZOxp = Zqyx, sin0 > 1/3, and so Ix - q1 > 2/3. Let u be the closest point to yin V. Then V fl B(y, 2) D V fl B(u, 2/3). Now polyhedra to approximate convex sets will be constructed. Let Wd be the cube centered at 0 in Rd of side 2/d1/2, parallel to the axes, so that the
8.4 Metric entropy of classes of convex sets
273
coordinates of the vertices are +I /d1/2. Given s > 0, decompose the 2d faces of Wd into equal (d - 1)-cubes of side sd := sd(e) where sd - c(81d)112 as 8 O and c := 10-4/(d112(d-1)), so sd ^- s11210-4/(d(d- l)). Specifically, let sd := 2/(d't2kd) where kd := kd(s) is the smallest positive integer such
that sd < s1!210-4/(d(d - 1)). Then for 0 < s < 1,
81/210-4/d2 < sd < e1"210-4/(d(d - 1)). Let Ld be the set of all (d - 1)-cubes thus formed. The diameter of each cube in Cd is d 112sd < e1/210-4/(d1/2(d - 1)). The next fact follows directly, by the law of sines, since 0 < s < 1 and d > 2, and
sin-1 x < 1.1x for 0 < x < 10-4. 8.4.8 Lemma (a) For any cube in Cd and any two vertices p and q of the cube,
ZpOq <
sin-I (10-4e1/2/(d
- 1)) <
(1.1)10-481/2/(d - 1).
(b) The total number of vertices of all the cubes in Gd is less than
2d (kd(s) + 1)d-1 < where Kd := 2d((2. 104 +
Kd,,(I-d)12
1)d3/2)d-1
Next, there is a triangulation of each cube in Cd, in other words a decomposition of the cube into (d - 1)-simplices, with disjoint interiors, where each simplex is a convex hull of some d of the 2d-I vertices of the cube. That such a triangulation (without additional vertices) exists (a well-known fact to algebraic topologists) can be seen as follows. By induction, it will be enough to treat S x [0, 1] where S is a simplex with vertices vo, , vp. Let ai :_ (vi, 0) and bi := (vi, 1). Then for each i = 0, 1, , p, the points ao, , a;, bi, , by are vertices of a (p + 1)-dimensional simplex Si. To see that these Si give the desired decomposition of S x [0, 1], note first that each point of a simplex is a unique convex combination of the vertices. For each point z of
S x [0, 1], z = (ri Xivi, x) for some unique x E [0, 1] and Xi > 0 with >i = 1. Then z = Fi < j µiai + Ei> j pibi, where µi > 0, pi > 0, and K< j µi + F-k> j pk = 1, if and only if µi = Xi for i < j, Pk = Ak for
k > j, Aj = µj + pj, and x = Ek> j pk. Thus z is in Sj if and only if F-i> Xi < x < >i> j Xi. If both inequalities are strict, then j is unique. Every point of S x [0, 1] is in some Sj, and a point in more than one Sj is on the boundary of both. Let K be a convex set including a neighborhood of 0. For each vertex pi of a cube in £d, let Hi be the half-line starting at 0 passing through pi and let vi
Approximation of Functions and Sets
274
be the unique point at which Hi passes through the boundary of K. For each simplex Sj in the triangulation of the cubes in Cd, let Tj be the corresponding simplex with vertices vi in place of pi. Let lre (K) be the polyhedron with faces Tj, in other words the union of the d-dimensional simplices which are convex hulls of Tj U (0). Ford > 3, ,r8 (K) is not necessarily convex.
Let E E Cd,1,3, 8 > 0 ands > 0. For i = 1, 2 let zi be points such that zi E E16' \ E. Assume that Lzi 0z2 < 8 < 1/15. Then 8.4.9 Lemma
Izi -Z21 <<98+968. Proof Let xi := x (zi) be the point where the line segment from 0 to zi inter-
sects dE, i = 1, 2. By Lemma 8.4.6, Izi - xi I < 48e, i = 1, 2. We have Izi I > 1 and Ixi I > 1 for all i. By symmetry we can assume Ixi I >- Ix2I. To get a bound for Ixi - x21, we can assume x i and x2 are not on the same line through 0, or they would be equal. Take a half-line L starting at xi which is tangent to the unit circle aB(0, 1) at a point v and crosses the half-line from 0 through z2 at a point y.
If v is between xi and y, then Ixi - v I< tan S < 28 since 8 < 1/15 < 7r/4. Likewise, ly-vi < 23. Also, 1 < Ix2I < Ixi l < (1+452)1/2 < 1+252 < 1+5, So 1X2-Y1 <8 and
Ix2-Vl < Ix2-yl + ly - vl < 8+2S = 38 and lxi - x21 < 58. Otherwise, y is between xi and v. Then Ixi - x21 is maximized when Ix2I = Ixi I or x2 = y, since x2 must not be in the convex hull of (xi) U B(0, 1). If Ix2I = Ixi I then
Ixi -x21 < 2(3 sin 2) < 38. := L0xi v. Then sing > 1/3, and
To bound Ixi - yl, let
Ixi - yl
<-
tan
7r
8 sect (2 - ) < 93. So Ixi - X21 < 93 in all cases, and Izi - z21 < 93 + 968.
8.4.10 Lemma Let E E C2,1,3. Suppose 2 < K < oo and 0 < 8 < 10-12/(K 1)2. Let c := (1.1)10-4/(K1/2(K - 1)). Let r > 2/3 and let a, b, a be points of R2 such that: a E, B(a, r) C E, b is on the boundary
-
of E and of B(a, r), d(a, E) < 168, and fi := LaOb < c(Ke)1i2. Then
8.4 Metric entropy of classes of convex sets
275
d(a, B(a, r)) < 50e. Also, Ia - bl < 0.0015e1/2/(K - 1) and Laab < D(K)E112 where D(K) := 0.004/(K - 1). Proof Apply Lemma 8.4.5 at x = b. The tangent line L to E at b is also tangent to the circle 8B(a, r). Let y LObp. Then y > sin-1(1/3) > 1/3. We have (8.4.11)
6 < c(KE) 1/2 = (1.1)10-4e1/2/ (K - 1) < (1.1)10-10
Let H be the half-plane including E bounded by L. First suppose a V H. Then d (a, H) < 16e. Let the line from 0 to a intersect L at a point q. If 17 is between p and b, then y + ,B < 7r/2 and
Ill - bl = Ipl(cot y - cot(y + #)) < 30 csc2 y < 27k. If p is between n and b, then
In-b1 = Ipl coty+tan y+,8-
IT
2
= Ipl(cot y - cot(y +,B)), where now r/2 < y + P < n. Thus by (8.4.11),
117-bI < IPIPmax(csc2y,csc2(y+fi)) < 3#max (csc2 y, csc2
< 27,6 <
(n2
+ 10-9))
27c(KE)1/2,
The other possibility is that b is between rl and p. Then
117-bI = IPI(cot(y-,e)-coty). Now f < (1.1)10-9 and y > sin-' (1/3) imply sin(y - f) > 0.333, so
In - bl < 30/(0.333)2 < 280 < 28c(KE)1/2 in all three cases. For any ordering of b, p, and rl,
]a - ill < 16E/ sin(y - ,B) < 49e. Next, let x be the distance from a varying point on L to b. Then the distance y from to the circle 8B(a, r) satisfies y = (r2 + x2)1/2 - r. Now
(r2 + t)1/2 < r + t fort > 0 and r > 2/3, so 0 < y < x2 for all x. So the distance from a to B(a, r) is at most CE for C = 49 + 282c2K < 50, giving the first conclusion for a 0 H. A line W through a, orthogonal to the line V through 0 and b, meets V at
a point l;. Then Lbal; = y > sin-1(1/3), so Ia - I < r(1 -
-11)1/2. Let q
276
Approximation of Functions and Sets
the point on the circle I q - a = r and the line W, on the same side of a as is. Then Iq I > 3(l - (9)1/2) > 0.03. By (8.4.11), i6 < tan -1(0.01), so the line A through 0, rj, and a must intersect B(a, r). Then since E is convex, a E, 0 E E, and B(a, r) C E, for a E H, d (a, B(a, r)) is maximized when a = q on L, and d(a, B(a, r)) < Cs for the same C < 50 as before. So the
-
first conclusion is proved. Lemma 8.4.9 with b = iB gives
Ia - bl <
0.0015e1/2/(K-1),
so Laab < sin-1(Ia - bI/(2/3)) < 0.004e1/2/(K - 1). So Lemma 8.4.10 is proved.
8.4.12 Lemma
Let a E R2, B := B(a, r) C R2, where r > 2/3 and
0EB,solaI
h(S n W, S1) < 4(KE
1)
Proof Since S1 C S n W, it will be enough to show that for zo E 8S n W, d(zo, [al, a2]) < s/(9(K - 1)) where [al, a2] is the line segment joining al to a2.
It is easy to see that a1, a2, and a are not all on a line, unless al = a2, when the result clearly holds, so assume they are not on a line. Let Li be a tangent line to B, at a point bi, through ai, where of the two such tangent lines, Li is the one for which Lbiaai < Lbiaa3_i, i = 1, 2, and bi is not in W.
Now I ai - al < r + 50s, so
Lbiaai < cos-1(r/(r + 50e)) < cos-1(1 - 75s). By derivatives, cosx < 1 - 2x2 + 24x4 for all x, so cosx < I - 11x2/24, 0 < x < 1. Thus cos-1(1 - 11x2/24) < x, and cos-1(1 - 75s) < Now 2D(K)s1/2 < 10-8, and (1646)1/2 < 10-4, so the angles Lbiaai, i =
(164x)1/2_
1, 2, and Lalaa2 add up to (8.4.13)
Lb1ab2 < 2(164)1/2 + 2D(K))s1/2 < 0.002 < jr.
So the lines Li intersect, in a unique point m.
8.4 Metric entropy of classes of convex sets
277
Now, zo is in the triangle ala2m because: zo E W and al, a2 E 0S, b1, b2 E S, so zo cannot be in the triangle aala2 unless it is on [a1, a2]. Also if zo were on the side of L; away from a, then a; 8S, a contradiction, i = 1, 2. Next,
Lmala2 + Lma2al = n - Lblmb2
(2 - Lambl) + (2 - Lamb2)
= Lb1ab2.
We have d(zo, [ai, a2l) < d(m, [al, a2]). Then
d(m, [al, all) < Ial - all tan Lmala2, and by (8.4.13),
tanLmala2 < 2Lmaia2 < (52 + 4D(K))s112, and Ial - a2I < D(K)E1/2 by assumption. So
d(m, [al, a2l) < (52D(K) +4D(K)2)E < E/(4(K - 1)). 8.4.14 Lemma
Let d > 2, C E Cd, 0 < s < 10-12/(d - 1)2, and N' E
.N4E (C2). Let N be the polyhedron it, (N'). Then h (N, N') < e/4.
Proof For each j, the simplex Sj and the origin span a convex cone Wj, the union of all half-lines with endpoint 0 passing through Sj. It suffices to show that for each j,
h(N'nWj,NnWj) < E/9. Fix such a cone W = Wj and simplex S = Sj. For s = 1, , d, let W (s) be the union of the s-dimensional faces of W, so that W(d) = W, WO) is the union of half-lines through 0 and vertices of S, and so on. It will be shown by induction on s that if z E W(1) n N' then (8.4.15)
d(z, N) < E(s - 1)/(4(d - 1)).
For s = d this will give the desired conclusion. For s = 1, (8.4.15) holds by definition of N. Suppose it holds for a given s. Take any z E N' n W(s+1). Let x = x(z) be the point at which the half-line from 0 through z intersects 8C2. There is a point ,6 E 8C such that Ix - # I = 2: to see this, let H be a support hyperplane to C2 at x. Let HI be a hyperplane parallel to H at distance 2 in the direction toward C. Then it is easily seen that H1 is a support hyperplane to C at a point ,a, the nearest point in H to x. Clearly B(fi, 2) C C2.
278
Approximation of Functions and Sets
Let Fs+t be an (s + 1)-dimensional face of W containing z. Let .7r2 be a two-dimensional subspace with z E 7r2 fl W C FS+t Then the two rays at the edges of the wedge 72 fl W are included in WW W. Let these rays intersect aN' at points u, v. By Lemma 8.4.7, the disk n2 fl B(fi, 2) contains 0 and has radius at least 2/3. So it is a disk 7r2 fl B(a, r), a E Jr2, with Ix - al = r > 2/3. Then u, v E aN' implies u, v are not in C2 and so not in B(a, r). Now apply Lemma 8.4.10 to
E = C2 fl 7r2, with K = d, first for (a, b) = (u, x), then for (a, b) = (v, x). To justify the application we need the following:
Claim. d(u, E) < 16E and d(v, E) < 16E. Proof There is a point r E C2 with I u - r I < 4E. Let U be the two-dimensional subspace spanned by u and r (if u and r are on a line through 0 then r E 72 and
d (u, E) < 4E). Let M be the line in U through r which crosses the half-line D from 0 through u at a point t and is tangent to the unit disk B(0, 1) in U
at a point w. Let z be the closest point to r on D. Then Ir - zI < 4E. Let = LOtw = Zztr. Since t E C2, It I < 3. Thus sin > 1/3 and I r - t I < 12E, so I t - u I < I t - r I + I r - u I < 16E, proving the claim for u and by symmetry for v. It will be shown that:
(a) if u and v are points of a simplex with vertices w;, then LuOv < max;j Lw;Owj. Here (a) will follow from
(b) if u is fixed and v is in a simplex with vertices w1, then ZuOv < maxi ZuOwi,
since (b) could be applied in stages to prove (a). Further, it will be enough to prove (b) for a simplex reducing to a line segment from wl to W2, since then, the case of general v would reduce in stages to cases where v is on a boundary face of the simplex, then a lower-dimensional face, and so on until v is a vertex. So it will be enough to show that
(c) For all u, x, y in R3 and 0 < ,l < 1, LuO(Ax + (1 -,l)y) < max(LuOx, ZuOy). Expressing (c) in terms of cosines, and then of scalar products and lengths, we can reduce to the case where Ix I = I I = J u I = 1, and then check the condition.
8.4 Metric entropy of classes of convex sets
279
Now it is easily seen, using also Lemma 8.4.8, that all the hypotheses of Lemma 8.4.10 hold, hence so do its conclusions. Next, Lemma 8.4.12 will
be applied with (again) K = d, and with al = u, a2 = v, and S := N' fl 1r2. Let's check that the hypotheses of Lemma 8.4.12 hold. We have
Lalaa2 < 0.00881/2/(d - 1). The angles Luax and Lxav both have the upper bound 0.004x1/2/(d - 1). The other hypotheses of Lemma 8.4.12 follow from Lemma 8.4.10, so Lemma 8.4.12 does apply. Its conclusion gives d(z, [u, v]) < e/(4(d - 1)). By induction hypothesis,
d(y, N) < s(s - 1)/(4(d - 1)) for y = u, v, and then by convexity for any y E [u, v]. It follows that d (z, N) < ss/(4(d - 1)), completing the induction and the proof of Lemma 8.4.14.
8.4.16 Lemma Let C E Cd. Then for any s > 0 with s < 10-12/(d - 1)2, there is an 8/2-net for JV (C) containing at most gd (s) points where gd (s) exp(a(d)s(1-d)/2) and a(d) = Kd log 72 with Kd as in Lemma 8.4.8.
Proof Recall that 02+2e gives an isometry for h of .N2E (C) into N4E (C2). Let N' E N4e (C2) and let the vertices of the polyhedron N := 7r, (N') be
yi, i = 1,2,' .. By Lemma 8.4.14,h(N,N) <s/4. For each i, on the half-line from 0 through yi, take the interval Ii := [y, yi + l2syi/Iyi I] of length 128 starting at yi. Then Ii has an E/4-net J containing 24 points (midpoints of subintervals of length 8/2). By Lemma 8.4.6, for every M E N4E (C2), 7rs (M) has a vertex vi in Ii, which
is within s/4 of some ui E Ji. The (d - 1)-simplex with vertices vi is within E14 for h of the one with vertices ui. The same is true of the d-simplices where the vertex 0 is adjoined to both. Thus if L, is the set of all polyhedra with one vertex in Ji for each i, defined as in the definition of Ire, then L, is s/2-dense in N4E(C2). The number of values of i is at most Kds(1-d)/2 by Lemma 8.4.8, and Lemma 8.4.16 follows.
Now it will be shown that there are constants Kd, Ld < oo such that: for 0 < 8 < 1, there is an s-net for Cd containing at most fd(8) sets where (8.4.17)
fd(e) := Ld exp
(Kde(1-d)/2)
This will clearly imply the upper bound for h in Theorem 8.4.1. It will be enough (possibly changing Kd and Ld) to prove (8.4.17) for 8 small
enough, specifically for 0 < s < 80 := 10-12/(d - 1)2. For such 8, and a(d)
280
Approximation of Functions and Sets
from Lemma 8.4.16, let Kd := a(d)/(1 - 2(1-d)/2). (Ld will be specified below.) We have the decomposition 00
(0, Eo] = U Ik,
Ik := (so/2k, so/2k-11.
k=1
Before getting more precise bounds, it will help to see that N(e, Cd) < oo for all E > 0. The cube [-1, 1]d can be written as a finite union Ui C, of cubes of side < s/d 112 and so of diameter < e. For each nonempty C E Cd, let B(C) be the union of all Ci which intersect C. Then C C B(C) and h (B(C), C) < E. It follows that N(s, Cd) < oo. So taking s = s0/2, Ld can be and hereby is chosen so that (8.4.17) holds for E E Ii. Then (8.4.17) will be proved fore E Ik by induction on k. Suppose it holds for all s E Ik. Let S E Ik+t Then e := 28 E Ik. By induction, there is an s-net {C1 } for Cd with at most fd (s) values of i. By Lemma 8.4.16, there exist at most gd (s/2) sets Kj, not necessarily convex, such that each A E .N (Ci )
is within e/4 of some Kj. Thus there is a 6-net for A' (Ci) containing at most gd (8) sets, and a 8-net for Cd containing at most gd (8) fd (28) sets. Now a(d) + 2(1-d)12Kd < Kd by choice of Kd, so gd(S) fd(28) < fd(8), and the upper bound in Theorem 8.4.1 is proved. Now to prove the lower bound, let p be the Euclidean metric on Rd and recall that Sd-1 denotes the unit sphere {x E Rd : IxI = 1}. Let d > 2. If Vd is the volume of B(0, 1) C Rd, then X(/Sd-1)E) > Vd[(1 +E)d
- 1] > dvdE.
Then by the left side of the last displayed inequality in the proof of Lemma 8.4.3, there is an ad > 0 such that D(E,
Sd-1 p) > adE1-d for 0 < E < 1.
Given s, take a set {xi }m 1 of (points of Sd -1 more than 2e apart, of maximal
cardinality m := D(2s) := D(2s, Sd-1, p). As above let Zabc denote the angle at b in the triangle abc. Then for i # j,
0 := LxiOxj > 2 sin(e/2) > 2s. Let Ki be the half-line from 0 through xi. Let Ci be the spherical cap cut from the unit ball Bd by a hyperplane orthogonal to K, at a distance cos s from 0. Then the caps Ci are disjoint.
For any set I C (1, , m) let DI := Bd \ UiEI C;. Then each DI is convex. Let a,d be d-dimensional Lebesgue measure (volume). Then for all i, .d(C1) > bdsd+l for some constant bd > 0. By Lemma 8.2.11 there are at
Problems least em /6 sets I (j) such that for all j contains at least m /5 elements. Then
281
k the symmetric difference I (j) D I (k) CdEI-d+d+1 = CdE2
Xd(Di(j)ADI(k)) >
for a constant cd := adbd21-d/5. So for some ,Bd > 0, D (8, Cd, dX) > exp
(fldS(1-d)121
for/d)L
This finishes the proof of the lower bound 8.4.1 is proved.
for 0 < 8 < 1. and so also for h. Theorem
Problems
1. Let (K, d) be a compact metric space. Show that the collection of all nonempty closed subsets of K, with the Hausdorff metric, is also compact. 2. In the proof of Theorem 8.2.1, just after Lemma 8.2.7, show that the cubes
can be ordered so that i = j - 1. 3. If inequality (1.3.12) is used instead of (1.3.5) in the proof of Lemma 8.2.11, what is the result?
4. Show that in Theorem 8.2.16(a), 1C(d, a, K) cannot be replaced by I(d,
a, K) if d = 2 and a > 1. Hint: Let f ((cos 0, sin B)) :_ (1 - cos 6)/2, so f takes S1 onto [0, 1]. Then I ((f, 0)) = 0 where (f, 0)(u, v) (f (u, v), 0). For any interval (a, b) C R there is a C°O function g(a,b) >
0 with g > 0 just on (a, b).
rkf=1
Show that for functions r _ (f,
for disjoint (at, b,) and small enough 8i > 0, de-
pending on k, JI Vi 11a can remain bounded as k increases while I(*) can approximate any finite subset of [0, 1] x {0} for h. 5. Show that the unit disk {(x, y) : x2 + y2 < 1) belongs to a class Co in
Theorem 8.2.15 for k = 4, any a E (0, oo), and some K = K(a) < oo. Hint: The function g(x) :_ (I - x2)1/2 is smooth on intervals Ix I < for
<1. 6. Find a constant c > 0 such that lim infE{o E log NI(E, £ 2, .4) > c, and likewise for D(E, L L2,1, dA). Hint: Consider squares along a decreasing diagonal, Sj := Sj,,,, _ 1 _j, j = 1, , m, for the grid defined in the proof of Theorem 8.3.1. The union of Ui<m-1-j Sp and an arbitrary set of Sj gives a set in £C2,1. Apply Lemma 8.2.11. 7. (a) The proof of Theorem 8.3.1 used the inequality (kk) < 4k. Show that,
conversely, (k> 4kk-1/2/3. Hint: Use Stirling's formula (Theorem 1.3.13).
(b) Use this to give a lower bound for lim inf 40 8 log D(E, £ 2,1, h).
282
Approximation of Functions and Sets
8. In the proof of Lemma 8.3.2, after Lemma 8.3.4, a function f is defined with 11f II c -< (d - 1)1/2. For d = 2, deduce this from monotonicity of x(.), in the proof of Theorem 8.3.1.
9. Show that each class g ,K,d in Theorem 8.2.1 for a > 0 and K < oo is a uniform Glivenko-Cantelli class as defined in Section 6.6. 10. Show that the lower layers in Rd form a Glivenko-Cantelli class for any law having a bounded support and bounded density with respect to Lebesgue measure.
Notes Note to Section 8.1. Hausdorff (1914, Section 28) defined his metric between closed, bounded subsets of a metric space.
Notes to Section 8.2. Kolmogorov (1955) gave the first statement in Theorem 8.2.1. The proof of that part as given is essentially that of Kolmogorov and Tikhomirov (1959). Lorentz (1966, p. 920) sketches another proof. Theorem 8.2.10 is essentially due to Clements (1963, Theorem 3); the proof here is adapted from Dudley (1974, Lemmas 3.5 and 3.6). Remark 8.2.14, for which I am very grateful to Joseph Fu, shows that there is an error in Dudley (1974, (3.2)), even with the correction (1979), which is proved only for a > 1. Theorem 8.2.16 and its proof by a sequence of lemmas are newly corrected and extended versions of results and proofs of Dudley (1974). Tze-Gong Sun and R. Pyke (1982), whose research began in 1974, proved Corollary 8.2.25(a) independently by a different method. See also Pyke (1983). The statement was apparently first published in Dudley (1978, Theorem 5.12), and attributed to Sun.
Notes to Section 8.3. In Theorem 8.3.2 and its proof ford > 3, I am grateful to Lucien Birge (1982, personal communication) for the idea of the transformation U. For Theorem 8.3.1, I am much indebted to earlier conversations with Mike Steele. Any errors, however, are mine. A lower bound for empirical processes on lower layers in the plane will be given in Section 12.4, where P is uniform on the unit square. For other, previous results, such as laws of large numbers uniformly over L L d for more general P, and on the statistical interest of lower layers (monotone regression), see Wright (1981) and references given there.
References
283
Notes to Section 8.4. Bolthausen (1978) proved the Donsker property of the class of convex sets in the plane for the uniform law on I2 (cf. Corollary 8.4.2(b)). Theorem 8.4.1, as mentioned, is due to Brongtein (1976). Specifically, the smoothing C H C' and the set of lemmas used follows mainly, but not entirely,
his original proof. Perhaps most notably, it appears that the polyhedra used to approximate convex sets in Brongtein's construction need not be convex themselves, and this required some adjustments in the proof. A more minor point is that if C E Cd then C1 doesn't necessarily include B(0, 1), although C2 does. So Bronstein's Lemma 1 seems incorrect as stated but could be repaired by changing various constants.
In an earlier result of Dudley (1974, Theorem 4.1), for d > 2, in the upper bound, e(1-d)/2 was multiplied by Ilog s i. This bound, weaker than Bronstein's,
is easier to prove and suffices to give Corollary 8.4.2(b) and (c). The lower bound in Dudley (1974) was reproduced here. I thank James Munkres for telling me about the triangulation method used after Lemma 8.4.8. Gruber (1983) surveys other aspects of approximation of convex sets.
References *An asterisk indicates a work I have seen discussed in secondary sources but not in the original. Bolthausen E. (1978). Weak convergence of an empirical process indexed by the closed convex subsets of I2. Z. Wahrscheinlichkeitsth. verw. Gebiete 43, 173-181. Bonnesen, Tommy, and Fenchel, Werner (1934). Theorie der konvexen Korper. Springer, Berlin; repub. Chelsea, New York, 1948. Bronshtein [Bronltein], E. M. (1976). s-entropy of convex sets and functions. Siberian Math. J. 17, 393-398, transl. from Sibirsk. Mat. Zh. 17, 508-514. Clements, G. F. (1963). Entropies of several sets of real valued functions. Pacific J. Math 13, 1085-1095. Dudley, R. M. (1974). Metric entropy of some classes of sets with differentiable boundaries. J. Approx. Theory 10, 227-236; Correction, ibid. 26 (1979), 192-193. Eggleston, H. G. (1958). Convexity. Cambridge University Press, Cambridge. Reprinted with corrections, 1969. Eilenberg, S., and Steenrod, N. (1952). Foundations of Algebraic Topology. Princeton University Press.
284
Approximation of Functions and Sets
Gruber, P. M. (1983). Approximation of convex bodies. In Convexity and Its Applications, ed. P. M. Gruber and J. M. Wills. Birkhauser, Basel, pp. 131-162. Hausdorff, Felix (1914). Mengenlehre, transl. by J. P. Aumann et al. as Set Theory. 3d English ed. of transl. of 3d German edition (1937). Chelsea, New York, 1978. Kolmogorov, A. N. (1955). Bounds for the minimal number of elements of an s-net in various classes of functions and their applications to the question of representability of functions of several variables by functions of fewer variables (in Russian). Uspekhi Mat. Nauk (N.S.) 10, no. 1 (63), 192-194. Kolmogorov, A. N., and Tikhomirov, V. M. (1959). s-entropy and s-capacity of sets in function spaces. Uspekhi Mat. Nauk 14, no. 2, 3-86 = Amer. Math. Soc. Transl. (Ser. 2) 17 (1961), 277-364. Lorentz, George C. (1966). Metric entropy and approximation. Bull. Amer. Math. Soc. 72, 903-937. Pyke, R. (1983). The Haar-function construction of Brownian motion indexed by sets. Z. Wahrscheinlichkeitsth. verw. Gebiete 64, 523-539. *Sun, Tze-Gong, and Pyke, R. (1982) Weak convergence of empirical measures. Technical Report no. 19, Dept. of Statistics, Univ. of Washington, Seattle. Wright, F. T. (1981). The empirical discrepancy over lower layers and a related
law of large numbers. Ann. Probab. 9, 323-329.
9 Sums in General Banach Spaces and Invariance Principles
Let (S, II II) be a Banach space (in general nonseparable). A subset F of the unit ball If E S' : II .f II' < 1) is called a norming subset if and only if 11S11 = sup fE- I f (s) I
for all s E S. The whole unit ball in S' is always a
norming subset by the Hahn-Banach theorem (RAP, Corollary 6.1.5). Conversely, given any set F, let S := £°O(F) be the set of all bounded real functions on F, with the supremum norm IISII = IISIIJ-
SUPfEJ-Is(f)I,
S E S.
Then the natural map f i-+ (s i-+ s(f )) takes F one-to-one onto a norming subset of S. So, limit theorems for empirical measures, uniformly over a class .F of functions, can be viewed as limit theorems in a Banach space S with norm II II.F. Conversely, limit theorems in a general Banach space S with norm II II can be viewed as limit theorems for empirical measures on S, uniformly over a class F of functions, such as the unit ball of S', since for f E S' and x j, , xn E S,
(8x, + ... + 8xn) (J) = f (X1 + ... + xn). \ Suppose that Xj are i.i.d. real random variables with mean 0 and variance 1. Let S,, := Fj
probability space, there exist such X j and also i.i.d. N(0,1) variables Y1, Y2,
with Tn :_ F-i
,
0 in
probability. Since Tn/n 1/2 also has a N(0, 1) distribution for each n, the invariance principle implies that Sn /n 1 /2 is close to Tn /n 1/2, which implies the central limit theorem. Although it is not as obvious, central limit theorems generally imply invariance principles. The phrase "invariance principle" means that such quantities as maxk
286
Sums in General Banach Spaces and Invariance Principles
the main result of the chapter will be Theorem 9.4.2, saying that the Donsker property is equivalent to an invariance principle for empirical processes.
9.1 Independent random elements and partial sums We need a notion of independence for functions which may not be measurable.
Let (A1, Aj, Pj), j = 1, 2,
, be probability spaces, and form a product Ilj=1(Aj, Aj, Pj) = (B, B, P) with points x {xj)'=1. If Xj are functions
on B of the form X j = h j (x j ), j = 1, , n, where each h j is a function on A j (not necessarily measurable) then we call Xj independent random elements. If the h j are measurable this implies independence in the usual sense. In the rest of this section, Xj = h (x j), j = 1, , n, are independent random elements with values in a real vector space Son which there is a seminorm II II Xj, k = 1, 2, , with So := 0. The section collects some and Sk inequalities for later use.
9.1.1 Lemma
Suppose given independent random elements Xj with r,
j=1 EliXj11*2 and
(9.1.2)
IIXjII < M,
j=1,...,n.
Then for 0 < y < 1/(2M), (9.1.3)
Eexp(yllSnll*) < exp(3y2tn+yEIISn11*).
Hence for any K > 0 and 0 < y < 1/(2M), (9.1.4)
Pr{llSll*>K} < exp(3y2tn-y(K-EIISnII*))
Remarks.
If K < EIISnII * then the infimum of the right side of (9.1.4) is attained at y = 0 and the bound is trivial.
If 0 < K - EIISnII * < 3tn/M then the infimum is attained at y = (K - EIISn ll*)/(6tH) and equals exp(-(K - EII Sn II *)2/(12tH)). If K - E II Sn ll * > 3tn /M then the infimum (again, for 0 < y < 1/(2M)) is at 1/(2M) and equals
exp (3tn/(4M2) - (K - EIISn II *)/(2M))
< exp ( - (K - EllSn II*)/(4M)) Proof First, (9.1.2) is equivalent to IIXjII * < M, 1 < j < n. Clearly (9.1.4) follows from (9.1.3). To prove (9.1.3), set Yk := Sn - Xk. For any rl E L'(B, B, P), let Ekq := E(rjl xl, ... , xk-1), k < n; En+1t) := il, Elrl := Erl.
9.1 Independent random elements and partial sums
287
Then Ek+11IYkII* = EkII YkII * by Lemma 3.2.5 with wl = (xl, , xk_1), (02 = Xk, and w3 = (xk+l, ... , xn). Let 17k Ek+l IISn II* - EkII Sn II*. Then by RAP, Theorem 10.1.9, (9.1.5)
Eexp(YIISnII*)
= E(En(exp[YEnIISnII*+Yl)nIl)) E (exp (y En II Sn 11 *) En exp (Y'1 n ))
By boundedness and Lemmas 3.2.2 and 3.2.3, Ek+111Sn1I* - EkIISn11*
Ek+1 IIXkII* + Ek+1 IIYkII*
- Ek
IIYkII*
+ Ek IIXkII*
= IIXkII*+EIIXkII* for k = 1,
, n, and likewise
Ek+lIISnII*-EkIISnhI* >- -IIXkII*-EIIXkII* Thus by a binomial expansion and Jensen's inequality (RAP, 10.2.6), Ek I ilk I 2JEII XkII *i f o r j = 1, 2, . Clearly Ekr1k = 0. So for 0 < y < 1/(2M), 00
Ek (exp (Y'1k)) = 1 + E y Eknklj! j=2
1+
r
4Y2EIIXkII*2r1+2y3M+2242M2 +...1
< 1+2y 2 E IIXkII*2 /(1 - 2yM/3) < exp (2y2E IIXkII*2 (1 + YM)) since 1 + x < ex, x E R. Combining this with (9.1.5) gives
Eexp(y IlSnll*) < exp(2y2EIIXnI1*2(l +yM))Eexp(YEn IISn11*). Now as in (9.1.5), E exp (yEn II Sn II *) = E [exp (yEn-lIISnII *) En-l eXP(Yl)n-1)] , and the argument can be iterated n times to give (9.1.3).
D
288
Sums in General Banach Spaces and Invariance Principles
Next, Ottaviani's inequality (1.3.15) also extends to starred norms.
9.1.6 Lemma Suppose independent random elements Xj are such that, for some a > 0, maxj
c<1.
Then
Pr (maxj
Proof Let j * (w) be the least j < n such that II Sj II * > 2a (or, say, n + 1 if there is no such j). Then
Pr (IISnhI*>a) > Pr (IISnll*>a, maxj
I: Pr(I1Snll*>a, J*=J) j=1 n
> EPr(Ilsn-sjll*
since II Sj 11 * < II S. 11 * + II Sj - Sn 11 * (Lemma 3.2.3). Now II Sn - Sj II * is a
measurable function of (x j+1, , xn) =: y2, by Lemma 3.2.4 with n = 2, y1 :_ (x1, , xj). Thus IISn - Sj 11* is independent of the event { j* = j} in the usual sense, so the last sum equals n
n
EPr(IISn - SjII* < a)Pr(j* _j) > (1 j=1
-c)EPr(>* =J) j=1
_ (1 - c) Pr (maxj
Here is another form of Hoffmann-Jorgensen inequality:
9.1.7 Lemma Let K > 0 and suppose the independent random elements Xj are such that
0 < i < j
Pr(11Sj-Si II* > K) < 1/2, Then for any t > K ands > 0, Pr 111Snll* > 4t +s} < 4 Pr {IISn11* >
t12
+Pr { maxm
9.1 Independent random elements and partial sums
289
Proof Let T (w) := min( j > 1: II Sj ((o) II * > 2t), or +oo if there is no such j. Then (9.1.8)
Pr {IISnII * > 4t + s) n
EPrIT= m, IISnII* > 4t +s) m=1
< (M=1
+ Pr maxmn IIXmII* > S) By Lemma 3.2.3, IISnII* < IISm-111*+IIXmII*+IISn-Sm11*,so the last sum Y' in (9.1.8) satisfies
>Pr(T=m, IISn-Sm11*>2t) M=1 n
EPr(T =m)Pr(JISn -SmII* > 2t) m=1 n
< Pr (maxo<m
< 2Pr(IISn11* > t)Pr(maxm
< 4Pr(,ISn11* >
t)2,
using Lemma 3.2.4 as in the proof of Lemma 9.1.6 for independence, then using Lemma 9.1.6 itself twice, the first time with order of summation reversed. The lemma now follows from (9.1.8).
The random elements Xj will be called symmetric if we can write xj = (yj, z j) where for each j, yj and z j are independent and have the same distri-
bution in some space B j (A1 = B j x Bj with Pj = Q j x Q j for some Q j) and Xj = ,/rj (yj) - l/ij(zj) for some function i/ij from Bj into S. With this definition, the Levy inequality (Lemma 1.3.16) extends to starred norms, with about the same proof:
9.1.9 Lemma
If Xj are independent symmetric random elements then for any r > 0, the following both hold:
290
Sums in General Banach Spaces and Invariance Principles
(a) Pr(maxj
Proof For (a), let Mk(w) := maxj
1 <m
so if II Sm II * > r, then either IISnII * > r or 112 Sm - Sn I I * > r or both. The transformation which interchanges yj and zj for j > m preserves probabilities and interchanges S, and 2Sm - Sn, so interchanges IISn 11* and II2Sm - Sn 11*,
while preserving all Xj for j < m. Thus
Pr(Cmfl{IISnII*>r}) = Pr(Cmf{II2Sm-SnII*>r}) > 1 Pr (C.)
Thus Pr(Mn > r) = Em =1 Pr(Cm) < 2Pr(IISnII* > r). For (b), the proof is similar: replace Si for i = j, k, or m by X1, and use the transformation interchanging yj and zj for j 0 m, which does not change any IIXj II .
9.1.10 Lemma
El
Let Xj be independent symmetric random elements and
s, t > 0. Thenforn = Pr{IISnII* > 2t +s}
< 4(Pr{ISnII*>t})2+Pr{maxm
Pr{IISnII* > 2t+s}=YPr{T=j, IISnII*>2t+s}. j=1
If T= j
t + s - Nn (w).
Pr{T=j, IISnII*>2t+s}
< Pr{T=j, IISn-Sjll*>t+s-Nn} < Pr{T=j, IISn-SjII*>t)+Pr{T= j,Nn>s} = Pr(T=j)Pr(IISn-SjII*>t) +Pr(T=j, Nn >s)
9.2 A CLT implies measurability in separable normed spaces
291
where for the last step, Lemma 3.2.4 applies as in the proof of Lemma 9.1.6. Now summing on j yields by (9.1.11),
Pr(IISnll*>>2t+s) n
< Pr(Nn>S)+EPr(T=j)Pr(IISn-Sjll*?t) j=1
The Levy inequality (Lemma 9.1.9) with > instead of > gives
Pr(IISn-Sjll*>t) < 2Pr(IISnII*>t) for each j, and n
Y, Pr(T = j) = Pr(maxk
Combining the last three facts gives the result.
9.2 A CLT implies measurability in separable normed spaces Here "CLT" abbreviates "central limit theorem." As noted in the Remarks at the end of Section 1.1, if Fn is an empirical distribution function, Pr(FF E A) need not be defined if A is complete and discrete, hence Borel, for the (nonseparable)
supremum norm. Thus the variables Xj := 8,(j) - P need not be Borel measurable in general for a norm II Ilc or II il.P But in separable Banach spaces one usually assumes that variables are Borel measurable. It will be shown here that in a separable normed space, a form of central limit theorem can hold only if X1 is measurable. After this is proved in R1, it will be extended easily to separable normed spaces.
Recall the inner measure Pr* (B) := sup(Pr(C): C C B) and that
f* := - ((- f)*) = ess. sup {g: g < f g measurable} . 9.2.1 Theorem Let (A, A, P) be a probability space and xn coordinates on (A°O, A°°, P°°), n = 1, 2, . Let h : A R (where h is not assumed measurable), Xi := h(xi). Let Sn := X1 + + Xn. If for all t,
limn_ooPr* (Sn/n1'2 < t) = limn..+ Pr* (Sn/n1/2 < t)
= N(0, 1) ((-oo, t]), then h is measurable for the completion of P, so that Xi are measurable,
EXi=0andEXi2=1.
292
Sums in General Banach Spaces and Invariance Principles
Proof First, for the measurable case we need the following converse of the usual 1-dimensional central limit theorem (RAP, Theorem 9.5.6).
9.2.2 Theorem Let X1, X2, ., be i.i.d. real random variables and S, := + X,,. If the law of S,/n1/2 converges to N(0, 1) as n -> oo then X1 + EX1 = O and EXI = 1.
, all be i.i.d. and Z := (Xn - Yn)/21 /2 for all n. Then Z are i.i.d. and the law of (Z1 + + Zn)/n1/2 also converges to
P r o o f Let X1, Y1, X2, Y2,
N(0, 1). If h and f are the characteristic functions of X1 and Z1 respectively,
then f(t) _ Ih(t/21/2)2 = f(-t) and 0 < f(t) < 1 for all t. Let g(t) I - f (t) for all t. As n -> oo,
(1 - g(t/n1/2))" -+ exp (- t2/2)
(9.2.3)
for all t, uniformly for Itl < 1 (RAP, Theorem 9.4.5, Corollary 11.3.4). I claim that g(u)/u2 --> 1/2 as u 0. If not, first suppose for some S > 0, g(uk) > (2 + S)u2t for some uk - 0. Taking a subsequence, we can assume that for all n, u, = to/n1/2 where Itni < 1. Then (1 - g(tn/n1/2))" exp(-(2 + 8)t,), contradicting (9.2.3) as n oo. Or, if g(uk) < (2 - S)u2ti for some Uk -+ 0, again we can assume un = t"/n1/2 where It" I < 1 for all n. Then (1 - g(tn/n1/2))n > exp(-(1 - S)tn/2) for n large enough, again contradicting (9.2.3). So, as claimed, g(u)/u2 1/2 as u -+ 0. Now for P := L (Z1), and all u, (9.2.4)
g(u)
- g(u) + g(-u) =
u2
2u2
Letting u -
f
cos(ux) oo
u2
dP(x).
0 and applying Fatou's Lemma (RAP, 4.3.3) gives that
f x2dP(x) < 1. Then by the Tonelli-Fubini theorem, E(X?) < oo, so EJX1 I
< oo, and (Sn - nEX1)/n1/2 converges in law to some N(0, a2).
Since Sn/n1/2 also converges in law it follows by tightness that EX1 = 0 and
then EX? =a2=1. So, to continue the proof of Theorem 9.2.1, suppose h is nonmeasurable for
(x E A : h*(x) = +oo}. Suppose P(B) > 0. Then for each j, Pr(Xj* _ +oo) = P(B) by Lemma P, so that the Xn are nonmeasurable. Let B
3.2.4, case (b). By Lemma 3.2.4(c), S _ +oo a.s. on the set where X,* = +00
for any j < n. Thus Pr ((Sn/n1 I2)* = +oo)
> Pr (Xj* _ +oo for some j < n) = 1 - (I - P(B))n _+ 1
9.3 A finite-dimensional invariance principle
293
as n - oo. By Lemma 3.2.6 this contradicts the hypothesis. Thus P(B) _
Pr(X = +oo) = 0.
LetBj:=B(j):_{Xj>X
-2-J}.
Then
C := nj'=1 Bj. Apply Lemma 3.2.4 with Pj = P and f = 1B(j) to obtain Pr*(Cn) = 1. On Cn,
S n < XI .+.... + x n < Sn + 1. N(0, 1). Hence by Theorem 9.2.2,EXI = 0. Likewise EX1* = 0. Thus X1* = X1 = XI a.s., that is, X1 is completion Thus
measurable.
9.2.5 Corollary Suppose (S, I.1) is a separable normed space and Xn = h (xn ) where the xn are independent, identically distributed random variables with values in some measurable space (A, A) and h is any function from A into S (not assumed measurable). Suppose Yn are i.i.d. Gaussian variables in S with mean 0 and *
(9.2.6)
Xj - Yj
limn--> 00 n-1/2
= 0
j
in probability. Then the Xj are completion measurable for the Borel or -algebra on S.
Proof Let F = S', the dual Banach space. Then (9.2.6) implies that for each f e F, the central limit theorem as in Theorem 9.2.1 holds for the f(h(xj)), with some limit law N(0, a2), of := a2(f(Y1)). Thus by that theorem, each
f o h, f E F, is measurable for the completion of G(x1) on A. So each f(Xj) is completion measurable, f E F. Since S is separable, by the HahnBanach theorem there is a countable norming subset If, I C S', so that Is) _ sup,, I fn (s) for all s E S, and the Borel a-algebra of S is generated by the fn. Thus the Xj are completion measurable. 9.2.7 Corollary If the hypotheses of Corollary 9.2.5 hold except that S is not necessarily separable, and if H is a bounded linear operator from S into a separable normed space, then the H(Xj) are completion measurable.
9.3 A finite-dimensional invariance principle The invariance principles to be treated here are extensions of central limit the-
orems to joint distributions of partial sums for the sequence of values of n.
294
Sums in General Banach Spaces and Invariance Principles
Specifically, for example, let P be a law on JR with mean 0 and variance 1. One invariance principle says that on some probability space there exist random variables Xi, independent with distribution P, and random variables Yi, independent with standard normal distribution N(0, 1) (but not independent of the Xj), such that
limn-*oomaXk
Let Sk := Xl + + Xk as usual. Invariance principles will say that the , Sn)/n1/2 is the same asymptotic behavior of quantities such as max(Si, as if the Xi have a standard normal distribution, or as in the "simple random walk" case where X1 = +1 or -1 with probability 1 /2 each. This invariance is helpful in computing the asymptotic distributions, since for the normal or random walk case, there are reflection principles (as in RAP, Section 12.3). This section will prove an invariance principle for Rd-valued random variables. Then, for d = 1, another form of the theorem, in C[0, 1], will be proved. The next section will prove a general invariance principle for empirical processes. Here is the main result of this section:
9.3.1 Theorem Let P be a law on Rd with mean 0. Let II II be the usual Euclidean norm on Rd. Assume that f II X II2dP(x) < oo. Let Ci j := f xixjdP(x) where x :_ (X1, , xd) E Rd. Then on any nonatomic probability space there exist random variables Xi independent with law P, and Yi independent with law N(0, C), such that for Tk := Y1 + + Yk, (9.3.2)
limn,oo maxk
in probability.
P r o o f Let 0 < 6 < 0.01. For k = 1, 2, , let tk := tk(s) := t(k) :_ [(1 + e4)k], where [x] denotes the greatest integer < x. Let to := 0. Let nk nk(6) := n(k) := tk+i - tk, k = 0, 1, .. . 9.3.3 Lemma There is a 3 > 0 such that for 0 < e < 8 and k > k2(6) large enough, Pr { maxr
t
J_+r
tkO<'
Xi
I
> Et (k)112 J < 466.
9.3 A finite-dimensional invariance principle
Proof For k = 1, 2, Ck
,
295
let
:= maxo<j
II
maxo<j
Etk/2/2i1
>-
8tk12/2}.
Let Kn := G(Sn/n112). Then by the central limit theorem (RAP, 9.5.6) the laws Kn converge to G := N(0, C) as n -f oo. Thus for the Prokhorov metric p (RAP, Section 11.3), p(Kn, G) -k 0 as n -k oo. Then for No large enough, and all n > No,
Pr{IISn/n1/211 > 1/(5e)} < G{x: Ilxll >_ 1/(68)}+86. Also, for S > 0 small enough, G{x : Ilxll > 1/(6e)} < 86 for 0 < 8 < S by the Landau-Shepp-Marcus-Fernique theorem (Lemma 2.2.5 suffices in this case). Take S < 1/2. Then for such an e and k > kl (e) large enough, we have nk and 84(1 + E4)k > 256, so
4s2(l + e4)k/2 +4 < 4 821 + E4)k12,
(l + e4)k - 1 >
n (k) <
11
e4)k,
16(1 +
841 + E4)k + 1,
and for No < n < nk, 1
58
E((1 + e4)k - 1)1/2
Et(k)1/2
et(k)1/2
482(1 + E4)k/2 + 4 - 4n(k)1/2 -
4nl/2
Thus Pr(iiS ii > Etk/2/4)
Pr { II Sn II
> 8tk12/41 < 2e6
also for n < No, fork > k2(8) > k1 (8) large enough, and thus for 1 < n < nk. Then in particular, Ck as defined above is less than 1/2. Applying Ottaviani's inequality (1.3.15) gives
Xi
Pr S maim
etkl2
t (k) < j
I < (1 - Ck)-1 Pr I IISn(k) II ? Etk/2/2} < 486.
296
Sums in General Banach Spaces and Invariance Principles
H(k) :_ { j : tk < Now continuing the proof of Theorem 9.3.1, let Hk j < tk+,). Then Hk has nk elements. Define Vk by Vk EjEH(k) Xj/nk/2 Let Fk := G(Vk). Then Fk G as k -+ oo. Thus pk := p(Fk, G) - 0 (by RAP, Theorem 11.3.3 again). By Strassen's theorem (RAP, 11.6.2), for each k there is a law hk on R2d such that for L((y, z)) = tk we have L(y) = G(Vk), L(z) = G and µk{(y, z) : fly - z I I > pk ) < pk. The joint distribution of the X j for j E Hk and Vk gives a law 13k on Rn(k)d X Rd. So we can apply Lemma 1.1.10 (the Vorob'ev-BerkesPhilipp theorem) to get a law ak on ][gn(k)d x Rd x Rd which has marginals O k
on Rn(k)d x Rd and ttk on Rd x Rd. Let Zk be the z coordinate. If z j for j E Hk are independent, each with law then G(Uk) = N(0, C) = L(Zk) where N(0, C), and Uk :_ EjEH(k) Zk has its law under ak (or µk). By the Vorob'ev-Berkes-Philipp lemma (1.1.10) again, there is a law yk on Rn(k)d x Rd x Rd x ]18n(k)d for which the marginal on the latter Rd x Rn(k)d is the law of (Uk, {zj} jEH(k)) and the law on the product
of the first three factors is ak. Taking the product over k of all these spaces and laws yk (RAP, Section 8.2), we can assume that the coordinates of the kth factor are {Xj}JEH(k), Vk, Uk = Zk, and {zj} jEH(k), since this gives the correct joint distributions for those variables which had them previously. Specifically, all the Xj remain independent and identically distributed, and likewise for the zj. Then for each k, (9.3.5)
Pr {IIVk - UkjI > Pk} <_ Pk-
Take ko large enough so that for all k > ko, (9.3.6)
Pk <
E6.
For each n let S(n) := Sn and T (n) := Tn := zi +
+ zn. Let
s :_ [log (3/E4)/log (1 + E4)] + 1, so that (9.3.7)
(1 +84)1 > 3E-4.
Also note that for e small enough, by L'Hospital's rule, log(1 + E4) > 84/2 log(3/e4) < 1/(5E), and 1 < 1/(10s), so (9.3.8)
s := s(E) < E-5/2.
9.3.9 Lemma For E > 0 small enough there is an Mo := MO (8) < oo such that for M > MO and m := M - s, we have /2 P{ maxm
9.3 A finite-dimensional invariance principle
297
Proof For M > M, (E) large enough, E4(1 + 84)M-s > 3, and m > ko. Then
nM-1 > E4(l
+E4)M-1
-1>
+E4)M-1/3
2E4(1
> 84(1 + E4) M/3 > (1+84)1
by (9.3.7), so
tm < nM-1
(9.3.10)
Also, by the definitions of tj and nk and a geometric series,
(9.3.11)
tMi/2
M
E
< tMl/2
n1k/2
k=m
M
12 E 21 1 + 84)k+1 - tl + 84)k)
k=m
\
4E2 (1 + E4)1/2
ll
9 E2
for 0 < s < _o small enough (where so is an absolute constant). Then the probability in the statement of the lemma is bounded above by P1 + P2 + P3, where
P1 := Pr { maxm
:=
2P2 Pr
:=
2P3 Pr{ IIT(tm)II
{
II S(tm) II >-
>>
stM/4}, EtM/4}.
It follows from (9.3.10) and (9.3.4) for n < nk that P2 < 286 for M > M2(E) large enough. Since the zj have just as good properties as the Xj do, we get
also P3 < 286 for M > M2(8). Now S(tk) - S(tm) = likewise T(tk) - T(tm) = Thus by (9.3.11),
P1
< Pr 1 1: m
n 1/2 Il ltk -UkII >_
and
8tM2/21 1
Pr{IIVk-UkII>_3/181 < SE6 < 8/2
< m
for 8 small enough and m > ko, using (9.3.5), (9.3.6), and (9.3.8). The lemma follows.
Sums in General Banach Spaces and Invariance Principles
298
Now to finish the proof of Theorem 9.3.1, given n, let M := M(n) be such
that t11-1 < n < tM and again m := M - s. Let n be large enough so that m > k2(s/2) in Lemma 9.3.3 and M > max(Mo(E), k2(E) + 1). Then
Pr {maxj
qi := Pr {maxj
= Pr {maxj
q3
q4
= Pr {maim
and
q5 := Pr{maim
<- Pr { maxj
q1
Then likewise, q2 < 486. For q3, we get an upper bound by replacing 2n 1/2 by t(M)1/2, then applying Lemma 9.3.9 to get q3 < E. Next,
q4 <
E Pr {maxj
.
m
Another upper bound is obtained by replacing tM by tk for each k. Then applying Lemma 9.3.3 with k > in > k2(e/2) and (9.3.8) again gives a bound 4(e/2)6s < E. Likewise, q5 < E. So the sum of all five qj is less than 58 for n large enough.
The remaining difficulty is that the construction depends on E. Now, for , it has been shown that there are sequences {X7(P), i > 11 i.i.d. (P) with partial sums SkP) and {YZ(P), i > 1} i.i.d. N(0, C) with partial sums T(P) such that for n > no(p),
p = 1, 2,
(9.3.12)
Pr {n-1/2 maxk
SkP)
- T(P) II
> 2-P } < 2-P.
We can assume that the Xt(P) are independent for different p, as are the Yt(P) Let r(p) := EP=1 no(q), p > 2. Let r(1) := 0. For each i = 1, 2, , there
is a unique p such that r (p) < i < r (p + 1). Define (9.3.13)
X;
X.(P) +-r(P)
Yt :=
Y(P)
-r(P)
i > 1.
9.3 A finite-dimensional invariance principle
299
Now (9.3.2) will be verified for the given Xi and Yi. Clearly Xi are i.i.d. (P) and
Yi are i.i.d. N(0, C). Let E > 0 and take po := p(O) > 2 such that 2-P(O) < E. Take No large enough so that for all n > No,
Pr{n-1/2maxk
Pr {n-1/2 maxk
Let n > max(No, no(po)) and take the unique M such that r(M) < n < r(M + 1). Then
M1 mark
57, Vp1 FU3,
P=P(O) where
U2 :=
U1 := n- 1/2 maxk
Vp := maxr(P)<j
n-112
maxk
n-1/2(Xi -Y1)
T.
,
i=r(p)+1 and
U3 := maXk
Then Pr(U1 > E) < E for i = 1, 2. Now by (9.3.13), k
X(P)
VP = n-1/2 max1
- Y(P)
i=1
so for 2 < p(O) < p < M - 1, since n > r(M) > no(p + 1) and n > no(p), (9.3.12) gives Pr(Vp > 2-P) < 2-P. Further, since n > r(M) > no(M) and n > n - r(M), we have Pr(U3 > 2-M) < 2-M, Thus M-1
Vp <
U3 +
21-P(0)
< 2E
P=P(o)
except on an event A with Pr(A) < 2E. So Max k
except on a set of probability < 4E, and Theorem 9.3.1 is proved.
300
Sums in General Banach Spaces and Invariance Principles
There is another form of the invariance principle which is older but in some ways more abstract. Let C[0, 1] be the space of all continuous real functions on [0, 1] with supremum norm. For a sequence X1, X2, .. of real random variables, independent with distribution P, with partial sums Sn := X1 + + X,,, let µ,,,P be the law of the random function on [0, 1] which equals Sk/n1/2 at k/n for k = 0, 1, , n (where So := 0) and which is linear on each closed interval [(k - 1)/n, k/n], k = 1, , n. Then An,P is a law on C[0, 1]. Let s be the law of the Wiener process (Brownian motion) restricted to [0, 1]. This can be chosen as a law on C[0, 1] by sample continuity (RAP, Theorem 12.1.5).
9.3.14 Corollary For any law P with mean 0 and variance 1, the laws An, P converge to it on C[0, 1] as n - oo.
Proof First let Q = N(0, 1). For f E C[0, 1] let T (f) be the function which
equals fat k/n for k = 0, 1, , n and is linear in between (on each interval [(k - 1)/n, k/n] for k = 1, , n). Then Tn (f) fin the supremum norm for each f E C[0, 1]. On the other hand the law of Tn for it is exactly An,Q in this case, so P-n,Q - A. Thus for the Prokhorov metric p for laws on C[0, 1], p(An,Q, µ) - 0 by RAP, Theorem 11.3.3. For two functions, each linear on an interval [(k -1)/n, k/n], the supremum distance between the two functions on the interval is attained at one of the
endpoints. So it follows from Theorem 9.3.1 that p(µn,Q, pn,p) -+ 0 as n - oo. The triangle inequality gives p(An,P, A) + 0, so An,P - A. Corollary 9.3.14 can be applied to sets of functions falling between two given functions, as follows. Recall that a function f on a metric space (S, d) is said
to satisfy a Holder condition of order r for 0 < r < I if there is a constant K < oo such that for all x, y E S, If (x) - f (y) I < Kd (x, y)r. 9.3.15 Theorem Again let P have mean 0 and variance 1. Suppose a(.) and are two real-valued functions on [0, 1], each satisfying a Holder condition
of order r > 1/2, such that a(0) < 0 < b(0) and a(t) < b(t) for 0 < t < 1. Let (a, b) := If: a(t) < f(t) < b(t) for 0 < t < 1}. Then An,P((a, b)) -+ A((a, b)) as n -+ oc. Proof The boundary of (a, b) in C[0, 1] with supremum norm is easily seen
to be the set of all f E C[0, 1] such that a(t) < f(t) < b(t) for 0 < t < 1 and for some t E [0, 1], f(t) = a(t) or f(t) = b(t). The result follows from Corollary 9.3.14 and the portmanteau theorem (RAP, Theorem 11.1.1) if one shows that all the sets (a, b) are continuity sets for µ, in other words that their
9.4 Invariance principles for empirical processes
301
boundaries 8(a, b) have 0 probability for A. To see that, let Wt, 0 < t < 1 be a sample-continuous Wiener process (Brownian motion). Let r be the least t E [0, 1], if one exists, such that Wt = a(t) or Wt = b(t). Clearly, the probability that t = 1 is 0. If 0 < r < 1, we apply the strong Markov property (RAP, Theorem 12.2.3). Thus, conditional on the a-algebra of events up to time r, WT+h - W, is equal in distribution to Wh, h > 0. Now, there is a local law of the iterated logarithm for Wt at t = 0, namely for v (t) := (2t log log(r ))1/2
almost surely lim supty0 Wt/v(t) _ - lim inftl,0 Wt/v(t) = 1. This follows from the law at infinity (RAP, Theorem 12.5.2) and the fact that the process {tWi11}t>o has the same distribution as {Wt}t>0, being Gaussian with mean 0
and the same covariances. So, suppose W7 = b(t). Then b(r + h) - b(r) < v(h)/2 for small enough h > 0 by the Holder condition, so we will have almost
surely W (r + h) > b(r + h) for some h with 0 < h < 1 - r. So the path t r-). Wt, 0 < t < 1, will not be in the boundary of (a, b). A similar argument holds if Wi = a(r), so the proof is done.
9.4 Invariance principles for empirical processes Recall the notion of coherent Gp process (Section 3.1). Let (A, A, P) be a probability space and .F c L2 (A, A, P). Let Q be the product of ([0, 1], B, A), where B is the Borel a-algebra, A = Lebesgue measure, and a countable product of copies of (A, A, P), with coordinates xi. Then .P will be called a functional Donsker class for P if it is pregaussian for P and there are independent coherent
Gp processes Yj (f, co), f E .T, co E Sl, such that f H Yj (f, co) is bounded and pp-uniformly continuous for each j and almost all cv, and such that in outer probability, m
(9.4.1)
n-1/2 maim
Ts" - P - Yj j=1
Recalling the notion of Donsker class (as defined in Section 3.1), we have an equivalence:
9.4.2 Theorem
Given any probability space (A, A, P) and .P C ,C2(A,
A, P), F is a functional Donsker class if and only if it is a Donsker class.
Proof The "only if " direction is easy and will be proved first. Let F be a functional Donsker class for P. Let xj and Yj be as in (9.4.1). Given e > 0,
302
Sums in General Banach Spaces and Invariance Principles
take no large enough so that for n > no, n
Pr* {n-1/2
Es" - P - Y, j=1
> E/3
} < E/2. JJJ
By definition, F is pregaussian for P, so F is totally bounded for pp by the Sudakov-Chevet theorem (2.3.5). Also, there is a S > 0, by Theorem 3.1.1, such that for a coherent version of Gp,
Pr(sup {IGp(f) - Gp(g)I:.f, g E F, pp(f, g) < 8} > e/31 < E/2. By assumption, each Y; , and so each (Y1 + + Yn )/n 1/2, is a coherent version of Gp. Combining gives an asymptotic equicontinuity condition for the empir-
ical processes vn, proving "only if" by Theorem 3.7.2. To prove "if," assume .T' is a P-Donsker class and again apply Theorem 3.7.2.
So .P is totally bounded for pp and we have an asymptotic equicontinuity condition: f o r each k = 1 , 2, , there is a Sk > 0 and an Nk < oo such that (9.4.3)
Pr* { sup { I vn (f) - vn (g) I
< 2-k,
:
f, g E P, pp (f, g) < Sk } > 2-k }
n > Nk.
We may assume Nk increases with k. As in the proof of Theorem 3.7.2, we have for a coherent Gp that (9.4.4) Pr { sup { I Gp (f) - Gp (g) I : ff g E F, pp (f, g) < 8k } > 2-k }
< 2-k
Let U := UC(F) denote the set of all prelinear real-valued functions on F uniformly continuous for pp. Then U is a separable subspace of S := 2°O(Y), as shown after the proof of Lemma 2.5.4.F is pregaussian for P and Gp has a law A defined on the Borel sets of U by Theorem 2.5.5. For k = 1, 2, let Fk be a finite subset of .T' such that
supfE.-min {pp(f,g): gE.Pk} < 8k. Let Tk denote the finite-dimensional space of all real functions on .Tk, with supremum norm II Ilk. Let m(k) be the number of functions in Fk and let .Fk = {g1, , gm (k)). For each f E F let fk = gj for the least j such that
pp(f gj) < 8k. For any 0 E S let (ak(f) := 0(fk), f E Y. Then Qbk E S. Let AkW = Ok and Xkj := Xj - AkXj where Xj := 8x, There is a complete -P.
separable subspace T of S which includes both U and the (finite-dimensional) ranges of all the Ak.
9.4 Invariance principles for empirical processes
303
Let II II = II II.- from here on. Note that II AkO II < 11011 for all k and all
E S. Then by (9.4.3) we have for n > Nk, Pr* {n_1/2 F-j=1 Xkj
(9.4.5)
pr*
>2-k}
l supfE. I vn (.f - .fk) I > 2-k } < 2-k,
Thus by Lemma 3.2.6, n
> 2-k l < 2-k
Xkj
Pr { n-1/2 1(
j=1
Il
1
Then if n > 2Nk and m = 0, 1, n-1/2
, n, we have by Lemma 3.2.3 a.s.
*
n
< n-1/2
E Xkj
n > Nk.
*
n
M
+m-1/2 57 Xkf
E Xki
j=1
j=1
j=m+1
and since either n - m > Nk or m > Nk, *
n
Pr
In-112
E Xkj
< 21-k
21-k
>
j=m+1
Thus fork > 2, Ottaviani's inequality (Lemma 9.1.6) with c = 1 /2 gives *
(9.4.6)
Pr In-1/2 maxm
Xkj I
22-k
>
j=1
l < 22-k 1
for n > 2Nk. Let Pk be the law on Tk off i-+ f(x1) - f fdP, f E .Pk. Then by Theorem 9.3.1 there exist Vkj i.i.d. Pk, Wkj i.i.d. with a Gaussian law Qk, and some nk := n(k) > 2Nk such that for all n > nk, k > 1, (9.4.7)
Pr fn_1/2max
>
1: Vkj - Wkj j<m
2-k
<
1
2k
k
Also, Qk must be the law of the restriction of Gp to Tk. We can assume that the sequences { (Vkj, Wkj) J j> 1 are independent of each other for different k. It
can and will be assumed that no := 1 < n1 < n2 <
For each j = 1 , 2, 3,
. .
,
.
if nk < j < nk+1, write k := k(j) and set
Vi := Vkj,Wj := Wkj. Then{Vj}j>1 and{Wj}j>1 are sequences of independent random variables. (Note: The notations nk, Vk, Wk are all different from those in the previous section.) Each sequence has its values in a countable product of Polish spaces which itself is Polish (RAP, Theorem 2.5.7). Let {Yj, j > 1} be i.i.d. It on U. Then Wj has the law of the restriction Yj P .T'k of Yj to Pk
304
Sums in General Banach Spaces and Invariance Principles
for nk < j < nk+1. Let Tkj be a copy of Tk and Uj of U for each j, and let Let T.:= T(j) := 1 T(j), UU := 1 Uj, .P(j) _ Fk(j). Then, Tk(j)j.
apply the Vorob'ev-Berkes-Philipp theorem (1.1.10) to P := G({ (Vj , W j)) j> 1),
and G({(Yj [ F(j), Yj)} j>1). So we can assume Wj = Yj [ X(j) for all j > 1. Now for ((xj)j>1, t) E 0 := A°O x [0, 1], let
V(w) :_ I f H f(xj) - f fdP, f E X(j)1j>1 E T. I.
Let Q := G({(Vj, Wj, Yj)}j>1) on T,,, x T,,) x U,,,,. By Lemma 3.7.3, Vj and
Yj can be taken to be defined on 7 with V (w) j = Vj for all j. It remains to
prove (9.4.1) for the given Yj, with Xj(f) := f (xj) - f f dP, f E.F. Given 0 < e < 1, take k large enough so that 25-k < e. We have Xj [ F(j) = Vj, j > 1. Let Mk > nk be large enough so that for all n > Mk, n(k)
(9.4.8)
2-k
Pr* n-1/2K II AkXjll + IIYj11 > j=1
<2
k.
Fix n > Mk. Then there is a unique r such that nr < n < nr+1, and (9.4.9)
In
EXj - Yj
:= maxm
j<m
EX, -YJ
maxm
I
j<m
r-1 + EmaXn(i)<m
m
1: Xj -YJ j=n(i)
i=k
+ maxn(r)<m
1:
Xj - Yj
n(r)<j<m
Then almost surely Lk
:= maXm
1: Xj - Yj j<m
< maX m
EXkl +EIlAkXjHI+Will
\ j<m
j<m
Thus by (9.4.6) and (9.4.8), (9.4.10)
Pr (Lk > 23-knl/2) < 23-k
9.4 Invariance principles for empirical processes
Next, since restriction to Fi is a linear isometry from Ai S to (Ti ,
305 II i ), and
II
for any f EFandn(i) < j
(Xi-Yj)(f) = Xj(f)-Xj(J1)+Xj(f)-Yj(fi)+Yj(.f)-Yj(f) = X1(f) +(Vj -Wj)(.f)+(AiYj -Yj)(f), so we have for k < i < r, m
T XJ - YJ
Ai := maXn(i)<m
j=n(i) M
+
< maxn(i)<m
E Vj - Wj i
j=n(i) M
E Yj - Ai Yj j=n(i)
}
Since n > nr > 2Ni, we have by (9.4.6) and the fact that i.i.d. variables have a joint distribution preserved by any shift of indices (stationarity), that if
k < i
> 22-`n 112
I
< 22-` .
Likewise by (9.4.7), if k < i < r,
Pr{n-1/2 maxn(i)<m
E Vj
- Wj
> 2-` } < 2-i.
j=n(i) jn=n(i)(i+1)-i
Yj is equal in law to a version of Gp (or Yi) Next, (ni+1 - n;) 1/2 with uniformly continuous sample functions. Also, for i < r, n > ni+1 - ni By Levy's inequality (Lemma 9.1.9) we have M
Pr {n_h/2maxn(i)<m
E Yj - AiYj j=n (i)
n(i+1)-1
< 2Pr
{n-1/2
E Yj - AiYj
j=n(i)
< 2Pr{IIY1-AiYiII>2-`) <
21-i
Sums in General Banach Spaces and Invariance Principles
306
by (9.4.4). Collecting terms, we have
k < i < r.
Pr (Ai > 23-`n1/2) < 23-`,
For i = r, replacing ni+i by n + 1 throughout yields the same result. From this, (9.4.9), and (9.4.10), we obtain since 25-k < s,
Xj - 1'j
Pr I maxm
> n1j2e)
j<m
< 23-k + E 23-i < s
/1
i=k
proving (9.4.1) as desired and so Theorem 9.4.2.
**9.5 Log log laws and speeds of convergence As usual with double-starred sections at the ends of chapters, statements in this section are provided with literature references but are not proved here.
9.5(A) Log log laws and strong invariance principles
A class F of functions in G2 of a probability space (X, A, P) will be said to satisfy the bounded law of the iterated logarithm or bounded log log law for the empirical measures P, of P if and only if lim sup,,,,,, II v, I.F /(log log n) 1/2 < oo almost surely. Also, F will be said to satisfy a strong invariance principle if it satisfies the definition of functional Donsker class except that instead of (9.4.1) we have m
(9.5.1)
(n log log n)-1/2 maxm
T 8Xj -P - Yj
-a 0
j=1
almost uniformly; in other words, the expression in (9.5.1) is dominated by a measurable function gn for each n such that gn -+ 0 almost surely. Neither (9.4.1) nor (9.5.1) necessarily implies the other. Even if ,F is both a functional Donsker class and satisfies a strong invariance principle for P, I don't know whether Yj can be chosen the same way in both (9.4.1) and (9.5.1). Let Ly := max(l, logy) as usual. Under a condition on the envelope function, we do have an implication (Dudley and Philipp, 1983, Theorem 1.2):
9.5.2 Theorem Let F be a Donsker class for P with an envelope function
F(x) := II8 II* If f F(x)2/LLF(x) dP(x) < oo, then F satisfies a strong invariance principle.
**9.5 Log log laws and speeds of convergence
307
The proof of Theorem 9.5.2 is rather long and technical. It is based on corresponding results in separable Banach spaces due to Heinkel (1979) and Goodman, Kuelbs and Zinn (1981), depending in turn on Kuelbs and Zinn (1979). Substantial further work, given in Dudley and Philipp (1983, Sections 4 and 5), seemed to be needed for the extension to the nonseparable case. If the condition on the envelope function in Theorem 9.5.2 is strengthened a
little to f F2dP < oo, then the proof is substantially simpler. Here a proof of the log log law in the separable case is due to Pisier (1976). Kuelbs and Dudley (1980) proved a log log law for Donsker classes of sets under a measurability condition, as did Koltchinskii (1981) for uniformly bounded classes of functions satisfying some entropy conditions. If the norm II II j, is separable on the linear span of all 3, for x E X, in other words, if we consider i.i.d. random variables with values in a separable Banach space, and if the central limit theorem (Donsker property) holds, then finiteness
of the integral in Theorem 9.5.2 is not only sufficient but also necessary for the law of the iterated logarithm to hold (e.g., Ledoux and Talagrand, 1991, Theorems 8.6 and 10.12). 9.5.3 Corollary If F = { 1 A : A E C} for a class C of measurable sets, and if F is a Donsker class, then F also satisfies a strong invariance principle.
In connection with Theorem 9.5.2 one can ask what conditions on envelope functions are necessary, or sufficient along with other conditions, for the Donsker property. An image admissible Suslin, VC subgraph or major class of functions will be called a VCSM class or VCMM class, respectively. On a probability space, the Lorentz space £2,1 is defined as the space of real random variables X such that f o P(I XI > t)112dt < oo. A random variable X will be called weakly L2, or X E W L2, if and only if t2P(I XI > t) --* 0 as t +oo. Recall that an envelope function of a class .F of functions is Fi(x) := II8 II
In the next theorem, one can take F(x) := supfET I f(x)I := Fi(x) to be measurable itself and thus "the" envelope function. The following equivalences, due to several authors, including Alexander (1987) and Gind and Zinn (1986, Chapter 1, Proposition 2.7), are collected in Dudley and Koltchinskii (1994, Theorem B):
9.5.4 Theorem Let (X, A, P) be any nonatomic probability space and F a nonnegative measurable function on X. Then the following are equivalent:
(a) P(F < oo) > 0 and F is the envelope of some P-Donsker class of functions on (X, A, P);
308
Sums in General Banach Spaces and Invariance Principles
(b) P (F < oo) > 0 and F is the envelope of some VC subgraph P-Donsker class of functions on (X, A, P);
(c) P(F < oo) > 0 and F is the envelope of some VC major P-Donsker class of functions on (X, A, P); (d) F E WL2(P). Notably, while in finite-dimensional spaces the finiteness of second moments (of the norm) is necessary for the CLT (Theorem 9.2.2), in the infinite-
dimensional case it is not, as Jain (1976) and Pisier (1976, Section 7) first showed, but only weak L2 is necessary. On the other hand, for every VCMM class with envelope F to be P-Donsker, a condition stronger than L2, namely F E G2, 1, is needed. Dudley and Koltchinskii (1994, Theorem 3.1 and Proposition 4.2) show that:
9.5.5 Theorem If (X, A, P) is a probability space and F E C2,1 (P) then every VCMM class.F with envelope F is P-Donsker. Conversely, if (X, A, P) is, for example, [0, 1] with Lebesgue measure, and if F is measurable but not in £2,1(P), then there exists a VCMM class F with envelope F which is not P-Donsker. If F is a VCSM class (not necessarily Donsker), F satisfies the condition in Theorem 9.5.2, and sup{oP (f) : f E F) < oo, then.F satisfies a bounded log log law. For this and some related facts see Alexander and Talagrand (1989). 9.5(B) Speeds of convergence
Komlos, Major and Tusnady (1975, 1976) found the best possible speed of convergence in the invariance principle for i.i.d. real-valued random variables , having a finite moment-generating function E exp(tXI) < oo for X1, X2, It I < to for some to > 0, namely, that then in (9.3.2), the variables Xi and YZ can be chosen so that II SS - Tn II = 0 (log n) a.s. Let P be the uniform (Lebesgue) law on the unit cube Id in Rd. Let
X(d) :_ {[0, xl] X [0, x2] X ... X [0, xd]: 0 < x1, x2, ... xd < 1), a collection of subsets of Id, and let B(d) be the collection of all intersections with Id of d-dimensional closed balls {y: Ix - yl < r} for r < 1 and x E W1. Koml6s, Major and Tusnady (1975, 1976) for d = 1, see also Bretagnolle and Massart (1989), and Tusnady (1977) ford = 2, showed that on some probability space there exist Gp processes YY and empirical processes v, for each n such that
IIVn-YnIIX(d) = O((logn)d/nl/2)
Problems
309
almost surely as n - oo. J. Beck (1985) showed that for some constants cd depending only on the dimension d, for any possible choices of vn and Gp, Pr (11v. - GPIIB(d) >
cdn-t/(2d)) > 1
-e-n.
Thus the convergence in the central limit theorem uniformly over balls is relatively slow for d > 2.
Massart (1989) considered the uniform law P = U(Id) on the unit cube Id in Rd and "UM" (uniform Minkowski) classes C of sets in Rd for which there is a constant K < oo such that for 0 < e < 1, supAEC ,Xd ((aA)e) < Ks where a A := A \ intA is the boundary of A and CE :_ {y: d (x, y) < e for some x E C). Massart shows that if a VC class C of subsets of Rd is UM and image admissible Suslin then there exist Gp processes Bn such that as n -+ oo, O((Ln)3/2n-1/(2d))a.s.where Ln := max(l, log n). Evidently IIvn - BnIIc = the uniform Minkowski condition is important here, since the VC index S(C) doesn't appear in the bound except perhaps for the constant in the 0. For the class of balls, Massart's upper bound is (Ln)3/2 times Beck's lower bound, so the exponent -1/(2d) on n is sharp. (Note: Massart uses a definition of "strong invariance principle" quite different from the one given early in this section.) If C is a UM class of sets in Rd such that for some M < oo and 0 < < 1
for 0 < s < 1, Massart (1989) shows that, still for P = U(Id ), there exist Gp processes Yn with II vn - Yn IIc = a.s. In particular, for d > 2 and the class I (d, a, K) of
we have NI(s, C, P) <
sets in Rd with boundaries differentiable through order a (Section 8.2), one has
= (d - 1)/a and so, if a > d - 1, there exist Gp processes Y, such that IIvn - Yn III(d,a,K) =
0((Ln)-(a-d+1)/(2ad))
a.s., improving on earlier results of Revesz (1976). For the class C(2) := C2 of convex sets in R2, Massart's result implies that for some Gp processes Yn,
IIvn - YnIIc = O((Ln)/ntlg) a.s. Koltchinskii (1994) gives some general results on speed of convergence for empirical processes, implying some of the above particular facts, and surveys some other work done in the meantime, for example, Rio (1993a,b; 1994).
Problems 1. For i.i.d. real-valued bounded random variables with mean 0 and variance a2, compare the bounds for P(I Sn I > K) given by Lemma 9.1.1 and by Bernstein's inequality (1.3.2). Hint: Note that E I Sn I < Qn 1/2
310
Sums in General Banach Spaces and Invariance Principles
2. Let X1, X2, be real-valued i.i.d. random variables with mean 0 and variance 1. Show that for any t > 0, as n oo, P(maxk
each interval [(k - 1)/n, k/n] for k = 1,
, n. Let C([0, 1], d) be the
space of continuous functions from [0, 1] into Rd, with the norm II f II := supo
xt := (xtI,
, xtd), 0 < t < 1, where xtj are i.i.d. Brownian motions,
j = 1,...,d. 4. Let F:=
:=xa(1+I logxl)Pfor0 <x < 1. Let Pbetheuniform
U[0, 1] distribution. For what real values of a, 6 is it true that
(a) F is the envelope function of some P-Donsker class on (0, 1); (b) every VCSM class (defined in Section 9.5) with envelope < F is P-Donsker; (c) every P-Donsker class with envelope F satisfies a strong invariance principle;
(d) the class ft F: -1 < t < 1] is a P-Donsker class. Does (a) imply (d)? 5. Answer the same questions as in the previous problem but now on 1 < x < oo where P is the Pareto law with density 1/x2. 6. Let X1, X2, be i.i.d. real random variables and h a function: R H R, not assumed measurable. Suppose that n-1 Yj=t h(Xj) converges to 0 in outer probability as n --). oo. Prove or disprove that h(X1) = g(XI) a.s. for some Borel measurable function g. (Compare Section 9.2.)
Notes All the lemmas in the section first appeared in their current forms in Dudley and Philipp (1983). Lemma 9.1.1 extends Lemma 2.1 of Kuelbs (1977). The proof of Lemma 9.1.9 is adapted from Kahane (1985, Section 1.3, Lemma 1). The proof of Lemma 9.1.10 in the measurable case appears in Hoffmann-Jorgensen (1974, p. 164); see also Jain and Marcus (1975, Lemma 3.4). Kahane (1968, Section 1.5) gave a precursor of Lemma 9.1.10.
Notes to Section 9.1.
References
311
Notes to Section 9.2. Theorem 9.2.2 appears in Gnedenko and Kolmogorov (1949, Section 35, Theorem 4). The other results of this section are essentially from Dudley and Philipp (1983, Section 8).
Notes to Section 9.3. Theorem 9.3.1, and most of the section, are based on a special case (a = 2) of Theorem 1 of Philipp (1980). The first invariance principles, when the X; are real-valued, were in C[0, 1] as in Corollary 9.3.14. Kolmogorov (1931, 1933) proved Theorem 9.3.15 in case and b(.) have continuous derivatives. Kolmogorov did not prove convergence of laws tt.,p - s on C[0, 1], since convergence of laws on spaces such as C[0, 1] was not defined until the 1940s. Donsker (1951) first stated and proved Corollary 9.3.14. Kolmogorov's papers (1931, 1933) were overlooked to some degree in the 1950s. Then Borovkov and Korolyuk (1965) and Nagaev (1970) went further along the way Kolmogorov had started. Thanks to Lucien Le Cam for pointing out Kolmogorov's papers to me.
Theorem 9.3.1 for d = 1 emerged from proofs of Donsker's invariance principle in the books of Breiman (1968, Theorem 13.8, p. 279) and Freedman (1971, p. 83, (130)). Apparently Major (1976, p. 222) first formulated Theorem 9.3.1 for d = 1 explicitly. Note to Section 9.4. The invariance principle for general empirical processes, Theorem 9.4.2, first appeared in Dudley and Philipp (1983).
References Alexander, K. (1987). The central limit theorem for empirical processes on Vapnik-Cervonenkis classes. Ann. Probab. 15, 178-203. Alexander, K., and Talagrand, M. (1989). The law of the iterated logarithm for empirical processes on Vapnik-Cervonenkis classes. J. Multivariate Analysis 30, 155-166. Beck, Jozsef (1985). Lower bounds on the approximation of the multivariate empirical process. Z. Wahrscheinlichkeitsth. verw. Gebiete 70, 289-306. Borovkov, A. A., and Korolyuk, V. S. (1965). On the results of asymptotic analysis in problems with boundaries. Theory Probab. Appl. 10, 236-246 (English), 255-266 (Russian). Breiman, Leo (1968). Probability. Addison-Wesley, Reading, MA.
312
Sums in General Banach Spaces and Invariance Principles
Bretagnolle, J., and Massart, P. (1989). Hungarian constructions from the nonasymptotic viewpoint. Ann. Probab. 17, 239-256. Donsker, Monroe D. (1951). An invariance principle for certain probability limit theorems. Memoirs Amer. Math. Soc. 6. Dudley, R. M., and Koltchinskii, V. I. (1994). Envelope moment conditions and Donsker classes. Theor. Probab. Math. Statist. (Kiev) 51, 39-48. Dudley, R. M., and Philipp, Walter (1983). Invariance principles for sums of Banach space valued random elements and empirical processes. Z. Wahrscheinlichkeitsth. verw. Gebiete 62, 509-552. Freedman, David (1971). Brownian Motion and Diffusion. Holden-Day, San Francisco. Gind, E., and Zinn, J. (1986). Lectures on the central limit theorem for empirical processes. In Probability and Banach Spaces, Proc. Conf. Zaragoza, 1985. Lecture Notes in Math. (Springer) 1221, pp. 50-113. Gnedenko, B. V., and Kolmogorov, A. N. (1949). Limit Theorems for Sums of Independent Random Variables. Transl. and ed. by K. L. Chung, AddisonWesley, Reading, MA, 1968.
Goodman, V., Kuelbs, J., and Zinn, J. (1981). Some results on the LIL in Banach space with applications to weighted empirical processes. Ann. Probab. 9, 713-752. Heinkel, B. (1979). Relation entre theoreme central-limite et loi du logarithme itdr6 dans les espaces de Banach. Z. Wahrscheinlichkeitsth. verw. Gebiete 49, 211-220. Hoffmann-Jorgensen, Jtrgen (1974). Sums of independent Banach space valued random elements. Studia Math. 52, 159-186. Jain, Naresh C. (1976). An example concerning CLT and LIL in Banach space.
Ann. Probab. 4, 690-694. Jain, Naresh C., and Marcus, Michael B. (1975). Integrability of infinite sums of independent vector-valued random elements. Trans. Amer. Math. Soc. 212, 1-36. Kahane, J.-P. (1985). Some Random Series of Functions, 2d. ed. Cambridge University Press, Cambridge; 1st ed. D. C. Heath, Lexington, MA, 1968. Kolmogorov, A. N. (1931). Eine Verallgemeinerung des Laplace-Liapounoffschen Satzes. Izv. Akad. Nauk SSSR Otdel. Mat. Estest. Nauk (Ser. 7) no. 7, 959-962. Kolmogorov, A. N. (1933). Uber die Grenzwertsatze der Wahrscheinlichkeitsrechnung. Izv. Akad. Nauk SSSR (Bull. Acad. Sci. URSS) (Ser. 7) no. 3,
363-372. Koltchinskii, V.I. (1981). On the law of the iterated logarithm in the Strassen form for empirical measures. Teor. Veroiatnost. i Mat. Statist. (Kiev)
References
313
1981, no. 25, 40-47 (Russian); Theory Probab. Math. Statist. no. 25, 43-49 (English). Koltchinskii, V.I. (1994). Komlos-Major-Tusnady approximation for the general empirical process and Haar expansions of classes of functions. J. The-
oret. Probab. 7, 73-118. Komlos, J., Major, P., and Tusnady, G. (1975, 1976). An approximation of partial sums of independent RV's and the sample DF. I, H. Z. Wahrscheinlichkeitsth. verve Gebiete 32, 111-131; 34, 33-58. Kuelbs, J. (1977). Kolmogorov's law of the iterated logarithm for Banach space valued random variables. Illinois J. Math. 21, 784-800.
Kuelbs, J., and Dudley, R. M. (1980). Log log laws for empirical measures. Ann. Probab. 8, 405-418. Kuelbs, J., and Zinn, J. (1979). Some stability results for vector valued random variables. Ann. Probab. 7, 75-84. Ledoux, M., and Talagrand, M. (1991). Probability in Banach Spaces. Springer, Berlin. Major, Peter (1976). Approximation of partial sums of i.i.d. r.v.s when the summands have only two moments. Z. Wahrscheinlichkeitsth. verw. Gebiete 35, 221-229. Massart, P. (1989). Strong approximations for multidimensional empirical and related processes, via KMT constructions. Ann. Probab. 17, 266-291. Nagaev, S. V. (1970). On the speed of convergence in a boundary problem I, II. Theor. Probab. Appl. 15, 163-186, 403-429 (English), 179-199, 419-441 (Russian). Philipp, Walter (1980). Weak and LP-invariance principles for sums of Bvalued random variables. Ann. Probab. 8, 68-82; Correction, ibid. 14 (1986), 1095-1101. Pisier, G. (1976). Le theor8me de la limite centrale et la loi du logarithme iter6 dans les espaces de Banach. Seminaire Maurey-Schwartz 1975-76, Exposes III, IV, Ecole Polytechnique, Paris. Revesz, P. (1976). On strong approximation of the multidimensional empirical process. Ann. Probab. 4, 729-743. Rio, Emmanuel (1993a,b). Strong approximation for set-indexed partial sum processes, via K.M.T. constructions, I, II. Ann. Probab. 21, 759-790, 1706-1727. Rio, Emmanuel (1994). Local invariance principles and their application to density estimation. Probab. Theory Related Fields 98, 21-45. Tusnady, G. (1977). A remark on the approximation of the sample DF in the multidimensional case. Period. Math. Hungar. 8, 53-55.
10
Universal and Uniform Central Limit Theorems
In this chapter we look at cases where a class F of measurable functions is a Donsker class for all probability laws on the underlying space (universal Donsker classes) and such classes where convergence in the central limit theorem holds uniformly in P (uniform Donsker classes).
10.1 Universal Donsker classes Let X be a set and A a o-algebra of subsets of X. Then a class F of measurable
functions on X will be called a universal Donsker class if it is a P-Donsker class for every probability measure P on (X, A). Recall that every universal Donsker class of sets is a Vapnik-Cervonenkis class (Theorem 6.4.1), and the converse holds under measurability conditions. For a real-valued function f let diam(f) := sup f - inf f. The following shows that a universal Donsker class is uniformly bounded up to additive constants:
10.1.1 Proposition If F is a universal Donsker class, then
sup fEF diam(f) < oo. Proof Suppose not. Then take xn E X, yn E X, and fn E F so that fn (xn) . Let fn (yn) > 2' for n = 1, 2, 00
P := E (Sxn +SyJ/2n+1 n=1
Then P is a probability measure defined on all subsets of X and so on A. For P,
E(fn - Efn)2 > 2-n-1 lnfcen {(fn(xn) - c)2 + (fn(yn) - c)2}. 314
10.1 Universal Donsker classes
315
The infimum is attained when c = (f" (x") + f , (y,)) /2, so
E(f, - Efn)2 > 2-n-2(fn(xn) - fn(Yn))2 > for all n = 1, 2,
2"-2
.. Thus F is unbounded in the pp metric and hence not a
Donsker class for P by Theorem 3.7.2, and not a universal Donsker class. A function h on .F will be said to ignore additive constants if h (f) = h (f +c) whenever f E F, c is a constant, and f +c E T. Recall (Section 3.1) that a Gp process on F is called coherent if each sample function is prelinear, bounded, and uniformly continuous on F with respect to pp. Here F is Ppregaussian if and only if a coherent Gp process on it exists (Theorem 3.1.1). If Z is a coherent Gp process then if f and f + c are both in F for a constant c, we have pp (f f + c) = 0 so Z(f) = Z(f + c) and ignores additive constants for all w. Such a Z can be consistently extended to the set .F + 118 of
all functions f + c, f E F, c e R, letting Z(f + c) = Z(f). Then Z is a coherent Gp process on F + R, so F + R is pregaussian. Next, suppose F is a Donsker class for P. Then each function P" - P ignores additive constants on r and is well-defined on F + R. If pp (f + c, g + d) < S for some f, g E F, constants c, d, and S > 0, then pp(f, g) < S. Since the total boundedness for pp and asymptotic equicontinuity condition both extend directly from .F to .F + IR, it follows by Theorem 3.7.2 that .F + R is a Donsker
class for P. Thus if F is a universal Donsker class, so is Y+ R. For a given law P, any subset of a Donsker class for P is also a P-Donsker class, for example by asymptotic equicontinuity (Theorem 3.7.2), so any subset of a universal Donsker class is also a universal Donsker class. If is any real-valued function on F, then F is a universal Donsker class if and only if
9 :_ { f - c(f) : f E FJ is. Taking c(f) := inf f, we obtain from any class .E of bounded functions a class g of nonnegative functions such that 4 is Donsker for a given P, or universal Donsker, if and only if F has the same property. If F and 9 are universal Donsker classes then g is uniformly bounded by Proposition 10.1.1.
Since multiplication by a positive constant is easily seen to preserve the Donsker property, in finding conditions for a class to be universal Donsker it will be enough to consider classes.F of functions f with 0 < f < 1. The Vapnik-Cervonenkis properties of classes of functions treated in Sections 4.7 and 4.8 (VC subgraph, VC major, VC hull) will all be shown to imply the universal Donsker property for uniformly bounded classes of functions. So the relations among these different VC properties are of interest here. Recall (Section 4.8) that D(p)(e, F, Q) is the largest in such that for some
316
Universal and Uniform Central Limit Theorems
, fm E F, f I f - f I "dQ > e for all i 0 j. Also, D(2) (E, F) is the supremum over all laws Q with finite support of D(2) (e, F, Q). The following is a continuation of Theorem 4.8.3. .fi ,
10.1.2 Proposition There exist VC major (thus VC hull) classes which do not satisfy (4.8.4), thus are not VC subgraph classes.
Proof A VC major class is VC hull by Theorem 4.7.1(b). Let T be the set of all right-continuous nonincreasing functions f on 118 with 0 <_ f < 1. Then since the class C of open or closed half-lines ] - oo, x[ or ] - oo, x] is a VC class (with S(C) = 1), F is a VC major class. For any f, g (in.F) and any law Q, (f I f - gI2dQ)1/2 > f I f - gI dQ. Thus D(2) (E, .F, Q) > Dpi) (E, F, Q)
for any e > 0.
Let P be Lebesgue measure on [0, 1]. Then Dpi) (E, F, P) = D(e, CC2,1, dX)
where C C2,1 is the set of all lower layers (defined in Section 8.3) in the unit square 72 in 1182, and dx is the Lebesgue measure of the symmetric difference of sets in 12. For some c > 0, D(E, GC2,1, d),) > ecIE as e J, 0 by Theorem 8.3.1. For each e > 0 small enough, by the law of large numbers, there is a law Q with finite support and DO) (E, .F, Q) > e`1' - 1, so (4.8.4) fails and by Theorem 4.8.3, F is not a VC subgraph class. Recall that if 1
(log D(2) (E, .F))1/2de < oc, fo as in Theorem 6.3.1 for F = 1, then.F is said to satisfy Pollard's entropy con(10.1.3)
dition.
10.1.4 Theorem If F is a uniformly bounded, image admissible Suslin class of measurable functions and satisfies Pollard's entropy condition then F is a universal Donsker class.
Proof .F has a finite constant C as an envelope function. For a constant envelope, the hypotheses of Theorem 6.3.1 don't depend on the law P, so Theorem 10.1.4 is a corollary of Theorem 6.3.1. Here are some more details. Let.F/ C :_ If/C: f E .F}, so that.F/ C has as an envelope the constant 1. It will be enough to show that.F/ C is a universal Donsker class. Then for 8 > 0,
D(2)(8, F1 C) = Di2)(3, F/C) as in Theorem 6.3.1. Make the substitution
10.1 Universal Donsker classes
317
8 = s/C and note that D(2)(E/C, .F/C) = D(2)(E, .F). It follows that.F/C satisfies Pollard's entropy condition. So Theorem 6.3.1 applies and .F/C and .F are universal Donsker classes. 10.1.5 Corollary A uniformly bounded, image admissible Suslin VC subgraph class is a universal Donsker class.
Proof This follows from Theorems 4.8.3 and 10.1.4. Specializing further, the set of indicators of an image admissible Suslin VC class of sets is a universal Donsker class (Corollary 6.3.16 for F = 1). For a class F of real-valued functions on a set X, recall from Section 4.7 the class H(.F, M) which is M times the symmetric convex hull of F, and HS(.F, M) which is the closure of H(.F, M) for sequential pointwise convergence. Note that for any uniformly bounded class.F of measurable functions for a or -algebra A and any law Q defined on A, H(.F, M) is dense in HS (.F, M) for the L2( Q)
distance (or any LP(Q) distance, 1 < p < oo).
10.1.6 Theorem If F is a Donsker class for a law P such that F has an envelope function in G2 (P), or a universal Donsker class, then for any M < 00, HS(.F, M) is a P-Donsker (resp. universal Donsker) class.
Proof First, for a given P, the limiting Gaussian process Gp can be viewed as an isonormal process, for the inner product
(f, g)o,P :=
f fgdP - f fdP f gdP.
So, by Corollary 2.5.9, F U -.F is P-pregaussian. (Or, by Theorem 3.8.1, .FU -.F is P-Donsker, thus P-pregaussian.) Clearly, MF is an envelope func-
tion for HS(.F, M) and is in G2(P). Suppose hk E HS(.F, M) and hk -a h pointwise. Then for any empirical measure P, f hk dP --* f h dP ask -+ oo. Also, f hk dP f h dP and f (hk - h)2dP 0, both by dominated convergence. It follows that pp(hk, h) -+ 0. Then by Theorem 2.5.5, the set HS(.F, M) is pregaussian for P. If g is any prelinear function on H(.F, M), then IIgllH(J-,M) = MllglI.F. Next, II gIIHs(.P,M) = MIIgII if g is any linear combination of P, P, and a sample function of a coherent Gp process. Then, the Donsker property of HS (.F, M) follows from the invariance principle, Theorem 9.4.2, where the Gp processes are coherent.
Suppose F is a universal Donsker class. Let 9 :_ If - inf f : f E .F}, as in the discussion before Proposition 10.1.2. Then functions in H(.F, M)
318
Universal and Uniform Central Limit Theorems
differ from functions in H(Q, M) by additive constants. If hk E HS(.F, M),
hk -+ h pointwise, and for all k, hk = 4'k + ck for some /k E HS(Q, M) and constants Ck, then since 4k are uniformly bounded and h has finite values, the ck are bounded. So, taking a subsequence, we can assume ck converges to some c. Then cbk converge pointwise to some 0 E HS (Q, M) and h = 0 + c. Thus all functions in M) differ by additive constants from functions
in HS(Q, M) (the converse may not hold). Thus, if HS(Q, M) is a universal Donsker class, so is HS(.F, M). So we can assume F is uniformly bounded. Then it has an envelope function in L2(P) for all P, so by the first half of the proof, HS (.F, M) is a universal Donsker class.
By Theorem 10.1.4, for any 8 > 0, if log D(2)(e, )) = O(1/s2-s) as a 4, 0, and if .F satisfies a measurability condition (specifically, if .F is image admissible Suslin), then F is a universal Donsker class. In the converse direction, we have:
10.1.7 Theorem For a uniformly bounded class.F to be a universal Donsker class it is necessary that log
D(2)(8'.F)
=
O(e-2)
as s 4. 0.
Proof Suppose not. Then there are a universal Donsker class F and sk 4. 0 such that logD(2)(ek, .F) > k3/sk for k = 1, 2, , so there are probability laws Pk with finite support for which log D(2) (sk, .F, Pk) > k3/ek for k = 2, 3, . Let P be a law with P > Ek 2 Pk/k2. Then for any measurable f and g,
(f(f - g)2dPl Let Sk := sk/k. Then
1/2
1/2
>
\ f (.f - g)2dPk)
/
/k.
/
log D(2) (Sk, .F, P) > log D(2) (sk, .F, Pk) > k' /82 = k132 So any isonormal process L on L2(p) is a.s. unbounded on.F by Theorem 2.3.5.
We can write L (f) = Gp (f)+G f fdP where G is a standard normal variable independent of Gp. Since .F is uniformly bounded, Gp is a.s. unbounded on (a countable pp-dense subset of) .F, so F is not P-pregaussian and so not a P-Donsker class. Theorem 10.1.7 is optimal, as the following shows:
10.1.8 Proposition There exists a universal Donsker class E such that lim infs o 82 log Dt2) (S, E) > 0.
10.1 Universal Donsker classes
319
Proof Let A j := A (j) be disjoint, nonempty measurable sets for j = 1, 2, f2 norm, 11x 112 = (F-j xi?) 1/2 for x = {x j) ° 1. Let Let II II2 be the
J:xjlA(j): lIX112 <- 1}. J
(So, E is an ellipsoid with center 0 and semiaxes 1 A(j) .) Let P be any probability
measure defined on the Aj and let pj := P(Aj) for j = 1, 2,
.. We can assume that pj > 0 for all j, since if B is the union of all Aj such that pj = 0, then P(B) = 0, Gp(B) = 0 a.s., and vn(B) = 0 a.s. for all n. (k v, (A Let E > 0. For any k and n, let II vn 112,k Then for all n, j)2)1/2.
00
EIIvnII2,k
- -pj -+ 0
as k-+ oo.
j=k
Take k = k(E) large enough so that
Pi < E3/18. Then
Pr{Ilvnll2,k > E/3} < E/2.
(10.1.9)
If Ilvnll2,k < E/3, lIx112 < 1, and 11YI12 < 1, then by the Cauchy (-Schwarz) inequality, (10.1.10)
vn
\
k(xj
-Yj)1A(j)) < 2s/3.
j=k
Also, EIIv.112,1 < (EIIvn1121)1/2 < 1, so
(10.1.11)
Pr{Ilvnll2,1 > 2/E} < E/2.
Let 8 := (minj
let
o°
.fx := Y,xj1A(j) j=1
If fx and fy E E and ep(fx, fy) := (f (10.1.12)
(_fY)2dp)112 < 8, then
x
((_y)2)
1/2
< E2/6.
j
By (10.1.11) and Cauchy's inequality, I>j
320
Universal and Uniform Central Limit Theorems
Thus by (10.1.9) and (10.1.10), there is an event F with Pr(F) < s such that
if w F then for all x and y with ep(fX, fy) < 8,
(10.1.13)
I
< e.
Given x with F j xi 2 < 1, let yj = x j for j < k and let yj = 0 for j > k. Then Y)2dp)112 < s/4. Since x i-+ f, is continuous from E2 into L2(P), the set of all fx such that F_j xi <_ 1 and x j = 0 for j > k is compact in L2(P). It follows that E is totally bounded in L2(P). Thus by (10.1.13) and Theorem 3.7.2 for t = ep, E is a Donsker class for P. Since P was arbitrary,
(f (f -f
E is a universal Donsker class.
Next, given 0 < 6 < 1/2, let m
[1/(462)], where [x] is the largest integer
< x. Let P be a probability measure with P(Aj) = 1/m for j = 1, , M. Then in L2(P), E is an m-dimensional ball with radius m-1/2. Let r
forl
Let B1 :_{fx: ep(fx,gi)<2S}.Then r
UBiDSfx i=1
1/2
j=m1
x)?
/m)
<m-1/2+6, xj=0forj>m
t
JJJ
Thus by comparing volumes of balls in R', we get by choice of m,
D(2)(6, E> P) = r >
(m-1/2 + 6)m/(26)m
\2[1 + 1/(6m1/2)])m >
(3/2)m
exp (log (2)/(482)). 3
Letting S 10, Proposition 10.1.8 follows.
El
10.1.14 Proposition There is a uniformly bounded class F of measurable functions, which is not a universal Donsker class, such that
log D(2)(e, .F) < s21og(1/s) as s 4, 0. Proof Let Bj := B(j) be disjoint nonempty measurable sets. Recall that Lx := max(l, logx). Let aj := 1/(jLj)1/2, j > 1, and
F=j l j=1
xjIB(j): xj = faj for all j }. )))
Take c such that Fjt 1 Pj = 1, wherePj := c(aj/LLj)2. Herey jPj
<
by
10.1 Universal Donsker classes
321
the integral test since
(d/dx)(1/LLx) = -1/(xLx(LLx)2) for x > ee. Take a probability measure P with P(Bj) = pj for all j. Let
a := (I - PI)1/2/2 > 0. Then 00
EIIGPII,P = EajEIGP(Bj)I j=1 00
00
aj(2/7r)1/2(Pj(I - Pj)) 1/2 > j=1
aT aj Pi
1/2
j=1 00
1/(jLjLLj) = +oo
acl/2 j=1
by the integral test since (d/dx)(LLLx) = 1/(xLxLLx) for x large enough. If F were P-pregaussian, then since Gp can be treated as an isonormal process (see Section 3.1), by Theorem 2.5.5 ((a) if and only if (h)) and the material just before it, Gp could be realized on a separable Banach space with norm II IIj, Then the norm would have finite expectation by the Landau-Shepp-MarcusFernique theorem (2.2.2), a contradiction. So F is not pregaussian for P and so not a universal Donsker class. For any probability measure Q and r = 1, 2, ,
f=
xiIB(j) E F and g =
j
YjIB(j) E
if x j = yj for 1 < j < r, z
ao
(10.1.15)
eQ(f, g) =
( (
\1 \
E -r
(xj
1/z
-Yj)1B(j)) dQ) /
00
(QB)1"2
< 2ar.
/=r
Given s > 0, let r := r(s) be the smallest integer s > 1 such that as < s/2. By (10.1.15), if xj = yj for 1 < j < r, then ep(f, g) < E. Thus since there are F, Q) < 2r-1 only 2r-1 possibilities for xj = ±aj, j < r, we have and since r doesn't depend on Q, D(2) (s, F) < 2r-1 and log D(2) (s, F) < r log 2. As s j 0, we have ar(,) - s/2, so D(z)(£,
log(1/s) - log(2/s) - log(1/ar(e))
Zlog(r(s)),
Universal and Uniform Central Limit Theorems
322
and e/2 - 1/(r(s) .2 log(1/8)) 1/2, so r(s)
2/(e2log(1/e)). Since
log D(2) (e, F) < r (e) log 2 < r (e), the conclusion follows.
Theorems 10.1.4 and 10.1.7 show that Pollard's entropy condition (10.1.3) comes close to characterizing the universal Donsker property, but Propositions 10.1.8 and 10.1.14 show that there is no characterization of the universal Donsker property in terms of D(2).
10.2 Metric entropy of convex hulls in Hilbert space Let H be a real Hilbert space and for any subset B of H let co(B) be its convex hull, k
CO(B)
E
j=1
k
t j > 0, Y' tj = 1, xj E B, k = 1, 2,
I.
j=1
J
Recall that D(e, B) is the maximum number of points in B more than a apart.
10.2.1 Theorem Suppose that B is an infinite subset of a Hilbert space H,
IIxii < 1 for all x E B, and for some K < oo and 0 < y < oo, we have D(e, B) < Ke-Y for 0 < e < 1. Lets := 2y/(2 + y). Then for any t > s, there is a constant C, which depends only on K, y, and t, such that
D(s, co(B)) < exp (CE-') for 0 < s < 1. Note. Carl (1997) gives the sharper bound with t = s.
Proof We may assume K > 1. Choose any x1 E B. Given
B(n) := {x1,...,xn_1),
n > 2,
let d(x, B(n)) := minyEB(n) IIx - yii and Sn := supXEB d(x, B(n)). Since B is infinite, Sn > 0 for all n. Choose xn E B with d(xn, B(n)) > Sn/2. Then
for all n, K(2/Sn)Y >- D(Sn/2, B) > n, so 8n -< Mn-1/Y for all n where M:= 2K1/Y. Let O < e < 1. Let N := N(s) be the next integer larger than (4M/s)Y. Then
SN < e/4. Let G := B(N). For each x E B there is an i = i(x) < N - 1 with IIx -xi 11
8N. For any convex combination z = EXEB zxx where rlx > 0 and
EXEB11x = 1,witht7x =0except forfinitely many x,letzN :_ ExEBi)xxi(x)
10.2 Metric entropy of convex hulls in Hilbert space
323
Then 11 Z - ZN 11 < SN < s/4, so
D(S, co(B)) < D(s/2, co(G)).
(10.2.2)
To bound D(s/2,co(G)), let m := m (s) be the largest integer < s-S. Note that y > s. Then for each i with in < i < N, there is a j < in such that (10.2.3)
1l xi - xj 11 < Sm+1 < MSSIY
Let j(i) be the least such j. Let Am :_ {{),j}1<j<m : ),j > 0, ET, ),j = 1). On R' we have the £P metrics 1/P
m
pp({xj), {yj}) :=
( j=1 Ixj -YJV')
By Cauchy's inequality, p1 < m1/2p2. Let ,6 := s/6 and 8 := &(2m1/2). The S-neighborhood of Am for P2 is included in a ball of radius 1 + S < 13/12. We have D(23, Am, p2) centers of disjoint balls of radius S included in the neighborhood. Comparing volumes of balls and recalling that Lx max(1, logx) gives (10.2.4)
D(P, Am, p1) < D(28, Am, p2) <
13mmm/2S-m
< exp(m{L(1/s) + (Lm)/2 + log(13)})
< exp
(C3s-SL(1/s)) < eXp
(C4s-t),
0<s<1, for some constants C3, C4.
For each j = 1, , in, let Aj := A(j) consist of xj and the set of all xi, i = in + 1, , N, such that j (i) = j. Then G = UjI1 A j and the A j are disjoint. Take a maximal set S = S(s) C Am with p1 (u, v) > 0 for any u 0 v in S. For a given a, _ (A1, , ..m) E S, let Fa,
fx E co(G): x
Ay,xy, where
0 and
yEG
µy,x=Aj, for all YEA(j)
JJJ
For any x E co(G), let x = EyEG µy,xy where p > 0 and for Tj(x) := EyEA(j) AY,x, i(x) := {rj(x))j 1 E Am. Take k(x) E Ssuchthatr-j
ij(x)l < 0. For y E A(j), define vy,x := Aj(x) sy,x/rj(x) if rj(x) > 0. If rj (x) = 0, choose a w E A(j) and define vw,x :_ A(x) and vy,x := 0 for
Universal and Uniform Central Limit Theorems
324
y # w, y E A(j). If -rj(x) > 0 then
E 1AY,x-vy,xl = E µy,xll-),j(x)/rj(x)l = Itj(x)-),j(x)l. YEA(j)
YEA(j)
If rj(x) = 0 then EYEA(j) I AY,x - Vy,xI = Vw,x = A1(x) = I r (x) - 'Xj(x)I
It follows that EyEG Ihy,x - vy,xl = Ej Irj(x) - aj(x)I < P. Set ox := EyEGvy,xy. Thentx E Fx,(x), A(x) ES,andIlex -xII < 6. Thusby(10.2.2)
and (10.2.4),
D(s, co(B)) < D(e/2, co(G)) < DI 0, U F )
(10.2.5)
\
AES
< card(S) maxi, D(16, F),) exp (C4s-t) maxi, D($, FA).
To bound the latter factor, let A E S. We may assume X j > 0 for all j. , m and x E Fx, let x(j) := EyEA(j) Ay,x y. Let Yj be a random variable with values in A(j) and P(Yj = y) = µy,x/X.j for each y E A(j). Then EYj = x(j)/),j =: zj. Take Y1, , Ym to be independent and let Y:=F_j 1.XjYj. Then EY =x and
For any j = 1,
2
M
EIIY-x112 = E E.lj(Yj - zj) j=1
m
_ YA EIIYj -ZjIl2, j=1
since Yj - zj are independent and have mean 0, and H is a Hilbert space. Now the diameter of A(j) is at most 2Mesl y by (10.2.3), and zj is a convex combination of elements of A (j). Thus 2
EIIYj-zjII2 = Aj1
l-ty,x
J-1
E lLz,x(Y-Z)
< 4M2s2sly,
zEA(j)
YEA(j)
and for any set F C {1, 2
E Y,.lj(Yj - zj) jEF
<
4M2(maxjEF
kj),2sly
jEF
Next, an idea of B. Maurey will be applied. For k = 1 , 2, , let Yj 1, Yj2, , Yjk be independent with the distribution of Yj, and with Yjj also independent
for different j. Then k
EII
YAj (k-1 E(Yjt - zj)) J
/
2
< 4M2(maxfEF),j)s2s/y/k
10.2 Metric entropy of convex hulls in Hilbert space
325
so
E jEF
j ((
k
Yji/ -zj/
\k-1
\
<
2M(maxjEF)')
1/2ESly/k1/2.
1
Thus, there exist yji E A(j), i = 1,
, k, j E F, such that k
(10.2.6)
-kj
jEF
\k-1 EYii/ - zj/ )
l
(( \
< 2M(maxjEF Aj)1/2ssly/k1/2.
Take v > 0 such that s + v < t. Let F(0)
I j < m : Aj > EU}. Let k(0) be
the smallest integer k such that
k > 6400M2E-2+2s/y = 640OM28-S.
l
For k > k(0) and F = F(0) the expressions in (10.2.6) are at most E/40. Let r be the smallest positive integer such that Ev/4r < (E1-s/y1(80M))2. F(u) :=
Ev/4u < Al < E°/4u-11,
i n :
andletk = k(u)bethesmallest integer ksuch that 22-"Mesly/k1/2 < e/(40r), 100M244-ur2E-2+2s/y. that is, k > Thus, for some constant C5, k(u) < 1 + C54-us s(L(1/E))2. The yji for F = F(u) will be called you) (they also depend on x).
Let F(r+l) := j j < m : X j < Ev/4r}. Letk(r+l) := 1. For F = F(r+l) and k = k(r + 1), (10.2.6) is bounded above by 8/40. We have yy i+1) a single choice for each j. Let r+1
1
EXj
u=0 jEF(u)
k(u)
k(u) i=1
y (jiu)
Then by (10.2.6) and the results for u = 0 and u = r + 1, r+1
III -xII =
k(u)
EE
j ' k(u)
u=0 jEF(u)
(\ yji
(u)
i=1
zj /
r
an
2M(Ev14u-1)112Es/y/k(u)1/2
+ An +
u=1
E -+r
20
E
40r
3E
E
40 < 12
Universal and Uniform Central Limit Theorems
326 Here
is determined uniquely by the k(u)-tuples (y(i) ... , yk(u)), j E F(u),
u = 0, 1, , r + 1. Each A(j) has at most N elements, so that for given u < r and j < m, there are at most Nk(u) ways of choosing the you). Now card(F(u)) < 4n/8°, so the number of ways to choose the you) for given u with
1 < u < r is at most
+ C5e-S-vL(l/s)2)}. There are at most exp((log N)[e-S-°640OM2 + s-°]) ways to choose the Y(0), and < N' ways to choose the with in < 8'. Thus, the total number of ways to exp{(logN)(4u8-U
choose all the y() gives (C6{s-s-DL(1/e)4
D(e/6, FF) < exp
+ L(1/s)4''/so})
C7e-S for for some C6 := C6(K). By definition of r, 4''/s" < C7s2s/y-2 = some C7 := C7(K). Thus, D(s/6, Fx) < exp(C8e-`) for some C8. Then by
(10.2.5), Theorem 10.2.1 is proved.
Example. The exponent 2y/(2+y) in Theorem 10.2.1 is sharp in the following example. Let {en }n,1 be an orthonormal basis of H and for 0 < y < oo let B
{n-tlyen}
n>1 U
{-ri11ven}
n>1 For any e > 0 small enough we have D(s, B) = 2n for the least n such that
(n-21y + (n + 1)-2/y)1/2 < s: the points ±j-1/yej, 1 < j < n, are more than a apart, so D(s, B) > 2n, while a set of points of B more than s apart cannot contain j-1/yep or -j-1/yep for more than one value of j > n, so
D(e, B) < 2n. Thus as s -+ 0, e2 - 2/n2/y, so for a constant C = Cy, 2n - C/sy, and replacing C by a suitable larger K if necessary, the hypothesis of Theorem 10.2.1 holds. Let Bn be the intersection of B with the linear span of el, , en. Let Cn = co(Bn). Then for any s > 0, D(e, co(B)) > D(e, Cn). The n-dimensional volume vn(Cn) is
vn(Cn) =
2n/n!1+11y
A ball B(x, r) :_ {y: ly - xl < r) in Rn has volume vn (B(x, r)) = cnrn where cn = 7r n121 I'((n+2)/2). (This is well known, especially forn = 1, 2, 3 since IF (1 /2) = r 1/2, and then can be proved by induction from n ton + 2). If x1, , xn are points of Cn more than s apart, for a maximal m = D(e, Cn),
then the sets B(xi, e) cover Cn, so mcns" > vn(C,,). By Stirling's formula (Theorem 1.3.13), as n -* oo, 2n(e/n)n+n/y(27rn)-(y+l)/(2y)n-n/2(n/(2e))n/2(2rn)1/2
vn(Cn)lcn
_ for a constant Dy D.
(eln)n(2+y)1(2y)(2/n)n12n-1/(2y)Dy
10.2 Metric entropy of convex hulls in Hilbert space
327
Take any d such that d > 1 + v . Then for n large enough, vn (Cn) /cn > n -dn
Let g, (e) := n-do/sn. Then m > g, (e). The following paragraph is only for motivation. As s .. 0, n = n(s) will be chosen to make gn (s) about as large as possible. We have
8-1(n + 1)-d(n+1)ndn -dn
gn+1 (E)/gn (s)
(ne) -d s This sequence is decreasing in n, so to maximize gn (s) for a given s we want
ln -d (1+ ) e nI
to take n such that the ratio is approximately 1.
At any rate, let n(s) be the largest integer < f(s) := e-1s-11d. Then for e small enough,
d 1)11d) gn(E)(s) > f(8)- df(e) s -I(e) = sexp(df(s)) = eexp(ee s8 ' 1
Now s > exp(-s-8) as s ,. 0 for any 8 > 0. Taking 8 < 1/d 'and letting d 4, (y + 2)/(2y) we have 1/d'' (2y)/(2 + y), showing that the exponent in Theorem 10.2.1 is indeed optimal. Recall the definitions of D(2) from Section 4.8 and HS from after Corollary 10.1.5.
10.2.7 Corollary If 9 is a uniformly bounded class of measurable functions
and for some K < oo and 0 < y < oo, D(2) (s, Q) < Ks-Y for 0 < s < 1, then for any t > r := 2y/(2 + y), and for the constants Ci = Ci (2K, y, t), i = 1, 2, of Theorem 10.2.1,
D(2) (s, HS(9, 1)) < C1 exp (C28 -t)
for 0 < e < 1.
Proof We have D(2)(e, g U -g) < 2Ks-Y, 0 < s < 1. Thus for any law Q with finite support, D(2)(s, Q U -Q, Q) < 2Ks-Y. By Theorem 10.2.1, D(2) (e, H(g, 1), Q) < C1 exp(C2s-`) for Ci = Ci (2K, y, t). It is easily seen that H(c, 1) is a dense subset of R, (9, 1) in £2(Q). The conclusion follows.
Now, recall the notions of VC subgraph and VC subgraph hull class from Section 4.7. 10.2.8 Corollary If C is a uniformly bounded VC subgraph class and M < 00 then the VC subgraph class H,s (9, M) satisfies Pollard's entropy condition (10.1.3).
328
Universal and Uniform Central Limit Theorems
Proof Let M9 :_ (Mg: g E 9). Then M9 is a uniformly bounded VC subgraph class. By Theorem 4.8.3(a), M9 satisfies the hypothesis of Corollary
10.2.7, so r < 2 and we can take t < 2.
Remark.
It follows by Theorem 6.3.1 that for a uniformly bounded VC subgraph class 9, if F C HH (9, M) and F satisfies the image admissible Suslin measurability condition, then F is a universal Donsker class. This also follows from Corollary 10.1.5 and Theorem 10.1.6.
Example. Let C be the set of all intervals (a, b] for 0 < a < b < 1. Let G be the set of all real functions f on [0, 1] such that I f(x)I < 1/2 for all
x, I f(x) - f(y) I < Ix - yI for 0 < x, y < 1, and f(x) = 0 for x < 0 or x > 1. Each fin G has total variation at most 2 (at most 1 on the open interval 0 < x < 1 and 1/2 at each endpoint 0, 1). By the Jordan decomposition we
have, for each f E G, f = g - h where g and h are both nondecreasing functions, 0 for x < 0. Then g and h have equal total variations < 1 and G C H5 (C, 2) by the proof of Theorem 4.7.1(b). Let P be Lebesgue measure on [0, 1]. By Theorem 8.2.10, and since (f I f I2dP)1/2 > f If IdP, there is a c > 0 such that D(2) (s, G) > ecl' as s 4. 0 (consider laws with finite support which approach P). Since S(C) = 2, the exponent y can be any number larger
than 2 by Corollary 4.1.7 and Theorem 4.6.1. Letting y ,[ 2, tin Corollary 10.2.7 can be any number > 1, and we saw above that it cannot be < 1 in this case, so again the exponent is sharp.
**10.3 Uniform Donsker classes This section will state some results with references but without proofs. A class F of measurable functions on a measurable space (X, A) is a uniform Donsker class if it is a universal Donsker class and the convergence in law of v, to Gp is also uniform in P. To formulate this notion precisely we will follow Gine and Zinn (1991) in using the dual-bounded-Lipschitz metric ,B as defined just before Theorem 3.6.4.
Let P(X) be the set of all probability measures on (X, A) and let Pf(X ) be the set of all laws in P(X) with finite support. For S > 0, a class F of measurable real-valued functions on X and a pseudometric d on F let
F(S,d) := {f-g:.f,gEF, d(ff g) <<S}.
**10.3 Uniform Donsker classes
329
Definitions. A class .F is uniformly pregaussian if it is pregaussian for all P E P(X), and if, for a coherent version of GP for each P, we have both SUPPEP(X) E II GP II.P < o0 and
limsyo supPEP(X)
0.
The class .P is finitely uniformly pregaussian if the same holds with Pf(X )
in place of P(X). The class .F is a uniform Donsker class if it is uniformly pregaussian and limn+oo SUPPEP(X) O.p(vn, GP) = 0
where ,8,F is the dual-bounded-Lipschitz metric P based on II IIF as in Section 3.6. Gine and Zinn (1991, Theorem 2.3) proved:
Theorem Let (X, A) be a measurable space and F a class of real-valued measurable functions on X. Then F is a uniform Donsker class if and only if it is finitely uniformly pregaussian and thus, if and only if it is uniformly pregaussian.
The theorem is a very useful characterization since it's easier to check the finitely uniformly pregaussian property than to check the uniform Donsker property directly.
A uniform Donsker class is clearly a universal Donsker class. Thus it is uniformly bounded up to additive constants (Proposition 10.1.1).
Gine and Zinn show (as a corollary of their Proposition 3.1) that the hypotheses of Theorem 10.1.4 (Pollard's entropy condition 10.1.3, together with suitable boundedness and measurability) actually imply that F is uniformly Donsker. Thus most of the examples of universal Donsker classes treated in Sections 10.1 and 10.2 are uniformly Donsker. An exception is the "ellipsoid" universal Donsker class of Proposition 10.1.8, see problem 2 below. Some uniform Donsker classes of functions on R will be defined. For any
function f from R into R and 0 < p < oo the p-variation vp(f) is defined as the supremum of all sums Ej"=1 If(x) - f (xj_ t) I P where -oo < xo < xt < < xn < +oo and n = 1, 2, . For p = 1 this is the ordinary total variation. We have:
Theorem (Dudley, 1992) For each M < oo and p with 0 < p < 2, the class .F := { f : R i-+ R, vp(f) < M} is a uniform Donsker class.
330
Universal and Uniform Central Limit Theorems
Problems 1. Prove the "easy half " of the Gind-Zinn theorem on uniform Donsker classes
in Section 10.3, namely, if F is a uniform Donsker class of measurable functions on a measurable space (X, A), show that.F is finitely uniformly pregaussian. 2. Let (X, A) be a measurable space and let A (j) be disjoint, nonempty measurable subsets of X. Let E F -j xj IA(J) : Ej xi < 1}. Show that E is not a uniform Donsker class. Hint: It is enough (applying only the easy direction of the Gine-Zinn theorem from the previous problem) to show E is not finitely uniformly pregaussian. Choose a point y(j) := Yj E A(j) for
each j and let Q(N) := N EN i=1 3Y(A. 3. Show that for p > 2, { f : R i-+ R, v p (f) < 1) is not a universal Donsker class. (It is a uniform Donsker class for p < 2.) Hint: Show it is not a Donsker class for U[0, 1]. Consider functions c IF for F = {X1, , and suitable constants c,,.
4. If 0 < p < 1, f is continuous from R into R and v p (f) < oo, show that f is constant. 5. Prove that the class.F in the proof of Proposition 10.1.2 is not a VC subgraph
class directly, by showing that the class of all subgraphs of functions in F shatters any finite subset of the diagonal {(x, x): 0 < x < 1).
Notes
Note to Section 10.1. This section is based on Dudley (1987). Notes to Section 10.2. Theorem 10.2.1 is from Dudley (1987). The argument of Maurey used in its proof was published in the proofs of Pisier (1981, Lemma 2) and Carl (1982, Lemma 1). The example showing that 2A/(2 + A) is sharp is from Dudley (1967), Propositions 5.8 and 6.12. Carl (1997) showed that one can take t = s in Theorem 10.2.1.
References Carl, B. (1982). On a characterization of operators from tq into a Banach space of type p with some applications to eigenvalue problems. J. Funct. Anal.
48, 394-407. Carl, B. (1997). Metric entropy of convex hulls in Hilbert space. Bull. London Math. Soc. 29, 452-458.
Dudley, R. M. (1967). The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. J. Funct. Anal. 1, 290-330.
References
331
Dudley, R. M. (1987). Universal Donsker classes and metric entropy. Ann. Probab. 15, 1306-1326. Dudley, R. M. (1992). Frechet differentiability, p-variation and uniform Donsker classes. Ann. Probab. 20, 1968-1982. Gind, Evarist, and Zinn, Joel (1991). Gaussian characterization of uniform Donsker classes of functions. Ann. Probab. 19, 758-782. Pisier, G. (1981). Remarques sur un resultat non public de B. Maurey. Seminaire d'Analyse Fonctionnelle 1980-1981 V.1-V.12. Ecole Polytechnique, Centre de Mathematiques, Palaiseau.
11 The Two-Sample Case, the Bootstrap, and Confidence Sets
11.1 The two-sample case Let X1, , Xm , , Y1, , Yn , be some random variables taking values in a set A where (A, A) is a measurable space. Thus (X1, , Xm) and (Y1, , Y,) are "samples," of which we have two. Let m 1
m
n
3x.,
1
Pm :=
Qn
i=1
n
Y, SYj j=1
The object of two-sample tests in statistics is to decide whether P. and Qn are empirical measures from the same, but unknown, law (probability measure) on (A, A). Since P is unknown, we cannot directly compare Pm or Qn to it by forming m 1 /2 (Pm - P) or n 1/2 (Qn - P). Instead, Pm and Qn can be compared to each other, setting Vm'n
( mn l1/2 (Pm-Qn)
\m+n/
The basic hypothesis will be that there are two laws P, Q on (A, A) and a product of two countable products of copies of (A, A) with factor laws P and Q respectively, namely
(Q,D,Pr) = (Qi,13 ,Pr1) X (22,132,Pr2) where 00
00
(Q2, B2, Pr2) = jj(Bj, A, Q),
(c21, 131, Pr1) = 1T(Ai, A, P),
j=1
i=1
and each Ai and Bj is a copy of A. On these products let Xi be the Ai coordinate
and Yj the Bj coordinate. If P is a class of laws on (A, A), the (P) null hypothesis is that in addition, P = Q E P. A class F of measurable functions 332
11.1 The two-sample case
333
on (A, A) will be called a P-universal Donsker class if it is a P-Donsker class for every P E P.
11.1.1 Theorem Suppose F is a P-universal Donsker class of functions on (A, A). Then for each P E P, under the (P) null hypothesis, vm,n Gp as m, n --i- oc.
Proof Let P E P. By Theorem 3.6.6, it will be enough to show that oo. Since F is P-Donsker, it is P-pregaussian, so that there exists a coherent Gp process on F. Since F is pp-totally bounded, f (Vm n , Gp) - 0 as m , n
the functions f i-+ Gp(f)(w) on F belong to the space UC(F, pp) of all pp-uniformly continuous and thus bounded functions on F. Here UC(.T', pp) is a separable subspace of the Banach space Ii.F). The map w H is Borel measurable from the underlying probability space into UC(F, pp): to see this, it is enough by second-countability to show that the inverse image of an open ball If: II f - fo II.F < r} is measurable, which is true since on UC(F, pp) the supremum can be taken over a countable pp-dense set II
in T. Thus Gp has a law (Borel probability measure) on UC(F, pp), as in Theorem 2.5.5. It is easily seen that every coherent Gp process on F has the same law on UC(F, pp) (consider finite subsets increasing up to a countable dense subset of F). Since F is a P-Donsker class, by Theorem 3.5.1 there exist probability spaces (Vi, Si, µi), i = 1, 2, perfect measurable functions gn; from V1 into j such that µi o grail = Pr; for all n and i = 1, 2, and coherent Gp processes GP) on F x Vi for i = 1, 2 such that m 1/2 (Pm - P) o gm 1 G O almost uniformly as m -+ oo and n 1/2(Qn - P) o $n2 - GP i almost uniformly for for II II.- as n - oc. Form the product (V, S, t) :_ (V1, S1, µ1) x (V2, S2, µ2) and define hmn : V ra QI X S22 by hmn(u, v) (gmI(u), gn2(V)). Then A o hm = Pr1 x Pr2 for all m and n. We have II
(mn)1 Vmn
(m + n)112
(
(Pm - P - (Qn - P))
l1/2M 112(Pm-P)-(m+n)l/2
n1/2(Qn-P).
m+n)
Let Gp"''n) :_ (n/(m +n))1/2Gp1) - (m/(m
Now, if Y, Z are independent coherent Gp processes on .T' and a, b are any real numbers with a2 + b2 = 1, then aY + bZ is a coherent Gp process on Y. To see this, note that aY + bZ is a Gaussian process indexed by F, has the covariances of Gp, +n))1/2G,2).
334
The Two-Sample Case, the Bootstrap, and Confidence Sets
and is coherent. So GP 'n) is a coherent GP process. We have (m,n)
Ivm,n
°hmn - GP n
m+n
1/2
I
IM 1/2(P-P)ogm I-r_() 6P
I
\11
+\m+n/ /2 Inl/2(Qn -P)°gn2-Gp) 1/
m
-0. 0 I
almost uniformly as m, n --* oo. Let H be a function on £°O(.F) with II HII BL < 1, so that IIHII0. < 1 and
Givene > 0,therearemo <
IH(f)-H(g)I < II.f-glL.Fforall f,gE
oo and no < oo such that form > mo and n > no, I vm,n o hm,n - G( 'n) I < E on a set VE C V with it (VE) > 1 - E. It follows that dm,n := I H(vm,n oh n)
- H(G( 'n)) I
<
on V, and dm n < 2 everywhere. Since GP n) has a law defined on the Borel sets of UC(.F, pp), thus off' (.F), for II ILK, H(GP 'n)) is measurable and
f H(Vm,n o hmn) dA < EH(G( 'n)) +38. *
It follows that f (vm,n o hmn, GP) --* 0 as m, n -+ oo. Since each hmn is a measure-preserving, perfect function, it follows that vm,n GP form, n -+ 00 as in Theorem 3.4.3. The classical two-sample situation is the special case where A =118, A is the
Borel or-algebra, F is the set of all indicator functions of half-lines (-00, x], and P is the set of all continuous (nonatomic) laws on R. Thus ml/2(Pm
- P)((-oo, x]) =
m1/2(Fm
- F)(x)
where F is the distribution function of P and Fm an empirical distribution func-
tion. Here F is a universal Donsker class by any of several previous results, for example Corollary 6.3.17, and Gp(l(_,,,x]) = yF(x) where y is the Brownian bridge (mentioned in Section 1.1). (Actually, F is a uniform Donsker class.) Since F is continuous, it takes all values in the open interval (0, 1), 0 as t , 0 or t f 1. So the distributions of sup., yF(x) yo = yi = 0, and yt and supx IYF(x)I and the joint distribution of (infx yF(x), supx yF(x)) do not depend on F, for P E P. Let Fm and Gn be independent empirical distri-
bution functions for the same F. Let Hmn := (mn/(m +n))1/2(Fm - Gn). By Theorems 3.6.7 and 3.6.8, which extend straightforwardly to limits as m, n -+ oo, the distributions of the supremum, supremum of absolute value,
11.2 A bootstrap central limit theorem in probability
335
and supremum minus infimum of Hmn converge to those of the same functionals
for yt. Thus we get for u > 0 (about u = 0, see problem 3):
11.1.2 Corollary If F. and Gn are independent empirical distribution functions for a continuous distribution on R, then for any u > 0, (a) limm,n-goo P(supx Hmn(x) > u) = exp(-2u2);
(b) limm,n+
P(supxIHmn(x)I >u)=2F_' (-1)k-1 exp(-2k2u2);
(c)
P(supx Hmn (x) - infy Hmn (Y) > u)
= 2 Fo 1(4k2u2 - 1) exp(-2k2u2). Proof The distributions of the given functionals of the Brownian bridge yt are given in RAP, Propositions 12.3.3, 12.3.4, and 12.3.6. All three are continuous in u for u > 0. Thus convergence follows from RAP, Theorem 9.3.6. Note. The statistics supx Hmn (x) and sup., I Hmn (x) I in (a) and (b) are called Kolmogorov-Smirnov statistics, while sup., Hmn (x) - infy Hmn (y) is called a Kuiper statistic.
11.2 A bootstrap central limit theorem in probability Iterating the operation by which we get an empirical measure P, from a law P, we form the bootstrap empirical measure PB by sampling n independent points whose distribution is the empirical measure Pn. The bootstrap was first introduced in nonparametric statistics, where the law P is unknown and we want to make inferences about it from the observed Pn. This can be done by way of bootstrap central limit theorems, which say that under some conditions, n112(PB - Pn) behaves like n112(Pn - P) and both behave like Gp.
Let (S, S, P) be a probability space and F a class of real-valued measurable functions on S. Let as usual X1, X2, be coordinates on a countable product of copies of (S, S, P). Then let XB , XB be independent with distribution Pn . Let
PB := 1 n
3,'.. nl
j=1
Then PB will be called a bootstrap empirical measure. A statistician has a data set, represented by a fixed Pn, and estimates the distribution of PB by repeated resampling from the same Pn. So we are interested not so much in the unconditional distribution of PB as P, varies, but
336
The Two-Sample Case, the Bootstrap, and Confidence Sets
rather in the conditional distribution of PB given Pn or (X1, , X,). Let vB := n1/2( P,," - Pn). The limit theorems will be formulated in terms of dual-bounded-Lipschitz "metric" P of Section 3.6, which metrizes convergence in distribution for not necessarily measurable random elements of a possibly nonseparable metric space (S, d), to a limit which is a measurable random variable with separable range. Let fi,F be the ,B distance where d is the metric defined by the norm ll
II.-F.
Definition. Let (S, S, P) be a probability space and .F a class of measurable real-valued functions on S. Then the bootstrap central limit theorem holds in probability (resp. almost surely) for P and F if and only if F is pregaussian for P and 0.,(vB, Gp), conditional on X1, , Xn, converges to 0 in outer oo. probability (resp. almost uniformly) as n In other words, the bootstrap central limit theorem holds in probability if and only if for any e > 0 there is an no large enough so that for any n > no, there is a set A of values of X(n) :_ (X1, , Xn) with Pn (A) < e such that if X(n) A, and vB(Xtn))O is vB conditional on X(n), we have,B.F (vB(Xini)(.), Gp) < e. A main bootstrap limit theorem will be stated. The rest of this section will be devoted to proving it, with a number of lemmas and auxiliary theorems.
11.2.1 Theorem (Gine and Zinn) Let (X, A, P) be any probability space. Then the bootstrap central limit theorem holds in probabilityfor P and I if F is a Donsker class for P. Remarks. Gine and Zinn (1990), see also Gine (1997), also proved "only if" under a measurability condition and proved a corresponding almost sure form of the theorem where F has an L2 envelope up to additive constants. See Section 11.4 for detailed statements (but not proofs) of these other theorems.
Proof EB, PrB, and GB will denote the conditional expectation, probability, and law given the sample X(n) := (X1, , Xn). Given the sample, vB has only finitely many possible values. First, a finite-dimensional bootstrap central limit theorem is needed.
11.2.2 Theorem Let X1, X2, be i.i.d. random variables with values in Rd and let XB,, i = 1, .. , n, be i.i.d. (Pn), where Pn Sx1. Let n 1X3. Assume that E I Xi 12 < oo. Let C be the covariance matrix n Ej= of X1, Crs := E(X1i.X1s) - E(X1i.)E(X1s). Then for the usual convergence
Xn
11.2 A bootstrap central limit theorem in probability
337
of laws in Rd, almost surely as n -+ oo, (11.2.3)
GB
(n-1/2
N(0, Q.
E (XB j - Xn)) j=1
Proof Note: N(0, 0) is defined as a point mass at 0. Suppose the theorem holds for d = 1. Then for each t E Rd, the theorem holds for the inner products (t, Xj) and thus (t, XB j), and so on. This implies that the characteristic functions of the laws on the left in (11.2.3) converge to that of N(0, C), and thus that the laws converge by the Levy continuity theorem (RAP, Theorem 9.8.2). (This last argument is sometimes called the "Cramer-Wold" device.) So, we can assume d = 1, and on the right in (11.2.3) we have a law N(0, a2) where a2 is the variance of X1. If a2 = 0 there is no problem, so assume a2 > 0. We
can assume that EX1 = 0 since subtracting EXI from all Xj doesn't change the expression on the left in (11.2.3). Let sn2
l
(Xj -
n
sn
(sn2)1/2.
j=1
Then by the strong law of large numbers for the Xj and the X?, sn2 converges
a.s. as n -* oo to a2. For a given n and sample XW, Xn is fixed and the variables Yni
(XB1 - Xn)/n1/2
for i = 1,
, n are i.i.d. We have a triangular array and will apply the Lindeberg theorem (RAP, Theorem 9.6.1). For each n and i, EBYn i = 0. Each
Yni has variance VarB(Yni) = sn2/n. The sum from i = 1 to n of these variances is sn2. Let Zni := Yni/sn. Then Zni remain i.i.d. given Pn, and the sum of their variances is 1. We would like to show by the Lindeberg theorem that GB(yi=1 Zni) converges to N(0, 1) as n - oo, for almost all . We have EB(Zni) = 0. Let 1{.....} := 1{.....{ for any event {..... }. For s > 0 let En j, := EB (I Zn j I2l { I Zn j l > s}). It remains to show that En jE = n En t, 0 as n oo, almost surely. Now X1, X2,
Anj
{jZnjI > E) = {IX,
- XnI >
n1/2sn8},
which for n large enough, almost surely, is included in the set
C(n, j) := Cnj := {I XB .1 > n1/2as/2}. Also, Zni = (4 j - Xn)2/(nsn2) < 2[(Xa )2 + Xn ]/(nsn2). Then Xn /sn2 which is constant given the sample, approaches 0 a.s. as n -f oo since EXI = 0
338
The Two-Sample Case, the Bootstrap, and Confidence Sets
and so does its integral over Cj. For the previous term, n
X21{IXiI
EB((Xnl)2lC(n,j))/(nsn2) = n-2
> n1I2QE/2)/sn2,
i=1
which, multiplied by n, still goes to 0 a.s. as n -+ oo, by the strong law of large numbers for the variables Xt21 { I X, I > K) for any fixed K, then letting K oo. Theorem 11.2.2 is proved. Here is a desymmetrization fact:
11.2.4 Lemma Let T be a set, and for any real-valued function f on T let II f II T : = Supt, T I f (t) I. Let X and Y be two stochastic processes indexed by t E T defined on a probability space (S2 x S2', S ® S', P x P'), where X(t) (co, w') depends only on co E S2, and Y(t)(w, w') only on w E S2'. Then
(a) for any s > 0 and any u > 0 such that suptET Pr{IY(t)I > u) < 1, we have
Pr*(IIXIIT > s)
< Pr*{IIX- YIIT > s - u}/[l - SUPtET Pr{IY(t)I > u}]. (b) If 9 > suptET E(Y(t)2) then for any s > 0,
P*(IIXIIT > S) < 2Pr* (IIX - YIIT > s - (29)1/2).
Proof Let A(s) := (co: II X(w)II T > s}. For part (a) we have by (3.2.8),
Pr*(IIX-YIIT>S-u) > EP(P')*(IIX-YIIT>S-u) > P*(IIXIIT > s)infwEA(s)(P')*(IIX((O)-YIIT > s - u) > P*(IIXIIT > s) inftET P'(IY(t)I < u) since if IIX (w) II T > s then for some t E T, I X((o, t) I > s. This proves part (a). Then part (b) follows by Chebyshev's inequality.
Now si will denote i.i.d. Rademacher variables, that is, variables with the
distribution P(Ei = 1) = P(E, = -1) = 1/2, defined on a probability space (JGE, SE, Pe) which we can take as [0, 1] with Lebesgue measure. We also take the product of (QE, Se, Pe) with the probability space on which other random variables and elements are defined. Next, some hypotheses will be given for later reference.
11.2 A bootstrap central limit theorem in probability
339
Let (S, S, P) be a probability space and (Sn, Sn, Pn) a Cartesian product of n copies of (S, S, P). Let F C £2(S, S, P). Let E1, , en be i.i.d. Rademacher variables defined on a probability space (0', A, P'). Then take the probability space (Sn
(11.2.5)
x c', Sn ®A, Pn x
p).
References to (11.2.5) will also be to the preceding paragraph. Here is another fact on symmetrization and desymmetrization.
11.2.6 Lemma Under (11.2.5), f o r a n y t > 0 and n = 1, 2,
,
(a) Pr*(II y;`=1 Ej.f(Xj)IIF > t) < 2maxk
E(f(Xi) - Pf) i=1
>
t)
F
n
> (t - (2n)1/2a)/2 }
Ei f (Xi )
F
i=1
(r
Proof For part (a), let E(n)
{ri}
1 :
ri = fl for each i}. Let
£n := {Ei }in_1. Then Pt
>t
:= Pr* Ili=1 Etf(Xi) F Pr*
[ten = r} x
f(Xi) -
f(Xi) r;=-1
r,=1
TEE(n)
By Lemma 3.2.4 and Theorem 3.2.1, for laws p, v and sets A, B,
(A x v)*(A x B) = A*(A)v*(B) = p(A)v*(B) if A is measurable, so
< 2 E 2-nmaxk
\
= 2maxk
.
JJJ
F
340
The Two-Sample Case, the Bootstrap, and Confidence Sets
This proves (a). For (b), take a further product, so that we have (11.2.5) with
2n in place of n. Apply Lemma 11.2.4(b) to the process f H E"I f (Xi ), letting Y i := Xn+i f o r i = 1, , n. We get for any t E E(n),
Ct < 2 Pr* (
> t - (2n)1/2a)
f(Xi)-f(Yi)
T
i=1
= 2(P2n)* C
> t - (2n) 1/2a)
ri (f(Xi) - f(Yi)) i=1
Y
Then by (3.2.9),
< 2EPE(P2n)* (
< 4Pr*
Y
> [t - (2n)1/2a]
Eif(Xi) ii=l
.F
finishing the proof.
/2( > t - (2n)1i2a/
Ei (f(Xi) - f(Yi)) i=1
.
)
,
Next, we have some consequences or forms of Jensen's inequality.
11.2.7 Lemma (a) Let (S, S, P) be a probability space and .F C G 1(S, S, P).
Then IIEfIIr < EIIfII (b) On a product space (A, A, P) x (A, A, Q) let X and Y be coordinate functions. Let .F be a class of real-valued measurable functions on (A, A). If Q f= O for all f E F, then (11.2.8)
E*Ilf(X)IIj- < E*Ilf(X)+f(Y)IIj-.
P r o o f (a) For each g E .F, I Eg l < E I gI , while E I gI < E ll f II
Then, taking
the supremum over g E F proves (a).
For (b), recall that for any function of X, E* for P and P x Q are the same by Lemma 3.2.4 and Theorem 3.2.1. For any value of x and f E F, z i-k I f (x) + zI is a convex function of z. Letting Z = f (Y), we have for each
x, by Jensen's inequality,
If(x)I = If(x)+Ef(Y)I < EYIf(x)+f(Y)I, and so
EXIIf(X)IIJ' < EX(supfE,-EYIf(X)+f(Y)I).
11.2 A bootstrap central limit theorem in probability For each g E .F, I g(X) +g(Y) I
341
II f (X) + f (Y) II.*, and taking the supremum
over g E F gives
EXIIf(X)II), <- EXEYIIf(X)+f(Y)IIv = E*Ilf(X)+f(Y)II.F by the Tonelli-Fubini theorem since 11 f (X) + f(Y) [Iis measurable. The following lemma and theorem are known as Hoffmann-Jorgensen inequalities.
11.2.9 Lemma Let XI, , X" be coordinates on a product probability space (S", S", 1IJ_I Pj). Let F be a class of measurable real-valued functions on (S, S). Let Sk(f) :_ F; =I f(X) for k = 1, , n. Then for any s > 0 and
t>0,
(11.2.10)
Pr (maxk
IISk(f)II* > 3t + s)
< (Pr {maxk
Proof Letll.ll :=
Letr :=min{j
if there is no such j < n. Then f o r j = 1 , , n, {r = j} is by Lemma 3.2.4 , Xj, and {max k
maXk
For j < k < n, by Lemma 3.2.4, IISk - Sjll* is a measurable function of Xj+l, , Xk and thus is independent of {r = j}. So Pr (T = j, maxk
< Pr(r= j, maxi
+ Pr(r = j) Pr (maxj
11.2.11 Theorem Let 0 < p < oc, n = 1, 2, , and let X1, , X,, be coordinates on a product probability space (Sn, S", III=I Pj). Let F be a
The Two-Sample Case, the Bootstrap, and Confidence Sets
342
class of measurable real-valued functions on (S, S) such that for i = 1,
, n,
E(Ilf(X1)II*) < oo. Let k
u := inf It > 0: Pr 1maxk
> t < 1/(2.4P)
1=1
Then *p
k
) < 2.4PE(maxjsn(IIf(Xi)Il*)P)+2(4u)P.
E f(X1)
E maxk
i=1
Proof Recall that for any random variable Y > 0 with law P,
r
00
P(Y > t)dt =
t+
o
fo
fo YdtdP(y)
f00
ydP(y)=EY.
0
Let II
11
00
P (maxk
E (maxk
=
4( P
fU
J0
+
f
00
P (mark
u
which by (11.2.10) with s = t is less than or equal to (4u) P + 4P
(P (maxk
Ju 00
+4PJ
P(maxi
f00
< (4u) P + 4 P P (mark
P (maXk
+4PEmax1
< (4u)P + 1 E (maxk
since
4PP (maxk
11.2 A bootstrap central limit theorem in probability
343
by choice of u. Then since E(maxk
The theorem is proved. Next is another symmetrization-desymmetrization inequality.
11.2.12 Lemma Under (11.2.5), n
1 E*
(11.2.13)
n
EEj(f(Xj) - Pf) j=1
E*
E(f(Xj) - Pf) j=1
F
F
n
EEj(.f(Xj) - Pf)
< 2E*
j=1
F
which also holds if the Pf in the last expression is deleted.
Proof Replacing F by If - Pf : f E F}, we can assume Pf = 0 for all f E Y. Then by (3.2.9),
E*
E sjf(Xj) j=1
= Ee EX
.F
Y ei f (Xi) + T
< EeEX
Ei f (Xi )
F
is e,=-1, i
is s,=1, i
E Eif(Xi) + EE j F
is a,=1,i
T,
is ei=-1, in
sif(Xi) F
n
< 2EeEX
f(XX)
j=1
= 2E* F
Ef(Xj) j=1 nn
F
by (11.2.8). The left side of (11.2.13) follows. For the right side, extend (11.2.5)
by taking the Cartesian product with another copy of (Sn, Sn, Pn) and let the
344
The Two-Sample Case, the Bootstrap, and Confidence Sets , fin. Then if ri = ±1 for each i,
new coordinate functions bet 1,
n
n
E*
< E*
f(Xj)
j=1
1
1:(.f(Xj) - .f( j)) j=1
F
F
n
T, rj(.f(Xj)
- f(j))
j=1
F
since (11.2.8) holds as well if + is replaced by -, and by symmetry of
f(Xj) -
So by (3.2.8),
f(Xi)
E* i=1
Ei(f(Xi) -f(i))
EE EX
i=1
F
F
< 2EE EX
Ei .f (Xi )
i=1
= 2E*
F
Ei f(Xi ) i=1
F
which finishes the proof of the lemma. Next will be some Poissonization facts. Recall that any real-valued stochastic
process X(t), t E T, indexed by a set T, has a law PX defined on the space RT of all real-valued functions on T, on the smallest or -algebra for which the coordinate projections f r-+ f (t) are measurable for each t E T. The process
is called centered if EX(t) = 0 for all t E T. Recall that Y has a Poisson distribution with parameter X > 0 if and only if P(Y = k) = e-A.Ak/k! for
k=0, 1,2, .Letx Ay:=min(x, y). , n, let X(i) := Xi be a 11.2.14 Lemma Let T be any set. For each i = 1, centered, real-valued stochastic process indexed b y T. Let Xi, j , j = 1 , 2, , be independent copies of Xi, taken as coordinates on a product, say A, of copies o f (RT, PX(i)). Let N i := N(i), i = 1 , 2, .. , be i.i.d. Poisson variables
with parameter A = 1, defined on a probability space (S2', P'), and take the are independent of Xi, j. Let product of this space with A, so that N1, N2, suptET If(t)I. Then IIf1IT n
(11.2.15)
<
E*
e
e-1 T
N(i)
E*
i=1 j=1
T
11.2 A bootstrap central limit theorem in probability
345
Proof By Jensen's inequality, Lemma 11.2.7(a), letting EN be expectation with respect to the Poisson variables Ni, we have Mn
._
n
e - IE,*
Xi ,
e i=1
1
T
n
= E
X*
n
E E(Ni A 1)Xi,I i=1
< EX* EN
Then by (3.2.9), M n < ENEXII _;'=1(Ni A
T(Ni A 1)Xi,1 i=1
T I
T
I I IIT. For each i and given
Ni - N(i), N(i)
E Xi,1 - (Ni A 1)Xi,1 = j=1
(= 0 for Ni < 1), which is a centered process independent of (N1 A 1) Xi,1. Thus
for any values of N1, , Nn, applying another form of Jensen's inequality, Lemma 11.2.7(b), we get Mn < ENEXII ;`=1 (1) Xi, j II T Then by (3.2.9), ENEX can be replaced by E* and the lemma is proved. El For any two finite signed measures µ and v on the Borel sets of a separable Banach space B recall that the convolution µ * v is defined by (µ * v) (A) := f µ(A - x) dv(x) for any Borel set A. Here convolution is commutative and associative. For any finite signed measure µ on B and k = 1 , 2, , let µk
be the kth convolution power A * ... * It to k factors. Let µo := So. Let et` := exp(A) F_0O Ak/k!. If tt > 0 let Pois(µ) := e-µ(B)eµ. If is and v are two finite measures on B it is straightforward to check that Pois(µ + v) _ Pois(µ) * Pois(v). If X is a measurable function from a probability space (0, S, P) into a measurable space (S, A), recall that the law of X is the image
measure £(X) := P o X-1 on A. For any c > 0 and x E B, Pois(cS,) _ L(NNx) where NN is a Poisson random variable with parameter c.
Also, if £(Xi, j) = µi for j = 1, 2, and Ni := N(i) are Poisson with parameter 1, where all Xi, j and Nr are jointly independent, i, r = 1, , k, j = 1 , 2, .. , then by induction on k, k
N(i)
L
1 EEXi,j
J
k
1 = Pois EAr
)
If Xi are random variables with values in a separable Banach space with then the depoissonization inequality EXi = 0 for each i =
346
The Two-Sample Case, the Bootstrap, and Confidence Sets
(11.2.15) gives n
(11.2.16)
e
E E Xj
e-
j=1
1
f IIxIIdPoisl
n
E1(Xj)). j=1
Here is another depoissonization inequality.
11.2.17 Lemma Let (B, II II) be a normed space. For each n = 1, 2, v1, , vn be n distinct points of B and v := (v1 + vn)/n. Let Vl,
, let , Vn
be i.i.d. B-valued random variables with Pr(Vi = vj) = 1/n for i, j = , n. Let N1, , Nn be Poisson variables with parameter 1. Let all Vi and Nj be jointly independent. Then 1,
n
(11.2.18)
E(Vj - v)
E
j=1
<
e
e-1
n
E
T(Nj - 1)(vj - v) j=1
Proof The variables Vi and Ni are discrete, so the norms in (11.2.18) are both measurable random variables. Next, with v(j) := vj, n
n
E G(Vj - v) = nG(Vi - v) = E 8v(j)-v j=1
j=1
It follows from the above properties of Poissonization that (11.2.19)
Pois ( E G(Vj - v)) = Pois(8v(1)_v) *
* Pois(8v(n)_v)
j=1
=
r(i j_i
Nj(vj - v)) = GI E(Nj - 1)(vj - v)l. \ j_1 I /
Then by (11.2.16) for Xj = Vi - v, the lemma is proved.
The next proof will use the fact that for any real-valued function f on a measure space, 1{ f*>t) = (1{ f>t})* This is true since l{ f>t} < 1{ f*>t}, where
the latter is measurable, and Lemma 3.2.6(a) applies. The next fact is about triangular arrays with i.i.d. summands.
11.2.20 Theorem Let (T, d) be a totally bounded pseudometric space. For each n = 1, 2, , suppose given a product of n copies of a probability space, SZn := P(n))n For each n, and t E T, let Yn(t, ) := Yn(t)()
11.2 A bootstrap central limit theorem in probability
347
be a measurable real-valued function on Stn. For cv := E Stn let Xn, j( , (v) := Y(., cvj). Thus Xn, j are i.i.d. copies of Yn. Suppose that each Yn has bounded sample paths a.s. Assume that
(i) For all tin a dense subset D c T for d and for all 0 > 0, (11.2.21)
limn.oonPr{JY,,(t)I > Onl/2) = 0;
(ii) for any S > 0, suptET Pr{IYn(t)I > Sn1/2} _+ 0 as n - oo; and (iii) for all s > 0, as 8 4. 0, lira supn,oo Pr* {nl/2 supd(s,t)<s
n
Xn,i (t) - EXn,i (t) i=1
>E}--*
- Xn,i (s) + EX,i (s)
0.
Then for any y > 0, (11.2.22)
limn.>
nPr*{itYnhiT > yn1/2} = 0.
If the hypotheses are assumed only along a subsequence nk, then the conclusion holds along the subsequence. Proof Taking more Cartesian products, let {Vn, j} bean independent copy of easily (11.2.23)
Pr*{IIXn,1-Vn,111>u} <
2Pr*{IIXn,111 >u/2}.
Then, condition (iii) implies that as S 4. 0, n
(11.2.24)
lim sup,, Pr* I
n-112
supd(s,t)<s
Xn,i (t) - Vn,i (t) i=1
- Xn,i (s) + Vn,i (s)
>s}--*
0.
Let Un, j := Xn, j - Vn, j. For any t > 0, let {s1, , sN(,)} be a maximal subset of T with d(si, sj) > t/3 for i # j. Let Bi It E T : d (t, si) < r/3} and Ai := Ai,, := Bi \ U,
The Two-Sample Case, the Bootstrap, and Confidence Sets
348
t E Ai,T. Then
(11.2.25) Pr* {IIUn,111T > 2y/} < Pr* 1IIUn,1,TIIT > yln} + Pr* {IIUn,l - Un,1,1IIT > YV"} Then, as n (11.2.26)
oo, nPr* {IIUn,1,tIIT > Y-vfn}
-
N(T)
nPr{1Un,I(ti)I > y ln} -. 0
< i=1
by hypothesis (i) and (11.2.23). For an arbitrary set A in a probability space, let A* be a measurable cover, so that A* is a measurable set and (0A)* = lA*. k * k C* Then it is easily checked that for any sets C1, Ck, (Uk Uk up to sets of probability zero. Take Ci :_ { II Un,i - Un,i,r II T > yn 1/2}. Then by Theorem 3.2.1 and Lemma 3.2.4,
, G) =
1 - exp(n Pr*(C1)) < 1 - (1 - Pr*(Cl))n = Pr. (
Ii
U j_1
Pr* {maxl Y/"}
which by Levy's inequality with stars (9.1.9(b)) is n
E(Un,i
- Un,1,1)
i=1
T n
< 2 Pr* jn_1I2suPd(s,t)T
(Xn,i (t) - Vn,i (t) i=1
- Xn,i (s) + Vn,i (s)) Thus, applying (11.2.24), then (11.2.25) and (11.2.26), gives (11.2.27)
lim1,o (lim sup)n,,,n Pr* { II Un,1 - U,,,1,, II T > Y In}
= limn,oonPr* {IIUn,111T > Y In = 0. Next, by Lemma 3.1.4, II Xn,i II T and II Vn,i II T are independent. Applying also Lemma 3.2.6, and recalling that Xn,1 and Yn have the same distribution,
11.2 A bootstrap central limit theorem in probability
349
we get
nPr*{IIUn,111T>Y,/n} > nPr{IIYnIIT
By Lemma 11.2.4(a),
Pr{ IIYnIIT > ynl/2} < 1
Pr*{IIUn,1IIT > ynl/2/2}
-suptETPr{IYn(t)I > yn1/2/2}
Here the numerator -a 0 as n -+ oo by (11.2.26), and suplET Pr{IYn(t) > yn1/2/2} -* 0 as n -> oo by hypothesis (ii). So Pr{IIYnIIT < y J } 1 as n -+ oo and the conclusion follows. The proof for subsequences is the same.
Next will be a characterization of Donsker classes, in which the asymptotic equicontinuity condition (Theorem 3.7.2) is put into symmetrized forms. For a class F of functions let
{f - gf,gE.F, pp(f,g)<S}.
-T8
Let (X, A, P) be a probability space and F a class of
functions with.F C £2(X, A, P). Assume (11.2.5), so that Ei are i.i.d. Rademacher variables, independent o f X1, X2, . Suppose that for each x E X, (11.2.29)
Fi(x) := supfE.F If(X) - Pfl < oc.
Then the following are equivalent:
(a) .F is a Donsker class for P; (b) .F is totally bounded for pp and for any s > 0, as 8 -a 0,
limsupn,O,,Pr*
n-1/2 57e (f(X1) - Pf) i=1
(c) (.F, pp) is totally bounded and as S
0,
n
lim supra,
n-112E*
TEi(f(Xi) - Pf)
-+ 0;
i=1
(d) (.F, pp) is totally bounded and for vn := n 1/2(Pn - P), as S -* 0, lim supra,,, E* 11 vn hI S,
0.
The Two-Sample Case, the Bootstrap, and Confidence Sets
350
To prove the theorem, let's first show that (a) implies (b). Applying Lemma 11.2.6(a) to the class Ps, the outer probability in (b) is bounded above by k
2 maxk
- Pf t=1
> e/2
.
a
For any fixed N, we can write for n > N that maxk
max (maxk
By the asymptotic equicontinuity condition (Theorem 3.7.2(b)), there are an
N < oo and a3 > 0 such that for N
> e/4} < e/4,
(f(Xi) - Pf)
(11.2.30) Pr* I n-1/2 i=1
s,.F
since n-1/2 < k-1/2. Then for 1 < j < N, note that
j
N+j
N{-j
i=1
i=1
i=j+1
Y
and inequalities for EN 1 also hold for _N j+1 Thus (11.2.30) holds also, with e/4 replaced by s12, for 1 < k < N, and (b) holds. Now to prove (b) implies (c), we will apply Theorem 11.2.20, with D
T := F, XX,j(f) = Ej(f(Xj) - Pf). Then EXn, j(f) = 0 for all f E F, and (b) gives hypothesis (iii). For hypothesis (i) (11.2.21), again letting 1 {... 1{...}, we have
nPr{If(Xj) - PfI > fn1/2) <
f-2E(If(Xj) - Pf121{If(X1) - PfI > On1/2})
0
as n -+ oo by dominated convergence. Hypothesis (ii) holds since .T' is totally bounded for pp. So all three hypotheses of Theorem 11.2.20 and its conclusion (11.2.22) hold for the given X,,j. In proving (c) the following will be helpful.
11.2.31 Lemma Let l j, i = 1, 2,
, be i.i.d. nonnegative random variables
such that (11.2.32)
M := sups>0 t2 Pr{r;i > t} < oo.
Then, for all r such that 0 < r < 2, n-r/2E (11.2.33)
sup,,
maxt
11.2 A bootstrap central limit theorem in probability
351
Proof We have, as in the proof of Theorem 11.2.11,
n-r/2Emax1
_
rn-r/2
00
f
tr-1 ]'I. (maxi
o
< 1 + rn 1-r/2 J
<<
00
tr-1 Pr{41 > t} dt
MrnI-r/2 f
1+
J,r
tr-3 dt
r 1+M2-r, proving (11.2.33).
Now continuing the proof that (b) implies (c), for i = 11 f (Xi) - P f 11
,
condition (11.2.32) is implied by equation (11.2.22). This implies, for example,
for r = 3/2, that the sequence {maxim i/,/}n° is bounded in Cr and so 1
uniformly integrable. But, again by (11.2.22), we also have, letting y - 0, that maxi
Emaxi
(11.2.34)
Next, Theorem 11.2.11 will be applied, where the original sample space S as in (11.2.5) is replaced by S x ( -1, 1), the random variables are (Xj, Ej), and in place of F, we take the class g of functions g := gh where for each h E .7='811
g(x, s) := s(h(x) - Ph) for s = ±1, so that g(Xj, e3) = Ej (h(Xj) - Ph).
Also let p := 1. Then by (11.2.34), the hypothesis of Theorem 11.2.11 holds and in the conclusion, the first term on the right is o(n1/2) as n -+ oo and S - 0. In the definition of u, by the Levy inequality (9.1.9), we have u < 2t1 if
A(
n
> t1) < 4-P-1,
g(Xi , 8i) i=1
9
and t1 = o(n1/2) also by (b). So (c) follows. (c) implies (d) by desymmetrization, Lemma 11.2.12. (d) implies (a) by Markov's inequality and the usual asymptotic equicontinuity condition, Theorem 3.7.2. The proof of Theorem 11.2.28 is complete.
352
The Two-Sample Case, the Bootstrap, and Confidence Sets
Next, Theorem 11.2.28 will be extended to multipliers other than Rademacher variables. For any real random variable Y recall that 00
A2,1(Y) :=
J0
[Pr(IYI > t)]112dt.
Then A2,1(Y) < oo implies E(Y2) < oo, and for any 3 > 0, EIYI2+s < 00 implies A2,1(Y) < 00-
Let (0, S, P) be a probability space and F C G1(P).
11.2.35 Lemma
Assume the usual hypotheses (11.2.5) and that furthermore, by another product space, there are i.i.d. symmetric real random variables Ski :_ (i) independent
of the Xj and Ej. Then for integers 0 < m < n < oo we have n
(11.2.36)
Ei f (Xi ) i=1
F
< n-1/2E* mn-1/2(E*II f(Xi)IIF)E(max i
+
E £if(Xi)
A2,1(1)maxm
i=m+1
F
If the variables li have
0 but are not necessarily symmetric, then (11.2.36) holds with the following changes: at the left is replaced by the first summand on the right is multiplied by 2, and the second by 3.
Proof By symmetry, the joint distribution of the i is the same as that of si I i I . Thus, writing I I 11 := 1 1- II F' we have by Theorem 3.2.7 (the one-sided
Tonelli-Fubini theorem), then by Jensen's inequality, Lemma 11.2.7(a), n
if(Xi)I)
E*
>
Ei
i-1
If(Xi)
i=1 n
> E*
et E I 1 I f(Xi )
i=1
'
11.2 A bootstrap central limit theorem in probability
353
The first inequality in the lemma follows. For the second we have that n
E*
E (n)
E* i=1
Eilfilf(Xi)
i=1
f
E*
-
l{t
0
i=1
E* f
1{t
Eif(Xi)
f(Xi) dt
.
Let Ft(X, E, l;) := n 1/211 E 1 l{r
< E-
f f
( l{t
Ft(X,E,. )dt <
E*
lim
00 j=1
the latter inequality by Fatou's lemma with stars (3.2.12). For each set G C 11, 2,
, n} and t > 0 let
H(G, t) := {l; := {li }" 1 E
Wn :
t<
iEG}
I
.
Given t, the sets H(G, t) are disjoint and measurable with union Rn, so by independence (Lemma 3.2.6) and (3.2.9),
E*Ft(X, e, ) =
n-1/2 E
P( E H(G t))E* T ei f (X1)
n
n-1/2YP r=1
F
iEG
G
r
n
E i=1
1{I (i)I>_r} =
r)
E*
EEif(Xi) i=1
F
354
The Two-Sample Case, the Bootstrap, and Confidence Sets
since (e,, Xi) are i.i.d. Thus E(n) is bounded above by
f-( n
Pr I
E_1
f
n
(
1{I (i)I>t}JE= kEi f(Xi)
l -1
dt
iY=1
,(i)I_t)
n
ooPr
'(I
o
i=1
(joo
n
+ k=m+1
k
> 0 dt
maxk<,n E* E Ei ,f (Xi ) i=1
Pr Ij
k dt
i=1 k
maxm
k-1/2 E Ei.f(Xi) i=m+1
Now, n
J Pr
E
k
I i=1
k>m
E
1/2
1/2
n
((1{(E)>tJ)
1{(i)t)
\E
t)1/2. Thus
E(n) < m (f°°Pr{maxi
>
t)dt) E*Ilf(Xl)II k
k-1/2 E Ei.f(Xi)
+n1/2A2,1(1)maxm
i=m+1
mE*IIJ k
n1/2A2,1(1) maxm
k-112
E Ei.f(Xi) i=m+1
finishing the proof for the symmetric case. When 1 is not symmetric, but 0, take a product of 2n factors so that i, , 2n are i.i.d. Let j 4j - n+j Then 1, , n are i.i.d. and symmetric. We have n
E*
Ili=1 if(Xi)
(
< E*
i.f(Xi) i=1
11.2 A bootstrap central limit theorem in probability
355
by Jensen's inequality (11.2.8) applied to functions g(X, ) :_ f (X ). The lemma holds for the t j. Clearly, E (maxi
A2,1(0) _ f
t)'12dt
0
t/2))'/2dt < 3A2,i( 0
as claimed. For the lower bound, it is easily seen that n
E*
n
< 2E*
1: i f(Xi) i=1
Sif(Xi) i=1
.F
and the conclusion follows.
Next, here is a characterization of Donsker classes in terms of multipliers i.
11.2.37 Theorem Let F be a class such that hypotheses (11.2.5) hold and If - Pf : f E F) has a finite envelope function (11.2.29). Let 4 j be i.i.d. centered real random variables, independent o f X1, X2, , specifically, defined on a different factor of a product probability space, such that 0 and A2,1(1) < oo. Then .E is Donsker for P if and only if both (F, pp) is totally bounded and (11.2.38)
i(f(Xi) - Pf)
lim supn.. E* n-1/2 i=1
Proof We can assume that each f E F is centered, replacing f by f - Pf. Recall that (a) implies (c) in Theorem 11.2.28. Also, we have sups>0 t (Pr(I1;1 I >
t))'/2 < A2,1(1) < oo, so (11.2.32) holds for 111 Thus by (11.2.36), it follows that (11.2.38) holds. Conversely, assume that .F is pp-totally bounded and (11.2.38) holds. Then by the first displayed inequality in Lemma 11.2.35, applied to .Ea for each S > 0, and since E I > 0, it follows that (c) in Theorem 11.2.28 holds, so by that theorem, F is Donsker for P. O
Now, the proof of the bootstrap central limit theorem in probability, Theorem 11.2.1, will be finished. Let F be P-Donsker. We can, again, assume that
P f= 0 for all f E F since each f E F can be replaced by f - P f without changing vn (f) or vB (f ). Recall that the conclusion is equivalent to a statement in terms of the metric O.F. Since F is totally bounded for pp by
The Two-Sample Case, the Bootstrap, and Confidence Sets
356
Theorem 3.7.2, for any r > 0 there is an N(r) < oo and a map in from F into a subset having N(r) elements such that pp(Jrt f, f) < a for all f E F. Let vBt (f) := vB(nt f) and Gp,t (f) := Gp(Jr7 (f )). Then for any bounded Lipschitz function H on £°O(F) for II ILr,
I EBH(vB) - EH(Gp)I < I EBH(vB) -E BH(vB.,)I
- EH(Gp,t)
+ EBH(vnB
+ EH(Gp,t) - EH(Gp) I
A(H) + Bn,r (H) + Ct (H). Let BL (1) denote the set of all functions H on £°O (F) with bounded Lipschitz norm < 1. Then supHEBL(1)Ct(H) -+ 0 as r 0 since by definition of Donsker class, Gp is a.s. uniformly continuous with respect to pp, so Gp,t -+
Gp uniformly on F as r 1 0. For fixed r > 0, the supremum for H E BL(1) of Bn,t(H) is measurable and converges to 0 a.s. by Theorem 11.2.2, the finite-dimensional bootstrap central limit theorem, in RN(,). Recall the definition of II 11s,., from before Theorem 11.2.28. For the An,t term, we have
so it will be enough to show that for all
SUPHEBL(1) An,t (H) < 2EB II vBllt,
e>0,as8 J. 0, (11.2.39)
{EBIIvBIIS"
limsupn,,,,, Pr*
>
Apply (11.2.18) with B :=
ri _ 0.
and vi := 8X;(,)) for each w, and the
starred Tonelli-Fubini theorem where one coordinate is discrete (3.2.9). We get n
E*EBIIVBIIs,.p
= n-1/2E*EB
E (SXB - Pn) i=1
< n-1/2 < n-1/2
a
e-1 a
e-1
n
E*
(Ni-1)(SXr-Pn) i=1
E.*
E(Ni-1)SX;II i=1
n
+n-1/2
e E* e-1
(Ni-1)
IIPnhIL,T)
_: S+T.
i=1
Since .P is Donsker for P, the limit as S J, 0 of the lim sup as n -> oo of S is 0 by Theorem 11.2.37. For T we can apply independence, Lemma 3.2.6, and (3.2.9). Then E y1(Ni - 1)I < n1/2, so T < (e/(e - 1))E*II Pn IIo,.
n
11.3 Other aspects of the bootstrap
357
Recalling that each f E .7= is taken to be centered, we have E* II P Ila,.F = E*Iln-1/2vn + Plls,.r -+ 0 as n oo and S --* 0 by Theorem 11.2.28(d). Thus Theorem 11.2.1 is proved.
11.3 Other aspects of the bootstrap Efron (1979) invented the bootstrap and by now there is a very large literature about it. This section will address some aspects of the application of the Gine-Zinn theorems. These do not cover the entire field by any means. For example, some statistics of interest, such as max(X1, , X,,), are not averages
f(Xn)) as f ranges over a class F. Some bootstrap limit theorems are stated in probability, and others for almost
sure convergence. To compare their usefulness, first note that almost sure convergence is not always preferable to convergence in probability:
Example. Let Xn be a sequence of real-valued random variables converging to some Xo in probability but not almost surely. Then some subsequences Xnk converge to X0 almost surely. Suppose this occurs whenever nk > k2 for all
k. Let Yn := X2k for 2k < n < 2k+1 where k = 0, 1,
.. Then Y, -* Xo
almost surely, but in a sense, X,, -+ Xo faster although it converges only in probability. Another point is that almost sure convergence is applicable in statistics when
inferences will be made from data sets with increasing values of n, in other words, in the part of statistics called sequential analysis. But suppose one has a fixed value of the sample size n, as has generally been the case with the bootstrap. Then the probability of an error of a given size, for a given n, which relates to convergence in probability, may be more relevant than the question of what would happen for values of n -) oo, as in almost sure convergence. The rest of this section will be devoted to confidence sets. A basic example of a confidence set is a confidence interval. As an example, suppose X1, , Xn
are i.i.d. with distribution N(µ, a2) where a2 is known but It is not. Then + Xn)/n has a distribution N(µ, a2/n). Thus (X1 +
P(X < p - 1.96a/n1/2) = 0.025 = P(X > p +
1.96a/nt/2).
So we have 95% confidence that the unknown it belongs to the interval [ X 1.96a/n1/2, X+ 1.96a/n1/2], which is then called a 95% confidence interval for A.
Next, suppose X1, , Xn are i.i.d. in R" with a normal (Gaussian) distribution N(µ, a2I) where I is the identity matrix. Suppose a > 0 and Ma = Ma (k)
358
The Two-Sample Case, the Bootstrap, and Confidence Sets
is such that N(0, I) [x: Ix I > Ma } = a. Then n 112(X - µ)/o has distribution
N(0, I) so P(IX - µI > Ma/n'/2) = a. Thus, the ball with center X and radius Mqa/n1/2 is called a 100(1 - a)% confidence set for the unknown lc. When the distribution of the Xi is not necessarily normal, but has finite variance, then the distribution of X will be approximately normal by the central limit theorem for n large, so we get some approximate confidence sets. Now let's extend these ideas to the bootstrap. Let X1, , Xn be i.i.d. from an otherwise unknown distribution P. Let Pn be the empirical measure formed from X1, , X,,. Let F be a universal Donsker class (Section 10.1). Then we know from the Gine-Zinn theorem in the last section that v8 and vn have asymptotically the same distribution on Y. By repeated resampling, given a small a > 0 such as a = 0.05, one can find M = M(a) such that approximately P(IIvBIIj_ > M) = a. Then {Q: II Q - Pn IIT < M/n1/2}
is an approximate 100(1 - a)% confidence set for P.
**11.4 Further Gine-Zinn bootstrap central limit theorems Gine and Zinn (1990) proved two theorems giving criteria for bootstrap central
limit theorems, one for the almost sure bootstrap central limit property, the other in (outer) probability, both under some measurability conditions. Later, in work by Strobl (1994) and van der Vaart and Wellner (1996), the measurability conditions were removed from the direct parts of each theorem (that the Donsker property, together with an L2 envelope in the almost sure case, implies the corresponding bootstrap central limit theorem). Gine (1997) gives an exposition, with proofs, of the resulting forms of the theorems. A law it on the Borel a-algebra of a metric space is said to be Radon if it is tight and thus, concentrated in a separable subspace. The theorems then are, first (Gine, 1997, Theorem 2.2):
Theorem A If F is any Donsker class for a law P then the bootstrap central limit theorem in probability also holds for.F and P. Conversely, if (S, S, P) is any probability space, F is a class of real-valued measurable functions on S such that FF(x) := sup fEj, I f (x) I < oo for all x E S, F is image admissible Suslin, and there exists a Gaussian process G indexed by .F with Radon law on
G) -* 0 in outer probability as n -+ oo, then F is £'(.F) such that Donsker for P and G = Gp. Only the first (direct) statement in Theorem A is proved in Section 11.2. In the almost sure case, Gine (1997, Theorem 3.2) is:
Problems
Theorem B
359
Let F be any Donsker class for a law P and assume that
E* 11 f (XI) - Pf II* < oo. Then the almost sure bootstrap central limit theorem also holds for F and P. Conversely, if a class F of measurable real-valued functions is image admissible Suslin, has an everywhere finite envelope F,p, and there is a centered Gaussian process G indexed by F having a Radon law in f' (Y) and such that P,F(vB, G) -* 0 almost uniformly as n -+ oo, then .F is
Donskerfor P, EF2 < oo, and G = Gp. Note that the image admissible Suslin property implies that EF2 is welldefined without any star.
The proof of a form of Theorem B in Gine and Zinn (1990) relies on results from Araujo and Gine (1980), Fernique (1975,1985), Gine and Zinn (1984,1986), Jain and Marcus (1978), and Ledoux and Talagrand (1988). Especially when the needed material from these references is included, the proof
is quite long (over 50 pages in a version I prepared). The proof of the "in probability" version is shorter, but still rather long, as is the proof of it in one direction in Section 11.2.
Problems 1. If the sample space is the unit circle Sl := {(cos 0, sin 6) : 0 < 0 < 2n}, and distribution functions are defined by evaluating laws on arcs
((cos B, sin 0): 0 < 6 < x}, show that the quantity in Corollary 11.1.2(c), sup., Hmn (x) - inf yHmn (y)}, is invariant under rotations of the circle. 2. (Continuation) For m = n = 20, find the approximation given by Corollary 11.1.2(c) to the probability that there exists an arc A := ((cos 6, sin 0) : a < 0 < b) such that X j E A and Y j V A f o r j = 1, , 20, if all 40 variables
are i.i.d. from the same continuous distribution on the circle. Hint: Not many terms of the series should be needed. 3. In Corollary 11.1.2,
(i) show that the series in (b) and (c) diverge when u = 0.
(ii) Show that for u = 0, the probabilities equal 1 for any m > 1 and n > l and for yt. (iii) Show that the three expressions on the right are all less than 1 for u > 0 and converge to 1 as u 0, so that the corresponding distribution functions are continuous everywhere. Hint: For (b) and (c) use the left sides.
360
The Two-Sample Case, the Bootstrap, and Confidence Sets
4. Let X1, X2, , be i.i.d. in R with a continuous distribution. Given n, < arrange X1, , X in order as X(1) < X(2) < Let XB , j = 1, , n,bei.i.d. sample. Thus P(XB = X(I)) = 1/n for i, j = 1, n. Let XB) := minl<j
?
In the next two problems, let (X, A, P) be a probability space and suppose that F C L2(X, A, P) is a Donsker class f o r P. Let X1, X2, , be i.i.d.
(P). Convergence in law, conditional on P or P2i,, in outer probability as n -+ oo, is defined just as in the definition of bootstrap Donsker class early in Section 11.2, except for some interchanges of n and 2n. 5. Given P formed from X1, , X,,, let 2n points
X(j,M
X0,
j = 1,...,2n,
and P2 := Tn Ej° SX(j,p). Show that (2n) be i.i.d. conditional on P,,, converges in law to Gp, in outer probability. , X2,,, let n points 6. Given Pen formed from X1, 1
X(j, Y)
XY,
2, -
j = 1, ... , n,
be i.i.d. (P20, and Pn := n Ej=1 SX(j,y). Show that n1/2(Pn - 1'2n), conditional on Pen, converges in law to Gp, in outer probability.
7. Prove the statements before Lemma 11.2.35, that A2,1(Y) < 00 implies E(Y2) < oo, and for any 3 > 0, EIY12+8 < oo implies A2,1(Y) < 00.
Notes Notes to Section 11.1. 1 do not have a reference for Theorem 11.1.1. A form of the theorem for classes of sets was given in Dudley (1978). Corollary 11.1.2 is classical and gives asymptotic distributions of so-called Kolmogorov-Smirnov (parts (a),(b)) and Kuiper (part (c)) statistics. See the notes to RAP, Section 12.3. The results are sometimes stated under the unnecessary hypothesis that
m/n converges to some c, 0 < c < oo, as m, n -k 00. Notes to Section 11.2. The section is based mainly on Gind (1997), an update (as regards measurability) of the fundamental theorems of Gind and Zinn (1990). The proofs also include some new elements. Gind (1997) gives a lot of references, not reproduced here, on different parts of the proof. Some special cases of the Gind-Zinn theorems were published earlier by Bickel and Freedman (1981) and Gaenssler (1986) among others.
References
361
Notes to Section 11.3. Efron (1979) discovered the bootstrap. Three books on it are Hall (1992), Efron and Tibshirani (1993), and Shao and Tu (1995).
References
Araujo, A., and Gind, E. (1980). The Central Limit Theorem for Real and Banach Valued Random Variables. Wiley, New York.
Bickel, P. J., and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. Ann. Statist. 9, 1196-1217. Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6, 899-929; Correction, ibid. 7 (1979), 909-911. Efron, B., (1979). Bootstrap methods: another look at the jackknife. Ann. Statist. 7, 1-26. Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York. Fernique, X. (1975). Regularitd des trajectoires des fonctions aldatoires gaussiennes. Ecole d'ete de probabilites de St. -Flour, 1974. Lecture Notes in Math. (Springer) 480, 1-96.
Fernique, X. (1985). Sur la convergence dtroite des mesures gaussiennes. Z. Wahrscheinlichkeitsth. verw. Gebiete 68, 331-336. Gaenssler, P. (1986). Bootstrapping empirical measures indexed by VapnikChervonenkis classes of sets. In Probability Theory and Mathematical Statistics (Vilnius, 1985), ed. by Yu. V. Prohorov, V. A. Statulevicius, V. V. Sazonov and B. Grigelionis, VNU Science Press, Utrecht, 467-481. Gind, E. (1997). Lectures on some aspects of the bootstrap. In Lectures on Probability Theory and Statistics, Ecole d'dtd de probabilites de SaintFlour (1996), ed. P. Bernard. Lecture Notes in Math. (Springer) 1665, 37-151. Gind, E., and Zinn, J. (1984). Some limit theorems for empirical processes. Ann. Probab. 12, 929-989. Gine, E., and Zinn, J. (1986). Lectures on the central limit theorem for empirical processes. In Probability and Banach Spaces. Proc. Conf. Zaragoza, 1985. Lecture Notes in Math. (Springer) 1221, 50-113. Gine, E., and Zinn, J. (1990). Bootstrapping general empirical measures. Ann.
Probab. 18, 851-869. Hall, Peter (1992). The Bootstrap and Edgeworth Expansion. Springer, New York.
Jain, N. C., and Marcus, M. B. (1978). Continuity of subgaussian processes. In Probability on Banach Spaces, ed. J. Kuelbs, Advances in Probability and Related Topics 4, 81-196. Dekker, New York.
362
The Two-Sample Case, the Bootstrap, and Confidence Sets
Ledoux, M., and Talagrand, M. (1988). Characterization of the law of the iterated logarithm in Banach spaces. Ann. Probab. 16, 1242-1264. Shao, Jun, and Tu, Dongsheng (1995). The Jackknife and Bootstrap. Springer, New York. Strobl, F. (1994). Zur Theorie empirischer Prozesse. Dissertation, Mathematik, Univ. Munchen. van der Vaart, A. W., and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, Berlin.
12
Classes of Sets or Functions Too Large for Central Limit Theorems
12.1 Universal lower bounds This chapter is primarily about asymptotic lower bounds for II P, - P II T on certain classes .1 of functions as treated in Chapter 8, mainly classes of indicators of sets. Section 12.2 will give some upper bounds which indicate the sharpness of some of the lower bounds. Section 12.4 gives some relatively difficult lower
bounds on classes such as the convex sets in R3 and lower layers in 1R2. In preparation for this, Section 12.3 treats Poissonization and random "stopping sets" analogous to stopping times. The present section gives lower bounds in some cases which hold not only with probability converging to 1 but for all possible P,,. Definitions are as in Sections 3.1 and 8.2, with P := U(Id) _ Ad =
Lebesgue measure on Id. Specifically, recall the classes c(a, K, d) :_ 9a,x,d of functions on the unit cube Id C Rd with derivatives through ath order bounded by K, and the related families C(a, K, d) of sets, both defined early in Section 8.2.
12.1.1 Theorem (Bakhvalov) F o r P = U(Id), any d = 1 , 2, , and a > 0, there is a y = y(d,a) > O such that f o r all n = 1, 2, . . and all possible values of P, we have II Pn - PII9(a,1,d) > yn-ald Remarks. When a < d12, this shows that c (a, K, d), K > 0, is not a Donsker class. For a > d/2 the lower bound in Theorem 12.1.1 is not useful, since it is smaller than the average size of II Pn - PII9(a,1,d), which is at least of order n-1/2: even for one function f not constant a.e. P, EI (Pn - P)(f) I > cn-1/2 for some c > 0. Theorem 12.1.1 gives information about accuracy of possible methods of numerical integration in several dimensions, or "cubature," using the values of a function f E 9(a, K, d) at just n points chosen in advance (from the proof, it 363
364
Classes Too Large for Central Limit Theorems
will be seen that one has the same lower bound even if one can use any partial derivatives of fat then points). It was in this connection that Bakhvalov (1959) proved the theorem.
Proof Given n let m := m(n) :_ 1(2n)1'd1 where [xl is the smallest integer > x. Decompose the unit cube Id into and cubes Ci of side 1 /m. Then and > 2n.
For any Pn let S := {i : Pn (Ci) = 0). Then card(S) > n. For x(i), f and fs as defined after (8.2.9), we then have for each i,
P(f) = m-a J f (M (x - x(i))) dx = m-a f f(m(x - x(i))) dx ;
= fjd f (y)dy/rn= yn-d-p(.f) Thus I(PP - P)(fs)I = p(fs) >
m-a-dnP(f) >
cn-"Id for some constant
c = c(d, a) > 0, while 11 fs lla < II f 1Ia < 1 (dividing the original f by some constant depending on d and a if necessary).
12.1.2 Theorem For P = U(Id), any K > 0, and 0 < a < d - I there is a 8 = 8(a, K, d) > 0 such that for all n = 1, 2, and all possible values of a/(d-1+a) Pn, II Pn - PII C(a,K,d) > 8n
Remark. Since a/(d - 1 + a) < 1/2 for a < d - 1, the classes C(a, K, d) are then not Donsker classes. For a > d - 1, C(a, K, d) is a Donsker class by Theorem 8.2.1 and Corollary 7.2.13. For a = d - 1, it is not a Donsker class (Theorem 12.4.1). Proof Again, the construction and notation around (8.2.9) will be applied, now KAd-1(f)/llfIla, [(nc)11(a+d-1)]. Then on Id-1. Let c := m := m(n) := 1/n < cm 1-d-a Taken > r (the result holds for n < r) for an r with
M := supn>r nc Let On :=
m(n)a+d-1/(nc). Then 1/M
m(n)1-d-a <
od
< On < 1. Let
gi := KOn f /(211f IIa)
Then Ad-1(gi) = 1/(2n) and Igi Ila < K. Let Bi := {x E Id : 0 < Xd < gi(x(d))} where x(d) := (x1, , xd_1). Then for each i, Ad(B,) := P(Bi) _ 11(2n) and either Pn (Bi) = 0 or Pn (Bi) > 1/n. Either at least half the Bi have
12.2 An upper bound
365
Pn (B;) = 0, or at least half have Pn (B;) > 1 In. In either case,
FIPn - PIIC(a,K,d) > In
>
d-11(4n)
>:
cm-"/(4M)
8n-a1(a+d-1)
for some 8(a, K, d) > 0. The method of the last proof applies to convex sets in Rd, as follows.
12.1.3 Theorem (W. Schmidt) Let d = 2, 3, .. For the collection Cd of closed convex subsets of a bounded nonempty open set U in Rd, there is a constant b := b(d, U) > 0 such that for P = Lebesgue measure normalized on U, and all Pn , sup(I(Pn - P)(C)I: C E Cd) >
bn-2/(d+1).
Proof We can assume by a Euclidean transformation that U includes the unit ball. Take disjoint spherical caps as at the end of Section 8.4. Let each cap C have volume P(C) = 1/(2n). Then as n oo, the angular radius en of such caps is asymptotic to cdn-1/(d+1) for some constant cd. Thus the number of such disjoint caps is of the order of n(d-1)/(d+1) Either (Pn - P)(C) > 1/(2n)
for at least half of such caps, or (Pn - P)(C) = -1/(2n) for at least half of them. Thus for some constant i = rl(U, d), there exist convex sets D, E, differing by a union of caps, such that
I (Pn - P)(D) - (Pn - P)(E)l >
r1n(d-1)1(d+1) - 1
and the result follows with b = q/2. Thus ford > 4, Cd is not a Donsker class for P. If d = 3 it is not either; see problem 3 and Dudley (1982). C2 is a Donsker class for A2 on 12 by Theorem 8.4.1 and Corollary 7.2.13.
12.2 An upper bound Here, using bracketing numbers NI as in Section 7.1, is an upper bound for supBec I vn (B) I which applies in many cases where the hypotheses II vn IIC of Corollary 7.2.13 fail. Let (X, A, Q) be a probability space, vn
n1/2(Qn
and recall NJ as defined before (7.1.4).
- Q),
Classes Too Large for Central Limit Theorems
366
12.2.1 Theorem
(
Let C C .A, 1 < c < oo, it > 2/( + 1) and O forsome K < oc Nj(E, C, Q) < 0<E<
then
limn.,oPr* {Ilvnllc > n°(logn)''} = 0. Remarks.
The classes C = C(a, M, d) satisfy the hypothesis of Theorem
12.2.1 for
= (d - 1)/a > 1, that is, a < d - 1, by the last inequality in
Theorem 8.2.1. Then O = 2 - d-i+a Thus Theorem 12.1.2 shows that the exponent O is sharp for > 1. Conversely, Theorem 12.2.1 shows that the exponent on n in Theorem 12.1.2 cannot be improved. In Theorem 12.2.1 we cannot take < 1, for then O < 0, which is impossible even for a single set,
C = {C}, with 0 < P(C) < 1. Proof The chaining method will be used as in Section 7.2. Let loge be logarithm to the base 2. For each n > 3 let
k(n) := [(2 - O) loge n - q 1092 log n].
(12.2.2)
NI(2-k, C, Q), k = 1, 2, .. Then there exist Aki and Bkj E N(k), such that for any A E C, there is an i < N(k) with Aki C A C Bki and Q(Bki\Aki) < 2-k. Let Aol := 0 (the empty set) and B01 := X. Choose such i = i (k, A) for k = 0, 1, . Then for each k > 0, Let N(k)
A, i = 1,
,
Q (Aki(k,A) A Ak-1,i(k-1,A)) <
22-k
where A := symmetric difference, CAD := (C \ D) U (D \ Q. For k > 1 let 1(k) be the collection of sets B, with Q(B) < 22-k, of the form Aki\Ak_1,l or Ak_1,J\Aki or Bki\Aki Then
card(13(k)) < 2N(k - 1)N(k) + N(k) < 3 exp (2K2 For each B E 8(k), Bernstein's inequality (1.3.2) implies, for any t > 0, (12.2.3)
Pr{Ivn(B)I > t) < 2exp ( - t2/(23-k +
tn-1/2)).
Choose 3 > 0 such that 3 < 1 and (2+25)/(1 + ) < ij. Letc := 8/(1 +8) and t := tn,k := cn°(log n)7k-1-s. Then for each k = 1, , k(n), 23-k > tn,kn-112. Hence by (12.2.3), 8n°-1/2(log n)n > Pr{Ivn(B)I > tn,k} < 2exp ( - to k124-k), and
Pnk
:= Pr {SUPBE13(k)Ivn(B)I > tn,k) < 6exp (2K2k . = 6exp (2K2k - 2k-4c2n2°(log n)20k-2-23)
2k_4tn,k)
12.3 Poissonization and random sets
367
Fork
and
Pnk < 6exp(2K2" Let y := rl(i; +l)-2-28 > 0 by choice of S. Since
< log 2, k(n) < log n 2
and Pnk <
2-4c2(log n)Y)). For n large, (log n)Y > 64K/c2. Then 2K < 2-5c2(log n)Y and 2' > I for
k> 1, so k(n)
E Pnk < 6(logn) exp (- 2-5c2(log n)')
0
as n --o- oo.
k=1
Let En
:= {CV: SUPBEB(r) Ivn(B)I < tn,r, r = 1, ... , k(n)}.
Then lim n_ ,0 Pr(EE) = 1. For any A E C and n, let k := k(n) and i := i (k, A). Then for each co E En, I vn (Aki) I is bounded above by k
T'wn(Ari(r,A) \ Ar-l,i(r-1,A))j + Ivn
`Ar-l,i(r-1,A) \ Ar,i(r,A))j
r=1 k
< 2Etn,r < 2n°(logn)'iT r=1
cr-1-s
< 2n°(logn)'1,
r>1
Ivn(Bki\Aki)) < n°(logn)i, and by (12.2.2), n 1,2 Q(Bki\Aki)
< n112/2k < 2n°(logn)n.
Hence n 1/2 Qn(Bki\Aki) < 3n°(logn)'i, and Ivn(A\Aki)I < 3n©(logn)'i. So on En, I vn (A) I < 5n° (log n)'1. As ri J, 2/( + 1) the factor of 5 can be dropped.
Since A E C is arbitrary, the proof is done.
12.3 Poissonization and random sets Section 12.4 will give some lower bounds Ilvn 11C > f (n) with probability converging to 1 as n oc where f is a product of powers of logarithms or iterated logarithms. Such an f has the following property. A real-valued function f defined for large enough x > 0 is called slowly varying (in the sense of Karamata) if for every c > 0, f (cx)/ f (x) -* 1 as x +oo.
368
Classes Too Large for Central Limit Theorems
12.3.1 Lemma If f is continuous and slowly varying then for every e > 0 there is a S = S (e) > 0 such that whenever x > 1 IS and I 1 - I < S we have
I1- fI<e.
x
Proof For each c > 0 and e > 0 there is an x(c,s) such that for x > (c) -1 I < 4. Note that if x (c, e) < n, even for one c, then f (x) # 0 AX) for all x > n. By the category theorem (e.g., RAP, Theorem 2.5.2), for fixed X (c, s),
s > 0 there is an n < oo such that x(c, e) < n for all c in a set dense in some interval [a, b], where 0 < a < b, and thus by continuity for all c in [a, b]. Then fort, d E [a, b] and x > n, I(f(cx) - f(dx))/f(x)I < s/2. Fix c := (a + b)/2. Let u := cx. Then for u > nc, we have
f(udl c) - 1
f(u)
e
f (u/c)
2 f (u) As u --* +oo, f (u)/f (u/c) -+ 1. Then there is a S > 0 with
S < (b - a)/(b + a) such that for u > 1 IS and all r with I r - 1 I < S we have I f(u))
- 1 < E.
Recall the Poisson law Pc on N with parameter c > 0, so that Pc(k) := e-cck/k! for k = 0, 1, .. Given a probability space (X, A, P), let Uc be a Poisson point process on (X, A) with intensity measure cP. That is, for any disjoint A1, , A. in A, UU(Aj) are independent random variables, j = has law PcP(A). , m, and for any A E A, 1, Let Yc(A) := (Uc - cP)(A), A E A. Then Yc has mean 0 on all A and still has independent values on disjoint sets. be coordinates for the product space (X°O, A°°, P°°). Let x(1), x(2), For c > 0 let n (c) be a random variable with law P, independent of the x (i). + Sx(n)), n > 1, PO := 0 we have: Then for P := n-1(8x(1) + 12.3.2 Lemma The process Zc := n (c) P, (c) is a Poisson process with intensity measure cP.
Proof We have laws L(Uc(X)) = L(Zc(X)) = P. If Xi are independent Poisson variables with £(X1) = Pc(i) and Em, c(i) = c, then given n :_ Em, Xi, the conditional distribution of {X1}' 1 is multinomial with total n and probabilities pi = c(i)/c. Thus Uc and Z, have the same conditional distributions on disjoint sets given their values on X. This implies the lemma.
12.3 Poissonization and random sets
369
From here on, the version Uc - Zc will be used. Thus for each co, UU()((v)
is a countably additive integer-valued measure of total mass Uc(X)(cv) _ n(c)(w). Then
Yc = n(c)Pn(c) - cP = n(c)(Pn(c) - P) + (n(c) - c)P, (12.3.3)
Yc/c1/2 = (n(c)Ic)112vn(c) + (n(c) - c)c-112P. The following shows that the empirical process vn is asymptotically "as large" as a corresponding Poisson process.
12.3.4 Lemma
Let (X, A, P) be a probability space and C C A. Assume
that for each n and constant t, supAEc I (Pn - t P) (A) I is measurable. Let f be
a continuous, slowly varying function such that as x -a +oo, f(x) -* +oo. For b > 0 let g(b) := liminfx.+,,, Pr {supAEc IYX(A)I > bf(x)x1/2} . Then for any a < b,
liminfn. Pr {supAEc I v.(A)I > af(n) I > g(b). Proof It follows from Lemma 12.3.1 that f(x)/x -* 0 as x -+ oo. From 1 and (12.3.3), supAEc I YY (A) I is measurable. As x +oo, Pr(n (x) > 0)
n(x)lx - 1 in probability. If the lemma is false there is a O < g(b) and a sequence mk
+oo with, for each m = mk,
Pr {supAEc I vm(A)I > af(m) I < 0.
Choose 0 < s < 1/3 such that a (1 + 7s) < b. Then let 0 < S < 1 /2 be such that 8 < S (s) in Lemma 12.3.1 and (1 + 8) (1 + 5s) < 1 + 6s. We may assume
that forallk=
m =mk>2/Sand 1+2s
Set Sm (f(M)IM) 1/2. Then since f(x)lx 0 we may assume S. < 8/2 for all m = mk. Then for any m = mk, if (1 - Sm)m < n < m, we have mPm(A) > nPn(A)forallA E A, som(Pm - P)(A) > n(Pn-P)(A)-mSm.
Conversely n Pn (A) > m Pm (A) - m8m and
n(PP - P) (A) > m(Pm - P) (A) - mSm. Thus
Ivm(A)I > (n/m)1/2I vn(A)I - .f(m)1/2 > (1
+8)-'lvn(A)I
- f(m)1/2,
Classes Too Large for Central Limit Theorems
370
I = n -1 < 28m < 8
so I vn(A)I < (1 +8)(Ivm(A)I + f(m)1/2). Next, I1-
n implies I f(n) - II < s, so 1/f(n) < (1 +s)/f(m). Hence Ivn(A)Ilf(n) < (1 +28)(Ivm(A)I.f(m)-1 +
f(m)-1/2).
Thus since (1 + 2s) f(m)-1/2 < ae,
af(n)(1+3s)} < O.
Pr{supAEc Ivn(A)I
For each m = mk, set c = cm = (1 - 1Sm)m. Then as k -+ oo, since f(mk) -* oo, by Chebyshev's inequality and since Pc has variance c,
Pr{(1 -Sm)m
1.
Then for any y with O < y < g := g(b) and k large enough, since the x(i) are independent of n (c),
Pr {supAEC Ivn(c)(A)I > af(n(c))(1 + 3s)} < Y. I
k large enough we may assume Pr {supAEc I vn(c)(A)I > af(c)(1 + 5s)} < Y.
Fork large enough, applying Chebyshev's inequality ton (c) - c, we have since
c-oo, Pr
/n(c)
1 c)
1/2
1+6s > 1 +5c
-
(n(c)-c
Pr j
c'!2
2sc112
g-y
> (1 +56)2 } <
4
and since f(c) -+ oo, Pr{[n(c) - c]/c1!2 > af(c)s) < (g - y)/4. Thus by (12.3.3), Pr {supAEc
IYc(A)I > ac1/2f(c)(1 +7e)} < (y+g)/2 < g,
a contradiction, proving Lemma 12.3.4. Next, the Poisson process's independence property on disjoint sets will be extended to suitable random sets. Let (X, A) be a measurable space and (S2, B, Pr)
a probability space. A collection (BA : A E A) of sub-a-algebras of B will be called a filtration if 8A C 13B whenever A C B in A. A stochastic process Y indexed by A, (A, c)) H Y(A)(w), will be called adapted to (BA : A E A) if for every A E A, Y(A)(.) is BA measurable. Then the process and filtration will be written {Y(A), BA)AE.A. A stochastic process Y: (A, w) - Y(A)(w), A E A, co E S2, will be said to have independent pieces if for any disjoint A1, , A. E A, Y(AK) are independent,
12.3 Poissonization and random sets
371
j = 1,
, m, and Y(A1 U A2) = Y(A1) + Y(A2) almost surely. Clearly each Y,, has independent pieces. If in addition the process is adapted to a filtration (BA : A E A), the process {Y(A),13A}AEA will be said to have independent pieces if for any disjoint sets A1, . , A in A, the random variables Y02), , Y(An) and any random variable measurable for the a-algebra BA, are jointly independent. For example, for any C E A let Bc be the smallest a-algebra for which every is measurable for A C C, A E A. This is clearly a filtration, and the smallest filtration to which Y is adapted. A function G from 7 into A will be called a stopping set for a filtration {BA :
A E A) if for all C E A, (w: G(w) C C) E 13c. Given a stopping
set let EG be the a-algebra of all sets B E 13 such that for every C E A, B fl {G C C} E Bc. (Note that if G is not a stopping set, then SZ ¢ 13G, so BG would not be a a-algebra.) If G (w) = H E A then it is easy to check that G is
a stopping set and BG = 13x 12.3.5 Lemma Suppose (Y (A), BA}AEA has independent pieces and for all
WES2,G(w)EA,A(W)EA,and E(w)EA. Assume that:
(i) G(.) is a stopping set; (ii) for all w, G(w) is disjoint from A(w) and from E(w); (iii) G (w), A(w), and E(w) each have just countably many possible val-
ues G(j) := Gj E A, C(i) := Ci E A, and D(j) := Dj E A, respectively;
(iv) for all i, j, {A(.) = Ci} E BG and
D1) E 13G.
Then the conditional probability law (joint distribution) of Y(A) and Y(E) given BG satisfies
C{(Y(A), Y(E))IBG} =
I{A(.)=C(i),E(.)=D(j)}C(Y(Ci), Y(Dj))
i,j where G(Y(Ci), Y(Dj)) is the unconditional joint distribution of Y(Ci) and Y(Dj). If this unconditional distribution is the same for all i, j, then (Y(A), Y(E)) is independent 0f 8G. rather Proof The proof will be given when there is only one random set than two, A0 and E(.). The proof for two is essentially the same. If for some
w, A(w) = Ci and G(w) = G j, then by (ii), Ci fl G j = 0 (the empty set). Thus Y(Ci) is independent of BG(j). Let Bi := B(i) := Ci) E 13G by
Classes Too Large for Central Limit Theorems
372
(iv). For each j,
{G = Gj} = {G C Gj} \ U{{G C Gi}: Gi C Gj, Gi
Gj},
so by (i),
{G = Gj} E BG(j).
(12.3.6)
LetHj:={G=Gj}. ForanyDEA, Hjn(GCD)=0EBDifGj¢D. If Gj c D, then Hj n {G C D) = Hj E BG(j) C BD by (12.3.6). Thus
H(j) := Hj E BG.
(12.3.7)
For any B E BG, by (12.3.6), (12.3.8)
BnHj = Bn{GCGj}nHj E BG(j).
We have for any real t almost surely, since Bi E BG,
Pr(Y(A) < t I BG) = EPr(Y(A) < t 1 BG)1B(i)lH(j)
i,j
EPr(Y(Ci) < t I BG)IB(i)IH(j) i, j
The sums can be restricted to i, j such that Bi n Hj # 0 and therefore Ci n G j =
0. Now Pr(Y(Ci) < tIBG) is a BG-measurable function f, such that for any
I'EBG,
Pr({Y(Ci)
Restricted to Hj, f equals a 13G(j) -measurable function by (12.3.8). By
(12.3.7),FnH1E8G,SO (12.3.9)
f rnH(j) fdPr = Pr({Y(Ci) < t} n r n Hj) = Pr(Y(Ci) < t) Pr(r n Hj)
because by (12.3.8), r n H(j) E BG(j), which is independent of {Y(Ci) < t}.
One solution f to (12.3.9) for all j is f, - Pr(Y(Ci) < t). To show that this solution is unique, it will be enough to show that it is unique on Hj for each j. First, for a set A C Hj it will be shown that A E 5G(j) if and only if A E BG. By (12.3.6) and (12.3.7), Hj itself is in both BG and BG(j). We can write A = A n H(j). Then "if" follows from (12.3.8). To prove "only if," let A E BG(j). Then for C E A, let
F := A n {G C C} = A n {G j C C}.
12.4 Lower bounds in borderline cases
373
If G j C C then F = A E BG(j) C Bc. Otherwise, F is empty and in Bc. In either case F E BC, so A E BG as desired. Thus, if g is a function on Hj, measurable for BG(j) there, and
gdPr=O Df1 H(j)
for all D E BG, then the same holds for all D E BG(j). So, as in the usual proof of uniqueness of conditional expectations,
gdPr = J{g>0}f1H(j)
J {g
gdPr = 0,
so g = 0 on H(j) a.s. and f = Pr(Y(C,) < t) a.s. on H(j) for all j. Thus Pr(Y(A) << t I BG) = >t Pr(Y(C1) < t)1B(;).
Here is another fact about stopping sets, which corresponds to a known fact about nonnegative real-valued stopping times or Markov times (e.g., RAP, Lemma 12.2.5):
12.3.10 Lemma If G and H are stopping sets and G C H then BG C BH. Proof For any measurable set D and A E BG, we have
A fl {H C D} = (A fl {G C D}) fl {H C D} E BD.
12.4 Lower bounds in borderline cases Recall the classes C(a, K, d) of subgraphs of functions with bounded derivatives through order a in Rd, defined in Section 8.2. We had lower bounds for P, - P on C(a, K, d) in Theorem 12.1.2 which imply that for a < d - 1, II Vn II C(a,K,d) - oo surely as n - oo. For a > d - 1, C(a, K, d) is a Donsker class by Theorem 8.2.1 and Corollary 7.2.13, so II vn IIC(a,K,d) is bounded in
probability. Thus a = d - 1 is a borderline case. Other such cases are given by the class GG2 of lower layers in R2 (Section 8.3) and the class C3 of convex sets in R3 (Section 8.4), for ),d = Lebesgue measure on the unit cube Id, where
I := [0, 1]. Any lower layer A has a closure A which is also a lower layer, with Ad (A \ A) = 0, where in the present case d = 2. It is easily seen that suprema of our processes over all lower layers are equal to suprema over closed lower layers, so it will be enough to consider closed lower layers. Let GG2 be the class of all closed lower layers in 1182.
Classes Too Large for Central Limit Theorems
374
Let P = ,kd and c > 0. Recall the centered Poisson process Yc from Section 12.3. Let Nc := UU - Vc where U.. and VV are independent Poisson processes,
each with intensity measure cP. Equivalently, we could take Uc and Vc to be centered. The following lower bound holds for all the above borderline cases:
12.4.1 Theorem
For any K > 0 and 8 > 0 there is a y = y(d, K, 8) > 0
such that
limx_,+,,,,Pr{IlYxllc > y(xlogx)112(loglogx)-d-1/2} = 1 and
limn.,,,,Pr{llvnlIc >
y(logn)1/2(loglogn)-6-1/2}
=1
where C = C(d - 1, K, d), d > 2, or C = LL2, or C = C3. For a proof, see the next section, except that here I will give a larger lower bound with probability close to 1, of order (log n)3/4 in the lower layer case (C = LL2, d = 2). Shor (1986) first showed that Ell Yxilc > yxt/2(logx)3/4 for some y > 0 and x large enough. Shor's lower bound also applies to
C(1, K, 2) by a 45° rotation as in Section 8.3. For an upper bound with a 3/4 power of the log also for convex subsets of a fixed bounded set in R3 see Talagrand (1994, Theorem 1.6). To see that the supremum of N, Y, or an empirical process v, over LL2 is measurable, note first for P, that for each F C { 1, , n) and each w,
there is a smallest, closed lower layer LF((O) containing the xj for j E F, with LF((O) := 0 for F = 0. For any c > 0, w i-± (Pn - cP) (LF((O))(w) is measurable. The supremum of P, - cP over LL2, as the maximum of these 2n measurable functions, is measurable. Letting n = n (c) as in Lemma 12.3.2 and (12.3.3) then shows that sup{Yc(A) : A E LL2} is measurable. Likewise, there is a largest, open lower layer not containing xj for any j E F, so sup{IYY(A)l: A E LL21 and sup{Ivn(A)I : A E LC2) are measurable. For N0, taking non-centered Poisson processes Uc and V0, their numbers of points m (w) and n (w) are measurable, as are the m-tuple and n-tuple of , n, it is points occurring in each. For each i = 0, 1, , m and k = 0, 1, a measurable event that there exists a lower layer containing exactly i of the
m points and k of the n, and so the supremum of Nc over all lower layers, as a measurable function of the indicators of these finitely many events, is measurable.
12.4 Lower bounds in borderline cases
375
12.4.2 Theorem For every s > 0 there is a 8 > 0 such that for the uniform distribution P on the unit square 12, and n large enough,
Pr sup{Ivn(A)I: A (= GG2} > 6(logn)3/4) > 1 - e, and the same holds forC(1, 2, 2) in place of LL2. Also, vn can be replaced by NN/c1/2 or Yc/c1/2 if log n is replaced by log c, for c large enough.
Remark. The order (log n)314 of the lower bound is optimal, as there is an upper bound in expectation of the same order, not proved here, see Rhee and Talagrand (1988), Leighton and Shor (1989), and Coffman and Shor (1991).
Proof Let M 1 be the set of all functions f on [0, 11 with f (O) = f (l) = 1 /2 and I I f I I L < 1, that is, If(u) - f (x) I < I x - u I for 0 < x < u < 1. Then
0 < f(x) < 1 for 0 < x < 1. For any f : [0, 1] H [0, oo) let Sf be the subgraph of f, S f := S(f) := {(x, y) : 0 < x < 1 , 0 < y < f (x)}. Then for each f E M1 we have Sf E C(1, 2, 2). Let Sl := {Sf f E X11}. Let R be a counterclockwise rotation of R2 by 450, R = 2-1/2(iflower 1). Then
for each f E MI, R-1(Sf) = M fl R-1(I2) where M is a
layer,
I2 := I x I, and I := [0, 1]. So, it will be enough to prove the theorem for S1 C CO, 2, 2).
Let [x] denote the largest integer < x and let 0 < 8 < 1/3. For each c large enough so that 82(log c) > 1 and w, functions fo < fi < < fL < gL < .. < gi < go will be defined for L := [82(log c)], with f := f(t)
fj(w,t)for CO Ecland 0
Each f and gi will be continuous and piecewise linear in t. For each j = , 2', one of f and g, will be linear on I,j := [(j - 1)/2', j/2'], and the
other will be linear only on each half, Ii+i,2 j-1 and Ii+1,2,j, with f = g, at the endpoints of I. Thus over Iij, the region Tip between the graphs of f and g, is a triangle.
Let fo(x) := 1/2 for 0 < x < 1. Let s := (logc)-1/2 and go(1/2) (1 + s)/2. Given f and g,, to define f +1 and gi+1, we have two cases. Case I. Suppose g, is linear on I,j, as in Figure 12.1. Let pk := (xk, yk) be the point labeled by k = in Figure 12.1. Then xk = (4j + k - 5)/2i+2 for k = 1, , 5, xk = xlo_k for k = 6, 7, 8, x9 = x2 + 1/(3 21+2), and
xlo = X4 - 1/(3 .2'+2). Thus T := T,j is the triangle PIP3P5. Let Q := Q,j be the quadrilateral p3piop7p9, and V := Vii the union p2p3p9 U P3P4Plo of two triangles. The triangles PiP2P7, P1P3P8, P3P5P6, and p4p5p7 each have 1 /4 the area of T, Q has 1/3, and V has 1/6.
376
P1
Classes Too Large for Central Limit Theorems
P2
P3
Figure 12.1.
Let pi j , i = 1, 2, , j = 1, , 2i , be i.i.d. Rademacher random variables, so that P(pij = 1) = P(pij = -1) = 1/2, independent of the NN process.
If NN (Qi j) > 0, or if N(Q11) = 0 and pij = 1, say event Aij occurs and set gi+1 = gi on Ii j, then the graph of f .}1 on Ii j will consist of line segments joining the points p1, P2, P7, p4, p5 in turn, so Ti+1,2j-1 = PIP2P7 and Ti+1,2j = P7P4P5
If Aij doesn't occur, that is, N,(Q) < 0 or N,(Q) = 0 and pij = -1, set f+1 := f on Ii j and let the graph of $i+1 consist of line segments joining the points p1, P8, P3, p6, p5, so that Ti+1,2j-1 = PIP8P3 and Ti+1,2j = P3P6P5 This finishes the recursive definition of f and gi in Case I. The stated properties of f and gi continue to hold. Case II. In this case f is linear on Ii j, see Figure 12.2. Then Ti j is the triangle PI P5 P7 The other definitions remain the same as in Case I. By induction, there is an integer m = mij = mij ((v) such that p1 p5 has slope ms, P1 P3 in Figure 12.1 and p7 p5 in Figure 12.2 have slope (m - 1)s, while p3 p5 in Figure 12.1 and PI P7 in Figure 12.2 have slope (m + 1)s. We have for each i and j that mi+1,2 j_1 - mij and mi+1,2j - mij have possible values 0 or
±1. Specifically, on Aij in Case I or A . in Case II, mi+1,2j = mi+1,2j-1 = mij. On A .- in Case I, mi+1,2j_1 -mij = -1 and mi+1,2j -mij = 1. Or on Aij in Case II, mi+1,2j-1 - mij = 1 and mi+1,2j - mij = -1 Let (0, A) be a probability space on which NN and all pij are defined. Take 0 1 := 0 x [0, 1] with product measure Pr := it x A where A is the uniform
12.4 Lower bounds in borderline cases
377
P7
P5
Figure 12.2.
(Lebesgue) law on [0, 1]. For almost all t E [0, 1], t is not a dyadic rational, so for each i = 1, 2, , there is a unique j := j (i, t), j = 1, , 2', such that
t E Inti j := ((j - 1)/2' , j/2' ). Thus Inti j is the interior of I. The derivatives f t and g are step functions, constant on each interval Inti+l, j and possibly undefined at the endpoints. Some random subsets of 12 are defined as follows. For in = 0, 1, 2, m
Gm
G(m)
G(m, (o)
and
2'
U U Qki k=0 i=1
G(m,j):=G(m)UUi_1Qki
For
For each C in the Borel or-algebra 13 := 13(12), k = 0, 1,
, and i =
2k, let L3C't) be the smallest a-algebra of subsets of Q with respect to , k - 1 or m = k, r = 1, , i, and all NN(A), which all pmr for m = 0, 1, for Borel A C C, are measurable. It is easily seen that for each fixed k and i, {t3C°`i: C E B) is a filtration and {NB(A), ,CiAk" )AE,, has independent pieces as defined before Lemma 12.3.5. 1,
,
The initial triangle T01 is fixed and has area s/2. For each k = 1, 2, and each possible value of the triangle Tk_1,i, there are two possible values of the pair of triangles Tk,2t_1, Tk,2i, each with area s122k+1 Thus for each
Classes Too Large for Central Limit Theorems
378
m = 0, 1 , 2, , there are finitely many possibilities for the values of Tki and , 2k, each on a measurable event, and so Qki for k = 0, 1, , m and i = 1, for whether Aki holds or not. So w i-± N N (Qi j (co)) (w) is measurable, and for each m and j, there are finitely many possible values of Gm, j,,,,. Next we have:
12.4.3 Lemma For each k = 0, 1, 2,
and i = 1,
(a) G(k, i) is a stopping set for (k,i)
13k,i
2k,
{13Ak'i))AEt3;
.
(b) Aki E Z3G(k i)' (c) if k > 1, then for each possible value Q of Qki, {Qki = Q} E
13(k-1)
where C3(r) := 13G(r)) for r = 0, 1,
Proof It will be shown by double induction on k and on i for fixed k that (a), 13Ak,2k) . (b), and (c) all hold. Let ,C3Ak) := for any A E 13(12) and k = 0, 1,
Fork = 0 we have G(0) = G(0, 1) = Qoi, a fixed set. So as noted just before Lemma 12.3.5, G(0) is a stopping set, and 13(0) = 8G(0)' where 13G(o) is defined as for stopping sets. So (a) holds for k = 0. Since I Ao, is a measurable function of NN (Qol) and pol, we have A01 E 13(0)
so (b) holds for k = 0. In (c), each event { Qki = Q) is a Boolean combination of events A jr for j = 0, , k -1 and r = 1, , V. Thus if (a) and (b) hold for k = 0, 1, , m -1 for some m > 1, then (c) holds for k = m. So (c) holds for k = 1.
To prove (a) for k = m, let J(m, j) be the finite set of possible values of G(m, j). If (c) holds for k = 1, , m, then from the definition of G(m, j), for each G E J(m, j), (G (m, j) = G} e
(12.4.4)
13(ni-1)
Thus, for any D E 13(12), since G(m - 1) C G(m, j),
{G(m, j) C D} _ {G(m, j) C D} fl {G(m - 1) C D}
U
{G(m, j) = G} fl {G(m - 1) C D}
GEJ(m, j),GCD E
13(m'j) D
so G (m, j) is a stopping set and (a) holds for k = m. Then to show (b) for k = m, for a set C E 13(12) and i = 1,
C(m,i) :_ {NB(C) > 0} U {NB(C) = 0, pi = 11. Then (12.4.5)
C(m,i) E 8(m C ,i).
, 2m let
12.4 Lower bounds in borderline cases
379
If C(m, i) is the finite set of possible values of Qmi,
Ami = U {Qmi = Q} n Q(m,i) QEC(m,i)
Thus
Ami n (G(m) C D)
{Gm c D} n {G,_1 C D} n
U
{Qrni = Q} n Q(m,i)
QEC(m,i), QCD
It follows from (12.4.4) that {G(m, i) C D} E B(m-1) for each m and i. By (c) for k = in, each {Qmi = Q} E 13(mi-1). Thus
{G(m, i) C D} n {Qmi = Q} E
L3(m-1)
and so
{G(m,i)CD)n{Qmi=Q}n(G(m-1) CD) E
5D-1)
For Q C D, by (12.4.5), Q(m,i) E B('`). So taking a union of intersections, Ami n (G (m, i) C D) E SD '`). Thus Ami E BG(rt)i) and (b) holds for k = m. This finishes the proof of the lemma. Next, here is a fact about symmetrized Poisson variables.
12.4.6 Lemma There is a constant co > 0 such that if X and Y are independent random variables, each having a Poisson distribution with parameter > 1, then EIX - YI > e0A112 Proof By a straightforward calculation, E((X - Y)4) = 2A + 12A2 < 14A2 if X > 1. By the Cauchy-Bunyakovsky-Schwarz inequality,
2A = E((X - Y)2) < (EJX - YI)1/2(E(IX - Yi3))112
andE(IX-YI3) <
(E(IX-Y14))374. The lemma follows withco =4/(14)3/4
Now continuing with the proof of Theorem 12.4.2, set G(m, 0) := Gm-1. , 2', the We apply Lemma 12.4.3(a) and (c). For m = 1, 2, .. and j = 1, Gm,j-1. The hypotheses of random set Qm3 is disjoint from the stopping set Lemma 12.3.5 hold for (G, A) = (Gm,.i_1, Qmj) and the process Y = Nc with 113(-'j-1): the filtration A E B). Since all possible values of Q,,1 have the
380
Classes Too Large for Central Limit Theorems
same area, it follows that the random variable Nc(Qm j) has the law of U - V for U, V i.i.d. Poisson with parameter (12.4.7)
cP(Qmj) = cs/(6.4m)
and is independent of Bm,j-1, which was defined in Lemma 12.4.3(b). Also,
Nc(Qri) is f3m,j-1 measurable for r < m or for r = m and i < j. So, the variables Nc(Qmj) for m = 0, 1, 2, and j = 1, , 21 are all jointly independent with the given laws. Also, the i.i.d. Rademacher variables pm j are independent of the process Nc. Fort E Inti j, let hi (w, t) be the slope of the longest side of Ti j, so hl ((0, t) _ g' (t) in Case I and f'(t) in Case II, while ho((o, t) = 0. In Case I, if Ai j holds
then hi+1 = hi on Inti j. If w Ai j then hi+1 - hi = -s on Inti+1,2j_1 and +S on Inti+1,2j In Case II, if w V Ai j then hi+1 = hi on Inti j. If (0 E Ai then hi+1 - hi = s on Inti+1,2 j_ 1 and -s on Inti+1,2 j
Let i := hi - hi-1 f o r i = 1 ,
, L. By Lemmas 12.3.5 and 12.4.3, and (12.4.7), any NJQm j) or pm j form > k is independent of 13(k-1), where 13(-1) is defined as the trivial a-algebra {0, Q1). So, each of 1, , k is 13(k-1) measurable while any event Akj and the random function k+1 are independent of 13(k-1). It follows that hi (co, t) = E;-1 r ((o, t) where r are i.i.d. variables for Pr = t x k on Q x [0, 1] having distribution 0) = 1/2,
s) = Pr(tr = -s) = 1/4. Since tr are independent and symmetric, we have by the P. Levy inequality (1.3.16) that for any M > 0, (12.4.8)
Pr(Ihrl > M for some r < L) < 2Pr(IhLI > M).
Now, we can write tr = 1)2r-1 + 1)2r where rlj are i.i.d. variables with Pr(nj =
s/2) = Pr(r)j = -s/2) = 1/2. So by one of Hoeffding's inequalities (Proposition 1.3.5), we have Pr(IhLI > M) < exp(-M2/(Ls2)). For c large, (1 - 2s)2 > 1/2. Thus (12.4.9)
Pr(IhrI > 1 - 2s for some r < L)
< 2exp ( - (1 - 2s)2/(Ls2)) < 2exp (- log c/ (2821og c)
= 2exp (- 1/(282)). Let K((O, t) := inf{k: Ihk((O, t)I > 1 - 2s} < +oo. Then I fk(t)I < 1 for all k < K((0, t). Let 4)k 00, t) := fmin(K,k)(w, t). Then 1k E .M1 for each k. If Ihr(t)I < 1 - 2s for all r < L then I f,.'(t)I < land Ig'r(t)I < I for all r < L, K(w, t) > L, and OL((0, t) = fL(w, T).
12.4 Lower bounds in borderline cases
381
Let Xki (w) := 1 if K (w, t) > k for some, or equivalently all, t E Intki, otherwise let U (CO) = 0. For any real x let x+ := max(x, 0). For k =
, L - l and i = 1, , 2k we have Qki C S(fk+l) if and only if Aki holds. So, Qki C S(DL(w, )) if and only if both Aki holds and Xki = 1. Let 0, 1,
aki := 1Aki, L-1 (12.4.10)
S41,L
2k
:_ Y, YXkiNc(Qki)+, k=O i=1
L-1 2k
(12.4.11)
SV,L
:= T, TVki k=0 i=1
XkiakiNN(Vki) Also, let Uki := XkiNc(Qki)+ and W := Nc(S(fo)) where S(fo) = [0, 1] x [0, 1/2]. Then S(fk) is the union, diswhere Vki
joint up to one-dimensional boundary line segments, of S(fo) and of those Qrj and Vrj with r < k such that Arj holds. So (12.4.12)
Nc(S('L)) = SD,L + SV,L + W.
From the definitions, we can see that the set where K(w, t) < k is B(k-1) measurable for each t, and thus so is each Xki. Also, Qki is a B(k-1)-measurable
random set, disjoint from G(k - 1). Since P(Qki) is fixed, by Lemma 12.3.5, Nc(Qki)+, a function of NN(Qki), is independent of B(k-1) and thus of Xki. Similarly, aki is independent of Xki by Lemma 12.3.5. Both are measurable with respect to 13k,` = BG(k)1). while Vki is disjoint from G(k, i), and P(Vki) is a constant. So by Lemma 12.3.5 again, NN(Vki) is independent of 13G(k,i), and the three variables NN(Vki), aki, and Xki are jointly independent. Since L < (log c)/9, in (12.4.7) the parameters of the Poisson variables for k < L are > cs/(6 4L) > 1 for c large enough. By (12.4.9) and the Tonelli-Fubini , L - 1 that theorem we have for S < 1/4 and k = 0, 1, 2k
(12.4.13)
1/2 < 1 - 2 exp (-1/(252)) < Pr(K > k) < 2-k
EXki. i=1
Let Xki := Nc(Qki)+ and pki := Pr(Xki = 0). Then by (12.4.13), 2k
Pki < 2k+1 exp ( - 1/(252)) i=1
For each k and i, E[(1 - Xki)Xki] < p1ki/2(EX2 )1/2. Thus by (12.4.7), setting
Classes Too Large for Central Limit Theorems
382
yk 1 y2k 1(1 - Xki)Xki, we have
S1
L-1
ES1 < r [2k+1eXp \ //
kk=OOLL
/
1
[2 - 1/(282))]1/2kcsl(6.4k) ]1/2
6-1/2L(cs)1/2 exp (
- 1/(482))
< (82log c)(cs)1/2 exp ( - 1/(482)), so
ES1 < c1/2(log c)3/482 exp ( - 1/(482)).
(12.4.14)
Ek
E 2i= Letting S2 1 Xki, we have by Lemma 12.4.6, and since O ENc(Qki)+ = EINc(Qki)I12 by symmetry,
L-1
ES2 > E
2k-1co(cs/(6 4k))1/2
k=0
24-1/2coL(cs)112 > c182c1/2(logc)314,
where c1 := c0/10. By (12.4.14) and Markov's inequality we have for any
a > 0, (12.4.15)
Pr (Si > a82c1/2(logc)3/4) <
a-1
exp ( - 1/(482)).
If j < k or j = k and i < r, then Xji is measurable for BG(k,r-1) while by Lemma 12.3.5, Xkr is independent of it. Thus all the variables Xji are independent, and by (12.4.7), L-1 2-'
L-1
2jcs/(6.4J) < cs = c/(log c)1/2
Var(Xji) <
Var(S2) _ j=0 i=1
j=0
Thus S2 has standard deviation < c1/2/(log c)1/4, and (12.4.16)
S2 - ES2 = op(ES2) as c -+ 00
by Chebyshev's inequality.
To find the covariance Cov(Vji, Vkr) of two different terms of SV,L in (12.4.11), we can again assume j < k or j = k and i < r. Let Yuv := Nc(Vuv) for each u and v. Since EVuv = 0, we need to find Ejikr
E(XjiajiYjiXkrakrYkr) = E(XjiNc(Wji)XkrakrYkr),
where Wji := Vji on Aji and Wji := 0 otherwise. Then Wji and Vkr are measurable random sets disjoint from the stopping set G(k, r), and each
12.4 Lower bounds in borderline cases
383
of the three random sets has finitely many possible values. Thus Lemma 12.3.5
applies to G = G(k, r), A = Wji, and B = Vkr. Also, A fl B = 0. Thus we have Ejikr
EE (Xji Nc (Wji)Xkrakr Ykr I
Uk,r)
E[XjiXkrakrE(Nc(Wji)Nc(Vkr) I f3k,r)] E[XjiXkrakrE(Nc(Wji) 113k,r) E(Nc(Vkr)
113k,r)]
=0
since the last conditional expectation is 0, and E(VjiVkr) = 0 = E(Vji) _ E(Vkr) So the terms are orthogonal, and
L-1 2' Var(SV,L)
_
L-1
Var(Vji) < E 2jE((Nc(Vj1))2) j=o i=1
j=o
L-1
2j+1cs/(12 4J) < cs. j=o Also, for Win (12.4.12), Var(W) < c. Thus by Chebyshev's inequality, SV,L
op(ES2) and W = op(ES2) as c -+ oo. Note that we have Nc(S((DL)) 4,L - S2 - St. Then by (12.4.12) and (12.4.16),
Nc(S((DL)) = ES2 + (S2 - ES2) - Si + SV,L + W
= ES2 - S1 + op(ES2). Taking a := c1/3 in (12.4.15), then S > 0 small enough so that
(3/c1) exp (- 1/(482)) < s/2, we have for c large enough that
Pr {Nc(S((DL)) > yc1V2(logc)3/4} > 1-8 where y := c132/3. Since (DL E M1, the conclusion for Nc follows. Now, suppose the theorem fails for the centered Poisson process Y, for some e > 0. Thus, for arbitrarily small 3 > 0, Pr[IIYcIIc(1,2,2) < Sc1/2(logc)3/4/2] > E. Then, subtracting two independent versions of Yc to get an N, we have Pr[IINNIIc(1,2,2) < 8C112(logc)314I > 82,
a contradiction. The v, case follows from Lemma 12.3.4.
384
Classes Too Large for Central Limit Theorems
12.5 Proof of Theorem 12.4.1
Proof First, let's take care of measurability properties. For C(a, K, d), let a = /3 + y where 0 < y < 1 and /3 is an integer > 0. The set C(a, K, d) of functions is compact in the II lip norm: it is totally bounded by the ArzelaAscoli theorem (RAP, Theorem 2.4.7) and closed since uniform (or pointwise)
convergence of functions g preserves a Holder condition I G (x) - G (y) I < Klx - yl1'. Recall that for x E Rd we let x(d) := (xl, , xd_1). The set {(x, 9): 0 < xd < g(x(d)), g E c(a, K, d)} is compact, for the,6 norm on g. So the class C(a, K, d) is image admissible Suslin. Measurability for the lower layer case C = Ld follows as in the case d = 2 treated soon after the statement of Theorem 12.4.1. For the case C = C3 of convex sets in R3, f o r any F C {X1, , X"} there is a smallest convex
set including F, its convex hull (a polyhedron). The maximum of P - P will be found by maximizing over 2" - I such sets. Conversely, to maximize (P - P")(C), or equivalently P(C), over all closed convex sets C such that C fl {X1, , x,) = F, recall that a closed convex set is an intersection of closed half-spaces (RAP, Theorem 6.2.9). If I Fl = k, it suffices in the present case to take an intersection of no more than n - k closed half-spaces, one to F, j < n. Thus 11 Pn - P 11 c = II Pn - P 11,D for a finiteexclude each Xj dimensional class D of sets for which an image admissible Suslin condition holds. It follows that the norms II Yx II c and II vn 11c are measurable for the classes
C in the statement. Theorem 12.4.1 for Y,,, implies it for v" by Lemma 12.3.4. By assumption
d > 2. Let J := [0, 1), so that Jd-1 = {x E Rd-1: 0 < xj < 1 for j = 1,
,d-1}.
Let f1 be a C°O function on R such that f1 (t) = 0 for t < 0 or t > 1, for
some K with 0 < K < 1, f(t) = K for 1/3 < t < 2/3, and 0 < f(t) < K for t E (0, 1/3) U (2/3, 1). Specifically, we can let fl(t) (g * h)(t) f g(t-y)h(y)dywhereh := 111/6,5/6]andgisaC°Ofunction with g(t) > 0 for Itl < 1/6 and g(t) = 0 for I t I > 1/6. For X E Rd-1 let f(x) := f1(x1) fi (x2) . fi (xd-1). Then f is a C°° function on Rd-1 with f (x) = 0 outside the unit cube Jd-1, 0 < f(x) < y forx in the interior of Jd-1, and f = y on the subcube [1/3, 2/3]d-1 where 0 < y := Kd-1 < 1. Taking a small enough positive multiple of f1 we can assume that supx Suplpl
12.5 Proof of Theorem 12.4.1
385
Let Bji be a cube of side 3-j-1, concentric with and parallel to Cji. Let xji be the point of Cji closest to 0 (a vertex) and yji the point of Bji closest to 0. For S > 0 and j = 1 , 2, , let cj := Cj-1(log(j + 1))-1-25, where the constant C > 0 is chosen so that 1 c j < 1. For X E Rd-1 let
fi(x)
cj3-j(d-1) f(31(x
- xji)),
cj3-(j+1)(d-1)f(3'+1(x -
gji(x)
yji)).
Then f i (x) > 0 on the interior of Cji while fji (x) = 0 outside Cji, and likewise for gji and Bji. Note that (12.5.1)
fi(x)/gji(x) > 3d-1 > 1 for all x E Bji
Let So : = 1/2. Sequences of random variables sji = f1 and random functions on
Id-1
k
(12.5.2)
Sk
3j(d-q
SO + E E sji (w) f ii, j=1
i=1
will be defined recursively. Then since d > 2, we have E?_ 1 c .3 -j(d-1) < 1/3,
so 0 < Sk < 1 for all k. Given j > 1 and Sj _ I, let (12.5.3)
Dji
Dji(w) 11 {x E Jd: Ixd - Sj-I(x(d)1(w)I < gji lx(d))}. ((
Let sji(cw) :_ +1 if Y),,(Dji(to)) > 0, otherwise sji(w) -1. This finishes the recursive definition of the sji and the Sk. By induction, each Dji has only finitely many possible values, each on a measurable event, so the sji and Sk are all measurable random variables. Since the cubes Cji do not overlap, and all derivatives DP f i are 0 on the boundary of Cji, it follows for any k > 1 that k
(12.5.4)
sup[P]<-d-1 supx IDPSk(x)I <
IDPf 1(x)I j=1 k
< Ecj < 1. j=1
The volume of Dji (w) is always (12.5.5)
P(Dji (w)) = 2L'ji gji dx =
where 0 < µ := fjd_1 f (x) dx < 1.
24cj/9(j+1)(d-t)
386
Classes Too Large for Central Limit Theorems
Next, it will be shown that Dji (co) for different i, j are disjoint. For 1 < j < k and any co, i and x E Bji, ISk(x)
- Sj-1(x)J (w) >
ycj3-j(d-1) -
ycr3-r(d-1)
r>j
> ycj3-j(d-1)(1
- i)
> ycj3-j(d-1)/2 > supy(gji + SUPr gk+1,r)(Y) Thus if sji (w) = +1, then for any r and any x E Bji,
Sk(x)(w) - gk+1,r(x) ? Sj-1(x)(w) + gji (x). It follows that Dji (w) is disjoint from Dk+l,r(w). They are likewise disjoint if
sji (w) = -1 by a symmetrical argument. For the same j and different i, the sets Dji (w) are disjoint since the projection x H x(d) from Rd onto Rd-1 takes them into disjoint sets Bji.
Given A > 0 let r = r(A) be the largest j, if one exists, such that 2A scj > 9(j+1)(d-1). Then (12.5.6)
r(.k) - (log k)/((d - 1) log 9)
as X
+oo.
Let Go(w) := 0. For m = 1, 2, .. , let G(m)(co) := Gm(w) :=
U{Dji((o): j <m}.
Let Hm(w) be the subgraph of Sm, H,, (w) := {x : 0 < xd < S,,,(x(d))(w)}. Then for all co, by (12.5.4), Hm E C(d - 1, 1, d). Let
A. (w) := Hm (to) \ G. (w) From the disjointness proof,
(12.5.7)
Hm (w) n Gm (w) =
U{Dji (w): j < m, sji = +1}.
For each m, each of the above sets has finitely many possible values, each on a measurable event. Each G j is easily seen to be a stopping set, while A3 is I3G(j) measurable and for each w, A j (w) is disjoint from Gj (co). So the hypotheses of Lemma
12.3.5 hold. Thus, conditional on BG(j), X,,(Aj) is Poisson with parameter a.P(Aj(w)). Also, for u+ := max(u, 0), by (12.5.7), m 3j(d-l)
(12.5.8)
Y ((Hm n Gm)(w)) _ E E Y,x(Dji)+. j=1 i=1
12.5 Proof of Theorem 12.4.1
387
Now, P(Dji) does not depend on w nor i, and Dji (.) is 13G(j-1) measurable. Thus by Lemma 12.3.5, applied to A := Dmi, replacing Gm by Gm-1 U Ur 1, by (12.5.5). So by Lemma 1.3.14, for a constant c > 0,
X-112EY)((Hm (1 G,)(w)) r(x)
>
c> E
3j(d-1)
P(Dj,)112
j=1 i=1 r(x)
= cY3j(d-1)(2µc]/9(j+1)(d-1))112
by (12.5.5)
j=1
r(x) c31-d(2µ)1/2
(Cj-1(log(j +
1))-1-23)1/2
(by def. of cj)
j=1 r(x)
c(2µC.)1/231-d
j-112((log(j +
1))-0.5-6
j=1 r(x)
>
1))-0.5-5
T,
j-1/2
j=1
>
2ad(r(.k)1/2 - 1)(log(r(A) + 1)-o.5-'
for some constant ad > 0. For A large, by (12.5.6), the latter expression is logA)-0.5-s for some bd > 0. > 3bd(1ogA)1/2(log By independence of the variables Yx (Dji) and (12.5.5), Ya, (Hr(x) fl Gr(A)) (w)
has variance less than r(x) 3J(d-1)
r(x)
3j(d-1)2µcj9-(j+1)(d-1) < X.
E E AP(Dji) = X j=1
i=1
j=1
Thus by Chebyshev's inequality, as k - +oo, Pr{1<-1/2Yx(Hr(x) nGr(x))(w) > 2bd(logA)1/2(log log
X)-0.5-1}
1.
As shown around (12.5.8), YA(Ar()L)((o))(w) has, given BG(r(x)), the condi-
tional distribution of YY(D) for P(D) = P(Ar(x)((O)), where since Yx is a
388
Classes Too Large for Central Limit Theorems
centered Poisson process, EY),(D)2 < A for all D. Thus EY;L(Dr(A))2 < A, YA(Artal)/x,172 is bounded in probability, and Pr{Yx(Hr.(A)) > bd (A log A)1/2(log log as A
A)-o.5-s }
1
+oo. This proves Theorem 12.4.1 for Y)L, since given the result for
K = 1, one can let y(d, K, 8) := Ky(d, 1, 8). For v, Lemma 12.3.4 then applies. This completes the proof for classes C(d - 1, K, d). For convex sets in 1[83 see Dudley (1982, Theorem 4, proof in section 5). Lower layers were treated in Section 12.4.
Problems 1. Find a lower bound for the constant y (d, a) in the proof of Theorem 12.1.1.
Hint: y(d, a) > 3-d-"P(f)/I1.f II". 2. In the proof of Theorem 12.1.2, show that r can be taken as n + 1 for the largest n = 0, 1, such that m(n) = 0. 3. From Theorem 12.1.3 for d = 3, show that C3 cannot be a Donsker class for the uniform distribution on the unit cube 13. Hint: For a pregaussian class .T, P(IIGp 11 j < s) > 0 for all r > 0 (Theorem 2.5.5(b)). 4. What should replace n°(log n)n in Theorem 12.2.1 if 0 < < 1? Hint:
Any sequence an - +oo. 5. To show that Poisson processes are well-defined, show that if X and Y are independent real random variables with G(X) = Pa and 12(Y) = Pb then G(X + Y) = Pa+b. 6. Let P be a law on R and UU the Poisson process with intensity measure
cP for some c > 0. For a given y > 0 let X be the least x > 0 such that Uc([-x, x]) > y. Show that the interval [-X, X] is a stopping set.
Notes Notes to Section 12.1. Bakhvalov (1959) proved Theorem 12.1.1. W. Schmidt (1975) proved Theorem 12.1.3. Theorem 12.1.2, proved by the same method as theirs, also is in Dudley (1982, Theorem 1).
Notes to Section 12.2. Walter Philipp (unpublished) proved Theorem 12.2.1 for C = 1, with a refinement (rl = 1 and a power of log log n factor). The proof for C > 1 follows the same scheme. Theorem 12.2.1 is Theorem 2 of Dudley (1982).
References
Notes to Section 12.3.
389
Lemma 12.3.1 is classical; see, for example, Feller
(1971, VIII.8, Lemma 2). Lemma 12.3.2 and with it the main idea of Poissoniza-
tion are due to Kac (1949). Pyke (1968) gives relations between Poissonized and non-Poissonized cases along the line of Lemma 12.3.4. Evstigneev (1977, Theorem 1) proves a Markov property for random fields indexed by closed subsets of a Euclidean space, somewhat along the line of Lemma 12.3.5. This entire section is a revision of Section 3 of Dudley (1982) with proofs of Lemmas 12.3.1 and 12.3.2 supplied here.
Notes to Section 12.4. Theorem 12.4.1 was proved in Dudley (1982) for all d > 2. Another proof was given for lower layers and d = 2 in Dudley (1984). Shor (1986) discovered the larger lower bound with (log n)3/4 in expectation. He also contributed to a later version of the proof (Coffman and Lueker, 1991, pp. 57-64). Shor gave the definition of f and gj. Theorem 12.4.2 here shows that the lower bound holds not only in expectation, but with probability going to 1, by the methods of Dudley (1982, 1984). I thank J. Yukich for providing very helpful advice about this section. Note to Section 12.5. The proof is from Dudley (1982).
References Bakhvalov, N. S. (1959). On approximate calculation of multiple integrals. Vestnik Moskov. Univ. Ser. Mat. Mekh. Astron. Fiz. Khim. 1959, No. 4, 3-18. Coffman, E. G., and Lueker, G. S. (1991). Probabilistic Analysis of Packing and Partitioning Algorithms. Wiley, New York. Coffman, E. G., and Shor, P. W. (1991). A simple proof of the 0 (.,fn- log3/4 n ) upright matching bound. SIAM J. Discrete Math. 4, 48-57.
Dudley, R. M. (1982). Empirical and Poisson processes on classes of sets or functions too large for central limit theorems. Z. Wahrscheinlichkeitsth. verw. Gebiete 61, 355-368. Dudley, R. M. (1984). A course on empirical processes. Ecole d'ete de probabilites de St.-Flour. Lecture Notes in Math. (Springer) 1097, 1-142. Evstigneev, I. V. (1977). "Markov times" for random fields. Theor. Probab. Appl. 22, 563-569; Teor. Veroiatnost. i Primenen. 22, 575-581. Feller, W. (1971). An Introduction to Probability Theory and Its Applications, Vol. II, 2nd ed. Wiley, New York. Kac, M. (1949). On deviations between theoretical and empirical distributions. Proc. Nat. Acad. Sci. USA 35, 252-257.
390
Classes Too Large for Central Limit Theorems
Leighton, T., and Shor, P. (1989). Tight bounds for minimax grid matching, with applications to the average case analysis of algorithms. Combinatorica 9, 161-187. Pyke, R. (1968). The weak convergence of the empirical process with random
sample size. Proc. Cambridge Philos. Soc. 64, 155-160. Rhee, W. T., and Talagrand, M. (1988). Exact bounds for the stochastic upward matching problem. Trans. Amer. Math. Soc. 307, 109-125. Schmidt, W. M. (1975). Irregularities of distribution IX. Acta Arith. 27, 385-396. Shor, P. W. (1986). The average-case analysis of some on-line algorithms for bin packing. Combinatorica 6, 179-200. Talagrand, M. (1994). Matching theorems and empirical discrepancy computations using majorizing measures. J. Amer. Math. Soc. 7, 455-537.
Appendix A Differentiating under an Integral Sign
The object here is to give some sufficient conditions for the equation
f
t) d f(x, t) du (x) _ f of'x dlt (x) dt Here X E X, where (X, S, µ) is a measure space, and t is real-valued. The (A.1)
derivatives with respect to t will be taken at some point t = to. The function f will be defined for x E X and t in an interval J containing to. Assume for the time being that to is in the interior of J. Let's write
.f2(x, to) := af(x, t)latIt=to Some assumptions are clearly needed for (A.1) even to make sense. A set .F C G1(X, S, µ) is called G1-bounded iff sup{ f I f dµ : f E F] < oo. Here are some basic assumptions: (A.2) The following two conditions hold:
For some 8 > 0, if ( , t) : I t - to l < 8) are L1-bounded.
(A.2a) (A.2b)
f2(x, to) exists for µ-almost all x, and
to) is in L' (X, S, µ).
Here is an example to show that the conditions (A.2a,b) are not sufficient:
Let (X, µ) be [0, 1] with Lebesgue measure. Let J = [-1, 1], to = 0. Let f(x, t) := 1/t if t > 0 and 0 < x < t, otherwise let f(x, t) := 0. Then for all x 0, f(x, t) = 0 for tin a neighborhood of 0, so f2(x, 0) = 0 for x 0. For each t > 0, fo f (x, t) dx = 1, while fo f (x, 0) dx = 0. So the Example.
function t i-+ fo f (x, t) dx is not even continuous, and so not differentiable, at t = 0, so (A.1) fails.
The function f in the last example behaves badly near (0, 0). Another possibility has to do with behavior where x is unbounded, while f may be very 391
Appendix A
392
regular at finite points. Recall that a function f of two real variables x, t is called C°O if the partial derivatives ap+q f/8xpdtq exist and are continuous for all nonnegative integers p, q.
For J = (-1, 1), X = R with Lebesgue measure /.t, and to = 0, there exists a C°° function f on X x J such that for 0 < t < 1, f 00 f (x, t) dx = 1, while for to = 0, and any x, f (x, t) = 0 for t in a neighborhood of 0 (depending on x), so f (x, 0) = f2(x, 0) - 0. Thus Example.
f oo f2 (x, 0) dx = 0 while f °° f (x, t) dx is not continuous (and so not differentiable) at 0 and (A. 1) fails. To define such an f, let g be a C°O function on R with compact support, specifically g(x) := c exp(-1/(x(1 - x))) for 0 < x < 1 and g(x) := 0 elsewhere, where c is chosen to make f 00 g(x) dx = f0 g(x) dx = 1. It can be checked (by L'Hospital's rule) that g and all its
derivatives approach 0 as x 4. 0 or x T 1, so g is C°°. Now let f(x, t) :_ g(x - t-1) for t > 0, and f(x, t) := 0 otherwise. The stated properties then follow since f (x, t) = 0 if t < 0 or x < 0 or 0 < t < l 1x. For interchanging two integrals, there is a standard theorem, the TonelliFubini theorem (e.g., RAP, Theorem 4.4.5). But for interchanging a derivative and an integral there is apparently not such a handy single theorem yielding (A.1). Some sufficient conditions will be given, beginning with rather general ones and going on to more special but useful conditions. A set F of integrable functions on (X, S, s) is called uniformly integrable or u. i. if for every e > 0 there exist both (A.3)
a set A with µ(A) < oo such that fX\A I f I dµ < e
for all f E F,
and (A.4)
an M < oo such that f fI>M If I dµ < e
for all f E F.
In the presence of (A.3), condition (A.4) is equivalent to (A.5)
F is L 1-bounded and for any e > 0 there is a S > 0 such that
for any CESwith µ(C) <Sandany f E.F, fc If Idµ
The best-known sufficient condition for uniform integrability is known as domination. The following is straightforward to prove, via (A.4):
A.6 Theorem If f E G1 (X, S, µ) and f > 0, then (g E L1 (X, S, p) IgI < f } is uniformly integrable.
Differentiating under an Integral Sign
393
Here IgI < f means Ig(x)I < f(x) for all x E X. But domination is not necessary for uniform integrability, as the following shows:
Example. Let (X, S, µ) be [0, 1] with Lebesgue measure. Let g(t, x) :_ It - x1-1/2, 0 < t < 1, 0 < x < 1. Then the set of all functions g(t, ), 0 < t < 1, is uniformly integrable but not dominated. Uniform integrability provides a general sufficient condition for (A. 1). Let
Ah.f(x,to) := f(x,to+h) - f(x,to). A.7 Theorem Assume (A.2) and that for some neighborhood U of 0, (A.8)
{ Ah f (x, to) / h : 0 0 h E U) are uniformly integrable.
Then (A.1) holds and moreover (A.9)
as h
0.
Conversely, (A.9) and (A.2) imply (A.8).
Proof (A.2) implies that the difference-quotients Ah f (x, to)/ h converge pointwise for almost all x as h --). 0 to 8f(x, t)/etlt=to. Pointwise convergence and uniform integrability (A.8) imply G1 convergence (A.9), as this was proved in (RAP, Theorem 10.3.6) for a sequence of functions on a probability space. The implication holds in the case here since:
(a) by (A.3), we can assume that up to a difference of at most s in all the integrals, the functions are defined on a finite measure space, which reduces by a constant multiple to a probability space; (b) if (A.9) fails, then it fails along some sequence h = h, -+ 0, so a proof for sequences is enough. Thus (A.9) follows from the cited fact.
Now suppose (A.9) and (A.2) hold. Then given s > 0, there is a set A of finite measure with fX\A I f2(x, to)I dµ(x) < e/2, and for h small enough, f I Ah f(x, to)/h - f2(x, to)I dg(x) < e/2, so (A.3) holds for the functions Ah fl h, h E U, as desired. For sequences of integrable functions on a probability space, convergence in G1 implies uniform integrability (RAP, Theorem 10.3.6). Here again, we can reduce to the case of probability spaces, and if (A.8) fails, for a given e, then it fails along some sequence, contradicting the cited fact.
394
Appendix A
But given (A.2), the uniform integrability condition (A.8), necessary for (A.9), is not necessary for (A. 1), as is shown by a modification of a previous example:
Example.
Let (X, A) = [-1, 1] with Lebesgue measure, J = [-1, 1],
and to = 0. If t > 0 let f(x, t) = 1/t if 0 < x < t, f(x, t) _ -1/t if -t < x < 0, and f (x, t) = 0 otherwise. Then f 11 f (x, t) dx = 0 for all t and f11 f2(x, 0) dx = 0 since f2(x, 0) = 0, so (A.1) holds. Domination of the difference-quotients by an integrable function gives the classical "Weierstrass" sufficient condition for (A. 1), which follows directly from Theorems A.6 and A.7:
A.10 Corollary If (A.2) holds and for some neighborhood U of 0, there is a g E £1(X, S, µ) such that I Ah f (x, to)/ h I < g(x) for 0 h E U, then (A.1) holds.
Corollary A. 10 is highly useful, but it is not as directly applicable as is the Tonelli-Fubini theorem on interchanging integrals. In some cases, even when a dominating function g E L' exists, it may not be easy to choose such a g for which integrability can be proved conveniently. We can take the neighborhoods U to be sets {h : I h [ < S}, for S > 0. For a given S > 0, the smallest possible dominating function would be
9s(x) := sUp{lohf(x, to)/hl: 0 < IhI < S). If (X, S) can be taken to be a complete separable metric space with its a-algebra of Borel sets, and if f is jointly measurable, then each gs is measurable for the
completion of A (RAP, Section 13.2). If so, then the hypothesis of Corollary A. 10, beside (A.2), is equivalent to saying that (A.11)
for some S > 0, gs E G1(µ).
So far, nothing required the partial derivatives f2 (x, t) to exist fort # to, but if they do, we get other sufficient conditions:
A.12 Corollary Suppose (A.2) holds, f ( , ) is jointly measurable, and there is a neighborhood U of to such that for all t E U, f2 (x, t) exists for almost all x and the functions f2( , t) fort E U are uniformly integrable. Suppose also
that fort E U, f (x, t) - f (x, to) = fro f2 (x, s) ds for A -almost all x. Then (A.8) and so (A.9) and (A.1) hold.
Differentiating under an Integral Sign
395
Proof The functions f2
t) for t E U, being uniformly integrable, are G1bounded. We can take U to be a bounded interval. Then
II I f2(x, t)I dis(x) dt < oo. Now for I h I small enough,
lAhf(x,to)lhl = <
1
h 1
Ihl
h
f f2(x,to+u)du
f hIl If2(x,to+u)ldu
Thus the difference-quotients Ah f( , to)/h are also L1-bounded for h in a neighborhood of 0. Condition (A.3) for the functions f2(., t) implies it for the functions Ah f( , to)/h. Also, condition (A.5) for the functions f2( , to + u) directly implies (A.5) and (A.8) for the functions Ah P X, to) / h as desired. Three clear consequences of Corollary A. 10 will be given.
A.13 Corollary Suppose f (x, t) - G (x, t) H(x) for measurable functions G ( , ) and H where (A.2) holds for f. Suppose that for h in a neighborhood of 0, 1 AhG(x, to)/hI < a(x)forafunctionasuch that f let (x)H(x) I dµ(x) < oo. Then (A.8), (A.9), and (A.1) hold.
A.14 Corollary Let G(x, t) - ¢(x - t) where 0 is bounded and has a bounded, continuous first derivative, X = R, it = Lebesgue measure, and H E
G1(R). Then the convolution integral (0 * H)(t) = f ! 0 (t -x)H(x) dx has afirst derivative with respect tot given by (q5 * H)'(t) = f !00,O'(t-x)H(x)dx. If the jth derivative $(' is continuous and bounded for j = 0, 1, , n, then iterating gives (0 * H)(")(t) f 00O(")(t - x)H(x) dx.
A.15 Corollary Let f(x, t) = ett'l(x)H(x) where * and H are measurable real-valued functions on X such that H and iliH are integrable. Then (A.8), (A.9), and (A.1) hold. Proof Let G(x, t) := ett'1'(x). For any real t, u, and h, let(t+)u - eitu I < I uh I (RAP, proof of Theorem 9.4.4). So I AhG(x, t)/hI < I * (x) 1. Since
a (e't1(x)H(x)) = i*(x)ett*(x)H(x) and l i Vr (x)e't* (x) I = I* (x) ],conditions (A.2) hold and Corollary A. 13 applies.
Appendix A
396
Note. In Corollary A. 15, if X =1I and fi(x) - x, we have Fourier transform integrals.
A.16 Proposition Let i be a measurable function of x and
rl(t) :=
J
et*(x)dµ(x) < 00
for all t in an open interval U. Then rl is an analytic function having a power series expansion in t in a neighborhood of any to E U. Derivatives of n of all orders can be found by differentiating under the integral,
ri(")(t) =
f(x)net'd/2(x) , n = 1, 2,
,
for all t E U.
Proof Take to E U. Replacing tt by v where dv(x) = exp(to* (x)) dµ(x), we can assume to = 0. Then for some s > 0, it : It I _< s} C U, and for Itl < s, elt't (x)I < es*(x) + e--"(x), an integrable function. Thus, q(x) = r`noOt" f r(x)"dµ(x)/n! where the series converges absolutely by dominLated convergence since the corresponding series for elt'i`(x)I does and is a series of positive terms dominating those for et*(x). The derivatives of rl at 0 satisfy 17 (n) (0) = f fi(x)"dµ(x): this holds either by the theorem that for a function r) represented by a power series En _-o c" t" in a neighborhood of 0, we must have c" = rl(") (0)/n! (converse of Taylor's theorem), or by the methods of this Appendix as follows. For any real y, le.v -1I < IYI (ey+ 1) by the mean value theorem since et < e3' + 1 for t between 0 and y. Thus for any real u and
h, I(euh - 1)/hI < IuI (euh + 1). For any 3 > 0, there is a constant C = CS such that Jul < C(es" + e-3") for all real u. If IhI < 3 := e/2 then since euh < esu +e-Su and ecu +e-`" is increasing in u > 0 for any real c, we have I(euh
- 1)/hl < 3CS(eE"+e-su).
Letting u (x), we have domination by an integrable function, so Corollary A. 10 applies for the first derivative. The proof extends to higher derivatives since I V' (x)In < KeSI*(x)I for some K = K",S.
Notes
Perhaps the most classical result on interchange of integral and derivative is attributed to Leibniz: if f and 8 f/8 y are continuous on a finite rectangle [a, b] x
Differentiating under an Integral Sign
397
real-valued continuous functions defined on a half-line, say [ 1, oo). A function
f in .F will be said to have an (improper Riemann) integral if the integrals M f (x) dx converge to a finite limit as M -+ oo, which may then be called 110 f (x) dx. This integral is analogous to the (conditional) convergence of
f
the sum of a series, where Lebesgue integrability corresponds to absolute convergence. An example of a non-Lebesgue integrable function having a finite improper Riemann integral is (sin x)/x. The family ,F is called uniformly (im-
proper Riemann) integrable if the convergence is uniform for f E Y. For a continuous function f of two real variables having a continuous partial deriva-
tive 8f/8y, uniform improper Riemann integrability with respect to x, say on a half-line [a, oo), of 8f(x, y)/8y for y in some interval, and finiteness of f 00 f (x, y) dx, imply that the latter integral can be differentiated with respect to y under the integral sign; see, for example, Il'in and Pozniak (1982, Theorem 9.10). Il'in and Pozniak (1982, Theorem 9.8) prove what they call Dini's test: if f
is nonnegative and continuous on [a, oo) x [c, d), where -oo < c < d < +oo, and for each y E [c, d], I (y) := f °O f (x, y) dx < oo and I(-) is continuous, then the f(., y) for c < y < d are uniformly (improper Riemann) integrable; in this case they are also uniformly integrable in our sense. Of course, the same
holds if f is nonpositive, which would apply to the derivative of a positive, decreasing function. The Weierstrass domination condition, Corollary A. 10, is classical and appears in most of the listed references. Hobson (1926, p. 355) states it (for functions of real variables) and gives references to several works published between 1891 and 1910. Lang (1983) mentions the convolution case, Corollary A. 14. Lang also treats
differentiation of integrals f' f (t, x) dt where x E E, f E F, and E, F are Banach spaces. Il'in and Pozniak (1982, Theorem 9.5) and Kartashev and Rozhdestvenskii (1984) treat differentiation of integrals f (y)) f (x, y) dx.
Brown (1986, Chapter 2) is a reference for Proposition A.16 and some multidimensional extensions, for application to exponential families in statistics.
References Brown, Lawrence D. (1986). Fundamentals of Statistical Exponential Families. Inst. Math. Statist. Lecture Notes-Monograph Ser. 9.
Hobson, E. W. (1926). The Theory of Functions of a Real Variable and the Theory of Fourier's Series, vol. 2, 2d ed. Repr. Dover, New York, 1957.
Appendix A
398
Il' in, V.A., and Pozniak, E. G. (1982). Fundamentals of Mathematical Analysis. Transl. from the 1980 4th Russian ed. by V. Shokurov. Mir, Moscow. Kartashev, A. P., and Rozhdestvenskii, B. L. (1984). Matematicheskii Analiz
(Mathematical Analysis; in Russian). Nauka, Moscow. French transl.: Kartachev, A., and Rojdestvenski, B. Analyse mathematique, transl. by Djilali Embarek. Mir, Moscow, 1988. Lang, Serge (1993). Real and Functional Analysis, 3d ed. of Real Analysis. Springer, New York.
Appendix B Multinomial Distributions
Consider an experiment whose outcome is given by a point co of a set S2 where (S2, A, P) is a probability space. Let A i, i = 1, , m, be disjoint measurable sets with union 22. Independent repetition of the experiment n times is represented by taking the Cartesian product S2" of n copies of (Q, A, P) (RAP, The-
orem 4.4.6). Let X {Xi }n=1 E S2" and P" (A) := n y t SX (A), A E A, so P, is an empirical measure for P, and n P" (A,) is the number of times the event Ai occurs in the n repetitions. A random vector {nj}'t is said to have a multinomial distribution for n observations, or with sample size n, and m bins or categories, with probabilities
{ pj}mt , if pj > 0, pi +
+ pm = 1, and for any nonnegative integers kj
with (B.1)
n!
k,,, Pr{nj =kj, j=1,...,m} = klik2t...km!plki ...pm.
Otherwise - if kj are not nonnegative integers, or don't sum to n - the probabilities are 0. Note that in a multinomial distribution, since the probability is
positive only when km = n - Fi<m k1, we have nm = n - ri<m ni. Also,
pm = I - Fi <m pi. So if pl,
, pm _ 1 are nonnegative and their sum is
less than or equal to 1, it also makes sense to say that the {nj}j<m have a multinomial distribution for m (not m - 1) bins and probabilities {pj)j<m if Pr{nj = kj, j < m} is given by the right side of (B.1) whenever kj are nonnegative integers whose sum is at most n, and where km and pm are determined as just described.
B.2 Theorem For any probability space (0, A, P) and any disjoint measur, m, with union 0, {n P" (Ai) }m t have a multinomial
able sets Ai, i = 1,
distribution for n observations and m categories with probabilities { P (A1) The same holds if we take i = 1, , m - 1. 399
Appendix B
400
Proof Use induction on m. For in = 2, n P, (A1) has a binomial distribution: specifically, n Pn (A 1) is the number of successes, and n Pn (A2) the number of failures, in n independent trials with probability p = P(A1) of success on each
trial. To evaluate Pr(nPn (A1) = k), for k = 0,
, n, there are (k) ways to choose k of the n trials. For each way, the probability that just these trials are successes is pk(1 - p), -k. Then n Pn (A2) = n - k and the stated result holds. For the induction step from in to in + 1, apply the case of in bins to Al U A2, A3, , Am+1. Then, given that n Pn (Ai U A2) = k1 +k2, the conditional probability that n 1 = k1 and n 2 = k2 is binomial. Checking that
(P1 + p2)k'i k2 (ki kk2
(ki+k2)
1
Pl
l k1
(PI +P2/I
pk, k, P2
\k2
P2
kl.k2'
Pl+P2
.
finishes the proof.
B.3 Theorem Suppose {n, }m 1 have a multinomial distribution for n obser, m). vations and m bins with probabilities { p j }il1. Let T be a subset of 11, Then (a) {n i }i ET have a multinomial distribution for n observations and card (T) +1 bins, with probabilities {pt};ET. (b) Let S be a subset of { 1, , m) disjoint from T. Then the conditional distribution, or law, G({ni Ii ESI (n1) jET) depends on {n j}jET only through JjET n p
in other words for any integers kj > 0, j E T,
G ({ni )iEs I n j = kj for all j E T) = £ {nu }iES
Enj = Ekj jET
JET
, m), this distribution is multinomial for n - E jET kj observations in card(S) bins with probabilities pj/ EiES Pi I j E S. (c) If S U T = { 1,
Proof We can assume that for some r, T = {r + 1, , m}. The probability that n j = kj for r < j < m is a sum of multinomial probabilities,
n!(nj>rpj'/kj!)
nj<_rpj'/Kj!: Kj integers > 0,
K := K1 + ... + Kr = n - Y kj). j>r The sum Y(n ...) equals (p1 + and (a) follows.
+ pr)'/K! by the multinomial theorem,
Multinomial Distributions
401
, m } since the joint distribution of n i, (b) We can assume S U T = (1, , r). We need to i E S, is determined by that of n j, j V T. Thus S = (1,
evaluate probabilities of the form
Pr(ni=ki, i=1, ,rInj=kj, j > r) = P(ni = ki, i = 1, ... , m)/P(nj = kj, j > r). The numerator is a simple multinomial probability as in (B. 1). The denominator
is a multinomial probability by part (a). Carrying out the division, the k! terms for j > r cancel. The resulting expression, as claimed, depends on the
kj for j > r only through their sum, and (c) it is of the stated multinomial form, by the proof of (a).
Appendix C Measures on Nonseparable Metric Spaces
Let (S, d) be a metric space. Under fairly general conditions, to be given, any probability measure on the Borel sets will be concentrated in a separable subspace. The problem will be reduced to one about discrete spaces. An open cover of S will be a family {Ua }aEI of open subsets Ua of S, where I is any set, here called an index set, such that S = UaEJ Ua. An open cover {Vp}pEJ of S, with some index set J, will be called a refinement of { U04,11 EI if for all fi E J there exists an a E I with V,p C Ua. An open cover {Va }aEI will be called a-discrete if I is the union of a sequence of sets I, such that for each n and a 0 ,B in In , Ua and
U are disjoint. Recall that the ball B(x, r) is defined as l y E S: d (x, y) < r). For two sets A, B we have d(A, B) := inf{d(x, y) : x E A, y E B), and d(y, B) := d({y}, B) for a point y. C.1 Theorem For any metric space (S, d), any open cover {Ua }aEI of S has an open or -discrete refinement.
Proof For any open set U C S let Un := {y: B(y, 2-n) C U). Then y E Un if and only if d(y, S \ Un) > 2-n, and d (Un, S \ Un+i) > 2-n _ 2-n-1 = 2-n-1 by the triangle inequality. (The sets Un are always closed and not usually open.) Let Ua,n := (Ua)n. By well-ordering (e.g., RAP, Section 1.5), we can assume that the index set I is well-ordered by a relation <. For each positive integer n
andeacha E Ilet Wa,n linI and Ua,n\U{Up,n+i: 0 < a). Foreacha each n, either Wa,n C S \ U,e,n+i if 0 < a or Wp,n C S \ Ua,n+1 if a < ,B. In either case, d(Wa,n, Wp,n) >
2-n-1. Let Va,n
:_ [x: d(x, Wa,n) < 2-n-3}.
Then d(Va,n, Vp,n) > 2-n-2. The sets V,,n are all open. For a given n and different values of a, they are disjoint. Let J := N x I and define Vy, y E J, by V(n,a) := Vn,a. To show that {Vy}yEJ is a cover of S, given any x E S, take the least a such that x E Ua. Then for n large enough, x E Ua,n, and 402
Measures on Nonseparable Metric Spaces
403
thenx E W,,,, C Va,n as desired. So {Vy}yEJ is a or-discrete refinement of (Ua)aEI Cardinal numbers are defined in RAP, in the last part of Appendix A (as smallest ordinals with given cardinality). A cardinal number is said to be measurable if for a set S of cardinality , there exists a probability measure P defined on all subsets of S which has no point atoms; in other words, P ({x }) = 0 for all x E S. If there is no such P, is said to be of measure 0.
The continuum hypothesis implies that the cardinality c of the continuum (that is, of [0, 1]) is of measure 0 (RAP, Appendix Q. The separability character of a metric space is the smallest cardinality of a dense subset. We have:
C.2 Theorem Let (S, d) be a metric space. Let P be a probability measure on the a-algebra of Borel sets, generated by the open sets. Then either there is a separable subspace T with P(T) = 1, or the separability character of S is measurable. Proof Let {xa }.,I be dense in S, where I has cardinality C. For a fixed positive integer m, the balls Ua := B(xa, 1/m) form an open cover of S. Take an open, a-discrete refinement. Then S is the union of countably many open sets V, ,
each of which is the union of open sets Vy,n, Y E I,n, disjoint for different values of y, and each included in some ball of radius 2/m. Here each r,' has cardinality at most . If a cardinal has measure 0, then so, clearly, do all smaller cardinals. Define a measure µn on each r by µn (A) := P(UYEA Vy,n), where P is defined on all open sets. So if C has measure 0, then for each n, there is a countable subset Gn of Fn such that P(Vn) = E{P(Vy,n) : y E Gn}. So P (C,n) = 1 where Cn, is a countable union (over n and members of Gn for each n) of balls of radius 2/m. Taking an intersection over m, we see that P (C) = 1 for some separable subset C.
It is consistent with the usual axioms of set theory (including the axiom of choice) that there are no measurable cardinals, in other words all cardinals are of measure 0; see, for example, Drake (1974, pp. 67-68, 177-178). It is apparently unknown whether existence of measurable cardinals is consistent (Drake, 1974, pp. 185-186). So, for practical purposes, a probability measure defined on the Borel sets of a metric space is always concentrated in some separable subspace. Here is another fact giving separability:
404
Appendix C
C.3 Theorem Let f be a Borel measurable function from a separable metric space S into a metric space T. Then, assuming the continuum hypothesis, f has separable range.
Proof If the range of f is nonseparable, then for some r > 0, there is an uncountable set F of values of f any two of which are more than r apart. The continuum hypothesis implies that F has cardinality at least c. All the 2° or more subsets of F are closed, so all their inverse images under f are distinct Borel sets. On the other hand, in a separable metric space there are at most c Borel sets: for this we can assume the space is complete. Then the larger collection of analytic sets has cardinality at most c by the Borel isomorphism and universal analytic set theorems (RAP, Theorems 13.1.1, 13.2.1, and 13.2.4), since N°O has cardinal c. So there is a contradiction.
Notes
Theorem C.1 on or-discrete refinements is from Kelley (1955, Theorem 4.21 p. 129), who attributes it to A. H. Stone (1948). Theorem C.2 is due to Marczewski and Sikorski (1948), who prove it by a somewhat different method.
References Drake, F. R. (1974). Set Theory: An Introduction to Large Cardinals. NorthHolland, Amsterdam. Kelley, John L. (1955). General Topology. Van Nostrand, Princeton. Repr. Springer, New York, 1975. Marczewski, E., and Sikorski, R. (1948). Measures in non-separable metric spaces. Colloq. Math. 1, 133-139. Stone, Arthur H. (1948). Paracompactness and product spaces. Bull. Amer. Math. Soc. 54, 631-632.
Appendix D An Extension of Lusin's Theorem
Lusin's theorem says that for any measurable real-valued function f, on [0, 1] with Lebesgue measure k for example, and e > 0, there is a set A with), (A) < e such that restricted to the complement of A, f is continuous. Here [0, 1] can be replaced by any normal topological space and a, by any finite measure µ
which is closed regular, meaning that for each Borel measurable set B, µ(B) = sup(s(F): F closed, F C B) (RAP, Theorem 7.5.2). Recall that any finite Borel measure on a metric space is closed regular (RAP, Theorem 7.1.3). Proofs of Lusin's theorem are often based on Egorov's theorem (RAP, The-
orem 7.5.1), which says that if measurable functions f from a finite measure space to a metric space converge pointwise, then for any E > 0 there is a set of measure less than s outside of which the f, converge uniformly. Here, the aim will be to extend Lusin's theorem to functions having values in any separable metric space. The proof of Lusin's theorem in RAP, however, also relied on the Tietze-Urysohn extension theorem, which says that a continuous
real-valued function on a closed subset of a normal space can be extended to be continuous on the whole space. Such an extension may not exist for some range spaces: for example, the identity from {0, 1) onto itself doesn't extend to a continuous function from [0, 1] onto {0, 1); in fact there is no such function since [0, 1] is connected. It turns out, however, that the Tietze-Urysohn extension and Egorov's theorem are both unnecessary in proving Lusin's theorem:
D.1 Theorem Let (X, T) be a topological space and it a finite, closed regular measure defined on the Borel sets of X. Let f be a Borel measurable function from x into S where (S, d) is a separable metric space. Then for any E > 0 there is a closed set F with µ(X \ F) < E such that f restricted to F is continuous. 405
Appendix D
406
Proof We can assume µ(X) = 1. Let {sn 1, >I be a countable dense set in S. For m = 1 , 2, , and any x E X, let fm (x) = s, for the least n such that d (f (x), sn) < 1 /m. Then fm is measurable and defined on all of X. For each m, let n (m) be large enough so that
µ{x: d(f(x),sn)> 1/m foralln
, n(m), take a closed set Fmn C fm t {sn } with lt(fm t{Sn} \Fmn
<
2mn(m)
by closed regularity. For each fixed m, the sets Fmn are disjoint for different values of n. Let Fm := Un(m) n-t Fmn. Then f m is continuous on Fm. By choice of n(m) and Fmn, µ(Fm) > 1 - 2/2m Since d (fm , f) < 1 / m everywhere, clearly fm -+ f uniformly (so Egorov's theorem is not needed). For r = 1 , 2, , let H, := nm° r Fm . Then Hr is
closed and µ(Hr) > 1 - 4/2T. Take r large enough so that 4/2'' < E. Then f restricted to Hr is continuous as the uniform limit of continuous functions fm
on Hr C Fm, m > r, so we can let F = Hr to finish the proof.
Notes
Lusin's theorem for measurable functions with values in any separable metric space, on a space with a closed regular finite measure, was first proved as far as I know by Schaerf (1947), who proved it for f with values in any second-countable topological space (i.e., a space having a countable base for its topology), and for more general domain spaces ("neighborhood spaces"). See also Schaerf (1948) and Zakon (1965) for more extensions.
References Schaerf, H. M. (1947). On the continuity of measurable functions in neighborhood spaces. Portugal. Math. 6, 33-44. Schaerf, H. M. (1948). On the continuity of measurable functions in neighborhood spaces II. Portugal. Math. 7, 91-92. Zakon, Elias (1965). On "essentially metrizable" spaces and on measurable functions with values in such spaces. Trans. Amer. Math. Soc. 119,
443-453.
Appendix E Bochner and Pettis Integrals
Let (X, A, s) be a measure space and (S, II II) a separable Banach space. A function f from X into S will be called simple or µ-simple if it is of the
form f = _k t IA,Yi for some yi E S, k < oo, and measurable Ai with it (Ai) < oo. For a simple function, the Bochner integral is defined by k
1=t
E.1 Theorem The Bochner integral is well-defined for A-simple functions, the µ-simple functions form a real vector space, and for any it-simple functions
f, g and real constant c, f cf + gdµ = c f fdµ + f g dµ. Proof These facts are proved just as they are for real-valued functions (RAP, Proposition 4.1.4). For any measurable function g from X into S, where measurability is defined
for the Borel or-algebra generated by the open sets of S, x i-+ 11g(x)II is a measurable, nonnegative real-valued function on X.
E.2 Lemma For any two it-simple functions f and g from X into S,
IIffdpll
Il rip(Ai)uill < Fiit (Ai)Iluill = f Ilflldit. 407
408
Appendix E
Let L' (X, A, it, S) := Gl (X, it, S) be the space of all measurable functions f from X into S such that f II f II d A < oo. By the triangle inequality, it's easily seen that G1 (X, A, it, S) is a vector space. Also, all µ-simple functions belong to C1 (X, A, it, S). Define I I 111 on G1(X, A, A, S) by IIf1I1 f IIfI1 du. It is easily seen that II III is a seminorm on G1 (X, A, it, S). On the vector space of it-simple functions, the Bochner integral is linear (by Theorem E.1) and continuous (also Lipschitz) for II 111 by Lemma E.2.
E.3 Theorem For any separable Banach space (S, II II) and any measure space (X, A, µ), the lc-simple functions are dense in Li (X, A, it, S) for 11 and the Bochner integral extends uniquely to a linear, real-valued function on L1(X, A, it, S), continuous for II 111. 11 1
Proof A first step in the proof will be the following.
E.4 Lemma For any f E G1 (X, A, µ, S) there exist g-simple fk with II f - fk 11 1 - 0 and II f - fk II
0 almost everywhere for it as k -+ 00
and such that II fk II < II f II on X.
Proof Define ameasure yon (X, A) by y(B) := fB Ilf(x)II dtt(x). Then y is a finite measure and has a finite image measure y o f-1 on S. If f = 0 a.e. (µ), , so that y - 0, set f k - 0. So we can assume y(S) > 0. Given m = 1, 2, by Ulam's theorem (RAP, Theorem 7.1.4), there is a compact K := Km C S with y (f- 1 (S \ K)) < 1/2'. Then take a finite set F C K such that each point of K is within 1/m of some point of F. For each y E F, let s(y) := cy for a constant with 0 < c < 1 such that Ily - cyll = 1/m, or s(y) = 0 if there is no such c (IlylI < 1/m). Let F:= { yt, , yk) for some distinct yi. We can assume yl := 0 E F C K. For Y E K let hm (y) = s(y;) such that IIy - yt II is minimized, choosing the smallest possible index i in case of ties. For y K set hm(y) := 0. Then for each y E S, 11hm(y)11 < IIYII, and 11hm(y) - yll < 2/m for ally E K. Define a function gm by gm (x) := hm (f (x)). Then gm and hm are measurable and have finite range. Let 8 := mini>2 II y1 II. Then 8 > 0. The set where gm 0 and thus f E K is included in the set where II f 11 > 8/2, which has finite measure for it. So gm is a /,t-simple function. The set where 11 f - gm 11 > 2/m has y measure at most 112m. Then, by the Borel-Cantelli
lemma applied to the probability measure y/y(S), it follows that gm --+ f almost everywhere for y. For any x, if f (x) = 0 then gm (x) = 0 for all m. It follows that II,, - f II -* 0 almost everywhere for it. Since IIgm II < 11 f II and 11 gm - f 11 < 211 f 11, it follows by dominated convergence that 11 gm - f 11 1 -+ 0
as m -* oo, proving the lemma.
Bochner and Pettis Integrals
409
Then to continue proving the theorem, the Bochner integral, a linear function,
is continuous (and Lipschitz) for the seminorm II 111. Thus it is uniformly continuous on its original domain, the simple functions, and so has a unique continuous extension to the closure of its domain, which is G1 (X, A, It, S), and the extension is clearly also linear. So the Bochner integral is well-defined for all functions in G1 (X, A, µ, S), and only for such functions. A function from X into S will be called Bochner
integrable if and only if it belongs to L1 (X, A, µ, S). The extension of the Bochner integral to Ll (X, A, it, S) will also be written as f dµ. Thus Theorem E.3 implies that (E.5)
r J
r
f
cf+gdµ = cJ fdµ+J gdµ
for any Bochner integrable functions f, g and real constant c. Also, by taking limits in Lemma E.2 it follows that (E.6) 11J
f
dµ11
f IIf11 dµ
< ,l
for any Bochner integrable function f Although monotone convergence is not defined in general Banach spaces, a form of dominated convergence holds:
E.7 Theorem Let (X, A, µ) be a measure space. Let fn be measurable functions from X into a separable Banach space S such that for all n, 11 fn 11 < g
where g is an integrable real-valued function. Suppose fn converge almost everywhere to a function f. Then f is Bochner integrable and 11 f fn - f d µ 11 <
f II fn-f11dµ - 0asn--+ oo. Proof First, f is measurable (RAP, Theorem 4.2.2). Since 11 f 11 < g, f is Bochner integrable, and the rest follows from (E.6) for fn - f and dominated convergence for the real-valued functions II fn - f II < 2g. A Bochner integral f gdµ = f f dµ can be defined when g is defined only almost everywhere for µ, f is Bochner integrable and g = f where g is defined, just as for real-valued functions (RAP, Section 4.3). It's easy to check that when S = R, the Bochner integral equals the usual Lebesgue integral. A Tonelli-Fubini theorem holds for the Bochner integral:
E.8 Theorem Let (X, A, µ) and (Y, B, v) be Q finite measure spaces. Let f be a measurable function from X x Y into a separable Banach space S such that
Appendix E
410
f f Il f (x, Y) II dµ(x) dv(y) < oo. Then for it-almost all x, f (x, ) is Bochner integrable from Y into S; for v-almost ally, f(. , y) is Bochner integrable from X into S, and
f fd(µ x v) =
ff f(x, y)dµ(x)dv(y) = ff f(x, y)dv(y)dµ(x).
Proof By the usual Tonelli-Fubini theorem for real-valued functions (RAP, 4.4.5), f II f (x, y) II d t(x) d t(y) < oo implies that for v-almost all y,
f Ilf(x,Y)II dµ(x) < 00 and for tt-almost all x, f II f (x, y) II d v (y) < oo. Let
TF:= {fE £'(Xx Y,
x v,S):
f fd(µxv)
= fff(xY)dL(x)dv(y) = fff(xY)dv(Y)dt(x)}. Then by the usual Tonelli-Fubini theorem, all (it x v)-simple functions are in T F. For any f E L1(X x Y, it x v, S), by Lemma E.4, take simple fn with 0 almost everywhere for it x v. Then for v-almost II fn II II f II and II fn - f II
all y, f (x, y) - f (x, y) for it-almost all x, and II f, ( ,
y) II
_<
II f(- , Y) II E
£1(X, it), so by dominated convergence (Theorem E.7), f f (x, y) d p (x) f f (x, y) d it (x). Since the functions of y on the left are v-measurable, so is the function on the right (RAP, Theorem 4.2.2, with a possible adjustment on a set
of measure 0). For each y, II f fn(x, y) dµ(x)II < f IIf(x, y) 11 d t(x), which is v-integrable with respect to y, so by dominated convergence (E.7) again,
fffn(xy)d/.L(x)dv(Y)
,
fff(xy)d(x)dv(y)
as n --+ oo.
The same holds for the iterated integral in the other order, f f d v d it, and
ffnd(txv)--> ffd(µxv)by (E.7),so fETF. Now let (S, T) be any topological vector space, in other words S is a real vector space, T is a topology on S, and the operation (c, f, g) H cf + g is jointly continuous from IR x S x S into S. Then the dual space S' is the set of all continuous linear functions from S into R. Let (X, A, it) be a measure space. Then a function f from X into S is called Pettis integrable with Pettis integral y E S if and only if for every t E S', f t (f) dµ is defined and finite and equals t(y).
Bochner and Pettis Integrals
411
The Pettis integral is also due to Gelfand and Dunford and might be called the Gelfand-Dunford-Pettis integral (see the Notes). The Pettis integral may lack interest unless S' separates points of S, as is true for normed linear spaces by the Hahn-Banach theorem (RAP, Corollary 6.1.5).
E.9 Theorem For any measure space (X, A, s) and separable Banach space (S, II II), each Bochner integrable function f from X into S is also Pettis integrable, and the values of the integrals are the same.
Proof The equation t (f f dµ) = f t(f) dµ is easily seen to hold for simple functions and then, by Theorem E.3, for Bochner integrable functions.
Example. A function can be Pettis integrable without being Bochner integrable. Let H be a separable, infinite-dimensional Hilbert space with or-
thonormal basis {en}. Let µ(f = nen) := µ(f = -nen) := p,, := n-7/4 and f = 0 otherwise. Then f II .f II dµ = 2 Fn n-3/4 = +oo, so f is not
Bochner integrable. On the other hand, for any xn with n x2 < oo we have 2nxn pn I < (E xn)1/22(E n-3/2)1/2 < oo, and by symmetry the Pettis integral of f is 0. Example. The Tonelli-Fubini theorem, which holds for the Bochner integral (Theorem E.8), can fail for the Pettis integral. Let H be an infinite-dimensional Hilbert space and let sit, i be orthonormal for all positive integers i and j. Let
U(x, y) := 2%,j for (j - 1)/2' < x < j/2' and 2-' < y < 21-', j =
2', and 0 elsewhere. Then it can be checked that U is Pettis integrable on [0, 1] x [0, 1] for Lebesgue measure but is not integrable with respect to y for fixed x. 1,
,
Notes
The treatment of the Bochner integral here is partly based on that of Cohn (1980). The following historical notes are mainly based on those of Dunford and Schwartz (1958).
Graves (1927) defined a Riemann integral for Banach-valued functions. Bochner (1933) defined his integral, a Lebesgue-type integral for Banachvalued functions. The definition of integral given by Pettis (1938) was published at the same time or earlier by Gelfand (1936, 1938) and is equivalent to one of
the definitions of Dunford (1936). Gelfand (1936) is one of the first, perhaps the first, of many distinguished publications by I. M. Gelfand. The integral
412
Appendix E
of Birkhoff (1935) strictly includes that of Bochner, and is strictly included in that of Gelfand-Dunford-Pettis. Price (1940), in a further extension, treats set-valued functions. Dunford and Schwartz (1958, Chapter 3) develop an integral when it is only finitely additive. Facts such as the dominated convergence theorem still require countable additivity.
References *An asterisk indicates a work I have seen discussed in secondary sources but have not seen in the original.
Birkhoff, Garrett (1935). Integration of functions with values in a Banach space. Trans. Amer. Math. Soc. 38, 357-378. Bochner, Salomon (1933). Integration von Funktionen, deren Werte die Elemente eines Vektorraumes sind. Fund. Math. 20, 262-276. Cohn, Donald L. (1980). Measure Theory. Birkhauser, Boston. Dunford, Nelson (1936). Integration and linear operations. Trans. Amer. Math. Soc. 40, 474-494. Dunford, Nelson, and Schwartz, J. T. (1958). Linear Operators. Part I: General Theory. Interscience, New York. Repr. 1988. *Gelfand, Izrail Moiseevich (1936). Sur un lemme de la theorie des espaces lineaires. Communications de l'Institut des Sciences Math. et Mecaniques de l'Universite de Kharkoff et de la Societe Math. de Kharkov (= Zapiski Khark. Mat. Obshchestva) (Ser. 4) 13, 35-40. Gelfand, I. M. (1938). Abstrakte Funktionen and lineare Operatoren. Mat. Sbornik (N. S.) 4, 235-286. Graves, L. M. (1927). Riemann integration and Taylor's theorem in general analysis. Trans. Amer. Math. Soc. 29, 163-177. Pettis, Billy Joe (1938). On integration in vector spaces. Trans. Amer. Math. Soc. 44, 277-304. Price, G. B. (1940). The theory of integration. Trans. Amer. Math. Soc. 47, 1-50.
Appendix F Nonexistence of Types of Linear Forms on Some Spaces
Recall that a real vector space V with a topology T is called a topological vector space if addition is jointly continuous from V x V to V for T, and scalar multiplication (c, v) i-+ cv is jointly continuous from R x V into V for Ton V and the usual topology on R. If d is a metric on V, then (V, d) is called a metric
linear space if it is a topological vector space for the topology of d. First, it will be shown that any measurable linear form on a complete metric linear space is continuous. Then, it will be seen that the only measurable (and thus, continuous) linear form on LP([0, 1],A) for 0 < p < 1 is zero. Recall that for a given a-algebra A of measurable sets, in our case the Borel a-algebra, a universally measurable set is one measurable for the completion of every probability measure on A (RAP, Section 11.5). The universally measurable sets form a a-algebra, and a function f is called universally measurable if for every Borel set B in its range, f-1 (B) is universally measurable.
F.1 Theorem Let (E, d) be a complete metric real linear space. Let u be a universally measurable linear form: E i-+ R. Then u is continuous.
Proof There exists a metric p, metrizing the d topology on E, such that p(x + v, y+ v) = p(x, y) for all x, y, v E E, and I.X I < 1 implies p(0, Ax) < p(0, x) for all x E E (Schaefer, 1966, Theorem 1.6.1 p. 28). Then we have: F.2 Lemma If (E, d) is a complete metric linear space then it is also complete
for p. Proof If not, let F be the completion of E for p. Because of the translation invariance property of p, the topological vector space structure of E extends to F. Then since (E, d) is complete, E is a Gs (a countable intersection of open
sets) in F (RAP, Theorem 2.5.4). So, if E is not complete for p, F \ E is of 413
Appendix F
414
first category in F, but F \ E includes a translate of E which is carried to E by a homeomorphism (by translation) of F, so F \ E is of second category in F, a contradiction, so the lemma is proved. Now to continue the proof of Theorem E.1, let (E, p) be a complete metric linear space where p has the stated properties. If u is not continuous, there
are en E E with en - 0 and I u (en) I > 1 for all n. There are constants c, -+ oo such that cn en -+ 0: for each k = 1, 2, , there is a Sk > 0 such that if p (x, 0) < Sk then p (kx, 0) < 1 / k. We can assume that Sk is decreasing as k increases. For some nk, p(en, 0) < Sk for n > nk. We can also assume that nk increases with k. Let cn = 1 for n < n1 and cn = k for nk < n < nk+1, k = 1, 2, . Then cn have the claimed properties. So we can assume en -> 0 and I u (en) I -+ oo. Taking a subsequence, we can assume En p (0, en) < oo. So it will be enough to show that if En p (0, en) < oo then I u (en) I are bounded.
Let Q be the Cartesian product nn° 1In where for each n, In is a copy of [-1, 1]. (So Q is the closed unit ball of £°O.) For), := {-kn }n° 1 E Q, the series En° a,nen converges in E, since by the properties of p, h(,l) 1
n
n
n
p 0, E jej < E p(0, Xjej) < Y p(0, ej) -0 j=k
j=k
j=k
as n > k -a oo, and (E, p) is complete. Also, h is continuous for the product topology T on Q. For all n, let An be the uniform law U [-1, 11 (one-half
times Lebesgue measure on [-1, 1]) and let µ := Fl,' 1µn on Q. Then µ is a regular Borel measure on the compact metrizable space (Q, T). Thus the image measure µ o h-1 is a regular Borel measure on (E, p). Since u is universally measurable, it is It o h-1 measurable (that is, measurable for the completion of µ o h-1 on the Borel or-algebra). So U := u o h is measurable for the completion of µ on the Borel a-algebra of (Q, T). For 0 < M < oo let QM := {x E Q: IU(x)I < M}. Then QM is µ-measurable and µ(QM) T 1 as M f +oo. For a given n write Q = I, x Q(n) where Q(n) := Ilion lj. Let µ(n) := R jOnµ j. Then µ = µn X µtn). For y e Q(n) let QM,n(Y) := {t E In : (t, y) E QM}. By the Tonelli-Fubini theorem,
A(QM) = f
N-n(QM,n(Y))d/2(n)(Y)
Q(n)
For s, t E QM,n(y), U(s, y) - U(t, y) = (s - t)u(en). So IU(s, y) I < M and IU(t, y)I < M imply Is - tI < 2M/lu(en)I. (We can assume u(en) 0 foralln.) Soµn(QM,n(y)) < M/Iu(en)I (since g, has density . 11_1,10. So 1L(QM) < M/lu(en)I,and lu(en)I < M/µ(QM) < ooforalln ifµ(QM) > 0,
Nonexistence of Types of Linear Forms on Some Spaces
415
which is true for some M large enough. So Iu(en)I are bounded, proving Theorem F.1.
For 0 < p < oo let £ ([O, 1], A) be the space of all Lebesgue measurable real-valued functions f on [0, 1] with fo I f(x)Ipdx < oo. Let LP Lp[0, 1] := Lp([0, 1], A) be the space of all equivalence classes of functions
in LP([O, 1], A) for equality a.e. (A). For 1 < p < oo, recall that LP is a Banach space with the norm Il.fIlp := (fp I.f(x)Ipdx)lip. For 0 < p < 1, Ilp is not a norm. Instead, let pp(f, g) := fo' If - gl p(x) dx. It will be shown that for 0 < p < 1, pp defines a metric on LP. For any a > 0 and b > 0, (a + b)p < ap + bp: to see this, let r := 1/p > 1, u := ap, v := bp, II
and apply the Minkowski inequality in jr. The triangle inequality for pp follows, and the other properties of a metric are clear. Also, pp has the properties of the metric p in the proof of Theorem F. I. To see that LP is complete for pp, let {fn} be a Cauchy sequence. As in the proof for p > 1 (RAP, Theorem 5.2.1) there is a subsequence fnk converging for almost all x to some f (x), and
Pp(.fnk, f) - 0, so pp(fn, f) --± 0. F.3 Theorem For 0 < p < 1 there are no nonzero, universally measurable linear forms on L p [0, 1].
Proof Suppose u is such a linear form. Then by Theorem F.1, a is continuous. Now L1[0, 1] C Lp[0, 1] and the injection h from Lt into Lp is continuous. Thus, u o h is a continuous linear form on L1. So, for some g E L°O[0, 1],
u(h(f)) = fo (fg)(x)dx for all f E L1[0, 1] (e.g., RAP, Theorem 6.4.1). Then we can assume that g > c > 0 on a set E with A(E) > 0 (replacing u by -u if necessary). Let fn = n on a set En C E with A(En) _ A(E)/n 0 in Lp for 0 < p < 1 and u(fn) _ and fn = 0 elsewhere. Then fn fo (fng)(x) dx > cA(E) > 0 for all n, a contradiction. Note that Theorem F.1 fails if the completeness assumption is omitted: let H be an infinite-dimensional Hilbert space and let hn be an infinite orthonormal
sequence in H. Let E be the set of all finite linear combinations of the hn. Then there exists a linear form u on E with u(hn) = n for all n, and u is Borel measurable on E but not continuous.
Notes The proof of Theorem F.1 is adapted from a proof for the case that E is a Banach space. The proof is due to A. Douady, according to L. Schwartz (1966),
416
Appendix F
who published it. The fact for Banach spaces extends easily to inductive limits of Banach spaces (for definitions see, e.g., Schaefer, 1966), or so-called ultrabornologic locally convex topological linear spaces, and from linear forms to linear maps into general locally convex spaces. Among ultrabornologic spaces are the spaces of Schwartz's theory of distributions (generalized functions).
References Schaefer, H. H. (1966). Topological Vector Spaces. Macmillan, New York. 3d printing, corrected, Springer, New York, 1971. Schwartz, L. (1966). Sur le theoreme du graphe ferm6. C. R. Acad. Sci. Paris Sir. A 263, A602-605.
Appendix G Separation of Analytic Sets; Borel Injections
Recall that a Polish space is a topological space S metrizable by a metric for which S is complete and separable. Also, in any topological space X, the oralgebra of Borel sets is generated by the open sets. Two disjoint subsets A, C of X are said to be separated by Borel sets if there is a Borel set B C X such that A C B and C C X \ B. Recall that a set A in a Polish space Y is called analytic if there is another Polish space X, a Borel subset B of X, and a Bore] measurable function f from B into Y such that A = f [B] := { f(x) : x E B} (e.g., RAP, Section 13.2). Equivalently, we can take f to be continuous and/or B = X and/or X = NOO, where N is the set of nonnegative integers with discrete topology and N' the Cartesian product of an infinite sequence of copies of N, with product topology (RAP, Theorem 13.2.1). G.1 Theorem (Separation theorem for analytic sets) Let X be a Polish space. Then any disjoint analytic subsets A, C of X can be separated by Borel sets.
Proof First, it will be shown that:
, and D are subsets of X such that for each j, Cj and D can be separated by Borel sets, then Uj"'° 1 C/ and D can be separated (G.2) If C j f o r j = 1 , 2,
by Borel sets.
Proof For each j, take a Borel set Bj such that D C BB and Cj C X \ Bj. Let B := Bp Then D C B and U;° Cj C X \ B, so (G.2) holds. 1
Next, it will be shown that: (G.3) If C j and D j f o r j = 1 , 2,
, are subsets of X such that for every i and
j, Ci and Dj can be separated by Borel sets, then U°_1 C; and U;° 1 Dj can be separated by Borel sets. 417
Appendix G
418
Proof By (G.2), for each j, U° 1 Ci and Dj can be separated by Borel sets. Applying (G.2) again gives (G.3). In proving Theorem G. 1, we can assume A and C are both nonempty. Then there exist continuous functions f, g from N°O into X such that f [N°O] = A and g[N°O] = C. Suppose A and C cannot be separated by Borel sets. For k = 1, 2, , and
mi EN, i = 1, - , k, let {n={ni}°°1EN°°: By (G.3), there exist ml and n j such that f [N°O (m 1) ] and g [N°O (n 1)] cannot be separated by Borel sets. Inductively, for all k = 1 , 2, . , there exist Mk and nk such that f [N°O (m 1, , mk)] and g[N°O (n i , , nk)] cannot be separated
by Borel sets. Let m :_ {mi}°O1, n :_ {nj}°O1. If f(m) ¢ g(n), then
there are disjoint open sets U, V with f(m) E U and g(n) E V. But then by continuity off and g, fork large enough, f [N°O(m 1, , Mk)] C U and g[N°°(nI, , nk)] C V, which gives a contradiction. So f(m) = g(n). But f (m) E A, g(n) E C, and A fl c= 0, another contradiction, which proves Theorem G. 1.
G.4 Lemma If X is a Polish space and Aj, j = 1, 2, then
, are analytic in X,
Aj is analytic.
P r o o f For j = 1 , 2,
, let f be a continuous function from N°O onto Ap Forn :_ {ni}°O1 E N°° define f(n) := f,,({ni+t}i>i) Then f is continuous from N°O onto U?° 1 Aj.
G.5 Corollary I f X is a Polish space and Aj, j = 1 , 2,
, are disjoint
analytic subsets of X, then there exist disjoint Borel sets Bj such that Ai C Bj
for all j. Proof For each j such that Aj = 0 we can take Bj = 0. So we can assume all the Aj are nonempty. For each k = 1 , 2, , by Lemma G.4, A(k) := U#k Al is analytic. Then by Theorem G.1, there is a Borel set Ck with Ak C Ck and A(k) C X \ Ck. Let Bl := C1 and for j > 2, Bj := Ci \ Uk<j Ck. Then the sets Bj have the properties stated.
G.6 Theorem Let S be a Polish space, Y a separable metric space, and A a Borel subset of S. Let f be a 1-1, Borel measurable function from A into Y.
Separation of Analytic Sets; Borel Injections
Then the range f [A] is a Borel subset of Y, and f function from f [A] onto A.
419
is a Borel measurable
Proof The following fact will be used.
G.7 Lemma Let S be a Polish space and Y a separable metric space. Let A be a nonempty Borel subset of S and f a 1-1, Borel measurable function from A into Y. Then there exists a Borel measurable function g from Y onto A
such that g(f(x)) = x for all x in A. Proof Let d metrize S where (S, d) is complete and separable. Let {y, }°O 1 be dense in S. Choose x0 E A. Recall that B(x, r) := l y: d (x, y) < r}. For each
k=
j=
:=Af1B(yj, Ilk) \U«j B(yi, 1/k),
where the union is empty for j = 1. Then for each k, the Aki for j = 1, 2, are disjoint Borel sets whose union is A, and for each j, d (x, y) < 2/ k for any x, y E Akj. Since f is 1-1, for each k, the sets f [Ak,] are disjoint analytic sets in Y. Thus by Corollary G.5 there exist disjoint Borel subsets Bkk of Y with f [Akj] C Bki and with Bki = 0 if Akk = 0. For each j such that Aki is nonempty, choose a point xkf E Akj. Define a function gk on Y by letting xki if y E Bkk for any j and gk(Y) := xo if y V Uj 1 Bkj. Then gk gk(Y) is Borel measurable (e.g., RAP, Lemma 4.2.4). Since (S, d) is complete, the set G of yin Y such that gk(y) converges in X is the same as the set on which {gk(y)}k>1 is a Cauchy sequence, so
G= I
I
U
1
I {y: d(gk(Y), gn(Y)) < 1/m}.
m>1 n>1 k>n
Since S is separable, d( , ) is product measurable on S x S (e.g., RAP, Propositions 4.1.7 and 2.1.4). Thus G is a Borel set in Y. On G, let h(y) := limk_ gk(y). Then h is Borel measurable (RAP, Theorem 4.2.2). Let
g(y) := h(y) if y E G and h(y) E A. Otherwise, let g(y) := x0. Then g is a Borel function from Y into A. If X E A, then for each k, x E Akj for
some j, f(x) E Bkj, and gk(f(x)) = xkj, so d(x, gk(f(x))) < 2/k. Thus 9k (f (x)) -+ x E A, so g(f(x)) = h (f (x)) = x, proving Lemma G.7. Now to prove Theorem G.6, we can assume A is nonempty. Take the function
g from Lemma G.7. Then if y E f [A], we have f(g(y)) = y, while if y f [A], then f(g(y)) # y. So f [A] = t y E Y: f (g(y)) = y}. If e is a metric for Y, then f [A] = {y E Y: e(f (g(y)), y) = 0}. So by the product measurability of e (as ford above), f [A] is a Borel set in Y. Thus for any Borel
420
Appendix G
set B C A, (f -1) 1(B) = f [B] is Borel in Y, so f
is Borel measurable,
and Theorem G.6 is proved.
Note Appendix G is entirely based on Cohn (1980), Section 8.3.
Reference Cohn, D. (1980). Measure Theory. Birkhauser, Boston.
Appendix H Young-Orlicz Spaces
A convex, increasing function g from [0, oo) onto itself will be called a Young-
Orlicz modulus. Then g is continuous since it is increasing and onto. Let (X, S, s) be a measure space and g a Young-Orlicz modulus. Let Lg(X, S, µ) be the set of all real-valued measurable functions f on X such that (H.1)
IIfIIg := inf {c > 0:
fg(If(x)I/c)d(x)
< 1} < 00.
v
Let Lg be the set of equivalence classes of functions in Lg(X, S, µ) for equality almost everywhere (µ). By monotone convergence, we have
H.2 Proposition For any Young-Orlicz modulus g and any f E Cg(X, S, µ), if 0 < c := Ilf IIg < +oo, we have f (g(I f 11c)) dg = 1, in other words the i n fi m u m in the d e fi n i t i o n o f i f IIg is attained. Also, IIfIIg = 0 if and only if f = 0 almost everywhere for tt. Next, we have
H.3 Lemma For any Young-Orlicz modulus g, and any measurable functions
fn, ifIfnI t Ifl, then IIfnIIg t IIfIIg <+oo. Proof For any c > 0, f g(I f I /c) d µ T f g(I f 11c) d µ by ordinary monotone convergence. Thus II fn 11g f t for some t. Taking c = t and using Proposition H.2 we get IIfIIg < t, while clearly IIfIIg > t. E Next is a fact showing that (not surprisingly) convergence in implies convergence in measure (or probability):
II
IIg norm
H.4 Lemma For any Young-Orlicz modulus g, and any s > 0, there is a
S>0such that ifllfllg<6,then µ(IfI >e) <e. 421
Appendix H
422
Proof For any S > 0,
11g
< 8 is equivalent to f g(I f 1/3) d A < 1 by Proposition H.2. Since g is onto [0, oo), there is some M < oo with g(M) > 1 /s and M > 1. Thus II f IIg < 8 implies µ(I f I > MS) < s. So we can set II f
8:= s/M. H.5 Theorem For any measure space (X, S, µ), Lg(X, S, µ) is a Banach space.
Proof For any real number A 0 it's clear that a function f is in Gg(X, S, µ) if and only if A.f is, with IIAfIIg = IAIl1fllg by definition of II IIg. If f, h E Gg(X, S, A) and 0 < A < 1 then since g is convex we have by definition of convexity for any c, d > 0,, (H.6)
fg(A(L)+(1_A))d/2 < AJ
g()dA+(1-,l) f g(d)dµ,
where some or all three integrals may be infinite. Applying this for c = 1 and h = 0 gives (H.7)
fg(Af)d/2 < A J g(f) dg.
Clearly, if the inequality in (H. 1) holds for some c > 0 it also holds for any larger c. It follows that Af + (1 - A)h E Lg(X, S, µ). For the triangle inequality, let c := II f IIg and d := IIh IIg. If c = 0 then f = 0 a.e. (A), so 11f + h IIg = IIh IIg < c + d, and likewise if d = 0. If c > 0 and d > O let A:= c/(c + d) in (H.6). Applying Proposition H.2 to both terms on the right
in (H.6) gives f g((f + h)/(c + d)) dg < 1, and so II f + h IIg < c + d. So the triangle inequality holds and II IIg is a seminorm on Lg(X, S, µ). Clearly
it becomes a norm on Lg(X, S, A). To see that the latter space is complete for II . IIg, let [ fk} be a Cauchy sequence. By Lemma H.4, take Sj := S for s := si := 1/2j. Take a subsequence fk(j) with II f - fk(j) II g < Sj for any i > k(j) and j = 1 , 2, . Then fk(j) converges A-almost everywhere, by the proof of the Borel-Cantelli lemma, to some f Then II fk(J) - f IIg - 0 as j oc by Fatou's lemma applied to functions g(I fk(j) - fk(r) I /c) as j < r -+ oo for c > Sj. It follows that II fi - f II g 0 as i oo, completing the proof.
Let 4 be a Young-Orlicz modulus. Then it has one-sided derivatives as follows (RAP, Corollary 6.3.3):
O(x) := 4'(x+) := limyl,x(4 (y) - 4(x))/(Y - x)
Young-Orlicz Spaces
423
exists for all x > 0, and ¢(x-) := fi'(x-) := limytX(fi(x) - 41(y))/(x - y) exists for all x > 0. As the notation suggests, for each x > 0, O(x-) limytx 0 (y), and 0 is a nondecreasing function on [0, oo). Thus, 0 (x -) _ 0 (x) except for at most countably many values of x, where ¢ may have jumps with O(x) > O(x-). On any bounded interval, where is bounded, 4) is Lipschitz and so absolutely continuous. Thus since 4) (0) = 0 we have 4) (x) _ fa 0 (u) du for any x > 0 (e.g., Rudin, 1974, Theorem 8.18). For any x > 0, O(x) > 0 since 4) is strictly increasing.
If 0 is unbounded, for 0 < y < oo let i/i(y)
(y) := inf{x > 0:
(x) > y}. Then * (0) = 0 and * is nondecreasing. Let %P (y) := fo *(t) dt. Then W is convex and 'Y' = Eli except on the at most countable set where Eli
has jumps. Thus for each y > 0 we have *(y) > 0 and ' is also strictly increasing. For any nondecreasing function f from [0, oo) into itself, it's easily seen that
for any x > 0 and u > 0, f " (u) > x if and only if f (t) < u for all t < x. It follows that (f ')'(x) = f(x-) for all x > 0. Since a change in 0 or Vr on a countable set (of its jumps) doesn't change its indefinite integral 4) or %P respectively, the relation between 4) and kP is symmetric. A Young-Orlicz modulus fi such that 0 is unbounded and 0 (x) 4, 0 as x 4, 0
will be called an Orlicz modulus. Then * is also unbounded and *(y) > 0 for all y > 0, so ' is also an Orlicz modulus. In that case, 4) and ' will be called dual Orlicz moduli. For such moduli we have a basic inequality due to W. H. Young:
H.8 Theorem (W. H. Young) Let fi, 4J be any two dual Orlicz moduli from [0, oo) onto itself. Then for any x, y > 0 we have xy < (P (X) + 41 (y),
with equality if x > 0 and (x-) < y < 0 (x).
Proof If x = 0 or y = 0 there is no problem. Let x > 0 and y > 0. Then 4) (x) is the area of the region A : 0 < u < x, 0 < v < ¢ (u) in the (u, v) plane. Likewise, 11(y) is the area of the region B : 0 < v < y, 0 < u < *(v). By monotonicity and right-continuity of 0, u > *(v) is equivalent to 0(u) > v, so u < * (v) is equivalent to 0 (u) < v, so A f1 B = 0. The rectangle Rx, y :
0 < u < x, 0 < v < y is included in A U B U C, where C has zero area, and if 0(x-) < y < 4(x), then Rx,y = A U B up to a set of zero area, so the conclusions hold.
Appendix H
424
One of the main uses of inequality H.6 is to prove an extension of the RogersHolder inequality to Young-Orlicz spaces:
H.9 Theorem Let 4 and %P be dual Orlicz moduli, and for a measure space (X, S, µ) let f E Gq(X, S, g) and g E L,1, (X, S, µ). Then fg E L1(X, S, µ) and f Ifgl dp < 211.f11 (p llgllp.
Proof By homogeneity we can assume 11.f 11o = II gII , = 1. Then applying Proposition H.2 with c = 1 and Theorem H.6 we get f I fgl dµ (x) < 2 and the conclusion follows.
Notes
W. H. Young (1912) proved his inequality (Theorem H.6) for smooth functions 4). Birnbaum and Orlicz (1931) apparently began the theory of "Orlicz spaces," and W. A. J. Luxemburg defined the norms 11 11,p; see Luxemburg and
Zaanen (1956). Krasnosel'skii and Rutitskii (1961) wrote a book on the topic.
References
Birnbaum, Z. W., and Orlicz, W. (1931). Uber die Verallgemeinerung des Begriffes der zueinander konjugierten Potenzen. Studia Math. 3, 1-67. Krasnosel'skii, M. A., and Rutitskii, Ia. B. (1961). Convex Functions and Orlicz Spaces. Transl. by L. F. Boron. Noordhoff, Groningen.
Luxemburg, W. A. J., and Zaanen, A. C. (1956). Conjugate spaces of Orlicz spaces. Akad. Wetensch. Amsterdam Proc. Ser. A 59 (= Indag. Math. 18), 217-228. Rudin, Walter (1974). Real and Complex Analysis, 2d ed. McGraw-Hill, New York.
Young, W. H. (1912). On classes of summable functions and their Fourier series. Proc. Roy. Soc. London Ser. A 87, 225-229.
Appendix I Modifications and Versions of Isonormal Processes
Let T be any set and (0, A, P) a probability space. Recall that a real-valued stochastic process indexed by T is a function (t, (o) i-+ Xt ((0) from T x S2 into R such that for each t E T, Xt () is measurable from 0 into R. A modification
of the process is another stochastic process Yt defined for the same T and 0
such that for each t, we have P(Xt = Yt) = 1. A version of the process Xt is a process Zt, t E T, for the same T but defined on a possibly different probability space (01, 8, Q) such that Xt and Zt have the same laws, that is, for each finite subset F of T, G({Xt ItEF) = L({Zt)tEF). Clearly, any modification of a process is also a version of the process, but a version, even if on the same probability space, may not be a modification. For example, for an isonormal
process L on a Hilbert space H, the process M(x) := L (-x) is a version, but not a modification, of L. One may take a version or modification of a process in order to get better properties such as continuity. It turns out that for the isonormal process on subsets of Hilbert space, what can be done with a version can also be done by a modification, as follows. 1.1 Theorem Let L bean isonormal process restricted to a subset C of Hilbert space. For each of the following two properties, if there exists a version M of L with the property, there also is a modification N with the property. For each Co. x - M(x) (w) for x e C is: (a) bounded
(b) uniformly continuous.
Also, if there is a version with (a) and another with (b) then there is a modification N(.) having both properties.
Proof Let A be a countable dense subset of C. For each x E C, take xn E A with IIx - xII < 1/n2 for all n. Then L(x) Thus if we define 425
426
N(x)(w) := lim sup,
Appendix I
or 0 on the set of probability 0 where the lim sup is infinite, then N is a modification of L. If (a) holds for M it will also hold for L on A and so for N, and likewise for (b), since a uniformly continuous function L (co) on A has a unique uniformly continuous extension to C given by Since is the same in both cases, the last conclusion follows.
Subject Index
a.s., almost surely, 12 acyclic graph, 147 admissible (measurability), 180, 184 Alaoglu's theorem, 50 almost uniformly, see convergence analytic sets, 417-420 asymptotic equicontinuity, 117, 118, 121, 131, 315, 349 atom of a Boolean algebra, 141, 158, 159, 163 of a probability space, 92 soft, 192, 247
Banach space, 23, 50, 51, 285 binomial distribution, 1, 5, 16, 123, 400 Blum-DeHardt theorem, 235 Bochner integral, 51, 248, 407-412 Boolean algebras, 141, 158, 165 bootstrap, 335-362 bordered, 155 Borel classes, 181 Borel injections, 417-420 Borel-isomorphic, 119 Borisov-Durst theorem, 244 boundaries, differentiable, 252-264, 363 bracket, bracketing, 234-313, 365 and majorizing measures, 246 Brownian bridge, 1-3, 7, 18, 19, 91, 220, 334, 335 Brownian motion, 2, 300 in d dimensions, 310 C-estimator, 217 Cantor set, 181, 187 cardinals, 167, 403 ceiling function, 19 centered Gaussian, 25-28, 359 central limit theorems general, 94, 208, 238, 285
see also Donsker class, 94 triangular arrays, 246 in finite dimensions, 1, 23 in separable Banach spaces, 209, 291 characteristic function, 26, 31, 84, 337 classification problems, 226 closed regular measure, 405 coherent process, 93, 117, 315 compact operator, 81 comparable, 145 complemented, 144 composition of functions, 94 concentration of measure, 223 confidence sets, 357-358 continuity set, 107, 112 continuous mapping theorem, 116 convergence almost uniform, 100, 101, 107, 112, 223, 306, 336, 359 in law, 3, 93, 94 classical, 9, 91, 93, 128, 130 Dudley (1966), 130 Hoffmann-Jorgensen, 94, 99, 101, 104, 106,107,111-117,127,130,131, 291, 328, 360
in outer probability, 100-102, 104, 112, 127, 215, 223, 301, 310, 336, 358 convex combination, 46, 322 functions, 14, 24, 30, 58, 62, 70, 84, 421,423 hull, 46, 47, 84, 117, 127, 159, 166, 322 metric entropy of, 322-328 of a sequence, 82-83, 88 sets, 227, 384 and Gaussian laws, 40-43 approximation of, 269-283, 322-328 427
428
Subject Index
convex sets (cont.) classes of, 138, 166, 227, 248, 283, 309, 365, 373, 374, 384, 388
symmetric, 40,47 convolution, 345, 395, 397 correlation, 33 cotreelike ordering, 148, 149 coupling, see Strassen's theorem Vorob'ev form, 7, 10, 19, 119, 128, 296, 304 Cram6r-Wold device, 337 cubature, numerical integration, 363 cubes, unions of, 247, 364 cycle in a graph, 147 Darmois-Skitovic theorem, 25, 86 desymmetrization, 338, 339, 343 determining class, 177, 179 diameter, 61, 75 differentiability degrees for functions, and bounding sets, 252-264, 363 differentiable boundaries, 309 differentiating under integrals, 391-398 dominated family of laws, 172 Donsker classes, 94, 134 and invariance principles, 286, 301, 306, 307, 310 and log log laws, 307 criteria for, 349 asymptotic equicontinuity, 117, 118 multipliers, 355 envelopes of, 307, 308, 310 examples all subsets of N, 228, 244 bounded variation, 227 convex sets in 72, 228 convex sets in 72, 269, 283 half-lines, 127 unions of k intervals, 129 unit ball in Hilbert space, 128 weighted intervals, 227 functional, 301 non-Donsker classes, 129, 138, 228,
363-390 P-universal, 333 stability of convex hull, 127 subclass, 315 sums, 127 union of two, 121-122 sufficient conditions for bracketing, 238-248 differentiability classes, 253, 264 Koltchinskii-Pollard entropy, 208 random entropy, 226 sequences of functions, 122-126, 129 Vapnik-Cervonenkis classes, 214, 215
uniform, 328-329, 334 universal, 314-322, 328-330, 334 Donsker's invariance principle, 311 Donsker's theorem, 3, 9, 19, 129 dual Banach space, 50, 237 edge (of a graph), 147 Effros Borel structure, 190, 194 Egorov's theorem, 101, 405 ellipsoids, 80, 81, 84, 85, 128, 140, 319, 329 empirical measure, 1, 91, 94, 100, 138, 199, 285, 306, 399 as statistic, 176, 191, 226, 332, 335 bootstrap, 335 empirical process, 91-94, 117-121, 220-223, 286, 301, 308, 309, 311 in one dimension, 2, 3, 7, 18, 19 envelope function, 161, 196, 220, 223, 234,
306-308,310,316,317,358 e-net, 10, 228 equicontinuity, asymptotic, 117, 118, 121, 131, 315, 349 equivalent measures, 172 essential infimum or supremum, 44, 95, 96, 98, 130 estimator, 217-219, 226 exchangeable, 191 exponent of entropy, 79 exponent of volume, 79 exponential family, 191, 397 factorization theorem, 172, 174, 193 filtration, 370, 371 Gaussian law, see normal law Gaussian processes, 23-90, 222, 223 see also pregaussian, 92 GB-sets, 44, 52 and metric entropy, 54, 79 criteria for convex hull, 82 majorizing measures, 59, 60 mean width, 80 implied by GC-set, 45 necessary conditions, 45, 54, 64 special classes ellipsoids, 80, 84 sequences, 82 GC-sets, 44, 52, 93 and Gaussian processes, 77 and metric entropy, 79, 85 criteria for, 46, 47 convex hull, 82 other metrics, 78 implies GB-set, 45 special classes ellipsoids, 80, 81
Subject Index random Fourier series, 85 sequences, 84, 85 stability of convex hull, 47 union of two, 51 sufficient conditions majorizing measures, 74 metric entropy, 52, 54, 87 unit ball of dual space, 50 volumes, 79 geometric mean, 14 Glivenko-Cantelli class, 101, 223, 224,
234-238 criteria for, 223, 224, 226 envelope, integrability, 220 examples, 247 convex sets, 269 lower layers, 282 non-examples, 127, 138, 171, 247 strong, 100 sufficient conditions for bracketing (Blum-DeHardt), 235 differentiability classes, 253, 264 unit ball (Mourier), 237 Vapnik-Cervonenkis classes, 134 uniform, 219, 225, 282 universal, 224, 225, 228 weak, 101 not strong, 228 Glivenko-Cantelli theorem, 1 graph, 147 Hausdorff metric, 190, 250 Hilbert-Schmidt operator, 81 Holder condition, 252, 384 homotopy, 260 image admissible, 180, 182, 184, 185, 193 Suslin, 186, 189, 190, 192, 193, 217, 229,
316,317,359 incomparable, 145 independent as sets, 158, 220 events, sequences of, 123, 126, 129 pieces, process with, 370, 371 random elements, 286-291 inequalities, 12-18 Bernstein's, 12-14, 20, 124, 129 Chernoff-Okamoto, 16 Dvortezky-Kiefer-Wolfowitz, 221 for empirical processes, 220-223 Hoeffding's, 13, 14, 20 Hoffmann-Jorgensen, 288, 341 Jensen's, 340 conditional, 29 Komatsu's, 57, 87
429
Levy, 17, 21, 289, 310 Ottaviani's, 17, 21, 288 Slepian's, 31, 33, 34
with stars, 95-100, 285-286, 339-341, 343, 344, 352
Young, W. H., 423,424 inner measure, 104 integral Bochner, 51 Pettis, 51 integral operator, 81 intensity measure, 127, 368 invariance principles, 3, 285-313 isonormal process, 39, 51, 52, 59, 63, 74, 75, 78, 92, 425-426
Kolmogorov-Smirnov statistics, 335 Koltchinskii-Pollard entropy, 196, 228, 316 Kuiper statistic, 335 law, see convergence in law law, probability measure, 1, 23, 345 laws of the iterated logarithm, 306-308 learning theory, 226, 227 likelihood ratio, 175 linearly ordered by inclusion, 142, 143, 145, 154, 155, 161, 247 Lipschitz functions, seminorm, 112, 210, 223, 251, 356 log log laws, 306-308 loss function, 217 lower layers, 264-269, 282, 316, 373-384, 389 lower semilattice, 95 Lusin's theorem, 105, 405, 406
major set or class, 159 majorizing measures, 59-74, 81-85, 87, 246 Marczewski function, 181 marginals, laws with given, 8, 10, 119, 296 Markov property, 5, 10, 301, 389 for random sets, 373 measurability, 10, 19, 93, 170-194, 217, 221, 226, 229, 291-293, 402-406,
413-420 measurable cover functions, 95-100, 130,
285-286 cover of a set, 95, 348 universally, 20, 105, 186, 199, 202, 203,
413,415
vector space, 25-27 measurable cardinals, 403 metric dual-bounded-Lipschitz, 115, 328, 336 Prokhorov, 6, 19, 114, 119, 295 Skorokhod's, 19
430
Subject Index
metric entropy, 10-12 of convex hulls, 322-328 with bracketing, 234 metrization of convergence in law, 111, 112, 329, 336 Mills' ratio, 57 minimax risk, 217, 218 minoration, Sudakov, 33 mixed volume, 79, 270 modification, 43, 138, 229,425-426 monotone regression, 282 multinomial distribution, 4, 16, 368, 399-401 Neyman-Pearson lemma, 56, 175 node (of a graph), 147 nonatomic measure, 92, 192, 225, 403 norm, 23 bounded Lipschitz, 112, 252, 336 supremum, 3, 7, 10, 18, 19, 27, 170, 180, 190, 300 normal distribution or law, 357 in finite dimensions, 23, 24, 26, 36 in infinite dimensions, 23, 25 in one dimension, 23-25, 86 normed space, 23 norming set, 285 null hypothesis, 332 order bounded, 223 ordinal triangle, 170 orthants, as VC class, 155 outer measure, 100 outer probability, 100, 336, 358
P-Donsker class, see Donsker class P-perfect, 104 p-variation, 329 PAC learning, 227 Pascal's triangle, 135 pattern recognition, 226 perfect function, 104-107, 112, 333 perfect probability space, 106 Pettis integral, 51, 248, 407-412 Poisson distributions, 17, 19, 127, 344, 346, 368, 379 process, 127, 368-370, 374, 388 Poissonization, 344-346, 363, 367-373, 389 polar, of a set, 43, 84 Polish space, 7, 20, 119, 303 Pollard's entropy condition, 208, 316, 322, 327, 329 polyhedra, polytopes, 273, 283, 384 polynomials, 140 polytope, 141
portmanteau theorem, 111, 112, 131
pregaussian, 92-94, 117, 129, 215, 315, 388 uniformly, 329 prelinear, 45, 46, 93, 138 pseudo-seminorm, 27 pseudometric, 75 quasi-order, 145 quasiperfect, 106 Rademacher variables, 13, 206, 338 Radon law, 358 random element, 94, 100, 106, 111, 115, 127, 286-291, 336 random entropy, 223, 226 RAP, xiii realization a.s. convergent, 106-111 of a stochastic process, 46, 47 reflexive Banach space, 50 regular conditional probabilities, 120 reversed (sub)martingale, 199-201, 205 risk, 217, 218 sample, 332, 336 -bounded, 52, 59 -continuous, 2, 3, 7, 52, 55, 76, 85, 92 function, 44, 76-78 modulus, 52 space, 91, 351 Sauer's lemma, 135, 145, 149, 154, 162, 167, 207, 221 Schmidt ellipsoid, 80, 81 selection theorem, 186, 188, 193 semialgebraic sets, 165 seminorm, 23 separable measurable space, 179 shatter, 134, 330 for functions, 224 or-algebra Borel, 7, 23, 25, 28, 74 product, 91, 98, 181, 183
tail, 30,49 slowly varying function, 367 standard deviation, 33 standard model, 91, 191 standard normal law, 24, 39, 43, 53 stationary process, 204 statistic, 171, 174, 176, 357, 360 Stirling's formula, 16 stochastic process, 27, 43, 52, 74, 75, 78, 87, 217, 370, 425 stopping set, 363, 371, 373, 388 Strassen's theorem, 6, 19, 131 subadditive process, 204, 205 subgraph, 159, 250, 373 submartingale, 29, 199, 200
431
Subject Index
Sudakov-Chevet theorem, 33, 34, 39, 45, 79,81 sufficient a-algebra, 171-179 collection of functions, 177 statistic, 171, 174, 176, 191, 192 superadditive process, 204 supremum norm, 28 Suslin property, 185, 186, 190, 217, 229 symmetric random elements, 289 symmetrization, 198, 205, 206, 211, 228, 339, 343
universally measurable, 20, 105, 186, 199, 202, 203, 413
universally null, 225 upper integral, 93, 95 Vapnik-Cervonenkis classes of functions hull, 160, 163, 164, 315, 316, 327 major, 159, 160, 164, 167, 308, 315, 316
subgraph, 159, 160, 214, 223, 307, 308, 315-317,327,328,330 of sets, 134-159, 165, 169, 197, 215, 218,
tail event or a-algebra, 30, 49 tail probability, 50 tetrahedra, 227 tight, 105, 358 uniformly, 105 tightness, 20 topological vector space, 25, 27, 413 tree, 147 treelike ordering, 145, 222 triangle function, 84 triangular arrays, 247, 337, 346 triangulation, 273, 283 truncation, 13, 247 two-sample process, 128, 332-335 u.m., see universally measurable, 105 uniform distribution on [0, 1], 2, 119 uniformly integrable, 392, 393, 396, 397 uniformly pregaussian, 330 universal class a function, 183
219,221-223,225,226,314 VC, see Vapnik-Cervonenkis classes VCM class, 221 VCSM class, 307 vector space
measurable, 25-27 topological, 25, 27, 410 version, 43, 71, 74, 138, 425-426 version-continuous, 74, 75, 77, 78 volume, 78, 79, 85, 262, 270, 320, 365 weak* topology, 50 width, of a convex set, 80 Wiener process, 2 witness of irregularity, 224
Young-Orlicz norms, modulus, 55, 85, 86,
420-424 zero-one law for Gaussian measures, 26
Author Index
Adamski, W., 191,194 Adler, R. J., 222 Alexander, K. S., 121, 131, 168, 221, 307, 308 Alon, N., 166, 226, 227 Andersen, N. T., 88, 130, 131, 246, 248 Anderson, T. W., 86 Araujo, A., 359 Arcones, M., 121, 131, 239, 248 Assouad, P., 167, 168, 218, 219, 225, 229, 230 Aumann, R. J., 182, 193 Bahadur, R. R., 193 Bakhvalov, N. S., 363, 364, 388 Bauer, H., 120 Beck, J., 309 Ben-David, S., 226 Bennett, G., 20 Berger, E., 131 Berkes, I., 7, 20 Bernstein, S. N., 12, 20 Bickel, P., 360 Billingsley, P., 19 Birge, L., 154, 282 Birkhoff, Garrett, 411 Birnbaum, Z. W., 424 Blum, J. R., 234, 235, 248 Blumberg, H., 130 Blumer, A., 227 Bochner, S., 411 Bolthausen, E., 269, 283 Bonnesen, T., 79, 270 Borell, C., 86 Borisov, I. S., 244,248 Borovkov, A. A., 311 Breiman, L., 311 Bretagnolle, J., 308
Bron"stein, E. M., 269, 282, 283 Brown, L. D., 222, 397
Cabana, E. M., 222 Carl, B., 322, 330 tervonenkis, A. Ya., 134, 135, 162, 167, 221, 226, 229 Cesa-Bianchi, N., 226 Chebyshev, P. L., 5, 12 Chernoff, H., 16, 20, 124 Chevet, S., 31, 33, 34, 86 Clements, G. F., 282 Coffman, E. G., 375, 389 Cohn, D. L., 101, 193, 411, 420 Danzer, L., 167 Darmois, G., 25, 86 DeHardt, J., 234, 235, 248 Devroye, L., 221 Dobric, V., 88, 130, 131 Donsker, M. D., 2, 3, 9, 19, 311 Doob, J. L., 2, 19, 229 Douady, A., 415 Drake, F. R., 403 Dudley, R. M., 20, 86, 87, 130, 131, 154, 167, 168, 193, 222, 225, 229, 248, 282, 307, 310, 311, 329, 330, 360,
365,388,389 Dunford, N., 50, 411, 412 Durst, M., 193, 229, 244, 248 Dvoretzky, A., 221 Eames, W., 130 Effros, E. G., 190, 194 Efron, B., 357, 361 Eggleston, H. G., 270 Ehrenfeucht, A., 227 Eilenberg, S., 260 Einmahl, U., 193 432
Author Index Ersov, M. P., 131 Evstigneev, I. V., 389
Feldman, J., 87 Feller, W., 17, 389 Fenchel, W., 79, 270 Ferguson, T. S., 176
Fernique, X., 24,25,27,39,40,50,58-60, 62, 63, 86, 87, 295, 321, 359 Fisher, R. A., 193 Freedman, D. A., 193, 311, 360 Fu, J., 282 Gaenssler, P., 191, 194, 248, 360 Gelfand, I. M., 411, 412 Gine, E., 20, 88, 131, 225, 226, 239, 246,
248,307,328-330,336,357-360 Gnedenko, B. V., 130, 311 Goffman, C., 130 Goodman, V., 222, 307 Gordon, Y., 86 Grunbaum, B., 167 Graves, L. M., 411 Gross, L., 86, 87 Gruber, P. M., 283 Gutmann, S., 193 Gyorfi, L., 227 Hadwiger, H., 80 Hall, P., 361 Halmos, P. R., 193 Harary, F., 167 Hausdorff, F., 282 Haussler, D., 165, 166, 168, 222, 226, 227 Heinkel, B., 307 Hobson, E. W., 397 Hoeffding, W., 13, 14, 16, 20 Hoffmann-Jorgensen, J., 130, 131, 310, 341
Il'in, V. A., 397 10, K., 87 Jain, N. C., 208, 229, 308, 310, 359
Kac, M., 389 Kahane, J.-P., 21, 86, 310 Kartashev, A. P., 397 Kelley, J. L., 404 Kiefer, J., 221 Kingman, J. F. C., 204, 229 Klee, V. L., 167 Kolmogorov, A. N., 2, 11, 19, 20, 130, 282, 311 Koltchinskii, V. I., 196, 198, 208, 228,
229,307-309 Komatsu, Y., 57, 87 Koml6s, J., 308
433
Korolyuk, V. S., 311 Krasnosel'skii, M. A., 424 Kuelbs, J., 307, 310 Kulkarni, S. R., 227
Landau, H. J., 24, 27, 50, 58, 59, 86, 295, 321 Lang, S., 397 Laskowski, M. C., 165 Le Cam, L., 193, 311 Leadbetter, M. R., 222 Lebesgue, H., 183, 193 Ledoux, M., 81, 87, 223, 307, 359 Lehmann, E. L., 56, 175 Leichtweiss, K., 79 Leighton, T., 375 Levy, P., 17, 21 Lindgren, G., 222 Lorentz, G. G., 20, 282 Lueker, G. S., 389 Lugosi, G., 227 Luxemburg, W. A. J., 130, 424 Major, P., 308, 311 Marcus, M. B., 24, 50, 58, 59, 81, 86, 208, 229, 295, 310, 321, 359 Marczewski, E., 404 Massart, P., 221-223, 308, 309 Maurey, B., 324, 330 May, L. E., 130 McKean, H. P., 87 McMullen, P., 80 Milman, V. D., 79 Mityagin, B. S., 81 Mourier, E., 237, 248 Munkres, J. R., 283 Nagaev, S. V., 311 Natanson, I. P., 193 Neveu, J., 193 Neyman, J., 175, 193
Okamoto, M., 16, 20, 124 Orlicz, W., 423, 424 Ossiander, M., 239, 246, 248, 253
Pachl, J. K., 130 Pettis, B. J., 411 Philipp, W., 7, 20, 130, 307, 310, 311, 388 Pickands, J., 222 Pisier, G., 79, 307, 308, 330 Pollard, D. B., 168, 196, 198, 208, 214, 228, 229 Posner, E. C., 20 Pozniak, E. G., 397 Price, G. B., 412
434
Author Index
Prokhorov, Yu. V., 6, 19 Pyke, R., 389
Strassen, V., 6
Revesz, P., 309 Radon, J., 140, 141, 167 Rao, B. V., 193 Rao, C. R., 86 Rhee, W. T., 375 Richardson, T., 227 Rio, E., 309 Rodemich, E. R., 20 Rootzen, H., 222 Rozhdestvenskif, B. L., 397 Rudin, W. 423,424 Rumsey, H., 20 Rutitskii, Ia. B., 424 Ryll-Nardzewski, C., 130
Sudakov, V. N., 31, 33, 34, 80, 81, 86, 87 Sun, Tze-Gong, 264, 282
Sainte-Beuve, M.-F., 186, 193 Samorodnitsky, G., 222 Sauer, N., 135, 167 Savage, L. J., 193 Sazonov, V. V., 130 Schaefer, H. H., 413, 416 Schaerf, H. M., 406 Schmidt, W. M., 365, 388 Schwartz, J. T., 50, 411, 412 Schwartz, L., 37, 415, 416 Shao, Jun, 361 Shelah, S., 167 Shepp, L. A., 24, 27, 50, 58, 59, 86, 295, 321 Shor, P. W., 374, 375, 389 Shortt, R. M., 20, 130 Sikorski, R., 404 Skitovi6, V. P., 25, 86 Skorokhod, A. V., 19, 106, 130, 131 Slepian, D., 31, 86 Smith, D. L., 222 Steele, J. M., 203, 204, 226, 229, 282 Steenrod, N., 260 Stengle, G., 165 Stone, A. H., 404
Valiant, L. G., 227 van der Vaart, A., 131, 358 Vapnik, V. N., 134, 135, 162, 167, 221, 226, 227, 229 Vorob'ev, N. N., 7, 20 Vulikh, B. Z., 130
Strobl, F., 191, 200, 229, 358 Stute, W., 248
Talagrand, M., 59, 60, 63, 81, 82, 87, 88, 223, 224, 226, 307, 308, 359, 374, 375 Tarsi, M., 166 Tibshirani, R. J., 361 Tikhomirov, V. M., 20, 282 Topsoe, F, 131 Tu, Dongsheng, 361 Tusnady, G., 308
Uspensky, J. V., 20
Warmuth, M., 227 Wellner, J., 358 Wenocur, R. S., 167, 168 Wiener, N., 2 Wolfowitz, J., 221, 229 Wright, F T., 282
Young, W. H., 423,424 Yukich, J., 165, 389
Zaanen, A. C., 130,424 Zakon, E., 406 Zeitouni, 0., 227 Ziegler, K., 191 Zink, R. E., 130 Zinn, J., 88, 131, 225, 226, 246, 307,
328-330,336,357-360
Index of Notation
B means A is defined by B, xiv A =: B means B is defined by A, xiv x, of the same order, 252 ®, set of Cartesian products, 153
8 A, S-interior of A, 262
A
3, point mass at x, 1 D(E, A, d), packing number, 10 DFl (S, )c_), 196
converges in law, 94 (9, product a-algebra, 109 n, class of intersections, 134 nki=' , class of intersections, 251 u, class of unions, 153 uk=1, class of unions, 251
DFi (E, .F, Q), 161 D(P)(E,.F), 161 D(P) (E, .F, Q), 161 DP, partial derivative, 252 dens, combinatorial density, 137 diam, diameter, 10
( , )0,p, covariance, 92 II
Ila, differentiability norm, 252 IIBL, bounded Lipschitz norm, 112
II
11 C, sup norm over C, 138
II
II II II
dp, dp(A, B) := P(AAB), 156 dp,Q, LP(Q) distance, 161 ds p, supremum distance, 235
II.F, sup norm over .F, 94 IIL, Lipschitz seminorm, 112 II', dual norm, 24
E(k, n, p), binomial probability, 16 E*, upper expectation, 95 e", A a signed measure, 345 ess.inf, essential infimum, 95 ess.sup, essential supremum, 43
[.f, g] := (h: f < h < g), 234 an nt/2(F,, - F), 1 AEC(P, r), asymptotic equicontinuity condition, 117 fl ( , ), bounded Lipschitz distance, 115 B(k, n, p), binomial probability, 16
C(F) := I(F) U R(F), 260 Ct, polar of C, 43 card, cardinality, 134
covp, 92 C(a, K, d), subgraphs of a-smooth f's, 252 A, symmetric difference, 95 AA, set of symmetric differences, 143 AO, number of induced subsets, 134
c, normal distribution function, 55 ¢, normal density, 55 F, distribution function, 1 F, empirical distribution function, 1 FT, envelope, 161 f*, 98 f o g, composition, 94 f *, measurable cover function, 95 Fa,K(F), a-differentiable functions, 252
y(T) = infy(T), 59 y(T), majorizing measure integral, 59 Gp, 91 CJa,K,d, differentiable functions, 252
435
436
Index of Notation
H(s, A, d) = log N(s, A, d), 11 H5(.F, M), hull, 160
PB, bootstrap empirical measure, 335 P*, outer probability, 95
h(. , ), Hausdorff metric, 250
pos(G) _ {pos(g): g E G}, 139 pos(g) _ {x: g(x) > 0}, 139
I(F), inside F boundary, 260 I = [0, 1], unit interval, 19 I", unit cube in I[R", 7 i.i.d., indep. identically dist., 1
isonormal process, 39 L(A)*, ess.supA L, 43 IL(A)I*, ess.supA ILI,43
£2(P), 91 G0, 95 Go, 92
GGd, lower layers in Rd, 264 £ Cd,1, lower layers in unit cube, 264
mi(n), 134 vn, empirical process, 91 vB, bootstrap empirical process, 336 NC
N[ ]) (s, F, P), 234
nn(g) = {x: g(x) > 01, 139 )ro, centering projection, 92 Pn, empirical measure, 1
p(-, ), Prokhorov distance, 115 pp, covp distance, 92
R(F), range(F), 260 JR = [-oo, oo], 93 IlRX, all functions X H JR, 139
f upper integral, 94 f*, lower integral, 98 S 1, unit circle, 138
S(C) = V(C) - 1, VC index, 134 sco, symmetric closed convex hull, 49 sco, symmetric convex hull, 117 S(IIR"), L. Schwartz space, 37 U[0, 1], uniform law on [0, 1], 126
V(C) = S(C) + 1, VC index, 134
Var(X) = E((X - EX)2), 12 [x], least integer > x, 19 XP = xPO)xP(2) ... 252 2 xt, Brownian motion, 2 1
yt, Brownian bridge, 1