Uniform Central Limit Theorems (Cambridge Studies in Advanced Mathematics)

Uniform z J '.L The book shows how the central limit theorem for independent, identically distributed random variable...

Author: R. M. Dudley

56 downloads 808 Views 4MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Uniform

z J '.L

The book shows how the central limit theorem for independent, identically distributed random variables with values in general, multidimensional spaces holds uniformly over some large classes of functions. The book contains, with complete proofs, the Fernique-Talagrand majorizing measure theorem for Gaussian processes, an extended treatment of VapnikCernovenkis combinatorics, the Ossiander L2 bracketing central limit theorem,

the Gine-Zinn bootstrap central limit theorem in probability, the Bronstein theorem on approximation of convex sets, and the Shor theorem on rates of convergence over lower layers. The book incorporates an updated form of the author's 1984 St.-Flour lecture notes and also gives various results of the author's not previously collected in one place. A number of recent results of Talagrand and others are surveyed without proofs in separate sections. The book will interest mathematicians working in probability, mathematical statisticians, and computer scientists working in computer learning theory.

R. M. Dudley is Professor of Mathematics at the Massachusetts Institute of Technology in Cambridge, Massachusetts.

Cambridge Studies in Advanced Mathematics 63 Editorial Board: W. Fulton, D. J. H. Garling, T. tom Dieck, P. Walters Already published 2 K. Petersen Ergodic theory 3 P. T. Johnstone Stone spaces J.-P. Kahane Some random series of functions, 2nd edition 5 7 J. Lambek & P. J. Scott Introduction to higher-order categorical logic H. Matsumura Commutative ring theory 8 9 C. B. Thomas Characteristic classes and the cohomology of finite groups 10 M. Aschbacher Finite group theory 11 J. L. Alperin Local representation theory 12 P. Koosis The logarithmic integral I 13 A. Pietsch Eigenvalues and s-numbers 14 S. J. Patterson An introduction to the theory of the Riemann zeta-function 15 H. J. Baues Algebraic homotopy V. S. Varadarajan Introduction to harmonic analysis on semisimple 16 Lie groups 17 W. Dicks & M. Dunwoody Groups acting on graphs L. J. Corwin & F. P. Greenleaf Representations of nilpotent Lie groups 18 and their applications 19 R. Fritsch & R. Piccinini Cellular structures in topology 20 H. Klingen Introductory lectures on Siegel modular forms 21 P. Koosis The logarithmic integral II 22 M. J. Collins Representations and characters of finite groups 24 H. Kunita Stochastic flows and stochastic differential equations 25 P. Wojtaszczyk Banach spaces for analysts 26 J. E. Gilbert & M. A. M. Murray Clifford algebras and Dirac operators in harmonic analysis 27 A. Frohlich & M. J. Taylor Algebraic number theory 28 K. Goebel & W. A. Kirk Topics in metric fixed point theory 29 J. F. Humpheys Reflection groups and Coxeter groups 30 D. J. Benson Representations and cohomology I D. J. Benson Representations and cohomology 11 31 32 C. Allday & V. Puppe Cohomological methods in transformation groups 33 C. Soule et al. Lectures on Arakelov geometry 34 A. Ambrosetti & G. Prodi A primer of nonlinear analysis 35 J. Palis & F. Takens Hyperbolicity, stability and chaos at homoclinic

bifurcation 37 38 39

40 41

42 43 44 45

46 47 49

Y. Meyer Wavelets and operators 1 C. Weibel Introduction to homological algebra W. Bruns & J. Herzog Cohen-Macaulay rings V. Snaith Explicit Brauer induction G. Laumon Cohomology of Drinfeld modular varieties I E. B. Davies Spectral theory and differential operators

J. Diestel, H. Jarchow, & A. Tonge Absolutely summing operators P. Mattila Geometry of sets and measures in Euclidean spaces R. Pinsky Positive harmonic function and diffusion G. Tenenbaum Introduction to analytic and probabilistic number theory C. Peskine An algebraic introduction to complex projective geometry R. Stanley Enumerative combinatorics I

50 51

52 53 54 55

56 60

I. Porteous Clifford algebras and the classical groups M. Audin Spinning tops V. Jurdjevic Geometric control theory H. Volklein Groups as Galois groups J. Le Potier Lectures on vector bundles D. Bump Automorphic forms and representations G. Laumon Cohomology of Drinfeld modular varieties 11 M. P. Brodmann & R. Y. Sharp Local cohomology

UNIFORM CENTRAL LIMIT THEOREMS R. M. DUDLEY Massachusetts Institute of Technology

CAMBRIDGE UNIVERSITY PRESS

PUBLISHED BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE The Pitt Building, Trumpington Street, Cambridge CB2 IRP 4o West 20th Street, New York, NY 10011-4211, USA 10 Stamford Road, Oakleigh, Melbourne 3166, Australia

© Cambridge University Press 1999 First published 1999 Typeface Times 10/13 pt. System LATEX [RW]

A catalog record of this book is available from the British Library

Library of Congress cataloging in publication data

P.

Dudley, R. M. (Richard M.) Uniform central limit theorems / R. M. Dudley. cm. - (Cambridge studies in advanced mathematics: 63) Includes bibliographical references. ISBN 0 52146102 2

1. Central limit theorem. I. Title. H. Series. QA 273.67.D84 1999 519.2-DC21 98-35582 CIP

ISBN 0 521 46102 2 hardback

Transferred to digital printing 2004

To Liza

Contents

Preface

page xiii

Introduction: Donsker's Theorem, Metric Entropy, and Inequalities 1.1 Empirical processes: the classical case 1.2 Metric entropy and capacity 1.3 Inequalities Problems Notes 1

References

1

2 10 12 18

19 21

Gaussian Measures and Processes; Sample Continuity 2.1 Some definitions 2.2 Gaussian vectors are probably not very large 2.3 Inequalities and comparisons for Gaussian distributions 2.4 Gaussian measures and convexity 2.5 The isonormal process: sample boundedness and continuity 2.6 A metric entropy sufficient condition for sample continuity 2.7 Majorizing measures 2.8 Sample continuity and compactness **2.9 Volumes, mixed volumes, and ellipsoids **2.10 Convex hulls of sequences Problems Notes 2

References

23 23

24 31

40 43 52 59 74 78

82 83

86 88

Foundations of Uniform Central Limit Theorems: Donsker Classes 3.1 Definitions: convergence in law 3.2 Measurable cover functions 3

ix

91 91

95

x

Contents 3.3

3.4 3.5 3.6 3.7 3.8 3.9

4 4.1 4.2 *4.3 *4.4 *4.5 4.6 4.7 4.8 **4.9

Almost uniform convergence amd convergence in outer probability Perfect functions Almost surely convergent realizations Conditions equivalent to convergence in law Asymptotic equicontinuity and Donsker classes Unions of Donsker classes Sequences of sets and functions Problems Notes

6.2 6.3 6.4 **6.5 **6.6 **6.7

117 121

122 127 130 132

Vapnik-Cervonenkis Combinatorics Vapnik-Cervonenkis classes Generating Vapnik-Cervonenkis classes Maximal classes Classes of index 1 Combining VC classes Probability laws and independence Vapnik-Cervonenkis properties of classes of functions Classes of functions and dual density Further facts about VC classes Problems Notes

134 134 138 142 145 152 156 159

References

168

Measurability *5.1 Sufficiency 5.2 Admissibility 5.3 Suslin properties, selection, and a counterexample Problems Notes

6.1

111

References

5

6

100 103 106

161 165

166 167

170 171

179 185 191

193

References

194

Limit Theorems for Vapnik-Cervonenkis and Related Classes Koltchinskii-Pollard entropy and Glivenko-Cantelli theorems Vapnik-Cervonenkis-Steele laws of large numbers Pollard's central limit theorem Necessary conditions for limit theorems Inequalities for empirical processes Glivenko-Cantelli properties and random entropy Classification problems and learning theory Problems

196 196 203

208 215 220 223 226 227

Contents

7 7.1

7.2 7.3 **7.4

Notes References Metric Entropy, with Inclusion and Bracketing Definitions and the Blum-DeHardt law of large numbers Central limit theorems with bracketing The power set of a countable set: the Borisov-Durst theorem Bracketing and majorizing measures Problems Notes

228 230 234 234 238 244 246 247 248

References

248

Approximation of Functions and Sets 8.1 Introduction: the Hausdorff metric 8.2 Spaces of differentiable functions and sets with differentiable boundaries 8.3 Lower layers 8.4 Metric entropy of classes of convex sets Problems Notes References Sums in General Banach Spaces and Invariance Principles 9 9.1 Independent random elements and partial sums 9.2 A CLT implies measurability in separable normed spaces 9.3 A finite-dimensional invariance principle 9.4 Invariance principles for empirical processes **9.5 Log log laws and speeds of convergence Problems Notes References 10 Universal and Uniform Central Limit Theorems 10.1 Universal Donsker classes 10.2 Metric entropy of convex hulls in Hilbert space **10.3 Uniform Donsker classes Problems Notes References The Two-Sample Case, the Bootstrap, and Confidence Sets 11 11.1 The two-sample case 11.2 A bootstrap central limit theorem in probability 11.3 Other aspects of the bootstrap 8

xi

250 250 252

264 269 281

282 283 285 286 291 293 301

306 309 310 311

314 314 322 328 330 330 330 332 332 335 357

Contents

xii

** 11.4 Further Gine-Zinn bootstrap central limit theorems Problems Notes References

Classes of Sets or Functions Too Large for Central Limit Theorems 12.1 Universal lower bounds 12.2 An upper bound 12.3 Poissonization and random sets 12.4 Lower bounds in borderline cases 12.5 Proof of Theorem 12.4.1 Problems Notes

358 359

360 361

12

References

Appendix A Differentiating under an Integral Sign Appendix B Multinomial Distributions Appendix C Measures on Nonseparable Metric Spaces Appendix D An Extension of Lusin's Theorem Appendix E Bochner and Pettis Integrals Appendix F Nonexistence of Types of Linear Forms on Some Spaces Appendix G Separation of Analytic Sets; Borel Injections Appendix H Young-Orlicz Spaces Appendix I Modifications and Versions of Isonormal Processes Subject Index Author Index Index of Notation

363 363 365

367 373

384 388 388 389 391

399

402 405 407 413 417 421

425 427

432 435

Preface

Suppose given a probability distribution P on the plane and a random sample of n points, chosen independently with distribution P. For each half-plane H bounded by a line, the number k of sample points in H has a binomial distribution. Suitably centered, taking k - nP(H), and divided by 'In-, k has an asymptotically normal (Gaussian) distribution as n -) oo, by De Moivre's classical central limit theorem. It will be seen herein, as one example of a uniform central limit theorem, that the asymptotic normality holds simultaneously and uniformly over all half-planes. The corresponding property of half-lines in the line was first shown by M. Donsker. Thus classes of sets, or functions, for which a uniform central limit theorem holds are called Donsker classes. It turns out, as will be seen, that rather general classes of sets in, or functions on, general spaces are Donsker classes. This book developed out of some topics courses given at M.I.T. and my lectures at the St.-Flour probability summer school in 1982. The material of the book has been expanded and extended considerably since then. The reader will need to know some real analysis including Lebesgue integration, and probability based on it, including the finite-dimensional central limit theorem. Starred sections are not cited later in the book except perhaps in other starred sections. At the end of some chapters are doubly starred sections. These are surveys, without proofs, on topics not covered in the book, usually because I did not know short enough proofs to include. Also at the end of each chapter are some problems, notes, and references on that chapter. For useful conversations on topics in the book, I'm glad to thank Kenneth Alexander, Niels Trolle Andersen, Miguel Arcones, Patrice Assouad, Erich Berger, Lucien Birge, Igor S. Borisov, Donald Cohn, Yves Derrienic, Uwe Einmahl, Joseph Fu, Evarist Gine, Sam Gutmann, David Haussler, Jt rgen Hoffmann-Jorgensen, Yen-Chin Huang, Vladimir Koltchinskii, Lucien Le Cam, xiii

xiv

Preface

Pascal Massart, James Munkres, Rimas Norvaiga, Walter Philipp, Tom Salisbury, Galen Shorack, Rae Shortt, Michel Talagrand, He Sheng Wu, Joe Yukich, and Joel Zinn. I especially thank Yong Chen, Peter Gaenssler, Evarist Gind, Jinghua Qian, Arvind Sankar, and Franz Strobl for providing lists of corrections and suggestions. I also thank Xavier Fernique and Evarist Gine very much for sending me copies of recent expositions. Cambridge, Massachusetts February 22, 1999

Note. Throughout this book, all references to "RAP" are to the author's book Real Analysis and Probability, Wadsworth and Brooks/Cole, Pacific Grove, Calif. 1989, reprinted with corrections by Chapman and Hall, New York, 1993. Also, "A := B" means A is defined by B, whereas "A =: B" means B is defined by A.

1

Introduction: Donsker's Theorem, Metric Entropy, and Inequalities

Let P be a probability measure on the Borel sets of the real line IR with distribu-

tion function F(x) := P((-oo, x]). Here and throughout, ":=" means "equals by definition." Let X1, X2, be i.i.d. (independent, identically distributed) random variables with distribution P. For each n = 1, 2, and any Borel

set A C R, let P,, (A) := ; F-jn-1 Sxj(A), where 8x(A) = lA(x). Then P , X and is called the empirical measure. Let F be the distribution function of P,,. Then F, is called the empirical distribution function. The developments to be described in this book began with the GlivenkoCantelli theorem, a uniform law of large numbers, which says that with probis a probability measure f o r each X1,

ability 1, F converges to F as n -* oo, uniformly on R, meaning that supx I (F - F) (x) I -+ 0 as n oo (RAP, Theorem 11.4.2); as mentioned in the Note at the end of the Preface, "RAP" refers to the author's book Real Analysis and Probability. The next step was to consider the limiting behavior of a := n112(F - F) as n -+ oo. For any fixed t, the central limit theorem in its most classical form, for binomial distributions, says that converges in distribution to N(0, F(t)(1 - F(t))), in other words a normal (Gaussian) law, with mean 0 and variance F(t)(1 - F(t)). Here a law is a probability measure defined on the Borel sets. For any finite set T of values oft, the multidimensional central limit theorem (RAP, Theorem 9.5.6) tells us that a (t) for t in T converges in distribution as n oc to a normal law N(0, CF) with mean 0 and covariance CF(s, t) _

F(s)(1 - F(t)) for s < t. The Brownian bridge (RAP, Section 12.1) is a stochastic process yt (w) defined for 0 < t < 1 and w in some probability space 0, such that for any finite

set S C [0, 1], yt for t in S have distribution N(0, C), where C = CU for the uniform distribution function U(t) = t, 0 < t < 1, and t -- yt(w) is I

2

Introduction: Donsker's Theorem, Metric Entropy, and Inequalities

continuous for almost all w. So the empirical process a converges in distribution to the Brownian bridge composed with F, namely t i-+ YF(t), at least when restricted to finite sets. It was then natural to ask whether this convergence extends to infinite sets or the whole interval or line. Kolmogorov (1933) showed that when F is continuous, the supremum sups a,r (t) and the supremum of absolute value, sups l an (t) 1, converge in distribution to the laws of the same functionals of yF. Then, these functionals of yF have the same distributions as for the Brownian

bridge itself, since F takes R onto an interval including (0, 1) and which may or may not contain 0 or 1; this makes no difference to the suprema since yo = y1 = 0. Also, yt -+ 0 almost surely as t 10 or t f 1 by sample continuity; the suprema can be restricted to a countable dense set such as the rational numbers in (0, 1) and are thus measurable. Kolmogorov evaluated the distributions of sups yt and sups I Yr I explicitly (see RAP, Propositions 12.3.3 and 12.3.4). Doob (1949) asked whether the convergence in distribution held for more general functionals. Donsker (1952) stated and proved (not quite correctly) a general extension. This book will present results proved over the past few

decades by many researchers, where the collection of half-lines (-oo, x], x E R, is replaced by much more general classes of sets in, and functions on, general sample spaces, for example the class of all ellipsoids in R3. To motivate and illustrate the general theory, the first section will give a revised formulation and proof of Donsker's theorem. Then the next two sections, on metric entropy and inequalities, provide concepts and facts to be used in the rest of the book.

1.1 Empirical processes: the classical case In this section, the aim is to treat an illuminating and historically basic special

case. There will be plenty of generality later on. Here let P be the uniform distribution (Lebesgue measure) on the unit interval [0, 1]. Let U be its distribution function, U(t) = t, 0 < t < 1. Let U be its empirical distribution functions and a := n1/2(U - U) on [0, 1]. It will be proved that as n -+ oo, a, converges in law (in a sense to be made precise below) to a Brownian bridge process yt, 0 < t < 1 (RAP, before Theorem 12.1.5). Recall that yt can be written in terms of a Wiener process (Brownian motion) xt, namely yt = xt - tx1, 0 < t < 1. Or, yt is xt conditioned on x1 = 0 in a suitable sense (RAP, Proposition 12.3.2). The Brownian bridge (like the Brownian motion) is sample-continuous, that is, it can be chosen such that for all w, the function t H yt (w) is continuous on [0, 1] (RAP, Theorem 12.1.5).

1.1 Empirical processes: the classical case

3

Donsker in 1952 proved that the convergence in law of an to the Brownian bridge holds, in a sense, with respect to uniform convergence in t on the whole interval [0, 1]. How to define such convergence in law correctly, however, was not clarified until much later. General definitions will be given in Chapter 3.

Here, a more special approach will be taken in order to state and prove an accessible form of Donsker's theorem. For a function f on [0, 1], we have the sup norm

Ilfll0 = sup{If(t)I: 0 < t < 1}. Here is the form of Donsker's theorem that will be the main result of this section.

1.1.1 Theorem

For n = 1, 2,

, there exist probability spaces Stn such

that:

(a) On Q,,, there exist n i.i.d. random variables X1, , X, with uniform distribution in [0, 1]. Let an be the nth empirical process based on these X,; (b) On Qn a sample-continuous Brownian bridge process Yn: (t, w) H Yn (t, (o) is also defined;

(c) Ilan - Ynllooismeasurable,andforalls > 0,Pr(Ilan-Ynlloo > E) -* 0 as n --> oo.

Notes. (i) Part (c) gives a sense in which the empirical process an converges in distribution to the Brownian bridge with respect to the sup norm II Il,". (ii) It is actually possible to use one probability space on which X1, X2,

are i.i.d., while Y, = (B1 +

+ Bn)//i, Bj being independent Brownian

bridges. This is an example of an invariance principle, to be treated in Chapter 9, not proved in this section. (iii) One can define all an and Yn on one probability space and make Y, all equal some Y, although here the joint distributions of an for different n will be

different from their original ones. Then an will converge to Y in probability and moreover can be defined so that Ilan - Y 1100 0 almost surely, as will be shown in Section 3.5. Proof For a positive integer k, let Lk be the set of k + 1 equally spaced points,

Lk := {0, 1/k, 2/k,

,

1} c [0, 1].

It will first be shown that both processes an and yt, for large enough n and k, can be well approximated by step functions and then by piecewise-linear interpolation of their values on Lk.

4

Introduction: Donsker's Theorem, Metric Entropy, and Inequalities

Given 0 < s < 1, take k = k(s) large enough so that

4k exp (- ks2/648) < s/6.

(1.1.2)

Let Ijk := [j/k, (j + 1)/k], j = 0,

,

k - 1. By the representation yt =

xt - txl, we have Pr{IYt - Yi1kI > 8/6 for some t E Ilk} < P1 + p2, where

P1 := Pr(lxl l > ks/18),

P2 := Pr{lxt - x3/kl > E/9 for some t E Ilk}.

Then pi < 2 exp(-k2s2/648) (RAP, Lemma 12.1.6(b)). For P2, via a reflection principle (RAP, 12.3.1) and the fact that {xu+h - xu}h>o has the same

distribution as {xh }h>o (applied to u = j/k), we have p2 < 4 exp(-ks2/162). Thus by (1.1.2), (1.1.3)

Pr{IYt - Yj/kl > s/6

for some j = 0,

, k - 1 and some t E Ilk} < s/3.

Next, we need a similar bound for a when n is large. The following will help:

1.1.4 Lemma Given the uniform distribution U on [0, 1]:

(a) For 0 < u < 1 and any finite set S C [0, 1 - u], the joint distribution of { Un (u + s) - U (u) }SES is the same as for u = 0. (b) The same holds for an in place of Un.

(c) The distribution of sup[ lan(t + j1 k) - an(j/k)I: 0 < t < 1/k) is the same for all j.

Proof (a) Let S = NJ', where we can assume so = 0. It's enough to consider { Un (u + sj) - Un (u + sj_ t) }iL1, whose partial sums give the desired quantities. Multiplying by n, we get m random variables from a multinomial distribution for n observations for the first m of m + 1 categories, which have

probabilities {sj - sj_111, where sn,+l = 1 (Appendix B, Theorem B.2). This distribution doesn't depend on u.

(b) Since an (u + s) - a (u) = n 1/2 (Un (u + s) - U. (u) - s), (b) follows from (a). (c) The statement holds for finite subsets of Ijk by (b). By monotone convergence, we can let the finite sets increase up to the countable set of rational

numbers in Ilk. Since Un is right-continuous, suprema over the rationals in Ilk equal suprema over the whole interval (the right endpoint is rational), and Lemma 1.1.4 is proved.

1.1 Empirical processes: the classical case

5

So in bounding the supremum in Lemma 1.1.4(c) we can take j = 0, and we need to bound Pr{n 112 Un (t) - t I > s for some t E [0, 1/k]}. Suppose , nr of sample size n = given a multinomial distribution of numbers n 1, , pr. Then for each j, the + nr in r bins with probabilities p I, n1+ conditional distribution of nj+1 given n 1, , nj is the same as that given

- nj trials + nj, namely a binomial distribution for n - n i nj+ with probability pj+l/(P.i+l + + pn) of success on each (see Appendix B, Theorem B.3(c)). It follows that the empirical distribution function U, has the following Markov property: if 0 < tl < < tj < t < u, then the conditional distribution of U (u) given U (ti), , U (tj), U (t) is the same as that given U, (t). Specifically, given that U (t) = m/n, the conditional distribution of

Un(u) is that of (m + X)/n where X has a binomial distribution for n - m trials with success probability (u - t)/(1 - t). To be given U,(t) = m/n is equivalent to being given an (t) = n 1/2 (n - t), and an also has the Markov property. So the conditional distribution of an (u) given m = n Un (t) has mean Am

I--tt

:=n1/2{n

=n1/2\\n -t/

\1// 1

and variance

(n-m)(u-t)(1-u)

u-t

n(1 - t)2

1-t

< u.

So, by Chebyshev's inequality, Pr { Ian (u) - µm 1 > 2u 1/2 1 m } < 1/4.

If u < 1/2, then i-t > 2. Let 0 < S < 1. If an (t) > 8, then n - t > S/n1/2 and µm > 8(i -) > 3/2, so for any y > 8 (such that Pr{an (t) = y} > 0), S

Pr an(u) > 2 - 2u 1/2 an (t) = Y } > 3/4. (For such a y, y = n1/2(n - t) for some integer m.) If u < 82/64, then u < 1 /2 and Pr{an (u) > 8/4 1 an (t) = y) > 3/4.

Let u = 1 / k and S = e/4. Then by (1.1.2), since e-' < 1/24 implies x > 2, we have u < 82/64, so

Pr{an(1/k)>e/161an(t)=y} > 3/4 for y>s/4. Now take a positive integer r and let r be the smallest value of j/(kr), , r, for which an (r) > s/4. Let Ar be the event that such a j exists. Let Arj := {r = j/(kr)}. Then Ar is the union of the

if any, for j = 1,

6

Introduction: Donsker's Theorem, Metric Entropy, and Inequalities

disjoint sets Arj for j = 1, , r. For each such j, by the Markov property, Pr{an(1/k) > E/16 I Arj} > 3/4. Thus

Pr{an(1/k) > E/16 I Ar} > 3/4. Let r -+ oo. Then by right continuity of U, and an, we get

Pr{an(t) > E/4 for some t E [0, 1/k]) <

Pr{an(1/k) > E/16). 3

Likewise,

Pr{an(t) < -E/4 for some t E [0, 1/k]) < 3 Pr{an(1/k) < -E/16). Thus by Lemma 1.1.4(c), (1.1.5)

Pr{Ian(t) -an(j/k)I > E/4 forsomet E Ilk and j = 0, 1,

,

k - 11 < (4k/3) Pr(lan(1/k)I > e/16).

As n oo, for our fixed k, by the central limit theorem and RAP, Lemma 12.1.6(b),

Pr{lan(1/k)I > E/16) - Pr{IY1/kI > E/16} < 2 exp (- ks2/512

.

So for n large enough, say n > no = no(E), recalling that k = k(E),

Pr{lan(1/k)l > s/16} < 3 exp (- ks2/512. Then by (1.1.5) and (1.1.2), for n > no, (1.1.6)

Pr{Ian(t) - an(jl k) I > E/4 for some j = 0,

,

k - 1 and t E Ilk} < E/6.

As mentioned previously, the law, say 4 (an ), of {an (i / k) }k o converges by the central limit theorem in IRk+i to that of {Yi/k}k o, say Gk(y). On ][8k+1 put the metric d,,,(x, y) := Ix - yl := maxi I xi - yi 1, which of course metrizes the

usual topology. Since convergence of laws is metrized by Prokhorov's metric p (RAP, Theorem 11.3.3), for n large enough, say n > n 1(E) > no (E), we have p(Gk(an), 4(y)) < E/6. Then by Strassen's theorem (RAP, Corollary 11.6.4), there is a probability measure lcn on Rk+1 x Rk+1 such that for (X, Y) with G(X, Y) = µn, we have (1.1.7)

L(X) = Lk(an),

£(Y) = Lk(Y),

and µn {(x, Y) : Ix - Yloo > E/6} < E/6

(RAP, Section 9.2).

1.1 Empirical processes: the classical case

7

Let Libk (for "linear in between") be the function from Rk+1 into the space

C[0, 1] of all continuous real functions on [0, 1] such that Libk (x) (j/ k) = xj, j = 0, , k, and is linear (affine) on each closed interval Ilk = [j/k, (j + 1)/k], j = 0, , k - 1. For any x, y E Rk+1, Libk(x) Libk(y) is also linear on each Ilk, so it attains its maximum, minimum, and

-

maximum absolute value at endpoints. So for the supremum norm 11 f 11 o0 supo<x<1 I.f(x)I on C[0, 1], Libk is an isometry into:

II Libk(x) - Libk(y) it oo = Ix - YIoo

for all x, y E I[8k+1

Since C [0, 1 ] is a separable metric space (e.g., RAP, Corollary 11.2.5), (f, g) H

Il.f - glloo is jointly measurable in f, g E C[0, 1] (RAP, Proposition 4.1.7). So, from (1.1.7) we get (1.1.8)

µn{IhLibk(x) -Libk(y)Iloo > e/6) < e/6.

For any real-valued function f on [0, 1], let Yrk(f) _ { E Then nk(Libk(x)) = x for all x E Rk+1 For a sample-continuous Brownian bridge process (w, t) i-a yt (w), 0 < t < Rk+1.

1, the map

w i-- it i-- yt(w): 0 < t < 11 E C[0, 1] is measurable for the Borel a-algebra on C[0, 1] (by a simple adaptation of RAP, Proposition 12.2.2). (Recall that in any topological space, the Borel a-algebra is the one generated by the open sets.) If I xi+l - xjl < e then ILibk(x)(t) - xj I < e for all t E Ilk. It follows from (1.1.3) that (1.1.9)

Pr{IIY - Libk(Jrk(Y))Il oo >

e/3} < e/3,

where for each w, we have a function y: t i-+ yt (w), 0 < t < 1. We can take the probability space for each an process as the unit cube In, , xn, with where then i.i.d. uniform variables in defining Un and an are x1, x = (xl, xn) E In. Then

Ak: X H flan(J/k)}l-o is measurable from In into Rk+1 and has distribution Gk(an) on Rk+1 Also, x i-- Libk(Ak(x)) is measurable from In into C[0, 1]. The next theorem will give a way of linking up or "coupling" processes. Recall that a Polish space is a topological space metrizable by a complete separable metric. 1.1.10 Theorem (Vorob'ev-Berkes-Philipp) Let X, Y and Z be Polish spaces with Borel a-algebras. Let a be a law on X x Y and let # be a law on Y x Z.

8

Introduction: Donsker's Theorem, Metric Entropy, and Inequalities

Let zry(x, y) := y and ry(y, z) := y for all (x, y, z) E X x Y x Z. Suppose the marginal distributions of a and P on Y are equal, in other words z a o7rY 1 = B o rY1 on Y. Let n12(x, Y, z) :_ (x, y) and 1x23(x, Y, z) :_ (Y, Z). Then there exists a law yon X x Y x Z such that y o7r121 = a and y o7r231 = B

Proof There exist conditional distributions ay for a on X given y E Y, so that for each y E Y, ay is a probability measure on X, for any Borel set A C X, the function y i-± ay(A) is measurable, and for any integrable function f for a,

f fda = fff(xY)day(x)d1)(y) (RAP, Section 10.2). Likewise, there exist conditional distributions fly on Z for P. Let x and z be conditionally independent given y. In other words, define a set function y on X x Y x Z by

y(C) = JJf

lc(x, y,

z) day(x) doy(z) dq (Y)

The integral is well-defined if

(a) C = U x V x W for Borel sets U, V, and W in X, Y, and Z, respectively; (b) C is a finite union of such sets, which can be taken to be disjoint (RAP, Proposition 3.2.2 twice); or

(c) C is any Borel set in X x Y x Z, by RAP, Proposition 3.2.3 and the monotone class theorem (RAP, Theorem 4.4.2).

Also, y is countably additive by monotone convergence (for all three integrals). Soy is a law on X x Y x Z. Clearly y o Jr121 = a and y o 7r-1

Now, let's continue the proof of Theorem I.I.I. The function (x, f) H Ilan - flloo is jointly Borel measurable for x E In and f E C[0, 1]. Also, u i-+ Libk(u) is continuous and thus Borel measurable from Rk+1 into C[0, 1]. So (x, u) i-+ Ilan - Libk(u) II , is jointly measurable on In x 1Rk+1 (This is

true even though an V C [0, 1] and the functions t -* an (t) for different w form a nonseparable space for II Iloo.) Thus x i-a Ilan - Libk(Ak(x))Iloo is measurable on In. From (1.1.6), we then have (1.1.11)

Pr{IIan - Libk(Ak(x))Il oo > E/2} < E/6.

Apply Theorem 1.1.10 to (X, Y, Z) = (In, Rk+1 Rk+1) with the law of (x, Ak(x)) on In x Rk+1, and p,, from (1.1.7) on Rk+1 x Rk+1 both of which induce the law L k (an) on Y = Rk+1 to get a law yn .

1.1 Empirical processes: the classical case

9

Then apply Theorem 1.1.10, this time to (X, Y, Z) = (In X Rk+t Rk+1 C[0, 1]), with yn on X x Y and the law of (zrk(y), y) on Y x Z, where y is the Brownian bridge.

We see that there is a probability measure n on In x C[0, 1] such that if L (Vn, Yn) = cn, then £(Vn) is uniform on In, £(Y,) is the law of the Brownian bridge, and if we take an = an (Vn ), then for n > n 1(E) defined after (1.1.6), Ilan - YnIIoc

<-

Ilan -Libk(7rk(an))Iloo

+ IILibk(nk(an)) - Libk(nk(Yn)) Iloo + IILibk(nk(Yn)) - Yn Iloo

< E/2 + s/6 + E/3 < s except on a set with probability at most 6 + 6 + 3 < E, by (1.1.11), (1.1.8), and (1.1.9), respectively.

Let On := In x C[0, 1], n > 1 . For r = 1, 2, , let nr := nt(1/r). Let N, be an increasing sequence with N,. > nr for all r. For n < Ni, define µn as in (1.1.7) but with 1 in place of E/6 (both times), so that it always

holds: one can take An as the product measure 4 (an) x 4(y). Define on On as above, but with 1 in place of E/m for m = 2, 4, or 6 in (1.1.6) and

(1.1.11). For Nr < n < Nr+t, define µn and as for e = 1/r. Then Pr(Ilan - N loo > 1/r) < 1/r for n > N,., r > 1, and Theorem 1.1.1 is proved.

0

Remarks. It would be nice to be able to say that an converges to the Brownian bridge yin law in some space S of functions with supremum norm. The standard definition of convergence in law, at least if S is a separable metric space, would

say that EH(an) - EH(y) for all bounded continuous real functions H on S (RAP, Section 9.3). Donsker (1952) stated this when continuity is assumed only at almost all values of y in C[0, 1]. But then, H could be nonmeasurable away from the support of y, and EH(an) is not necessarily defined. Perhaps

more surprisingly, EH(an) may not be defined even if H is bounded and continuous everywhere. Consider for example n = 1. Then in the set of all possible functions Ul - U, any two distinct functions are at distance 1 apart Iloo. So the set and all its subsets are complete, closed, and discrete for II oo If the image of Lebesgue (uniform) measure on [0, 11 by the function II x i--* (t i-k 1{x>t} - t) were defined on all Borel sets for II III in its range, for II

or specifically on all complete, discrete sets, it would give an extension of Lebesgue measure to a countably additive measure on all subsets of [0, 1]. Assuming the continuum hypothesis, which is consistent with the other axioms of set theory, such an extension is not possible (RAP, Appendix Q.

10

Introduction: Donsker's Theorem, Metric Entropy, and Inequalities

So in a nonseparable metric space, such as a space of empirical distribution functions with supremum norm, the Borel or-algebra may be too large. In Chapter 3 it will be shown how to get around the lack of Borel measurability. Here is an example relating to the Vorob'ev theorem (1.1.10). Let X = Y = Z = {-1, 1). In X x Y x Z let each coordinate x, y, z have the uniform distribution giving probability 1 /2 each to -1, 1. Consider the laws on the

products of two of the three spaces such that y = -x, z = -y, and x = -z. There exist such laws having the given marginals on X, Y and Z. But there is no law on X x Y x Z having the given marginals on X x Y, Y x Z, and Z x X, since the three equations together yield a contradiction.

1.2 Metric entropy and capacity The word "entropy" is applied to several concepts in mathematics. What they have in common is apparently that they give some measure of the size or complexity of some set or transformation and that their definitions involve logarithms. Beyond this rather superficial resemblance, there are major differences. What are here called "metric entropy" and "metric capacity" are measures of the size of a metric space, which must be totally bounded (have compact completion) in order for the metric entropy or capacity to be finite. Metric entropy will provide a useful general technique for dealing with classes of sets or functions in general spaces, as opposed to Markov (or martingale) methods. The latter methods apply, as in the last section, when the sample space is R and the class C of sets is the class of half-lines (-oo, x], x E IR, so that C with its ordering by inclusion is isomorphic to R with its usual ordering.

Let (S, d) be a metric space and A a subset of S. Let s > 0. A set F C S (not necessarily included in A) is called an s-net for A if and only if for each x E A, there is a y E F with d (x, y) < s. Let N(s, A, S, d) denote the minimal number of points in an c-net in S for A. Here N(s, A, S, d) is sometimes called a covering number. It's the number of closed balls of radius s and centers in S needed to cover A. For any set C C S, define the diameter of C by diam C := sup{d(x, y) : x, y E C}. Let N(s, C, d) be the smallest n such that C is the union of n sets of diameter at most 2s. Let D(s, A, d) denote the largest n such that there is a subset F C A

with F having n members and d (x, y) > e whenever x 0 y for x and y in F. Then, in a Banach space, D(2s, A, d) is the largest number of disjoint closed balls of radius s that can be "packed" into A and is sometimes called a "packing number."

1.2 Metric entropy and capacity

11

The three quantities just defined are related by the following inequalities:

1.2.1 Theorem For any E > 0 and set A in a metric space S with metric d,

D(2E, A, d) < N(s, A, d) < N(s, A, S, d)

< N(s, A, A, d) < D(s, A, d). Proof The first inequality holds since a set of diameter 2s can contain at most one of a set of points more than 2e apart. The next holds because any ball B(x, s) := {y: d(x, y) < s} is a set of diameter at most 2E. The third inequality holds since requiring centers to be in A is more restrictive. The last holds because a set F of points more than E apart, with maximal cardinality, must be an s-net, since otherwise there would be a point more than s away from each point of F, which could be adjoined to F, a contradiction unless F is infinite, but then the inequality holds trivially. It follows that as s ,j. 0, when all the functions in the theorem go to 00 unless S is a finite set, they have the same asymptotic behavior up to a factor of 2 in s. So it will be convenient to choose one of the four and make statements about

it, which will then yield corresponding results for the others. The choice is somewhat arbitrary. Here are some considerations that bear on the choice. The finite set of points, whether more than s apart or forming an s-net, are often useful, as opposed to the sets in the definition of N(s, A, d). N(s, A, S, d) depends not only on A but on the larger space S. Many workers, possibly for these reasons, have preferred N(s, A, A, d). But the latter may decrease when the set A increases. For example, let A be the surface of a sphere of radius s

around 0 in a Euclidean space S and let B := A U (0). Then N(s, B, B, d) = 1 < N(s, A, A, d) for 1 < s < 2. This was the reason, apparently, that Kolmogorov chose to use N(s, A, d). In this book I adopt D(s, A, d) as basic. It depends only on A, not on the larger space S, and is nondecreasing in A. If D(s, A, d) = n, then there are n points which are more than s apart and at the same time form an s-net. Now, the s-entropy of the metric space (A, d) is defined as H(s, A, d) log N(s, A, d), and the s-capacity as log D(s, A, d). Some other authors take logarithms to the base 2, by analogy with information-theoretic entropy. In this book logarithms will be taken to the usual base e, which fits for example with bounds coming from moment-generating functions as in the next section, and with Gaussian measures as in Chapter 2. There are a number of interesting sets

of functions where N(s, A, d) is of the order of magnitude exp(s-') as s 4, 0, for some power r > 0, so that the s-entropy, and likewise the s-capacity, have

Introduction: Donsker's Theorem, Metric Entropy, and Inequalities

12

the simpler order a-r. But in other cases below, D(e, A, d) is itself of the order of a power of 1/e.

1.3 Inequalities This section collects several inequalities bounding the probabilities that random variables, and specifically sums of independent random variables, are large. Many of these follow from a basic inequality of S. Bernstein and P. L. Chebyshev.

1.3.1 Theorem For any real random variable X and t E R,

Pr{X > t} <

info>oe-tuEeux.

Proof For any fixed u > 0, the indicator function of the set where X > t satisfies 1{x>t) < eu(x-t)so the inequality holds for a fixed u; then take info>o.

For any independent real random variables X1,

, X,,, let S := X1 +

... + Xn. 1.3.2 Bernstein's inequality Let X1, X2, , X be independent real random variables with mean 0. Let 0 < M < oo and suppose that I Xj I < M almost surely for j = 1, , n. Let Q? = Var(Xj) and t,, := Var(SS)

of + (1.3.3)

+ Q,2. Then for any K > 0,

Pr{ISnI > Kn1/2) <

Proof We can assume that t,2 > 0, since otherwise S, = 0 a.s. (where a.s. means almost surely) and the inequality holds. For any u > 0 and j = 1, (1.3.4)

, n,

E exp(uXj) = I+ u2o Fj/2 < exp (6,2 Fju2/2),

or Fj = 0 if 6? = 0. For r > 2, IXjjr < X2Mr-2 a.s., so Fj < 2 Er2 (Mu)r-2/r! < Err=2 (Mu/3)r-2 = 1/(1-Mu/3)forall <3/M.

where Fj := 2O2 Er°02

Let v := Kn112 and u := v/(r,2 + My/3), so that v = Mu/3). Then 0 v) < e-`12 and e-uv12

= exp ( - v2/(2r,, + 2Mv/3)) = exp ( - nK2/(2rn +2MKn1/2/3))

Here are some remarks on Bernstein's inequality. Note that for fixed K and M, if X, are i.i.d. with variance v2, then as n -+ oo, the bound approaches the normal bound 2 exp(-K2/(2Q2)), as given in RAP, Lemma 12.1.6. Moreover, this is true even if M := Mn -+ oo as n -+ oo while K stays constant, provided that M/n1/2 - 0. Sometimes the inequality can be applied to unbounded variables, replacing them by "truncated" ones, say replacing an unbounded f by fm where fm(x) := f(x)l{I f(x)1<M} , be i.i.d. variables with P(s, = 1) = P(s, -1) = 1/2. Next, let sl, s2, Such variables are called "Rademacher" variables. We have the following inequality.

1.3.5 Proposition (Hoeffding) For any t > 0 and real a1, n

n

ajsj > t

Pr

< exp -t2 2 a

j=1

j=1

Proof Since 1/(2n)! < 2-'In! for n =

we have cosh x

(ex + e-x)/2 < exp(x2/2) for all x. Apply Theorem 1.3.1, where by calculus, infra exp(-ut + y=1 a?u2/2) is attained at u = t/ F-jn=1 aj?, and the result follows. Here are some remarks on Proposition 1.3.5. Let Y1, Y2, , be independent variables which are symmetric, in other words Yj has the same distribution as - Yj for all j. Let s; be Rademacher variables independent of each other and of all the Yj. Then the sequence {sjYj}{ j>1} has the same distribution as {Yj}{ j>1}.

Thus to bound the probability that E 1 Yj > K, for example, we can consider the conditional probability for each Y1, , Yn ,

Pr{ YsjYj > K j=1

Y11..., Ynl

<

exp \-K2/

\2 J=1

Y2

//

by 1.3.5. Then to bound the original probability, integrating over the distribution of the Yj, one just needs to have bounds on the distribution of Fj 1 Y?, which may simplify the problem considerably.

14

Introduction: Donsker's Theorem, Metric Entropy, and Inequalities

The Bernstein inequality (1.3.2) used variances as well as bounds for centered variables. The following inequalities, also due to Hoeffding, use only bounds.

They are essentially the best that can be obtained, under their hypotheses, by the moment-generating function technique.

1.3.6 Theorem (Hoeffding) Let X1,

, Xn be independent variables with

0 < Xj < 1 for all j. Let T:= (Xl +

+ Xn)/n and µ := EX Then for

0
fµ+t 1 1µµ

Pr IT -µ>t}

I

-

14

< e-nt28(µ) <

e-2nt2

where

g(µ) := (1 - 2µ)-1 log((1 - µ)/µ)

for 0 < µ < 1/2, or for 1/2 < µ < 1.

:= 1/(2µ(l - µ))

Remarks. Fort > 1 - µ, Pr(X - it > t) < Pr(X > 1) = 0. Fort < 0, the given probability would generally be of the order of 1/2 or larger, so no small bound for it would be expected. Proof For any v > 0, the function f (x) := e"X is convex (its second derivative is positive), so any chord lies above its graph (RAP, Section 6.3). Specifically,

if 0 < x < 1, then e"X < 1-x+xe". Taking expectations gives E exp(vXj) < 1 - µj + µje", where µj := EXj. (Note that the latter inequality becomes an equation for a Bernoulli variable Xj, taking only the values 0, 1.) Let

S, := X, +-..+X,. Then Pr ( X - µ > t) = Pr(S, - ESn > nt) < E exp(v(S, - ESn - nt)) e-un(t+µ)11r1 E exp(vXj)

< e-nv(t+µ)H=1(1-µj+ttje"). The following fact is rather well known (e.g., RAP (5.1.6) and p. 276):

1.3.7 Lemma For any nonnegative real numbers ti,

, tn, the geometric mean is less than or equal to the arithmetic mean, in other words (t, t2 ... tn)lln

. (tl + ... + tn)/n.

1.3 Inequalities

15

Applying the lemma gives n

Pr {

X - µ > t} <

e-nv(t+u) 1

e-nv(t+t0(I

(.El-ttj+Ajev

- it +

µev)n.

To find the minimum of this for v > 0, note that it becomes large as v ---). 00

since t + µ < 1, while setting the derivative with respect to v equal to 0 gives a single solution, where 1 - µ + µev = µev/(t + µ) and ev = 1 + t/(µ(1- µ - t)). Substituting these values into the bounds gives the first, most complicated bound in the statement of the theorem. This bound can be written

as Pr(X -A > t) < exp(-nt2G(t, µ)), where

G(t,µ) := µt2

tlog(µl2 t2

\\mino
1 1µµ -

G (t, µ) = g(µ) as defined in ment of the theorem. For 0 < x < 1 let The next step is to show that

H(x) :=

1-

I \\\

theJe

state-

2) log(1 - x).

x

In aG(t, µ)/at, the terms not containing logarithms cancel, giving

t2 aGat, tt) =

[1_(1_IL)]log(1_1t) [

- L1- t(lµ+t)]log11-µ+t)

H \1tµ/ H \µ+t) . To see that H is increasing in x for 0 < x < 1, take the Taylor series of log(1 - x) and multiply by 1 - x to get the Taylor series of H around 0, all whose coefficients are nonnegative. Only the first order term is 0. Thus

aG/at > 0 if and only if t/(1 - µ) > t/(µ + t), or equivalently t > 1 - 2µ. So if µ < 1/2, then G(t, µ), for fixed µ, has a minimum with respect to t > 0 at t = 1 - 2µ, giving g(µ) for that case as stated. Or if it > 1/2, then G(t, µ) is increasing in t > 0, with limt10 G (t, µ) = g(ls) as stated for that case, using the first two terms of the Taylor series around t = 0 of each logarithm. Now to get the final bound exp(-2nt2), it needs to be shown that the minimum of g(µ) for 0 < µ < 1 is 2. For µ > 1/2, g is increasing (its denominator

is decreasing), and g(1/2) = 2. For µ < 1/2, letting w := I - 2µ, we get g(µ) = w log('+w). From the Taylor series of log(1 + w) and log(1 - w)

16

Introduction: Donsker's Theorem, Metric Entropy, and Inequalities

around w = 0, we see that g is increasing in w, and so decreasing in s, and converges to 2 as µ - 1/2. Thus g has a minimum at tt = 1/2, which is 2. The theorem is proved.

For the empirical measure Pn, if A is a fixed measurable set, n Pn (A) is a binomial random variable, and in a multinomial distribution, each n1 has a binomial distribution. So we will have need of some inequalities for binomial probabilities, defined by

(fl)pjqn_j,

B(k, n, p)

0 < q := I - p < 1,

0<j
E(k, n, p)

y (fl)pjqfl_j.

k<j
Here k is usually, but not necessarily, an integer. Thus, in n independent trials with probability p of success on each trial, so that q is the probability of failure, B(k, n, p) is the probability of at most k successes, and E(k, n, p) is the probability of at least k successes.

1.3.8 Chernoff-Okamoto inequalities We have (1.3.9)

E(k, n, p) <

k

Ik (n

-k

if k > n p,

(1.3.10) B(k, n, p) < exp ( - (np - k)2/(2npq)) if k < np < n/2. Proof These facts follow directly from the Hoeffding inequality 1.3.6. For (1.3.10), note that B(k, n, p) = E(n - k, n, 1 - p) and apply the g(µ) case

with it = 1-p. If in (1.3.9) we set x := nq/(n - k) < ex-1, it follows that (1.3.11)

E(k, n, p) < (np/k)k ek-np if k > np.

The next inequality is for the special value p = 1/2.

1.3.12 Proposition If k < n/2 then 2nB(k, n, 1/2) < (ne/k)k. Proof By (1.3.9) and symmetry, B(k, n, 1/2) < (n/2)nk-k(n - k)k-n. Letting y := n/(n - k) < ey-1 then gives the result. A form of Stirling's formula with error bounds is:

1.3 Inequalities

1.3.13 Theorem For n = 1, 2,

,

17

e1/(12n+1)

< n!(e/n)n(27rn)-1/2 <

el/12n

Proof See Feller (1968), vol. 1, Section 11.9, p. 54.

For any real x let x+ := max (x, 0). A Poisson random variable z with parameter m has the distribution given by Pr(z = k) = e-'nmk/k! for each nonnegative integer k.

1.3.14 Lemma For any Poisson variable z with parameter m > 1,

E(z - m)+ > m 1/2/8.

Proof We have E(z - m)+ = Eke,,, a-mmk(k - m)/k!. Let j := [m], meaning j is the greatest integer with j < m. Then by a telescoping sum (which is absolutely convergent), E(z-m)+ = e mmj+1/j!.ThenbyStirling's formula with error bounds (Theorem 1.3.13),

E(z - m)+

e-mmj+1(e/j )j >

(2nj)-112e-11(12j)

(mj+1/jj+2)e-13/12(27r)-1/2 > m1/2/8.

In the following two facts, let X1, X2, , X, be independent random variables with values in a separable normed space S with norm II II . (Such spaces

are defined, for example, in RAP, Section 5.2.) Let Sj := X1 +

j = 1,...,n.

+ Xj for

1.3.15 Ottaviani's inequality If for some a > 0 and c with 0 < c < 1, we have P(IIS,, - Sjll >

P{maxj
P(IlSnll?a)/(1-c).

Proof The proof in RAP, 9.7.2, for S = Rk, works for any separable normed S. Here (x, y) i-- Ilx - fill is measurable: S x S i-- JR by RAP, Proposition 4.1.7.

When the random variables Xj are symmetric, there is a simpler inequality:

1.3.16 P. Levy's inequality Given a probability space (S2, P) and a countable set Y, let X1, X2, be stochastic processes defined on S2 indexed by Y, in other words for each j and y E Y, is a random variable on Q. For any bounded function f on Y, let 11f Il Y := sup{ l f (y) I : y E Y). Suppose that

Introduction: Donsker's Theorem, Metric Entropy, and Inequalities

18

the processes Xj are independent, with II Xj II y < 00 a.s., and symmetric, in other words for each j, the random variables {-Xj(y) : y E Y} have the same joint distribution as {Xj(y) : y E Y}. Let Sn := Xl + + Xn. Then for each

n, andM>0, P(maxjM) <2P(11 Sn11Y>M). Notes. Each II Sj 11Y is a measurable random variable because Y is countable.

Lemma 9.1.9 treats uncountable Y. The norm on a separable Banach space (X, 11

II) can always be written in the form 11 II y for Y countable, via the Hahn-

Banach theorem (apply RAP, Corollary 6.1.5, to a countable dense set in the unit ball of X to get a countable norming subset Y in the dual X' of X, although X' may not be separable). On the other hand, the preceding lemma applies to some nonseparable Banach spaces: the space of all bounded functions on an infinite Y with supremum norm is itself nonseparable. Proof Let Mk(w) := maxj
M < Mk}, k = 1, 2,

, where we set Mo := 0. Then for 1 < m < n, 211Sm11Y < IISnIlY+112Sm-SnIly.SoifIISmIIY>M,then l1Sn11Y> Mor II 2Sm - Sn Il Y > Mor both. The transformation which interchanges Xj and - Xj

just for m < j < n preserves probabilities, by symmetry and independence. Then S, is interchanged with 2Sm - Sn, while Xj are preserved for j < m. So P(Cm fl {IISnIlY > M}) = P(Cm fl {112Sm - Sn11Y > M}) > P(Cm)/2, and

P(Mn>M)=Em=1 P(Cm)<_2P(IISn1IY>M).

Problems 1. Find the covariance matrix on 10, 1/4, 1/2, 3/4, 1) of (a) the Brownian bridge process yr; (b) U4 - U. Hint: Recall that n 1/2(U - U) has the same covariances as yr.

2. Let 0 < t < u < 1.

Let an be the empirical process for the uniform

distribution on [0, 1].

(a) Show that the distribution of an (t) is concentrated in some finite set At.

(b) Let f (t, y, u) := E(an (u) I an (t) = y). Show that for any yin At, (u, f (t, y, u)) is on the straight line segment joining (t, y) to (1, 0).

Notes

19

3. Let (S, d) be a complete separable metric space. Let s be a law on S x S and let 6 > 0 satisfy

p,({(x, y): d(x, y) > 28}) < 38.

Let n2(x, y) := y and P := s o Let Q be a law on S such that p(P, Q) < S where p is Prokhorov's metric. On S x S x S let n21.

7r12(x, y, z) := (x, y) and 1r3(x, y, z) := Z. Show that there exists a law a on S x S x Ssuch that ao7ri21 =µ,ao7T 1 = Q, and

a({(x, y, z): d(x, z) > 33)) < 48. Hint: Use Strassen's theorem, which implies that for some law v on S x S,

if £(Y, Z) = v, then £(Y) = P, C(Z) = Q, and v({d(Y, Z) > S)) < 8. Then the Vorob'ev-Berkes-Philipp theorem applies.

4. LetA=B=C={0,1}.OnAxB,let it :_ (8(0,0) + 28(1,0) + 56(0,1) + 8(1,1)) /9.

On B x C, let v := [8(0,0) + 8(1,0) + 8(0,1) + 38(1,1)]/6. Find a law y on A x B x C such that if y = £(X, Y, Z), then L (X, Y) = ,a and G(Y, Z) = v. 5. Let I = [0, 1] with usual metric d. For e > 0, evaluate D(e, I, d), N(E, I, d), and N(E, I, I, d). Hint: The ceiling function [xl is defined as the least integer > x. Answers can be written in terms of [.1.

6. For a Poisson variable X with parameter ,l > 0, that is, P(X = k) _ e-xAk/k! for k = 0, 1 , 2, , evaluate the moment-generating function EeIX for all t. For M > A, find the bound for Pr(X > M) given by the moment-generating function inequality (1.3.1).

Notes Notes to Section 1.1. The contributions of Kolmogorov (1933), Doob (1949), and Donsker (1952) were mentioned in the text. When it was realized that the formulation by Donsker (1952) was incorrect because of measurability problems, Skorokhod (1956) - see also Kolmogorov (1956) - defined a separable metric d on the space D[0, 1] of right-continuous functions with left limits on [0, 1], such that convergence ford to a continuous function is equivalent to convergence for the sup norm, and the empirical process a converges in law in D[0, 1] to the Brownian bridge; see, for example, Billingsley (1968, Chapter 3). The formulation of Theorem 1.1.1 avoids the need for the Skorokhod topology and deals with measurability. I don't know whether Theorem 1.1.1 has

20

Introduction: Donsker's Theorem, Metric Entropy, and Inequalities

been stated before explicitly, although it is within the ken of researchers on empirical processes. In Theorem 1.1.10, the assumption that X, Y and Z are Polish can be weakened: they could instead be any Borel sets in Polish spaces (RAP, Section 13.1). Still more generally, since the proof of Theorem 10.2.2 in RAP depends just on

tightness, it is enough to assume that X, Y and Z are universally measurable subsets of their completions, in other words, measurable for the completion of any probability measure on the Borel sets (RAP, Section 11.5). Shortt (1983) treats universally measurable spaces and considers just what hypotheses on X, Y and Z are necessary. Vorob'ev (1962) proved Theorem 1.1.10 for finite sets. Then Berkes and Philipp (1977, Lemma Al) proved it for separable Banach spaces. Their proof carries over to the present case. Vorob'ev (1962) treated more complicated families of joint distributions on finite sets, as did Shortt (1984) for more general measurable spaces.

Notes to Section 1.2. Apparently the first publication on e-entropy was the announcement by Kolmogorov (1955). Theorem 1.2.1, and the definitions of all the quantities in it, are given in the longer exposition by Kolmogorov and Tikhomirov (1959, Section 1, Theorem IV). Lorentz (1966) proposed the name "metric entropy" rather than "s-entropy," urging that functions should not be named after their arguments, as functions of a complex variable z are not called "z-functions." The name "metric entropy" emphasizes the purely metric nature of the concept. Actually, "s-entropy" has been used for different quantities. Posner, Rodemich, and Rumsey (1967, 1969) define an s, S entropy, for a metric space S with a probability measure P defined on it, in terms of a decomposition of S into sets of diameter at most s and one set of probability at most S. Also, Posner et al. define s-entropy as the infimum

of entropies - F; P(U;) log(P(U;)) where the U; have diameters at most E. So Lorentz's term "metric entropy" seems useful and will be adopted here.

Notes to Section 1.3. Sergei Bernstein (1927, pp. 159-165) published his inequality. The proof given is based on Bennett (1962, p. 34) with some incorrect, but unnecessary steps (his (3), (4), ...) removed as suggested by Gine (1974). For related and stronger inequalities under weaker conditions, such as unbounded variables, see also Bernstein (1924, 1927), Hoeffding (1963), and Uspensky (1937, p. 205). Hoeffding (1963, Theorem 2) implies Proposition 1.3.5. Chernoff (1952, (5.11)) proved (1.3.9). Okamoto (1958, Lemma 2(b')) proved (1.3.10). Inequality (1.3.11) appeared in Dudley (1978, Lemma 2.7) and Lemma 1.3.12 in

References

21

Dudley (1982, Lemma 3.3). On Ottaviani's inequality (1.3.15) for real-valued functions, see (9.7.2) and the notes to Section 9.7 in RAP. The P. Levy inequality (1.3.16) is given for Banach-valued random variables in Kahane (1985, Section 2.3). For the case of real-valued random variables, it was known much earlier; see the notes to Section 12.3 in RAP.

References *An asterisk indicates a work I have seen discussed in secondary sources but not in the original.

Bennett, George (1962). Probability inequalities for the sum of independent random variables. J. Amer. Statist. Assoc. 57, 33-45. Berkes, Istvan, and Philipp, Walter (1977). An almost sure invariance principle for the empirical distribution function of mixing random variables. Z Wahrscheinlichkeitsth. verw. Gebiete 41, 115-137. Bernstein, Sergei N. (1924). Ob odnom vidoizmenenii neravenstva Chebysheva i o pogreshnosti formuly Laplasa (in Russian). Uchen. Zapiski Nauchn.-issled. Kafedr Ukrainy, Otdel. Mat., vyp. 1, 38-48; reprinted in S. N. Bem9tein, Sobranie Sochinenii [Collected Works], Tom IV, Teoriya Veroiatnostei, Matematicheskaya Statistika, Nauka, Moscow, 1964, pp. 71-79. *Bernstein, Sergei N. (1927). Teoriya Veroiatnostei (in Russian), 2d ed. Moscow, 1934. Billingsley, Patrick (1968). Convergence of Probability Measures. Wiley, New York.

Chemoff, Herman (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Statist. 23, 493-507. Donsker, Monroe D. (1952). Justification and extension of Doob's heuristic approach to the Kolmogorov-Smimov theorems. Ann. Math. Statist. 23, 277-281. Doob, J. L. (1949). Heuristic approach to the Kolmogorov-Smirnov theorems. Ann. Math. Statist. 20, 393-403. Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6, 899-929; Correction 7 (1979), 909-911. Dudley, R. M. (1982). Empirical and Poisson processes on classes of sets or functions too large for central limit theorems. Z Wahrscheinlichkeitsth. verve Gebiete 61, 355-368.

22

Introduction: Donsker's Theorem, Metric Entropy, and Inequalities

Feller, W. (1968). An Introduction to Probability Theory and Its Applications, 3d ed. Wiley, New York.

Gine, Evarist (1974). On the central limit theorem for sample continuous processes. Ann. Probab. 2, 629-641. Hoeffding, Wassily (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 13-30. Kahane, J.-P. (1985). Some Random Series of Functions, 2d ed. Cambridge University Press, Cambridge. *Kolmogorov, A. N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giorn. Ist. Ital. Attuari 4, 83-91. Kolmogorov, A. N. (1955). Bounds for the minimal number of elements of an s-net in various classes of functions and their applications to the question of representability of functions of several variables by superpositions of functions of fewer variables (in Russian). Uspekhi Mat. Nauk 10, no. 1 (63), 192-194. Kolmogorov, A. N. (1956). On Skorokhod convergence. Theory Probab. Appl. 1, 215-222. Kolmogorov, A. N., and Tikhomirov, V. M. (1959). s-entropy and s-capacity of sets in function spaces. Amer. Math. Soc. Transls. (Ser. 2) 17 (1961), 277-364 (Uspekhi Mat. Nauk 14, vyp. 2 (86), 3-86). Lorentz, G. G. (1966). Metric entropy and approximation. Bull. Amer. Math. Soc. 72, 903-937. Okamoto, Masashi (1958). Some inequalities relating to the partial sum of binomial probabilities. Ann. Inst. Statist. Math. 10, 29-35. Posner, Edward C., Rodemich, Eugene R., and Rumsey, Howard Jr. (1967). Epsilon entropy of stochastic processes. Ann. Math. Statist. 38, 10001020.

Posner, Edward C., Rodemich, Eugene R., and Rumsey, Howard Jr. (1969). Epsilon entropy of Gaussian processes. Ann. Math. Statist. 40, 12721296.

Shortt, Rae M. (1983). Universally measurable spaces: an invariance theorem and diverse characterizations. Fund. Math. 121, 169-176. Shortt, Rae M. (1984). Combinatorial methods in the study of marginal problems over separable spaces. J. Math. Anal. Appl. 97,462-479.

Skorokhod, A. V. (1956). Limit theorems for stochastic processes. Theory Probab. Appl. 1, 261-290. Uspensky, J. V. (1937). Introduction to Mathematical Probability. McGrawHill, New York. Vorob'ev, N. N. (1962). Consistent families of measures and their extensions. Theory Probab. Appl. 7, 147-163 (English), 153-169 (Russian).

2 Gaussian Measures and Processes; Sample Continuity

Let X1, X2, .. , be independent, identically distributed real-valued random

variables with EX1 = 0 and EX1 = 02 < co. Let S, := Xl +

+ X.

Then the one-dimensional central limit theorem says that the distribution of converges as n -+ oo to the normal distribution N(0, 02), which (if or > 0) has a density a-1(2ir)-1/2 exp(-x2/(202)) with respect to Lebesgue measure on R (RAP, Theorem 9.5.6). Also, if the X; are i.i.d. with values in 1/2 converges to Rk, EX1 = 0, and E I Xl 12 < oo, then the distribution of a normal distribution N(0, C) where C is the covariance matrix of XI (ibid.). This book is mainly about extensions of the central limit theorem to infinitedimensional situations. Here the limit distributions will be normal distributions on infinite-dimensional spaces. Since their behavior is not as simple as in the finite-dimensional case, this chapter is devoted to a study of normal or Gaussian measures.

2.1 Some definitions Let V be a real vector space. Recall that a seminorm is a function II II from X into the nonnegative real numbers such that lix + yll lixII + Ilyll for all x and y in X and Ilcxll = Iclllx11 for all real c and x E X. The seminorm 11 is called a norm if IIx 11 = 0 only for x = 0 in X, and then (X, II 11) is called 11

a normed linear space. A norm defines a metric by d(x, y) := llx - yll. A normed linear space complete for this metric is called a Banach space. As with

any metric space, it is called separable if it has a countable dense subset. A probability distribution P defined on a separable Banach space will be assumed to be defined on the Borel 0-algebra generated by the open sets, unless another 0-algebra is specified. Then P will be called a law. Let (X, 11 11) be a separable Banach space. A law P on X will be called Gaussian or normal if for every continuous linear form f E X', P o f -1 is 23

Gaussian Measures and Processes; Sample Continuity

24

a normal law on R. Recall that a law on a finite-dimensional real vector space is normal if and only if every real linear form is normally distributed (RAP, Theorem 9.5.13).

2.2 Gaussian vectors are probably not very large First, let's have some bounds for one-dimensional Gaussian variables. Let 4) be the standard normal distribution function and 0 its density function, so 0(x) = (2n)-1i2 exp(-x2/2) for all real x, and D(x) = fX,,,, 0(u)du.

2.2.1 Proposition Let X be a real-valued random variable with a normal distribution N(0, a2). Then

(a) for any M > 0, Pr(IXI > M) < exp(-M2/(2a2)); (b) if M/a > 1, then/ M.

< Pr(,XI > M) < M0

I

(M

Proof Replacing X by X/v, we can assume or = 1. For (a) we want to prove

24 (-c) < exp(-c2/2) for any c > 0. This holds for c = 0 and follows by differentiating both sides for 0 < c < (2/sr)1/2. For larger c, it follows from '1 (-c) < 0(c)lc (RAP, Lemma 12.1.6(a)), as does the right side of (b). For the left side of (b), note that 0 is a convex function for x > 1, since there 0"(x) = (x2 - 1)0(x) > 0. Thus, the region between the graph of 0 and the x axis for x > c includes a right triangle with right vertex at (c, 0) and a vertex at (c, 0(c)), and whose hypotenuse is along the tangent line to the graph of 0 at c. It's easily seen that this triangle has area 0(c)/(2c), which finishes the proof.

This section will prove an extension of inequality (a) to infinite-dimensional Gaussian variables such as those taking values in separable Banach spaces. It will be said that a law P on a separable Banach space (X, II II) has mean 0

if f Ilx II dP(x) < oo and f f (x) dP(x) = 0 for each f E X'. Recall the dual norm 11f II' := sup{ l f(x) I

:

IIx II < 1). Here is one of the main results.

Let P be a normal law with mean 0 on a separable Banach space X. For f E X' let a2(f) f f2dP. Then r2 := sup(a2(f): llfll' < 1) < oo and 2.2.2 Theorem (Landau-Shepp-Marcus-Femique)

fexp(allxll2)dP(x) < 00 for any a < 1/(2r2).

2.2 Gaussian vectors are probably not very large

25

By Proposition 2.2.1(a), the theorem holds in the one-dimensional case, and

by the left side of part (b), the conclusion fails when a = 1/(2r2). Before proving the theorem in general, some other facts will be developed.

Definition. Let X be a real vector space and B a a-algebra of subsets of X. Then (X,13) is called a measurable vector space if both

(a) addition is jointly measurable from X x X to X, and (b) scalar multiplication is jointly measurable from R x X to X (for the usual Borel a-algebra on R).

Example. Let X be a topological vector space, namely a vector space with a topology for which (a) and (b) hold with "measurable" replaced by "continuous." Suppose the topology of X is metrizable and separable. For a Cartesian product of two separable metric spaces, since their topologies have countable bases, the Borel a-algebra in the product equals the product a-algebra of the Borel a-algebras in the two spaces (RAP, Proposition 4.1.7). Thus X with its Borel a-algebra is a measurable vector space.

The notion of normal law can't be defined for general measurable vector spaces by way of linear forms, as it was for Banach spaces in the last section, since there exist measurable vector spaces, such as spaces LP[0, 1] for 0 < p < 1, which have nontrivial normal measures but turn out to have no nontrivial measurable linear forms (Appendix F). Fernique (1970) proposed the following ingenious definition.

Definition. A probability measure P on a measurable vector space (X, 8) will be called centered Gaussian if for variables U and V independent with law P (say, coordinates on the product X x X for the product law P x P) and any O with O < 0 < 2ir, U cos O + V sin 9 and -U sin 9 + V cos 9 are also independent with distribution P.

If X = JR, the transformation of (U, V) E R2 in the last definition is a rotation through an angle 0. Normal laws with mean 0 on finite-dimensional real vector spaces are centered Gaussian in this sense, as can be seen from covariances. Conversely, a law on X = JR satisfying the above definition of "centered Gaussian," even for one value of 0 with sin (20) 0, must be normal according to the "Darmois-Skitovic" theorem - see the notes for this section. We will not need the full strength of the latter theorem below, but the following will be proved.

Gaussian Measures and Processes; Sample Continuity

26

2.2.3 Proposition some a2 > 0.

A centered Gaussian law P on JR is a law N(0, a2) for

Proof P x P on JR2 is invariant under all rotations. A rotation through 6 = n

shows that P is symmetric, dP(x) - dP(-x). Let f be the characteristic function of P, f (t) := f 0 e`txdP(x). Then f is real-valued, f (0) = 1, and f (t) = f(-t). Any point (t, u) E R2 can be rotated to a point on a coordinate axis, so f(t) f(u) - f((t2 + u2)1/2). Let h(t) := log f(ItI112) where it is defined and finite, that is, where f > 0, as is true at least in a neighborhood of 0. Then h(t + u) = h(t) + h(u) for t, u > 0 and, perhaps, small enough. Where both sides are defined and finite, we have h(qu) - qh(u) first when q is an integer, then when it is rational, and then for general real q by continuity. Since h thus doesn't become unbounded on finite intervals where it's defined, it is defined and continuous on the whole line, with h(t) = ct for some constant

c = h(l), so f(t) = exp(ct2) for all t, and c < 0 since I f(t)I < 1. Thus P = N(0, a2) where a2 = -2c (RAP, Proposition 9.4.2, Theorem 9.5.1).

Given a normal measure P = N(C) on a finite-dimensional space X and a vector subspace Y of X, it follows from the structure of normal measures (RAP,

Theorem 9.5.7) that P(Y) = 0 or 1. This fact extends to general measurable vector spaces:

2.2.4 Theorem (0-1 law) Let (X, 8) be a measurable vector space and Y a vector subspace with Y E B. Then for any centered Gaussian law P on X, P(Y) = 0 or 1. Proof Let U and V be independent in X with law P. For 0 < 6 < 7r/2, let A(6) be the event

A(O) := {Ucos9+Vsin6EY, If 6

with 0 < 0 < r/2, then

cos 9 sin 0 - sin 9 cos 0 = sin(0 - 6) 0 0, so if Y1 := u cos 6 + v sin 6 E Y and Y2 := u cos 0 + v sin 0 E Y, then u and v can be solved for as linear combinations of yi, Y2, so they are in Y and -u sin 6 + v cos 9 E Y, -u sin 0 + v cos 0 E Y. So the sets A(9) are disjoint for different values of 6 E [0, r/2]. By definition of centered Gaussian, these sets all have the same probability, which thus must be 0. Taking 9 = 0 gives

0 = Pr(U E Y) Pr(V 0 Y) = P(Y)P(X\Y), so P(Y) = 0 or 1.

2.2 Gaussian vectors are probably not very large A measurable function II

27

II from a measurable vector space X into [0, oo]

will be called a pseudo-seminorm if Y := {x E X : 11x 11 < oo} is a vector subspace of X and I I I I is a seminorm on Y, that is, I I cx I I = I c I Ilx II for each real

candx E Y, and so forallx E X,with

Ilxll+llyll for all x, y E Y, and so for all x, y E X. By the 0-1 law (Theorem 2.2.4), for any pseudo-seminorm II II and centered Gaussian P on X, P(II II < oo) = 0 or 1. Likewise, P(11 II = 0) = 0 or 1. A real-valued stochastic process consists of a set T, a probability space (0, A, P) , and a map (t, w) i-+ Xt(w) from T x 0 into R such that for each t E T, Xt is measurable from 0 into R. The process is called Gaussian if for every finite subset F of T, the law G({Xt}tEF) is a normal distribution on 1(S F.

If S is a countable set, then RS, the set of all real-valued functions on S, with product topology, is a separable metric topological linear space, hence a measurable vector space. If P is the law of a Gaussian stochastic process {xt, t e S} on ]IBS, with Ext = 0 for all t e S, then P is centered Gaussian on IRS. The supremum "norm" II { yt, t E S) II := suet lyt I is clearly a pseudoseminorm on IRS.

Here is a step toward proving Theorem 2.2.2:

2.2.5 Lemma (Landau-Shepp-Femique) Let (X, 8) be a measurable vector space, P a centered Gaussian measure, and 11 II a pseudo-seminorm on X with 0. Then for some s > 0, P(Il 11 < oo) >fexp(ailxli2)dP(x)

< oo for 0 < a < E. Proof As noted above, P(11 II < oo) = 1. Let U and V be independent with distribution P in X. The definition of centered Gaussian for 6 = -ir/4 yields, for any real s < t, (2.2.6)

POI PrIII(U-V)/21/211 <S,

II(U+V)/21/211>t}.

Note that 2112min(IIUII, IIVII) > II(U + V)/21/211- 11 (U

-

V)/21V211,

where the event that the right side is undefined, equaling oo - oo, has zero probability and so can be neglected. Thus on the event on the right in (2.2.6), we have IIVII > (t - s)/21/2 and IIVII > (t - s)/21/2. So (2.2.7)

P(11

II < s)P(II II > t)

<-

P(ll

1/2 2 II > (t - s)/2).

28

Gaussian Measures and Processes; Sample Continuity

Choose s with 0 < s < oo large enough so that q := P(Il II < s) > 1/2. Define a sequence t, recursively by to := s, t,+1 := s +21/2tn, n = 0, 1, Then we have by induction

to =

(21/2

+ 1)(2(n+1)/2 -1)s.

So to increases up to +oo with n. Let xn := P(II By induction, we then have P(II

II > tn)/q, so xn < x2n_1.

it > tn) < q((1- q)lq)2n

It follows that E exp (all 112) < geas2 + E P(tn

< 11

I1

<_ to+1) exp (atn+1),

n=0

where the nth term of the latter sum is bounded above by q(q-1

- 1)2n exp [a (21/2 + 1)2(2(n+2)/2 - 1)2s2]

The sum will be finite if exp {2n [log 1 - q + 4(21/2 + 1)2as2J

n=0

l

q

00,

<

which holds for a < (log 1q)/(2452), proving the lemma. Theorem 2.2.2 will be a corollary of the following fact:

2.2.8 Theorem Let (X, l3) be a measurable vector space and P a centered Gaussian law on X. Let (yn }n>_1 be a sequence of measurable linear forms: X H R. Let Il x 11 suPn I Yn (x) 1. Suppose that P (ll x II < oo) > 0. Then r :=

(supra f yn dP)1/2 < 00, and E exp(allxll2) < oo if and only if a < 1/(2t2).

Proof For each n, II ? I Yn I. It's easily checked that P o yn 1 is centered Gaussian and thus by Proposition 2.2.3 is a law N(0, Q,2), an < t2. Now P(I Yn I > Qn) > c > 0 for all n (c > 0.3). Thus if a, are unbounded, II

P(II II > Qn) > c for all n gives a contradiction. Then to prove "only if," we have E exp(lyn l2/(2t2)) = 1/(12 - an2)1/2, or = +00 if a,, = t2. Taking the

supremum over n gives E exp(llx 112/(212)) = +00. Now to prove "if," recall the space £°O of all bounded sequences of real numbers with supremum norm. This is a nonseparable Banach space. With the smallest or-algebra making the coordinates measurable, it is a measurable vector space (by the way, this a-algebra is smaller than the Borel a-algebra for

2.2 Gaussian vectors are probably not very large

29

the supremum norm). Let Y(x) := {yn(x)}n>1. Let S be the vector subspace is finite. Then P(S) = 1 by Theorem 2.2.4 and Y is linear, of X where II II

measurable, and preserves norms from S into £°O. So it will be enough to prove the theorem in £°O with coordinates l y, }.

By Gram-Schmidt orthonormalization in L2(P) (RAP, 5.4.6) we can write yn = m I an jgj for all n, where gj are linear functions on £°O (finite linear combinations of coordinates), are orthonormal in L2(P), and are normally distributed, so they are i.i.d. N(0, 1). If the yn are linearly independent in L2(P), then m(n) = n. Otherwise, m(n + 1) = m(n) + 1 or m(n) according as Yn+1 is or is not linearly independent of the yj for j < n.

Let an j = 0 for j > m (n). Each gi is in turn a linear combination of yl, , yn for the least n such that in (n) > i.

n=1,2,...,let

For each n, f yndP=F_ja
j>k an jgj. Since an j = 0 for j > n, the sum defining Vkn runs over k < j < n and there is no problem of convergence. Let 13-k be the smallest a-algebra for which gj are measurable for all j > k. Then for any 0 < j < k Vkn :=

and n, we have Vkn = E(Vjn 113_k). Let 11 Vk II

supra I Vkn I < +00. Then for

a > 0 we have the inequalities exp (a II Vk II2) = exp (a supra I Vkn 12)

= exp (a supra I E(Vjn I8-k)12) < exp (a{E(supn I Vin I IB-k)}2) = exp (a{E(11 Vj 1118-k)}2)

< E (exp (a II Vj ll2) 18-k)

by conditional Jensen's inequality (RAP, 10.2.7) if the expectations are finite. First taking j = 0, E exp(a II Vo 112) < oo for some a > 0 by Lemma 2.2.5. Then for j = 0 < k, the inequalities hold and give for that a, E exp(a II Vk 112) <

Eexp(alIVoll2) < oo for all k. Thus for almost all y E f°°, Vk(y) {Vkn(y)}n>1 E too. Let W-k := exp(aIIVk112), k = 0, 1, .. Then by the inequalities for general 0 < j < k, {(Wj, Bj) : j = , -2, -1, 01 is a submartingale (RAP, Section 10.3) and in view of its index set, a reversed submartingale. For any s > 0 and finite k, by the Doob maximal inequality (RAP, 10.4.2),

Pr{maxo<j exp(as2)}

< EWo/ exp (as2) < 00

Gaussian Measures and Processes; Sample Continuity

30

by choice of a. Chooses large enough so that EWo exp(-as2) < 1/2. Then by letting k ---o- oo,

Pr (lim sup j, ,,, II Vj 11 > s) < Pr (sup j 11 Vj 11 > s) < 1/2. Now lim supj,,,,, II Vj 11 is measurable for the "tail a-algebra" f j B_ j. Thus

Pr(limsupj,,, IIVjII > s) = 0 by the Kolmogorov 0-1 law (RAP, 8.4.4). Then for 0 < s < 1/2, there is a k(e) < oo such that Pr(II Vk 11 > s) < s fork > k(s), since if Pr(II Vk II ? s) > s for infinitely many values of k, then Pr(lim supk..., 11 Vk II > s) > s. Then by the last line of the proof of Lemma 2.2.5,

(2.2.9)

< oo for 0 <

E(exp (pIIVk112))

24s2

log

\

£

s

/

.

Now take any a with 0 < a < 1/(2r2). Choose y with a < y < 1/(2r2), then e > 0 small enough so that

log((- s/s)

ay

(2.2.10)

(y 1/2 - a1/2)2 -

24s2

Let k = k(s) and Uk(y) := y - Vk(y). Then al/211Y11

a'1211Uk(Y)II +a'/2II Vk(Y)II

(a/Y)112y112IIUk(Y)II + (1Since t i-+ exp(t2) is a convex function (RAP, Section 6.3, see Problem 1), exp(ally1I2)

< (a/Y)1/2eXp(YIIUk112)

+ (1- (a/Y)'/2) exp ( II Vk II2). By (2.2.9) and (2.2.10), E exp( II Vk 112) < oo. Now, by the Cauchy inequality, for each n, )2

k 2

Ukn =

(

(;gi2)(a;i) k

anigi

k

\

j=1

so

k

E( exp (Y II Uk II2))

E exp

k

g?

y

j=1

i=1 k

< Eexp

{(g) t2

l

(1 - 2yr2)k/2 < oo. 111

So Theorem 2.2.8 is proved.

a

supra

2.3 Inequalities and comparisons for Gaussian distributions

31

Proof of Theorem 2.2.2 Let {x,, }n 1 be dense in X. For each n, by the HahnBanach theorem and a corollary (RAP, 6.1.5), there is a yn E X' with II Yn II ' = 1 and I Yn (xn) I = II xn I I . Then for all x E X, we have sup,, l y, (x) l = I I x 11, so the theorem follows from Theorem 2.2.8.

2.3 Inequalities and comparisons for Gaussian distributions The main result of this section will show that if a set of Gaussian random variables is large enough in the sense of metric entropy (as defined in Section 1.2), meaning that the number of variables more than s apart grows rather fast as s 4. 0, then it is almost surely unbounded. The main steps in the proof will be some inequalities, one due to Slepian and another to Sudakov and Chevet.

2.3.1 Slepian's inequality. Let X1, , Xn be real random variables with a normal joint distribution N(0, r) on Rn. Let Pn(r) := Pr{Xj > 0 for all j = 1, , n}. Let q be another covariance matrix, with r = q;; = 1 for all , n. If ri j > qi j for all i and j, then Pn (r) > Pn (q). i = 1, Remarks. Since each Xj has distribution N(0, 1), clearly P, (r) < 1/2, with Pn (r) = 1 /2 when r, j = 1 for all i and j. On the other hand, Pn (r) = 0 if rl j = -1 for some i j. Proof First, suppose that r is nonsingular, so that it is strictly positive definite and N(0, r) has a density gn given by Fourier inversion of its characteristic function as oo

gn(xl,...,xn)

:=

gn(xl,...,xn;r) = (2n)-n Jt n

eXp(-ixjtj-2

n

j=1

k,m=1

1

00

...

t

J

oo

-o0

rkmtktm dt1...dtn

(RAP, Theorem 9.5.4). Since r is symmetric, it is given by the n(n + 1)/2 variables rkm, 1 < k < m < n. The partial derivatives a2gn/axkaxm can be evaluated by differentiating under the integral signs, applying Corollary A.15 (in Appendix A) twice, thus multiplying the integrand by -tktm (RAP, Theorem 9.4.4). The same integral results from taking ag,,/arkm for k < m, where a/arkm can be taken under the integral sign by Proposition A.16 (since r is strictly positive definite). So (2.3.2)

a2gn/axkaxm = agnlarkm,

k

m.

Gaussian Measures and Processes; Sample Continuity

32

Now

r0 P,, (r) =

f00(2.3.3) xn)dxI...dxn, gn(Xi,...

J

0

J0

where g, (x) = (27r)-n/2(detr)-1/2 exp((-r-lx,x)/2) (RAP, 9.5.8). For a positive definite symmetric matrix s, by Proposition A. 16, applied to t = sib

and i(x) = xixj (or xi /2 if i = j), the integral of exp(-(sx, x)/2) over any measurable region in Rn, specifically the positive orthant {0 < xi < 00, i = , n } , can be differentiated under the integral sign with respect to any com-

1,

ponent of s. Then, since the functions r H r-1 and r i-+ (det r) -1/2 are smooth for r in the set of symmetric, (strictly) positive definite matrices, the integral (2.3.3) can be differentiated under the integral sign with respect to any rkm , k <

m. Then by (2.3.2), we need to evaluate fo ... fo a2gn/axkaxm dx1 ... dxn. Since the integrand is absolutely integrable, we can do the integrations in any order (by the Tonelli-Fubini theorem) and replace fo f( dxk dxm by limn", J0M f0M dxk dxm. Here we may as well assume that k = 1, m = 2. Now, since gn is a smooth function,

M M a2gn/ax18x2dxidx2

f0

0

gn (M, M, x3, ... , xn) - gn (M, O, x3, ... , xn) -gn(O,M,X3,...,x,)+gn(0,0,X3,...,xn)

gn (O, O, x3 , ... , xn ) as M

oo. Thus

aPn(r)1ar12 =

o o J 0 ... f o

> 0.

Likewise, aPn(r)/arkm > 0 for all k < m. In the general case, let s > 0 and 0 < ,l < 1. Let I be the identity matrix and p := Ar + (1 - A)q + cI, which is positive definite. Then dPn(p)

dl

_

aP(p) dpkm

_

8P-(p) (rkm - qkm)

k<m

apkm

d),

k<m

0.

apkm

Integrating from 0 to 1 with respect to A gives P,, (r + cI) > P, (q + cI). The laws N(0, r + cI) converge to N(0, r) as e 0. (To prove this, let X and Y be independent with G(X) = N(0, r) and G(Y) = N(0, I). Then G(X + cY) = N(0, r + eI) and ass 1 0, X + cY X a.s., hence in law.) Likewise, N(0, q + cI) converges to N(0, q). Since the rkk and qkk are

2.3 Inequalities and comparisons for Gaussian distributions all equal to 1, the marginal distribution of each Xk, k = 1,

33

, n, is N(0, 1) and

P(Xk = 0) = 0. Thus the boundary of the orthant {Xk > 0 for k = 1, , n} has zero probability for both N(0, r) and N(0, q). Then, by the portmanteau theorem (RAP, 11.1.1(d)), it follows that P,,(r) > Pn(q).

Example. Let r, j = 1, i, j = 1, 2, 3, and let q be the 3 x 3 identity matrix. Let s = r except that S13 = S31 = 0. Then s is not nonnegative definite, in other words it is not a covariance matrix, because sij = 1 for {i, j} # {1, 3} implies X1 = X2 = X3 a.s. with a N(0, 1) distribution, while S13 = S31 = 0 implies X, and X3 are independent. Thus, in the preceding proof, if we change rkn, to qkm for one pair of indices

k, m at a time, it is possible to pass through values of the matrix which are not positive definite. Then, the "normal density" gn (xl, , xn , r) is not welldefined, and integrals of its "characteristic function" diverge. Recall that the correlation (coefficient) r(X, Y) of two nonconstant variables X and Y with finite second moments is defined by

r(X, Y) := E((X - EX)(Y - EY))/(Qxay), where Qx := or (X) is the standard deviation (E(X -

EX)2)1/2.

2.3.4 Corollary Let X1,

, X n and Y1, , Yn be two sets of jointly normally distributed variables with mean 0, or (X1) > 0 and or (Y,) > 0 for all i,

andr(Xi,Xj)>r(Y,,Yj)foralli Pr{X,>0,

Pr{Y,>0,

Proof Replacing each X, by X, /a (X,) and Y, by Yi /o (Y,) does not change the events being considered, or the correlations, and gives covariances to which Slepian's inequality applies.

Let C be a set of Gaussian random variables with mean 0. On C, we take the G2 metric d(X, Y) := (E(X V)2)1/2. Recall that fore > 0, D(e, C) :_ D(E, C, d) := sup{n : f o r some X1, , X, E C, d(X,, Xj) > e, 1 < i < j < n}. In the following, by Theorem 1.2.1, D(E, C) could be replaced equivalently by N(E, C, d). The following theorem and lemma are sometimes called the Sudakov minoration.

2.3.5 Theorem (Sudakov-Chevet) If lim sup,lo E2 log D(E, C) = +oo, then SUP( IXi : X E C) = +oo almost surely.

Gaussian Measures and Processes; Sample Continuity

34

Proof Let 4) be the standard normal distribution function and G(x) := 1 4)(x). Most of the proof is in the next inequality.

2.3.6 Lemma (Sudakov-Chevet) Let X1, , X, be jointly normally distributed variables with mean 0 such that for some M < oc and some a with 0 < e < 1, we have d (0, Xj) < M for all j and d (Xi, Xj) > e for all i j. Let K := 21/2(M2 + 1). Then

G(1) Pr{Xj < 1 , j = 1, < 2-n-1 +

, n} oo

(2n)-1/2 f

exp

(- t2/2)4>(Kt/e)ndt.

0

Proof Let e1, , en be orthonormal variables such that X1, , Xn are in their linear span, by the Gram-Schmidt process (RAP, 5.4.6). Then e1, , en are i.i.d. N(0, 1). Let en+1 be another N(0, 1) variable independent of e1, Let Y1 := Xi - en+1, i = 1, , n. Then

G(1) Pr{Xj < 1 , j = 1,

, en.

, n}

= Pr{en+l > 1 and Xj < 1, 1 < j < n}

< Pr{Yi <0, i = 1, ,n}. Let bi3 := r (Yi ,

Y j) and II Y II

d (O, Y). Let Bid be the angle between Yi and

Yj at 0, so that bid = cos(61 ). Let U := Xi and V := Xj with i # j. Then

u:=IIUJI <M, v:=IIVII <M,andIIU-VII>s.So 2

< IIU-VII2 = u2-2(U,V)+v2,

and 2(U, V) < u2 + v2 - e2. Thus

[(U,V)+1]2 = (U, V)2+2(U,V)+1 u2v2 + u2 + v2 + 1

- e2,

so

bi

<

(U,V)+1 =

[(u2 + 1)(v2 + 1)]1/2

<

1-

E2

L1-

1/2

(u2 + 1)(v2 + 1)

L

E

2(M2 + 1)2

= 1 - E2/K2 < 1/(1 +e2/K2).

Let f := eei/K - en+1, fi := r(f;, f) = 1/(1 + E2/K2) for i # j, bii = f i := 1. Then Slepian's inequality (above), specifically Corollary 2.3.4,

2.3 Inequalities and comparisons for Gaussian distributions

35

gives

Pr{f <0,

Pr{Y; <0,

i= 00

_ (2n)-1/2J

Pr{e; < tK/e, i = 1,

, n} exp

( - t2/2) dt

00 00

(27r)-1/2

_

fi(tK/s)n exp (- t2/2) dt f00 00

_

(2n)-1/2

f

exp o

< 2 -n-1 + (27r )-1/2

(- t2/2)[`)n(Kt/s) + (pn(-Kt/E)] dt 00

f

exp (- t2/2)q)n(Kt/E) dt,

o

proving the lemma. Now to prove Theorem 2.3.5, take Ek 0 such that Ek log D(Ek, C) > k, k = 1, 2, . In Lemma 2.3.6, let E = Ek and n = D(Ek, C). If the variances of the

variables in C are unbounded, let X(n) E C with or (X(n)) > n. Then clearly I X(n) I are unbounded a.s. So we can assume that for some M < oo, a (X) <

MforallXEC. For 1 < B < oo, the probability that IX; I < B for j = 1,

, n is the probability that J Xj/ B I < 1 for each j. To apply Lemma 2.3.6, we can then replace Ek by Ek/B, giving the same bound except for replacing Kt/Ek by s/Ek where s := Kt B.

Since cn = (1 - G)n < e-nG, it will be enough using the dominated convergence theorem to show that D(Ek, C)G(s/Ek) - +oo as k -+ oo for every s > 0. As x -+ +oo, G(x) (27r)-1/2x-1 exp(-x2/2) (RAP, Lemma 12.1.6), so for k large, G(s/Ek) > Ek(3s)-1 exp(-(s/Ek)2/2). So it will be enough to prove that log D(Ek, C) + log Ek - cEk 2 - +oo as k -+ oo for any c < oo. Now log E > -E-2 for small E > 0, so the log Ek term can be removed. By assumption, Ek log D(Ek, C) - c on multiplying by Ek2 -). +oo also.

+oo, and the result follows

Example. Let G,, be i.i.d. N(0, 1) variables and X,, := Gn/(logn)1/2, n > 2. Then d(Xj, Xk) > (log j)-1/2 for j < k, so D(E, {Xj}2<j n - 1

if E < (log(n - 1))-1/2. So for 0 < E < 1, D(E, {Xj}j>2) > [exp(E-2)] > exp(E-2)/2, where [x] denotes the greatest integer < x. So log D(E, (XjJ f>2) > E-2 - log 2 > E-2/2 for 0 < E < 1/2. On the other hand for each n, Proposition 2.2.1 gives Pr{ I Xn I > 2} < exp(-2log n) = 1 /n2 so by the Borel-Cantelli

Gaussian Measures and Processes; Sample Continuity

36

lemma, lim sup,,,,,,, I Xn I < 2 and sup,, I Xn I < +oo a.s. Thus Theorem 2.3.5 is sharp. Here is another comparison of Gaussian laws.

2.3.7 Theorem Let N(0, C) and N(0, D) be two normal measures with mean 0 on R'. Let X = {X}i_1 have law N(0, C) and let Y = {YY}n-1 have law N(0, D). Suppose that for all i, j = 1, , n, we have E((Y1 - Yj)2) < E((Xi - Xj)2), in other words for each i, j, (2.3.8)

Di1+Djj-2D1j < Ci1+Cjj-2C11.

Then

(a) E{maxi
for any t > 0, C + t21 and D + t2I are nonsingular and the hypotheses hold for them, thus the conclusion. Now, N(0, C + t2I) is the law of X + t Z where Z has law N(0, I) and is independent of X, and likewise for D and Y. Then, letting t 4. 0, the theorem follows for C, D. Suppose then that C and D are nonsingular. For 0 < A < 1, let CA AC + (1 - A)D and let g(A) := E maxl 0 for 0 < A < 1, taking a right derivative at 0 and a left derivative at 1. We have g(A) = f maxl
Since C and D are nonsingular, they are strictly positive definite. Thus for

some y > 0, (C)'y, y) > y Iy12 for all y and 0 < A < 1. Thus det CA is bounded below and (det Cx)-1/2 is bounded above for 0 < a, < 1. Here y can also be"chosen so that (Cj 1x, x) > yIx12 for all x E R' and 0 < A < 1. The mappings A r-a CA and ), -* Cj 1 are both smooth (C°O), and the map (C, x) i-+ (det C)-112 exp(-(C-1x, x)/2) is a smooth map on the open set of strictly positive definite symmetric matrices C and all x E R. It follows that the functions fx(x) and their difference-quotients and first partial derivatives with respect to A and x are uniformly integrable on R" with respect to x and remain so if multiplied by any function bounded above in absolute value by a polynomial. Thus by Theorem A.7 of Appendix A, we can differentiate under

2.3 Inequalities and comparisons for Gaussian distributions

37

the integral sign: (2.3.9)

dg(?) =

fmaxi
As in (2.3.2), we have a2gn/axk = 2agn/arkk for k = 1,

, n. By this and

(2.3.2), we have (2.3.10)

afx(x)

_

d(Cx)rj a2fx

1

dk

2 ,j=1

aa,

axi ax1

1,

where d (Cx)i j /d k = Ci j - D, j. Let S := S(118n) be the space of all C°O functions f from 118" into R such that for every polynomial and every n-tuple p:= (pl, , pn) of nonnegative integers with IPI P1 + + pn, sup, I P(x)DP f(x)I < oo where DPf(x) := alPI f(x)/axr axP". After +X,2 , we multiplying P(x) by some power of 1 + Ix12 where Ix12 := xi + see that each P(x)DPf(x) is integrable on 118n. Here S(118n) is Laurent Schwartz's space of rapidly decreasing functions. It is easily seen that any normal density such as fx is in S(R'). Let u(x) := maxi xi and for each i = 1, , n, let v, := vi (x) maxj#i xj. Let dx/dxi := n j#,dxj. For any i j, define a function gij by gij((xr)r#i) := {yr}r=1 where yr := xr for r i and y, := yj. Let Vi j be the measure defined for suitable functions on W1 by f cpdV, j fA(j)O(g,j({xr}r#,)dx/dx,where A(j) is the set of{xr}r,6i such thatxr <xj for all r i, or equivalently vi = xj. In other words, Vi j is a measure on the subset where xr < xi for all r of the (n -1)-dimensional hyperplane (xi = x j ), given by dx/dx, or equivalently by dx/dxj. On A (j), as an image by a linear map gij of Lebesgue measure on Rn-1, V j is a multiple (by 2-1/2) of Lebesgue

measure on the hyperplane {xi = xj }. Thus, Vi j = Vj, for 1 v, . cb(x) dx. (b) For any i 0 j with 1 < i, j < n, (2.3.12)

f u(x)(a2q5/ax,axj)dx

(c) For each i = 1, (2.3.13)

= -J

dVij.

, n,

f u(x)(a2cb/ax?)dx =

f

O(x)Ix;-v; dx/dx,.

Gaussian Measures and Processes; Sample Continuity

38

Proof For (a), we have f u(x)(a¢/axi) dx = f (u - vi)(x)(a¢/axi) dx since vi doesn't depend on xi. Integrating with respect to xi first, we get °O

v;(x)

(xi - vi)

dx dxi-. dxi axi a¢

Integrating by parts in the inner integral, the boundary terms are 0 both at xi = vi and at xi = oo since 0 E S, so (a) follows. Then to prove (b), apply (a) to ao/axj. Integrating first with respect to xj, (b) follows. For (c), applying (a) to a0/axi and integrating first with respect to xi give (c).

Now continuing the proof of Theorem 2.3.7, note that we have

maxi, j(Xi - Xj) = maxi Xi - minj Xj and

minj X, = - max(-Xj). Normal laws with mean 0 are symmetric, so same law. Thus

and

have the

Emaxi,j(Xi-Xj) = 2Emaxi Xi.

(2.3.14)

Then, from equations (2.3.9), (2.3.10), (2.3.14), (2.3.12), and (2.3.13), we get

dg

_

dl

(Dij 1
/

- C; j) If

1 / dx + 2 L.(Cii - Dii) J fxx)Ix,=v; dx

i=1

`

Now, for Lebesgue measure dx/dxi on {xj} j#i, each set {xj = x,.} for j # i ¢ r # j has measure 0, and thus E ji 1,,;=xj = 1 almost everywhere. So

f fx(x)lx,=v,dx/dxi = J fidVij, JTi and

dg da

12 EE(Dij -Cij +Cii - Dii) i=1 joi

f fAdVij

Symmetrizing the sum, interchanging j and i, gives

dg dX

_

n

1: E(Cii+Cjj+2Dij-2Cij-Dii-Djj)f dVij > 0 4 1

i=1 jai

2.3 Inequalities and comparisons for Gaussian distributions

39

by (2.3.8). Thus (a) of Theorem 2.3.7 is proved. Then, since (2.3.14) also holds for Y, (b) follows and Theorem 2.3.7 is proved.

Next, bounds for expected suprema will be developed, closely related to , Z,) be the the Sudakov-Chevet theorem (2.3.5). Let Mn := max(Zi, maximum of n i.i.d. standard normal variables Zi. Then EMn is bounded below as follows.

2.3.15 Lemma For all n > 1, EMn > (log n)1/2/ 12. Remark. The constant 1/12 can be improved to (n log 2)-1/2, by a less elementary proof (Fernique, 1997, (1.7.1) and references given there).

Proof For n = 1 we have 0 > 0. For n = 2, EM2 > (log 2) 1/2/ 12 can be found by a direct calculation (Problem 1). The following proof is for n > 3. Let a := (87r)-1/2. By Proposition 2.2.1, we have

P(Z > (logn)1/2) >

e-logn/2

2(2n logn)1/2

= a(n logn)-1/2 >

n

where log n < n for all n > 1 since x < ex for all x. Now, by its Taylor series,

log(1 - x) < -x for 0 < x < 1, so et)n

= exp(nlog(1

n11 <

exp(n(-n11 = e-".

Then P(Mn > (logn)1/2) > 1 - e-" > 0.18. Recall that for any realvalued function f, we let f + := max(f, 0) and f - :_ - min(f, 0). Thus 0.18(log n) 1/2.

Next, -Mn- is nondecreasing as n increases, so for n > 3, E(-Mn)E(-M;-). Also, M3 = 0 unless Z1, Z2, and Z3 are all negative, which has probability 1/8. We have E(-M3 IM3 < 0) > E(Z1IZ1 < 0) = -(2/,-r)1/2. Thus E(-M3) > -(2/ir)1/2/8 > -0.1. It follows that EMn > (log n)1/2/ 12 for all n > 3 and so for all n > 1. Recall (RAP, Theorem 12.1.4) the isonormal Gaussian process L on a separable, infinite-dimensional Hilbert space H: for any x E H, L (x) is a Gaussian random variable with distribution N(0, lix 112), and for any xl , , xn E H, (L (x 1), , L (xn)) have a jointly normal distribution with covariance given by the inner products (xi, x j ). To end this section, here is a lower bound based on packing numbers.

40

Gaussian Measures and Processes; Sample Continuity

2.3.16 Theorem For any countable subset S of a Hilbert space H with its usual metric, and any s > 0, E supXEs L(x) > 17s(log D(s, S,

Remark.

d))1/2.

The constant 1/17 can be improved to (2n log 2)-1/2 (Fernique,

1997, Theorem 4.1.4).

Proof Fix any E > 0 and let m := D(E, S, d). Take m points xl, , xm E S j and i.i.d. N(0, 1) variables Zi. Let Vi with Ilxi - xjII2 > E2 for i

EZi/21/2. Then E(Vi - Vj)2 = E2 < E((L(xi) 1,

, m, i # j.

- L(xj))2) for all i, j =

It follows then from Lemma 2.3.15 and the comparison

theorem 2.3.7 that

EsupXEsL(x) > Emaxl Emaxl (E/2112)EMm

> (E/21/2)(log m)1/2/12 > E(logD(e,S,d))1/2/17, which finishes the proof.

2.4 Gaussian measures and convexity There are several useful inequalities about normal measures and convex sets. Convex sets were treated in RAP, Sections 6.2 and 6.6.

A set C in a vector space is called symmetric if -C := (-x: x E C} = C.

A function f is called even if f(-x) = f(x) for all x. Thus the indicator function of a set is even if and only if the set is symmetric. For sets A, B in a vector space and a constant c, let cA := {ca: a E Al and

A+B:={a+b: a E A, bEB}. 2.4.1 Theorem Let C be a convex, symmetric set in Rk. Let f be a nonnegative, even function in £1(Rk, 8, V) where V is Lebesgue measure ,kk and 8 is the Borel cr-algebra. Suppose that for every t > 0, Ki := {x : f (x) > t} is convex. Then for 0
f

c

f(x + y) dx.

> Jc

Proof Since f is integrable, both sides of the stated inequality are finite. First

suppose that f is the indicator function of a set K. Then K is convex and

2.4 Gaussian measures and convexity

41

symmetric. We need to prove (2.4.2)

V (C n (K - ay)) > V (C n (K - y)).

For these measures even to make sense, we need to take care of a measurability

problem. Not all convex sets are Borel sets: for example, in R2, the open unit disk together with an arbitrary subset of the boundary unit circle is always convex. But we do have the following fact.

2.4.3 Lemma Any convex set D in Rk is Lebesgue measurable and its boundary 8D has measure 0, V (8D) = 0.

Proof Either D is included in some (k - 1)-dimensional hyperplane, in which

case V (D) = 0 and we are done, or D has a nonempty interior U (RAP, Theorem 6.2.6). Then D and U have the same closure and boundary (RAP, Proposition 6.2.10; also, any point of D is a vertex of a k-dimensional simplex

included in D). Let p E U. By translation we can assume p = 0. A straight line L through 0 can intersect 8U in at most two points: suppose 0, q, r are on L in that order, q r, q, r E 8U. Now D includes a neighborhood W of 0, W = {x : Ixl < 81 for some 8 > 0. Let X := IRI / Irl. Take points r E D with r -+ r. Then the convex combinations Ar + (1 -,l)w, w E W, yield all points in a neighborhood of q, so q E U, a contradiction. Then by spherical coordinates (RAP, Section 4.4, problems 8 and 9), V (8 D) = 0, so D is Lebesgue measurable.

Now continuing the proof of Theorem 2.4.1, let A := (1 + a)/2, so that

A(-y)+(1-A)y=-ay,and 1/2
C n (K-ay) D A(Cn(K-y))+(1-A)(Cn(K+y)) because C is also convex. Thus

V(Cn(K-ay)) > V{A(Cn(K-y))+(1-),)(Cn(K+y))). Since both C and K are symmetric, C n (K - y) = -{C n (K + y)) and V (C n (K - y)) = V (C n (K + y)). If C and K are compact, then the BrunnMinkowski inequality (RAP, 6.6. 1 (b)), with A = Cn(K-y), B = Cn(K+y) and still A = (1 + a)/2 gives (2.4.2) as desired. Recall that if U is any open set in a metric space and r > 0, with B(x, r) (y: d(x, y) < r), then U,.:= {x E U : B(x, r) C U) is always a closed subset of U. It is easily seen that if U is convex and/or symmetric, so is U, for any r > 0. Thus any open (symmetric) convex set is the union of an increasing

42

Gaussian Measures and Processes; Sample Continuity

sequence of compact (symmetric) convex sets Kn := {x E Ul/n: IxI < n}. Thus for any convex set L in Rk, by Lemma 2.4.3, there exist compact convex sets L (n) such that 1L(n) T 1L almost everywhere for Lebesgue measure. Applying this to the different convex sets L = A, B shows that (2.4.2) holds for any convex symmetric sets C and K with V (K) < oo, in which case V (K) = 0 or K is bounded.

Next suppose f is bounded. Then we can assume 0 < f < 1. Take simple functions approaching f, namely n-1 k

E

.fn

k=0

ri

n

1

1{k/n 1{f>j/n}, j=1

since k/n < f < (k + 1)/n if and only if f > j/n for exactly k values of j > 1. So by linearity and (2.4.2), the conclusion holds for fn for each n. By dominated convergence, it then holds for any bounded f Then if f is unbounded, let gn := min(f, n). Then the result holds for gn for each n and gn f f, so by monotone convergence, it holds for f

2.4.4 Theorem

Let X be a random variable with values in Rk whose law has a density f satisfying the hypotheses of Theorem 2.4.1. Let Y be any other JRk-valued random variable independent of X. Then for any convex, symmetric

set C and 0 < a < 1, Pr{X + aY E C} > Pr{X + Y E C}. Proof For P := L(Y), by Theorem 2.4.1, with -y infJ place of y,

Pr{X+aY E C) = ff-ay f(x)dxdP(y) >

ffz

f(z-ay)dzdP(y)

- y) dzdP(y) = Pr(X + Y E C).

E]

2.4.5 Corollary Let Z and X be random variables with values in Rk, L(Z) = N(0, A) and L(X) = N(0, D), where D and A - D are nonnegative definite. Let C be a convex, symmetric set in Rk. Then

Pr(X E C) > Pr(Z E C). Proof LetYbeindependentofXwithL(Y) = N(0, A-D). Then L(X+Y) L(Z), and Theorem 2.4.4 for a = 0 gives the result.

2.5 The isonormal process: sample boundedness and continuity

43

2.4.6 Corollary Let X1,

, X and Y1, , Y both be jointly Gaussian with mean 0 and such that { EY; Yj - EX; Xi} 1 <,1
for any M, Pr {maxl<j M} < Pr {maxi<j M}. Proof Corollary 2.4.5 applies, taking C :_ {t E ]Rn : maxi I tj I < M}, a convex, symmetric set, then taking complements. For any set C in a vector space V, and a vector space W of real linear forms

on V, the polar of C is defined by C*1 := {w E W : w(x) < 1 for all x e C}. If C is symmetric in V, then C*1 is also symmetric in W and equals {w E W: Iw(x)I < 1for all x EC}. For V = Rk, W will be understood to be Rk also, defining linear forms via the usual inner product. Recall the standard normal law N(0, I) on Rk. 2.4.7 Corollary Let A be a linear transformation from Rk into itself with norm II A II < 1 (for the usual Euclidean norm 11 II on Rk), where 11 A 11 := sup{ II Ax II

:

Ilxli < 11. Then for any symmetric subset C of Rk, N(0, I)(A(C)*1) > N(0 I)(C*1). Proof If G(G) = N(0, I), then G E A(C)*1 means (G, Ax) < 1 for all x E C, for the usual inner product, or equivalently (AtG, x) < 1 for all x E C. Now G(AtG) = N(0, At A), where I - AtA is nonnegative definite, and N(0, I)(A(C)*1) = N(0, AtA)(C*1). Since C*1 is a convex, symmetric set, Corollary 2.4.5 applies and gives the result as stated.

2.5 The isonormal process: sample boundedness and continuity Two stochastic processes {Xt, t E T} and {Yt, t (= T} defined on the same index set, but possibly on different probability spaces, will be said to have the same laws, or one will be said to be a version of the other, if for each finite subset F of T, {Xt, t E F} and {Yt, t E F) have the same law. If Xt and Yt are defined on the same probability space, then one is said to be a modification of the other if for each t E T, we have P(Xt = Yt) = 1. On the relationship of versions and modifications, especially for the isonormal process L restricted to a set, see Appendix J. (L was defined just before Theorem 2.3.16.) For any subset A C H, L(A)* will be defined as ess.supXEA L(x), the smallest random variable Y such that Y > L(x) a.s. for all x E A, where

44

Gaussian Measures and Processes; Sample Continuity

Y is determined up to a.s. equality. Here "ess.sup" stands for "essential supremum." Similarly let IL(A)I* := ess.supXEA IL(x)I.

2.5.1 Lemma For any subset A C H, L (A)* and I L (A) I * are well-defined up to almost sure equality. Proof If B is countable, then clearly L(B)* exists and equals SUNXEB L(x).

Let B be any countable dense subset of A. Let Y := L(B)*. Then Y is measurable. If U is any random variable such that U > L (x) a.s. for all x E A,

then clearly U > Y. For each x r= A, there is a sequence yn E B such that Ilyn - x112 < 1/n2 for all n = 1, 2, , and then by Chebyshev's inequality and the Borel-Cantelli lemma, L(yn) -k L(x) a.s. Thus L(x) < Y a.s., and as seen above, Y is the smallest random variable with this property, so L (A)* is well-defined and equals Y a.s. The proof with absolute values is the same.

2.5.2 Lemma Let B be a linear operator from H into itself with JIB 11 < 1 and C a (nonempty) subset of H. Then for any t > 0,

Pr{IL(BC)I* < t} > Pr{IL(C)I* < t}.

Proof If t = 0, then the right side is 0 unless C = (0). Then both sides of the inequality equal 1 and it holds. So assume t > 0. First suppose C is finite. We have I L (C) I * < t if and only if I L (C/ t) I * < 1. Let V be the linear span of C, a finite-dimensional Hilbert space. A version of L restricted

to V is given by L(v) = (v, w) for the given inner product where w has distribution N(0, I) on V. Since the joint distribution of L(x) for X E C is uniquely determined, Corollary 2.4.7 for the set {±x/t : x e C} gives the result.

In general, let finite sets Fn increase up to a countable dense set in C. Then I L (C) I * < t if and only if I L (Fn) I * < t for all n (except possibly on a set of 0 probability) so

Pr{IL(C)I* < t} = limn,, Pr(IL(Fn)I* < t) < limn,00 Pr(IL(BFn)I* < t) = Pr{IL(BC)I* < t}.

Definitions. A set C in H is called a GB-set if I L (C) I * < oo a.s. Also, C will be called a GC-set if it is totally bounded and the restriction of L to C can be chosen so that each of its sample functions x H L (x) (w), X E C, is uniformly continuous on C.

2.5 The isonormal process: sample boundedness and continuity

45

A function f from a set C in a vector space V into R will be called prelinear , cn E C and a 1, , an e R such that a I cl + + a, cn = 0, if for any cl, we have aI

=0.

A GB-set must be totally bounded (by the Sudakov-Chevet theorem 2.3.5). Since a uniformly continuous function on a totally bounded set must be bounded, every GC-set is a GB-set.

2.5.3 Lemma For any prelinear function f on a set C in a real vector space V into R, let g(xlcl+...+xnCn) :=

x1f(cl)+...+xnf(cn)

for any xl, , xn E R and cl, , cn E C. Then g is a well-defined linear function from the linear span of C into R which extends f. Such an extension exists if and only if f is prelinear. Proof To show that g is well-defined, suppose x1cl +

+xncn = yldi +

Y1,..., E C and "' + ymdm for some Ym ER.Then 0, so since f is prelinear, x1 f(x1) + ... + xn f (cn) - yl f(dl) - ... - ym f(dm) = 0, the

apparent candidates to be defined as g(xlcl + + xncn) are equal, and g is well-defined. Then g is clearly linear and extends f. On the other hand if f is not prelinear, suppose x 1 cl + +xncn = 0 but xi f(c1) + +xn f (cn) 0.

Then x1c1 = -x2C2 - ... - xnCn but xl.f(cl)

-x2f(c2) - ... - xn.f(xn),

so f has no linear extension to the linear span of C. Now, a finite-dimensional projection (fdp) will be an orthogonal projection (RAP, end of Section 5.3) of H onto a finite-dimensional subspace of H. For a sequence {nn } of such projections, nn f I will mean that the range of nn is included in that of nn+1 for all n and that the union of all the ranges is dense in H. Since 7rnx is the nearest point to x in the range of nn (RAP, Theorems 5.3.6 and 5.3.8), it follows that 1I Jrnx - x 1l -+ 0 as n - oo for all x E H. For any orthogonal projection jr, let jr-'- := I - ir, the orthogonal projection onto the orthogonal complement of the range of jr (RAP, 5.3.8).

2.5.4 Lemma

Whenever fdp's irn T I, there is an orthonormal basis of H

which includes an orthonormal basis of the range of 7rn for each n.

Proof Begin with an orthonormal basis of the range of 7rI, then recursively given an orthonormal basis of the range of nn, adjoin an orthonormal basis

46

Gaussian Measures and Processes; Sample Continuity

of the range of 7rn+l - nn. For each n, 7tn+1 o nn = 7rn = 7rn o 7rn+1, so Jn 7rn+1 = 7tn+17tn = 7rn+1 - 7rn Also, 7rn+1 = 7rn+1 o (I - 7rn + 7rn) _ (nn+1 - nn) + nn, so the range of 7tn+1 is the direct sum of two orthogonal subspaces, ran(7rn+1 - 7rn) I ran(7rn). Taking the union of the bases over n gives an orthonormal set whose linear span is dense in H and thus is a basis (RAP, Theorem 5.4.9). If C is a totally bounded set in a metric space, then the set V1 of all uniformly continuous real-valued functions on C is a vector space. For any real function f on C, let II f II c := sup{ I f (x) 1: x E C}. Then VI with norm II IIc is naturally isometric to the space C(K) of all continuous functions on the completion K of C, where K is compact, so C(K) is separable for the supremum norm (RAP, Corollary 11.2.5). Let C C H and let V2 be the set of prelinear elements of V1. Each element

h of H defines a function on C by x i-+ (x, h), x E C. Let He be the completion of H for II IIc. Note that each element of He naturally defines a uniformly continuous, prelinear function on C as a uniform limit of uniformly continuous, prelinear functions. Let V3 be the set of functions on C so defined. Then V3 c V2. (Often, V3 = V2, but whether V3 = V2 in all cases will not be settled here.) Let V be a set of functions on C. Say that L on C can be realized on V if

there is a probability measure t on V such that the process (v, x) H v(x),

v E V, X E C, has the joint distributions of L restricted to C: for any x1,

, x, in C, v i-* v(xi) are jointly Gaussian with mean 0 and covariances

(xi, xj), i, j = 1, ... , n. From the definition, t would be defined on the smallest or -algebra Bc making

all evaluations v i-- v(x) measurable for x E C. If D is a countable dense set in C, V is a set of continuous functions on C, and v, w E V, then

Ilv - wIIc := sup{I(v - w)(y): y E C) = sup{I(v - w)(y): y E D}, so v H II v - w II c is BC measurable for w fixed. If also V is a set of bounded functions on C, separable for II IIc, as V1, V2, and V3 are, then all open sets for the II IIc topology are in 13c (RAP, Proposition 2.1.4 and its proof), so l3c equals the Borel or-algebra. Given a set A in a vector space, the symmetric convex hull of A is the smallest

convex set including A and -A = {-x : x E A), and is the set of all finite ai E A U -A, with a,i > 0 and i=1 X, = 1, convex combinations for all positive integers n. The closed symmetric convex hull of A for some topology (in this case the Hilbert norm) is the closure of the symmetric convex hull. Here is a set of characterizations of GC-sets:

2.5 The isonormal process: sample boundedness and continuity

2.5.5 Theorem

47

The following are equivalent for a totally bounded set C

in H : (a) C is a GC-set;

(a) the closed symmetric convex hull of C is a GC-set;

(b) for any s > 0, Pr(IL(C)I* < e) > 0; (c) there exist fdp's nn T I such that lim infn,,, I L (d) for some sequence nn f I offdp's, I L I*

I * = 0 a.s.; 0 in probability; (d') for some sequence nn t I of fdp's, I L (nn C) I * -+ 0 almost surely; (e) for every sequence irn t I of fdp's, I L (irn C) I * -p 0 in probability; (e') for every sequence nn t I of fdp's, I L (irn C) I * 0 almost surely; (f) L can be realized on the completion V3 of H for II 11 C; (g) L on C can be realized on the space V2 of uniformly continuous, prelinear functions; (h) L on C can be realized on the space Vt of uniformly continuous functions.

Proof For any f, g E H and constant c, L (cf + g) = cL (f) + L (g) a.s., since L (cf + g) - cL (f) - L (g) has mean and variance both 0. For any fdp it, the processes L o it and L o irl, when restricted to a countable set, are independent and satisfy L = L o it + L o irl a.s. Let us first consider the properties (d), (d'), (e), and (e'). Clearly (e') implies (e), which implies (d). For the converse(s), the following will be useful. Let Trn

be fdp's with nn T I. Let Xn (f) := L (Jrn (f )) - L (7tn _ t (f )) for f E C, n =

iro(f):=0forall f.

L

(f )) for all f E C. Let Y be a countable dense subset of C. Then nn Y is

countable and dense in Jrn C for each n. On Y, the processes Xj are independent. Now, the P. Levy inequality (1.3.16) applies to the Xj and Y defined above, since Gaussian random variables with mean 0 are symmetric, and when orthogonal, they are jointly independent (RAP, Theorem 9.5.14). So we can infer that

for each n = 1, 2, (2.5.6)

and t > 0,

Pr(maxp t) < 2Pr(IIS,IIY > t)

Further, for k = 1,

, n,

Pr(maxk<jt) < 2Pr(IISn-SkIly>t) So, assuming (d) for the given sequence 7rn, given e > 0, take k large enough so that Pr{IL(irk C)I* > s} < E/2. It follows that for any n > k, Pr{IL((zrn Jrk)C)I* > s} < s/2, by Lemma 2.5.2 with Jrk C in place of C and B := 7rn,

48

Gaussian Measures and Processes; Sample Continuity

noting that 7rn o zrk = nn o (I - Irk) = 77rn - nk Thus by the Levy inequality (Lemma 1.3.16),

Pr(maxk<j e) < s Letting n

oo, monotone convergence gives

Pr Imaxj>k IL((nj - nk)(Y))

> s} < s.

Thus the processes L o 7rj converge uniformly on Y a.s. be an orthonormal basis of H including an orthonormal basis Let el, e2, of the range of nn for each n, by Lemma 2.5.4. Consider the process M(x) :_

>j(x, ej)L(ej) and the approximating processes Mn(x) := >2j{(x, ej)L(ej): ej E range (7rn)}. We have M.(y) = L(nn(y)) a.s. for each y E Y, thus for all y E Y since Y is countable. It will be shown that for each x, Mn (x) converges to M(x) a.s. Let the variables be defined on a probability space (S2, P). In the Hilbert space J := L2(S2, P), the series defining M(x) converges since L(ej) are orthonormal and F j (X, e)2 = IIx 112 < oo. In J, the series converges

in any order. Thus MM(x) converges to M(x) in J, and so in probability by Chebyshev's inequality. Then, since L(ej) are independent, M,(x) are partial sums of a series of independent variables and so converge a.s. by the Levy equivalence theorem (RAP, Theorem 9.7.1). Now, Mn converges uniformly on Y a.s., and since each Mn is uniformly continuous on C a.s., because C is bounded, the limit is uniformly continuous on Y a.s. and is a version

of L on Y, in fact for each y E Y, M(y) = L(y) a.s. Then M extends by uniform continuity to a process defined on C and uniformly continuous there. Since L is continuous in probability, M is then a modification of L on C. Now

M - Mn converges uniformly to 0 on Y and on C a.s., which implies that 0 a.s., proving (d'). Now let Q,n be another sequence of fdp's with Q,n T I. Then for k fixed as above, let el, , e,- be a basis for the range of Irk. Then Qmej -+ 0 in H for each j. Since r is fixed and C is bounded, I L (QmjrkC) I * -- 0 in probability. L (nn C) I *

We also have Pr{IL(QmJrk C)I* > 81 < Pr{IL(nk C)I* > s} by Lemma 2.5.2. The latter is < s by choice of k. Since I L (Qm C) I * < I L (Qm. rkC) I * + I L (Qm7rk C) I * a.s., we have I L (Q,l C) I * -+ 0 in probability as m -+ oo. By

the last paragraph, the convergence is almost sure. So the properties (d), (d'), (e), and (e') are equivalent. These properties clearly imply (c). To see that they imply (f), note that in the above proof, each Mn (co) is the inner product with some element of H, and Mn almost surely converges uniformly on C to M. Each Mn defines a measurable function from 0 into H, thus into V3. Hence M defines a random variable with values in V3 (RAP, Theorem 4.2.2). So (f) follows.

2.5 The isonormal process: sample boundedness and continuity

49

Clearly (f) implies (g) implies (h) implies (a). On the other hand, (d) implies that the Mn converge uniformly also on the closed symmetric convex hull sco(C) of C. In fact, for any fdp 7r,

IL(nlsco(C))I* = IL(n1C)I* a.s., since IL(-x)I = IL(x)I a.s. for any x, and finite convex combinations with rational coefficients of elements of Y U -Y give a countable dense set in sco(C). The limit of the Mn is again M, a modification of L, now on sco(C). So (d) implies (a'), which implies (a) clearly.

Next, to see that (d) implies (b), given e > 0, take an fdp n such that Pr{IL(Jr1C)I* > s/2} < 1/2. Also Pr{IL(JrC)I* < s/2} > 0, since,rC is a bounded set in a finite-dimensional space, and since L o n and L o it can be taken to be independent (on Y), we get Pr{ I L (C) I * < s} > 0, proving (b).

Next it will be shown that (b) implies (c). For e > 0, and fdp's nn T I, Lemma 2.5.2 implies Pr{IL(Jr, C)I* < e} > Pr{IL(C)I* < s} > S for some S > 0 for all n. The event D that I L (Jr,, C) I * < e for infinitely many n, that is nm> t Un>m (I L (7r C) I * < s }, thus has probability at least S. But D is a "tail event," since it depends on the sequence of independent random variables L (ej)

only for j > k for k arbitrarily large. It follows that D has probability 0 or 1 (Kolmogorov's zero-one law, RAP, 8.4.4), thus probability 1. This yields (c).

Next (c) implies (a): for any s > 0, suppose that IL(Jr,lC)I* < 8/2 for some w and n. Then Mn, being linear on the finite-dimensional bounded set TrnC, is uniformly continuous there, so for some y > 0, lix - YII < y implies I Mn (x) - Mn (y) I < s/2, for x, y E Jrn (Y) and thus for x and y in Y, and then since Mn + L o Jr, = L almost surely on Y, I L (x) - L (y) I < E. Thus L (w) is almost surely uniformly continuous on Y, hence again extends to a uniformly continuous function on C which yields a modification of L, giving (a). It will now be enough to prove that (a) implies (d). Given e > 0, take a version of L and S > 0 such that

Pr{sup{IL(x) - L(Y)I: x, y E Y, IIx - YII < S) > s} < E. Take a finite-dimensional linear subspace F of H such that F fl C is within 3/2 of every point of C. We can assume that Y fl F is dense in F fl C. Let Tr be

the orthogonal projection onto F. Since Y is countable, we have L (x - y) =

L (x) - L (y) and L (n 1(x - y)) = L (r'x) - L (r y) almost surely for all x, y E Y. Then by Lemma 2.5.2,

s > Pr {sup I I L(7r'x) - L(ir'y)1: X, Y E Y, IIx - YII < S} > e}

> Pr {sup{IL(n1x)I: xEY}>>s},

50

Gaussian Measures and Processes; Sample Continuity

since for any x E Y there is a y in F fl Y with Ilx - yll < E and ir-Ly = 0. Letting E = 1/n J. 0, n oo, (d) holds. Recall that a Borel probability measure on a separable Banach space B is called Gaussian if every continuous linear form in B' has a Gaussian distribution. It follows that the norm II II on B satisfies some inequalities on the upper tail of its distribution for it (Landau-Shepp-Marcus-Fernique bounds, Theorem 2.2.2). In particular, f IIx I12dµ(x) < oo.

2.5.7 Theorem Let (B, II.11) be a separable Banach space. Let p be a Gaussian probability measure with mean 0 on the Borel sets of B. Then the unit ball Bi := { f : 11 f 11' < 1 } in the dual Banach space B' is a compact GC-set in L2(B, A).

-

Proof A Cauchy sequence {y, } in Bi for the L2 (µ) norm converges in L 2 (µ). Consider the weak-star topology on B', in other words the topology of pointwise convergence on B. The functions in Bi are uniformly equicontinuous, indeed

Lipschitzwiththeuniformbound I f(x)-f(y)I < Ilx-yll, f E Bi, x, y E B. Thus in Bi, pointwise convergence on B is equivalent to convergence on a countable dense set. So the weak-star topology on Bi is metrizable (cf. RAP, Theorem 2.4.4). Any linear function on B is given by its values on B1 := {x E B : Iix 11 11, and pointwise convergence on B is equivalent to pointwise convergence on B1. The set of all functions from Bt into [-1, 1], with the topology of pointwise convergence, is compact by Tychonoff's theorem (RAP, 2.2.8). It's easily

seen that B, is a closed subset of this compact space, so it is also compact. (Compactness of Bi in the weak* topology for a general Banach space B is known as Alaoglu's theorem; see, for example, Dunford and Schwartz, 1958, pp. 424-426.) So {y,} has a subsequence converging pointwise on B to some element y of B. For jointly Gaussian variables, pointwise convergence (convergence in probability) implies L2 convergence, so y is the L2 limit of {y,} and Bi is compact in L2(µ). The natural mapping T of B' into L2(µ) has an adjoint T* taking L2(µ) into B", the dual space of (B', 11 II'). There is a natural map of B into B" given

by x H (h H h(x)) for x E B and h E B'. The map is an isometry (RAP, Corollary 6.1.5, of the Hahn-Banach theorem 6.1.4). So B can be viewed as a linear subspace of B". If it is all of B", then B is called reflexive. In the present case, whether or not B is reflexive, T* actually has values in B:

2.5 The isonormal process: sample boundedness and continuity

51

2.5.8 Lemma Let B be a separable Banach space and µ a measure on B such that f IIx112dIL(x) < oo. Then for the natural mapping T of B' into H := L2(µ), the adjoint T* on H' has values in B. Proof For any h e H and Y E B',

(T*h)(y) = (h, Ty) =

h(x)y(x)dµ(x) = y(u), lB

where u e B is defined by the Bochner integral u = fB h (x )x d it (x) (Appendix E, Theorem E.9). The linear form y can be taken under the integral sign since the Bochner integral, when it exists, equals the Pettis integral (Appendix E).

So let J be the range of T*, a linear subspace of B, and S its closure, a Banach subspace of B. Note that each element of S is uniformly continuous on Bi for the H = L2(µ) norm topology, since it is a limit in the norm II II on B,

and thus uniformly on Bi, of such functions. It will be shown that µ(S) = 1. If S # B, take a countable dense subset {x,n} of B\S. By the Hahn-Banach theorem (RAP, 6.1.4) f o r each m = 1 , 2, .. , there is a u,,, E Bj such that

um =0on Sand um (x)=d(x,,,,S):=inf{llxm-yll: yES}. For any x E B\S, let e := d (x, S) > 0 and take m with Ilx - x, II < 6/2. Then l un, (x) I > d (xn

, S) - e/2 > s - 8/2 - e/2 = 0, so u,n (x) # 0. Thus

S = n,n u,;,t(0). For each m, to show that p(u,n = 0) = I is equivalent to showing that T (urn) = 0. If not, let T (u,n) = v 0. Then 0 < (v, v) = (T (un,) , v) = um (T * v) = 0 since T * v E S, a contradiction. So tt(u,n = 0) _

1 for each of the countably many values of m. It follows that µ(S) = 1. Let K be the closure of the range of T in H. Then K is a Hilbert space. A limit of Gaussian random variables with mean 0 in L2 (µ) is also such a random variable, so K consists of such random variables, and any finite set of them has a joint normal distribution. Thus the identity from K to itself is an isonormal process L. For this L, we can apply Theorem 2.5.5, where S is the space V3 of Theorem 2.5.5(f). It follows that Bi is a GC-set. The next fact is a direct consequence of Theorem 2.5.5.

2.5.9 Corollary For any two GC-sets C and D, their union C U D is also a GC-set.

Proof Condition (e) or (e') in Theorem 2.5.5 holds on C and on D and so, clearly, on C U D.

Gaussian Measures and Processes; Sample Continuity

52

2.6 A metric entropy sufficient condition for sample continuity Recall that a stochastic process Xt (w), t E T, is said to be sample-bounded on T if suptIT I Xt I is finite for almost all w. If T is a topological space, then the process is said to be sample-continuous if for almost all (o, t i-+ Xt (w) is continuous. The isonormal process is not sample-continuous on the Hilbert space H: let {en) be an orthonormal sequence. Then L(en) are i.i.d. N(0, 1) 0 slowly enough, specifically if a n (log n)1/2 -+ 00 variables. Thus if an oo, L (an en) are almost surely unbounded (by Theorem 2.3.5). So not as n all bounded sets or even compact sets are GB-sets or GC-sets. Such sets must be small enough in a metric entropy sense. This section will prove a sufficient condition based on metric entropy (defined in Section 1.2), while Section 2.7 will give a characterization based on what are called majorizing measures. A metric entropy sufficient condition for sample continuity of L will actually

give a quantitative bound for the continuity. Let (T, d) be a metric space. A function f will be called a sample modulus for a real stochastic process {Xt, t E T} if there is a process Yt with the same laws as Xt and such that for almost all w, there is an M(w) < oo such that for all s, t e T, I Ys - Yt 1(w) < M((o) f (d (s, t)).

Whenever f is a sample modulus for L on C C H, and {Xt, t E T} is a Gaussian process with mean 0 and {Xt t E T) = C, then f is also a sample modulus for the process Xt, with the intrinsic pseudometric d (s, t) (E(XS - Xt)2)112 on T. Recall from Section 1.2 the definitions of N(e, C) and H(s, C), for the usual metric d(x, y) := Ilx - yll on H. Now the main theorem of this section can be stated.

2.6.1 Theorem For any C C H, if f0 (log N(t, C))1/2dt < oo, then C is a GC-set, and if

f (x) := f (log N(t, C))1/2dt, x > 0, o

then f is a sample modulus for L on C. Note.

If C is bounded, then N(t, C) = 1 and log N(t, C) = 0 for t

large enough, and N(., C) is a nonincreasing function, so integrability of (log N(t, C))112 is only an issue near t = 0. If f (x) = +oo for some x > 0, then f (x) = +oo for all x > 0, so it still provides a sample modulus but only a trivial one. By Theorem 1.2.1, N(t, C) could be replaced equivalently by D(t, C).

2.6 A metric entropy sufficient condition for sample continuity

53

Proof We have f(l) < oo and can assume C is infinite. Then H(E) := H(E, C) +oo as s 4, 0. Sequences Sn 4. 0 and E(n) := En 4. 0 will be , En, let defined recursively as follows. Let Si := 1. Given E1, Sn

:= 2inf (E: H(E) < 2H(En)}, min(En/3,Sn)

8n+1

Then En < 3(En - En+1)/2. Also, if En+1 = Sn, then

J6(n) (n+1)

H(x)2dx < 2H(En)2En,

while otherwise En+1 = En/3 and

I(n) n+1)

H(x)2dx <

It follows that for any n = 1, 2, 2 °O m=n

, CO

(Em - Em+1)H(Em)1,2 m=n 00

f(En) < 4

n

E.H(Em)1/2.

M =n

So the convergence of the above sums is equivalent to that of the integral defining

f(1), and they all converge.

For each n, there is a set An C C such that for any x E C we have Ilx yll < 2Sn for some y E An, and the number of elements of An is bounded by card(An) < exp(2H(En)). Let Gn := Ix - y: x, y E An-1 U An). Then card(Gn) < 4exp(4H(En)). Let Pn := Pr {max {IL(z)I/IIzII : 0

z E GnI > 3H(En) 1/2).

If is the standard normal distribution function, then for T > 0, 1 - 1(T) < exp(-T2/2) (RAP, Lemma 12.1.6(b)). Then

Pn < 4exp{4H(En)-9H(En)/2} = 4exp(-H(En)/2). Since H(En+2) > H(Sn/3) > 2H(sn), En Pn is dominated for n > 2 by a sum of two geometric series, one for n even and one for n odd, and so converges. Then for almost all w, there is an no (w) such that for all n > no (w), I L (z) I < 3IIZIIH(En)1/2 for all z E Gn.

Either Sm = Em+l or Sm < 2Em = 6Em+1, so Sm < 6Em+1 for all m > 1,

and En>2Sn_1H(en)1/2 < oo. For each x E C choose An(x) E An with

Gaussian Measures and Processes; Sample Continuity

54

I I x - A n (x) I I < 28n. Then I I A n - i (x) -An(X)11 <28_i + 28n , and for almost

all co, L(An(x))(co) is a Cauchy sequence for all x E C. Define a process

M by M(x)(co) := limn,, L(An(x))(c)), X E C, or M(x)(co) := 0 if the is continuous in probability, for sequence doesn't converge. Then since each x E C, M(x) = L(x) a.s., and M and L have the same laws (on C). From the definition of sample modulus, we can then assume that L - M. So except for w in the set of 0 probability where no(w) is undefined, L (An (x)) (w) L (x) (w) as n oo for all x E C. Ifs, t E Canden+1 < Its - til < En, then 11An(s)-An(t)11 < I1s-t1I+48n, and if n > no (w), then IL(s) - L (t) I ((o)

< I L(An(s)) - L(An(t))I (w) 00

+

(I L (Am (s)) - L (Am+l (s)) I (w) m =n

+ I L (Am(t)) - L (Am+1(t))I (w)}

Us - tll +48n) 3H (8n) 1/2 +

88m 3H

(Em+1)112.

m =n

Thus

L(s)-L(t)j(w) < 7511s-tIIH(Ils-tli)112+144T EmH(Em)112 m>n

< 291f(Ils-tI). This is valid for Its - tll < Eno(,)). The hypothesis implies that C is totally bounded, so whenever no((o) < oo, is bounded on C, say by K(w). I

So the conclusion of Theorem 2.6.1 holds with

M(w) := max (291, 2K(w)/f (Eno(w))).

Li

It follows from Theorem 2.6.1 that C is a GC-set if as E J, 0, N(E, C) =

O(exp(e-P)) for some p < 2, or if N(e, C) = O(exp(E-21 log El-')) for some r > 2. On the other hand Theorem 2.3.5 implies that C is not a GB-set

if as E J, 0, eventually N(E, C) > exp(E-P) for some p > 2 or N(e, C) > exp(e-21 log 815) for some s > 0. It turns out that the gap cannot be closed further: if N(e, C) is of the order of exp(E-21loge I-'') for 0 < r < 2, there are examples showing that C may or may not be a GB-set (see problems 14 and 15). So a characterization of the GB-property can't be given in terms of metric

2.6 A metric entropy sufficient condition for sample continuity

55

entropy, although it comes rather close. For a characterization in other terms, see the next section.

Remark. If C is a GC-set, then can be chosen such that for all w, x H L(x)(w) is continuous for x E C. Then for any countable dense subset A of C, L(C)* = SUPXEA L(x) a.s.

Next, the same integral as in Theorem 2.6.1 yields a bound for expectations of certain suprema.

2.6.2 Theorem Let C C H be nonempty and let D := diam C = supx,YEC Ilx - yll Let B

{x - y: x, y E Q. Then for f as in Theorem 2.6.1,

(a) EIL(B)I* < 81 f(D/4), and (b) EL(C)* < 81 f(D/4). Remarks. All three quantities in (a) and (b) are invariant under translation, replacing C by {c + u : c E C) for any fixed u. But E I L (C) I * does not have such invariance, and it becomes unbounded as II u II -+ oo, so for it we cannot

have an upper bound Kf(D), K < oo. If the constant 81 is replaced by a larger one, one can have, instead of the quantities on the left in (a) and (b), Young-Orlicz norms (Appendix H) II where g(x) := exp(x2) - 1; see Theorem 2.6.8 at the end of this section.

IIg

Proof Note that log N(t, C) = 0 for t > D/2, so f(D/2) = f(+oo). If f (x) = +oo for some (and hence all) x > 0, then (a) and (b) hold trivially (under the given definitions). If f (D) < oo, then we can take L samplecontinuous on C by Theorem 2.6.1. We first have:

2.6.3 Lemma Let go(u) := 2uO(4-1(1/(2u)))foru > 1/2, where 0and are the standard normal density and distribution function, respectively. Then go is concave. For any random variable Z with distribution N(0, a2) and any event A with P(A) > 0, (2.6.4)

f

JA

IZI dP < aP(A)go(1/P(A)).

Gaussian Measures and Processes; Sample Continuity

56

Proof Let h(v) := go(v/2) = v4)(-1(1/v)) for v > 1. Then

h'(v) = 0

(1)) \c-1

+v4

\

1

(0-, \v

()) \-(D_1 \v/) 0 ((D-1 (1/v)) 1) + v(D-1 vl

\v

(i)

V2

V

+

1

v

1

V2

J

((D-1 (1/v))

0

V 1

v2

_J

1 v34) (q)-1 (1/v)` < 0

so h is concave for v > 1 and go is concave for u > 1/2. Next, the left side of (2.6.4) is maximized for fixed P(A) > 0 when A is a set {IZI > r} for some r > 0, by the Neyman-Pearson lemma (e.g., Lehmann, 1986, p. 74). Then P(A) = 2(D(-r/a) and

J. I Z I dP = (2/ir)1/2a exp (- r2/(2a2)). Letting x = r/a, we need to prove, for x > 0,

24)(x) < 24 (-x)go(l/(2(D(-x))).

(2.6.5)

Setting u := 11(2(D(-x)), so that x = - -1(1/(2u)), shows that (2.6.5) holds, with equality.

2.6.6 Lemma If Zl,

, ZN are each normally distributed with mean 0 and

variance < a2, then Emaxl<j
fIZjIdP J

E maxi <j
< TaP(Aj)go(1/P(AS)) < ago(N) j=1

by Lemma 2.6.3, from which go is concave.

2.6 A metric entropy sufficient condition for sample continuity

57

2.6.7 Lemma Let gl (u) := K(log(1 + u))1/2 for u > 0 where

K := (2 + [4 + log

4]/(log(3/2)))1/2.

Then gi (u) > go (u) for all u > 1.

Proof "Mills' ratio" satisfies, for x > 0, Komatsu's inequality +4) 1/2)

M(x) :_ ,)(-x)/O(x) > 2/(x + (x2

(RAP, Section 12.1, Problem 7). Thus M(x) > 1/(x2 + 4)1/2. Let x -4)-1(1/(2u)) > 0, so u = 1/(24)(-x)). Then we need to show that

gl(1/(24'(-x))) ? fi(x)/(D(-x) = 1/M(x), 4)1/2. Since (D(-x) < so it will be enough to prove gi(1/(2(D(-x))) > (x2 + exp(-x2/2) for x > 0 (RAP, Lemma 12.1.6(b)), and gi is nondecreasing, it will be enough to show that gi(exp (x2/2)/2) > (x2 +4) 1/2, x > 0.

Letting y := exp(x2/2)/2, we need to show that

gi(y) = K(log(1 + y))1/2 > (4 + 2log(2Y))1/2,

y > 1/2,

or

y > 1/2,

K2log(1 + y) > 4 + 2 log 2 + 2 log y, which follows from the definition of K.

Now to prove Theorem 2.6.2, D = 0 if and only if C consists of a single point. Then both sides of (a) and (b) are 0 and they hold. So assume D > 0. Let Ek := D/2k and Nk := N(Ek/2, C), k = 0, 1, . Then for each k = 0, 1, 2, , there is a set Ck of Nk points xkj, j = 1, , Nk, such that for all x E C, Ilx - xki II < Ek for some j. Then No = 1, so Co = {xoi } for some xoi. For each k = 1, 2, and j = 1, , Nk, choose and fix a point yki = xk_1,i for some i such that Ilxki - Ykj Il < Ek-1 Let Wk be the set of all variables , Nk. Then by Lemma 2.6.6, L (xkj) - L (ykj), j = 1,

Emax(Izl: z E Wk) < Ek-igo(Nk) For any uk E Ck, there is a sequence of points uj E Cj, j = 0, , k. Thus

that L(ug) - L(uj_i) E Wj, j = 1,

k

l

k

Esup jJL(x)-L(y)I: x,YE UCi1 < I

i=1

j=1

,

k, such

58

Gaussian Measures and Processes; Sample Continuity

The union of all the Cl is dense in C, so by sample continuity and monotone convergence, ao

EC := E sup{IL(x) - L(y)I: x, y E C) < 2 E sj_tgo(Nj) j=1 00

4D Y go(Nj)/23 j=1

By Lemma 2.6.7, where K < 4, we get 00

EE < 16D E(log(1 + Nj))1/2/2j. j=1

For all j > 1, Nj > 2, so [log(1+ Nj)/ log Nj] 1/2 < (log 3/ log 2)1/2 < 1.26. Thus

Ec < 20.2D E (log Nj)1/2/2j 00

j=1 00

Ei+t

81

(logN(t, C))1/2dt = 81 f(D/4),

E1+2

proving (a). Then for any fixed y c C, SUPXEC L(x) < L(y) + suPXEC IL(x) - L(y)I, so (b) follows and Theorem 2.6.2 is proved.

Let g be a convex, increasing function from [0, oo) onto itself. If Y is a random variable such that Eg(SY) < oo for some S > 0, let II Y IIg := inf {c > 0 : Eg(I Y I /c) < 11. Then 11 IIg is a seminorm on such random variables (Appendix H). If there is no such S > 0, let II Y 11 g

+oo.

2.6.8 Theorem There is an absolute constant M < oo such that for any subset C of a Hilbert space H, and g(x) := exp(x2) - 1,

IIL(C)*IIg < II IL('')I*IIg < ME(IL(C)I*). Proof If C is not a GB-set, all three expressions will be infinite, so suppose C is a GB-set, which we can then take to be countable. Then by the LandauShepp-Marcus-Femique theorem (2.2.2), 111 L (C) I * 11 g < oo.

2.7 Majorizing measures

59

The first inequality in the theorem is clear. Suppose there is no such M < co. Then there a r e countable GB-sets C j C H with 111 L (Cj) I* II g> j 3 E I L (Cj) I* for j = 1, 2, . By homogeneity, we can assume E I L (C j) I * = 1 for each j.

Let H1, H2,

be infinite-dimensional Hilbert spaces and form the direct sum 1-1 := ®jHj, so that Hj are taken as orthogonal subspaces of R. We can take

Cj C Hj for each j. Let Dj := Cj/j2 for each j. Let D := Uj Dj C R. Then

EIL(D)I* = EmaxjlL(Dj)I* < T EIL(Dj)I* = J:j-2 < 00, l I so D is a GB-set. Thus by the Landau-Shepp-Marcus-Fernique theorem (2.2.2), II IL(D)I*IIg < oo. But for each j, II IL(D)I*Ilg >_ 1 L(Dj)111 g > j, a contradiction; so Theorem 2.6.8 is proved. 11

2.7 Majorizing measures This section will prove a characterization of GB-sets, that is subsets of a Hilbert

space H on which the isonormal process is sample-bounded, in terms of majorizing measures, to be defined next. Problems 14 and 15 at the end of this chapter show that there is no such characterization in terms of metric entropy in general, although there is under further restrictions (Theorem 2.7.4). The majorizing measure characterization and its proof are due to X. Fernique and M. Talagrand. For a metric space (T, d), r > 0, and x E T, the open ball of center x and radius r is B(x, r) :_ f y: d (x, y) < r). Definition. Let (T, d) be a metric space and P(T) the set of all laws (Borel probability measures) on T. For M E P(T) let Ym(T)

5UpxeT

f 0

(log

1

1/2

11dr. \m(B(x, r)))J

(

If y,,, (T) < oo, then m is called a majorizing measure for (T, d). Let y (T)

y(T,d) := inf{ym(T): ME P(T)}. Then y (T, d) < oo if and only if there exists a majorizing measure on T. If m is a majorizing measure on T, then for all x E T, m(B(x, r)) > 0 for all r > 0 and does not approach 0 too fast as r 0. For example, if T is finite, then ym (T) < oo if and only if m ({x)) > 0 for all x E T. On [0, 1] with usual metric, Lebesgue measure ? is a majorizing measure. More generally, so is any law having an absolutely continuous component with a density h > c for some

c>0.

60

Gaussian Measures and Processes; Sample Continuity

First, two theorems will be stated, which together characterize GB-sets as sets in Hilbert space having majorizing measures. Here T = C will be a subset of a Hilbert space with the usual Hilbert metric. Recall L(C)* := ess.supXEc L(x) and I L (C) I * as defined by Lemma 2.5.1.

2.7.1 Theorem (Fernique, 1975) If C is a subset of a Hilbert space H and y(C) < oo, then C is a GB-set. For some absolute constant K, EL(C)* <

Ky(C). Translation of C, replacing it by {c + It : c E C} for some fixed h, preserves all of N(e, C), y(C), and EL(C)* but not EIL(C)I*. The word Notes.

"majorizing" apparently refers to the inequality in Theorem 2.7.1.

2.7.2 Theorem (Talagrand, 1987) If C is a GB-set then y(C) < oo. For some absolute constant K' and all C C H, y(C) < KEL(C)*. Before Theorems 2.7.1 and 2.7.2 are proved, some easier facts will be given.

Recall from Section 1.2 the definition of D(e, C) := D(s, C, d) for the usual metric d(x, y) := Ilx - yll on H. A purely atomic law is a countable sum m = E j a j 8 y(j) for some a j < 0 and points y(j), where 8x is the unit point mass at x.

2.7.3 Proposition

f °O(log D(e, C))1"gds < oo implies y(C) < oo and

ym (C) < oo for a purely atomic m.

Proof That y (C) < oo follows from Theorems 2.6.1 and 2.7.2. Here is a direct proof. Let D k := D(2-k, C). For k = 1 , 2, .. , take a maximal set xik, i = 1, , Dk, With Ilxik - xjk II > 2-k for i j. Set 00

m := E (2kDk)

1E

Sxik

1
k=1

For each k and each x E C, there is some xik with Ilx - xikll < 2-k. So m(B(x, e)) > 1/(2kDk) for 2-k < s < 21-k and 1

f

Jo

00

(log (1/m(B(x,£))))112ds < T, 2-k(log {2kDk}) 1/2 k=1 00

< T2 -k I (k log2)1/2 + (log Dk)1/2}. k=1

2.7 Majorizing measures

61

Recall that for any set A in a metric space (S, d), the diameter is defined by

diam A := supx,yEA d (x, y). Now Ek k112/2k < oo, and Tk(log Dk)112/2k < oo is equivalent to 1

(log D(s, C))112ds < oo

(integral test).

JO

The integral of (log(1/m(B(x, s))))1/2 from 1 to oo is the sum of the integral from diam C to oo, which is 0, plus, if diam C > 1, an integral bounded above by (diamC)(log(1/m(B(x, 1))))1/2 < oo.

2.7.4 Theorem Let (T, d) be a totally bounded metric space. Suppose that m is a law (Borel probability measure) on T such that for some M < oo and all r > 0, supx,T m(B(x, r)) < MinfyET m(B(y, r)). Then the following are equivalent:

(i) m is a majorizing measure for T ;

(ii) y(T) < oo; (iii) fa (log D(s, T))1/2ds < oo. Proof Proposition 2.7.3 shows that (iii) implies (ii), and clearly (i) implies (ii). We can assume M > e. To show that (iii) implies (i), by definition of D(s, T) (Section 1.2), given

s > 0, there are x; E T, i = 1, , n := D(s/2, T), such that T is the , n of the balls B(xi, s). Thus for any x E T, 1 = union f r o m i = 1, m(T) < D(s/2, T)Mm(B(x, E)). Thus 1/m(B(x, s)) < MD(s/2, T) for all x. In the latter inequality we take logarithms of both sides, then square roots, then integrate with respect to e. The integrals are finite and bounded in x, so m is a majorizing measure.

To show that (i) implies (iii), given s > 0, there are xi E T, i = 1, , D(2s, T), such that d(xi, xj) > 2s for i # j, so the balls Bi := B(xi, s) {y: d(xi, y) < s) are disjoint. Now m(T) = 1 > D(2s, T)m(B(x, s))/M for all x. Thus 1/m(B(x, s)) > D(2s, T)/M for all x. In the latter inequality we again take logarithms of both sides, then square roots, then integrate. The integrals are finite, and the factor of 2 doesn't affect the convergence, so (iii) holds. The last step will be to show that (ii) implies (i). Let s be a majorizing measure for T. If in the definition of y we replace log(.) by L max(log(.), 1), the integral can only be increased, but at most by 1, so finiteness of the supremum

Gaussian Measures and Processes; Sample Continuity

62

of integrals is preserved. The function g: t i-k (L(1/t))1/2 is convex on the interval 0 < t < 1, being the maximum of the function (log (1 / t)) 1/2, which has a positive second derivative for 0 < t < e -1/2, and the constant 1, which is larger fore-1

< t < 1. Foreachr > 0, consider the randomvariableX= µ(B(x, r))

where x has distribution m. Evaluating EX by the Tonelli-Fubini theorem, since

y e B(x, r) is equivalent to x E B(y, r), gives EX = f m (B(y, r)) d p (y). Thus for any x E T, m (B(x, r))/M < EX < Mm (B(x, r)). Jensen's inequality (RAP, 10.2.6) gives g(EX/M) < Eg(X/M). Let K := supx,YET d(x, y). Then for any x, using M > e, 00

r

Jo

1

[log (m (B(x, r))

)]1/2

K

dr <

dr

LL \m (B(x, r)) / J

Jo

K

<

1/2

1

(

f

Jo

fJo K

M

1/2

L

lt(B(y, r)) M [log rr

(

dm (y) dr 111/2

l-t(B(y,r))/J

drdm(y)

< K(log M)112 + y, (T) < oo, proving (i).

2.7.5 Corollary Let (T, d) be a metric space such that there is a group G of 1-1 transformations g of T onto itself for which

(a) d is G-invariant: for all s, t E T and g E G,

d(g(s), g(t)) = d(s, t); (b) there is a law (Borel probability measure) m on T which is G-invariant,

that is,mog 1=m forallgEG; (c) G acts transitively on T: forall s, t c- T, there is a g E G with g(s) = t. Then the hypotheses and thus the conclusion of Theorem 2.7.4 hold.

Proof The hypotheses imply that for each r > 0, m(B(x, r)) is the same for all x E T. Thus Theorem 2.7.4 applies for any M > 1. Notes. Fernique (1975) proved that for Gaussian processes satisfying a homogeneity condition like that in Corollary 2.7.5, the metric entropy integral condition (or the corresponding condition on N(e, T), equivalent by (1.2.1)) is necessary and sufficient for sample continuity. Theorems 2.7.1 and 2.7.2, with

2.7 Majorizing measures

63

Theorem 2.7.4, show that the metric entropy integral condition is equivalent to

sample continuity for the isonormal process on T for a subset T of a Hilbert space satisfying the conditions of Theorem 2.7.4. For an example of the situation in Corollary 2.7.5, let T be the unit circle x2 + y2 = 1 in R2, let G be the group of rotations, let d be the usual metric on R2, and let dm(O) = d9/(27r). Likewise, T could be a sphere of any dimension, with the orthogonal group G. To see how Theorem 2.7.4 applies beyond Corollary 2.7.5, suppose one wants to prove sample continuity of a Gaussian process on a locally compact but not compact metric space, such as a Euclidean space or a non-compact manifold. Then it suffices to prove sample continuity on each of a family of compact sets whose interiors form a base for the topology, such as balls or cubes in Euclidean spaces. Then one can often define a measure, such as Lebesgue measure in a Euclidean space, restrict it to a compact set C, and normalize it to have mass 1 to get a law m. Then m (B(x, r)) may not depend on x while B(x, r) is included in the interior of C, but become smaller as x approaches the boundary of C, yet the hypothesis of Theorem 2.7.4 still holds.

For any set A C H, let E(A) := EL(A)*. Then if A is countable, E(A) E supxEA L(x).

Proof of Theorem 2.7.2 The proof below for Talagrand's theorem 2.7.2 will be mainly as given by Fernique (1997). We have:

2.7.6 Lemma Let S C T C H, u > 0, and suppose d (x, y) > 4u for all x =,4 y in S. Let M be the cardinality of S. Then

E(T) >

E[B(s, u) fl T] + 12 (log M)112u.

Proof The statement holds if E(T) = +oo, so we can assume E(T) < oo. This implies that T is totally bounded by Theorem 2.3.5, so S is finite. , sM}. Let L1, , LM be independent copies of L. Let Z1, , ZM be i.i.d. N(0, 1) variables independent of L1, , LM. We can

Let S = {sl,

assume T C UM1 B(sj, u). Let Y(t) := Li(t) - Lj(sj) + uZj for each t E B(sj, u) and j = 1, M. Let v, t E T. If v, t E B(sj, u) for the same j, then dy(t, v)2 := E((Y(v) - Y(t))2) = d(v, t)2 = Ilv - t112. Otherwise, we have d(v, t) > 2u > dy(v, t). Thus dy(v, t) < d(v, t) in all cases, so by Theorem 2.3.7(b), E(T) > E suptET Y(t). To find a lower bound

Gaussian Measures and Processes; Sample Continuity

64

for E suptET Y(t), we have

suptET Y(t) = maxl
, ZM measurable. Then since

E(suptET Y(t)I13Z) > maxt<j<M [uZj + E suptETnB(s;,u) Lf(t)]. Taking the expectation with respect to {Z;}M 1 then gives

E SUptET Y(t) > U. E maxi<j<M Zj+ mint<j<M E(T fl B(sj, u)) 12 (log M)112 + minSES E(T fl B(s, u))

by Lemma 2.3.15, which finishes the proof.

We next have some positive-homogeneity properties of functionals of sets in H.

2.7.7 Proposition Let C be a bounded set in H and let g be either of the two functionals g(C) = y(C) or g(C) = E(C) := EL(C)*. Then for any constant a > 0, we have g(aC) = ag(C). Proof For E(C) this follows from linearity of L. For y(C), multiplication by a transforms any law m on C to an image measure on aC, and the conclusion follows by a linear change of variables in the definition of yn (C). By the last fact, we can assume in Theorem 2.7.2 that (2.7.8)

3/4 < diam T < 1,

since T is bounded by Theorem 2.3.5, while if it has diameter 0, Theorem 2.7.2 holds trivially. Given a set C, a finite partition of C is a collection of disjoint sets Cj, i = 1 , 2, , J for some J < oo such that C = U 1 Cj. A sequence

C('), m > 0, of partitions is nested if for each m and each A E C(nr+1) there

is a B E C(') such that A C B. Recall that B(x, r) := {y: d(x, y) < r} for any x and any r > 0. A main step in the proof is the following:

2.7.9 Lemma Let p := 1/4. If T is a GB-set in H satisfying (2.7.8), there is a nested sequence {C(i°), n > 0) of partitions of T and for each n, a mapping m from C(") into N and a mapping r from C(") into T such that the following hold:

2.7 Majorizing measures

65

(i) For every n E ICY and each C E C(n), C C B(rn (C), pn); (ii) for any positive integer n, every C E C(n-1), and any two sets D # D'

in Cini included in C, we have mn (D) 0 mn (D'), rn (D) and rn (D')

are in C, and d(rn(D), rn(D')) > p'; (iii) for every s E T, if Cn (s) is the element of C(') containing s, then 00

pn[logmn(Cn(S))]112 < 192E(T). n=0

Proof The partitions COO will be defined recursively. Let C(°) :_ IT), mo(T) := 1 and let ro(T) be any fixed element a E T. Then (i) holds for n = 0. Suppose C(i), mj(.), and have been chosen for j = 0, 1, ,n satisfying (i) and (ii). To define C(n+1) each set C E C(') will be decomposed as a disjoint union of subsets which will be in C(n+1) These subsets and the corresponding values of mn+l () and rn+l () will also be chosen recursively on an index r. Let Ao := 0 and Wo := C. Let t1 be an element of C for which

E(B(tl, pn+2)) > suPtEWo E(B(t, pn+2)) - pn+2E(T)

For r > 1, if C is not included in A,. :_ U

pn+l), let tr+l be an

element of Wr := C fl (T \ Ar) such that (2.7.10)

E( B(tr+1, pn+2) fl Wr)

> suP,Ew E(B(t, pn+2) fl Wr) - pn+2E(T) Then d (ti, tj) > pn+l for i # j. For the least k such that C C Ak, let M(C) k and the construction stops. Then M(C) < exp(300E(T)2/p2n+2) by Theorem 2.3.16. For each m = 1, , M(C), let Dm := Dm(C) := B(tm, pn+1) fl Wm-1 Let the partition C(n+1) consist of the sets D. (C) for m = 1, , M(C) and C E C. Then set rn+l (Dm) := tm and mn+1(Dm) := m. This completes the recursive definition. By construction, (i) and (ii) in the lemma hold.

To prove (iii), fix an s E T and an n E N. Let C := Cn (s) be the set in C(n) containing s. Then Cn+1(s) C Cn (s) and Cn+1(s) is Dm for some m = 1 , , M(C), where Dm is a set in C(n+l) included in C, with m = mn+1(Cn+l (s)) and tm = rn+l (Dm) = rn+1(Cn+l (s)). For the points t1, t2, defined for C as above, and I pn+l The sets B(t,, pn+2) fl W;_1 for i = 1, , m are included in C since W;_1 is. Thus by the choice of p = 1 /4 and Lemma 2.7.6,

(2.7.11) min,<m E(B(t,, pn+2) fW,-1) + zpn+2[logm]l/2 < E(C).

Gaussian Measures and Processes; Sample Continuity

66

On the other hand, for each i = 1, , m and since tm E Wm-1 C Wt-1, by (2.7. 10) for r = i - 1 we have pn+2E(T) + E(B(ti, pn+2)) > E (B(tm, pn+2) n Wi-1) 1

>

El( Bl(tm, pn+2) n Wm-1).

Combining the latter with (2.7.11) gives

ECn(s) = E(C) > 12pn+2[logmn+l(Cn+l(s))]112

(2.7.12)

+ E( B(tm, pn+2) fl Wm-1) - pn+2E(T)

Also, since rn+2(Cn+2(s)) E Cn+1(s) = Dm C Wm_1, by (2.7.10) for r = m - 1 we have E( B(tm, pn+2) n Wm-1)

(2.7.13)

> E( B(rn+2(Cn+2(s)), pn+2) fl Wm-i) - pn+2E(T)

Since Cn+2(s) C B(rn+2(Cn+2(s)), pn+2) n Wm_1, and by inequalities (2.7.12) and (2.7.13), we get

E(Cn(s)) >

12pn+2[logmn+l(Cn+1(s))]1/2+E(C1+2(s))-2pn+2E(T)

Fixing an integer N > 0, we get N

E pn+2 [log m n+1(Cn+1(5) )] 1/2 nn=0

< 12 {ECOs+ E(C1(s)) + 2E(T) n=O

n+2 J

48E(T).

J

Thus yn_1 pn+1[logmn(Cn(s))]1"2 < 48E(T), and because mo(Co(s)) mo(T) - 1, the lemma follows. To finish the proof of Theorem 2.7.2, apply Lemma 2.7.9. Let T be a GB-set in H. For each n = 1, 2, , let T (n) be the set of all points rn (C) for C E C(n). A purely atomic law it will be defined by It :_ >n°_0 EtET(n) An(t)8 . To define the measures 12n, first, T(0) = (a). Let µo(a) := 1/2. For n = and s E T, let (¢n (s) := rn (Cn (s)) and mn (s) := mn (Cn (s)). For 0, 1, 2, t E T(n) and n > 1, let

lin-1(On-1(t)) 2[mn(t)]2 F-{mn(u)-2: On_1(u) = 0n-1(t)} Each µn is 0 outside of T (n). Then for each v E T (n - 1), we have (2.7.14) ltn(t)

(2.7.15)

1

{µn(t): t E T(n), 0n-1(t) = v} = An-1(v)12.

2.7 Majorizing measures

67

Thus An :_ {µn(t) : t E T(n)} = 2 E{isn_i(v) : v E T(n - 1)}, s0 An = An_i/2. Then by induction An = 2-a-1 f o r n = 0, 1, , and A is indeed a probability measure. An (on (u)). For any u E T and n > 0, we have d (u, ¢n (u)) < Let A, u

pn, so A(B(u, pn) > An,u. The sum in the denominator of (2.7.14) is, by Lemma 2.7.9(ii), < F-,n>, m-2, which equals 7r2/6 (problem 19). Thus

µn,u ? 37r-2µn-l,u/mn(u)2.

A(- (u, pn))

(2.7.16)

Iterating the latter inequality, and noting that µp,u = 1/2 for all u, we get for all n and u E T, (2.7.17)

µ B (u> p")) >

(3 1"

1

1

n2 J Mn (u)Zmn-1(u)2 ... M 1(u)2 2 .

For each u E T, we have p,(B(u, r)) < p (B (u, r)) for at most countably many r, making no difference in integrals dr. Now, since by (2.7.8) diam T < 1, letting g(u, r) := [log(1/µ(B(u, r)))]112, we have 00

I (t-r, u)

:=

L

1

g(u, r) dr =

00

E j=1

J0

g(u, r) dr 00

pj-1

pj-lg(u, pi)

g(u, r)dr 0, (a1)1I2 < F; ai 12. Thus by (2.7.17), o

I(µ, u) < 4

5j=1

j pj (j log (7T2/3))1/2+ (log2)1/2 + >2(21ogm;(u))1/2 i=1

The (j log (7r 2 /3)) 1/2 and (log 2)1/2 terms clearly yield summable series not

depending on u. Specifically, we have j1/2 < j and 4 F_jtl j pi = 4p/(1 p)2 = 16/9, and (16/9)(log(7r2/3)) 1/2 + (4/3)(log2)1/2 < 3.06. The remaining terms give 00 4 .21/2>2[logm;(u)]1/2p`/(1 - p) < 1449E(T)

by Lemma 2.7.9(iii). It follows that A is a majorizing measure. Now by

(2.7.8), T contains at least two points x, y at least 3/4 apart, so E(T) > E max(L (x), L (y)) = E max(L (z), 0) > 2/7 where z := x - y, L (z) has law N(0, a2), and or > 3/4. Thus 3.06 < 11 E(T) and I (A, u) < 1460E(T) for all u, so y, (T) < 1460E(T) and Theorem 2.7.2 is proved. 0

Gaussian Measures and Processes; Sample Continuity

68

Proof of Theorem 2.7.1 Let C C H and let it be a majorizing measure on H. It will be proved that C is a GB-set. Let {e,) be an orthonormal basis of H. For each x E H we have L (x) >2,, (x, en)L(en) where the series converges almost surely (the set on which it does not converge depends on x). Define a version of L (x) for all x by taking the sum of the series if it converges, and 0 otherwise. The resulting process is jointly measurable in x and cv (RAP, proof of Theorem 4.2.5). Thus we can assume L is jointly measurable, in particular when restricted to x E C. Theorem 2.7.1 will be proved by way of the next fact.

2.7.18 Theorem Let T be a bounded subset of H with diameter D on which there exists a majorizing measure It. Then there exists a version M of L on T

and a random variable Y with 0 < Y((o) < oo and EY < 485 such that for any s, t E T, (2.7.19)

(M(s) - M(t))(c))

\]1/2

d(s,t)

< Y((0)

I[log(p.(B(U))

J

+

u)))] 1/21 [log( µ(B(s, 1

du.

Proof Recall that a measurable space is a pair (W, 8) where W is a set and B is a o--algebra of subsets of W. One part of the proof will be furnished by the following:

2.7.20 Lemma (a) Let x, y, a, and ,B be positive numbers. Then rr

xy < ao Lexp

1/2

(x 2

-a )/ - 11 + a yL I log I 1 + \\\

/IJ

.

(b) Let (W, B) be a measurable space and let Y(x)(co), X E W, cv E 0 be a jointly measurable Gaussian process with mean 0 and EY(x)2 < 1 for all

x E W. Let µ be a law on W. Then there is a random variable z > 0 with Ez < 5/2 such that for each nonnegative function f E G1 (W, µ), not 0 a.s., we have

f

IY(x)(w)I.f(x)dµ(x)

f

/

3z(w)J f(x) [log I 1 +

/p.)]"2d(x).

2.7 Majorizing measures

69

= 1. To apply the

Proof For (a), dividing by a p, we can assume a =

W. H. Young inequality (Theorem H.6 in Appendix H), let 0 (u) := exp(u2) -1,

u > 0, *(V) := ¢-i(v) = [log(1 + v)]1/2, v > 0, c(x) := fox O(u) du, and %P (y) : = fo i (v) d v. Then clearly I (y) < y,' (y) for y > 0. By derivatives, 4) (x) < 0 (x) for x > 0, and (a) follows from Theorem H.6. Now to prove (b), let p < 1/2. Then supXEw E exp(pY(x)2) < oo. Thus for almost all w, x i-+ exp(pY(x) (w)2) is integrable for it by the Tonelli-Fubini

theorem. Let

z := inf I C1 > 0:

J

(exp[a-2Y(x)2] - l)dµ(x) < 1I

.

(Thus z = IIYIIg as defined in Appendix H for g(x) := exp(x2) - 1.) The dominated convergence theorem implies that as a -+ oo,

J

(exp[a-2Y(x)2] - l)dµ(x) -+ 0

almost surely. Thus [a > 0: f (exp[a-2Y(x)2] - 1) dµ(x) < 1} is almost surely nonempty. So z has finite values a.s. It is easily seen that z(w) = 0 if and only if Y(x)(w) = 0 for /.t-almost all x, and then (b) holds. It remains to prove (b) when z(w) > 0. Since f is not 0 a.s., it follows by part (a) with

0<6:= f fda0,

f

IY(x)((o)I.f(x)dµ(x)

a

f fdµ f LeXp (Y(x)2\ - l]dlt(x) a2 /

+a

f f(x)

/

1/2

f

Ilog I 1 + .f(f.xd) µ)I

dµ(x).

This fact, Proposition H.2 of Appendix H, and the definition of z then imply, since z((o) > 0, (2.7.21)

f

IY(x)(w)I f(x)dg(x)

< z1 f f(x) dlt (x) +

f

[log .f (x)

f(x)

(l + I f da)]

1112

dIL(x)

For u > 0 let F(u) := u[log(1 + (u/,B))]1/2. Then F is increasing and

Gaussian Measures and Processes; Sample Continuity

70

convex on [0, oo) by derivatives. Jensen's inequality (RAP, 10.2.6) then gives (log 2)1/2 f

fdg =

F

(f f d i)/// < fF(f)dt fx)

1/2

f(f

f f(x)

['og 11+

dµ

\l

dµ(x).

I

This, with (2.7.21), gives part (b) except for bounding Ez. To do that, we have for any u > 0 and p > 1 by definition of z and then the Chebyshev-Markov inequality that l

(

P(z > u) 2 } dµ(x)JP

([f exp (Y(x)2/u2)

)

.

By Holder's inequality, multiplying the latter integrand by 1 (or Jensen's inequality), we have

P(z > u) < 2-PE For 1 uo := (2 + (log 2)-1)1/2, if we set p := u2/2 - 1/(2 log 2), then p > 1

and

g(p, u) = u(e log 2)1/2 exp (- u2(log 2)/2). For any t > 0 and random variable X with law P,

l

00

00

P(X > u) du

f fuo"

dP(x) du = 100

JX du dP(x)

t

00

(x - t) dP(x) = E(Xlx>t) - tP(X > t). It follows that if X > 0, then EX < t + fto° P(X > u) du. Thus

Ez < uo + (e log 2) 1/2

f

2_U2/2du

00 u

up

uo + (e log 2)1/2(log 2)-1 exp (- ua(log 2)/2) < 2.5 as stated, which finishes the proof of the lemma.

2.7 Majorizing measures

71

Now to continue with the proof of Theorem 2.7.18, recall that we have chosen

a jointly measurable version of L()(). It will be shown that on T can be well approximated by its averages over small balls with respect to A. Recall that B(x, r) :_ {y: d(x, y) < r}. For each k = 0, 1, and t E T let

A(B(t, D/2k)),

µk(t) (2.7.22)

Pk(t, )

0-pk(t))-11g(t

D12k) ('),

and Mk(t) := Mk(t, (o) f pk(t, u)L(u) d s(u), which exists and is finite a.s. by the Tonelli-Fubini theorem since for any x, to E T, E I L (x) I < 1I to II + D. For each t E T, we have B(t, D) = T, so M0(t) = f L d p doesn't depend on t. We also have E I L (x) - Mk (x) I < D/2k for each x and k. It follows that for each x in T, a.s. as k -+ oc, (2.7.23)

Mk(x) - L(x).

The rest of the proof will bound the increments of the processes Mk and their rate of convergence to L.

Let Y(u, v) := (L(u) - L(v))/d(u, v) if u v and 0 if u = v. Then Y is jointly measurable on H x H, EY(u, v) = 0, and EY(u, v)2 < 1 for all , letting pi (u) := p; (t, u), u, v E H. For each t E T and k = 0, 1, 2, k

Mk(t) - MO = T, MM(t) - Mj-1(t), j=1

and

IMj(t) - Mj-1(t)I = f Y(u, v)d(u, v)pj(u)pj-l(v)dtc(u)d1t(v)

< 3D.2-i. f I Y(u, v)1pj(u)pj-1(v) dl-L(u) dlt(v) Applying Lemma 2.7.20(b) to W := T x T with f(u, v) := pj(u)pj_1(v)

for j = 1, 2, .. and s x µ gives a random variable z > 0 with Ez < 5/2, not depending on n or t, and a set 21 C 0 with P(21) = 1 such that for all

c0E221,n EN,n> 1, and t E T, I Mn (t, (o) - Mn-1(t, (0)1 <

9D2 ((o)

[log

I+

)] 1/2

lcn (t)µn-1(t)

21/29Dz 2n (co ) clog (1

+

1

t-1n (t)

)]1/2

72

Gaussian Measures and Processes; Sample Continuity

since An (t) < An_1(t) for all n > 1 and t while log(l + x2) < 2log(1 + x) for all x > 0. Now, letting h(t, r) :_ [log(1 + µ(B(t, r))-1)]1/2

log (1 + µk(t)-1) < h(t, r)2

(2.7.24)

for r < D/2k. Applying this for r > D/2k+1 for each k > 1, we get 00

D/2

E I Mn - M,,-11(t, w) < 18z(c))21/2 n=1

h(t, r) dr.

J

For x > 1, we have (log(l +x))112 < (log(2x))1/2 < (log2)1/2 + (log X)112. So for co E 521, the series M0 + Fool Mn(t, (o) - Mn_1(t, (v) converges absolutely and converges to a limit

M(t, (v) := M(t)(c)) := limk,oo Mk(t, (0) Thus by (2.7.23), for each t, M(t, w) = L(t)(w) a.s., so M(., ) gives a version of Then we have D/2

(2.7.25)

IM(t, w) - Mo(cv)I < 18.21/2z(c))

h(t, r)dr.

J0

Next, for any s and tin T, take the unique integer k = k(s, t) > 0 such that (2.7.26)

D/2k+1 < d(s, t) < D/2k.

Then

IM(s) - M(t)I < IM(s) - Mk(s)I + IMk(S) - Mk(t)I + IM(t) - Mk(t)I. The proof of (2.7.25) yields (2.7.27)

I (M - Mk) (t, w) I

< 18.2 1/2Z(CO)

I

2-k-1 D

h(t, r) dr

and likewise for s in place of t. We have (2.7.28)

Mk(S) - Mk(t)

= f Y(u, v)d(u, v)pk(s, u)pk(t, v)dis(u)d,c(v). By the definition of pk (2.7.22) and (2.7.26), where the integrand in (2.7.28) is nonzero, we have d(u, v) < 3D/2k. Thus for all w, by Lemma 2.7.20(b) again,

I Mk(S) - Mk(t)I <

9Dz

[log I 1 +

1

111/2

µk(S)µk(t)l J

2.7 Majorizing measures

73

It follows as in (2.7.24) that

IMk(S) - Mk(t)I D/2k+1

[108(1

++l0) 1+

1/2

g

µ(B(t, r))

µ(B(s, r)) )

For r > 0 let µ(r) := µs,t(r) := min (It (B (t, r), B(s, r))). Then D/2k+1

1/2

1

[log 1+

I Mk(s) - Mk(t)I < 18.2 1/2Z I

)1dr.

For any nonincreasing function f > 0 and S > 0 we have S

S/2

f(r)dr.

(1. f

f(r)dr < 2 fo

Recalling (2.7.27) for s and t, we then have for w E 521,

1

f D/2k+2

GM(s) - M(t) I

< 154zJ

[log (1

1 /2

+ µ(r))J

dr.

Since d(s, t) > D/2k+1 by (2.7.26), for r < D/2k+2 the balls B(t, r) and B(s, r) are disjoint and so µ(r) < 1/2. For x > 2, we have log(1 + x) (log 3) (log 2) -1 log(x) by first derivatives. Thus

[log (1 +

< [log2 log

/t(r))J

log 3

log2)

log 3

1/2

1

i/2

A(r)

1/2

1

Llog

1/2

1

\lt(B(t, r)))]

1

)]1/2

+ [log \it(B(s, r))

Then (2.7.19) follows with Y(w) := 194z(w). Since Ez < 2.5 from Lemma 2.7.20, we get EY < 485 as stated. So, Theorem 2.7.18 is proved. Now to prove Theorem 2.7.1, by Theorem 2.7.18 there is a version M of L such that inequality (2.7.19) holds, and therefore letting t be fixed and s vary,

EL(C)* = EM(C)* and (2.7.29)

EM(C)* <

f

D

1

[log (/2(B(x, r)) )]

1/2

dr.

Since EY(w) < 485, EM(C)* < 970y (C) and the proof of Theorem 2.7.1 is complete. Theorem 2.7.18 also directly implies a sufficient condition for the GC property if the majorizing measure condition can be strengthened, as follows.

Gaussian Measures and Processes; Sample Continuity

74

2.7.30 Corollary Let H be a Hilbert space and C C H. Suppose that there exists a probability measure µ on C such that 1

b

limo1 suPxec Jo

to

g \µ(B(x, r))

1/2

dr = 0.

Then C is a GC-set in H.

2.8 Sample continuity and compactness This section will show that for a Gaussian process Xt indexed by a compact metric space, or other suitable parameter space such as an open or closed set in a Euclidean space, sample continuity reduces to that of the isonormal process on

some subsets, and continuity of the nonrandom function t H EXt (Corollary 2.8.5).

Let (T, T) and (W, U) be two topological spaces. Let {Xt, t E T) be a stochastic process defined over a probability space (0, B, P) with values in W, meaning that for each t c- T and Borel set B C W, Xt 1(B) E 13. (Recall that the or-algebra of Borel sets is generated by the open sets and that it's equivalent to assume Xt 1(U) E 13 for each U E U.) Let {Yt, t E T) be another process with values in W, possibly defined over a different probability space (7', B', P'). Recall that the processes {Xt } and {Yt } have the same laws iff for every n = 1, 2, and t1, t" E T, the law of on the product or-algebra in W" is the same as that of {Ytj} j-1 Recall that two processes with the same laws are said to be versions of each other. A process {Xt}tET will be called version-continuous if there is a process Y with the same laws such that for all co' E S2', t H Yt (w') is continuous from T into W. (Equivalently, continuity need only hold for almost all co' because then without changing the laws, for the set of measure 0 of values of co' for which Yt ((o') is not continuous, one can replace it by a fixed continuous function, say having a constant value in W.) Now let W = ll8 with usual topology and suppose {Xt, t E T) is a Gaussian process. Suppose also that (T, e) is a metric space with the metric topology on T. We have 2.8.1 Theorem A Gaussian process { Xt } indexed by a metric space T, defined on a probability space (0, P), is version-continuous if and only if both

(a) the nonrandom function t H EXt is continuous, and (b) the process {Xt - EXt} is version-continuous. Then, t i--

is continuous into L2(P).

2.8 Sample continuity and compactness

75

Proof "If" is clear. To prove "only if," suppose Xt is sample-continuous. For any sequence t, -+ t E T, sample continuity implies that Xtn -+ Xt almost surely and therefore in probability. For jointly Gaussian random variables, convergence in probability is equivalent to convergence in L2(P), since for the Gaussian variables Y := Xt - Xt to converge to 0 in probability, the means EY must converge to 0, and so must the variances. Thus EXtn -+ EXt. Since T is a metric space, (a) follows. Then by subtracting the continuous function EXt, (b) follows.

So, in studying sample continuity or version-continuity of Gaussian processes, we may as well restrict ourselves to processes with mean 0. Let Xt be such a process, t E T. Each Xt is an element of a Hilbert space H, namely L2(P). Consider the isonormal process L on this H. Then since L is Gaussian, has mean 0, and preserves covariances, we see that L(Xt) has the same laws as Xt. If h(.) is a continuous function from T into a Hilbert space H, with range C :_ {h(t) : t E T}, and if L restricted to C is version-continuous, then the process L o h is clearly version-continuous. Conversely, if (T, e) is compact and h is 1-1, then h is a homeomorphism (RAP, Theorem 2.2.11). Then version continuity of L on C and of L oh on T are equivalent. So for (T, e) compact and t i-* Xt one-to-one, version continuity of the Gaussian process Xt reduces to that of L on a subset C of H. (Theorem 2.8.2 and Corollary 2.8.5 will show that the 1-1 assumption is not actually necessary.) If T is locally compact, for example an open or closed subset of some Rk, then continuity is equivalent to continuity on each compact subset.

Let T be a set and d a pseudometric on T: for all x, y, z E T, d(x, y) _ d(y, x) andd(x, z) < d(x, y)+d(y, z), d(x, x) = 0, butpossiblyd(x, y) = 0 for some x y. Recall that for a set S C T, the diameter (with respect to d) is defined by diam S := diamd S := sup{d(x, y) : x, y E S}. The next fact holds for general, not necessarily Gaussian, processes.

2.8.2 Theorem If (T, e) is a compact metric space, h is a continuous function from T onto a metric space K, and Y(x, w), x E K, co E S2, is a stochastic process on K with values in a complete separable metric space S, then Y o h is version-continuous on T if and only if Y is on K.

Remark. If Y(x, w) - Y(x), so Y a nonrandom function, then the result is a known fact in general topology (RAP, Theorem 2.2.11). The difficulty in the proof here is that if Y o h is version-continuous, it is not clear that the

Gaussian Measures and Processes; Sample Continuity

76

corresponding sample-continuous process X can be written as Y' o h for a process Y' on K. Proof "If" is obvious. Conversely, let Y o h be version-continuous and take a process X on T with the same laws as Y o h and t i-+ X(t, co) := Xt(c)) continuous on T for all co. Let p and be the metrics on K and S, respectively. Let d be the pseudometric on T defined by d(s, t) := p(h(s), h(t)). Let A be a countable dense subset of T, and B := h(A) := {h(a) : a E A), so B is a countable dense subset of K. For y, 8 > 0 and any countable set F C T, define a random variable by

D(F, 3, y) := sup { (Xs, Xt) : s, t E F, d (s, t) < 8, e(s, t) < y). This is measurable since F is countable and S is separable (RAP, Proposition 4.1.7). Let D(F, 8) := D(F, 8, 1 + diame T), and

UC := n00U "0(D(A, 1/m) < 1/n), n=1 m=1

so that UC is measurable. If P(UC) = 1, then the sample functions t H Xt (w) are almost surely uniformly continuous with respect to d on A. Since A is countable and Y o h has the same laws as X, Y o h also has sample functions

almost surely uniformly continuous for d on A. Equivalently, Y has sample functions almost surely uniformly continuous for p on B. For any x E K let Y'(x) := lim{Y(u) : u -+ x, u E B). Almost surely, all these limits exist (by uniform continuity and since S is complete), and x i-+ Y'(x) is continuous. Since X and Y' o h both have continuous sample functions on T and have the same law on A, they have the same laws on T, and so does Y o h by choice of X. It follows that Y' has the same laws as Y on K, so Y is version-continuous as desired. Otherwise, P(UC) < 1. Then for some s > 0,

infs>oP(D(A, S) > 3e) > 3e. Then by inclusion (monotone convergence, with 8 = 1/m), (2.8.3)

P n {D(A, 8) > 3s) I > 3s.

\ s>o

/

On the other hand, continuity oft H Xt and compactness of T imply that for some y > 0 and any countable set C C T, (2.8.4)

P {D (C, 1 + diamdT, y) > e) < E.

2.8 Sample continuity and compactness

(Otherwise, take a countable union of countable sets for y = 1, 2,

77

1/n, n =

, to get a contradiction.)

T is a finite union of e-open sets Tj with diame Tj < y. For each i j, if there are s E T, and t E Tj with h(s) = h(t), then we say (i, j) E G, and let us choose and fix such s = s(i, j) and t = t(i, j). Let C be the union of A and the set of all s(i, j) and t(i, j). Since G is finite, we can assume that XX(i, j) (w) = Xt (t, j) (w) for all co and (i, j) E G. Let

J := n {D(C, 1/n) > 3s} f1 {D(C, 1 + diamd T, y) < e}. 0" n=1

Then (2.8.3) and (2.8.4) give P(J) > 3s - s > E. Fix an co E J and choose sn E C and to E C such that d(s, tn) < 1/n and (X,s,,, Xtn)((o) > 3e. By compactness, we can assume that the sequences sn and tn both converge fore and

hence also ford. Let sn -* s and to - t. Then d (s, t) = 0, XSn (w) -+ X5 (w), and Xt,, (co) Xt (co) as n oo. Lets E T, and t E Tj. If i = j we have, since

co E J, 3e < (X5, Xt)(co) < s, a contradiction. If i # j, then (i, j) E G. For n large enough, s, E T, and to E Tj, so

3s <

(Xn, XtJ (w) <

(XSn, XS(i,j)) ((0) +

(Xt(1,j), X,J ((0)

< s + s = 2e < 3s, again a contradiction. Now recall that a totally bounded set C in a Hilbert space H is called a GC-set if L restricted to C has a version with uniformly continuous sample functions.

2.8.5 Corollary A Gaussian process {Xt, t E T } with mean 0 on a compact metric space (T, e) is version-continuous if and only if both t H Xt EH L2(P) is continuous and its range K is a GC-set.

Proof Apply Theorem 2.8.2 with h(t) := K := h(T), p the usual metric in H, and S = R with its usual metric; again has the same laws as Xt. If Xt is version-continuous, then t i-* X, by Theorem 2.8.1, and the rest follows.

is continuous into H

Example. Let Xt be a Gaussian process defined for t E R, periodic of period 2n, so that Xt Xt+2n for all t, with E((Xt - X5)2) > 0 for Is - tI < 27r. Then we can write the process as Xt = Y(ett) where Y is a process indexed by the unit circle T1 :_ (z: I z I = 1) in the complex plane, which is compact. is 1-1 from Version continuity for X and for Y are equivalent, and z t-a

78

Gaussian Measures and Processes; Sample Continuity

Tt into H := L2(P), so version continuity is equivalent to that of L on the range of Y in H (without needing Theorem 2.8.2 and Corollary 2.8.5). On the other hand, any process indexed by R is version-continuous if and only if it is

so on each compact interval [-N, N], where in this example for N > n, the process is not 1-1 into H. Recall that a sample function of a stochastic process Xt is a function t H Xt (w) for a fixed w. The usual metric on a Hilbert space is the natural one for an isonormal process, but the GC-property holds for other metrics in the following sense:

2.8.6 Theorem Let C be a subset of a Hilbert space H. Then the following are equivalent:

(I) C is a GC-set; (II) L on C has a version with bounded, uniformly continuous sample functions; (III) there exists a metric p on C such that (C, p) is totally bounded, and the sample functions of the isonormal process L on C can be chosen to be p-uniformly continuous a.s.

Proof (I) implies (H) since by definition a GC-set is totally bounded, so a uniformly continuous function on it must be bounded. (II) implies (III) directly where p is the usual metric. Suppose (III) holds. Take a version of L such that on a set of probability 1,

the sample functions of L are p-uniformly continuous on C. Then L extends to a Gaussian process t i-± Xt on the compact completion M of C for p. Here Xt is version-continuous, and so by Corollary 2.8.5, C is included in, and thus is, a GC-set.

**2.9 Volumes, mixed volumes, and ellipsoids There will be no proofs in this section.

2.9.1 Volumes Any closed set C in a finite-dimensional Euclidean space Rk has a well-defined

volume, its Lebesgue measure V(C) := Vk(C) :_ Ak(C), invariant under Euclidean transformations (rotations, reflections, translations, etc.). Let C be a compact set in a Hilbert space H of dimension > k. Then Vk (C)

will be defined by Vk(C) := sup Vk(Jr(C)) where the supremum is over all

* *2.9 Volumes, mixed volumes, and ellipsoids

79

orthogonal projections 7r with k-dimensional range. Define the exponent of volume of C by

EV(C) := lim

(log V (C))/(n log n).

The exponent of entropy of C is defined by

r(C) := lim sup,10 (log log D(e, C))/(log(1/s)).

Thus C is a GC-set if r(C) < 2 by Theorem 2.6.1 and is not a GB-set if r(C) > 2 by the Sudakov-Chevet theorem (2.3.5). Milman and Pisier (1987) (see also Pisier (1989)) proved the following relation between r(C) and EV (C): Theorem (Milman and Pisier) Let C be a convex, symmetric, compact subset of

a Hilbert space. Suppose that for some 3 > 0 and M < oo, Vk(C) < Mk-'-' for all k = 1, 2, .. Then C is a GC-set and

EV(C) _

1

1

r(C)

2

2.9.2 Mixed volumes and mean widths

Let C be a convex set in a Euclidean space Rk and let B be the unit ball

B:={x: IxI <1}.Forr>OletC+rB:={x+ry: xEC, yE B). Then it is known that Vk(C + rB) is a polynomial of degree kin r, namely k

Vk(C+rB) = L.rk_ j(C) j=0

where ,Bj (C) is called the jth mixed volume of C (see, for example, Bonnesen and Fenchel (1934, section 29) or Leichtweiss (1980, Satz 15.4 p. 162); other normalizations of the mixed volumes can be found in the literature). We have fk (C) = Vk (C). Here if C is viewed as a subset of a space of higher dimension,

its mixed volumes may change, so more precisely one can write fj(C) = ,Bj,k(C). If C is a straight line segment, then 01 (C) is proportional to its length.

Let h1(C) := )81,k(C)/4 where k is such that for a line segment S, h1(S) equals the length of the segment for each k. Now if C is a compact, convex set in a separable, possibly infinite-dimensional Hilbert space and 2rm are finitedimensional projections increasing up to the identity, then h (C)) increase

up to a limit h1(C) < +oo.

Gaussian Measures and Processes; Sample Continuity

80

Let Sk-1 be the unit sphere in Rk, Sk-1 :_ {x E Rk : (x) = 1). For any convex set C in IRk and x E Sk-1, C has a width wx(C) in the direction x defined by

wx(C) := suPZEc(x, z) - infyec(x, Y) is the usual inner product. Let cok_1 be the unique rotationally where invariant Borel probability measure on Sk-1 (normalized (k - 1)-dimensional

surface area measure). Then the mean width of C is defined by w(C) := f wx (C)dwk_ 1(x). It is known that for each k, w (C) is proportional to h 1(C) or to #t (C) for convex sets C; see, for example, McMullen (1993, Theorem 5.5 p. 970), who refers for a proof to Hadwiger (1957). Sudakov (1971, Theorem 4, and 1973) discovered that:

Theorem (Sudakov) A compact, convex set C in a Hilbert space is a GB-set if and only if h 1(C) < oo. In fact,

h1(C) = (2n)1/2EL(C)*. Indeed, it is easily seen that for each k, EL (C)* for a convex set C in Rk is proportional to the mean width. Since for a finite-dimensional set C, EL (C)* doesn't depend on the dimension of the space in which C is imbedded, and neither does h 1(C), one can find the proportionality constant between h 1(C) and EL(C)* by evaluating both on a line segment.

2.9.3 Ellipsoids Let H be a separable, infinite-dimensional Hilbert space. Then a subset E of H will be called a (compact) ellipsoid if there exists an orthonormal basis {en }n> 1 of H such that for some sequence an 4. 0, we have 00

E = E({an})

jx=xnenEH: n=1

00

>(x,,/a,,)2 < 1

.

n=1

Then E is called a Schmidt ellipsoid if an2 < oo. Suppose for X E E we write L (x) xn Gn where Gn = L (en) are i.i.d. N(0, 1) variables, and then En xn Gn = En (xn /an )an Gn. We then have a.s. L(E({an}))* = En a2nG,2)112, where "<" holds by the Cauchy inequality and holds since for finite sums, the maximum over x is attained when xn/an is proportional to anGn. It follows easily that E is a GB-set if and only if it is a Schmidt ellipsoid. Approximating ellipsoids by finite-dimensional ones, it is also not hard to show that a Schmidt ellipsoid is a GC-set.

* *2.9 Volumes, mixed volumes, and ellipsoids

81

Let J be a possibly different separable, infinite-dimensional Hilbert space and let A be a bounded linear operator from J into H. Let B1 be the unit ball of

J, B1 := {x E J: IIxII < 11. Let A(B1) := {A(x) : x E B1}. Then A is called a compact operator if A(B1) is compact. The compact ellipsoids are exactly the sets A(B1) for compact operators A. The operator A is called a Hilbert-Schmidt operator if A(B1) is a Schmidt ellipsoid. Suppose that J = L2 (Y, v) and H = L2 (X, s) for some measure spaces (X, A, µ) and (Y, B, v). One such representation is always possible where X and Y are countably infinite sets (orthonormal

bases) and t and v are counting measures so that J and H are f2 spaces. The operator A is a Hilbert-Schmidt operator if and only if it can be written as an integral operator with a kernel K E L2(X X Y, g x v), namely A(f)(x) := f K (x, y) f (y) d v (y). If { fk }k> 1 and {gj }j> 1 are any orthonormal bases of J and H, respectively, the above criterion is equivalent to >k II A fk 112 < oo or to

Ek (A fk, gj)2 < oo. The latter two sums do not depend on the choices of the orthonormal bases. Further, these criteria do not require finding an orthonormal basis {e } of H in which the ellipsoid is diagonalized, as in the definition of Thus, the question whether an ellipsoid is a GB- or GC-set, in other words a Schmidt ellipsoid, is most easily settled directly rather than by computing its metric entropy or finding a majorizing measure. Still it can be interesting to see how the different kinds of criteria fit together. By Theorems 2.7.1 and 2.7.2, an ellipsoid has a majorizing measure on it if and only if it is a Schmidt ellipsoid. Ledoux and Talagrand (1991, pp. 448-453) construct explicit majorizing measures on Schmidt ellipsoids. For metric entropy, there is a specific criterion

for ellipsoids. Let H(s) := log D(s, E). Then the ellipsoid E is a Schmidt ellipsoid if and only if f0 s2dH(s) > -oo. Sudakov (1969, (4)) stated this criterion. It follows from more general results of Mityagin (1961) for ellipsoids in Banach spaces, as Marcus (1974) showed. Integrating by parts and using the Sudakov-Chevet theorem (2.3.5), one can see that an equivalent criterion is

f0 sH(s)ds < oo. Suppose that H(s) = log D(s, C) is of the order l/(s2[log(1/s)]`) ass j, 0 for some constant c. Then for general compact sets C in Hilbert space, c > 0 is necessary for the GB or GC property by the Sudakov-Chevet theorem, c > 2 is sufficient by Theorem 2.6.1, and both conditions are sharp for some families of sets. For ellipsoids we see that c > 1 is necessary and sufficient for the GB and GC properties. This is strictly intermediate between the best necessary (2.3.5) and best sufficient (2.6.1) conditions on metric entropy for the GB property. It follows that on some Schmidt ellipsoids, there is no probability measure m satisfying the approximate homogeneity condition in Theorem 2.7.4.

Gaussian Measures and Processes; Sample Continuity

82

**2.10 Convex hulls of sequences Two facts proved by Talagrand (1987 (5), Theorems 2 and 3), not explicitly in terms of majorizing measures, will be stated without proof, then some related, easier facts will be proved. Let co(A) denote the closed convex hull of a set A (smallest closed, convex set including A), and Lx := max(1, logx).

2.10.1 Theorem (Talagrand, 1987) C is a GB-set if and only if for some M < oo and gn E H with Ilgn II < M for all n,

C C co ({gn/(Ln)1I2}n>1) . 2.10.2 Theorem (Talagrand, 1987) C is a GC-set if and only if the same holds with IIgnII -* 0.

The next fact, an easy part of Theorem 2.10.1, will be proved.

2.10.3 Proposition If C = {gn}, where Ilgnll = O((logn)-1/2), then y(C) < oo, and C and J(C) are GB-sets. Proof Since constant multiples of C preserve y (C) < oo, we can assume Ilgn II < (Ln)-112 for all n. Let mn := c/(n(Ln)2) where c is such that En mn = 1. Let m :_ En mn8gn. To show that m is a majorizing mea-

sure, we need to consider m (B(gn, e)) for each n. Fix n and let c(j) := cj := (Ln)-t/2 + (Lj)-1"2. Then Ilgn - gj II < cj. Ifs > cj, then m(B(gn, e)) >

F-i> j c/(i(Li)2) > c/Lj. Break up the range of integration [0, oo) as the union of the intervals [0, CO, [cn, cn-1), c(n)

fo

[log (1/m (B (gn,

8)))11"2ds

, [c4, c3), [c3, oo). We have

< cn ( log [n(Ln)2/c])1/2

< 2(Ln)-1/2(Ln + 2LLn - log c)112 which is bounded in n. Next, the integral from c/+1 to cj is bounded by (cj -

cj+1)(L(L(j + 1)/c))1/2. By the mean value theorem, we get cj - cj+t = (Lj)-1I2 - (L(j + 1))-1/2 < 1/(2 j(Lj)3/2). The sum from j = 3 ton - 1 of (LL(j + 1) - logc)1/2/(2j(Lj)312) is bounded in n. The integral from c3 to oc is the integral from 2 to oo, which is 0, plus the integral from c3 to 2, which is bounded above by 2(L3/c)1/2 for all n (using logy < y for all y), so m is a majorizing measure.

To show {gn } is a GB-set, Proposition 2.2.1 gives P(I L (gn) I > M) < exp(-M2Ln/2), which is summable inn for M > 21/2, and the Borel-Cantelli

Problems

83

lemma applies. The set Y of finite convex combinations with rational coefficients of the gn is countable and dense in co({gn}), and L is a.s. bounded on Y, so co({g, }) is a GB-set by Lemma 2.5.1 and its proof.

Let en be orthonormal in H and let an

2.10.4 Proposition

,{

0, C =

{anen }n>1. Then the following are equivalent:

(I) C is a GB-set;

(II) an = O((Ln)-1"2);

(III) y(C) < 00.

Proof (I) implies (II) by Theorem 2.3.5. (II) implies (I) and (III) by Proposition

2.10.3. To show that (III) implies (II), let m be a majorizing measure and mn := m{anen}. Then for each n, 00

00 > ym(C) ? f (log(1/m(B(anen, 8))))1/2 de. 0

The integral is bounded below by the integral from 0 to an, which is at least

Let J

Ym(C)2 < oo. Then mn > exp(-J/an), E. m, = 1, and exp(-J/an) is nonincreasing inn, son exp(-J/an) < 1, a(log(1/mn))1/2.

and an <

(J/Ln)1/2.

Problems 1. If X and Y are i.i.d. N(0, 1), evaluate E max(X, Y). 2. Evaluate E exp(a II X 112) (finite or infinite) as a function of a > 0 if

(a) G(X) = N(0,1) in R, IIXII = IXI; (b) G(X) = N(0, C) in R2, C = (0 ), and II (xl , x2) I =

(x1 +x22)1 2.

3. Let H be a Hilbert space with orthonormal basis {ej}j>1. Let Gn be independent with laws N(0, (a) Under what conditions on does F_ Gnen converge almost surely in the norm of H? Hint: Apply the 3-series theorem (RAP, 9.7.3) to suitable real-valued random variables.

(b) If G = En Gnen in H as in (a), where the sum converges almost surely, find for what a > 0 we have E exp(a II G II2) < 00. After doing this directly, compare it with the results of Section 2.2.

4. Let Gn be i.i.d. N(0, 1) variables and an > 0 for each n. Under what conditions on an is Y-n an I Gn I < oo a.s.? Hint: Use the 3-series theorem.

Gaussian Measures and Processes; Sample Continuity

84

5. (a) Show that for any set A in a real vector space V and for any vector space W of linear forms on V, the polar A*1 of A, defined by A*1 := {w E W: w(v) < 1 for all v E A}, is convex in W. (b) Let C be a set and D its convex hull, that is, the smallest convex set including C. Show that D*1 = C*1.

(c) In I[82 let C be the unit square {0 < x < 1, 0 < y < 1). Evaluate the polar C*1.

6. Let H be a Hilbert space with orthonormal basis {en }n> 1. For c > 0 let E((cnIn>1) :_ (En>1 xnen : Y_n>1 xn/cn < 1), an infinite-dimensional ellipsoid. Show that E is a GB-set if and only if n c,2 < oo. 7. With notation as in the previous problem, let C {en /(log n) 1/2 : n > 21. Show that C is not a GC-set (it is a GB-set as shown in the example before Theorem 2.3.7).

8. Let 1/i be a real-valued characteristic function on R, where 1/i(t) _ °0 00 e`xt dP(x) for some symmetric law P on R, P(A) = P(-A) for

f

all Borel sets A. Show that there exists a Gaussian process Xt, t E R, with mean 0 and covariance EXSXt = 1/i(s - t) for all real s, t.

9. (a) For each t > 0 let flit be a "triangle function" on R with *t (0) = 1, and for some t > 0, Tit(s) = 0 whenever Isi > t, while flit is linear on each interval [-t, 0] and [0, t]. Show that 1/!t satisfies the conditions of the previous problem. Hint: Find its (inverse) Fourier transform and show that it is a probability density.

(b) Let 1/i be a continuous real-valued function on R which is even, 1/i(-x) = 1/i(x), 1/i(0) = 1, and on [0, oo), 1 is nonincreasing, nonnegative, and convex. Show that 1 is a characteristic function. Hint: Use a mixture of triangle functions. First consider the case that 1/i is piecewise linear and is 0 outside some finite interval. Take increasing limits of such piecewise linear functions.

10. Force > 0let 1/i(x) = 1 - (log(l/Ix1))-' forx in some neighborhood of 0 (piecewise linear elsewhere). (a) Show that there exists such a 1/i satisfying the conditions of problem 8, assuming problem 9. (b) What can be said about sample-continuity of the Gaussian process

Xt for different values of a? 11. Let p be a majorizing measure on a metric space (T, d). (a) If T is a subset of a Hilbert space H with usual metric, then T is a GB-set by Theorem 2.7.1 and is totally bounded by Theorem 2.3.5. Prove directly, in general, that (T, d) is totally bounded. Hint: If for some r > 0 there are infinitely many disjoint balls B(x1, r),

Problems

85

then inf1 µ(B(x;, r)) = 0, and l/p (B(x1, s)) is nonincreasing in s for each i.

(b) Let p be another probability measure on (T, d). If 0 < A < 1, show that A s + (1 - .k)p is also a majorizing measure on (T, d). 12. Let Xt (w) Fn=_,,o G, (co)etnt where Gn are independent random variables with laws N(0, Q,2) and >n Qn < oo. Show that the process t i-+ Xt is sample-continuous if and only if {Xt () : 0 < t < 2ir } is a GC-set in

L2(0) 13. Let z be as in Lemma 2.7.20. Show that E(z2) < 100/9. 14. Let e be orthonormal in H and C :_ {an (log n)-112en }n>2 where an --> 0 asn --> co. (a) Show that every such set is a GC-set.

0 slowly enough, show that for any r > 0 there (b) By taking a exist GC-sets C with D(e, C) > exp[1/(s2J log ej'')] for s > 0 small enough.

15. (A further extension of problem 10). For c > 0 let fi(x) := 1 (log(1/IxI))-1(log log(1/lxI))-c forx in some neighborhood of O (piecewise linear elsewhere).

(a) Show that there exists such a f satisfying the conditions of problem 8, assuming problem 9. (b) What can be said about sample continuity of the Gaussian process Xt for different values of c? (c) Show that for any r < 2, there exist GC-sets C with D(e, C) < exp(1/(s2l logejT) for s > 0 small enough. Compare this with problem 14, part (b), to see that for 0 < r < 2, one cannot tell from D(e, C) whether C is a GC-set or not. 16. Let vk be the Lebesgue volume of the unit ball in Rk. Then it is known that . For any c, > 0, the ellipsoid Vk = nk/2/ l,(1 + (k/2)) f o r k = 1 , 2, [X: rk1 x2/c? < 11 C Rk has volume VkC1 ck £k E({c; }k1)

for any ci > 0, i = 1,

,

k. For s > 0 let m := D(s, E).

(a) Show that m > cic2 . . . ck/sk.

(b) If cj > s for j = 1, , k, show that m (s/2)k < 2kc1 c2 . . . ck. (c) If c j = j -" f o r j = 1 , 2, , give upper and lower bounds for D(e, E) as s . 0 and k - oo. Hint: Recall Stirling's formula k!/[(k/e)k(22rk)112]

1 ask - oo (Theorem 1.3.13).

17. If g is a Young-Orlicz modulus, x2 = o(log g(x)) as x -> +oo, and Y is a N(0, 1) random variable, show that 11 YIIg = +oo.

86

Gaussian Measures and Processes; Sample Continuity

18. If g is a Young-Orlicz modulus, log g(x)) = 0(x2) as x -+ +oo, and Y is a N(0, 1) random variable, show that II YIIg < oo.

19. Let f(x) = r-xfor0<x
f ^- c + Y, an sin(nx) + bn cos(nx) n=1

in L2((-7r, 7r)) and use it to prove Yn_1 n-2 = n2/6.

Notes

Notes to Section 2.2. Proposition 2.2.1 improves on RAP, Lemma 12.1.6(b), by a factor of 2. Lemma 2.2.5 was first proved by Landau and Shepp (1971), then by Femique (1970), whose note giving a much shorter proof appeared in print earlier. The main theorems 2.2.2 and 2.2.8, giving the best possible upper bound for a, were then proved independently by Marcus and Shepp (1971) and Fernique (1971). The current exposition is based mainly, though not entirely, on Fernique (1975). The Darmois-Skitovic theorem says that if X1, , X, are independent real random variables where for some constants a 1, , an is independent with a,b1 $ 0, then for that i, Xi has a normal distribution; see Darmois (1951) and Skitovic (1954). C. R. Rao (1973, pp. 158-163 and 218) lists various characterizations of normal distributions. Notes to Section 2.3. Slepian (1962, Lemma 1) proved his inequality. Sudakov

(1969) announced a result somewhat weaker than Theorem 2.3.5. Sudakov (1971, Theorem 5) announced a stronger result, corrected in Sudakov (1973) to be exactly Theorem 2.3.5. Lemma 2.3.6 is essentially from Chevet (1970). Fernique (1975, Theoreme 2.1.2, Corollaire 2.1.3) proved Theorem 2.3.7. The proof given is as in Fernique (1997, pp. 59, 63-67), who mentions an idea of Kahane; see Kahane (1986).

Notes to Section 2.4.

Theorems 2.4.1 and 2.4.4 and Corollaries 2.4.5 and 2.4.6, and essentially Corollary 2.4.7, are due to T. W. Anderson (1955). These

facts were to some extent rediscovered by L. Gross (1962). Both used the Brunn-Minkowski inequality in their proofs. See also Borell (1974, 1975a,b), Y. Gordon (1985), and Kahane (1986).

Notes to Section 2.5. Lemma 2.5.2 is essentially due to Anderson (1955). Lemmas 2.5.3 and 2.5.4 are straightforward. In Dudley (1967, Theorem 4.6),

Notes

87

parts of which are due to Gross (1962), most of the conditions in Theorem 2.5.5 were proved equivalent for convex, symmetric sets. Feldman (1971) proved part (a'), about the symmetric convex hull of C.

Notes to Section 2.6. In apparently the first public use of a metric entropy hypothesis to obtain sample continuity of Gaussian processes, V. N. Sudakov, in a talk at the International Congress of Mathematicians in Moscow, 1966, announced that if log N(e, C) = O(e-') as e . 0 for some r < 2, then C is a GC-set. His first publication on the topic was Sudakov (1969). Specific bounds on metric entropy implying sample continuity were suggested by previous work

of Fernique (1964) for processes indexed by a real parameter. The fact that f o" (log N(e, C))1t2ds < oo implies C is a GC-set was given (with an equivalent

series instead of the integral) in Dudley (1967). Theorem 2.6.1 and its proof were given in Dudley (1973). Theorem 2.6.2 and its proof are based (with some changes) on Ledoux and Talagrand (1991, Section 11.1), where the theorem is extended, with different functions go, to a large class of non-Gaussian stochastic processes. Komatsu's inequality, used in the proof of Lemma 2.6.7, is quoted,

with hints for the proof, in Ito and McKean (1974, p. 17), citing as original source Komatsu (1955).

Talagrand's original proof (1987) of Theorem 2.7.2 was quite long. Talagrand (1992) published a proof of the majorizing measure theorem (2.7.2) which is much shorter, although some steps at the end were left to the reader. Talagrand also gave other reasonably short proofs of Theorem 2.7.2. Building on these, Fernique (1997) gave a proof which has largely been followed here. Specifically, Lemma 2.7.6 is Lemme 3.4.5 of Fernique (1997), Lemma 2.7.9 is his Lemme 5.3.5, and so on, up to changes in constants. Specifically, some constants were allowed to become larger to make the proof more self-contained. Fernique (1975) first proved Theorem 2.7.1, namely that existence of a majorizing measure implies the GB property. The proof given here is from Fernique (1997, pp. 100-101, 104-111) with minor changes. Thus, for example, Theorem 2.7.18 is Fernique's Theoreme 5.2.6 and Lemma 2.7.20 is his Lemme 5.2.7. As given here, the proofs of Theorems 2.7.1 and 2.7.2 turn out to be about equally long. The author wrote out the proofs of the much easier facts: Proposition 2.7.3, Theorem 2.7.4, and Corollary 2.7.5.

Notes to Section 2.7.

Notes to Section 2.8. I have no reference for Theorem 2.8.2. The facts in this section on Gaussian processes were to some extent stated, but not proved, in

88

Gaussian Measures and Processes; Sample Continuity

Dudley (1967, 1973). Andersen and Dobric (1988, Lemma 8, (4.6); preprint, 1985) first published proofs of Corollary 2.8.5 and Theorem 2.8.6 according to Gine and Zinn (1986, p. 58).

Note to Section 2.9. The section incorporates its own notes. Notes to Section 2.10. Talagrand (1987) proved Theorems 2.10.1 and 2.10.2. The author (without claiming any priority) proved the easier Propositions 2.10.3 and 2.10.4.

References *An asterisk indicates a work I have seen discussed in secondary sources but not in the original. Andersen, Niels Trolle, and Dobric, Vladimir (1988). The central limit theorem for stochastic processes II. J. Theoret. Probab. 1, 287-303. Anderson, Theodore W. (1955). The integral of a symmetric unimodal function over a symmetric convex set and some probability inequalities. Proc. Amer.

Math. Soc. 6, 170-176. Bonnesen, T., and Fenchel, W. (1934). Theorie der konvexen Kdrper. Berlin, Springer; repr. Chelsea, New York, 1948. Borell, Christer (1974). Convex measures on locally convex spaces. Ark. Mat. 12, 239-252. Borell, C. (1975a). Convex set functions in d-space. Period. Math. Hungar. 6, 111-136. Borell, C. (1975b). The Brunn-Minkowski inequality in Gauss space. Invent. Math. 30, 207-216. Chevet, Simone (1970). Mesures de Radon sur Rn et mesures cylindriques. Ann. Fac. Sci. Univ. Clermont#43 (math., 6° fasc.), 91-158. Darmois, Georges (1951). Sur une propriete caracteristique de la loi de probabilite de Laplace. Comptes Rendus Acad. Sci. Paris 232, 1999-2000.

Dudley, R. M. (1967). The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. J. Funct. Anal. 1, 290-330. Dudley, R. M. (1973). Sample functions of the Gaussian process. Ann. Probab.

1, 66-103. Dunford, N., and Schwartz, J. T. (1958). Linear Operators, Part I: General Theory. Interscience. New York; repr. Wiley, New York (1988). Feldman, Jacob (1972). Sets of boundedness and continuity for the canonical normal process. Proc. Sixth Berkeley Symposium Math. Statist. Prob. 2, pp. 357-367. Univ. of Calif. Press, Berkeley and Los Angeles.

References

89

Fernique, Xavier (1964). Continuite des processus gaussiens. Comptes Rendus Acad. Sci. Paris 258, 6058-6060.

Fernique, X. (1970). Integrabilite des vecteurs gaussiens. Comptes Rendus Acad. Sci. Paris Sir. A 270, 1698-1699. Fernique, X. (1971). Regularite de processus gaussiens. Invent. Math. 12, 304-320. Femique, X. (1975). Regularite des trajectoires des fonctions aleatoires gaussiennes. Ecole d'ete de probabilites de St.-Flour, 1974. Lecture Notes in Math. (Springer) 480, 1-96. Fernique, X. (1985). Sur la convergence etroite des mesures gaussiennes. Z. Wahrscheinlichkeitsth. verw. Geb. 68, 331-336. Fernique, X. (1997). Fonctions aleatoires gaussiennes, vecteurs aleatoires gaussiens. Publications du Centre de Recherches Mathematiques, Montreal.

Gine, Evarist, and Zinn, Joel (1986). Lectures on the central limit theorem for empirical processes. In Probability and Banach Spaces, Proc. Conf. Zaragoza, 1985, Lecture Notes in Math. (Springer) 1221, 50-113. Gordon, Yehoram (1985). Some inequalities for Gaussian processes and appli-

cations. Israel J. Math. 50, 265-289. Gross, Leonard (1962). Measurable functions on Hilbert space. Trans. Amer. Math. Soc. 105, 372-390. Hadwiger, H. (1957). Vorlesungen fiber Inhalt, Oberflache and Isoperimetrie. Springer, Berlin. Ito, Kiyosi, and McKean, H. P. Jr. (1974). Diffusion processes and their sample paths. Springer, Berlin, New York. Kahane, Jean-Pierre (1986). Une inegalite du type de Slepian et Gordon sur les processus gaussiens. Israel J. Math. 55, 109-110. *Komatsu, Y. (1955). Elementary inequalities for Mills' ratio. Rep. Statist. Appl. Res. Un. Japan. Sci. Engrs. 4, 69-70. Landau, H. J., and Shepp, Lawrence A. (1971). On the supremum of a Gaussian process. Sankhya Ser. A 32, 369-378. Ledoux, M., and Talagrand, M. (1991). Probability in Banach Spaces. Springer, Berlin. Lehmann, E. L. (1986). Testing Statistical Hypotheses, 2d ed. Wiley, New York; repr. 1997. Leichtweiss, K. (1980). Konvexe Mengen. Springer, Berlin. Marcus, M. B. (1974). The e-entropy of some compact subsets of V. J. Approximation Th. 10, 304-312.

90

Gaussian Measures and Processes; Sample Continuity

Marcus, Michael B., and Shepp, L. A. (1972). Sample behavior of Gaussian processes. Proc. Sixth Berkeley Symp. Math. Statist. Prob. (1970) 2, 423441. Univ. of Calif. Press, Berkeley and Los Angeles. McMullen, P. (1993). Valuations and dissections. In Handbook of Convex Geometry, vol. B, eds. P. M. Gruber, J. M. Wills, North-Holland, Amsterdam. Milman, V. D., and Pisier, G. (1987). Gaussian processes and mixed volumes. Ann. Probab. 15, 292-304. Mityagin, B. S. (1961). Approximate dimension and bases in nuclear spaces. Russian Math. Surveys 16, no. 4, 59-127. Pisier, G. (1989). The Volume of Convex Bodies and Banach Space Geometry. Cambridge University Press, New York. Rao, C. R. (1973). Linear Statistical Inference and Its Applications, 2d ed. Wiley, New York.

Skitovic, V. P. (1954). Linear combinations of independent random variables and the normal distribution law. Izv. Akad. Nauk SSSR Ser. Mat. 18, 185-200 (in Russian). Slepian, David (1962). The one-sided barrier problem for Gaussian noise. Bell Syst. Tech. J. 41, 463-501. Sudakov, V. N. (1969). Gaussian measures, Cauchy measures and s-entropy. Soviet Math. Dokl. 10, 310-313. Sudakov, V. N. (1971). Gaussian random processes and measures of solid angles in Hilbert space. Soviet Math. Dokl. 12, 412-415. Sudakov, V. N. (1973). A remark on the criterion of continuity of Gaussian sample function. In Proc. Second Japan-USSR Symp. Probab. Theory, Kyoto, 1972, ed. G. Maruyama, Yu. V. Prokhorov; Lecture Notes in Math. (Springer) 330, 444-454. Talagrand, M. (1987). Regularity of Gaussian processes. Acta Math. (Sweden) 159, 99-149. Talagrand, M. (1992). A simple proof of the majorizing measure theorem. Geom. Funct. Anal. 2, 118-125.

3

Foundations of Uniform Central Limit Theorems: Donsker Classes

3.1 Definitions: convergence in law Let (S, 13, P) be a probability space, to be called the sample space. Examples

to have in mind for S are Euclidean spaces such as the plane R2. To form empirical measures, we'd like to take variables X1, X2, , i.i.d. with law P. To do this, take a countable product S°O of copies of (S, 13, P) (RAP, Theorem

8.2.2) and let X; be the coordinates on the product. A product may be taken with another probability space. Throughout the rest of this book, X, will be defined as such coordinates unless something is said to the contrary. This way of defining Xi will be called the standard model. An example showing that use of the standard model makes a difference will be given at the end of Section 5.3. Recall that the product or -algebra is the smallest for which all the coordinates a r e measurable, and that 8 (A) = lA(x). Then, we can form the empirical measures Pn

- rn

x1

i=1

and the empirical process

vn := n1"2(Pn - P). So Pn is a probability measure on S, defined on B and actually on all subsets of S, for any values of X1, , Xn. Each vn is a finite signed measure of total charge 0. As n -* oo, for any set A E 13, vn(A) converges in law to a normal law

with mean 0 and variance P(A)(1 - P(A)). Next, a general analogue of the Brownian bridge will be defined. Let GP be a Gaussian process indexed by the set G2(P) of measurable, square-integrable functions for P, having

mean 0 and covariance EGp(f)Gp(g) = f fgdP - f f dP f gdP. To see 91

92

Uniform Central Limit Theorems: Donsker Classes

that such a Gaussian process exists, we need to check that the given covariance is nonnegative definite, which holds since it is just the covariance for P, in other words

EGp (.f) Gp (g) = (.f g)o, p

ov p (.f, g) Cf(f_ffdP)(g_fgaP)dP ,

and such a covariance is always nonnegative definite.

Let G2(P) be the set of all functions f E L2(p) such that f f dP = 0. Let L2(P) be the set of all equivalence classes of elements of L2(p) for equality a.s. (P). On L2(P), the covariance for Gp reduces to the ordinary G2 semiinner product. In other words, restricted to the subspace L2(P), Gp is an isonormal process. Let C be the one-dimensional space of constant functions c as a subspace of G2(P). Then Gp is 0 on C, while the spaces C and L2(P) are orthogonal complements of each other (RAP, Theorem 5.3.8) in G2(P). For

any f and gin G2(P) and c E 1t8, Gp(cf + g) = cGp(f) + Gp(g) a.s., since as is easily checked, Gp(cf + g) - cGp(f) - Gp(g) has mean and variance both 0.

The covariance for P defines a pseudometric on G2(P) by pp(f, g)

(E(Gp(f) -

Gp(g))2)112. Then pp equals the usual L2(p) pseudometric on

L2 (p). Moreover, with respect to the semi-inner product ( , -)0, p on L2 (p), Gp is the isonormal process and pp the usual pseudometric. Thus, the results of Chapter 2 apply to sample continuity of Gp with respect to pp. The Brownian bridge process yr is a special case of the Gp process where P is Lebesgue measure on [0, 1] and yt = Gp(1[0,(1); it is easily checked that this has the right covariance. The Brownian bridge process can be taken to have continuous sample paths, in other words to be continuous as a function of t for each w (RAP, Theorem 12.1.5). But Gp is not sample-continuous on the whole space G2(P). Recall

that an atom of a measure space (0, S, s) is a set A E S with µ(A) > 0 such that for all S E S, µ(S fl A) = 0 or it (A). A measure having no atoms is called nonatomic. In fact, unless P is concentrated in finitely many atoms, the spaces G2(P) and L2(P) are infinite-dimensional, in the sense that they contain infinite orthonormal sets, and we saw in Section 2.6 that an isonormal process on an infinite-dimensional Hilbert space H is not sample-continuous.

We will be concerned then with suitable subsets of £2(P). A class F C G2 will be called pregaussian if a Gp process (f, (o) i-4 Gp(f)(w) can be defined on some probability space such that for each w, f i-+ Gp (f) (w) is bounded and uniformly continuous for pp from .F into R.

3.1 Definitions: convergence in law

93

Given f e G2(P), let rro(f) := f - f f dP. Then 7ro(f) E G4(P). Given F C G2(P), let Jro(.P) be the set of all functions 7ro(f), f E .P. For any f E G2(P), pp(f, no(f )) = 0, and Gp(f) = Gp(7ro(f )) a.s. I claim that F is pregaussian if and only if rro(.P) is: if iro(.P) is pregaus-

sian, then f r- Gp(Iro(f)) has the desired properties on .P, while if .P is pregaussian, then for any g E 7ro(.F), select arbitrarily (by the axiom of

choice!) an fg E F with Iro(fg) = g. Then g H Gp(fg) has the desired properties.

Now recall the definitions of GB-set and GC-set from Section 2.5. Note that if a set C in a Hilbert space is a GB-set, then it must be totally bounded by Theorem 2.3.5. If C is also a GC-set, then uniformly continuous sample functions of the isonormal process on C must be bounded. Thus since Gp is isonormal on L2 (P), any set 9 C G4(P) is pregaussian if and only if the corresponding set in L2(P) is a GC-set. Recall the notion of prelinear function (Lemma 2.5.3). A Gp process Y on a class F C G2 (S, 8, P) for a probability space (S, 13, P) will be called coherent if for each (o E S, the function f F-+ Y(f) (w) on.F is bounded, pp-uniformly continuous, and prelinear.

3.1.1 Theorem Given a probability space (S, B, P) and F C £2(S, 8, P), .P is pregaussian if and only if there exists a coherent Gp process on F.

Proof "If" follows from the definition of pregaussian. For the converse, note that no above is linear and Gp is isonormal on Go (S,13, P), and apply Theorem 2.5.5.

Now, if a class F C G2(P) is pregaussian we can ask whether v converges in distribution, or in law, to Gp with respect to uniform convergence over F. Recall that for random variables Y, with values in a separable metric space S, convergence in law of Y to Yo is defined classically to mean that Eg(Y,,) Eg(Yo) as n -+ oo for every bounded continuous function g on S. But empirical processes, even empirical distribution functions as treated in Section 1.1, take values in nonseparable metric spaces, and for an arbitrary bounded continuous g, g(v,) may not be measurable. A new definition of convergence in law is needed to take care of this nonmeasurability. The definition will involve the notion of upper integral. Let g be a real-valued, not necessarily measurable function defined on a space X where (X, S, µ) is a measure space. Let R be the set [-oo, oo] of extended real numbers. Then the upper integral is defined

Uniform Central Limit Theorems: Donsker Classes

94

by

f gdµ := inf if h dµ : h > g, h measurable and R-valued I

,

which will be undefined if there exists a measurable h > g with f hdµ = oo-oo undefined, unless there is also a measurable i/r > g with f *dµ = -oo, in which case f * gdµ will also be defined as -oo. There always exists at least one measurable h > g, namely h - +00. We will be dealing often with compositions of functions. If f is a function whose domain includes the range of g then either f (g) or fog will denote the

function such that (f o g)(x) - f (g(x)). A function, which may not be measurable, from a probability space into a metric space will be called a random element. Now here is a definition of convergence in law, where only the limit variable necessarily has a law:

Let (S, d) be any metric space. Let (Q,, A,z , Q,) be probability spaces for n = 0, 1, 2, .. , and Yn , n > 0, functions from Stn into S. Suppose that Yo takes values in some separable Definition (J. Hoffmann-Jorgensen).

subset of S and is measurable for the Borel sets on its range. Then Y, will be Yo, if for every said to converge to Yo in law as n -+ oo, in symbols Y, bounded continuous real-valued function g on S,

fg(Yfl)dQfl

- fg(Yo)dQoasnoo.

For g bounded, f * g(YY) dP is always defined and finite. Then, here is a general definition of when the central limit theorem for empirical measures holds with respect to uniform convergence over a class F of functions. The metric space S will be the space £°O(F) of all bounded real-valued functions on F, with the metric given by the supremum norm IIHII,, := sup{IH(f)I :

fc.T}. Definition. Let (S2, A, P) be a probability space and F c G2(P). Then F will be called a Donsker class for P, or P-Donsker class, or be said to satisfy the central limit theorem (for empirical measures) for P, if F is pregaussian

for P and vn = Gp in Later on, a number of rather large classes F of functions will be shown to be Donsker classes for various laws P. The next few sections develop some of the needed theory.

3.2 Measurable cover functions

95

3.2 Measurable cover functions In the last section, convergence in law was defined in terms of upper integrals. The notion of upper integral is related to that of measurable cover. Let (7, A, P) be a probability space. Then for a possibly nonmeasurable set A C S2, a set B is called a measurable cover of A if A C B, B E A,

and P(B) = inf{P(C) : A C C, C measurable). For A C B E A with P(A), a measurable cover of A is n,, B,,. If B and C are measurable

covers of the same set A, then clearly so is B fl C. It follows that B = C up to a set of measure 0, in other words P(BOC) = 0 where A denotes the symmetric difference, or equivalently P (1 B = 1 c) = 1. For any set A C S2 let P*(A) := inf{P(B) : B measurable, A C B}. Then for any measurable cover B of A, clearly P*(A) = P(B). Let r0 := &A A, P, 9) denote the set of all measurable functions from S2 into R. Then G° is a lattice: for any f, g E G°, f v g := max(f, g) and f n g := min(f, g) are in G°. But this LO is not a vector space since we could have, for example, f = +oo and g = -oo, so f + g would be undefined. The map y i-* tan-1 y is one-to-one from ]l8 onto [-7r/2,7r/21. Then a

metric on R is defined from the usual metric on [-ir/2, r/2] by d (x, y) := I tan-1 x - tan-1 yl. On Lo we have the Ky Fan metric (RAP, Theorem 9.2.2) defined by

d(f,g) := inf{E>0: P(d(f(x),g(x))>s)<E}. Then d (f, g) = 0 if and only if P (f = g) = 1. For any set j C Ge(S2, A, P, R), a function f E Lo is called an essential

infimum of J, or f := ess.inf J, if for all j c- ,7, f < j a.s. and for any g E LO such that g < j a.s. for all j E ,7, we have g < f a.s. If f and g are two essential infima of the same set ,7, then clearly f = g a.s. A set ,7 of functions will be called a lower semilattice if for any f, g E .7, we have

min(f, g) E J. 3.2.1 Theorem For any probability space (0, A, P) and class ,7 C Go(S2, A, P, R), an essential infimum of ,7 exists. If for some function f : S2 i-+ R we have j = [j E Lo : j > f everywhere), then f * := ess.inf ,7 can be chosen so that f * > f everywhere. Also, f f * dP and E*f := f * f dP are both defined and equal if either of them is well-defined (possibly infinite), for example if f * is bounded below. P r o o f Let j i be the class of all functions min (fl ,

, f , , , ) for fl,

, f E .7

and m = 1, 2, . Then JI is a lower semilattice. For f E G°, f = ess.inf ,71 if and only if f = ess.inf J. So we can assume ,7 is a lower semilattice. For

96

Uniform Central Limit Theorems: Donsker Classes

each j E J, tan-1 j is a measurable function with values in [-7r/2, r/2]. Take jn E J such that f tan-1 j f tan-1 j dP. Then min(J1, , j,,,) is in Ji and decreases to g as m - oo for some g E Ge(S2, A, P, ]R). For any h E J, min(h, j1, , jm) J, min(h, g) so f tan- Imin(h, g) = f tan '(g) and g < h a.s., so g satisfies the definition of ess.inf J. If 3 = (h E LO : h > f) then J is a lower semilattice and g > f everywhere. By the definitions, f * f dP < f f * dP if either side is well-defined, and the inequality is an equation by the definition of essential infimum.

Here f * is called the measurable cover function of f. Recall that in Chapter 2, L (A)* was the essential supremum of L (x) for x E A, and so the essential infimum of random variables Y such that for each x E A, Y > L (x) a.s. - a different, although related, notion. If f is real-valued and bounded above by some finite-valued measurable function then f * is a measurable real-valued function. But whenever there exist nonmeasurable sets A 4. 0 with P* 1, as for Lebesgue measure (e.g., RAP, Section 3.4, problem 2), let f := n on An \

Then f is real-valued but f * _ +oo a.s. The next two lemmas on measurable cover functions are basic.

3.2.2 Lemma For any two functions f, g: S2 H (-oo, oo], we have

(a) (f + g)* < f * + g* a.s., and (b) (f - g)* > f - g* whenever both sides are defined a.s. Proof (a) We have -oo < f * < +oo and -oo < g* < +oo everywere, so f * +g* is an everywhere defined, measurable function > f+g, and (a) follows. For part (b), on the measurable set where g* = +oo, where by assumption f is finite a.s., the right side is -oo and the inequality holds. Where g* is finite,

g is also finite and f = (f - g) + g. Then f * < (f - g)* + g* by (a), so f * - g* < (f - g)*, since this holds where (f - g)* < oo and where

(f -g)*=oo. 3.2.3 Lemma Let S be a vector space with a seminorm II II. Then for any two functions X, Y from 0 into S,

IIX+ YII* < (11XII + IIYII)* < IIXII* + IIYII* a.s. and IIcXII* = Ici IIXII* a.s. for any real c. Proof The first inequality is clear, the second follows from Lemma 3.2.2, and the equation is clear (for c = 0 and c # 0).

3.2 Measurable cover functions

97

Next, in some cases of independence, the upper-star operation can be distributed over products or sums.

3.2.4 Lemma Let (2j, Aj, Pj), j = 1,

, n, be any n probability spaces. Let fj be functions from Q j into R. Suppose either

(a) f >0, j= 1, - - , n, or (b) fl = 1 and n = 2. Then on the Cartesian product IInj_1(S2j, Aj, Pj) with x := (xl,

f(x) :=

, xn), if

fj(xj), we have f*(x) = nj=1 f.*(xj) a.s., where 0 oo is set

equal to 0.

(c) Or, if fj(xj) > - ooforallxj, j = fl (XI) + ... + fn (xn ), then g* (xi, ... , xn) = fl* (x i) + .

+ fn (xn )

a.s.

Proof First, for (c), by induction we can assume n = 2. Clearly g*(x, y) < f l* (x)+ f2 (y) a.s., and if equality does not hold a. s., there is a rational t such that

on a set C of positive probability in the product space, g* (x, y) < t < fl* (x) +

f2 (y), and there exist rational q, r with q + r > t such that C can be chosen

with fl*(x) > q and f2 (y) > r for (x, y) E C. Let Cx := {y: (x, y) E C}. By the Tonelli-Fubini theorem there is a set D C 01 with PI (D) > 0 such that

P2(Cx) > 0 for all x E D. If fi < q on D then fl* < q a.s. on D, but for any x E D and y E Cx # 0 we have fl* (x) > q, a contradiction. So choose and fix an x E D with fl (x) > q. Then for any y E Cx, q + f2 (y) < fi (x) + f2 (y) < g* (x, y), so f2 (y) < g* (x, y)-q and f2 (y) < g* (x, y)-q for almost all y E C. For any such y, q + f2 (y) < q +r and f2 (y) < r, a contradiction. So (c) is proved. Now for products, in case (a) or (b), clearly f *(x) < nj=1 f.* (xj) a.s., with 1 * = 1. For the converse inequality we can assume n = 2, by induction in case (a). Suppose f *(x) < fl (xl) f2 (x2) with positive probability. Then for some

rational r, f * (x) < r < fl*(x1) f2 (x2) with positive probability. If fj - 1, this gives f (x) < f * (x) < r < f2 (x2) on a set of positive probability. Then by _

the Tonelli-Fubini theorem, for some x1, f2(x2) < f *(xl, x2) < r < f2 (x2) on a set of x2 with P2 > 0, contradicting the choice of f2*' So assume fi > 0 and f2 > 0. Then as in case (c), there are rationals a, b

with ab > r, a > 0, and b > 0, such that on a set C in the product with positive probability, fl*(xi) > a, f2 (x2) > b, and f*(x1,x2) < r. Again by the Tonelli-Fubini theorem, there is a set D C 01 with PI (D) > 0 and P2(Cu) > 0, u E D, and there is a point u of D where fl (u) > a. Then for any v E C, f2(v) < f *(u, v)/a, so f2 (v) < f *(u, v)/a for almost all

98

Uniform Central Limit Theorems: Donsker Classes

v E G. For such a v we have aft (v) < ab and f2 (v) < b, a contradiction, finishing the proof.

For the next fact here is some notation: given two functions f, g and a a-

algebra S on the range of f, let (f, g)(x) := (f (x), g(x)) and f -1(S) { f -'(A): A E S}. 3.2.5 Lemma Let (0, A, P) = II3_1(521, Si, PI) with coordinate projections 111: 111(x1, x2, x3) := xi, i = 1, 2, 3. Let Si ®S2 denote the product a-algebra on S21 x 522. Then for any bounded real function f on 01 x 523 and 9(x1, x2, x3) := f (x1, x3), conditional expectations of g* satisfy

E(g*1(1-11, 112)-1(S1 (9 S2)) = E(g*I11-1(S1)) a.s.for P.

Proof By Lemma 3.2.4(b), for 22 X (S21 x 523), g* equals P-almost surely a measurable function not depending on x2, thus independent of 11-1(S2). Let S be the collection of all sets A E (111, f12)-1(S1 0 S2) such that g* and h := E(g* 111-1(Si )) have the same integral over A. Then S contains all finite

disjoint unions of sets (Hi, n2)-1(B1 x B2) = IIi 1(B1) n n21(B2), Bi E Si, i = 1, 2, since both g* and h are independent of 17121(S2). Now S is easily seen to be a monotone class, so it equals all of (111, H2)-1(S1 0 S2) (RAP, Theorem 4.4.2).

3.2.6 Lemma Let X be a real-valued function on a probability space (0, A, P). Then for any t E R, (a) P* (X > t) = P(X* > t);

(b) foranye>0, P*(X>t)t)t-e). Proof Clearly {X > t} c {X* > t) and {X > t) c {X* > t}, so we have "<" in (a) and the first inequality in (b).

Take a measurable cover A D {X > t), so P*(X > t) = P(A). Then X* < t a.s. outside A, so P(X* > t) < P(A), proving (a). Thus for 0 < 8 <

e, P(X* > t - 8) < P*(X > t - s). Letting 8 4, 0 proves (b). Let (0, A, P) be aprobability space. For a function f from Q into [-oo, oo]

let f* f dP := sup{ f gdP : g measurable, g < f}. Let f* be the essential supremum of all measurable functions g < f. Then just as for f*, f* is well-defined up to a.s equality and f* f dP = f f* dP whenever either side is defined, as in Theorem 3.2.1.

3.2 Measurable cover functions

99

It's easy to check that f* = -((- f)*) and that f* f dP = -(f * - f dP). So the convergence in law Y, = Yo as defined in Section 3.1 implies that f* g(Y-,) dQn - f g(Yo) dQoNext, here is a one-sided Tonelli-Fubini theorem for starred functions:

3.2.7 Theorem Let (X, A, P) x (Y, B, Q) be a product of two probability spaces. For a real-valued function f > 0 on X x Y, define f * with respect to P x Q. For each x E X let (E* f)(x) := f * f (x, y) dQ(y). Then

EiE2f(x,Y)

(3.2.8)

ff*(x,y)d(PxQ)(x,y),

where Ei = E* with respect to P. Also, if Q is purely atomic, so that >j Q({yj)) = 1 for some yj E Y, and (3.2.9)

f - dQ, then

Ei E2f(X, Y) < E*f(X, Y) = E2Ei f(X, Y).

Proof By the usual Tonelli-Fubini theorem, we have (3.2.10)

f

f*(x, y)d(P x Q)(Y) = ff f*(x, y)d Q(y) dP(x),

and f f *(x, y)d Q(y) is measurable in x. For each x E X let f (y) := f (x, y). Then f(x, y) < f *(x, y), which is measurable in y, so f,, (y) < f *(x, y) for Q-almost all y. Thus since E* fX = Efx by Theorem 3.2.1,

(E*f)(x) =

f

f,*(Y)dQ(Y) <

f

f*(x,Y)dQ(Y),

and (3.2.8) follows. Next, to prove (3.2.9), note that on Y, since Q is purely atomic, all real-valued functions are measurable (for the completion of Q), so E2 - E2, the left inequality follows from (3.2.8), and the equality from (3.2.10), so (3.2.9) is proved. We also have a one-sided monotone convergence theorem with stars:

3.2.11 Theorem Let (S2, A, P) be a probability space and let f be realvalued functions on Q such that f f f, that is, f (x) T f (x) for all x E Q. If

E* fl > -oo then E*fj T E*f as j -+ oo.

Proof By Theorem 3.2.1, E*fj = Ef * for each j, and as j -+ oo, f increases a.s. up to some function g where clearly g < f * almost surely. If

g < f * on some set A with P(A) > 0, then f < g on A implies f < g and so f * < g on A, a contradiction. The theorem follows.

100

Uniform Central Limit Theorems: Donsker Classes

Recall that for any measure space (X, A, µ) and any C C X, the outer measure is defined by s*(C) := inf(µ(A) : A E A, A D C). Note that there exist subsets Aj := A(j) of [0, 1] with outer measure k*(Aj) = I for all j and Aj J, 0 (RAP, Problem 3.4.2). Letting f := IA(.i), we have f 4, 0, and E* f = 1 for all j, so that the monotone convergence theorem fails for E* on decreasing sequences. Next is a Fatou lemma with stars.

3.2.12 Theorem Let (Q, A, P) be a probability space and f any nonnegative real-valued functions on 0. Then E* lim infs.,, f < lim infj,0 E* f j. Proof For 1 < k < m we have infra>k f, < fm and so E* fn < E* fm, thus E* infn>k fn < infra>k E*fm. Taking the supremum over k on both sides and using the previous monotone convergence theorem gives the result.

3.3 Almost uniform convergence and convergence in outer probability In Section 3.1 the definition of convergence of laws was adapted to define convergence in law for random elements which may not have laws defined. In this section the same will be done for convergence in probability and almost sure convergence. Let (0, A, P) be a probability space, (S, d) a metric space, and fn functions

from S2 into S. Then f, will be said to converge to fo in outer probability if

d(f, fo)* - 0 in probability as n - oo, or equivalently, by Lemma 3.2.6,

for every e > 0, P*(d(f , fo) > E) -k 0 as n -+ oo. Also, f, is said to converge to fo almost uniformly if as n - oo, d(fn, fo)* -k 0 almost surely. The following is immediate:

3.3.1 Proposition Almost uniform convergence always implies convergence in outer probability.

If fn are all measurable functions, then clearly convergence in outer probability is equivalent to the usual convergence in probability, and almost uniform convergence to almost sure convergence. Now some definitions will be given for Glivenko-Cantelli properties, which are laws of large numbers for empirical measures.

Definition. If (X, A, P) is a probability space and .P is a class of integrable real-valued functions, F C Gt (X, A, P), then F will be called a strong (resp.

3.3 Convergence without measurability

101

weak) Glivenko-Cantelli class for P if as n -+ oo, 11 P, - PII.r --+ 0 almost uniformly (resp. in outer probability). In the following proposition, part (C) is what is usually called "Egorov's theorem" for almost surely convergent sequences of measurable functions (RAP, Theorem 7.5.1).

3.3.2 Proposition Let (S2, A, P) be a probability space, (S, d) a metric . Then the following space, and f, any functions from 2 into S for n = 0, 1, are equivalent:

(A) fn - fo almost uniformly. (B) For any e > 0, P*{sup,,>m d(fn, fo) > e} 4, 0 as m -* oo. (C) For any 8 > 0 there is some B E A with P(B) > 1 - 8 such that fn -* fo uniformly on B. (D) There exist measurable hn > d(fn, fo) with hn -+ 0 a.s. Proof (A) implies that (SUpn>m d (fn, f0))* < Sup,,>m (d (fn,

f0)*)

4, 0 a.s. I

as m -+ oo,

which implies (B).

Assuming (B), for k = 1, 2, , let Ck := {supra>m(k) d(f, fo) > 1/k), where m (k) is large enough so that P* (Ck) < 2-k. Take measurable covers B k f o r Ck, so C k C Bk, Bk E A, and P(Bk) < 2-k. For r = 1 , 2, , let Ar := S2 \ Uk>r Bk. Then P(Ar) > 1 - 2-' and fn fo uniformly on Ar, so (C) holds. Now assume (C). Take Ck E A, k = 1, 2, , such that P(Ck) T 1 and f n -+ f o uniformly on Ck. W e can take Cl C C2 C . Take Mk such that

d(f, fo) < 1/k on Ck for all n >_ mk. Then d(fn, fo)* < 1/k on Ck, so d(fn, fo)* --> 0 a.s., proving (A). Clearly, (A) and (D) are equivalent.

Example. In [0, 1] with Lebesgue measure P let A 1 D A2 D . . . be sets with P*(A,) = 1 and fn°_1 An = 0 (e.g., RAP, Section 3.4, problem 2; Cohn, 1980, p. 35). Then 1An -* 0 everywhere and, in that sense, almost surely, but not almost uniformly. Note also that lAn doesn't converge to 0 in law as defined in Section 3.1. To avoid such pathology, almost uniform convergence is helpful.

3.3.3 Proposition Let (S, d) and (Y, e) be two metric spaces and (S2, A, P) a probability space. Let f, be functions f r o m 0 into S for n = 1 , 2,

, such

that f, - fo in outer probability as n -+ oo. Assume that fo has separable

Uniform Central Limit Theorems: Donsker Classes

102

range and is measurable (for the Borel or -algebra on S). Let g be a continuous function from S into Y. Then g(fn) g(o) in outer probability.

Note. If f G (fo) dP is defined for all bounded continuous real G (as it must be if fn fo, by definition) then the image measure P o f0 1 is defined on all Borel subsets of S (RAP, Theorem 7.1.1). Such a law does have a separable support except perhaps in some set-theoretically pathological cases (Appen-

dix Q. , let Bk :_ {x E S: d(x, y) < 1/k implies P r o o f Given e > 0, k = 1, 2, e(g(x), g(y)) < s, y E S}. Then each Bk is closed and Bk T S as k -+ oo. Fix k large enough so that P(f01(Bk)) > 1 - e. Then {e(g(fn), g(.f0)) > s) (1 f0 1(Bk) C {d (fn, .fo)

1/k}.

Thus

P*{e(g(fn),g(f0)) > e} < e+P*{d(fn, fo) > 1/k} < 2s for n large enough.

3.3.4 Lemma Let (S2, A, P) be a probability space and {gn }n o a uniformly bounded sequence of real-valued functions on Q such that go is measurable. If gn -* go in outer probability then lim supn,, f * gn dP < f go dP.

Proof Let I g, (x) I < M < oo for all n and all x E 22. We can assume M = 1. Given s > 0, for n large enough P*(I gn - goI > e) < e. Let An be a measurable set on which Ign - gol < e with P(S2 \ An) < e. Then

J

gndP < e+ J*gndP < 2s+ f A

A An

< 3s+ J $odP.

Letting s 1. 0 completes the proof. On any metric space, the a-algebra will be the Borel or -algebra unless something is said to the contrary.

3.3.5 Corollary If f, are functions from a probability space into a metric space, fn --* fo in outer probability, and fo is measurable with separable range, then fn = fo. Proof Apply Proposition 3.3.3 to g = G for any bounded continuous G and Lemma 3.3.4 to gn := G o fn and also to gn = -G o fn.

3.4 Perfect functions

103

3.4 Perfect functions For a function g defined on a set A let g[A] :_ {g(x) : x E A}. It will be useful that under some conditions on a measurable function g and general real-valued

f, (f o g)* = f * o g. Here are some equivalent conditions: 3.4.1 Theorem Let (X, A, P) be a probability space, (Y, l3) any measurable space, and g a measurable function from X to Y. Let Q be the restriction of P o g t to B. For any real-valued function f on Y, define f *for Q. Then the following are equivalent:

(a) For any A E A there is a B E B with B C g[A] and Q(B) > P(A); (b) for any A E A with P(A) > 0 there is a B E 8 with B C g[A] and Q(B) > 0; (c) for every real function f on Y, (f o g)* = f * o g a.s.; (d) f o r any D C Y, OD o 9)* = 1 *D o g a. s.

Proof Clearly (a) implies (b).

To show (b) implies (c), note that always (f o g)* < f* o g. Suppose (f o g)* < f * o g on a set of positive probability. Then for some rational

r, (fog)* 0.Letg[A]DBEB with Q(B) > 0. Then f o g < r on A implies f < r on B, so f * < r on B a.s., contradicting f * o g > r on A. Clearly (c) implies (d). Now to show (d) implies (a), given A E A, let D := Y \ g[A]. Then we can take 1D = lc for some C E B: let C be the set where 10 > 1. Then D C C

andIDog=(lDog)*=Oa.sonA.LetB:=Y\C.Then BCg[A],and

Q(B) = 1 - Q(C) = 1 -

f 1* d(P o g '),

which by the image measure theorem (RAP, Theorem 4.1.11) equals

1-J 1*DogdP = 1- J(1Dog)*dP > P(A). Note. In (a) or (b), if the direct image g[A] E B, we could just set B := g[A]. But, for any uncountable complete separable metric space Y, there exists a complete separable metric space S (for example, a countable product N°O of copies of N) and a continuous function f from S into Y such that f [B] is not a Borel set in Y (RAP, Theorem 13.2.1, Proposition 13.2.5). If f is only required to be Borel measurable, then S can also be any uncountable complete metric space (RAP, Theorem 13.1.1).

104

Uniform Central Limit Theorems: Donsker Classes

A function g satisfying any of the four conditions in Theorem 3.4.1 will be called perfect or P perfect. Coordinate projections on a product space are, as one would hope, perfect:

3.4.2 Proposition Suppose A = X x Y, P is a product probability v x m, and g is the natural projection of A onto Y. Then g is P-perfect.

Proof Here P o g-1 =m. For any B CA let By := {x : (x, y) E B}, y e Y. If B is measurable, then by the Tonelli-Fubini theorem, for C := {y: v(By) >

0}, C is measurable, C C g[B], and P(B) < m(C), so condition (a) of Theorem 3.4.1 holds.

3.4.3 Theorem Let (0, A, P) be a probability space and (S, d) a metric , (Yn,13,,) is a measurable space, gn a space. Suppose that for n = 0, 1, perfect measurable function from 0 into Y,,, and fn a function from Y into S, where fo has separable range and is measurable. Let Qn := P o gn 1 on 13n and suppose fn o gn - fo o go in outer probability as n --* oo. Then fn = fo as n oo for fn on (Yn, 13,,, Qn)

Before this is proved, an example will be given. Recall that for any measure space (X, S, µ) and C C X, the inner measure is defined by /,t. (C)

sup{µ(A) : A E S, ACC}. 3.4.4 Proposition

Theorem 3.4.3 can fail without the hypothesis that gn be

perfect.

Proof Let C C I := [0, 1] satisfy 0 = X*(C) < X*(C) = 1 for Lebesgue measure A (RAP, Theorem 3.4.4). Let P = A*, giving a probability measure on

the Borel sets of C (RAP, Theorem 3.3.6). Let 0 = C, fo = 0, Yn = I, fn :_ II\C, and let gn be the identity from C into Y,, for all n. Then fn o gn = 0 for all n, so f, o gn -+ fo o go in outer probability (and in any other sense). Let 13,, be the Borel or-algebra on Y,, = I for each n. Let G be the identity from I into IR. Then f * dQn = f * fn da, = 1 for n > 1, while f G(fo) dQo = 0, so fn does not converge to fo in law.

After Theorem 3.4.3 is proved, it will follow that the gn in the last proof are not perfect, as can also be seen directly from condition (c) or (d) in Theorem 3.4.1.

fo o go. Let H Proof of Theorem 3.4.3 By Corollary 3.3.5, fn o g, be any bounded, continuous, real-valued function on S. Then by RAP,

3.4 Perfect functions

105

Theorem 4.1.11,

JH(ffl(gfl))dP - f H(fo(go)) dP

f

H(fo) dQo

Also,

f *H(fn(gn))dP =

f H(fn(gn))*dP

by Theorem 3.2.1

f(Hofn)*(gn)dP

by Theorem 3.4.1

f(Hofn)*dQn

(RAP, Theorem 4. 1.11)

f * H(fn) dQn

by Theorem 3.2.1,

and the theorem follows.

In Proposition 3.4.2, X x Y could be an arbitrary product probability space, but projection is a rather special function. The following fact will show that all measurable functions on reasonable domain spaces are perfect.

Recall that a law P is called tight if sup{P(K) : K compact) = 1. A set P of laws is called uniformly tight if for every e > 0 there is a compact K such that P(K) > 1 - e for all P E P. Also, a metric space (S, d) is called universally measurable (u.m.) if for every law P on the completion of S, S is measurable for the completion of P (RAP, Section 11.5). So any complete metric space, or any Borel set in its completion, is u.m.

3.4.5 Theorem

Let (S, d) be a u.m. separable metric space. Let P be a

probability measure on the Borel or -algebra of S. Then any Borel measurable function g from S into a separable metric space Y is perfect for P.

Note. In view of Appendix C, the hypothesis that S be separable is not very restrictive.

Proof Let A be any Borel set in S with P(A) > 0. Let 0 < s < P(A). By the extended Lusin theorem (Theorem D.1, Appendix D) there is a closed set F with

P(F) > 1-s/2 such that g restricted to F is continuous. Since P is tight (RAP, Theorems 11.5.1 and 7.1.3), there is a compact set K C A with P(K) > s. Then C := F fl K is compact, C C A, P(C) > 0, and g is continuous on C,

so g[C] is compact, g[C] C g[A], and (P o g 1)(g[C]) > P(C) > 0, so the conclusion follows from Theorem 3.4.1.

106

Uniform Central Limit Theorems: Donsker Classes

Let (0, A, P) be a probability space and g a measurable function from S2 into Y where (Y, 8) is a measurable space. Let Q := P o g 1 on B. Call g quasiperfect for P or P-quasiperfect if for every C C Y with g -'(C) E A, C is measurable for the completion of Q. Then the probability space (0, A, P) is called perfect if every real-valued function G on 0, measurable for the usual Borel or -algebra on R, is quasiperfect.

Example. A measurable, quasiperfect function g on a finite set need not be perfect: let X:= {al, a2, a3, a4, a5, a6}, U := {al, a2}, V :_ {a3, a4}, W {a5, a6}, A := {0, U, V, W, U U V, U U W, V U W, X), P(U) = P(V) _

P(W) = 1/3, Y := {0, 1, 21, g(al) := g(a3)

0, g(a2) := g(a5) :=

1,

g(a4) := g(a6) := 2. Let B := {0, Y). For C C Y, g-1 (C) E A if and only if C E B, so g is quasiperfect. But P(U) > 0 and g[U] does not include any nonempty set in 13, so g is not perfect.

3.4.6 Proposition Any perfect function is quasiperfect.

Proof Let C C Y, A := 9_1(C) E A. By Theorem 3.4.1 take B C g[A]

with B E 8 and Q(B) > P(A). Then B C C, so Q(B) = P(g 1(B)) < P(g 1(C)) = P(A), and Q(B) = P(A). Thus the inner measure Q,,(C) = P o g 1(C). Likewise, Q. (Y \ C) = (P o g 1) (Y \ C), so Q*(C) = (P o g 1) (C) and C is Q-completion measurable.

U

3.5 Almost surely convergent realizations First let's recall a theorem of Skorokhod (RAP, Theorem 11.7.2): if (S, d) is a complete separable metric space, and Pn are laws on S converging to a law Po, then on some probability space there exist S-valued measurable functions X, such that P, for all n and X -+ Xo almost surely. This section will prove an extension of Skorokhod's theorem to our current setup. Having almost uniformly convergent realizations shows that the definition of convergence in law for random elements is reasonable. The realizations will be useful in some later proofs on convergence in law.

Suppose f, = fo where f, are random elements, in other words functions not necessarily measurable except for n = 0, defined on some probability spaces (52,,, into a possibly nonseparable metric space S. We want to find random elements Y "having the same laws" as f, for each n such that Yo almost surely or better, almost uniformly. At first look it isn't clear Yn what "having the same laws" should mean for random elements fn, n > 1, not

3.5 Almost surely convergent realizations

107

having laws defined on any nontrivial or-algebra. A way that turns out to work is to define Y, = fn o gn where gn are functions from some other probability space 0 with probability measure Q into Stn such that each gn is measurable and Q o gn1 = Q, for each n. Thus the argument of fn will have the same law Qn as before. It turns out, moreover, that the gn should be not only measurable but perfect. Before stating the theorem, here is an example to show that there may really be no way to define a or -algebra on S on which laws could be defined and yield an equivalence as in the next theorem, even if S is a finite set.

Example. Let (Xn, An, Qn) = ([0, 1], B, A) for all n (), = Lebesgue measure, 13 = Borel a-algebra ). Take sets C(n) C [0, 1] with O = A*(C(n)) < .X*(C(n)) = 1/n2 (RAP, Theorem 3.4.4). Let S be the two-point space {0, 11 with usual metric. Then f,, := lC(n) -* 0 in law and almost uniformly, but each "law" On := Qn o fn 1 is only defined on the trivial a-algebra {0, S). The only larger or -algebra on S is 2S, but no On for n > 1 is defined on 2S.

3.5.1 Theorem Let (S, d) be any metric space, (Xn, An, Qn) any probability . Suppose fo has spaces, and fn a function from Xn into S for each n = 0, 1, separable range So and is measurable (for the Borel a -algebra on So). Then fn = fo if and only if there exist a probability space (S2, S, Q) and perfect , such measurable functions gn from (S2, S) to (Xn, An) for each n = 0, 1, fo o go almost uniformly that Q o gn 1 = Qn on An for each n and fn o gn as n -+ oo. Notes. Proposition 3.4.4 and the "if and only if " in Theorem 3.5.1 show that the hypothesis that gn be perfect can't just be dropped from the theorem.

Proof "If" follows from Proposition 3.3.1 and Theorem 3.4.3. "Only if" will be proved very much as in RAP, Theorem 11.7.2. Let S2 be the Cartesian product fl 0Xn x In where each In is a copy of [0, 1]. Here gn will be the natural projection of 0 onto Xn for each n. Let P := Qo o fo 1 on the Borel or -algebra of S, concentrated in the separable subset So. A set B C S will be called a continuity set (RAP, Section 11.1) for P if P(dB) = 0 where 8B is the boundary of B. Then,

3.5.2 Lemma For any e > 0 there are disjoint open continuity sets Uj, j = 1, , J, for some J < oo, where for each j, diam Uj := sup{d(x, y) :

x,yEUj)<e,andwith 1:;J= 1P(Uj)>1-e.

108

Uniform Central Limit Theorems: Donsker Classes

Proof Let {xjf°° 1 be dense in So. Let B(x, r) :_ {y E S: d(x, y) < r} for 0 < r < oo and x E So. Then B(xj, r) is a continuity set of P for all but at most countably many values of r. Choose rj with e/3 < rj < e/2 such that B(xj, rj) is a continuity set of P for each j. The continuity sets form an algebra (RAP, Proposition 11.1.4). Let

U j := B(xj, rj) \ U{y: d(xi, y) < r,}. i<j

Then Uj are disjoint open continuity sets of diameters < e with

1, so there isaJ
1

P(Uj) _

1P(Uj)> 1-e.

Now to continue the proof of Theorem 3.5.1, for each k = 1, 2, - -, by the last lemma take disjoint open continuity sets Ukj := U(k, j) of P for j = 1, 2, .. -, Jk := J(k) < oo, with diam(Ukj) < 1/k, P(Ukj) > 0, and J(k)

E P(Ukj) > 1-2 -k.

(3.5.3)

j=1

For any open set U in S with complement F, let d(x, F) := inf{d(x, y) y E F). For r = 1, 2, - -, let Fr := {x : d (x, F) > 1/r}. Then Fr is closed and Fr T U as r -* oo. There is a continuous hr on S with 0 < hr < 1, hr = 1 on Fr, and hr = 0 outside F2r: let hr(X) := min(l, max(0, 2rd(x, F) - 1)). For each j and k, let F(k, j) := S \ Ukj. Take r := r(k, j) large enough so that P(F(k, f)r) > (1 - 2-k)P(Ukj). Let hkj be the hr as defined above for such an r and Hkj the her. For n large enough, say for n > nk, we have

jhki(fn)dQn > (1-2-k)P(Ukj) and

f* Hkj(fn)dQn < (1+2-k)P(Ukj) for all j = 1, - - -, Jk. We may assume n1 < n2 < ... . For every n = 0, 1, - , let fkjn := (hkj o fn)* for Qn, so that by Theorem - -

3.2.1,

fhki(fn)dQn =

ffkJndQn,

0 < ,fkjn < hkj(fn),

and fkjn is An-measurable. For n > 1 let Bkjn :_ (.fkjn > 0) E A. Let Bkjo

1(Ukj) E Ao. For each k and n, the Bkjn C f-1

for j= 1,---,Jk,andHkj(fn)=Ion Bkjn, soforn>nk, (3.5.4)

(1-2-k)P(Ukj) < Qn(Bkjn) < (1+2-k)P(Ukj),

are disjoint

3.5 Almost surely convergent realizations

109

and Qo(Bk jo) = P(Ukj). Let Tn := X, x In. Let it, be the product law Q, x .X on Tn where A is Lebesgue measure on the Borel or-algebra B in In. For each

k> 1, Ckjn

:= Bkjn x [0, F(k, j, n)] C T,,

Dkjn

:= Bkjo x [0, G(k, j, n)] C To,

where F and G are defined so that

An(Ckjn) = Ito(Dkjn) = min(Qn(Bkjn), Qo(Bkjo))

Then for each k, j, and n > nk,wehave by(3.5.4),since 1/(1+2-k) > 1-2-k,

(3.5.5) 1-2 -k < min(F, G)(k, j, n) < max(F, G)(k, j, n) = 1. Let J(k) Ckon

7'n \ U Ckjn,

J(k)

DkOn = To \ U Dkjn.

j=1

j=1

For k = 0let Jo := J(0) := 0, Coon Tn, Doon := To, and no := 0. For each n = 1, 2, , let k(n) be the unique k such that nk < n < nk+1 Then for n > 1, Tn is the disjoint union of sets Wnj Ck(n)jn, j = 0, 1, (3.5.6)

, Jk(n). We also have

If j > 1 and (v, s) E Wn j then V E Bk(n) jn so fn (v, S) E Uk(n) j

Next, To is the disjoint union of sets Enj := Dk(n) jn. Then µn (Wnj) = Ito(Enj)

for each n and j, and if j > 1 or k(n) = 0, then Ao(En j) > 0. For x in To and each n, (3.5.7)

let j (n, x) be the j such that x E En j .

Let L := (x E To : so(Enj(n,x)) > 0 for all n). Then To \ L C IJi En(i)o for some (possibly empty or finite) sequence n (i) such that po(En(1)o) = 0 for all i. Thus µo(L) = 1. For X E L and any measurable set B C Tn, in other words B is in the product

or-algebra An 0 B, let j := j(n, x), recall that Itn(Wnj) = t o(Enj), and let (3.5.8)

Pnj(B) := An(B fl Wnj)/Ito(Enj),

Pnx

Pnj(n,x).

Then Pnx is a probability measure on An 0 B. Let px be the product measure 1 -100 Pnx on T := IIn° 1 T, (RAP, Theorem 8.2.2).

Uniform Central Limit Theorems: Donsker Classes

110

3.5.9 Lemma For any measurable set H C T (for the infinite product a algebra with A® ® 13 on each factor), x i-+ px(H) is measurable on (To, Ao ®13).

Proof Let H be the collection of all H for which the assertion holds. Given n, Pnx is one of finitely many laws, each obtained for x in a measurable subset

En j. Thus if Y, is the natural projection of T onto Tn and H = Y,- 1(B) for some B E A®®13 then H E H. If H= ITEM Ym(i)(Bi), where Bi E An(i) 0 13 and M is finite, we may assume the m (i) are distinct. Then I

px(H) = niEMpx(Yn(i)(Bi)), so H E R. Then, any finite, disjoint union of such intersections is in 7-l. Such unions form an algebra. If Hn E 7-1 and Hn T H or Hn ,. H, then H E 7-1. Since the smallest monotone class containing an algebra is a a-algebra (RAP, Theorem 4.4.2), the lemma follows.

Now returning to the proof of Theorem 3.5.1, S2 = To x T. For any set C C S2, measurable for the product a-algebra, and x E To, let C, :_ l y E T : (x, y) E C}, and Q(C) := f px (Cx) dµo(x). Here x r-+ px (Cx) is measurable

if C is a finite union of products Ai x Fi where Ai E Ao ® 13 and Fi is product measurable in T. Such a union equals a disjoint union (RAP, Proposition 3.2.2). Thus by monotone classes again, x i-+ px(Cx) is measurable on To for

any product-measurable set C C Q. Thus Q is defined. It is then clearly a countably additive probability measure.

Let p be the natural projection of T onto Xn. Recall that Pnx = Pnj for all x E En j, by (3.5.7) and (3.5.8). The marginal of Q on X,,, in other words Q o gn 1, is by (3.5.8) again J(k(n))

E lso(Ej)Pnj o p-1 = An o

P-1

= Qn

j=0

Thus Q has marginal Qn on Xn for each n, as desired. By (3.5.3), 00

J(k)

E Q0 X 0 \ U f 1(Ukj) < 1: 2-k < 00. k=1

j=1

k

l

So Qo-almost every y E X o belongs to U f0 1(Uk j) for all large enough k. Also if t E 10 and t < 1, then by (3.5.5), t < G(k, j, n) for all j > 1 as soon

3.6 Conditions equivalent to convergence in law

111

as 1 - 2-k > t and n > nk. Thus for µo-almost all (y, t), there is an in such that (y, t) E U (k(n)) En j for all n > m. If x := (y, t) E En j for j > 1, then y E Bk(n) jo, so fo(y) E Uk(n) j. Also, by (3.5.8), Pnx = Pn j is concentrated in Wnj. For (v, s) E Wnj, fn (V) E Uk(n) j by (3.5.6). Since diam(Ukj) < 1/k for

each j > 1, Q*(d(fn(gn), .Po(go)) > 1/k(n) for some n > m)

< µo({(y, t) E Eno for some n > m}) -+ 0 as in -* oo, so fn (gn) -* fo (go) almost uniformly. Lastly, let's show that the gn are perfect. Suppose Q(A) > 0 for some A.

Now Q(A) = f px(Ax) dµo(x). Firstletn > 1. Thenforsomex, px(Ax) > 0. If µo(Eno) = 0, we take x Eno. Now T = Tn x T(n) where T(n) :_ ff1<(0nT,. Then on T, px = Pnx x Qnx for a law Qnx = fImonPmx on T(n). Let A (x) := A. By the Tonelli-Fubini theorem, px (Ax) = ff 1A(x) (u, v) dPnx (u) dQnx (v) Thus for some v, f 1 A(x) (u, v) dPnx (u) > 0. Choose and fix such a v as well as

x. Now Pnx = Pn j for j = j (n, x) with µo(En j) > 0. Let u = (s, t), s E Xn, and t E In. Then since Pn j = Qn x A restricted to a set of positive measure and normalized,

0 < ff1A(X)(stv)dQfl(s)dt. Choose and fix a t with 0 < f IA(x)(s, t, v)dQn(s). Let C IS E Xn (s, t, v) E Ax}. Then Qn(C) > 0. Clearly C C gn[A], so gn is perfect for n > 1 by Theorem 3.4.1. To show go is perfect, we have µo = Qo x ,., and px (Ax) > 0 for x = (y, t)

in a set X with Ao(X) > 0. There is a t E Io such that Qo(C) > 0 where C := {y: (y, t) E X). Then C C go[A], so go is perfect, finishing the proof of Theorem 3.5.1.

3.6 Conditions equivalent to convergence in law Conditions equivalent to convergence of laws on separable metric spaces are given in the portmanteau theorem (RAP, Theorem 11.1.1) and metrization theorem (RAP, Theorem 11.3.3). Here, the conditions will be extended to general random elements for the theory being developed in this chapter. For any probability space (S2, A, P) and real-valued function f on S2 let

E*f := f * f dP, E* f := f* f dP. If (S, d) is a metric space and f is a

112

Uniform Central Limit Theorems: Donsker Classes

real-valued function on S, recall (RAP, Section 11.2) that the Lipschitz seminorm of f is defined by

IIflIL

Sup{If(x) - f(y)Ild(x, y) : x 0 y)

and f is called a Lipschitz function if II f 11L < oo. The bounded Lipschitz := sup., If(x)j. norm is defined by II f II BL II f II L + II f II where Iif II Then f is called a bounded Lipschitz function if II f II BL < oo, and 11 II BL is a norm on the space of all such functions.

The extended portmanteau theorem about to be proved is an adaptation of RAP, Theorem 11.1.1, and some further facts based on the last section (Theorem

3.5.1). The proof to be given includes relatively easy implications, some of which consist of putting in stars at appropriate places in the proofs in RAP.

3.6.1 Theorem Let (S, d) be any metric space. F o r n = 0, 1 , 2, , let (Xn, An, Q,) be a probability space and fn a function from Xn into S. SupQo o f0 1 on S. pose fo has separable range So and is measurable. Let P Then the following are equivalent: (a) fn = fo;

(a') lim supn,, E*G(fn) < EG(fo) for each bounded continuous realvalued G on S; (b) E*G (fn) --* EG (fo) as n -+ oo for every bounded Lipschitzfunction

GonS; (b') (a') holds for all bounded Lipschitz G on S;

(c) sup{IE*G(fn) - EG(fo)I: IIG1IBL < 1) - Oasn - oo; (d) for any closed F C S, P(F) > lim supn,oo Qn (fn E F); (e) for any open U C S, P(U) < liminfn,oo(Qn)*(fn E U);

(f) for any continuity set A of P in S, Q* (fn E A) - P(A) and (Qn)*(fn E A) -* P(A) as n -* oo; (g) there exist a probability space (S2, S, Q) and measurable functions gn from S2 into Xn and hn from S2 into S such that the gn are perfect,

Qogn1 = Qn and Qohn1 = Pforalln, andd(fn ogn,hn) --). 0 almost uniformly.

Moreover, (g) remains equivalent if any of the following changes are made in it: "almost uniformly" can be replaced by "in outer probability"; we can take hn = fo o yn for some measurable functions yn from S2 into X0, which can be taken to be perfect; and we can take yn to be all the same, yn = yl for all n.

3.6 Conditions equivalent to convergence in law

113

Proof Clearly (a) implies (a'). Conversely, by interchanging G and -G, (a') implies

liminfn*cE*G(fn) > liminfn,,,,E*G(fn) > EG(fo), and (a) follows, so (a) and (a') are equivalent.

Clearly (a) implies (b), which is equivalent to (b') just as (a) is to (a'). To show (b) implies (c), let T be the completion of S. Then all the fn take values in T. Each bounded Lipschitz function G on S extends uniquely to such a function on T, and the functions G o fn on Xn are exactly the same. So we can assume in this step that S and So are complete. Let E > 0. By Ulam's theorem (RAP, Theorem 7.1.4), take a compact K C So with P(K) = Qo(fo E K) > 1 - E. Recall that d (x, K) := inf {d (x, y) :

y E K) and KE := Ix: d(x, K) < E}. Let g(x) := max(O, 1-d(x, K)/e) for 1 + 1/E < oo. Clearly 1K < g < 1KE. Since E*g(fn) -). Eg(fo) as n -+ oo, it follows that x E S. Then g is a bounded Lipschitz function with II9IIBL for n large enough (3.6.2)

(Q,,). (f, E K5) > E*g(fn) > Eg(fo) - E > I - 2E.

Let B be the set of all G on S with II G II BL < 1. Then the functions in B are uniformly equicontinuous, so their restrictions to K are totally bounded for the supremum distance over K by the Arzela-Ascoli theorem (RAP, Theorem 2.4.7). Let G I, , G j for some finite J be functions in B such that for each

G E B, supXEK I(G - Gj)(x)I < E for some j = 1, , J. Next, for any G E B, choose such a j. Then (3.6.3)

I E*G(fn) - EG(fo)I < IE*G(fn) - E*Gj(fn)I + I E*(Gj(fn)) - E(Gj(fo))I + I E(Gj(fo) - G(fo))I,

a sum of three terms. For the last term, splitting the integral into two parts according as fo c K or not, the first part is bounded by E and the second by 2E since I G j - G I < 2 everywhere and Qo (fo K) < E. The middle term on the right side of (3.6.3) is bounded above by

maxj<j IE*Gj(fn) - EGj(fo)I, which is less than E for n large enough by (b). The first term on the right in (3.6.3) is a sum of two parts, one over a measurable subset An of Xn where f, E KE with Q, (A,) > 1- 2E by (3.6.2), and the other over the complement of A. Since G and G j E B, and I G - G j I < E on

K, we haveIG - GjI <3s on K6. It follows that I (Go fn)*-(Gjo fn)*I <3E

Uniform Central Limit Theorems: Donsker Classes

114

on An, so the first part is bounded above by 3e. The second part is bounded by 2(1 - Qn(An)) < 46. So for n large enough the left side of (3.6.3) is less than 116 uniformly for G E B, and (c) follows. Clearly (c) implies (b), so (b) and (c) are equivalent. Next, (b) implies (d) by taking bounded Lipschitz functions gk decreasing down to 1 F as k - oo, specifically the functions g in the proof of (b) implies (c) with F in place of K

and 6 = 1/k. (d) and (e) are equivalent by taking complements. (d) and (e) together easily imply (f). For any bounded continuous function G, all but countably many of the sets

f G < t) are continuity sets of P. It follows that G can be approximated uniformly within e by a simple function F_ ti 1 Ai where Ai are continuity sets {ti-1 < G < ti}. So (f) implies (a), and (a) through (f) are all equivalent. Theorem 3.5.1 says that (a) implies (g), in the strongest form where hn

h 1 - fo o yl and yi is perfect. The weakest form of (g), with convergence in outer probability and hn depending on n and not necessarily a composition of fo with any (perfect) function,

will be shown to imply (b'). Given 6 > 0, take n large enough so that

Q*{d(.fn o gn, hn) > 6) < 6. n So on a measurable set An with Q(An) > 1 - e, we have d(fn o gn, hn) < e, and thus for any G E B (in other words II G II BL < 1) we have

IG(fn(gn)) -G(hn)I < 61A

+2. IA'-

It follows that

E*G(fn(gn)) < EG(hn)+36 = EG(fo)+3e (since hn has the same distribution as fo) and

E*G(fn (gn)) = E(((G o fn) o gn)*) = E((G o fn)* o gn) by Theorem 3.2.1 and since gn is perfect, and where all the expectations are with respect to Q. The latter integral, by the image measure theorem (RAP, Theorem 4.1.11), equals E*(G(fn)) (for Qn). Letting e 4. 0, (b') follows, so all of the forms of (g) are equivalent to any of (a) through (f). Theorem 1.1.1, in a special case, proved a form of convergence which in that case is now easily seen to imply the conditions in Theorem 3.6.1: the one in (c), for example. It will be shown that convergence in law is also equivalent to convergence in some analogues of the Prokhorov and dual-bounded-Lipschitz metrics, which

3.6 Conditions equivalent to convergence in law

115

metrize convergence of laws on separable metric spaces as shown in RAP, Theorem 11.3.3. We will have analogues of metrics, rather than actual metrics, because of the asymmetry between the nonmeasurable random elements f, and the limiting measurable random variable fo.

Definitions. Let (X., Am , Qm) be probability spaces, m = 0, 1, and let (S, d) be a metric space. Let fm be functions from Xm into S, m = 0, 1, such that fo is measurable and has separable range. Let P := Qo o f0 1. Then let

P(.fi, fo) := sup{IE*G(.fl) - EG(fo)I :

IIGIIBL <- 11,

and for F ranging over nonempty closed subsets of S,

p(.fi, .fo) := inf{e > 0: SUPF[P(F) - (Qi)*(fi E FE)] < e}. 3.6.4 Theorem For any metric space (S, d), probability spaces (Xm, Am, Qm) and functions fm from Xm into S, where fo has separable range and is measurable, the following are equivalent: (i) .fm

fo;

(ii) $(fm, Jo)

0 as m

(iii) p(fm, fo) - 0 as m

oo; oo.

Proof (i) and (ii) are equivalent just as (a) and (c) are in Theorem 3.6.1. To show that (ii) implies (iii), we have:

For any fl and fo as in the statement of Theorem 3.6.4, p(Ji, .fo) < min(l, (20(ft,

3.6.5 Lemma

.fo))1/2).

Proof Let F be any closed set in S and P = Qo o f01. Given 0 < s < 1, let g(x) := max(0, 1 - d(x, F)/e) as before. Then 11811 BL < 1 + 1/c, and

(Q1). (f, E FE) > jg(Ji)dQi >

fg(Jo)dQo_(JiJo)(l+) > P(F) - s cases,

if (fi, fo) < r2/2. Clearly p(fl, fo) < 1 in all (2p(.fi, .fo))1/2 when the latter is < 1, the lemma follows.

so letting s

Now to continue the proof of Theorem 3.6.4, applying the lemma to fm in place of fl for each in shows that (ii) implies (iii). To show that (iii) implies , let F k := {x : d (x, y) < 1/k (i), let U be an open set in S. For k = 1 , 2,

116

Uniform Central Limit Theorems: Donsker Classes

implies y E U). Then Fk are closed sets and Fk T U as k -f oo. For n large enough, p(fn, fo) < 1/k, and then P(Fk) < (Qn),k(fn E F1Ik) + k < (Qn)*(fn E U) + k, SO

P(Fk) < liminfn->oo(Qn)*(fn E U) + k Then letting k ->. oo gives P(U) < lim infn oo(Qn). (fn E U). This is (e) of Theorem 3.6.1, so the proof of Theorem 3.6.4 is complete. Next is a version of Theorem 3.6.4 where we have two indices.

3.6.6 Theorem

Let (S, d) be a metric space and (S2, S, Q) a probability

space. Suppose that f o r each m, n = 1 , 2, , fmn is a function from Q into S, and fo is a measurable function from Q into S with separable range. Then the following are equivalent:

(i) fm,n = fo, that is, for every bounded continuous real function G on S,

I*G(fm,n)dQ

f G (.f0) dQ asm,n -+ oo;

(ii) f(fmn, fo) - O as m, n -> oo; (iii) p(fmn, fo) -+ 0 as m, n -+ oo. Proof For a double sequence amn of real numbers, let lim SUpm

n,oo amn =

infk SUpm>k,n>k amn,

lim infm, n-#oo amn = SUpk lnfm>k, n>k amn

Then the steps in the proof of Theorem 3.6.1 showing that (a) through (f) are equivalent extend directly to convergence as m -+ oo and n -+ oo if lim supra,,, is replaced by lim Supm,n,oo (not by lim supm,o lim Supra,,, or the iterated lim sup in the reverse order, which may be different), and likewise for lim inf. Thus, the proof of Theorem 3.6.4 also extends.

We have the following:

3.6.7 Continuous mapping theorem Under the conditions of Theorem 3.6.4, if (T, e) is another metric space, G is a continuous function from S into T, and fo, then G(fm) G(fo) .fm Proof This follows directly from the definition of convergence in law =.

3.7 Asymptotic equicontinuity and Donsker classes

117

We also have

3.6.8 Theorem Suppose fm, m > 0, are measurable random variables taking values in a separable metric space S, so that laws G(fm) exist on the Borel or -algebra of S. Then convergence fm

fo is equivalent to convergence

of the laws G(fm) -+ G(o) in the usual sense. Proof For any bounded continuous real-valued function G on S, the functions G(fm) are measurable, so upper integrals reduce to integrals, and the result follows from the definitions.

3.7 Asymptotic equicontinuity and Donsker classes Recall from Section 3.1 the definitions of the empirical measures P, empirical process v,,, pseudometric pp, and the Gaussian process Gp. Recall that a set .F C 'C2 (p) is called pregaussian if a Gp process restricted to F exists whose

sample functions f i-+ Gp (f) ((o) are (almost) all bounded and uniformly continuous for pp on F. Such a Gp process is called coherent if, in addition, its sample functions are prelinear on .F as in Lemma 2.5.3.

3.7.1 Proposition For any probability measure P and pregaussian set F C G2(P), the symmetric convex hull sco(.F) of.F is also pregaussian, and there exists a Gp process defined on the linear span of .F and constant functions which is 0 on the constant functions and coherent on sco(.F). Proof A Gp process is isonormal on the space L2 (P) for the semi-inner product

(f, g)o,p := P(fg) - PfPg, and Gp is 0 on constant functions. So Theorem 2.5.5 on isonormal processes applies and gives the result. For a signed measure v and measurable function f such that f f d v is defined,

letv(f) := f fdv. A class ,F of functions will be said to satisfy the asymptotic equicontinuity

condition for P and a pseudometric r on F, or F E AEC(P, r) for short, if for every e > 0 there is a 3 > 0 and an no large enough such that for n > no,

f,gE.F, t(f,g)<S}>e} < E. Then.F E AEC(P) will mean.F E AEC(P, pp).

Uniform Central Limit Theorems: Donsker Classes

118

3.7.2 Theorem Let .T' C G2(X, A, P). Then the following are equivalent: (I) F is a Donsker class for P, in other words .F is P-pregaussian and vn

Gp in £'(F);

(II) (a) F is totally bounded for pp, and (b) F satisfies the asymptotic equicontinuity condition for P, F E AEC(P); (III) there is a pseudometric r on .F such that .F is totally bounded for r and

.T' E AEC(P, t).

Proof If F is a Donsker class and e > 0, then since F is pregaussian, it is totally bounded for pp by Theorem 2.3.5, so (a) holds. Take 0 < 3 < s/3 such that for any coherent Gp process,

Pr {sup{IGp(f) - Gp(g)I: pp(f, g) < 8} > s/31 < s/2. By almost surely convergent realizations (Theorem 3.5.1), for n > no large enough we can assume Pr*{Ilvn -GpII.p > s/3} < 8/2. If IIvn -GpII,p < s/3 and I Gp (f) - Gp (g) I < s/3 then I vn (f) - vn (g) l < s. So the asymptotic equicontinuity holds with the 3 and no chosen, and (b) holds, so (I) implies (II).

(II) implies (III) directly, with r = pp.

To show (III) implies (I), suppose F is r-totally bounded and F E AEC(P, r). Let UC := UC(F) denote the set of all real-valued functions on F uniformly continuous for T. Then UC is a separable subspace of t°O(F) for II Il.p since F is totally bounded and UC equals, in effect, the space of continuous functions on the compact completion (RAP, Corollary 11.2.5). For any finite subset g of F, by the finite-dimensional central limit theorem (RAP, Theorem 9.5.6) we can let n oo in the asymptotic equicontinuity

condition and so replace v (f - g) by Gp (f) - Gp (g) for f, g E 9. Given s > 0, take a 8 := 8(s) > 0 and no := no(s) < oo from the definition of asymptotic equicontinuity condition. Then 8 and no don't depend on 9 C F. So we can let CJ increase up to a countable dense set 7-l C F. Gp has sample functions almost surely uniformly continuous on H.

If fo E C and fk E 7-l are such that r(fk, fo) -+ 0 then by applying the finite-dimensional central limit theorem to the finite set 9 = { fo, fi, , fk} for each k, applying the asymptotic equicontinuity condition and letting k -+ oo we see that Gp is almost surely uniformly continuous for r on the set { f )j>o.

So Gp(fk) -- Gp(fo) almost surely. Thus for each f E F, the almost sure limit of Gp(hk) as hk - f through 7-l, which exists by uniform continuity, equals Gp (f) a.s., so the almost sure limit defines a Gp process. This Gp has uniformly continuous sample functions for r on F. So by Theorem 2.8.6, since

3.7 Asymptotic equicontinuity and Donsker classes

119

Gp is isonormal for ( , -)o, p, it follows that .F is pregaussian. So Gp has a law µ3 defined on the Borel sets of the separable Banach space UC. Given e > 0, take S > 0 from the asymptotic equicontinuity condition and a finite set 9 c Y such that for each f E F there is a g E 9 such that r (f, g) < S. Then 1189 is the set of all real-valued functions on 9. Let µ2 be the law of Gp on 1R and let tµ23 be the law on 1[89 x UC where Gp on g in R9 is just the restriction of Gp on UC. So µ23 has marginals µ2 and µ3. Let 1t1,n be the law of v on g, so µ1,n is also defined on R9. Then by the finite-dimensional central limit theorem again, the laws µ1,n converge to µ2 on 1[89. So for the Prokhorov metric p, since it metrizes convergence of laws (RAP,

Theorem 11.3.3), p(µt,,, µ2) < e for n large enough. Taken > no also, then fix n. By Strassen's theorem (RAP, Corollary 11.6.4), there is a law A12 on 1[89 x R9 with marginals 1t1,n and µ2 such that µ1,2{(x, Y): Ix - yA > E) < E. By the Vorob'ev-Berkes-Philipp theorem (1.1.10), there is a Borel measure /'t123 on 1[89 x 1[89 x UC having marginals A12 and µ23 on the appropriate spaces.

The next step is to link up v with its restriction to the finite set 9. The Vorob'ev-Berkes-Philipp theorem may not apply here since v on F may not be in a Polish space, at least not one that seems apparent. (About nonmeasurability on nonseparable spaces see the remarks at the end of Section 1.1.) Here we can use instead:

3.7.3 Lemma Let S and T be Polish spaces and (Q, A, P) a probability space. Let Q be a law on S x T with marginal q on S. Let V be a random variable on Q with values in S and law G(V) = q. Suppose there is a real random variable U on S2 independent of V with continuous distribution function FU. Then there is a random variable W : S2 i-). T such that the joint law G(V, W)

of(V,W)isQ. Proof Two metric spaces X, Y are called Borel-isomorphic if there is a one-toone Borel measurable map of X onto Y with Borel measurable inverse. Every Polish space is Borel-isomorphic to some compact subset of [0, 11, either the whole interval, a finite set, or a convergent sequence and its limit (RAP, Theorem 13.1.1). Since the lemma involves only measurability and not topological properties of the Polish spaces, we can assume S = T = [0, 11. Here we need the fact that for any real-valued random variable X with continuous distribution function F, F(X) has a uniform distribution in [0, 1 ], which can be seen as fol-

lows. For -oo < t < oo, the probability that X < t equals F(t). Now X < t implies F(X) < F(t). On the other hand, the probability that F(X) < F(t) is the supremum of probabilities that X < y for y such that F(y) < F(t), and this supremum is at most F(t). So the probability that X < t equals the probability

Uniform Central Limit Theorems: Donsker Classes

120

that F(X) < F(t). Since F is continuous, for any s with 0 < s < 1 there is a t E R with F(t) = s, so the probability that F(X) < s equals F(t) = s, and F(X) is uniformly distributed in [0, 1] as claimed. So, taking FU (U), we can assume U is uniformly distributed in [0, 1]. By way of regular conditional probabilities (RAP, Section 10.2; Bauer, 1981) we can write Q = f Qx dq(x) where for each x, Qx is a probability measure on T, so that for any measurable set A in (the square) S x T, Q(A) = f f 1 A (X, y) dQx (y) d q (x) (RAP, Theorems 10.2.1 and 10.2.2). Let Fx be the distribution function of Qx and

0 < t < 1.

Fx 1(t) := inf{u: Fx(u) > t1,

Then for any real z and 0 < t < 1, Fx 1(t) < z if and only if Fx(z) > t. Now x i-- Fx (z) is measurable for any fixed Z. It follows that x i-+ Fx 1(t) is measurable for each t, 0 < t < 1. For each x, Fx 1 is left-continuous and nondecreasing in t. It follows that (x, t) i-+ Fx 1(t) is jointly measurable. Thus f o r W (co) := F7 ) (U (w)), co H W (co) is measurable. For each x we have the image measure A o (Fx 1)-1 = Qx (RAP, Proposition 9.1.2). So for any bounded Borel function g,

J

gdQ =

f1 f1

JJ 0

g(x, y) dQx(y) dq(x)

by Theorem 10.2.1 of RAP

0

f

= J0

g(x, Fx 1(y)) dy dq (x) by the image measure theorem

0

I fo g (x, Fx 1(y)) d(q x A)(x, y) 1

1

0

by the Tonelli-Fubini

theorem

= E(g(V, FV 1(U))) = Eg(V, W), since U is independent of V and G(V) = q and by the image measure theorem again. So G(V, W) = Q, proving Lemma 3.7.3. Now let (Q, S, Q) be a probability space on which all the empirical processes v, and an independent U are defined, specifically a countable product of copies

of the probability space (X, A, P) times one copy of [0, 1] with Lebesgue measure A. Then Lemma 3.7.3 applies to Q with S = RO and T = V x

UC, where V is v, restricted to 9, and Q = µ123 on S x T. On Q we then have processes v,, and Gp defined on F, which by construction and the asymptotic equicontinuity condition are within 3e of each other uniformly on F except with a probability at most 38 (as in the proof of Donsker's theorem in Section 1.1).

3.8 Unions of Donsker classes

121

Lets j, 0 through the sequence s = 1 / k, k = 1, 2, . Let the approximation just shown hold for n > nk on a probability space (Qk, Sk, Qk). We can assume nk is nondecreasing ink. Let Akn be the vn process defined on S2k and let

Gkn be the corresponding Gp process on Stk. Let no := 1 and let (Qo, So, Qo) be a probability space on which vn processes Aon and Gp processes Go, Go

are defined, where Go is independent of the Aon processes. Let An Akn and Gn := Gk, if and only if nk < n < nk+1 for k = 0, 1 , . Then for all n, An is a vn process and Gn is a Gp process. Also, II A, - Gn CIF 0 in outer probability, so by Theorem 3.6.1, vn z' Gp on .T7, and (III) implies (I), proving Theorem 3.7.2.

3.8 Unions of Donsker classes It will be shown in this section that the union of any two Donsker classes F and

g is a Donsker class. This is not surprising: one might think it was enough, given the asymptotic equicontinuity conditions for the separate classes, for a given s > 0, to take the larger of the two no's and the smaller of the two 8's. But it is not so easy as that. For example, F and G could both be finite sets, with distinct elements of F at distance, say, more than 0.2 apart for pp, and likewise for CG, but there may be some element of F very close to an element of G. So the equicontinuity condition on the union won't just follow from the conditions on the separate families. Given a probability measure P, F C G2(P), s > 0, 8 > 0, and a positive integer no, say that AE(F, no, s, 8) holds if for all n > no,

Pr* {sup{Ivn(f - g)I: f, g E F, pp(f, g) < 8} > s} < E. Then the asymptotic equicontinuity condition, as in the previous section, holds for .P and P if and only if for every s > 0 there is a 8 > 0 and an no such that AE(F, no, s, 8) holds. The asymptotic equicontinuity condition, together with total boundedness of F for pp, is equivalent to the Donsker property of F for P (Theorem 3.7.2).

3.8.1 Theorem (K. Alexander) Let (S2, A, P) be a probability space and let .T'1 and F2 be two Donsker classes for P. Then F := .T'1 U .T72 is also a Donsker class for P.

Proof (M. Arcones) Given s > 0, take 8i > 0 and ni < oo so that AE(Fi, ni, s/3, 8i) holds, i = 1, 2. F1 and F2, being Donsker classes, are pregaussian, and Gp is an isonormal process for ( , )o,p, so F is pregaussian by Corollary

Uniform Central Limit Theorems: Donsker Classes

122

2.5.9. So there is an a > 0 such that for a suitable version of GP,

Pr{sup{JGP(f-g)j: pp(f,g)s/3} < 8/3. Let S := min(31, 82, a/3). Take finite sets hi c Ti, i = 1, 2, such that for

each i and f E Ti there is an h := Ti f E Hi with pp (f, h) < 3. Since 7-l := Ni U N2 is finite, by the finite-dimensional central limit theorem (RAP, Theorem 9.5.6), vn restricted to N converges in law to Gp restricted to H. Let

F(R, a, s/3) be the set of all y E R H such that I y(f) - y(g) I > 8/3 for some f, g E N with pp(f, g) < a. Then F(N, a, e/3) is closed and has probability less than s/3 for the law of Gp on IRx. Thus by the portmanteau theorem (RAP, Theorem 11.1.1), there is an m such that

AE(N, m, s/3, a) holds.

(3.8.2)

Let no := max(n 1, n2, m). It will be shown that AE(.TF, no, e, 3) holds. By the

asymptotic equicontinuity conditions in each Ti and since no > max(n1, n2) and S < min(81, 32), there is a set of probability less than s/3 for each i = 1, 2 such that outside these sets, I v,, (f) - vn (g) I < 8/3 for any f, g in the same

.T'i with pp(f, g) < S. For pairs f, g with pp(f, g) < 8, with f E .T'1 and g E F2, we have pP(ri f, r2$) < a since pp(f, rl f) < 8, pp(g, r2g) < S, and 33 < a. Thus by (3.8.2), I vn (rl f) - vn (r2g) I < s/3 for all such f, g, outside of another set of probability at most e/3. Thus

vn(f) - vn(g)I

Ivn(f) - vn(nif)I + Ivn(rlf) - vn(r2g)I + (vn (r2g) - vn (g) i

< e/3 + s/3 + e/3 = s

for all f E .T'1 and g E F2 except on a set of probability at most 8/3 + s/3 + s/3 = e.

3.9 Sequences of sets and functions This section will show how the asymptotic equicontinuity condition in Theorem 3.7.2 can be applied to prove that some sequences of sets and functions are Donsker classes. In Chapters 6 and 7, other sufficient conditions for the Donsker property will be given that will apply to uncountable families of sets and functions. 3.9.1 Theorem Let (X, A, P) be a probability space and {C,n },n> 1 a sequence of measurable sets. If 00

(3.9.2)

r(P(Cm)(1 - P(Cm)))' < oo forsome r < oo, m=1

3.9 Sequences of sets and functions

123

then the sequence {Cm),,,>1 is a Donsker class for P. Conversely, if the sets Cm are independent for P, then the sequence is a Donsker class only if (3.9.2) holds.

Proof Suppose (3.9.2) holds. Then the positive integers can be decomposed into two subsequences, over one of which P(Cm) 0 and over the other P(Cm) -+ 1. It's enough to prove the Donsker property separately for each subsequence by Theorem 3.8.1. For any measurable set A with complement

-vn(A) and Gp(A`) - -Gp(A). The transformation of these processes into their negatives preserves convergence in law if it holds. So we can assume P(Cm) ,(, 0 as m -- oo, and then E. P(Cm)' < oo. Also, {Cm}m>1 is totally bounded for pp. By Theorem 3.8.1 we can assume pm := P(Cm) < 1/2 for all m. For any i and m such that P(CiACm) = 0, we will have almost surely for any n that P. (Ci) = Pn (Cm ). So we can assume that P(C1 ACm) > 0 for all i 0 M. For any m such that P(Cm) = 0 we will have Pn (Cm) = 0 almost surely for any n, and then vn (Cm) = 0, so we can assume P (Cm) > 0 for all m. Let 0 < s < 1. Suppose we can find M and N such that for all n > N, Ac,

(3.9.3)

Pr {supra>MIVn(Cm)I >s} < s.

Then for J large enough, pm < pM12 form > J. Let y := min{P(Ci ACj :

1 M}

< 2sup{Ivn(C1)I: j > M} < 2s with probability at least 1- e, proving the asymptotic equicontinuity condition. So it will be enough to prove (3.9.3). For that, recalling the binomial prob-

abilities defined in Section 1.3, it will suffice to find M and N such that for

n > N, 00

(3.9.4)

57, E

< 8/2

m=M and 00

B(npm - sn1 /2, n, pm) < s/2.

(3.9.5) M=M

Uniform Central Limit Theorems: Donsker Classes

124

Let qm := 1 - pm. For (3.9.5), by the Chernoff-Okamoto inequality (1.3.10), B(npm - En 1/2, n1

p.) < exp (- e2/(2pmgm)).

For some K, 1 < K < oo, pm < Km -1/'' for all m. Choose M large enough so that exp

m' e2/(2K)) < E/2,

M=M

and so (3.9.5) holds for all n. The other side, (3.9.4), is harder. Bernstein's inequality 1.3.2 gives, if 2pn1/2 > e and 0 

(3.9.6)

m=M 00

{exp-E2/(6pm)): 2pmn1/2 > e} M=M 00

exp (- e2m1/rl (6K)) < e/4

< M=M

for all n if M is large enough. It remains to treat the sum, which will be called S2, of (3.9.4) restricted to values of m with 2pmn1/2 < E.

(3.9.7)

For p:= pm, inequality (1.3.11) implies

E(np + en 1/2, n, p) < (npl (np +

En1/2))np+En1/2eenl/2

Lety:= y(n,m,e) :=n1/2pm/e,so y> 0. Let f(x) - (1+x)log(1+x Then

- (np + en1/2) log (1 + e/(pn1/2)) = En1/2 (1 - f(y)). Forx > 0, f(x) < 0. By (3.9.7), y < 1/2, so f(y) > f(1/2) > 3/2. Thus En1/2

1 < 2f(y)/3, En1/2 (1 - f(y)) < -En 1/2 f (y) /3, and 00

{exp(-(En1/2+npm) [log (1+E/(n1/2pm)}]/3): 2pmn112<E1

S2 m=1 00

JJ ((

M=1

{(pmnl/2/e)En1/2l3:

2pmn1/2 < E}

3.9 Sequences of sets and functions

since 1/y < (1 +

y)l+y, so (1 +

125

y)-(l+y) < y. Now since p,, < Km-1I' we

have S2 < S3 + S4 where 00

._

S3

i{

2Km-1/rfn- <

E}

m=1 E,vIn_13 1` m

K

e/nl(3r)

m>G

where G :_ (2K,1_n_/s)r > 2. Then for any n1 > (3r/s)2 and n > n

S3 <

(K,,,fn-/E) 8vfn-13

x-e,Jn_1(3r)dx

fG-1

< (K,//s) ,/3(s ,/

(3r)-1 - 1`-1(G

-

For K, E and r fixed, and n

oo, the logarithm of the latter expression is asymptotic to -En 1/2 (log 2)/3 -oo, so S3 -+ 0. Thus S3 < E/8 for n > n2 for some n2. Finally, define S4 by 00

E

{(Pm'In-/e)r,1rn_l3:

E<

2Km-1/rfn_}

m=1

<

(2K/i/sy2-E,`l3

-0

as n - o0, so S4 < E/8 forn > n3 for some n3. Thus forn > max(n 1, n2, n3), S2 < e/4. This and (3.9.6) give (3.9.4). So (3.9.2) implies that {Cm}m>1 is a Donsker class. For the converse, if the measurable sets {Cm }m>1 are independent for P and form a Donsker class for P, it will be shown that (3.9.2) holds. Note that for each n, P (Cm) are independent random variables for in = 1, 2, . First suppose

lim supm,,,,, pm =: p < 1 and for all n, EM' 1 pm = +oo. Then for each n, Pr{ P (Cm) = 1 for infinitely many in } = 1 by the Borel-Cantelli lemma. Now

Pn(Cm) = 1 implies vn(Cm) = n1/2(1 - pm) > n1/2(1 - p) /2 form large. From Theorem 3.7.2, { 1 C. : m > 1) must be totally bounded for pp and satisfy

the asymptotic equicontinuity condition. It follows that pm -* 0 as m - oo, and by the asymptotic equicontinuity condition, we must have En° 1 pm < 00 for some n. A symmetrical argument applies if lim infra + pm > 0. We can write {Cm )m> l as a union of two subsequences, one for which

P(C,) < 1/2 and another for which P(Cm) > 1/2, both Donsker classes. Thus by the argument just given, for some n < oo, the first subsequence sat-

isfies E. P(Cm)" < oo and the second, F_m(1 - P(Cm))" < oo. So for the original sequence, (3.9.2) holds.

126

Uniform Central Limit Theorems: Donsker Classes

Next, let's consider sequences of functions. For a probability space (A, A, P) and f E £2(A, A, P) let 0 ,2( f := f f 2 dP - (f f dP) 2 (the variance of f). Here is a sufficient condition for the Donsker property of a sequence { fm } which is easy to prove, yet turns out to be optimal of its kind:

3.9.8 Theorem If { fm}m>1 C G2(P) and Fn° 1 ci (fm) < oo, then {fm }m> 1 is a Donsker class for P.

Proof Since v, and Gp are the same on fm - Efm as on fm a.s. for all m, we can assume Efm = 0 for all m. Then fm - 0 in Go, so the sequence { fm} is totally bounded for pp. For any 0 < s < 1, n > 1 and m > 1, by Chebyshev's inequality,

Pr(Ivn(f )I > s/2) < 4 E o (f)/s2 < s j>m

j>m

form > mo for some mo < oo. We have a.s. for all n, vn (f) = 0 for all j such

that a (f) = 0 and vn (f) = v (fk) for all j and k with or p (f - fk) = 0. Let a be the infimum of o (f - f k ) over all j and k such that o (fj - fk) > 0, j < me, and o (f) > 0. Then a > 0 since in some pp neighborhood of fj there are only finitely many fk. Let 8 := min(a, 1). Then the probability that I vn (f) - vn (fk) I > s f o r some j, k with orp2 (f - fk) < 8 is less than s, implying the asymptotic equicontinuity condition and so finishing the proof by Theorem 3.7.2. The following shows that Theorem 3.9.8, although it does not imply the first half of Theorem 3.9.1, is sharp in one sense:

3.9.9 Proposition Let A := [0, 1] and P := U[0, 1] := Lebesgue measure on A. Let am > 0 satisfy Fm 1 am = +oo. Then there is a sequence { fm I C G2(A, A, P) with c,' (f.) < am for all m where { fm } is not a Donsker class. . 0. There exist cm J. 0 such that >m am cm = +oo (see problem 12). In A let Cm be independent sets with P(Cm) = am cm for

Proof We can assume am

1/2

each m (see problem 13). Let fm := cm I Cm. Then oP (fm) < f f,2 dP = am. For each n, almost surely Pn (Cm) > 1/n for infinitely many m. Then vn(Cm) > n-1/2/2 for infinitely many m and supm vn(fm) = +oo a.s., so the asymptotic equicontinuity condition fails and { fm } is not a Donsker class for P.

Problems

127

Problems 1. Let (0, A, P) be a probability space, let (S, d) be a (possibly nonseparable) metric space, and let xn, n = 0, 1, 2, , be points of S. Let fn ((o) = Xn

for all co. Show that fn = f0 if and only if xn - X. 2. In a general (possibly nonseparable) metric space show that if X0 - p E S is a constant random variable then random elements Xn = X0 if and only X0 in outer probability. if X,

3. Let (T, d) be any metric space and S C T a separable subspace. Let Xn, n > 0, be S-valued measurable random variables on some probability space such that Xn -) X0 in law as n --- oc, in S. Show that Xn = X0 as T-valued random elements. 4. Show that C := J(-oo, t] : t E R} is a Donsker class for any law P on R. Hint: Use Section 1.1.

5. Let Ak be independent sets in a probability space (S2, A, P) such that Y-k[P(Ak)(1 - P(Ak))]" = +oo for all n = 1, 2, - -. Show that {Ak}k>1 is not a Glivenko-Cantelli class, that is, SUpk I (Pn - P) (Ak) I doesn't con-

verge to 0 in probability as n - oo. Hint: See the last part of the proof of Theorem 3.9.1.

6. Let (A, A, P) be a probability space and let F C L2 (A, A, P) be a Donsker class for P. (a) Show that the convex hull of F, namely k

co(F) := j E ) , j : f E .P, j = 1, ... , k, Aj > 0, j=1 k

YAj=1, j=1

is a Donsker class. Hint: Use Theorem 2.5.5 to get that co(F) is pregaussian and Gp can be taken to be prelinear on co(F). Then use almost surely convergent realizations (3.5.1). (b) For any fixed k < oo, show that f : f E F for j = 1, , k} is also a Donsker class. Hints: Use induction and Theorem 3.8.1 on unions. (2f : f E F} is Donsker. Thus take k = 2. It is easy to show that 2F Then apply part (a). 7. Let c > 0. For Lebesgue measure A on [0, 1], the Poisson process with intensity measure cl is defined by first choosing n at random having a

Poisson distribution with parameter c, namely P(n = k) = e-`ck/k! for , then setting Yc := 8x, where Xj are i.i.d. with law). on k = 0, 1, [0, 1]. In the Banach space of bounded functions on [0, 1] with supremum

Uniform Central Limit Theorems: Donsker Classes

128

norm, prove that as c

oo (along any sequence) the random functions

t H (Yc - c? )c-1/2[0, t], 0 < t < 1, converge in law to the Brownian motion process xt, 0 < t < 1. (Recall that xt = L(llo,tl).) Hints: For c = ck -+ oo let n = nk be Poisson (ck). For F(t) - t, 0 < t < 1, write Yc(t) := Yc([0, t]) = n F, (t), so c-1/2(Yc(t)

- Ct)

_ (n/c)1/2[n1"2(Fn - F)(t)] + c-112(n - c)t.

By Donsker's theorem take Brownian bridges y(n) such that n 1/2(Fn - F)

is close to y(n) for n large. Also, c-1/2(n - c) is close in law by the central limit theorem and so in probability (Strassen's theorem) to some random N(0, 1) variable Zc. Show that one can take n((o) independent y(n('0)) of X1, X2, , then y( (w) is a Brownian bridge; apply the t w) := Vorob'ev-Berkes-Philipp theorem to take Zc independent of yt, and then yt + Zc t is Brownian. 8. Let P be a law on a separable, infinite-dimensional Hilbert space H such that f IIx X12 dP(x) < oo and with mean 0, so that f (x, h) dP(x) = 0 for all h E H. Let X1, X2, be i.i.d. in H with law P and Sn := X1 + + Xn.

(a) Show that the central limit theorem holds in H, that is, Sn/n1/2 converges in law to some normal measure on H. Hint: Prove, using variances, that the laws of Sn/n1/2 are uniformly tight. (b) Show that the class of functions x r* (x, h) for h E H with 1Ih 11 < 1 (the unit ball of the dual space of H) is a Donsker class of functions on H

for P. Hints: For part (a) let {en ] be an orthonormal basis so that for any x E H,

x = En xnen with lix 112 = En xn. Thus En Exn is finite. Show that for Exn/cn is still finite. For any K let some cn 0 slowly enough, CK be the set of x such that En (xn /cn )2 < K2 (an infinite-dimensional ellipsoid). Show that each CK is compact and that the laws of Sn/n1/2 are uniformly tight by way of the sets CK. Thus subsequences of these laws converge. Show that they all converge to the same Gaussian limit law, since by the Stone-Weierstrass theorem, functions depending on only finitely many xj are dense in the continuous functions on each CK. , X,n, Y1, 9. The two-sample empirical process. Let X1, , Yn be i.i.d. with the uniform distribution U[0, 1] on [0, 1] (Lebesgue measure restricted to [0, 1]). Let F be the empirical distribution function based , X,,,, and Gn likewise based on Y1, , Yn. Show that in the on X1, space of all bounded functions on [0, 1] with supremum norm, mn

(m+n)

1/2

(Fm

- Gn)

y

Problems

129

as m, n -* oo where t i-± yt, 0 < t < 1 is a Brownian bridge process. Hints: As in Donsker's theorem (1.1.1), form, n large, m1I2(F,, - F) is close to a Brownian bridge Y(m), and n 1/2 (Gn - F) is close to an independent Brownian bridge Z (n ). It follows that [(m n) / (m +n)] 1/2 (Fm - Gn) is

close to (n/(m +n))1/2Y(m) - (m/(m +n))1/2Z(n), which is a Brownian bridge.

10. Let fn (x) = cos(27rnx) for0 < x < I with law U[0, 1] on [0, 1]. For real c let Fc be the sequence of functions n-' fn for all n = 1, 2, . For what values of c is Fc pregaussian? A Donsker class? Hints: If c < 0 show easily that the class is not pregaussian and so not Donsker. If c > 0 then it is pregaussian, by metric entropy. If c > 1/2, it is Donsker by 3.9.8. Show that it is Donsker for 0 < c < 1 /2 by the Bernstein inequality and the asymptotic equicontinuity condition. 11. In R, for any law P on IR, show that for any fixed k < oo, k

U (aj, bj]: aj < bj for all j j=1

is a Donsker class, that is, .F := {lc : C E C} is a Donsker class. Hint: Apply Donsker's theorem (1.1.1) and take differences (and sums). For k = 2 reduce to the case of disjoint intervals and apply problem 6. Do induction to get general k.

12. Show that as stated in the proof of Proposition 3.9.9, for any am > 0 with E. an = +oo there are cm 1 0 with >m cmam = +oo. Hint: Take a sequence mk such that Flan : mk < m < mk+1} > k for each k = 1, 2, . Let cm have the same value for Mk < m < mk+1 13. Show that as also stated in the proof of Proposition 3.9.9, in [0, 1] for P = U[0, 1] there exist independent sets Cm with any given probabilities. Hint: Use binary expansions. By decomposing the set of positive integers into a countable union of countably infinite sets, show that ([0, 1], P) is isomorphic as a probability space to a countable Cartesian product of copies of itself. 14. Suppose that {Cm)m>1 are independent for P, 00

P(Cm)(1 - P(Cm)) _ +00, m=1

and cm

oo. Show that (cm I cm )m> 1 is not a Donsker class.

130

Uniform Central Limit Theorems: Donsker Classes

Notes

Notes to Section 3.1.

For any metric space (S, d), let Bb (S, d) be the o-

algebra generated by all balls B(x, r) := {y: d(x, y) < r}, x E S, r > 0. Then Bb (S, d) is always included in the Borel or -algebra 13(S, d) generated by all the open sets, with 13b(S, d)= 13(S, d) if (S, d) is separable. Suppose Y, are functions from a probability space (S2, P) into S, measurable

for Bb(S, d). Then each Y has a law it, = P o Yn 1 on Bb(S, d). Dudley (1966, 1967) defined convergence in law of Y to Yo to mean that f * Hdµ -k f Hd µo for every bounded continuous real-valued function Hon S. HoffmannJorgensen (1984) gave the newer definition adopted generally and here, where the upper integrals and integral are taken over n, not S, so that the laws µ are not necessarily defined on any particular or -algebra in S. Hoffmann-Jorgensen's monograph was published in 1991. Andersen (1985a,b), Andersen and Dobric (1987, 1988), and Dudley (1985) developed further the theory based on Hoffmann-Jorgensen's definition. Notes to Section 3.2. Blumberg (1935) defined the measurable cover function f *, see also Goffman and Zink (1960). (Thanks to Rae M. Shortt for pointing out Blumberg's paper.) Later, Eames and May (1967) also defined f *. Lemmas 3.2.2 through 3.2.6 are more or less as in Dudley and Philipp (1982, Section 2), except that Lemma 3.2.4(c) is new here. Theorem 3.2.1 and its proof are as in Vulikh (1967, pp. 78-79) for Lebesgue measure on an interval (the proof needs no change). Luxemburg and Zaanen (1983, Lemma 94.4 p. 222) also prove existence of essential suprema and infima of families of measurable (extended) real functions.

Note to Section 3.3. This section is based on parts of Dudley (1985). Notes to Section 3.4. Perfect probability spaces were apparently first defined by Gnedenko and Kolmogorov (1949, Section 3), and their theory was carried on among others by Ryll-Nardzewski (1955), Sazonov (1962), and Pachl (1979).

Perfect functions are defined and treated in Hoffmann-Jorgensen (1984, 1985) and Andersen (1985a,b); see also Dudley (1985).

Notes to Section 3.5. The existence of almost surely convergent random variables with a given converging sequence of laws was first proved by Skorokhod (1956) for complete separable metric spaces, then by Dudley (1968) for any separable metric space, with a re-exposition in RAP, Section 11.7, and by Wichura (1970) for laws on the a-algebra generated by balls in an arbitrary metric space

Notes

131

as mentioned in the notes to Section 3.1. The current version was given in Dudley (1985).

Notes to Section 3.6. Hoffmann-Jorgensen (1984), who defined convergence in law in the sense adopted in this chapter, also developed the theory of it as in this section, and partly in a more general form (with nets instead of sequences, and other classes of functions in place of the bounded Lipschitz functions). Andersen and Dobric (1987, Remark 2.13) pointed out that the portmanteau theorem (as in Topsoe, 1970, Theorem 8.1) "can be extended to the nonmeasurable case. The proof of this extension is the same as the ordinary proof." Much the same might be said of other equivalences in this section. Dudley (1990, Theorem A) gave a form of the portmanteau theorem and (Theorem B) of the metrization theorem (3.6.4).

But not all facts or proofs from the separable case extend so easily: for example, in the separable case, there is an inequality for the two metrics, 0 < 2p, in the opposite direction to Lemma 3.6.5, which follows from Strassen's theorem on nearby variables with nearby laws (RAP, Corollary 11.6.5), but Strassen's theorem seems not to extend well to the nonmeasurable case (Dudley, 1994).

Notes to Section 3.7. An early form of the asymptotic equicontinuity condition appeared in Dudley (1966, Proposition 2) and a later form in Dudley (1978).

The equivalence with a different pseudometric r in Theorem 3.7.2 is due to Gine and Zinn (1986, p. 58). Lemma 3.7.3 is essentially contained in the proof of Skorokhod (1976, Theorem 1), as Erich Berger kindly pointed out. See also Erlov (1975). Notes to Section 3.8. Alexander (1987, Corollary 2.7) stated Theorem 3.8.1 but didn't publish a proof of it, although he had written out an unpublished proof several years earlier. He says that the result is "an extension of a slightly weaker result of Dudley (1981)," where F2 is finite, but this author himself doesn't think his 1981 result was only "slightly weaker"! The proof presented was suggested by Miguel Arcones in Berkeley during the fall of 1991, but I take responsibility for any possible errors in it. Apparently van der Vaart (1996, Theorem A.3) first published a proof.

Notes to Section 3.9. Theorem 3.9.1 first appeared in Dudley (1978, Section 2), Theorem 3.9.8 and Proposition 3.9.10 in Dudley (1981), and Proposition 3.9.9 in Dudley (1984).

132

Uniform Central Limit Theorems: Donsker Classes

References Alexander, K. S. (1987). The central limit theorem for empirical processes on Vapnik-Cervonenkis classes. Ann. Probab. 15, 178-203. Andersen, N. T. (1985a). The central limit theorem for non-separable valued functions. Z. Wahrscheinlichkeitstheorie verve. Gebiete 70, 445-455. Andersen, N. T. (1985b). The calculus of non-measurable functions and sets. Various Publication Series no. 36, Matematisk Institut, Aarhus Universitet. Andersen, N. T., and Dobric, V. (1987). The central limit theorem for stochastic processes. Ann. Probab. 15, 164-177. Andersen, N. T., and Dobric, V. (1988). The central limit theorem for stochastic processes II. J. Theoret. Probab. 1, 287-303. Bauer, H. (1981). Probability Theory and Elements of Measure Theory, 2d ed. Academic Press, London. Blumberg, Henry (1935). The measurable boundaries of an arbitrary function. Acta Math. (Uppsala) 65, 263-282. Cohn, D. L. (1980). Measure Theory. Birkhauser, Boston.

Dudley, R. M. (1966). Weak convergence of probabilities on nonseparable metric spaces and empirical measures on Euclidean spaces. Illinois J. Math. 10, 109-126. Dudley, R. M. (1967). Measures on non-separable metric spaces. Illinois J. Math. 11, 449-453. Dudley, R. M. (1968). Distances of probability measures and random variables. Ann. Math. Statist. 39, 1563-1572.

Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6, 899-929; Correction 7 (1979) 909-911. Dudley, R. M. (1981). Donsker classes of functions. In Statistics and Related Topics (Proc. Symp. Ottawa, 1980), North-Holland, New York, 341-352. Dudley, R. M. (1984). A course on empirical processes. Ecole d'ete de probabilites de St.-Flour, 1982. Lecture Notes in Math. (Springer) 1097, 1-142. Dudley, R. M. (1985). An extended Wichura theorem, definitions of Donsker class, and weighted empirical distributions. In Probability in Banach Spaces V (Proc. Conf. Medford, 1984), Lecture Notes in Math. (Springer) 1153, 141-178. Dudley, R. M. (1990). Nonlinear functionals of empirical measures and the bootstrap. In Probability in Banach Spaces VII (Proc. Conf. Oberwolfach, 1988), Progress in Probability 21, Birkhauser, Boston, 63-82. Dudley, R. M. (1994). Metric marginal problems for set-valued or non-measurable variables. Probab. Theory Related Fields 100, 175-189.

References

133

Dudley, R. M., and Philipp, Walter (1983). Invariance principles for sums of Banach space valued random elements and empirical processes. Z. Wahrscheinlichkeitsth. verw. Gebiete 62, 509-552. Eames, W., and May, L. E. (1967). Measurable cover functions. Canad. Math. Bull. 10, 519-523. Erlov, M. P. (1975). The Choquet theorem and stochastic equations. Analysis Math. 1, 259-271. Gine, E., and Zinn, J. (1986). Lectures on the central limit theorem for empirical processes. In Probability and Banach Spaces (Proc. Conf. Zaragoza, 1985), Lecture Notes in Math. (Springer) 1221, 50-113. Gnedenko, B. V., and Kolmogorov, A. N. (1949). Limit Distributions for Sums of Independent Random Variables. Moscow. Transl. and ed. by K. L. Chung, Addison-Wesley, Reading, MA, 1954; rev. ed. 1968. Goffman, C., and Zink, R. E. (1960). Concerning the measurable boundaries of a real function. Fund. Math. 48, 105-111. Hoffmann-Jt rgensen, Jorgen (1984). Stochastic processes on Polish spaces. Published (1991): Various Publication Series no. 39, Matematisk Institut, Aarhus Universitet. Hoffmann-Jorgensen, Jorgen (1985). The law of large numbers for non-measurable and non-separable random elements. Asterisque 131, 299-356. Luxemburg, W. A. J., and Zaanen, A. C. (1983). Riesz Spaces, vol. 2. NorthHolland, Amsterdam. Pachl, Jan K. (1979). Two classes of measures. Colloq. Math. 42, 331-340. Ryll-Nardzewski, C. (1953). On quasi-compact measures. Fund. Math. 40, 125-130. Sazonov, V. V. (1962). On perfect measures (in Russian). Izv. Akad. Nauk SSSR 26, 391-414. Skorokhod, Anatolii Vladimirovich (1956). Limit theorems for stochastic processes. Theor. Probab. Appl. 1, 261-290. Skorokhod, A. V. (1976). On a representation of random variables. Theor. Probab. Appl. 21, 628-632 (English), 645-648 (Russian). Topsoe, Flemming (1970). Topology and Measure. Lecture Notes in Math. (Springer) 133. van der Vaart, Aad (1996). New Donsker classes. Ann. Probab. 24, 21282140.

Vulikh, B. Z. (1961). Introduction to the Theory of Partially Ordered Spaces (transl. by L. F. Boron, 1967). Wolters-Noordhoff, Groningen. Wichura, Michael J. (1970). On the construction of almost uniformly convergent random variables with given weakly convergent image laws. Ann. Math. Statist. 41, 284-291.

4

Vapnik-Cervonenkis Combinatorics

This chapter will treat some classes of sets satisfying a combinatorial condition. In Chapter 6, it will be shown that under a mild measurability condition to be treated in Chapter 5, these classes have the Donsker property, for all probability measures P on the sample space, and satisfy a law of large numbers (GlivenkoCantelli property) uniformly in P. Moreover, for either of these limit-theorem properties of a class of sets (without assuming any measurability), the VapnikCervonenkis property is necessary (Section 6.4). The present chapter will be self-contained, not depending on anything earlier in this book, except in some examples.

4.1 Vapnik-Cervonenkis classes Let X be any set and C a collection of subsets of X. For A C X let CA C n A := A nC := (Cf1A : C E C). Letcard(A) :_ I A I denotethecardinality (number of elements) of A and 2A :_ [B: B C A). Let AC(A) :_ ICAI. If A n C = 2A, then C is said to shatter A. If A is finite, then C shatters A if and only if AC (A) = 21 A1.

Let mc(n) := max[0C(F) : F C X , IF] = n} for n = 0, 1, IXI < n let mc(n) := mc(IXI). Then mc(n) < 2" for all n. Let V(C)

:

inf in: me (n) < 2")

if this is finite,

+00

if mc(n) = 2" for all n,

, or if

{I

sup {n: mo(n)=2"}, S(C)

:=

if C is empty.

1

Then S(C) - V (C) - 1, and S(C) is the largest cardinality of a set shattered by C, or +oo if arbitrarily large finite sets are shattered. So, V (C) is the smallest n 134

4.1 Vapnik-Cervonenkis classes

135

such that no set of cardinality n is shattered by C. If V (C) < oo, or equivalently if S(C) < oo, C will be called a Vapnik-Cervonenkis class or VC class. If X is finite, with n elements, then clearly 2X is a VC class, with S(2x) = n. Let NC where

(N

\j/

N!/(j!(N - j)!),

j = 0, 1, ... N,

0,

j>N.

Then NC
4.1.1 Proposition NC
- -

, and

N = 1,2,.... Proof /For each j = 1, 2, - -, N, we have by the classical Pascal's triangle, I

Nl

(Ni 11+IN-1

I,

(N 0

I = INO 11 = 1.

Summing /over j, the conclusion follows. If N < k, we\ get 2N/ 2N-1

= 2N-1 +

For a non-VC class C we have mc(n) = 2' for all n. For a VC class, the next fact, which is fundamental in the Vapnik-Cervonenkis theory, will imply that mc(n) grows only as a polynomial rather than exponentially in n.

4.1.2 Theorem (Sauer's lemma) If mc(n) > nC 1, then mc(k) =2k. Hence if S(C) < oo, then mc(n) < nC<s(c) for all n.

Proof The proof is by induction on k and n. For k = 1, nC n, then nC< mc(n), so the assumption implies k < n. Now assume that the theorem holds whenever k < K and n > k, for all C.

Fix k := K + 1. For n < k, as noted, it holds vacuously. For n = k, the hypothesis mc(n) > C nC
Vapnik-Cervonenkis Combinatorics

136

AC(HN) > NC
AC(HN) < NC
(4.1.3)

Let Cn := Hn n C :_ {A fl Hn : A E C}. Call a set E C HN full if both E and E U {xn } belong to Cn . Let f be the number of full sets. Then the map A i-+ A fl HN takes C n H onto C n HN and is two-to-one onto full sets and one-to-one onto non-full sets in C n HN. Thus

OC(H,,) = Ac(HN) + .f

(4.1.4)

Let F be the collection of all full sets. Suppose f = 0-F(HN) > NC
In the remaining case, f < NC
AC(H,,) < NC k + 2, nC
Proof The inequality clearly holds for k = 0, so assume k > 1. By the binomial theorem, nk-1 (k + n) < (n + 1)k, so (4.1.6)

nk-1/(k

- 1)! +nk/k! < (n + 1)k/k!.

The proof will be done by induction on n and k. For k = 1 the inequality is n + 1 < 1.5n, which holds for all n > 2. For n = k + 2, the desired inequality is 2n - n - 1 < 1.5nii-2/(n - 2)! = 1.5(n -

1)nn-1/n!.

This can be checked directly for n = 3, 4, 5, and 6. Stirling's formula with an error bound (Theorem 1.3.13) gives n! < (n/e)n (2nn)1/2e1/(12n)

4.1 Vapnik-Cervonenkis classes

137

So it will be enough to prove

2n (n/e)n (2irn)1/2e11(12n) < 1.5(n -

1)nn-1

n>7,

which follows from (e/2)" > 2n1/2, n > 7; for f(x) := (e/2)x and g(x) :_ 2x1/2 it is straightforward to check that f (7) > g(7), f'(7) > g'(7), f" > 0, and g" < 0, so f (x) > g(x) for all x > 7. Now suppose the proposition has been proved for n = k + i, i = 2, , j, and for n = k + J, J := j + 1, for k = 1, , K, as we have done for j = 2 and for K = 1. We need to prove (4.1.5) for n = k + J and k = K + 1. We

have k+j = K + J and nC
< 1.5(k + j)k/k! + 1.5(K + J)K/K!

< 1.5nk/k!

by Proposition 4.1.1

by induction hypotheses by (4.1.6),

completing the proof.

Combining Theorem 4.1.2 and Proposition 4.1.5 gives mc(n) < 1.5nk/k! for n > k + 2 where k := S(C). To see that Theorem 4.1.2 is sharp, let X be an infinite set and C the collection of all subsets of X with cardinality k. Then S(C) = k and the inequality in the second sentence of the theorem becomes an equality for all n.

Let dens(C) be the infimum of all r > 0 such that for some K < oo, m c (n) < Kn' for all n > 1. Then we have

4.1.7 Corollary For any set X and C C 2X, dens(C) < S(C). Conversely if dens(C) < oc then S(C) < oo. Proof By Theorem 4.1.2 and Proposition 4.1.5, there is a K such that m c (n) <

Kns(c) for all n > S(C) + 2. The same holds for all n > 1, possibly with a larger K, so dens(C) < S(C). Conversely if dens(C) < oo then since mc(n) < Kn' < 2n for n large we have S(C) < oo. Note that S(C) can be determined by one large shattered set while dens(C) has to do with the behavior of C on arbitrarily large finite sets. For example if X is a set with card(X) = n and C = 2X then S(C) = n while dens(C) = 0.

For any set X, it is immediate that if C C D C 2X then S(C) < S(D) and dens(C) < dens(D). The following is straightforward since for any set X, the

138

Vapnik-Cervonenkis Combinatorics

map A r+ X \ A is one-to-one from 2X onto itself, and for any A, B, C C

X, Af1B#CflBifandonlyif(X\A)f1B#(X\C)f1B: 4.1.8 Proposition If X is any set, C C 2X and D := (X \ A : A E C) then for all B C X, AC(B) = AD(B), so m'(n) = mD(n) for all n, S(D) = S(C) and dens(D) = dens(C).

4.2 Generating Vapnik-Cervonenkis classes Let's begin with some examples of non-VC classes for which some limit theorems for empirical measures will fail. First, let X = [0, 1] and let C be the class of all finite subsets of X. Let P be the uniform (Lebesgue) law on [0, 1]. Clearly, S(C) = +oo, and C is not a VC

class. Also, for any possible value of Pn, we will have P,(A) = 1 for some A = {X1, , Xn} E C while P(A) = 0. Thus SUPAEC(Pn - P)(A) = 1 for all n, so C is not a Glivenko-Cantelli class for P; in other words, IIPn - PIIC

SUPAEC I(Pn - P)(A)I

doesn't approach 0 as n oo in outer probability or in any sense, since it is identically 1. It follows that C is also not a Donsker class for P. Note that all functions 1 A for A E C equal 0 almost surely for P. Thus, the whole class .F := {lA : A E C) reduces to the one point 0 in the space L2(P) of equivalence classes for equality almost everywhere of functions in G2(P), that is, measurable, square-integrable functions. Thus for purposes of empirical processes, functions equal a.s. P are not the same, and we need to deal with classes F C G2(P) of actual real-valued functions, not equivalence classes. Then the integral f fd(P, - P) will be well-defined for any f E L2(P). This

integral is linear in f and thus prelinear for f E T for any set J7 C G2(P). For the empirical process v, = n 1/2(P - P) we will not be taking versions or modifications as was done for Gaussian processes (Appendix J). Next, let C2 be the collection of all closed, convex subsets of R2. Let S1 be the unit circle {(x, y) : x2 + y2 = 11. For any finite subset F of S1, the convex polygon with vertices in F (a singleton if IFl = 1, or a line segment if

I F1 = 2) is in C2, and its intersection with S' is F. Thus S(C) = +oo and C is not a VC class. Let P be the uniform law dP(O) = dO/(2n) on S1. Then the Glivenko-Cantelli and Donsker properties fail for P just as in the previous example. Classes with S(C) finite, in other words Vapnik-Cervonenkis classes, can be

formed in various ways. Here is one. Let G be a collection of real-valued

4.2 Generating Vapnik-Cervonenkis classes

139

functions on a set X. Let

pos(g)

Ix: g(x) > 01, nn(g) := IX: g(x) > 0), gEG,

pos(G)

{pos(g): g E G}, nn(G) := {nn(g): g E G),

U(G)

pos(G) U nn(G).

4.2.1 Theorem Let H be an m-dimensional real vector space of functions on

a set X, let f be any real function on X, and let Hi := (f + h : h E H). Then S(pos(Hl)) = S(nn(H1)) = m. If H contains the constants then also S(U(H1)) = M.

Proof First it will be shown that S(pos(H1)) = m. Clearly card(X) > m. If card(X) = in, then H = Hl is the set ]RX of all real-valued functions on X, so the result holds.

Otherwise, let A C X with card(A) = m + 1. Let G be the vector space {a f + h : a E IR, h E H). Let rA : G ra IRA be the restriction of functions in G to A. If rA is not onto, take 0 v E IRA where v is orthogonal to rA(G) for the usual inner product -)A Let A+ := {x E A : v(x) > 0}. We can assume A+ is nonempty, replacing v by -v if necessary. If A+ = A fl pos(g) for some g E G, then (rA(g), V)A > 0, a contradiction. So pos(G) doesn't shatter A.

Suppose instead that rA(G) = RA. Then rA is 1-1 on G, f 0 H, and rA(H1) is a hyperplane in RA not containing 0, so that for some v E RA,

(j, v)A = -1 for all j E rA(H1). Again let A+ :_ {x E A : v(x) > 0). If A+ = A fl pos(f + h) for some h E H, then (rA(f + h), v)A > 0, a contradiction (here A+ may be empty). Thus pos(H1) never shatters A, so S(pos(Hi)) < m. For each x E X, a linear form 8x is defined on H by 3, (h) := h (x), h E H.

Let H' be the vector space of all real linear forms on H. Then H' is m-

dimensional. Let HH{aJxj: be the linear span in H' of the set of all 8,, x E X, (4.2.2)

HH

xj E X, aj E ]I8, r

1, 2,

}.

j-1 The map h H (r i-4 *(h)): h E H, r E HX, is 1-1 and linear from H JJJ

into (HX)', so HH is m-dimensional. Take B = {xl, , xm) C X such that the 8,1 are linearly independent in H. So rB (H) = RB, rB (H1) = JRB, and pos(rB(H1)) = 2B, so S(pos(Hi)) = m. Then S(nn(H1)) = m by taking complements (Proposition 4.1.8). If H contains the constant functions, then the sets nn(f ), f E H1, are the same as

140

Vapnik-Cervonenkis Combinatorics

the sets If > t), f E H1, t E R, and the sets pos(f), f E H1, are the same as the sets If > t), f E H1, t E R. Now for any finite subset A of X, f E Hl and t E R, since f takes only finitely many values on A, there exist s and u

such that AflIf >t}=AflIf >s}and AflIf >t}=AflIf >u}. So in this case S(U(H1)) = m. Examples. (I) Let H := Pd,k be the space of all polynomials of degree at most k on Rd. Then for each d and k, H is a finite-dimensional vector space of functions, so pos(H) is a Vapnik-Cervonenkis class. For k = 2, it follows specifically that the set of all ellipsoids in 1W' is included in a VapnikCervonenkis class and thus is one.

(II) Let X = R. Let H be the 1-dimensional space of linear functions f(x) = cx, x E 1k, c E R. Then S(pos(H)) = S(nn(H)) = 1 by Theorem 4.2.1, but U(H) shatters {0, 1). Since sets in U(H) are convex (half-lines), it follows that S(U(H)) = 2. So the condition that H contains the constants can't just be dropped from Theorem 4.2.1 for U(H). Let X be a real vector space of dimension m. Let H be the space of all real affine functions on X, in other words functions of the form h + c where h is real linear and c is any real constant. Then H has dimension m + 1 and pos(H) is the set of all open half-spaces of X. Letting f = 0 in Theorem 4.2.1 for this H gives a special case known as Radon's theorem. On the other hand, Theorem 4.2.1 for f = 0 with general X and H follows from Radon's theorem via the following stability fact.

4.2.3 Theorem If X and Y are sets, F is a function from X into Y, C C 2Y, S(F-t (C)) < S(C). If F is onto Y and F-1(C) := {F-1(A) : A E C}, then then S(F-1(C)) = S(C).

x j for i # j. Then , xm } where x; Proof Let F-1(C) shatter {xl, F(xi) 0 F(xj) for i ¢ j and C shatters {F(xl), , F(xm)}. So S(F-1(C)) < S(C). If F is onto Y and H C Y with card(H) = m, choose G C X such that F takes G 1-1 onto H. Then if C shatters H, F-1 (C) shatters G, so S(F-1(C)) = S(C). Now let X be any set and G a finite-dimensional real vector space of real functions on X. Then there is a natural map F : x H S from X into the space of linear functions on G. Then by Theorem 4.2.3 one could deduce Theorem 4.2.1 from its special case where X is an m- or (m + 1)-dimensional real vector space and f and all functions in H are affine, so that sets in pos(H1) are open half-spaces.

4.2 Generating Vapnik-Cervonenkis classes

141

Next it will be seen how a bounded number of Boolean operations preserves the Vapnik-Cervonenkis property.

, let C(k) be the 4.2.4 Theorem Let X be a set, C C 2X, and f o r k = 1 , 2, union of all (Boolean) algebras generated by k or fewer elements of C. Then dens(C(k)) < k dens(C), so if S(C) < oo then S(C(k)) < oo.

Proof Let dens(C) = r, so that for any s > 0 there is some M < oo such that mc(n) < Mnr+E for all n. For any A C X we have A n C(k) = (A n C)(k). An algebra A with k generators A1, , Ak has at most 2k atoms, which are those nonempty sets that are intersections of some of the A; and the complements of the rest. Sets in A are unions of atoms, so JAI < 22k. Thus IA n C(k) I < 22k IA n CIk < 22kMkIAlk(r+e). Letting E J, 0 gives dens(C(k)) < k dens(C). If S(C) < 00 then by Corollary 4.1.7, S(C(k)) < 00. The constant 22k is very large if k is at all large. Let C(lk) be the class of all intersections of at most k sets in C. Then C(nk) C C(k). For C(nk) the constant 22k is not needed. Theorems 4.2.1 and 4.2.4 can be combined to generate Vapnik-Cervonenkis classes. For example, half-spaces in Rd form a VC class. Intersections of at most k half-spaces give convex polytopes with at most k faces, so these form a VC class.

Remarks. Let X be an infinite set, r = 1 , 2,

, and let Cr be the collection of all subsets of X with at most r elements. Then clearly dens(Cr) = S(Cr) = r. It's easy to check that D := C(rk) consists of all sets B such that either B or X \ B

has at most kr elements. Thus mD(n) < 2(nC 2kr + 1. So dens(D) = kr since n C<j is a polynomial inn of degree j. Thus the inequality dens(C(k)) < k dens(C) is sharp. But it does not always hold for S(.) in place of dens(-): if C is the collection of open half-spaces in Rd, d > 1, then S(C) = d + 1 by Radon's theorem, while, taking for example the d half-spaces (xj > 0) for j = 1, , d, we see that C(d) shatters a set of 2d points, one in each coordinate orthant, so S(C(d)) > 2d > d(d + 1) for

d>5. Classes with V (C) = 0 or I are easily characterized:

4.2.5 Proposition A class C of subsets of a set X has V (C) = 0, or equivalently S(C) = -1, if and only if C is empty. Also, V (C) = 1, or equivalently

142

Vapnik-Cervonenkis Combinatorics

S(C) = 0, if and only if C contains exactly one set. Thus S(C) > 1 if and only if C contains at least two sets. Proof Clearly C shatters the empty set if and only if C contains at least one set. If C contains at least two sets, then for some A, B E C and X E X, X E A \ B.

Then C shatters {x), so S(C) > 1. Conversely if S(C) > 1, then clearly C contains at least two sets.

Here are two sufficient conditions for S(C) = 1:

4.2.6 Theorem If C is a collection of at least two subsets of a set X, then S(C) = 1 if either (a) C is linearly ordered by inclusion, or (b) any two sets in C are disjoint.

Proof In any case, S(C) > 1. If C is linearly ordered by inclusion, suppose it shatters {x, y). Let A, B E C, A fl {x, y} = {x}, B fl {x, y} = {y}. But A C B or B C A, giving a contradiction. If the sets in C are disjoint, then we can argue as in part (a), taking C E C with {x, y} C C, but C cannot be disjoint from A or B, a contradiction. Section 4.4 will go more into detail about classes of index 1.

*4.3 Maximal classes

Let C C A be classes of subsets of a set X. Then C will be called (A, n)maximal if S(C) = n while if C C D strictly and D C A, then S(D) > n. If A is the class 2X of all subsets of X, then C will be called n-maximal. If C is n-maximal, then clearly C is (A, n)-maximal for any A such that C C A C 2X. In view of Proposition 4.2.4, classes C with S(C) = i, i = -1 or 0, are empty or contain just one set, respectively, and so are always i -maximal. Thus n-maximality becomes interesting only for n > 1.

Examples. 1. For any set X, let C consist of 0 (the empty set) and all singletons (x} for x E X. Then C is clearly 1-maximal. 2. Let X = R. Let £ 7-1 consist of 0, R, and all left half-lines, closed (-oo, x] or open (-oo, x), for x E R. In other words, £7-l is the collection of all subsets A C JR such that whenever x < y e A then also x E A. Then clearly S(G7-l) = 1 since for x < y and A E £7-l, A fl {x, y} # (y}. On the

*4.3 Maximal classes

143

other hand if any subset of JR not in CH is adjoined, then some 2-element set is shattered, so £7-l is 1-maximal.

3. Let X = R and let Co consist of all subintervals of R, namely 0, R, any closed or open, left or right half-line, and any bounded interval, open or closed at either end. In other words, Co is the class of all convex subsets of R. Then S(Co) = 2, in fact Co shatters every 2-element subset of IR, while if x < y < z and A E Co, then A fl {x, y, z} {x, z}. On the other hand if any set not in Co is adjoined to it, its index becomes 3, so Co is 2-maximal. Here is an existence theorem for maximal classes:

4.3.1 Theorem Let X be a set and D C 2X. Suppose that C C D and S(C) = n. Then there exists a (D, n)-maximal class B with C C B.

Proof Zorn's lemma (RAP, Section 1.5) will be applied. Let (,t3a)aEI be such that C C Ba C D for all a in the index set I and such that the Ba are linearly ordered (i.e., form a chain) by inclusion, with S(Ba) = n for all a. Let A := Ua Ba. To show that S(A) = n, suppose A shatters a set F with I Fl = n + 1. Each of the 2"+1 subsets of F is induced by a set in some Ba. Since there are only finitely many of these sets and the Ba are linearly ordered by inclusion, there is some a such that Ba shatters F, a contradiction. So the chain {8a }aEI has an upper bound. So by Zorn's lemma the collection of all B with C C B C D and S(B) = n has a maximal element (for inclusion). The following fact is straightforward:

4.3.2 Proposition For any set X, Y C X, C C 2X, and Cy := C n Y, we have S(Cy) < S(C). Recall that Z2 := to, 1} with addition mod 2, in other words the usual addition except that I + 1 = 0. For any set X, the group Z2 of all functions from X into Z2, with the natural addition (f + g)(x) := f(x) + g(x) in Z2, provides a group structure for the collection 2X of all subsets of X. Addition of indicator functions mod 2 corresponds to the symmetric difference AAB :_ (A \ B) U (B \ A), so that I A + 1 B = I AAB mod 2. For any fixed set A C X, the translation 1 B H I A + 1 B takes Z 2 one-to-one and onto itself. If the functions are restricted to a subset Y C X, translation still takes Z2 one-to-one and onto itself. For any A C Xand C C 2X, let AAAC {AAC: C E C). Then for any finite F C X, C shatters F if and only if AAAC does. It follows that:

144

Vapnik-Cervonenkis Combinatorics

4.3.3 Proposition For any fixed set A C X and class C C 2X, S(C) S(AAAC). Also, C is n-maximal if and only if AAAC is. Next, we have:

4.3.4 Proposition If C is an n-maximal class of subsets of a set X, and n > 1, then UAc

Proof Suppose there is an x such that x A for all A E C. Take any B E C and let D := C U (B U {x}). Then by n-maximality, D must shatter some set F of cardinality n + 1 > 2, and evidently x E F. Thus D contains at least 2" > 2 sets which contain x, but it only contains one, a contradiction. This proves the first conclusion. The second follows on taking complements, setting A = X in Proposition 4.3.3.

On ZX = 2X there is a product topology coming from the discrete topology on Z2. The product topology is compact by Tychonoff's theorem (RAP, Theorem 2.2.8). 4.3.5 Proposition For any set X, any n-maximal class C C 2X is closed and therefore compact in 2X.

Proof Suppose Ca -* C is a convergent net in 2X with Ca E C for all a. Then for any finite set F C X, there is some a with Ca fl F = C fl F, so S(C U {C}) = S(C) and C E C. So C is closed (RAP, Theorem 2.1.3). A class C of subsets of a set X will be called complemented if X \ A E C for every A E C.

4.3.6 Theorem If S(C) = n, C C A strictly, and C is complemented, then C is not (A, n)-maximal.

Proof For any finite set F C X and G C F, G E C n F if and only if F \ G E C n F. Thus if IFl = n + 1, then IC n Fl < 21+1 - 2. So, for any

AEA\C,S(CU{A})=n. If F is a k-dimensional real vector space of real-valued functions on a set X containing the constants, and C is the collection U(F) of all sets {x : f (x) > 0}

or {x : f (x) > 0} for all f E .F, then S(C) = k by Theorem 4.2.1. Since C is complemented, if U(.F) 2X, then C is not k-maximal.

*4.4 Classes of index 1

145

Let X be any set and Ck the collection of all subsets of X with at most k elements. Then clearly S(C) = k. Also, C is k-maximal since if A 0 C, A C X, then I A I > k and if B is any subset of A with I B I = k + 1 then B is shattered by C U {A). For C = Ck we have me (n) C
*4.4 Classes of index 1 In this section the structure of classes C with S(C) = 1 will be treated. Recall that for classes of two or more sets, disjoint classes and classes linearly ordered by inclusion have S(C) = 1 (Proposition 4.2.6). A common extension of these two kinds of classes is given by treelike partial orderings, defined as follows. A binary relation < on a set X will be called a quasi-order if it is transitive:

x < y and y < z imply x < z; and reflexive: x < x for all x e X. The quasi-order is called a partial order if also x < y and y < x imply x = y. For any set S, inclusion ( C ) is a partial order on 2S or any subset of 2S. Let < be a quasi-order on a set X. Then two elements x and y of X are called comparable if at least one of x < y and y < x holds, or incomparable if neither holds. A quasi-order < on X will be called linear if any two elements of X are comparable. A quasi-order < will be called treelike if for any y E X and L Y := {x : x < y), the restriction of < to L Y is linear. 4.4.1 Theorem Let C C 2X contain at least two sets and satisfy, for any x in X, (4.4.2)

y

A fl {x, y) = 0 for some A E C.

Then the following are equivalent:

(a) S(C) = 1; (b) for every Y C X, the partial ordering of Cy := Y n C by inclusion is treelike; (c) for every Y C X with I Y I = 2, the partial ordering of Cy := Y n C by inclusion is treelike.

Note. Clearly, (4.4.2) holds if 0 E C.

146

Vapnik-Cervonenkis Combinatorics

Proof In proving (a) implies (b), since S(Cy) < S(C) and classes with S(C) < 1 contain at most one set and trivially have a treelike ordering by inclusion, we can assume Y = X. If C doesn't have a treelike ordering by inclusion, there is

a set D E C and B C D, C C D such that B and C are not comparable, so there exist some x E B \ C and Y E C \ B. Take A from assumption (4.4.2). But then the sets A, B, C, and D shatter (x, y) and S(C) > 2, a contradiction. So (b) holds. Now (b) implies (c) directly. If (c) holds and I Y I = 2, since 2Y doesn't have a treelike ordering by inclusion, C must not shatter Y, so (a) follows.

4.4.3 Proposition Let X be a set and A C 2X where 0 E A and for any B and C in A, B fl C E A. If C is (A, 1) -maximal and satisfies (4.4.2) for any

x# yin X, then 0ECandBflCECforanyBandCinC. Proof If x

y and (4.4.2) holds for A, then A fl {x, y} = 0 = 0 fl {x, y}, so adjoining 0 to C doesn't induce any additional subsets of sets with two elements, and S(C U 10)) = 1, so by maximality 0 E C. Suppose B, C E C and S(C U {B fl C}) > 1. Then for some x # yin X, B fl C fl (x, y) # D fl {x, y} for all D E C. Then by (4.4.2), we can assume x E B fl C. If {x, y} C B fl C C B then taking D = B would give a contradiction, so y 0 B fl C. Now B fl c fl {x, y} = {x} # D fl {x, y) for D = B or C implies y E B and y E C, again a contradiction. So B fl C E C.

4.4.4 Proposition Let X be a set and C a finite class of subsets of X with S(C) = 1 such that for any x y in X, (4.4.2) holds. Let D := D(C) consist of 0 and all intersections of nonempty subclasses of C. Then S(D) = 1. For each nonempty set D E D there is a C := C(D) E D such that C C D strictly (C 0 D) and if B is any set in D with B C D strictly, then B C C. Proof By Theorem 4.3.1, let C C E where £ is 1-maximal. Then by Proposition 4.4.3 for A = 2X and induction, C C D C E, so S(D) = 1. Clearly, D is finite. For each nonempty D E D, by Theorem 4.4.1, {B E D: B C D} is linearly ordered by inclusion and contains 0, so it has a largest element C(D) other than D itself.

4.4.5 Proposition

Under the hypotheses of Proposition 4.4.4, the sets

D \ C(D) for distinct nonempty D E D are all disjoint and are nonempty.

Proof Let A

B in D. If B C A, then B C C(A), so B and hence

B \ C(B) are disjoint from A \ C(A). Otherwise, A fl B C B strictly, and then

*4.4 Classes of index 1

Af1B C C(B), so again A \C(A) is disjoint from B\C(B). That A\C(A) 0 follows from the definitions.

147

0

for A

A graph is a nonempty set S together with a set E of unordered pairs {x, y} for some x y in S. Then S will be called the set of nodes and E the set of edges of the graph. The graph (S, E) is called a tree if (a) it is connected, in other words for any x and y in S there is a finite n and x; E S, i = 0, 1, , n, such that xo = x, x, = y, and {xk_1, xk} E E for

k= (b) the graph is acyclic, which means that there is no cycle, where a cycle

is a set of distinct xl,

, x E S such that n > 3, and letting xo := x,,,

{xk_1,xk} E Efork= 1, ,n. 4.4.6 Theorem (a) For in nodes, for any positive integer m, there exist connected graphs with in - 1 edges. (b) A connected graph with in nodes cannot have fewer than in - 1 edges. (c) A connected graph with m nodes has exactly m - 1 edges if and only if it is a tree.

Proof (a) is clear. (b) will be proved by induction. It is clearly true for in = 1, 2. Suppose (S, E) is a connected graph with BSI = in, I EI < in - 2, and m > 3. The edges in E contain at most 2m - 4 nodes, counted with multiplicity, so at least four nodes appear in only one edge each, or some node is in no edge, a contradiction. Select a node in only one edge and delete it and the edge that contains it. The remaining graph must be connected, but is not by induction assumption, a contradiction, so (b) holds.

For (c), let (S, E) be a connected graph with ISO = in and IEJ = m - 1. If the graph contains a cycle, we can delete any one edge in the cycle and the graph remains connected, contradicting (b). So (S, E) is a tree. Conversely, let (S, E) be a tree with I SI = m. It will be proved by induction that I E I < in - 1.

This is clearly true for m = 1, 2. Suppose I E I > in > 3. Take a maximal , xk} C S such that the xi are distinct and {xj_1, Xj) E E for set C := {xl, j = 2, , k. Then there is no y 0 X2 with {x1, y) E E: y cannot be any xj, j > 3, or there would be a cycle, and there is no such y C since C and k are maximal. So we can delete the node xl and the edge {xl, x21 from the graph, leaving a graph which is still a tree with in - 1 nodes and at least in - 1 edges, contradicting the induction hypothesis and so proving (c). Let the class D in Propositions 4.4.4 and 4.4.5 form the nodes of a graph G

whose edges are the pairs {C(D), D} for D E D, D # 0.

148

Vapnik-Cervonenkis Combinatorics

4.4.7 Proposition The graph G is a tree. Proof If D has m elements, then there are exactly m - I pairs {C(D), D}, for D E D, D # 0. Starting with any D E D, we have a decreasing sequence of sets D D C(D) D C(C(D)) D , which must end with the empty set, so all sets in D are connected in G via the empty set, and G is connected. Then by Theorem 4.4.6 it is a tree. 4.4.8 Proposition Let X be a finite set. Let C be 1-maximal in X and suppose (4.4.2) holds for all x y in X. Then C = D(C) as defined in Proposition 4.4.4. The sets D\C(D)for nonempty D E C are all the singletons {x}, x E X.

IfIXI=mthen ICI =m+1. Proof C = D(C) by Proposition 4.4.4. Suppose that for some D E C, D\C(D)

has two or more elements. Then for some B, C := C(D) C B C D where both inclusions are strict. It will be shown that S(C U {B}) = 1. If not, then for some x 0 y, C U {B} shatters {x, y}, so B fl {x, y} 54 F fl {x, y} for all F E C. Letting F = C shows that B fl {x, y} 0 0. Likewise, letting F = D shows that B fl {x, y} {x, y}. So we can assume B fl {x, y} = {x}. Taking F = D shows

that y e D. If G n {x, y} = (y) for some G E C, then y E G fl D E C, and G n D C D strictly, so G n D C C and y E C C B, giving a contradiction. So C U { BI does not shatter {x, y}, and S(C U { B)) = 1, contradicting 1-maximality of C.

So, each set D \ C(D) for 0 0 D E C is a singleton. Each singleton {x} equals D \ C(D) for at most one D E C by Proposition 4.4.5. By Proposition

4.3.4, X = UDEC D. For any x E X take D1 E C with x e D. Let . Forsomem, D,,, = 0, and {x} = Dj\C(Dj) D,+1 := C(D,) forn = 1, 2, for some j < m. So all singletons are of the form D \ C(D), D E C. This

gives a 1-1 correspondence between singletons and nonempty sets in D, so there are exactly m such sets and I D I = m + 1. Suppose throughout this paragraph that C is a class of two or more sets such that (4.4.2) holds with 0 replaced by {x, y}. Then the class of complements, N := {X \ C: C E C), satisfies the original hypotheses of Proposition 4.4.1. If C is 1-maximal, so is N by Proposition 4.3.3. So Theorem 4.4.1 and the facts numbered 4.4.3 through 4.4.7 apply to N, and so does Proposition 4.4.8 if X is finite. Then, C itself has a "cotreelike" ordering, where for each C E C, {D E C : C C D} is linearly ordered by inclusion. Propositions 4.4.3 and 4.4.4 apply to C if 0 is replaced by X and intersections by unions; in Proposition 4.4.4, we will have an immediate successor D(C) D C instead of a

*4.4 Classes of index 1

149

predecessor; and sets D(C) \ C instead of D \ C(D) in Propositions 4.4.5 and 4.4.8. The resulting tree (Proposition 4.4.7) then branches out as sets become smaller rather than larger. Next will be several facts in the general case, that is, without the hypothesis (4.4.2).

4.4.9 Theorem Let X be any set and C any collection of subsets with S(C) = 1. Then for any C E C, the collection Cx\c := (B \ C : B E C) satisfies (4.4.2) for any x y as a collection of subsets of X \ C. Likewise, Cc \ := (C \ B : B E C) satisfies (4.4.2) for any x y as a collection of subsets of C, S(Cc V) < 1 and S(CX\C) < 1.

Proof Letting B = C shows that both classes Cc \ and CX\c contain 0, so (4.4.2) holds for them. Both have index S < 1 by Propositions 4.3.2 and 4.3.3.

So, for an arbitrary class C with S(C) = 1, we have by Theorem 4.4.1 a treelike partial ordering by inclusion in one part X \ C of X and a cotreelike ordering in the complementary part C, for any C E C. If also X \ C happens to be in C, both orderings are linear. To see how the two orderings fit together in general, Proposition 4.3.3 gives:

4.4.10 Corollary Let C be any class of sets with S(C) = 1 and A E C. Let D := AAAC. Then S(D) = 1 and 0 E D. If C is 1-maximal, so is D. Then Theorem 4.4.1, Proposition 4.4.3, and if C is finite, Propositions 4.4.4, 4.4.5, 4.4.7, and if X is finite, Proposition 4.4.8, apply to D. The last sentence in Proposition 4.4.8 has a converse and extension:

4.4.11 Proposition Let X be finite with m elements and C C 2X with S(C) _ 1. Then C is 1-maximal if and only if ICI = m + I.

Proof For any fixed C E C, we can replace C by CAiC without loss of generality by Theorem 4.3.3. So we can assume 0 E C, and then (4.4.2) holds for all x, y. Then Proposition 4.4.8 implies "only if."

Conversely, let S(C) = 1 and ICI = m + 1. Let C C D strictly. Then IDS > m + 1, so by Sauer's lemma (Theorem 4.1.2), S(D) > 2. So C is 1-maximal.

Now, m + 1 = mC<1, which is the maximum value of mc(m) for S(C) = 1 by Sauer's lemma (Theorem 4.1.2). The example at the end of Section 4.3

150

Vapnik-Cervonenkis Combinatorics

shows that Proposition 4.4.11 in the form Cl I_ mC 1. Next it will be shown that 1-maximality can be relativized to subsets. For

a set X, a subset Y C X, and a class C c 2X, recall that Cy := C n Y {A n Y: A E C}.

4.4.12 Theorem If C is 1-maximal and 0

Y C X, then Cy is a 1-maximal

class of subsets of Y.

Proof Let A E C. Without changing 1-maximality, C can be replaced by AAAC (Proposition 4.3.3). So we can assume 0 E C. We can also assume that

IXI>2. Case 1. Suppose X is finite, I X I = m < oo. Then by Proposition 4.4.8 or

4.4.11, ICI = m + 1. Letx E Xand Y := X\{x}. Let B := {y E Y: {x, y} E C{x, y} and (y) E C{x, y} }. A set C C Y will be called full if C E C and

CU{x}E C. 4.4.13 Lemma If A C Y and A is full then A = B. Proof Suppose A is full. First suppose y (= A \ B. Then A U {x} E C implies {x, y} E C{x,y} and A E C implies {y} E C{x,y}, contradicting y

B.

Next, suppose y E B \ A. Then {x} = (A U {x}) n {x, y}, 0 = 0 n {x, y}, and y (= B imply that C shatters {x, y}, a contradiction. So A = B, proving the lemma.

El

Continuing the proof of Theorem 4.4.12, any two distinct sets in C have different intersections with Y, except possibly for B and B U {x}. Now Cl S=

m + 1 by Proposition 4.4.11, so ICyl > m and ICY = m since S(Cy) < 1 and I Y I = m - 1. So if m > 2, Cy is a 1-maximal class of subsets of Y by Proposition 4.4.11 again. Then by induction downward, Cy is 1-maximal in Y for any Y C X, Y 0, proving the theorem if X is finite.

Case 2. Let X be general and 0 Y finite. Suppose Cy is not 1-maximal in Y. Then by Case 1, for any finite Z D Y, Cz is not 1-maximal in Z. Let E be the class of all B C Y such that B Cy and S({B} U Cy) < 1. Suppose for each B E £ there is some finite Z(B) D Y such that there is no A C Z(B) with S({A} U CZ(B)) < 1 and A n Y = B. Let Z := UBEE Z(B). Then Z is a finite set including Y. So Cz is strictly included in some class D which is 1-maximal in Z, by Theorem 4.3.1 (with A = 2Z), and Dy is 1-maximal

*4.4 Classes of index 1

151

by Case 1. So B E Dy for some B E E, say D n Y = B for some D E D. Then A := D n Z(B) gives a contradiction.

So, for some B E E and any finite Z D Y, there is an A C Z such that A n Y = B and S(Cz U {A}) < 1. Let Dz be the class of all subsets D of X for which D n z is such a set A. Then Dz is compact in the product topology of 2X (which was treated in Proposition 4.3.5). For any finite U D Y

and V D Y, DU n Dv D DUuV 0 0. So the intersection of all DU is nonempty. Let D E DU for all finite U D Y. Then S(C U {D}) = 1, so

D E C, but D n Y = B

Cy, a contradiction. This finishes the proof in

Case 2.

Case 3. Let X be infinite and U any infinite subset. Let T (U) := {D C U : D n Y E Cy for all finite Y C U}. Then clearly CU C .F(U). Conversely, if D E .F(U), then for each finite Z C X, let Y := Z n U. Let g(Z) := {A E C : A n Y = D n Y}. Then 9(Z) is compact by Proposition 4.3.5, nonempty since D E .F(U), and decreases as Z increases. Thus there is some C E nz g(Z). Then for any finite Z, C n Z n U = D n z n U, so C n U = D n U = D. So D ECU and CU =.F(U). For each finite, nonempty Y C U, Cy is 1-maximal in Y by Case 2. Suppose

.F(U) is not 1-maximal in U. Let A C U, A .F(U), S(E) < 1 where E = (Al U.F(U). For each finite Y C U, Cy C Sy, so Cy = Ey. Let fly :=

{BEC: BnY=AnY}. Then Ry00andasshown above, I

lyfy#0.

Let B E ly Ry. Then A = B n U E Cu = .F(U), a contradiction. So CU is 1-maximal in U. I

For any set X and C C 2X, let x
4.4.14 Theorem If S(C) = 1, UBEC B = X, and C satisfies (4.4.2) for all x y, then
Proof Suppose S(C) = 1 and (4.4.2) holds for all x y. Let x
152

Vapnik-Cervonenkis Combinatorics

C, D E C with x E C \ D, y E D \ C. Then C shatters {x, y), a contradiction. So {x : x
y<x}EC,and0EC,soS(C)=1. Let A C X and suppose A is not linearly ordered by <. Take x, y E A which are not comparable for <. Since 0, Lx and L y are all in C, C U (A) shatters {x, y), and S(C U (A}) > 2.

Let B C X and suppose there exist x< y E B with x V B. Then Lx E C, Ly E C, and B fl {x, y} _ {y} imply that S(C U {B)) > 2. This has now been shown for any B V C, so C is 1-maximal. Example. Let X = 11, 2, 3, 4, 51 and

9=(0,(1),(5), 12,51,11,2,41,11,2,3,41,11,2,3,51,11,2,3,4,5)1. Let C be the complement of g in 2X. Then it can be checked that C is 3-maximal

but for Y = (1, 2, 3, 41, Cy is not 3-maximal in Y. So in Theorem 4.4.12, "1maximal" and "Y 0" cannot be replaced by "3-maximal" and "JYJ > 3" respectively.

*4.5 Combining VC classes Recalling the density as in Corollary 4.1.7, the following is clear:

4.5.1 Theorem For any set X, if A C 2X and C C 2X then dens(A U C) = max(dens(A), dens(C)). For the Vapnik-Cervonenkis index we have instead:

4.5.2 Proposition For any set X, A C 2X and C C 2X, S(A U C) < S(A) + S(C) + 1. This bound is optimal: for any nonnegative integers k and m there exist X and A, C C 2X with S(A) = k, S(C) = m, and S(A U C) _

k+m+1.

*4.5 Combining VC classes

153

Proof By Theorem 4.1.2, if k := S(A), m := S(C), and n > k + m + 2,

m,auc(n) < mA(n)+mC(n) < nC
(n) + j=0

j=n-m

(n)

<

2n,

so S(A U C) < n. It follows that S(A U C) < k + m + 1 as claimed. Conversely, given k and m let n = k + m + 1 and let X be a set with n members. Let A be the set of all subsets of X with at most k members and let C be the set of subsets

of X with at least n - m = k + 1 members. Then S(A) = k, S(C) = m, and A U C = 2X, so S(A U C) = n. Let X be a set and C, D any two collections of subsets of X. Let

CnD

{C n D: CEC, DED},

CuD

{CUD: CEC, DED}.

If A is a class of subsets of another set Y let

C®A := {CXA: CEC, AEA}. For such classes, we have:

4.5.3 Theorem For any C C 2X and D C 2X or 2Y let k := dens(C) and m := dens(D). Then we have dens(C El D) < k + m for El = n, u, or S. Proof For any of the three operations, we have m c E IV (n) < m c (n )m D (n ), and the conclusions follow.

For the Vapnik-Cervonenkis index the behavior of the El operations is not so simple. For k, m = 0, 1, 2, , and D = u, n, or ®, let D(k, m) :_ max{S(C El D) : S(C) = k, S(D) = m}. Here the maximum is taken where X and Y may be infinite sets. Then we have:

4.5.4 Theorem For any k = 0, 1, 2,

and m = 0, 1, 2,

, and El = u,

n, or ®, we have D(k, m) < oo.

Proof By Theorem 4.1.2 and Proposition 4.1.5, if S(C) = k and S(D) = m, then

mcoD(n) < mc(n)mD(n) < 9nk+m1(4k!m!) < 2n for n > no(k, m) large enough.

154

Vapnik-Cervonenkis Combinatorics

4.5.5 Theorem F o r any k, in = 0, 1 , 2,

,

n(k, m) = u(k, m) = ®(k, m).

Proof The first equation follows from taking complements (Proposition 4.1.8). For the second, for given k and in, by Theorem 4.5.4 it is enough to consider

large enough finite sets in place of X and Y, and then we can assume X =

Y. We have n(k, m) < ®(k, m) by restricting to the diagonal in X x X and applying Proposition 4.3.2. In the other direction, let fix and fly be the projections of X x Y onto X and Y respectively. Let C C 2X and A C 2Y.

Let.F:= (IIXI(C): C E C), B = {IIYI(A): A E A}. By Theorem 4.2.3, S(.T) = S(C) and S(B) = S(A). Since I} XI (C) n IIY I (A) = C x A it follows that S(.F n 8) > S(C ® A), and the theorem is proved. Let S(k, m) be the common value of the quantities in Theorem 4.5.5. Theorem 4.5.4 can be improved as follows. For any nonnegative integers j, k let

9(j, k) := sup{r E N: (rC 2r}. Then 9(j, k) < oo for each j, k by Proposition 4.1.5, and we have:

4.5.6 Proposition S(j, k) 6(j, k), by Sauer's lemma (Theorem 4.1.2) and Proposition 4.1.5 again, mcuD(n)

< mc(n)m"(n) < (nC<j)(nC
Can the values S(k, m) be computed? It is shown in Dudley (1984, Proposition 9.2.8) that S(1, 1) = 3 (which is not a corollary of Proposition 4.5.2, apparently). Also, after Proposition 9.2.10 there, an example of L. Birge is given showing that S(l, 2) > 5. For classes satisfying stronger conditions, more can be proved:

4.5.7 Theorem Let X and Y be sets, C C 2X and D C 2Y. If C is linearly ordered by inclusion, then S(C ® D) < S(D) + 1. Proof Let m = S(D). We can assume that m < oo and X E C. Let F C X x Y

with IFI = in + 2. Suppose C ®D shatters F. The subsets of IIXF C X induced by C are linearly ordered by inclusion. Let G be the next largest subset

other than I1XF itself, and p E f}XF \ G. Take y such that (p, y) E F. Then all subsets of F containing (p, y) are induced by sets of the form I}XF x D, D E D. Thus H := F \ {(p, y)} is shattered by such sets. Now I H I = in + 1, and IIY must be one-to-one on H or it could not be shattered by sets of the given form. But then D shatters flyH, giving a contradiction.

*4.5 Combining VC classes

155

4.5.8 Theorem For any set X and C, D C 2X, if C is linearly ordered by inclusion, then S(C D D) < S(D) + 1 for D = n or u.

Proof First consider D = n. Again let m = S(D), and we can assume m < oo. Let F C X and I Fl = m + 2. Suppose C n D shatters F. Now CF is linearly ordered by inclusion. We can assume that X E C, so F E CF. Let G be the next largest element of CF and p E F \ G. Each set A C F containing p is of the form C f1 D fl F, C E C, D E D, so we must have C f1 F = F, and A = D fl F. So D shatters F \ (p), a contradiction. Since the complements of a class linearly ordered by inclusion are again linearly ordered by inclusion, the case of unions follows by taking complements (Proposition 4.1.8).

Then by Theorem 4.2.6 and induction we have:

4.5.9 Corollary Let Ci be classes of subsets of a set X and C := {n"=1 C; : Ci E Ci, i = 1, , n}, where each C1 is linearly ordered by inclusion. Then S(C) < n. Definition. For any set X and Vapnik-Cervonenkis class C C 2X, C will be

called bordered if for some F C X, with I Fl = S(C), and x (= X \ F, F is shattered by sets in C all containing x.

4.5.10 Theorem Let Ci C 2x0) be bordered Vapnik-Cervonenkis classes for i = 1, 2. Then S(C1®C2) > S(C1) + S(C2).

Proof Take F C X(i) and xi as in the definition of "bordered." Let H := ({x1} x F2) U (F1 x {x2}). Then IHI = S(C1) + S(C2). For any Vi C F, i = 1, 2, take Ci E Ci with Ci fl Fi = Vi and xi E Ci. Then (C1 x C2) fl H = ({x1 } x V2) U (V1 x {x21), so Ci ®C2 shatters H.

Theorem 4.5.10 extends by induction to any number of factors. One consequence is:

4.5.11 Corollary In JR let ,7 be the set of all intervals, which may be open or closed, bounded or unbounded on each side. In other words, ,7 is the set of all convex subsets of R. In JR let C be the collection of all rectangles parallel , m }. Then S(C) = 2m. to the axes, C :_ {III" 1 J : J i E ,7, i = 1, Let D be the set of all left half-lines (-oo, x] or (-oo, x) for x E R. Let

Vapnik-Cervonenkis Combinatorics

156

T :_ {IIm 1 Hi : Hi E D, i = 1, , m}, so T is the class of lower orthants parallel to the given axes. Then S(C) = m. Proof The class D is linearly ordered by inclusion and is bordered with S(D) =

1, and so is the class of half-lines [a, oo) or (a, oo), a E JR. The class ,7 of all intervals in R is bordered with S(J) = 2. The results now follow from Corollary 4.5.9 with n = 2m and n = m, and Theorem 4.5.10 and induction.

4.5.12 Proposition Let ,7 be the set of all intervals in JR. Let Y be any set and C C 2t' with Y E C. Then in JR x Y, S(,7 ®C) < 2 + S(C).

Proof If S(C) = +oo the result is clear, so suppose m := S(C) < oo. Let F C ] I R x Y with I FI = 3 + m and suppose ,7 ®C shatters F. Let (xi, yi ),

i=

, m + 3, be the points of F. Let u := mini xi and v := maxi xj. Let

1,

p := (u, yi) E F and q := (v, yj) E F. All subsets of F which include 1p, q) must be induced by sets of the form JR x C, C E C. So n y must be one-to-one on F \ { p, q }, and C shatters n y (F \ (p, q }) of cardinality m + 1, a contradiction.

4.6 Probability laws and independence Let (X, A, P) be a probability space. Recall the pseudometric dp (A, B) P(AAB) on A, where AAB := (A \ B) U (B \ A). Recall also that for any (pseudo) metric space (S, d) and s > 0, D(s, S, d) denotes the maximum number of points more than e apart (Section 1.2).

For a measurable space (X, A) and C C A let s(C) := inf{w : there is a K = K(w, C) < oo such that for every law P on A and 0 < s < 1, D(s, C, dp) < Ke-w). Definition.

This index s(C) turns out to equal the density:

4.6.1 Theorem For any measurable space (X, A) and C C A, dens(C) _ S (C).

Proof Let P be a probability measure on A. Suppose A,, , A, E C and be i.i.d. dp(Ai, Aj) > s > 0 for 1 2. Let X1, X2, (P), specifically coordinates on a countable product of copies of (X, A, P).

4.6 Probability laws and independence Note that P may have atoms. For n = 1 , 2, Pr{for some i

157

,

j, Xk 0 Al DAB for all k < n)

< in (m - 1)(1 - s)'/2 < 1 forn large enough, n > -log(m(m-1)/2)/log(1-e). LetPn := n FY=18x,. (as usual) be empirical measures for P. For such n, there is positive probability that Pn (Ai AAA) > 0 for all i j, and so Ai and Aj induce different subsets of {X1, , Xn} and mc(n) > m. For any r > dens(C) there is an M =

M(r, C) < oo, where we can take M > 2, such that mc(n) < Mnr for all n.

Note that -log(1 - s) > E. Thus form > 2, m < M(2log(m2))rs-r, or m (log m) -r < M1 8-r for some M1 = Ml (r, C) = 4r M. For any 3 > 0 and ml-g < C large enough, (log m)' < Cm8 for all in > 1. Thus for all m > 0, M28-r for 0 < s < 1 for some M2 = M2(r, C, 8). Thus m < (M28-r)11(1-1)

Letting r ,I, dens(C) and 8 ,{ 0 gives dens(C) > s(C). In the converse direction, let IA I := card(A). Since it is not the case that

mc(n) < knt f o r all n > 1, for r < t < dens(C) and k = 1, 2, , let Ak C X with Ak 0 0 and I Ak n CI > k I Ak I t. Then I Ak I - oo as k oo. Let B0 := Al. Other sets Bj will be defined recursively. Let Bo, , B,_1 be disjoint subsets of X and let C(n) := Uo<j
Let Bn := Ak(n) \ C(n). So all the Bn are defined and disjoint. Each set in Bn n C is induced by at most 21 C(n)I different sets in Ak(n) n C. Thus

IBn n CI > IAk(n) n CI/21C(n)I > 224IAk(n)It > 224IBnIt. Since Bn is nonempty, it follows that I Bn n C I > 22', and hence I Bn I > 2n Let

an := IBnI-t/r,

S:_

IBnl1-t/r

< 00.

n=0

Let P be the probability measure on Un_1 Bn C X giving mass an/S to each point of Bn for each n. The distinct sets in Bn n C are at dp-distance at least an / S apart, and so are a set of elements of C which induce the subsets of Bn. So for all n,

D(an1(2S), C, dp) > I B. n CI > 224 I Bn I t

= 224an r = 224 (2S)-r(an/(2S))-r For s := an/(2S) -* 0 as n -+ oo, this implies D(s, C, dp) > 2 24

(2S)-rs-r

158

Vapnik-Cervonenkis Combinatorics

Since 2" - oo as a" 4. 0 this implies r < s(C). Then letting r f dens(C), the proof is complete. There is a notion of independence for sets without probability. To define it, for

any set X and subset A C X let A l := A and A-1 := X \ A. Sets A l , , A,,, from are called independent, or independent as sets, if for every function {1, , m} into {-1, +1}, n 1 As(j) 0 0. Such intersections, when they are , Am. nonempty, are called atoms of the Boolean algebra generated by A 1 , Thus for A 1, , Am to be independent as sets means that the Boolean algebra they generate has the maximum possible number, 2m, of atoms. If A 1, , Am are independent as sets, then one can define a probability law on the algebra they generate for which they are jointly independent in the usual probability sense and for which P(Ai) = 1/2, i = 1, , m. For example, choose a point in each atom and put mass 1/2m at each point chosen. Or, if desired, given any qi, 0 < qi < 1, one can define a probability measure Q for which the Ai are jointly independent and have Q(Ai) = qi, i = 1, , n. For a set X and C C 2X let I (C) be the supremum of m such that A t ,

are independent as sets for some Ai E C, i = 1,

, Am

, M.

4.6.2 Theorem F o r any set X , C C 2X, and n = 1, 2, , if S(C) > 2" then I (C) > n. Conversely, if I (C) > 2" then S(C) > n. So I (C) < 00 if and only if S(C) < oo. In both cases, 2" cannot be replaced by 2" - 1.

Proof Clearly, if a set Y has n or more independent subsets, then I Y I > 2'. Conversely, if I Y I = 2", we can assume that Y is the set of all strings of n digits

each equal to 0 or 1. Let Aj be the set of strings for which the jth digit is 1. Then the Aj are independent. It follows that if S(C) > 2", then I(C) > n, while if I Y I = 2" - 1 and C = 21', then S(C) = 2" - 1 while I (C) < n, as stated.

Conversely, if Bj are independent as sets for j = 1, , 2", Bj E C, let A(i) := Ai be independent subsets of {1, , 2"} for i = 1, , n. Choose

xi E nJEA(i) Bi n n0A(i)(X \ Bi). Then xi E Bj if and only if j E A(i).

For each setS C {1, ,n},

nAi nn({1,...,2"}\Ai) iES

for some j := is E 11, B; n

i0s , 2" }. Then j E Ai if and only if i E S, and (xi: i E S).

4.7 Vapnik-Cervonenkis properties of classes of functions

159

So C shatters {xl, , xn) and S(C) > n, as stated. If C consists of 2' - 1 (independent) sets, then clearly S(C) < n.

For any set X, C C 2X and Y C X, recall that Cy := Y n c := {Y fl C : Y) be the set of atoms of the algebra of subsets of Y

C E C}. Let At(C

I

generated by Cy, where in the cases to be considered, Cy will be finite because

C or Y is. Let Ac(Y) := IAt(C Y)J be the number of such atoms. Let mc(n) := sup{DA(Y) : A C C, IAA < n} < 2n. Let dens*(C) denote the infimum of all s > 0 such that for some C < oo, me (n) < Cn' for all n. For anyx E XletCX = {A E C: X E A}. LetC'' := {Cy: y E Y}. I

4.6.3 Theorem For any set X and A C C C 2X, with A finite,

(a) AA(X) = AcX (A); (b) for n = 1, 2, ... , me (n) = mcx'(n); (c) S(CX) = I (C); (d) dens*(C) = dens(CX) < I(C).

Proof For any B C A let

a(B) := n B fl n (X \ A). BES

AEA\l3

Then a (B) is an atom, in At(AI X), if and only if it is nonempty. Now y E a (B)

if and only if A fl Cy = B, so (a) follows. Then taking the maximum over A with IAA = n on both sides of (a) gives (b), which then implies (c) and (d). The last inequality follows from Corollary 4.1.7.

4.7 Vapnik-Cervonenkis properties of classes of functions The notion of VC class of sets has several extensions to classes of functions.

Definitions. Let X be a set and.F a class of real-valued functions on X. Let C C 2X. If f is any real-valued function, each set If > t } fort E R will be called a major set of f. The class .F will be called a major class for C if all the major sets of each f E ,F are in C. If C is a Vapnik-Cervonenkis class, then .F will be called a VC major class (for C). The subgraph of a real-valued function f will be the set {(x, t) E X x R:

0 < t < f(x) or f(x) < t < 01. If D is a class of subsets of X x R, and for each f E .F, the subgraph of f is in D, then .F will be called a subgraph class for D. If D is a VC class in X x JR then .F will be called a VC subgraph class.

Vapnik-Cervonenkis Combinatorics

160

The symmetric convex hull of F is the set of all functions E"'_1 c; f for f E F, c; E I[8, any finite m, and Ym1 Ic; I < 1. If 0 < M < oo, let H(F, M) denote M times the symmetric convex hull of F. Let HS (F, M) be the smallest class 9 of functions including H(.F, M) such that whenever gn E 9 for all n and gn (x) -> g(x) as n -+ oo for all x, we have g E tj. A class F of functions such that F C HS(G, M) for some M < oo and a given 9 will be called a VC subgraph hull class if G is a VC subgraph class, and a VC hull class if g = (IC: C E C} where C is a VC class of sets. So there are at least four possible ways to extend the notion of VC class to classes of functions. Some implications hold between these different conditions, but no two of them are equivalent. The next theorem deals with some of the easier cases of implication or non-implication.

4.7.1 Theorem Let F be a uniformly bounded class of nonnegative realvalued functions on a set X. Then

(a) if F is the set of indicators of members of a VC class of sets, then F is also a VC major class, a VC subgraph class, and a VC hull class; (b) if F is a VC major class then it is a VC hull class; (c) there exist VC hull classes F which are not VC major; (d) there exist VC subgraph classes F which are not VC major. Proof (a) The indicators of a VC class C of sets clearly form a VC major class (forCU{0, X}) and aVChull class. Also, the class of sets (Cx[0, 1])U(Xx{0}) for C E C form a VC class, for example by Theorem 4.5.4. So (a) holds. (b) Since .F is uniformly bounded and the definition of VC hull class allows

multiplication by a constant, we can assume 0 < f < 1 for all f E F. For such an f let n-1

A

n

n-1

l{f>j/n} j=1

jn l{ j/n< f<(j+1)/n}

j=0

Then fn (x) -> f(x) as n -+ cc for all x. (In fact, f(x)-1/n < fn (x) < f(x) for all x, so fn -> f uniformly.) So .F is a VC hull class. (c) Let X =1R2. Let C be the set of open lower left quadrants {(x, y) : x < a, y < b} for all a, b E R. Then S(C) = 2 by Corollary 4.5.11. Let F be the set of all sums Y-k 1 lc(k)/2k where C(k) E C. Then clearly .T' is a VC hull class. The sets where f > 0 for f in .F are exactly the countable unions of sets

4.8 Classes of functions and dual density

161

in C. But such unions are not a VC class; for example, they shatter any finite subset of the line x + y = 1. So .F is not a VC major class and (c) is proved. (d) Let f, := n-1 + n-21 B. for n = 1, 2, and for any measurable sets Bn. Then fn 4. 0, so the subgraphs of the functions fn are linearly ordered by inclusion and form a VC class of index 1 (Theorem 4.2.6(a)). Now If, > 1/n) = Bn for each n, and the sequence {Bn} need not form a VC class, for example let Bn be a sequence of independent sets (see Section 4.6). Then If,) is not a VC major class, proving (d).

4.8 Classes of functions and dual density For a metric space (S, d) and e > 0 recall D(s, S, d), the maximum number of points more than E apart. For a probability measure Q and 1 < p < oo we

have theLpmetric dp,Q(f,g):=(f If-gIPdQ)1/p.For aclass FC£"(Q) let D(p) (E, F, Q) := D(e, Jr, dp,Q). Let D(p) (E, F) be the supremum of D(E, F, dp,Q) over all laws Q concentrated in finite sets. If F is a class of measurable real-valued functions on a measurable space

(X, A), let FF(x) := supfEJ I f(x)I. Then a measurable function F will be called an envelope function for.F if and only if FF < F. If F2 is measurable it will be called the envelope function of F. For any law P on (X, A), F* is an envelope function for F, which in general depends on P.

Given .F, an envelope function F for it, E > 0, and 1 Ep f FP dQ for all i

fn, E .F for

j.

The next fact extends part of Theorem 4.6.1 to families of functions. The proof is also similar.

4.8.1 Theorem Let 1 1. Let C be the collection of subgraphs in X x JR of functions in F. Then for any W > S(C) there is an A < oo depending only on W and S(C) such that (4.8.2)

DF ) (e, .F, Q) < A (2P-11sP)W for 0 < E < 1.

Proof Given 0 < E < 1, take a maximal m and f1, , fm as in the definition of Df ). First suppose p = 1. For any measurable set B let (FQ)(B)

fB FdQ. Let QF := FQ/QF where QF := fFdQ, so QF is a probability measure. Let k = k(m, E) be the smallest integer such that eke'2 > (2). Then k < 1 + (4logm)/E. Let X1,

, Xk be i.i.d. (QF). Given X1, let Y, be uniformly distributed on the interval [- F(X1), F(X1)] and such that the vectors

Vapnik-Cervonenkis Combinatorics

162

(X1, Yi) E I[82 are independent for i = 1, , m. Then for all i and j # s,

f , j = 1,

,

k. Let Cj be the subgraph of

Pr((Xi, Y;) E CjACS)

= =

f [if (xi) - fs(Xi)I/(2F(xi))] d QF(xi)

f If -fldQ/(2 f FdQ)

Let Afsk be the event that (Xi, Yi) independence, for each j, s,

> s/2.

Cj ACS for all i = 1,

Pr(A,jsk) < (1 - 2)

k

, k. Then by

< e -h12'

and

Pr { Ajsk occurs for some j # s

(')e_1sI2 < 1.

So with positive probability, for all j # s there is some (Xi, Yi) E Cj ACs. Fix Xi and Y1, i = 1, , k, for which this happens. Then the sets Cj n {(X1, Yi)}k 1 are distinct for all j, so mc(k) > m. Let S := S(C). By the Sauer and Vapnik-Cervonenkis lemmas (4.1.2 and 4.1.5), mc(k) < 1.5ks/S! fork > S + 2. It follows that for some constant C depending only on S, mc(k) < Cks for all k > 1, where C < 2s+1 - 1. We can assume that C > 1. So

m < Cks < C((1 +4logm)/e)s. For any a > 0 there is an mo such that 1 + 4logm < m" form > mo and then ml-"s < C/ss. Choosing a small enough so that aS < 1 we have m < Cl/£s/(1-«s) form > mo and a constant C1. For any W > S, we can As-W for A := max(mo, C1), and solve W = as by a = Wyys . Then m < 1

then mo and A are functions of W and S, finishing the proof for p = 1.

Now suppose p > 1. Let QF,P := FP 1 Q/Q(FP-1). Then for i # j, EP

f FPdQ < =

JIi_fjIPdQ

f

fi f- f I dQ2F,P Q((2F)P 1)>

and Q2F,p = QF,p. Thus by the p = 1 case,

Q) < Dp)(6, F, QF,P) <

AS-W

4.8 Classes of functions and dual density

163

where 8 := ePQ(FP)/[Q(F)Q((2F)P-1)]. Now by Holder's inequality,

QF = Q(F 1) < (Q(FP)) 1/p and (Q(FP))(p-1)"P

Q(Fp_1) = Q(Fp_1.1) < So 8 > ep/2p-1 and the conclusion follows. The following fact is a continuation of Theorem 4.7.1.

4.8.3 Theorem Let.F be a uniformly bounded class of functions on a set X. (a) If F is a VC subgraph class, then (4.8.4)

for some r < oo and M < oo,

D(2) (s, .F°) < Me-r for 0 < e < 1; (b) there exist classes F satisfying (4.8.4) which are not VC hull; (c) there exist VC subgraph classes which are not VC hull.

Proof (a) There is a finite constant envelope function K for F, so for any f, g E Y and law y, f (f - g)2 dy < 2K f If - gidy. Thus (a) follows from Theorem 4.8.1 with r = 2W. (b) It will be shown that there exist sequences .F' = {bn 1 A }n, l , where we can

take bn = 1/n° for any positive integer v, such that .F is not VC hull. Clearly such a sequence will satisfy (4.8.4). For this, two lemmas will be helpful:

4.8.5 Lemma Let A1, , An be jointly independent events in a probability space (X, P) with P(A;) = 1/2, i = 1 , , n. Let B1, , Bn be any events. Let D := U1<j
Proof For anyFC

let :=}l

AF

I

IAjn

jEF

n

X\Aj,

joF, j
and define BF likewise. The atoms of l3 are those BF which are not empty. For each F, P(AF) = 1/2n. If a point of AF is not in D then it is also in BF, which

then is nonempty. Since D can include at most 2n P(D) of the 2' events AF, the lemma follows.

164

Vapnik-Cervonenkis Combinatorics

4.8.6 Lemma Suppose Aj are independent events with P(Aj) = 1/2 for all j, and C is a class of events such that for some K < oo and u < oo, for each j there is an event Dj such that P(AjADj) < ijj, where Dj is in an algebra generated by at most K j" elements of C and E j i7 j < 1. Then C is not a VC class.

Proof Let a := 1 - Fj=1 ilj > 0. By Lemma 4.8.5, for each m = 1, 2, , the algebra Dm generated by D1, , Dm has at least 2'a atoms. On the other hand, Dm is generated by at most E 1 Kj" < K fi +t x"dx < K(m + 1)u+1/(u + 1) sets in C. By Theorems 4.6.2 and 4.6.3(c), C' is a VC class and then by Theorem 4.6.3(b) and Corollary 4.1.7, there are t < 00 and C < oo such that the number of atoms of the algebra generated by k elements of C is at most Ckr. Then 2m a is bounded above by a polynomial in m of degree at most (u + 1)t, a contradiction, so Lemma 4.8.6 is proved.

Now to prove Theorem 4.8.3(b), let X := [0, 1], let P := U[0, 1] be the uniform (Lebesgue) law, and let Am be independent sets with P(Am) = 1/2. Let v be a positive integer and .F := { l Am /m"}m> 1. Then (4.8.4) holds for F.

Suppose F is a VC hull class for some C. Recall that P(f) := f fdP.

Suppose P(A) = 1/2 and a + b1A E 7,(C, M) for some a > 0, b > 0,

and 0 < M < oo. We can take M = 1, replacing a by a/M and b by b/M. Let r := S(C) < oo. Recall that dp(C, D) := P(COD) for measurable sets C, D. Then for any w > r, specifically for w := r + 1, there

is some C < oo with D(s, C, dp) < Cs-w for 0 < s < 1 by Corollary 4.1.7 and Theorem 4.6.1. Given 0 < P < 1, take Cj E C and tj with Ej It1I < 1 such that P(Ia + b1A - J:j tj1E1 I) < P. Choose Di E C, i = 1, , m, where m < C,B-r-1, such that for each j there is an i := i(j) with

P(CjADi) < P. Let f := Ej tjID,(i) Then P(Ia + b1A - fl) < 2,8. Let B := If > a + b/21. Then B is in the algebra generated by D1,

, Dm, and

P(AOB) < 4f/b. Apply what has just been shown to A = Ak for k = 1, , m, with a = ak = 0, b = bk = 1 / kv, and fi = P k = 1/(8k2). Then there are sets B = Bk and m = mk < Nk" for some N < oc where u := (r + 1)(2 + v) < oc. To apply Lemma 4.8.6, let ilk = 1/(2k2) to get a contradiction. So (b) is proved. For (c), let a = ak = 1 / k and b = bk = 1 / k2 to get a VC subgraph class as in the proof of Theorem 4.7.1(d). The same argument as for (b), now with v = 2, again gives a contradiction, so (c) holds. It will be shown in Proposition 10.1.2 that there are VC major (thus VC hull) classes which do not satisfy (4.8.4) and so are not VC subgraph classes.

**4.9 Further facts about VC classes

165

**4.9 Further facts about VC classes This section states facts without proofs.

4.9.1 Projections of semialgebraic sets Let's consider whether the VC property is preserved by projection. Let C be a collection of subsets of a product space X x Y, say R2. Let JrX(x, y) := x for each x E X and y E Y. For a set A E C let .nx[A] := f7rx(u) : u E Al. Let 7rx[[C]] := {7rx [A] : A E C}. It is possible for C to be a VC class, with S(C) = 1, while rx[[C]] is not a VC class. In fact, if the sets in C are disjoint, then S(C) = 1 by Theorem 4.2.6. Let D be an arbitrary collection of no more than c (cardinality of the continuum) subsets of R. Let D equal {Dy: Y E ][8}. For each y let Ay := {(x, y) : x E Ay}. Then the sets Ay are all disjoint with irx[Ay] = Dy for each y, but the sets Dy need not be a VC class, even for a countably infinite sequence of sets Dy. So, it's interesting that for certain sets of an algebraic nature, projection does preserve the VC property. A semialgebraic set in JRd is a set in the Boolean algebra generated by all sets pos(f) where f is a polynomial (in d variables). Stengle and Yukich (1989) proved the following:

, ) be a fixed real polynomial on ][8m+d+p+q p = P(x, y, t, u), where x e Rm, y E Rd, t E RP, and u E TR'. Let T C RP and

Theorem Let P( ,

,

U C Rq be fixed semialgebraic sets. Then the family of all subsets of Il8m of the form WY :_ {x E Rm : suprET infUEU P(x, y, t, u) > 01, as y varies over Rd, is a Vapnik-Cervonenkis class.

For the proof, see Stengle and Yukich (1989). They and Laskowski (1992) also give other ways of generating Vapnik-Cervonenkis classes.

4.9.2 The bound of Haussler (1995) For a class C of sets, and positive integers k and n, let M(k, n, C) be the largest m such that for some set F with I FI = n and some A 1, , Am in C n F, each

symmetric difference Ai DAB := (Ai \ Aj) U (Aj \ Ai) for i 0 j contains at least k points.

Theorem (Haussler, 1995) For any VC class C with S(C) = d, and integers 1 < k < n, we have

2e(n+1) M(k,n,C) < e(d+1)(k+2d+2) d < e(d+1)(2ne/k)d.

Vapnik-Cervonenkis Combinatorics

166

When the proof of Theorem 4.6.1 is applied to the given situation, it gives an additional logarithmic factor which Haussler's theorem shows to be unnecessary. Haussler's proof is partly based on a result of Alon and Tarsi (1992).

Problems 1. Let C be the class of all unions of two intervals in R. Evaluate S(C). Hint: Try it first directly; if you like, look at the more general problem 11.

2. If S(C) = 3 find the upper bounds for mc(n) given by Theorem 4.1.2 and by Proposition 4.1.5. 3. Show that for dens(C) = 0, S(C), which is finite by Corollary 4.1.7, can be arbitrarily large. Hint: Let C be finite. 4. Find the smallest n such that there is a set X with IXI = n and C C 2X with S(C) = 1 where neither (a) nor (b) in Theorem 4.2.6 holds. Hints for Problems 5-7: If C is a collection of convex sets in Rd and shatters a set F, then no point in F is in the convex hull of the other points. Thus, the convex hull of F is a polyhedron of which each point of F is a vertex. In the plane, it's a polygon. To get a lower bound S(C) > k, it's enough to find one set of k elements that is shattered. Try the vertices of a regular k-gon. To get upper bounds, use facts such as Theorem 4.2.1 and Proposition 4.5.6. 5. Let C be the set of all interiors of ellipses in R2, with arbitrary centers and semiaxes in any two perpendicular directions. Give upper and lower bounds for S(C).

6. A half-plane in R2 is a set of the form {(x, y) : ax + by > c} for real a, b, c with a and b not both 0. Define a wedge as an intersection of two half-planes. Let C be the collection of all wedges in R2. Show that S(C) > 5. Also find an upper bound for S(C). 7. Let C be the set of all interiors of triangles in R2. Show that S(C) > 7. Also give an upper bound for S(C).

8. Show that the lower bounds for S(C) in problems 6 and 7 are the values of S(C). Hint: For a convex polygon, the set F of vertices can be arranged in cyclic order, say clockwise around the boundary of the polygon, Vi, v2, , v,,, vt. Show that if a half-plane J contains vi and vj with i < j then it includes either {v, vi+t, , vj} or {vj, vj+t, , v,,, vi, , vi}. Thus find what kind of set the intersection of J and F must be. From that, find what occurs if two or three half-planes are intersected (or unioned, via complements).

Notes

167

9. In the example at the end of Section 4.3, for each set A C X with three elements, find a specific subset of A not in A n C. 10. Let .F be the class of all probability distribution functions on R. Show that F is a VC major class but not a VC subgraph class. Hint: Show with that the subgraphs of functions in F shatter all sets {(xj,

xt <...
BFI = n we have OcW(F) _ C<2j (the largest possible value by Sauer's lemma). Hints: One can take F = {1, 2, , n}. For A C F and x, y E F let x =A y mean that A fl {x, y} = 0 or {x, y), otherwise x A y. If A # 0 let jl := j1(A) be the least element of A. If ji(A), , jk(A) are defined, let jk+1(A) be the least j > jk(A) such

that j #A jk(A) and j < n, if there is such a j. Show that there is a 1-1 correspondence between subsets A C F and finite sequences r(A) = where r(0) :=0. Show that

A E C(j) n F if and only if r(A) < 2j. Notes

Notes to Section 4.1. The definitions of AC, mc, and (in effect) V(C) appeared in the announcement by Vapnik and Cervonenkis (1968). In their 1971 paper they had a weaker form of Theorem 4.1.2 with "> C n C
Notes to Section 4.4. This section is also based on Dudley (1985). Theorem 4.4.6 and other characterizations of trees are given in Harary (1969, pp. 32-33).

168

Vapnik-Cervonenkis Combinatorics

Notes to Section 4.5. Some of the facts in this section are from Dudley (1984,

Section 9.2). Theorems 4.5.1 and 4.5.3 on the density are due to Assouad (1983). Assouad also told me Proposition 4.5.6 in a letter. Theorem 4.5.8 and Corollary 4.5.11 are from Wenocur and Dudley (1981).

Notes to Section 4.6. Theorem 4.6.1 is partly given in Dudley (1978, Section 7) and was stated in the current form by Assouad (1981, 1983). Haussler (1995) proved a sharper form. Assouad (1983) also proved Theorem 4.6.2, first defined dens*(.), and proved Theorem 4.6.3.

Notes to Section 4.7.

VC subgraph classes have been called "VC graph" classes (Alexander 1984, 1987) or "polynomial classes" (Pollard 1984, pp. 17, 34; 1985). This section is based on part of Dudley (1987). Notes to Section 4.8. Theorem 4.8.1 is essentially Lemma 25, p. 27, of Pollard (1984) and had appeared earlier in Pollard (1982). Theorem 4.8.3 parts (b) and (c) are based on Section 3 of Dudley (1987).

References Alexander, Kenneth S. (1984). Probability inequalities for empirical processes and a law of the iterated logarithm. Ann. Probab. 12, 1041-1067. Alexander, K. S. (1987). The central limit theorem for empirical processes on Vapnik-Cervonenkis classes. Ann. Probab. 15, 178-203. Alon, N., and Tarsi, M. (1992). Colorings and orientations of graphs. Combi-

natorica 12, 125-134. Assouad, Patrice (1981). Sur les classes de Vapnik-Cervonenkis. C. R. Acad. Sci. Paris Ser. 1292, 921-924. Assouad, P. (1983). Densite et dimension. Ann. Inst. Fourier (Grenoble), 33, no. 3, 233-282. Danzer, L., Grunbaum, B., and Klee, V. L. (1963). Helly's theorem and its relatives. Proc. Symp. Pure Math. (Amer. Math. Soc.) 7, 101-180. Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6, 899-929; Correction 7 (1979), 909-911. Dudley, R. M. (1984). A course on empirical processes. Ecole d'ete de probabilites de St.-Flour, 1982. Lecture Notes in Math. (Springer) 1097, 1-142. Dudley, R. M. (1985). The structure of some Vapnik-Cervonenkis classes. In Proc. Berkeley Conf. in honor of J. Neyman and J. Kiefer 2, 495-508. Wadsworth, Belmont, CA.

References

169

Dudley, R. M. (1987). Universal Donsker classes and metric entropy. Ann. Probab. 15, 1306-1326. Harary, F. (1969). Graph Theory. Addison-Wesley, Reading, MA. Haussler, D. (1995). Sphere packing numbers for subsets of the Boolean ncube with bounded Vapnik-Chervonenkis dimension. J. Combin. Theory (A) 69, 217-232. Laskowski, M. C. (1992). Vapnik-Chervonenkis classes of definable sets. J. London Math. Soc. (Ser. 2) 45, 377-384. Pollard, David (1982). A central limit theorem for k-means clustering. Ann. Probab. 10, 919-926. Pollard, D. (1984). Convergence of Stochastic Processes. Springer, New York. Pollard, D. (1985). New ways to prove central limit theorems. Econometric Theory 1, 295-314. Radon, J. (1921). Mengen konvexer Korper, die einen gemeinsamen Punkt enthalten. Math. Ann. 83, 113-115. Sauer, N. (1972). On the density of families of sets. J. Combin. Theory (A) 13, 145-147. Shelah, S. (1972). A combinatorial problem: stability and order for models and theories in infinitary languages. Pacific J. Math. 41, 247-261. Stengle, G., and Yukich, J. E. (1989). Some new Vapnik-Chervonenkis classes. Ann. Statist. 17, 1441-1446. Vapnik, V. N., and Cervonenkis, A. Ya. (1968). Uniform convergence of frequencies of occurrence of events to their probabilities. Dokl. Akad. Nauk SSSR 181,781-783 (Russian) = Soy. Math. Doklady 9,915-918 (English). Vapnik, V. N., and Cervonenkis, A. Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theor. Probability Appls. 16, 264-280 = Teor. Verojatnost. i Primenen. 16, 264-279. Vapnik, V. N., and Cervonenkis, A. Ya. (1974). Teoriya Raspoznavaniya Obrazov: Statisticheskie problemy obucheniya [Theory of Pattern Recognition:

Statistical problems of learning; in Russian]. Nauka, Moscow. German ed.: Theorie der Zeichenerkennung, by W. N. Wapnik and A. J. Tscherwonenkis, transl. by K. G. Stockel and B. Schneider, ed. S. Unger and K. Fritzsch. Akademie-Verlag, Berlin, 1979 (Elektronisches Rechnen and Regeln, Sonderband). Wenocur, R. S., and Dudley, R. M. (1981). Some special Vapnik-Cervonenkis classes. Discrete Math. 33, 313-318.

5 Measurability

Let A be the set of all possible empirical distribution functions F1 for one observation x E [0, 1], namely F1(t) = 0 fort < x and F1 Q) = 1 fort > x. We noted previously that A in the supremum norm is nonseparable: it is an uncountable set, in which any two points are at a distance 1 apart. Thus A and all its subsets are closed. If x := X1 has a continuous distribution such

as the uniform distribution U[0, 1] on [0, 1], then x H (t H It>x) takes [0, 1] onto A, but it is not continuous for the supremum norm. Also, it is not measurable for the Borel cr-algebra on the range space. So, in Chapter 3, functions f * and upper expectations E* were used to get around measurability problems. Here is a different kind of example. It is related to the basic "ordinal triangle" counterexample in integration theory, showing why measurability is needed in the Tonelli-Fubini theorem on Cartesian product integrals. Let (S2, <) be an uncountable well-ordered set such that for each x E 0, the initial segment

Ix := {y : y < x} is countable. (In terms of ordinals, 0 is, or is orderisomorphic to, the least uncountable ordinal.) Let S be the or -algebra of subsets

of 0 consisting of sets that are countable or have countable complement. Let P be the probability measure on S which is 0 on countable sets and 1 on sets with countable complement. Then

ffly<xdP(Y)dP(X) = 0 < ffly<xdP(x)dP(Y) = 1. Since all other conditions of the Tonelli-Fubini theorem hold, the function (x, y) H ly<x must not be measurable for the product or -algebra, even if S is replaced by any larger cr-algebra of subsets of 0 to which P can be extended. For example, according to the continuum hypothesis, we could take S2 to be [0, 1] (where the well-ordering is unrelated to the usual ordering), and P to be Lebesgue measure or any other nonatomic law on [0, 1]. 170

*5.1 Sufficiency

171

Now, consider the class C of sets Ix for each x E Q. Each of these sets is countable by assumption. The sets are linearly ordered by inclusion since < is a linear ordering. Thus S(C) = 1 by Theorem 4.2.6. But, C is not a weak or strong Glivenko-Cantelli class as defined in Section 3.3 (still less a Donsker class), since for any possible X1, , X", a maximum x := max(XI, , X,) for the well-ordering exists, so P (Ix) = 1 while P(Ix) = 0, so

supAEC I (Pn - P)(A)I = 1. In Sections 5.2 and 5.3, some measurability conditions will be developed, which will hold for classes of sets encountered in practice. It will be shown in the next chapter that these conditions, together with the Vapnik-Cervonenkis or related properties, are enough to imply Glivenko-Cantelli and Donsker properties.

*5.1 Sufficiency Sufficiency is a concept from mathematical statistics. Suppose that a probability measure P is known to be in a certain family P of laws and we have observed

, X i.i.d. (P), but nothing else is known about P. A statistic, T, which is a measurable function of X1, , X,,, will roughly speaking be said to be sufficient for P if, given T, no further information about X1, , X, is useful in making decisions or inferences about P E P. A precise definition is given below. This section will show that the empirical measure P, is sufficient even when P is the family of all probability measures on a measurable space. Note that P is a symmetric function of the X, in the sense that it is preserved by any permutation of the indices 1, , n. Once P is given, knowing that the XZ were observed in a certain order will not help in making inferences about P. Here is the formal definition of sufficiency: let (Y, S) be a measurable space (a set Y and a or-algebra S of subsets of Y). Let Q be a set of probability laws on (Y, S). A sub-a-algebra 13 of S is called sufficient for Q if for every C E S there is some 13-measurable function gc such that for every Q E Q, the conditional probability X1,

(5.1.1)

Q(CI13) = gc

almost surely for Q.

The essential point is that gc does not depend on Q in Q. Most often, there will be some n > 1, a measurable space (X, A), and a family P of laws on (X, A) such that Y is the n-fold Cartesian product X' with

the product 6-algebra S = A" and Q = P" :_ {P" : P E P), where P" is the n-fold Cartesian product P x P x .

x P (RAP, Theorem 4.4.6).

172

Measurability

The meaning of sufficiency is clarified by the factorization theorem, to be proved next. A family P of probability measures on a measurable space (S, B) is said to be dominated by a measure p if every P E P is absolutely continuous with respect to it. Then we have the density (Radon-Nikodym derivative) dP/dµ (RAP, Section 5.5). If there is a nonatomic law on (X, A), the family P of all laws on (X, A) is not dominated. Factorization is still useful in that case, in the proof of Theorem 5.1.7.

5.1.2 Theorem (Factorization theorem) Let (S, B) be a measurable space, A a sub-or -algebra of B, and P a family of probability measures on B, dominated by a c finite measure lt. Then A is sufficient for P if and only if there is

a B-measurable function h > 0 such that for all P E P, there is an Ameasurable function fp with dP/d a = fph almost everywhere for it. We can take h E L' (S, B, µ).

Proof Two measures are called equivalent if each is absolutely continuous with respect to the other. Two families P and Q of probability measures on a or-algebra B are called equivalent if and only if for all B E B, (P(B) = 0 for all P E P) is equivalent to (Q(B) = 0 for all Q E Q). Two lemmas and another theorem will be useful in the proof.

5.1.3 Lemma If a family P of probability measures on a measurable space (S, B) is dominated by a or -finite measure µ, there is a countable subfamily of P equivalent to P.

Proof For each P E P, let fp be (a specific version of) the Radon-Nikodym derivative (density) dP/dic and let Kp := [x: fp(x) > 0}. A set K E B such

that for some P E P, K C Kp and µ(K) > 0, will be called a kernel. A chain is any union of disjoint kernels (necessarily a countable union). If µ is not finite let S = Un> 1 An where An are disjoint measurable sets with

0 < µ(An) < oc. Let v(A) :_ En__1 µ(A n An)/[2nµ(An)]. Then v and it are equivalent (mutually absolutely continuous), so replacing it by v, we can assume p is finite. Let C1, C2, be chains such that supn µ(Cn) = sup{µ(C) : C a chain}.

Let Dl := C1. Given a chain Di_1, let Cn = Uj>1 Knj where Knj are disjoint kernels (for n fixed). Let Dnj := Knj \ Di_1 if it is a kernel, that is, if µ(Knj \ Dn_1) > 0. Otherwise let Dnj be empty. Let Dn := Dn_1 U Uj>1 Dnj, which is a chain. Let D := UnDn, a chain with maximal it. Then D = Un K, for some disjoint kernels Kn. Choose P := Pn := P(n) in P

*5.1 Sufficiency

173

such that Kn C Kp(n) and P,(K,,) > 0. To show that {Pm} is equivalent to

P, suppose not. Then for some B E 8, P,(B) = 0 for all n and P(B) > 0 for some P E P. If P(B \ D) > 0 then some kernel is included in B \ D, contradicting the choice of D. So we can assume B C D and for some n,

P(BnK,) > 0. Then Pn(BnK,) > 0since tt (BnKn) > 0andKn is a kernel for Pn. This contradiction completes the proof.

5.1.4 Lemma Any countable family of or -finite measures An is equivalent to one finite measure µ.

Proof First, as in the last proof, we can assume An (S) < 1 for all n. Then let µ En>1 An/2n, which is finite and equivalent to {µn}n>1 The next step in the proof of Theorem 5.1.2 is:

5.1.5 Theorem Under the hypotheses of Theorem 5.1.2, the following are equivalent:

(a) A is sufficient for P; (b) there is a probability measure A such that {A} is equivalent to P and dP/dx is equal P-a.s. to an A-measurable function for all P E P; (c) there is a probability measure A which dominates P such that dP/dl is A-measurable for all P E P.

Proof First it will be shown that (c) implies (a). For any B E B let fB Ex (1 B I A). Then for any P E P and A E A, by RAP, Theorem 10.1.9,

IAfBdP =

jEx(1BIA)dA =

f1BdA

= P(AnB),

so fB = Ep (1 B I A) and A is sufficient. (a) implies (b): if A is sufficient, take a sequence { Pn } C P, equivalent to P,

by Lemma 5.1.3. Let A.:= E,,,, P,12'. For each B E B take fB = Ep (1 B I A)

P-a.s. for all P E P. Since 0 < fB < 1 P-almost surely for all P in P, also 0 < fB < 1 almost surely fork. Since Pn (B n A) = fA fB dPn for all n and all A E A, it follows that ,l(B n A) = fA fB dl, so fB = Ex(1BIA). To show that for all P E P, dP/d), is A-measurable, note that for each B in B,

f

JB

dAd,k

= P(B)

=

f

fEA(fBA)dA '

Measurability

174

which by RAP, Theorem 10.1..9 again, equals

J

Ex(1BIA)EA

\A I dX _ fEA(1BEA(A)A)dA f1BEA(HA)dA.

Since dP/d l and Ea, (Old), I A) have the same integral over all sets in 8, they must be equal P-a.s., so that dP/d k is equal P-a.s. to an A-measurable function. So (a) implies (b). Clearly (b) implies (c). Now to prove the factorization theorem (5.1.2), if A is sufficient, take A from

the last theorem and let h := d;./dµ. Letting fp = dP/dl, we have the desired factorization, with h e Ll (S, B, µ) (in fact f h dµ = 1). Conversely, if factorization holds (with h not necessarily integrable), then by Lemma 5.1.3, take {Pn}n>1 equivalent to P and let A:= En>1 so that {A} is equivalent to P as in the proof of Theorem 5.1.5. Then A is absolutely continuous with respect to µ and d.X /dµ = hk where k(x) := >n> fp,, (x)/2" 1

for all x. Also, k is A-measurable. For each P E P let gp(x) := fp(x)/k(x) if k(x) > 0, and gp(x) := 0 otherwise. Then A.(k = 0) = 0, and for each P E P, gp is A-measurable, P is absolutely continuous with respect to ,l, and dP/dA = gp. So A is sufficient for P by Theorem 5.1.5, and Theorem 5.1.2 is proved. Given a statistic T, that is, a measurable function, from S into Y for mea-

surable spaces (S, B) and (Y, .T'), let A := T -t (.F) := IT-'(A): A E Y}, a Q-algebra. For a family Q of laws on (S, B), T is called a sufficient statistic for Q if A is sufficient for Q. If T is sufficient we can write fp = gp o T for some .F-measurable function gp by RAP, Theorem 4.2.8. Sufficiency, defined in terms of conditional probabilities of measurable sets, can be extended to suitable conditional expectations:

5.1.6 Theorem Let A be sufficient for a family P of laws on a measurable space (5,13). Then for any measurable real-valued function f on (S, B) which

is integrable for each P E P, there is an A-measurable function g such that

g=Ep(fIA)a.s.forallPEP. Proof When f is the indicator function of a set in B, the assertion is the definition of sufficiency. It then follows for any simple function, which is a finite linear combination of such indicators. If f is nonnegative, there is a sequence of nonnegative simple functions increasing up to f, and the conclusion

*5.1 Sufficiency

175

holds (RAP, Proposition 4.1.5 and Theorem 10.1.7). Then any f satisfying the

hypothesis can be written as f = f + - f - for f + and f - nonnegative, and the result follows.

Let p. and v be two probability measures on the same measurable space (V, U). Take the Lebesgue decomposition (RAP, Theorem 5.5.3) v = vac + vs where vac is absolutely continuous, and vs is singular, with respect to A. Let

A E U with vs(A) = µ(V \ A) = 0, so vac(V \ A) = 0. Then the likelihood ratio R, I,, is defined as the Radon-Nikodym derivative d vac/d tt on A and +oo

on V \ A. By uniqueness of the Hahn decomposition of V for vs - tt (RAP, Theorem 5.6.1), R IN, is defined up to equality (it + vs)- and so (µ + v)-almost everywhere.

5.1.7 Theorem For any family P of laws on a measurable space (S, B) and sub-or-algebra A C B, A is sufficient for P if and only if for all P, Q E P, RQlp can be taken to be A-measurable, that is, it is equal (P + Q)-almost everywhere to an A-measurable function.

Proof Only if: suppose A is sufficient for P. Then it is also sufficient for {P, Q}, which is dominated by µ := P + Q. So by factorization (Theorem 5.1.2) there are A-measurable functions fp and fQ and a B-measurable function

h such that dP/dµ = fph and dQ/dµ = fQh. Then RQlp = fQh/(fph) = fQ/fp, where y10

+oo if y > 0 and 0 if y = 0, is A-measurable (note that RQlp doesn't depend on the choice of dominating measure). Conversely, given P, Q E P, suppose RQlp is A-measurable. Again let

It := P + Q. Then dP/dµ = 1/(1 + RQ1p) and dQ/dµ = RQIp/(1 + RQ1p) almost everywhere for tt, so dP/dµ and dQ/d µ can be chosen to be Ameasurable. Thus by factorization again (this time with h = 1), A is sufficient

for {P, Q}. So for all B E 8, P(BIA) = Q(BIA) almost surely for P and Q. Taking P fixed and letting Q vary over P, we get that A is sufficient for P. Suppose we observe X1,

, X7z i.i.d. with law P or Q but we do not know

which and want to decide. Suppose we have no a priori reason to favor a choice of P or Q, only the data. Then it is natural to evaluate the likelihood ratio RQn/pn and choose Q if RQ-lpn > 1 and P if RQnlpn < 1, while if RQnlpn = 1 we still have no basis to prefer P or Q. More generally, decisions between P and Q can be made optimally in terms of minimizing error probabilities or expected losses by way of the likelihood ratio RQlp or RQnlpn as appropriate (the Neyman-Pearson lemma, Lehmann, 1986, pp. 74, 125;

Measurability

176

Ferguson, 1967, Section 5.1). By Theorem 5.1.7, if B is sufficient for P" for some P D {P, Q}, then RQn/pn is B-measurable. Specifically, if T is a sufficient statistic, then by Theorem 5.1.7 and RAP, Theorem 4.2.8, RQ-l pn is a measurable function of T. Thus, no information in (X1, , Xn) beyond T is helpful in choosing P or Q. In this sense, the definition of sufficiency fits with the informal notion of sufficiency given at the beginning of the section. It will be shown that empirical measures are sufficient in a sense to be defined. Let Sn be the sub-or-algebra of A" consisting of sets invariant under all permutations of the coordinates.

5.1.8 Theorem S, is sufficient for Pn

{Pn : P E P} where P is the set

of all laws on (X, A). Proof Let Sn be the symmetric group of all permutations of 11, 2,

, n}. For

each R E Sn and x := (x1,

, xn(n))

, xn) E Xn, set fn (x) :_ (xn(1),

Then fn is a 1-1 measurable transformation of X" onto itself with measurable inverse and preserves the product law P" for each law P on (X, A). For any C E An, we have

P"(CISS) = 11

ufn(C)( ) nES,,

almost surely for P" since for any B E Sn,

P" (B n c) =

n

E Pn (C o f n1(B)) nEsn

1

Pn(fn(C) n B).

n FEsn The conclusion follows.

For example, if X has just two points, say X = to, 1}, and S := Eni=1 x;, then Sn is the smallest or-algebra for which S is measurable. In this case no a-algebra strictly smaller than Sn is sufficient (Sn is "minimal sufficient"). For each B E A and x = (x1, , Xn) E X", let n

Pn(B)(x) :_

1

Y, 1B(xj). j-1

So Pn is the usual empirical measure, except that in this section, x H Pn (B) (x)

is a measurable function, or statistic, on a measurable space, rather than a

*5.1 Sufficiency

177

probability space, since no particular law P or P" has been specified as yet. Here, P, (B) (x) is just a function of B and x.

For a collection F of measurable functions on (X", A"), let Sy be the smallest or -algebra making all functions in F measurable. Then I will be called sufficient if and only if ST is sufficient.

5.1.9 Theorem F o r any measurable space (X, A) andfor each n = 1 , 2, , the empirical measure P" is sufficient for P" where P is the set of all laws on (X, A). In other words, the set .F of functions x H P" (B)(x), for all B E A, is sufficient. In fact the or -algebra ST is exactly S".

Proof Clearly S. C S" . To prove the converse inclusion, for each set B E An let S(B) := UrlESn fn(B) E S". Then if B E S", S(B) = B. Let £ {C E An: S(C) E Sj,}. We want to prove £ = Now £ is a monotone class: if C, E £ and C" t C or C, , C, then C E S. An.

Also, since commutes with finite unions, any finite union of sets in £ is in S. So it will be enough to prove that (5.1.10)

A,x...xAnE9 for anyAj EA,

since the collection C of finite unions of such sets, which can be taken to be disjoint, is an algebra, and the smallest monotone class including C is At (RAP, Propositions 3.2.2 and 3.2.3 and Theorem 4.4.2). Here by another finite union the A; can be replaced by Bl(1) where Bt, , Br are atoms of the algebra generated by At, , An, so r < 2". So we just need to show that for all j(l), , j(n) with 1 < j(i) < r, i = 1, , n, we have S(Bj(t) x x x Bj(")) if and only if for each i = Bj(")) E S.F. Now, x E S(BJ(1) X 1,

, r, P"(Bi) = k, /n where k; is the number of values of s such that

j(s) = i, s = 1,

, n. So 5F = S", and by Theorem 5.1.8 the conclusion

follows.

For some subclasses C C A, the restriction of P" to C may be sufficient, and handier than the values of P" on the whole or-algebra A. Recall that a class C included in a or -algebra A is called a determining class if any two measures on A, equal and finite on C, are equal on all of A. If C generates the or-algebra A, C is not necessarily a determining class unless for example it is an algebra (RAP, Theorem 3.2.7 and the example after it). Sufficiency of P" (A) for A E C can depend on n. Let X = 11, 2, 3, 4, 5}, A = 2X, and C = {{ 1, 2, 3}, {2, 3, 4}, {3, 4, 5}}. Then C is sufficient for n = 1, but not for n = 2 since for example (61 + 64)/2 = (62 + 65)/2 on C. This is a case where C generates A but is not a determining class.

Measurability

178

5.1.11 Theorem Let (X, d) be a separable metric space which is a Borel subset of its completion, with Borel or-algebra. Suppose C = {Q11 is a countable determining class for A. Then for each n = 1, 2, , the sequence { Pn (Ck) }k 1 is sufficient for the class Pn of all laws Pn on (X n, An) where P E P, the class of all laws on (X, A). P r o o f For n = 1 , 2, , let I n be the finite set { j/n : j = 0, 1, , n}. Let I,°° be a countable product of copies of In with the product or -algebra defined by the or-algebra of all subsets of In. We have:

5.1.12 Lemma Under the hypotheses of Theorem 5.1.11, for each n = 1, 2, and A E A, there is a Borel measurable function fA on I,°° such that for any x1,

, xn E

X, and Pn := n-1 Fj 3, we have PP(A) _

fA({Pn(Ck)}k=1).

Proof Since C is a determining class, the function fA exists. We need to show it is measurable. If X is uncountable, then by the Borel isomorphism theorem (RAP, Theorem 13.1.1) we can assume X = [0, 1 ]. Or if X is countable, then A is the or -algebra of all its subsets and we can assume X C [0, 1]. Let

X(n) :_ {X :_ (xj)'_1 E Xn : X1 < X2 < ... < Xn }. The map x i-- P,ix) is 1-1 from X W into the set of all possible empirical measures P, , since if Q := P,(x) = Pny), with x, y E X(n), then x1 = y1 = the smallest u such that Q({u}) > 0. Next, X2 = Y2 = x1 if and only if Q({x1}) > 2/n, while otherwise x2 = Y2 = the next-smallest v such that Q({v}) > 0, and so on. Now, X (n) is a Borel subset of Xn. The completion of X' for any of the usual product metrics is isometric to Sn where S is the completion of X. Clearly X' is a Borel subset of Sn. It follows that X (n) is a Borel subset of its completion. Here the following will be useful:

5.1.13 Lemma On I,°O, the product or -algebra l3°°, the smallest or -algebra for which all the coordinates are measurable, equals the Borel a -algebra 8 of the product topology T.

Proof Clearly 8°O C BO° since the coordinates are continuous. Conversely, T has a base R consisting of all sets ilk ° 1 A j where A j = In for all but finitely many j; R is countable (since In is finite) and consists of sets in 8°O, and every

5.2 Admissibility

179

U E T is a countable union of sets in R, so U E BOO and BOO = B°°. Lemma 5.1.13 is proved. Now to continue the proof of Lemma 5.1.12, the map f: x i-+ { P,(x) (Ck) )_ 1

is Borel measurable from X(') into I,°O. Since {Ck)' area determining class, 1

f is one-to-one. Thus by Appendix G, Theorem G.6, f has Borel image f [X (' ] in I,°O and f-1 is Borel measurable from f [X (") ] onto X('). Then

f-1 extends to a Borel measurable function h on all of I,°O into X00, since Xt"> is Borel-isomorphic to R, or a countable subset, with Borel a-algebra, by the Borel isomorphism theorem (RAP, Theorem 13.1.1 again), and thus the extension works as for real-valued functions (RAP, Theorem 4.2.5). For any A E A, gA : x H P,$x) (A) is Borel measurable. Thus fA =- gA o h is Borel measurable and Lemma 5.1.12 is proved. Now to prove Theorem 5.1.11, we know from Theorem 5.1.9 that the smallest

a-algebra S,, making all functions x i-- P, (B) (x) measurable for B E A is sufficient. By Lemma 5.1.12, SF is the same as the smallest a-algebra making x r) P1t (Ck) measurable for all k, which finishes the proof.

In the real line R, the closed half-lines (-oo, x] form a determining class. In other words, as is well known, a probability measure P on the Borel a-algebra of IR is uniquely determined by its distribution function F (RAP, Theorem 3.2.6). It follows that the half-lines (-oo, q] for q rational are a determining class: for any real x, take rational qk J, x, then F(qk) ,[ F(x). Thus we have:

5.1.14 Corollary In R, the empirical distribution functions F (x) := P, ((-oo, x]) are sufficient for the family P" of all laws P" on W1 where P varies over all laws on the Borel or -algebra in R.

5.2 Admissibility Let F be a family of real-valued functions on a set X, measurable for a aalgebra A on X. Then there is a natural function, called here the evaluation

map, F x X H IR given by (f, x) H f (x). It turns out that for general F there may not exist any a-algebra of subsets of F for which the evaluation map is jointly measurable. The possible existence of such a a-algebra and its uses will be the subject of this section. Let (X, 5) be a measurable space. Then (X, B) will be called separable if B is generated by some countable subclass C C B and B contains all singletons

180

Measurability

{x}, x E X. In this section, (X, B) will be assumed to be such a space. Let F be a collection of real-valued functions on X. (The following definition is unrelated to the usage of "admissible" for estimators in statistics.)

F is called admissible if there is a a-algebra T of subsets of F such that the evaluation map (f, x) i-+ f (x) is jointly measurable from (F, T) x (X, B) (with product a-algebra) to R with Borel sets. Then T will Definition.

be called an admissible structure for F. .T' will be called image admissible via (Y, S, T) if (Y, S) is a measurable space and T is a function from Y onto F such that the map (y, x) r-* T(y)(x) is jointly measurable from (Y, S) x (X, B) with product a-algebra to R with Borel sets. To apply these definitions to a family C of sets let T 1 A : A E C).

Remarks. For one example, let (K, d) be a compact metric space and let F be a set of continuous real-valued functions on K, compact for the supremum norm. Then the functions in F are uniformly equicontinuous on K by the Arzela-Ascoli theorem (RAP, 2.4.7). It follows that the map (f, x) H f(x) is jointly continuous for the supremum norm on f E F and d on K. Since both spaces are separable metric spaces, the map is also jointly measurable, so that F is admissible. If a family F is admissible, then it is image admissible, taking T to be the identity. In regard to the converse direction, here is an example. Let X = [0, 1] with usual Borel a-algebra B. Let (Y, S) be a countable product

of copies of (X, 8). For y = E Y let T(y)(x) := lj(x, y) where J := {(x, y) : x = y, for some n}. Let C be the class of all countable subsets of X and .F the class of indicator functions of sets in C. Then it is easy to check

that F is image admissible via (Y, S, T). If a a-algebra T is defined on F by setting T := {F C F: T-' (F) E S) then T is not countably generated (see problem 5(b)) although S is. This example shows how sometimes image admissibility may work better than admissibility.

5.2.1 Theorem For any separable measurable space (X, B), there is a subset Y of [0, 1] and a 1-1 function M from X onto Y which is a measurable isomorphism (is measurable and has measurable inverse) for the Borel or -algebra on Y.

Remark. Note that Y is not necessarily a measurable subset of [0, 1]. On the other hand, if (X, B) is given as a separable metric space which is a Borel subset of its completion, with Borel a-algebra, then (X, 5) is measurably isomorphic

5.2 Admissibility

181

either to a countable set, with the a-algebra of all its subsets, or to all of [0, 1] by the Borel isomorphism theorem (RAP, Theorem 13.1.1).

Proof Recall that the Borel subsets of Y as a metric space with usual metric are the same as the intersections with Y of Borel sets in [0, 1], since the same is true for open sets. Let A := {Aj}j>1 be a countable set of generators of S. Consider the map f : x H {lAj(x)}j>1 from X into a countable product 2°O of copies of {0, 11 with product a-algebra. Then f is 1-1 and onto its range Z. Thus it preserves all set operations, specifically countable unions and complements. So it is easily

seen that f is a measurable isomorphism of X onto Z. Next consider the map g : {zj} i-+ Yj 2zj/3-' from 2' into [0, 1], actually onto the Cantor set C (RAP, proof of Proposition 3.4.1). Then g is continuous from the compact space 2°O with product topology onto C. It is easily seen

that g is 1-1. Thus g is a homeomorphism (RAP, Theorem 2.2.11) and a measurable isomorphism for Borel a-algebras. The Borel a-algebra on 21 equals the product a-algebra (Lemma 5.1.13). So the restriction of g to Z is a measurable isomorphism onto its range Y. It follows that the composition go f, called the Marczewskifunction,

M(x)

00

E 2 , IA. (x)13n n=1

is one-to-one from X onto Y C I

[0, 1] and is a measurable isomorphism

onto Y.

Let (X,13) be a separable measurable space where 13 is generated by a sequence {Aj 1. By taking the union of the finite algebras generated by A 1,

, An

for each n, we can and do take A := {A; )1>1 to be an algebra. Let Fo be the class of all finite sums "=1 C; IA, for rational c; E Ik, and n = 1, 2, . Then "Borel classes" or "Banach classes" are defined as follows by transfinite recursion (RAP, 1.3.2). Let (S2, <) be an uncountable well-

ordered set such that for each P E 0, {a E 0 : a < P} is countable. (Specifically, one can take 0 to be the set of all countable ordinals with their

usual ordering.) For any countable set A C 0, {y: y < x for some x E A} is countable, so there is a z E Q with x < z for all x E A. For each a E 2 there is a next larger element called a + 1. Let 0 be the smallest element of 0. For each a E 2, given Fa, let .Fa+1 be the set of all limits of everywhere pointwise convergent sequences of functions in Fa. If ,B E S2 is not of the form

a + 1 (f3 is a "limit ordinal"), P > 0, and Ta is defined for all a < P, let .Ffi

Measurability

182

be the union of all Jla for a < P. Note that Fa C Fp whenever a < P. Let U := U is the set of all measurable real functions on X.

Proof Clearly, each function in U is measurable. Conversely, the class of all sets B such that I B E U is a monotone class and includes the generating algebra A, so it is the a-algebra B of all measurable sets (RAP, Theorem 4.4.2). Likewise, for a fixed A E A and constants c, d, the collection of all sets B such that c 1 A + d 1 B E U is B. Then for a fixed B E B, the set of all C E B such that c 1 C + d 1 B E U is all of 13. By a similar proof for a sum of n terms, we get that all simple functions F-"=t ci 1B(j) are in U for any c; E R and B(i) E B. Since any measurable real function f is the limit of a sequence of simple functions min(f, 0) (RAP, Proposition 4.1.5), max(f, 0) and gn f, +gn where fn every measurable real function on X is in U.

On admissibility there is the following main theorem:

5.2.3 Theorem (Aumann) Let I :_ [0, 1] with usual Borel a-algebra. Given a separable measurable space (X, B) and a class . F of measurable real-valued functions on X, the following are equivalent:

(i) F C Fa for some a E S2; (ii) there is a jointly measurable function G : I x X H R such that for

each f E F, f = G(t, ) for some t E I; (iii) there is a separable admissible structure for F; (iv) F is admissible; (v) 2F is an admissible structure for .T'; (vi) F is image admissible via some (Y, S, T).

Remarks.

The specific classes Ta depend on the choice of the countable

family A of generators, but condition (i) does not: if C is another countable set of generators of B with corresponding classes 9a, then for any a E 0 there are ig E S2 and Y E 0 with.T'a C 9,6 and 9J C F.

Proof (ii) implies (iii): for each f E F choose a unique t E I and restrict the Borel a-algebra to the set of is chosen. Then (iii) follows. Clearly (iii) implies (iv), which is equivalent to (v). (iv) implies (iii): note that any real-valued measurable function f (for the Borel a-algebra on R as usual) is always measurable for some countably generated sub-a-algebra, for example generated by If > t), t rational. A product

5.2 Admissibility

183

a-algebra is the union of the a-algebras generated by countable sets of rectangles A, x B,. So the evaluation map is measurable for such a sub-a-algebra. Let D be the a-algebra of subsets of F generated by the A; A. For any two dis-

tinct functions f, g in F, f(x) g(x) for some x. The map h i-+ h(x) is D measurable. So If) is the intersection of those AI which contain f and the complements of the others, and (iii) follows. So (iii) through (v) are equivalent.

(iii) implies (ii): by Theorem 5.2.1, there is a subset Y C I [0, 1] with Borel a-algebra and a measurable isomorphism t H G(t, ) from Y onto F. The assumed admissibility implies that (t, x) H G (t, x) is jointly measurable. By the general extension theorem for real-valued measurable functions (RAP, Theorem 4.2.5), although Y is not necessarily measurable, we can assume G is jointly measurable on I x X, proving (ii). So (ii) through (v) are equivalent. (ii) implies (i): G belongs to some Fa on [0, 1] x X, where we take gener-

ators of the form A; x B;, Ai Borel, B; E 13, and {A;};,t and {B;};>1 are algebras. Hence the sections G (t, ) on X all belong to Fa on X for the generators {B1).

(i) implies (ii): for this we need universal functions, defined as follows. A jointly measurable function G: I x X H R will be called a universal class a function if every function f e Fa on X is of the form G(t, ) for some t E I. (G itself will not necessarily be of class a on I x X.) Recall by the way that an open universal open set in N°O x X exists for any separable metric space X (RAP, Proposition 13.2.3).

5.2.4 Theorem (Lebesgue) For any a E 0 there exists a universal class a function G : I x X i-+ R.

Proof For a = 0, Ft is a countable sequence { fk}k>l of functions. Let G(1/k, x) := fk(x) and G(t, x) := 0 if t 1/k for all k. Then G is jointly measurable and a universal class 0 function. For general a > 0, by transfinite induction (RAP, Section 1.3) suppose there

is a universal class 3 function for all ,B < a. First suppose a is a successor, that is, a = f + 1 for some P. Let H be a universal class ,B function on I x X. Let I°O be the countable product of copies of I with the product a-algebra. The product topology on I°O is compact and metrizable (RAP, Theorem 2.2.8 and Proposition 2.4.4). Its Borel a-algebra is the same as the product a-algebra, by way of the usual base or subbase of the product topology (RAP, pp. 30-31).

For t =

E I°O let G(t, x) := 1im sup, H(t,,, x) if the lim sup is finite, otherwise G(t, x) := 0. Then G is jointly measurable. Now I°O, as a

Measurability

184

Polish space, is Borel-isomorphic to I (RAP, Section 13.1), so we can replace I°O by I. Then G is a universal class a function. If a is not a successor, then there is a sequence P k f a, 18k < a, meaning < a, there is some k with f < pk. For each k let Gk be a that for every

universal class A function. Define G on I2 x X by G(s, t, x) = Gk(t, x) if s = 1/k and G(s, t, x) = 0 otherwise. Then G is jointly measurable. Since 12 is Borel-isomorphic to I we again have a universal class a function, proving Theorem 5.2.4.

With Theorem 5.2.4, (i) in Theorem 5.2.3 clearly implies (ii), so (i) through (v) are equivalent. (iv) implies (vi) directly. If (vi) holds, let Z be a subset of Y on which the map z H T (z) is one-to-one and onto F. Let Sz and TZ be the restrictions to Z of

S and T respectively. For a function f and set B let f [B] := { f(x) : x E B). Then F remains image admissible via (Z, SZ, Tz), and {TZ[A] : A E SZ} is an admissible structure for F, giving (iv).

For 0 < p < oo and a probability law Q on (X,13) we have the space GP (X,13, Q) of measurable real-valued functions f on X such that f If I P dQ < oo, with the pseudometric

dd,Q(f,g)

(f If-gIPdQ)1/P,

1 0: Q(I f - gI > s) < s},

0
In admissible classes, dP, Q-open sets are measurable, as follows:

5.2.5 Theorem Let (X, B) be a separable measurable space, 0 < p < 00, and F C LP (X, 13, Q) where F is admissible. Then if F is image admissible via (Y, S, T), U C F, and U is relatively dd,Q-open in .T, we have

T-1(U) E S. Proof Since (X, B) is separable, the pseudometric spaces CP(X, B, Q) are separable for 0 0 it is enough to show that each function y H f gIPdQ is measurable. For g fixed, (y, x) i-+ IT(y)(x) - g(x)IP is jointly measurable.

So for 0 0 for all x, y. Now, the Tonelli-Fubini theorem implies the desired measurability. For p = 0, given s > 0, by the Tonelli-Fubini the-

orem again, the set AE of y such that Q(IT(y)(x) - g(x)I > s) < e is

5.3 Suslin properties, selection, and a counterexample

185

measurable, and {y : do,Q(T(y), g) < r} is the union of AE for a rational,

0<e
5.2.6 Corollary If (X, B) is a separable measurable space, .P C G1 (X, B, Q), and F is image admissible via (Y, S, T), then y i-* f T (y) dQ is Smeasurable.

Proof For any real u, { f : f f dQ > u} is open for d1,Q.

For 1 < p < oo and f, g E LP (X, B, Q), let pp,Q(f, g) := dp,Q(fo,Q, go,Q) where for h E G1(X, B, Q), ho,Q := h - f h dQ. Thus for pQ as defined in Section 3.1, pQ = p2,Q. 5.2.7 Corollary If (X, B) is a separable measurable space, 1 < p < oo, F C GP(X, B, Q), F is image admissible via (Y, S, T), S is separable, U C .P, and U is pp, Q-open, then T _ 1 (U) E S.

Proof Let g be the set of functions f - f f dQ, f E Y. For y E Y let S(y)(x) := T (y) (x) - f T (y) dQ. Then by Corollary 5.2.6, S is jointly measurable and g is image admissible via (Y, S, S). Now f E U if and only if

fo,Q E V where V is dp,Q-open in 9. Then T-1(U) = S-'(V) E S by Theorem 5.2.5.

5.3 Suslin properties, selection, and a counterexample Here is another counterexample on measurability to add to the two at the beginning of the chapter. Let X = [0, 1] with Borel a-algebra and uniform (Lebesgue) probability measure P := U[0, 1]. Let A be a non-Lebesgue measurable subset of [0, 1] (e.g., RAP, Theorem 3.4.4). Let C := {{x} : x E A}. Then C is a collection of disjoint sets, so S(C) = 1 by Theorem 4.2.6. Also C, being a class of singletons, is admissible, for example, by Theorem 5.2.3(ii) with G(t, s) = 1 for t = s, and G(t, s) = 0 otherwise. But, IIP1 IIc is nonmeasurable, being 1 if and only if Xl E A, and likewise any II P,, 11C and II P - P II c

is nonmeasurable. So some measurability condition beyond admissibility is needed for II P, - P II,F to be measurable. A sufficient condition will be provided by Suslin properties, as follows. Recall that a Polish space is a topological space metrizable as a complete separable metric space. A separable measurable space (Y, S) will be called a Suslin space if there is a Polish space X and a Borel measurable map from X onto Y. If (Y, S) is a measurable space, a subset Z C Y will be called a Suslin set if it is a Suslin space with the relative a-algebra Z n S.

Measurability

186

Given a measurable space (X, B) and M C X, recall that M is called universally measurable or u.m. if for every probability law P on B, M is measurable for the completion of P, in other words for some A, B E B, A C M C B and

P(A) = P(B). In a Polish space, all Suslin sets are universally measurable (RAP, Theorems 13.2.1 and 13.2.6). A function f from X into Z, where (Z, A) is a measurable space, will be called universally measurable or u.m. if for each set B E A, f -1(B) is universally measurable.

If (S2, A) is a measurable space and F a set, then a real-valued function X : (f, co) H X(f, co) will be called image admissible Suslin via (Y, S, T)

if (Y, S) is a Suslin measurable space, T is a function from Y onto F, and (y, (o) i-+ X(T(y), (o) is jointly measurable on Y x 0. Equivalently, Y could be taken to be Polish with S as its Borel or -algebra.

As the notation suggests, a main case of interest will be where F is a set of functions on S2 and X(f, (o) = f (w). Also, X or F will be called image admissible Suslin if X is image admissible Suslin via some (Y, S, T) as above. Recall the notion of separable measurable space defined in the last section. Note that any separable metric space with its Borel a-algebra is a separable measurable space, as follows from RAP, Proposition 2.1.4. We have:

5.3.1 Theorem A measurable space (X,13), where B is countably generated, is separable if and only if it separates the points of X, so that for any x 0 y in X, there is some A E B containing just one of x, y.

Proof If (X, 13) is separable then (x } E B for each x E X, so B separates points. The converse direction follows from the proof of Theorem 5.2.3, (iv) implies (iii).

5.3.2 Selection Theorem (Sainte-Beuve) Let (0, A) be any measurable space and let X : .T' x S2 i-+ R be image admissible Suslin via (Y, S, T). Then for any Borel set B C R,

flx(B) := {co : X(f, co) E B for some f E YJ is u.m. in S2, and there is a u.m. function H from IIX(B) into Y such that X(T (H(co)), co) E B for all co E IIX(B).

Note. Here (S2, A) need not be Suslin or even separable.

Proof As noted above, we can assume Y is Polish and S is its Borel aalgebra. For any measurable set V in a product or -algebra, here V = {(y, (o) : X(T (y), c)) E B} C Y x 0, there are countably many measurable sets A C Y

5.3 Suslin properties, selection, and a counterexample

187

and E C S2 such that V is in the a-algebra generated by the sets A x E (as noted in the proof of Theorem 5.2.3, (iv) implies (iii)). Let B vary over a countable set of generators of the Borel or -algebra in R. Thus (y, w) H X(T (y), w) is jointly measurable for S in Y and a a-algebra 13 in n generated by a sequence {B,r} of measurable sets. Define a Marczewski function b, as in the proof of Theorem 5.2.1, for the sequence { B, } in place of (A, }. Then b is a 13-measurable function from 0 into the Cantor set C C [0, 1] (defined, e.g., in RAP, proof of Proposition 3.4.1). Define w =13 co' if for all n, w E B if and only if co' E B. Equivalently, for all B E 13, co E B if and only if w' E B. Then b(a)) = b((o) if and only if (o =B a/. Let 6 be a 13-measurable function. Then if b(w) = b((W), we must have ,6((o) = ,B (w'). So ,B (w) - g(b (w)) for some function g. Now, ,B is measurable for the a-algebra of all sets b-1 (D) where D is a Borel subset of R, since each B is such a (by the proof of Theorem 5.2.1). So g can be taken to be

Borel measurable (RAP, Theorem 4.2.8). Similarly, X(T (y), (0) - F(y, b(w)) for some function F on Y x C. The next step in the proof of Theorem 5.3.2 is the following fact:

5.3.3 Lemma Every S ® B measurable real function i/i on Y x S2 can be written as F(., where F is S ® BC measurable and BC is the Borel a-algebra in C. Proof This is easily seen when i/r is the indicator of a set S x B, S E S, B E B, or a finite linear combination of such indicators. A finite union of sets Si x B; , Si E S, B, E B, can be written as a disjoint union (RAP, Proposition 3.2.2). Thus the lemma holds when it is the indicator of any such union.

F are Next, suppose * are S ® B measurable, i/r = S 0 BC measurable, and i/r - ilr pointwise on Y x Q. For any y E Y b(w')) c) whenever the sequence converges, as it will if c = b(w) for some w E 0. Otherwise set F(y, c) := 0. Then F is S ® BC measurable (RAP, proof of Theorem 4.2.5) and (y, (o) = F(y, b((o)) for all y E Y and w E 0. Since the finite unions of sets Si x B1 form an algebra, the monotone class theorem (RAP, Theorem 4.4.2) proves the lemma when i/r is the indicator function of any set in the product a-algebra S 0 B. It thus holds when i/r is any simple function (finite linear combination of such indicator functions). Since each measurable function is a pointwise limit of simple functions, the lemma follows. and to, (0' E S2 such that b(w) = b(w'), we have i/r,, (y, b(w)) =

for all n, so ,lr(y, b((o)) = ir(y, b(w')). Let F(y, c) := lim,"

Measurability

188

So X(T(y), (o) - F(y, b((o)) for an S ® ,tic measurable function F on

YxC. Now, {(y, c) E Y x C : F(y, C) E B} is a Borel subset of Y x C (RAP, Proposition 4.1.7) and thus a Suslin set (RAP, Theorem 13.2.1). It follows that IIFB := {c E C : for some y E Y, F(y, C) E B} is a Suslin subset of C and therefore universally measurable (RAP, Theorem 13.2.6). By a selection theorem (RAP, Theorem 13.2.7) there is a u.m. function h

from IIFB into Y with F(h(c), C) E B for all c E HFB. Next we need:

5.3.4 Lemma Let (D, D) and (G, 9) be two measurable spaces and g a measurable function from D into G. Let U C G be u.m. for 9. Then g 1(U) is u.m. for D.

Proof Let P be a law on (D, D), so P o g-1 is a law on (G, 9). Take A,

BE gwith ACUCBand (Pog 1)(B\A)=O. Then g-1(A)and g 1(B)

are in D, g -'(A) C g -'(U) C g -'(B) and P(g 1(B) \ g -'(A)) _ P(g1(B \ A)) = 0. Remark. The lemma includes the case where D C G, g is the identity, and D includes the relative a-algebra {J fl D: J E 9}, while possibly D and D may not be u.m. for Q.

Now to finish the proof of Theorem 5.3.2, Lemma 5.3.4 implies that

rlX(B) = b-1 {c E C: for some y E Y, F(y, C) E B)

is u.m. Let H := hob from Q into Y. Then for any E E S, H-1(E) _ b-1(h-1(E)). Here h-1(E) is a u.m. set in IIFB, and so by Lemma 5.3.4, b-1(h-1(E)) is u.m. in S2 and His a u.m. function. By choice of h we have X(T (H(w)), (0) E B for all (0 E IIXB. Some possibilities for the set B C R are the sets Ix: x > t} or {x : lx i > t) for any real t. These choices give:

5.3.5 Corollary Let (f, co) r* X(f, co) be real-valued and image admissible Suslin via some (Y, S, T). Then co r-+ sup(X(f, co) : f E F) and co r-+ Sup{ X(f, w) l : f E.F) are u. m. functions. The image admissible Suslin property is preserved by composing with a measurable function:

5.3 Suslin properties, selection, and a counterexample

189

, Xk be image admissible Suslin real-valued 5.3.6 Theorem Let X1, , k, for one measurable space (12, A), via functions on Fi x S2, i = 1, (Yi, Si, Ti), i = 1, , k. Let g be a Borel measurable function from Rk into R. Then (w, fl, , Xk(fk, c))) is image admissif k ) i-+ g(Xl(fi, cv),

x Yk with ble Suslin via some (Y, S, T). Specifically, we can let Y = Y1 x ®Sk and let T (yi, product or -algebra S = S1 ® yk) := (Ti (Yl), Tk(YO)-

Proof Clearly (Y, S) is Suslin and the joint measurability holds. Next, here are examples showing that if the Suslin assumption on Y is removed from Theorem 5.3.2 it may fail. Let (0, <) and the class C of countable initial segments Ix be as in the second example at the beginning of this chapter. We can take S2 to be (in 1-1 correspondence with) a subset of [0, 1], by the axiom of choice (or, by the continuum hypothesis, all of [0, 1]). Here the ordering < has no relation to any usual structure on [0, 1]. Then C is admissible by Theorem 5.2.3, since the collection of all countable sets is of bounded Borel class: all finite unions of open intervals (q, r) with rational endpoints are in some so all finite sets are in .F,,,+1 and all countable sets in .F(a+l)+1 =: Fa+2 Let P be a law on Q which is 0 on countable sets and 1 on sets with countable complement. Such a law, on the a-algebra generated by singletons, exists on any uncountable set. Under the continuum hypothesis with 0 = [0, 11 we can take P to be Lebesgue measure or any nonatomic law. Let P1 = Sx, P2 = Ox + By) /2, and Q1 = SZ where x, y, z are coordinates on Q3 with law P3, so x, y and z are i.i.d. (P). Then supAEC(P1 - Q1)(A) =

1 if and only if x < z. Thus as shown at the beginning of this chapter, supAEc (Pl - Q1) (A) is nonmeasurable.

If we let B := {(x, y, z): supAEc KP2 - Qi)(A)I = 1), it can be seen likewise that B must not be measurable. So "Suslin" cannot simply be removed from Corollary 5.3.5 or Theorem 5.3.2. Returning to positive results, an admissible structure can be put on spaces of closed sets. Let (X, d) be a separable metric space and .F0 the collection of all nonempty closed subsets of X. Then Fo is admissible: there is a countable base for the topology of X, so for some a, all finite unions of sets in the base are in .Fa. Thus all open sets are in ,Fa+1 Then all closed sets are in .Fa+2 since any closed set F is a countable intersection of open sets

U,, := {x: d(x, F) := infyEFd(x, y) < 1/n) . The topology of X can be metrized by a metric d for which (X, d) is totally

Measurability

190

bounded (RAP, Theorem 2.8.2). Assume d is such a metric. Let hd be the Hausdorff metric: hd(A, B) := max{supXEA d(x, B), supyEB d(y, A)) for any two closed sets A, B. Since d is totally bounded, it's easily seen that (.Fo, hd) is separable, since the finite subsets of a countable dense set in X are dense in F0 for hd. The Borel or-algebra of hd will be called an Effros Borel structure on To. (Effros, 1965, proved that ford totally bounded this Borel structure is unique.)

5.3.7 Proposition For any separable metric space X with totally bounded metric d, the Effros Borel structure (of hd) is admissible on F0. Also, for any law P on the Borel sets of X, P(.) is measurable for the Effros Borel structure. P r o o f The set {(x, F) : x E F E .F0} is closed in X x 1*0: if xn E F for (x, F), then xn -* x in X and hd(Fn, F) - 0. Then all n and (x,,, Fn) 0 and so x E F. Since both (X, d) and (.F0, hd) are separable, a d(xn, F) Borel set in their product is in the product a-algebra (RAP, Proposition 4.1.7).

For any law P and c > 0, the set IF E Fo : P(F) > c} is closed for hd as follows. If Fn - F for hd, P(F,,) > c, and e > 0, let Fe := l y: d(x, y) < e for some x E F). Then Fn C FE for n large enough, so P(FE) > c, and letting 0 shows P(F) > c. So is upper semicontinuous and thus e = 1/m Effros measurable. For families of functions, we have the following.

5.3.8 Theorem (a) Let S be a topological space and.F a family of bounded real functions on S, equicontinuous at each point of S. Then (f, x) F-* f (x) is jointly continuous.F x S i-+ R, with the supremum norm II f II oo := supx If (x) I

on.F. (b) If in addition .F is separable for 11 II oo and S is metrizable as a separable

metric space (S, d), then (f, x) H f (x) is jointly measurable for the Borel Q-algebras of II on .F and d on S. Thus .F is admissible. So is g, the collection of all subgraphs {(x, y) : 0 < y < f(x) or f(x) < y < 01 for II

fE.F. (c) If, moreover, F with II III distance and its Borel or-algebra is a Suslin set, then F and G are image admissible Suslin. Proof Part (a) is immediate. For Y, (b) follows from the fact that on a Cartesian

product of two separable metric spaces, the Borel or-algebra of the product topology equals the product a-algebra of the Borel Q-algebras on each space (RAP, Theorem 4.1.7). For 9, the set {(y, z) : 0 < y < z or z < y < 0) is

Problems

191

closed and thus Borel in 1R2. Composing its indicator functions with (f, x) i-k z := f (x) thus gives a measurable function. Then (c) follows directly. Strobl (1994, 1995) and Ziegler (1994, 1997a,b) have given special attention to measurability issues for empirical processes.

Let H C [0, 1] be a nonmeasurable set with X ,(H) = 0 and A*(H) = 1 where A is Lebesgue measure (e.g., RAP, Theorem 3.4.4). Then for each Borel set B C [0, 1], letting µ(B fl H) :_ X*(B fl H) defines a countably additive probability measure on the Borel subsets of H, as a metric space with the usual metric from R (RAP, Theorem 3.3.6). Likewise, A* gives a probability measure v on the Borel sets of H' := [0, 1] \ H. Take the countable product of probability spaces nn° 1(An,1n, pn) where for n odd, An = H and pn = µ while for M, p) n even, An = H` and pn = v. Such a product of probability spaces always Example (Adamski and Gaenssler).

exists (e.g., RAP, Theorem 8.2.2). Let Xn be the nth coordinate on S2, viewed as a map from Q into [0, 1]. Then each Xn is measurable and has law U[0, 1], the uniform distribution on [0, 1]. Thus, the X; are i.i.d. U[0, 1]. Let C be the collection of all finite subsets of H. Let Pn be the empirical measures defined by the given X, . Then II Pn IIc = 1 /2 for n even and II Pn IIc = (n + 1)/(2n) for n odd. On the other hand, if Xi are coordinates on a countable product of copies of ([0, 1], A), then II Pn 11C is nonmeasurable. This illustrates that C is a pathological class of sets, but the pathology can be obscured if one doesn't use the standard model.

Problems 1. A law Q on Rn is called exchangeable if it is invariant under all permutations

fn of the coordinates. (a) Give an example of an exchangeable law which is not of the form Pn for a law P. (b) Show that the empirical measure Pn is sufficient for any set of exchangeable laws. 2. (continuation) Give an example of a class of non-exchangeable laws on 1182 for which the empirical measure is still sufficient. Hint: Consider laws with

X2=2X1. 3. An exponential family is a set ( Pe, B E O) of laws on a measurable space (X, A), all dominated by a or -finite measure s, having densities d Pe /dµ = " exp((f (B), g(x)) where 0 is a set, f maps O into Rk, g maps X into R,

Measurability

192

and ( , ) is the usual inner product in JRk. If P is such an exponential family

and P" :_ {P" : P E P} is the corresponding set of laws on X", show that Fj=1 g(Xj) is a sufficient statistic for P. 4. (a) Show that the class C of initial segments Ix in the example at the beginning of Chapter 5 is admissible. Hint: The or-algebra generated by the sets Ix is not separable, but the set 0 has the smallest possible uncountable cardinality;

thus we can assume 0 is in 1-1 correspondence with a subset of [0, 1] (cf. RAP, Appendix A.3). Use this to show that there exists a separable structure on 0 for which the sets IX are all measurable. Then, see the discussion after Theorem 5.3.6. (b) A measure space (X, S, µ) where S contains all singletons [x) is called

nonatomic if µ({x}) = 0 for all x E X. Let S be a a-algebra of subsets of C from part (a) making it admissible, for some or-algebra B of subsets of S2 containing all singletons. Show that there is no nonatomic probability measure on S. 5. (a) Let (X, E) be a measurable space such that £ contains all singletons {x} for x E X and X is uncountable. Suppose that for any nonatomic law P on (X, £), P(A) = 0 or I for all A E £. Show that £ cannot be countably generated. Hint: Suppose Ak are generators. If for all k there is a law Qk on E with Qk(Ak) = 1, find a law Q on F with Q(Ak) = 1 for all k and consider nk Ak. Or, show that the union of those Ak such that Q(Ak) = 0 for all laws Q on S can be deleted. (b) In the remarks before Theorem 5.2.1, prove that T is not countably generated. Hint: Use part (a) and the Hewitt-Savage 0-1 law (RAP, Theorem 8.4.6).

6. Let (S, d) be a separable metric space. Let Y1, Y2, be Suslin subsets of S, for the Borel a-algebra on each. Show that Uk Yk and nk Yk are also Suslin but Yi may not be, even if S is complete. Hint: See RAP, Section 13.2, problem 13.2.1 (don't assume the result of the problem). 7. Let (X, A) be a measurable space and V a finite-dimensional vector space of measurable real-valued functions on X. (a) Show that V is image admissible Suslin. Hint: See Theorem 5.3.6. (b) Show that {lA : A E pos(V)} is also image admissible Suslin.

8. Let (X, A) be a measurable space and let C C A be image admissible Suslin, that is, { 1 B : BE C} is. Let C(k) be the union of all Boolean algebras generated by k or less sets in C, as in Theorem 4.2.4. Show that C(k) is also

image admissible Suslin. Hint: Apply Theorem 5.3.6.

9. Let (X, A) be a measurable space such that {x} E A for all x E X. If P is a law on A, call a set A E A a soft atom if P(A) > 0, for all B C A

Notes

193

with B E A either P(B) = P(A) or P(B) = 0, and P({x}) = 0 for all x E A. (A "hard atom" would be an ordinary atom, namely a singleton {x}

with P({x)) > 0.) (a) If X is a separable metric space with Borel or -algebra A, show that no

law on A has a soft atom. Hint: Show that a soft atom has a sequence of soft atom subsets, decreasing to a singleton. (b) Give an example of a soft atom. Hint: Let X be uncountable and let A consist of countable sets and their complements. Notes

Notes to Section 5.1. Fisher (1922) invented the idea of sufficiency and Neyman (1935) gave a first form of the factorization theorem. Halmos and Savage (1949) proved the factorization theorem (5.1.2) in case h E L' (S, L3, A). Vivid ad hoc concepts such as "kernel" and "chain" in the proof of Lemma 5.1.3

are typical of Halmos's style of writing proofs. Bahadur (1954) removed the restriction on h, by the proof given above in the proof of Theorem 5.1.2. Theorems 5.1.8 and 5.1.9 (sufficiency of follow from facts given by Neveu (1977, pp. 267-268). I thank Sam Gutmann for showing me the proof actually given in the section and Don Cohn for telling me about Neveu's proof. I do not have references for Theorem 5.1.11, Lemma 5.1.12, or Corollary 5.1.14. Yet, in some sense, it is well known that the empirical distribution function incorporates all the information in an i.i.d. sample (Corollary 5.1.14). Notes to Section 5.2. Theorem 5.2.4, due to Lebesgue, is proved in Natanson (1957, p. 137). Aumann (1961) proved Theorem 5.2.3. The proof here is due to B. V. Rao (1971), except that the "image admissible" part was added in Dudley (1984). Freedman (1966, Lemma (5)) proved that the or-algebra T mentioned before Theorem 5.2.1 is not countably generated. Notes to Section 5.3. There are several papers on selection theorems. SainteBeuve (1974, Theorem 3) gives a selection theorem close to Theorem 5.3.2, which covers Suslin topological spaces. Sainte-Beuve did not use the terminology "(image) admissible Suslin." Originally, as in Dudley (1984, Section 10.3), the definition of image admissible Suslin required that (0, A) also be a Suslin space. I thank Uwe Einmahl, who pointed out to me around 1988 that the Suslin assumption on S2 was unnecessary. I also thank Lucien Le Cam for stimulating discussions about this section. Durst and Dudley (1981) gave the example after Theorem 5.3.6.

Measurability

194

Effros (1965) is the original paper on the Effros Borel structure. W. Adamski and P. Gaenssler told me in 1992 about the example at the end of the section.

References *An asterisk indicates a work I have seen discussed in secondary sources but not in the original.

Aumann, R. J. (1961). Borel structures for function spaces. Illinois J. Math. 5, 614-630. Bahadur, R. R. (1954). Sufficiency and statistical decision functions. Ann. Math. Statist. 25, 423-462. Dudley, R. M. (1984). A course on empirical processes. Ecole d'ete de probabilites de St.-Flour, 1982. Lecture Notes in Math. (Springer) 1097,1-142. Durst, Mark, and Dudley, R. M. (1981). Empirical processes, Vapnik-Chervonenkis class and Poisson processes. Probab. Math. Statist. (Wroclaw) 1, no. 2, 109-115. Effros, E. G. (1965). Convergence of closed subsets in a topological space. Proc. Amer. Math. Soc. 16, 929-931. Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. Academic Press, New York. Fisher, Ronald Aylmer (1922). IX. On the mathematical foundations of theoretical statistics. Philos. Trans. Roy. Soc. London Ser. A 222, 309-368. Freedman, David A. (1966). On two equivalence relations between measures. Ann. Math. Statist. 37, 686-689. Gutmann, S. (1981). Unpublished manuscript. Halmos, Paul R., and Savage, L. J. (1949). Application of the Radon-Nikodym theorem to the theory of sufficient statistics. Ann. Math. Statist. 20, 225241.

Lehmann, E. L. (1986). Testing Statistical Hypotheses, 2d ed. Repr. 1997, Springer, New York. Natanson, I. P. (1957). Theory of Functions of a Real Variable, vol. 2, 2d ed. Transl. by L. F. Boron, Ungar, New York, 1961. Neveu, Jacques (1977). Processus ponctuels. Ecole d'ete de probabilites de St.-Flour VI, 1976. Lecture Notes Math. (Springer) 598, 249-445. *Neyman, Jerzy (1935). Su un teorema concernente le cosidette statistiche sufficienti. Giorn. Ist. Ital. Attuari 6, 320-334. Rao, B. V. (1971). Borel structures for function spaces. Colloq. Math. 23, 33-38.

References

195

Sainte-Beuve, M.-F. (1974). On the extension of von Neumann-Aumann's theorem. J. Funct. Anal. 17, 112-129. Strobl, F. (1994). Zur Theorie empirischer Prozesse. Dissertation, Mathematik, Univ. Miinchen.

Strobl, F. (1995). On the reversed sub-martingale property of empirical discrepancies in arbitrary sample spaces. J. Theoret. Probab. 8, 825-831. Ziegler, K. (1994). On functional central limit theorems and uniform laws of large numbers for sums of independent processes. Dissertation, Mathematik, Univ. Miinchen.

*Ziegler, K. (1997a). A maximal inequality and a functional central limit theorem for set-indexed empirical processes. Results Math. 31, 189-194. *Ziegler, K. (1997b). On Hoffmann-Jorgensen-type inequalities for outer expectations with applications. Results Math. 32, 179-192.

6

Limit Theorems for Vapnik-Cervonenkis and Related Classes

6.1 Koltchinskii-Pollard entropy and Glivenko-Cantelli theorems For central limit theorems over Vapnik-Cervonenkis and certain related classes some good sufficient conditions were proved by Pollard (1982), using the following form of "entropy" or "capacity." Let (X, A) be a measurable space and let .F C G°(X, A) be the space of all real-valued measurable functions on X. Recall that

FF(x) := sup{I f(x)I: f E.F) (Section 4.8). Then Fj,(x) - 1I3x I. A measurable function F E G°(X, A) with F > FF is called an envelope function for F. If FF is A-measurable it is called the envelope function of Y. If a law P is given on (X, A), then FF for P will be called the envelope function of F for P, defined up to equality P-a.s. Let t be the set of all laws on X of the form n-1 8x(j) for some

x(j) E X , j = 1 ,

, n, and n = 1, 2,

.

, where the x(j) need not be

distinct. For S > 0, 0 8P 1 FPdy}.

Then log DF>(8, F, y) will be called a Koltchinskii-Pollard entropy. Let DF) (S, F) := supy,r DF i (S, F, y). Here DIP) (3,.F, y) is a kind of packing number, involving the envelope function F. Logarithms of DIP) (S, F) will appear in Section 6.3. 196

6.1 Koltchinskii-Pollard entropy and Glivenko-Cantelli theorems

197

Let 9 := { fl F: f E F), where 0/0 is replaced by 0. Then Ig(x)I < 1 for all g E Q and X E X. Given F, p, and y E r let Q(B) := fB FPdy/y (FP) if y (FP) > 0. Then Q := Qy is a law and for 1 < p < oo, DF) (S, T, y) = D (S, 9, dn,Q)

where (as defined before Theorem 5.2.5) dp,Q(f, g) := (f If - glP dQ)1VP. For example, if C is a collection of measurable sets, whose union is all of X, and F := 0A : A E C) the envelope function is F 1 and DF) Y) _ D(S,.F, dp,y). The next few results will connect DF (8,.F) with other ways of measuring the size of certain classes F. First we have Vapnik-Cervonenkis classes C of sets with dens(C) < S(C) < +oo (Corollary 4.1.7):

6.1.2 Theorem If C C A, dens(C) < +oo, 1 0, and T:= {F1A : A E C} then for any w > dens(C) there is a K < 00 such that

DF)(S, F') <

KS_Pw,

0 < S < 1.

Proof Let y E F and let G be the smallest set with y(G) = 1. We may assume F(x) > 0 for some x E G since otherwise for any K > 1,

Dp)(S, F, y) = 1 < KS-wP

for 0<S<1.

If C(1), , C(m) E C are such that [dp,y(F1c(i), Flc(j))]P > 3P f Fpdy for i 0 j, with m maximal, then for Q = Qy,

Q(C(i)OC(j)) = JC(i)AC(j) Fdy

/J

Fpdy > SP.

Then by maximality and Theorem 4.6.1, there is a K(w, C) < oo such that

DF) (S, .F, y) < m < D(SP, C, dQ) < K(w, C)S-Pw. Next, here is a kind of converse to Theorem 6.1.2:

6.1.3 Proposition Suppose 1 < p < 00, F = { 1 B : B E C) for some collection C of sets, F 1, and for some S with 0 < SP < 1/2, DF) (8,.F) < 00. Then dens(C) < S(C) < 00. Proof First, dens(C) < S(C) by Corollary 4.1.7. Next, suppose C shatters a set G with card G = n = 21 for some positive integer m. Let y have mass 1/n at each point of G. Then G has m subsets A(i), i = 1, , m, independent for y, with y(A(i)) = 1/2, i = 1, , m (taking elements of G as strings of

198

Limit Theorems for Vapnik-Cervonenkis and Related Classes

m binary digits, let A(i) be the set where the ith digit is 1). Then for i k, f I IA(i) - IA(k) I pdy = 1/2. So for each j in (6.1.1) we can take f = 1A(j). Thus

DF)(S,.T') > DF)(6,F,y) >

m.

For large m this is impossible, so S(C) < oo.

Next let us consider families of functions of the form F = (Fg: g E 9) where 9 is a family of functions totally bounded in the supremum norm Ilglisup := supxEX Ig(x)I,

with the associated metric dsup(g, h) := Ilg - h llsup and with Ilgllsup < 1 for

allgE9. 6.1.4 Proposition If 1 < p < oo and 0 < s < 1 then for any such F, DF) (s, .F) < D(E, 9, dsup)

Proof If dup(g, h) < S then for all x, I Fg - Fh I P (x) < SPF(x)P, so the result follows from the definition (6.1.1).

Next we come to the Koltchinskii-Pollard method of symmetrization of , let x1, , x2n be coordinates on empirical measures. Given n = 1 , 2,

(X2n A2n pen) hence i.i.d. P. Let a(1),

, o(n) be random variables independent of each other and the xi with Pr(Q (i) = 2i) = Pr(6 (i) = 2i - 1) =

1/2, i = 1 ,

, n. Let r(i) = 2i if Q(i) = 2i - 1 and t(i) = 2i - I if

a(i) = 2i. Let x(i) := xi. Then the x(o (j)) are i.i.d. P. Let P, vn Pno

:= n

n

n 1

E Sx(o(J))> j=1

:= n112(Pn-P), Pn' - Pn"+

Pn

n-1Y,

8x(t(j)),

j=1

vn' := n"2(Pn"-p) von

n1/2POn.

Note that P," = 2P2n - Pn and that vn and v;' are two independent copies of vn. Here is a symmetrization fact (a more general one is Lemma 11.2.4):

6.1.5 Symmetrization lemma Let

> 0 and F C C2 (X, A, P) with

f I.f12dP < 2 for all f E F.

6.1 Koltchinskii-Pollard entropy and Glivenko-Cantelli theorems

199

Assume F is image admissible Suslin via some (Y, S, T). Then for any 11 > 0,

Pr 111411', > n} > (1 -Pr JJ l11vn ll, > 2rl}. Proof Note that the given events are measurable by Corollary 5.3.5. For x = , x2n) let H :_ {(x, f) : I vn (f )I > 21j, f E F}. Then by admis41, sibility, H is a product measurable subset of X2n X F. The Suslin property implies (Theorem 5.3.2) that there is a universally measurable selector h such

that whenever (x, f) E H for some f E F, h(x) E .F and (x, h(x)) E H. Let x7 := (xr (1),

, xr (n) ). Then for some function h t , h (x) = h 1(x7) where

h 1(y) is defined if and only if y E J for some u.m. set J C Xn (by Theorem 5.3.2). Let Tn be the smallest or-algebra with respect to which x7(.) is measurable. Then on the set where x't E J, since v° = vn - v", Pr(11vn'll,. > rl I T,,)

Pr (11 v' (h 1 (x))

E,,).

Given Tn, that is, given x', h I (x7) () is a fixed function f E F with f I f 12d P < 2. Then since vn is independent of Tn, we can apply Chebyshev's inequality to obtain

Pr(lvn(.f)j < rl) > 1_0/11)2. Integrating gives Pr{IIvOII.F > rl} > (1 -

211}, which,

since v,,' is a copy of vn, gives the result.

Some reverse martingale and submartingale properties of the empirical measures Pn will be proved. Recall that Q(f) f f dQ for any f E G1(Q), and that in defining empirical measures Pn := n F Sxj, the Xj are always (in this book) taken as coordinates on a product of copies of a probability space (X, A, P), so that the underlying probability measure for P, is Pn, a product of n copies of P. The definitions of (reversed) (sub)martingale (e.g., RAP, Sections 10.3 and 10.6), for random variables Y, and o-algebras ,tan, will be slightly extended here by allowing Yn to be measurable for the completion of tan.

,

6.1.6 Theorem Let (X, A, P) be a probability space, let.F C 'C I (P), and let P, be empirical measures for P. Let S, be the smallest or -algebra for which Pk(f) are measurable for all k > n and f E L1 (X, A, P). Then: (a) For any f E F, { Pn (f ), Sn }n> 1 is a reversed martingale; in other words,

E(Pn-1(f)ISn) = Pn(f) a.s. if n > 2.

200

Limit Theorems for Vapnik-Cervonenkis and Related Classes

(b) (F Strobl) Suppose that .F has an envelope function F E L' (X, A, P) and that for each n, 11P, - P II, is measurable for the completion of Pn. Then (IIPn - P I Lr , Sn )n> 1 is a reversed submartingale; in other words,

IIPn - PIIj_ < E(IIPk - PIIj, I Sn) a.s.fork
Proof For each n, any set in Sn is invariant under permutations of the first n coordinates Xi, where Pn := (Sx, + + SXn). So if 1 < i < j < n, n E(f(Xi) I Sn) = E(f(Xj) I Sn). Summing oven = 1, , k and dividing by

k gives E(Pk(f) I Sn)=E(f(Xl) I

k=n-1

and n proves (a).

Proof of (b): clearly I I P I I F < PF < oo and I I Pn I I .F < PF. Since by assumption II Pn - P II y is completion measurable, we have Ell Pn - P II )- <

2PF 0, the set Aq :_ Aq,n where II Pn - P II.r' > q is measurable for the completion of Sn. The set Aq itself is invariant under all transformations in the set nn of permutations of the first n coordinates. By assumption there exist sets Cq and

Dq which are An measurable with Cq C Aq C Dq and P'(Dq \ Cq) = 0. Each image of Cq by an element of nn is also An measurable and included in Aq. The union Uq of these n! images, being A' measurable and symmetric under all n! permutations of 1, , n, is Sn measurable by Theorem 5.1.9 and included in Aq. Likewise, the intersection Fq of all images of Dq by transformations in nn is Sn measurable and includes Aq. Clearly pn (Fq \ Uq) = 0, so Aq and IIPn - PII. - are measurable for the completion of Sn as desired.

Fori=1,

n+l letPni:=n_{Sx1:

jai}.Then

since the Xj are coordinates on a product space, Pn,i has the same properties as

Pn for each i, and Pn,n+l = Pn Thus IIPn,i - PIIF is completion measurable with respect to A+ the product a-algebra for the first n + 1 coordinates. Now, the conditional expectations E(II Pn,i - PILE I Sn+ 1) are all the same (equal to each other almost surely) for i = 1, , n + 1, so all are a.s. equal to E(II Pn - PILE I Sn+1), because for each i, a transformation defined by a permutation of the first n + 1 coordinates takes Pn,i into Pn while leaving all sets in the a-algebra Sn+l invariant.

6.1 Koltchinskii-Pollard entropy and Glivenko-Cantelli theorems

Then for any n = 1, 2,

201

, we have 1

n+1

IIPn+1 - PIIF _ n+l E Pn,i - P I i=1 II.n+1 +1T+IIPn,i-PII.', I =1

and

IIPn+1 - PILE = E(IIPn+1 - PILE I Sn+l) 1

<

<

n+

n+1

E (II Pn,i - PILF I Sn+l)

E01 P" - PILE I Sn+l)

a.s., finishing the proof.

Here is a law of large numbers (generalized Glivenko-Cantelli theorem):

6.1.7 Theorem Let (X, A, P) be a probability space, let F E Gl (X, A, P), and let.F be a collection of measurable functions on X having F as an envelope function. Suppose.F is image admissible Suslin via (Y, S, T). Assume that (6.1.8)

DF) (S, .F) < oo for all 8 > 0.

Then limn-+oo II Pn - PILE = 0 a.s.

Proof By Theorem 6.1.6, {IIPn - P I LF, Sn }n> 1 is a reversed submartingale. Being nonnegative, it converges almost surely and in L 1(RAP, Theorem 10.6.4). It will now be enough to show that II Pn - P II.F - 0 in probability. Given e > 0, take M large enough so that P(F1F>M) < e/4. Then II (Pn - P)lF>MII.F -< II PnlF>MII.F + P(F1F>M) < (Pn + P)(F1F>M)

2P(F1F>M) < e/2 almost surely. Replacing each f E .F by P F<M takes F onto another class of functions which is still image admissible Suslin since (x, y) -' T(y)(x)1F(x)<M is jointly measurable while the Suslin space (Y, S) is unchanged. So we may assume F < M. Next apply Lemma 6.1.5 with = M and rl = n 1/2e/4 > 2M for n large. Then it is enough to show that 11 P, II.F = 11 Pn - Pn II.F -+ 0 in probability.

202

Limit Theorems for Vapnik-Cervonenkis and Related Classes

In the definition of DF)(3, F, y) take S = s/(9M) and y = P2,. Choose functions fl, , fm to satisfy (6.1.1), with

m = DF)(5,.F, y) < DO)(S, F) One can choose the f := fgin) to depend measurably on x := (xl, , X20 as follows. For each k > 1, the measurable space (Y, S)k is Suslin. Let

Bk := ((x,Y1,...,Yk): P2n(lT(Yi)-T(Yj)I) > SP2n(F), i ¢ j}. Then Bk is product measurable by image admissibility. Let y(k) := (yl,

, Yk )

and let Ak denote the projection {x : (x, y(k)) E Bk for some Y(k)}, k > 2; Al := X2n. Then by selection (Theorem 5.3.2), Ak is a u.m. set and there is a universally measurable function 11k from Ak into yk such that (x, r7k(x)) E Bk for all x E Ak. Let 17(x) := rjk(x) if and only if x E Ak \ Ak+l, k > 1. For each x let k(x) be the unique k such that x E Ak \ Ak+l. Thus we have the following:

6.1.9 Lemma For each x E X2n and P2n :_ (2n)-1 >i<2n Sx;, DF) (3,.F, P2n) = k(x) where is universally measurable and for k(x) = k, ilk is a universally measurable function from X2, into Yk such that the functions f, in (6.1.1) can be taken as T (17k(x)i ), i = 1, , k.

Now for all f E F and m = DF)(3, J, P20, minj<,n P2n(I / - fi1) S P2n (F). For each j at which the minimum occurs,

PO(f)-Po(f) _ Pn(f-fiPnU-f)I (Pn+P,')(I.f-fiI) 2P2n(I f - f I) < 23P2n(F) As n -+ oo, almost surely 2SP2n(F) -+ 2SP(F) < e14. Since m remains bounded as n -+ oo, we have by Chebyshev's inequality

Pr {maxi<,n I PO(f )I > s/4}

< m sup {Pr {IP,(fj)I > e/4): IfI <M)

< m 2M2(4/e)2/n < s/4

6.2 Vapnik-Cervonenkis-Steele laws of large numbers

203

for n large enough, and then Pr{ II PO II.F > s/2} < e/2, proving Theorem 6.1.7.

6.1.10 Corollary Let (X, A) be a measurable space and C C A where S(C) < 0o and C is image admissible Suslin. Then for any probability law P on A, we have

limn,oo SUPAEC I (PI - P)(A)I = 0

a.s.

Proof In Theorem 6.1.7, take F - 1 and apply Theorem 6.1.2 with p = 1.

6.2 Vapnik-Cervonenkis-Steele laws of large numbers Let (X, A, P) be a probability space and C C A. Let {xn }n> 1 be coordinates in

(X', A°°, P°°) so that x j are i.i.d. P. For certain classes C with S(C) = +oo, one will have AC ({x1, , xn }) < 2' with P"-probability converging to 1. For such classes, with sufficient measurability properties, a law of large numbers will still hold. Here a main result is as follows: 6.2.1 Theorem If (X, A, P) is any probability space, C C A, and C is image admissible Suslin, then the following are equivalent: (a) II Pn - P II c

0 a.s. as n -+ oo;

(b) II P. - P 11c - 0 in probability as n - oo; n-'E log /c({x1, , xn}) = 0. (c) limp

For a finite set F C X and collection C C 2X, let kc(F) := S(C n F). 6.2.2 Lemma Under the hypotheses of Theorem 6.2.1, Oc(x1, kc({x1,

, xn) and

, xn}) are universally measurable.

Proof Let C be image admissible Suslin via (Y, Y. T). For any decomposition of { 1, , n) into subsets, there is a measurable set of ordered n -tuples , xn) such that xi = xj if and only if i and j belong to the same subset. (x1, It is enough to prove the lemma on each such measurable set; specifically, the set on which the xi are all different. For each xj e T (y) if and only if j E J } is product measurable by image admissibility. Thus

204

Limit Theorems for Vapnik-Cervonenkis and Related Classes

its projection IIU(J) into X' is universally measurable by Theorem 5.3.2. Now AC ({xl,

... , xn}) _ E

and

J (6.2.3)

U

I

I nU(H).

IGI=m HcG

The main step in Steele's proof uses Kingman's subadditive ergodic theorem (the equivalent superadditive ergodic theorem is Theorem 10.7.1 in RAP). To state it, here is some terminology. Let N denote the set of nonnegative integers. A subadditive process is a doubly indexed set {xmn }o<m
Xkn < Xkm + Xmn

whenever k < m < n.

Let xnn := 0 for all n E N. If instead of (6.2.4), (6.2.5)

Xkn > xkm + xmn,

k < m
then {xmn}o<m
can clearly be written as Xkn = >k<j
of all B E B such that V -l (B) = B. Another useful hypothesis for subadditive processes is: (6.2.6)

For each n E ICY, Exxon I

< +oc,

and K := infra>1 Exon/n > -oo. A or -algebra D C B will be called degenerate if Pr(D) = 0 or 1 for all D E D. 6.2.7 Theorem (Kingman's subadditive ergodic theorem) Let {xmn}o<m
a stationary subadditive process satisfying (6.2.6). Then as n -* 00, xon /n converges as. and in Ll to a random variable y := limn", n-1E(xon I S) with Ey = K. If S is degenerate, y = K a.s.

6.2 Vapnik-Cervonenkis-Steele laws of large numbers

205

Proof Theorem 10.7.1 in RAP applies with f, there defined as -xon. From near the end of the proof (RAP, p. 296), we have Ey < Exon/n for all n and Exon/n -* Ey as n -+ oo, so Ey = K. To apply Theorem 6.2.7 in proving Theorem 6.2.1 we have:

6.2.8 Theorem Let (S2, T, Pr) be a probability space, (X, A) a measurable space, X1 a measurable function from 0 into X, and V a measure-preserving transformation of 0 onto itself. Let Xj := X1 o V j-1 for j = 2, 3, . Let C be image admissible Suslin and 0 C C A. Then each of the following is a stationary subadditive process satisfying (6.2.6): (a) Dmn := SUPAEC I Em
Proof We have measurability by Corollary 5.3.5 in (a) and by Lemma 6.2.2 in (b) and (c). Stationarity clearly holds for the same V in each case. Subadditivity is clear in (a) and not difficult for (b) and (c). All three processes are nonnegative: in (b), C nonempty implies Ac > 1, so (6.2.6) holds. Now let us prove Theorem 6.2.1. The quantities II Pn - P II c are completion measurable by Corollary 5.3.5. The probability space, in our standard model, is a countable Cartesian product of copies of (X, A, P), with coordi..The measure-preserving transformation will be V ({x j } j> 1) := {x j+1 J j> 1. By the Kolmogorov 0-1 law (RAP, 8.4.4), S is degenerate. Then by Theorems 6.2.7 and 6.2.8, we have almost sure limits nates xl , x2,

(6.2.9)

ci

limn->oo Don/n,

c2

V im,

C3

limner

,

log AC )In,

kgn/n

for some constants ci. Thus in Theorem 6.2.1, (a) and (b) are equivalent (as was also shown in the proof of Theorem 6.1.7 via the reversed submartingale property). It remains to prove (b) equivalent to (c). To show (c) implies (b), given 0 < s < 1, first apply the symmetrization lemma (6.1.5) with = 1 and rl = sn 1/2/2 for n large enough so that rl > 2, giving

Pr {IIPn-PIIc>£} < 2Pr{IIPn-Pnlie >e/2}.

206

Limit Theorems for Vapnik-Cervonenkis and Related Classes

Thus it suffices to show that IIPn Pn II c -+ 0 in probability (these variables are universally measurable by Corollary 5.3.5). Let An be the smallest a-algebra for which xl, , xn are measurable. For any fixed set A E A, the conditional probability

-

PrA,e,n

Pr {I(Pn-Pn) (A) I >eI A2n Pr

{

E, Si (8-12j) - 8x2j_, ((A) > nEI i=1

where si := 21{v(i)=2i} - 1 = ±1 with probability 1/2 each, independently of each other (Rademacher variables) and of the xi. Thus by Hoeffding's inequality, Proposition 1.3.5, with ai = -1, 0, or 1, PrA,e,n < exp(-nee/2). For some no, E log Don < ne4 for n > no. Then by Markov's inequality,

Pr (log Opn > nE3) < E.

(6.2.10)

Given A2", the event I (Pn - P")(A)I > e is the same for any two sets A having the same intersection with {x 1,

Pr III Pn

- Pn IIc > E I A2n }

, x2n }. Thus

_< Ao,2n exp ( - n82 /2).

Hence, for e < 1/8 and n > no large enough so that exp(-ns2/4) < e, we have by (6.2.10) Pr (II Pn

- Pn

II

c > e) < e + exp (2ne3 - n82 /2)

< E + exp ( -ns2/4) < 2e, so (c) implies (b).

Now let us show (b) implies (c). By (6.2.9) we have 1 > n-1 log AC - c a.s. for some constant c := C2 > 0. Thus n-1 E log Dan - c and we want to prove c = 0. Suppose c > 0. Given e > 0, for n large enough,

Pr {(2n)-1 log 00 2n > c/2} = Pr {Op 2n > encI > Next, to symmetrize,

Pr {IIPn-Pn'IIC>2E} < Pr {IIPn-PIIC>e}+Pr {IIPn -PIIC>e} 2Pr{IIPn - PIIC > E}. So it will suffice to prove it P,, - Fn' IIc - 0 in probability. If 2 < k :_ [an] where [x] is the largest integer < x and 0 < a < 1/2, then an < k + 1 < 3k/2,

6.2 Vapnik-Cervonenkis-Steele laws of large numbers

207

so by the inequality (4.1.5) and Stirling's formula (Theorem 1.3.13) we have

2nC 0. Then for n large enough, (6.2.11)

n > 2/a

and

(3e/a)"n < enc

Hence by Sauer's lemma (4.1.2), if Oo 2n > e"' then ko 2n > [an]. Fix n satisfying (6.2.11). Let k :_ [an]. Now on an event U with Pr(U) > 1 - a, there is a subset T of the indices {1, , 2n} such that card T = k, C shatters {xi : i E T}, and xi xj for i j in T. If there is more than one such T, let us select each of the possible T's with equal probability, using for this a random variable rl independent of xj and Qj, 1 < j < 2n. Then since xj are i.i.d., T is uniformly distributed over its ( possible values. For any distinct ji E 11, , 2n }, N := 2n, we have, k) where the following equations are conditional on U,

Pr (ji E T) = k/N, Pr (ji, j2 E T) = k(k - 1)/N(N - 1),

(6.2.12)

Pr (ji E T,

i = 1, 2, 3, 4) = ()/().

Let Mn be the number of values of j < n such that both 2j - I and 2j are in T. Then from (6.2.12),

EMn = k(k - 1)/2(N - 1) = a2n/4 + 0(1), n -+ oo;

EMn = k(k - 1)/2(N - 1) -h n(n - 1)

\4/

/()

x

+ 0(n),

and a bit of algebra gives Q2(Mn) = EMn - (EMn)2 = 0(n) as n - oo. Thus for 0 < S < a2/4, by Chebyshev's inequality, Pr(MM > Sn) > I - 2e for n large. On the event U n { Mn > Sn }, let us make a measurable selection (5.3.2)

of a sequence J of [Sn] values of i such that J' := UfEJ{2i - 1, 2i) C T. Here Mn and J are independent of the or (j). Now measurably select, by Theorem

5.3.2 again with y(.) =

a set A = A(w) = T(y(co)) E C such that

t j E Y: xj E A) = ja(i): i E J}. Then EiEJ(SxQ(;) - SxI(,))(A) = [n6]. Here is measurable for the a-algebra 13J generated by all the xj, by rl, and by or (i) for i E J. Conditional on 13 j,

E (Sxo(i) - Sx,(n) (A) _ i0J

iJ

siai

208

Limit Theorems for Vapnik-Cervonenkis and Related Classes

where ai are Bj-measurable functions with values -1, 0, and 1, and si have values ±1 with probability 1/2 each, independently of each other and of BJ. Thus by Chebyshev's inequality,

PrI{ Ysiai >013 BJ I

< 9/(n62) i0J on the event where J is defined. Thus for n large, I

Pr {(Pn'-Pn)(A((s))>S/3} > 1-3s and II P, -

II

C - 0 in probability. So Theorem 6.2.1 is proved.

6.2.13 Theorem In (6.2.9), if any ci is 0, all three are 0.

Proof In the last proof, we saw that cl = 0 if and only if c2 = 0, and just after (6.2.11), that if c2 > 0 then C3 > 0. On the other hand, if c3 > 0 then for some S > 0, ko" > Sn for n large enough a.s., and then AC > 26". Thus

c2>c3log2>0. 6.3 Pollard's central limit theorem By way of the Koltchinskii-Pollard kind of entropy and law of large numbers (Section 6.1) the following will be proved:

6.3.1 Theorem (Pollard) Let (X, A, P) be a probability space and F C G2 (X, A, P). Let .F be image admissible Suslin via (Y, S, T) and have an envelope function F E C2 (X, A, P). Suppose that 1

log D(2,i (x, .F))112dx < oo. fo Then.F is a Donsker class for P. (6.3.2)

The hypothesis (6.3.2) will be called Pollard's entropy condition. Before the proof of the theorem, here is a consequence:

6.3.3 Theorem (Jain and Marcus) Let (K, d) be a compact metric space. Let C(K) be the space of continuous real functions on K with supremum norm. Let X1, X2, be i.i.d. random variables in C(K). Suppose EX1(t) = 0 and EX1(t)2 < oc for all t E K. Assume that for some random variable M with

EM2 < cc, I X1(s) - X1(t) I (w) < M(c))d (s, t)

6.3 Pollard's central limit theorem

209

for all cw and all s, t E K. Suppose that 1

(6.3.4)

J.

(log D(s, K, d)) 1/2 ds < oo.

Then the central limit theorem holds, that is, in C(K), G(n-1/2(XI + + Xn)) converges to some Gaussian law.

Proof For a real-valued function h on K recall (RAP, Section 11.2) that IIhIIL

IIhIIBL

sups#t Ih(s) - h(t)I/d(s, t), IIhIIL+Ilhllsup,

Ilhllsup := sups lh(t)I,

BL(K) := {h EC(K): IIhIIBL <00}.

To apply Theorem 6.3.1, take as probability space X = BL(K), and let A be the or-algebra induced by the Borel sets of II Ilsup (or equivalently by the eval-

uations at points of K). Let P = G(XI), F(h) := IIhIIBL, h e X. Then for any s E K, (XI(S)2 + M(w)2 sup, d(s, t)2) < 00 E IIXi Itsp < 2E

and E 11X! 112 < EM(w)2 < 00, so EF2 < 00 (note that F is measurable, F E £2(X, A, P)). Let 9 be the collection of functions St/F: h t-+ h(t)/F(h), h E X, where we replace h(t)/F(h) by 0 if h = 0 in BL(K), and t runs through K. Then l g(h) l< 1 for all g E g and h E X. Let

.F :_ {Fg: g (=- 9) = {St := (h --+ h(t)): t E K}. For any s, t E K and h E X, I(Ss/F - St/F)(h)l < d(s, t). Then by Proposition 6.1.4, for 0 < x < 1, D tFi (x, .F) < D(x, 9, dsup) < D(x, K, d).

Thus (6.3.2) holds and Theorem 6.3.1 applies to give that .F is a Donsker class. Since .F is the set of evaluations at points of K, uniform convergence over .F (as in the definition of Donsker class) implies uniform convergence of functions on K. Since BL(K) C C(K), which is complete for uniform convergence, the limiting Gaussian process Gp for our Donsker class must also have sample functions in C(K) (almost surely). Since C(K) is separable, the laws G(n-112(X1 + + Xn)) are defined on all Borel sets of C(K) and converge to G(Gp).

Remark. In the situation of Theorem 6.3.3, K may be given originally with a metric e. The metric d may be chosen, perhaps as a function d = f (e), where

210

Limit Theorems for Vapnik-Cervonenkis and Related Classes

f(x) may approach 0 slowly as x ,{ 0, for example, f(x) = xE for e > 0

or f(x) = 1/max(I logxl, 2). Thus one can increase the possibilities for obtaining the Lipschitz property of Xl with respect to d, so long as (6.3.4) holds for d. Now to prove Theorem 6.3.1 we first have:

6.3.5 Lemma Let (X, A, P) be a probability space, F E £2(X, A, P) and .F C £2(X, A, P) having envelope function F. Let H := 4F2 and H :_ {(f - g)2: f, g E .F). Then 0 < cp(x) < H(x) for all cP E H and X E X, and for any S > 0, DHt (4S, H) < D(2) (8, .F)2.

Proof Clearly 0 < cp < H for V E H. Given any y E r, choose m < DF ) (S, .F) and f , . , fm E F such that (6.1.1) holds with p = 2. For any f, g E F, take i and j such that max (Y ((.f

- f )2), Y ((g - f )2)) <

32Y (F2).

Then by the Cauchy-Bunyakovsky-Schwarz inequality,

Y((f-g)2-(f -f)2) = Y(f - g - f +

fj)(f-g+f -f)

< Y((f - f - (g- f ))2)1/24Y(F2)1/2 < 86y(F2) = 26y(H). Thus letting hk(i,J) := f - f where k(i, j) := mi -m +j, i, j = 1,

, m,

we get an approximation of all functions in 1-l, in the L1 (y) norm, within 23y(H), by functions hk, k = 1, , m2, which implies the lemma.

Lemma 6.3.5 gives in particular that if DF)(S, F) < oo for all 8 > 0 then DHt (e, 7-1) < oo for all e > 0. Thus hypothesis (6.3.2) lets us apply Theorem 6.1.7, with F there = H.

6.3.6 Proposition Let (X, A, P) be a probability space. Suppose an image admissible Suslin class F of measurable real-valued functions on X has an envelope function F E C2 (X, A, P). If DF) (8, F) < oo for all S > 0, then .F is totally bounded in

L2(X, A, p).

6.3 Pollard's central limit theorem

211

Proof We may assume f F2dP > 0, as otherwise.F is quite totally bounded. By Lemma 6.3.5 and Theorem 6.1.7 we have sup { I (P

- P) ((f - g)2) I

:

f, g E .F} --> 0

n - oo.

a.s.,

Also, as n -f oo, f F2dP2n -* f F2dP a.s. Given s > 0, take no large enough and a value of P2n, n > no, such that f F2dP2n < 2 f Fed P and sup { I (P2n - P) ((f - g)2) I

:

f, g E .F} < s/2.

Take 0 < 8 < (s/(4P(F2)))1/2 and choose fl, , fin E F to satisfy (6.1.1) for p = 2 and y = P2n. Then for each f E .F we have for some j,

f (f - fj)2dP < i + f(f -

j)2dP2n < 2 + S2 J F2dP2n < E.

Now to continue the proof of Theorem 6.3.1, the total boundedness gives us condition 3.7.2(II)(a); it remains to check the asymptotic equicontinuity condition 3.7.2(II)(b). For this, given s > 0, let us apply the symmetrization lemma (6.1.5) to sets .Fj,s

ff - f: .fEF, f(f - f)2dP<821

with 8 > 0 small enough, it = 8/2, and 2; = s/4. By Theorem 5.2.5 with p = 2, l y E Y : T (y) - f E Fj,s } E S, and since a measurable subset of a Suslin space is Suslin with its relative Borel structure (by RAP, Theorem 13.2.1), Fj,s is image admissible Suslin. Thus Lemma 6.1.5 applies and we need only check 3.7.2(II)(b) for the symmetrized v0, in place of vn. To do this we will prove 3.7.2(II)(b) conditionally on P2n for P2, in a set it belongs to with probability converging to 1 as n -+ oo. Via Lemma 6.3.5 and Theorem 6.1.7 again, Fj,s will be treated by way of

{f_fi:fEF, f(f_fi)2dP2<82j which, for each fixed P2n, is image admissible Suslin. Let A2n be the smallest Q-algebra making x = (xl, Then given s, rl (> 0, it will be enough to prove (6.3.7)

Pr{ Pr[SUP {Iv?(f -g)I:

, x2n) measurable.

f,gE.F,

f(f_g)2dP2n <

S21 > 31l A2n1 > 3e } I

< 3s for 8 small enough and n large enough.

212

Limit Theorems for Vapnik-Cervonenkis and Related Classes

Given A2n, that is, given x, let II.fII2n i = 1, 2, .. Choose finite subsets .F(1, x), F(2, x),

(P2n(f2))1/2

Let Si := 2-`,

of F such that for

all i, and f E F, (6.3.8)

min{Ilf-g112n: gEF(i,x)} < SiIIFII2n,

with card(F(i, x)) < DF) (S1, .T'). We can write.F(i, x) = {g i , . > g k(t,x) } where by Lemma 6.1.9 we have gim) = T (yim (x)), k(i, ) and yim () being universally measurable in x, and where yim (x) is defined if and only if m k(i, x).

For each f E F, let f := g := gi,,, E F(i, x) achieve the minimum in (6.3.8), with m minimal in case of a tie. Let Ak denote the v-algebra of universally measurable sets for laws defined on A". For each f = T(y) E F and i, wehave f = T (y)i = gim(x,y,t) where m ( , , i) is ,42' x S measurable. Thus (u, x, y) --+ gim(x.Y,i)(u) is '42,+1 x S measurable. Hence v,,O(T(y) T(y)j) is A2n x S measurable and thus equals some A2n x S measurable function G (x, y) for x V V where Pen (V) = 0. Then by the selection theorem (5.3.2), sups I G (x, y) I is universally measurable in x. Thus for each j, (6.3.9)

sups I vnO (T (y) - T(y)j)I

is Pen-measurable in x.

Now II fi - f II2n -* 0 as i -a oo by (6.3.8), and for any fixed r, f - Jr =

Jr<j
, X20-

Let Hj := log D(2.)(2-j, F), j = 1, 2, . The integral condition (6.3.2) is 1i2 equivalent to >j 2-jH < oo. For all x, card.F(j, x) < exp(Hj). Let

ijj := max (jSj, (576P(F2)8 Hj)1i2) > 0. Then

1: lj < 00,

(6.3.10)

j>1

j > 576P(F2)32j Hj,

(6.3.11) and

(6.3.12)

Lexp (- qj/(2883 P(F2))) j>1

exp (- j2/(288P(F2))) < 00.

< j>1

6.3 Pollard's central limit theorem

213

Then

(6.3.13)

Pr f supfeIv(.f - fr)>

A2n

j>r

1

< I: Pr{supfE Ivo(f -f-1)I > njI A2n} j>r

<Eexp(Hj)exp(Hj-1)supfEFPr{Iv,,O(f j>r

- f-i)

> 77j I A2n I I

i = j - 1, j. For a fixed j and f

since there are exp(Hi) possibilities for let

(fj - f-1)(x2i) - (f - f -1)(x21-1)

zi

Then n

vnno(f - f-1) =

n-1/2E(-1)e(i)zi i=1

where e(i) := 1 {Q (i)=21-1 } are random variables taking values 0 and 1 with prob-

ability 1/2 each, independently of each other and the zi. Then by an inequality of Hoeffding (Proposition 1.3.5), n

E(-1)e(il zi >r/j

<

i=1

Now zi2

f

<

4n

<

4n(IIf- f II2n+IIf-f-1II2n)2

<

4nII FII2n(Sj +Sj_1)2 < 72n8j2P(F2)

J (fj - .fj-1)2dP2n

i=1

(by (6.3.8) and the few lines after it) on Bn := {IFII2n < 2P(F2)}. Then the last sum in (6.3.13) is less than

Eexp(2Hj)2exp(-17j/(1445 P(F2))) j>r

<2

exp (- 17 /(2885 P(F2))) j>r

214

Limit Theorems for Vapnik-Cervonenkis and Related Classes

by (6.3.11) and is < e for r large enough by (6.3.12). If r is also large enough so that >j>r rji < n we obtain almost surely on Bn that, by (6.3.13), Pr {supfE.- I vO(f - fr) I > r1 I A2n } < S.

(6.3.14)

Next, if 11f - gII2n <8P(F2) and IIF112n < 2P(F2) then by choice of Jr (after 6.3.8), 11fr-gr112n

<-

IIfr-f112n+11f-9II2n+11g-gr112n

<

SrP(F2)112 + 28r1IF112n

<

48r(P(F2))1/'2.

Then by the same Hoeffding inequality (1.3.5) and with measurability as in (6.3.9), if S2 < S, P(F2), (6.3.15)

Pr {SUP {Ivn'(fr-gr)(: 11f-g112n <S, f,gE.F}>71I A2n) < (card F(r, x))22 exp (- 1)2[8 sup 11 fr - gr II2n])

< 2 exp (2Hr - 712/(1285, P(F2))) < 2 exp (- 712 (2565, P(F2)))

if 112 > 512Hr6r P(F2)

< E

for r large enough, on Bn, noting that Hr82r > 0 as r

just before it). Clearly Pr(Bn) -+ 1 as n over F,

oo (see (6.3.10) and

oo. Now for f and g ranging

sup{IvO,(f-g)I: 11f -gII2n < S}

<2sup

IIvo(f- fr) I

+IVo(fr-gr)I: 11f-gII2n <8).

Thus from (6.3.14) and (6.3.15) we get (6.3.7) for S = 6rP(F2)1/2 and r large enough. The event in (6.3.7) is measurable as follows: as Y x Y is Suslin, and (x, y, z) - f (T (y) -T (z))2dP2n is jointly measurable by admissibility, so the supremum over a measurable subset is universally measurable as in Corollary 5.3.5.

6.3.16 Corollary (Pollard) Let (X, A, P) be a probability space and let .F be an image admissible Suslin Vapnik-Cervonenkis subgraph class of functions with envelope F E £2(X, A, P). Then .F is a Donsker class for P.

6.4 Necessary conditions for limit theorems

215

Proof This follows from Theorem 6.3.1 and Theorem 4.8.1 for p = 2.

6.3.17 Corollary Let (X, A, P) be a probability space, F E G2(X, A, P), and .F = (F1c : C E C) where C is an image admissible Suslin Vapnik-Cervonenkis class of sets. Then F is a Donsker class for P.

Proof Since F is measurable, the image admissible Suslin property of F follows from that of C. By Theorem 6.1.2 for p = 2, (6.3.2) holds and Theorem 6.3.1 applies.

6.4 Necessary conditions for limit theorems

Theorems 6.1.2 and 6.3.1 imply that every class C C A with S(C) < +00, and which is image admissible Suslin, is a Donsker class, for an arbitrary law P on A. In this section it will be shown that to obtain, for all P, such a central limit theorem (or even the pregaussian property), for a class C of sets, the condition S(C) < +oo is necessary. Then it will be noted that some measurability, beyond that of II P,, - P II c, is needed to obtain even a law of large numbers for S(C) < +oo (6.1.9). Lastly, it will be shown that S(C) < 00 oo, uniformly 0 in outer probability as n is necessary so that II Pn - P Il c

in P.

6.4.1 Theorem Let (X, A) be a measurable space and C C A. Suppose that for all laws P on A, {1A : A E C} is a pregaussian class (as defined in Section 3.1). Then S(C) < 00.

Proof Suppose S(C) = +00. Then for each n > 1, C shatters some set F with card(Fn) = 4'. Let Gn := Fn \ U3 2n, and C shatters G. Take En C Gn with card(En) = 2n. Then the En are disjoint and shattered by C. Some countable subset D C C shatters every E. Recall that L.n>1 n-2 = 7r2/6; see problem 19 of Chapter 2. Let P be the law on Un° 1 En with P({x}) = 6/(7r2n22n) for each x c En. Given n, for the isonormal process Wp on L2 (p), for each C E D, Wp (C) Wp (C fl En) + Wp (C \ En). For 0 < K < 00, define the events £1

{IWP(B)I<2KforallBCEn},

£2

{IWp(B)I > 2K for some B C E, and for all such B and all C E D with C fl En = B, I Wp (C \ E,) J > K).

_

216

Limit Theorems for Vapnik-Cervonenkis and Related Classes

Then ( IWP(C)I < K for all C ED) C £1 U.62. Let Sn := >2XEE4 I WP({x})I.

Then sup{IWp(B)I: B C En} > Sn/2, so Si C {Sn <4K}. Foreachx E E, WP({x}) has a Gaussian law with mean 0 and variance 6/(ir2n22n) =:

so

EIWp({x})I = (2/n)1/2Qn and var(lWP({x})I) = a, (1 - 2). Thus ESn = 2n (2/7r)1/20rn 67r-2n-2(1_

Since Wp has independent values on disjoint sets, var (Sn) _ ). For n large, ESn > 4K, and then by Chebyshev's inequality,

n Pr{Sn <4K) < Pr{JSn - ESSI > ESn -4K} <

< 1/((2n n2(ESn - 4K)2 1

12/7r3)1/2

-

4Kn)2

=: f (n, K) -a 0 as n -* oo. Turning to £2, let t(n) := 224. Let the subsets of En be B1,

, Bt(n). Let

MO := 0 and recursively for j > 1, Mj := {IWP(Bj)I > 2K) \ Uo
Dj := Mj n {for all C E D such that c n En = Bj, IWP(C \ En) I > K).

For any set A, Wp(A) has a Gaussian distribution with mean 0 and vari-

ance P(A). Let '(x) := (2n)-1/2 f xO, exp(-t2/2) dt. Then since Wp has independent values on disjoint sets,

Pr(D) = Pr(Mj) Pr{for all C E D with c n En = Bj, IWP(C \ En)I > K}

< Pr(Mj) 24)(-K). Now £2 C UI<j
Pr(£2) < E Pr(Mj) 2(D(-K) < 2(D(-K). 1<j
Hence

Pr (IWP(C)I < K for all C E D) < f(n, K) + 2(D(-K).

If we let n -+ +oo, then K

+oo, we see that Wp is a.s. unbounded on D C C. Note that Gp () can be written as P()Wp(1), so Gp is a.s. unbounded on C.

6.4 Necessary conditions for limit theorems

217

6.4.2 Remark. Now let us recall the example at the beginning of Chapter 5, where C is the collection of all countable initial segments of an uncountable well-

ordered set (X, <) and P is a continuous law on some a-algebra A containing

all countable subsets of X. Then S(C) = 1 but SUpAEC I(P, - P)(A)I 1 for all n. Thus the latter random variable is measurable. For this class the weak law of large numbers, and hence the strong law and central limit theorem, all fail as badly as possible. This shows that in Theorem 6.1.7 and Corollary 6.1.9, the "image admissible Suslin" condition cannot simply be removed, nor can it be replaced by simple measurability of random variables appearing in the statements of the results. Further, for all A E C, IA = 0 a.s. P. So vanishing a.s. (P), even with S(C) = 1, does not imply a law of large numbers.

6.4.3 Remark. If X is a countably infinite set and A= 2X, then for an arbitrary

law P on A, limn',' supAEA I (Pn - P)(A)I = 0 a.s., but S(A) = +oo, so the hypothesis of 6.4.1 cannot be weakened to a law of large numbers for all P. Next it will be seen that the Vapnik-Cervonenkis property is also necessary for a law of large numbers to hold uniformly in P or that there exist an estimator of P based on X1, , Xn (which might or might not equal Pn) converging to P uniformly over C and uniformly in P. Here are some definitions.

Let (X,13) be a measurable space. Let its n-fold Cartesian product be (X", B"). Let P be the class of all probability measures on (X,13). Let C C B be any collection of measurable sets. A real-valued function T" on X" x C will be called a C-estimator if it is a stochastic process indexed by C; in other words, for each A E C, X H T" (x, A) is measurable on X". A C-estimator T" will be called an estimator if for each x, there is a probability

measure t on A which equals T" (x, - ) on C. For any probability measure P on (X,13) and product law P" on (X", B") for x = (X1, , X") so that Xi are i.i.d. (P), we would like T" to be a good

approximation to P with probability -+ 1 as n -* oo. The goodness of the approximation will be measured by the loss function L (Tn, P, C) := 11 Tn - P 11 c

From it we get the risk r (T, P, C) := EpL(T", P, C) * where Ep denotes expectation with respect to P". For any class Q of laws, let r(T", Q, C) := sup{r(T", P, C) : P E Q) and let rn (Q, C) be the minimax risk, that is, the infimum of r(Tn, Q, C) over all C-estimators Tn. In finding minimax risks for C-estimators, we can assume Tn takes values in [0, 1] since max(0, min(T,, 1)) will clearly have risks no larger than those of Tn.

218

Limit Theorems for Vapnik-Cervonenkis and Related Classes

For any C and Q, clearly 0 < rn (Q, C) < 1 and rn is nonincreasing in n. The following theorem holds for C-estimators and so a fortiori for estimators, whose values are probability measures. The following fact is mainly due to P. Assouad (see the Notes to this section).

6.4.4 Theorem Let P be the class of all probability measures on a sample space (X,13) and C C B. If the minimax risk rn (P, C) < 1 /2 for some n, then C is a Vapnik-Cervonenkis class.

Proof Suppose S(C) = +oo. Given n, for any m > n let F be a set with 2m elements, shattered by C. Let D C C be a collection of 22m sets which shatters F. Then II Ilc > II Ilc > II 11-D, and since D is finite, II Tn - P11 E) will be measurable for any C-estimator Tn and any P. Let Tn be a C-estimator defined on Xn. Let r be a ("prior") probability distribution on the set (finitedimensional simplex) of all probability laws on F, with mass 1/( 2m) at each law uniformly distributed over an m-element subset of F. Then the maximum risk is bounded below by the risk for r and D,

suppEpllTn - Pllc > E,tEpIITn - PIID For each n-tuple x = (xl, , xn) E Fn, let ranx denote the range of x, ran x = [XI, , xn }, and let k = k(x) be the number of distinct xi, which is the cardinality of ran x.

Let µ be the distribution of x in Fn averaged with respect to 7r, s f Pndjr(P). Thenforeachx E Fn, letlr(x) := 7rx be the posterior distribution defined by n given x, namely the law 7rx giving mass 1/(2m k) to each law uniform on a set of m elements of F including ran x. Then for any C-estima-

tor T, PIID = EN,E,,(x)Il Tn - PIID

Let A:= 2F. For each A E A, there is a unique DA E D with DA fl F = A. For any P in the support of ir, P(DA) = P(A). Let V, (x, A) := Tn(x, DA), an A-estimator. Then II Tn - P 11 D > 11 Vn - P II A, and we want to find a lower

bound for E,r(x)11V - P11A for each P and x fixed where V (A) := Vn (x, A).

Let y(k) := 7rx. If PO is any fixed law in the support of y(k), and if r is uniformly distributed over the group G of all (2m - k)! permutations of F which equal the identity on ran x (and thus permute the other 2m - k elements of F), then PO o r-1 will have distribution y(k). Let T[A] :_ {r(y) : y E A} for any set A. Each permutation r defines a 1-1 transformation A i-+ T[A] of

6.4 Necessary conditions for limit theorems

219

A onto itself. Let (V o r)(A) := V (r[A]). Then we have

Ey(k) II V - PII A = (2m -

k)!-1

E ll V -Poor-1 IlA lEG

(2m-k)!-1EIIVo_1-POIIA, TEG

which by the triangle inequality is bounded below by (2m -

k)!-'

liV -

V o r - PO

lEG

PollA.

A

Now let A(m) := (A E A: ran x C A and I AI = m). For any A, B E A(m) there is a r E G with r[A] = B. It follows that V, restricted to A(m), is a constant. The support of Po is in A(m). For another set C E A(m), PO(C) = k/m. Thus IIV-Po1IA(-)

max(ii - VI,

k

-V

m

1

k

1

n

2

2m

2

2m

This last bound, uniform in k, holds after integration with respect to it. For a oo, we see that the minimax risk for II Ilc is at least 1/2. given n, letting m

A corollary of Theorem 6.4.4, taking T as the empirical measure P,,, is:

6.4.5 Theorem (Assouad) If (X, B) is a measurable space and C C B is a uniform Glivenko-Cantelli class of sets, that is, sup p Ep II Pn - P II c --+ 0 as n -* oo, where the supremum is over all probability laws on (X, B), then C is a Vapnik-Cervonenkis class.

Now we'll see that the constant 1/2 in Theorem 6.4.4 is sharp:

6.4.6 Proposition For the class P of all probability measures on a sample space (X, B) there is always a B-estimator T, not depending on n or x, with r(T, P, B) < 1/2, so that for any Q C P, C C B, and n we have r (Q, C) < 1/2. Moreover, when (X,13) is the unit interval [0, 1] with the Borel or -algebra, there is a class C which is not a Vapnik-Cervonenkis class and for which T on C is given by a probability measure, so T is an estimator, not only a C-estimator.

220

Limit Theorems for Vapnik-Cervonenkis and Related Classes

Proof A C-estimator T (not depending on n or x) is defined by T (x, A) - 1/2 for all A E B. Then r(T, Q, C) < 1/2 for any Q and C, so rn (Q, C) < 1/2 for all n. For Lebesgue measure a, on [0, 1] let C := {Ck}k>1 be an infinite sequence of sets independent with probability 1/2. Specifically, let Ck be the set where the , k, are called "independent kth binary digit is 1. Recall that sets C1, i = 1,

as sets" if for any set J C (1, , k), the intersection of all the C, for i E J and of the complements of the C; for i J is nonempty. In other words, the (Boolean) algebra generated by the C; has the maximum number, 2k, of possible

,a

atoms. Clearly the C , are independent in [0, 1]. For any m = 1 , 2,

collection of 2' sets independent in a set X shatters some subset with m elements by Theorem 4.6.2. So C is not a Vapnik-Cervonenkis class.

Next, it will be shown that one of the hypotheses of Theorem 6.1.7, namely existence of an integrable envelope function, is essentially necessary for a law of large numbers (Glivenko-Cantelli property). This is related to the fact that f o r i.i.d. real random variables Y1, , Y, , , (Yl + + Yn)/n converges a.s. to a finite limit if and only if EI Y1 I < oo (RAP, Theorem 8.3.5).

6.4.7 Theorem If (S, B, P) is a probability space,

.F c G1(S,13,P), IIPII.7 < 00, and if .F is a strong Glivenko-Cantelli class for P, that is, II Pn - P II a. s., then F has an integrable envelope function: E II P1 11 < 00.

Proof We haveasn a.s., and IIP/nllj,

00,IIPn-i-PII

Oa.s.,I1n1(Pn-1-P)II

-k 0

-0

0, so IlSxn/nll* -+ 0 a.s. Let F(x) := IISxILF sup fEy If (x) I for x c S. Let Sj be copies of S whose Cartesian product S°° is taken in the standard model. For each n write S°O = Sn x TIC#nSj. Then Lemma 3.2.4 implies that IlSxn /n II

where II 11* is with respect to P°O on S

equals F*(Xn)/n. The random variables F*(XX) are i.i.d. Thus by the BorelCantelli lemma, En °_1 P°O(F*(XX) > n) < oo, so Yn°_1 P(F* > n) < 00, which implies EF*(Xi) < 00 (RAP, Lemma 8.3.6).

**6.5 Inequalities for empirical processes This section gives no proofs, but gives references. Recall that for the Brownian

bridge yr, and 0 < M < 00,

P (supo M) = exp (- 2M2)

**6.5 Inequalities for empirical processes (RAP, Proposition 12.3.3), and so P(supo
221

> M) < 2exp(-2M2)

where as M - oo, the probability on the left is asymptotic to the bound on the right (RAP, Proposition 12.3.4). Dvoretzky, Kiefer and Wolfowitz (1956) showed that for any law on ll with distribution function F and empirical dis-

tribution functions F,

Pr{ sup,, n1"21(F - F)(x) I > M} < K exp ( - 2M2) for a constant K not depending on n, F, or M. Massart (1990) showed that the

inequality holds with K = 2, which then is clearly best possible as n -+ 00 since it is best for the Brownian bridge. Let a VCM class C of sets be a Vapnik-Cervonenkis class satisfying the image admissible Suslin measurability condition. Some measurability condition is needed, as shown by the examples at the beginning of Chapter 5. Results are stated here on VCM classes for convenience. The authors quoted used various measurability conditions, some weaker, some stronger. For a class F of functions integrable for a law P, let

Dn(.F) := n11211Pn - PII F := supfEFn1121(Pn - P)(.f)j. Any quantity defined for classes .F of measurable functions is also defined for classes C of sets by letting F := { 1A: A E C}. If (X, A) is a measurable space and .F is a uniformly bounded class of real-valued measurable functions on X, so that F has a finite constant envelope function C > FF, let

7r (n, .F, M) := supP Pr(Dn(F) > M) where the supremum is over all laws P on (X, A). Vapnik and Cervonenkis (1974, Theorem 10.2) showed that for any VCM class C of measurable sets,

n(n,C,M) < 6mc(2n)exp(-M2(1- ')/4)n

Since C is a VC class, mc(2n) is bounded above by a polynomial in n of degree S(C) (Sauer's lemma, Theorem 4.1.2, and Proposition 4.1.5). Devroye (1982) proved that jr(n, C, M) < 4mc(n2) exp(-2M2 +4Mn-1"2+4M2n-1). Alexander (1984) and Massart (1983, 1986) showed that for any VCM class

C and s > 0, 7r(n, C, M) < K exp(-(2 - s)M2) for a large enough constant K = K(s, S(C)). Here, unlike the case of the Dvoretzky-Kiefer-Wolfowitz inequality, 2 - s cannot be replaced by 2. Let

,r(oo, F, M) := supp Pr(IIGPILT > M).

222

Limit Theorems for Vapnik-Cervonenkis and Related Classes

Any upper bound for zr(n, C, M) that holds for all finite n, uniformly inn, will also hold for n (oo, F, M), for any countable class F, by the finite-dimensional central limit theorem, and thus for any class F such that II II.F = II II4 for a

countable 9 c F, as is true for many classes T. Consider inequalities of the form (6.5.1)

r(oo,F,M) < CMYexp(-2M2)

and (6.5.2)

sup,,,ir(n,F,M) < CMYexp(-2M2)

where C is a constant possibly depending on y and F. For the VCM class O(d) of orthants Oy :_ {x : x j < y,, j = 1, , d) in Rd for ally E Rd, the smallest possible value of y in (6.5.1) is 2(d - 1), by results of Goodman (1976) for d = 2, Massart (1983, 1986, Theorem A.1), Cabana (1984, Section

3.2), and Adler and Brown (1986). For d > 2 it follows that y in (6.5.1) and (6.5.2) cannot be taken as 0. Adler and Samorodnitsky (1987, Example 3.2 p. 1347) found that for the VCM class B(d) of rectangular blocks parallel

to the axes in Rd, one has y = 2(2d - 1) as the precise value in (6.5.1). Here S(6 (d)) = d and S(B(d)) = 2d (Corollary 4.5.11). For these classes of sets, y = 2(S(C) - 1) is optimal in (6.5.1). For the VCM class H(2) of all open half-planes in R2, where S(H(2)) = 3 by Theorem 4.2.1, and P is uniform on the unit square, Adler and Samorodnitsky (1987, Example

3.3 p. 1349) showed that y < 2, so y is smaller than the other examples suggested.

On the other hand, for S(C) = 1, y can be 1 > 2(S(C) - 1), as follows. In the unit circle Sl :_ {(cos0, sine) : 0 < 6 < 2ir} C R2 let HC be the class of half-open half-circles Ht :_ {(cos0, sine) : t < e < t + r} for 0 < t < it. Then it is easy to check that S(HC) = 1. Let P be the uniform

law on St, dP(6) = de/(27r) for 0 < 0 < 2n. Let Xt := Gp(Ht). Then Xt, -oo < t < oo, is a stationary Gaussian process, periodic of period 2n. Here y = 1 is the precise exponent in (6.5.1) by a result of Pickands (1969, Lemma 2.5). See also Leadbetter, Lindgren and Rootzen (1983, Theorem 12.2.9) and Adler (1990, pp. 117-118). The class HC does not contain the empty set, and HC U {0} shatters some 2-element sets. Smith and Dudley (1992) showed that there exists a VCM class C with S(C) = 1 and 0 E C such that y = 1 is optimal in (6.5.1). Here C has a treelike partial ordering by inclusion as in Section 4.4. Returning to general cases, Samorodnitsky (1991) shows that for any VCM class C, (6.5.1) holds for any y > 2S(C) - 1. By way of a theorem of Haussler stated in Section 4.9.2, (6.5.1) also holds for y = 2S(C) - 1.

**6.6 Glivenko-Cantelli properties and random entropy

223

For empirical processes, upper bounds for exponents y have been harder to obtain than for Gaussian processes, while lower bounds in (6.5.1) also provide

them for (6.5.2). Massart (1986, Theorem 3.3.1°(a)) proved (6.5.2) for any VCM class of sets and also for VC subgraph classes .F of functions with values in [0, 1], for any y > 6S(C), where C is the class of subgraphs of functions in .F and is VCM. Talagrand (1994) showed that for any VCM class C one can take y = 2S(C) - 1 in (6.5.2). Talagrand also proves similar bounds for other,

not necessarily VC classes. The half-circle and tree examples show that for S(C) = 1, Talagrand's bound is precise, so the best value of y in (6.5.1) and in (6.5.2) for VCM classes C with S(C) = 1 is y = 1. At this writing, the problem of finding optimal or useful constants C in (6.5.1) and (6.5.2) for general VCM classes with given S(C) seems to be open. Another direction for inequalities is what is called the concentration of measure phenomenon. Ledoux (1996) gives an exposition. Here is one of the many

results. Let P be the standard normal measure N(0, I) on Rd. Let f be a real-valued Lipschitz function on Rd, so that

II.fIIL := supxOy If(x) - f(Y)I/Ix - yI < 00. Then for any r > 0,

f -JfdP

>

r)

< 2exp(-r2/(211.f112))

(Ledoux, 1996, (2.9) p. 181). Notably, the dimension d doesn't appear in the inequality. Concentration inequalities of similar form have been proved for spheres and some other Riemannian manifolds and for product spaces. Two major works for product spaces are Talagrand (1995, 1996a).

**6.6 Glivenko-Cantelli properties and random entropy This section is also a survey giving references rather than proofs. Let (X, A, P) be a probability space and F a set of measurable real-valued functions on X.

Recall that F is called a weak (resp. strong) Glivenko-Cantelli class for P if both F C Gl (X, A, P) and II P, - P II.T - 0 in outer probability (resp. almost

uniformly) as n - oo. Also, F is called order bounded for P if it has an integrable envelope function, EFF < oo. Let .F0,p := { f - Pf : f E .F}. Talagrand (1987a, Theorem 22) proved the following:

Theorem A (Talagrand) For a probability space (X, A, P) and a class F C G1 (X, A, P), the following are equivalent as n -k oo: (a) F is a strong Glivenko-Cantelli class;

224

Limit Theorems for Vapnik-Cervonenkis and Related Classes

(b) the possibly nonmeasurable functions II Pn - P IIjr converge to 0 a.s.; (c) F is a weak Glivenko-Cantelli class and .F0, p is order bounded.

Proof Each of (a), (b), or (c) for F is equivalent to the same statement for .F0, p. Clearly (a) implies (b), which for .Fo, p is equivalent to (I) in Talagrand's

theorem (1987a, Theorem 22), while (c) is intermediate between Talagrand's (VI) and (VII). Talagrand's (V) implies (a). El

A class F satisfying any of the three equivalent conditions in Theorem A will be called a Glivenko-Cantelli class. There exist weak Glivenko-Cantelli classes which are not Glivenko-Cantelli classes (problem 12). Talagrand (1987a, 1996b) gave a kind of Vapnik-Cervonenkis criterion for the Glivenko-Cantelli property, as follows. For a finite set F C X and -oo <

a < ,B < oo the class F is said to shatter F at the levels a, fi if for every G C F there is an f E F such that f (x) < a for all x E G and f (y) > f for

allyEF\G. Thus, if F is the class of indicators of sets in a class C, then F shatters F at levels a, fi where 0 < a < f < 1 if and only if C shatters F as defined in Section 4.1.

Suppose given a probability space (X, A, P). Let.F C G1 (X, A, P), -oo < , Xn be coordinates on a product of copies of (X, A, P). Let W (T, A, a, fi, n) be the set of all X:= {Xjj =1 E A" , Xn } at levels a, ,B. such that the Xj are all distinct and.F shatters {X1, X2, The triple (A, a, ,B) is called a witness of irregularity for.F if the following

a < fi < oo, and A E A. Let X1, X2,

all hold: P(A) > 0; the restriction of P to A has no atoms; and for all n =

1,2,..., (Pn)*(W (F, A, a, f, n)) = P(A)'. The terminology is as in Talagrand (1996b), except that in the 1996 paper, measurability assumptions make the star unnecessary. Talagrand (1987a, Theorem 2) proved:

Theorem B (Talagrand) Given (X, A, P), an order bounded class F C G1 (X, A, P) fails to be a Glivenko-Cantelli class if and only if there exists a witness of irregularityfor it. Thus, a class F is a Glivenko-Cantelli class for P if and only if Fo, p is order bounded and has no witness of irregularity. Now, given a measurable space (X, A), a class.F of real-valued measurable functions on X is called a universal Glivenko-Cantelli class if it is a GlivenkoCantelli class for every law (probability measure) on (X, A).

**6.6 Glivenko-Cantelli properties and random entropy

225

While a universal Donsker class C of sets must be a VC class (Theorem 6.4.1), a universal Glivenko-Cantelli class need not be, even if X and C are both

countable (problems 14, 15). In any uncountable complete separable metric space X, there exists an uncountable "universally null" set Z, which has outer measure 0 for every nonatomic law P on the Borel sets of X (Sierpinski and Szpilrajn, 1936). [As Shortt (1984) notes, Szpilrajn later changed his name to Marczewski.] The collection of all subsets of a universally null set is clearly a universal Glivenko-Cantelli class. Such examples and extensions of them show that the notion "universal Glivenko-Cantelli class" is surprisingly wide (Dudley, Gine and Zinn, 1991, Section 3). Some of the width is in seemingly pathological directions. More useful are classes.F such that II Pn - P11* -+ 0 uniformly in P, as follows. Given a measurable space (X, A) and a class.F of real-valued measurable functions on X,.F is a uniform Glivenko-Cantelli class as defined in Theorem 6.4.5 if and only if for every E > 0 there is an no such that for every law P on A and n > no, Pr*{IIPn - PIIF > E} < E. Each function f in a universal (or uniform) Glivenko-Cantelli class must be bounded, and Fo := { f - inf f : f E .F} must be uniformly bounded (problem 13)..Fo is a universal or uniform Glivenko-Cantelli class if and only if F has the same property.

For x :_ (xl,

, xn) E X, n = 1, 2,

and 0 < p < oo, define on To

the pseudometrics ex,p(f, g) :_

n [n_1If(xi)_g(xi)IP]

min(1,1/p)

i=1

ex,oo(f g) := maxi
Let Hn, p(e, .F) := sup{log D(e,.Fo, ex, p) : x E X'). Then we have Theorem C For any measurable space (X, A) and image admissible Suslin class F of bounded measurable functions on X, the following are equivalent: (a) F is a uniform Glivenko-Cantelli class; (b) for some or all p with 0 0. (cp) limn,oo Hn, p(E, .Fo)/n = 0. Proof This is Theorem 6 of Dudley, Gine and Zinn (1991). Thus, every image admissible Suslin VC class of sets is a uniform GlivenkoCantelli class. Recall that conversely, every uniform Glivenko-Cantelli class of sets is a Vapnik-Cervonenkis class by Assouad's theorem (6.4.5). Theorem C

226

Limit Theorems for Vapnik-Cervonenkis and Related Classes

fails if no measurability condition is assumed (by the ordinal triangle example near the beginning of Chapter 5). Alon, Ben-David, Cesa-Bianchi, and Haussler (1997) give an alternate characterization of uniform Glivenko-Cantelli classes. Also recall that under a measurability condition, Vapnik and Cervonenkis (1971) and Steele (1978) gave a criterion (necessary and sufficient condition)

for a class C of sets to be a Glivenko-Cantelli class for a given P, in terms of "random entropy" log(Oc({X1, , X,})) (Theorem 6.2.1). Vapnik and Cervonenkis (1981) gave an analogous criterion for uniformly bounded classes

of functions. Talagrand (1987a) proved a kind of random entropy criterion (Theorem B in this section) for the Glivenko-Cantelli property without any measurability restrictions, with order boundedness in £1(P). There are also random entropy criteria for the Donsker property, which will not be stated here: Gine and Zinn (1986, Chapter 2, Theorem 3.4; Chapter 3, Theorem 1.3). Gind and Zinn refer in part to work that had been done, but was published later, by Talagrand (1987b, 1988).

**6.7 Classification problems and learning theory Let P and Q be laws on Rd, which are unknown except that X1, , X, i.i.d. (P) and Y1, , Y,, i.i.d. (Q) have been observed, giving empirical measures Pm and Q. There is a new observation Z = Xm+1 or but we do not know which and want to decide as well as possible. For example, for large d, each Xi and Yj might be a vector of regional weather data for one day, where Xi are for days when it rained the following day at a given station, and Yj for other days. Then Z is today's data vector and we want to predict whether it will rain tomorrow at the station. (A) Let C be a VCM class of Borel sets in Rd. On some set A E C, (Pm is maximized for A E C. A decision rule is then to decide Z = Xm+l (it will rain tomorrow) if Z E A, otherwise Z = Since Pm -* P and Q -+ Q

uniformly on C, also uniformly in P and Q, at rates which for some C are given by inequalities in Section 6.5, for large m and n, knowing Pm and Qn is nearly as good as knowing P and Q. Only VC classes can have such uniformity (Assouad's theorem (6.4.5)). Vapnik and Cervonenkis (1968, 1971) invented their classes of sets with applications to pattern recognition in view (Vapnik and Cervonenkis,1974). There have been extensive other and further developments, sometimes called "learning theory." There is at least one, unknown, measurable set B on which (P - Q) (B) is maximized. One wants to choose A, based on X1, , Xm; Y1, , Y,,, as a kind of estimator of B, so that there is a small error probability for large m, n

Problems

227

in deciding that Z E B just when one observes Z E A. If so, one has what is called probably approximately correct (PAC) learning. In most problems, it seems that a successful application depends on a good choice of the data vectors to be measured ("feature sets") and on knowledge of the particular field of application. A few further references on learning theory are: Vapnik (1982), Valiant (1984), Blumer, Ehrenfeucht, Haussler and Warmuth (1989), Haussler (1992), Dudley, Kulkarni, Richardson and Zeitouni (1994), Vapnik (1995), Devroye, Gyorfi and Lugosi (1996), and Alon et al. (1997).

Problems 1. On [0, 1] with Lebesgue (uniform) law P = U[0, 1], let ii00 F(a) := {k"l[0,1/k]Ik=1

for any a > 0. (a) Show that for a < 1, II P - P 11 y(,,) is measurable and -a 0 a.s. as n - * oo. Hint: This follows from the Chebyshev inequality only for some values of a. Instead, use some fact from Section 6.1. (b) Show that for a > 1, II Pn - PIIF(a) doesn't approach 0 in probability. (Prove this directly, without using facts from Chapter 6.) 2. If C is the set of tetrahedra in R3 (bounded intersections of four half-spaces), and P is any law on the Borel sets of 1[83, show that II P, - P II c is measurable

and goes to 0 a.s. as n - oo. Hint: You can assume the result of Chapter 5, problem 8. (Actually, any bounded polyhedron with four faces in 1R3 is convex and is an intersection of four half-spaces.) Let Co(Rk) be the class of all convex sets in Rk. 3. Let C := Co(1182) and F := {(0, 0), (1, 1), (2, 4), (3, 1), (4, 0)}. Evaluate

AC(F) and kc(F). 4. For C = Co(lRk) show that for any law P on Rk (on the Bore] a-algebra, as usual) supAEc(Pf - P)(A) is measurable. 5. Let C be the collection of all unions of three intervals in R. Show that Lc(F) and kc(F) depend only on the cardinality m = IFI and evaluate each of them as a function of m = 1, 2, .. 6. Let .F :_ {x-1/31[o,x] : 0 < x < 1}. Show that F is a Donsker class for

P = U[0, 1]. 7. For a law P on 1l8 suppose F E G2(R, P). Let 9 be the class of functions g on 1l8 with total variation < 1 and I I g l l sup < 1. Show that { Fg : g E 9)

is a Donsker class for P.

228

Limit Theorems for Vapnik-Cervonenkis and Related Classes

8. Prove Remark 6.4.3. 9. Let P({k}) := pk := ck-514 fork = 1, 2, , with c such that 1. Let N+ be the set of all positive integers and C the class of all subsets of N+. Show that for some a > 0, II Pn - P11 c > an-1/4 for all possible values of P,, and so, C is not Donsker for P. 10. In problem 9, if each pk is replaced by c/k3 for the appropriate constant c, show that C is a Donsker class for P. 11. If P is the uniform distribution on the unit square in R2, show that Co(R2) is a Donsker class for P. Note: This will be proved later in the book by different methods. It may be quite challenging to prove it with the methods available so far. 12. Let X = H be a separable Hilbert space with an orthonormal basis {ej}O01.

Let kj := 1/(1 + log j) and P(kjej) := P(-kjej) := 3/(n2 j2) for j = . Then Pisa law on H (Chapter 2, problem 19). Let.F : _ l y E H : Ilyll < 11 with y(x) := (x, y). Show that .F is a weak Glivenko-Cantelli class, but not a Glivenko-Cantelli class. Hint:F is not order bounded. 13. Let .F be a universal Glivenko-Cantelli class. 1, 2,

(a) If f E F, show that f is bounded. Hint: Otherwise, it is not in ,C1 (P) for some law P.

(b) Show that To :_ (f - inf f : f E FJ is uniformly bounded. 14. Show that if X is countable and A = 2X then A is a universal GlivenkoCantelli class. 15. Let (X, A) be a measurable space and X = U Aj for some disjoint sets Aj E A. Let C C A be image admissible Suslin and such that for each j, S(C n Aj) < oo (but may grow arbitrarily fast with j). Show that C is a universal Glivenko-Cantelli class. 1

Notes

Notes to Section 6.1. Koltchinskii (1981) defined DF)(8, F, y) when F = 1 and y = Pn. Pollard (1982) independently defined it for general F, for p = 2, and for distinct x(j). Both, in fact, used the minimal cardinality of an e-net (see Section 1.2). Theorem 6.1.2 is due to Pollard (1982, proof of Theorem 9). I do not know references for Proposition 6.1.3 or 6.1.4. The symmetrization method is also due independently to Koltchinskii (1981) and Pollard (1982). Lemma 6.1.5 is adapted from Pollard (1982, Lemma 11), who proves it for a countable class.F, thus avoiding measurability difficulties. From the countable

case, one can infer the result if there is a countable subset 71 C F such that 11 vn Ilrt = 11 vn IF almost surely. Thus it suffices for vn on F to be separable

Notes

229

in the sense of Doob, meaning that for some countable Q C F and set A with

P(A) = 0, for any w

A, f E F, e > 0, and closed interval J C Il8,

v, (g) (w) V J for all g E G with pp (f, g) < e implies the same for g E F. For most classes F arising in applications the empirical process is naturally separable. But the collection C of all singletons in [0, 1], for P = Lebesgue measure, appears as a rather regular collection for which the empirical process is not separable. In such a case it may appear unnatural to use a separable modification of the process, so that, for example, PI ({x)) = 0 for all x, even x = xl ! Strobl (1995) proved Theorem 6.1.6(b) without supplementary measurability assumptions. Theorem 6.1.7 extends Pollard (1982, Theorem 12) and Wolfowitz (1954) for half-spaces in W'.

Notes to Section 6.2. In Theorem 6.2.1, Vapnik and Cervonenkis (1971, Theorem 4) proved equivalence of (b) and (c). The proof here was first given in Dudley (1984). The Koltchinskii-Pollard techniques allowed some simplification of the Vapnik-Cervonenkis proof. Steele (1978) proved equivalence of (a) and (b) using Kingman's theorem and proved Theorem 6.2.8 and (6.2.9). However, Steele's assumption that II P - P II C is measurable has been strengthened to "image admissible Suslin." Some such strengthening seems needed in the proof. Notes to Section 6.3. Pollard (1982) proved Theorem 6.3.1 when the empirical

processes v are stochastically separable in the sense of Doob, as is usually true ab initio in cases of interest and can always be obtained by modifications of the process which may, occasionally, appear unnatural (see the Notes to Section 6.1). The "image admissible Suslin" formulation was given in Dudley (1984). The proof, based on Pollard's, is somewhat different, not only as regards measurability, but notably in that Proposition 6.3.6 extends a more specific result of Pollard.

The implication from Pollard's theorem (6.3.1) to Theorem 6.3.3, of Jain and Marcus (1975a), was first done in Dudley (1984). Koltchinskii (1981) proved Theorem 6.3.1 for .F uniformly bounded and satisfying a further entropy condition.

Notes to Section 6.4. Theorem 6.4.1 and Example 6.4.2 are from Durst and Dudley (1981). Assouad (1982, Proposition C) proved a form of Theorem 6.4.4 with 1/(8e2) in place of 1/2 and so proved Theorem 6.4.5. I thank David Pollard for a remark leading to Proposition 6.4.6. Theorem 6.4.4 and Proposition 6.4.6

230

Limit Theorems for Vapnik-Cervonenkis and Related Classes

were given in Assouad and Dudley (1990), which, like Assouad (1982), was not published. Assouad (1985) is a related work.

References Adler, R. J. (1990). An Introduction to Continuity, Extrema and Related Topicsfor General Gaussian Processes. IMS Lecture Notes-Monograph Series 12. Adler, R. J., and Brown, L. D. (1986). Tail behaviour for suprema of empirical processes. Ann. Probab. 14, 1-30. Adler, R. J., and Samorodnitsky, G. (1987). Tail behaviour for the suprema of Gaussian processes with applications to empirical processes. Ann. Probab. 15, 1339-1351. Alexander, K. S. (1984). Probability inequalities for empirical processes and a law of the iterated logarithm. Ann. Probab. 12, 1041-1067; Correction, ibid. 15 (1987), 428-430. Alon, N., Ben-David, S., Cesa-Bianchi, N., and Haussler, D. (1997). Scalesensitive dimensions, uniform convergence, and learnability. J. ACM 44,

615-631. Assouad, P. (1982). Classes de Vapnik-Cervonenkis et vitesse d'estimation (unpublished manuscript). Assouad, P. (1985). Observations sur les classes de Vapnik-Cervonenkis et la dimension combinatoire de Blei. In Seminaire d'analyse harmonique, 1983-84, Publ. Math. Orsay, Univ. Paris XI, Orsay, pp. 92-112. Assouad, P., and Dudley, R. M. (1990). Minimax nonparametric estimation over classes of sets (unpublished manuscript). Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. (1989). Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36, 929-965. Cabana, E. M. (1984). On the transition density of a multidimensional parameter Wiener process with one barrier. J. Applied Prob. 21, 197-200. Devroye, L. (1982). Bounds for the uniform deviation of empirical measures. J. Multivar. Analysis 12, 72-79. Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York. Dudley, R. M. (1984). A course on empirical processes. In Ecole d'ete de probabilites de St. -Flour, 1982. Lecture Notes in Math. (Springer) 1097, 1-142. Dudley, R. M., Gine, E., and Zinn, J. (1991). Uniform and universal GlivenkoCantelli classes. J. Theoret. Probab. 4, 485-510.

References

231

Dudley, R. M., Kulkarni, S. R., Richardson, T., and Zeitouni, O. (1994). A metric entropy bound is not sufficient for learnability. IEEE Trans. Inform. Theory 40, 883-885. Durst, Mark, and Dudley, R. M. (1981). Empirical processes, Vapnik-Chervo-

nenkis classes and Poisson processes. Probab. Math. Statist. (Wroclaw) 1, no. 2, 109-115. Dvoretzky, A., Kiefer, J., and Wolfowitz, J. (1956). Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Statist. 27, 642-669. Gine, E., and Zinn, J. (1986). Lectures on the central limit theorem for empirical processes. In Probability and Banach Spaces (Zaragoza, 1985). Lecture Notes in Math. (Springer) 1221, 50-113. Goodman, V. (1976). Distribution estimates for functionals of the two-parameter Wiener process. Ann. Probab. 4, 977-982. Haussler, David (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications Inform. and Comput. 100, 78-150. Jain, Naresh, and Marcus, Michael B. (1975). Central limit theorems for C(S)valued random variables. J. Funct. Analysis 19, 216-231. Koltchinskii, V. I. (1981). On the central limit theorem for empirical measures. Theor. Probab. Math. Statist. 24, 71-82. Transl. from Teor. Verojatnost. i Mat. Statist. (1981) 24, 63-75. Leadbetter, M. R., Lindgren, G., and Rootzen, H. (1983). Extremes and Related Properties of Random Sequences and Processes. Springer, New York. Ledoux, M. (1996). Isoperimetry and Gaussian analysis. In Ecole d'ete de probabilites de St. -Flour, 1994. Lecture Notes in Math. (Springer) 1648, 165-294. Massart, P. (1983). Vitesses de convergence dans le theoreme central limite pour des processus empiriques. C. R. Acad. Sci. Paris Ser. 1296, 937940. Massart, P. (1986). Rates of convergence in the central limit theorem for empirical processes. Ann. Inst. H. Poincare Probab. -Statist. 22, 381-423. See also Lecture Notes in Math. (Springer) 1193, 73-109. Massart, P. (1990). The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Ann. Probab. 18, 1269-1283. Pickands, James III (1969). Upcrossing probabilities for stationary Gaussian processes. Trans. Amer. Math. Soc. 145, 51-73. Pollard, David B. (1982). A central limit theorem for empirical processes. J. Austral. Math. Soc. Ser. A 33, 235-248.

232

Limit Theorems for Vapnik-Cervonenkis and Related Classes

Samorodnitsky, G. (1991). Probability tails of Gaussian extrema. Stochastic Process. Appl. 38, 55-84. Shorts, Rae M. (1984). Universally measurable spaces: an invariance theorem and diverse characterizations. Fund. Math. 121, 169-176. Sierpinski, W., and Szpilrajn, E. (1936). Remarque sur le probleme de la mesure. Fund. Math. 26, 56-61. Smith, D. L., and Dudley, R. M. (1992). Exponential bounds in Vapnik-Cervonenkis classes of index 1. In Probability in Banach Spaces VIII, ed. R. M. Dudley, M. G. Hahn, J. Kuelbs. Birkhauser, Boston, pp. 451-465. Steele, J. Michael (1978). Empirical discrepancies and subadditive processes. Ann. Probab. 6, 118-127. Strobl, F. (1995). On the reversed sub-martingale property of empirical discrepancies in arbitrary sample spaces. J. Theoret. Probab. 8, 825-831. Talagrand, M. (1987a). The Glivenko-Cantelli problem. Ann. Probab. 15, 837-870. Talagrand, M. (1987b). Donsker classes and random geometry. Ann. Probab. 15, 1327-1338. Talagrand, M. (1988). Donsker classes of sets. Probab. Theory Related Fields 78, 169-191. Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Ann. Probab. 22, 28-76. Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Pubis. J.H.E.S. 81, 73-205. Talagrand, M. (1996a). New concentration inequalities in product spaces. Invent. Math. 126, 505-563. Talagrand, M. (1996b). The Glivenko-Cantelli problem, ten years later. J. Theoret. Probab. 9, 371-384. Valiant, L. G. (1984). A theory of the learnable. C. ACM 27, 1134-1142. Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer, New York.

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York.

Vapnik, V. N., and Cervonenkis, A. Ya. (1968). Uniform convergence of frequencies of occurrence of events to their probabilities. Dokl. Akad. Nauk SSSR 181, 781-783. Engl. transl. Soy. Math. Doklady 9, 915-918. Vapnik, V. N., and Cervonenkis, A. Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264-280. Vapnik, V. N., and Cervonenkis, A. Ya. (1974). Teoriya Raspoznavaniya Obrazov (Theory of Pattern Recognition) (in Russian). Nauka, Moscow.

References

233

German ed.: Theorie der Zeichenerkennung, by W. N. Wapnik and A. Ja. Tscherwonenkis, transl. by K. G. StSckel and B. Schneider, eds. S. Unger and K. Fritzsch (Elektronisches Rechnen and Regeln, Sonderband). Akademie-Verlag, Berlin, 1979. Vapnik, V. N., and Cervonenkis, A. Ya. (1981). Necessary and sufficient conditions for the uniform convergence means to their expectations. Theory Probab. Appl. 26, 532-553. Wolfowitz, J. (1954). Generalization of the theorem of Glivenko-Cantelli. Ann. Math. Statist. 25, 131-138.

Metric Entropy, with Inclusion and Bracketing

7.1 Definitions and the Blum-DeHardt law of large numbers Definitions. Given a measurable space (A, A), let G° (A, A) denote the set of all real-valued A-measurable functions on A. Given f, g E G° (A, A), let

[f g] :_ (h E G0 (A, A) : f < h < g} (empty unless f < g). A set f f, g] will be called a bracket. Given a probability space (A, A, P), 1 < q < oo, F C £ (A, A, P) with usual seminorm II IIq, and e > 0, let NN (s, F, P) denote the smallest m such that for some fi , , , gin Gq (A, A, P), and g1,

with Ilgi -fllq <<sfori = M

C U[fi,gi] i=1

Here log N, (s, F, P) will be called a metric entropy with bracketing. Note that the f and gj are not required to be in F. For example, if F is the set of indicators of half-planes in R2, then f < h < g for f, g, h in F would require the boundary lines of all three half-planes to be parallel. If instead we let f be the indicator of an intersection of two half-planes and g that of a union,

then there can be a nondegenerate set of h E F with f < h < g. Also note that an individual bracket [f, g] has the envelope function

max(-f, g) = max(I f I, IgD,

and so if (7.1.1) holds, for some 8, then F has an envelope function max1< j<m max(- f j, gj). The set of all differences h - H for h and H in [f, g] has an envelope function g - f. So, in this chapter, unlike the last, envelope functions will not be singled out for special attention.

If F C jr then for q < r < oo, F C Gq and (7.1.2)

NN] t (e, F, P) < NN (e, F, p) for all s > 0. 234

7.1 Definitions and the Blum-DeHardt law of large numbers

235

For r = oc, more is true. Let

g) := sup., 1(f - g) (x) I, 11f IIs,P d,up(f, 0). It is easily seen using brackets [ f - s, f + s] that for any law P, (7.1.3)

N(-) (28, F, P) < D (e, F, dup) .

Thus, for example, a set F of continuous functions, totally bounded in the usual

supremum norm with given bounds D(s, F, dup) will have the same bounds on all NN 1) (2s, F, P), 1 < q < oo. If Y consists of indicator functions of measurable sets then in finding brackets

[j, gi ] to cover F, it is no loss to assume 0 < f < gi < 1 for all i. Next, if C(i) := {x: f (x) > 01, D(i) := {x: gi(x) = 11, and f < 1C < gi then

f

1C(i) <_ IC <_ 1D(i)

_< gi

So, [ f , gi] can be replaced by [1C(i), 1D(i)]. If C is a collection of measurable sets and s > 0, let NJ(s, C, P) := inf{m : for some C1, , Cm and D1, , Dm in A, for all C E C there is an i with Ci C C C Di and

P (Di \ Ci) < s}. Here the I in NJ indicates "inclusion" Then it follows that (7.1.4)

NI (s, C, P) = N[(]') (s, F, P)

where F = (I c : C E C) .

We have the following law of large numbers:

7.1.5 Theorem (Blum-DeHardt) Suppose .F C C1 (A, A, P) and for all s > 0, N, I) (s, .P, P) < oo. Then F is a strong Glivenko-Cantelli class, that is,

limn->o.IIP, -PIIf-=0a.s. Proof Given e > 0, take fl, , fm and gl, , gm in G1, m < oo, to satisfy (7.1.1) for q = 1. Let fm+.i := gj for j = 1, , m. By the ordinary strong law of large numbers there is an N such that

Pr {Supra>N maxj<2m I (Pn - P)(f)I > s} < s.

For each f E F let f < f < g, with P(gi - f) < s. Then if max,<2m I(PP - P)(f)I < s we have

I(Pn-P)(.f)I

<_

I(Pn-P)(.f)I+I(PP-P)(f-f)I

< e+(Pn+P)(gi-f) < 3s+(Pn-P)(gi-f) < 5e.

236

Metric Entropy, with Inclusion and Bracketing

Thus

Pr* {supn>N II PP - PII.F > 5E) < E,

and the conclusion follows from Lemma 3.2.6.

The sufficient condition in Theorem 7.1.5 is not necessary. In fact, the following holds.

7.1.6 Proposition There is a probability space (A, A, P) and a strong Glivenko-Cantelli class .F :_ (I c : C E C} for P, where C C A is such that for alle < 1/2, we have N,,')(E,.T, P) _ +00. Proof Let A = [0, 1] with P = U [O, 11 the uniform (Lebesgue) law. Let C,n := C(m) be independent sets with P (Cm) = 1/m. Then to show that limn+oo SUpm I(Pn - P) (Cm)I = 0 a.s., let 0 < E < 1. Form < 3/s we have by Bernstein's inequality (1.3.2),

(

Pr(I (Pn - P)(Cm)I > s) < 2exp l - nE

2

2

\m < 2 exp (- mne2/4).

+ 2e3

Form > 3/s, inequality (1.3.11) gives

Pr(I(Pn - P)(Cm)I > E) < E(ne, n, l/m) <

(e/(mE))"-i.

We have for r = 1 or 2, (m8JnE

t

<

/ME)r/[1 I\

-

- \ me

(me)r/L1 - 13 IEJ

Thus F-n>2/e Fm>1 Pr (I (Pn - P)(Cm)I > E) < oo, and the Borel-Cantelli lemma gives the result. Now, given functions fi,

, fk and g1,

, gk in G 1, suppose

{Cm}m>1 C U {[f , gi]: P (gi - f) < 1/2). We may assume 0 < f < gi < 1 for all i. For eachi with P(gi- f) < 1/2, we have E j P(Cm) : f < 1C(m) < gi } < +oo since if the series diverges then for a subsequence Cm(r) we have >, P(Cm(r)) = +oo, and for C := Ur Cm(r),

we have P(C) = 1 by the Borel-Cantelli lemma, so f < 1c < gi implies

7.1 Definitions and the Blum-DeHardt law of large numbers

P(g;) = 1 and P(f) > 1 - P(gi - f) > 1 - 1 = f < 1C(m) for only finitely many m, a contradiction. Thus +oo for every s < 1/2, which finishes the proof.

N 2l)

237

> 0, but then (s, {Cm }, P) _

On the other hand, let C be the collection of all finite subsets of [0, 1] with Lebesgue law P. Then I I P, - P l I c = 1 fi 0 although IA = 0 a. s. for all

A E C. This shows that in Theorem 7.1.5, NI < oo cannot be replaced by N (s, F, dp) - 1 for any LP distance dp. A Banach space (S, II 11) has a dual space (S', II II') of continuous linear forms f : S H R with Ilfll' := sup{I f(x) I : x E S, IIxII <- 1) < oo (RAP, Section 6.1). One way to apply Theorem 7.1.5 is via the following:

7.1.7 Proposition Let (S, 11 11) be a separable Banach space and P a law on the Borel sets of S such that f IIxII dP(x) < oo. Let F be the unit ball of the dual space S', F := If E S: 11 ,f 11' <_ 1). Then for every e > 0, NC(1) (s, F, P) < 00.

Proof By Ulam's theorem (RAP, Theorem 7.1.4) take a compact K C S such that fS\K llx II dP(x) < s/4. The elements of .T, restricted to K, form a uniformly bounded, equicontinuous family, hence totally bounded for the sup , f E F, m < norm 11 11K on K by the Arzela-Ascoli theorem. Take fl,

oo, such that for all f E F, Il.f - f II K < s/4 for some j. Let gj := f - s/4 on K, gj (x) := - IIxII, x 0 K; hj := f + s14 on K, hj (x) := IIxII , x K, all for j = 1, , m. Then for any f E F, if 11 .f - f 11 K < 8/4, then gj < f < hj and P (hj - gj) < s, so N,(,') (s, F, P) < m < oo. 7.1.8 Corollary (Mourier's strong law of large numbers) Let (S, II II) be a

separable Banach space, P a law on S such that f IIxII dP(x) < oo, and Xl, X2, i.i.d. P. Let S := Xt + + X,,. Then converges a.s. in (S,

xo E S.

Proof By Theorem 7.1.5 and Proposition 7.1.7, a.s. for II

is a Cauchy sequence II and hence converges a.s. to some random variable Y E S. For each

f E.T', f (Y) = P(f) a.s., that is, Y E f-1({P(f)})a.s.Let{fm}m>t CFbe a countable total set: if fm (x) = 0 for all m, then x = 0. Such f t exist by the Hahn-Banach theorem (RAP, Corollary 6.1.5). Let D := n m fm t ({P(fm)}).

Then Y E D a.s., so D is nonempty. But if y, z E D, then lly - zll = supra I fm (y - z) I = 0, so D = {x0} for some xo.

238

Metric Entropy, with Inclusion and Bracketing

Direct proof Given e > 0, there is a Borel measurable function g from S into a finite subset of itself such that P (Il x - g(x) II) < s. To show this, let {xi }°O 1 be dense in S, with xo := 0. For k = 1, 2, , let gk(x) := xi for the smallest i such that IIx - xi II = minr
IIx - gk(x)II < IIxII for all k and x, and llx - gk(x)II 4.0 as k -' oo for all x, so P (IIx - gk (x) II) -+ 0 by dominated or monotone convergence. The strong law of large numbers holds for the finite-dimensional variables

g (Xj), with some limit x, and

n-III Xj -g(Xj)I)

n-1 E IIX,-g(X,)II

< j=1 j-1 Ell X, - g(X1)II < e as n -+ 00 by the one-dimensional strong law. Thus

limn,oo IIx6 - Sn/nll < s a.s. Letting a J, 0 through some sequence, we get Sn/n converging a.s. to some xo.

7.1.9 Corollary If F' C £1(A, A, P) and {Sx : x E A} is separable for

-,thenllPn-PIIy =IIPP-PII*-Oa.s. Proof This follows from Corollary 7.1.8, since finite linear combinations of Sx, x E A, with rational coefficients, are dense in their completion for II II.F, a Banach space. The proof of Proposition 7.1.7 and Corollary 7.1.8 together from Theorem 7.1.5 is no shorter than the direct proof. On the other hand, if .P = { l [o,t] : 0 <

t < 1} and P is Lebesgue measure on [0, 1] then Theorem 7.1.5 applies but Corollary 7.1.9 does not.

7.2 Central limit theorems with bracketing In this section the bracketing will be in L2. A bracket [f h] will be called a S-bracket if (f (h - f)2dP)1/2 < S. The following main theorem will be proved. Then, Corollary 7.2.13 gives a hypothesis on N[ (II ) for uniformly bounded classes of functions.

7.2 Central limit theorems with bracketing

239

7.2.1 Theorem (M. Ossiander) Let (X, A, P) be a probability space and let F C L2 (X, A, P) be such that I

(x, F, P))1/2dx < ao.

(log fo Then F is a P-Donsker class.

Proof (Arcones and Gine). First, a lemma will help. Recall that an envelope G for a class G of functions is a measurable function such that I g(x) I < G (x) for all g E 9 and all x.

7.2.2 Lemma

Let (X, A, P) be a probability space and 9 a set of real-

valued measurable functions on X, with an envelope G. Let B E A and 8 > 0. Suppose that P(G 1B) < 8/2 and that for each g E Q, A(g) is a measurable set with A(g) C B. Then

Pr* {I)(Pn - P)(g1A(g))IIg > 28) < Pr{I(PP - P)(G1B)I > 8). Proof I P(g1A(g))I < P(G1B) < 8/2 for all g E G, so Pr* {II (Pn - P)(g1A(g)) II g > 28) < Pr* {II Pn (g1A(g)) II g > 38/2)

< Pr{Pn(G1B) > 38/2) < Pr{I(Pn - P)(GIB)I > 8). . Let Now to prove the theorem, let Nk := .T', P), k = 1, 2, yk :_ (log(kNi . . . Nk))112. Then yk is increasing in k. By the integral test,

1(logNk)1/2/2k+1 < 00, so 00

00

k=1

k=1

r

k

1` 2-k yk < 12-k [(logk)1/2 +

(log Nj)1/2J j=1

00

00

(log k)1/2/2k + k=1

00

2-k < oo.

(log Nj)1/2 j=1

k=j

Let 6k := Ej=k yj/2f. Then fik -+ 0 as k -k oo. Let Ski := [fki, hkil, i = 1, , Nk, be a set of 2-k-brackets covering .P. Let Tki := Ski \ Us
set Ak,s, s = (sl, . , sk), choose fk.s E Ak,s. For each f E F let Ak(f) :_ Ak,s and 7rkf := fk,s for the unique s := s(f) := s(f, k) such that f E Ak,s.

Metric Entropy, with Inclusion and Bracketing

240

Let Akf :=ok(f):=sup( Ig-hI: g, h EAk(f)}. Then for any f E. F, Lt k (f) is nonincreasing in k and E A2 (f) < 2-2k.

(7.2.3)

Note that Ak f depends on f only through s(f ). Fork > 1, n > 1, and f E .F let (7.2.4)

Bk := B(k) B(k, f, n) :_ {x E X: Ak f > n1/2/(2k+lyk+1)}.

For any fixed j and n and x E X let

rf := rj,, (f, x) := min{k> j: x E B(k, f,n)}, where min0 := +oo. Then {rf = j} = B(j, f, n), {rf > j} = {Oj f < n1/2/(2j+lyj+1)}, and fork > j, (7.2.5)

{rf > k) C X \ Bk_1 C {okf < Ak-1 f < n1/2/(2kyk) I

by (7.2.3), and {rf = k} C Bk \ Bk_i. For any f, g E F, let pl (f, g) := 112K for the largest K such that for some s, f and g are both in AK,S, or p[ (f, g) = 0 if this holds for arbitrarily large K. Then F is totally bounded for pl1. So by Theorem 3.7.2 it will be enough to prove the asymptotic equicontinuity condition for pl 1, in other words that for

every a > 0, (7.2.6)

limj.,,,

Pr* 1n112 II (Pn

- P)(f -

a} = 0.

We can assume 0 < a < 1. Then for any positive integers j < r, f -7rj f will be decomposed as follows: (7.2.7)

f -7rjf = (f -7rjf)lrf=j + (f -7rrf)lrf>r r-1

+ Y. (f -7rkf)lrf=k k=j+1 r

+ E (7rkf -7rk-1f)lrf>k; k=j+1

this is easily seen for r = j + 1 and then by induction on r. The decomposition (7.2.8) will give a bound for the outer probability in (7.2.6) by a sum of four terms to be labeled (I), (II), (III), and (IV) below.

Let s := a/8. Fix j = j (E) large enough so that (7.2.8)

,Bj < s/24 and E k-12 < 2s. k>j

7.2 Central limit theorems with bracketing

241

Then choose r > j large enough so that, since yr increases with r,

n1122-r < e/4 and 2 exp ( -

(7.2.9)

yr2r-IE)

< e.

Lemma 7.2.2 will be applied to classes of functions

9 :_ 9(k, s) = ck,s := {f - nkf: f E F, nk.f = fk,s} with envelope < G := Gk,, := Akfk,s About (I): for any function i/j > 0 and t > 0, l,,>t < .r/t. So by (7.2.4),

n1/2E(1BjAj f) < 2j+lyj+iE((Ajf)2)

< 2t-jyj+1 < 4,Oj < e/4 for all f E F. Then since {t f = j} = Bj,

n1/21P((f-njf)l,f=j)I < n1/2P(IBjLjf) < e/4. Apply Lemma 7.2.2 to Cjjs for each s with B := Bj := B(j, f,,, n), S e/n 1/2, and A(f - nj f) = Bj = {tr f = j} in this case. Then Pr* { Il n1/2(Pn - P)(g)1Bi 11O(j,s) > 2e}

< Pr {n1/2I (Pn - P)((Aj fj,s)1Bj)I > e}. Then summing over s, (I)

:= Pr* { Il n1/2(Pn - P)((f - njf)1rf=j)II.F > 2E1

< exp(y?)max,Pr {n1/2I(Pn - P)(Oj f,slBj)I > s}

< exp(y?)e-2maxsVar(Ajf,s1Bj) As n -+ oo, for fixed j and s, by (7.2.3) and (7.2.4), since P(Bj) _ P(B(j, fj,s, n)) - 0 for each j and s, we have (I) -+ 0, so (I) < e for n large enough. About (II): we have by (7.2.3) and (7.2.9), (7.2.10)

n1/2E(Ar f1(orf
<

n1/22-r

< e/4

and

(7.2.11)

E((Arf)21(Arf
Metric Entropy, with Inclusion and Bracketing

242

Now by (7.2.5) for k = r, and since If - 7rr f I < Ar f, we have by (7.2.10), E((f - 7rr f) 1r f>r) I < 8/4 for all f E F. Apply Lemma 7.2.2 for each n s with A(f - 7rr f) :_ {t f > r} and B :_ {Ar fr,s < nl/2/(2ryr)}, noting that Ar f < Ar_1 f for all f. Thus by (7.2.5) again, 1/21

(II)

:= Pr*

111n112(Pn

- P)((f

-)Trf)lrf>r)II.F > 281

Pr {n1/2II (F'n - P)(Arf 1{Arf E}.

<

Then by Bernstein's inequality (Proposition 1.3.2) and (7.2.5), (II)

< 2 exp Yr -

E2

2E/(2r+2yr) + 2e/(3.2ryr)

2 . exp (Yr - 2ryrE/(7/6))

2 exp ( -

yr2r-1E)

since 2-ryr < jr B< P j < e/24 by definition of or and (7.2.8). Then (II) < 8 by (7.2.9).

About (III): fork= j+1, (7.2.12)

r - 1, by (7.2.4) and (7.2.3),

n1/2E((Akf)1tf=k) < 2k+1yk+1E((Akf)2) < 21-kyk+1

Then

(III)

Pr* n1/2 1(Pn - P) r-1 Y Pr*{n1/211(Pn

\k _J+ l(f

- 7rkf)1Lf-k/

- P) ((f

> 2E } JJJ

2-kEyk+1/pjl,

k=j+1

2 "Eyk+1/,8j < 28. To apply Lemma 7.2.2 for each k + , r -1 and for each s, let 8 := 2-k-1 Eyk+l 1#j, and A (f) := B := { r f = 1, k} for f - Irk f E Cjk,s. The hypothesis of the lemma holds since j < 8/24 (7.2.8) implies 21-kyk+1 < 2-k-2Yk+1E/fij, and by (7.2.5). So since

r-1

(III) < k=j+1

Pr {n11211(Pn - P)((Akf)lrf=k)1I F >

2-k-18Yk+1/pj}.

7.2 Central limit theorems with bracketing

243

Then by Bernstein's inequality again, (7.2.3), and (7.2.5),

r-1 exp (Yk2

(III) <

21-2k + 2-kEyk+l/(3 2kykfj) J1

k=j+1 r-1

exp k=j+1

(\yk2 - 402(2 + E 2yk+1 Yk+l/(3Yk0j)) / j

Now since yk is increasing with k,

-E2yk+1/(4,o2 2 +

.

l

_ -E2Yk+1

/

< -E2yk

3

(40j2

yk+1

-

Ykfj J/

2

Yk+1

2+

(416

+

E

3Ykpj

)

E

3_l

)

The latter expression, since IBj < e/24 by (7.2.8) and so 2 < 81(12#j), is bounded above by

l E

-E2yk /416j

.

5

12

1

=

-3Eyk /(516j).

It follows that

r-1

r-1

< E exp(-Eyk/(216j)) < E exp(- 12yk)

(III)

k=j+1

k=j+1 r-1-1

1/(kNi ... Nk) 12 <

Tk-12 <2s k>j

k=j+1 by (7.2.8).

, r and g E Ak(f) then 7rk(g) = 7rk(f ), About (IV): if k = j + 1, Irk-l(g) = Irk-l (f ), and A j = L g for all s = 1, . . , k, so {r f > k} = {rg > k). Thus the number of distinct functions (7rkf - rk-1 f)1Tf>k is at most exp(yk ). Also, Irk f E Ak_ 1(f) and so by (7.2.3), E((7rk f -7rk-l f)2) < 22-2k. Now (IV)

:= Pr*In1/2 ll

k(Pn - P)(7rkf -7rk-1.f)ltf>k)I) 1

j

>E

r

PrIlnl/2II(Pn - P)((7rkf -nk-1f) ITf>k) IlF > 2-kEYk/16jj k=j+1

Metric Entropy, with Inclusion and Bracketing

244

from the definition of fj. Bernstein's inequality and (7.2.5) give r

(IV) k

E

-J+1

2 E22-2ky2 2 PT exp (yk - 23-2k + 3 E2

-k-12-k

\

l/

.

OT

Now since fj < e/24, 8 + 3ePJ 1 < E/pj and r

r

exp (yk (1 - e/fi1)) < E exp (- 23yk) < 2e

(IV) < k=j+1

k=j+1

as in (III). Thus the expression in (7.2.6) is less than a. Letting a 4. 0, j -+ oo, and n -+ oo, the proof of Theorem 7.2.1 is complete. Theorem 7.2.1 implies the following for L 1 entropy with bracketing:

7.2.13 Corollary Let (X, A, P) be a probability space and F a uniformly bounded set of measurable functions on X. Suppose that r1

(logN1( 1)(x2,F, P))112dx < oo. 1

Then F is a Donsker class for P.

Proof Suppose I f (x) I < M < oo for all f E .E and X E X. Since multiplication by a constant preserves the Donsker property (by Theorem 3.7.2), we

can assume M = 1/2. Then for any f, g E F and e > 0, If - gI < 1 everywhere. So if f If - gIdP < E2 then (f If - gI2dP)1"2 < E. So N1( 2) (E, 1

F, P) <

N111) (E2,

F, P) and the result follows from Theorem 7.2.1.

It will be seen in the next section that Corollary 7.2.13, and thus Theorem 7.2.1, are best possible (provide characterizations of the Donsker property) in some cases.

7.3 The power set of a countable set: the Borisov-Durst theorem Let P be a law on the set N of nonnegative integers. The next theorem gives a criterion for the Donsker property of the collection 2N of all subsets of N, for P, in terms of the numbers P({m}) for m > 0. We also find that the sufficient condition given in Corollary 7.2.13 is necessary for 2N. Recall NJ as defined above Theorem 7.1.5.

7.3.1 Theorem The following are equivalent:

7.3 The power set of a countable set: the Borisov-Durst theorem

245

(a) 211 is a Donsker class for P; (b) >2m Pm/2 < 00;

(c) fo (log NI(x2, 21, p))1/2 dx < 00.

Proof We have (c) = (a) by Corollary 7.2.13. Next, to prove (a) ==> (b), suppose E = oo. The random variables W (m) := Wp (1 {m }) (for the p112

isonormal Wp on L2(P) as defined in Section 2.5) are independent and Gaus-

sian with mean 0 and variances p,,,. We can write Gp(f) = Wp(f) P(f)Wp(1) since the right side is Gaussian and has mean 0 and the covariances of Gp. Then F_m EIWp({m})I _ (2/n)1/2p1l?i/2 diverges, while Em Var(IWp({m})1) M) + 1 as m -* oo. Thus >j IWp({ j})I = +oo almost surely. Now >m pmIWp(1N)I < oo a.s., so E IGp({m})l = +oo a.s. Hence SUPACN Gp(1A) = +oo a.s. and 2N is not a pregaussian class, so a fortiori not a Donsker class. Thus (a) implies (b).

Next, to prove (b) = (c): equivalently, let us prove 00

E 2-k (log Nl

(4-k 2N, P))112 < 00.

k=1

We can assume pm > pr > 0 for m < r. Let rj be the number of values of m such that 4-j-1 < p1i/2 < 4-j, j = 0, 1, 2, , and Cj := rj/41. Then j Cj < oo. Fork > ko large enough there is a unique j (k) such that

Cj/41 < 4-k <

(7.3.2)

j>j(k) Let m(k) := Mk :_ j_oj

(k)

Cj/41.

j>j(k)

rj. Then

m>m(k)

pm < L: rj142 < 4-k j>j(k)

Let Ai run overall subsets of fl,

, m (k)} where i = 1,

,

2m(k). Let Bt :=

Ai U {m E N: m > m(k)}. Then for any C C N, c fl 11, , m(k)} = Ai for some i. Then Ai C C C Bi and P(Bi \ Ai) < 4-k. So NI(4-k, 2N, P) < 2m(k)+1 Thus it will be enough to prove

mk/2/2k < 00,

(7.3.3) k

Metric Entropy, with Inclusion and Bracketing

246

with >k restricted to k > k0. We have j(k)

mk/2/2k

> C j=0

k

j(k)

<

1/2

4lCJ)

00

2j-kC)l2 =

j=0

k

/2k

/I

C)12

j=0

1: 2j-k k j<<jj(k)

To prove this converges, since > j C j < oo, it is enough by Cauchy's inequality to prove that > j (>k. j< j(k) 2i k)2 < oo. Let k(j) be the smallest k such that

j(k) > j. Then

E 2j-k < 2j+1-k(J) k: j:5 jj (k)

To prove that Y j 4j-k(j) < oo, setting j (k0 - 1) := 0, we have

4J-k(J) <

T, k

j>11

41+J(k)-k

Y 4j-k < j: j(k-1)<j<j(k)

k

For each k, let K (k) be the smallest K such that j (K) = j (k). Then from (7.3.2) for K, letting JC denote the range of 4-K(k) < )i>-j(k) CJ/4J, so

r 4j (k) -k

4j(k)-k+K(k)

k

E 4j(K)+K

Cl.4- j

l < Since j

CJ/4J j>j(k)

k

KEK,j(K)<j

j

k: K(k)=K

4i

2ECj4-J

4-k

.

KEIC,j(K)<j

is one-to-one on 1C, the sum is at most 4 E j Cj < 00.

Recall Remark 6.4.3, which implies that if C = 2N then we have

SUPA EC I (Pn - P)(A)I - 0

a.s., n -+ oc, for any law P on C.

**7.4 Bracketing and majorizing measures Andersen, Gine, Ossiander and Zinn (1988) combine the ideas of majorizing measures (Section 2.7) and bracketing to get extended central limit theorems (Section 4 of their paper) and laws of the iterated logarithm (Section 5). They

Problems

247

consider triangular arrays, so that variables are independent but not i.i.d.; they allow envelope functions not necessarily in L2; and they use truncation, so that bracketing need not hold everywhere but only on sets of high enough probability.

Problems 1. Let (K, d) be a compact metric space. For 0 < a < 1 let La(K, d) be the set of all real-valued functions f on K such that I f (x) - f (y) I < d (x, y)a for all x, y (= K. Show that for any law P on K, La (K, d) is a strong Glivenko-Cantelli class for P.

2. Show that If E C[0, fl: 11 f Ilsup < 1) is not a weak Glivenko-Cantelli class for P = U[0, 1]. 3. Let (X, A, P) be a probability space, such that {x } E A for all x E X and P has no soft atoms, as defined in problem 9 of Chapter 5. Let C C A be such that C is linearly ordered by inclusion and generates A. Give an upper bound on Ni (e, C, P) for 0 < s < 1.

4. If X = N, A = 214, and p := P({n}) := 1/2"+1 for n = 0, 1, 2,

,

evaluate NJ (s, A, P) for 0 < e < 1.

5. If X = N, A = 2, and for some a E IR, p, := P({n)) := ca/[(n + 1)2(log(n +2))'] for all n E N, where ca is such that En p = 1, for what values of a is A a Donsker class for P? 6. Let (X, A, P) be a probability space. For i = 1, . , k let )l C 'CO (X, A) each satisfy the hypothesis of Theorem 7.2.1. Then show that the hypothesis also holds for:

(a) j, (b)

Uk :_ {F_j'=1 f : f E J:j for each j}.

Let C; be a collection of sets, F; := { 1 A : A E C; }. Show that the hypothesis of Theorem 7.2.1 also holds for J := { IA: A E C) where

(c) C :=

Aj: Aj E Cj for each j}.

7. Let C := {[0, a] x [0, b] : 0 < a < 1, 0 < b < 1}. Let P be the uniform distribution on the unit square [0, 1] x [0, 1]. Give an upper bound for

NI(s,C,P)for0<s<1. 8. In the unit cube X := [0, 1]d in Rd, with uniform (Lebesgue) law P, let C be a collection of subsets A of X such that for some K < oo and c > 0, whenever X is decomposed as a union of nd cubes of side 1/n, each with vertices having coordinates i/n, i = 0, 1, . , n, the boundary of A

intersects at most Knd-` of these cubes. Show that NI(s, C, P) < oo for all s > 0.

248

Metric Entropy, with Inclusion and Bracketing

9. Show that the hypothesis of the previous problem holds if C is the collection of convex subsets of the unit square in R2. 10. Prove the same for the unit cube in ][8d for any finite dimension d.

11. If F is a class of indicators of measurable sets and s > 0, show that

N1 t (s, F, P) = N[ (e2, F, P) (compare Corollary 7.2.13 and its proof). 12. Prove Corollary 7.2.13 directly by Bernstein's inequality (1.3.2).

Notes

Notes to Section 7.1.

Theorem 7.1.5 is due to Blum (1955, Lemma 1) for families of (indicators of) sets and to DeHardt (1971, Lemma 1) for uniformly bounded families of functions. Mourier (1951; 1953, pp. 195-196) proved the law of large numbers in general separable Banach spaces, Corollary 7.1.8. I do not know a reference for Propositions 7.1.6 or 7.1.7. In the proof of Corollary 7.1.8, (Bochner or Pettis) integrals of Banach-valued functions (defined in Appendix E) were not assumed, so they had to be, in part, reconstructed. For a survey of laws of large numbers for empirical measures up to 1978, see Gaenssler and Stute (1979). Notes to Section 7.2. Theorem 7.2.1 is due to Ossiander (1987). The shorter proof presented here is an expanded version of that of Arcones and Gind (1993, Theorem 4.10) and applies a technique from Andersen, Gine, Ossiander and Zinn (1988), who proved an extended bracketing central limit theorem. Corollary 7.2.13 was proved earlier, first for classes of sets in Dudley (1978), then for classes of functions in Dudley (1984).

Notes to Section 7.3. In Theorem 7.3.1, it is not hard to prove that (b) (a). Durst and Dudley (1981) proved the equivalence of (a) and (b). Borisov (1981) discovered and announced the more difficult implication (b) = (c). I have not seen his proof.

References Andersen, Niels Trolle, Gine, E., Ossiander, M., and Zinn, J. (1988). The central limit theorem and the law of the iterated logarithm for empirical processes

under local conditions. Probab. Theory Related Fields 77, 271-305. Arcones, Miguel A., and Gine, E. (1993). Limit theorems for U-processes. Ann. Probab. 21, 1494-1542.

References

249

Blum, J. R. (1955). On the convergence of empiric distribution functions. Ann. Math. Statist. 26, 527-529. Borisov, I. S. (1981). Some limit theorems for empirical distributions (in Russian). Abstracts of Reports, Third Vilnius Conf. Probability Th. Math.

Statist. 1, 71-72. DeHardt, J. (1971). Generalizations of the Glivenko-Cantelli theorem. Ann. Math. Statist. 42, 2050-2055. Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6, 899-929; Correction, ibid. 7 (1979), 909-911. Dudley, R. M. (1984). A course on empirical processes. Ecole d'ete de probabilites de St.-Flour, 1982. Lecture Notes in Math. (Springer) 1097, 1-142. Durst, Mark, and Dudley, R. M. (1981). Empirical processes, Vapnik-Chervonenkis classes and Poisson processes. Probab. Math. Statist. (Wroclaw) 1, no. 2, 109-115. Gaenssler, Peter, and Stute, Winfried (1979). Empirical processes: a survey of results for independent and identically distributed random variables. Ann. Probab. 7, 193-243. Mourier, Edith (1951). Lois de grands nombres et theorie ergodique. C. R. Acad. Sci. Paris 232, 923-925. Mourier, E. (1953). Elements aleatoires dans un espace de Banach. Ann. Inst. H. Poincare 13, 161-244. Ossiander, Mina (1987). A central limit theorem under metric entropy with L2 bracketing. Ann. Probab. 15, 897-919.

8 Approximation of Functions and Sets

8.1 Introduction: the Hausdorff metric In this chapter upper and lower bounds will be shown for the metric entropies of various concrete classes of functions on Euclidean spaces and sets in such spaces. Some metric entropies with and without bracketing are treated. Metrics for functions are in £P, 1 < p < oo. For sets we use dp metrics dp(B, C) P(BOC) or the Hausdorff metric, defined as follows. For any metric space (S, d), x E S, and a nonempty B C S, let

d(x, B) := inf {d(x, y) : y E B). For nonempty B, C C S, the Hausdorff pseudometric is defined by

h (B, C) := max

d (x, C), supxE1 d (x, B)).

Then h is a metric on the collection of bounded, closed, nonempty sets. Let h (O, C) := h (C, 0) := +oo for C 0, and h (O, 0) := 0. On Rd we have the usual Euclidean metric d(x, y) := Ix - yI where Jul

+ ud)"2, u E Rd. For any set H C Rd-' and function f from H (u1 + into [0, oo] let

if = J(f) := {x

E Rd : 0 < xd < f (x(d)), x(d) E H}

(xt, , xd_t). Then J(f) is the subgraph of f. For any other function g > 0 on H, clearly h(Jf, Jg) < g). Thus for any collection F of real functions > 0 on H, and any s > 0, where x(d)

(8.1.1)

D(e, {Jf: f E F}, h) < D(s,.F,

If dsup(f, g) < s and j : max(f - s, 0), where g > 0, then 0 < j < g < f + s, so Jj C Jg C Jf+E. If P is a law on H x [0, oo) having a density p 250

8.1 Introduction: the Hausdorff metric

251

with respect to Lebesgue measure on Rd with p(x) < M < oo for all x, then (8.1.2)

NI(2Me, {Jf: f E .P}, P) < D(s, .F, dSnp).

In the converse direction there are corresponding estimates for Lipschitz functions. Recall that II f II L := sup,, #y I f(x) - f(y)I/Ix - y1. Then we have:

K on H C g > 0, then h(Jf, Jg) > min(1, 1/K)dd p(f, g)/2. 8.1.3 Lemma If II f I I L

K and I I g I I L

Rd-1, with f> 0 and

Proof Let t :=

g). Let 0 < s < t. By symmetry, assume that for some x E H, f (x) > g(x) + s. Then for any y E H, either Ix - yI > s/(2K) or g(y) < g(x) + s/2 < f(x) - s/2. In either case, if z < g(y) then I (x, f(x)) - (y, z)I > min(1, 1/K)s/2. Letting s f t, the result follows. Recall that a bounded number of Boolean operations preserve the VapnikCervonenkis property (Theorem 4.2.4). The same holds for classes of sets satisfying bounds on metric entropy (with inclusion). For any families Cj of subsets of a set X, extending the notation in Section 4.5, let k {

n Aj: Aj ECj for all j JJJ

k

{ UAj: Aj ECj for all

,1

jj.

8.1.4 Theorem Let (X, A, P) be a probability space and Cj C A for j = 1, , k. Let f1 j(e) := log NI(e, Cj, P) and f2 j (e) := log D(e, Cj, dp) for j = 1, k. Let CO := nkj=1Cj. If for i = 1 or 2, there area y > 0 , Mk such that J, (e) < Mj e _Y for 0 < e < 1 and and constants M1, j = 1, , k, then the same holds for j = 0. The statements also hold for i = 1 or 2 for CO :=

Proof For NJ, given 0 < e < 1, for each j = 1, , k, take m j < Mj (k/e)Y brackets [Ajr, Bjr] covering Cj with P(Bjr \ Ajr) < Elk for r = 1, , m j. If Ajr( j) C Cjr( j) C Bjr( j) for j = 1, , k and some r(j), then k

k

k

A(r) := n Afr(j) C C(r) :=n Cjr(j) C B(r) := n Bjr(i) j=1

j=1

j=1

252

Approximation of Functions and Sets

and P(B(r) \ A(r)) < k(elk) = E. The result for NJ then follows with MO ky (MI + M2 + + Mk). Without inclusions, and/or for u instead of n, the proof is similar.

8.2 Spaces of differentiable functions and sets with differentiable boundaries For any a > 0, spaces of functions will be defined having "bounded derivatives

through order a." If fi is the largest integer < a, the functions will have partial derivatives through order ,8 bounded, and the derivatives of order 0 will satisfy a uniform Holder condition of order a - 8. Still more specifically: for x := (x1, , xd) E Rd and p = (pi, , pd) E Nd (where N is the set of

nonnegative integers) let [p] := p1 + xp

x1

142 ... xad

+ pd, DP := a[p]/axp' ... axad

For a function f on an open set U C Rd having all partial derivatives DP f of orders [p] < P defined everywhere on U, let Ilflla := Ilflla,U := max[p]
yIa-1}

.

Here if 0 < a < 1, so ,B = 0, Do f := f := f. Let Id denote theunitcube{x E Rd: 0 <xj < 1, j = 1, ,d},andx(d) := (xl, ,xd_1), x c Rd. Let F C Rd be a closed set which is the closure of its interior U. Let .Fa, K (F) denote the set of all continuous f : F H R with 11f II a, U < K.

For a = 1, .FI,K(F) is the set of bounded Lipschitz functions f on F with max(II f II sup, II f II L) < K. Then, recalling the bounded Lipschitz norm

IIfIIBL := IIfIIL + IlfII

, we have

{f: IIfIIBL
.If = J(f) = {x E Id : 0 < xd < f (x(d))},

f E QJa,K,d-1,

f >-0.

If g and h and two functions defined for (small enough) y > 0, then g (as y 0) means that

h

0 < liminfyyo(g/h)(y) < limsupylo(g/h)(y) < +oo. Clearly, if f E ca,K,d and [p] < a then DP f E ca-[pl,K,d Let Bd :_ {x E Rd : Ix I < 1) and Bd := {x E Rd : lx I < 1) be the open and closed unit balls, respectively, in Rd. Here are some bounds on metric entropies, which

8.2 Spaces of differentiable functions and sets

253

by Ossiander's theorem (7.2.1) will give that if a > d/2, Ca,K,d is a Donsker class for any law P, and C(a, K, d + 1) is for laws with bounded densities.

8.2.1 Theorem For 0 < K < oo, 0 < a < oo, and d > 1, as s 10, log D(e, QJa,K,d, dsup) x E-d/a

For some T := T (a, K, d), any law P on Id, 1
log D(s,

C(a, K, d + 1), h) < TE-d/a

If Q is a law on Id+1 having a density with respect to Lebesgue measure bounded by M, then for some Ml = Ml (M, d, K, a),

log NI(e, C(a, K, d + 1), Q) <

Me-dla

0 < E < 1.

The same statements all hold for Bd in place of Id and so Ofgga,K,d, with possibly larger constants T and Mi.

K (Bd) in place

Notes. For a > 1, in the statement about h, the order E-dl, is precise, see Corollary 8.2.13. The statement about N[ ] for r = 1 implies that cca,K,d is a Glivenko-Cantelli class for any a > 0, K < oo, and d, by the Blum-DeHardt theorem (7.1.5).

Proof First, the proof will be done for the cube Id. The inequality on N follows from the first by way of (7.1.2) and (7.1.3). The inequalities for classes

of sets (in h and for NI) follow directly from that for functions in dsup by (8.1.1) and (8.1.2). So only the bounds for dsup need to be proved. For each f E cca,K,d, x E Id, and x + h E Id write the Taylor series with remainder 0

(8.2.2)

f(x + h) - E Qk(x, h) = R(x, h) k=0

where for each x, Qk(x, - ) is a homogeneous polynomial of degree k in h and by the mean value theorem, (8.2.3)

I R(x, h)I < C1hIa

for some constant C = C(d, K, a). Then C > 1 will be taken large enough so

254

Approximation of Functions and Sets

that whenever [p] < 3 we also have R-[P]

(8.2.4)

IRp(x, h)I

:=

DPf(x + h) -

Qk,p(x, h) k=0

<

CIhIa-[P]

where for each k, p, and x, Qk, p(x, ) is also a homogeneous polynomial of degree k.

To prove one half of the first conclusion in Theorem 8.2.1, it needs to be shown that (8.2.5)

lim supE,O [log D (e, ca,K,d, dup)] Ed/a < oo.

Given 0 < E < 1, let A :_ (e/(4C))1/a < 1. Letx(1), , x(,) be a A/2-net in Id, that is, sup{infj<s Ix - x(j) I : x E Id} < A/2. Here we can takes < M2E-dja for some constant M2 = M2(d, C); specifically, one can choose the x(j) as centers of cubes of a decomposition of Id into cubes of side I/ m where

m is the least integer > d1/2/A. For each multi-index p with [p] = k < p let p! := fl =1 pj!. Then for f E ca,K,d let Q(P)(x, h) := (DPf) (x)hP/p!. Thus in (8.2.2), (8.2.6)

Q(P) (x, h).

Qk(x, h) _ [pl =k

Let sk := E/(2Aked), k = 0, 1,

, and

A1,p = Ai,p(f) := [DPf (x(1)) Irk] ,

i = 1, ... , s,

[p] = k <-

where [x] denotes the largest integer < x for -oo < x < oo. Given some A := {A1,p: i < s, [p] < P} let

ga,K,d(A) := (f E Ga,K,d: Ai,p(f) = A1,p for all i < s, [p] 8.2.7 Lemma If f, g E cca,K,d(A) for some A then sup 11(f - g) (x) I

: x E id) < E.

Proof Let F := f - g. Whenever [p] < P and i < s, we will have (D'F)(x(t))I < E[P]. Also, by (8.2.4),

IDPF(x + h) - DPF(x)I <

2CIhIa-#

if [p] =

For each y E Id take an x(;) with I y - x(1) I < A /2. Then from (8.2.2), (8.2.3),

8.2 Spaces of differentiable functions and sets

255

and (8.2.6) with h := y - x(;),

2CIhla+sk T, IhPllp! k=0

[p]=k

0

2C0a + L EkEk I k=0

1/p! [p]=k

l

ed maxk<,8Ektk + E/2 < E, proving the lemma.

Now continuing the proof of Theorem 8.2.1, it follows that D(E, cca,K,d, dsup) < Na,K,d where Na,K,d is the number of distinct nonempty sets Qa,K,d(A). Let the x(j)

be ordered so that for 1 < j < s, IX(i) - X(j) I < 0 for some i < j. Such an ordering clearly exists for d = 1. Then by induction on d, we enumerate subcubes beginning and ending with sub-cubes at vertices of Id, where we first enumerate cubes on one face Id-1, then on the adjoining level, and so on. Now suppose f E Qa,K,d(A) (so that this set is nonempty) and suppose

given the values Ai,r for i < j for some j < s. Choose i < j such that Ix(j) - x(j) I < A. Take the Taylor expansion as in (8.2.2) with x = x(Z), h = x(j) - x(1). By (8.2.4) and (8.2.6),

VIA

DPf (x(i)) - 1: 1: DP+qf (x(z))hq/q!

<-

Cihla-[P]

k=0 [q]=k

Now I DP+q f (X(,)) - Ai,p+gSk+[P] I < E[p+q] for [q] = k = 0, 1, ... Let

D(j, f, p)

[DPJ(x(J))

-

,

- [p].

P-1p] 11

Ai,p+ghglq[J/E[pl.

--k+[p]

k=0

[q]=k

Then by the latter two inequalities, P-[p] I D(.f, j, p) I

!S

\CAa-[P] +

` L

1/q[)/E[Pl

E[P]+k1k

k=O

[q]=k

2Ced / 4C + ed < 3ed / 2. So for f E Qa,K,d and given the A;,,. (f) for i < j there are at most 3ed + 2 < 4ed possible values of Aj, p for a given p. The number of different p E Nd with

Approximation of Functions and Sets

256

[p] < P is bounded above by (P + 1)d. Thus the number of possible sets of values {A j, p}l p) 2 is at most exp((d + log 4)(0 + 1)d). The number of possible values of the vectors {A1, pI[p)<# is at most

(2(2K +

1)ed/e)(,,+1)d

Thus by Lemma 8.2.7, d

Na,K,d < (2ed(2K + 1)(4ed)5/e)('s+l)

< exp ((p + 1)d {(d + log4)M2s-dla + log (ed (4K + 2)/e)})

< exp (Je-dl.) for some J = J(d, a, K) not depending on e, proving (8.2.5). In the other direction we need to prove (8.2.8)

liminfEyo log D (e, cc-a,K,d,

edja > 0.

Let f be a Cc function on Rd, 0 outside Id and positive on its interior, such as

f(x) :=

d d

g(xj), where

j=1 (8.2.9)

g(t) :=

r exp(-1/t)exp(-1/(1 -t)), jt

0

0
, decompose Id into and sub-cubes Ami of side 1/m, i = For m = 1 , 2, . , md. Let x(;) be the vertex of Ami closest to the origin, in other words 1, the vertex of Am, at which all coordinates are smallest. Given a > 0, set f (x) := m -a f (m (x - x(,))). Note that x H m (x - x(;)) is a 1-1 affine map of Rd onto itself, taking the interior of Ami onto that of Id. Thus f (x) > 0 if and only if x is in the interior of Amt . Lets := sup,, f (x) (= e-4d for our f ). For any S C 11,

,

md} let fs := Fi,s f . Then Il fslla

11 fI1a =: B while

forS

T, sup', I (fs-fT)(x)I = m-as. Thus foranyem := Km-as/(3B) we 2md have D(--m, GJa,K,d, ds p) > > exp(Cemdla) for some C = C(K, d, a, s) not depending on m. Since em+1/em -+ 1 as m -k oo this is enough to prove (8.2.8) and so to finish the proof of Theorem 8.2.1 for Id. Now it will be shown how to adapt the proof to the ball Bd in place of Id. In Sd-l :_ {x E Rd : IxI = 1}, for 0 < e < 1, take D(e, Sd-1, e) points at distances > e apart where e is the Euclidean metric on Rd. The balls of radius

8.2 Spaces of differentiable functions and sets

257

e/2 with centers at these points are disjoint. Thus by volumes, then by the mean value theorem,

D(e

Sd-1 ,

(Ed e)(2) < (1 + <

Eld

(

2I - \1

de(1+-

d

)

£ld 2I

< de(2)d

1

and so D(e, Sd-1, e) < 2d(3/e)d-1. Take a maximal set Sin Sd-1 of points at distances > e/2 apart, with ISI < 2d(6/e)d-1. Let W consist of 0 and all points tx f o r x E S and t = je/2, j = 1 , 2, . , [2/e]. Then W is e-dense in Bd and IWI

< 1 +2d(2/e)(3/e'1 <

2d(318)d.

Then, starting at 0 and moving outward along each segment tx, x E S, 0 < t < 1, through points in W, one can do the same proof as for (8.2.5) in Id, except for larger constants J, MI. For a lower bound of the form (8.2.8), note that Rd includes a cube of side d-112 centered at 0. This finishes the proof of Theorem 8.2.1. Next, some lower bounds for metric entropies in the L 1 norm will be given.

For a collection F C C1 (A, A, P), we have the G1 distance dl, p(f, g)

P(If - gD 8.2.10 Theorem Let P be a law on Id having a density with respect to Lebesgue measure bounded below by y > 0. Then for some C = C(y, a, K, d) > 0, and 1 < r < oo, N ji(e, «,x,d, P) ? N("(E, c«,K,d, P) D(e, c«,K,d, di, p) > exp(Ce-dI") fore small enough, and if d > 2, for small enough e > 0, and M := C(y, a, K, d - 1),

NI(e, C(a, K, d), P) > D(e, C(a, K, d), dp) > exp (Me Proof The following combinatorial fact will be used:

8.2.11 Lemma Let B be a set with n elements, n = 0, 1, .. Then there exist subsets Ei C B, i = 1, , k, where k > e"/6, such that for i 0 j, the symmetric difference Ei DEj has at least n/5 elements.

Proof For any set E C B, the number of sets F C B such that card (E A F) :5 n/5 is 2"B(n/5, n, 1/2), where binomial probabilities B(k, n, p) are as defined before the Chernoff-Okamoto inequalities (1.3.8). If S" is the sum of n independent Rademacher variables Xi taking values ±1 with probability 1/2 each,

258

Approximation of Functions and Sets

then by one of Hoeffding's inequalities (1.3.5), defining "success" as Xi = -1,

B(n/5, n, 1/2) = Pr (S > 3n/5) < exp(-9n/50) <

e-n16

This implies the lemma.

Now to prove Theorem 8.2.10, the first inequality follows from (7.1.2) and the second is also straightforward. For the lower bound on D, let us use again the construction in the proof of (8.2.8). Let.X denote Lebesgue measure on Id

and S := y f f dA for the f in (8.2.9). Then for each i, f f dP > Sm-a-d Applying Lemma 8.2.11 and obtaining sets S with card(S) > and/5 gives f fs dP > Sm -1/6, and f Ifs - fT I dP = f fsoTdP. So D (Sm-a/6, ca,K,d, dip) > exp (md/6). Thus if 0 < e < 3/6, since [x] > x/2 for x > 1,

D (E, Qa,K,d, dt,P) > exp ([(S/(6e))1/a]d/6l

> exp

(2-d

(S/ (6E)

)d/a /6)

proving the statement about cca,K,d

IfC=J(f)andD=J(g)then dp(C, D) := P(CAD) > yl(CAD) = Y f If - gldA, so for E > 0,

NI(E, C(a, K, d), P) > D(e, C(a, K, d), dp) > D (EIY, 9a,K,d-1, di,x), which finishes the proof of Theorem 8.2.1. To get lower bounds for the Hausdorff metric, the following will help:

8.2.12 Lemma If a > 1 and f, g E9Ja,K,d, then

h (Jf, Jg) >

g)/(2 max(1, Kd)).

Proof Note that for a > 1, any g E 9a,K,d is Lipschitz in each coordinate by the mean value theorem with Ig(x) - g(y)I < KIx - yl if x/ = yj for all but one value of j. In the cube, one can go from a general x to a general y by changing one coordinate at a time, so g is Lipschitz with Il91lL < Kd and Lemma 8.1.3 applies.

8.2 Spaces of differentiable functions and sets

8.2.13 Corollary I f a > 1 and d = 1 , 2,

259

, then as E , 0,

log D(s, C(a, K, d + 1), h) : E-d/« Proof This follows from Lemma 8.2.12 and the first and third statements in Theorem 8.2.1.

8.2.14 Remark.

For m = 1 , 2, , let Id be decomposed into a grid of and sub-cubes of side 1/m. Let E be the set of centers of the cubes. For any A C Id let B C E be the set of centers of the cubes in the grid that A intersects. Then

h(A, B) < d1/2/(2m), which includes the possibility that A = B = 0. For 0 < s < I there is a least m = 1, 2, such that d112/m < s, namely m = 1d1/2/el. It follows that D(s 21d h) < 2(1+d'121E)d

Hence for a < d/(d + 1), Corollary 8.2.13 cannot hold, nor can the upper bound for h in Theorem 8.2.1 be sharp.

The classes C(a, K, d) considered so far contain sets with flat faces except for one curved face. There are at least two ways to form more general classes of sets with piecewise differentiable boundaries, still satisfying the bounds in Theorem 8.2.1. One is to take a bounded number of Boolean operations. Let vl, , Vk be nonzero vectors in Rd where d > 2. For constants cl, , ck let Hj {x E Rd : (x, vj) = cj}, a hyperplane. Let Jrj map each x E Rd to its

nearest point in Hj, nj(x) := x - ((x, vj) - cj)vj/I vjI2. Let Tj be a cube in Hj and a, K > 0. Let f be a linear transformation taking Tj onto Id-1. For g E Qa,K,d-1, let

Jj(g) := {x E Rd : 7rj(x) E Tj, Cj < (vj, x) < Cj +g(fj(nj(x)))}. Then Theorem 8.2.1 implies that if K is a compact set in Rd (e.g., a cube) including all sets in Cj, and P is a law on K having bounded density with respect to Lebesgue measure ) , then for some Mj < oo,

log D(e, Cj, dp) < log Nj(e, Cj, P) < MjE By Theorem 8.1.4 we then have the following:

8.2.15 Theorem Let d > 2 and let Co :=

orCo :=

just defined. Then for some M < oo,

log D(E, Co, dp) < log NI(E, Co, P) <

ME(1-d1la

forCj as

Approximation of Functions and Sets

260

By intersections or unions of k sets in classes Cj (with k depending on d), one

can obtain sets with smooth boundaries (through order a) such as ellipsoids; see problem 5. One can also get more general sets, since, for example, for a > 1, the minimum or maximum of two functions in ca,K,d need not have first derivatives everywhere and then will not be in cy,K,d for any y > 1 and

K <00. Recall that a C°O real-valued function on an open set on Rd is one such that

the partial derivatives DPf exist for all p E Nd and are continuous. For a function f := (fl, , fk) into Rk to be C°O means that each f is. Another way to generate sets with boundaries differentiable of order a is as follows. The unit sphere Sd-1 :_ {x E Rd : IxI = 1) is a C°° manifold, specifically as follows. Sd-1 is the union of two sets

A := {x E Sd-1: x1 > -1/2) and C := {x E Sd-1: x1 < 1/21. There is a one-to-one, C°O function ,/r from {x E Rd-1 : IxI < 9/8} into Rd, with derivative matrix a ; ax i=I,j of maximum rank d - 1 everywhere,

such that * takes Bd_1 := {x E Rd-t : IxI < 1} onto A. Let ii(y) := (-*1(y), *2(y), , *d(y)). Then the above statements for >/r and A also hold for q and C. Sd-1 For 0 < a, K < oo let J'a,K(Sd-1) be the set of functions h : HR such that for Bd-1 := {x E Rd-': IxI < 11, h o t1i and h o 11 E .Fa,K(Bd-1), recalling that f o g(y) := f(g(y)). Let .F(d) (Sd-1) be the set of functions a,K

h = (h1,

hd) such that hj E Fa,K(Sd-1) for each j = 1, , d. Two continuous functions F, G from one topological space X to another, Y, are called homotopic if there exists a jointly continuous function H from ,

X x [0, 1] into Y such that

0) = F and H(-, 1) = G. His then called

a homotopy of F and G. Let I (F) be the set of all y E Y, not in the range of F, such that among mappings of X into Y \ l y), F is not homotopic to any

constant map G(x) = z # y. For a function F let R(F) := ran(F) := range(F) and C(F) := I(F) U R(F). For example, if F is the identity from Sd-t onto itself in W', then I (F) {y : Iyl < 11 by well-known facts in algebraic topology; see, for example, Eilenberg and Steenrod (1952, Chapter 11, Theorem 3.1).

Let I (d, a, K) := {I(F): F E

and let K(d, a, K) := (C(F): jadK FE (Sd-1) }. Then I (d, a, K) is a collection of open sets and IC (d, a, K) of compact sets, each of which, in a sense, has boundaries differentiable of order a. (For functions F that are not one-to-one, the boundaries may not be differentiable in some other senses.) For IC(d, a, K) and to some extent for .FctdK(Sd-i)}

8.2 Spaces of differentiable functions and sets

261

I (d, a, K) there are bounds as for other classes of sets with a times differentiable boundaries (Theorem 8.2.15):

8.2.16 Theorem For each d = 2, 3,

, K > 1, and a > 1,

(a) there is a constant Hd,a,K < oo such that for 0 < E < 1, and the Hausdorff metric h,

log D(E, lC(d, a, K), h) <

Hd,a,K/E(d-1)/a

< oo there is a constant Ad,a,K, < oo such that for any

(b) For any

law P on Rd having density with respect to Ad bounded above by

0<e<1,

for

max(log NJ (8, IC (d, a, K), P), log N, (s, I (d, a, K), P))

<

Ad,a,K,S/E(d-1)/a.

The proof will follow from a sequence of lemmas.

8.2.17 Lemma If H is a homotopy of F and G, then I (F) AI (G) C R(H). Proof Suppose y E I (F) \I(G) and y is not in the range of H. Then F and G are homotopic as maps into Y \ [y). Homotopy is clearly a transitive relation. Since G is homotopic to a constant map into Y \ { y}, so is F, a contradiction. The lemma is proved.

8.2.18 Lemma If F is a continuous map from a compact Hausdorff topological space K into a Hausdorff space Y, then C(F) is closed.

Proof If y f C(F), then there is a homotopy H of F to a constant map G(x) - z into Y \ {y}. Then R(H) is compact, thus closed (RAP, Theorem 2.2.3 and Proposition 2.2.9). Clearly I(G) = 0, so by the previous lemma,

I(F) C R(H). Then the open complementY\R(H) C Y\C(F)soY\C(F) is open and C(F) is closed. 8.2.19 Lemma If F is a continuous map from a compact Hausdorff topological space into some Rd, then I (F) is open, and its boundary is included in

R(F). Proof Let X E I(F). Then for some S > 0, d(x, y) < 23 implies y ¢ R(F). Let d (x, y) < S. Then there is a homeomorphism g of Rd which leaves

Approximation of Functions and Sets

262

{u : (u - xl > 28}, and so the values of F, fixed and takes x to y. T o define such a g we can assume x = 0 and 8 = 1/2. Let g(u) := a for Iul > 1 and g(u) := u + y(1 - Jul) for Jul < 1. Then g is the identity for Jul > 1 and is continuous, with g(0) = y. Also, gis 1-1 since Ig(u)l < 1 for Jul < 1,

andifg(u)=g(v)with lul,lvl < 1, then u-v=y(lul-lvl)andlu-vl < I(lul - Ivl)I/2 < lu - v1/2, so u = v. Thus Y E I(F), so I(F) is open. Since C(F) is closed by Lemma 8.2.18, it follows that the boundary of I (F) is included in R(F). Recall that for a metric space (S, d), a set A C S and 8 > 0, the 8-interior of A is defined by S A := Ix: d (x, y) < 8 implies y E A), and the S-neighborhood

by AS := {y: d(x, y) < S for some x E A}. 8.2.20 Lemma For continuous functions F, G from Sd-1 into Rd, if

ds p(F, G) := sup {I F(u) - G(u)J: u E Sd-1 } < 8, then

sI(F) C I(G) C C(G) C C(F)a. Proof If x E E 1 (F) and x E R(G), then d (x, y) < 8 for some y E R(F), so

y V I(F), a contradiction. Thus x V R(G). For 0< t < 1 and u E Sd-1 let H(u, t) := (1 - t)F(u) + tG(u). Then H is a homotopy of F and G, and R(H) C R(F)s, but I (F) n R(F) = 0, so x R(H). Thus by Lemma 8.2.17, x E I (G).

Next, let y E C(G). If Y E R(G) then y E R(F)s C C(F)s. Otherwise y E I (G). Then y E I (F) or by Lemma 8.2.17, y E R(H) C R(F)S.

8.2.21 Lemma For any continuous function F from a compact Hausdorff space K into Rd, C(F)8 \ sI (F) C R(F)s.

Proof Let x E C(F)' \ SI(F). Suppose d(x, R(F)) > S. Then Ix - yl < 8 for some y E I(F). Since the boundary of I(F) is included in R(F) by Lemma 8.2.19, the line segment {tx + (1 - t)y: 0 < t < 11 C I(F). It follows that x E I(F) and then likewise that z E I(F) whenever Ix - z1 < S. Thus x E SI(F), a contradiction. The Lipschitz seminorm II FII z, is defined for functions with values in R' just

as for real-valued functions. Let vk be the Lebesgue volume of the unit ball in Rk.

8.2 Spaces of differentiable functions and sets

263

8.2.22 Lemma For k = 1, 2, , if (T, d) is a metric space, 8 > 0, for some M < oo, D(8, T, d) < M81-x, and F is Lipschitz from T into 1[8x, with II FII L < K, then Ad

(C(F)s \ sI (F)) < vkM(K + 2)x8.

Proof For the usual metric e on 1[8x, we have D(K8, R(F), e) < It follows that D((K + 2)8, R(F)s, e) < M81-x. Lemma 8.2.21 gives the M81-k

conclusion.

Proof of Theorem 8.2.16. By the definitions, a function F E F(l) a, K (Sd-1) is given by a pair F(1), F(2) of functions F(j) :_ (F(j)1, F(j)d) where each

F(f)i E Fa,K(Bd-1) and Bj is the closed unit ball in R. Since a > 1, each F(j), is Lipschitz with II F(j)i II L < K, so each F(j) is Lipschitz with 11F(j)ft < M. Let T := T1 U T2 be a union of two disjoint copies Ti of Bd_1, with the Euclidean metric eon each and e(x, y) := 2 forx E Ti, y E Tj, i j. Letting G := F(j) on Tj, j = 1, 2, gives a function G := GF on T with (8.2.23)

IIGIIL < maxj IIF(j)IIL < dK.

Let 0 < s < 1. Then there are D(s, B, e) disjoint balls of radius s/2, included in a ball of radius 1 + (s/2). It follows by volumes that (8.2.24)

D(s, Bj,e) < [(2 + s)/2]i(2/s)i < (3/e)j.

By Theorem 8.2.1 for the ball case, for any K > 1, d > 2, and a > 1, there is

aC=C(K,d,a) < oo such that for 0 < 8 < 1, log D(8, Fa,K(Bd-1), dsup) <

C/8(d-1)/a.

It follows from the definitions with T := T1 U T2 that

log D(8,

.Fa,K(Sd-1psup)

<

2C/6(d-1)/a

and thus that for 0 < 8 < 1, log D(8, .Tadk(Sd-1dsuP < set

2C(d/8)(d-1)/a.

Given 0 < 8 < 1, take a

of functions f1, , fm E .Ta,K(Sd-1) with dsup(f , f) > 8/2 for i j and maximal m where m < and i d := 2C (2d)(d-1)/a. The brackets [sI (f ), C(f)s] for j = 1, ,m cover I (a, K, d) and 1C(a, K, d) by Lemma 8.2.20 (some sets s I (f) may be empty). If f) < 8/2 then by Lemma 8.2.20, C(g) C C(f )s and C(fj) C C(g)s,soh(C(g), C(f )) < 8andpart (a) of Theorem 8.2.16follows. Then Lemma 8.2.22 applies with M = 2 3d-1 by (8.2.24) and K = Kd by exp(Pd/8(d-1)/a)

264

Approximation of Functions and Sets

(8.2.23). It also holds for P in place of Ad with an additional factor of

Theorem 8.2.16(b) then follows with 8 := s/[2vd 3d-1(Kd + 2)d] and so

Ad,a,K, := I8d[2W 3d-1(Kd

+2)d](d-1)/a.

8.2.25 Corollary For any law P on Rd, d > 2, having bounded density with respect to Lebesgue measure, and K < oo,

(a) (Tze-Gong Sun) I (d, a, K) is a Donsker class for P if a > d - 1; (b) I (d, a, K) is a Glivenko-Cantelli class for P whenever a > 1.

Proof Apply 8.2.16(b) and, for part (a), Corollary 7.2.13; for part (b), the Blum-DeHardt theorem (7.1.5).

8.3 Lower layers A set B C Rd is called a lower layer if and only if for all x = (xl, , xd) E Bandy= (yl, , yd) with yj < xjfor j = 1, , d, we have y E B. Let LGd denote the collection of all nonempty lower layers in Rd with nonempty complement. Let 0 be the empty set and let

GGd,l := {L fl Id : L E GGd, L fl Id # 0}. Let ,l := )4 denote Lebesgue measure on Id. Recall that f

g means f/g -. 1. The size of GCd,l will be bounded first when d = 1 and 2. Let [xl be the smallest integer > x.

8.3.1 Theorem Ford = 1, D(e, GG1,1, h) = D(e,,Cr1, d)j = Nr(s, L 121 1, ),) = [1/c]. Ford = 2, any m = 1, 2,

, and 0 < t < 21/2/m, we have

max (Nj(2/m, GG2, A) , D(2 1/2/M, GG2,1, h))

< (2m - 2)

m-1

D(t GG2,1, h)

For 0 < s < 1, N, (E, GG2,1, )'I) < 42/E and

D(s, GG2,1, h) < exp ((21/2log4)

/s).

8.3 Lower layers

265

Proof For d = 1, sets in LL1,1 are intervals [0, t), 0 < t < 1, or [0, t], 0 < t < 1. For any s with 0 < s < 1, let m := m(E) [1/Ea. Then the collection of m brackets [[0, (k - 1)s], [0, ks]], k = 1, , m, covers GL1,1 with minimal m for E, showing that NI(s, CC1,1, k) = m. For any 3 > 0, the points k(s + 3) for k = 0, 1, , [1/(s + 6)], are at distances at least s + 3 apart. For 0 < x < y < 1, h([0, x], [0, y]) = dA([0, x], [0, y]) = y - x, so

D(s, LL1,1, h) = D(s, LLI, dd) = D(s, [0, 1], d) =: D(s) for the usual metric d. Letting 3 J, 0 gives D(s) = [1/e1 = m, finishing the proof for d = 1. For d = 2, decompose the unit square 12 into a union of m2 squares Si j

[(i - 1)/m, i/m) x [(j - 1)/m, j/m), i, j = 1,

, m - 1, but for i = m or j = m, replace "i/m)" or "j/m)" respectively by "1]." For any L E LL2,1, let m L be the union of the squares in the grid included in L and let Lm be the union of the squares which intersect L. Then mL C L C Lm and both m L and Lm are in LL2,1 U {P}.

For each m and each function f from {2, 3, .. , 2m - 1 } into 10, 11 taking the value 1 exactly m -1 times, define a sequence S(f)(k), k = 1, , 2m -1 of squares in the grid as follows. Let S(f)(1) be the upper left square S1m. Given S(f) (k - 1) = Si j, let S(f) (k) be the square Si+1, j just to its right if f(k) = 1, otherwise the square Si, j_1 just below it, for k = 2, .. , 2m - 1. Then S(f) (2m - 1) is always the lower right square Sm 1. Let Bm (f) 2m-1 Uk=1 S(f) (k). Let Am (f) be the union of the squares below and to the left

of Bm(f), and Cm(f) := Am(f) U Bm(f). Here Am(f) and Cm(f) belong to LL2,1 U {0}. Also, if f # g then h(Cm (f ), Cm(g)) > 1/m. Let L E LL2,1 U 0. Let L be its closure and

M := ML := LU{(0,y): 0
x(t) = y(t) + t < y(s) + t = x(s) + t - s, so x(.) is a Lipschitz function with 1. Likewise 1. In particular, x(.) and are continuous and the curve G is connected. We

have x(-1) = 0, y(-l) = 1, x(1) = 1, and y(1) = 0. For t near -1, g(t) E Slm. If G intersects a square Sij for i < m or j > I then it next

266

Approximation of Functions and Sets

intersects, as t increases, one of the squares 5i+1,j or Si,j-1. (It cannot go directly to Si+1,x_1 since the upper left vertex of that square is in Si+1,j.) For t near 1, g(t) E Sm 1. Thus for some f as above, there is a sequence of squares

S(f)(1) = Sim,

, S(f)(2m - 1) = Smt intersected by G, and no other

squares Si j are. It follows that

Am(f) C mL C L C L. C Cm(f), so that Lm \ mL C Bm (f ), and A%(Bm (f )) = (2m - 1)/m2 < 2/m, while h(Am (f ), Cm (f )) = 21/2/m or Am (f) = 0. Also, {(0, 0)} is the smallest nonempty intersection of a lower layer with I2, and using this set in place of Am (f) if the latter is empty, which occurs for just one, say fo, of the functions f, we have h({(0, 0)}, A) > 21/2/m for any A C I2 including S11 and thus for Cm (f) for any of the functions f fo. Since the number of such functions f is (2,-2), the first sentence of the Theorem for d = 2 is proved. Then since

(2m-2 < 4m-1 the second sentence follows. l M-1

-

El

For dimension > 2 we then have:

8.3.2 Theorem For each d > 2, as e y. 0, log D(e, GGd,1, h) x log D(s, GGd, dx) = log D(e, LLd,t, dx)

log NI(e,LLd,I,A) x E1

d

Proof First, for the Hausdorff metric h, it will be shown that for some constants

Cd with I < cd < 00, (8.3.3)

log D(e, 1CGd,1, h) <

CdE1-d

for 0 < E < 1. For d = 2 this holds by the previous theorem. It will be proved

for d > 2 by induction on d. Suppose it holds for d - 1, for d > 3. Given 0 < E < 1, take a maximal number of sets L 1,

, Lm E GGd_ 1,1 such that

h(Li, Lj) > E/4 for i 0 j, where m < exp(cd-1(4/e)d-2). Let k := 13/e1 and A E GGd,1. For j = 1, , k let Aj := {x E Id-1 : (x, j1 k) E A) and A(j) := Aj x {j/k) C A. Then Aj = 0 or Aj E GGd-1,1. In the latter case we can choose i := i(j, A) such that h(Ap Li) < E/4. Let Lo := 0 and i := i (j, A) := 0 if Aj = 0, so h(Aj, Li) = 0 < E/4 in that case also. , k-1. LetA, B E GGd,1 and suppose that i (j, A) = i(j, B) for j = 0, 1, It will be shown that h (A, B) < E. Let X E A. There is a j = 0, 1, ,k-1 such that j/k < xd < (j + 1)/k. Let y := (xl, , xd-1, j/k) E A(p. Then Aj 0 0 and h(A., Bj) < E/2, so for some z E B(p C B, we have

8.3 Lower layers

267

ly - zi < 2e/3 and Ix - zI < k-1 + 2e/3 < E. So d(x, B) < e and by symmetry, h(A, B) < e. Thus

D(e,LLd,1,h) < (m+1)k < [exp (2cd_1(4/e)d-2114/8 < eXp (cd/ed-1) IJ for cd := 2 4d-1 cd_1, so (8.3.3) is proved. For the metrics in terms of A we have the following:

8.3.4 Lemma

Let 8 > 0 and let A, B E £L d,1 with h(A, B) < S. Then

Ad (AAB) < dd128.

Proof Let U be a rotation of Rd which takes

v := (1, 1, into (0, 0,

,

d1/2). Let nd(y) :_ (y1,

, 1)

, yd-1, 0). Let C be the cube

C := U[Id] := {U(x) : x E Id}. Each point of Id or C is within d1/2/2 of its respective center. Thus each point z e H := lyd[C] is within d 1/2/2 of 0. Also, for any z E Rd-, It E R : (z, t) E C} is empty or a closed interval

h(z) < t < j(z). Let CZ := {(w, t) E C : w = z), a line segment. The intersections of U[A] and U[B] with CZ are each either empty or line segments with the same lower endpoint (z, h (z)) as C, so the two sets are linearly ordered

by inclusion. Thus the intersection of U[A]AU[B] = U[AAB] with CZ is some line segment SA,B,Z. It will be shown that SA,B,Z has length < d1/26. Suppose not. Then by symmetry we can assume that there is some > S

and a point x E B \ A such that v := x + ( 1 , l , . . . , 1 ) E B. The orthant O := l y: yj > xj for all j = 1, , d} is disjoint from A. But, the open ball of radius and center v is included in 0, contradicting h (A, B) < 8. Now, H is included in a cube of side d1/2 in Rd-1 with center at 0, so by the Tonelli-Fubini theorem,

), d(AAB) < (3d 1/2)

(d1/2)d-1

< 8d d12,

proving the lemma.

Returning to the proof of Theorem 8.3.2, from the last lemma and (8.3.3) it follows that for each d = 2, 3, and some Cd < oo, for 0 < e < 1,

log D(e, GGd, dx) = log D(e, GGd,I, d)) <

l

Cde-d

Approximation of Functions and Sets

268

Next, consider the remaining upper bound, for NJ. The angle between v (1, 1, , 1) and each hyperplane xj = 0 is cos-1

(((d - 1)/d)1/2) =

sin-1 d-112 = tan-1 ((d

-

1)-1/2).

Rd and point p on its boundary, Thus for any nonempty lower layer B U[B] includes the cone {x : Ix(d) - q(d) I < (qd - xd)(d - 1)-1/2}, where x(d) := ( X I ,---, xd_1), q := U(p). Hence the boundary of U[B] is the graph R where for any s, t E Rd-t of a function f : Rd-1

f(s) > f(t) - Kls - tl,

K := (d - 1)1/2.

Hence, interchanging s and t, I f (s) - f (t) I < K Is - t 1. So II f II L < K. Let

J(f) := {x : -00 < Xd < f(x(d))}. Thus for each B E CLd,l we have U[B] = ,7(f) fl U[Id] for a function f = fB on ][8d-1 with Il f IIL < K. We can restrict the functions f to a cube T of side d112 centered at the origin in Rd-1 parallel to the axes, which includes the projection of U[Id]. We can also assume that II fB I I sup < d112 for each B, since replacing f by max(-d1/2, (min(f, d1/2)) doesn't change ,7(f) fl U[Id ], nor does it increase II .f II L (RAP, Proposition 11.2.2(a), since 11911L = 0 if g is constant). Now, apply

Theorem 8.2.1 for a = 1 and d - 1 in place of d, where by a fixed linear transformation we have a correspondence between Id and the cube T. Since f < g

implies ,7(f) C ,7(g) and i(f) fl U[Id] C i(g) n U[Id], the bracketing parts of Theorem 8.2.1 imply the desired upper bound for log NI (e, LCd,1, ,L)

with -(d - 1)/a = 1 - d. This finishes the proof for upper bounds. Now for lower bounds, it will be enough to prove them for D(s, Ld, d;,) in light of Lemma 8.3.4 and since NI(e, ) > D(s, ). The angle between v = (1, 1, , 1) and each coordinate axis is

9d :=

COs-1 d-1/2

= sin-1 (((d - 1)/d)1/2) = tan-1 ((d - 1)1/2).

-

Thus if f : Rd-1 i-+ R satisfies II.fIIL < (d 1)-1/2, then to see that L U-1(J(f )) is a lower layer, suppose not. For some x E L and y ¢ L, x; = y, for all i except that xj < y, for some j. Then U transforms the line through x, y to a line f forming an angle 6d with the dth coordinate axis. Writing £ as td = h (t(d) ), we have I I h I I L = cot Bd = (d - 1)-1/2, which yields a contradiction. Recall (Section 8.2) that for S > 0 and a metric space Q,

.Fi,s(Q) := {f: Q -'R, max(II.fIIL,Ilfllsup)<<S}. Let S := (d - 1)-1/2d-1. Let Q be a small enough cube in Rd-1 with center at 0. Then for each f E .F1,3 (Q), we have v + U-1(J(f )) C Id . For such f, J(f) H 2 v + U-1(J(f )) is an isometry 2for h and for da, and preserves

8.4 Metric entropy of classes of convex sets

269

inclusion. Each f can be extended to Rd-1, preserving IIfIIL < 8 (RAP, Theorem 6.1.1). So the lower bound with dsup in Theorem 8.2.1 gives, via Lemma 8.1.3, the lower bound with h in Theorem 8.3.2. Theorem 8.2.10, likewise adapted from Id-1 to Q, gives the lower bounds with A, proving Theorem 8.3.2.

8.4 Metric entropy of classes of convex sets Let Cd denote the class of all nonempty closed convex subsets of the open unit ball B(0, 1) := {x : IxI < 1} in Rd. Let A be Lebesgue measure on Rd. Upper and lower bounds will be given for the metric entropy of Cd for the metric dd and for the Hausdorff metric h.

8.4.1 Theorem (E. M. Bronstein) For each d > 2 we have

log D(s, Cd, dA) x log D(e, Cd, h) x 8(1-d)/2 as s ., 0. Remark. For d = 1, C1 is just the class of subintervals of the open interval (-1, 1). Then it's rather easy to see that D(s, C1, d;) x D(s, C1, h) x s-2 Before proving the theorem, let us prove from it:

8.4.2 Corollary In Rd, for d > 2, if P is a law whose restriction to B(0, 1) has a bounded density f with respect to Lebesgue measure, then

(a) log NI(s, Cd, P) = 0 (e(1-d)/2) as s .} 0; (b) (E. Bolthausen) for d = 2, C2 is a Donsker class for P; (c) for any d, Cd is a Glivenko-Cantelli class; (d) if also f > v on B(0, 1) for some constant v > 0, then

log NI(e, Cd, P) x s(1-d)/2 as s J, 0.

Proof Let 0 < f(x) < V < oo for all x. For B C Rd and 8 > 0 recall that aB :_ {x : y E B whenever Ix - yI < 8} and B3 :_ {x : d(x, B) < 8}. For a closed set B, we have sB = [x: y E B whenever Ix - yl < 8}. Also, B3 is always open and 3B is always closed. We have 6B C B C Bs. For any two sets B, C, if h (B, C) < 8, then C C B3. It will be shown next that if B and C are closed and convex then also sB C C. If not, let x E 8B \ C. Take a closed half-space J D C with x V J (RAP, Theorem 6.2.9). On the line through x

Approximation of Functions and Sets

270

perpendicular to the hyperplane bounding J, on the side opposite J, there are points y e B \ Cs, a contradiction. So 3 B 0 C 0 B8. For dx, the following will help. For any set B, let 8B denote the boundary of B.

8.4.3 Lemma F o r any d = 1 , 2, , there is a K = K(d) < oo such that forany B E Cd andO < 8 < 1,A(BS\ sB) 0, the set Be is the vector sum B + EB(0, 1). The volume a,(BE) can be written as a polynomial in E,

A (BE) = A(B) + C1(B)e + C2(B)E2 + ... + CdEd where the coefficients Ci are known as mixed volumes of B and B(0, 1) (e.g.,

Eggleston, 1958, pp. 82-89; Bonnesen and Fenchel, 1934, pp. 38, 49-50). Here C1(B) = limb 1 o(A(BE) - A(B))/s is the (d - 1)-dimensional surface area of 8B (Eggleston, 1958, p. 88), and Cd = A(B(O, 1)). All the mixed volumes Ci (B) are nondecreasing functions of the convex set B (Eggleston, 1958, Theorem 42 p. 86; Bonnesen and Fenchel, 1934, p. 41). Thus, all the Ci(B) are maximized for B C B(0, 1) when B = B(0, 1). It follows that the derivative dA(BE)/dE is bounded above uniformly for B E Cd

and 0 < e < 1 by a constant K2 = K2(d), so that k(B8 \ B) < K2e for all

BECdandO<s<1. Now suppose B has an interior. Then B8 \ sB = (3B)8. For a set of points on 8B more than 8 apart, the balls of radius 8/2 with centers at the points are disjoint, and the outer halves of these balls cut by support hyperplanes to B at the points are outside of B. Thus for 0 < 8 < 1,

(8/2)K2 > A(Bs12 \ B) > !Cd (8/2)d D(8, 8B, p) where p is the Euclidean distance. Thus

D(8, 8B, p) <

2dK281-d/Cd.

Then for 0 < 8 < 1,

X ((M)') < D(8, 8B, p)Cd(28)d < 4dK28, and the lemma follows.

8.4 Metric entropy of classes of convex sets

271

Now continuing the proof of Corollary 8.4.2, given 0 < S < 1, let Nd(8) be a 3-net in Cd for h with cardinality at most exp(AdS(I-d)/2) where Ad is a large enough constant. The brackets [s B, B81 for B E Nd(8) cover Cd: in other words, for any C E Cd there is such a B with 8 B C C C B8, as seen just before Lemma 8.4.3, and X(B6 \ 8B) < KS, so P(BS \ 8B) < KV8, and

NI(E, Cd, P) < D(E/(KV ), Cd, h) < exp (Ad (E/(KV

))(1-d)'2

for E small enough, proving part (a). Given part (a), part (b) follows from Corollary 7.2.13, and part (c) from the Blum-DeHardt theorem (7.1.5). For part (d), we have

NI(E, Cd, P) > D(E, Cd, dP) > D(E/v, Cd, d),) > exp (Cd (El v)(1-d)'2) for some cd > 0 and all E small enough, which finishes the proof of Corollary 8.4.2 from Theorem 8.4.1. Now Theorem 8.4.1 will be proved. Let 0 < E < 1. For any set C C Rd and r > 0 let Cr1 := [x E Rd : d(x, C) < r}. Then the open set C' is included in

the closed set C'1, aCr = aC'1, a closed set, and h(C', Cr]) = 0.

8.4.4 Lemma For any C, D E Cd, and r > 0, h (Cr], D'1) = h (C, D), in other words Or : E H Erl is an isometry for h. Proof Lets > 0. It will be shown that D C CS if and only if Drl C (C'1)S = Cr+S. "Only if" is straightforward. To prove "if," suppose not. Let a E D \ CS. There is a unique point of q of Csl closest to a: there is a nearest point q since CS1 is compact, and if b is another nearest point, then (q + b)/2 E CS1 since Cs1 is convex and (q + b)/2 is nearer to a, a contradiction. (Possibly q = a.) Now q E 8(Cs1). If q = a, take a support hyperplane H to CS1 at q (RAP, Theorem 6.2.7). If q a, then the hyperplane H through q perpendicular to

the line segment aq is a support hyperplane to Csl at q (if there were a point c of Cs1 on the same side of H as a, then on the line segment cq there would be a point of CS1 closer to a than q is, a contradiction). Let p be a point at distance r from a in the direction perpendicular to H and heading away from CS1. Then

p E D'1 but p 0 (Cs1)r = Cr+s, a contradiction. So "if" is proved. Since C and D can be interchanged, the lemma follows.

For r > 0, 0r is a useful smoothing, as it takes a convex set D, which may have a sharply curved boundary (vertices, edges, etc.) to a convex set Drl

272

Approximation of Functions and Sets

whose boundary is no more curved than a sphere of radius r, and so will be easier to approximate. Now, for a given C E Cd and s > 0 let

AI(C)

{DECd: h(C,D)<s},

NE(C)

{DEAIE(C): CCD}.

For any convex C, 02+E is an isometry from Ar, (C) into R2, (C2). A sequence of lemmas will be proved. Here I I will denote the usual Euclid-

ean norm on Rd. Let Cd,1,3 denote the class of all closed convex sets C in

Rd such that B(0, 1) C C C B(0, 3). Note that if C E Cd, then clearly C21 E Cd,1,3

8.4.5 Lemma Let E E Cd, 1, 3. Let x E 8E and let H be a support hyperplane

to E at x. Let p be the point of H closest to 0. Then I pI > 1 and ZOxp > sin-1(1/3). Proof Existence of support hyperplanes is proved in RAP (Theorem 6.2.7). Clearly H is disjoint from B(0, 1), so Ipl > 1. Since IxI < 3 and LOpx = r/2, the lemma follows.

8.4.6 Lemma If E E Cd,1,3, r > 0, and z is a point such that z E E' \ E, let x (z) be the point on the half-line from 0 to z and in W. Then I z - x (z) I< 3r.

Proof Note that x = x(z) is uniquely determined since E D B(0, 1). Apply Lemma 8.4.5. Let zl be the point of H closest to z. Then Iz - zl I < r. The vectors z - zi and p are parallel, so := LpOz = LOzzl, and Iz - xI = Iz - ziIIxI/IPI < 3r. 8.4.7 Lemma Suppose C E Cd, y E aC, x E aC2, and Ix - yI = 2. Then for any two-dimensional subspace V containing x, V fl B(y, 2) is a disk containing 0 of radius at least 2/3.

Proof Clearly 0 E B(y, 2) C C2. Apply Lemma 8.4.5 again. Then H is also a support hyperplane to B(y, 2) at x, so x - y is orthogonal to H and in the same direction as p. Let q be the point on the segment [0, x] closest to y. Then 0 := ZOxp = Zqyx, sin0 > 1/3, and so Ix - q1 > 2/3. Let u be the closest point to yin V. Then V fl B(y, 2) D V fl B(u, 2/3). Now polyhedra to approximate convex sets will be constructed. Let Wd be the cube centered at 0 in Rd of side 2/d1/2, parallel to the axes, so that the

8.4 Metric entropy of classes of convex sets

273

coordinates of the vertices are +I /d1/2. Given s > 0, decompose the 2d faces of Wd into equal (d - 1)-cubes of side sd := sd(e) where sd - c(81d)112 as 8 O and c := 10-4/(d112(d-1)), so sd ^- s11210-4/(d(d- l)). Specifically, let sd := 2/(d't2kd) where kd := kd(s) is the smallest positive integer such

that sd < s1!210-4/(d(d - 1)). Then for 0 < s < 1,

81/210-4/d2 < sd < e1"210-4/(d(d - 1)). Let Ld be the set of all (d - 1)-cubes thus formed. The diameter of each cube in Cd is d 112sd < e1/210-4/(d1/2(d - 1)). The next fact follows directly, by the law of sines, since 0 < s < 1 and d > 2, and

sin-1 x < 1.1x for 0 < x < 10-4. 8.4.8 Lemma (a) For any cube in Cd and any two vertices p and q of the cube,

ZpOq <

sin-I (10-4e1/2/(d

- 1)) <

(1.1)10-481/2/(d - 1).

(b) The total number of vertices of all the cubes in Gd is less than

2d (kd(s) + 1)d-1 < where Kd := 2d((2. 104 +

Kd,,(I-d)12

1)d3/2)d-1

Next, there is a triangulation of each cube in Cd, in other words a decomposition of the cube into (d - 1)-simplices, with disjoint interiors, where each simplex is a convex hull of some d of the 2d-I vertices of the cube. That such a triangulation (without additional vertices) exists (a well-known fact to algebraic topologists) can be seen as follows. By induction, it will be enough to treat S x [0, 1] where S is a simplex with vertices vo, , vp. Let ai :_ (vi, 0) and bi := (vi, 1). Then for each i = 0, 1, , p, the points ao, , a;, bi, , by are vertices of a (p + 1)-dimensional simplex Si. To see that these Si give the desired decomposition of S x [0, 1], note first that each point of a simplex is a unique convex combination of the vertices. For each point z of

S x [0, 1], z = (ri Xivi, x) for some unique x E [0, 1] and Xi > 0 with >i = 1. Then z = Fi < j µiai + Ei> j pibi, where µi > 0, pi > 0, and K< j µi + F-k> j pk = 1, if and only if µi = Xi for i < j, Pk = Ak for

k > j, Aj = µj + pj, and x = Ek> j pk. Thus z is in Sj if and only if F-i> Xi < x < >i> j Xi. If both inequalities are strict, then j is unique. Every point of S x [0, 1] is in some Sj, and a point in more than one Sj is on the boundary of both. Let K be a convex set including a neighborhood of 0. For each vertex pi of a cube in £d, let Hi be the half-line starting at 0 passing through pi and let vi

Approximation of Functions and Sets

274

be the unique point at which Hi passes through the boundary of K. For each simplex Sj in the triangulation of the cubes in Cd, let Tj be the corresponding simplex with vertices vi in place of pi. Let lre (K) be the polyhedron with faces Tj, in other words the union of the d-dimensional simplices which are convex hulls of Tj U (0). Ford > 3, ,r8 (K) is not necessarily convex.

Let E E Cd,1,3, 8 > 0 ands > 0. For i = 1, 2 let zi be points such that zi E E16' \ E. Assume that Lzi 0z2 < 8 < 1/15. Then 8.4.9 Lemma

Izi -Z21 <<98+968. Proof Let xi := x (zi) be the point where the line segment from 0 to zi inter-

sects dE, i = 1, 2. By Lemma 8.4.6, Izi - xi I < 48e, i = 1, 2. We have Izi I > 1 and Ixi I > 1 for all i. By symmetry we can assume Ixi I >- Ix2I. To get a bound for Ixi - x21, we can assume x i and x2 are not on the same line through 0, or they would be equal. Take a half-line L starting at xi which is tangent to the unit circle aB(0, 1) at a point v and crosses the half-line from 0 through z2 at a point y.

If v is between xi and y, then Ixi - v I< tan S < 28 since 8 < 1/15 < 7r/4. Likewise, ly-vi < 23. Also, 1 < Ix2I < Ixi l < (1+452)1/2 < 1+252 < 1+5, So 1X2-Y1 <8 and

Ix2-Vl < Ix2-yl + ly - vl < 8+2S = 38 and lxi - x21 < 58. Otherwise, y is between xi and v. Then Ixi - x21 is maximized when Ix2I = Ixi I or x2 = y, since x2 must not be in the convex hull of (xi) U B(0, 1). If Ix2I = Ixi I then

Ixi -x21 < 2(3 sin 2) < 38. := L0xi v. Then sing > 1/3, and

To bound Ixi - yl, let

Ixi - yl

<-

tan

7r

8 sect (2 - ) < 93. So Ixi - X21 < 93 in all cases, and Izi - z21 < 93 + 968.

8.4.10 Lemma Let E E C2,1,3. Suppose 2 < K < oo and 0 < 8 < 10-12/(K 1)2. Let c := (1.1)10-4/(K1/2(K - 1)). Let r > 2/3 and let a, b, a be points of R2 such that: a E, B(a, r) C E, b is on the boundary

-

of E and of B(a, r), d(a, E) < 168, and fi := LaOb < c(Ke)1i2. Then

8.4 Metric entropy of classes of convex sets

275

d(a, B(a, r)) < 50e. Also, Ia - bl < 0.0015e1/2/(K - 1) and Laab < D(K)E112 where D(K) := 0.004/(K - 1). Proof Apply Lemma 8.4.5 at x = b. The tangent line L to E at b is also tangent to the circle 8B(a, r). Let y LObp. Then y > sin-1(1/3) > 1/3. We have (8.4.11)

6 < c(KE) 1/2 = (1.1)10-4e1/2/ (K - 1) < (1.1)10-10

Let H be the half-plane including E bounded by L. First suppose a V H. Then d (a, H) < 16e. Let the line from 0 to a intersect L at a point q. If 17 is between p and b, then y + ,B < 7r/2 and

Ill - bl = Ipl(cot y - cot(y + #)) < 30 csc2 y < 27k. If p is between n and b, then

In-b1 = Ipl coty+tan y+,8-

IT

2

= Ipl(cot y - cot(y +,B)), where now r/2 < y + P < n. Thus by (8.4.11),

117-bI < IPIPmax(csc2y,csc2(y+fi)) < 3#max (csc2 y, csc2

< 27,6 <

(n2

+ 10-9))

27c(KE)1/2,

The other possibility is that b is between rl and p. Then

117-bI = IPI(cot(y-,e)-coty). Now f < (1.1)10-9 and y > sin-' (1/3) imply sin(y - f) > 0.333, so

In - bl < 30/(0.333)2 < 280 < 28c(KE)1/2 in all three cases. For any ordering of b, p, and rl,

]a - ill < 16E/ sin(y - ,B) < 49e. Next, let x be the distance from a varying point on L to b. Then the distance y from to the circle 8B(a, r) satisfies y = (r2 + x2)1/2 - r. Now

(r2 + t)1/2 < r + t fort > 0 and r > 2/3, so 0 < y < x2 for all x. So the distance from a to B(a, r) is at most CE for C = 49 + 282c2K < 50, giving the first conclusion for a 0 H. A line W through a, orthogonal to the line V through 0 and b, meets V at

a point l;. Then Lbal; = y > sin-1(1/3), so Ia - I < r(1 -

-11)1/2. Let q

276

Approximation of Functions and Sets

the point on the circle I q - a = r and the line W, on the same side of a as is. Then Iq I > 3(l - (9)1/2) > 0.03. By (8.4.11), i6 < tan -1(0.01), so the line A through 0, rj, and a must intersect B(a, r). Then since E is convex, a E, 0 E E, and B(a, r) C E, for a E H, d (a, B(a, r)) is maximized when a = q on L, and d(a, B(a, r)) < Cs for the same C < 50 as before. So the

-

first conclusion is proved. Lemma 8.4.9 with b = iB gives

Ia - bl <

0.0015e1/2/(K-1),

so Laab < sin-1(Ia - bI/(2/3)) < 0.004e1/2/(K - 1). So Lemma 8.4.10 is proved.

8.4.12 Lemma

Let a E R2, B := B(a, r) C R2, where r > 2/3 and

0EB,solaI
h(S n W, S1) < 4(KE

1)

Proof Since S1 C S n W, it will be enough to show that for zo E 8S n W, d(zo, [al, a2]) < s/(9(K - 1)) where [al, a2] is the line segment joining al to a2.

It is easy to see that a1, a2, and a are not all on a line, unless al = a2, when the result clearly holds, so assume they are not on a line. Let Li be a tangent line to B, at a point bi, through ai, where of the two such tangent lines, Li is the one for which Lbiaai < Lbiaa3_i, i = 1, 2, and bi is not in W.

Now I ai - al < r + 50s, so

Lbiaai < cos-1(r/(r + 50e)) < cos-1(1 - 75s). By derivatives, cosx < 1 - 2x2 + 24x4 for all x, so cosx < I - 11x2/24, 0 < x < 1. Thus cos-1(1 - 11x2/24) < x, and cos-1(1 - 75s) < Now 2D(K)s1/2 < 10-8, and (1646)1/2 < 10-4, so the angles Lbiaai, i =

(164x)1/2_

1, 2, and Lalaa2 add up to (8.4.13)

Lb1ab2 < 2(164)1/2 + 2D(K))s1/2 < 0.002 < jr.

So the lines Li intersect, in a unique point m.

8.4 Metric entropy of classes of convex sets

277

Now, zo is in the triangle ala2m because: zo E W and al, a2 E 0S, b1, b2 E S, so zo cannot be in the triangle aala2 unless it is on [a1, a2]. Also if zo were on the side of L; away from a, then a; 8S, a contradiction, i = 1, 2. Next,

Lmala2 + Lma2al = n - Lblmb2

(2 - Lambl) + (2 - Lamb2)

= Lb1ab2.

We have d(zo, [ai, a2l) < d(m, [al, a2]). Then

d(m, [al, all) < Ial - all tan Lmala2, and by (8.4.13),

tanLmala2 < 2Lmaia2 < (52 + 4D(K))s112, and Ial - a2I < D(K)E1/2 by assumption. So

d(m, [al, a2l) < (52D(K) +4D(K)2)E < E/(4(K - 1)). 8.4.14 Lemma

Let d > 2, C E Cd, 0 < s < 10-12/(d - 1)2, and N' E

.N4E (C2). Let N be the polyhedron it, (N'). Then h (N, N') < e/4.

Proof For each j, the simplex Sj and the origin span a convex cone Wj, the union of all half-lines with endpoint 0 passing through Sj. It suffices to show that for each j,

h(N'nWj,NnWj) < E/9. Fix such a cone W = Wj and simplex S = Sj. For s = 1, , d, let W (s) be the union of the s-dimensional faces of W, so that W(d) = W, WO) is the union of half-lines through 0 and vertices of S, and so on. It will be shown by induction on s that if z E W(1) n N' then (8.4.15)

d(z, N) < E(s - 1)/(4(d - 1)).

For s = d this will give the desired conclusion. For s = 1, (8.4.15) holds by definition of N. Suppose it holds for a given s. Take any z E N' n W(s+1). Let x = x(z) be the point at which the half-line from 0 through z intersects 8C2. There is a point ,6 E 8C such that Ix - # I = 2: to see this, let H be a support hyperplane to C2 at x. Let HI be a hyperplane parallel to H at distance 2 in the direction toward C. Then it is easily seen that H1 is a support hyperplane to C at a point ,a, the nearest point in H to x. Clearly B(fi, 2) C C2.

278

Approximation of Functions and Sets

Let Fs+t be an (s + 1)-dimensional face of W containing z. Let .7r2 be a two-dimensional subspace with z E 7r2 fl W C FS+t Then the two rays at the edges of the wedge 72 fl W are included in WW W. Let these rays intersect aN' at points u, v. By Lemma 8.4.7, the disk n2 fl B(fi, 2) contains 0 and has radius at least 2/3. So it is a disk 7r2 fl B(a, r), a E Jr2, with Ix - al = r > 2/3. Then u, v E aN' implies u, v are not in C2 and so not in B(a, r). Now apply Lemma 8.4.10 to

E = C2 fl 7r2, with K = d, first for (a, b) = (u, x), then for (a, b) = (v, x). To justify the application we need the following:

Claim. d(u, E) < 16E and d(v, E) < 16E. Proof There is a point r E C2 with I u - r I < 4E. Let U be the two-dimensional subspace spanned by u and r (if u and r are on a line through 0 then r E 72 and

d (u, E) < 4E). Let M be the line in U through r which crosses the half-line D from 0 through u at a point t and is tangent to the unit disk B(0, 1) in U

at a point w. Let z be the closest point to r on D. Then Ir - zI < 4E. Let = LOtw = Zztr. Since t E C2, It I < 3. Thus sin > 1/3 and I r - t I < 12E, so I t - u I < I t - r I + I r - u I < 16E, proving the claim for u and by symmetry for v. It will be shown that:

(a) if u and v are points of a simplex with vertices w;, then LuOv < max;j Lw;Owj. Here (a) will follow from

(b) if u is fixed and v is in a simplex with vertices w1, then ZuOv < maxi ZuOwi,

since (b) could be applied in stages to prove (a). Further, it will be enough to prove (b) for a simplex reducing to a line segment from wl to W2, since then, the case of general v would reduce in stages to cases where v is on a boundary face of the simplex, then a lower-dimensional face, and so on until v is a vertex. So it will be enough to show that

(c) For all u, x, y in R3 and 0 < ,l < 1, LuO(Ax + (1 -,l)y) < max(LuOx, ZuOy). Expressing (c) in terms of cosines, and then of scalar products and lengths, we can reduce to the case where Ix I = I I = J u I = 1, and then check the condition.

8.4 Metric entropy of classes of convex sets

279

Now it is easily seen, using also Lemma 8.4.8, that all the hypotheses of Lemma 8.4.10 hold, hence so do its conclusions. Next, Lemma 8.4.12 will

be applied with (again) K = d, and with al = u, a2 = v, and S := N' fl 1r2. Let's check that the hypotheses of Lemma 8.4.12 hold. We have

Lalaa2 < 0.00881/2/(d - 1). The angles Luax and Lxav both have the upper bound 0.004x1/2/(d - 1). The other hypotheses of Lemma 8.4.12 follow from Lemma 8.4.10, so Lemma 8.4.12 does apply. Its conclusion gives d(z, [u, v]) < e/(4(d - 1)). By induction hypothesis,

d(y, N) < s(s - 1)/(4(d - 1)) for y = u, v, and then by convexity for any y E [u, v]. It follows that d (z, N) < ss/(4(d - 1)), completing the induction and the proof of Lemma 8.4.14.

8.4.16 Lemma Let C E Cd. Then for any s > 0 with s < 10-12/(d - 1)2, there is an 8/2-net for JV (C) containing at most gd (s) points where gd (s) exp(a(d)s(1-d)/2) and a(d) = Kd log 72 with Kd as in Lemma 8.4.8.

Proof Recall that 02+2e gives an isometry for h of .N2E (C) into N4E (C2). Let N' E N4e (C2) and let the vertices of the polyhedron N := 7r, (N') be

yi, i = 1,2,' .. By Lemma 8.4.14,h(N,N) <s/4. For each i, on the half-line from 0 through yi, take the interval Ii := [y, yi + l2syi/Iyi I] of length 128 starting at yi. Then Ii has an E/4-net J containing 24 points (midpoints of subintervals of length 8/2). By Lemma 8.4.6, for every M E N4E (C2), 7rs (M) has a vertex vi in Ii, which

is within s/4 of some ui E Ji. The (d - 1)-simplex with vertices vi is within E14 for h of the one with vertices ui. The same is true of the d-simplices where the vertex 0 is adjoined to both. Thus if L, is the set of all polyhedra with one vertex in Ji for each i, defined as in the definition of Ire, then L, is s/2-dense in N4E(C2). The number of values of i is at most Kds(1-d)/2 by Lemma 8.4.8, and Lemma 8.4.16 follows.

Now it will be shown that there are constants Kd, Ld < oo such that: for 0 < 8 < 1, there is an s-net for Cd containing at most fd(8) sets where (8.4.17)

fd(e) := Ld exp

(Kde(1-d)/2)

This will clearly imply the upper bound for h in Theorem 8.4.1. It will be enough (possibly changing Kd and Ld) to prove (8.4.17) for 8 small

enough, specifically for 0 < s < 80 := 10-12/(d - 1)2. For such 8, and a(d)

280

Approximation of Functions and Sets

from Lemma 8.4.16, let Kd := a(d)/(1 - 2(1-d)/2). (Ld will be specified below.) We have the decomposition 00

(0, Eo] = U Ik,

Ik := (so/2k, so/2k-11.

k=1

Before getting more precise bounds, it will help to see that N(e, Cd) < oo for all E > 0. The cube [-1, 1]d can be written as a finite union Ui C, of cubes of side < s/d 112 and so of diameter < e. For each nonempty C E Cd, let B(C) be the union of all Ci which intersect C. Then C C B(C) and h (B(C), C) < E. It follows that N(s, Cd) < oo. So taking s = s0/2, Ld can be and hereby is chosen so that (8.4.17) holds for E E Ii. Then (8.4.17) will be proved fore E Ik by induction on k. Suppose it holds for all s E Ik. Let S E Ik+t Then e := 28 E Ik. By induction, there is an s-net {C1 } for Cd with at most fd (s) values of i. By Lemma 8.4.16, there exist at most gd (s/2) sets Kj, not necessarily convex, such that each A E .N (Ci )

is within e/4 of some Kj. Thus there is a 6-net for A' (Ci) containing at most gd (8) sets, and a 8-net for Cd containing at most gd (8) fd (28) sets. Now a(d) + 2(1-d)12Kd < Kd by choice of Kd, so gd(S) fd(28) < fd(8), and the upper bound in Theorem 8.4.1 is proved. Now to prove the lower bound, let p be the Euclidean metric on Rd and recall that Sd-1 denotes the unit sphere {x E Rd : IxI = 1}. Let d > 2. If Vd is the volume of B(0, 1) C Rd, then X(/Sd-1)E) > Vd[(1 +E)d

- 1] > dvdE.

Then by the left side of the last displayed inequality in the proof of Lemma 8.4.3, there is an ad > 0 such that D(E,

Sd-1 p) > adE1-d for 0 < E < 1.

Given s, take a set {xi }m 1 of (points of Sd -1 more than 2e apart, of maximal

cardinality m := D(2s) := D(2s, Sd-1, p). As above let Zabc denote the angle at b in the triangle abc. Then for i # j,

0 := LxiOxj > 2 sin(e/2) > 2s. Let Ki be the half-line from 0 through xi. Let Ci be the spherical cap cut from the unit ball Bd by a hyperplane orthogonal to K, at a distance cos s from 0. Then the caps Ci are disjoint.

For any set I C (1, , m) let DI := Bd \ UiEI C;. Then each DI is convex. Let a,d be d-dimensional Lebesgue measure (volume). Then for all i, .d(C1) > bdsd+l for some constant bd > 0. By Lemma 8.2.11 there are at

Problems least em /6 sets I (j) such that for all j contains at least m /5 elements. Then

281

k the symmetric difference I (j) D I (k) CdEI-d+d+1 = CdE2

Xd(Di(j)ADI(k)) >

for a constant cd := adbd21-d/5. So for some ,Bd > 0, D (8, Cd, dX) > exp

(fldS(1-d)121

for/d)L

This finishes the proof of the lower bound 8.4.1 is proved.

for 0 < 8 < 1. and so also for h. Theorem

Problems

1. Let (K, d) be a compact metric space. Show that the collection of all nonempty closed subsets of K, with the Hausdorff metric, is also compact. 2. In the proof of Theorem 8.2.1, just after Lemma 8.2.7, show that the cubes

can be ordered so that i = j - 1. 3. If inequality (1.3.12) is used instead of (1.3.5) in the proof of Lemma 8.2.11, what is the result?

4. Show that in Theorem 8.2.16(a), 1C(d, a, K) cannot be replaced by I(d,

a, K) if d = 2 and a > 1. Hint: Let f ((cos 0, sin B)) :_ (1 - cos 6)/2, so f takes S1 onto [0, 1]. Then I ((f, 0)) = 0 where (f, 0)(u, v) (f (u, v), 0). For any interval (a, b) C R there is a C°O function g(a,b) >

0 with g > 0 just on (a, b).

rkf=1

Show that for functions r _ (f,

for disjoint (at, b,) and small enough 8i > 0, de-

pending on k, JI Vi 11a can remain bounded as k increases while I(*) can approximate any finite subset of [0, 1] x {0} for h. 5. Show that the unit disk {(x, y) : x2 + y2 < 1) belongs to a class Co in

Theorem 8.2.15 for k = 4, any a E (0, oo), and some K = K(a) < oo. Hint: The function g(x) :_ (I - x2)1/2 is smooth on intervals Ix I < for

<1. 6. Find a constant c > 0 such that lim infE{o E log NI(E, £ 2, .4) > c, and likewise for D(E, L L2,1, dA). Hint: Consider squares along a decreasing diagonal, Sj := Sj,,,, _ 1 _j, j = 1, , m, for the grid defined in the proof of Theorem 8.3.1. The union of Ui<m-1-j Sp and an arbitrary set of Sj gives a set in £C2,1. Apply Lemma 8.2.11. 7. (a) The proof of Theorem 8.3.1 used the inequality (kk) < 4k. Show that,

conversely, (k> 4kk-1/2/3. Hint: Use Stirling's formula (Theorem 1.3.13).

(b) Use this to give a lower bound for lim inf 40 8 log D(E, £ 2,1, h).

282

Approximation of Functions and Sets

8. In the proof of Lemma 8.3.2, after Lemma 8.3.4, a function f is defined with 11f II c -< (d - 1)1/2. For d = 2, deduce this from monotonicity of x(.), in the proof of Theorem 8.3.1.

9. Show that each class g ,K,d in Theorem 8.2.1 for a > 0 and K < oo is a uniform Glivenko-Cantelli class as defined in Section 6.6. 10. Show that the lower layers in Rd form a Glivenko-Cantelli class for any law having a bounded support and bounded density with respect to Lebesgue measure.

Notes Note to Section 8.1. Hausdorff (1914, Section 28) defined his metric between closed, bounded subsets of a metric space.

Notes to Section 8.2. Kolmogorov (1955) gave the first statement in Theorem 8.2.1. The proof of that part as given is essentially that of Kolmogorov and Tikhomirov (1959). Lorentz (1966, p. 920) sketches another proof. Theorem 8.2.10 is essentially due to Clements (1963, Theorem 3); the proof here is adapted from Dudley (1974, Lemmas 3.5 and 3.6). Remark 8.2.14, for which I am very grateful to Joseph Fu, shows that there is an error in Dudley (1974, (3.2)), even with the correction (1979), which is proved only for a > 1. Theorem 8.2.16 and its proof by a sequence of lemmas are newly corrected and extended versions of results and proofs of Dudley (1974). Tze-Gong Sun and R. Pyke (1982), whose research began in 1974, proved Corollary 8.2.25(a) independently by a different method. See also Pyke (1983). The statement was apparently first published in Dudley (1978, Theorem 5.12), and attributed to Sun.

Notes to Section 8.3. In Theorem 8.3.2 and its proof ford > 3, I am grateful to Lucien Birge (1982, personal communication) for the idea of the transformation U. For Theorem 8.3.1, I am much indebted to earlier conversations with Mike Steele. Any errors, however, are mine. A lower bound for empirical processes on lower layers in the plane will be given in Section 12.4, where P is uniform on the unit square. For other, previous results, such as laws of large numbers uniformly over L L d for more general P, and on the statistical interest of lower layers (monotone regression), see Wright (1981) and references given there.

References

283

Notes to Section 8.4. Bolthausen (1978) proved the Donsker property of the class of convex sets in the plane for the uniform law on I2 (cf. Corollary 8.4.2(b)). Theorem 8.4.1, as mentioned, is due to Brongtein (1976). Specifically, the smoothing C H C' and the set of lemmas used follows mainly, but not entirely,

his original proof. Perhaps most notably, it appears that the polyhedra used to approximate convex sets in Brongtein's construction need not be convex themselves, and this required some adjustments in the proof. A more minor point is that if C E Cd then C1 doesn't necessarily include B(0, 1), although C2 does. So Bronstein's Lemma 1 seems incorrect as stated but could be repaired by changing various constants.

In an earlier result of Dudley (1974, Theorem 4.1), for d > 2, in the upper bound, e(1-d)/2 was multiplied by Ilog s i. This bound, weaker than Bronstein's,

is easier to prove and suffices to give Corollary 8.4.2(b) and (c). The lower bound in Dudley (1974) was reproduced here. I thank James Munkres for telling me about the triangulation method used after Lemma 8.4.8. Gruber (1983) surveys other aspects of approximation of convex sets.

References *An asterisk indicates a work I have seen discussed in secondary sources but not in the original. Bolthausen E. (1978). Weak convergence of an empirical process indexed by the closed convex subsets of I2. Z. Wahrscheinlichkeitsth. verw. Gebiete 43, 173-181. Bonnesen, Tommy, and Fenchel, Werner (1934). Theorie der konvexen Korper. Springer, Berlin; repub. Chelsea, New York, 1948. Bronshtein [Bronltein], E. M. (1976). s-entropy of convex sets and functions. Siberian Math. J. 17, 393-398, transl. from Sibirsk. Mat. Zh. 17, 508-514. Clements, G. F. (1963). Entropies of several sets of real valued functions. Pacific J. Math 13, 1085-1095. Dudley, R. M. (1974). Metric entropy of some classes of sets with differentiable boundaries. J. Approx. Theory 10, 227-236; Correction, ibid. 26 (1979), 192-193. Eggleston, H. G. (1958). Convexity. Cambridge University Press, Cambridge. Reprinted with corrections, 1969. Eilenberg, S., and Steenrod, N. (1952). Foundations of Algebraic Topology. Princeton University Press.

284

Approximation of Functions and Sets

Gruber, P. M. (1983). Approximation of convex bodies. In Convexity and Its Applications, ed. P. M. Gruber and J. M. Wills. Birkhauser, Basel, pp. 131-162. Hausdorff, Felix (1914). Mengenlehre, transl. by J. P. Aumann et al. as Set Theory. 3d English ed. of transl. of 3d German edition (1937). Chelsea, New York, 1978. Kolmogorov, A. N. (1955). Bounds for the minimal number of elements of an s-net in various classes of functions and their applications to the question of representability of functions of several variables by functions of fewer variables (in Russian). Uspekhi Mat. Nauk (N.S.) 10, no. 1 (63), 192-194. Kolmogorov, A. N., and Tikhomirov, V. M. (1959). s-entropy and s-capacity of sets in function spaces. Uspekhi Mat. Nauk 14, no. 2, 3-86 = Amer. Math. Soc. Transl. (Ser. 2) 17 (1961), 277-364. Lorentz, George C. (1966). Metric entropy and approximation. Bull. Amer. Math. Soc. 72, 903-937. Pyke, R. (1983). The Haar-function construction of Brownian motion indexed by sets. Z. Wahrscheinlichkeitsth. verw. Gebiete 64, 523-539. *Sun, Tze-Gong, and Pyke, R. (1982) Weak convergence of empirical measures. Technical Report no. 19, Dept. of Statistics, Univ. of Washington, Seattle. Wright, F. T. (1981). The empirical discrepancy over lower layers and a related

law of large numbers. Ann. Probab. 9, 323-329.

9 Sums in General Banach Spaces and Invariance Principles

Let (S, II II) be a Banach space (in general nonseparable). A subset F of the unit ball If E S' : II .f II' < 1) is called a norming subset if and only if 11S11 = sup fE- I f (s) I

for all s E S. The whole unit ball in S' is always a

norming subset by the Hahn-Banach theorem (RAP, Corollary 6.1.5). Conversely, given any set F, let S := £°O(F) be the set of all bounded real functions on F, with the supremum norm IISII = IISIIJ-

SUPfEJ-Is(f)I,

S E S.

Then the natural map f i-+ (s i-+ s(f )) takes F one-to-one onto a norming subset of S. So, limit theorems for empirical measures, uniformly over a class .F of functions, can be viewed as limit theorems in a Banach space S with norm II II.F. Conversely, limit theorems in a general Banach space S with norm II II can be viewed as limit theorems for empirical measures on S, uniformly over a class F of functions, such as the unit ball of S', since for f E S' and x j, , xn E S,

(8x, + ... + 8xn) (J) = f (X1 + ... + xn). \ Suppose that Xj are i.i.d. real random variables with mean 0 and variance 1. Let S,, := Fj
probability space, there exist such X j and also i.i.d. N(0,1) variables Y1, Y2,

with Tn :_ F-i
,

0 in

probability. Since Tn/n 1/2 also has a N(0, 1) distribution for each n, the invariance principle implies that Sn /n 1 /2 is close to Tn /n 1/2, which implies the central limit theorem. Although it is not as obvious, central limit theorems generally imply invariance principles. The phrase "invariance principle" means that such quantities as maxk
286

Sums in General Banach Spaces and Invariance Principles

the main result of the chapter will be Theorem 9.4.2, saying that the Donsker property is equivalent to an invariance principle for empirical processes.

9.1 Independent random elements and partial sums We need a notion of independence for functions which may not be measurable.

Let (A1, Aj, Pj), j = 1, 2,

, be probability spaces, and form a product Ilj=1(Aj, Aj, Pj) = (B, B, P) with points x {xj)'=1. If Xj are functions

on B of the form X j = h j (x j ), j = 1, , n, where each h j is a function on A j (not necessarily measurable) then we call Xj independent random elements. If the h j are measurable this implies independence in the usual sense. In the rest of this section, Xj = h (x j), j = 1, , n, are independent random elements with values in a real vector space Son which there is a seminorm II II Xj, k = 1, 2, , with So := 0. The section collects some and Sk inequalities for later use.

9.1.1 Lemma

Suppose given independent random elements Xj with r,

j=1 EliXj11*2 and

(9.1.2)

IIXjII < M,

j=1,...,n.

Then for 0 < y < 1/(2M), (9.1.3)

Eexp(yllSnll*) < exp(3y2tn+yEIISn11*).

Hence for any K > 0 and 0 < y < 1/(2M), (9.1.4)

Pr{llSll*>K} < exp(3y2tn-y(K-EIISnII*))

Remarks.

If K < EIISnII * then the infimum of the right side of (9.1.4) is attained at y = 0 and the bound is trivial.

If 0 < K - EIISnII * < 3tn/M then the infimum is attained at y = (K - EIISn ll*)/(6tH) and equals exp(-(K - EII Sn II *)2/(12tH)). If K - E II Sn ll * > 3tn /M then the infimum (again, for 0 < y < 1/(2M)) is at 1/(2M) and equals

exp (3tn/(4M2) - (K - EIISn II *)/(2M))

< exp ( - (K - EllSn II*)/(4M)) Proof First, (9.1.2) is equivalent to IIXjII * < M, 1 < j < n. Clearly (9.1.4) follows from (9.1.3). To prove (9.1.3), set Yk := Sn - Xk. For any rl E L'(B, B, P), let Ekq := E(rjl xl, ... , xk-1), k < n; En+1t) := il, Elrl := Erl.

9.1 Independent random elements and partial sums

287

Then Ek+11IYkII* = EkII YkII * by Lemma 3.2.5 with wl = (xl, , xk_1), (02 = Xk, and w3 = (xk+l, ... , xn). Let 17k Ek+l IISn II* - EkII Sn II*. Then by RAP, Theorem 10.1.9, (9.1.5)

Eexp(YIISnII*)

= E(En(exp[YEnIISnII*+Yl)nIl)) E (exp (y En II Sn 11 *) En exp (Y'1 n ))

By boundedness and Lemmas 3.2.2 and 3.2.3, Ek+111Sn1I* - EkIISn11*

Ek+1 IIXkII* + Ek+1 IIYkII*

- Ek

IIYkII*

+ Ek IIXkII*

= IIXkII*+EIIXkII* for k = 1,

, n, and likewise

Ek+lIISnII*-EkIISnhI* >- -IIXkII*-EIIXkII* Thus by a binomial expansion and Jensen's inequality (RAP, 10.2.6), Ek I ilk I 2JEII XkII *i f o r j = 1, 2, . Clearly Ekr1k = 0. So for 0 < y < 1/(2M), 00

Ek (exp (Y'1k)) = 1 + E y Eknklj! j=2

1+

r

4Y2EIIXkII*2r1+2y3M+2242M2 +...1

< 1+2y 2 E IIXkII*2 /(1 - 2yM/3) < exp (2y2E IIXkII*2 (1 + YM)) since 1 + x < ex, x E R. Combining this with (9.1.5) gives

Eexp(y IlSnll*) < exp(2y2EIIXnI1*2(l +yM))Eexp(YEn IISn11*). Now as in (9.1.5), E exp (yEn II Sn II *) = E [exp (yEn-lIISnII *) En-l eXP(Yl)n-1)] , and the argument can be iterated n times to give (9.1.3).

D

288

Sums in General Banach Spaces and Invariance Principles

Next, Ottaviani's inequality (1.3.15) also extends to starred norms.

9.1.6 Lemma Suppose independent random elements Xj are such that, for some a > 0, maxj a)

c<1.

Then

Pr (maxj 2a) < (1 - c)-1 Pr (11S, II* > a).

Proof Let j * (w) be the least j < n such that II Sj II * > 2a (or, say, n + 1 if there is no such j). Then

Pr (IISnhI*>a) > Pr (IISnll*>a, maxj2a) n

I: Pr(I1Snll*>a, J*=J) j=1 n

> EPr(Ilsn-sjll*
since II Sj 11 * < II S. 11 * + II Sj - Sn 11 * (Lemma 3.2.3). Now II Sn - Sj II * is a

measurable function of (x j+1, , xn) =: y2, by Lemma 3.2.4 with n = 2, y1 :_ (x1, , xj). Thus IISn - Sj 11* is independent of the event { j* = j} in the usual sense, so the last sum equals n

n

EPr(IISn - SjII* < a)Pr(j* _j) > (1 j=1

-c)EPr(>* =J) j=1

_ (1 - c) Pr (maxj 2a).

Here is another form of Hoffmann-Jorgensen inequality:

9.1.7 Lemma Let K > 0 and suppose the independent random elements Xj are such that

0 K) < 1/2, Then for any t > K ands > 0, Pr 111Snll* > 4t +s} < 4 Pr {IISn11* >

t12

+Pr { maxm s}.

9.1 Independent random elements and partial sums

289

Proof Let T (w) := min( j > 1: II Sj ((o) II * > 2t), or +oo if there is no such j. Then (9.1.8)

Pr {IISnII * > 4t + s) n

EPrIT= m, IISnII* > 4t +s) m=1

< (M=1

+ Pr maxmn IIXmII* > S) By Lemma 3.2.3, IISnII* < IISm-111*+IIXmII*+IISn-Sm11*,so the last sum Y' in (9.1.8) satisfies

>Pr(T=m, IISn-Sm11*>2t) M=1 n

EPr(T =m)Pr(JISn -SmII* > 2t) m=1 n

< Pr (maxo<m 2t) Y, Pr(T = m) m=1

< 2Pr(IISn11* > t)Pr(maxm 2t)

< 4Pr(,ISn11* >

t)2,

using Lemma 3.2.4 as in the proof of Lemma 9.1.6 for independence, then using Lemma 9.1.6 itself twice, the first time with order of summation reversed. The lemma now follows from (9.1.8).

The random elements Xj will be called symmetric if we can write xj = (yj, z j) where for each j, yj and z j are independent and have the same distri-

bution in some space B j (A1 = B j x Bj with Pj = Q j x Q j for some Q j) and Xj = ,/rj (yj) - l/ij(zj) for some function i/ij from Bj into S. With this definition, the Levy inequality (Lemma 1.3.16) extends to starred norms, with about the same proof:

9.1.9 Lemma

If Xj are independent symmetric random elements then for any r > 0, the following both hold:

290

Sums in General Banach Spaces and Invariance Principles

(a) Pr(maxj r) < 2Pr(IISnII* > r), and (b) Pr(maxj r) < 2Pr(IISnII* > r).

Proof For (a), let Mk(w) := maxj
1 <m
so if II Sm II * > r, then either IISnII * > r or 112 Sm - Sn I I * > r or both. The transformation which interchanges yj and zj for j > m preserves probabilities and interchanges S, and 2Sm - Sn, so interchanges IISn 11* and II2Sm - Sn 11*,

while preserving all Xj for j < m. Thus

Pr(Cmfl{IISnII*>r}) = Pr(Cmf{II2Sm-SnII*>r}) > 1 Pr (C.)

Thus Pr(Mn > r) = Em =1 Pr(Cm) < 2Pr(IISnII* > r). For (b), the proof is similar: replace Si for i = j, k, or m by X1, and use the transformation interchanging yj and zj for j 0 m, which does not change any IIXj II .

9.1.10 Lemma

El

Let Xj be independent symmetric random elements and

s, t > 0. Thenforn = Pr{IISnII* > 2t +s}

< 4(Pr{ISnII*>t})2+Pr{maxms}. Proof Let T((o) := inf {k > 1: IISkII * > t), or +oo if there is no such k. Let 2t + s implies T < n, we have Nk maxm
Pr{IISnII* > 2t+s}=YPr{T=j, IISnII*>2t+s}. j=1

If T= j 2t+s,thenlISj-lII*
t + s - Nn (w).

Pr{T=j, IISnII*>2t+s}

< Pr{T=j, IISn-Sjll*>t+s-Nn} < Pr{T=j, IISn-SjII*>t)+Pr{T= j,Nn>s} = Pr(T=j)Pr(IISn-SjII*>t) +Pr(T=j, Nn >s)

9.2 A CLT implies measurability in separable normed spaces

291

where for the last step, Lemma 3.2.4 applies as in the proof of Lemma 9.1.6. Now summing on j yields by (9.1.11),

Pr(IISnll*>>2t+s) n

< Pr(Nn>S)+EPr(T=j)Pr(IISn-Sjll*?t) j=1

The Levy inequality (Lemma 9.1.9) with > instead of > gives

Pr(IISn-Sjll*>t) < 2Pr(IISnII*>t) for each j, and n

Y, Pr(T = j) = Pr(maxk- t) < 2Pr(IISn11* > t). j=1

Combining the last three facts gives the result.

9.2 A CLT implies measurability in separable normed spaces Here "CLT" abbreviates "central limit theorem." As noted in the Remarks at the end of Section 1.1, if Fn is an empirical distribution function, Pr(FF E A) need not be defined if A is complete and discrete, hence Borel, for the (nonseparable)

supremum norm. Thus the variables Xj := 8,(j) - P need not be Borel measurable in general for a norm II Ilc or II il.P But in separable Banach spaces one usually assumes that variables are Borel measurable. It will be shown here that in a separable normed space, a form of central limit theorem can hold only if X1 is measurable. After this is proved in R1, it will be extended easily to separable normed spaces.

Recall the inner measure Pr* (B) := sup(Pr(C): C C B) and that

f* := - ((- f)*) = ess. sup {g: g < f g measurable} . 9.2.1 Theorem Let (A, A, P) be a probability space and xn coordinates on (A°O, A°°, P°°), n = 1, 2, . Let h : A R (where h is not assumed measurable), Xi := h(xi). Let Sn := X1 + + Xn. If for all t,

limn_ooPr* (Sn/n1'2 < t) = limn..+ Pr* (Sn/n1/2 < t)

= N(0, 1) ((-oo, t]), then h is measurable for the completion of P, so that Xi are measurable,

EXi=0andEXi2=1.

292

Sums in General Banach Spaces and Invariance Principles

Proof First, for the measurable case we need the following converse of the usual 1-dimensional central limit theorem (RAP, Theorem 9.5.6).

9.2.2 Theorem Let X1, X2, ., be i.i.d. real random variables and S, := + X,,. If the law of S,/n1/2 converges to N(0, 1) as n -> oo then X1 + EX1 = O and EXI = 1.

, all be i.i.d. and Z := (Xn - Yn)/21 /2 for all n. Then Z are i.i.d. and the law of (Z1 + + Zn)/n1/2 also converges to

P r o o f Let X1, Y1, X2, Y2,

N(0, 1). If h and f are the characteristic functions of X1 and Z1 respectively,

then f(t) _ Ih(t/21/2)2 = f(-t) and 0 < f(t) < 1 for all t. Let g(t) I - f (t) for all t. As n -> oo,

(1 - g(t/n1/2))" -+ exp (- t2/2)

(9.2.3)

for all t, uniformly for Itl < 1 (RAP, Theorem 9.4.5, Corollary 11.3.4). I claim that g(u)/u2 --> 1/2 as u 0. If not, first suppose for some S > 0, g(uk) > (2 + S)u2t for some uk - 0. Taking a subsequence, we can assume that for all n, u, = to/n1/2 where Itni < 1. Then (1 - g(tn/n1/2))" exp(-(2 + 8)t,), contradicting (9.2.3) as n oo. Or, if g(uk) < (2 - S)u2ti for some Uk -+ 0, again we can assume un = t"/n1/2 where It" I < 1 for all n. Then (1 - g(tn/n1/2))n > exp(-(1 - S)tn/2) for n large enough, again contradicting (9.2.3). So, as claimed, g(u)/u2 1/2 as u -+ 0. Now for P := L (Z1), and all u, (9.2.4)

g(u)

- g(u) + g(-u) =

u2

2u2

Letting u -

f

cos(ux) oo

u2

dP(x).

0 and applying Fatou's Lemma (RAP, 4.3.3) gives that

f x2dP(x) < 1. Then by the Tonelli-Fubini theorem, E(X?) < oo, so EJX1 I

< oo, and (Sn - nEX1)/n1/2 converges in law to some N(0, a2).

Since Sn/n1/2 also converges in law it follows by tightness that EX1 = 0 and

then EX? =a2=1. So, to continue the proof of Theorem 9.2.1, suppose h is nonmeasurable for

(x E A : h*(x) = +oo}. Suppose P(B) > 0. Then for each j, Pr(Xj* _ +oo) = P(B) by Lemma P, so that the Xn are nonmeasurable. Let B

3.2.4, case (b). By Lemma 3.2.4(c), S _ +oo a.s. on the set where X,* = +00

for any j < n. Thus Pr ((Sn/n1 I2)* = +oo)

> Pr (Xj* _ +oo for some j < n) = 1 - (I - P(B))n _+ 1

9.3 A finite-dimensional invariance principle

293

as n - oo. By Lemma 3.2.6 this contradicts the hypothesis. Thus P(B) _

Pr(X = +oo) = 0.

LetBj:=B(j):_{Xj>X

-2-J}.

Then

C := nj'=1 Bj. Apply Lemma 3.2.4 with Pj = P and f = 1B(j) to obtain Pr*(Cn) = 1. On Cn,

S n < XI .+.... + x n < Sn + 1. N(0, 1). Hence by Theorem 9.2.2,EXI = 0. Likewise EX1* = 0. Thus X1* = X1 = XI a.s., that is, X1 is completion Thus

measurable.

9.2.5 Corollary Suppose (S, I.1) is a separable normed space and Xn = h (xn ) where the xn are independent, identically distributed random variables with values in some measurable space (A, A) and h is any function from A into S (not assumed measurable). Suppose Yn are i.i.d. Gaussian variables in S with mean 0 and *

(9.2.6)

Xj - Yj

limn--> 00 n-1/2

= 0

j
in probability. Then the Xj are completion measurable for the Borel or -algebra on S.

Proof Let F = S', the dual Banach space. Then (9.2.6) implies that for each f e F, the central limit theorem as in Theorem 9.2.1 holds for the f(h(xj)), with some limit law N(0, a2), of := a2(f(Y1)). Thus by that theorem, each

f o h, f E F, is measurable for the completion of G(x1) on A. So each f(Xj) is completion measurable, f E F. Since S is separable, by the HahnBanach theorem there is a countable norming subset If, I C S', so that Is) _ sup,, I fn (s) for all s E S, and the Borel a-algebra of S is generated by the fn. Thus the Xj are completion measurable. 9.2.7 Corollary If the hypotheses of Corollary 9.2.5 hold except that S is not necessarily separable, and if H is a bounded linear operator from S into a separable normed space, then the H(Xj) are completion measurable.

9.3 A finite-dimensional invariance principle The invariance principles to be treated here are extensions of central limit the-

orems to joint distributions of partial sums for the sequence of values of n.

294

Sums in General Banach Spaces and Invariance Principles

Specifically, for example, let P be a law on JR with mean 0 and variance 1. One invariance principle says that on some probability space there exist random variables Xi, independent with distribution P, and random variables Yi, independent with standard normal distribution N(0, 1) (but not independent of the Xj), such that

limn-*oomaXk
Let Sk := Xl + + Xk as usual. Invariance principles will say that the , Sn)/n1/2 is the same asymptotic behavior of quantities such as max(Si, as if the Xi have a standard normal distribution, or as in the "simple random walk" case where X1 = +1 or -1 with probability 1 /2 each. This invariance is helpful in computing the asymptotic distributions, since for the normal or random walk case, there are reflection principles (as in RAP, Section 12.3). This section will prove an invariance principle for Rd-valued random variables. Then, for d = 1, another form of the theorem, in C[0, 1], will be proved. The next section will prove a general invariance principle for empirical processes. Here is the main result of this section:

9.3.1 Theorem Let P be a law on Rd with mean 0. Let II II be the usual Euclidean norm on Rd. Assume that f II X II2dP(x) < oo. Let Ci j := f xixjdP(x) where x :_ (X1, , xd) E Rd. Then on any nonatomic probability space there exist random variables Xi independent with law P, and Yi independent with law N(0, C), such that for Tk := Y1 + + Yk, (9.3.2)

limn,oo maxk
in probability.

P r o o f Let 0 < 6 < 0.01. For k = 1, 2, , let tk := tk(s) := t(k) :_ [(1 + e4)k], where [x] denotes the greatest integer < x. Let to := 0. Let nk nk(6) := n(k) := tk+i - tk, k = 0, 1, .. . 9.3.3 Lemma There is a 3 > 0 such that for 0 < e < 8 and k > k2(6) large enough, Pr { maxr
t

J_+r

tkO<'
Xi

I

> Et (k)112 J < 466.

9.3 A finite-dimensional invariance principle

Proof For k = 1, 2, Ck

,

295

let

:= maxo<j
II

maxo<j

Etk/2/2i1

>-

8tk12/2}.

Let Kn := G(Sn/n112). Then by the central limit theorem (RAP, 9.5.6) the laws Kn converge to G := N(0, C) as n -f oo. Thus for the Prokhorov metric p (RAP, Section 11.3), p(Kn, G) -k 0 as n -k oo. Then for No large enough, and all n > No,

Pr{IISn/n1/211 > 1/(5e)} < G{x: Ilxll >_ 1/(68)}+86. Also, for S > 0 small enough, G{x : Ilxll > 1/(6e)} < 86 for 0 < 8 < S by the Landau-Shepp-Marcus-Fernique theorem (Lemma 2.2.5 suffices in this case). Take S < 1/2. Then for such an e and k > kl (e) large enough, we have nk and 84(1 + E4)k > 256, so

4s2(l + e4)k/2 +4 < 4 821 + E4)k12,

(l + e4)k - 1 >

n (k) <

11

e4)k,

16(1 +

841 + E4)k + 1,

and for No < n < nk, 1

58

E((1 + e4)k - 1)1/2

Et(k)1/2

et(k)1/2

482(1 + E4)k/2 + 4 - 4n(k)1/2 -

4nl/2

Thus Pr(iiS ii > Etk/2/4) 1/(5E)} <286. Now (9.3.4)

Pr { II Sn II

> 8tk12/41 < 2e6

also for n < No, fork > k2(8) > k1 (8) large enough, and thus for 1 < n < nk. Then in particular, Ck as defined above is less than 1/2. Applying Ottaviani's inequality (1.3.15) gives

Xi

Pr S maim
etkl2

t (k) < j
I < (1 - Ck)-1 Pr I IISn(k) II ? Etk/2/2} < 486.

296

Sums in General Banach Spaces and Invariance Principles

H(k) :_ { j : tk < Now continuing the proof of Theorem 9.3.1, let Hk j < tk+,). Then Hk has nk elements. Define Vk by Vk EjEH(k) Xj/nk/2 Let Fk := G(Vk). Then Fk G as k -+ oo. Thus pk := p(Fk, G) - 0 (by RAP, Theorem 11.3.3 again). By Strassen's theorem (RAP, 11.6.2), for each k there is a law hk on R2d such that for L((y, z)) = tk we have L(y) = G(Vk), L(z) = G and µk{(y, z) : fly - z I I > pk ) < pk. The joint distribution of the X j for j E Hk and Vk gives a law 13k on Rn(k)d X Rd. So we can apply Lemma 1.1.10 (the Vorob'ev-BerkesPhilipp theorem) to get a law ak on ][gn(k)d x Rd x Rd which has marginals O k

on Rn(k)d x Rd and ttk on Rd x Rd. Let Zk be the z coordinate. If z j for j E Hk are independent, each with law then G(Uk) = N(0, C) = L(Zk) where N(0, C), and Uk :_ EjEH(k) Zk has its law under ak (or µk). By the Vorob'ev-Berkes-Philipp lemma (1.1.10) again, there is a law yk on Rn(k)d x Rd x Rd x ]18n(k)d for which the marginal on the latter Rd x Rn(k)d is the law of (Uk, {zj} jEH(k)) and the law on the product

of the first three factors is ak. Taking the product over k of all these spaces and laws yk (RAP, Section 8.2), we can assume that the coordinates of the kth factor are {Xj}JEH(k), Vk, Uk = Zk, and {zj} jEH(k), since this gives the correct joint distributions for those variables which had them previously. Specifically, all the Xj remain independent and identically distributed, and likewise for the zj. Then for each k, (9.3.5)

Pr {IIVk - UkjI > Pk} <_ Pk-

Take ko large enough so that for all k > ko, (9.3.6)

Pk <

E6.

For each n let S(n) := Sn and T (n) := Tn := zi +

+ zn. Let

s :_ [log (3/E4)/log (1 + E4)] + 1, so that (9.3.7)

(1 +84)1 > 3E-4.

Also note that for e small enough, by L'Hospital's rule, log(1 + E4) > 84/2 log(3/e4) < 1/(5E), and 1 < 1/(10s), so (9.3.8)

s := s(E) < E-5/2.

9.3.9 Lemma For E > 0 small enough there is an Mo := MO (8) < oo such that for M > MO and m := M - s, we have /2 P{ maxm tMEI < E.

9.3 A finite-dimensional invariance principle

297

Proof For M > M, (E) large enough, E4(1 + 84)M-s > 3, and m > ko. Then

nM-1 > E4(l

+E4)M-1

-1>

+E4)M-1/3

2E4(1

> 84(1 + E4) M/3 > (1+84)1

by (9.3.7), so

tm < nM-1

(9.3.10)

Also, by the definitions of tj and nk and a geometric series,

(9.3.11)

tMi/2

M

E

< tMl/2

n1k/2

k=m

M

12 E 21 1 + 84)k+1 - tl + 84)k)

k=m

\

4E2 (1 + E4)1/2

ll

9 E2

for 0 < s < _o small enough (where so is an absolute constant). Then the probability in the statement of the lemma is bounded above by P1 + P2 + P3, where

P1 := Pr { maxm EtM2/2},

:=

2P2 Pr

:=

2P3 Pr{ IIT(tm)II

{

II S(tm) II >-

>>

stM/4}, EtM/4}.

It follows from (9.3.10) and (9.3.4) for n < nk that P2 < 286 for M > M2(E) large enough. Since the zj have just as good properties as the Xj do, we get

also P3 < 286 for M > M2(8). Now S(tk) - S(tm) = likewise T(tk) - T(tm) = Thus by (9.3.11),

P1

< Pr 1 1: m
n 1/2 Il ltk -UkII >_

and

8tM2/21 1

Pr{IIVk-UkII>_3/181 < SE6 < 8/2

< m
for 8 small enough and m > ko, using (9.3.5), (9.3.6), and (9.3.8). The lemma follows.

Sums in General Banach Spaces and Invariance Principles

298

Now to finish the proof of Theorem 9.3.1, given n, let M := M(n) be such

that t11-1 < n < tM and again m := M - s. Let n be large enough so that m > k2(s/2) in Lemma 9.3.3 and M > max(Mo(E), k2(E) + 1). Then

Pr {maxj4En1/2} < q1+q2+q3+q4+q5 where

qi := Pr {maxj nl/281, q2

= Pr {maxj n1/28} Pr {maxm 2-,n1/2} ,

q3

q4

= Pr {maim En112} ,

and

q5 := Pr{maim En112} Then using Lemma 9.3.3 and (9.3.10), since M > k2(8) + 1,

<- Pr { maxj
q1

Then likewise, q2 < 486. For q3, we get an upper bound by replacing 2n 1/2 by t(M)1/2, then applying Lemma 9.3.9 to get q3 < E. Next,

q4 <

E Pr {maxj- et(M)1/2/2}

.

m
Another upper bound is obtained by replacing tM by tk for each k. Then applying Lemma 9.3.3 with k > in > k2(e/2) and (9.3.8) again gives a bound 4(e/2)6s < E. Likewise, q5 < E. So the sum of all five qj is less than 58 for n large enough.

The remaining difficulty is that the construction depends on E. Now, for , it has been shown that there are sequences {X7(P), i > 11 i.i.d. (P) with partial sums SkP) and {YZ(P), i > 1} i.i.d. N(0, C) with partial sums T(P) such that for n > no(p),

p = 1, 2,

(9.3.12)

Pr {n-1/2 maxk
SkP)

- T(P) II

> 2-P } < 2-P.

We can assume that the Xt(P) are independent for different p, as are the Yt(P) Let r(p) := EP=1 no(q), p > 2. Let r(1) := 0. For each i = 1, 2, , there

is a unique p such that r (p) 1.

9.3 A finite-dimensional invariance principle

299

Now (9.3.2) will be verified for the given Xi and Yi. Clearly Xi are i.i.d. (P) and

Yi are i.i.d. N(0, C). Let E > 0 and take po := p(O) > 2 such that 2-P(O) < E. Take No large enough so that for all n > No,

Pr{n-1/2maxk E} < E and

Pr {n-1/2 maxk E) < E.

Let n > max(No, no(po)) and take the unique M such that r(M) < n < r(M + 1). Then

M1 mark
57, Vp1 FU3,

P=P(O) where

U2 :=

U1 := n- 1/2 maxk
Vp := maxr(P)<j
n-112

maxk
n-1/2(Xi -Y1)

T.

,

i=r(p)+1 and

U3 := maXk
Then Pr(U1 > E) < E for i = 1, 2. Now by (9.3.13), k

X(P)

VP = n-1/2 max1
- Y(P)

i=1

so for 2 < p(O) r(M) > no(p + 1) and n > no(p), (9.3.12) gives Pr(Vp > 2-P) < 2-P. Further, since n > r(M) > no(M) and n > n - r(M), we have Pr(U3 > 2-M) < 2-M, Thus M-1

Vp <

U3 +

21-P(0)

< 2E

P=P(o)

except on an event A with Pr(A) < 2E. So Max k
except on a set of probability < 4E, and Theorem 9.3.1 is proved.

300

Sums in General Banach Spaces and Invariance Principles

There is another form of the invariance principle which is older but in some ways more abstract. Let C[0, 1] be the space of all continuous real functions on [0, 1] with supremum norm. For a sequence X1, X2, .. of real random variables, independent with distribution P, with partial sums Sn := X1 + + X,,, let µ,,,P be the law of the random function on [0, 1] which equals Sk/n1/2 at k/n for k = 0, 1, , n (where So := 0) and which is linear on each closed interval [(k - 1)/n, k/n], k = 1, , n. Then An,P is a law on C[0, 1]. Let s be the law of the Wiener process (Brownian motion) restricted to [0, 1]. This can be chosen as a law on C[0, 1] by sample continuity (RAP, Theorem 12.1.5).

9.3.14 Corollary For any law P with mean 0 and variance 1, the laws An, P converge to it on C[0, 1] as n - oo.

Proof First let Q = N(0, 1). For f E C[0, 1] let T (f) be the function which

equals fat k/n for k = 0, 1, , n and is linear in between (on each interval [(k - 1)/n, k/n] for k = 1, , n). Then Tn (f) fin the supremum norm for each f E C[0, 1]. On the other hand the law of Tn for it is exactly An,Q in this case, so P-n,Q - A. Thus for the Prokhorov metric p for laws on C[0, 1], p(An,Q, µ) - 0 by RAP, Theorem 11.3.3. For two functions, each linear on an interval [(k -1)/n, k/n], the supremum distance between the two functions on the interval is attained at one of the

endpoints. So it follows from Theorem 9.3.1 that p(µn,Q, pn,p) -+ 0 as n - oo. The triangle inequality gives p(An,P, A) + 0, so An,P - A. Corollary 9.3.14 can be applied to sets of functions falling between two given functions, as follows. Recall that a function f on a metric space (S, d) is said

to satisfy a Holder condition of order r for 0 < r < I if there is a constant K < oo such that for all x, y E S, If (x) - f (y) I < Kd (x, y)r. 9.3.15 Theorem Again let P have mean 0 and variance 1. Suppose a(.) and are two real-valued functions on [0, 1], each satisfying a Holder condition

of order r > 1/2, such that a(0) < 0 < b(0) and a(t) < b(t) for 0 < t < 1. Let (a, b) := If: a(t) < f(t) < b(t) for 0 < t < 1}. Then An,P((a, b)) -+ A((a, b)) as n -+ oc. Proof The boundary of (a, b) in C[0, 1] with supremum norm is easily seen

to be the set of all f E C[0, 1] such that a(t) < f(t) < b(t) for 0 < t < 1 and for some t E [0, 1], f(t) = a(t) or f(t) = b(t). The result follows from Corollary 9.3.14 and the portmanteau theorem (RAP, Theorem 11.1.1) if one shows that all the sets (a, b) are continuity sets for µ, in other words that their

9.4 Invariance principles for empirical processes

301

boundaries 8(a, b) have 0 probability for A. To see that, let Wt, 0 < t < 1 be a sample-continuous Wiener process (Brownian motion). Let r be the least t E [0, 1], if one exists, such that Wt = a(t) or Wt = b(t). Clearly, the probability that t = 1 is 0. If 0 < r < 1, we apply the strong Markov property (RAP, Theorem 12.2.3). Thus, conditional on the a-algebra of events up to time r, WT+h - W, is equal in distribution to Wh, h > 0. Now, there is a local law of the iterated logarithm for Wt at t = 0, namely for v (t) := (2t log log(r ))1/2

almost surely lim supty0 Wt/v(t) _ - lim inftl,0 Wt/v(t) = 1. This follows from the law at infinity (RAP, Theorem 12.5.2) and the fact that the process {tWi11}t>o has the same distribution as {Wt}t>0, being Gaussian with mean 0

and the same covariances. So, suppose W7 = b(t). Then b(r + h) - b(r) < v(h)/2 for small enough h > 0 by the Holder condition, so we will have almost

surely W (r + h) > b(r + h) for some h with 0 < h < 1 - r. So the path t r-). Wt, 0 < t < 1, will not be in the boundary of (a, b). A similar argument holds if Wi = a(r), so the proof is done.

9.4 Invariance principles for empirical processes Recall the notion of coherent Gp process (Section 3.1). Let (A, A, P) be a probability space and .F c L2 (A, A, P). Let Q be the product of ([0, 1], B, A), where B is the Borel a-algebra, A = Lebesgue measure, and a countable product of copies of (A, A, P), with coordinates xi. Then .P will be called a functional Donsker class for P if it is pregaussian for P and there are independent coherent

Gp processes Yj (f, co), f E .T, co E Sl, such that f H Yj (f, co) is bounded and pp-uniformly continuous for each j and almost all cv, and such that in outer probability, m

(9.4.1)

n-1/2 maim
Ts" - P - Yj j=1

Recalling the notion of Donsker class (as defined in Section 3.1), we have an equivalence:

9.4.2 Theorem

Given any probability space (A, A, P) and .P C ,C2(A,

A, P), F is a functional Donsker class if and only if it is a Donsker class.

Proof The "only if " direction is easy and will be proved first. Let F be a functional Donsker class for P. Let xj and Yj be as in (9.4.1). Given e > 0,

302

Sums in General Banach Spaces and Invariance Principles

take no large enough so that for n > no, n

Pr* {n-1/2

Es" - P - Y, j=1

> E/3

} < E/2. JJJ

By definition, F is pregaussian for P, so F is totally bounded for pp by the Sudakov-Chevet theorem (2.3.5). Also, there is a S > 0, by Theorem 3.1.1, such that for a coherent version of Gp,

Pr(sup {IGp(f) - Gp(g)I:.f, g E F, pp(f, g) < 8} > e/31 < E/2. By assumption, each Y; , and so each (Y1 + + Yn )/n 1/2, is a coherent version of Gp. Combining gives an asymptotic equicontinuity condition for the empir-

ical processes vn, proving "only if" by Theorem 3.7.2. To prove "if," assume .T' is a P-Donsker class and again apply Theorem 3.7.2.

So .P is totally bounded for pp and we have an asymptotic equicontinuity condition: f o r each k = 1 , 2, , there is a Sk > 0 and an Nk < oo such that (9.4.3)

Pr* { sup { I vn (f) - vn (g) I

< 2-k,

:

f, g E P, pp (f, g) < Sk } > 2-k }

n > Nk.

We may assume Nk increases with k. As in the proof of Theorem 3.7.2, we have for a coherent Gp that (9.4.4) Pr { sup { I Gp (f) - Gp (g) I : ff g E F, pp (f, g) < 8k } > 2-k }

< 2-k

Let U := UC(F) denote the set of all prelinear real-valued functions on F uniformly continuous for pp. Then U is a separable subspace of S := 2°O(Y), as shown after the proof of Lemma 2.5.4.F is pregaussian for P and Gp has a law A defined on the Borel sets of U by Theorem 2.5.5. For k = 1, 2, let Fk be a finite subset of .T' such that

supfE.-min {pp(f,g): gE.Pk} < 8k. Let Tk denote the finite-dimensional space of all real functions on .Tk, with supremum norm II Ilk. Let m(k) be the number of functions in Fk and let .Fk = {g1, , gm (k)). For each f E F let fk = gj for the least j such that

pp(f gj) < 8k. For any 0 E S let (ak(f) := 0(fk), f E Y. Then Qbk E S. Let AkW = Ok and Xkj := Xj - AkXj where Xj := 8x, There is a complete -P.

separable subspace T of S which includes both U and the (finite-dimensional) ranges of all the Ak.

9.4 Invariance principles for empirical processes

303

Let II II = II II.- from here on. Note that II AkO II < 11011 for all k and all

E S. Then by (9.4.3) we have for n > Nk, Pr* {n_1/2 F-j=1 Xkj

(9.4.5)

pr*

>2-k}

l supfE. I vn (.f - .fk) I > 2-k } < 2-k,

Thus by Lemma 3.2.6, n

> 2-k l < 2-k

Xkj

Pr { n-1/2 1(

j=1

Il

1

Then if n > 2Nk and m = 0, 1, n-1/2

, n, we have by Lemma 3.2.3 a.s.

*

n

< n-1/2

E Xkj

n > Nk.

*

n

M

+m-1/2 57 Xkf

E Xki

j=1

j=1

j=m+1

and since either n - m > Nk or m > Nk, *

n

Pr

In-112

E Xkj

< 21-k

21-k

>

j=m+1

Thus fork > 2, Ottaviani's inequality (Lemma 9.1.6) with c = 1 /2 gives *

(9.4.6)

Pr In-1/2 maxm
Xkj I

22-k

>

j=1

l < 22-k 1

for n > 2Nk. Let Pk be the law on Tk off i-+ f(x1) - f fdP, f E .Pk. Then by Theorem 9.3.1 there exist Vkj i.i.d. Pk, Wkj i.i.d. with a Gaussian law Qk, and some nk := n(k) > 2Nk such that for all n > nk, k > 1, (9.4.7)

Pr fn_1/2max
>

1: Vkj - Wkj j<m

2-k

<

1

2k

k

Also, Qk must be the law of the restriction of Gp to Tk. We can assume that the sequences { (Vkj, Wkj) J j> 1 are independent of each other for different k. It

can and will be assumed that no := 1 < n1 < n2 <

For each j = 1 , 2, 3,

. .

,

.

if nk < j < nk+1, write k := k(j) and set

Vi := Vkj,Wj := Wkj. Then{Vj}j>1 and{Wj}j>1 are sequences of independent random variables. (Note: The notations nk, Vk, Wk are all different from those in the previous section.) Each sequence has its values in a countable product of Polish spaces which itself is Polish (RAP, Theorem 2.5.7). Let {Yj, j > 1} be i.i.d. It on U. Then Wj has the law of the restriction Yj P .T'k of Yj to Pk

304

Sums in General Banach Spaces and Invariance Principles

for nk < j < nk+1. Let Tkj be a copy of Tk and Uj of U for each j, and let Let T.:= T(j) := 1 T(j), UU := 1 Uj, .P(j) _ Fk(j). Then, Tk(j)j.

apply the Vorob'ev-Berkes-Philipp theorem (1.1.10) to P := G({ (Vj , W j)) j> 1),

and G({(Yj [ F(j), Yj)} j>1). So we can assume Wj = Yj [ X(j) for all j > 1. Now for ((xj)j>1, t) E 0 := A°O x [0, 1], let

V(w) :_ I f H f(xj) - f fdP, f E X(j)1j>1 E T. I.

Let Q := G({(Vj, Wj, Yj)}j>1) on T,,, x T,,) x U,,,,. By Lemma 3.7.3, Vj and

Yj can be taken to be defined on 7 with V (w) j = Vj for all j. It remains to

prove (9.4.1) for the given Yj, with Xj(f) := f (xj) - f f dP, f E.F. Given 0 < e < 1, take k large enough so that 25-k < e. We have Xj [ F(j) = Vj, j > 1. Let Mk > nk be large enough so that for all n > Mk, n(k)

(9.4.8)

2-k

Pr* n-1/2K II AkXjll + IIYj11 > j=1

<2

k.

Fix n > Mk. Then there is a unique r such that nr < n < nr+1, and (9.4.9)

In

EXj - Yj

:= maxm
j<m

EX, -YJ

maxm
I

j<m

r-1 + EmaXn(i)<m
m

1: Xj -YJ j=n(i)

i=k

+ maxn(r)<m
1:

Xj - Yj

n(r)<j<m

Then almost surely Lk

:= maXm
1: Xj - Yj j<m

< maX m
EXkl +EIlAkXjHI+Will

\ j<m

j<m

Thus by (9.4.6) and (9.4.8), (9.4.10)

Pr (Lk > 23-knl/2) < 23-k

9.4 Invariance principles for empirical processes

Next, since restriction to Fi is a linear isometry from Ai S to (Ti ,

305 II i ), and

II

for any f EFandn(i) < j
(Xi-Yj)(f) = Xj(f)-Xj(J1)+Xj(f)-Yj(fi)+Yj(.f)-Yj(f) = X1(f) +(Vj -Wj)(.f)+(AiYj -Yj)(f), so we have for k nr > 2Ni, we have by (9.4.6) and the fact that i.i.d. variables have a joint distribution preserved by any shift of indices (stationarity), that if

k 22-`n 112

I

< 22-` .

Likewise by (9.4.7), if k 2-` } < 2-i.

j=n(i) jn=n(i)(i+1)-i

Yj is equal in law to a version of Gp (or Yi) Next, (ni+1 - n;) 1/2 with uniformly continuous sample functions. Also, for i < r, n > ni+1 - ni By Levy's inequality (Lemma 9.1.9) we have M

Pr {n_h/2maxn(i)<m
E Yj - AiYj j=n (i)

n(i+1)-1

< 2Pr

{n-1/2

E Yj - AiYj

j=n(i)

< 2Pr{IIY1-AiYiII>2-`) <

21-i

Sums in General Banach Spaces and Invariance Principles

306

by (9.4.4). Collecting terms, we have

k 23-`n1/2) < 23-`,

For i = r, replacing ni+i by n + 1 throughout yields the same result. From this, (9.4.9), and (9.4.10), we obtain since 25-k < s,

Xj - 1'j

Pr I maxm
> n1j2e)

j<m

< 23-k + E 23-i < s

/1

i=k

proving (9.4.1) as desired and so Theorem 9.4.2.

**9.5 Log log laws and speeds of convergence As usual with double-starred sections at the ends of chapters, statements in this section are provided with literature references but are not proved here.

9.5(A) Log log laws and strong invariance principles

A class F of functions in G2 of a probability space (X, A, P) will be said to satisfy the bounded law of the iterated logarithm or bounded log log law for the empirical measures P, of P if and only if lim sup,,,,,, II v, I.F /(log log n) 1/2 < oo almost surely. Also, F will be said to satisfy a strong invariance principle if it satisfies the definition of functional Donsker class except that instead of (9.4.1) we have m

(9.5.1)

(n log log n)-1/2 maxm
T 8Xj -P - Yj

-a 0

j=1

almost uniformly; in other words, the expression in (9.5.1) is dominated by a measurable function gn for each n such that gn -+ 0 almost surely. Neither (9.4.1) nor (9.5.1) necessarily implies the other. Even if ,F is both a functional Donsker class and satisfies a strong invariance principle for P, I don't know whether Yj can be chosen the same way in both (9.4.1) and (9.5.1). Let Ly := max(l, logy) as usual. Under a condition on the envelope function, we do have an implication (Dudley and Philipp, 1983, Theorem 1.2):

9.5.2 Theorem Let F be a Donsker class for P with an envelope function

F(x) := II8 II* If f F(x)2/LLF(x) dP(x) < oo, then F satisfies a strong invariance principle.

**9.5 Log log laws and speeds of convergence

307

The proof of Theorem 9.5.2 is rather long and technical. It is based on corresponding results in separable Banach spaces due to Heinkel (1979) and Goodman, Kuelbs and Zinn (1981), depending in turn on Kuelbs and Zinn (1979). Substantial further work, given in Dudley and Philipp (1983, Sections 4 and 5), seemed to be needed for the extension to the nonseparable case. If the condition on the envelope function in Theorem 9.5.2 is strengthened a

little to f F2dP < oo, then the proof is substantially simpler. Here a proof of the log log law in the separable case is due to Pisier (1976). Kuelbs and Dudley (1980) proved a log log law for Donsker classes of sets under a measurability condition, as did Koltchinskii (1981) for uniformly bounded classes of functions satisfying some entropy conditions. If the norm II II j, is separable on the linear span of all 3, for x E X, in other words, if we consider i.i.d. random variables with values in a separable Banach space, and if the central limit theorem (Donsker property) holds, then finiteness

of the integral in Theorem 9.5.2 is not only sufficient but also necessary for the law of the iterated logarithm to hold (e.g., Ledoux and Talagrand, 1991, Theorems 8.6 and 10.12). 9.5.3 Corollary If F = { 1 A : A E C} for a class C of measurable sets, and if F is a Donsker class, then F also satisfies a strong invariance principle.

In connection with Theorem 9.5.2 one can ask what conditions on envelope functions are necessary, or sufficient along with other conditions, for the Donsker property. An image admissible Suslin, VC subgraph or major class of functions will be called a VCSM class or VCMM class, respectively. On a probability space, the Lorentz space £2,1 is defined as the space of real random variables X such that f o P(I XI > t)112dt < oo. A random variable X will be called weakly L2, or X E W L2, if and only if t2P(I XI > t) --* 0 as t +oo. Recall that an envelope function of a class .F of functions is Fi(x) := II8 II

In the next theorem, one can take F(x) := supfET I f(x)I := Fi(x) to be measurable itself and thus "the" envelope function. The following equivalences, due to several authors, including Alexander (1987) and Gind and Zinn (1986, Chapter 1, Proposition 2.7), are collected in Dudley and Koltchinskii (1994, Theorem B):

9.5.4 Theorem Let (X, A, P) be any nonatomic probability space and F a nonnegative measurable function on X. Then the following are equivalent:

(a) P(F < oo) > 0 and F is the envelope of some P-Donsker class of functions on (X, A, P);

308

Sums in General Banach Spaces and Invariance Principles

(b) P (F < oo) > 0 and F is the envelope of some VC subgraph P-Donsker class of functions on (X, A, P);

(c) P(F < oo) > 0 and F is the envelope of some VC major P-Donsker class of functions on (X, A, P); (d) F E WL2(P). Notably, while in finite-dimensional spaces the finiteness of second moments (of the norm) is necessary for the CLT (Theorem 9.2.2), in the infinite-

dimensional case it is not, as Jain (1976) and Pisier (1976, Section 7) first showed, but only weak L2 is necessary. On the other hand, for every VCMM class with envelope F to be P-Donsker, a condition stronger than L2, namely F E G2, 1, is needed. Dudley and Koltchinskii (1994, Theorem 3.1 and Proposition 4.2) show that:

9.5.5 Theorem If (X, A, P) is a probability space and F E C2,1 (P) then every VCMM class.F with envelope F is P-Donsker. Conversely, if (X, A, P) is, for example, [0, 1] with Lebesgue measure, and if F is measurable but not in £2,1(P), then there exists a VCMM class F with envelope F which is not P-Donsker. If F is a VCSM class (not necessarily Donsker), F satisfies the condition in Theorem 9.5.2, and sup{oP (f) : f E F) < oo, then.F satisfies a bounded log log law. For this and some related facts see Alexander and Talagrand (1989). 9.5(B) Speeds of convergence

Komlos, Major and Tusnady (1975, 1976) found the best possible speed of convergence in the invariance principle for i.i.d. real-valued random variables , having a finite moment-generating function E exp(tXI) < oo for X1, X2, It I < to for some to > 0, namely, that then in (9.3.2), the variables Xi and YZ can be chosen so that II SS - Tn II = 0 (log n) a.s. Let P be the uniform (Lebesgue) law on the unit cube Id in Rd. Let

X(d) :_ {[0, xl] X [0, x2] X ... X [0, xd]: 0 < x1, x2, ... xd < 1), a collection of subsets of Id, and let B(d) be the collection of all intersections with Id of d-dimensional closed balls {y: Ix - yl < r} for r < 1 and x E W1. Koml6s, Major and Tusnady (1975, 1976) for d = 1, see also Bretagnolle and Massart (1989), and Tusnady (1977) ford = 2, showed that on some probability space there exist Gp processes YY and empirical processes v, for each n such that

IIVn-YnIIX(d) = O((logn)d/nl/2)

Problems

309

almost surely as n - oo. J. Beck (1985) showed that for some constants cd depending only on the dimension d, for any possible choices of vn and Gp, Pr (11v. - GPIIB(d) >

cdn-t/(2d)) > 1

-e-n.

Thus the convergence in the central limit theorem uniformly over balls is relatively slow for d > 2.

Massart (1989) considered the uniform law P = U(Id) on the unit cube Id in Rd and "UM" (uniform Minkowski) classes C of sets in Rd for which there is a constant K < oo such that for 0 < e < 1, supAEC ,Xd ((aA)e) < Ks where a A := A \ intA is the boundary of A and CE :_ {y: d (x, y) < e for some x E C). Massart shows that if a VC class C of subsets of Rd is UM and image admissible Suslin then there exist Gp processes Bn such that as n -+ oo, O((Ln)3/2n-1/(2d))a.s.where Ln := max(l, log n). Evidently IIvn - BnIIc = the uniform Minkowski condition is important here, since the VC index S(C) doesn't appear in the bound except perhaps for the constant in the 0. For the class of balls, Massart's upper bound is (Ln)3/2 times Beck's lower bound, so the exponent -1/(2d) on n is sharp. (Note: Massart uses a definition of "strong invariance principle" quite different from the one given early in this section.) If C is a UM class of sets in Rd such that for some M < oo and 0 < < 1

for 0 < s < 1, Massart (1989) shows that, still for P = U(Id ), there exist Gp processes Yn with II vn - Yn IIc = a.s. In particular, for d > 2 and the class I (d, a, K) of

we have NI(s, C, P) <

sets in Rd with boundaries differentiable through order a (Section 8.2), one has

= (d - 1)/a and so, if a > d - 1, there exist Gp processes Y, such that IIvn - Yn III(d,a,K) =

0((Ln)-(a-d+1)/(2ad))

a.s., improving on earlier results of Revesz (1976). For the class C(2) := C2 of convex sets in R2, Massart's result implies that for some Gp processes Yn,

IIvn - YnIIc = O((Ln)/ntlg) a.s. Koltchinskii (1994) gives some general results on speed of convergence for empirical processes, implying some of the above particular facts, and surveys some other work done in the meantime, for example, Rio (1993a,b; 1994).

Problems 1. For i.i.d. real-valued bounded random variables with mean 0 and variance a2, compare the bounds for P(I Sn I > K) given by Lemma 9.1.1 and by Bernstein's inequality (1.3.2). Hint: Note that E I Sn I < Qn 1/2

310

Sums in General Banach Spaces and Invariance Principles

2. Let X1, X2, be real-valued i.i.d. random variables with mean 0 and variance 1. Show that for any t > 0, as n oo, P(maxk t) + 24 (-t) where 4) is the N(0, 1) distribution function. Hint: Apply Theorem 9.3.15 and a reflection principle from RAP, Section 12.3. 3. Let X1, X2, be i.i.d. random variables with values in Rd, having mean 0 and covariance matrix the identity I. Let S, := XI + + Xn and Yn(k/n) := Sk/n112, k = 0, 1, , n, with So := 0. Let Yn be linear on

each interval [(k - 1)/n, k/n] for k = 1,

, n. Let C([0, 1], d) be the

space of continuous functions from [0, 1] into Rd, with the norm II f II := supo
xt := (xtI,

, xtd), 0 < t < 1, where xtj are i.i.d. Brownian motions,

j = 1,...,d. 4. Let F:=

:=xa(1+I logxl)Pfor0 <x < 1. Let Pbetheuniform

U[0, 1] distribution. For what real values of a, 6 is it true that

(a) F is the envelope function of some P-Donsker class on (0, 1); (b) every VCSM class (defined in Section 9.5) with envelope < F is P-Donsker; (c) every P-Donsker class with envelope F satisfies a strong invariance principle;

(d) the class ft F: -1 < t < 1] is a P-Donsker class. Does (a) imply (d)? 5. Answer the same questions as in the previous problem but now on 1 < x < oo where P is the Pareto law with density 1/x2. 6. Let X1, X2, be i.i.d. real random variables and h a function: R H R, not assumed measurable. Suppose that n-1 Yj=t h(Xj) converges to 0 in outer probability as n --). oo. Prove or disprove that h(X1) = g(XI) a.s. for some Borel measurable function g. (Compare Section 9.2.)

Notes All the lemmas in the section first appeared in their current forms in Dudley and Philipp (1983). Lemma 9.1.1 extends Lemma 2.1 of Kuelbs (1977). The proof of Lemma 9.1.9 is adapted from Kahane (1985, Section 1.3, Lemma 1). The proof of Lemma 9.1.10 in the measurable case appears in Hoffmann-Jorgensen (1974, p. 164); see also Jain and Marcus (1975, Lemma 3.4). Kahane (1968, Section 1.5) gave a precursor of Lemma 9.1.10.

Notes to Section 9.1.

References

311

Notes to Section 9.2. Theorem 9.2.2 appears in Gnedenko and Kolmogorov (1949, Section 35, Theorem 4). The other results of this section are essentially from Dudley and Philipp (1983, Section 8).

Notes to Section 9.3. Theorem 9.3.1, and most of the section, are based on a special case (a = 2) of Theorem 1 of Philipp (1980). The first invariance principles, when the X; are real-valued, were in C[0, 1] as in Corollary 9.3.14. Kolmogorov (1931, 1933) proved Theorem 9.3.15 in case and b(.) have continuous derivatives. Kolmogorov did not prove convergence of laws tt.,p - s on C[0, 1], since convergence of laws on spaces such as C[0, 1] was not defined until the 1940s. Donsker (1951) first stated and proved Corollary 9.3.14. Kolmogorov's papers (1931, 1933) were overlooked to some degree in the 1950s. Then Borovkov and Korolyuk (1965) and Nagaev (1970) went further along the way Kolmogorov had started. Thanks to Lucien Le Cam for pointing out Kolmogorov's papers to me.

Theorem 9.3.1 for d = 1 emerged from proofs of Donsker's invariance principle in the books of Breiman (1968, Theorem 13.8, p. 279) and Freedman (1971, p. 83, (130)). Apparently Major (1976, p. 222) first formulated Theorem 9.3.1 for d = 1 explicitly. Note to Section 9.4. The invariance principle for general empirical processes, Theorem 9.4.2, first appeared in Dudley and Philipp (1983).

References Alexander, K. (1987). The central limit theorem for empirical processes on Vapnik-Cervonenkis classes. Ann. Probab. 15, 178-203. Alexander, K., and Talagrand, M. (1989). The law of the iterated logarithm for empirical processes on Vapnik-Cervonenkis classes. J. Multivariate Analysis 30, 155-166. Beck, Jozsef (1985). Lower bounds on the approximation of the multivariate empirical process. Z. Wahrscheinlichkeitsth. verw. Gebiete 70, 289-306. Borovkov, A. A., and Korolyuk, V. S. (1965). On the results of asymptotic analysis in problems with boundaries. Theory Probab. Appl. 10, 236-246 (English), 255-266 (Russian). Breiman, Leo (1968). Probability. Addison-Wesley, Reading, MA.

312

Sums in General Banach Spaces and Invariance Principles

Bretagnolle, J., and Massart, P. (1989). Hungarian constructions from the nonasymptotic viewpoint. Ann. Probab. 17, 239-256. Donsker, Monroe D. (1951). An invariance principle for certain probability limit theorems. Memoirs Amer. Math. Soc. 6. Dudley, R. M., and Koltchinskii, V. I. (1994). Envelope moment conditions and Donsker classes. Theor. Probab. Math. Statist. (Kiev) 51, 39-48. Dudley, R. M., and Philipp, Walter (1983). Invariance principles for sums of Banach space valued random elements and empirical processes. Z. Wahrscheinlichkeitsth. verw. Gebiete 62, 509-552. Freedman, David (1971). Brownian Motion and Diffusion. Holden-Day, San Francisco. Gind, E., and Zinn, J. (1986). Lectures on the central limit theorem for empirical processes. In Probability and Banach Spaces, Proc. Conf. Zaragoza, 1985. Lecture Notes in Math. (Springer) 1221, pp. 50-113. Gnedenko, B. V., and Kolmogorov, A. N. (1949). Limit Theorems for Sums of Independent Random Variables. Transl. and ed. by K. L. Chung, AddisonWesley, Reading, MA, 1968.

Goodman, V., Kuelbs, J., and Zinn, J. (1981). Some results on the LIL in Banach space with applications to weighted empirical processes. Ann. Probab. 9, 713-752. Heinkel, B. (1979). Relation entre theoreme central-limite et loi du logarithme itdr6 dans les espaces de Banach. Z. Wahrscheinlichkeitsth. verw. Gebiete 49, 211-220. Hoffmann-Jorgensen, Jtrgen (1974). Sums of independent Banach space valued random elements. Studia Math. 52, 159-186. Jain, Naresh C. (1976). An example concerning CLT and LIL in Banach space.

Ann. Probab. 4, 690-694. Jain, Naresh C., and Marcus, Michael B. (1975). Integrability of infinite sums of independent vector-valued random elements. Trans. Amer. Math. Soc. 212, 1-36. Kahane, J.-P. (1985). Some Random Series of Functions, 2d. ed. Cambridge University Press, Cambridge; 1st ed. D. C. Heath, Lexington, MA, 1968. Kolmogorov, A. N. (1931). Eine Verallgemeinerung des Laplace-Liapounoffschen Satzes. Izv. Akad. Nauk SSSR Otdel. Mat. Estest. Nauk (Ser. 7) no. 7, 959-962. Kolmogorov, A. N. (1933). Uber die Grenzwertsatze der Wahrscheinlichkeitsrechnung. Izv. Akad. Nauk SSSR (Bull. Acad. Sci. URSS) (Ser. 7) no. 3,

363-372. Koltchinskii, V.I. (1981). On the law of the iterated logarithm in the Strassen form for empirical measures. Teor. Veroiatnost. i Mat. Statist. (Kiev)

References

313

1981, no. 25, 40-47 (Russian); Theory Probab. Math. Statist. no. 25, 43-49 (English). Koltchinskii, V.I. (1994). Komlos-Major-Tusnady approximation for the general empirical process and Haar expansions of classes of functions. J. The-

oret. Probab. 7, 73-118. Komlos, J., Major, P., and Tusnady, G. (1975, 1976). An approximation of partial sums of independent RV's and the sample DF. I, H. Z. Wahrscheinlichkeitsth. verve Gebiete 32, 111-131; 34, 33-58. Kuelbs, J. (1977). Kolmogorov's law of the iterated logarithm for Banach space valued random variables. Illinois J. Math. 21, 784-800.

Kuelbs, J., and Dudley, R. M. (1980). Log log laws for empirical measures. Ann. Probab. 8, 405-418. Kuelbs, J., and Zinn, J. (1979). Some stability results for vector valued random variables. Ann. Probab. 7, 75-84. Ledoux, M., and Talagrand, M. (1991). Probability in Banach Spaces. Springer, Berlin. Major, Peter (1976). Approximation of partial sums of i.i.d. r.v.s when the summands have only two moments. Z. Wahrscheinlichkeitsth. verw. Gebiete 35, 221-229. Massart, P. (1989). Strong approximations for multidimensional empirical and related processes, via KMT constructions. Ann. Probab. 17, 266-291. Nagaev, S. V. (1970). On the speed of convergence in a boundary problem I, II. Theor. Probab. Appl. 15, 163-186, 403-429 (English), 179-199, 419-441 (Russian). Philipp, Walter (1980). Weak and LP-invariance principles for sums of Bvalued random variables. Ann. Probab. 8, 68-82; Correction, ibid. 14 (1986), 1095-1101. Pisier, G. (1976). Le theor8me de la limite centrale et la loi du logarithme iter6 dans les espaces de Banach. Seminaire Maurey-Schwartz 1975-76, Exposes III, IV, Ecole Polytechnique, Paris. Revesz, P. (1976). On strong approximation of the multidimensional empirical process. Ann. Probab. 4, 729-743. Rio, Emmanuel (1993a,b). Strong approximation for set-indexed partial sum processes, via K.M.T. constructions, I, II. Ann. Probab. 21, 759-790, 1706-1727. Rio, Emmanuel (1994). Local invariance principles and their application to density estimation. Probab. Theory Related Fields 98, 21-45. Tusnady, G. (1977). A remark on the approximation of the sample DF in the multidimensional case. Period. Math. Hungar. 8, 53-55.

10

Universal and Uniform Central Limit Theorems

In this chapter we look at cases where a class F of measurable functions is a Donsker class for all probability laws on the underlying space (universal Donsker classes) and such classes where convergence in the central limit theorem holds uniformly in P (uniform Donsker classes).

10.1 Universal Donsker classes Let X be a set and A a o-algebra of subsets of X. Then a class F of measurable

functions on X will be called a universal Donsker class if it is a P-Donsker class for every probability measure P on (X, A). Recall that every universal Donsker class of sets is a Vapnik-Cervonenkis class (Theorem 6.4.1), and the converse holds under measurability conditions. For a real-valued function f let diam(f) := sup f - inf f. The following shows that a universal Donsker class is uniformly bounded up to additive constants:

10.1.1 Proposition If F is a universal Donsker class, then

sup fEF diam(f) < oo. Proof Suppose not. Then take xn E X, yn E X, and fn E F so that fn (xn) . Let fn (yn) > 2' for n = 1, 2, 00

P := E (Sxn +SyJ/2n+1 n=1

Then P is a probability measure defined on all subsets of X and so on A. For P,

E(fn - Efn)2 > 2-n-1 lnfcen {(fn(xn) - c)2 + (fn(yn) - c)2}. 314

10.1 Universal Donsker classes

315

The infimum is attained when c = (f" (x") + f , (y,)) /2, so

E(f, - Efn)2 > 2-n-2(fn(xn) - fn(Yn))2 > for all n = 1, 2,

2"-2

.. Thus F is unbounded in the pp metric and hence not a

Donsker class for P by Theorem 3.7.2, and not a universal Donsker class. A function h on .F will be said to ignore additive constants if h (f) = h (f +c) whenever f E F, c is a constant, and f +c E T. Recall (Section 3.1) that a Gp process on F is called coherent if each sample function is prelinear, bounded, and uniformly continuous on F with respect to pp. Here F is Ppregaussian if and only if a coherent Gp process on it exists (Theorem 3.1.1). If Z is a coherent Gp process then if f and f + c are both in F for a constant c, we have pp (f f + c) = 0 so Z(f) = Z(f + c) and ignores additive constants for all w. Such a Z can be consistently extended to the set .F + 118 of

all functions f + c, f E F, c e R, letting Z(f + c) = Z(f). Then Z is a coherent Gp process on F + R, so F + R is pregaussian. Next, suppose F is a Donsker class for P. Then each function P" - P ignores additive constants on r and is well-defined on F + R. If pp (f + c, g + d) < S for some f, g E F, constants c, d, and S > 0, then pp(f, g) < S. Since the total boundedness for pp and asymptotic equicontinuity condition both extend directly from .F to .F + IR, it follows by Theorem 3.7.2 that .F + R is a Donsker

class for P. Thus if F is a universal Donsker class, so is Y+ R. For a given law P, any subset of a Donsker class for P is also a P-Donsker class, for example by asymptotic equicontinuity (Theorem 3.7.2), so any subset of a universal Donsker class is also a universal Donsker class. If is any real-valued function on F, then F is a universal Donsker class if and only if

9 :_ { f - c(f) : f E FJ is. Taking c(f) := inf f, we obtain from any class .E of bounded functions a class g of nonnegative functions such that 4 is Donsker for a given P, or universal Donsker, if and only if F has the same property. If F and 9 are universal Donsker classes then g is uniformly bounded by Proposition 10.1.1.

Since multiplication by a positive constant is easily seen to preserve the Donsker property, in finding conditions for a class to be universal Donsker it will be enough to consider classes.F of functions f with 0 < f < 1. The Vapnik-Cervonenkis properties of classes of functions treated in Sections 4.7 and 4.8 (VC subgraph, VC major, VC hull) will all be shown to imply the universal Donsker property for uniformly bounded classes of functions. So the relations among these different VC properties are of interest here. Recall (Section 4.8) that D(p)(e, F, Q) is the largest in such that for some

316

Universal and Uniform Central Limit Theorems

, fm E F, f I f - f I "dQ > e for all i 0 j. Also, D(2) (E, F) is the supremum over all laws Q with finite support of D(2) (e, F, Q). The following is a continuation of Theorem 4.8.3. .fi ,

10.1.2 Proposition There exist VC major (thus VC hull) classes which do not satisfy (4.8.4), thus are not VC subgraph classes.

Proof A VC major class is VC hull by Theorem 4.7.1(b). Let T be the set of all right-continuous nonincreasing functions f on 118 with 0 <_ f < 1. Then since the class C of open or closed half-lines ] - oo, x[ or ] - oo, x] is a VC class (with S(C) = 1), F is a VC major class. For any f, g (in.F) and any law Q, (f I f - gI2dQ)1/2 > f I f - gI dQ. Thus D(2) (E, .F, Q) > Dpi) (E, F, Q)

for any e > 0.

Let P be Lebesgue measure on [0, 1]. Then Dpi) (E, F, P) = D(e, CC2,1, dX)

where C C2,1 is the set of all lower layers (defined in Section 8.3) in the unit square 72 in 1182, and dx is the Lebesgue measure of the symmetric difference of sets in 12. For some c > 0, D(E, GC2,1, d),) > ecIE as e J, 0 by Theorem 8.3.1. For each e > 0 small enough, by the law of large numbers, there is a law Q with finite support and DO) (E, .F, Q) > e`1' - 1, so (4.8.4) fails and by Theorem 4.8.3, F is not a VC subgraph class. Recall that if 1

(log D(2) (E, .F))1/2de < oc, fo as in Theorem 6.3.1 for F = 1, then.F is said to satisfy Pollard's entropy con(10.1.3)

dition.

10.1.4 Theorem If F is a uniformly bounded, image admissible Suslin class of measurable functions and satisfies Pollard's entropy condition then F is a universal Donsker class.

Proof .F has a finite constant C as an envelope function. For a constant envelope, the hypotheses of Theorem 6.3.1 don't depend on the law P, so Theorem 10.1.4 is a corollary of Theorem 6.3.1. Here are some more details. Let.F/ C :_ If/C: f E .F}, so that.F/ C has as an envelope the constant 1. It will be enough to show that.F/ C is a universal Donsker class. Then for 8 > 0,

D(2)(8, F1 C) = Di2)(3, F/C) as in Theorem 6.3.1. Make the substitution

10.1 Universal Donsker classes

317

8 = s/C and note that D(2)(E/C, .F/C) = D(2)(E, .F). It follows that.F/C satisfies Pollard's entropy condition. So Theorem 6.3.1 applies and .F/C and .F are universal Donsker classes. 10.1.5 Corollary A uniformly bounded, image admissible Suslin VC subgraph class is a universal Donsker class.

Proof This follows from Theorems 4.8.3 and 10.1.4. Specializing further, the set of indicators of an image admissible Suslin VC class of sets is a universal Donsker class (Corollary 6.3.16 for F = 1). For a class F of real-valued functions on a set X, recall from Section 4.7 the class H(.F, M) which is M times the symmetric convex hull of F, and HS(.F, M) which is the closure of H(.F, M) for sequential pointwise convergence. Note that for any uniformly bounded class.F of measurable functions for a or -algebra A and any law Q defined on A, H(.F, M) is dense in HS (.F, M) for the L2( Q)

distance (or any LP(Q) distance, 1 < p < oo).

10.1.6 Theorem If F is a Donsker class for a law P such that F has an envelope function in G2 (P), or a universal Donsker class, then for any M < 00, HS(.F, M) is a P-Donsker (resp. universal Donsker) class.

Proof First, for a given P, the limiting Gaussian process Gp can be viewed as an isonormal process, for the inner product

(f, g)o,P :=

f fgdP - f fdP f gdP.

So, by Corollary 2.5.9, F U -.F is P-pregaussian. (Or, by Theorem 3.8.1, .FU -.F is P-Donsker, thus P-pregaussian.) Clearly, MF is an envelope func-

tion for HS(.F, M) and is in G2(P). Suppose hk E HS(.F, M) and hk -a h pointwise. Then for any empirical measure P, f hk dP --* f h dP ask -+ oo. Also, f hk dP f h dP and f (hk - h)2dP 0, both by dominated convergence. It follows that pp(hk, h) -+ 0. Then by Theorem 2.5.5, the set HS(.F, M) is pregaussian for P. If g is any prelinear function on H(.F, M), then IIgllH(J-,M) = MllglI.F. Next, II gIIHs(.P,M) = MIIgII if g is any linear combination of P, P, and a sample function of a coherent Gp process. Then, the Donsker property of HS (.F, M) follows from the invariance principle, Theorem 9.4.2, where the Gp processes are coherent.

Suppose F is a universal Donsker class. Let 9 :_ If - inf f : f E .F}, as in the discussion before Proposition 10.1.2. Then functions in H(.F, M)

318

Universal and Uniform Central Limit Theorems

differ from functions in H(Q, M) by additive constants. If hk E HS(.F, M),

hk -+ h pointwise, and for all k, hk = 4'k + ck for some /k E HS(Q, M) and constants Ck, then since 4k are uniformly bounded and h has finite values, the ck are bounded. So, taking a subsequence, we can assume ck converges to some c. Then cbk converge pointwise to some 0 E HS (Q, M) and h = 0 + c. Thus all functions in M) differ by additive constants from functions

in HS(Q, M) (the converse may not hold). Thus, if HS(Q, M) is a universal Donsker class, so is HS(.F, M). So we can assume F is uniformly bounded. Then it has an envelope function in L2(P) for all P, so by the first half of the proof, HS (.F, M) is a universal Donsker class.

By Theorem 10.1.4, for any 8 > 0, if log D(2)(e, )) = O(1/s2-s) as a 4, 0, and if .F satisfies a measurability condition (specifically, if .F is image admissible Suslin), then F is a universal Donsker class. In the converse direction, we have:

10.1.7 Theorem For a uniformly bounded class.F to be a universal Donsker class it is necessary that log

D(2)(8'.F)

=

O(e-2)

as s 4. 0.

Proof Suppose not. Then there are a universal Donsker class F and sk 4. 0 such that logD(2)(ek, .F) > k3/sk for k = 1, 2, , so there are probability laws Pk with finite support for which log D(2) (sk, .F, Pk) > k3/ek for k = 2, 3, . Let P be a law with P > Ek 2 Pk/k2. Then for any measurable f and g,

(f(f - g)2dPl Let Sk := sk/k. Then

1/2

1/2

>

\ f (.f - g)2dPk)

/

/k.

/

log D(2) (Sk, .F, P) > log D(2) (sk, .F, Pk) > k' /82 = k132 So any isonormal process L on L2(p) is a.s. unbounded on.F by Theorem 2.3.5.

We can write L (f) = Gp (f)+G f fdP where G is a standard normal variable independent of Gp. Since .F is uniformly bounded, Gp is a.s. unbounded on (a countable pp-dense subset of) .F, so F is not P-pregaussian and so not a P-Donsker class. Theorem 10.1.7 is optimal, as the following shows:

10.1.8 Proposition There exists a universal Donsker class E such that lim infs o 82 log Dt2) (S, E) > 0.

10.1 Universal Donsker classes

319

Proof Let A j := A (j) be disjoint, nonempty measurable sets for j = 1, 2, f2 norm, 11x 112 = (F-j xi?) 1/2 for x = {x j) ° 1. Let Let II II2 be the

J:xjlA(j): lIX112 <- 1}. J

(So, E is an ellipsoid with center 0 and semiaxes 1 A(j) .) Let P be any probability

measure defined on the Aj and let pj := P(Aj) for j = 1, 2,

.. We can assume that pj > 0 for all j, since if B is the union of all Aj such that pj = 0, then P(B) = 0, Gp(B) = 0 a.s., and vn(B) = 0 a.s. for all n. (k v, (A Let E > 0. For any k and n, let II vn 112,k Then for all n, j)2)1/2.

00

EIIvnII2,k

- -pj -+ 0

as k-+ oo.

j=k

Take k = k(E) large enough so that

Pi < E3/18. Then

Pr{Ilvnll2,k > E/3} < E/2.

(10.1.9)

If Ilvnll2,k < E/3, lIx112 < 1, and 11YI12 < 1, then by the Cauchy (-Schwarz) inequality, (10.1.10)

vn

\

k(xj

-Yj)1A(j)) < 2s/3.

j=k

Also, EIIv.112,1 < (EIIvn1121)1/2 < 1, so

(10.1.11)

Pr{Ilvnll2,1 > 2/E} < E/2.

Let 8 := (minj 0. For each x = {xj} j>i

let

o°

.fx := Y,xj1A(j) j=1

If fx and fy E E and ep(fx, fy) := (f (10.1.12)

(_fY)2dp)112 < 8, then

x

((_y)2)

1/2

< E2/6.

j
By (10.1.11) and Cauchy's inequality, I>j
320

Universal and Uniform Central Limit Theorems

Thus by (10.1.9) and (10.1.10), there is an event F with Pr(F) < s such that

if w F then for all x and y with ep(fX, fy) < 8,

(10.1.13)

I

< e.

Given x with F j xi 2 < 1, let yj = x j for j < k and let yj = 0 for j > k. Then Y)2dp)112 < s/4. Since x i-+ f, is continuous from E2 into L2(P), the set of all fx such that F_j xi <_ 1 and x j = 0 for j > k is compact in L2(P). It follows that E is totally bounded in L2(P). Thus by (10.1.13) and Theorem 3.7.2 for t = ep, E is a Donsker class for P. Since P was arbitrary,

(f (f -f

E is a universal Donsker class.

Next, given 0 < 6 < 1/2, let m

[1/(462)], where [x] is the largest integer

< x. Let P be a probability measure with P(Aj) = 1/m for j = 1, , M. Then in L2(P), E is an m-dimensional ball with radius m-1/2. Let r

forl
Let B1 :_{fx: ep(fx,gi)<2S}.Then r

UBiDSfx i=1

1/2

j=m1

x)?

/m)

<m-1/2+6, xj=0forj>m

t

JJJ

Thus by comparing volumes of balls in R', we get by choice of m,

D(2)(6, E> P) = r >

(m-1/2 + 6)m/(26)m

\2[1 + 1/(6m1/2)])m >

(3/2)m

exp (log (2)/(482)). 3

Letting S 10, Proposition 10.1.8 follows.

El

10.1.14 Proposition There is a uniformly bounded class F of measurable functions, which is not a universal Donsker class, such that

log D(2)(e, .F) < s21og(1/s) as s 4, 0. Proof Let Bj := B(j) be disjoint nonempty measurable sets. Recall that Lx := max(l, logx). Let aj := 1/(jLj)1/2, j > 1, and

F=j l j=1

xjIB(j): xj = faj for all j }. )))

Take c such that Fjt 1 Pj = 1, wherePj := c(aj/LLj)2. Herey jPj

<

by

10.1 Universal Donsker classes

321

the integral test since

(d/dx)(1/LLx) = -1/(xLx(LLx)2) for x > ee. Take a probability measure P with P(Bj) = pj for all j. Let

a := (I - PI)1/2/2 > 0. Then 00

EIIGPII,P = EajEIGP(Bj)I j=1 00

00

aj(2/7r)1/2(Pj(I - Pj)) 1/2 > j=1

aT aj Pi

1/2

j=1 00

1/(jLjLLj) = +oo

acl/2 j=1

by the integral test since (d/dx)(LLLx) = 1/(xLxLLx) for x large enough. If F were P-pregaussian, then since Gp can be treated as an isonormal process (see Section 3.1), by Theorem 2.5.5 ((a) if and only if (h)) and the material just before it, Gp could be realized on a separable Banach space with norm II IIj, Then the norm would have finite expectation by the Landau-Shepp-MarcusFernique theorem (2.2.2), a contradiction. So F is not pregaussian for P and so not a universal Donsker class. For any probability measure Q and r = 1, 2, ,

f=

xiIB(j) E F and g =

j

YjIB(j) E

if x j = yj for 1 < j < r, z

ao

(10.1.15)

eQ(f, g) =

( (

\1 \

E -r

(xj

1/z

-Yj)1B(j)) dQ) /

00

(QB)1"2

< 2ar.

/=r

Given s > 0, let r := r(s) be the smallest integer s > 1 such that as < s/2. By (10.1.15), if xj = yj for 1 < j < r, then ep(f, g) < E. Thus since there are F, Q) < 2r-1 only 2r-1 possibilities for xj = ±aj, j < r, we have and since r doesn't depend on Q, D(2) (s, F) < 2r-1 and log D(2) (s, F) < r log 2. As s j 0, we have ar(,) - s/2, so D(z)(£,

log(1/s) - log(2/s) - log(1/ar(e))

Zlog(r(s)),

Universal and Uniform Central Limit Theorems

322

and e/2 - 1/(r(s) .2 log(1/8)) 1/2, so r(s)

2/(e2log(1/e)). Since

log D(2) (e, F) < r (e) log 2 < r (e), the conclusion follows.

Theorems 10.1.4 and 10.1.7 show that Pollard's entropy condition (10.1.3) comes close to characterizing the universal Donsker property, but Propositions 10.1.8 and 10.1.14 show that there is no characterization of the universal Donsker property in terms of D(2).

10.2 Metric entropy of convex hulls in Hilbert space Let H be a real Hilbert space and for any subset B of H let co(B) be its convex hull, k

CO(B)

E

j=1

k

t j > 0, Y' tj = 1, xj E B, k = 1, 2,

I.

j=1

J

Recall that D(e, B) is the maximum number of points in B more than a apart.

10.2.1 Theorem Suppose that B is an infinite subset of a Hilbert space H,

IIxii < 1 for all x E B, and for some K < oo and 0 < y < oo, we have D(e, B) < Ke-Y for 0 < e < 1. Lets := 2y/(2 + y). Then for any t > s, there is a constant C, which depends only on K, y, and t, such that

D(s, co(B)) < exp (CE-') for 0 < s < 1. Note. Carl (1997) gives the sharper bound with t = s.

Proof We may assume K > 1. Choose any x1 E B. Given

B(n) := {x1,...,xn_1),

n > 2,

let d(x, B(n)) := minyEB(n) IIx - yii and Sn := supXEB d(x, B(n)). Since B is infinite, Sn > 0 for all n. Choose xn E B with d(xn, B(n)) > Sn/2. Then

for all n, K(2/Sn)Y >- D(Sn/2, B) > n, so 8n -< Mn-1/Y for all n where M:= 2K1/Y. Let O < e < 1. Let N := N(s) be the next integer larger than (4M/s)Y. Then

SN < e/4. Let G := B(N). For each x E B there is an i = i(x) < N - 1 with IIx -xi 11

8N. For any convex combination z = EXEB zxx where rlx > 0 and

EXEB11x = 1,witht7x =0except forfinitely many x,letzN :_ ExEBi)xxi(x)

10.2 Metric entropy of convex hulls in Hilbert space

323

Then 11 Z - ZN 11 < SN < s/4, so

D(S, co(B)) < D(s/2, co(G)).

(10.2.2)

To bound D(s/2,co(G)), let m := m (s) be the largest integer < s-S. Note that y > s. Then for each i with in 0, ET, ),j = 1). On R' we have the £P metrics 1/P

m

pp({xj), {yj}) :=

( j=1 Ixj -YJV')

By Cauchy's inequality, p1 < m1/2p2. Let ,6 := s/6 and 8 := &(2m1/2). The S-neighborhood of Am for P2 is included in a ball of radius 1 + S < 13/12. We have D(23, Am, p2) centers of disjoint balls of radius S included in the neighborhood. Comparing volumes of balls and recalling that Lx max(1, logx) gives (10.2.4)

D(P, Am, p1) < D(28, Am, p2) <

13mmm/2S-m

< exp(m{L(1/s) + (Lm)/2 + log(13)})

< exp

(C3s-SL(1/s)) < eXp

(C4s-t),

0<s<1, for some constants C3, C4.

For each j = 1, , in, let Aj := A(j) consist of xj and the set of all xi, i = in + 1, , N, such that j (i) = j. Then G = UjI1 A j and the A j are disjoint. Take a maximal set S = S(s) C Am with p1 (u, v) > 0 for any u 0 v in S. For a given a, _ (A1, , ..m) E S, let Fa,

fx E co(G): x

Ay,xy, where

0 and

yEG

µy,x=Aj, for all YEA(j)

JJJ

For any x E co(G), let x = EyEG µy,xy where p > 0 and for Tj(x) := EyEA(j) AY,x, i(x) := {rj(x))j 1 E Am. Take k(x) E Ssuchthatr-j

ij(x)l < 0. For y E A(j), define vy,x := Aj(x) sy,x/rj(x) if rj(x) > 0. If rj (x) = 0, choose a w E A(j) and define vw,x :_ A(x) and vy,x := 0 for

Universal and Uniform Central Limit Theorems

324

y # w, y E A(j). If -rj(x) > 0 then

E 1AY,x-vy,xl = E µy,xll-),j(x)/rj(x)l = Itj(x)-),j(x)l. YEA(j)

YEA(j)

If rj(x) = 0 then EYEA(j) I AY,x - Vy,xI = Vw,x = A1(x) = I r (x) - 'Xj(x)I

It follows that EyEG Ihy,x - vy,xl = Ej Irj(x) - aj(x)I < P. Set ox := EyEGvy,xy. Thentx E Fx,(x), A(x) ES,andIlex -xII < 6. Thusby(10.2.2)

and (10.2.4),

D(s, co(B)) < D(e/2, co(G)) < DI 0, U F )

(10.2.5)

\

AES

< card(S) maxi, D(16, F),) exp (C4s-t) maxi, D($, FA).

To bound the latter factor, let A E S. We may assume X j > 0 for all j. , m and x E Fx, let x(j) := EyEA(j) Ay,x y. Let Yj be a random variable with values in A(j) and P(Yj = y) = µy,x/X.j for each y E A(j). Then EYj = x(j)/),j =: zj. Take Y1, , Ym to be independent and let Y:=F_j 1.XjYj. Then EY =x and

For any j = 1,

2

M

EIIY-x112 = E E.lj(Yj - zj) j=1

m

_ YA EIIYj -ZjIl2, j=1

since Yj - zj are independent and have mean 0, and H is a Hilbert space. Now the diameter of A(j) is at most 2Mesl y by (10.2.3), and zj is a convex combination of elements of A (j). Thus 2

EIIYj-zjII2 = Aj1

l-ty,x

J-1

E lLz,x(Y-Z)

< 4M2s2sly,

zEA(j)

YEA(j)

and for any set F C {1, 2

E Y,.lj(Yj - zj) jEF

<

4M2(maxjEF

kj),2sly

jEF

Next, an idea of B. Maurey will be applied. For k = 1 , 2, , let Yj 1, Yj2, , Yjk be independent with the distribution of Yj, and with Yjj also independent

for different j. Then k

EII

YAj (k-1 E(Yjt - zj)) J

/

2

< 4M2(maxfEF),j)s2s/y/k

10.2 Metric entropy of convex hulls in Hilbert space

325

so

E jEF

j ((

k

Yji/ -zj/

\k-1

\

<

2M(maxjEF)')

1/2ESly/k1/2.

1

Thus, there exist yji E A(j), i = 1,

, k, j E F, such that k

(10.2.6)

-kj

jEF

\k-1 EYii/ - zj/ )

l

(( \

< 2M(maxjEF Aj)1/2ssly/k1/2.

Take v > 0 such that s + v < t. Let F(0)

I j < m : Aj > EU}. Let k(0) be

the smallest integer k such that

k > 6400M2E-2+2s/y = 640OM28-S.

l

For k > k(0) and F = F(0) the expressions in (10.2.6) are at most E/40. Let r be the smallest positive integer such that Ev/4r < (E1-s/y1(80M))2. F(u) :=

Ev/4u < Al < E°/4u-11,

i n :

andletk = k(u)bethesmallest integer ksuch that 22-"Mesly/k1/2 < e/(40r), 100M244-ur2E-2+2s/y. that is, k > Thus, for some constant C5, k(u) < 1 + C54-us s(L(1/E))2. The yji for F = F(u) will be called you) (they also depend on x).

Let F(r+l) := j j < m : X j < Ev/4r}. Letk(r+l) := 1. For F = F(r+l) and k = k(r + 1), (10.2.6) is bounded above by 8/40. We have yy i+1) a single choice for each j. Let r+1

1

EXj

u=0 jEF(u)

k(u)

k(u) i=1

y (jiu)

Then by (10.2.6) and the results for u = 0 and u = r + 1, r+1

III -xII =

k(u)

EE

j ' k(u)

u=0 jEF(u)

(\ yji

(u)

i=1

zj /

r

an

2M(Ev14u-1)112Es/y/k(u)1/2

+ An +

u=1

E -+r

20

E

40r

3E

E

40 < 12

Universal and Uniform Central Limit Theorems

326 Here

is determined uniquely by the k(u)-tuples (y(i) ... , yk(u)), j E F(u),

u = 0, 1, , r + 1. Each A(j) has at most N elements, so that for given u < r and j < m, there are at most Nk(u) ways of choosing the you). Now card(F(u)) < 4n/8°, so the number of ways to choose the you) for given u with

1 < u < r is at most

+ C5e-S-vL(l/s)2)}. There are at most exp((log N)[e-S-°640OM2 + s-°]) ways to choose the Y(0), and < N' ways to choose the with in < 8'. Thus, the total number of ways to exp{(logN)(4u8-U

choose all the y() gives (C6{s-s-DL(1/e)4

D(e/6, FF) < exp

+ L(1/s)4''/so})

C7e-S for for some C6 := C6(K). By definition of r, 4''/s" < C7s2s/y-2 = some C7 := C7(K). Thus, D(s/6, Fx) < exp(C8e-`) for some C8. Then by

(10.2.5), Theorem 10.2.1 is proved.

Example. The exponent 2y/(2+y) in Theorem 10.2.1 is sharp in the following example. Let {en }n,1 be an orthonormal basis of H and for 0 < y < oo let B

{n-tlyen}

n>1 U

{-ri11ven}

n>1 For any e > 0 small enough we have D(s, B) = 2n for the least n such that

(n-21y + (n + 1)-2/y)1/2 < s: the points ±j-1/yej, 1 < j < n, are more than a apart, so D(s, B) > 2n, while a set of points of B more than s apart cannot contain j-1/yep or -j-1/yep for more than one value of j > n, so

D(e, B) < 2n. Thus as s -+ 0, e2 - 2/n2/y, so for a constant C = Cy, 2n - C/sy, and replacing C by a suitable larger K if necessary, the hypothesis of Theorem 10.2.1 holds. Let Bn be the intersection of B with the linear span of el, , en. Let Cn = co(Bn). Then for any s > 0, D(e, co(B)) > D(e, Cn). The n-dimensional volume vn(Cn) is

vn(Cn) =

2n/n!1+11y

A ball B(x, r) :_ {y: ly - xl < r) in Rn has volume vn (B(x, r)) = cnrn where cn = 7r n121 I'((n+2)/2). (This is well known, especially forn = 1, 2, 3 since IF (1 /2) = r 1/2, and then can be proved by induction from n ton + 2). If x1, , xn are points of Cn more than s apart, for a maximal m = D(e, Cn),

then the sets B(xi, e) cover Cn, so mcns" > vn(C,,). By Stirling's formula (Theorem 1.3.13), as n -* oo, 2n(e/n)n+n/y(27rn)-(y+l)/(2y)n-n/2(n/(2e))n/2(2rn)1/2

vn(Cn)lcn

_ for a constant Dy D.

(eln)n(2+y)1(2y)(2/n)n12n-1/(2y)Dy

10.2 Metric entropy of convex hulls in Hilbert space

327

Take any d such that d > 1 + v . Then for n large enough, vn (Cn) /cn > n -dn

Let g, (e) := n-do/sn. Then m > g, (e). The following paragraph is only for motivation. As s .. 0, n = n(s) will be chosen to make gn (s) about as large as possible. We have

8-1(n + 1)-d(n+1)ndn -dn

gn+1 (E)/gn (s)

(ne) -d s This sequence is decreasing in n, so to maximize gn (s) for a given s we want

ln -d (1+ ) e nI

to take n such that the ratio is approximately 1.

At any rate, let n(s) be the largest integer < f(s) := e-1s-11d. Then for e small enough,

d 1)11d) gn(E)(s) > f(8)- df(e) s -I(e) = sexp(df(s)) = eexp(ee s8 ' 1

Now s > exp(-s-8) as s ,. 0 for any 8 > 0. Taking 8 < 1/d 'and letting d 4, (y + 2)/(2y) we have 1/d'' (2y)/(2 + y), showing that the exponent in Theorem 10.2.1 is indeed optimal. Recall the definitions of D(2) from Section 4.8 and HS from after Corollary 10.1.5.

10.2.7 Corollary If 9 is a uniformly bounded class of measurable functions

and for some K < oo and 0 < y < oo, D(2) (s, Q) < Ks-Y for 0 < s < 1, then for any t > r := 2y/(2 + y), and for the constants Ci = Ci (2K, y, t), i = 1, 2, of Theorem 10.2.1,

D(2) (s, HS(9, 1)) < C1 exp (C28 -t)

for 0 < e < 1.

Proof We have D(2)(e, g U -g) < 2Ks-Y, 0 < s < 1. Thus for any law Q with finite support, D(2)(s, Q U -Q, Q) < 2Ks-Y. By Theorem 10.2.1, D(2) (e, H(g, 1), Q) < C1 exp(C2s-`) for Ci = Ci (2K, y, t). It is easily seen that H(c, 1) is a dense subset of R, (9, 1) in £2(Q). The conclusion follows.

Now, recall the notions of VC subgraph and VC subgraph hull class from Section 4.7. 10.2.8 Corollary If C is a uniformly bounded VC subgraph class and M < 00 then the VC subgraph class H,s (9, M) satisfies Pollard's entropy condition (10.1.3).

328

Universal and Uniform Central Limit Theorems

Proof Let M9 :_ (Mg: g E 9). Then M9 is a uniformly bounded VC subgraph class. By Theorem 4.8.3(a), M9 satisfies the hypothesis of Corollary

10.2.7, so r < 2 and we can take t < 2.

Remark.

It follows by Theorem 6.3.1 that for a uniformly bounded VC subgraph class 9, if F C HH (9, M) and F satisfies the image admissible Suslin measurability condition, then F is a universal Donsker class. This also follows from Corollary 10.1.5 and Theorem 10.1.6.

Example. Let C be the set of all intervals (a, b] for 0 < a 1. Each fin G has total variation at most 2 (at most 1 on the open interval 0 < x < 1 and 1/2 at each endpoint 0, 1). By the Jordan decomposition we

have, for each f E G, f = g - h where g and h are both nondecreasing functions, 0 for x < 0. Then g and h have equal total variations < 1 and G C H5 (C, 2) by the proof of Theorem 4.7.1(b). Let P be Lebesgue measure on [0, 1]. By Theorem 8.2.10, and since (f I f I2dP)1/2 > f If IdP, there is a c > 0 such that D(2) (s, G) > ecl' as s 4. 0 (consider laws with finite support which approach P). Since S(C) = 2, the exponent y can be any number larger

than 2 by Corollary 4.1.7 and Theorem 4.6.1. Letting y ,[ 2, tin Corollary 10.2.7 can be any number > 1, and we saw above that it cannot be < 1 in this case, so again the exponent is sharp.

**10.3 Uniform Donsker classes This section will state some results with references but without proofs. A class F of measurable functions on a measurable space (X, A) is a uniform Donsker class if it is a universal Donsker class and the convergence in law of v, to Gp is also uniform in P. To formulate this notion precisely we will follow Gine and Zinn (1991) in using the dual-bounded-Lipschitz metric ,B as defined just before Theorem 3.6.4.

Let P(X) be the set of all probability measures on (X, A) and let Pf(X ) be the set of all laws in P(X) with finite support. For S > 0, a class F of measurable real-valued functions on X and a pseudometric d on F let

F(S,d) := {f-g:.f,gEF, d(ff g) <<S}.

**10.3 Uniform Donsker classes

329

Definitions. A class .F is uniformly pregaussian if it is pregaussian for all P E P(X), and if, for a coherent version of GP for each P, we have both SUPPEP(X) E II GP II.P < o0 and

limsyo supPEP(X)

0.

The class .P is finitely uniformly pregaussian if the same holds with Pf(X )

in place of P(X). The class .F is a uniform Donsker class if it is uniformly pregaussian and limn+oo SUPPEP(X) O.p(vn, GP) = 0

where ,8,F is the dual-bounded-Lipschitz metric P based on II IIF as in Section 3.6. Gine and Zinn (1991, Theorem 2.3) proved:

Theorem Let (X, A) be a measurable space and F a class of real-valued measurable functions on X. Then F is a uniform Donsker class if and only if it is finitely uniformly pregaussian and thus, if and only if it is uniformly pregaussian.

The theorem is a very useful characterization since it's easier to check the finitely uniformly pregaussian property than to check the uniform Donsker property directly.

A uniform Donsker class is clearly a universal Donsker class. Thus it is uniformly bounded up to additive constants (Proposition 10.1.1).

Gine and Zinn show (as a corollary of their Proposition 3.1) that the hypotheses of Theorem 10.1.4 (Pollard's entropy condition 10.1.3, together with suitable boundedness and measurability) actually imply that F is uniformly Donsker. Thus most of the examples of universal Donsker classes treated in Sections 10.1 and 10.2 are uniformly Donsker. An exception is the "ellipsoid" universal Donsker class of Proposition 10.1.8, see problem 2 below. Some uniform Donsker classes of functions on R will be defined. For any

function f from R into R and 0 < p < oo the p-variation vp(f) is defined as the supremum of all sums Ej"=1 If(x) - f (xj_ t) I P where -oo < xo < xt < < xn < +oo and n = 1, 2, . For p = 1 this is the ordinary total variation. We have:

Theorem (Dudley, 1992) For each M < oo and p with 0 < p < 2, the class .F := { f : R i-+ R, vp(f) < M} is a uniform Donsker class.

330

Universal and Uniform Central Limit Theorems

Problems 1. Prove the "easy half " of the Gind-Zinn theorem on uniform Donsker classes

in Section 10.3, namely, if F is a uniform Donsker class of measurable functions on a measurable space (X, A), show that.F is finitely uniformly pregaussian. 2. Let (X, A) be a measurable space and let A (j) be disjoint, nonempty measurable subsets of X. Let E F -j xj IA(J) : Ej xi < 1}. Show that E is not a uniform Donsker class. Hint: It is enough (applying only the easy direction of the Gine-Zinn theorem from the previous problem) to show E is not finitely uniformly pregaussian. Choose a point y(j) := Yj E A(j) for

each j and let Q(N) := N EN i=1 3Y(A. 3. Show that for p > 2, { f : R i-+ R, v p (f) < 1) is not a universal Donsker class. (It is a uniform Donsker class for p < 2.) Hint: Show it is not a Donsker class for U[0, 1]. Consider functions c IF for F = {X1, , and suitable constants c,,.

4. If 0 < p < 1, f is continuous from R into R and v p (f) < oo, show that f is constant. 5. Prove that the class.F in the proof of Proposition 10.1.2 is not a VC subgraph

class directly, by showing that the class of all subgraphs of functions in F shatters any finite subset of the diagonal {(x, x): 0 < x < 1).

Notes

Note to Section 10.1. This section is based on Dudley (1987). Notes to Section 10.2. Theorem 10.2.1 is from Dudley (1987). The argument of Maurey used in its proof was published in the proofs of Pisier (1981, Lemma 2) and Carl (1982, Lemma 1). The example showing that 2A/(2 + A) is sharp is from Dudley (1967), Propositions 5.8 and 6.12. Carl (1997) showed that one can take t = s in Theorem 10.2.1.

References Carl, B. (1982). On a characterization of operators from tq into a Banach space of type p with some applications to eigenvalue problems. J. Funct. Anal.

48, 394-407. Carl, B. (1997). Metric entropy of convex hulls in Hilbert space. Bull. London Math. Soc. 29, 452-458.

Dudley, R. M. (1967). The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. J. Funct. Anal. 1, 290-330.

References

331

Dudley, R. M. (1987). Universal Donsker classes and metric entropy. Ann. Probab. 15, 1306-1326. Dudley, R. M. (1992). Frechet differentiability, p-variation and uniform Donsker classes. Ann. Probab. 20, 1968-1982. Gind, Evarist, and Zinn, Joel (1991). Gaussian characterization of uniform Donsker classes of functions. Ann. Probab. 19, 758-782. Pisier, G. (1981). Remarques sur un resultat non public de B. Maurey. Seminaire d'Analyse Fonctionnelle 1980-1981 V.1-V.12. Ecole Polytechnique, Centre de Mathematiques, Palaiseau.

11 The Two-Sample Case, the Bootstrap, and Confidence Sets

11.1 The two-sample case Let X1, , Xm , , Y1, , Yn , be some random variables taking values in a set A where (A, A) is a measurable space. Thus (X1, , Xm) and (Y1, , Y,) are "samples," of which we have two. Let m 1

m

n

3x.,

1

Pm :=

Qn

i=1

n

Y, SYj j=1

The object of two-sample tests in statistics is to decide whether P. and Qn are empirical measures from the same, but unknown, law (probability measure) on (A, A). Since P is unknown, we cannot directly compare Pm or Qn to it by forming m 1 /2 (Pm - P) or n 1/2 (Qn - P). Instead, Pm and Qn can be compared to each other, setting Vm'n

( mn l1/2 (Pm-Qn)

\m+n/

The basic hypothesis will be that there are two laws P, Q on (A, A) and a product of two countable products of copies of (A, A) with factor laws P and Q respectively, namely

(Q,D,Pr) = (Qi,13 ,Pr1) X (22,132,Pr2) where 00

00

(Q2, B2, Pr2) = jj(Bj, A, Q),

(c21, 131, Pr1) = 1T(Ai, A, P),

j=1

i=1

and each Ai and Bj is a copy of A. On these products let Xi be the Ai coordinate

and Yj the Bj coordinate. If P is a class of laws on (A, A), the (P) null hypothesis is that in addition, P = Q E P. A class F of measurable functions 332

11.1 The two-sample case

333

on (A, A) will be called a P-universal Donsker class if it is a P-Donsker class for every P E P.

11.1.1 Theorem Suppose F is a P-universal Donsker class of functions on (A, A). Then for each P E P, under the (P) null hypothesis, vm,n Gp as m, n --i- oc.

Proof Let P E P. By Theorem 3.6.6, it will be enough to show that oo. Since F is P-Donsker, it is P-pregaussian, so that there exists a coherent Gp process on F. Since F is pp-totally bounded, f (Vm n , Gp) - 0 as m , n

the functions f i-+ Gp(f)(w) on F belong to the space UC(F, pp) of all pp-uniformly continuous and thus bounded functions on F. Here UC(.T', pp) is a separable subspace of the Banach space Ii.F). The map w H is Borel measurable from the underlying probability space into UC(F, pp): to see this, it is enough by second-countability to show that the inverse image of an open ball If: II f - fo II.F < r} is measurable, which is true since on UC(F, pp) the supremum can be taken over a countable pp-dense set II

in T. Thus Gp has a law (Borel probability measure) on UC(F, pp), as in Theorem 2.5.5. It is easily seen that every coherent Gp process on F has the same law on UC(F, pp) (consider finite subsets increasing up to a countable dense subset of F). Since F is a P-Donsker class, by Theorem 3.5.1 there exist probability spaces (Vi, Si, µi), i = 1, 2, perfect measurable functions gn; from V1 into j such that µi o grail = Pr; for all n and i = 1, 2, and coherent Gp processes GP) on F x Vi for i = 1, 2 such that m 1/2 (Pm - P) o gm 1 G O almost uniformly as m -+ oo and n 1/2(Qn - P) o $n2 - GP i almost uniformly for for II II.- as n - oc. Form the product (V, S, t) :_ (V1, S1, µ1) x (V2, S2, µ2) and define hmn : V ra QI X S22 by hmn(u, v) (gmI(u), gn2(V)). Then A o hm = Pr1 x Pr2 for all m and n. We have II

(mn)1 Vmn

(m + n)112

(

(Pm - P - (Qn - P))

l1/2M 112(Pm-P)-(m+n)l/2

n1/2(Qn-P).

m+n)

Let Gp"''n) :_ (n/(m +n))1/2Gp1) - (m/(m

Now, if Y, Z are independent coherent Gp processes on .T' and a, b are any real numbers with a2 + b2 = 1, then aY + bZ is a coherent Gp process on Y. To see this, note that aY + bZ is a Gaussian process indexed by F, has the covariances of Gp, +n))1/2G,2).

334

The Two-Sample Case, the Bootstrap, and Confidence Sets

and is coherent. So GP 'n) is a coherent GP process. We have (m,n)

Ivm,n

°hmn - GP n

m+n

1/2

I

IM 1/2(P-P)ogm I-r_() 6P

I

\11

+\m+n/ /2 Inl/2(Qn -P)°gn2-Gp) 1/

m

-0. 0 I

almost uniformly as m, n --* oo. Let H be a function on £°O(.F) with II HII BL < 1, so that IIHII0. < 1 and

Givene > 0,therearemo <

IH(f)-H(g)I < II.f-glL.Fforall f,gE

oo and no < oo such that form > mo and n > no, I vm,n o hm,n - G( 'n) I < E on a set VE C V with it (VE) > 1 - E. It follows that dm,n := I H(vm,n oh n)

- H(G( 'n)) I

<

on V, and dm n < 2 everywhere. Since GP n) has a law defined on the Borel sets of UC(.F, pp), thus off' (.F), for II ILK, H(GP 'n)) is measurable and

f H(Vm,n o hmn) dA < EH(G( 'n)) +38. *

It follows that f (vm,n o hmn, GP) --* 0 as m, n -+ oo. Since each hmn is a measure-preserving, perfect function, it follows that vm,n GP form, n -+ 00 as in Theorem 3.4.3. The classical two-sample situation is the special case where A =118, A is the

Borel or-algebra, F is the set of all indicator functions of half-lines (-00, x], and P is the set of all continuous (nonatomic) laws on R. Thus ml/2(Pm

- P)((-oo, x]) =

m1/2(Fm

- F)(x)

where F is the distribution function of P and Fm an empirical distribution func-

tion. Here F is a universal Donsker class by any of several previous results, for example Corollary 6.3.17, and Gp(l(_,,,x]) = yF(x) where y is the Brownian bridge (mentioned in Section 1.1). (Actually, F is a uniform Donsker class.) Since F is continuous, it takes all values in the open interval (0, 1), 0 as t , 0 or t f 1. So the distributions of sup., yF(x) yo = yi = 0, and yt and supx IYF(x)I and the joint distribution of (infx yF(x), supx yF(x)) do not depend on F, for P E P. Let Fm and Gn be independent empirical distri-

bution functions for the same F. Let Hmn := (mn/(m +n))1/2(Fm - Gn). By Theorems 3.6.7 and 3.6.8, which extend straightforwardly to limits as m, n -+ oo, the distributions of the supremum, supremum of absolute value,

11.2 A bootstrap central limit theorem in probability

335

and supremum minus infimum of Hmn converge to those of the same functionals

for yt. Thus we get for u > 0 (about u = 0, see problem 3):

11.1.2 Corollary If F. and Gn are independent empirical distribution functions for a continuous distribution on R, then for any u > 0, (a) limm,n-goo P(supx Hmn(x) > u) = exp(-2u2);

(b) limm,n+

P(supxIHmn(x)I >u)=2F_' (-1)k-1 exp(-2k2u2);

(c)

P(supx Hmn (x) - infy Hmn (Y) > u)

= 2 Fo 1(4k2u2 - 1) exp(-2k2u2). Proof The distributions of the given functionals of the Brownian bridge yt are given in RAP, Propositions 12.3.3, 12.3.4, and 12.3.6. All three are continuous in u for u > 0. Thus convergence follows from RAP, Theorem 9.3.6. Note. The statistics supx Hmn (x) and sup., I Hmn (x) I in (a) and (b) are called Kolmogorov-Smirnov statistics, while sup., Hmn (x) - infy Hmn (y) is called a Kuiper statistic.

11.2 A bootstrap central limit theorem in probability Iterating the operation by which we get an empirical measure P, from a law P, we form the bootstrap empirical measure PB by sampling n independent points whose distribution is the empirical measure Pn. The bootstrap was first introduced in nonparametric statistics, where the law P is unknown and we want to make inferences about it from the observed Pn. This can be done by way of bootstrap central limit theorems, which say that under some conditions, n112(PB - Pn) behaves like n112(Pn - P) and both behave like Gp.

Let (S, S, P) be a probability space and F a class of real-valued measurable functions on S. Let as usual X1, X2, be coordinates on a countable product of copies of (S, S, P). Then let XB , XB be independent with distribution Pn . Let

PB := 1 n

3,'.. nl

j=1

Then PB will be called a bootstrap empirical measure. A statistician has a data set, represented by a fixed Pn, and estimates the distribution of PB by repeated resampling from the same Pn. So we are interested not so much in the unconditional distribution of PB as P, varies, but

336

The Two-Sample Case, the Bootstrap, and Confidence Sets

rather in the conditional distribution of PB given Pn or (X1, , X,). Let vB := n1/2( P,," - Pn). The limit theorems will be formulated in terms of dual-bounded-Lipschitz "metric" P of Section 3.6, which metrizes convergence in distribution for not necessarily measurable random elements of a possibly nonseparable metric space (S, d), to a limit which is a measurable random variable with separable range. Let fi,F be the ,B distance where d is the metric defined by the norm ll

II.-F.

Definition. Let (S, S, P) be a probability space and .F a class of measurable real-valued functions on S. Then the bootstrap central limit theorem holds in probability (resp. almost surely) for P and F if and only if F is pregaussian for P and 0.,(vB, Gp), conditional on X1, , Xn, converges to 0 in outer oo. probability (resp. almost uniformly) as n In other words, the bootstrap central limit theorem holds in probability if and only if for any e > 0 there is an no large enough so that for any n > no, there is a set A of values of X(n) :_ (X1, , Xn) with Pn (A) < e such that if X(n) A, and vB(Xtn))O is vB conditional on X(n), we have,B.F (vB(Xini)(.), Gp) < e. A main bootstrap limit theorem will be stated. The rest of this section will be devoted to proving it, with a number of lemmas and auxiliary theorems.

11.2.1 Theorem (Gine and Zinn) Let (X, A, P) be any probability space. Then the bootstrap central limit theorem holds in probabilityfor P and I if F is a Donsker class for P. Remarks. Gine and Zinn (1990), see also Gine (1997), also proved "only if" under a measurability condition and proved a corresponding almost sure form of the theorem where F has an L2 envelope up to additive constants. See Section 11.4 for detailed statements (but not proofs) of these other theorems.

Proof EB, PrB, and GB will denote the conditional expectation, probability, and law given the sample X(n) := (X1, , Xn). Given the sample, vB has only finitely many possible values. First, a finite-dimensional bootstrap central limit theorem is needed.

11.2.2 Theorem Let X1, X2, be i.i.d. random variables with values in Rd and let XB,, i = 1, .. , n, be i.i.d. (Pn), where Pn Sx1. Let n 1X3. Assume that E I Xi 12 < oo. Let C be the covariance matrix n Ej= of X1, Crs := E(X1i.X1s) - E(X1i.)E(X1s). Then for the usual convergence

Xn

11.2 A bootstrap central limit theorem in probability

337

of laws in Rd, almost surely as n -+ oo, (11.2.3)

GB

(n-1/2

N(0, Q.

E (XB j - Xn)) j=1

Proof Note: N(0, 0) is defined as a point mass at 0. Suppose the theorem holds for d = 1. Then for each t E Rd, the theorem holds for the inner products (t, Xj) and thus (t, XB j), and so on. This implies that the characteristic functions of the laws on the left in (11.2.3) converge to that of N(0, C), and thus that the laws converge by the Levy continuity theorem (RAP, Theorem 9.8.2). (This last argument is sometimes called the "Cramer-Wold" device.) So, we can assume d = 1, and on the right in (11.2.3) we have a law N(0, a2) where a2 is the variance of X1. If a2 = 0 there is no problem, so assume a2 > 0. We

can assume that EX1 = 0 since subtracting EXI from all Xj doesn't change the expression on the left in (11.2.3). Let sn2

l

(Xj -

n

sn

(sn2)1/2.

j=1

Then by the strong law of large numbers for the Xj and the X?, sn2 converges

a.s. as n -* oo to a2. For a given n and sample XW, Xn is fixed and the variables Yni

(XB1 - Xn)/n1/2

for i = 1,

, n are i.i.d. We have a triangular array and will apply the Lindeberg theorem (RAP, Theorem 9.6.1). For each n and i, EBYn i = 0. Each

Yni has variance VarB(Yni) = sn2/n. The sum from i = 1 to n of these variances is sn2. Let Zni := Yni/sn. Then Zni remain i.i.d. given Pn, and the sum of their variances is 1. We would like to show by the Lindeberg theorem that GB(yi=1 Zni) converges to N(0, 1) as n - oo, for almost all . We have EB(Zni) = 0. Let 1{.....} := 1{.....{ for any event {..... }. For s > 0 let En j, := EB (I Zn j I2l { I Zn j l > s}). It remains to show that En jE = n En t, 0 as n oo, almost surely. Now X1, X2,

Anj

{jZnjI > E) = {IX,

- XnI >

n1/2sn8},

which for n large enough, almost surely, is included in the set

C(n, j) := Cnj := {I XB .1 > n1/2as/2}. Also, Zni = (4 j - Xn)2/(nsn2) < 2[(Xa )2 + Xn ]/(nsn2). Then Xn /sn2 which is constant given the sample, approaches 0 a.s. as n -f oo since EXI = 0

338

The Two-Sample Case, the Bootstrap, and Confidence Sets

and so does its integral over Cj. For the previous term, n

X21{IXiI

EB((Xnl)2lC(n,j))/(nsn2) = n-2

> n1I2QE/2)/sn2,

i=1

which, multiplied by n, still goes to 0 a.s. as n -+ oo, by the strong law of large numbers for the variables Xt21 { I X, I > K) for any fixed K, then letting K oo. Theorem 11.2.2 is proved. Here is a desymmetrization fact:

11.2.4 Lemma Let T be a set, and for any real-valued function f on T let II f II T : = Supt, T I f (t) I. Let X and Y be two stochastic processes indexed by t E T defined on a probability space (S2 x S2', S ® S', P x P'), where X(t) (co, w') depends only on co E S2, and Y(t)(w, w') only on w E S2'. Then

(a) for any s > 0 and any u > 0 such that suptET Pr{IY(t)I > u) < 1, we have

Pr*(IIXIIT > s)

< Pr*{IIX- YIIT > s - u}/[l - SUPtET Pr{IY(t)I > u}]. (b) If 9 > suptET E(Y(t)2) then for any s > 0,

P*(IIXIIT > S) < 2Pr* (IIX - YIIT > s - (29)1/2).

Proof Let A(s) := (co: II X(w)II T > s}. For part (a) we have by (3.2.8),

Pr*(IIX-YIIT>S-u) > EP(P')*(IIX-YIIT>S-u) > P*(IIXIIT > s)infwEA(s)(P')*(IIX((O)-YIIT > s - u) > P*(IIXIIT > s) inftET P'(IY(t)I < u) since if IIX (w) II T > s then for some t E T, I X((o, t) I > s. This proves part (a). Then part (b) follows by Chebyshev's inequality.

Now si will denote i.i.d. Rademacher variables, that is, variables with the

distribution P(Ei = 1) = P(E, = -1) = 1/2, defined on a probability space (JGE, SE, Pe) which we can take as [0, 1] with Lebesgue measure. We also take the product of (QE, Se, Pe) with the probability space on which other random variables and elements are defined. Next, some hypotheses will be given for later reference.

11.2 A bootstrap central limit theorem in probability

339

Let (S, S, P) be a probability space and (Sn, Sn, Pn) a Cartesian product of n copies of (S, S, P). Let F C £2(S, S, P). Let E1, , en be i.i.d. Rademacher variables defined on a probability space (0', A, P'). Then take the probability space (Sn

(11.2.5)

x c', Sn ®A, Pn x

p).

References to (11.2.5) will also be to the preceding paragraph. Here is another fact on symmetrization and desymmetrization.

11.2.6 Lemma Under (11.2.5), f o r a n y t > 0 and n = 1, 2,

,

(a) Pr*(II y;`=1 Ej.f(Xj)IIF > t) < 2maxk t/2). (b) Suppose that a2 := supfEF f (f - Pf )2dP < oo. Then for all n = 1, 2, , and t > 21/2an1/2

E(f(Xi) - Pf) i=1

>

t)

F

n

> (t - (2n)1/2a)/2 }

Ei f (Xi )

F

i=1

(r

Proof For part (a), let E(n)

{ri}

1 :

ri = fl for each i}. Let

£n := {Ei }in_1. Then Pt

>t

:= Pr* Ili=1 Etf(Xi) F Pr*

[ten = r} x

f(Xi) -

f(Xi) r;=-1

r,=1

TEE(n)

By Lemma 3.2.4 and Theorem 3.2.1, for laws p, v and sets A, B,

(A x v)*(A x B) = A*(A)v*(B) = p(A)v*(B) if A is measurable, so

< 2 E 2-nmaxk
\

= 2maxk
.

JJJ

F

340

The Two-Sample Case, the Bootstrap, and Confidence Sets

This proves (a). For (b), take a further product, so that we have (11.2.5) with

2n in place of n. Apply Lemma 11.2.4(b) to the process f H E"I f (Xi ), letting Y i := Xn+i f o r i = 1, , n. We get for any t E E(n),

Ct < 2 Pr* (

> t - (2n)1/2a)

f(Xi)-f(Yi)

T

i=1

= 2(P2n)* C

> t - (2n) 1/2a)

ri (f(Xi) - f(Yi)) i=1

Y

Then by (3.2.9),

< 2EPE(P2n)* (

< 4Pr*

Y

> [t - (2n)1/2a]

Eif(Xi) ii=l

.F

finishing the proof.

/2( > t - (2n)1i2a/

Ei (f(Xi) - f(Yi)) i=1

.

)

,

Next, we have some consequences or forms of Jensen's inequality.

11.2.7 Lemma (a) Let (S, S, P) be a probability space and .F C G 1(S, S, P).

Then IIEfIIr < EIIfII (b) On a product space (A, A, P) x (A, A, Q) let X and Y be coordinate functions. Let .F be a class of real-valued measurable functions on (A, A). If Q f= O for all f E F, then (11.2.8)

E*Ilf(X)IIj- < E*Ilf(X)+f(Y)IIj-.

P r o o f (a) For each g E .F, I Eg l < E I gI , while E I gI < E ll f II

Then, taking

the supremum over g E F proves (a).

For (b), recall that for any function of X, E* for P and P x Q are the same by Lemma 3.2.4 and Theorem 3.2.1. For any value of x and f E F, z i-k I f (x) + zI is a convex function of z. Letting Z = f (Y), we have for each

x, by Jensen's inequality,

If(x)I = If(x)+Ef(Y)I < EYIf(x)+f(Y)I, and so

EXIIf(X)IIJ' < EX(supfE,-EYIf(X)+f(Y)I).

11.2 A bootstrap central limit theorem in probability For each g E .F, I g(X) +g(Y) I

341

II f (X) + f (Y) II.*, and taking the supremum

over g E F gives

EXIIf(X)II), <- EXEYIIf(X)+f(Y)IIv = E*Ilf(X)+f(Y)II.F by the Tonelli-Fubini theorem since 11 f (X) + f(Y) [Iis measurable. The following lemma and theorem are known as Hoffmann-Jorgensen inequalities.

11.2.9 Lemma Let XI, , X" be coordinates on a product probability space (S", S", 1IJ_I Pj). Let F be a class of measurable real-valued functions on (S, S). Let Sk(f) :_ F; =I f(X) for k = 1, , n. Then for any s > 0 and

t>0,

(11.2.10)

Pr (maxk
IISk(f)II* > 3t + s)

< (Pr {maxk t})2 + Pr (maxj s)

Proof Letll.ll :=

Letr :=min{j t},ort :=n+1

if there is no such j < n. Then f o r j = 1 , , n, {r = j} is by Lemma 3.2.4 , Xj, and {max k t) is the disjoint a measurable function of X1, union Ujn=1 1r = j). On {t = j}, we have Il Sk ll* < t if k < j and when k > j, IISkII* < t+IIXjII*+IISk-SjII*. Thus

maXk
For j < k < n, by Lemma 3.2.4, IISk - Sjll* is a measurable function of Xj+l, , Xk and thus is independent of {r = j}. So Pr (T = j, maxk 3t+s)

< Pr(r= j, maxis)

+ Pr(r = j) Pr (maxj 2t). Since maxj
11.2.11 Theorem Let 0 < p < oc, n = 1, 2, , and let X1, , X,, be coordinates on a product probability space (Sn, S", III=I Pj). Let F be a

The Two-Sample Case, the Bootstrap, and Confidence Sets

342

class of measurable real-valued functions on (S, S) such that for i = 1,

, n,

E(Ilf(X1)II*) < oo. Let k

u := inf It > 0: Pr 1maxk
> t < 1/(2.4P)

1=1

Then *p

k

) < 2.4PE(maxjsn(IIf(Xi)Il*)P)+2(4u)P.

E f(X1)

E maxk
i=1

Proof Recall that for any random variable Y > 0 with law P,

r

00

P(Y > t)dt =

t+

o

fo

fo YdtdP(y)

f00

ydP(y)=EY.

0

Let II

11

00

P (maxk 4t) dtP

E (maxk
=

4( P

fU

J0

+

f

00

P (mark 4t) dtP,

u

which by (11.2.10) with s = t is less than or equal to (4u) P + 4P

(P (maxk t)) 2 dtP

Ju 00

+4PJ

P(maxi t) dtP u

f00

< (4u) P + 4 P P (mark u)

P (maXk t) dt

+4PEmax1
< (4u)P + 1 E (maxk
since

4PP (maxk u) < 1/2

11.2 A bootstrap central limit theorem in probability

343

by choice of u. Then since E(maxk
The theorem is proved. Next is another symmetrization-desymmetrization inequality.

11.2.12 Lemma Under (11.2.5), n

1 E*

(11.2.13)

n

EEj(f(Xj) - Pf) j=1

E*

E(f(Xj) - Pf) j=1

F

F

n

EEj(.f(Xj) - Pf)

< 2E*

j=1

F

which also holds if the Pf in the last expression is deleted.

Proof Replacing F by If - Pf : f E F}, we can assume Pf = 0 for all f E Y. Then by (3.2.9),

E*

E sjf(Xj) j=1

= Ee EX

.F

Y ei f (Xi) + T

< EeEX

Ei f (Xi )

F

is e,=-1, i
is s,=1, i
E Eif(Xi) + EE j F

is a,=1,i
T,

is ei=-1, in

sif(Xi) F

n

< 2EeEX

f(XX)

j=1

= 2E* F

Ef(Xj) j=1 nn

F

by (11.2.8). The left side of (11.2.13) follows. For the right side, extend (11.2.5)

by taking the Cartesian product with another copy of (Sn, Sn, Pn) and let the

344

The Two-Sample Case, the Bootstrap, and Confidence Sets , fin. Then if ri = ±1 for each i,

new coordinate functions bet 1,

n

n

E*

< E*

f(Xj)

j=1

1

1:(.f(Xj) - .f( j)) j=1

F

F

n

T, rj(.f(Xj)

- f(j))

j=1

F

since (11.2.8) holds as well if + is replaced by -, and by symmetry of

f(Xj) -

So by (3.2.8),

f(Xi)

E* i=1

Ei(f(Xi) -f(i))

EE EX

i=1

F

F

< 2EE EX

Ei .f (Xi )

i=1

= 2E*

F

Ei f(Xi ) i=1

F

which finishes the proof of the lemma. Next will be some Poissonization facts. Recall that any real-valued stochastic

process X(t), t E T, indexed by a set T, has a law PX defined on the space RT of all real-valued functions on T, on the smallest or -algebra for which the coordinate projections f r-+ f (t) are measurable for each t E T. The process

is called centered if EX(t) = 0 for all t E T. Recall that Y has a Poisson distribution with parameter X > 0 if and only if P(Y = k) = e-A.Ak/k! for

k=0, 1,2, .Letx Ay:=min(x, y). , n, let X(i) := Xi be a 11.2.14 Lemma Let T be any set. For each i = 1, centered, real-valued stochastic process indexed b y T. Let Xi, j , j = 1 , 2, , be independent copies of Xi, taken as coordinates on a product, say A, of copies o f (RT, PX(i)). Let N i := N(i), i = 1 , 2, .. , be i.i.d. Poisson variables

with parameter A = 1, defined on a probability space (S2', P'), and take the are independent of Xi, j. Let product of this space with A, so that N1, N2, suptET If(t)I. Then IIf1IT n

(11.2.15)

<

E*

e

e-1 T

N(i)

E*

i=1 j=1

T

11.2 A bootstrap central limit theorem in probability

345

Proof By Jensen's inequality, Lemma 11.2.7(a), letting EN be expectation with respect to the Poisson variables Ni, we have Mn

._

n

e - IE,*

Xi ,

e i=1

1

T

n

= E

X*

n

E E(Ni A 1)Xi,I i=1

< EX* EN

Then by (3.2.9), M n < ENEXII _;'=1(Ni A

T(Ni A 1)Xi,1 i=1

T I

T

I I IIT. For each i and given

Ni - N(i), N(i)

E Xi,1 - (Ni A 1)Xi,1 = j=1

(= 0 for Ni < 1), which is a centered process independent of (N1 A 1) Xi,1. Thus

for any values of N1, , Nn, applying another form of Jensen's inequality, Lemma 11.2.7(b), we get Mn < ENEXII ;`=1 (1) Xi, j II T Then by (3.2.9), ENEX can be replaced by E* and the lemma is proved. El For any two finite signed measures µ and v on the Borel sets of a separable Banach space B recall that the convolution µ * v is defined by (µ * v) (A) := f µ(A - x) dv(x) for any Borel set A. Here convolution is commutative and associative. For any finite signed measure µ on B and k = 1 , 2, , let µk

be the kth convolution power A * ... * It to k factors. Let µo := So. Let et` := exp(A) F_0O Ak/k!. If tt > 0 let Pois(µ) := e-µ(B)eµ. If is and v are two finite measures on B it is straightforward to check that Pois(µ + v) _ Pois(µ) * Pois(v). If X is a measurable function from a probability space (0, S, P) into a measurable space (S, A), recall that the law of X is the image

measure £(X) := P o X-1 on A. For any c > 0 and x E B, Pois(cS,) _ L(NNx) where NN is a Poisson random variable with parameter c.

Also, if £(Xi, j) = µi for j = 1, 2, and Ni := N(i) are Poisson with parameter 1, where all Xi, j and Nr are jointly independent, i, r = 1, , k, j = 1 , 2, .. , then by induction on k, k

N(i)

L

1 EEXi,j

J

k

1 = Pois EAr

)

If Xi are random variables with values in a separable Banach space with then the depoissonization inequality EXi = 0 for each i =

346

The Two-Sample Case, the Bootstrap, and Confidence Sets

(11.2.15) gives n

(11.2.16)

e

E E Xj

e-

j=1

1

f IIxIIdPoisl

n

E1(Xj)). j=1

Here is another depoissonization inequality.

11.2.17 Lemma Let (B, II II) be a normed space. For each n = 1, 2, v1, , vn be n distinct points of B and v := (v1 + vn)/n. Let Vl,

, let , Vn

be i.i.d. B-valued random variables with Pr(Vi = vj) = 1/n for i, j = , n. Let N1, , Nn be Poisson variables with parameter 1. Let all Vi and Nj be jointly independent. Then 1,

n

(11.2.18)

E(Vj - v)

E

j=1

<

e

e-1

n

E

T(Nj - 1)(vj - v) j=1

Proof The variables Vi and Ni are discrete, so the norms in (11.2.18) are both measurable random variables. Next, with v(j) := vj, n

n

E G(Vj - v) = nG(Vi - v) = E 8v(j)-v j=1

j=1

It follows from the above properties of Poissonization that (11.2.19)

Pois ( E G(Vj - v)) = Pois(8v(1)_v) *

* Pois(8v(n)_v)

j=1

=

r(i j_i

Nj(vj - v)) = GI E(Nj - 1)(vj - v)l. \ j_1 I /

Then by (11.2.16) for Xj = Vi - v, the lemma is proved.

The next proof will use the fact that for any real-valued function f on a measure space, 1{ f*>t) = (1{ f>t})* This is true since l{ f>t} < 1{ f*>t}, where

the latter is measurable, and Lemma 3.2.6(a) applies. The next fact is about triangular arrays with i.i.d. summands.

11.2.20 Theorem Let (T, d) be a totally bounded pseudometric space. For each n = 1, 2, , suppose given a product of n copies of a probability space, SZn := P(n))n For each n, and t E T, let Yn(t, ) := Yn(t)()

11.2 A bootstrap central limit theorem in probability

347

be a measurable real-valued function on Stn. For cv := E Stn let Xn, j( , (v) := Y(., cvj). Thus Xn, j are i.i.d. copies of Yn. Suppose that each Yn has bounded sample paths a.s. Assume that

(i) For all tin a dense subset D c T for d and for all 0 > 0, (11.2.21)

limn.oonPr{JY,,(t)I > Onl/2) = 0;

(ii) for any S > 0, suptET Pr{IYn(t)I > Sn1/2} _+ 0 as n - oo; and (iii) for all s > 0, as 8 4. 0, lira supn,oo Pr* {nl/2 supd(s,t)<s

n

Xn,i (t) - EXn,i (t) i=1

>E}--*

- Xn,i (s) + EX,i (s)

0.

Then for any y > 0, (11.2.22)

limn.>

nPr*{itYnhiT > yn1/2} = 0.

If the hypotheses are assumed only along a subsequence nk, then the conclusion holds along the subsequence. Proof Taking more Cartesian products, let {Vn, j} bean independent copy of easily (11.2.23)

Pr*{IIXn,1-Vn,111>u} <

2Pr*{IIXn,111 >u/2}.

Then, condition (iii) implies that as S 4. 0, n

(11.2.24)

lim sup,, Pr* I

n-112

supd(s,t)<s

Xn,i (t) - Vn,i (t) i=1

- Xn,i (s) + Vn,i (s)

>s}--*

0.

Let Un, j := Xn, j - Vn, j. For any t > 0, let {s1, , sN(,)} be a maximal subset of T with d(si, sj) > t/3 for i # j. Let Bi It E T : d (t, si) < r/3} and Ai := Ai,, := Bi \ U,
The Two-Sample Case, the Bootstrap, and Confidence Sets

348

t E Ai,T. Then

(11.2.25) Pr* {IIUn,111T > 2y/} < Pr* 1IIUn,1,TIIT > yln} + Pr* {IIUn,l - Un,1,1IIT > YV"} Then, as n (11.2.26)

oo, nPr* {IIUn,1,tIIT > Y-vfn}

-

N(T)

nPr{1Un,I(ti)I > y ln} -. 0

< i=1

by hypothesis (i) and (11.2.23). For an arbitrary set A in a probability space, let A* be a measurable cover, so that A* is a measurable set and (0A)* = lA*. k * k C* Then it is easily checked that for any sets C1, Ck, (Uk Uk up to sets of probability zero. Take Ci :_ { II Un,i - Un,i,r II T > yn 1/2}. Then by Theorem 3.2.1 and Lemma 3.2.4,

, G) =

1 - exp(n Pr*(C1)) < 1 - (1 - Pr*(Cl))n = Pr. (

Ii

U j_1

Pr* {maxl Y/"}

which by Levy's inequality with stars (9.1.9(b)) is n

E(Un,i

- Un,1,1)

i=1

T n

< 2 Pr* jn_1I2suPd(s,t)T

(Xn,i (t) - Vn,i (t) i=1

- Xn,i (s) + Vn,i (s)) Thus, applying (11.2.24), then (11.2.25) and (11.2.26), gives (11.2.27)

lim1,o (lim sup)n,,,n Pr* { II Un,1 - U,,,1,, II T > Y In}

= limn,oonPr* {IIUn,111T > Y In = 0. Next, by Lemma 3.1.4, II Xn,i II T and II Vn,i II T are independent. Applying also Lemma 3.2.6, and recalling that Xn,1 and Yn have the same distribution,

11.2 A bootstrap central limit theorem in probability

349

we get

nPr*{IIUn,111T>Y,/n} > nPr{IIYnIIT

By Lemma 11.2.4(a),

Pr{ IIYnIIT > ynl/2} < 1

Pr*{IIUn,1IIT > ynl/2/2}

-suptETPr{IYn(t)I > yn1/2/2}

Here the numerator -a 0 as n -+ oo by (11.2.26), and suplET Pr{IYn(t) > yn1/2/2} -* 0 as n -> oo by hypothesis (ii). So Pr{IIYnIIT < y J } 1 as n -+ oo and the conclusion follows. The proof for subsequences is the same.

Next will be a characterization of Donsker classes, in which the asymptotic equicontinuity condition (Theorem 3.7.2) is put into symmetrized forms. For a class F of functions let

{f - gf,gE.F, pp(f,g)<S}.

-T8

Let (X, A, P) be a probability space and F a class of

functions with.F C £2(X, A, P). Assume (11.2.5), so that Ei are i.i.d. Rademacher variables, independent o f X1, X2, . Suppose that for each x E X, (11.2.29)

Fi(x) := supfE.F If(X) - Pfl < oc.

Then the following are equivalent:

(a) .F is a Donsker class for P; (b) .F is totally bounded for pp and for any s > 0, as 8 -a 0,

limsupn,O,,Pr*

n-1/2 57e (f(X1) - Pf) i=1

(c) (.F, pp) is totally bounded and as S

0,

n

lim supra,

n-112E*

TEi(f(Xi) - Pf)

-+ 0;

i=1

(d) (.F, pp) is totally bounded and for vn := n 1/2(Pn - P), as S -* 0, lim supra,,, E* 11 vn hI S,

0.

The Two-Sample Case, the Bootstrap, and Confidence Sets

350

To prove the theorem, let's first show that (a) implies (b). Applying Lemma 11.2.6(a) to the class Ps, the outer probability in (b) is bounded above by k

2 maxk
- Pf t=1

> e/2

.

a

For any fixed N, we can write for n > N that maxk
max (maxk
By the asymptotic equicontinuity condition (Theorem 3.7.2(b)), there are an

N < oo and a3 > 0 such that for N
> e/4} < e/4,

(f(Xi) - Pf)

(11.2.30) Pr* I n-1/2 i=1

s,.F

since n-1/2 < k-1/2. Then for 1 < j < N, note that

j

N+j

N{-j

i=1

i=1

i=j+1

Y

and inequalities for EN 1 also hold for _N j+1 Thus (11.2.30) holds also, with e/4 replaced by s12, for 1 < k < N, and (b) holds. Now to prove (b) implies (c), we will apply Theorem 11.2.20, with D

T := F, XX,j(f) = Ej(f(Xj) - Pf). Then EXn, j(f) = 0 for all f E F, and (b) gives hypothesis (iii). For hypothesis (i) (11.2.21), again letting 1 {... 1{...}, we have

nPr{If(Xj) - PfI > fn1/2) <

f-2E(If(Xj) - Pf121{If(X1) - PfI > On1/2})

0

as n -+ oo by dominated convergence. Hypothesis (ii) holds since .T' is totally bounded for pp. So all three hypotheses of Theorem 11.2.20 and its conclusion (11.2.22) hold for the given X,,j. In proving (c) the following will be helpful.

11.2.31 Lemma Let l j, i = 1, 2,

, be i.i.d. nonnegative random variables

such that (11.2.32)

M := sups>0 t2 Pr{r;i > t} < oo.

Then, for all r such that 0 < r < 2, n-r/2E (11.2.33)

sup,,

maxt
11.2 A bootstrap central limit theorem in probability

351

Proof We have, as in the proof of Theorem 11.2.11,

n-r/2Emax1
_

rn-r/2

00

f

tr-1 ]'I. (maxi t} dt l

o

< 1 + rn 1-r/2 J

<<

00

tr-1 Pr{41 > t} dt

MrnI-r/2 f

1+

J,r

tr-3 dt

r 1+M2-r, proving (11.2.33).

Now continuing the proof that (b) implies (c), for i = 11 f (Xi) - P f 11

,

condition (11.2.32) is implied by equation (11.2.22). This implies, for example,

for r = 3/2, that the sequence {maxim i/,/}n° is bounded in Cr and so 1

uniformly integrable. But, again by (11.2.22), we also have, letting y - 0, that maxi
Emaxi
(11.2.34)

Next, Theorem 11.2.11 will be applied, where the original sample space S as in (11.2.5) is replaced by S x ( -1, 1), the random variables are (Xj, Ej), and in place of F, we take the class g of functions g := gh where for each h E .7='811

g(x, s) := s(h(x) - Ph) for s = ±1, so that g(Xj, e3) = Ej (h(Xj) - Ph).

Also let p := 1. Then by (11.2.34), the hypothesis of Theorem 11.2.11 holds and in the conclusion, the first term on the right is o(n1/2) as n -+ oo and S - 0. In the definition of u, by the Levy inequality (9.1.9), we have u < 2t1 if

A(

n

> t1) < 4-P-1,

g(Xi , 8i) i=1

9

and t1 = o(n1/2) also by (b). So (c) follows. (c) implies (d) by desymmetrization, Lemma 11.2.12. (d) implies (a) by Markov's inequality and the usual asymptotic equicontinuity condition, Theorem 3.7.2. The proof of Theorem 11.2.28 is complete.

352

The Two-Sample Case, the Bootstrap, and Confidence Sets

Next, Theorem 11.2.28 will be extended to multipliers other than Rademacher variables. For any real random variable Y recall that 00

A2,1(Y) :=

J0

[Pr(IYI > t)]112dt.

Then A2,1(Y) < oo implies E(Y2) < oo, and for any 3 > 0, EIYI2+s < 00 implies A2,1(Y) < 00-

Let (0, S, P) be a probability space and F C G1(P).

11.2.35 Lemma

Assume the usual hypotheses (11.2.5) and that furthermore, by another product space, there are i.i.d. symmetric real random variables Ski :_ (i) independent

of the Xj and Ej. Then for integers 0 < m < n < oo we have n

(11.2.36)

Ei f (Xi ) i=1

F

< n-1/2E* mn-1/2(E*II f(Xi)IIF)E(max i
+

E £if(Xi)

A2,1(1)maxm
i=m+1

F

If the variables li have

0 but are not necessarily symmetric, then (11.2.36) holds with the following changes: at the left is replaced by the first summand on the right is multiplied by 2, and the second by 3.

Proof By symmetry, the joint distribution of the i is the same as that of si I i I . Thus, writing I I 11 := 1 1- II F' we have by Theorem 3.2.7 (the one-sided

Tonelli-Fubini theorem), then by Jensen's inequality, Lemma 11.2.7(a), n

if(Xi)I)

E*

>

Ei

i-1

If(Xi)

i=1 n

> E*

et E I 1 I f(Xi )

i=1

'

11.2 A bootstrap central limit theorem in probability

353

The first inequality in the lemma follows. For the second we have that n

E*

E (n)

E* i=1

Eilfilf(Xi)

i=1

f

E*

-

l{t
0

i=1

E* f

1{t
Eif(Xi)

f(Xi) dt

.

Let Ft(X, E, l;) := n 1/211 E 1 l{r maxj
< E-

f f

( l{t
Ft(X,E,. )dt <

E*

lim

00 j=1

the latter inequality by Fatou's lemma with stars (3.2.12). For each set G C 11, 2,

, n} and t > 0 let

H(G, t) := {l; := {li }" 1 E

Wn :

t<

iEG}

I

.

Given t, the sets H(G, t) are disjoint and measurable with union Rn, so by independence (Lemma 3.2.6) and (3.2.9),

E*Ft(X, e, ) =

n-1/2 E

P( E H(G t))E* T ei f (X1)

n

n-1/2YP r=1

F

iEG

G

r

n

E i=1

1{I (i)I>_r} =

r)

E*

EEif(Xi) i=1

F

354

The Two-Sample Case, the Bootstrap, and Confidence Sets

since (e,, Xi) are i.i.d. Thus E(n) is bounded above by

f-( n

Pr I

E_1

f

n

(

1{I (i)I>t}JE= kEi f(Xi)

l -1

dt

iY=1

,(i)I_t)

n

ooPr

'(I

o

i=1

(joo

n

+ k=m+1

k

> 0 dt

maxk<,n E* E Ei ,f (Xi ) i=1

Pr Ij

k dt

i=1 k

maxm
k-1/2 E Ei.f(Xi) i=m+1

Now, n

J Pr

E

k

I i=1

k>m

E

1/2

1/2

n

((1{(E)>tJ)

1{(i)t)

\E

t)1/2. Thus

E(n) < m (f°°Pr{maxi
>

t)dt) E*Ilf(Xl)II k

k-1/2 E Ei.f(Xi)

+n1/2A2,1(1)maxm
i=m+1

mE*IIJ k

n1/2A2,1(1) maxm
k-112

E Ei.f(Xi) i=m+1

finishing the proof for the symmetric case. When 1 is not symmetric, but 0, take a product of 2n factors so that i, , 2n are i.i.d. Let j 4j - n+j Then 1, , n are i.i.d. and symmetric. We have n

E*

Ili=1 if(Xi)

(

< E*

i.f(Xi) i=1

11.2 A bootstrap central limit theorem in probability

355

by Jensen's inequality (11.2.8) applied to functions g(X, ) :_ f (X ). The lemma holds for the t j. Clearly, E (maxi
A2,1(0) _ f

t)'12dt

0

t/2))'/2dt < 3A2,i( 0

as claimed. For the lower bound, it is easily seen that n

E*

n

< 2E*

1: i f(Xi) i=1

Sif(Xi) i=1

.F

and the conclusion follows.

Next, here is a characterization of Donsker classes in terms of multipliers i.

11.2.37 Theorem Let F be a class such that hypotheses (11.2.5) hold and If - Pf : f E F) has a finite envelope function (11.2.29). Let 4 j be i.i.d. centered real random variables, independent o f X1, X2, , specifically, defined on a different factor of a product probability space, such that 0 and A2,1(1) < oo. Then .E is Donsker for P if and only if both (F, pp) is totally bounded and (11.2.38)

i(f(Xi) - Pf)

lim supn.. E* n-1/2 i=1

Proof We can assume that each f E F is centered, replacing f by f - Pf. Recall that (a) implies (c) in Theorem 11.2.28. Also, we have sups>0 t (Pr(I1;1 I >

t))'/2 < A2,1(1) < oo, so (11.2.32) holds for 111 Thus by (11.2.36), it follows that (11.2.38) holds. Conversely, assume that .F is pp-totally bounded and (11.2.38) holds. Then by the first displayed inequality in Lemma 11.2.35, applied to .Ea for each S > 0, and since E I > 0, it follows that (c) in Theorem 11.2.28 holds, so by that theorem, F is Donsker for P. O

Now, the proof of the bootstrap central limit theorem in probability, Theorem 11.2.1, will be finished. Let F be P-Donsker. We can, again, assume that

P f= 0 for all f E F since each f E F can be replaced by f - P f without changing vn (f) or vB (f ). Recall that the conclusion is equivalent to a statement in terms of the metric O.F. Since F is totally bounded for pp by

The Two-Sample Case, the Bootstrap, and Confidence Sets

356

Theorem 3.7.2, for any r > 0 there is an N(r) < oo and a map in from F into a subset having N(r) elements such that pp(Jrt f, f) < a for all f E F. Let vBt (f) := vB(nt f) and Gp,t (f) := Gp(Jr7 (f )). Then for any bounded Lipschitz function H on £°O(F) for II ILr,

I EBH(vB) - EH(Gp)I < I EBH(vB) -E BH(vB.,)I

- EH(Gp,t)

+ EBH(vnB

+ EH(Gp,t) - EH(Gp) I

A(H) + Bn,r (H) + Ct (H). Let BL (1) denote the set of all functions H on £°O (F) with bounded Lipschitz norm < 1. Then supHEBL(1)Ct(H) -+ 0 as r 0 since by definition of Donsker class, Gp is a.s. uniformly continuous with respect to pp, so Gp,t -+

Gp uniformly on F as r 1 0. For fixed r > 0, the supremum for H E BL(1) of Bn,t(H) is measurable and converges to 0 a.s. by Theorem 11.2.2, the finite-dimensional bootstrap central limit theorem, in RN(,). Recall the definition of II 11s,., from before Theorem 11.2.28. For the An,t term, we have

so it will be enough to show that for all

SUPHEBL(1) An,t (H) < 2EB II vBllt,

e>0,as8 J. 0, (11.2.39)

{EBIIvBIIS"

limsupn,,,,, Pr*

>

Apply (11.2.18) with B :=

ri _ 0.

and vi := 8X;(,)) for each w, and the

starred Tonelli-Fubini theorem where one coordinate is discrete (3.2.9). We get n

E*EBIIVBIIs,.p

= n-1/2E*EB

E (SXB - Pn) i=1

< n-1/2 < n-1/2

a

e-1 a

e-1

n

E*

(Ni-1)(SXr-Pn) i=1

E.*

E(Ni-1)SX;II i=1

n

+n-1/2

e E* e-1

(Ni-1)

IIPnhIL,T)

_: S+T.

i=1

Since .P is Donsker for P, the limit as S J, 0 of the lim sup as n -> oo of S is 0 by Theorem 11.2.37. For T we can apply independence, Lemma 3.2.6, and (3.2.9). Then E y1(Ni - 1)I < n1/2, so T < (e/(e - 1))E*II Pn IIo,.

n

11.3 Other aspects of the bootstrap

357

Recalling that each f E .7= is taken to be centered, we have E* II P Ila,.F = E*Iln-1/2vn + Plls,.r -+ 0 as n oo and S --* 0 by Theorem 11.2.28(d). Thus Theorem 11.2.1 is proved.

11.3 Other aspects of the bootstrap Efron (1979) invented the bootstrap and by now there is a very large literature about it. This section will address some aspects of the application of the Gine-Zinn theorems. These do not cover the entire field by any means. For example, some statistics of interest, such as max(X1, , X,,), are not averages

f(Xn)) as f ranges over a class F. Some bootstrap limit theorems are stated in probability, and others for almost

sure convergence. To compare their usefulness, first note that almost sure convergence is not always preferable to convergence in probability:

Example. Let Xn be a sequence of real-valued random variables converging to some Xo in probability but not almost surely. Then some subsequences Xnk converge to X0 almost surely. Suppose this occurs whenever nk > k2 for all

k. Let Yn := X2k for 2k < n < 2k+1 where k = 0, 1,

.. Then Y, -* Xo

almost surely, but in a sense, X,, -+ Xo faster although it converges only in probability. Another point is that almost sure convergence is applicable in statistics when

inferences will be made from data sets with increasing values of n, in other words, in the part of statistics called sequential analysis. But suppose one has a fixed value of the sample size n, as has generally been the case with the bootstrap. Then the probability of an error of a given size, for a given n, which relates to convergence in probability, may be more relevant than the question of what would happen for values of n -) oo, as in almost sure convergence. The rest of this section will be devoted to confidence sets. A basic example of a confidence set is a confidence interval. As an example, suppose X1, , Xn

are i.i.d. with distribution N(µ, a2) where a2 is known but It is not. Then + Xn)/n has a distribution N(µ, a2/n). Thus (X1 +

P(X p +

1.96a/nt/2).

So we have 95% confidence that the unknown it belongs to the interval [ X 1.96a/n1/2, X+ 1.96a/n1/2], which is then called a 95% confidence interval for A.

Next, suppose X1, , Xn are i.i.d. in R" with a normal (Gaussian) distribution N(µ, a2I) where I is the identity matrix. Suppose a > 0 and Ma = Ma (k)

358

The Two-Sample Case, the Bootstrap, and Confidence Sets

is such that N(0, I) [x: Ix I > Ma } = a. Then n 112(X - µ)/o has distribution

N(0, I) so P(IX - µI > Ma/n'/2) = a. Thus, the ball with center X and radius Mqa/n1/2 is called a 100(1 - a)% confidence set for the unknown lc. When the distribution of the Xi is not necessarily normal, but has finite variance, then the distribution of X will be approximately normal by the central limit theorem for n large, so we get some approximate confidence sets. Now let's extend these ideas to the bootstrap. Let X1, , Xn be i.i.d. from an otherwise unknown distribution P. Let Pn be the empirical measure formed from X1, , X,,. Let F be a universal Donsker class (Section 10.1). Then we know from the Gine-Zinn theorem in the last section that v8 and vn have asymptotically the same distribution on Y. By repeated resampling, given a small a > 0 such as a = 0.05, one can find M = M(a) such that approximately P(IIvBIIj_ > M) = a. Then {Q: II Q - Pn IIT < M/n1/2}

is an approximate 100(1 - a)% confidence set for P.

**11.4 Further Gine-Zinn bootstrap central limit theorems Gine and Zinn (1990) proved two theorems giving criteria for bootstrap central

limit theorems, one for the almost sure bootstrap central limit property, the other in (outer) probability, both under some measurability conditions. Later, in work by Strobl (1994) and van der Vaart and Wellner (1996), the measurability conditions were removed from the direct parts of each theorem (that the Donsker property, together with an L2 envelope in the almost sure case, implies the corresponding bootstrap central limit theorem). Gine (1997) gives an exposition, with proofs, of the resulting forms of the theorems. A law it on the Borel a-algebra of a metric space is said to be Radon if it is tight and thus, concentrated in a separable subspace. The theorems then are, first (Gine, 1997, Theorem 2.2):

Theorem A If F is any Donsker class for a law P then the bootstrap central limit theorem in probability also holds for.F and P. Conversely, if (S, S, P) is any probability space, F is a class of real-valued measurable functions on S such that FF(x) := sup fEj, I f (x) I < oo for all x E S, F is image admissible Suslin, and there exists a Gaussian process G indexed by .F with Radon law on

G) -* 0 in outer probability as n -+ oo, then F is £'(.F) such that Donsker for P and G = Gp. Only the first (direct) statement in Theorem A is proved in Section 11.2. In the almost sure case, Gine (1997, Theorem 3.2) is:

Problems

Theorem B

359

Let F be any Donsker class for a law P and assume that

E* 11 f (XI) - Pf II* < oo. Then the almost sure bootstrap central limit theorem also holds for F and P. Conversely, if a class F of measurable real-valued functions is image admissible Suslin, has an everywhere finite envelope F,p, and there is a centered Gaussian process G indexed by F having a Radon law in f' (Y) and such that P,F(vB, G) -* 0 almost uniformly as n -+ oo, then .F is

Donskerfor P, EF2 < oo, and G = Gp. Note that the image admissible Suslin property implies that EF2 is welldefined without any star.

The proof of a form of Theorem B in Gine and Zinn (1990) relies on results from Araujo and Gine (1980), Fernique (1975,1985), Gine and Zinn (1984,1986), Jain and Marcus (1978), and Ledoux and Talagrand (1988). Especially when the needed material from these references is included, the proof

is quite long (over 50 pages in a version I prepared). The proof of the "in probability" version is shorter, but still rather long, as is the proof of it in one direction in Section 11.2.

Problems 1. If the sample space is the unit circle Sl := {(cos 0, sin 6) : 0 < 0 < 2n}, and distribution functions are defined by evaluating laws on arcs

((cos B, sin 0): 0 < 6 < x}, show that the quantity in Corollary 11.1.2(c), sup., Hmn (x) - inf yHmn (y)}, is invariant under rotations of the circle. 2. (Continuation) For m = n = 20, find the approximation given by Corollary 11.1.2(c) to the probability that there exists an arc A := ((cos 6, sin 0) : a < 0 < b) such that X j E A and Y j V A f o r j = 1, , 20, if all 40 variables

are i.i.d. from the same continuous distribution on the circle. Hint: Not many terms of the series should be needed. 3. In Corollary 11.1.2,

(i) show that the series in (b) and (c) diverge when u = 0.

(ii) Show that for u = 0, the probabilities equal 1 for any m > 1 and n > l and for yt. (iii) Show that the three expressions on the right are all less than 1 for u > 0 and converge to 1 as u 0, so that the corresponding distribution functions are continuous everywhere. Hint: For (b) and (c) use the left sides.

360

The Two-Sample Case, the Bootstrap, and Confidence Sets

4. Let X1, X2, , be i.i.d. in R with a continuous distribution. Given n, < arrange X1, , X in order as X(1) < X(2) < Let XB , j = 1, , n,bei.i.d. sample. Thus P(XB = X(I)) = 1/n for i, j = 1, n. Let XB) := minl<j
?

In the next two problems, let (X, A, P) be a probability space and suppose that F C L2(X, A, P) is a Donsker class f o r P. Let X1, X2, , be i.i.d.

(P). Convergence in law, conditional on P or P2i,, in outer probability as n -+ oo, is defined just as in the definition of bootstrap Donsker class early in Section 11.2, except for some interchanges of n and 2n. 5. Given P formed from X1, , X,,, let 2n points

X(j,M

X0,

j = 1,...,2n,

and P2 := Tn Ej° SX(j,p). Show that (2n) be i.i.d. conditional on P,,, converges in law to Gp, in outer probability. , X2,,, let n points 6. Given Pen formed from X1, 1

X(j, Y)

XY,

2, -

j = 1, ... , n,

be i.i.d. (P20, and Pn := n Ej=1 SX(j,y). Show that n1/2(Pn - 1'2n), conditional on Pen, converges in law to Gp, in outer probability.

7. Prove the statements before Lemma 11.2.35, that A2,1(Y) < 00 implies E(Y2) < oo, and for any 3 > 0, EIY12+8 < oo implies A2,1(Y) < 00.

Notes Notes to Section 11.1. 1 do not have a reference for Theorem 11.1.1. A form of the theorem for classes of sets was given in Dudley (1978). Corollary 11.1.2 is classical and gives asymptotic distributions of so-called Kolmogorov-Smirnov (parts (a),(b)) and Kuiper (part (c)) statistics. See the notes to RAP, Section 12.3. The results are sometimes stated under the unnecessary hypothesis that

m/n converges to some c, 0 < c < oo, as m, n -k 00. Notes to Section 11.2. The section is based mainly on Gind (1997), an update (as regards measurability) of the fundamental theorems of Gind and Zinn (1990). The proofs also include some new elements. Gind (1997) gives a lot of references, not reproduced here, on different parts of the proof. Some special cases of the Gind-Zinn theorems were published earlier by Bickel and Freedman (1981) and Gaenssler (1986) among others.

References

361

Notes to Section 11.3. Efron (1979) discovered the bootstrap. Three books on it are Hall (1992), Efron and Tibshirani (1993), and Shao and Tu (1995).

References

Araujo, A., and Gind, E. (1980). The Central Limit Theorem for Real and Banach Valued Random Variables. Wiley, New York.

Bickel, P. J., and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. Ann. Statist. 9, 1196-1217. Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6, 899-929; Correction, ibid. 7 (1979), 909-911. Efron, B., (1979). Bootstrap methods: another look at the jackknife. Ann. Statist. 7, 1-26. Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York. Fernique, X. (1975). Regularitd des trajectoires des fonctions aldatoires gaussiennes. Ecole d'ete de probabilites de St. -Flour, 1974. Lecture Notes in Math. (Springer) 480, 1-96.

Fernique, X. (1985). Sur la convergence dtroite des mesures gaussiennes. Z. Wahrscheinlichkeitsth. verw. Gebiete 68, 331-336. Gaenssler, P. (1986). Bootstrapping empirical measures indexed by VapnikChervonenkis classes of sets. In Probability Theory and Mathematical Statistics (Vilnius, 1985), ed. by Yu. V. Prohorov, V. A. Statulevicius, V. V. Sazonov and B. Grigelionis, VNU Science Press, Utrecht, 467-481. Gind, E. (1997). Lectures on some aspects of the bootstrap. In Lectures on Probability Theory and Statistics, Ecole d'dtd de probabilites de SaintFlour (1996), ed. P. Bernard. Lecture Notes in Math. (Springer) 1665, 37-151. Gind, E., and Zinn, J. (1984). Some limit theorems for empirical processes. Ann. Probab. 12, 929-989. Gine, E., and Zinn, J. (1986). Lectures on the central limit theorem for empirical processes. In Probability and Banach Spaces. Proc. Conf. Zaragoza, 1985. Lecture Notes in Math. (Springer) 1221, 50-113. Gine, E., and Zinn, J. (1990). Bootstrapping general empirical measures. Ann.

Probab. 18, 851-869. Hall, Peter (1992). The Bootstrap and Edgeworth Expansion. Springer, New York.

Jain, N. C., and Marcus, M. B. (1978). Continuity of subgaussian processes. In Probability on Banach Spaces, ed. J. Kuelbs, Advances in Probability and Related Topics 4, 81-196. Dekker, New York.

362

The Two-Sample Case, the Bootstrap, and Confidence Sets

Ledoux, M., and Talagrand, M. (1988). Characterization of the law of the iterated logarithm in Banach spaces. Ann. Probab. 16, 1242-1264. Shao, Jun, and Tu, Dongsheng (1995). The Jackknife and Bootstrap. Springer, New York. Strobl, F. (1994). Zur Theorie empirischer Prozesse. Dissertation, Mathematik, Univ. Munchen. van der Vaart, A. W., and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, Berlin.

12

Classes of Sets or Functions Too Large for Central Limit Theorems

12.1 Universal lower bounds This chapter is primarily about asymptotic lower bounds for II P, - P II T on certain classes .1 of functions as treated in Chapter 8, mainly classes of indicators of sets. Section 12.2 will give some upper bounds which indicate the sharpness of some of the lower bounds. Section 12.4 gives some relatively difficult lower

bounds on classes such as the convex sets in R3 and lower layers in 1R2. In preparation for this, Section 12.3 treats Poissonization and random "stopping sets" analogous to stopping times. The present section gives lower bounds in some cases which hold not only with probability converging to 1 but for all possible P,,. Definitions are as in Sections 3.1 and 8.2, with P := U(Id) _ Ad =

Lebesgue measure on Id. Specifically, recall the classes c(a, K, d) :_ 9a,x,d of functions on the unit cube Id C Rd with derivatives through ath order bounded by K, and the related families C(a, K, d) of sets, both defined early in Section 8.2.

12.1.1 Theorem (Bakhvalov) F o r P = U(Id), any d = 1 , 2, , and a > 0, there is a y = y(d,a) > O such that f o r all n = 1, 2, . . and all possible values of P, we have II Pn - PII9(a,1,d) > yn-ald Remarks. When a < d12, this shows that c (a, K, d), K > 0, is not a Donsker class. For a > d/2 the lower bound in Theorem 12.1.1 is not useful, since it is smaller than the average size of II Pn - PII9(a,1,d), which is at least of order n-1/2: even for one function f not constant a.e. P, EI (Pn - P)(f) I > cn-1/2 for some c > 0. Theorem 12.1.1 gives information about accuracy of possible methods of numerical integration in several dimensions, or "cubature," using the values of a function f E 9(a, K, d) at just n points chosen in advance (from the proof, it 363

364

Classes Too Large for Central Limit Theorems

will be seen that one has the same lower bound even if one can use any partial derivatives of fat then points). It was in this connection that Bakhvalov (1959) proved the theorem.

Proof Given n let m := m(n) :_ 1(2n)1'd1 where [xl is the smallest integer > x. Decompose the unit cube Id into and cubes Ci of side 1 /m. Then and > 2n.

For any Pn let S := {i : Pn (Ci) = 0). Then card(S) > n. For x(i), f and fs as defined after (8.2.9), we then have for each i,

P(f) = m-a J f (M (x - x(i))) dx = m-a f f(m(x - x(i))) dx ;

= fjd f (y)dy/rn= yn-d-p(.f) Thus I(PP - P)(fs)I = p(fs) >

m-a-dnP(f) >

cn-"Id for some constant

c = c(d, a) > 0, while 11 fs lla < II f 1Ia < 1 (dividing the original f by some constant depending on d and a if necessary).

12.1.2 Theorem For P = U(Id), any K > 0, and 0 < a < d - I there is a 8 = 8(a, K, d) > 0 such that for all n = 1, 2, and all possible values of a/(d-1+a) Pn, II Pn - PII C(a,K,d) > 8n

Remark. Since a/(d - 1 + a) < 1/2 for a < d - 1, the classes C(a, K, d) are then not Donsker classes. For a > d - 1, C(a, K, d) is a Donsker class by Theorem 8.2.1 and Corollary 7.2.13. For a = d - 1, it is not a Donsker class (Theorem 12.4.1). Proof Again, the construction and notation around (8.2.9) will be applied, now KAd-1(f)/llfIla, [(nc)11(a+d-1)]. Then on Id-1. Let c := m := m(n) := 1/n < cm 1-d-a Taken > r (the result holds for n < r) for an r with

M := supn>r nc Let On :=

m(n)a+d-1/(nc). Then 1/M

m(n)1-d-a <

od

< On < 1. Let

gi := KOn f /(211f IIa)

Then Ad-1(gi) = 1/(2n) and Igi Ila < K. Let Bi := {x E Id : 0 < Xd < gi(x(d))} where x(d) := (x1, , xd_1). Then for each i, Ad(B,) := P(Bi) _ 11(2n) and either Pn (Bi) = 0 or Pn (Bi) > 1/n. Either at least half the Bi have

12.2 An upper bound

365

Pn (B;) = 0, or at least half have Pn (B;) > 1 In. In either case,

FIPn - PIIC(a,K,d) > In

>

d-11(4n)

>:

cm-"/(4M)

8n-a1(a+d-1)

for some 8(a, K, d) > 0. The method of the last proof applies to convex sets in Rd, as follows.

12.1.3 Theorem (W. Schmidt) Let d = 2, 3, .. For the collection Cd of closed convex subsets of a bounded nonempty open set U in Rd, there is a constant b := b(d, U) > 0 such that for P = Lebesgue measure normalized on U, and all Pn , sup(I(Pn - P)(C)I: C E Cd) >

bn-2/(d+1).

Proof We can assume by a Euclidean transformation that U includes the unit ball. Take disjoint spherical caps as at the end of Section 8.4. Let each cap C have volume P(C) = 1/(2n). Then as n oo, the angular radius en of such caps is asymptotic to cdn-1/(d+1) for some constant cd. Thus the number of such disjoint caps is of the order of n(d-1)/(d+1) Either (Pn - P)(C) > 1/(2n)

for at least half of such caps, or (Pn - P)(C) = -1/(2n) for at least half of them. Thus for some constant i = rl(U, d), there exist convex sets D, E, differing by a union of caps, such that

I (Pn - P)(D) - (Pn - P)(E)l >

r1n(d-1)1(d+1) - 1

and the result follows with b = q/2. Thus ford > 4, Cd is not a Donsker class for P. If d = 3 it is not either; see problem 3 and Dudley (1982). C2 is a Donsker class for A2 on 12 by Theorem 8.4.1 and Corollary 7.2.13.

12.2 An upper bound Here, using bracketing numbers NI as in Section 7.1, is an upper bound for supBec I vn (B) I which applies in many cases where the hypotheses II vn IIC of Corollary 7.2.13 fail. Let (X, A, Q) be a probability space, vn

n1/2(Qn

and recall NJ as defined before (7.1.4).

- Q),

Classes Too Large for Central Limit Theorems

366

12.2.1 Theorem

(

Let C C .A, 1 < c < oo, it > 2/( + 1) and O forsome K < oc Nj(E, C, Q) < 0<E<

then

limn.,oPr* {Ilvnllc > n°(logn)''} = 0. Remarks.

The classes C = C(a, M, d) satisfy the hypothesis of Theorem

12.2.1 for

= (d - 1)/a > 1, that is, a < d - 1, by the last inequality in

Theorem 8.2.1. Then O = 2 - d-i+a Thus Theorem 12.1.2 shows that the exponent O is sharp for > 1. Conversely, Theorem 12.2.1 shows that the exponent on n in Theorem 12.1.2 cannot be improved. In Theorem 12.2.1 we cannot take < 1, for then O < 0, which is impossible even for a single set,

C = {C}, with 0 < P(C) < 1. Proof The chaining method will be used as in Section 7.2. Let loge be logarithm to the base 2. For each n > 3 let

k(n) := [(2 - O) loge n - q 1092 log n].

(12.2.2)

NI(2-k, C, Q), k = 1, 2, .. Then there exist Aki and Bkj E N(k), such that for any A E C, there is an i < N(k) with Aki C A C Bki and Q(Bki\Aki) < 2-k. Let Aol := 0 (the empty set) and B01 := X. Choose such i = i (k, A) for k = 0, 1, . Then for each k > 0, Let N(k)

A, i = 1,

,

Q (Aki(k,A) A Ak-1,i(k-1,A)) <

22-k

where A := symmetric difference, CAD := (C \ D) U (D \ Q. For k > 1 let 1(k) be the collection of sets B, with Q(B) < 22-k, of the form Aki\Ak_1,l or Ak_1,J\Aki or Bki\Aki Then

card(13(k)) < 2N(k - 1)N(k) + N(k) < 3 exp (2K2 For each B E 8(k), Bernstein's inequality (1.3.2) implies, for any t > 0, (12.2.3)

Pr{Ivn(B)I > t) < 2exp ( - t2/(23-k +

tn-1/2)).

Choose 3 > 0 such that 3 < 1 and (2+25)/(1 + ) < ij. Letc := 8/(1 +8) and t := tn,k := cn°(log n)7k-1-s. Then for each k = 1, , k(n), 23-k > tn,kn-112. Hence by (12.2.3), 8n°-1/2(log n)n > Pr{Ivn(B)I > tn,k} < 2exp ( - to k124-k), and

Pnk

:= Pr {SUPBE13(k)Ivn(B)I > tn,k) < 6exp (2K2k . = 6exp (2K2k - 2k-4c2n2°(log n)20k-2-23)

2k_4tn,k)

12.3 Poissonization and random sets

367

Fork

and

Pnk < 6exp(2K2" Let y := rl(i; +l)-2-28 > 0 by choice of S. Since

< log 2, k(n) < log n 2

and Pnk <

2-4c2(log n)Y)). For n large, (log n)Y > 64K/c2. Then 2K < 2-5c2(log n)Y and 2' > I for

k> 1, so k(n)

E Pnk < 6(logn) exp (- 2-5c2(log n)')

0

as n --o- oo.

k=1

Let En

:= {CV: SUPBEB(r) Ivn(B)I < tn,r, r = 1, ... , k(n)}.

Then lim n_ ,0 Pr(EE) = 1. For any A E C and n, let k := k(n) and i := i (k, A). Then for each co E En, I vn (Aki) I is bounded above by k

T'wn(Ari(r,A) \ Ar-l,i(r-1,A))j + Ivn

`Ar-l,i(r-1,A) \ Ar,i(r,A))j

r=1 k

< 2Etn,r < 2n°(logn)'iT r=1

cr-1-s

< 2n°(logn)'1,

r>1

Ivn(Bki\Aki)) < n°(logn)i, and by (12.2.2), n 1,2 Q(Bki\Aki)

< n112/2k < 2n°(logn)n.

Hence n 1/2 Qn(Bki\Aki) < 3n°(logn)'i, and Ivn(A\Aki)I < 3n©(logn)'i. So on En, I vn (A) I < 5n° (log n)'1. As ri J, 2/( + 1) the factor of 5 can be dropped.

Since A E C is arbitrary, the proof is done.

12.3 Poissonization and random sets Section 12.4 will give some lower bounds Ilvn 11C > f (n) with probability converging to 1 as n oc where f is a product of powers of logarithms or iterated logarithms. Such an f has the following property. A real-valued function f defined for large enough x > 0 is called slowly varying (in the sense of Karamata) if for every c > 0, f (cx)/ f (x) -* 1 as x +oo.

368

Classes Too Large for Central Limit Theorems

12.3.1 Lemma If f is continuous and slowly varying then for every e > 0 there is a S = S (e) > 0 such that whenever x > 1 IS and I 1 - I < S we have

I1- fI<e.

x

Proof For each c > 0 and e > 0 there is an x(c,s) such that for x > (c) -1 I < 4. Note that if x (c, e) < n, even for one c, then f (x) # 0 AX) for all x > n. By the category theorem (e.g., RAP, Theorem 2.5.2), for fixed X (c, s),

s > 0 there is an n < oo such that x(c, e) < n for all c in a set dense in some interval [a, b], where 0 < a < b, and thus by continuity for all c in [a, b]. Then fort, d E [a, b] and x > n, I(f(cx) - f(dx))/f(x)I < s/2. Fix c := (a + b)/2. Let u := cx. Then for u > nc, we have

f(udl c) - 1

f(u)

e

f (u/c)

2 f (u) As u --* +oo, f (u)/f (u/c) -+ 1. Then there is a S > 0 with

S < (b - a)/(b + a) such that for u > 1 IS and all r with I r - 1 I < S we have I f(u))

- 1 < E.

Recall the Poisson law Pc on N with parameter c > 0, so that Pc(k) := e-cck/k! for k = 0, 1, .. Given a probability space (X, A, P), let Uc be a Poisson point process on (X, A) with intensity measure cP. That is, for any disjoint A1, , A. in A, UU(Aj) are independent random variables, j = has law PcP(A). , m, and for any A E A, 1, Let Yc(A) := (Uc - cP)(A), A E A. Then Yc has mean 0 on all A and still has independent values on disjoint sets. be coordinates for the product space (X°O, A°°, P°°). Let x(1), x(2), For c > 0 let n (c) be a random variable with law P, independent of the x (i). + Sx(n)), n > 1, PO := 0 we have: Then for P := n-1(8x(1) + 12.3.2 Lemma The process Zc := n (c) P, (c) is a Poisson process with intensity measure cP.

Proof We have laws L(Uc(X)) = L(Zc(X)) = P. If Xi are independent Poisson variables with £(X1) = Pc(i) and Em, c(i) = c, then given n :_ Em, Xi, the conditional distribution of {X1}' 1 is multinomial with total n and probabilities pi = c(i)/c. Thus Uc and Z, have the same conditional distributions on disjoint sets given their values on X. This implies the lemma.

12.3 Poissonization and random sets

369

From here on, the version Uc - Zc will be used. Thus for each co, UU()((v)

is a countably additive integer-valued measure of total mass Uc(X)(cv) _ n(c)(w). Then

Yc = n(c)Pn(c) - cP = n(c)(Pn(c) - P) + (n(c) - c)P, (12.3.3)

Yc/c1/2 = (n(c)Ic)112vn(c) + (n(c) - c)c-112P. The following shows that the empirical process vn is asymptotically "as large" as a corresponding Poisson process.

12.3.4 Lemma

Let (X, A, P) be a probability space and C C A. Assume

that for each n and constant t, supAEc I (Pn - t P) (A) I is measurable. Let f be

a continuous, slowly varying function such that as x -a +oo, f(x) -* +oo. For b > 0 let g(b) := liminfx.+,,, Pr {supAEc IYX(A)I > bf(x)x1/2} . Then for any a < b,

liminfn. Pr {supAEc I v.(A)I > af(n) I > g(b). Proof It follows from Lemma 12.3.1 that f(x)/x -* 0 as x -+ oo. From 1 and (12.3.3), supAEc I YY (A) I is measurable. As x +oo, Pr(n (x) > 0)

n(x)lx - 1 in probability. If the lemma is false there is a O < g(b) and a sequence mk

+oo with, for each m = mk,

Pr {supAEc I vm(A)I > af(m) I < 0.

Choose 0 < s < 1/3 such that a (1 + 7s) < b. Then let 0 < S < 1 /2 be such that 8 < S (s) in Lemma 12.3.1 and (1 + 8) (1 + 5s) < 1 + 6s. We may assume

that forallk=

m =mk>2/Sand 1+2s

Set Sm (f(M)IM) 1/2. Then since f(x)lx 0 we may assume S. < 8/2 for all m = mk. Then for any m = mk, if (1 - Sm)m < n < m, we have mPm(A) > nPn(A)forallA E A, som(Pm - P)(A) > n(Pn-P)(A)-mSm.

Conversely n Pn (A) > m Pm (A) - m8m and

n(PP - P) (A) > m(Pm - P) (A) - mSm. Thus

Ivm(A)I > (n/m)1/2I vn(A)I - .f(m)1/2 > (1

+8)-'lvn(A)I

- f(m)1/2,

Classes Too Large for Central Limit Theorems

370

I = n -1 < 28m < 8

so I vn(A)I < (1 +8)(Ivm(A)I + f(m)1/2). Next, I1-

n implies I f(n) - II < s, so 1/f(n) < (1 +s)/f(m). Hence Ivn(A)Ilf(n) < (1 +28)(Ivm(A)I.f(m)-1 +

f(m)-1/2).

Thus since (1 + 2s) f(m)-1/2 < ae,

af(n)(1+3s)} < O.

Pr{supAEc Ivn(A)I

For each m = mk, set c = cm = (1 - 1Sm)m. Then as k -+ oo, since f(mk) -* oo, by Chebyshev's inequality and since Pc has variance c,

Pr{(1 -Sm)m
1.

Then for any y with O < y < g := g(b) and k large enough, since the x(i) are independent of n (c),

Pr {supAEC Ivn(c)(A)I > af(n(c))(1 + 3s)} < Y. I

k large enough we may assume Pr {supAEc I vn(c)(A)I > af(c)(1 + 5s)} < Y.

Fork large enough, applying Chebyshev's inequality ton (c) - c, we have since

c-oo, Pr

/n(c)

1 c)

1/2

1+6s > 1 +5c

-

(n(c)-c

Pr j

c'!2

2sc112

g-y

> (1 +56)2 } <

4

and since f(c) -+ oo, Pr{[n(c) - c]/c1!2 > af(c)s) < (g - y)/4. Thus by (12.3.3), Pr {supAEc

IYc(A)I > ac1/2f(c)(1 +7e)} < (y+g)/2 < g,

a contradiction, proving Lemma 12.3.4. Next, the Poisson process's independence property on disjoint sets will be extended to suitable random sets. Let (X, A) be a measurable space and (S2, B, Pr)

a probability space. A collection (BA : A E A) of sub-a-algebras of B will be called a filtration if 8A C 13B whenever A C B in A. A stochastic process Y indexed by A, (A, c)) H Y(A)(w), will be called adapted to (BA : A E A) if for every A E A, Y(A)(.) is BA measurable. Then the process and filtration will be written {Y(A), BA)AE.A. A stochastic process Y: (A, w) - Y(A)(w), A E A, co E S2, will be said to have independent pieces if for any disjoint A1, , A. E A, Y(AK) are independent,

12.3 Poissonization and random sets

371

j = 1,

, m, and Y(A1 U A2) = Y(A1) + Y(A2) almost surely. Clearly each Y,, has independent pieces. If in addition the process is adapted to a filtration (BA : A E A), the process {Y(A),13A}AEA will be said to have independent pieces if for any disjoint sets A1, . , A in A, the random variables Y02), , Y(An) and any random variable measurable for the a-algebra BA, are jointly independent. For example, for any C E A let Bc be the smallest a-algebra for which every is measurable for A C C, A E A. This is clearly a filtration, and the smallest filtration to which Y is adapted. A function G from 7 into A will be called a stopping set for a filtration {BA :

A E A) if for all C E A, (w: G(w) C C) E 13c. Given a stopping

set let EG be the a-algebra of all sets B E 13 such that for every C E A, B fl {G C C} E Bc. (Note that if G is not a stopping set, then SZ ¢ 13G, so BG would not be a a-algebra.) If G (w) = H E A then it is easy to check that G is

a stopping set and BG = 13x 12.3.5 Lemma Suppose (Y (A), BA}AEA has independent pieces and for all

WES2,G(w)EA,A(W)EA,and E(w)EA. Assume that:

(i) G(.) is a stopping set; (ii) for all w, G(w) is disjoint from A(w) and from E(w); (iii) G (w), A(w), and E(w) each have just countably many possible val-

ues G(j) := Gj E A, C(i) := Ci E A, and D(j) := Dj E A, respectively;

(iv) for all i, j, {A(.) = Ci} E BG and

D1) E 13G.

Then the conditional probability law (joint distribution) of Y(A) and Y(E) given BG satisfies

C{(Y(A), Y(E))IBG} =

I{A(.)=C(i),E(.)=D(j)}C(Y(Ci), Y(Dj))

i,j where G(Y(Ci), Y(Dj)) is the unconditional joint distribution of Y(Ci) and Y(Dj). If this unconditional distribution is the same for all i, j, then (Y(A), Y(E)) is independent 0f 8G. rather Proof The proof will be given when there is only one random set than two, A0 and E(.). The proof for two is essentially the same. If for some

w, A(w) = Ci and G(w) = G j, then by (ii), Ci fl G j = 0 (the empty set). Thus Y(Ci) is independent of BG(j). Let Bi := B(i) := Ci) E 13G by

Classes Too Large for Central Limit Theorems

372

(iv). For each j,

{G = Gj} = {G C Gj} \ U{{G C Gi}: Gi C Gj, Gi

Gj},

so by (i),

{G = Gj} E BG(j).

(12.3.6)

LetHj:={G=Gj}. ForanyDEA, Hjn(GCD)=0EBDifGj¢D. If Gj c D, then Hj n {G C D) = Hj E BG(j) C BD by (12.3.6). Thus

H(j) := Hj E BG.

(12.3.7)

For any B E BG, by (12.3.6), (12.3.8)

BnHj = Bn{GCGj}nHj E BG(j).

We have for any real t almost surely, since Bi E BG,

Pr(Y(A) < t I BG) = EPr(Y(A) < t 1 BG)1B(i)lH(j)

i,j

EPr(Y(Ci) < t I BG)IB(i)IH(j) i, j

The sums can be restricted to i, j such that Bi n Hj # 0 and therefore Ci n G j =

0. Now Pr(Y(Ci) < tIBG) is a BG-measurable function f, such that for any

I'EBG,

Pr({Y(Ci)
Restricted to Hj, f equals a 13G(j) -measurable function by (12.3.8). By

(12.3.7),FnH1E8G,SO (12.3.9)

f rnH(j) fdPr = Pr({Y(Ci) < t} n r n Hj) = Pr(Y(Ci) < t) Pr(r n Hj)

because by (12.3.8), r n H(j) E BG(j), which is independent of {Y(Ci) < t}.

One solution f to (12.3.9) for all j is f, - Pr(Y(Ci) < t). To show that this solution is unique, it will be enough to show that it is unique on Hj for each j. First, for a set A C Hj it will be shown that A E 5G(j) if and only if A E BG. By (12.3.6) and (12.3.7), Hj itself is in both BG and BG(j). We can write A = A n H(j). Then "if" follows from (12.3.8). To prove "only if," let A E BG(j). Then for C E A, let

F := A n {G C C} = A n {G j C C}.

12.4 Lower bounds in borderline cases

373

If G j C C then F = A E BG(j) C Bc. Otherwise, F is empty and in Bc. In either case F E BC, so A E BG as desired. Thus, if g is a function on Hj, measurable for BG(j) there, and

gdPr=O Df1 H(j)

for all D E BG, then the same holds for all D E BG(j). So, as in the usual proof of uniqueness of conditional expectations,

gdPr = J{g>0}f1H(j)

J {g
gdPr = 0,

so g = 0 on H(j) a.s. and f = Pr(Y(C,) < t) a.s. on H(j) for all j. Thus Pr(Y(A) << t I BG) = >t Pr(Y(C1) < t)1B(;).

Here is another fact about stopping sets, which corresponds to a known fact about nonnegative real-valued stopping times or Markov times (e.g., RAP, Lemma 12.2.5):

12.3.10 Lemma If G and H are stopping sets and G C H then BG C BH. Proof For any measurable set D and A E BG, we have

A fl {H C D} = (A fl {G C D}) fl {H C D} E BD.

12.4 Lower bounds in borderline cases Recall the classes C(a, K, d) of subgraphs of functions with bounded derivatives through order a in Rd, defined in Section 8.2. We had lower bounds for P, - P on C(a, K, d) in Theorem 12.1.2 which imply that for a < d - 1, II Vn II C(a,K,d) - oo surely as n - oo. For a > d - 1, C(a, K, d) is a Donsker class by Theorem 8.2.1 and Corollary 7.2.13, so II vn IIC(a,K,d) is bounded in

probability. Thus a = d - 1 is a borderline case. Other such cases are given by the class GG2 of lower layers in R2 (Section 8.3) and the class C3 of convex sets in R3 (Section 8.4), for ),d = Lebesgue measure on the unit cube Id, where

I := [0, 1]. Any lower layer A has a closure A which is also a lower layer, with Ad (A \ A) = 0, where in the present case d = 2. It is easily seen that suprema of our processes over all lower layers are equal to suprema over closed lower layers, so it will be enough to consider closed lower layers. Let GG2 be the class of all closed lower layers in 1182.

Classes Too Large for Central Limit Theorems

374

Let P = ,kd and c > 0. Recall the centered Poisson process Yc from Section 12.3. Let Nc := UU - Vc where U.. and VV are independent Poisson processes,

each with intensity measure cP. Equivalently, we could take Uc and Vc to be centered. The following lower bound holds for all the above borderline cases:

12.4.1 Theorem

For any K > 0 and 8 > 0 there is a y = y(d, K, 8) > 0

such that

limx_,+,,,,Pr{IlYxllc > y(xlogx)112(loglogx)-d-1/2} = 1 and

limn.,,,,Pr{llvnlIc >

y(logn)1/2(loglogn)-6-1/2}

=1

where C = C(d - 1, K, d), d > 2, or C = LL2, or C = C3. For a proof, see the next section, except that here I will give a larger lower bound with probability close to 1, of order (log n)3/4 in the lower layer case (C = LL2, d = 2). Shor (1986) first showed that Ell Yxilc > yxt/2(logx)3/4 for some y > 0 and x large enough. Shor's lower bound also applies to

C(1, K, 2) by a 45° rotation as in Section 8.3. For an upper bound with a 3/4 power of the log also for convex subsets of a fixed bounded set in R3 see Talagrand (1994, Theorem 1.6). To see that the supremum of N, Y, or an empirical process v, over LL2 is measurable, note first for P, that for each F C { 1, , n) and each w,

there is a smallest, closed lower layer LF((O) containing the xj for j E F, with LF((O) := 0 for F = 0. For any c > 0, w i-± (Pn - cP) (LF((O))(w) is measurable. The supremum of P, - cP over LL2, as the maximum of these 2n measurable functions, is measurable. Letting n = n (c) as in Lemma 12.3.2 and (12.3.3) then shows that sup{Yc(A) : A E LL2} is measurable. Likewise, there is a largest, open lower layer not containing xj for any j E F, so sup{IYY(A)l: A E LL21 and sup{Ivn(A)I : A E LC2) are measurable. For N0, taking non-centered Poisson processes Uc and V0, their numbers of points m (w) and n (w) are measurable, as are the m-tuple and n-tuple of , n, it is points occurring in each. For each i = 0, 1, , m and k = 0, 1, a measurable event that there exists a lower layer containing exactly i of the

m points and k of the n, and so the supremum of Nc over all lower layers, as a measurable function of the indicators of these finitely many events, is measurable.

12.4 Lower bounds in borderline cases

375

12.4.2 Theorem For every s > 0 there is a 8 > 0 such that for the uniform distribution P on the unit square 12, and n large enough,

Pr sup{Ivn(A)I: A (= GG2} > 6(logn)3/4) > 1 - e, and the same holds forC(1, 2, 2) in place of LL2. Also, vn can be replaced by NN/c1/2 or Yc/c1/2 if log n is replaced by log c, for c large enough.

Remark. The order (log n)314 of the lower bound is optimal, as there is an upper bound in expectation of the same order, not proved here, see Rhee and Talagrand (1988), Leighton and Shor (1989), and Coffman and Shor (1991).

Proof Let M 1 be the set of all functions f on [0, 11 with f (O) = f (l) = 1 /2 and I I f I I L < 1, that is, If(u) - f (x) I < I x - u I for 0 < x < u < 1. Then

0 < f(x) < 1 for 0 < x < 1. For any f : [0, 1] H [0, oo) let Sf be the subgraph of f, S f := S(f) := {(x, y) : 0 < x < 1 , 0 < y < f (x)}. Then for each f E M1 we have Sf E C(1, 2, 2). Let Sl := {Sf f E X11}. Let R be a counterclockwise rotation of R2 by 450, R = 2-1/2(iflower 1). Then

for each f E MI, R-1(Sf) = M fl R-1(I2) where M is a

layer,

I2 := I x I, and I := [0, 1]. So, it will be enough to prove the theorem for S1 C CO, 2, 2).

Let [x] denote the largest integer < x and let 0 < 8 < 1/3. For each c large enough so that 82(log c) > 1 and w, functions fo < fi < < fL < gL < .. < gi < go will be defined for L := [82(log c)], with f := f(t)

fj(w,t)for CO Ecland 0
Each f and gi will be continuous and piecewise linear in t. For each j = , 2', one of f and g, will be linear on I,j := [(j - 1)/2', j/2'], and the

other will be linear only on each half, Ii+i,2 j-1 and Ii+1,2,j, with f = g, at the endpoints of I. Thus over Iij, the region Tip between the graphs of f and g, is a triangle.

Let fo(x) := 1/2 for 0 < x < 1. Let s := (logc)-1/2 and go(1/2) (1 + s)/2. Given f and g,, to define f +1 and gi+1, we have two cases. Case I. Suppose g, is linear on I,j, as in Figure 12.1. Let pk := (xk, yk) be the point labeled by k = in Figure 12.1. Then xk = (4j + k - 5)/2i+2 for k = 1, , 5, xk = xlo_k for k = 6, 7, 8, x9 = x2 + 1/(3 21+2), and

xlo = X4 - 1/(3 .2'+2). Thus T := T,j is the triangle PIP3P5. Let Q := Q,j be the quadrilateral p3piop7p9, and V := Vii the union p2p3p9 U P3P4Plo of two triangles. The triangles PiP2P7, P1P3P8, P3P5P6, and p4p5p7 each have 1 /4 the area of T, Q has 1/3, and V has 1/6.

376

P1

Classes Too Large for Central Limit Theorems

P2

P3

Figure 12.1.

Let pi j , i = 1, 2, , j = 1, , 2i , be i.i.d. Rademacher random variables, so that P(pij = 1) = P(pij = -1) = 1/2, independent of the NN process.

If NN (Qi j) > 0, or if N(Q11) = 0 and pij = 1, say event Aij occurs and set gi+1 = gi on Ii j, then the graph of f .}1 on Ii j will consist of line segments joining the points p1, P2, P7, p4, p5 in turn, so Ti+1,2j-1 = PIP2P7 and Ti+1,2j = P7P4P5

If Aij doesn't occur, that is, N,(Q) < 0 or N,(Q) = 0 and pij = -1, set f+1 := f on Ii j and let the graph of $i+1 consist of line segments joining the points p1, P8, P3, p6, p5, so that Ti+1,2j-1 = PIP8P3 and Ti+1,2j = P3P6P5 This finishes the recursive definition of f and gi in Case I. The stated properties of f and gi continue to hold. Case II. In this case f is linear on Ii j, see Figure 12.2. Then Ti j is the triangle PI P5 P7 The other definitions remain the same as in Case I. By induction, there is an integer m = mij = mij ((v) such that p1 p5 has slope ms, P1 P3 in Figure 12.1 and p7 p5 in Figure 12.2 have slope (m - 1)s, while p3 p5 in Figure 12.1 and PI P7 in Figure 12.2 have slope (m + 1)s. We have for each i and j that mi+1,2 j_1 - mij and mi+1,2j - mij have possible values 0 or

±1. Specifically, on Aij in Case I or A . in Case II, mi+1,2j = mi+1,2j-1 = mij. On A .- in Case I, mi+1,2j_1 -mij = -1 and mi+1,2j -mij = 1. Or on Aij in Case II, mi+1,2j-1 - mij = 1 and mi+1,2j - mij = -1 Let (0, A) be a probability space on which NN and all pij are defined. Take 0 1 := 0 x [0, 1] with product measure Pr := it x A where A is the uniform

12.4 Lower bounds in borderline cases

377

P7

P5

Figure 12.2.

(Lebesgue) law on [0, 1]. For almost all t E [0, 1], t is not a dyadic rational, so for each i = 1, 2, , there is a unique j := j (i, t), j = 1, , 2', such that

t E Inti j := ((j - 1)/2' , j/2' ). Thus Inti j is the interior of I. The derivatives f t and g are step functions, constant on each interval Inti+l, j and possibly undefined at the endpoints. Some random subsets of 12 are defined as follows. For in = 0, 1, 2, m

Gm

G(m)

G(m, (o)

and

2'

U U Qki k=0 i=1

G(m,j):=G(m)UUi_1Qki

For

For each C in the Borel or-algebra 13 := 13(12), k = 0, 1,

, and i =

2k, let L3C't) be the smallest a-algebra of subsets of Q with respect to , k - 1 or m = k, r = 1, , i, and all NN(A), which all pmr for m = 0, 1, for Borel A C C, are measurable. It is easily seen that for each fixed k and i, {t3C°`i: C E B) is a filtration and {NB(A), ,CiAk" )AE,, has independent pieces as defined before Lemma 12.3.5. 1,

,

The initial triangle T01 is fixed and has area s/2. For each k = 1, 2, and each possible value of the triangle Tk_1,i, there are two possible values of the pair of triangles Tk,2t_1, Tk,2i, each with area s122k+1 Thus for each

Classes Too Large for Central Limit Theorems

378

m = 0, 1 , 2, , there are finitely many possibilities for the values of Tki and , 2k, each on a measurable event, and so Qki for k = 0, 1, , m and i = 1, for whether Aki holds or not. So w i-± N N (Qi j (co)) (w) is measurable, and for each m and j, there are finitely many possible values of Gm, j,,,,. Next we have:

12.4.3 Lemma For each k = 0, 1, 2,

and i = 1,

(a) G(k, i) is a stopping set for (k,i)

13k,i

2k,

{13Ak'i))AEt3;

.

(b) Aki E Z3G(k i)' (c) if k > 1, then for each possible value Q of Qki, {Qki = Q} E

13(k-1)

where C3(r) := 13G(r)) for r = 0, 1,

Proof It will be shown by double induction on k and on i for fixed k that (a), 13Ak,2k) . (b), and (c) all hold. Let ,C3Ak) := for any A E 13(12) and k = 0, 1,

Fork = 0 we have G(0) = G(0, 1) = Qoi, a fixed set. So as noted just before Lemma 12.3.5, G(0) is a stopping set, and 13(0) = 8G(0)' where 13G(o) is defined as for stopping sets. So (a) holds for k = 0. Since I Ao, is a measurable function of NN (Qol) and pol, we have A01 E 13(0)

so (b) holds for k = 0. In (c), each event { Qki = Q) is a Boolean combination of events A jr for j = 0, , k -1 and r = 1, , V. Thus if (a) and (b) hold for k = 0, 1, , m -1 for some m > 1, then (c) holds for k = m. So (c) holds for k = 1.

To prove (a) for k = m, let J(m, j) be the finite set of possible values of G(m, j). If (c) holds for k = 1, , m, then from the definition of G(m, j), for each G E J(m, j), (G (m, j) = G} e

(12.4.4)

13(ni-1)

Thus, for any D E 13(12), since G(m - 1) C G(m, j),

{G(m, j) C D} _ {G(m, j) C D} fl {G(m - 1) C D}

U

{G(m, j) = G} fl {G(m - 1) C D}

GEJ(m, j),GCD E

13(m'j) D

so G (m, j) is a stopping set and (a) holds for k = m. Then to show (b) for k = m, for a set C E 13(12) and i = 1,

C(m,i) :_ {NB(C) > 0} U {NB(C) = 0, pi = 11. Then (12.4.5)

C(m,i) E 8(m C ,i).

, 2m let

12.4 Lower bounds in borderline cases

379

If C(m, i) is the finite set of possible values of Qmi,

Ami = U {Qmi = Q} n Q(m,i) QEC(m,i)

Thus

Ami n (G(m) C D)

{Gm c D} n {G,_1 C D} n

U

{Qrni = Q} n Q(m,i)

QEC(m,i), QCD

It follows from (12.4.4) that {G(m, i) C D} E B(m-1) for each m and i. By (c) for k = in, each {Qmi = Q} E 13(mi-1). Thus

{G(m, i) C D} n {Qmi = Q} E

L3(m-1)

and so

{G(m,i)CD)n{Qmi=Q}n(G(m-1) CD) E

5D-1)

For Q C D, by (12.4.5), Q(m,i) E B('`). So taking a union of intersections, Ami n (G (m, i) C D) E SD '`). Thus Ami E BG(rt)i) and (b) holds for k = m. This finishes the proof of the lemma. Next, here is a fact about symmetrized Poisson variables.

12.4.6 Lemma There is a constant co > 0 such that if X and Y are independent random variables, each having a Poisson distribution with parameter > 1, then EIX - YI > e0A112 Proof By a straightforward calculation, E((X - Y)4) = 2A + 12A2 < 14A2 if X > 1. By the Cauchy-Bunyakovsky-Schwarz inequality,

2A = E((X - Y)2) < (EJX - YI)1/2(E(IX - Yi3))112

andE(IX-YI3) <

(E(IX-Y14))374. The lemma follows withco =4/(14)3/4

Now continuing with the proof of Theorem 12.4.2, set G(m, 0) := Gm-1. , 2', the We apply Lemma 12.4.3(a) and (c). For m = 1, 2, .. and j = 1, Gm,j-1. The hypotheses of random set Qm3 is disjoint from the stopping set Lemma 12.3.5 hold for (G, A) = (Gm,.i_1, Qmj) and the process Y = Nc with 113(-'j-1): the filtration A E B). Since all possible values of Q,,1 have the

380

Classes Too Large for Central Limit Theorems

same area, it follows that the random variable Nc(Qm j) has the law of U - V for U, V i.i.d. Poisson with parameter (12.4.7)

cP(Qmj) = cs/(6.4m)

and is independent of Bm,j-1, which was defined in Lemma 12.4.3(b). Also,

Nc(Qri) is f3m,j-1 measurable for r < m or for r = m and i < j. So, the variables Nc(Qmj) for m = 0, 1, 2, and j = 1, , 21 are all jointly independent with the given laws. Also, the i.i.d. Rademacher variables pm j are independent of the process Nc. Fort E Inti j, let hi (w, t) be the slope of the longest side of Ti j, so hl ((0, t) _ g' (t) in Case I and f'(t) in Case II, while ho((o, t) = 0. In Case I, if Ai j holds

then hi+1 = hi on Inti j. If w Ai j then hi+1 - hi = -s on Inti+1,2j_1 and +S on Inti+1,2j In Case II, if w V Ai j then hi+1 = hi on Inti j. If (0 E Ai then hi+1 - hi = s on Inti+1,2 j_ 1 and -s on Inti+1,2 j

Let i := hi - hi-1 f o r i = 1 ,

, L. By Lemmas 12.3.5 and 12.4.3, and (12.4.7), any NJQm j) or pm j form > k is independent of 13(k-1), where 13(-1) is defined as the trivial a-algebra {0, Q1). So, each of 1, , k is 13(k-1) measurable while any event Akj and the random function k+1 are independent of 13(k-1). It follows that hi (co, t) = E;-1 r ((o, t) where r are i.i.d. variables for Pr = t x k on Q x [0, 1] having distribution 0) = 1/2,

s) = Pr(tr = -s) = 1/4. Since tr are independent and symmetric, we have by the P. Levy inequality (1.3.16) that for any M > 0, (12.4.8)

Pr(Ihrl > M for some r < L) < 2Pr(IhLI > M).

Now, we can write tr = 1)2r-1 + 1)2r where rlj are i.i.d. variables with Pr(nj =

s/2) = Pr(r)j = -s/2) = 1/2. So by one of Hoeffding's inequalities (Proposition 1.3.5), we have Pr(IhLI > M) < exp(-M2/(Ls2)). For c large, (1 - 2s)2 > 1/2. Thus (12.4.9)

Pr(IhrI > 1 - 2s for some r < L)

< 2exp ( - (1 - 2s)2/(Ls2)) < 2exp (- log c/ (2821og c)

= 2exp (- 1/(282)). Let K((O, t) := inf{k: Ihk((O, t)I > 1 - 2s} < +oo. Then I fk(t)I < 1 for all k < K((0, t). Let 4)k 00, t) := fmin(K,k)(w, t). Then 1k E .M1 for each k. If Ihr(t)I < 1 - 2s for all r < L then I f,.'(t)I < land Ig'r(t)I L, and OL((0, t) = fL(w, T).

12.4 Lower bounds in borderline cases

381

Let Xki (w) := 1 if K (w, t) > k for some, or equivalently all, t E Intki, otherwise let U (CO) = 0. For any real x let x+ := max(x, 0). For k =

, L - l and i = 1, , 2k we have Qki C S(fk+l) if and only if Aki holds. So, Qki C S(DL(w, )) if and only if both Aki holds and Xki = 1. Let 0, 1,

aki := 1Aki, L-1 (12.4.10)

S41,L

2k

:_ Y, YXkiNc(Qki)+, k=O i=1

L-1 2k

(12.4.11)

SV,L

:= T, TVki k=0 i=1

XkiakiNN(Vki) Also, let Uki := XkiNc(Qki)+ and W := Nc(S(fo)) where S(fo) = [0, 1] x [0, 1/2]. Then S(fk) is the union, diswhere Vki

joint up to one-dimensional boundary line segments, of S(fo) and of those Qrj and Vrj with r < k such that Arj holds. So (12.4.12)

Nc(S('L)) = SD,L + SV,L + W.

From the definitions, we can see that the set where K(w, t) < k is B(k-1) measurable for each t, and thus so is each Xki. Also, Qki is a B(k-1)-measurable

random set, disjoint from G(k - 1). Since P(Qki) is fixed, by Lemma 12.3.5, Nc(Qki)+, a function of NN(Qki), is independent of B(k-1) and thus of Xki. Similarly, aki is independent of Xki by Lemma 12.3.5. Both are measurable with respect to 13k,` = BG(k)1). while Vki is disjoint from G(k, i), and P(Vki) is a constant. So by Lemma 12.3.5 again, NN(Vki) is independent of 13G(k,i), and the three variables NN(Vki), aki, and Xki are jointly independent. Since L < (log c)/9, in (12.4.7) the parameters of the Poisson variables for k < L are > cs/(6 4L) > 1 for c large enough. By (12.4.9) and the Tonelli-Fubini , L - 1 that theorem we have for S < 1/4 and k = 0, 1, 2k

(12.4.13)

1/2 < 1 - 2 exp (-1/(252)) < Pr(K > k) < 2-k

EXki. i=1

Let Xki := Nc(Qki)+ and pki := Pr(Xki = 0). Then by (12.4.13), 2k

Pki < 2k+1 exp ( - 1/(252)) i=1

For each k and i, E[(1 - Xki)Xki] < p1ki/2(EX2 )1/2. Thus by (12.4.7), setting

Classes Too Large for Central Limit Theorems

382

yk 1 y2k 1(1 - Xki)Xki, we have

S1

L-1

ES1 < r [2k+1eXp \ //

kk=OOLL

/

1

[2 - 1/(282))]1/2kcsl(6.4k) ]1/2

6-1/2L(cs)1/2 exp (

- 1/(482))

< (82log c)(cs)1/2 exp ( - 1/(482)), so

ES1 < c1/2(log c)3/482 exp ( - 1/(482)).

(12.4.14)

Ek

E 2i= Letting S2 1 Xki, we have by Lemma 12.4.6, and since O ENc(Qki)+ = EINc(Qki)I12 by symmetry,

L-1

ES2 > E

2k-1co(cs/(6 4k))1/2

k=0

24-1/2coL(cs)112 > c182c1/2(logc)314,

where c1 := c0/10. By (12.4.14) and Markov's inequality we have for any

a > 0, (12.4.15)

Pr (Si > a82c1/2(logc)3/4) <

a-1

exp ( - 1/(482)).

If j < k or j = k and i < r, then Xji is measurable for BG(k,r-1) while by Lemma 12.3.5, Xkr is independent of it. Thus all the variables Xji are independent, and by (12.4.7), L-1 2-'

L-1

2jcs/(6.4J) < cs = c/(log c)1/2

Var(Xji) <

Var(S2) _ j=0 i=1

j=0

Thus S2 has standard deviation < c1/2/(log c)1/4, and (12.4.16)

S2 - ES2 = op(ES2) as c -+ 00

by Chebyshev's inequality.

To find the covariance Cov(Vji, Vkr) of two different terms of SV,L in (12.4.11), we can again assume j < k or j = k and i < r. Let Yuv := Nc(Vuv) for each u and v. Since EVuv = 0, we need to find Ejikr

E(XjiajiYjiXkrakrYkr) = E(XjiNc(Wji)XkrakrYkr),

where Wji := Vji on Aji and Wji := 0 otherwise. Then Wji and Vkr are measurable random sets disjoint from the stopping set G(k, r), and each

12.4 Lower bounds in borderline cases

383

of the three random sets has finitely many possible values. Thus Lemma 12.3.5

applies to G = G(k, r), A = Wji, and B = Vkr. Also, A fl B = 0. Thus we have Ejikr

EE (Xji Nc (Wji)Xkrakr Ykr I

Uk,r)

E[XjiXkrakrE(Nc(Wji)Nc(Vkr) I f3k,r)] E[XjiXkrakrE(Nc(Wji) 113k,r) E(Nc(Vkr)

113k,r)]

=0

since the last conditional expectation is 0, and E(VjiVkr) = 0 = E(Vji) _ E(Vkr) So the terms are orthogonal, and

L-1 2' Var(SV,L)

_

L-1

Var(Vji) < E 2jE((Nc(Vj1))2) j=o i=1

j=o

L-1

2j+1cs/(12 4J) < cs. j=o Also, for Win (12.4.12), Var(W) < c. Thus by Chebyshev's inequality, SV,L

op(ES2) and W = op(ES2) as c -+ oo. Note that we have Nc(S((DL)) 4,L - S2 - St. Then by (12.4.12) and (12.4.16),

Nc(S((DL)) = ES2 + (S2 - ES2) - Si + SV,L + W

= ES2 - S1 + op(ES2). Taking a := c1/3 in (12.4.15), then S > 0 small enough so that

(3/c1) exp (- 1/(482)) < s/2, we have for c large enough that

Pr {Nc(S((DL)) > yc1V2(logc)3/4} > 1-8 where y := c132/3. Since (DL E M1, the conclusion for Nc follows. Now, suppose the theorem fails for the centered Poisson process Y, for some e > 0. Thus, for arbitrarily small 3 > 0, Pr[IIYcIIc(1,2,2) < Sc1/2(logc)3/4/2] > E. Then, subtracting two independent versions of Yc to get an N, we have Pr[IINNIIc(1,2,2) < 8C112(logc)314I > 82,

a contradiction. The v, case follows from Lemma 12.3.4.

384

Classes Too Large for Central Limit Theorems

12.5 Proof of Theorem 12.4.1

Proof First, let's take care of measurability properties. For C(a, K, d), let a = /3 + y where 0 < y < 1 and /3 is an integer > 0. The set C(a, K, d) of functions is compact in the II lip norm: it is totally bounded by the ArzelaAscoli theorem (RAP, Theorem 2.4.7) and closed since uniform (or pointwise)

convergence of functions g preserves a Holder condition I G (x) - G (y) I < Klx - yl1'. Recall that for x E Rd we let x(d) := (xl, , xd_1). The set {(x, 9): 0 < xd < g(x(d)), g E c(a, K, d)} is compact, for the,6 norm on g. So the class C(a, K, d) is image admissible Suslin. Measurability for the lower layer case C = Ld follows as in the case d = 2 treated soon after the statement of Theorem 12.4.1. For the case C = C3 of convex sets in R3, f o r any F C {X1, , X"} there is a smallest convex

set including F, its convex hull (a polyhedron). The maximum of P - P will be found by maximizing over 2" - I such sets. Conversely, to maximize (P - P")(C), or equivalently P(C), over all closed convex sets C such that C fl {X1, , x,) = F, recall that a closed convex set is an intersection of closed half-spaces (RAP, Theorem 6.2.9). If I Fl = k, it suffices in the present case to take an intersection of no more than n - k closed half-spaces, one to F, j < n. Thus 11 Pn - P 11 c = II Pn - P 11,D for a finiteexclude each Xj dimensional class D of sets for which an image admissible Suslin condition holds. It follows that the norms II Yx II c and II vn 11c are measurable for the classes

C in the statement. Theorem 12.4.1 for Y,,, implies it for v" by Lemma 12.3.4. By assumption

d > 2. Let J := [0, 1), so that Jd-1 = {x E Rd-1: 0 < xj < 1 for j = 1,

,d-1}.

Let f1 be a C°O function on R such that f1 (t) = 0 for t < 0 or t > 1, for

some K with 0 < K < 1, f(t) = K for 1/3 < t < 2/3, and 0 < f(t) < K for t E (0, 1/3) U (2/3, 1). Specifically, we can let fl(t) (g * h)(t) f g(t-y)h(y)dywhereh := 111/6,5/6]andgisaC°Ofunction with g(t) > 0 for Itl < 1/6 and g(t) = 0 for I t I > 1/6. For X E Rd-1 let f(x) := f1(x1) fi (x2) . fi (xd-1). Then f is a C°° function on Rd-1 with f (x) = 0 outside the unit cube Jd-1, 0 < f(x) < y forx in the interior of Jd-1, and f = y on the subcube [1/3, 2/3]d-1 where 0 < y := Kd-1 < 1. Taking a small enough positive multiple of f1 we can assume that supx Suplpl 0. Some indexed families of sets and functions, some of them random, will be defined as follows. For each j = 1, 2, , the unit cube Jd-1 is decomposed as , 3i(d-1) a union of 31(d-1) disjoint subcubes Cj1 of side 3-> for i = 1, 2, where each Cj1 is also a Cartesian product of left closed, right open intervals.

12.5 Proof of Theorem 12.4.1

385

Let Bji be a cube of side 3-j-1, concentric with and parallel to Cji. Let xji be the point of Cji closest to 0 (a vertex) and yji the point of Bji closest to 0. For S > 0 and j = 1 , 2, , let cj := Cj-1(log(j + 1))-1-25, where the constant C > 0 is chosen so that 1 c j < 1. For X E Rd-1 let

fi(x)

cj3-j(d-1) f(31(x

- xji)),

cj3-(j+1)(d-1)f(3'+1(x -

gji(x)

yji)).

Then f i (x) > 0 on the interior of Cji while fji (x) = 0 outside Cji, and likewise for gji and Bji. Note that (12.5.1)

fi(x)/gji(x) > 3d-1 > 1 for all x E Bji

Let So : = 1/2. Sequences of random variables sji = f1 and random functions on

Id-1

k

(12.5.2)

Sk

3j(d-q

SO + E E sji (w) f ii, j=1

i=1

will be defined recursively. Then since d > 2, we have E?_ 1 c .3 -j(d-1) < 1/3,

so 0 < Sk < 1 for all k. Given j > 1 and Sj _ I, let (12.5.3)

Dji

Dji(w) 11 {x E Jd: Ixd - Sj-I(x(d)1(w)I < gji lx(d))}. ((

Let sji(cw) :_ +1 if Y),,(Dji(to)) > 0, otherwise sji(w) -1. This finishes the recursive definition of the sji and the Sk. By induction, each Dji has only finitely many possible values, each on a measurable event, so the sji and Sk are all measurable random variables. Since the cubes Cji do not overlap, and all derivatives DP f i are 0 on the boundary of Cji, it follows for any k > 1 that k

(12.5.4)

sup[P]<-d-1 supx IDPSk(x)I <

IDPf 1(x)I j=1 k

< Ecj < 1. j=1

The volume of Dji (w) is always (12.5.5)

P(Dji (w)) = 2L'ji gji dx =

where 0 < µ := fjd_1 f (x) dx < 1.

24cj/9(j+1)(d-t)

386

Classes Too Large for Central Limit Theorems

Next, it will be shown that Dji (co) for different i, j are disjoint. For 1 < j < k and any co, i and x E Bji, ISk(x)

- Sj-1(x)J (w) >

ycj3-j(d-1) -

ycr3-r(d-1)

r>j

> ycj3-j(d-1)(1

- i)

> ycj3-j(d-1)/2 > supy(gji + SUPr gk+1,r)(Y) Thus if sji (w) = +1, then for any r and any x E Bji,

Sk(x)(w) - gk+1,r(x) ? Sj-1(x)(w) + gji (x). It follows that Dji (w) is disjoint from Dk+l,r(w). They are likewise disjoint if

sji (w) = -1 by a symmetrical argument. For the same j and different i, the sets Dji (w) are disjoint since the projection x H x(d) from Rd onto Rd-1 takes them into disjoint sets Bji.

Given A > 0 let r = r(A) be the largest j, if one exists, such that 2A scj > 9(j+1)(d-1). Then (12.5.6)

r(.k) - (log k)/((d - 1) log 9)

as X

+oo.

Let Go(w) := 0. For m = 1, 2, .. , let G(m)(co) := Gm(w) :=

U{Dji((o): j <m}.

Let Hm(w) be the subgraph of Sm, H,, (w) := {x : 0 < xd < S,,,(x(d))(w)}. Then for all co, by (12.5.4), Hm E C(d - 1, 1, d). Let

A. (w) := Hm (to) \ G. (w) From the disjointness proof,

(12.5.7)

Hm (w) n Gm (w) =

U{Dji (w): j < m, sji = +1}.

For each m, each of the above sets has finitely many possible values, each on a measurable event. Each G j is easily seen to be a stopping set, while A3 is I3G(j) measurable and for each w, A j (w) is disjoint from Gj (co). So the hypotheses of Lemma

12.3.5 hold. Thus, conditional on BG(j), X,,(Aj) is Poisson with parameter a.P(Aj(w)). Also, for u+ := max(u, 0), by (12.5.7), m 3j(d-l)

(12.5.8)

Y ((Hm n Gm)(w)) _ E E Y,x(Dji)+. j=1 i=1

12.5 Proof of Theorem 12.4.1

387

Now, P(Dji) does not depend on w nor i, and Dji (.) is 13G(j-1) measurable. Thus by Lemma 12.3.5, applied to A := Dmi, replacing Gm by Gm-1 U Ur 1, by (12.5.5). So by Lemma 1.3.14, for a constant c > 0,

X-112EY)((Hm (1 G,)(w)) r(x)

>

c> E

3j(d-1)

P(Dj,)112

j=1 i=1 r(x)

= cY3j(d-1)(2µc]/9(j+1)(d-1))112

by (12.5.5)

j=1

r(x) c31-d(2µ)1/2

(Cj-1(log(j +

1))-1-23)1/2

(by def. of cj)

j=1 r(x)

c(2µC.)1/231-d

j-112((log(j +

1))-0.5-6

j=1 r(x)

>

1))-0.5-5

T,

j-1/2

j=1

>

2ad(r(.k)1/2 - 1)(log(r(A) + 1)-o.5-'

for some constant ad > 0. For A large, by (12.5.6), the latter expression is logA)-0.5-s for some bd > 0. > 3bd(1ogA)1/2(log By independence of the variables Yx (Dji) and (12.5.5), Ya, (Hr(x) fl Gr(A)) (w)

has variance less than r(x) 3J(d-1)

r(x)

3j(d-1)2µcj9-(j+1)(d-1) < X.

E E AP(Dji) = X j=1

i=1

j=1

Thus by Chebyshev's inequality, as k - +oo, Pr{1<-1/2Yx(Hr(x) nGr(x))(w) > 2bd(logA)1/2(log log

X)-0.5-1}

1.

As shown around (12.5.8), YA(Ar()L)((o))(w) has, given BG(r(x)), the condi-

tional distribution of YY(D) for P(D) = P(Ar(x)((O)), where since Yx is a

388

Classes Too Large for Central Limit Theorems

centered Poisson process, EY),(D)2 < A for all D. Thus EY;L(Dr(A))2 < A, YA(Artal)/x,172 is bounded in probability, and Pr{Yx(Hr.(A)) > bd (A log A)1/2(log log as A

A)-o.5-s }

1

+oo. This proves Theorem 12.4.1 for Y)L, since given the result for

K = 1, one can let y(d, K, 8) := Ky(d, 1, 8). For v, Lemma 12.3.4 then applies. This completes the proof for classes C(d - 1, K, d). For convex sets in 1[83 see Dudley (1982, Theorem 4, proof in section 5). Lower layers were treated in Section 12.4.

Problems 1. Find a lower bound for the constant y (d, a) in the proof of Theorem 12.1.1.

Hint: y(d, a) > 3-d-"P(f)/I1.f II". 2. In the proof of Theorem 12.1.2, show that r can be taken as n + 1 for the largest n = 0, 1, such that m(n) = 0. 3. From Theorem 12.1.3 for d = 3, show that C3 cannot be a Donsker class for the uniform distribution on the unit cube 13. Hint: For a pregaussian class .T, P(IIGp 11 j < s) > 0 for all r > 0 (Theorem 2.5.5(b)). 4. What should replace n°(log n)n in Theorem 12.2.1 if 0 < < 1? Hint:

Any sequence an - +oo. 5. To show that Poisson processes are well-defined, show that if X and Y are independent real random variables with G(X) = Pa and 12(Y) = Pb then G(X + Y) = Pa+b. 6. Let P be a law on R and UU the Poisson process with intensity measure

cP for some c > 0. For a given y > 0 let X be the least x > 0 such that Uc([-x, x]) > y. Show that the interval [-X, X] is a stopping set.

Notes Notes to Section 12.1. Bakhvalov (1959) proved Theorem 12.1.1. W. Schmidt (1975) proved Theorem 12.1.3. Theorem 12.1.2, proved by the same method as theirs, also is in Dudley (1982, Theorem 1).

Notes to Section 12.2. Walter Philipp (unpublished) proved Theorem 12.2.1 for C = 1, with a refinement (rl = 1 and a power of log log n factor). The proof for C > 1 follows the same scheme. Theorem 12.2.1 is Theorem 2 of Dudley (1982).

References

Notes to Section 12.3.

389

Lemma 12.3.1 is classical; see, for example, Feller

(1971, VIII.8, Lemma 2). Lemma 12.3.2 and with it the main idea of Poissoniza-

tion are due to Kac (1949). Pyke (1968) gives relations between Poissonized and non-Poissonized cases along the line of Lemma 12.3.4. Evstigneev (1977, Theorem 1) proves a Markov property for random fields indexed by closed subsets of a Euclidean space, somewhat along the line of Lemma 12.3.5. This entire section is a revision of Section 3 of Dudley (1982) with proofs of Lemmas 12.3.1 and 12.3.2 supplied here.

Notes to Section 12.4. Theorem 12.4.1 was proved in Dudley (1982) for all d > 2. Another proof was given for lower layers and d = 2 in Dudley (1984). Shor (1986) discovered the larger lower bound with (log n)3/4 in expectation. He also contributed to a later version of the proof (Coffman and Lueker, 1991, pp. 57-64). Shor gave the definition of f and gj. Theorem 12.4.2 here shows that the lower bound holds not only in expectation, but with probability going to 1, by the methods of Dudley (1982, 1984). I thank J. Yukich for providing very helpful advice about this section. Note to Section 12.5. The proof is from Dudley (1982).

References Bakhvalov, N. S. (1959). On approximate calculation of multiple integrals. Vestnik Moskov. Univ. Ser. Mat. Mekh. Astron. Fiz. Khim. 1959, No. 4, 3-18. Coffman, E. G., and Lueker, G. S. (1991). Probabilistic Analysis of Packing and Partitioning Algorithms. Wiley, New York. Coffman, E. G., and Shor, P. W. (1991). A simple proof of the 0 (.,fn- log3/4 n ) upright matching bound. SIAM J. Discrete Math. 4, 48-57.

Dudley, R. M. (1982). Empirical and Poisson processes on classes of sets or functions too large for central limit theorems. Z. Wahrscheinlichkeitsth. verw. Gebiete 61, 355-368. Dudley, R. M. (1984). A course on empirical processes. Ecole d'ete de probabilites de St.-Flour. Lecture Notes in Math. (Springer) 1097, 1-142. Evstigneev, I. V. (1977). "Markov times" for random fields. Theor. Probab. Appl. 22, 563-569; Teor. Veroiatnost. i Primenen. 22, 575-581. Feller, W. (1971). An Introduction to Probability Theory and Its Applications, Vol. II, 2nd ed. Wiley, New York. Kac, M. (1949). On deviations between theoretical and empirical distributions. Proc. Nat. Acad. Sci. USA 35, 252-257.

390

Classes Too Large for Central Limit Theorems

Leighton, T., and Shor, P. (1989). Tight bounds for minimax grid matching, with applications to the average case analysis of algorithms. Combinatorica 9, 161-187. Pyke, R. (1968). The weak convergence of the empirical process with random

sample size. Proc. Cambridge Philos. Soc. 64, 155-160. Rhee, W. T., and Talagrand, M. (1988). Exact bounds for the stochastic upward matching problem. Trans. Amer. Math. Soc. 307, 109-125. Schmidt, W. M. (1975). Irregularities of distribution IX. Acta Arith. 27, 385-396. Shor, P. W. (1986). The average-case analysis of some on-line algorithms for bin packing. Combinatorica 6, 179-200. Talagrand, M. (1994). Matching theorems and empirical discrepancy computations using majorizing measures. J. Amer. Math. Soc. 7, 455-537.

Appendix A Differentiating under an Integral Sign

The object here is to give some sufficient conditions for the equation

f

t) d f(x, t) du (x) _ f of'x dlt (x) dt Here X E X, where (X, S, µ) is a measure space, and t is real-valued. The (A.1)

derivatives with respect to t will be taken at some point t = to. The function f will be defined for x E X and t in an interval J containing to. Assume for the time being that to is in the interior of J. Let's write

.f2(x, to) := af(x, t)latIt=to Some assumptions are clearly needed for (A.1) even to make sense. A set .F C G1(X, S, µ) is called G1-bounded iff sup{ f I f dµ : f E F] < oo. Here are some basic assumptions: (A.2) The following two conditions hold:

For some 8 > 0, if ( , t) : I t - to l < 8) are L1-bounded.

(A.2a) (A.2b)

f2(x, to) exists for µ-almost all x, and

to) is in L' (X, S, µ).

Here is an example to show that the conditions (A.2a,b) are not sufficient:

Let (X, µ) be [0, 1] with Lebesgue measure. Let J = [-1, 1], to = 0. Let f(x, t) := 1/t if t > 0 and 0 < x < t, otherwise let f(x, t) := 0. Then for all x 0, f(x, t) = 0 for tin a neighborhood of 0, so f2(x, 0) = 0 for x 0. For each t > 0, fo f (x, t) dx = 1, while fo f (x, 0) dx = 0. So the Example.

function t i-+ fo f (x, t) dx is not even continuous, and so not differentiable, at t = 0, so (A.1) fails.

The function f in the last example behaves badly near (0, 0). Another possibility has to do with behavior where x is unbounded, while f may be very 391

Appendix A

392

regular at finite points. Recall that a function f of two real variables x, t is called C°O if the partial derivatives ap+q f/8xpdtq exist and are continuous for all nonnegative integers p, q.

For J = (-1, 1), X = R with Lebesgue measure /.t, and to = 0, there exists a C°° function f on X x J such that for 0 < t < 1, f 00 f (x, t) dx = 1, while for to = 0, and any x, f (x, t) = 0 for t in a neighborhood of 0 (depending on x), so f (x, 0) = f2(x, 0) - 0. Thus Example.

f oo f2 (x, 0) dx = 0 while f °° f (x, t) dx is not continuous (and so not differentiable) at 0 and (A. 1) fails. To define such an f, let g be a C°O function on R with compact support, specifically g(x) := c exp(-1/(x(1 - x))) for 0 < x < 1 and g(x) := 0 elsewhere, where c is chosen to make f 00 g(x) dx = f0 g(x) dx = 1. It can be checked (by L'Hospital's rule) that g and all its

derivatives approach 0 as x 4. 0 or x T 1, so g is C°°. Now let f(x, t) :_ g(x - t-1) for t > 0, and f(x, t) := 0 otherwise. The stated properties then follow since f (x, t) = 0 if t < 0 or x < 0 or 0 < t < l 1x. For interchanging two integrals, there is a standard theorem, the TonelliFubini theorem (e.g., RAP, Theorem 4.4.5). But for interchanging a derivative and an integral there is apparently not such a handy single theorem yielding (A.1). Some sufficient conditions will be given, beginning with rather general ones and going on to more special but useful conditions. A set F of integrable functions on (X, S, s) is called uniformly integrable or u. i. if for every e > 0 there exist both (A.3)

a set A with µ(A) < oo such that fX\A I f I dµ < e

for all f E F,

and (A.4)

an M < oo such that f fI>M If I dµ < e

for all f E F.

In the presence of (A.3), condition (A.4) is equivalent to (A.5)

F is L 1-bounded and for any e > 0 there is a S > 0 such that

for any CESwith µ(C) <Sandany f E.F, fc If Idµ
The best-known sufficient condition for uniform integrability is known as domination. The following is straightforward to prove, via (A.4):

A.6 Theorem If f E G1 (X, S, µ) and f > 0, then (g E L1 (X, S, p) IgI < f } is uniformly integrable.

Differentiating under an Integral Sign

393

Here IgI < f means Ig(x)I < f(x) for all x E X. But domination is not necessary for uniform integrability, as the following shows:

Example. Let (X, S, µ) be [0, 1] with Lebesgue measure. Let g(t, x) :_ It - x1-1/2, 0 < t < 1, 0 < x < 1. Then the set of all functions g(t, ), 0 < t < 1, is uniformly integrable but not dominated. Uniform integrability provides a general sufficient condition for (A. 1). Let

Ah.f(x,to) := f(x,to+h) - f(x,to). A.7 Theorem Assume (A.2) and that for some neighborhood U of 0, (A.8)

{ Ah f (x, to) / h : 0 0 h E U) are uniformly integrable.

Then (A.1) holds and moreover (A.9)

as h

0.

Conversely, (A.9) and (A.2) imply (A.8).

Proof (A.2) implies that the difference-quotients Ah f (x, to)/ h converge pointwise for almost all x as h --). 0 to 8f(x, t)/etlt=to. Pointwise convergence and uniform integrability (A.8) imply G1 convergence (A.9), as this was proved in (RAP, Theorem 10.3.6) for a sequence of functions on a probability space. The implication holds in the case here since:

(a) by (A.3), we can assume that up to a difference of at most s in all the integrals, the functions are defined on a finite measure space, which reduces by a constant multiple to a probability space; (b) if (A.9) fails, then it fails along some sequence h = h, -+ 0, so a proof for sequences is enough. Thus (A.9) follows from the cited fact.

Now suppose (A.9) and (A.2) hold. Then given s > 0, there is a set A of finite measure with fX\A I f2(x, to)I dµ(x) < e/2, and for h small enough, f I Ah f(x, to)/h - f2(x, to)I dg(x) < e/2, so (A.3) holds for the functions Ah fl h, h E U, as desired. For sequences of integrable functions on a probability space, convergence in G1 implies uniform integrability (RAP, Theorem 10.3.6). Here again, we can reduce to the case of probability spaces, and if (A.8) fails, for a given e, then it fails along some sequence, contradicting the cited fact.

394

Appendix A

But given (A.2), the uniform integrability condition (A.8), necessary for (A.9), is not necessary for (A. 1), as is shown by a modification of a previous example:

Example.

Let (X, A) = [-1, 1] with Lebesgue measure, J = [-1, 1],

and to = 0. If t > 0 let f(x, t) = 1/t if 0 < x < t, f(x, t) _ -1/t if -t < x < 0, and f (x, t) = 0 otherwise. Then f 11 f (x, t) dx = 0 for all t and f11 f2(x, 0) dx = 0 since f2(x, 0) = 0, so (A.1) holds. Domination of the difference-quotients by an integrable function gives the classical "Weierstrass" sufficient condition for (A. 1), which follows directly from Theorems A.6 and A.7:

A.10 Corollary If (A.2) holds and for some neighborhood U of 0, there is a g E £1(X, S, µ) such that I Ah f (x, to)/ h I < g(x) for 0 h E U, then (A.1) holds.

Corollary A. 10 is highly useful, but it is not as directly applicable as is the Tonelli-Fubini theorem on interchanging integrals. In some cases, even when a dominating function g E L' exists, it may not be easy to choose such a g for which integrability can be proved conveniently. We can take the neighborhoods U to be sets {h : I h [ < S}, for S > 0. For a given S > 0, the smallest possible dominating function would be

9s(x) := sUp{lohf(x, to)/hl: 0 < IhI < S). If (X, S) can be taken to be a complete separable metric space with its a-algebra of Borel sets, and if f is jointly measurable, then each gs is measurable for the

completion of A (RAP, Section 13.2). If so, then the hypothesis of Corollary A. 10, beside (A.2), is equivalent to saying that (A.11)

for some S > 0, gs E G1(µ).

So far, nothing required the partial derivatives f2 (x, t) to exist fort # to, but if they do, we get other sufficient conditions:

A.12 Corollary Suppose (A.2) holds, f ( , ) is jointly measurable, and there is a neighborhood U of to such that for all t E U, f2 (x, t) exists for almost all x and the functions f2( , t) fort E U are uniformly integrable. Suppose also

that fort E U, f (x, t) - f (x, to) = fro f2 (x, s) ds for A -almost all x. Then (A.8) and so (A.9) and (A.1) hold.

Differentiating under an Integral Sign

395

Proof The functions f2

t) for t E U, being uniformly integrable, are G1bounded. We can take U to be a bounded interval. Then

II I f2(x, t)I dis(x) dt < oo. Now for I h I small enough,

lAhf(x,to)lhl = <

1

h 1

Ihl

h

f f2(x,to+u)du

f hIl If2(x,to+u)ldu
Thus the difference-quotients Ah f( , to)/h are also L1-bounded for h in a neighborhood of 0. Condition (A.3) for the functions f2(., t) implies it for the functions Ah f( , to)/h. Also, condition (A.5) for the functions f2( , to + u) directly implies (A.5) and (A.8) for the functions Ah P X, to) / h as desired. Three clear consequences of Corollary A. 10 will be given.

A.13 Corollary Suppose f (x, t) - G (x, t) H(x) for measurable functions G ( , ) and H where (A.2) holds for f. Suppose that for h in a neighborhood of 0, 1 AhG(x, to)/hI < a(x)forafunctionasuch that f let (x)H(x) I dµ(x) < oo. Then (A.8), (A.9), and (A.1) hold.

A.14 Corollary Let G(x, t) - ¢(x - t) where 0 is bounded and has a bounded, continuous first derivative, X = R, it = Lebesgue measure, and H E

G1(R). Then the convolution integral (0 * H)(t) = f ! 0 (t -x)H(x) dx has afirst derivative with respect tot given by (q5 * H)'(t) = f !00,O'(t-x)H(x)dx. If the jth derivative $(' is continuous and bounded for j = 0, 1, , n, then iterating gives (0 * H)(")(t) f 00O(")(t - x)H(x) dx.

A.15 Corollary Let f(x, t) = ett'l(x)H(x) where * and H are measurable real-valued functions on X such that H and iliH are integrable. Then (A.8), (A.9), and (A.1) hold. Proof Let G(x, t) := ett'1'(x). For any real t, u, and h, let(t+)u - eitu I < I uh I (RAP, proof of Theorem 9.4.4). So I AhG(x, t)/hI < I * (x) 1. Since

a (e't1(x)H(x)) = i*(x)ett*(x)H(x) and l i Vr (x)e't* (x) I = I* (x) ],conditions (A.2) hold and Corollary A. 13 applies.

Appendix A

396

Note. In Corollary A. 15, if X =1I and fi(x) - x, we have Fourier transform integrals.

A.16 Proposition Let i be a measurable function of x and

rl(t) :=

J

et*(x)dµ(x) < 00

for all t in an open interval U. Then rl is an analytic function having a power series expansion in t in a neighborhood of any to E U. Derivatives of n of all orders can be found by differentiating under the integral,

ri(")(t) =

f(x)net'd/2(x) , n = 1, 2,

,

for all t E U.

Proof Take to E U. Replacing tt by v where dv(x) = exp(to* (x)) dµ(x), we can assume to = 0. Then for some s > 0, it : It I _< s} C U, and for Itl < s, elt't (x)I < es*(x) + e--"(x), an integrable function. Thus, q(x) = r`noOt" f r(x)"dµ(x)/n! where the series converges absolutely by dominLated convergence since the corresponding series for elt'i`(x)I does and is a series of positive terms dominating those for et*(x). The derivatives of rl at 0 satisfy 17 (n) (0) = f fi(x)"dµ(x): this holds either by the theorem that for a function r) represented by a power series En _-o c" t" in a neighborhood of 0, we must have c" = rl(") (0)/n! (converse of Taylor's theorem), or by the methods of this Appendix as follows. For any real y, le.v -1I < IYI (ey+ 1) by the mean value theorem since et < e3' + 1 for t between 0 and y. Thus for any real u and

h, I(euh - 1)/hI < IuI (euh + 1). For any 3 > 0, there is a constant C = CS such that Jul < C(es" + e-3") for all real u. If IhI < 3 := e/2 then since euh < esu +e-Su and ecu +e-`" is increasing in u > 0 for any real c, we have I(euh

- 1)/hl < 3CS(eE"+e-su).

Letting u (x), we have domination by an integrable function, so Corollary A. 10 applies for the first derivative. The proof extends to higher derivatives since I V' (x)In < KeSI*(x)I for some K = K",S.

Notes

Perhaps the most classical result on interchange of integral and derivative is attributed to Leibniz: if f and 8 f/8 y are continuous on a finite rectangle [a, b] x
Differentiating under an Integral Sign

397

real-valued continuous functions defined on a half-line, say [ 1, oo). A function

f in .F will be said to have an (improper Riemann) integral if the integrals M f (x) dx converge to a finite limit as M -+ oo, which may then be called 110 f (x) dx. This integral is analogous to the (conditional) convergence of

f

the sum of a series, where Lebesgue integrability corresponds to absolute convergence. An example of a non-Lebesgue integrable function having a finite improper Riemann integral is (sin x)/x. The family ,F is called uniformly (im-

proper Riemann) integrable if the convergence is uniform for f E Y. For a continuous function f of two real variables having a continuous partial deriva-

tive 8f/8y, uniform improper Riemann integrability with respect to x, say on a half-line [a, oo), of 8f(x, y)/8y for y in some interval, and finiteness of f 00 f (x, y) dx, imply that the latter integral can be differentiated with respect to y under the integral sign; see, for example, Il'in and Pozniak (1982, Theorem 9.10). Il'in and Pozniak (1982, Theorem 9.8) prove what they call Dini's test: if f

is nonnegative and continuous on [a, oo) x [c, d), where -oo < c < d < +oo, and for each y E [c, d], I (y) := f °O f (x, y) dx < oo and I(-) is continuous, then the f(., y) for c < y < d are uniformly (improper Riemann) integrable; in this case they are also uniformly integrable in our sense. Of course, the same

holds if f is nonpositive, which would apply to the derivative of a positive, decreasing function. The Weierstrass domination condition, Corollary A. 10, is classical and appears in most of the listed references. Hobson (1926, p. 355) states it (for functions of real variables) and gives references to several works published between 1891 and 1910. Lang (1983) mentions the convolution case, Corollary A. 14. Lang also treats

differentiation of integrals f' f (t, x) dt where x E E, f E F, and E, F are Banach spaces. Il'in and Pozniak (1982, Theorem 9.5) and Kartashev and Rozhdestvenskii (1984) treat differentiation of integrals f (y)) f (x, y) dx.

Brown (1986, Chapter 2) is a reference for Proposition A.16 and some multidimensional extensions, for application to exponential families in statistics.

References Brown, Lawrence D. (1986). Fundamentals of Statistical Exponential Families. Inst. Math. Statist. Lecture Notes-Monograph Ser. 9.

Hobson, E. W. (1926). The Theory of Functions of a Real Variable and the Theory of Fourier's Series, vol. 2, 2d ed. Repr. Dover, New York, 1957.

Appendix A

398

Il' in, V.A., and Pozniak, E. G. (1982). Fundamentals of Mathematical Analysis. Transl. from the 1980 4th Russian ed. by V. Shokurov. Mir, Moscow. Kartashev, A. P., and Rozhdestvenskii, B. L. (1984). Matematicheskii Analiz

(Mathematical Analysis; in Russian). Nauka, Moscow. French transl.: Kartachev, A., and Rojdestvenski, B. Analyse mathematique, transl. by Djilali Embarek. Mir, Moscow, 1988. Lang, Serge (1993). Real and Functional Analysis, 3d ed. of Real Analysis. Springer, New York.

Appendix B Multinomial Distributions

Consider an experiment whose outcome is given by a point co of a set S2 where (S2, A, P) is a probability space. Let A i, i = 1, , m, be disjoint measurable sets with union 22. Independent repetition of the experiment n times is represented by taking the Cartesian product S2" of n copies of (Q, A, P) (RAP, The-

orem 4.4.6). Let X {Xi }n=1 E S2" and P" (A) := n y t SX (A), A E A, so P, is an empirical measure for P, and n P" (A,) is the number of times the event Ai occurs in the n repetitions. A random vector {nj}'t is said to have a multinomial distribution for n observations, or with sample size n, and m bins or categories, with probabilities

{ pj}mt , if pj > 0, pi +

+ pm = 1, and for any nonnegative integers kj

with (B.1)

n!

k,,, Pr{nj =kj, j=1,...,m} = klik2t...km!plki ...pm.

Otherwise - if kj are not nonnegative integers, or don't sum to n - the probabilities are 0. Note that in a multinomial distribution, since the probability is

positive only when km = n - Fi<m k1, we have nm = n - ri<m ni. Also,

pm = I - Fi <m pi. So if pl,

, pm _ 1 are nonnegative and their sum is

less than or equal to 1, it also makes sense to say that the {nj}j<m have a multinomial distribution for m (not m - 1) bins and probabilities {pj)j<m if Pr{nj = kj, j < m} is given by the right side of (B.1) whenever kj are nonnegative integers whose sum is at most n, and where km and pm are determined as just described.

B.2 Theorem For any probability space (0, A, P) and any disjoint measur, m, with union 0, {n P" (Ai) }m t have a multinomial

able sets Ai, i = 1,

distribution for n observations and m categories with probabilities { P (A1) The same holds if we take i = 1, , m - 1. 399

Appendix B

400

Proof Use induction on m. For in = 2, n P, (A1) has a binomial distribution: specifically, n Pn (A 1) is the number of successes, and n Pn (A2) the number of failures, in n independent trials with probability p = P(A1) of success on each

trial. To evaluate Pr(nPn (A1) = k), for k = 0,

, n, there are (k) ways to choose k of the n trials. For each way, the probability that just these trials are successes is pk(1 - p), -k. Then n Pn (A2) = n - k and the stated result holds. For the induction step from in to in + 1, apply the case of in bins to Al U A2, A3, , Am+1. Then, given that n Pn (Ai U A2) = k1 +k2, the conditional probability that n 1 = k1 and n 2 = k2 is binomial. Checking that

(P1 + p2)k'i k2 (ki kk2

(ki+k2)

1

Pl

l k1

(PI +P2/I

pk, k, P2

\k2

P2

kl.k2'

Pl+P2

.

finishes the proof.

B.3 Theorem Suppose {n, }m 1 have a multinomial distribution for n obser, m). vations and m bins with probabilities { p j }il1. Let T be a subset of 11, Then (a) {n i }i ET have a multinomial distribution for n observations and card (T) +1 bins, with probabilities {pt};ET. (b) Let S be a subset of { 1, , m) disjoint from T. Then the conditional distribution, or law, G({ni Ii ESI (n1) jET) depends on {n j}jET only through JjET n p

in other words for any integers kj > 0, j E T,

G ({ni )iEs I n j = kj for all j E T) = £ {nu }iES

Enj = Ekj jET

JET

, m), this distribution is multinomial for n - E jET kj observations in card(S) bins with probabilities pj/ EiES Pi I j E S. (c) If S U T = { 1,

Proof We can assume that for some r, T = {r + 1, , m}. The probability that n j = kj for r < j < m is a sum of multinomial probabilities,

n!(nj>rpj'/kj!)

nj<_rpj'/Kj!: Kj integers > 0,

K := K1 + ... + Kr = n - Y kj). j>r The sum Y(n ...) equals (p1 + and (a) follows.

+ pr)'/K! by the multinomial theorem,

Multinomial Distributions

401

, m } since the joint distribution of n i, (b) We can assume S U T = (1, , r). We need to i E S, is determined by that of n j, j V T. Thus S = (1,

evaluate probabilities of the form

Pr(ni=ki, i=1, ,rInj=kj, j > r) = P(ni = ki, i = 1, ... , m)/P(nj = kj, j > r). The numerator is a simple multinomial probability as in (B. 1). The denominator

is a multinomial probability by part (a). Carrying out the division, the k! terms for j > r cancel. The resulting expression, as claimed, depends on the

kj for j > r only through their sum, and (c) it is of the stated multinomial form, by the proof of (a).

Appendix C Measures on Nonseparable Metric Spaces

Let (S, d) be a metric space. Under fairly general conditions, to be given, any probability measure on the Borel sets will be concentrated in a separable subspace. The problem will be reduced to one about discrete spaces. An open cover of S will be a family {Ua }aEI of open subsets Ua of S, where I is any set, here called an index set, such that S = UaEJ Ua. An open cover {Vp}pEJ of S, with some index set J, will be called a refinement of { U04,11 EI if for all fi E J there exists an a E I with V,p C Ua. An open cover {Va }aEI will be called a-discrete if I is the union of a sequence of sets I, such that for each n and a 0 ,B in In , Ua and

U are disjoint. Recall that the ball B(x, r) is defined as l y E S: d (x, y) < r). For two sets A, B we have d(A, B) := inf{d(x, y) : x E A, y E B), and d(y, B) := d({y}, B) for a point y. C.1 Theorem For any metric space (S, d), any open cover {Ua }aEI of S has an open or -discrete refinement.

Proof For any open set U C S let Un := {y: B(y, 2-n) C U). Then y E Un if and only if d(y, S \ Un) > 2-n, and d (Un, S \ Un+i) > 2-n _ 2-n-1 = 2-n-1 by the triangle inequality. (The sets Un are always closed and not usually open.) Let Ua,n := (Ua)n. By well-ordering (e.g., RAP, Section 1.5), we can assume that the index set I is well-ordered by a relation <. For each positive integer n

andeacha E Ilet Wa,n linI and Ua,n\U{Up,n+i: 0 < a). Foreacha each n, either Wa,n C S \ U,e,n+i if 0 < a or Wp,n C S \ Ua,n+1 if a < ,B. In either case, d(Wa,n, Wp,n) >

2-n-1. Let Va,n

:_ [x: d(x, Wa,n) < 2-n-3}.

Then d(Va,n, Vp,n) > 2-n-2. The sets V,,n are all open. For a given n and different values of a, they are disjoint. Let J := N x I and define Vy, y E J, by V(n,a) := Vn,a. To show that {Vy}yEJ is a cover of S, given any x E S, take the least a such that x E Ua. Then for n large enough, x E Ua,n, and 402

Measures on Nonseparable Metric Spaces

403

thenx E W,,,, C Va,n as desired. So {Vy}yEJ is a or-discrete refinement of (Ua)aEI Cardinal numbers are defined in RAP, in the last part of Appendix A (as smallest ordinals with given cardinality). A cardinal number is said to be measurable if for a set S of cardinality , there exists a probability measure P defined on all subsets of S which has no point atoms; in other words, P ({x }) = 0 for all x E S. If there is no such P, is said to be of measure 0.

The continuum hypothesis implies that the cardinality c of the continuum (that is, of [0, 1]) is of measure 0 (RAP, Appendix Q. The separability character of a metric space is the smallest cardinality of a dense subset. We have:

C.2 Theorem Let (S, d) be a metric space. Let P be a probability measure on the a-algebra of Borel sets, generated by the open sets. Then either there is a separable subspace T with P(T) = 1, or the separability character of S is measurable. Proof Let {xa }.,I be dense in S, where I has cardinality C. For a fixed positive integer m, the balls Ua := B(xa, 1/m) form an open cover of S. Take an open, a-discrete refinement. Then S is the union of countably many open sets V, ,

each of which is the union of open sets Vy,n, Y E I,n, disjoint for different values of y, and each included in some ball of radius 2/m. Here each r,' has cardinality at most . If a cardinal has measure 0, then so, clearly, do all smaller cardinals. Define a measure µn on each r by µn (A) := P(UYEA Vy,n), where P is defined on all open sets. So if C has measure 0, then for each n, there is a countable subset Gn of Fn such that P(Vn) = E{P(Vy,n) : y E Gn}. So P (C,n) = 1 where Cn, is a countable union (over n and members of Gn for each n) of balls of radius 2/m. Taking an intersection over m, we see that P (C) = 1 for some separable subset C.

It is consistent with the usual axioms of set theory (including the axiom of choice) that there are no measurable cardinals, in other words all cardinals are of measure 0; see, for example, Drake (1974, pp. 67-68, 177-178). It is apparently unknown whether existence of measurable cardinals is consistent (Drake, 1974, pp. 185-186). So, for practical purposes, a probability measure defined on the Borel sets of a metric space is always concentrated in some separable subspace. Here is another fact giving separability:

404

Appendix C

C.3 Theorem Let f be a Borel measurable function from a separable metric space S into a metric space T. Then, assuming the continuum hypothesis, f has separable range.

Proof If the range of f is nonseparable, then for some r > 0, there is an uncountable set F of values of f any two of which are more than r apart. The continuum hypothesis implies that F has cardinality at least c. All the 2° or more subsets of F are closed, so all their inverse images under f are distinct Borel sets. On the other hand, in a separable metric space there are at most c Borel sets: for this we can assume the space is complete. Then the larger collection of analytic sets has cardinality at most c by the Borel isomorphism and universal analytic set theorems (RAP, Theorems 13.1.1, 13.2.1, and 13.2.4), since N°O has cardinal c. So there is a contradiction.

Notes

Theorem C.1 on or-discrete refinements is from Kelley (1955, Theorem 4.21 p. 129), who attributes it to A. H. Stone (1948). Theorem C.2 is due to Marczewski and Sikorski (1948), who prove it by a somewhat different method.

References Drake, F. R. (1974). Set Theory: An Introduction to Large Cardinals. NorthHolland, Amsterdam. Kelley, John L. (1955). General Topology. Van Nostrand, Princeton. Repr. Springer, New York, 1975. Marczewski, E., and Sikorski, R. (1948). Measures in non-separable metric spaces. Colloq. Math. 1, 133-139. Stone, Arthur H. (1948). Paracompactness and product spaces. Bull. Amer. Math. Soc. 54, 631-632.

Appendix D An Extension of Lusin's Theorem

Lusin's theorem says that for any measurable real-valued function f, on [0, 1] with Lebesgue measure k for example, and e > 0, there is a set A with), (A) < e such that restricted to the complement of A, f is continuous. Here [0, 1] can be replaced by any normal topological space and a, by any finite measure µ

which is closed regular, meaning that for each Borel measurable set B, µ(B) = sup(s(F): F closed, F C B) (RAP, Theorem 7.5.2). Recall that any finite Borel measure on a metric space is closed regular (RAP, Theorem 7.1.3). Proofs of Lusin's theorem are often based on Egorov's theorem (RAP, The-

orem 7.5.1), which says that if measurable functions f from a finite measure space to a metric space converge pointwise, then for any E > 0 there is a set of measure less than s outside of which the f, converge uniformly. Here, the aim will be to extend Lusin's theorem to functions having values in any separable metric space. The proof of Lusin's theorem in RAP, however, also relied on the Tietze-Urysohn extension theorem, which says that a continuous

real-valued function on a closed subset of a normal space can be extended to be continuous on the whole space. Such an extension may not exist for some range spaces: for example, the identity from {0, 1) onto itself doesn't extend to a continuous function from [0, 1] onto {0, 1); in fact there is no such function since [0, 1] is connected. It turns out, however, that the Tietze-Urysohn extension and Egorov's theorem are both unnecessary in proving Lusin's theorem:

D.1 Theorem Let (X, T) be a topological space and it a finite, closed regular measure defined on the Borel sets of X. Let f be a Borel measurable function from x into S where (S, d) is a separable metric space. Then for any E > 0 there is a closed set F with µ(X \ F) < E such that f restricted to F is continuous. 405

Appendix D

406

Proof We can assume µ(X) = 1. Let {sn 1, >I be a countable dense set in S. For m = 1 , 2, , and any x E X, let fm (x) = s, for the least n such that d (f (x), sn) < 1 /m. Then fm is measurable and defined on all of X. For each m, let n (m) be large enough so that

µ{x: d(f(x),sn)> 1/m foralln
, n(m), take a closed set Fmn C fm t {sn } with lt(fm t{Sn} \Fmn

<

2mn(m)

by closed regularity. For each fixed m, the sets Fmn are disjoint for different values of n. Let Fm := Un(m) n-t Fmn. Then f m is continuous on Fm. By choice of n(m) and Fmn, µ(Fm) > 1 - 2/2m Since d (fm , f) < 1 / m everywhere, clearly fm -+ f uniformly (so Egorov's theorem is not needed). For r = 1 , 2, , let H, := nm° r Fm . Then Hr is

closed and µ(Hr) > 1 - 4/2T. Take r large enough so that 4/2'' < E. Then f restricted to Hr is continuous as the uniform limit of continuous functions fm

on Hr C Fm, m > r, so we can let F = Hr to finish the proof.

Notes

Lusin's theorem for measurable functions with values in any separable metric space, on a space with a closed regular finite measure, was first proved as far as I know by Schaerf (1947), who proved it for f with values in any second-countable topological space (i.e., a space having a countable base for its topology), and for more general domain spaces ("neighborhood spaces"). See also Schaerf (1948) and Zakon (1965) for more extensions.

References Schaerf, H. M. (1947). On the continuity of measurable functions in neighborhood spaces. Portugal. Math. 6, 33-44. Schaerf, H. M. (1948). On the continuity of measurable functions in neighborhood spaces II. Portugal. Math. 7, 91-92. Zakon, Elias (1965). On "essentially metrizable" spaces and on measurable functions with values in such spaces. Trans. Amer. Math. Soc. 119,

443-453.

Appendix E Bochner and Pettis Integrals

Let (X, A, s) be a measure space and (S, II II) a separable Banach space. A function f from X into S will be called simple or µ-simple if it is of the

form f = _k t IA,Yi for some yi E S, k < oo, and measurable Ai with it (Ai) < oo. For a simple function, the Bochner integral is defined by k

1=t

E.1 Theorem The Bochner integral is well-defined for A-simple functions, the µ-simple functions form a real vector space, and for any it-simple functions

f, g and real constant c, f cf + gdµ = c f fdµ + f g dµ. Proof These facts are proved just as they are for real-valued functions (RAP, Proposition 4.1.4). For any measurable function g from X into S, where measurability is defined

for the Borel or-algebra generated by the open sets of S, x i-+ 11g(x)II is a measurable, nonnegative real-valued function on X.

E.2 Lemma For any two it-simple functions f and g from X into S,

IIffdpll
Il rip(Ai)uill < Fiit (Ai)Iluill = f Ilflldit. 407

408

Appendix E

Let L' (X, A, it, S) := Gl (X, it, S) be the space of all measurable functions f from X into S such that f II f II d A < oo. By the triangle inequality, it's easily seen that G1 (X, A, it, S) is a vector space. Also, all µ-simple functions belong to C1 (X, A, it, S). Define I I 111 on G1(X, A, A, S) by IIf1I1 f IIfI1 du. It is easily seen that II III is a seminorm on G1 (X, A, it, S). On the vector space of it-simple functions, the Bochner integral is linear (by Theorem E.1) and continuous (also Lipschitz) for II 111 by Lemma E.2.

E.3 Theorem For any separable Banach space (S, II II) and any measure space (X, A, µ), the lc-simple functions are dense in Li (X, A, it, S) for 11 and the Bochner integral extends uniquely to a linear, real-valued function on L1(X, A, it, S), continuous for II 111. 11 1

Proof A first step in the proof will be the following.

E.4 Lemma For any f E G1 (X, A, µ, S) there exist g-simple fk with II f - fk 11 1 - 0 and II f - fk II

0 almost everywhere for it as k -+ 00

and such that II fk II < II f II on X.

Proof Define ameasure yon (X, A) by y(B) := fB Ilf(x)II dtt(x). Then y is a finite measure and has a finite image measure y o f-1 on S. If f = 0 a.e. (µ), , so that y - 0, set f k - 0. So we can assume y(S) > 0. Given m = 1, 2, by Ulam's theorem (RAP, Theorem 7.1.4), there is a compact K := Km C S with y (f- 1 (S \ K)) < 1/2'. Then take a finite set F C K such that each point of K is within 1/m of some point of F. For each y E F, let s(y) := cy for a constant with 0 < c < 1 such that Ily - cyll = 1/m, or s(y) = 0 if there is no such c (IlylI < 1/m). Let F:= { yt, , yk) for some distinct yi. We can assume yl := 0 E F C K. For Y E K let hm (y) = s(y;) such that IIy - yt II is minimized, choosing the smallest possible index i in case of ties. For y K set hm(y) := 0. Then for each y E S, 11hm(y)11 < IIYII, and 11hm(y) - yll < 2/m for ally E K. Define a function gm by gm (x) := hm (f (x)). Then gm and hm are measurable and have finite range. Let 8 := mini>2 II y1 II. Then 8 > 0. The set where gm 0 and thus f E K is included in the set where II f 11 > 8/2, which has finite measure for it. So gm is a /,t-simple function. The set where 11 f - gm 11 > 2/m has y measure at most 112m. Then, by the Borel-Cantelli

lemma applied to the probability measure y/y(S), it follows that gm --+ f almost everywhere for y. For any x, if f (x) = 0 then gm (x) = 0 for all m. It follows that II,, - f II -* 0 almost everywhere for it. Since IIgm II < 11 f II and 11 gm - f 11 < 211 f 11, it follows by dominated convergence that 11 gm - f 11 1 -+ 0

as m -* oo, proving the lemma.

Bochner and Pettis Integrals

409

Then to continue proving the theorem, the Bochner integral, a linear function,

is continuous (and Lipschitz) for the seminorm II 111. Thus it is uniformly continuous on its original domain, the simple functions, and so has a unique continuous extension to the closure of its domain, which is G1 (X, A, It, S), and the extension is clearly also linear. So the Bochner integral is well-defined for all functions in G1 (X, A, µ, S), and only for such functions. A function from X into S will be called Bochner

integrable if and only if it belongs to L1 (X, A, µ, S). The extension of the Bochner integral to Ll (X, A, it, S) will also be written as f dµ. Thus Theorem E.3 implies that (E.5)

r J

r

f

cf+gdµ = cJ fdµ+J gdµ

for any Bochner integrable functions f, g and real constant c. Also, by taking limits in Lemma E.2 it follows that (E.6) 11J

f

dµ11

f IIf11 dµ

< ,l

for any Bochner integrable function f Although monotone convergence is not defined in general Banach spaces, a form of dominated convergence holds:

E.7 Theorem Let (X, A, µ) be a measure space. Let fn be measurable functions from X into a separable Banach space S such that for all n, 11 fn 11 < g

where g is an integrable real-valued function. Suppose fn converge almost everywhere to a function f. Then f is Bochner integrable and 11 f fn - f d µ 11 <

f II fn-f11dµ - 0asn--+ oo. Proof First, f is measurable (RAP, Theorem 4.2.2). Since 11 f 11 < g, f is Bochner integrable, and the rest follows from (E.6) for fn - f and dominated convergence for the real-valued functions II fn - f II < 2g. A Bochner integral f gdµ = f f dµ can be defined when g is defined only almost everywhere for µ, f is Bochner integrable and g = f where g is defined, just as for real-valued functions (RAP, Section 4.3). It's easy to check that when S = R, the Bochner integral equals the usual Lebesgue integral. A Tonelli-Fubini theorem holds for the Bochner integral:

E.8 Theorem Let (X, A, µ) and (Y, B, v) be Q finite measure spaces. Let f be a measurable function from X x Y into a separable Banach space S such that

Appendix E

410

f f Il f (x, Y) II dµ(x) dv(y) < oo. Then for it-almost all x, f (x, ) is Bochner integrable from Y into S; for v-almost ally, f(. , y) is Bochner integrable from X into S, and

f fd(µ x v) =

ff f(x, y)dµ(x)dv(y) = ff f(x, y)dv(y)dµ(x).

Proof By the usual Tonelli-Fubini theorem for real-valued functions (RAP, 4.4.5), f II f (x, y) II d t(x) d t(y) < oo implies that for v-almost all y,

f Ilf(x,Y)II dµ(x) < 00 and for tt-almost all x, f II f (x, y) II d v (y) < oo. Let

TF:= {fE £'(Xx Y,

x v,S):

f fd(µxv)

= fff(xY)dL(x)dv(y) = fff(xY)dv(Y)dt(x)}. Then by the usual Tonelli-Fubini theorem, all (it x v)-simple functions are in T F. For any f E L1(X x Y, it x v, S), by Lemma E.4, take simple fn with 0 almost everywhere for it x v. Then for v-almost II fn II II f II and II fn - f II

all y, f (x, y) - f (x, y) for it-almost all x, and II f, ( ,

y) II

_<

II f(- , Y) II E

£1(X, it), so by dominated convergence (Theorem E.7), f f (x, y) d p (x) f f (x, y) d it (x). Since the functions of y on the left are v-measurable, so is the function on the right (RAP, Theorem 4.2.2, with a possible adjustment on a set

of measure 0). For each y, II f fn(x, y) dµ(x)II < f IIf(x, y) 11 d t(x), which is v-integrable with respect to y, so by dominated convergence (E.7) again,

fffn(xy)d/.L(x)dv(Y)

,

fff(xy)d(x)dv(y)

as n --+ oo.

The same holds for the iterated integral in the other order, f f d v d it, and

ffnd(txv)--> ffd(µxv)by (E.7),so fETF. Now let (S, T) be any topological vector space, in other words S is a real vector space, T is a topology on S, and the operation (c, f, g) H cf + g is jointly continuous from IR x S x S into S. Then the dual space S' is the set of all continuous linear functions from S into R. Let (X, A, it) be a measure space. Then a function f from X into S is called Pettis integrable with Pettis integral y E S if and only if for every t E S', f t (f) dµ is defined and finite and equals t(y).

Bochner and Pettis Integrals

411

The Pettis integral is also due to Gelfand and Dunford and might be called the Gelfand-Dunford-Pettis integral (see the Notes). The Pettis integral may lack interest unless S' separates points of S, as is true for normed linear spaces by the Hahn-Banach theorem (RAP, Corollary 6.1.5).

E.9 Theorem For any measure space (X, A, s) and separable Banach space (S, II II), each Bochner integrable function f from X into S is also Pettis integrable, and the values of the integrals are the same.

Proof The equation t (f f dµ) = f t(f) dµ is easily seen to hold for simple functions and then, by Theorem E.3, for Bochner integrable functions.

Example. A function can be Pettis integrable without being Bochner integrable. Let H be a separable, infinite-dimensional Hilbert space with or-

thonormal basis {en}. Let µ(f = nen) := µ(f = -nen) := p,, := n-7/4 and f = 0 otherwise. Then f II .f II dµ = 2 Fn n-3/4 = +oo, so f is not

Bochner integrable. On the other hand, for any xn with n x2 < oo we have 2nxn pn I < (E xn)1/22(E n-3/2)1/2 < oo, and by symmetry the Pettis integral of f is 0. Example. The Tonelli-Fubini theorem, which holds for the Bochner integral (Theorem E.8), can fail for the Pettis integral. Let H be an infinite-dimensional Hilbert space and let sit, i be orthonormal for all positive integers i and j. Let

U(x, y) := 2%,j for (j - 1)/2' < x < j/2' and 2-' < y < 21-', j =

2', and 0 elsewhere. Then it can be checked that U is Pettis integrable on [0, 1] x [0, 1] for Lebesgue measure but is not integrable with respect to y for fixed x. 1,

,

Notes

The treatment of the Bochner integral here is partly based on that of Cohn (1980). The following historical notes are mainly based on those of Dunford and Schwartz (1958).

Graves (1927) defined a Riemann integral for Banach-valued functions. Bochner (1933) defined his integral, a Lebesgue-type integral for Banachvalued functions. The definition of integral given by Pettis (1938) was published at the same time or earlier by Gelfand (1936, 1938) and is equivalent to one of

the definitions of Dunford (1936). Gelfand (1936) is one of the first, perhaps the first, of many distinguished publications by I. M. Gelfand. The integral

412

Appendix E

of Birkhoff (1935) strictly includes that of Bochner, and is strictly included in that of Gelfand-Dunford-Pettis. Price (1940), in a further extension, treats set-valued functions. Dunford and Schwartz (1958, Chapter 3) develop an integral when it is only finitely additive. Facts such as the dominated convergence theorem still require countable additivity.

References *An asterisk indicates a work I have seen discussed in secondary sources but have not seen in the original.

Birkhoff, Garrett (1935). Integration of functions with values in a Banach space. Trans. Amer. Math. Soc. 38, 357-378. Bochner, Salomon (1933). Integration von Funktionen, deren Werte die Elemente eines Vektorraumes sind. Fund. Math. 20, 262-276. Cohn, Donald L. (1980). Measure Theory. Birkhauser, Boston. Dunford, Nelson (1936). Integration and linear operations. Trans. Amer. Math. Soc. 40, 474-494. Dunford, Nelson, and Schwartz, J. T. (1958). Linear Operators. Part I: General Theory. Interscience, New York. Repr. 1988. *Gelfand, Izrail Moiseevich (1936). Sur un lemme de la theorie des espaces lineaires. Communications de l'Institut des Sciences Math. et Mecaniques de l'Universite de Kharkoff et de la Societe Math. de Kharkov (= Zapiski Khark. Mat. Obshchestva) (Ser. 4) 13, 35-40. Gelfand, I. M. (1938). Abstrakte Funktionen and lineare Operatoren. Mat. Sbornik (N. S.) 4, 235-286. Graves, L. M. (1927). Riemann integration and Taylor's theorem in general analysis. Trans. Amer. Math. Soc. 29, 163-177. Pettis, Billy Joe (1938). On integration in vector spaces. Trans. Amer. Math. Soc. 44, 277-304. Price, G. B. (1940). The theory of integration. Trans. Amer. Math. Soc. 47, 1-50.

Appendix F Nonexistence of Types of Linear Forms on Some Spaces

Recall that a real vector space V with a topology T is called a topological vector space if addition is jointly continuous from V x V to V for T, and scalar multiplication (c, v) i-+ cv is jointly continuous from R x V into V for Ton V and the usual topology on R. If d is a metric on V, then (V, d) is called a metric

linear space if it is a topological vector space for the topology of d. First, it will be shown that any measurable linear form on a complete metric linear space is continuous. Then, it will be seen that the only measurable (and thus, continuous) linear form on LP([0, 1],A) for 0 < p < 1 is zero. Recall that for a given a-algebra A of measurable sets, in our case the Borel a-algebra, a universally measurable set is one measurable for the completion of every probability measure on A (RAP, Section 11.5). The universally measurable sets form a a-algebra, and a function f is called universally measurable if for every Borel set B in its range, f-1 (B) is universally measurable.

F.1 Theorem Let (E, d) be a complete metric real linear space. Let u be a universally measurable linear form: E i-+ R. Then u is continuous.

Proof There exists a metric p, metrizing the d topology on E, such that p(x + v, y+ v) = p(x, y) for all x, y, v E E, and I.X I < 1 implies p(0, Ax) < p(0, x) for all x E E (Schaefer, 1966, Theorem 1.6.1 p. 28). Then we have: F.2 Lemma If (E, d) is a complete metric linear space then it is also complete

for p. Proof If not, let F be the completion of E for p. Because of the translation invariance property of p, the topological vector space structure of E extends to F. Then since (E, d) is complete, E is a Gs (a countable intersection of open

sets) in F (RAP, Theorem 2.5.4). So, if E is not complete for p, F \ E is of 413

Appendix F

414

first category in F, but F \ E includes a translate of E which is carried to E by a homeomorphism (by translation) of F, so F \ E is of second category in F, a contradiction, so the lemma is proved. Now to continue the proof of Theorem E.1, let (E, p) be a complete metric linear space where p has the stated properties. If u is not continuous, there

are en E E with en - 0 and I u (en) I > 1 for all n. There are constants c, -+ oo such that cn en -+ 0: for each k = 1, 2, , there is a Sk > 0 such that if p (x, 0) < Sk then p (kx, 0) < 1 / k. We can assume that Sk is decreasing as k increases. For some nk, p(en, 0) < Sk for n > nk. We can also assume that nk increases with k. Let cn = 1 for n < n1 and cn = k for nk < n < nk+1, k = 1, 2, . Then cn have the claimed properties. So we can assume en -> 0 and I u (en) I -+ oo. Taking a subsequence, we can assume En p (0, en) < oo. So it will be enough to show that if En p (0, en) < oo then I u (en) I are bounded.

Let Q be the Cartesian product nn° 1In where for each n, In is a copy of [-1, 1]. (So Q is the closed unit ball of £°O.) For), := {-kn }n° 1 E Q, the series En° a,nen converges in E, since by the properties of p, h(,l) 1

n

n

n

p 0, E jej < E p(0, Xjej) < Y p(0, ej) -0 j=k

j=k

j=k

as n > k -a oo, and (E, p) is complete. Also, h is continuous for the product topology T on Q. For all n, let An be the uniform law U [-1, 11 (one-half

times Lebesgue measure on [-1, 1]) and let µ := Fl,' 1µn on Q. Then µ is a regular Borel measure on the compact metrizable space (Q, T). Thus the image measure µ o h-1 is a regular Borel measure on (E, p). Since u is universally measurable, it is It o h-1 measurable (that is, measurable for the completion of µ o h-1 on the Borel or-algebra). So U := u o h is measurable for the completion of µ on the Borel a-algebra of (Q, T). For 0 < M < oo let QM := {x E Q: IU(x)I < M}. Then QM is µ-measurable and µ(QM) T 1 as M f +oo. For a given n write Q = I, x Q(n) where Q(n) := Ilion lj. Let µ(n) := R jOnµ j. Then µ = µn X µtn). For y e Q(n) let QM,n(Y) := {t E In : (t, y) E QM}. By the Tonelli-Fubini theorem,

A(QM) = f

N-n(QM,n(Y))d/2(n)(Y)

Q(n)

For s, t E QM,n(y), U(s, y) - U(t, y) = (s - t)u(en). So IU(s, y) I < M and IU(t, y)I < M imply Is - tI < 2M/lu(en)I. (We can assume u(en) 0 foralln.) Soµn(QM,n(y)) < M/Iu(en)I (since g, has density . 11_1,10. So 1L(QM) < M/lu(en)I,and lu(en)I < M/µ(QM) < ooforalln ifµ(QM) > 0,

Nonexistence of Types of Linear Forms on Some Spaces

415

which is true for some M large enough. So Iu(en)I are bounded, proving Theorem F.1.

For 0 < p < oo let £ ([O, 1], A) be the space of all Lebesgue measurable real-valued functions f on [0, 1] with fo I f(x)Ipdx < oo. Let LP Lp[0, 1] := Lp([0, 1], A) be the space of all equivalence classes of functions

in LP([O, 1], A) for equality a.e. (A). For 1 < p < oo, recall that LP is a Banach space with the norm Il.fIlp := (fp I.f(x)Ipdx)lip. For 0 0 and b > 0, (a + b)p < ap + bp: to see this, let r := 1/p > 1, u := ap, v := bp, II

and apply the Minkowski inequality in jr. The triangle inequality for pp follows, and the other properties of a metric are clear. Also, pp has the properties of the metric p in the proof of Theorem F. I. To see that LP is complete for pp, let {fn} be a Cauchy sequence. As in the proof for p > 1 (RAP, Theorem 5.2.1) there is a subsequence fnk converging for almost all x to some f (x), and

Pp(.fnk, f) - 0, so pp(fn, f) --± 0. F.3 Theorem For 0 < p < 1 there are no nonzero, universally measurable linear forms on L p [0, 1].

Proof Suppose u is such a linear form. Then by Theorem F.1, a is continuous. Now L1[0, 1] C Lp[0, 1] and the injection h from Lt into Lp is continuous. Thus, u o h is a continuous linear form on L1. So, for some g E L°O[0, 1],

u(h(f)) = fo (fg)(x)dx for all f E L1[0, 1] (e.g., RAP, Theorem 6.4.1). Then we can assume that g > c > 0 on a set E with A(E) > 0 (replacing u by -u if necessary). Let fn = n on a set En C E with A(En) _ A(E)/n 0 in Lp for 0 cA(E) > 0 for all n, a contradiction. Note that Theorem F.1 fails if the completeness assumption is omitted: let H be an infinite-dimensional Hilbert space and let hn be an infinite orthonormal

sequence in H. Let E be the set of all finite linear combinations of the hn. Then there exists a linear form u on E with u(hn) = n for all n, and u is Borel measurable on E but not continuous.

Notes The proof of Theorem F.1 is adapted from a proof for the case that E is a Banach space. The proof is due to A. Douady, according to L. Schwartz (1966),

416

Appendix F

who published it. The fact for Banach spaces extends easily to inductive limits of Banach spaces (for definitions see, e.g., Schaefer, 1966), or so-called ultrabornologic locally convex topological linear spaces, and from linear forms to linear maps into general locally convex spaces. Among ultrabornologic spaces are the spaces of Schwartz's theory of distributions (generalized functions).

References Schaefer, H. H. (1966). Topological Vector Spaces. Macmillan, New York. 3d printing, corrected, Springer, New York, 1971. Schwartz, L. (1966). Sur le theoreme du graphe ferm6. C. R. Acad. Sci. Paris Sir. A 263, A602-605.

Appendix G Separation of Analytic Sets; Borel Injections

Recall that a Polish space is a topological space S metrizable by a metric for which S is complete and separable. Also, in any topological space X, the oralgebra of Borel sets is generated by the open sets. Two disjoint subsets A, C of X are said to be separated by Borel sets if there is a Borel set B C X such that A C B and C C X \ B. Recall that a set A in a Polish space Y is called analytic if there is another Polish space X, a Borel subset B of X, and a Bore] measurable function f from B into Y such that A = f [B] := { f(x) : x E B} (e.g., RAP, Section 13.2). Equivalently, we can take f to be continuous and/or B = X and/or X = NOO, where N is the set of nonnegative integers with discrete topology and N' the Cartesian product of an infinite sequence of copies of N, with product topology (RAP, Theorem 13.2.1). G.1 Theorem (Separation theorem for analytic sets) Let X be a Polish space. Then any disjoint analytic subsets A, C of X can be separated by Borel sets.

Proof First, it will be shown that:

, and D are subsets of X such that for each j, Cj and D can be separated by Borel sets, then Uj"'° 1 C/ and D can be separated (G.2) If C j f o r j = 1 , 2,

by Borel sets.

Proof For each j, take a Borel set Bj such that D C BB and Cj C X \ Bj. Let B := Bp Then D C B and U;° Cj C X \ B, so (G.2) holds. 1

Next, it will be shown that: (G.3) If C j and D j f o r j = 1 , 2,

, are subsets of X such that for every i and

j, Ci and Dj can be separated by Borel sets, then U°_1 C; and U;° 1 Dj can be separated by Borel sets. 417

Appendix G

418

Proof By (G.2), for each j, U° 1 Ci and Dj can be separated by Borel sets. Applying (G.2) again gives (G.3). In proving Theorem G. 1, we can assume A and C are both nonempty. Then there exist continuous functions f, g from N°O into X such that f [N°O] = A and g[N°O] = C. Suppose A and C cannot be separated by Borel sets. For k = 1, 2, , and

mi EN, i = 1, - , k, let {n={ni}°°1EN°°: By (G.3), there exist ml and n j such that f [N°O (m 1) ] and g [N°O (n 1)] cannot be separated by Borel sets. Inductively, for all k = 1 , 2, . , there exist Mk and nk such that f [N°O (m 1, , mk)] and g[N°O (n i , , nk)] cannot be separated

by Borel sets. Let m :_ {mi}°O1, n :_ {nj}°O1. If f(m) ¢ g(n), then

there are disjoint open sets U, V with f(m) E U and g(n) E V. But then by continuity off and g, fork large enough, f [N°O(m 1, , Mk)] C U and g[N°°(nI, , nk)] C V, which gives a contradiction. So f(m) = g(n). But f (m) E A, g(n) E C, and A fl c= 0, another contradiction, which proves Theorem G. 1.

G.4 Lemma If X is a Polish space and Aj, j = 1, 2, then

, are analytic in X,

Aj is analytic.

P r o o f For j = 1 , 2,

, let f be a continuous function from N°O onto Ap Forn :_ {ni}°O1 E N°° define f(n) := f,,({ni+t}i>i) Then f is continuous from N°O onto U?° 1 Aj.

G.5 Corollary I f X is a Polish space and Aj, j = 1 , 2,

, are disjoint

analytic subsets of X, then there exist disjoint Borel sets Bj such that Ai C Bj

for all j. Proof For each j such that Aj = 0 we can take Bj = 0. So we can assume all the Aj are nonempty. For each k = 1 , 2, , by Lemma G.4, A(k) := U#k Al is analytic. Then by Theorem G.1, there is a Borel set Ck with Ak C Ck and A(k) C X \ Ck. Let Bl := C1 and for j > 2, Bj := Ci \ Uk<j Ck. Then the sets Bj have the properties stated.

G.6 Theorem Let S be a Polish space, Y a separable metric space, and A a Borel subset of S. Let f be a 1-1, Borel measurable function from A into Y.

Separation of Analytic Sets; Borel Injections

Then the range f [A] is a Borel subset of Y, and f function from f [A] onto A.

419

is a Borel measurable

Proof The following fact will be used.

G.7 Lemma Let S be a Polish space and Y a separable metric space. Let A be a nonempty Borel subset of S and f a 1-1, Borel measurable function from A into Y. Then there exists a Borel measurable function g from Y onto A

such that g(f(x)) = x for all x in A. Proof Let d metrize S where (S, d) is complete and separable. Let {y, }°O 1 be dense in S. Choose x0 E A. Recall that B(x, r) := l y: d (x, y) < r}. For each

k=

j=

:=Af1B(yj, Ilk) \U«j B(yi, 1/k),

where the union is empty for j = 1. Then for each k, the Aki for j = 1, 2, are disjoint Borel sets whose union is A, and for each j, d (x, y) < 2/ k for any x, y E Akj. Since f is 1-1, for each k, the sets f [Ak,] are disjoint analytic sets in Y. Thus by Corollary G.5 there exist disjoint Borel subsets Bkk of Y with f [Akj] C Bki and with Bki = 0 if Akk = 0. For each j such that Aki is nonempty, choose a point xkf E Akj. Define a function gk on Y by letting xki if y E Bkk for any j and gk(Y) := xo if y V Uj 1 Bkj. Then gk gk(Y) is Borel measurable (e.g., RAP, Lemma 4.2.4). Since (S, d) is complete, the set G of yin Y such that gk(y) converges in X is the same as the set on which {gk(y)}k>1 is a Cauchy sequence, so

G= I

I

U

1

I {y: d(gk(Y), gn(Y)) < 1/m}.

m>1 n>1 k>n

Since S is separable, d( , ) is product measurable on S x S (e.g., RAP, Propositions 4.1.7 and 2.1.4). Thus G is a Borel set in Y. On G, let h(y) := limk_ gk(y). Then h is Borel measurable (RAP, Theorem 4.2.2). Let

g(y) := h(y) if y E G and h(y) E A. Otherwise, let g(y) := x0. Then g is a Borel function from Y into A. If X E A, then for each k, x E Akj for

some j, f(x) E Bkj, and gk(f(x)) = xkj, so d(x, gk(f(x))) < 2/k. Thus 9k (f (x)) -+ x E A, so g(f(x)) = h (f (x)) = x, proving Lemma G.7. Now to prove Theorem G.6, we can assume A is nonempty. Take the function

g from Lemma G.7. Then if y E f [A], we have f(g(y)) = y, while if y f [A], then f(g(y)) # y. So f [A] = t y E Y: f (g(y)) = y}. If e is a metric for Y, then f [A] = {y E Y: e(f (g(y)), y) = 0}. So by the product measurability of e (as ford above), f [A] is a Borel set in Y. Thus for any Borel

420

Appendix G

set B C A, (f -1) 1(B) = f [B] is Borel in Y, so f

is Borel measurable,

and Theorem G.6 is proved.

Note Appendix G is entirely based on Cohn (1980), Section 8.3.

Reference Cohn, D. (1980). Measure Theory. Birkhauser, Boston.

Appendix H Young-Orlicz Spaces

A convex, increasing function g from [0, oo) onto itself will be called a Young-

Orlicz modulus. Then g is continuous since it is increasing and onto. Let (X, S, s) be a measure space and g a Young-Orlicz modulus. Let Lg(X, S, µ) be the set of all real-valued measurable functions f on X such that (H.1)

IIfIIg := inf {c > 0:

fg(If(x)I/c)d(x)

< 1} < 00.

v

Let Lg be the set of equivalence classes of functions in Lg(X, S, µ) for equality almost everywhere (µ). By monotone convergence, we have

H.2 Proposition For any Young-Orlicz modulus g and any f E Cg(X, S, µ), if 0 < c := Ilf IIg < +oo, we have f (g(I f 11c)) dg = 1, in other words the i n fi m u m in the d e fi n i t i o n o f i f IIg is attained. Also, IIfIIg = 0 if and only if f = 0 almost everywhere for tt. Next, we have

H.3 Lemma For any Young-Orlicz modulus g, and any measurable functions

fn, ifIfnI t Ifl, then IIfnIIg t IIfIIg <+oo. Proof For any c > 0, f g(I f I /c) d µ T f g(I f 11c) d µ by ordinary monotone convergence. Thus II fn 11g f t for some t. Taking c = t and using Proposition H.2 we get IIfIIg < t, while clearly IIfIIg > t. E Next is a fact showing that (not surprisingly) convergence in implies convergence in measure (or probability):

II

IIg norm

H.4 Lemma For any Young-Orlicz modulus g, and any s > 0, there is a

S>0such that ifllfllg<6,then µ(IfI >e) <e. 421

Appendix H

422

Proof For any S > 0,

11g

< 8 is equivalent to f g(I f 1/3) d A < 1 by Proposition H.2. Since g is onto [0, oo), there is some M < oo with g(M) > 1 /s and M > 1. Thus II f IIg < 8 implies µ(I f I > MS) < s. So we can set II f

8:= s/M. H.5 Theorem For any measure space (X, S, µ), Lg(X, S, µ) is a Banach space.

Proof For any real number A 0 it's clear that a function f is in Gg(X, S, µ) if and only if A.f is, with IIAfIIg = IAIl1fllg by definition of II IIg. If f, h E Gg(X, S, A) and 0 < A < 1 then since g is convex we have by definition of convexity for any c, d > 0,, (H.6)

fg(A(L)+(1_A))d/2 < AJ

g()dA+(1-,l) f g(d)dµ,

where some or all three integrals may be infinite. Applying this for c = 1 and h = 0 gives (H.7)

fg(Af)d/2 < A J g(f) dg.

Clearly, if the inequality in (H. 1) holds for some c > 0 it also holds for any larger c. It follows that Af + (1 - A)h E Lg(X, S, µ). For the triangle inequality, let c := II f IIg and d := IIh IIg. If c = 0 then f = 0 a.e. (A), so 11f + h IIg = IIh IIg < c + d, and likewise if d = 0. If c > 0 and d > O let A:= c/(c + d) in (H.6). Applying Proposition H.2 to both terms on the right

in (H.6) gives f g((f + h)/(c + d)) dg < 1, and so II f + h IIg < c + d. So the triangle inequality holds and II IIg is a seminorm on Lg(X, S, µ). Clearly

it becomes a norm on Lg(X, S, A). To see that the latter space is complete for II . IIg, let [ fk} be a Cauchy sequence. By Lemma H.4, take Sj := S for s := si := 1/2j. Take a subsequence fk(j) with II f - fk(j) II g < Sj for any i > k(j) and j = 1 , 2, . Then fk(j) converges A-almost everywhere, by the proof of the Borel-Cantelli lemma, to some f Then II fk(J) - f IIg - 0 as j oc by Fatou's lemma applied to functions g(I fk(j) - fk(r) I /c) as j < r -+ oo for c > Sj. It follows that II fi - f II g 0 as i oo, completing the proof.

Let 4 be a Young-Orlicz modulus. Then it has one-sided derivatives as follows (RAP, Corollary 6.3.3):

O(x) := 4'(x+) := limyl,x(4 (y) - 4(x))/(Y - x)

Young-Orlicz Spaces

423

exists for all x > 0, and ¢(x-) := fi'(x-) := limytX(fi(x) - 41(y))/(x - y) exists for all x > 0. As the notation suggests, for each x > 0, O(x-) limytx 0 (y), and 0 is a nondecreasing function on [0, oo). Thus, 0 (x -) _ 0 (x) except for at most countably many values of x, where ¢ may have jumps with O(x) > O(x-). On any bounded interval, where is bounded, 4) is Lipschitz and so absolutely continuous. Thus since 4) (0) = 0 we have 4) (x) _ fa 0 (u) du for any x > 0 (e.g., Rudin, 1974, Theorem 8.18). For any x > 0, O(x) > 0 since 4) is strictly increasing.

If 0 is unbounded, for 0 < y < oo let i/i(y)

(y) := inf{x > 0:

(x) > y}. Then * (0) = 0 and * is nondecreasing. Let %P (y) := fo *(t) dt. Then W is convex and 'Y' = Eli except on the at most countable set where Eli

has jumps. Thus for each y > 0 we have *(y) > 0 and ' is also strictly increasing. For any nondecreasing function f from [0, oo) into itself, it's easily seen that

for any x > 0 and u > 0, f " (u) > x if and only if f (t) 0. Since a change in 0 or Vr on a countable set (of its jumps) doesn't change its indefinite integral 4) or %P respectively, the relation between 4) and kP is symmetric. A Young-Orlicz modulus fi such that 0 is unbounded and 0 (x) 4, 0 as x 4, 0

will be called an Orlicz modulus. Then * is also unbounded and *(y) > 0 for all y > 0, so ' is also an Orlicz modulus. In that case, 4) and ' will be called dual Orlicz moduli. For such moduli we have a basic inequality due to W. H. Young:

H.8 Theorem (W. H. Young) Let fi, 4J be any two dual Orlicz moduli from [0, oo) onto itself. Then for any x, y > 0 we have xy < (P (X) + 41 (y),

with equality if x > 0 and (x-) < y < 0 (x).

Proof If x = 0 or y = 0 there is no problem. Let x > 0 and y > 0. Then 4) (x) is the area of the region A : 0 *(v) is equivalent to 0(u) > v, so u < * (v) is equivalent to 0 (u) < v, so A f1 B = 0. The rectangle Rx, y :

0 < u < x, 0 < v < y is included in A U B U C, where C has zero area, and if 0(x-) < y < 4(x), then Rx,y = A U B up to a set of zero area, so the conclusions hold.

Appendix H

424

One of the main uses of inequality H.6 is to prove an extension of the RogersHolder inequality to Young-Orlicz spaces:

H.9 Theorem Let 4 and %P be dual Orlicz moduli, and for a measure space (X, S, µ) let f E Gq(X, S, g) and g E L,1, (X, S, µ). Then fg E L1(X, S, µ) and f Ifgl dp < 211.f11 (p llgllp.

Proof By homogeneity we can assume 11.f 11o = II gII , = 1. Then applying Proposition H.2 with c = 1 and Theorem H.6 we get f I fgl dµ (x) < 2 and the conclusion follows.

Notes

W. H. Young (1912) proved his inequality (Theorem H.6) for smooth functions 4). Birnbaum and Orlicz (1931) apparently began the theory of "Orlicz spaces," and W. A. J. Luxemburg defined the norms 11 11,p; see Luxemburg and

Zaanen (1956). Krasnosel'skii and Rutitskii (1961) wrote a book on the topic.

References

Birnbaum, Z. W., and Orlicz, W. (1931). Uber die Verallgemeinerung des Begriffes der zueinander konjugierten Potenzen. Studia Math. 3, 1-67. Krasnosel'skii, M. A., and Rutitskii, Ia. B. (1961). Convex Functions and Orlicz Spaces. Transl. by L. F. Boron. Noordhoff, Groningen.

Luxemburg, W. A. J., and Zaanen, A. C. (1956). Conjugate spaces of Orlicz spaces. Akad. Wetensch. Amsterdam Proc. Ser. A 59 (= Indag. Math. 18), 217-228. Rudin, Walter (1974). Real and Complex Analysis, 2d ed. McGraw-Hill, New York.

Young, W. H. (1912). On classes of summable functions and their Fourier series. Proc. Roy. Soc. London Ser. A 87, 225-229.

Appendix I Modifications and Versions of Isonormal Processes

Let T be any set and (0, A, P) a probability space. Recall that a real-valued stochastic process indexed by T is a function (t, (o) i-+ Xt ((0) from T x S2 into R such that for each t E T, Xt () is measurable from 0 into R. A modification

of the process is another stochastic process Yt defined for the same T and 0

such that for each t, we have P(Xt = Yt) = 1. A version of the process Xt is a process Zt, t E T, for the same T but defined on a possibly different probability space (01, 8, Q) such that Xt and Zt have the same laws, that is, for each finite subset F of T, G({Xt ItEF) = L({Zt)tEF). Clearly, any modification of a process is also a version of the process, but a version, even if on the same probability space, may not be a modification. For example, for an isonormal

process L on a Hilbert space H, the process M(x) := L (-x) is a version, but not a modification, of L. One may take a version or modification of a process in order to get better properties such as continuity. It turns out that for the isonormal process on subsets of Hilbert space, what can be done with a version can also be done by a modification, as follows. 1.1 Theorem Let L bean isonormal process restricted to a subset C of Hilbert space. For each of the following two properties, if there exists a version M of L with the property, there also is a modification N with the property. For each Co. x - M(x) (w) for x e C is: (a) bounded

(b) uniformly continuous.

Also, if there is a version with (a) and another with (b) then there is a modification N(.) having both properties.

Proof Let A be a countable dense subset of C. For each x E C, take xn E A with IIx - xII < 1/n2 for all n. Then L(x) Thus if we define 425

426

N(x)(w) := lim sup,

Appendix I

or 0 on the set of probability 0 where the lim sup is infinite, then N is a modification of L. If (a) holds for M it will also hold for L on A and so for N, and likewise for (b), since a uniformly continuous function L (co) on A has a unique uniformly continuous extension to C given by Since is the same in both cases, the last conclusion follows.

Subject Index

a.s., almost surely, 12 acyclic graph, 147 admissible (measurability), 180, 184 Alaoglu's theorem, 50 almost uniformly, see convergence analytic sets, 417-420 asymptotic equicontinuity, 117, 118, 121, 131, 315, 349 atom of a Boolean algebra, 141, 158, 159, 163 of a probability space, 92 soft, 192, 247

Banach space, 23, 50, 51, 285 binomial distribution, 1, 5, 16, 123, 400 Blum-DeHardt theorem, 235 Bochner integral, 51, 248, 407-412 Boolean algebras, 141, 158, 165 bootstrap, 335-362 bordered, 155 Borel classes, 181 Borel injections, 417-420 Borel-isomorphic, 119 Borisov-Durst theorem, 244 boundaries, differentiable, 252-264, 363 bracket, bracketing, 234-313, 365 and majorizing measures, 246 Brownian bridge, 1-3, 7, 18, 19, 91, 220, 334, 335 Brownian motion, 2, 300 in d dimensions, 310 C-estimator, 217 Cantor set, 181, 187 cardinals, 167, 403 ceiling function, 19 centered Gaussian, 25-28, 359 central limit theorems general, 94, 208, 238, 285

see also Donsker class, 94 triangular arrays, 246 in finite dimensions, 1, 23 in separable Banach spaces, 209, 291 characteristic function, 26, 31, 84, 337 classification problems, 226 closed regular measure, 405 coherent process, 93, 117, 315 compact operator, 81 comparable, 145 complemented, 144 composition of functions, 94 concentration of measure, 223 confidence sets, 357-358 continuity set, 107, 112 continuous mapping theorem, 116 convergence almost uniform, 100, 101, 107, 112, 223, 306, 336, 359 in law, 3, 93, 94 classical, 9, 91, 93, 128, 130 Dudley (1966), 130 Hoffmann-Jorgensen, 94, 99, 101, 104, 106,107,111-117,127,130,131, 291, 328, 360

in outer probability, 100-102, 104, 112, 127, 215, 223, 301, 310, 336, 358 convex combination, 46, 322 functions, 14, 24, 30, 58, 62, 70, 84, 421,423 hull, 46, 47, 84, 117, 127, 159, 166, 322 metric entropy of, 322-328 of a sequence, 82-83, 88 sets, 227, 384 and Gaussian laws, 40-43 approximation of, 269-283, 322-328 427

428

Subject Index

convex sets (cont.) classes of, 138, 166, 227, 248, 283, 309, 365, 373, 374, 384, 388

symmetric, 40,47 convolution, 345, 395, 397 correlation, 33 cotreelike ordering, 148, 149 coupling, see Strassen's theorem Vorob'ev form, 7, 10, 19, 119, 128, 296, 304 Cram6r-Wold device, 337 cubature, numerical integration, 363 cubes, unions of, 247, 364 cycle in a graph, 147 Darmois-Skitovic theorem, 25, 86 desymmetrization, 338, 339, 343 determining class, 177, 179 diameter, 61, 75 differentiability degrees for functions, and bounding sets, 252-264, 363 differentiable boundaries, 309 differentiating under integrals, 391-398 dominated family of laws, 172 Donsker classes, 94, 134 and invariance principles, 286, 301, 306, 307, 310 and log log laws, 307 criteria for, 349 asymptotic equicontinuity, 117, 118 multipliers, 355 envelopes of, 307, 308, 310 examples all subsets of N, 228, 244 bounded variation, 227 convex sets in 72, 228 convex sets in 72, 269, 283 half-lines, 127 unions of k intervals, 129 unit ball in Hilbert space, 128 weighted intervals, 227 functional, 301 non-Donsker classes, 129, 138, 228,

363-390 P-universal, 333 stability of convex hull, 127 subclass, 315 sums, 127 union of two, 121-122 sufficient conditions for bracketing, 238-248 differentiability classes, 253, 264 Koltchinskii-Pollard entropy, 208 random entropy, 226 sequences of functions, 122-126, 129 Vapnik-Cervonenkis classes, 214, 215

uniform, 328-329, 334 universal, 314-322, 328-330, 334 Donsker's invariance principle, 311 Donsker's theorem, 3, 9, 19, 129 dual Banach space, 50, 237 edge (of a graph), 147 Effros Borel structure, 190, 194 Egorov's theorem, 101, 405 ellipsoids, 80, 81, 84, 85, 128, 140, 319, 329 empirical measure, 1, 91, 94, 100, 138, 199, 285, 306, 399 as statistic, 176, 191, 226, 332, 335 bootstrap, 335 empirical process, 91-94, 117-121, 220-223, 286, 301, 308, 309, 311 in one dimension, 2, 3, 7, 18, 19 envelope function, 161, 196, 220, 223, 234,

306-308,310,316,317,358 e-net, 10, 228 equicontinuity, asymptotic, 117, 118, 121, 131, 315, 349 equivalent measures, 172 essential infimum or supremum, 44, 95, 96, 98, 130 estimator, 217-219, 226 exchangeable, 191 exponent of entropy, 79 exponent of volume, 79 exponential family, 191, 397 factorization theorem, 172, 174, 193 filtration, 370, 371 Gaussian law, see normal law Gaussian processes, 23-90, 222, 223 see also pregaussian, 92 GB-sets, 44, 52 and metric entropy, 54, 79 criteria for convex hull, 82 majorizing measures, 59, 60 mean width, 80 implied by GC-set, 45 necessary conditions, 45, 54, 64 special classes ellipsoids, 80, 84 sequences, 82 GC-sets, 44, 52, 93 and Gaussian processes, 77 and metric entropy, 79, 85 criteria for, 46, 47 convex hull, 82 other metrics, 78 implies GB-set, 45 special classes ellipsoids, 80, 81

Subject Index random Fourier series, 85 sequences, 84, 85 stability of convex hull, 47 union of two, 51 sufficient conditions majorizing measures, 74 metric entropy, 52, 54, 87 unit ball of dual space, 50 volumes, 79 geometric mean, 14 Glivenko-Cantelli class, 101, 223, 224,

234-238 criteria for, 223, 224, 226 envelope, integrability, 220 examples, 247 convex sets, 269 lower layers, 282 non-examples, 127, 138, 171, 247 strong, 100 sufficient conditions for bracketing (Blum-DeHardt), 235 differentiability classes, 253, 264 unit ball (Mourier), 237 Vapnik-Cervonenkis classes, 134 uniform, 219, 225, 282 universal, 224, 225, 228 weak, 101 not strong, 228 Glivenko-Cantelli theorem, 1 graph, 147 Hausdorff metric, 190, 250 Hilbert-Schmidt operator, 81 Holder condition, 252, 384 homotopy, 260 image admissible, 180, 182, 184, 185, 193 Suslin, 186, 189, 190, 192, 193, 217, 229,

316,317,359 incomparable, 145 independent as sets, 158, 220 events, sequences of, 123, 126, 129 pieces, process with, 370, 371 random elements, 286-291 inequalities, 12-18 Bernstein's, 12-14, 20, 124, 129 Chernoff-Okamoto, 16 Dvortezky-Kiefer-Wolfowitz, 221 for empirical processes, 220-223 Hoeffding's, 13, 14, 20 Hoffmann-Jorgensen, 288, 341 Jensen's, 340 conditional, 29 Komatsu's, 57, 87

429

Levy, 17, 21, 289, 310 Ottaviani's, 17, 21, 288 Slepian's, 31, 33, 34

with stars, 95-100, 285-286, 339-341, 343, 344, 352

Young, W. H., 423,424 inner measure, 104 integral Bochner, 51 Pettis, 51 integral operator, 81 intensity measure, 127, 368 invariance principles, 3, 285-313 isonormal process, 39, 51, 52, 59, 63, 74, 75, 78, 92, 425-426

Kolmogorov-Smirnov statistics, 335 Koltchinskii-Pollard entropy, 196, 228, 316 Kuiper statistic, 335 law, see convergence in law law, probability measure, 1, 23, 345 laws of the iterated logarithm, 306-308 learning theory, 226, 227 likelihood ratio, 175 linearly ordered by inclusion, 142, 143, 145, 154, 155, 161, 247 Lipschitz functions, seminorm, 112, 210, 223, 251, 356 log log laws, 306-308 loss function, 217 lower layers, 264-269, 282, 316, 373-384, 389 lower semilattice, 95 Lusin's theorem, 105, 405, 406

major set or class, 159 majorizing measures, 59-74, 81-85, 87, 246 Marczewski function, 181 marginals, laws with given, 8, 10, 119, 296 Markov property, 5, 10, 301, 389 for random sets, 373 measurability, 10, 19, 93, 170-194, 217, 221, 226, 229, 291-293, 402-406,

413-420 measurable cover functions, 95-100, 130,

285-286 cover of a set, 95, 348 universally, 20, 105, 186, 199, 202, 203,

413,415

vector space, 25-27 measurable cardinals, 403 metric dual-bounded-Lipschitz, 115, 328, 336 Prokhorov, 6, 19, 114, 119, 295 Skorokhod's, 19

430

Subject Index

metric entropy, 10-12 of convex hulls, 322-328 with bracketing, 234 metrization of convergence in law, 111, 112, 329, 336 Mills' ratio, 57 minimax risk, 217, 218 minoration, Sudakov, 33 mixed volume, 79, 270 modification, 43, 138, 229,425-426 monotone regression, 282 multinomial distribution, 4, 16, 368, 399-401 Neyman-Pearson lemma, 56, 175 node (of a graph), 147 nonatomic measure, 92, 192, 225, 403 norm, 23 bounded Lipschitz, 112, 252, 336 supremum, 3, 7, 10, 18, 19, 27, 170, 180, 190, 300 normal distribution or law, 357 in finite dimensions, 23, 24, 26, 36 in infinite dimensions, 23, 25 in one dimension, 23-25, 86 normed space, 23 norming set, 285 null hypothesis, 332 order bounded, 223 ordinal triangle, 170 orthants, as VC class, 155 outer measure, 100 outer probability, 100, 336, 358

P-Donsker class, see Donsker class P-perfect, 104 p-variation, 329 PAC learning, 227 Pascal's triangle, 135 pattern recognition, 226 perfect function, 104-107, 112, 333 perfect probability space, 106 Pettis integral, 51, 248, 407-412 Poisson distributions, 17, 19, 127, 344, 346, 368, 379 process, 127, 368-370, 374, 388 Poissonization, 344-346, 363, 367-373, 389 polar, of a set, 43, 84 Polish space, 7, 20, 119, 303 Pollard's entropy condition, 208, 316, 322, 327, 329 polyhedra, polytopes, 273, 283, 384 polynomials, 140 polytope, 141

portmanteau theorem, 111, 112, 131

pregaussian, 92-94, 117, 129, 215, 315, 388 uniformly, 329 prelinear, 45, 46, 93, 138 pseudo-seminorm, 27 pseudometric, 75 quasi-order, 145 quasiperfect, 106 Rademacher variables, 13, 206, 338 Radon law, 358 random element, 94, 100, 106, 111, 115, 127, 286-291, 336 random entropy, 223, 226 RAP, xiii realization a.s. convergent, 106-111 of a stochastic process, 46, 47 reflexive Banach space, 50 regular conditional probabilities, 120 reversed (sub)martingale, 199-201, 205 risk, 217, 218 sample, 332, 336 -bounded, 52, 59 -continuous, 2, 3, 7, 52, 55, 76, 85, 92 function, 44, 76-78 modulus, 52 space, 91, 351 Sauer's lemma, 135, 145, 149, 154, 162, 167, 207, 221 Schmidt ellipsoid, 80, 81 selection theorem, 186, 188, 193 semialgebraic sets, 165 seminorm, 23 separable measurable space, 179 shatter, 134, 330 for functions, 224 or-algebra Borel, 7, 23, 25, 28, 74 product, 91, 98, 181, 183

tail, 30,49 slowly varying function, 367 standard deviation, 33 standard model, 91, 191 standard normal law, 24, 39, 43, 53 stationary process, 204 statistic, 171, 174, 176, 357, 360 Stirling's formula, 16 stochastic process, 27, 43, 52, 74, 75, 78, 87, 217, 370, 425 stopping set, 363, 371, 373, 388 Strassen's theorem, 6, 19, 131 subadditive process, 204, 205 subgraph, 159, 250, 373 submartingale, 29, 199, 200

431

Subject Index

Sudakov-Chevet theorem, 33, 34, 39, 45, 79,81 sufficient a-algebra, 171-179 collection of functions, 177 statistic, 171, 174, 176, 191, 192 superadditive process, 204 supremum norm, 28 Suslin property, 185, 186, 190, 217, 229 symmetric random elements, 289 symmetrization, 198, 205, 206, 211, 228, 339, 343

universally measurable, 20, 105, 186, 199, 202, 203, 413

universally null, 225 upper integral, 93, 95 Vapnik-Cervonenkis classes of functions hull, 160, 163, 164, 315, 316, 327 major, 159, 160, 164, 167, 308, 315, 316

subgraph, 159, 160, 214, 223, 307, 308, 315-317,327,328,330 of sets, 134-159, 165, 169, 197, 215, 218,

tail event or a-algebra, 30, 49 tail probability, 50 tetrahedra, 227 tight, 105, 358 uniformly, 105 tightness, 20 topological vector space, 25, 27, 413 tree, 147 treelike ordering, 145, 222 triangle function, 84 triangular arrays, 247, 337, 346 triangulation, 273, 283 truncation, 13, 247 two-sample process, 128, 332-335 u.m., see universally measurable, 105 uniform distribution on [0, 1], 2, 119 uniformly integrable, 392, 393, 396, 397 uniformly pregaussian, 330 universal class a function, 183

219,221-223,225,226,314 VC, see Vapnik-Cervonenkis classes VCM class, 221 VCSM class, 307 vector space

measurable, 25-27 topological, 25, 27, 410 version, 43, 71, 74, 138, 425-426 version-continuous, 74, 75, 77, 78 volume, 78, 79, 85, 262, 270, 320, 365 weak* topology, 50 width, of a convex set, 80 Wiener process, 2 witness of irregularity, 224

Young-Orlicz norms, modulus, 55, 85, 86,

420-424 zero-one law for Gaussian measures, 26

Author Index

Adamski, W., 191,194 Adler, R. J., 222 Alexander, K. S., 121, 131, 168, 221, 307, 308 Alon, N., 166, 226, 227 Andersen, N. T., 88, 130, 131, 246, 248 Anderson, T. W., 86 Araujo, A., 359 Arcones, M., 121, 131, 239, 248 Assouad, P., 167, 168, 218, 219, 225, 229, 230 Aumann, R. J., 182, 193 Bahadur, R. R., 193 Bakhvalov, N. S., 363, 364, 388 Bauer, H., 120 Beck, J., 309 Ben-David, S., 226 Bennett, G., 20 Berger, E., 131 Berkes, I., 7, 20 Bernstein, S. N., 12, 20 Bickel, P., 360 Billingsley, P., 19 Birge, L., 154, 282 Birkhoff, Garrett, 411 Birnbaum, Z. W., 424 Blum, J. R., 234, 235, 248 Blumberg, H., 130 Blumer, A., 227 Bochner, S., 411 Bolthausen, E., 269, 283 Bonnesen, T., 79, 270 Borell, C., 86 Borisov, I. S., 244,248 Borovkov, A. A., 311 Breiman, L., 311 Bretagnolle, J., 308

Bron"stein, E. M., 269, 282, 283 Brown, L. D., 222, 397

Cabana, E. M., 222 Carl, B., 322, 330 tervonenkis, A. Ya., 134, 135, 162, 167, 221, 226, 229 Cesa-Bianchi, N., 226 Chebyshev, P. L., 5, 12 Chernoff, H., 16, 20, 124 Chevet, S., 31, 33, 34, 86 Clements, G. F., 282 Coffman, E. G., 375, 389 Cohn, D. L., 101, 193, 411, 420 Danzer, L., 167 Darmois, G., 25, 86 DeHardt, J., 234, 235, 248 Devroye, L., 221 Dobric, V., 88, 130, 131 Donsker, M. D., 2, 3, 9, 19, 311 Doob, J. L., 2, 19, 229 Douady, A., 415 Drake, F. R., 403 Dudley, R. M., 20, 86, 87, 130, 131, 154, 167, 168, 193, 222, 225, 229, 248, 282, 307, 310, 311, 329, 330, 360,

365,388,389 Dunford, N., 50, 411, 412 Durst, M., 193, 229, 244, 248 Dvoretzky, A., 221 Eames, W., 130 Effros, E. G., 190, 194 Efron, B., 357, 361 Eggleston, H. G., 270 Ehrenfeucht, A., 227 Eilenberg, S., 260 Einmahl, U., 193 432

Author Index Ersov, M. P., 131 Evstigneev, I. V., 389

Feldman, J., 87 Feller, W., 17, 389 Fenchel, W., 79, 270 Ferguson, T. S., 176

Fernique, X., 24,25,27,39,40,50,58-60, 62, 63, 86, 87, 295, 321, 359 Fisher, R. A., 193 Freedman, D. A., 193, 311, 360 Fu, J., 282 Gaenssler, P., 191, 194, 248, 360 Gelfand, I. M., 411, 412 Gine, E., 20, 88, 131, 225, 226, 239, 246,

248,307,328-330,336,357-360 Gnedenko, B. V., 130, 311 Goffman, C., 130 Goodman, V., 222, 307 Gordon, Y., 86 Grunbaum, B., 167 Graves, L. M., 411 Gross, L., 86, 87 Gruber, P. M., 283 Gutmann, S., 193 Gyorfi, L., 227 Hadwiger, H., 80 Hall, P., 361 Halmos, P. R., 193 Harary, F., 167 Hausdorff, F., 282 Haussler, D., 165, 166, 168, 222, 226, 227 Heinkel, B., 307 Hobson, E. W., 397 Hoeffding, W., 13, 14, 16, 20 Hoffmann-Jorgensen, J., 130, 131, 310, 341

Il'in, V. A., 397 10, K., 87 Jain, N. C., 208, 229, 308, 310, 359

Kac, M., 389 Kahane, J.-P., 21, 86, 310 Kartashev, A. P., 397 Kelley, J. L., 404 Kiefer, J., 221 Kingman, J. F. C., 204, 229 Klee, V. L., 167 Kolmogorov, A. N., 2, 11, 19, 20, 130, 282, 311 Koltchinskii, V. I., 196, 198, 208, 228,

229,307-309 Komatsu, Y., 57, 87 Koml6s, J., 308

433

Korolyuk, V. S., 311 Krasnosel'skii, M. A., 424 Kuelbs, J., 307, 310 Kulkarni, S. R., 227

Landau, H. J., 24, 27, 50, 58, 59, 86, 295, 321 Lang, S., 397 Laskowski, M. C., 165 Le Cam, L., 193, 311 Leadbetter, M. R., 222 Lebesgue, H., 183, 193 Ledoux, M., 81, 87, 223, 307, 359 Lehmann, E. L., 56, 175 Leichtweiss, K., 79 Leighton, T., 375 Levy, P., 17, 21 Lindgren, G., 222 Lorentz, G. G., 20, 282 Lueker, G. S., 389 Lugosi, G., 227 Luxemburg, W. A. J., 130, 424 Major, P., 308, 311 Marcus, M. B., 24, 50, 58, 59, 81, 86, 208, 229, 295, 310, 321, 359 Marczewski, E., 404 Massart, P., 221-223, 308, 309 Maurey, B., 324, 330 May, L. E., 130 McKean, H. P., 87 McMullen, P., 80 Milman, V. D., 79 Mityagin, B. S., 81 Mourier, E., 237, 248 Munkres, J. R., 283 Nagaev, S. V., 311 Natanson, I. P., 193 Neveu, J., 193 Neyman, J., 175, 193

Okamoto, M., 16, 20, 124 Orlicz, W., 423, 424 Ossiander, M., 239, 246, 248, 253

Pachl, J. K., 130 Pettis, B. J., 411 Philipp, W., 7, 20, 130, 307, 310, 311, 388 Pickands, J., 222 Pisier, G., 79, 307, 308, 330 Pollard, D. B., 168, 196, 198, 208, 214, 228, 229 Posner, E. C., 20 Pozniak, E. G., 397 Price, G. B., 412

434

Author Index

Prokhorov, Yu. V., 6, 19 Pyke, R., 389

Strassen, V., 6

Revesz, P., 309 Radon, J., 140, 141, 167 Rao, B. V., 193 Rao, C. R., 86 Rhee, W. T., 375 Richardson, T., 227 Rio, E., 309 Rodemich, E. R., 20 Rootzen, H., 222 Rozhdestvenskif, B. L., 397 Rudin, W. 423,424 Rumsey, H., 20 Rutitskii, Ia. B., 424 Ryll-Nardzewski, C., 130

Sudakov, V. N., 31, 33, 34, 80, 81, 86, 87 Sun, Tze-Gong, 264, 282

Sainte-Beuve, M.-F., 186, 193 Samorodnitsky, G., 222 Sauer, N., 135, 167 Savage, L. J., 193 Sazonov, V. V., 130 Schaefer, H. H., 413, 416 Schaerf, H. M., 406 Schmidt, W. M., 365, 388 Schwartz, J. T., 50, 411, 412 Schwartz, L., 37, 415, 416 Shao, Jun, 361 Shelah, S., 167 Shepp, L. A., 24, 27, 50, 58, 59, 86, 295, 321 Shor, P. W., 374, 375, 389 Shortt, R. M., 20, 130 Sikorski, R., 404 Skitovi6, V. P., 25, 86 Skorokhod, A. V., 19, 106, 130, 131 Slepian, D., 31, 86 Smith, D. L., 222 Steele, J. M., 203, 204, 226, 229, 282 Steenrod, N., 260 Stengle, G., 165 Stone, A. H., 404

Valiant, L. G., 227 van der Vaart, A., 131, 358 Vapnik, V. N., 134, 135, 162, 167, 221, 226, 227, 229 Vorob'ev, N. N., 7, 20 Vulikh, B. Z., 130

Strobl, F., 191, 200, 229, 358 Stute, W., 248

Talagrand, M., 59, 60, 63, 81, 82, 87, 88, 223, 224, 226, 307, 308, 359, 374, 375 Tarsi, M., 166 Tibshirani, R. J., 361 Tikhomirov, V. M., 20, 282 Topsoe, F, 131 Tu, Dongsheng, 361 Tusnady, G., 308

Uspensky, J. V., 20

Warmuth, M., 227 Wellner, J., 358 Wenocur, R. S., 167, 168 Wiener, N., 2 Wolfowitz, J., 221, 229 Wright, F T., 282

Young, W. H., 423,424 Yukich, J., 165, 389

Zaanen, A. C., 130,424 Zakon, E., 406 Zeitouni, 0., 227 Ziegler, K., 191 Zink, R. E., 130 Zinn, J., 88, 131, 225, 226, 246, 307,

328-330,336,357-360

Index of Notation

B means A is defined by B, xiv A =: B means B is defined by A, xiv x, of the same order, 252 ®, set of Cartesian products, 153

8 A, S-interior of A, 262

A

3, point mass at x, 1 D(E, A, d), packing number, 10 DFl (S, )c_), 196

converges in law, 94 (9, product a-algebra, 109 n, class of intersections, 134 nki=' , class of intersections, 251 u, class of unions, 153 uk=1, class of unions, 251

DFi (E, .F, Q), 161 D(P)(E,.F), 161 D(P) (E, .F, Q), 161 DP, partial derivative, 252 dens, combinatorial density, 137 diam, diameter, 10

( , )0,p, covariance, 92 II

Ila, differentiability norm, 252 IIBL, bounded Lipschitz norm, 112

II

11 C, sup norm over C, 138

II

II II II

dp, dp(A, B) := P(AAB), 156 dp,Q, LP(Q) distance, 161 ds p, supremum distance, 235

II.F, sup norm over .F, 94 IIL, Lipschitz seminorm, 112 II', dual norm, 24

E(k, n, p), binomial probability, 16 E*, upper expectation, 95 e", A a signed measure, 345 ess.inf, essential infimum, 95 ess.sup, essential supremum, 43

[.f, g] := (h: f < h < g), 234 an nt/2(F,, - F), 1 AEC(P, r), asymptotic equicontinuity condition, 117 fl ( , ), bounded Lipschitz distance, 115 B(k, n, p), binomial probability, 16

C(F) := I(F) U R(F), 260 Ct, polar of C, 43 card, cardinality, 134

covp, 92 C(a, K, d), subgraphs of a-smooth f's, 252 A, symmetric difference, 95 AA, set of symmetric differences, 143 AO, number of induced subsets, 134

c, normal distribution function, 55 ¢, normal density, 55 F, distribution function, 1 F, empirical distribution function, 1 FT, envelope, 161 f*, 98 f o g, composition, 94 f *, measurable cover function, 95 Fa,K(F), a-differentiable functions, 252

y(T) = infy(T), 59 y(T), majorizing measure integral, 59 Gp, 91 CJa,K,d, differentiable functions, 252

435

436

Index of Notation

H(s, A, d) = log N(s, A, d), 11 H5(.F, M), hull, 160

PB, bootstrap empirical measure, 335 P*, outer probability, 95

h(. , ), Hausdorff metric, 250

pos(G) _ {pos(g): g E G}, 139 pos(g) _ {x: g(x) > 0}, 139

I(F), inside F boundary, 260 I = [0, 1], unit interval, 19 I", unit cube in I[R", 7 i.i.d., indep. identically dist., 1

isonormal process, 39 L(A)*, ess.supA L, 43 IL(A)I*, ess.supA ILI,43

£2(P), 91 G0, 95 Go, 92

GGd, lower layers in Rd, 264 £ Cd,1, lower layers in unit cube, 264

mi(n), 134 vn, empirical process, 91 vB, bootstrap empirical process, 336 NC
N[ ]) (s, F, P), 234

nn(g) = {x: g(x) > 01, 139 )ro, centering projection, 92 Pn, empirical measure, 1

p(-, ), Prokhorov distance, 115 pp, covp distance, 92

R(F), range(F), 260 JR = [-oo, oo], 93 IlRX, all functions X H JR, 139

f upper integral, 94 f*, lower integral, 98 S 1, unit circle, 138

S(C) = V(C) - 1, VC index, 134 sco, symmetric closed convex hull, 49 sco, symmetric convex hull, 117 S(IIR"), L. Schwartz space, 37 U[0, 1], uniform law on [0, 1], 126

V(C) = S(C) + 1, VC index, 134

Var(X) = E((X - EX)2), 12 [x], least integer > x, 19 XP = xPO)xP(2) ... 252 2 xt, Brownian motion, 2 1

yt, Brownian bridge, 1

Uniform Central Limit Theorems (Cambridge Studies in Advanced Mathematics)

Read more

Uniform central limit theorems

Read more

Uniform central limit theorems

Read more

Galois Theories (Cambridge Studies in Advanced Mathematics)

Read more

Tolerance Graphs (Cambridge Studies in Advanced Mathematics)

Read more

Additive Combinatorics (Cambridge Studies in Advanced Mathematics)

Read more

Random Graphs (Cambridge Studies in Advanced Mathematics)

Read more

Algebraic Homotopy (Cambridge Studies in Advanced Mathematics)

Read more

Additive Combinatorics (Cambridge Studies in Advanced Mathematics)

Read more

Galois Theories (Cambridge Studies in Advanced Mathematics)

Read more

Ergodic Theory (Cambridge Studies in Advanced Mathematics)

Read more

Galois Theories (Cambridge Studies in Advanced Mathematics)

Read more

Some Limit Theorems in Statistics

Read more

Some Limit Theorems in Statistics

Read more

Some limit theorems in statistics

Read more

Fixed Point Theorems (Cambridge Tracts in Mathematics)

Read more

Practical Foundations of Mathematics (Cambridge Studies in Advanced Mathematics)

Read more

Practical Foundations of Mathematics (Cambridge Studies in Advanced Mathematics)

Read more

Ergodic Theorems (De Gruyter Studies in Mathematics)

Read more

Ip Studies in Advanced Mathematics)

Read more

Ip Studies in Advanced Mathematics)

Read more

Limit Theorems for Stochastic Processes

Read more

Limit Theorems for Stochastic Processes

Read more

Cellular Structures in Topology (Cambridge Studies in Advanced Mathematics 19)

Read more

Limit theorems of probability theory

Read more

Limit Theorems for Stochastic Processes

Read more

Limit theorems for large deviations

Read more

Strong Limit Theorems in Noncommutative L2-spaces

Read more

Strong Limit Theorems in Non-Commutative Probability

Read more

Limit theorems of probability theory

Read more

Recommend Documents

Uniform Central Limit Theorems (Cambridge Studies in Advanced Mathematics)

Uniform z J '.L The book shows how the central limit theorem for independent, identically distributed random variable...

Uniform central limit theorems

Uniform central limit theorems

Galois Theories (Cambridge Studies in Advanced Mathematics)

Classical Galois theory 4 A link with the usual notion of conjugate complex numbers should certainly be exhibited. Th...

Tolerance Graphs (Cambridge Studies in Advanced Mathematics)

Tolerance Graphs CAMBRIDGE STUDIES IN ADVANCED MATHEMATICS 89 Editorial Board: B. Bollobas, W. Fulton, A. Katok, F. K...

Additive Combinatorics (Cambridge Studies in Advanced Mathematics)

This page intentionally left blank CAMBRIDGE STUDIES IN ADVANCED MATHEMATICS 105 EDITORIAL BOARD B. BOLLOBAS, W. FULT...

Random Graphs (Cambridge Studies in Advanced Mathematics)

Algebraic Homotopy (Cambridge Studies in Advanced Mathematics)

Algebraic homotopy Algebraic homotopy HANS JOACHIM BAUES Max-Planck-Institut fur Mathematik, Bonn S The right of the ...

Additive Combinatorics (Cambridge Studies in Advanced Mathematics)

This page intentionally left blank CAMBRIDGE STUDIES IN ADVANCED MATHEMATICS 105 EDITORIAL BOARD B. BOLLOBAS, W. FULT...

Galois Theories (Cambridge Studies in Advanced Mathematics)

Galois Theories Francis Borceux Universite Catholique de Louvain George Janelidze Georgian Academy of Sciences, Tbilisi...