Probability and Its Applications Published in association with the Applied Probability Trust
Editors: S. Asmussen, J. Gani, P. Jagers, T.G. Kurtz
Photo of Charles Stein, in front, with, from left to right in the rear, Qi-Man Shao, Louis Chen and Larry Goldstein, taken at a conference at Stanford University held in honor of Charles Stein’s 90th birthday on March 22nd, 2010
For further titles published in this series, go to www.springer.com/series/1560
Louis H.Y. Chen Larry Goldstein Qi-Man Shao
Normal Approximation by Stein’s Method
Louis H.Y. Chen Department of Mathematics National University of Singapore 10 Lower Kent Ridge Road Singapore 119076 Republic of Singapore
[email protected] Larry Goldstein Department of Mathematics KAP 108 University of Southern California Los Angeles, CA 90089-2532 USA
[email protected]
Qi-Man Shao Department of Mathematics Hong Kong University of Science and Technology Clear Water Bay, Kowloon Hong Kong China
[email protected]
Series Editors: Søren Asmussen Department of Mathematical Sciences Aarhus University Ny Munkegade 8000 Aarhus C Denmark
[email protected]
Peter Jagers Mathematical Statistics Chalmers University of Technology and University of Gothenburg 412 96 Göteborg Sweden
[email protected]
Joe Gani Centre for Mathematics and its Applications Mathematical Sciences Institute Australian National University Canberra, ACT 0200 Australia
[email protected]
Thomas G. Kurtz Department of Mathematics University of Wisconsin - Madison 480 Lincoln Drive Madison, WI 53706-1388 USA
[email protected]
ISSN 1431-7028 ISBN 978-3-642-15006-7 e-ISBN 978-3-642-15007-4 DOI 10.1007/978-3-642-15007-4 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2010938379 Mathematics Subject Classification (2010): 60F05, 60B12, 62E17 © Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: VTEX, Vilnius Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
This book is dedicated to Charles Stein. We also dedicate this book to our families. Annabelle, Yitian, Yipei Nancy Jiena and Wenqi
Preface
Stein’s method has developed considerably since its first appearance in 1972, and presently shows every sign that its range in theory and applications will continue to expand. Nevertheless, there must be some point along this continuing path when the method reaches a certain level of maturity that a thorough, self contained treatment, highlighted with a sampling of its many successes, is warranted. The authors of this book believe that now is this time. In the years since Stein’s method for the normal was introduced, the recognition of its power has only slowly begun to percolate throughout the probability community, helped along, no doubt, by the main references in the field over the last many years, first, the monograph of Stein (1986), the compilation of Diaconis and Holmes (2004), and the series of Barbour and Chen (2005a, 2005b). Nevertheless, to use one barometer, to date there exist only a small number of books or monographs, targeted generally and accessible at the graduate or undergraduate level, that make any mention of Stein’s method for the normal at all, in particular, the texts of Stroock (2000) and Ross and Peköz (2007). With a thorough building up of the fundamentals necessary to cover the many forms that Stein’s method for the normal can take to date, and the inclusion of a large number of recent developments in both theory and applications, we hope this book on normal approximation will continue to accelerate the appreciation, understanding, and use of Stein’s method. Indeed, as interest in the method has steadily grown, this book was partly written to add to the list we can give in response to the many queries we have received over the years, regarding sources where one can go to learn more about the method, and, moreover, to get a sense of whether it can be applied to new situations. We have many to thank for this book’s existence. The first author would like to thank Charles Stein for his ideas which the former learned from him as a student and which has been a rich source of inspiration to him over the years. He would also like to thank his co-authors, Andrew Barbour, Kwok-Pui Choi, Xiao Fang, Yu-Kiang Leong, Qi-Man Shao and Aihua Xia, from whom he has benefited substantially through many stimulating discussions. The second author first heard about Stein’s method, for the Poisson case, in a lecture by Persi Diaconis, and he thanks his first teachers in that area, Richard Arratia and Louis Gordon, for conveying a real sense of the use of the Stein equation, and vii
viii
Preface
Michael Waterman for providing a fountain of wonderful applications. He learned the most about the normal approximation version of the method, and about its applications, from his work with Yosi Rinott, to whom he is most grateful. He has also benefited greatly through all his other collaborations where Stein’s method played a role, most notably those with Gesine Reinert, as well as with Aihua Xia, Mathew Penrose, and Haimeng Zhang. The third author would like to thank Louis Chen for introducing him to Stein’s method, and for the inspiration and insight he has provided. All the authors would like to thank the Institute for Mathematical Sciences, at the National University of Singapore, for their support of the many Singapore conferences, which served as a nexus for the dissemination of the most recent discoveries by the participants, and for the creation of a perfect environment for the invention of new ideas. For comments and suggestions regarding the preparation of this work the authors would particularly like to thank Jason Fulman, Ivan Nourdin and Giovanni Peccati for their guidance on the material in Chap. 14. Additionally, we thank Subhankar Ghosh and Wenxin Zhou for their help in various stages of the preparation of this book, and proofreading, and Xiao Fang for his assistance and help in writing parts of Chap. 7, on Discretized Normal Approximation. The first author was partially supported by the Tan Chin Tuan Centennial Professorship Grant C-389-000-010101 at the National University of Singapore during the time this manuscript was prepared, the second author acknowledges the grant support of NSA-AMS 091026, and the third author acknowledges grant support from Hong Kong Research Grants Council (CERG-602608 and 603710). For updates and further information on this book, please visit: http://mizar.usc. edu/~larry/nabsm.html.
Contents
1
Introduction . . . . . . . . . . . . . . 1.1 The Central Limit Theorem . . . 1.2 A Brief History of Stein’s Method 1.3 The Basic Idea of Stein’s Method 1.4 Outline and Summary . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
1 1 3 4 8
2
Fundamentals of Stein’s Method . . . . . . . . . . . . . . 2.1 Stein’s Equation . . . . . . . . . . . . . . . . . . . . . 2.2 Properties of the Solutions . . . . . . . . . . . . . . . . 2.3 Construction of Stein Identities . . . . . . . . . . . . . 2.3.1 Sums of Independent Random Variables . . . . 2.3.2 Exchangeable Pairs . . . . . . . . . . . . . . . 2.3.3 Zero Bias . . . . . . . . . . . . . . . . . . . . . 2.3.4 Size Bias . . . . . . . . . . . . . . . . . . . . . 2.4 A General Framework for Stein Identities and Normal Approximation for Lipschitz Functions . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
13 13 15 18 19 21 26 31
. . . . . . . . . . . .
36 37
3
Berry–Esseen Bounds for Independent Random Variables 3.1 Normal Approximation with Lipschitz Functions . . . . 3.2 The Lindeberg Central Limit Theorem . . . . . . . . . 3.3 Berry–Esseen Inequality: The Bounded Case . . . . . . 3.4 The Berry–Esseen Inequality for Unbounded Variables . 3.4.1 The Concentration Inequality Approach . . . . 3.4.2 An Inductive Approach . . . . . . . . . . . . . 3.5 A Lower Berry–Esseen Bound . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
45 46 48 49 53 53 57 59
4
L1 Bounds . . . . . . . . . . . . . . . 4.1 Sums of Independent Variables . 4.1.1 L1 Berry–Esseen Bounds 4.1.2 Contraction Principle . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
63 65 65 69
. . . .
. . . . .
. . . .
. . . . .
. . . .
. . . . .
. . . .
. . . . .
. . . .
. . . . .
. . . .
. . . . .
. . . .
. . . . .
. . . .
. . . . .
. . . .
. . . . .
. . . .
. . . . .
. . . .
. . . .
ix
x
Contents
4.2
Hierarchical Structures . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Bounds to the Normal for Approximately Linear Recursions 4.2.2 Normal Bounds for Hierarchical Sequences . . . . . . . . . 4.2.3 Convergence Rates for the Diamond Lattice . . . . . . . . 4.3 Cone Measure Projections . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Coupling Constructions for Coordinate Symmetric Variables and Their Projections . . . . . . . . . . . . . . . 4.3.2 Construction and Bounds for Cone Measure . . . . . . . . 4.4 Combinatorial Central Limit Theorems . . . . . . . . . . . . . . . 4.4.1 Use of the Exchangeable Pair . . . . . . . . . . . . . . . . 4.4.2 Construction and Bounds for the Combinatorial Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . 4.6 Chatterjee’s L1 Theorem . . . . . . . . . . . . . . . . . . . . . . 4.7 Locally Dependent Random Variables . . . . . . . . . . . . . . . . 4.8 Smooth Function Bounds . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Fast Rates for Smooth Functions . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
105 111 116 133 136 136 138
5
L∞ 5.1 5.2 5.3 5.4
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
147 147 149 156 161
6
L∞ : Applications . . . . . . . . . . . . . . . . . . . . . . 6.1 Combinatorial Central Limit Theorem . . . . . . . . 6.1.1 Uniform Distribution on the Symmetric Group 6.1.2 Distribution Constant on Conjugacy Classes . 6.1.3 Doubly Indexed Permutation Statistics . . . . 6.2 Patterns in Graphs and Permutations . . . . . . . . . 6.3 The Lightbulb Process . . . . . . . . . . . . . . . . . 6.4 Anti-voter Model . . . . . . . . . . . . . . . . . . . 6.5 Binary Expansion of a Random Integer . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
167 167 168 183 201 202 210 213 217
7
Discretized Normal Approximation . . . . . . . . . . . . . . . . . . . 221 7.1 Poisson Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . 222 7.2 Sum of Independent Integer Valued Random Variables . . . . . . . 227
8
Non-uniform Bounds for Independent Random Variables . . . . . . 233 8.1 A Non-uniform Concentration Inequality . . . . . . . . . . . . . . 233 8.2 Non-uniform Berry–Esseen Bounds . . . . . . . . . . . . . . . . . 237
9
Uniform and Non-uniform Bounds Under Local Dependence 9.1 Uniform and Non-uniform Berry–Esseen Bounds . . . . . 9.2 Outline of Proofs . . . . . . . . . . . . . . . . . . . . . . . 9.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . .
by Bounded Couplings . . . . . . . . . . Bounded Zero Bias Couplings . . . . . . . Exchangeable Pairs, Kolmogorov Distance Size Biasing, Kolmogorov Bounds . . . . Size Biasing and Smoothing Inequalities .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . .
. . . .
. . . .
. . . .
72 78 82 87 88 90 94 100 102
245 246 247 253
Contents
xi
10 Uniform and Non-uniform Bounds for Non-linear Statistics . . . . . 10.1 Introduction and Main Results . . . . . . . . . . . . . . . . . . . . 10.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 U -statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Multi-sample U -statistics . . . . . . . . . . . . . . . . . . 10.2.3 L-statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.4 Random Sums of Independent Random Variables with Non-random Centering . . . . . . . . . . . . . . . . . 10.2.5 Functions of Non-linear Statistics . . . . . . . . . . . . . . 10.3 Uniform and Non-uniform Randomized Concentration Inequalities Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
257 257 260 260 262 265
11 Moderate Deviations . . . . . . . . . . . . . . . . 11.1 A Cramér Type Moderate Deviation Theorem 11.2 Applications . . . . . . . . . . . . . . . . . . 11.3 Preliminary Lemmas . . . . . . . . . . . . . . 11.4 Proofs of Main Results . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
293 293 295 298 302 310
12 Multivariate Normal Approximation . . . . . . . . . . . . . . . 12.1 Multivariate Normal Approximation via Size Bias Couplings 12.2 Degrees of Random Graphs . . . . . . . . . . . . . . . . . . 12.3 Multivariate Exchangeable Pairs . . . . . . . . . . . . . . . . 12.4 Local Dependence, and Bounds in Kolmogorov Distance . .
. . . . .
. . . . .
. . . . .
313 314 315 325 331
13 Non-normal Approximation . . . . . . . . . . . . . . . . . . 13.1 Stein’s Method via the Density Approach . . . . . . . . . 13.1.1 The Stein Characterization and Equation . . . . . 13.1.2 Properties of the Stein Solution . . . . . . . . . . 13.2 L1 and L∞ Bounds via Exchangeable Pairs . . . . . . . . 13.3 The Curie–Weiss Model . . . . . . . . . . . . . . . . . . 13.4 Exponential Approximation . . . . . . . . . . . . . . . . 13.4.1 Spectrum of the Bernoulli–Laplace Markov Chain 13.4.2 First Passage Times . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
343 343 344 346 347 353 358 358 363 367
14 Group Characters and Malliavin Calculus . . . 14.1 Normal Approximation for Group Characters 14.1.1 O(2n, R) . . . . . . . . . . . . . . . 14.1.2 SO(2n + 1, R) . . . . . . . . . . . . 14.1.3 U Sp(2n, C) . . . . . . . . . . . . . 14.2 Stein’s Method and Malliavin Calculus . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
371 371 379 380 381 381
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
270 273 277 284
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
xii
Contents
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Chapter 1
Introduction
1.1 The Central Limit Theorem The Central Limit Theorem is one of the most striking and useful results in probability and statistics, and explains why the normal distribution appears in areas as diverse as gambling, measurement error, sampling, and statistical mechanics. In essence, the Central Limit Theorem in its classical form states that a normal approximation applies to the distribution of quantities that can be modeled as the sum of many independent contributions, all of which are roughly the same size. Thus mathematically justified, at least asymptotically, in practice the normal law may be used to approximate quantities ranging from a p-value of a hypothesis tests, the probability that a manufacturing process will remain in control or the chance of observing an unusual conductance reading in a laboratory experiment. However, even though in practice sample sizes may be large, or may appear to be sufficient for the purposes at hand, depending on that and other factors, the normal approximation may or may not be accurate. It is here the need for the evaluation of the quality of the normal approximation arises, which is the topic of this book. The seeds of the Central Limit Theorem, or CLT, lie in the work of Abraham de Moivre, who, around the year 1733, not being able to secure himself an academic appointment, supported himself consulting on problems of probability and gambling. He approximated the limiting probabilities of the binomial distribution, the one which governs the behavior of the number S n = X 1 + · · · + Xn
(1.1)
of successes in an experiment which consists of n independent Bernoulli trials, each one having the same probability p ∈ (0, 1) of success. de Moivre realized that even though the sum n P (Sn ≤ m) = p k (1 − p)n−k k k≤m
that yields the cumulative probability of m or fewer successes becomes unwieldy for even moderate values of n, there exists an easily computable, normal approximation to such probabilities that can be quite accurate even for moderate values of n. L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_1, © Springer-Verlag Berlin Heidelberg 2011
1
2
1
Introduction
Only many years later with the work of Laplace around 1820 did it begin to be systematically realized that the normal limit holds in much greater generality. The result was the classical Central Limit Theorem, which states that Wn →d Z, that is, Wn converges in distribution to Z, whenever Wn = (Sn − nμ)/ nσ 2 (1.2) is the standardization of a sum Sn , as in (1.1), of independent and identically distributed random variables each with mean μ and variance σ 2 . Here, Z denotes a standard normal variable, that is, one with distribution function P (Z ≤ x) = (x) given by x 1 1 (x) = ϕ(u) du where ϕ(u) = √ exp − u2 , 2 2π −∞ and we say a sequence of random variables Yn is said to converge in distribution to Y , written Yn →d Y , if lim P (Yn ≤ x) = P (Y ≤ x) for all continuity points x of P (Y ≤ x). (1.3)
n→∞
Generalizing further, but still keeping the variables independent, the question of when a sum of independent but not necessarily identically distributed random variables is asymptotically normal is essentially completely answered by the Lindeberg– Feller–Lévy Theorem (see Feller 1968b), which shows that the Lindeberg condition is sufficient, and nearly necessary, for the normal limit to hold. For a more detailed, and delightful account of the history of the CLT, we refer the reader to LeCam (1986). When the quantity Wn given by (1.2) is a normalized sum of i.i.d. variables X1 , . . . , Xn with finite third moment, the works of Berry (1941) and Esseen (1942) were the first to give a bound on the normal approximation error, in terms of some universal constant C, of the form CE|X1 |3 supP (Wn ≤ z) − P (Z ≤ z) ≤ √ . n z∈R This prototype bound has since been well studied, generalized and applied in practice, and it appears in many guises in the pages that follows. Esseen’s original upper bound on C of magnitude 7.59 has been markedly decreased over the years, the record currently now held by Tyurin (2010) who proved C ≤ 0.4785. With the independent case tending toward resolution, attention can now turn to situations where the variables exhibit dependence. However, as there are countless ways variables can fail to be independent, no single technique can be used to address all situations, and no theorem parallel to the Lindeberg–Feller–Lévy theorem is ever to be expected in this greater generality. Consequently, the literature for validating the normal approximation in the presence of dependence now fragments somewhat into various techniques which can handle certain specific structures, or assumptions, two notable examples being central limit theorems proved under mixing conditions, and those results that can be applied to martingales.
1.2 A Brief History of Stein’s Method
3
Characteristic function methods have proved essential in making progress in the analysis of dependence, and though they are quite powerful, they rely on handling distributions through their transforms. In doing so it is doubtless that some probabilistic intuition is lost. In essence, the Stein method replaces the complex valued characteristic function with a real characterizing equation through which the random variable, in its original domain, may be manipulated, and in particular, coupled.
1.2 A Brief History of Stein’s Method Stein’s method for normal approximation made its first appearance in the ground breaking work of Stein (1972), and it was here that the characterization of the normal distribution on which this book is based was first presented. That is, the fact that Z ∼ N (0, σ 2 ) if and only if E Zf (Z) = σ 2 E f (Z) , (1.4) for all absolutely continuous functions f for which the above expectations exist. Very soon thereafter the work of Chen (1975) followed, applying the characterizing equation method to the Poisson distribution based on the parallel fact that X ∼ P(λ), a Poisson variable with parameter λ, if and only if E Zf (Z) = λE f (Z + 1) , for all functions f for which the expectations above exist. From this point it seemed to take a number of years for the power of the method in both the normal and Poisson cases to become fully recognized; for Poisson approximation using Stein’s method, see, for instance, the work of Arratia et al. (1989), and Barbour et al. (1992). The key identity (1.4) for the normal was, however, put to good use in the meantime. In another landmark paper, Stein (1981) applied the characterization that he had proved earlier for the purpose of normal approximation to derive minimax estimates for the mean of a multivariate normal distribution in dimensions three or larger. In particular, he shows, using the multivariate version of (1.4), that when X has the normal distribution with mean θ and identity covariance matrix, then the mean squared error risk of the estimate X + g(X), for an almost everywhere differentiable function g : Rp → Rp , is unbiasedly estimated by p + g(X) 2 + 2∇ · g(X). This 1981 work builds on the earlier and rather remarkable and surprising result of Stein (1956), that shows that the usual sample mean estimate X for the true mean θ of a multivariate normal distribution Np (θ, I) is not admissible in dimensions three and greater; the multivariate normal characterization given in Stein (1981) provides a rather streamlined proof of this very counterintuitive fact. Returning to normal approximation, by 1986 Stein’s method was sufficiently cohesive that its foundations and some illustrative examples could be laid out in the manuscript of Stein (1986), with the exchangeable pair approach being one notable cornerstone. This manuscript also considers approximations using the binomial and the Poisson, and other probability estimates related to but not directly concerning the normal. In the realm of normal approximation, this work rather convincingly
4
1
Introduction
demonstrated the potential of the method under dependence by showing how it could be used to assess the quality of approximations for the distribution of the number of empty cells in an allocation model, and the number of isolated trees in the Erdös–Rényi random graph. For a personal history up to this time from the view point of Charles Stein, see his recollections in DeGroot (1986). The period following the publication of Stein’s 1986 manuscript saw a veritable explosion in the number of ideas and applications in the area, a fact well illustrated by the wide range of topics covered here, as well as in the two volumes of Barbour and Chen (2005b, 2005c), and those referred to in the bibliographies thereof. Including up to the present day, powerful extensions and applications of the method continue to be discovered that were, at the time of its invention, completely unanticipated.
1.3 The Basic Idea of Stein’s Method To show a random variable W has a distribution close to that of a target distribution, say that of the random variable Z, one can compare the values of the expectations of the two distributions on some class of functions. For instance, one can compare the characteristic function φ(u) = EeiuW of W to that of Z, thus encapsulating all expectations of the family of functions eiuz for u ∈ R. And indeed, as this family of functions is rich enough, closeness of the characteristic functions implies closeness of the distributions. When studying the sum of random variables, and independent random variables in particular, the characteristic function is a natural choice, as convolution in the space of measures become products in the realm of characteristic functions. Powerful as they may be, one may lose contact with probabilistic intuition when handling complex functions in the transform domain. Stein’s method, based instead on a direct, random variable characterization of a distribution, allows the manipulation of the distribution through constructions involving the basic random quantities of which W is composed, and coupling can begin to play a large role. Consider, then, testing for the closeness of the distributions of W and Z by evaluating the difference between the expectations Eh(W ) and Eh(Z) over some collection of functions h. At first there appears to be no handle that we can apply, the task as stated being perhaps overly general. Nevertheless, it seems clear that if the distribution of W is close to the distribution of Z then the difference Eh(W ) − Eh(Z) should be small for many functions h. Specializing the problem, for a specific distribution, we may evaluate the difference by relying on a characterization of Z. For instance, by (1.4), the distribution of a random variable Z is N (0, 1) if and only if
E f (Z) − Zf (Z) = 0 (1.5) for all absolutely continuous functions f for which the expectation above exists. Again, if the distribution of W is close to that of Z, then evaluating the left hand side of (1.5) when Z is replaced by W should result in something small. Putting these two differences together, from the Stein characterization (1.5) we arrive at the Stein equation f (w) − wf (w) = h(w) − Eh(Z).
(1.6)
1.3 The Basic Idea of Stein’s Method
5
Now, given h, one solves (1.6) for f , evaluates the left hand side of (1.6) at W and takes the expectation, obtaining Eh(W ) − Eh(Z). Perhaps at first glance the problem has not been made any easier, as the evaluation of Eh(W )−Eh(Z) has been replaced by the need to compute E(f (W )−Wf (W )). Yet the form of what is required to evaluate is based on the normal characterization, and, somehow, for this reason, the expectation lends itself to calculation for W for which approximation by the normal is appropriate. Borrowing, essentially, the following ‘leave one out’ idea from Stein’s original 1972 paper, let ξ1 , . . . , ξn be independent mean zero random variables with variances σ12 , . . . , σn2 summing to one, and set n ξi . W= i=1
Then, with W (i) = W − ξi , for some given f , we have n n
E Wf (W ) = E ξi f (W ) = E ξi f W (i) + ξi . i=1
i=1
If f is differentiable, then the summand may be expanded as 1
(i)
(i)
2 ξ i f W + ξi = ξ i f W f W (i) + uξi du, + ξi 0
W (i)
and, since and ξi are independent, the first term on the right hand side vanishes when taking expectation, yielding 1 n
ξi2 f W (i) + uξi du. E Wf (W ) = E 0
i=1
On the other hand, again with reference to the left hand side of (1.6), since σ12 , . . . , σn2 sum to 1, and ξi and W (i) are independent, we may write Ef (W ) = E =E =E
n i=1 n
σi2 f (W ) n
σi2 f W (i) + E σi2 f (W ) − f W (i)
i=1
i=1
n
n
ξi2 f W (i) + E
i=1
σi2 f (W ) − f W (i) .
i=1
Taking the difference we obtain the expectation of the left hand side of (1.6) at W , 1 n
(i)
2 f W − f W (i) + uξi du ξi E f (W ) − Wf (W ) = E i=1
+E
n i=1
0
σi2 f (W ) − f W (i) .
(1.7)
6
1
Introduction
When n is large, as ξ1 , . . . , ξn are random variables of comparable size, it now becomes apparent why this expectation is small, no matter the distribution of the summands.√Indeed, W and W (i) only differ by the single variable ξi , accounting for roughly 1/ n of the total variance, so the differences in both terms above are small. To make the case more convincingly, when f has a bounded second derivative, then for all u ∈ [0, 1], with g denoting the supremum norm of a function g, the mean value theorem yields (i)
f W − f W (i) + uξi ≤ |ξi | f . As this bound applies as well to the second term in (1.7), it being the case u = 1, when ξi has third moments we obtain n
3 E f (W ) − Wf (W ) ≤ f E ξi + σi2 E|ξi | i=1 n
≤ 2 f
E ξi3 ,
(1.8)
i=1
by Hölder’s inequality. The calculation reveals the need for the understanding of the smoothness relation between the solution f and the given function h. For starters, we see directly from (1.6) that f always has one more degree of smoothness than h, which, naturally, helps. However, as the original question was regarding the evaluation of the difference of expectations Eh(W ) − Eh(Z) expressed in terms of h, we see that in order to answer using (1.8) that bounds on quantities such as f must be provided in terms of some corresponding bound involving h. It is also worth noting that this illustration, and therefore also the original paper of Stein, contains the germ of several of the couplings which we will develop and apply later on, the present one bearing the most similarity to the analysis of local dependence. The resemblance between Stein’s ‘leave one out’ approach and the method of Lindeberg (see, for instance, Section 8.6 of Breiman 1986) is worth some exploration. Let X1 , X2 , . . . be i.i.d. mean zero random variables with variance 1, and for each n let Xi ξi,n = √ , n
i = 1, . . . , n,
(1.9)
the elements of a triangular array. The basic idea of Lindeberg is to compare the sum Wn = ξ1,n + · · · + ξn,n to the sum Zn = Z1,n + · · · + Zn,n of mean zero, i.i.d. normals Z1,n , . . . , Zn,n with Var(Zn ) = 1. Let h be a twice differentiable bounded function on R such that h is uniformly continuous and
1.3 The Basic Idea of Stein’s Method
7
M = sup h (x) < ∞.
(1.10)
x∈R
For such an h, the quantity δ( ) = sup h (x) − h (y) |x−y|≤
is bounded over ∈ R and satisfies lim ↓0 δ( ) = 0. Write the difference Eh(Wn ) − Eh(Zn ) as the telescoping sum Eh(Wn ) − Eh(Zn ) = E
n
h(Vi,n ) − h(Vi−1,n ),
(1.11)
i=1
where Vi,n =
i
ξj,n +
j =1
n
Zj,n ,
j =i+1
with the usual convention that an empty sum is zero. In this way, the variables interpolate between Wn = Vn,n and Zn = V0,n . Writing Ui,n =
i−1 j =1
ξj,n +
n
Zj,n ,
j =i+1
a Taylor expansion on the summands in (1.11) yields h(Vi,n ) − h(Vi−1,n ) = h(Ui,n + ξi,n ) − h(Ui,n + Zi,n ) = (ξi,n − Zi,n )h (Ui,n ) 1 2 1 2 + ξi,n h (Ui,n + uξi,n ) − Zi,n h (Ui,n + vZi,n ), 2 2 for some u, v ∈ [0, 1]. Since h can grow at most linearly the expectation of the first term exists, and, as ξi,n and Zi,n are independent of Ui,n , equals zero. Considering the expectation of the remaining second order terms, write
2
2
2 Eξi,n h (Ui,n + uξi,n ) = E ξi,n h (Ui,n ) + αE ξi,n δ |ξi,n | , for some α ∈ [−1, 1], with a similar equality holding for the expectation of the 2 = EZ 2 , taking the difference of the second order terms, using last term. As Eξi,n i,n independence, and that ξi,n and Zi,n are identically distributed, respectively, for i = 1, . . . , n, yields 1 2
2
E h(Vi,n ) − h(Vi−1,n ) ≤ E ξ1,n δ |ξ1,n | + E Z1,n δ |Z1,n | . (1.12) 2 Recalling (1.9), we have 1
2
δ |ξ1,n | = E X12 δ n−1/2 |X1 | , E ξ1,n n
8
1
Introduction
with a similar equality holding for the second term of (1.12). Hence, by (1.11), with Z now denoting a standard normal variable, summing yields
Eh(Wn ) − Eh(Z) ≤ 1 E X 2 δ n−1/2 |X1 | + E Z 2 δ n−1/2 |Z| . 1 2 By (1.10), δ( ) ≤ 2M for all ∈ R, so X12 δ(n−1/2 |X1 |) ≤ 2MX12 . As X12 δ(n−1/2 |X1 |) → 0 almost surely as n → ∞, the dominated convergence theorem implies the first term above tends to zero. Applying the same reasoning to the second term we obtain lim Eh(Wn ) − Eh(Z) = 0. (1.13) n→∞
As the class of functions h for which we have obtained Eh(Wn ) → Eh(Z) is rich enough, we have shown Wn →d Z. Both the Stein and Lindeberg approaches proceed through calculations that ‘leave one out.’ However, the Stein approach seems more finely tuned to the target distribution, using the solution of a differential equation tailored to the normal. Moreover, use of the Stein differential equation provides that the functions f being evaluated on the variables of interest have one degree of smoothness over that of the basic test functions h which are used to gauge the distance between W and Z. However, the main practical difference between Stein’s method and that of Lindeberg, as far as outcome, is the former’s additional benefit of providing a bound on the distance to the target, and not only convergence in distribution; witness the difference between conclusions (1.8) and (1.13). Furthermore, Stein’s method allows for a variety of ways in which variables can be handled in the Stein equation, the ‘leave one out’ approach being just the beginning.
1.4 Outline and Summary We begin in Chap. 2 by introducing and working with the fundamentals of Stein’s method. First we prove the Stein characterization (1.4) for the normal, and develop bounds on the Stein equation (1.6) that will be required throughout our treatment; the multivariate Stein equation for the normal, and its solution by the generator method, is also introduced here. The ‘leave one out’ coupling considered in Sect. 1.3 is but one variation on the many ways in which variables close to the one of interest can enter the Stein equation, and is in particular related to some of the couplings we consider later on to handle locally dependent variables. Four additional, and somewhat overlapping, basic methods for handling variables in the Stein equation are introduced in Chap. 2: the K-function approach, the original exchangeable pair method of Stein, and the zero bias and size bias transformations. Illustrations of how these methods allow for various manipulations in the Stein equation are provided, as well as a number of examples, some of which will continue as themes and illustrations for the remainder of the book. The independent case, of course, serves as one important testing ground
1.4 Outline and Summary
9
throughout. A framework that includes some of our approaches is considered in Sect. 2.4. Some technical calculations for bounds to the Stein equation appear in the Appendix to Chap. 2, as do other such calculations in subsequent chapters. Chapter 3 focuses on the independent case. The goal is to demonstrate a version of the classical Berry–Esseen theorem using Stein’s method. Along the way techniques are developed for obtaining L1 bounds, and the Lindeberg central limit theorem is shown as well. The Berry–Esseen theorem is first demonstrated for the case where the random variables are bounded. The boundedness condition is then relaxed in two ways, first by concentration inequalities, then by induction. This chapter concludes with a lower bound for the Berry–Esseen inequality. As seen in the chapter dependency diagram that follows, Chaps. 2 and 3 form much of the basis of this work. Chapter 4 develops a theory for obtaining L1 bounds using the zero bias coupling, and a main result is obtained which can be applied in non-independent settings. A number of examples are presented for illustration. The case of independence is considered first, with an L1 Berry–Esseen bound followed by the demonstration of a type of contraction principle satisfied by sums of independent variables which implies, or even in a way explains, normal convergence. Bounds in L1 are then proved for hierarchical structures, that is, self similar, fractal type objects whose scale at small levels is replicated on the larger. Then, making our first departure from independence we prove L1 bounds for the projections of random vectors having distribution concentrated on regular convex sets in Euclidean space. Next, illustrating a different coupling, L1 bounds to the normal for the combinatorial central limit theorem are given. Though the combinatorial central limit theorem contains simple random sampling as a particular case, somewhat better bounds may be obtained by applying specifics in the special case; hence, an L1 bound is given for the case of simple random sampling alone. Next we present Chatterjee’s L1 theorem for functions of independent random variables, and apply it to the approximation of the distribution of the volume covered by randomly placed spheres in the Euclidean torus. Results are then given for sums of locally dependent random variables, with applications including the number of local maxima on a graph. Chapter 4 concludes with a consideration of a class of smooth functions, contained in the one which may be used to determine the L1 distance, for which convergence to the normal is at the accelerated rate of 1/n, subject to a vanishing third moment assumption. The theme of Chap. 5 is to provide upper bounds in the L∞ , or Kolmogorov distance, that can be applied when certain bounded couplings can be constructed. Various bounds to the normal for a random variable W are formed by constructing , on the same space as W . We have in mind here, an auxiliary random variable, say W has the same distribution as W , or the zero bias or in particular, the cases where W size bias distribution of W . The resulting bound is often interpretable, sometimes , a small bound being a reflection of a directly, as a distance between W and W small distance. Heuristically, being able to make a close coupling to W , shows, in a sense, that perturbing W has only a weak effect. Being able to make a close coupling shows the dependence making up W is weak, and, as a random variable has an approximate normal distribution when it depends on many small weakly dependent
10
1
Introduction
factors, such a W should be approximately normal. The bounded couplings studied | ≤ δ with probability one for some δ, and are in this chapter, ones where |W − W often much easier to manage than unbounded ones. Chapter 5 provides results when bounded zero bias, exchangeable pair, or size bias couplings can be constructed. The chapter concludes with the use of smoothing inequalities to obtain distances between W and the normal over general function classes, one special case being the derivation of Kolmogorov distance bounds when bounded size bias couplings exist. Chapter 6 applies the L∞ results of Chap. 5 to a number of applications, all of which involve dependence. Dependence can loosely be classified into two types, first, the local type, such as when each variable has a small neighborhood outside of which the remaining variables are independent, and second, dependence with a global nature. Chapter 6 deals mainly with global dependence but begins to also touch upon local dependence, a topic more thoroughly explored in Chap. 9. Regarding global dependence, the analysis of the combinatorial central limit theorem, studied in L1 in Chap. 4, is continued here with the goal of obtaining L∞ results. Results for the classical case are given, where the permutation is uniformly chosen over the symmetric group, as well as for the case where the permutation is chosen with distribution constant over some conjugacy class, such as the class of involutions. Two approaches are considered, one using the zero bias coupling and one using induction. Normal approximation bounds for the so called lightbulb process are also given in this chapter, again an example of handling global dependence, this time using the size bias coupling. The anti-voter model is also studied, handled by the exchangeable pair technique, as is the binary expansion of a random integer. Results for the occurrences of patterns in graphs and permutations, an example of local dependence, are handled using the size bias method. Returning to the independent case, and inspired by use of the continuity correction for the normal approximation of the binomial, in Chap. 7 we consider the approximation of independent sums of integer valued random variables by the discretized normal distribution, in the total variation metric. The main result is shown by obtaining bounds between the zero biased distribution of the sum and the normal, and then treating the coupled zero biased variable as a type of perturbation. Continuing our consideration of the independent case, in Chap. 8 we derive nonuniform bounds for sums of independent random variables. In particular, by use of non-uniform concentration inequalities and the Bennett–Hoeffding inequality we provide bounds for the absolute difference between the distribution function F (z) of a sum of independent variables and the normal (z), which may depend on z ∈ R. Non-uniform bounds serve as a counterpoint to the earlier derived supremum norm bounds that are not allowed to vary with z, and give information on how the quality of the normal approximation varies over R. In Chap. 9 we consider local dependence using the K-function approach, and obtain both uniform and non-uniform Berry–Esseen bounds. The results are applied to certain scan statistics, and yield a general theorem when the local dependence can be expressed in terms of a dependency graph whose vertices are the underlying variables, and where two non-intersecting subsets of variables are independent anytime there is no edge in the graph connecting a element of one subset with the other.
1.4 Outline and Summary
11
In Chap. 10 we develop uniform and non-uniform bounds for non-linear functions T (X1 , . . . , Xn ), of independent random variables X1 , . . . , Xn , that can be well approximated by a linear term plus a non-linear remainder. Applications include U statistics, L-statistics and random sums. Randomized concentration inequalities are established in order to develop the theory necessary to cover these examples. In previous chapters we have measured the accuracy of approximations using differences between two distributions. For the most part, the resulting measures are sensitive to the variations between distributions in their bulk, that is, measures like the L1 or L∞ norm typically compare two distributions in the region where most of their mass is concentrated. In contrast, in Chap. 11, we consider moderate deviations of distributions, and rather than consider a difference, compare the ratio of the distribution function of the variable W of interest to that of the normal. Information on small probabilities in the tail become available in this way. Applications of the results of this chapter include the combinatorial central limit theorem, the anti-voter model, the binary expansion of a random integer, and the Curie–Weiss model. In Chap. 12 we consider multivariate normal approximation, extending both the size bias and exchangeable pair methods to this setting. In the latter case we show how in some cases the exchangeable pair ‘linearity condition’ can be achieved by embedding the problem in a higher dimension. Applications of both methods are applied to problems in random graphs. We momentarily depart from normal approximation in Chap. 13. We confine ourselves to approximations by continuous distributions for which the methods of the previous chapters may be extended. As one application, we approximate the distribution of the total spin of the Curie–Weiss model from statistical physics, at the critical inverse temperature, by a distribution with density proportional to exp(−x 4 /12) using the method of exchangeable pairs. We also develop bounds for approximation by the exponential distribution, and apply it to the spectrum of the Bernoulli Laplace Markov chain, and first passage times for Markov chains. In Chap. 14 we consider two applications of Stein’s method, each of which go well beyond the confines of the method’s originally intended uses; the approximation of the distribution of characters of elements chosen uniformly from compact Lie groups, and of random variables in a fixed Wiener chaos of Brownian motion, using the tools of Malliavin calculus. Regarding the first topic, the study of random characters is in some sense a generalization to abstract groups of the study of traces of random matrices, a framework into which the combinatorial central limit theorem can be made to fit. As for the second, joining Stein’s method to Malliavin calculus shows that the underlying fundamentals of Stein’s method, in particular the basic characterization of the normal which can be shown by integration by parts, can be extended, with great benefit, to abstract Wiener spaces. As for what this book fails to include, narrowing in as it does on what can be shown in the realm of normal approximation by Stein’s method, we do not consider, most notably, transform methods, mixing, or martingales. For these topics, having more history than the one presently considered, sources already abound. We stress to the reader that this book need not at all be read in a linear fashion, especially if one is interested in applications and is willing to forgo the proofs of the
12
1
Introduction
theorems on which the applications are based. The following diagram reflects the dependence of each chapter on the others.
(7)
(2)
(3)
(8)
(4)
(5)
(13)
(12)
(6)
(14)
(9, 10, 11)
Chapter 2
Fundamentals of Stein’s Method
We begin by giving a detailed account of the fundamentals of Stein’s method, starting with Stein’s characterization of the normal distribution and the basic properties of the solution to the Stein equation. Then we provide an outline of the basic Stein identities and distributional transformations which play a large role in coupling constructions, introducing first the construction of the K function for independent random variables, the exchangeable pair approach due to Stein, the zero bias transformation for random variable with mean zero and variance one, and lastly the size bias transformation for non-negative random variables with finite mean. We conclude the chapter with a framework under which a number of Stein identities can be placed, and a proposition for normal approximation using Lipschitz functions. Some of the more technical results on bounds to the Stein equation can be found in the Appendix to this chapter.
2.1 Stein’s Equation Stein’s method rests on the following characterization of the distribution of a standard normal variable Z, given in Stein (1972). Lemma 2.1 If W has a standard normal distribution, then Ef (W ) = E Wf (W ) ,
(2.1)
for all absolutely continuous functions f : R → R with E|f (Z)| < ∞. Conversely, if (2.1) holds for all bounded, continuous and piecewise continuously differentiable functions f with E|f (Z)| < ∞, then W has a standard normal distribution. Though there is no known definitive method for the construction of a characterizing identity, of the type given in Lemma 2.1, for the distribution of a random variable Y in general, two main contenders emerge. The first one we might call the ‘density approach.’ If W has density p(w) then in many cases one can replace the coefficient W on the right hand side of (2.1) by −p (W )/p(W ); this approach is pursued in L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_2, © Springer-Verlag Berlin Heidelberg 2011
13
14
2
Fundamentals of Stein’s Method
Chap. 13 to study approximations by non-normal distributions. In another avenue, one which we might call the ‘generator approach’, we seek a Markov process that has as its stationary distribution the one of interest. In this case, the generator, or some variation thereof, of such a process has expectation zero when applied to sufficiently smooth functions, giving the difference between the two sides of (2.1). In Sect. 2.3.2 we discuss the relation between the generator method and exchangeable pairs, and in Sect. 2.2 its relation to the solution of the Stein equation, the differential equation motivated by the characterization (2.1). In fact, we now prove one direction of Lemma 2.1 using the Stein equation (2.2). Lemma 2.2 For fixed z ∈ R and (z) = P (Z ≤ z), the cumulative distribution function of Z, the unique bounded solution f (w) := fz (w) of the equation f (w) − wf (w) = 1{w≤z} − (z) is given by
√ fz (w) =
2
2π ew /2 (w)[1 − (z)] √ 2 2π ew /2 (z)[1 − (w)]
if w ≤ z,
fz (w) = e
w 2 /2
= −e
w
(2.3)
if w > z.
Proof Multiplying both sides of (2.2) by the integrating factor e−w −w2 /2 2 f (w) = e−w /2 1{w≤z} − (z) . e Integration now yields
(2.2)
2 /2
yields
2 1{x≤z} − (z) e−x /2 dx
−∞ ∞
w 2 /2
w
2 1{x≤z} − (z) e−x /2 dx,
which is equivalent to (2.3). Lemma 2.3 below shows fz (w) is bounded. The general solution to (2.2) is given by fz (w) plus some constant multiple, 2 say cew /2 , of the solution to the homogeneous equation. Hence the only bounded solution is obtained by taking c = 0. Proof of Lemma 2.1 Necessity. Let f be an absolutely continuous function satisfying E|f (Z)| < ∞. If W has a standard normal distribution then ∞ 1 2 f (w)e−w /2 dw Ef (W ) = √ 2π −∞ w 0 1 2 =√ f (w) −xe−x /2 dx dw 2π −∞ −∞ ∞ ∞ 1 2 +√ f (w) xe−x /2 dx dw. 2π 0 w By Fubini’s theorem, it thus follows that
2.2 Properties of the Solutions
15
0 0 1 2 Ef (W ) = √ f (w) dw (−x)e−x /2 dx 2π −∞ x ∞ x 1 2 +√ f (w) dw xe−x /2 dx 2π 0 0 ∞ 1 2 f (x) − f (0) xe−x /2 dx =√ 2π −∞ = E Wf (W ) . Sufficiency. The function fz as given in (2.3) is clearly continuous and piecewise continuously differentiable; Lemma 2.3 below shows fz is bounded as well. Hence, if (2.1) holds for all bounded, continuous and continuously differentiable functions, then by (2.2) 0 = E fz (W ) − Wfz (W ) = E 1{W ≤z} − (z) = P (W ≤ z) − (z). Thus W has a standard normal distribution.
When f is an absolutely continuous and bounded function, one can prove (2.1) holds for a standard normal W using integration by parts, as in this case ∞ 1 2 wf (w)e−w /2 dw E Wf (W ) = √ 2π −∞ ∞ 1 2 = −√ f (w)d e−w /2 2π −∞ ∞ 1 2 =√ f (w)e−w /2 dw 2π −∞ = Ef (W ). For a given real valued measurable function h with E|h(Z)| < ∞ we denote Eh(Z) by N h and call f (w) − wf (w) = h(w) − N h
(2.4)
the Stein equation for h, or simply the Stein equation. Note that (2.2) is the special case of (2.4) for h(w) = 1{w≤z} . By the same method of integrating factors that produced (2.3) one may show that the unique bounded solution of (2.4) is given by w 2 2 fh (w) = ew /2 h(x) − N h e−x /2 dx −∞ ∞ 2 2 w /2 = −e h(x) − N h e−x /2 dx. (2.5) w
2.2 Properties of the Solutions We now list some properties of the solutions (2.3) and (2.5) to the Stein equations (2.2) and (2.4), respectively, that are required to determine error bounds in our various approximations to come. We defer the detailed proofs of Lemmas 2.3 and 2.4
16
2
Fundamentals of Stein’s Method
to an Appendix since they are somewhat technical. As the arguments used to prove these bounds do not themselves figure in the methods themselves, the reader may skip them if they so choose. We begin with the solution fz to (2.2). Lemma 2.3 Let z ∈ R and let fz be given by (2.3). Then wfz (w) is an increasing function of w. Moreover, for all real w, u and v,
wfz (w) ≤ 1,
f (w) ≤ 1,
wfz (w) − ufz (u) ≤ 1
f (w) − f (u) ≤ 1 z z z √ 0 < fz (w) ≤ min 2π /4, 1/|z|
and √
(w + u)fz (w + u) − (w + v)fz (w + v) ≤ |w| + 2π/4 |u| + |v| .
(2.6)
(2.7) (2.8) (2.9)
(2.10)
We mostly use (2.8) and (2.9) for our approximations. If one does not care much about constants, the bounds
f (w) ≤ 2 and 0 < fz (w) ≤ π/2 z may be easily obtained by using the well-known inequality 1 1 2 1 − (w) ≤ min , √ e−w /2 , w > 0. (2.11) 2 w 2π Next, we consider (2.5), the solution fh to the Stein equation (2.4). For any real valued function h on Rp let
h = sup h(x) . x∈Rp
Lemma 2.4 For a given function h : R → R, let fh be the solution (2.5) to the Stein equation (2.4). If h is bounded, then (2.12) fh ≤ π/2 h(·) − N h and fh ≤ 2 h(·) − N h . If h is absolutely continuous, then fh ≤ 2h ,
fh ≤
2/π h
and
fh ≤ 2h .
(2.13)
Some of the results that follow are shown by letting h(w) be the indicator of (−∞, z] with a linear decay to zero over an interval of length α > 0, that is, the function ⎧ w ≤ z, ⎨1 h(w) = 1 + (z − w)/α z < w ≤ z + α, (2.14) ⎩ 0 w > z + α. The following bounds for the solution to the Stein equation for the smoothed indicator appear in Chen and Shao (2004).
2.2 Properties of the Solutions
17
Lemma 2.5 For z ∈ R and α > 0, let f be the solution (2.5) to the Stein equation (2.4) for the smoothed indicator function (2.14). Then, for all w, v ∈ R,
0 ≤ f (w) ≤ 1, f (w) ≤ 1, f (w) − f (v) ≤ 1 (2.15) and
1
f (w + v) − f (w) ≤ |v| 1 + |w| + 1 1[z,z+α] (w + rv)dr . α 0
(2.16)
For multivariate approximations we consider an extension of the Stein equation (2.4) to Rp . For a twice differentiable function g : Rp → R let ∇g and D 2 g denote the gradient and second derivative, or Hessian matrix, of g respectively and let Tr(A) be the trace of a matrix A. Let Z be multivariate normal vector in Rp with mean zero and identity covariance matrix. For a test function h : Rp → R and for u ≥ 0 define (Tu h)(w) = E h we−u + 1 − e−2u Z . (2.17) Letting Nh = Eh(Z), the following lemma provides bounds on the solution of the ‘multivariate generator’ method for solutions to the Stein equation (Ag)(w) = h(w) − N h
where (Ag)(w) = Tr D 2 g(w) − w · ∇g(w). (2.18)
We note that in one dimension (2.18) reduces to (2.4) with one extra derivative, that is, to g (w) − wg (w) = h(w) − N h.
(2.19)
For a vector k = (k1 , . . . , kp ) of nonnegative integers and a function h : Rp → R, let p ∂ |k| h(w) where |k| = kj , j =1 ∂wkj j =1
h(k) (w) = p
and for a matrix A ∈ Rp×p , let A = max |aij |. 1≤i,j ≤p
Lemma 2.6 If h : Rp → R has three bounded derivatives then ∞ Tu h(w) − N h du g(w) = −
(2.20)
0
solves (2.18), and if the kth partial derivative of h exists then (k) 1 (k) g ≤ h . k p Further, for any μ ∈ R and positive definite p × p matrix , f defined by the change of variable f (w) = g −1/2 (w − μ) (2.21)
18
2
Fundamentals of Stein’s Method
solves Tr D 2 f (w) − (w − μ) · ∇f (w) = h −1/2 (w − μ) − N h,
(2.22)
and satisfies (k) p k −1/2 k (k) f ≤ h . k
(2.23)
The operator A in (2.18) is the generator of the Ornstein–Uhlenbeck process in Rp , whose stationary distribution is the standard normal. The operator (Tu h)(w) in (2.17) is the expected value of h evaluated at the position of the Ornstein–Uhlenbeck process at time u, when it has initial position w at time 0. Equations of the form Ag = h − Eh(Z) may be solved more generally by (2.20) when A is the generator of a Markov process with stationary distribution Z, see Ethier and Kurtz (1986). Indeed, the generator method may be employed to solve the Stein equation for distributions other than the normal, see, for instance, Barbour et al. (1992) for the Poisson, and Luk (1994) for the Gamma distribution. For the specific case at hand, the proof of Lemma 2.6 can be found in Barbour (1990), see equations (2.23) and (2.5), and also in Götze (1991). Essentially, following Barbour (1990) one shows that g is a solution, and that under the assumptions above, differentiating (2.20) and applying the dominated convergence yields ∞ e−ku E h(k) we−u + 1 − e−2u Z du. g (k) (w) = − 0
The bounds then follow by straightforward calculations.
2.3 Construction of Stein Identities Stein’s equation (2.4) is the starting point of Stein’s method. To prove that a mean zero, variance one random variable W can be approximated by a standard normal distribution, that is, to show that Eh(W ) − Eh(Z) is small for some large class of functions h, rather than estimating this difference directly, we solve (2.4) for a given h and show that E[f (W ) − Wf (W )] is small instead. As we shall see, this latter quantity is often much easier to deal with than the former, as various identities and couplings may be applied to handle it. In essence Stein’s method shows that the distribution of two random variables are close by using the fact that they satisfy similar identities. For example, in Sect. 2.3.1, we demonstrate that when W is the sum of independent mean zero random variables ξ1 , . . . , ξn whose variances sum to 1, then E Wf (W ) = Ef W (I ) + ξI∗ where W (I ) is the sum W with a random summand ξI removed, and ξI∗ is a random variable independent of W (I ) . Hence W satisfies an identity very much like the characterization (2.1) for the normal.
2.3 Construction of Stein Identities
19
We present four different approaches, or variations, for handling the Stein equation. Sect. 2.3.1 introduces the K function method when W is a sum of independent random variables. In Sect. 2.3.2 we present the exchangeable pair approach of Stein, which works well when W has a certain dependency structure. We then discuss the zero bias distribution and the associated transformation, which, in principle, may be applied for arbitrary mean zero random variables having finite variance. We note that the K function method of Sect. 2.3.1 and the zero bias method of Sect. 2.3.3 are essentially identical in the simple context of sums of independent random variables, but these approaches will later diverge. Size bias transformations, and some associated couplings, presented in Sect. 2.3.3 are closely related to those for zero biasing; the size bias method is most naturally applied to non-negative variables such as counts.
2.3.1 Sums of Independent Random Variables In this subsection we consider the most elementary case and apply Stein’s method to justify the normal approximation of the sum W of independent random variables ξ1 , ξ2 , . . . , ξn satisfying Eξi = 0,
1≤i≤n
and
n
Eξi2 = 1.
i=1
Set W=
n
ξi
and W (i) = W − ξi ,
i=1
and define Ki (t) = E ξi (1{0≤t≤ξi } − 1{ξi ≤t<0} ) . It is easy to check that Ki (t) ≥ 0 for all real t , and that ∞ ∞ 1 Ki (t) dt = Eξi2 and |t|Ki (t) dt = E|ξi |3 . 2 −∞ −∞
(2.24)
(2.25)
Let h be a measurable function with E|h(Z)| < ∞, and let f = fh be the corresponding solution of the Stein equation (2.4). Our goal is to estimate Eh(W ) − N h = E f (W ) − Wf (W ) .
(2.26)
The argument below is fundamental to the K function approach, with many of the following tricks reappearing repeatedly in the sequel. Since ξi and W (i) are independent for each 1 ≤ i ≤ n, we have
20
2
Fundamentals of Stein’s Method
n E Wf (W ) = E ξi f (W ) i=1 n = E ξi f (W ) − f W (i) , i=1
where the last equality follows because Eξi = 0. Writing the final difference in integral form, we thus have ξi n (i) E ξi f W + t dt E Wf (W ) = = =
i=1 n
0
∞
E
−∞ i=1 n ∞ i=1 −∞
f W (i) + t ξi (1{0≤t≤ξi } − 1{ξi ≤t<0} ) dt
E f W (i) + t Ki (t) dt,
from the definition of Ki and again using independence. However, from n ∞ n Ki (t) dt = Eξi2 = 1, i=1 −∞
(2.27)
(2.28)
i=1
it follows that
Ef (W ) =
n
∞
i=1 −∞
E f (W ) Ki (t) dt.
(2.29)
Thus, by (2.27) and (2.29), n E f (W ) − Wf (W ) =
∞
i=1 −∞
E f (W ) − f W (i) + t Ki (t) dt. (2.30)
∞ Since Ki (t) is non-negative and −∞ Ki (t) dt = Eξi2 , the ratio Ki (t)/Eξi2 can be regarded as a probability density function. Let ξi∗ , i = 1, . . . , n be independent random variables, independent of ξj for j = i, having density function Ki (t)/Eξi2 for each i. Let I be a random index, independent of {ξi , ξi∗ , i = 1, . . . , n} with distribution P (I = i) = Eξi2 . Then we may rewrite (2.27) as E Wf (W ) = Ef W (I ) + ξI∗ and (2.30) as E f (W ) − Wf (W ) = E f (W ) − f W (I ) + ξI∗ .
(2.31)
2.3 Construction of Stein Identities
21
Equations (2.27), and (2.30) play a key role in proving good normal approximations. Note in particular that (2.30) is an equality, and that (2.27) and (2.30) hold for all bounded absolutely continuous functions f . It is easy to see that bounds on the solution f such as those furnished by Lemma 2.4 can now come into play to bound the expected difference in (2.30), and therefore the left hand side of (2.26).
2.3.2 Exchangeable Pairs Suppose now that W is an arbitrary random variable, in particular, not necessarily a sum. A number of variations of Stein’s method introduce an auxiliary random variable coupled to W possessing certain properties. In the exchangeable pair approach, (see Stein 1986) one constructs W on the same probability space as W in such a way that (W, W ) is an exchangeable pair, that is, such that (W, W ) =d (W , W ), where =d signifies equality in distribution. The exchangeable pair approach makes essential use of the elementary fact that, if (W, W ) is an exchangeable pair, then Eg(W, W ) = 0
(2.32)
for all antisymmetric measurable functions g(x, y) such that the expected value above exists. The key identities applied in the exchangeable pair approach are given in Lemma 2.7, for which we require the following definition. Definition 2.1 If the pair (W, W ) is exchangeable and satisfies the ‘linear regression condition’ E(W |W ) = (1 − λ)W
(2.33)
with λ ∈ (0, 1), then we call (W, W ) a λ-Stein pair, or more simply, a Stein pair. One heuristic explanation of why property (2.33) should be of any importance in normal approximation is that it is parallel to the conditional expectation property enjoyed by the bivariate normal distribution. That is, if Z, Z have the bivariate normal distribution then the conditional expectation of Z given Z is linear, specifically Z − μ2 E(Z |Z) = μ1 + σ1 ρ , σ2 where σ12 and σ22 are the variances of Z and Z, respectively, and ρ is the correlation coefficient. Hence, when Z and Z have mean zero and equal variance, we obtain (2.33), E(Z |Z) = (1 − λ)Z, with λ = 1 − ρ.
22
2
Fundamentals of Stein’s Method
Lemma 2.7 Let (W, W ) be a Stein pair and = W − W . Then EW = 0 and E2 = 2λEW 2
if EW 2 < ∞.
(2.34)
Furthermore, when EW 2 < ∞, for every absolutely continuous function f satisfying |f (w)| ≤ C(1 + |w|), we have 1 E Wf (W ) = E (W − W ) f (W ) − f (W ) , 2λ ∞ ˆ dt , f (W + t)K(t) E Wf (W ) = E −∞
(2.35) (2.36)
and E f (W ) − EWf (W ) ∞ 2 ˆ dt, (2.37) f (W ) − f (W + t) K(t) +E = Ef (W ) 1 − 2λ −∞ where ˆ = (1{−≤t≤0} − 1{0
(2.38)
satisfies
∞
2 ˆ dt = . K(t) 2λ −∞
(2.39)
Proof Taking expectation in (2.33) yields, by exchangeability, EW = EW = (1 − λ)EW so EW = 0. Furthermore, as EW W = E E(W W |W ) = E W E(W |W ) = (1 − λ)EW 2 , we have E(W − W )2 = 2EW 2 − 2EW W = 2λEW 2 . Next we exploit (2.32) with the antisymmetric function g(x, y) = (x −y)(f (y)+ f (x)), for which Eg(W, W ) exists, because of the growth assumption on f . Identity (2.32) yields 0 = E (W − W ) f (W ) + f (W ) = E (W − W ) f (W ) − f (W ) + 2E f (W )(W − W ) = E (W − W ) f (W ) − f (W ) + 2E f (W )E(W − W |W ) = E (W − W ) f (W ) − f (W ) + 2λE Wf (W ) , this last by (2.33). Rearranging this equality yields (2.35), and now
2.3 Construction of Stein Identities
23
1 E Wf (W ) = E (W − W ) f (W ) − f (W ) 2λ 1 = E f (W ) − f (W − ) 2λ 0 1 = E f (W + t) dt 2λ − ∞ ˆ dt. =E f (W + t)K(t) −∞
(2.40)
This proves (2.36). Now note that integrating (2.38) yields (2.39) and so to prove (2.37), we need only observe that ∞ 2 ˆ f (W )K(t) dt +E Ef (W ) = E f (W ) 1 − 2λ −∞
and subtract using (2.40).
As the linear regression condition (2.33) may at times be too restrictive, it can be replaced by E(W − W | W ) = λ(W − R),
(2.41)
where R is a random variable of small order. Following the proof of (2.36), if W and W are mean zero exchangeable random variables with finite second moments, and (2.41) holds for some λ ∈ (0, 1) and random variable R, then ∞ ˆ dt + E Rf (W ) , f (W + t)K(t) (2.42) E Wf (W ) = E −∞
ˆ with K(t) given by (2.38). We present three examples that give the flavor of the construction of exchangeable pairs; sometimes we will denote the pair by (W , W ), instead of by (W, W ). Example 2.1 (Independent random variables) Let {ξi , 1 ≤ i ≤ n} be independent random variables with zero means and ni=1 Eξi2 = 1, and put W = ni=1 ξi . Let {ξi , i = 1, . . . , n} be an independent copy of {ξi , i = 1, . . . , n}, and let I have uniform distribution on {1, 2, . . . , n}, independent of {ξi , ξi , i = 1, . . . , n}. Define W = W − ξI + ξI . Then (W, W ) is an exchangeable pair, and it is easy to verify 1 W, E(W | W ) = 1 − n so that (2.33) is satisfied with λ = 1/n. The exchangeable pair above is a special case of the following general construction.
24
2
Fundamentals of Stein’s Method
Example 2.2 (Exchangeable pair by substitution) Let W = g(ξ1 , . . . , ξn ), and ξi have the conditional distribution of ξi given ξj , 1 ≤ j = i ≤ n. Let I be a random index uniformly distribution over {1, . . . , n}, independent of {ξi , ξi , i = 1, . . . , n}. Define W = g(ξ1 , . . . , ξI −1 , ξI , ξI +1 , . . . , ξn ). That is, in the definition of W , the ξI is replaced by ξI while the other variables remain the same. Then (W, W ) is an exchangeable pair. We note that unlike Example 2.1 the linearly condition (2.33) is not automatically satisfied. Example 2.3 (Combinatorial Central Limit Theorem) For a given array {aij }1≤i,j ≤n of real numbers and π = π a random permutation, let Y =
n
(2.43)
ai,π (i) .
i=1
Classically, π is taken to be uniformly distributed over the symmetric group Sn ; we specialize to that case here, and study it in Sects. 4.4 and 6.1.1, but also consider alternative permutation distributions in Sect. 6.1.2. Let a =
n 1 aij , n2 i,j =1
1 aij n n
ai =
j =1
Using that π is uniform one easily obtains EY = na = therefore Y − EY =
1 aij . n n
and aj =
i=1
i ai
=
(2.44)
i aπ(i) ,
n n (ai,π(i) − a ) = (ai,π(i) − ai − aπ(i) + a ). i=1
and
(2.45)
i=1
As our goal√is to derive bounds to the normal for the standardized variable (Y − EY )/ Var(Y ), without loss of generality we may replace aij by aij − ai − aj + a , and assume ai = aj = a = 0.
(2.46) π
π τ
Y
= be given Let τij be the permutation that transposes i and j , ij and by (2.43) with π replaced by π . Since π (k) = π (k) for k ∈ / {i, j } while π (i) = π (j ) and π (j ) = π (i), we have (2.47) Y − Y = b i, j, π(i), π(j ) , where b(i, j, k, l) = ail + aj k − (aik + aj l ). Taking (I, J ) to be independent of π , with the uniform distribution over all pairs satisfying 1 ≤ I = J ≤ n, the permutations π and π = τI J π are exchangeable, and hence so are Y and Y . To prove that the linear regression property (2.33) is satisfied, write Y − Y = (aI,π (J ) + aJ,π (I ) ) − (aI,π (I ) + aJ,π (J ) ). Taking the conditional expectation given
π ,
using (2.46), we obtain
(2.48)
2.3 Construction of Stein Identities
25
n 1 1 E(Y − Y |π ) = 2 − ai,π (i) + ai,π (j ) n n(n − 1)
i=1
i =j
i=1
i=1
n n 1 1 2 = −2 ai,π (i) + ai,π (i) = − Y . n n(n − 1) n−1 As the right hand side is measurable with respect to Y , we conclude that 2 E(Y |Y ) = 1 − Y , n−1 demonstrating that Y , Y is a 2/(n − 1)-Stein pair. One particular special case of note is when aij = bi cj where b1 , . . . , bn are any real numbers and the values cj ∈ {0, 1}, j = 1, . . . , n satisfy n
cj = m.
i=1
In this case, as any set of m values from {b1 , . . . , bn } are as likely to be summed to yield Y as any other set of that same size, Y is the sum of a simple random sample of size m from a population whose numerical characteristics are given by {bi , i = 1, . . . , n}. It is worth mentioning a connection between the exchangeable pair and the generator approach which gave the solutions and bounds to the Stein equation in Lemma 2.6. To see the connection, let (W, W ) be a λ-Stein pair and rewrite E(W |W ) = (1 − λ)W
as E(W − W |W ) = −λW.
If one can construct a sequence W1 , W2 , . . . such that (Wt , Wt+1 ) =d (W, W ),
for t = 1, 2, . . . ,
then E(Wt+1 − Wt |Wt ) = −λWt , and so, with Wt = Wt+1 − Wt we have Wt = −λWt + t
where E[ t |Wt ] = 0,
a recursion reminiscent of the stochastic differential equation for the Ornstein– Uhlenbeck process, dWt = −λWt + σ dBt where Bt is a Brownian motion. It is sometimes possible to produce the sequence W1 , W2 , . . . as the successive states of a reversible Markov chain in stationarity. Or, looking at this construction in another way, for a given W of interest, one may be able to create a Stein pair by constructing a reversible Markov chain with stationary distribution W . As an illustration, consider the sum Y of a simple random sample S = {X1 , . . . , Xn } of size n of N population characteristics A = {a1 , . . . , aN } which have been centered to satisfy
26
2 N
ai = 0.
Fundamentals of Stein’s Method
(2.49)
i=1
Given a simple random sample S0 , one may construct a Markov chain S0 , S1 , . . . , whose state space consists of all size n subsets of A, by interchanging at time step n a randomly chosen element of Sn with one from the complement of Sn to form Sn+1 . The chain is in equilibrium and is reversible, hence the sets Sn , n = 0, 1, . . . are identically distributed, and (Sn , Sn+1 ) is exchangeable. In particular, the sums Yn and Yn+1 of Sn and Sn+1 respectively, are exchangeable and have the same distribution as Y . This construction is, essentially, the one used in Theorem 4.10, and it is shown there that the linearity condition (2.33) holds under the centering (2.49). This method for the construction of exchangeable pairs features prominently in the analysis of the anti-voter model in Sect. 6.4.
2.3.3 Zero Bias Stein’s characterization (2.1) of the standard normal Z can be easily extended to the mean zero normal family in general. In particular, a simple change of variable in Lemma 2.1 shows that X is N (0, σ 2 ) if and only if (2.50) σ 2 Ef (X) = E Xf (X) for all absolutely continuous functions for which these expectations exist. Though the left and right hand sides of (2.50) will only be equal at the normal, one can create an identity in the same spirit that holds more generally. In particular, as introduced in Goldstein and Reinert (1997), given X with mean zero and variance σ 2 , we say that X ∗ has the X-zero bias distribution if (2.51) σ 2 Ef X ∗ = E Xf (X) for all absolutely continuous functions f for which these expectations exist. It is convenient to regard (2.51) as giving rise to a transformation mapping the distribution of X to that of X∗ . Indeed, the characterization in Lemma 2.1 can be restated as saying that the normal distribution is the unique fixed point of the zero bias transformation. It is the uniqueness of the fixed point of the zero bias transformation, that is, the fact that X ∗ has the same distribution as X only when X is normal, that provides a probabilistic reason for a normal approximation to hold. If the distribution of a random variable X gets mapped to an X ∗ which is close in distribution to X, then X is close to the zero bias transformation’s unique fixed point, that is, close to the normal. This same reasoning indicates that not only should a normal approximation be justified whenever the distribution of X is close to that of X ∗ , but that the quality of the approximation can be measured in terms of their distance. Though this claim will later be made precise in a number of ways, for now one can see how it might
2.3 Construction of Stein Identities
27
be formalized by observing that a coupling of a mean zero, variance one W to such a W ∗ can be used in the Stein equation (2.4) as Eh(W ) − N h = E f (W ) − Wf (W ) = E f (W ) − f (W ∗ ) . Hence, when W and W ∗ are close, the right hand side, and so also the left hand side, will be small. While the zero bias transformation fixes the mean zero normal, for non-normal distributions, in some sense, the transformation moves them closer to normality. For example, let ξ ∈ {0, 1} be a Bernoulli random variable with success probability p ∈ (0, 1). Centering ξ to form the mean zero discrete random variable X = ξ − p having variance σ 2 = p(1 − p), substitution into the right hand side of (2.51) yields E Xf (X) = E (ξ − p)f (ξ − p) = p(1 − p)f (1 − p) − (1 − p)pf (−p) = σ 2 f (1 − p) − f (−p) 1−p f (u) du = σ2 −p
= σ 2 Ef (U ), for U uniformly distributed over [−p, 1 − p]. Hence, with =d indicating the equality of two random variables in distribution, and U[a, b] denoting the uniform distribution on the finite interval [a, b], (ξ − p)∗ =d U
where U ∼ U [−p, 1 − p].
(2.52)
As hinted at by the Bernoulli example, the following lemma shows that the zero bias distribution exists and is absolutely continuous for every X having mean zero and some finite, positive variance. Proposition 2.1 Let X be a random variable with mean zero and finite positive variance σ 2 . Then there exists a unique distribution for X ∗ such that (2.53) Ef (X ∗ ) = σ 2 E Xf (X) for every absolutely continuous function f for which E|Xf (X)| < ∞. Moreover, the distribution of X ∗ is absolutely continuous with density p ∗ (x) = E X1(X > x) /σ 2 = −E X1(X ≤ x) /σ 2
(2.54)
and distribution function G∗ (x) = E X(X − x)1(X ≤ x) /σ 2 .
(2.55)
Proof We prove the claims assuming σ 2 = 1, the extension to the general case being straightforward. First, regarding (2.54), we note that the second equality holds since EX = 0. It follows that p ∗ (x) is nonnegative, using the first form for x ≥ 0, and the second for x < 0.
28
2
Fundamentals of Stein’s Method
To prove that we may write E[Xf (X)] as the expectation on the left hand side ∗ ∗ of (2.53), xin terms of an absolutely continuous variable X with density p (x), let f (x) = 0 g with g a nonnegative function which is integrable on compact domains. Then by Fubini’s theorem, ∞ ∞ f (u)E X1(X > u) du = g(u)E X1(X > u) du 0 0 ∞ g(u)1(X > u) du =E X 0
X∨0 =E X g(u) du 0 = E Xf (X)1(X ≥ 0) . A similar argument over (−∞, 0] yields ∞ f (u)E X1(X > u) du = E Xf (X) , −∞
(2.56)
x where both sides may be +∞. If f (x) = 0 g with E|Xf (X)| < ∞, then taking the difference of the contributions from the positive and negative parts of g shows that (2.56) continues to hold over this larger class of functions, as it does for f x satisfying the conditions of the theorem by writing f (x) = 0 g + f (0) and using that the mean of X is zero. Taking f (x) = x shows that p ∗ (x) integrates to one and is therefore a density, whence the left hand side of (2.56) may be written as Ef (X ∗ ) for X ∗ with density p ∗ (x). The distribution of X ∗ is clearly unique, as Ef (X ∗ ) = Ef (Y ∗ ) for all, say, continuously differentiable functions f with compact support, implies X ∗ =d Y ∗ . Integrating the density p ∗ to obtain the distribution function G∗ , we have x ∗ 1(X ≤ u) du G (x) = −E X −∞ x du 1(X ≤ x) = −E X X = E X(X − x)1(X ≤ x) . The characterization (2.51) also specifies a relationship between the moments of X and X ∗ . One of the most useful of these relations is the one which results from applying (2.51) with f (x) = (1/2)x 2 sgn(x), for which f (x) = |x|, yielding 1 σ 2 E|X ∗ | = E|X|3 2
where σ 2 = Var(X).
(2.57)
In particular, we see that E|X|3 < ∞ if and only if E|X ∗ | < ∞. We have observed that the zero bias distribution of a mean zero Bernoulli variable with support {−p, 1 − p} is uniform on [−p, 1 − p], and it is easy to see from (2.54) that, more generally, if x is such that P (X > x) = 0, then the same holds for all y > x, and p ∗ (y) = 0 for all such y, while if x is such that P (X > x) > 0 then
2.3 Construction of Stein Identities
29
p ∗ (x) > 0. As similar statements hold when considering x for which P (X ≤ x) = 0, letting support(X) be the support of the distribution of X, if a = inf support(X) and
b = sup support(X)
are finite then support(X ∗ ) = [a, b]. One can verify that the support continues to be given by this relation, with any closed endpoint replaced by the corresponding open one, when any of the values of a or b are infinite. One consequence of this fact is that if X is bounded by some constant then X ∗ is also bounded by the same constant, that is, |X| ≤ C
implies |X ∗ | ≤ C.
(2.58)
The zero bias transformation enjoys the following scaling, or linearity property. If X is a mean zero random variable with finite variance, and X ∗ has the X-zero biased distribution, then for all a = 0 (aX)∗ =d aX ∗ .
(2.59)
The verification of this claim follows directly from (2.51), as letting σ 2 = Var(X) and g(x) = f (ax), we find (aσ )2 Ef (aX ∗ ) = aσ 2 Eg (X ∗ ) = aE Xg(X) = E (aX)f (aX) = (aσ )2 Ef (aX)∗ . But by far the most important properties of the zero bias transformation are those like the ones given in the following lemma. Lemma 2.8 Let ξi , i = 1, . . . , n be independent mean zero random variables with Var(ξi ) = σi2 summing to 1. Let ξi∗ have the ξi -zero bias distribution with ξi∗ , i = 1, . . . , n mutually independent, and ξi∗ independent of ξi for all j = i. Further, let I be a random index, independent of ξi , ξi∗ , i = 1, . . . , n with distribution P (I = i) = σi2 .
(2.60)
W ∗ =d W − ξI + ξI∗ ,
(2.61)
Then
where W ∗ has the W -zero bias distribution. In other words, upon replacing the variable ξI by ξI∗ in the sum W = ni=1 ξi we obtain a variable with the W -zero bias distribution. The distributional identity (2.61) indicates that a normal approximation is justified when the difference ξI − ξI∗ is small, since then the distribution of W will be close to that of W ∗ . To prepare for the proof, note that we may write the variables ξI and ξI∗ selected by I using indicators as follows
30
2
ξI =
n
1{I = i}ξi
and ξI∗ =
n
i=1
Fundamentals of Stein’s Method
1{I = i}ξi∗ ,
i=1
from which it is clear, writing L for the distribution, or law of a random variable, that the distributions of ξI and ξI∗ are the mixtures L(ξI ) =
n
L(ξi )σi2
n and L ξI∗ = L ξi∗ σi2 .
i=1
i=1
Proof Let W ∗ have the W -zero bias distribution. Then for all absolutely continuous functions f for which the following expectations exist, E f W ∗ = E Wf (W ) n ξi f (W ) =E i=1 n = E ξi f (W − ξi + ξi ) i=1 n = E σi2 f W − ξi + ξi∗ i=1
=E
n
f W
− ξi + ξi∗
1(I = i)
i=1
= E f W − ξI + ξI∗ , where independence is used in the fourth and fifth equalities. The equality of the expectations of W ∗ and W − ξI + ξI∗ over this class of functions is sufficient to guarantee (2.61), that is, that these two random variables have the same distribution, as in the proof of Proposition 2.1. When handling the sum of independent random variables, the zero bias method and the K function approach of Sect. 2.3.1 are essentially equivalent, with the former providing a probabilistic formulation of the latter. To begin to see the connection, note that by (2.54) and (2.24) the zero bias density p ∗ (t) and the K(t) function are almost sure multiplies, p ∗ (t) = K(t)/σ 2 . In particular, by Lemma 2.8, with Ki (t) the function (2.24) corresponding to ξi , integrating against the density of ξi∗ yields n n ∞ ∗ (i) ∗ 2 Ef W = Ef W + ξi σi = Ef W (i) + t Ki (t) dt. (2.62) i=1 p ∗ (x)
i=1 −∞
is a density function, and the moment identity (2.57), are Likewise, that probabilistic interpretations of the two equalities in (2.25), respectively, in terms of
2.3 Construction of Stein Identities
31
random variables. In addition, we note the correspondence between Lemma 2.8 and identity (2.31). To later explore the relationship between the zero bias method and the general Stein identity in Sect. 2.4, note now that if W and W ∗ are defined on the same space then trivially from the defining zero bias identity (2.51) we have E Wf (W ) = Ef (W + ) where = W ∗ − W. Though the K function approach and zero biasing are essentially completely parallel when dealing with sums of independent variables, these two views each give rise to useful, and separate, ways of handling different classes of examples. In addition to its ties to the K function approach, we will see in Proposition 4.6 that zero biasing is also connected to the exchangeable pair.
2.3.4 Size Bias The size bias and zero bias transformations are close relatives, and as such, size bias and zero bias couplings can be used in the Stein equation in somewhat similar manners. The size bias transformation is defined on the class of non-negative random variables X with finite non-zero means. For such an X with mean EX = μ, we say X s has the X-size biased distribution if for all functions f for which E[Xf (X)] exists, E Xf (X) = μEf X s . (2.63) We note that this characterization for size biasing is of the same form as (2.51) for zero biasing, but with the mean replacing the variance, and f replacing f for the evaluation of the biased variable. To place size biasing in the framework of Sect. 2.4 to follow, we note that when Var(X) = σ 2 and W = (X − μ)/σ , and, with a slight abuse of notation, W s = (X s − μ)/σ , if X and X s are defined on the same space, identity (2.63) can be written ∞ μ s ˆ dt, (2.64) f (W + t)K(t) E Wf (W ) = E f W − f (W ) = E σ −∞ where ˆ = μ (1{0≤t≤W s −W } − 1{W s −W ≤t<0} ). K(t) (2.65) σ The characterization (2.63) is easily seen to be the same as the more common specification of the size bias distribution F s (x) as the one which is absolutely continuous with respect to the distribution F (x) of X with Radon Nikodym derivative x dF s (x) = . dF (x) μ Hence, parallel to property (2.58) for zero bias, here we have
(2.66)
32
2
0≤X≤C
Fundamentals of Stein’s Method
implies 0 ≤ X s ≤ C.
(2.67)
Moreover, if X is absolutely continuous with density p(x), then X s is also absolutely continuous, and has density xp(x)/μ. Size biasing also enjoys a scaling property. If X s has the X-size bias distribution, then for a > 0 (aX)s = aX s by an argument nearly identical to the one that proves (2.59). Size biasing can occur, possibly unwanted, when applying various sampling designs where items associated with larger outcomes are more likely to be chosen. For instance, when sampling an individual in a population at random, their report of the number of siblings in their family is size biased. Size biasing is also responsible for the well known waiting time paradox (see Feller 1968b), but can also be used to advantage, in particular, to form unbiased ratio estimates (Midzuno 1951). Lemma 2.8 carries over with only minor changes when replacing zero biasing by size biasing, though the variable replaced is now selected proportional to its mean, rather its variance. Moreover, the size bias construction generalizes easily to the case where the sum is of dependent random variables. In particular, let X = {Xα , α ∈ A} be a collection of nonnegative random variables with finite, nonzero means μα = EXα . For α ∈ A, we say that Xα has the X distribution biased in direction, or coordinate, α if EXα f (X) = μα Ef Xα (2.68) for all real valued functions f for which the expectation of the left hand side exists. Parallel to (2.66), if F (x) is the distribution of X, then the distribution F α (x) of α X satisfies dF α (x) xα . = dF (x) μα
(2.69)
By considering functions f which depend only on xα , it is easy to verify that Xαα =d Xαs , that is, that Xαα has the Xα -size biased distribution. A consequence of the following proposition is a method for size biasing sums of dependent variables. Proposition 2.2 Let A be an arbitrary index set, and let X = {Xα , α ∈ A} be a collection of nonnegative random variables with finite means. For any subset B ⊂ A, set XB = Xβ and μB = EXB . β∈B
Suppose B ⊂ A with 0 < μB < ∞, and for β ∈ B let Xβ have the X-size biased distribution in coordinate β as in Definition 2.68. Let I be a random index, independent of X, with distribution μβ P (I = β) = . μB
2.3 Construction of Stein Identities
33
Then XB = XI , that is, the collection XB which is equal to Xβ with probability μβ /μB , satisfies (2.70) E XB f (X) = μB Ef XB for all real valued functions f for which these expectations exist. If f is a function of XA = α∈A Xα only, then B B = XαB , where XA E XB f (XA ) = μB Ef XA α∈A A ), and that X A has the X -size and when A = B we have EXA f (XA ) = μA Ef (XA A A biased distribution.
Proof Without loss of generality, assume μβ > 0 for all β ∈ A. By (2.68) we have E Xβ f (X) /μβ = Ef Xβ . Multiplying by μβ /μB , summing over β ∈ B and recalling XB is a mixture yields (2.70). The remainder of the lemma now follows as special cases. By the last claim of the lemma, to achieve the size bias distribution of the sum XA = α∈A Xα of all the variables in the collection, one mixes over the distribu β β tions of XA = α∈A Xα using the random index with distribution P (I = β) =
μβ
α∈A μα
.
(2.71)
Hence, by randomization over A, a construction of Xβ for every coordinate β leads s . to a construction of XA We may size bias in coordinates by applying the following procedure. Let A = {1, . . . , n} for notational ease. For given i ∈ {1, . . . , n}, write the joint distribution of X as a product of the marginal distribution of Xi times the conditional distribution of the remaining variables given Xi , dF (x) = dFi (xi )dF (x1 , . . . , xi−1 , xi+1 , . . . , xn |xi ),
(2.72)
which gives a factorization of (2.69) as dF i (x) = dFii (xi )dF (x1 , . . . , xi−1 , xi+1 , . . . , xn |xi ), where
(2.73)
dFii (xi ) = (xi /μi )dFi (xi ).
The representation (2.73) says that one may form Xi by first generating Xii having the Xi -sized biased distribution, and then the remaining variables from their original distribution, conditioned on xi taking on its newly chosen sized biased value. For X already given, a coupling between the sum of Y = X1 + · · · + Xn and Y s can be generated by first constructing, for every i, the biased variable Xii and then ‘adjusting’ the remaining variables Xj , j = i as necessary so that they have the correct conditional distribution. Mixing then yields Y s . Typically the goal
34
2
Fundamentals of Stein’s Method
is to adjust the variables as little as possible in order to have the resulting bounds to normality small. The following important corollary of Proposition 2.2 handles the case where the variables X1 , . . . , Xn are independent, so that (2.72) reduces to dF (x) = dFi (xi )dF (x1 , . . . , xi−1 , xi+1 , . . . , xn ). The following result is parallel to Lemma 2.8. Corollary 2.1 Let Y = ni=1 Xi , where X1 , . . . , Xn are independent, nonnegative random variables with means EXi = μi , i = 1, . . . , n. Let I be a random index with distribution given by (2.71), independent of all other variables. Then, upon replacing the summand XI selected by I with a variable XIs having its size biased distribution, independent of Xj for j = I , we obtain Y I =d Y − XI + XIs , a variable having the Y -size bias distribution. Proof Letting X = (X1 , . . . , Xn ), the vector Xi = X1 , . . . , Xi−1 , Xis , Xi+1 , . . . , Xn has the X-size biased distribution in coordinate i, as the conditional distribution in (2.73) is the same as the unconditional one. Now apply Proposition 2.2. In other words, when the variables are independent and Xi is replaced by its size biased version, here is no need to change any of the remaining variables Xj , j = i in order for them to have their original conditional distribution given the new value Xis . As shown in Goldstein and Reinert (2005), size biasing and zero biasing are both special cases of a general form of distributional biasing, where given a ‘biasing function’ P (x) with m ∈ {0, 1, . . .} sign changes, and a distribution X which satisfies the m − 1 orthogonality relations EX i P (X) = 0, i = 0, . . . , m − 1, there exists a distribution X (P ) satisfying (2.74) E P (X)f (X) = αEf (m) X (P ) when α = EP (X)X m /m! > 0. For example, for zero biasing the function P (x) = x has m = 1 sign change, so the identity involves the first derivative f , and we require that the distribution of X satisfies the single orthogonality relation E(1 · X) = EX = 0, and set α = EX 2 = σ 2 . For size biasing P (x) = max{x, 0}, which has m = 0 sign changes, so no derivatives of f are involved, and neither are there any orthogonality relations to be satisfied, and α = EX. Letting X be characterized by (2.74) with P (x) = x 2 , since P (x) has no sign changes the distribution X exists for any distribution X with finite second moment, and α = EX 2 . In this particular case, where E X 2 f (X) = EX 2 Ef X (2.75)
2.3 Construction of Stein Identities
35
for all functions f for which E|Xf (X)| < ∞, we say that X has the X-square bias distribution. As in (2.69) for the size biased distribution, the distribution of X can also be characterized by its Radon–Nikodym derivative with respect to the distribution of X, as we do in Proposition 2.3, below. By comparing Lemma 2.8 with Corollary 2.1 one can already see that zero and size biasing are closely related. Another relation between the two is given by the following proposition. Proposition 2.3 Let X be a symmetric random variable with finite, non-zero variance σ 2 , and let X have the X-square bias distribution, that is, dF (x) =
x 2 dF (x) . σ2
Then, with U ∼ U[−1, 1] independent of X , the variable X∗ = U X d
has the X-zero bias distribution. Proof Since X is symmetric with finite second moment, EX = 0 and EX 2 = σ 2 . For an absolutely continuous function f with derivative g ∈ Cc , the collection of continuous functions having compact support, using the characterization (2.75) for the fourth equality below, we have σ 2 Eg U X = σ 2 Ef U X 1 σ2 = f uX du E 2 −1 2 f (X ) − f (−X ) σ E = 2 X 1 2 f (X) − f (−X) = E X 2 X 1 = E X f (X) − f (−X) 2 1 = EXf (X) + E(−X)f (−X) 2 = E Xf (X) . Hence, if X ∗ has the X-zero bias distribution, σ 2 Eg U X = E Xf (X) = σ 2 E f X ∗ = σ 2 Eg X ∗ . As the expectation of g(U X ) and g(X ∗ ) agree for all g ∈ Cc , the random variables U X and X ∗ must be equal in distribution.
36
2
Fundamentals of Stein’s Method
2.4 A General Framework for Stein Identities and Normal Approximation for Lipschitz Functions Identity (2.42) E Wf (W ) = E
∞
−∞
ˆ f (W + t)K(t)dt + E Rf (W ) ,
(2.76)
arose when allowing for the possibility that a given exchangeable pair may not satˆ isfy the linearity condition (2.33) exactly. The function K(t) may be random, and, to obtain a good bound, R should be a random variable so that the second term E[Rf (W )] is of smaller order than the first. The exchangeable pair and size bias identities, (2.36) and (2.64), respectively, are both the special case of (2.76) when ˆ R = 0. For the first case, the function K(t) is given by (2.38), and by (2.65) in the second. Though the zero bias identity (2.51) with σ 2 = 1 does not fit the mold of (2.76) precisely, in somewhat the same spirit, with = W ∗ − W we have EWf (W ) = Ef (W + ),
(2.77)
holding for all absolutely continuous functions f for which the expectations above exist. The following proposition provides a general bound for normal approximation for smooth functions when (2.76) or (2.77) holds. Proposition 2.4 Let h be an absolutely continuous function with h < ∞ and F any σ -algebra containing σ {W }. (i) If (2.76) holds, then
Eh(W ) − N h ≤ h where Kˆ 1 = E
(ii) If (2.77) holds, then
∞ −∞
ˆ dt|F K(t)
2 E|1 − Kˆ 1 | + 2E Kˆ 2 + 2E|R| , π
and Kˆ 2 =
∞
−∞
Eh(W ) − N h ≤ 2h E||.
(2.78)
t K(t) ˆ dt.
(2.79)
Proof Let fh be the solution (2.5) to the Stein equation (2.4). We note that by (2.13), both fh and fh are bounded. We may assume the expectations on the right hand side of (2.78) are finite, as otherwise the result is trivial. By (2.4) and (2.76), Eh(W ) − Nh = E fh (W ) − Wfh (W ) ∞ ˆ dt − E Rfh (W ) fh (W + t)K(t) = Efh (W ) − E −∞ ∞ ˆ dt = Efh (W )(1 − Kˆ 1 ) + E fh (W ) − fh (W + t) K(t) −∞ − E Rfh (W ) .
Appendix
37
By the properties of the Stein solution fh given in (2.13) and the mean value theorem, we have
Ef (W )(1 − Kˆ 1 ) ≤ h 2 E|1 − Kˆ 1 |, h π ∞ ∞
E ˆ dt ≤ E ˆ dt = 2h E Kˆ 2 2h t K(t) fh (W ) − fh (W + t) K(t) −∞
−∞
and
E Rfh (W ) ≤ 2h E|R|.
This proves (2.78). Next, (2.79) follows from (2.13) and
Eh(W ) − N h = E f (W ) − Wfh (W )
h
= E f (W ) − f (W + )
h ≤ fh E||.
h
We will explore smooth function bounds extensively in Chap. 4.
Appendix Here we prove Lemmas 2.3 and 2.4, giving the basic properties of the solutions to the Stein equations (2.2) and (2.4). The proof of Lemma 2.3, and part of Lemma 2.4, follow Stein (1986), while parts of the proof of Lemma 2.4 are due to Stroock (2000) and Raiˇc (2004) (see also Chatterjee 2008). Before beginning, note that from (2.2) and (2.3) it follows that fz (w) = wfz (w) + 1{w≤z} − (z) wfz (w) + 1 − (z) for w < z, = for w > z, wfz (w) − (z) √ 2 ( 2πwew /2 (w) + 1)(1 − (z)) = √ 2 ( 2πwew /2 (1 − (w)) − 1)(z)
for w < z,
(2.80)
for w > z,
and
⎧√ 2 w 2 /2 (w) + ⎨ 2π(1 − (z))((1 + w )e wfz (w) = √ ⎩ 2π(z)((1 + w 2 )ew2 /2 (1 − (w)) −
√w ) 2π √w ) 2π
if w < z, (2.81) if w > z.
Proof of Lemma 2.3 Since fz (w) = f−z (−w), we need only consider the case z ≥ 0. Note that for w > 0 ∞ ∞ 2 x −x 2 /2 e−w /2 −x 2 /2 e dx ≤ dx = e , w w w w
38
2
Fundamentals of Stein’s Method
and that
∞
we−w /2 , 1 + w2 2
e−x
2 /2
dx ≥
w
by comparing the derivatives of the two functions and their values at w = 0. Thus e−w /2 we−w /2 ≤ 1 − (w) ≤ √ . √ 1 + w 2 2π w 2π 2
2
(2.82)
Applying the lower bound in inequality (2.82) to the form (wfz (w)) for w > z in (2.81), we see that this derivative is nonnegative, thus yielding (2.6). Now, in view of (2.82) and the fact that wfz (w) is increasing, taking limits using (2.3) we have, lim wfz (w) = (z) − 1 and
w→−∞
lim wfz (w) = (z),
w→∞
(2.83)
and (2.7) follows. Now, using that wfz (w) is an increasing function of w, (2.83) and (2.80), 0 < fz (w) ≤ zfz (z) + 1 − (z) < 1 for w < z
(2.84)
−1 < zfz (z) − (z) ≤ fz (w) < 0 for w > z,
(2.85)
and
proving the first inequality of (2.8). For the second, note that for any w and u we therefore have
f (w) − f (u) ≤ zfz (z) + 1 − (z) − zfz (z) − (z) = 1. z z Next, observe that by (2.84) and (2.85), fz (w) attains its maximum at z. Thus √ 2 0 < fz (w) ≤ fz (z) = 2πez /2 (z) 1 − (z) . By (2.82), fz (z) ≤ 1/z. To finish the proof of (2.9), let e−z /2 g(z) = (z) 1 − (z) − 4 2
Observe that g (z) = e−z
2 /2
z 2(z) 1 + − √ . and g1 (z) = √ 4 2π 2π
g1 (z) and that
z −z2 /2 and lim g1 (z) = ∞. e z→∞ π Hence g1 is convex on [0, ∞), and there exists z1 > 0 such that g1 (z) < 0 for z < z1 and g1 (z) > 0 for z > z1 . In particular, on [0, ∞) the function g(z) decreases for z < z1 and increases for z > z1 , so its supremum must be attained at either z = 0 or z = ∞, that is, g(z) ≤ max g(0), g(∞) = 0 for all z ∈ [0, ∞), √ which is equivalent to fz (z) ≤ 2π/4. This completes the proof of (2.9). g1 (0) = 0, g1 (0) < 0, g1 (z) =
Appendix
39
To verify the last inequality (2.10), write (w + u)fz (w + u) − (w + v)fz (w + v) = w fz (w + u) − fz (w + v) + ufz (w + u) − vfz (w + v) and apply the mean value theorem and (2.8) on the first term, and (2.9) on the second. ˜ ˜ and let c1 = h if Proof of Lemma 2.4 Let h(w) = h(w) − N h and put c0 = h h is absolutely continuous, and c1 = ∞ otherwise. Since h˜ and fh are unchanged when h is replaced by h − h(0), √ we may assume that h(0) = 0. Therefore |h(t)| ≤ c1 |t| and |Nh| ≤ c1 E|Z| = c1 2/π . We first prove the two bounds on fh itself. From the expression (2.5) for fh it follows that 2 w −x 2 /2 dx if w ≤ 0, ˜
ew /2 −∞ |h(x)|e
fh (w) ≤ ∞ 2 −x 2 /2 dx ˜ if w ≥ 0 ew /2 w |h(x)|e ∞ ∞ 2 2 2 ≤ ew /2 min c0 e−x /2 dx, c1 |x| + 2/π e−x /2 dx |w|
|w|
≤ min( π/2c0 , 2c1 ),
where in the last inequality we obtain ∞ 2 2 e−x /2 dx ≤ π/2 ew /2 |w|
by applying (2.82) to show that the function on the left hand side above has a negative derivative for w ≥ 0, and therefore that its maximum is achieved at w = 0. We note that the first bound in the minimum applies if h is only bounded, thus yielding the first claim in (2.12), while if h is only absolutely continuous the second bound holds, yielding the first claim in (2.13). Moving to bounds on fh , by (2.4) for w ≥ 0, ∞
f (w) ≤ h(w) − N h + wew2 /2
h(x) − N h e−x 2 /2 dx h w ∞ 2 2 e−x /2 dx ≤ 2c0 , ≤ c0 + c0 wew /2 w
using (2.82). A similar argument may be applied for w < 0, proving the remaining claim in (2.12). To prove the second claim in (2.13), when h is absolutely continuous write ∞ 2 1 h(x) − Nh = √ h(x) − h(u) e−u /2 du 2π −∞ x x ∞ u 1 1 2 2 =√ h (t)e−u /2 dt du − √ h (t)e−u /2 dt du 2π −∞ u 2π x x x ∞ = h (t)(t) dt − h (t) 1 − (t) dt, (2.86) −∞
x
40
2
Fundamentals of Stein’s Method
from which it follows that w 2 2 h(x) − N h e−x /2 dx fh (w) = ew /2 −∞ w x ∞ 2 w 2 /2 =e h (t)(t) dt − h (t) 1 − (t) dt e−x /2 dx −∞ −∞ x √ w w 2 /2 = − 2πe 1 − (w) h (t)(t) dt −∞ ∞ √ 2 − 2πew /2 (w) h (t) 1 − (t) dt. (2.87) w
Now, from (2.4), (2.87) and (2.86), fh (w) = wfh (w) + h(w) − N h √ w w 2 /2 = 1 − 2πwe 1 − (w) h (t)(t) dt −∞ √ ∞ 2 − 1 + 2πwew /2 (w) h (t) 1 − (t) dt. w
Hence fh ≤ h
√
2 sup 1 − 2πwew /2 1 − (w)
w
(t) dt √
∞ 2 + 1 + 2π wew /2 (w)
1 − (t) dt . w∈R
−∞
w
By integration by parts,
e−w /2 and (t) dt = w(w) + √ 2π −∞ ∞ 2 e−w /2 . 1 − (t) dt = −w 1 − (w) + √ 2π w w
2
(2.88)
Thus, 2 √
e−w /2 w 2 /2
1 − (w) w(w) + √ sup 1 − 2πwe 2π w∈R 2 /2 −w √
e 2 + 1 + 2πwew /2 (w) −w 1 − (w) + √ . 2π One √ may now verify that the term inside the brackets attains its maximum value of 2/π at w = 0. Now we prove the final claim of (2.13). Differentiating (2.4) gives fh ≤ h
fh (w) = wfh (w) + fh (w) + h (w) = 1 + w 2 fh (w) + w h(w) − N h + h (w). From (2.89), (2.87), (2.86), (2.82) and (2.88) we obtain
(2.89)
Appendix
41
f (w) ≤ h (w) + 1 + w 2 fh (w) + w h(w) − N h
h
√
w 2 w 2 /2
≤ h (w) + w − 2π 1 + w e h (t)(t) dt
1 − (w) −∞
∞ √
2 2 w /2
(w) h (t) 1 − (t) dt
+ −w − 2π 1 + w e w √
w 2 w 2 /2
1 − (w) (t) dt ≤ h (w) + c1 −w + 2π 1 + w e −∞
√ 2 ∞ 1 − (t) dt + c1 w + 2π 1 + w 2 ew /2 (w) w
= h (w) 2 √ e−w /2 2 w 2 /2 1 − (w) w(w) + √ + c1 −w + 2π 1 + w e 2π 2 /2 −w √ e 2 w 2 /2 + c1 w + 2π 1 + w e (w) −w 1 − (w) + √ 2π
= h (w) + c1 ≤ 2c1 ,
as desired.
We now present the proof of Lemma 2.5 for bounds on the solution f (w) to the Stein equation for the linearly smoothed indicator function (2.14). For this case Bolthausen (1984) proved the inequalities |f (w)| ≤ 1, |f (w)| ≤ 2, and, through use of the latter, the bound (2.16) with the factor of |w| replaced by 2|w|. Proof of Lemma 2.5 As in (2.87) in the proof of Lemma 2.3, letting √ 2 η(w) = 2πew /2 (w), we have
f (w) = −η(−w)
w −∞
h (t)(t) dt − η(w)
∞
h (t)(−t) dt.
(2.90)
w
For z ≤ w ≤ z + α, we therefore have w z+α (t) (−t) f (w) = η(−w) dt + η(w) dt α α z w η(−w)(w)(w − z) η(w)(−w)(z + α − w) + ≤ α α = η(w)(−w) √ 2 = 2π ew /2 (w)(−w).
(2.91)
By symmetry we may take w ≥ 0 without loss of generality. Then, using the fact that (w)/w is decreasing, and straightforward inequalities, we derive
42
2
Fundamentals of Stein’s Method
√ √ √ 2π 1 2π 2 w 2 /2 2πe (w)(−w) ≤ min , (w) ≤ < 1, 2 w 2 π showing f (w) ≤ 1 for w ∈ [z, z + α]. Next, note that (z) ≤ N h ≤ (z + α), and let fz (w) be the solution to the 2 Stein equation for the function 1{w≤z} . For w < z, since ew /2 (w) is increasing, we obtain √ 2 f (w) = 2π(1 − N h)ew /2 (w) √ 2 ≤ 2π 1 − (z) ew /2 (w) √ 2 ≤ 2π 1 − (z) ez /2 (z) √ (2.92) = fz (z) ≤ 2π/4, using (2.9). Similarly, for w > z + α, √ 2 f (w) = 2πNhew /2 1 − (w) √ 2 ≤ 2π(z + α)ew /2 1 − (w) √ 2 ≤ 2π(w)ew /2 1 − (w) √ = fw (w) ≤ 2π/4,
(2.93)
showing that f (w) ≤ 1 for all w ∈ R. The proof of the first claim of (2.15) is completed by showing the lower bound, which follows from the three expressions (2.91), (2.92) and (2.93), proving that f (w) ≥ 0 over the three intervals (−∞, z), [z, z + α] and (α, ∞), respectively. For the second claim, starting again from (2.5), we have w w 2 2 2 h(x)e−x /2 dx − e−x /2 N hdx e−w /2 f (w) = −∞ w
−∞
= =
−∞ w −∞
h(x)e
−x 2 /2
h(x)e−x
= 1 − (w)
2 /2
1 dx − √ 2π dx − (w)
w
−∞
h(x)e−x
w
−∞ ∞
2 /2
−∞
e−x
2 /2
∞
−∞
h(t)e−t
2 /2
dx − (w)
h(t)e−t
2 /2
dt dx
dt
∞
h(t)e−t
2 /2
dt.
w
Hence, f (w) = ew
2 /2
1 − (w)
1 = √ η(−w) 2π
w
−∞
w −∞
h(x)e−x
h(x)e−x
and taking the derivative, we obtain
2 /2
2 /2
dx − ew
2 /2
∞
(w)
1 dx − √ η(w) 2π
h(t)e−t
2 /2
w ∞
w
h(t)e−t
2 /2
dt,
dt
Appendix
43
w 1 1 2 2 f (w) = − √ η (−w) h(x)e−x /2 dx + √ η(−w)e−w /2 h(w) 2π 2π −∞ ∞ 1 1 2 2 − √ η (w) h(t)e−t /2 dt + √ η(w)e−w /2 h(w) 2π 2π w = h(w) (w) + (−w) w ∞ 1 −x 2 /2 −t 2 /2 η (−w) h(x)e dx + η (w) h(t)e dt −√ 2π −∞ w = h(w) − g(w), where we have set
w ∞ 1 −x 2 /2 −x 2 /2 η (−w) g(w) = √ h(x)e dx + η (w) h(x)e dx . 2π −∞ w
Since η (w) ≥ 0, we have inf h(x) η (−w)(w) + η (w)(−w) x ≤ g(w) ≤ sup h(x) η (−w)(w) + η (w)(−w) . x
However, noting η (−w)(w) + η (w)(−w) = 1,
(2.94)
it follows that inf h(x) − sup h(x) ≤ f (w) ≤ sup h(x) − inf h(x), x
x
x
x
|f (w)| ≤ 1,
that is, proving the second claim in (2.15). For the third claim in (2.15), differentiating (2.90) yields w ∞ f (w) = η (−w) h (t)(t) dt − η (w) h (t)(−t) dt. −∞
w
For w < z we have f (w) = η (w)
z+α z
for w ∈ [z, z + α], f (w) = −η (−w)
w
(t) dt + η (w) α
z
and for w > z + α
(−t) dt, α
f (w) = −η (−w) z
Hence, we may write f (w) = −
1 α
z+α
z+α
w
(t) dt. α
z+α
G(w, t) dt, z
(−t) dt, α
44
2
where
G(w, t) =
Fundamentals of Stein’s Method
−η (w)(−t)
when w ≤ t,
η (−w)(t)
when w > t.
Now writing η(w) =
√ 2 2π ew /2 (w) =
0 −∞
e−s
2 /2−sw
ds,
applying the dominated convergence theorem to differentiate under the integral, we obtain 0 2 s 2 e−s /2−sw ds, η (w) = −∞
and therefore
−η (w)(−t) ∂G(w, t) = ∂w −η (−w)(t)
when w < t, when w > t.
Hence, for any fixed t, the function G(w, t) is decreasing in w for w < t and w > t, and, moreover, satisfies lim G(w, t) = 0,
w→−∞
lim G(w, t) = 0,
w→∞
and lim G(w, t) = −η(t)(−t) < 0 and w↑t
lim G(w, t) = η (−t)(t) > 0. w↓t
Now, from (2.94), it follows that
G(w, t) − G(v, t) ≤ η (t)(−t) + η (−t)(t) = 1, and hence
z+α
f (w) − f (v) = 1 G(w, t) − G(v, t) dt
α z
1 z+α
G(w, t) − G(v, t) dt ≤ α z 1 z+α ≤ 1 dt = 1. α z
Lastly, to demonstrate (2.16), we apply the mean value theorem and the first two bounds in (2.15) to write
f (w + v) − f (w)
= vf (w + v) + w f (w + v) − f (w) + h(w + v) − h(w)
1 1 1[z,z+α] (w + rv)dr . ≤ |v| 1 + |w| + α 0
Chapter 3
Berry–Esseen Bounds for Independent Random Variables
In this chapter we illustrate some of the main ideas of the Stein method by proving the classical Lindeberg central limit theorem and the Berry–Esseen inequality for sums of independent random variables. We begin with Lipschitz functions, which suffice to prove the Lindeberg theorem. We then prove the Berry–Esseen inequality by developing a concentration inequality. Throughout this chapter we assume that W = ξ1 + · · · + ξn where ξ1 , . . . , ξn are independent random variables satisfying Eξi = 0,
1 ≤ i ≤ n and
n
Var(ξi ) = 1.
(3.1)
i=1
Though we focus on the independent case, the ideas developed here provide a basis for handling more general situations, see, for instance, Theorem 3.5 and its consequence, Theorem 5.2. Recall that the supremum, L∞ , or Kolmogorov distance between two distribution functions F and G is given by F − G∞ = supF (z) − G(z). z∈R
The main goal of this chapter is to prove the Berry–Esseen inequality, first shown by Berry (1941), and Esseen (1942), which gives a uniform bound between F , the distribution function of W , and , that of the standard normal Z, of the form F − ∞ ≤ C
n
E|ξi |3
(3.2)
i=1
where C is an absolute constant. The upper bound on the smallest possible value of C has decreased from Esseen’s original estimate of 7.59 to its current value of 0.4785 by Tyurin (2010). After proving the Lindeberg and Berry–Esseen theorems, the latter using both the concentration inequality and inductive approaches, we end the chapter with a lower bound on F − ∞ . L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_3, © Springer-Verlag Berlin Heidelberg 2011
45
46
3 Berry–Esseen Bounds for Independent Random Variables
3.1 Normal Approximation with Lipschitz Functions We recall that h : R → R is a Lipschitz continuous function if there exists a constant K such that h(x) − h(y) ≤ K|x − y| for all x, y ∈ R. Equivalently, h is Lipschitz continuous if and only if h is absolutely continuous with h < ∞. Theorem 3.1 Let W = ni=1 ξi be the sum of mean zero independent random variables ξi , 1 ≤ i ≤ n with ni=1 Var(ξi ) = 1, and h a Lipschitz continuous function. If E|ξi |3 < ∞ for i = 1, . . . , n, then Eh(W ) − N h ≤ 3h γ , (3.3) where γ=
n
E|ξi |3 .
(3.4)
i=1
Proof By Lemma 2.8, (2.77) holds with = ξI∗ − ξI , where ξi∗ has the ξi -zero bias distribution and is independent of ξj , j = i, and I is a random index with distribution (2.60), independent of all other variables. Invoking (2.79) of Proposition 2.4, Eh(W ) − N h ≤ 2h E ξ ∗ − ξI I
n = 2h E ξi∗ − ξi Eξi2 i=1 n ∗ 2 E ξi Eξi + E|ξi |Eξi2 ≤ 2h
= 2h
i=1 n i=1
≤ 3h
n
1 E|ξi |3 + E|ξi |Eξi2 2
E|ξi |3 ,
i=1
where we have invoked (2.57) to obtain the second equality, followed by Hölder’s inequality. The constant 3 is improved to 1 in Corollary 4.2. The following theorem shows that one can bound |Eh(W ) − N h| in terms of sums of the truncated second and third moments β2 =
n i=1
Eξi2 1{|ξi |>1}
and β3 =
n
E|ξi |3 1{|ξi |≤1} ,
i=1
without the need to assume the existence of third moments as in Theorem 3.1.
(3.5)
3.1 Normal Approximation with Lipschitz Functions
47
Theorem 3.2 If W = ni=1 ξi is the sum of mean zero independent random variables ξi , 1 ≤ i ≤ n with ni=1 Var(ξi ) = 1, then for any Lipschitz function h Eh(W ) − N h ≤ h (4β2 + 3β3 ). (3.6) Proof We adopt the same notation as in the proof of Theorem 3.1. The key observation is that we can follow the proof of (2.79) in Proposition 2.4, but instead of applying |fh (W ) − fh (W + )| ≤ 2h ||, we instead use f (W ) − f (W + ) ≤ min 2 f , f || ≤ 2h 1 ∧ || , (3.7) h h h h which holds by (2.13), where a ∧ b denotes min(a, b). Hence Eh(W ) − N h ≤ 2h E 1 ∧ ξ ∗ − ξI I ≤ 2h E 1 ∧ ξI∗ + |ξI | ≤ 2h E 1 ∧ ξ ∗ + E 1 ∧ |ξI | . I
(3.8)
Letting sign(x) be +1 for x > 0, −1 for x < 0 and 0 for x = 0, setting 1 f (x) = x1|x|>1 + x 2 sign(x)1|x|≤1 2 Hence, (2.60) and (2.51) now yield
we have f (x) = 1 ∧ |x|.
n E 1 ∧ ξi∗ Eξi2 E 1 ∧ ξI∗ = i=1
n 1 3 1 2 = E ξi 1|ξi |>1 + |ξi | 1|ξi |≤1 = β2 + β3 . 2 2
(3.9)
i=1
We recall the fact that if g and h are increasing functions, then Eg(ξ )Eh(ξ ) ≤ Eg(ξ )h(ξ ). Now, regarding the second term in (3.8), since both 1 ∧ |x| and x 2 are increasing functions of |x|, again applying (2.60), n n E 1 ∧ |ξI | = 1 ∧ |ξi | ξi2 E 1 ∧ |ξi | Eξi2 ≤ E i=1
≤E
n
i=1
ξi2 1{|ξi |>1} + |ξi |3 1{|ξi |≤1} = β2 + β3 .
(3.10)
Substituting the bounds (3.9) and (3.10) into (3.8) now gives the result.
i=1
One cannot derive a sharp Berry–Esseen bound for W using the smooth function bounds (3.3) or (3.6). Nevertheless, as noted by Erickson (1974), these smooth function bounds imply a weak L∞ bound, as highlighted in the following theorem. Theorem 3.3 Assume that there exists a δ such that, for any Lipschitz function h, Eh(W ) − N h ≤ δh . (3.11)
48
3 Berry–Esseen Bounds for Independent Random Variables
Then supP (W ≤ z) − (z) ≤ 2δ 1/2 .
(3.12)
z∈R
Proposition 2.4 shows that (3.11) is satisfied under conditions (2.76) or (2.77). Though the resulting bound δ has the optimal rate in many applications, see, for example, (3.3) and (3.6), the rate of a Berry–Esseen bound of the type (3.12) may not be optimal. Proof We can assume that δ ≤ 1/4, since otherwise (3.12) is trivial. Let α = δ 1/2 (2π)1/4 , and for some fixed z ∈ R define ⎧ if w ≤ z, ⎨1 hα (w) = 0 if w ≥ z + α, ⎩ linear if z < w < z + α. Then h is Lipschitz continuous with h = 1/α, and hence, by (3.11), P (W ≤ z) − (z) ≤ Ehα (W ) − N hα + N hα − (z) δ ≤ + P (z ≤ Z ≤ z + α) α δ α ≤ +√ . α 2π Therefore P (W ≤ z) − (z) ≤ 2(2π)−1/4 δ 1/2 ≤ 2δ 1/2 . Similarly, we have P (W ≤ z) − (z) ≥ −2δ 1/2 ,
proving (3.12).
3.2 The Lindeberg Central Limit Theorem Let n ξ1 , . . . , ξn be independent random n variables satisfying Eξi = 0, 1 ≤ i ≤ n and Var(ξ ) = 1, and let W = i i=1 i=1 ξi . The classical Lindeberg central limit theorem states that supP (W ≤ z) − (z) → 0 as n → ∞ z∈R
if the Lindeberg condition is satisfied, that is, if for all ε > 0 n i=1
Eξi2 1{|ξi |>ε} → 0
as n → ∞.
(3.13)
3.3 Berry–Esseen Inequality: The Bounded Case
49
With β2 and β3 as in (3.5), observe that for any 0 < ε < 1, β2 + β3 = ≤ ≤
n
Eξi2 1{|ξi |>1} +
n
i=1
i=1
n
n
i=1 n
Eξi2 1{|ξi |>1} +
E|ξi |3 1{|ξi |≤1} Eξi2 1{ε<|ξi |≤1} +
i=1
n
εEξi2 1{|ξi |≤ε}
i=1
Eξi2 1{|ξi |>ε} + ε.
(3.14)
i=1
Hence, if the Lindeberg condition (3.13) holds, then (3.14) implies β2 + β3 → 0 as n → ∞, since ε is arbitrary. Therefore, by Theorems 3.2 and 3.3, supP (W ≤ z) − (z) ≤ 4(β2 + β3 )1/2 → 0 as n → ∞, (3.15) z∈R
thus proving the Lindeberg central limit theorem. In Sect. 3.5 we prove the partial converse, that if max1≤i≤n Eξi2 → 0, then the Lindeberg condition (3.13) is necessary for normal convergence.
3.3 Berry–Esseen Inequality: The Bounded Case In the previous section, the smooth function bounds in Theorem 3.3 (see also Proposition 2.4) are of order O(δ), while the L∞ bounds are only of the larger order O(δ 1/2 ). Here, we turn to deriving L∞ bounds which are of comparable order to those of the smooth function bounds. We will use the notation introduced in Sect. 2.3.1, W=
n
ξi ,
W (i) = W − ξi ,
i=1
(3.16)
and Ki (t) = Eξi (1{0≤t≤ξi } − 1{ıi ≤t<0} ). For bounded ξi , we are ready to apply (2.27) to obtain the following Berry–Esseen bound. Theorem 3.4 Let ξ1 , ξ2 , . . . , ξn be independent random variables with zero means satisfying ni=1 Var(ξi ) = 1, and W (i) and Ki (t) as in (3.16). Then n ∞ (i) P W + t ≤ z Ki (t) dt − (z) ≤ 2.44γ (3.17) −∞ i=1
where γ is given in (3.4). If in addition |ξi | ≤ δ0 for 1 ≤ i ≤ n, then supP (W ≤ z) − (z) ≤ 3.3δ0 . z∈R
(3.18)
50
3 Berry–Esseen Bounds for Independent Random Variables
Before starting the proof, we note that by (2.62), we may write (3.17) as P (W ∗ ≤ z) − (z) ≤ 2.44γ . (3.19) Proof For z ∈ R, let f = fz be the solution of the Stein equation (2.2). From (2.27) and (2.2), n ∞ E Wf (W ) = E f W (i) + t Ki (t) dt =
i=1 −∞ n ∞ i=1 −∞
E W (i) + t f W (i) + t + 1{W (i) +t≤z} − (z) Ki (t) dt.
∞ Reorganizing this equality, using ni=1 −∞ Ki (t) dt = 1 from (2.28), and recalling Ki (t) is real yields n ∞ P W (i) + t ≤ z Ki (t) dt − (z) i=1 −∞ n ∞
=
i=1 −∞
E Wf (W ) − W (i) + t f W (i) + t Ki (t) dt.
(3.20)
Now, by (2.10), we may bound the absolute value of (3.20) by
∞ n Wf (W ) − W (i) + t f W (i) + t Ki (t) dt E i=1
= ≤
−∞
n
E
∞
W (i) + ξi f W (i) + ξi − W (i) + t f W (i) + t Ki (t) dt
−∞ i=1
n ∞ i=1 −∞
≤ (1 +
√ E W (i) + 2π/4 |ξi | + |t| Ki (t) dt
n √ 2π/4)
∞
i=1 −∞
E|ξi | + |t| Ki (t) dt,
since E(W (i) )2 ≤ 1 and ξi and W (i) are independent. Hence, recalling (2.25), we have n ∞ P W (i) + t ≤ z Ki (t) dt − (z) i=1 −∞ n √ 1 E|ξi |Eξi2 + E|ξi |3 ≤ (1 + 2π /4) 2 i=1
√ 3 ≤ (1 + 2π/4)γ ≤ 2.44γ 2 proving (3.17).
3.3 Berry–Esseen Inequality: The Bounded Case
51
Theproof would be finished if P (W (i) + t ≤ z) could be replaced by P (W ≤ z), ∞ since ni=1 −∞ Ki (t) dt = 1. Note that |ξi | ≤ δ0 implies that Ki (t) = 0 for |t| > δ0 , and when both |t| and |ξi | are bounded by δ0 then (3.21) P W (i) + t ≤ z = P (W − ξi + t ≤ z) ≥ P (W ≤ z − 2δ0 ). Replacing z by z + 2δ0 in (3.17) and (3.21) we obtain n ∞ 2.44γ ≥ P W (i) + t ≤ z + 2δ0 Ki (t) dt − (z + 2δ0 ) ≥
i=1 −∞ n ∞ i=1 −∞
P (W ≤ z)Ki (t) dt − (z + 2δ0 )
2δ0 ≥ P (W ≤ z) − (z) − √ , 2π where we have applied (2.28) followed by an elementary inequality. Next, as |ξi | ≤ δ0 for all i = 1, . . . , n, γ=
n
E|ξi |3 ≤ δ0
i=1
n
E|ξi |2 = δ0 ,
i=1
from which we now obtain 2δ0 P (W ≤ z) − (z) ≤ 2.44γ + √ ≤ 3.3δ0 . 2π
(3.22)
The proof is completed by proving the corresponding lower bound using similar reasoning. The key ingredient in the proof of Theorem 3.4 is to rewrite E[Wf (W )] in terms of a functional of f . We now formulate a result along these same lines, taking as our basis the Stein identity (2.76). Theorem 3.5 For W any random variable, suppose that for every z ∈ R there exist ˆ ≥ 0, t ∈ R, and constants δ0 and a random variable R1 and random function K(t) δ1 not depending on z, such that |ER1 | ≤ δ1 and
ˆ dt + ER1 , fz (W + t)K(t) (3.23) EWfz (W ) = E |t|≤δ0
where fz is the solution of the Stein equation (2.2). Then supP (W ≤ z) − (z) ≤ δ0 1.1 + E |W |Kˆ 1 + 2.7E|1 − Kˆ 1 | + δ1 , (3.24) z∈R
ˆ dt | W ). where Kˆ 1 = E( |t|≤δ0 K(t) Proof We can assume δ0 ≤ 1 because (3.24) is trivial otherwise. Using that fz satˆ isfies the Stein equation (2.2), and the nonnegativity of K(t), we have
52
3 Berry–Esseen Bounds for Independent Random Variables
E
|t|≤δ0
ˆ dt fz (W + t)K(t)
=E
|t|≤δ0
ˆ dt + E 1{W +t≤z} − (z) K(t)
|t|≤δ0
ˆ dt (W + t)fz (W + t)K(t)
ˆ dt ˆ 1{W ≤z+δ0 } − (z) K(t) dt + E ≤E (W + t)fz (W + t)K(t) |t|≤δ0 |t|≤δ0
ˆ ˆ dt, (W + t)fz (W + t)K(t) = E 1{W ≤z+δ0 } − (z) K1 + E
|t|≤δ0
where the inequality holds because −t ≤ δ0 . Now, writing Kˆ 1 = 1 − (1 − Kˆ 1 ), we find that
ˆ dt fz (W + t)K(t) E |t|≤δ0
≤ P (W ≤ z + δ0 ) − (z) + E|1 − Kˆ 1 | + E
|t|≤δ0
ˆ dt (W + t)fz (W + t)K(t)
δ0 ≤ P (W ≤ z + δ0 ) − (z + δ0 ) + √ 2π
ˆ dt. + E|1 − Kˆ 1 | + E (W + t)fz (W + t)K(t) |t|≤δ0
Thus, rearranging and using (3.23) to obtain the first equality, P (W ≤ z + δ0 ) − (z + δ0 )
δ0 ˆ dt − E|1 − Kˆ 1 | + E ≥ −√ fz (W + t)K(t) 2π |t|≤δ0
ˆ dt −E (W + t)fz (W + t)K(t) |t|≤δ0
δ0 = −√ − E|1 − Kˆ 1 | + EWfz (W ) − ER1 2π
ˆ dt −E (W + t)fz (W + t)K(t) |t|≤δ0
δ0 = −√ − E|1 − Kˆ 1 | + E Wfz (W )(1 − Kˆ 1 ) − ER1 2π
ˆ dt +E Wfz (W ) − (W + t)fz (W + t) K(t) |t|≤δ0
√ δ0 ˆ ˆ dt, ≥ −√ − 2E|1 − K1 | − δ1 − E |W | + 2π/4 |t|K(t) 2π |t|≤δ0 this last by (2.7), the hypotheses |ER1 | ≤ δ1 , and (2.10). Hence, P (W ≤ z + δ0 ) − (z + δ0 )
δ0 ˆ dt − 2E|1 − Kˆ 1 | − δ1 − E ≥ −√ |W | + 0.7 δ0 K(t) 2π |t|≤δ0
3.4 The Berry–Esseen Inequality for Unbounded Variables
53
δ0 − 2E|1 − Kˆ 1 | − δ1 − δ0 E |W | + 0.7 Kˆ 1 = −√ 2π δ0 − 2E|1 − Kˆ 1 | − δ1 − δ0 E |W |Kˆ 1 + 0.7 + 0.7E|1 − Kˆ 1 | ≥ −√ 2π ≥ −δ0 1.1 + E |W |Kˆ 1 − 2.7E|1 − Kˆ 1 | − δ1 , (3.25) recalling that δ0 ≤ 1. A similar argument gives P (W ≤ z − δ0 ) − (z − δ0 ) ≤ δ0 1.1 + E |W |Kˆ 1 + 2.7E|1 − Kˆ 1 | + δ1 ,
(3.26)
completing the proof of (3.24).
In Chap. 5 we illustrate how to use Theorem 3.5 to obtain Berry–Esseen bounds in various applications.
3.4 The Berry–Esseen Inequality for Unbounded Variables Theorem 3.4 demonstrates the Berry–Esseen inequality when W is a sum of uniformly bounded, mean zero, independent random variables ξ1 , . . . , ξn with variances summing to one. Here we drop the boundedness restriction and prove, using two different methods, that there exists a universal constant C such that supP (W ≤ z) − (z) ≤ Cγ
where γ =
z∈R
n
E|ξi |3 .
(3.27)
i=1
Tyurin (2010) has shown that C can be taken 0.4785. Both of our two approaches, using concentration inequalities in Sect. 3.4.1, and an inductive method in Sect. 3.4.2, lead to somewhat larger constants, but as the sequel shows, these approaches generalize to many cases where the independence condition can be dropped.
3.4.1 The Concentration Inequality Approach Noting that (3.17) in Theorem 3.4 holds without the uniform boundedness restriction, with W (i) as in (3.16) we see that one can prove the Berry–Esseen inequality more generally by showing that P W (i) + t ≤ z is close to P (W ≤ z) = P W (i) + ξi ≤ z , which it suffices to have a good bound for P (a ≤ W (i) ≤ b). Intuitively, the distribution of W (i) is close to the standard normal, and hence we should be able to bound P (a ≤ W (i) ≤ b) using some multiple of b − a. This heuristic is made precise by the concentration inequality
54
3 Berry–Esseen Bounds for Independent Random Variables
Lemma 3.1 For all real a < b, and for every 1 ≤ i ≤ n, √ √ P a ≤ W (i) ≤ b ≤ 2(b − a) + 2( 2 + 1)γ
(3.28)
where γ is as in (3.27). We remark that Chen (1998) was the first to apply the concentration inequality approach to independent but non-identically distributed variables. Postponing the proof of (3.28) to the end of this section, we demonstrate the following Berry– Esseen bound with a constant of 9.4. Theorem 3.6 Let ξ1 , ξ2 , . . . , ξn be independent random variables with zero means, n n satisfying i=1 Var(ξi ) = 1. Then W = i=1 ξi satisfies supP (W ≤ z) − (z) ≤ 9.4γ z∈R
where γ =
n
E|ξi |3 .
(3.29)
i=1
Proof With W (i) and Ki (t) as in (3.16), by (2.25) and (3.28) we have n ∞ (i) P W + t ≤ z Ki (t) dt − P (W ≤ z) −∞ i=1 n ∞ (i) P W + t ≤ z − P (W ≤ z) Ki (t) dt = −∞ ≤ = = ≤ =
i=1 n ∞
P W (i) + t ≤ z − P (W ≤ z)Ki (t) dt
i=1 −∞ n ∞ i=1 −∞ n ∞
P W (i) + t ≤ z − P W (i) + ξi ≤ z Ki (t) dt
i=1 −∞ n ∞ i=1 −∞ n √
2
i=1
E P z − t ∨ ξi ≤ W (i) ≤ z − t ∧ ξi | ξi Ki (t) dt
E
√ √ 2 |t| + |ξi | + 2( 2 + 1)γ Ki (t) dt
√ 1 E|ξi |3 + E|ξi |Eξi2 + 2( 2 + 1)γ 2
√ ≤ (3.5 2 + 2)γ ≤ 6.95γ , where we have again applied (2.25). Invoking (3.17) now yields the claim.
(3.30)
As in Theorem 3.2, one can dispense with the third moment assumption in Theorem 3.6 and replace γ in (3.29) by β2 + β3 , defined in (3.5); we leave the details to
3.4 The Berry–Esseen Inequality for Unbounded Variables
55
the reader. Additionally, with a more refined concentration inequality, the constant can be reduced further, resulting in (3.31) supP (W ≤ z) − (z) ≤ 4.1(β2 + β3 ), z∈R
see Chen and Shao (2001). We now prove the concentration inequality (3.28). The idea is to use the fact that if f equals the indicator 1[a,b] of some interval, then Ef (W ) = P (a ≤ W ≤ b). This fixes f up to a constant, and choosing f ((a + b)/2) = 0 the norm f = (b − a)/2 takes on its minimal value, yielding the smallest factor in the right hand side of the inequality EWf (W ) ≤ 1 (b − a)E|W | ≤ 1 (b − a), 2 2 which holds whenever EW 2 ≤ 1. Proof of Lemma 3.1 Define δ = γ and take ⎧ 1 ⎪ ⎨ − 2 (b − a) − δ if w < a − δ, f (w) = w − 12 (b + a) if a − δ ≤ w ≤ b + δ, ⎪ ⎩1 for w > b + δ, 2 (b − a) + δ
(3.32)
so that f = 1[a−δ,b+δ] , and f = 12 (b − a) + δ. Set Kˆ j (t) = ξj (1{−ξj ≤t≤0} − 1{0
ˆ = K(t)
n
Kˆ j (t).
(3.33)
j =1
Since ξj and W (i) − ξj are independent for j = i, ξi is independent of W (i) , and Eξj = 0 for all j , similarly to (2.27), we have EW (i) f W (i) − Eξi f W (i) − ξi =
n
Eξj f W (i) − f W (i) − ξj
j =1
=
n
Eξj
j =1
=
n
E
j =1
=E
∞
−∞
0 −ξj
∞
−∞
f W (i) + t dt
f W (i) + t Kˆ j (t) dt
ˆ dt. f W (i) + t K(t)
ˆ ≥ 0, we have by the definition of f Noting that f (t) ≥ 0 and K(t)
(3.34)
56
3 Berry–Esseen Bounds for Independent Random Variables
E
∞
−∞
ˆ dt ≥ E f W (i) + t K(t)
ˆ dt f W (i) + t K(t) |t|≤δ
ˆ dt. ≥ E1{a≤W (i) ≤b} K(t) |t|≤δ
ˆ Letting K(t) = E K(t), we may write this last expression as
ˆ − K(t) dt + P a ≤ W (i) ≤ b K(t) E1{a≤W (i) ≤b} |t|≤δ
|t|≤δ
K(t) dt.
(3.35)
As in (2.28) and (2.25), respectively, the function K(t) is a density and E|T | = γ /2 for T so distributed. Hence, for the integral in the second term of (3.35), recalling δ =γ,
γ K(t) dt = P |T | ≤ δ = 1 − P |T | > δ ≥ 1 − = 1/2. 2δ |t|≤δ For the first term of (3.35), applying the Cauchy–Schwarz inequality and integrating yields the bound 1/2 1/2 n ˆ Var ≤ Var |ξj | min δ, |ξj | K(t) dt |t|≤δ
≤
n
j =1
Eξj2 min
j =1
≤δ
n
2
1/2
δ, |ξj |
1/2 Eξj2
= δ.
j =1
Hence, from (3.34) and (3.35) we obtain 1 EW (i) f W (i) − Eξi f W (i) − ξi ≥ P a ≤ W (i) ≤ b − δ. 2 On the other hand, recalling that f ≤ 12 (b − a) + δ, we have EW (i) f W (i) − Eξi f W (i) − ξi 1 ≤ (b − a) + δ E W (i) + E|ξi | 2 2 2 1/2 1 (b − a + 2δ) ≤ √ E W (i) + E|ξi | 2 2 1/2 1 (b − a + 2δ) ≤ √ E W (i) + E|ξi |2 2 1 = √ (b − a + 2δ). 2 Combining (3.36) and (3.37) thus gives
(3.36)
(3.37)
3.4 The Berry–Esseen Inequality for Unbounded Variables
57
√ √ √ √ P a ≤ W (i) ≤ b ≤ 2(b − a) + (2 2 + 2)δ = 2(b − a) + 2( 2 + 1)γ
as desired.
By reasoning as above, and as in the proofs of Theorem 8.1 and Propositions 10.1 and 10.2, one can prove the following stronger concentration inequality. Proposition 3.1 If W is the sum of the independent mean zero random variables ξ1 , . . . , ξn , then for all real a < b P (a ≤ W ≤ b) ≤ b − a + 2(β2 + β3 )
(3.38)
where β2 and β3 are defined in (3.5). In addition, if W (i) = W − ξi , then √ √ P a ≤ W (i) ≤ b ≤ 2(b − a) + ( 2 + 1)(β2 + β3 )
(3.39)
for every 1 ≤ i ≤ n. We leave the proof to the reader. Clearly, β2 + β3 ≤ γ , so Proposition 3.1 not only relaxes the moment assumption required by (3.28) but improves the constant as well.
3.4.2 An Inductive Approach In this section we prove the following Berry–Esseen inequality by induction. random variables with zero means, Theorem 3.7 Let ξ1 , ξ2 , . . . , ξn be independent satisfying ni=1 Var(ξi ) = 1. Then W = ni=1 ξi satisfies supP (W ≤ z) − (z) ≤ 10γ
where γ =
z∈R
n
E|ξi |3 .
(3.40)
i=1
Though the constant produced is not optimal, the inductive approach is quite useful in more general settings when the removal of some variables leaves a structure similar to the original one; see Theorem 6.2 in Sect. 6.1.1 for one example involving dependence where the inductive method succeeds, and references to other such examples. Use of induction in the independent case appears in the text of Stroock (2000). Proof Without loss of generality we may assume Eξi2 = 0 for all i = 1, . . . , n. Let 2 and τ = min τi . τi2 = E W (i) 1≤i≤n
Since (3.40) is trivial if γ ≥ 1/10, we can assume γ < 1/10. Since 2 2 2 2 2/3 , 1 = EW 2 = E W (i) + ξi = E W (i) + Eξi2 ≤ E W (i) + E|ξi |3
58
3 Berry–Esseen Bounds for Independent Random Variables
we have τ 2 ≥ 1 − γ 2/3 ≥ 0.7845.
(3.41)
When n = 1, since γ = E|ξ1 |3 ≥ (Eξ12 )3/2 = 1, inequality (3.40) is trivially true. Now take n ≥ 2 and assume that (3.40) has been established for a sum composed of fewer than n summands. Then for all i = 1, . . . , n and a < b, with C = 10 we have P a < W (i) ≤ b = (b/τi ) − (a/τi ) + P W (i) ≤ b − (b/τi ) − P W (i) ≤ a − (a/τi ) 2C b−a ≤ 3 E|ξj |3 + √ τi j =i 2πτi ≤ 2.88Cγ + (b − a)/2,
(3.42)
using (3.41) twice in the final inequality. Let ξi∗ have the ξi -zero bias distribution and be independent of ξj , j = i, and let I be a random index, independent of all other variables, with distribution (2.60). Then, by Lemma 2.8, letting δ = 2γ , we have P (W ∗ ≤ z) − P (W ≤ z − 2δ) = P W (I ) + ξI∗ ≤ z − P W (I ) + ξI ≤ z − 2δ ≥ −EP z − ξI∗ ≤ W (I ) ≤ z − ξI − 2δ|ξI , ξI∗ 1 ξI∗ ≥ ξI + 2δ ≥ −E 2.88Cγ + ξI∗ − ξI /2 − δ 1 ξI∗ ≥ ξI + 2δ ≥ −2.88Cγ P ξI∗ − ξI ≥ 2δ − E ξI∗ − ξI 1 ξI∗ ≥ ξI + 2δ /2 − δ, where we have invoked (3.42) to obtain the second inequality. By Theorem 4.3, ξi and ξi∗ may be coupled so that E|ξi |3 E ξi∗ − ξi ≤ 2Eξi2 But now P ξI∗ − ξI ≥ 2δ ≤ γ /(4δ)
so, by (2.60),
E ξI∗ − ξI ≤ γ /2.
and E ξI∗ − ξI 1 ξI∗ ≥ ξI + 2δ ≤ γ /2.
Hence, recalling δ = 2γ , P (W ∗ ≤ z) − P (W ≤ z − 2δ) ≥ −2.88Cγ /8 − γ /4 − 2γ = −5.85γ . Thus, by (3.19), P (W ≤ z − 2δ) − (z − 2δ) ≤ P (W ∗ ≤ z) − (z − 2δ) + 5.85γ 4γ + 5.85γ < 10γ . ≤ 2.44γ + √ 2π Similarly, we may obtain P (W ≤ z + 2δ) − (z + 2δ) ≥ −10γ , thus completing the proof.
3.5 A Lower Berry–Esseen Bound
59
3.5 A Lower Berry–Esseen Bound Again, let ξ1 , . . . , ξn be independent random variables with zero means satisfying n Var(ξ i ) = 1. Feller (1935) and Lévy (1935) proved independently (see LeCam i=1 1986) that if the Feller–Lévy condition max Eξi2 → 0,
(3.43)
1≤i≤n
is satisfied, then the Lindeberg condition (3.13) is necessary for the central limit theorem. The theorem below is due to Hall and Barbour (1984) who used Stein’s method to provide not only a nice proof of the necessity, but also a lower bound for the L∞ distance between the distribution of W and the normal. Theorem 3.8 Let ξ1 , ξ2 , . . . , ξn be independent randomvariables with zero means n 2 2 2 and n finite variances Eξi = σi , 1 ≤ i ≤ n, satisfying i=1 σi = 1, and let W = i=1 ξi . Then there exists an absolute constant C such that for all ε > 0,
1 − e−ε
2 /4
n
Eξi2 1{|ξi |>ε}
i=1
n ≤ C supP (W ≤ z) − (z) + σi4 . z∈R
(3.44)
i=1
n 4 Clearly, the Feller–Lévy condition (3.43) implies that i=1 σi ≤ 2 max1≤i≤n σi → 0 as n → ∞. Therefore, if W is asymptotically normal, n
Eξi2 1{|ξi |>ε} → 0
i=1
as n → ∞ for every ε > 0, that is, the Lindeberg condition is satisfied. Proof Once again, the argument starts with the Stein equation E fh (W ) − Wfh (W ) = Eh(W ) − N h,
(3.45)
a function h yet to be chosen. Taking h absolutely continuous with for ∞ −∞ |h (w)|dw < ∞, we may integrate by parts and obtain the bound ∞ Eh(W ) − N h = h (w) P (W ≤ w) − (w) dw −∞
∞ h (w) dw, (3.46) ≤δ −∞
where δ = supz∈R |P (W ≤ z) − (z)|. For the left hand side of (3.45), in the usual way, because ξi and W (i) = W − ξi are independent, and Eξi = 0, we have
60
3 Berry–Esseen Bounds for Independent Random Variables
EWfh (W ) =
n
Eξi2 fh W (i)
i=1
+ and, because
n E ξi fh W (i) + ξi − fh W (i) − ξi fh W (i) , i=1
n
2 i=1 σi
Efh (W ) =
= 1,
n
n σi2 Efh W (i) + σi2 E fh (W ) − fh W (i) ,
i=1
i=1
with the last term easily bounded by 12 fh ni=1 σi4 . Hence n n (i) 1 2 Eξi g W , ξi ≤ fh σi4 , E fh (W ) − Wfh (W ) − 2 i=1
(3.47)
i=1
where g(w, y) = gh (w, y) = −y −1 fh (w + y) − fh (w) − yfh (w) . Intuitively, if the distribution of W is close to that of the standard normal Z, taken to be independent of the ξi ’s, then R1 :=
n
Eξi2 g W (i) , ξi and
i=1
R :=
n
Eξi2 g(Z, ξi ),
i=1
should be close to one another. Taking (3.46) and (3.47) together, we will be able to compute a lower bound for δ, ∞ if we can produce an absolutely continuous function h satisfying −∞ |h (w)| dw < ∞ for which Egh (Z, y) is of constant sign, provided also that fh < ∞. In practice, it is easier to look for a suitable f , and then define h(w) = f (w) − wf (w). The function g is zero for any linear function f , and when f is an even function then Eg(Z, y) is odd. Choosing f to be the odd function f (y) = y 3 yields 2 Eg(Z, ∞y) = −y , of constant sign. Unfortunately, this f fails to yield an h satisfying −∞ |h (w)| dw < ∞.
A good choice is f (w) = we−w /2 , which behaves much like the sum of a linear and a cubic function for those values of w where Z puts most of its mass, yet decays to zero quickly when |w| is large. Making the computations, we have
∞ y −1 2 (w + y)e−(w+y) /2 Eg(Z, y) = − √ 2π −∞ 2 2 2 − we−w /2 − ye−w /2 1 − w 2 e−w /2 dw 1 2 = √ 1 − e−y /4 , (3.48) 2 2 2
a nonnegative function which satisfies
3.5 A Lower Berry–Esseen Bound
61
1 2 Eg(Z, y) ≥ √ 1 − e−ε /4 whenever |y| ≥ ε 2 2 for all ε > 0. Hence, for this choice of f we have 1 2 R ≥ √ 1 − e−ε /4 Eξi2 1{|ξi |>ε} . 2 2 i=1 n
(3.49)
It thus remains to show that R and R1 are close enough, after which (3.1), (3.47) and (3.49) complete the proof. 2 For this step, note that for f (w) = we−w /2 and h(w) = f (w) − wf (w) we have
∞ h (w) dw ≤ 7, c1 := −∞
∞ f (w) dw ≤ 4; and c3 := supf (w) = 3. c2 := −∞
w
Now define an intermediate quantity R2 between R1 and R, by R2 :=
n
Eξi2 g W , ξi ,
i=1
where
W
has the same distribution as W , but is independent of the ξi ’s. Then 1 n (i) (i) 2 f W + tξi − f W dt E ξi R1 = − 0
i=1
= R2 + −
n
1 f W + tξi − f W (i) + tξi dt E ξi2
i=1 n
E ξi2
i=1
0
f W − f W (i) dt .
1
(3.50)
0
Now, for any θ , using that W and W have the same distribution, that ξi and W (i) are independent, and that Eξi = 0, E f (W + θ ) − f W (i) + θ = E f W (i) + ξi + θ − f W (i) + θ = E f W (i) + ξi + θ − f W (i) + θ − ξi f W (i) + θ 1 ≤ c3 σi2 , 2 by Taylor’s theorem. Hence, from (3.50), R1 ≥ R2 − c3
n i=1
Similarly,
σi4 .
(3.51)
62
3 Berry–Esseen Bounds for Independent Random Variables
1 n 2 f (Z + tξi ) − f (W + tξi ) dt R2 = R + E ξi −
0
i=1 n
1 f (Z) − f (W ) dt , E ξi2 0
i=1
∞
and, for any θ , as −∞ |f (w)|dw = c2 < ∞, Ef (W + θ ) − Ef (Z + θ ) ∞ f (w) P (W ≤ w − θ ) − (w − θ ) dw ≤ c2 δ, = −∞
so that R2 ≥ R − 2c2 δ.
(3.52)
Combining (3.46) and (3.47) with (3.51) and (3.52), it follows that 1 4 3 4 σi ≥ R − c3 σi − 2c2 δ. c1 δ ≥ R 1 − c 3 2 2 n
n
i=1
i=1
In view of (3.49), collecting terms, it follows that 3 4 1 2 σi ≥ √ 1 − e−ε /4 Eξi2 1{|ξi |>ε} δ(c1 + 2c2 ) + c3 2 2 2 i=1 i=1 n
for any ε > 0. This proves (3.44), with C ≤ 43.
n
(3.53)
Chapter 4
L1 Bounds
In this chapter we focus on normal approximation using smooth functions, and the L1 norm in particular. We begin with a discussion of distances induced by function classes. Any class of functions H mapping R to R induces a measure of the separation between the distributions L(X) and L(Y ) of the random variables X and Y by L(X) − L(Y ) = sup Eh(X) − Eh(Y ). (4.1) H h∈H
Certain choices of H lead to classical distances, for instance, taking H = 1(x ≤ z), z ∈ R
(4.2)
leads to the Kolmogorov, L∞ , or supremum norm distance, while the class of measurable functions H = h: 0 ≤ h(x) ≤ 1, ∀x ∈ R (4.3) leads to the total variation distance. Calculations with smooth functions are typically simpler than those with functions such as the discontinuous indicators in (4.2), or the bounded measurable functions in (4.3). Our main focus in this chapter is the L1 distance, given by (4.1) with H = L, the collection of Lipschitz functions in (4.7). In Sect. 4.8 we move to the distance L(W ) − L(Z)Hm,∞ , produced by taking H to be the collection of functions Hm,∞ defined in (4.183), a class including functions allowed to posses some small number of additional higher order derivatives. Our L1 examples include: the sums of independent random variables and an associated contraction principle, hierarchical structures, cone measure on the sphere, combinatorial central limit theorems, simple random sampling, coverage processes, and locally dependent random variables. To illustrate our approach for the smooth functions Hm,∞ we show how fast convergence rates may result under a vanishing third moment assumption. The use of Stein’s method for L1 approximation was pioneered by Erickson (1974). We begin now by recalling that the L1 distance between distribution functions F and G is defined by L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_4, © Springer-Verlag Berlin Heidelberg 2011
63
64
4
F − G1 =
∞ −∞
F (t) − G(t)dt.
L1 Bounds
(4.4)
This distance has a number of equivalent forms, and, perhaps for that reason, is known by many names, including Gini’s measure of discrepancy, the Kantarovich metric (see Rachev 1984), and the Wasserstein, Dudley, and the Fortet–Mourier distance (see e.g., Barbour et al. 1992). In addition to writing the L1 distance as in (4.4), we will also let L(X) − L(Y )1 denote the L1 distance between the distributions of random variables X and Y . That zero biasing seems to be particularly suited to produce L1 bounds is evidenced in the following theorem from Goldstein (2004). Theorem 4.1 Let W be a mean zero, variance 1 random variable with distribution function F and let W ∗ have the W -zero biased distribution and be defined on the same space as W . Then, with the cumulative distribution function of the standard normal, F − 1 ≤ 2E|W ∗ − W |.
(4.5)
As there may exist many couplings of W and W ∗ on a joint space, the challenge in producing good L1 bounds is to find one in which the variables are close. Before proving Theorem 4.1, we recall some facts about the L1 norm which can be found in Rachev (1984). First, the ‘dual form’ of the L1 distance is given by F − G1 = inf E|X − Y |,
(4.6)
where the infimum is over all couplings of X and Y on a joint space with marginal distributions F and G, respectively. As R is a Polish space, this infimum is achieved. A yet equivalent form of the L1 distance is given by (4.1) with L the collection of Lipschitz functions L = h : R → R: h(y) − h(x) ≤ |y − x| , (4.7) that is, L(Y ) − L(X) = sup Eh(Y ) − Eh(X). 1
(4.8)
h∈L
We will also make use of the fact that the elements in L are exactly those absolutely continuous functions whose derivatives are (a.e.) bounded by 1 in absolute value. Though the L1 distance is, therefore, just one example of a metric induced by a collection of smooth functions such as those we will study in Sect. 4.8, its many equivalent forms lead to a rich theory which accommodates numerous examples. Part (ii) of Proposition 2.4 leads directly to the following proof of Theorem 4.1. Proof First, let (W, W ∗ ) achieve the infimum W − W ∗ 1 in (4.6). As (2.77) holds with = W ∗ − W , (2.79) yields Eh(W ) − N h ≤ 2h E|W − W ∗ | = 2W − W ∗ 1 .
4.1 Sums of Independent Variables
65
Taking supremum over h ∈ L and using (4.8) shows F − 1 ≤ 2W − W ∗ 1 .
(4.9)
Now for (W, W ∗ ) any coupling of W to a variable W ∗ with the W -zero bias distribution, inequality (4.6) shows that the right hand side of (4.9) can be no greater than that of (4.5), and the result follows. The majority of this chapter is devoted to the exploration of various consequences of this bound, starting with sums of independent random variables.
4.1 Sums of Independent Variables 4.1.1 L1 Berry–Esseen Bounds Continuing the discussion in Sect. 3.1, and Theorem 3.1 in particular, in this section we elaborate on the theme of L1 bounds for sums of independent random variables. In particular, we demonstrate the application of Theorem 4.1 and the construction (2.61) in Lemma 2.8 to produce L1 bounds with small, explicit, and distributionally specific constants for the distance between the distribution of a sum of independent variables and the normal. The utility of Theorem 4.2 below is reflected by the fact that the L1 distance on the left hand side of (4.13) is that of a convolution to the normal, but is bounded on the right by terms which require only the calculation of integrals of the form (4.4) involving marginal distributions. The proof of Theorem 4.2 requires the following simple proposition. For H a distribution function on R let H −1 (u) = sup x: H (x) < u for u ∈ (0, 1) and let U(a, b) denote the uniform distribution on (a, b). It is well known that when U ∼ U[0, 1] then H −1 (U ) has distribution function H . Proposition 4.1 For F and G distribution functions and U ∼ U(0, 1), F − G1 = E F −1 (U ) − G−1 (U ). Further, for any a ≥ 0 and b ∈ R, with Fa,b and Ga,b the distribution functions of aX + b and aY + b, respectively, we have Fa,b − Ga,b 1 = aF − G1 .
(4.10)
Proof The first claim is stated in (iii), Sect. 2.3 of Rachev (1984); the second follows immediately from the dual form (4.6) of the L1 distance. Note that one consequence of the proposition is a representation of a pair of variables which achieve the infimum in (4.6).
66
4
For X a random variable with finite third absolute moment let 2 Var(X)L(X ∗ ) − L(X)1 . B(X) = E|X|3 Applying (4.10) we have B(aX) = B(X)
for a = 0.
L1 Bounds
(4.11)
(4.12)
be independent mean zero random variables with Theorem 4.2 Let ξi , i = 1, . . . , n variances σi2 = Var(ξi ) satisfying ni=1 σi2 = 1. Then for F the distribution function of W=
n
ξi
i=1
and that of the standard normal, F − 1 ≤
n
B(ξi )E|ξi |3 .
(4.13)
i=1
√ Additionally, when W = ni=1 Xi /(σ n) with X, X1 , . . . , Xn i.i.d. mean zero, variance σ 2 random variables, then F − 1 ≤
1 √ B(X)E|X|3 . σ3 n
(4.14)
Proof Let U1 , . . . , Un be mutually independent U (0, 1) variables and set
∗ −1
(Ui ) , i = 1, . . . , n, ξi , ξi∗ = G−1 i (Ui ), Gi where G∗1 , . . . , G∗n are the distribution functions of ξ1∗ , . . . , ξn∗ , respectively. Then ξi and ξi∗ have distribution functions Gi and G∗i , respectively, and by Proposition 4.1, E ξ ∗ − ξi = G∗ − Gi . i
i
1
Constructing W ∗ as in Lemma 2.8 yields W ∗ − W = ξI∗ − ξI , with I having distribution P (I = i) = σi2 , so applying Theorem 4.1 we have F − 1 ≤ 2E|W ∗ − W | = 2E ξ ∗ − ξI I
=2 =2
n i=1 n
σi2 E ξi∗ − ξi σi2 G∗i − Gi 1
i=1
=
n i=1
thus proving (4.13).
B(ξi )E|ξi |3 ,
4.1 Sums of Independent Variables
67
If X, X1 , . . . ,√ Xn are i.i.d. with mean zero and variance σ 2 then applying (4.13) with ξi = Xi /(σ n), and (4.12), yields the bound n 1 Xi B(X)E|X|3 , B F − 1 ≤ 3 3/2 √ √ E|Xi |3 = σ n σ n σ3 n i=1
proving (4.14). Specializing (4.14) to particular cases leads to the following corollary.
√ Corollary 4.1 When X = (ξ − p)/ pq where ξ has the Bernoulli distribution with success probability 1 − q = p ∈ (0, 1), E|X|3 p 2 + q 2 for all n = 1, 2, . . . . and F − 1 ≤ √ = √ npq n √ √ When X has the uniform distribution U[− 3, 3 ], then √ 3 E|X|3 for all n = 1, 2, . . . . B(X) = 1/3 and F − 1 ≤ √ = √ 3 n 4 n B(X) = 1
Proof In the Bernoulli case, by (2.55), X ∗ has the uniform distribution function
−p q √ G∗ (x) = pqx + p for x ∈ √ , √ , pq pq √ that is, X ∗ =d (U − p)/ pq, where U ∼ U[0, 1]. Hence, by Proposition 4.1, 2 2 U − p ξ − p = √1 U − ξ 1 = p √+ q . G∗ − G1 = − √ √ pq pq 1 pq 2 pq √ 3 2 2 Calculating E|X| = (p + q )/ pq and using Var(X) = 1 gives B(X) = 1, and the claimed bound. √ √ For the uniform distribution U[− 3, 3 ], (2.55) yields √ 3 √ √ √ 3x 3x 1 ∗ G (x) = − + + for x ∈ [− 3, 3 ] 36 4 2 and from (4.4) we obtain √ 3 ∗ G − G1 = . 8 √ Calculating E|X|3 = 3 3/4 now gives B(X) = 1/3, and the claimed bound. Constants B(X) and bounds for other distributions may be calculated in a similar fashion. A universal L1 constant over a class of distributions F , by Theorem 4.2, is given by B(F ) = sup B(X). L(X)∈F
The following result, by Goldstein (2010a) and Tyurin (2010), shows that the Bernoulli distribution achieves the worst case B(X).
68
4
L1 Bounds
Theorem 4.3 For σ > 0 let Fσ be the collection of all mean zero distributions with variance σ 2 and finite absolute third moment. Then Fσ . B(F ) = 1 where F = σ >0
Theorems 4.3 and 4.2 immediately give Corollary 4.2 If ξi , i = 1, . . . , n are independent mean zero random variables with variances σi2 = Var(ξi ) satisfying ni=1 σi2 = 1 and W = ξ1 + · · · + ξn , then F − 1 ≤ In particular, if W = n and variance 1, then
−1/2
n
E|ξi |3 .
i=1
Xi with X, X1 , . . . , Xn i.i.d. variables with mean zero E|X|3 F − 1 ≤ √ . n
Though it may be difficult to achieve the optimal L1 coupling between X and in particular applications, especially those involving dependence, the following proposition shows how to construct a coupling which results in a constant bounded by 1 when X is symmetric. Proposition 4.2 is applied in Theorem 4.7 to improve the leading constant in Goldstein (2007) for projections of cone measure. X∗
Proposition 4.2 Let χ be a random variable with a symmetric distribution, variance σ 2 ∈ (0, ∞) and finite third absolute moment. Let X and Y be constructed on a joint space with 0 ≤ X ≤ Y a.s. having marginal distributions given by X =d |χ| and Y =d |χ |, where χ is as defined in Proposition 2.3. Let V ∼ U[0, 1] and take the values 1 and −1 with equal probability, and be independent of each other and of X and Y . Then X = X has distribution χ , the variable X ∗ = V Y has the χ -zero biased distribution, and 2σ 2 E|X ∗ − X| ≤ 1. E|X|3
(4.15)
Proof That X =d χ follows by the symmetry of χ . Again, by the symmetry of χ ,
σ 2 Ef χ = E χ 2 f (χ) = E (−χ)2 f (−χ) = E χ 2 f (−χ) = σ 2 Ef −χ . Hence χ is symmetric, and as V ∼ U[−1, 1] and is independent of Y , by Proposition 2.3, X ∗ = V Y =d V χ =d χ ∗ .
4.1 Sums of Independent Variables
69
Now, E|X ∗ − X| = E|V Y − X| = E|V Y − X| =
1
|vy − x|dvdF (x, y)
x≥0,y>0 0
where dF (x, y) is the joint distribution of (X, Y ). Since dF (x, y) is zero on sets where x > y, we may decompose the integral above as (vy − x)dvdF (x, y) + (x − vy)dvdF (x, y) x≥0,y>0 x/y
x≥0,y>0 0
2 x x 1 y 1− dF (x, y) −x 1− = y y x≥0,y>0 2 1 x 2 x − y x dF (x, y) + y 2 y x≥0,y>0 2 X 1 . =E Y −X+ 2 Y
2
As X/Y ≤ 1, we have X /Y ≤ X, and therefore 1 1 1 E|X ∗ − X| ≤ EY = E χ = 2 E|X|3 . 2 2 2σ Substituting into (4.15) yields the desired inequality.
Let X be any random variable with mean zero and variance σ 2 , and let φ be an increasing function on [0, ∞). Since x 2 is an increasing function on [0, ∞), X 2 will be positively correlated with φ(|X|), that is,
σ 2 Eφ X = EX 2 φ |X| ≥ EX 2 Eφ |X| = σ 2 Eφ |X| , showing |X | is stochastically larger than |X|. Hence there always exists a coupling where |X | ≥ |X| a.s., even when X is not symmetric. Though an optimal L1 coupling is similarly assured, in principle, by Proposition 4.1, couplings constructed by following Proposition 4.2 seem to be of more practical use; see in particular where this proposition is applied for cone measure in item 3 of Proposition 4.5.
4.1.2 Contraction Principle In this section we show that the distribution of a standardized sum of i.i.d. variables is closer in L1 to the normal, in a zero bias sense, than the distribution of the summands themselves. This result leads to a type of L1 contraction principle for the CLT. For some additional generality we will consider weighted averages of i.i.d. random variable. Let α denote the Euclidean norm of a vector α ∈ Rk , and when α is nonzero let
70
4
L1 Bounds
k
|αi |3 ϕ(α) = ki=1 . ( i=1 αi2 )3/2
(4.16)
Inequality (4.17) of Lemma 4.1 says that taking weighted averages of i.i.d. variables is a contraction in the L1 distance to normal in a zero biased sense. Lemma 4.1 For α ∈ Rk with λ = α = 0, let Y=
k αi
λ
i=1
Wi ,
where Wi are mean zero, variance one, independent random variables distributed as W . Then L(Y ∗ ) − L(Y ) ≤ ϕ L(W ∗ ) − L(W ) (4.17) 1 1 with ϕ = ϕ(α) as in (4.16), and ϕ < 1 if and only if α is not a multiple of a standard basis vector. If W0 is any mean zero, variance one random variable with finite absolute third moment, α n , n = 0, 1, . . . a sequence of nonzero vectors in Rk , λn = α n , ϕn = ϕ(α n ), and Wn+1 =
k αn,i i=1
λn
Wn,i
for n = 0, 1, . . .
(4.18)
where Wn,i are i.i.d. copies of Wn , then
n−1 ∗
L W − L(Wn ) ≤ ϕj . n 1
(4.19)
j =0
If lim supn ϕn = ϕ < 1, then for any γ ∈ (ϕ, 1) there exists C such that L(Wn ) − L(Z) ≤ Cγ n for all n, 1 while if α n = α for some α and all n, then L(Wn ) − L(Z) ≤ 2ϕ n 1
for all n,
(4.20)
(4.21)
with ϕ = ϕ(α). We begin the proof of the lemma by studying how ϕ behaves in terms of α, and prove a bit more than we need now, saving the additional results for use in Sect. 4.2. Lemma 4.2 For α ∈ Rk with λ = α = 0, k |αi |p i=1
λp
≤1
for all p > 2,
(4.22)
with equality if and only if α is a multiple of a standard basis vector. With ϕ as in (4.16),
4.1 Sums of Independent Variables
71
1 √ ≤ ϕ ≤ 1, k
(4.23)
with equality to the upper bound if and only if α is a multiple of a standard basis vector, and equality to the lower bound if and only if |αi | = |αj | for all i, j . In addition, when αi ≥ 0 and ni=1 αi = 1 then λ ≤ ϕ,
(4.24)
with equality if and only if α is equal to a standard basis vector. Proof Since |αi |/λ ≤ 1 we have |αi |p−2 /λp−2 ≤ 1, yielding k k k |αi |p |αi |p−2 αi2 αi2 = ≤ = 1, λp λp−2 λ2 λ2 i=1
i=1
i=1
with equality if and only if |αi | = λ for some i and αj = 0 for all j = i. Specializing to the case p = 3 yields the claims about the upper bound in (4.23). By Hölder’s inequality with p = 3, q = 3/2, we have k 3/2 k 3/2 k √ 2 2 αi = 1 · αi ≤ k |αi |3 , i=1
i=1
i=1
giving the lower bound (4.23), with equality if and only if αi2 is proportional to 1 for all i. The claim (4.24) follows from the inequality (EY )2 ≤ EY 2
when P (Y = αi ) = αi ,
which is an equality if and only if the variable Y is constant.
We may now proceed to the proof of the lemma. Proof of Lemma 4.1 Let FW ∗ and FW be the distribution functions of W ∗ and W , respectively, and with U1 , . . . , Un independent U[0, 1] variables let
−1
∗ −1 i = 1, . . . , n. W i , Wi = F W ∗ (Ui ), FW (Ui ) By Proposition 4.1, E|Wi∗ − Wi | = L(W ∗ ) − L(W )1 for all i = 1, . . . , n. By Lemma 2.8 and (2.59), with I a random index independent of all other variables with distribution P (I = i) =
αi2 , λ2
the variable
αI (4.25) WI − WI∗ λ has the Y -zero biased distribution. Using (4.6) for the first inequality, we now obtain (4.17) by Y∗ = Y −
72
4
L1 Bounds
L(Y ∗ ) − L(Y ) ≤ E|Y ∗ − Y | 1 =E
k |αi | ∗ Wi − Wi 1(I = i) λ i=1
k
|αi |3 ∗ E W i − Wi 3 λ i=1 = ϕ L(W ) − L(W ∗ )1 .
=
That ϕ < 1 if and only if α is not a multiple of a standard basis vector was shown in Lemma 4.2. To obtain (4.19), note that induction and (4.17) yield n−1
∗
L W − L(Wn ) ≤ ϕj L W ∗ − L(W0 ) , n
0
1
1
j =0
and L(W0∗ ) − L(W0 )1 ≤ 1 by Theorem 4.3. When lim supn ϕn = ϕ < γ < 1 there exists n0 such that ϕj ≤ γ Hence, for all n ≥ n0 n−1 j =0
ϕj =
n −1 0 ϕj j =0
γ
for all j ≥ n0 .
γ
n0
n−1
ϕj ≤
n −1 0 ϕj
j =n0
j =0
γ
γ n.
The bound (4.20) now follows from this inequality and Theorem 4.1. The last claim (4.21) is immediate from (4.19) and Theorem 4.1.
We note that the standardized, classical case (4.14) is recovered from (4.17) and √ Theorem 4.1 when αi = 1/ n. In Sect. 4.2 we study nonlinear versions of recursion (4.18) with applications to physical models.
4.2 Hierarchical Structures For k ≥ 2 an integer, D ⊂ R, and F : Dk → D a given function, every distribution for a random variable X0 with P (X0 ∈ D) = 1 generates the sequence of ‘hierarchical’ distributions through the recursion Xn+1 = F (Xn ), )
n ≥ 0,
(4.26)
where Xn = (Xn,1 , . . . , Xn,k with Xn,i independent, each with distribution Xn . Such hierarchical variables have been considered extensively in the physics literature (see Li and Rogers 1999 and the references therein), in particular to model conductivity of random media.
4.2 Hierarchical Structures
73
The special case where the function F is determined by the conductivity properties of the diamond lattice has been considered in Griffiths and Kaufman (1982) and Schlösser and Spohn (1992). Figure 4.1 shows the progression of the diamond lattice from large to small scale. At the large scale (a), the conductivity of the system can be measured along the bond connecting its top and bottom nodes. Inspection of the lattice on a finer scale reveals that this bond is actually comprised of four smaller bonds, each similar to (a), connected as shown in (b). Inspection on an even finer scaler reveals that each of the four bonds in (b) are constructed in a self-similar way from bonds at a smaller level, giving the successive diagram (c), and so on. To determine the conductivity function F associated with a given lattice, first recall that conductances add in parallel, that is, if two components with conductances x1 and x2 are placed in parallel, then the net conductance of the system is L1 (x1 , x2 ) = x1 + x2 .
(4.27)
Similarly, resistances add for components placed in series. Hence, for these same two components in series, as resistance and conductance are inverses, the resulting conductance of the system is
−1 L−1 (x1 , x2 ) = x1−1 + x2−1 . (4.28) For the diamond lattice in particular, assume that each bond has a fixed ‘baseline’ conductivity characteristic w ≥ 0 such that when a component with conductivity x ≥ 0 is present along the bond its net conductivity is wx. For bonds in the diamond lattice as in (b), we associate conductivities characteristics w = (w1 , w2 , w3 , w4 ) , numbering bonds from the top and proceeding counter-clockwise. Hence, if x = (x1 , x2 , x3 , x4 ) are the conductances of four elements each as in (a) which are present along the bonds in (b), then the two components in series on the left side have conductance L−1 (w1 x1 , w2 x2 ), and similarly, the conductance for the two components in series on the right is L−1 (w3 x3 , w4 x4 ). Combining these two subsystems in parallel gives
F (x) = L1 L−1 (w1 x1 , w2 x2 ), L−1 (w3 x3 , w4 x4 ) , (4.29) that is,
F (x) =
1 1 + w1 x 1 w2 x 2
−1
1 1 + + w3 x 3 w4 x 4
−1 .
(4.30)
Returning to the sequence of distributions generated by the recursion (4.26), conditions on F which imply the weak law Xn →p c
(4.31)
for some constant c have been considered by various authors. Recall that we say F is homogeneous, or positively homogeneous, if F (ax1 , . . . , axk ) = a k F (x1 , . . . , xk ) hold for all a ∈ R, or all a > 0, respectively. Shneiberg (1986) proves that (4.31) holds if D = [a, b] and F is continuous, monotonically increasing, positively homogeneous, convex and satisfies the normalization condition F (1k ) = 1 where 1k
74
4
L1 Bounds
Fig. 4.1 The diamond lattice
is the vector of all ones in Rk . Li and Rogers (1999) provide rather weak conditions under which (4.31) holds for closed D ⊂ (−∞, ∞). See also Wehr (1997) and Wehr (2001), and Jordan (2002) for an extension of the model to random F and applications of hierarchical structures to computer science. Letting X0 have mean c and variance σ 2 , the classical central limit theorem can be set in the framework of hierarchical sequences by letting 1 F (x1 , x2 ) = (x1 + x2 ), 2
(4.32)
which gives X0,1 + · · · + X0,2n (4.33) 2n where X0,m , m = 1, . . . , 2n are independent and identically distributed as X0 . Hence, Xn →p c by the weak law of large numbers, and since Xn is the average of N = 2n i.i.d. variables with finite variance, we have additionally that √ Xn − c →d N (0, 1). Wn = N σ Xn = d
Moreover, when X0 has a bounded absolute third moment (4.21) yields L(Wn ) − L(Z) ≤ Cγ n (4.34) 1 √ with C = 2 and γ = 1/ 2. The function F in (4.32) is a simple average, and one would, therefore, expect normal limiting behavior more generally when the function F averages its inputs in some sense. Definition 4.1 We say that F : Dk → D is an averaging function when it satisfies the following three properties on its domain: 1. mini xi ≤ F (x) ≤ maxi xi . 2. F (x) ≤ F (y) whenever xi ≤ yi .
4.2 Hierarchical Structures
75
3. For all x < y and for any two distinct indices i1 = i2 , there exists xi ∈ {x, y}, i = 1, . . . , k such that xi1 = x, xi2 = y and x < F (x) < y. We say F is strictly averaging if F satisfies Properties 1 and 2 with strict inequality when mini xi < maxi xi , and when xi < yi for some i, respectively. Properties 1 and 2 say that the ‘average’ returned by F should lie inbetween the values being ‘averaged’ and that that ‘average’ increases with those values. Note that Property 1 says that for F to be an averaging function it is necessary that F (1k ) = 1. Property 3 says that F is sensitive, that is, depends on, all of its coordinates. We note that if F is strictly averaging then F satisfies Property 3 thusly: if x < y and xii = x, xi2 = y, then any assignment of the values x, y to the remaining coordinates gives x < F (x) < y by the strict form of Property 1. Hence all strictly averaging functions are averaging. We note that the function F (x) = mini xi satisfies the first two properties but not the third, and it gives rise to extreme value, rather than normal, limiting behavior. Normal limits are proved by Wehr and Woo (2001) for sequences Xn , n = 0, 1, . . . determined by the recursion (4.26) when the function F (x) is averaging by showing that such recursions can be treated as the approximate linear recursion around the mean cn = EXn with small perturbation Zn , Xn+1 = α n · Xn + Zn ,
n ≥ 0,
F (c
(4.35) )
where α n = ∈ Rk . In n ), the gradient of F at cn where cn = (cn , . . . , cn 1 Sect. 4.2.1 we prove Theorem 4.6, which gives the bound (4.34) for the L distance to the normal for sequences generated by the approximate linear recursion (4.35) under Conditions 4.1 and 4.2, which guarantee that Zn is small relative to Xn . In Sect. 4.2.2 we prove Theorem 4.4 which shows that the normal convergence of the hierarchical sequence Xn , n = 0, 1, . . . holds with bound (4.34) under mild conditions, and specifies the exponential rate γ in an explicit range. Theorem 4.4 is proved by invoking Theorem 4.6 after showing that the required moment conditions are satisfied for a linearization of Xn+1 = F (Xn ). Theorem 4.4 For some a < b let X0 be a non constant random variable with P (X0 ∈ [a, b]) = 1 and let Xn+1 = F (Xn ),
n ≥ 0,
)
where Xn = (Xn,1 , . . . , Xn,k with Xn,i independent, each with distribution Xn and F : [a, b]k → [a, b], twice continuously differentiable. Suppose F is averaging and that Xn →p c, with α = F (1k√c) not a scalar multiple of a standard basis vector. Then with Wn = (Xn − cn )/ Var(Xn ) and Z a standard normal variable, for all γ ∈ (ϕ, 1) there exists C such that L(Wn ) − L(Z) ≤ Cγ n for all n ≥ 0, 1 where ϕ, given by α through (4.16), √ is a positive number strictly less than 1. The value ϕ achieves a minimum of 1/ k if and only if the components of α are equal.
76
4
L1 Bounds
As in (4.33), the variable Xn is a function of N = k n variables, so achieving the rate ϕ n exactly corresponds to a ‘classical rate’ of N −θ where ϕ n = N −θ = k −nθ
or θ = − logk ϕ. (4.36) √ Hence when ϕ achieves its minimum value 1/ k we have θ = −1/2 and the rate N −1/2 , and achieving this rate for all γ > ϕ therefore corresponds to the rate N −1/2+ for every > 0. Further, when α is close to a standard basis vector, ϕ is close to 1, so the bound can have rate N −θ for θ arbitrarily close to zero. This behavior is anticipated: for the simple hierarchical sequence generated by the function F (x1 , x2 ) = (1 − )x1 + x2 , convergence to the normal will be slow indeed for small > 0. The condition in Theorem 4.4 that the gradient α = F (c) of F at the limiting value c not be a scalar multiple of a standard basis vector rules out cases which behave in the limit degenerately as F (x1 , x2 ) = x1 . The function (4.32), and (4.30) when F (14 ) = 1, are examples of averaging functions. To handle multiples, we say that G(y) with G(1k ) = 0 is a scaled averaging function if G(y)/G(1k ) is averaging. Now suppose that G(y) is scaled averaging and homogeneous, and that Yn+1 = G(Yn )
for n ≥ 0,
where Y0 is a given random variable and Yn ∈ Rk is a vector of independent copies of Yn . Then letting an+1 = kan + 1 for all n ≥ 0, and a0 = 0, and setting F (y) = G(y)/G(1k ), which is an averaging function, and Xn = Yn /G(1k )an and likewise for Xn , we have Xn+1 = Yn+1 /G(1k )an+1 = G(Yn )/G(1k )an+1 = F (Yn )/G(1k )kan = F (Xn ). √ As the scaled and centered and variables (Xn − EXn )/ Var(Xn ) and (Yn − EYn )/ √ Var(Yn ) are equal, the conclusion of Theorem 4.4 holds for Yn when it holds for Xn . Theorem 4.4 is applied in Sect. 4.2.3 to the specific hierarchical variables generated by the diamond lattice conductivity function (4.30), and, in (4.67), the value ϕ determining the range of γ is given as an explicit function of the weights w; for the diamond lattice all rates N −θ for θ ∈ (0, 1/2) are exhibited. Interestingly, there appears to be no such formula, simple or otherwise, for the limiting mean or variance of the sequence Xn . To proceed we introduce another equivalent formulation of the L1 distance. With L as in (4.7), let (4.37) F = f : f absolutely continuous f (0) = f (0) = 0, f ∈ L . Clearly, if f ∈ F then h ∈ L for h = f . On the other hand, if h ∈ L then x f ∈ F and f (y) − f (x) = h(y) − h(x) for f (x) = h(u) − h(0) du. 0
4.2 Hierarchical Structures
77
Then, from (4.8),
L(Y ) − L(X) = sup E f (Y ) − f (X) . 1 f ∈F
(4.38)
For the application of Theorem 4.4, it is necessary to verify that the function F (x) in (4.26) is averaging. Proposition 3 of Wehr and Woo (2001) shows that the effective conductance of a resistor network is an averaging function of the conductances of its individual components. Theorem 4.5, which shows that strict averaging is preserved under certain compositions, yields an independent proof that, for instance, (4.30) is strictly averaging under natural scaling and positivity conditions on the weights. In addition, Theorem 4.5 provides an additional source of averaging functions to which Theorem 4.4 may be applied. Theorem 4.5 Let k ≥ 1 and set I0k = {1, . . . , k}. Suppose subsets Ii ⊂ I0 , i ∈ I0 satisfy i∈I0 Ii = I0 . For x ∈ R and i ∈ I0 let xi = (xj1 , . . . , xj|Ii | ) where {j1 , . . . , j|Ii | } = Ii with j1 < · · · < j|Ii | . Let Fi : R|Ii | → R (or Fi : [0, ∞)|Ii | → [0, ∞)), i = 0, . . . , k. If F0 , F1 , . . . , Fk are strictly averaging and F0 is (positively) homogeneous, then the composition
Fs (x) = F0 s1 F1 (x1 ), . . . , sk Fk (xk ) is strictly averaging for any s which satisfies F0 (s) = 1 and si > 0 for all i. If F0 , F1 , . . . , Fk are scaled, strictly averaging and F0 is (positively) homogeneous, then
F1 (x) = F0 F1 (x1 ), . . . , Fk (xk ) is a scaled strictly averaging function. Note that the parallel and series combination rules (4.27) and (4.28) are the p = 1 and p = −1 special cases, respectively, with wi = 1, of the weighted Lp norm functions k 1/p w p (wi xi ) , w = (w1 , . . . , wk ) , wi ∈ (0, ∞), Lp (x) = i=1
which are scaled, strictly averaging, and positively homogeneous on [0, ∞)k for p > 0 and on (0, ∞) for p < 0. Since F (x) in (4.30) is represented by the composition (4.29), Theorem 4.5 obtains to show that F is a scaled, strictly averaging function on (0, ∞)4 for any choice of positive weights. In particular, for positive weights such that F (1) = 1, the function F is strictly averaging on (0, ∞)4 . Theorem 4.4 requires F to have domain [a, b]k . However, if F is an averaging function on, say, (0, ∞)4 , then Property 1 implies that F : [a, b]k → [a, b] for all [a, b] ⊂ (0, ∞), and hence F will be averaging on this smaller domain. Note lastly that Theorem 4.5 shows the same conclusion holds when the resistor parallel L1 and series L−1 combination rules in this network are replaced by, say, L2 and L−2 respectively.
78
4
L1 Bounds
4.2.1 Bounds to the Normal for Approximately Linear Recursions In this section we study sequences {Xn }n≥0 generated by the approximate linear recursion Xn+1 = α n · Xn + Zn ,
n ≥ 0,
(4.39)
where X0 is a given nontrivial random variable and the components Xn,1 , . . . , Xn,k of Xn are independent copies of Xn . We present Theorem 4.6 which shows the exponential bound (4.34) holds when the perturbation term Zn , which measures the departure from linearity, is small. The effective size of Zn is measured by the quantity βn of (4.42), which will be small when the moment bounds in Conditions 4.1 and 4.2 are satisfied. When the recursion is nearly linear, Xn+1 will be approxi2 mately equal to α n · Xn , and therefore its variance σn+1 will be close to σn2 λ2n where λn = α n . Iterating, the variance of Xn will grow like a some constant C 2 times λ2n−1 · · · λ20 , so when α n → α, like C 2 λ2n . Condition 4.1 assures that Zn is small relative to Xn in that its variance grows at a slower rate. This condition was assumed in Wehr and Woo (2001) for deriving a normal limiting law for the standardized sequence generated by (4.39). Condition 4.1 The nonzero sequence of vectors α n ∈ Rk , k ≥ 2, converges to α, not equal to any multiple of a standard basis vector. With λ = α, there exist 0 < δ1 < δ2 < 1 and positive constants CX,2 , CZ,2 such that for all n, 2 λ2n (1 − δ1 )2n , Var(Xn ) ≥ CX,2 2 Var(Zn ) ≤ CZ,2 λ2n (1 − δ2 )2n .
Bounds on the distance between Xn and the normal can be provided under the following additional conditions on the fourth order moments of Xn and Zn . Condition 4.2 on the higher order moments is satisfied under the same averaging assumption on F used in Wehr and Woo (2001) to guarantee Condition 4.1 for weak convergence to the normal. Condition 4.2 With δ1 and δ2 as in Condition 4.1, there exists δ3 ≥ 0 and δ4 ≥ 0 such that (1 − δ2 )(1 + δ3 )3 1 − δ4 2 φ1 = < 1 and φ2 = < 1, 1 − δ1 (1 − δ1 )4 and constants CX,4 , CZ,4 such that 4 λ4n (1 + δ3 )4n , E(Xn − EXn )4 ≤ CX,4 4 λ4n (1 − δ4 )4n . E(Zn − EZn )4 ≤ CZ,4
The following is our main result on L1 bounds for approximately linear recursions.
4.2 Hierarchical Structures
79
Theorem 4.6 Let X0 be a random variable with variance σ02 ∈ (0, ∞) and Xn+1 = α n · Xn + Zn
for n ≥ 0
∈ Rk , λ
(4.40)
Rk
with independent components with α n n = α n = 0 and Xn a vector in distributed as Xn with mean cn and finite, non-zero variance σn2 . Set Y0 = 0 and Wn = (Xn − EXn )/σn , and for n ≥ 0 let Wn =
X n − cn , σn
Yn+1 =
αn · Wn , λn
(4.41)
and 1 βn = E|Wn − Yn | + E Wn3 − Yn3 . 2
(4.42)
If there exist (β, ϕ) ∈ (0, 1)2 such that βn <∞ βn
(4.43)
lim sup ϕn = ϕ,
(4.44)
lim sup n→∞
and ϕn = ϕ(α n ) in (4.16) satisfies n→∞
then with γ = β when β > ϕ, and for any γ ∈ (ϕ, 1) when β ≤ ϕ, there exists C such that L(Wn ) − L(Z) ≤ Cγ n for all n ≥ 0. (4.45) 1 Under Conditions 4.1 and 4.2, the bound (4.45) holds for all γ ∈ (max(β, ϕ), 1) with β = max{φ1 , φ2 } < 1 and ϕ = ki=1 |αi |3 /λ3 < 1 where α and λ are the limiting values of α n and λn , respectively. Proof Let f ∈ F with F given by (4.37). Then f is absolutely continuous with f (w) ≤ 1, and in addition f (w) ≤ |w| and f (w) ≤ w 2 /2. Letting h be given by h(w) = f (w) − wf (w)
(4.46)
we have Nh = 0 by Lemma 2.1. Differentiation yields h (w) = f (w) − wf (w) − f (w), and therefore
h (w) ≤ 1 + 3 w 2 . 2
Letting rn =
λn σn σn+1
and Tn =
σn Zn − EZn σn+1 σn
(4.47)
(4.48)
80
4
L1 Bounds
and using (4.41), write the recursion (4.40) as Xn+1 − EXn+1 σn+1 σn Xn − EXn Zn − EZn = + αn · σn+1 σn σn σn Zn − EZn = α n · Wn + σn+1 σn = rn Yn+1 + Tn .
Wn+1 =
(4.49)
Now by (4.47) and the definition of βn in (4.42), Wn E h(Wn ) − h(Yn ) = E h (u)du ≤ βn . Yn
Now by (2.51), that Var(Wn+1 ) = 1, (4.46) and N h = 0, we have
Ef (Wn+1 ) − Ef W ∗ = Ef (Wn+1 ) − Wn+1 f (Wn+1 ) n+1 = Eh(Wn+1 ) − N h
= E h(Wn+1 ) − h(Yn+1 ) + h(Yn+1 ) − N h ≤ βn+1 + Eh(Yn+1 ) − N h ∗
− f (Yn+1 ) = βn+1 + E f Yn+1 ∗ − Yn+1 1 by (4.38) ≤ βn+1 + Yn+1 ≤ βn+1 + ϕn Wn∗ − Wn 1 by Lemma 4.1. Taking supremum over f ∈ F on the left hand side, using (4.38) again and letting dn = Wn∗ − Wn 1 we obtain, for all n ≥ 0, dn+1 ≤ βn+1 + ϕn dn . Iteration yields that for all n, n0 ≥ 0, n +n−1 n +n−1 n 0 +n 0 0 ϕi βj + ϕi dn0 . dn0 +n ≤ j =n0 +1
i=j
(4.50)
i=n0
Now suppose the bounds (4.43) and (4.44) hold on βn and ϕn , respectively, and recall the choice of γ . When β > ϕ take ϕ ∈ (ϕ, β) so that ϕ < ϕ < β = γ ; when β ≤ ϕ take ϕ ∈ (ϕ, γ ) so that β ≤ ϕ < ϕ < γ . Then for any B > lim supn βn /β n there exists n0 such that for all n ≥ n0 βn ≤ Bβ n
and
ϕn ≤ ϕ.
Applying these inequalities in (4.50) and summing yields, for all n ≥ 0, n n n0 +1 β − ϕ + ϕ n dn0 . dn+n0 ≤ Bβ β −ϕ Since max(β, ϕ) ≤ γ , for some C we have that dn ≤ Cγ n for all n ≥ n0 , and by enlarging C if necessary, for all n ≥ 0. Now (4.45) follows from Theorem 4.1.
4.2 Hierarchical Structures
81
To prove the final claim under Conditions 4.1 and 4.2 it suffices to show that (4.43) and (4.44) hold with β = max{φ1 , φ2 } and ϕ = ki=1 |αi |3 /λ3 < 1 where α is the limiting value of α n . Lemma 6 of Wehr and Woo (2001) gives that the limit as n → ∞ of σn /(λ0 · · · λn−1 ) exists in (0, ∞), and therefore that σn+1 = λ. (4.51) lim rn = 1 and lim n→∞ n→∞ σn Referring to the definition of Tn in (4.48) and using (4.51) and Conditions 4.1 and 4.2, there exist positive constants CT ,2 , CT ,4 such that
2 σn 2 Var(Zn ) 1 − δ2 2n E|Tn | ≤ ETn2 = Var(Tn ) = , ≤ CT2 ,2 σn+1 Var(Xn ) 1 − δ1 σn 4 Zn − EZn 4 1 − δ4 4n and ETn4 = E ≤ CT4 ,4 . σn+1 σn 1 − δ1 By independence, a simple bound and Condition 4.2 for the second inequality we have
2 E|Yn | ≤ EYn2 = Var(Yn ) = 1, and X n − cn 4 1 + δ3 4n 4 4 EYn+1 ≤ 6E ≤ 6CX,4 . σn 1 − δ1 Using the recursion (4.39) and writing σZ2n = Var(Zn ), we have σn+1 ≤ λn σn + σZn and λn σn ≤ σn+1 + σZn , hence with Cr,1 = CT ,2 we have 1 − δ2 n . |λn σn − σn+1 | ≤ σZn so |rn − 1| ≤ Cr,1 1 − δ1
p p Now, since |rn − 1| = |(rn − 1 + 1)p − 1| ≤ j =1 pj |rn − 1|j and 0 < δ1 < δ2 < 1, there are constants Cr,p such that n p rn − 1 ≤ Cr,p 1 − δ2 , p = 1, 2, . . . . 1 − δ1 Now considering the first term of βn in (4.42), recalling (4.49), E|Wn+1 − Yn+1 | = E|(rn − 1)Yn+1 + Tn |
1 − δ2 n , ≤ |rn − 1|E|Yn+1 | + E|Tn | ≤ (Cr,1 + CT ,2 ) 1 − δ1
which is upper bounded by a constant times φ1n+1 . For the second term of (4.42) we have 3
3 3 2 E Wn+1 = E rn3 − 1 Yn+1 − Yn+1 + 3rn2 Yn+1 Tn + 3rn Yn+1 Tn2 + Tn3 . Applying the triangle inequality, the first term which results may be bounded as 3
r − 1E Y 3 ≤ r 3 − 1 EY 4 3/4 n n n+1 n+1 (1 − δ2 )(1 + δ3 )3 n 3/4 3 ≤ 6 Cr,3 CX,4 , (1 − δ1 )4 which is smaller than some constant times φ1n+1 .
82
4
L1 Bounds
Since rn → 1 by (4.51), it suffices to bound the next two terms without the factor of rn . Thus, 2 4 (1 − δ2 )(1 + δ3 )2 n 1/2 2 2 E Yn+1 Tn ≤ EYn+1 ETn ≤ 6 CX,4 CT ,2 , (1 − δ1 )3 which is less than a constant times φ1n+1 . Lastly, 2 1 − δ4 2n 2 2 4 E Yn+1 Tn ≤ EYn+1 ETn ≤ CT ,4 = CT2 ,4 φ2n and 1 − δ1
3/4 1 − δ4 3n 3n/2 ≤ CT3 ,4 ≤ CT3 ,4 φ2 . E Tn3 ≤ ETn4 1 − δ1 Hence (4.43) holds with the given β. Since α n → α, we have ϕn → ϕ, verifying (4.44). Under Condition 4.1, α is not a scalar multiple of a standard basis vector and hence ϕ < 1 by Lemma 4.1. As the first part of the theorem shows that (4.43) and (4.44) imply that (4.45) holds for all γ ∈ (max(β, ϕ), 1), the last claim is shown. We note that this proof reverses the way in which the Stein equation is typically applied, where h is given and the properties of f are dependent on those assumed for h. In particular, in the proof of Theorem 4.6 the function f ∈ F is taken as given, and the function h, whose properties are determined by f through (2.4), plays only an auxiliary role.
4.2.2 Normal Bounds for Hierarchical Sequences The following result, extending Proposition 9 of Wehr and Woo (2001) to higher orders, is used to show that the moment bounds of Conditions 4.1 and 4.2 are satisfied under the hypotheses of Theorem 4.4, allowing Theorem 4.6 to be invoked. The dependence of the constants in (4.53) and (4.54) on is suppressed for notational simplicity. Lemma 4.3 Let the hypotheses of Theorem 4.4 be satisfied for the recursion Xn+1 = F (Xn )
for n ≥ 0.
With cn = EXn and α n = F (cn ), define Zn = F (Xn ) − α n · Xn .
(4.52)
Then with α the limit of α n and λ = α, for any integer p ≥ 1 and > 0, there exists constants CX,p , CZ,p such that p
E|Zn − EZn |p ≤ CZ,p (λ + )2pn
for all n ≥ 0,
(4.53)
and p
E|Xn − cn |p ≤ CX,p (λ + )pn
for all n ≥ 0.
(4.54)
4.2 Hierarchical Structures
83
Proof Expanding F (Xn ) around the mean cn = 1k cn of Xn yields F (Xn ) = F (cn ) +
k
αn,i (Xn,i − cn ) + R2 (cn , Xn ),
(4.55)
i=1
where R2 (cn , Xn ) =
k i,j =1 0
1
(1 − t)
∂ 2F cn + t (Xn − cn ) (Xn,i − cn )(Xn,j − cn )dt. ∂xi ∂xj
Since the second partials of F are continuous on the compact set D = [a, b]k , with · the supremum norm on D we have 2 ∂ F 1 < ∞, B = max 2 i,j ∂xi ∂xj and therefore k (Xn,i − cn )(Xn,j − cn ). R2 (cn , Xn ) ≤ B
(4.56)
i,j =1
Using (4.52), (4.55) and (4.56), we have for all p ≥ 1 E|Zn − EZn |p p k = E F (Xn ) − EF (Xn ) − αn,i (Xn,i − cn ) i=1 p = E F (cn ) − EF (Xn ) + R2 (cn , Xn ) p p (Xn,i − cn )(Xn,j − cn ) ≤ 2p−1 F (cn ) − EF (Xn ) + B p E . (4.57) i,j
For the first term of (4.57), again using (4.56), F (cn ) − EF (Xn )p = ER2 (cn , Xn )p p (Xn,i − cn )(Xn,j − cn ) ≤ Bp E i,j
≤ Bp
p
E(Xn − cn )2
i,j
p = B p k 2p E(Xn − cn )2 ≤ B p k 2p E(Xn − cn )2p , using Jensen’s inequality for the final step.
(4.58)
84
4
L1 Bounds
Similarly, for the second term in (4.57), k p p p 2 (Xn,i − cn )(Xn,j − cn ) E ≤k E (Xn,i − cn ) i,j
i=1
≤k
2p−1
k 2p E (Xn,i − cn ) i=1
= k 2p E(Xn − cn )2p .
(4.59)
Applying the bounds (4.58) and (4.59) in (4.57) we obtain for all p ≥ 1, with Cp = 2p B p k 2p , E|Zn − EZn |p ≤ Cp E(Xn − cn )2p .
(4.60)
To demonstrate the proposition it therefore suffices to prove (4.54). Note that since Xn → c for Xn ∈ [a, b] the bounded convergence theorem implies that cn = EXn → c. Lemma 8 of Wehr and Woo (2001) shows that if F : [a, b]k → [a, b] is an averaging function and there exists c ∈ [a, b] such that Xn →p c, then
∀ ∈ (0, 1)∃M such that for all n ≥ 0, P |Xn − c| > ≤ M n . (4.61) In particular the large deviation estimate (4.61) holds under the given assumptions, and therefore also with c replaced by cn . We now show that if an , n = 0, 1, . . . is a sequence such that for every > 0 there exists M and n0 ≥ 0 such that an+1 ≤ (λ + )p an + M(λ + )p(n+1)
for all n ≥ n0 ,
(4.62)
then for all > 0 there exists C such that an ≤ C(λ + )pn
for all n ≥ 0.
(4.63)
Let > 0 be given, and let M and n0 be such that (4.62) holds with replaced by /2. Setting an0 M C = max , , (λ + )n0 1 − ( λ+/2 )p λ+ it is trivial that (4.63) holds for n = n0 , and a direct induction shows (4.63) holds for all n ≥ n0 . By increasing C if necessary, we have that (4.63) holds for all n ≥ 0. Unqualified statements in the remainder of the proof below involving and M are to be read to mean that for every > 0 there exists M such that the statement holds for all n; the values of and M are not necessarily the same at each occurrence, even from line to line. By (4.61) and that Xn ∈ [a, b] we have E(Xn − cn )2p = E (Xn − cn )2p ; |Xn − cn | ≤ + E (Xn − cn )2p ; |Xn − cn | > ≤ p E|Xn − cn |p + M n . From (4.60), this inequality gives that
4.2 Hierarchical Structures
85
E|Zn − EZn |p ≤ p E|Xn − cn |p + M n .
(4.64)
Since for all > 0 we have lim (x + 1)p − (1 + )x p = −∞ and therefore
x→∞
sup(x + 1)p − (1 + )x p < ∞, x≥0
substituting x = |w|/|z| when z = 0 we see that there exists M such that for all w, z we have |w + z|p ≤ (1 + )|w|p + M|z|p , noting that the inequality holds trivially with M = 1 for z = 0. Now applying definition (4.52), k p p E|Xn+1 − cn+1 | ≤ (1 + )E αn,i (Xn,i − cn ) + ME|Zn − EZn |p . (4.65) i=1
Specializing (4.65) to the case p = 2 gives E(Xn+1 − cn+1 )2 ≤ (λ + )2 E(Xn − cn )2 + ME(Zn − EZn )2 . Applying (4.64) with p = 2 to this inequality yields E(Xn+1 − cn+1 )2 ≤ (λ + )2 E(Xn − cn )2 + M 2n+2 ≤ (λ + )2 E(Xn − cn )2 + M(λ + )2(n+1) . Hence inequality (4.62), and therefore (4.63), are true for an = E(Xn − cn )2 and p = 2, yielding (4.54) for p = 2. Now Hölder’s inequality shows that (4.54) is also true for p = 1. Now let p > 2 be an integer and suppose that (4.54) is true for all integers q, 1 ≤ q < p. In expanding the first term in (4.65) we let p = (p1 , . . . , pk ) denote a multiindex and |p| = i pi . Use the induction hypotheses, and (4.22) of Lemma 4.2 in p p (4.66), to obtain, with AX,p = maxq
≤
k
|αn,i | E|Xn,i − cn | + p
p
|p|=p,0≤pi
i=1
≤ E|Xn − cn |p
k i=1
≤ E|Xn − cn |p
k i=1
|αn,i |p +
|p|=p, 0≤pi
k p |αn,i |pi |Xn,i − cn |pi E p i=1
k p pi |αn,i |pi CX,p (λ + )pi n i p
|αn,i |p + AX,p (λ + )pn
i=1
k p |αn,i |pi p
|p|=p
i=1
86
4
= E|Xn − cn |
p
k
|αn,i |
p
p + AX,p (λ + )pn
i=1
≤
k
k
L1 Bounds
p |αn,i |
i=1
p |αn,i |p E|Xn − cn |p + BX,p (λ + )pn
i=1 p
≤ (λ + )p E|Xn − cn |p + BX,p (λ + )p(n+1) .
(4.66)
Applying (4.64) and (4.66) in (4.65) gives E|Xn+1 − cn+1 |p ≤ (λ + )p E|Xn − cn |p + M(λ + )p(n+1) , from which we can conclude that (4.63) holds for an = E|Xn − cn |p , completing the induction on p. Proof of Theorem 4.4 By Theorem 4.6 it suffices to show that Conditions 4.1 and 4.2 are satisfied for some δi , i = 1, 2, 3, 4 satisfying β < ϕ. By Property 1 of averaging functions, F (1k c) = c, and differentiation with respect to c yields ni=1 αi = 1. By Property 2, monotonicity, αi ≥ 0, and (4.24) of Lemma 4.2 yields 0 < λ < ϕ < 1, using that α is not a multiple of a standard basis vector. Let δ4 ∈ (1 − ϕ, 1 − λ). Since δ4 < 1 − λ we have λ2 < λ(1 − δ4 ), and therefore there exists > 0 such that (λ + )2 < λ(1 − δ4 ). By Lemma 4.3, for p = 2 and p p = 4, for this there exists CZ,p such that p
p
E(Zn − EZn )p ≤ CZ,p (λ + )2pn ≤ CZ,p λpn (1 − δ4 )pn . Hence the fourth and second moment bounds in Conditions 4.1 and 4.2 on Zn are satisfied with δ4 and δ2 = δ4 , respectively. Since 1 − δ4 < ϕ there δ1 ∈ (0, δ2 ) and δ3 > 0 such that η < ϕ where η=
(1 − δ4 )(1 + δ3 )3 . (1 − δ1 )4
Proposition 10 of Wehr and Woo (2001) shows that under the assumptions of Theo2 such that rem 4.4, for every > 0 there exists CX,2 2 Var(Xn ) ≥ CX,2 (λ − )2n .
Taking = λδ1 , we have Var(Xn ) satisfies the lower bound in Condition 4.1. Applying Lemma 4.3 with p = 4 and = λδ3 we see the fourth moment bound on Xn in Condition 4.2 is satisfied. With these choices for δi , i = 1, . . . , 4, as η < ϕ < 1, we have φ2 < η < 1 and φ1 = η < 1, hence Conditions 4.1 and 4.2 are satisfied. Noting that β = max{φ1 , φ2 } = η < ϕ now completes the proof.
4.2 Hierarchical Structures
87
4.2.3 Convergence Rates for the Diamond Lattice We now apply Theorem 4.4 to hierarchical sequences generated by the diamond lattice conductivity function F (x) in (4.30). We have already argued that Theorem 4.5 implies that F (x) is strictly averaging on, say [a, b]4 , for any 0 < a < b and choice of positive weights satisfying F (14 ) = 1, and on this domain such an F (x) is easily seen to be twice continuously differentiable. For all such F (x) the result of Shneiberg (1986) quoted in Sect. 4.2 shows that Xn satisfies a weak law. We now study the quantity ϕ which determines the exponential decay rate of the upper bound of Theorem 4.4 to zero. The first partial derivative ∂F (x)/∂x1 has the form (w1 x12 )−1 ∂F (x) = , ∂x1 ((w1 x1 )−1 + (w2 x2 )−1 )2 and similarly for the other partials. Hence F (tx) = F (x) for all t = 0. As Xn is a random variable on [a, b] we have cn = EXn = 0, and therefore α n = F (cn 14 ) = F (14 ) for all n ≥ 0. In particular, α = limn→∞ α n is given by
w3−1 w2−1 w4−1 w1−1 , , , . α= (w1−1 + w2−1 )2 (w1−1 + w2−1 )2 (w3−1 + w4−1 )2 (w3−1 + w4−1 )2 Since we are considering the case where all the weights are positive, the vector α is not a scalar multiple of a standard basis vector. Now from (4.16) we compute −3 w3−3 + w4−3 w1 + w2−3 −3 ϕ=λ + , (4.67) (w1−1 + w2−1 )6 (w3−1 + w4−1 )6 where
λ=
w1−2 + w2−2
(w1−1 + w2−1 )4
+
w3−2 + w4−2
(w3−1 + w4−1 )4
1/2 .
As an illustration of the bounds provided by Theorem 4.4, first consider the ‘side equally weighted network’, the one with w = (w, w, 2 − w, 2 − w) for w ∈ [1, 2); we recall the weights w refer to the bonds in the lattice traversed counterclockwise from the top in Fig. 4.1(c). The vector of weights for w in this range are positive and −1 satisfy F (14 ) = 1. For w = √ 1 all weights are equal and α = 4 14 , so ϕ achieves its minimum value 1/2 = 1/ k with k = 4. By Theorem 4.4, for all γ ∈ (1/2, 1) there exists a constant C such that Wn − Z1 ≤ Cγ n . The values of γ just above 1/2 correspond, in view of (4.36), to the rate N −θ for θ just below − log4 1/2 = 1/2, that is, N −1/2+ for small > 0, where N = 4n , the number of variables√at stage n. As w increases from 1 to 2, ϕ increases continuously from 1/2 to 1/ 2, with w approaching 2 from below corresponding to the least √ favorable rate for the side equally weighted network of θ just under − log4 1/ 2 = 1/4, that is, of N −1/4+ for any > 0.
88
4
L1 Bounds
With only the restriction that the weights are positive and satisfy F (14 ) = 1 consider w = (1 + 1/t, s, t, 1/t)
−1 −1 where s = 1 − (1/t + t)−1 − (1 + 1/t)−1 , t > 0. √ s/t → 1/2 and α √ tends When t = 1 we have s = 2/3 and ϕ = 11 2/27. As t → ∞, √ to the standard basis vector (1, 0, 0, 0), so ϕ → 1. Since 11 2/27 ∈ (1/2, 1/ 2 ), the above two examples show that the value of γ given by Theorem 4.4 for the diamond lattice can take any value in the range (1/2, 1), corresponding to N −θ for any θ ∈ (0, 1/2).
4.3 Cone Measure Projections In this section we use Stein’s method to obtain L1 bounds for the normal approximation of one dimensional projections of the form Y = θ · X,
(4.68)
Rn
where for some p > 0, the vector X ∈ has the cone measure distribution Cpn n given in (4.71) below, and θ ∈ R is of unit length. The normal approximation of projections of random vectors in lesser and greater generality has been studied by many authors, and under a variety of metrics. In the case p = 2, when cone measure is uniform on the surface of the unit Euclidean sphere in Rn , Diaconis and Freedman (1987) show that the low dimensional projections of X are close to normal in total variation. It is particularly easy to see in this case, and true in general, that cone measure Cpn is coordinate symmetric, that is, (X1 , . . . , Xn ) =d (e1 X1 , . . . , en Xn ) for all (e1 , . . . , en ) ∈ {−1, 1}n . (4.69) Meckes and Meckes (2007) derive bounds using Stein’s method for the normal approximation of random vectors with symmetries in general, including coordinatesymmetry, considering the supremum and total variation norm. Goldstein and Shao (2009) give√ L∞ bounds on the projections of coordinate symmetric random vectors of order 1/ n without applying Stein’s method. Klartag (2009) proves bounds of order 1/n on the L∞ distance under additional conditions on the distribution of X, including that its density be log concave. One special case of note where X is coordinate symmetric is when its distribution is uniform over a convex set which has symmetry with respect to all coordinate planes. For general results on the projections of vectors sampled uniformly from convex sets, see Klartag (2007) and references therein. Studying here the specific instance of the projections of cone measure allows, naturally, for the sharpening of general results about projections of coordinate symmetric vectors to this particular case. To define cone measure let n n
n p |xi | = 1 and S p = x ∈ R :
n
i=1
B p = x ∈ R : n
n i=1
|xi | ≤ 1 . p
(4.70)
4.3 Cone Measure Projections
89
Then with μn Lebesgue measure in Rn , the cone measure of A ⊂ S(np ) is given by Cpn (A) =
μn ([0, 1]A) μn (B(np ))
where [0, 1]A = {ta: a ∈ A, 0 ≤ t ≤ 1}.
(4.71)
The main result in the this section on the projections of Cpn is the following. Theorem 4.7 Let X have cone measure Cpn on the sphere S(np ) for some p > 0 and let Y=
n
θ i Xi
i=1
be the one-dimensional projection of X along the direction θ ∈ Rn with θ = 1. 2 = Var(X ) and m 3 2 Then with σn,p 1 n,p = E|X1 | /σn,p , given in (4.84) and (4.87), respectively, and F the distribution function of the normalized sum W = Y/σn,p , we have n mn,p 3 1 4 F − 1 ≤ |θi | + ∨1 , (4.72) σn,p p n+2 i=1
where is the cumulative distribution function of the standard normal. We note that by the limits in (4.84) and (4.88), the constant mn,p /σn,p that multiplies the sum in the bound (4.72) is of the order of a constant with asymptotic value √ mn,p (4/p) (1/p) lim = . n→∞ σn,p (3/p)3/2 Since, for θ ∈ Rn with θ = 1, we have 1 |θi |3 ≥ √ , n the second term in (4.72) is always of smaller order than the first, so√the decay rate 3 of the bound to zero is determined by i |θi | . The minimal rate 1/ n is achieved √ when θi = 1/ n. In the special cases p = 1 and p = 2, Cpn is uniform on the simplex ni=1 |xi | = 1 n and the unit Euclidean sphere i=1 xi2 = 1, respectively. By (4.84) and (4.87) for p = 1, 2 = σn,1
2 n(n + 1)
and mn,1 =
and, using also (4.88) for p = 2, 2 σn,2
these relations yield
1 = n
and
mn,2 ≤
3 , n+2
3 ; n+2
90
4
mn,1 n(n + 1) 3 =3 ≤√ 2 σn,1 2(n + 2) 2
and
mn,2 ≤ σn,2
L1 Bounds
√ 3n ≤ 3. n+2
Substituting into (4.72) now gives 3 4 F − 1 ≤ √ |θi |3 + n+2 p + 1 i=1 n
for p ∈ {1, 2}.
(4.73)
4.3.1 Coupling Constructions for Coordinate Symmetric Variables and Their Projections We generalize the construction in Proposition 2.3 to coordinate symmetric vectors, beginning by generalizing the notion of square biasing, given there, to square biasing in coordinates. To begin, note that if Y is a coordinate symmetric random vector in Rn and EYi2 < ∞ for i = 1, . . . , n, then the symmetry condition (4.69) implies EYi = −EYi
and EYi Yj = −EYi Yj
for all i = j ,
and hence EYi = 0
and EYi Yj = σi2 δij
for all i, j ,
(4.74)
where σi2 = Var(Yi ) = EYi2 . By removing any component which has zero variance, and lowering the dimension accordingly, we may assume without loss of generality that σi2 > 0 for all i = 1, . . . , n. For such Y, for all i = 1, . . . , n, we claim there exists a distribution Yi such that for all functions f : Rn → R for which the expectation of the left hand side below exists,
EYi2 f (Y) = σi2 Ef Yi , (4.75) and say that Yi has the Y-square bias distribution in direction i. In particular, the distribution of Yi is absolutely continuous with respect to Y with dF i (y) =
yi2 σi2
dF (y).
(4.76)
By specializing (4.75) to the case where f depends only on Yi , we see, in the language of Proposition 2.3, that Yii =d Yi , that is, that Yii has the Yi -square bias distribution. Proposition 4.3 shows how to construct the zero bias distribution Y ∗ for the sum Y of the components of a coordinate-symmetric vector in terms of Yi and a random index in a way that parallels the construction for size biasing given in Proposition 2.2. Again we let U[a, b] denote the uniform distribution on [a, b].
4.3 Cone Measure Projections
91
Proposition 4.3 Let Y ∈ Rn be a coordinate-symmetric random vector with Var(Yi ) = σi2 ∈ (0, ∞) for all i = 1, 2, . . . , n, and Y=
n
Yi .
i=1
Let Yi , i = 1, . . . , n, have the square bias distribution given in (4.75), I a random index with distribution σ2 P (I = i) = n i
(4.77)
2 j =1 σj
and Ui ∼ U [−1, 1], with Yi , I and Ui mutually independent for all i = 1, . . . , n. Then Y ∗ = UI YII + YjI (4.78) j =I
has the Y -zero bias distribution. Proof Let f be an absolutely continuous function with E|Yf (Y )| < ∞. Staring with the given form of Y ∗ then averaging over the index I , integrating out the uniform variable Ui and applying (4.75) and (4.69) we obtain 2 ∗ 2 I σ Ef (Y ) = σ Ef UI YI + YjI =σ
2
n σ2 i=1
= = =
n i=1 n i=1 n
i Ef σ2
σi2 E EYi
f (Yii
j =I
+
Ui Yii +
f (Yi +
j =i
j =i
j =i
Yji
Yji ) − f (−Yii + 2Yii
Yj ) − f (−Yi +
j =i
j =i
Yji )
Yj )
2
EYi f Yi + Yj
i=1
j =i
= EYf (Y ). Thus,
Y∗
has the Y -zero bias distribution.
Factoring (4.76) as dF i (y) = dFii (yi )dF (y1 , . . . , yi−1 , yi+1 , . . . , yn |yi ) where dFii (yi ) =
yi2 dFi (yi ) σi2
(4.79)
92
4
L1 Bounds
provides an alternate way of seeing that Yii =d Yi . Moreover, it suggests a coupling between Y and Y ∗ where, given Y, an index I = i is chosen with weight proportional to the variance σi2 , the summand Yi is replaced by Yii having that summand’s ‘square bias’ distribution and then multiplied by U , and, finally, the remaining variables of Y are perturbed, so that they achieve their original distribution conditional on the ith variable now taking on the value Yii . Typically the remaining variables are changed as little as possible in order to make the coupling between Y and Y ∗ close. Now let X ∈ Rn be an exchangeable coordinate-symmetric random vector with components having finite second moments and let θ ∈ Rn have unit length. Then, by (4.74), the projection Y of X along the direction θ , Y=
n
θ i Xi
i=1
has mean zero and variance σ 2 equal to the common variance of the components of X. To form Y ∗ using the construction just outlined, in view of (4.79) in particular, requires a vector of random variables to be ‘adjusted’ according to their original distribution, conditional on one coordinate taking on a newly chosen, biased, value. Random vectors which have the ‘scaling-conditional’ property in Definition 4.2 can easily be so adjusted. Let L(V ) and L(V |X = x) denote the distribution of V , and the conditional distribution of V given X = x, respectively. Definition 4.2 Let X = (X1 , . . . , Xn ) be an exchangeable random vector and D ⊂ R the support of the distribution of X1 . If there exists a function g : D → R such that P (g(X1 ) = 0) = 0 and g(a) L(X2 , . . . , Xn |X1 = a) = L for all a ∈ D, (4.80) (X2 , . . . , Xn ) g(X1 ) we say that X is scaling g-conditional, or simply scaling-conditional. Proposition 4.4 is an application of Theorem 4.1 and Proposition 4.3 to projections of coordinate symmetric, scaling-conditional vectors. Proposition 4.4 Let X ∈ Rn be an exchangeable, coordinate symmetric and scaling g-conditional random vector with finite second moment. For θ ∈ Rn of unit length set n θi Xi , σ 2 = Var(Y ), and F (x) = P (Y/σ ≤ x). Y= i=1
Then any construction of (X, Xii ) on a joint space for each i = 1, . . . , n with Xii having the Xi -square biased distribution provides the upper bound
g(XII ) 2 I F − 1 ≤ E θI UI XI − XI + θj Xj , −1 (4.81) σ g(XI ) j =I
and Ui ∼ U [−1, 1] with where P (I = i) = independent for i = 1, 2, . . . , n. θi2
{Xii , Xj , j
= i}, I and Ui mutually
4.3 Cone Measure Projections
93
Proof For all i = 1, . . . , n, since X is scaling g-conditional, given X and Xii with the Xi -square bias distribution, by (4.79) and (4.80) the vector i g(Xii ) g(Xii ) g(Xii ) i i g(Xi ) X = X1 , . . . , Xi−1 , Xi , Xi+1 , . . . , Xn g(Xi ) g(Xi ) g(Xi ) g(Xi ) has the X-square bias distribution in direction i as given in (4.75), that is, for every h for which the expectation on the left-hand side below exists,
(4.82) EXi2 h(X) = EXi2 Eh Xi . We now apply Proposition 4.3 to Y = (θ1 X1 , . . . , θn Xn ). First, the coordinate symmetry of Y follows from that of X. Next, we claim
Yi = θ1 X1i , . . . , θn Xni has the Y-square bias distribution in direction i. Given f , let h(X) = f (θ1 X1 , . . . , θn Xn ). Applying (4.82) we obtain EYi2 f (Y) = Eθi2 Xi2 f (Y) = θi2 EXi2 h(X)
= θi2 EXi2 Eh Xi
= Eθi2 Xi2 Ef Yi
= EYi2 Ef Yi . Since X is exchangeable, the variance of Yi is proportional to θi2 and the distribution of I in (4.77) specializes to the one claimed. Lastly, as Yi , I and Ui are mutually independent for i = 1, . . . , n, Proposition 4.3 yields that Y ∗ = UI YII + YjI j =I
has the Y -zero bias distribution. The difference Y ∗ − Y is given by Y ∗ − Y = UI YII +
YjI −
j =I
= UI θI XII +
n
Yi
i=1
θj XjI −
j =I
n
θ j Xj
j =1
I
= θI UI XII − XI + θ j X j − Xj j =I
g(XII ) − 1 Xj g(XI ) j =I I
g(XI ) = θI UI XII − XI + θ j Xj . −1 g(XI )
= θI UI XII − XI +
θj
j =I
94
4
L1 Bounds
The proof is completed by dividing both sides by σ , applying (2.59) to yield Y ∗ /σ = (Y/σ )∗ , and invoking Theorem 4.1.
4.3.2 Construction and Bounds for Cone Measure Proposition 4.5 below shows that Proposition 4.4 can be applied to cone measure. We denote the Gamma and Beta distributions with parameters α, β as (α, β) and B(α, β), respectively. That is, with the Gamma function at α > 0 given by ∞ (α) = x α−1 e−x dx, 0
with β > 0, the density of the (α, β) distribution is x α−1 e−x/β 1{x>0} ; β α (α) the density of the Beta distribution B(α, β) is given in (4.90). Proposition 4.5 Let Cpn denote cone measure as given in (4.71) for some n ∈ N and p > 0. 1. Cone measure Cpn is exchangeable and coordinate-symmetric. For {Gj , j , j = 1, . . . , n} independent variables with Gj ∼ (1/p, 1) and j taking values −1 and +1 with equal probability, setting Ga,b = bi=a Gi we have G1 1/p Gn 1/p , . . . , n (4.83) ∼ Cpn . X = 1 G1,n G1,n 2. The common marginal distribution Xi of cone measure is characterized by
Xi =d −Xi and |Xi |p ∼ B 1/p, (n − 1)/p , 2 = Var(X ) is given by and the variance σn,p i 2 = σn,p
(3/p)(n/p) (1/p)((n + 2)/p)
(4.84)
and satisfies 2 = lim n2/p σn,p
n→∞
p 2/p (3/p) . (1/p)
3. The square bias distribution Xii of Xi is characterized by p
Xii =d −Xii and Xii ∼ B 3/p, (n − 1)/p .
(4.85)
Letting {Gj , G j , j , j = 1, . . . , n} be independent variables with Gj ∼ (1/p, 1), G j ∼ (2/p, 1) and j taking values −1 and +1 with equal probability, for each i = 1, . . . , n, a construction of (X, Xii ) on a joint space is given by the representation of X in (4.83) along with
4.3 Cone Measure Projections
95
Xii
= i
Gi + G i G1,n + G i
1/p (4.86)
.
2 for all i = 1, . . . , n is given by The mean mn,p = E|Xii | = E|Xi3 |/σn,p
mn,p =
(4/p)((n + 2)/p) (3/p)((n + 3)/p)
(4.87)
and satisfies lim n
n→∞
1/p
p 1/p (4/p) mn,p = (3/p)
and mn,p ≤
3 n+2
1/(p∨1) . (4.88)
4. Cone measure Cpn is scaling (1 − |x|p )1/p conditional. The proof of Proposition 4.5 is deferred to the end of this section. Before proceeding to Theorem 4.7, we remind the reader of the following known facts about the Gamma and Beta distributions; see Bickel and Doksum (1977), Theorem 1.2.3 for the case n = 2 of the first claim, the extension to general n and the following claim being straightforward. For γi ∼ (αi , β), i = 1, . . . , n, independent with αi > 0 and β > 0, γ1 γ1 + γ2 ∼ (α1 + α2 , β), ∼ B(α1 , α2 ), (4.89) γ 1 + γ2 n γn γ1 n , . . . , n γi are independent; and and i=1 γi i=1 γi i=1
the Beta distribution B(α, β) has density (α + β) α−1 u (1 − u)β−1 1u∈[0,1] (α)(β) (α + κ)(α + β) and κ > 0 moment . (α + β + κ)(α)
pα,β (u) =
(4.90)
Proof of Theorem 4.7 Using Proposition 4.5, we apply Proposition 4.4 for X with g(x) = (1 − |x|p )1/p and the joint construction of (X, Xii ) given in item 3. Note that Proposition 4.2 applies, using the notation there, with V ∼ U[0, 1], independent of all other variables, Ui = i V , and Gi + G i 1/p Gi 1/p Xi = and Y i = . G1,n G1,n + G i Applying the triangle inequality on (4.81) yields the bound on the L1 norm F − 1 of g(XII )
2 I θj Xj . −1 E θI UI XI − XI + E (4.91) σn,p g(XI ) j =I
We begin by averaging the first term over I . Note that
96
4
|X1 | =
G1 G1,n
1/p ≤
G1 + G 1 G1,n + G 1
1/p
L1 Bounds
= X11 ,
and therefore, recalling P (I = i) = θi2 , we may invoke Proposition 4.2 to conclude n
E θI UI XII − XI = |θi |3 E Ui Xii − Xi i=1 n = E U1 X11 − X1 |θi |3 i=1
≤
|3
E|X1 2 2σn,p
n
|θi |3 =
i=1
n mn,p 3 |θi | . 2
(4.92)
i=1
Now, averaging the second term in (4.91) over the distribution of I yields n g(Xii ) g(XII ) θ j Xj = E θj Xj θi2 . (4.93) −1 −1 E g(XI ) g(Xi ) j =I
j =i
i=1
g(x) = (1 − |x|p )1/p ,
we have 1/p g(Xii ) G1,n − 1. −1= g(Xi ) G1,n + G i
Using (4.83), (4.86) and
(4.94)
Applying (4.89) we have that {G1,n , G i } are independent of X1 , . . . , Xn ; hence, the term (4.94) is independent of the sum it multiplies in (4.93) and therefore (4.93) equals n g(Xii ) E θj Xj θi2 . (4.95) − 1E g(Xi ) i=1
j =i
To bound the first expectation in (4.95), since G1,n /(G1,n + G i ) ∼ B(n/p, 2/p), we have 1/p g(Xii ) 2 1 G1,n E − 1 = E 1 − ∨1 (4.96) ≤ g(Xi ) G1,n + Gi p n+2 since for p ≥ 1, using (4.90) with κ = 1, 1/p G1,n E 1− G1,n + G i 2 n/p G1,n = , =1− ≤E 1− G1,n + Gi (n + 2)/p n + 2 while for 0 < p < 1, using Jensen’s inequality and the fact that (1 − x)1/p ≥ 1 − x/p we have
for x ≤ 1,
4.3 Cone Measure Projections
97
E 1−
1/p G1,n G1,n + G i 1/p 1/p G1,n n 2 ≤1− E = 1 − ≤ . G1,n + G i n+2 p(n + 2)
We may bound the second expectation in (4.95) by σn,p since 2 θj Xj E j =i
≤E
2 θ j Xj
2 2 = Var θj Xj = σn,p θj2 ≤ σn,p .
j =i
j =i
j =i
Neither this bound nor the bound (4.96) depends on i, so substituting them into (4.95) and summing over i, again using i θi2 = 1, yields n g(Xii ) 2 1 E θj Xj θi2 ≤ σn,p − 1E ∨1 . (4.97) g(Xi ) p n+2 j =i
i=1
Adding (4.92) and (4.97) and multiplying by 2/σn,p in accordance with (4.81) yields (4.72). Proof of Proposition 4.5 1. For A ⊂ S(np ), e = (e1 , . . . , en ) ∈ {−1, 1}n and a permutation π ∈ Sn , let Ae = x: (e1 x1 , . . . , en xn ) ∈ A and Aπ = x: (xπ(1) , . . . , xπ(n) ) ∈ A . By the properties of Lebesgue measure, μn ([0, 1]Ae ) = μn ([0, 1]Aπ ) = μn ([0, 1]A), so by (4.71), cone measure is coordinate symmetric and exchangeable. The coordinate symmetry of X implies that P (X ∈ A) = P (X ∈ Ae )
for all e ∈ {−1, 1}n ,
so with i , i = 1, . . . , n, i.i.d. variables taking the values 1 and −1 with probability 1/2 and independent of X,
P (1 X1 , . . . , n Xn ) ∈ A = P (X ∈ A ) 1 = n P (X ∈ Ae ) 2 n e∈{−1,1}
= P (X ∈ A), and hence (1 X1 , . . . , n Xn ) =d (X1 , . . . , Xn ). Note that for any (s1 , . . . , sn ) ∈ {−1, 1}n that (1 s1 , . . . , n sn ) =d (1 , . . . , n ),
and is independent of X.
98
4
L1 Bounds
Hence, since P (Xi = 0) = 0, with si = Xi /|Xi |, the sign of Xi , we have
P 1 |X1 |, . . . , n |Xn | ∈ A = P (1 s1 X1 , . . . , n sn Xn ) ∈ A
= P (1 X1 , . . . , n Xn ) ∈ A
= P (X1 , . . . , Xn ) ∈ A . We thus obtain (4.83) applying that X ∼ Cpn satisfies
Gn 1/p G1 1/p ,..., |X1 |, . . . , |Xn | =d G1,n G1,n
(4.98)
shown, for instance, by Schechtman and Zinn (1990). 2. Applying the coordinate symmetry of X coordinatewise gives Xi =d −Xi and (4.98) yields |Xi |p = Gi /G1,n , which has the claimed Beta distribution, by (4.89). As EXi = 0, we have
2/p Var(Xi ) = EXi2 = E |Xi |p (4.99) and the variance claim in (4.84) follows from (4.90) for α = 1/p, β = (n − 1)/p and κ = 2/p. From Stirlings formula, for all x > 0, mx (m) = 1, m→∞ (m + x) lim
so letting m = n/p and x = k/p, nk/p (n/p) = p k/p . n→∞ ((n + k)/p) lim
(4.100)
The limit (4.84) now follows. 3. If X is symmetric with variance σ 2 > 0 and X has the X-square bias distribution, then for all bounded continuous functions f
σ 2 Ef X
= EX 2 f (X) = E (−X)2 f (−X) = EX 2 f (−X) = σ 2 Ef −X , showing X is symmetric. From (4.90) and a change of variable, a random variable X satisfies |X|p ∼ B(α/p, β/p) if and only if the density p|X| (u) of |X| is p|X| (u) =
β/p−1 p((α + β)/p) α−1 1 − up 1u∈[0,1] . u (α/p)(β/p)
(4.101)
Hence, since |Xi |p ∼ B(1/p, (n − 1)/p) by item 2, the density p|Xi | (u) of |Xi | is
(n−1)/p−1 p(n/p) 1u∈[0,1] . 1 − up p|Xi | (u) = (1/p)((n − 1)/p)
4.3 Cone Measure Projections
99
Multiplying by u2 and renormalizing produces the |Xii | density u2 p|Xi | (u) i EXi2
(n−1)/p−1 p((n + 2)/p) = 1u∈[0,1] , (4.102) u2 1 − up (3/p)((n − 1)/p) and comparing (4.102) to (4.101) shows the second claim in (4.85). The representation (4.86) now follows from (4.89) and the symmetry of Xii . The moment formula (4.87) for mn,p follows from (4.90) for α = 3/p, β = (n − 1)/p and κ = 1/p, and the limit in (4.88) follows from (4.100). Regarding the last claim in (4.88), for p ≥ 1 Hölder’s inequality gives 1/p 1 1 p 1/p 3 mn,p = E X ≤ E X = , n+2 while for 0 < p < 1, we have 1 Gi + G i 1/p Gi + G i 3 ≤E . = mn,p = E X = E G1,n + Gi G1,n + Gi n+2 p|Xi | (u) =
4. We consider the conditional distribution on the left-hand side of (4.80), and use the representation, and notation Ga,b , given in (4.83). The second equality below follows from the coordinate-symmetry of X, and the fourth follows since we may replace G1,n by G2,n /(1 − |a|p ) on the conditioning event. Using the notation aL(V ) for the distribution of aV , we have L(X2 , . . . , Xn |X1 = a) G2 1/p Gn 1/p G1 1/p , . . . , n =a = L 2 1 G1,n G1,n G1,n 1/p 1/p 1/p G2 Gn G1 = L 2 , . . . , n = |a| G1,n G1,n G1,n 1/p 1/p G2 Gn G2,n p = L 2 , . . . , n = 1 − |a| G1,n G1,n G1,n 1/p
G2 Gn 1/p G2,n p 1/p p = 1 − |a| L 2 , . . . , n | = 1 − |a| G2,n G2,n G1,n 1/p 1/p
G2 Gn G1 p 1/p p = 1 − |a| L 2 , . . . , n = |a| G2,n G2,n G1,n 1/p 1/p
1/p G2 Gn = 1 − |a|p L 2 , . . . , n G2,n G2,n G2 1/p Gn 1/p = g(a)L 2 , . . . , n . (4.103) G2,n G2,n In the penultimate step may we remove the conditioning on G1 /G1,n since (4.89) and the independence of G1 from all other variables gives that Gn G2 ,..., is independent of (G1 , G2,n ) G2,n G2,n
100
4
L1 Bounds
and therefore independent of G1 /(G1 + G2,n ) = G1 /G1,n . Regarding the right-hand side of (4.80), using 1 − |X1 |p = ni=2 |Xi |p and the representation (4.83), we obtain (X2 , . . . , Xn ) g(a)(X2 , . . . , Xn )/g(X1 ) = g(a) (|X2 |p + · · · + |Xn |p )1/p ( ( G2 )1/p , . . . , ( Gn )1/p ) 2 G1,n n G1,n = g(a) G2 n (( G1,n ) + · · · + ( GG1,n ))1/p 1/p 1/p (2 G2 , . . . , n Gn ) = g(a) (G2 + · · · + Gn )1/p G2 1/p Gn 1/p = g(a) 2 , . . . , n G2,n G2,n matching the distribution (4.103). In principle, Proposition 4.3 and Theorem 4.1 may be applied to compute bounds to the normal for projections of other coordinate-symmetric vectors when the required couplings, and conditioning, are as tractable as here.
4.4 Combinatorial Central Limit Theorems In this section we apply Theorem 4.1 to derive L1 bounds in the combinatorial central limit theorem, that is, for random variables Y of the form Y=
n
ai,π(i) ,
(4.104)
i=1
where π is a permutation distributed uniformly over the symmetric group Sn , and {aij }1≤i,j ≤n are the components of a matrix A ∈ Rn×n . Random variables of this form are of interest in permutation tests. In particular, given a function d(x, y) which in some sense measures the closeness of two observations x and y, given values x1 , . . . , xn and y1 , . . . , yn and a putative ‘matching’ permutation τ that associates xi to yτ (i) , one can test whether the level of matching given by τ , as measured by yτ =
n
aiτ (i)
where aij = d(xi , yj ),
i=1
is unusually high by seeing how large the matching level yτ is relative to that provided by a random matching, that is, by seeing whether P (Y ≥ yτ ) is significantly small. Motivated by these considerations, Wald and Wolfowitz (1944) proved the central limit theorem as n → ∞ when the factorization aij = bi cj holds; Hoeffding (1951) later generalized this result to arrays {aij }1≤i,j ≤n . Motoo (1957) gave
4.4 Combinatorial Central Limit Theorems
101
Lindeberg-type sufficient conditions for the normal limit to hold. In Sect. 6.1 the L∞ distance to the normal is considered for the case where π is uniformly distributed, and also when its distribution is constant on conjugacy classes of Sn . Letting a =
n 1 aij , n2
1 aij n n
ai =
i,j =1
1 aij , n n
and
aj =
j =1
i=1
straightforward calculations show that when π is uniform over Sn the mean μA and variance σA2 of Y are given by μA = na and
1 2 2 σA2 = aij − ai2 − aj + a2 n−1 i,j
(4.105)
1 = (aij − ai − aj + a )2 . n−1 i,j
For simplicity, writing μ and σ 2 for μA and σA2 , respectively, we prove in (4.124) the following equivalent representation for σ 2 , 2 1 σ2 = 2 (4.106) (aik + aj l ) − (ail + aj k ) , 4n (n − 1) i,j,k,l
and assume in what follows that σ 2 > 0 to rule out trivial cases. By (4.106), σ 2 = 0 if and only if ail − ai does not depend on i, that is, if and only if the difference between any two rows ai and aj of A satisfy ai − aj = (ai − aj )(1, . . . , 1). For each n ≥ 3, Theorem 4.8 provides an L1 bound between the standardized version of the variable Y given in (4.104) and the normal, with an explicit constant depending on the third-moment-type quantity γ = γA ,
where γA =
n
|aij − ai − aj + a |3 .
(4.107)
i,j =1
When the elements of A are all of comparable order, σ 2 is of order n and γ of order n2 , resulting in a bound of order n−1/2 . Theorem 4.8 For n ≥ 3, let {aij }ni,j =1 be the components of a matrix A ∈ Rn×n , let π be a random permutation uniformly distributed over Sn , and let Y be given by (4.104). Then, with μ, σ 2 given in (4.105), and γ given in (4.107), F the distribution function of W = (Y − μ)/σ and that of the standard normal, γ 8 56 F − 1 ≤ + 16 + . (n − 1) (n − 1)2 (n − 1)σ 3 The proof of this theorem depends on a construction of the zero bias variable using an exchangeable pair, which we now describe.
102
4
L1 Bounds
4.4.1 Use of the Exchangeable Pair We recall that the exchangeable variables Y , Y form a λ-Stein pair if E(Y |Y ) = (1 − λ)Y
(4.108)
for some 0 < λ < 1. When Var(Y ) = σ 2 ∈ (0, ∞), Lemma 2.7 yields EY = 0 and E(Y − Y )2 = 2λσ 2 .
(4.109)
The following proposition is in some sense a two variable version of Proposition 2.3. Proposition 4.6 Let Y , Y be a λ-Stein pair with Var(Y ) = σ 2 ∈ (0, ∞) and distribution F (y , y ). Then when Y † , Y ‡ have distribution dF † (y , y ) =
(y − y )2 dF (y , y ), 2λσ 2
(4.110)
and U ∼ U[0, 1] is independent of Y † , Y ‡ , the variable Y ∗ = U Y † + (1 − U )Y ‡
has the Y -zero biased distribution.
(4.111)
Proof For all absolutely continuous functions f for which the expectations below exist,
σ 2 Ef (Y ∗ ) = σ 2 Ef U Y † + (1 − U )Y ‡ f (Y † ) − f (Y ‡ ) 2 =σ E Y† − Y‡ 1 f (Y ) − f (Y ) 2 = E (Y − Y ) 2λ Y − Y
1 = E f (Y ) − f (Y )(Y − Y ) 2λ
1 = E Y f (Y ) − Y f (Y ) λ
1 = E Y f (Y ) − (1 − λ)Y f (Y ) λ = EY f (Y ). The following lemma, leading toward the construction of zero bias variables, is motivated by generalizing the framework of Example 2.3, where the Stein pair is a function of some underlying random variables ξα , α ∈ χ and a random index I. Lemma 4.4 Let F (y , y ) be the distribution of a Stein pair and suppose there exists a distribution F (i, ξα , α ∈ χ)
(4.112)
4.4 Combinatorial Central Limit Theorems
103
and an R2 valued function (y , y ) = ψ(i, ξα , α ∈ χ) such that when I and {α , α ∈ X } have distribution (4.112) then (Y , Y ) = ψ(I, α , α ∈ X ) has distribution F (y , y ). If I† , {†α , α ∈ χ} have distribution dF † (i, ξα , α ∈ X ) =
(y − y )2 dF (i, ξα , α ∈ X ) E(Y − Y )2
(4.113)
then the pair
(Y † , Y ‡ ) = ψ I† , †α , α ∈ X has distribution F † (y † , y ‡ ) satisfying dF † (y , y ) =
(y − y )2 dF (y , y ). 2λσ 2
Proof For any bounded measurable function f
Ef (Y † , Y ‡ ) = Ef ψ I† , †α , α ∈ X
= f ψ(i, ξα , α ∈ χ) dF † (i, ξα , α ∈ χ) (y − y )2 = f (y , y ) dF (i, ξα , α ∈ χ) 2λσ 2 (Y − Y )2 =E f (Y , Y ) , 2λσ 2 where (Y , Y ) has distribution F (y , y ).
We continue building a general framework around Example 2.3, where the random index is chosen independently of the permutation, so their joint distribution factors, leading to dF (i, ξα , α ∈ χ) = P (I = i)dF (ξα , α ∈ χ).
(4.114)
Moreover, in view of (2.47), that is, that
Y − Y = b i, j, π(i), π(j ) where b(i, j, k, l) = ail + aj k − (aik + aj l ), we will pay special attention to situations where Y − Y = b(I, α , α ∈ χI )
(4.115)
where I and χI are vectors of small dimensions with components in I and χ , respectively. In other words, we consider situations where the difference between Y and Y depends on only a few variables. In such cases, it will be convenient to further decompose dF (i, ξα , α ∈ χ) as dF (i, ξα , α ∈ χ) = P (I = i)dFi (ξα , α ∈ χi )dFic i (ξα , α ∈ / χi |ξα , α ∈ χi ), (4.116)
104
4
L1 Bounds
where dFi (ξα , α ∈ χi ) is the marginal distribution of ξα for α ∈ χi , and dFic |i (ξα , / χi given ξα for α ∈ χi . α∈ / χi |ξα , α ∈ χi ) the conditional distribution of ξα for α ∈ One notes, however, that the factorization (4.114) guarantees that the marginal distributions of any ξα does not depend on i. In terms of generating variables having the specified distributions for the purposes of coupling, the decomposition (4.116) corresponds to first generating I, then {ξα , α ∈ χI }, and lastly {ξα , α ∈ / χI } conditional on {ξα , α ∈ χI }. In what follows we will continue the slight abuse notation of letting {α: α ∈ χi } denote the set of components of the vector χi . We now consider the square bias distribution F † in (4.113) when the factorization (4.116) of F holds. Letting I and {α : α ∈ χ} have distribution (4.114), by (4.109), (4.115) and independence we obtain P (I = i)Eb2 (i, α , α ∈ χi ). 2λσ 2 = E(Y − Y )2 = Eb2 (I, α , α ∈ χI ) = i⊂I
In particular, we may define a distribution for a vector of indices I† with components in I by P (I† = i) =
ri 2λσ 2
with ri = P (I = i)Eb2 (i, α , α ∈ χi ).
(4.117)
Hence, substituting (4.115) and (4.116) into (4.113), dF † (i, ξα , α ∈ χ) =
P (I = i)b2 (i, ξα , α ∈ χi ) dFi (ξα , α ∈ χi )dFic |i (ξα , α ∈ / χi |ξα , α ∈ χi ) 2λσ 2
=
b2 (i, ξα , α ∈ χi ) ri dFi (ξα , α ∈ χi )dFic |i (ξα , α ∈ / χi |ξα , α ∈ χi ) 2λσ 2 Eb2 (i, α , α ∈ χi )
= P (I† = i)dFi† (ξα , α ∈ χi )dFic |i (ξα , α ∈ / χi |ξα , α ∈ χi ),
(4.118)
where dFi† (ξα , α ∈ χi ) =
b2 (i, ξα , α ∈ χi ) dFi (ξα , α ∈ χi ). Eb2 (i, α , α ∈ χi )
(4.119)
Definition (4.119) represents dF † (i, ξα , α ∈ χ) in a manner parallel to (4.116) for dF (i, ξα , α ∈ χ). This representation gives the parallel construction of variables I† , {†α , α ∈ χ} with distribution dF † (i, ξα , α ∈ χ) as follows. First generate I† according to the distribution P (I† = i). Then, when I† = i, generate {†α , α ∈ χi } according to dFi† (ξα , α ∈ χi ) and then {†α , α ∈ / χi } according to dFic |i (ξα , α ∈ / χi |ξα , α ∈ χi ). As this last factor is the same as the last factor in (4.116) an opportunity for coupling is presented. In particular, it may be possible to set †α equal to α for many α ∈ / χi , thus making the pair Y † , Y ‡ close to Y , Y .
4.4 Combinatorial Central Limit Theorems
105
4.4.2 Construction and Bounds for the Combinatorial Central Limit Theorem In this section we prove Theorem 4.8 by specializing the construction given in Sect. 4.4.1 to handle the combinatorial central limit theorem, and then applying Theorem 4.1. Recall that by (2.45) we may, without loss of generality, replace aij by aij − ai − aj + a , and assume ai = aj = a = 0,
(4.120)
noting that by doing so we may now write
W = Y/σ,
(4.121)
and that (4.107) becomes γ = ij |aij Now, denoting Y and π by Y and π , respectively, when convenient, the construction given in Example 2.3 applies. That is, given π , uniform over Sn , take (I, J ) independent of π with a uniform distribution over all distinct pairs in {1, . . . , n}, in other words, with distribution |3 .
p1 (i, j ) =
1 1(i = j ). (n)2
(4.122)
Letting τij be the permutation which transposes i and j , set π = πτI,J and let Y be given by (4.104) with π replacing π . Example 2.3 shows that (Y, Y ) is a 2/(n − 1)-Stein pair, and (2.48) gives Y − Y = (aI,π(I ) + aJ,π(J ) ) − (aI,π(J ) + aJ,π(I ) ).
(4.123)
In particular, averaging over I, J, π(I ) and π(J ) we now obtain (4.106) as follows, using (4.109) for the second equality, 2 1 (aik + aj l ) − (ail + aj k ) = E(Y − Y )2 n2 (n − 1)2 i,j,k,l
= 2λσ 2 4σ 2 = . (4.124) n−1 We first demonstrate an intermediate result before presenting a coupling construction of Y , Y to Y † , Y ‡ , leading to a coupling of Y and Y ∗ . Lemma 4.5 Let π be chosen uniformly from Sn and suppose i = j and k = l are elements of {1, . . . , n}. Then ⎧ if l = π(i), k = π(j ), ⎪ ⎨ πτπ −1 (k),j † if l = π(i), k = π(j ), (4.125) π = πτπ −1 (l),i ⎪ ⎩ πτ −1 π (k),i τπ −1 (l),j otherwise, is a permutation that satisfies
106
4
L1 Bounds
π † (m) = π(m) for all m ∈ / i, j, π −1 (k), π −1 (l) , † π (i), π † (j ) = {k, l},
(4.126) (4.127)
and
/ {i, j } = P π † (m) = ξm† , m ∈
1 (n − 2)!
(4.128)
for all distinct ξm† , m ∈ / {i, j } with ξm† ∈ / {k, l}. Proof That π † satisfies (4.126) is clear from its definition. To show (4.127) and that π † is a permutation, let A1 , A2 and A3 denote the three cases of (4.125) in their respective order. Clearly under A1 we have π † (t) = π(t) for all t ∈ / j, π −1 (k) . Hence, as i = j and i = π −1 (l) = π −1 (k), we have π † (i) = π(i) = l. Also,
π † (j ) = πτπ −1 (k),j (j ) = π π −1 (k) = k, showing (4.127) holds on A1 . As π † (π −1 (k)) = π(j ), both π and π † map the set {j, π −1 (k)} to {π(j ), k}, and, as their images agree on {j, π −1 (k)}c , we conclude that π † is a permutation on A1 . As A2 becomes A1 upon interchanging i with j and k with l, these conclusions hold also on A2 . Under A3 , either l = π(i), k = π(j ) or l = π(i), k = π(j ). In the first instance π † = π , so π † is a permutation, and (4.127) is immediate. Otherwise, as i = j and i = π −1 (l), we have
π † (i) = πτπ −1 (k),i τπ −1 (l),j (i) = πτπ −1 (k),i (i) = π π −1 (k) = k and similarly, as j = i and j = π −1 (k),
π † (j ) = πτπ −1 (k),i τπ −1 (l),j (j ) = πτπ −1 (k),i π −1 (l) ,
(4.129)
and now, as l = k and l = π(i),
πτπ −1 (k),i π −1 (l) = π π −1 (l) = l,
so (4.127) holds under A3 . As both π and π † map {i, j, π −1 (k), π −1 (l)} to {π(i), π(j ), k, l}, and agree on {i, j, π −1 (k), π −1 (l)}c , we conclude that π † is a permutation on A3 . / {i, j } be distinct and satisfy We now turn our attention to (4.128). Let ξm† , m ∈ † / {k, l}. Under A1 we have k = π(j ), and have shown that i = π −1 (k). Hence ξm ∈ π −1 (k) ∈ / {i, j } and therefore ξπ† −1 (k) ∈ / {k, l}. Setting ξi† = l, we have
/ {i, j }, A1 P π † (m) = ξm† , m ∈
/ {i, j }, π(i) = l, π(j ) = k = P π † (m) = ξm† , m ∈
/ {j }, π(j ) = k = P π † (m) = ξm† , m ∈
/ j, π −1 (k) , π(j ) = k, π † π −1 (k) = ξπ† −1 (k) = P π † (m) = ξm† , m ∈
4.4 Combinatorial Central Limit Theorems
107
= P π(m) = ξm† , m ∈ / j, π −1 (k) , π(j ) = k, π(j ) = ξπ† −1 (k)
/ j, π −1 (k) , π(j ) = ξπ† −1 (k) = P π(m) = ξm† , m ∈
= P π(m) = ξm† , m ∈ / {j, q}, π(j ) = ξq† , π(q) = k q ∈{i,j / }
(n − 2) . n! Case A2 being the same upon interchanging i with j and k with l, we obtain =
2(n − 2) / {i, j }, A1 ∪ A2 = P π † (m) = ξm† , m ∈ . n! Under A3 there are subcases depending on R = π(i), π(j ) ∩ {k, l},
(4.130)
and we let A3,r = A3 ∩ {R = r} for r = 0, 1, 2. When R = 0 the elements π(i), π(j ), k, l are distinct, and so A3,0 = {R = 0}. Additionally R = 0 if and only if the inverse images i, j, π −1 (k), π −1 (l) under π are also distinct, and so
P π † (m) = ξm† , m ∈ / {i, j }, A3,0 = P π † (m) = ξm† , m ∈ / i, j, π −1 (k), π −1 (l) ,
π † π −1 (k) = ξπ† −1 (k) , π † π −1 (l) = ξπ† −1 (l) , A3,0 = P π(m) = ξm† , m ∈ / i, j, π −1 (k), π −1 (l) ,
π(i) = ξπ† −1 (k) , π(j ) = ξπ† −1 (l) , A3,0 = P π(m) = ξm† , k ∈ / {i, j, q, r}, {q,r}: |{q,r,i,j }|=4
π(i) = ξq† , π(j ) = ξr† , π(q) = k, π(r) = l (n − 2)(n − 3) . n! Considering the case R = 1, in view of (4.125) we find =
(4.131)
A3,1 = A3 ∩ {R = 1} = A3,1a ∪ A3,1b , where
A3,1a = π(i) = k, π(j ) = l ,
and A3,1b = π(i) = k, π(j ) = l .
Since by appropriate relabeling each of these cases becomes A1 , we have
2(n − 2) P π † (m) = ξm† , m ∈ / {i, j }, A3,1 = . (4.132) n! For R = 2 we have A3,2 = A3,2a ∪ A3,2b where and A3,2b = π(j ) = l, π(i) = k . A3,2a = π(i) = l, π(j ) = k
108
4
L1 Bounds
Under A3,2a ,
P π † (m) = ξm† , m ∈ / {i, j }, A3,2a
1 / {i, j }, π(i) = l, π(j ) = k = , = P π † (m) = ξm† , m ∈ n! and the same holding for A3,2b , by symmetry, yields
2 P π † (m) = ξm† , m ∈ / {i, j }, A3,2 = . (4.133) n! Summing the contributions from (4.130), (4.131), (4.132) and (4.133) we obtain
4(n − 2) (n − 2)(n − 3) 2 1 P π † (m) = ξm† , k ∈ / {i, j } = + + = n! n! n! (n − 2)!
as claimed.
The following lemma shows how to choose the ‘special’ indices in Lemma 4.5 to form the square bias, and hence, zero bias, distributions. In addition, as values of the π † permutation can be made to coincide with those of a given π using (4.125), a coupling of these variables on the same space is achieved. Before stating the lemma we note that (4.134) is a distribution by virtue of (4.106). Lemma 4.6 Let Y=
n
ai,π(i)
i=1
with π chosen uniformly from Sn , and let (I † , J † , K † , L† ) be independent of π with distribution p2 (i, j, k, l) =
[(aik + aj l ) − (ail + aj k )]2 . 4n2 (n − 1)σ 2
(4.134)
Further, let π † be constructed from π as in (4.125) with I † , J † , K † and L† replacing i, j , k and l, respectively and π ‡ = π † τI † ,J † . Then π(i) = π † (i) = π ‡ (i)
for all i ∈ /I
(4.135)
where I = {I † , J † , π −1 (K † ), π −1 (L† )}, the variables Y† =
n
ai,π † (i)
i=1
and
Y‡ =
n
ai,π ‡ (i)
(4.136)
i=1
have the square bias distribution (4.113), and with U an uniform variable on [0, 1], independent of all other variables Y ∗ = U Y † + (1 − U )Y † has the Y -zero bias distribution.
4.4 Combinatorial Central Limit Theorems
109
Proof The claim (4.135) follows from (4.126) and the definition of π ‡ . When I = (I, J ) is independent of π with distribution (4.122), χ = {1, . . . , n} and α = π(α) for α ∈ χ , let ψ be the R2 valued function of {I, α , α ∈ χ} which yields the exchangeable pair Y , Y in Example 2.3. In view of Lemma 4.6, to prove the remainder of the claims it suffices to verify the hypotheses of Lemma 4.4, that is, with I† = (I † , J † ) that {I† , †α , α ∈ χ}, or equivalently {I† , π † (α), α ∈ χ}, has distribution (4.113). Relying on the discussion following Lemma 4.4, we prove this latter claim by considering the factorization (4.116) of dF (i, ξα , α ∈ χ) and show that {I† , π † (α), α ∈ χ} follows the corresponding square bias distribution (4.118). With i = (i, j ) and P (I = i) already specified by (4.122), we identify the remaining parts of the factorization (4.116) by noting that the distribution dFi (ξα , α ∈ χi ) = dFi (ξi , ξj ) of the images of i and j under π is uniform over all ξi = ξj , / {i, j }|ξi , ξj ) is uniform over all distinct elements and, for such ξi , ξj , dFic |i (ξα , α ∈ ξα , α ∈ χ that do not intersect {ξi , ξj }, that is, for such values
/ {i, j }|ξi , ξj = dFic |i ξα , α ∈
1 . (n − 2)!
(4.137)
Now consider the corresponding factorization (4.118). First, this expression specifies the joint distribution of the values I† and their images †α , α ∈ I† under π † by P (I† = i)dFi† (ξα , α ∈ χi ) P (I = i) 2 = b (i, ξα , α ∈ χi )dFi (ξα , α ∈ χi ), 2λσ 2
(4.138)
where from (2.47) for the difference Y − Y we have b(i, j, ξi , ξj ) = (ai,ξi + aj,ξj ) − (ai,ξj + aj,ξi ).
(4.139)
Since the distribution (4.122) of I is uniform over the range where i = j , and for such distinct i and j , the distribution dFi (ξα , α ∈ χi ) is uniform over all distinct choices of images ξi and ξj , we conclude that the joint distribution (4.138) of I† and their ‘biased permutation images’ (†I † , †J † ) is proportional to 1i =j, k =l b2 (i, j, k, l). This is exactly the distribution p2 (i, j, k, l) from which I † , J † , K † , L† is chosen. In addition, the values {K † , L† } are the images of {I † , J † } under the permutation π † constructed as specified in the statement of the lemma, as follows. By (4.134) I † = J † and K † = L† with probability one. As {I † , J † , K † , L† } and π are independent, the construction and conclusions of Lemma 4.5 apply, conditional on these indices. Invoking Lemma 4.5, π † is a permutation that maps {I † , J † } to {K † , L† }. To show that the remaining values are distributed according to dFi (ξα , α ∈ χi ), / {I † , J † } are distinct values not lying in {K † , L† }, again by Lemma 4.5, if ξm† , m ∈ then
1 / {I † , J † }|I † , J † , K † , L† = . (4.140) P π † (m) = ξm† , m ∈ (n − 2)! As (4.140) agrees with (4.137), the proof of the lemma is complete.
110
4
L1 Bounds
Note that in general even when I is uniformly distributed, the index I† need not be. In fact, from (4.117) it is clear that when I is uniform the distribution of I† is given by P (I† = i) = 0 for all i such that P (I = i) = 0, and otherwise Eb2 (i, α , α ∈ χi ) . P (I† = i) = 2 i Eb (i, α , α ∈ χi )
(4.141)
In particular, the distribution (4.134) selects the indices I† = (I † , J † ) jointly with their ‘biased permutation’ images (K † , L† ) with probability that preferentially makes the squared difference large. One can see this effect directly by calculating the marginal distribution of I † , J † , which, by (4.141), is proportional to [(aik + aj l ) − (ail + aj k )]2 , by expanding and applying (4.120), yielding
2 (aik + aj l ) − (ail + aj k )
k,l
=2
2 aik + aj2l − aik aj k − aj l ail
k,l
= 2n
n (aik − aj k )2 , k=1
and hence the generally nonuniform distribution n P (I = i, J = j ) = †
†
− aj k )2 . 2n(n − 1)σ 2
k=1 (aik
With the construction of the zero bias variable now in hand, Theorem 4.8 follows from Lemma 4.6, Theorem 4.1, (4.10) of Proposition 4.1, and the following lemma. Lemma 4.7 For Y and Y ∗ constructed as in Lemma 4.6 γ 4 28 L(Y ∗ ) − L(Y ) ≤ + 8+ . 1 (n − 1) (n − 1)2 (n − 1)σ 2 With π and the indices {I † , J † , K † , L† } constructed as in Lemma 4.6 the calculation of the bound proceeds by decomposing V = Y∗ − Y
as V = V 12 + V 11 + V 10
where 1k = 1(R = k) with R = π(I † ), π(J † ) ∩ {K † , L† }. The three factors give rise to the three terms of the bound. The proof of the lemma, though not difficult, requires some attention to detail, and can be found in the Appendix to this chapter.
4.5 Simple Random Sampling
111
4.5 Simple Random Sampling Theorem 4.9 gives an L1 bound for the exchangeable pair coupling. After proving the theorem, we will record a corollary and use it to prove an L1 bound for simple random sampling. Recall that (Y, Y ) is a λ-Stein pair for λ ∈ (0, 1) if (Y, Y ) are exchangeable and satisfy the linear regression condition E(Y |Y ) = (1 − λ)Y.
(4.142)
Theorem 4.9 Let W, W be a mean zero, variance 1, λ-Stein pair. Then if F is the distribution function of W , 2 (W − W )2 1 F − 1 ≤ E E 1 − W + E|W − W |3 . π 2λ 2λ Proof Letting = W − W , the result follows directly from Proposition 2.4 and ˆ Lemma 2.7, the latter which shows that identity (2.76) is satisfied with R = 0, K(t) given by (2.38), Kˆ 1 = E(2 |W )/2λ by (2.39), and 0 − || (−t)dt + 1{−>0} tdt 1{−≤0} Kˆ 2 = 2λ − 0 2 2 || |3 | = 1{−≤0} + 1{−>0} = . 2λ 2 2 4λ In many applications calculation of the expectation of the absolute value of the conditional expectation may be difficult. However, by (2.34) we have (W − W )2 (W − W )2 E = 1 so that E E 1 − = 0. W 2λ 2λ Hence, by the Cauchy–Schwarz inequality, − W )2 (W − W )2 (W E E 1 − W ≤ Var E 1 − W 2λ 2λ
1 = Var E (W − W )2 W . 2λ Though the variance of the conditional expectation E((W − W )2 |W ) may still be troublesome, the inequality
Var E(Y |W ) ≤ Var E(Y |F ) when σ {W } ⊂ F (4.143) often leads to the computation of a tractable bound, and provides estimates which result in the optimal rate. To show (4.143), first note that the conditional variance formula, for any X, yields Var E(X|W ) ≤ E Var(X|W ) + Var E(X|W ) = Var(X).
112
4
L1 Bounds
However, for X = E(Y |F ) we have
E(X|W ) = E E(Y |F )|W = E(Y |W ), and substituting yields (4.143). Hence we arrive at the following corollary to Theorem 4.9. Corollary 4.3 Under the assumptions of Theorem 4.9, when F is any σ -algebra containing σ {W }, 1 1 1 F − 1 ≤ √ + E|W − W |3 , λ 2 2π where =
Var E (W − W )2 |F .
(4.144)
We use Corollary 4.3 to prove an L1 bound for the sum of numerical characteristics of a simple random sample, that is, for a sample of a population {1, . . . , N } drawn so that all subsets of size n, with 0 < n < N , are equally likely. The limiting normal distribution for simple random sampling was obtained by Wald and Wolfowitz (1944) (see also Madow 1948; Erdös and Rényi 1959a; and Hájek 1960). Let ai ∈ R, i = 1, 2, . . . , N denote the characteristic of interest associated with individual i, and let Y be the sum of the characteristics {X1 , . . . , Xn } of the sampled individuals. One can easily verify that the mean μ and variance σ 2 of Y are given by n(N − n) (ai − a) ¯ 2 N (N − 1) N
μ = na¯
and σ 2 =
where a¯ =
i=1
N 1 ai . N
(4.145)
i=1
As we are interested in boundsto the normal for the standardized variable (Y − ¯ 2 we may assume in what follows μ)/σ , by replacing a by (a − a)/ ¯ b∈A (b − a) without loss of generality that a¯ = 0 and
N
ai2 = 1.
(4.146)
i=1
For m = 1, . . . , n let (n)m = n(n − 1) · · · (n − m + 1), the falling factorial of n, and fm =
(n)m . (N )m
(4.147)
Theorem 4.10 Let the numerical characteristics A = {ai , i = 1, 2, . . . , N } of a population of size N satisfy (4.146), and let Y be the sum of characteristics in a simple random sample of size n from A with 1 < n < N . Let
4.5 Simple Random Sampling
113
n(N − n) , N (N − 1) N a4, , A4 = λ= n(N − n)
σ2 =
and γ =
a∈A
(4.148)
|a|3 .
a∈A
Then with F the distribution function of Y/σ , R2 1 R1 + , F − 1 ≤ √ λ 2 2π where
1 R1 = n
2 8 S1 + 4 S2 2 σ σ (N − n)2
with 1 , N S2 = A4 (f1 − 7f2 + 6f3 − 6f4 ) + 3(f2 − f3 + f4 ) − σ 4 S 1 = A4 −
and
R2 = 8f1 γ /σ . 3
In the usual asymptotic n and N tend to infinity together with the sampling fraction f1 = n/N bounded away from zero and one; in such cases λ = O(1/n) and 2 fm = O(1). Additionally, if a ∈ A satisfy comparable size a∈A a = 1 and are of √ √ then a = O(1/ N ) which implies A4 = O(1/n) and γ = O(1/ n ). Overall then the bound provided by the √ theorem in such an asymptotic, which has main contribution from R2 , is O(1/ n ). Since distinct labels may be appended to ai , i = 1, . . . , N , say as a second coordinate which is neglected when taking sums, we may assume in what follows that elements of A = {ai , i = 1, . . . , N } are distinct. The first main point of attention is the construction of a Stein pair, which can be achieved as follows. Let X1 , X2 , . . . , Xn+1 be a simple random sample of size n + 1 from the population and let I and I be two distinct indices drawn uniformly from {1, . . . , n + 1}. Now set Xi . Y = XI + T and Y = XI + T where T = i∈{1,...,n+1}\{I,I }
As (XI , XI , T ) =d (XI , XI , T ) the variables Y and Y are exchangeable. By exchangeability and the first condition in (4.146) we have 1 E(XI |Y ) = Y n
and E(XI |Y ) = −
1 Y, N −n
and therefore E(Y |Y ) = E(Y − XI + XI |Y ) = (1 − λ)Y where λ ∈ (0, 1) is given by (4.148); the linearity condition (4.142) is satisfied.
114
4
L1 Bounds
Before starting the proof we pause to simplify the required moment calculations for X = {X1 , . . . , Xn }, a simple random sample of A. For m ∈ N, {k1 , . . . , km } ⊂ N and k = (k1 , . . . , km ) let k1 k2 km [k] = E a b ···c {a,b,...,c}⊂X , |{a,b,...,c}|=m
and
k =
km y1k1 y2k2 . . . ym .
{y1 ,...,ym }⊂A, |{y1 ,...,ym }|=m
Now observe that, with fm given in (4.147), [k] = fm k.
(4.149)
As [k] and k are invariant under any permutation of its components we may always use the canonical representation where k1 ≥ · · · ≥ km . Let ejm be the j th unit vector in Rm . When the population characteristics satisfy (4.146) we have k1 , . . . , km−1 , 1 = −
m−1
!
(k1 , . . . , km−1 ) + ejm−1
"
and
j =1
k1 , . . . , km−1 , 2 = k1 , . . . , km−1 −
m−1
!
" (k1 , . . . , km−1 ) + 2ejm−1 .
j =1
Note then that 2 = 1 3, 1 = −4 2, 2 = 2 − 4
(4.150)
2, 1, 1 = −3, 1 − 2, 2 = 4 − 2 + 4 = 24 − 2 1, 1, 1, 1 = −32, 1, 1 = −64 + 32. Proof of Theorem 4.10 We may assume n ≤ N/2, as otherwise we may replace Y , a sample of size n from A, by −Y , a sample of size N − n; this assumption is used in (4.151). We apply Corollary 4.3, beginning with the first term in the bound. Letting X = {Xj , j = I } and F = σ (X ), applying inequality (4.143) yields
Var E (Y − Y )2 |Y ≤ Var E (Y − Y )2 |F
= Var E (XI − XI )2 |F
= Var E XI2 − 2XI XI + XI2 |F . For these three conditional expectations,
4.5 Simple Random Sampling
E XI2 |F =
115
1 2 b , N −n b∈ /X
1 E(XI XI |F ) = n(N − n)
ab
a∈X ,b∈ /X
1 2 and E XI2 |F = a . n a∈X
By the standardization (4.146) we have, 1 1 2 b = a2 1− N −n N −n b∈ /X
a∈X
1 n(N − n)
and
a∈X b∈ /X
2 1 ab = − a . n(N − n) a∈X
Hence, using Var(U + V ) ≤ 2(Var(U ) + Var(V )),
Var E (Y − Y )2 |Y 2 2 N − 2n 2 a + a ≤ Var n(N − n) n(N − n) a∈X a∈X 2 2 2 1 2 a + Var a . ≤ 2 2 Var n(N − n) n a∈X
(4.151)
a∈X
Calculating the first variance in (4.151), using (4.149), we begin with 2
2 a 2 = [2]2 = f1 2 = f12 . E a∈X
Next, note E
a∈X
and therefore
2 a
2
= [4] + [2, 2] = f1 4 + f2 2, 2
n(N − n) = f1 4 + f2 2 − 4 = 4 + f2 , N (N − 1)
1 n(N − n) 2 Var a = 4 − = σ 2 S1 . N (N − 1) N a∈X
For the second variance in (4.151), using (4.149) and (4.150) we first obtain the expectation 2 E a = [2] + [1, 1] = f1 − f2 = σ 2 . (4.152) a∈X
Similarly, for the second moment we compute
116
4
L1 Bounds
4 E a = [4] + 4[3, 1] + 3[2, 2] + 3[2, 1, 1] + [1, 1, 1, 1] a∈X
= f1 4 + f2 43, 1 + 32, 2 + f3 32, 1, 1 + f4 1, 1, 1, 1 = 4(f1 − 7f2 + 6f3 − 6f4 ) + 3(f2 − f3 + f4 ).
The variance of this term is now obtained by subtracting the square of the expectation (4.152), resulting in the quantity S2 . Hence, from (4.151),
1 8 2 2 S2 , Var E (Y − Y ) |Y ≤ 2 2σ S1 + n (N − n)2 and therefore, with W = Y/σ and W = Y /σ , we have
Var E (W − W )2 |W = Var E (Y − Y )2 |Y /σ 4 = R1 . Regarding the second term in Corollary 4.3, as E|Y − Y |3 = E|XI − XI |3 ≤ 8E|XI |3 = 8
n 3 |a| = 8f1 γ , N a∈A
we obtain E|W − W |3 = 8f1 γ /σ 3 = R2 .
4.6 Chatterjee’s L1 Theorem The basis of all normal Stein identities is that Z ∼ N (0, 1) if and only if E Zf (Z) = E f (Z)
(4.153)
for all absolutely continuous functions f for which these expectations exist. For a mean zero, variance one random variable W which may be close to normal, (4.153) may hold approximately, and there may therefore be a related identity which holds exactly for W . One way the identity (4.153) may be altered to hold exactly for some given W is to no longer insist that the same variable, W , appear on the right hand side as on the left, thus leading to the zero bias identity (2.51) (4.154) E Wf (W ) = E f (W ∗ ) , as discussed in Sect. 2.3.3. Insisting that W appear on both sides, one may be lead instead to consider identities of the form (4.155) E Wf (W ) = E f (W )T , for some random variable T , defined on the same space as W . When such a T exists, by conditioning we obtain E f (W ∗ ) = E Wf (W ) = E f (W )T = E f (W )E(T |W ) ,
4.6 Chatterjee’s L1 Theorem
117
which reveals that dF ∗ (w) dF (w) is the Radon–Nikodym derivative of the zero bias distribution of W with respect to the distribution of W . In particular, as W ∗ always has an absolutely continuous distribution, for there to exist a T such that (4.155) holds it is necessary for W to be absolutely continuous; naturally, in other cases, considering approximations allows the equality to become relaxed. Identities of the form (4.155), in some generality, were considered in Cacoullos and Papathanasiou (1992), but T was constrained to be a function of W . As we will see, much more flexibility is provided by removing this restriction. Theorem 4.11, of Chatterjee (2008), gives bounds to the normal, in the L1 norm, for a mean zero function ψ(X) of a vector of independent random variables X = (X1 , . . . , Xn ) taking values in some space X . For the identity (4.155), or an approximate form thereof, to be useful, a viable T must be produced. Towards this goal, with X an independent copy of X, and A ⊂ {1, . . . , n}, let XA be the random vector with components Xj j ∈ A, XjA = (4.156) / A. Xj j ∈ E(T |W = w) =
For i ∈ {1, . . . , n}, writing i for {i} when notationally convenient, let
i ψ(X) = ψ(X) − ψ Xi ,
(4.157)
which measures the sensitivity of the function ψ to the values in its ith coordinate. Now, for any A ⊂ {1, . . . , n}, let
TA 1 n
TA = . (4.158) i ψ(X)i ψ XA and T = 2 |A| (n − |A|) i ∈A /
A⊂{1,...,n} |A| =n
Theorem 4.11 Let W = ψ(X) be a function of a vector of independent random variables X = (X1 , . . . , Xn ), and have mean zero and variance 1. Then, with i as defined in (4.157) and T given in (4.158) we have that ET = 1 and n # 3
L(W ) − L(Z) ≤ 2/π Var E(T |W ) + 1 E i ψ(X) . 1 2 i=1
We present the proof, from Chatterjee (2008),at the end of this section. To explore a simple application, let ψ(X) = ni=1 Xi where X1 , . . . , Xn are independent with mean zero, variances σ12 , . . . , σn2 summing to one, and fourth moments τ1 , . . . , τn . For A ⊂ {1, . . . , n} and i ∈ / A, A
A∪i
A
i ψ X = ψ X − ψ X Xj + Xj − Xj + Xj = Xi − Xi . (4.159) = j ∈A /
j ∈A
j ∈A∪i /
j ∈A∪i
118
4
L1 Bounds
Hence, TA =
2 Xi − Xi , i ψ(X)i ψ XA =
i ∈A /
i ∈A /
and T=
=
1 2
TA (n − |A|)
n
A⊂{1,...,n}, |A| =n |A| n−1
a=0
A⊂{1,...,n}, |A|=a
1 1 n
2 a (n − a)
1 1 n
2 a (n − a) n−1
=
=
1 2
Xi − Xi
2
/ A⊂{1,...,n},|A|=a i ∈A
a=0
n−1
TA
n
2 1 Xi − Xi . (n − a) i=1 A⊂{1,...,n}, |A|=a,A i
n
a=0 a
As for each i ∈ {1, . . . , n} there are we obtain
n−1
a
subsets of A of size a that do not contain i,
n−1 n
2 1 1 n
1 Xi − Xi 2 (n − a) i=1 a=0 a A⊂{1,...,n}, |A|=a,A i n n−1
n − 1 1 1 2 n
Xi − Xi = a 2 a (n − a)
T=
i=1
=
a=0
n 1
2
2
Xi − Xi .
i=1
For the first term in the theorem, applying the bound (4.143) with F the σ algebra generated by X we obtain n n
2 1
1 = τi + 3σi4 . Var Xi − Xi Var E(T |W ) ≤ Var(T ) = 4 2 i=1
i=1
From (4.159), n n n 3 1 3 1
4 3/4 1 E Xi − Xi E i ψ(X) = E Xi − Xi ≤ 2 2 2 i=1
i=1
=
1 21/4
Invoking Theorem 4.11 yields,
i=1
n
i=1
τi + 3σi4
3/4
.
4.6 Chatterjee’s L1 Theorem
119
$ % n n %
3/4 4 + 1 L(W ) − L(Z) ≤ & 1 τ τi + 3σi4 + 3σ . i i 1 π 21/4 i=1
i=1
When X1 , . . . , Xn are independent, mean zero variables having common second 2 and fourth moments, √ say, σ and τ , respectively, then applying this result to W = (X1 + · · · + Xn )/ n yields
1 1 4 3/4 4 L(W ) − L(Z) ≤ n−1/2 τ + 3σ + 1/4 τ + 3σ . 1 π 2 For a different application of Theorem 4.11 we consider normal approximation of quadratic forms. Let Tr(A) denote the trace of A. Proposition 4.7 Let X = (X1 , . . . , Xn ) be a vector of independent variables taking the values +1, −1 with equal probability, A a real symmetric matrix and Y = i≤j aij Xi Xj . Then the mean μ and variance σ 2 of Y are given by μ = Tr(A)
and σ 2 =
and W = (Y − μ)/σ satisfies L(W ) − L(Z) ≤ 1
1/2
1 Tr A4 4 πσ
1 2
Tr A , 2
(4.160)
n 3/2 n 7 2 + 3 aij . 2σ i=1
j =1
Proof The mean and variance formulas (4.160) can be obtained by specializing Theorems 1.5 and 1.6 of Seber and Lee (2003) to X with the given distribution. By subtracting the mean and then replacing aij by aij /σ it suffices to prove the result when aii = 0 and σ 2 = 1. Letting ψ(x) = aij xi xj i<j
Rn ,
for x ∈ we have
with
xi
the vector x with
xi
replacing xi and using the symmetry of A
i ψ(x) = ψ(x) − ψ xi = aij xi xj + aj i xj xi − aij xi xj − aj i xj xi j : i<j
= xi − xi
j : j
j : i<j
j : j
aij xj .
j =1
/ A we have By replacing x above by XA , for i ∈
A i ψ X = Xi − Xi aij Xj + aij Xj . j ∈A /
j ∈A
We apply the bound Var(E(T |W )) ≤ Var(E(T |X)), from (4.143). For the calculation of E(T |X), with A ⊂ {1, . . . , n} and i ∈ / A, using that Xi , Xi are in −1, 1, we
120
4
L1 Bounds
have
E i ψ(X)i ψ XA |X n
2 aij Xj aij Xj + aij Xj X = E X i − Xi =
j ∈A /
j =1
n j =1
j ∈A
2 aij Xj E Xi − Xi aij Xj + aij Xj X
n
=2
j =1
n
=2
j ∈A /
j ∈A
aij Xj E 1 − Xi Xi aij Xj + aij Xj X aij Xj
j ∈A /
j ∈A
aij Xj ,
j ∈A /
j =1
where, since i ∈ / A, all the remaining terms have conditional mean zero. Hence we may write
aij aik Xj Xk . E i ψ(X)i ψ XA |X = 2 j ∈{1,...,n}, k ∈A /
Summing over all i ∈ / A, (4.158) yields E(TA |X) = 2
aij aik Xj Xk .
i ∈A / j ∈{1,...,n}, k ∈A /
From the definition of T , again from (4.158), TA 1 n
E(T |X) = 2 (n − |A|) A⊂{1,...,n} |A| |A| =n
=
=
A⊂{1,...,n} |A| |A| =n n j =1 A⊂{1,...,n} |A| =n
=
1 aij aik Xj Xk (n − |A|) i ∈A / j ∈{1,...,n}, k ∈A /
n
1 aij aik Xj Xk (n − |A|) i ∈A / k ∈A /
n
|A|
aij aik Xj Xk
=
aij aik Xj Xk
=
1≤i,j,k≤n
n−2
n
a=0 A∩{i,k}=∅, |A|=a a
1≤i,j,k≤n
1 (n − |A|)
n
A∩{i,k}=∅ |A|
1≤i,j,k≤n
aij aik Xj Xk
n−2
n−2
n
a=0 a
a
(n − a)
1 (n − a)
4.6 Chatterjee’s L1 Theorem
121
=
aij aik Xj Xk
1≤i,j,k≤n
=
a=0
aij aik Xj Xk
1≤i,j,k≤n
1 = 2
Letting bj k =
n−2 n−a−1
n(n − 1)
1 a n(n − 1) n−1
a=1
aj i aik Xj Xk
1≤i,j,k≤n
1 = X A2 X. 2
the j kth element of A2 , again using Xi2 = 1,
1
Var E(T |X) = Var bj k X j X k = bj2k ≤ Tr A4 . 2 1≤i,j ≤n aj i aik ,
j
j
To bound the final term in Theorem 4.11, we apply Khintchine’s inequality, see Haagerup (1982), which yields n p n p/2 p 2 E aj Xj ≤ Bp aj j =1 j =1 1 0 < p ≤ 2, where Bp = 1/2 √ 2 (((p + 1)/2)/ π )1/p 2 < p < ∞. In particular B33 ≤ 1.6, and using the fact that Xi is independent of the event {Xi = Xi }, we obtain n 3 n 3/2 3 2 aij Xj ≤ 7 aij . E i ψ(X) = 4E j =1
j =1
To consider some further examples, we make the following definition. With X the space in which our random variables take values, given n ∈ N suppose there is a map G, or ‘graphical rule’, which to every x ∈ X n assigns an undirected graph, that is, a collection of edges G(x) on the vertices {1, . . . , n}. We will say the map G is symmetric if it respects the action of permutations, that is, if for every permutation π of {1, . . . , n} and any (x1 , . . . , xn ) ∈ X n , {i, j }: {i, j } ∈ G(x π(1) , . . . , xπ(n) ) = π(i), π(j ) : {i, j } ∈ G(x1 , . . . , xn ) . Now fixing m > n, we say the vector x ∈ X n is embedded in the vector y ∈ X m if there exist distinct indices i1 , . . . , in in {1, . . . , m} with xk = yik for 1 ≤ k ≤ n. A graphical rule G on X m will be called an extension of the rule G if whenever the vector x ∈ X n is embedded in y ∈ X m the graph G(x) on {1, . . . , n} is the naturally induced subgraph of G(y) on {1, . . . , m}. Now let x and x be any two elements of X n . For every i ∈ {1, . . . , n}, let xi be the vector obtained by replacing xi by xi in x, and, for i and j distinct elements of
122
4
L1 Bounds
{1, . . . , n}, let xij be similarly obtained be replacing xi and xj in x by xi and xj , respectively. With ψ : χ n → R, we say the coordinates i and j are non-interacting with respect to the triple (ψ, x, x ) if
ψ(x) − ψ xj = ψ xi − ψ xij . We will say that G is an interaction rule for a function ψ if for any choice of x, x and i, j , the event that {i, j } is not an edge in the graphs G(x), G(xi ), G(xj ), G(xij ) implies that i and j are non-interacting vertices with respect to (ψ, x, x ). With these definitions in hand, we can now state the following theorem; we present the proof, from Chatterjee (2008), at the end of this section. Theorem 4.12 Let the symmetric map G be an interaction rule for ψ : X n → R, and X = (X1 , . . . , Xn ) a vector of i.i.d. X valued variates such that W = ψ(X) has mean zero and variance 1. For each i ∈ {1, . . . , n} define
i ψ(X) = ψ(X) − ψ Xi where X is an independent copy of X, and let M = max i ψ(X).
(4.161)
i=1,...,n
Let G be any extension of G on X n+4 , and set δ = 1 + degree of vertex 1 in G (X1 , . . . , Xn+4 ).
(4.162)
Then for some universal constant C, n 3
L(W ) − L(Z) ≤ Cn1/2 E M 8 1/4 E δ 4 1/4 + 1 E i ψ(X) . 1 2 i=1
Following Chatterjee (2008), we apply Theorem 4.12 to prove an L1 bound to the normal for two problems which stem from the theory of coverage processes; the volume of the region covered by the union of n balls with random centers and some radius, and the number of such centers that are isolated at some radius; see Hall (1988) and Penrose (2003) for more background. Generally, we may work in a separable metric space (X , ρ), and for the first case, we take as given one endowed with measure λ. Let the components of X = (X1 , . . . , Xn ) be i.i.d. with values in X . For some fixed radius r > 0, let R be given by R = R(X)
where R(x) =
n
B(xi , r),
(4.163)
i=1
with B(x, r) the closed ball of radius r centered at x. Proposition 4.8 gives an L1 bound to the normal for the ‘covered volume’ λ(R(X)) in terms of
(4.164) KV = sup λ B(u, r) , u∈X
an upper bound to the volume of any ball of radius r.
4.6 Chatterjee’s L1 Theorem
123
By very similar reasoning, we also derive an L1 bound to the normal for the number S of isolated points, or singletons, given by S = S(X)
where S(x) =
n
1 {x1 , . . . , xn } ∩ B(xi , 2r) = {xi } , (4.165) i=1
that is, the number of points of X such that the ball B(Xi , r) had empty intersection with B(Xj , r) for all j = i. Proposition 4.8 gives an L1 bound to the normal for S in terms of KS = sup k: ∃x ∈ X k+1 such that B(xk+1 , r) ∩ B(xi , r) = ∅, and B(xi , r) ∩ B(xj , r) = ∅ for all distinct 1 ≤ i, j ≤ k , (4.166) which is an upper bound to the number of points in any collection from X which may become isolated upon the removal of a single point. In Euclidean space, the number KS is a lower bound to the so called kissing number, the maximum number of spheres of radius 1 that can simultaneously touch the unit sphere at the origin; see Zong (1999), Conway and Sloane (1999), and Leech and Sloane (1971) for estimates on the kissing number. For example, in two dimensions KS = 5, since at most five unit circles can intersect another unit circle without intersecting each other, while the kissing number in two dimensions is 6. Proposition 4.8 With p = P (ρ(X1 , X2 ) ≤ 2r) we have 3 1/2 2 L(WV ) − L(Z) ≤ Cn KV (1 + np) + nKV 1 σV2 2σV3
for some universal constant C, with μV = EYV , σV2 = Var(YV ) and WV = (YV − μV )/σV , when YV = λ(R) with R as given in (4.163) and KV as in (4.164). The same bound holds for YS = S in (4.165) and WS = (YV − μS )/σS where μS = EYS and σS2 = Var(YS ), with the same constant C, upon replacing σV and KV by σS and KS , respectively. Proof It suffices to prove the theorem when the variables standardized to have mean zero and variance one; we apply Theorem 4.12. First we consider R, and let ψ(x) = λ(R(x)) for x ∈ X n . Let G(x) be the graph on {1, . . . , n} with edges between points i and j if and only if ρ(xi , xj ) ≤ 2r. Clearly the graphical rule G is symmetric, as distances are unchanged by relabeling. We verify that G is an interaction rule as follows. With x and x any points in X n , let xi and xij be obtained by replacing the ith, or both the ith and j th, coordinate respectively of x by those of x . Writing Bj and Bj for B(xj , r) and B(xj , r) respectively, we let Bj so that ψ(x) = Ri ∪ Bi and ψ(x ) = Ri ∪ Bi . Ri = j =i
Hence,
124
4
L1 Bounds
ψ(x) − ψ xi = λ(Ri ∪ Bi ) − λ Ri ∪ Bi
c
= λ (Ri ∪ Bi ) ∩ Ri ∪ Bi − λ Ri ∪ Bi ∩ (Ri ∪ Bi )c
c = λ Bi ∩ Bi ∩ Ric − λ Bi ∩ Bic ∩ Ric
= λ Bi ∩ Ric − λ Bi ∩ Ric , where we obtain the last inequality by adding and subtracting λ(Bi ∩ Bi ∩ Ric ). Hence, with Ni (x) be the set of indices j = i of the neighbors of xi in the graph G(x), c c i
ψ(x) − ψ x = λ Bi ∩ Bj Bj − λ Bi ∩ . (4.167) j ∈Ni (x)
j ∈Ni (xj )
The pair {i, j } fails to be an edge in the graphs G(x), G(xi ), G(xj ), G(xij ) if and only no member of {xi , xi } is a neighbor of {xj , xj }, in which case Ni (x) = Ni (xj ) and Ni (xi ) = Ni (xij ), and ψ(x) = ψ(xi ) and ψ(xi ) = ψ(xij ). Thus G is an interaction rule. In addition, (4.167) shows that for all x ∈ X n and all i = 1, . . . , n
i ψ(x) = ψ(x) − ψ xi ≤ λ B(xi , r) ≤ KV . (4.168) Hence we may take M = KV in the first term in the bound of Theorem 4.12, and also apply this same estimate to the second term. Defining the graph G on (x1 , . . . , xn+4 ) ∈ X n+4 by placing edges between any two points using the same rule as for G, the rule G clearly extends G. As each of the n + 3 points x2 , . . . , xn+4 is independently a neighbor of x1 with probability pr , we have that δ − 1 ∼ Bin(n + 3, pr ). As E(δ − 1)4 = n4 p 4 + O(n3 ), we may bound (Eδ 4 )1/4 by some constant times 1 + np, completing the argument for R. The calculation for S is similar. Let ψ(x) = S(x) and take G to be the same graphical rule as the one used for R. As the removal of a point from x ∈ X n can cause at most KS points to become isolated, i ψ(x) = ψ(x) − ψ xi ≤ KS . As the graph for S is the same as for R, the distribution and bounds for the degree δ are the same as for R. To test the quality of the bounds, we specialize to Euclidean space, and in the case of V , let λ be the Lebesgue measure. Specializing a bit further, we take the points X1 , . . . , Xn uniformly and independently in the cube Cn = [0, n1/d )d in Rd , with periodic boundary conditions. Then letting vρ = ρ d π d/2 / (1 + d/2), the volume of the radius ρ ball in dimension d, we have KV = vr . Now assuming r ≤ n1/d /2 we have p = v2r /n. By Goldstein and Penrose (2010), lim n−1 σV2 = gV
n→∞
(4.169)
with an explicit gV > 0, showing the bound of Proposition 4.8 to be of order n−1/2 . Similar remarks apply to S. In particular, KS , as a lower bound on the kissing number, is bounded in any dimension as n → ∞, and (4.169) holds for some
4.6 Chatterjee’s L1 Theorem
125
gS > 0 when σV2 is replaced by σS2 . At the cost of considerable more effort, Goldstein and Penrose (2010) apply Theorem 5.6 to obtain bounds of order n−1/2 for the Kolmogorov distance for both the standardized V and S, with explicit constants. Though Chatterjee’s approach might at first glance seem to bear little connection to the methods already presented, and (4.158) indeed appears a bit mysterious, Chen and Röllin (2010) have an interpretation which fits it into a general framework that contains a number of previous techniques mentioned, the exchangeable pair and size bias methods in particular. Chen and Röllin (2010) consider an identity of the form E Gf (W ) − Gf (W ) = E Wf (W ) , (4.170) for some triple (W, W , G) of square integrable random variables. If W , W is a λ-Stein pair then by (2.35) identity (4.170) is satisfied with 1 (W − W ). 2λ If Y s is on the same space as Y and has the Y -size biased distribution, and if EY = μ and Var(Y ) = σ 2 , then by (2.64) the variables W = (Y − μ)/σ and W = (Y s − μ)/σ satisfy (4.170) with G = μ/σ . Chatterjee’s approach is also included in the framework of Chen and Röllin (2010), by the method of ‘interpolation to independence’, as follows. Suppose W is a mean zero, variance 1 random variable, and for each i ∈ {1, . . . , n} we have a random variable Wi which is close in some sense to W . Suppose there exists a sequence of random variables V0 , . . . , Vn such that V0 = W , that V0 and Vn are independent, and that
(W, Vi−1 ), Wi , Vi =d Wi , Vi , (W, Vi−1 ) for all i = 1, . . . , n. G=
Note in particular we must therefore have W =d Wi and Vi =d Vi−1 , so all elements of the sequence V0 , . . . , Vn are equal in distribution, and have mean E[V0 ] = EW = 0. Given such variables, letting I be uniform over {1, . . . , n} and independent of the remaining variables and n G = (VI − VI −1 ), 2 we have, by telescoping the sum, using the independence of Vn and W on the first term and taking conditional expectation with respect to W on the second, that n 1 E Gf (W ) = (Vi − Vi−1 )f (W ) 2 i=1
1 = (Vn − V0 )f (W ) 2 1 = − E Wf (W ) , 2
126
4
L1 Bounds
while n 1 E Gf (W ) = (Vi − Vi−1 )f (W ) 2
=−
i=1 n
1 2
(Vi − Vi−1 )f (W )
i=1
1 = − E Gf (W ) 2 1 = E Wf (W ) . 2 Hence (4.170) is satisfied with W = WI . Now when W = ψ(X), a mean zero, variance one function of i.i.d. variables X1 , . . . , Xn , one can construct the required sequence V0 , . . . , Vn by setting Vi to be the function ψ evaluated on X1 , . . . , Xi , Xi+1 , . . . , Xn where Xi is an independent copy of Xi . Let also Wi = ψ(Xi ), where Xi is the vector X with Xi replacing Xi . It is clear that V0 = W , and is independent of Vn . In the notation of (4.156) we have
Wi = ψ Xi and Vi = ψ X{1,...,i} . Now consider the variation where π is a random permutation independent of the remaining variables, and we interpolate to independence in the order determined by π , that is,
Wi = ψ Xπ(i) and Vi = ψ X{π(1),...,π(i)} . Then (4.170) is satisfied with G=
1 Wπ(I ) − Wπ(I −1) , 2n
where I is an independent index chosen uniformly from {1, . . . , n}. Moreover, bounds to the normal in this framework involve conditional expectations such as (4.144), and in particular E(G(W − W )|X, X ) is the expression (4.158), see Chen and Röllin (2010) for details. We now present the proof of Theorems 4.11 and 4.12, starting with some preliminary lemmas. Lemma 4.8 Let X = (X1 , . . . , Xn ) be a random vector with independent χ valued components. Then, for any functions φ, ψ : χ n → R such that Eφ(X)2 and Eψ(X)2 are both finite,
1 Cov φ(X), ψ(X) = 2
A⊂{1,...,n} |A| =n
1 E j φ(X)j ψ XA . (n − |A|) j ∈A /
n
|A|
4.6 Chatterjee’s L1 Theorem
127
Proof First, we claim that
1 n
j ψ X A (n − |A|) j ∈A / A⊂{1,...,n} |A| |A| =n
=
1 ψ XA − ψ XA∪j (n − |A|) j ∈A /
n
A⊂{1,...,n} |A| =n
|A|
= ψ(X) − ψ(X ).
(4.171)
In particular, note that for any set A ⊂ {1, . . . , n}, except A = {1, . . . , n}, as there are n − |A| elements j ∈ / A, these set appear in (4.171) with a positive sign a total of
1 1 n
× n − |A| = n
(n − |A|) |A| |A| times. Similarly, any set B ⊂ {1, . . . , n}, except B = ∅, can be represented as B = A ∪ j for |B| different sets A, so these sets appear with a negative sign a total of
1
n
|B|−1 (n − |B| + 1)
1 × |B| = n
|B|
times. Hence only the terms A = ∅ and A ∪ j = {1, . . . , n} do not cancel out, the first one appearing with a coefficient of 1/ n0 = 1, and the latter with coefficient
−1/ nn = −1. Now, for a fixed A and j ∈ / A let U = φ(X)j ψ(XA ), a function of the random vectors X and X . Note that upon interchanging Xj and Xj the joint distribution of (X, X ) is unchanged, while U becomes U = −φ(Xj )j ψ(XA ). Thus, 1 1 EU = EU = E(U + U ) = j φ(X)j ψ XA . 2 2 Combining these observations yields
Cov φ(X), ψ(X) = E φ(X)ψ(X) − E φ(X) E ψ(X)
= E φ(X) ψ(X) − ψ(X ) 1 n
= E φ(X)j ψ XA (n − |A|) j ∈A / A⊂{1,...,n} |A| |A| =n
=
as desired.
1 2
A⊂{1,...,n} |A| =n
1 E j φ(X)j ψ XA , (n − |A|) j ∈A /
n
|A|
Lemma 4.9 Let W = ψ(X) with EW = 0 and Var(W ) = 1 where X = (X1 , . . . , Xn ) is a vector of χ valued, independent components, and let T be given by (4.158).
128
4
L1 Bounds
Then, for any twice continuously differentiable function f with bounded second derivative, we have n 3
E f (W )W − E f (W )T ≤ f E j ψ(X) , 4 j =1
where T is given by (4.158). Proof For each A ⊂ {1, . . . , n} and j ∈ / A, let
RA,j = j (f ◦ ψ)(X)j ψ XA and
'A,j = f ψ(X) j ψ(X)j ψ XA . R By Lemma 4.8 with g = f ◦ ψ , we have 1 E f (W )W = 2
1 ERA,j . (n − |A|) j ∈A /
n
A⊂{1,...,n} |A| =n
|A|
(4.172)
By the mean value theorem, and Hölder’s inequality, we have
'A,j | ≤ f E j ψ(X) 2 j ψ XA E|RA,j − R 2 f A 3 ≤ . E j ψ X 2 From the definition of T ,
f (W )T =
1 2
1 'A,j . R (n − |A|) j ∈A /
n
|A|
A⊂{1,...,n} |A| =n
(4.173)
(4.174)
Combining (4.172), (4.174) and (4.173), we obtain E f (W )W − Ef (W )T 1 1 ' n
E(RA,j − Rj,A ) = 2 (n − |A|) j ∈A / A⊂{1,...,n} |A| ≤
=
|A| =n f
4
A⊂{1,...,n} |A| =n
3 1 E j ψ(X) (n − |A|) j ∈A /
n
|A|
n 3 f E j ψ(X) , 4 j =1
as claimed.
4.6 Chatterjee’s L1 Theorem
129
Proof of Theorem 4.11 Let h be any absolutely continuous function with h ≤ 1, and let f be the solution to the Stein equation for h, Eh(W ) − N h = E f (W ) − Wf (W ) . √ By (2.13) of Lemma 2.4, we have that f ≤ 2/π and f ≤ 2. Setting φ = ψ in Lemma 4.8, we obtain ET = EW 2 = 1. Therefore Eh(W ) − N h ≤ E f (W ) − Wf (W ) ≤ E f (W ) − f (W )T + E f (W )T − Wf (W ) # ≤ 2/πE E(T |W ) − 1 + E f (W )T − Wf (W ) ≤
#
n 3
1/2 1 2/π Var E(T |W ) + E j ψ(X) , 2 j =1
by the Cauchy–Schwarz inequality, and Lemma 4.9. The proof is completed by taking supremum over h, noting (4.8). We now proceed to the proof of Theorem 4.12. By Theorem 4.11, it suffices to bound Var(E(T |X)). For this reason, the proof of Theorem 4.12 follows quickly from the following upper bound. Lemma 4.10 Let X be a vector of i.i.d. variates, A ⊂ {1, . . . , n} with |A| = n, and TA , M and δ given by (4.158), (4.161) and (4.162), respectively. Then there exists a constant C such that
1/2 4 1/2 # n(n − |A|). Eδ Var E(TA |X) ≤ C EM 8 For the remainder of this section, we make the convention that constants C need not be the same at each occurrence. Deferring the proof of Lemma 4.10, we present the proof of Theorem 4.12. Proof By the definition of T and Minkowski’s inequality, we obtain
1/2 1 Var E(T |X) ≤ 2
A⊂{1,...,n} |A| =n
[Var(E(TA |X))]1/2 n
. |A| (n − |A|)
Substituting the bound from Lemma 4.10 yields
1/2
1/4 4 1/2 Var E(T |X) Eδ ≤ C EM 8
1/4 = C EM 8 Eδ
n1/4 (n − |A|)1/4 n
|A| (n − |A|)
A⊂{1,...,n} |A| =n n
1/4 −3/4 4 1/4
n
k
k=1
1/4 4 1/4 1/2 Eδ = C EM 8 n . Now invoking Theorem 4.11 completes the proof.
130
4
L1 Bounds
It remains to prove Lemma 4.10. We proceed by way of the following preliminary result. Lemma 4.11 Suppose that G is a symmetric graphical rule on χ n and X = (X1 , . . . , Xn ) is a vector of i.i.d. χ -valued random variables. Let d1 be the degree of vertex 1 in G(X), and, for any k ≤ n − 1, let i, i1 , . . . , ik be any collection of k + 1 distinct elements of {1, . . . , n}. Then
E(d1 )k , (4.175) P {i, il } ∈ G(X) for all 1 ≤ l ≤ k = (n − 1)k where (r)k stands for the falling factorial r(r − 1) · · · (r − k + 1). Proof Since G is a symmetric rule and X1 , . . . , Xn are i.i.d., the probability
P {i, il } ∈ G(X) for all 1 ≤ l ≤ k does not depend on i, i1 , . . . , ik . Hence
P {i, il } ∈ G(X) for all 1 ≤ l ≤ k
1 = P {i, jl } ∈ G(X) for all 1 ≤ l ≤ k . (n − 1)k {j1 ,...,jk }⊂{1,...,n}\{i} |{j1 ,...,jk }|=k
Lastly, note that
1 {i, jl } ∈ G(X) for all 1 ≤ l ≤ k = (di )k ,
{j1 ,...,jk }⊂{1,...,n}\{i} |{j1 ,...,jk }|=k
where di is the degree of vertex i. As di and d1 have the same distribution, the argument is complete. To prove Lemma 4.10 we require the following result, the Efron–Stein inequality, see Efron and Stein (1981), and Steele (1986). Lemma 4.12 Let U = g(Y1 , . . . , Ym ) be a function of independent random objects Y1 , . . . , Ym , and let Yi be an independent copy of Yi for i = 1, . . . , m. Then Var(U ) ≤
m
2 1 E g Y1 , . . . , Yi−1 , Yi , Yi+1 , . . . , Ym − g(Y1 , . . . , Ym ) . 2 i=1
Proof of Lemma 4.10 Fix A ⊂ {1, . . . , n} with |A| = n. For each j ∈ / A, let A
Rj = j ψ(X)j ψ X
= ψ(X) − ψ Xj ψ XA − ψ XA∪j . Let Y = (Y1 , . . . , Yn ) be a copy of X, which is independent of both X and X . For a fixed i ∈ {1, . . . , n} let ' X = (X1 , . . . , Xi−1 , Yi , Xi+1 , . . . , Xn ).
4.6 Chatterjee’s L1 Theorem
131
Similarly, for each B ⊂ {1, . . . , n}, let B B , Y , X B , . . . , X B ) if i ∈ (X1 , . . . , Xi−1 / B, i n i+1 ' XB = if i ∈ B. XB Now let A∪j
j A
Rj i = ψ(' ψ ' X −ψ ' X , X) − ψ ' X and put hi = E
2 (Rj − Rj i ) .
j ∈A /
It follows from inequality (4.143) and Lemma 4.12 that n
1 hi . Var E(TA |X) ≤ Var(TA ) ≤ 2
(4.176)
i=1
Hence, we turn our attention to bounding hi , and note that we need only consider j∈ / A. When j = i let
dj1i = 1 {i, j } ∈ G(X) ,
dj2i = 1 {i, j } ∈ G Xj ,
X) and dj3i = 1 {i, j } ∈ G(' j
dj4i = 1 {i, j } ∈ G ' X . Suppose in a particular realization we have dj1i = dj2i = dj3i = dj4i = 0. Since G is an interaction rule for ψ , on this event we have
j
ψ(X) − ψ Xj = ψ(' X) − ψ ' X . XA in place of X and ' X, and define ej1i , ej2i , ej3i and ej4i If we now take XA and ' 3 1 2 4 analogously, then when ej i = ej i = ej i = ej i = 0 we have
A
A∪j
ψ XA − ψ XA∪j = ψ ' X −ψ ' X , whether i ∈ A or not. Now, let
A Li = maxj ψ(X)j ψ XA − j ψ(' X)j ψ ' X . j ∈A /
From the preceding considerations, when j = i |Rj − Rj i | ≤ Li
4
djki + ejk i .
k=1
When j = i then i ∈ / A and we have |Rj − Rj i | ≤ Li . The Cauchy–Schwarz inequality now yields 4 )1/2 ( 4 k
4 k dj i + ej i / A) + . (4.177) hi ≤ ELi E 1(i ∈ j ∈A∪i / k=1
132
4
L1 Bounds
Applying the inequality ( ri=1 ai )4 ≤ r 3 ri=1 ai4 , we obtain 4 4 k
E 1(i ∈ / A) + dj i + ejk i j ∈A∪i / k=1
≤ 93 1(i ∈ A) + 93
4 4 4 4 E djki + 93 E ejk i . k=1
j ∈A∪i /
k=1
j ∈A∪i /
To handle the first term in the first sum, from Lemma 4.11, for any j, k, l and m,
Eδ r 1 1 1 E dj1i dki dli dmi ≤ C r1 , n where r is the number of distinct indices among j , k, l, m, and δ1 is the degree of vertex 1 in G(X). Recall the definition of δ from (4.162), and observe that δ ≥ δ1 + 1. It follows easily that 4 n − |A| E dj1i ≤ CE δ 4 . n j ∈A∪i /
2 d 2 d 2 ). First suppose that j, k, l, m are disNow we consider bounding E(dj2i dki li mi tinct. Now let ' X be the random vector in χ n+4 given by
' . X = X1 , . . . , Xn , Xj , Xk , Xl , Xm 2 = d 2 = d 2 = 1 then {i, n + 1}, {i, n + 2}, {i, n + 3} and {i, n + Note that if dj2i = dki li mi 4} are all edges in the extended graph G (' X). Since G is a symmetric rule and the ' components of X are i.i.d., it follows from Lemma 4.10 that
Eδ 4 2 2 2 E dj2i dki dli dmi ≤ C 4 . n Now, suppose j , k, l are distinct, and that m = l. Let s ∈ {1, . . . , n} be distinct from j , k and l, and define
' X = X1 , . . . , Xn , Xj , Xk , Xl , Xs and argue as before to conclude that in this case
Eδ 3 2 2 2 2 2 dli dmi = E dj2i dki dli ≤ C 3 . E dj2i dki n In general, if r is the number of distinct elements among j , k, l, m, then
Eδ r 2 2 2 E dj2i dki dli dmi ≤ C r . n From this inequality we obtain as before that 4 n − |A| E dj2i ≤ CE δ 4 . n j ∈A∪i /
4.7 Locally Dependent Random Variables
133
The d 3 , e1 and e3 terms can be bounded as the d 1 term, while the d 4 , e2 and e4 terms like the d 2 term. Combining, we conclude 4 4 4
k
n − |A| k E 1(i ∈ / A) + ≤ CE δ dj i + e j i 1(i ∈ / A) + . n j ∈A∪i / k=1
8 applying these bounds in As M = maxj |j ψ(X)| have EL4i ≤ CEM √ , and √ √ (4.177), along with the inequality x + y ≤ x + y for nonnegative x and y, we obtain
4 1/2 n − |A| 8 1/2 Eδ . 1(i ∈ / A) + hi ≤ C EM n
Substituting this bound in (4.176), we obtain
1/2 4 1/2
Eδ n − |A| + n n − |A| Var E(TA |X) ≤ C EM 8
1/2 4 1/2 # ≤ C EM 8 Eδ n(n − |A|),
completing the proof.
4.7 Locally Dependent Random Variables In this section we consider L1 bounds for sums of locally dependent random variables. We being by recalling that an m-dependent sequence of random variables ξi , i ∈ N, is one with the property that, for each i, the sets of random variables {ξj , j ≤ i} and {ξj , j > i + m} are independent. Independent random variables are the special case of m-dependence when m = 0. Local dependence generalizes the notion of m-dependence to collections of random variables indexed more generally. The concept of local dependence is applicable, for example, to random variables indexed by the vertices of a graph such that the collections {ξi , i ∈ I } and {ξj , j ∈ J } are independent whenever I ∩ J = ∅ and the graph contains no edges {i, j } with i ∈ I and j ∈ J . Let J be a finite index set of cardinality n, and let {ξi , i ∈ J } be a random field, that is, an indexed collection of random variables, with zero means and finite variances. Define W = i∈J ξi , and assume that Var(W ) = 1. For any A ⊂ J let / A} Ac = {j ∈ J : j ∈
and ξA = {ξi : i ∈ A}.
We introduce the following two conditions, corresponding to different degrees of local dependence. (LD1) For each i ∈ J there exists Ai ⊂ J such that ξi and ξAci are independent. (LD2) For each i ∈ J there exist Ai ⊂ Bi ⊂ J such that ξi is independent of ξAci and ξAi is independent of ξBic .
4
L1 Bounds
Clearly (LD2) implies (LD1). Whenever (LD1) or (LD2) hold we set ξj and τi = ξj ηi =
(4.178)
134
j ∈Ai
j ∈Bi
respectively. Note that when {ξi , i ∈ J } are independent (LD2) holds with Ai = Bi = {i}, in which case ηi = τi = ξi . Theorem 4.13 Let {ξi , i ∈ J } be a random field with mean zero and Var(W ) = 1 where W = i∈J ξi . If (LD1) holds then, then with ηi as in (4.178), 2 L(W ) − L(Z) ≤ 2 E η − E(ξ η ) E ξi ηi , (4.179) ξ i i i i + 1 π i∈J
i∈J
and if (LD2) holds, then with ηi and τi as in (4.178),
2 L(W ) − L(Z) ≤ 2 E(ξi ηi )E|τi | + E|ξ η τ | + E ξi ηi . (4.180) i i i 1 i∈J
i∈J
We remark that for independent random variables, applying Hölder’s inequality to the bound in (4.180) yields 5 i∈J E|ξi |3 , somewhat larger than the constant of 1 given by Corollary 4.2. Proof Assume (LD1) holds and let f = fh be the solution of the Stein equation (2.4) for an absolutely continuous function h satisfying h ≤ 1. By the independence of ξi and W − ηi , and that Eξi = 0, we have E Wf (W ) = Eξi f (W ) = Eξi f (W ) − f (W − ηi ) . i∈J
i∈J
Now adding and subtracting yields E ξi f (W ) − f (W − ηi ) − ηi f (W ) E Wf (W ) = i∈J
+E
ξi ηi f (W ) .
(4.181)
i∈J
Now, using again that Eξi = 0 for all i, from (LD1) it follows that E{ξi ξj } = E{ξi ηi }, 1 = EW 2 = i∈J j ∈J
i∈J
and so
ξi ηi − E(ξi ηi ) f (W ) E f (W ) − Wf (W ) = −E i∈J
E ξi f (W ) − f (W − ηi ) − ηi f (W ) . (4.182) − i∈J
4.7 Locally Dependent Random Variables
135
√ By (2.13), f ≤ 2/π and f ≤ 2. Therefore it follows from (4.182) and a Taylor expansion that 2 Eh(W ) − Eh(Z) ≤ 2 E ξ η − E(ξ η ) E ξi ηi . i i i i + π i∈J
i∈J
Now (4.179) follows from (4.8). When (LD2) is satisfied, f (W − τi ) and ξi ηi are independent for each i ∈ J . Hence, using (4.182), we can write Eh(W ) − Eh(Z)
2 ≤ E E ξi ηi ξi ηi − E(ξi ηi ) f (W ) − f (W − τi ) + i∈J
≤2
2 E|ξi ηi τi | + E(ξi ηi )E|τi | + E ξi η ,
i∈J
i
i∈J
i∈J
as desired.
We provide two examples of locally dependent random variables. We refer to Baldi and Rinott (1989), Rinott (1994), Baldi et al. (1989), Dembo and Rinott (1996), and Chen and Shao (2004) for more details. Example 4.1 (Graphical dependence) Consider a set of random variables {ξi , i ∈ V} indexed by the vertices of a graph G = (V, E). The graph G is said to be a dependency graph if, for any pair of disjoint sets 1 and 2 in V such that no edge in E has one endpoint in 1 and the other in 2 , the sets of random variables {ξi , i ∈ 1 } and {ξi , i ∈ 2 } are independent. Let Ai = {i} ∪ j ∈ V: {i, j } ∈ E and Bi = j ∈Ai Aj . Then {ξi , i ∈ V} satisfies (LD2). Hence (4.180) holds. Example 4.2 (The number of local maxima on a graph) Consider a graph G = (V, E) (which is not necessary a dependency graph) and independent and identically distributed continuous random variables {Yi , i ∈ V}. For i ∈ V define the indicator variable 1 if Yi > Yj for all j ∈ Ni , ξi = 0 otherwise = {j ∈ V: {i, j } ∈ E}. Hence ξi = 1 indicates that Yi is a local maximum where Ni and W = i∈V ξi is the total number of local maxima. Letting Nj and Bi = Aj Ai = {i} ∪ Ni ∪ j ∈Ni
j ∈Ai
we find that {ξi , i ∈ V} satisfies (LD2), and therefore (4.180) holds. Bounds in L∞ for this problem are considered in Example 6.4.
136
4
L1 Bounds
4.8 Smooth Function Bounds In defining a distance L(X) − L(Y )H through (4.1) one typically chooses H to be a convergence determining class of functions, that is, a collection of functions such that if {Xn }n≥0 is any sequence of random variables then Eh(Xn ) → Eh(X0 )
for all h ∈ H implies Xn →d X0 .
A convergence determining class can consist of functions all of which are very smooth, such as the collection of all infinity differentiable functions with compact support. To describe the collection of functions we consider in this section, following E.M. Stein (1970), let L∞ <∞ m (R) be all functions h : R → R satisfying hL∞ m (R) where hL∞ = max h(k) . m (R) 0≤k≤m
That is L∞ m (R) consists of all functions possessing m bounded derivatives. Now let L(W ) − L(Z)Hm,∞ be the distance which is obtained through (4.1) by setting Hm,∞ = h ∈ L∞ ≤1 . (4.183) m (R): hL∞ m (R) In the following section we show how fast rates of convergence can be obtained under a vanishing third moment assumption when inducing our distance by H4,∞ . In Chap. 12 we prove a smooth function theorem in Rp using a multidimensional generalization of the distances defined here, and produce bounds in that distance for the problem of counting the number of vertices in a random graph that have specified degree counts.
4.8.1 Fast Rates for Smooth Functions In this section we first prove Theorem 4.14, a smooth function theorem parallel to Theorem 4.9, for the zero bias coupling as discussed in Sect. 2.3.3. Comparing Theorems 4.9 and 4.14, we see that the latter requires the computation of a conditional expectation of a difference, rather than of a difference squared, and that the second, or remainder term is of a square, rather than a cube. Lastly, Theorem 4.9 requires the linearity condition (4.108) to be satisfied, whereas Theorem 4.14 does not. After the proof we apply Theorem 4.14 in an independent case to show that fast rates of convergence for smooth functions are obtained when fourth moments exist and third moment vanishes. Theorem 4.14 Let W be a mean zero, variance 1 random variable and suppose that the pair (W, W ∗ ) is given on a joint probability space so that W ∗ has the W -zero biased distribution. Then
1 1 ∗ ∗ 2 L(W ) − L(Z) H4,∞ ≤ 3 E E W − W |W + 8 E(W − W ) .
4.8 Smooth Function Bounds
137
Proof Let g be the solution to (2.19) for a given h ∈ H4,∞ . By the bounds in Lemma 2.6 1 (3) 1 g ≤ and g (4) ≤ . (4.184) 3 4 By (2.19), (2.51), and Taylor expansion,
Eh(W ) − Nh = E g (W ) − Wg (W )
= E g (W ∗ ) − g (W ) W∗ (3) ∗ (4) ∗ ≤ Eg (W )(W − W ) + E g (t)(W − t)dt . W
Conditioning on W we may bound the first term as (3)
E g (W )E W ∗ − W |W ≤ g (3) E|E W ∗ − W |W . For the second term W∗ 1 (4) ∗ E g (t)(W − t)dt ≤ g (4) E(W ∗ − W )2 . 2 W
Applying (4.184) completes the proof.
We now apply Theorem 4.14 to the sum of independent identically distributed variables and show how the zero bias transformation leads to an error bound for smooth functions of order n−1 , under additional moment assumptions which include a vanishing third moment. Corollary 4.4 Let X1 , X2 , . . . , Xn be independent and identically distributed mean 4 zero, variance one random n variables with vanishing third moment and EX < ∞. −1/2 Then, for W = n i=1 Xi ,
1 4 L(W ) − L(Z) H4,∞ ≤ 24n 11 + EX . Proof For i = 1, . . . , n let Xi∗ have the Xi -zero biased distribution and be independent of Xj , j = 1, . . . , n, and I a random index independent of Xi , Xi∗ , i = 1, . . . , n with distribution P (I = i) = 1/n. Then, by Lemma 2.8 and the scaling property (2.59), √ √ W ∗ = W − XI / n + XI∗ / n has the W -zero biased distribution. From substituting f (x) = x 2 /2 into (2.51), for every i = 1, . . . , n we have EXi∗ = (1/2)EXi3 = 0
and therefore EXI∗ = 0.
(4.185)
Next, √ using that X1 , . . . , Xn ’s are i.i.d., and therefore exchangeable, E(XI |W ) = W/ n.
138
4
L1 Bounds
Now, by the independence of XI∗ and W , and (4.185), we obtain
E(W ∗ − W |W ) = n−1/2 E XI∗ − XI |W
= n−1/2 E XI∗ − E(XI |W ) = −n−1/2 E(XI |W ) = −n−1 W. Therefore 1 E E(W ∗ − W |W ) = n−1 E|W | ≤ . n For the second term in Theorem 4.14, application of (2.51) with f (x) = x 3 /3 yields 2 1 2 E XI∗ = E Xi∗ = EXi4 . 3 ∗ Since XI and XI are independent, and the latter variable has mean zero,
2 1 2
1 EX 4 1 E(W ∗ − W )2 = E XI∗ − XI = E XI∗ + EXI2 = +1 . n n n 3
Applying Theorem 4.14 now yields the claim.
Under more special assumptions a fast rates may be obtained for distances induced by classes of non-smooth functions. In particular, Klartag (2009) demonstrates a bound of order 1/n for cases which include the sum of independent symmetric random variables whose density is log concave.
Appendix Proof of Lemma 4.7 Let π be uniform on Sn and I † , J † , K † , L† be independent of π with distribution (4.134). Constructing Y from π and Y † and Y ‡ from π † and π ‡ respectively, as in Lemma 4.6, we have Y ∗ − Y = U Y † + (1 − U )Y ‡ − Y n n n ai,π † (i) + (1 − U ) ai,π ‡ (i) − ai,π(i) . =U i=1
With
i=1
i=1
I = I † , J † , π −1 K † , π −1 L† ,
(4.186)
we see from (4.126) in Lemma 4.5, and from = / I, then I † ,J † , that if m ∈ † ‡ ∗ π(m) = π (m) = π (m). Hence, setting V = Y − Y , we have
V= U ai,π † (i) + (1 − U )ai,π ‡ (i) − ai,π(i) . (4.187) π‡
i∈I
π †τ
Appendix
139
Further, letting R = π(I † ), π(J † ) ∩ {K † , L† } and 1k = 1(R = k), since P (R ≤ 2) = 1, we have V = V 1 2 + V 11 + V 10 , and therefore E|V | ≤ E|V |12 + E|V |11 + E|V |10 .
(4.188)
The three terms on the right hand side of (4.188) give rise to the three components of the bound in the theorem. For notational simplicity, the following summations in this section are performed over all indices which appear, whether in the summands or in a (possibly empty) collection of restrictions. In what follows, we will apply equalities and bounds such as
2 2 |ail | aik + aj2l + ail2 + aj2k |ail | (aik + aj l ) − (ail + aj k ) = ≤ 4n2 γ .
(4.189)
Due to the form of the terms being squared on the left-hand side, if the factors in a cross term agree in their first index, they will have differing second indices, and likewise if their second indices agree. This gives cross terms which are zero by virtue of (4.120), since there will be at least one unpaired index outside the absolute value over which to sum, for instance, the index k in the term |ail |aik ail . Hence the equality. To obtain the inequality, on each of the four terms are argue as for the first,
2 |ai,l |ai,k ≤
i,j,k,l
j,k
|ail |3
1/3
2/3 |aik |3
= n2 γ .
(4.190)
j,l
Generally, the power of n in such an inequality, in this case 2, will be 2 less than the number of indices of summation, in this case 4. Calculation on R = 2 On 12 we have {π(I † ), π(J † )} = {K † , L† } and therefore I = {I † , J † , π −1 (K † ), π −1 (L† )} = {I † , J † }. As the intersection which gives R = 2 can occur in two different ways, we make the further decomposition V 12 = V 12,1 + V 12,2 , where
12,1 = 1 π(I † ) = K † , π(J † ) = L†
and 12,2 = 1 π(I † ) = L† , π(J † ) = K † . Since π † = π on 12,1 by (4.125), following (4.187) we have
140
4
V 12,1 =
L1 Bounds
U ai,π † (i) + (1 − U )ai,π ‡ (i) − ai,π(i) 12,1
i∈{I † ,J † }
= U (aI † ,π † (I † ) + aJ † ,π † (J † ) ) + (1 − U )(aI † ,π ‡ (I † ) + aJ † ,π ‡ (J † ) ) − (aI † ,π(I † ) + aJ † ,π(J † ) ) 12,1 = U (aI † ,π(I † ) + aJ † ,π(J † ) ) + (1 − U )(aI † ,π(J † ) + aJ † ,π(I † ) ) − (aI † ,π(I † ) + aJ † ,π(J † ) ) 12,1 = (1 − U )(aI † ,π(J † ) + aJ † ,π(I † ) − aI † ,π(I † ) − aJ † ,π(J † ) )12,1 = (1 − U )(aI † ,L† + aJ † ,K † − aI † ,K † − aJ † ,L† )12,1 .
(4.191)
Due to the presence of the indicator 12,1 , taking the expectation of (4.191) requires a joint distribution which includes the values taken on by π at I † and J † , say s and t, respectively. Since these images can be any two distinct values, and are independent of I † , J † , K † and L† , we have, with p1 and p2 given in (4.122) and (4.134), respectively,
p3 (i, j, k, l, s, t) = P I † , J † , K † , L† , π(I † ), π(J † ) = (i, j, k, l, s, t) = p2 (i, j, k, l)p1 (s, t) [(aik + aj l ) − (ail + aj k )]2 = 1(s = t). (4.192) 4n3 (n − 1)2 σ 2 Now bounding the absolute value of the first term in (4.191) using (4.189), we obtain 1 E (1 − U )aI † ,L† 12,1 = |ail |1(s = k, t = l)p3 (i, j, k, l, s, t) 2 1 |ail |p3 (i, j, k, l, k, l) = 2 2 1 |ail | (aik + aj l ) − (ail + aj k ) = 3 2 2 8n (n − 1) σ γ ≤ . 2n(n − 1)2 σ 2 Using the triangle inequality in (4.191) and applying the same reasoning to the remaining three terms shows that E|V |12,1 ≤ 2γ /(n(n − 1)2 σ 2 ). Since by symmetry the term V 12,2 can be handled the same way, we obtain E|V |12 ≤
4γ 4γ ≤ . 2 2 n(n − 1) σ (n − 1)3 σ 2
(4.193)
Calculation on R = 1 As the event R = 1 can occur in four different ways, depending on which element of {π(I † ), π(J † )} equals an element of {K † , L† }, we decompose 11 to yield V 11 = V 11,1 + V 11,2 + V 11,3 + V 11,4 , = 1(π(I † ) = K †
where 11,1 and cators in (4.194) similarly.
π(J † ) = L† ),
(4.194)
specifying the remaining three indi-
Appendix
141
On 11,1 we have, from (4.186), that I = {I † , J † , π −1 (L† )}, and from (4.125) that π † = πτπ −1 (L† ),J † and so π ‡ = πτπ −1 (L† ),J † τJ † ,I † , yielding π ‡ (π −1 (L)) = π † (π −1 (L)) = π(J ). Now, using (4.187),
U ai,π † (i) + (1 − U )ai,π ‡ (i) − ai,π(i) 11,1 V 11,1 = i∈{I † ,J † ,π −1 (L† )}
= U (aI † ,π † (I † ) + aJ † ,π † (J † ) + aπ −1 (L† ),π † (π −1 (L† )) ) + (1 − U )(aI † ,π ‡ (I † ) + aJ † ,π ‡ (J † ) + aπ −1 (L† ),π ‡ (π −1 (L† )) ) − (aI † ,π(I † ) + aJ † ,π(J † ) + aπ −1 (L† ),π(π −1 (L† )) ) 11,1 = U (aI † ,K † + aJ † ,L† + aπ −1 (L† ),π(J † ) ) + (1 − U )(aI † ,L† + aJ † ,K † + aπ −1 (L† ),π(J † ) ) − (aI † ,K † + aJ † ,π(J † ) + aπ −1 (L† ),L† ) 11,1 = U aJ † ,L† + (1 − U )(aI † ,L† + aJ † ,K † − aI † ,K † ) − aJ † ,π(J † ) − aπ −1 (L† ),L† + aπ −1 (L† ),π(J † ) 11,1 .
(4.195)
For the first term in (4.195), dropping the restriction t = l and summing over t to obtain the first inequality, and then applying (4.189) with |ail | replaced by |aj l |, we obtain 1 EU |aJ † ,L† |11,1 = |aj l |1(s = k, t = l)p3 (i, j, k, l, s, t) 2 2 1 ≤ 2 |aj l | (aik + aj l ) − (ail + aj k ) 2 2 8n (n − 1) σ γ ≤ . (4.196) 2(n − 1)2 σ 2 The second, third and fourth terms in (4.195) also may be bounded by (4.196) upon replacing |aj l | by |ail |, |aj k | and |aik |, respectively, yielding E U aJ † ,L† + (1 − U )(aI † ,L† + aJ † ,K † − aI † ,K † )11,1 ≤
2γ . (n − 1)2 σ 2
(4.197)
For the fifth term in (4.195), that is, for −aJ † ,π(J † ) , reasoning similarly, E|aJ † ,π(J † ) |11,1 =
|aj t |1(s = k, t = l)p3 (i, j, k, l, s, t) 2 1 |aj t | (aik + aj l ) − (ail + aj k ) ≤ 3 2 2 4n (n − 1) σ γ ≤ . (4.198) (n − 1)2 σ 2
Note that for the final inequality, though the sum being bounded is not of the form (4.189), having the index t , the same reasoning applies and that, moreover, the five indices of summation require that n2 in (4.190) be replaced by n3 .
142
4
L1 Bounds
To handle the sixth term in (4.195), −aπ −1 (L† ),L† , we need the joint distribution p4 (i, j, k, l, s, t, u)
= P I † , J † , K † , L† , π(I † ), π(J † ), π −1 (L† ) = (i, j, k, l, s, t, u) , accounting for the value u taken on by π −1 (L† ). If l equals s or t , then u is already fixed at i or j , respectively; otherwise, π −1 (L† ) is free to take any of the remaining available n − 2 values, with equal probability. Hence, with p3 given by (4.192), we deduce that ⎧ if (l, u) ∈ {(s, i), (t, j )}, ⎨ p3 (i, j, k, l, s, t), 1 p4 (i, j, k, l, s, t, u) = p3 (i, j, k, l, s, t) n−2 , if l ∈ / {s, t} and u ∈ / {i, j }, ⎩ 0, otherwise. Note, for example, that on 11,1 , where π(I † ) = K † and π(J † ) = L† , the value u of π −1 (L† ) is neither I † nor J † , so the second case above is the relevant one and the vanishing of the first sum on the third line of the following display is to be expected. Now, applying the density p4 we may bound the sixth term in (4.195) as follows, E|aπ −1 (L† ),L† |11,1 = |aul |1(s = k, t = l)p4 (i, j, k, l, s, t, u) = |aul |p4 (i, j, k, l, k, t, u) t =l
= = =
|aik |p3 (i, j, k, k, k, t) +
1 n−2 1 (n)3
1 n−2
|aul |p3 (i, j, k, l, k, t)
l ∈{k,t},u / ∈{i,j / }
|aul |p2 (i, j, k, l)p1 (k, t)
l =t,u∈{i,j / }
|aul |p2 (i, j, k, l)
(4.199)
t ∈{l,k}, / u∈{i,j / }
1 = |aul |p2 (i, j, k, l) (n)2 u∈{i,j / }
2 1 |aul | (aik + aj l ) − (ail + aj k ) 3 2 2 4n (n − 1) σ γ ≤ , (n − 1)2 σ 2
≤
(4.200)
where the final inequality is achieved using (4.189) in the same way as for (4.198). The computation for the seventh term in (4.195) begins as that for the sixth, yielding (4.199) with aut replacing aul , so that 1 |aut |p2 (i, j, k, l) E|aπ −1 (L† ),π(J † ) |11,1 = (n)3 t ∈{l,k},u / ∈{i,j / }
≤
2 1 |aut | (aik + aj l ) − (ail + aj k ) 4(n)3 n2 (n − 1)σ 2
Appendix
143
≤
n2 γ (n)3 (n − 1)σ 2
≤
3γ , (n − 1)2 σ 2
(4.201)
where we have applied reasoning as in (4.189), replaced n2 by n4 in (4.190) due to the sum over six indices, and recalled our assumption that n ≥ 3. Returning to (4.195) and adding the contribution (4.197) from the first four terms together with (4.198), (4.200) and (4.201) from the fifth, sixth and seventh, respectively, we obtain E|V |11,1 ≤ 7γ /((n − 1)2 σ 2 ). Since, by symmetry, all four terms on the right-hand side of (4.194) can be handled in the same way as the first, we obtain the following bound on the event R = 1: E|V |11 ≤
28γ . (n − 1)2 σ 2
(4.202)
Calculation on R = 0 We may write the indicator of the event that R = 0 as
10 = 1 π(I † ) ∈ / {K † , L† }, π(J † ) ∈ / {K † , L† } , and we see from (4.186) that I = {I † , J † , π −1 (K † ), π −1 (L† )}, a set of size 4, on R = 0. Hence, from (4.187),
V 10 = U ai,π † (i) + (1 − U )ai,π ‡ (i) − ai,π(i) 10 i∈{I † ,J † ,π −1 (K † ),π −1 (L† )}
= U (aI † ,K † + aJ † ,L† ) + (1 − U )(aI † ,L† + aJ † ,K † ) + aπ −1 (K † ),π(I † ) + aπ −1 (L† ),π(J † )
− (aI † ,π(I † ) + aJ † ,π(J † ) + aπ −1 (K † ),K † + aπ −1 (L† ),L† ) 10 .
(4.203)
Since the first four terms in (4.203) have the same distribution, we bound their contribution to E|V |10 , using (4.189), by |aik |p2 (i, j, k, l) 4EU |aI † ,K † |10 ≤ 4EU |aI † ,K † | = 2 = ≤
1 2n2 (n − 1)σ 2
2 |aik | (aik + aj l ) − (ail + aj k )
2γ . (n − 1)σ 2
(4.204)
The sum of the contributions from the fifth and sixth terms of (4.203) can be bounded as 2E|aπ −1 (L† ),π(J † ) |10 =2 |aut |p4 (i, j, k, l, s, t, u) s ∈{k,l},t / ∈{k,l} /
144
4
=
2 n−2
L1 Bounds
|aut |p3 (i, j, k, l, s, t)
s ∈{k,l},t / ∈{k,l},u / ∈{i,j / },s =t
2 n−3 |aut | (aik + aj l ) − (ail + aj k ) 3 2 2 2(n − 2)n (n − 1) σ 2n(n − 3)γ ≤ (n − 2)(n − 1)2 σ 2 2γ ≤ , (n − 1)σ 2 ≤
(4.205)
(4.206)
where the second equality follows from the form of p4 and that l ∈ / {s, t} implies (l, u) ∈ / {(s, i), (t, j )}, inequality (4.205) is obtained by summing over the n − 3 choices of s and dropping the remaining restrictions, and the next inequality by following the reasoning of (4.189). Similarly, for the sum of the contributions from the seventh and eighth terms of (4.203), summing over the n − 3 choices of t and then dropping the remaining restrictions to obtain the first inequality, we have 2E|aI † ,π(I † ) |10 = 2 |ais |p3 (i, j, k, l, s, t) s ∈{k,l},t / ∈{k,l} /
=
1 3 2n (n − 1)2 σ 2
2 |ais | (aik + aj l ) − (ail + aj k )
s ∈{k,l},t / ∈{k,l},s / =t
2 n−3 |ais | (aik + aj l ) − (ail + aj k ) ≤ 3 2 2 2n (n − 1) σ 2(n − 3)γ ≤ (n − 1)2 σ 2 2γ ≤ . (n − 1)σ 2
(4.207)
The total contribution of the ninth and tenth terms together can be bounded like the sum of the fifth and sixth, yielding (4.205) with |aul | replacing |aut |, and then summing over the n choices of t to give 2 n−3 |aul | (aik + aj l ) − (ail + aj k ) 2 2 2 2(n − 2)n (n − 1) σ 2n(n − 3)γ ≤ (n − 2)(n − 1)2 σ 2 2γ ≤ . (4.208) (n − 1)σ 2
2E|aπ −1 (L† ),L† |10 ≤
Adding up the bounds for the first four terms (4.204), the fifth and sixth terms (4.206), the seventh and eighth terms (4.207) and the ninth through tenth terms (4.208) yields E|V |10 ≤
8γ . (n − 1)σ 2
(4.209)
Appendix
145
Now, from (4.188), adding up the contributions from (4.193), (4.202) and (4.209) from R = 2, R = 1, and R = 0, respectively, for this coupling of Y ∗ and Y we find that γ 4 28 + 8 + . E|Y ∗ − Y | ≤ (n − 1) (n − 1)2 (n − 1)σ 2 The proof of the lemma may now be completed by noting that E|Y ∗ − Y | is an upper bound on the L1 norm L(Y ∗ ) − L(Y )1 , by the dual form of the L1 norm 4.6.
Chapter 5
L∞ by Bounded Couplings
In this chapter we prove a number of Berry–Esseen type theorems, for a random variable W , which may be applied when certain couplings are bounded. For example, the first result here, Theorem 5.1, requires the construction of a variable W ∗ , having the W -zero bias distribution, on the same space as W such that |W ∗ − W | ≤ δ. The theorem is shown by the use of concentration inequalities. Similar results are shown for the exchangeable pair and size bias couplings. In addition to the Kolmogorov distance, we use smoothing inequalities to derive bounds which hold more generally, for distances given in terms of the supremum over classes of non-smooth functions. We illustrate some of our bounded coupling results to sums of independent bounded random variables, and apply them in more general situations starting in Chap. 6. In addition, some results given in this chapter can handle situations where the couplings are not bounded. Theorem 5.3, for the exchangeable pair, can be applied in the unbounded case when the term E(W − W )2 1{|W −W |>a} can be usefully upper bounded, with similar remarks applying to Theorem 5.7 for the size biased coupling.
5.1 Bounded Zero Bias Couplings The calculation here is greatly simplified due to the assumption of boundedness. For W a mean zero random variable with variance one, recall definition (2.51) of W ∗ , the W -zero biased variable. Theorem 4.1 shows that when W and W ∗ are close then W is close to normal in the L1 sense. Theorem 5.1 below, the bounded zero bias coupling theorem, provides a corresponding result in L∞ . Theorem 5.1 Let W be a mean zero and variance 1 random variable, and suppose that there exists W ∗ , having the W -zero bias distribution, defined on the same space as W , satisfying |W ∗ − W | ≤ δ. Then supP (W ≤ z) − P (Z ≤ z) ≤ cδ z∈R
√ √ where c = 1 + 1/ 2π + 2π/4 ≤ 2.03. L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_5, © Springer-Verlag Berlin Heidelberg 2011
147
148
5
L∞ by Bounded Couplings
We note that the application of this theorem does not require the sometimes difficult calculation of a variance of a conditional expectation, such as the term in (4.144) of Corollary 4.3, for the exchangeable pair. Theorem 5.1 may be directly applied to the sum of independent, mean zero random variables X1 , . . . , Xn all bounded by some C, yielding a bound with an explicit constant that has the order of the inverse of the standard deviation of the sum. In particular, let σi2 = Var(Xi ), Bn2 = ni=1 σi2 and W=
n
ξi
where ξi = Xi /Bn .
i=1
Then applying Lemma 2.8, (2.58) to yield |Xi∗ | ≤ C, and the scaling property (2.59), we obtain, for I an index independent of X1 , . . . , Xn with P (I = i) = σi2 /Bn2 , W ∗ − W = ξI∗ − ξI ,
so in particular |W ∗ − W | ≤ 2C/Bn .
Hence the conclusion of Theorem 5.1 holds with δ = 2C/Bn . The proof of Theorem 5.1 is similar to that of Theorem 3.5, even though the relation between W and a coupled W ∗ having the W -zero bias distribution cannot be expressed as in (3.23). Proof Let z ∈ R and f be the solution to the Stein equation (2.2) with z replaced by z − δ. Then f (W ∗ ) = 1{W ∗ ≤z−δ} − (z − δ) + W ∗ f (W ∗ ) ≤ 1{W ≤z} − (z − δ) + W ∗ f (W ∗ ). By taking expectation in this inequality, applying a simple bound on the normal distribution function and the zero bias definition (2.51), we obtain P (W ≤ z) − (z) = (z − δ) − (z) + P (W ≤ z) − (z − δ) δ ≥ −√ + P (W ≤ z) − (z − δ) 2π δ ≥ −√ + E f (W ∗ ) − W ∗ f (W ∗ ) 2π δ = −√ + E Wf (W ) − W ∗ f (W ∗ ) . (5.1) 2π Writing = W ∗ − W and applying the bound (2.10) yields E Wf (W ) − W ∗ f (W ∗ ) = E Wf (W ) − (W + )f (W + ) √ ≤ E |W | + 2π/4 || √ ≤ δ(1 + 2π/4). Using this inequality in (5.1) yields
1 +1+ P (W ≤ z) − (z) ≥ −δ √ 2π A similar argument yields the reverse inequality.
√
2π ≥ −2.03δ. 4
5.2 Exchangeable Pairs, Kolmogorov Distance
149
5.2 Exchangeable Pairs, Kolmogorov Distance In this section we provide results that give a bound on the Kolmogorov distance when we can construct W, W , an exact or approximate Stein pair, whose difference |W − W | is bounded. Theorem 5.3 can also be applied when W − W is not bounded if the term E(W − W )2 1|W −W |>a can be handled. For a pair (W, W ) and a given δ, some of the results in this section are expressed in terms of = W − W ,
W ˆ K(t)dt Kˆ 1 = E |t|≤δ
ˆ = with K(t) and additionally
(1{−≤t≤0} − 1{0
(W − W )2 B = E 1 − E W 2λ and = Var E (W − W )2 |W .
(5.2)
(5.3)
When (W, W ) is a λ-Stein pair with variance 1, then by (2.34) we have that E2 = 2λ, and the Cauchy–Schwarz inequality yields , (5.4) 2λ so in such cases B may be replaced by /2λ in all the upper bounds of this section. We first present result for the exchangeable pair technique which can handle situations where the linear regression condition (2.33), B≤
E(W | W ) = (1 − λ)W is satisfied only approximately. Indeed, given any W and W with mean zero and variance one, one can always express the conditional expectation of W given W as the linear regression of W on W , with coefficient 1 − λ equal to the correlation coefficient, plus the difference. If the difference is small then the methods which apply when the conditional expectation is linear should apply approximately. Rinott and Rotar (1997) proved such a result, and applied it to the case of non-degenerate U -statistics, obtaining a bound of rate n−1/2 . The result below, along similar lines, is a consequence of Theorem 3.5. Theorem 5.2 If W , W are mean zero, variance 1 exchangeable random variables satisfying E(W − W |W ) = λ(W − R) |W
(5.5)
− W | ≤ δ for some δ, then √ 2π ˆ E|R|, sup P (W ≤ z) − (z) ≤ δ 1.1 + E |W |K1 + 2.7B + 4 z∈R
for some λ ∈ (0, 1) and random variable R, and if
where Kˆ 1 is given by (5.2).
150
5
L∞ by Bounded Couplings
When W , W is mean zero, variance one λ-Stein pair as in Definition 2.1, then with B as in (5.4) δ3 supP (W ≤ z) − (z) ≤ 1.1δ + + 2.7B. 2λ z∈R Proof As (5.5) and (2.41) are equivalent, by (2.42) identity (3.23) holds with R1 = Rfz (W ). Further, as |W − W | ≤ δ we have
(W − W )2 Kˆ 1 = E W . 2λ √ Now applying (2.9) we invoke Theorem 3.5 with δ0 = δ and δ1 = ( 2π/4)E|R| to obtain the first conclusion. When W , W is a Stein pair then R = 0, and the bound on |W − W | and that E|W | ≤ 1 yields the second conclusion. One significant difference between Theorem 5.1 and 5.2 is that the latter bound, for the exchangeable pair coupling, requires the calculation of B, or , in (5.3), which may be difficult in particular cases, whereas such a computation is not required for zero bias couplings. Similar remarks apply to the computation of terms appearing in the bounds for the size biased coupling constructions given in Sect. 5.3. However, exchangeable pair and size bias couplings can be constructed for a broader range of examples than zero bias couplings, generally speaking. We also present a bound for the exchangeable pair coupling from Shao and Su (2005), depending on the following concentration inequality. Lemma 5.1 Let W , W be a λ-Stein pair with variance 1. Then for any z ∈ R and a > 0, E(W − W )2 1{−a≤W −W ≤0} 1{z−a≤W ≤z} ≤ 3λa. Proof Let
⎧ ⎨ −3a/2 f (w) = w − z + a/2 ⎩ 3a/2
w ≤ z − 2a, z − 2a ≤ w ≤ z + a, w ≥ z + a.
Then using (2.35), 3aλ ≥ 2λE Wf (W ) = E(W − W ) f (W ) − f (W )
0 = E (W − W ) f (W + t)dt W −W 0
≥ E (W − W )
W −W
Noting that
1{|t|≤a} 1{z−a≤W ≤z} f (W + t)dt .
f (w + t) = 1{z−2a≤w+t≤z+a} ,
we have
5.2 Exchangeable Pairs, Kolmogorov Distance
151
1{|t|≤a} 1{z−a≤W ≤z} f (W + t) = 1{|t|≤a} 1{z−a≤W ≤z} , and hence
0 3aλ ≥ E (W − W ) 1{|t|≤a} dt1{z−a≤W ≤z} W −W = E |W − W | min a, |W − W | 1{z−a≤W ≤z} ≥ E (W − W )2 1{0≤W −W ≤a} 1{z−a≤W ≤z} .
Theorem 5.3 If W , W are mean zero, variance 1 exchangeable random variables satisfying E(W − W |W ) = λ(W − R) for some λ ∈ (0, 1) and some random variable R, then for any a ≥ 0, supP (W ≤ z) − P (Z ≤ z) z∈R
√ E(W − W )2 1{|W −W |>a} 2π 0.41a 3 + 1.5a + + E|R|, ≤B+ λ 2λ 4 where B is as in (5.3). If W, W is a variance one λ-Stein pair satisfying |W − W | ≤ δ, then 0.41δ 3 supP (W ≤ z) − P (Z ≤ z) ≤ B + + 1.5δ. λ z∈R Proof Let f be the solution to the Stein equation (2.2) for some arbitrary z ∈ R. Following the reasoning in the derivation of (2.35), we find that 1 E Wf (W ) = E (W − W ) f (W ) − f (W ) + E f (W )R . 2λ Hence, P (W ≤ z) − (z) = E f (W ) − Wf (W )
(W − W )(f (W ) − f (W )) + f (W )R = E f (W ) − 2λ − W )2
(W = E f (W ) 1 − 2λ
f (W )(W − W )2 − (f (W ) − f (W ))(W − W ) + + f (W )R 2λ := E(J1 + J2 + J3 ), say,
≤ |EJ1 | + |EJ2 | + |EJ3 |.
(5.6)
152
5
L∞ by Bounded Couplings
For the first term, by conditioning and then taking expectation, using (2.8) we obtain
(W − W )2 |EJ1 | = E f (W )E 1 − W ≤ B. (5.7) 2λ For the third term, applying (2.9) we have √ 2π |EJ3 | ≤ E|R|. 4 To bound |EJ2 |, with a ≥ 0 write f (W )(W − W )2 − f (W ) − f (W ) (W − W ) W −W f (W ) − f (W + t) dt = (W − W ) 0
= (W − W )1|W −W |>a
W −W
f (W ) − f (W + t) dt
0
+ (W − W )1|W −W |≤a := J21 + J22 ,
W −W
f (W ) − f (W + t) dt
0
say.
(5.8)
By (2.8), |EJ21 | ≤ E(W − W )2 1|W −W |>a , yielding the second to last term in the bound of the theorem. Now express J22 , using (2.2), as the sum W −W (W − W )1|W −W |≤a Wf (W ) − (W + t)f (W + t) dt 0
+ (W − W )1|W −W |≤a
0
W −W
(1{W ≤z} − 1{W +t≤z} )dt.
(5.9)
Applying (2.10) to the first term in (5.9) shows that the absolute value of its expectation is bounded by √
W −W 2π E(W − W )1|W −W |≤a |t|dt |W | + 4 0 √
2π 1 ≤ E |W − W |3 1|W −W |≤a |W | + 2 4 √
2π 1 3 ≤ a 1+ 2 4 ≤ 0.82a 3 . We break up the expectation of the second term in (5.9) according to the sign of W − W . When W − W ≤ 0, we have
5.2 Exchangeable Pairs, Kolmogorov Distance
153
E (W − W )1{−a≤W −W ≤0}
0
W −W
(1{W ≤z} − 1{W +t≤z} )dt
0 = E (W − W )1{−a≤W −W ≤0} 1{z−t<W ≤z} dt W −W ≤ E (W − W )2 1{−a≤W −W ≤0} 1{z−a<W ≤z} ≤ 3aλ. As a bound may be similarly produced for the case W − W ≥ 0, the proof of the first claim is complete. The second claim follows by choosing a = δ in the first bound, and noting that R = 0 for a λ-Stein pair. Next we present two results that are obtained by using the linearly smoothed indicator function hz,α (w), as given in (2.14) for z ∈ R and α > 0, which equals the indicator hz (w) = 1(−∞,z] (w) over (−∞, z], decays to zero linearly over [z, z + α], and equals zero on (z + α, ∞). Let (5.10) κ = sup Ehz (W ) − N hz : z ∈ R , the Kolmogorov distance, and for α > 0 set κα = sup Ehz,α (W ) − N hz,α : z ∈ R .
(5.11)
Theorem 5.4 If W , W is a variance one λ-Stein pair that satisfies |W − W | ≤ δ for some δ then 3δ 3 supP (W ≤ z) − P (Z ≤ z) ≤ + 2B λ z∈R
(5.12)
where B is given by (5.3). If δ is of order 1/σ , B of order 1/σ , and λ of order 1/σ 2 , then the bound has order 1/σ . A more careful optimization in the proof leads to the improved bound √ ( 11δ 3 + 10λB + 2δ 3/2 )2 sup P (W ≤ z) − P (Z ≤ z) ≤ . (5.13) 10λ z∈R √ √ The bound (5.12) follows from (5.13) and the fact that ( a + b)2 ≤ 2(a + b). Proof For z ∈ R arbitrary and α > 0 let f be the solution (2.4) to the Stein equation for the function hz,α given in (2.14). Decompose Ehz,α (W )−N hz,α into E(J1 +J2 ) as in the proof of Theorem 5.3, noting that here the term R is zero. By the second inequality in (2.15) of Lemma 2.5 we may again bound |EJ1 | by B as in (5.7). From (5.8) with a = δ we obtain W −W 1 f (W ) − f (W + v) dv |EJ2 | ≤ E(W − W ) 2λ 0
0∨(W −W ) 1 f (W ) − f (W + v)dv . ≤ E |W − W | 2λ 0∧(W −W )
154
5
L∞ by Bounded Couplings
By applying |W − W | ≤ δ and a simple change of variable in (2.16) of Lemma (2.5), we may bound |EJ2 | by
0∨(W −W ) δ 1 δ v∨0 |v|dv + 1{z≤W +u≤z+α} dudv . E 1 + |W | 2λ α −δ v∧0 0∧(W −W ) As
0∨(W −W )
0∧(W −W )
we obtain |EJ2 | ≤
1 δ2 |v|dv = (W − W )2 ≤ 2 2
and E|W | ≤ 1
δ 1 δ v∨0 P (z ≤ W + u ≤ z + α)dudv . δ2 + 2λ α −δ v∧0
(5.14)
Now, recalling the definitions of κ and κα in (5.10) and (5.11) respectively, as P (a ≤ W ≤ b) = P (W ≤ b) − (b) − P (W < a) − (a) + (b) − (a) √ ≤ 2κ + (b − a)/ 2π, we bound (5.14) by
δ v∨0 1 1 (2κ + 0.4α)dudv ≤ δ 3 + δα −1 1.4δ 3 + 2δ 3 α −1 κ . 2λ 2λ −δ v∧0 Combining the bounds for |EJ1 | and |EJ2 | and taking supremum over z ∈ R we obtain 1 κα ≤ B + (5.15) 1.4δ 3 + 2δ 3 α −1 κ . 2λ As P (W ≤ z) − (z) ≤ Ehz,α (Z) − (z)
= Ehz,α (Z) − N hz,α − (z) − N hz,α √ ≤ κα + α/ 2π,
with similar reasoning providing a corresponding lower bound, taking supremum over z ∈ R we obtain κ ≤ κα + 0.4α. Now applying the bound (5.15) yields κ≤
aα + b , 1 − c/α
where a = 0.4, b = B +
0.7δ 3 δ3 and c = . λ λ
Now setting α = 2c yields 4ac + 2b, the right hand side of (5.12).
Lastly we present a result of Stein (1986), with an improved constant and slightly extended to allow a nonlinear remainder term; this result has the advantage of not requiring the coupling to be bounded. However, the bound supplied by the theorem is typically not of the best order due to√its final term. √In particular, if W is the sum of i.i.d. variables taking the values 1/ n and −1/ n with equal probability and W is formed from W by replacing a uniformly chosen variable by an independent
5.2 Exchangeable Pairs, Kolmogorov Distance
155
copy, then λ = 1/n and E|W − W |3 = 4/n3/2 , so that the final term in the bound of the theorem below becomes of order n−1/4 . Nevertheless, in Sect. 14.1 we present a number of important examples where this final term makes no contribution. Theorem 5.5 If W , W are mean zero, variance 1 exchangeable random variables satisfying E(W − W |W ) = λ(W − R)
(5.16)
for some λ ∈ (0, 1) and some random variable R, then E|W − W |3 supP (W ≤ z) − (z) ≤ B + (2π)−1/4 + E|R|, λ z∈R where B is given by (5.3). Proof For z ∈ R and α > 0 let f be the solution to the Stein equation for hz,α , the smoothed indicator given by (2.14). Decompose f (W ) − Wf (W ) into J1 + J2 + J3 as in the proof of Theorem 5.3. Applying the first inequality in (2.15) of Lemma 2.5, we may bound the contribution from |EJ3 | by E|R|, and from |EJ1 | by B as in (5.7). Next we claim that for J2 , the second term of (5.6), we have W 1 f (W ) − f (t) dt J2 = (W − W ) 2λ W W W 1 f (u)dudt (5.17) = (W − W ) 2λ W t W 1 (W − u)f (u)du. (5.18) = (W − W ) 2λ W We obtain (5.18) by first considering W ≤ W and rewriting (5.17) as W t W W 1 1 − (W − W ) f (u)dudt = − (W − W ) f (u)dtdu 2λ 2λ W W u W W 1 (W − u)f (u)dtdu, = − (W − W ) 2λ W which equals (5.18). When W ≤ W , similarly we have W W W W 1 1 f (u)dudt = − (W − W ) f (u)dudt (W − W ) 2λ 2λ W t W t W u 1 = − (W − W ) f (u)dtdu 2λ W W W 1 = − (W − W ) u − W f (u)du, 2λ W which is again (5.18).
156
5
L∞ by Bounded Couplings
Since W and W are exchangeable, the expectation of (5.18) is the same as that of 1 (W − W ) 2λ
W W W
+W − u f (u)du, 2
which we bound by the expectation of W ∨W W + W 1 |W − W |3 1 f |W − W | − udu = f 2λ 2 2λ 4 W ∧W
|W − W |3 , 4αλ where for the inequality we used the fact that |hz,α (x)| ≤ 1/α for all x ∈ R, and then applied (2.13). Collecting the bounds, we obtain ≤
P (W ≤ z) ≤ Ehz,α (W ) E|W − W |3 + E|R| 4αλ E|W − W |3 α +B + + E|R|. ≤ (z) + √ 4αλ 2π ≤ N hz,α + B +
Evaluating the expression at the minimizer (2π)1/4 E|W − W |3 α= 2 λ yields the inequality
P (W ≤ z) − (z) ≤ B + (2π)−1/4
E|W − W |3 + E|R|. λ
Proving the corresponding lower bound in a similar manner completes the proof of the theorem.
5.3 Size Biasing, Kolmogorov Bounds We now present two results employing size biased couplings, Theorems 5.6 and 5.7, which parallel Theorems 5.4 and 5.3, respectively, for the exchangeable pair. In particular, in Theorem 5.6 we focus on deriving bounds in the Kolmogorov distance in situations where bounded size bias couplings exist, that is, where one can couple the nonnegative variable Y to Y s having the Y -size biased distribution, so that |Y s − Y | is bounded. In Theorem 5.7 we require the bounded coupling to satisfy an additional monotonicity condition. In principle, Theorem 5.7, like Theorem 5.3, may be applied in situations where Y s − Y is not bounded.
5.3 Size Biasing, Kolmogorov Bounds
157
For Y a nonnegative random variable with positive mean μ, recall that Y s has the Y -size bias distribution if E Yf (Y ) = μEf Y s (5.19) for all functions f for which the expectations above exist. When Y has finite positive variance σ 2 , we consider the normalized variables W = (Y − μ)/σ
and, with some abuse of notation, W s = Y s − μ /σ.
(5.20)
Ys,
Given a size bias coupling of Y to the resulting bounds will be expressed in terms of the quantities D and given by
μ s D = E E 1 − W − W |W and = Var E Y s − Y |Y σ μ (5.21) which obey D ≤ 2 . σ To demonstrate the inequality, note that EY s = EY 2 /μ by (5.19), hence
μ EY 2 μ s E W −W = 2 − μ = 1, σ μ σ so the Cauchy–Schwarz inequality yields μ μ Var E W s − W |W = 2 . D≤ σ σ
(5.22)
Therefore D may be replaced by μ /σ 2 in all the upper bounds in this section and the one following. Note that we cannot apply Theorem 3.5 here, as for a size biased coupling in ˆ general there is no guarantee that the function K(t) will be non-negative. Theorem 5.6 Let Y be a nonnegative random variable with finite mean μ and positive, finite variance σ 2 , and suppose Y s , having the Y -size biased distribution, may be coupled to Y so that |Y s − Y | ≤ A for some A. Then with W = (Y − μ)/σ and D as in (5.21), 6μA2 + 2D. supP (W ≤ z) − P (Z ≤ z) ≤ σ3 z∈R
(5.23)
Following Goldstein and Penrose (2010), a more careful optimization in the proof yields the improved bound
μ 11A2 5σ 2 2A 2 . + D+ √ supP (W ≤ z) − P (Z ≤ z) ≤ 2 σ μ 5σ σ z∈R Again, as for the bound√in Theorem 5.4, inequality (5.23) follows from the one √ above and the fact that ( a + b)2 ≤ 2(a + b).
158
5
L∞ by Bounded Couplings
Usually the mean μ and variance σ 2 of Y will grow at the same rate, typically n, so the bound will asymptotically have order O(σ −1 ) when D is of this same order. In Chap. 6, Theorem 5.6 is applied to counting the occurrences of fixed relatively ordered sub-sequences in a random permutation, such as rising sequences, and to counting the occurrences of color patterns, local maxima, and sub-graphs in finite random graphs. Here we consider a simple application of Theorem 5.6 when Y is the sum of the i.i.d. variables X1 , . . . , Xn with mean θ and variance v 2 , satisfying 0 ≤ Xi ≤ A. In this case μ = nθ and σ 2 = nv 2 so μ/σ 2 = θ/v 2 a constant. Next, applying the construction in Corollary 2.1 we have Y s − Y = XIs − XI , and now using (2.67) and the fact that Xi and Xis are nonnegative we obtain s Y − Y = X s − XI ≤ A. I Lastly, by independence and exchangeability Var E Y s − Y |Y = Var E XIs − XI |Y = Var EXIs − Y/n = v 2 /n, √ so in (5.21), and therefore the resulting bound, is of order 1/ n, with an explicit constant. Proof of Theorem 5.6 Fix z ∈ R and α > 0, and let f solve the Stein equation (2.4) for the linearly smoothed indicator hz,α (w) given in (2.14). Then, letting W s = (Y s − μ)/σ , applying (5.19) we have E hz,α (W ) − N hz,α = E f (W ) − Wf (W )
μ s = E f (W ) − f W − f (W ) σ
s μ s μ W −W = E f (W ) 1 − f (W + t) − f (W ) dt . W −W − σ σ 0 (5.24) For the first term, taking expectation by conditioning and then applying the second inequality in (2.15) of Lemma 2.5, we have
E f (W )E 1 − μ W s − W |W ≤ D σ where D is given by (5.21). Hence, letting δ = A/σ so that |W s − W | ≤ δ, applying a change of variable on (2.16) of Lemma 2.5 for the second inequality, and then proceeding as in the proof of Theorem 5.4 yields E hz,α (W ) − N hz,α (W s −W )∨0 μ f (W + t) − f (W )dt ≤D+ E σ (W s −W )∧0
5.3 Size Biasing, Kolmogorov Bounds
≤D+
μ E σ μ
159
(W s −W )∨0
(W s −W )∧0
1 + |W | |t| + α −1
t∨0
t∧0
1{z≤W +u≤z+α} du dt
δ t∨0 μ (2κ + 0.4α)dudt 1 + E|W | δ 2 + α −1 ≤D+ 2σ σ −δ t∧0 μ μ ≤ D + 1.4 δ 2 + 2 δ 2 α −1 κ. (5.25) σ σ Now, with κ and κα given in (5.10) and (5.11), respectively, continuing to parallel the proof of Theorem 5.4, taking supremum we see that κα is bounded by (5.25), and since κ ≤ 0.4α + κα , substitution yields κ≤
aα + b , 1 − c/α
2μδ 2 μ where a = 0.4, b = D + 1.4 δ 2 , and c = . σ σ
Now setting α = 2c yields 4ac + 2b, the right hand side of (5.23).
We also present Theorem 5.7 which may be applied when the size bias coupling is monotone, that is, when Y s ≥ Y almost surely. The proof depends on the following concentration inequality, which is in some sense the ‘size bias’ version of Lemma 5.1. Lemma 5.2 Let Y be a nonnegative random variable with mean μ and finite positive variance σ 2 , and let Y s be given on the same space as Y , with the Y -size biased distribution, satisfying Y s ≥ Y . Then with W = (Y − μ)/σ and W s = Y s − μ /σ, for any z ∈ R and a ≥ 0, μ s E W − W 1{W s −W ≤a} 1{z≤W ≤z+a} ≤ a. σ Proof Let
Then
⎧ ⎨ −a f (w) = w − z − a ⎩ a
w ≤ z, z < w ≤ z + 2a, w > z + 2a.
a ≥ E Wf (W )
1 Y −μ = E(Y − μ)f σ σ μ s = E f W − f (W ) σ W s −W μ = E f (W + t)dt σ 0
W s −W μ 1{0≤t≤a} 1{z≤W ≤z+a} f (W + t)dt . ≥ E σ 0
160
5
L∞ by Bounded Couplings
Noting that f (w + t) = 1{z≤w+t≤z+2a} , we have 1{0≤t≤a} 1{z≤W ≤z+a} f (W + t) = 1{0≤t≤a} 1{z≤W ≤z+a} , and therefore
W s −W μ 1{0≤t≤a} 1{z≤W ≤z+a} dt a≥ E σ 0 μ = E min a, W s − W 1{z≤W ≤z+a} σ μ ≥ E W s − W 1{W s −W ≤a} 1{z≤W ≤z+a} . σ
With the use of Lemma 5.2 we present the following result for monotone size bias couplings, from Goldstein and Zhang (2010). Theorem 5.7 Let Y be a nonnegative random variable with mean μ and finite positive variance σ 2 , and let Y s be given on the same space as Y , with the Y -size biased distribution, satisfying Y s ≥ Y . Then with W = (Y − μ)/σ and W s = Y s − μ /σ, for any a ≥ 0,
supP (W ≤ z) − P (Z ≤ z) z∈R
≤ D + 0.82
μ a2μ + a + E W s − W 1{W s −W >a} , σ σ
where D is as in (5.21). If W s − W ≤ δ with probability 1, δ2 μ supP (W ≤ z) − P (Z ≤ z) ≤ D + 0.82 + δ. σ z∈R Proof Let z ∈ R and let f be the solution to the Stein equation (2.4) for h(w) = 1{w≤z} . Decompose Eh(W ) − N h as in (5.24) in proof of Theorem 5.6, and bound, as there, the first term by D, noting that (2.8) applies in the present case. For the remaining term of (5.24) we write s μ W −W f (W + t) − f (W ) dt σ 0 W s −W μ f (W + t) − f (W ) dt = 1{W s −W >a} σ 0 W s −W μ f (W + t) − f (W ) dt + 1{W s −W ≤a} σ 0 := J1 + J2 , say.
5.4 Size Biasing and Smoothing Inequalities
161
By (2.8), μ s E W − W 1{W s −W >a} , σ yielding the last term in the first bound of the theorem. Now express J2 using (2.4) as the sum W s −W μ (W + t)f (W + t) − Wf (W ) dt 1{W s −W ≤a} σ 0 W s −W μ (1{W +t≤z} − 1{W ≤z} )dt. + 1{W s −W ≤a} σ 0 |EJ1 | ≤
(5.26)
Applying (2.10) to the first term in (5.26) shows that the absolute value of its expectation is bounded by √
W s −W μ 2π E 1{W s −W ≤a} tdt |W | + σ 4 0 √
2 s 2π μ E W − W 1{W s −W ≤a} |W | + ≤ 2σ 4 √
μ 2 2π ≤ a 1+ 2σ 4 a2μ . σ Taking the expectation of the absolute value of the second term in (5.26), we have W s −W μ (1{W +t≤z} − 1{W ≤z} )dt E 1{W s −W ≤a} σ 0
W s −W μ 1{z−t<W ≤z} dt = E 1{W s −W ≤a} σ 0 μ s ≤ E W − W 1{W s −W ≤a} 1{z−a<W ≤z} σ which is bounded by a, by Lemma 5.2, completing the proof of the first claim. The second claim follows immediately by letting a = δ. ≤ 0.82
5.4 Size Biasing and Smoothing Inequalities In this section we present one further result which may be applied in situations where bounded size bias couplings exist. The method here, using smoothing inequalities, yields bounds in terms of supremums over function classes H, and are more general than methods which only produce bounds in the Kolmogorov distance. Naturally, we may pay the price in larger constants. We follow the approach of Rinott and Rotar (1997), itself stemming from Götze (1991) and Bhattacharya and Rao (1986).
162
5
L∞ by Bounded Couplings
In order to state our results we now introduce conditions on the function classes H we consider. Since in Sect. 12.4 we will consider approximation in Rp we state Condition 5.1 in this generality, and in particular we take Z in (iii) to be a standard normal variable with mean zero and identity covariance matrix in this space. In the present chapter we consider only the one dimensional case p = 1. Condition 5.1 H is a class of real valued measurable functions on Rp such that (i) The functions h ∈ H are uniformly bounded in absolute value by a constant, which we take to be 1 without loss of generality. (ii) For any real numbers c and d, and for any h(x) ∈ H, the function h(cx + d) ∈ H. − (iii) For any > 0 and h ∈ H, the functions h+ , h are also in H, and E h˜ (Z) ≤ a
(5.27)
for some constant a that depends only on the class H, where h+ (x) = sup h(x + y), |y|≤
h− (x) =
− inf h(x + y) and h˜ (x) = h+ (x) − h (x).
|y|≤
Given a function class H and random variables X and Y , let L(X) − L(Y ) = sup Eh(X) − Eh(Y ). H h∈H
(5.28)
(5.29)
In one dimension, the collection of indicators of all half lines, and indicators of √ 2/π and all intervals, each form classes H that satisfy Condition 5.1 with a = √ a = 2 2/π respectively (see e.g. Rinott and Rotar 1997); clearly, in the first case the distance (5.29) specializes to the Kolmogorov metric. Theorem 5.8 Let Y be a nonnegative random variable with finite, nonzero mean μ and variance σ 2 ∈ (0, ∞), and suppose there exists a variable Y s , having the Y -size biased distribution, defined on the same space as Y , satisfying |Y s − Y | ≤ A √ 3/2 for some A ≤ σ / 9μ. Then, when H satisfies Condition 5.1 for some constant a, 2 3
L(W ) − L(Z) ≤ 0.21aA + μ (12.4 + 58.1a) A + 2.5A + 15D, H σ σ σ2 σ2 where W = (Y − μ)/σ , Z is a standard normal, and D is given by (5.21). Specializing to the case where H is the collection of indicators of half lines and √ a = 2/π yields the bound
0.17A μ 58.8A2 2.5A3 + 2 + + 15D, supP (W ≤ z) − P (Z ≤ z) ≤ σ σ σ σ2 z∈R demonstrating, by comparison with, say, the bound in Theorem 5.6, that the consideration of general function classes H comes at some expense. One reason for the
5.4 Size Biasing and Smoothing Inequalities
163
increase in the magnitude of the constants is that general bounds on the solution to the Stein equation, as given by Lemma 2.4, must be applied here, and not, say, the more specialized bounds of Lemma 2.3 which require that the function h be an indicator. Let φ(y) denote the standard normal density and for h ∈ H and t ∈ (0, 1), define (5.30) ht (w) = h(w + ty)φ(y)dy. The function ht (w) is a smoothed version of h(w), with smoothing parameter t , and clearly ht ≤ h. Furthermore, in this section, let κ = sup Eh(W ) − N h: h ∈ H (5.31) and for t ∈ (0, 1) set κt = sup Eht (W ) − N ht : h ∈ H . Lemma 5.3 Let H be a class of functions satisfying Condition 5.1 with constant a. Then, for any random variable W , κ ≤ 2.8κt + 4.7at
for all t ∈ (0, 1).
Furthermore, for all δ > 0, t ∈ (0, 1) and h˜ as in Condition 5.1,
E h˜ δ+t|y| (W )φ (y)dy ≤ 1.6κ + a(δ + t).
(5.32)
(5.33)
Proof Inequality (5.32) is Lemma 4.1 of Rinott and Rotar (1997), following Lemma 2.11 of Götze (1991) from Bhattacharya and Rao (1986). As in Rinott and Rotar (1997), adding and subtracting to the left hand side of (5.33) we have
˜ ˜ ˜ E hδ+t|y| (W ) − hδ+t|y| (Z) φ (y) dy + hδ+t|y| (Z) φ (y) dy ˜ ˜ ≤ E hδ+t|y| (W ) − E hδ+t|y| (Z) φ (y) dy + E h˜ δ+t|y| (Z)φ (y)dy
≤ 1.6κ + a(δ + t|y|)|φ (y)|dy ≤ 1.6κ + a(δ + t), √ where we have used the definitions of h˜ and κ and that |φ (y)|dy = 2/π for the first term, and then additionally (5.27) and |y||φ (y)|dy = 1 for the second. Lemma 5.4 Let Y ≥ 0 be a random variable with mean μ and variance σ 2 ∈ (0, ∞), and let Y s be defined on the same space as Y , with the Y -size biased distribution, satisfying |Y s − Y |/σ ≤ δ for some δ. Then for all t ∈ (0, 1),
μ 2 1 1 κt ≤ 4D + (5.34) 3.3 + a δ 2 + δ 3 + 1.6κδ 2 + aδ 3 , σ 2 3 2t with D as in (5.21).
164
5
L∞ by Bounded Couplings
Proof With h ∈ H and t ∈ (0, 1) let f be the solution to the Stein equation (2.4) for ht . Letting W = (Y − μ)/σ and W s = (Y s − μ)/σ we have |W s − W | ≤ δ. From (5.19) we obtain, μ s EWf (W ) = (5.35) f W − f (W ) , σ and, so, letting V = W s − W , Eht (W ) − Nht = E f (W ) − Wf (W )
μ s f W − f (W ) = E f (W ) − σ
s μ W = E f (W ) − f (w)dw σ W
1 μ = E f (W ) − V f (W + uV )du σ 0
1 μ μ μ = E f (W ) 1 − V f (W + uV )du . +E Vf (W ) − V σ σ σ 0 (5.36) Bounding the first term in (5.36), by (2.12) and that ht ≤ 1, and definition (5.21), we have
E f (W )E 1 − μ V |W ≤ 4D. (5.37) σ By (5.30) and a change of variable we may write ht (w + s) − ht (w) = h(w + tx) φ(y − s/t) − φ(y) dy,
(5.38)
so, for the second term in (5.36), applying the dominated convergence theorem in (5.38) and differentiating the Stein equation (2.4), f (w) = f (w) + wf (w) + ht (w) 1 h(w + ty)φ (y)dy. with ht (w) = − t Hence, we may we write the second term in (5.36) as the expectation of 1 μ f (W + uV )du V f (W ) − σ 0 1 μ = V f (W ) − f (W + uV ) du σ 0 1 W +uV μ f (v)dvdu =− V σ 0 W 1 W +uV μ =− V f (v) + vf (v) + ht (v) dvdu. σ 0 W
(5.39)
(5.40)
5.4 Size Biasing and Smoothing Inequalities
165
We apply the triangle inequality and bound the three resulting terms separately. For the expectation arising from the first term on the right-hand side of (5.40), by (2.12) and that ht ≤ 1 we have 1 W +uV E μ V f (v)dvdu σ 0 W 1 √ μ μ u|V |du ≤ 1.3 δ 2 . (5.41) ≤ 2π E |V | σ σ 0 For the second term in (5.40), again applying (2.12), 1 W +uV μ E V vf (v)dvdu σ 0 W 1 W +uV 2μ 2|v|dv du E|V | ≤ σ 0 W 1 2μ ≤ 2u|W V | + u2 V 2 du E|V | σ 0 1 2μ 2δuE|W | + u2 δ 2 du δ ≤ σ 0 2μ ≤ δ δ + δ 2 /3 . σ For the last term in (5.40), beginning with the inner integral, we have W +uV 1 ht (v)dv = uV ht (W + xuV )dx 0
W
and using (5.39),
(5.42)
φ (y)dy = 0,
and Lemma 5.3 we have 1 1 μ EV 2 uht (W + xuV )dxdu σ 0 0 1 1 μ 2 uh(W + xuV + ty)φ (y)dydxdu = EV σt 0 0 1 1 μ u h(W + xuV + ty) − h(W + xuV ) φ (y)dydxdu = EV 2 σt 0 0
1 μ − φ (y)dudy u h+ (W ) − h (W ) ≤ E V2 |V |+t|y| |V |+t|y| σt
0 + μ φ (y)dy h|V |+t|y| (W ) − h− = (W ) E V2 |V |+t|y| 2σ t
μ 2 ≤ h˜ δ+t|y| (W )φ (y)dy δ E 2σ t
166
5
L∞ by Bounded Couplings
μ 2 δ 1.6κ + a(δ + t) 2σ t μ 2 μ 1.6κδ 2 + aδ 3 + aδ . = 2σ t 2σ Combining (5.37), (5.41), (5.42), and (5.43) completes the proof. ≤
(5.43)
Proof of Theorem 5.8 Substituting (5.34) into (5.32) of Lemma 5.3 we obtain
μ 2 1 1 κ ≤ 2.8 4D + + 4.7at, 3.3 + a δ 2 + δ 3 + 1.6κδ 2 + aδ 3 σ 2 3 2t or, κ≤
2.8(4D + (μ/σ )((3.3 + 12 a)δ 2 + 23 δ 3 + aδ 3 /2t)) + 4.7at . 1 − 2.24μδ 2 /(σ t)
(5.44)
Setting t = 4 × 2.24μδ 2 /σ , which is a number in (0, 1) since δ ≤ (σ/(9μ))1/2 , we obtain
4 2 3 σ μ 1 2 κ ≤ × 2.8 4D + 3.3 + a δ + δ + aδ 3 σ 2 3 2(8.96μ)
μδ 2 4 + × 4.7a 8.96 3 σ μ ≤ 0.21aδ + (12.4 + 58.1a)δ 2 + 2.5δ 3 + 15D. σ Substituting δ = A/σ now completes the proof.
Chapter 6
L∞ : Applications
In this chapter we consider the application of the results of Chap. 5 to obtain L∞ bounds for the combinatorial central limit theorem, counting the occurrences of patterns, the anti-voter model, and for the binary expansion of a random integer.
6.1 Combinatorial Central Limit Theorem Recall that in the combinatorial central limit theorem we study the distribution of Y=
n
(6.1)
ai,π(i)
i=1
where A = {aij }ni,j =1 is a given array of real numbers and π a random permutation. This setting was introduced in Example 2.3, and L1 bounds to the normal were derived in Sect. 4.4 for the case where π is chosen uniformly from the symmetric group Sn ; some further background, motivation, applications, references and history on the combinatorial CLT were also presented in that section. For π chosen uniformly, von Bahr (1976) and Ho and Chen (1978) obtained L∞ bounds to the normal when the matrix A is random, which yield the correct rate O(n−1/2 ) only under some boundedness conditions. Here we focus on the case where A is non-random. In Sect. 6.1.1 we present the result of Bolthausen (1984), which gives a bound of the correct order in terms of a third-moment quantity of the type (4.107), but with an unspecified constant. In this same section, based on Goldstein (2005), we give bounds of the correct order and with an explicit constant, but in terms of the maximum absolute array value. In Sect. 6.1.2 we also give L∞ bounds when the distribution of the permutation π is constant on cycle type and has no fixed points, expressing the bounds again in terms of the maximum array value. For the last two results mentioned we make use of Lemma 4.6, which, given π , constructs permutations π † and π ‡ on the same space as π such that n n Y† = aiπ † (i) and Y ‡ = aiπ ‡ (i) i=1
i=1
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_6, © Springer-Verlag Berlin Heidelberg 2011
167
168
6
L∞ : Applications
have the square bias distribution as in Proposition 4.6. As noted in Sect. 4.4 for L1 bounds in the uniform case, the permutations π, π † and π ‡ agree on the complement of some small index set I, and hence we may write Y = S + T, where S= ai,π(i) , i∈ /I
T=
Y† = S + T †
ai,π(i) ,
i∈I
T† =
and
Y ‡ = S + T ‡,
ai,π † (i)
and
T‡ =
i∈I
(6.2)
ai,π ‡ (i) .
i∈I
(6.3) Now, as Y ∗ = U Y † + (1 − U )Y ‡ has the Y -zero bias distribution by Proposition 4.6, we have |Y ∗ − Y | = U T † + (1 − U )T ‡ − T ≤ U T † +(1 − U )T ‡ + |T |. (6.4) Hence when I is almost surely bounded (6.4) gives an upper bound on |Y ∗ − Y | equal to the largest size of I times twice the largest absolute array value. Now Theorem 5.1 for bounded zero bias couplings yields an L∞ norm bound in any instance where such constructions can be achieved. In the remainder of this section, to avoid trivial cases we assume that Var(Y ) = σ 2 > 0, and for ease of notation we will write Y and π interchangeably for Y and π , respectively.
6.1.1 Uniform Distribution on the Symmetric Group We approach the uniform permutation case in two different ways, first using zero biasing, then by an inductive method. Using zero biasing, combining the coupling given in Sect. 4.4 with Theorem 5.1 quickly leads to the following result. Theorem 6.1 Let {aij }ni,j =1 be an array of real numbers and let π be a random permutation with uniform distribution over Sn . Then, with Y as in (6.1) and W = (Y − μ)/σ , supP (W ≤ z) − P (Z ≤ z) ≤ 16.3C/σ for n ≥ 3, z∈R
where μ and
σ2
= Var(Y ) are given by (4.105), and C = max |aij − ai + aj − a |, 1≤i,j ≤n
with the row, column and overall array averages ai , aj and a as in (2.44). Proof By (2.45) we may first replace aij by aij − ai − aj + a , and in particular assume EY = 0. Following the construction in Lemma 4.6, we obtain the variable Y ∗ = U Y † + (1 − U )Y ‡ with the Y -zero biased distribution, where Y, Y † and Y ‡ may be written as in (6.2) and (6.3) with |I| = |{I † , J † , π −1 (K † ), π −1 (L† )}| ≤ 4
6.1 Combinatorial Central Limit Theorem
169
by (4.135). As W ∗ = Y ∗ /σ has the W -zero bias distribution by (2.59), applying (6.4) we obtain E|W ∗ − W | = E|Y ∗ − Y |/σ ≤ 8C/σ. Our claim now follows from Theorem 5.1 by taking δ = 8C/σ .
With a bit more work, we can use the zero bias variation of Ghosh (2009) on the inductive method in Bolthausen (1984) to prove an L∞ bound depending on a third moment type quantity of the array, like the L1 bound in Theorem 4.8. On the other hand, the bound in Theorem 6.2 depends on an unspecified constant, whereas the constant in Theorem 6.1 is explicit. Though induction was used in Sect. 3.4.2 for the independent case, the inductive approach taken here has a somewhat different flavor. Bolthausen’s inductive method has also been put to use by Fulman (2006) for character ratios, and by Goldstein (2010b) for addressing questions about random graphs. Theorem 6.2 Let {aij }ni,j =1 be an array of real numbers and let π be a random permutation with the uniform distribution over Sn . Let Y be as in (6.1) and μA and σA2 = Var(Y ) be given by (4.105). Then, with W = (Y − μA )/σA , there exists a constant c such that supP (W ≤ z) − P (Z ≤ z) ≤ cγA / σA3 n for all n ≥ 2, z∈R
where γA is given in (4.107). To prepare for the proof we need some additional notation. For n ∈ N and an array E ∈ Rn×n let WE =
n
ei,π(i) ,
i=1
and let E 0 be the centered array with components 0 eij = eij − ei − ej + e
(6.5)
be the where the array averages are given by (2.44). In addition, when σE2 > 0, let E array given by 0 /σE , eij = eij
and set βE = γE /σE3 .
(6.6)
Clearly, if E is an array with σE2 > 0 then βE = βE 0 = βE.
(6.7)
For any E ∈ Rn×n let E be the truncated array whose components are given by eij = eij 1 |eij | ≤ 1/2 . (6.8) For β > 0 let
170
6
L∞ : Applications
n (β) = E ∈ Rn×n : ej = ei = 0 for all i, j = 1, . . . , n, σE2 = 1, βE ≤ β , M n (β) : |eij | ≤ 1 for all i, j = 1, . . . , n , Mn1 (β) = E ∈ M and n = M
n (β) M
and Mn1 =
β>0
Mn1 (β).
β>0
∈ M n (β) for all β ≥ βE , We note that if E is any n × n array with σE2 > 0 then E and if E ∈ Mn then E = E. Let (6.9) δ 1 (β, n) = sup P (WE ≤ z) − (z): z ∈ R, E ∈ Mn1 (β) . The proof of the theorem depends on the following four lemmas, whose proofs are deferred to the end of this section. The first two lemmas are used to control the effects of truncation and scaling. n let E be the truncated E array given by (6.8). Lemma 6.1 For n ≥ 2 and E ∈ M Then there exists c1 ≥ 1 such that P (WE = WE ) ≤ c1 βE /n
and
|μE | ≤ c1 βE /n.
In addition, there exist constants 1 and c2 such that when βE /n ≤ 1 2 σ − 1 ≤ c2 βE /n, E ∈ M 1 and βE ≤ c1 βE . n E
(6.10)
(6.11)
n for some n ≥ 2 Lemma 6.2 There exist constants 2 and c3 such that if E ∈ M and E is as in (6.8), (6.6) and (6.5), then whenever βE /n ≤ 2 supP (WE ≤ z) − (z) ≤ supP (WE ≤ z) − (z) + c3 βE /n. z∈R
z∈R
The following lemma handles the effects of deleting rows and columns from an array in Mn1 . Lemma 6.3 There exist n0 ≥ 16, 3 > 0 and c4 ≥ 1 such that if n ≥ n0 , l ≤ 4 and C ∈ Mn1 , when D is the (n − l) × (n − l) array formed by removing the l rows R ⊂ {1, 2, . . . , n} and l columns C ⊂ {1, 2, . . . , n} from C, we have |μD | ≤ 8, and if βC /n ≤ 3 then 2 σ − 1 ≤ 3/4 and βD ≤ c4 βC . D The proof, being inductive in nature, expresses the distance to normality for a problem of a given size in terms of the distances to normality for the same problem, but of smaller sizes. This last lemma is used to handle the resulting recursion for the relation between these distances.
6.1 Combinatorial Central Limit Theorem
171
Lemma 6.4 Let {sn }n≥1 be a sequence of nonnegative numbers and m ≥ 5 a positive integer such that sn ≤ d + α max sn−l l∈{2,3,4}
for all n ≥ m,
(6.12)
with d ≥ 0 and α ∈ (0, 1). Then sup sn < ∞. n≥1
Proof of Theorem 6.2 In view of (2.45) and (6.7) it suffices to prove the theorem for n . Let 1 , c1 and c2 be as in Lemma 6.1, 2 and c3 as in Lemma 6.2, WB with B ∈ M and n0 , 3 and c4 as in Lemma 6.3. Noting that from the lemmas we have n0 ≥ 16 and c1 ≥ 1 and c4 ≥ 1, set
(6.13)
0 = min 1/(2n0 ), 1 /c1 , 3 /c1 , 31 /(4c4 c1 ), 32 /(4c4 c1 ) .
We first demonstrate that it suffices to prove the theorem for βB /n < 0 and n > n0 . By Hölder’s inequality and (4.105), for all n ∈ N, n 1/2 n 1/3 1 (n − 1)1/2 1/3 2 3 = 1/3 bij ≤ |bij | = βB . (6.14) 1/3 n n i,j =1
i,j =1
As inequality (6.14) implies that βB ≥ 1/2 for all n ≥ 2, we have βB /n ≥ 0 for all 2 ≤ n ≤ n0 . Hence, taking c ≥ 1/0 the theorem holds if either 2 ≤ n ≤ n0 or B satisfies βB /n ≥ 0 . We may therefore assume n ≥ n0 and βB /n ≤ 0 . as in (6.8), (6.6) and (6.5), Lemma 6.2 yields As βB /n ≤ 0 , setting C = B supP (WB ≤ z) − (z) ≤ supP (WC ≤ z) − (z) + c3 βB /n. (6.15) z∈R
z∈R
By (6.7) and (6.11) of Lemma 6.1 we have that βC /n = βB /n ≤ c1 βB /n ≤ c1 0 , and also that C ∈ Mn1 . Hence, exists a constant c5 such that
(6.16)
by (6.15) and (6.16) it suffices to prove that there
δ 1 (β, n) ≤ c5 β/n
for all n ≥ n0 and β/n ≤ c1 0 .
(6.17)
For z ∈ R and α > 0 let hz,α (w) be the smoothed indicator function of (−∞, z], which decays linearly to zero over the interval [z, z + α], as given by (2.14), and set (6.18) δ 1 (α, β, n) = sup Ehz,α (WC ) − N hz,α : z ∈ R, C ∈ Mn1 (β) . Also, define hz,0 (x) = 1(x ≤ z). As the collection of arrays
Mn1 (β)
increases in β, so therefore does δ 1 (α, β, n).
172
6
L∞ : Applications
Now, since for any z, w ∈ R and α > 0, hz,0 (w) ≤ hz,α (w) ≤ hz+α,0 (w), for all C ∈ Mn1 (β) and all α > 0 we have α supP (WC ≤ z) − (z) ≤ supEhz,α (WC ) − Ehz,α (Z) + √ , 2π z∈R z∈R and taking supremum yields α δ 1 (β, n) ≤ δ 1 (α, β, n) + √ . 2π
(6.19)
To prove (6.17), for n ≥ n0 let C ∈ Mn1 satisfy βC /n ≤ c1 0 , and let f be the solution to the Stein equation (2.4) with h = hz,α as in (2.14), for some fixed z ∈ R. Following the construction in Lemma 4.6, we obtain the variable WC∗ = U WC† + (1 − U )WC‡ with the WC -zero biased distribution. Now, using the bound (2.16) from Lemma 2.5 on the differences of the derivative of f , write Eh(WC ) − N h = E f (WC ) − WC f (WC ) = E f (WC ) − f WC∗ (6.20) ≤ E f WC∗ − f (WC ) ≤ A1 + A2 + A2 , where
A1 = E WC∗ − WC , A2 = E WC WC∗ − WC and
1 1 A3 = E WC∗ − WC 1[z,z+α] WC + r WC∗ − WC dr . α 0
First, from the L1 bound in Lemma 4.7, noting that γ in the lemma equals βC as σC2 = 1, we obtain (6.21) A1 = E WC∗ − WC ≤ c6 βC /n. Next, to estimate A2 , note that by (4.135) of Lemma 4.6 we may write WC† and WC‡ as in (6.2) and (6.3) with I = {I † , J † , π −1 (K † ), π −1 (L† )} and WC∗ − WC = U WC† + (1 − U )WC‡ − WC = U S + T † + (1 − U ) S + T ‡ − (S + T ) = U T † + (1 − U )T ‡ − T .
(6.22)
Now let I = (I † , J † , π −1 (K † ), π −1 (L† ), π(I † ), π(J † ), K † , L† ). By the construction in Lemma 4.5 the right hand side of (6.22), and hence WC∗ − WC , is measurable with respect to I = {I, U }. Furthermore, since C ∈ Mn1 and |I| ≤ 4, we have |ciπ(i) | ≤ |S| + 4. |WC | = |S + T | ≤ |S| + |T | ≤ |S| + i∈I
Now, using the definition of A2 , and that U is independent of {S, I}, we obtain
6.1 Combinatorial Central Limit Theorem
173
A2 = E WC WC∗ − WC = E WC∗ − WC E |WC ||I ≤ E WC∗ − WC E |S| + 4|I ≤ E WC∗ − WC E S 2 |I + 4E WC∗ − WC .
(6.23)
In the following, for ıa realization of I, let l denote the number of distinct elements of ı. Since S = i ∈/ I ciπ(i) and π is chosen uniformly from Sn , we have that
L(S|I = i) = L(WD ),
(6.24)
where WD = 1≤i≤n−l diθ(i) with D the (n − l) × (n − l) array formed by removing from C the rows {I † , J † , π −1 (K † ), π −1 (L† )} and columns {π(I † ), π(J † ), K † , L† }, and θ chosen uniformly from Sn−l . Using l ∈ {2, 3, 4}, that n ≥ n0 and βC /n ≤ c1 0 ≤ 3 , Lemma 6.3 yields |μD | ≤ 8 and that 2 σ − 1 ≤ 3/4, so that EW 2 ≤ c7 . (6.25) D D In particular E S 2 |I = i = EWD2 ≤ c7
for all i, and hence E S 2 |I ≤ c7 .
Now using (6.23) and (6.21), we obtain A2 ≤ c8 βC /n.
(6.26)
Finally, we are left with bounding A3 . First we note that for any r ∈ R, WC + r WC∗ − WC = rWC∗ + (1 − r)WC = r S + U T † + (1 − U )T ‡ + (1 − r)(S + T ) = S + rU T † + r(1 − U )T ‡ + (1 − r)T = S + gr
where gr = rU T † + r(1 − U )T ‡ + (1 − r)T .
Now, from the definition of A3 , again using that WC − WC∗ is I measurable,
1 1 1[z,z+α] WC + r WC∗ − WC dr A3 = E WC − WC∗ α 0
1
∗ 1 ∗ 1[z,z+α] WC + r WC − WC dr|I = E W C − WC E α 0
1 1 P WC + r WC∗ − WC ∈ [z, z + α]|I dr = E WC − WC∗ α 0
1 1 P S + gr ∈ [z, z + α]|I dr = E WC − WC∗ α 0
174
6
= ≤ = = =
L∞ : Applications
1 1 P S ∈ [z − gr , z + α − gr ]|I dr E WC − WC∗ α 0
1 1 ∗ sup P S ∈ [z − gr , z + α − gr ]|I dr E W C − WC α 0 z∈R
1 1 ∗ sup P S ∈ [z, z + α]|I dr E W C − WC α 0 z∈R 1 E WC − WC∗ sup P S ∈ [z, z + α]|I α z∈R 1 E WC − WC∗ sup P S ∈ [z, z + α]|I , α z∈R
(6.27)
(6.28)
where to obtain equality in (6.27) we have used the fact that gr is measurable with respect to I for all r, and the equality in (6.28) follows from the independence of U from {S, I}. Regarding P (S ∈ [z, z + α]|I), we claim that sup P S ∈ [z, z + α]|I = i = sup P WD ∈ [z, z + α] z∈R z∈R = sup P WD 0 ∈ [z, z + α] z∈R z z+α = sup P WD ∈ , σD σD z∈R ≤ sup P WD (6.29) ∈ [z, z + 2α] . z∈R
n−l The first equality is (6.24), the second follows from (6.5) and that i=1 di , n−l n−l d and d do not depend on θ , and the next is by definition (6.6) i=1 θ(i) i=1 of D. The inequality follows from (6.25), which implies σD ≥ 1/2. Using that βC /n ≤ c1 0 ≤ 3 and Lemma 6.3, we have Let E = D. βD ≤ c4 βC so that by (6.7) βE βD c4 βC nc4 c1 0 3 n = ≤ ≤ ≤ min{1 , 2 } ≤ min{1 , 2 }, n−l n−l n−l n−l 4n−4 since n0 ≥ 16. Now Lemma 6.1 and (6.7) yield ∈ M 1 E n−l
and
βE = βE ≤ c1 βE = c1 βD .
Furthermore, Lemma 6.2 and (6.7) may be invoked to yield P WD ∈ [z, z + 2α] = P WE ∈ [z, z + 2α] ≤ P (WE ≤ z + 2α) − (z + 2α) + (z + 2α) − (z) + (z) − P (WE < z)
(6.30)
6.1 Combinatorial Central Limit Theorem
175
2c3 βD 2α +√ n−l 2π β 2c 2α 3 D ≤ 2 max δ 1 (c1 βD , n − l) + +√ l∈{2,3,4} n−l 2π c10 βC 2α 1 +√ , ≤ 2 max δ (c9 βC , n − l) + l∈{2,3,4} n 2π
≤ 2δ 1 (c1 βD , n − l) +
(6.31)
where in the final inequality we have again invoked (6.30) and set c9 = c1 c4 ≥ 1, by (6.13). As (6.31) does not depend on z or i, by (6.29), it bounds supz∈R P (S ∈ [z, z + α]|I). Now using (6.28), (6.31) and (6.21), we obtain
1 c10 βC 2α A3 ≤ E WC − WC∗ 2 max δ 1 (c9 βC , n − l) + +√ α l∈{2,3,4} n 2π
c6 βC β c 2α 10 C ≤ . (6.32) 2 max δ 1 (c9 βC , n − l) + +√ l∈{2,3,4} nα n 2π Recalling h = hz,α , as the bound on A3 does not depend on z ∈ R, combining (6.21), (6.26) and (6.32), then taking supremum over z ∈ R on the left hand side of (6.20), we obtain, supEhz,α (WC ) − N hz,α z∈R
≤
c10 βC c6 βC 2α c11 βC , + 2 max δ 1 (c9 βC , n − l) + +√ l∈{2,3,4} n nα n 2π
and now taking supremum over C ∈ Mn1 (β) with β/n ≤ c1 0 we have
c11 β c6 β c10 β 2α 1 1 δ (α, β, n) ≤ . + 2 max δ (c9 β, n − l) + +√ l∈{2,3,4} n nα n 2π Recalling (6.19), we obtain
α c11 β c6 β c10 β 2α δ 1 (β, n) ≤ +√ . + 2 max δ 1 (c9 β, n − l) + +√ l∈{2,3,4} n nα n 2π 2π Setting α = 4c6 c9 β/n yields δ 1 (β, n) ≤
δ 1 (c9 β, n − l) c12 β 1 . + max n 2 l∈{2,3,4} c9
Multiplying by n/β we obtain nδ 1 (β, n) nδ 1 (c9 β, n − l) 1 ≤ c12 + max . β 2 l∈{2,3,4} c9 β Taking supremum over positive β satisfying β/n ≤ c1 0 , and using n0 ≥ 16 we obtain sup 0< βn ≤c1 0
nδ 1 (β, n) 2c12 sup ≤ max β 3 l∈{2,3,4} 0< β ≤c c n
1 9 0
(n − l)δ 1 (β, n − l) . β
(6.33)
176
6
L∞ : Applications
Clearly sup β/n>c1 0
(n − l)δ 1 (β, n − l) ≤ 1/(c1 0 ), β
so letting sn =
sup 0<β/n≤c1 0
nδ 1 (β, n) β
and, recalling c9 ≥ 1, decomposing the supremum on the right hand side of (6.33) over β/n ≤ c1 0 and c1 0 < β/n ≤ c1 c9 0 we obtain sn ≤ c13 +
2 max sn−l 3 l∈{2,3,4}
for all n ≥ n0 .
Lemma 6.4 now yields supn sn < ∞. Taking the value of this supremum to be c5 , we obtain (6.17), as desired. We now present the proof of the four technical lemmas that were used in the proof of the theorem. Proof of Lemma 6.1 Let = {(i, j ): |eij | > 1/2} and i = {j : (i, j ) ∈ } for i = 1, . . . , n. By a Chebyshev type argument we may bound the size of by 1 |eij | > 1/2 ≤ 8 |eij |3 = 8βE . (6.34) || = i,j
i,j
Now the inclusion {WE = WE } ⊂
n
i, π(i) ∈ i=1
implies P (WE = WE ) ≤ E
n n 1 i, π(i) ∈ = |i |/n = ||/n ≤ 8βE /n, i=1
i=1
proving the first claim of (6.10) taking c1 = 8. Hölder’s inequality and (6.34) yield that for all r ∈ (0, 3]
r/3 r 1−r/3 3 |eij | ≤ || |eij | ≤ c1 βE . i,j
(i,j )∈
Similarly, as |i | =
1 |eij | > 1/2 ≤ 8 |eij |3 , j
we have
j
(6.35)
6.1 Combinatorial Central Limit Theorem
177
1/3 2/3 3 eij ≤ |i | |eij | ≤4 |eij |3 ≤ c1 βE , j ∈i
j ∈i
(6.36)
j
with the same bound holding when interchanging the roles of i and j . Regarding the mean μE , since ij eij = 0, we have 1 1 1 1 = |μE | = = = eij e e e ij ij ij n n n n c c i,j
(i,j )∈
(i,j )∈
1 |eij | ≤ c1 βE /n, ≤ n
(i,j )∈
(i,j )∈
by (6.35) with r = 1, proving the second claim in (6.10). To prove the bound on σE2 , recalling the form of the variance in (4.105) we have 2 2 2 2 2 2 σ − 1 = 1 e − e − e + e − e ij i j ij E n − 1 i,j i,j i,j i,j i,j
1 2 2 2 eij + ei + ej + e 2 . ≤ n−1 (i,j )∈
i,j
i,j
i,j
Since n ≥ 2 the first term is bounded by 2c1 βE /n using (6.35) with r = 2. By (6.36), we have that 1 1 1 4 4βE e = = = e e e |eij |3 = . (6.37) ij ij ≤ i ij n n n n n j ∈ / i
j
Hence, for n ≥ 2, 1 n−1
i,j
2 ei ≤
j ∈i
j
16βE 4βE ei ≤ |eij |3 ≤ 32βE2 /n2 , n−1 n(n − 1) i
i,j
with the same bound holding when i and j are interchanged. In addition, by the second claim in (6.10), e = |μE |/n ≤ c1 βE /n2 , (6.38) and so 1 2 n2 c12 βE2 e ≤ ≤ 2c12 βE2 /n3 . n−1 n − 1 n4 i,j
Hence 2 σ − 1 ≤ βE 2c1 + 64βE /n + 2c2 βE /n2 . 1 E n Now the first claim of (6.11) holds with c2 = 2c1 + 64 + 2c12 taking any 1 ∈ (0, 1). Requiring further that 1 ∈ (0, 1/(3c2 )), when βE /n ≤ 1 then 2 σ − 1 ≤ 1/3, E
178
6
L∞ : Applications
so that σE2 > 2/3, implying σE > 2/3. Therefore, when βE /n ≤ 1 the elements of satisfy E e − e − e + e /σE ≤ 3 + 3 e + e + e , ij i j j 4 2 i and by (6.37) and (6.38) there exists 1 sufficiently small such that the elements of are all bounded by 1, thus showing the second claim of (6.11). Lastly, by the E lower bound on σE we have 3 3 ij |eij | ij |eij | ≤ ≤ c1 βE , βE = σE3 σE3
completing the proof of the lemma. Proof of Lemma 6.2 With 1 , c1 and c2 as in Lemma 6.1, set 2 = min 1 , 1/(9c2 ) and assume βE /n ≤ 2 . The first inequality in (6.10) of Lemma 6.1 yields supP (WE ≤ z) − (z) z∈R
≤ supP (WE ≤ z) − (z) + c1 βE /n z∈R
z − μE z − μE ≤ supP (WE ≤ z) − − (z) + c1 βE /n + sup σ σ E E z∈R z∈R z − μ E ≤ sup P (WE ≤ z) − (z) + sup − (z) + c1 βE /n. σE z∈R z∈R Hence we need only show that there exists some c14 such that z − μE sup − (z) ≤ c14 βE /n. σE z∈R
(6.39)
From the first inequality in (6.11) of Lemma 6.1, since βE /n ≤ 1/(9c2 ) we have |σE2 − 1| ≤ 1/9 and so σE ∈ [2/3, 4/3]. First consider the case where |z| ≥ c1 βE /n. It is easy to show that √ z exp −az2 /2 ≤ 1/ a for all a > 0, z ∈ R. (6.40) Hence
z exp − 9 (z − μE )2 ≤ (z − μE ) exp − 9 (z − μE )2 + |μE | 32 32 4 + |μE | 3 4 ≤ 1 + |μE | . 3
≤
(6.41)
6.1 Combinatorial Central Limit Theorem
179
Since σE ≥ 2/3 and (6.11) gives that |σE2 − 1| ≤ c2 βE /n, we find that |σE2 − 1| (6.42) ≤ c2 βE /n. σE + 1 Letting z = (z − μE )/σE , since |μE | ≤ c1 βE /n by (6.10) of Lemma 6.1, z and z will be on the same side of the origin. Now, using the mean value theorem, that σE ∈ [2/3, 4/3], and Lemma 6.1, we obtain ( z) − (z)
z − μE z − μE − z where φ = ≤ max φ , φ(z) σE σE
2 z(1 − σE ) 9 1 z 2 ≤ √ max exp − (z − μE ) , exp − 32 2 σE 2π 1 μE + √ 2π σE
2 3 9 z 2 ≤ √ |σE − 1| max z exp − (z − μE ) , z exp − 32 2 2 2π 1 μE + √ 2π σE 3 2 ≤ √ |σE − 1|(1 + |μE |) + |μE |. 4 2π |σE − 1| =
This last inequality using (6.41), and (6.40) with a = 1. But now, using (6.10) and (6.42), we have 3 2 √ |σE − 1| 1 + |μE | + |μE | 4 2π
c1 βE 3c1 βE 2c2 βE 1+ + ≤ √ n 4n n 2π
3c1 βE 2c2 (6.43) , since βE /n ≤ 2 . ≤ √ (1 + c1 2 ) + 4 n 2π When |z| < c1 βE /n, the bound is easier. Since z lies in the interval with boundary points 3(z − μE )/2 and 3(z − μE )/4, we have 3(|z| + |μE |) . (6.44) 2 Now using that |z| < c1 βE /n, and |μE | ≤ c1 βE /n by (6.10), from (6.44) we obtain 1 ( z − z| z) − (z) ≤ √ | 2π 1 ≤√ 3|z| + 2|μE | 2π 5c1 βE ≤√ . (6.45) 2π n | z| ≤
180
6
L∞ : Applications
The proof of (6.39), and therefore of the lemma, is now completed by letting c14 be the maximum of the constants that multiply βE /n in (6.43) and (6.45). Proof of Lemma 6.3 Let m = n − l. Since ci = 0, |cij | ≤ 1 and l ≤ 4 we have m 1 1 1 |C| 4 dij = cij = cij ≤ |di | = = , (6.46) m m m m m j∈ /C
j =1
j ∈C
with the same bound holding when the roles of i and j are interchanged. Similarly, as c = 0, m 1 1 |d | = 2 dij = 2 cij m m i,j =1 {i ∈ / R}∩{j ∈ /C} 1 cij = 2 m {i∈R}∪{j ∈C }
≤
8 |R| + |C| = , m m
(6.47)
and the first claim now follows, since μD = md . To handle σD2 , recalling σC2 = 1, by (4.105) there exists some n2 ≥ 16 such that for all l ≤ 4 n 1 2 n−1 3 cij = (6.48) ∈ 1, 1 when n ≥ n2 . m−1 m−1 8 i,j =1
Again from (4.105), σD2
1 = m−1
m
i,j =1
dij2
−m
m
di2
−m
i=1
m j =1
2 dj
+ m2 d2
.
Applying (6.46) and (6.47), when βC /n ≤ 3 , a value yet to be specified, n 1 2 2 cij σD − m−1 i,j =1 n m m 1 2 2 2 2 2 2 cij − cij + m di + m dj + m d ≤ m−1 i,j =1 i=1 j =1 {i ∈ / R}∩{j ∈ /C}
1 2 ≤ cij + 96 m−1 {i∈R}∪{j ∈C }
≤
1 2/3 8 n + 96 , m−1 3
where for (6.49), we have, by Hölder’s inequality
(6.49)
6.1 Combinatorial Central Limit Theorem n
2 cij ≤n
1 3
181
n
j =1
2 3
|cij |3
1
2
≤ n 3 βC3 ,
j =1
with the same inequality holding when the roles of i and j are reversed, and so, when βC /n ≤ 3 ,
2 cij ≤
{i∈R}∪{j ∈C }
n
2 cij +
i∈R j =1
n
1
2
2/3
2 cij ≤ 2ln 3 βC3 ≤ 83 n.
j ∈C i=1
Now choosing n3 ≥ n2 such that 96/(n3 − 5) ≤ 3/16, and then choosing 3 such 2/3 that 83 n3 /(n3 − 5) ≤ 3/16, by (6.48) and (6.49) we obtain |σD2 − 1| ≤ 3/4 for all n ≥ n3 , proving the second claim in the lemma for any n0 ≥ n3 . To prove the final claim, first note m 3 m m m 1 1 3 3 |di | = 3 dij = 3 cij m m i=1 i=1 j =1 i=1 j ∈ /C m m 1 3 l2 l 2 βC 3 = 3 c ≤ |c | ≤ , (6.50) ij ij m m3 m3 i=1 j ∈C
i=1 j ∈C
with the same bound holding when i and j are interchanged. Now, since
cij =
{i∈R}∪{j ∈C }
n i=1 j ∈C
cij +
n j =1 i∈R
cij −
cij ,
{i∈R}∩{j ∈C }
we obtain
m 3 3 3 1 1 1 dij = 6 cij = 6 cij |d | = 6 m m m i,j =1 {i ∈ / R}∩{j ∈ /C} {i∈R}∪{j ∈C } 3
3 3 9 ≤ 6 cij + cij + cij . m 3
i∈ / R j ∈C
i∈R j ∈ /C
{i∈R}∩{j ∈C }
Hence, using that | i ∈/ R j ∈C cij |3 ≤ (nl)2 ni=1 j ∈C |cij |3 , with the same bound holding for the second term and a similar one for the last, we find that for some n0 ≥ n3 , for all n ≥ n0 we have 9 28l 2 2 4 2(nl) β + l ≤ βC . (6.51) C m6 m4 Now, when βD /n ≤ 3 and n ≥ n0 , since σD2 ≥ 1/4, by (6.50) and (6.51), |d |3 ≤
βD = σD−3 =8
m m 0 3 0 3 d ≤ 8 d ij ij
i,j =1 m
i,j =1
|dij − di − dj + d |3
i,j =1
182
L∞ : Applications
6
≤8×4
2
n
|cij | + 3
i,j =1
30l 2 ≤ 128 1 + 2 βC m ≤ c4 βC ,
m
|di | + |dj | + |d | 3
3
3
i,j =1
thus proving the final claim of the lemma. Proof of Lemma 6.4 Let the sequence {tn }n≥m be given by tm = max sm−k
and
0≤k≤3
tn+1 = d + αtn
for n ≥ m.
(6.52)
Explicitly solving the recursion yields t n = d 1 α n + d2
where d1 =
(1−α)tm −d α m (1−α)
and d2 =
d 1−α .
We note that since limn→∞ tn = d2 the sequence {tn }n≥m is bounded, and it suffices to prove sn ≤ tn for all n ≥ m. We consider the two cases, (a) d1 < 0 and (b) d1 ≥ 0. (a) When d1 < 0 the sequence {tn }n≥m is increasing. By (6.52) we have sm ≤ tm . In addition, sm+1 ≤ d + α max{sm−1 , sm−2 , sm−3 } ≤ d + αtm = tm+1 ,
(6.53)
sm+2 ≤ d + α max{sm , sm−1 , sm−2 } ≤ d + αtm = tm+1 ≤ tm+2 , and hence sm+3 ≤ d + α max{sm+1 , sm , sm−1 } ≤ d + α max{tm+1 , tm } = d + αtm+1 = tm+2 ≤ tm+3 . Hence, for k = 3, sn ≤ tn
for m ≤ n ≤ m + k.
(6.54)
Assuming now that (6.54) holds for some k ≥ 3, for n = m + k we have sn+1 ≤ d + α max{sn−1 , sn−2 , sn−3 } ≤ d + α max{tn−1 , tn−2 , tn−3 } = d + αtn−1 = tn ≤ tn+1 , thus completing the inductive step showing that (6.54) holds for all k ≥ 0 in case (a). (b) When d1 ≥ 0 the sequence {tn }n≥m is non-increasing. In a similar way we can show that for k = 5, sn ≤ t m
for m ≤ n ≤ m + k.
(6.55)
Assuming now that (6.55) holds for some k ≥ 5, for n = m + k we have sn+1 ≤ d + α max{sn−1 , sn−2 , sn−3 } ≤ d + αtm = tm+1 ≤ tm , thus completing the inductive step showing that (6.55) holds for all k ≥ 0 in case (b).
6.1 Combinatorial Central Limit Theorem
183
6.1.2 Distribution Constant on Conjugacy Classes In this section we focus on the normal approximation of Y in (6.1) when the distribution of π is a function only of its cycle type. This framework includes two special cases of note, one where π is a uniformly chosen fixed point free involution, considered by Goldstein and Rinott (2003) and Ghosh (2009), and the other where π has the uniform distribution over permutations with a single cycle, considered by Kolchin and Chistyakov (1973) with the additional restriction that aij = bi cj . Both Goldstein and Rinott (2003) and Ghosh (2009) obtained an explicit constant, the latter in terms of third moment quantities on the array aij , rather than on its maximum, as in the former. Kolchin and Chistyakov (1973) considered normal convergence for the long cycle case, but did not provide bounds on the error. As discussed in Sect. 4.4, being able to approximate the distribution of Y is important for performing permutation tests in statistics. In particular, the case where π is a fixed point free involution arises when testing if a given pairing of n = 2m observations shows an unusually high level of similarity, as in Schiffman et al. (1978). In this case, the test statistic yτ is of the form (6.1) with π replaced by a given pairing τ , and where aij = d(xi , xj ) measures the similarity between observations xi and xj . Under the null hypotheses that no pairing is distinguished, the value of yτ will tend to lie near the center of the distribution of Y when π is an involution having no fixed points, chosen uniformly. This instance is the particular case where π is constant on conjugacy classes, as defined below in (6.57), where the probability of any π with m 2-cycles is constant, and has probability zero otherwise. In the involution case Goldstein and Rinott (2003) used an exchangeable pair construction in which π is obtained from π by a transformation which preserves the m 2-cycle structure. The construction in Theorem 6.3 preserves the cycle structure in general, and when there are m 2-cycles, specializes to a construction similar, but not equivalent, to that of Goldstein and Rinott (2003). We note that in the case where π is a fixed point free involution the sum Y contains both aiπ(i) and aπ(i)i , making the symmetry assumption of Theorem 6.3 without loss of generality. This assumption is also satisfied in many statistical applications where one wishes to test the equality of the two distributions generating the samples X1 , . . . , Xn and Y1 , . . . , Yn , and aij = d(Xi , Yj ), a symmetric ‘distance’ function evaluated at the observed data points Xi and Yj . Consider a permutation π ∈ Sn represented in cycle form; in S7 for example, π = ((1, 3, 7, 5), (2, 6, 4)) is the permutation consisting of one 4 cycle in which 1 → 3 → 7 → 5 → 1, and one 3 cycle where 2 → 6 → 4 → 2. For q = 1, . . . , n, let cq (π) be the number of q cycles of π , and let c(π) = c1 (π), . . . , cn (π) . We say the permutations π and σ are of the same cycle type if c(π) = c(σ ), and that a distribution P on Sn is constant on cycle type if P (π) depends only on c(π), that is, P (π) = P (σ )
whenever c(π) = c(σ ).
(6.56)
184
6
L∞ : Applications
Equivalently, see Sagan (1991) for instance, π and σ are of the same cycle type if and only if π and σ are conjugate, that is, if and only if there exists a permutation ρ such that π = ρ −1 σρ. Hence, a probability measure P on Sn is constant over cycle type if and only if (6.57) P (π) = P ρ −1 πρ for all π, ρ ∈ Sn . A special case of a distribution constant on cycle type is one uniformly distributed over all permutations of some fixed type. Letting n Nn = (c1 , . . . , cn ) ∈ Nn0 : ci = n , i=1
the set of possible cycle types for a permutation π ∈ Sn , the number N (c) of permutations in Sn having cycle type c is given by Cauchy’s formula n c j 1 1 N (c) = n! (6.58) for c ∈ Nn . j cj ! j =1
For c ∈ Nn let U (c) denote the distribution over Sn which is uniform on cycle type c, that is, the distribution P given by 1/N(c) if c(π) = c, P (π) = (6.59) 0 otherwise. The situations where π is chosen uniformly from the set of all fixed point free involutions, and where π is chosen uniformly from all permutations having a single cycle, are both distributions of type U(c), the first with c = (0, n/2, 0, . . . , 0) the second with c = (0, . . . , 0, 1). The following lemma shows that every distribution P that is constant on cycle type is a mixture of U(c) distributions. Lemma 6.5 If the distribution P on Sn is constant on cycle type then ρc U(c) where ρc = P c(π) = c . P=
(6.60)
c∈Nn
Proof If c ∈ Nn is such that ρc = 0 then by (6.56), P γ |c(γ ) = c = N (c)P π|c(π) = c , 1= γ :c(γ )=c
and therefore P (π|c(π) = c) = 1/N(c). Hence, for any π ∈ Sn , with c = c(π), P (π) = P π|c(π) = c P c(π) = c = ρc /N (c),
that is, P is the mixture (6.60). For {i, j, k} ⊂ {1, . . . , n} distinct, let A = π: π(k) = j and
B = π: π(i) = j ,
6.1 Combinatorial Central Limit Theorem
185
and let τik be the transposition of i and k. Then π ∈A
−1 τik πτik ∈ B.
if and only if
Hence, if the distribution of π is constant on conjugacy classes, −1 P (A) = P τik Aτik = P (B), so if in addition π has no fixed points, P π(k) = j = P π(i) = j 1= k: k=j
k: k=j
and hence P π(i) = j =
1 n−1
for i = j .
(6.61)
If π has no fixed points with probability one then no aii appears in the sum (6.1), and we may take aii = 0 for all i for convenience. In this case, letting 1 aij , n−2 n
aio =
j =1
1 aij n−2 n
aoj =
aoo =
and
1 aij , (n − 1)(n − 2) ij
i=1
by (6.61) we have EY =
n
Eai,π(i) =
n n n 1 1 aij = aij = (n − 2)aoo . n−1 n−1 i=1 j : j =i
i=1
i=1 j =1
Now note that n
aio = (n − 1)aoo
i=1
and
n i=1
aoπ(i) =
n
aoj = (n − 1)aoo ,
j =1
the latter equality holding since π is a permutation. Letting aij − aio − aoj + aoo for i = j , = a ij 0 for i = j ,
(6.62)
where the choice of a ii is arbitrary, when π has no fixed points, using that {π(j ): j = 1, . . . , n} = {1, . . . , n}, we have n
aiπ(i) =
i=1
=
n i=1 n i=1
aiπ(i) − 2(n − 1)aoo + naoo aiπ(i) − (n − 2)aoo =
n i=1
aiπ(i) − EY.
(6.63)
186
6
L∞ : Applications
Additionally, noting
aij =
i: i=j
n
aij = (n − 2)aoj ,
i=1
and
aio =
i: i=j
n
aio − aoj = (n − 1)aoo − aoj ,
i=1
we have n i=1
a ij =
a ij
i: i=j
= (n − 2)aoj − (n − 1)aoo − aoj − (n − 1)aoj + (n − 1)aoo = 0. (6.64)
In summary, in view of (6.63), (6.64), and the corresponding identity when the roles of i and j are reversed, when π has no fixed points, by replacing aij by a ij , we may without loss of generality assume that EY = 0, and in particular, that
aio = 0,
and
aoj = 0,
aij = 0.
(6.65)
ij
Lastly note that if aij is symmetric then so is a ij , and in this case aij − 2aio + aoo for i = j , a ij = 0 for i = j .
(6.66)
Regarding the variance of Y , Lemma 6.7 below shows that when π is chosen uniformly over a fixed cycle type c without fixed points, n ≥ 4 and aij = aj i , the variance σc2 = Var(Y ) is given by
1 2c2 σc2 = (aij − 2aio + aoo )2 . (6.67) + n − 1 n(n − 3) i=j
Remarkably, for a given n the variance σc2 depends on the vector c of cycle types only though c2 , the number of 2-cycles. When π is uniform over the set of fixed point free involutions, n is even and c2 = n/2, (6.67) yields 2(n − 2) (aij − 2aio + aoo )2 . (6.68) σc2 = (n − 1)(n − 3) i=j
On the other hand, if π has no 2-cycles, c2 = 0 and 1 (aij − 2aio + aoo )2 . σc2 = n−1 i=j
(6.69)
6.1 Combinatorial Central Limit Theorem
187
When normal approximations hold for Y when π has distribution U(c) for some c ∈ Nn it is clear upon comparing (6.68) with (6.69) that the variance of the approximating normal variable depends on c. More generally, when the distribution π is constant on cycle type, the mixture property (6.60) allows for an approximation of Y in terms of mixtures of mean zero normal variables as in the following theorem. Theorem 6.3 Let n ≥ 5 and let {aij }ni,j =1 be an array of real numbers satisfying aij = aj i .
(6.70)
Let π ∈ Sn be a random permutation with distribution constant on cycle type, having no fixed points. Then, with Y given by (6.1) and W = (Y − EY )/σρ , we have √
2π ρc 1 + sup P (W ≤ z) − P (Zρ ≤ z) ≤ 40C 1 + √ 4 σc 2π z∈R c∈Nn
where σρ2 =
c∈Nn
ρc σc2
and
L(Zρ ) =
ρc L(Zc /σρ )
(6.71)
c∈Nn
with Zc ∼ N (0, σc2 ), σc2 given by (6.67), ρc = P (c(π) = c), and C = maxi=j |aij − 2aio + aoo |. In the special case where π is uniformly distributed on fixed point free involutions, with W = (Y − EY )/σc and σc2 given by (6.68), √
1 2π supP (W ≤ z) − P (Z ≤ z) ≤ 24C 1 + √ + σc 4 2π z∈R where Z ∼ N (0, 1). We note the numerical value of the coefficient of C in the general, and the involution case, are approximately equal to 125.07 and 75.04, respectively. The proof of the theorem follows fairly quickly from Lemma 6.10, which considers the special case where π has distribution U(c), and the mixture property in Lemma 6.5. The proof of Lemma 6.10 is preceded by a sequence of lemmas. Lemma 6.6 provides a helpful decomposition. Lemma 6.7 gives the variance of Y in (6.1) when π has distribution U(c) for some c ∈ Nn . Lemma 6.8 records some properties of the difference of the pair y , y , given functions of two fixed permutations π and π , related by transpositions. Lemma 6.9 constructs a Stein pair (Y , Y ). Then, Lemma 6.10 is shown by following the outline in Sect. 4.4.1 to construct the appropriate square bias variables, followed by applying Theorem 5.1 to the resulting zero bias coupling. To better highlight the reason for the imposition of the symmetry condition (6.70) and the exclusion of fixed points, in Lemmas 6.6, 6.8 and the proof of Lemma 6.9 we consider an array satisfying only (6.65) and allow fixed points. For a given permutation π and i, j ∈ {1, . . . , n}, write i ∼ j if i and j are in the same cycle of π , and let |i| denote the length of the cycle of π containing i.
188
6
L∞ : Applications
Lemma 6.6 Let π be a fixed permutation. For any i = j , distinct elements of {1, . . . , n}, the sets A0 , . . . , A5 form a partition of the space where, A0 = i, j, π(i), π(j ) = 2 A1 = |i| = 1, |j | ≥ 2 , A2 = |i| ≥ 2, |j | = 1 and A3 = |i| ≥ 3, π(i) = j , A4 = |j | ≥ 3, π(j ) = i A5 = i, j, π(i), π(j ) = 4 . Additionally, the sets A0,1 and A0,1 partition A0 , where A0,1 = π(i) = i, π(j ) = j , A0,2 = π(i) = j, π(j ) = i , and we may also write A1 = π(i) = i, π(j ) = j , A3 = π(i) = j, π(j ) = i ,
A2 = π(i) = i, π(j ) = j A4 = π(j ) = i, π(i) = j ,
and membership in Am , m = 0, . . . , 5 depends only on i, j, π(i), π(j ). Lastly, the sets A5,m , m = 1, . . . , 4 partition A5 , where A5,1 = |i| = 2, |j | = 2, i ∼ j A5,2 = |i| = 2, |j | ≥ 3 , A5,3 = |i| ≥ 3, |j | = 2 A5,4 = |i| ≥ 3, |j | ≥ 3 ∩ A5 , and membership in A5,m , m = 1, . . . , 4 depends only on i, j , π −1 (j ), π −1 (i), π(i), π(j ). Proof The sets Am , m = 0, . . . , 5 are clearly disjoint, so we need only demonstrate that they are exhaustive. Let s = |{i, j, π(i), π(j )}|. Since i = j , we have 2 ≤ s ≤ 4. The case A0 is exactly the case s = 2. There are four cases when s = 3. Either exactly one of i or j is a fixed point, that is, either we are in case A1 or A2 , or neither i nor j is a fixed point, and so i = π(i) and j = π(j ). As i = j , and therefore π(i) = π(j ), the only equalities among i, π(i), j, π(j ) which are yet possible are π(i) = j or π(j ) = i. Both equalities cannot hold at once, as then s = 2. The case where only the first equality is satisfied is A3 , and only the second is A4 . Clearly what remains now is exactly A5 . The sets A0,1 and A0,2 are clearly disjoint and union to A0 , and the alternative ways to express A1 , A2 , A3 and A4 are clear, as, therefore, are the claims about what values are sufficient to determine membership is these sets. The sets A5,m , m = 1, . . . , 5 are also clearly disjoint. If either i or j is a fixed point then s ≤ 3, so on A5 we must have |i| ≥ 2 and |j | ≥ 2. The set A5,1 is where both i and j are in 2-cycles, in which case these cycles must be distinct in order that s = 4. The sets A5,2 and A5,3 are the cases where exactly one of i or j is in a 2-cycle, and these are already subsets of A5 . The remaining case in A5 is when i and j are both in cycles of length at least 3, yielding A5,4 . We now calculate the variance of Y when π in (6.1) is chosen uniformly over all permutations of some fixed cycle type with no fixed points.
6.1 Combinatorial Central Limit Theorem
189
Lemma 6.7 For n ≥ 4 let c ∈ Nn with c1 = 0, and let π be uniformly chosen from all permutations with cycle type c. Assume that aij = aj i . Then the variance of Y in (6.1) is given by
1 2c2 (aij − 2aio + aoo )2 . + σc2 = n − 1 n(n − 3) i=j
Proof Without loss of generality we may take aii = 0 and then replace aij by aij − 2aio + aoo for i = j as in (6.66), and so, in particular, we may assume (6.65) holds. In particular EY = 0 and Var(Y ) = EY 2 . Expanding, 2 ai,π(i) aj,π(j ) = E ai,π(i) +E ai,π(i) aj,π(j ) . EY 2 = E ij
i=j
i
For the first term, by (6.61), we have 2 Eai,π(i) = i
1 2 aij . n−1
(6.72)
ij
It is helpful to write the second term as ai,π(i) aj,π(j ) = n(n − 1)EaI,π(J ) aJ,π(J ) E
(6.73)
i=j
where I and J are chosen uniformly from all distinct pairs, independently of π . We evaluate this expectation with the help of the decomposition in Lemma 6.6, starting with A0 . Noting A0,1 is null as c1 = 0, from A0,2 we have 2 1{π(I )=J,π(J )=I } EaI,π(I ) aJ,π(J ) 1A0,2 = EaI,J 2c2 = 2 aij2 , 2 n (n − 1)
(6.74)
i=j
noting that there are n(n − 1) possibilities for I and J , another factor of n(n − 1) for the possible values of π(i) and π(j ), and c2 ways that (i, j ) can be placed as a 2-cycle, with the same holding for (j, i). As A1 and A2 are null, moving on to A3 we have, by similar reasoning, EaI,π(J ) aJ,π(J ) 1A3 = EaI,J aJ,π(J ) 1{|I |≥3,π(I )=J } b≥3 bcb = 2 aij aj k . 2 n (n − 1) (n − 2)
(6.75)
|{i,j,k}|=3
By symmetry the event A4 contributes the same. Lastly, consider the contributions from A5 . Starting with A5,1 , we have EaI,π(J ) aJ,π(J ) 1A5,1 = EaI,π(J ) aJ,π(J ) 1{|I |=2, |J |=2, I ∼J } 4c2 (c2 − 1) = 2 2 n (n − 1) (n − 2)(n − 3)
|{i,j,k,l}|=4
aik aj l ,
(6.76)
190
L∞ : Applications
6
and EaI,π(J ) aJ,π(J ) 1A5,2 = EaI,π(J ) aJ,π(J ) 1{|I |=2, |J |≥3} 2c2 b≥3 bcb = 2 n (n − 1)2 (n − 2)(n − 3)
aik aj l .
(6.77)
|{i,j,k,l}|=4
The contribution from A5,3 is the same as that from A5,2 . We break A5,4 into two subcases, depending on whether or not I and J are in a common cycle. When they are, we obtain EaI,π(J ) aJ,π(J ) 1{A5,4 ,I ∼J } = EaI,π(J ) aJ,π(J ) 1{|I |≥3, I ∼J, A5 } b≥3 bcb (b − 3) = 2 2 n (n − 1) (n − 2)(n − 3)
aik aj l , (6.78)
|{i,j,k,l}|=4
where the term b − 3 accounts for the fact that on A5 the value of π(j ) in the cycle of length b cannot lie in {i, j, π(i)}. When I and J are in disjoint cycles we have EaI,π(J ) aJ,π(J ) 1{A5,4 ,I ∼J } = EaI,π(J ) aJ,π(J ) 1{|I |≥3,|J |≥3, I ∼J }
1 = 2 bcb dcd − b n (n − 1)2 (n − 2)(n − 3) b≥3
d≥3
aik aj l ,
(6.79)
|{i,j,k,l}|=4
where the term −b accounts for the fact that j must lie in a cycle of length at least three, different from the one of length b ≥ 3 that contains i. To simplify the sums, using that aio = 0 we obtain aij aj k = − aij2 i=j
|{i,j,k}|=3
and therefore
aik aj l = −
|{i,j,k,l}|=4
|{i,j,k}|=3
=
i=j
aij2 +
aik aj i − i=k
aik aj k
|{i,j,k}|=3 2 aik =2
aij2 .
i=j
Summing the contributions to (6.73) from the events A0 , . . . , A4 , that is, (6.74) and twice (6.75), using b≥2 bcb = n and letting (n)k = n(n − 1) · · · (n − k + 1) denote the falling factorial, yields i=j aij2 /(n)4 times
2 b≥3 bcb 2c2 (n)4 − = (n − 3) (n − 2)2c2 − 2(n − 2c2 ) n(n − 1) n(n − 1)(n − 2) = (n − 3)2n(c2 − 1). (6.80) Adding up the contributions to (6.73) from A5 , that is, (6.76), twice (6.77), (6.78) and (6.79), yields i=j aij2 /(n)4 times
6.1 Combinatorial Central Limit Theorem
8c2 (c2 − 1) + 8c2
bcb + 2
b≥3
= 8c2 (c2 − 1) + 8c2
b≥3
191
bcb (b − 3) + 2
b≥3
bcb − 6
b≥3
bcb
b≥3
bcb + 2
b≥3
bcb
dcd − b
(6.81)
d≥3
dcd
d≥3
= 8c2 (c2 − 1) + 8c2 (n − 2c2 ) − 6(n − 2c2 ) + 2(n − 2c2 )2 = 2n2 − 6n + 4c2 .
(6.82)
Now totalling all contributions, adding (6.80) to (6.82) we obtain (n − 3)2n(c2 − 1) + 2n2 − 6n + 4c2 = 2c2 (n − 1)(n − 2). Dividing by (n)4 gives the second term in the expression for σc2 . The first term is (6.72). We will use Y , y and π interchangeably for Y , y and π , respectively. Again, for i ∈ {1, . . . , n} we let |i| denote the number of elements in the cycle of π that contains i. Due to the way that π is formed from π in Lemma 6.8 using two distinct indices i and j , the various cases for expressing the difference Y − Y depend only on i and j and their pre and post images under π . Lemma 6.8 Let π be a fixed permutation and i and j distinct elements of {1, . . . , n}. Letting π(−α) = π −1 (α) for α ∈ {1, . . . , n} set χi,j = {−j, −i, i, j }, so that π(α), α ∈ χi,j = π −1 (j ), π −1 (i), π(i), π(j ) . Then, for π = τij π τij with τij the transposition of i and j , and y and y given by (6.1) with π and π replacing π , respectively, y − y = b i, j, π(α), α ∈ χi,j where 5 b i, j, π(α), α ∈ χi,j = bm i, j, π(α), α ∈ χi,j 1Am
(6.83)
m=0
with Am , m = 0, . . . , 5 as in Lemma 6.6, b0 (i, j, π(α), α ∈ χi,j ) = 0, b1 i, j, π(α), α ∈ χi,j = aii + aπ −1 (j ),j + aj,π(j ) − (ajj + aπ −1 (j ),i + ai,π(j ) ), b2 i, j, π(α), α ∈ χi,j = ajj + aπ −1 (i),i + ai,π(i) − (aii + aπ −1 (i),j + aj,π(i) ), b3 i, j, π(α), α ∈ χi,j = aπ −1 (i),i + aij + aj,π(j ) − (aπ −1 (i),j + aj i + ai,π(j ) ), b4 i, j, π(α), α ∈ χi,j = aπ −1 (j ),j + aj i + ai,π(i) − (aπ −1 (j ),i + aij + aj,π(i) ), and
b5 i, j, π(α), α ∈ χi,j = aπ −1 (i),i + ai,π(i) + aπ −1 (j ),j + aj,π(j ) − (aπ −1 (i),j + aj,π(i) + aπ −1 (j ),i + ai,π(j ) ).
192
6
L∞ : Applications
Proof First we note that equality (6.83) defines a function, as Lemma 6.6 shows that 1Am , m = 0, . . . , 5 depend only on the given variables. Now considering the difference, under A0 either π(i) = i and π(j ) = j , or π(i) = j and π(j ) = i; in the both cases π = π and therefore y = y, and their difference is zero, corresponding to the claimed form for b0 . When A1 is true, since |i| = 1, and π(j ) = j we have π (j ) = τij πτij (j ) = τij π(i) = τij (i) = j and π π −1 (j ) = τij πτij π −1 (j ) = τij π π −1 (j ) = τij j = i and π (i) = τij πτij (i) = τij π(j ) = π(j ). / {i, j } so τij (k) = k, and therefore If k ∈ / {i, π −1 (j ), j } then π(k) ∈ / i, π −1 (j ), j . π (k) = τij πτij (k) = τij π(k) = π(k) for all k ∈ That is, on A1 the permutations π and π only differ in that where π has the action i → i, π −1 (j ) → j → π(j ), leading to the terms aii + aπ −1 (j ),j + aj,π(j ) , the permutation π
has the action j → j , π −1 (j ) → i → π(j ), leading to the terms ajj + aπ −1 (j ),i + ai,π(j ) .
Taking the difference now leads to the form claimed for b1 when A1 is true. By symmetry, on A2 we have the same result as for A1 upon interchanging i and j . Similarly, when A3 is true the only difference between π and π is that the former has the action π −1 (i) → i → j → π(j ), leading to the terms aπ −1 (i),i + aij + aj,π(j ) , while that latter has π −1 (i) → j → i → π(j ), leading to aπ −1 (i),j + aj i + ai,π(j ) . Again, A4 is the same as A3 with the roles of i and j interchanged. Lastly, when |{i, j, π(i), π(j )}| = 4 the permutation π has the action π −1 (i) → i → π(i) and π −1 (j ) → j → π(j ) while π has π −1 (i) → j → π(i) and π −1 (j ) → i → π(j ), making the form of b5 clear. Our next task is the construction of a Stein pair Y , Y , which we accomplish in the following lemma in a manner similar to that in Sect. 4.4.2. We remind the reader that we consider the symbols π and Y interchangeable with π and Y , respectively. Lemma 6.9 For n ≥ 5 let {aij }ni,j =1 be an array of real numbers satisfying aij = aj i
and
aii = 0.
Let π ∈ Sn be a random permutation with distribution constant on cycle type, having no fixed points, and let Y be given by (6.1). Further, let I, J be chosen independently of π , uniformly from all pairs of distinct elements of {1, . . . , n}. Then, letting π = τI J πτI J and Y be given by (6.1) with π replacing π , (Y, Y ) is a 4/n-Stein pair.
6.1 Combinatorial Central Limit Theorem
193
Proof First we show that the pair of permutations π , π is exchangeable. For fixed permutations σ , σ , if σ = τI J σ τI J then P (π = σ , π = σ ) = 0 = P (π = σ , π = σ ). Otherwise σ = τI J σ τI J , and using (6.57) for the second equality followed by τij−1 = τij , we have P (π = σ , π = σ ) = P (π = σ ) = P (π = τI J σ τI J ) = P (π = σ ) = P (π = σ , π = σ ). Consequently, π and π , and therefore Y and Y , given by (6.1) with permutations π and π , respectively, are exchangeable. It remains to demonstrate that Y , Y satisfies the linearity condition (4.108) with λ = 4/n, for which it suffices to show 4 E(Y − Y |π) = Y . (6.84) n We 5 prove (6.84) by computing the conditional expectation given π of the sum m=0 bm (i, j, π(α), α ∈ χi,j )1Am in (6.83) of Lemma 6.6, with A0 , . . . , A5 given in Lemma 6.8, with i, j replaced by I, J . First we have that b0 = 0. Next, we claim that the contribution to n(n − 1)E(Y − Y |π) from b1 and b2 totals to aii + 4c1 (π) ai,π(i) 2 n − c1 (π) |i|=1
− 2c1 (π)
aii − 2
|i|≥2
|i|≥2
aij − 2
|i|=1, |j |≥2
aij .
(6.85)
|i|≥2, |j |=1
In particular, for the first term aI I in the function b1 , by summing below over j we obtain 1 n − c1 (π) E(aI I 1A1 |π) = aii 1{|i|=1, |j |≥2} = aii . (6.86) n(n − 1) n(n − 1) |i|=1
i,j
For the next two terms of b1 , noting that the sum of aj,π(j ) over a given cycle of π equals the sum of aπ −1 (j ),j over that same cycle, we obtain E(aπ −1 (J ),J 1A1 |π) + E(aJ,π(J ) 1A1 |π) = 2E(aJ,π(J ) 1A1 |π) 2 aj,π(j ) 1{|i|=1,|j |≥2} n(n − 1) n
=
j =1
2c1 (π) = aj,π(j ) . n(n − 1) |j |≥2
Moving to the final three terms of b1 , we have similarly that
(6.87)
194
6
E(aJ,J 1A1 |π) =
L∞ : Applications
c1 (π) ajj , n(n − 1) |j |≥2
1 E(aπ −1 (J ),I 1A1 |π) = n(n − 1)
aπ −1 (j ),i =
|i|=1,|j |≥2
1 n(n − 1)
aj i
|i|=1,|j |≥2
and E(aI,π(J ) 1A1 |π) =
1 n(n − 1)
ai,π(j ) =
|i|=1, |j |≥2
1 n(n − 1)
aij .
|i|=1, |j |≥2
Summing (6.86) and (6.87) and subtracting these last three contributions, and then using the fact that the contribution from b2 is the same as that from b1 by symmetry, we obtain (6.85). Next, it is easy to see that the first three contributions to n(n − 1)E(Y − Y |π) from b3 , on the event A3 = 1(π(I )= J, |I | ≥ 3), all equal |i|≥3 ai,π(i) , that thefourth and sixth both equal − |i|≥3 aπ −1 (i),π(i) , and that the fifth equals − |i|≥3 aπ(i),i . Combining this quantity with the equal amount from b4 yields ai,π(i) − 4 aπ −1 (i),π(i) − 2 aπ(i),i . (6.88) 6 |i|≥3
|i|≥3
|i|≥3
Next, write A5 = 1{|I | ≥ 2, |J | ≥ 2, I = J, π(I ) = J, π(J ) = I }. The first term in b5 , aπ −1 (I ),I , has conditional expectation given π of (n(n − 1))−1 times aπ −1 (i),i 1 |i| ≥ 2, |j | ≥ 2, i = j, π(i) = j, π(j ) = i . (6.89) Write i ∼ j when i and j are elements of the same cycle. When i ∼ j and {i, j, π(i), π(j )} are distinct, then |i| ≥ 4 and there are |i| − 3 possible choices for j ∼ i that satisfy the conditions in the indicator in (6.89). Hence, the case i ∼ j contributes aπ −1 (i),i 1 i = j, π(i) = j, π(j ) = i = aπ −1 (i),i |i| − 3 j ∼i
|i|≥4
|i|≥4
=
|i| − 3 ai,π(i) .
|i|≥3
When i ∼ j the conditions in the indicator function in (6.89) are satisfied if and only if |i| ≥ 2, |j | ≥ 2. For |i| ≥ 2 there are n − |i| − c1 (π) choices for j , so the case i ∼ j contributes aπ −1 (i),i 1 |i|≥2
=
j ∼i, |j |≥2
n − |i| − c1 (π) ai,π(i)
|i|≥2
n − |i| − c1 (π) ai,π(i) . = n − 2 − c1 (π) ai,π(i) +
|i|=2
|i|≥3
6.1 Combinatorial Central Limit Theorem
195
As the first four terms of b5 all yield the same contribution, they account for a total of ai,π(i) + 4 n − 3 − c1 (π) ai,π(i) . (6.90) 4 n − 2 − c1 (π) |i|=2
|i|≥3
Decomposing the contribution from the fifth term −aπ −1 (I ),J of b5 , according to whether i ∼ j or i ∼ j , gives − aπ −1 (i),j 1 i = j, π(i) = j, π(j ) = i |i|≥2,|j |≥2
=−
|i|≥4 j ∼i
=−
aπ −1 (i),j +
|i|≥4 j ∼i
−
aπ −1 (i),j
|i|≥2,|j |≥2 j ∼i
(aπ −1 (i),i + aπ −1 (i),π(i) + aπ −1 (i),π −1 (i) )
aij
aij +
|i|≥4 j ∼i
|i|≥4
|i|≥2,|j |≥2 j ∼i
=−
aπ −1 (i),j 1 i = j, π(i) = j, π(j ) = i −
(ai,π(i) + aπ −1 (i),π(i) + aii ) −
aij . (6.91)
|i|≥2,|j |≥2 j ∼i
|i|≥4
Tosimplify (6.91), let a ∧ b = min(a, b) and consider a decomposition of the sum ij aij first by whether i ∼ j or not, and then according to cycle sizes, and in the first case further as to whether the length of the common cycle of i and j is greater than 4, and in the second case as to whether the distinct cycles of i and j both have size at least 2. That is, write, n
aij =
aij +
|i|≥4 j ∼i
i,j =1
+
|i|≤3 j ∼i
aij +
aij
|i|≥2,|j |≥2 j ∼i
(6.92)
aij .
|i|∧|j |=1 j ∼i
Since i,j aij = 0 by (6.65), we may replace the sum of the first and third terms in (6.91) by the sum of the second and fourth terms on the right hand side of (6.92). Hence, the contribution from aπ −1 (I ),J on A5 equals aij + aij + (ai,π(i) + aπ −1 (i),π(i) + aii ) |i|≤3 j ∼i
=
|i|∧|j |=1 j ∼i
aij +
|i|≤2 j ∼i
|i|≥4
aij +
|i|∧|j |=1 j ∼i
(ai,π(i) + aπ −1 (i),π(i) + aii ),
|i|≥3
where to obtain the equality we used the fact that π 2 (i) = π −1 (i) when |i| = 3. Dealing similarly with the |i| = 2, j ∼ i term we obtain aii + aij + (ai,π(i) + aii ) + aπ −1 (i),π(i) |i|=1
=
|i|∧|j |=1 j ∼i
|i|∧|j |=1 j ∼i
aij +
|i|≥2
|i|≥2
ai,π(i) +
|i|≥1
|i|≥3
aii +
|i|≥3
aπ −1 (i),π(i) .
196
6
L∞ : Applications
Combining this contribution with the next three terms of A5 , each of which yields the same amount, gives the total ai,π(i) + 4 aπ −1 (i),π(i) + 4 aii + 4 aij . (6.93) 4 |i|≥2
|i|≥3
|i|≥1
|i|∧|j |=1 j ∼i
Combining (6.93) with the contribution (6.90) from the first four terms in b5 , the b1 and b2 terms in (6.85) and the b3 and b4 terms (6.88), yields n(n − 1)E(Y − Y |π ), which, canceling the terms involving aπ −1 (i),π(i) and rearranging to group like terms, can be written 4(n − 1) ai,π(i) + (4n − 2) ai,π(i) − 2 aπ(i),i (6.94) |i|=2
|i|≥3
|i|≥3
aii − 2 c1 (π) − 2 aii + 2 n − c1 (π) + 2 |i|=1
+4
aij − 2
|i|∧|j |=1,j ∼i
|i|=1,|j |≥2
(6.95)
|i|≥2
aij − 2
aij .
(6.96)
|i|≥2,|j |=1
The assumption that aii = 0 causes the contribution from (6.95) to vanish, the assumption that there are no 1-cycles causes the contribution from (6.96) to vanish, and the assumption that aij = aj i allows the combination of the second and third terms in (6.94) to yield
1 E(Y − Y |π ) = ai,π (i) + (4n − 4) ai,π (i) 4(n − 1) n(n − 1) |i|=2
=
4 n
n i=1
|i|≥3
4 ai,π (i) = Y . n
Hence, the linearity condition (4.108) is satisfied with λ = 4/n, completing the ar gument that Y , Y is a 4/n-Stein pair. We now prove the special case of Theorem 6.3 when π is uniform over cycle type. Lemma 6.10 Let n ≥ 5 and let {aij }ni,j =1 be an array of real numbers satisfying aij = aj i . Let π ∈ Sn be a random permutation with distribution U(c), uniform on cycle type c ∈ Nn , having no fixed points. Then, letting Y be the sum in (6.1), σc2 given by (6.67) and W = (Y − EY )/σc , √
1 2π + (6.97) σc , supP (W ≤ z) − P (Z ≤ z) ≤ 40C 1 + √ 4 2π z∈R where C = maxi=j |aij − 2aio + aoo | and Z is a standard normal variable. When π is uniform over involutions without fixed points, then 40 in (6.97) may be replaced by 24, and σc2 specializes to the form given in (6.68).
6.1 Combinatorial Central Limit Theorem
197
Proof We may set aii = 0, and then by replacing aij by aij − 2aio + aoo when i = j , assume without loss of generality that aio = aoj = EY = 0. We write Y and π interchangeably for Y and π , respectively. We follow the outline in Sect. 4.4.1 to produce a coupling of Y to a pair Y † , Y ‡ with the square bias distribution as in Proposition 4.6, satisfying (6.2) and (6.3). We then produce a coupling of Y to Y ∗ having the Y -zero bias distribution using the uniform interpolation as in that proposition, and lastly invoke Theorem 5.1 to obtain the bound. First construct the Stein pair Y , Y as in Lemma 6.9. Let π be a permutation with distribution U(c). Then, with I and J having distribution P (I = i, J = j ) =
1 n(n − 1)
for i = j ,
set π = τI J π τI J where τij is the transposition of i and j . Now Y and Y are given by (6.1) with π replaced by π and π , respectively. To specialize the outline in Sect. 4.4.1 to this case, we let I = (I, J ) and α = π(α). In keeping with the notation of Lemma 6.8, with χ = {1, . . . , n} we let π(−j ) = π −1 (j ) for j ∈ χ , and with i and j distinct elements of χ we set χi,j = {−j, −i, i, j } and pi,j (ξα , α ∈ χi,j ) = P π(α) = ξα , α ∈ χi,j , the distribution of the pre and post images of i and j under π . Equality (4.116) gives the factorization of the variables from which π and π are constructed as / χi |ξα , α ∈ χi ). P (i, ξα , α ∈ χ) = P (I = i)Pi (ξα , α ∈ χi )Pic |i (ξα , α ∈ The factorization can be interpreted as saying that first we choose I, J , then construct the pre and post images of I and J , under π , then, conditional on what has already been chosen, the values of π on the remaining variables. For the distribution of the pair with the square bias distribution, equality (4.118) gives the parallel factorization, / χi |ξα , α ∈ χi ) P † (i, ξα , α ∈ χ) = P † (I = i)Pi† (ξα , α ∈ χi )Pic |i (ξα , α ∈
(6.98)
where P † (I = i), the distribution of indices we will label I† , is given by (4.117) and Pi† (ξα , α ∈ χi ) by (4.119). Let σ † , σ ‡ have distribution given by (6.98), that is, with I † , J † and α , α ∈ χ having distribution (6.98), σ † (α) = α and σ ‡ = τI † ,J † σ † τI † ,J † . These permutations do not need to be constructed, we only introduce them so that we can conveniently refer to their distribution, which is the one targeted for π † , π ‡ . We construct π † , π ‡ , of which Y † , Y ‡ will be a function, in stages, beginning with the indices I † , J † , and their pre and post images under π † . Following (4.117), with λ = 4/n, let I † , J † have distribution ri,j P I † = i, J † = j = 2λσc2 (6.99) where ri,j = P (I = i, J = j )Eb2 i, j, π(α), α ∈ χi,j
198
6
L∞ : Applications
with b(i, j, ξα , α ∈ χi,j ) as in Lemma 6.8. Next, given I † = i and J † = j , from (4.119), let the pre and post images π −† (J † ), π −† (I † ), π † (I † ), π † (J † ) have distribution † pi,j (ξα , α ∈ χi,j ) =
b2 (i, j, ξα , α ∈ χi,j ) pi,j (ξα , α ∈ χi,j ). Eb2 (i, j, π(α), α ∈ χi,j )
(6.100)
We will place I † and J † , along with these generated pre and post images, into cycles of appropriate length. The conditional distribution of the remaining values of π † , given I † , J † and their pre and post images, by (4.118), has the same conditional distribution as that of π , which is the uniform distribution over all permutations of cycle type c where I † and J † have the specified pre and post images. Hence, to complete the specification of π † we fill in the remaining values of π † uniformly. For this last step we will use the values of π to construct π † in a way that makes π † and π close. Lemma 6.6 gives that, for π † , membership in A0 , . . . , A4 and A5,1 , . . . , A5,4 is determined by (6.101) I † , J † , π −† (J ), π −† (I ), π † I † , π † J † . As b0 = 0 from Lemma 6.8 the case A0 has probability zero. Note that the distribution of σ † , σ ‡ is absolutely continuous with respect to that of π , π , and therefore the permutations σ † , σ ‡ have the same cycle structure, namely c, as π , π . In particular, since π has no fixed points, A2 is eliminated and we need only consider the events A3 , A4 and A5 . For the purpose of conditioning on the values in (6.101), for ease of notation we will write (α, β) = I † , J † and (γ , δ, , ζ ) = π −† (J ), π −† (I ), π † I † , π † J † . The specification π † depends on which case, or subcase, of the events A3 , A4 , A5 is determined by the variables (6.101). In every subcase, however, π † will be specified in terms of π by conjugating with transpositions as π † = τι,ι† πτι,ι†
where τι,ι† =
κ k=1
τi
† k ,ik
,
(6.102)
for ι = (i1 , . . . , iκ ) and ι† = (i1† , . . . , iκ† ), vectors of disjoint indices of some length κ. Note that when π † is given by π through (6.102) then, / Iι,ι† , where π † (k) = π(k) for all k ∈ † −1 −1 † ik , ik : k = 1, . . . , κ . Iι,ι† = π (ik ), ik , π
(6.103)
Consider first the case where the generated values determine an outcome in A3 , that is, when J † = π † (I † ) and {π −† (I † ), I † , π † (I † )} are distinct. If π † (J † ) ∈ {π −† (I † ), I † , π † (I † )} then π † (J † ) = π −† (I † ) and the generated values form a 3I † , J † are consecutive elecycle. By the symmetry of aij we have that b3 = 0 if ment of a 3-cycle, so A3 has probability zero unless b≥4 cb ≥ 1, that is, unless
6.1 Combinatorial Central Limit Theorem
199
the cycle type c has cycles of length at least 4. Hence, if so, under A3 the elements π −† (I † ), I † , π † (I † ), π † (J † ) must be distinct and form part of a cycle of π † of length at least 4. Conditioning on the values in (6.101), and letting c(σ † , α) be the length of the cycle in σ † containing α, select a cycle length b according to the distribution P c σ † , α = b|σ −† (α) = δ, σ † (α) = , σ †2 (α) = ζ and let I be chosen uniformly, and independently from the b-cycles of π . Now let π † be given by (6.102) with ι = π −1 (I), I, π(I), π 2 (I) and ι† = π −† I † , I † , π † I † , π † J † . As the inverse images under π of the components in ι are all again components of this vector, with the possible exception of π −1 (I ), the set (6.103) can have size at most (4 + 1) + 2 × 4 = 13 in this case. The construction on A4 is analogous, with the roles of I † and J † reversed. Moving on to A5 , consider A5,1 , where if c2 ≥ 2, the elements I † and J † are to be placed in distinct 2-cycles. Choosing I and J from pairs of indices in distinct 2-cycles, let π † be given by (6.102) with ι = I, π(I), J, π(J) and ι† = I † , π † I † , J † , π † J † . As I and J are members of 2-cycles of π , the vector ι already contains all of its inverse images under π , and therefore the set (6.103) can have size at most 4 + 2 × 4 = 12. When π is an involution without fixed points, this is the only case. Similarly, if c2 and b≥3 cb are both nonzero, then the probability of A5,2 is positive, and we let I and J be chosen independently, the first uniformly from the 2-cycles of π , the second uniformly from elements of the b-cycles of π where b has distribution (6.104) P c σ † , β = b|σ −† (β) = γ , σ † (β) = ζ . Now let π † be given by (6.102) with ι = I, π(I), π −1 (J), J, π(J) and
ι† = I † , π I † , π −† J † , J † , π J † .
Arguing as above, as I is in a 2-cycle, the set (6.103) can have size at most (5 + 1) + 2 × 5 = 16. The argument is analogous on A5,3 . Before beginning our consideration of the final case, A5,4 , we note that though the generated values (6.101) are placed in π † according to the correct conditional distributions, such as (6.104), as we are considering a worst case analysis, the actual values of these probabilities never enter our considerations. Hence, on A5,4 , no matter how I and J are selected to be consistent with A5,4 , the result will be that π † will be given by (6.102) with ι = π −1 (I), I, π(I), π −1 (J), J, π(J) and ι† = π −† I † , I † , π I † , π −† J † , J † , π J † . In this case the set (6.103) can have size at most (6 + 2) + 2 × 6 = 20.
200
6
L∞ : Applications
As A0 , . . . , A5 is a partition, the construction of π † has been specified in every case. By arguments similar to those in Lemma 4.5, the conditional distribution P{i,j }c |{i,j } (ξα , α ∈ / χi,j |ξα , α ∈ χi,j ) of the remaining values, given the ones now determined, is uniform, so specifying π † by (6.102) and setting π ‡ = τI † ,J † π † τI † ,J † results in a collection of variables I † , J † and a pair of permutations with the square bias distribution (4.113). Hence, letting Y, Y † and Y ‡ be given by (6.1) with π, π † and π ‡ , respectively results in a coupling of Y to the variables Y † , Y ‡ with the square bias distribution. Now with T , T † and T ‡ given by (6.3), we have U T † + (1 − U )T ‡ + |T | ≤ 2 max |I † | C ι,ι
ι, ι†
where the maximum is over the values of appearing in the possible cases. For fixed point free involutions, A5,1 is the only case, giving the coefficient 2 × 12 = 24 on C. In general, the coefficient is bounded by 2 × 20 = 40, determined by the worst case on A5,4 . Now (6.4) gives |Y ∗ − Y | ≤ 40C in general, and the bound 24C for involutions. As |W ∗ − W | = |Y ∗ − Y |/σc by (2.59), invoking Theorem 5.1 with δ = 40C/σc and δ = 24C/σc now completes the proof. Lemma 6.10 and the mixing property of Lemma 6.5 are the key ingredients of the following argument. Proof of Theorem 6.3 First, note that the claim in Theorem 6.3 regarding involutions is part of Lemma 6.10. Otherwise, by replacing aij by a ij given in (6.66) we may without loss of generality assume that EY = 0 whenever Y is given by (6.1) with π having distribution constant on cycle type. In this case, writing Yc for the variable given by (6.1) when π ∼ U(c), the mixture property of Lemma 6.5 yields P (Y ≤ z) = ρc P (Yc ≤ z), c∈Nn
with ρc = P (c(π) = c), and in addition, from (6.71), P (Zρ ≤ z) = ρc P (Zc /σρ ≤ z). c∈Nn
Hence, with W = Y/σρ , by changes of variable, supP (W ≤ z) − P (Zρ ≤ z) = supP (Y ≤ z) − P (σρ Zρ ≤ z) z∈R
z∈R
≤
c∈Nn
=
c∈Nn
ρc P (Yc ≤ z) − P (Zc ≤ z) ρc P (Wc ≤ z) − P (Z ≤ z)
6.1 Combinatorial Central Limit Theorem
201
where Wc = Yc /σc . Now applying the uniform bound in Lemma 6.10 completes the proof.
6.1.3 Doubly Indexed Permutation Statistics In Sect. 4.4 we observed how the distribution of the permutation statistic Y in (4.104), that is, Y=
n
aiπ(i) ,
i=1
can be used to test whether there is an unusually high degree of similarity in a particular matching between the observations x1 , . . . , xn and y1 , . . . , yn . In particular, if, say d(x, y) is a function which reflects the similarity between x and y and aij = d(xi , yj ), one compares the ‘overall similarity’ score yτ =
n
aiτ (i)
i=1
of the distinguished matching τ to the distribution of Y , that is, to this same similarity score for random matchings. In spatial or spatio-temporal association, two dimensional generalizations of the permutation test statistic Y become of interest. In particular, if aij and bij are two different measures of closeness of xi and yj , which may or may not be related, then the relevant null distribution is that of W= aij bπ(i),π(j ) (6.105) (i,j ): i=j
where the permutation π is chosen uniformly from Sn ; see, for instance, Moran (1948) and Geary (1954) for applications in geography, Knox (1964) and Mantel (1967) in epidemiology, as well as the book of Hubert (1987). Following some initial results which yield the asymptotic normality of W , see Barbour and Chen (2005a) for history and references, much less restrictive conditions were given in Barbour and Eagleson (1986). Theorem 6.4 of Barbour and Chen (2005a) provides a Berry–Esseen bound for this convergence; to state it we first need to introduce some notation. As the diagonal elements play no role, we may set aii = bii = 0. For such an array {aij }ni,j =1 , let A0 = A22 =
1 n(n − 1) 1 n(n − 1)
aij ,
A12 = n−1
n ∗ 2 ai ,
(i,j ): i=j
(i,j ): i=j
i=1
1 ∗ 3 ai n n
a˜ ij2
and
A13 =
i=1
202
6
L∞ : Applications
where ai∗ =
1 (aij − A0 ) and a˜ ij = aij − ai∗ − aj∗ − A0 , n−2 j :j =i
and let the analogous definitions hold for {bij }. In addition, let μ=
1 n(n − 1)
aij blm
and
σ2 =
(i,j ),(l,m): i=j,l=m
4n2 (n − 2)2 A12 B12 . n−1
Theorem 6.4 For W as given in (6.105) with A and B symmetric arrays, we have √ supP (W − μ ≤ σ z) − (z) ≤ (2 + c)δ + 12δ 2 + (1 + 2)δ˜2 , z∈R
where δ = 128n4 σ −3 A13 B13 , δ˜22 =
(n − 1)3 A22 B22 2 2n(n − 2) (n − 3) A12 B12
and c is the constant in Theorem 6.2. It turns out that statistics such as W can be expressed as a singly indexed permutation statistic upon which known bounds may be applied, plus a remainder term which may be handled using concentration inequalities and exploiting exchangeability, somewhat similar to the way that some non-linear statistics are handled in Chap. 10. The bounds of Theorem 6.4 compare favorably with those of Zhao et al. (1997).
6.2 Patterns in Graphs and Permutations In this section we will prove and apply corollaries of Theorem 5.6 to evaluate the quality of the normal approximation for various counts that arise in graphs and permutations, in particular, coloring patterns, local maxima, and the occurrence of subgraphs of finite random graphs, and for the number of occurrences of fixed, relatively ordered sub-sequences, such as rising sequences, of random permutations. We explore the consequences of Theorem 5.6 under a local dependence condition on a collection of random variables X = {Xα , α ∈ A}, over some arbitrary, finite, index set A. In particular, we consider situations where for every α ∈ A there exists a dependency neighborhood Bα ⊂ A of Xα , containing α, such that Xα
and {Xβ : β ∈ / Bα } are independent.
(6.106)
First recalling the definition of size biasing in a coordinate direction given in (2.68) in Sect. 2.3.4, we begin with the following corollary of Theorem 5.6.
6.2 Patterns in Graphs and Permutations
203
Corollary 6.1 Let X = {Xα , α ∈ A} be a finite collection of random variables with values in [0, M] and let Xα . Y= α∈A
Let μ = α∈A EXα denote the mean of Y and assume that the variance σ 2 = Var(Y ) is positive and finite. Let EXβ and p = max pα . (6.107) pα = EXα / α∈A
β∈A
Next, for each α ∈ A let Bα ⊂ A be a dependency neighborhood of Xα such that (6.106) holds, and let b = max |Bα |. α∈A
(6.108)
For each α ∈ A, let (X, Xα ) be a coupling of X to a collection of random variables Xα having the X-size biased distribution in direction α such that for some F ⊃ σ {Y } and D ⊂ A × A, / D, then for all (β1 , β2 ) ∈ Bα1 × Bα2 if (α1 , α2 ) ∈ α1 Cov E Xβ1 − Xβ1 |F , E Xβα22 − Xβ2 |F = 0.
(6.109)
Then with W = (Y − μ)/σ ,
√ 6μb2 M 2 2μpbM |D| + . sup P (W ≤ z) − P (Z ≤ z) ≤ σ3 σ2 z∈R
Proof In view of Theorem 5.6 and (5.21), it suffices to couple Y s , with the Y -size biased distribution, to Y such that s Y − Y ≤ bM and ≤ pbM |D|. (6.110) Assume without loss of generality that EXα > 0 for each α ∈ A. Note that for every α ∈ A the distribution dF (x) of X factors as / Bα |xα ) dFα (xα )dFBαc |α (xβ , β ∈ / Bα , × dFBα \{α}|{α}∪Bαc xγ , γ ∈ Bα \ {α}|xα , xβ , β ∈ which, by the independence condition (6.106) we may write as dFα (xα )dFBαc (xβ , β ∈ Bα ) / Bα . × dFBα \{α}|{α}∪Bαc xγ , γ ∈ Bα \ {α}|xα , xβ , β ∈ Hence, as in (2.73), the coordinate size biased distribution dF α (x) may be factored as / Bα ) dF α (x) = dFαα (xα )dFBαc (xβ , β ∈ / Bα , × dFBα \{α}|{α}∪Bαc xγ , γ ∈ Bα \ {α}|xα , xβ , β ∈
204
6
L∞ : Applications
where dFαα (xα ) =
xα dFα (xα ) . EXα
(6.111)
Given a realization of X, this factorization shows that we can construct Xα by first choosing Xαα from the Xα -size bias distribution (6.111), then the variables Xβ for β ∈ Bαc according to their original distribution, and so in particular set Xβα = Xβ
for all β ∈ Bαc ,
and finally the variables Xβα , β ∈ B \ {α} using their original conditional distribution given the variables {Xαα , Xβ , β ∈ B c }. As the distribution of Xα is absolutely continuous with respect to that of X, we have Xβα ∈ [0, M] for all α, β, and therefore α X − Xβ ≤ M for all α, β ∈ A. (6.112) β By Proposition 2.2, Y s = β∈A XβI has the Y -size biased distribution, where the random index I has distribution P (I = α) = pα and is chosen independently of {(X, Xα ), α ∈ A} and F . In particular Ys − Y = XβI − Xβ , (6.113) β∈BI
yielding the first inequality in (6.110). Recalling the definition of in (5.21), since σ {Y } ⊂ F , by (4.143), 2 = Var E Y s − Y |Y ≤ Var E Y s − Y |F . Taking conditional expectation with respect to F in (6.113) yields, pα E Xβα − Xβ |F , E Y s − Y |F = α∈A
and therefore, Var E Y s − Y |F =E (α1 ,α2 )∈A×A (β1 ,β2 )∈Bα1 ×Bα2
=E
(α1 ,α2 )∈D (β1 ,β2 )∈Bα1 ×Bα2
β∈Bα
pα1 pα2 Cov E Xβα11 − Xβ1 |F , E Xβα22 − Xβ2 |F pα1 pα2 Cov E Xβα11 − Xβ1 |F , E Xβα22 − Xβ2 |F ,
where we have applied (6.109) to obtain the last equality. By (6.112), the covariances are bounded by M 2 , hence 2 ≤ Var E Y s − Y |F ≤ M 2 pα1 pα2 (α1 ,α2 )∈D (β1 ,β2 )∈Bα1 ×Bα2
6.2 Patterns in Graphs and Permutations
= M2
205
pα1 pα2 |Bα1 ||Bα2 |
(α1 ,α2 )∈D
≤ M2
p 2 b2 = p 2 b2 M 2 |D|,
(α1 ,α2 )∈D
by (6.107) and (6.108), thus yielding the second inequality in (6.110).
Though Corollary 6.1 provides bounds for finite problems, asymptotically, when the mean and variance of Y grow such that μ/σ 2 is bounded, and when b and M stay bounded, then the first term in the bound of the corollary is of order 1/σ . Additionally, if Xα have comparable expectations, so that p is of order 1/|A|, and if the ‘dependence diagonal’ D ⊂ A × A has size comparable to that of A, then the second term will also be of order 1/σ . We next specialize to the case where the summand variables {Xα , α ∈ A} are functions of independent random variables. Corollary 6.2 With G and A index sets, let {Cg , g ∈ G} be a collection of independent random elements taking values in an arbitrary set C, let {Gα , α ∈ A} be a finite collection of subsets of G, and, for α ∈ A, let Xα = Xα (Cg : g ∈ Gα ) be a real valued function of the variables {Cg , g ∈ Gα }, taking values in [0, M]. Then for Y = α Xα with mean μ and finite, positive variance σ 2 , the variable W = (Y − μ)/σ satisfies √ 6μb2 M 2 2μpbM |D| + , sup P (W ≤ z) − P (Z ≤ z) ≤ σ3 σ2 z∈R where p and b are given in (6.107) and (6.108), respectively, for any Bα ⊃ {β ∈ A: Gβ ∩ Gα = ∅},
(6.114)
and any D for which D ⊃ (α1 , α2 ): there exists (β1 , β2 ) ∈ Bα1 × Bα2 with Gβ1 ∩ Gβ2 = ∅ . (6.115) Proof We apply Corollary 6.1. Since Xα and Xβ are functions of disjoint sets of independent variables whenever Gα ∩ Gβ = ∅, the independence condition (6.106) holds when the dependency neighborhoods satisfy (6.114). To verify the remaining conditions of Corollary 6.1, for each α ∈ A we consider the following coupling of X and Xα . We may assume without loss of generality that (α) EXα > 0. Given {Cg , g ∈ G} upon which X depends, for every α ∈ A let {Cg , g ∈ Gα } be independent of {Cg , g ∈ G} and have distribution dF α (cg , g ∈ Gα ) =
Xα (cg , g ∈ Gα ) dF (cg , g ∈ Gα ), EXα (Cg , g ∈ Gα )
206
6
L∞ : Applications
so that the random variables {Cgα , g ∈ Gα } ∪ {Cg , g ∈ / G} have distribution dF α (cg , α g ∈ Gα )dF (cg , g ∈ / Gα ). Thus, letting X have coordinates given by Xβα = Xβ Cg , g ∈ Gβ ∩ Gαc , Cg(α) , g ∈ Gβ ∩ Gα , β ∈ A for any bounded continuous function f we find EXα f (X) = xα f (x)dF (cg , g ∈ G) xα dF (cg , g ∈ Gα ) / Gα ) = EXα f (x) dF (cg , g ∈ EXα (Cg , g ∈ Gα ) = EXα f (x)dF α (cg , g ∈ Gα )dF (cg , g ∈ Gα ) = EXα Ef Xα . That is, Xα has the X distribution biased in direction α, as defined in (2.68). Lastly, taking F = {Cg : g ∈ G}, so that Y is F measurable, we verify (6.109). / Gβ }, Since Xβα and {Cg , g ∈ Gβ } are independent of {Cg , g ∈ α α / Gβ = E Xβα |Cg , g ∈ Gβ , E Xβ |F = E Xβ |Cg , g ∈ Gβ , Cg , g ∈ and, since E(Xβ |F) = Xβ = E(Xβ |Cg , g ∈ Gβ ), the difference E(Xβα − Xβ |F) is a function of {Cg , g ∈ Gβ } only. By choice of D, if (α1 , α2 ) ∈ / D then for all β1 ∈ Bα1 and β2 ∈ Bα2 we have Gβ1 ∩ Gβ2 = ∅, and so E(Xβα11 − Xβ1 |F ) and E(Xβα22 − Xβ2 |F) are independent, yielding (6.109). The verification of the conditions of Corollary 6.1 is now complete. With the exception of Example 6.2, in the remainder of this section we consider graphs G = (V, E) having random elements {Cg }g∈V ∪E assigned to their vertices V and edges E , and applications of Corollary 6.2 to the sum Y = α∈A Xα of bounded functions Xα = Xα (Cg , g ∈ Vα ∪ Eα ), where Gα = (Vα , Eα ), α ∈ A is a given finite family of subgraphs of G. We abuse notation slightly in that a graph G is replaced by V ∪ E when used as an index set for the underlying variables Cg . When applying Corollary 6.2 in this setting, in (6.114) and (6.115) the intersection of the two graphs (V1 , E1 ) and (V2 , E2 ) is the graph (V1 ∩ V2 , E1 ∩ E2 ). Given a metric d on V, for every v ∈ V and r ≥ 0 we can consider the restriction Gv,r of G to the vertices at most a distance r from v, that is, the graph with vertex and edge sets Vv,r = w ∈ V: d(v, w) ≤ r and Ev,r = {w, u} ∈ E: w, u ∈ Vv,r (6.116) respectively. We say that a graph G is distance r-regular if Gv,r is isomorphic to some graph (Vr , Er ) for all v. This notion of distance r-regular is related to, but not the same as, the notion of a distance-regular graph as given in Biggs (1993) and Brouwer et al. (1989). A graph of constant degree with no cliques of size 3 is distance 1-regular.
6.2 Patterns in Graphs and Permutations
207
When Vα , α ∈ V is given by (6.116) for some fixed r, regarding the choice of the dependency neighborhoods Bα , α ∈ A, we note that if d(α1 , α2 ) > 2r and (β1 , β2 ) ∈ Vα1 × Vα2 , then rearranging yields 2r < d(α1 , α2 ) ≤ d(α1 , β1 ) + d(β1 , β2 ) + d(β2 , α2 ), and using that d(αi , βi ) ≤ r implies d(β1 , β2 ) > 0, hence Vα2 = ∅. d(α1 , α2 ) > 2r implies Vα1
(6.117)
Natural families of graphs in Rp can be generated using the vertex set V = {1, . . . , n}p with componentwise addition modulo n, and d(α, β) given by e.g. some Lp distance between α and β. We apply the following result when the subgraphs are indexed by some subset of the vertices only, in which case we take A ⊂ V. Corollary 6.3 Let G be a finite graph with a family of isomorphic subgraphs {Gα , α ∈ A} for some A ⊂ V, let d be a metric on A, and set ρ = min : d(α, β) > implies Vα ∩ Vβ = ∅ . (6.118) For each α ∈ A, let Xα be given by Xα = X(Cg , g ∈ Gα ) for a fixed function X taking values in [0, M], and let {Cg , g ∈ G} be a collection of independent variables such that the distribution of {Cg : g ∈ Gα } is the same for all α ∈ A. If G is a distance-3ρ-regular graph, then with Y = α∈A Xα having mean μ and finite, positive variance σ 2 , the variable W = (Y − μ)/σ satisfies 6μV (ρ)2 M 2 2μV (ρ)M + 2 1/2 V (3ρ), supP (W ≤ z) − P (Z ≤ z) ≤ σ3 σ |A| z∈R where V (r) = |Vr |.
(6.119)
Proof We verify that conditions (6.114) and (6.115) of Corollary 6.2 are satisfied with Bα = β: d(α, β) ≤ ρ and D = (α1 , α2 ): d(α1 , α2 ) ≤ 3ρ . (6.120) First note that to show the intersection of two graphs is empty it suffices to show that the vertex sets of the graphs do not intersect. Since for any α ∈ A, by (6.118), Bαc = β: d(β, α) > ρ ⊂ {β: Vβ ∩ Vα = ∅}, we see that condition (6.114) is satisfied. To verify (6.115), note that rearranging d(α1 , α2 ) ≤ d(α1 , β1 ) + d(β1 , β2 ) + d(β2 , α2 ) gives, for (α1 , α2 ) ∈ / D and (β1 , β2 ) ∈ Bα1 × Bα2 ,
208
6
L∞ : Applications
d(β1 , β2 ) ≥ d(α1 , α2 ) − d(α1 , β1 ) + d(α2 , β2 ) ≥ d(α1 , α2 ) − 2ρ > ρ, and hence Vβ1 ∩ Vβ2 = ∅. As EXα is constant we have p = maxα pα = 1/|A|, and in addition, that b = max |Bα | = V (ρ) and α∈A
|D| = |A|V (3ρ).
Substituting these quantities into the bound of Corollary 6.2 now yields the result. Example 6.1 (Sliding m-window) For n ≥ m ≥ 1, let A = V = {1, . . . , n} with addition modulo n, {Cg : g ∈ G} i.i.d. real valued random variables, and for each α ∈ A set Gα = (Vα , Eα ) where Vα = {v ∈ V: α ≤ v ≤ α + m − 1}
and
Eα = ∅.
: Rm
(6.121)
Then for X → [0, 1], Corollary 6.3 may be applied to the sum Y = α∈A Xα of the m-dependent sequence Xα = X(Cα , . . . , Cα+m−1 ), formed by applying the function X to the variables in the ‘m-window’ Vα . In this example, taking d(α, β) = |α − β| the bound of Corollary 6.3 obtains with ρ = m − 1 by (6.118) and V (r) ≤ 2r + 1 by (6.119). In Example 6.2 the underlying variables are not independent, so we turn to Corollary 6.1. Example 6.2 (Relatively ordered sub-sequences of a random permutation) For n ≥ m ≥ 1, let V and (Gα , Vα ), α ∈ V be as specified in (6.121). For π and τ permutations of {1, . . . , n} and {1, . . . , m}, respectively, we say the pattern τ appears at location α if the values {π(v)}v∈Vα and {τ (v)}v∈V1 are in the same relative order. Equivalently, the pattern τ appears at α if and only if π(τ −1 (v) + α − 1), v ∈ V1 is an increasing sequence. Letting π be chosen uniformly from all permutations of {1, . . . , n}, and setting Xα to be the indicator that τ appears at α, we may write Xα π(v), v ∈ Vα = 1 π τ −1 (1) + α − 1 < · · · < π τ −1 (m) + α − 1 , and the sum Y = α∈V Xα counts the number of m-element-long segments of π that have the same relative order as τ . For α ∈ V we may generate Xα = {Xβα , β ∈ V} with the X = {Xβ , β ∈ V} distribution biased in direction α as follows. Let σα be the permutation of {1, . . . , m} for which π σα (1) + α − 1 < · · · < π σα (m) + α − 1 , and set
π α (v) =
π(σα (τ (v − α + 1)) + α − 1) v ∈ Vα , v∈ / Vα .
π(v)
6.2 Patterns in Graphs and Permutations
209
In other words π α is the permutation π with values π(v), v ∈ Vα reordered so that the values of π α (γ ) for γ ∈ Vα are in the same relative order as τ . Now let Xβα = Xβ π α (v), v ∈ Vβ , the indicator that τ appears at position β in the reordered permutation π α . Since the relative order of non-overlapping segments of the values of π are independent, (6.106) holds for Xα , α ∈ V with Bα = β: |β − α| ≤ m − 1 . Next, note that with F = σ {π}, for β ∈ Bα the random variables E(Xβα |F ) and Xβ depend only on the relative order of π(v) for v ∈ β∈Bα Bβ . Since
Bβ1 Bβ2 = ∅ when |α1 − α2 | > 3(m − 1), β1 ∈Bα1
β2 ∈Bα2
for such α1 , α2 , and (β1 , β2 ) ∈ Bα1 × Bα1 , the variables E(Xβα11 |F ) − Xβ1 and E(Xβα22 |F ) − Xβ2 are independent. Hence (6.109) holds with D = (α1 , α2 ) : |α1 − α2 | ≤ 3(m − 1) , and Corollary 6.1 gives bounds of the same form as for Example 6.1. When τ = ιm , the identity permutation of length m, we say that π has a rising sequence of length m at position α if Xα = 1. Rising sequences were studied in Bayer and Diaconis (1992) in connection with card tricks and card shuffling. Due to the regular-self-overlap property of rising sequences, namely that a non-empty intersection of two rising sequences is again a rising sequence, some improvement on the constant in the bound can be obtained by a more careful consideration of the conditional variance. Example 6.3 (Coloring patterns and subgraph occurrences on a finite graph G) With n, p ∈ N, let V = A = {1, . . . , n}p , again with addition modulo n, and for α, β ∈ V let d(α, β) = α − β where · denotes the supremum norm. Further, let E = {{w, v}: d(w, v) = 1}, and, for each α ∈ A, let Gα = (Vα , Eα ) where Vα = v: d(v, α) ≤ 1 and Eα = {v, w}: v, w ∈ Vα , d(w, v) = 1 . Let C be a set (of e.g. colors) from which is formed a given pattern {cg : g ∈ G0 }, let {Cg , g ∈ G} be independent variables in C with {Cg : g ∈ Gα }α∈A identically distributed, and let X(Cg , g ∈ G0 ) = 1(Cg = cg ), (6.122) g∈G0
and Xα = X(Cg , g ∈ Gα ). Then Y = α∈A Xα counts the number of times the pattern appears in the subgraphs Gα . Taking ρ = 2 by (6.117) the conclusion of Corollary 6.3 holds with M = 1, V (r) = (2r + 1)p and |A| = np .
210
6
L∞ : Applications
Such multi-dimensional pattern occurrences are a generalization of the wellstudied case in which one-dimensional sequences are scanned for pattern occurrences; see, for instance, Glaz et al. (2001) and Naus (1982) for scan and window statistics, see Huang (2002) for applications of the normal approximation in this context to molecular sequence data, and see also Darling and Waterman (1985, 1986), where higher-dimensional extensions are considered. Occurrences of subgraphs can be handled as a special case. For example, with (V, E) the graph above, let G be the random subgraph with vertex set V and random edge set {e ∈ E : Ce = 1} where {Ce }e∈E are independent and identically distributed be the inBernoulli variables. Then letting the function X(Cg , g ∈ G0 ) in (6.122) dicator of the occurrence of a distinguished subgraph of G0 , sum Y = α∈A Xα counts the number of times that copies of the subgraph appear in the random graph G; the same bounds hold as above. Example 6.4 (Local extremes) For a given graph G, let Gα , α ∈ A, be a collection of subgraphs of G isomorphic to some subgraph G0 of G, and let v ∈ V0 be a distinguished vertex in G0 . Let {Cg , g ∈ V} be a collection of independent and identically distributed random variables, and let Xα = X(Cβ , β ∈ Vα ) where X(Cβ , β ∈ V0 ) = 1(Cv ≥ Cβ , β ∈ V0 ). Then the sum Y = α∈A Xα counts the number of times the vertex in Gα , the one corresponding under the isomorphism to the distinguished vertex v ∈ V0 , is a local maxima. Corollary 6.3 holds with M = 1; the other quantities determining the bound are dependent on the structure of G. Consider, for example, the hypercube V = {0, 1}n and E = {{v, w}: v − w = 1}, where · is the Hamming distance (see also Baldi et al. 1989 and Baldi and Rinott 1989). Let v = 0 be the distinguished vertex, A = V, and, for each α ∈ A, let w}: v,w∈ Vα , v − w = 1}. Corollary 6.3 Vα = {β: β − α ≤ 1} and Eα = {{v, applies with ρ = 2 by (6.117), V (r) = rj =0 nj , and |A| = 2n .
6.3 The Lightbulb Process The following problem arises from a study in the pharmaceutical industry on the effects of dermal patches designed to activate targeted receptors. An active receptor will become inactive, and an inactive one active, if it receives a dose of medicine released from the dermal patch. Let the number of receptors, all initially inactive, be denoted by n. On study day i over a period of n days, exactly i randomly selected receptors each will receive one dose of medicine, thus changing their status between inactive and active. The problem has the following, somewhat more colorful, though equivalent, formulation. Consider n toggle switches, each being connected to a lightbulb. Pressing the toggle switch connected to a bulb changes its status from off to on and vice versa. At each stage i = 1, . . . , n, exactly i of the n switches are randomly pressed.
6.3 The Lightbulb Process
211
Interest centers on the random variable Y , which records the number of lightbulbs that are on at the terminal time n. The problem of determining the properties of Y was first considered in Rao et al. (2007) where the following expressions for the mean μ = EY and variance σ 2 = Var(Y ) were derived, n n 2i 1− , (6.123) μ= 1− 2 n i=1
and
! " n n 4i 4i(i − 1) σ = 1− 1− + 4 n n(n − 1) i=1 ! n " n n2 4i 4i(i − 1) 2i 2 + − 1− + 1− . 4 n n(n − 1) n 2
i=1
(6.124)
i=1
Other results, for instance, recursions for determining the exact finite sample distribution of Y , are derived in Rao et al. (2007). In addition, approximations to the distribution of Y , including by the normal, are also considered there, though the question of the asymptotic normality of Y was left open. Note that when n is even then μ = n/2 exactly, as the product in (6.123), containing the term i = n/2, is zero. By results in Rao et al. (2007), in the odd case μ = (n/2)(1 + O(e−n )), and in both the even and odd cases σ 2 = (n/2)(1 + O(e−n )). The following theorem of Goldstein and Zhang (2010) provides a bound to the normal which holds for all finite n, and which tends to zero as n tends to infinity at the rate n−1/2 , thus showing the asymptotic distribution of Y is normal as n → ∞. Though the results of Goldstein and Zhang (2010) provide a bound no matter the parity of n, for simplicity we only consider the case where n even. Theorem 6.5 With Y the number of bulbs on at the terminal time n and W = (Y − μ)/σ where μ = n/2 and σ 2 is given by (6.124), for all n even n 2 n supP (W ≤ z) − P (Z ≤ z) ≤ 2 + 1.64 3 + σ 2σ σ z∈R where 1 1 + e−n/2 ≤ √ + 2 n 2n
for n ≥ 6.
(6.125)
We now more formally describe the random variable Y . Let Y = {Yri : r, i = 1, . . . , n} be the Bernoulli ‘switch’ variables which have the interpretation 1 if the status of bulb i is changed at stage r, Yri = 0 otherwise. We continue to suppress the dependence of Y , and also of Yri , on n. As the set of r bulbs which have their status changed at stage r is chosen uniformly over all sets of
212
6
L∞ : Applications
size r, and as the stages are independent of each other, with e1 , . . . , en ∈ {0, 1} the joint distribution of Yr1 , . . . , Yrn is given by n −1 if e1 + · · · + en = r, P (Yr1 = e1 , . . . , Yrn = en ) = r 0 otherwise, with the collections {Yr1 , . . . , Yrn } independent for r = 1, . . . , n. Clearly, for each stage r, the variables (Yr1 , . . . , Yrn ) are exchangeable, and the marginal distribution for each r, i = 1, . . . , n is given by r and P (Yri = 0) = 1 − . n For r, i = 1, 2, . . . , n the quantity ( rs=1 Ysi ) mod 2 is the indicator that bulb i is on at time r, and therefore
n n Y= Yi where Yi = Yri mod 2 (6.126) P (Yri = 1) =
r n
i=1
r=1
is the number of bulbs on at the terminal time. The lightbulb process, where the n individual states evolve according to the same marginal Markov chain, is a special case of a certain class of multivariate chains studied in Zhou and Lange (2009), termed ‘Composition Markov chains of multinomial type.’ As shown there, such chains admit explicit full spectral decompositions, and in particular, each transition matrix of the lightbulb process can be simultaneously diagonalized by a Hadamard matrix. These properties were, in fact, put to use in Rao et al. (2007) for the calculation of the moments needed for (6.123) and (6.124). We now describe the coupling given by Goldstein and Zhang (2010), which shows that when n is even, Y may be coupled monotonically to a variable Y s having the Y -size bias distribution, in particular, such that Y ≤ Y s ≤ Y + 2.
(6.127)
For every i ∈ {1, . . . , n} construct the collection of variables Yi from Y as follows. If Yi = 1, that is, if bulb i is on, let Yi = Y. Otherwise, with J i a uniformly chosen i : r, k = 1, . . . , n} where index over the set {j : Yn/2,j = 1 − Yn/2,i }, let Yi = {Yrk ⎧ r = n/2, Yrk ⎪ ⎪ ⎪ ⎨ / {i, J i }, Yn/2,k r = n/2, k ∈ i Yrk = Yn/2,J i r = n/2, k = i, ⎪ ⎪ ⎪ ⎩ Yn/2,i r = n/2, k = J i , and let Y i = nk=1 Yki where n i i Yk = Yj k mod 2. j =1
6.4 Anti-voter Model
213
In other words, if bulb i is off, then the switch variable Yn/2,i of bulb i at stage n/2 is interchanged with that of a variable whose switch variable at this stage has the opposite status. With I uniformly chosen from {1, . . . , n} and independent of all other variables, it is shown in Goldstein and Zhang (2010) that the mixture Y s = Y I has the Y size biased distribution, essentially due to the fact that L Yi = L(Y|Yi = 1) for all i = 1, . . . , n. It is not difficult to see that Y s satisfies (6.127). If YI = 1 then XI = X, and so in this case Y s = Y . Otherwise YI = 0, and we obtain YI by interchanging, at stage n/2, the unequal switch variables Yn/2,I and Yn/2,J I , which changes the status of both bulbs I and J I . If bulb J I was on, that is, if YJ I = 1, then after the interchange YII = 1 and YJI I = 0, in which case Y s = Y . Otherwise bulb J I was off, that is, YJ I = 0, in which case after the interchange we have YII = 1 and YJI I = 1, yielding Y s = Y + 2. As the coupling is both monotone and bounded, by (6.127) Theorem 5.7 may be invoked with δ = 2/σ . In fact, the first two terms of the bound in Theorem 6.5 arise directly from Theorem 5.7 with this δ. The bound (6.125) is calculated in Goldstein and Zhang (2010), making heavy use of the spectral decomposition provided by Zhou and Lange (2009) to determine various joint probabilities of fourth, but no higher, order.
6.4 Anti-voter Model The anti-voter model was introduced by Matloff (1977) on infinite lattices. Donnelly and Welsh (1984), and Aldous and Fill (1994) consider, as we do here, the case of finite graphs; see also Liggett (1985), and references there. The treatment below closely follows Rinott and Rotar (1997), who deal with a discrete time version. Let G = (V, E), a graph with n vertices V and edges E , which was assume to be r-regular, that is, all vertices v ∈ V have degree r. Consider the following transition rule for a Markov chain {X(t) , t = 0, 1, . . .} with state space {−1, 1}V . At each time t , a vertex v is chosen uniformly from V, and then a different vertex w is chosen uniformly from the set Nv = w: {v, w} ∈ E of neighbors of v, and then we let Xu(t+1)
=
Xu
(t)
u = v,
(t) −Xw
u = v.
That is, the configuration at time t + 1 is the same at time t , but that vertex v takes the sign opposite of its randomly chosen neighbor w. Following Donnelly and Welsh (1984), and Aldous and Fill (1994), when G is neither an n cycle nor bipartite, the chain is irreducible on the state space consisting
214
6
L∞ : Applications
(t)
of the 2n − 2 configurations which exclude those where Xv are identical, and has a stationary distribution supported on this set. We suppose the distribution of X(0) , the chain at time zero, is this stationary distribution. The exchangeable pair coupling yields the following result on the quality of the normal approximation to the distribution of the standardized net sign of the stationary configuration. Theorem 6.6 Let X have the stationary distribution of the anti voter chain on an n vertex, r-regular graph G, neither an n cycle nor bipartite, and let W be the standardized net sign U of the configuration X, that is, with σ 2 = Var(U ) let W = U/σ where U = Xv . (6.128) v∈V
U
is the net sign obtained by applying the one step transition to the configThen, if uration X, (U, U ) is a 2/n-Stein pair that satisfies |U − U | ≤ 2 and E (U − U )2 |X = 8(a + b)/(rn) (6.129) where a and b are the number of edges that are incident on vertices both of which are in state +1, or −1, respectively. In addition, √ 12n Var(Q) sup P (W ≤ z) − P (Z ≤ z) ≤ 3 + σ rσ 2 z∈R where Q=
Xv Xw .
(6.130)
v∈V w∈Nv
When σ 2 and Var(Q) are of order n, the bound given in the theorem has order √ 1/ n. Use of (5.13) results in a somewhat more complex, but superior bound. The first order of business in proving Theorem 6.6 is the construction of an exchangeable pair. It is immediate that if X(t) is a reversible Markov chain in stationarity then for any measurable function f on the chain, (f (X(s) ), f (X(t) )) is exchangeable, for any s and t. Even when a chain is not reversible, as is the case for the anti-voter model, the following lemma may be invoked for functions of chains whose increments are the same as that of a birth death process. Lemma 6.11 Let {X(t) , t = 0, 1, . . .} be a stationary process, and suppose that T (X(t) ) assumes nonnegative integer values such that T X(t+1) − T X(t) ∈ {−1, 0, 1} for all t = 0, 1, . . . . (6.131) Then for any measurable function f , W, W = f T X(t) , f T X(t+1) is an exchangeable pair.
6.4 Anti-voter Model
215
Proof The process T (t) = T (X(t) ) is stationary and has values in the nonnegative integers. For integers i, j in the range of T (·), set πi = P (T (t) = i) and pij = P (T (t+1) = j | T (t) = i). By stationarity, these probabilities do not depend on t . Using stationarity to obtain the second equality, and setting πi and pij = 0 for all i < 0, we have for all nonnegative integers j , (t+1) P T = j, T (t) = i πj = P T (t) = j = P T (t+1) = j = =
i∈N0
P T (t+1) = j |T (t) = i P T (t) = i =
πi pij ,
i: |i−j |≤1
i∈N0
where we have restricted the sum in the last equality due to the condition imposed by (6.131). This same system of equations arises in birth and death chains and it is well-known that if it has a solution then it is unique, can be written explicitly, and satisfies πi pij = πj pj i (which implies reversibility for birth and death chains). Here, the latter relation is equivalent to P T (t) = i, T (t+1) = j = P T (t) = j, T (t+1) = i ,
implying that (T (t) , T (t+1) ) is an exchangeable pair. With this result in hand, we may now proceed to the Proof of Theorem 6.6 We apply Theorem 5.4. By Lemma 6.11, W, W = W X(t) , W X(t+1)
is an exchangeable pair when W (X) is the standardized net sign of the configuration X, as in (6.128). With U and U the net signs of X(t) and X(t+1) , respectively, since at most a single 1 becomes −1, or a −1 a 1 in a one step transition, clearly the first claim of (6.129) holds. We next verify that (W, W ) satisfies the linearity condition (2.33) with λ = 2/n. Let T= 1{Xv =1} , v∈V
the number of vertices with sign 1, and let 1{Xu =Xv =1} , a= {u,v}∈E
b=
1{Xu =Xv =−1}
{u,v}∈E
and c =
1{Xu =Xv } ,
{u,v}∈E
the number of edges both of whose incident vertices take the value 1, the value −1, or both these values, respectively. For an r-regular graph, 1{Xv =1,Xw =1} + 1{Xv =1,Xw =−1} , r1{Xv =1} = w∈Nv
w∈Nv
216
6
hence summing over v ∈ V yields rT = 1{Xv =1,Xw =1} + v∈V ,w∈Nv
L∞ : Applications
1{Xv =1,Xw =−1} = 2a + c,
v∈V , w∈Nv
and so T = (2a + c)/r
and likewise
n − T = (2b + c)/r.
(6.132)
Note U = 2T − n and U = 2T − n are the net signs of the configurations X(t) and X(t+1) , respectively. When making a transition one first chooses a vertex uniformly, then one of its neighbors, uniformly, and so since the graph is regular the edge so chosen is uniform. As the net sign U decreases by 2 in a transition if and only if a 1 becomes a −1, and this event occurs if and only if one of the rn/2 edges counted by a is chosen, we have P (U − U = −2|X) =
2a rn
and likewise
P (U − U = 2|X) =
2b , rn
(6.133)
and therefore, by (6.132), 4b 4a 2(n − 2T ) 2 − = = − U. E U − U |X = rn rn n n Hence, (2.33) is satisfied for W = U/σ with λ = 2/n, and Theorem 5.4 obtains with this value of λ and, as |U − U | ≤ 2, with δ = 2/σ . Next we bound in (5.3). By (6.133) we have
2a 2b 2 E (U − U ) |X = 4 + , rn rn proving the second claim in (6.129). Next, recalling the definition of Q in (6.130), note the relations 2a + 2b + 2c = rn
and 2a + 2b − 2c = Q,
imply 4(a + b) = Q + rn. Hence E[(U − U )2 |X] = 2(Q + rn)/(rn), and therefore, using that W of X and (4.143), ≤
Var E (W − W )2 |X =
is a function
'
2Q 2 Var Var(Q). = 2 rnσ rnσ 2
Applying Theorem 5.4 along with (5.4) and the computed upper bound for , and that λ = 2/n and δ = 2/σ , the proof of the theorem is complete. The quantities σ and Var(Q) depend heavily on the particular graph under consideration. For details on how these quantities may be bounded for graphs having certain regularity properties, and examples which include the Hamming graph and the k-bipartite graph, see Rinott and Rotar (1997).
6.5 Binary Expansion of a Random Integer
217
6.5 Binary Expansion of a Random Integer Let n ≥ 2 be a natural number and x an integer in the set {0, 1, . . . , n − 1}. For m = [log2 (n − 1)] + 1, consider the binary expansion of x x=
m
xi 2m−i .
i=1
Clearly any leading zeros contribute nothing to the sum x. With X uniformly chosen from {0, 1, . . . , n − 1}, the sum S = X1 + · · · + Xm is the number of ones in the expansion of X. When n = 2m a uniform random integer between 0 and 2m − 1 may be constructed by choosing its m binary digits to be zeros and ones with equal probability, and independently, so the distribution of S in this case has the symmetric binomial distribution with m trials, which of course can be well approximated by the normal. Theorem 6.7 shows that the same is true for any large n, and provides an explicit bound. We follow Diaconis (1977). We approach the problem using exchangeable pairs. For x an integer in {0, . . . , n − 1} let Q(x, n) be the number of zeros in the m long expansion of x which, when changed to 1, result in an integer n or larger, that is, Q(x, n) = |Jx | where Jx = i ∈ {1, . . . , m}: xi = 0, x + 2m−i ≥ n .
(6.134)
For example, Q(10, 5) = 1. With I a random index on {1, . . . , m} let X + (1 − 2XI )2m−I if I ∈ / JX , X = X otherwise. That is, the I th digit of X is changed from XI to 1 − XI , if doing so produces a number between {0, . . . , n − 1}. Clearly (X, X ) are exchangeable, and S , the number of ones in the expansion of X , is given by S + 1 − 2XI if I ∈ / JX , S = S otherwise. As we see from the following lemma (S, S ) is not a Stein pair, as it fails to satisfy the linearity condition. Nevertheless, Theorem 3.5 applies. The lemma also provides the mean and variance of S. Lemma 6.12 For n ≥ 2 let X be uniformly chosen from {0, 1, . . . , n − 1} and Q = Q(X, n). Then
2S Q Q E(S − S|X) = 1 − − , E (S − S)2 |X = 1 − , m m m and 1 ES = (m − EQ) and 2
Var(S) =
m EQ + 2 Cov(S, Q) 1− . 4 m
218
6
L∞ : Applications
Proof To derive the first identity, write E(S − S|X) = P (S − S = 1|X) − P (S − S = −1|X) = P XI = 0, XI = 1|X − P (XI = 1|X) = P (XI = 0|X) − P XI = 0, XI = 0|X − P (XI = 1|X) = 1 − P XI = 0, XI = 0|X − 2P (XI = 1|X) =1−
1 2 1{Xi =0,X+2m−i ≥n} − Xi m m m
m
i=1
i=1
2S Q =1− − . (6.135) m m The expectation of S can now be calculated using that E(S − S) = 0. Similarly, since S − S ∈ {−1, 0, 1}, E (S − S)2 |X = P (S − S = 1|X) + P (S − S = −1|X) = P XI = 0, XI = 1|X + P (XI = 1|X) = P (XI = 0|X) − P XI = 0, XI = 0|X + P (XI = 1|X) = 1 − P XI = 0, XI = 0|X 1 1{Xi =0,X+2m−i ≥n} m m
=1−
i=1
Q =1− . m To calculate the variance, note first
(6.136)
0 = E(S − S)(S + S) = E 2(S − S)S + (S − S)2 .
Now taking expectation in (6.136), using identity (6.135) and that the quantities involved have mean zero, we obtain
Q E 1− = E(S − S)2 = 2E S(S − S ) m
2S Q = 2E S −1+ m m
4 Var(S) Q = + 2 Cov S, . m m Solving for the variance now completes the proof.
Theorem 6.7 For n ≥ 2 let X be a random integer chosen uniformly from {0, . . . , n − 1}. Then with S the number of ones in the binary expansion of X and m = [log2 (n − 1)] + 1, the random variable S − m/2 W= √ m/4
(6.137)
6.5 Binary Expansion of a Random Integer
219
satisfies 6.9 5.4 supP (W ≤ z) − P (Z ≤ z) ≤ √ + . m m z∈R Proof With W given by (6.137) with S replaced by S , the pair (W, W ) is exchangeable, and, with λ = 2/m, Lemma 6.12 yields
E(Q|W ) (6.138) E(W − W |W ) = λ W + √ m and
1 E(Q|W ) E (W − W )2 |W = 1 − . 2λ m
(6.139)
Further, we have EQ =
m P Xi = 0, X + 2m−i ≥ n i=1
m P X ≥ n − 2m−i ≤ i=1
= ≤
m 2m−i
n
i=1 2m − 1
=
2m − 1 n
≤ 2. (6.140) 2m−1 Since W , W are exchangeable, for any function f for which the expectations below exist, identity (2.32) yields 0 = E (W − W )f (W ) + f (W ) = E (W − W )f (W ) − f (W ) + 2E f (W )(W − W ) (W )E W − W | W = E (W − W ) f (W ) − f (W ) + 2E f E(Q|W ) f (W ) . = E (W − W ) f (W ) − f (W ) + 2λE W + √ m Solving for EWf (W ) and then reasoning as in the proof of Lemma 2.7 we obtain 1 1 E (W − W ) f (W ) − f (W ) − √ E E(Q|W )f (W ) 2λ m ∞ 1 ˆ f (W + t)K(t)dt − √ E E(Q|W )f (W ) =E m −∞
EWf (W ) =
where ˆ = (1{−≤t≤0} − 1{0
220
6
L∞ : Applications
where 1 R1 = − √ E(Q|W )fz (W ) . m Now, using Q ≥ 0, inequality (2.9) and (6.140) we have |ER1 | ≤ π/(2m). We invoke Theorem 3.5 with √ δ0 = 2/ m First, by (6.139),
ˆ K1 = E
and δ1 =
π/(2m).
(6.141)
E(Q|W ) (W − W )2 ˆ . K(t)dt W = E W = 1 − 2λ m |t|≤δ0
Rewriting (6.138), E(Q|W ) so √ m E(QW ) E(W W ) = (1 − λ)EW 2 − λ √ . m
E(W |W ) = (1 − λ)W − λ
Therefore, taking expectation in (6.139) yields 1−
E(Q) 1 E(QW ) 1 , = E(W − W )2 = E W 2 − W W = EW 2 + √ m 2λ λ m
or E(QW ) E(Q) E(QW ) ≤1− √ . − √ m m m √ Now, since 0 ≤ S ≤ m, we have |W | ≤ m, and therefore EW 2 = 1 −
EW 2 ≤ 1 +
E|W |Q ≤ 1 + EQ ≤ 3. √ m
Applying in addition the fact that E|1 − Kˆ 1 | = EQ/m ≤ 2/m, √ and that 0 ≤ Kˆ 1 ≤ 1 so that E|W |Kˆ 1 ≤ E|W | ≤ EW 2 , Theorem 3.5 along with (6.141) yields the bound δ0 1.1 + E(|W |Kˆ 1 ) + 2.7E|1 − Kˆ 1 | + δ1 √ 5.4 ≤ δ0 (1.1 + 3) + + δ1 m 5.4 ≤ 2.8δ0 + + δ1 m 5.4 6.9 ≤√ + m m as claimed.
Chapter 7
Discretized Normal Approximation
The very first use of normal approximation was for the binomial. However, glancing at the histogram depicting the mass function of, say, the binomial S ∼ B(16, 1/2), with the normal density drawn on the same scale, one sees that integer boundaries split the mass of the binomial into halves. Hence, to make the approximation more accurate, it is often recommended that continuity correction be used, that is, that 1/2 be added at upper limits, and 1/2 subtracted at lower ones. For instance, for S ∼ B(16, 1/2), one has P (S ≤ 4) = 0.0384. The straightforward normal approximation yields ((4 − 8)/2) = (−2) = 0.023, while continuity correction gives the much more accurate value, ((4.5 − 8)/2) = (−1.75) = 0.040. For a discussion of continuity correction, see Feller (1968a), page 185, and for some additional justification, also Cox (1970). In this chapter, we provide a total variation bound between the Poisson binomial distribution and the integer valued distribution obtained by discretizing the normal according to the continuity correction rule, and then, a more general result for the discretized normal approximation of an independent sum of integer valued random variables. The results in this chapter are due to Chen and Leong (2010). To study approximation using continuity correction, for μ ∈ R and σ 2 > 0, let Zμ,σ 2 be the discretized normal with distribution given by k − μ − 1/2 k − μ + 1/2 P (Zμ,σ 2 = k) = P
1 = sup P (X ∈ A) − P (Y ∈ A) = sup Eh(X) − Eh(Y ). 2 h≤1 A⊂R
(7.2)
One can verify that if both X and Y are integer valued random variables, then ∞ L(X) − L(Y ) = 1 P (X = k) − P (Y = k), TV 2
(7.3)
k=−∞
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_7, © Springer-Verlag Berlin Heidelberg 2011
221
222
7
Discretized Normal Approximation
while, similarly, if X and Y have densities pX and pY , respectively, then ∞ pX (u) − pY (u)du. L(X) − L(Y ) = 1 TV 2 −∞
(7.4)
7.1 Poisson Binomial The following theorem gives a total variation bound on the approximation of a Poisson binomial random variable S and a continuity corrected discretized normal. Without loss of generality we assume in what follows that the summands X1 , . . . , Xn of S are non-trivial. Theorem 7.1 Let X1 , . . . , Xn be independent Bernoulli variables with ndistribution P (X = 1) = p and P (X = 0) = 1 − p , i = 1, . . . , n, and let S = i i i i i=1 Xi , μ = n n 2= p and σ p (1 − p ). Then i i i i=1 i=1 L(S) − L(Z
μ,σ 2 ) TV
≤
7.6 . σ
We begin with a general result which gives the total variation distance between a zero biased variable W ∗ and the normal, when W and W ∗ can be constructed on a joint space. Theorem 7.2 Let W be a mean zero, variance one random variable and suppose W ∗ , on the same space as W , has the W -zero bias distribution. Then for all functions h with h ≤ 1, √ Eh(W ∗ ) − N h ≤ 4E W (W − W ∗ ) + 2πE|W − W ∗ |. Proof With f the solution of the Stein equation for the given h, we have
Eh(W ∗ ) − N h = E f (W ∗ ) − W ∗ f (W ∗ )
= E Wf (W ) − W ∗ f (W ∗ )
≤ E W f (W ) − f (W ∗ ) + E f (W ∗ )(W − W ∗ ) ≤ f E W (W − W ∗ ) + f E|W − W ∗ |. Applying the bounds in Lemma 2.4 yields the inequality.
We will also consider Z μ,σ 2 , the discretized version of the distribution of Z given by
P (Z μ,σ 2
k−μ k−μ+1 = k) = P
∗ ∗ and Wμ,σ 2 , the similarly discretized version of W with distribution
∗ k−μ k+1−μ ∗ = k = P ≤ < W . P Wμ,σ 2 σ σ
(7.5)
(7.6)
7.1 Poisson Binomial
223
When W is the sum of independent variables, Theorem 7.2 has the following corollary. Corollary 7.1 Let X1 , . . . , Xn be independent random variables with EXi = μi , Var(Xi ) = σi2 and finite third absolute central moments where σ 2 =
n
γi = E|Xi − μi |3 /σ 3 ,
2 i=1 σi .
Suppose that W=
n
ξi
where ξi =
i=1
X i − μi σ
W∗
has the W zero bias distribution. Then L(W ∗ ) − L(Z) ≤ (5 + 3 π/8 )γ TV where γ = ni=1 γi . ∗ If Z μ,σ 2 and Wμ,σ 2 have distributions (7.5) and (7.8), respectively, then ∗
L W 2 − L(Z 2 ) ≤ (5 + 3 π/8)γ . μ,σ TV μ,σ
and that
(7.7)
(7.8)
Proof By Lemma 2.8 we may construct W and W ∗ on a joint space by letting W ∗ = W − ξI + ξI∗ where ξi∗ has the ξi zero bias distribution and I is a random index with distribution (2.60), with ξi∗ independent of ξi and I independent of all other variables. We apply Lemma 7.2. For the first term, by independence and the Cauchy– Schwarz inequality,
E W (W ∗ − W ) = E W ξI∗ − ξI ≤ E ξI∗ + E|W ξI |. Noting that Var(ξi ) = σi2 /σ 2 and using the distribution (2.60) of I and (2.57), we have n n n ∗ σi2 ∗ 1 σi2 3 = E ξ | = γ /2 and E|ξ | = E|ξi | ≤ γ . E ξI = E|ξ i I i 2 σ2 σ2 i=1
i=1
i=1
Next, n n
σi2 σi2 σi2 2 E|W ξI | ≤ E |W − ξi ||ξi | + ξi ≤ E|ξi | + 2 σ2 σ2 σ i=1
i=1
≤γ +
n σ3 i
i=1
σ3
≤ 2γ .
For the second term, by the triangle inequality, 3 E|W ∗ − W | = E ξI∗ − ξI ≤ γ . 2
224
7
Discretized Normal Approximation
Collecting terms in the bound, taking supremum over h ≤ 1, and applying Definition 7.2 now yields the first conclusion. Applying (7.7), (7.3), and the definition of total variation distance in (7.2) with k−μ k+1−μ hk (w)1 <w≤ h(w) = σ σ k∈Z
where
hk (w) =
∗ +1 P (Wμ,σ 2 = k) ≥ P (Zμ,σ 2 = k), ∗ −1 P (Wμ,σ 2 = k) < P (Zμ,σ 2 = k)
yields the second conclusion. We will also make use of the following fact.
Lemma 7.1 If Z μ,σ 2 and Zμ,σ 2 have distributions (7.5) and (7.1), respectively, then L(Z
μ,σ 2
1 . + 1) − L(Zμ,σ 2 )TV ≤ √ 2 2πσ
Proof For Z a standard Gaussian random variable, from (7.5), 1 + μ < k + 1/2 , P (Z μ,σ 2 + 1 = k) = P k − 1/2 < σ Z + 2σ while from (7.1), P (Zμ,σ 2 = k) = P (k − 1/2 < σ Z + μ < k + 1/2). Therefore, L(Z
μ,σ 2
1 + 1) − L(Zμ,σ 2 )TV ≤ − L(Z) L Z + 2σ TV 1 =1−2 1− 4σ 1 1 = − − 4σ 4σ 1 , ≤ √ 2 2πσ
where we have applied (7.4) to obtain the first equality.
To prove Theorem 7.1 we require the following result. Lemma 7.2 Let S = ni=1 Xi where X1 , . . . , Xn are independent indicator random variables with P (Xi = 1) = pi for i = 1, . . . , n, having variance σ 2 = Var(S). If I is a random index with distribution
7.1 Poisson Binomial
225
P (I = i) = then
pi (1 − pi ) , σ2
n
1 (1 − pi )(Xi − pi ) f (S), Ef (S) = Ef S (I ) + 1 − 2 E σ
i=1
where
S (i)
= S − Xi for i = 1, . . . , n.
Assuming Lemma 7.2, the proof of Theorem 7.1 is somewhat direct. Proof of Theorem 7.1 For the application of Corollary 7.1 when the variables X1 , . . . , Xn are indicators, note that γ=
n E|Xi − pi |3 i=1
σ3
≤
n E(Xi − pi )2
σ3
i=1
=
1 . σ
(7.9)
Standardizing S, write W = (S − μ)/σ , that is, W=
n
ξi
where ξi =
i=1
Xi − pi . σ
With I a random index with distribution P (I = i) = σi2 /σ 2 , and U a uniform variable on [0, 1], each independent of the other and of the remaining variables, by Lemma 2.8 and (2.52), the variable W ∗ = W (I ) + ξI∗
with ξi∗ =
U − pi σ
has the W zero bias distribution, where W (i) = W − ξi . But now, by (7.6), ∗
k−μ k+1−μ ∗ P Wμ,σ = k = P ≤ < W 2 σ σ
(I ) = P k − μ < S − μI + pI + U − pI ≤ k + 1 − μ
= P k < S (I ) + U ≤ k + 1
= P S (I ) = k , ∗ (I ) . Now, by (7.8) of Corollary 7.1 and (7.9), we obtain that is, Wμ,σ 2 =d S (I )
π L S − L(Z μ,σ 2 ) TV ≤ 5 + 3 /σ. 8
Applying Lemma 7.2 with f (s) = 1(s = k) yields
n (I )
1 P (S = k) = P S = k − 1 − 2 E (1 − pi )(Xi − pi ) 1(S = k), σ i=1
and in particular
(7.10)
226
7
Discretized Normal Approximation
∞
P (S = k) − P S (I ) + 1 = k k=0
n ∞ 1 E (1 − p )(X − p ) ≤ 1(S = k) i i i σ2 k=0 i=1
n 1/2 1 2 2 ≤ 2 (1 − pi ) E(Xi − pi ) σ i=1
1 ≤ . σ Hence, summing and applying (7.3) we have
L(S) − L S (I ) + 1 ≤ 1 . (7.11) TV 2σ The proof is now completed by using (7.11), (7.10) and Lemma 7.1, along with the triangle inequality, to yield L(S) − L(Z 2 ) μ,σ TV (I )
≤ L(S) − L S + 1 TV + L S (I ) + 1 − L(Z μ,σ 2 + 1)TV + L(Z μ,σ 2 + 1) − L(Zμ,σ 2 )TV π 1 1 1 ≤ 7.6/σ. ≤ +5+3 + √ σ 2 8 2 2π It remains to prove Lemma 7.2. Proof of Lemma 7.2 Using the fact that X1 , . . . , Xn are independent indicator variables, ESf (S) = E
n
Xi f (S) = E
i=1
=E
n
n
pi f S (i) + 1
i=1 n
pi (1 − pi )f S (i) + 1 + E pi2 f S (i) + 1
i=1
=E
n
i=1 n
pi (1 − pi )f S (i) + 1 + E pi Xi f S (i) + 1
i=1
= σ Ef S 2
(I )
+1 +E
n
i=1
pi Xi f (S) ,
i=1
while also
ESf (S) = E (S − μ)f (S) + μf (S) = E(S − μ)f (S) + E
n i=1
pi (1 − pi )f (S) + E
n i=1
pi2 f (S).
7.2 Sum of Independent Integer Valued Random Variables
227
Equating these expressions yields n
pi (1 − pi )Ef (S)
i=1
= σ Ef S 2
(I )
n
n
+1 +E
pi Xi f (S) − E(S − μ)f (S) −
i=1
= σ 2 Ef S (I ) + 1 − E
n
pi2 Ef (S)
i=1
−pi Xi + Xi − pi + pi2 f (S)
i=1
n (I )
2 (1 − pi )(Xi − pi ) f (S), = σ Ef S + 1 − E i=1
and now dividing by σ 2 yields the result.
7.2 Sum of Independent Integer Valued Random Variables We now consider the more general situation where S is the sum of independent integer-valued random variables X1 , . . . , Xn with distribution P (Xi = k) = pik
for i = 1, . . . , n and k ∈ Z.
The following result plays the role of Corollary 7.1 in the more general setting. Theorem 7.3 Let X1 , . . . , Xn be independent integer valued random variables with EXi = μi , Var(Xi ) = σi2 , and finite third absolute central moments γi = E|Xi − μi |3 /σ 3 . Further, let S = i=1 Xi , μ = ni=1 μi , σ 2 = ni=1 σi2 and γ = ni=1 γi . Then for each i ∈ {1, 2, . . . , n}, the values αil , l ∈ Z given by ∞ 2 l ≥ 0, k=l+1 (k − μi )pik /σi , (7.12) αil = l 2 k=−∞ (μi − k)pik /σi , l ≤ −1 n
are nonnegative and sum to one, and for J (i) a random integer such that J (1), J (2), . . . , J (n), X1 , . . . , Xn are independent with distribution
P J (i) = l = αil , when I is a random index independent of J (1), J (2), . . . , J (n), X1 , . . . , Xn with distribution P (I = i) = we have
σi2 , σ2
228
7
(I )
L S + J (I ) − L(Z
μ,σ 2
Discretized Normal Approximation
π )TV ≤ 5 + 3 γ, 8
where S (i) = S − Xi for i = 1, 2, . . . , n. Proof Standardizing, write W = (S − μ)/σ , so W=
n
ξi
where ξi =
i=1
X i − μi . σ
Note that as E(Xi − μi ) = 0, for any integer l we have l
(μi − k)pik = −E(Xi − μi )1(X ≤ l)
k=−∞
= E(Xi − μi )1(X ≥ l + 1) ∞ = (μi − k)pik .
(7.13)
k=l+1
For i = 1, . . . , n let ξi∗ have the ξi -zero biased distribution. From the formula (2.54) we see that the density of ξi∗ is constant over intervals of the form ((l − μi )/σi , (l − μi + 1)/σi ) for any integer l, and so, again in view of (2.54), and applying (7.13) also, we have
l − μi l − μi + 1 ∗ ∗ < ξi < P l < σ ξi + μi < l + 1 = P σ σ 1 Eξi 1(ξi > (l − μi )/σ ) = σ σi2 /σ 2 ∞ 1 k=l+1 ((k − μi )/σ )P (ξi = (k − μi )/σ ) = σ σi2 /σ 2 = αil . Hence the sequence αil is nonnegative, sums to one over l ∈ Z for every i = 1, . . . , n, and
P l < σ ξi∗ + μi < l + 1 = P J (i) = l . (7.14) By (7.8) of Corollary 7.1, ∗ L W
μ,σ 2
π − L(Z μ,σ 2 ) TV ≤ 5 + 3 γ, 8
∗ where the discretized distributions Wμ,σ 2 and Z μ,σ 2 are given in (7.6) and (7.5),
respectively. Hence, the theorem is proved upon noting that W ∗ = W (I ) + ξI∗ by Lemma 2.8, and therefore, by (7.14),
7.2 Sum of Independent Integer Valued Random Variables
229
∗
k−μ k+1−μ P Wμ,σ < W (I ) + ξI∗ ≤ 2 =k =P σ σ
(I ) ∗ = P k ≤ S + σ ξI + μ I < k + 1
= P S (I ) + J (I ) = k .
The next theorem gives a total variation bound between L(S) and L(Zμ,σ 2 ). Theorem 7.4 Let X1 , . . . , Xn be independent integer valued random variables with EXi = μi , Var(Xi ) = σi2 , and finite third absolute central moments γi = E|Xi − μi |3 /σ 3 . Further, let S = ni=1 Xi , μ = ni=1 μi , σ 2 = ni=1 σi2 and γ = ni=1 γi . Then L(S) − L(Z 2 ) μ,σ TV n
2σi2 3 ≤ σ γi + 3 L S (i) − L S (i) + 1 TV + (5 + 3 π/8)γ 2 3σ i=1
1 + √ . 2 2π σ
(7.15)
Proof First, we compute a total variation bound between the distributions of S and S (I ) + J (I ) as follows,
2L(S) − L S (I ) + J (I ) TV
P S (I ) + XI = k − P S (I ) + J (I ) = k = = ≤
k∈Z n i=1 n i=1
≤
n i=1
=
n i=1
σi2 (i) P S + Xi = k − P S (i) + J (i) = k 2 σ k∈Z
σi2 σ2
P S (i) + k1 = k − P S (i) + k2 = k P Xi = k1 , J (i) = k2 k,k1 ,k2 ∈Z
σi2 2L S (i) − L S (i) 2 σ
+ 1 TV |k1 − k2 |P Xi = k1 , J (i) = k2 k1 ,k2 ∈Z
σi2 2L S (i) − L S (i) + 1 TV E Xi − J (i). 2 σ
Now, by (7.14), with ξi∗ a variable with the ξi -zero bias distribution, E Xi − J (i) J (i) − μi = σ E ξi − σ
(7.16)
230
7
Discretized Normal Approximation
J (i) − μi ≤ σ E|ξi | + σ E σ l − μi
= σ E|ξi | + σ σ P J (i) = l l∈Z l − μi l − μi l + 1 − μi ∗ 1 < < ξ = σ E|ξi | + σ E i σ σ σ l∈Z 1 l + 1 − μi l − μi ∗ ξ ∗ + ≤ σ E|ξi | + σ E < 1 < ξ i i σ σ σ l∈Z
= σ E|ξi | + σ E|ξi∗ | + 1, where we obtain the last inequality by noting that (l − μi )/σ < ξi∗ < (l + 1 − μi )/σ implies 1 l − μi l − μi ∗ and − < ξi < ξi∗ + . σ σ σ Hence, from (7.16), we have
2L(S) − L S (I ) + J (I )
TV
n
σi2 (σ E|ξi | + σ E|ξi∗ | + 1) ≤ 2L S (i) − L S (i) + 1 TV σ2 i=1
≤ 3σ
n
γi L S (i) − L S (i) + 1
+ TV
i=1
n
σi2 2L S (i) − L S (i) + 1 TV , 2 σ i=1
(7.17) where, for the final inequality, we note that σi2 E|ξi | = Eξi2 E|ξi | ≤ γi , σ2 while from (2.57), 1 1 σi2 ∗ E ξi = Var(ξi )E ξi∗ = E ξi3 = γi . 2 2 σ2 The proof may now be completed by applying the bound (7.17), along with Theorem 7.3, Lemma 7.1 and the triangle inequality, to yield L(S) − L(Z 2 ) μ,σ TV
≤ L(S) − L S (I ) + J (I ) TV + L S (I ) + J (I ) − L(Z μ,σ 2 )TV + L(Z μ,σ 2 ) − L(Zμ,σ 2 )TV , and hence the terms in (7.15).
In order to apply Theorem 7.4, a bound on L(S (i) ) − L(S (i) + 1)TV is required. If L(S (i) ) is unimodel, then
7.2 Sum of Independent Integer Valued Random Variables
231
(i)
L S − L S (i) + 1 TV
≤ 2 max P S (i) = k k∈Z
k − ES (i) ≤ 2 max P S (i) ≤ k − k∈Z σ 2 − σi2
k − 1 − ES (i) + P S (i) ≤ k − 1 − σ 2 − σi2 k − 1 − ES (i) k − ES (i) − + σ 2 − σi2 σ 2 − σi2 ≤ 16.4
j =i
γj
1 σ3 2 +√ , (σ 2 − σi2 )3/2 2π σ 2 − σ 2 i
(7.18)
applying the Berry–Esseen theorem, twice, with the constant 4.1 from Chen and Shao (2001). Recall that we say that the distribution L(X) is strongly unimodal if the convolution of L(X) and any unimodal distribution is unimodal. Theorem 4.8 in Dharmadhikari and Joag-Dev (1998) states that the distribution L(X) of an integer valued random variable X is strongly unimodel if and only if P (X = k) ≥ P (X = k − 1)P (X = k + 1) for all integers k. Clearly then, the Bernoulli distributions are strongly unimodel, and hence also the Poisson binomial. In particular, for S the sum of indicators X1 , . . . , Xn as in Theorem 7.1, and S i = S − Xi , by inequality (7.18) and the fact that γ ≤ 1/σ , from (7.9) in the proof of Theorem 7.1, we have (i)
L S − L S (i) + 1
TV
≤
1 16.4σ 2 2 , +√ √ 2 3/2 (σ − 1) 2π σ 2 − 1
which is of order O(1/σ ). Applying this inequality in the bound (7.15) of Theorem 7.4, we obtain a bound for L(S) − L(Zμ,σ 2 )TV of the same order as in Theorem 7.1. By Proposition 4.6 of Barbour and Xia (1999), for S the sum of independent integer valued random variables,
max L S (i) − L S (i) + 1 TV
1≤i≤n
≤
j =i
−1/2 min 1 − L(Xj ) − L(Xj + 1)TV , 1/2 ,
√ which is of order 1/ n if L(Xi ) − L(Xi + 1)TV ≤ α < 1 for i = 1, . . . , n. A slightly better bound, for this same case, is given by Corollary 1.6 of Mattner and Roos (2007), which states that
232
7
Discretized Normal Approximation
(i)
L S − L S (i) + 1 TV −1/2 2 1 1 − L(Xj ), L(Xj + 1)TV ≤ . + π 4 j =i
Under the hypotheses of Theorem 7.1, 1 − |2pj − 1| 1 − L(Xj ), L(Xj + 1)TV = ≥ σj2 , 2 so L(S (i) ) − L(S (i) + 1)TV has an upper bound of order O(1/σ ), which again leads to a bound on L(S) − L(Zμ,σ 2 )TV of the same order as in Theorem 7.1.
Chapter 8
Non-uniform Bounds for Independent Random Variables
Throughout this chapter ξ1 , ξ2 , . . . , ξn will denote independent random variables with zero means satisfying ni=1 Eξi2 = 1, and W will be their sum ni=1 ξi . Our goal is to prove non-uniform Berry–Esseen bounds of the form P (W ≤ z) − (z) ≤ C 1 + |z| −3 γ for all z ∈ R, (8.1) n where C is an absolute constant and γ = i=1 E|ξi |3 . Non-uniform bounds were first obtained by Esseen (1945) for i.i.d. random variables, and the bound was later improved to CnE|ξ1 |3 by Nagaev (1965) for the identically distributed case. That (8.1) holds for independent random variables was proved by Bikjalis (1966) using Fourier methods. In this chapter, we follow Chen and Shao (2001, 2005) and approach (8.1) using Stein’s method. To use ideas similar to those in the proof of Theorem 3.6 we first need to develop a non-uniform concentration inequality.
8.1 A Non-uniform Concentration Inequality To prove our non-uniform concentration inequality we consider the truncated variables and the sums xi ¯ i = ξi 1{ξi ≤1} ,
W¯ =
n
xi ¯i
and W¯ (i) = W¯ − xi ¯i
(8.2)
i=1
and recall the quantities given in (3.5), β2 =
n
(8.3)
Proposition 8.1 For all real a < b and i = 1, . . . , n, P a ≤ W¯ (i) ≤ b ≤ 6 min(1, b − a) + β2 + β3 e−a/2 .
(8.4)
i=1
and β3 =
n
E|ξi |3 1{|ξi |≤1} .
Eξi2 1{|ξi |>1}
i=1
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_8, © Springer-Verlag Berlin Heidelberg 2011
233
234
8
Non-uniform Bounds for Independent Random Variables
We prove Proposition 8.1 using the Bennett–Hoeffding inequality. Lemma 8.1 For some α > 0 let η1 , . . . , ηn be independent random variables satn 2 2 isfyingEηi ≤ 0, ηi ≤ α for all 1 ≤ i ≤ n, and Eη i=1 i ≤ B . Then, with n Sn = i=1 ηi , EetSn ≤ exp α −2 etα − 1 − tα B 2 for t > 0, (8.5)
B2 αx αx αx P (Sn ≥ x) ≤ exp − 2 1 + 2 log 1 + 2 − 2 , (8.6) α B B B and
P (Sn ≥ x) ≤ exp −
x2 2 2(B + αx)
(8.7)
for x > 0. Proof First, one may prove that (es − 1 − s)/s 2 is an increasing function for s ∈ R, from which it follows that ets ≤ 1 + ts + (ts)2 etα − 1 − tα /(tα)2 (8.8) for s ≤ α, and t > 0. Using the properties of the ηi ’s, for t > 0 we have EetSn = ≤
n
Eetηi
i=1 n
1 + tEηi + α −2 etα − 1 − tα Eηi2
i=1
≤
n
1 + α −2 etα − 1 − tα Eηi2
i=1
≤ exp α −2 etα − 1 − tα B 2 , proving (8.5). To prove inequality (8.6), with x > 0 let αx 1 t = log 1 + 2 . α B Then by (8.5), P (Sn ≥ x) ≤ e−tx EetSn ≤ exp −tx + α −2 etα − 1 − tα B 2
αx αx αx B2 1 + 2 log 1 + 2 − 2 , = exp − 2 α B B B demonstrating (8.6). Lastly, in view of the fact that
8.1 A Non-uniform Concentration Inequality
235
(1 + s) log(1 + s) − s ≥
s2 2(1 + s)
for s > 0, (8.7) now follows.
Although the hypotheses of Lemma 8.1 require that ηi be bounded by above for all i = 1, . . . , n, the following result shows how the lemma may nevertheless be applied to unbounded variables. Lemma 8.2 Let η 1 , η2 , . . . , ηn be independent random variables satisfying Eηi ≤ 0 for 1 ≤ i ≤ n and ni=1 Eηi2 ≤ B 2 . Then, P (Sn ≥ x)
B2 αx αx αx 1 + 2 log 1 + 2 − 2 , (8.9) ≤ P max ηi > α + exp − 2 1≤i≤n α B B B for all α > 0 and x > 0. In particular, x ∨B x 2 −p p P (Sn ≥ x) ≤ P max ηi > (8.10) +e 1+ 1≤i≤n p pB 2 for x > 0 and p ≥ 1. Proof Letting η¯ i = ηi 1{ηi ≤α} we have
P (Sn ≥ x) ≤ P max ηi > α + P 1≤i≤n
n
η¯ i ≥ x .
i=1
As η¯ i ≤ α, E η¯ i = Eηi − Eηi 1{ηi >α} ≤ 0 and ni=1 E η¯ i2 ≤ ni=1 Eηi2 ≤ B 2 we may now apply Lemma 8.1 to yield (8.9). Inequality (8.10) is trivial when 0 < x < B and p ≥ 1, since then x 2 −p ≥ 1. For x > B, taking α = x/p in the second term in (8.9) yields ep (1 + pB 2)
αx αx αx B2 1 + 2 log 1 + 2 − 2 exp − 2 α B B B x αx x 2 −p ≤ exp − log 1 + 2 + p = ep 1 + . α B pB 2 This proves (8.10). We now proceed to the proof of Proposition 8.1, the non-uniform concentration inequality. Proof As ξ¯i , i = 1, . . . , n satisfies the hypotheses of Lemma 8.1 with α = 1 and B 2 = 1, it follows from (8.5) with t = 1/2 that P a ≤ W¯ (i) ≤ b ≤ P 0 ≤ W¯ (i) − a /2 3 ¯ (i) ≤ 1.2 e−a/2 . ≤ e−a/2 EeW /2 ≤ e−a/2 exp e1/2 − 2
236
8
Non-uniform Bounds for Independent Random Variables
Thus, (8.4) holds if 6(β2 + β3 ) ≥ 1.2 or b − a ≥ 1. Hence, it suffices to prove the claim assuming that (β2 + β3 ) ≤ 1.2/6 = 0.2 and b − a < 1. Similar to the proof of the concentration inequality in Lemma 3.1, define δ = (β2 + β3 )/2 and set ⎧ if w < a − δ, ⎨0 w/2 f (w) = e (w − a + δ) if a − δ ≤ w ≤ b + δ, (8.11) ⎩ w/2 e (b − a + 2δ) if w > b + δ. Let ¯ (i) (t) = M¯ i (t) = ξi (1{−xi ¯ i ≤t≤0} − 1{0
M¯ j (t).
j =i
Clearly, M¯ (i) (t) ≥ 0, f (w) ≥ 0 and f (w) ≥ ew/2 for a − δ < w < b + δ. Using these inequalities and the independence of ξj and W¯ (i) − xi ¯ j, (i) (i) (i) (i) = − f W¯ − xi E ξj f W¯ ¯j E W f W¯ j =i
=
E
j =i
∞ −∞
f W¯ (i) + t M¯ j (t) dt
∞
f W¯ (i) + t M¯ (i) (t) dt −∞ ≥ E1{a≤W¯ (i) ≤b} f W¯ (i) + t M¯ (i) (t) dt |t|≤δ (i) ¯ (W −δ)/2 1{a≤W¯ (i) ≤b} ≥ Ee M¯ (i) (t) dt
=E
|t|≤δ
= Ee
(W¯ (i) −δ)/2
1{a≤W¯ (i) ≤b}
|ξj | min(δ, |xi ¯ j |)
j =i
≥ e−δ/2 (H1 − H2 ), where
¯ (i) E |ξj | min(δ, |xi ¯ j |) H1 = E eW /2 1{a≤W¯ (i) ≤b}
(8.12)
and
j =i
¯ (i) |ξj | min(δ, |xi ¯ j |) − E|ξj | min(δ, |xi ¯ j |) . H2 = E eW /2 j =i
Applying the inequality min(x, y) ≥ y − y 2 /4x for x > 0, y > 0, we obtain E|ξj | min δ, |xi ¯ j| ≥ E|ξj |1{|ξj |≤1} min δ, |ξj |1{|ξj |≤1} j =i
≥
j =i n j =1
E|ξj |1{|ξj |≤1} min δ, |ξj |1{|ξj |≤1} − δE|ξi |1{|ξi |≤1}
8.2 Non-uniform Berry–Esseen Bounds
≥
237
n
Eξj2 1{|ξj |≤1}
−
E|ξj |3 1{|ξj |≤1}
j =1
4δ
1/3
− δβ3
4δβ2 + β3 1/3 − δβ3 4δ ≥ 0.5 − 0.1(0.2)1/3 ≥ 0.44,
=1−
(8.13)
where we have used δ = (β2 + β3 )/2, δ ≤ 0.1 and β2 + β3 ≤ 0.2 in the final inequality. Hence (8.14) H1 ≥ 0.44 ea/2 P a ≤ W¯ (i) ≤ b . Turning now to H2 , by the Bennett–Hoeffding inequality (8.5) with α = t = B = 1, ¯ (i)
EeW
≤ exp(e − 2).
(8.15)
Hence, by the Cauchy–Schwarz inequality, 1/2 ¯ (i) 1/2 H2 ≤ EeW |ξj | min δ, |xi ¯ j| Var
≤ exp(e/2 − 1)
j =i
2 Eξj2 min δ, |xi ¯ j|
j =i
≤ exp(e/2 − 1)δ
1/2
1/2 Eξj2
j =i
≤ exp(e/2 − 1)δ ≤ 1.44δ.
(8.16)
As to the left hand side of (8.12), we have ¯ (i) E W (i) f W¯ (i) ≤ (b − a + 2δ)E W (i) eW /2 2 1/2 W¯ (i) 1/2 ≤ (b − a + 2δ) E W (i) Ee ≤ (b − a + 2δ) exp(e/2 − 1) ≤ 1.44(b − a + 2δ). Combining the above inequalities and applying the bound δ ≤ 0.1 yields P a ≤ W¯ (i) ≤ b ≤ e−a/2 eδ/2 1.44(b − a + 2δ) + 1.44δ /0.44 ≤ e−a/2 4(b − a) + 12δ ≤ e−a/2 4(b − a) + 6(β2 + β3 ) , completing the proof of (8.4).
8.2 Non-uniform Berry–Esseen Bounds We begin our development of non-uniform bounds with the following lemma.
238
8
Non-uniform Bounds for Independent Random Variables
Lemma random variables with mean zero satis 8.3 Let ξ1 , . . . , ξn be independent fying ni=1 Var(ξi ) = 1 and let W = ni=1 ξi . Then for z ≥ 2 and p ≥ 2,
P W ≥ z, max ξi > 1 1≤i≤n −p z ≤2 P |ξi | > β2 + ep 1 + z2 /(4p) 2p 1≤i≤n
whenever β2 , given by (8.3), is bounded by 1. Proof Beginning with the left hand side, we have
P W ≥ z, max ξi > 1 1≤i≤n
≤
n
P (W ≥ z, ξi > 1)
i=1
n
≤ P max ξi > z/2 + P W (i) ≥ z/2, ξi > 1 1≤i≤n
i=1
n n ≤ P |ξi | > z/p + P W (i) ≥ z/2 P (ξi > 1) i=1
n ≤ P |ξi | > z/p
i=1
i=1
+
n
−p P max |ξi | > z/(2p) + ep 1 + z2 /(4p) P (ξi > 1) 1≤i≤n
i=1
[by (8.10)] n
−p P |ξi | > z/p + P max |ξi | > z/(2p) + ep 1 + z2 /(4p) β2 ≤
i=1
≤2
1≤i≤n
−p P |ξi | > z/(2p) + ep 1 + z2 /(4p) β2 ,
(8.17)
1≤i≤n
since β2 ≤ 1.
We are now ready to prove the following non-uniform Berry–Esseen inequality, generalizing (8.1). Theorem 8.1 For every p ≥ 2 there exists a finite constant Cp depending only on p such that for all z ∈ R P (W ≤ z) − (z) n −p 1 ∨ |z| ≤2 P |ξi | > (β2 + β3 ), (8.18) + Cp 1 + |z| 2p i=1
where β2 and β3 are given in (8.3).
8.2 Non-uniform Berry–Esseen Bounds
239
Inequality (8.1) follows from Theorem 8.1 using (1 + |z|)/2 ≤ 1 ∨ |z|, and that we may bound the first sum by p n n 1 + |z| 4p P |ξi | > E|ξi |p , ≤ 4p 1 + |z| i=1
i=1
and the remaining terms, since p ≥ 2, using β2 =
n
Eξi2 1{|ξi |>1}
≤
i=1
n
n
E|ξi | 1{|ξi |>1} ≤ p
i=1
E|ξi |p ,
i=1
and β3 =
n
E|ξi |3 1{|ξi |≤1} ≤
i=1
n
E|ξi |3∧p 1{|ξi |≤1} ≤
i=1
n
E|ξi |3∧p .
i=1
Hence, Theorem 8.1 implies that there exists Cp such that n P (W ≤ z) − (z) ≤ Cp 1 + |z| −p E|ξi |p + E|ξi |3∧p .
(8.19)
i=1
Proof By replacing W by −W it suffices to consider z ≥ 0. As W is the sum of independent variables with mean zero and Var(W ) ≤ 1, by (8.10) with B = 1, for all p ≥ 2 we obtain z∨1 z2 −p . + ep 1 + P (W > z) ≤ P max |ξi | > 1≤i≤n p p Thus (8.18) holds if β2 + β3 ≥ 1, and it suffices to prove the claim assuming that β2 + β3 < 1. We may also assume the lower bound z ≥ 2 holds, as the fact that (8.18) holds for z over any bounded range follows from the uniform bound (3.31) by choosing Cp sufficiently large. Let xi ¯ i , W¯ and W¯ (i) be defined as in (8.2). We first prove that P (W > z) and ¯ P (W > z) are close and then prove a non-uniform bound for W¯ . Observing that {W > z} = W > z, max ξi > 1 ∪ W > z, max ξi ≤ 1 1≤i≤n
1≤i≤n
⊂ W > z, max ξi > 1 ∪ {W¯ > z}, 1≤i≤n
(8.20)
we obtain the upper bound
P (W > z) ≤ P (W¯ > z) + P W > z, max ξi > 1 . 1≤i≤n
(8.21)
Clearly W ≥ W¯ , yielding the lower bound, P (W¯ > z) ≤ P (W > z).
(8.22)
240
8
Non-uniform Bounds for Independent Random Variables
Hence, in view of Lemma 8.3, to prove (8.18) it suffices to show that P (W¯ ≤ z) − (z) ≤ Ce−z/2 (β2 + β3 )
(8.23)
for some absolute constant C. For z ∈ R let fz be the solution to the Stein equation (2.2), and define ¯ i (1{0≤t≤xi K¯ i (t) = E xi ¯ i } − 1{xi ¯ i ≤t<0} ) . Reasoning as in the proof of (2.27), noting here that xi ¯ i ≤ 1, and E xi ¯ i ≤ 0, and does not equal zero in general, using the independence of xi ¯ i and W¯ (i) , we obtain n 1 n E W¯ fz (W¯ ) = Efz W¯ (i) + t K¯ i (t) dt + E xi ¯ i Efz W¯ (i) . i=1 −∞
i=1
Additionally, from n
1
i=1 −∞
K¯ i (t) dt =
n
E xi ¯ i2 = 1 −
i=1
n
Eξi2 1{ξi >1} ,
(8.24)
i=1
recalling Eξi = 0 we have P (W¯ ≤ z) − (z) = Efz (W¯ ) − E W¯ fz (W¯ ) =
n E ξi2 1{ξi >1} Efz (W¯ ) i=1
+ +
n
1
i=1 −∞ n
E fz W¯ (i) + xi ¯ i − fz W¯ (i) + t K¯ i (t) dt
E{ξi 1{ξi >1} }Efz W¯ (i)
i=1
:= R1 + R2 + R3 .
(8.25)
By (2.80), (2.8) and (8.5), E fz (W¯ ) = E fz (W¯ )1{W¯ ≤z/2} + E fz (W¯ )1{W¯ >z/2} √ 2 ≤ 2π(z/2)ez /8 + 1 1 − (z) + P (W¯ > z/2) √ 2 ¯ ≤ 2π(z/2)ez /8 + 1 1 − (z) + e−z/2 EeW ≤ Ce−z/2 by a standard tail bound on 1 − (z), and hence |R1 | ≤ Cβ2 e−z/2 . Similarly, using (2.3) we obtain Efz
(W¯ (i) ) ≤ Ce−z/2
|R3 | ≤ Cβ2 e−z/2 .
(8.26) and (8.27)
8.2 Non-uniform Berry–Esseen Bounds
241
To estimate R2 , use (2.2) to write R2 = R2,1 + R2,2 , where n
R2,1 =
1
i=1 −∞
¯ E(1{W¯ (i) +xi ¯ i ≤z} − 1{W¯ (i) +t≤z} )Ki (t) dt
and R2,2 =
n
1
i=1 −∞
E W¯ (i) + xi ¯ i fz W¯ (i) + xi ¯ i − W¯ (i) + t fz W¯ (i) + t K¯ i (t) dt.
By Proposition 8.1, with C not necessarily the same at each occurrence, n 1 ¯ (i) ≤ z − xi E 1{xi ¯ i | xi ¯ i K¯ i (t) dt R2,1 ≤ ¯ i ≤t} P z − t < W i=1 −∞ n 1
≤C
e−(z−t)/2 E min 1, |xi ¯ i | + |t| + β2 + β3 K¯ i (t) dt
i=1 −∞ n 1 −z/2
≤ Ce
i=1 −∞
E min 1, |xi ¯ i | + |t| + β2 + β3 K¯ i (t) dt.
From (8.24), Ce−z/2
n
1
i=1 −∞
(β2 + β2 )K¯ i (t)dt ≤ Ce−z/2 (β2 + β3 ).
Hence, to prove R2,1 ≤ Ce−z/2 (β2 + β3 ) it suffices to show that n 1 i=1 −∞
E min 1, |xi ¯ i | + |t| K¯ i (t) dt ≤ C(β2 + β3 ).
As 1{0≤t≤xi ¯ i } + 1{xi ¯ i ≤t<0} ≤ 1{|t|≤|xi ¯ i |} we have ¯ i |1{|t|≤|xi K¯ i (t) ≤ E |xi ¯ i |} . Since both min 1, |xi ¯ i | + |t|
and
|xi ¯ i |1{|t|≤|xi ¯ i |}
are increasing functions of |xi ¯ i | they are positively correlated, therefore
(8.28)
(8.29)
242
8
Non-uniform Bounds for Independent Random Variables
E min 1, |xi ¯ i | + |t| K¯ i (t) ≤ E min 1, |xi ¯ i | + |t| E |xi ¯ i |1{|t|≤|xi ¯ i |} ¯ i |1{|t|≤|xi ≤ E min 1, |xi ¯ i | + |t| |xi ¯ i |} ¯ i |1{|t|≤|xi ≤ 2E min 1, |xi ¯ i | |xi ¯ i |} . Hence n
1
i=1 −∞
n ¯ E min 1, |xi ¯ i | + |t| Ki (t) dt ≤ 4 E min 1, |xi ¯ i | |xi ¯ i |2
i=1
≤4
n
E min 1, |ξi | |ξi |2
i=1 n
=4
Eξi2 1|ξi |>1
n + E ξi3 1|ξi |≤1
i=1
i=1
= 4(β2 + β3 ), proving (8.29), and therefore (8.28). Similarly we may show R2,1 ≥ −Ce−z/2 (β2 + β3 ). Using Lemma 8.4 below for the second inequality, by the monotonicity of wfz (w) provided by (2.6), it follows that n 1 (i) ¯ + xi ¯i E 1{t≤xi ¯ i fz W¯ (i) + xi ¯ i | xi R2,2 ≤ ¯ i} E W i=1 −∞ (i) ¯
+ t fz W¯ (i) + t K¯ i (t) dt n 1 ≤ Ce−z/2 E min 1, |xi ¯ i | + |t| K¯ i (t) dt −E W
i=1 −∞
≤ Ce
−z/2
(β2 + β3 ),
(8.30)
where we have applied (8.29) for the last inequality. Therefore R2 ≤ Ce−z/2 (β2 + β3 ).
(8.31)
Similarly, we may demonstrate the lower bound R2 ≥ −Ce−z/2 (β2 + β3 ), thus proving the theorem.
It remains to prove the following lemma. Lemma 8.4 With W¯ (i) as in (8.2) and fz the solution to the Stein equation (2.2) for z > 0, for all s ≤ t ≤ 1 E W¯ (i) + t fz W¯ (i) + t − W¯ (i) + s fz W¯ (i) + s (8.32) ≤ Ce−z/2 min 1, |s| + |t| .
8.2 Non-uniform Berry–Esseen Bounds
243
Proof Let g(w) = (wfz (w)) . Then E W¯ (i) + t fz W¯ (i) + t − W¯ (i) + s fz W¯ (i) + s t Eg W¯ (i) + u du. =
(8.33)
s
From the formula (2.81) for g(w) and the bound √ 2 2π 1 + w 2 ew /2 (w) + w ≤
2 1 + |w|3
for w ≤ 0,
(8.34)
from (5.4) of Chen and Shao (2001), we obtain the w ≤ 0 case of the first inequality in ⎧ ⎨ 4(1+z2 )(1+z3 ) ez2 /8 (1 − (z)) if w ≤ z/2, 1+|w|3 g(w) ≤ 2 ⎩ 8(1 + z2 )ez /2 (1 − (z)) if z/2 < w < z or w > z. For 0 < w ≤ z/2 we apply the simpler inequality √ 2 2 2 2π 1 + w 2 ew /2 (w) + w ≤ 3 1 + z2 ez /8 + z ≤ 4 1 + z2 ez /8 , and the same reasoning yields the case z/2 ≤ w < z. For w > z we apply (8.34) with −w replacing w, and the inequality e−z /2 4(1 + z2 ) 2
1 − (z) ≥
for z > 0.
Hence, for any u ∈ [s, t], we have Eg W¯ (i) + u = E g W¯ (i) + u 1{W¯ (i) +u≤z/2} + E g W¯ (i) + u 1{W¯ (i) +u>z/2} 4(1 + z2 )(1 + z3 ) z2 /8 ≤E 1 − (z) e 1 + |W¯ (i) + u|3 2 + 8 1 + z2 ez /2 1 − (z) P W¯ (i) + u > z/2 3 −1 ¯ (i) + C(1 + z)e−z+2u Ee2W ≤ Ce−z/2 E 1 + W¯ (i) + u 3 −1 ≤ Ce−z/2 E 1 + W¯ (i) + u + C(1 + z)e−z+2u [by (8.5)] ≤ Ce−z/2 since u ≤ t ≤ 1. Hence, for all s ≤ t ≤ 1 we have t Eg W¯ (i) + u du ≤ Ce−z/2 (t − s) ≤ Ce−z/2 |t| + |s| , s
while also, now using that g(w) ≥ 0 by (2.6), from (8.35),
(8.35)
(8.36)
244
8 t
s
Eg W¯ (i) + u du ≤
1 −∞
≤ Ce
Non-uniform Bounds for Independent Random Variables
Eg W¯ (i) + u du
−z/2
E
1
−∞
1
du + Ce 1 + |W¯ (i) + u|3
≤ Ce−z/2 . Using (8.36) and (8.37) in (8.33) now completes the proof.
−z/2
1 −∞
e2u du (8.37)
Chapter 9
Uniform and Non-uniform Bounds Under Local Dependence
In this chapter we continue the study of Stein’s method under the types of local dependence that was first considered in Sect. 4.7 for the L1 distance, and also in Sect. 6.2 for the L∞ distance. We follow the work of Chen and Shao (2004), with the aim of establishing both uniform and non-uniform Berry–Esseen bounds having optimal asymptotic rates under various local dependence conditions. Throughout this chapter, J will denote an index set of cardinality n and {ξi , i ∈ J } a random field, that is, an indexedcollection of random variables, with zero means and finite variances. Define W = i∈J ξi and assume that Var(W ) = 1. For / A} and |A| the cardinality of A. A ⊂ J , let ξA = {ξi , i ∈ A}, Ac = {j ∈ J : j ∈ We introduce the following four local dependence conditions, the first two of which appeared in Sect. 4.7. In each, the set Ai can be thought of as a neighborhood of dependence for ξi . (LD1) For each i ∈ J there exists Ai ⊂ J such that ξi and ξAci are independent. (LD2) For each i ∈ J there exist Ai ⊂ Bi ⊂ J such that ξi is independent of ξAci and ξAi is independent of ξBic . (LD3) For each i ∈ J there exist Ai ⊂ Bi ⊂ Ci ⊂ J such that ξi is independent of ξAci , ξAi is independent of ξBic , and ξBi is independent of ξCic . (LD4∗ ) For each i ∈ J there exist Ai ⊂ Bi ⊂ Bi∗ ⊂ Ci∗ ⊂ Di∗ ⊂ J such that ξi is independent of ξAci , ξAi is independent of ξBic , ξAi is independent of {ξAj , j ∈ Bi∗c }, {ξAj , j ∈ Bi∗ } is independent of {ξAj , j ∈ Ci∗c }, and {ξAj , j ∈ Ci∗ } is independent of {ξAj , j ∈ Di∗c }. It is clear that each condition is implied by the one that follows it, that is, that (LD4∗ ) ⇒ (LD3) ⇒ (LD2) ⇒ (LD1). Roughly speaking, (LD4∗ ) is a version of (LD3) obtained by considering {ξAi , i ∈ J } as the basic random variables in the field. Though the conditions listed are increasingly more restrictive, in many ∗ ) hold cases the weakest one, (LD2), (LD3) or (LD4 (LD1), actually implies that upon taking Bi = j ∈Ai Aj , Ci = j ∈Bi Aj , Bi∗ = j ∈Ai Bj , Ci∗ = j ∈B ∗ Bj i and Di∗ = j ∈C ∗ Bj . For example, (LD1) implies (LD4∗ ) when {ξi , i ∈ J } is the i m-dependent random field considered at the end of the next section. We note that L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_9, © Springer-Verlag Berlin Heidelberg 2011
245
246
9
Uniform and Non-uniform Bounds Under Local Dependence
Bulinski and Suquet (2002) obtain results for random fields having both negative and positive dependence by Stein’s method.
9.1 Uniform and Non-uniform Berry–Esseen Bounds We first present a general uniform Berry–Esseen bound under assumption (LD2). Recall that |J | = n. Theorem 9.1 Let p ∈ (2, 4] and assume that there exists some κ such that (LD2) is satisfied with |N (Bi )| ≤ κ for all i ∈ J , where N (Bi ) = {j ∈ J : Bj ∩ Bi = ∅}. Then supP (W ≤ z) − (z) z∈R
≤ (13 + 11κ)
E|ξi |
3∧p
+ E|ηi |
i∈J
3∧p
1/2 p p + 2.5 κ E|ξi | + E|ηi | , i∈J
where ηi = j ∈Ai ξj . In particular, if there is some θ > 0 such that E|ξi |p + E|ηi |p ≤ θ p for all i ∈ J , then √ supP (W ≤ z) − (z) ≤ (13 + 11κ)nθ 3∧p + 2.5θ p/2 κn. z∈R
In typical asymptotic √ regimes κ is bounded and θ is of order of n−1/2 , yielding 3∧p p/2 +θ κn = O(n−(p−2)/4 ). When fourth moments exist we may the order κnθ take p = 4 and obtain the best possible order of n−1/2 . Assuming the stronger local dependence condition (LD3) allows us to relax the moment assumptions. Theorem 9.2 Let p ∈ (2, 3] and assume that there exists some κ such that (LD3) is satisfied with |N (Ci )| ≤ κ for all i ∈ J , where N (Ci ) = {j ∈ J : Ci ∩ Bj = ∅}. Then E|ξi |p . supP (W ≤ z) − (z) ≤ 75κ p−1 z∈R
i∈J
Under the strongest condition (LD4∗ ) we have the following general nonuniform bound. Theorem 9.3 Let p ∈ (2, 3] and assume that (LD4∗ ) is satisfied with κ = maxi∈J max(|Di∗ |, |{j : i ∈ Dj∗ }|). Then for all z ∈ R, P (W ≤ z) − (z) ≤ Cκ p 1 + |z| −p E|ξi |p , i∈J
where C is an absolute constant.
9.2 Outline of Proofs
247
The above results can immediately be applied to m-dependent random fields, indexed, for example, by elements of Nd , the d-dimensional space of positive integers. Letting |i − j | denote the L∞ distance |i − j | = max |il − jl | 1≤l≤d
between two points i = (i1 , . . . , id ) and j = (j1 , . . . , jd ) in Nd , define the distance ρ(A, B) between two subsets A and B of Nd by
ρ(A, B) = inf |i − j |: i ∈ A, j ∈ B . For a given subset J ⊂ Nd , a set of random variables {ξi , i ∈ J } is said to be an m-dependent random field if {ξi , i ∈ A} and {ξj , j ∈ B} are independent whenever ρ(A, B) > m, for any subsets A and B of J . It is readily verified that if {ξi , i ∈ J } is an m-dependent random field then (LD3) and (LD4∗ ) are satisfied by choosing Ai = {j ∈ J : |j − i| ≤ m}, Bi = {j ∈ J : |j − i| ≤ 2m}, Ci = {j ∈ J : |j − i| ≤ 3m}, Bi∗ = {j ∈ J : |j − i| ≤ 3m}, Ci∗ = {j ∈ J : |j − i| ≤ 5m}, and Di∗ = {j ∈ J : |j − i| ≤ 7m}. Hence, Theorems 9.2 and 9.3 yield the following uniform and non-uniform bounds. Theorem 9.4 If {ξi , i ∈ J } is a zero mean m-dependent random field then for all p ∈ (2, 3] E|ξi |p (9.1) supP (W ≤ z) − (z) ≤ 75(10m + 1)(p−1)d z∈R
i∈J
and for all z ∈ R, P (W ≤ z) − (z) ≤ C 1 + |z| −p (14m + 1)pd E|ξi |p
(9.2)
i∈J
where C is an absolute constant.
9.2 Outline of Proofs The main ideas behind the proofs of the results in Sect. 9.1 are similar to those in Sects. 3.4.1 and 8.2. First a Stein identity is derived, followed by uniform and non-uniform concentration inequalities. We outline the main steps under the local dependence condition (LD1), referring the reader to Chen and Shao (2004) for further details. Assume that (LD1) is satisfied and let ηi = j ∈Ai ξj . Define
Kˆ i (t) = ξi 1(−ηi ≤ t < 0) − 1(0 ≤ t ≤ −ηi ) , ˆ = ˆ Kˆ i (t), and K(t) = E K(t). K(t) i∈J
Ki (t) = E Kˆ i (t), (9.3)
248
9
Uniform and Non-uniform Bounds Under Local Dependence
We first derive a Stein identity for W . Let f be a bounded absolutely continuous function. Then, by the independence of ξi and W − ηi ,
E ξi f (W ) − f (W − ηi ) E Wf (W ) = i∈J
=
E ξi
i∈J
= E i∈J
=E
∞ −∞
0
−ηi
∞
f (W + t) dt
f (W + t)Kˆ i (t) dt
−∞
ˆ dt . f (W + t)K(t)
(9.4)
Now, by virtue of the fact that
∞ K(t) dt = E ξi ηi −∞
i∈J
=E
ξi ξj = E
i∈J , j ∈Ai
ξi ξj = EW 2 = 1,
(9.5)
i∈J , j ∈J
we have Ef (W ) − EWf (W )
∞
f (W )K(t) dt − E =E −∞
Let r1 =
i∈J
r2 =
∞
−∞
ˆ dt. f (W + t)K(t)
E|ξi ηi |1{|ηi |>1} , E|ξi | ηi2 ∧ 1 ,
and r3 =
i∈J
|t|≤1
ˆ Var K(t) dt.
(9.6)
We record some useful inequalities involving integrals of the functions K(t) and ˆ K(t) in the following lemma, the verification of which follows by simple computations, and are therefore omitted. ˆ Lemma 9.1 Let K(t) and K(t) be given by (9.3). Then
≤ K(t)dt ≤ ˆ dt ≤ r1 K(t)dt E K(t) |t|>1
and
|t|>1
|t|≤1
tK(t)dt ≤ E
|t|>1
|t|≤1
t K(t) ˆ dt ≤ 0.5r2 .
9.2 Outline of Proofs
249
The concentration inequality given by Proposition 9.1 is used in the proof of Theorem 9.1. Similar ideas are applied to prove Theorems 9.2 and 9.3, requiring conditional and non-uniform concentration inequalities, respectively. In the following, sometimes without mention, we will make use of the inequality 1 2 (9.7) ca + b2 /c for all c > 0. 2 Inequality (9.7) is an immediate consequence of the inequality resulting √ from replac√ ing a and b in the simpler special case when c = 1 by ca, and b/ c, respectively. ab ≤
Proposition 9.1 Assume that (LD1) is satisfied. Then for any real numbers a < b, P (a ≤ W ≤ b) ≤ 0.625(b − a) + 4r1 + 2.125r2 + 4r3 ,
(9.8)
where r1 , r2 and r3 are given in (9.6). ˆ Proof Since K(t) is not necessary non-negative we cannot use the function defined in (3.32) and must consider a modification. For a < b arbitrary and α = r2 define ⎧ −(b − a + α)/2 for w ≤ a − α, ⎪ ⎪ ⎪ 1 2 ⎪ for a − α < w ≤ a, ⎪ ⎨ 2α (w − a + α) − (b − a + α)/2 for a < w ≤ b, f (w) = w − (a + b)/2 ⎪ ⎪ ⎪ − 1 (w − b − α)2 + (b − a + α)/2 for b < w ≤ b + α, ⎪ ⎪ ⎩ 2α (b − a + α)/2 for w > b + α. Then f is the continuous function given by ⎧ for a ≤ w ≤ b, ⎨1
f (w) = 0 for w ≤ a − α or w ≥ b + α, ⎩ linear for a − α ≤ w ≤ a or b ≤ w ≤ b + α.
(9.9)
ˆ and K(t) as Clearly |f (w)| ≤ (b − a + α)/2. With this choice of f , and ηi , K(t) defined in (9.3), by the Cauchy–Schwarz inequality, EW 2 = 1 and (9.4),
∞ ˆ dt (b − a + α)/2 ≥ EWf (W ) = E f (W + t)K(t) −∞
K(t) dt + E f (W + t) − f (W ) K(t) dt = Ef (W ) |t|≤1 |t|≤1
ˆ dt f (W + t)K(t) +E |t|>1
ˆ − K(t) dt f (W + t) K(t) +E |t|≤1
:= H1 + H2 + H3 + H4 .
(9.10)
From (9.5), (9.9) and Lemma 9.1 we obtain H1 ≥ Ef (W )(1 − r1 ) ≥ P (a ≤ W ≤ b) − r1
and |H3 | ≤ r1 .
(9.11)
250
9
Uniform and Non-uniform Bounds Under Local Dependence
Moving on to H4 , we have
2 f (W + t) dt + 2E |H4 | ≤ (1/8)E |t|≤1
|t|≤1
ˆ − K(t) 2 dt K(t)
≤ (b − a + 2α)/8 + 2r3 .
(9.12)
Lastly to bound H2 , let L(α) = sup P (x ≤ W ≤ x + α). x∈R
Then, noting that
f
(w) = α −1 (1
1 t
H2 = E 0
[a−α,a] (w) − 1[b,b+α] (w))
f
(W + s)dsK(t)dt − E
0
0
0
−1 t
a.s., write
f
(W + s)dsK(t)dt
as α
−1
1 t
0
P (a − α ≤ W + s ≤ a) − P (b ≤ W + s ≤ b + α) dsK(t) dt
0
− α −1
0
−1 t
0
P (a − α ≤ W + s ≤ a) − P (b ≤ W + s ≤ b + α) dsK(t) dt.
Now, by Lemma 9.1 and that α = r2 ,
1 t
|H2 | ≤ α −1 L(α)ds K(t) dt + α −1 0
= α −1 L(α)
0
|t|≤1
tK(t) dt
0
−1 t
0
L(α)ds K(t) dt
1 1 ≤ α −1 r2 L(α) = L(α). 2 2
(9.13)
It follows from (9.10)–(9.13) that for all a < b P (a ≤ W ≤ b) ≤ 0.625(b − a) + 0.75α + 2r1 + 2r3 + 0.5L(α).
(9.14)
Substituting a = x and b = x + α in (9.14) and taking supremum over x we obtain L(α) ≤ 1.375α + 2r1 + 2r3 + 0.5L(α), and hence L(α) ≤ 2.75α + 4r1 + 4r3 .
(9.15)
Finally combining (9.14) and (9.15), and again recalling α = r2 , we obtain (9.8). Using Proposition 9.1 we prove the following Berry–Esseen bound for random fields satisfying (LD1), which enables one to derive Theorem 9.1. We leave details to the reader.
9.2 Outline of Proofs
251
Theorem 9.5 Under (LD1) we have supP (W ≤ z) − (z) ≤ 3.9r1 + 5.8r2 + 4.6r3 + r4 + 0.5r5 + 1.5r6 z∈R
where r1 , r2 and r3 are defined in (9.6), and
r4 = E (ξi ηi − Eξi ηi ), r5 = E |W ξi | ηi2 ∧ 1 i∈J
r6 =
|t|≤1
ˆ |t| Var K(t) dt
and
i∈J
1/2 .
Proof For z ∈ R and α > 0 let f be the solution of Stein equation (2.4) for the smoothed indicator function hz,α (w) given in (2.14). Substituting f into identity (9.4) and using (9.5) we obtain
E f (W ) − Wf (W )
∞
ˆ ˆ =E f (W ) − f (W + t) K(t)dt f (W ) K(t) − K(t) dt + E |t|>1 −∞
ˆ − K(t) dt f (W ) − f (W + t) K(t) +E |t|≤1
f (W ) − f (W + t) K(t)dt +E |t|≤1
:= R1 + R2 + R3 + R4 . By calculating as in (9.5), and applying the second inequality in (2.15) of Lemma 2.5 we obtain |R1 | = Ef (W ) (ξi ηi − Eξi ηi ) ≤ r4 , i∈J
and by the final inequality in (2.15), and Lemma 9.1, we have
f (W ) − f (W + t)K(t) ˆ dt ≤ ˆ dt ≤ r1 . E K(t) |R2 | ≤ E |t|>1
|t|>1
Applying the simple change of variable u = rt to the bound (2.16) of Lemma 2.5 on the smoothed indicator solution, we have f (w + t) − f (w)
1 1 ≤ |t| 1 + |w| + 1[z,z+α] (w + rt)dr α 0 1 t = 1 + |w| |t| + 1(z ≤ w + u ≤ z + α)du (9.16) α 0 ≤ 1 + |w| |t| + 1(z − 0 ∨ t ≤ w ≤ z − 0 ∧ t + α). (9.17) For R3 , the bound (9.17) will produce two terms. For the first,
252
9
E
Uniform and Non-uniform Bounds Under Local Dependence
ˆ − K(t)dt 1 + |W | |t|K(t)
|t|≤1
=E
|t|≤1
ˆ − K(t)dt + E|W | |t|K(t)
|t|≤1
ˆ − K(t)dt. |t|K(t)
Applying the triangle inequality and the bounds from Lemma 9.1, the first term above is bounded by r2 . Similarly, the second term may be bounded by 0.5r5 +0.5r2 . Hence |R3 | ≤ 1.5r2 + 0.5r5 + R3,1 + R3,2 , where
R3,1 = E
1
ˆ − K(t)dt 1(z − t ≤ W ≤ z + α)K(t)
and
0
R3,2 = E
0
−1
ˆ − K(t)dt. 1(z ≤ W ≤ z − t + α)K(t)
Let δ = 0.625α + 4r1 + 2.125r2 + 4r3 . Then by Proposition 9.1, P (z − t ≤ W ≤ z + α) ≤ δ + 0.625t
(9.18)
for t ≥ 0. Hence, 1 0.5α(δ + 0.625t)−1 1(z − t ≤ W ≤ z + α) R3,1 ≤ E 0 ˆ − K(t)2 dt + 0.5α −1 (δ + 0.625t)K(t) ≤ 0.5α + 0.5α −1 δ
1
ˆ Var K(t) dt + 0.32α −1
0
1
ˆ t Var K(t) dt.
0
As a corresponding upper bound holds for R3,2 , we arrive at |R3 | ≤ α + 0.5α −1 δr3 + 0.32α −1 r62 + 1.5r2 + 0.5r5 . By (9.16), (9.18) with t = 0, and Lemma 9.1 we have
|R4 | ≤ E 1 + |W | tK(t)dt |t|≤1 t
P (z ≤ W + u ≤ z + α)duK(t)dt + α −1 |t|≤1 0
−1 δ tK(t)dt ≤ r2 + 0.5α −1 δr2 . ≤ r2 + α |t|≤1
Combining the above inequalities yields Ehz,α (W ) − N hz,α
≤ r4 + r1 + 2.5r2 + 0.5r5 + α + α −1 δ(0.5r3 + 0.5r2 ) + 0.32r62 ≤ r4 + r1 + 2.82r2 + 0.5r5 + 0.32r3 + α
+ α −1 (4r1 + 2.125r2 + 4r3 )(0.5r3 + 0.5r2 ) + 0.32r62 .
(9.19)
9.3 Applications
253
Using the fact that Ehz−α,α (W ) ≤ P (W ≤ z) ≤ Ehz,α (W ) and that |(z + α) − (z)| ≤ (2π)−1/2 α, we have supP (W ≤ z) − (z) ≤ supEhz,α (W ) − N hz,α + 0.5α. z∈R
z∈R
Letting 1/2 α = (4r1 + 2.125r2 + 4r3 )(0.5r3 + 0.5r2 ) + 0.32r62 and applying the inequality (a + b)1/2 ≤ a 1/2 + b1/2 yields supP (W ≤ z) − (z) z∈R
≤ r4 + r1 + 2.82r2 + 0.5r5 + 0.32r3 1/2 + 2.5 (4r1 + 2.125r2 + 4r3 )(0.5r3 + 0.5r2 ) + 0.32r62 ≤ r4 + r1 + 2.82r2 + 0.5r5 + 0.32r3 + 1.5r6 1/2 + 2 (4r1 + 2.125r2 + 4r3 )(r3 + r2 ) . Now, applying inequality (9.7) on the last term, we obtain supP (W ≤ z) − (z) z∈R
≤ r4 + r1 + 2.82r2 + 0.5r5 + 0.32r3 + 1.5r6 √ + 2 0.5(4r1 + 2.125r2 + 4r3 ) + 2(0.5r3 + 0.5r2 ) ≤ r4 + 3.9r1 + 5.8r2 + 0.5r5 + 4.6r3 + 1.5r6 , completing the proof of Theorem 9.5.
We remark that if we use the Stein solution for the indicator hz (w) = 1(−∞,z] (w) instead of the one for the smoothed indicator hz,α (w), then the final integral in (9.19) can be no more than δ |t|≤1 |K(t)|dt, a term which is not clearly bounded by ≤ 1 + r1 . |t|≤1 |K(t)|dt, though |t|≤1 K(t)dt Under (LD2), letting τi = j ∈Bi ξj , the proof of Theorem 9.2 is based on a conditionalconcentration inequality for P (aτi ≤ W ≤ bτi |τi ), where τi = (ξ, ηi , ζi ), ζi = j ∈Bi ξj and aτi ≤ bτi are measurable functions of τi , while the proof of Theorem 9.3 relies on a non-uniform concentration inequality for E((1 + W )3 1{aτi ≤W ≤bτi } |τi ). We refer to Chen and Shao (2004) for details.
9.3 Applications The following three applications of our local dependence results were considered in Chen and Shao (2004).
254
9
Uniform and Non-uniform Bounds Under Local Dependence
Example 9.1 (Dependency Graphs) This example was discussed in Baldi and Rinott (1989) and Rinott (1994) where some results on uniform bound were obtained. Consider a set of random variables {Xi , i ∈ V} indexed by the vertices of a graph G = (V, E). G is said to be a dependency graph if for any pair of disjoint sets 1 and 2 in V such that no edge in E has one endpoint in 1 and the other in 2 , the sets of random variables {Xi , i ∈ 1 } and {Xi , i ∈ 2 } are independent. Let D denote the maximal degree of G, i.e., the maximal number of edges incident to a sin= {j ∈ V: there is an edge connecting j and i}, B = gle vertex. Let A i i j ∈Ai Aj , Ci = j ∈Bi Aj , Bi∗ = j ∈Ai Bj , Ci∗ = j ∈B ∗ Bj and Di∗ = j ∈C ∗ Bj . Noting i i that |Ai | ≤ D, |Bi | ≤ D 2 , |Ci | ≤ D 3 , Bi∗ ≤ D 3 , Ci∗ ≤ D 5 and D ∗ ≤ D 7 , i
we have that
κ1 = {j ∈ J : Ci ∩ Bj = ∅} ≤ D 5 and
κ2 = max D ∗ , j : i ∈ D ∗ ≤ D 7 . i∈J
i
j
Hence, applying Theorem 9.2 with κ = κ1 , and Theorem 9.3 with κ = κ2 , yields the following theorem. be random variables indexed by the vertices of a Theorem 9.6 Let {Xi , i ∈ V} dependency graph. Put W = i∈V Xi . Assume that EW 2 = 1, EXi = 0 and E|Xi |p ≤ θ p for i ∈ V and for some θ > 0. (9.20) supP (W ≤ z) − (z) ≤ 75D 5(p−1) |V|θ p z
and for z ∈ R,
P (W ≤ z) − (z) ≤ C(1 + |z|)−p D 7p |V|θ p .
The bound (9.20) compares favorably with those of Baldi and Rinott (1989). Example 9.2 (Exceedances of the m-scans process) Let X1 , X2 , . . . , be i.i.d. random variables and let Ri = m−1 k=0 Xi+k , i = 1, 2, . . . , n be the m-scans process. For a ∈ R consider the number of exceedances of a by {Ri : i = 1, . . . , n}, Y=
n
1{Ri > a}.
i=1
Assessing the statistical significance of exceedances of scan statistics in one and higher dimensions plays a key role in many areas of applied statistics, and is a well studied problem, see, for example Glaz et al. (2001) and Naus (1982). Scan statistics have been used, for example, for the evaluation of the significance of observed inhomogeneities in the distribution of markers along the length of long DNA sequences, see Dembo and Karlin (1992), and Karlub and Brede (1992). Dembo and Rinott
9.3 Applications
255
(1996) obtain a uniform Berry–Esseen bound for Y , of the best possible order, as n → ∞. Let p = P (R1 > a) and σ 2 = Var(Y ). From Dembo and Rinott (1996) we have 2 σ ≥ np(1 − p), and that {1{Ri > a}, 1 ≤ i ≤ n} are m-dependent. Let Y − np ξi = σ n
W=
where ξi = 1(Ri > a) − p /σ.
i=1
Since
σ2
≥ np(1 − p), we have
n 1 p(1 − p)3 + p 3 (1 − p) np(1 − p) . E ξi3 = n ≤ ≤√ 3 3 σ σ np(1 − p) i=1
Hence the following non-uniform bound is a consequence of Theorem 9.4. Theorem 9.7 There exists a universal constant C such that for all z ∈ R, P (W ≤ z) − (z) ≤
Cm3 . √ (1 + |z|)3 np(1 − p)
Chapter 10
Uniform and Non-uniform Bounds for Non-linear Statistics
In this chapter we consider uniform and non-uniform Berry–Esseen bounds for nonlinear statistics that can be written as a linear statistic plus an error term. We apply our results to U -statistics, multi-sample U -statistics, L-statistics, random sums, and functions of non-linear statistics, obtaining bounds with optimal asymptotic rates. The main tools are uniform and non-uniform randomized concentration inequalities. The work of Chen and Shao (2007) forms the basis of this chapter.
10.1 Introduction and Main Results Let X1 , X2 , . . . , Xn be independent random variables and let T := T (X1 , . . . , Xn ) be a general sampling statistic. In many cases of interest T can be written as a linear statistic plus a manageable error term, that is, as T = W + where W=
n
gn,i (Xi ),
and := (X1 , . . . , Xn ) = T − W,
i=1
for some functions gn,i . Let ξi = gn,i (Xi ). We assume that Eξi = 0 for i = 1, 2, . . . , n,
and
n
Var(ξi ) = 1,
(10.1)
i=1
and also that depends on Xi only through gn,i (Xi ), that is, with slight abuse of notation, = (ξ1 , . . . , ξn ). It is clear that if → 0 in probability as n → ∞ then the central limit theorem holds for W provided the Lindeberg condition is satisfied. If in addition, E||p < ∞ for some p > 0, then by the Chebyshev inequality followed by a simple minimization, one can obtain the following uniform bound, 1/(1+p) supP (T ≤ z) − (z) ≤ supP (W ≤ z) − (z) + 2 E||p , (10.2) z∈R
z∈R
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_10, © Springer-Verlag Berlin Heidelberg 2011
257
258
10
Uniform and Non-uniform Bounds for Non-linear Statistics
where the first term on the right hand side of (10.2) may be readily estimated by the Berry–Esseen inequality. However, after the addition of the second term the resulting bound will not generally be sharp for many commonly used statistics. Taking a different approach, by developing randomized versions of the concentration inequalities in Sects. 3.4.1 and 8.1, we can establish uniform and non-uniform Berry– Esseen bounds for T with optimal asymptotic rates. Let δ > 0 satisfy n
E|ξi | min δ, |ξi | ≥ 1/2
(10.3)
i=1
and recall that β2 =
n i=1
Eξi2 1{|ξi |>1}
and β3 =
n
E|ξi |3 1{|ξi |≤1} .
(10.4)
i=1
The following approximation of T by W provides our uniform Berry–Esseen bound for T . Theorem 10.1 Let ξ1 , . . . , ξn be independent random variables satisfying (10.1), W = ni=1 ξi and T = W + . For each i = 1, . . . , n, let i be a random variable such that ξi and (W − ξi , i ) are independent. Then for any δ satisfying (10.3), n supP (T ≤ z) − P (W ≤ z) ≤ 4δ + E|W | + E ξi ( − i ). z∈R
(10.5)
i=1
In particular, n E ξi ( − i ) (10.6) supP (T ≤ z) − P (W ≤ z) ≤ 2(β2 + β3 ) + E|W | + z∈R
i=1
and n E ξi ( − i ). sup P (T ≤ z) − (z) ≤ 6.1(β2 + β3 ) + E|W | + z∈R
(10.7)
i=1
With X2 denoting the L2 norm X2 = (EX 2 )1/2 of a random variable X, we now provide a corresponding non-uniform bound. Theorem 10.2 Let ξ1 , . . . , ξn be independent random variables satisfying (10.1), W = ni=1 ξi and T = W + . For each 1 ≤ i ≤ n, let i be a random variable such that ξi and (W − ξi , i ) are independent. Then for δ satisfying (10.3), and any p ≥ 2, P (T ≤ z) − P (W ≤ z) ≤ γz,p + e−|z|/3 τ for all z ∈ R, (10.8) where
10.1 Introduction and Main Results
259
n γz,p = P || > |z| + 1 /3 + 2 P |ξi | > |z| + 1 /(6p)
−p β2 + e 1 + z /(36p) p
2
τ = 22δ + 8.62 + 3.6
n
i=1
and
(10.9)
ξi 2 − i 2 .
i=1
If E|ξi |p < ∞ for some p > 2, then for some constant Cp depending on p only, P (T ≤ z) − (z) ≤ P || > |z| + 1 /3
n n Cp p 3∧p E|ξi | + E|ξi | . ξi 2 − i 2 + + 2 + (|z| + 1)p i=1
i=1
(10.10) The following remark shows how to choose δ so that (10.3) is satisfied. Remark 10.1 (i) When E|ξi |p < ∞ for p > 2 then one may verify that
1/(p−2) n 2(p − 2)p−2 p E|ξi | δ= (p − 1)p−1
(10.11)
i=1
satisfies (10.3) using the inequality min(x, y) ≥ y −
(p − 2)p−2 y p−1 (p − 1)p−1 x p−2
for x > 0, y ≥ 0.
(10.12)
Inequality (10.12) is trivial when y ≤ x. For y > x the inequality follows by replacing x and y by x/(p − 1) and y/(p − 2), respectively, resulting in the inequality p−2 p−2 x 1 y 1≤ , + p−1 y p−1 x which holds as the function 1 p−2 a+ a 2−p p−1 p−1
for a > 0
has a minimum of 1 at a = 1. (ii) If β2 + β3 ≤ 1/2, then (10.3) holds with δ = (β2 + β3 )/2. In fact, as (10.12) for p = 3 yields min(x, y) ≥ y − y 2 /(4x), we have n i=1
n E|ξi | min δ, |ξi | ≥ E|ξi |1{|ξi |≤1} min δ, |ξi | i=1
260
10
Uniform and Non-uniform Bounds for Non-linear Statistics
≥
n
Eξi2 1{|ξi |≤1}
i=1
E|ξi |3 1{|ξi |≤1} − 4δ
β2 + β3 4δβ2 + β3 ≥1− = 1/2. 4δ 4δ (iii) Recalling (10.1), we see that if δ > 0 satisfies =1−
n
Eξi2 1{|ξi |≥δ} < 1/2,
i=1
then (10.3) holds. In particular, when ξi , 1 ≤ i ≤ n are standardized i.i.d. ran√ dom variables, then δ may be taken to be of the order 1/ n, which may be much smaller than β2 + β3 . We turn now to our applications, deferring the proofs of Theorems 10.1 and 10.2 to Sect. 10.3.
10.2 Applications Theorems 10.1 and 10.2 can be applied to a wide range of different statistics, providing bounds of the best possible order in many instances. To illustrate the usefulness and generality of these results we present the following five applications.
10.2.1 U -statistics Let X1 , X2 , . . . , Xn be a sequence of i.i.d. random variables, and for some m ≥ 2 let h(x1 , . . . , xm ) be a symmetric, real-valued function, where m < n/2 may depend on n. Introduced by Hoeffding (1948), the class of U -statistics are those random variables that can be written as −1 n Un = h(Xi1 , . . . , Xim ). (10.13) m 1≤i1 <···
Special cases include (i) the sample mean, when h(x1 , x2 ) = 12 (x1 + x2 ), (ii) the sample variance, where h(x1 , x2 ) = 12 (x1 − x2 )2 , and (iii) the one-sample Wilcoxon statistic, when h(x1 , x2 ) = 1(x1 + x2 ≤ 0). We refer the reader to Koroljuk and Borovskich (1994) for a systematic treatment of U -statistics, and note that Rinott and Rotar (1997) also handle a variety of U -statistics using Stein’s method. The Hoeffding decomposition (see (10.19) below) allows us to write U as a linear statistic plus an error term, allowing for the application of Theorems 10.1 and 10.2, yielding the following result.
10.2 Applications
261
Theorem 10.3 Let X1 , . . . , Xn be i.i.d. random variables and let U be given by (10.13) with Eh(X1 , . . . , Xm ) = 0, σ 2 = Eh2 (X1 , . . . , Xm ) < ∞ and σ12 = Eg 2 (X1 ) > 0 where g(x) = E(h(X1 , X2 , . . . , Xm )|X1 = x). Then √
n n 1 supP Un ≤ z − P √ g(Xi ) ≤ z mσ1 nσ1 z∈R i=1 √ (1 + 2)(m − 1)σ 4c0 , (10.14) ≤√ + (m(n − m + 1))1/2 σ1 n where c0 is any constant such that Eg 2 (X1 )1(|g(X1 )| > c0 σ1 ) ≤ σ12 /2. If in addition E|g(X1 )|p < ∞ for some 2 < p ≤ 3, then √ √ 6.1E|g(X1 )|p n (1 + 2)(m − 1)σ Un ≤ z − (z) ≤ (p−2)/2 p + , supP mσ1 (m(n − m + 1))1/2 σ1 n σ1 z∈R (10.15) and there exists a universal constant C such that for all z ∈ R √ n P U ≤ z − (z) n mσ1 C E|g(X1 )|p m1/2 σ 9mσ 2 + + ≤ p . (|z| + 1)2 (n − m + 1)σ12 (|z| + 1)p (n − m + 1)1/2 σ1 n(p−2)/2 σ1 (10.16) Note that for the error bound in (10.14) to be of order O(n−1/2 ) it is necessary that σ 2 , the second moment of h, be finite. However, requiring σ 2 < ∞ is not the weakest assumption under which the uniform bound at this rate is known to hold; Friedrich (1989) obtained the order O(n−1/2 ) when E|h|5/3 < ∞. It would be interesting to use Stein’s method to obtain this same result. We refer to Benkus et al. (1994) and Jing and Zhou (2005) for a discussion regarding the necessity of the moment condition. For 1 ≤ k ≤ m, let hk (x1 , . . . , xk ) = E h(X1 , . . . , Xm )|X1 = x1 , . . . , Xk = xk , h¯ k (x1 , . . . , xk ) = hk (x1 , . . . , xk ) − √ −1 n n = mσ1 m and for l ∈ {1, . . . , n}, √ −1 n n l = mσ1 m
k
g(xi ),
i=1
h¯ m (Xi1 , . . . , Xim ),
(10.17)
1≤i1 <···
1≤i1 <···
h¯ m (Xi1 , . . . , Xim ).
(10.18)
262
10
Uniform and Non-uniform Bounds for Non-linear Statistics
We now prove Theorem 10.3 by applying Theorems 10.1 and 10.2. Proof Observing that Un =
−1 n m n g(Xi ) + m n i=1
we have
h¯ m (Xi1 , . . . , Xim ),
(10.19)
1≤i1 <···
√
n Un = W + , mσ1 where W=
n
ξi
i=1
1 with ξi = √ g(Xi ). nσ1
For each l ∈ 1, . . . , n the random variables W − ξl and l are functions of ξj , j = l, and therefore ξi is independent of (W − ξi , i ). Hence, by Theorems 10.1 and 10.2, and (iii) of Remark 10.1, the result follows by Lemma 10.1. Lemma 10.1 Let and l for l = 1, . . . , n be as in (10.17) and (10.18), respectively. Then E2 ≤
(m − 1)2 σ 2 m(n − m + 1)σ12
(10.20)
and E( − l )2 ≤
2(m − 1)2 σ 2 . nm(n − m + 1)σ12
(10.21)
The reader is encouraged to prove Lemma 10.1 for the case m = 2, and refer to the Appendix for the general case.
10.2.2 Multi-sample U -statistics Consider k independent sequences, Xj 1 , . . . , Xj nj , j = 1, . . . , k of i.i.d. random variables, of lengths n1 , . . . , nk . With mj ≥ 1 for j = 1, . . . , k, let h(xj l , l = 1, . . . , mj ; j = 1, . . . , k) be a function which is symmetric with respect to the mj arguments of the j -th set, that is, invariant under permutations of xj l , l = 1, . . . , mj . Let θ = Eh(Xj l , j = 1, . . . , k, l = 1, . . . , mj ). The multi-sample U -statistic is defined as
10.2 Applications
263
Un =
k nj −1 j =1
mj
h(Xj l , j = 1, . . . , k, l = ij 1 , . . . , ij mj ), (10.22)
where n = (n1 , . . . , nk ) and the summation is carried out over all indices satisfying 1 ≤ ij 1 < · · · < ij mj ≤ nj . Clearly, Un is an unbiased estimate of θ . The two-sample Wilcoxon statistic, where h(x11 ; x21 ) = 1{x11 <x21 } , and the two-sample
ω2 -statistic,
where ⎧ if max(x11 , x12 ) < min(x21 , x22 ) ⎨ 1/3 or min(x11 , x12 ) < max(x21 , x22 ), h(x11 , x12 ; x21 , x22 ) = ⎩ −1/6 otherwise are both special cases of the general multi-sample U -statistic; see Koroljuk and Borovskich (1994), pp. 36–37, for additional examples. For multi-sample U -statistics of the form (10.22), Helmers and van Zwet (1982) and Borovskich (1983) (see Koroljuk and Borovskich 1994, pp. 304–311) obtain a uniform Berry–Esseen bound of order O((min1≤j ≤k nj )−1/2 ). Theorem 10.4 not only refines their results but also gives an optimal non-uniform bound. To state the theorem, first note that we may assume θ = 0 without loss of generality. Next, let σ 2 = Eh2 (X11 , . . . , X1m1 ; . . . ; Xk1 , . . . , Xkmk ), for j = 1, . . . , k, define hj (x) = Eh(X11 , . . . , X1m1 ; . . . ; Xk1 , . . . , Xkmk )|Xj 1 = x), let
σj2
= Eh2j (Xj 1 ), and set σn2 =
k m2 j j =1
nj
σj2 .
Theorem 10.4 Assume that θ = 0, σ 2 < ∞, max1≤j ≤k σj2 > 0 and let nj ≥ 2mj for all j = 1, . . . , k. Then for 2 < p ≤ 3 supP σ −1 Un ≤ z − (z) z∈R
n
√ p k k m2 p (1 + 2)σ 6.1 mj j E hj (Xj 1 ) + , ≤ p σn nj σn j =1 np−1 j =1 j
and for z ∈ R −1 P σ Un ≤ z − (z) n k
2 m2j 9σ 2 ≤ nj (1 + |z|)2 σn2 j =1
(10.23)
264
10
C + (1 + |z|)p
Uniform and Non-uniform Bounds for Non-linear Statistics
p k k 2 p 1 mj E|hj (Xj 1 )| σ mj + p . p−1 σn nj σn j =1 nj j =1
(10.24)
Proof We follow an argument similar to the one used for the proof of Theorem 10.3. For 1 ≤ j ≤ k, let Xj = (Xj 1 , . . . , Xj mj ), x j = (xj 1 , . . . , xj mj ) and define ¯ 1 , . . . , x k ) = h(x 1 , . . . , x k ) − h(x
mj k
hj (xj l ).
(10.25)
j =1 l=1
For the given U -statistic Un , define the projection nj k
Uˆ n =
E(Un |Xj l ).
j =1 l=1
Since
n j − 1 nj mj /nj = , mj − 1 mj
we have Uˆ n =
nj k mj j =1 l=1
nj
hj (Xj l ),
and the difference Un − Uˆ n can be expressed as k nj −1 ˆ ¯ 1i 1 , . . . , Xki k ), Un − Un = h(X mj j =1
where Xj i j = (Xj ij 1 , . . . , Xj ij mj ) and the summation is carried out over all indices 1 ≤ ij 1 < ij 2 < · · · < ij mj ≤ nj , j = 1, 2, . . . , k. Thus, we obtain σn−1 Un = W + with W
nj k mj
= σn−1
j =1 l=1
= σn−1
nj
hj (Xj l ) and
k nj −1
j =1
mj
¯ 1i 1 , . . . , Xki k ). h(X
To apply Theorems 10.1 and 10.2, let ξj l = σn−1 njj hj (Xj l ) and m
j l = σn−1
k nv −1 v=1
mv
(j l) ¯
h(X 1i 1 , . . . , Xki k ),
10.2 Applications
265
where the sum (j l) excludes the variable Xj l , that is, the summation is carried out over all indices 1 ≤ iv1 < iv2 < · · · < ivmv ≤ nv , 1 ≤ v ≤ k, v = j and 1 ≤ ij 1 < ij 2 < · · · < ij mj ≤ nj with ij s = l for 1 ≤ s ≤ mj . Clearly, ξj l and (W − ξj l , j l ) are independent. Theorem 10.4 now follows from Theorems 10.1 and 10.2 and the lemma below. Lemma 10.2 We have
k
2 2 σ 2 mj E ≤ 2 nj σn 2
(10.26)
j =1
and E( − j l )2 ≤
k 2m2j σ 2 m2v
n2j σn2
v=1
nv
(10.27)
.
We refer the reader to the Appendix for a proof.
10.2.3 L-statistics Suppose that one wishes to estimate a measure of spread of an unknown distribution F based on a sample X1 , . . . , Xn of independent observations. Under a second moment assumption, one could form the unbiased estimator 2 = σ
1 (Xi − X)2 n−1 n
1 Xi n n
where X =
i=1
i=1
of the variance of the underlying F . A little algebra shows that the variance estimator 2 may be written more ‘symmetrically’ as σ 2 = σ
1 (Xi − Xj )2 . n(n − 1) i<j
2 is clearly seen. Without Indeed, from this second formula, the unbiasedness of σ the need for assuming the existence of second moments, one could instead compute the estimator −1 n T= |Xi − Xj | (10.28) 2 i<j
to obtain an idea of the underlying spread, in this case unbiasedly estimating Gini’s mean difference E|X1 − X2 |. The estimate T of Gini’s mean difference can actually be written as a linear combination n T= cni Xni (10.29) i=1
266
10
Uniform and Non-uniform Bounds for Non-linear Statistics
of the order statistics Xn1 ≤ Xn2 ≤ · · · ≤ Xnn of the sample, as follows. To begin, as |x − y| = | max(x, y) − min(x, y)|, we have 2 2 |Xi − Xj | = |Xnj − Xni | n(n − 1) n(n − 1) i<j
i<j
n n−1 2 (Xnj − Xni ) n(n − 1) i=1 j =i+1 n
n−1 2 (j − 1)Xnj − (n − i)Xni = n(n − 1)
=
j =2
=
2 n(n − 1)
i=1
n
(2i − n − i)Xni ,
i=1
and hence is of the form (10.29). A subclass of estimators of this form that includes many typical applications may be written using the empirical distribution function Fn (x) = n
−1
n
1(Xi ≤ x)
i=1
and a real-valued function J (t) on [0, 1]. Indeed, letting ∞ T (G) = xJ G(x) dG(x) −∞
for non-decreasing functions G, we have that 1 J (i/n)Xni , n n
T (Fn ) =
i=1
a linear combination of the order statistics of the sample, with coefficients, or weights, determined for all sample sizes by the function J . Estimators of the form T (Fn ) are known as an L-statistic, referring to their formation as linear combinations of order statistics; the trimmed mean and smoothly trimmed mean are two special cases. We refer to Serfling (1980), Chap. 8 for additional examples and some asymptotic properties of L-statistics. As T (Fn ) estimates some parameter of interested of the underlying, unknown distribution F , a natural question is to determine its variation about its asymptotic value T (F ) as a function of n. Uniform Berry–Esseen bounds for L-statistics for smooth functions J were given by Helmers (1977), and Helmers et al. (1990). In order to apply Theorems 10.1 and 10.2 to yield uniform and non-uniform bounds for L-statistics, let ∞ g(x) = 1(x ≤ s) − F (s) J F (s) ds (10.30) −∞
10.2 Applications
and
σ =
∞
2
267
∞
−∞ −∞
J F (s) J F (t) F min(s, t) 1 − F max(s, t) dsdt,
which is easily checked to be the variance of g(X) when X has distribution function F (x). Theorem 10.5 Let n ≥ 2 and assume that EX12 < ∞ and E|g(X1 )|p < ∞ for some 2 < p ≤ 3. If the weight function J (t) is Lipschitz of order 1 on [0, 1], that is, there exists a constant c0 such that J (t) − J (s) ≤ c0 |t − s| for 0 ≤ s, t ≤ 1, (10.31) then √ supP nσ −1 T (Fn ) − T (F ) ≤ z − (z) z∈R
6.1E|g(X1 )|p (1 + ≤ (p−2)/2 p + n σ
√ 2)c0 (EX12 )1/2 √ nσ
and for all z ∈ R, √ −1 P nσ T (Fn ) − T (F ) ≤ z − (z) 9c02 EX12 c0 (EX12 )1/2 E|g(X1 )|p C + (p−2)/2 p . + ≤ √ (|z| + 1)2 nσ 2 (1 + |z|)p n σ nσ Proof Let ψ(t) = p. 265), we have
t 0
(10.32)
(10.33)
J (s)ds. Using integration by parts (see, e.g., Serfling 1980,
T (Fn ) − T (F ) = −
∞
−∞
ψ Fn (x) − ψ F (x) dx.
Therefore, letting 1 gn,i (Xi ) = − √ nσ
∞
−∞
1(Xi ≤ x) − F (x) J F (x) dx
we may write √ −1 T (Fn ) − T (F ) = W + nσ with W=
n
and
gn,i (Xi )
i=1
√ = − nσ −1
∞ −∞
ψ Fn (x) − ψ F (x) − Fn (x) − F (x) J F (x) dx.
268
10
Uniform and Non-uniform Bounds for Non-linear Statistics
As for every i = 1, . . . , n, the variable gn,i (Xi ) is independent of W − gn,i (Xi ) and of ∞ √ i = − nσ −1 ψ Fn,i (x) − ψ F (x) − Fn,i (x) − F (x) J F (x) dx, −∞
where
1 Fn,i (x) = F (x) + n
1(Xj ≤ x) ,
1≤j ≤n, j =i
we may apply Theorems 10.1 and 10.2. Hence, to prove Theorem 10.5 it suffices to show σ 2 E2 ≤ c02 n−1 EX12
(10.34)
σ 2 E| − i |2 ≤ 2c02 n−2 EX12 .
(10.35)
and
Observe that the Lipschitz condition (10.31) implies t−s ψ(t) − ψ(s) − (t − s)J (s) = J (u + s) − J (s) du ≤ 0.5c0 (t − s)2 0
(10.36) for 0 ≤ s ≤ t ≤ 1. Hence with ηi (x) = 1(Xi ≤ x) − F (x), we have σ 2 E2
≤ 0.25c02 nE = 0.25c02 n−3 ≤ 0.25c02 n−3
∞
−∞ ∞
2 Fn (x) − F (x) dx
∞
−∞ −∞ ∞
E
n n
2
2
ηi (x)ηj (y)
dxdy
i=1 j =1
∞
−∞ −∞
3n2 Eη12 (x)Eη12 (y) + nE η12 (x)η12 (y) dxdy.
For the first term in the integral, we observe that ∞ ∞ Eη12 (x)Eη12 (y)dxdy −∞ −∞ ∞
=
−∞
F (x) 1 − F (x) dx
while for the second term, ∞ ∞ E η12 (x)η12 (y) dxdy −∞ −∞ E η12 (x)η12 (y) dxdy =2 x≤y
2
2 ≤ E|X1 | ≤ EX12 ,
10.2 Applications
269
2 2 1 − F (x) 1 − F (y) F (x)
=2 x≤y
2 + F (x) 1 − F (y) F (y) − F (x) + F 2 (x)F 2 (y) 1 − F (y) dxdy ≤2 F (x) 1 − F (y) dxdy 2
x≤y
+
=2
0<x≤y
+
F (x)dx
x≤0 − 2 ≤ E X1 ≤ 2EX12 .
x≤0,y>0
y 1 − F (y) dy
|x|F (x)dx + x≤0
F (x) 1 − F (y) dxdy
+
x≤y≤0
≤2
y≥0
1 − F (y) dy
y>0
2 + E X1+ + 2EX1− EX1+ (10.37)
Recalling n ≥ 2, the proof of inequality (10.34) is complete. To prove (10.35), first observe that σ √ | − i | n ∞ ψ Fn (x) − ψ Fn,i (x) − Fn (x) − Fn,i (x) J Fn,i (x) dx = −∞ ∞ + Fn (x) − Fn,i (x) J Fn,i (x) − J F (x) dx −∞ ∞ 2 Fn (x) − Fn,i (x) dx ≤ 0.5c0 −∞ ∞ Fn (x) − Fn,i (x)Fn,i (x) − F (x)dx + c0 −∞ ∞ 2 = 0.5c0 n−2 1(Xi ≤ x) − F (x) dx −∞ ∞ 1(Xi ≤ x) − F (x) dx + c0 n−2 1(X ≤ x) − F (x) j −∞
= 0.5c0 n
−2
∞ −∞
ηi2 (x)dx
+ c0 n
−2
j =i ∞
−∞
ηi (x) ηj (x)dx. j =i
Now, to handle the first term above, from (10.37) we have 2 ∞ ∞ ∞ 2 ηi (x)dx = Eη12 (x)η12 (y)dxdy ≤ 4EX12 , E −∞
while for the second term,
−∞ −∞
270
10
E
2 ηi (x) ηj (x)dx
∞
−∞
= = ≤
Uniform and Non-uniform Bounds for Non-linear Statistics
∞
j =i
∞
−∞ −∞ ∞
∞
−∞ −∞ ∞
E ηi (x) ηj (x) ηi (y) ηj (y) dxdy j =i
j =i
E ηi (x)ηi (y) E ηj (x) ηj (y) dxdy j =i
j =i
j =i
2 j =i
ηi (x) ηi (y) ηj (x) 2 2
ηj (y) dxdy
∞
−∞ −∞
= (n − 1)
∞
2
∞
ηi (x)2 ηi (y)2 dxdy
2 −∞ −∞ 2 ≤ (n − 1) E|X1 | ≤ (n − 1)EX12 .
2
Hence, applying (9.7) and recalling n ≥ 4 we obtain σ 2 E| − i |2 ≤ n−3 c02 E 0.5
≤ n−3 c02 0.75E
≤ n−3 c02 3EX12 ≤ 2n−2 c02 EX12 .
∞ −∞
ηi2 (x)dx + ∞
−∞
2 ηi (x) dx η (x) j
∞
−∞
2 ηi2 (x)dx
+ 1.5(n − 1)EX12
j =i
+ 1.5E
2 dx ηi (x) η (x) j
∞
−∞
j =i
This proves (10.35), and hence the theorem.
10.2.4 Random Sums of Independent Random Variables with Non-random Centering Let {Xi , i ≥ 1} be i.i.d. random variables with EXi = μ and Var(Xi ) = σ 2 , and let {Nn , n ≥ 1} be a sequence of non-negative integer-valued random variables that are independent of {Xi , i ≥ 1}. Assume for each n = 1, 2, . . . that ENn2 < ∞ and Nn − ENn d −→ N (0, 1). √ Var(Nn ) Then, by Robbins (1948), Nn Xi − (ENn )μ d i=1 −→ N (0, 1). 2 2 σ ENn + μ Var(Nn )
(10.38)
10.2 Applications
271
This result is a special case of limit theorems for random sums with non-random centering. Such problems arise, for example, in the study of Galton–Watson branching processes. We refer to Finkelstein et al. (1994), and references therein, for recent developments in this area. As another application of Theorem 10.1, we give the following uniform bound for the convergence in (10.38) when Nn is the sum of i.i.d. random variables. Theorem 10.6 Let {Xi , i ≥ 1} be i.i.d. random variables with EXi = μ, Var(Xi ) = σ 2 and E|Xi |3 < ∞, and {Yi , i ≥ 1} i.i.d. non-negative integer-valued random variν, Var(Yi ) = τ 2 , EY13 < ∞ with {Xi , i ≥ 1} and {Yi , i ≥ 1} indeables with EYi = pendent. If Nn = ni=1 Yi , then there exists a universal constant C such that Nn i=1 Xi − nμν supP ≤ z − (z) 2 2 2 n(νσ + τ μ ) z∈R 2 EY 3 E|X1 |3 σ τ (10.39) ≤ Cn−1/2 2 + 31 + 1/2 3 + √ . ν τ ν σ μ ν Proof Let Z1 and Z2 be independent standard normal random variables that are independent of {Xi , i ≥ 1} and {Yi , i ≥ 1}. Put b = νσ 2 + τ 2 μ2 , Nn Nn Xi − nμν X i − Nn μ Tn = i=1√ and Hn = i=1√ , nb Nn σ and write √ Nn σ (Nn − nν)μ Tn = √ Hn + √ nb nb and √ Nn σ (Nn − nν)μ Tn (Z1 ) = √ Z1 + . √ nb nb Applying the Berry–Esseen inequality (3.27) to Hn by first conditioning on Nn yields, with C not necessarily the same at each occurrence, that supP (Tn ≤ z) − P Tn (Z1 ) ≤ z z∈R
|X1 |3 1 |Nn − nν| ≤ nν/2 ≤ P |Nn − nν| > nν/2 + CE √ Nn σ 3 E|X1 |3 4τ 2 (10.40) ≤ 2 + Cn−1/2 1/2 3 . nν ν σ Now, letting the truncation x for any x ∈ R be given by ⎧ for x < nν/2, ⎨ nν/2
x=
⎩
x 3nν/2
for nν/2 ≤ x ≤ 3nν/2, for x > 3nν/2,
272
10
Uniform and Non-uniform Bounds for Non-linear Statistics
we may write √ Nn σ (Nn − nν)μ τ μ σ ν Z1 + = Tn (Z1 ) = √ W ++ Z1 , √ b τμ nb nb where Nn − nν W= √ nτ
√ ( Nn − nν)σ Z1 and = . √ nτ μ
As Yi is independent of Nn − Yi for all i = 1, . . . , n, we may apply Theorem 10.1 to W + setting √ ( Nn − Yi + ν − nν)σ Z1 i = for i = 1, . . . , n. (10.41) √ nτ μ To evaluate the second term in the bound (10.7), condition on Z1 and apply the identity √ x −y √ x− y=√ √ x+ y to obtain σ |Z1 | σ |Z1 | |W (Nn − nν)| E |W | |Z1 ≤ √ E ≤√ √ , √ nτ μ nν nμ ν while for the third term, apply √ √ 2σ |Z1 | 2σ |Z1 | 1 √ E |(Yi − ν)( − i )| | Z1 ≤ 3/2 2 √ E(Yi − ν)2 = 3/2 √ . n τ μ ν n μ ν nτ Now letting Tn (Z1 , Z2 ) =
√ τμ σ ν Z2 + Z1 b τμ
and applying Theorem 10.1 for given Z1 yields supP Tn (Z1 ) ≤ z − P Tn (Z1 , Z2 ) ≤ z z∈R
≤ P |Nn − nν| > 0.5nν + supP Tn (Z1 ) ≤ z − P Tn (Z1 , Z2 ) ≤ z
z∈R
σ E|Z1 | ≤ 2 +C + 1/2 √ nν n μ ν 2 3 EY σ τ ≤ Cn−1/2 2 + 31 + √ . ν τ μ ν 4τ 2
EY13 n1/2 τ 3
(10.42)
As Tn (Z1 , Z2 ) has a standard normal distribution, (10.39) follows from (10.40) and (10.42).
10.2 Applications
273
10.2.5 Functions of Non-linear Statistics Let X1 , X2 , . . . , Xn be a random sample and θˆn = θˆn (X1 , . . . , Xn ) a weakly consistent estimator of an unknown parameter θ . Assume that θˆn can be written as n
1 ˆθn = θ + √ ξi + (10.43) n i=1
where ξi = gn,i (Xi ) are functions of Xi satisfying Eξi = 0 and ni=1 Eξi2 = 1, and := n (X1 , . . . , Xn ) → 0 in probability. Under these conditions, √ n(θˆn − θ ) →d N (0, 1) as n → ∞. The class of U -statistics, multi-sample U -statistics and L-statistics discussed in previous subsections fit into this setting. The so called ‘Delta Method’ in statistics (see Theorem 7 of Ferguson 1996, for instance) allows us to determine the asymptotic distribution of functions of the estimator θˆn . In particular, if h is differentiable in a neighborhood of θ with h continuous at θ with h (θ ) = 0, then √ n(h(θˆn ) − h(θ )) (10.44) →d N (0, 1). h (θ ) Of course, results that give some idea as to the accuracy of the Delta Method are of interest. When θˆn is the sample mean, the Berry–Esseen bound and Edgeworth expansion have been well studied (see Bhattacharya and Ghosh 1978). The next theorem shows that the results in Sect. 10.1 can be extended to functions of nonlinear statistics. Theorem 10.7 Suppose the statistic θˆn may be expressed in the form (10.43) where ξ1 , . . . , ξn areindependent random variables with mean zero and satisfy Var(W ) = 1 where W = ni=1 ξi . Assume that h (θ ) = 0 and δ(c0 ) = sup|x−θ|≤c0 |h (x)| < ∞ for some c0 > 0. Then for all 2 < p ≤ 3, √ n(h(θˆn ) − h(θ )) supP ≤ z − (z) h (θ ) z∈R
n 2c0 δ(c0 ) E ξi ( − i ) E|W | + ≤ 1+ |h (θ )| i=1
+ 6.1
n
3−p
E|ξi |p +
i=1
+
n−1/2 δ(c |h (θ )|
0)
n
δ(c0 ) 3.4c 2E|| 4 + + 0 (p−2)/2 2 1/2 |h (θ )|n c0 n c0 n
Eξi2 E|i |,
i=1
for any i , i = 1, . . . , n such that ξi and (W − ξi , i ) are independent.
(10.45)
274
10
Uniform and Non-uniform Bounds for Non-linear Statistics
Naturally, Theorem 10.7 may be applied as well to functions of linear statistics, and before turning to the proof we present the following simple example. Suppose one is making inference based on a random sample X1 , . . . , Xn from the Poisson distribution with unknown parameter λ > 0. Letting 1 Xi n n
Xn =
i=1
be the sample mean, the central limit theorem yields √ n(X n − λ) →d N (0, λ).
(10.46)
As the limiting distribution depends on the very parameter one is trying to estimate, a confidence interval based directly on (10.46) would depend on the estimate Xn of λ not only, naturally, for its centering, but also for its length, thus contributing some additional, unwanted, uncertainty. However, here, as in other important cases of interest, there exists a variance stabilizing transformation, that is, a function g such that the standardized asymptotic distribution of g(X n ) does not depend on√λ. For the case at hand, a direct application of the Delta method with g(x) = 2 x yields √ √ n(2 Xn − 2 λ) →d N (0, 1). (10.47) In (10.47) the limiting distribution is known even though λ is not, and the resulting √ confidence interval, for 2 λ, will use an estimate of λ only for centering. To apply Theorem 10.7 to calculate a bound on the error in the normal approximation justified by (10.47), we note that (10.43) holds with = 0 upon letting Xn θˆn = √ , λ
θ=
√ λ
and
Xi − λ . ξi = √ λn
√ With these choices √ (10.47) is equivalent to (10.44) when h(x) = 2λ1/4 x; in par ticular, √ note that h ( λ) = 1. With i = 0, Theorem 10.7 yields for, say, p = 3 and c0 = λ/2, that √ √ supP n(2 X n − 2 λ ) ≤ z − (z) z∈R
≤
4.9 16 6.1E|X1 − λ|3 +√ + . √ λn λ3/2 n λn
We now proceed to the proof of the theorem. Proof Let p ∈ (2, 3]. Since (10.45) is trivial if n i=1
n
p i=1 E|ξi |
E|ξi |p ≤ 1/6.
> 1/6, we assume (10.48)
10.2 Applications
275
Similar to the proof of Theorem 10.6, let ⎧ ⎨ −c0 /2 for x < −c0 /2, for −c0 /2 ≤ x ≤ c0 /2, x= x ⎩ for x > c0 /2. c0 /2 Observe that √ n(h(θˆn ) − h(θ )) h (θ ) √ θˆn −θ n ˆ = h (θ + t) − h (θ ) dt h (θ )(θn − θ ) + h (θ ) 0 √ n−1/2 (W +) n =W ++ h (θ + t) − h (θ ) dt h (θ ) 0 := W + + R, where √ n−1/2 W +n−1/2 n h (θ + t) − h (θ ) dt =+ h (θ ) 0 √ n−1/2 (W +) n h (θ + t) − h (θ ) dt. R= −1/2 −1/2 h (θ ) n W +n
and
We will apply Theorem 10.1 with and i , defined in (10.53), playing the role of and i , respectively. But first, in order to handle the remainder term R, note that if |n−1/2 W | ≤ c0 /2 and |n−1/2 | ≤ c0 /2 then R = 0. Hence P |R| > 0 ≤ P |W | > c0 n1/2 /2 + P || > c0 n1/2 /2 ≤ 4/ c02 n + 2E||/ c0 n1/2 . (10.49) n Recall W = i=1 ξi . We prove in the Appendix that under (10.48), for all 2 < p ≤ 3, n p/2 + E|ξi |p ≤ 2.2. E|W |p ≤ 2 EW 2
(10.50)
i=1
With W (i) denoting W − ξi as usual, we have
h (θ + t) − h (θ ) dt
n−1/2 W +n−1/2 0
2 ≤ 0.5δ(c0 ) n−1/2 W + n−1/2 2 2 ≤ δ(c0 ) n−1/2 W + n−1/2 p−1 ≤ δ(c0 ) (c0 /2)3−p n−1/2 |W | + (c0 /2)n−1/2 || ,
and therefore
(10.51)
276
10
Uniform and Non-uniform Bounds for Non-linear Statistics
(c0 /2)3−p δ(c0 ) c0 δ(c0 ) E|W |p + E|W | |h (θ )| |h (θ )|n(p−2)/2 3−p δ(c0 ) 2.2c c0 δ(c0 ) ≤ 1+ E|W | + 0 (p−2)/2 , |h (θ )| |h (θ )|n
E|W | ≤ E|W | +
(10.52)
where for the last term we have applied inequality (10.50). Now introducing √ n−1/2 W (i) +n−1/2 i n (10.53) h (θ + t) − h (θ ) dt, h (θ ) 0 √ the difference − i will equal − i plus n/ h (θ ) times the term in the absolute value of (10.54), which we bound in a manner similar to (10.51). In particular, applying the bound b δ(c0 ) h (θ + t) − h (θ ) dt ≤ |b − a| |a| + |b| for a, b ∈ [−c0 , c0 ], 2 a i = i +
we obtain n−1/2 W +n−1/2 −1/2 (i) −1/2 h (θ + t) − h (θ ) dt n
W
+n
i
δ(c0 ) −1/2 n W − n−1/2 W (i) n−1/2 W + n−1/2 W (i) ≤ 2 + n−1/2 + n−1/2 i + 2c0 n−1/2 − n−1/2 i p−2 3−p p−2 δ(c0 ) −1/2 3−p ≤ n W − n−1/2 W (i) c0 n−1/2 W + c0 n−1/2 W (i) 2 + n−1/2 || + n−1/2 |i | + 2c0 n−1/2 | − i | p−2 3−p ≤ δ(c0 ) c0 n−(p−1)/2 |ξi | W (i) + |ξi |p−2 + n−1 |ξi ||i | + 2c0 n−1/2 | − i | . (10.54)
Now, to attend to the final term in the bound (10.7), where, again, and i are playing the role of and i , from the inequality above we obtain n E ξi ( − i ) i=1
≤
n E ξi ( − i ) i=1
√ n nδ(c0 ) 3−p −(p−1)/2 2 (i) p−2 E |ξi | W + |ξi |p−2 c0 n + |h (θ )| i=1 n n −1 2 −1/2 Eξ |i | + 2c0 n E ξi ( − i ) +n i
i=1
i=1
10.3 Uniform and Non-uniform Randomized Concentration Inequalities
277
n 3−p n c0 δ(c0 ) 2 2c0 δ(c0 ) + Eξi + E|ξi |p ≤ 1+ E ξ ( − ) i i (p−2)/2 |h (θ )| |h (θ )|n i=1
+
n−1/2 δ(c |h (θ )|
≤ 1+ +
n 0)
2c0 δ(c0 ) |h (θ )|
i=1
Eξi2 E|i |
i=1
n
E ξi ( − i )
i=1 3−p 1.2c0 δ(c0 ) + |h (θ )|n(p−2)/2
n−1/2 δ(c0 ) 2 Eξi E|i |, |h (θ )| n
(10.55)
i=1
recalling (10.48) for the last inequality. The theorem now follows by combining (10.7), (10.49), (10.52) and (10.55).
10.3 Uniform and Non-uniform Randomized Concentration Inequalities As the previous chapters have demonstrated, the concentration inequality approach is a powerful tool for deriving sharp Berry–Esseen bounds for independent random variables. In this section we develop uniform and non-uniform randomized concentration inequalities which we will use to prove Theorems 10.1 and 10.2. Let ξ1 , . . . , ξn be independent random variables satisfying (10.1), W = ni=1 ξi and T = W + . The simple inequality −P z − || ≤ W ≤ z ≤ P (T ≤ z) − P (W ≤ z) ≤ P z ≤ W ≤ z + || (10.56) provides lower and upper bounds for the difference between the distribution functions of T and its approximation W , and involves the probability that W lies in an interval of random length. Hence, we are led to consider concentration inequalities that bound quantities of the form P (1 ≤ W ≤ 2 ). Proposition 10.1 Let δ > 0 satisfy (10.3). Then P (1 ≤ W ≤ 2 ) ≤ 4δ + E W (2 − 1 ) +
n E ξi (1 − 1,i ) + E ξi (2 − 2,i ) , (10.57) i=1
whenever ξi is independent of (W − ξi , 1,i , 2,i ) for all i = 1, . . . , n. When both 1 and 2 are not random, say, 1 = a and 2 = b with a ≤ b, then, by (ii) of Remark 10.1, whenever β1 + β2 ≤ 1/2 Proposition 10.1 recovers (3.38) by letting 1,i = a and i,2 = b for each i = 1, . . . , n.
278
10
Uniform and Non-uniform Bounds for Non-linear Statistics
Proof As the probability P (1 ≤ W ≤ 2 ) is zero if 1 > 2 we may assume without loss of generality that 1 ≤ 2 a.s. We follow the proof of (3.28). For a ≤ b let ⎧ 1 ⎪ ⎨ − 2 (b − a) − δ for w < a − δ, fa,b (w) = w − 12 (a + b) for a − δ ≤ w ≤ b + δ, ⎪ ⎩1 for w > b + δ, 2 (b − a) + δ and set Kˆ i (t) = ξi 1(−ξi ≤ t ≤ 0) − 1(0 < t ≤ −ξi )
ˆ = and K(t)
n
Kˆ i (t).
i=1
Since ξi and f1,i ,2,i (W − ξi ) are independent for 1 ≤ i ≤ n and Eξi = 0, we have EWf1 ,2 (W ) =
n E ξi f1 ,2 (W ) − f1 ,2 (W − ξi ) i=1
+
n E ξi f1 ,2 (W − ξi ) − f1,i ,2,i (W − ξi ) i=1
:= H1 + H2 .
(10.58)
ˆ ≥ 0 and f Using the fact that K(t) 1 ,2 (w) ≥ 0, we have
n E ξi H1 = =
i=1 n
∞ −∞
≥E
∞ −∞
i=1
=E
−ξi
E
0
f 1 ,2 (W
f 1 ,2 (W
+ t)dt
+ t)Kˆ i (t)dt
f 1 ,2 (W
ˆ + t)K(t)dt
f 1 ,2 (W
ˆ + t)K(t)dt ˆ K(t)dt
|t|≤δ
≥ E 1{1 ≤W ≤2 } |t|≤δ n = E 1{1 ≤W ≤2 } |ξi | min δ, |ξi | i=1
≥ H1,1 − H1,2 ,
(10.59)
where H1,1 = P (1 ≤ W ≤ 2 )
n i=1
by (10.3), and
1 E|ξi | min δ, |ξi | ≥ P (1 ≤ W ≤ 2 ), (10.60) 2
10.3 Uniform and Non-uniform Randomized Concentration Inequalities
n |ξi | min δ, |ξi | − E|ξi | min δ, |ξi | H1,2 = E i=1
1/2 n |ξi | min δ, |ξi | ≤ δ. ≤ Var
279
(10.61)
i=1
As to H2 , first, one verifies f , (w) − f , (w) ≤ |1 − 1,i |/2 + |2 − 2,i |/2, 1 2 1,i 2,i which then yields |H2 | ≤
n 1 E ξi (1 − 1,i ) + E ξi (2 − 2,i ) . 2
(10.62)
i=1
It follows from the definition of f1 ,2 that f
1 ,2
1 (w) ≤ (2 − 1 ) + δ. 2
Hence, by (10.58)–(10.62) P (1 ≤ W ≤ 2 ) ≤ 2EWf1 ,2 (W ) + 2δ +
i=1
≤ E W (2 − 1 ) + 2δE|W | + 2δ +
n E ξi (1 − 1,i ) + E ξi (2 − 2,i )
n E ξi (1 − 1,i ) + E ξi (2 − 2,i ) i=1
n ≤ E W (2 − 1 ) + 4δ + E ξi (1 − 1,i ) + E ξi (2 − 2,i ) , i=1
as desired.
Proof of Theorem 10.1 Claim (10.5) follows from applying (10.56) and Proposition 10.1 with
(z + , z, i , z) < 0, (1 , 2 , i,1 , i,2 ) = (z, z + , z, i ) ≥ 0. Next, (10.6) is trivial when β1 + β2 > 1/2, and otherwise follows from (10.5) and (ii) of Remark 10.1. Lastly, (10.7) is a direct corollary of (10.6) and (3.31). Theorem 10.2 is based on the following non-uniform randomized concentration inequality. Proposition 10.2 Let δ > 0 satisfy (10.3). If ξi is independent of (W − ξi , 1,i , 2,i ) for all i = 1, . . . , n, then for all a ∈ R and p ≥ 2,
280
10
Uniform and Non-uniform Bounds for Non-linear Statistics
P (1 ≤ W ≤ 2 , 1 ≥ a) −p ≤2 P |ξi | > (1 ∨ a)/(2p) + ep 1 + a 2 /(4p) β2 + e−a/2 τ1 , (10.63) 1≤i≤n
where β2 is given in (10.4) and τ1 = 18δ + 7.22 − 1 2 + 3
n
ξi 2 1 − 1,i 2 + 2 − 2,i 2 .
i=1
(10.64) Proof When a ≤ 2, (10.63) follows from Proposition 10.1. For a > 2, without loss of generality assume that a ≤ 1 ≤ 2 ,
(10.65)
as otherwise we may consider 1 = max(a, 1 ) and 2 = max(a, 1 , 2 ) and use the fact that |2 − 1 | ≤ |2 − 1 |. We follow the lines of argument in the proofs of Propositions 8.1 and 10.1. Let xi ¯ i = ξi 1{ξi ≤1} ,
W¯ =
n
xi ¯ i,
and W¯ (i) = W¯ − xi ¯ i.
i=1
As in (8.20), we have
!
{1 ≤ W ≤ 2 } ⊂ {1 ≤ W¯ ≤ 2 } ∪ 1 ≤ W ≤ 2 , max ξi > 1 1≤i≤n ! ⊂ {1 ≤ W¯ ≤ 2 } ∪ W ≥ a, max ξi > 1 1≤i≤n
by (10.65). Invoking Lemma 8.3 for the second term above, it only remains to show P (1 ≤ W¯ ≤ 2 ) ≤ e−a/2 τ1 .
(10.66)
We can assume that δ ≤ 0.065 since otherwise, by (8.5) of Lemma 8.1 with α = 1, ¯
P (1 ≤ W¯ ≤ 2 ) ≤ P (W¯ ≥ a) ≤ e−a/2 EeW /2 ≤ e−a/2 exp e0.5 − 1.5 ≤ 1.17e−a/2 ≤ 18δe−a/2 , implying (10.66). For α, β ∈ R let
⎧ for w < α − δ, ⎨0 w/2 fα,β (w) = e (w − α + δ) for α − δ ≤ w ≤ β + δ, ⎩ w/2 e (β − α + 2δ) for w > β + δ,
and set ¯ M¯ i (t) = ξi (1{−xi = ¯ i ≤t≤0} − 1{0
n i=1
M¯ i (t).
(10.67)
10.3 Uniform and Non-uniform Randomized Concentration Inequalities
281
Similarly to (10.58), we may write EWf1 ,2 (W¯ ) = H3 + H4 , where
H3 = E H4 =
∞ −∞
¯ f 1 ,2 (W¯ + t)M(t)dt
(10.68)
and
n E ξi f1 ,2 (W¯ − xi ¯ i ) − f1,i ,2,i (W¯ − xi ¯ i) . i=1
¯ It follows from the fact that M(t) ≥ 0, f 1 ,2 (w) ≥ ew/2 for 1 − δ ≤ w ≤ 2 + δ and f 1 ,2 (w) ≥ 0 for all w that
¯ f , (W¯ + t)M(t)dt H3 ≥ E |t|≤δ
1
2
¯
≥ E e(W −δ)/2 1{1 ≤W¯ ≤2 } =E e
(W¯ −δ)/2
1{1 ≤W¯ ≤2 }
¯
≥ Ee(W −δ)/2 1{1 ≤W¯ ≤2 }
|t|≤δ n
¯ M(t)dt
|ξi | min δ, |xi ¯ i|
i=1 n
E|ξi | min δ, |xi ¯ i|
i=1
n ¯ − Ee(W −δ)/2 |ξi | min δ, |xi ¯ i | − E|ξi | min δ, |xi ¯ i| i=1
≥ H3,1 − H3,2 ,
(10.69)
where, as in the proof of Proposition 10.1, H3,1 ≥ e(a−δ)/2 P (1 ≤ W¯ ≤ 2 )
H3,2 ≤ Ee
W¯ 1/2
Var
n
E|ξi | min δ, |xi ¯ i|
i=1 n
and
1/2
|ξi | min δ, |xi ¯ i|
.
i=1
Since δ satisfies (10.3), n
n E|ξi | min δ, |xi ¯ i| = E|ξi | min δ, |ξi | ≥ 1/2
i=1
i=1
and now δ ≤ 0.065 yields H3,1 ≥ 0.48ea/2 P (1 ≤ W¯ ≤ 2 ). As in (8.15), using (8.5) to obtain Ee
W¯ (i)
(10.70)
≤ exp(e − 2) ≤ 2.06 we have
H3,2 ≤ 1.44δ.
(10.71)
282
10
Uniform and Non-uniform Bounds for Non-linear Statistics
As to H4 , it is easy to see that f , (w) − f , (w) ≤ ew/2 |1 − 1,i | + |2 − 2,i | . 1 2 1,i 2,i ¯ i, Hence, by the Hölder inequality and the independence of ξi and W¯ − xi |H4 | ≤
n
¯ ¯ i )/2 |1 − 1,i | + |2 − 2,i | E|ξi |e(W −xi
i=1
≤
n
¯
¯i Eξi2 eW −xi
1/2
1 − 1,i 2 + 2 − 2,i 2
i=1
=
n
¯
¯i Eξi2 EeW −xi
1/2
1 − 1,i 2 + 2 − 2,i 2
i=1
≤ 1.44
n
ξi 2 1 − 1,i 2 + 2 − 2,i 2 .
(10.72)
i=1
Now, recalling xi ¯ i ≤ 1 for all i, we have ¯
EW 2 eW n ¯ ¯i ¯i = Eξi2 exi EeW −xi +
¯ ¯ ¯ −xi ¯ i −xi ¯j i − 1 Eξ e xi j − 1 Ee W Eξi exi j
1≤i =j ≤n
i=1
≤ 2.06e
n
Eξi2 + 2.06(e − 1)2
Eξi2 Eξj2
1≤i =j ≤n
i=1
≤ 2.06e + 2.06(e − 1) < 3.42 . 2
2
Thus, we obtain ¯ EWf1 ,2 (W¯ ) ≤ E|W |eW /2 |2 − 1 | + 2δ ¯ 1/2 ≤ E W 2 eW 2 − 1 2 + 2δ ≤ 3.42 2 − 1 2 + 2δ .
(10.73)
Combining (10.68)–(10.73) yields P (1 ≤ W¯ ≤ 2 ) ≤e
−a/2
−1
(0.48)
+ 1.44
n
8.28δ + 3.422 − 1 2
ξi 2 1 − 1,i 2 + 2 − 2,i 2
,
i=1
and collecting terms completes the verification of (10.66).
Proof of Theorem 10.2 Without loss of generality, assume that z ≥ 0. Since for 0 ≤ z ≤ 2 inequality (10.8) follows from (10.5), we may assume z > 2.
10.3 Uniform and Non-uniform Randomized Concentration Inequalities
283
Applying Proposition 10.2 with (1 , 2 , 1,i , 2,i ) = z − ||, z, z − |i |, z and a = (2z − 1)/3 yields P z − || ≤ W ≤ z, || ≤ (z + 1)/3 −p P |ξi | > 1 ∨ (2z − 1)/3 /(2p) + ep 1 + (2z − 1)2 /(36p) β2 ≤2 1≤i≤n
+e ≤2
−(2z−1)/6
18δ + 7.22 + 3
n
ξi 2 − i 2
i=1
−p P |ξi | > (z + 1)/(6p) + ep 1 + z2 /(36p) β2 + e−z/3 τ.
1≤i≤n
Now combining the bound above with (10.56) and the inequality P z − || ≤ W ≤ z ≤ P || > (z + 1)/3 + P z − || ≤ W ≤ z, || ≤ (z + 1)/3 yields −γz,p − e−z/3 τ ≤ P (T ≤ z) − P (W ≤ z). Similarly showing the corresponding upper bound completes the proof of (10.8). When β1 + β2 ≤ 1/2, in light of (ii) of Remark 10.1, choosing δ = (β2 + β3 )/2 and noting that β2 ≤ ni=1 E|ξi |p , β3 ≤ ni=1 E|ξi |3∧p and n n |z| + 1 (6p)p P |ξi | > E|ξi |p , ≤ 6p (|z| + 1)p i=1
i=1
we see (10.10) holds by (10.8) and Theorem 8.1. If β2 + β3 > 1/2, then n
E|ξi |p + E|ξi |3∧p ≥ 1/2
i=1
and P (T ≥ z) ≤ P W ≥ (2z − 1)/3 + P || > (z + 1)/3
n Cp p ≤ E|ξi | + P || > (z + 1)/3 1+ p (1 + z) i=1
by (8.10). Therefore (10.10) remains valid.
284
10
Uniform and Non-uniform Bounds for Non-linear Statistics
Appendix Proof of Lemma 10.1 It is known (see, e.g., Koroljuk and Borovskich 1994, p. 271) that 2 ¯ hm (Xi1 , . . . , Xim ) E 1≤i1 <···
n = m
j =2
m j
n−m E h¯ 2j (X1 , . . . , Xj ). m−j
(10.74)
Using that the variables are i.i.d., the symmetry of h(x1 , . . . , xm ), and that Eh(X1 , . . . , Xm ) = 0 implies Eg(Xi ) = 0, we have E h¯ 2j (X1 , . . . , Xj ) = Eh2j (X1 , . . . , Xj ) − 2
j
2 j E g(Xi )hj (X1 , . . . , Xj ) + E g(Xi )
i=1 i=1 2 = Ehj (X1 , . . . , Xj ) − 2j E g(X1 )E h(X1 , . . . , Xm )|X1 , . . . , Xj + j Eg 2 (X1 ) = Eh2j (X1 , . . . , Xj ) − 2j E g(X1 )h(X1 , . . . , Xm ) + j Eg 2 (X1 )
= Eh2j (X1 , . . . , Xj ) − 2j Eg 2 (X1 ) + j Eg 2 (X1 ) = Eh2j (X1 , . . . , Xj ) − j Eg12 (X1 ),
(10.75)
so in particular E h¯ 2j (X1 , . . . , Xj ) ≤ Eh2j (X1 , . . . , Xj ).
(10.76)
We next prove by induction that for 2 ≤ j ≤ m Eh2j −1 (X1 , . . . , Xj −1 ) ≤
j −1 2 Ehj (X1 , . . . , Xj ). j
(10.77)
Since E h¯ 22 (X1 , X2 ) ≥ 0 and g(x) = h1 (x), (10.77) holds for j = 2 by (10.75). Assume that (10.77) is true for j ≥ 2. Then 2 E hj +1 (X1 , . . . , Xj +1 ) − hj (X1 , . . . , Xj ) − hj (X2 , . . . , Xj +1 ) = Eh2j +1 (X1 , . . . , Xj +1 ) − 4E hj +1 (X1 , . . . , Xj +1 )hj (X1 , . . . , Xj ) + 2Eh2j (X1 , . . . , Xj ) + 2Ehj (X1 , . . . , Xj )hj (X2 , . . . , Xj +1 ) = Eh2j +1 (X1 , . . . , Xj +1 ) − 2Eh2j (X1 , . . . , Xj ) + 2E E hj (X1 , . . . , Xj )hj (X2 , . . . , Xj +1 ) | X2 , . . . , Xj = Eh2j +1 (X1 , . . . , Xj +1 ) − 2Eh2j (X1 , . . . , Xj ) + 2Eh2j −1 (X1 , . . . , Xj −1 ). (10.78)
Appendix
285
On the other hand, by (4.143) 2 E hj +1 (X1 , . . . , Xj +1 ) − hj (X1 , . . . , Xj ) − hj (X2 , . . . , Xj +1 ) ≥ E E hj +1 (X1 , . . . , Xj +1 ) − hj (X1 , . . . , Xj ) 2 − hj (X2 , . . . , Xj +1 ) | X1 , . . . , Xj = Eh2j −1 (X1 , . . . , Xj −1 ).
(10.79)
Now (10.78), (10.79) and the induction hypothesis yield 2Eh2j (X1 , . . . , Xj ) ≤ Eh2j +1 (X1 , . . . , Xj +1 ) + Eh2j −1 (X1 , . . . , Xj −1 ) j −1 2 Ehj (X1 , . . . , Xj ), j which simplifies to (10.77) with j replaced by j + 1, completing the inductive step. Now iterating (10.77) we obtain ≤ Eh2j +1 (X1 , . . . , Xj +1 ) +
j j (10.80) Eh2m (X1 , . . . , Xm ) = σ 2 . m m In order to demonstrate the bounds (10.20) and (10.21), respectively, we prove that m m(m − 1)2 n m n−m j ≤ (10.81) j m − j m n(n − m + 1) m Eh2j (X1 , . . . , Xj ) ≤
j =2
and m−1 j =1
2(m − 1)2 n n−m j +1 ≤ m n(n − m + 1) m m−1−j
m−1 j
hold for all n/2 > m ≥ 2. Regarding (10.81), m m m n−m j m−1 n−m = j m−j m j −1 m−j j =2
j =2
m−1 n−m n−m − j m−1−j m−1 j =0 n−1 n−m = − m−1 m−1 n−1 (n − m)!/(n − 2m + 1)! = 1− m−1 (n − 1)!/(n − m)! n−1 n−1 m−1 = 1− 1− m−1 j =
m−1
≤
n−1 m−1
j =n−m+1
n−1 j =n−m+1
m−1 j
(10.82)
286
10
Uniform and Non-uniform Bounds for Non-linear Statistics
n − 1 (m − 1)2 m−1 n−m+1 m(m − 1)2 n = . n(n − m + 1) m ≤
(10.83)
As for (10.82), m−1 j =1
m−1 n−m j +1 m j m−1−j
m−1 n−m j m−1 m−1 = m j m−1−j m−1 j =1
+
1 m
m−1 j =1
m−1 j
n−m m−1−j
m−2 m−1 1 m−1 n−m n−m m−1 m−2 + = m m j m−2−j j m−1−j j =0 j =1
m−1 n−2 1 n−1 n−m = + − m m−2 m m−1 m−1 2 2 n−1 (m − 1) (m − 1) ≤ + m − 1 m(n − 1) m(n − m + 1) 2(m − 1)2 n ≤ , n(n − m + 1) m where in the second to last inequality we have applied (10.83). From (10.17), (10.74), (10.76), (10.80) and (10.81) we obtain that 2 −2 n n 2 ¯ hm (X1 , . . . , Xm ) E E = m2 σ12 m 1≤i1 <···
−1 m n n m n−m ≤ E h¯ 2j (X1 , . . . , Xj ) j m−j m2 σ12 m j =2
−1 m n m n−m n Eh2j (X1 , . . . , Xj ) ≤ j m−j m2 σ12 m j =2
−1 m n m n−m j ≤ j m−j m m2 σ12 m nσ 2
j =2
≤ This proves (10.20).
(m − 1)2 σ 2 m(n − m + 1)σ12
.
(10.84)
Appendix
287
Similarly, using (10.18), a slightly modified form of (10.74), (10.76), (10.80) and (10.82), we obtain E( − l )2 −2 n n = 2 2 m m σ1 ×E
−
1≤i1 <···
−2 n n = E 2 2 m m σ1
2 ¯hm (Xi1 , . . . , Xim )
1≤i1 <···
2 ¯hm (Xi1 , . . . , Xim−1 , Xn )
1≤i1 <···
−2 m−1 n n n−1 m−1 n−m = E h¯ 2j +1 (X1 , . . . , Xj , Xn ) m−1 j m−1−j m2 σ12 m j =1
−2 m−1 n n−1 m−1 n−m n Eh2j +1 (X1 , . . . , Xj , Xn ) ≤ m−1 j m−1−j m2 σ12 m j =1
−1 m−1 m − 1 n − m j + 1 n ≤ m j m−1−j mσ12 m σ2
j =1
≤
2(m − 1)2 σ 2 nm(n − m + 1)σ12
(10.85)
.
This proves (10.21) and hence completes the proof of Lemma 10.1.
Proof of Lemma 10.2 For 0 ≤ dj ≤ mj , 1 ≤ j ≤ k and h¯ as in (10.25), let Yd1 ,...,dk (xj 1 , . . . , xj dj , 1 ≤ j ≤ k) ¯ j 1 , . . . , xj dj , Xj dj +1 , . . . , Xj mj , 1 ≤ j ≤ k) = E h(x and yd1 ,...,dk = EYd21 ,...,dk (Xj 1 , . . . , Xj dj , 1 ≤ j ≤ k). Noting that ¯ 1i 1 , . . . , Xki k )|Xj l = 0 for every 1 ≤ l ≤ mj , 1 ≤ j ≤ k, E h(X using (4.5.8) in Koroljuk and Borovskich (1994) for the first equality and the fact that yd1 ,...,dk ≤ σ 2 , we have E(Un − Uˆ n )2 k nj −1 = mj j =1
k mj nj − mj yd1 ,...,dk dj mj − dj
d1 +···+dk ≥2 j =1 0≤dj ≤mj ,1≤j ≤k
288
10
≤σ
2
k nj −1 j =1
Uniform and Non-uniform Bounds for Non-linear Statistics
mj
k mj nj − mj . dj mj − dj
(10.86)
d1 +···+dk ≥2 j =1 0≤dj ≤mj ,1≤j ≤k
The inequality
k k k m2 2 mj nj − mj nj j ≤ , (10.87) dj mj − dj nj mj
d1 +···+dk ≥2 j =1 0≤dj ≤mj , 1≤j ≤k
j =1
j =1
which we prove later, now completes the proof of (10.26) by noting that = (Un − Uˆ n )σn−1 . As to (10.27) it suffices to consider the case j = 1. By analogy with (10.85) and (10.86), setting X 1i 1 = (X1i1,1 , . . . , X1i1,m1 −1 , X1,n1 ) we have σn2 E( − 1l )2 k −2 nv = mv v=1 ×E =σ
2
2 ¯h(X 1i 1 , X 2i 2 , . . . , Xki k )
1≤iv1
k nv v=1
−2
mv
×
k n1 − 1 nv m1 − 1 mv v=2 m1 − 1 n1 − m1 d1 m1 − 1 − d1
d1 +···+dk ≥1 0≤d1 ≤m1 −1,0≤dv ≤mv ,2≤v≤k
×
k mv nv − mv
mv − dv k −1 σ 2 m1 nv = n1 mv v=2
dv
m1 − 1 n1 − m1 d1 m1 − 1 − d1 k −1 σ 2 m1 nv ×
=
n1 ×
v=1
−
0≤d1 ≤m1 −1,0≤dv ≤mv ,2≤v≤k
v=1
k v=2
dj =0,1≤j ≤k
mv nv − mv dv mv − dv
mv
k k n 1 − 1 nv n1 − m1 nv − mv − m1 − 1 mv m1 − 1 mv v=2
v=2
Appendix
289
k −1 k σ 2 m1 nv n1 − m1 nv n1 − 1 = − m1 − 1 n1 mv m1 − 1 mv v=1 v=2 k k n1 − m1 nv nv − mv + − . m1 − 1 mv mv v=2
v=2
Now, using inequality (10.83) for the difference of binomial coefficients, and induction to prove that for aj ≥ bj ≥ 1, k
aj −
j =1
k
k k (av − bv ) aj ,
bj ≤
j =1
j =1
v=1
we obtain σn2 E( − 1l )2 k −1 k σ 2 m1 nv (m1 − 1)2 n1 − 1 nv ≤ n1 mv n1 − m1 + 1 m1 − 1 mv v=1 v=2 k (mv − 1)2 n1 − m1 nj + nv − mv + 1 m1 − 1 mj j =2
2≤v≤k
≤
σ 2 m21 n21 1≤v≤k
m2v nv − mv + 1
≤
2σ 2 m21 n21 1≤v≤k
m2v nv
for n1 ≥ 2m1 , proving (10.27). Now we prove (10.87). Consider two cases in the summation: Case 1 At least one of dj ≥ 2, say d1 ≥ 2. In this case, we have k mj nj − mj dj mj − dj 2≤d1 ≤m1 j =1 1≤dl ≤ml ,1≤l≤k
k nj
≤
j =2
≤
k
j =2
≤
mj nj mj
m21 (m1
2≤d1 ≤m1
m1 2
for n1 ≥ 2m1 .
2≤d1 ≤m1
− 1)2
2n1 (n1 − m1 + 1)
k m41 nj ≤ 2 n1 j =1 mj
m1 n1 − m1 d1 m1 − d1 m1 d1
k nj
j =1
mj
n1 − m1 d1 m1 − d1 m1
by (10.81)
290
10
Uniform and Non-uniform Bounds for Non-linear Statistics
Case 2 At least two of {dj } are equal to 1, say d1 = d2 = 1. Then k mj nj − mj dj mj − dj
d1 =d2 =1 j =1 1≤dl ≤ml ,3≤l≤k
≤ m1 m2
≤
n1 − m1 m1 − 1
k m21 m22 nj
n1 , n2
k nj mj j =3
mj
j =1
n2 − m2 m2 − 1
.
Thus, we have k mj nj − mj dj mj − dj
d1 +···+dk ≥2 j =1 1≤dj ≤mj ,1≤j ≤k
≤
k m4 j
n2j
j =1
=
k m2 j j =1
nj
+
2
m2i m2j
1≤i =j ≤k
ni nj
k nj j =1
mj
k nj j =1
mj
.
This proves (10.87) and completes the proof of Lemma 10.2.
Proof of (10.50) We prove the stronger inequality p
E|W |p ≤ (p − 1)Bn +
n
E|ξi |p
(10.88)
i=1
for 2 < p ≤ 3, where Bn2 = EW 2 is not necessarily equal to 1. Let W (i) = W − ξi . Then E|W | = p
n
Eξi W |W |p−2
i=1
=
n
Eξi W |W |p−2 − W (i) |W |p−2
i=1
+
n
p−2 , Eξi W (i) |W |p−2 − W (i) W (i)
i=1
because ξi and
W (i)
are independent, and Eξi = 0. Thus we have
Appendix
291
E|W |p ≤
n
Eξi2 |W |p−2 +
i=1
≤
n
n
p−2 (i) p−2 E|ξi |W (i) W (i) + |ξi | − W
i=1
p−2 Eξi |ξi |p−2 + W (i) 2
i=1
+
n
p−1 p−2 1 + |ξi |/W (i) E|ξi |W (i) −1 .
i=1
Since (1 + x)p−2 − 1 ≤ (p − 2)x for x ≥ 0, we have E|W |p ≤
n
E|ξi |p +
i=1
+
n
p−2 Eξi2 E W (i)
i=1
n
p−1 E|ξi |W (i) (p − 2)|ξi |/W (i)
i=1
=
n
E|ξi | + (p − 1) p
i=1
n
p−2 Eξi2 E W (i) ,
i=1
and Hölder’s inequality now gives E|W |p ≤
n i=1
≤
n
E|ξi |p + (p − 1)
n
2 (p−2)/2 Eξi2 E W (i)
i=1 p
E|ξi |p + (p − 1)Bn ,
i=1
as desired.
Chapter 11
Moderate Deviations
The Berry–Esseen inequality, which gives a bound on the absolute error in the normal approximation, may not be very informative when (x) is close to 0 or 1. Cramér’s theory of moderate deviations, on the other hand, provides relative errors. Let X1 , X2 , . . . be i.i.d. random variables with E(X1 ) = 0, EX12 = 1 and √ Eet0 |X1 | < ∞ for some t0 > 0 and set W = ni=1 Xi / n. Then, see Petrov (1995) for instance, for z ≥ 0 with z = o(n1/2 ), 1+z z P (W ≥ z) 2 1+O √ , = exp z λ √ 1 − (z) n n for a function λ(t) known as the Cramér series. In particular, (1 + z3 )E|X1 |3 P (W ≥ z) = 1 + O(1) √ 1 − (z) n
(11.1)
for 0 ≤ z ≤ n1/6 /(E|X1 |3 )1/3 , where O(1) is a sequence of real numbers bounded by a universal constant for n ∈ N. As a consequence of (11.1), P (W ≥ z) →1 1 − (z)
(11.2)
as n → ∞, uniformly in z ∈ [0, o(n1/6 )). It is known in general that o(n1/6 ) is the largest possible value for the range of z such that (11.2) holds. In this chapter, following Chen et al. (2009), we first establish a Cramér type moderate deviation theorem in the mold of (11.1) by use of the Stein identity (2.42) for approximate exchangeable pairs, and then apply the result to four examples: the combinatorial central limit theorem, the anti-voter model on complete graphs, the binary expansion of a random integer, and the Curie–Weiss model. Related ideas appear in Raiˇc (2007); see also Chatterjee (2007).
11.1 A Cramér Type Moderate Deviation Theorem The following theorem gives moderate deviation bounds of Cramér type for W in the context of the Stein identity (2.42). L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_11, © Springer-Verlag Berlin Heidelberg 2011
293
294
11
Moderate Deviations
Theorem 11.1 Suppose that for a given random variable W there exist a constant ˆ δ > 0, a random function K(t) and a random variable R such that ˆ f (W + t)K(t)dt + E Rf (W ) (11.3) EWf (W ) = E |t|≤δ
for all absolutely continuous function f for which the expectations on either side exists, and let ˆ K(t)dt. (11.4) Kˆ 1 = |t|≤δ
ˆ ≥ 0 and there exist constants δ1 , δ2 and d0 ≥ 1 such that If K(t)
E(Kˆ 1 |W ) − 1 ≤ δ1 1 + W 2 ,
E(R|W ) ≤ δ2 1 + |W |
(11.5) (11.6)
and E(Kˆ 1 |W ) ≤ d0 ,
(11.7)
P (W ≥ z) = 1 + O(1) d0 1 + z3 δ + 1 + z4 δ1 + 1 + z2 δ2 1 − (z)
(11.8)
then
−1/4
for 0 ≤ z ≤ d0−1 min(δ −1/3 , δ1 constant.
−1/3
, δ2
), where O(1) is bounded by a universal
Additionally, a moderate deviation bound holds when a bounded zero bias coupling exists. Theorem 11.2 Let W and W ∗ be defined on the same space, with W ∗ having the W -zero biased distribution, such that |W ∗ − W | ≤ δ. Then P (W ≥ z) = 1 + O(1) 1 + z3 δ 1 − (z) for 0 ≤ z ≤ δ −1/3 , where O(1) is bounded by a universal constant. The following remarks may be useful. Remark 11.1 Let (W, W ) be an exchangeable pair satisfying E(W − W |W ) = λ W − R(W )
(11.9)
and |W − W | ≤ δ. Then with (11.3) is satisfied with Kˆ 1 = 2 /(2λ) where = W − W ; see Sect. 2.3.2, and (2.41) and (2.42) in particular.
11.2 Applications
295
Remark 11.2 One can show that if condition (11.5) in Theorem 11.1 is replaced by
E(Kˆ 1 |W ) − 1 ≤ δ1 1 + |W | , (11.10) then P (W ≥ z) = 1 + O(1)d0 1 + z3 (δ + δ1 ) + O(1) 1 + z2 δ2 1 − (z) −1/3
for 0 ≤ z ≤ d0−1 min(δ −1/3 , δ1
−1/3
, δ2
).
11.2 Applications We apply Theorems 11.1 and 11.2 to four cases, all involving dependent random variables; namely, the combinatorial central limit theorem, the binary expansion of a random integer, the anti-voter model on a complete graph, and the Curie–Weiss model. Example 11.1 (Combinatorial central limit theorem) With n ≥ 3 let {aij }ni,j =1 be an array of real numbers and π a random permutation with the uniform distribution over Sn , the symmetric group on {1, . . . , n} and Y=
n
aiπ(i) .
i=1
Derivation of L1 and Berry–Esseen bounds for the standardized W = (Y − EY )/σ are obtained in Sects. 4.4 and 6.1, respectively, where σ 2 is given in (4.105). With c0 = max |aij − ai − aj − a |, 1≤i,j ≤n
the construction of W ∗
with the W -zero biased distribution in Lemma 4.6, following the proof of Theorem 6.1, satisfies |W ∗ − W | ≤ 8c0 /σ. Therefore, by Theorem 11.2 P (W ≥ z) = 1 + O(1) 1 + z3 c0 /σ 1 − (z) for 0 ≤ z ≤ (σ/c0 )1/3 . Similar remarks apply to the combinatorial central limit theorem when π is chosen with a distribution uniform, or constant, on cycle type; the bounded coupling for the former instance was applied to prove Theorem 6.3. Example 11.2 (The anti-voter model) The discrete time anti-voter model was described in Sect. 6.4, where a bound to the normal was shown using the method of exchangeable pairs; some references and history may also be found there. We recall
296
11
Moderate Deviations
that in this model each vertex of a graph is in state +1 or −1, and at each time step a randomly chosen vertex adopts the opposite state of one of its randomly chosen neighbors. Adopting notation from Sect. 6.4, here we prove a moderate deviation result for W given by W = U/σ where U = Xv with σ 2 = Var(U ), v∈V
the net standardized sign of the stationary distribution of the anti-voter chain, run on the complete graph with n ≥ 3 vertices. Theorem 6.6 yields that if the process is run in equilibrium for a single time step to obtain U , then (U, U ), a 2/n-Stein pair which satisfies
|U − U | ≤ 2 and E (U − U )2 |X = 8(a + b)/ n(n − 1) , where a and b are the number of edges that are incident on vertices both of which are in state +1, or −1, respectively. In particular, according to Remark 11.1, identity (11.3) is satisfied with δ = 2/σ and, since R = 0, with δ2 = 0 in (11.6). Given U , there are (n + U )/2 vertices in state +1 and (n − U )/2 vertices in state −1. Since the graph is complete, we see that a = (n + U )(n + U − 2)/8 and b = (n − U )(n − U − 2)/8. Therefore
1 8(a + b) E (W − W )2 |X = 2 E (U − U )2 |X = 2 σ σ n(n − 1) 2U 2 + 2n2 − 4n 2σ 2 W 2 + 2n2 − 4n = , = σ 2 n(n − 1) σ 2 n(n − 1) and so n E(Kˆ 1 |W ) − 1 = E (W − W )2 |W − 1 4 2σ 2 (n − 1) − (n2 − 2n) W2 . − = 2(n − 1) 2σ 2 (n − 1) As E(E(Kˆ 1 |W ) − 1) = 0 and EW 2 = 1, we conclude that σ 2 = (n2 − 2n)/(2n − 3). Hence, E(Kˆ 1 |W ) − 1 =
W2 1 − , 2(n − 1) 2(n − 1)
(11.11)
implying that (11.5) is satisfied with δ1 = 1/(2(n − 1)). Additionally, as σ 2 ≥ n/2 and |U | ≤ n, we find that W 2 ≤ 2n and the quantity in (11.11) can be at most 2; hence we may take d0 = 3 in (11.7). Hence, Theorem 11.1 yields the moderate deviation result √ P (W ≥ z) = 1 + O(1) 1 + z3 / n 1 − (z)
√ √ for 0 ≤ z ≤ n1/6 /(3 2), noting that z4 /n is dominated by z3 / n over the given range.
11.2 Applications
297
Example 11.3 (Binary expansion of a random integer) Following the conventions in Sect. 6.5, let X be a random variable uniformly distributed over the set {0, 1, . . . , n− 1} for some n ≥ 2, and with m = [log2 (n − 1)] + 1, let S = X1 + · · · + Xm , the number of ones in the binary expansion of X. Set S − m/2 W= √ m/4
(11.12)
and recall that for (S, S ) and (W, W ), the exchangeable pairs constructed there, |S − S | ≤ 1, so 2 (11.13) |W − W | ≤ √ m and equalities (6.138) and (6.139) hold, that is, W + E(Q|W ) (11.14) E(W − W |W ) = λ √ m and 1 E(Q|W ) E (W − W )2 |W = 1 − , (11.15) 2λ m where λ = 2/m and Q = Q(X, n), as defined in (6.134). We prove a moderate deviation bound for W in (11.12) with the help of the following result. Lemma 11.1 There exists a constant C such that with W and Q = Q(X, n) defined in (11.12) and (6.134), respectively, E(Q|W ) ≤ C 1 + |W | . As the proof of this lemma √ is quite involved we defer it to the Appendix. By (11.13) we may take δ = 2/ m. Equality (11.14), (2.41) and Lemma 11.1 yield
√ √
E(R|W ) = E(Q|W ) / m ≤ (C/ m ) 1 + |W | , √ so we may take δ2 = C/ m to satisfy (11.10) in Remark 11.2. Similarly, (11.15) and Lemma 11.1 yield that (11.5) is satisfied with δ1 = C/m. Lastly, as Q is nonnegative, by (11.15) we may take d0 = 1 in (11.7). Hence, by Remark 11.2, √ P (W ≥ z) = 1 + O(1) 1 + z3 / m 1 − (z) for 0 ≤ z ≤ m1/6 /(max(2, C))1/3 . Example 11.4 (The Curie–Weiss model) The Curie–Weiss model is a simple statistical mechanical model of ferromagnetic interaction, where for n ∈ N, a vector σ = (σ1 , . . . , σn ) of ‘spins’ in {−1, 1}n has joint probability mass function β p(σ ) = Cβ exp σi σj (11.16) n i<j
where Cβ is a normalizing constant and β > 0 is known as the inverse temperature.
298
11
Moderate Deviations
Let σ 2 be the variance of the total sum of spins ni=1 σi , and W = ni=1 σi /σ , the normalized sum. It is known that W converges to a standard normal distribution as n → ∞ when 0 < β < 1, see, e.g., Ellis and Newman (1978a, 1978b). However, when β = 1 the limiting distribution is non-normal; see Sect. 13.3 for a Berry– Esseen type bound in this instance. Here, focusing on the case β ∈ (0, 1), we prove √ P (W ≥ z) = 1 + O(1) 1 + z3 / n 1 − (z)
(11.17)
for 0 ≤ z ≤ n1/6 , and O(1) a constant that may depend on β. The proof is postponed to Sect. 11.4.
11.3 Preliminary Lemmas To prove Theorem 11.1 we first develop two preliminary lemmas. Our first lemma gives a bound for the moment generating function of W . Lemma 11.2 Let W be a random variable satisfying the hypotheses of Theorem 11.1 for some δ > 0, δ1 ≥ 0, 0 ≤ δ2 ≤ 1/4 and d0 ≥ 1. Then for all 0 < t ≤ 1/(2δ) satisfying 8td0 (tδ1 + 2δ2 ) ≤ 1
(11.18)
EetW ≤ exp t 2 /2 + c0 (t)
(11.19)
c0 (t) = 30d0 δ2 t + δ1 t 2 + (δ2 + δ)t 3 + δ1 t 4 .
(11.20)
we have
where
Proof Fix a > 0, t ∈ (0, 1/(2δ)] and s ∈ (0, t], and let f (w) = es(w∧a) . Letting h(s) = Ees(W ∧a) , firstly we prove that h (s) can be bounded by an expression in h(s) and EW 2 f (W ). Applying the bounded convergence theorem to differentiate under the expectation, and (11.3), h (s) = E(W ∧ a)es(W ∧a) ≤ E Wf (W ) ˆ =E f (W + u)K(u)du + E Rf (W ) |u|≤δ ˆ = sE es(W +u) 1(W + u ≤ a)K(u)du + E es(W ∧a) E(R|W ) |u|≤δ ˆ ≤ sE es[(W +u)∧a] K(u)du + E es(W ∧a) E(R|W ) |u|≤δ ˆ ≤ sE es(W ∧a+δ) K(u)du + E es(W ∧a) E(R|W ) |u|≤δ
11.3 Preliminary Lemmas
= sE
|u|≤δ
299
ˆ es(W ∧a) K(u)du
+ sE ≤ sEe
|u|≤δ s(W ∧a) ˆ
ˆ es(W ∧a) esδ − 1 K(u)du + E es(W ∧a) E(R|W )
K1 + sEes(W ∧a) esδ − 1 Kˆ 1 + δ2 E 1 + |W | es(W ∧a) ,
where we have applied (11.4) and (11.6) to obtain the last inequality. Now, applying the simple inequality
x
e − 1 ≤ 2|x| for |x| ≤ 1 followed by (11.7), the fact that 1 + |w| ≤ 2(1 + w 2 ) and then (11.5), we find that h (s) ≤ sEes(W ∧a) Kˆ 1 + sEes(W ∧a) 2sδ Kˆ 1 + δ2 E 1 + |W | es(W ∧a) ≤ sE es(W ∧a) E(Kˆ 1 |W ) + 2s 2 d0 δEes(W ∧a) + 2δ2 E 1 + W 2 es(W ∧a)
= sEes(W ∧a) + sEes(W ∧a) E(Kˆ 1 |W ) − 1 + 2s 2 d0 δEes(W ∧a) + 2δ2 E 1 + W 2 es(W ∧a) ≤ sEes(W ∧a) + sδ1 Ees(W ∧a) 1 + W 2 (11.21) + 2s 2 d0 δEes(W ∧a) + 2δ2 E 1 + W 2 es(W ∧a) . Collecting terms and recalling 0 < s ≤ t we obtain h (s) ≤ s(1 + δ1 + 2td0 δ) + 2δ2 h(s) + (sδ1 + 2δ2 )EW 2 f (W ). (11.22) Secondly, we show that EW 2 f (W ) can be bounded by a function of h(s) and Letting g(w) = wes(w∧a) , and then arguing as for (11.22) and recalling 0 < sδ ≤ tδ ≤ 1/2,
h (s).
EW 2 f (W ) = EWg(W ) s[(W +u)∧a] ˆ =E e + s(W + u)es[(W +u)∧a] 1(W + u ≤ a) K(u)du |u|≤δ + E RWf (W ) s(W ∧a) sδ
ˆ e e + s (W + u) ∧ a es(W ∧a) esδ K(u)du ≤E |u|≤δ + E RWf (W ) ≤ esδ E f (W ) + sf (W ) (W ∧ a) + δ Kˆ 1 + δ2 Ef (W ) 1 + |W | |W | ≤ d0 e0.5 (1 + 0.5)Ef (W ) + sd0 e0.5 E(W ∧ a)f (W ) + δ2 Ef (W ) 1 + 2W 2 ≤ 3d0 h(s) + 2sd0 h (s) + δ2 h(s) + 2δ2 EW 2 f (W ).
(11.23)
Thus, recalling δ2 ≤ 1/4, we have EW 2 f (W ) ≤ (6d0 + 2δ2 )h(s) + 4sd0 h (s).
(11.24)
300
11
Moderate Deviations
We are now ready to prove (11.19). Substituting (11.24) into (11.22) yields h (s) ≤ s(1 + δ1 + 2td0 δ) + 2δ2 h(s) + (sδ1 + 2δ2 ) (6d0 + 2δ2 )h(s) + 4sd0 h (s) = s 1 + δ1 (1 + 6d0 + 2δ2 ) + 2td0 δ + 2δ2 (1 + 6d0 + 2δ2 ) h(s) + 4sd0 (sδ1 + 2δ2 )h (s) ≤ s 1 + δ1 (1 + 6d0 + 2δ2 ) + 2td0 δ + 2δ2 (1 + 6d0 + 2δ2 ) h(s) + 4td0 (sδ1 + 2δ2 )h (s). Solving for h (s), we obtain h (s) ≤ sc1 (t) + c2 (t) h(s),
(11.25)
1 + δ1 (1 + 6d0 + 2δ2 ) + 2td0 δ , 1 − c3 (t) 2δ2 (1 + 6d0 + 2δ2 ) c2 (t) = with 1 − c3 (t) c3 (t) = 4td0 (tδ1 + 2δ2 ).
(11.26)
where c1 (t) =
Taking t to satisfy (11.18) yields c3 (t) ≤ 1/2. Solving (11.25) we obtain 2 t h(t) ≤ exp c1 (t) + tc2 (t) . 2
(11.27)
Now, using c3 (t) ≤ 1/2, (11.26), δ2 ≤ 1/4, d0 ≥ 1 and (11.20), t2 c1 (t) − 1 + tc2 (t) 2 t 2 δ1 (1 + 6d0 + 2δ2 ) + 2td0 δ + c3 (t) 2tδ2 (1 + 6d0 + 2δ2 ) = + 2 1 − c3 (t) 1 − c3 (t) 2 ≤ t δ1 (1 + 6d0 + 2δ2 ) + 2td0 δ + 4td0 (tδ1 + 2δ2 ) + 4tδ2 (1 + 6d0 + 2δ2 ) ≤ c0 (t) and hence t2 t2 c1 (t) + tc2 (t) ≤ + c0 (t). 2 2 Hence letting a → ∞ in (11.27) we obtain (11.19), as desired.
Lemma 11.3 Let W be a random variable satisfying the hypotheses of Theorem 11.1 for some nonnegative δ, δ1 and δ2 with max(δ, δ1 , δ2 ) ≤ 1/256 and d0 ≥ 1. Then for all −1/4 −1/3 (11.28) t ∈ 0, d0−1 min δ −1/3 , δ1 , δ2
11.3 Preliminary Lemmas
301
and integers k ≥ 1, there exists a finite universal constant C such that t 2 uk eu /2 P (W ≥ u)du ≤ C 1 + t k . 0
Proof Recalling (11.20), for t satisfying (11.28) it is easy to see that c0 (t) ≤ 93 and (11.18) is satisfied. Hence, by Lemma 11.2, inequality (11.19) holds. Write t 2 uk eu /2 P (W ≥ u)du 0 [t] t 2 k u2 /2 = u e P (W ≥ u)du + uk eu /2 P (W ≥ u)du. [t]
0
For the first integral, noting that for j ≥ 1 we have supj −1≤u≤j eu
2 /2−j u
=
2 e(j −1) /2−j (j −1) ,
[t]
uk eu
2 /2
P (W ≥ u)du ≤
0
≤
[t] j =1 [t]
jk
≤2
≤2
j −1
eu
j k e(j −1)
j =1 [t]
=2
j
j =1 [t] j =1 [t]
2 /2−j u
ej u P (W ≥ u)du
2 /2−j (j −1)
j k e−j
2 /2
j k e−j
2 /2
j
j −1
∞
−∞
ej u P (W ≥ u)du
ej u P (W ≥ u)du
(1/j )Eej W
j k−1 exp −j 2 /2 + j 2 /2 + c0 (j )
j =1
≤ 2ec0 (t)
[t]
j k−1
j =1 k
≤C 1+t . Similarly, we have t t 2 k u2 /2 k u e P (W ≥ u)du ≤ t eu /2−tu etu P (W ≥ u)du [t] [t] t k [t]2 /2−t[t] etu P (W ≥ u)du ≤t e ∞[t] k −t 2 /2 etu P (W ≥ u)du ≤ 2t e −∞ ≤ C 1 + tk , completing the proof.
302
11
Moderate Deviations
11.4 Proofs of Main Results Proof of Theorem 11.1 For z ≥ 0 the factor d0 (1 + z3 )δ + (1 + z4 )δ1 + (1 + z2 )δ2 in the error term of (11.8) is bounded below by d0 δ. Note also that 1/(1 − (z)) ≤ 1/(1 − (z0 )) for 0 ≤ z ≤ z0 . Therefore, (11.8) is trivial if the range of z is bounded, by, say 7. Hence, we can assume −1/4 −1/3 ≥ 7. (11.29) d0−1 min δ −1/3 , δ1 , δ2 Let f = fz be the solution of the Stein equation (2.2) for 0 ≤ z ≤ d0−1 min(δ −1/3 , −1/4 −1/3 δ1 , δ2 ). By (11.3), (2.2) and (11.4), EWf (W ) − ERf (W ) ˆ f (W + t)K(t)dt =E |t|≤δ ˆ =E (W + t)f (W + t) + 1 − (z) − 1(W + t > z) K(t)dt |t|≤δ ˆ =E (W + t)f (W + t) − Wf (W ) K(t)dt + EWf (W )Kˆ 1 |t|≤δ ˆ 1 − (z) − 1(W + t > z) K(t)dt +E |t|≤δ ˆ ≤E (W + δ)f (W + δ) − Wf (W ) K(t)dt + EWf (W )Kˆ 1 |t|≤δ ˆ 1 − (z) − 1(W > z + δ) K(t)dt, +E |t|≤δ
where, in the final inequality, we have applied (2.6), that is, the monotonicity of ˆ wf (w), and the assumption that K(t) is non-negative. Again applying (11.4), the expression above can be written E (W + δ)f (W + δ) − Wf (W ) Kˆ 1 + EWf (W )Kˆ 1 + E 1 − (z) − 1(W > z + δ) Kˆ 1 = 1 − (z) − P (W > z + δ) + E (W + δ)f (W + δ) − Wf (W ) Kˆ 1 + EWf (W )Kˆ 1 + E 1 − (z) − 1(W > z + δ) (Kˆ 1 − 1). Therefore, we have
P (W > z + δ) − 1 − (z) ≤ E (W + δ)f (W + δ) − Wf (W ) Kˆ 1 + EWf (W )(Kˆ 1 − 1) + E 1 − (z) − 1(W > z + δ) (Kˆ 1 − 1) + ERf (W ) ≤ d0 E (W + δ)f (W + δ) − Wf (W ) + δ1 E |W | 1 + W 2 f (W )
+ δ1 E 1 − (z) − 1(W > z + δ) 1 + W 2 + δ2 E 1 + |W | f (W )
11.4 Proofs of Main Results
303
where we have again applied the monotonicity of wf (w), inequality (2.9) giving that f (w) ≥ 0, as well as (11.5), (11.6) and (11.7). Rewriting, we have that (11.30) P (W > z + δ) − 1 − (z) ≤ d0 I1 + δ1 I2 + δ1 I3 + δ2 I4 , where I1 = E (W + δ)f (W + δ) − Wf (W ) , I2 = E |W | 1 + W 2 f (W ) ,
I3 = E 1 − (z) − 1(W > z + δ) 1 + W 2 I4 = E 1 + |W | f (W ).
and
We will consider I2 first, and again using f (w) ≥ 0, apply |w| 1 + w2 f (w) ≤ 2 1 + |w|3 f (w).
(11.31)
Recalling inequality (2.11), e
z2 /2
1 1 1 − (z) ≤ min , √ 2 z 2π
for z > 0,
(11.32)
and the form (2.3) of the solution f = fz from Lemma 2.2, to bound the first term arising from the expectation of (11.31) we have Ef (W ) ≤ π/2P (W > z) + π/2 1 − (z) P (W ≤ 0) √ 2 + 2π 1 − (z) EeW /2 1(0 < W ≤ z) ≤ π/2P (W > z) + π/2 1 − (z) √ 2 (11.33) + 2π 1 − (z) EeW /2 1(0 < W ≤ z). Note that (11.29) implies max(δ, δ1 , δ2 ) ≤ 1/256. Hence the hypotheses of Lemma 11.2 are satisfied, and therefore also the conclusion of Lemma 11.3. Now note that since c0 in (11.20) is bounded over the given range of z, it follows from Lemma 11.2 that EezW ≤ Cez
2 /2
and hence P (W > z) ≤ e−z EezW ≤ Ce−z 2
2 /2
, (11.34)
where C denotes an absolute constant, not necessarily the same at each occurrence. This last inequality handles the first term in (11.33). We will apply the identities, for any absolutely continuous function g, that z g (y)P (W > y)dy 0
and
= g(z)P (W > z) − g(0)P (W > 0) + Eg(W )1(0 < W ≤ z),
∞
g (y)P (W > y)dy = −g(z)P (W > z) + Eg(W )1(W > z).
z
Now, to handle the last term in (11.33), by Lemma 11.3,
(11.35)
304
11
EeW
2 /2
1(0 < W ≤ z) ≤ P (0 < W ≤ z) +
z
yey
2 /2
Moderate Deviations
P (W > y)dy
0
≤ C(1 + z).
For the second term in (11.31), similarly, by (2.7), (11.32) and (2.3), E|W |3 f (W )
≤ EW 2 1(W > z) + 1 − (z) EW 2 1(W < 0) √ 2 + 2π 1 − (z) EW 3 eW /2 1(0 < W ≤ z).
The second term is clearly bounded by 2(1 − (z)), and we may bound the last expectation as z 4 2 3 W 2 /2 y + 3y 2 ey /2 P (W > y)dy 1(0 < W ≤ z) ≤ EW e 0 (11.36) ≤ C 1 + z4 , applying Lemma 11.3 again. As to EW 2 1(W > z), first, using (11.34), ∞ ∞ zW yP (W > y)dy ≤ Ee ye−zy dy z
z
= Ee
zW
≤ Ce−z
2 2 z−2 1 + z2 e−z ≤ Ce−z /2 z−2 1 + z2
2 /2
for z > 1. Thus, for all such z, by (11.35) and (11.34), ∞ 2 2 EW 1(W > z) = z P (W > z) + 2yP (W > y)dy
≤C 1+z e 2
z −z2 /2
≤ C 1 + z3 1 − (z) .
(11.37)
Now, by (11.3) with f (w) = w and (11.6) and (11.5), we have 2 ˆ EW = E K(t)dt + E(RW ) |t|≤δ
≤ E(Kˆ 1 ) + δ2 E |W | + W 2 ≤ E(Kˆ 1 ) + δ2 E 1 + 2W 2 ≤ (1 + δ1 + δ2 ) + (δ1 + 2δ2 )EW 2 ≤ 5/4 + EW 2 /4, yielding EW 2 ≤ 2. Hence (11.37) remains valid for 0 ≤ z ≤ 1 since EW 2 1(W > z) ≤ EW 2 ≤ 2. Summarizing, we have I2 ≤ C 1 + z4 1 − (z) , and in a similar fashion one may demonstrate I4 ≤ C 1 + z2 1 − (z)
(11.38)
11.4 Proofs of Main Results
305
and I3 ≤ 3 1 − (z) + E1(W ≥ δ + z) 1 + W 2 ≤ C 1 + z3 1 − (z) . Lastly, to handle I1 letting g(w) = (wf (w)) and recalling (2.81), √ 2 ( 2π(1 + w 2 )ew /2 (1 − (w)) − w)(z), w > z, g(w) = √ 2 ( 2π(1 + w 2 )ew /2 (w) + w)(1 − (z)), w < z and the inequality √
2 2π 1 + w 2 ew /2 1 − (w) − w ≤
2 1 + w3 from (5.4) of Chen and Shao (2001), we have for 0 ≤ t ≤ δ, 0≤
for w ≥ 0
Eg(W + t) = Eg(W + t)1{W + t ≥ z} + Eg(W + t)1{W + t ≤ 0} + Eg(W + t)1{0 < W + t < z} 2 P (W + t ≥ z) + 2 1 − (z) P (W + t ≤ 0) ≤ 1 + z3 √ 2 + 2π 1 − (z) E 1 + (W + t)2 + (W + t) e(W +t) /2 × 1{0 < W + t < z} ≤ C 1 + z3 1 − (z) ,
δ by arguing as in (11.36) for the final term. Now writing I1 = 0 Eg(W + t)dt, putting everything together and using the continuity of the right hand side in z to replace the strict inequality in (11.30) by a non-strict one, we obtain P (W ≥ z + δ) − 1 − (z) ≤ C 1 − (z) d0 1 + z3 δ + 1 + z4 δ1 + 1 + z2 δ2 . (11.39) Now note that for δz ≤ 1 and z ≥ 0, 1 − (z − δ) − 1 − (z) z 1 2 e−t /2 dt =√ 2π z−δ 1 2 ≤ √ δe−(z−δ) /2 2π 1 2 ≤ √ δe−z /2+zδ 2π 1 2 ≤ √ δe−z /2+1 2π ≤ eδ(1 + z) 1 − (z) ≤ 3(1 + z)δ 1 − (z) ≤ 6 1 + z3 δ 1 − (z) .
306
11
Moderate Deviations
For the third to last inequality we have used the fact that g(z) ≥ 0 for all z ≥ 0, where 1 2 g(z) = 1 − (z) − √ e−z /2 , 2π(1 + z) which can be shown by verifying g (z) ≤ 0 for all z ≥ 0, and limz→∞ g(z) = 0. Hence P (W ≥ z) − 1 − (z) = P (W ≥ z) − 1 − (z − δ) + 1 − (z − δ) − 1 − (z) ≤ P (W ≥ z) − 1 − (z − δ) + 6 1 + z3 δ 1 − (z) . Now, from (11.39), with C not necessarily the same at the occurrence, P (W ≥ z) − 1 − (z) ≤ C 1 − (z) d0 1 + z3 δ + 1 + z4 δ1 + 1 + z2 δ2 . As a corresponding lower bound may be shown in the same manner, the proof of Theorem 11.1 is complete. The proof of Theorem 11.2 follows the lines same as the proof of Theorem 11.1, with Kˆ 1 = 1, δ1 = δ2 = 0 and d0 = 1; we omit the details. We now prove our moderate deviation result for the Curie–Weiss model. Proof of (11.17) For each i ∈ {1, . . . , n} let σi be a random sample from the conditional distribution of σi given {σj , j = i, 1 ≤ j ≤ n}. Let I be a random index uniformly distributed over {1, . . . , n} independent of {σi , σi : 1 ≤ i ≤ n}. Recalling that σ 2 is the variance of the total spin ni=1 σi , and that W = ni=1 σi /σ , define W = W − (σI − σI )/σ . Then (W, W ) is an exchangeable pair. Let A(w) =
exp(−βσ w/n + β/n) , exp(βσ w/n − β/n) + exp(−βσ w/n + β/n)
and exp(βσ w/n + β/n) . exp(βσ w/n + β/n) + exp(−βσ w/n − β/n) With σ = (σ1 , . . . , σn ), from (11.16) we obtain B(w) =
n 1 E σi − σi |σ nσ i=1 1 = 2P σi = −1|σ + (−2)P σi = 1|σ nσ
E(W − W |σ ) =
i: σi =1
i: σi =−1
1 (n + σ W )A(W ) − (n − σ W )B(W ) = nσ A(W ) − B(W ) A(W ) + B(W ) W+ , = n σ
11.4 Proofs of Main Results
307
and hence E(W − W |W ) =
A(W ) + B(W ) A(W ) − B(W ) W+ . n σ
Similarly,
E (W − W )2 |W = E E (W − W )2 |σ |W
n
1 2 E σi − σi |σ
W =E nσ 2 i=1
1 (n + σ W )2A(W ) + (n − σ W )2B(W ) = 2 nσ 2(A(W ) + B(W )) 2(A(W ) − B(W )) + W. = nσ σ2 It is easy to see that e−βσ w/n 1 ≤ A(w) = e−βσ w/n + eβσ w/n 1 + exp(2βσ w/n − 2β/n) e2β/n ≤ 1 + exp(2βσ w/n) e−βσ w/n e2β/n = −βσ w/n e + eβσ w/n and similarly, 1 eβσ w/n ≤ B(w) = e−βσ w/n + eβσ w/n 1 + exp(−2βσ w/n − 2β/n) eβσ w/n e2β/n ≤ −βσ w/n . e + eβσ w/n Therefore A(W ) + B(W ) = 1 + O(1)
1 n
and 1 A(W ) − B(W ) = − tanh(βσ W/n) + O(1) . n Hence we have E(W − W |W ) 1 1 1 W + − tanh(βσ W/n) + O(1) = 1 + O(1) n n σ n 1 W W 1 tanh (ξ ) 2 = + O(1) 2 + O(1) + −βσ W/n − (βσ W/n) n nσ σ 2 n =
1 tanh (ξ )β 2 σ W 2 1−β + O(1) , W− 2 n nσ 2n
(11.40)
308
11
Moderate Deviations
using the fact that |σ W | ≤ n, and likewise E (W − W )2 |W 2W 1 W 2 − tanh (η)βσ W/n = 2 + O(1) 2 + O(1) 2 + nσ σ nσ n σ 2 2 tanh (η)βW 2 1 = 2− + O(1) 2 , (11.41) σ n2 nσ where ξ and η lie between 0 and βσ W/n. From (11.40) and Remark 11.1, W satisfies (11.3) with λ = (1 − β)/n, Kˆ 1 = (W − W )2 /2λ and R=
1 tanh (ξ )β 2 σ W 2 + O(1) . 2n(1 − β) σ
(11.42)
Further, from (11.41), E[Kˆ 1 |W ] − 1 1 = E (W − W )2 |W − 1 2λ tanh (η)βW 2 1 n − 1 − + O(1) 2 . = 2 n(1 − β) (1 − β)σ σ
(11.43)
Since (11.9) holds, the expected value of the left hand side of (11.43) is −E[RW ]. Hence, using that EW = 0, making the second term in (11.42) vanish after multiplying by W and taking expectation, we obtain n E(tanh (η)βW 2 ) 1 − 1 − + O(1) 2 n(1 − β) (1 − β)σ 2 σ tanh (ξ )β 2 σ W 3 = −E . 2n(1 − β)
(11.44)
On the left hand side, since tanh (x) is bounded on R and EW 2 = 1, the third term is O(1/n), and the last term is of smaller order than the first. On the right hand side, as tanh (x) has sign opposite that of x, we conclude tanh (ξ )W 3 ≤ 0, as ξ lies between 0 and βσ W/n. Hence the right hand side above is nonnegative. As tanh (x) is bounded on R, |W 3 | ≤ nW 2 /σ and EW 2 = √ 1, the right hand side is also bounded. Hence n/((1 − β)σ 2 ) is of order 1, and σ/ n is bounded away from 0 and infinity. Note now that from (11.44) that if E|W 3 | ≤ C then √ n − 1 = O(1/ n), 2 (1 − β)σ implying, by (11.43), that
1
E (W − W )2 |W − 1 ≤ Cn−1/2 1 + |W | .
2λ
(11.45)
11.4 Proofs of Main Results
309
Next we prove E|W 3 | ≤ C. Letting f (w) = w|w|, for which f (w) = 2|w|, substitution into (11.3), and (11.43) and (11.42) yield
E W 3
ˆ = EWf (W ) = E 2|W |K(t)dt + E Rf (W ) |t|≤δ
= 2E|W | + 2E |W | E[Kˆ 1 |W ] − 1 + E Rf (W ) tanh (η)βW 2 1 n − 1 − + O(1) = 2E|W | + 2E|W | n(1 − β) (1 − β)σ 2 σ2 tanh (ξ )β 2 σ W 2 1 +E + O(1) f (W ) 2n(1 − β) σ n 1 1 = 2E|W | + 2E|W | −1 +O E|W |3 + O(1) 2 E|W | n (1 − β)σ 2 σ tanh (ξ )β 2 σ W 3 1 +E |W | + O(1) Ef (W ). 2n(1 − β) σ As tanh (ξ )W 3 ≤ 0, and n/((1 − β)σ 2 ) − 1 = O(1), the right hand side is O(1) + O(1/n)E|W 3 |, hence E|W 3 | = O(1), √ as desired. By (11.42) and the fact that σ/ n is bounded away from zero and infinity, in place of Condition (11.6) we have instead that
C
E(R|W ) ≤ δ2 1 + W 2 where δ2 = √ . n
(11.46)
However, simple modifications can be made in the proofs of Lemma 11.2 and Theorem 11.1 so that (11.17) holds. First, note that the inequality (1 + |w|) ≤ 2(1 + w 2 ) is used in (11.21) to bound the first application of (11.6) in Lemma 11.2. Next, since ξ is between 0 and βσ W/n, the terms tanh (ξ ) and W have opposite signs. Hence, in the display (11.23) in Lemma 11.2, for the first term of the remainder R in (11.42) we have tanh (ξ )β 2 σ W 2 E W es(W ∧a) ≤ 0, 2n(1 − β) √ while the second term, of order 1/σ , that is, order 1/ n, can be absorbed after the indicated multiplication by W in the existing term δ2 Ef (W )(1 + |W |)|W |, with δ2 √ of order 1/ n. Hence (11.24), and Lemma 11.2 remain valid. In the proof of Theorem 11.1, the present case can be handled by replacing I4 = (1 + |W |)f (W ) by I4 = (1 + W 2 )f (W ), resulting in the bound I4 ≤ C 1 + z3 1 − (z) in place of (11.38). By (11.43) we √ may take d0 = O(1), and since |W − W | = |σI − σI |/σ , we have δ = O(1/ n). Likewise, √ by (11.46) and (11.45) we may take δ2 and δ1 respectively, both of order O(1/ n). Hence, in view of (11.45) and Remark 11.2, we have the following moderate deviation result for W
310
11
Moderate Deviations
P (W ≥ z) = 1 + O(1)d0 1 + z3 δ + O(1) 1 + z3 δ1 + O(1) 1 + z3 δ2 . 1 − (z) This completes the proof of (11.17).
Appendix Proof of Lemma 11.1 Write n−1=
2m−pi ,
i≥1
with 1 = p1 < p2 < · · · ≤ m1 the positions of the ones in the binary expansion of n − 1, where m1 ≤ m. Recall that X is uniformly distributed over {0, 1, . . . , n − 1}, and that m X= Xi 2m−i , i=1
with exactly S of the indicator variables X1 , . . . , Xm equal to 1. We say that X falls in category i, i = 1, . . . , m1 , when Xp1 = 1,
Xp2 = 1,
...,
Xpi−1 = 1
and Xpi = 0,
(11.47)
and in category m1 + 1 if X = n − 1. This last category is nonempty only when S = m1 and in this case, Q = m − m1 , which gives the last term in (11.48). Note that if X is in category i for i ≤ m1 , then, since X can be no greater than n − 1, the digits of X and n − 1 match up to the pith , except for the digit in place pi , where n − 1 has a one, and X a zero. Further, up to this digit, n − 1 has pi − i zeros, and so X has ai = pi − i + 1 zeros. Changing any of these ai zeros of X, except the zero in position pi , to one results in a number n − 1 or greater, while changing any other zeros, since digit pi of n − 1 is one and that same digit of X is zero, does not. Hence Q is at most ai when X falls in category i. Since X has S ones in its expansion, i − 1 of which are accounted for by (11.47), conditional on S the remaining S − (i − 1) ones are uniformly distributed over the m − pi = m − (i − 1) − ai remaining digits {Xpi +1 , . . . , Xm }. Thus, we have the inequality I (S = m1 ) 1 m − (i − 1) − ai ai + (m − m1 ) (11.48) E(Q|S) ≤ A S − (i − 1) A i≥1
where A=
m − (i − 1) − ai i≥1
S − (i − 1)
+ I (S = m1 ),
and 1 = a1 ≤ a2 ≤ a3 ≤ · · · . Note that if m1 = m, the last term of (11.48) equals 0. When m1 < m, we have I (S = m1 ) m − 1 −1 (m − m1 ) ≤ 1, (m − m1 ) ≤ m1 A
Appendix
311
so we may consider only the remaining terms of (11.48) in the following argument. We consider two cases; constants C may not necessarily be the same at each occurrence. Case 1 S ≥ m/2. As ai ≥ 1 for all i, there are at most m + 1 nonzero terms in the sum (11.48). Divide the summands into two groups, those for which ai ≤ 2 log2 m and those with ai > 2 log2 m. The first group can sum to no more than 2 log2 m, as the sum is a weighted average of the ai terms, with weights summing to less than 1. For the second group, note that m − (i − 1) − ai A S − (i − 1) m − (i − 1) − ai m − 1 ≤ S − (i − 1) S a −1 i−2 i m−S −j S −j = m−j m − (ai − 1) − 1 − j j =1
≤
j =0
1 2ai −1
2 ≤ 2, m
(11.49)
where the second to last inequality follows from S ≥ m/2 and the fact that the term considered is nonzero only when S ≤ m − ai , and the last from ai > 2 log2 m. As ai ≤ m and there are at most m + 1 terms in the sum, the terms in the second group can sum to no more than 4. Case 2 S < m/2. Divide the sum in (11.48) into two groups according as to whether i > 2 log2 m or i ≤ 2 log2 m. Reordering the product in (11.49), a i−2 i −1 S −j m−S −j m − (i − 1) − ai A≤ S − (i − 1) m−1−j m − (i − 1) − j j =0
≤ 1/2
j =1
i−1
using the assumption S < m/2, and noting that the term considered is zero unless S ≥ i − 1. The above inequality is true for all i, so in particular the summation over i satisfying i > 2 log2 m is bounded by 4. Next consider i ≤ 2 log2 m. For ai ≥ 2 the inequality log ai (11.50) + 2 log2 m S≥m ai − 1 −
log ai
−
log ai
implies S ≥ (1 − e ai −1 )m − 1 + e ai −1 i, which is equivalent to m−S−1 )ai −1 ≤ 1, which clearly holds also for ai = 1. Hence, ai ( m−(i−1)−1
312
11
ai
Moderate Deviations
m − (i − 1) − ai A S − (i − 1) m − (i − 1) − ai m − 1 ≤ ai S − (i − 1) S a −1 i−2 i S−j m−S −j = ai m−1−j m − (i − 1) − j j =0 j =1 ai −1 1 1 m−S −1 ≤ i−1 ≤ i−1 ai m − (i − 1) − 1 2 2
m−S−1 using the fact that ai ( m−(i−1)−1 )ai −1 ≤ 1. ai On the other hand, if S < m( log ai −1 ) + 2 log2 m then ai S/(m − 1) ≤ C log2 m, which implies m − (i − 1) − ai A ai S − (i − 1) a i−2 i −1 S −j m−S −j ai S ≤ m−1 m−1−j m − (i − 1) − j j =1
≤ C log2 m/2
j =1
i−2
.
Hence the sum over i is bounded by some constant time log2 m. Combining the two cases we have that the right hand side of (11.48), and therefore E(Q|S), is bounded by C log2 m. To complete the proof of the lemma, that is, to prove E(Q|W ) ≤ C(1 + |W |), we only need to show E(Q|S) ≤ C
when |W | ≤ log2 m,
(11.51)
as when |W | > log2 m we already have E(Q|W ) ≤ C log2 m ≤ C|W |. In case 1 we have shown E(Q|S) is bounded, and in case 2 that the contribution of the summands where i > 2 log2 m is bounded. Hence we need only consider√ summands where i ≤ 2 log2 m. Note that |W | ≤ log2 m implies S ≥ m/2 − m/4 log2 m. When ai , m are bigger than some universal constant, m/2 − √ ai m/4 log2 m ≥ m( log ai −1 ) + 2 log2 m, which implies that (11.50) holds. Hence, as in m−S−1 i case 2, we have that ( m−(i−1)−1 × ai /A ≤ 1/2i−1 . )ai −1 × ai ≤ 1 and m−(i−1)−a S−(i−1) Summing, we see the contribution from the remaining terms are also bounded, completing the proof of (11.51), and the lemma.
Chapter 12
Multivariate Normal Approximation
In this chapter we consider multivariate normal approximation. We begin with the extension of the ideas in Sect. 4.8 on bounds for smooth functions, using the results in Sect. 2.3.4 which may be applied in the multivariate setting. The first goal is to develop smooth function bounds in Rp . In Sect. 12.1 we obtain such bounds using multivariate size bias couplings, and in Sect. 12.3 by multivariate exchangeable pairs. In Sect. 12.4 we turn to local dependence, and bounds in the Kolmogorov distance. We consider applications of these results to questions in random graphs. Generalizing notions from Sect. 4.8, for p
k = (k1 , . . . , kp ) ∈ N0
let |k| =
p
ki ,
i=1
and for functions h : Rp → R whose partial derivatives hk (x) =
∂ k1 +···+kp h ∂ k1 x1 · · · ∂ kp xp
exists for all 0 ≤ |k| ≤ m,
p and · the supremum norm, recall that L∞ m (R ) is the collection of all functions p h : R → R with (k) p = max h (12.1) hL∞ m (R ) 0≤|k|≤m
finite. Now, for random vectors X and Y in Rp , letting p p ≤1 Hm,∞,p = h ∈ L∞ m R : hL∞ m (R )
(12.2)
define L(X) − L(Y)
Hm,∞,p
=
sup
h∈Hm,∞,p
Eh(X) − Eh(Y).
For a vector, matrix, or more generally, any array A = (aα )α∈A with A finite, let A = max |aα |. α∈A
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_12, © Springer-Verlag Berlin Heidelberg 2011
(12.3) 313
314
12
Multivariate Normal Approximation
12.1 Multivariate Normal Approximation via Size Bias Couplings The following theorem gives a smooth function bound via multivariate size bias couplings. Theorem 12.1 Let Y be a random vector in Rp with nonnegative components, mean μ = EY, and invertible covariance matrix Var(Y) = . For each i = 1, . . . , p let (Y, Yi ) be random vectors defined on a joint probability space such that Yi has the Y-size biased distribution in direction i, as in (2.68). Then, with Z a mean zero, covariance I normal vector in Rp , L( −1/2 (Y − μ) − L(Z) H 3,∞,p
≤
p p
−1/2 2 μi Var E Y i − Yj | Y
p2 2 +
j
1 p3 2 3
i=1 j =1 p p p −1/2 3
μi E Yji − Yj Yki − Yk .
(12.4)
i=1 j =1 k=1
Note that the theorem does not require the joint construction of (Y1 , . . . , Yp ). p ≤ 1, let f be the solution of (2.22) given by (2.21) Proof Given h with hL∞ 3 (R ) and (2.20). Writing out the expressions in (2.22), E h −1/2 (Y − μ) − N h
p p p ∂2 ∂ σij f (Y) − (Yi − μi ) f (Y) . (12.5) =E ∂yi ∂yj ∂yi
i=1 j =1
i=1
Recall from (2.68) that Yi is characterized by the fact that EYi G(Y) = μi EG Yi
(12.6)
for all functions G : Rp → R for which the expectations exist. For the coordinate function G(y) = yj , (12.6) gives (12.7) σij = Cov(Yi , Yj ) = EYi Yj − μi μj = Eμi Yji − Yj . Subtracting μi EG(Y) from both sides of (12.6), we obtain
E(Yi − μi )G(Y) = μi E G Yi − G(Y) . Equation (12.5), and (12.8) with G = (∂/∂yi )f , yield E h −1/2 (Y − μ) − N h p p p ∂2 ∂ ∂ i σij f (Y) − μi f Y − f (Y) . =E ∂yi ∂yj ∂yi ∂yi i=1 j =1
i=1
(12.8)
(12.9)
12.2 Degrees of Random Graphs
315
Taylor expanding (∂/∂yi )f (Yi ) about Y, with remainder in integral form, and simple calculations show that the right hand side of (12.9) equals −E
p p
μi Yji
− Yj − σij
i=1 j =1
× 0
1
(1 − t)
∂2 f (Y) − E μi ∂yi ∂yj p
p
p
i=1 j =1 k=1
∂3 ∂yi ∂yj ∂yk
f Y + t Yi − Y Yji − Yj Yki − Yk dt.
(12.10)
In the first term, we condition on Y, apply the Cauchy–Schwarz inequality and use (12.7), and then apply the bound (2.23) with k = 2 to obtain the first term in (12.4). The second term in (12.10) gives the second term in (12.4) by applying (2.23) with k = 3.
12.2 Degrees of Random Graphs In the classical Erdös and Rényi (1959b) random graph model (see also Bollobás 1985) for n ∈ N and ∈ (0, 1), K = Kn, is the random graph on the vertex set V = {1, . . . , n} with random edge set E where each pair of vertices has probability of being connected, independently of all other such pairs. For v ∈ V let 1{v,w}∈E , D(v) = w∈V
the degree of vertex v, and for d ∈ {0, 1, 2, . . .} let Xv where Xv = 1{D(v)=d} , Y= v∈V
the number of vertices with degree d. Karo´nski and Ruci´nski (1987) proved asymptotic normality of Y when n(d+1)/d → ∞ and n → 0, or n → ∞ and n − log n − d log log n → −∞; see also Palka (1984) and Bollobás (1985). Asymptotic normality when n → c > 0, was obtained by Barbour et al. (1989); see also Kordecki (1990) for the case d = 0, for nonsmooth h. Goldstein (2010b) gives a Berry–Esseen theorem for Y for all d by applying the size bias coupling in Bolthausen’s (1984) inductive method. Other univariate results on asymptotic normality of counts on random graphs, including counts of the type discussed in Theorems 12.2, are given in Janson and Nowicki (1991), and references therein. Based on the work of Goldstein and Rinott (1996) we consider the joint asymptotic normality of a vector of degree counts. For p ∈ N let di for i = 1, . . . , p be distinct, fixed nonnegative integers, and let Y ∈ Rp have ith coordinate Xvi where Xvi = 1{D(v)=di } , Yi = v∈V
the number of vertices of the graph with degree di . For simplicity we assume 0 < = c/(n − 1) < 1 in what follows, though the results below can be weakened to
316
12
Multivariate Normal Approximation
cover the case nn → c > 0 as n → ∞. To keep track of asymptotic constants, for a sequence an and a sequence of positive numbers bn write an = (bn ) if lim supn→∞ |an |/bn ≤ 1. Theorem 12.2 If = n = c/(n − 1) for some c > 0 and Z ∈ Rp is a mean zero normal vector with identity covariance matrix, then L( −1/2 (Y − μ) − L(Z) ≤ n−1/2 (r1 + r2 ), (12.11) H 3,∞,p
where p p3 b βi 24c + 48c2 + 144c3 + 48di2 + 144cdi2 + 12 and r1 = 2 i=1
r2 =
p 5 b3/2 3
p
βi c + c2 + (di + 1)2 ,
i=1
where the components μi , σij , i, j = 1, . . . , n of the mean vector μ = EY and covariance matrix = Var(Y) respectively, are given by μi = nβi
and (di − c)(dj − c) − 1 + 1{i=j } nβi , σij = nβi βj c(1 − c/(n − 1))
and
(12.12)
n − 1 di βi = (1 − )n−1−di di 1 . b= p minj βj (1 − i=1 βi )
and (12.13)
p Note that i=1 βi < 1 when {d1 , . . . , dp } = {0, 1, . . . , n − 1}, and then the quantities r1 and r2 are both of order O(1). Proof As for any v ∈ V the degree D(v) is the sum of n − 1 independent Bernoulli variables with success probability , we have D(v) ∼ Bin(n − 1, ). In particular, βi in (12.13) equals P D(v) = di = EXvi , yielding the expression for μi in (12.12). To calculate the covariance σij for i = j , with v = u write EXvi Xuj = E Xvi Xuj |{v, u} ∈ E + E Xvi Xuj |{v, u} ∈ / E (1 − ).
(12.14)
12.2 Degrees of Random Graphs
317
Given that there is an edge connecting v and u, Xvi Xuj = 1 if and only if v is connected to di − 1 vertices in V \ {u}, and u to dj − 1 vertices in V \ {v}, which are functions of independent Bernoulli variables. Hence n−2 n − 2 di +dj −2 (1 − )2n−2−di −dj E Xvi Xuj |{v, u} ∈ E = di − 1 dj − 1 di dj = βi βj 2 (n − 1)2 di dj = βi βj 2 . c Likewise, given that there is no edge between v and u, Xvi Xuj = 1 if and only if v is connected to di vertices in V \ {u}, and u to dj vertices in V \ {v}, and so n − 2 n − 2 di +dj /E = (1 − )2n−4−di −dj E Xvi Xuj |{v, u} ∈ di dj (n − 1 − di )(n − 1 − dj ) = βi βj (1 − )2 (n − 1)2 (n − 1 − di )(n − 1 − dj ) = βi βj . (n − 1 − c)2 Adding these expressions according to (12.14) yields di dj + c(n − 1) − cdi − cdj EXvi Xuj = βi βj . c(n − 1 − c) Now, multiplying by n2 − n, as Xvi Xvj = 0 for di = dj , we have di dj + c(n − 1) − cdi − cdj , EYi Yj = nβi βj c(1 − c/(n − 1)) and subtracting n2 βi βj yields (12.12) for i = j . When i = j the calculation is the same, but for the addition in the second moment of the expectation of n diagonal 2 =X . terms of the form Xvi vi We may write the covariance matrix more compactly as follows. Let 1/2 1/2 , b = β1 , . . . , βp 1/2 1/2 β1 (d1 − c) βp (dp − c) g= √ , ,..., √ c(1 − c/(n − 1)) c(1 − c/(n − 1)) and 1/2 1/2 D = diag β1 , . . . , βp , that is, the diagonal matrix whose diagonal elements are the components of b. Then it is not difficult to see that n−1 = D I + gg − bb D.
318
12
Multivariate Normal Approximation
For nonnegative definite matrices A and B, write when x Ax ≤ x Bx for all x.
A B
(12.15)
It is clear that D I − bb D n−1 . Letting λ1 (A) ≤ · · · ≤ λp (A) be the eigenvalues of A in non-decreasing order, then, see e.g. Horn and Johnson (1985), λk D I − bb D ≤ λk n−1 . It is simple to verify that the eigenvalues of B = I − bb are 1, with multiplicity p − 1, and, corresponding to the eigenvector b, λ1 (B) = 1 − b b. Now, by the Rayleigh-Ritz characterization of eigenvalues we obtain λ1 (DBD) =
λ1 (B) x DBDx y By ≥ . = min −2 p ,x=0 y∈R ,y=0 y D y x x λp (D −2 )
min p
x∈R
Hence
λ1 n−1 ≥ min βj 1 − j
p
βi = b1−1 ,
i=1
and −1/2 ≤ λp −1/2 =
n−1/2 ≤ n−1/2 b1/2 . λ1 ((n−1 )1/2 )
(12.16)
To apply Theorem 12.1, for all i ∈ {1, . . . , p} we need to couple Y to a vector Yi having the size bias distribution of Y in direction i. Let A = {vi, v ∈ V, i = 1, . . . , p} so that X = {Xvi , v ∈ V, i = 1, . . . , p} = {Xα , α ∈ A}. We will apply Proposition 2.2 to yield Yi from Xα for α ∈ A. To achieve Xα for α ∈ A, we follow the outline given after Proposition 2.2. First we generate Xαα from the Xα -size bias distribution. Since Xα is a nontrivial Bernoulli variable, we have Xαα = 1. Then we must generate the remaining variables with distribution L(Xβα |Xαα = 1). That is, for α = vi, say, we need to have D(v) = di , the degree of v equal to di , and the remaining variables so conditioned. We can achieve such variables as follows. If D(v) > di let K vi be the graph obtained by removing D(v) − di edges from K, selected uniformly from the D(v) edges of v. If D(v) < di let K vi be the graph obtained by adding di − D(v) edges of the form {v, u} to K, where the vertices u are selected uniformly from the n − 1 − D(v) vertices not connected to v. If D(v) = di let K vi = K. Using exchangeability, it is easy to see that the distribution of the graph K vi is the conditional distribution of K given that the degree of v is di . Now, for j = 1, . . . , p letting Xα . Bj = {vj : v ∈ V} we may write Yj = α∈Bj
12.2 Degrees of Random Graphs
319
By Proposition 2.2, to construct Yji , we first choose a summand of Yj according to the distribution given in (2.71), that is, proportional to its expectation. As EXvj is constant and |Bj | = n, we set P (V = v) = 1/n, so that V uniform over V, with V i be the indicator that vertex v has degree d V independent of K. Then letting Xvj j V i in K , Proposition 2.2 yields that the vector Yi with components Yji =
n
Vi Xvj ,
j = 1, . . . , p
v=1
has the Y-size biased distribution in direction i. In other words, for the given i, one vertex of K is chosen uniformly to have edges added or removed as necessary in order for it to have degree di , and then Yji counts the number of vertices of degree dj in the graph that results. We now proceed to obtain a bound for the last term in (12.4) of Theorem 12.1. Note that since exactly |D(V ) − di | edges are either added or removed from K to form K V i , and that the vertex degrees can only change on vertices incident to these edges and on vertex V itself, we have i Y − Yj ≤ D(V ) − di + 1. j This upper bound is achieved, for example, when i = j, di < dj and the degree of V and the degrees of all the D(V ) vertices connected to V have degree dj . Hence, as D(V ) ∼ Bin(n − 1, ), and = c/(n − 1), 2 E Yji − Yj Yki − Yk ≤ E D(V ) − di + 1 ≤ 2E D 2 (V ) + (di + 1)2 ≤ 2 (n − 1) + (n − 1)2 2 + (di + 1)2 = 2 c + c2 + (di + 1)2 . Now, considering the last term in (12.4), since the bound above depends only on i, applying (12.16) and that μi = nβi from (12.12), we obtain p p p 1 p3 −1/2 3 μi E Yji − Yj Yki − Yk 2 3
≤
i=1 j =1 k=1 p p 5 −1/2 3/2 b βi c + c2 n 3 i=1
+ (di + 1)2 ,
yielding the term r2 in the bound (12.11). Since Y is measurable with respect to K, following (4.143) we obtain the upper bound
Var E Yji − Yj |Y ≤ Var E Yji − Yj |K , and will demonstrate
Var E Yji − Yj |K = n−1 24c + 48c2 + 144c3 + 48di2 + 144cdi2 + 12
(12.17)
320
12
Multivariate Normal Approximation
Then, for the first term in (12.4), again applying (12.12) to give μi = nβi , and (12.16), we obtain p p
p2 −1/2 2 μi Var E Yji − Yj | Y 2
≤ n−1/2
i=1 j =1 n
p3 b 2
βi
24c + 48c2 + 144c3 + 48di2 + 144cdi2 + 12 ,
i=1
yielding r1 . To obtain (12.17) we first condition on V = v. Recalling V is uniform and letting | · | denote cardinality, in this way we obtain
E Yji − Yj |K 1 D(u) = dj + 1 − D(u) = dj D(v) − di = n D(v) u: {u,v}∈E v: D(v)>di
+
1 n
D(u) = dj − 1 − D(u) = dj
u=v, {u,v}∈ / E , D(v)
di − D(v) n − 1 − D(v) 1 1 + v: D(v) = di δi,j − v: D(v) = dj (1 − δi,j ). (12.18) n n To understand the first term, for example, note that if V = v and D(v) > di , then i − X = 1 if {u, v} ∈ E , D(u) = d + 1, and {u, v} is one of the d − D(v) Xuj uj j i edges removed at v at random, chosen with probability (D(v) − di )/D(v). Note that the factor 1/n multiplies all terms in (12.18), which provides a factor of 1/n2 in the variance. Breaking the two sums into two separate sums, so that six terms result, we will bound the variance of each term separately and then apply the bound k k Var Uj ≤ k Var(Uj ) (12.19) ×
j =1
j =1
for k = 6. The first term of (12.18) yields two sums, both of the form D(v) − di 1A (u, v) D(v) u,v for A = (u, v): {u, v} ∈ E: D(v) > di , D(u) = dj + a , the first with a = 1, and the second with a = 0. We show D(v) − di Var 1A (u, v) = 2cn + 4c2 n + 12c3 n . D(v) u,v
(12.20)
(12.21)
12.2 Degrees of Random Graphs
321
To calculate this variance requires the consideration of terms all of the form D(v ) − di D(v) − di Cov 1A (u, v) , 1A u , v . (12.22) D(v) D(v ) Let N be the number of distinct vertices among u, u , v, v . From the definition of A in (12.20), and that no edge connects a vertex to itself, we see that we need only consider cases where u = v and u = v . Hence N may only take on the values 2, 3 and 4, leading to the three terms in (12.21). There are two cases for N = 2. The n(n − 1) diagonal variance terms with (u, v) = (u , v ) can be bounded by their second moments as D(v) − di D(v) − di 2 ≤ E 1A (u, v) Var 1A (u, v) D(v) D(v) ≤ P {u, v} ∈ E c , = n−1 leading to a factor of n(c). Handling the case (u, v) = (v , u ) in the same manner gives an overall contribution of 2n(c) for the case N = 2, and the first term in (12.21). For N = 3 there are four subcases, all of which may be handled in a similar way. Consider, for example, the case u = u , v = v . Using the inequality Cov(X, Y ) ≤ EXY , valid for nonnegative X and Y , we obtain D(v ) − di D(v) − di Cov 1A (u, v) , 1A u, v D(v) D(v ) ≤ P {u, v} ∈ E, {u, v } ∈ E = c2 /(n − 1)2 . Handling the three other cases similarly and noting that the total number of N = 3 terms is no more than 4n3 leads to a contribution of 4n(c2 ) from the case N = 3 and the second term in (12.21). In the case N = 4 the vertices u, u , v, v are distinct, and we have D(v) − di D(v ) − di , 1A (u , v ) Cov 1A (u, v) D(v) D(v ) D(v) − di D(v ) − di (12.23) 1A (u , v ) − β 2, = E 1A (u, v) D(v) D(v ) where
D(v) − di β = E 1A (u, v) D(v) (D(v) − di )+ . = E 1{{u,v}∈E } 1{D(u)=dj +a} D(v)
322
12
With C the event that
Multivariate Normal Approximation
{u, u }, {v, v }, {u, v }, {u , v} ∩ E = ∅
(12.24)
we have P (C) = (1 − c/(n − 1))4 = 1 − 4(c/n). This estimate implies, noting that the events {u, v}, {u , v } ∈ E each have probability c/(n − 1) and are independent of C, that D(v) − di D(v ) − di E 1A (u, v) 1A (u , v ) D(v) D(v ) 3 D(v) − di c D(v ) − di = E 1A (u, v) 1A (u , v ) C P (C) + 4 3 D(v) D(v ) n 3 c 2 ≤ α + 4 3 , (12.25) n with
(D(v) − di )+ α = E 1{{u,v}∈E } 1{D(u)=dj +a} C , D(v)
where in the last inequality we used the conditional independence given C of the events indicated, for (u, v) and (u , v ). Bounding both α and β by the probability that {u, v} ∈ E , an event independent of C, we bound the covariance term (12.23) as 3 3 2 α + 4 c − β 2 = (α + β)(α − β) + 4 c n3 n3 3 c c + 4 3 . (12.26) ≤ 2|α − β| n n To handle α − β, letting R = {u, v} ∈ E ,
S = 1{D(u)=dj +a}
and T =
(D(v) − di )+ , D(v)
we have α − β = E[1R ST |C] − E[1R ST ] = E[ST |CR]
P (RC) − E[ST |R]P (R). P (C)
As R and C are independent and P (R) = c/(n − 1), |α − β| = E[ST |CR] − E[ST |R](c/n). Since S and T are conditionally independent given R or given CR, we have |α − β| = E[S|CR]E[T |CR] − E[S|R]E[T |R](c/n). Let X, Y ∼ Binomial(n − 4, ) and X , Y ∼ Binomial(n − 2, ), all independent. In α, conditioning on CR, D(u) − 1 and D(v) − 1 are equal in distribution to X and Y respectively; in β, conditioning on R, the same variables are distributed as X , Y . Hence,
12.2 Degrees of Random Graphs
323
(Y + 1 − di )+ |α − β| = E1{X=dj +a−1} E Y +1 (Y + 1 − di )+ c − E1{X =dj +a−1} E . Y +1 n
(12.27)
Next, note |E1{X=dj +a−1} − E1{X =dj +a−1} | = 2(c/n) and E (Y + 1 − di )+ − E (Y + 1 − di )+ = 2(c/n), Y +1 Y +1 which can be easily understood by defining X and X jointly, with X = X + ξ , with ξ ∼ Binomial(2, ), independently of X, so that P (X = X ) = 2(c/n), and constructing Y and Y similarly. Hence, by (12.27), 2 c |α − β| = 4 2 , n 3
and the N = 4 covariance term (12.26) is 12( nc 3 ). As there are no more than n4 where u, u , v, v are all distinct, their total contribution is (12c3 n), yielding the final term in (12.21). We apply a similar argument to the third and fourth terms arising from (12.18), both of the form di − D(v) 1D (u, v) n − 1 − D(v) u=v (12.28) for D = (u, v): {u, v} ∈ / E: D(v) < di , D(u) = dj − a , the first with a = 1, the second with a = 0, and show di − D(v) 1D (u, v) = 2di2 + 4di2 n + 12cdi2 n . Var n − 1 − D(v)
(12.29)
u=v
With N again counting the number of distinct indices among u, u , v, v , for the cases N = 2 and N = 3 it will suffice to apply the inequality di di − D(v) . ≤ 1{D(v)
324
12
Multivariate Normal Approximation
handling the other cases similarly and, again, noting that the total number of N = 3 terms is no more than 4n3 leads to a contribution of 4(di2 n). For N = 4, write di − D(v) di − D(v ) , 1D (u , v ) Cov 1D (u, v) n − 1 − D(v) n − 1 − D(v ) di − D(v) di − D(v ) = E 1D (u, v) 1D (u , v ) − δ 2 , (12.30) n − 1 − D(v) n − 1 − D(v ) where
(di − D(v))+ δ = E 1{{u,v}∈/ E ,D(u)=dj −a} . n − 1 − D(v)
With C as in (12.24), for the first term in (12.30), as for (12.25), we have 2 cdi di − D(v) di − D(v ) E 1D (u, v) 1D (u , v ) C P (C) + 4 n − 1 − D(v) n − 1 − D(v ) n3 2 cdi ≤ γ 2 + 4 n3 where
(di − D(v))+ γ = E 1{{u,v}∈/ E , D(u)=dj −a} C . n − 1 − D(v)
Now we may bound the covariance (12.30) as 2 2 2 γ + 4 cdi − δ 2 ≤ |γ − δ||γ + δ| + 4 cdi n3 n3 2 cdi di = 2|γ − δ| + 4 . n n3
(12.31)
Let X, Y ∼ Binomial(n − 4, ) and X , Y ∼ Binomial(n − 2, ), all independent. By arguments similar to those which yield (12.27), with R denoting R complement, noting that in γ , conditioning on CR, D(u), D(v) are equal in distribution to X, Y , and in δ, conditioning on R, these same variables are distributed as X , Y , we have (di − Y )+ (di − Y )+ |γ − δ| ≤ E1{X=dj −a} E − E1{X =dj −a} E , (12.32) n−1−Y n − 1 − Y with |E1{X=dj −a} − E1{X =dj −a} | = 2(c/n) and E (di − Y )+ − E (di − Y )+ = 2 cdi , n−1−Y n − 1 − Y n2 yielding
|γ − δ| = 4
cdi . n2
12.3 Multivariate Exchangeable Pairs
325
Hence the N = 4 covariance term (12.30), by (12.31), is 12(cdi2 /n3 ). As there are at most n4 such terms the N = 4 contribution is (12cdi2 n). Summing this amount to the results of the N = 2 and N = 3 computations yields (12.29). For the last two terms in (12.18), note that v: D(v) = di = n − Yi and v: D(v) = dj = Yj , which respectively have variances σii and σjj , given in (12.12). Now, for instance, n βk2 (dk − c)2 c(1 − c/(n − 1)) p
σii ≤
k=1
≤ n(1/c)
p
βk (dk − c)2
k=1
≤ n(1/c)(n − 1)(1 − ) = (n),
(12.33)
with the same result holding, clearly, for σjj . Adding together (12.21), (12.29) and (12.33), and then multiplying by 2 to match each term of the first given type to the following one of the same type yields the sum of the six variance terms. Applying (12.19), which mandates an additional factor of 6, now yields (12.17).
12.3 Multivariate Exchangeable Pairs To construct a Stein pair for a given application in order to apply, say, Theorem 5.4, one must create mean zero, variance 1, exchangeable variables W, W , which satisfy the linearity condition E(W |W ) = (1 − λ)W
for some λ ∈ (0, 1).
(12.34)
Typically it is easy to construct an exchangeable pair, by, say, sampling some variables from their conditional distribution given the others, but the linearity condition (12.34) is never guaranteed to hold. However, even when (12.34) fails to hold there are at least two remedies available, each of which may allow a bound to the normal to be computed. One remedy, already mentioned, is to work with an approximate version of (12.34), such as (2.41), which has a remainder term, and bounds may then be computed using Theorem 3.5; see also the work of Rinott and Rotar (1997). In this section, we introduce a second technique, due to Reinert and Röllin (2009), which may be used when the linearity condition fails. The success of this method depends on being able to ‘embed’ the given collection of random variables in a larger one so that linearity holds in some multivariate sense. The price to pay is the extra complication of the higher dimensional setting, and often an accompanying loss in rates. The one dimensional linearity condition (12.34) may be rephrased as saying that the conditional expectation agrees with the linear regression of W on W . Extending to the multivariate case, we recall that when Y , Y are random vectors with covariance matrix
326
12
=
11 21
12 22
Multivariate Normal Approximation
with 22 invertible, centralizing by letting W = Y − EY and the linear regression L of
W
W = Y − EW ,
(12.35)
on W is given by
−1 W. L(W |W) = 12 22
(12.36)
Driven by (12.34) and these considerations, we are led to search for an exchangeable pair of vectors W, W such that E(W |W) = (I − )W, or E(W − W|W) = − W
(12.37)
for some matrix . Multivariate exchangeable pairs were first studied by Chatterjee and Meckes (2008), under a condition somewhat more restrictive than (12.37). Following Reinert and Röllin (2010), we show how in some cases a random quantity may be embedded in a higher dimensional vector so that (12.37) is satisfied, even if linearity does not hold for the originally given problem. We take again as our example some characteristic of K = Kn, , the random graph considered in Sect. 12.2. In this case, say the quantity of interest is T , the total number of triangles. Let the vertices of K be labeled by {1, 2, . . . , n}, and for distinct i, j ∈ {1, . . . , n} let 1i,j be the indicator of the event that {i, j } is in the edge set of K. We assume n ≥ 4. With these conventions, we may write the number of triangles T as T= 1i,j 1j,k 1k,i . i<j
To construct an exchangeable pair select vertices 1 ≤ I < J ≤ n uniformly from all such pairs, that is, with probability −1 n for all 1 ≤ i < j ≤ n, P (I = i, J = j ) = 2 independently of K. Form a new graph K with edge indicators 1 i,j = 1i,j for {i, j } = {I, J }, and 1I,J replaced by an independent copy 1 I,J . The change in the number of triangles is given by T −T = 1 I,J 1J,k 1k,I − 1I,J 1J,k 1k,I . k: k ∈{I,J / } Calculating n the expectation of the change T − T , conditional on T , by averaging over the 2 choices of i and j we have
E(T − T |T ) −1 n = E (1 i,j 1j,k 1i,k − 1i,j 1j,k 1i,k )T 2 i<j k: k ∈{i,j / } −1 n 1j,k 1i,k T − 3T , = E 2 i<j, k ∈{i,j / }
12.3 Multivariate Exchangeable Pairs
327
where for the first term we have used that 1 i,j is an indicator independent of K with success probability , while for the second term the factor of three accounts for the number of ways k can appear relative to i < j . Hence we see that the conditional expectation depends on the number of 2-stars S, given by S= 1i,k 1j,k , i<j : k ∈{i,j / }
and hence by conditioning in addition on S we obtain −1 n (S − 3T ). E(T − T |S, T ) = 2
(12.38)
However, if we are to form an enlarged vector by appending S to T we must now also verify the linearity condition for S as well. For the difference in the number of two stars we have S − S = (1 I,J − 1I,J )(1J,k + 1I,k ), k: k ∈{I,J / }
and hence E(S − S|S, T ) −1 n = E (1 i,j − 1i,j )(1j,k + 1i,k )S, T 2 i<j k: k ∈{i,j / } = n E (1j,k + 1i,k )S, T i<j, k ∈{i,j / }
2
1 − n E 2
1i,j 1j,k + 1i,j 1i,k S, T
i<j, k ∈{i,j / }
−1 n (n − 2)E(E|S, T ) − S =2 2
(12.39)
where E is the total number of edges in Kn, , 1i,j . E= i<j
Continuing the process, so now appending E to the vector S, T , taking the difference, E − E = 1I,J − 1 I,J we find E(E − E|E) −1 −1 n n n = E(1i,j − 1i,j |E) = −E , 2 2 2 i<j
a linear function of E.
(12.40)
328
12
Multivariate Normal Approximation
Hence, letting Y = (E, S, T ) , the conditional expectation E(Y |Y) will be a linear function of Y. In particular, centralizing by letting W and W be given by (12.35), from (12.38), (12.39) and (12.40) with the additional conditioning on E, we conclude that (12.37) holds with ⎡ ⎤ −1 1 0 0 n ⎣ −2(n − 2) 2 0 ⎦ W, E(W − W|W) = − (12.41) 2 0 − 3 thus achieving the multivariate version of (12.34). In this particular example, the different powers of n in the matrix in equality (12.41) indicates the presence of a scaling issue. In particular, from Reinert and Röllin (2010), and as one can easily confirm, n 3 n n 2 , EE = , ES = 3 and ET = 3 2 3 and in addition they show
n 1 (1 − ), Var(E) = 3 3 n−2 n 2 Var(S) = 3 (1 − ) 1 − + 4(n − 2) 3
and
n 3 Var(T ) = (1 − ) (1 − )2 + 3(1 − ) + 3(n − 2)2 . 3
In particular, we see that the variance of the number of edges grows at the rate n2 , while the variance of the number of 2-stars and triangles grow like n4 . Considering, then, the standardized variables n−2 1 E, S1 = 2 S n2 n which have limiting variances, letting E1 =
and
T1 =
1 T, n2
W1 = (E1 − EE1 , S1 − ES1 , T1 − ET1 ), and defining
W 1
E(W 1 |W1 ) = (I
likewise, one obtains ⎡ −1 1 n ⎣ −2
= 2 0
0 2 −
⎤
(12.42)
− )W1 with
0 0 ⎦ W1 . 3
Calculating the remaining covariance terms, Reinert and Röllin (2010) find that the covariance matrix 1 of W1 is given by ⎤ ⎡ 1 2 2 n ⎢ 2 (1−) ⎥ (n − 2) 3 ⎥ . (12.43) 2 42 + (1−) 23 + n−2 1 = 3 (1 − ) ⎢ n−2 ⎦ ⎣ 4 n 2 (1−) 2 (1+−22 ) 2 3 4 2 + n−2 + 3(n−2)
12.3 Multivariate Exchangeable Pairs
329
As the variables have been scaled so that their covariance matrix 1 has a non-trivial 1/2 limit, convergence in distribution of W1 to 1 Z is a consequence of the following result. Theorem 12.3 Let W1 and 1 be given by (12.42) and (12.43) respectively. Then with Z a standard normal vector in R3 , 1 35 8 −1 L(W1 ) − L 1/2 Z 1 + n−1 + n−2 . + 1 H3,∞,3 ≤ n 4 + 9n 3n Reinert and Röllin (2010) prove Theorem 12.3 by applying Theorem 12.4 for multivariate exchangeable pairs, from Reinert and Röllin (2009), and calculating the quantities that appear in the bounds below; we refer the reader there for these latter details. In these two works they also consider identity (12.37) generalized to have remainder, E(W − W|W) = − W + R,
(12.44)
and supply further applications of their embedding technique to runs on the line, complete U -statistics, and doubly indexed permutation statistics. In Theorem 12.4, rather than compare the distribution of a random vector with covariance matrix I to that of a standard normal Z, as in Theorem 12.2, one compares the distribution of a vector with covariance to that of 1/2 Z. In this case, one considers the variation on the multivariate Stein equation (2.22) given by Tr D 2 f (w) − w ∇f (w) = h(w) − Eh 1/2 Z , (12.45) where Z is a standard normal vector in Rp . Equation (12.45) is (2.22) for μ = 0, with the function h(w) evaluated at 1/2 w. Using the change of variable arguments in Lemma 2.6 it is straightforward to construct a solution f to (12.45) and show that it satisfies fi ···i (w) ≤ 1 hi ···i (w), (12.46) k 1 k 1 k where fi1 ···ik (w) denotes the partial derivative of f with respect to wi1 , . . . , wik , and likewise for h, whenever the partial of h on the right hand side exists. As in Sect. 12.2, we adopt the supremum norm (12.3) for matrices. Theorem 12.4 Let (W, W ) be an exchangeable pair of Rd valued random vectors satisfying EW = 0,
EWW =
(12.47)
with positive definite. Suppose further that (12.44) holds with an invertible matrix and R some Rp valued random vector. Then, if Z has the standard pdimensional normal distribution, A B p L(W) − L 1/2 Z 1/2 C, ≤ + + 1 + H3,∞,p 4 12 2
330
12
p
where, with γi =
m=1 |(
B= C=
−1 ) m,i |,
γi Var E Wi − Wi Wj − Wj |W ,
p
A=
Multivariate Normal Approximation
i,j =1 p
γi E Wi − Wi Wj − Wj Wk − Wk
and
i,j,k=1 p
γi Var(Ri ).
i=1
Proof Recalling (12.2), let f be the solution to the multivariate Stein equation (12.45) for a given h ∈ H3,∞,p . Note that the function F : Rp × Rp → R given by 1 F (w , w) = (w − w) − ∇f (w ) + ∇f (w) 2 is anti-symmetric, and therefore, by exchangeability, EF (W , W) = 0. Thus 1 0 = E(W − W) − ∇f (W ) + ∇f (W) 2 = E (W − W) − ∇f (W) 1 + E (W − W) − ∇f (W ) − ∇f (W) 2 = E R − ∇f (W) − E W ∇f (W) 1 + E (W − W) − ∇f (W ) − ∇f (W) , (12.48) 2 where we have applied (12.44). Focusing on the final expression in (12.48), Taylor expansion yields E (W − W) − ∇f (W ) − ∇f (W)
−1 m,i E Wi − Wi Wj − Wj fm,j (W) = m,i,j
+
−1
m,i
E Wi − Wi Wj − Wj Wk − Wk R˜ mj k (12.49)
m,i,j,k
where, by (12.46), |R˜ mj k | ≤
1 1 sup fm,j,k (w) ≤ . 2 w∈Rp 6
(12.50)
By exchangeability, (12.44) and (12.47), E(W − W)(W − W) = E W(W − W ) + E W(W − W ) = 2E W( W − R) = 2 − 2E WR := Q,
say.
12.4 Local Dependence, and Bounds in Kolmogorov Distance
331
Solving for we obtain 1 = Q − + E WR − , 2 and therefore Tr D 2 f (W) 1 Tr Q − D 2 f (W) + E Tr WR − D 2 f (W) 2 1 −1 =
m,i Qj,i fm,j (W) +
−1 m,i E(Wj Ri )fm,j (W). 2 =
m,i,j
m,i,j
Combining this identity with (12.48) and (12.49), we obtain E Tr D 2 f (W) − W ∇f (W) 1 −1
≤ E m,i Qj,i − E Wi − Wi Wj − Wj |W fm,j (W) 2 m,i,j,k
1 −1 + E m,i Wi − Wi Wj − Wj Wk − Wk R˜ mj k (W) 2 m,i,j,k −1 −1
m,i ERi fm (W) +
m,i E(Wj Ri )Efm,j (W) + i,m
m,i,j
1 1 (i) λ E Qj,i − E Wi − Wi Wj − Wj |W + B ≤ 4 12 i,j
+
i
λ(i) E|Ri | +
1 (i) λ E|Wj Ri |, 2
(12.51)
i,j
where we have applied (12.46) and (12.50) to obtain the last inequality. The Cauchy–Schwarz inequality yields E|Rj | ≤ ERj2 and E|Wj Ri | ≤
EWj2 ERi2 ≤ 1/2 ERi2 .
The term C in the bound of the theorem now follows from the last two terms of (12.51). Lastly, recalling that E(W − W)(W − W) = Q, the first term of the bound, A/4, arises from the first term of (12.51).
12.4 Local Dependence, and Bounds in Kolmogorov Distance We now consider multivariate results in distances which include the Kolmogorov metric, due to Rinott and Rotar (1996), that apply to sums
332
12
W=
n
Multivariate Normal Approximation
Xi
(12.52)
i=1
of bounded, locally dependent random variables. In this section, for an array A, let |A| denote the sum of the absolute values of the components of A. Since in finite dimension all norms are equivalent, and, in addition, as constants in the results of this section are not given explicitly, this convention is only a matter of convenience. Let X1 , . . . , Xn be random vectors in Rp satisfying |Xi | ≤ B for all i = 1, . . . , n; the constant B is allowed to depend on n. For each i = 1, . . . , n, let Si ⊂ Ni be dependency neighborhoods for Xi , such that with D1 ≤ D2 we have max |Si |, i = 1, . . . , n ≤ D1 and max |Ni |, i = 1, . . . , n ≤ D2 , where for a finite set we also use | · | to denote cardinality. The sets Si and Ni may be random. For i = 1, . . . , n, let X k , Vi = W − U i , Ui = k∈Si
Ri =
Xk
(12.53)
and Ti = W − Ri ,
k∈Ni
and n E E(Xi |Vi ), χ1 = i=1 n E E Xi Ui − E Xi Ui |Ti χ2 = i=1
(12.54) n and χ3 = I − E Xi Ui . i=1
As in Sect. 5.4, we consider a collection H of functions satisfying Condition 5.1, and recall that a is a constant that measures the roughness of the collection with respect to the normal distribution. Specifically, with Z a standard normal vector in Rp , we assume the collection H satisfies E h˜ (Z) ≤ a
for all > 0,
(12.55)
where h˜ is defined in (iii) of Condition 5.1. The class of indicators of convex sets is one example of a collection H that satisfies Condition 5.1, see Sazonov (1968), and Bhattacharya and Rao (1986). As in the one dimensional case, given a function class H and random vectors X and Y, we let L(X) − L(Y) = sup Eh(X) − Eh(Y). H h∈H
Letting H be the collection of indicators of ‘lower quadrants’ the distance above specializes to the Kolmogorov distance. As our interest in this section is in multivariate results, we consider only the general p ≥ 1 case of the results of Rinott and Rotar (1996), and refer the reader there for improvements of Theorems 12.5 and 12.8 in the case p = 1.
12.4 Local Dependence, and Bounds in Kolmogorov Distance
333
Theorem 12.5 For p ≥ 1, there exists a constant C depending only on the dimension p, such that L(W) − L(Z) ≤ C aD2 B + naD1 D2 B 3 |log B| + log n H (12.56) + χ1 + |log B| + log n (χ2 + χ3 ) . The log n terms in the bound will typically preclude the n−1/2 rate possible in the p = 1 case. Though it is not assumed that W have mean zero and identity covariance, typically, as below, the theorem is invoked in the standardized case. Theorem 12.5 follows from Theorem 12.8, which proceeds by the way of smoothing inequalities, as in the proof of Theorem 5.8. As in Rinott and Rotar (1996), we apply Theorem 12.5 to two questions in random graphs. First, let G be a fixed regular graph with n vertices, each of degree m. Let each vertex p be independently assigned color i = 1, . . . , p with probability πi ≥ 0 and i=1 πi = 1. We are interested in counting the number of edges that connect vertices having the same color. Indexing the N = nm/2 edges of G by {1, . . . , N}, for i = 1, . . . , p we may write Yi , the number of edges whose vertices both have color i, as the sum of indicators Yi =
N
Xki ,
k=1
where Xki indicates if edge k connects vertices both of which have color i. Letting Y = (Y1 , . . . , Yp ), clearly, the mean μ = EY is given by μ = N π12 , . . . , πp2 . (12.57) To calculate the variance of Yi , note first that there will be N variance terms of the form Var(Xki ) = πi2 (1 − πi2 ). Next, for a give edge k, the covariance between Xki and Xli for l = k will be nonzero only for the 2(m − 1) edges l = k that share a vertex with k, and in this case Cov(Xki , Xli ) = πi3 − πi4 . Hence, letting = (σij ) be the covariance matrix of Y, σii = N πi2 1 − πi2 + 2N (m − 1) πi3 − πi4 . For i = j note that Cov(Xki , Xlj ) is zero if edges k and l have no vertex in common, while otherwise, as Xki Xlj = 0, this covariance is −πi2 πj2 . Hence σij = −N (2m − 1)πi2 πj2
for i = j .
It is not difficult to see that we may write a bit more compactly as
= N (2m − 1) A − bb + N H,
(12.58)
where A and H are diagonal matrices with ith diagonal entries and − πi3 , 2 respectively, and b the column vector with ith component πi . Using this representation and recalling (12.15), we now show πi3
N H .
πi2
(12.59)
334
12
Multivariate Normal Approximation 3/2
Let D be the diagonal matrix with diagonal entries πi and g the column vector 1/2 with entries πi . Then A − bb = D(I − gg )D. Since πi = 1, its is easy to see that the smallest eigenvalue of I − gg is 0. Hence A − gg is nonnegative definite, and (12.59) follows. It follows that, with −1/2 , (12.60) L = min πi2 (1 − πi ) 1≤i≤p
and · denoting the maximal absolute value of the entries of an array as in (12.3), −1/2 ≤ N −1/2 L. We apply Theorem 12.5 to the normalized counts W = −1/2 (Y − μ) with the bound B = pN −1/2 L on the standardized summands Xk = −1/2 Xk1 − π12 , . . . , Xkp − πk2 in (12.52). For any edge j we ! choose Sj to be the collection of all edges that share a vertex with j , and Ni = j ∈Si Sj , yielding the bounds D1 = 2m − 1 and D2 = (2m − 1)2 on the cardinalities of Sj and Ni , respectively. These choices yield χ1 = χ2 = χ3 = 0. Observing that m ≤ n and L ≥ 1, we have obtained the following result. Theorem 12.6 For i = 1, . . . , p, let Yi count the number of edges, of a regular degree m graph with n vertices, that connect vertices both of color i, where colors 1, . . . , p are assigned independently with probabilities π1 , . . . , πp . Then there exists a constant C, depending only on p, such that L(W) − L(Z) ≤ Cam3/2 L3 |log L| + log n n−1/2 , (12.61) H where W = −1/2 (Y − μ) with the mean μ and variance of Y given by (12.57) and (12.58), respectively, L is given by (12.60) and the constant a depends on the class H through (12.55). In this same case, a bound of rate n−1/2 was obtained by Goldstein and Rinott (1996), but only for smooth functions. The following example illustrates an advantage of Theorem 12.5 in allowing the neighborhoods of dependence Si and Ni to be random. Consider a sample of n points chosen independently in Rk according to some absolutely continuous distribution. Similar to the earlier example, color each point with color i with probability πi , i = 1, . . . , p, independently of the sample and of the colors assigned to other points. Now form the nearest neighbor graph by making a directed edge from each point in the collection to its nearest neighbor. Let Xj i be the indicator that vertex j and its nearest neighbor both have color i, so that Yi =
n j =1
Xj i
12.4 Local Dependence, and Bounds in Kolmogorov Distance
335
counts the number of times a vertex and its nearest neighbor share color i, with mutual nearest neighbors counted twice. The vector Y = (Y1 , . . . , Yp ) is of interest in multivariate tests of equality of distribution. In particular, consider observations that are drawn from p = 2 distributions, say F1 and F2 , with probabilities π1 and π2 = 1 − π1 , respectively. When F1 = F2 one would expect that there would be a certain degree of clustering of the points drawn from the same distribution, and that the nearest neighbor of a point drawn from F1 , say, would more likely be a point from this same population, rather than one from F2 . Clustering of this type could be detected by computing functions of Y for a given sample, and testing against the null hypotheses that F1 = F2 , that is, the case where the colors are assigned independently of the sample. Hence, the normal approximation of Y for this instance gives bounds on how well the null distribution of a test statistic can be approximated by the same function of a multivariate normal vector. The use of Y for such test was considered by Schilling (1986) and Henze (1988); the latter proves asymptotic normality, without rates, of certain test statistics. Clearly, the mean μ of Y is given by μ = n π12 , . . . , πp2 . (12.62) Referring further details to Rinott and Rotar (1996), letting α be the probability that vertex j is the nearest neighbor of its own nearest neighbor, β = ED 2 (j ) for D(j ) the degree of j , H, J and K the p × p diagonal matrices having ith diagonal elements πi3 , πi2 − πi4 and πi2 − πi3 respectively, and b the column vector with ith component πi2 , we may write the covariance matrix of Y as
1 = n(β − 2) H − bb + nJ + nαK. 2 By use of this representation, one may derive −1/2 ≤ n−1/2 M where M = min π 2 − π 4 i i 1≤i≤p
(12.63)
−1/2
.
(12.64)
Hence, as in the previous example, we may take the bound B on the standardized summand variables to equal pn−1/2 M. Since the dependency neighborhoods may be random they can, in particular, depend on the nearest neighbor graph G itself. In particular, let Sj consist of vertex j and all vertices that are connected to j by an edge. Then for any set A we have P {Xj i , 1 ≤ i ≤ p} ∈ A / Sj } . = P {Xj i , 1 ≤ i ≤ p} ∈ A|G, {Xli , 1 ≤ i ≤ p, l ∈ / Sj } we obtain the indepenTaking expectations conditioned on {Xli , 1 ≤ i ≤ p, l ∈ dence! of {Xj i , 1 ≤ i ≤ p} and {Xli , 1 ≤ i ≤ p, l ∈ / Sj }. With similar arguments for Ni = j ∈Si Sj , one may conclude that χ1 = χ2 = χ3 = 0. It is easy to see that in Rk the degree of the nearest neighbor graph is bounded by the kissing number κk , the maximum number of spheres of radius 1 that can simultaneously touch the unit sphere at the origin. The kissing number was discussed in
336
12
Multivariate Normal Approximation
Sect. 4.6, where further references, some to bounds, were provided. For our choices of dependency neighborhoods, we may take D1 = κk and D2 = κk2 as bounds on the cardinalities of Si and Ni , respectively. Applying Theorem 12.5, as for the bound (12.61), one obtains the following result. Theorem 12.7 Let n i.i.d. points from an absolutely continuous distribution in Rk be assigned colors i = 1, . . . , p independently with probabilities π1 , . . . , πp , and let the component Yi of Y = (Y1 , . . . , Yp ) count the number of vertices which share color i with its nearest neighbor. Then, with W = −1/2 (Y − μ), where the mean μ and the variance of Y are given by (12.62) and (12.64), respectively, there exists a constant C, depending only on p, such that L(W) − L(Z) ≤ Caκ 3 M 3 |log M| + log n n−1/2 , k H where κk is the kissing number in dimension k, M is given in (12.64), and the constant a depends on the class H through (12.55). The quantities α and β in (12.63) converge to finite limits as n tends to infinity, and therefore n−1 tends to a limiting matrix. When the sample of points have a continuous distribution F , these limits do not depend on F , yielding, asymptotically, a non-parametric test for the equality of multivariate distributions. See Schilling (1986), Henze (1988) and Rinott and Rotar (1996) for further details. We now proceed to the proof of Theorem 12.5, which is a consequence of the following result. Theorem 12.8 For each i = 1, . . . , n assume that we have two representations of W, W = Ui + Vi , and W = Ri + Ti , such that |Ui | ≤ A1 and |Ri | ≤ A2 for constants A1 ≤ A2 . Then for p ≥ 1 there exists a constant C depending only on the dimension p such that L(W) − L(Z) ≤ C aA2 + naA1 A2 B |log A2 B| + log n H + χ1 + |log A2 B| + log n (χ2 + χ3 ) , (12.65) where the constant a depends on the class H through (12.55), |Xi | ≤ B for all i = 1, . . . , n and χ1 , χ2 and χ3 are specified in (12.54). Theorem 12.5 follows by observing that the quantities defined in (12.53) satisfy the assumptions of Theorem 12.8 with A1 = D1 B and A2 = D2 B. We note that the log A2 term that appears in (12.65) does not give rise to a log D2 term in (12.56) as D2 ≤ n. Throughout the proof of Theorem 12.8 we write C for universal constants, not necessarily the same at each occurrence. For a given h ∈ H and s ∈ (0, 1) we work with the smoothed version of h given by √ √ hs (x) = h( sz + 1 − sx)φ(z)dz (12.66) Rp
12.4 Local Dependence, and Bounds in Kolmogorov Distance
337
where φ(z) denotes the standard normal density in Rp . It is straightforward to verify that Nhs = N h for all s ∈ [0, 1]. We require the following smoothing result, Lemma 2.11 of Götze (1991), from Bhattacharya and Rao (1986). Lemma 12.1 is essentially√the same as Lemma 5.3, though there the smoothing is at scale t , and here at scale t . Here we let denote the standard normal distribution in Rp . Lemma 12.1 Let Q be a probability measure on Rp . Then there exists a constant C > 0, which depends only on p, such that for all t ∈ (0, 1) sup hd(Q − ): h ∈ H p R √ ≤ C sup hd(Q − )t : h ∈ H + a t , Rp
where a is any constant that satisfies (12.55). For f : Rp → R, let ∇f and D 2 f denote the Hessian matrix and gradient of f , respectively. Consider the multivariate Stein equation (2.22) with μ = 0 and = I , that is, Tr D 2 f (w) − w · ∇f (w) = h(w) − N h.
(12.67)
By Götze (1991), the function 1 ft (x) = − 2
1
t
ds hs (x) − N h 1−s
(12.68)
solves the Stein equation (12.67) for ht . Again by Götze (1991), when |h| ≤ 1, there exists a constant C such that ∇ft ≤ C and D 2 ft ≤ C log t −1 . (12.69)
Setting Ki = Xi Ui , by (12.67),
Eht (W) − N h = E Tr D 2 ft (W) − W · ∇ft (W) = A − B − C + D, where
"
A = E Tr D ft (W) I − 2
n
# Ki
,
i=1
B=
n
E Xi · ∇ft (Vi ) , i=1
n
C= and E Xi · ∇ft (W) − ∇ft (Vi ) − D 2 ft (Vi )Ui i=1
D=
n i=1
E Tr Ki D 2 ft (W) − D 2 ft (Vi ) .
(12.70)
338
12
Multivariate Normal Approximation
The next lemma is used to bound Taylor series remainder terms arising from the (3) decomposition of terms (12.70). We will let ft (j kl) (x) denote the third derivative of ft at x, with respect to xj , xk , xl , with a similar notation for the partial derivatives of the normal density φ(z), and derivatives of lower orders. Lemma 12.2 Let W, V and U be any random vectors in Rp satisfying W = V + U, and let Y be any random variable. Suppose there exists constants C1 and C2 such that |U| ≤ C1 and |Y | ≤ C2 . Set κ = sup Eh(W) − N h: h ∈ H . (12.71) Then there exists a constant C, depending only on p, such that for all τ ∈ [0, 1] and h ∈ H, √ √ EYf (3) (V + τ U) ≤ CC2 κ/ t + aC1 / t + a|log t| , t (j kl) where a satisfies (12.55). Proof By replacing ht by ht − N h we may assume N ht = 0. Differentiation of (12.66), using a change of variable, and (12.68), yield √ √ 1 1 (1 − s)1/2 (3) (3) ds h( sz + 1 − sx)φj kl (z)dz. ft (j kl) (x) = p 2 t s 3/2 R Observe that ∂3 ∂ 3 (3) φj kl (z)dz = φ(z + x)dz = 1 = 0. (12.72) p p ∂ ∂ ∂ ∂ ∂ ∂ R
R
j k l
x=0
j k l
x=0
Now, EYf (3) (V + τ U) t (j kl) 1 √ 1 √ (3) (1 − s)1/2 ds EY h sz + 1 − s(V + τ U) φj kl (z)dz = 3/2 2 s Rp t 1 1/2 1 (1 − s) ds = 2 t s 3/2 √ √ √ (3) EY h 1 − sW − 1 − s(1 − τ )U + sz φj kl (z)dz × p R1 √ 1 √ √ (1 − s)1/2 ds EY h 1 − sW − 1 − s(1 − τ )U + sz = 3/2 2 t s Rp √ √ (z)dz − h 1 − sW − 1 − s(1 − τ )U φj(3) (12.73) kl 1 ≤ 2
−
=
t
1
1
s
ds C2 3/2
$
Rp
E
sup√
|u|≤C + s|z|
√ h( 1 − sW + u).
1 % √ (3) h( 1 − sW + u) φj kl (z)dz
inf√ |u|≤C1 + s|z| C2 1 1 ds 2 t s 3/2 Rp
√ √ (3) E h˜ 1 − sW; C1 + s|z| φj kl (z)dz,
(12.74)
12.4 Local Dependence, and Bounds in Kolmogorov Distance
339
where we have used (12.72) to obtain (12.73), and recalled the definition of h˜ in (5.28). Let Z denote an independent standard normal vector in Rp . Adding and subtracting, the quantity (12.74) equals √ √ √ √ C2 1 1 E h˜ 1 − sW; C1 + s|z| − h˜ 1 − sZ; C1 + s|z| 3/2 p 2 t s R √ √ ˜ (12.75) + h 1 − sZ; C1 + s|z| φj(3) kl (z) dz. ˜ for any > 0, Again in view of definition (5.28) of h, √ √ E h( ˜ 1 − sZ; ) ˜ 1 − sW; ) − h( √ √ + ≤ E h+ ( 1 − sW; ) − h ( 1 − sZ; ) √ − √ − + E h ( 1 − sW; ) − h ( 1 − sZ; ) .
(12.76)
By the closure conditions on the class H and the definition (12.71) of κ, we see that for any > 0 the expression (12.76) is bounded by 2κ. As 1 1 C ds ≤ √ , 3/2 t t s we conclude, for some C, 1 √ √ 1 E h˜ 1 − sW; C1 + s|z| ds 3/2 Rp t s √ √ √ (3) ˜ 1 − sZ; C1 + s|z| φj kl (z)dz ≤ Cκ/ t. (12.77) −h Turning now to the last term of (12.75), by (12.55), √ √ √ E h˜ 1 − sZ; C1 + s|z| ≤ a C1 + s|z| . Hence
t
1
1 s 3/2
ds
Rp
√ √ E h˜ 1 − sZ; C1 + s|z| φj(3) kl (z) dz
√ (3) C1 + s|z| φj kl (z)dz ds 3/2 Rp t s √ ≤ Ca C1 / t + |log t| .
≤a
1
1
Lemma 12.2 now follows by collecting terms.
We are now ready for the proof of Theorem 12.8. Proof Consider the decomposition in (12.70), starting with the term C. Let Xij and Uij denote the j th components of Xi and Ui , respectively. For i = 1, . . . , n, Taylor expansion of ∇ft (W) about Vi shows that C equals 1 p p p n (3) E (1 − τ ) Xij Uik Uil ft (j kl) (Vi + τ Ui )dτ. (12.78) i=1
0
j =1 k=1 l=1
340
12
Multivariate Normal Approximation
Applying Lemma 12.2 for each i, with U = Ui and Y = Xij Uik Uil , recalling |Xi | ≤ B and |Ui | ≤ A1 , we obtain √ √ (12.79) |C| ≤ CnA21 B κ/ t + aA1 / t + a| log t| . Next consider the term D in (12.70). A first order Taylor expansion yields p 1 (2) ft(2) (W) − f (V ) = ft(3) (12.80) i (j k) t (j k) (j kl) (Vi + τ Ui )Uil dτ. l=1
0
The term D is obtained by multiplying (12.80) by the entries of Ki , and it is easy to see from their definition that this leads to a term which is similar to (12.78), allowing us to conclude that |D| is bounded by the right hand side of (12.79), with a possibly different constant. Next, note that we may write B of (12.70) as B=
n
E ∇ft (Vi ) · E(Xi |Vi ) . i=1
By the bound (12.69) on the solution to (12.67), the components of ∇ft (Vi ) are uniformly bounded, implying that for some positive constant C p n |B| ≤ C E E(Xij |Vi ).
(12.81)
i=1 j =1
Finally, consider A. With δj k = 1 when j = k and 0 otherwise, we have # " n 2 Ki Tr D ft (W) I − i=1
= =
p p j =1 k=1 p p
D ft (j k) (W) δj k − 2
n
Xik Uij
i=1
D 2 ft (j k) (W)
j =1 k=1
n n n × δj k − E(Xik Uij ) + E(Xik Uij ) − Xik Uij . i=1
i=1
(12.82)
i=1
By the bound (12.69), we have that |D 2 ft (j k) (W)| ≤ C log(t −1 ) for all j, k = 1, . . . , p and t ∈ (0, 1). Hence, for the expectation of the first two terms of the last line of (12.82), we obtain the bound p p n 2 E D ft (j k) (W) δj k − E(Xik Uij ) j =1 k=1 i=1 p p n E(Xik Uij ). ≤ C|log t| (12.83) δj k − j =1 k=1
i=1
12.4 Local Dependence, and Bounds in Kolmogorov Distance
341
Now write the expression involving the last two terms in the last line of (12.82) in the form p p
D 2 ft (j k) (W) − D 2 ft (j k) (Ti ) + D 2 ft (j k) (Ti )
j =1 k=1
× E(Xik Uij ) − Xik Uij .
(12.84)
Taylor expansion of the difference D 2 ft (j k) (W) − D 2 ft (j k) (Ti ), and Lemma 12.2 applied for each i with U = Ri and Y = Ril (Xik Uij − E(Xik Uij )), imply p p n 2
E D ft (j k) (W) − D 2 ft (j k) (Ti ) E(Xik Uij ) − Xik Uij i=1 j =1 k=1
√ √ ≤ CnA1 A2 B κ/ t + aA2 / t + a|log t| .
(12.85)
Returning to (12.84), we apply (12.69) to bound the last term by p p n 2
E D ft (j k) (Ti ) E(Xik Uij ) − Xik Uij i=1 j =1 k=1
≤ C|log t|
p p n E E(Xik Uij ) − E(Xik Uij |Ti ).
(12.86)
i=1 j =1 k=1
Combining Lemma 12.1, the decomposition (12.70), and the bounds (12.79), (12.81), (12.83), (12.85) and (12.86), noting that since A1 ≤ A2 the term (12.79) may be ignored, being of smaller order than (12.85), we obtain √ √ κ ≤ CnA1 A2 Bκ/ t + CnaA1 A1 B A2 / t + |log t| +C
p n E E(Xij |Vi ) i=1 j =1
p p n E(Xij Uik ) + C|log t| δj k − j =1 k=1
i=1
p p n √ E E(Xij Uik ) − E(Xij Uik |Ti ) + Ca t. + i=1 j =1 k=1
(12.87)
√ Setting t = 2CnA1 A2 B, provided it is less than 1, simple manipulations yield (12.65) after observing that the last term in (12.87) is of lower order than the second term. If t > 1 for the choice above, then by enlarging C in (12.65) as necessary, the theorem is trivial.
Chapter 13
Non-normal Approximation
Though the principle theme of this book concerns the normal distribution, in this chapter we explore how Stein’s method can be applied to approximations by nonnormal distributions as well. There are already many well known distributions other than the normal where Stein’s method works, the Poisson case being the most notable. Here we focus on approximation by continuous distributions where an analysis parallel to that for the normal, such as the method of exchangeable pairs, may proceed. Denoting the random variable of interest as W = Wn , it may be the case that appropriate approximating or limiting distributions of Wn are not known a priori. In this chapter, following Chatterjee and Shao (2010) we first develop a method of exchangeable pairs which identifies an appropriate approximating distribution, and which obtains L1 and Berry–Esseen type bounds for that √ approximation. As applications, in Sect. 13.3 we obtain error bounds of order 1/ n in both the L1 and Kolmogorov distance for the non-central limit theorem for the magnetization in the Curie–Weiss model at the critical temperature. In Sect. 13.4, and also using different methods, we derive bounds for approximations by the exponential distribution, and, following Chatterjee et al. (2008), apply the results to the spectrum of the Bernoulli Laplace diffusion model, and following Peköz and Röllin (2009), to first passage times of Markov chains.
13.1 Stein’s Method via the Density Approach One way of looking at Stein’s characterization for the normal is the following. Since √ 2 the standard normal density function φ(z) = e−z /2 / 2π satisfies φ (z) (13.1) = −z, we have φ (z) + zφ(z) = 0, φ(z) and integration by parts now yields a kind of ‘dual’ equation for the distribution of Z having density φ, that is, the Stein characterization E f (Z) − Zf (Z) = 0 L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_13, © Springer-Verlag Berlin Heidelberg 2011
343
344
13
Non-normal Approximation
holding for all functions for which the expectations above exist. We will now see that a number of arguments used for the normal hold more generally for distributions with density p(y) when replacing the ratio −y in (13.1) by p (y)/p(y). We will consider approximations by the distribution of Y , a random variable with probability density function p satisfying the following condition. Condition 13.1 For some −∞ ≤ a < b ≤ ∞, the density function p is strictly positive and absolutely continuous over the interval (a, b), zero on (a, b)c , and possesses a right-hand limit p(a+) at a and a left-hand limit p(b−) at b. Furthermore, the derivative p of p satisfies b p (y)dy < ∞. (13.2) a
The key step for applying the Stein method for approximation by the distribution of Y is the development of a Stein identity and the derivation of bounds on solutions to the Stein equation.
13.1.1 The Stein Characterization and Equation Let Y have density p satisfying Condition 13.1. Then, letting f be an absolutely continuous function satisfying f (a+) = f (b−) = 0, whenever the expectations below exist, on the interval (a, b) we have E f (Y ) + f (Y )p (Y )/p(Y ) = E f (Y )p(Y ) /p(Y ) b = f (y)p(y) dy = f (b−)p(b−) − f (a+)p(a+) = 0, (13.3) a
that is,
E f (Y ) + f (Y )p (Y )/p(Y ) = 0.
(13.4)
For any measurable function h with E|h(Y )| < ∞, let f = fh be the solution to the Stein equation f (w) + f (w)p (w)/p(w) = h(w) − Eh(Y ).
(13.5)
Rewriting (13.5) we have that f (w)p(w) = h(w) − Eh(Y ) p(w) and hence the solution for w ∈ (a, b) is given by w h(y) − Eh(Y ) p(y)dy fh (w) = 1/p(w) a
= −1/p(w)
b
w
h(y) − Eh(Y ) p(y)dy.
(13.6)
13.1 Stein’s Method via the Density Approach
345
As the Stein equation, and its solution f , are valid only over the interval (a, b), we consider, without further mention, approximation of the distributions of random variables W by that of Y , having density p on (a, b), only when the support of W is contained in the closure of (a, b). Example 13.1 (Exponential Distribution) Let Y ∼ E(λ), where E(λ) denotes the exponential distribution with parameter λ, that is, Y is a random variable with density function p(y) = λe−λy 1(y > 0). Then p (y)/p(y) = −λ and identity (13.4) becomes E f (Y ) − λEf (Y ) = 0, (13.7) for any absolutely continuous f for which the expectation above exists, satisfying f (0) = limy→∞ f (y) = 0. Similar to the case of the normal, (13.7) is a characterization of the exponential distribution in that if (13.7) holds for all such functions f then Y ∼ E(λ). The exponential distribution is a special case of the Gamma, which we turn to next. Example 13.2 (The Gamma and χ 2 distributions) With α and β positive numbers, we say Y has the (α, β) distribution when Y has density p(y) = Then p (y)/p(y) =
y α−1 e−y/β 1{y>0} . β α (α)
− β1 and identity (13.4) becomes
α−1 1 − f (Y ) = 0. E f (Y ) + Y β
α−1 y
In the special case where Y has the χk2 distribution, that is, the (k/2, 2) distribution, the identity specializes to
k−2 1 E f (Y ) + − f (Y ) = 0. 2Y 2 Approximation by Gamma and χ 2 distributions have been considered by Luk (1994) and Pickett (2004). Gamma approximation of the distribution of stochastic integrals of Weiner processes is handled in Nourdin and Peccati (2009), the normal version of which is explored in Chap. 14. Example 13.3 Let W(α, β) denote the density function p(y) =
αe−|y|
α /β
2β 1/α ( α1 )
for y ∈ R, with α > 0, β > 0.
For any α > 0, by the change of variable u = y α we have ∞ ∞ 2 ∞ −u 1/α−1 2 α α e−|y| dy = 2 e−y dy = e u du = (1/α), α α −∞ 0 0
(13.8)
346
13
Non-normal Approximation
hence, scaling by β 1/α > 0, the family of functions (13.8) are densities on R. Note that the mean zero normal distributions with variance σ 2 are the special case W(2, 2σ 2 ). Of special interest will be the distribution W(4, 12) with density √ 4 (13.9) p(y) = c1 e−y /12 for y ∈ R, where c1 = 2/ 31/4 (1/4) . For W(4, 12) the ratio of the derivative to the density is given by p (y) y3 =− . p(y) 3
13.1.2 Properties of the Stein Solution As in the normal case, in order to determine error bounds for approximations by the distribution of Y , we need to understand the basic properties of the Stein solution. As we consider the approximation of a random variable W whose support is contained in the closure of (a, b), in this chapter for a function f on R we take f to be the supremum of |f (w)| over w ∈ (a, b). Lemma 13.1 Let p be a density function satisfying Condition 13.1 for some −∞ ≤ a < b ≤ ∞ and let y F (y) = p(x)dx a
be the associated distribution function. Further, let h be a measurable function and fh the Stein solution given by (13.6). (i) Suppose there exist d1 > 0 and d2 > 0 such that for all y ∈ (a, b) we have (13.10) min 1 − F (y), F (y) ≤ d1 p(y) and
p (y) min F (y), 1 − F (y) ≤ d2 p 2 (y).
(13.11)
Then if h is bounded
and
fh ≤ 2d1 h , fh p /p ≤ 2d2 h
(13.13)
f ≤ (2 + 2d2 ) h .
(13.14)
h
(13.12)
(ii) Suppose in addition to (13.10) and (13.11), there exist d3 ≥ 0 such that p min E|Y |1{Y ≤y} + E|Y |F (y), E|Y |1{Y >y} + E|Y | 1 − F (y) p ≤ d3 p(y) (13.15)
13.2 L1 and L∞ Bounds via Exchangeable Pairs
347
and d4 (y) such that for all y ∈ (a, b) we have min E|Y |1{Y ≤y} + E|Y |F (y), E|Y |1{Y >y} + E|Y | 1 − F (y) ≤ d4 (y)p(y).
(13.16)
Then if h is absolutely continuous with bounded derivative h , f ≤ (1 + d2 )(1 + d3 ) h , h fh (y) ≤ d4 (y) h for all y ∈ (a, b), and
f ≤ (1 + d3 )d1 h . h
(13.17) (13.18)
(13.19)
The proof of the lemma is deferred to the Appendix.
13.2 L1 and L∞ Bounds via Exchangeable Pairs Let W be a random variable of interest and (W, W ) an exchangeable pair. Write E(W − W |W ) = g(W ) + r(W ),
(13.20)
where we consider g(W ) to be the dominant term and r(W ) some negligible remainder. When g(W ) = λW , and λ−1 E((W − W )2 |W ) is nearly constant, the results in Sect. 5.2 show that the distribution of W can be approximated by the normal, subject to some additional conditions. Here we use the function g(w) to determine an appropriate approximating distribution for W , or, more particularly, identify its density function p. Once p is determined, we can parallel the development of Stein’s method of exchangeable pairs for normal approximation. As a case in point, the proofs in this section depend on the following exchangeable pair identity, analogous to the one applied in the proof of Lemma 2.7 for the normal. That is, when (13.20) holds, for any absolutely continuous function f for which the expectations below exist, recalling = W − W , by exchangeability we have 0 = E(W − W ) f (W ) + f (W ) = 2Ef (W )(W − W ) + E(W − W ) f (W ) − f (W ) 0 = 2E f (W )E (W − W )|W − E(W − W ) f (W + t)dt −
∞ ˆ f (W + t)K(t)dt (13.21) = 2Ef (W )g(W ) + 2Ef (W )r(W ) − E −∞
where ˆ = E 1{− ≤ t ≤ 0} − 1{0 < t ≤ − } |W . K(t)
(13.22)
348
13
Non-normal Approximation
Note that here, similar to (2.39), we have ∞ ˆ K(t)dt = E 2 |W . −∞
(13.23)
For a given function g(y) defined on (a, b) let Y be a random variable with density function p(y) = 0 for y ∈ / (a, b), and for y ∈ (a, b), ⎧ y ⎪ ⎨ 0 g(s)ds if 0 ∈ [a, b), y −c0 G(y) p(y) = c1 e where G(y) = (13.24) a g(s)ds if a > 0, ⎪ ⎩ y b g(s)ds if b ≤ 0 with c0 > 0 and c1−1
b
=
e−c0 G(y) dy < ∞.
(13.25)
a
Note that (13.24) implies p (y) = −c0 g(y)p(y)
for all y ∈ (a, b).
(13.26)
Theorem 13.1 shows that for deriving L1 bounds for approximations by distributions with densities p of the form (13.24), it suffices that there exist a function b0 (y), and constants b1 and b2 , such that f (y) ≤ b0 (y) for all y ∈ (a, b), and (13.27) f ≤ b1 and f ≤ b2 for all solutions f to the Stein equation (13.5) for absolutely continuous functions h with h ≤ 1. For some cases, the following two conditions will help verify the hypotheses of Lemma 13.1 for densities of the form (13.24), thus implying bounds of the form (13.27). Condition 13.2 On the interval (a, b) the function g(y) is non-decreasing and yg(y) ≥ 0. Condition 13.3 On the interval (a, b) the function g is absolutely continuous, and there exists c2 < ∞ such that
1 3 1 , |y| + max 1, c0 g (y) ≤ c2 . min c1 |c0 g(y)| c1 Lemma 13.2 Suppose that the density p is given by (13.24) for some c0 > 0, and g satisfying Conditions 13.2 and 13.3, and E|g(Y )| < ∞ for Y having density p. Then Condition 13.1 and all the bounds in Lemma 13.1 on the solution f and its derivatives hold, with d1 = 1/c1 , d2 = 1, d3 = c2 and d4 (y) = c2 for all y ∈ (a, b). We refer the reader to the Appendix for a proof of Lemma 13.2. Equipped with bounds on the solution, we can now provide the following L1 result.
13.2 L1 and L∞ Bounds via Exchangeable Pairs
349
Theorem 13.1 Let (W, W ) be an exchangeable pair satisfying (13.20) and set = W − W . Let Y have density p of the form (13.24), on an interval (a, b), with c0 > 0, and g in (13.20) satisfying E|g(Y )| < ∞. Suppose that the solution f to the Stein equation (13.5), for all absolutely continuous functions h with h ≤ 1, satisfies (13.27) for some function b0 (w) and constants b1 and b2 . Then L(W ) − L(Y ) ≤ b1 E 1 − c0 E 2 |W 1 2 c0 b 2 E| |3 + c0 E r(W )b0 (W )1{a<W
Note that the first term in the bound on the right hand side of (13.28) will be small when E( 2 |W ) is nearly constant, that is, nearly equal to its expectation E( 2 ), and c0 is chosen close to 2/E( 2 ). Proof Let h be an absolutely continuous function satisfying h ≤ 1 and let f be the solution to the Stein equation (13.5) for h. By (13.5) and (13.26) we have Eh(W ) − Eh(Y ) = E f (W ) + f (W )p (W )/p(W ) (13.29) = E f (W ) − c0 f (W )g(W ) . As identity (13.21) holds for this f , adding and subtracting using (13.23) gives E f (W ) − c0 f (W )g(W ) ∞ ˆ f (W + t)K(t)dt − 2Ef (W )r(W ) = Ef (W ) − (c0 /2) E −∞ 2 = E f (W ) 1 − (c0 /2)E |W ∞ c0 ˆ + c0 Ef (W )r(W ). (13.30) f (W ) − f (W + t) K(t)dt + E 2 −∞ Now note that
E
∞
t K(t) ˆ dt = 1 E 3 , 2 −∞
and that by (13.6) we have f (a) = f (b) = 0, so, by the first inequality in (13.27), f (w) ≤ b0 (w)1{a<w
350
13
Non-normal Approximation
Theorem 13.2 Let (W, W ) be an exchangeable pair satisfying (13.20), and let Y have density (13.24) for some c0 > 0, and g in (13.20) satisfying E|g(Y )| < ∞ and Conditions 13.2 and 13.3. If = W − W satisfies |W − W | ≤ δ for some constant δ then supP (W ≤ z) − P (Y ≤ z) z∈R
2c0 c0 2 E r(W ) ≤ 3E 1 − E |W + c1 max{1, c2 }δ + 2 c1
c 1 c2 c2 E c0 g(W ) + . + δ 3 c0 2 + 2 2
(13.31)
Proof of Theorem 13.2 Since (13.31) is trivial when c1 c2 δ > 1, we assume c1 c2 δ ≤ 1.
(13.32)
Let F be the distribution function of Y and for z ∈ R let f = fz be the solution to the Stein equation f (w) + f (w)p (w)/p(w) = 1(w ≤ z) − F (z) or, by (13.26), equivalently, to f (w) − c0 f (w)g(w) = 1(w ≤ z) − F (z).
(13.33)
By Lemma 13.2, the bound (13.12) of Lemma 13.1 holds, so f < ∞. Letting ˆ K(t) be given by (13.22), in view of identities (13.21) and (13.23), and that |W − ˆ ≥ 0, we obtain W | ≤ δ and K(t) 2Ef (W )g(W ) + 2Ef (W )r(W ) ∞ ˆ =E f (W + t)K(t)dt −∞ δ
=E ≥E
−δ δ
−δ
ˆ c0 f (W + t)g(W + t) + 1(W + t ≤ z) − F (z) K(t)dt
ˆ c0 f (W + t)g(W + t)K(t)dt + E1{W ≤z−δ} 2 − F (z)E 2 .
Rewriting, and adding and subtracting using (13.23) again, we have E1{W ≤z−δ} 2 − F (z)E 2 ≤ 2Ef (W )g(W ) + 2Ef (W )r(W ) − E
δ −δ
ˆ c0 f (W + t)g(W + t)K(t)dt
= 2Ef (W )g(W ) 1 − (c0 /2)E 2 |W + 2Ef (W )r(W ) δ ˆ f (W )g(W ) − f (W + t)g(W + t) K(t)dt + c0 E −δ
:= J1 + J2 + J3 .
(13.34)
13.2 L1 and L∞ Bounds via Exchangeable Pairs
351
Lemma 13.2 and (13.12), (13.13), and (13.14) of Lemma 13.1 yield, along with (13.26), that f ≤ 2/c1 , Therefore
f g ≤ 2/c0
and f ≤ 4.
(13.35)
|J1 | ≤ (4/c0 )E 1 − (c0 /2)E 2 |W
(13.36)
|J2 | ≤ (4/c1 )E r(W ).
(13.37)
and
To bound J3 , we first show that c1 c2 δ c1 + c0 g(w) . sup g(w + t) − g(w) ≤ 2c 0 |t|≤δ From Condition 13.3 it follows that c 1 c2 g (w) ≤ 3c0 min(1/c1 , 1/|c0 g(w)|) c 1 c2 max c1 , c0 g(w) = 3c0 c1 c2 ≤ c1 + c0 g(w) . 3c0
(13.38)
(13.39)
Thus by the mean value theorem sup g(w + t) − g(w) ≤ δ sup g (w + t) |t|≤δ
|t|≤δ
c 1 c2 δ c1 + c0 sup g(w + t) 3c0 |t|≤δ c 1 c2 δ c1 + c0 g(w) + c0 sup g(w + t) − g(w) ≤ 3c0 |t|≤δ c1 c2 δ c 1 c2 δ c1 + c0 g(w) + sup g(w + t) − g(w) = 3c0 3 |t|≤δ 1 c1 c2 δ c1 + c0 g(w) + sup g(w + t) − g(w), ≤ 3c0 3 |t|≤δ ≤
by (13.32). This proves (13.38). Now, by (13.35) and (13.38), when |t| ≤ δ, f (w)g(w) − f (w + t)g(w + t) ≤ g(w)f (w + t) − f (w) + f (w + t)g(w + t) − g(w) 2 c1 c2 δ ≤ 4g(w)|t| + c1 + c0 g(w) c1 2c0 ≤ (4 + c2 )δ g(w) + δc1 c2 /c0 .
Therefore
352
13
Non-normal Approximation
|J3 | ≤ c0 (4 + c2 )δE g(W ) 2 + δc1 c2 E 2 ≤ (4 + c2 )δ 3 E c0 g(W ) + c1 c2 δ 3 .
(13.40)
Combining (13.34), (13.36), (13.37) and (13.40) shows that 4 4 c0 E1{W ≤z−δ} 2 − F (z)E 2 ≤ E 1 − E 2 |W + E r(W ) c0 2 c1 3 (13.41) + (4 + c2 )δ E c0 g(W ) + c1 c2 δ 3 . On the other hand, using F (z) = p(z) ≤ c1 by (13.24), we have E1{W ≤z−δ} 2 − F (z)E 2 2 E1{W ≤z−δ} − F (z − δ) = c0
c0 2 − E 1{W ≤z−δ} − F (z) 1 − E 2 |W c0 2 2 + F (z − δ) − F (z) c0 2 ≥ P (W ≤ z − δ) − F (z − δ) c0 c0 2 2c1 δ 2 , − E 1 − E |W − c0 2 c0
(13.42)
which together with (13.41) yields P (W ≤ z − δ) − F (z − δ) c0 2 ≤ E 1 − E |W + c1 δ 2
c0 c0 2 + (4/c0 )E 1 − E |W + (4/c1 )E r(W ) 2 2 3 3 + (4 + c2 )δ E c0 g(W ) + c1 c2 δ c0 2 = 3E 1 − E |W + c1 δ + 2c0 E r(W )/c1 2 3 + δ c0 (2 + c2 /2)E c0 g(W ) + c1 c2 /2 .
(13.43)
Similarly, one can demonstrate F (z + δ) − P (W ≤ z + δ) c0 2 ≤ 3E 1 − E |W + c1 δ + 2c0 E r(W )/c1 2 3 + δ c0 (2 + c2 /2)E c0 g(W ) + c1 c2 /2 . As c1 δ ≤ c1 max(1, c2 )δ, the proof of (13.31) is complete.
(13.44)
13.3 The Curie–Weiss Model
353
13.3 The Curie–Weiss Model The Curie–Weiss model, or the Ising model on the complete graph, was introduced in Sect. 11.2 and is a simple statistical mechanical model of ferromagnetic interaction. We recall that for n ∈ N the vector σ = (σ1 , . . . , σn ) of ‘spins’ in {−1, 1}n are assigned probability
β σi σj p(σ ) = Cβ exp (13.45) n i<j
for a given ‘inverse temperature’ β > 0, with Cβ the appropriate normalizing constant. For a detailed mathematical treatment of the Curie–Weiss model in general we refer to the book by Ellis (1985). When 0 < β < 1 the total spin ni=1 σi , properly standardized, converges to a standard normal distribution as n → ∞, see, e.g., Ellis and Newman (1978a, 1978b). For β = 1 it was proved by Ellis and Newman (1978a, 1978b) that as n → ∞, the law of W = n−3/4
n
σi
(13.46)
i=1
converges to the distribution W(4, 12) of Example 13.3. For various interesting extensions and refinements of their results, see Ellis et al. (1980), and Papangelou (1989). Here we present the following L1 and Berry–Esseen bounds for the critical β = 1 non-central limit theorem, obtained via Theorems 13.1 and 13.2, respectively. Theorem 13.3 Let W be the scaled total spin (13.46) in the Curie–Weiss model, where the vector σ of spins has distribution (13.45) at the critical inverse temperature β = 1, and let Y be a random variable with distribution W(4, 12) as in (13.9). Then there exists a universal constant C such that for all n ∈ N, L(W ) − L(Y ) ≤ Cn−1/2 . (13.47) 1 and
supP (W ≤ z) − P (Y ≤ z) ≤ Cn−1/2 .
(13.48)
z∈R
The required exchangeable pair is constructed following Example 2.2. Given σ having distribution (13.45), construct σ by choosing I uniformly and independently of σ , and replacing σI by σI , where σI is generated from the conditional distribution of σI given {σj , j = I }. It is easy to see that (σ , σ ) is an exchangeable pair. Let W = W − σI + σI , the total spin of the configuration when σI is replaced by σI . Considering the sequence of distributions (13.45) indexed by n ∈ N, the key step is to show (13.20), or, more specifically for the case at hand, that 1 (13.49) E(W − W |W ) = n−3/2 W 3 + O n−2 as n → ∞. 3
354
13
Non-normal Approximation
To explain (13.49), roughly, a simple computation shows that at any inverse temperature, E(W − W |W ) = n−3/4 M − tanh(βM) + O n−2 as n → ∞, where M = n−1/4 W is known as the magnetization. Since M 0 with high probability when β ≤ 1, and a Taylor expansion about zero yields tanh x = x − x 3 /3 + O(x 5 ), we see that M − tanh(βM) behaves like n−3/4 M(1 − β) when β < 1, and like n−3/4 M 3 /3 when β = 1. This is what distinguishes the high temperature regime β < 1 from the critical case β = 1, and how we arrive at (13.49). Comparing (13.49) with (13.20), we find that if on (−∞, ∞) we take 1 g(y) = n−3/2 y 3 3
and c0 = n3/2 ,
(13.50)
then, following (13.24), the density function
y 4 g(s)ds = c1 e−y /12 p(y) = c1 exp −c0 0
results, that is, the distribution W(4, 12) of the family considered in Example 13.3. We note that though c0 depends on n, the constant c1 given in (13.9) does not. We now make (13.49) precise, as well as verify the remainder of the hypotheses required in order to invoke Theorem 13.2. Lemma 13.3 Let W be the scaled total spin (13.46) in the Curie–Weiss model, where the spins σ have distribution given by (13.45) at the critical inverse temperature β = 1, and let W be the given by (13.46) when a uniformly chosen spin from σ has been replaced by one having its conditional distribution given the others. Then for all n ∈ N, |W − W | ≤ 2n−3/4 ,
1 −3/2 3 E E(W − W |W ) − n W ≤ 15n−2 , 3 15 −1/2 n3/2 2 E 1 − E (W − W ) |W ≤ n 2 2
(13.51) (13.52)
(13.53)
and E|W |3 ≤ 15.
(13.54)
Proof As W and W differ in at most one coordinate, (13.51) is immediate. Next, n −1 −1/4 W be the magnetization, and for each i, let let M = n i=1 σi = n σj . Mi = n−1 j =i
13.3 The Curie–Weiss Model
355
It is easy to see that for every i = 1, 2, . . . , n, if one chooses a variables σi from the conditional distribution of the ith spin given σj , j = i, independently of σi , then for τ ∈ {−1, 1}, P σi = τ |σj , j = i = P σi = τ |σ =
e Mi τ , eMi + e−Mi
(13.55)
and so E σi |σ =
e Mi e−Mi − = tanh Mi . eMi + e−Mi eMi + e−Mi
Hence
E(W − W |σ ) = n
−1
n
n−3/4 σi − E σi |σ
i=1
= n−3/4 M − n−7/4
n
tanh Mi .
(13.56)
i=1
Now it is easy to verify that the second derivative d2 tanh x = −2(tanh x) 1 − tanh2 x 2 dx has exactly two extrema on the real line, the solutions x to the equation tanh2 x = 1/3, and is bounded in absolute value by 4/33/2 . Thus, for all x, y ∈ R, 2 tanh x − tanh y − (x − y)(cosh y)−2 ≤ 2(x − y) . 33/2 It follows that for all i = 1, . . . , n, n n 2n−1 −1 −2 tanh Mi − n tanh M + n (cosh M) σi ≤ 3/2 , 3 i=1
and therefore
i=1
n 2n−1 tanh Mi − n tanh M ≤ |M| + 3/2 . 3 i=1
Applying this inequality and the relation M = n−1/4 W in (13.56), we obtain −11/4 E W − W |σ + n−3/4 (tanh M − M) ≤ n−2 |W | + 2n . 33/2
(13.57)
Now consider the function f (x) = tanh x −x +x 3 /3. Since f (x) = (cosh x)−2 − 1 + x 2 ≥ 0 for all x, the function f is increasing, and as f (0) = 0 we obtain f (x) ≥ 0 for all x ≥ 0. Now, it can be easily verified that the first four derivatives of f vanish at zero, and for all x ≥ 0, d 5f 16 sinh2 x sinh4 x 16 = − 120 + 120 ≤ ≤ 16. 5 2 4 6 dx cosh x cosh x cosh x cosh2 x
356
13
Non-normal Approximation
Thus, for all x ≥ 0, 0 ≤ f (x) ≤
2 16 5 x = x5. 5! 15
Since f is an odd function, we obtain 5 tanh x − x + 1 x 3 ≤ 2|x| 3 15
for all x ∈ R.
Using this inequality in (13.57), we arrive at −3/4 |M|5 2n−11/4 E(W − W |σ ) − 1 n−3/4 M 3 ≤ 2n , + n−2 |W | + 3 15 33/2 or, by the relation M = n−1/4 W , equivalently, −2 5 −11/4 E(W − W |σ ) − 1 n−3/2 W 3 ≤ 2n |W | + n−2 |W | + 2n . 3 15 33/2 The latter inequality implies, in particular, that E (W − W )W 3 − 1 n−3/2 E W 6 3 −2 8 2n−11/4 E|W |3 2n E(W ) . + n−2 E W 4 + ≤ 15 33/2 Thus,
(13.58)
(13.59)
2n−1/2 E(W 8 ) E W 6 ≤ 3n3/2 E (W − W )W 3 + 5 −5/4 E|W |3 2n . (13.60) + 3n−1/2 E W 4 + 31/2 Regarding the first term on the right hand side of (13.60), note that by the exchangeability of (W, W ), 1 E (W − W )W 3 = E (W − W ) W 3 − W 3 2 1 = − E (W − W )2 W 2 + W W + W 2 . 2 Now, by (13.51) and the Cauchy–Schwarz inequality, E (W − W )W 3 ≤ 6n−3/2 E W 2 .
(13.61)
For the remaining terms in (13.60), using the crude bound |W | ≤ n1/4 , we obtain 2n−5/4 E|W |3 2n−1/2 E(W 8 ) + 3n−1/2 E W 4 + 5 31/2 6 −1 2 2n E(W ) 2E(W ) ≤ . + 3E W 2 + 5 31/2 Combining (13.60), (13.61), and (13.62), we obtain
(13.62)
13.3 The Curie–Weiss Model
357
2E(W 6 ) 2n−1 E W 6 ≤ 21 + 1/2 E W 2 + , 5 3 and therefore, for all n ∈ N,
5 2n−1 21 + 1/2 E W 2 ≤ 36.925E W 2 . E W6 ≤ 3 3 Since E(W 2 ) ≤ [E(W 6 )]1/3 , this gives E W 6 ≤ (36.925)3/2 ≤ 224.4 < (15)2 ,
(13.63)
and hence (13.54) holds. Applying the bound (13.63) in (13.58) yields, for all n ∈ N, 1 −3/2 3 E E(W − W |W ) − n W 3
5/6 2n−11/4 −2 2(224.4) 1/6 ≤ 15n−2 , (13.64) + (224.4) + ≤n 15 33/2 completing the proof of (13.52). Lastly, to prove (13.53), by (13.55) we have n 1 e−σi Mi 4 σM E (W − W )2 |σ = n−3/2 n e i i + e−σi Mi
= 2n−5/2
i=1 n
1 − tanh(σi Mi )
i=1
= 2n−3/2 − 2n−5/2
n
σi tanh Mi .
i=1
Using | tanh Mi − tanh M| ≤ |Mi − M| ≤ n−1 , we obtain E (W − W )2 |σ − 2n−3/2 ≤ 2n−5/2 + 2n−3/2 M tanh M ≤ 2n−5/2 + 2n−3/2 M 2 = 2n−5/2 + 2n−2 W 2 . Using (13.63), we obtain, for all n ∈ N, that 2 E E W − W |W − 2n−3/2 ≤ 2n−5/2 + 2n−2 (224.4)1/3 ≤ 15n−2 . Now multiplying by n3/2 /2 completes the proof of (13.53), and the lemma.
Proof of Theorem 13.3 We apply Theorems 13.1 and 13.2 with the coupling given in Lemma 13.3. First, inequality (13.52) of Lemma 13.3 shows that the exchangeable pair (W, W ) satisfies (13.20) with 1 g(y) = n−3/2 y 3 3
and
r(y) ≤ 15n−2 .
(13.65)
358
13
Non-normal Approximation
Recall that the density p(y) of Y on (−∞, ∞) is given by (13.9), or, equivalently, by (13.24) with g(y) and c0 = n3/2 as in (13.50), and that c1 is a constant not depending on n. It is clear that Y has moments of all order, so in particular E|g(Y )| < ∞, and that g(y) satisfies Condition 13.2. As the quantity min 1/c1 , 3/y 3 (|y| + 3/c1 ) 1 + y 2 is bounded near zero and has finite limits at plus and minus infinity, Condition 13.3 is satisfied with a constant c2 not depending on n. Regarding the L1 bound (13.47), we have that the hypotheses of Theorem 13.1 are satisfied, and need only verify that all terms in the bound (13.28) of that theorem are of order O(n−1/2 ). Inequality (13.53) shows that the first term in the bound is no more than (15/2)n−1/2 , inequality (13.51) shows the second term is O(n−9/4 ), and (13.65) shows the last term to be of order O(n−2 ). Regarding the supremum norm bound (13.48), Lemma 13.3 shows the coupling of W and W is bounded, and therefore the hypotheses of Theorem 13.2 are satisfied; similarly, we need only verify that all terms in the bound (13.31) are O(n−1/2 ). We have already shown the first term in the bound is of this order. Inequality (13.51) allows us to choose δ = 2n−3/4 , showing the second term in the bound is of order o(n−1/2 ). By (13.65) the third term is O(n3/2 n−2 ) = O(n−1/2 ). For the coefficient of the last term, we find δ 3 c0 = O(n−9/4 n3/2 ) = O(n−3/4 ). As c1 and c2 do not depend on n, and E|c0 g(W )| = E|W |3 /3 ≤ 5 by (13.54), the final term is o(n−1/2 ). As all terms in the bound are O(n−1/2 ) the claim is shown.
13.4 Exponential Approximation In this section we focus on approximation by the exponential distribution E(λ) for λ > 0, that is, the distribution with density p(x) = λe−λx 1{x>0} as in Example 13.1. We consider two examples, the spectrum of the Bernoulli–Laplace diffusion model, and first passage time of Markov chains. The first example is handled using exchangeable pairs, and the second by introducing the equilibrium distribution.
13.4.1 Spectrum of the Bernoulli–Laplace Markov Chain Since for the exponential distribution the ratio p (w)/p(w) is constant, following (13.20) we hope to construct an exchangeable pair (W, W ) satisfying E(W − W |W ) = 1/c0 + r(W )
(13.66)
for some positive constant c0 > 0 and small remainder term. Taking (a, b) = (0, ∞), g(y) = 1/c0 and G(y) = y/c0 in (13.24) yields p(y) = e−y 1{y>0} , and so Y ∼ E(1), the unit exponential distribution; clearly c1 = 1 and E|g(Y )| < ∞. We now obtain bounds on the solution to the Stein equation for the unit exponential.
13.4 Exponential Approximation
359
Lemma 13.4 If f is the solution to the Stein equation (13.5) with p(y) = e−y 1(y > 0), the unit exponential density, and h any absolutely continuous function with h ≤ 1, then the bounds (13.27) hold with b0 (y) = 3.5y for y > 0, b1 = 1 and b2 = 2. Proof We verify the hypotheses of Lemma 13.1 are satisfied for p(y) = e−y and F (y) = 1 − e−y for y > 0. Clearly (13.10) and (13.10) are satisfied with d1 = 1 and d2 = 1, respectively. As (p /p) = 0, (13.15) is satisfied with d3 = 0. Regarding (13.16), we have EY 1(Y ≤ y) + EY P (Y ≤ y) = 2 1 − e−y − ye−y ≤ 2 1 − e−y , and similarly, EY 1(Y > y) + EY P (Y > y) ≤ (1 + 2y)e−y . Hence, (13.16) is satisfied with d4 (y) = 3.5y as ey min 2 1 − e−y , (1 + 2y)e−y = min 2 ey − 1 , (1 + 2y) ≤ 3.5y, where the final inequality is shown by considering the cases 0 < y < 2/3 and y ≥ 2/3. Invoking Lemma 13.1 with d1 = 1, d2 = 1, d3 = 0 and d4 (y) = 3.5y now completes the proof of the claim. Theorem 13.1 now immediately yields the following result for approximation by the exponential distribution. Theorem 13.4 Let (W, W ) be an exchangeable pair satisfying (13.66) for some c0 > 0, let = W − W , and Y ∼ E(1). Then L(W ) − L(Y ) ≤ E 1 − c0 E 2 |W + c0 E| |3 2 1 2 + 3.5c0 E W r(W )1{|W |>0} . We apply the result above to the Bernoulli–Laplace Markov chain, a simple model of diffusion, following the work of Chatterjee et al. (2008). Two urns contain n balls each. Initially the balls in each urn are all of a single color, with urn 1 containing all white balls, and urn 2 all black. At each stage two balls are picked at random, one from each urn, and interchanged. Let the state of the chain be the number of white balls in the urn 1. Diaconis and Shahshahani (1987) proved that (n/4) log(2n) + cn steps suffice for this process to reach equilibrium, in the sense that the total variation distance to stationarity is at most ae−dc , for some positive universal constants a and d. To prove this result they used the fact that the spectrum of the chain consists of the numbers λi = 1 −
i(2n − i + 1) n2
for i = 0, . . . , n
(13.67)
360
13
Non-normal Approximation
occurring with multiplicities
2n 2n mi = − for i = 0, 1, . . . , n, i i−1 where we adopt the convention that nk = 0 when k < 0, so that the multiplicity m0 of the eigenvalue λ0 = 1 is 1. The Bernoulli–Laplace Markov chain is equivalent to a function of a certain random walk on the Johnson graph J (2n, n). The vertices of the Johnson graph J (2n, n) are all size n subsets of {1, 2, . . . , 2n}, and two subsets are connected by an edge if they differ by exactly one element. From a given vertex, the random walk moves to a neighbor chosen uniformly at random. Numbering the balls of the Bernoulli–Laplace model 1 through 2n, with the white balls corresponding to the odd numbers 1, 3, . . . , 2n − 1 and the black balls to the even values 2, 4, . . . , 2n, the state of the random walk on the Johnson graph is simply the labels of the balls in urn 1. We apply Stein’s method to study an approximation to the distribution of a randomly chosen eigenvalue, that is, to the values λi given in(13.67), chosen in proportion to their multiplicities mi , i = 0, . . . , n. As the sum ni=0 mi telescopes, this means we choose λi with probability 2n 2n − πi = i 2n i−1 , i = 0, 1, . . . , n. (13.68) n
Letting I have distribution P (I = i) = πi , translating and scaling the distribution which chooses λi with probability πi to be comparable to the unit exponential, we are led to study the random variable (n − i)(n + 1 − i) . (13.69) n We construct an exchangeable pair (W, W ) using a reversible Markov chain on {0, 1, . . . , n}, as outlined at the end of Sect. 2.3.2. For the chain to be reversible with respect to π , its transition kernel K must satisfy the detail balance equation W = μI
where μi =
πi K(i, j ) = πj K(j, i)
for all i, j ∈ {0, . . . , n}.
(W, W )
(13.70)
by letting W = μI where I is chosen Given such a K, one obtains the pair from the equilibrium distribution π , and W = μJ where J is determined by taking one step from state I according to the transition mechanism K. One can verify that the following transition kernel K satisfies (13.70). For the upward transitions, K(i, i + 1) =
2n − i + 1 4n(n − i)(2n − 2i + 1)
for i = 0, . . . , n − 1,
for the downward transitions, K(i, i − 1) =
i 4n(2n − 2i + 1)(n − i + 1)
for i = 0, . . . , n,
13.4 Exponential Approximation
361
for the probability of returning to the same state, K(i, i) = 1 − K(i, i + 1) − K(i, i − 1), and K(i, j ) = 0 otherwise. The following lemma summarizes some of the properties of the exchangeable pair so constructed. Lemma 13.5 Let W = μI as in (13.69) with P (I = i) = πi , specified in (13.68), and let W = μJ where J is obtained from I by taking one step from I according to the transition kernel K. Then (W, W ) is exchangeable, and, with = W − W , n+1 1 − 1{W =0} , 2 2n 2n2 1 E(W ) = 1, E 2 |W = 2 n
E( |W ) =
(13.71) (13.72)
and E| |3 ≤ 4n−5/2 .
(13.73)
Proof Since K is reversible with respect to π , the pair (I, J ), and therefore the pair (μI , μJ ) = (W, W ), is exchangeable. The mapping i → μi given by (13.69) is strictly decreasing, and is therefore invertible, for i ∈ {0, 1, . . . , n}. Hence, conditioning on W is equivalent to conditioning on I . For i ∈ {0, 1, . . . , n − 1} we have E( |I = i) = K(i, i + 1)(μi − μi+1 ) + K(i, i − 1)(μi − μi−1 ) 2n − i + 1 (2n − 2i) = 4n(n − i)(2n − 2i + 1) n i (2i − 2n − 2) + 4n(2n − 2i + 1)(n − i + 1) n 1 = 2, 2n and for i = n, E( |I = n) = K(n, n − 1)(μn − μn−1 ) = −
1 . 2n
As {I = n} = {W = 0}, the claim (13.71) is shown. To show that E(W ) = 1, argue as in the proof of (13.71) to compute that 2 E 3 |W = 3 (W − 1). n
(13.74)
Since W and W are exchangeable, E( 3 ) = 0, and taking expectation in (13.74) yields E(W ) = 1. Similarly one checks that E( 2 |W ) = 1/n2 , proving (13.72). Lastly, to show (13.73), since μi is decreasing in i, for i ∈ {0, 1, . . . , n − 1} we have
362
13
Non-normal Approximation
E | |3 |I = i = K(i, i + 1)(μi − μi+1 )3 + K(i, i − 1)(μi−1 − μi )3 =
2n − i + 1 (2n − 2i)3 4n(n − i)(2n − 2i + 1) n3 +
=
i (2n + 2 − 2i)3 4n(2n − 2i + 1)(n − i + 1) n3
2((2n − i + 1)(n − i)2 + i(n − i + 1)2 ) n4 (2n − 2i + 1)
2((2n − i + 1)(n − i)(n − i + 1) + i(n − i + 1)2 ) n4 (2n − 2i + 1) 2(n − i + 1) , = n3 ≤
and when i = n, 2 2(n − i + 1) E | |3 |I = n = K(n, n − 1)(μn − μn−1 )3 = 3 = . n n3 Thus, √ √ 2(n − I + 1) 2 n(W + 2) E | |3 |W ≤ ≤ ≤ 2n−5/2 W + 2, 3 3 n n and, by Jensen’s inequality, √ E| |3 ≤ 2n−5/2 E(W + 2) = 2 3n−5/2 ≤ 4n−5/2 .
This proves (13.73).
Applying Theorem 13.4 with c0 = 2n2 and r(w) = −(n + 1)/(2n2 )1(w > 0) now yields the following result. Theorem 13.5 Let W be a scaled, translated randomly chosen eigenvalue of the Bernoulli–Laplace diffusion model given by (13.69). Then, with Y having the unit exponential distribution L(W ) − L(Y ) ≤ 4n−1/2 . (13.75) 1 As the difference between W and W is large when I is close to zero, Theorem 13.2 does not provide a useful bound for the Kolmogorov distance between W and Y . However, using a completely different approach and some heavy machinery, Chatterjee et al. (2008) are able to show that (13.76) supP (W ≤ z) − P (Y ≤ z) ≤ Cn−1/2 z∈R
where C is a universal constant.
13.4 Exponential Approximation
363
13.4.2 First Passage Times We now consider an approach to exponential approximation different from that of the previous sections, following Peköz and Röllin (2009). In the nomenclature of renewal theory, for a non-negative random variable X with finite mean, X e is said to have the equilibrium distribution with respect to X if for all Lipschitz functions f , (13.77) Ef (X) − f (0) = EXEf X e . Clearly (13.77) holds for all Lipschitz functions if and only if it holds for all Lipschitz functions f with f (0) = 0. The equilibrium distribution has a close connection with both the size biased and zero biased distributions. For the first connection, let X s have the X-size bias distribution of X, that is, EXf (X) = EXEf X s for all functions f for which these expectations exist. Then, if U has the uniform distribution on [0, 1] independent of X, the variable U X s has the equilibrium distribution X e with respect to X. Indeed, if f is any Lipschitz function then Ef (X) − f (0) = EXf (U X) = EXEf U X s = EXEf X e . For the second connection, recall how a random variable with the zero bias distribution can likewise be formed by multiplying a square bias variable by an independent uniform. Note also that for any a > 0, parallel to (2.59), we have (aX)e = U (aX)s = aU X s = aX e .
(13.78)
Xe
is absolutely continuous Additionally, a simplecalculation shows that in general x with density function 0 P (X > t)dt/EX. As identity (13.7) characterizes the exponential distribution we see that if X =d X e then X ∼ E(λ), where λ = EX. Hence the equilibrium distribution of X operates for the exponential distributions in the same way that the zero bias distributions do for the mean zero normals. By analogy then, if the distributions of X and X e are close then X should be approximately exponential. This intuition is made precise by the following result of Peköz and Röllin (2009). Theorem 13.6 Let W be a non-negative random variable with EW = 1 and let W e have the equilibrium distribution with respect to W . Then, for Y ∼ E(1) and any β > 0, (13.79) supP (W ≤ z) − P (Y ≤ z) ≤ 12β + 2P W e − W > β z∈R
and
supP W e ≤ z − P (Y ≤ z) ≤ β + P W e − W > β .
(13.80)
z∈R
If in addition W has a finite second moment, then L(W ) − E(1) ≤ 2E W e − W 1
(13.81)
364
13
Non-normal Approximation
supP W e ≤ z − P (Y ≤ z) ≤ E W e − W .
(13.82)
z∈R
The time until the occurrence of a rare event can often be well approximated by an exponential distribution. Aldous (1989) gives a wide survey of some settings where this phenomenon can occur, and Aldous and Fill (1994) summarize many results in the setting of Markov chain hitting times. We consider exponential approximation for first passage times, following the paper by Peköz and Röllin (2009). If X0 , X1 , . . . is a Markov chain taking values in a denumerable space X , for j ∈ X let Tπ,j = inf{t ≥ 0: Xt = j },
(13.83)
the time of the first visit to state j when the chain is initialized at time 0 with distribution π , and let Ti,j = inf{t > 0: Xt = j }
(13.84)
be the first time the Markov chain started in state i at time 0 next visits state j . Theorem 13.7 Let X0 , X1 , . . . be an ergodic, stationary Markov chain with stationary distribution π , and Tπ,j and Ti,j the first passage times as given in (13.83) and (13.84), respectively. Then, with Y ∼ E(1), the unit exponential distribution, for every i ∈ X we have supP (πi Tπ,i ≤ z) − P (Y ≤ z) ≤ 1.5πi + πi E|Tπ,i − Ti,i |, (13.85) z∈R
supP (πi Tπ,i ≤ z) − P (Y ≤ z) ≤ 2πi + P (Tπ,i = Ti,i )
(13.86)
z∈R
and ∞ (n) P − πi , supP (πi Tπ,i ≤ z) − P (Y ≤ z) ≤ 2πi + i,i z∈R
(13.87)
n=1
(n)
where Pi,i = P (Xn = i|X0 = i). Proof We first claim that if U is a uniform [0, 1] random variable independent of all else, then the equilibrium distribution of Ti,i is given by e =d Tπ,i + U. Ti,i
(13.88)
To prove (13.88) we first demonstrate P (Tπ,i = k) = πi P (Ti,i > k).
(13.89)
Consider the renewal–reward process which has renewals at visits to state i and a reward of 1 when times between renewals is greater than k. Then, with (X1 , R1 ), (X2 , R2 ), . . . the sequence of renewal interarrival times and rewards, the renewal–reward theorem (see, e.g., Grimmett and Stirzaker 2001) yields
13.4 Exponential Approximation
365
lim
t→∞
ER(t) ER1 , = t EX1
(13.90)
where R(t) is the total reward received by time t . As the mean length between renewals is ETi,i = 1/πi , the right hand side of (13.90) equals the right hand side of (13.89). On the other hand, there is precisely one time t when the waiting time to the next renewal Yt = inf{s ≥ t: Xs = i} − t
is exactly k if and only if the cycle length is greater than k. Hence R(t) = si=1 Yi , and R(t)/t → EYs , which is the left hand side of (13.89). Hence, letting f be a Lipschitz function with f (0) = 0, using (13.89) for the first equality, we have Ef (Tπ,i + U ) = E f (Tπ,i + 1) − f (Tπ,i ) = πi
∞
P (Ti,i > k) f (k + 1) − f (k)
k=0
= πi
∞ ∞
P (Ti,i = j ) f (k + 1) − f (k)
k=0 j =k+1
= πi = πi
j −1 ∞
P (Ti,i = j ) f (k + 1) − f (k)
j =0 k=0 ∞
P (Ti,i = j )f (j )
j =0
= πi Ef (Ti,i ). The claim (13.88) now follows from definition (13.77). Writing the L∞ norm between random variables ξ and η more compactly as L(ξ ) − L(η) ∞ , by the triangle inequality, we have L(πi Tπ,i ) − E(1) ∞ ≤ L(πi Tπ,i ) − L πi (Tπ,i + U ) ∞ + L πi (Tπ,i + U ) − E(1) ∞ ≤ πi + L πi (Tπ,i + U ) − E(1) ∞ . (13.91) To justify the final inequality (13.91), with [z] denoting the greatest integer not greater than z and z ≥ 0, note that P (Tπ,i + U ≤ z) = P Tπ,i ≤ [z] − 1 + P Tπ,i = [z], U ≤ z − [z] = P Tπ,i ≤ [z] − 1 + z − [z] P Tπ,i = [z] , that P (Tπ,i ≤ z) = P Tπ,i ≤ [z] − 1 + P Tπ,i = [z] ,
366
13
Non-normal Approximation
and with X0 , X1 , . . . in the equilibrium distribution π , that P Tπ,i = [z] ≤ P (X[z] = i) = πi . Continuing from (13.91) and applying (13.78), (13.88), and (13.82) of Theorem 13.6, and then (13.88) again, we obtain L(πi Tπ,i ) − E(1) ≤ πi + L πi T e − E(1) i,i ∞ ∞ ≤ πi + L (πi Ti,i )e − E(1) ≤ πi + πi E|Tπ,i + U − Ti,i |
∞
≤ 1.5πi + πi E|Tπ,i − Ti,i |,
(13.92)
proving (13.85). Taking β = πi in (13.80), from (13.92) and (13.88) we obtain L(πi Tπ,i ) − E(1) ≤ 2πi + P |Tπ,i + U − Ti,i | > 1 , ∞ and we now obtain (13.86) by noting that |Tπ,i + U − Ti,i | > 1
implies Tπ,i = Ti,i .
Lastly, to show (13.87), let X0 , X1 , . . . be the chain in equilibrium and Y0 , Y1 , . . . a coupled copy started in state i at time 0, according to the maximal coupling, see (n) Griffeath (1974/1975), so that P (Xn = Yn = i) = πi ∧ Pi,i . Let Tπ,i and Ti,i be the hitting times given by (13.83) and (13.84) defined on the X and Y chain, respectively. Then P (Tπ,i = Ti,i ) ≤
∞
P (Xn = i, Yn = i) + P (Yn = i, Xn = i) ,
n=0
and since P (Xn = i, Yn = i) = πi − P (Xn = i, Yn = i) (n) (n) + = πi − πi ∧ Pi,i = πi − Pi,i , and a similar calculation yields
+ (n) P (Yn = i, Xn = i) = Pi,i − πi ,
we obtain (13.87) from (13.86).
From, say, Billingsley (1968) Theorem 8.9, we know that for a finite, irreducible, aperiodic Markov chain, convergence to the unique stationary distribution is exponential, that is, that there exists A ≥ 0 and 0 ≤ ρ < 1 such that (n) P − πi ≤ Aρ n for all n ∈ N. i,i In this case, (13.87) of Theorem 13.7 immediately yields the bound supP (πi Tπ,i ≤ z) − P (Y ≤ z) ≤ inf (k + 1)πi + Aρ k /(1 − ρ) z∈R
k∈N
on the Kolmogorov distance between πi Tπ,i and the unit exponential.
Appendix
367
Using the results of Theorem 13.7, Peköz and Röllin (2009) give bounds to the exponential for the times of appearance of patterns in independent coin tosses. In addition, they consider exponential approximations for random sums, and the asymptotic behavior of the scaled population size in a critical Galton–Watson branching process, conditioned on non-extinction.
Appendix Proof of Lemma 13.1 (i) Let Y be an independent copy of Y . Then, for y ∈ (a, b), we can rewrite fh in (13.6) as f (y) = 1/p(y) E h(Y ) − h Y 1{Y ≤y} = − 1/p(y) E h(Y ) − h Y 1{Y >y} , (13.93) which yields
f (y) ≤ 2 h min F (y), 1 − F (y) /p(y).
(13.94)
Inequality (13.12) now follows from (13.10) and (13.94). Inequalities (13.94) and (13.11) imply |fh p /p| ≤ 2d2 h , that is, (13.13), and now (13.14) follows from (13.5). (ii) Let g1 (y) = p (y)/p(y) for y ∈ (a, b). Differentiating (13.5) we obtain f = h − f g1 − f g1 . To prove (13.17), it suffices to show that f g ≤ d3 h
(13.96)
1
and
(13.95)
f g1 ≤ (1 + d3 )d2 h .
(13.97)
By (13.93) again, we have f (y)p(y) ≤ h min E |Y | + |Y | 1{Y ≤y} , E |Y | + |Y | 1{Y >y} = h min E|Y |1{Y ≤y} + E|Y |F (y), E|Y |1{Y >y} + E|Y | 1 − F (y) . (13.98) This inequality proves (13.96) by assumption (13.15); the claim (13.18) follows in the same way from (13.16). Rearranging (13.95) and multiplying by p(y) yields h − f g1 p = p f + f g1 = f p + f p = (f p) . Thus, noting that the boundary terms vanish by (13.6), b y h − f g1 pdx = − h − f g1 pdx, f (y)p(y) = a
y
368
13
Non-normal Approximation
and hence, using (13.96), f (y)p(y) ≤ h (1 + d3 ) min F (y), 1 − F (y) , yielding (13.97), as well as (13.19), by (13.11) and (13.10), respectively.
Proof of Lemma 13.2 First, we see that Condition 13.1 is satisfied for densities of the form (13.24) whenever E|Y | < ∞ for Y having density p, by (13.26). To prove the remaining claims, it suffices to verify that hypotheses (13.10), (13.11), (13.15) and (13.16) of Lemma 13.1 hold with d1 = 1/c1 , d2 = 1, d3 = c2 and d4 (y) = c2 . Let g2 (y) = c0 g(y) and F (y) = P (Y ≤ y), the distribution function of Y . Consider the case 0 ∈ [a, b], so that G(0) = 0 and p(0) = c1 . We first show that (13.10) is satisfied with d1 = 1/c1 . It suffices to show that F (y) ≤ F (0)p(y)/c1 and
for a < y < 0,
1 − F (y) ≤ 1 − F (0) p(y)/c1
for 0 ≤ y < b.
(13.99)
(13.100)
Consider the case a < y < 0 and let H (y) = F (y) − F (0)p(y)/c1 . Differentiating, H (y) = p(y) − F (0)p (y)/c1 = p(y) + F (0)g2 (y)p(y)/c1 = p(y) 1 + F (0)g2 (y)/c1 . Since g2 (y) is non-decreasing by Condition 13.2, if H (0) > 0, then H (y) has at most one sign change on (a, 0), and therefore H achieves its maximum either are a or at 0. If H (0) ≤ 0, then H (y) ≤ 0 for all y < 0, and H achieves its maximum at a. However, as H (0) = 0 and H (a) = −F (0)p(a)/c1 ≤ 0, we conclude that H (y) ≤ 0 for all y ≤ 0. This proves (13.99). Inequality (13.100) can be shown in a similar fashion. Next we prove (13.11) holds with d2 = 1. First consider a < y < 0. Inequality (13.11) is trivial when p (y) = 0, so consider y such that p (y) = −p(y)g2 (y) = 0. By Condition 13.2, we have g2 (x) ≤ g2 (y) < 0 for all x ≤ y, and therefore y F (y) = p(x)dx a y p(x)g2 (x) dx ≤ g2 (y) a y −p (x) dx = g2 (y) a p(y) − p(a) p(y) = ≤ . (13.101) −g2 (y) |g2 (y)| Similarly, we have 1 − F (y) ≤ p(y)/g2 (y) for 0 ≤ y < b. Hence (13.11) is satisfied with d2 = 1.
(13.102)
Appendix
369
Note that (13.100) and (13.102) imply that 1 − F (y) ≤ p(y) min 1/c1 , 1/g2 (y)
for a < y < 0,
(13.103)
and that likewise we have
F (y) ≤ p(y) min 1/c1 , 1/g2 (y)
for 0 ≤ y < b.
(13.104)
To verify (13.15), with 0 ≤ y < b write E|Y |1{Y >y}
b
= yP (Y > y) +
P (Y > t)dt y
≤ yp(y) min 1/c1 , 1/g2 (y) +
b
p(t) min 1/c1 , 1/g2 (t) dt
y
≤ yp(y) min 1/c1 , 1/g2 (y) + min 1/c1 , 1/g2 (y) = min 1/c1 , 1/g2 (y) yp(y) + 1 − F (y) ≤ min 1/c1 , 1/g2 (y) yp(y) + p(y)/c1 = p(y) min 1/c1 , 1/g2 (y) {y + 1/c1 }.
b
p(t)dt y
(13.105)
Similarly, for a < y < 0 we obtain
E|Y |1{Y ≤y} ≤ p(y) min 1/c1 , 1/g2 (y) |y| + 1/c1 .
(13.106)
Applying inequalities (13.105) and (13.106) at y = 0, and noting again that G(0) = 0 so that p(0) = c1 , gives E|Y | ≤ 2/c1 . Hence, recalling (13.103), E|Y |1{Y >y} + E|Y | 1 − F (y) (13.107) ≤ p(y) min 1/c1 , 1/g2 (y) {y + 3/c1 } for 0 ≤ y < b, and, using (13.104), E|Y |1{Y ≤y} + E|Y |F (y) ≤ p(y) min 1/c1 , 1/g2 (y) |y| + 3/c1
for a < y < 0.
(13.108)
Thus, (13.15) holds with d3 = c2 by Condition 13.3. Inequalities (13.107) and (13.108) also show that (13.16) is satisfied d4 (y) = c2 , completing the proof of Lemma 13.2 for the case 0 ∈ [a, b]. The other cases follow similarly, noting that for a > 0 we have G(a) = 0 and p(a) = c1 , and likewise when b ≤ 0 we have G(b) = 0 and p(b) = c1 .
Chapter 14
Group Characters and Malliavin Calculus
In this chapter we outline two recent developments in the area of Stein’s method, one in the area of algebraic combinatorics, and the other a deep connection to the Malliavin calculus. We provide some background material in these areas in order to help make our presentation more self contained, but refer the reader to more complete sources for a comprehensive picture. The material in Sect. 14.1 is based on Fulman (2009), and that in Sect. 14.2 on Nourdin and Peccati (2009).
14.1 Normal Approximation for Group Characters In the combinatorial central limit theorem studied in Sects. 4.4 and 6.1, one considers Y=
n
aiπ(i)
where π ∼ U(Sn ),
(14.1)
i=1
that is, the sum of matrix elements, one from each row, where the column index is chosen according to a permutation π with the uniform distribution on the symmetric group Sn of {1, . . . , n}. If one were interested, say, in the distribution of the number of fixed points of π , one could take aij = 1{i=j } , that is, let the matrix (aij ) be the identity. Or, to look at the matter another way, if e1 , . . . , en are the standard (column) basis vectors in Rn , and if Pπ = [eπ(1) , . . . , eπ(n) ], the permutation matrix associated to π , then the number of fixed points Y of π is the trace Tr(Pπ ) of Pπ . More generally, we may write Y in (14.1) as Tr(APπ ). Asking similar questions for other groups leads us fairly directly to the study of the distribution of traces, or group characters on random matrices, and generalizations thereof. One of the earliest results on traces of random matrices is due to Diaconis and Shahshahani (1994), who applied the method of moments to prove joint convergence in distribution for traces of powers in a number of classical compact groups, including the orthogonal group O(n, R). For this case, Stein (1995) showed that the error in the normal approximation decreases faster than the rate n−r for any fixed r, and Johansson (1997), working more generally, showed the rate is actually L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_14, © Springer-Verlag Berlin Heidelberg 2011
371
372
14
Group Characters and Malliavin Calculus
exponential, validating a conjecture of Diaconis. The theory of representations and group characters being a rich one, below we only provide the most basic relevant definitions, but for omitted details and proofs, refer the reader to Weyl (1997) and Fulton and Harris (1991) for in depth treatments, and in particular to Serre (1997) for finite groups, and Sagan (1991) for the symmetric group. For V a finite dimensional vector space, let GL(V ) denote the set of all invertible linear transformation from V to itself. When taking V to be some subspace of Cn , each such transformation may be considered as an invertible n × n matrix with complex entries. A representation of a group G is a map τ : G → GL(V ) which preserves the group structure, that is, which obeys τ (e) = I
and
τ (gh) = τ (g)τ (h),
where e is the identity element of g, and I the identity matrix. For groups where G itself has a topology we also require τ to be continuous. The map τ (g) = 1 for all g ∈ G, clearly a representation, is called the trivial representation. We define the dimension dim(τ ) of the representation τ to be the dimension of the vector space V . The defining representation, for groups such as the ones we consider below which are already presented as matrices, is just the map that sends each matrix to itself. If τ is a given representation over the vector space V , we say that a subspace W ⊂ V is invariant if w∈W
implies τ (g)w ∈ W
for all g ∈ G.
If τ has no invariant subspaces other than the trivial ones W = V and W = 0, we say τ is irreducible. Recall that Y in the motivating example (14.1) could be written in terms of the trace of a matrix. Given a representation τ , the character χ τ of τ , also written simply as χ when τ is implicit, is given by χ(g) = Tr τ (g). We let the dimension, or degree, of χ be the dimension of τ , and say that χ is irreducible whenever τ is. The character χ inherits the properties of the trace, and in particular is the sum of the eigenvalues of τ (g), counting multiplicities. As a representation always sends the identity element to the identity matrix, the dimension of any representation may be calculated by evaluating its character on the identity. A few more simple facts regarding characters are in order. For a complex number z ∈ C, let z denote the complex conjugate of z. For χ a character of the representation τ , we have (i) χ(g −1 ) = χ (g) for all g ∈ G. (ii) χ(hgh−1 ) = χ(g) for all g, h ∈ G. The first property may be shown by arguing that the transformations τ (g) may be taken to be unitary without loss of generality. Hence, any eigenvalue of τ (g) has modulus 1, so its inverse and complex conjugate are equal. Clearly then the same
14.1 Normal Approximation for Group Characters
373
holds for their sum. The second property is simply a consequence of the cyclic invariance of the trace. Recalling that C is a conjugacy class of the group G if g −1 hg ∈ C
for all g ∈ G and h ∈ C,
property (ii) may be stated as the fact that characters are constant on the conjugacy classes of G, that is, they are what are known as ‘class functions.’ Though the theory of group representations and group characters is more general, we closely follow a portion of the work of Fulman (2009), focusing on the following three compact Lie groups: 1. O(n, R), the orthogonal group, of all n × n real valued matrices such that U U = I . 2. SO(n, R), the special orthogonal group, of all elements in O(n, R) with determinant 1. 3. U Sp(2n, C), the unitary symplectic group of all 2n × 2n complex matrices U that satisfy 0 I U † ϒU = ϒ where ϒ = , −I 0 with U † is the conjugate transpose of U . We will not require a precise definition of a Lie group, and refer the reader to Hall (2003). To consider the analogs of the uniform measure in (14.1) on these groups, we select a group element with distribution according to Haar measure, that is, the unique measure μ on G of total mass one that satisfies μ(gS) = μ(S)
for all g ∈ G and all measurable subsets S of G.
Using the Haar measure, we can define an inner product on complex valued functions χ and ψ on G by χ, ψ = Eχψ , that is, as χ(g)ψ(g)dμ. χ, ψ = G
If χ and ψ are irreducible characters, then they satisfy the orthogonality relation χ, ψ = δχ,ψ .
(14.2)
In particular, if W = where τ is a nontrivial irreducible character and g is chosen according to Haar measure, this orthogonality relation implies χ τ (g)
EW = 0 and EW 2 = 1, where for the first equality we have used the orthogonality of χ τ to the character of the trivial representation 1. We apply Stein’s method to study the distribution of W = χ τ (g) for τ an irreducible character and g chosen according to the Haar measure of some compact Lie group. In particular, we construct an appropriate Stein pairs W, W so that Theorem 5.5 may be applied in this context. First, to show our choices are Stein pairs,
374
14
Group Characters and Malliavin Calculus
and for many subsequent calculations, we rely on Lemma 14.1 below, a result of Helgason (2000). Next, we must be able to bound the variance of a conditional expectation in order to handle the first term of Theorem 5.5. As W is a function of g, (4.143) yields (14.3) Var E (W − W )2 |W ≤ Var E (W − W )2 |g . The conditional expectation of (W )2 given g is computed in Lemma 14.2 with the help of Lemma 14.1, allowing for the computation of the conditional expectation on the right hand side of (14.3), and subsequently, its variance, in Lemma 14.3. The higher order moments required for the evaluation of the second term in the bound of Theorem 5.5 are handled in Lemma 14.4; as our constructions will lead to Stein pairs, the last term of that bound will be zero. Focusing now on the real case, let G be a compact Lie group and χ τ a nontrivial, real valued irreducible character of G. To create an appropriate W , let α be chosen independently and uniformly from some fixed self-inverse conjugacy class C of G, that is, a conjugacy class C such that h ∈ C implies h−1 ∈ C. (When all the characters of G are real valued, all conjugacy classes are self inverse, see Fulman 2009.) We claim that the pair (14.4) (W, W ) = χ τ (g), χ τ (αg) is exchangeable. Since for all α ∈ G the product α −1 g is distributed according to Haar measure whenever g is, and α −1 =d α when α is chosen uniformly over C, by the independence of g and α we obtain τ χ (g), χ τ (αg) =d χ τ α −1 g , χ τ (g) =d χ τ (αg), χ τ (g) . Furthermore, since C is self-inverse and characters are constant on conjugacy classes, for φ any representation, χ φ (α) = χ φ α −1 = χ φ (α), implying χ φ (α) is real. Recall that the class functions on G are the ones which are constant on the conjugacy classes of G, so in particular they form a vector space. We have seen that the characters themselves are class functions, but more is true. The irreducible characters of a group form an orthonormal basis for the class functions. Indeed, the calculation of the bounds in the theorems that follow hinge on the expansion of given characters in terms of the irreducibles. For the calculation of the higher order moments required for the evaluation of the second term in the bound of Theorem 5.5 we require some additional facts regarding characters and tensor products. If τ and ρ are representations of groups G and H , then we may define the tensor product representation on the product group G × H by (τ ⊗ ρ)(g, h) = τ (g) ⊗ ρ(h)
for all g ∈ G, h ∈ H .
14.1 Normal Approximation for Group Characters
375
Letting χ, ψ and χ ⊗ ψ be the characters associated to τ, ρ and τ ⊗ ρ, respectively, the properties of the tensor product directly imply that (χ ⊗ ψ)(g, h) = χ(g)ψ(h).
(14.5)
is again a repWhen H = G the mapping from g to τ (g) ⊗ τ (g), denoted resentation of G, and more generally we may define the r-fold tensor product representation τ r for r ∈ N; when r = 0 we let the product τ 0 be the trivial representation. If τ has character χ , then by (14.5) the representation τ r has character χ r . As the irreducible characters form a basis for the class functions, the character χ r can be so decomposed; in particular, all that is needed to specify the decomposition of χ r in terms of irreducible characters is the multiplicity mφ (τ r ) of the irreducible representation φ in τ r . The following lemma from Helgason (2000) is key in the sequel. τ 2,
Lemma 14.1 Let G be a compact Lie group and χ the character induced by the irreducible representation φ of G. Then χ φ (α) φ χ φ hαh−1 g dh = χ (g) dim(φ) G for all α, g ∈ G. We can now show that W, W is a Stein pair, and develop a number of its properties. Lemma 14.2 On the compact Lie group G let (W, W ) be given by (14.4) with g chosen from Haar measure, independently of α having the uniform distribution over some fixed self inverse conjugacy class. Then for all r ∈ N0 , Wr =
φ
mφ τ r χ φ (g)
r χ φ (α) φ and E (W )r |g = mφ τ χ (g), (14.6) dim(φ) φ
where the sum is over all irreducible representations of G, and E(W |W ) = (1 − λ)W and
where λ = 1 −
χ τ (α) dim(τ )
χ τ (α) . E(W − W )2 = 2 1 − dim(τ )
(14.7)
(14.8)
Proof The first claim is simply the decomposition of the character W r = χ τ (g)r of τ r of the product group Gr over the basis of irreducible representations of G. Using this decomposition on g = αg we obtain (W )r = mφ τ r χ φ (g ). φ
376
14
Group Characters and Malliavin Calculus
Now, since C = {hαh−1 : h ∈ G} is the conjugacy class of α, and when h is distributed according to Haar measure then hαh−1 is uniform over C, we have χ φ (α) φ χ φ hαh−1 g dh = χ (g), E χ φ (g )|g = dim(φ) G by Lemma 14.1, proving the second equality in (14.6). When r = 1 only the summand φ = τ is non-vanishing, yielding E(W |g) as a function of W , hence (14.7). The last claim is now immediate from (2.34). We now calculate the right hand side of (14.3) for use in bounding the first term in Theorem 5.5. Lemma 14.3 With W, W and g as in Lemma 14.2,
∗ 2 2χ τ (α) 2 χ φ (α) mφ τ 2 , − Var E (W − W )2 |g = 1+ dim(φ) dim(τ ) φ
where ∗ signifies that the sum is over all nontrivial irreducible representations of G. Proof Expanding the square and using the measurability of W with respect to g, then applying (14.7), and (14.6) with r = 2, of Lemma 14.2, we obtain E (W − W )2 |g = E (W )2 |g − 2W E(W |g) + W 2
2 2χ τ (α) = E (W ) |g + 1 − W2 dim(τ )
χ φ (α) 2χ τ (α) φ = mφ τ 2 1 + − χ (g). dim(φ) dim(τ ) φ
Now squaring and taking expectation, the orthogonality relation (14.2) for irreducible characters yields the second moment of the conditional expectation,
2 2 2 2χ τ (α) 2 χ φ (α) E E (W − W )2 |g = mφ τ . (14.9) − 1+ dim(φ) dim(τ ) φ
We note that the square of E(W − W )2 , as given in Lemma 14.2, is the summand of (14.9) corresponding to the trivial representation of multiplicity 1, and therefore subtraction of this term to yield the variance completes the proof. We now focus on the calculation of the fourth moment of W − W in order to bound the second term in Theorem 5.5. Lemma 14.4 With W, W as in Lemma 14.2,
2
χ φ (α) χ τ (α) mφ τ 2 8 1 − −6 1− . E(W − W )4 = dim(τ ) dim(φ) φ
14.1 Normal Approximation for Group Characters
377
Proof Expanding and using the measurability of W with respect to g, and then applying Lemma 14.2, we obtain
4 r 4 (−1) χ τ (g)4−r E (W )r |g E (W − W ) |g = r
4
r=0
=
4 4 τ 4−r r χ φ (α) φ (−1)r mφ τ χ (g). χ (g) dim(α) r φ
r=0
Now, taking expectation yields E(W − W )4 =
4 χ φ (α) 4 χ τ (g)4−r χ φ (g)dg (−1)r mφ τ r dim(α) r φ
r=0
4 χ φ (α) r 4 = (−1) mφ τ r mφ τ 4−r , dim(φ) r φ
r=0
where
χ τ (g)4−r χ φ (g)dg = mφ τ 4−r
by the decomposition of χ τ (g)4−r in terms of irreducible characters, and applying the orthogonality relation they satisfy. When α is the identity element of G then W = W and χ φ (α) = dim(φ), so 0=
4 4 (−1)r mφ τ r mφ τ 4−r , r φ
r=0
so we may write, for all α, E(W − W )4 = −
4 χ φ (α) 4 (−1)r mφ τ r mφ τ 4−r 1 − . dim(φ) r φ
r=0
The r = 0, 4 terms contribute zero, since the only φ which might contribute to the sum is the trivial representation, for which the last term vanishes. For r = 2 the contribution is
2 χ φ (α) mφ τ 2 . 1− −6 dim(φ) φ
For both the r = 1 and r = 3 terms, the only nonzero summand is the one where φ = τ , with mτ (τ ) = 1, and hence these contributions sum to
χ τ (α) 8 1− (14.10) mτ τ 3 . dim(τ )
378
14
Group Characters and Malliavin Calculus
Taking the inner product of the character of τ 3 with that of τ to find mτ (τ 3 ), and then using the decomposition of χ τ (g)2 in terms of irreducible characters, we have mτ τ 3 =
2 2 φ 2 χ (g) dg = mτ τ χ (g) dg = mφ τ 2 . 4
τ
φ
φ
Now substitution of this expression into (14.10) and the collection of terms yields the result. We now state a normal approximation theorem for general compact Lie groups with real valued characters. Theorem 14.1 Let G be a compact Lie group and let τ be a non-trivial irreducible representation of G with real valued character χ τ . Let W = χ τ (g) where g is chosen from the Haar measure of G. Then, if α is any non-identity element of G with the property that α and α −1 are conjugate, sup P (W ≤ z) − P (Z ≤ z) z∈R
2 ∗ 2 χ φ (α) 1 1 mφ τ 2 2 − 1− ≤ 2 λ dim(φ) φ
+
1/4
1 2 2 χ φ (α) 6 mφ τ , 1− 8− π λ dim(φ) φ
where λ = 1 − χ τ (α)/ dim(τ ), the first sum is over all non-trivial representations of G, and the second sum over all irreducible representations of G. Proof Let (W, W ) be given by (14.4) for g and an element chosen uniformly from the conjugacy class containing α. By Lemma 14.2 we may invoke Theorem 5.5 with the given λ and R = 0. The first term in the bound of Theorem 5.5 is handled using (5.4), (14.3) and Lemma 14.3, and the last term by (14.8) of Lemma 14.2, Lemma 14.4 and E|W − W |3 ≤
E(W − W )2 E(W − W )4 .
We apply Theorem 14.1 to characters of individual Lie groups with τ their defining representation. In order to apply the bounds, for each example we need information about the decomposition of τ 2 in terms of irreducible characters. We include details for the calculation of the bound for the character of O(2n, R), and omit the similar steps for the remaining examples. For additional details see Fulman (2009).
14.1 Normal Approximation for Group Characters
379
R) 14.1.1 O(2n,R Let τ be the 2n-dimensional defining representation and let x1 , x1−1 , . . . , xn , xn−1 be the eigenvalues of an element of O(2n, R). The following lemma is the k = 2 case of a result of Proctor (1990). Lemma 14.5 For n ≥ 2, the square of the defining representation of O(2n, R) decomposes in a multiplicity free way as the sum of the following three irreducible representations: (i) The trivial representation, with character 1. (ii) The representation with character 12 ( i xi + xi )2 − 12 i (xi2 + x i 2 ). (iii) The representation with character 12 ( i xi + xi )2 + 12 i (xi2 + x i 2 ) − 1. Armed with Lemma 14.5 we may now prove the following result. Theorem 14.2 Let g be chosen from the Haar measure of O(2n, R) with n ≥ 2, and let W be the trace of g. Then 1 sup P (W ≤ z) − P (Z ≤ z) ≤ √ . 2(n − 1) z∈R Proof We apply Theorem 14.1 with τ the defining representation and α a rotation by some angle θ , that is, an element conjugate to the diagonal matrix with entries {x1 , x1−1 , . . . , xn , xn−1 } where x1 = · · · = xn−1 = 1 and xn = eiθ . Then α is conjugate to α −1 and χ τ (α) 2(n − 1) + 2 cos(θ ) 1 − cos(θ ) =1− = . dim(τ ) 2n n To calculate the first error term in Theorem 14.1, we apply Lemma 14.5 to write the decomposition of τ 2 into non-trivial irreducibles. With φ1 the non-trivial irreducible character given in (ii), λ=1−
χ φ1 (α) =
2 1 2(n − 1) + 2 cos(θ ) − n − 1 + cos(2θ ) 2
(14.11)
with dim(φ1 ) = (2n)2 /2 − 2n/2 = 2n2 − n, and with φ2 as in (iii) χ φ2 (α) =
2 1 2(n − 1) + 2 cos(θ ) + n − 1 + cos(2θ ) − 1, 2
(14.12)
with dim(φ2 ) = (2n)2 /2 + 2n/2 − 1 = 2n2 + n − 1. Hence, substituting using (14.11) and (14.12), we find the first error term is
2 1 χ φ1 (α) 2 χ φ2 (α) 1 1 + 2− 1− 2 1− 2 2− 2 λ λ 2n − n 2n + n − 1 2 1 8(n cos(2θ ) − 2n(n − 1) cos(θ ) + 2n2 + 1) = . 2 (n + 1)(2n − 1)
380
14
Group Characters and Malliavin Calculus
Similarly, the second term in the error bound equals
1/4 χ φ1 (α) χ φ2 (α) 1 6 6 1− 2 1− 2 8+ 8− + 8− λ λ π 1/4 2n − n 2n + n − 1
1/4 24n(1 − cos θ ) = . π(n + 1)(2n − 1) Since the bound given by the sum of these two terms holds for all θ , and is continuous√in θ , the bound holds √ in the limit as θ → 0. The limiting value of the first term is 2/(2n − 1) < 1/( 2(n − 1)), while the second term vanishes in the limit, thus completing the argument.
R) 14.1.2 SO(2n + 1,R We follow the same lines of argument in Sect. 14.1.1, with corresponding notation. Let τ be the 2n + 1 dimensional defining representation of SO(2n + 1, R) and W = χ τ (g) with g chosen according to Haar measure. The following result following from Sundaram (1990) gives the decomposition of τ 2 into irreducibles. Lemma 14.6 For n ≥ 2, the square of the defining representation of SO(2n + 1, R) decomposes in a multiplicity free way as the sum of the following three irreducible representations: (i) The trivial representation, with character 1. (ii) The representation with character 12 ( i xi + xi−1 )2 + −1 i (xi + xi ). (iii) The representation with character 12 ( i xi + xi−1 )2 − −1 i (xi + xi ).
1 2 1 2
2 i (xi
+ xi−2 ) +
2 i (xi
+ xi−2 ) +
With the help of Lemma 14.5, we have the following result. Theorem 14.3 Let g be chosen from the Haar measure of SO(2n + 1, R) with n ≥ 2, and let W be the trace of g. Then 1 sup P (W ≤ z) − P (Z ≤ z) ≤ √ . 2n z∈R Proof Let α be a rotation by some angle θ , that is, an element conjugate to a diagonal matrix with entries {x1 , x1−1 , . . . , xn , xn−1 , 1} along the diagonal, where x1 = · · · = xn−1 = 1 and xn = eiθ . Then α is conjugate to α −1 and λ = 2(1 − cos(θ ))/(2n + 1). Taking the limit as θ → 0 Using Lemma 14.6 we decompose τ 2 into irreducibles. √ of the first error term of Theorem 14.1 yields 1/ 2n. Taking the limit of second term gives 0, as in the proof of Theorem 14.2.
14.2 Stein’s Method and Malliavin Calculus
381
C) 14.1.3 U Sp(2n,C Let τ be the 2n dimensional defining representation of U Sp(2n, C) and W = χ τ (g) with g chosen according to Haar measure. The following result following from Sundaram (1990) gives the decomposition of τ 2 into irreducibles. Lemma 14.7 For n ≥ 2, the square of the defining representation of U Sp(2n, C) decomposes in a multiplicity free way as the sum of the following three irreducible representations: (i) The trivial representation, with character 1. (ii) The representation with character 12 ( i xi + xi−1 )2 + 12 i (xi2 + xi−2 ). (iii) The representation with character 12 ( i xi + xi−1 )2 − 12 i (xi2 + xi−2 ) − 1. Using Lemma 14.7, we are able to prove the following result. Theorem 14.4 Let g be chosen from the Haar measure of U Sp(2n, C) with n ≥ 2, and let W be the trace of g. Then 1 sup P (W ≤ z) − P (Z ≤ z) ≤ √ . 2n z∈R Proof Let α be an element of conjugate to the diagonal matrix with diagonal entries {x1 , x1−1 , . . . , xn , xn−1 } where x1 = · · · = xn−1 = 1 and xn = eiθ . Then α is conjugate to α −1 and λ = 2(1 − cos(θ ))/n. 2 √ we decompose τ into irreducibles, and obtain the limit of √ Using Lemma 14.7 2/(2n + 1) < 1/ 2n for the first term, as θ → 0, and a limit of zero for the second term, as in the proof of Theorem 14.2.
14.2 Stein’s Method and Malliavin Calculus In some real sense, the most fundamental identity underlying Stein’s method for the normal, that for all absolutely continuous functions f such that the expectations below exist, E Zf (Z) = E f (Z) if and only if Z is standard normal, (14.13) can be seen as really nothing more than integration by parts. Proceeding along these same lines in more general spaces, the fact that (14.13) is a consequence of the Malliavin calculus integration by parts formula has some profound consequences. In this section we scratch the surface of the deep connection that exists between Stein’s method and the Malliavin calculus, as unveiled in Nourdin and Peccati (2009). Working exclusively below with one dimensional Brownian motion, we do not attempt to cover the generality of Nourdin and Peccati (2009), whose
382
14
Group Characters and Malliavin Calculus
framework includes Gaussian fields in higher dimensions, fractional Brownian motion, and parallel results for the Gamma distribution. Neither is our presentation one which approaches a treatment with complete technical details. We refer the reader to Nourdin and Peccati (2009), the lecture notes of Peccati (2009), and the text Nourdin and Peccati (2011) for full coverage of the material here, and to the standard reference, Nualart (2006), for the needed elements of the Malliavin calculus. Let B(t) be a standard Brownian motion on [0, 1] on a probability space ( , F , P ), and let L2 (B) be the Hilbert space of square integrable functionals of B, endowed with inner product X, Y L2 (B) = EXY . Starting with the defini b tion ψdB = a dBt = B(b) − B(a) of the stochastic integral for the indicator ψ = 1(a,b] of an interval in [0, 1], we may extend to all ψ ∈ L2 (λ), the collection of square integrable functions on [0, 1] with respect to Lebesgue measure λ, by taking L2 (B) limits to obtain ψdB; this integral will be denoted by I (ψ). The associated map ψ → I (ψ) from L2 (λ) → L2 (B) satisfies the isometry property (14.14) ψ, φL2 (λ) = I (ψ), I (φ) L2 (B) . In addition, I (ψ) is a mean zero normal random variable, which by (14.14) has variance ψ2L2 (λ) . In particular, the collection {I (ψ): ψ ∈ L2 (λ)} is a real valued Gaussian process. For ψ = 1A , the indicator of the measurable set A in [0, 1], we also write B(A) = I (1A ), and may think of B(A) as measure on [0, 1] with values in the space of random variables. For A = (0, t] the definition recovers the Brownian motion through B((0, t]) = B(t). To consider higher order integrals, for m ≥ 1 and L2 (λm ), the collection of square integrable functions on [0, 1]m with respect to m dimensional Lebesgue measure λm , the higher order stochastic integrals Im (ψ) are defined as follows. Consider first elementary functions of the form ψ(t1 , . . . , tm ) =
n
ai1 ,...,im 1(Ai1 · · · Aim )(t1 , . . . , tm )
i1 ,...,im =1
where A1 , . . . , An are disjoint measurable sets for all j = 1, . . . , n, and the coefficients ai1 ,...,im are zero if any of the two indices i1 , . . . , im are equal. For a function ψ of this form, define Im (ψ) =
n
ai1 ,...,im B(Ai1 ) · · · B(Aim ),
i1 ,...,im =1
and extend to L2 (λm ) by taking L2 (B) limits. It is clear that I (ψ) and I1 (ψ) agree. For a function ψ : [0, 1]m → R, define the symmetrization of ψ by (t1 , . . . , tm ) = ψ
1 ψ(tσ (1) , . . . , tσ (m) ), m! σ
(14.15)
where the sum is over all permutations σ of {1, . . . , m}. Letting L2s (λm ) be the closed subspace of L2 (λm ) of symmetric, square integrable functions on [0, 1]m
14.2 Stein’s Method and Malliavin Calculus
383
∈ L2s (λm ) by with respect to Lebesgue measure, we see that ψ ∈ L2 (λm ) implies ψ the triangle inequality, L2 (λ) ≤ ψL2 (λ) . ψ One can verify that the stochastic integrals Im (·) have the following properties: ) for all ψ ∈ L2 (λm ). (i) EIm (ψ) = 0 and Im (ψ) = Im (ψ 2 p 2 (ii) For all ψ ∈ L (λ ) and φ ∈ L (λq ), 0 p = q, E Ip (ψ)Iq (φ) = p!ψ , φ L2 (λp ) p = q.
(14.16)
(iii) The mapping ψ → Im (ψ) from L2 (λm ) to L2 (B) is linear. The goal of this section, achieved in Theorem 14.6, is to obtain a bound to the normal in the total variation distance for integrals Iq (ψ) for q ≥ 2. For this purpose we require the multiplication formula, which expresses the product of the stochastic integrals of ψ ∈ L2 (λp ) and φ ∈ L2 (λq ) in terms of sums of integrals of contractions of ψ and φ, Ip (ψ)Iq (φ) =
p∧q r=0
p q ⊗r φ ), r! Ip+q−2r (ψ r r
(14.17)
where, for r = 1, . . . , p ∧ q and (t1 , . . . , tp−r , s1 , . . . , sq−r ) ∈ [0, 1]p+q−2r , the contraction ⊗r is given by (ψ ⊗r φ)(t1 , . . . , tp−r , s1 , . . . , sq−r ) ψ(z1 , . . . , zr , t1 , . . . , tp−r )φ(z1 , . . . , zr , s1 , . . . , sq−r )λr (dz1 , . . . , dzr ), = [0,1]r
and for r = 0, denoting ⊗0 also by ⊗, by (ψ ⊗ φ)(t1 , . . . , tp , s1 , . . . , sq ) = ψ(t1 , . . . , tp )φ(s1 , . . . , sq ). r φ be the Even when ψ and φ are symmetric ψ ⊗r φ may not be, and we let ψ ⊗ symmetrization of ψ ⊗r φ as given in (14.15). Using the multiple stochastic integrals Iq , q ∈ N, any F ∈ L2 (B), that is, any square integrable function of the Brownian motion B, can be represented by the following Wiener chaos decomposition. For any such F , there exists a unique sequence {ψq : n ≥ 1} with ψq ∈ L2s (λq ) such that F=
∞
Iq (ψq )
(14.18)
q=0
where I0 (ψ0 ) = EF , and the series converges in L2 . When all terms but one in the sum vanish so that F = Iq (ψq ) for some q, we say F belongs to the qth Wiener
384
14
Group Characters and Malliavin Calculus
chaos of B. Applying the orthogonality relation (14.16) to the symmetric ‘kernels’ ψq for F of the form (14.18), F 2L2 (B) =
∞
q!ψq 2L2 (λq ) .
(14.19)
q=0
We now briefly describe two of the basic operators of the Malliavin calculus: the Malliavin derivative D, and the Ornstein–Uhlenbeck generator L. Beginning with D, for g : Rn → R a smooth function with compact support, consider a random variable of the form (14.20) F = g I (ψ1 ), . . . , I (ψn ) with ψ1 , . . . , ψn ∈ L2 (λ). For such an F , the Malliavin derivative is defined as DF =
n ∂ g I (ψ1 ), . . . , I (ψn ) ψi . ∂xi i=1
Note in particular DI (ψ) = ψ
for every ψ ∈ L2 (λ).
(14.21)
In general then, the mth derivative D m F , given by DmF =
n i1 ,...,im =1
∂m g I (ψ1 ), . . . , I (ψn ) ψi1 ⊗ · · · ⊗ ψim , ∂xi1 · · · ∂xim
maps random variables F to the Hilbert space L2 (B, L2 (λm )) of L2 (λm ) valued functionals of B, endowed with the inner product u, vL2 (B,L2 (λm )) = Eu, vL2 (λm ) . Letting S denote the set of random variables of the form (14.20), for every m ≥ 1 the domain of D m may be extended to Dm,2 , the closure of S with respect to the norm · m,2 given by F 2m,2 = EF 2 +
m 2 E D i F 2
L (λi )
.
i=1
A random variable F ∈ L2 (B) having chaotic expansion (14.18) is an element of Dm,2 if and only if the kernels ψq , q = 1, 2, . . . satisfy ∞
q m q!ψq 2L2 (λq ) < ∞,
q=1
in which case ∞ 2 (q)m q!ψq 2L2 (λq ) , E D m F L2 (λm ) = q=m
(14.22)
14.2 Stein’s Method and Malliavin Calculus
385
where (q)m is the falling factorial. In particular, any F having a finite Wiener chaos expansion is an element of Dm,2 for all m ≥ 1. The Malliavin derivative obeys a chain rule. If g : Rn → R is a continuously differentiable function with bounded derivative, and Fi ∈ D1,2 for i = 1, . . . , n, then g(F1 , . . . , Fn ) ∈ D1,2 and Dg(F1 , . . . , Fn ) =
n ∂ g(F1 , . . . , Fn )DFi . ∂xi
(14.23)
i=1
As we are considering Brownian motion on [0, 1], and the indexed family {I (ψ): ψ ∈ L2 (λ)} with λ non-atomic, the derivatives of random variables F of the form (14.18) can be identified with the element L2 ([0, 1] × ) given by Dt F =
∞
qIq−1 ψq (·, t) ,
t ∈ [0, 1].
(14.24)
q=1
We next introduce the Ornstein–Uhlenbeck generator L. For a square integrable random variable F represented as in (14.18), let LF =
∞
−qIq (ψq )
(14.25)
q=0
and, when EF = 0, let L−1 F =
∞ q=1
1 − Iq (ψq ). q
In view of (14.19) and (14.22), we see that the operator L−1 takes values in D2,2 . As the Malliavin derivative D maps random variables to the Hilbert space L2 (B, L2 (λ)) endowed with the inner product Eu, vL2 (λ) , by definition, the adjoint operator δ satisfies the (integration by parts) identity (14.26) E F δ(u) = EDF, uL2 (λ) , for every F ∈ D1,2 , when u lies in the domain dom(δ) of δ. One of the key consequences of this identity is that for every F ∈ D1,2 with EF = 0, (14.27) E Ff (F ) = E DF, −DL−1 F L2 (λ) f (F ) , for all real valued differentiable functions f with bounded derivative; identity (14.27) also holds when f is only a.e. differentiable if F has an absolutely continuous law. The Stein identity (14.13) is the special case where F = I (ψ) for ψ ∈ L2 (λ) with ψL2 (λ) = 1. For then F is standard normal, and by (14.21) and (14.25), respectively, we have, DF = ψ so
and L−1 F = L−1 I (f ) = −I (f ) = −F,
DF, −DL−1 F
L2 (λ)
= ψ, DF L2 (λ) = ψ, ψL2 (λ) = ψ2L2 (λ) = 1. Hence (14.27) implies (14.13).
(14.28)
386
14
Group Characters and Malliavin Calculus
Though the theory of Nourdin and Peccati (2009) supplies results in the Wasserstein, Fortet–Mourier, and the Kolmogorov distance, recalling (4.1) and (4.3), we confine ourselves to the total variation norm. In this case, by the bounds (2.12), for any random variable F , L(F ) − L(Z) ≤ sup E f (F ) − Ff (F ) , (14.29) TV f ∈FTV
where FTV is the √collection of piecewise continuously differentiable functions that are bounded by π/2 and whose derivatives are bounded by 2. Theorem 14.5 Let F ∈ D1,2 have mean zero and an absolutely continuous law with respect to Lebesgue measure. Then L(F ) − L(Z) ≤ 2E 1 − DF, −DL−1 F 2 . TV L (λ) Proof By (14.27), E f (F ) − Ff (F ) = E f (F ) 1 − DF, −DL−1 F L2 (λ) ,
and the proof is completed by applying (14.29).
To make use of Theorem 14.5 it is necessary to handle the inner product appearing in the bound. We note that (14.28) gives the simplest case, and also shows the upper bound to be tight in the sense that it is zero when F =d Z. Theorem 14.6 gives a much more substantial illustration of a case where computation with the inner product is possible. We now follow Nourdin et al. (2009), which simplifies the calculations in Nourdin and Peccati (2009), as well as generalizes the results from integrals I2 (ψ) to Iq (ψ) for all q ≥ 2. Theorem 14.6 Let F belong to the qth Wiener chaos of B for some q ≥ 2. Then 2 L(F ) − L(Z) ≤ 2 1 − EF + 2 q − 1 EF 4 − 3 EF 2 2 . TV 3q The following proof shows that it is always the case that EF 4 ≥ 3(EF 2 )2 . Proof Writing F = Iq (ψ) with ψ ∈ L2s (λq ), by (14.19) we obtain EF 2 = q!ψ2L2 (λq ) .
(14.30)
Now, applying (14.30) for the final inequality below, by (14.24) and the multiplication formula (14.17), we have 1 2 1 Iq−1 ψ(·, a) λ(da) DF 2L2 (λ) = q q 0 1 q−1
q −1 2 r ψ(·, a)λ(da) r! I2q−2−2r ψ(·, a) ⊗ =q r 0 r=1
14.2 Stein’s Method and Malliavin Calculus
=q =q
387
1 q−1
q −1 2 r ψ(·, a)λ(da) r! I2q−2−2r ψ(·, a) ⊗ r 0 r=1 q−1 r=0
q −1 2 r+1 ψ) r! I2q−2−2r (ψ ⊗ r
q−1 q −1 2 r ψ). = EF + q (r − 1)! I2q−2r (ψ ⊗ r −1 2
(14.31)
r=1
Subtracting EF 2 from both sides and applying (14.16) yields
2 1 2 2 DF L2 (λ) − EF E q
q−1 q −1 4 r ψ2 2 2q−2r . = q2 (r − 1)! (2q − 2r)!ψ ⊗ L (λ ) r −1
(14.32)
r=1
Next, again by (14.17), F2 =
2 q q r ψ). r! I2q−2r (ψ ⊗ r
(14.33)
r=0
Applying (14.27) and (14.25) for the second equality below, and (14.31), (14.33) and (14.16) for the third, we obtain
1 EF 4 = E F × F 3 = 3E F 2 × DF 2L2 (λ) q
q−1 2 3 2 q 4 r ψ2 2 2q−2r . (14.34) = 3E F 2 + rr! (2q − 2r)!ψ ⊗ L (λ ) q r r=1
Comparing (14.32) and (14.34) leads to
2 2 1 q − 1 2 2 E DF L2 (λ) − EF EF 4 − 3 EF 2 . ≤ q 3q Lastly, by Theorem 14.5 and (14.25), L(F ) − L(Z) ≤ 2E 1 − DF, −DL−1 F 2 TV L (λ) ≤ 2 1 − EF 2 + 2E EF 2 − DF, −DL−1 F L2 (λ) 2 2 ≤ 2 1 − EF + 2 E EF 2 − DF, −DL−1 F 2
2 q −1 2 ≤ 2 1 − EF + 2 EF 4 − 3 EF 2 . 3q
L (λ)
We note that, as one consequence of Theorem 14.6, Stein’s method provides a streamlined proof of the Nualart–Peccati criterion, that is, if Fn = Iq (ψn ) for some q ≥ 2, such that E[Fn2 ] → σ 2 > 0 as n → ∞, then the following are equivalent:
388
14
Group Characters and Malliavin Calculus
(i) L(Fn ) − L(σ Z)TV → 0. (ii) Fn →d σ Z. (iii) EFn4 → 3σ 4 . Though in the section we have considered the case of Brownian motion on [0, 1], with corresponding Gaussian process {I (ψ): ψ ∈ L2 (λ)}, much here carries over with no essential changes when considering Gaussian processes on {X(ψ): ψ ∈ H} indexed by more general Hilbert spaces; see Nourdin and Peccati (2009) and Nourdin and Peccati (2011) for details.
Appendix
Notation 1 =d →d →p Z N N0 R R+ Sn N (μ, σ 2 ) U [a, b] Z (z) Nh Ki (t) X∗ Xs X f F − G1 (α) (α, β) B(α, β) AT Tr(A) L(·) hL∞ m (R) Hm,∞
indicator function equality in distribution convergence in distribution convergence in probability {. . . , −1, 0, 1, . . .} {1, 2, . . .} {0, 1, . . .} (−∞, ∞) [0, ∞) the symmetric group on n symbols normal distribution with mean μ and variance σ 2 uniform distribution over [a, b] standard normal variable standard normal distribution function Eh(Z) K function, (prototype) page 19 zero bias, page 26 size bias, page 31 square bias, page 34 supremum norm of f L1 distance, page 64 Gamma function Gamma distribution Beta distribution transpose of the matrix A trace of the matrix A law, or distribution, of a random variable page 136 page 136
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4, © Springer-Verlag Berlin Heidelberg 2011
389
390
Hm,∞,p page 313 L(X) − L(Y)Hm,∞,p page 313
Appendix
References
Aldous, D. (1989). Applied mathematical sciences: Vol. 77. Probability approximations via the Poisson clumping heuristic. New York: Springer. Aldous, D., & Fill, J. A. (1994). Reversible Markov chains and random walks on graphs. Monograph in preparation. http://www.stat.berkeley.edu/aldous/RWG/book.html. Arratia, R., Goldstein, L., & Gordon, L. (1989). Two moments suffice for Poisson approximations: the Chen–Stein method. Annals of Probability, 17, 9–25. Baldi, P., & Rinott, Y. (1989). Asymptotic normality of some graph-related statistics. Journal of Applied Probability, 26, 171–175. Baldi, P., Rinott, Y., & Stein, C. (1989). A normal approximations for the number of local maxima of a random function on a graph. In T. W. Anderson, K. B. Athreya, & D. L. Iglehart (Eds.), Probability, statistics and mathematics, papers in honor of Samuel Karlin (pp. 59–81). San Diego: Academic Press. Barbour, A. D. (1990). Stein’s method for diffusion approximations. Probability Theory and Related Fields, 84, 297–322. Barbour, A. D., & Chen, L. H. Y. (2005a). In A. D. Barbour & L. H. Y. Chen (Eds.), The permutation distribution of matrix correlation statistics. Stein’s method and applications. Singapore: Singapore University Press. Barbour, A. D., & Chen, L. H. Y. (2005b). In A. D. Barbour & L. H. Y. Chen (Eds.), An introduction to Stein’s method. Singapore: Singapore University Press. Barbour, A. D., & Chen, L. H. Y. (2005c). In A. D. Barbour & L. H. Y. Chen (Eds.), Stein’s method and applications. Singapore: Singapore University Press. Barbour, A. D., & Eagleson, G. (1986). Random association of symmetric arrays. Stochastic Analysis and Applications, 4, 239–281. Barbour, A. D., & Xia, A. (1999). Poisson perturbations. ESAIM, P&S, 3, 131–150. Barbour, A. D., Karo´nski, M., & Ruci´nski, A. (1989). A central limit theorem for decomposable random variables with applications to random graphs. Journal of Combinatorial Theory. Series B, 47, 125–145. Barbour, A. D., Holst, L., & Janson, S. (1992). Poisson approximation. London: Oxford University Press. Bayer, D., & Diaconis, P. (1992). Trailing the dovetail shuffle to its lair. The Annals of Applied Probability, 2, 294–313. Bentkus, V., Götze, F., & Zitikis, R. (1994). Lower estimates of the convergence rate for U statistics. Annals of Probability, 22, 1707–1714. Berry, A. (1941). The accuracy of the Gaussian approximation to the sum of independent variates. Transactions of the American Mathematical Society, 49, 122–136. Bhattacharya, R. N., & Ghosh, J. (1978). On the validity of the formal Edgeworth expansion. Annals of Statistics, 6, 434–451. L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4, © Springer-Verlag Berlin Heidelberg 2011
391
392
References
Bhattacharya, R. N., & Rao, R. (1986). Normal approximation and asymptotic expansion. Melbourne: Krieger. Bickel, P., & Doksum, K. (1977). Mathematical statistics: basic ideas and selected topics. Oakland: Holden-Day. Biggs, N. (1993). Algebraic graph theory. Cambridge: Cambridge University Press. Bikjalis, A. (1966). Estimates of the remainder term in the central limit theorem. Lietuvos Matematikos Rinkinys, 6, 323–346 (in Russian). Billingsley, P. (1968). Convergence of probability measures. New York: Wiley. Bollobás, B. (1985). Random graphs. San Diego: Academic Press. Bolthausen, E. (1984). An estimate of the reminder in a combinatorial central limit theorem. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 66, 379–386. Borovskich, Yu. V. (1983). Asymptotics of U -statistics and von Mises’ functionals. Soviet Mathematics. Doklady 27, 303–308. Breiman, L. (1986). Probability. Reading: Addison–Wesley. Brouwer, A. E., Cohen, A. M., & Neumaier, A. (1989). Distance-regular graphs. Berlin: Springer. Bulinski, A., & Suquet, C. (2002). Normal approximation for quasi-associated random fields. Statistics & Probability Letters, 54, 215–226. Cacoullos, T., & Papathanasiou, V. (1992). Lower variance bounds and a new proof of the central limit theorem. Journal of Multivariate Analysis, 43, 173–184. Chatterjee, S. (2007). Stein’s method for concentration inequalities. Probability Theory and Related Fields, 138, 305–321. Chatterjee, S. (2008). A new method of normal approximation. Annals of Probability, 4, 1584– 1610. Chatterjee, S., & Meckes, E. (2008). Multivariate normal approximation using exchangeable pairs. ALEA. Latin American Journal of Probability and Mathematical Statistics, 4, 257–283. Chatterjee, S., & Shao, Q. M. (2010, to appear). Non-normal approximation by Stein’s method of exchangeable pairs with application to the Curie–Weiss model. The Annals of Applied Probability. Chatterjee, S., Fulman, J., & Röllin, A. (2008, to appear). Exponential approximation by Stein’s method and spectral graph theory. Chen, L. H. Y. (1975). Poisson approximation for dependent trials. Annals of Probability, 3, 534– 545. Chen, L. H. Y. (1998). Stein’s method: some perspectives with applications. In L. Accardi & C. C. Heyde (Eds.), Lecture Notes in Statistics: Vol. 128. Probability towards 2000. Berlin: Springer. Chen, L. H. Y., & Leong, Y. K. (2010). From zero-bias to discretized normal approximation (Preprint). Chen, L. H. Y., & Röllin, A. (2010). Stein couplings for normal approximation. Chen, L. H. Y., & Shao, Q. M. (2001). A non-uniform Berry–Esseen bound via Stein’s method. Probability Theory and Related Fields, 120, 236–254. Chen, L. H. Y., & Shao, Q. M. (2004). Normal approximation under local dependence. Annals of Probability, 32, 1985–2028. Chen, L. H. Y., & Shao, Q. M. (2005). Stein’s method for normal approximation. In Lecture Notes Series, Institute for Mathematical Sciences, National University of Singapore: Vol. 4. An introduction to Stein’s method (p. 159). Singapore: Singapore University Press. Chen, L. H. Y., & Shao, Q. M. (2007). Normal approximation for nonlinear statistics using a concentration inequality approach. Bernoulli, 13, 581–599. Chen, L. H. Y., Fang, X., & Shao, Q. M. (2009). From Stein identities to moderate deviations. Conway, J. H., & Sloane, N. J. A. (1999). Sphere packings, lattices and groups (3rd ed.). New York: Springer. Cox, D. R. (1970). The continuity correction. Biometrika, 57, 217–219. Darling, R. W. R., & Waterman, M. S. (1985). Matching rectangles in d dimensions: algorithms and laws of large numbers. Advances in Applied Mathematics, 55, 1–12. Darling, R. W. R., & Waterman, M. S. (1986). Extreme value distribution for the largest cube in a random lattice. SIAM Journal on Applied Mathematics, 46, 118–132.
References
393
DeGroot, M. (1986). A conversation with Charles Stein. Statistical Science, 1, 454–462. Dembo, A., & Karlin, S. (1992). Poisson approximations for r-scan processes. The Annals of Applied Probability, 2, 329–357. Dembo, A., & Rinott, Y. (1996). Some examples of normal approximations by Stein’s method. In The IMA Volumes in Mathematics and Its Applications: Vol. 76. Random discrete structures (pp. 25–44) New York: Springer. Dharmadhikari, S., & Joag-Dev, K. (1998). Unimodality, convexity and applications. San Diego: Academic Press. Diaconis, P. (1977). The distribution of leading digits uniform distribution mod 1. Annals of Probability, 5, 72–81. Diaconis, P., & Freedman, D. (1987). A dozen de Finetti-style results in search of a theory. Annales de L’I.H.P. Probabilités Et Statistiques, 23, 397–423. Diaconis, P., & Holmes, S. (2004). Institute of mathematical statistics lecture notes, monograph series: Vol. 46. Stein’s method: expository lectures and applications. Beachwood: Institute of Mathematical Statistics. Diaconis, P., & Shahshahani, M. (1987). Time to reach stationarity in the Bernoulli–Laplace diffusion model. SIAM Journal on Mathematical Analysis, 18, 208–218. Diaconis, P., & Shahshahani, M. (1994). On the eigenvalues of random matrices. Journal of Applied Probability, 31A, 49–62. Donnelly, P., & Welsh, D. (1984). The antivoter problem: random 2-colourings of graphs. In B. Bollobás (Ed.), Graph theory and combinatorics (pp. 133–144). San Diego: Academic Press. Efron, B., & Stein, C. (1981). The jackknife estimate of variance. Annals of Statistics, 9, 586–596. Ellis, R. (1985). Grundlehren der Mathematischen Wissenschaften. Entropy, large deviations, and statistical mechanics. New York: Springer. Ellis, R., & Newman, C. (1978a). The statistics of Curie–Weiss models. Journal of Statistical Physics, 19, 149–161. Ellis, R., & Newman, C. (1978b). Limit theorems for sums of dependent random variables occurring in statistical mechanics. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 44, 117–139. Ellis, R., Newman, C., & Rosen, J. (1980). Limit theorems for sums of dependent random variables occurring in statistical mechanics. II. Conditioning, multiple phases, and metastability. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 51, 153–169. Erdös, P., & Rényi, A. (1959a). On the central limit theorem for samples from a finite population. A Magyar Tudoma’nyos Akadémia Matematikai Kutató Intézetének Közleményei, 4, 49–61. Erdös, P., & Rényi, A. (1959b). On random graphs. Publicationes Mathematicae Debrecen, 6, 290–297. Erickson, R. (1974). L1 bounds for asymptotic normality of m-dependent sums using Stein’s technique. Annals of Probability, 2, 522–529. Esseen, C. (1942). On the Liapounoff limit of error in the theory of probability. Arkiv För Matematik, Astronomi Och Fysik, 28A, 19 pp. Esseen, C. (1945). Fourier analysis of distribution functions. A mathematical study of the Laplace– Gaussian law. Acta Mathematica, 77, 1–125. Ethier, S., & Kurtz, T. (1986). Markov processes: characterization and convergence. New York: Wiley. Feller, W. (1935). Über den Zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 40, 512–559. Feller, W. (1968b). An introduction to probability theory and its applications (Vol. 2). New York: Wiley. Feller, W. (1968a). An introduction to probability theory and its applications (Vol. 1). New York: Wiley. Ferguson, T. (1996). A course in large sample theory. New York: Chapman & Hall. Finkelstein, M., Kruglov, V., & Tucker, H. (1994). Convergence in law of random sums with nonrandom centering. Journal of Theoretical Probability, 7, 565–598.
394
References
Friedrich, K. (1989). A Berry–Esseen bound for functions of independent random variables. Annals of Statistics 17, 170–183. Fulman, J. (2006). An inductive proof of the Berry–Esseen theorem for character ratios. Annals of Combinatorics, 10, 319–332. Fulman, J. (2009). Communications in Mathematical Physics, 288, 1181–1201. Fulton, W., & Harris, J. (1991). Graduate texts in mathematics. Representation theory. New York: Springer. Geary, R. (1954). The continuity ratio and statistical mapping. Incorporated Statistician, 5, 115– 145. Ghosh, S. (2009). Lp bounds for a combinatorial central limit theorem with involutions (Preprint). Glaz, J., Naus, J., & Wallenstein, S. (2001). Springer series in statistics. Scan statistics. New York: Springer. Goldstein, L. (2004). Normal approximation for hierarchical sequences. The Annals of Applied Probability, 14, 1950–1969. Goldstein, L. (2005). Berry Esseen bounds for combinatorial central limit theorems and pattern occurrences, using zero and size biasing. Journal of Applied Probability, 42, 661–683. Goldstein, L. (2007). L1 bounds in normal approximation. Annals of Probability, 35, 1888–1930. Goldstein, L. (2010a). Bounds on the constant in the mean central limit theorem. Annals of Probability. 38, 1672–1689. Goldstein, L. (2010b). A Berry–Esseen bound with applications to counts in the Erdös–Rényi random graph (Preprint). Goldstein, L., & Penrose, M. (2010). Normal approximation for coverage models over binomial point processes. The Annals of Applied Probability, 20, 696–721. Goldstein, L., & Reinert, G. (1997). Stein’s method and the zero bias transformation with application to simple random sampling. The Annals of Applied Probability, 7, 935–952. Goldstein, L., & Reinert, G. (2005). Distributional transformations, orthogonal polynomials, and Stein characterizations. Journal of Theoretical Probability, 18, 237–260. Goldstein, L., & Rinott, Y. (1996). Multivariate normal approximations by Stein’s method and size bias couplings. Journal of Applied Probability, 33, 1–17. Goldstein, L., & Rinott, Y. (2003). A permutation test for matching and its asymptotic distribution. Metron, 61, 375–388. Goldstein, L., & Shao, Q. M. (2009). Berry–Esseen bounds for projections of coordinate symmetric random vectors. Electronic Communications in Probability, 14, 474–485. Goldstein, L., & Zhang, H. (2010). A Berry–Esseen theorem for the lightbulb problem. Submitted for publication. Götze, F. (1991). On the rate of convergence in the multivariate CLT. The Annals of Applied Probability, 19, 724–739. Griffeath, D. (1974/1975). A maximal coupling for Markov chains. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 31, 95–106. Griffiths, R., & Kaufman, M. (1982). Spin systems on hierarchical lattices. Introduction and thermodynamic limit. Physical Review. B, Solid State, 26, 5022–5032. Grimmett, G., & Stirzaker, D. (2001). Probability and random processes. London: Oxford University Press. Haagerup, U. (1982). The best constants in the Khintchine inequality. Studia Mathematica, 70, 231–283. Hájek, J. (1960). Limiting distributions in simple random sampling from a finite population. A Magyar Tudoma’nyos Akadémia Matematikai Kutató Intézetének Közleményei, 5, 361–374. Hall, P. (1988). Introduction to the theory of coverage processes. New York: Wiley. Hall, B. (2003). Lie groups, Lie algebras, and representations: an elementary introduction. Berlin: Springer. Hall, P., & Barbour, A. D. (1984). Reversing the Berry–Esseen inequality. Proceedings of the American Mathematical Society 90(1), 107–110. Helgason, S. (2000). Groups and geometric analysis. Providence: American Mathematical Society.
References
395
Helmers, R. (1977). The order of the normal approximation for linear combinations of order statistics with smooth weight functions. Annals of Probability, 5, 940–953. Helmers, R., & van Zwet, W. (1982). The Berry–Esseen bound for U -statistics. Statistical decision theory and related topics, III, West Lafayette, Ind. (Vol. 1, pp. 497–512). New York: Academic Press. Helmers, R., Janssen, P., & Serfling, R. (1990). Berry–Esséen and bootstrap results for generalized L-statistics. Scandinavian Journal of Statistics, 17, 65–77. Henze, N. (1988). A multivariate two-sample test based on the number of nearest neighbor type coincidences. Annals of Statistics, 16, 772–783. Ho, S. T., & Chen, L. H. Y. (1978). An Lp bound for the remainder in a combinatorial central limit theorem. Annals of Probability, 6, 231–249. Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics, 19, 293–325. Hoeffding, W. (1951). A combinatorial central limit theorem. Annals of Mathematical Statistics, 22, 558–566. Horn, R., & Johnson, C. (1985). Matrix analysis. Cambridge: Cambridge University Press. Huang, H. (2002). Error bounds on multivariate normal approximations for word count statistics. Advances in Applied Probability, 34, 559–586. Hubert, L. (1987). Assignment methods in combinatorial data analysis. New York: Dekker. Janson, S., & Nowicki, K. (1991). The asymptotic distributions of generalized U -statistics with applications to random graphs. Probability Theory and Related Fields, 90, 341–375. Jing, B., & Zhou, W. (2005). A note on Edgeworth expansions for U -statistics under minimal conditions. Lietuvos Matematikos Rinkinys, 45, 435–440; translation in Lithuanian Math. J., 45, 353–358. Johansson, K. (1997). On random matrices from the compact classical groups. Annals of Mathematics, 145, 519–545. Jordan, J. H. (2002). Almost sure convergence for iterated functions of independent random variables. Journal of Applied Probability, 12, 985–1000. Karlub, S., & Brede, V. (1992). Chance and statistical significance in protein and DNA sequence analysis. Science, 257, 39–49. Karo´nski, M., & Ruci´nski, A. (1987). Poisson convergence of semi-induced properties of random graphs. Mathematical Proceedings of the Cambridge Philosophical Society, 101, 291–300. Klartag, B. (2007). A central limit theorem for convex sets. Inventiones Mathematicae, 168, 91– 131. Klartag, B. (2009). A Berry–Esseen type inequality for convex bodies with an unconditional basis. Probability Theory and Related Fields, 145, 1–33. Knox, G. (1964). Epidemiology of childhood leukemia in Northumberland and Durham. British Journal of Preventive & Social Medicine, 18, 17–24. Kolchin, V. F., & Chistyakov, V. P. (1973). On a combinatorial limit theorem. Theory of Probability and Its Applications, 18, 728–739. Kordecki, W. (1990). Normal approximation and isolated vertices in random graphs. In M. Karo´nski, J. Jaworski, & A. Ruci´nski (Eds.), Random graphs 1987. New York: Wiley. Koroljuk, V., & Borovskich, Y. (1994). Mathematics and its applications: Vol. 273. Theory of U statistics (translated from the 1989 Russian original by P. V. Malyshev & D. V. Malyshev and revised by the authors). Dordrecht: Kluwer Academic. LeCam, L. (1986). The central limit theorem around 1935. Statistical Science, 1, 78–96. Leech, J., & Sloane, N. (1971). Sphere packings and error-correcting codes. Canadian Journal of Mathematics, 23, 718–745. Lévy, P. (1935). Propriétes asymptotiques des sommes de variables indépendantes on enchainees. Journal de Mathématiques Pures Et Appliquées, 14, 347–402. Li, D., & Rogers, T. D. (1999). Asymptotic behavior for iterated functions of random variables. The Annals of Applied Probability, 9, 1175–1201. Liggett, T. (1985). Interacting particle systems. New York: Springer. Luk, M. (1994). Stein’s method for the Gamma distribution and related statistical applications. Ph.D. dissertation, University of Southern California, Los Angeles, USA.
396
References
Madow, W. G. (1948). On the limiting distributions of estimates based on samples from finite universes. Annals of Mathematical Statistics, 19, 535–545. Mantel, N. (1967). The detection of disease cluttering and a generalized regression approach. Cancer Research, 27, 209–220. Matloff, N. (1977). Ergodicity conditions for a dissonant voting model. Annals of Probability, 5, 371–386. Mattner, L., & Roos, B. (2007). A shorter proof of Kanter’s Bessel function concentration bound. Probability Theory and Related Fields, 139, 191–205. Meckes, M., & Meckes, E. (2007). The central limit problem for random vectors with symmetries. Journal of Theoretical Probability, 20, 697–720. Midzuno, H. (1951). On the sampling system with probability proportionate to sum of sizes. Annals of the Institute of Statistical Mathematics, 2, 99–108. Moran, P. (1948). The interpretation of statistical maps. Journal of the Royal Statistical Society. Series B. Methodological, 10, 243–251. Motoo, M. (1957). On the Hoeffding’s combinatorial central limit theorem. Annals of the Institute of Statistical Mathematics, 8, 145–154. Nagaev, S. (1965). Some limit theorems for large deviations. Theory of Probability and its Applications, 10, 214–235. Naus, J. I. (1982). Approximations for distributions of scan statistics. Journal of the American Statistical Association, 77, 177–183. Nourdin, I., & Peccati, G. (2009). Stein’s method on Weiner chaos. Probability Theory and Related Fields, 145, 75–118. Nourdin, I., & Peccati, G. (2011). Normal approximations with Malliavin calculus. From Stein’s method to universality. Nourdin, I., Peccati, G., & Reinert, G. (2009, to appear). Invariance principles of homogeneous sums: Universality of Gaussian Wiener chaos. Annals of Probability. Nualart, D. (2006). The Malliavin calculus and related topics (2nd ed.). Berlin: Springer Palka, Z. (1984). On the number of vertices of a given degree in a random graph. Journal of Graph Theory, 8, 167–170. Papangelou, F. (1989). On the Gaussian fluctuations of the critical Curie–Weiss model in statistical mechanics. Probability Theory and Related Fields, 83, 265–278. Peccati, G. (2009). Stein’s method, Malliavin calculus and infinite-dimensional Gaussian analysis. Lecture notes from Progress in Stein’s method. In http://www.ims.nus.edu.sg/Programs/ stein09/files/Lecture_notes_5.pdf. Peköz, E., & Röllin, A. (2009). New rates for exponential approximation and the theorems of Rényi and Yaglom (Preprint). Penrose, M. (2003). Random geometric graphs. Oxford: Oxford University Press. Petrov, V. (1995). Oxford studies in probability: Vol. 4. Limit theorems of probability theory: sequences of independent random variables. London: Oxford University Press. Pickett, A. (2004). Rates of convergence of χ 2 approximations via Stein’s method. Ph.D. dissertation, Oxford. Proctor, R. (1990). A Schensted algorithm which models tensor representations of the orthogonal group. Canadian Journal of Mathematics, 42, 28–49. Rachev, S. (1984). The Monge–Kantorovich transference problem and its stochastic applications. Theory of Probability and Its Applications, 29, 647–676. Raiˇc, M. (2004). A multivariate CLT for decomposable random vectors with finite second moment. Journal of Theoretical Probability, 17, 573–603. Raiˇc, M. (2007). CLT related large deviation bounds based on Stein’s method. Advances in Applied Probability, 39, 731–752. Rao, R., Rao, M., & Zhang, H. (2007). One bulb? Two bulbs? How many bulbs light up? A discrete probability problem involving dermal patches. Sankhy¯a. The Indian Journal of Statistics, 69, 137–161. Reinert, G., & Röllin, A. (2009). Multivariate normal approximation with Stein’s method of exchangeable pairs under a general linearity condition. Annals of Probability, 37, 2150–2173.
References
397
Reinert, G., & Röllin, A. (2010). Random subgraph counts and U -statistics: multivariate normal approximation via exchangeable pairs and embedding. Journal of Applied Probability, 47(2), 378–393. Rinott, Y. (1994). On normal approximation rates for certain sums of dependent random variables. Journal of Computational and Applied Mathematics, 55, 135–143. Rinott, Y., & Rotar, V. (1996). A multivariate CLT for local dependence with n−1/2 log n rate and applications to multivariate graph related statistics. Journal of Multivariate Analysis, 56, 333– 350. Rinott, Y., & Rotar, V. (1997). On coupling constructions and rates in the CLT for dependent summands with applications to the antivoter model and weighted U -statistics. The Annals of Applied Probability, 7, 1080–1105. Robbins, H. (1948). The asymptotic distribution of the sum of a random number of random variables. Bulletin of the American Mathematical Society, 54, 1151–1161. Ross, S., & Peköz, E. (2007). A second course in probability. United States: Pekozbooks. Sagan, B. (1991). The symmetric group. Belmont: Wadsworth. Sazonov, V. (1968). On the multi-dimensional central limit theorem. Sankhy¯a Series A, 30, 181– 204. Schechtman, G., & Zinn, J. (1990). On the volume of the intersection of two np balls. Proceedings of the American Mathematical Society, 110, 217–224. Schiffman, A., Cohen, S., Nowik, R., & Sellinger, D. (1978). Initial diagnostic hypotheses: factors which may distort physicians judgement. Organizational Behavior and Human Performance, 21, 305–315. Schilling, M. (1986). Multivariate two-sample tests based on nearest neighbors. Journal of the American Statistical Association, 81, 799–806. Schlösser, T., & Spohn, H. (1992). Sample to sample fluctuations in the conductivity of a disordered medium. Journal of Statistical Physics, 69, 955–967. Seber, G., & Lee, A. (2003). Wiley series in probability and statistics. Linear regression analysis (2nd ed.). New York: Wiley. Serfling, R. (1980). Approximation theorems of mathematical statistics. New York: Wiley. Serre, J.-P. (1997). Graduate texts in mathematics. Linear representations of finite groups. New York: Springer. Shao, Q. M., & Su, Z. (2005). The Berry–Esseen bound for character ratios. Proceedings of the American Mathematical Society, 134, 2153–2159. Shneiberg, I. (1986). Hierarchical sequences of random variables. Theory of Probability and Its Applications, 31, 137–141. Steele, J. M. (1986). An Efron–Stein inequality for nonsymmetric statistics. Annals of Statistics, 14, 753–758. Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955 (Vol. I, pp. 197–206). Berkeley: University of California Press. Stein, E. M. (1970). Singular integrals and differentiability properties of functions. Princeton: Princeton University Press. Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 2, pp. 586–602). Berkeley: University of California Press. Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Annals of Statistics, 9, 1135–1151. Stein, C. (1986). Approximate computation of expectations. Hayward: IMS. Stein, C. (1995). The accuracy of the normal approximation to the distribution of the traces of powers of random orthogonal matrices (Technical Report no. 470). Stanford University, Statistics Department. Stroock, D. (2000). Probability theory, an analytic view. Cambridge: Cambridge University Press. Sundaram, S. (1990). Tableaux in the representation theory of compact Lie groups. In IMA volumes in mathematics: Vol. 19. Invariant theory and tableaux (pp. 191–225). New York: Springer.
398
References
Tyurin, I. (2010, to appear). New estimates on the convergence rate in the Lyapunov theorem. von Bahr, B. (1976). Remainder term estimate in a combinatorial limit theorem. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 35, 131–139. Wald, A., & Wolfowitz, J. (1944). Statistical tests based on permutations of the observations. Annals of Mathematical Statistics, 15, 358–372. Wehr, J. (1997). A strong law of large numbers for iterated functions of independent random variables. Journal of Statistical Physics, 86, 1373–1384. Wehr, J. (2001). Erratum on: A strong law of large numbers for iterated functions of independent random variables [Journal of Statistical Physics, 86(1997), 1373–1384]. Journal of Statistical Physics, 104, 901. Wehr, J., & Woo, J. M. (2001). Central limit theorems for nonlinear hierarchical sequences of random variables. Journal of Statistical Physics, 104, 777–797. Weyl, H. (1997). The classical groups. Princeton: Princeton University Press. Zhao, L., Bai, Z., Chao, C.-C., & Liang, W.-Q. (1997). Error bound in the central limit theorem of double-indexed permtation statistics. Annals of Statistics, 25, 2210–2227. Zhou, H., & Lange, K. (2009). Composition Markov chains of multinomial type. Advances in Applied Probability, 41, 270–291. Zong, C. (1999). Sphere packings. New York: Springer.
Author Index
A Accardi, L., see Chen, L. H. Y., 54 Aldous, D., 213, 364 Anderson, T. W., see Baldi, P., 135, 210 Arratia, R., 3 Athreya, K. B., see Baldi, P., 135, 210 B Bai, Z., see Zhao, L., 202 Baldi, P., 135, 210, 254 Barbour, A. D., vii, 3, 4, 18, 64, 201, 231, 315 Barbour, A. D., see Barbour, A. D., vii, 4, 201 Barbour, A. D., see Hall, P., 59 Bayer, D., 209 Bentkus, V., 261 Berry, A., 2, 45 Bhattacharya, R. N., 161, 163, 273, 332, 337 Bickel, P., 95 Biggs, N., 206 Bikjalis, A., 233 Billingsley, P., 366 Bollobás, B., 315 Bollobás, B., see Donnelly, P., 213 Bolthausen, E., 41, 167, 169, 315 Borovskich, Y., see Koroljuk, V., 260, 263, 284, 287 Borovskich, Yu. V., 263 Brede, V., see Karlub, S., 254 Breiman, L., 6 Brouwer, A. E., 206 Bulinski, A., 246 C Cacoullos, T., 117 Chao, C.-C., see Zhao, L., 202 Chatterjee, S., 37, 117, 122, 293, 326, 343, 359, 362
Chen, L. H. Y., 3, 16, 54, 55, 125, 126, 135, 221, 231, 233, 243, 245, 247, 253, 257, 293, 305 Chen, L. H. Y., see Barbour, A. D., vii, 4, 201 Chen, L. H. Y., see Ho, S. T., 167 Chistyakov, V. P., see Kolchin, V. F., 183 Cohen, A. M., see Brouwer, A. E., 206 Cohen, S., see Schiffman, A., 183 Conway, J. H., 123 Cox, D. R., 221 D Darling, R. W. R., 210 DeGroot, M., 4 Dembo, A., 135, 254, 255 Dharmadhikari, S., 231 Diaconis, P., vii, 88, 217, 359, 371 Diaconis, P., see Bayer, D., 209 Doksum, K., see Bickel, P., 95 Donnelly, P., 213 E Eagleson, G., see Barbour, A. D., 201 Efron, B., 130 Ellis, R., 298, 353 Erdös, P., 112, 315 Erickson, R., 47, 63 Esseen, C., 2, 45, 233 Ethier, S., 18 F Fang, X., see Chen, L. H. Y., 293 Feller, W., 2, 32, 59, 221 Ferguson, T., 273 Fill, J. A., see Aldous, D., 213, 364 Finkelstein, M., 271 Freedman, D., see Diaconis, P., 88
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4, © Springer-Verlag Berlin Heidelberg 2011
399
400 Friedrich, K., 261 Fulman, J., 169, 371, 373, 374, 378 Fulman, J., see Chatterjee, S., 343, 359, 362 Fulton, W., 372 G Geary, R., 201 Ghosh, J., see Bhattacharya, R. N., 273 Ghosh, S., 169, 183 Glaz, J., 210, 254 Goldstein, L., 26, 34, 64, 67, 68, 88, 124, 125, 157, 160, 167, 169, 183, 211–213, 315, 334 Goldstein, L., see Arratia, R., 3 Gordon, L., see Arratia, R., 3 Götze, F., 18, 161, 163, 337 Götze, F., see Bentkus, V., 261 Griffeath, D., 366 Griffiths, R., 73 Grimmett, G., 364 H Haagerup, U., 121 Hájek, J., 112 Hall, B., 373 Hall, P., 59, 122 Harris, J., see Fulton, W., 372 Helgason, S., 374, 375 Helmers, R., 263, 266 Henze, N., 335, 336 Heyde, C. C., see Chen, L. H. Y., 54 Ho, S. T., 167 Hoeffding, W., 100, 260 Holmes, S., see Diaconis, P., vii Holst, L., see Barbour, A. D., 3, 18, 64 Horn, R., 318 Huang, H., 210 Hubert, L., 201 I Iglehart, D. L., see Baldi, P., 135, 210 J Janson, S., 315 Janson, S., see Barbour, A. D., 3, 18, 64 Janssen, P., see Helmers, R., 266 Jaworski, J., see Kordecki, W., 315 Jing, B., 261 Joag-Dev, K., see Dharmadhikari, S., 231 Johansson, K., 371 Johnson, C., see Horn, R., 318 Jordan, J. H., 74
Author Index K Karlin, S., see Dembo, A., 254 Karlub, S., 254 Karo´nski, M., 315 Karo´nski, M., see Barbour, A. D., 315 Karo´nski, M., see Kordecki, W., 315 Kaufman, M., see Griffiths, R., 73 Klartag, B., 88, 138 Knox, G., 201 Kolchin, V. F., 183 Kordecki, W., 315 Koroljuk, V., 260, 263, 284, 287 Kruglov, V., see Finkelstein, M., 271 Kurtz, T., see Ethier, S., 18 L Lange, K., see Zhou, H., 212, 213 LeCam, L., 2, 59 Lee, A., see Seber, G., 119 Leech, J., 123 Leong, Y. K., see Chen, L. H. Y., 221 Lévy, P., 59 Li, D., 72, 74 Liang, W.-Q., see Zhao, L., 202 Liggett, T., 213 Luk, M., 18, 345 M Madow, W. G., 112 Mantel, N., 201 Matloff, N., 213 Mattner, L., 231 Meckes, E., see Chatterjee, S., 326 Meckes, E., see Meckes, M., 88 Meckes, M., 88 Midzuno, H., 32 Moran, P., 201 Motoo, M., 100 N Nagaev, S., 233 Naus, J. I., 210, 254 Naus, J., see Glaz, J., 210, 254 Neumaier, A., see Brouwer, A. E., 206 Newman, C., see Ellis, R., 298, 353 Nourdin, I., 345, 371, 381, 382, 386, 388 Nowicki, K., see Janson, S., 315 Nowik, R., see Schiffman, A., 183 Nualart, D., 382 P Palka, Z., 315 Papangelou, F., 353 Papathanasiou, V., see Cacoullos, T., 117 Peccati, G., 382
Author Index Peccati, G., see Nourdin, I., 345, 371, 381, 382, 386, 388 Peköz, E., 343, 363, 364, 367 Peköz, E., see Ross, S., vii Penrose, M., 122 Penrose, M., see Goldstein, L., 124, 125, 157 Petrov, V., 293 Pickett, A., 345 Proctor, R., 379 R Rachev, S., 64, 65 Raiˇc, M., 37, 293 Rao, M., see Rao, R., 211, 212 Rao, R., 211, 212 Rao, R., see Bhattacharya, R. N., 161, 163, 332, 337 Reinert, G., 325, 326, 328, 329 Reinert, G., see Goldstein, L., 26, 34 Reinert, G., see Nourdin, I., 386 Rényi, A., see Erdös, P., 112, 315 Rinott, Y., 135, 149, 161–163, 213, 216, 254, 260, 325, 331–333, 335, 336 Rinott, Y., see Baldi, P., 135, 210, 254 Rinott, Y., see Dembo, A., 135, 254, 255 Rinott, Y., see Goldstein, L., 183, 315, 334 Robbins, H., 270 Rogers, T. D., see Li, D., 72, 74 Röllin, A., see Chatterjee, S., 343, 359, 362 Röllin, A., see Chen, L. H. Y., 125, 126 Röllin, A., see Peköz, E., 343, 363, 364, 367 Röllin, A., see Reinert, G., 325, 326, 328, 329 Roos, B., see Mattner, L., 231 Rosen, J., see Ellis, R., 353 Ross, S., vii Rotar, V., see Rinott, Y., 149, 161–163, 213, 216, 260, 325, 331–333, 335, 336 Ruci´nski, A., see Barbour, A. D., 315 Ruci´nski, A., see Karo´nski, M., 315 Ruci´nski, A., see Kordecki, W., 315 S Sagan, B., 184, 372 Sazonov, V., 332 Schechtman, G., 98 Schiffman, A., 183 Schilling, M., 335, 336 Schlösser, T., 73 Seber, G., 119 Sellinger, D., see Schiffman, A., 183 Serfling, R., 266, 267 Serfling, R., see Helmers, R., 266 Serre, J.-P., 372
401 Shahshahani, M., see Diaconis, P., 359, 371 Shao, Q. M., 150 Shao, Q. M., see Chatterjee, S., 343 Shao, Q. M., see Chen, L. H. Y., 16, 55, 135, 231, 233, 243, 245, 247, 253, 257, 293, 305 Shao, Q. M., see Goldstein, L., 88 Shneiberg, I., 73, 87 Sloane, N. J. A., see Conway, J. H., 123 Sloane, N., see Leech, J., 123 Spohn, H., see Schlösser, T., 73 Steele, J. M., 130 Stein, C., vii, 3, 5, 13, 21, 37, 154, 371 Stein, C., see Baldi, P., 135, 210 Stein, C., see Efron, B., 130 Stein, E. M., 136 Stirzaker, D., see Grimmett, G., 364 Stroock, D., vii, 37, 57 Su, Z., see Shao, Q. M., 150 Sundaram, S., 380, 381 Suquet, C., see Bulinski, A., 246 T Tucker, H., see Finkelstein, M., 271 Tyurin, I., 2, 45, 53, 67 V van Zwet, W., see Helmers, R., 263 von Bahr, B., 167 W Wald, A., 100, 112 Wallenstein, S., see Glaz, J., 210, 254 Waterman, M. S., see Darling, R. W. R., 210 Wehr, J., 74, 75, 77, 78, 81, 82, 84, 86 Welsh, D., see Donnelly, P., 213 Weyl, H., 372 Wolfowitz, J., see Wald, A., 100, 112 Woo, J. M., see Wehr, J., 75, 77, 78, 81, 82, 84, 86 X Xia, A., see Barbour, A. D., 231 Z Zhang, H., see Goldstein, L., 160, 211–213 Zhang, H., see Rao, R., 211, 212 Zhao, L., 202 Zhou, H., 212, 213 Zhou, W., see Jing, B., 261 Zinn, J., see Schechtman, G., 98 Zitikis, R., see Bentkus, V., 261 Zong, C., 123
Subject Index
A adjoint, 385 anti-voter model, 213, 295 antisymmetric function, 21 averaging function, 74 B Bennett–Hoeffding inequality, 234, 237 Bernoulli distribution, 67 Bernoulli–Laplace model, 358 Berry–Esseen constant, 45 Berry–Esseen inequality, 45, 53 Beta distribution, 94, 95 binary expansion of a random integer, 217, 297 bounds on the Stein equation, 16 Brownian motion, 382 C Cauchy’s formula, 184 chi squared distribution, 345 class functions, 373, 374 combinatorial central limit theorem, 24, 100, 167–169, 183, 295, 371 complete graph, 296 composition Markov chains of multinomial type, 212 concentration inequality, 53, 57, 150, 159, 249 concentration inequality, non-uniform, 233 concentration inequality, randomized, 277 conditional variance formula, 111 conductivity of random media, 72 cone measure, 88 conjugacy class, 373 continuity correction, 221 contraction, 383 contraction principle, 69 convergence determining class, 136
coordinate symmetric, 88 coverage processes, 122 covered volume, 122 Cramér series, 293 Curie–Weiss model, 297, 353 cycle type, 183 D delta method, 273 dependency graph, 135, 254 dependency neighborhoods, random, 334 diamond lattice, 73 discretized normal distribution, 221 distance r-regular graph, 206 distribution constant on cycle type, 183, 187, 196 E Efron–Stein inequality, 130 embedding method, 325 equilibrium distribution, 363 Erdös–Rényi random graph, 315 exchangeable pair, 21, 102, 111, 113, 149, 151, 153, 155, 217, 347, 358 exchangeable pair, multivariate, 325, 329 exponential distribution, 345 F fast rates of convergence, 136 first passage times, 364 G Gamma distribution, 94, 95, 345 Gamma function, 94 generator method, 17, 18, 25 Gini’s mean difference, 265
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4, © Springer-Verlag Berlin Heidelberg 2011
403
404 graph patterns, 209 graphical rule, 121, 130 group characters, 371, 372 H Haar measure, 373 hierarchical variables, 72 Hoeffding decomposition, 260 homogeneous, 73 I inductive method, 57, 169 integer valued variables, 227 integration by parts, 385 interaction rule, 122 involution, 183 irreducible representation, 372 isolated points, 123 J Johnson graph, 360 K Kantarovich metric, 64 K function, 19 Khintchine’s inequality, 121 kissing number, 123, 124, 335 Kolmogorov distance, 45, 63, 331 L L-statistics, 265 L1 distance, 63, 64 L1 distance, dual form, 64 L1 distance, sup form, 64 L∞ distance, 45, 63 Lie group, 373 lightbulb process, 210 Lindeberg condition, 48, 59 Lipschitz function, 46 local dependence, 133, 202, 205, 207, 245 local dependence, multivariate, 331 local dependence conditions, 245 local extremes, 210 local maxima, 210 M m-dependence, 133, 208 m-scans process, 254 Malliavin calculus, 381 Malliavin derivative, 384, 385 moderate deviations, 293 multi-sample U -statistics, 262 multiplication formula, 383 multivariate approximation, 313
Subject Index multivariate size bias coupling, 314 N nearest neighbor graph, 334 non-linear statistics, 258 non-uniform Berry–Esseen bounds, 237 non-uniform bounds, 233 O Ornstein–Uhlenbeck generator, 385 Ornstein–Uhlenbeck process, 25 orthogonal group, 373 P patterns in graphs, 209 permutation statistics, 201 permutation test, 100, 183, 201 Poisson binomial distribution, 222 Polish space, 64 positively homogeneous, 73 Q quadratic forms, 119 R random field, 133, 247 random graph, coloring, 334 random graph coloring, 333 random graphs, triangles, 326 random sums, 270 renewal–reward process, 364 representation, 372 resistor network, 77 rising sequence, 209 S scaled averaging function, 76 scaling conditional, 92 scan statistics, 208 simple random sampling, 25, 111, 112 size bias, 31, 32, 156, 157, 160, 162, 318 size biased coupling, multivariate, 314 smooth function bound, 314 smooth functions, 136 smoothed indicator, 153, 155, 158, 171, 251 smoothed indicator, bounds, 17 smoothing inequalities, 333 smoothing inequality, 161, 337 special orthogonal group, 373 square bias, 35, 90 Stein characterization, 13 Stein equation, 14, 15
Subject Index Stein equation, general, 344 Stein identity, 36 Stein pair, 22, 192, 375 stochastic integral, 382, 383 strictly averaging function, 75, 77 strongly unimodal, 231 sub-sequences of random permutations, 208 supremum norm, 45, 63 symmetrization, of functions, 382, 383 T tensor product representation, 374 total variation distance, 63, 221, 386 trimmed mean, 266 two-sample ω2 -statistic, 263 two-sample Wilcoxon statistic, 263
405 U U -statistics, 260 uniform distribution, 67 unimodal, 230 unitary symplectic group, 373 universal L1 constant, 67 V variance stabilizing transformation, 274 W Wasserstein distance, 64 Wiener chaos, 383, 386 Z zero bias, 26, 27, 29, 64, 102, 136, 147, 222, 228