This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
( x ) + (1 - p )
( 5 . 29 )
< E [
valid if
-p1 + q1
( 5 .30 )
-
=
1,
p>
X.
1, q > 1 .
Holder 's inequality is ( 5 .3 1 ) If, say, the first factor on the right vanishes, then X = 0 with probability 1, hence X Y = 0 with probability 1, and hence the left side vanishes also. Assume then that the right side of (5.31) is positive. If a and b are positive, -l -1 there exist s and t such that a = e P and b = e q 1• Since ex is convex, 1 e P- l + q< p - le + q - te or s
s
I
s
I'
P a bq ab < +p q. u
This obviously holds for nonnegative as well as for positive and v be the two factors on the right in (5.31). For each w,
a and b. Let
X( w ) Y( w ) < -1 X( w ) P + -1 Y( w ) q . uv p u q v Taking expected values and applying (5.30) leads to (5.31). I f p = q = 2, Holder's inequality becomes Schwarz 's inequality :
(5 .32)
76
PROBABILITY
a {J.
p {Jja .. q = fJ/( fJ a ), and Lyapounov 's inequality..
= < Suppose that 0 < In (5. 31) t ake Y( w ) = 1 , and repl ace X by I XI 0• The result is
0
(5 .33)
-
< a < {J .
PROBLEMS 5. 1.
5.2. 5.3. 5.4.
(a) Show that X is measurable with respect to the a-field t§ if and only if a( X) c t§. (b) Show that, if t§ == { 0, Q } , then X is measurable t§ if and only if X is
constant. (c) Show that, if P( A ) is 0 or 1 for every A in t§ and X is measurable t§, then P [ X == c) == 1 for a constant c. (d) Complement Theorem 5.1(ii) by showing that Y is measurable a( X1 , • • • , Xn ) if and only if for all w and w', X; ( w) == X; ( w'), i < n , implies that Y( w ) == Y( w'). 2. 1 7 f Show that the unit interval can be replaced by any nonatomic probability measure space in the proof of Theorem 5.2. Show that m == E[ X] minimizes E[( X - m ) 2 ]. Suppose that X assumes the values m - a , m , m + a with probabilities p , 1 2 p , p , and show that there is equality in (5.28) . Thus Chebyshev's inequality cannot be improved without special assumptions on X. -
5.5.
Suppose that X has mean m and variance a 2• (a) Prove Cantelli ' s inequality a >
(b)
5.6. 5.7.
5.8.
0.
Show that P [ I X - m l > a ] < 2 a 2 /( a 2 + a2 ). When is this better than Chebyshev's inequality? (c) By considering a random variable assuming two values, show that Cantelli's inequality is sharp. The polynomial E[( t i XI + I Y l ) 2 ] in t has at most one real zero. Deduce Schwarz's inequality once more. (a) Write (5.33) in the form EllIa [ I X l 0 ] < E[( I X l cr ) Jllcr ] and deduce it directly from Jensen's inequality. (b) Prove that E[1jXP ] > 1jEP ( X] for p > 0 and X a positive random variable. (a) Let f be a convex real function on a convex set C in the plane. Suppose that ( X( w ) , Y( w )) e C for all w and prove a two-dimensional Jensen's
SECTION
5.
77
SIMPLE RANDOM VARIABLES
inequality: /( E ( X ) , E ( Y ] )
(5 .34) (b)
5.10.
E ( /( X, Y) ) .
Show that f is convex if it has continuous second derivatives that satisfy
( 5 .35) 5.9.
<
/u > 0,
f Holder's inequality is equivalent to E[ X1 1P y1 1q ] < E1 1P ( X] E1 1q [ Y ] ( p - 1 + q - 1 == 1), where X and Y are nonnegative random variables. Derive this from (5.34). f Minkowski ' s inequality is ·
(5.36)
5.11.
valid for p � 1. It is enough to prove that E[( X1 1P + Y1 1P ) P ) < ( E1 1P ( X] + E1 1P ( Y ])7' for nonnegative X and Y. Use (5.34). The essential supremum of I X I is ess sup I X I == inf[ x : P [ l X I < x ] == 1]. (a) Show that E1 1P ( I X I P ) f ess sup l X I as p t oo . (b) Show that E( I XY I ] < E( I X I ] ess sup I Y l and ess sup J X + Y l < ess sup l X I + ess sup I Y l . If one formally writes ess sup l X I == E foo ( I X I 00 ], the first of these inequalities is Holder's for p == 1 and q == oo (formally 1 - I + oo - t == 1) and the second is Minkowski's inequality (5.36) for p == oo. (c) Check (5.33) for the case fl == oo . For events A 1 , A 2 , , not necessarily independent, let Nn == E i: _ 1 I k be the number to occur among the first n . Let ·
5.11.
A
• • •
Show that
an - fln Var [ n - I Nn ) == PQn - an2 + n
.
Thus Var[ n - 1 Nn ] -+ 0 if and only if fln - a ; -+ 0, which holds if the An are independent and P(A n ) == p (Bernoulli trials) because then an == p and
fln == P 2 == a ;. 5. 13.
5.14.
(a) Let S be the number of elements left fixed by a random permutation on n letters. Calculate E[ S] and Var[ S]. (b) A fair r-sided die is rolled n times. Let N be the number of pairs of
trials on which the same face results. Calculate E[ N ] and Var[ N]. Hint: In neither case consider the distribution of the random variable; instead, represent it as a sum of random variables assuming values 0 and 1. Show that, if X has nonnegative integers as values, then E[ X ] == E� 1 P [ X > n ]. _
78
PROBABILITY
5. 15. Let I; == IA 1 be the indicators of n events having union A . Let Sk = El;1 • • • I; , where the summation extends over all k-tuples satisfying 1 < i 1 < · · · < i/' < n. Then sk = E[Sk ] are the terms in the inclusion-exclusion formula P( A ) = s 1 - s 2 + · · · ± sn . Deduce the inclusion-exclusion formula from fA = S1 - S2 + ± Sn. Prove the latter formula by expanding the product n f_ 1 (1 - I; > · 5.16. 2.15 f For integers m and primes p, let a. ( m ) be the exact power of p in •
•
·
the prime factorization of m : m = n p p0P (',m ). Let divides m or not. Under each Pn (see (2.20)) the variables. Show that for distinct primes p 1 , • • • , Pu ,
(5.39)
P,. [ a.p , > k; ' ; < u ]
...
[
8p ( m ) be 1 or 0 as p a.P and BP are random
.!. k n k -+ k 1 k n P 1 1 • • • Pu " P 1 1 • • • Pu "
]
and ( 5. 40)
Pn ( a.P
·
I
= k; , i s u ]
Similarly, ( 5.41 )
.
Pn [ 8Pi = 1 , I
S U
]
u
(1 -
-+ n ,• _
)
1 1 • k p pk 1 .
I
I
.
I
I+
]
[
1 n 1 =-+ . n Pt . . . Pu P 1 . . . Pu
According to (5.40), the a.P are for large n approximately independent under Pn , and according to (5.41), the same is true of the BP . For a function f of positive integers, let ( 5 .42)
n
1 En [ / ] = - E /( m ) n m- 1
be its expected value under the probability measure Pn . Show that ( 5 .43)
En ( a.p ] =
00
[ ]
1 1 . - �k -+ p-1' k-1 n p � �
this says roughly that ( p - 1)- 1 is the average power of p in the factoriza tion of large integers. 5.17. f (a) From Stirling's formula, deduce (5.44)
En [ log] = logn + 0{1) .
From this, the inequality En [ a.P ] < 2/p, and the relation logm = EP a.P ( m)logp, conclude that EP p- 1 logp diverges and that there are in finitely many primes.
SECTION
5.
SIMPLE RANDOM VARIABLES
79
(b) Let log*m == EPBP ( m)logp. Show that
( 5 .45)
[ ]
En ( log* ] == E .!. !! logp == logn + 0 { 1) . n p P
(c)
Show that [2n/p] - 2[ n/p] is always nonnegative and equals 1 in the range n < p s 2n. Deduce E2 n [log* ] - En [log* ] = 0(1) and conclude that E logp == O( x) .
( 5.46)
psx
Use this to estimate the error removing the integral-part brackets introduces into (5.45) and show that ( 5.47)
1 E p- logp == logx + 0 { 1) . psx
(d)
Restrict the range of summation in (5.47) to fJx < p < x for an ap propriate fJ and conclude that (5.48)
E logp x x ,
psx
in the sense that the ratio of the two sides is bounded away from 0 and oo . (e) Use (5.48) and truncation arguments to prove for the number w(x) of primes not exceeding x that ( 5.49)
w( X)
X
X
logx
.
(By the prime number theorem the ratio of the two sides in fact goes to 1.) Conclude that the rth prime p, satisfies p, x rlogr and that ( 5.50) 5. 18.
5. 19.
1 == oo . Ep p
Let fn ( x ) be n 2x or 2n - n 2x or 0 according as 0 < x < n - 1 or n - 1 < x < 2 n- 1 or 2 n- 1 < x < 1. This gives a standard example of a sequence of continuous functions that converges to 0 but not uniformly. Note that /J/n ( x ) dx does not converge to 0; relate to Example 5.6. 4.18 f Prove Theorem 5.2 for the case In 2 by the result in Problem 4.18. =
80
PROBABILITY
SECfiON 6. THE LAW OF LARGE NUMBERS The Strong Law
Let X1 , X2 , • • • be a sequence of simple random variables on some probabil ity space (D, !F, P). They are identically distributed if their distributions (in the sense of (5.9)) are all the same. Define Sn = X1 + · · · + Xn .
If the Xn are independent and identically distributed and then
Theorem 6. 1.
E[ Xn ] =
m,
( 6.1 )
The conclusion is that n - 1sn - m = n - 1 E?_ 1 ( X,. - m) --+ 0 except on an jli:set of probability 0. Replacing X; by X; - m shows that there is no loss of generality in assuming m = 0. Let E [ X;] = o 2 and E[X:] = � 4• The proof is like that for Theorem 1.2. In analogy with (1 .25), E[S:] = EE[ X0XpX1X8], the four indices ranging independently from 1 to n. Since E[ Xn 1 = 0, it follows by the product rule (5.21) for independent random variables that the summand vanishes if there is one index different from the three others. This leaves terms of the form E[ X;4 ] = � 4 , of which there are n, and terms of the form E[X/ .X:/1 = E[ X/]E[ X12 ] = o 4 for i ¢ j, of which there are 3n(n - 1). Hence PROOF.
( 6 .2 ) where K does not depend on n. By Markov's inequality (5.27) for k = 4, P[ I Snl � n£] < Kn - 2£ -4 , and so by the first Borel-Cantelli lemma, P[ l n - 1Snl > £ i.o. ] = 0 for each positive £. Now
where the union extends over positive rational £. This shows in the first place that the set on the left in (6.3) lies in � and in the second place that its probability is 0. Taking complements gives (6.1). • The classical example is the strong law of large numbers for Bernoulli trials. Here P[Xn = 1] = p, P[ Xn = 0] = 1 - p, m = p; Sn represents the number of successes in n trials, and n - 1Sn --+ p with prob ability 1 . The idea of probability as frequency depends on the long-range • stability of the success ratio S,jn . Example 6. 1.
SECTION
6.
THE LAW OF LARGE NUMBERS
81
Theoren1 1 .2 is the case of Example 6.1 in which ( (}, $i", P ) is t he unit interval and the Xn( w ) are the digits d,( w ) of the dyadic expansion of w. Here p = ! . The set (1 .20) of normal numbers in the unit i n terval has by (6.1) Lebesgue measure 1 ; its complement has measure 0 ( and so in the terminology of Section 1 is negligible). See Example 4. 1 2. • Example 6. 2.
The Weak Law If (6.3) has probability 0, so has [ l n - 1Snl > £ i.o ] for each positive £ , and Theorem 4.1 implies that P[ l n - tsn 1 > £ ] --+ 0. (Compare the discussion ce ntering on (4 . 3 1 ) and (4. 3 2) . ) But this follows directly from Chebyshev's inequality ( 5 .2 8 ) , which also leads to a more flexible result. Suppose that (On, �' Pn ) is a probability space for each n and that xnl ' . . . ' xn r, are simple random variables on this space. Let sn = xnl + · · · + Xn r, • If .
(6 .4) and if xnl' . . . ' xn r, are independent, then r, Var( Sn ] = (6 .5) E [ Sn ] = mn = E m n k ' k-1
r,
on2 = E on2k · k=l
Chebyshev's inequality immediately gives this form of the weak law of large numbers:
Theorem 6.2. Suppose that for vn > 0. If an/Vn --+ 0, then
each n,
Xn 1 , • • • , Xn,,
are independent and
(6 . 6 )
for all positive
£.
Example 6.3.
Suppose that the spaces (On , Fn , Pn ) are all the same, that 'n = n, and that Xn k = Xk , where X1 , X2 , . . . is an infinite sequence of independ en� random variables. If vn = n, E [ Xk ] = m , and Var[ Xk ] = o 2 , then o,jvn = o ;{; --+ 0. This gives the weak law corresponding to Theorem 6. 1 . • Let On consist of the n ! permutations of 1 , 2, . . . , n, all equally probable, and let Xn k ( w ) be 1 or 0 according as the k th element in Example 6.4.
82
PROBABILITY
the cyclic representation of w E o n completes a cycle or not. This is Example 5.5, although there the dependence on n was suppressed in the notation. Here rn = n, Xn 1 , . . . , X, n are independent, and Sn = Xn 1 + Xnn is the number of cycles. The mean m n k of Xn k is the probabil + ity that it equals 1 , namely (n - k + 1) - 1 , and its variance is on2k = m n k(1 - m n k ). If Ln = I:/: -= 1 k - 1 , then Sn has mean I:: = 1 m n k = Ln and variance I:: _ 1 m n k(1 - m n k ) < Ln . By Chebyshev' s inequality, ·
·
·
Of the n ! permutations on n letters, a proportion exceeding 1 - £ - 2L; 1 thus have their cycle number in the range (1 ± £)Ln . Since Ln = log n + 0(1), most permutations on n letters have about log n cycles. For a refinement, see Example 27.3. Since O n changes with n, it is the nature of the case that there cannot be • a strong law corresponding to this result. Bernstein's Theorem
Some theorems that can be stated without reference to probability nonethe less have simple probabilistic proofs, as the last example shows. Bernstein ' s approach to the Weierstrass approximation theorem is another example. Let f be a function on [0, 1 ]. The Bernstein polynomial of degree n associated with f is ( 6 .7) Theorem 6.3. [0, 1 ].
If f
is continuous,
Bn ( x )
converges to f( x ) uniformly on
According to the Weierstrass approximation theorem, f can be uniformly approximated by polynomials; Bernstein's result goes further and specifies an approximating sequence. Let M = supx lf( x ) l , and let 8(£) = sup[ l/( x ) - /( y ) l : l x - Y l < £] be the modulus of continuity of f. It will be shown that
PROOF.
( 6 .8 )
sup lf ( x ) - Bn ( x )I x
2M < 8( £ ) + n£ 2 •
SECTION
6.
83
THE LAW OF LARGE NUMBERS
By the uniform continuity of f, lim ( _ 08( £ ) = 0, and so this inequality (for n - 1 13, say) will give the theorem. £ Fix n > 1 and x E [0, 1] for the moment. Let X1 . . . , X, be independent ra ndom variables (on some probability space) such that P [ X 1] x and P [ X; 0] 1 - x; put S X1 + · · · + Xn. Since P[S k ] = : x A (l x ) " - A , the formula (5.16) for calculating expected values of functions of random variables gives E[ /(S/n )] Bn(x). By the law of large numbers, there should be high probability that Sjn is near x and hence ( / being continuous) that /( S/n) is near /(x); E[/( S/n )] should therefore be near f( x ). This is the probabilistic idea behind the proof and, indeed, behind the defi nition (6. 7) itself. Bound l/(n - 1S ) - /( x ) l by 8(£) on the set [ l n - 1S - x l < £] and by 2 M on the complementary set, and use (5.19) and (5.20) as in the proof of Theorem 5.3. Since E[S] nx , Chebyshev ' s inequality gives =
,
=
=
=
=
;
=
=
()
=
=
I Bn ( x ) - /( x ) I < E [ I/( n - Is ) - /( x )I ] < 8 ( £ ) P [ I n - 1S - x l < £ ) + 2 MP [ I n - 1S - x l > £ ) s
since Var[S] A
=
8 ( £ ) + 2 M Var[ S ]/n 2£ 2 ;
nx (1 - x)
s
•
n , (6.8) follows.
Refinement of the Second Borel-Cantelli Lemma =
For a sequence A 1 , A 2 , . . . of events, consider the number Nn IA 1 + · · · + lA of occurrences among A 1 , . . . , An. Since [ A n i.o.] [ w: supn Nn(w) oo ], P[ A n i.o.] can be studied by means of the random variables Nn. Suppose that the A n are independent. Put P k P ( A k ) and mn p1 + · · · + Pn · From E[ IA , ) P k and Var[ /A , ) P k (1 - p k ) < P k follow E [ Nn ] m n and Var[ Nn 1 E� = lVar[ /A , ) s mn. If mn > X, then =
II
=
=
=
=
=
=
=
P ( Nn < x ] S P ( I Nn - mn l > mn - x ]
(6 .9 )
<
Var( Nn ) ( mn - x ) 2
<
mn ( mn - x ) 2
"
If L Pn oo , so that mn --+ oo, it follows that lim n Pl Nn < x ] x . Since =
(6 .1 0)
=
0 for each
84
PROBABILITY
P[supk Nk < x] = 0 and hence (take the union over x = 1, 2, . . . ) P[supk Nk < oo ] = 0. Thus P[ An i.o.] = P[supn Nn = oo] = 1 if the A n are indepen dent and L Pn = oo , which proves the second Borel-Cantelli lemma once . again. Independence was used in this argument only to estimate Var[ Nn 1 · Even without independence, E[ Nn ] = m n and the first two inequalities in (6.9) hold.
If EP( A n ) diverges and
Theorem 6.4 t
(6 . 1 1 )
then P[ A n
i.o.]
=
1.
As the proof will show, the ratio in (6.1 1) is at least 1 ; if (6.1 1) holds, the inequality must therefore be an equality. PROOF.
Let 8n denote the ratio in (6.11). In the notation above, Var[ Nn )
= =
E [ Nn2 ] - m �
=
IA/A,) - m� [ E j, k s n L
L P ( Ai n A k ) - m ; = ( 8n - 1 ) m �
j, k s n
2
(and 8n - 1 � 0). Hence ( see (6.9)) P[Nn < x ] S ( 8n - 1)m�/(mn - x ) 2 for x < m n . Since m �( m n x ) --+ 1, (6.1 1) implies that lim infn PlNn < x ] = 0. It still follows by (6.10) that P[supk Nk < x ] = 0, and the rest of the • argument is as before.
-
If as in the second Borel-Cantelli lemma, the A n are independent (or even if they are merely independent in pairs), the ratio in (6. 1 1 ) is 1 + Ek s n< Pk - Pi)/m�, so that EP( A n ) = oo implies (6.1 1). • Example 6.5.
Return once again to the run lengths /n (w) of Section 4. It was shown in Example 4.19 that { rn } is an outer boundary ( P[/n � rn i.o.] = 0) if 1: 2 - r, < oo . It was also shown that { rn } is an inner boundary Example 6.6.
t For an extension, see Problem 6. 1 5.
SECTION
6.
85
THE LAW OF LARGE NUMBERS
( P [In � rn i. o.] = 1) if rn is nondecreasing and E 2 -',.rn- t = oo , butr Theo rem 6 . 4 can be used to prove this under the sole assumption that E 2 - , = oo . A s usual, the rn can be taken to be positive integers. Let A n = [In � rn ] = = dn + r l = 0] . If j + 'j � k, then Ai and A k are independent . [ dn - < j + 'j and put m = max{ j + 'j, k + r } ; then Aj n , < k that ose j Supp k m k+ i i A k = [d . == · · · = dm - 1 = 0] and so P(A . n A k ) = 2 - < - > ,. 2 - < r�c - > 2 -< k -i >P(A k ). Therefore, =
· · · J
=
J
'
E P ( Ai n A k ) �
j, k s n
E
k 5. n
�
P(A k ) + 2 E P ( Ai ) P( A k ) j
j
ksn
+ ( E P(A k ) r ksn
+
2 E P ( A k ). ksn
If EP(A n ) = E 2 - r, diverges, then (6.11) follows . Thus { rn } is an outer or an inner boundary according as E 2 - r, converges or diverges, which completely settles the issue. In particular, rn = log 2n + 8 log 2 log 2 n gives an outer boundary for 8 > 1 and an inner boundary for
(J
�
•
1.
It is now possible to complete the analysis in Examples 4.11 and 4.15 of the relative error en (w) in the approximation of w by EZ :l dk (w)2 - k. If /n (w) � log 2 x n (0 < Xn < 1), then en (w) < Xn by (4.22). By the preceding example for the case rn = log 2 x n , Ex n = oo implies that P[w: en (w) � Xn i. o.] = 1. By this and Example 4.11, [w: en(w) � x n i. o.] has Lebesgue measure 0 or 1 according as Exn converges or Example 6. 7.
-
-
.
·� ·
PROBLEMS 6.1. 6.2. 6.3.
Show that Zn -+ Z with probability 1 if and only if for every positive £ there exists an n such that P[ I Zk - Z l < £, n < k < m ] > 1 - £ for all m ex ceeding n. This describes convergence with probability 1 in " finite" terms. Show in Example 6.4 that P[ I Sn - Ln l � L!(2 + f ] -+ 0. As in Examples 5.5 and 6.4, let w be a random permutation of 1, 2, . . . , n. Each k, 1 < k s n , occupies some position in the bottom row of the permo-
PROBABILITY
86
tation w ; let xnk < w ) be the number of smaller elements (between 1 and k - 1) lying to the right of k in the bottom row. The sum sn = xnl + . . . + xnn is the total number of inversions- the number of pairs appearing in the bottom row in reverse order of size. For the permutation in Example 5.5 the values of X7 1 , , X11 are 0, 0, 0, 2, 4, 2, 4, and S7 = 12. 1 Show that Xn1 , , Xnn are independent and P[ Xnk = i ] = k - for 0 < i < k. Calculate E[ Sn ] and Var[ Sn 1 · Show that Sn is likely to be near n 2 /4. • • •
• • •
6.4. 6.5.
For a function f on [0, 1] write 11 / 11 = supx 1/( x ) I · Show that, if / has a B < £ 11 / ' 11 + 2 11 / IVn£ 2 • Conclude that continuous derivative , then n II/ / ' ll 1 II / - Bn ll = O ( n- 13 ).
From Theorem 6.2 deduce Poisson ' s theorem : If A1 , A 2 , . . . are independent events and P (An) = Pn ' if Pn = ( P I + . . . + pn )/n , and if Nn = Ei: - 1 /A k ' 1 then P [ I n - Nn - Pn I > £] -+ 0.
In the following problems Sn = X1 + · · · + Xn .
6.6.
Prove Cantelli ' s theorem: If XI ' x2 . . . are independent, E[ xn ] = 0, and ' E[ X: ) is bounded, then n- 1Sn -+ 0 with probability 1. The Xn need not be identically distributed.
6.7.
(a) Let x1 , x 2 , . . . be a se�uence of real numbers, and put sn = x 1 that n- sn 2 -+ 0 and that the Xn are bounded, and + . . . + Xn . Suppose 1 show that n - sn -+ 0. (b) Suppose that n - 2Sn2 -+ 0 with probability 1 and that the Xn are uniformly bounded (supn. w i Xn ( w ) l < oo) . Show that n - 1Sn -+ 0 with prob ability 1. Here the Xn need not be identically distributed or even indepen
dent.
6.8.
f Suppose that XI , x2 , . . . are independent and uniformly bounded and E[ Xn ] = 0. Using only the preceding result, the first Borel-Cantelli lemma, and Chebyshev's inequality, prove that n - 1sn -+ 0 with probability 1.
6.9.
Use the ideas of Problem 6.8 to give a new proof of Borel's normal number theorem, Theorem 1.2. The point is to return to first principles and use only negligibility and the other ideas of Section 1, not the apparatus of Sections 2 through 6; in particular, P( A ) is to be taken as defined only if A is a finite, disjoint union of intervals.
f
Suppose that (in the notation of (5.37)) /Jn - a; == 0 ( 1 /n ) . Show that n - 1 Nn - an -+ 0 with probability 1. What condition on /Jn - a� will imply a weak law? Note that independence is not assumed here.
6.10. 5.12 6. 7 f
6.11. Suppose that X1 , X2 , . . . are m-dependent in the sense that random variables more than m apart in the sequence are independent. More precisely, let �k = a( � , . . . , Xk ), and assume that �k 1 , , �k,' are independent if k; _ 1 + m < i; for i == 2, . . . , I. (Independen\ random variables are 0-depen dent.) Suppose that the Xn have this property and are uniformly bounded and that E[ Xn ] == 0. Show that n- 1sn -+ 0. Hint: Consider the subse quences X; , X;+m+ I ' X;+ 2(m + l) ' for 1 < i S m + 1 . 6.12. f Suppose that the xn are independent and assume the values Xt ' . . ' x, with probabilities p ( x1 ) , , p ( x1 ). For u1 , , uk a k-tuple of the x; ' s, let • • •
• • •
.
• • •
• • •
SECTION
6.
87
THE LAW OF LARGE NUMBERS
Nn ( u 1 , , uk ) be the frequency of the k-tuple in the first n + k - 1 trials, that is, the number Of m SUCh that 1 < m � n and Xm == U 1 , , Xm + k - 1 == uk . Show that with probability 1, all asymptotic relative frequencies are what they should be-that is, with probability 1, n- 1 Nn ( u1 , , uk ) -+ p ( u 1 ) p ( uk ) for every k and every k-tuple u 1 , , uk . •
•
•
• • •
• • •
6. 13.
• • •
•
•
•
f A number w in the unit interval is completely normal if, for every base b
and every k and every k-tuple of base-b digits, the k-tuple appears in the base-b expansion of w with asymptotic relative frequency b- k . Show that the set of completely normal numbers has Lebesgue measure 1.
6.14. Shannon ' s theorem. Suppose that XI ' x2 ' . . . are independent, identically distributed random variables taking on the values 1, . . . , r with positive probabilities p 1 , , p,. If Pn (i 1 , , in ) == P; · · · P; , and if Pn ( w ) == Pn ( XI ( w ) , . . . ' xn ( w )), then Pn ( w ) is the probatiility that a new sequence of n trials would produce the particular sequence XI ( w ), . . . ' xn ( w ) of out • • •
•
•
•
comes that happens actually to have been observed. Show that
with probability 1. In information theory 1, . . . , r are interpreted as the letters of an alphabet, X1 , X2 , . . . are the successive letters produced by an information source, and h is the entropy of the source. Prove the asymptotic equipartition property: For large n there is probability exceeding 1 - £ that the probability n
6.15. The Renyi- Lamperti lemma. Let the limit inferior in (6.11) be c. Show that P [ A n i.o.] > 2 - c. Note that c cannot be less than 1. 6.16. In the terminology of Example 6.6, show that log 2 n + log 2 log 2 n + fJ log 2 log 2 log 2 n is an outer or inner boundary as fJ > 1 or fJ < 1. Generalize. (Compare Problem 4.16.) 6.17. 5.17 f Let g( m) == E PBP ( m ) be the number of distinct prime divisors of m. For a n == En [ g ] (see (5.42)) show that a n -+ oo . Show that
( 6.12) for p
En
[( B - !n [ �p ] )( B - !n (!!.q ] ) ] < __!_np + __!_nq P
q
=F q and hence that the variance of g under P,, satisfies
( 6.13) Prove the Hardy-Ramanujan theorem :
( 6.14)
88
PROBABILITY
Since an log logn (see Problem 18.18), most integers under n have some thing like log log n distinct prime divisors. Since log log1 07 is a 1i ttle less than 3, the typical integer under 107 has about three prime factors-remarkably few. 6.18. Suppose that X� ' X2 , are independent and P[ Xn 0] p. Let Ln be the length of the run of O's starting at the n th place: Ln k if Xn · · · Xn + k I 0 =F Xn + k · Show that P[ Ln > rn i.o.] is 0 or 1 according as En p converges or diverges. Example 6.6 covers the case p t . -
• • •
=
=
=
=
=
=
=
'n
SECI10N 7. GAMBLING SYSTEMS
Let X1 , X2 , be an independent sequence of random variables (on some ( D , !F, P)) taking on the two values + 1 and - 1 with probabilities P[ Xn = + 1 ] = p and P[ Xn = - 1 ] = q = 1 - p. Throughout the section Xn will be viewed as the gambler's gain on the nth of a series of plays at unit stakes. The game is favorable to the gambler if p > ! , fair if p = ! , and unfavor able if p < � . The case p � ! will be called the subfair case. After the classical gambler's ruin problem has been solved, it will be shown that every gambling system is in certain respects without effect and that some gambling systems are in other respects optimal. Gambling prob lems of the sort considered here have inspired many ideas in the mathemati cal theory of probability, ideas that carry far beyond their origin. Red-and-black will provide numerical examples. Of the 38 spaces on a roulette wheel, 18 are red, 18 are black, and 2 are green. In betting either on red or on black the chance of winning is �: . •
.
.
Gambler's Ruin
Suppose that the gambler enters the casino with capital a and adopts the strategy of continuing to bet at unit stakes until his fortune increases to c or his funds are exhausted. What is the probability of ruin, the probability that he will lose his capital, a? What is the probability he will achieve his goal, c? Here a and c are integers. Let (7 . 1 )
Sn = X1 + · · · + Xn '
The gambler's fortune after ( 7 .2)
S0 = 0.
n plays is a + Sn. The event
A a . n = (a + Sn = c, O < a + Sk < c, k < n )
SECTION
7.
89
GAMBLING SYSTEMS
represents success for the gambler at time n , and
Ba. n = ( a + Sn =
( 7 .3)
represents ruin at time success, then
n.
< a + S* <
0� 0
s (a) (
If
c,
k
< n)
denotes the probability of ultimate
(7 .4) for 0 < a < c. Fix c and let a vary. For n > 1 and 0 < a < c, define A a . n by (7.2), and adopt the conventions Aa.o = 0 for 0 < a < c and A,.o = 0 (success is impossible at time 0 if a < c and certain if a = c), as well as Ao. n = Ac. n = 0 for n � 1 (play never starts if a is 0 or c). By these conventions, s c (O) = 0 and s c (c) = 1 . Because of independence and the fact that the sequence X2 , X3, is a probabilistic replica of X1 , X2 , , it seems clear that the chance of success for a gambler with initial fortune a must be the chance of winning the first wager times the chance of success for an initial fortune a + 1 , plus the chance of losing the first wager times the chance of success for an initial fortune a 1 . It thus seems intuitively clear that •
•
•
•
•
•
-
sc ( a ) = psc ( a + 1 ) + qs ( a - 1) ,
(7.5)
0
c
< a < c.
For a rigorous argument, define A � . n just as Aa. n but with s; = X2 + . . . + Xn + l in place of sn in ( 7 . 2 ) . Now P[ X; = X;, i < n ] = P[ X; + I = X;, i s n] for each sequence x 1 , . . . , xn of + 1's and - 1's, and therefore n P[(X1 , . . . , Xn) e H ] = P[( X2 , . n. . , Xn + 1)e H) for H c R . Take H to be the set of X = (x 1 , . . . x n ) in R satisfying X ; = ± 1, a + X I + . . . + x n = ' c, and 0 < a + x 1 + +x k < c for k < n . It follows then that · · ·
(7.6) Moreover, Aa , n = ( [ X1 = + 1] n A �+ l , n - l ) U ( [ X1 = - 1] n A � - l , n - 1 ) for n � 1 and 0 < a < c. By independence and ( 7 . 6 ), P( Aa , n) = pP( Aa + 1, n_ 1) + qP ( A a _ 1 n - 1) ; adding over n now gives (7.5). Note that this argument mvolves the entire infinite sequence X1 , X2 , It remains to solve the difference equation (7.5) with the side conditions rc(O) = O, s c (c) = 1 . Let p = qjp be the odds against the gambler. Then [Al9] there exist constants A and B such that, for 0 � a < c, sc (a ) = A + Bp11 if p ¢ q and sc(a) = A + Ba if p = q. The requirements s c (O) = 0 and sc(c) = 1 determine A and B, which gives the solution: •
t
•
•
•
•
90
PROBABILITY
The probability the gambler can before ruin attain his goal of c from an initial capital of a is pa - 1 0 � a < c, P = !1. p =1= I , c- 1 ' p (7 .7) a q = 1. 0 < a c, p = p c' <
The gambler's initial capital is $900 and his goal is $1000. If p = � , his chance of success is very good: s 1000 ( 900) = .9. At red-and black, p = �= and hence p = i� ; in this case his chance of success as • computed by (7.7) is only about .0000 3 . Example 7. 1.
It is the gambler's desperate intention to convert his $100 into $20,000. For a game in which p = � (no casino has one), his chance of success is 100/20,000 = .005 ; at red-and-black it is minute-about 3 X • 10 - 9 11 . Example 7. 2.
In the analysis leading to (7.7), replace (7.2) by (7 . 3) . It follows that (7.7) with p and q interchanged (p goes to p - 1) and a and c - a interchanged a gives the probability rc( a ) of ruin for the gambler: rc ( a ) = ( p - (£ - ) 1 )/( P - c - 1 ) if p ¢ 1 and rc( a ) = (c - a)jc if p = 1. Hence sc( a ) + rc(a) = 1 holds in all cases: The probability is 0 that play continues forever. For positive integers a and b, let Ha. b = U �_ 1 [Sn = b, - a < Sk < b, k < n] be the event that Sn reaches + b before reaching - a. Its probability is simply (7.7) with c = a + b: P( Ha. b) = sa+b(a). Now let Hh = U�_ 1 Ha . b = U �_ 1 [Sn = b] = [supnSn > b] be the event that Sn ever reaches +b. Since Ha. h i Hb as a --+ oo , P( Hb) = lim as a+ h ( a ) ; this is 1 if p = 1 or p < 1, and it is 1/ph if p > 1 . Thus if p >
q, ( 7 .8) if p < q. This is the probability that a gambler with unlimited capital can ultimately gain b units. The gambler in Example 7.1 has capital 900 and the goal of winning b = 100; in Example 7.2 he has capital 100 and b is 19,900. Suppose, instead, that his capital is infinite. If p = �, the chance of achieving his goal increases from .9 to 1 in the first example and from .005 to 1 in the second. At red-and-black, however, the two probabilities .9 100 and .9 1 9900 remain essentially what they were before ( . 0000 3 and 3 x 10 - 9 1 1 ) Example 7.3.
•
•
SECTION
7.
91
GAMB LING SYSTEMS
Selection Systems
Players often try to improve their luck by betting only when in the preceding trials the wins and losses form an auspicious pattern. Perhaps the gambler bets on the nth trial only when among X1 , , Xn - l there are many more + 1 's than - 1 's, the idea being to ride winning streaks (he is " in the vein"). Or he may bet only when there are many more - 1's than + 1's, the idea being it is then surely time a + 1 came along (the " maturity of the chances"). There is a mathematical theorem which translated into gaming language says that all such systems are futile. •
•
•
, X,, _ 1 there is an It might be argued that it is sensible to bet if among X1 , excess of + l's, on the ground that it is evidence of a high value of p. But it is assumed throughout that statistical inference is not at issue: p is fixed-at * , for example, in the case of red-and-black-and is known to the gambler, or should be. •
•
•
The gambler' s strategy is described by random variables B 1 , B2 , taking the two values 0 and 1 : If Bn = 1, the gambler places a bet on the nth trial; if Bn = 0, he skips that trial. If Bn were ( Xn + 1)/2, so that Bn = 1 for Xn = + 1 and Bn = 0 for Xn = - 1 , the gambler would win every time he bet, but of course such a system requires he be prescient-he must know the outcome Xn in advance. For this reason the value of Bn is assumed to depend only on the values of X1 , , Xn 1 : there exists some n 1 function bn : R - l --+ R such that •
•
•
•
•
•
_
(7 .9) (Here B 1 is constant.) Thus the mathematics avoids, as it must, the question of whether prescience is actually possible. Define
{�
( x• . . . . . xn ) . F0 - {O, D } .
(7 .10 ) The
0
n
=
1 , 2, . . . ,
a-field � 1 generated by X1 , , Xn 1 corresponds to a knowledge of the outcomes of the first n - 1 trials. The requirement (7.9) ensures that Bn is mea surable � - I (Theorem 5.1) and so depends only on the information actually available to the gambler just before the n th trial. For n = 1, 2, . . . , let Nn be the time at which the gambler places his n th bet. This n th bet is placed at time k or earlier if and only if the number E�_ 1 B; of bets placed up to and including time k is n or more; in fact, Nn is the smallest k for which E�_ 1 B; = n. Thus the event [ Nn < k ] coincides _
•
•
•
_
92
PROBABILITY
[E�_ 1B; > n]; by (7.9) this latter o( XI , . . . ' xk - 1 > = !Fk - 1 · Therefore, with
event lies in o( B 1 ,
•
•
•
, Bk ) c
(7 .1 1 )
(Even though [Nn = k ] lies in .?Fk - I and hence in .?F, Nn is as a function on 0 generally not a simple random variable because it has infinite range. This makes no difference because expected values of the Nn will play no role; (7 .11) is the essential property.) To ensure that play continues forever (stopping rules will be considered later) and that the Nn have finite values with probability 1, make the further assumption that P ( Bn
( 7 .12 )
=
1 i.o]
=
1.
A sequence { Bn } of random variables assuming the values 0 and 1, having the form (7 . 9), and satisfying (7.12) is a selection system. Let Yn be the gambler's gain on the nth of the trials at which he does bet: Yn = XN . It is only on the set [Bn = 1 i.o.] that all the Nn and hence all the Yn are well defined. To complete the definition, set Yn = - 1, say, on [Bn = 1 i.o.] c; since this set has probabili�y 0 by (7.12), it really makes no difference how Yn is defined on it. Now Yn is a complicated function on 0 because Yn (w) = XN,. "' > (w). < Nonetheless, II
( w : Yn( w ) = 1 ]
=
00
U ( ( w : Nn ( w ) = k ] n ( w : Xk ( w ) = 1 ] )
k-1
lies in .fF and so does its complement simple random variable.
[ w: Yn ( w) = - 1 ] .
Hence
Yn
is a
An example will fix these ideas. Suppose that the rule is always to bet on the first trial, to bet on the second trial if and only if X1 = + 1, to bet on the third trial if and only if X1 = X2 , and to bet on all subsequent trials. Here B 1 = 1, [B2 = 1] = [ X1 = + 1], [B3 = 1 ] = [ X1 = X2 ], and B4 = B5 = · · · = 1. The table shows the ways the gambling can start out. A dot represents a value undetermined by X1 , X2 , X3 Ignore the rightmost column for the moment. Example 7.4.
•
SECTION
xl
x2
x3
-1
-1 -1 +1 +1 -1 -1 +1 +1
-1 +1 -1 +1 -1 +1 -1 +1
-1 -1 -1 +1 +1 +1 +1
7.
GAMBLING SYSTEMS
B t B2 B3 Nt N2 N3 N4 1 1 1 1 1 1 1 1
0 0 0 0 1 1 1 1
1 1 0 0 0 0 1 1
1 1 1 1 1 1 1 1
3 3 4 4 2 2 2 2
4 4 5 5 4 4 3 3
5 5 6 6 5 5 4 4
93
yl
y2
y3
-1 -1 -1 -1 +1 +1 +1 +1
-1 +1
1 1 1 1 2 2 -1 3 +1 .
-1 -1 +1 +1
T
In the evolution represented by the first line of the table, the second bet is placed on the third trial ( N2 = 3), which results in a loss because y2 == XN == x3 = - 1 . Since x3 = - 1, the gambler was " wrong" to bet. 2 But remember that before the third trial he does not know X3( w) (much less "' itself); he knows only X1(w) and X2 (w). See the discussion in Example • s.� Selection systems achieve nothing because { Yn } has the same structure as { Xn } : 1beorem 7.1. For any selection + 1] = p , P [ Yn = - 1 ] = q.
system, { Yn } is independent and P[Yn =
PROOF.
Since random variables with indices that are themselves random variables are conceptually confusing on first aquaintance, the w ' s here will not be suppressed as they have been in previous proofs. Relabel p and q as p ( + 1) and p( - 1), so that P[w: Xk (w) = x ] = p(x) for x == ± 1 . If A e �k - l ' then A and [w: Xk (w) = x ] are independent, and so P( A n [w: Xk (w) = x ]) = P( A) p ( x ). Therefore, by (7.1 1),
==
00
L P ( w : Nn ( w ) = k , Xk ( w ) = x ]
k-1
= p(x) .
00
= L P ( w : Nn ( w ) = k ] p ( x ) k�l
94
PROBABILITY
More generally, for any sequence
P [w:
==
x1,
•
•
•
, x n of
± l ' s,
Y; ( w ) = x ; , i < n ] = P [ w : XN, < ., > ( w ) = x; , i < n ] kl <
...
E
n
P [ w : N;( w ) = k ; , Xk, ( w ) = x; , i < n ] ,
where the sum extends over n-tuples of positive integers satisfying k 1 < . < k n . The event [w: N; (w) = k ; , i < n] n [w: xk (w) = X; , i < n] lies in :Fk n 1 (note that there is no condition on Xk ,(w)), and therefore
..
I
-
P (w:
Y; ( w ) = X; , i < n ] kl <
Summing
...
E
n
P( [w : N;( w ) = k; , i < n ]
k n over k n_ 1 + 1, k n_ 1 + 2, . . .
E
k 1 < · · · < kn - 1
brings this last sum to
P [ w : N;( w ) = k ; , xk; ( w ) = X; , i < n ] p ( x n)
= P [ w : XN, < .,> ( w ) = X; , i < n ) p ( x n) = P (w : Y; ( w ) = X; , i < n ] p ( x n ) .
It follows by induction that
P (w : Y; ( w ) = X; , i < n ] = 0 p ( x ;) = 0 P ( w : Y; ( w ) = X ; ] , i s. n i
•
Gambling Policies There are schemes that go beyond selection systems and tell the gambler not only whether to bet but how much. Gamblers frequently contrive or adopt such schemes in the confident expectation that they can, by pure force of arithmetic, counter the most adverse workings of chance. If the wager specified for the n th trial is in the amount Wn and the gambler cannot see into the future, then Wn must depend only on X1 , , Xn_ 1 Assume •
•
•
•
SECTION
7.
GAMBLING SYSTEMS
95
wn is a nonnegative function of these random variables: there therefore that n 1 such that 1 R --+ R : fn is an
(7 .13) Apart from nonnegativity there are at the outset no constraints on the fn , although in an actual casino their values must be integral multiples of a basic unit. Such a sequence { Wn } is a betting system. Since Wn = 0 corresponds to a decision not to bet at all, betting systems in effect include n selection systems. In the double-or-nothing system, Wn = 2 - 1 if X1 = · · · = Xn _ 1 = - 1 ( W1 = 1) and Wn = 0 otherwise. The amount the gambler wins on the nth play is Wn Xn . If his fortune at time n is Fn , then
(7 . 14)
This also holds for n = 1 if F0 is taken as his initial (nonrandom) fortune. It is convenient to let Wn depend on F0 as well as the past history of play and hence to generalize (7.13) to
(7 . 15 ) for a function
gn : R n
(7 . 16 )
F:0
1
In expanded notation, Wn ( w ) = Bn( F0, X1 ( w ), . . . , Xn _ 1 ( w )) . The symbol Wn does not show the dependence on w or on F0 either. For each fixed initial fortune F0, Wn is a simple random variable; by (7.15) it is measurable � 1 • Similarly, Fn is a function as of X = F (w): F (w), . . . , X of F0 as well n n ( F0, w). n 1 If F0 = 0 and Bn = 1 , the Fn reduce to the partial sums (7.1). Since � _ 1 and o( Xn ) are independent, and since Wn is measurable �- 1 (for each fixed F0), Wn and Xn are independent. Therefore, E [ W,. Xn] == E[Wn ] · E [ Xn 1 · Now E[ Xn 1 = p - q < 0 in the subfair case ( p < �), with equality in the fair case ( p = � ). Since E [ Wn 1 > 0, (7.14) implies that E(Fn ] � E [ Fn _ 1 ]. Therefore, --+ R •
> ··· > ··· > E ( Fn ] > E ( F1 ] -
in the subfair case, and
(7 .1 7 ) the fair case. (If p < q and P [ Wn > 0] > 0, there is strict inequality in (7 . 1 6). ) Thus no betting system can convert a subfair game into a profitable enterprise.
in
96
PROBABILITY
Suppose that in addition to a betting system, the gambler adopts some policy for quitting. Perhaps he stops when his fortune reaches a set target, or his funds are exhausted, or the auguries are in some way dissuasive. The decision to stop must depend only on the initial fortune and the history of play up to the present. Let T ( F0, "' ) be a nonnegative integer for each "' in 0 and each F0 � 0. If T = n , the gambler plays on the n th trial (betting Wn ) and then stops; if T = 0, he does not begin gambling in the first place. The event [ w: T ( F0, "' ) = n ] represents the decision to stop just after the n th trial, and so, whatever value F0 may have, it must depend only on X1 , , Xn . Therefore, assume that •
•
•
n = 0, 1 , 2, . . .
(7 .1 8)
A T satisfying this requirement is a stopping time. (In general it has infinite range and hence is not a simple random variable; as expected values of T play no role here, this does not matter.) It is technically necessary to let T ( F0, "' ) be undefined or infinite on an w-set of probability 0. This has no effect on the requirement (7.1 8 ), which must hold for each finite n. Since T is finite with probability 1, play is certain to terminate. A betting system together with a stopping time is a gambling policy. Let 'IT denote such a policy. Suppose that the betting system is given by Wn = Bn , with Bn as in Example 7.4. Suppose that the stopping rule is to quit after the first loss of a wager. Then [ T = n] = u k_ 1 [Nk = n, y1 = . . . = yk 1 = + 1, Yk = 1 ]. For j � k , [Nk = n , lj = x ] = u:._ 1 [Nk = n , � = m,- Xm = x] lies in � by (7.11); hence T is a stopping time. The values of T are shown • in the rightmost column of the table. Example 7.5.
-
The sequence of fortunes is governed by (7.14) until play terminates, and then the fortune remains for all future time fixed at FT (with value FT< Fo , w ("' )). Therefore, the gambler's fortune at time n is
>
(7 .19 )
F* n =
{ FFnT
if T > n , if T < n .
Note that the case T = n is covered by both clauses here. If n 1 < n � T, then Fn* = Fn = Fn 1 + wn xn = Fn*- 1 + WnXn ; if T � n 1 < n, then Fn* = FT = Fn*_ 1 . Therefore, if W,.* = I[ .,. � n J Wn , then -
-
( 7 .20 )
SECTION
7.
97
GAMBLING SYSTEMS
But this is the equation for a new betting system in which the wager placed at time n is Wn* · If , > n (play has not already terminated), W,1* is the old amount Wn; if T < n (play has terminated), W,,* is 0. Now by (7.18), ( T � n] = [ T < n] c lies in � - 1• Thus Ir T , 1 is measurable � - 1 , so that wn* as well as wn is measurable � - 1' and { wn* } represents a legitimate betting system. Therefore, (7.16) and (7.17) apply to the new system: �
(7 . 21 )
F,0 = F,0* > -
E ( F1* ] ->
> ··· ··· > E ( Fn* ] -
if p < � , and
(7 .22) if p = t . The gambler' s ultimate fortune is FT . Now lim n Fn* = 1 since in fact Fn* = FT for n > T . If
(7 .23)
linm E ( Fn* )
=
E ( FT ),
then (7.21) and (7.22), respectively, imply that E[FT ] < According to Theorem 5.3, (7.23) does hold if the bounded. Call the policy bounded by M (M nonrandom) if
(7.24 )
FT with probability
n = 0, 1 , 2, . . .
F0 and E[ FT ] = F0• Fn* are uniformly
.
If Fn* is not bounded above, the gambler' s adversary must have infinite capital. A negative Fn* represents a debt, and if Fn* is not bounded below, the gambler must have a patron of infinite wealth and generosity from whom to borrow and so must in effect have infinite capital. In case Fn* is bounded below, 0 is the convenient lower bound-the gambler is assumed to have in hand all the capital to which he has access. In any real case, (7 .24) holds and (7.23) follows. (There is a technical point that arises because the general theory of integration has been postponed: FT must be assumed to have finite range so that it will be a simple random variable and hence have an expected value in the sense of Section 5.) The argument has led to this result :
Theorem 7.2. For anypolicy, (1.21) holds ifp < If the policy is bounded (and F., has finite range), and E [ FT ] = F0 for p = t.
} and (1.22) holds ifp = }. then E[F.,] < F0 for p < �
98
PROBABILITY
The gambler has initial capital a and plays at unit stakes until his capital increases to c (0 < a < c) or he is ruined. Here F0 = a and w,, 1 , and so Fn = a + sn. The policy is bounded by c and FT is c or 0 according as the gambler succeeds or fails. If p = t and if s is the probability of success, then a = F0 = E[FT ] = sc. Thus s = ajc. This gives a new derivation of (7.7) for the case p = ! . The argument assumes however that play is certain to terminate. If p < t, Theorem 7.2 only gives s < ajc, • which is weaker than (7.7). Example 7. 6. =
Example 7. 7. Suppose as before that F0 = a and Wn = 1, so that = a + Sn, but suppose the stopping rule is to quit as soon as Fn reaches
Fn a + b.
Here Fn* is bounded above by a + b but is not bounded below. If p = � , the gambler is by (7.8) certain to achieve his goal, so that FT = a + b. In this case F0 = a < a + b = E[F,.]. This illustrates the effect of infinite capital. It also illustrates the need for uniform boundedness in Theorem 5.3 (compare Example 5.6). • For some other systems (gamblers call them " martingales"), see the problems. For most such systems there is a large chance of a small gain and a small chance of a large loss. Bold Play *
The formula (7. 7) gives the chance that a gambler betting unit stakes can increase his fortune from a to c before being ruined. Suppose that a and c happen to be even and that at each trial the wager is two units instead of one. Since this has the effect of halving a and c, the chance of success is now pa/2 pc/2
_
-
1 1
-
pa pc
_
-
1 1
pc/2
+1 pa/2 + 1 '
g_ =
p
p
¢
1.
If p > 1 ( p < � ), the second factor on the right exceeds 1: Doubling the stakes increases the probability of success in the unfavorable case p > 1. In case p = 1, the probability remains the same. There is a sense in which large stakes are optimal. It will be convenient to rescale so that the initial fortune satisfies 0 < F0 < 1 and the goal is 1. The policy of bold play is this: At each stage the gambler bets his entire fortune, unless a win would carry him past his goal of 1, in which case he bets just * This topic may be omitted.
SECTION
7.
99
GAMBLING SYSTEMS
enough that a win would exactly achieve that goal: Wn =
( 7 .25)
{ Fn - 1
1 - Fn - 1
if 0 < F,, - 1 < 1 if ! < Fn 1 < 1 .
'
_
(It is convenient to allow even irrational fortunes.) As for stopping, the policy is to quit as soon as Fn reaches 0 or 1. Suppose that play has not terminated by time k - 1 ; under the policy (7.25), if play is not to terminate at time k, then Xk must be + 1 or - 1 according as Fk 1 < ! or Fk 1 > ! , and the conditional probability of this is at most m = max{ p, q }. It follows by induction that the probability that n bold play continues beyond time n is at most m , and so play is certain to terminate ( T is finite with probability 1). It will be shown that in the sub fair case, bold play maximizes the probability of successfully reaching the goal of 1. This is the Dubins-Savage theorem. It will further be shown that there are other policies that are also optimal in this sense, and this maximum probability will be calculated. Bold play can be substantially better than betting at constant stakes. This contrasts with Theorems 7.1 and 7.2 concerning respects in which gambling systems are worthless. From now on, consider only policies 'IT that are bounded by 1 (see (7.24)). Suppose further that play stops as soon as Fn reaches 0 or 1 and that this is certain eventually to happen. Since FT assumes the values 0 and 1, and since [ FT = x] = U:0_ 0[T = n] n [Fn = x] for x = 0 and x = 1, FT is a simple random variable. Bold play is one such policy 'IT. The policy 'IT leads to success if F., = 1. Let Q.,(x) be the probability of this for an initial fortune F0 = x: _
(7 .26)
_
Q., ( x ) = P [ F., = 1 ] ,
F0 = x.
Since Fn is a function l/l n (F0, X1 (w), . . . , Xn (w)) = Vn (F0, w ), (7.26) in expanded notation is Q.,(x) = P[w: VT( x, '-J ) (x, w) = 1]. As 'IT specifies that play stops at the boundaries 0 and 1,
{ Q.,(O)
Q ., (1) = 1, 0 < Q ., ( x ) < 1, 0 < X < 1.
(7 .27 )
= 0,
Let Q be the Q ., for bold play. (The notation does not show the dependence of Q and Q ., on p, which is fixed.) Theorem
7.3.
In the subfair case, Q .,(x) < Q (x) for all and all x. 'IT
100 PROOF.
( 7.28)
PROBABILITY
Under the assumption
p s q, it will be shown later that
Q( x ) > pQ( x + t ) + qQ(x - t ) , 0<X-
t < X < X + t < 1.
This can be interpreted as saying that the chance of success under bold play starting at x is at least as great as the chance of success if the amount t is wagered and bold play then pursued from x + t in case of a win and from x - t in case of a loss. Under the assumption of (7.28), optimality can be proved as follows. Consider a policy 'IT, and let Fn and Fn* be the simple random variables defined by (7.14) and (7.19) for this policy. Now Q(x) is a real function, and so Q ( Fn*) is also a simple random variable; it can be interpreted as the conditional chance of success if 'IT is replaced by bold play after time n . By (7.20), Fn* = x + tXn if Fn*_ 1 = x and W,.* = t. Therefore, X.I
where x and t vary over the (finite) ranges of Fn*_ 1 and Wn*, respectively. For each x and t, the indicator above is measurable � - 1 and Q ( x + tX. n ) is measurable o( Xn ); since the Xn are independent, (5.21) and (5.14) gtve X. I
By (7.28), E [Q(x + tXn )] < Q ( x ) if 0 < x - t S x < x + t S 1. As it is assumed of 'IT that Fn* lies in [0, 1] (that is, Wn* S min{ Fn*_ 1 , 1 - Fn*_ 1 } ), the probability in (7.29) is 0 unless x and t satisfy this constraint. There fore, X. I
X
This is true for each n, and so E[Q( Fn*)] � E[Q( F0*)] = Q(F0). Since Q( Fn*) = Q( FT ) for n > T, Theorem 5.3 implies that E[Q(FT )] � Q (F0 ). Since x = 1 implies that Q ( x ) = 1, P[FT = 1] s E[Q ( F,.)] < Q( F0). Thus Q"( F0 ) � Q ( F0 ) for the policy 'IT, whatever F0 may be.
SECTION
7.
101
GAMBLING SYSTEMS
Q ) �X< {pQ(2x , ) Q ( X p + qQ(2x (7 �X< x x Q(O)x. x < ! Q(1) x; if p) x + x 2x Q(2x)); x x; p) q) x x) 2x Q(2x Q(x) x x� Q(x) x.
It remains to analyze functional equation .30)
and prove (7.28). Everything hinges on the
=
0 i
- 1),
i, 1.
= 0 and = 1. The For = 0 and = 1 this is obvious because , the first stake idea is this: Suppose that the initial fortune is If under bold play is the gambler is to succeed in reaching 1, he must win the first trial (probability = and then from his new fortune go this makes the first half of (7.30) on to succeed (probability plausible. If � t, the first stake is 1 the gambler can succeed either by winning the first trial (probability or by losing the first trial (probabil = ity and then going on from his new fortune - (1 - 1 to succeed (probability - 1)); this makes the second half of (7.30) plausible. must be an increasing function of It is also intuitively clear that (0 � 1): the more money the gambler starts with, the better off he is. Finally, it is intuitively clear that ought to be a continuous function of the initial fortune
A formal proof of (7.30) can be constructed as for the difference equation (7.5). If /l ( x ) is x for x s ! and 1 - x for x � t , then under bold play W, == {J( F, _ 1 ). Starting from l0 ( x ) == x, recursively define
Then F,
== In ( Jb; x1 ' . . . ' xn ) . Now define
If Fo == x, then T, ( x) == [ gn (x; X1 , . . . , Xn ) == 1] is the event that bold play
will
by time n successfully increase the gambler's fortune to 1. From the recursive defini tion it follows by induction on n that for n � 1, In (x; x1 , . . . , xn ) == fn - 1 (x + /l( x ) x1 ; x 2 , , xn ) and hence that gn (x; x1 , . . . , xn ) == max { x, gn _ 1 (x + /l( x)x1 ; x 2 , , xn ) }. Since x == 1 implies gn _ 1 ( x + fj (x) x1 ; x 2 , , xn ) > x + /l( x) x1 == 1, T, ( x) == [ gn _ 1 ( x + {j(x) X1 ; X2 , , Xn ) == 1], and since the X; are independent and identically distributed, P( T, ( x)) == P(( X1 == + 1] n T,(x)) + P([ X1 == - 1] n T, ( x)) == pP [ gn _ 1 (x + fj (x); X2 , , Xn ) == 1] + qP [ gn _ 1 (x /l( x); X2 , , Xn ) == 1] == pP( T, _ 1 (x + fj (x))) + qP(T, _ 1 (x - fj (x ))). Letting n -+ oo now gives Q (x) == pQ (x + fj ( x)) + qQ (x - fj (x)), which reduces to (7.30) because Q (O) == 0 and Q (1) == 1. Suppose that y == ln - 1 (x; x1 , . . . , xn _ 1 ) is nondecreasing in x. If xn = + 1 , then /, ( X; X 1 , . . . , Xn ) is 2 y if 0 S y < t and 1 if t S y S 1 ; if Xn = - 1, then •
•
•
•
•
•
•
•
•
•
•
•
•
• • •
•
•
102
PROBABILITY
-
In ( x; x 1 , , xn ) is 0 if 0 < y < ! and 2 y 1 if ! < y < 1. In any case, fn ( x; x 1 , , xn ) is also nondecreasing in x, and by induction this is true for every n. It follows that the same is true of gn (x; x1 , , xn ), of P(T,(x)), and of Q (x). Thus Q (x) is nondecreasing. Since Q (1) = 1, (7.30) implies that Q (!) = p Q ( 1) = p, Q ( i) = pQ (!) = p 2 , Q ( i ) = p + qQ ( ! ) = p + pq. More generally, if Po = p and p 1 = q, then • • •
• • •
k (7.31) Q ,. 2
• • •
( )
-
L
[
n
Pu1
•
•
•
Pu. :
U;
;E. 2;
]
k
< " ' 2
n
> 1,
the sum extending over n-tuples ( u1 , , un ) of O's and 1 's satisfying the condition indicated. Indeed, it is easy to see that (7.31) is the same thing as • • •
( 7 .32) n for each dyadic rational . u1 • • un of rank n . If . u1 • • un +n 2- < ! , then u1 = 0 and by (7.30) the difference in (7.32) is p0[ Q (. u 2 un + 2- + t ) - Q (. u 2 • un )). But (7.32) follows inductively from this and a similar relation for the case . u 1 un > t. n n n Therefore Q ( k2 - n ) - Q (( k - 1)2- ) is bounded by max{ p , q }, and so by monotonicity Q is continuous. Since (7.32) is positive, it follows that Q is strictly increasing over [0, 1]. •
•
•
•
•
•
•
•
•
•
Thus Q is continuous and increasing and satisfies (7.30). The inequality (7 .28) is still to be proved. It is equivalent to the assertion that
a ( r, s ) = Q(a) - pQ(s) - qQ(r ) > 0 if 0 < r � s < 1, where a stands for the average: a = �( r + s). Since Q is n continuous, it suffices to prove the inequality for r and s of the form kj2 , and this will be done by induction on n . Checking all cases disposes of n = 0. Assume that the inequality holds for a particular n , and that r and s n have the form kj2 + t . There are four cases to consider.
CASE 1. s < ! . By the first part of (7.30), d(r, s) = p4(2r, 2s). Since 2r n and 2s have the form kj2 , the induction hypothesis implies that d(2r, 2s) > 0. CAsE
2.
t<
r. By the second part of (7.30), a (r, s ) = qa ( 2r - 1, 2s - 1 ) > 0.
CASE 3.
r < a � � � s. By (7.30), a ( r, s ) = p Q(2a ) - p ( p + qQ(2s - 1 ) ] - q [ p Q( 2r ) ] .
SECTION
7.
Since � s s � r + s = 2a < 1, - � < i , Q(2a - � ) = pQ (4a - i ), and it fellows that
&( r, s ) = q (Q( 2a -
103
GAMBLING SYSTEMS
Q(2a) = p + qQ(4a - 1); since 0 < 2a - 1). Therefore, p Q (2a) = p 2 + qQ(2a � ) - p Q( 2s
- 1 ) - p Q(2r )] .
Since p < q, the right side does not increase if either of the two p ' s is changed to q. Hence
& ( r, s ) > q max[�( 2r, 2s - 1 ) , & ( 2s - 1, 2r ) ] . The induction hypothesis applies to 2r < 2s - 1 or to 2s - 1 < 2r, as the case may be, and so one of the two & ' s on the right is nonnegative. CASE
4. r <
�
�
a < s. By (7.30),
& ( r, s ) = pq + qQ(2a - 1 ) - pqQ( 2s - 1 ) - pqQ( 2r ) . Since 0 � 2a - 1 = r + s - 1 < � ' Q (2a - 1) = p Q (4a - 2); since � < 2a - � = r + s - � < 1, Q (2a - � ) = p + qQ (4a - 2). Therefore, qQ (2a - 1) = p Q (2a - � ) - p 2, and it follows that
&( r, s ) = p ( q - p + Q ( 2a I f 2s
-1
<
� ) - qQ( 2s
- 1 ) - qQ( 2r )] .
2r, the right side here is p [( q - p ) (l - Q( 2r ) ) + &( 2s - 1, 2r )] > 0.
If
2r < 2s - 1, the right side is p [ ( q - p)(1 - Q(2 s - 1 )) + & { 2r, 2s - 1 )]
> 0.
This completes the proof of (7 .28) and hence of Theorem 7 .3.
•
The equation (7.31) has an interesting interpretation. Let Z1, Z2 be independent random variables satisfying P[Zn = 0] = Po = p and P(Zn = n ; 1 ] = p 1 = q.; From P[Z 1 i.o.] = 1 and E; n Z; 2 - < 2 - it follows that n = n n n ; ; < ] 2 [E 2 < k2 2 < k2 ] < P[E Z < k2 ]. Z P[E Z ; ; ; 7P �t �t t n Since by (7.31) the middle term is Q(k2 - ). ,
>
( 7 .33 )
•
•
•
104
PROBABILITY
holds for dyadic rational x and hence by continuity holds for all x. In Section 31, Q will reappear as a continuous, strictly increasing function singular in the sense of Lebesgue. On p. 428 is a graph for the case Po = .25. Note that Q( x) = x in the fair case p = t. In fact, for a bounded policy Theorem 7.2 implies that E[FT ] = F0 in the fair case, and if the policy is to stop as soon as the fortune reaches 0 or 1, then the chance of successfully reaching 1 is P[FT = 1 ] = E[FT ] = F0• Thus in the fair case with initial fortune x, the chance of success is x for every policy that stops at the boundaries, and x is an upper bound even if stopping earlier is allowed. Example 7. 8.
The gambler of Example 7.1 has capital $900 and goal $1000. For a fair game ( p = !) his chance of success is .9 whether he bets unit stakes or adopts bold play. At red-and-black ( p = tf), his chance of success with unit stakes is .0000 3 ; an approximate calculation based on (7.31) shows that under bold play his chance Q(.9) of success increases to • about .88, which compares well with the fair case. In Example 7.2 the capital is $100 and the goal $20,000. 11 At unit stakes the chance of successes is .005 for p = � and 3 x 10 - 9 for p = tf. Another approximate calculation shows that bold play at red-and black gives the gambler probability about .003 of success, which again compares well with the fair case. This example illustrates the point of Theorem 7.3. The gambler enters the casino knowing that he must by dawn convert his $100 into $20,000 or face certain death at the hands of criminals to whom he owes that amount. Only red-and-black is available to him. The question is not whether to gamble-he must gamble. The question is how to gamble so as to maximize the chance • of survival, and bold play is the answer. Example 7. 9.
There are policies other than the bold one which achieve the maximum success probability Q(x). Suppose that as long as the gambler' s fortune x is less than t he bets x for x s -l and t x for � < x s � . This is, in effect, the bold-play strategy scaled down to the interval [0, ! 1 , and so the chance he ever reaches t is Q(2x ) for an initial fortune of x. Suppose further that if he does reach the goal of �, or if he starts with fortune at least � in the first place, he continues with ordinary bold play. For an initial fortune x > � , the overall chance of success is of course Q(x ), and for an initial fortune x < � ' it is Q(2x )Q( i) = pQ(2x) = Q(x). The success probability is in deed Q(x) as for bold play, although the policy is different. With this example in mind, one can generate a whole series of distinct optimal policies. -
SECTION
7.
GAMBLING SYSTEMS
105
Timid Play *
The optimality of bold play seems reasonable when one considers the effect of its opposite, timid play. Let the £-timid policy be to bet Wn = min { £, Fn 1 , 1 - Fn 1 } and stop when Fn reaches 0 or 1. Suppose that p < q, fix an initial fortune x = F0 with 0 < x < 1 , and consider what 1 happens as £ --+ 0. By the strong law of large numbers, lim nn - S,, = £[ Xd = p - q < 0. There is therefore probability 1 that supkSk < oo and lim nS, = - oo . Given 1J > 0, choose £ so that P[supk(x + tSk ) < 1 ] > 1 - 1J. Since P(U�_ 1 [x + £Sn < 0]) = 1, with probability at least 1 - 1J there exists an n such that x + tSn < 0 and maxk n(x + tSk ) < 1. But under the £-timid policy the gambler is in this circumstance ruined. If Q((x) is the probability of success under the £-timid policy, then lim ( _. 0Q (( x ) = 0 for 0 < x < 1. The law of large numbers carries the timid player to his ruin. _
_
<
PROBLEMS 7.1.
A gambler with initial capital a plays until his fortune increases b units or he is ruined. Suppose that p > 1. The chance of success is multiplied by 1 + 8 if his initial capital is infinite instead of a. Show that 0 < (J < ( pa - 1) < ( a ( p - 1)) - ; relate to Example 7.3. - •
7.2.
13.
As shown on p.
90, there is probability 1 that the gambler either achieves his goal of c or is ruined. For p =F q, deduce this directly from the strong law of
large numbers. Deduce it (for all p ) via the Borel-Cantelli lemma from the fact that if play never terminates, there can never occur c successive + 1's. 6.12 f If V, is the set of n-long sequences of ± 1's, the function bn in (7.9) maps V, _ 1 into {0, 1 }. A selection system is a sequence of such maps. Although there are uncountably many selection systems, how many have an effective description in the sense of an algorithm or finite set of instructions by means of which a deputy (perhaps a machine) could operate the system for the gambler? An analysis of the question is a matter for mathematical logic, but one can see that there can be only countably many algorithms or finite sets of rules expressed in finite alphabets. Let Y. , �a>, . . . be the random variables of Theorem 7.1 for a particular system a ' and let c be the w-set where every k-tuple of + 1 's ( k arbitrary) a occurs in Y1( w ), Yia>( w ), . . . with the right asymptotic relative frequency (in the sense of Problem 6.12). Let C be the intersection of C over all a in the effective selection systems a. Show that C lies in !F (the a-field probability space (Q, !F, P) on which the Xn are defined) and that P( C) = 1. A sequence ( X1 ( w ), X2 ( w ), . . . ) for w in C is called a collective: a subse quence chosen by any of the effective rules a contains all k-tuples in the correct proportions.
* This topic may be omitted.
106
7.4.
PROBABILITY
Let D11 be 1 or 0 according as X2 _ 1 * X2 11 or not.. and let M" be the time of the k th 1 - the smallest n such that E�·= 1 D, = k . Let Z" = X2 M . In other words, look at successive nonoverlapping pairs ( X2 11 _ 1 , X211 ) , dis�ard accor dant ( x2 ,r - 1 = x2 ,. ) pairs, and keep the second element of discordant ( X211 _ 1 * X2 ,. ) pairs. Show that this process simulates a fair coin: Z1 , Z2 , are independent and identically distributed and P [ Z" = + 1] = P [ Z" = - 1] = ! , whatever p may be. Follow the proof of Theorem 7. 1 ; it is instructive to write out the proof in all detail, suppressing no w 's. n
.
7.5.
•
.
Suppose that a gambler with initial fortune 1 stakes a proportion 8 (0 < 8 < 1) of his current fortune: Fo = 1 and W,. = 8F,. _ 1 • Show that F,. = n;: 1 (1 + (J xk ) and hence that =
n
[
�.
1
log F,. = 2 -; log 1 Show that F, 7.6.
7.8.
_
-
82 >] .
0 with probability 1 in the subfair case.
In "doubling,'' W1 = 1, w,. = 2 W,. _ 1 , and the rule is to stop after the first win. For any positive p, play is certain to terminate. Here F, = Fo + 1 , but of course infinite capital is required. If Fo = 2 k - 1 and W,. cannot exceed F, 1 the probability of F, = Fo + 1 in the fair case is 1 - 2 - " . Prove this via Theorem 7.2 and also directly. _
7.7.
-+
+8 + ( log 1 8
,
In " progress and pinch," the wager, initially some integer, is increased by 1 after a loss and decreased by 1 after a win, the stopping rule being to quit if the next bet is 0. Show that play is certain to terminate if and only if p > ! . Show that F, = Fo + ! W12 + ! ( T - 1). Infinite capital is required.
Here is a common martingale. At each stage the gambler has before him a pattern x 1 , , xk of positive numbers. He bets x 1 + x�.. , or x 1 in case k = 1 . If he loses, at the next stage he uses the pattern x 1 , . . . , x�.. , x1 + x�.. . If he wins, at the next stage he uses the pattern x 2 , , x" 1 , unless k is 1 or 2, in which case he quits. Show that play is certain to terminate if p > t and that the ultimate gain is the sum of the numbers in the initial pattern. Infinite capital is again required. • • •
• . .
7.9.
_
Suppose that wk = 1 , so that Fk = Fo + sk . Suppose that p > q and T is a stopping time such that 1 < T < n with probability 1 . Show that E[ F, ] < E [ F, ], with equality in case p = q. Interpret this result in terms of a stock option that must be exercised by time n , Fo + s�.. representing the price of the stock at time k.
7. 10. Let u be a real function on [0, 1], u( x) representing the utility of the fortune x. Consider policies bounded by 1 ; see (7 .24) . Let Q ( Fo ) = E [ u ( F, ) ] ; this represents the expected utility under the policy , of an initial fortune Jb . Suppose of a policy w0 that 'IT
( 7 .34)
0 < X < 1,
SECTION
8.
107
MARKOV CHAINS
and that
( 7 .35) 0 < X - 1 < X < X + t < 1. Show that Q ( x) < Q ( x) for all x and all policies , . Such a w0 is optimal. Theorem 7.3 is the ;pecial case of this result for p < i , bold play in the role of w0 , and u(x) = 1 or u ( x ) = 0 according as x = 1 or x < 1. Condition (7.34) says that gambling with policy w0 is at least as good as not gambling at all; (7 .35) says that, although the prospects even under w0 become on the average less sanguine as time passes, it is better to use w0 now than to use some other policy for one step and then change to w0 . 'IT
'IT
7.1 1. The functional equation (7.30) and the assumption that Q is bounded suffice to determine Q completely. First, Q(O) and Q(1) must be 0 and 1, respec tively, and so ( 7.3 1) holds. Let 1Qx = !x and T1 x = }x + ! ; let f0 x = px and /1 x = p + qx. Then Q( Tu � x) = fu fu Q( x). If the binary expansions of x and y both b�gin with the diiits u 1 , � , u,r , they have the form x = T Tu,.x' and y = T Tu ,. y'. If K bounds Q and if m = max{ p , q }, it follows that IQ( x) - Q(y) l < Km n . Therefore, Q is continuous and satisfies (7.31). · · ·
"•
· · ·
· · ·
"•
• •
· · ·
SECTION 8. MARKOV CHAINS
As Markov chains illustrate well the connection between probability and measure, their basic properties are here developed in a measure-theoretic setting. Definitions
Let S be a finite or countable set. Suppose that to each pair i and j in S there is assigned a nonnegative number piJ and that these numbers satisfy the constraint
L P;j = l ,
(8.1 )
jES
i E S.
Let X0, X1 , X2, be a sequence of random variables whose ranges are contained in S. The sequence is a Markov chain or Markov process if •
•
•
P ( Xn + t = JIXo = i o ,
(8.2 )
· · · ,
Xn = i n ]
= P ( Xn+ t = ii Xn = i n ] = P;nJ for every
n
and every sequence i0,
•
•
•
, i n in S for which P [ X0 = i0,
•
•
•
, Xn
108
PROBABILITY
= i n ] > 0. The set S is the state space or phase space of the process, and the p iJ are the transition probabilities. Part of the defining condition (8.2) is that the transition probability (8 .3) does not vary with n. t The elements of S are thought of as the possible states of a system, Xn representing the state at time n. The sequence or process X0, X1 , X2 , then represents the history of the system, which evolves in accordance with the probability law (8.2). The conditional distribution of the next state Xn + t given the present state Xn must not further depend on the past X0 , , Xn - t · This is what (8.2) requires, and it leads to a copious theory. The initial probabilities are •
•
'IT;
(8 .4)
•
•
•
•
= P [ X0 = i] .
The 'IT; are nonnegative and add to 1, but the definition of Markov chain places no further restrictions on them. Example 8. 1.
The Bernoulli-Lap/ace model of diffusion. Imagine r black
balls and r white balls distributed between two boxes, with the constraint that each box contains r balls. The state of the system is specified by the number of white balls in the first box, so that the state space is S = {0, 1, . . . , r } . The transition mechanism is this: at each stage one ball is chosen at random from each box and the two are interchanged. If the present state is i, the chance of a transition to i - 1 is the chance i/r of drawing one of the i white balls from the first box times the chance i/r of drawing one of the i black balls from the second box. Together with similar arguments for the other possibilities, this shows that the transition probabil ities are
P;, ; - t =
( � r.
r ; 2 Pi , i +l = r '
(
)
i r i ( ) = 2 Pu r2 '
the others being 0. This is the probabilistic analogue of the model for the • ftow of two liquids between two containers. * t Sometimes in the definition of Markov chain P [ Xn + 1 = jl Xn = i ] is allowed to depend on n . A chain satisfying (8.3) is then said to have stationary transition probabilities, a phrase that will be omitted here because (8.3) will always be assumed. * For an excellent collection of examples from physics and biology, see FELLER, Volume 1 , Chapter XV.
SECTION
8.
109
MARKOV CHAINS
The P;j form the transition matrix P = [ P;j ] of the process. A stochastic matrix is one whose entries are nonnegative and satisfy (8.1); the transition matrix of course has this property. Extunple 8. 2.
{ O , l, . . . , r } and
Random walk with absorbing barriers. 1
0 0
q •
•
0 0 0
•
•
0 0 0
0 0
p
q
0 P=
0
p
0
•
•
•
0 0 0
•
•
•
0 0 0
•
•
•
•
•
•
...
•
•
•
•
•
... •
•
•
•
•
•
0 0 0
0 0 0 •
•
•
q
0 0
•
•
•
0 0 0 •
0
q
0
•
•
p 0 0
S=
Suppose that 0 0 0
•
•
•
0
p 1
That is, Pi, i + l = p and P;, ; - t = q = 1 p for 0 < i < r and p00 = p,, = 1. The chain represents a particle in random walk. The particle moves one unit to the right or left, the respective probabilities being p and q, except that each of 0 and r is an absorbing state-once the particle enters, it cannot leave. The state can also be viewed as a gambler's fortune; absorption in 0 represents ruin for the gambler, absorption in r ruin for his adversary (see Section 7). The gambler's initial fortune is usually regarded as nonrandom, so that (see (8.4)) 'IT; = 1 for some i. • -
Example 8. 3. Unrestricted random walk. Let S consist of all the integers p. This chain i = 0, ± 1, ± 2, . . . , and take P;, ; + t = p and P;, ; - I = q = 1 -
represents a random walk without barriers, the particle being free to move • anywhere on the integer lattice. The walk is symmetric if p = q.
The state space may, as in the preceding example, be countably infinite. If so, the Markov chain consists of functions Xn on a probability space (D, !F, P ), but these will have infinite range and hence will not be random variables in the sense of the preceding sections. This will cause no difficulty however, because expected values of the Xn will not be considered. All that is required is that for each i E S the set [w: Xn (w) = i] lie in !F and hence have a probability. Let S consist of the integer lattice points in k-dimensional Euclidean space R k ; x = (x 1 , , x k ) lies in S if the coordinates are all integers. Now x has 2k neighbors, points of the form y = (x 1 , , X ; + 1, . . . , x k ); for each such y let Px_v = (2k) - 1 . Example 8.4.
Symmetric random walk in space.
•
•
.
.
•
•
110
PROBABILITY
The chain represents a particle moving randomly in space; for k = 1 it reduces to Example 8.3 with p = q = � . The cases k < 2 and k > 3 exhibit an interesting difference. If k s 2, the particle is certain to return to its • initial position, but this is not so if k > 3; see Example 8.6. Since the state space in this example is not a subset of the line, the X0 , X1 , do not assume real values. This is immaterial because expected values of the Xn play no role. All that is necessary is that Xn be a mapping from 0 into S (finite or countable) such that [ w: Xn ( w) = i] E � for i E S. There will be expected values E[/(Xn )1 for real functions f on S with finite range, but then /( Xn (w)) is a simple random variable as defined before. •
•
•
Example 8.5. A selection problem. A princess must choose from among r suitors. She is definite in her preferences and if presented with all r at once could choose her favorite and could even rank the whole group. They are ushered into her presence one by one in random order, however, and she must at each stage either stop and accept the suitor or else reject him and proceed in the hope that a better one will come along. What strategy will maximize her chance of stopping with the best suitor of all? Shorn of some details, the analysis is this. Let S1 , S2 , , Sr be the suitors in order of presentation; this sequence is a random permutation of the set of suitors. Let X1 == 1, and let X2 , X3 , be the successive positions of suitors who dominate (are preferable to) all their predecessors. Thus X2 == 4 and X3 == 6 means that S1 dominates S2 and � but S4 dominates S1 , S2 , � , and that S4 dominates S5 but S6 dominates S1 , , S5 • There can be at most r of these dominant suitors; if there are exactly m , xm + 1 == xm + 2 == == r + 1 by convention. As the suitors arrive in random order, the chance that S; ranks highest among S1 , , S; is (i - 1)!/i ! == 1/i. The chance that � ranks highest among S1 , � and S; ranks next is ( j - 2)!/j ! == 1/j ( j - 1). This leads to a chain with transition probabilitiest • • •
• • •
• • •
•
•
•
•
•
•
•
P [ X, + 1 = JI X,
(8.5 )
i i ... ] ... j( j _ l ) ,
•
(8.6)
•
p [ xn + 1 == r + 11 xn == i ] == !_ ' r
•
,
1 < i < j < r.
If Xn == i, then Xn + 1 == r + 1 means that S; dominates S; + 1 , S1 , , S; , and the conditional probability of this is •
•
• • •
, Sr
as
well
as
1 < i < r.
As downward transitions are impossible and r + 1 is absorbing, this specifies a transition matrix for S == {1, 2, . . . , r + 1 }. It is quite clear that in maximizing her chance of selecting the best suitor of all, the princess should reject those who do not dominate their predecessors. Her strategy therefore will be to stop with the suitor in position XT , where T is a random t The details can be found in DYNKIN & YUSHKEVICH. Chapter III.
SECTION
8.
111
MARKOV CHAINS
variable representing her strategy. Since her decision to stop must depend only on the suitors she has seen thus far, the event [ T == n ] must lie in a( X1 , , Xn ) . If X., == i , then by (8.6) the conditional probability of success is /(i) == ifr. The probability of success is therefore E[/( X., )], and the problem is to choose the • strategy T so as to maximize it. For the solution, see Example 8.16.t • • •
Higher-Order Transitions
The properties of the Markov chain are entirely determined by the transi tion and initial probabilities. The chain rule (4.2) for conditional probabili ties gives
Similarly, (8. 7)
for any sequence Further, (8. 8 )
i i 0'
1'
. . . ' i m of states.
P ( Xm + t = }1 , 1
<
t < n i Xs = i s , O < s < m ]
as follows by expressing the conditional probability as a ratio and applying (8. 7) to numerator and denominator. Adding out the intermediate states now gives the formula (8.9 )
P � �> = P ( Xm + n = J· I Xm = i ] '}
(the k1 range over S ) for the nth order transition probabilities. t With the princess replaced by an executive and the suitors by applicants for an office job, this is known as the secretary problem.
112
PROBABILITY
n
Notice thatp f? > is the entry in position (i, j ) of p , the n th power of the transition matrix P. If S is infinite, P is a matrix with infinitely many rows and columns; as the terms in (8.9) are nonnegative, there are no conver gence problems. It is natural to put
{
1 P ��> = 8 . . = 0 I}
I}
if i = j , if i ¢ j .
Then P0 is the identity, as it should be. From (8.1) and (8.9) follow n + <� � l � > � . . � � ) P p = p p p = 8.10 ( ) LJ 1 , "} LJ 1 , 'II} > ' I} ,
,
An Existence lbeorem
Suppose that P = [ p;1 ] is a stochastic matrix and that 'IT; are nonnegative numbers satisfying L; s'IT; = 1. There exists on some ( 0, .fF, P) a Markov chain Xo , XI , x2 , . . . with initial probabilities 'IT; and transition probabilities piJ · lbeorem 8.1.
e
Reconsider the proof of Theorem 5.2. There the space ( 0, ofF, P) was the unit interval, and the central part of the argument was the construction of the decompositions (5.10). Suppose for the moment that S = { 1 , 2, . . . ) . First construct a partition Jf0>, 1�0>, . . . of (0, 1] into counta bly manyt subintervals of lengths ( P is again Lebesgue measure) P( I/0> ) = 'IT;· Next decompose each 1/0> into subintervals I;�> of lengths P( l;�>) = 'IT; P ;J · Continuing inductively gives a sequence of partitions { /;�� �. ;,. : i 0 , , i n = 1 , 2, . . . } such that each refines the preceding and P( l;�� > . ;,. ) = 'IT;0 pi 0 i1 pin in Put Xn ( w) = i if w E U ;0 . . . ;,. _ 1 1;�� �. ;,. _ 1 ; . It follows just as in the proof of Theorem 5.2 that the set [ X0 = i 0 , , Xn = i n ] coincides with the interval 1;��� . ;,. · Thus P[ X0 = i , , Xn = in] = 'IT;0 P ;0 ; 1 P;,._ 1 ; ,.· From this, (8.2) 0 and (8.4) follow immediately. That completes the construction for the case S = { 1 , 2, . . . } . For the general countably infinite S, let g be a one-to-one mapping of { 1 , 2, . . . } onto S and replace the Xn as already constructed by g( Xn ); the assumption S = { 1, 2, . . . } was merely for notational convenience. The same argument • obviously works if S is finite. PROOF.
•
•
•
•
•
•
-I
•
•
•
•
•
•
•
•
•
•
Although strictly speaking the Markov chain is the sequence X , X1 � . . . , one often speaks as though the chain were the matrix P together0 = b - a and B; � 0 , then I; = (b - E1 s , 81 , b - E1 < ;B1 ), i = 1 , 2 decompose ( a , b ] into intervals of lengths B, .
t 1r
81 + 82 +
·
·
·
• . . .
,
SECTION
8.
113
MARKOV CHAINS
with the initial probabilities 'IT; or even P with some unspecified set of 'IT; . Theorem 8.1 justifies this attitude: For given P and 'IT; the corresponding Xn do exist, and the apparatus of probability theory-the Borel-Cantelli lemmas and so on-is available for the study of P and of systems evolving in accordance with the Markov rule. From now on fix a chain X0 , X1 , • • • satisfying 'IT; > 0 for all i. Denote by P; probabilities conditional on [ X0 = i]: P;(A) = P[A I X0 = i]. Thus
( 8. 11)
--
P. [ X = z· 1 < t < n ] = p I
I
I'
.
II1
..
p I 1 I 2 • • • P I· n - 1 I· n
by (8.8). The interest centers on these conditional probabilities, and the actual initial probabilities are now largely irrelevant. From (8.11) follows
(8. 12)
Suppose that I is a set (finite or infinite) of m-long sequences of states, J is a set of n-long sequences of states, and every sequence in I ends in j. Adding both sides of (8.12) for ( i 1 , • • • , i m ) ranging over I and ( j1, • • • , in ) ranging over J gives
( 8 . 13)
For this to hold it is essential that each sequence in (8.12) and (8.13) are of central importance.
I end in j.
Formulas
Transience and Persistence
Let
(8.14)
-
be the probability and let
(8.1 5 )
of a first visit to j at time
tij
= P;
n
for a system that starts in
( CJl x = i 1 ) = f /;�" > n-1
..
n-1
i,
1 14
PROBABILITY
be the probability of an eventual visit. A state i is persistent if a system starting at i is certain sometime to return to i: /;; = 1. The state is transient in the opposite case: /;; < 1. < n k and Suppose that n 1 , • • • , n k are integers satisfying 1 < n 1 < consider the event that the system visits j at times n 1 , • • • , n k but not in between; this event is determined by the conditions · · ·
XI
=I= j '
. . . ' xn l - I
=I= j '
xn l
= j'
. . . ' Xn 2 - I =I= j ' Xn 2 = j ' .................. . .......
Xn l + I
=I= j '
Repeated application of (8.13) shows that under P; the probability of this n n n �}n �< - n �< - •> . Add this over the k-tuples n 1 , • • • , n k : event is /;S · >�} 2 - 1 > 1 the P;-probability that Xn = j for at least k different values of n is /;1�J - • Letting k --+ oo therefore gives · · ·
if lI.jj. < if lI.jj. =
( 8 .16 ) Recall that
i.o. means infinitely often. Taking i = j gives if /; ; < if }1,. . =
( 8 .17 ) Thus
4.5),
1' 1.
1' 1.
P;[ Xn = i
i.o.] is either 0 or 1; compare the zero-one law (Theorem but note that the events [ Xn = i ] here are not in general independent.
Theorem 8.2. (i) Transience ofi is equivalent to P;[ Xn = i i.o.] = 0 and to Ln Pft> < oo . (ii) Persistence of i is equivalent to P;[ Xn = i i.o.] = 1 and to En pfr > = oo .
By the first Borel-Cantelli lemma, L n Pfr > < 00 implies P;[ xn = i i.o.] = 0, which by (8.17) in tum implies /;; < 1. The entire theorem will be proved if it is shown that /;; < 1 implies Ln Pfr > < oo . PROOF.
SECTION
8.
115
MARKOV CHAINS
The proof uses a first-passage argument : by (8.13),
n- 1
p!j > = P; [ xn = j ) = L P; [ xl * j , . . . ' xn - s - 1 * j , s-O
n- 1
- L
s-o
=
n-1 � �
s-o
xn - s = j , xn = j )
P; [ X1 * j , . . . , xn - s - 1 * i , xn - s = j ] lj [ Xs = j]
l_ (_n - s )p �� ) . lrJ JJ
Therefore,
n
n t- 1
t - s >p : > > = /;� : L L L P:f t- 1 t- 1 s-O n- 1
n
s-o
t-s+ 1
- L P :t > L
/;� t - s ) �
n
L P Jt >/; ; ·
s-o
Thus (1 - /;;) E7_ 1 p1f> � /;;, and if /;; < 1, this puts a bound on the partial � n ;(t) Surns f-t• 1P ; . Polya 's theorem. For the symmetric k-dimensional ran dom walk (Example . 8.4), all states are persistent if k = 1 or k = 2, and all states are transient if k > 3. To prove this, note first that the probability pfr > of return in n steps is the same for all states i; denote this probability by a�k> to indicate the dependence on the dimension k. Clearly, a ��>+ 1 = 0. Suppose that k = 1. Since return in 2n steps means n steps east and n steps west, Example 8.6.
a 2< ln) =
( 2nn ) _!_ . 22n
Stirling's formula, a�1J - ( 'ITn) - 1 12 • Therefore, En a �1 > = are persistent by Theorem 8.2. By
oo
and all states
1 16
PROBABILITY
In the plane, a return to the starting point in 2n steps means equal numbers of steps east and west as well as equal numbers north and south:
(2n )! 1 f. u!u! ( n - u ) ! ( n - u ) ! 4 2 n u == O
a(22n> _
( )
It can be seen on combinatorial grounds that the last sum is 2nn , and so a �2J = (a �1J ) 2 - ( w n) - I . Again, En a �2 ) = oo and every state is persistent. For three dimensions,
a 2( 3n) = L
( 2n ) ! u!u!v!v! ( n - u - v ) ! ( n - u - v ) !
the sum extending over nonnegative reduces to
u
and
v
1 6 2n '
satisfying
u + v < n.
This
) 2 n - 2/ ( 2 ) 2 / ) ( ( 1 2 a 2< 3>n - L � 3 3 a ��� u a W , n
(8 . 18)
_
/- 1
-
2
as can be checked by substitution. (To see the probabilistic meaning of this formula, condition on there being 2n - 21 steps parallel to the vertical axis and 21 steps parallel to the horizontal plane.) It will be shown that a �3J = O(n - 312 ), which will imply that En a�3> < oo . The terms in (8 . 18) for I = 0 and I = n are each OI( n - 312 ) and hence can be omitted. Now a �1 > < Ku - 1 12 and a�2 > < Ku - , as already seen, and so the sum in question is at most
n ) ) ) 2/ 2 2 / 1 ( ( ( 2 1 n2 3 K L ;� 3 ( 2n - 21 ) - 1 /2 ( 21 ) /-1
-•.
(2n - 21 ) - 1 12 < 2n 1 12 ( 2n - 21 ) - 1 � 4n 1 12 ( 2n - 21 + (21) - 1 � 2(21 + 1) - 1 , this is at most a constant times Since
! 2 n ) ( /2 I n 2n + 2 ( )! Thus L, n a�3> <
oo ,
1 ( 2n + 2 ) ( ! ) 2n - 2/ 1�1 2/ + 1 3
n-
+
I
1) - 1 and
( 2 ) 2/ I - O ( n - 3/2 ) . 3
+
and the states are transient. The same is true for
k=
SECTION
4, 5, . . . , since 0( n - " 12 ) .
8.
1 17
MARKOV CHAINS
an inductive extension of the argument shows that a �k >
=
•
It is possible for a system starting in i to reach j (f,.1 > 0) if and only if p1j > > 0 for some n. If this is true for all i and j, the Markov chain is
irreducible.
If the Markov chain is irreducible, then one of the following two alternatives holds. (i) All states are transient, P,. (U 1[ Xn = j i.o.]) = 0 for all i, and Ln Pfj > < oo for all i and j. (ii) All states are persistent, P,. (n1[ Xn = j i.o.]) = 1 for all i, and En p1j > = oo for all i and}. Theorem 8.3.
The irreducible chain itself can accordingly be called persistent or transient. In the persistent case the system visits every state infinitely often. In the transient case it visits each state only finitely often, hence visits each finite set only finitely often, and so may be said to go to infinity. PROOF. >
P�� > 0. }I
For each Now
i
and
j
there exist
r
and
P � � + s + n ) p ��>p � � >p � � )
(8. 19 )
II
> -
1)
}}
)1
s
such that
>
pf; > 0
and
'
and since p I}� � >p}�I� > > 0' En p I�I� > < oo implies that En p}}� � > < oo · if one state is transient, they all are. In this case (8.16) gives P,. [ Xn = j i.o.] = 0 for all i and j , so that P,.(U1[ Xn = i i.o.]) = O for all i. Since E�_ 1 p 1j > = n I(•)P ( n - • ) - �00 / ( P)�oom P (Jm ) < L.,�00m-== OP (Jm ) , 1· t f0llOWS that 1· f }· � ao � L.,.,._ I J ij L., -o { �n - 1L.,.,- 1 J ij 11 J is transient, then (Theorem 8.2) En pfj converges for every i. The other possibility is that all states are persistent. In this case lj[ Xn = j i.o.] = 1 by Theorem 8.2, and it follows by (8.13) that •
pJ,.m > = lj ( ( Xm = i ] n ( Xn = j i.o. ] ) < _
E
n>m � i..J
n>m
lj [ Xm = i, xm + l =1:- j,
•
•
•
, xn - 1 =1:- j, Xn = J ]
p � � )Jl,(_n - m ) = p � � >Jl. . . }I
I}
}I
I}
There is an m for which pJ,.m > > 0, and therefore fiJ = 1 . By (8.16), P, [ Xn = j i.o.] = /;1 = 1 . If En pfj > were to converge for some i and j, it would follow by the first Borel-Cantelli lemma that P,. [ Xn = j i.o.] = 0. •
1 18
PROBABILITY
Example B. 7.
impossible if
S
Since E1 pfj > = 1 , the first alternative in Theorem 8.3 is is finite: a finite, irreducible Markov chain is persistent.
The chain in Polya's theorem is certainly irreducible. If the dimension is 1 or 2, there is probability 1 that a particle in symmetric random walk visits every state infinitely often. If the dimension is 3 or more, • the particle goes to infinity. Example 8. 8.
Example 8. 9.
Consider the unrestricted random walk on the line (Ex ample 8.3). According to the ruin calculation (7.8), lot = pjq for p < q. Since the chain is irreducible, all states are transient. By symmetry, of course, the chain is also transient if p > q, although in this case (7.8) gives lot = 1. Thus /; = 1 (i ¢ j ) is possible in the transient case.t If p = q = , the chain is persistent by Polya' s theorem. If n and j - i have the same parity,
(
n
n +j - ; 1 li - i l < n · 2 This is maximal if j i or j i ± 1, and by Stirling' s formula the maximal value is of order n - t /2 • Therefore, lim n p �n > 0, which always holds in the =
=
=
transient case but is thus possible in the persistent case as well ( see Theorem
8.8).
•
Another Criterion for Persistence
Let Q = [ qiJ ] be a matrix with rows and columns indexed by the elements of a finite or countable set U. Suppose it is substochastic in the sense that q;1 > 0 and E1 q;1 < 1. Let Q n = [ q;)n >] be the n th power, so that � q . ,q"l<� > ' q1)� � + t ) = i..J ( 8.20 ) ,
,
Consider the row sums O.,( n ) =
( 8.21 ) From
( 8.22 )
(8.20) follows
o .< n + t ) I
=
� q��)•
i..J j
1)
� . . (J � n ) • q i..J j
1) J
t out for each j there must be some i =F j for which fij
<
1 ; see Problem 8.9.
8. MARKOV CHAINS 01.< l l � 1 , and hence o.< n + l )
1 19
SECTION
Si ncen Q is substochastic, n l r q I�JI >oJ1< ) � o.I< > Therefore the monotone limits � 'II • '
I
=
n lq .
EJ.E J1qI�JI
Jlj =
( 8.23) exist. By (8.22) and the Weierstrass M-test solve the system X;
(8.24 )
=
E jE U
[A28], o; E1q;1o1. Thus the o; =
q J x1 , i e U, i
O < x; < 1 ,
; e u.
n � � ;1 - o;( I ) , and X; < o;( ) for < L.t1q L.t1q;1x1 n n n all i implies that X; < E1q;1o} > o;< + I ) by (8.22). Thus X; < o;< > for all n by induction, and so X; < o;. Thus the o; give the maximal solution to (8.24): . · For an arb1 trary so1utton, X;
=
For a substochastic matrix solution of (8.24). Lemma
1.
Q
the limits (8.23) are the maximal
Now suppose that U is a subset of the state space S. The p iJ for i and j n in U give a substochastic matrix Q. The row sums (8.21) are o/ > = Ep;11 P1LI 2 • • • pin - Lin' where the j1 , . . . , in range over U, and so o/ n > = P; [ X1 e U, t < n ]. Let n --+ oo :
(8.25 )
o;
=
P; ( X1
E
U, t 1 , 2, . . . ) =
,
i E U.
this case, o; is thus the probability that the system remains forever in U, given that it starts at i. The following theorem is now an immediate consequence of Lemma 1. In
Theorem
8.4.
of the system
( 8.26 )
For U c S the probabilities (8.25) are the maximal solution X;
=
E jE U
piJ x1 , i e U,
0 < X; < 1 ,
i E U.
120
PROBABILITY
For the random walk on the line consider the set The system (8.26) is
Example 8.10.
{0, 1, 2, . . . }.
{ X; == PpxX; +. 1 + qxi - 1 '
Xo 1 xn = A + An
i>
U
=
1,
n
It follows [A19] that if p = q and x n = A - A(qjp) + 1 if p ¢ q. The only bounded solution is x n = 0 if q > p, and in this case there is probability 0 of staying forever among the nonnegative integers. If q < p, A = 1 gives the maximal solution xn = 1 - (qjp) n + 1 . Compare (7.8) and • Example 8.9. Now consider the system (8.26) with single state i 0 :
U = S - { i0 }
for an arbitrary
X; = E P;jxj, i ¢ i 0, j '=�= io 0 � X; < 1 ,
( 8 .27 )
There is always the trivial solution-the one for which
X; = 0.
An irreducible chain is transient zf and only if (8.27) has a nontrivial solution.
Theorem 8.5. PROOF.
The probabilities
( 8 .28 ) are by Theorem 8.4 the maximal solution of (8.27). Therefore (8.27) has a nontrivial solution if and only if /;;0 < 1 for some i ¢ i 0 • If the chain is persistent, this is impossible by Theorem 8.3(ii). Suppose the chain is transient. Since 00
P;o [ X1 = i, x2 ¢ i o , . . . ' Xn - 1 ¢ i o , xn = i o] hoio = P;o [ X1 = i o] + nE=2 iE -:l= i0 = P i0 i0 + L Pi0 i fi io ' ; '=�= io and since /;0 ;0 < 1, it follows that /;;0 < 1 for some i ¢ i 0• • Since the equations in (8.27) are homogeneous, the issue is whether they have a solution that is nonnegative, nontrivial, and bounded. If they do, 0 � x ; � 1 can be arranged by rescaling. t t see Problem 8. 1 1 .
SECTION
8.
121
MARKOV CHAIN S
In the simplest of queueing models the state space is and the transition matrix has the form
Example 8. 11.
{ 0, 1, 2, . . . }
ao aI a 2 0 0 0 ao aI a 2 0 0 0 0 ao al a 2 0 0 0 0 ao aI a 2 0 0 0 0 ao aI a 2
...
••• ••• ...
•••
.........................
If there are i customers in the queue and i > 1 , the customer at the head of the queue is served and leaves, and then 0, 1, or 2 new customers arrive (probabilities a0, a 1 , a 2 ), which leaves a queue of length i - 1, i, or i + 1. If i = 0, no one is served, and the new customers bring the queue length to 0, 1, or 2. Assume that a0 and a 2 are positive, so that the chain is irreducible. For i0 = 0 the system (8.27) is
(8 . 29 )
{ x i == aai xxi + a+2xa2 x, xk
o k-1
k > 2.
l k + a 2x k + 1 '
Since a0, a 1 , a 2 have the form q( 1 - a), a, p( 1 - a) for appropriate p, q, a, the second line of (8.29) has the form x k = px k + I + qx k _ 1 , k > 2. Now the solution [A19] is A + B( q/p) k = A + B( a0ja 2 ) k if a 0 ¢ a 2 ( p ¢ q) and A + Bk if a0 = a 2 ( p = q), and A can be expressed in terms of B because of the first equation in (8.29). The result is if a 0 ¢ a2 , if a 0 = a 2 • There is a nontrivial solution if a0 < a 2 but not if a 0 > a 2 . If a 0 < a 2 , the chain is thus transient, and the queue size goes to infinity with probability 1. If a0 � a 2 , the chain is persistent. For a nonempty queue the expected i_ncrease in queue length in one step is a 2 - a0, and the • queue goes out of control if and only if this is positive. Stationary Distributions
Suppose that the initial probabilities for the chain satisfy
(8 .30)
E 'IT; P ij = 'IT} '
iES
j E
S.
122
PROBABILITY
It then follows by induction that
� 'fT.p � �> 'fT.,
( 8 .31 )
i..J iES
I
1)
=
J
jE
S,
n
=
0, 1 , 2, . . . .
If 'IT; is the probability that X0 = i, then the left side of (8.31) is the probability that xn = j, and thus (8.30) implies that the probability of [ Xn = j] is the same for all n. A set of probabilities satisfying (8.30) is for this reason called a stationary distribution. The existence of such a distribu tion implies that the chain is very stable. To discuss this requires the notion of periodicity. The state j has period t if p}1n > > 0 implies that t divides n and if t is the largest integer with this property. In other words, the period of j is the greatest common divisor of the set of integers
( 8 .32 ) If the chain is irreducible, then for each pair i and j there exist for which p fj > and p}t > are positive, and of course
�� + s + n) > p � � >p � �>p � � ) PI I }} 1) )1
( 8 .33 )
-
r and s
•
Let t ; and t1 be the periods of i and j. Taking n = 0 in this inequality shows that t ; divides r + s; and now it follows by the inequality that p}1n > > 0 implies that t; divides r + s + n and hence divides n. Thus t; divides each integer in the set (8.32), and so t; � t1 . Since i and j can be interchanged in this argument, i and j have the same period. One can thus speak of the period of the chain itself in the irreducible case. The random walk on the line has period 2, for example. If the period is 1, the chain is
aperiodic.
In an irreducible, aperiodic chain, for each i and}, P0n > > 0 for all n exceeding some n0(i, j). PROOF. Since p1jm + n > > p1jm>pJ1n > , if M is the set (8.32) then m e M and n e M together imply m + n e M. But it is a fact of number theory [A21] Lemma
2.
that if a set of positive integers is closed under addition and has greatest common divisor 1, then it contains all integers exceeding some n 1 . Given i and J.' choose r so that p1)� � > > 0 If n > n 0 = n I + r then pI)� � > > pI}� � >p}}� �- r > > 0. • •
'
Suppose of an irreducible, aperiodic chain that there exists a stationary distribution-a solution of (8.30) satisfying 'IT; � 0 and L; 'IT; 1 . Theorem 8.6.
=
SECTION
Then the chain is persistent, ( 8.34 )
8.
MARKOV CHAINS
123
� ) = 'fT.} lim p � I} n
for all i and}, the 'ITi are all positive, and the stationary distribution is unique. n The main point of the conclusion is that, since p� > approaches 'ITi , the effect of the initial state i wears off. n PROOF. If the chain is transient, then p� > --+ 0 for all i and j by Theorem 8.3 , and it follows by (8.31) and the M-test that 'ITi is identically 0, which contradicts L; 'IT; = 1. The existence of a stationary distribution therefore
implies that the chain is persistent. Consider now a Markov chain with state space S X S and transition probabilities p(ij, k/) = P; k Pil (it is easy to verify that these form a stochas tic matrix). Call· this the coupled chain; it describes the joint behavior of a pair of independent systems, each evolving according to the laws of the original Markov chain. By Theorem 8.1 there exists a Markov chain ( Xn , Yn ), n = 0, 1, . . . , having positive initial probabilities and transition probabilities
P ( ( Xn + l ' Yn + t ) =
( k, / ) 1( Xn , Yn) = ( i, J ) ) = p ( ij, kl ) . n For n exceeding some n 0 depending on i, }, k, /, p< n >(ij, kl) = pf: >pi� > is positive by Lemma 2. Therefore, the coupled chain is irreducible. (The
proof of this fact requires only the assumptions that the original chain is irreducible and aperiodic.) It is easy to check that 'IT( ij) = 'IT; 'ITi forms a set of stationary initial probabilities for the coupled chain, which, like the original one, must therefore be persistent. It follows that, for an arbitrary initial state ( i, }) for the chain {( Xn , Yn )} and an arbitrary i0 in S, P;i [( Xn , Yn ) = ( i 0 , i0) i.o.] = 1. If T is the smallest integer such that XT = YT = i0, then T is finite with probability 1 under P;j· The idea of the proof is now this: Xn starts in i and yn starts in }; once xn = yn = i o occurs, xn and yn follow identical probability laws, and hence the initial states i and j will lose their influence. By (8.13) applied to the coupled chain, if m < n, then P;j [ ( xn ' yn ) =
(k' I)' T = m] t < m , ( Xm , Ym ) = ( i o , i o )] X P;o;o ( ( Xn - m ' Yn - m ) = ( k , / )]
= Pij ( ( X, , Y, ) ¢ ( i o , i o ) ,
�n - m)p �n - m > . = PI.J. ( T = m ] pt0A. 10/
124
PROBABILITY
= =I = = = k, < ] = = < ] = k ] < = k, < n ] + > ] = = k , < ] + PiJ > ] ] = k] +
Adding out I gives PiJ [ X, k , 'T m ] m] P.1 1 [ T out k oives PIJ. . [ Yn ' 'T probabilities, and add over m 1, . . . , n : =
ca
pi} [ xn
From this follows P;1 ( Xn
'T
n
PiJ ( Xn
PiJ ( Yn
=
=
=
P;1 [ T
=
m ] p 1c:ic- m ), and adding m ] p�1 11/" - m > • Take k = I' equate
pi} [ yn
k'
T
'T
< PiJ ( Yn
n
PiJ ( T >
'T
n .
P;1 ( 'T
n
( 'T
n
n .
This and the same inequality with X and Y interchanged give
Since T is finite with probability 1 ,
p�n) - p�n) lim I k l jk I n
(8 .35)
=
0•
(This proof of (8.35) requires only that the coupled chain be persistent.) By (8.31) , 'ITk - pj; > L; 'IT; ( pf; > - pj; >), and this goes to 0 by the M-test if (8.35) holds. Thus Iim n p}; > 'ITk . As this holds for each stationary distribution, there can be only one of them. It remains to show that the 'ITJ are all strictly positive. Choose r and s so that pf; > and p}t > are positive. Letting n --+ oo in (8.33) shows that 'IT; is positive if 'IT1 is; since some '"J is positive (they add to 1) , all the 'IT; must be • positive.
=
=
For the queueing model in Example 8.1 1 the equations
Example 8. 12.
(8.30) are
'ITo 'IT1 'IT2 'ITk
== 'IT'IToaao ++ 'IT1aa1o , 2a , o t 'IT1 + 'IT o = '1Toa2 + 'IT1a2 + 'IT2a1 + 'IT3ao , = 'ITk - 1a2 + 'ITkai + 'ITk + tao ,
k > 3.
Again write a 0, a 1 , a 2 , as q(1 - a), a , p(1 - a ). Since the last equation here is 'ITk q 'ITk + 1 + p 'ITk _ 1 , the solution is
=
AA Bk 'ITk = { + +
B ( p/q )
k
= A + B ( a 2ja0 ) k
if a0 if a 0
=I=
=
a2 , a2 ,
SECTION
8.
125
MARKOV CHAINS
for k > 2. If a 0 < a2 and Ewk converges, then wk 0, and hence there is no stationary distribution; but this is not new, because it was shown in Example 8.11 that the chain is transient in this case. If a 0 = a 2, there is again no stationary distribution, and this is new because the chain was in Example 8.11 shown to be persistent in this case. If a 0 > a2, then Ewk converges, provided A = 0. Solving for w0 and w1 in the first two equations of the system above gives w0 = Ba2 and w1 = Ba 2 ( 1 - a0 )/a0 • From Ekwk = 1 it now follows that B = (a0 - a 2 )/a 2, and the wk can be written down explicitly. Since wk = B(a2ja0) k for k � 2, there is small chance of a large queue length. • =
If a 0 = a2 in this queueing model, the chain is persistent (Example 8.11) but has no stationary distribution (Example 8.12). The next theorem de scribes the asymptotic behavior of the p!1n > in this case. 1beorem 8.7.
bution, then ( 8 .36) for all i and j.
If an irreducible, aperiodic chain has no stationary distri lim pI)� � > = 0
n
If the chain is transient, (8.36) follows from Theorem interesting here is the persistent case.
8.3.
What is
By the argument in the proof of Theorem 8.6, the coupled chain is irreducible. If it is transient, then E ( p�n> ) 2 converges by Theorem 8.2, and n the conclusion follows. Suppose, on the other hand, that the coupled chain is persistent. Then the stopping-time argument leading to (8.35) goes through as before. If the p�n> do not all go to 0, then there is an increasing sequence { n u } of integers along which some p�n> is bounded away from 0. By the diagonal method [A14], it is possible by passing to a subsequence of { n u } to ensure that each p�n��> converges to a li�t, which by (8.35) must be independent of i. Therefore, there is a sequence { n u } such that limup�n��> = a1 exists for all i and j, where a1 is nonnegative for all j and positive for some j. If M is a finite set of states, then E1e MaJ = limuEJ e MP�n��> < 1, and hence 0 < a = E1a1 < 1. Now Ek e MPf:��>pkJ < p�n��+ I> = Ek PikP�j��>; it is possible to pass to the limit ( u --+ oo) inside the first sum (if M is finite) and inside the second sum (by the M-test), and hence Ek e MakpkJ � EkPi kaJ = a1 . There fore, EkakpkJ � a1; if one of these inequalities were strict, it would follow that Ekak = E1EkakpkJ < E1a1, which is impossible. Therefore Eka k p k J = aj for all }, and the ratios w1 = a1ja give a stationary distribution, contrary • to the hypothesis. PROOF.
126
PROBABILITY
The limits In times. Let
(8.34) and (8.36) can be described in terms of mean return 00
n) . l. < . � n = i..J !ljj ' ,-} n� l if the series diverges, write p.i = oo. In the persistent case, this sum is to be thought of as the average number of steps to first return to j, given that Xo = j.t Lemma 3. Suppose that} is persistent and that lim n p}in > = u. Then u > 0 if and only if p.i < oo, in which case u = 1/p.i .
( 8.37)
u .
Under the convention that 0 = 1/oo, the case u = 0 and p.i = oo is consistent with the equation u = 1/p.i . n PROOF . For k > 0 let p k = En k fi} >; the notation does not show the dependence on j, which is fixed. Consider the double series 1.(2 ) + ljj 1.(3) + . . . 1_(1 ) + ljj ljj 1_(.2) + ljj 1_(3) + . . . + ljj 1_(.3 ) + + ljj >
n n to sums column th n the and 0) > k ( p to sums row The k th k fi} > (n > 1), and so [A27] the series in (8.37) converges if and only if E k p k does, in which case 00
( 8.38 )
P.j = L P k · k�O Since j is persistent, the lj-probability that the system does not hit j up to time n is the probability that it hits j after time n, and this is Pn· Therefore, 1 - pi)n > = Pi [ Xn ¢ j ) n- 1
= Pj [ Xl ¢ j, . . . , Xn ¢ J ) + L lj [ Xk = j, xk + l ¢ j, . . . , Xn ¢ J ) n-1
k-1
k = Pn + L Pj) )Pn - k ' k-1
t Since in general there is no upper bound to the number of steps to first return, it is not a simple random variable. It does come under the general theory in Chapter 4, and its expected value is indeed P.j (and (8.38) is just (5.25)) , but for the present the interpretation of P.j as an average is informal.
SECTION
and since
127
MARKOV CHAINS
1,
=
Po
8.
1
=
p0 pJJ��> + p 1 pJJ� � - 1 > +
·
· ·
+ pn 1 pJJP > + -
pn pJ�J�> •
Keep only the first k + 1 terms on the right here, and let n --+ oo; the result + P k )u. Therefore u > 0 implies that Ek p k converges, so is 1 � (p0 + that p.1 < oo . n Write x n k = P k PJ) - k ) for 0 < k � n and x n k = 0 for n < k. Then 0 < x n k < P k and Iimn x n k = p k u. If p.1 < oo , then E k p k converges and it follows by the M-test that 1 = Er- ox n k --+ Er_0 p k u. By (8.38), 1 = p.1 u, so • tha t u > 0 and u = 1jp.1 . ·
·
·
The law of large numbers bears on the relation u = 1jp.1 in the persistent case. Let Vn be the number of visits to state j up to time n. If the time from one visit to the next is about p.1 , then V,. should be about njp.1 : Vnfn 1jp.1 . But (if X0 = } ) Vn/n has expected value n - 1 EZ _ 1 pjf > , which goes to u under the hypotheses of Lemma 3 [A30]. Consider an irreducible, aperiodic, persistent chain. There are two possi bilities. If there is a stationary distribution, then the limits (8.34) are positive, and the chain is called positive persistent. It then follows by Lemma 3 that p.1 < oo and 'ITJ = 1jp.1 for all }. In this case, it is not actually necessary to assume persistence, since this follows from the existence of a stationary distribution. On the other hand, if the chain has no stationary distribution, then the limits (8.36) are all 0, and the chain is called null persistent. It then follows by Lemma 3 that p.1 = oo for all }. This, taken together with Theorem 8.3, provides a complete classification: =
For an irreducible, aperiodic chain there are three possibili The chain is transient; then for all i and}, lim n p�n > 0 and in fact
Theorem
ties:
8.8.
(i) �� n p '1� � > < 00 (ii) The chain
=
·
is persistent but there existsn no stationary distribution (the null persistent case); then for all i and}, p� > goes to 0 but so slowly that Ln P!j > = 00, and p.1 = oo. (iii) There exist stationary probabilities 'ITJ and (hence) the chain is persistent ( the positive persistent case); then for all i and}, lim n p!j > 'ITJ > 0 and p.i 1/'nj· < oo. Since the asymptotic properties of the p!j > are distinct in the three cases, =
these asymptotic properties in fact characterize the three cases.
=
128
PROBABILITY
Example 8. 13.
matrix is
Suppose that the states are 0, 1 , 2, . . . and the transition
o Po 0 q1 0 P 1
q
.
0 0
q2
.
.. ..
0 0 P2 .. .... ......... ...
where P ; and q; are positive. The state i represents the length of a success run, the conditional chance of a further success being P; · Clearly the chain is irreducible and aperiodic. A solution of the system (8.27) for testing for transience (with i0 = 0) must have the form x k = x 1 jp 1 P k - 1 . Hence there is a bounded, nontrivial solution, and the chain is transient, if and only if the limit a of p0 • • • Pn is positive. But the chance of no return to 0 (for initial state 0) in n steps is clearly p0 • • • Pn - 1 ; hence /00 = 1 - a , which checks: the chain is persistent if and only a = 0. Every solution of the steady-state equations (8.30) has the form 'ITk = 'IT0 p 0 • • • P k - 1 . Hence there is a stationary distribution if and only if E k p0 • • • p k converges; this is the positive persistent case. The null per p k --+ 0 but Ek Po p k diverges (which sistent case is that in which Po happens, for example, if qk = 1/k for k > 1). Since the chance of no return to 0 in n steps is p0 • • • Pn _ 1 , in the p k _ 1 . In the null persistent persistent case (8.38) gives p. 0 = Ek-o Po case this checks with p.0 = oo ; in the positive persistent case it gives P. o = Er-o'1Tk/'IT0 = 1/'IT0 , which again is consistent. • ·
·
·
·
·
·
·
·
·
·
·
·
Exponential Convergence *
In the finite case pfj > converges to
'ITi at an exponential rate: Theorem 8.9. If the state space is finite and the chain is irreducible and aperiodic, then there is a stationary distribution { 'IT; } , and lp � � ) 'IT · I < Apn I}
where A
>0
-
}
-
'
and 0 < p < 1.
From this it follows that a finite, irreducible, aperiodic chain cannot be null persistent. This also follows from Theorem 8.8 ; see Example 8.7. * This topic may be omitted.
SECTION
8.
129
MARKOV CHAINS
m(J n > = min . p � �> and MJ� n > = max · P � � > • By (8 • 10) ' n > = m (n ) � p . � . ( min <�> p > m ( n + l ) = min m p . i..J i..J I
Let
PROOF .
}
1)
I
I
1 11
J1
11)
-
I
J1
•
I JI
1)
}
}
'
n > = M� n > • . � . � � <�> p M � n + t > = max < max p M p . � . i..J "J J }
I
Sin ce obviously
J1
-
m} n > < M} n > , < ··· < m (1 > _ < m (2> _ 0_ }
}
( 8 .39)
1,
I
< -
J1
I JI
}
M � 2> < M � 1 > }
-
}
" �
1.
Suppose temporarily that all the P;J are positive. Let r be the number of states and let B = minii p;1 . Since E1 pii > rB, 0 < B < r - 1 • Fix states u and v for the moment; let E' denote the summation over j in S satisfying PuJ > PvJ and let E" denote summation over j satisfying PuJ < PvJ · Then (8 .40) Since E'pvj +
E"puJ > rB,
(8 .41 ) Apply (8.40) and then (8.41):
n n n _ + l + l . �( ( ( ) ) . � p p p p = Pu k vk VJ ) Jk ) i..J UJ 1
"( _ � ' ( PuJ _ PvJ ) Mk( n ) + � < i..J i..J PuJ
=
n n ) ' Ml m� p - >) L ( uj Pvj) (
< (1 Since
u
and
v
n ( m ) PvJ k )
rB ) ( Ml n > - m�n > ) .
are arbitrary,
Th erefore, Ml n > - m�n > < (1 - rB) n . It follows by (8.39) that M) n > have a common limit 'ITJ and that (8 .42 )
m5 n > and
130
PROBABILITY
Take A = 1 and p = 1 - r8. Passing to the limit in (8.10) shows that the 'IT; are stationary probabilities. (Note that the proof thus far makes almost no use of the preceding theory.) If the p;1 are not all positive, apply Lemma 2: Since there are only finitely many states, there exists an m such that pfj > > 0 for all i and j. By the case just treated M} m t > - m} m t ) < p1 Take A = p - I and then replace p by pl / m . • •
Example 8. 14.
Suppose that
P=
Po P t Pr - 1 Po •
•
•
•
•
•
• • • • • •
•
•
•
•
•
• • •
P1
•
•
Pr - 1 Pr - 2
•
•
•
•
Po
The rows of P are the cyclic permutations of the first row: piJ = p1 _; , j - i reduced modulo r. Since the columns of P add to 1 as well as the rows, the steady-state equations (8.30) have the solution 'IT; = r - 1 • If the P; are all 1 n positive, the theorem implies that p� > converges to r- at an exponential rate. If X0 , Y1 , Y2 , . . . are independent random variables with range {0, 1, . . . , r - 1 } , if each Yn has distribution { p0 , , p, _ 1 } , and if Xn = X0 + Y1 + + Yn , where the sum is reduced modulo r, then P[ Xn = j] --+ r - 1 • The Xn describe a random walk on a circle of points, and whatever the • initial distribution, the positions become equally likely in the limit. ·
·
•
·
•
•
Optimal Stopping *
Assume throughout the rest of the section that S is finite. Consider a function T on 0 for which T ( w ) is a nonnegative integer for each w. Let � = o( X0 , X1 , . . . , Xn ); T is a stopping time or a Markov time if
(8.4 3 )
[w:
T( w )
= n]
e�
for n = 0, 1, . . . . This is analogous to the condition (7. 1 8) on the gambler' s stopping time. It will be necessary to allow T ( w ) to assume the special value oo , but only on a set of probability 0. This has no effect on the requirement (8.43), which concerns finite n only. If f is a real function on the state space, then /( X0 ), /( X1 ), . . . are simple random variables. Imagine an observer who follows the successive * This topic may be omitted.
SECTION
8.
13 1
MARKOV CHAINS
X0 , X1 , X,.< c.) > ( w )), and
of the system. He stops at time T, when the state is XT (or receives an reward or payoff /( XT ). The condition (8.43) prevents prevision on the part of the observer. This is a kind of game, the stopping time is a strategy, and the problem is to find a strategy that maximizes the expected payoff E[/( XT )]. The problem in Example 8.5 had this form; there S = {1, 2, . . . , r + 1 }, and the payoff function is /(i) = i/r for i < r (set /(r + 1) = 0). If P (A) > 0 and Y = E1 y1 18 is a simple random variable, the B1 1 forming a finite decomposition of 0 into jli:sets, the conditional expected value of Y given A is defined by states
• • •
E [ Y I A ] = E Y;P( BJI A ) . 1
Denote by
E; conditional expected values for the case A E;[ Y ] = E [ YI X0 = i ] = Ey1 P;( B1 ) .
=
[ X0 = i]:
1
The stopping-time problem is to choose T so as to maximize simultaneously E;[/( XT )] for all initial states i. If x lies in the range of f, which is finite, and if T is everywhere finite, then (w: /( XT( w ) (w)) = X ) = U�-o (w: T( W ) = n, f( Xn ( w )) = x] lies in � ' and so /( XT ) is a simple random variable. In order that this always hold, put /( XT ( w ) (w)) = 0, say, if T(w) = oo (which happens only on a set of probability 0). The game with payoff function f has at i the value
(8 .44 ) the supremum extending over all Markov times T. It will tum out that the supremum here is achieved: there always exists an optimal stopping time. It will also tum out that there is an optimal T that works for all initial states i. The problem is to calculate v(i) and find the best T. If the chain is irreducible, the system must pass through every state, and the best strategy is obviously to wait until the system enters a state for which f is maximal. This describes an optimal T, and v(i ) = max / for all i. For this reason the interesting cases are those in which some states are transient and others are absorbing ( p ;; = 1 ) . A function q> on S is excessive, or superharmonic, if t
(8 .45)
E Pij <J> ( j) , j
t compare the conditions (7.28) and (7.35).
i E S.
132 In
PROBABI LITY
terms of conditional expectation the requirement is
cp(i) > E;[ cp ( X1 )).
The value function v is excessive. PROOF. Given choose for each j in S a "good" stopping time 'lj Xn ) E 11, ] satisfying E1 [f( X'i )] > v ( j ) - By (8.43), [ 'lj = n ] = [( X0 , for some set I1 n of (n + 1)-long sequences of states. Set T = n + 1 (n > 0) on the set ( X1 = j ) n (( X1 , . . . , Xn+ l ) E 11n ) ; that is, take one step and then from the new state X1 add on the "good" stopping time for that state. Then Lemma 4.
£
T
£.
•
•
•
,
is a stopping time and 00
E; [ f( XT )] = E E EP; [ XI = j , ( X1 , . . . , Xn + l ) E /Jn , Xn + l = k ] f ( k ) n -o j k 00
- E E E P;J lJ [ ( Xo, . . . , Xn ) E /Jn ' Xn = k ) f ( k ) n-o j k
= E Pij EA !( x,)] . 1
Therefore, v(i) � E; [f( XT )] > was arbitrary, v is excessive.
E1 piJ ( v ( j ) -
£
)
= E1 piJ v ( j ) -
Lemma 5. Suppose that cp is excessive. (i) For all stopping times T, cp(i) > E; [ cp( XT )] .
£.
Since
£
•
For all pairs of stopping times satisfying E; [ cp( XT )] .
o<
T,
E;[ cp( Xa)] >
Part (i) says that for an excessive payoff function optimal strategy.
T
0
represents an
(ii)
PROOF.
( 8.46 )
To prove (i), put
TN
=
= min{ T, N } . Then TN is a stopping time, and
E; [ cp ( XTN )] = E E P; [ T = n ' xn = k ] cp ( k ) N- 1
n-O k
+
EP; [ T > N, XN = k]cp ( k ) . k
8. MARKOV CHAINS Since [ T > N] = [ T < N]c e �N - 1 , the final sum here is by (8.13)
133
SECTION
E E P; [ T > N, XN - 1 = j , XN = k]cp(k) k j
= E E P; [ T > N, XN - 1 = j ] PJkcp ( k ) < E P; [ T > N, XN - 1 = j ] cp ( j ) . 1
k j
Substituting this into (8.46) leads to E;[ cp ( XTN )] < E; [ cp( XTN _ 1 )]. Since To = 0 and E;[ cp ( X0 )] = cp (i), E; [ cp ( XTN )] < cp(i ) for all N. But for T( w) finite, q> ( XTN < "' > ( w )) --+ cp( XT< "' > (w)) (there is equality for large N), and so E; [ cp( XTN )] --+ E; [ cp( XT )] by Theorem 5.3. The proof of (ii) is essentially the same. If TN = min{ T, o + N ) , then TN is a stopping time, and oo
N- 1
E; [ cp ( XTN ) ] = E E E P;( o = m , T = m + n , xm + n = k ]cp( k ) m=O n=O
k
00
+ E E P; ( o = m , T > m + N, Xm + N = k] cp ( k ) . m=O
k
Since [o = m, T > m + N] = [o = m] - [o = m, T < m + N] e �+ N - 1 , again E;[cp ( XTN )] < E; [ q> ( XTN _ 1 )] < E; [cp( XT0 )]. Since T0 = o, part (ii) fol lows from part (i) by another passage to the limit. •
If an excessive function cp dominates the payofffunction f, then it dominates the value function v as well. Lemma 6.
all
By definition, to say that
i.
By Lemma 5, cp(i) > and so cp(i ) > v(i) for all i.
PROOF.
,. ,
g
dominates
8.10.
dominating f.
is to say that
g(i) > h(i)
for
E; (cp( XT )] > E; [f( XT )] for all Markov times
Since T = 0 is a stopping time, immediately characterize v: Theorem
h
•
v
dominates
f.
Lemmas
4
and
6
The value function v is the minimal excessive function
There remains the problem of constructing the optimal strategy T. Let M be the set of states i for which v(i) = f(i); M, the support set, is nonempty
134
PROBABILITY
since it at least contains those i that maximize f. Let A = n:o= o [ Xn � M] be the event that the system never enters M. The following argument shows that P; (A) = 0 for each i. As this is trivial if M = S, assume that M ¢ S. Choose 8 > 0 so that /(i) < v(i) - 8 for i e S - M. Now E;[/( XT )] = E�-o E k P; [T = n, Xn = k]f(k); replacing the /(k) by v(k) or v(k) - 8 according as k e M or k e S - M gives E; [/( XT )] < E; [v( XT )] - 8P; [XT E S - M ] < E; [v( XT )] - 8P; (A) < v(i) - 8P; (A), the last inequality by Lemmas 4 and 5. Since this holds for every Markov time, taking the supremum over T gives P; (A) = 0. Whatever the initial state, the system is thus certain to enter the support set M. Let T0 ( w ) = min[n: Xn (w) E M] be the hitting time for M. Then T0 is a Markov time and To = 0 if X0 E M. It may be that Xn (w) � M for all n, in which case T0 ( w) = oo , but as just shown, the probability of this is 0.
The hitting time T0 is optimal: E; [/( XT0 )] = v(i ) for all i. PROOF. By the definition of T0 , /( XT0 ) = l'( XT0 ). Put cp(i) = E; [/( XT0 )] = E; [ v( XT0 )]. The first step is to show that cp is excessive. If T1 = min[n: n > 1, xn E M], then Tl is a Markov time and
Theorem 8.11.
E; [ v ( XT. ) ] = E E P; [ X1 � M, . . . , Xn_ 1 � M, Xn = k ] v ( k ) 00
n-1
=
ke M
00
E E E P;JIJ [ Xo � M, . . . , xn - 2 � M, xn - 1 = k ] v ( k )
n-1
ke M jES
= E P;J E1 [ v ( XTo ) ] . 1
Since T0 � T1 , E; [ v( XT0 )] > E; [ v( XT1 )] by Lemmas 4 and 5. This shows that cp is excessive. And cp(i) < v(i) by the definition (8.44). If cp(i) > /(i) is proved, it will follow by Theorem 8.10 that cp(i) > v(i) and hence that cp(i) = v(i). Since To = 0 for X0 e M, i e M implies that cp(i) = E; [/( X0 )] = /(i). Suppose that cp(i) < /(i) for some values of i in S - M, and choose i0 to maximize /(i) - cp ( i ). Then -Jl(i) = cp(i) + /(i 0 ) - cp(i 0 ) dominates f and is excessive, being the sum of a constant and an excessive function. By Theorem 8.10, .p must dominate v, so that -Jl(i 0 ) > • v(i0), or /(i 0 ) � v(i 0 ). But this implies that i 0 e M, a contradiction The optimal strategy need not be unique. If all strategies have the same value.
f is constant, for example,
SECTION
8.
135
MARKOV CHAINS
For the symmetric random walk with absorbing barriers at 0 and r (Example 8.2) a function cp on S = { 0, 1, . . . , r } is excessive if cp (i) > !cp(i - 1) + �cp(i + 1) for 1 < i < r - 1. The requirement is that cp give a concave function when extended by linear interpolation from S to the entire interval [0, r]. Hence v thus extended is the minimal concave function dominating f. The figure shows the geometry: the ordinates of the dots are the values of f and the polygonal line describes v. The optimal strategy is to stop at a state for which the dot lies on the polygon. Example 8. 15.
0
0
r
,
If /(r) = 1 and /(i) = 0 for i < r, v is a straight line; v ( i ) = ijr. The optimal Markov time T0 is the hitting time for M = {0, r }, and v ( i ) = E; [/( XT0 )] is the probability of absorption in the state r. This gives another solution of the gambler' s ruin problem for the symmetric case. • For the selection problem in Example 8.5, the piJ are given by (8.5) and (8.6) for 1 < i S r, while p, + 1 . , + 1 == 1. The payoff is l( i) == i/r for i < r and 1( r + 1) == 0. Thus v ( r + 1) == 0, and since v is excessive, Example 8.16.
v ( i ) > g( i ) =
( 8 . 47)
r
L
J- i + l
.
.( . � l) v ( j) ,
J J
1 S i < r.
By Theorem 8 . 1 0, v is the smallest function satisfying (8.47) and v(i) > l(i) == ijr, 1 � i s r. Since (8.47) puts no lower limit on v ( r), v(r) = l ( r) == 1, and r lies in the support set M. By minimality
( 8 . 48)
v( i ) = max{ I( i ) , g( i ) } ,
1
< i < r.
M , then I( i) == v ( i) � �( i) � E j + 1 ij- 1 ( j - 1) - 1I( j ) == I( i )Ej + 1 ( j - 1) � 1 , and hence r:; _! + 1 ( j - 1) - � 1 . 0� the ��er .hand,! ! thi.s ineq�ali�y hold� and 1 + 1 , . . . , r all lie m M , then g( 1 ) == E + 1 1) ( 1 - 1) I( 1 ) == I( 1 )E1 - ; + 1 ( J - 1) - 1 s l( i), so that i e M by (8.48). Therefore, M == { i,, i, + 1, . . . , r, r + 1 } , If i
e
_
;
_
;
_
;
136 where i r is determined by
(8 .49) If i
�+ 1,
PROBABILITY
1 + ··· + 1 <1 < + i, 1 -1 r
1
i, - 1
+ ! + ··· + 1 r
i,
-
1
< i,, so that i E M, then v (i) > /(i) and so, by (8.48),
...
i
'r,l
0
J-i+ l
0;
J( J -
1)
v( j)
(
+ .!.. . 1 + . . . + 1 1 1 r
It follows by backward induction starting with i (8.50)
v( i)
= Pr ...
i, r
(
r -
1, -
==
i, -
1 that
1 . 1 + . . . + _1_ 1 1 1, -
).
r -
)
is constant for 1 < i < i,. In the selection problem as originally posed, X1 == 1 . The optimal strategy is to stop with the first Xn that lies in M. The princess should therefore reject the first i, - 1 suitors and accept the next one who is preferable to all his predecessors (is dominant). The probability of success is p, as given by (8.50). Failure can happen two ways. Perhaps the first dominant suitor after i, is not the best of all suitors; in this case the princess will be unaware of failure. Perhaps no dominant suitor comes after i ,; in this case the princess is obliged to take the last suitor of all and may be well aware of failure. Recall that the problem was to maximize the chance of getting the best suitor of all rather than, say, the chance of getting a suitor in the top half. If r is large, (8.49) essentially requires that log r - log i, be near 1, so that i, rje . In this case, p, 1/e . Note that although the system starts in state 1 in the original problem, its resolution by means of the preceding theory requires consideration of all possible initial states. • =
=
This theory carries over in part to the case of infinite S, although this requires the general theory of expected values, since /( XT) may not be a simple random variable. Theorem 8.10 holds for infinite S if the payoff function is nonnegative and the value function is finite. t But then problems t The only essential change in the argument is that Fatou's lemma (Theorem 16.3) must be used in place of Theorem 5.3 in the proof of Lemma 5.
SECTION
8.
137
MARKOV CHAINS
arise : Optimal strategies may not exist, and the probability of hitting the support set M may be less than 1. Even if this probability is 1, the strategy of stopping on first entering M may be the worst one of all. t PROBLEMS 8.1. 8.2.
Show that (8.2) holds if the first and third terms are equal for all n , i 0 ' . . . ' in ' and j. Let Y0 , Y1 , be independent and identically distributed with P[ Y, = 1] == p, P[ Y, == 0] == q == 1 - p, p ¢ q. Put Xn == Y, + Y, + 1 (mod 2). Show that Xo ' x1 ' . . . is not a Markov chain even though p [ xn 1 == j I xn - 1 == i] == P [ Xn 1 == j ]. Does this last relation hold for all Markov chains? Why? Show by example that a function /( X0 ), /( X1 ), of a Markov chain need not be a Markov chain. 4.18 f Given a 2 X 2 stochastic matrix and initial probabilities w0 and w1 , define P for cylinders in sequence space by P[ w: a; ( w ) == u; , 1 < i < n ] == wu Pu u Pu _ u for sequences u 1 , , u n of O's and 1 ' s (Problem 4.18 is th� c�s� piJ == � 1 =:: p1 ). Extend P as a probability measure to fl0 and then to fl. Show that a1 ( w ), a 2 ( w ), . . . is a Markov chain with transition probabili ties piJ . Show that •
•
•
+
+
8.3. 8.4.
•
· · ·
8.5.
• • •
/,. . I)
oo
� �
k -0
p))�� ) ==
oo
� �
n
� �
n-1 m=1
j,.< !" >p � � - m ) == 1) ))
and prove that if j is transient, then E n P�j' > Show that if j is transient, then
8.6.
8.7.
• •
<
oo
oo
� �
n-1
p� n ) ' 1)
(compare Theorem 8.3(i)).
specialize to the case i == j: in addition to implying that i is transient (Theorem 8.2(i)), a finite value for E�_ 1 p1r > suffices to determine /;; exactly. Call { X; } a subsolution of (8.24) if X; < E 1 qiJx and 0 < X; < 1, i e U. Extending Lemma 1, show that a subsolution (X; } satisfies X; < o; : the solution { o; } of (8.24) dominates all subsolutions as well as all solutions. Show that if X; == E1 qiJ xi and - 1 < X; < 1, then { 1 X; 1 } is a subsolution of (8.24). Find solutions to (8.24) that are not maximal.
t See Problems 8.34 and 8.35.
138
8.8.
8.9.
PROBAB ILITY
Show by solving (8.27) that the unrestricted random walk on the line (Example 8.3) is persistent if and only if p == t . (a) Generalize an argument in the proof of Theorem 8.5 to show that /; k = p; k + EJ k piJJjk Generalize this further to •
•
JiI k == Jit(lk ) + . . . +Jit(kn ) + E P; [ XI
i• k
¢
k ' . . . ' xn - I
¢
k ' xn == j ] Jjk •
Take k == i. Show that /, > 0 if and only if P; [ XI -:1= i, . . . ' xn - 1 ¢ i, xn == j] > 0 for some n, and conclude that i is transient if and only if � ; < 1 for some j ¢ i such that fiJ > 0. (c) Show that an irreducible chain is transient if and only if for each i there is a j ¢ i such that Jj; < 1. 8. 10. Suppose that S == (0, 1, 2, . . . }, p00 == 1, and /;0 > 0 for all i. (a) Show that P; (Uj.... 1 [ Xn == j i.o.]) == 0 for all i. (b) Regard the state as the size of a population and interpret the conditions p00 == 1 and /;0 > 0 and the conclusion in part (a). 8.1 1. 8.6 f Show for an irreducible chain that (8.27) has a nontrivial solution if and only if there exists a nontrivial, bounded sequence { X; } (not necessarily nonnegative) satisfying X; == EJ piJ xJ , i -:1= i0• (See the remark following the proof of Theorem 8.5.) (b)
• ;0
8.12.
f Show that an irreducible chain is transient if and only if (for arbitrary i0 ) the system Y; == EJ PJ.J.Y.i ' i ¢ i0 (sum over all j) has a bounded, noncon stant solution { Y; , i e .s } .
8.13. Show that the P;-probabilities of ever leaving U for i e U are the minimal solution of the system
Z; == L PijZj + L Pi} ' i E U,
{ 8.51 )
jE U
0 S Z; < 1 ,
jE U
i E U.
The constraint z; < 1 can be dropped: the minimal solution automatically satisfies it, since z; 1 is a solution. 8.14. Show that supiJ n0(i, j) == oo is possible in Lemma 2. 8.15. Suppose that { 'IT; } solves (8.30), where it is assumed that E; l w; l < oo , so that the left side is well defined. Show in the irreducible case that the 'IT; are either all positive or all negative or all 0. Stationary probabilities thus exist in the irreducible case if and only if (8.30) has a nontrivial solution { 'IT; } (L, w, absolutely convergent). 8.16. Show by example that the coupled chain in the proof of Theorem 8.6 need not be irreducible if the original chain is not aperiodic. =
SECTION
8.
139
M.AllKOV CHAINS
8. 17. Suppose that S consists of all the integers and - 13 ' - 1 - p0. 0 - p0, + 1 Pk . k - 1 == q , Pk . k + 1 == p , Pk . k - 1 == p , Pk . k + 1 == q , n
.rO,
k < -1 k > 1. -
'
Show that the chain is irreducible and aperiodic. For which p's is the chain persistent? For which p 's are there stationary probabilities? 8.18. Show that the period of j is the greatest common divisor of the set ( 8.52) 8.19.
f
s
be nonnegative numbers with I == EC:.. 1 In Recurrent events. Let 11 , 12 , 1. Define u 1 , u 2 , recursively by u 1 == 11 and • •
• • •
•
( 8.53) (a) Show that I < 1 if and only if En un < oo .
(b)
Assume that I == 1, set
( 8.54)
p.
== E�_ 1 nln , and assume that
gcd[ n : n
�
1 , In > 0] == 1 .
Prove the renewal theorem: Under these assumptions, the limit u == lim n u n exists, and u > 0 if and only if p. < oo, in which case u == 1 Ip.. Although these definitions and facts are stated in purely analytical terms, they have a probabilistic interpretation: Imagine an event I that may occur at times 1, 2, . . . . Suppose In is the probability I occurs first at time n. Suppose further that at each occurrence of I the system starts anew, so that In is the probability that I next occurs n steps later. Such an I is called a recurrent event. If un is the probability that I occurs at time n , then (8.53) holds. The rec11rrent event I is called transient or persistent according as I < 1 or I == 1, it is called aperiodic if (8.54) holds, and if I == 1, p. is interpreted as the mean recurrence time. 8.20. Show for a two-state chain with S == {0, 1 } that pbQ> == p 1 0 + ( p00 p 1 0 ) p6{; _ 1+,. Derive an explicit formula for pbQ> and show that it converges to p 1 0/( p01 p 1 0 ) if all p, are positive. Calculate the other P1j> and verify that the limits satisfy (8. JO). Verify that the rate of convergence is exponen tial. 8.21. A thinker who owns r umbrellas travels back and forth between home and office, taking along an umbrella (if there is one at hand) in rain (probability p) but not in shine (probability q). Let the state be the number of umbrellas at hand, irrespective of whether the thinker is at home or at work. Set up the transition matrix and find the stationary probabilities. Find the steady-state probability of his getting wet, and show that five umbrellas will protect him at the 5% level against any climate (any p ) . •
140
PROBABILITY
8.22. (a) A transition matrix is doub�v stochastic if E; p, 1 == 1 for each j. For a finite, irreducible, aperiodic chain with doubly stochastic transition matrix, show that the stationary probabilities are all equal. (b) Generalize Example 8.14: Let S be a finite group, let p ( i ) be probabili I ties, and put p; 1 == p(j · ; - ) , where product and inverse refer to the group operation. Show that, if all p ( i ) are positive, the states are all equally likely in the 1imi t. (c) Let S be the symmetric group on 52 elements. What has (b) to say about card shuffling? 8.23. A set e in S is closed if E1 c pi J == 1 for i e e: once the system enters C it cannot leave. Show that a chain is irreducible if and only if S has no proper closed subset. 8.24. f Let T be the set of transient states and define persistent states i and j (if there are any) to be equivalent if /;1 > 0. Show that this is an equivalence relation on S - T and decomposes it into equivalence classes e1 , C2 , , so that S == T u C1 u e2 u . . . . Show that each em is closed and that /;; == 1 for i and j in the same em . 8.25. 8.13 8.23 f Let T be the set of transient states and let e be any closed set of persistent states. Show that the P;-probabilities of eventual absorption in C for i e T are the minimal solution of ·
e
• • •
Y; ==
( 8.55)
E
jE T
P;J J'.i
0 < Y; < 1 ,
+ E Pi } ' jE C
iE
T,
ie
T.
8.26. Suppose that an irreducible chain has period t > 1. Show that S decomposes into sets 50 , . . . , S,_ 1 such that piJ > 0 only if i E S, and j E S,+ for some 11 ( 11 + .1 reduced modulo t). Thus the system passes through the S, in cyclic successton. 8.27. f Suppose that an irreducible chain of period t > 1 has a stationary distribution { '17) } · Show that, if i E s, and j E s,+ cr ( 11 + a reduced modulo t). ' then lim n p� �r + cr ) == 'fT. Show that lim n n - I�"-- mn -= 1 p � �) == w.jt for all i and 1
1·
IJ
J"
'l
J
8.28. Suppose that { Xn } is a Markov chain with state space S and put Y, == ( Xn , Xn + 1 ). Let T be the set of pairs ( i , j ) such that P;1 > 0 and show that Y, is a Markov chain with state space T. Write aown the transition probabilities. Show that, if { Xn } is irreducible and aperiodic, so is { Y,, } . Show that, if "'; are stationary probabilities for { Xn }, then "'; piJ are stationary probabilities for { Y, }. 8.29. 6.10 8.28 f Suppose that the chain is finite, irreducible, and aperiodic and that the initial probabilities are the stationary ones. Fix a state i, let A n == [ Xn == i ], and let Nn be the number of passages through i in the first n steps. Calculate an and fln as defined by (5.37). Show that fln - a; = 0(1/n ) , so that n - 1 Nn --+ with probability 1 . Show for a function f on the state space that n - 1 E'ic .... 1 /( Xk ) --+ E;w;/(i) with probability 1. Show that n - 1 E'ic .... l g( Xk , xk + l > --+ E j wi pi j g,i , j ) for functions g on s X s. 'IT ;
i
SECTION
8.30.
8.
141
MARKOV CHAINS
, in , put p, ( w ) 8. 2 9 f If X0 ( w ) == i0 , , X, ( w ) == i, for states i0 == w, p, ; · · · p, , , so that p, ( w ) is the probability of the observation obse�e�� Show ffi�i - n - • tog p, ( w ) --+ h == - E, 1 P; log p, 1 with probabil ity 1 if the chain is finite, irreducible, and aperiodic. & tend to this case the notions of source, entropy, and asymptotic equipartition. A sequence { Xn } is a Markov chain of second order if P [ Xn + 1 == j l X0 == i0 , , Xn == in ] = P[ Xn + t == j ! Xn - t == i n_ 1 , Xn == in ] = P;n_ 1 ;n: 1 Show that nothing really new is involved because the sequence of pairs ( xn ' xn + 1 ) is an ordinary Markov chain (of first order). Compare Problem 8.28. Generalize this idea into chains of order r. Consider a chain on S == {0, 1, . . . , r }, where 0 and r are absorbing states and Pi , i + 1 == P; > 0, P; , ; - 1 == q; == 1 - P; > 0 for 0 < i < r. Identify state i with a point z; on the line, where 0 == z0 < · · · < z, and the distance from z; to z; + 1 is q;/P; times that from z; _ 1 to z; . Given a function q> on S, consider the associated function ip on [0, z, ] defined at the z; by ip(z; ) == q> ( i ) and in between by linear interpolation. Show that q> is excessive if and only if ip is concave. Show that the probability of absorption in r for initial state i is t; _ 1 /t, _ 1 , where I; == E� _0q1 Pk · Deduce (7.1). Show that qk /p 1 in the new scale the expected distance moved on each step is 0. Suppose that a finite chain is irreducible and aperiodic. Show by Theorem 8.9 that an excessive function must be constant. Alter the chain in Example 8.13 so that q0 == 1 - p0 == 1 (the other P; and q; still positive). Let {J == limn p 1 p, and assume that {J > 0. Define a payoff function by /(0) = 1 and /(i) == 1 - /;0 for i > 0. If X0 , , Xn are positive, put a, == n ; otherwise let an be the smallest k such that X,._ == 0. Show that E; [/( Xa )] --+ 1 as n --+ oo , so that v(i) 1 . Thus the support set is M = {0} , and for an initial state i > 0 the probability of ever hitting M is /; o < 1. For an arbitrary finite stopping time T, choose n so that P; [ T < n = a, ] > 0. Then E; [/( XT )] � 1 - h + n . o P; [ T < n == a, ] < 1. Thus no strategy achieves the value v(i) (except of course for i == 0). f Let the chain be as in the preceding problem, but assume that {J == 0, so that /;0 == 1 for all i. Suppose that A 1 , A 2 , exceed 1 and that A 1 A n --+ A < oo ; put /(0) == 0 and /(i) == A 1 A; _ 1 /p 1 P t · For an arbitrary (finite) stopping time T, the event [ T == n ] must have the- form [( X0 , , Xn ) e /n ) for some set In of ( n + 1)-long sequences of states. Show that for each i there is at most one n � 0 such that (i, i + 1 , . . . , i + n ) e ln. If there is no such n , then E; [ /( XT )] == 0. If there is one, then 6.14
• • •
• • • •
'IT;
8.31.
•
8.32.
•
•
•
•
8.33. 8.34.t
•
•
•
•
•
•
•
•
•
•
•
=
8.35.
•
•
•
•
•
•
•
•
•
•
•
•
;
•
•
.
•
.
and hence the only possible values of E; [ f( XT )] are 0 , /( i ) , P;/( i + 1 )
= /( i) A ; ,
P; P; + 1 /( i + 2) == /( i) A; A; + 1 ,
•
t lbe l ast two problems involve expected values for random variables with infinite range.
.
142
PROBABILITY
Thus v(i) == /( i)A/A 1 • • • A; _ 1 for i � 1 ; no strategy achieves this value. The support set is M == {0}, and the hitting time To for M is finite, but E; (/( XTo )) == 0. SECI10N 9. LARGE DEVIATIONS AND THE LAW OF THE ITERATED LOGARITHM *
It is interesting in connection with the strong law of large numbers to estimate the rate at which Sn/n converges to the mean m. The proof of the strong law used upper bounds for the probabilities P[ I Sn - m l > a] for large a. Accurate upper and lower bounds for these probabilities will lead to the law of the iterated logarithm, a theorem giving very precise rates for Sn/n --+ m. The first concern will be to estimate the probability of large deviations from the mean, which will require the method of moment generating functions. The estimates will be applied first to a problem in statistics and then to the law of the iterated logarithm. Moment Generating Functions
Let X be a simple random variable assuming values x 1 , • • • , x1 with respec tive probabilities p 1 , . . . , p 1• Its moment generating function is \
(9 . 1)
I
M ( t ) = E [ e'X ] = E p;e 1x• . ;- I
(See (5.16) for expected values of functions of random variables.) This function, defined for all real t, can be regarded as associated with X itself or as associated with its distribution-that is, with the measure on the line having mass P; at X; (see (5.9)). If c = max ;I X;I , the partial sums of the series e 1 x = Er/_0t k X k/k ! are bounded by el t lc, and so the corollary to Theorem 5.3 applies : (9 . 2 )
Thus M(t) has a Taylor expansion, and as follows from the general theory [A29], the coefficient of t k must be M < k > (O)jk! Thus (9 . 3 ) * This section may be omitted.
SECTION
9.
LARGE DEVIATIONS
143
Furthermore, term-by-term differentiation in (9.1) gives
M
==
I
L P;x fe'x; E [ X ke' x ] ;
i- 1
==
taking t = 0 here gives (9.3) again. Thus the moments of X can be calculated by successive differentiation, whence M( t ) gets its name. Note that M(O) = 1. If X assumes the values 1 and 0 with probabilities p and q = 1 - p , as in Bernoulli trials, its moment generating function is M(t) == pe 1 + q. The first2 two moments are M '(O) == p and M "(O) == p, and the variance is p - p == pq. • Example 9. 1.
If XI , . . . ' xn are independent, then for each t (see the argument follow ing (5.7)), e 1 x1 , • • • , e' x, are also independent. Let M and M1 , • • • , Mn be the respective moment generating functions of s == xl + . . . + xn and of X1, • • • , Xn ; of course, e 1 5 == II ; e 1 x'. Since by (5.21) expected values multiply for independent random variables, there results the fundamental relation
(9 . 4) This is an effective way of calculating the moment generating function of the sum S. The real interest, however, centers on the distribution of S, and so it is important to know that distributions can in principle be recovered from their moment generating functions. Consider along with (9.1) another finite exponential sum N(t) == E1q1e 'Y1, and suppose that M( t) == N( t) for all t. If X;0 == max X; and y10 == max y1 , then M(t) - P ;0e 1x;o and N(t) - q10e'YJo as t --+ oo , and so X;0 = y10 and 1x; == E ,. ioq1e'Y1, and The same argument now applies to E; .,. e = • ; ; ; P 0 1 P10 q0 it follows inductively that with appropriate relabeling, X; == Y; and P; == q; for each i. Thus the function (9.1) does uniquely determine the X; and P; · If X1 , • • • , Xn are independent, each assuming values 1 and 0 with probabilities p and q, then s == xl + . . . + xn is the number of successes in n Bernoulli trials. By (9.4) and Example 9.1, S has the moment generating function EXIIIple II 9.2.
144
PROBABILITY
The right-hand form shows this to be the moment generating function of a n distribution with mass ( = )p kq - k at the integer k, 0 < k < n. The unique ness just established therefore yields the standard fact that P[S = k ]
= ( =)p kq n - k . The
•
cumulant generating function of X (or of its distribution) is C ( t ) = log M ( t ) = log E ( e' x ).
( 9.5 )
(Note that M( t) is strictly positive.) Since - ( M') 2 )/M 2, and since M(O) = 1 ,
C' = M'/M and C " = (MM"
C (O) = 0, C' (O ) = E ( X ] , C " (O) = Var( X ] . Let m k = E [ X k ]. The leading term in (9.2) is m 0 = 1 , and so expansion of the logarithm in (9.5) gives v v+1 1 oo ) oo k . t C( t ) = E ( ( 9.7 ) E� k- 1 k . (9.6)
v-1
)
(
V
a formal
1 as t � 0, this expansion is valid for t in some neighbor Since M( t ) hood of 0. By the theory of series, the powers on the right can be expanded ; and terms with a common factor t collected together. This gives an . expansion .-.
( 9.8 )
c( t )
=
00
� i..J
,. _
1
C;
.
7f t ' , I.
valid in some neighborhood of 0. The C; are the cumulants of X. Equating coefficients in the expansions (9.7) and (9.8) leads to c1 = m 1 and c2 = m 2 - m�, which checks with (9.6). Each c; can be expressed as a polynomial in m 1 , , m ; and conversely, although the calculations soon become tedious. If E[ X] = 0, however, so that m 1 = c 1 = 0, it is not hard to check that •
.
•
( 9.9) Taking logarithms converts the multiplicative relation (9.4) into the additive relation
(9.10)
SECTION
9.
LARGE DEVIATIONS
145
M (t)
--�----�----� � T
for the corresponding cumulant generating functions; it is valid in the presence of independence. By this and the definition (9.8), it follows that cumulants add for independent random variables. Clearly, M "(t) = E[ X 2e 1 x ] > 0. Since ( M'(t)) 2 = E 2 [ Xe 1 x ] < E[e 1 x ] E [ X 2e 1 x ] = M(t)M "(t) by Schwarz's inequality (5.32), C"(t) > 0. Thus ·
the moment generating function
convex.
and the cumulant generating function are both
Large Deviations Let Y be a simple random variable assuming values y1 with probabilities p1 . The probl�m i� to estimate P[Y > a] when Y has mean 0 and a is positive. It is not.ationally convenient to subtract a away from Y and instead estimate P[Y > 0] when Y has negative mean. Assume then that
(9.11 )
P ( Y > O] > O, the second assumption to avoid trivialities. Let M(t) = E1 p1 e'Y1 be the moment generating function of Y. Then M '(0) < 0 by the first assumption in (9.11), and M(t) � oo as t --+ oo by the second. Since M(t) is convex, it has its minimum p at a positive argument T: (9.12 ) inf M ( t ) = M ( T ) = p , 0 < p < 1 , T > 0. E ( Y ] < O,
I
Construct (on an entirely irrelevant probability space) an auxiliary ran dom variable Z such that
(9.13 ) for each y1 in the range of Y. Note that the probabilities on the right do add to 1. The moment generating function of Z is
(9.1 4 )
146
PROBABILITY
and therefore
( 9.15)
M' T ) ( E ( Z ) = p = 0,
For all positive t, (5.27), and hence
T ) M" ( 2 2 s = E ( Z ] = p > 0.
P[Y 0] = P[e 1 Y > 1] < M(t) by Markov's inequality �
P ( Y > 0)
( 9.16 )
<
p.
Inequalities in the other direction are harder to obtain. If summation over those indices j for which yj > 0, then
Put the final sum here in the form e - 1, and let p = P[Z > � 0. Since log x is concave, Jensen's inequality (5.29) gives fJ - IJ
= log E' e - "Yip - 1P ( Z = yi ] > E' ( - Tyj) p - 1P ( Z = yi ]
�
= - Tsp - 1 L' P [ Z = yi ] By
E'
0].
denotes
By
(9.16),
+ log p + log p
+ log p .
(9.15) and Lyapounov's inequality (5.33), Y· 1 E li Z < 1 E 112 z 2 ] = 1 . = z ( P < y j I] ; [ ] ; E' :
The last two inequalities give
(9.18 )
0 � IJ �
TS ) - log P ( Z > 0) . � P[Z O
This proves the following result.
Suppose that Y satisfies (9.11). Define p and T by (9.12), let variable with distribution (9.13), and define s 2 by (9.15). Then P[Y > 0] = pe - 9, where fJ satisfies (9.18).
Theorem 9. 1. Z be a random
To use (9.18) requires a lower bound for P[Z
> 0].
SECTION
9.
LARGE DEVIATIONS
1beorem 9.2. If E[Z] = 0, E[ Z 2 ] = � 0] � s 4/4 � 4 .t
s 2, and E[ Z 4 ] = � 4 > 0, then
147
P[Z
Let z + = Z/[ Z � OJ and z- = - Z/[ Z < OJ · Then z+ and z- are nonnegative, Z = z + _ z-, Z2 = (Z + )2 + (Z-) 2 , and PROOF.
(9. 1 9) Let p
= P[ Z
�
0]. By Schwarz's inequality (5.32),
By Holder's inequality (5.31) (for p = f and
Since E[ Z ] = q j) gives ==
= 3)
0, another application of Holder's inequality (for p = 4 and
Combining these three inequalities with
(Epl/4 ) 213� 4/3
q
=
2 ptl2� 2 .
(9.1 9)
gives
s2
�
p1 12� 2 +
•
Chernoff's Theorem* 1beorem
9.3.
Let xl, x2 , . . . be independent, identically distributed simple
random variables satisfying E [ Xn ] < 0 and P[ Xn > 0] > 0, let M(t) be their common moment generating function, and put p = inf M(t ) . Then ,
(9.20)
.!. tog P [ X1 + lim n � oo n
· · ·
+ Xn > 0 ) = log p .
t For a related result, see Problem 25.21 . *This theorem is not needed for the law of the iterated logarithm, Theorem 9.5.
148
PROBABILITY
Put Yn = X1 + · · · + Xn. Then E[ Yn] < 0 and P[ Yn > 0] > P " [ X1 > 0 ] > 0, and so the hypotheses of Theorem 9.1 are satisfied. Define Pn and Tn by inf,Mn ( t ) = Mn ( Tn ) = Pn, where Mn(t ) is the moment generating n fu�ction of Yn . Since Mn( t ) == M ( t ), it follows that Pn = p" and Tn = T, where M( T) == p. Let Zn be the analogue for Yn of the Z described by (9.13). Its moment n n generating function (see (9.14)) is Mn ( T + t)/p = ( M( T + t)/p) . This is also the moment generating function of V1 + · · · + V, for independent random variables V1 , . . . , V, each having moment generating function M( T + t )jp . Now each v; has (see (9.15)) mean 0 and some positive variance o 2 and fourth moment � 4 independent of i. Since Z, must have the same moments as V1 + · · · + V,, it has mean 0, variance s; = n o 2 , and fourth moment �! == n� 4 + 3n(n - 1) o 4 = O(n 2 ) (see (6.2)). By Theorem 9.2, P[ Zn > 0] > s:/4�! > a for some positive a independent of n. By Theo n 9 > rem 9.1 then, P[ Yn 0] = p e - , where 0 < on < Tnsna - 1 - log a = Ta - 1o /n - log a. This gives (9.20), and shows, in fact, that the rate of • convergence is 0( n - 1 /2 ). PROOF.
"
This result is important in the theory of statistical hypothesis testing. An informal treatment of the Bernoulli case will illustrate the connection. Suppose Sn == X1 + · · · + Xn , where the X; are independent and assume the values 1 and O with probabilities p and q. Now P( Sn > na ] == P(EZ= 1 ( X�c - a ) > 0], and Chernoff's theorem applies if p < a < 1. In this case M ( t ) = E ( e '< x1 - a ) ] = e - 'a ( pe ' + q). Minimizing this shows that the p of Chernoff's theorem satisfies
a
b
- log p == K( a , p) = a log - + b log - , p q where b == 1 - a. By (9.20), n- 1 log P[Sn > na ] -+ - K(a, p ); express this as (9 .21) Suppose now that p is unknown and that there are two competing hypotheses concerning its value, the hypothesis H1 that p = p 1 and the hypothesis H2 that p = P2 ' where P I < P2 . Given the observed results x. ' . . . ' xn of n Bernoulli trials, one decides in favor of H2 if sn > na and in favor of HI if s, < na ' where a is some number satisfying p 1 < a < p2 • The problem is to find an advantageous value for the threshold a. By (9.21), (9 .22) where the notation indicates that the probability is calculated for p = p 1 -that is, under the assumption of H1 • By symmetry, (9.23)
SECTION
9.
LARGE DEVIATIONS
149
The left sides of (9.22) and (9.23) are the probabilities of erroneously deciding in favor of H2 when H1 is, in fact, true and of erroneously deciding in favor of H1 when H2 is, in fact, true-the probabilities describing the level and power of the tes t. Suppose a is chosen so that K(a, p 1 ) = K(a, p 2 ), which makes the two error probabilities approximately equal. This constraint gives for a a linear equation with solution (9 .24) where q; = 1 - P; · The common error probability is approximately e - n K for this value of a , and so the larger K (a, p 1 ) is, the easier it is to distinguish statistically between p 1 and p2 • Although K( a(p . , p2 ), p 1 ) is a complicated function, it has 3a simple approxima tion for p 1 near p2 . As x -+ 0, log(1 + x) = x - !x 2 + O(x ). Using this in the definition of K and collecting terms gives (9 .25) Fix
x2 K ( p + x , p ) = 2pq + O ( x 3 ) ,
X -+ 0.
p 1 = p, and let p2 = p + t; (9.24) becomes a function 1/!(t) of t, and expanding
the logarithms gives {9.26)
t -+ 0
after some reductions. Finally, (9.25) and (9.26) together imply that (9.27)
t -+ 0.
In distinguishing p 1 == p from p2 == p + t for small t, if a is chosen to equalize the two error probabilities, then their common value is about e - n t 2 /Sp q. For t fixed, the nearer p is to ! , the larger this probability is and the more difficult it is to distin,uish p from p + t. As an example, compare p == .1 with p == .5. Now .36nt /8(.1)(.9) == nt 2/8(.5)(.5). With a sample only 36 percent as large, .1 can therefore be distinguished from .1 + t with about the same precision as .5 can be distinguished from .5 + t. 1be
Law of the Iterated Logarithm
analysis of the rate at which S,jn approaches the mean depends on the following variant of the theorem on large deviations.
The
Let sn = xl + . . . xn , where the xn are independent and identically distributed simple random variables with mean 0 and variance 1. If
Theorem 9.4.
150
PROBABILITY
a n are constants satisfying (9 .28 ) then ( 9. 2 9) for a sequence rn going to 0. PROOF. Put yn = sn - a nrn = E�- l ( Xk - a n! ..fi). Then E[Yn1 < 0. Since X1 has mean 0 and variance 1, P[X1 > 0] > 0, and it follows by (9.28) that P[nX1 > a,/ {i] > 0 for n sufficiently large, in which case P [Yn > 0] > p [ X1 - a n/..fn > 0] > 0. Thus Theorem 9.1 applies to Yn for all large enough n. Let Mn (t), Pn ' Tn , and zn be associated with yn as in the theorem. If m(t) and c( t ) are the moment and cumulant generating functions of the Xn , a then Mn( t ) is the n th power of the moment generating function e - t ,/�m(t) of xl - a nl rn ' and so yn has cumulant generating function (9 . 30 ) Since Tn is the unique minimum of Cn( t ), and since c;(t) = - a nrn + nc'(t ), Tn is determined by the equation c'( Tn ) = a n! rn . Since xl has mean 0 and variance 1, it follows by (9.6) that c (O ) = c' (O ) = 0, c ( O) = 1 . ( 9 . 31 ) Now c'( t ) is nondecreasing because c(t) is convex, and since c'( Tn ) = a n! rn goes tO 0, Tn must therefore go tO 0 as Well and must in fact be 0( a n/ rn ) . By the second-order mean-value theorem for c'( t ) , a nl rn = c'( Tn ) = Tn + 0( Tn2 ), from which follows "
( 9.32 )
'�'n =
an 7n
+
(0 a; ) --;;
·
c( t ), log pn = Cn( Tn ) = - Tn a nrn + nc ( Tn ) = - Tn a nrn + n 'l'n2 + o( ;
By the third-order mean-value theorem for
Applying (9.32) gives
( 9.33 )
[I
.,.
)] .
SECTION
9.
151
LARGE DEVIATIONS
Now (see (9 .14)) zn has moment generating function Mn ( Tn + t)/ Pn and (see (9.30)) cumulant generating function Dn (t) = Cn (Tn + t) - log pn = - (Tn + t)a nln + nc(t + Tn ) - log pn. The mean of zn is D�(O) = 0. Its variance s; is D�'(O); by (9.31), this is
(9 . 3 4 )
s; = nc"( Tn ) = n ( c"(O) + 0 ( Tn ) ) = n ( 1
+
o (1)) .
The fourth cumulant of zn is D�" ' (O) = nc""( Tn ) = 0( n ) . By formula (9.9) relating moments and cumulants (applicable because E[Zn1 = 0), E[Z:] = 3s: + D�" ' (O). Therefore, E[Z:];s: 3, and it follows by Theorem 9.2 that there exists an a such that P [Zn > 0] > a > 0 for all sufficiently large n. By Theorem 9. 1 , P[Yn � 0] = Pne - B, with 0 < (Jn < Tn sna - 1 + log a. By (Jn = O(a n ) = o(a;), and it follows by (9.33) that (9.28), (9.32), and (9.34), a�( l + o( 1 ))/2 . y = e � 0 ] n • [ P -+
The law of the interated logarithm is this:
Let sn = x1 + . . . + xn, where the xn are independent, identically distributed simple random variables with mean 0 and variance 1. Then
Theorem 9.5.
(9.35 )
P
[
Sn lim sup n J2 n log log n
=
1
]
=
1.
Equivalent to (9.35) is the assertion that for positive £
(9 .3 6 )
P [ Sn > ( 1 + £ ) J2n log log n
i.o. ]
=
0
P [ Sn � ( 1 - £ ) J2n log log n
i.o. ]
=
1.
and
(9. 3 7 )
The set in (9.35) is, in fact, the intersection over positive rational £ of the sets in (9.37) minus the union over positive rational £ of the sets in (9.36). The idea of the proof is this. Write
(9. 3 8 )
q> ( n ) = J2n log log n .
If A ! = [Sn > (1 ± £)f/>(n)], then by (9.29) , P ( A !) is near (log n) - < 1 ± '>2• If n k increases exponentially, say n k - (J k for (J > 1, then P (A�) is of the order k - < 1 ± '>2• Now Ekk -< 1 ± '>2 converges if the sign is + and diverges if
152
PROBABILITY
the sign is - . It will follow by the first Borel-Cantelli lemma that there is probability 0 that A� occurs for infinitely many k. In proving (9.36), an extra argument is required to get around the fact that the A; for n =I= nk must also be accounted for (this involves choosing (J near 1). If the A; were independent, it would follow by the second Borel-Cantelli lemma that with probability 1, A n " occurs for infinitely many k, which would in tum imply (9.37). An extra argument is required to get around the fact that the A n " are dependent (this involves choosing 8 large). For the proof of (9.36) a preliminary result is needed. Put Mk = max{ S0 , S1 , . . . Sk }, where S0 = 0. ,
Theorem 9.6.
for a � {2,
P
(9.39) PROOF.
If the Xk are independent with mean 0 and variance l , then
[ � > a] � [-!;- > a - ] . 2P
n:
If Aj == ( � _ 1 < a..fn < Mj], then
Since Sn - � has variance n - j, it follows by independence and Chebyshev's inequality that the probability in the sum is at most
(9.36). Given £ , choose (J so that (J > 1 but 8 2 < k 1 [(J ] and xk == 8(2 log log n k ) 12• By (9.29) and (9.39),
PROOF OF
nk
==
1
+
£.
Let
SECTION
9.
153
LARGE DEVIATIONS
where � k --+ 0. The negative of the exponent is asymptotically 8 2 log k and hence for large k exceeds 8 log k, so that
Since 8 > 1, it follows by the first Borel-Cantelli lemma that there is probability 0 that (see (9.38))
(9.40) for infinitely manx k. Suppose that nk _ 1 < n < n k and that
sn > ( 1
(9.41 )
+ £ ) q, ( n )
.
1 Now q, ( n ) � q, ( n k _ 1 ) - 8 - 12q,(n k ); hence, by the choice of 8, (1 + £)q,( n ) > IJq,( nk) if k is large enough. Thus for sufficiently large k, (9.41) implies (9.40) (if n k _ 1 < n < n k ), and there is therefore probability 0 that (9.41) • holds for infinitely many n. 1 PROOF OF (9.37). Given £, choose an integer (J so large that 38 - 12 < £. k Take nk = fJ . Now n k - n k _ 1 --+ oo , and (9.29) applies with n = n k - nk _ 1 1 and a n = xk/ Jn k - n k _ 1 , where xk = ( 1 - 8 - )q,( n k ). It follows that
P ( Sn k - Sn k - 1
> xk ]
=
P [ Sn - n 1c - 1 �<
> xk
]=
[
x k2 1 exp - 1 + �k ) , 2 nk - nk 1 ( -
-
]
1 where � k 0. The negative of the exponent is asymptotically ( 1 (J - )log k and so for large k is less than log k, in which case P[Sn k - Sn k _ 1 > xk ] > k - 1 • The events here being independent, it follows by the second Borel- Cantelli lemma that with probability 1, Sn " - Sn k - 1 > x k for infinitely many k. On the other hand, by (9.36) applied to { - Xn }, there is probability 1 that - Sn " _ 1 < 2q,( n k _ 1 ) < 2fJ - 1 12q,(n k ) for all but finitely many k. These two inequalities give sn k > x k - 2fJ - lllq,(n k ) > ( 1 • t ) � (n k ), the last inequality because of the choice of 8. .-.
That completes the proof of Theorem 9.5.
154
PROBABILITY
PROBLEMS 9.1. Prove (6.2) by using (9.9) and the fact that cumulants add in the presence of independence.
9.2. Let Y be a simple random variable with moment generating function M( t ) . Show that P[ Y > 0] == 0 implies that P [ Y � 0] == inf M ( t ) . Is the infimum ,
achieved in this case?
9.3. In the Bernoulli case, (9.21) gives
where p < a < 1 and
xn == n ( a - p ). Theorem 9.4 gives
where xn == a nlnpq . Resolve the apparent discrepancy. Use (9.25) to compare the two expressions in case x fn is small. See Problem 27. 17. n
9.4. Relabel the binomial parameter p as fJ == I( p ) , where I is increasing and continuously differentiable. Show by (9.27) that the distin�uishability of fJ from fJ + �fJ, as measured by K, is ( �6 ) 2j8p (1 - p)(l'( p )) + 0( �6 ) 3 • The leading coefficient is independent of fJ if I( p ) == arc sin {i. 9.5. From (9.35) and the same result for { - Xn } , together with the uniform boundedness of the xn ' deduce that with the probability 1 the set of limit points of the sequence { sn (2 n log log n ) - 1 12 } is the closed interval from - 1 to + 1. 9.6.
Suppose Xn takes the values ± 1 with probability ! each and show that P[ Sn = 0 i.o.] == 1 (This gives still another proof of the persistence of symmet ric random walk on the line (Example 8.3).) Show more generally that, if the Xn are bounded by M, then P [ I Sn l s M i.o.] == 1.
f
9.7. Weakened versions of (9.36) are quite easy to prove. By a fourth moment argument (see (6.2)), show that P[Sn > n 314 (log n ) < 1 + f )/4 i.o.] == 0. Use (9.29) to give a simple proof that P[ Sn > (3n log n ) 1 12 i.o.] == 0. 9.8. What happens in Theorem 9.6 if fi is replaced by fJ (0 < fJ < a)? 9.9. Show that (9.35) is true if sn is replaced by I sn I or maxk n sk or maxk n I sk I · s
s
CHAPTER
2
Measure
SECI'ION 10. GENERAL MEASURES
Lebesgue measure on the unit interval was central to the ideas in Chapter 1. Lebesgue measure on the entire real line is important in probability as well as in analysis generally, and a uniform treatment of this and other examples requires a notion of measure for which infinite values are possible. The present chapter extends the ideas of Sections 2 and 3 to this more general
case.
aasses
of Sets
The a-field of Borel sets in (0, 1] played an essential role in Chapter 1, and it is necessary to construct the analogous classes for the entire real line and for k-dimensional Euclidean space. Let x = (x 1 , . . . , xk) be the generic point of Euclidean k-space R k . The bounded rectangles Example 10.1.
(10.1 ) play in R k the role intervals (a, b] played in (0, 1]. Let � k be the o-field generated by these rectangles. This is the analogue of the class fJI of Borel sets in (0, 1]; see Example 2.6. The elements of � k are the k-dimen siona/ Borel sets. For k = 1 they are also called the linear Borel sets. Call the rectangle (1 0.1) rational if the a ; and h; are all rational. If G is an open set in R k and y e G, then there is a rational rectangle Av such that Y e A.v c G. But then G = Uy eGAy, and since there are only. countably will
155
156
MEASURE
many rational rectangles, this is a countable union. Thus f11 k contains the open sets. Since a closed set has open complement, f11 k also contains the closed sets. Just as fJI contains all the sets in (0, 1] that actually come up in ordinary analysis and probability theory, !1/k contains all the sets in R k that actually come up. The a-field gJ k is generated by subclasses other than the class of rectangles. If An is the x-set where a; < x; < b; + n - 1 , i = 1, . . . , k, then An is open and (10.1) is nnAn. Thus !1/ k is generated by the open sets. Similarly, it is generated by the closed sets. Now an open set is a countable union of rational rectangles. Therefore, the (countable) class of rational
rectangles generates f11 k.
•
The a-field !A 1 on the line R 1 is by definition generated by the finite intervals. The a-field fJI in (0, 1] is generated by the subintervals of (0, 1 ]. The question naturally arises whether the elements of !!4 are the elements of !1/ 1 that happen to lie inside (0, 1]. The answer is yes, as shown by the following technical but frequently useful result. Theorem 10.1. (i) If !F is
Oo.
Suppose that 0 0 c 0. a a-field in 0, then �0 = [ A
n
0 0:
A e �]
is a a-field in
If .JJI generates the a-field � in 0, then ..ra10 = [ A n 0 0 : A e ..raf] generates the a-field � = [A n 0 0 : A e � ] in 0 0• PROOF. Of course, 0 0 = 0 n 0 0 e �0• If B = A n 0 0 and A e �' then 0 0 - B = (0 - A ) n 0 0 and 0 - A e �- If Bn = An n 0 0 and An e �' then U n Bn = (U n An) n O o and U nAn E �- Thus � is a a-field in O o . Let a0( �0 ) be the a-field ..ra10 generates in 0 0• Since �0 is a a-field in 0 0 , and since .JJ/0 c � ' certainly a0( d0 ) c �0• Now �0 c a0 ( .JJ/0 ) will follow if it is shown that A e � implies A n 0 0 e a0 ( ..ra10 ), or, to put it another way, if it is shown that � is contained in f§ = [ A : A n O o E ao ( ..rafo )]. Since A E d implies that A n Oo E ..rafo c a0 ( .JJ/0 ), it follows that d c �- It is therefore enough to show that f§ is a a-field in D. But if A E f§, then (0 - A) n O o = O o - ( A n O o ) E ao (do) and hence D - A E �- If An E f§ for all n, then (UnAn) n Oo = Un(An n 0 0 ) E a0(�0 ) and hence U nAn E f§. • (ii)
If 0 0 e /F, then �0 = [ A : A c 0 0 , A e � ]. If 0 = R1, 0 0 = (0, 1 ] , and 1 � = fA , and if .JJI is the class of finite intervals, then ..ra10 is the class of subintervals of (0, 1] and fJI = a0( d0) is given by
( 1 0 .2)
ffl = ( A : A
c (O, 1] , A
e !11 1 ] .
SECTION
10.
GENERAL MEASURES
157
A
subset of (0, 1] is thus a Borel set (lies in ffl) if and only if it is a linear Borel set (lies in 91 1 ), and the distinction in terminology can be dropped. Conventions Involving oo
Measures assume values in the set [0, oo] consisting of the ordinary nonnegative reals and the special value oo , and some arithmetic conventions are called for. For x, y e [0, oo ], x < y means that y == oo or else x and y are finite (that is, are ordinary real numbers) and x < y holds in the usual sense. Similarly, x < y means that y == oo and x is finite or else x and y are both finite and x < y holds in the usual sense. For a finite or infinite sequence x, x 1 , x 2 , . in [0, oo ], • •
(10 .3) means that either (i) x == oo and xk == oo for some k, or (ii) x == oo and xk < oo for all k and :E k xk is an ordinary divergent infinite series or (iii) x < oo and xk < oo for all k and (10.3) holds in the usual sense for l:k xk an ordinary finite sum or convergent infinite series. By these conventions and Dirichlet's theorem [A26], the order of summation in (10.3) has no effect on the sum. For an infinite sequence x, x 1 , x 2 , in [0, oo ], • • •
(10.4) means in the first place that xk < xk 1 s x and in the second place that either (i) x < oo and there is convergence in the usual sense, or (ii) xk == oo for some k, or (iii) x == oo and the xk are finite reals converging to infinity in the usual sense. +
Measures
A set function p. on a field .fF in D is a
measure
if it satisfies these
conditions: (i) p.( A ) E [0, oo] for A E ofF; (ii) ,., (0) = 0; (iii) if A I , A 2, . . . is a disjoint sequence of jli:sets and if uk_ l Ak E .?F, then (see (10.3))
The measure p. is finite or infinite as p.(O) < oo or p.(O) = oo ; it is a probability measure if p.(D) = 1, as in Chapter 1. I f D = A 1 u A 2 u · · · for some finite or countable sequence of jli:sets satisfying p. ( A k) < oo, then p. is a-finite. The significance of this concept
158
MEASURE
will be seen later. A finite measure is by definition a-finite; a a-finite measure may be finite or infinite. If SJI is a subclass of !F, p. is a-finite on SJI if D = U k A k for some finite or infinite sequence of �sets satisfying p.( A k ) < oo . It is not required that these sets A k be disjoint. Note that if D is not a finite or countable union of �sets, then no measure can be a-finite on .JJI . If p. is a measure on a a-field � in D, the triple (0, �' p.) is a measure space. (This term is not used if � is merely a field.) It is an infinite, a a-finite, a finite, or a probability measure space according as p. has the corresponding properties. If p.( Ac) = 0 for an jli:set A , then A is a support of p., and p. is concentrated on A. For a finite measure, A is a support if and only if p. ( A ) = p.(D). The pair ( D, � ) itself is a measurable space if � is a a-field in D. To say that p. is a measure on (0, �) indicates clearly both the space and the class of sets involved. As in the case of probability measures, (iii) above is the condition of countable additivity, and it implies finite additivity: If A 1 , . . . , A n are disjoint jli:sets, then
As in the case of probability measures, if this holds for extends inductively to all n.
n = 2,
then it
A measure p. on (D, �) is discrete if there are finitely or countably many points w; in D and masses m ; in [0, oo] such that p. ( A ) = E"'. A m ; for A e �- It is an infinite, a finite, or a probability measure as E ; m ; diverges, or converges, or converges to 1; the last case was treated in Example 2.9. If � contains each singleton { w ; } , then p. is a-finite if and • only if m ; < oo for all i. Example 10. 2
I
e
Let � be the a-field of all subsets of an arbitrary D, and let p. ( A ) be the number of points in A , where p.(A) = oo if A is not finite. This p. is counting measure; it is finite if and only if D is finite and is a-finite if and only if D is countable. Even if � does not contain every • subset of 0, counting measure is well defined on �Example 10.3.
Specifying a measure includes specifying its domain. If p. is a measure on a field � and �0 is a field contained in �, then the restriction p. 0 of p. to .F0 is also a measure. Although often denoted by th e Example 10.4.
SECTION
10.
159
GENERAL MEASURES
same symbol, p.0 is really a different measure from p. unless �0 = �- Its properties may be different: If p. is counting measure on the a-field � of all subsets of a countably infinite 0, p. is a-finite, but its restriction to the • a-field $0 = { 0, 0 } is not a-finite. Certain properties of probability measures carry over immediately to the general case. First, p. is monotone: p.(A) � p.( B) if A c B. This is derived, just as its special case (2.4), from p.(A) + p.(B - A) = p.(B). But it is possible to go on and write p.(B - A) = p.(B) - p.(A) only if p.( B) < oo . If p, ( B) = oo and p.(A) < oo, then p.(B - A) = oo ; but for every a e (0, oo] there are cases where p.(A) = p.(B) = oo and p.( B - A) = a. The inclusion-exclusion formula (2.7) also carries over without change to jli:sets of finite measure:
(10 .5)
P.( lJ Ak ) = E p.( A;) - E p. ( A ; n Aj ) k-1
The proof of
i
��
+
···
finite subadditivity also goes through just as before:
here the A k need not have finite measure. 1beorem 10.2. Let p. be a measure on a field (i) Continuity from below: If A n and A p,( A n ) i p.( A).
�lie in � and A n j A, thent
Continuity from above: If A n and A lie in !F and An � A, and if I'( A 1 ) < oo, then p.( A n) � p ( A). (iii) Countable subadditivity: If A 1 , A 2 , and U k-t A k lie in �, then (ii)
•
•
•
If p. is a-finite on !F, then � cannot contain an uncountable, disjoint collection of sets of positive p.-measure. (iv)
tSee (10.4).
160
MEASURE
The proofs of (i) and (iii) are exactly as for the corresponding parts of Theorem 2.1 . The same is essentially true of (ii): If p.( A1) < oo , subtrac tion is possible and A1 - A n i A1 - A implies that p. ( A 1 ) - p. ( A n ) = p. ( A1 - A n ) i p.( At - A ) = p. ( At) - p. ( A ). There remains (iv). Let [ B6: 8 E 8 ] be a disjoint collection of jli:sets satisfying p. ( B6 ) > 0. Consider an jli:set A for which p.( A ) < oo . If 81, , 8n are distinct indices satisfying p. ( A n B6.) > £ > 0, then n £ < E7_ 1p.( A n B6. ) � p.( A ), and so n < p.( A)j£. Thus the index set [8: p. ( A n B6) > £] is finite, and hence (take the union over positive rational £) [8: p. ( A n B6) > 0] is countable. Since p. is a-finite, 0 = UkAk for some finite or countable sequence of jli:sets Ak satisfying p.(Ak) < oo . But then ek = [8: p.(Ak n B6 ) > 0] is countable for each k. If p.(Ak n B6 ) = 0 for all k, then p. ( B6 ) < • Ekp. ( Ak n B6 ) = 0. Therefore, 8 = U ke k , and so 9 is countable. PROOF.
•
•
•
I
I
Uniqueness
According to Theorem 3.3, probability measures agreeing on a w-system agree on a(fJ'). There is an extension to the general case.
9'
Suppose that p.1 and p.2 are measures on a(fJ'), where 9' is a w-system, and suppose they are ,a-finite on 9'. If p.1 and p. 2 agree on 9', then they agree on a ( 9' ). PROOF. Suppose B E 9' and p.1( B) = p.2( B ) oo , and let !1'8 be the class of sets A in a(fJ') for which p.1( B n A ) = p. 2 ( B n A ). Then !1'8 is a A-system containing 9' and hence (Theorem 3.2) containing a(fJ'). By a-finiteness there exist .9'-sets Bk satisfying 0 = UkBk and p.1( Bk ) = Theorem 10.3.
<:
p.2( Bk ) < oo . By the inclusion-exclusion formula (10 . 5),
P. a
(u
i-1
)
( BJ ' A ) =
E P. a ( B; n A ) -
1 sisn
E
1 s i<j < n
P. a ( B; n Bj n A ) + . . .
for a = 1 , 2 and all n. Since 9' is a w-system containing the B;, it contains the B; n Bi, and so on. Whatever a ( fJ' )-set A may be, the terms on the right above are therefore the same for a = 1 as for a = 2. The left side is then the • same for a = 1 as for a = 2; letting n --+ oo gives p.1( A ) = p. 2( A ) .
Suppose p.1 and p. 2 are finite measures on a(fJ'), where 9' is a w-system and 0 is a finite or countable union of sets in 9'. If p.1 and p. 2 agree on 9', then they agree on a(fJ'). Theorem 10.4.
SECTION
10.
161
GENERAL MEASURES
By hypothesis, 0 = U k Bk for .9'-sets Bk , and of course P. a( Bk) < P.a ( D ) < oo , a = 1, 2. Thus p.1 and p.2 are a-finite on gJ, and Theorem 10.3 • applies. PROOF.
If 9' consists of the empty set alone, then it is a 'IT-system and a(fJ') = {0, 0 }. Any two finite measures agree on 9', but of course they need not agree on a(fJ'). Theorem 10.4 does not apply in this case because 0 is not a countable union of sets in 9'. For the same reason, no measure on a(fJ') is a-finite on 9', and hence Theorem 10.3 does not • apply. Example 10.5.
1
Suppose (0, �) = ( R\ !1/ ) and 9' consists of the half infinite intervals ( oo , x]. By Theorem 10.4, two finite measures on � that agree on 9' also agree on �. The 91'-sets of finite measure required in the • definition of a-finiteness cannot in this example be made disjoint. Example 10.6.
-
If a measure on (0, � ) is a-finite on a subfield �0 of !F, then 0 = U kBk for disjoint �0-sets Bk of finite measure: if they are not • disjoint, replace Bk by Bk n Bf · · · n Bk- t · Example 10. 7.
The proof of Theorem 10.3 simplifies slightly if 0 = U k Bk for disjoint .9'-sets with p.1( Bk) = p.2( Bk) < oo , because additivity itself can be used in place of the inclusion-exclusion formula.
PROBLEMS 10.1. 10.2.
10.3. 10.4.
Show that if conditions (i) and (iii) in the definition of measure hold, and if p.( A ) < oo for some A e !F, then condition (ii) holds. Let ( Qn , � , P.n ) be a measure space, n == 1, 2, . . . , and suppose that the sets Qn are disjoint. Let Q == UnQn , let !F consist of the sets of the form A == UnAn with An � ' and for such an A let p.( A ) == E n P.n ( A n >· Show that ( 0 , !F, p.) is a measure space. Under what conditions is it a-finite? Finite? On the a-field of all subsets of Q == {1, 2, . . . }, put p.( A ) == E k e 2- k if A is finite and p.( A ) == oo otherwise. Is p. finitely additive? Countably additi.ve.? (a) In connection with Theorem 10.2(ii), show that if An � A and p.( A k ) < oo for some k, then I'( An ) � p.( A ). (b) Find an example in which An � A , p.( A n ) oo , and A == 0.
E
A
=
162 10.5.
MEASURE
The natural generalization of (4.10) is ( 10.6)
1'( lim infn A,. )
.s
lim inf n I' ( A,. )
.s
lim sup I'( A,. )
n
.s
(
}
I' lim sup A,. .
n
Show that the left-hand inequality always holds. Show that the right-hand inequality holds if p.(U k nA k ) < oo for some n but can fail otherwise. 3.5 t A measure space ( 0, !F, p.) is complete if A c B, B e !F, and p. ( B ) == 0 together imply that A e !F- the definition is just as in the probability case. Use the ideas of Problem 3.5 to construct a complete measure space ( 0, r, p. + ) such that !F c r and p. and p. + agree on !F. �
10.6.
The condition in Theorem 10.2(iv) essentially characterizes a-finiteness. (a) Suppose that (Q, !F, p.) has no "infinite atoms," in the sense that for every A in !F, if p. ( A ) == oo , then there is in !F a B such that B c A and 0 < p. ( B ) < oo . Show that if !F does not contain an uncountable, disjoint collection of sets of positive measure, then p. is a-finite. (Use Zorn's lemma.) (b) Show by example that this is false without the condition that there are no " infinite atoms." 10.8. Deduce Theorem 3.3 from Theorem 10.3 and from Theorem 10.4. 10.9. Example 10.5 shows that Theorem 10.3 fails without the a-finiteness condi tion. Construct other examples of this kind. 10.10. Suppose that p.1 and p.2 are measures on a a-field !F and are a-finite on a field � generating !F. (a) Show that, if p. 1 ( A ) s p.2 ( A ) holds for all A in 'a , then it holds for all A in !F. (b) Show that this fails if � is only a w-system. 10.7.
SECI'ION 11. OUTER MEASURE Outer Measure
An outer measure is a set function p.* that is defined for all subsets of a space D and has these four properties: (i) p.*( A ) e (0, oo] for every A c 0; (ii) p.*(O) = 0; (iii) p.* is monotone: A c B implies p.*( A ) � p.*( B); ( iv) p.* is countably subadditive: p.*(UnA n) � Enp.*( An)·
SECTION
11.
OUTER MEASURE
163
The set function P* defined by (3.1) is an example, one that generalizes : Example 11.1. Let p be a set function on a class .JJI in 0. Assume that 0 e � and p(O) = 0, and that p(A) e [0, oo] for A e SJI; p and � are otherwise arbitrary. Put (1 1 .1) n
where the infimum extends over all finite and countable coverings of A by �sets A n . If no such covering exists, take p.*(A) = oo in accordance with the convention that the infimum over an empty set is oo. That p.* satisfies (i), (ii), and (iii) is clear. If p.*(A n ) = oo for some n, then obviously p.*( U n A n ) � En p.*(A n ) Otherwise, cover each A n by �sets Bn k satisfying Ek p ( Bn k) < p.*(A n ) + £/2 n ; then p.*( U n A n ) � E n , k P ( Bn k) < • E n p.*( A n ) + £. Thus p.* is an outer measure. ·
Define A to be p.*-measurab/e if p.* ( A n E ) + p.* ( A c n E ) = p.* ( E )
(1 1 .2)
for every E. This is the general version of the definition (3.4) used in Section 3. It is by subadditivity equivalent to (1 1 .3 )
Denote by .l(p.* ) the class of p.*-measurable sets. The extension property for probability measures in Theorem 3.1 was proved by means of a sequence of four lemmas. These lemmas carry over directly to the case of the general outer measure: if P* is replaced by p.* and .,1( by .l(p.*) at each occurrence, the proofs hold word for word, symbol for symbol. In particular, an examination of the arguments shows that oo as a possible value for p.* does not necessitate any changes. The fourth lemma in Section 3 becomes this: If p.* is an outer measure, then .l(p.*) is a a-field, and p.* restricted to .l(p.* ) is a measure.
1beorem
1 1.1.
This will be used to prove an extension theorem, but it has other applications as well; see Sections 19 and 29.
164
MEASURE
Extension Theorem 1 1.2.
a-field.
A
measure on a field has an extension to the generated
If the original measure on the field is a-finite, then it follows by Theorem 10.3 that the extension is unique. Theorem 11.2 can be deduced from Theorem 11.1 by the arguments used in the proof of Theorem 3.1. t It is unnecessary to retrace the steps, however, because the ideas will appear in stronger form in the proof of the next result, which generalizes Theorem 11.2. Define a class .JJI of subsets of D to be a semiring if (i) 0 E .JJI ; (ii) A , B e .JJI implies A n B e .JJI ; (iii) if A , B e .JJI and A c B, then there exist disjoint Jet-sets C1 , . . . , Cn such that B - A = Ui:_ 1 Ck . The class of finite intervals in 0 = R 1 and the class of subintervals of 0 = (0, 1] are the simplest examples of semirings. Note that a semiring need not contain 0. If A and B lie in a semiring .JJI , then since A n B e .JJI by (ii), A n B e = A - A n B has by (iii) the form U i: _ 1Ck for disjoint Jet-sets Ck . If A , B1, , B1+ 1 are in .JJI and A n Bf n · · · n Bt has this form, then A n Bf n · · · n Bt+ 1 = U i: _ 1 (Ck n Bt+ 1 ), and each Ck n Bt+ 1 is a finite disjoint union of Jet-sets. It follows by induction that for any sets A , B1, , B1 in .JJI , there are in .JJI disjoint sets Ck such that A n Bf • n Bt = u ;: _ lck . n
Example
11. 2.
•
•
·
·
•
•
•
•
·
Theorem 11.3. Suppose that p. is a set function on a semiring .JJI . Suppose that p. has values in [0, oo ], that p. (0) = 0, and that p. is finitely additive and countably subadditive. Then p. extends to a measure on a( .JJI ). *
This contains Theorem 11.2, because the conditions are all satisfied if .JJI is a field and p. is a measure on it. If 0 = U k A k for a sequence of d-sets satisfying p. ( A k ) < oo , then it follows by Theorem 10.3 that the extension is . uruque.
t See also Problem 11.1.
*And so p. must have been countably additive on � in the first place; see Problem 1 1 .4.
SECTION
11.
165
OUTER MEASURE
If A , B, and the Ck are related as in condition (iii) in the definition of semiring, then by finite additivity p.( B) = p. ( A ) + I:;: _ 1p.( Ck ) > p. ( A ). Thus p. is monotone. Define an outer measure p.* by (11.1) for p = p.: PROOF.
(11 .4)
n
the infimum extending over coverings of A by �sets. The first step is to show that d e J/(p.* ). Suppose that A E .JJ/. If p.*( E ) = oo, then (11.3) holds trivially. If p.*( E ) < oo, for given £ choose Jet-sets A n such that E e UnAn and Enp.(An) < p.*( E ) + £ . Since .JJI is a semiring, Bn = A n An lies in .JJI and Ac n A n = A n - Bn has the form U k'.:. 1Cn k for disjoint �sets Cnk· Note that A n = Bn u U k'.:. 1Cn k ' where the union is disjoint, and that A n E e UnBn and Ac n E e U n U k'.:. 1Cnk· By the definition of p. * and the assumed finite additivity of p., p.* ( A n
E ) + p.* ( A (' n E ) � L p. ( Bn ) + L n
n
mn
L p. (Cn k ) k == l
n Since £ is arbitrary, (11.3) follows. Thus d e J/(p.* ). The next step is to show that p.* and p. agree on .JJI . If A e UnAn for Jef-sets A and A n, then by the assumed countable subadditivity of p. and the monotonicity established above, p. ( A ) < Enp. ( A n An) � Enp.(An)· There fore, A E .JJI implies that p. ( A ) < p.*( A ) and hence, since the reverse inequality is an immediate consequence of (11 .4), p. ( A ) = p.*( A). Thus p.* agrees with p. on .JJI . Since .JJI e Jl ( p. * ) and .I ( p. *) is a a-field (Theorem 11.1 ), a ( .JJI ) e J/( p.* ). Since p.* is countably additive on .l( p.*) (Theorem 11.1 again), p.* • restricted to a ( .JJI ) is an extension of p. on d, as required. For .JJI take the semiring of subintervals of 0 = (0, 1] together with the empty set. For p. take length A: A ( a, b) = b - a . The finite additivity of A follows by Theorem 2.2 and the countable subadditiv ity by Lemma 2 to that theorem. t By Theorem 11 .3, A extends to a measure • on the class o ( .JJ/ ) = fJI of Borel sets in (0, 1 ] . Example 11.3.
t On a field, countable additivity implies countable subadditivity, but � is merely a semiring. Hence the necessity of this observation; but see Problem 1 1 .4.
166
MEASURE
This gives a second construction of Lebesgue measure in the unit interval. In the first construction A was extended first from the class of intervals to the field 910 of finite disjoint unions of intervals (see (2.1 1)) and then by Theorem 1 1 .2 (in its special form Theorem 3.1) from !110 to !11 a(aJ0). Using Theorem 11.3 instead of Theorem 11.2 effects a slight economy, since the extension then goes from JJI directly to !11 without the intermediate stop at 910, and the arguments involving (2.12) and (2.13) become unnecessary. ==
In Theorem 11.3 take for JJI the serrunng of finite intervals on the real line R t , and consider A t (a, b) = b - a. The arguments for Theorem 2.2 and its two lemmas in no way require that the (finite) intervals in question be contained in (0, 1 ], and so A t is finitely additive and countably subadditive on this class d. Hence A t extends to the a-field gJ t of linear Borel sets, which is by definition generated by JJI. This defines • Lebesgue measure A t over the whole real line. Example 11.4.
A subset of (0, 1] lies in !11 if and only if it lies in gJ t (see (10.2)). Now A t (A) = A( A ) for subintervals A of (0, 1 ], and it follows by uniqueness (Theorem 3.3) that A t (A) = A ( A ) for all A in !11 . Thus there is no inconsistency in dropping A t and using A to denote Lebesgue measure on 91 t as well as on !JI. k The class of bounded rectangles in R is a semiring, a fact needed in the next section. Suppose A = [x: X; e I;, i � k] and B = [x: X; e .f;, i � k] are nonempty rectangles, the I; and 1; being finite intervals. If A c B, then I; c 1;, so that 1; - I; is a disjoint union I;' u /;'' k of intervals (possibly empty). Consider the 3 disjoint rectangles [x: X; e lf;, i < k], where for each i, lf; is I; or /;' or /;''. One of these rectangles is A itself, and B - A is the union of the others. The rectangles thus form a semmng. • Exllllple l 11.5.
•
•
An Approximation 1beorem
If JJ1 is a semiring, then by Theorem 10.3 a measure on a( JJ/) is determined by its values on JJI if it is a-finite there. The following theorem shows more explicitly how the measure of a a( JJ/)-set can be approximated by the measures of �sets.
SECTION
11.
OUTER MEASURE
167
Suppose that d is a semiring, p. is a measure on !F = o( .SJI ), and p. is a-finite on d. (i) If B E .fF and £ > 0, there exists a finite or infinite disjoint sequence of JJI.sets such that B c U k A k and p. ((U k A k ) - B) < £. A1, A 2, (ii) If B e .fF and £ > 0, and if p. ( B) < oo , then there exists a finite disjoint sequence A 1 , , A n of JJI.sets such that p. ( BA (Ui:_ 1 A k )) < £. Theorem 1 1.4.
•
•
•
•
•
•
Return to the proof of Theorem 1 1 .3. If p.* is the outer measure defined by (1 1 .4), then .?Fe .l(p.* ) and p.* agrees with p. on d, as was shown. Since p.* restricted to .fF is a measure, it follows by Theorem 10.3 that p.* agrees with p. on !F as well. Suppose now that B lies in !F and p.(B) = p.*( B) < oo . There exist �sets A k such that B c U k A k and p.(U k A k ) � Ek p. ( A k ) < p. ( B) + £; but then p ((U kA k ) - B) < £. To make the sequence { A k } disjoint, replace A k nAk_ 1; each of these sets is (Example 11 .2) a finite by A k n AI n disjoint union oT sets in d. Next suppose that B lies in .fF and p.(B) = p.*( B) = oo . By a-finiteness there exist JJI.sets Cm such that D = U mCm and p.(Cm ) < oo . By what has just been shown, there exist JJI.sets A m k such that B n Cm c U k A m k and m p. ((U k A m k ) - ( B n Cm )) < £/2 . The sets A m k taken all together provide a sequence A 1 , A 2 , of JJI.sets satisfying B c U k A k and p.((U k A k ) - B) < £. As before, the A k can be made disjoint. To prove part (ii), consider the A k of part (i). If B has finite measure, so has A = U k A k , and hence by continuity from above (Theorem 10.2(ii)), p. ( A - U k � nA k ) < £ for some n. But then p.( B A (Uk_ 1 A k )) < 2£. • PROOF.
·
·
·
•
•
•
If, for example, B is a linear Borel set of finite Lebesgue measure, then A ( BLl. (U Z _ 1 A k )) < £ for some disjoint collection of finite intervals
A t , . . . ' A n.
If p. is a finite measure on a a-field !F generated by a field !F0, then for each �set A and each positive £ there is an !F0-set B such that
Corollary.
p.(ALl.B) < £.
This is of course an immediate consequence of part (ii) of the theorem, but there is a simple direct argument. Let fl be the class of jli:sets with the required property. Since A c Ll.B c = ALl.B, f§ is closed under com plementation. If A = U n A n , where A n e f§, given £ choose n 0 so that p. ( A - U n � noAn ) < £, and then choose !F0-sets Bn , n � n 0, so that p ( A n Ll.Bn ) < £ jn 0 • Since (U n � noA n )A(U n � noBn ) C Un � n0( A n A Bn ), the .fF0set B = U n � noBn satisfies p.(AAB) < 2£. Of course .?F0 c f§; since f§ is a • a- field, .fF c l§, as required. PROOF.
168
MEASURE
Caratheodory's Condition*
Suppose that p.* is an outer measure on the class of all subsets of R k . It satisfies Caratheodory 's condition if (1 1 . 5 )
p.*( A
u
B ) = p.*( A ) + p.*( B )
whenever A and B are at positive distance in the sense that dist( A, B) = inf( l x - Y l : x e A, y e B] is positive. Theorem 11.5.
Ifk p.* is an outer measure on R k satisfying Caratheodory 's
condition, then !Ji
c
.l(p.*).
Since .l(p.* ) is a a-field (Theorem 11.1) and the closed sets generate !Jt k (Example 10.1), it is enough to show that each closed set is p.*-measureable. The problem is thus to prove (11.3) for A closed and E arbitrary. Let B = A n E and C = A c n E. Let Cn = [x e C: dist(x, A ) � n - 1 ], so that Cn t C and dist(Cn , A) � n - 1 . By Caratheodory' s condition, p.*( E) = p.*( B U C ) � p.*( B U Cn ) = p.*( B) + p.*( Cn ). Hence it suffices to prove that p.*( Cn ) .-. p.*( C), or, since Cn t C, that p.*( C) � lim n p.*( Cn ). Let Dn = cn + 1 - en ; if X E Dn + 1 and l x - Yl < n - 1 ( n + 1) - 1 , then dist( y , A ) s I Y - x l + dist(x, A) < n - 1 , so that y � en . Thus l x - Y l > n - 1 ( n + 1) - 1 if X E Dn + 1 and y E en , and so Dn + 1 and en are at positive distance if they are nonempty. By the Caratheodory condition extended inductively, PROOF.
(11 .6) and ( 1 1 .7) By subadditivity,
( 1 1 .8)
00
00
p * ( C ) � p.*( C2 n ) + En p.* ( D2k ) + E p.*( D2k - l ) . k
k-n+ l
If Ep.*( D2 k ) or Ep.*( D2 k _ 1 ) diverges, p.*(C) < limnp.*(Cn ) follows from • (11 .6) or (11.7); otherwise, it follows from (11.8). * This topic may be omitted; it is needed only for the construction of Hausdorff measure in
Section 19.
SECTION
11.
OUTER MEASURE
169
For A c R 1 define A* (A) = inf E n ( bn - a n ), where the infimum extends over countable coverings of A by intervals ( a n , bn1 · Then A* is an outer measure (Example 11 .1). Suppose that A and B are at positive distance and consider any covering of A U B by intervals (a n , bn ] ; (11 .5) will follow if E( bn - a n ) � A*( A) + A*( B ) necessarily holds. The sum here is unchanged if (a n , bn ] is replaced by a disjoint union of subintervals each of length less than dist( A, B). But if bn - a n < dist( A , B), then ( a n , bn ] cannot meet both A and B. Let M and N be the sets of n for which ( a n , bn ] meets A and B, respectively. Then M and N are disjoint, A c U n e M ( a n , bn ], and B c Un e N (a n , bn ] ; hence E( bn - a n ) � En e M (bn - a n ) + E n E N ( bn - a n ) � A*( A ) + A*( B). Thus A* satisfies Caratheodory's condition. By Theorem 11.5, 91 1 c Jf(A*), and so A* restricted to 91 1 is a measure by Theorem 1 1 . 1 . By Lemma 2 in Section 2 (p. 24), A*( a, b) = b - a. Thus A* restricted to 91 1 is • A , which gives still another construction of Lebesgue measure. Example 11.6.
PROBLEMS 11.1.
11 .2.
11 .3.
The proof of Theorem 3.1 obviously applies if the probability measure is replaced by a finite measure, since this is only a matter of rescaling. Take as a starting point then the fact that a finite measure on a field extends uniquely to the generated a-field. By the following steps prove Theorem 11.2-that is, remove the assumption of finiteness. (a) Let I' be a measure (not necessarily even a-finite) on a field 'a and let !F == a ( 'a). If A e 'a and I'( A) < oo , restrict I' to a finite measure I'A on the field 'a( A ) == [ B: B c A , B e �] in A , and extend I'A to a finite measure (still denoted I'A ) on the a-field !F ( A ) == [ B: B c A , B e !F] generated in A by �(A ). (b) Suppose that E e !F. If there exist disjoint 'a-sets An such that I'( A n ) < oo and E c Un An , put I'( E) == 'En i'A ,.( E n An ) and prove consistency; otherwise, put I'( E) == oo . (c) Show that I' is a measure on !F and agrees with the original �£ on �(a) If I'* is an outer measure and v*( E) = I'*( E n A ), then v* is an outer measure. (b) If 11: are outer measures and I'*( E) = Ln i':( E) or I'*( E) = sup p.:( E) then I'* is an outer measure. 2.5 10.10 t Suppose .JJI is a semiring that contains 0. (a) Show that the field generated by .JJI consists of the finite disjoint unions of elements of .JJI. (b) Suppose that measures 1' 1 and p. 2 on a( .JJI ) are a-finite on .JJ1 and that 1' 1 ( A ) < 1' 2 ( A ) for all A in .JJI. Show that 1'1 ( A ) < 1' 2 ( A ) for all A in a( .JJI ) .
170 11.4.
MEASURE
Suppose that .JJI is a semiring and I' is a finitely additive set function on .JJI having values in [0, oo] and satisfying 1'(0) 0. (a) Suppose that A , A1 , , Am are �sets. Show that Ek'- 11'(Ak ) S I'( A ) if Uk'- 1 A k c A and the Ak are disjoint. Show that I'( A ) s Ek'- 1 1'(A k ) if A c Uk'- 1 A k (here the Ak need not be disjoint). (b) Show that I' is countably additive on .JJI if and only if it is countably subadditive. 11.5. Measure theory is sometimes based on the notion of a ring. A class .JJI is a ring if it contains the empty set and is closed under the formation of differences and finite unions; it is a a-ring if it is also closed under the formation of countable unions. (a) Show that a ring ( a-ring) containing Q is a field ( a-field). Find a ring that is not a field, a a-ring that is not a a-field, a a-ring that is not a field. (b) Show that a ring is closed under the formation of symmetric dif ferences and finite intersections. Show that a a-ring is closed under the formation of countable intersections. (c) Show that the field f( .JJI ) generated (see Problem 2.5) by a ring .JJI consists of the sets A for which either A e .JJI or A c e .JJI. Show that /( .JJI ) a( .JJI ) if .JJI is a a-ring. (d) Show that a (countably additive) measure on a ring .JJI extends to /( .JJI ) . Of course this follows by Theorem 11.3, but prove it directly. (e) Let .JJI be a ring and let .JJI ' be the a-ring generated (define) by .JJI. Show that a measure on .JJI extends to a measure on .JJI' and that the extension is unique if the original measure is a-finite on .JJI . Hint: To prove uniqueness, first show by adapting the proof of Theorem 3.4 that a monotone class containing .JJI must contain .JJI ' . 11.6. Call .JJI a strong semiring if it contains the empty set and is closed under the formation of finite intersections and if A , B e .JJI and A c B imply that there exist �sets A 0 , , A n such that A A0 c A1 c · · c A n B and A k - A k - l e .JJI, k 1, . . . , n . k (a) Show that the class of bounded rectangles in R is a strong semiring. (b) Show that a semiring need not be a strong semiring. (c) Call I' on .JJI pairwise additive if A , B, A u B e .JJI and A n B 0 imply that I'( A u B) I'( A ) + I'( B). It can be shown that, if I' is pairwise additive on a strong semiring, then it is finitely additive there. Show that this is false if .JJI is only a semiring. 11.7. Suppose that I' is a finite measure on the a-field !F generated by the field -Fa · Suppose that A e !F and that A1 , . . . , A n is a decomposition of A into Jli:sets. Show that there exist an �-set B and a decomposition B1 , , Bn of B into �-sets such that I' ( A k � Bk ) < £, k 1, . . . , n . Show that B may be taken as A if A e -Fo11.8. Suppose that I' is a-finite on a field -Fa generating the a-field !F. Show that for A !F there exist -Fa-sets Cn ; and Dn i such that, if c U n ni cn l and D n n U; Dn ; ' then C c A c D and I' ( D - C ) 0. 11.9. Show that a-finiteness is essential in Theorem 11.4. Show that part (ii) can fail if I'( B) oo. 11.10. Here is another approach to Theorem 11.4(i). Suppose for simplicity that p. is a finite measure on the a-field !F generated by a field �- Let t§ be the • • •
==
==
==
• • •
·
==
==
==
==
.
==
==
E
==
==
==
.
•
11. 11.
12.
171 class of Jli:sets G such that for each £ there exist 'a-sets A k and Bk satisfying nk Ak c G c Uk Bk and I' ((Uk Bk ) - (nk A k )) < £. Show that f§ is a a-field containing 'a. This and Problems 11.12, 16.25, 17.16, 17.17, and 17.18 lead to proofs of the Daniell-Slone and Riesz representation theorems. Let A be a real linear functional on a vector lattice !R of (finite) real functions on a space n. This means that if 1 and g lie in !R, then so do I v g and f A g (with values max{ f(w), g( w)} and min{ f(w), g( w)}), as well as a.f + flg, and A ( a.f + flg ) == a.A ( I) + /lA ( g). Assume further of 9 that f e !R implies f A 1 e !R (where 1 denotes the function identically equal to 1). Assume further of A that it is positive in the sense that I � 0 (pointwise) implies A (/) � 0 and continuous from above at 0 in the sense that In � 0 (pointwise) implies A (/n ) -+ 0. (a) If f s g (I, g e !R), define in Q X R1 an "interval" SECTION
MEASURES IN EUCLIDEAN SPACE
( I, g) == [ ( w , t ) : I( w ) < t s g ( w ) ] .
{ 11 .9)
Show that these sets form a semiring Jaf0• (b) Define a set function �'o on Jaf0 by { 11 .10) 11.12.
�'o ( /, g]
== A ( g - /) .
Show that �'o is finitely additive and countably subadditive on Jaf0• f (a) Assume f e !R and let In == ( n ( f - f A 1)) A 1. Show that /( w) � 1 implies In ( w) == 0 for all n and f( w) > 1 implies In ( w) == 1 for all sufficiently large n. Conclude that for x > 0, ( 0, xfn ) f [ w : /( w ) > 1 ] X (O, x] .
( 11 .11 )
(b) Let !F be the smallest a-field with respect to which every f in !R is measurable: !F== a[r 1 H: f e !R, H e 911 ]. Let 'a be the class of A in !F for which A X (0, 1] e a ( Jaf0 ) . Show that 'a is a semiring and that !F== a('a). (c) Let be' the extension of �'o ( see (11.10)) to a(Jaf0) and for A e 'a define l'o ( A ) == P ( A X (0, 1]). Show that l'o is finitely additive and count ably subadditive on the semiring �.,
SECI10N 12. MEASURES IN EUCLIDEAN SPACE Lebesgue Measure
Example 11.4 Lebesgue measure A was constructed on the class fll 1 of linear Borel sets. By Theorem 10.3, A is the only measure on fll 1 satisfying A( a , b] = b - a for all intervals. There is in k-space an analogous k-dimen sional Lebesgue measure A k on the class !Ji k of k-dimensional Borel sets (Example 10.1). It is specified by the requirement that bounded rectangles In
172
MEASURE
have measure (I2.I)
Ak (x:
a ; < X ; < bi ' i =
I, . . . , k]
=
k n ( b;
i=l
- a, ) .
This is ordinary volume-that is, length ( k = I), area ( k = 2), volume (k = 3), or hypervolume ( k > 4). Since an intersection of rectangles is again a rectangle, the uniqueness theorem shows that (I2.I) completely determines A k . That there does exist such a measure on gJ k can be proved in several ways. One is to use the ideas involved in the case k = I. A second construction is given in Theorem I2.5. A third, independent, construction uses the general theory of product measures; this is carried out in Section IS. t For the moment, assume the existence on fit k of a measure A k satisfying (I2.I). Of course, A k is a-finite. A basic property of A k is translation invariance. *
If A E (A k , then A + x = [a + x: a E A] E fit k and A k ( A ) = A k ( A + x ) for all x. Theorem 12.1.
If l§ is the class of A such that A + x is in fit k for all x, then f§ is a a-field containing the bounded rectangles, and so f§ ::> fit k . Thus A + x E !Ji k for A E !Ji k . For fixed x define a measure p. on gJ k by p.(A) = A k ( A + x). Then p. and A k agree on the w-system of bounded rectangles and so agree for all • Borel seb. PROOF.
If A is a (k - I)-dimensional subspace and x lies outside A, the hyperplanes A + tx for real t are disjoint and by Theorem I2.I, all have the same measure. Since only countably many disjoint sets can have positive measure (Theorem I0.2 (iv)), the measure common to the A + tx must be 0. Every ( k - I)-dimensional hyperplane has k-dimensional Lebesgue mea sure 0. The Lebesgue measure of a rectangle is its ordinary vowme. The follow ing theorem makes it possible to calculate the measures of simple figures.
If T:k R k --+ R k is linear and nonsingular, then A E fit k implies that TA E !A and Theorem 12.2.
(I2 .2)
t See also Problems 12. 1 5, 1 7.18, and 20.4.
*An analogous fact was used in the construction of a nonmeasurable set at the end of Section 3.
SECTION
12.
173
MEASURES IN EUCLIDEAN SPACE
As a parallelepiped is the image of a rectangle under a linear transforma tion, (1 2.2) can be used to compute its volume. If T is a rotation or a reflection-an orthogonal or a unitary transformation-del T = + 1, and so A k ( TA) = A k( A). Hence every rigid transformation (a rotation followed by a translation) preserves Lebesgue measure. Since TU n A n U n TAn and TAc ( TA)(' be cause of the assumed nonsingularity of T, the class f§ = [ A : TA E !1/ k ] is a a-field. Since TA is open for open A, again by the assumed nonsingularity of T, f§ contains all the open sets and hence (Example 10.1) all the Borel sets. Therefore, A E !1/ k implies that TA E !1/ k . For A E gJ k , set p. 1 (A) = A k ( TA) and p. 2 ( A ) = l det Tl · Ak(A). Then p. 1 and p. 2 are measures, and by Theorem 10.3 they will agree on !1/ k (which is the assertion (12.2)) if they agree on the w-system consisting of the rectan gles [ x: a; < X; < h;, i = 1, . . . , k] for which the a; and the h; are all rational (Example 10.1). It suffices therefore to prove (12.2) for rectangles with sides of rational length. Since such a rectangle is a finite disjoint union of cubes and A k is translation-invariant, it is enough to check (12.2) for cubes =
PROOF OF THE THEOREM.
A = [x: 0
( 12.3)
< X; <
=
c, i = 1 , . . . , k ]
that have their lower comer at the origin. Now the general T can by elementary row and column operationst be represented as a product of linear transformations of these three special forms: (1 °) T( x 1 , , x k ) = (x,.1 , . . . , x,. k ), where w is a permutation of the set {1, 2, . . . ' k } ; (2°) T( x 1 , , x k ) = (ax 1 , x 2 , , x k ); (3°) T( x 1 , , x k ) = (x 1 + x 2 , x 2 , , x k ). •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Becau se of the rule for multiplying determinants, it suffices to check (12.2) for T of these three forms. And, as observed, for each such T it suffices to consider cubes (12.3). (1 °) Such a T is a permutation matrix, and so det T = ± 1. Since (12.3) is invariant under T, (12.2) is in this case obvious. (2°) Here det T = a, and TA = [x: x 1 E H, 0 < X; < c, i = 2, . . . , k ], where H = (O, ac] if a > 0, H = {0} if a = 0, and H = [ac, O) if a < 0. In each case, A k ( TA ) = I a I · c k = I a I · A k ( A ). t BIRKHOFF & MAC LANE, Section 8.9.
174
MEASURE
(3°) Here det T = 1. Let R k if k < 3, and define
B = [x: 0 < X; < c, i = 3, . . . , k], where B =
B 1 = ( x: 0 < x 1 s x 2 < c ] n B, B2 = ( x: 0 < x 2 < x 1 s c ] n B, B3 = [ x: c < x 1 s c + x 2 , 0 < x 2 s c ] n B. Then A = B1 U B2 , TA = B2 U B3 , and B 1 + (c, O, . . . , O) = B3. Since A k ( B 1 ) = A k ( B3) by translation invariance, (12.2) follows by additivity. •
If T is singular, then det T == 0 and TA lies in a ( k - 1 )-dimensional subspace. Since such a subspace has measure 0, (12.2) holds if A and TA lie in gt k . The surprising thing is that A e gt k need not imply that TA e gtk if T is singular. For even a very simple transformation such as the projection T( x1 , x 2 ) ( x 1 , 0) in the plane, there exist Borel sets A for which TA is not a Borel set.t But if A is a rectangle, TA is a Borel set by the argument involving (1 °), (2°), and (3°)-note that a can be 0 in (2°).
==
Regularity
Important among measures on gp k are those assigning finite measure to bounded sets. They share with A k the property of regularity:
Suppose that p. is a measure on gJ k such that p. ( A ) < oo if A is bounded. (i) For A e gp k and £ > 0, there exist a closed C and an open G such that C c A c G and p. (G - C) < £. (ii) If p. ( A ) < oo, then p.(A) = supp.(K ), the supremum extending over the compact subsets K of A . Theorem 12.3.
The second part of the theorem follows from the first: p. ( A ) < oo implies that p. ( A - A0) < £ for a bounded subset A0 of A, and it then PllOOF .
tSee HAUSDORFF, p. 241 .
SECTION
12.
MEASURES IN EUCLIDEAN SPACE
175
follows from the first part that p.(A0 - K) < £ for a closed and hence compact subset K of A0• To prove (i) consider first a bounded rectangle A = [x: a; < X; � h;, 1 i < k]. The set Gn = [x: a; < X; < h; + n - , i < k] is open and Gn � A . Since p.(G1) is finite by hypothesis, it follows by continuity from above that p, ( Gn - A ) < £ for large n . A bounded rectangle can therefore be approxi mated from the outside by open sets. The rectangles form a semiring (Example 11 .5). For an arbitrary set A in gt k , by Theorem 11 .4(i) there exist bounded rectangles A k such that A c U k A k and p.((U k A k ) - A) < £. Choose open sets G k such that A k c G k and p.(G k - A k ) < £/2 k . Then G = U kG k is open and p.(G - A) < 2£. Thus the general k-dimensional Borel set can be approximated from the outside by open sets. To approximate from the inside by closed sets, pass to • complements. Specifying Measures on the Line
There are on the line many measures other than A which are important for probability theory. There is a useful way to describe the collection of all measures on 91 1 that assign finite measure to each bounded set. If p. is such a measure, define a real function F by (12.4 )
F( x )
=
{
p.(O, x] - p.( x, o ]
if X > 0, if X S 0.
It is because p.(A) < oo for bounded A that F is a finite function. Clearly, F is nondecreasing. Suppose that x n � x. If x > 0, apply part (ii) of Theorem 10.2, and if x < 0, apply part (i); in either case, F( x n ) � F(x) follows. Thus F is continuous from the right. Finally, (12 .5)
p.( a, b]
=
F( b) - F( a )
for every bounded interval (a, b ]. If p. is Lebesgue measure, then (12.4) gives F( x ) = x. The finite intervals form a w-system generating !1/ 1 , and therefore by Theorem 10.3 the function F completely determines p. through the relation (12.5). But (12.5) and p. do not determine F: if F( x) satisfies (12.5), then so does F(x ) + c. On the other hand, for a given p., (12.5) certainly determines F to within such an additive constant. For finite p., it is customary to standardize F by defining it not by (12.4) but by ( 12 .6)
F( x )
=
p.( - oo , x] ;
176
MEASURE
then lim x _. _ 00 F(x) = 0 and lim x .... 00 F(x) = p. ( R 1 ). If p. is a probability measure, F is called a distribution function (the adjective cumulative is sometimes added). Measures p. are often specified by means of the function F. The following theorem ensures that to each F there does exist a p..
IfF is a nondecreasing, right-continuous real function on the line, there exists on 91 1 a unique measure p. satisfying (12.5) for all a and b. Theorem 12.4.
As noted above, uniqueness is a simple consequence of Theorem 10.3. The proof of existence is almost the same as the construction of Lebesgue measure, the case F(x) = x. This proof is not carried through at this point, because it is contained in a parallel, more general construction for k-dimen sional space in the next theorem. For a very simple argument establishing Theorem 12.4, see the second proof of Theorem 14.1. Specifying Measures in R"
The a-field !Jt k of k-dimensional Borel sets is generated by the class of bounded rectangles =
[x:
a ; < x; s b; , i 1 , . . . , k ] =
(12.7)
A
(Example 10.1). If
I; (a ; , b; ], =
A has the form of a cartesian product
(12 .8) Consider the sets of the special form (12.9 )
sx
=
[y:
Y; s
X; '
i 1' . . . ' k]; =
Sx consists of the points " southwest" of x ( x 1 , , x k ) ; in the case k 1 it is the half-infinite interval ( oo, x]. Now Sx is closed, and (12.7) has the =
•
•
=
•
-
form
Therefore, the class of sets (12.9) generates gt k_ This class is a w-system. The objective is to find a version of Theorem 12.4 for k-space. This will in particular give k-dimensional Lebesgue measure. The first problem is to find the analogue of (12.5). A bounded rectangle (12.7) has 2 k vertices-the points x = (x 1 , , x k ) for which each X; is either a ; or b; . Let sgn A x, the signum of the vertex, be •
•
•
SECTION
12.
177
MEASURES IN EUCLI DEAN SPACE
1 or - 1, according as the number of i k(1 < i < k ) satisfying X; = a ; is even or odd. For a real function F on R , the difference of F around the vertices of A is &AF = E sgn x F( x) , the sum extending over the 2 k vertices x of A . In the case k = 1, A = ( a , b ] and &AF = F( b ) - F( a) . In the case k = 2, &AF = F( b 1 , b2 ) - F( b 1 , a 2 ) - F(a 1 , b2 ) + F( a 1 , a 2 ) . Since the k-dimensional analogue of (12.4) is complicated, suppose at first that p. is a finite measure on !1/ k and consider instead the analogue of (1 2.6), namely +
A
F( X ) = ,., [ y : Y; < Xi ' i = 1 ' . . . ' k ] .
(1 2.11) Suppose that Then
·
Sx is defined by (12.9) and A is a bounded rectangle (12.7).
(12.12) To see this, apply to the union on the right in (12.10) the inclusion-exclusion formula (10.5). The k sets in the union give 2 k - 1 intersections, and these are the sets Sx for x ranging over the vertices of A other than ( b 1 , , b k ). Taking into account the signs in (10.5) leads to (12.12). •
+
-
+
-
+
+
-
+
+
-
-
+
-
+
-
+
•
•
+
+
Suppose x < n > � x in the sense that xf n > � X; as n --+ oo for each i = 1, . . . , k. Then Sx < " ) � Sx , and hence F(x < n >) --+ F(x ) by Theorem 10.2(ii). In this sense, F is continuous from above.
Suppose that the real function F on R k is continuous from above and satisfies &AF > 0 for bounded rectangles A . Then there exists a unique measure p. on !1/ k satisfying (12.12) for bounded rectangles A .
Theorem 12.5.
The empty set can be taken as a bounded rectangle (12.7) for which a 1 = h; for some i, and for such a set A, &A F = 0. Thus (12.12) defines a finite-valued set function p. on the class of bounded rectangles. The point of the theorem is that p. extends uniquely to a measure on !1/ k . The uniqueness is an immediate consequence of Theorem 10.3, since the bounded rectangles form a w-system generating !1/ k .
178
MEASURE
If F is bounded, then p. will be a finite measure. But the theorem does not require that F be bounded. The most important unbounded F is F(x) = x 1 • • • x k . Here &AF = (b 1 - a 1 ) • • • (b k - a k ) for A given by (12.7). This is the ordinary volume of A as specified by (12.1). The corresponding measure extended to !Ji k is k-dimensional Lebesgue measure as described at the beginning of this section. 12.5. As already observed, the uniqueness of the extension is easy to prove. To prove its existence it will first be shown that p. as defined by (12.12) is finitely additive on the class of bounded rectangles. Suppose that each side I; = ( a ; , b; ) of a bounded rectangle (12.7) is partitioned into n ; subintervals JiJ == ( t;. 1 _ 1 , t! � ], j == 1, . . . , n ; , where a ; == t; o < til < · · · < t; n ; = b;. The n 1 n 2 · · · n k rectanpes PROOF OF THEOREM
{ 12.13) then partition A . Call such a partition regular. It will first be shown that regular partitions: ( 12.14)
,.,. ( A )
== E ,.,. ( �1 ... it
ik
• . •
p.
adds for
j )• lc
The right side of (12.14) is E8Ex sgn8x · F(x), where the outer sum extends over the rectangles B of the form (12.13) and the inner sum extends over the vertices x of B. Now ( 12.15 )
E E sgn8x · F( x ) B
X
== L F( x ) L sgn8x , X
B
where on the right the outer sum extends over each x that is a vertex of one or more of the B 's, and for fixed x the inner sum extends over the B 's of which it is a vertex. Suppose that x is a vertex of one or more of the B 's but is not a vertex of A . Then there must be an i such that X; is neither a; nor b;. There may be several such i, but fix on one of them and suppose for notational convenience that it is i == 1. Then x1 == t1 1 with 0 < j < n 1 • The rectangles (12.13) of which x is a vertex therefore come in pairs B' == � 2 · · ·) and B " == B 1 • 1 2 ilc ' and sgn 8 x = - sgn 8 ,x. Thus the inner sum on t'&e n&bt in (12.15) is 6 if x is not a vertex of A . On the other hand, if x is a vertex of A as well as of at least one B, then for each i either X; == a ; == t; o or X; == b; = t; n . · In this case x is a vertex of only one B of the form (12.13)- the one for which j; == 1 or j; == n ; , according as X; = a; or X; == b, -and sgn8x == sgn A x. Thus the right side of (12.15) reduces to ll A F, which proves (12.14). +
,
• • .
'
!
� -- -
-- - -- - -- ----
I
I
·
-
-
-
12.
179 Now suppose that A == U: _ 1 A u, where A is the bounded rectangle (12.8), X lk u for u == 1, . . . , n , and the A u are disjoint. For each i (1 < i < 11u X Au k ) , the intervals /i l , . . . , I; n have I; as their union, although they need not be disj oint. But their endpoints split I; into disjoint subintervals .Ii i , . . . , .f; n such that each l; u is the union of certain of the JiJ . The rectangles B of the form (i2.13) are a regular partition of A as before; furthermore, the B 's contained in a single A u form a regular partition of A u . Since the A u are disjoint, it follows by (12.14) that SECTION
=
•
•
MEASURES IN EUCLIDEAN SPACE
·
n
n
It ( A ) == E �t { B) == E E It { B ) == E It { A u ) · B
u- 1
B c A11
u= 1
Therefore, I' is finitely additive on the class J k of bounded k-dimensional rectangles. As shown in Example 11 .5, Jk is a semiring, and so Theorem 11.3 applies. Now suppose that U:_ 1 A u e A , where A and the A u are in Jk and the latter are disjoint. As shown in Example 11.2, A - U:_ 1 A u is a finite disjoint union u:_ 1 Bv of sets in Jk . Therefore, E: .... I �t ( Aj ) + E� t i' ( Bv ) == �t ( A ), and so n
n
U
E It { A u ) S It { A ) if
{ 12.16)
u- 1
u -= 1
Au e A ,
e U: _ 1 A u, where A and the A u are in Jk . Let B1 == A n A1 and Bu == A n A u n A} n · · · n A � _ 1 • Again, each Bu is a finite disjoint union Ut, Bu of elements of Jk . The Bu are disjoint, and so the Buv taken all together are also disjoint. They have union A , so that I'( A ) == EuEv�t ( But, ). Since Ut, Buv == Bu e A u , an application of (12.16) gives
Now suppose that A l'
n
�£ { A ) S E �t ( A u ) if A
{ 12.17)
u- 1
e
n
U
Au.
u- 1
Of
course, (12.16) and (12.17) give back finite additivity again. To apply Theorem 11.3 requires showing that I' is countably subadditive on Jk . Suppose then that A e U�_ 1 A u , where A and the A u are in Jk . The problem is to prove that �t { A )
(12.18)
<
00
E �t ( A u ) ·
u- 1
Suppose that £ > 0. If A is given by (12.7) and B == [ x: a ; + B < X; < b; , i s k ], then I' ( B) > I' ( A ) £ for small enough positive B because I' is defined by (12.12) and F is continuous from above. Note that A contains the closure B- == [x: a; + B < x, < b, , i < k] of B. Similarly, for each u there is in Jk a set Bu == [x: u a i u < X; < b u + Bu , i < k ] such that �t ( Bu ) < �t ( A u ) + £/2 and A u is in the interior B: == [ x: a,u < X; < b;u + Bu , i < k] of Bu . Since B- e A e U:'_ 1 A u e U�_ 1 B:, it follows by the Heine-Bore) theorem that B c B- e u : _ 1 B: e U:_ 1 Bu for some n . Now (12.17) applies, and so �t( A ) - £ < p. ( B ) < E: .... 1 �t ( Bu ) < EC:- 1 �t ( A u ) + £. Since £ was arbitrary, the proof of (12.18) is complete. -
,
MEASURE 180 Thus p. as defined by (12.12) is finitely additive and countably subadditive on the semiring J* . By Theorem 11.3, p. extends to a measure on at" = a ( J " ). •
PROBLEMS 12.1. 12.2.
Suppose that p. is a measure on 9P1 which is finite for bounded sets and is translation-invariant: p.(A + x) p.(A ) . Show that p. ( A ) a A ( A ) for some a > 0. Extend to R*. Suppose that A e 9P1 , A ( A ) > 0, and 0 < 8 < 1. Show that there is a bounded open interval I such that A( A n I) > 8A( I). Hint: Show that A( A ) may be assumed finite and choose an open G such that A c G and A( A ) > 6A(G). Now G U,, In for disjoint open intervals In [A12], and E n A(A n In ) > 8EnA(In ); use an In. f If A e 911 and A ( A ) > 0, then the origin is interior to the difference set D( A ) [ x - y : x, y e A ]. Hint: Choose a bounded open interval I as in Problem 12.2 for 6 � . Suppose that l z 1 < A( I)/2 ; since A n I and (A n I) + z are contained in an interval of length less than 3A( I)/2 and hence cannot be disjoint, z e D( A ). 4.13 12.3 f The following construction leads to a subset H of the unit interval that is nonmeasurable in the extreme sense that its inner and outer Lebesgue measures are 0 and 1 : A .( H ) 0 and A* ( H ) 1 (see (3.9) and (3 .1 0)). Complete the details. The ideas are those in the construction of a nonmeasurable set at the end of Section 3. It will be convenient to work in G [0, 1); let E9 and e denote addition and subtraction modulo 1 in G, which is a group with identity 0. (a) Fix an irrational 6 in G and for n 0, ± 1, ± 2, . . . let (Jn be n8 reduced modulo 1. Show that 6n E9 6m 6n + m , 6n e 6m 6n - m , and the (Jn are distinct. Show that { 62 n : n 0, ± 1, . . . } and { 62 n + 1 : n 0, + 1, . . . } are dense in G. (b) Take x and y to be equivalent if x e y lies in { 6n : n 0, ± 1, . . . }, which is a subgroup. Let S contain one representive from each equivalence class (each coset). Show that G Un (S E9 6n ), where the union is disjoint. Put H Un (S E9 62 n ) and show that G - H H E9 6. (c) Suppose that A is a Borel set contained in H. If A( A ) > 0, then D( A ) contains an interval (0, £); but then some 62 k + I lies in (0, £) c D( A ) c D( H), and so 62 k + t h 1 - h 2 h 1 e h 2 ( s 1 E9 62 n ) e (s 2 E9 62 n ) for some h 1 , h 2 in H and some s 1 , s 2 in S. Deduce that s : s2 and ob tain a contradiction. Conclude that A.(H) 0. 1. (d) Show that A .( H E9 6) 0 and A*( H) f The construction here gives sets Hn such that Hn f G and A .( Hn ) 0. If Jn G - Hn , then Jn � 0 and A*( Jn ) 1. (a) Let Hn U Z , ( S E9 61c ), so that Hn f G. Show that the sets Hn E9 �2 n + !} v are disjoint for different v . (b) Suppose that A is a Borel set contained in Hn . Show that A and indeed all the A E9 6< 2 , + I ) v have Lebesgue measure 0.
==
==
==
12.3.
12.4.
==
==
==
==
== == == ==
==
12.5.
==
== ==
==
_ _
==
==
== == == ==
==
== == == ==
==
SECTION
12.6.
12.7.
12.8. 12.9.
12.10. 12.11.
12.12.
U.13.
12.14.
1 2.
MEASURES IN EUCLIDEAN SPACE
181
Suppose that p. is nonnegative and finitely additive on �k and p.( R k ) < oo . Suppose further that p.(A ) == sup p.( K), where K ranges over the compact subsets of A . Show that p. is countably additive. (Compare Theorem 12.3(ii).) Suppose p. is a measure on � k such that bounded sets have finite measure. Given A , show that there exist an F, set U (a countable union of closed sets) and a G8 set V (a countable intersection of open sets) such that U c A c V and p.( V - U) == 0. 2.17 f Suppose that p. is a nonatomic probability measure on ( R k , � k ) and that p.(A) > 0. Show that there is an uncountable compact set K such that K c A and p.( K ) == 0. The closed support of a measure p. on � k is a closed set C0 such that CO c C for closed C if and only if C supports p.. Prove its existence and uniqueness. Characterize the points of CO as those x such that p. ( U) > 0 for every neighborhood U of x. If k == 1 and if p. and the function F(x) are related by (12.5), the condition is F(x - £ ) < F(x + £) for all £; x is in this case called a point of increase of F. Let t§ be the class of sets in � 2 of the form [(x1 , x 2 ): x1 e H) for some H in �1 . Show that t§ is a a-field in � 2 and that two-dimensional Lebesgue measure is not a-finite on t§ even though it is a-finite on � 2 • Of minor interest is the k-dimensional analogue of (12.4). Let I, be (0, t ) for t > 0 and ( t, O] for t < 0, and let Ax == lx 1 X · · · X lx ,<" Let cp(x) be + 1 or - 1 according as the number of i, 1 < i < k , for which X; < 0 is even or odd. Show that, if F(x) == cp(x)p.(Ax ), then (12.12) holds for bounded rectangles A . Call F degenerate if it is a function of some k - 1 of the coordinates, the requirement in the case k == 1 being that F is constant. Show that ll A F == 0 for every bounded rectangle if and only if F is a finite sum of degenerate functions; (12.12) determines F to within addition of a function of this sort. Let G be a nondecreasing, right-continuous function on the line and put F(x, y ) = min{ G(x), y }. Show that F satisfies the conditions of Theorem 1 12.5, that the curve C == [(x, G(x)): x e R ] supports the corresponding measure, and that A 2 (C) == 0. Let F1 and F2 be nondecreasing, right-continuous functions on the line and put F(x. , x 2 ) == F1 (x 1 )F2 (x 2 ). Show that F satisfies the conditions of Theorem 12.5. Let p., p. 1 , p. 2 be the measures corresponding to F, F1 , F2 , and prove that p.(A 1 X A 2 ) == p. 1 (A 1 )p. 2 (A 2 ) for intervals A 1 and A 2 • This p. is the product of p. 1 and p. 2 ; products are studied in a general setting in Section 18. Let fl' be the class of sets A such that there exist an open G and a closed C such that C c A c G and p.(G - C) < £. Prove Theorem 12.3(i) for the case of a finite p. by showing that .JII' is a a-field containing the closed rectangles.
182
MEASURE
12.15. In the case F(x)
x 1 x" , the proof of Theorem 12.5 shows that k dimensional volume (see (12.1)) is finitely additive and countably subad ditive for bounded rectangles. Starting with this fact, use the ideas in Example 11.6 to give another construction of """ . 12.16. Identifying "' · Take as a starting point the fact that each point (x, y ) of the plane other than the origin has a unique polar representation ( x, y ) == 1 ( rcos 8, r sin 8), where 0 < 8 < 2 w and r == (x 2 + y 2 ) 12 . Let D, be the closed disc [( x, y): x 2 + y 2 < r 2 ] and define w0 == A 2 ( D1 ). The problem is to ideo tify w0 , an area, with the , of trigonometry. (a) Let T" be the linear map with matrix ==
•
•
•
- sin q> COS q>
]
Show by the addition formulas for sin and cos that T" rotates the plane through the angle q>. Show that it preserves area. (b) Consider the sector S9 1 9 consisting of the points with polar representa2 tions satisfying 61 < 6 < 62 • Show that A 2 ( D1 n S9 9 2 ) = w0 ( 82 - 81 )/2w by considering in sequence the cases where (i) (62 - f1 )/2w is the recipro cal of an integer, (ii) a rational, (iii) arbitrary. (c) Show that the linear map taking (x, y ) to ( rx, ry) multiplies areas by r 2 • Let A ,1 , be the annulus D, - D,1 , and sltow that 2
2
(12.19) Problem 18.24 shows how to prove w0 D1 n So9 in two different ways.
==
, by calculating the area of
SECI'ION 13. MEASURABLE FUNCitONS AND MAPPINGS
If a real function X on 0 has finite range, it is by the definition in Section 5 a simple random variable if [ w: X( w) = x] lies in the basic a-field � for each x. The requirement appropriate for the general real function X is stronger; namely, [ w: X( w) e H] must lie in � for each linear Borel set H. An abstract version of this definition greatly simplifies the theory of such functions . .,
Measurable Mappings
Let ( 0, $ ) and ( 0', � ') be two measurable spaces. For a mapping T: 0 0', consider the inverse images r- 1A' = [ w E 0: Tw E A'] for A' c 0'. (See [A 7] for the properties of inverse images.) The mapping T is measur able $j$' if T- �, E $ for each A' E $'. -+
SECTION
1 3.
MEASURABLE FUNCTIONS AND MAPPINGS
183
For a real function /, the image space 0' is the line R 1 , and in this case 91 1 is always tacitly understood to play the role of $i" ' . A real function f on 0 is thus measurable $i" (or simply measurable if it is clear from the context what !F is involved) if it is measurable fFjgJ 1 -that is, if f- 1H = [ w: f(w) e H] E !F for every H e qJ 1 . In probability contexts, a real measur able function is called a random variable. The point of the definition is to ensure that [ w: /( w) e H] has a measure or probability for all sufficiently regular sets H of real numbers-that is, for all Borel sets H.
f with finite range is measurable if a condition to f- 1 { x } e !F for each singleton { x }, but this is too weak impose on the general f. (It is satisfied if ( 0, !F) = ( R 1 , qJ 1 ) and f is any one-to-one map of the line into itself; but in this case f- 1H even for so simple a set H as an interval can for an appropriately chosen f be any uncountable set, say the non-Borel set constructed in Section 3.) On the other hand, for a measurable f with finite range, f- 1H e !F for every H c R 1 ; but this is too strong a condition to impose on the general f. (For ( D, /F) = ( R 1 , 9l 1 ), even f(x) = x fails to satisfy it.) Notice that nothing is • required of fA ; it need not lie in qJ 1 for A in !F. Example 13. 1.
A real function
If in addition to (0, !F), ( 0', !F'), and the map T: 0 --+ 0' there is a third measurable space (0", !F ") and a map T': 0' --+ 0", the composition T'T = T' o T is the mapping 0 --+ 0" that carries w to T'(T( w )).
If T- 1A' e !F for each A' e d ' and d' generates !F', then T is measurable !Fj!F'. (ii) If T is measurable !Fj!F' and T' is measurable !F 'j!F", then T'T is measurable !Fj!F ". PROOF. Since T- 1 (0' - A') = 0 - T - 1A' and T- 1 (U n A�) = U n T- 1A�, and since !F is a a-field in 0, the class [A': T - 1A' e !F] is a a-field in 0'. If Theorem 13.1. (i)
this a-field contains d ', it must also contain a( d '), and (i) follows. As for (ii), it follows by the hypotheses that A" e !F " implies that (T') - 1A" e $ ', which in tum implies that (T'T) - 1A" = [w: T'Tw e A"] = [w: Tw E (T') - 1A"] = T - 1 ((T') - 1A") E !F. • By part (i), if f is a real function such that [w: /(w) < x] lies in !F for all x, then f is measurable !F. This condition is usually easy to check. Mappings into Rk
For a mapping f: 0 --+ R k carrying D into k-space, fA k is always under stood to be the a-field in the image space. In probabilistic contexts, a
184
MEASURE
measurable mapping into R k is called a random vector. Now f must have the form ( 1 3 .1 ) for real functions h ( w ). Since the sets (12.9) (the "southwest regions") generate 9t k , Theorem 13.1(i) implies that f is measurable � if and only if the set ( 1 3 .2 )
[ w : /l ( w ) < x l , . . . , fk ( w )
�
k
x k ] = n [ w : h ( w ) < xj ] j- 1
lies in � for each (x 1 , , x k ). This condition holds if each h is measur able !F. On the other hand, if xj = x is fixed and x 1 = · · · = xi - I = xi+ I = · · · = x k = n goes to oo, the sets (13.2) increase to [w: h (w) � x]; the condition thus implies that each h is measurable. Therefore, f is measurable !F if and only if each component function h is measurable �. This provides a practical criterion for mappings into R k . A mapping f: R ; --+ R k is defined to be measurable if it is measurable 91 ;!9/ k . Such functions are often called Borel functions. To sum up, T: D D' is measurable �/�' if T- 1A' e � for all A' e � '; f: 0 --+ R k is measurable !F if it is measurable �!9t k ; and f: . R ; --+ R k is measurable (a Borel function) if it is measurable gJ ijgJ k . If H lies outside gJ1, then IH ( i = k = 1) is not a Borel function. •
•
•
-+
Theorem 13.2.
Iff: R ; --+ R k is continuous, then it is measurable.
As noted above, it suffices to check that each set (13.2) lies in gJ i. • But each is closed because of continuity.
PROOF.
If h = D --+ R 1 is measurable � ' j = 1, . . . , k, then g( / ( w ), . . . , /k (w)) is measurable � if g: R k --+ R 1 is measurable-in 1 particular, if it is continuous. Theorem 13.3.
If the h are measurable, then so is (13.1), so that the result follows • by Theorem 13.1(ii). PROOF.
Taking g( xl , . . . ' x k ) to be E�- IX;, n�- IX;, and max{ X I , . . . ' x k } in tum shows that sums, products, and maxima of measurable functions are mea surable. If /(w ) is teal and measurable, then so are sin /(w), e'f< w > , and so on, and if f ( w ) never vanishes, then 1 If( w ) is measurable as well.
SECTION
13.
MEASURABLE FUNCTIONS AND MAPPINGS
185
Limits and MeasurabUity
For a real function f it is often convenient to admit the artificial values oo and - oo - to work with the extended rea/ line [ - oo, oo]. Such an f is by definition measurable j&' if [ w: /( w) e H] lies in j&' for each Borel set H of (finite) real numbers and if [w: /(w) = oo] and [w: /( w) = - oo] both lie in .F. This extension of the notion of measurability is convenient in connection with limits and suprema, which need not be finite.
Suppose that /1 , /2 , are real functions measurable j&' _ (i) The functions supnfn , infn fn , lim supnfn , and lim infn/n are measur
'lbeorem 13.4.
able !F. (ii) (iii) (iv)
PROOF.
•
•
•
If lim nfn exists everywhere, then it is measurable j&'. The w-set where { /n ( w )} converges lies in j&'. Iff is measurable j&', then the w-set where fn ( w) --+ /( w) lies in !F. Clearly, (supnfn � x) = n n ( fn < x) lies in j&' even for X = 00 and
oo, and so supn fn is measurable. The measurability of infn /n follows the same way, and hence lim supn fn = inf n sup k n fk and lim inf n/n = supn inf k n fk are measurable. If limn fn exists, it coincides with these last two functions and hence is measurable. Finally, the set in (iii) is the set where lim supn /n (w) = lim infn/n ( w ), and that in (iv) is the set where this common value is /( w ). • x = -
�
�
Special cases of this theorem have been encountered before-part (iv), for example, in connection with the strong law of large numbers. The last three parts of the theorem obviously carry over to mappings into R k . A simple real function is one with finite range; it can be put in the form
(13 .3 ) where the A; form a finite decomposition of �. It is measurable !F if each A; lies in !F. The simple random variables of Section 5 have this form. Many results concerning measurable functions are most easily proved first for simple functions and then, by an appeal to the next theorem and a passage to the limit, for the general measurable function.
Iff is real and measurable !F, there exists a sequence { In } of simple functions, each measurable !F, such that
Theorem 13.5.
(13.4)
iff( w ) > 0
186
MEASURE
and ill( w ) < O.
(13 .5 ) PROOF.
Define
( 13.6 ) In( W ) =
-n - ( k - 1 )2 - n ( k - 1 )2 - n n
-n, < - ( k - 1 )2 - n , n, n2 < 1 < k n n if ( k - 1 ) 2 - < l( w ) < k2 - , 1 < k < n2 n , if n < I ( w ) < oo . if - oo < l( w) < n if -k2 - < l ( w )
•
This sequence has the required properties.
Note that (13.6) covers the possibilities l(w) = oo and l(w) = - oo . I f A e !F, a function I defined only on A is by definition measurable if [w e A : l(w) e H] lies in !F for H e !1/ 1 and for H = { oo } and H =
{ - 00 }.
Transfonnations of Measures
Let (D, !F) and (D', !F ') be measurable spaces and suppose that the mapping T: D --+ D' is measurable !Fj!F'. Given a measure p. on !F, define a set function p.T - 1 on !F ' by
( 13.7 )
A' E !F '.
p.T - 1 assigns value p.(T- 1A') to the set A'. If A' e �', then T - 1A' e !F because T is measurable, and hence the set function p.T - 1 is well defined on � '. Since T- 1 U n A� = U n T- 1A� and the T - 1A� are disjoint sets in D if the A� are disjoint sets in D', the countable additivity of p.T - 1 follows from that of p.. Thus p.T - 1 is a measure. This way of transferring a That is,
measure from D to D' will prove useful in a number of ways. If p. is finite, so is p. T- 1 ; if p. is a probability measure, so is p. T - 1 • t t But see Problem 1 3.22.
SECTION
13.
187
MEASURABLE FUNCTIONS AND MAPPINGS
PROBLEMS 13.1.
Show by examples in which Q == 0' is a space of two points that measur ability .Fj.F' is compatible with all but one of the eight statements A A A A
13.2. 13.3. 13.4.
13.5.
13.6. 13.7.
13.8. 13.9.
e .F and e .F and e .F and e .F and
rA rA rA rA
A' A' A' A'
e .F', e .F', e .F', e .F',
and r- 1A' E .F, and r- 1A' e .F, e .F' and r- 1A E .F e .F' and 1 1A' e .F.
E .F' E .F'
'
'
Although an indicator lA is measurable .F if and only if A e .F, the simple function (13.3) can be measurable .F even if some of the A; lie outside .F. Show that a monotone real function is measurable. Functions are often defined in pieces (for example, let l( x) be x3 or x- 1 as x � 0 or x < 0), and the following result shows that the function is measurable if the pieces are. Consider measurable spaces (Q, .F) and ( 0', .F') and a map T: Q -+ 0'. Let A 1 , A 2 , • • • be a countable covering of Q by $sets. Consider the a-field � == [ A : A c An , A e .F) in A n and the restriction T, of r to An. Show that T is measurable .Fj.F' if and only if T, is measurable !F,/.F' for each n . (a) For a map r and a-fields .F and .F' , define r- 1-F' == (r- 1A' : A' E .F'] and r.F== [ A' : 1 1A' E .F). Show that r- 1-F' and r.F are a-fields and that pteasurability .Fj.F' is equivalent to r- 1-F' c .F and to !F' c T.F. (b) For given !F', r- 1-F', which is the smallest a-field for which r is measurable .Fj.F', is by definition the a-field generated by r. For simple random variables describe a( XI ' . . . ' xn ) in these terms. (c) Let a'( J/1') be the a-field in 0' generated by J/1'. Show that a(r- 1JJI') == r- 1 ( a'( J/1' )). Prove Theorem 10.1 by taking r to be the identity map from no to n. f Suppose that 1: Q R1 • Show that I is measurable r- 1!F' if and only if there exists a map q>: 0' R1 such that q> is measurable .F' and I == q>T. Hint: First consider simple functions and then use Theorem 13 . 5. f Relate the result in Problem 13.6 to Theorem 5 . 1(ii). Show of real functions I and g that I( w ) + g( w) < x if and only if there exist rationals r and s such that r + s < x, I< w) < r, and g( w) < s. Prove directly that I + g is measurable .F if I and g are. Let .F be a a-field in R 1 • Show that 911 1c .F if and only if every continuous function is measurable !F. Thus 9P is the smallest a-field with respect to which all the continuous functions are measurable. -+
-+
188
MEASURE
13.10. Consider on R 1 the smallest class
13.1 1. 13. 12. 13.13.
13. 14. 13.15.
13.16.
(that is, the intersection of all classes) of real functions containing all the continuous functions and closed under pointwise passages to the limit. The elements of !I are called Baire functions. Show that Baire functions and Borel functions are the same thing. A real function I on the line is upper semicontinuous at x if for each there is a B such that l x - Y l < B implies that l( y ) < /( x) + £ . Show that. if I is everywhere upper semicontinuous, then it is measurable. If I is bounded and measurable !F, then there exist simple functions, each measurable !F, such that /,. ( w ) --+ I( w ) uniformly in w. Suppose that In and I are finite-valued, jlt:measurable functions such that In < w ) --+' I< w ) for w e A , where p. (A ) < oo ( p. a measure on !F ). Prove Egoroff s theorem : For each £ there exists a subset B of A such that I'( B) < £ and In < w ) --+ I< w ) uniformly on A -1 B. Hint: Let B�' d be the set of w in A such that ll( w ) - /; ( w ) l > k - for some i > n . Show that 00 B
£
1 +
r-
=
""
p .( A ).
Show that, if � is a field generating !F, if I is measurable !F, and if I'( Q ) < oo , then for positive ( there exists a simple function g == E; x; IA such that A ; e � and I' [ w : II( w ) - g( w ) 1 > (] < (. Hint: First ap� proximate I by a simple function measurable !F, and then use Problem 11.7. 13.18. 2.8 f Show that, if I is measurable a ( JJI), then there exists a countable subclass � of JJI such that I is measurable a( � ). 13.19. Circular Lebesgue measure. Let C be the unit circle in the complex plane and define T: [0, 1) --+ C by Tw == e 2 w i . Let !I consist of the Borel subsets of [0, 1) and let A be Lebesfue measure on !1. Show that � == (A : r- 1A e !I] consists of the sets in � (identify R 2 with the complex plane) that are contained in C. Show that � is generated by the arcs of C. Circular Lebesgue measure is defined as I' == A 1 1 • Show that I' is invariant under rotations: 1'[6z: z e A ] == I'( A ) for A e � and 6 e C. 13.20. f Suppose that the circular Lebesgue measure of A satisfies I'( A ) > 1 n - t and that B contains at most n points. Show that some rotation carries B into A : 6 B c A for some 6 in C. 13.17. 11.7 f
w
SECTION
14.
DISTRIBUTION FUNCTIONS
13.21. In connection with (13.7), show that p.(T'T) - 1 = ( p.T- 1 )(T') - 1 •
189
1 13.22. Show by example that p. a-finite does not imply p.T- a-finite.
13.23. Consider Lebesgue measure A restricted to the class !I of Borel sets in
(0, 1]. For a fixed permutation n1 , n 2 , • • • of the positive integers, if x has dyadic expansion . x 1 x 2 •1 • • , take Tx == • xn1 x, 2 • • • • Show that T is measura ble �/!I and that AT- == A. SECI10N 14. DISTRIBUTION FUNCI10NS
Distribution Functions
A random variable as defined in Section 13 is a measurable real function X on a probability measure space (0, �' P). The distribution or law of the random variable is the probability measure p. on ( R 1 , 91 1 ) defined by (14 .1 )
p. ( A) = P ( X e A ) ,
A
E
!1/ 1 .
As in the case of the simple random variables in Chapter 1, the argument w is usually omitted: P[ X e A] is short for P[w: X(w) e A ] . In the notation (13.7), the distribution is px - 1 • For simple random variables the distribution was defined in Section 5-see (5.9). There p. was defined for every subset of the line, however; from now on p. will be defined only for Borel sets, because unless X is simple, one cannot in general be sure that [ X e A] has a probability for A outside 91 1 . The distribution function of X is defined by (14 .2)
F( X ) = p.( - 00 ' X]
=
p[ X < X]
for real x. By continuity from above (Theorem 10.2(ii)) for p., F is right-continuous. Since F is nondecreasing, the left-hand limit F(x - ) = limy t x F(y) exists, and by continuity from below (Theorem 10.2(i)) for p., ( 14 .3)
F( X - ) = ,., ( - 00 ' X ) = p [ X < X ] .
Thus the jump or saltus in
F at x is
F( X ) - F( X -) = ,., { X } Therefore (Theorem 10.2(iv))
=
p[ X = X].
F can have at most countably many points of
190
MEASURE
discontinuity. Clearly,
(14.4)
lim F( x ) = O,
x � - oo
lim F( x ) = 1 .
A function with these properties must, in fact, be the distribution
function of some random variable:
a nondecreasing, right-continuous function satisfying (14.4), then there exists on some probability space a random variable X for which F(x) = P[ X < x]. FIRST PROOF. By Theorem 12.4, if F is nondecreasing and right-continu ous, there is on ( R 1 , 91 1 ) a measure p. for which p.(a, b] = F(b) - F( a). But lim x _. _ 00 F( x ) = 0 implies that p.( - oo, x] = F(x) , and lim x _. 00F(x) = 1 implies that p. ( R 1 ) = 1. For the probability space take (0, � ' P) = ( R 1 , !1/ 1 , p. ) and for X take the identity function: X(w) = w. Then P[ X < • x ] = p. [w e R 1 : w < x] = F(x). Theorem 14.1.
If F is
There is a proof that uses only the existence of Lebesgue measure on the unit interval and does not require Theorem 12.4. For the probability space take the open unit interval: 0 is (0, 1), � consists of the Borel subsets of (0, 1), and P(A) is the Lebesgue measure of A. To understand the method, suppose at first that F is continuous and strictly increasing. Then F is a one- to-one mapping of R 1 onto (0, 1); let cp : (0, 1) --+ R 1 be the inverse mapping. For 0 < w < 1, let X( w) = cp ( w ) . Since cp is increasing, certainly X is measurable �- If 0 < u < 1, then cp ( u) < x if and only if u < F(x) . Since P is Lebesgue measure, P[ X < x] = P[w e (0, 1): cp ( w ) < x] = P[w e (0, 1): w < F(x)] = F(x), as required. SECOND PROOF.
F (x)
.P (u) .P (p)
If F has discontinuities or is not strictly increasing, detinet
(14.5)
cp ( u ) = inf [ X :
t This is called the quantile function.
u < F( X ) ]
SECTION
1 4.
DISTRIBUTION FUNCTION S
191
for 0 < u < 1 . Since F is nondecreasing, [ x : u < F(x)] is an interval stretching to oo ; since F is right-continuous, this interval is closed on the left. For 0 < u < 1 , therefore, [ x : u < F( x )] = [cp( u), oo), and so cp(u) < x if and only if u < F(x). If X(w) = cp(w) for 0 < w < 1, then by the same • reasoning as before, X is a random variable with P[ X < x] = F(x). This second argument actually provides a simple proof of Theorem 12.4 for a probability distributiont F: the distribution p. (as defined by (14.1)) of the random variable just constructed satisfies p.( - oo, x] = F(x) and hence
p.(a, b] = F(b) - F(a).
Exponential Distributions
There are a numbet: of results which for their interpretation require random variables, independence, and other probabilistic concepts, but which can be discussed technically in terms of distribution functions alone and do not require the apparatus of measure theory. Suppose as an example that F is the distribution function of the waiting time to the occurrence of some event-say the arrival of the next customer at a queue or the next call at a telephone exchange. As the waiting time must be positive, assume that F(O) = 0. Suppose that F(x) < 1 for all x and furthermore suppose that
(14.6 )
1 - F( X + y ) 1 - F( x )
=
1 - F( y ) '
X,
y > 0.
The right side of this equation is the probability that the waiting time exceeds y; by the definition (4.1) of conditional probability, the left side is the probability that the waiting time exceeds x + y given that it exceeds x. Thus (14.6) attributes to the waiting-time mechanism a kind of lack of memory or aftereffect: If after a lapse of x units of time the event has not yet occurred, the waiting time still remaining is conditionally distributed just as the entire waiting time from the beginning. For reasons that will emerge later (see Section 23), waiting times often have this property. The condition (14.6) completely determines the form of F. If U(x) = 1 - F(x), (14.6) is U( x + y) = U(x)U(y). This is a form of Cauchy's equation [A20], and since U is bounded, U(x) = e - ax for some a. Since lim 00U( x) = 0, a must be positive. Thus (14.6) implies that F has the x
.....
t For the general case, see Problem 14.2.
192
MEASURE
exponential form if X < if X �
( 14.7)
0, 0,
and conversely. Weak Convergence
Random variables X1 , , Xn are defined to be independent if the events ( X1 e A 1 ], ( Xn e An] are independent for all Borel sets A 1 , , An, so that P[ X; e A;, i = 1, . . . , n ] = n?_ 1 P[ X; e A;]. To find the distribution function of the maximum Mn = max{ X1 , , Xn }, take A 1 = · · · = A n = ( - oo, x]. This gives P[Mn � x] = n?_ 1 P[ X; < x]. If the X; are indepen dent and have common distribution function G and Mn has distribution function Fn , then • • •
. . •
• • • ,
• • .
( 14.8) It is possible without any appeal to measure theory to study the real function Fn solely by means of the relation (14.8), which can indeed be taken as defining Fn . It is possible in particular to study the asymptotic properties of Fn :
EX111p11 le
Consider a stream or sequence of events, say arrivals of calls at a telephone exchange. Suppose that the times between successive events, the interarrival times, are independent and that each has the exponential form (14.7) with a common value of a. By (14.8) the maximum Mn among the first n interarrival times has distribution function Fn(x) = (1 - e - ax ) n , x � 0. For each x, limnFn (x) = 0, which means that Mn1 tends to be large for n large. But P[Mn - a - 1 log n � x] = Fn(x + a - log n). This is the distribution function of Mn - a - 1 log n, and it satisfies
( 14 .9) as n --+
14.1.
oo ; the equality here holds if log n �
- x , and so the limit holds a
for all x. This gives for large n the approximate distribution of the • normalized random variable Mn - a- 1 log n.
and F are distribution functions, then by definition, weakly to F, written Fn => F, if If
Fn
( 14.10)
Fn converges
SECTION
14.
193
DISTRIBUTION FUNCTIONS
for each x at which F is continuous.t To study the approximate distribu tion of a random variable Yn it is often necessary to study instead the normalized or rescaled random variable ( Yn - bn )/a n for appropriate con stants a n and bn . If Yn has distribution function Fn and if a n > 0, then P[( Yn - bn )/a n < x] = P[Yn < a n x + bn ], and therefore (Yn - bn )/a n has distribution function Fn (a n x + bn ). For this reason weak convergence often appears in the form*
( 14.11 ) An example of this is
e
- e- ax
.
(14.9):
there
a n = 1, bn = a - 1 log n,
Consider again the distribution function maximum, but suppose that G has the form Example 14. 2.
G(x) = where a > fore
0.
{�
Here Fn(n 1 1°x) =
_
(1
and F(x) =
(14.8)
of the
if X < 1, xif X > 1, n - n - 1x - a) for x > n - 1 /a, and there "
0, 0. This is an example of (14.11) in which a n = n 1 /a and bn = 0. if X < if X >
Example 14.3.
•
Consider (14.8) once more, but for
0 if X < 0, G ( x ) = 1 - (1 - x ) a if 0 < X < 1 , 1 if X > 1 , 0. This time Fn (n - llax + 1) = (1 - n - 1 ( - x) a ) n
where a > � 0. Therefore,
• ( ) li� Fn ( n - 1 /"x + 1 ) = { �- - x
a case of (14.11) in which an = n - l/a and
bn = 1.
if X < if X >
if - n 1 /a < x
0, 0, •
t For the role of continuity, see Example 14.4. * To write F,. ( an x + bn ) ::$ F( x ) ignores the distinction between a function and its value at an unspecified value of its argument, but the meaning of course is that F,. ( an x + bn ) --. F( x ) at continuity points x of F.
194
MEASURE
& be the distribution function with a unit jump at the origin:
Let
ll ( x ) =
( 14.12 ) If
{�
if X < if X >
0, 0.
X( w) = 0, then X has distribution function &. X1 , X2 , . . . be independent random variables for P[ Xk = 1] = P[ Xk = - 1] = � and put Sn = X1 + · · · + Xn . By the
Example 14.4.
Let
which weak law of large numbers,
( 14.13 ) for £ > 0. Let Fn be the distribution function of n - tsn . If x > 0, then Fn (x) = 1 - P[n - 1Sn > x] --+ 1; if x < 0, then Fn (x) < P[ l n - 1Snl > l x l ] 0. As this accounts for all the continuity points of &, Fn => &. It is easy to tum the argument around and deduce (14.13) from Fn => &. Thus the weak law of large numbers is equivalent to the assertion that the distribu tion function of n - 1sn converges weakly to &. If n is odd, so that Sn = 0 is impossible, then by symmetry the events [Sn s 0] and [Sn > 0] each have probability � and hence Fn (O) = � - Thus Fn(O) does not converge to &(0) = 1, but because & is discontinuous at 0, • the definition of weak convergence does not require this. -+
Allowing (14.10) to fail at discontinuity points x of F thus makes it possible to bring the weak law of large numbers under the theory of weak convergence. But if (14.10) need hold only for certain values of x, there arises the question of whether weak limits are unique. Suppose that Fn => F and Fn => G. Then F(x) = lim n Fn (x) = G(x) if F and G are both continu ous at x. Since F and G each have only countably many points of discontinuity,t the set of common continuity points is dense, and it follows by right continuity that F and G are identical. A sequence can thus have at most one weak limit. Convergence of distribution functions is studied in detail in Chapter 5. The remainder of this section is devoted to some weak-convergence theo rems which are interesting both for themselves and for the reason that they require so little technical machinery. t The proof following
(14.3) uses measure theory, but this is not necessary: If the saltus
a( x ) - F( x) - F( x - ) exceeds t: at x1 < · · · < xn , then F( x; ) - F( x;_ 1 ) > t: (take x0 < x1 ), and so n t: < F( xn ) - F( x0 ) < 1 ; hence [ x : a( x ) > t: ) is finite and [ x : a( x ) > 0] is countable.
SECTION
14.
195
DISTRIBUTION FUNCTI ONS
Convergence of Types *
Distribution functions F and G are of the same type if there exist constants a and b, a > 0, such that F(ax + b) = G(x) for all x. A distribution function is degenerate if it has the form �(x - x0) (see (14.12)) for some x0 ; otherwise, it is nondegenerate.
Suppose that Fn (u n x + vn ) => F(x) and Fn(a n x + bn ) => G (x), where u n > 0, a n > 0, and F and G are nondegenerate. Then there exist a and b, a > 0, such that a nfu n --+ a, (bn - vn )/u n --+ b, and F(ax + b ) = G(x). 1beorem 14.2.
Thus there can be only one possible limit type and essentially only one possible sequence of norming constants. The proof of the theorem is for clarity set out in a sequence of lemmas. In all of them, a and the a n are assumed to be positive. Lemma 1.
+ b).
If Fn => F, a n --+ a, and bn --+ b, then Fn (a n x + bn ) => F(ax
If x is a continuity point of F( ax + b) and £ > 0, choose continu ity points u and v of F so that u < ax + b < v and F( v) - F(u) < £; this is possible because F has only countably many discontinuities. For large enough n, u < a n x + bn < v, IFn (u) - F(u) l < £, and IFn (v) - F(v) l < £; but then F(ax + b) � 2£ < F( u) - £ < Fn (u) < Fn (a n x + bn ) < Fn (v) < • F( v) + £ < F( ax + b) + 2£. PROOF.
then Fn (a n x) => �(x). PROOF. Given £ choose a continuity point u of F so large that F( u) > 1 If x > 0, then for all large enough n, a n x > u and IFn (u) - F( u) l < £, SO that Fn (a n x) > Fn (u) > F(u) - £ > 1 - 2£. Thus limn Fn (a n x) = 1 for • x > 0; similarly, limn Fn (a n x) = 0 for x < 0. Lemma 2.
If Fn => F and a n --+
oo ,
£.
If Fn => F and bn is unbounded, then Fn (x + bn ) cannot con verge weakly. PROOF . Suppose that bn is unbounded and that bn --+ oo along some subsequence (the case bn --+ - oo is similar). Suppose that Fn (x + bn ) => G(x). Given £ choose a continuity point u of F so that F( u) > 1 - £. Whatever x may be, for n far enough out in the subsequence, + bn > u Lemma 3.
x
* This topic may be omitted.
196
MEASURE
Fn( u ) > 1 - 2£, so that Fn( x + bn ) > 1 - 2£. Thus G( x ) = lim n F,,(x • + bn ) = 1 for all continuity points x of G. which is impossible.
and
If Fn (x ) =$ F(x) and Fn (a n x + bn ) =$ G ( x), where F and G are nondegenerate, then Lemma 4.
( 14.14 )
0 < inf a n s sup a n < n
n
00 ,
n
Suppose that a n is not bounded above. Arrange by passing to a subsequence that a n oo . Then by Lemma 2, Fn (a n x) =$ �(x), and by Lemma 3 and the hypothesis, bn is bounded along this subsequence. Arrange by passing to a further subsequence that bn converges to some b. Then by Lemma 1, Fn (a n x + bn ) =$ �(x + b), and so by hypothesis G(x) = �(x + b), contrary to the assumption that G is nondegenerate. Thus a n is bounded above. If Gn (x) = Fn (a n x + bn ), then Gn (x) => G(x) and Gn (a n 1x - a; 1bn ) = Fn (x) =$ F(x). The result just proved shows that a; 1 is bounded. Thus a n is bounded away from 0 and oo . If bn is not bounded, pass to a subsequence along which bn --+ ± oo and an converges to some a. Since Fn (a n x) =$ F(ax) along the subsequence by Lemma 1, bn --+ ± oo is by Lemma 3 incompatible with the hypothesis that Fn (a n x + bn ) =$ G(x). • PROOF.
--+
If F(x) = F(ax + b) for all x and F is nondegenerate, then a = 1 and b = 0. n n PROOF. Since F(x) = F(a x + (a - 1 + · · · + a + 1)b), it follows by n Lemma 4 that a is bounded away from 0 and oo , so that a = 1, and it then follows that nb is bounded, so that b = 0. • PROOF OF THEOREM 14 . 2 . Suppose first that U n = 1 and Vn = 0. Then (14.14) holds. Fix any subsequence along which a n converges to some positive a and bn converges to some b. By Lemma 1, Fn (a n x + bn ) => F(ax + b) along this subsequence, and the hypothesis gives F(ax + b) = G(x). Suppose that along some other sequence, a n --+ u > 0 and bn --+ v. Then F( ux + v) = G(x) and F(ax + b) = G(x) both hold, so that u = a and v = b by Lemma 5. Every convergent subsequence of {(an, bn )} thus converges to (a, b), and so the entire sequence does. For the general case, let Hn (x) = Fn (u n x + vn ). Then Hn (x) =$ F(x) and Hn (a n u; 1x + (bn - vn )u n 1 ) =$ G(x), and so by the case already treated, a n u; 1 converges to some positive a and (bn - vn)u; 1 to some b, • and as before, F(ax + b) = G(x). Lemma 5.
SECTION
14.
DISTRIBUTION FUNCTIONS
197
Extremal Distributions*
A distribution function F is extremal if it is nondegenerate and if, for some distribution function G and constants a n ( a n > 0) and bn , ( 14.15) These are the possible limiting distributions of normalized maxima (see (14.8)), and Examples 14.1, 14.2, and 14.3 give three specimens. The following analysis shows that these three examples exhaust the possible types. n Assume is extremal. From (14.15) follow G k (a n x + bn ) => F k (x) that F n and G k (a n k x + bn k) => F(x), and so by Theorem 14.2 there exist con stants ck and dk such that ck is positive and
( 14 . 16 ) From F(cjk x + djk ) = pjk (x) follow (Lemma 5) the relations
d .k = c . dk + d . = ck d . + dk •
( 14 . 17 )
J
Of course, separately.
CASE 1.
= Fj(ck x + dk ) = F(cj (ck x + dk ) + dj )
c1 = 1
and
Suppose that
d1 = 0.
J
J
J
There are three cases to be considered
ck = 1 for all k. Then
(14.18 ) This implies that Fjfk (x) = F(x + dj - dk ). For positive rational r = jjk, put 8, = dj - dk ; (14.17) implies that the definition is consistent, and F'(x) = F( x + 8,). Since F is nondegenerate, there is an x such that 0 < F(x) < 1, and it follows by (14.18) that d k is decreasing in k, so that 8, is strictly decreasing in r. For positive real t let cp(t) = inf0 < r < ,8, (r rational in the infimum). Then cp( t) is decreasing in t and
(1 4 . 19 )
F ( x ) = F( x + cp ( t ) ) '
for all x and all positive t. Further, (14.17) implies that cp( st) = cp( s) + cp( t ), so that by the theorem on Cauchy's equation [A20] applied to cp(e x ), cp(t) * This topic may be omitted.
198
MEASURE
= {J log t, where {J > 0 because cp( t ) is strictly decreasing. Now (14.19) with t = e x ffl gives F(x ) = exp{ e - xlfliog F(O)} , and so F must be of the same type as -
(14.20) Example 14.1 shows that this distribution function can arise as a limit of distributions of maxima-that is, F1 is indeed extremal. CASE 2. Suppose that ck o ¢ 1 for some k0, which necessarily exceeds 1 . Then there exists an x ' such that ck 0x' + dk o = x'; but (14.16) then gives k p o ( x ') = F( x'), so that F(x') is 0 or 1. (In Case 1, F has the type (14.20) and so never assumes the values 0 and 1.) Now suppose further that, in fact, F(x') = 0. Let x0 be the supremum of those x for which F( x) = 0. By passing to a new F of the same type one can arrange that x0 = 0; then F(x) = 0 for x < 0 and F( x) > 0 for x > 0. The new F will satisfy (14.16), but with new constants dk . If a (new) dk is distinct from 0, then there is an x near 0 for which the arguments on the two sides of (14.16) have opposite signs. Therefore, dk = 0 for all k, and (14 .21 ) k j for all k and x. This implies that F f (x) = F(xcjjck ). For positive rational r = jjk, put Yr = cjjck . The definition is by (14.17) again con r sistent, and F (x) = F( yrx). Since 0 < F(x) < 1 for some x, necessarily positive, it follows by (14.21) that ck is decreasing in k, so that Yr is strictly decreasing in r. Put 1/l ( t ) = info < r < tYr for positive real t. From (14.17) follows 1/l ( st ) = 1f (s ) 1f ( t ), and by the corollary to the theorem on Cauchy' s equation [A20] applied to 1/l ( e x), 1/l ( t ) = � - � for some � > 0. Since F'(x) = F( 1/l ( t )x) for all x and for t positive, F(x ) = exp{ x - 1 1�log F(1)} for x > 0. Thus (take a = 1/�) F is of the same type as (14 .22)
if X < 0, if X > 0.
Example 14.2 shows that this case can arise. CASE 3. Suppose as in Case 2 that ck o ¢ 1 for some k0, so that F( x') is 0 or 1 for some x', but this time suppose that F( x') = 1. Let x 1 be the infimum of those x for which F(x) = 1. By passing to a new F of the same type, arrange that x 1 = 0; F(x) < 1 for x < 0 and F(x ) = 1 for x > 0. If
SECTION
14.
DISTRIBUTION FUNCTION S
199
dk ¢ 0, then for some x near 0, one side of (14.16) is 1 and the other is not. Thus d k = 0 for all k, and (14.21) again holds. And again Yj; k = cjjc k consistently defines a function satisfying F r(x) = F( yrx) . Since F is nonde generate, 0 < F( x) < 1 for some x, but this time x is necessarily negative, so that ck is increasing. The same analysis as before shows that there is a positive � such that F'(x) = F(t �x) for all x and for t positive. Thus F(x) = exp { ( - x ) 1 1�log F( - 1)} for x < 0, and F is of the type F3, a
( 14.23) Example
x >a { < e (x) = 1
if X < if X >
0, 0.
14.3 shows that this distribution function is indeed extremal.
This completely characterizes the class of extremal distributions:
The class of extremal distribution functions consists exactly of the distribution functions of the types (14.20), (14.22), and (14.23). Theorem 14.3.
It is possible to go on and characterize the domains of attraction. That is, it is possible for each extremal distribution function F to describe the class of G satisfying (14.15) for some constants a n and bn-the class of G attracted to F. t PROBLEMS 14.1.
The general nondecreasing function F has at most countably many discontinuities. Prove this by considering the open intervals (sup" < F( u ) , inf(, F( v ))-each nonempty one contains a rational. For distribution functions F, the second proof of Theorem 14.1 shows how to construct a measure I' on ( R1 , 9P1 ) such that I'( a , b] == F( b) - F( a ). (a) Extend to the case of bounded F. (b) Extend to the general case. Hint: Let F, ( x) be - n or F( x) or n as F( x ) < - n or - n < F( x ) < n or n � F( x ) . Construct the corresponding l'n and define I'(A ) == limnl'n ( A ). Suppose that X assumes positive integers as values. Show that P [ X > n + m i X > n ] == P[n X > m ] if and only if X has a Pascal distribution: P[ X == n ] == q - Ip, n == 1, 2, . . . . This is the discrete analogue of the exponential distribution. x
14.2.
14.3.
> x
t This theory is associated with the names of Fisher, Frechet, Gnedenko, and Tippet. For further information, see GALAMBOS.
]00
14.4.
MEASURE
(a) Suppose that X has a continuous, strictly increasing distribution
function F. Show that the random variable F( X) is uniformly distributed over the unit interval in the sense that P[ F( X) < u ] == u for 0 � u < 1. Passing from X to F( X) is called the probability transformation. (b) Show that the function q>( u ) defined by ( 1 4.5) satisfies F( q>( u ) - ) < u < F( q> ( u)) and that, if F is continuous (but not necessarily strictly increas ing), then F( q>( u)) == u for 0 < u < 1. (c) Show that P[ F( X) < u] == F( q> ( u ) - ) and hence that the result in part (a) holds as long as F is continuous.
14.5.
f Let C be the set of continuity points of F. (a) Show that for every Borel set A , P[ F( X) e A , X e C] is at most the Lebesgue measure of A. (b) Show that if F is continuous at each point of F" 1A , then P[ F( X) e A ] is at most the Lebesgue measure of A.
14.6.
A distribution function F can be described by its derivative or density f = F', provided that f exists and can be integrated back to F. One thinks in terms of the approximation
( 14.24)
P[ x < X < x
+
d.x ]
=
f( x ) dx.
(a) Show that, if X has distribution function with density f and g is
14.7.
increasing and differentiable, then the distribution function of g( X) has density /(g - 1 (x))/g'(g - 1 (x)). (b) Check part (a) of Problem 14.4 under the assumption that F is differentiable. (a) Suppose that X1 , , Xn are independent and that each has distribu tion function F. Let � kJ be the k th order statistic, the k th smallest among the X;. Show that � k > bas distribution function •
•
•
-
n
Gk ( X ) == E k
u-
( : ) ( F( X ) )
u
( 1 - F( X ) )
n-u
.
(b) Assume that F has density f and show that Gk has density
;:
F( x )) k - I /( x ) ( 1 - F( x )) " - k ( ( k - 1) n - k ) !
14.8.
•
Interpret this in terms of (14.24). Hint: Do not differentiate directly. Let A be the event that k - 1 of the X; lie in ( - oo , x], one lies in (x, x + h ], and n - k lie in (x, oo) . Show that P[x < � k ) < x + h ] - P (A ) is non negative and does not exceed the probability that two of the X, lie in (x, x + h ]. Estimate the latter probability, calculate P(A ) exactly, and let h go to 0. Show for every distribution function F that there exist distribution func tions F, such that F, � F and (a) F, is continuous; (b) F, is constant over each interval (( k - 1) n - 1 , kn - 1 ], k 0, ± 1, ± 2, . . . . =
201 14.9. The Levy distance d( F, G) between two distribution functions is the in fimum of those £ such that G(x - £) - £ < F( x) < G(x + £) + £ for all x. Verify that this is a metric on the set of distribution functions. Show that a necessary and sufficient condition for F, � F is that d( F, , F) --+ 0. 14.10. 12.3 f A Borel function satisfying Cauchy's equation [A20] is automati cally bounded in some interval and hence satisfies l(x) xl(1) Hint: Take K large enough that A[x: x > 0, ll(x) l < K ] > 0. Apply Problem 12.3 and conclude that I is bounded in some interval to the right of 0. 14. 1 1. f Consider sets S of reals that are linearly independent over the field of rationals in the sense that n 1 x1 + · · · + n k xk == 0 for distinct points X; in S and integers n; (positive or negative) is impossible unless n; 0. (a) By Zorn's lemma find a maximal such S. Show that it is a Hamel basis. That is, show that each real x can be written uniquely as x == n 1 x 1 + · · · + n k xk for distinct points X; in S and integers n; . (b) Define I arbitrarily on S and define it elsewhere by I( n 1 x 1 + · · · + n k xk ) == n 1 l(x 1 ) + · · · + n k l(xk ) Show that I satisfies Cauchy's equation but .need not satisfy l(x) = xl(1). (c) By means of Problem 14.10 give a new construction of a nonmeasur able function and a nonmeasurable set. SECTION
14 .
DISTRIBUTION FUNCTIONS
=
.
=
.
14.12. 14.9 f
(a) Show that if a distribution function F is everywhere continu
ous, then it is uniformly continuous. (b) Let BF (£) == sup[ F(x) - F(y): l x - Y l < £] be the modulus of con tinuity of F. Show that d( F, G) < £ implies that supx I F( x) - G ( x) I < £ + BF ( £). (c) Show that, if F, � F and F is everywhere continuous, then F, (x) --+ F( x) uniformly in x. What if F is continuous over a closed interval? 14.13. Show that (14.22) and (14.23) are everywhere infinitely differentiable, al though not analytic.
CHAPTER
3
Integration
SECitON 15. THE INTEGRAL
Expected values of simple random variables and Riemann integrals of continuous functions can be brought together with other related concepts under a general theory of integration, and this theory is the subject of the present chapter. Definition
Throughout this section, f, g, and so on will denote real measurable functions, the values ± oo allowed, on a measure space (0, ofF, p.). The object is to define and study the definite integral
Suppose first that f is nonnegative. For each finite decomposition of D into �sets, consider the sum
( 15 .1 ) In computing the products here, the conventions about infinity are
( 15 .2 ) 202
{ X0
. 00 = 00 .
0 = 0,
00 = 00 · X = 00 if 0 < X < 00 , 00 . 00 = 00 . •
{ A; }
SECTION
15.
203
THE INTEGRAL
The reasons for these conventions will become clear later. Also in force are the conventions of Section 10 for sums and limits involving infinity; see (10.3) and (10.4). If A; is empty, the infimum in (15 . 1) is by the standard convention oo ; but then p.(A;) = 0, so that by the convention (15.2), this term makes no contribution to the sum (15.1). The integral of f is defined as the supremum of the sums (15.1): (15 .3) The supremum here extends over all finite decompositions JEsets. For general /, consider its positive part,
its
(15 .5)
of 0 into
if 0 < / ( w ) < 00 ' if - oo < / ( w ) < 0
(15 .4) and
{ A; }
negative part, if - 00 < / ( w ) < 0, if 0 < / ( w ) < 00 .
These functions are nonnegative and measurable, and f = j+ - f- . The general integral is defined by ( 15 .6) unless f j+ dp. = f /- dp. = oo, in which case f has no integral. If f j+ dp. and f /- dp. are both finite, f is integrable, or integrable p., or summable, and has (15.6) as its definite integral. If f j+ dp. = oo and f 1- dp. < oo , f is not integrable but is in accordance with (15.6) assigned + oo as its definite integral. Similarly, if f j dp. < oo and f /- dp. = oo, f is not integrable but has definite integral - oo. Note that f can have a definite integral without being integrable; it fails to have a definite integral if and only if its positive and negative parts both have infinite integrals. The really important case of (15.6) is that in which f j+ dp. and f f- dp. are both finite. Allowing infinite integrals is a convention that simplifies the statements of various theorems, especially theorems involving nonnegative functions. Note that (15.6) is defined unless it involves "oo - oo; " if one term on the right is oo and the other is a finite real x, the difference is defined by the conventions oo - x = oo and x - oo = - oo.
204
INTEGRATION
The extension of the integral from the nonnegative case to the general case is consistent: (15.6) agrees with (15.3) if f is nonnegative because then
!-= 0.
Nonnegative Functions
It is convenient first to analyze nonnegative functions.
If f = E;x ; IA is a nonnegative simple function, { A; } being a finite decomposition of 0 into �sets, then f fdp. = E;x ;P. ( A;). (ii) If 0 � /( w ) � g(w) for all w, then f fdp. < f gdp.. (iii) If 0 � /n (w) j /( w) for all w, then 0 < f fdp. i f fdp.. (iv) For nonnegative functions f and g and nonnegative constants a and {J, j ( af + {Jg) dp. = af fdp. + {Jf g dp..
Theorem 15.1
(i)
I
In part (iii) the essential point is that f fdp. = lim n f In dp., and it is important to understand that both sides of this equation may be oo . If In = /A and f = /A , where A n i A, the conclusion is that p. is continuous from below (Theorem 10.2(i)): li m n p.(A n ) = p.(A); this equation often takes the form oo = oo . II
(i). Let { B1 } be a finite decomposition of 0 and let P1 be the infimum of f over B1 . If A ; n B1 ¢ 0, then {J1 s x;; therefore, E1P1p.(B1 ) = E;1p1p.((A ; n B1 ) s E;1x;p.(A; n B1 ) = E;x;p.(A;) · On the other hand, there • is equality here if { B1 } coincides with { A; }. PROOF OF
PROOF OF
by g.
(ii). The sums (15.1) obviously do not decrease if f is replaced
•
(iii) . By (ii) the sequence f In dp. is nondecreasing and bounded above by f f dp.. It therefore suffices to show that f f dp. S lim n f In dp., or that PROOF OF
m lim n /fn dfl > S = L V;fl ( A; )
( 15.7 )
i- 1
is any decomposition of 0 into �sets and V; = inf w e A f(w) . In order to see the essential idea of the proof, which is quite simple , suppose first that S is finite and all the v; and p.(A;) are positive and finite . Fix an £ that is positive and less than each v; , and put A;n = [w E A ;: fn ( w) > V; £]. Since In i /, A ;n i A;. Decompose 0 into A1 n , . . . , A m n and if
A1,
•
•
•
, Am -
I
SECTION
15.
THE INTEGRAL
the complement of their union, and observe that
( 15 .8 )
jf, dfl > iE- 1 ( v; - £ ) #' ( A ; n ) -+ iE= l ( v; - £ ) �L ( A; ) m
m
m
= s - £ E #' ( A ; ) . i- 1
Since the #' ( A ; ) are all finite, letting £ -+ 0 gives (15.7). Now suppose only that S is finite. Each product V; #L( A ; ) is then finite; suppose it is positive for i < m 0 and 0 for i > m 0• (Here m 0 � m; if the product is 0 for all i, then S = 0 and (15.7) is trivial.) Now v; and #'(A ; ) are positive and finite for i < m 0 (one or the other may be oo for i > m 0 ). Define A ; n as before, but only for i < m 0• This time decompose 0 into .A1 n , . . . , A m o n and the complement of their union. Replace m by m0 in (15.8) and complete the proof as before. Finally, suppose that S = oo . Then V;0 #'(A;0 ) = oo for some i0, so that v10 and #' ( A ; ) are both positive and at least one is oo . Suppose 0 < x < V; 0 0 � oo and 0 < y < #'(A; ) � oo, and put A; n = [w E A; : fn (w) > x]. Since 0 0 0 /,. f /, A;0 n i A;0 ; hence #'(A;0 n ) > y for n exceeding some n 0• But then (decompose 0 into A;0 n and its complement) f In dl' > X�L ( A ;0 n ) > xy for n > n 0 , and therefore li m n / In dl' > xy. If V;0 = oo, let x -+ oo , and if p( A;0 ) = oo, let y -+ oo . In either case (15.7) follows: lim n f In dl' = oo . • PROOF OF (iv). Suppose at first that f = E ; x ; IA and g = Ej.y./8 are j simple. Then a/ + {Jg = E ;1 (ax; + {Jy1 )IA , n B1 , and so J
I
j( af + fJg ) dfl = L. ( ax; + fJyj ) ll ( A; n Bi ) I)
j
= a E X ;fl ( A; ) + fJ LYifl ( Bi ) = a fdfl + I
}
fJ Jg dfl .
Note that the argument is valid if some of a , {J , X;, y1 are infinite. Apart from this possibility, the ideas are as in the proof of (5.18). For general nonnegative f and g, there exist by Theorem 13.5 simple functions In and Bn such that 0 S In i f and 0 < Bn i g . But then 0 < a.fn + /lg,. i af + {J g and f ( afn + {Jgn ) dl' = a f In dl' + {Jf 8n d#', so that (iv) fol lows via (iii) . •
15.1, the expected values of simple random Variables in Chapter 1 are integrals: E [ X] = f X( w)P( d w) . This also covers By part (i) of Theorem
206
INTEGRATION
the step functions in Section 1 (see (1.5)). The relation between the Riemann integral and the integral as defined here will be studied in Section 17.
Consider the line ( R 1 , 91 1 , A) with Lebesgue measure. Suppose that - oo < a 0 � a 1 S · · · < a m < oo and let f be the function with nonnegative value X; on (a; _ 1 , a;], i = 1, . . . , m , and value 0 on ( - oo, a 0 ] and ( a m , oo). By part (i) of Theorem 15.1, f fdA = E7' 1 x;(a,. a ; _ 1 ) because of the convention 0 · oo = O-see (15.2). If the "area under the curve" to the left of a 0 and to the right of am is to be 0, this convention is inevitable. It must then work the other way as well : oo · 0 = 0, so that f fdA = 0 if f is oo at a single point (say) and 0 elsewhere. If f = I( a , oo ) ' the area-under-the-curve point of view makes f fdp. = oo natural. Hence the second convention in (15.2), which also requires that the • integral be infinite if f is oo on a nonempty interval and 0 elsewhere. Example 15. 1.
Recall that
almost everywhere means outside a set of measure 0.
Theorem 15.2. Suppose that f and g are nonnegative. (i) Iff = 0 almost everywhere, then f fdp. = 0. (ii) If p.(w: /(w) > 0] > 0, then f fdp. > 0.
If f I dp. < oo, then f < oo almost everywhere. Iff < g almost everywhere, then ffdp. < f g dp.. If I = g almost everywhere, then f fdp. = f gdp.. Suppose that f = 0 almost everywhere. If A; meets [ w: /( w ) = 0], PROOF. then the infimum in (15.1) is 0; otherwise, p. ( A ; ) = 0. Hence each sum (15.1) is 0, and (i) follows. If A = [w: /(w) > £], then A( i [w: /(w) > 0] as £ � 0, so that under the hypothesis of (ii) there is a positive £ for which p. ( A( ) > 0. Decomposing 0 into A ( and its complement shows that f fdp. > £p.(A() > 0. If p.[/ = oo ] > 0, decompose D into [/ = oo ] and its complement: f fdp. � oo · p.[/ = oo] = oo by the conventions. Hence (iii) . To prove (iv), let G = [/ < g]. For any finite decomposition { A 1 , , A m } (iii) (iv) (v)
(
•
of 0,
[ ] p. ( A ; ) = E [ in�I ] p. ( A ; n G )
E inf I A,
A,
s
[
E inf
A, n G
]
•
•
I p. ( A ; n G )
SECTION
15.
207
THE INTEGRAL
where the last inequality comes from a consideration of the decomposition A 1 n G, . . . , A m n G, Gc. This proves (iv), and (v) follows immediately. • Suppose that I = g almost everywhere, where I and g may not be nonnegative. If I has a definite integral, then since I+ = g + and 1- = g almost everywhere, it follows by Theorem 15 . 2(v) that g also has a definite integral and f l dp. = f g dp. . Uniqueness Although there are various ways to frame the definition of the integral, they are all equivalent- they all assign the same value to lldp. . This is because the integral is uniquely determined by certain simple properties it is natural to require of it. It is natural to want the integral to have properties (i) and (iii) of Theorem 15.1. But these uniquely determine the integral for nonnegative functions: For I nonnega tive, there exist by Theorem 13.5 simple functions In such that 0 < In f I; by (iii) , I Idp, must be limn I In" dp. , and (i) determines the value of each I In dp.. Property (i) can itself by derived from (iv) (linearity) together with the assumption that I lA dp. == p. ( A ) for indicators /A : I (E; x; IA ) dp. == E ; x; l lA dp. == E; X;p. ( A ; ). If (iv) of Theorem 15.1 is to persist when the integral is extended beyond the class of nonnegative functions, I1 dp. must be I
I
PROBLEMS These problems outline alternative definitions of the integral. In all of them I is assumed measurable and nonnegative. Call (15.3) the lower integral and write it as
{15 .9) to
distinguish it from the upper integral
(15 .10) The infimum in (15.10), like the supremum in (15.9), extends over all finite partitions { A; } of Q into Jli:sets. 15.1.
Show that I * ldp. == oo if p. [ w : l( w ) > 0] == oo or if p. [ w : l( w ) > x ] all (finite) x .
As there are many I 's of this last kind that ought to be integrable,
>
0 for
(15.10) is
inappropriate as a definition of Ildp,. The only trouble with (15.10), however, is that
INTEGRATION
it treats infinity incorrectly. To see this, and to focus on essentials, assume in the following problems that p.(O) < oo and that I is bounded and nonnegative. 15.2.
f (a) Show that
� [��/("')] I'( A, ) < 7 [���/("') ] I'( B; )
{ B1 } refines { A ; }. Prove a dual relation for the sums in (15.10) and conclude that I * ldp. s I * ldp.. (b) Suppose that M bounds I and consider the partition A ; == [ w : i£ � if
I< w ) < ( i + 1)£], i that
0, . . . , N, where N is large enough that N£ > M. Show
=
Conclude that
( 15.11 ) To define the integral as the common value in (15.11) is the Darboux- Young approach. The advantage of (15.3) as a definition is that it applies at once to unbounded I and infinite p.. 15.3. 3.2 15.2 f For A c Q, define p.*(A) and p. .( A ) by (3.9) and (3.10) with p. in place of P. Suspend for the moment the assumption that I is measurable and show that I * lA dp. == p.*( A ) and I * lA dp. == p. .( A ) for every A . Hence (15.11) can fail if I is not measurable. (Where was measurability used in the proof of (15.11)?) Even though the definition (15.3) makes formal sense for all nonnegative functions, it is thus idle to apply it to nonmeasurable ones. Assume again that I is measurable (as well as nonnegative and bounded, and that p. is finite). 15.4.
f Show that for positive £ there exists a finite partition { A } such that , if { � } is any finer partition and w1 e � , then ;
Jfd#L - 'L, /( "'i ) #L ( B; )
< f.
j
15.5.
f
Show that fd" == lim n
f
•
n 2" k -
L
k-l
2"
1
[P. "'
:
k-
2"
1
< t< "'>
k
< 2"
]
·
The limit on the right here is Lebesgue 's definition of the integral.
SECTION
15.6.
16.
PROPERTIES OF THE INTEGRAL
209
f Suppose that the integral is defined for simple functions by f (L;x; IA '. ) dp. == E;x;p. (A;) · Suppose that In and gn are simple and nondecreasing and have a common limit: 0 � In f f and 0 < gn f f. Adapt the arguments used to prove Theorem 15.l(iii) and show that limn /In dp. == limn / gn dp.. Thus f fdp. can (Theorem 13.5) consistently be defined as limn /In dp. for simple functions for which 0 � In f /. SECI'ION 16. PROPERTIES OF THE INTEGRAL
FApdities and Inequalities
f is that f f+ dp. and ff- dp. both be finite, which is the same as the requirement that f /+ dp. + Jr dp. < oo and hence is the same as the requirement that / ( /+ + f- ) dp. < oo (Theorem 15.1(iv)). Since /+ + f+ = 1/1 , f is integrable if and only if
By definition, the requirement for integrability of
jill dp. < 00 .
( 16.1)
l g l almost everywhere and g is integrable, then integrable as well. If p.(D) < oo , a bounded f is integrable. It follows that if
11leorem
1/ 1
16.1 (i)
everywhere, then
<
f is
Monotonicity : Iff and g are integrable and/ s g almost
jIdp. s jg dp. .
( 16 .2 )
Linearity: Iff and g are integrable and a, P are finite real numbers, then af + pg is integrable and (ii)
j ( af + Pg ) dp. = a jfdp. + P jg dp..
( 1 6.3) PROOF
OF
PROoF
OF
(i). For nonnegative f and g such that f < g almost everywhere, (16.2) follows by Theorem 15.2(iv). And for general f and g, if f < g almost everywhere, then j+ < g + and /- > g - almost everywhere, and so (16.2) follows by the definition (15.6). • (ii). First,
af + pg is integrable because, by Theorem 15.1,
j Ia/ + Ps i dp. s j (Ia I · Ill + I P I · l s i ) dp. = I a I j Ill dp. + I P I j l s i dp. < oo .
210
INTEGRATION
By Theorem 15.1(iv) and the definition (15.6), f (a/) dp. = af f dp.-consider separately the cases a > 0 and a < 0. Therefore, it is enough to check (16.3) for the case a = P = 1. By definition, ( / + g ) + - ( / + g ) - = f + g = f+ 1- + g + - g- and therefore ( I + g ) + + /- + g - = ( I + g ) - + /+ + g + . All these functions being nonnegative, f ( f + g ) + dp. + f /- dp. + f g - dp. f ( / + g ) - dp. + f /+ dp. + f g + dp., which can be rearranged to give f ( / + g) + dp. - f ( I + g ) - dp. = f /+ dp. - f /- dp. + f g + dp. - f g - dp.. But this reduces to (16.3). • =
Since - 1/1
<
f < 1/1 , it follows by Theorem 16.1 that
(16 .4) for integrable f. Applying this to integrable f and g gives (16 .5) Suppose that 0 is countable, that � consists of all the subsets of D, and that p. is counting measure: each singleton has measure 1. To be definite, take D = { 1, 2, . . . }. A function is then a sequence x1, x 2 , If xnm is xm or 0 as m s n or m > n , the function corresponding to xn l ' Xn 2 ' . . . has integral E:, _ l xm by Theorem 15.1(i) (consider the decom position { 1 }, . . . , { n }, { n + 1, n + 2, . . . }). It follows by Theorem 15.1(iii) that in the nonnegative case the integral of the function given by { x m } is the sum E mxm (finite or infinite) of the corresponding infinite series. In the general case the function is integrable if and only if E: _ 1 1 x m 1 is a convergent infinite series, in which case the integral is E:_ 1 x� - E:_ 1 x� . The function xm = ( - 1) m + l m - 1 is not integrable by this definition and even fails to have a definite integral, since E:_ 1 x� = E:_ 1 x m = oo . This invites comparison with the ordinary theory of infinite series, according to which the alternating harmonic series does converge in the sense that lim M E !'_ 1 ( - 1) m + l m - 1 = log 2. But since this says that the sum of the first M terms has a limit, it requires that the elements of the space 0 be ordered. If D consists not of the positive integers but, say, of the integer lattice points in 3-space, it has no canonical linear ordering. And if E mxm is to have th e same finite value no matter what the order of summation, the series must be absolutely convergent.t This helps to explain why f is defined to be • integrable only if f /+ dp. and f f- dp. are both finite. Example 16.1.
•
t RUDIN , p. 76.
•
•
•
16. PROPERTIES OF THE INTEGRAL 211 Example 16.2. In connection with Example 15.1, consider the function I = 3/( a , oo ) - 2 /( - oo , a ) · There is no natural value for f fdA (it is "oo - oo " ), and none is assigned by the definition. If a function f is bounded on bounded intervals, then each function I, = fl( - n , n > is integrable with respect to A. Since f = lim n fn , the limit of I In dA, if it exists, is sometimes called the " principal value" of the integral of f. Although it is natural for some purposes to integrate symmetrically about the origin, this is not the right definition of the integral in the context of general measure theory. The functions gn = fl( - n , n + 1 > for example also converge to f, and f gn dA may have some other limit, or none at all; f(x) = x is a case in point. There is no general reason why In should take precedence over gn . As in the preceding example, I = Er_ 1 ( - 1) kk - 1I( k , k + 1 has no integral, > even though the f In dA above converge. • SECTION
Integration to the Limit
The first result, the Theorem 15.1(iii).
monotone convergence theorem, essentially restates
If 0 S In i f almost everywhere, then f In dp. i f / dp. . If 0 < In j f on a set A with p.(Ac) = 0, then 0 < fn/A i f/A holds
1beorem 16.2. PROOF.
everywhere, and it follows by Theorem 15.1(iii) and the remark following Theorem 15.2 that f In dp. = fIn/A dp. i f f/A dp. = f fdp.. • As the functions in Theorem 16.2 are nonnegative almost everywhere, all the integrals exist. The conclusion of the theorem is that lim n f In dp. and I Idp. are both infinite or both finite and in the latter case are equal. Consider the space { 1, 2, . . . } together with counting measure, as in Example 16.1. If for each m, 0 < x n m j x m as n --+ oo, then lim n E m x n m = E m x m , a standard result about infinite series. • Example 16.3.
If p. is a measure on � and �0 is a a-field contained in S', then the restriction p.0 of p. to �0 is another measure (Example 10.4). If / == /A and A e �0, then Example 16.4.
the common value being p. ( A ) = p. 0 ( A ). The same is by linearity true for
212
INTEGRATION
nonnegative simple functions measurable �0• It holds by Theorem 16.2 for all nonnegative I measurable j&'0 because (Theorem 13.5) 0 < In i I for simple functions In measurable j&'0• For functions measurable �0, integra tion with respect to p. is thus the same thing as integration with respect to •
P. o·
In this example a property was extended by linearity from indicators to nonnegative simple functions and thence to the general nonnegative func tion by a monotone passage to the limit. This is a technique of very frequent application. For a finite or infinite sequence of measures P.n on j&', p.(A) = LnP.n ( A) defines another measure (countably additive because [A27] sums can be reversed in a nonnegative double series). For indicators I, Example 16.5.
and by linearity the same holds for simple I > 0. If 0 < lk i I for simple lk, then by Theorem 16.2 and Example 16.3, f I dp. = lim k f lk dp. lim k En f lk dp.n = En lim k f lk dp.n = En f ldP.n · The relation in question thus holds for all nonnegative 1. • =
An important consequence of the monotone convergence theorem is
Fatou 's lemma: Theorem 16.3.
For nonnegative In,
If gn = infk 'il!:. nlk , then 0 < gn j g = lim infnln , and the preceding two theorems give f In dp. � f 8n dp. -+ f g dp.. • PROOF.
.
On ( R1 , !1/ 1 , A), In = n 2/(o, n -t ) and I = 0 satisfy ln ( x ) -+ l ( x ) for each x, but / IdA = 0 and fln d A = n -+ oo . This shows that the inequality in (16.6) can be strict and that it is not always possible to integrate to the limit. This phenomenon has been encountered before; see • Examples 5.6 and 7.7. EXIIIple II 16. 6.
16. PROPERTIES OF THE INTEGRAL 213 Fatou' s lemma leads to Lebesgue 's dominated convergence theorem: SECTION
f
If Ifn i s g almost everywhere, where g is integrable, and if In f almost everywhere, then f and the In are integrable and fIn dp. f fdp.. PROOF. From the hypothesis 1/n l s g it follows that all the In as well as I * = lim supn In and f * = lim inf In are integrable. Since g + In and g - In neorem 16.4. -+
-+
are nonnegative, Fatou' s lemma gives
and
.S:
j
j
j
lim inf In dp. . n ( g - In ) dp. = g dp. - lim sup n
Therefore (16 . 7) .S:
j
j
lim sup In dp. .S: lim sup In dp. . n n
The only assumption used thus far is that the In are dominated by an integrable g. If In f almost everywhere, the extreme terms in (16.7) agree • with f fdp.. -+
Example 16.6 shows that this theorem can fail if no dominating EXIIIple II 16. 7.
g exists.
The Weierstrass M-test for series. Consider the space
{ 1, 2, . . . } together with counting measure, as in Example 16.1. If l x n m l < M, and E m Mm < oo, and if lim n x n m = x m for each m, then lim n E m x n m = E,.. x m . This follows by an application of Theorem 16.4 with the function given by the sequence M1 , M2 , in the role of g. This is another standard • result on infinite series [A28]. •
•
•
214
INTEGRATION
The next result, the bounded convergence theorem, is a special case of Theorem 16.4. It contains Theorem 5.3 as a further special case.
If p.(D) < oo and the In are uniformly bounded, then In --+ f almost everywhere implies fIn dp. --+ f Idp.. Theorem 16.5.
The next two theorems are simply the series versions of the monotone and the dominated convergence theorems. Theorem 16.6.
If fn > 0, then J Enfn dp. = En / fn dp..
The members of this last equation are both equal either to oo or to the same finite, nonnegative real number.
If En fn converges almost everywhere and IE� == tfk l < g almost everywhere, where g is integrable, then En fn and the In are integrable and f En ln dp. = En fIn dp.. Theorem 16.7.
If Enf Ifni dp. < oo , then Enfn converges absolutely almost ev erywhere and is integrable, and f En fn dp. = En fIn dp.. PROOF. The function g = En Ifni is integrable by Theorem 16.6 and is finite almost everywhere by Theorem 15.2(iii). Hence Enl/nl and En/n converge • almost everywhere and Theorem 16.7 applies. Corollary.
In place of a sequence { In } of real measurable functions on ( Q, !F, p. ) , consider a family [/,: 1 > 0] indexed by a continuous parameter 1. Suppose of a measurable f that ( 16.8) on a set A , where { 16.9)
lim /,( w ) == f( w )
t� oo
A e !F ,
A technical point arises here, since holds:
p. ( O - A ) == O. !F
need not contain the w-set where (16.8)
Ex1111le 1p 16.8. Let !F consist of the Borel subsets of Q == (0, 1], and let H be a nonmeasurable set-a subset of Q that does not lie in !F (see the end of Section 3). Define /, ( w ) == 1 if w equals the fractional part t [ 1] of 1 and their common value lies in H('; define /, ( w ) == 0 otherwise. Each /, is measurable !F, but if f( w ) 0, • then the w-set where (16.8) holds is exactly H. -
=
Because of such examples, the set A above must be assumed to lie in � (Because of Theorem 13.4, no such assumption is necessary in the case of sequences.)
215 The assumption that (16.8) holds on a set A satisfying (16. 9) can still be expressed as the assumption that (16.8) holds almost everywhere. If I, == f /, dp. converges to I == f f dp. as t --+ oo , then certainly I,n --+ I for each sequence { t, } going to infinity. But the converse holds as well: if I, does not converge to I, then there is a positive £ such that 1 1, n - II > £ for a sequence { tn } going to infinity. To the question of whether l,n converges to I the previous theorems apply. Suppose that (16.8) holds almost everywhere and that 1/, 1 < g almost every where, where g is integrable. By the dominated convergence theorem, f must then be integrable and I, --+ I for each sequence { tn } going to infinity. It follows that f/, dp. --+ f f dp.. In this result t could go continuously to 0 or to some other value instead of to infinity. SECTION
Theorem 16.8.
( a, b).
16.
PROPERTIES OF THE INTEGRAL
Suppose that f( w, t) is a measurable function of w for each t in
(i) Suppose that /( w, t) is almost everywhere continuous in t at t0; suppose
further that, for each t in some neighborhood of t0, 1/(w, t) l < g( w) almost every where, where g is integrable. Then f f( w, t)p.( dw) is continuous in t at t0• (ii) Suppose that for w e A , where A satisfies (16.9), /( w , t) has in (a, b) a derivative / '( w, t) with respect to t; suppose further that 1/ '( w, t) I < g( w ) for w e A and t e ( a , b), where g is integrable. Then f f( w, t)p.(dw) has derivative f /' ( w, t) p. ( d w) on (a, b). Part (i) is an immediate consequence of the preceding discussion. To prove part (ii), consider a fixed t. By the mean value theorem, PllOOF.
/( w , I + h ) - /( w , I) h
== / ' ( w , s ) '
where s lies between t and t + h. The left side goes to /'( w, t) as h --+ 0 and is by the hypothesis dominated by the integrable function g( w ) . Therefore,
� [ J!( w , t + h ) p.( dw ) - jf( w , t ) p. ( dw ) ] --+ jf' ( w , t) p. ( dw ) .
•
The condition involving g can be localized. It suffices to assume that for each t there is an integrable g( w, t) such that 1/ '( w, s) 1 < g( w, t) for all s in some neighborhood of t. Integration over Sets The integral of I over a set A in � is defined by
(16 . 10)
j fdp. = jJAfdp. . A
The definition applies if I is defined only on A in the first place (set I = 0
outside A ). Notice that
fA idp. = 0 if p. ( A ) = 0.
216
INTEGRATION
All
the concepts and theorems above carry over in an obvious way to integrals over A . Theorems 16.6 and 16.7 yield this result:
If A1, A 2, are disjoint, and if f is either nonnegative or integrable, then fu, A , fdp. = En fA , fdp.. 16.9. The integrals (16.10) suffice to determine /, provided p. 11 EX111ple is a-finite. Suppose that f and g are nonnegative and that fAfdp. < fAgdp. for all A in !F. If p. is a-finite, there are jli:sets A n such that A n i 0 and p. ( A n ) < oo . If Bn = [0 < g < f, g S n ], then the hypothesized inequality applied to A n n Bn implies fA , , fdp. s fA , , gdp. < 00 (since A n n Bn has finite measure and g is bounded there) and hence f IA , ,(I - g) dp. = Theorem 16.9.
• • •
nB
nB
n8
0. But then by Theorem 15.2(ii), the integrand is 0 almost everywhere, and so p. ( A n n Bn ) = 0. Therefore, p.(O s g < f, g < oo] = 0, so that I < g almost everywhere:
Iff and g are nonnegative and fAfdp, = fAg dp, for all A in � ' and if p. is a-.finite, then f = g almost everywhere. (i)
There is a simpler argument for integrable functions; it does not require that the functions be nonnegative or that p, be a-finite. Suppose that f and g are integrable and that fAfdp, < fAgdp. for all A in �. Then f - g) dp. = 0 and hence p,[g < / ] = 0 by Theorem 15.2(ii):
Ils
If f and g are integrable and fAfdp, = fA.g dp, for all A in � ' then f = g almost everywhere. (iii) Iff and g are integrable and fA fdp, = fA gdp, for all A in 9', where 9' is a 'IT-system generating � and D is a finite or countable union of 9'-sets, then f = g almost everywhere. If f and g are nonnegative, this follows from Theorem 10.4 and (ii) above. For the general case, first prove that /+ + g - = f- + g + almost (ii)
everywhere.
Densities
Suppose that 8 is a nonnegative function and define a measure " by (Theorem 16.9) (16.11)
v(A)
=
L8 dfl ,
A
e � '·
8 is not assumed integrable with respect to p.. Many measures arise in this way. Note that p. ( A ) = 0 implies that 11(A) = 0. Clearly, " is finite if and
217 16. PROPERTIES OF THE INTEGRAL only if 8 is integrable p.. Another function 8' gives rise to the same " if 8 = 8 ' almost everywhere. On the other hand, J1(A) = JAB' dp. and (16.11) together imply that 8 = 8 ' almost everywhere if p. is a-finite, as was shown SECTION
in the preceding example. t p.
The measure " defined by (16.11) is said to have density 8 with respect to A density is by definition nonnegative. Formal substitution dJI = 8 dp. gives formulas (16.12) and (16.13).
1beorem 16.10.
If has density 8 with respect to p., then Jl
jfdv = jf8 dfA
( 16.12)
for nonnegative f. Moreover, f (not necessarily nonnegative) is integra ble with respect to if and only if fB is integrable with respect to p., in which case (16.12) and holds
Jl
(16.13 ) both hold. For nonnegative f, (16.13) always holds. Here fB is to be taken as 0 if f = 0 or if 8 conventions (15.2). Note that J1( 8 = 0] = 0.
= 0; this is consistent with the
If f = lA, then f fdJI = J1( A), so that (16.12) reduces to the definition (16.11). If f is a simple nonnegative function, (16.12) then follows by linearity. If f is nonnegative, fIn dJI = f fn6 dp. for the simple functions /, of Theorem 13.5, and (1�.12) follows by a monotone passage to the limit -that is, by Theorem 16:! Note that both sides of (16.12) may be infinite. Even if f is not nonnegative, (16.12) applies to 1/1 , whence it follows that I is integrable with respect to " if and only if f6 is integrable with respect to p. And if f is integrable, (16.12) follows from differencing the same result • for j+ and /-. Replacing f by fiA leads from (16.12) to (16.13). PROOF.
If J1( A) = p.(A n A0), then (16.11) holds with 6 and (16.13) reduces to fAfdJI = fA n A o fdp.. Example 16. 10.
= lA o'
•
Theorem 16.10 has two features in common with a number of theorems about integration. (i) The relation in question, (16.12) in this case, in addition to holding for integrable functions, holds for all nonnegative 'This is true also if " is a-finite;
see
Problem 16.12.
218
INTEGRATION
functions- the point being that if one side of the equation is infinite, then so is the other, and if both are finite, then they have the same value. This is useful in checking for integrability in the first place. (ii) The result is proved first for indicator functions, then for simple functions, then for nonnegative functions, then for integrable functions. In this connection, see Examples 16.4 and 16.5. The next result is Scheffe ' s theorem. Theorem 16.11.
sities 8n and 8. If
Suppose that �'n ( A) = fA Bn dP. and I'( A) = fA B dp. for den
n = 1 , 2, . . . , and if Bn .-. 8 except on a set of p.-measure 0, then ( 16.14 )
( 16.15 )
The inequality in (16.15) of course follows from (16.5). Let gn = 8 - Bn . The positive part g: of gn converges to 0 except on a set of p.-measure 0. Moreover, 0 < g: < 8 and 8 is integrable, and so the dominated convergence theorem applies: f g: dp. --+ 0. But f gn dp. = 0 by (16.14), and therefore PROOF.
• A corollary concerning infinite series follows immediately- take p. as counting measure on 0 = {1, 2, . . . } .
If E mx n m = Emxm < oo, the terms being nonnegative, and if lim n x n m = xm for each m , then lim nE mlx n m - xml = 0. If Ym is bounded, then lim nL m YmX n m = LmYmX m . Corollary.
Change of Variable
Let ( 0, � ) and ( 0', �') be measurable spaces, and suppose that the mapping T: 0 --+ D' is measurable �/� '. For a measure p. on �, define a measure p. T 1 on � ' by -
( 16 .16 )
as at the end of Section 13.
A'
E
� ',
SECTION
16.
PROPERTIES OF THE INTEGRAL
219
Suppose f is a real function on 0' that is measurable §' ', so that the composition fT is a real function on 0 that is measurable §' (Theorem 13.1(ii)). The change-of-variable formulas are (16.17) and (16.18). If A' = 0', the second reduces to the first. Theorem 16.12.
Iff is nonnegative, then
( 16 .17)
1 /( Tw ) p. ( dw ) = 1 f ( w' ) p.T- 1 ( dw' ) . O'
n
function f (not necessarily nonnegative) is integrable with respect to p. T- 1 if and only if fT is integrable with respect to p. , in which case (16.17) and
A
(}1 1A'/( Tw ) p. ( dw ) = jA'f( w' ) p.T- 1 ( dw' )
(16 .18)
hold. For nonnegative f, (16.18) always holds. PROOF. If f = JA ,, then fT = 11 tA' ' and so (16.17) reduces to the defini
tion (16.16). By linearity, (16.17) holds for nonnegative simple functions. If /, are simple functions for which 0 < i then 0 < fn T i fT, and (16.17) follows by the monotone convergence theorem. An application of (16.17) to establishes the assertion about integrabil ity, and for integrable f, (16.17) follows by decomposition into positive and negative parts. Finally, if f is replaced by f/A,, (16.17) reduces to (16.18). •
/, /,
1/1
Suppose that ( 0', $' ') = ( R 1 , !1/ 1 ) and T = cp is an ordinary real function, measurable §'. If /(x) = x, (16.17) becomes
Example
( 16.1 9
16. 11.
) p.
If cp = E ; x ; IA is simple, then P.'P- I has mass ( A ; ) at X;, and each side of • (16.19) reduces to L;X;p.(A;). I
Unifonn Integrability
1/1 /ll/l
If I is integrable, then � a] goes to 0 almost everywhere as a --+ oo and is dominated by and hence
1/1,
( 1 6.20)
lim
a
.....
oo
j[ 1/l � a )1/1 dp.
=
0.
220
INTEGRATION
A sequence
{ In } is uniformly integrable if ( 1 6 . 20) holds uniformly in n:
(16.21 )
lim sup
a -. oo
n
£
11,. 1 dp. = 0.
( 1/,. l > a ]
If (16.21) holds and p.(O) < oo , and if a is large enough that the supremum in (16.21) is less than 1, then
jll,. l dp. < ap. ( O ) + 1 ,
( 16.22 )
and hence the In are integrable. On the other hand, (16.21) always holds if the In are uniformly bounded, but the In need not in that case be integrable if p.(O) = oo. For this reason the concept of uniform integrability is interesting only for p. finite. If h is the maximum of 1/1 and l g l , then
11/+ gp � 2 alf + g l dp. 5. 2 1h � ah dp. 5. 2 f1/l � alll dp. + 2 jg � l g l dp.. l l
Therefore, if
a
{ In } and { gn } are uniformly integrable, so is { In + gn }.
If p.(O) < oo and In --+ / almost everywhere, and if the In are uniformly integrable, then f is integrable and fIn dp. --+ f fdp.. PROOF. Since f Ifn i dp. is bounded (see (16.22)), it follows by Fatou' s lemma that f is integrable. Put h n = 1/n - /1 . Then the h n are uniformly integrable and h n --+ 0 almost everywhere. If h �a > = h n l[ ,. < a ] ' then for each a, lim n h �a > = 0 and so lim n f h �a> dp, = 0 by the bounded convergence theorem. Now Theorem 16.13.
h
Given £, choose a so that the first term on the right is less than £ for all n; then choose n 0 so that the second term on the right is less than £ for all • n � n 0 • This shows that f h n dp. --+ 0. Suppose that
( 16.23)
sup /II,. I I +• dp. n
< oo
SECTION
for a positive
and
£.
If
16.
PROPERTIES OF THE INTEGRAL
221
K is the supremum here, then
so { In } is uniformly integrable.
Complex Functions A complex-valued function on Q has the form h are ordinary finite-valued real functions on Q.
l( w ) = g( w ) + ih ( w), where g and
It is, by definition, measurable !F if g and h are. If g and h are integrable, then I is by definition integrable, and its integral is of course taken as (16.24)
j( g + ih ) dp. = f gdp. + ; jh dp. .
Since max{ lgl , I h i } < Ill < lgl + l h l , I is integrable if and only if I Ill dp. < oo , just as in the real case. The linearity equation (16.3) extends to complex functions and coefficients-the proof requires only that everything be decomposed into real and imaginary parts. Consider the inequality (16.4) for the complex case: {16.25 )
ffdp. � jill dp. .
If f == g + ih and g and h are simple, the corresponding partitions can be taken to be the same ( g == Ek xk lA and h = Ek Yk lA ), and (16.25) follows by the triangle inequality. For the gener� integrable I, rePresent g and h as limits of simple
functions dominated by Ill and pass to the limit. The results on integration to the limit extend as well. Suppose that lk = gk + ih k are complex functions satisfying E k I Ilk I dp. < oo . Then E k I l gk I dp. < oo , and so by the corollary to Theorem 16.7, Ek gk is integrable and integrates to Ek l gk dp.. The same is true of the imaginary parts, and hence Ek lk is integrable and ( 16.26)
PROBLEMS 16.1. 16.1.
Suppose that p.(A) < oo and a s I < fl almost everywhere on A. Show that ap. ( .A ) < IAidp. S /lp. ( A ). (a) Deduce parts (i) and (ii) of Theorem 10.2 from Theorems 16.2 and 16.4. (b) Compare (16.7), (10.6), and (4.10).
222
INTEGRATION
13.13 Suppose that p.(Q) < oo and In are uniformly bounded. (a) Assume In --+ f uniformly and deduce I/,. dp. --+ I I dp. from (16.5). (b) Use part (a) and Egoroff's theorem to give another proof of Theorem 16.5. 16.4. Prove that if 0 < In --+ f almost everywhere and I In dp. < A < oo , then f is integrable and Ifdp. < A . (This is essentially the same as Fatou' s lemma and is sometimes called by that name.) 16.5. Suppose that p. is Lebesgue measure on the unit interval 0 and that ( a, b) = (0, 1) in Theorem 16.8. If /( w, t) is 1 or 0 according as w < t or w > t, then for each t, f'( w, t ) = 0 almost everywhere. But lf( w, t)p.( dw ) does not differentiate to 0. Why does this not contradict the theorem? Examine the proof for this case. 16.6 (a) Suppose that functions an , bn , fn converge almost everywhere to func tions a , b, f, respectively. Suppose that the first two sequences may be integrated to the limit-that is, the functions are all integrable and I an dp. --+ I a dp., I bn dp. --+ I b dp.. Suppose, finally, that the first two sequences enclose the third: an < In < bn almost everywhere. Show that the third may be integrated to the limit. (b) Deduce Lebesgue's dominated convergence theorem from part (a). 16.7. Show that, for f integrable and £ positive, there exists an integrable simple g such that I If - gl dp. < £ . If .JJ1 is a field generating !F, g can be taken as E; x; IA with A ; e .JJI. Hint: Use Theorems 13.5, 16.4, and 11.4. 16.8. Suppose that f( w, · ) is, for each w, a function on an open set W in the complex plane and that f( · , z ) is for z in W measurable !F. Suppose that A satisfies (16.9), that /( w, · ) is analytic in W for c.> in A , and that for each z0 in W there is an integrable g( · , z0 ) such that 1/( w, z ) I < g( w, z0 ) for all w e A and all z in some neighborhood of z0 • Show that I/( w, z )p.( dw) is analytic in W. 16.9. Suppose that In are integrable and supn l In dp. < oo . Show that, if In f /, then I is integrable and lfn dp. --+ lfdp.. This is Beppo Levi 's theorem. 16.10. Suppose of integrable In that am n = I lim - ln l dp. --+ 0 as m , n --+ 00 . Then there exists an integrable f such that /Jn = I If - In I dp. --+ 0. Prove this by the following steps. (a) Choose n k so that m , n > n k implies that am n < 2- k . Use the corollary to Theorem 16.7 to show that f = lim k In k exists and is integrable and lim k I If - Ink I d�t = o. (b> For n > n k , write I If - In I d�t < I If - In k I dp, + I lin k - In I d�t . 16. 1 1 (a) If p. is not a-finite, (i) in Example 16.9 can fail. Find an example. (b) Modifying the argument in Example 16.9, show that if f, g > 0, if IA fdp. < IA g dp. for all A e !F, and if [ g < oo,f > 0] is a countable union of sets of finite p.-measure, then f < g almost everywhere. (c) Show that the condition on [ g < oo , f > 0] in part (b) is satisfied if the measure v ( A ) = IA fdp. is a-finite. 16.12. f (a) Suppose that (16.11) holds ( B � 0). Show that v determines B up to a set of p.-measure 0 if p. or v is a-finite. 16.3.
I
SECTION
16.
223
PROPERTIES OF THE INTEGRAL
(b)
16. 13. 16. 14. 16.1 5. 16.16. 16. 17.
16.18. 16.19.
16.20.
16.21.
16.22.
16.13.
Show that, if p. is a-finite and p.( B = oo ] = 0, then ., is a-finite. (c) Show that if neither p. nor ., is a-finite then B may not be unique even up to sets of p.-measure 0. Show that the supremum in (16.15) is half the integral on the right. Show that if (16.14) holds and if lim infnBn > B almost everywhere, then (16.15) holds and lim infnBn = B almost everywhere. Show that Scheffe' s theorem still holds if (16.14) is weakened to �'n (g) --+ ., < n ) . f Suppose that In and f are integrable and that /,, --+ f almost every where. Show that / lfn I dp. --+ / 1/1 dp. if and only if / lfn - / 1 dp. --+ 0. (a) Show that if 1/n I < g and g is integrable, then { In } is uniformly integrable. Compare the hypotheses of Theorems 16.4 and 16.13. (b) On the unit interval with Lebesgue measure, let In = ( njlog n) /<0• n- • > for n > 3. Show that the In are uniformly integrable (and f In dp. --+ 0) although they are not dominated by any integrable g. (c) Show for In = nl
�
16.14.
•
•
.
�
•
•
224
INTEGRATION
Consider the vector lattice .ftJ and the functional A of Problems 11.11 and 11.12. Let p. be the extension (Theorem 11.3) to !F= a ( � ) of the set function p.0 on 'a. (a) Show by (1 1.11) that for positive x, y1 , y2 , v([ l > 1] X (0, x ]) = xp.0( 1 > 1] xp.[ l > 1] and v([y1 < I < Y2 J X (0, x]) = xp.[yt < I < Y2 ). (b) Show that if I e .ftJ, then I is integrable and
16.15. 11.12 f
=
A ( /) = jfdp. . (Consider first the case I � 0.) This is the Danie/1-Stone representation
theorem.
SECI'ION 17. INTEGRAL WITH RESPECf TO LEBESGUE MEASURE The Lebesgue Integral on the Line
Lebesgue integrable if it is integrable with respect to Lebesgue measure A, and its Lebesgue integral f f dA is denoted by f I( x ) dx or, in case of integration over an interval, by
A real measurable function on the line is
J:f( x ) dx. It is instructive to compare it with the Riemann integral .
The Riemann Integral
I on an interval (a, b] is by definitiont Riemann integrable, with integral r, if this condition holds: For each £ there exists a 8 with the
A real function
property that
( 17 .1 )
r - E l ( xi ) A ( Jj )
<
£
j
if { Jj } is any finite partition of (a, b] into subintervals satisfying A( Jj ) < 8 and if xi e � for each j. The Riemann integral for step functions was used in Section 1. Suppose that I is Borel measurable, and suppose that I is bounded, so that it is Lebesgue integrable. If I is also Riemann integrable, the r of (17 .1) must coincide with the Lebesgue integral J:l( x) dx. To see this, first note that letting xi vary over Jj leads from (17.1) to
( 17 .2)
r - E sup l ( x ) A ( Jj ) j x e�
·
t There are of course a number of equivalent definitions.
< £.
SECTION
17.
INTEGRAL WITH RESPECT TO LEBESGUE MEASURE
225
Consider the simple function g with value supx e J f(x) on � - Now f < g, and the sum in (17 .2) is the Lebesgue integral of g. By monotonicity of the Lebesgue integral, J:f(x) dx < J:g(x) dx < r + £. The reverse inequality follows in the same way, and so J:f(x) dx = r. Therefore, the Riemann
integral when it exists coincides with the Lebesgue integral. Suppose that f is continuous on [a, b]. By uniform continuity, for each £ there exists a 8 such that 1/(x) - /(y) l < £/(b - a) if l x - Y l < 8. If A ( � ) < 8 and x1 e � ' then g = E1/(x1 )IJ satisfies 1/ - g l < £/(b - a) and hence I J:fdx - J:g dxl < £. But this i� (17.1) with r replaced (as it must be) by the Lebesgue integral 1:1dx : A continuous function on a closed interval is Riemann integrable. E
ple
1 7.1.
E
ple
1 7. 2.
If f is the indicator of the set of rationals in (0, 1], then the Lebesgue integral f�f(x) dx is 0 because f = 0 almost everywhere. But for an arbitrary partition { Jj } of (0, 1] into intervals, E1/(x1 )'A(Jj) for xi e Jj is 1 if each x is taken from the rationals and 0 if each x is taken 1 1 from the irrationals. Thus f is not Riemann integrable. • xam
For the f of Example 17.1, there exists a g (namely, 0) such that f = g almost everywhere and g is Riemann integrable. To g show that the Lebesgue theory is not reducible to the Riemann theory by the casting out of sets of measure 0, it is of interest to produce an f (bounded and Borel measurable) for which no such g exists. In Examples 3.1 and 3.2 there were constructed Borel subsets A of (0, 1] such that 0 < 'A( A) < 1 and such that 'A( A n J ) > 0 for each subinterval J of (0, 1]. Take f = lA. Suppose that f = g almost everywhere and that { Jj } is a decomposition of (0, 1] into subintervals. Since 'A(Jj n A n [/ = g]) = A ( � n A) > 0, g(y1 ) = f(y1 ) = 1 for some y1 in Jj n A, and xam
=
(17 .3 )
L g ( yj ) 'A ( Jj ) = 1 > 'A( A ). j
If g were Riemann integrable, its Riemann integral would coincide with the • Lebesgue integrals f g dx = f fdx = A( A), in contradiction to (17 .3). It is because of their extreme oscillations that the functions in Examples 17.1 and 17.2 fail to be Riemann integrable. (It can be shown that a bounded function on a bounded interval is Riemann integrable if and only if the set of its discontinuities has Lebesgue measure o.t ) This cannot happen in the case of the Lebesgue integral of a measurable function: if f tSee Problems 1 7.5 and 1 7.6.
226
INTEGRATION
fails to be Lebesgue integrable, it is because its positive part or its negative part is too large, not because one or the other is too irregular.
E
ple
xam
1 7.3.
( 17 .4 )
It is an important analytic fact that . lim SID X 'IT dx = 2 t -. oo 0 X
1'
•
The existence of the limit is simple to prove because /<�"- t >wx - 1 sin x dx alternates in sign and its absolute value decreases to 0; the value of the limit will be identified in the next section (Example 18.3). On the other hand, x - 1 sin x is not Lebesgue integrable over (0, oo) because its positive and negative parts integrate to oo. Within the conventions of the Lebesgue theory (17.4) thus cannot be written f0(X)x - 1 sin x dx = 'IT/2-although such " improper" integrals appear in calculus texts. It is, of course, just a question • of choosing the terminology most convenient for the subject at hand. The function in Example 17.2 is not equal almost everywhere to any Riemann integrable function. Every Lebesgue integrable function can, how ever, be approximated in a certain sense by Riemann integrable functions of two kinds.
Suppose that f lfl dx < oo and £ > 0. (i) There is a step function g E�_ 1 x;IA , with bounded intervals the A ;, such that f If - g l dx < £. (ii) There is a continuous integrable h such that f If - h I dx < £. PROOF. By the construction (13.6) and the dominated convergence theo rem, (i) holds if the A; are not required to be intervals; moreover, A( A;) < oo for each i for which X; ¢ 0. By Theorem 11.4 there exists a finite disjoint Theorem 17.1.
=
I
as
union B; of intervals such that A(A;A B;) < £/k l x; l · But then E;x;I8 satisfies the requirements of (i) with 2£ in place of £. To prove (ii) it is only necessary to show that for the g of (i) there is a continuous h such that f l g - hi dx < £. Suppose that A ; = (a;, b;]; let h ;(x) be 1 on (a;, b;] and 0 outside (a; - 8, b; + 8], and let it increase linearly from 0 to 1 over (a; - 8, a;] and decrease linearly from 1 to 0 over (b;, b; + 8 ] . Since f IIA - h;l dx --+ 0 as 8 --+ 0, h = Ex;h; for sufficiently small 8 will satisfy th� requirements. Note\hat this h vanishes outside a • bounded interval. I
The Lebesgue integral is thus determined by its values for continuous functions. t t This provides another way of defining the Lebesgue integral on the line. See Problem 1 7. 1 7 .
SECTION
1be
17.
INTEGRAL WITH RESPECT TO LEBESGUE MEASURE
227
Fundamental Theorem of Calculus
For positive h , 1 h
fx+ hf( y ) dy - f( x ) X
1 < h
fx+hlf( y ) - f( x )l dy X
< sup [ l/( y ) - /( x )l : x < y < x + h ] , and the right side goes to 0 with h if f is continuous at x. The same thing holds for negative h ,t and therefore J:f( y ) dy has derivative /(x): d -;&
( 17 .5)
lxf( y ) dy = f( x ) a
if f is continuous at x. Suppose that F is a function with continuous derivative F' = f; that is, suppose that F is a primitive of the continuous function f. Then ( 17 .6 )
j
j
6 6 f( x ) dx = F ' ( x ) dx =
a
a
F( b ) - F( a ),
as follows from the fact that F(x) - F(a) and J:f( y ) dy agree at x = a and by (17 .5) have iden�ical derivatives. For continuous f, (17 .5) and (17 .6) are two ways of stating the fundamental theorem of calculus. To the calculation of Lebesgue integrals the methods of elementary calculus thus apply.
As will follow from the general theory of derivatives in Section 31, (17.5) holds
outside a set of Lebesgue measure 0 if f is integrable-it need not be continuous. As the following example shows, however, (17.6) can fail for discontinuous f. E%111le 11p 1 7.4. Define F(x) == x 2 sin x - 2 for 0 < x s t and F(x) == 0 for x s 0 and for x � 1. Now for t < x < 1 define F(x) in such a way that F is continuously differentiable over (0, oo ). Then F is everywhere differentiable, but F'(O) 0 and F'( x) == 2x sin x- 2 - 2x- 1cos x- 2 for 0 < x < }. Thus F' is dis continuous at 0; F' is, in fact, not even integrable over (0, 1], which makes (17.6) impossible for a = 0. For a more extreme example, decompose (0, 1] into countably many subintervals (a n , bn 1 · Define G( x) = 0 for x < 0 and x > 1, and on (an , bn ] define G(x) = F((x - a n )/( bn - an )). Then G is everywhere differentiable but (17.6) is impossible for G if (a, b] contains any of the (an , bn ] because G is not integrable over any of them . =
t nus requires defining /� = - ft� for
u >
v.
•
228
INTEGRATION
Change of Variable
Suppose that T is a continuous, increasing mapping of ( a , b ) onto ( Ta, Tb ). If has a continuous derivative then the formal substitutions y = and dy = dx give the usual formula for changing variables:
T
T'(x)
( 17 .7 )
T',
T(x)
1T b b T x))T'(x) d x = f( y ) dy. la /( ( Ta
T
There is a more general formula. If has a continuous derivative on (a, b) and is either increasing or decreasing, then
fA/( T( x))IT' ( x)l dx 1TA/( y) dy
( 17 .8)
=
for Borel subsets A of ( a, b). To prove this, define p,(A) A c ( a, b). If B is a subinterval of a, b), then
T(
=
fA
I T'( x)l dx for
( 17 .9 ) as follows by the fundamental theorem of calculus (consider increasing and decreasing separately). By Theorem 10.4, (17.9) holds for all Borel sets B in a, b). By the general change-of-variable formula (16.18), fr- t 8 ( )p,( d ) = /8/(y) dy. Putting B = TA is one-to-one) and transforming the left side by the formula (16.13) for integration with respect to a density gives (17. 8). t Note that may vanish and r- 1 may be nondifferentiable at places, for example if = The intervals ( a, b) and ( a , b) may be infinite. Finally, (17 .8) holds for all nonnegative f as well as for integrable f.
T(
T
(T
/ Tx x T' T(x) x3•
T
T(x) sin xjcos x on ( - '1T/2, '1T/2). Then T'(x) 1 + T 2( x), and (17.7) for f(y) (1 + y 2 ) - 1 gives EX111ple 11
( 17 .10 )
1 7.5.
Put
=
=
=
!00
dy - oo 1 + y
t For another approach, see Problem 17.12.
2
= 'IT .
•
SECTION
17.
INTEGRAL WITH RESPECT TO LEBESGUE MEASURE
229
'lbe Lebesgue Integral in R"
The k-dimensional Lebesgue integral, the integral in (R k , fA k , A k ), is de noted f f(x) dx, it being understood that x = (x 1 , . . . , x k ). In low-dimen sional cases it is also denoted f fA f(x 1 , x 2 ) dx 1 dx 2 , and so on . A mapping T: V --+ R k carrying a set V in R k into R k has the form T(x) = (t 1 (x), . . . , t k (x)). Suppose that the partial derivatives tiJ = at;/ax1 exist, let Dx be the k X k matrix with entries t iJ (x), and let J(x) denote the
Jacobian: ( 1 7 . 11 )
Let T: V --+ TV be a one-to-one mapping of an open set V onto an open set TV. Suppose that T is continuous and that T has continuous partial derivatives t iJ · Then Theorem 17.2.
( 1 7 .12 )
f f( Tx )IJ( x )l dx 1 f( y ) dy. =
V
As usual this holds for nonnegative Replacing f by /IrA for A c V gives (17 .13 )
j f( Tx)IJ( x )l dx A
TV
f and for f integrable over TV.
=
1 f( y) dy . TA
the case k = 1, this reduces to (17.8). If T is linear, Dx is for each x the matrix of the transformation. If identified with this matrix, (17.13) reduces to In
(17 .14)
l det T l
· j f( Tx ) dx A
=
T is
1 f( y) dy. TA
Only this special case will be proved here, because Theorem 19.3 contains Theorem 17.2.t Suppose that T is linear and nonsingular and V = R k . It is enough to prove (17.14) for A = R k. If f is an indicator, this reduces to (12.2), and it follows in the usual way that it holds for simple /, then for nonnegative /, and finally for the general integrable f. ' The proof in Section 1 9 uses properties of Hausdorff measure. For an alternative argument. see, for example, RUDIN', p. 1 84. In the present book, as it happens, Theorem 1 7.2 is applied
only to nonsingular linear transformations T and to the T of Example 1 7 .6. For the latter case, there is an entirely different proof; see Problems 12.16 , 1 7 . 1 5 , and 1 8.24.
230
INTEGRATION
If T(r, 8) = (rcos 8, r sin O ), then the Jacobian (17.1 1 ) is which gives the formula for integrating in polar coordinates:
Example 1 7.6.
J(r, 8 ) r, =
(17 .15)
JJ f( rcos 8, r sin 8 ) r dr d8 JJ f( x, y) dx dy. =
r>O 0 < 6 < 2w
R2
•
Stieltjes Integrals
Suppose that F is a function on R k satisfying the hypotheses of Theorem 12.5, so that there exists a measure p. such that p.( A) = &AF for bounded rectangles A . In integrals with respect to p., p.(d.x) is often replaced by dF( x ):
£!( x) dF(x) £f( x )p. (dx ) . =
(17 .16)
The left side of this equation is the Stieltjes integral of I with respect to F; since it is defined by the right side of the equation, nothing new is involved. Suppose that I is uniformly continuous on a rectangle A, and suppose that A is decomposed into rectangles A m small enough that ll(x) - I( Y ) I < £/p. ( A ) for x, y e A m . Then
£f( x ) dF( x ) - 'f. f(xm ) ll A .,F m
<E
for x m e A m . In this case the left side of (17.16) can be defined as the limit of these approximating sums without any reference to the general theory of measure and for historical reasons is sometimes called the Riemann-Stieltjes integral; (1 7.16) for the general I is then called the Lebesgue-Stieltjes integral. Since these distinctions are unimportant in the context of general measure theory, f l ( x ) dF(x) and f ldF are best regarded as merely nota tional variants for f l( x )p.( dx ) and f ldp.. PROBLEMS
17.1
to R k .
17.1.
Extend Theorem
17.2.
Show that if f is integrable, then lim f If ( x
t�o
Use Theorem
17.1.
+ t ) - / ( x )I dx == 0.
SECTION
17.
INTEGRAL WITH RESPECT TO LEBESGUE MEASURE
231
17.3.
Suppose that p. is a finite measure on qjA and A is closed. Show that p. ( x + A ) is upper semicontinuous in x and hence measurable.
17.4.
Stappose that /0 1/( x ) l dx < oo . Show that for each £ , A [ x: x > a , 1/( x ) l > � 1 � 0 as a --+ oo . Show by example that /( x) need not go to 0 as
X � 00 .
17.5.
A Riemann integrable function on (0, 1] is continuous almost everywhere. Let the function be f and l�t A( be the set of x such that for every 8 there are points y and z satisfying IY - x l < 8, l z - x l < 8, and 1/( y ) - /( z ) l > £ . Show that A( is closed and that the set D of discontinuities of f is the
union of the A ( for positive, rational £ . Suppose that [0, 1] is decomposed into intervals I, . If the interior of I; meets A( , choose Y; and z; in I; in such a way that /(Y; ) - /( z; ) > £; otherwise take Y; = z, e 1; . Show that E( /( Y; ) - /( z; )) A ( I; ) > £ A ( A( ) . Finally, show that if f is Riemann integra ble, then A ( D ) = 0.
17.6.
A bounded Borel function on (0, 1 ] that is continuous almost everywhere is Riemann in.tegrab/e. Let M bound 1/1 and suppose that A ( D ) = 0. Given find an open G such that D c G and A ( G) < £/M and take C = [0, 1] - G. For each x in C find a 8x such that 1/( y ) - /( x) l < £ if IY - x l < 8x . Cover C by finitely many ( xk - 8x /2, x" + 8x /2) and take 2 8 less than " the minimum 8x . Show that 1/( y) _! /( x) l < 2£ if IY - x l < 8 and x e C. If [0, 1] is deco�posed into intervals I; with A ( I; ) < 8 , and if Y; e I; , let g f
f,
be the function with value /( Y; ) on 1; . Let E' denote summation over those i for which I, meets C, and let I:" denote summation over the other i. Show that
17. 7.
Show that if f is integrable, there exist continuous, integrable functions gn such that gn (x) --+ /(x ) except on a set of Lebesgue measure 0. (Use Theorem 17.1(ii) with £ = n - 2 . )
17.8.
13.13 1 7.7 f Let f be a finite-valued Borel function over (0, 1). By the ' following steps, prove Lusin s theorem: For each £ there exists a continuous function g such that A [ x e (0, 1): /( x) =I= g( x)] < £ . (a) Show that f may be assumed integrable, or even bounded. (b) Let gn be continuous functions converging to f almost everywhere. Combine Egoroff's theorem and Theorem 12.3 to show that convergence is uniform on a compact set K such that A ((O, 1) - K ) < £. The limit lim n gn ( x) = /(x) must be continuous when restricted to K. (c) Exhibit (0, 1) - K as a disjoint union of open intervals Ik [A12] ; define g as f on K and define it by linear interpolation on each lk . n n Let fn ( x ) = x - 1 - 2x2 - 1 • Calculate and compare /JE':- 1 /n ( x) dx and EC:- 1 /J/n ( x) dx. Relate this to Theorem 16.6 and to the corollary to Theorem 16.7.
17.9.
232
INTEGRATION
17.10. Show that (1 + y 2 ) - 1 has equal integrals over ( - oo , - 1), ( - 1, 0), (0, 1), (1 , oo). Conclude from (17.10) that /J (1 + y 2 ) - 1 dy == w/4. Expand the integrand in a geometric series and deduce Leibniz's formula
1 1 1 'IT - == 1 - - + - - - + · · · 5 7 4 3 16.7 (note that its corollary does not apply). 17.11. Take T( x ) == sin x and l(y) = (1 - y 2 ) - 1 12 in (17.7) and show that t == /Ji0 1(1 - y 2 ) - 1 12 dy for 0 < t s wj2. From Newton's binomial for mula and two applications of Theorem 16.6, conclude that by Theorem
{ 17 .17) 17.12. 17.7 f
Suppose that T maps [ a, b] into [ u , v] arid has continuous deriva tive T'; T need not be monotonic, but assume for notational convenience that Ta s Tb. Suppose that I is a Borel function on [ u, v] for which J: I/( T( x )) T' ( x ) l dx < oo and Jl: II(Y) I dy < oo . Then (17.7) holds. (a) Prove this for continuous I by replacing b by a variable upper limit and differentiating. (b) Prove it for bounded I by exhibiting I as the limit almost everywhere of uniformly bounded continuous functions. (c) Prove it in general.
17.13. A function I on [0, 1] has Kurzweil integral· I if for each £ there exists a positive function B( · ) over [0, 1] such that, if x0 , , xn and � 1 , , �, •
•
•
•
•
•
satisfy
{ 17 .18)
{
O == x0 < x1 < · · · < xn == 1 ' X; - 1 < �; < X; ' X; - X; - 1 < B ( �; ) ,
i == 1 , . . . , n ,
then
( 17.19)
n
E l( �; ) ( x; - X; _ 1 ) - I < £ .
k- 1
Riemann's definition is narrower because it requires the function B( · ) to be constant. Show in fact by the following steps that, if a Borel function over [0, 1] is Lebesgue integrable, then it has a Kwzweil integral, namely its Lebesgue integral. (a) The Kunweil integral being obviously additive, assume that I > 0. Choose k0 so that J1 � k I dx < £, and for k == 1, 2, . . . choose open sets Gk such that Gk :::> A k = r(k - 1)£ < I < k£] and A ( Gk - A k ) < 1jkk5 2k . Define B(() == dist (�, Gk ) for E e A k .
SECTION
17.
233
INTEGRAL WITH RESPECT TO LEBESGUE MEASURE
(b) Suppose that (17.18) holds. Note that �; e A k implies that [ X; _ 1 , c Gk and show that E7- 1 /( �; )( x; - X; _ 1 ) < I + 2£ by splitting the sum according to which A k it is that �; lies in. k (c) Letk E< > denote summation over the i for which �; e Ak , and show that E< > ( x; - X; _ 1 ) 1 - E1 ., k E < I) ( x; - X; _ 1 ) > A ( Ak ) - k0 2 • Con clude that E7- 1 l(�; )( x; - X; _ 1 ) > k - 3£. Thus - 2£ > I
x,]
f dx
=
I dx
11< 0 I dx
(17.18) implies (17.19) for I = I Idx (with 3£ in place of £). 17. 14. f (a) Suppose that I is continuous on (0, 1] and ( 17 .20)
lim j11( x ) dx = I.
��o ,
Choose 1J so that t s 1J implies that I I - 1/1 dxl < £; define 80( ) over (0, 1] in such a way that 0 < x , y s 1 and lx - Y l < B0 ( x ) imply that ll( x ) - I( Y ) I < £, and put ·
B( x ) =
min{ � x , Bo ( x )} min{ 1 + 1 �( 0) I } 11 '
if O < X < 1 , if X
= 0.
From (17.18) deduce that � 1 = 0 and then that (17.19) holds with 3£ in place of £. (b) Let F be the function in Example 17.4, and put I = F' on [0, 1]. Show that I has a Kunweil integral even though it is not Lebesgue integrable. 17.15. 12.16 f Identifying w. Deduce from (12.19) that ( 17 .21 )
A2 ( A)
= 'ITo fJ rdrdfJ , 'IT B
where B is the set of ( r, 8 ) with (rcos 8, r sin 8 ) e A , and then that ( 17.22)
fJ f( x , y ) dx dy = � JJ R
2
f( rcos fJ , r sin fJ ) r dr dfJ .
r>O
0 < 9 < 2 tr
When w0 = has been established in Problem 18.24, this will give (17.15). 17.16. 16.25 f Let !R consist of the continuous functions on R1 with compact support. Show that !R is a vector lattice in the sense of Problem 11.11 and has the property that I e !R implies I 1\ 11 e !R (note that 1 E !R). Show that the a-field !F generated by !R is 9P • Suppose A is a positive linear functional on !R; show that A has the required continuity property if and only if ln (x) �O uniformly in x implies A ( /n ) � 0. Show under this ,
234
INTEGRATION
assumption on
A
{ 17 .23 )
that there is a measure p. on 911 such that A
( /)
=
ffdp. ,
f e !R.
Show that p. is a-finite and unique. This is a version of the Riesz representa
tion theorem .
17. 17.
f Let A ( /) be the Riemann integral of f, which does exist for f in !£'. Using the most elementary facts about Riemann integration, show that the p. determined by (17 .23) is Lebesgue measure. This gives still another way of constructing Lebesgue measure.
17. 18.
f
Extend the ideas in the preceding two problems to R k . SECI'ION 18.
PRODUCf MEASURE AND FUBINI'S THEOREM
Let ( X, !I) and ( Y, C!!J ) be measurable spaces . For given measures p. and v on these spaces, the problem is to construct on the cartesian product X X Y a product measure 'IT such that 'IT(A X B) = p.(A)v( B) for A c X and B c Y. In the case where p. and " are Lebesgue measure on the line, 'IT will be Lebesgue measure in the plane. The main result is Fubini 's theorem, according to which double integrals can be calculated as iterated integrals. Product Spaces
It is notationally convenient in this section to change from (Sl, �) to ( X, fi) and ( Y, C!!J ). In the product space X X Y a measurable rectangle is a product A X B for which A e fi and B e C!!J . The natural class of sets in X X Y to consider is the a-field fi X C!!J generated by the measurable rectangles. (Of course, fi X C!!J is not a cartesian product in the usual sense.) Suppose that X = Y = R 1 and f£= C!!J = !R 1 . Then a measurable rectangle is a cartesian product A X B in which A and B are linear Borel sets. The term rectangle has up to this point been reserved for cartesian products of intervals, and so a measurable rectangle is more general. As the measurable rectangles do include the ordinary ones and the latter generate !A 2 , !A 2 c !A 1 X 91 1 • On the other hand, i f A is an interval, [B: A X B e !A 2 ] contains R 1 (A X R 1 = U n ( A X ( - n, n ])e !A 2 ) and is closed under the formation of proper differences and countable unions; thus it is a a-field containing the intervals and hence the Borel sets. Therefore, if B is a Borel set, [ A : A X B e 91 2 ] contains the intervals and hence, being a Example 18. 1.
SECTION
18.
PRODUCT MEASURE AND FUBINI ' S THEOREM
235
a-field, contains the Borel sets. Thus all the measurable rectangles are in 91 2 , and so fJ1t 1 x !1/ 1 = fJ1t 2 consists exactly of the two-dimensional Borel
.
��-
As this example shows, fi X C!!J is in general much larger than the class of measurable rectangles.
If E e fi X C!!J, then for each x the set [y: (x, y) e E] lies in C!!J and for each y the set [x: (x, y) e E] lies in !I. (ii) Iff is measurable fi X C!!J , then for each fixed x the function f(x, y) with y varying is measurable C!!J, and for eachfixedy the function f(x, y) with varying is measurable !I.
Theorem 18. 1. (i)
x
The set [y: (x, y) e E] is the section of E determined by x, and as a function of y is the section off determined by x.
f(x, y)
x and consider the mapping Tx : Y --+ X X Y defined by TxY = (x, y). If E = A X B is a measurable rectangle, Tx 1E is B or 0 according as A contains x or not, and in either case Tx 1E e C!!J. By Theorem 13.1(i), Tx is measurable C!!Jj!IX C!!J . Hence [y: (x, y) e E] = Tx- 1E e C!!J for E e fiX C!!J. By Theorem 13.1(ii), if f is measurable fi X fl/91 1 , then fTx is measurable C!!f/Y't 1 • Hence f(x, y) = fTx ( Y ) as a function of y is measurable C!!J. The symmetric statements for fixed y are proved the PROOF.
Fix
-
-
•
same way.
Product Measure
Now suppose that ( X, !I, p.) and (Y, C!!J , ") are measure spaces, and suppose for the moment that p. and " are finite. By the theorem just proved Jl[y: (x, y) e E] is a well-defined function of x. If !l' is the class of E in fiX C!!J for which this function is measurable fi, it is not hard to show that !l' is a A-system. Since the function is IA(x)JI(B) for E = A X B, !l' contains the w-system consisting of the measurable rectangles. Hence !l' coincides with � X f!!/ by the 'IT-A theorem. It follows without difficulty that (18 . 1 )
'11' ' ( £ ) =
f: [y: ( x, y ) E E ) p(dx) ,
E e !I x C!!J ,
is a finite measure on fiX C!!J, and similarly for
(18 .2 )
'11' "( £ ) =
ft lx: ( x, y )
E
E ) P ( dy ) ,
E E fi X
C!!J .
INTEGRATION
For measurable rectangles,
w' ( A X B ) = w"( A X B ) = �t ( A ) · P ( B ) . The class of E in fiX C!!J for which w'(E) = w"( E) thus contains the measurable rectangles; since this class is a A-system, it contains :r X C!!J. The common value w'(E) = w"(E) is the product measure sought. To show that (18.1) and (18.2) also agree for a:.finite I' and ,, let { A m } and { Bn } be decompositions of X and Y into sets of finite measure, and put l' m (A) = �t(A n A m ) and �'n (B) = P( B n Bn > · Since P( B) = Lm �'m ( B), the integrand in (18.1) is measurable fi in the a-finite as well as in the finite case; hence w' is a well-defined measure on fi X C!!J and so is w". If w:,n and w:,:n are (18.1) and (18.2) for I' m and "n ' then by the finite case, already treated, w'( E ) = Lm nw:, n (E) = L m nw,::n ( E) = w"(E) . Thus (18.1) and (18.2) coincide in the a-finite case as well. Moreover, w'( A X B) = L m nl' m (A)Pn ( B) = #'(A)P( B).
(18.3 )
If ( X, !I, I') and ( Y, C!!J, ") are a-finite measure spaces, w( E ) = w'( E ) = w"(E) defines a a-finite measure on fi X C!!J; it is the only measure such that w(A X B) = �t(A) P(B) for measurable rectangles.
Theorem 18.2.
·
Only a-finiteness and uniqueness remain to be proved. The prod ucts Am X Bn for { A m } and { Bn } as above decompose X X Y into measurable rectangles of finite w-measure. This proves both a-finiteness and uniqueness, since the measurable rectangles form a w-system generating fi X f!!/ (Theorem 10.3). • PROOF.
The w thus defined is called product measure; it is usually denoted I' X " · Note that the integrands in (18.1) and (18.2) may be infinite for certain x which is one reason for introducing functions with infinite values. and Note also that (18.3) in some cases requires the conventions (15.2).
y,
Fubini's Theorem
Integrals with respect to w are usually computed via the formulas
( 1 8 .4 )
j
XX Y
and
( 1 8 .5 )
f(x,y)'1T(d(x,y)) = J [ J t(x,y),(dy) ] �t(dx) X
Y
J f(x,y)'1T(d(x,y)) = J [/ f(x,y)�t(dx)]P(dy). Xx Y
Y
X
SECTION
18.
PRODUCT MEASURE AND FUBINI ' S THEOREM
237
The left side here is a double integral, and the right sides are iterated integrals. The formulas hold very generally, as the following argument shows. Consider (18.4). The inner integral on the right is
/j (x , y ) v ( dy ) .
{ 18. 6)
Because of Theorem 18.1(ii), for f measurable fi X CfY the integrand here is measurable d.Y; the question is whether the integral exists, whether (18.6) is measurable fi as a function of x , and whether it integrates to the left side of (18. 4). First consider nonnegative f. If f = I£, everything follows from Theorem 18.2: (18.6) is l'[ y : (x, y ) e E] and (18.4) reduces to 'IT ( E ) = 'IT '( E ). Because of linearity (Theorem 15.1(iv)), if f is a nonnegative simple function, then (18.6) is a linear combination of functions measurable fi and hence is itself measurable f£; further application of linearity to the two sides of (18.4) shows that (18.4) again holds. The general nonnegative f is the monotone limit of nonnegative simple functions; applying the monotone convergence theorem to (18.6) and then to each side of (18.4) shows that again f has the properties required. Thus for nonnegative f, (18.6) is a well-defined function of x (the value oo is not excluded), measurable fi , whose integral satisfies (18.4). If one side of (18.4) is infinite, so is the other; if both are finite, they have the same finite value. Now suppose that f, not necessarily nonnegative, is integrable with respect to 'IT. Then the two sides of (18.4) are finite if f is replaced by 1/1 . Now make the further assumption that
hf ( x , y ) l v ( dy) < oo
(18.7 )
for all x. Then ( 18 .8 )
/j ( x , y )v (dy ) /j+ ( x , y )v ( dy) - J;- ( x , y)v( dy ) . =
functions on the right here are measurable fi and (since /+ , /- < 1/1 ) integrable with respect to p., and so the same is true of the function on the left. Integrating out the x and applying (18.4) to j+ and to f- gives (18.4) for I itself. The
238
INTEGRATION
The set A 0 of x satisfying (18.7) need not coincide with X, but p.( X A0) = 0 if f is integrable with respect to 'IT because the function in (18.7) integrates to f 1/1 d'IT (Theorem 15.2(iii)). Now (18.8) holds on A0, (18.6) is measurable !I on A0, and (18.4) again follows if the inner integral on the right is given some arbitrary constant value on X - A0• The same analysis applies to (18.5): Theorem 18.3.
functions
Under the hypotheses of Theorem 18.2, for nonnegative f the
j f( x , Y ) v ( dy ) , j f( x , y ) p. ( dx )
(1 8 .9)
y
X
are measurable !I and C!!l, respectively, and (18.4) and (18.5) hold. Iff (not necessarily nonnegative) is integrable with respect to 'IT, then the two functions (1 8.9) are finite and measurable on A0 and on B0, respectively, where p. ( X - A0) = 1'( Y - B0) = 0, and again (18.4) and (18.5) hold. It is understood here that the inner integrals on the right in (18.4) and (1 8.5) are set equal to 0 (say) outside A0 and B0. t This is Fubini 's theorem; the part concerning nonnegative f is sometimes called Tonelli ' s theorem. Application of the theorem usually follows a two-step procedure that parallels its proof. First, one of the iterated in tegrals is computed (or estimated above) with 1/1 in place of f. If the result is finite, then the double integral (integral with respect to 'IT ) of 1/1 must be finite, so that f is integrable with respect to 'IT; then the value of the double integral of f is found by computing one of the iterated integrals of f. If the iterated integral of 1/1 is infinite, f is not integrable 'IT. Let I = f �«Je - x2 dx. By Fubini' s theorem applied in the plane and by the polar-coordinate formula (see (17.15)),
Example
18. 2.
]2
=
x l +y l ) dx dy = jj -< e f I r>O R2
0 < 8 < 2w
e -'2rdrd0.
The double integral on the right can be evaluated as an iterated integral by t Since
two functions that are equal almost everywhere have the same integral, the theory of integration could be extended to functions that are only defined almost everywhere; then A 0 and B0 would disappear from Theorem 1 8. 3 .
SECTION
1 8.
PRODUCT MEASURE AND FUBINI ' S THEOREM
239
another application of Fubini' s theorem, which leads to the famous formula
(18 .10) As
the integrand in this example is nonnegative, the question of integrability • does not arise.
E
ple
It is possible by means of Fubini' s theorem to identify the limit in (17 .4). First, xam
18.3.
l0 e - uxsin x dx = 1 +1 u [1 - e - u ( u sin 2
'
'
as follows by differentiation with respect to
t
+ cos t ) ]
.
t. Since
Fubini's theorem applies to the integration of e - u x sin x over (0, t)
11 sin x dx = 11sm. x [ l ooe - ux du ] dx 0 x
0
0
oo 1 = o
X (0, oo ) :
du 1+u 2
oo e - ul ( u sin t + cos t ) du. 1 o
1+u
2
next-to-last integral is 'IT/2 (see (17.10)), and a change of variable ut == s shows that the final integral goes to 0 as t --+ oo . Therefore,
The
( 18.11 ) lntearation by Parts
. 'IT 1 sm x 1 lim 0 dx = -2 . X 1 -.
oo
•
Let F and G be two nondecreasing, right-continuous functions on an . lllterval [a, b ], and let p. and " be the corresponding measures:
l&( x , y]
=
F ( y) - F(x ) ,
Jl( x, y) = G ( y) - G ( x ) ,
a < x < y < b.
INTEGRATION
dF(x) and dG(x) in place
In accordance with the convention (17.16), write of p.( dx) and P ( dx ).
If F and G have no common points of discontinuity in (a, b ],
Theorem 18.4.
then
( 1 8 .12)
j
(a, b)
G ( x ) dF( x )
=
F ( b ) G ( b ) - F( a ) G ( a )
In brief: f GdF = integration formula.
-f
F( x ) dG ( x ) .
(a, b]
&FG - f FdG. This is one version of the partial
Note first that replacing F(x) by F(x) - C leaves (18.12) un changed-it merely adds and subtracts CP(a, b) on the right. Hence (take C = F(a)) it is no restriction to assume that F(x) = p.(a, x] and no restriction to assume that G(x) = P(a, x] . If , = p. X , is product measure in the plane, then by Fubini's theorem, PROOF.
(18 .13)
'1T [(x, y) : a < y � x � b ] =
j
(a, b ]
v( a, x) �&(dx)
j
=
(a, b ]
G (x) dF( x )
and (18 .14)
'1T [(x. y) : a < x � y < b] =
j
�&( a, y) v( dy )
(a, b 1
=
j
F(y) dG ( y ) .
(a, b ]
The two sets on the left have as their union the square S = The diagonal of S has 'IT-measure
'1T [ ( x, y ) : a < x y < b] =
=
j
(a, b ]
(a, b] X (a, b ].
v { x } �& ( dx ) 0 =
because of the assumption that p. and , share no points of positive measure. Thus the left sides of (18.13) and (18.14) add to 'IT( S) = p.(a, b]P(a, b] =
F(b)G(b).
•
SECTION
18.
PRODUCT MEASURE AND FUBINI ' S THEOREM
241
g
Suppose that " has a density with respect to Lebesgue measure and let Transform the right side of (18.12) by the formula c + (1 6.13) for integration with respect to a density; the result is
G(x ) =
J:g(t) dt.
(18 .15 )
J
(a, b 1
G( x ) dF(x )
= F(b)G(b) - F( a )G( a ) - j6F( x ) g ( x ) dx . a
A consideration of positive and negative parts shows that this holds for any g integrable over Suppose further that p. has a density I with respect to Lebesgue measure ' c + and let Then (18.15) further reduces to
(a, b ). F( x ) = J:l(t) dt. (18. 16 ) j6G( x ) f ( x ) dx = F(b )G(b) - F( a )G( a ) - j6F(x ) g (x) dx . a
a
Again, I can be any integrable function. This is the classical formula for integration by parts. Under the appropriate integrability conditions, can be replaced by an unbounded interval.
(a, b ]
Products of Higher Order
Suppose that ( X, !£, p. ), ( Y, �, "), and ( Z, .11' , "1) are three a-finite measure spaces. In the usual way, identify the products X X Y X Z and ( X X Y) X Z. Let !£X � X .11' be the a-field in X X Y X Z generated by the A X B X C with A, B, C in !£, �, .11' , respectively. For C in Z, let ric be the class of E e !I X � for which E X C e !£X � X .11' . Then ric is a a-field contain ing the measurable rectangles in X X Y, and so ric !£X �- Therefore, (� X '!?I ) X .11' c !£ X � X 11'. But the reverse relation is obvious, and so (.!'" X '!?/) X .11' !£ X � X .11' . Define the product p. X " X "1 on !£X � X .11' as ( p. X ") X "1. It gives to p.(A)11(B) 1J (C), and it is A X B X C the value (p. X 11) (A X B) · 1J (C) the only measure that does. The formulas (18.4) and (18.5) extend in the obvious way. Products of four or more components can clearly be treated in the same way. This leads in particular to another construction of Lebesgue measure k in R R 1 x · · · X R 1 (see Example 18.1) as the product A X · · · XA ( k
=
=
=
=
242
INTEGRATION
factors) on !Ji k = !1/ 1 x · · · X91 1 . Fubini' s theorem of course gives a practi cal way to calculate volumes: Let Vk be the volume of the sphere of radius 1 in R k ; by Theorem 12.2, a sphere in R k with radius r has volume r k Vk . Let A be the unit sphere in R k , let B = [(x 1 , x 2 ): xf + xi < 1], and let C(x 1 , x 2 ) = [(x 3 , , x k ): E�_ 3 xl < 1 - xf - xi]. By Fubini' s theorem, Example 18.4.
•
•
•
=
vk - 2
JJ (t - P2 ) < k - 2>f2 p dp dO
O < ll < 2w O
If V0 is taken as 1, this holds for k = 2 as well as for k > 3. Since V1 = follows by induction that
2(2w ) ;- t v2 ; - 1 = --------. 5 --. .� . ( 2i - 1- -) ' 1 .3� for i =
2, it
); v2 ; = -. (2w 2 4 . . . -( 2i ) ---------
1, 2, . . . .
• PROBLEMS
18.1. 18.2.
18.3. 18.4.
Show by Theorem 1 8.1 that if A E !I and B E tty_
A X B is nonempty and lies in !I X
tty , then
Suppose that X == Y is uncountable and !I == tty consists of the countable and the co-countable sets. Show that the diagonal E == [( x y ) : x == y ] does not lie in !I X tty , even though [y: ( x, y) E E] E t!JI and [ x : ( x, y) E E ) E !I for all x and y. 10.6 1 8.1 f Let ( X, !I, p. ) == ( Y, tty , v ) be the completion of ( R1 , 911 , A ) . Show that ( X X Y, !I X tty , p. X v) is not complete. 2.8 f
,
The assumption of a-finiteness in Theorem 1 8.2 is essential: Let p. be Lebesgue measure on the line, let v be counting measure on the line, and take E == [( x, y): x == y]. Then (1 8.1) and (1 8.2) do not agree.
SECTION
18.5.
18.
PRODUCT MEASURE AND FUBINI ' S THEOREM
243
Suppose Y is the real line and ., is Lebesgue measure. Suppose that, for each x in X, l(x, y) has with respect to y a derivative l '(x, y). Formally,
J' [ /Xf ' ( x . � ) p.( dx ) ] d� ... jX[ J( x , y ) - f( x , a ) ] p. ( dx ) a
and hence fxl(x, Y)l'( dx) has derivative fxl '(x, y)p.( dx). Use this idea to prove Theorem 16.8(ii).
18.6.
Suppose that I is nonnegative on a a-finite measure space ( n, !F, p. ) . Show that
Prove that the set on the right is measurable. This gives the " area under the curve." Given the existence of I' X A on n X R1 , one can use the right side of this equation as an alternative definition of the integral. 18.7.
Reconsider Problem 12.13.
18.8.
Suppose that v [y: (x , y) e E) == v[y: (x, y) e F) for all x and show that (I' X ., )( E) == (I' X ., )( F). This is a general version of Cavalieri ' s principle.
18.9.
(a) Suppose that p. is a-finite and prove the corollary to Theorem 16.7 by
18.10.
Fubini's theorem in the product of ( 0 , !F, p.) and {1 , 2, . . . } with counting measure. (b) Relate the series in Problem 17.9 to Fubini's theorem.
(a) Let p. == ., be counting measure on X == Y == { 1 , 2, . . . }. If
l( x, y )
==
2 - 2-x if X == y, - 2 + 2 - x if X == y + 1 , 0 otherwise,
then the iterated integrals exist but are unequal. Why does this not con tradict Fubini's theorem? (b) Show that xyj(x 2 + y 2 ) 2 is not integrable over the square [(x, y): I x I , IY I < 1 ] even though the iterated integrals exist and are equal. 18.11.
For an integrable function I on (0, 1 ], consider a Riemann sum in the form R == E}- 1 l(x1 )( x1 - x1 _ 1 ), where 0 == x0 < · · · < xn == 1 . Extend I to (0, 2] by setting I< x) == I< x - 1) for 1 < x < 2, and define n
R ( t) == E J- 1
I( x1 + t ) ( x1 - x1 1 ) . _
For 0 < t < 1 , the points x1 + t reduced modulo 1 give a partition of (0, 1], and R ( t) is essentially the corresponding Riemann sum, differing from it by at most three terms. If maxJ ( x1 - x 1 ) is small, then R ( t) is for most � values of t a good approximation to Jol(x) dx, even though R = R (O) may _
244
INTEGRATION
be a poor one. Show in fact that 1 ( 1 8 . 1 7) R ( t ) - t( x) dx
fo
f
... 10
n
1
L
j == 1
n
< L
1
x
1
XJ - l
[ /( x1 + t) - /( x + t ) ) dx dt
1 1 10 1f( x1 + t} - f( x 1
x
j-1
dt
I
x,
+
t )l dt dx.
Given £ choose a continuous g such that /J ll(x) - g(x) l dx < £ 2/3 and then choose B so that l x - Yl < B implies that l g(x) - g(y ) l < £ 2/3. Show that max(x1 - x1 _ 1 ) < B implies that the first integral in (1 8.17) is at most £ 2 and hence that I R ( t) - /JI(x) dx l < £ on a set of t (in (0, 1]) of Lebesgue measure exceeding 1 - £ .
18. 12. Exhibit a case in which (1 8.12) fails because F and G have a common point of discontinuity.
18.13. Prove (1 8.16) for the case in which all the functions are continuous by differentiating with respect to the upper limit of integration.
18.14. Prove for distribution functions F that /�00 ( F(x + c) - F(x)) dx = c.
t. 18.16. Suppose that a number In is defined for each n > n0 and put F(x) = E n o s n s x ln · Deduce from (1 8.1 5) that 18.15. Prove for continuous distribution functions that /� 00 F( x) dF( x )
L
{ 1 8 . 1 8)
G ( n )/,
n0 � n � x
=
F( x ) G ( x ) -
=
{n oF( t ) g( t ) dt
f1g( t) dt, which will hold if G has continuous derivative g. First assume that the In are nonnegative. 18. 17. f Take n0 1 , In 1 , and G(x) 1/x and derive E n s x n - 1 log x + y + 0( 1 /x ) , where y 1 - /f (t - [t]) t - 2 dt is Euler's constant. if
G(y) - G(x) =
=
=
=
=
==
18.18. 5.17 1 8.16 f such that ( 1 8 .19)
18.19. If A k
=
Use (1 8.1 8) and (5.47) to prove that there exists a constant
L .! p p �x E�_ 1 a; and Bk
=
==
log log x
E�_ 1 b; , then
n
L a k Bk
k-1
+c+
c
o( ) · 1 10g x
n
==
An Bn + L A k - l bk . k-2
This is Abel's partial summation formula. Derive it by arguing proof of Theorem 1 8.4.
as
in the
SECTION
1 8.
PRODUCT MEASURE AND FUBINI ' S TH EOREM
18. 20. (a) Let In == j012 (sin x)" dx . Show by partial integration that In
1)( 1n _ 2 - In ) and hence that In == ( n - 1) n - 1 1n _ 2 •
245 ==
(n -
(b) Show by induction that 1 r;, + 1 From
(17.17) deduce that 2 w - == 8
...
( - 1 ) " ( 2n + 1 )
{ -} ).
1 1 1 1 + -2 + -2 + -2 + · · · 3
5
7
1 22
1 32
1 42
and hence that
2 w 6
== 1 + - + - + - + · · ·
(c) Show by induction that
2 n - 1 2n - 3 . . . 3 1 'IT 2 n 2n - 2 . . . 4 2 == n 1 12 n == 2n 2n - 2 1 2 + 4 22' 2n + 1 2n - 1 S3· From /2 n I > 12 n > 12 , + 1 deduce that 'IT 22 · 4 2 · · · ( 2n ) 2 2n w < < 2 2n + 1 3 2 . 5 2 . . . ( 2n - 1 ) 2( 2n + 1 ) 2 ' and from this derive Wallis ' s formula, ,
18.21.
4466 - 1 3 3 -5 5 7
'IT 22 == - -
{ 18.20)
2
•
•
•
Stirling ' s formula. The area under the curve y == log x between the abscis sas 1 and n is An == n log n - n + 1 . The area under the corresponding inscribed polygon is Bn == log n ! - ! log n. y=
log
log
X
2
0
1
2
3
4
INTEGRATION
(a) Let Sk be the sliver-shaped region bounded by curve and polygon
between the abscissas k and k + 1, and let Tk be this region translated so that its right-hand vertex is at (2, log 2). Show that the Tk do not overlap and that T, u T, + 1 u · · · is contained in a triangle of area t log(1 + 1/n ). Conclude that log n ! =
{ 18.21 ) where
(b)
( n �) +
log n
- n + c + an ,
c is 1 minus the area of T1 u T2 u · · · and 1 1 1 0 < an < 2 log 1 + n < n .
)
(
By Wallis's formula (18.20) show that log
{f
Substitute
=
�
[
2 n log 2 +
2logn! - log(2n)! - � log( 2n + 1 ) ] .
(1 8.21) and show that c == log J2w . This gives Stirling's formula
18.22. Euler's gamma function is defined for positive t by f( t) == f0x ' - 1e - x dx. (a) Prove that r
(b) Show by partial integration that f( t + 1) == tf( t) and hence that f( n + 1) == n ! for integral n. (c) From (1 8.10) deduce f( ! ) == {; .
(d)
Show that the unit sphere in
Rk has volume (see Example 18.4)
( 18.22) 18.13. By partial integration prove that /0 (sin x )/x) 2 dx == wj2 and /� 00 (1 cos x ) x- 2 dx ==
18.24. 17.15 f and
w. Identifying
0 s cp <
If A8 is the set of ( rcos cp, r sin cp ) satisfying r < fJ, then by (17.21), 'If.
( 18.23 ) But by Fubini's theorem, if 0
{ 18.24)
< fJ < w/2, then
1 12 sin : x + 2 1 005' x 1 { ) dx 1 COS dx 1cosl
A 2 ( A ,) == 0 ==
11
1 . 2 ) 1 /2 dx. 1 x sm fJ cos fJ + ( 2 cosfJ
1
247 1 9 . HAUSDORFF MEASURE Equate the right sides of (18.23) and (18.24), differentiate with respect to 8, and conclude that w0 == w. Thus the factor w0/w can be dropped in (17.22), which gives the formula (17.15) for integrating in polar coordinates. 18.15. Suppose that 1-' is a probability measure on ( X, .t'") and that, for each x in X, "x is a probability measure on ( Y, '!V). Suppose further that, for each B in t!JI, "x ( B) is, as a function of x, measurable .t'". Regard the /-'( A ) as initial probabilities and the �'x ( B) as transition probabilities. (a) Show that, if E e .t'" X '!V, then �'x [y: ( x, y) e E] is measurable .t'". (b) Show that w( E) == Ix "x [ y: ( x, y) e E]/-'( dx) defines a probability measure on .t'"X '!V. If "x == v does not depend on x, this is just (18.1). (c) Show that if I is measurable !£ X 'lV and nonnegative, then I I( x, y ) vx ( dy) is measurable .t'". Show further that SECTION
y
which extenas Fubini's theorem (in the probability case). Consider also I 's that may be negative. (d) Let v(B) == lx �'x ( B)/-'(dx). Show that w( X X B) == v(B) and
SECI'ION 19. HAUSDORFF MEASURE *
The theory of Lebesgue measure gives the area of a figure in the plane and the volume of a solid in 3-space, but how can one define the area of a curved surface in 3-space? This section is devoted to such geometric questions. The Definition
Suppose that
A
is a set
in R k. For positive
m
and
£
put
(19 .1 ) where the infimum extends over countable coverings of A by sets Bn with diameters diam Bn = sup[ l x - y l : x, y e Bn 1 less than £ . Here em is a positive constant to be assigned later. As £ decreases, the infimum extends over smaller classes, and so h m , ((A) does not decrease. Thus h m , ((A) has a * This section should be omitted on a first reading.
248
INTEGRATION
limit h m ( A ), finite or infinite: ( 19 . 2 )
This limit h m ( A ) is the m-dimensiona/ outer Hausdorff measure of A . Although h m on R k is technically a different set function for each k , no confusion results from suppressing k in the notation. Since h m , , is an outer measure (Example 11.1), h m , , (U n A n ) < E n h m , ,( A n ) < E n h m ( A n ), and hence h m (U n A n ) < En h m ( A n ) · Since it is obviously monotone, h m is thus an outer measure. If A and B are at positive distance, choose £ so that £ < dist ( A, B). If A U B c Unen and diam en < £ for all n, then no en can meet both A and B. Splitting the sum cm E n (diam en ) m according as en meets A or not shows that it is at least h m ,,( A ) + h m , ,( B ). Therefore, h m ( A U B) > h m (A) + h m ( B ), and h m satisfies the Caratheodory condition (11.5). It follows by Theorem 11.5 that every Borel set is h m-measurab/e. There fore, by Theorem 11.1, h m restricted to gJ k is a measure. It is immediately obvious from the definitions that h m is invariant under isometries: If A c R k , cp: A --+ R i, and I f�>( X) - cp (y) l = l x - Yl for x, y e A , then h m (cpA) = h m ( A ).
Example 19.1. Suppose that m = k = 1. If a n = inf Bn and bn = sup Bn , then [a n , bn ] and Bn have the same diameter bn - a n . Thus (19.1) is in this
case unchanged if the Bn are taken to be intervals. Further, any sum Ln diam Bn remains unchanged if each interval Bn is split into subintervals of length less than £. Thus h 1 , ,(A) does not depend on £, and h 1 ( A ) = h 1 , ,(A) is for A c R 1 the ordinary outer Lebesgue measure A*( A) of A , provided that c 1 is taken to be 1, as it will. A planar segment S, = [(x, t): 0 < x < /], as an isometric copy of the interval [0, /], satisfies h 1 (S,) = I. This is a first indication that Hausdorff measure has the correct geometric properties. Since the segments S, are disjoint, one-dimensional Hausdorff measure in the plane is not o-finite. •
Example 19. 2. Consider the segments S0 and S, of the preceding exam ple with I = 1 and 0 < t < 1. Now h 1 (S0 U S, ) = 2, and in fact h 1 , ((S0 U S,) = 2 if £ < t. This shows the role of £: If h m ( A) were defined, not by ( 19 . 2), but as the infimum in (19.1) without the requirement diam Bn < then S0 U S, would have the wrong one-dimensional measure-diam ( S0 U • S,) = (1 + t 2 ) 1 12 instead of 2. £,
SECTION
19.
HAUSDORFF MEASURE
249
1be Normalizing Constant
In the definition
m
can be any positive number, but suppose from now on that it is an integer between 1 and the dimension of the space R k : m = 1 , . . . , k. The problem is to choose em so as to satisfy the geometric intention of the definition. Let Vm be the volume of a sphere of radius 1 in R m ; the value of Vm is explicitly calculated in Example 18.4. And now take the em in (19.1) as the volume of a sphere of diameter 1 :
(19 .3 ) A sphere in R m of diameter d has volume em d m . It will be proved below that for this choice of em the k-dimensional Hausdorff and Lebesgue measures agree for k-dimensional Borel sets:
fJI/ m and B is a set in R k that is an isometric copy of A , it will theorem that h m ( B) = h m (A) = A m (A). The unit cube C in R k can be covered by n k cubes of side n - 1 and diameter k 1 1 2 n - 1 ; taking n > k 1 1 2£ - 1 shows that h k . f( C ) < 1 2 1 k 2 clc n k (k 1 n - ) = ek k k / ' and so h k (C) < oo If C c U n Bn and diam Bn = d,, enclose Bn in a sphere Sn of radius dn . Then C c U n Sn , and so 1 == A k (C) < Ln A k (Sn ) = Ln Vk d! ; hence h k (C) > ek/ Vk . Thus h k (C) is finite and positive. Put h k (C) = K, where again C is the unit cube in R k . Then of course If A e follow by this
·
( 1 9. 4 ) for fJ >
C. Consider the linear transformation carrying x to fJx, where 0. By Theorem 12.2, A k (fJA) = fJ k A k (A) for A e gJ k . Since diam fJA =
A =
B" diam
A , the definition (19.2) gives
(1 9. 5) Thus (19.4) holds for all cubes and hence by additivity holds for rectangles whose vertices have rational coordinates. The latter sets form a w-system generating fJit k , and so (19.4) holds for all A e gJ k , by Theorem 10.3. Thus Theorem 19.1 will follow once it has been shown that the h k-mea sure of the unit cube in R k , the K in (19.4), is 1. To establish this requires
INTEGRATION
two lemmas connecting measure with geometry. The first is a version of the
Vitali covering theorem.
Suppose that G is a bounded open set in R k and that > 0. Then there exists in G a disjoint sequence S1 , S2 , . . . of closed spheres such that A k (G - U n Sn ) = 0 and 0 < diam Sn < PROOF. Let S 1 be any closed sphere in G satisfying 0 < diamS1 < Suppose that S 1 , . . . , Sn have already been constructed. Let 9:_ be the class of closed spheres in G that have diameter less than and do not meet S1 u · · · u Sn . Let sn be the supremum of the radii of the spheres in 9:_, and let Sn + l be an element of 9:_ whose radius rn + 1 exceeds s,/2. Since the Sn are disjoint subsets of G, En Vk r,:C = En 'A k (Sn ) < 'A k (G) < oo , and hence £
Lemma 1.
£.
£.
£
( 19.6 ) Put B = G - U n Sn . It is to be shown that 'A k (B) = 0. Assume the opposite. Let s; be the sphere having the same center as Sn and having radius 4rn . Then En 'A k (S;) = 4 k L,n 'A k (Sn ) < 4 k 'A k (G). Choose N so that En > N 'A k (S;) < 'A k (B). Then there is an x 0 such that (19.7)
x 0 e B,
x o � U s;. n
As the Sn are closed, there is about small enough that
>N
x 0 a closed sphere S with radius r
( 1 9 .8 ) If S does not meet S1 U · · · U Sn , then r < sn < 2rn + 1 ; this cannot hold for all n because of (19.6). Thus S meets some Sn ; let Sn o be the first of these:
( 19.9 )
S n ( S1 U · · · U Sn0 - 1 ) = 0 •
Because of (19.8), n o > N, and by (19 .7 ), Xo � s;0. But since Xo � s;0 and S n Sn o =I= 0, S has radius r >. 3rno > .l.2 sn0 _ 1 . By (19.9), r < sn 0 _ 1 • Thus • Sn o - 1 � r > .l.2 Sno - 1 ' a contrad"lCUOn. In other words, the volume of a set cannot exceed that of a sphere with the same diameter.
SECTION
19.
HAUSDORFF MEASURE
e
251
e
Represent the general point of R k as (t, y), where t R 1 , y R k - t. Let A y = [t: (t, y) A]. Since A is bounded, each A Y is bounded and A Y = 0 for y outside some bounded set. By the results of Section 18, k .Ay e 91� and A(A ) is measurable £1t - t . Let I be the open (perhaps Y y empty) mterval ( - A (A y )/2, + A( A y )/2), and set S1 A = U y [(t, y): t y ].k If 0 S fn ( Y ) t A(A y )/2 and In is simple, then [(t, y): l t l < /n ( Y )] is in £Il and increases to S1 A ; thus S1 A £Il k . By Fubini ' s theorem, A k ( A ) = A k (S1 A ). The passage from A to S1 A is Steiner symmetrization with respect R k : x 1 = 0]. to the hyperplane H1 = [x
e
PROOF.
y
e
e/
e
A
Let JY be the closed interval from inf A Y to supA Y ; it has length A (Jy ) > A ( A y ). Let m y be the midpoint of JY . If s IY and s' IY ' ' then I s - s' l < � A ( A y ) + �A( A y ,) < l m y - m y ' l + !A( Jy ) + �A( Jy ,), and this last sum is 1 t - t 'l for appropriate endpoints t of JY and t' of Jy ' · Thus for any points (s, y) and (s', y') of S1 A , there are points (t, y) and (t', y') that are at least as far apart and are limits of points in A . Hence diam S1 A < diam A. For each i there is a Steiner symmetrization S; with respect to the hyperplane H; = [x: X; = 0]; A k (S;A) = A k ( A ) and diam S;A < diam A . It is obvious by the construction that S;A is symmetric with respect to H;: if x ' is x with its ith coordinate reversed in sign, then x and x' both lie in S;A or neither does. If A is symmetric with respect to H; (i > 1) and y' is y with its (i - 1)st coordinate reversed in sign, then, in the notation above, �, = AY ' ' and so S1 A is symmetric with respect to H; . More generally, if A lS symmetric with respect to H; and j ¢ i, then S A is also symmetric with j respect to H;. Pass from A to SA = Sk · · · S1 A . This is central symmetrization. Now A k ( SA) = A k ( A ), diam SA < diam A, and SA is symmetric with respect to
e
e
252
INTEGRATION
each H; , so that x e SA implies - x e SA . If l x l > !diam A, then l x ( - x ) l > diam A > diam SA , so that SA cannot contain x and - x . Thus SA is contained in the closed sphere with center 0 and diameter diam A . • 19.1. It is only necessary to show that K = 1 in (1 9.4). By Lemma 1 the interior of the unit cube C in R k contains disjoint spheres such that diam Sn < £ and h k , ( (C - Un Sn ) < h k (C - Un Sn ) = S1 , S2 , K 'A k ( C - U n Sn ) = 0. By (19.3), h k , ( (C) = h k , ( (U n Sn ) < ck En (diam Sn ) k = ck E nck" 1 'A k ( Sn ) < A k (C) = 1. Hence h k (C) < 1. If C C U n Bn , then by Lemma 2, 1 < En 'A k ( Bn ) < Enck (diam Bn ) k , and • hence h k ( C ) � 1. PROOF OF THEOREM
•
•
•
Change of Variable
If x 1 , , x m are points in R k , let P(x 1 , , x m ) be the parallelepiped [E;"_ 1 a;x;: 0 < a 1 , , a m < 1]. Suppose that D is a k X m matrix, and let x 1 , , x m be its columns, viewed as points of R k ; define I D I = h m ( P(x 1 , , x m )). If C is the unit cube in R m and D is viewed as a linear map from R m to R k , then P(x 1 , , x m ) = DC, so that • • .
•
•
•
• . •
• . .
• • •
• • •
(19 .10) In the case
m =
k it follows by Theorem 12.2 that
I D I = l det D l ,
(19.11)
m
= k.
Theorem 12.2 has an extension: Theorem 19.2.
and
Suppose that D is univa/ent. t If A e £1/ m , then DA e £Il k
(19 .12) Since D is univalent, D Un A n = Un DA n and D( R m - A ) = DR m - DA ; since DR m is closed, it is in £Il k . Thus [A : DA e £Il k ] is a a-field. Since D is continuous, , DA is compact if A is. By Theorem 13.1(i), D is measurable £1/ mj£1/ k . Each side of (19.12) is a measure as A varies over £1/ m . They agree by definition if A is the unit cube in R m , and so by (19.5) they agree if A is any • cube. By the usual argument they agree for all A in !A m . PROOF.
t That is, one-to-one: Du
==
0 implies that u
=
0. In this case
m
< k.
19. HAUSDORFF MEASURE 253 If D is not univalent, A E £1/ m need not imply that DA E £Il k , but (19 .12) still holds because each side vanishes. The Jacobian formula (17.12) for change of variable also generalizes. Suppose that V is an open set in R m and T is a continuous, one-to-one mapping of V onto the set TV in R k , where m < k. Let t;(x) be the i th coordinate of Tx and suppose that t iJ (x) at;(x);axj is continuous on V. SECTION
=
Pu t
D
(19.13 )
X
=
.
............
•
'
IDx l as defined by (19.10) plays the role of the modulus of the Jacobian (1 7.11) and by (19.11) coincides with it in case m k. =
Suppose that T is a continuous, one-to-one map of an open set V in R m into R k . If T has continuous partial derivatives and A c V, then
Theorem 19.3.
(19.14 ) and (19.15 ) By Theorem by
A m.
19.1, h", on the left in (19.14) and (19.15) can be replaced
This generalizes Theorem 17 .2. Several explanations are required. Since V is open, it is a countable union of compact sets Kn . If A is closed, T( A n Kn ) is compact because T is continuous, and so T( A n V) = U n T( A n Kn ) E £Il k . In particular, TV E £Il k , and it follows by the usual argument that TA e £Il k for Borel subsets A of V. During the course of the proof of the theorem it will be seen that IDx l is continuous in x. Thus the formulas (19.14) and (19.15) make sense if A and I are measurable. As usual, (19.15) holds if f is nonnegative and also if the two sides are finite when f is replaced by 1 /1 . And (19.15) follows from (1 9.14) by the standard argument starting with indicators. Hence only (1 9.14) need be proved. To see why (19.14) ought to hold, imagine that A is a rectangle split into many small rectangles A n ; choose x n in A n and approximate the integral in (1 9.1 4) by E I Dx l h m (A n ) · For y in A n , Ty has the linear approximation "
254
INTEGRATION
Tx n + Dx ,( Y - x n ), and so TA n is approximated by Tx n + Dx,(A n - x n ), an m-dimensional parallelepiped for which h m is I Dx, l h m (A n - x n ) = I Dx , l h m (A n ) by (19.12). Hence EI Dx, l h m (A n ) Eh m ( TA n ) = h m ( TA) . The first step in a proof based on these ideas is to split the region of integration into small sets (not necessarily rectangles) on each of which the Dx are all approximately equal to a linear transformation that in tum gives a good local approximation to T. Let V0 be the set of x in V for which Dx is univalent. ==
If (J > 1, then there exists a countable mdecomposition of V0 into Borel sets B1 such that, for some linear maps M1 R --+ R k , Lemma 3.
:
and Choose £ and 80 in such a way that fJ - 1 + £ < 80 1 < 1 < 80 < fJ - £. Let Jl be the set of k X m matrices with rational coefficients. For M in Jl and p a positive integer let E (M, p) be the set of x in V such that
PROOF.
(19.18) and (19.19)
y E V. Imposing these conditions for countable, dense sets of v ' s and y ' s shows that E( M, p) is a Borel set. The essential step is to show that the E (M, p) cover V0• If Dx is univalent, then 80 = inf[ I Dxv l : I vi = 1] is positive. Take 8 so that 8 < ( 80 - 1)80 and 8 � (1 - 80 1 )80 , andm then choose in Jt an M such that I Mv - Dxv l � 8 1 v l for all v E R . Then I Mv l < I Dxf'l + I Mv - Dxf1 1 < I Dxv l + 8 1 v l � 801 Dxf'l , and (19.18) follows from this and a similar in equality going the other way. As for (19.19), it follows from (19.18) that M is univalent, and so there is a positive ., such that 1J I VI � I Mv l for all v. Since T is differentiable at x, there exists a p such that IY - x l < p - 1 implies that I Ty - Tx - Dx ( Y x) l � £ 1J IY - x l � £ 1 M(y - x) l . This and (19.18) imply that I Ty - Tx l < I Dx (x - Y ) l + I Tx - Ty - Dx (x - y) l � fJ0IM(x - y ) l + £ 1 M(x - y ) l
SECTION
19.
HAUSDORFF MEASURE
255
IJI M( x - y ) l . This inequality and a similar one in the other direction give (19.19). Thus the E( M, p) cover V0 • Now E( M, p) is a countable union of sets B such that diam B < p - 1 . These sets B (for all M in Jf and all p ), made disjoint in the usual way, together with the corresponding maps M, satisfy �
the requirements of the lemma.
•
Also needed is an extension of the fact that isometries preserve Hausdorff measure.
Suppose that A c R k , cp: A --+ R ;, andm 1/J: A --+ Ri. If lcpx fi>Y I < 8 1 1/J x - 1/IY I for X, y E A, then h m (cpA) < (J h m (l/IA). PROOF. Given £ and 8, cover 1/JA by sets Bn such that 8 diam Bn < £ and cm En1 (diam Bn ) m < h m ( 1/IA) + B. The sets cpl/J - 1Bn cover cpA and m diamcpl/J - Bn < IJ diam Bn < £, whence follows h m ,((cpA) < (J (h m (l/IA) + Lemma 4.
3).
•
19.3:
Suppose first that A is contained in the set V0 where Dx is univalent. Given 8 > 1, choose sets B1 as in Lemma 3 and put A1 = A n B1 • If C is the unit cube in R m , it follows by (19.16) and Lemma 4 that 9 - mh m (M1C) < h m (DxC) < 8 mh m (M1C) for X E A1; by (19.10), 9 - m i M1I S I Dx l < 8 mmi M1 1 for x E A1 t By (19.17) and Lemma 4, I h m (A1) 9 - mh m ( M1A1) < h m (TA I ) < IJ hmm (MIA I ). Now h m ( MIA I ) = I M1 m by Theorem 19.2, and so 8 - 2 h m (TA1) < fA,I Dx l h m ( dx ) < 8 2 h m ( TA1) . Adding over A 1 and letting (J --+ 1 gives (19.14). SECOND CASE. Suppose that A is contained in the set V - V0 where Dx is not univalent. It is to be shown that each side of (19.14) vanishes. Since the entries of (19.13) are continuous in x and V is a countable union of compact sets, it is no restriction to assume that there exists a constant K such that I Dxv l < K l v l for x e A, v e R m . It is also no restriction to assume that h m (A) = A m (A) is finite. Suppose that 0 < £ < K. Define T' : R m --+ R k X R m by T'x = ( Tx, £x). For the corresponding matrix D; of derivatives, D;v = ( Dxv, £ V ), and so D; is univalent. Fixing x in A for the moment, choose in R m an orthonormal set v 1 , • • • , vm such that Dxv 1 = 0, which is possible because x e V - V0 • Then I D;v 1 1 = 1 (0, £V 1 ) 1 = £ and ID.:V;I < I Dxv;l + £ I V; I < 2K. Since h m ( P( v l , . . . ' vm )) = 1, it follows by Theorem 19.2 that I D; I = h m ( D;P( v 1 , • • • , vm )). But the parallelepiped D;P( v 1 , • • • , vm ) of dimension PROOF OF THEOREM
FIRST CASE.
.
tsince the entries of Dx are continuous in x, if y is close enough to x, then fJ - 1 1 Dxv I < I Dyv I S fJ I Dx v I for v e R m , so that by the same argument, 8 - m I Dx I < I Dy I � (J m I Dx I ; hence I Dx I is continuous in x .
256
INTEGRATION
m in R k + m has one edge of length £ and m - m1 edges of length at most 2K and hence is isometric with a subset of [x E R : l x 1 1 < £, l x; l < 2mK, i = (19.14) for the univalent case 2, . . . , m]; therefore, I D� I < £(2mK) m - •. Now already treated gives h m (TA) S fA £(2mK ) m - lh m ( dx ). If w: R k X R m R k is defined by w(z, v) = z, then TA = wT'A, and Lemma 4 gives h m (TA) < h m (TA) = 0. h m (T'A) < £(2mK ) m - 1h m (A). Hence m If Dx is not univalent, Dx R is contained in an (m - I)-dimensional subspace of R k and hence is isometric to a set in R m of Lebesgue measure 0. Thus I Dx l = 0 for x e V - V0 , and both sides of (19.14) do vanish. • -+
Calculations
The h m on the left in (19.14) and (19.15) can be replaced by A m . To evaluate the integrals, however, still requires calculating I D I as defined by (19.10). Here only the case m = k - 1 will be considered. t If D is a k X ( k - 1) matrix, let z be the point in R k whose ith component is ( - 1) ; + 1 times the determinant of D with its i th row removed. Then IDI = 1 z 1 . (19.20 ) To see this, note first that for each x in R k the inner product x · z is the determinant of D augmented by adding x as its first column (expand by cofactors). Hence z is orthogonal to the columns x 1 , , x k _ 1 of D. If z =I= 0, let z 0 = z/ l z l ; otherwise, let z 0 be any unit vector orthogonal to x 1 , , x k _ 1 • In either case a rotation and an application of Fubini 's theorem shows that I D I = h k _ 1 (P(x 1 , , x k _ 1 )) = A k (P(z0, x 1 , , x k _ 1 )); this last quantity is by Theorem 12.2 the absolute value of the determinant of the matrix with columns z 0 , x 1 , , x k _ 1 • But, as above, this is z 0 z = 1 z I · •
•
•
•
•
•
•
•
•
•
•
•
•
.
•
•
V be the1 unit sphere in R k - 1 . If t;(x) = X ; for i [1 - E�.:-lxf] 12 , then TV is the top half of the surface of the unit sphere in R k . The determinants of the ( k - 1) X ( k - 1) submatrices of Dx are easily calculated, and I Dx l comes out to t; 1 (x). If A k is the surface area of the unit sphere in R k , then by (19.14), A k = 2h k - 1 (TV) = 2 f vti 1 (x) dx. Integrating out one of the variables shows that A k = 2wVk _ 2 , where Vk _ 2 is the volume of the unit sphere in R k - 2 • The calculation in Example 18.4 now gives A 2 = 2w and i- I ' 2 2w (2 ( ) ) '17' .= A 2; - 1 = . . . . . 2 A ' 2 2 4 6 (2 i - 2) 1 3 5 ( . - 3) Example 19.3. < k and t k (x) =
Let
'
t For the general case, see SPIVAK.
I
•
•
•
•
•
SECTION
19.
257
HAUSDORFF MEASURE
for i = 2, 3, . . . . Note that A k = k Vk for k = 1 , 2, . . . . By (19.5) the sphere in R k with radius r has surface area r k - tA k . • PROBLEMS 19.1. 19.2.
Describe S2 S1 A for the general triangle A in the plane. Show that in (19.1) finite coverings by open sets suffice if A is compact: for positive £ and B there exist finitely many m open sets Bn such that A c
Un Bn , diam Bn < £ , and cm En (diam Bn ) 19.3.
<
f
h m (A) + B. ,
f Let f be an arc or curve-a continuous mapping of an interval [a, b) into R k -and let /[a, b) = [/(t): a < t s b) be its trace or graph. The arc length L(f) of f is the supremum of the lengths of inscribed polygons:
L ( I)
=
n
sup E 1/ ( t; ) - I( t; - 1 ) I , i-1
where the supremum extends over sets { t; } for which a = t0 < t1 < · · · < tn == b. The curve is rectifiable if L (f) is finite. (a) Let /[ u , v] be the length of f restricted to a subinterval [ u , v ]. Show that I [ u , v] == I [ u, t] + I [ t, v] for u < t < v. (b) Let I ( t) be the length of f restricted to [ a, t ]. Show for s < t that 1/( t) - / (s) l < /[s, t] == /( t) - /(s) and deduce that h 1 ( / [a, b )) < L(f ): the one-dimensional measure of the trace is at most the length of the arc. (c) Show that h 1 (/[a, b]) < L( / ) is possible. (d) Show that 1/( b) - /(a) I < h 1 (/[a, b)). Hint: Use a compactness ar,fument-cover /[a, b) by finitely many open sets B1 , , BN such that
En _ 1 diam Bn < h 1 , ( ( / ( a, b]) + £.
•
•
•
(e) Show that h 1 (/[a, b]) = L(/) if the arc has no multiple points ( / is
19.4.
one-to-one). (f) Show that if the arc is rectifiable, then I ( t) is continuous. (g) Show that a rectifiable arc can be reparameterized in such a way that [a, b ] = [0, L(/)] and /( t) == t. Show that under this parameterization, h 1 ( A ) for a Borel subset A of the arc is the Lebesgue measure of r 1A , provided there are no multiple points. f Suppose the arc is smooth in the sense that /( t) == ( /1 ( t), . . . , fk ( t)) has continuous derivative / '( t) == ( /{( t), . . . , /f( t)). Show that L( /) == f: If' ( t) I dt. The curve is said to be parameterized by arc length if 1/ ' ( t) 1 1, because in that case f[s, t] has length t - s. f Show that a circle of radius r has perimeter 2wr: if /( t) == ( r cos t, r sin t) for O < t < 2w, then /� '" 1/ '( t) l dt = 2wr. The w here is that of trigonometry; for the connection with w as the area of the unit disc, see Problem 18.24. f Suppose that /(t) = (x( t), y( t)), a < t < b, is parameterized by arc length. For a < � s b, define l/;( �, ., ) = ( x( �), y( �), ., ). Interpret the trans formation geometrically and show that it preserves area and arc length. =
19.5.
19.6.
258
19.7.
19.8.
INTEGRATION
f (a) Calculate the length of the helix (sin 8, cos fJ, 8 ) , 0 < 8 < 2w. (b) Calculate the area of the helical surface ( t sin 8, t cos 8, 8 ) , 0 < 8 < 2 w, 0 < t s 1. Since arc length can be approximated by the lengths of inscribed polygons, it might be thought that surface area could be approximated by the areas of inscribed polyhedra The matter is very complicated, however, as an exam ple will show. Split a 1 by 2w rectangle into mn rectangles, each m- 1 by 2wn - 1 , and split each of these into four triangles by means of the two diagonals. Bend the large rectangle into a cylinder of height 1 and cir cumference 2w, and use the vertices of the 4mn triangles as the vertices of an inscribed polyhedron with 4 mn flat triangular faces. Show that the resulting polyhedron has area
Show that as m , n
[2w, oo ).
19.9.
--+ oo ,
the limit points of Am n fill out the interval
R3 4J2 . - /2 2 (18.22), (19.3) em 2 mwm ;r(1 m/2), m 0. Thus h m m.
A point on the unit sphere in is specified by its longitude 8, 0 < 8 < 2w, and its latitude 4J , - w/2 < 4J < w/2. Calculate the surface area of the sector 61 < 8 < 8 , 4J 1 < + <
19.10. By
can
be
written = is well defined for all
and this is Show that h 0 is
+
defined for all � counting measure. 19.11. f (a) Show that, if hm(A) < oo and B > 0, then hm+8(A) = 0. (b) Because of this, there exists a nonnegative number dim A , called the Hausdorff dimension of A , such that hm (A) = oo if m < dim A and hm (A) = 0 if m > dim A. Thus 0 < hm (A) < oo implies that dim A = m. Show that "most" sets in specified by m smooth parameters have (integral) dimension m. (c) Show that dim A S dim B if A C B and that dim Un A n = supn dim An. 19.12. f Show that hm (A) is well defined by (19.1) and (19.2) for sets A in an arbitrary metric space. Show that hm is an outer measure, whatever the space. Show that part (a) of the preceding problem goes through, so that Hausdorff dimension is defined for all sets in all metric spaces, and show that part (c) again holds.
Rk
Rk
19.13. Let x = (xb . . . , xk ) range over the surface of the unit sphere in and consider the set where au < x u < flu , 1 < u < t; here t < k - 2 and - 1 <
a u S flu � 1. Show that its area is ( 19.21 )
A k- r
[ Ak - tls 1 -
t
E
i-1
r ](k2) 2 / dx1 xl
•
•
•
dx , ,
where is as in Example 19.3 and S is the set where E�_ 1 xl < 1 and tlu < Xu < flu , 1 < U < t.
C H A PTER
4
Random Variables and Expected Values
SECI10N 20. RANDOM VARIABLES AND DISTRIBUTIONS This section and the next cover random variables and the machinery for dealing with them-expected values, distributions, moment generating func tions, independence, convolution. Radom
Variables and Vectors
random variable on a probability space ( 0, �, P ) is a real-valued function X = X( w) measurable �. Sections 5 through 9 dealt with random
A
variables of a special kind, namely simple random variables, those with finite range. All concepts and facts concerning real measurable functions carry over to random variables; any changes are matters of viewpoint, notation, and terminology only. The positive and negative parts x+ and x- of X are defined as in (15.4) and (15.5). Theorem 13.5 also applies: Define
(20.1 )
1/l n ( x ) =
( k - 1 )2 - n n
If X is nonnegative and
Xn
=
- 1 ) 2 - n < X < k2 - n , 1 < k < n2 n , if x > n .
if ( k
1/1 n ( X), then 0
<
Xn i X.
If
X is not neces259
260
RANDOM VARIAB LES AND EXPECTED VALUES
sarily nonnegative, define if if
( 20.2 )
X � 0, X s 0.
(This is the same as (13.6).) Then 0 < Xn ( w) i X( w) if X( w) > 0 and 0 � Xn (w) � X(w) if X(w) < 0; and IXn (w) l i I X(w) l for every w. The random variable Xn is in each case simple. A random vector is a mapping from 0 to R k measurable §'. Any mapping from 0 to R k must have the form w --+ X(w) = ( X1 (w), . . . , Xk ( w )), where each X;( w) is real; as shown in Section 13 (see (13.2)) , X is measurable if and only if each X; is. Thus a random vector is simply a k-tuple X = ( X1 , , Xk ) of random variables. •
•
•
Subfields
If f§ is a a-field for which f§ c §', a k-dimensional random vector X is of course measurable f§ if [ w: X( w) e H] e f§ for every H in gJ k _ The a-field a( X) generated by X is the smallest a-field with respect to which it is measurable. The a-field generated by a collection of random vectors is the smallest a-field with respect to which each one is measurable. As explained in Sections 4 and 5, a sub-a-field corresponds to partial information about w. The information contained in a( X) = a( X1 , , Xk ) consists of the k numbers X1 ( w ), . . . , Xk ( w ) t The following theorem is instructive in this connection, although only part (i) is used much. It is the analogue of Theorem 5.1, but there are technical complications in its proof. •
.
.
.
Theorem 20.1. Let X = (i) The a-field a( X)
( X1 , , Xk ) be a random vector. = a( X1 , , Xk ) consists exactly of the sets [Xe •
•
•
H] for H E fJII k. (ii) In order that a random variable Y be measurable a( X) a( X1 , , Xk ) it is necessary and sufficient that there exist a measurable map f: R k --+ R 1 such that Y( w) = /( X1 ( w ), . . . , Xk ( w )) for all w. PROOF. The class f§ of sets of the form [ X e H] for H e gJ k is a a-field. Since X is measurable a( X), f§ c a( X). Since X is measurable f§, a( X) c .
•
•
=
•
•
•
f§. Hence part (i). Measurability of f in part (ii) refers of course to measurability !1/ kj!l/ 1 • The sufficiency is easy: if such an f exists, Theorem 1 3. 1 (ii) implies that Y is measurable o ( X ). t The partition defined by (4.17) consists of the sets [ w: X( w )
=
x ] for x
e
Rk .
SECTION
20.
RANDOM VARI AB LES AND DISTRIBUTIONS
261
To prove necessity,t suppose at first that Y is a simple random variable , Ym be its different possible values. Since A; [ w : Y( w ) y;] and let y 1 , lies in o( X) , it must by part (i) have the form [ w: X( w) E H;] for some H; in £Il k . Put I = L; Y;ln. ; certainly I is measurable. Since the A ; are disjoint, no X( w) can lie in more than one H; (even though the latter need not be disjoint), and hence I( X( w )) = Y( w ). To treat the general case, consider simple random variables Y,, such that Yn ( w) --+1 Y( w) for each w. For each n , there is a measurable function In : R k --+ R such that Yn( w ) = ln( X( w )) for all w. Let M be the set of x in R k for which { ln ( x ) } converges; by Theorem 1 3.4(iii), M lies in !1/k. Let f( x ) = lim nln(x) for x in M, and let l( x ) = 0 for x in R k - M. Since f = lim n ln iM and ln iM is measurable, I is measurable by Theorem 1 3.4(ii). For each w , Y(w) = lim nln( X( w )); this implies in the first place that X( w ) lies in M and in the second place that Y( w ) = lim n In ( X( w )) = I( X( w )). • •
•
=
•
=
Distributions
The distribution or law of a random variable X was in Section 14 defined as the probability measure on the line given by p. = PX- 1 (see (1 3.7)), or
(20 .3)
= P[X E A),
p. ( A )
A
E
£1/ 1 .
The distribution function of X was defined by (20 .4)
F( X ) = p.( - 00 ' X] = p [ X < X]
for real x. The left-hand limit of (20.5)
F satisfies
{ F(F X -- =F ,., (X)
( - 00 ' X ) = p [ X < X]' ( X ) = p. { X } = P [ X = X ) ,
)
and F has at most countably many discontinuities. Further, F is nonde creasing and right-continuous, and lim 00 F(x) = 0, lim 00F( x ) = 1 . By Theorem 14.1, for each F with these properties there exists on some probability space a random variable having F as its distribution function. A support for p. is a Borel set S for which p. ( S ) = 1 . A random variable, its distribution, and its distribution function are discrete if p. has a count able support S = { x 1 , x 2 , } . In this case p. is completely determined by the values p. { x 1 } , p. { x 2 } , x
•
•
•
•
_.
_
•
•
•
tpor a general version of this argument,
see
Problem 13.6.
x
_.
262
RANDOM VARIAB LES AND EXPECTED VA LUES
A familiar discrete distribution is the
binomial:
( 20.6) P [ X = r ] = p { r } = ( � ) p'(l - p r ' ,
r = 0, 1 , . . . , n .
There are many random variables, on many spaces,with this distribution: If { Xk } is an independent sequence such that P [ Xk = 1] = p and P [ Xk = 0] = 1 - p (see Theorem 5.2), then X could be E? 1X; , or Ef +;x;, or the sum of any n of the X; . Or 0 could be {0, 1 , . . . , n } if !F consists of all subsets, P { r } = p. { r } , r = 0, 1, . . . , n, and X( r) = r. Or again the space and ran dom variable could be those given by the construction in either of the two proofs of Theorem 14.1. These examples show that, although the distribu tion of a random variable X contains all the information about the probabilistic behavior of X itself, it contains beyond this no further information about the underlying probability space (D, �, P ) or about the interaction of X with other random variables on the space. Another common discrete distribution is the Poisson distribution with parameter A > 0: Ar " P [ X = r ] = p { r } = e- r! ,
( 20 .7)
r = 0, 1 , . . . .
A constant c can be regarded as a discrete random variable with X( w) = c. In this case P [ X = c] = p. { c } = 1 . For an artificial discrete example, let { } be an enumeration of the rationals and put
x1, x2 ,
•
•
•
(20 .8) the point of the example is that the support need not be contained in a lattice. A random variable and its distribution have density I with respect to Lebesgue measure if I is a nonnegative function on R 1 and (20.9)
P[ X e A) = p(A) =
j f ( x ) dx , A
A
E
!1/ 1 .
In other words, the requirement is that p. have a density with respect to A in the sense of (16.11). The density is assumed to be with respect to A if no other measure is specified. Taking A = R 1 in (20.9) shows that I must integrate to 1 . Note that I is determined only to within a set of Lebesgue measure 0: if 1 = g except on a set of Lebesgue measure 0, then g can also serve as a density for X and p..
SECTION
20.
RANDOM VARIABLES AND DISTRIBUTIONS
263
It follows by Theorem 3.3 that (20.9) holds for every Borel set A if it holds for every interval- that is, if
F( b ) - F(a )
=
bf x dx j ( ) a
holds for every a and b. Note that F need not differentiate to I everywhere (see (20.13), for example); all that is required is that I integrate properly- that (20.9) hold. On the other hand, if F does differentiate to I and I is continuous, it follows by the fundamental theorem of calculus that f is indeed a density for F.t For the exponential distribution with parameter a > 0, the density is if X < 0, if X > 0.
(20.10) The corresponding distribution function
F( x )
(20.1 1 )
=
{�
_
e - ax
if X S 0, if X > 0
was studied in Section 14. For the normal distribution with parameters (20.12)
f( x )
=
1 {ii exp 2w o
[ - ( x - m2 ) 2 ] , 2o
m and o,
o > 0,
- oo < x < oo ;
a change of variable together with (1 8.10) shows that I does integrate to 1 . For the standard normal distribution, m = 0 and o = 1 . For the uniform distribution over an interval (a, b ], (20.13)
l(x )
=
1
b-a
0
if
a < x s b,
otherwise.
The distribution function F is useful if it has a simple expression, as in (20.1 1). It is ordinarily simpler to describe p. by means of the density l(x) or the discrete probabilities ,., { x r } . If F comes from a density, it is continuous. In the discrete case, F increases in jumps; the example (20.8), in which the points of discontinuity ' The general question of the relation between differentiation and integration is taken up in Section 31 .
264
RANDOM VARIABLES AND EXPECTED VALUES
are dense, shows that it may nonetheless be very irregular. There exist distributions that are not discrete but are not continuous either. An example is p. (A) = �p.1( A) �p. 2 (A) for p. 1 discrete and p. 2 coming from a density; such mixed cases arise, but they are few. Section 31 has examples of a more interesting kind, namely functions F that are continuous but do not come from any density. These are the functions singular in the sense of Lebesgue; the Q( x ) describing bold play in gambling (see (7 .33)) turns out to be one of them. If has distribution p. and is a real function of a real variable,
+
X
g
(20.14)
g( X) p. g-1
in the notation (1 3.7). is Thus the distribution of In the case where there is a density, f and F are related by F(x )
(20.1 5 )
=
{'
- oo
f( t ) dt.
Hence f at its continuity points must be the derivative of F. As noted above, if F has a continuous derivative, this derivative can serve as the density f. Suppose that f is continuous and is increasing, and let T= The distribution function of is
g-1•
If
g(X)
P[ g ( X ) < x ) P[ X < T(x )) =
g
=
F( T( x )) .
T is differentiable, this differentiates to
�d P (g ( X ) < x )
( 20 .16 )
g(X)
=
/ ( T( x )) T'( x ) ,
which is thus the density for (as follows also by (17.8)). has normal density (20.12) and > 0, (20.16) shows that b If b and has the normal density with parameters Calculating (20.16) is generally more complicated than calculating the distribution function of X) from first principles and then differentiating, which works even if is many-to-one:
X
g(
Example
20.1.
If
a am +
ao .
X has the standard normal distribution, then
aX + g
SECTION
for X > 0. Hence
20.
X2
RANDOM VARIABLES AND DISTRIBUTIONS
265
has density
/ (x ) =
0
1 .� x - 1/2e x/2 -
v � 'IT
if X < 0, •
if X > 0 .
Multidimensional Distributions
For a k-dimensional random vector X = ( X1 , , Xk ), the distribution p. (a probability measure on gJ k ) and the distribution function F (a real func tion on R k ) are defined by . • •
(20.17)
{
= P ( ( X1 , , Xk ) E A ] , A e fJit k , F( x 1 , , x k ) = P( X1 < x 1 , , Xk < x k ] = p.(Sx ) ,
P. ( A )
• • •
• • •
• • •
where Sx = [ y: Y; < x ; , i = 1, . . . , k] consists of the points "southwest" of x. Often p. and F are called the joint distribution and joint distribution function of xl, . . . ' xk . Now F is nondecreasing in each variable, and &AF > 0 for bounded rectangles A (see (12.12)) As h decreases to 0, the set
s h=
+ h' i = 1' . . . ' k] decreases to Sx , and therefore (Theorem 2.1(ii)J F is continuous from above in the sense that lim h l 0 F(x 1 + h, . . . , x k + h) = F(x 1 , , x k ). Further, F( x , x k ) --+ 0 if x; --+ - oo for some i (the other coordinates held fixed), and F( x 1 , , x k ) --+ 1 if X; --+ oo for each i. For any F with these properties there is by Theorem 12.5 a unique probability measure p. on gJ k such that p. ( A ) = &AF for bounded rectangles A, and p. ( Sx ) = F(x) for Xt
1,
[ y : Y; < X;
• • •
• • •
all X.
• • •
As h decreases to 0, Sx , - h increases to the interior i = 1, . . . , k ] of Sx , and so
sxo = [ y:
Y; < X ;,
(20.18) Since F is nondecreasing in each variable, it is continuous at x if and only if it is continuous from below there in the sense that this last limit coincides with F(x). Thus F is continuous at x if and only if F(x) = p.(Sx ) = p. ( Sx0 ), which holds if and only if the boundary asx = sx - sxo (the y-set where Y; s; X; for all i and Y; = X; for some i) satisfies p.( asx ) = 0. If k > 1 , F can have discontinuity points even if p. has no point masses: if p. corre-
RANDOM VARIABLES AND EXPECTED VALUES
sponds to a uniform distribution of mass over the segment B = [(x, 0): 0 < x < 1] in the plane (p.( B) = A[x : 0 < x < 1, (x, O) e B]) then F is discontinuous at each point of B. This also shows that F can be discontinu ous at uncountably many points. On the other hand, for fixed x the boundaries asx , h are disjoint for different values of h, and so (Theorem 1 0.2(iv)) only countably many of them can have positive p.-measure. Thus x is the limit of points (x 1 + h, . . . , x k + h) at which F is continuous: the continuity points of F are dense. There is always a random vector having a given distribution and distribu tion function : Take (O, S&", P) = ( R k , f1t k , p.) and X(w) = w. This is the obvious extension of the construction in the first proof of Theorem 14.1. The distribution may as for the line be discrete in the sense of having countable support. It may have density I with respect to k-dimensional Lebesgue measure: p.(A) = fA I(x) dx. As in the case k = 1, the distribu tion p. is more fundamental than the distribution function F, and usually p. is described not by F but by a density or by discrete probabilities. If X is a k-dimensional random vector and g: R k .-. R ,. is measurable, then g( X) is an i-dimensional random vector; if the distribution of X is p., the distribution of g( X) is p.g - 1 , just as in the case k = 1 see (20.14). If gj : R k .-. R 1 is defined1 by gj (x 1 , , x k ) = xj, then gj ( X) is Xj , and its distribution P.j = p.gj is given by p.j (A) = p.[(x 1 , , x k ): xj e A] = P[ � e A] for A e 91 1 • The P.j are the marginal distributions of p.. If p. has a density I in R k , then P.j has over the line the density ,
-
•
•
•
•
•
•
Jj(x ) =
(20 .1 9 )
since by Fubini ' s theorem the right side integrated over A comes to p.[(x 1 , , x k ): xj e A]. Now suppose that g: U --+ V is one-to-one, where U and V are open subsets of R k . Suppose that T = g - 1 is continuously differentiable, and let J(x) be its Jacobian. If X has a density that vanishes outside U, then P[g( X) E A] = P[ X E TA] = frA I( y ) dy, and so by (17.1 3), •
•
(20.20 ) for A c
•
P ( g ( X ) e A ) = j f( Tx )IJ( x )l dx A
V. Thus the density for g( X) is I(Tx)I J(x) l on V and 0 elsewhere. t
tAlthough (20.20) is indispensible in many investigations, it is used in this book only for linear T and in Example 20.2; and here different approaches are possible-see ( 1 7. 14) and Proble m 1 8.24.
SECTION 11 EX111ple
20. 2
20.
RANDOM VARIABLES AND DISTRIBUTIONS
267
Suppose that ( X1 , X2 ) has density
and let g be the transformation to polar coordinates. Then U = R 2 , and V and T are as in Example 17 .6. If R and 8 are the polar coordinates of 1 ( X1, X2 ), then ( R , 8 ) = g ( X1 , X2 ) has density (2w) - re - r 2 /2 in V. By (20 . 1 9), R has density re - r 2 12 on (0, oo ) and 8 is uniformly distributed • over (0 , 2 w ) .
Independence
Random variables xl , . . . ' xk are defined to be independent if the a-fields , o ( Xk ) t�ey generate are independent in the sense of Section 4. a( X1 ), This concept for simple random variables was studied extensively in Chapter 1; the general case was touched on in Section 14. Since o ( X;) consists of the sets [ X; E H ] for H E 91 1 , xl , . . . ' xk are independent if and only if • • •
for all linear Borel sets H1 , , Hk . The definition (4.11) of independence requires that (20.21) hold also if some of the events [ X; e H; ] are sup pressed on each side, but this only means taking H; = R 1 Suppose that • • •
•
for all real x 1 , , xk; it then also holds if some of th� events [ X; < X;] are suppressed on each side (let X; --+ oo ). Since the intervals ( oo , x] form a tr-system generating !1/ 1 , the sets [ X; < x] form a w-system generating a ( X;). Therefore, by Theorem 4.2, (20.22) implies that X1 , Xk are independent. If, for example, the X; are integer-valued, it is enough that P [ X1 = n 1 , , Xk = n k ] = P [ X1 = n 1 ] P[ Xk = n k ] for integral " 1 ' . . , n k (see (5.6)). Let ( X1 , , Xk ) have distribution p. and distribution function F, and let the X; have distributions P. ; and distribution functions F; (the marginals). By (20.21), X1 , , Xk are independent if and only if p. is product measure in the sense of Section 18: • • •
-
, • • •
•
•
• • •
•
.
• • •
• • •
(20. 23 )
268
RANDOM VARIABLES AND EXPECTED VALUES
By (20.22), X1,
• • •
, Xk are independent if and only if
(20 .24) Suppose that each P. ; has density /;; by Fubini' s theorem, f1 (y 1 ) fk ( Yk ) X ( - oo , x k ] is just F1 (x 1 ) Fk (x k ), so integrated over ( oo, x 1 ] X that p. has density • • •
-
· · ·
· · ·
(20 .25 ) in the case of independence. If f§1 , , f§k are independent a-fields and X; is measurable f§;, i = 1 , . . . ' k, then certainly x1 , . . . ' xk are independent. If X; is a d;-dimensional random vector, i = 1, . . . , k, then X1 , , Xk are by definition independent if the a-fields a( X1 ), , a( Xk ) are indepen dent. The theory is just as for random variables: x1 , . . . ' xk are indepen dent if and only if (20.21) holds for H1 e gJd•, , Hk e gJd". Now ( X1 , , Xk ) can be regarded as a random vector of dimension d = E�_ 1 d;; X Rdk and P. ; is the distribution of if p. is its distribution in Rd = Rd1 X X; in Rd•, then, just as before, X1 , , Xk are independent if and only if X p. k . In none of this need the d; components of a single X; p. = p. 1 X be themselves independent random variables. An infinite collection of random variables or random vectors is by definition independent if each finite subcollection is. The argument follow ing (5.7) extends from collections of simple random variables to collections of random vectors: • • •
• • •
• • •
• • •
• • •
· · ·
• • •
·
·
·
Theorem 20.2.
(20 .26)
Suppose that X11 , X1 2 , X21 , x22 , . . . · · ·
•
•
•
•
•
•
•
•
•
is an independent collection of random vectors. If .fF; is the a-field generated by the ith row, then �1 , �2 , . . . are independent. Let �; consist of the finite intersections of sets of the form [ X;1 e H] with H a Borel set in a space of the appropriate dimension, and apply Theorem 4.2. The a-fields .fF; = a( �;), i = 1, . . . , n, are independent • for each n, and the result follows. PROOF.
Each row of (20.26) may be finite or infinite, and there may be finitely or infinitely many rows. As a matter of fact, rows may be uncountable and there may be uncountably many of them.
SECTION
20.
RANDOM VARIABLES AND DISTR IBUTIONS
269
Suppose that X and Y are independent random vectors with distribu tions p. and Jl in Ri and R k . Then ( X, Y ) has distribution p. X Jl in R i X R k = R i +k . Let x range over Ri and over R k . By Fubini's theorem,
y
(20 .2 7) ( p. X ) ( B ) = J1
Replace B by reduces to
(20 .2 8 )
( ll
X
f. [ y : ( x, y ) Rl
J1
E
B ) p. ( dx ) ,
(A X R k ) n B, where A E gJi and B E gJ J +k . Then (20.27) v)(( A X R k ) n B ) = j v[ y : ( x , y ) E B ) fl (dx) , A
Bx = [ y: ( x, y) E B] is the x-section of B, so that Bx E gJ k (Theorem 1 8 .1), then P[(x, Y) E B] = P[w: (x, Y( w )) E B] = P[w: Y(w) E Bx ] = •
gives this result: Theorem 20.3.
p,
and
Jl
(20 . 29 )
If X and Y are independent random vectors with distributions
in Ri and R k, then P [ ( X, Y )
E B) =
f. .P [( x , Y ) E B ) p. ( dx ) , RJ
B E gJ J +k
and
(20.30)
P [ X E A, ( X, Y ) E B ) =
£P [( x , Y ) E B ) fl ( dx ) , A E gJi, B E gJ J +k .
Suppose that X and Y are independent exponentially distributed random variables. By (20.29), P [ YIX > z] = f0(X)P[Yjx > z)ae- ax dx = j0(X)e - axzae - ax dx = (1 + z) - 1 • Thus YjX has density (1 + z ) - 2 for z > 0. Since P[ X > z 1 , Y/X > z 2 ] = fz�P[ Yjx > z 2 ]a e - ax dx by • (20.30), the joint distribution of X and Y1 X can be calculated as well. Example 20.3.
Formulas (20.29) and (20.30) are constantly applied as in this example. There is no virtue in making an issue of each case, however, and the appeal to Theorem 20.3 is usually silent.
270
RANDOM VARIABLES AND EXPECTED VALUES
Here is a more complicated argument of the same sort. Let X1 , , Xn be independent random variables, each uniformly distributed over (0, I ] o Let Y" be the k th smallest among the X, , so that 0 < Y1 < · · · < Y,. < 1. The X, divide [0, 1 ] into n + 1 subintervals of lengths Y1 , Y2 - Y1 , , Yn - Yn - 1 , I - Y, ; let M be the maximum of these lengths. Define 1/1n ( I, a) = P [ M < a ]. The problem is to show that EXIIIle IIp 20.4. o
o
•
• • •
{ 20.31)
1/ln ( t, a )
( n 1 )( ) T:o ( - l) k z 1 - k �
n+l ...
"
+ ,
where x + ( x + l x l )/2 denotes positive part. Separate consideration of the possibilities 0 < a < 1/2, 1/2 < a < 1, and 1 < disposes of the case n 1. Suppose it is shown that the probability 1/Jn ( l, a) satisfies the recursion ==
a
==
( 20.32)
1/Jn ( l, a )
==
n a ( l - x ) - 1 dx nl l/Jn _ 1 ( 1 - x, a ) 1
1
0
°
Now (as follows by an integration together with Pascal' s identity for binomial coefficients) the right side of (20.31) satisfies this same recursion, and so it will follow by induction that (20.31) holds for all n. In intuitive form, the argument for (20.32) is this: If [ M < a] is to hold, the smallest of the X; must have some value x in [0, a ]. If X1 is the smallest of the X, , then x2 ' . . . ' xn must all lie in [x, 1] and divide it into subintervals of length at most a; the probability of nthis is (1 - xjl) n - 11/ln - 1 ( 1 - x, a ) because x2 , . . . ' xn have probability (1 - xjl) - l of all lying in [x, 1], and if they do, they are independent and uniformly distributed there. Now (20.32) results from integrating with respect to the density for X1 and multiplying by n to account for the fact that any of x1 ' . . . ' xn may be the smallest. To make this argument rigorous, apply (20.30) for j 1 and k n - 1. Let A be the interval [0, a], and let B consist of the points ( x 1 , . 0 0 , xn ) for which 0 < X; < I, x 1 is the minimum of x 1 , , xn , and x 2 , 0 0 . , xn divide [ x1 , 1 ] into subintervals of length at most a. Then P[ X1 min X;, M < a ] == P[ X1 e A , ( X1 , . . . , Xn ) e B). Take X1 for X and ( X2 , . . . , Xn ) for Y in (20.30)0 Since X1 has density 1/1, ==
• • •
(20 .33)
P ( X1
==
.
mm X; , M < a )
=
l0a
==
==
P ( ( x, X2 , . . . , Xn )
E
dx
B] I
o
n
If C is the event that x s X; < 1 for 2 < i < n , then P(C) (1 - xjl) - t o A simple calculation shows that P [ X; - X < s, , 2 < i s n i C] n;_ 2 (s;/(l - x )); in other WOrds, given C, X2 - X, . . Xn - X are conditionally independent and uni formly distributed over [0, I - X]. Now ( x2 ' 0 0 0 ' xn ) are random variables on some probability space ( Q, !F, P); replacing P by P( · I C) shows that the integrand in (20.33) is the same as that in (20032). The same argument holds with the index 1 o
,
=
==
20. RANDOM V ARIABLES AND DISTRIBUTIONS replaced by any k (1 < k < n ) which gives (20.32). (The events [ XI< y � a] are not disjoint, but any two intersect in a set of probability 0.) SECTION
,
271 = min
X, ,
•
Sequences of Random Variables Theorem 5.2 extends to general distributions P.n ·
{ If p. n } is a finite or infinite sequence of probability mea sures on !A 1 , there exists on some probability space ( 0, ofF, P) an independent sequence { xn } of random variables such that xn has distribution P.n ·
1beorem 20.4.
PROOF.
By Theorem 5.2 there exists on some probability space an indepen of random variables assuming the values 0 and 1 dent sequence Z 1 , Z2, with probabilities P[Zn = 0] = P[ Zn = 1] = ! . As a matter of fact, Theo rem 5.2 is not needed : take the space to be the unit interval and the Zn( w ) to be the digits of the dyadic expansion of w-the functions dn( w ) of Sections and 1 and 4. Relabel the countably many random variables Zn so that they form a double array, • • •
zl l zl 2 Z21 , Z22 ,
, · · ·
,
•
•
•
•
•
•
· · ·
•
•
•
the zn k are independent. Put un = Ef- I Zn k 2 - k . The series certainly converges, and Un is a random variable by Theorem 13.4. Further, U1 , U2 , is, by Theorem 20.2, an independent sequence. Now P[ Zn; = z; , 1 � i < k] = 2 - k for each sequence z 1 , , z k of O' s k and l's; hence the 2 k possible values j2 - k , 0 < j < 2 , of Sn k = E�_ 1Zn;2 - ; all have probability 2 - k . If 0 � x < 1, the number of the j2 - k that lie in [O,x ] is [2 k x] + 1, the brackets indicating integral part, and therefore P[S,.k � x ] = ([2 kx] + 1)/2 k . Since Sn k ( w ) i Un( w ) as k i oo , [ Sn k < x] � [Un k k S x ] as k t oo , and so P[Un � x ] = lim k P[Sn k < x ] = lim k ([2 x] + 1)/2 = X for 0 � X < 1. Thus Un is uniformly distributed over the unit interval. The construction thus far establishes the existence of an independent sequence of random variables Un each uniformly distributed over [0, 1]. Let F,. be the distribution function corresponding to P.n, and put cpn( u ) = inf[ x : u S Fn( x )] for 0 < u < 1. This is the inverse used in Section 14-see (14.5). Set q>,. (u) = 0, say, for u outside (0, 1), and put Xn( w ) = cpn(Un(w)). Since cp,.( u ) � x if and only if u < Fn( x )-see the argument following (14.5 ) -P[ Xn < x ] = P[Un < Fn(x)] = Fn(x). Thus Xn has distribution function Fn. And by Theorem 20.2, XI, x2, . . . are independent. • All
• • •
• • •
272
RANDOM VARIABLES AND EXPECTED VALUES
This theorem of course includes Theorem 5.2 as a special case, and its proof does not depend on the earlier result. Theorem 20.4 is a special case of Kolmogorov' s existence theorem in Section 36. Convolution
Let X and Y be independent random variables with distributions p. and " · Apply (20.27) and (20.29) to the planar set B = [(x, y): x + y e H] with
H E !1/ 1 :
j-oo P( H - x )p( dx ) oo = f oo P [ Y e H - x) p( dx ) . - oo
P [ X + Y e H) =
( 20.34 )
The
convolution of p. and " is the measure p. • " defined by
( 20.35 )
( p • ., )( H )
=
j-oo P( H - x )p ( dx ) , oo
H E !1/ 1 •
If X and Y are independent and have distributions p. and ,, (20.34) shows that X + Y has distribution p. • " · Since addition of random variables is commutative and associative, the same is true of convolution: p. • " = " • p. and p. • (" • '11 ) = ( p. • ") • '11 · If F and G are the distribution functions corresponding to p. and ,, the distribution function corresponding to p. • " is denoted F • G. Taking H ( - oo , y] in (20.35) shows that =
( F • G )( y) =
(20.36)
(See (17.16) for the notation
j-oo G(y - x ) dF( x). oo
dF(x).) If G has density g, then G(y - x) = JY--,;g(s) ds g(t x) dt, and so the right side of (20.36) is JY 00 fY 00 [ / 00 00 8( 1 - x) dF(x)] dt by Fubini's theorem. Thus F • G has density F • g, where =
(20.37) this holds if
G
has density
g.
If, in addition,
F
has density f,
(20.37) is
SECTION
20.
RANDOM VARIABLES AND DISTRIBUTIONS
273
denoted I • g and reduces by (16.12) to
oo ( / • g )(y) = j- g (y - x ) f(x ) dx . oo
(20. 38)
This defines convolution for densities, and p. • " has density I • g if p. and " have densities I and g. The formula (20.38) can be used for many explicit calculations. Let X1 , , Xk be independent random variables, each with the exponential density (20.10). Define gk by
EXIIIple II
• • •
20.5.
k- 1 ax ( ) - ax (20 .3 9) gk ( x ) = a ( k - l ) ! e ' put gk (x) = 0 for x < 0. Now
k=
X > 0,
1 , 2, . . . ;
1
Y = )( y) ( Y - x ) gl (x) dx, t k ( 8k - t * 8 t 8 0
which reduces to g,.(y). Thus gk = Bk - l • g 1 , and since g 1 coincides with (20. 1 0), it follows by induction that the sum XI + . . . + xk has density Bk · The corre·iponding distribution function is
X > 0, as
•
follows by differentiation.
E
ple
Suppose that X has the normal density (20.12) with m == 0 and that Y has the same density with T in place of o . If X and Y are independent, then X + Y has density mm
20.6.
1
2 '1TO A
change of variable
[ /oo T - oo
exp -
u
]
( y - x ) 2 - x 2 dx . 2 2 2o
2T
= x(o 2 + T 2 ) 1 12joT reduces this to
274
RANDOM VARIABLES AND EXPECTED VALUES
m = 0 and with o 2 + T 2 in place
Thus X + Y has the normal density with of o 2 •
•
I f p. and " are arbitrary finite measures on the line, their convolution is
defined by
(20.35) even if
they are not probability measures.
Convergence in ProbabUity Random variables Xn converge in probability to X, written Xn lim P [I Xn - X I � £ ]
(20.41 )
n
--+ p X,
if
=0
1,
for each positive £. t If xn --+ X with probability then lim supn [ I xn - XI > £ ] has probability 0, and it follows by Theorem 4.1 that Xn --+ p X. But the
converse does not hold : There exist sets A n such that P( A n )
1.
--+ 0
and
For example in the unit interval, let A 1 , A be the 2 dyadic intervals of order one, A 3 , • • • , A 6, those of order two, and so on (see P(lim supn A n ) =
(4.30) or (1.31)). If = 0.
Xn =
lA
II
and X =
0,
then Xn
--+ p X
but P[lim n Xn =
X]
If Xn --+ X with probability I, then Xn --+ p X. (ii) A necessary and sufficient condition for Xn --+ p X is that each subse quence { Xnk } contain a further subsequence { Xnk
Theorem 20.5
PROOF.
The argument for (i) has been given.
{ nk }
{n
choose a subsequence k < ;> } so that k � k ( i ) implies that P [ I Xnk - XI � ; - 1 ] < By the first Borel-Cantelli lemma I f Xn
-+ p X,
given
there is probability
1
that I Xn k
2 - ;.
- XI <
) (l babill. ty . h pro Th ere fore, lim ; Xnk . = X Wit
;- I
1.
for
all
but finitely many
i.
If Xn does not �nverge to X in probability, there is some positive £ fo r
{ nk }·
which P [ I Xn - XI � £] > £ holds along some sequence No subse " quence of Xnk } can converge in probability to X, and hence none can
{
converge to X with probability It follows from
probability
1.
(ii)
1.
that if Xn
•
--+ p X
and Xn
--+ pY,
then X = Y with
It follows further that if I is continuous and xn
/( Xn ) -+ pf( X ).
--+ p X,
In nonprobabilistic contexts, convergence in probability becomes
gence in measure:
If In and
then
conver
f are real measurable functions on a measure
t This is often expressed p lim, X, - X.
SECTION
20.
275
RANDOM VARIABLES AND DISTRIBUTIONS
space ( 0, �' p.), and if p.[w: /, converges in measure to f.
lf(w) - fn (w) l
�
£] -+
0 for each £ > 0,
then
1be Glivenko-CanteUi Theorem*
The empirical distribution function for random variables XI , . . . ' distribution function Fn ( x, w) with a jump of n - I at each Xk ( w ):
xn
is the
(20.42 ) If the Xk have a common unknown distribution function F( x ), Fn ( x, w) is its natural estimate. The estimate has the right limiting behavior, according to the Glivenko-Cantelli theorem:
Suppose that X1 , X2 , are independent and have a com mon distribution function F; put Dn (w) = supx i Fn (x, w) - F(x) l . Then Dn � 0 with probability I. Theorem 20.6.
•
•
•
For each x, Fn ( x, w) as a function of w is a random variable. By right continuity, the supremum above is unchanged if x is restricted to the rationals, and therefore Dn is a random variable. The summands in (20.42) are independent, identically distributed simple random variables, and so by the strong law of large numbers (Theorem 6.1), for each x there is a set A x of probability 0 such that
(20 .43)
Fn( x, w) = F( x ) lim n
for "' � A x · But Theorem 20.6 says more, namely that (20.43) holds for w outside some set A of probability 0, where A does not depend on x-as there are uncountably many of the sets A x ' it is conceivable a priori that their union might necessarily have positive measure. Further, the conver gence in (20.43) is uniform in x. Of course, the theorem implies that with probability 1 there is weak convergence Fn ( x, w) � F( x ) in the sense of Section 14. As already observed, the set A x where (20.43) fails has probability 0. Another application of the strong law of large �umbers,. with /( - x ) in place of /( - x ] in (20.42), shows that (see (20.5)) lim n Fn ( x - , w) = F( x - ) except on a set Bx of probability 0. Let cp( u) =
PROOF OF THE THEOREM. 00 ,
*This topic may be omitted.
00 ,
276
RANDOM VARIABLES AND EXPECTED VALUES
inf[x: u � F(x)] for 0 < u < 1 (see (14.5)), and put x m, k = cp(k/m), m � 1, 1 � k � m. It is not hard to see that F(cp(u)- ) � u � F(cp(u)) ; hence F( x m, k - ) - F(x m, k - 1 ) < m - I , F(x m,1 - ) < m - I , and F(x m, m ) > 1 - m - 1 . Let Dm, n (w) be the maximum of the quantities I Fn (x m, k ' w) F(x m, k ) l and IFn (x m , k - , w) - F(x m, k - ) I for k = 1, . . . , m. If x m, k - 1 < X < x m, k ' then Fn (x, w) < Fn (x m, k - , w) � F(x m, k - ) + Dm , n (w) � F(x) + m - 1 + Dm, n (w)1 and Fn (x, w) > Fn (x m, k - 1 , w) > F(x m, k - 1 ) - Dm, n (w) > F(x) - m - - Dm, n(w). Together with similar arguments for the cases X < X m,1 and X > X m , m ' this ShOWS that (20.44) If w lies outside the union A of all the A xm k and Bxm k , then limn Dm ' n (w) = 0 and hence lim n Dn ( w) = 0 by (20.44). But A has probability 0. • PROBLEMS 20.1.
20.2. 20.3. 20.4.
20.5.
20.6. 20. 7. 20.8.
2. 10 f A necessary and sufficient condition for a a-field t§ to be countably generated is that t§ == a( X) for some random variable X. Hint: If t§ == k a(A 1 , A 2 , • • • ), consider X == Ej/_ 1 f( IA �< )/10 , where /(x) is 4 for x == 0 and 5 for x =F 0. If X is a positive random variable with density /, x- 1 has density /(1/x)jx 2 • Prove this by (20.16) and by a direct argument. Suppose that a two-dimensional distribution function F has a continuous density f. Show that /(x, y) == a 2 F(x, y)jax ay. The construction in Theorem 20.4 requires only Lebesgue measure on the Use the theorem to prove the existence of Lebesgue measure unit interval. k on R . First construct A k restricted to ( - n , n ] X · · · X ( - n , n ], and then pass to the limit ( n --+ oo ) . The idea is to argue from first principles, and not to use previous constructions, such as those in Theorems 12.5 and 18.2. and latitude of a random point on 19.9 f Let 8 and _, be the longitude 3 the surface of the unit sphere in R ; probability is proportional to surface area. Show that 8 and _, are independent, 8 is uniformly distributed over [0, 2 w ), and _, is distributed over [ - w/2 , + w/2] with density ! cos 4-. Suppose that A , B, and C are positive, independent random variables with distribution function F. Show that the quadratic Az 2 + Bz + C has real zeros with probability /0/0F(x 2 /4y) dF(x) dF(y). Show that X1 , X2 , are independent if a( X1 , , Xn _ 1 ) and a ( Xn ) are independent for each n. Let X0 , X1 , be a persistent, irreducible Markov chain and for a fixed state j let T1 , 72 , . . . be the times of the successive passages through j. Let Z1 == T1 and Zn == T, - T, _ 1 ,k n > 2. Show that Z1 , Z2 , . . . are indepen dent and that P [ Zn == k ] == Jj� > for n > 2. •
•
•
•
•
•
•
•
•
277 20. RANDOM VARIABLES AND DISTRIBUTIONS are non are distribution functions and p 1 , p2 20.9. Suppose that F1 , F2 , negative and add to 1. Show that F( x) = E�= 1 p,, F,, ( x ) is a distribution function. Show that, if F,(x) has density /,, ( x ) for each n, then F( x ) has density E�_ 1 p,, In ( x ). 20. 10. Let XI ' . . . ' xn be independent, identically distributed random variables and let , be a permutation of 1, 2, . . . , n. Show that P(( X1 , , X,, ) e H) == p [( xw l ' . . . ' x'IT, ) H) for H qJ". 20.11. i Ranks and records. Let X1 , X2 be independent random variables with a common continuous distribution function. Let B be the w-set where xm ( w ) == xn ( w) for some pair m ' n of distinct integers, and show that P( B) == 0. Remove B from the space 0 on which the Xn are defined. This leaves the joint distributions of the xn unchanged and makes ties impossi ble. n n Let r < > ( w ) == ( T1< n > ( w ), . . . , r,< > ( w )) be that permutation ( t 1 , , tn ) of (1, . . . , n) for which X, ( w ) < X, ( w ) < · · · < X, ( w ). Let Y, be the rank of xn among x. ' . . . ' xn : Y, == / if and only if i; < xn for exactly r - 1 values of i preceding n. n (a) Show that r< > is uniformly distributed over the n ! permutations. (b) Show that P[ Y, == r] == 1 /n , 1 < nr < n. (c) Show that Yk is measurable a( T< > ) for k < n. (d) Show that Y1 , Yi , . . . are independent. 20.12. f Record values. Let An be the event that a record occurs at time n: max k < n xk < xn . are independent and P(An) == 1/n. (a) Show that A 1 , A 2 , (b) Show that no record stands forever. (c) Let Nn be the time of the first record after time n. Show that P [ Nn == n + k] == n ( n + k - 1) - 1 ( n + k) - 1 • SECTION
• •
• • •
E
E
•
, •
•
.
•
•
•
•
•
•
•
• • •
20.13. Use Fubini's theorem to prove that convolution of finite measures is 20.14. 20.15.
20.16.
20.17. 20.18. 20.19.
commutative and associative. In Example 20.6, remove the assumption that m is 0 for each of the normal densities. Suppose that X and Y are independent and have densities. Use (20.20) to find the joint density for ( X + Y, X) and then use (20.19) to find the density for X + Y. Check with (20.38). If F( x - £) < F( x + £) for all positive £, x is a point of increase of F (see Problem 12. 9). If F( x - ) < F( x ), then x" is an atom of F. (a) Show that, if x and y are points of increase of F and G, then x + y is a point of increase of F • G. (b) Show that, if x and y are atoms of F and G, then x + y is an atom of F • G. Suppose that I' and " consist of masses an and fln at n , n == 0, 1, 2, . . . . Show that I' • " consists of a mass of E'k - oa k /Jn - k at n, n == 0, 1, 2, . . . i Show that two Poisson distributions (the parameters may differ) con volve to a Poisson distribution. Show that, if X1 , , Xn are independent and have the normal distribution (20.12) with m == 0, then the same is true of ( X1 + · · · + Xn )/ {n . • • •
278
RANDOM VARIABLES AND EXPECTED VALUES
20.20. The Cauchy distribution has density
( 20.45 )
Cu ( X)
== -1 2 u 2 ' 'IT U + X
-
oo
<x<
oo ,
for u > 0. (By (17.10), the density integrates to 1.) (a) Show that cu cl, == cu . Hint: Exp8.l&d the convolution integrand in partial fractions. (b) Show that, if x1 ' . . . ' xn are independent and have density Cu , then ( X1 + · · · + Xn )/n has density cu as well. Compare with the convolution law in Problem 20.19. 20.21. f (a) Show that, if X and Y are independent and have the standard normal density, then X/Y has the Cauchy density with u == 1. (b) Show that, if X has the uniform distribution over ( - ,/2, ,/2), then tan X has the Cauchy distribution with u == 1. 20.22. 18.22 f Let XI ' . . . ' xn be independent, each having the standard normal distribution. Show that •
+ 1,
X n2 == x12 + . . . + xn2 has density
( 20.46)
1
2n 12 r ( n/2)
x(
n /2) - 1e - x/2
over (0, oo ). This is called the chi-squared distribution with n degrees of
freedom. 20.13.
f
The gamma distribution has density
( 20.47) over (0, oo) for positive parameters a and u. Check that (20.47) integrates to 1. Show that
( 20.48)
f( · ; a, u ) • f( · ; a , v ) == f( · ; a , u + v ) .
Note that (20.46) is f(x; t , n/2), and from (20.48) deduce again that (20.46) is the density of x; . Note that the exponential density (20.10) is f(x; a , 1), and from (20.48) deduce (20.39) once again. 20.24. f The beta function is defined for positive u and v by
20. RANDOM VARIABLES AND DISTRIBUTIONS From (20.48) deduce that SECTION
(
B u, v)
=
f( u ) f ( v ) r( u + v )
279
·
Show that
for integral u and v . be independent, where P [ N = n ] = q n - 1p , n 20.25. 20.23 f Let N, X1 , X2 , > 1 , and each Xk has the exponential density f( x; a , 1). Show that X1 + · · · + XN has density f( x; ap , 1). 20.26. Let An m (£) == [ I Zk - Z l < £, n < k < m ] Show that Zn -+ Z with prob ability 1 if and only if Iimnlim m P(An m < £)) = 1 for all positive £, whereas Zn -+ p Z if and only if limn P(Ann < £)) == 1 for all positive £. 20.27. (a) Suppose that f: R 2 -+ R1 is continuous. Show that Xn -+ pX and Y, -+pY imply /( Xn , Y, ) -+pf( X, Y). (b) Show that addition and multiplication preserve convergence in prob ability. 20.28. Suppose that the sequence { Xn } is fundamental in probability in the sense that for £ positive there exists an � such that P[ I Xm - Xn l > £] < £ for •
•
•
.
m, n >
�-
(a) Prove there is a subsequence { Xn k } and a random variable X such that
Iim k Xn k = X with r,robability 1. Hint: Choose increasing n k such that k P[ kI Xm - Xn l > 2- ] < 2 - for m , n > n k . Analyze P ( I Xn k 1 - xn k I > 2- ]. (b) Show that Xn -+ P X. (a) Suppose that X1 < X2 s · · · and that Xn -+ pX. Show that Xn -+ X -
10.29.
10.30.
10.31.
with probability 1. (b) Show by example that in an infinite measure space functions can converge almost everywhere without converging in measure. If Xn -+ 0 with probability 1, then n - 1 E i: 1 Xk -+ 0 with probability 1 by the standard theorem on Cesaro means [A30]. Show by example that this is not so if convergence with probability 1 is replaced by convergence in probability. 2.1 7 f (a) Show that in a discrete probability space convergence in probability is equivalent to convergence with probability 1. (b) Show that discrete spaces are essentially the only ones where this equivalence holds: Suppose that P has a nonatomic part in the sense that there is a set A such that P( A) > 0 and P( · l A ) is nonatomic. Construct random variables Xn such that Xn -+ pO but Xn does not converge to 0 with probability 1. _
280 20.32.
RANDOM VARIABLES AND EXPECTED VALUES
20.28 20.31 f Let d( X, Y) be the infimum of those positive £ for which P[ I X - Yl > £] < £. (a) Show that d( X, Y) == 0 if and only if X == Y with probability 1. Identify random variables that are equal with probability 1 and show that d is a metric on the resulting space. (b) Show that Xn --+ pX if and only if d( Xn , X) --+ 0. (c) Show that the space is complete. (d) Show that in general there is no metric d0 on this space such that Xn --+ X with probability 1 if and only if d0( Xn , X) -+ 0. SECI'ION 21. EXPECI'ED VALVES
Expected Value as Integral
The expected value of a random variable X with respect to the measure P:
X on ( 0, �, P) is the integral of
E[X] = j xdP = fox(w)P(dw). All the definitions, conventions, and theorems of Chapter 3 apply. For nonnegative X, E [X } is always defined (it may be infinite); for the general X, E[X] is defined, or X has an expected value, if at least one of E[X+] and E[X-] is finite, in which case E[X] = E[X+] - E[X-]; and X is integrable if and only if E[IXI] < oo . The integral fA X dP over a set A is defined, as before, as E[/A X] . In the case of simple random variables, the definition reduces to that used in Sections 5 through 9. Expected Valoes and Distributions
Suppose that X has distribution p,. If g is a real function of a real variable, then by the change-of-variable formula (16.17), (21 . 1 )
E (g( X)) = f oo g ( x )�& ( dx ) .
- oo
(In applying (16.17), replace T: 0 -+ D' by X: 0 -+ R 1 , p. by P, p.T - 1 by p, , and f by g.) This formula holds in the sense explained in Theorem 16.12: I t holds in the nonnegative case, so that (21 .2)
E [l g( X)I ] = j oo l g ( x )l�&(dx) ;
- oo
SECTION
21.
281
EXPECTED VALUES
if one side is infinite, then so is the other. And if the two sides of (21 .2) are finite, then (2 1 .1) holds. If p, is discrete and p. { x 1 , x 2 , . . . } = 1, then (21 .1) becomes (use Theorem 1 6 . 9)
(21 . 3 )
r
If X has density f, then (21.1) becomes (use Theorem 16.10)
E [ g( X)] =
(21 .4 )
F
100 g ( x ) f( x ) dx . - oo
is the distribution function of X and p., (21 .1) can be written E[ g( X)] = 00 g(x) dF(x) in the notation (17.16). If
/ 00
Moments By
p,
(2 1 .2),
(21 . 5 )
and
F determine all the absolute moments of X:
E (I X I k ) =
j
oo
- oo
l x l k�&(dx) =
j
oo
- oo
l x l k dF( x ) ,
k
= 1 , 2, . . .
.
Since j � k implies that l x lj s 1 + l xl k , if x ·has a finite absolute moment of order k, then it has finite absolute moments of orders 1, 2, . . . , k - 1 as well. For each k for which (21.5) is finite, X has k th moment
E [ Xk ) =
(21 .6)
100 x k�&(dx) = 100 x k dF(x ) . - oo
- oo
These quantities are also referred to as the moments of p. and of F. They can be computed via (21. 3) and (21.4) in the appropriate circumstances. EX111le 11p 21.1. Consider the normal density (20.12) with m = 0 and a == 1 . For each k, x ke - x2 12 goes to 0 exponentially as x -+ ± oo , and so finite moments of all orders exist. Integration by parts shows that 1 _ r,;-v 2w
J
oo
- oo
x ke - x2 /2 dx =
k-1 ,J2w
f
oo
- oo
x k - 2e - x2 /2 dx ,
k
= 2, 3 , . . .
.
(Apply (18. 16) to g(x) = x k - 2 and f(x) = xe - x2 12 , and let a -+ - oo , 0 b -.. oo.) Of course, E[ X] = 0 by symmetry and E[ X ] = 1 . It follows by
282
RANDOM VARIABLES AND EXPECTED VALUES
induction that
E ( X2 k ) 1 · 3 · 5 · · · ( 2k - 1 ) ,
( 21 .7 )
=
k 1 ' 2, . . . ' =
•
and that the odd moments all vanish. If the first two moments of Section 5, the variance is
( 21 .8 )
Var [ X ]
=
X are finite and E[X]
E [ (x - m ) 2]
=
=
m, then just as
in
j-oo ( x - m )2�& ( dx ) oo
From Example 21.1 and a change of variable, it follows that a random variable with the normal density (20.12) has mean m and variance o 2• Consider for nonnegative X the relation
(21 .9 ) Since P[X = t] can be positive for at most countably many values of t, the two integrands differ only on a set of Lebesgue measure 0 and hence the integrals are the same. For X simple and nonnegative, (21.9) was proved in Section 5 ; see ( 5 .2 5 ). For the general nonnegative X, let Xn be simple random variables for which 0 s Xn i X (see (20.1)). By the monotone convergence theorem E[Xn ] i E[X]; moreover, P[Xn > t] i P[X > t], and so f000P[Xn > t] dt j J:P[X > t] dt, again by the monotone convergence theorem. Since (21.9) holds for each Xn , a passage to the limit establishes (21.9) for X itself. Note that both sides of (21.9) may be infinite. If the integral on the right is finite, then X is integrable. Replacing X by X/[ X > aJ leads from (21.9) to
( 21 .10 )
1
XdP aP( X > a) + f 00P( X > t) dt,
[ X> a ]
As long as
=
a � 0.
a
a � 0, this holds even if X is not nonnegative.
Inequalities
Since the final term in (21.10) is nonnegative, E[X]. Thus
( 21 .11 )
P ( X � a] s -a1
1
aP[X � a] s f[x� X dP
1 ( X], XdP s -E a
[ X� a]
a.]
a>O
s
SECTION
21.
for nonnegative X. As in Section
EXPECTED VALUES
5, there follows the inequality
(21 .12 ) It is the inequality between the two extreme terms here that usually goes under the name of Markov; but the left-hand inequality is often useful, too. As a special case there is Chebyshev's inequality,
(21 .13 )
1
P ( I X - m l � a] s 2 Var( X ] a
(m = E[ X]). Jensen's inequality
(21 .14 )
cp ( E ( X ]) s E ( cp ( X ) ]
holds if cp is convex on an interval containing the range of X and if X and cp( X) both have expected values. To prove it, let /(x) = ax + b be a supporting line through ( E[ X], cp( E [ X]))-a line lying entirely under the graph of cp [A33 ] . Then aX( w) + b s cp( X( w )), so that aE[ X] + b < E [ cp( X)]. But the left side of this inequality is cp( E[ X]). Holder's inequality is
1 p
- +
1 q
-
=
1.
For discrete random variables, this was proved in Section 5; see (5.31). For the general case, choose simple random variables Xn and Yn satisfying 0 S I Xn l i l XI and 0 S I Yn l i I Yl ; see (20.2). Then (5.31) and the monotone convergence theorem give (21.15). Notice that (21.15) implies that if I XIP and I Yl q are integrable, then so is XY. Schwarz's inequality is the case p == q = 2:
(21 .16 ) If X and Y have second moments, then XY must have a first moment. The same reasoning shows that Lyapounov' s inequality (5.33) carries over from the simple to the general case.
284
RANDOM VARIABLES AND EXPECTED VALUES
Joint Integrals
The relation (21 .1) extends to random vectors. Suppose that ( X1 , has distribution p. in k-space and g: R k -+ R 1 • By Theorem 16. 1 2,
•
•
•
, Xk )
(21 .17) with the usual provisos about infinite values. For example, E [ X;Xi ] = fR "x ;xip.(dx). If E[ X;] = m ; , the covariance of X; and Xi is
Independence and Expected Value
Suppose that X and Y are independent. If they are also simple, then E [ XY] = E[ X]E[ Y], as proved in Section 5 -see (5.21). Define Xn by (20.2) and similarly define Yn = 1/ln( Y+ ) - 1/l n( Y-). Then Xn and Yn are independent and simple, so that E[ I XnYn l 1 = E[ 1 Xn 1 1 E [ I Yn l ], and 0 < 1 Xn 1 i l XI , 0 � I Yn l i I Yl . If X and Y are integrable, then E[ I XnYn l 1 = E [ 1 Xn 1 1 E [ I Yn 1 1 i E[ I XI ] · E[ l Yl ], and it follows by the monotone conver gence theorem that E[ I XYI ] < oo ; since XnYn -+ XY and I XnYnl < I XYI , it follows further by the dominated convergence theorem that E [ XY] = lim n E [ XnYn] = limnE[Xn]E[ Yn] = E[ X]E [ Y ]. Therefore, XY is integrable if X and Y are (which is by no means true for dependent random variables) and E[ XY] = E[ X]E[Y]. This argument obviously extends inductively: If X1 , , Xk are indepen dent and integrable, then the product xl . . . xk is also integrable and •
•
•
(21 .18) Suppose that fl1 and fl2 are independent a-fields, A lies in fl1 , X1 is measurable G1, and X2 is measurable fl2• Then /A X1 and X2 are indepen dent, so that (21 .18) gives (21 .19)
£x. x2 dP = £ X1 dP . E [ X2 ]
if the random variables are integrable. In particular, (21 .20)
SECTION
21 .
EXPECTED VALUES
285
From (21 .1 8) it follows just as for simple random variables (see (5.24)) that variances add for sums of independent random variables. It is even enough that the random variables be independent in pairs. Moment Generating Functions
The
moment generating function is defined as
(21 .21 )
M ( s ) = E [ e•X] =
J oooo e•xfl ( dx ) = J oooo e •x dF( x ) -
-
for all s for which this is finite (note that the integrand is nonnegative). Section 9 shows in the case of simple random variables the power of moment generating function methods. This function is also called the Laplace transform of p., especially in nonprobabilistic contexts. Now f0e sxp.(dx) is finite for s < 0, and if it is finite for a positive s, then it is finite for all smaller s. Together with the corresponding result for the left half-line, this shows that M(s) is defined on some interval containing 0. If X is nonnegative, this interval contains ( - oo, 0] and perhaps part of (0, oo ); if X is nonpositive, it contains [0, oo) and perhaps part of ( - oo , 0). It is possible that the interval consists of 0 alone; this happens for example, if p. is concentrated on the integers and p. { n } = p. { - n } = Cjn 2 for n = 1, 2, . . . . Suppose that M(s) is defined throughout an interval ( -s0, s0), where So > 0. Since elsxl < esx + e-sx and the latter function is integrable ,., for l s i < s0, so is Er- o l sx l kjk! = e lsxl . By the corollary to Theorem 16.7, p. has finite moments of all orders and
Thus M(s ) has a Taylor expansion about 0 with positive radius of convergence if it is defined in some ( - s0 , s0), s0 > 0. If M(s) can somehow be calculated and expanded in a series E k a k s k , and if the coefficients ak can be identified, then, since ak must coincide with E[ X k ]jk!, the moments of X can be computed: E[X k ] = a k k ! It also follows from the theory of Taylor expansions [A29] that a k k ! is the k th derivative M < k > ( s ) evaluated at s = 0:
(21 .23) This holds if
M ( k l (O ) = E [ X k ] =
/-00
oo
X kfl ( dx ) .
M(s) exists in some neighborhood of 0.
286
RANDOM VARIABLES AND EXPECTED VALUES
Suppose now that M is defined in some neighborhood of s. If " has density e s xjM(s ) with respect to p. (see (16.11)), then " has moment generating function N( u ) = M(s + u)/M(s ) for u in some neighborhood of 0. But then by (21 . 23 ) N < k >(O) = j 00 00 X k v( dx ) = f 00 00 X ke sxp.( dx )/M(s ), and since N < k >(O) = M< k >(s )/M(s), ,
M < k l (s ) =
( 21 . 24 )
oo
j- oox ke sxp. ( dx ) .
This holds as long as the moment generating function exists in some neighborhood of s. If s = 0, this gives ( 21.23 ) again. Taking k = 2 shows that M( s) is convex in its interval of definition.
Example
For the standard normal density,
21. 2.
M( s ) =
oo
j J2w - oo 1
es:xe - :x z /2 dx =
and a change of variable gives
h e •z /2J
oo
- oo
e - < x - s )z /2 dx ,
( 21 .25 )
The moment generating function is in this case defined for all s. Since es2 1 2 has the expansion e • z/2 =
f _!_ ( � ) k = f
k -0
k!
2
1 · 3 · · · (2k (2k ) ! k -0
-
1 ) s 2k
'
the moments can be read off from (21.22), which proves ( 21 . 7 ) once more. •
Exllltlple
21.3.
function
In the exponential case
(20.10)
the moment generating
(21 .26) is defined for s < a. By (21.22) the k th moment is k !a - k . The mean and • variance are thus a - 1 and a - 2.
Example ( 21 .27 )
21.4.
For the Poisson distribution (20.7),
SECTION
s M(s) and M'(s ) = Ae ce in S moments are M '(O) = A and both A.
21 .
287
EXPECTED VALUES
M"(s) = (A2e 2 s + Ae s ) M(s), the first two M"(O) = A2 + A; the mean and variance are •
Let X1 , . . . , Xk be independent random variables, and suppose that each X; has a moment generating function M; (s ) = E[e s X;] in ( - s0, s0). For ls i < s0, each exp(sX;) is integrable and, since they are independent, their product exp(sE �_ 1 X; ) is also integrable (see (21.18)). The moment gener ating function of xl + . . . + xk is therefore (21 .28)
( - s0, s0 ). This relation for simple random variables was essential to the arguments in Section 9. For simple rand_om variables it was shown in Section 9 that the moment generating function determines the distribution. This will later be proved for general random variables; see Theorem 22.2 for the nonnegative case and Section 30 for the general case.
in
PROBLEMS 21.1.
Prove
1 J2w
11.2. 11.3. 21.4. 21.5. 11.6.
21.7.
1 00 e- x 12 dx -
,
oo
2
== 1 -
1 12 ,
differentiate k times with respect to t inside the integral (justify), and derive (21. 7) again. 2 n + 1 ] == X Show that, if X has the standard normal distribution, then E[ 1 1 n
2 n h/2/w . 16.23 f Suppose that X1 , X2 , . are identically distributed (not neces sarily independent) and E( I X1 1 ] < oo . Show that E[max k n i Xk 1 1 == o( n ). 20.12 f Records. Consider the sequence of records in the sense of Problem 20.12. Show that the expected waiting time to the next record is infinite. 20.20 f Show that the Cauchy distribution has no mean. Prove the first Borel-Cantelli lemma by applying Theorem 16.6 to indicator random variables. Why is Theorem 16.6 not enough for the second • •
s
Borel-Cantelli lemma? Derive (21.11) by imitating the proof of (5.26). Derive Jensen's inequality (21.14) for bounded X and q> by passing to the limit from the case of simple random variables (see (5.29)).
288
21.8. 21.9.
RANDOM VARIABLES AND EXPECTED VALUES
Prove (21.9) by Fubini' s theorem. Prove for integrable X that
E [ X) -
{ P[ X > t) dt - t "
0
- oo
P[ X < t) dt.
21.10. (a) Suppose that X and Y have first moments and prove
E [ Y ] - E [ X)
...
/-cooo ( P [ X < t <
Y) - P ( Y < t < X ] ) dt .
(b)
Let ( X, Y] be a nondegenerate random interval. Show that its expected length is the integral with respect to t of the probability that it covers t. 21.11. Suppose that X and Y are random variables with distribution functions F and G. (a) Show that if F and G have no common jumps, then E[ F( Y)] +
E[ G( X)] == 1. (b) If F is continuous, then E[ F( X)] == ! . (c) Even if F and G have common jumps, if X and Y are taken to be independent, then E[ F( Y)] + E[G( X)] == 1 + P[ X == Y]. (d) Even if F has jumps, E[ F( X)] == ! + 1E x P 2 ( X == x]. 21.12. Let X have distribution function F. (a) Show that E[ X+ ] < oo if and only if J: < - log F( t)) dt < oo for some a. (b) Show more precisely that, if
f
X> o
X dP < a ( - log F( a )) +
a
and F( a ) are positive, then
l eo ( - log F( t ) ) dt < F(l ) f G
a
X> o
X dP.
21.13. Suppose that X and Y are nonnegative random variables, r > 1, and
P[ Y � t] < t - 1/y:?. , X dP for t > 0. Use (21.9), Fubini's theorem, and Holder's inequality to prove E( Y' ] < (rj( r - 1))'E[ X' ). 21. 14. Let X1 , X2 , . be identically distributed random variables with finite sec ond moment. Show that nP[ I X1 1 > £{,;' ] � 0 and n - 1 12 max k < n i X�c l • •
� P o.
21.15. Random variables X and Y with second moments are uncorre/ated if
E[ XY ] == E[ X] E( Y].
(a) Show that independent random variables are uncorrelated but that the
converse is false. (b) Show that, if X1 , , Xn are uncorrelated (in pairs), then Var[ X1 + · · · + Xn ] == Var[ X1 ] + · · · + Var[ Xn 1 · 21.16. f Let X, Y, and Z be independent random variables such that X and Y assume the values 0, 1, 2 with probability t each and Z assumes the values 0 and 1 with probabilities l and 1 . Let X' == X and Y' == X + Z (mod 3). (a) Show that X', Y', and X' + Y' have the same one-dimensional distri butions as X, Y, and X + Y, respectively, even though ( X', Y') and ( X, Y) have different distributions. • • •
21 . EXPECTED VALUES 289 (b) Show that X' and Y' are dependent but uncorrelated. (c) Show that, despite dependence, the moment generating function of X' + Y' is the product of the moment generating functions of X' and Y'. Suppose that X and Y are independent, nonnegative random variables and that E[ X] == oo and E[ Y] == 0. What is the value common to E( XY] and E[ X] E[ Y]? Use the conventions (15.2) for both the product of the random variables and the product of their expected values. What if E[ X] == oo and 0 < E[ Y] < oo? Deduce (21.18) for k == 2 from (21.17), (20.23), and Fubini' s theorem. Do not in the nonnegative case assume integrability. Suppose that X and Y are independent and that f( x , y) is nonnegative. Put g(x ) == E[f(x, Y)] and show that E[ g( X)] == E( f( X, Y)]. Show more gen erally that Ix A g( X) dP == Ix A /( X, Y) dP. Extend to / that may be negative. f The integrability of X + Y does not imply that of X and Y separately. Show that it·does if X and Y are independent. SECTION
21 .17.
21.18. 21.19.
e
21.20.
&
21.21. The dominated convergence theorem (Theorem 16.4) and the theorem on
uniformly integrable functions (Theorem 16.13) of course apply to random variables. Show that they hold if convergence with probability 1 is replaced by convergence in probability.
f Show that if E( l Xn - XI ) -+ 0, then Xn -+ p X. Show that the converse is false but holds under the added hypothesis that the xn are uniformly integrable. If E[ I xn - X I ] -+ 0, then xn is said to converge to X in the mean . 21.13. f Write d1 ( X, Y) == E[ I X - Yl /(1 + I x· - Yl )]. Show that this is a met ric equivalent to the one in Problem 20.32. 21.24. 5.10 f Extend Minkowski's inequality (5.36) to arbitrary random variables.
21.22.
Suppose that p > 1. For a fixed probability space ( 0 , S', P), consider the set of random variables satisfying E[ l X I P ] < oo and identify random variables that differ only on a set1 of probability 0. This is the space L P == L P (Q, .F, P). Put dp ( X, Y) == E 1P ( I X - Y I P ] for X, y E LP. (a) Show that L P is a metric space under dP . (b) Show that L P is complete. 2.10 16.7 21.25 f Show that L P ( Q, S', P) is separable if S' is sep arable, or if §' is the completion of a separable a-field. 21 .25 f For X and Y in L2 put ( X, Y) == E[ XY]. Show that ( · , · { is an inner product for which d2 ( X, Y) == ( X - Y, X - Y). Thus L is a Hilbert space. 5.11 f Extend Problem 5.11 concerning essential suprema to random variables that need not be simple. Consider on (1, oo) densities that are multiples of e - ax , e - x2, x - 2e - ax , x - 2 ; also consider reflections of these through 0 and linear combinations of
21.25. 20.28 21.21 21.24 f
21.26. 21.17. 21.28. 21.29.
290
21.30. 21.31.
21.32. 21.33.
21.34. 21.35.
RANDOM VARIABLES AND EXPECTED VALUES
them. Show that the convergence interval for a moment generating function can be any interval containing 0. There are 16 possibilities, depending on whether the endpoints are 0, finite and different from 0, or infinite, and whether they are contained in the interval or not. For the density Cexp( - l x l 1 12 ), - oo < x < oo , show that moments of all orders exist but that the moment generating function exists only at s == 0. 16.8 f Show that a moment generating function M( s ) defined in ( - s0 , s0 ) , s0 > 0, can be extended to a function analytic in the strip [ z: - s0 < Re z < s0 ] . If M ( s ) is defined in [0, s0 ), s0 > 0, show that it can be extended to a function continuous in [ z: 0 < Re z < s0 ] and analytic in [z: 0 < Re z < s0 ]. Use (21 .28) to find the generating function of (20.39). The joint moment generating function of ( X, Y) is defined as M( s, t ) == E( esX + t Y]. Assume that it exists for ( s, t) in some neighborhood of the origin, expand it in a double series, and conclude that E [ xm yn ] is a m + nMjas m at n evaluated at s .. t == 0. For independent random variables having moment generating functions, show by (21 .28) that the variances add. 20.23 f Show that the gamma density (20.47) has moment generating function (1 - sja ) - uk for s < a. Show that the k th moment is u( u + 1) · · · ( u + k - 1 )/a . Show that the chi-squared distribution with n de grees of freedom has mean n and variance 2 n . SECI10N 22. SUMS OF INDEPENDENT RANDOM VARIABLES
Let X1 , X2, be a sequence of independent random variables on some probability space. It is natural to ask whether the infinite series E�_ 1 Xn converges with probability 1, or as in Section 6 whether n - 1 Ek_1 Xk converges to some limit with probability 1. It is to questions of this sort that the present section is devoted. Throughout the section, Sn will denote the partial sum Ek _ 1 Xk (S0 = 0) . •
•
•
The Strong Law of Large Numbers
The central result is a general version of Theorem 6.1 .
are independent and identically distributed and have finite mean, then S,jn --+ E [ X1 ] with probability 1. Theorem 22. 1.
If X1 , X2,
•
•
.
Formerly this theorem stood at the end of a chain of results. following argument, due to Etemadi, proceeds from first principles.
The
SECTION
22.
SUMS OF INDEPENDENT RANDOM VARIABLES
291
pJtOO F.
If the theorem holds for nonnegative random variables, then ,.- 1sn = n - 1Ei:_ 1 Xt - n - 1Ei:_ 1 X; -+ E[ X(] - E[ X}] = E[ X1 ] with probability 1. Assume then that Xk > 0. Consider the truncated random variables Yk = Xk ll x. k J and their par n tial sums Sn* = Ei:_ 1 Yk . For a > 1, temporarily fixed, let un = [a ], where the brackets denote integer part. The first step is to prove s
>£ <
(22 .1)
Since the
00 .
Xn are independent and identically distributed, n
n
k-1
k-1
Var[Sn* ] = E Var ( Yk ] < E E [ Yk2 ] n
= E E [ x.2J( X, s k ) ] k-1
s:
nE [ x.2I( x, s n d .
It follows by Chebyshev' s inequality that the sum in (22.1) is at most
Let K = 2aj(a - 1) and suppose x > 0. If N is the smallest n such that N u,. � x, then a > x, and since y � 2 [y] for y > 1,
E u ; 1 � 2 E a - n = Ka - N < Kx - 1 •
u,. � x
n�N
Therefore, E:_ 1 u n 1/[ x1 s u,. J < KX} 1 for X1 > 0, and the sum in (22.1) is at most K£ - 2E [ X1 ] < oo . From (22.1) it follows by the first Borel-Cantelli lemma (take a union over positive, rational £) that (Su* - E[Su* 1)/u n -+ 0 with probability 1 . But 1 1 by the consistency of Cesaro sunimation (:�30], n - E[Sn*1 = n - Ei:_ 1 E[Yk ] has the same limit as E[Yn ], namely, E[X1 ]. Therefore Su* fun -+ E[X1 ] with probability 1. Since II
00
00
E P [ Xn ¢ Yn ] = nE P [ X1 > n ] s: ,_ 1 -=1
1
00
0
P [ X1 > t ] dt = E [ X. J <
oo ,
292
RANDOM VARIABLES AND EXPECTED VALUES
another application of the first Borel-Cantelli lemma shows that Sn )/n --+ 0 and hence
(Sn* -
( 22 .2) with probability 1 . If u n < k < u n + l' then since
X; > 0,
u n + 1 /u n --+ a, and so it follows by (22.2) that ! E [ X1 ) � lim inf £k � lim sup £k < aE ( X1 ) a k k k k with probability 1. This is true for each a > 1 . Intersecting the correspond ing sets over rational a exceeding 1 gives Iim k Sk/k = E[X1 1 with probabil But
•
ity 1 .
Although the hypothesis that the xn all have the same distribution is used several times, independence is used only through the equation Var[ Sn* 1 = EZ - tVar[ Yk 1 , and for this it is enough that the xn be independent in pairs. The proof given for Theorem 6.1 of course extends beyond the case of simple random variables, but it requires E[Xt1 < oo .
Suppose that X1 , X2 , . . . are independent and identically distrib uted and E [ X1- 1 < oo, E[Xi1 = oo (so that E[ X1 1 = oo). Then n - 1EZ_ 1 Xk oo with probability 1 . By the theorem, n - 1 EZ_ 1 X; --+ E[ X1- 1 with probability 1, and so PROOF. it suffices to prove the corollary for the case X1 = X1+ � 0. If xn if 0 < xn � u, u xn< > = 0 if xn > u, then n - 1Ek_ 1 Xk � n - 1Ek_ 1 X�"> --+ E[Xf"> 1 by the theorem. Let u --+ oo . Corollary. -+
{
•
1be Weak Law and Moment Generating Functions
The statement and proof of Theorem 6.2, the weak law of large numbers, carry over word for word for the case of general random variables with second moments-only Chebyshev' s inequality is required. The idea can be
SECTION
22.
SUMS OF IN DEPENDENT RANDOM VARIABLES
293
used to prove in a very simple way that a distribution concentrated on [0 , oo ) is uniquely determined by its moment generating function or Laplace transform. For each A, let Y" be a random variable (on some probability space) having the Poisson distribution with parameter A. Since Y" has mean and variance A (Example 21 .4), Chebyshev' s inequality gives
Let
G"
Y"/A, so that [At] k A " E e - k ,. .
be the distribution function of
G"( t ) =
k -0
-
The result above can be restated as if t > 1 , if t < 1 .
( 22 .3)
In the notation of Section 14, G"(x) => &(x - 1) as A � oo . Now consider a probability distribution p. concentrated on [0, oo ). Let F be the corresponding distribution function. Define
s > 0;
(22 .4)
here 0 is included in the range of integration. This is the moment generating function (21 .21), but the argument has been reflected through the origin. It is a one-sided Laplace transform, defined for all nonnegative s. For positive s, (21.24) gives
.
00 k k > M< = (22 .5 ) ( s ) ( - 1 ) y ke - s vp. ( dy) . 0 Therefore, for positive x and s, k k ( 1 ( 1 sy ( ) ( ) kM< k ) ( s ) = - sy , fl ( dy ) e s (22.6) E E -� k. k.
1 0
k -0
= Fix
the
1 oo fooo GsA; ) fl( dy ) . SX
SX
k-0
x > 0. If t 0 < y < x, then Gsy (xjy) --. 1 as s � oo by (22.3); if y > x, limit is 0. If p. { x } = 0, the integrand on the right in (22.6) thus
t lf y = 0, the integrand in (22.5) is 1 for k = 0 and 0 for k in the middle term of (22.6) is 1.
>
1 ; hence for y = 0, the integrand
294
RANDOM VARIABLES AND EXPECTED VALUES
converges as s --+ oo to Iro. x 1 (y) except on a set of p.-measure bounded convergence theorem then gives [sx ]
(22 .7)
lim E
s -. 00
(
_
k 1) ,
k.
k -o
0.
The
s kM < k > ( s ) = p. (O, x ] = F( x ) .
Thus M( s) determines the value of F at x if x > 0 and p. { x } = 0, which covers all but countably many values of x in [0, oo ). Since F is right-con tinuous, F itself and hence p. are determined through (22. 7) by M( s ) . In fact p. is by (22. 7) determined by the values of M( s) for s beyond an arbitrary s0: Theorem 22.2.
Let p. and be probability measures on [0, oo ). If
where s 0 > 0, then p. Corollary.
Jl
= Jl.
Let ft and /2 be real functions on [0, oo ). If
where s0 > 0, then ft = /2 outside a set of Lebesgue measure 0. /; need not be nonnegative, and they need e - sx/;(x) must be integrable over [0, oo) for s > s0 • The
not be integrable, but
For the nonnegative case, apply the theorem to the probability densities g; (x) = e - so x/;(x)jm, where m = J:e - sox/;(x) dx, i = 1, 2. For • the general case, prove that It + /2 = fi + /1 almost everywhere. PROOF.
Example 22.1. If p. 1 p. 2 = p.3, then the corresponding transforms (22.4) satisfy Mt (s)M2 (s) = M3 (s) for s � 0. If P. ; is the Poisson distribution with mean A ; , then (see (21 .27)) M; (s) = exp[A ; (e - s - 1)]. It follows by Theo rem 22.2 that if two of the P.; are Poisson, so is the third, and A t + A 2 = A 3 •
•
•
Kolmogorov's Zero-One Law
Consider the set A of w for which n - t Ek- tXk (w) --+ 0 as n --+ oo . For each m , the values of Xt (w), . . . , Xm t (w) are irrelevant to the question of -
SECTION
22.
SUMS OF INDEPENDENT RANDOM VARIABLES
295
whether or not w lies in A, and so A ought to lie in the a-field o( Xm , X,. + 1 , ). In fact, limn n - 1EZ'-JXk (w) = 0 for fixed m, and hence w lies in l A if and only if lim n n - L� - m Xk ( w) = 0. Therefore, • • •
[
A = n U n w: n-1
(22 .8)
f
N�m n> N
Xk (w ) t k-m
<
t],
the first intersection extending over positive rational £ . The set on the inside lies in o ( Xm , Xm + 1 , ), and hence so does A. Similarly, the w-set where the series E n Xn (w) converges lies in each o ( Xm , Xm + 1 , ). The intersection !T= n :_ 1a ( Xn, Xn + 1 , ) is the tail a-field associated with the sequence X1 , X2 , ; its elements are tail events. In the case xn = IA ,' this is the a-field (4.29) studied in Section 4. The following general ' form of Kolmogorov s zero-one law extends Theorem 4.5. • • •
• • •
• • •
• • •
If { Xn } is independent, and if A E !T= n :_ 1 a(Xn , Xn + 1 , ), then either P(A) = 0 or P(A) = 1 . PROOF. Let $0 = Uk_ 1a( X1 , , Xk ). The first thing to establish is that � is a field generating the a-field a( X1 , X2 , ) If B and C lie in $0 , then B e a( X1 , . . . , �) and C e a ( X1 , . . . , Xk ) for some j and k; if m -= max { j, k }, then B and C both lie in a( X1 , . . . , Xm ), so that B U C e a( X1 , , Xm ) c $0 . Thus $0 is closed under the formation of finite unions; since it is similarly closed under complementation, $0 is a field. For H e fil l , [ Xn e H] e � c a($0 ), and hence Xn is measurable a ( � ) ; thus .F0 generates a (X1 , X2 , ) (which in general is much larger than $0 ). Suppose that A lies in !T. Then A lies in a( Xk +l ' Xk + 2 , ) for each k. Therefore, if B e a( X1 , , Xk ), then A and B are independent by Theo rem 20.2. Therefore, A is independent of $0 and hence by Theorem 4.2 is also independent of a( X1 , X2 , ) But then A is independent of itself: P( A n A) = P(A)P(A). Therefore, P(A) = P 2 (A), which implies that P(A) is either 0 or 1 . • Deorem 22.3. • • •
• • •
• • •
•
.
.
.
• • .
• • •
. • .
• • •
.
As noted above, the set where En Xn ( w) converges satisfies the hypothesis of Theorem 22.3, and so does the set where n - 1EZ _ 1 Xk (w) --+ 0. In many similar cases it is very easy to prove by this theorem that a set at hand must have probability either 0 or 1 . But to determine which of 0 and 1 is, in fact, the probability of the set may be extremely difficult. In connection with the problem of verifying the hypotheses of Theorem 22.3, consider this example: Let 0 = R l , let A be a subset of R 1 not contained in fll 1 (see the end of Section 3), and let $ be the a-field generated by fll 1 u { A }.
E
ple
xam
22. 2.
296
RANDOM VARIABLES AND EXPECTED VALUES
Take X1 ( w ) = IA (w) and X2 (w) = X3 (w) = · · · = w. Then A lies in a ( x1 , x2 , . . . ) Fix m. If one knows Xm (w), Xm + 1 (w), . . . , then he knows w itself and hence knows whether or not w lies in A. To put it another way, if X1 (w') = X1( w" ) for all j > m, then IA(w') = IA(w"); therefore, the values of X1 (w), . . . , Xm _ 1 (w) are irrelevant to the question of whether or not w E A. Does A lie in o( Xm , xm + 1 ' • • • )? For m � 2 the answer is no, because o( Xm , Xm + 1 , • • • ) = gJ 1 and A � gJ 1 • • .
Although the issue is a technical one of measurability, this example shows that in checking the hypothesis of Theorem 22.3 it is not enough to show for each m that A does not depend on the values of X1 , • • • , Xm _ 1 • It must for each m be shown by some construction like (22.8) that A does depend in a measurable way on xm , xm + 1 ' • • • alone. Maximal Inequalities
Essential to the study of random series are maximal inequalities-inequali ties concerning the maxima of partial sums. The best known is that of Kolmogorov.
Suppose that X1 , . • • , Xn are independent with mean 0 and finite variances. For > 0 Theorem 22.4.
a
( 22.9 ) Let A k be the set where are disjoint,
PROOF.
Ak
I Sk i
�
a
but
ISi l
<
a
for j' < k. Since the
n E [ s; ] > E J s; dP k=l
n - E
Ak
k-l
jA [ sf + 2Sk ( Sn - Sk ) + ( Sn - Sk ) 2] dP
k -= l
jA [ sf + 2Sk ( Sn - Sk ) ] dP.
n > E
k
k
A k and Sk are measurable o( X1 , • • • , Xk ) and Sn - Sk is measurable o( Xk+ 1 , • • • , Xn ), and since the means are all 0, it follows by (21.19) and Since
SECTION
22.
SUMS OF INDEPENDENT RANDOM VARIABLES
297
independence that fAkSk(Sn - Sk ) dP = 0. Therefore, n n E [s;] > L Sf dP > L a2P ( Ak ) k-1 k - 1 Ak
f
• By Chebyshev's inequality, P [ I Sn l > a ] < a - 2 Var[ Sn 1· That this can be strengthened to (22.9) is an instance of a general phenomenon: For sums of independent variables, if max k s n i Sk l is large, then I Sn l is probably large as well. Theorem 9.6 is an instance of this, and so is the following result, due to Etemadi.
Suppose that X1 ,
Theorem 22.5.
(22 .10 )
P
• • •
[ 1 ms kaxs n I Sk l � 4a]
, Xn
are independent. For a > 0
< 4 max P [ I Sk l � a] . n
1 sks
Let Bk be the set where I Sk i > 4 a but I Si l < 4 a for j < k. Since the Bk are disjoint, n-1 max I Sk l � 4 a < P [ I Sn l > 2 a ] + L P ( Bk n [ I Sn l < 2 a] ) 1 sksn k-1 n-1 < P [ I Sn l > 2a ] + L P ( Bk n [ I Sn - Sk i > 2 a] ) k-1 n-1 = P ( I Sn l > 2 a ] + L P ( Bk ) P ( I Sn - Sk i > 2 a ] k-1
PROOF.
]
P[
< 2 max P [ I Sn - Sk i > 2 a]
Os k < n
Now replace 4 a by 2 a and apply the same argument to the partial sums of the random variables in reverse order: Xn , . . . , X1 ; it follows that
P
[ Oms kax< n I Sn - Sk i > 2 a ] � 2 I ms kaxs n P [I Sk l > a ] .
These two inequalities give (22.10).
•
298
RANDOM VARIABLES AND EXPECTED VALUES
If the Xk have mean 0 and Chebyshev' s inequality is applied to the right side of (22.10), and if a is replaced by a/4, the result is Kolmogorov' s inequality (22.9) with an extra factor of 64 on the right side. For this reason, the two inequalities are equally useful for the applications in this section. Convergence of Random Series
For independent Xn , the probability that EXn converges is either 0 or 1 . It is natural to try and characterize the two cases in terms of the distributions of the individual xn .
Suppose that { Xn } is an independent sequence and E[Xn ] = 0. If E Var[Xn 1 < oo , then EXn converges with probability 1. PROOF. By (22.9),
Theorem 12.6.
Since the sets on the left are nondecreasing in
r, letting r --+
oo
gives
Since E Var[ Xn ] converges, (22 .1 1 )
for each £. Let E( n, £) be the set where supj , k ni Si - Sk i > 2£ , and put E( £) = n n E(n, £). Then E (n , £) � E(£), and (22 . 11 ) implies P( E(£)) = 0. Now U E ( £ ) , where the union extends over positive rational £, contains the set where the sequence { Sn } is not fundamental (does not have the Cauchy • property), and this set therefore has probability 0. �
(
EX111ple 11
Let Xn ( w ) = rn ( w )a n , where the rn are the Rademacher functions on the unit interval-see ( 1 . 1 2) . Then xn has variance a;, and so Ea; < oo implies that Ern ( w )a n converges with probability 1. An interest ing special case is a n = n - 1 • If the signs in E ± n - 1 are chosen on the toss of a coin, then the series converges with probability 1. The alternating • harmonic series 1 - 2 - 1 + 3 - 1 + · · · is thus typical in this respect. 22.3.
SECTION
22.
299
SUMS OF INDEPENDENT RANDOM VARIABLES
If EXn converges with probability 1, then Sn converges with probability 1 to some finite random variable S. By Theorem 20.5, this implies that S,. _... p S. The reverse implication of course does not hold in general, but it does if the summands are independent.
For an independent sequence { Xn }, the Sn converge with probability 1 if and only if they converge in probability. It is enough to show that if Sn --+ p S, then { Sn } is fundamental pJtOOF. with probability 1. Since 1beorem
22.7.
P ( I Sn +j - Sn l �
t)
[
s P l sn +j - S l
i] + [
>
P l sn - S l
>
i ].
Sn -+p S implies
(22. 12) But by
sup P ( I Sn +J - Sn l lim n }� 1
> £)
=
0.
(22.10), P
[
m � I Sn +J - Sn l
1 Sj S k
>£
] <4
[
m� P I Sn +J - Sn l
1 Sj S k
>
£ 4
],
and therefore
It now follows by before.
(22.12) that (22.11) holds, and the proof is completed as
•
The final result in this direction, the three series theorem, does provide necessary and sufficient conditions for the convergence of EXn in terms of the individual distributions of the xn . Let x� c) be xn truncated at c: X
Theorem 22.8.
aeries (22.1 3 ) In
Suppose that { Xn } is independent and consider the three
E P [ I Xn l
> c] ,
order that EXn converge with probability 1 it is necessary that the three Series converge for all positive c and sufficient that they converge for some positive c.
300
RANDOM VARIABLES AND EXPECTED VALUES
Suppose that the series (22.13) converge and put m�c > = E [ X�c>]. By Theorem 22.6, E( X�c> - m�c > ) converges with probabil ity 1 , and since Em�c> converges, so does EX�c >. Since P[ Xn ¢ x�c > i.o.] = 0 by the first Borel-Cantelli lemma, it follows finally that EXn converges with probability 1 . • PROOF OF SUFFICIENCY.
Although it is possible to prove necessity in the three series theorem by the methods of the present section, the simplest and clearest argument uses the central limit theorem as treated in Section 27. This involves no circular ity of reasoning, since the three series theorem is nowhere used in what follows. Suppose that EXn converges with probability 1 and fix c > 0. Since Xn --+ 0 with probability 1, it follows that Ex�c > converges with probability 1 and, by the second Borel-Cantelli lemma, that EP[ I xn I > c] < oo . Let M�c > and s!c > be the mean and standard deviation of s�c > = Ek - t x�c > . If s!c> --+ oo , then since the x�c> - m�c > are uniformly bounded, it follows by the central limit theorem (see Example 27.4) that PROOF OF NECESSITY.
( 22.14)
sn< c> - Mn< c>
And since
EX�c>
s�c >;s!c> --+
Y
2
2 1 1 dt. e j V2 '1T x
converges with probability 1, s!c> --+ 0 with probability 1, so that (Theorem 20.5)
limP n [ IS�c>;s!c> l
( 22.15 ) But
=
1
also implies
> £ ] = 0.
(22.14) and (22.15) stand in contradiction: p
oo
Since
< c> < c> - Mn< c> s s n n X< <£ < y, sn< c> sn< c >
is greater than or equal to the probability in (22.14) minus that in (22.15) , it is positive for all sufficiently large n (if x < y ) But then .
X-£<
- Mn< c>;sn< c> < y + £ '
and this cannot hold simultaneously for, say, (x - £, y + £) = ( - 1, 0) and (x - £, y + £ ) = (0, 1). Thus s!c> cannot go to oo , and the third series in (22.13) converges.
SECTION
22.
SUMS OF INDEPENDENT RAN DOM VARIABLES
301
And now it follows by Theorem 22.6 that E( X� ( ) - m �c) ) converges with probability 1 , so that the middle series in (22. 1 3) converges as well. • If Xn = rn a n , where rn are the Rademacher functions, then Ea; < oo implies that E Xn converges with probability 1 . If E Xn converges, then a n is bounded, and for large c the convergence of the series (22 . 1 3) implies E a; < oo : If the signs in E + a n are chosen on the toss of a coin, then the series converges with probability 1 or 0 according as Ea; converges or diverges. If Ea; converges but E l a nl diverges, then E + a n is • with probability 1 conditionally but not absolutely convergent. Example 22.4.
Theorems 22.6, 22.7, and 22.8 concern conditional convergence, and in the most interesting cases, E xn converges not because the xn go to 0 at a high rate but because they tend to cancel each other out.t Random Taylor Series *
Consider a power series E ± z n , where the signs are chosen on the toss of a coin. The radius of convergence being 1, the series represents an analytic function in the open unit disc D0 = [z: l z l < 1] in the complex plane. The question arises whether this function can be extended analytically beyond D0 • The answer is no: With probability 1 the unit circle is the natural boundary. 1beorem 22.9.
Let { xn } be an independent sequence such that n = 0, 1, . . . .
(22.16 )
There is probability 0 that F( w,
(22.17 )
Z
00
) = E Xn ( W ) z n n-O
coincides in D0 with a function analytic in an open set properly containing D0• It will be seen in the course of the proof that the w-set in question lies in a ( X0 , X 1 , ) and hence has a probability. It is intuitively clear that if the set is measurable at all, it must depend only on the Xn for large n and hence • • •
t See Problem 22.3 . *This topic, which requires complex variable theory, may be omitted.
302
RANDOM VARIABLES AND EXPECTED VALUES
must have probability either 0 or 1. PROOF.
Since n = 0, 1 , . . .
( 22.18 )
with probability 1, the series in (22.17) has radius of convergence 1 outside a set of measure 0. Consider an open disc D = [z: l z - r l < r ] , where r E Do and r > 0. Now (22.17) coincides in D0 with a function analytic in D0 u D if and only if its expansion
about r converges at least for this holds. The coefficient am ( "' ) =
�!
lz - rl
< r. Let
m
p< > ( "' · r ) =
AD be the set of w for which
E ( t':t)
X,. ( "' ) r n -
m
n-m
-
is a complex-valued random variable measurable o ( Xm, Xm + 1 , • • • ). By the m� �root test, w E AD if and only if lim supm Vl am ( w )I < r - 1 • For each m 0 , the condition for w E AD can thus be expressed in terms of a m 0( w ), a m o 1 ( w ), . . . alone, and so AD E (J ( Xm o' Xm o 1 ' • • • ). Thus AD has a prob ability, and in fact P(AD) is 0 or 1 by the zero-one law. Of course, P( A D) = 1 if D c D0• The central step in the proof is to show that P(AD) = 0 if D contains points not in D0 • Assume on the contrary that P(AD) = 1 for such a D. Consider that part of the cir cumference of the unit circle that lies in D, and let k be an integer large enough that this arc has length exceeding 2 '1Tjk. Define +
+
if n ¥= 0 ( mod k ) , if n = 0 ( mod k ) . Let BD be the w-set where the function
( 22.19 )
G(w, z ) = E Yn( w ) z n 00
n -O
coincides in D0 with a function analytic in D0 u D. The sequence { Y0 , Y1 , . . • } has the same structure as the original se quence: the Yn are independent and assume the values ± 1 with probabili ty t
SECTION
22.
303
SUMS OF INDEPENDENT RANDOM VARIABLES
each . Since BD is defined in terms of the Yn in the same way as AD is defined in terms of the Xn , it is intuitively clear that P( BD) and P(AD) must be the same. Assume for the moment the truth of this statement, which is somewhat more obvious than its proof. If for a particular w each of (22.17) and (22.19) coincides in D0 with a function analytic in D0 u D, the same must be true of
( 22 . 20 )
00
F( w, z ) - G ( w, z ) = 2 E xmk ( w ) z mk . m -0
Let D1 = [ze 2 "illk : z E D). Since replacing z by ze 2 "i /k leaves the function (22.20) unchanged, it can be extended analytically to each D0 u D1, I = 1 , 2, . . . . Because of the choice of k, it can therefore be extended analyti cally to [z: l z I < 1 + E] for some positive E; but this is impossible if (22.18) holds, since the radius of convergence must then be 1. Therefore, AD n BD cannot contain a point w satisfying (22.18). Since (22.18) holds with probability 1, this rules out the possibility P(AD) = P( BD) = 1 and by the zero-one law leaves only the possibility P(AD) = P( BD) = 0. Let A be the w-set where (22.17) extends to a function analytic in some open set larger than D0• Then w E A if and only if (22.17) extends to D0 U D for some D = [z: l z - r 1 < r] for which D - D0 ¢ 0, r is rational, and r has rational real and imaginary parts; in other words, A is the countable union of AD for such D. Therefore, A lies in o( X0, X1 , ) and has probability 0. It remains only to show that P(AD) = P(BD), and this is most easily done by comparing { xn } and { yn } with a canonical sequence having the same structure. Put Zn ( w ) = ( Xn ( w ) + 1)/2, and let Tw be E:- o zn ( w )2 - n - I on the w-set A* where this sum lies in (0, 1 ]; on 0 - A* let Tw be 1, say. Because of (22.16) P(A*) = 1. Let �= o( X0, X1 , ) and let � be the a-field of Borel subsets of (0, 1 ]; then T: 0 --+ (0, 1] is measurable �/!fl. Let rn(x) be the n th Rademacher function. If M = [x: r1(x ) = u ; , i = 1, . . . , n ], where u ; = ± 1 for each i, then P(T - 1M ) = P[ w : �((a,)) = u ; , i = 0, 1, . . . , n - 1] = 2 - n , which is the Lebesgue measure A(M) of M. Since these sets form a w-system generating ffl, P(T - 1M) = A(M ) for all M in fJI (Theorem 3.3). n Let MD be the set of x for which E�-o rn + 1 (x)z extends analytically to D0 U D. Then MD lies in ffl, this being a special case of the fact that AD lies in �. Moreover, if w E A*, then w E AD if and only if Tw E MD : 1 A * n AD = A* n T - MD. Since P(A*) = 1, it follows that P(AD) = A ( MD) . This argument only uses (22.16), and therefore it applies to { Yn } and BD • as well. Therefore, P(BD) = A( MD) = P(AD). • • •
• . •
304
RANDOM VARIABLES AND EXPECTED VALUES
The Hewitt-Savage Zero-One Law *
Kolmogorov's zero-one law does not apply to the events [Sn > an i.o.] and [ Sn = 0 i.o. ], which lie in each a-field a ( Sn , Sn + 1 , ) but not in each a ( X,. , Xn + 1 , ) . If the X,, are identically distributed as well as independent, however, these events come under a different zero-one law. Let �:· k consist of the sets H in � n + " that are symmetric in the first n coordinates in the sense that if ( x1 , , x,. + k ) lies in H then so does ( x '" 1 , , x'" n ' xn + 1 , , xn + k ) for every permutation of 1, . . . , n. Then �:· " is a a-field. Let 9:,, k consist of the sets [( XI ' . . . ' xn + k ) E H) for H E �:· k ' let 9:, be the a-field generated by Uk'.... 1 9:,, k , and let .9= n�_ 1 9:, . The information in 9:, corresponds to knowing the values xn + I ( w ) , xn + 2 ( w ), . . . exactly and knowing the values X1 ( w ) , . . . , Xn ( w) to within a permutation. If n < m , then [ Sm e Hm ] e .Ym . o c � m - n c � , and so [ Sm e H i.o.] comes under the Hewitt-Savage zero-one . m law: •
•
•
•
•
•
•
•
•
•
•
•
•
.
'IT
•
Theorem 22.10. If XI ' x2 ' . . are independent and identically distributed ' and if
A
e
.9,
.
then either P(A ) = 0 or P( A) = 1.
Theorem 11.4 there Since u�- I a ( XI ' . . . ' xn) is a field, by the corollary to n is for given £ an n and a set U = [( X1 , , Xn ) e H) ( H e � ) such that P( A�U ) < ( . Let v = [( xn + l ' . . . ' x2 n> E H). From A E .92 n + I and the fact that the xk are identically distributed as well as independent it will be shown that
PROOF.
•
( 22 .21 )
•
•
P( A�U ) = P ( A� V ) .
This will imply P(A � V) < £ and P(A�(U n V)) � P( A�U) + P(A � V) < 2£. Since U and V are independent and have the same probability, it will follow that P( .A ) is within £ of P(U) and hence P 2 (A) is within 2£ of P 2 (U) == P( U)P( V) = P( U n V), which is in tum within 2£ of P(A ). But then 1P 2 ( A ) - P(A ) I < 4£ for all £ , and so P( .A ) must be 0 or 1. It remains ntok prove (22.21). If B -E .92 n. k ' then B == [( X1 , , X2 n + k ) e J) for some J in &f 2 · and since •
s
•
•
'
it follows that P( B n U) == P( B n V). But since U k'_ 1 .92 n. k is a field generating .92 n , P( B n U ) == P( B n V) holds for all B in .92 n (Theorem 10.4). Therefore P(Ac n U) == P(Ac n V); the same argument shows that P(A n Uc) == P(A n • vc ) . Hence (22.21 ) . * This topic may be omitted.
SECTION
22.
SUMS OF INDEPENDENT RANDOM VARIAB LES
305
PROBLEMS 22. 1.
is an independent sequence and Y is measurable Suppose that X1 , X2 , ) for each n. Show that there exists a constant a such that a( Xn , Xn + 1 , P[ Y == a] == 1.
22.2.
21.12 f Let XI ' x2 ' . . . be independent random variables with distribution functions F1 , F2 , (a) Show that P[supn Xn < oo ] is 0 or 1 and is in fact 1 if and only if En (1 - F, ( x)) < oo for some x. (b) Suppose that X == supn Xn is finite with probability 1 and let F be its distribution function. Show that F( X ) == n � F, (X ) . (c) Show further that E[ x+ ] < oo if and only if En fx a Xn dP < oo for some a .
• • •
• • •
•
• •
•
-1
II
>
22.3.
Assume { Xn } independent and define X!c > as in Theorem 22.8. Prove that for E I xn I to converge with probability 1 it is necessary that EP[ I xn I > c) and E E( 1 X! c > 1 ] converge for all positive c and sufficient that they converge for some positive c. If the three series (22.13) converge but EE( I X!c > 1 1 == oo , then there is probability 1 that EXn converges conditionally but not absolutely.
22.4.
f (a) Generalize the Borel-Cantelli lemmas: Suppose Xn are nonnega tive. If E E( Xn ] < oo , then EXn converges with probability 1. If the Xn are independent and uniformly bounded, and if EE[ Xn ] = oo , then EXn di verges with probability 1. (b) Construct independent, nonnegative Xn such that EXn converges with probability 1 but EE[ Xn 1 diverges. For an extreme example, arrange that P[ Xn > 0 i.o. ) == 0 but E[ Xn ] = oo .
22.5.
Show under the hypotheses of Theorem 22.6 that EXn has finite variance and extend Theorem 22.4 to infinite sums.
22.6.
are independent, each with the Cauchy 20.20 f Suppose that X1 , X2 , distribution (20.45) for a common value of u . (a) Show that n - 1 EZ .... 1 Xk does not converge with probability 1 to a constant. Contrast with Theorem 22.1. (b) Show that P[ n - 1 max k s n xk < x] --+ e - ujwx for X > 0. Relate to Theorem 14.3.
12.7.
If X1 , X2 , are independent and identically distributed, and if P[ X1 > 0] == 1 and P[ X1 > 0] > 0, then En Xn == oo with probability 1. Deduce this from Theorem 22.1 and its corollary and also directly: find a positive £ such that xn > ( infinitely often with probability 1.
11.8.
Suppose that XI ' x2 ' . . . are independent and identically distributed and show that En Pl i Xn l > an ] == oo for each a , E[ I X1 1 ] == oo . Use (21.9) to 1 and conclude that supnn - 1 Xn l == oo with probability 1. Now show that supn n - 1 1 sn I == oo with probability 1. Compare with the corollary to Theo rem 22.1.
• •
•
• • •
306
22.9.
RANDOM VARIABLES AND EXPECTED VALUES
Wald ' s equation. Let XI ' x2 ' . . . be independent and identically distributed
with finite mean, and put sn == x. + . . . + xn . Suppose that T is a stopping time: T has positive integers as values and [ T == n ] E a( XI ' . . . ' xn ); see Section 7 for examples. Suppose also that E[ T] < oo. (a) Prove that (22.22)
22.10.
22.11.
22.12.
22. 13.
22.14. 22.15.
22. 16.
(b) Suppose that xn is ± 1 with probabilities p and q , p :F q , let T be the first n for which sn is - a or b ( a and b positive integers), and calculate E[ T ) . This gives the expected duration of the game in the gambler's ruin problem for unequal p and q. 20.12 f Let Zn be 1 or 0 according as at time n there is or is not a record in the sense of Problem 20.12. Let Rn == Z1 + · · · + Zn be the number of records up to time n. Show that R n/log n -+pl. 22.1 f (a) Show that for an independent sequence { Xn } the radius of n convergence of the random Taylor series En Xn z is r with probability 1 for some nonrandom r. (b) Suppose that the xn have the same distribution and P[ XI :F 0] > 0. Show that r is 1 or 0 according as log + 1 X1 1 has finite mean or not. Suppose that X0 , X1 , . . are independent and each is uniformly distributed n ;x over [0, 2w]. Show that with probability 1 the series Ene ,. z has the unit circle as its natural boundary. In the proof of Kolmogorov's zero-one law, use the corollary to Theorem 11.4 in place of Theorem 4.2. Hint: See the first part of the proof of the Hewitt-Savage zero-one law. Prove (what is essentially Kolmogorov's zero-one law) that if A is indepen dent of a w-system gJ and A e a(gJ), then P(A ) is either 0 or 1. 11.3 f Suppose that .JJI is a semiring containing Q. (a) Show that if P(A n B ) < bP( B) for all B e .JJI, and if b < 1 and A e a( .JJI ) , then P(A ) == 0. (b) Show that if P( A n B) S P( A)P( B) for all B e .JJI, and if A e a(..w') , then P( A ) is 0 or 1. (c) Show that if aP( B) < P(A n B) for all B e .JJI, and if a > 0 and A e a( .JJI ) , then P( A ) == 1. (d) Show that if P( A )P( B ) < P( A n B) for all B e .JJI, and if A e a(..w' ) , then P( A ) is 0 or 1. (e) Reconsider Problem 3.20. 22.14 f Burstin ' s theorem. Let I be a Borel function on [0, 1] with arbitrarily small periods: For each £ there is a p with 0 < p < £ such that l( x ) == l( x + p ) for 0 < x < 1 - p. Show that such an I is constant almost everywhere: (a) Show that it is enough to prove that P< r 1 B) is 0 or 1 for every Borel set B, where P is Lebesgue measure on the unit interval •
SECTION
23.
307
THE POISSON PROCESS
Show that r 1 B is independent of each interval [0, X], and conclude that P< r 1 B ) is 0 or 1. (c) Show by example that f need not be constant. (b)
SECI'ION 23. THE POISSON PROCESS Characterization of the Exponential Distribution
Suppose that
X
has the exponential distribution with parameter
P [ X > x ] = e - ax ,
(23. 1 ) The definition
(23.2)
a:
X � 0.
(4.1) of conditional probability then gives
P [ X > x + yi X > x] = P [ X > y ],
X , y � 0.
Imagine X as the waiting time for the occurrence of some event such as the arrival of the next customer at a queue or telephone call at an exchange. As observed in Section 14 ( see (14.6)), (23.2) attributes to the waiting-time mechanism a lack of memory or aftereffect. And as shown in Section 14, the condition (23.2) implies that X has the distribution (23.1) for some positive «. Thus if in the sense of (23.2) there is no aftereffect in the waiting-time mechanism, then the waiting time itself necessarily follows the exponential law. The Poisson Process
Consider next a stream or sequence of events, say arrivals of calls at an exchange. Let X1 be the waiting time to the first event, let X2 be the waiting time between the first and second events, and so on. The formal model consists of an infinite sequence X1 , X2 , of random variables on some probability space, and sn = XI + . . . + xn represents the time of occurrence of the n th event; it is convenient to write S0 = 0. The stream of events itself remains intuitive and unformalized, and the mathematical definitions and arguments are framed exclusively in terms of the Xn. If no two of the events are to occur simultaneously, the Sn must be strictly increasing, and if only finitely many of the events are to occur in each finite interval of time, sn must go to infinity: •
•
•
n
308
RANDOM VARIABLES AND EXPECTED VALUES
This condition is the same thing as
(23.4 )
n
Throughout the section it will be assumed that these conditions hold everywhere-for every w. If they hold only on a set A of probability 1, and if Xn ( w) is redefined as Xn ( w) = 1, say, for w � A , then the conditions hold everywhere and the joint distributions of the xn and sn are unaffected.
Condition 0°.
For each w, (23.3) and (23.4) hold.
The arguments go through under the weaker condition that (23.3) and (23.4) hold with probability 1, but they then involve some fussy and uninteresting details. There are at the outset no further restrictions on the X;; they are not assumed independent, for example, or identically distrib uted. The number N1 of events that occur in the time interval [0, t] is the largest integer n such that sn < t:
N, = max ( n : Sn � t ) . Note that N, = 0 if t < S1 = X1 ; in particular, N0 = 0. The number of events in (s, t] is the increment N, - Ns . ( 23 .5 )
•
3 N1
2 N1
1 N1
=0
So = 0
s,
=
=
=
2
1
x,
�
=
x,
+
x2
From (23.5) follows the basic relation connecting the N, with the
( 23.6 ) From this follows
( 23.7 ) Each N, is thus a random variable.
Sn :
SECTION
t
23.
309
THE POISSON PROCESS
stochastic process,
that is, a collection of The collection [ N,: > 0] is a random variables indexed by a parameter regarded as time. Condition oo can be restated in terms of this process:
Condition 0°. For each w, N,( w) is a nonnegative integer for t > 0, N0( w ) 0, and lim, _. 00 N,( w) = oo ; further, for each w , N,( w ) as a function of t and right-continuous, and at the points of discontinuity the is nondecreasing saltus N,( w ) - sups < tNs ( w ) is exactly 1. =-
It is easy to see that (23.2) and (23.4) and the definition (23.5) give random variables N, having these properties. On the other hand, if the stochastic process [ N,: > 0] is given and does have these properties, and if random variables are defined by Sn( w ) = inf[t: N,( w ) > n] and Xn( w ) = Sn ("') - Sn _ 1 ( w ), then (23.3) and (23.4) hold, and the definition (23.5) gives back the original N,. Therefore, anything that can be said about the Xn can be stated in terms of the N,, and conversely. The points S1 ( w ), S2 ( w ), . . . of (0, oo) are exactly the discontinuities of N,( w ) as a function of because of the queueing example, it is natural to call them The program is to study the joint distributions of the N1 under conditions on the waiting times Xn and vice versa. The most common model specifies the independence of the waiting times and the absence of aftereffect:
t
arrival times.
t;
Condition 1 °. The xn are independent and each is exponentially distributed with parameter a. In this case P [ Xn > 0] = 1 for each n and n - 1Sn --+ a- 1 by the strong
and so (23.3) and (23.4) hold with probability 1 ; to assume they hold everywhere (Condition 0°) is simply a convenient normalization. Under Condition 1 o, Sn has the distribution function specified by 1 (20.40), so that P[N1 � n] = Er:_ n ( ) / by (23.6), and
law of large numbers (Theorem
22.1),
e - at i ! " at ) ( , (23 .8) P [ Nt = n ] = e_,. n ! ' n = 0, 1 , . . . . Thus N, has the Poisson distribution with mean at. More will be proved 0
;
Presently.
Condition
N,1,
•
•
•
2°
, N," -
(i)
N,k - 1
For 0 < t1 < are independent.
· · ·
<
tk the increments
N11, N1 2 -
310
RAN DOM VARIABLES AND EXPECTED VALUES
The individual increments have the Poisson distribution: " t s)) a ( ( s P ( N, - Ns = n] = e - t ( 23 .9) n! , n = 0, 1 , . . . , 0 < s < t. (ii)
a(
)
t Poisson process,
Since N0 = 0, (23.8) is a special case of (23. 9). A collection [ N,: > 0] of random variables satisfying Condition 2° is called a and a is the of the process. As the increments are independent by (i), if < < then the distributions of Ns - N, and N, - Ns must convolve to that of N, - N,. But this requirement is consistent with (ii) because Poisson distributions with parameters and convolve to a Poisson distribution with parameter +
rate r s t,
u
u v. Conditions 1 ° and
Theorem 23. 1.
Condition 0°.
v
2°
are equivalent in the presence of
t
t.
1 o --+ 2° . Fix and consider the events that happen after time By (23.5), SN < < SN + I ' and the waiting time from to the first event following is SN + 1 - the waiting time between the first and second events following is XN + 2 ; and so on. Thus PROOF OF
t
t t
I
( 23 .10 ) Xfl) =
t
I
t;
I
I
SN + l - t, I
X
X
t.
I
·
m,
define the waiting times following By (23.6), N, + s - N, > or N, + s > N, + m , if and only if SN + m < + which is the same thing as x: t) + · · · + X�'> < Thus
s.
( 23 .11)
t s,
I
N, + s - N, = m [ m : ax
m
Xf'> + · · · + X�' > <
s] .
Hence [N, + s - N, = ] = [ Xf'> + · · · + X�> < s < Xf'> + · · · + X�'! 1]. A comparison of (23.11) and (23.5) shows that for fixed the random variables N, + s - N, for > 0 are defined in terms of the sequence (23.10) in exactly the same way as the Ns are defined in terms of the original sequence of waiting times. The idea now is to show that conditionally on the event [N, = ] the random variables (23.10) are independent and exponentially distributed. Because of the independence of the Xk and the basic property (23.2) of the exponential distribution, this seems intuitively clear. For a proof, app ly (20.30). Suppose y > 0; if Gn is the distribution function of Sn , then sin ce
s
t
n
SECTION
Xn + l
By
23.
31 1
THE POISSON PROCESS
h as the exponential distribution,
=
1
=
e - ay
the assumed independence of the
=
If H = ( y 1 , oo )
X
P [ Xn +l
x
1
t + y - x ) dGn( x )
>
P ( Xn + l > t - x) dGn( x )
x
X,,
P(Sn + l - t > yI ' S
11
<
-
t
<
Sn + l ] e - ay2 · · ·
· · · X ( y1 , oo ), this is
(23.12 ) P [ N, = n , ( Xf ' > , . . . , xp>) e n] =
By Theorem 10.4, all H in gJ i.
e - ayJ
P ( N, = n ) P [ ( X1
the equation extends from
, • . .
, �) E H] .
H of the special form above to
Now the event [Ns . = m ; , 1 < e H ] , where j = m + 1 and u
i < u] can be put in the form [( X , �) H is the set of x in R 1 for which x 1 + + x m < S ; < x 1 + · · · + x m + I' 1 < i < u. But then [( Xf t ) , . . . , X1� 1 > ) e H ] is by (23.11) the same as the event [N, + s - N, = m ; , 1 < i < u]. Thus (23.12) gives I
·
·
·
1,
•
I
• • •
I
I
P [ N, = n , Nt + s1 - N, = m ; , 1
<
i < u]
From this it follows by induction on k that if 0
(23 .13 ) P [ N, - N,1 - l = n,. , 1 I
<
i < k]
=
t0
<
<
···
0 P [ N, _ 11 - 1 = n ; ] · k
=
t1
i-1
I
<
tk , then
312
RANDOM VARIABLES AND EXPECTED VALUES
Thus Condition 1 o implies (23.13) and, as already seen, (23.8). But from (23.13) and (23.8) follow the two parts of Condition 2°. •
2° --+ 1°. If 2° holds, then by (23.6), P( X1 > t) = P(N1 = 0] = e - at, so that X1 is exponentially distributed. To find the joint distribution of X1 and X2 , suppose that 0 < s 1 < t 1 < s 2 < t 2 and perform the calculation PROOF OF
=
a ( t l - s . )( e -•u z - e - arz ) =
a 2e - <>Yz dyl dy2 . < yl < l1 s2 < Y2SI2
JJsl
Thus for a rectangle A contained in the open set G
=
[( y1 , y2 ): 0 < y1
< y2 ]
,
By inclusion-exclusion, this holds for finite unions of such rectangles and hence by a passage to the limit for countable ones. Therefore, it holds for A = G n G' if G' is open. Since the2 open sets form a w-system generating the Borel sets, (S1 , S2 ) has density a e - ay2 on G (of course, the density is 0 outside G). By a similar argument in R k (the notation only is more complicated), (S 1 , , Sk ) has density a ke - aYk on [y: 0 < y1 < · · · < Yk ]. If a linear transformation g( y) = X is defined by X; = Y; - Y; - I ' then ( xl ' . . . ' xk ) = g(S l , . . . ' Sk ) has by (20.20) the density n�- l ae - OtX; (the Jacobian is identi • cally 1). This proves Condition 1°. • • •
The Poisson Approximation
Other characterizations of the Poisson process depend on a generalization of the classical Poisson approximation to the binomial distribution.
Suppose that for each n, Zn 1 , , Zn r are independent random variables and Zn k assumes the values 1 and 0 with probabilities Pnk Theorem 23.2.
• • •
II
SECTION
and
1 - Pn k · If
23.
rll L Pn k --+ A � 0,
(23. 14 )
313
THE POISSON PROCESS
max
1 sk s
k-l
11r Pn k --+ 0,
then
i = 0, 1 , 2, . . . .
(23 .15 )
= 0, the limit in (23.15) is interpreted as 1 for i = 0 and 0 for i > 1. In the case where rn = and Pn k = Ajn , (23.15) is the Poisson approxima tion to the binomial. Note that if A > 0, then (23.14) implies rn --+ oo . If A
n
Io : 1
-
P
II :p
I
�1 1 o l�----� - ------��----------�--� J1 :e -P / ! J2 J0:e -P PROOF.
p1
The argument depends on a construction like that in the proof of Theorem 20.4. Let U1 , U2 , be independent random variables, each uni formly distributed over [0, 1). For each p, 0 � p < 1, split [0, 1) into the two intervals /0 ( p) [0, 1 - p) and /1 ( p) = [1 - p, 1), as well as into the . tervals J ( p ) I I I. , L.t - (�;�;i _ 1 e pp j;J· I. ) , 1 - 0 1 sequence of tn L.ti _ 1e - pp j1J ; Define vn k = 1 if uk E /t ( Pn k ) and v,. k = .0 - if uk E lo ( Pn k > · Then v,.l , . . . , vn r are independent and vn k assumes the values 1 and 0 with probabilities P[Uk E /1 ( Pn k )] = Pn k and P[Uk E /0 ( Pn k )] = 1 - Pn k· Since Zn r , (23.15) will V,.1, . . . , Vn r have the same joint distribution as Zn1 , follow if it is shown that V,. = Ek��- tvn k satisfies • • •
=
.
-
.
-
,
, • • •
•
II
• • • ,
II
II
(23.16 ) Wn k = i if Uk e l;( Pn k ), i = 0, 1, . . . . Then P[ Wn k = i] = e -P �cp!Ji ! - Wn k has the Poisson distribution with mean Pn k· Since the W,. k are independent, Wn = Ek��- 1W,. k has the Poisson distribution with mean A n = Ek��- tPn k· Since 1 - p < e P , J1 ( p) c /1 ( p) (see the diagram). Now define ..
-
Therefore,
= rnn k e - P 111c pn k -< rnn2k ' -
314
RANDOM VARIABLES AND EXPECTED VALUES
and r,.
P [ Vn ¢ Wn] < E P;k < A n 1 max Pn k --+ 0 < k < r,. k-l
by
(23.14). And now (23.16) and (23.15) follow because •
Other Characterizations of the Poisson Process
The condition (23.2) is an interesting characterization of the exponential distribution because it is essentially qualitative. There are qualitative char acterizations of the Poisson process as well. For each w , the function N,( "' ) has a discontinuity at t if and only if Sn ( "' ) = t for some n > 1; t is a fixed discontinuity if the probability of this is positive. The condition that there be no fixed discontinuities is therefore
( 23 .17 )
P ( Sn = t ] = 0,
t > 0,
n > 1;
that is, each of S 1 , S2 , has a continuous distribution function. Of course there is probability 1 (under Condition 0°) that N,( "' ) has a discontinuity somewhere (and indeed has infinitely many of them). But (23.17) ensures that a t specified in advance has probability 0 of being a discontinuity, or time of an arrival. The Poisson process satisfies this natural condition. •
•
•
If Condition 0° holds and [ N,: t > 0] has independent increments and no fixed discontinuities, then each increment has a Poisson distribution.
Theorem 23.3.
This is Prekopa's theorem. The conclusion is not that [ N,: t > 0] is a Poisson process, because the mean of N, - Ns need not be proportional to t - s. If cp is an arbitrary nondecreasing, continuous function on [0, oo) with cp (O) = 0, and [ N,: t > 0] is a Poisson process, then Ncp
t This is in fact the general process satisfying them; see Problem 23.9. * This proof carries over to general topological spaces ; see Timothy C. Brown, A mer. Math . Month(v, 91 ( 1984) 1 16- 1 23.
23. THE POISSON PROCESS The procedure is to construct a sequence of partitions
315
SECTION
(23 .18) of
t'
=
t n O < t nl < . . . < t n
rII
=
t"
[t ', t "] with three properties. First, each decomposition refines the preced
ing one: each t n k is a t n + 1 , j . Second, r,.
E p [ Nl,. k - Nln . k - 1
(23 .19)
>
k-l for
some finite
(2 3 .20 )
1] j A
A and max P [ N,nk - N,n,k - 1 > l
_ r,.
1 ) --+ 0.
Third,
(2 3 . 2 1 )
[
P max ( N1nk - N1 n. k- 1 ) > 2 1Sk< r,.
] --+ 0.
Once the partitions have been constructed, the rest of the proof is easy: Let Zn k be 1 or 0 according as N,nlr. - N,n,k- 1 is positive or not. Since [ N,: t � 0] has independent increments, the Zn k are independent for each n. By Theorem 23.2, therefore, (23.19) and (23.20) imply that Zn = Ek _ 1 Zn k satisfies P[ Zn = i] --+ e - A A;/i! Now N,, - N,, > Zn , and there is strict inequality if and only if N1nlr. - N,n k- 1 > 2 for some k. Thus (23.21) implies i A P[N,, - N,, ¢ Zn 1 --+ 0, and there1ore P[N,, - N,, = i] = e - A'/i! To construct the partitions, consider for each t the distance D, = infm � 1 1 t - Sm l from t to the nearest arrival time. Since Sm --+ oo , the infimum is achieved. Further, D, = 0 if and only if Sm = t for some m , and since by hypothesis there are no fixed discontinuities, the probability of this is 0: P [ D1 = 0] = 0. Choose 81 so that 0 < 81 < n - 1 and P[ D1 < 8,] < n - 1 • The intervals (t - 81 , t + 8,) for t' < t < t" cover [t', t "]. Choose a finite subcover, and in (23.18) take the t n k for 0 < k < rn to be the endpoints (of intervals in the subcover) that are contained in (t', t"). By the construction, ..
·
(23 .22 ) and the probability that ( t n, k 1 ' t n k 1 contains some sm is less than -
n - 1 . This
gives a sequence of partitions satisfying (23.20). Inserting more points in a Partition cannot increase the maxima in (23.20) and (23.22), and so it can be arranged that each partition refines the preceding one.
316
RANDOM VARIABLES AND EXPECTED VALUES
To prove (23.21) it is enough (Theorem 4.1) to show that the limit superior of the sets involved has probability It is in fact empty: If for infinitely many N, (w ) - N, - 1 (w ) > 2 holds for some k < n , then by discontinuity points (arrival (23.22), N,( w ) as a function of has in times) arbitrarily close together, which requires Sm ( w ) e for in finitely many in violation of Condition 0°. It remains to prove (23.19). If Zn k and Zn are defined as above and Pn k = P [ Zn k = 1], then the sum in (23.19) is E k Pn k = E[ Zn 1 · Since Zn + t ;::: Zn , E k Pn k is nondecreasing in Now
n,
""
"·"
t
0. [t', t"]
r [t ', t"]
m,
n.
P ( N,, - N,, = 0 ) = P ( Zn k = 0, k � rn ) =
r ,
.n
k�l
[ r, l
( 1 - Pn k ) < exp - E Pn k
·
k-1
If the left-hand side here is positive, this puts an upper bound on E k Pn k ' and (23.19) follows. But suppose P[N,, - N, , = = If is the midpoint of and then since the increments are independent, one of P[Ns - N,, = and P[N,, - Ns = must vanish. It is therefore possible to find a nested sequence of intervals [u m , vm ] such that vm - u m --+ and the event A m = [Nv - Nu > 1] has probability 1. But then P(n m A m ) = 1, and if t is the point common to the [ u m , vm ], there is an arrival at with probability • 1, contrary to the assumption that there are no fixed discontinuities.
t' 0]
0] 0. s
t",
m
0]
m
0
t
Theorem 23.3 in some cases makes the Poisson model quite plausible. The increments will be · essentially independent if the arrivals to time cannot seriously deplete the population of potential arrivals, so that Ns has for > negligible effect on N, - Ns . And the condition that there are no fixed discontinuities is entirely natural. These conditions hold for arrivals of calls at a telephone exchange if the rate of calls is small in comparison with the population of subscribers and calls are not placed at fixed, prede termined times. If the arrival rate is essentially constant, this leads to the following condition.
s
t s
Condition 3°. (i) For 0 < t1 < · · · < tk the increments N11, N12 N, , . . . , N1 - N1 are independent. (ii) The distribution of N1 - Ns depends only on the difference t - s. Theorem 23.4. Conditions 1°, 2°, and 3° are equivalent in the presence of Condition 0°. Obviously Condition 2° implies 3°. Suppose that Condition 3° holds. If J, is the saltus at t (J1 = N, - sups < 1 Ns ), then [N, - N, _ n- 1 1
k
k-1
PROOF.
2:
SECTION
23.
317
THE POISSON PROCESS
1 ] � [ J1 > 1 ], and it follows by (ii) of Condition 3° that P[ J, > 1] is the same
t.
for all But if the value common to P[ J, > 1] is positive, then by the indepe ndence of the increments and the second Borel-Cantelli lemma there is probability 1 that 11 > 1 for infinitely many rational in (0, 1 ), for example, which contradicts Condition 0° . By Theorem 23.3, then, the increments have Poisson distributions. If and is the mean of N,, then N, - Ns for < must have mean + thus = must by (ii) have mean Therefore, f satisfies Cauchy' s functional equation [ A20] and, being nondecreasing, must = for > 0. Condition 0° makes = 0 impossible . have the form •
t
s t f(t) - f(s) f(t - s); f(t) f(s) f(t - s). a f( t) at a
f( t)
One standard way of deriving the Poisson process is by differential equations.
Condition 4°. If 0 < t1 < tegers, then
···
<
tk and if n1, . . . , nk are nonnegative in
(23.23)
tmd (23.24 )
Moreover, [ N,: t > 0] has no fixed discontinuities. The occurrences of o( h) in (23.23) and (23.24) denote functions, say +1 ( h ), and f/>2 (h), such that h Iq,; ( h) -+ 0 as h � 0; the f/>; may a priori depend on k, t1, , tk, and n1, , nk as well as on h. It is assumed in as h � 0 .
-
•
•
•
•
•
•
(23.23) and (23.24) that the conditioning events have positive probability, so that the conditional probabilities are well defined.
1beorem
23.5.
condition 0° .
Conditions 1 ° through 4° are all equivalent in the presence of
2 ° -+ 4 °. For a Poisson process with rate a, the left-hand sides of (23.23) and (23.24) are e - ahah and 1 - e - ah - e - ahah, and these are «h + o(h) and o(h), respectively, because e - ah = 1 - ah + o(h). And by PROOF OF
the argument -·
PROOF
[�i
in the preceding proof, the process has no fixed discontinui-
ti,
ni;
.
OF 4° -+ 2 ° . Fix k, the and the denote by A the event j � k]; and for > 0 put = P [ N," + t - N," = i A ] . It will
== ni,
t
Pn (t)
n
318
RANDOM VARIABLES AND EXPECTED VALUES
be shown that (23 .25 )
r , (at _ ,. t = e n ( P ) n! '
n
This will also be proved for the case in which 2° will then follow by induction. If t > 0 and I t - s I < n - 1 , then
= 0, 1 , . . . . Pn(t) = P[N1 = n]. Condition
As n j oo , the right side here decreases to the probability of a discontinuity at t, which is 0 by hypothesis. Thus P[N, = n] is continuous at t. The same kind of argument works for conditional probabilities and for t = 0, and so Pn ( t) is continuous for t > 0. To simplify the notation, put D, = N,k + t - N,k . If D, + h = n, then D, = m for some m < n. If t > 0, then by the rules for conditional probabilities,
Pn( t + h ) = Pn( t ) P ( D, + h - D, = OIA n [ D, = n ) ] +pn_ 1 ( t ) P ( D, + h - D, = l i A n [ D, = n - 1 ]] n-2 + E Pm ( t ) P [ D, + h - D, = n - m i A n [ D, = m ]] . m =O
For n < 1, the final sum is absent, and for n = 0, the middle term is absent as well. This holds in the case Pn(t) = P[N, = n ] if D, = N, and A = 0. (If t = 0, some of the conditioning events here are empty; hence the assump tion t > 0.) By (23.24), the final sum is o( h) for each fixed n. Applying (23.23) and (23.24) now leads to
Pn ( t + h ) = Pn(t )( l - ah ) + Pn - l (t )ah + o ( h ) , and letting
h � 0 gives
( 23 .26)
In the case n = 0, take p _ 1 ( t) to be identically 0. In (23.26), t > 0 and p�( t) is a right-hand derivative. But since Pn( t) and the right side of th e equation are continuous on [0, oo ), (23.26) holds also for t = 0 and p� ( t) can be taken as a two-sided derivative for t > 0 [A22].
SECTION
23.
319
THE POISSON PROCESS
Now (23.26) gives [A23]
Since Pn(O) is 1 or 0 as
n = 0 or n > 0, (23.25) follows by induction on n. •
Stochastic Processes
stochastic process-that
t
The Poisson process [N1 : > 0] is one example of a is, a collection of random variables (on some probability space ( 0, �, P)) indexed by a parameter regarded as representing time. In the Poisson case, time is In some cases the time is Section 7 concerns the sequence { Fn } of a gambler's fortunes; there represents time, but time that increases in jumps. Part of the structure of a stochastic process is specified by its For any finite sequence 1 , , of time points, the k-dimensional random vector ( N1 , , N111 ) has a distribution p. 11 ... 111 over R k . These measures p. 1 1k are the finite-dimensional distributions of the process. Condition 2° of this section in effect specifies them for the Poisson case:
continuous.
discrete: n
dimensional distributions.
t
1
n1
n
1
•
•
•
•
•
tk
finite
•
•
t
t
n 0 t0
if 0 < < · · · < k and 0 < 1 < · · · < k (take = = 0). The finite-dimensional distributions do not, however, contain all the mathematically interesting information about the process in the case of continuous time. Because of (23.3), (23.4), and the definition (23.5), for each fixed w, N1 ( w) as a function of has the regularity properties given in the second version of Condition 0°. These properties are used in an essential way in the proofs. Suppose that is t or 0 according as is rational or irrational. Let N1 be defined as before, and let
t
/(t)
(23 .28)
t
M1 ( w ) = N1 (w ) + f ( t + X1 ( w ) ) .
If R is the set of rationals, then P[w: f(t + X1 (w)) ¢ 0] = P[w: X1 (w) e R - t ] = 0 for each t because R - t is countable and X1 has a density. Thus P[M1 = N1 ] = 1 for each t, and so the stochastic process [M,: t > 0] has the same finite-dimensional distributions as [N1 : t > 0]. For w fixed,
320
RANDOM VARIABLES AND EXPECTED VALUES
t
however, M1 ( w ) as a function of is everywhere discontinuous and is neither monotone nor exclusively integer-valued. The functions obtained by fixing w and letting vary are called the or of the process. The example above shows that the finite-dimensional distributions do not suffice to determine the character of the path functions. In specifying a stochastic process as a model for some phenomenon, it is natural to place conditions on the character of the sample paths as well as on the finite-dimensional distributions. Condition 0° was imposed throughout this section to ensure that the sample paths are nondecreasing, right-continuous, integer-valued step functions, a natural condition if N, is to represent the number of events in [0, Stochastic processes in continuous time are studied further in Chapter 7.
path
t
functions sample paths
t].
PROBLEMS
Assume the Poisson processes here satisfy Condition 0° as well as Condition 1°. 13.1. 13.2.
13.3.
Show that the minimum of independent exponential waiting times is again exponential and that the parameters add. 20.23 f Show that the time Sn of the nth event in a Poisson stream has the gamma density /( x; a, n) as defined by (20.47). This is sometimes called the Er/ang density. Let A , t - SN1 be the time back to the most recent event in the Poisson stream (or to 0) and let B, SN 1 - t be the time forward to the next event. Show that A, and B, are independent, that B, is distributed as X1 (exponentially with parameter a), and that A, is distributed as min{ X1 , t } : P [ A, S t] is 0, 1 - e - or 1 as x < 0, 0 < x < t, or x � t. f Let L, A + B, SN1 1 - SN1 be the length of the interarrival interval covering t. (a) Show that L, has density ==
==
,
+
ax ,
13.4.
==
==
,
+
if O < x < t, if X > I. (b)
13.5.
Show that E[ L,] converges to 2 E[ X1 ] as t --+ oo . This seems paradoxi cal because L, is one of the Xn. Give an intuitive resolution of the apparent paradox. Merging Poisson streams. Define a process { N, } by (23.5) for a sequence { Xn } of random variables satisfying (23.4). Let { X� } be a second sequence of random variables, on the same probability space, satisfying (23.4) and define { N,' } by N,' max[ n: X{ + · · · + X� < t ]. Define { N," } by N," N, + N,'. Show that, if a( X1 , X2 , ) and a( X{, X2 , . . . ) are independent ==
=
•
•
•
23.
SECTION
and { � } and { N, } are Poisson processes with respective rates a and fJ, then { N, } is a Poisson process with rate a + {J . f The n th and (n + 1)st events in the process { N, } occur at times Sn and '
"
2.1.6.
321
THE POISSON PROCESS
sn + 1 ·
(a) Find the distribution of the number Ns,.
+
1
- Ns,. of events in the other
process during this time interval. (b) Generalize to Ns - Ns,. . 23.7. For a Poisson stream and a bounded linear Borel set A , let N(A ) be the number of events that occur at times lying in A . Show that N( A ) is a random variable having the Poisson distribution with mean a times the Lebesgue measure of A . Show that if A and B are disjoint, then N( A ) and N( B) are independent. Suppose that X1 , X2 , are independent and exponentially distributed with 23.8. parameter a, so that (23.5) defines a Poisson process { N, }. Suppose that are independent and identically distributed and that Y1 , Y2 , a( x1 ' x2 ' . . . ) and a( y1 ' Yi ' . . . ) are independent. Put z, == Ek N yk . This l is the compound Poisson process. If, for example, the event at time Sn in the original process represents an insurance claim, and if Y, represents the amount of the claim, then Z, represents the total claims to time t. (a) If Yk == 1 with probability 1, then { Z, } is an ordinary Poisson process. (b) Show that Z, has independent increments and that Zs + r - Zs has the same distribution as Z, . (c) Show that, if Yk assumes the values 1 and 0 with probabilities p and 1 p (0 < p < 1), then { Z, } is a Poisson process with rate pa. 11.9. Suppose a process satisfies Condition 0° and has independent, Poisson-dis tributed increments and no fixed discontinuities. Show that it has the form { Ncr> < I ) } , where { N, } is a standard Poisson process and cp is a nondecreas ing, continuous function on [0, oo) with cp(O) == 0. 11. 10. If the waiting times Xn are independent and exponentially distributed with parameter a, then Sn/n --+ a - 1 with probability 1, by the strong law of large numbers. From lim, _ 00 N, == oo and SN < t < SN + 1 deduce that lim, - 00 N,/t == a with probability 1. 1.1.1 1. f (a) Suppose that X1 , X2 , . are positive and assume directly that Sn/n --+ m with probability 1, as happens if the Xn are independent and identically distributed with mean m. Show that lim , N,/t == 1/m with probability 1. (b) Suppose now that Sn/n --+ oo with probability 1, as happens if the Xn are independent and identically distributed and have infinite mean. Show that lim, N,/t == 0 with probability 1. The results in Problem 23.11 are theorems in renewal theory: A component of some mechanism is replaced each time it fails or wears out. The Xn are the lifetimes of the successive components, and N, is the number of replacements, or renewals, to time t. 11. 1 2. 20.8 23.11 f Consider a persistent, irreducible Markov chain and for a fixed state j let Nn be the number of passages through j up to time n. Show that Nn/n --+ 1 /m with probability 1, where m = Ek' 1 kfj�k > is the "'
•
•
•
•
•
•
s
-
1
• •
1
322
RANDOM VARIABLES AND EXPECTED VALUES
mean return time (replace 1/m by 0 if this mean is infinite). See Lemma in Section 8.
3
23.13. Suppose that X and Y have Poisson distributions with parameters a and {J. Show that I P( X = i] - P[Y = i] l < I a - fJ I . Hint: Suppose that a < fJ and represent Y as X + D, where X and D are independent and have Poisson distributions with parameters
23.14.
a and fJ - a.
f Use the methods in the proof of Theorem 23.2 to show that the error in (23.15) is bounded uniformly in i by l A - A n I + A n max k Pnk .
SECI'ION 24. QUEUFS AND RANDOM WALK *
Results in queueing theory can be derived from facts about random walk, facts having an independent interest of their own. The Single-Server Queue
Suppose that customers arrive at a server in sequence, their arrival times being 0, A 1 , A 1 + A 2 , The customers are numbered 0, 1, 2, . . . , and customer n arrives at time A 1 + +A n , customer 0 arriving by conven tion at time 0. Assume that the sequence { A n } of interarrival times is independent and that the An all have the same distribution concentrated on (0, oo ) . The A n may be exponentially distributed, in which case the arrivals form a Poisson process as in the preceding section. Another possibility is P [ A n = a] = 1 for a positive constant a; here the arrival stream is determin istic. At the outset, no special conditions are placed on the distribution common to the A n . Let B0, B 1 , . . . be the service times of the successive customers. The Bn are by assumption independent of each other and of the A n , and they have a common distribution concentrated on (0, oo ) . Nothing further is assumed initially of this distribution: it may be concentrated at a single point for example, or it may be exponential. Another possibility is that the service-time distribution is the convolution of several exponential distributions; this represents a service consisting of several stages, the service times of the constituent stages being dependent and themselves exponentially distrib uted. If a customer arrives to find the server busy, he joins the end of a queue and his service begins when all the preceding customers have been served. If he arrives to find the server free, his service begins immediately; the server is by convention free at time 0 when customer 0 arrives. • • •
•
· · ·
* This section may be omitted.
SECTION
24.
QUEUES AND RANDOM WALK
323
Airplanes coming into an airport form a queueing system. Here the A n are the times between arrivals at the holding pattern, Bn is the time the plane takes to land and clear the runway, and the queue is the holding pattern itself. Let Wo , WI , W2 , . . . be the waiting times of the successive customers: wn is the length of time customer n must wait until his service begins; thus W,. + Bn is the total time spent at the service facility. Because of the convention that the server is free at time 0, W0 = 0. Customer n arrives at time A1 + · · · + A n ; abbreviate this as t for the moment. His service begins at time t + Wn and ends at time t + Wn + Bn. Moreover, customer n + 1 arrives at time t + A n +1 . If customer n + 1 arrives before t + Wn + Bn (that is, if t + A n + t < t + Wn + Bn , or Wn + Bn - A n +1 > 0), then his waiting time is Wn + t = (t + Wn + Bn ) - (t + A n + t ) = Wn + Bn - A n +1 ; if he arrives after t + Wn + Bn , he finds the server free and his waiting time is W,. + 1 = 0. Therefo�e, Wn + t = max[O, Wn + Bn - A n + 1 ]. Define
(24.1)
n
=
1 , 2, . . . ;
the recursion then becomes
n = 0, 1 , . . . .
(24.2 ) Note that Wn is measurable
o( X1 ,
•
•
•
, Xn ) and hence is measurable
Bn - 1 ' A 1 , . . . ' A n ) . The equations (24.1) and ( 24.2) also describe inventory systems. Suppose
a( Bo , . . . '
that on day n the inventory at a storage facility is augmented by the amount Bn l and the day' s demand is A n . Since the facility cannot supply what it does not have, (24.1) and (24.2) describe the inventory Wn at the end of day n. A problem of obvious interest in queueing theory is to compute or approximate the distribution of the waiting time Wn. Write -
(24 . 3 )
S0 = 0,
The following induction argument shows that
(24 .4) this is obvious for n = 0, assume that it holds for a particular n. If wn + xn + 1 is nonnegative, it coincides with wn + 1 ' which is therefore the maximum of Sn - Sk + Xn + 1 = Sn + 1 - Sk over O < k < n; this maxi mum, being nonnegative, is unchanged if extended over 0 < k < n + 1, As
324
RANDOM VARIABLES AND EXPECTED VALUES
whence (24.4) follows for n + 1. On the other hand, if Wn + Xn + 1 < 0, then sn + 1 - sk = sn - sk + xn + 1 < 0 for 0 < k < n, so that max 0 � k � n + 1 ( Sn + 1 - Sk ) and Wn + 1 are both 0. Thus (24.4) holds for all n. Since the components of ( Xh X2 , , Xn ) are independent and have a common distribution, this random vector has the same distribution over R n as ( Xn , Xn h . . . , X1 ). Therefore, the partial sums S0, S 1 , S2 , , Sn have the same joint distribution as 0, Xn , Xn + xn 1 ' . . . ' xn + . . . + XI ; that is, they have the same joint distribution as Sn -- Sn , Sn - Sn _ 1 , Sn - Sn_ 2 , , Sn - S0• Therefore, (24.4) has the same distribution as max o � k � n sk . The following theorem sums up these facts. • • •
•
•
•
• • •
Let { Xn } be an independent sequence of random variables with common distribution, and define Wn by (24.2) and Sn by (24.3). Then (24.4) holds and wn has the same distribution as max o � k � n sk . Theorem 24.1.
In this and the other theorems of this section, the Xn are any indepen dent random variables with a common distribution; it is only in the queueing-theory context that they are assumed defined by (24.1) in terms of positive, independent random variables A n and Bn . The problem is to investigate the distribution function
( 24.5 )
Mn ( x )
=
P ( Wn s x ]
=
]
[
Sk < x , P max n � � O k
X > 0.
(Since wn > 0, M(x) = 0 for X < 0.) By the zero-one law (Theorem 22.3), the event [sup k Sk = oo"] has probability 0 or 1. In the latter case max0 � k � n sk --+ oo with probability 1, so that
( 24.6 ) for every x, however large. On the other hand, if [sup k Sk probability 0, then
( 24.7 )
M( x )
=
[
P sup Sk k>O
s
x
= oo ]
]
is a well-defined distribution function. Moreover, [max0 k � n Sk x] � [sup k Sk s x] for each x and hence <
( 24. 8)
nlimoo Mn ( x ) ......
=
has
M( x ) .
Thus M(x) provides an approximation for Mn (x).
s
24. QUEUES AND RANDOM WALK If the A n and Bn of queueing theory have finite means, the ratio SECTION
325
(24 .9 )
called the traffic intensity. The larger p is, the more congested is the queueing system. If p > 1, then E[ Xn1 > 0 by (24.1), and since n - 1sn --+ E[ X1 ] with probability 1 by the strong law of large numbers, supn Sn = oo with probability 1. In this case (24.6) holds: the waiting time for the n th customer has a distribution that may be said to move out to + oo ; the service time is so large with respect to the arrival rate that the server cannot accommodate the customers. It can be shown that supn Sn = oo with probability 1 also if p = 1 , in which case E[ Xn 1 = 0. The analysis to follow covers only the case p < 1 . Here the strong law gives n - 1Sn --+ E[ X1 ] < 0, Sn --+ - oo , and supn Sn < oo with probability 1, so that (24.8) applies. is
Random Walk and Ladder Indices
it helps to visualize the Sn as the successive positions in a random walk. The integer n is a ladder index for the random walk if Sn is the maximum position up to time n:
To calculate (24.7)
(24 .1 0 )
Note that the maximum is nonnegative because S0 = 0. Define a finite measure L by (23 .11) L ( A ) =
P[ max n sk n- l O
E
=
0<
]
sn E A ,
A
c
( 0, oo ) .
Thus L is a measure on the Borel sets of the line and has support (0, oo ). The probability that there is at least one ladder index is then (24.12 )
Let T1 , T2 , be the times between the ladder indices: T1 is the smallest n satisfying (24.10), T1 + T2 is the next smallest, and so on. To complete the definition, if there are only k ladder indices ( k may be 0), set Tk + I = Tk+ 2 · · · = oo . Define the k th ladder height by H1 + · · · + Hk = Sr1 + . .. + r,/ Thus H1 + · · · + Hk is the position in the random walk at the k th ladder •
•
•
•
326
RANDOM VARIABLES AND EXPECTED VALUES
index; if there are only k ladder indices, set Hk + I Hk + 2 = · · · = 0. As long as ladder indices occur, the ladder heights must increase and the Hk must be positive. Whether there are infinitely many ladder indices or only finitely many, the definitions are so framed that =
(24.13 )
sup Sk k
=
00
E Hn . n== l
Therefore, (24.7) can be studied through the Hn. The distribution of H1 consists of the measure L defined by (24.11) together with a mass of 1 - p at the point 0. If there is a ladder index and n is the first one ( T1 = n ), then H2 is the first positive term (if there is one) in the sequence Sn + 1 - Sn , Sn + 2 - sn , . . . ; but this sequence has the same structure as the original random walk. Therefore, under the condition that there is a ladder index and hence a first one, the distribution of H2 is the original distribution of H1 An extension of this idea is the key to understanding the distribution of (24.13). Let L n * denote the n-fold convolution of L with itself; L n * = L< n - 1 > * L, and L0 * is a unit mass at the point 0. Let 1/1 be the measure (see Example 16.5) •
•
00
( 24.14)
n -O
A c [O, oo ) .
Note that 1/1 has a unit mass at 0 and the rest of the mass is confined to (0, 00 ) . Theorem 24.2. (i)
If A 1 ,
•
•
•
, An
are subsets of (0, oo), then
( 24.15 ) If p = 1, then with probability 1 there are infinitely many ladder indices and supn Sn = oo. (iii) If p < 1, then with probability p n (l - p) there are exactly n ladder indices; with probability 1 there are only finitely many ladder indices and supn Sn < oo ; finally, (ii)
( 24.16 )
[ n >O
P sup Sn e A
] = ( 1 - p ) l/J ( A ) ,
A c [0, oo ) .
If A c (0, oo), then P[Hn e A] = p n - 1L(A), as will be seen in the course of the proof; thus (24.15) does not assert the independence of the Hn - obvi ously Hn = 0 implies that Hn + 1 = 0.
327 24. QUEUES AND RANDOM WALK (i) For n = 1, (24.15) is just the definition (24.1 1). Fix subsets PROOF. .A1 , . . . , A n of (0, oo) and for k > n - 1 let Bk be the w-set where H; A; for i � n - 1 and T1 + · · · + Tn - l k. Then Bk E o( X1, . . . , Xk ) and so independence and the assumption that the X; all have the same distribuSECTION
=
by tion,
E
Summing over m = 1, 2, . . . and then over k = n - 1, n, n + 1, . . . gives P[H; e A ; , i � n] = P [ H; E A ; , i < n - 1] L ( A n ). Thus (24.15) follows by induction. Taking A 1 = · · · = A n = (0, oo) in (24.15) shows that the probability of at least n ladder indices is P[H; > 0, i < n] = p n . (ii) If p = 1, then there must be infinitely many ladder indices with probability 1. In this case the Hn are independent random variables with distribution L concentrated on (0, oo ) ; since 0 < E[ Hn ] < oo, it follows by the strong law of large numbers that n - 1 E'ic,_1Hk converges to a positive number or to oo, so that (24.13) is infinite with probability 1. (iii) Suppose that p < 1. The chance that there are exactly n ladder indices is p n - p n +l = p n (1 - p) and with probability 1 there are only finitely many. Moreover, the chance that there are exactly n ladder indices and that H; E A ; for i < n is L(A 1 ) · • • L(A n ) - L(A 1 ) · • · L(A n )L(O, oo ) = L(A 1 ) • • • L(A n )(1 - p). Conditionally on there being exactly n ladder indices, H1 , . . . , Hn are therefore independent, each with distribution p - 1L. Thus the probability that there are exactly n ladder indices and H1 + · · · + Hn E A is (1 - p)L n *(A) for A c (0, oo). This 0 holds also for n = 0: (1 - p )L * consists of a mass 1 - p = P[sup k Sk = 0] at the point 0. Adding over n = 0, 1, 2, . . . gives (24.16). Note that in this case there are with probability 1 only finitely many positive terms on the • right in (24.13). Exponential Right Tail
Further information about the p and 1/1 in (24.16) can be obtained by a time-reversal argument like that leading to Theorem 24.1. As observed before, the two random vectors (S0, S1 , . • . , Sn ) and (Sn - Sn, Sn S,._ 1 , • • • , Sn - S0) have the same distribution over R n + t . Therefore, the event [max o k n sk < sn E A] has the same probability as the event <
<
328
RANDOM VARIABLES AND EXPECTED VALUES
[
A]
Sk > 0, Sn E Sk < Sn E A ] = P min P [ max l k n O k
�
00
00
00
L l/ln( A ) = E L P [ Tt +
n- 1
n - l j- l
···
<
<
+ 1j = n , H1 +
···
+ H1 e A ]
00
= E Li* ( A ) . j- l
Let 1/1 0 be a unit mass at 0; it follows by (24.14) that 00
E 1/J n( A ) = 1/J ( A ) ,
( 24. 1 8 )
n -O
A
c
[O, oo) .
Let F be the distribution function common to the Xn . If P. n is the distribution in the plane of (min 1 � k � n sk , Sn ), then by (20.30), P[min 1 � k � n Sk > 0, Sn +l � x] = /y1 > 0 F( x - y2 ) dp. n (y1 , y2 ). Identifying 1/1 n < A) with the right side of (24.17) givest
(24. 19) f'XJOOF( x - y ) t/ln ( dy) = p L � n sk > 0, sn +l � X ] . If the minimum is suppressed, this holds for n = 0 as well as for n > 1. Since convolution commutes, the left side here is J 00 oo l/l n ( - oo, x - y] dF(y) = /y � xl/1 n [O, x - y] dF(y ). Adding over n = 0, 1, . . . leads to this result: Theorem 24.3.
The measure (24. 1 4) satisfies
fy s x1/! [0, x - y) dF(y ) = nE- l P [ l smink < n sk > o , sn < x ] . 00
( 24.20 )
The important case is x � 0; here the sets on the right in (24.20) are disjoint and their UniOn is the set where Sn < 0 for SOme n > 1 and Sn � X t That /y1
>
0/(y2 ) dp.n (y1 , Yl )
indicators f.
==
/� 00/( y )1/Jn ( dy ) follows by the usual argument starting with
SECTION
24.
QUEUES AND RANDOM WALK
329
for the first such n. This theorem makes it possible to calculate (24.16) in a
case
of interest for queueing theory.
Theorem
24.4.
Suppose that E[X1 ] < 0 and that the right tail of F is
exponential: P ( X1 > x] = 1 - F( x ) = �e - A x , (24 .21 ) where 0 < � < 1 and A > 0. Then p < 1 and (24. 22) p SUp Sn > x = pe - A ( l -p ) x ,
[
n �O
]
X > 0,
X > 0.
Moreover, A(1 - p) is the unique root of the equation (24.23) in
the range 0 < s < A.
Because of the absence of memory (property (23.2) of the exponen tial distribution), it is intuitively clear that (even though only the right tail of F is exponential), if Sn t < 0 < Sn , then the amount by which Sn exceeds 0 should follow the exponential distribution. If so, PROOF.
L ( x , oo ) = pe - A x ,
(24.24 )
X > 0.
To prove this, apply (20.30):
[
P max Sk < o, sn > x k
] P [ max Sk < 0, xn > X - sn - 1 ] 1 �e - ><< x - S. - t l dP =
=
k
max S�c s O
k
Adding over n gives (24.24). Thus L is the exponential distribution multiplied by p. Therefore, L k * k k can be computed explicitly: L *[O, x] is p times the right side of (20.40) with A = a, so that
(24.25 )
330
RANDOM VARIABLES AND EXPECTED VALUES
Note that (24.24) and (24.25) follow from (24.21) alone-the condition E[ X1 ] < 0 has not been used. From E [ X1 ] < 0 it follows that Sn --+ - oo with probability 1 , so that p must be less than 1 . Summing (24.25) in the other reduces it to (24.26)
1
tP (0, X ) = 1 - p
P e - A ( l -p ) x , 1 -p
whence (24.22) follows via (24.16). It remains to analyze (24.23). Since Sn 1 for x = 0, and so by (24.26)
X > 0,
--+ - oo, the right side of (24.20) is
fy !f;O( 1 - pe><
( 24 .27)
Let l(s) denote the function on the left in (24.23). Because of (24.21), it is defined for 0 � s < A and (see (21 .26)) l(s) = �A( A - s ) - 1 + fx !f; 0esx dF( x ). From this, (24.27), and F(O) = 1 - �' it follows that I( A(l - p )) = 1. Now I has value 1 at s = 0 also. Since I is continuous on [0, A ) and l "(s) = f 00 00 x 2es x dF(x) is positive on (0, A ) because F does not con • centrate at 0, (24.23) cannot have more than one root in (0, A ). Suppose in the queuing context-see (24.1)- that the service times Bn are exponentially distributed with parameter P and that the A n have the distribution function A over (0, oo ). Assume that p- t < E[A 1 ], so that the traffic intensity (24.9) is less than 1 and E [ X1 ] < 0. Since P[ X1 > x] = e - flxJ0e - flY dA(y), Theorem 24.4 applies for A = p. The Bn have moment generating function fJ/(P - s) by (21.26); the moment gener ating function for F can by the product rule be expressed in terms of that for A. Therefore, (24.23) is in this case Example 24. 1.
(24.28)
1
o
00e - sy dA ( y ) = p {J- s .
This can be solved explicitly if the interarrival times An are also exponen tial, say with parameter a < p. The left side of (24.28) is then aj( a + s ) , the only positive root is P - a, and so p = a;p. In this case (see (24.8)) (24.29 )
P ( W, < x) = 1 - : e-
X�
0.
SECTION
24.
331
QUEUES AND RANDOM WALK
This limit is a mixture of discrete and continuous distributions: there is a mass of 1 - a.;p at 0 and a density ap - 1 ( P - a.) e -
Theorem 24.5. Suppose that left tail of F is exponential:
(24.20) through 0 makes it possible F is exponential.
E[X1 ] < 0, that F is continuous, and that the
(24.30) where
X<
0,
0 < � < 1 and A > 0. Then
(24 .3 1) and
L(O, x)
(24 .3 2 )
=
x
fo
F( x ) - F(O) + >.. (1 - F( y )) dy.
First assume not (24.30) but (24.21), and assume that E[ X1 ] > 0. Then Sn + oo , and so p = 1. As pointed out after the proof of (24.25), that relation follows from (24.21) alone; in the case p = 1 it reduces to (reverse the sum and use the Poisson mean) 1/J[O, x] = 1 + AX, x > 0. By Theorem 24.3, PROOF.
-+
[
(24 .33) 'f. P l � n Sk > 0, Sn < x n 1 00
= for
x <
] � < } 1 + >.. ( x - y )) dF( y) =
(1 +
0.
>.. x ) P ( X1 < x) - >.. f xX1 dP X1 <
Now if (24.30) holds and if E[ X1 ] < 0, the analysis above applies to { - xn }. If xn is replaced by - xn and X by - X, then (24.33) becomes
'f. P [ 1 max S < 0, Sn > x ] k n < sk n -= 1 00
valid
for
=
(1 -
>.. x ) P ( X1 > x ) + >..
f
X1 > x
X1 dP,
x > 0. Now F is assumed continuous, and it follows by the
332
RANDOM VARIABLES AND EXPECTED VALUES
convolution formula that the same must be true of the distribution of each 00 Sk . By (24.11) and (21.10), L(x, oo) = L[x, oo) = 1 - F(x) + >.. fx (1 F(y)) dy. By (24.30), fx1 0 X1 dP = - >.. - 1� = - >.. - 1F(O), and so /0(1 F( y )) dy = fx1 0 X1 dP = E[ X1 ] + A - 1F(O). Since p = L (0, oo ), this gives • (24.31) and (24.32). �
';!!
This theorem gives formulas for p and L and hence in principle determines the distribution of supn 0 Sn via (24.14) and (24.16). The rela tion becomes simpler if expressed in terms of the Laplace transforms 'i!!
and
.l ( s ) = E exp { -s sup sn ) . n �O
(
]
These are moment generating functions with the argument reflected through 0 as in (22.4); they exist for s > 0. Let �+ (s) denote the transform of the restriction to (0, oo) of the measure corresponding to F:
By (24.32), L is this last measure plus the measure with density A (1 over (0, oo ). An integration by parts shows that
=
- F(x))
� ( 1 - F(0)) + ( 1 - � ) !F+ ( s ) .
By (24.14) and (24.16), .l(s) = (1 - p)r./:_ 0!t' n (s); the series converges because !t'(s) � !t'(O) = p < 1. By (24.31),
( 24"34 ) .l ( s ) =
>.. E ( X1 )s ( s - " ) !F+ ( s ) - s + " ( 1 - F(O)) '
s > 0.
This, a form of the Khinchine-Po//aczek formula, is valid under the hypothe ses of Theorem 24.5. Because of Theorem 22.2, it determines the distribu tion of supn 0Sn . 'i!!
SECTION
24.
333
QUEUES AN D RAN DOM WALK
Example 24. 2.
Suppose that the queueing variables A ,, are exponen tially distributed with parameter a, so that the arrival stream is a Poisson process . Let b = E [ Bn], and assume that the traffic intensity p = ab is less than 1 . If the B, have distribution function B( x ) an d transform ffl( s ) = JOOe -sx dB ( x ), then F(x ) has densityt ae axdl( a ) ,
F (x) = '
ae ax
J
x < O
e - a v dB ( y ) ,
y>x
X>
0.
Note that F(O) = !11 ( a). Thus Theorem 24.5 applies for A = a, and revers ing integrals shows that �+ (s) = a( a - s ) - 1 ( 91(s) - F(O)). Therefore, (24. 3 4) becomes (24.35)
.K
(s ) =
(1 - p )s s - a + adl( s ) '
the
usual form of the Khinchine-Pollaczek formula. If the B,. are exponen tially distributed with parameter p, this checks with the Laplace transform • of the distribution function M derived in Example 24.1. Queue Size
Suppose that p < 1 and let Qn be the size of the queue left behind by customer n when his service terminates. His service terminates at time A I + . . . + A n + wn + Bn , and customer n + k arrives at time A l + A n + k · Therefore, Qn < k if and only if A 1 + · · + A n + k > A 1 + + + A n + Wn + Bn . (This requires the convention that a customer arriving exactly when service te rminates is not counted as being left behind in the queue.) Thus ·
·
·
·
·
·
·
k
> 1.
Since W,. is measurable o( B0, , Bn _ 1 , A 1 , , A n ) (see (24.2)), the three random variables Wn , Bn , and - ( A n + l + + A n + k ) are independent. If Mn , B, and Vk are the distribution functions of these random variables, then by the convolution formula, P[Qn < k ] = j 00 00 Mn ( -y )( B • Vk )( dy ). If M is the distribution function of supn Sn , then (see (24.8)) M,. converges to M and hence p [ Q n < k ] converges to I 00 00 M( -y )( B * vk )( dy ), which is the • • •
• • •
·
tJust
as
·
·
(20.37) gives the density for X + Y, f� 00 g( x - y ) dF( x ) gives the density for X - Y.
334
RANDOM VARIABLES AND EXPECTED VALUES
same as /00 00 Vk ( -y )( M * B )( dy) because convolution is commutative. Dif ferencing for successive values of k now shows that = k ) = qk , P(Qn n -+ oo
( 24.36)
lim
k > 1,
where
These relations hold under the sole assumption that p < 1. If the A n are exponential with parameter a, the integrand in (24.37) can be written down explicitly:
( 24.38) Example
Suppose that the A n and Bn are exponential with parameters p, as in Example 24.1. Then (see (24.29)) M corresponds to a mass of 1 - a;P = 1 - p at 0 together with a density p(p - a ) e - < IJ - a > x over (0, oo ) . By (20.37), M * B has density 24.3. a and
1
fJe - fl < y - x) dM ( x ) = ( fJ
[O. y ]
- a ) e - < fl - a) y .
By (24.38),
Use inductive integration by parts (or the gamma function) to reduce the last integral to k ! Therefore, qk = (1 - p)pk , a fundamental formula of • queuing theory. See Examples 8.11 and 8.12.
C HAP T ER
5
Convergence of Distributions
SECI'ION 25. WEAK CONVERGENCE
Many of the best known theorems in probability have to do with the asymptotic behavior of distributions. This chapter covers both general methods for deriving such theorems and specific applications. The present section concerns the general limit theory for distributions on the real line, and the methods of proof use in an essential way the order structure of the line. For the theory in R k , see Section 29. Definitions
Distribution functions Fn were in Section the distribution function F if
14 defined to converge weakly to
(25 .1 ) for every continuity point x of F; this is expressed by writing Fn � F. Examples 14.1, 14.2, and 14.3 illustrate this concept in connection with the asymptotic distribution of maxima. Example 14.4 shows the point of allowing (25.1) to fail if F is discontinuous at x; see also Example 25.4. Theorem 25.8 and Example 25.9 show why this exemption is essential to a useful theory. If p. n and p. are the probability measures on ( R 1 , !1/ 1 ) corresponding to Fn and F, then Fn � F if and only if lim p. n ( A )
(25 .2 ) for every A of the form
n
A
=
(
-
oo ,
=
p. ( A )
x] for which p. { x } = O-see
(20.5). In 335
336
CONVERGENCE OF DISTRIBUTIONS
this case the distributions themselves are said to converge weakly, which is expressed by writing P. n � p.. Thus Fn � F and P. n => p. are only different expressions of the same fact. From weak convergence it follows that (25.2) holds for many sets A besides half-infinite intervals; see Theorem 25.8.
Example
Let Fn be the distribution function corresponding to a unit mass at n: Fn = /[ n . oo > · Then lim n Fn ( x ) = 0 for every x, so that (25. 1 ) is satisfied if F( x) = 0. But Fn � F does not hold, because F is not a distribution function. Weak convergence is in this section defined only for functions Fn and F that rise from 0 at - oo to 1 at + oo -that is, it is defined only for probability measures p. n and p.. t For another example of the same sort, take Fn as the distribution function of the waiting time W, for the n th customer in a queue. As observed after (24.9), Fn (x) --+ 0 for each x if the traffic intensity exceeds 1 . 25. 1.
•
Example
Poisson approximation to the binomial. Let P. n be the binomial distribution (20.6) for p = 'A/n and let p. be the Poisson distribu tion (20.7). For nonnegative integers k , ( n ) "' k A n - k P. n { k } = k n 1 - n n k k-1 'A/n 'A 1 1 i ) ( x = n 1 -k n k! ( 1 - 'Ajn ) i = O 25. 2.
( )(
)
(
)
if n > k . As n --+ oo the second factor on the right goes to 1 for fixed k , and so P. n { k } --+ p. { k } ; this is a special case of Theorem 23.2. By the series form of Scheffe's theorem (the corollary to Theorem 16. 1 1 ), (25.2) holds for every set A of nonnegative integers. Since the nonnegative integers support p. and the P. n ' (25.2) even holds for every linear Borel set A. Certainly P. n converges • weakly to p. in this case.
Example 25.3. Let P. n correspond to a mass of n - 1 at each point kjn, k = 0, 1 , . . . , n - 1 ; let p. be Lebesgue measure confined to the unit interval. The corresponding distribution functions satisfy Fn (x) = ([nx] + 1 )/n --+ F(x) for 0 < x < 1 , and so Fn � F. In this case (25. 1) holds for every x, but (25.2) does not as in the preceding example hold for every Borel set
A:
t There is (see Section 28) a related notion of vague convergence in which p. may be defective in the sense that p.( R 1 ) < 1 . Weak convergence is in this context sometimes called complete convergence.
SECTION
25.
WEAK CONVERGENCE
337
is the set of rationals, then P. n (A) = 1 does not converge to p. (A) = 0. Despite this, P. n does converge weakly to p.. •
if A
Example
If P. n is a unit mass at x n and p. is a unit mass at x, then p , :::o p. if and only if x n --+ x. If x n > x for infinitely many n , then (25.1) • fails at the discontinuity point of F. 25.4.
Unifonn Distribution Modulo 1*
For a sequence x1 , x 2 , of real numbers, consider the corresponding sequence of their fractional parts { xn } == xn - [ xn ]. For each n , define a probability measure p. n •
•
•
by
(25 .3)
P.n ( A ) == -1 , # ( k : 1 < k < n , { xk } E A ] ; n
has mass n - 1 at the points { x1 }, , { xn }, and if several of these points coincide, the masses add. The problem is to find the weak 1imi t of { p. n } in number-theoretically interesting cases. Suppose that xn == nplq, where plq is a rational fraction in lowest terms. Then { kplq } and { k'plq } coincide if and only if (k - k')plq is an integer, and since p and q have no factor in common, this holds if and only if q divides k - k'. Thus for any q successive values of k the fractional parts { kp1q } consist of the q points
P.n
•
0 1
•
•
,..., q'q
(25 .4)
q-1 q
in
some order. If O < x < 1, then P.n [O, x] == P.n [O, y ], where y is the point ilq just left of x, namely y == [ qx]lq. If r == [ nlq], then the number of { kpjq } in [0, y] for 1 s k < n is between r([ qx] + 1) and ( r + 1)([ qx] + 1) and so differs from nq- 1 ([ qx] + 1) by at most 2 q. Therefore, {25 .5)
- [ qx 1 + 1 2q p." [ 0 ' x 1 < n ' q
0 < X < 1.
If _, consists of a mass of q - 1 at each of the points (25.4), then (25.5) implies that l'n � _,
. Suppose next that xn == n 8, where 8 is irrational. If pIq is a highly accurate rational approximation to 8 (q large), then the measures (25.3) corresponding to { n fJ } and to { npIq } should be close to one another. If ., is Lebesgue measure restricted to the unit interval, then .,
is for large q near ., (see Example 25.3), and so there is reason to believe that the measures (25.3) for { n8 } will converge weakly to
P.
If the #Ln defined by (25.3) converge weakly to Lebesgue measure restricted to the unit interval, the sequence x1 , x 2 , is said to be uniformly distributed modulo 1. In •
•
This topic may be omitted.
•
•
338
CONVERGENCE OF DISTRIBUTIONS
this case every subinterval has asymptotically its proportional share of the points { x n } ; by Theorem 25.8 below, the same is true of every subset whose boundary has Lebesgue measure 0.
Theorem 25.1. PROOF.
Define
For () irrational, { n6 } is uniformly distributed modulo 1 . #Ln by (25.3) for xn #Ln ( X , y]
( 25 .6)
n6. It suffices to show that
==
--+ y -
X,
0 < x < y < 1.
Suppose that 0 < £ < min { x , 1 - y } .
(25 .7)
It follows by the elementary theory of continued fractionst that for n exceeding some n 0 == n 0 ( £ , () ) there exists an irreducible fraction pIq (depending on £ , n , and () ) such that
p n 6 - -q < £ .
q<( ' n
-1 < ( '
q
If p.� is the measure (25.3) for the sequence
( 25 .8)
I ll�( x , y] - ( y - x) I <
{ np1q } , then by (25.5),
(
q
+
,t,.( x , y] - [ qy ] + 1 - [ qx 1 + 1 q q
2
+
-4qn
2
� -
q
)
< 6£ .
Now l k 6 - kplql < £ for k == 1 , . . . , n. If { k6 } lies in ( x, y ] , then k6 and kp/q have the same integral part because of (25 . 7), so that { kplq } lies in ( x - £ , y + £ ] . Thus P.n ( x, y ] < p.� ( x - £ , y + £ ] . Similarly, p.� ( x + £ , y - £ ] < P.n ( x , y ] . Two ap plications of (25.8) give I P.n ( x, y ] - ( y - x) l < 8£ . Since this holds for n > n 0 • • (25.6) follows. For another proof see Example 26 . 3 . Theorem 25.1 implies Kronecker's theorem. according to which { n () } is dense in the unit interval if () is irrational.
Convergence in Distribution
Let Xn and X be random variables with respective distribution function s Fn and F. If Fn => F, then Xn is said to converge in distribution or in law to X, written Xn => X. This dual use of the double arrow will cause no confu sion . t Let p,/q, be the convergents of 9 and choose , = "n so sufficiently large, this forces , to be large enough that q, > ( fractions, 1 9 - p,/q, l < 1 /q,q, + 1 < 1 /n ( q, < (/n . Take p/ q
-
that q, < n( < q, + l · If n i s 2 • By the theory of cont i n ued = p, /q, .
SECTION
25.
Because of the defining conditions (25.1) and (25.2),
(25.9) for every x such that
339
WEAK CONVERGENCE
lim n P [ Xn < x ] =
Xn => X if and only if
P[ X < x]
P[ X = x] = 0.
be independent random variables, each Example 25. 5. Let X1 , X ,. 2 with the exponential distribution: P[ Xn > x] = e - a x , x > 0. Put Mn = max{ XI , . . . ' xn } and bn = a - 1 log n. The relation (14.9) established in •
•
•
Example 14.1 can be restated as P[Mn - bn < x] --+ e - e . If X is any ax random variable with distribution function e - e - ' this can be written �-��x • - ax
One is usually interested in proving weak convergence of the distribu tions of some given sequence of random variables, such as the Mn - bn in this example, and the result is often most clearly expressed in terms of the random variables themselves rather than in terms of their distributions or distribution functions. Although the Mn - bn here arise naturally from the problem at hand, the random variable X is simply constructed to make it possible to express the asymptotic relation compactly by Mn - bn => X. Recall that by Theorem 14.1 there does exist a random variable with any prescribed distribution. For each n , let On be the space of n-tuples of O's and 1's, let .F, consist of all subsets of On, and let Pn assign probability ( 'Ajn ) k (1 n - 'Ajn ) - k to each w consisting of k 1 's and n - k O's. Let Xn(w) be the number of 1 's in w; then Xn, a random variable on (On, � ' Pn), represents the number of successes in n Bernoulli trials having probability 'A/n of success at each. Let X be a random variable, on some ( 0 , :F, P), having the Poisson • distribution with parameter 'A. According to Example 25.2, Xn => X. Exllllple l 25. 6.
As this example shows, the random variables Xn may be defined on entirely different probability spaces. To allow for this possibility, the P on the left in (25.9) really should be written Pn . Suppressing the n causes no confusion if it is understood that P refers to whatever probability space it is that xn is defined on; the underlying probability space enters into the definition only via the distribution P.n it induces on the line. Any instance of Fn => F or of P. n => p. can be rewritten in terms of convergence in distribu tion: There exist random variables Xn and X (on some probability spaces) with distribution functions Fn and F, and Fn => F and Xn => X express the same fact.
340
CONVERGENCE OF DISTRIBUTIONS
Convergence in Probability
are random variables all defined on the same Suppose that X, X1 , X2 , probability space ( 0, $", P ). If Xn -+ X with probability 1, then P[ I Xn XI > £ i . o. ] = 0 for £ > 0, and hence •
(25 .10 )
•
•
lim P [ I Xn - XI > u
£] = 0
by Theorem 4.1. Thus there is convergence in probability Xn -+ pX as discussed in Section 20; see (20.41). That convergence with probability 1 implies convergence in probability is part of Theorem 20.5. Suppose that (25.10) holds for each positive £. Now P [ X < x - £] P[ I Xn - XI > £] < P[ Xn < x] < P[ X < x + £] + P [ I Xn - x l > £]; letting n tend to oo and then letting £ tend to 0 shows that P[ X < x] < lim infnP[ Xn < x] S lim supnP[ Xn < x] < P[ X < x]. Thus P[ Xn S x] � P[ X < x] if P[ X = x] = 0, and so Xn => X:
Suppose that Xn and X are random variables on the same probability space. If Xn -+ X with probability 1 , then Xn -+ pX. If Xn -+ p X, then Xn => X. Theorem 25.2.
Of the two implications in this theorem, neither converse holds. As observed after (20.41), convergence in probability does not imply conver gence with probability 1. Neither does convergence in distribution imply convergence in probability: if X and Y are independent and assume the values 0 and 1 with probability i each, and if Xn = Y, then Xn => X, but Xn -+ p X cannot hold because P[ I X - Y l = 1 ] = � - What is more, (25.10) is impossible if X and the Xn are defined on different probability spaces, as may happen in the c�se of convergence in distribution. Although (25.10) in general makes no sense unless X and the Xn are defined on the same probability space, suppose that X is replaced by a constant real number a- that is, suppose that X( w) = a. Then (25.10) becomes (25 .1 1 )
lim P [I Xn n
al > £] = 0,
and this condition makes sense even if the space of Xn does vary with n. Now a can be regarded as a random variable (on any probability space at all) , and it is easy to show that (25.11) implies that Xn => a: Put £ = l x - al ; if x > a, then P[ Xn < x] > P[ I Xn - a l < £] -+ 1, and if x < a, then P [ Xn < x] < P[ I Xn - a l � £] -+ 0. If a is regarded as a random variable, i ts
SECTI ON
25.
WEAK CO N V E R G E NC E
34 1
distribution function is 0 for x < a and 1 for x > a . Thus ( 25 . 1 1 ) implies that the distribution function of X, converges weakly to that of a . Suppose, on the other hand. that xll => a . Then p [ I X, - a I > £] < p [ xll � a - £ ] + 1 - P [ X, < a + £ ] � 0. so that (25 . 1 1 ) holds : 'lbeorem 25.3. Condition (25 . 1 1 ) holds for all positive £ if and on�y if Xn ==> a. that is. if and only if
if x < a , if x > a . If (25.1 1 ) holds for all positive £ , Xn may be said to converge to a in probability. As this does not require that the Xn be defined on the same space, it is not really a special case of convergence in probability as defined by (25.10). Convergence in probability in this new sense will be denoted xn � a, in accordance with the theorem just proved. Example 1 4.4 restates the weak law of large numbers in terms of this concept. Indeed, if X1 , X2 , . . . are independent, identically distributed ran dom variables with finite mean m , and if Sn = X1 + + Xn , the weak law 1 of large numbers is the assertion n - sn � m. Example 6.4 provides another illustration : If Sn is the number of cycles in a random permutation on n letters, then Snf log n � 1 . ·
·
·
Example 25. 7. Suppose that Xn � X and 8n � 0. Given £ and 11 choose x so that P [ I XI > x ] < 11 and P[ X = ± x] = 0, and then choose n 0 so that n � n 0 implies that 1 8n l < £/X and I P [ Xn < y] - P [ X < y ] l < 11 for y = ± x. Then P [ l 8n Xn 1 > £] < 311 for n > n 0 . Thus Xn � X and 8n --+ 0 imply that 8n Xn � 0, a restatement of Lemma 2 of Section 14 (p. 195). • The
asymptotic properties of a random variable should remain un affected if it is altered by the addition of a random variable that goes to 0 in probability. Let ( Xn , Yn ) be a two-dimensional random vector. Theorem 25.4.
�
X and Xn - Yn
�
0, then Yn
Suppose that y ' < x < y " and P [ X = y'] x - y' and £ < y" - x, then
PROOF.
( <
If Xn
( 25 .12 )
� =
X.
P[ X
=
y "]
=
0. If
342
Since Xn
CONVERGE NCE OF DISTRIBUTIONS :::.
X,
letting n
� oo
gives
P [ X < y') < lim inf P [ Yn < x ]
( 25 . 1 3 )
n
< lim sup P ( Yn < x ] < P [ X < y" ] .
·
Since P[ X = y] = 0 for all but countably many y, if P[ X = x ] = 0, then y ' and y" can further be chosen so that P[ X < y'] and P[ X < y"] are arbitrarily near P[ X < x ]; hence P[ Yn < x] � P[ X < x ]. • Theorem 25.4 has a useful ex tension. Suppose that ( X,� " , , Y,. ) is a sional random vector. Theorem 2S.S. and if
If, for each
x,�
u
)
=>
x
as
n -+
_ lim lim sup n P ( I x,� u ) Y,. l > £ ]
( 25 .14) for positive
u,
u
£ , then Y,.
-+
oo , if
two-dimen
x< u> => X as
u -+
oo .
== 0
X.
Replace X,. by X,� u> i n (25.12). If P( X = y') = 0 P( x< u> == y ' ] and P[ X = y"] == 0 P[ x
=
=
-+
u -+
=
Fundamental Theorems
Some of the fundamental properties of weak convergence were established in Section 14. It was shown there that a sequence cannot have two distinct weak limits: If Fn => F and Fn => G, then F = G. The proof is simple: The hypothesis implies that F and G agree at their common points of continu ity, hence at all but countably many points, and hence by right continuity at all points. Another simple fact is this: If lim n Fn (d ) = F( d ) for d in a set D dense in R 1 , then Fn => F. Indeed, if F is continuous at x, there are in D points d ' and d " such that d ' < x < d " and F( d ") - F( d ') < £ , and it follows that the limits superior and inferior of F,(x ) are within £ of F(x ). For any probability measure on ( R 1 , &/ 1 ) there is on some probability space a random variable having that measure as its distribution. Therefore, for probability measures satisfying P.n => p., there exist random variables Yn and Y having these measures as distributions and satisfying Y,, � Y. According to the following theorem, the Yn and Y can be constructed on t he same probability space, and even in such a way that Yn( w ) � Y( w ) for every w- a condition which is of course much stronger than Yn => Y. This result, Skorohod 's theorem , makes possible very simple and transparen t proofs of many important facts.
SECTION
25.
343
WEAK CONVERGENCE
neorem 25.6. Suppose that P.n and p. are probability measures on ( R 1 , 9l 1 ) and p. n => p.. There exist random variables Yn and Y on a common probability space ( 0 , !F .. P ) such that Yn has distribution p. , , Y has distribution p., and Yn ( w ) __... Y( w ) for each w.
For the probability space ( 0, !F, P ), take 0 (0, 1 ), let !F consist of the Borel subsets of (0, 1), and for P( A ) take the Lebesgue measure of A . The construction is related to that in the proofs of Theorems 14.1 and 20 .4. Consider the distribution functions F, and F corresponding to P.n and p.. For 0 < w < 1 , put Yn( w ) = in f[ x : w < Fn( x )] and Y( w) = in f[ x : w < F(x )]. Since w < Fn(x) if and only if Yn( w ) < x (see the argument follow ing (14.5)), P [ w: Yn( w ) < x] = P[w: w < Fn( x )] = Fn( x ). Thus Yn has distribution function Fn; similarly, Y has distribution function F. It remains to show that Yn( w ) --. Y( w ). The idea is that Y,, and Y are essentially inverse functions to Fn and F; if the direct functions converge, so must the inverses. Suppose that 0 < w < 1 . Given £, choose x so that Y( w ) - £ < x < Y( w) and p. { x } = 0. Then F( x) < w; Fn( x ) --. F( x ) now implies that, for n large enough, Fn( x ) < w and hence Y( w ) - £ < x < Y,, ( w ). Thus lim infn Yn( w ) > Y( w ). If w < w' and £ is positive, choose a y for which Y( w') < y < Y( w') + £ and p. { y } = 0. Now w < w' < F( Y( w')) < F( y ), and so, for n large enough, w < Fn( Y ) and hence Yn( w ) < y < Y( w') + £. Thus lim supn Yn( w) < Y( w') if w < w'. Therefore, Yn ( w) __... Y( w) if Y is continuous at w. Since Y is nondecreasing on (0, 1 ), it has at most countably many discontinuities. At discontinuity points w of Y, redefine Yn( w) = Y( w ) = 0. With this change, Yn( w) --. Y( w ) for every w. Since Y and the Yn have been altered only on a set of Lebesgue measure 0, their distributions are still p. n and p.. • =
PROOF.
Note that this proof uses the order structure of the real line in an essential way. The proof of the corresponding result in R k (Theorem 29.6) is more complicated. The following mapping theorem is of very frequent use. lbeorem 25.7. Suppose that h : R1 --. R1 is measurable and that the set Dh of its discontinuities is measurable . t If P. n => p. and p.( Dh ) = 0, then P.n h - 1 � ,., h - 1 . t
That Dh lies in 9l1 is generally obvious in applications. In point of fact. it always holds (even
h is not measurable): Let A ( ( , B ) be the set of x l x - Y l < B. l x - z l < B. and l h ( y ) - h ( z ) l > ( . if
A( ( . B ). where
( and
B
for which there exist y and Then
range over the positive rationals.
A( (. B )
z
such that
is open and Dh
=
U( n8
344
CONVERGENCE OF DISTRIBUTIONS
Recall (see (1 3.7)) that
p.h - 1
has value p.( h - 1A ) at
A.
PROOF. Consider the random variables Yn and Y of Theorem 25.6. Since Yn ( w ) � Y( w ), Y(w) � Dh implies that h( Y,,( w)) � h ( Y(w)). Since P[w: Y( w ) E Dh] = p.(Dh) = 0, h(Yn (w)) � h(Y(w)) holds with probability 1. Hence h ( Yn ) => h ( Y) by Theorem 25.2. Since P[h(Y) E A] = P[ Y E h - 1A] = p. ( h - 1A ), h ( Y) has distribution p.h - 1 ; similarly, h ( Yn ) has distribution P. n h - 1 • Thus h ( Yn ) => h(Y) is the same thing as P. n h- 1 => p.h - 1 • • Because of the definition of convergence in distribution, this result has an equivalent statement in terms of random variables: Corollary 1.
Take
If Xn => X and P[ X E Dh] = 0, then h( Xn ) => h( X).
X = a:
Corollary 2.
If Xn � a and h is continuous at a, then h( Xn ) => h(a).
From Xn => X aXn + b => aX + b. Suppose also a ) Xn => 0 by Example 25.7, and now an Xn + bn => aX + b follows
it follows directly by the theorem that that a n � a and bn � b. Then (a n so (anXn + bn) - (aX, + b) => 0. And by Theorem 25 . 4 : If Xn => X, an � a, and bn � b, then anXn + bn => aX + b. This fact was stated and proved • differently in Section 14-see Le mma 1 on p. 195. Example 25.8.
By definition, p. n => p. means that the corresponding distribution func tions converge weakly. The following theorem characterizes weak conver gence without reference to distribution functions. The boundary aA of A consists of the points that are limits of sequences in A and are also limits of sequences in Ac; alternatively, aA is the closure of A minus its interior. A set A is a p.-continuity set if it is a Borel set and p.( a A ) = 0. Theorem 25.8. The following three conditions are equivalent. (i) p. n p.; (ii) f f dp. n � f fdp. for every bounded, continuous real function f; (iii) P.n( A ) � p.(A) for every p.-continuity set A. :::0
Suppose that P.n => p. and consider the random variables Yn and Y of Theorem 25.6. Suppose that f is a bounded function such that p.( D1) = 0, where D1 is the set of points of discontinuity of f. Since P [ Y E D1] = p.( D1 ) = 0, f( Yn) � f( Y) with probability 1, and so by change of variable (see (21 .1)) and the bounded convergence theorem, ffdp.n = E[/( Yn)] � PROOF.
SECTION
25.
WEAK CONVERGENCE
345
E[ /( Y )] = f fdp.. Thus P.n => p. and p.( D1 ) = 0 together imply that f fdp.n __. f f dp. if f is bounded. In particular, (i) implies (ii). Further, if f = IA ' then D1 = aA , and from p.( aA ) = O and P.n => p. follows p.,,( A ) = f fdp.n __. f I dp. = p. ( A ). Thus (i) also implies (iii). Since a( - 00 , x ] = { X } , obviously (iii) implies (i). It therefore remains only to deduce p. n => p. from (ii). Consider the corresponding distribution functions. Suppose that x < y, and let /( t ) be 1 for t < x, 0 for t > y, and interpolate linearly on [x, y]: /( t ) = ( y - t )/( y - x ) for x < t < y . Since Fn (x ) < f fdp. n and f fdp. < F( y ), it follows from (ii) that lim supn Fn(x ) < F( y ); letting y � x shows that lim supn Fn(x ) < F( x ). Similarly, F( u ) < lim infn Fn( x ) for u < x and hence F( x - ) < lim infn Fn( x). This implies • convergence at continuity points. The function f in this last part of the proof is uniformly continuous. Hence P.n => p. follows if J fdp.n � f fdp. for every bounded and uniformly continuous f. Example 25. 9. The distributions of Example 25.3 satisfy P.n => p., but Hence this A P n ( A ) does not converge to p.( A) if A is the set of rationals. • cannot be a p.-continuity set; in fact, of course, a A = R 1 •
The concept of weak convergence would be nearly useless if (25.2) were not allowed to fail when p.( aA ) > 0. Since F(x) - F(x - ) = p. { x } = p.( a( - 00 , X )), it is therefore natural in the original definition tO allow (25.}) to fail when x is not a continuity point of F. Belly's 1beorem
One of the most frequently used results in analysis is the Helly selection theorem: Theorem 25.9. For every sequence { Fn } of distribution functions there exists a subsequence { Fn " } and a nondecreasing, right-continuous function F such that lim k Fn "(x) = F(x) at continuity points x of F. An application of the diagonal method [A14] gives a sequence PROOF. { n k } of integers along which the limit G( r ) = lim k Fn"( r ) exists for every rational r. Define F( x) = inf(G( r ): x < r ] . Clearly F is nondecreasing. To each x and £ there is an r for which x < r and G ( r ) < F( x ) + £. If x s y < r, then F( y ) < G(r) < F(x) + £. Hence F is continuous from the right. If F is continuous at x, choose y < x so that F( x ) - £ < F( y ) ; now choose rational r and s so that y < r < x < s and G(s ) < F( x) + £. From
346
CONVERGENCE OF DISTRIBUTIONS
F( x ) - £ < G ( r ) < G(s ) < F(x ) + £ and Fn ( r ) < Fn (x ) < Fn( s ) it follows that as k goes to infinity Fn k(x) has limits superior and inferior within £ of F( x ). • The F in this theorem necessarily satisfies 0 < F(x) < 1 . But F need not be a distribution function; if Fn has a unit jump at n , for example, F( x ) 0 is the only possibility. It is important to have a condition which ensures that for some subsequence the limit F is a distribution function. A sequence of probability measures P.n on (R 1 , !1/ 1 ) is said to be tight if for each £ there exists a finite interval ( a, b] such that p. n( a, b] > 1 - £ for all n . In terms of the corresponding distribution functions Fn, the condition is that for each £ there exist x and y such that Fn(x) < £ and F,, ( y ) > 1 - £ for all n . If p. n is a unit mass at n , { p. n } is not tight in this sense- the mass of p. n "escapes to infinity." Tightness is a condition preventing this escape of mass. =
Theorem 25.10. Tightness is a necessary and sufficient condition that for every subsequence { p. n k } there exist a further subsequence { p. n } and a probability measure p. such that p. n k ( J ) � p. as j � oo . k(})
Only the sufficiency of the condition in this theorem is used follows.
in
what
Apply Helly' s theorem to the subsequence { F,, " } of corresponding distribution functions. There exists a further subsequence { Fn k j ) } such that lim 1.Fn k ( J )( x ) = F(x ) at continuity points of F, where F is � nonaecreasing and right-continuous. There exists by Theorem 12.4 a measure p. on ( R 1 , !1/ 1 ) such that p.(a, b] = F( b ) - F( a ) . Given £, choose a and b so that P. n( a , b ] > 1 - £ for all n , which is possible by tightness. By decreasing a and increasing b, one can ensure that they are continuity points of F. But then p. ( a, b] > 1 - £. Therefore, p. is a probability mea • sure, and of course p. n k ( J ) � p. . PROOF :
SUFFICIENCY.
If { p. n } is not tight, there exists a positive £ such that for each finite interval ( a, b ], p. n( a, b] < 1 - £ for some n . Choose n " so that P. n k ( - k , k ] < 1 - £. Suppose that some subsequence { P. n A ( J ) } of { P. n A } we re to converge weakly to some probability measure p. . Choose ( a, b] so that p. { a } = p. { b } = 0 and p. ( a, b ] > 1 - £. For large enough }, ( a , b) c ( - k ( j ), k ( j )], and so 1 - £ > P. n A (J ) ( - k ( j ), k ( j )] > p. , A ( J ) ( a, b] � p. ( a � b] . • Thus p. ( a, b ] < 1 - £, a contradiction. NECESSITY.
Corollary. If { P.n } is a tight sequence of probability measures, and if each subsequence that converges weakly at all converges weakly to the probability measure p. , then p. n � p. .
SECTION
25.
WEAK CONVERGENCE
347
By the theorem, each subsequence { p. n " } contains a further subsequence { p. n } converging weakly ( j � oo) to some limit, and that limit must by hypothesis be p.. Thus every subsequence { p., A } contains a further su bsequence { p. n lc ( J ) } converging weakly to p.. Suppose that p. n � p. is false. Then there exists some x such that p { x } = 0 but P. n( - oo, x] does not converge to p.( - oo , x]. But then there exists a positive £ such that I P. n "( - oo , x] - p. ( - oo , x ] l > £ for an infinite sequence { n k } of integers, and no subsequence of { P. n A } can converge • weakly to p.. This contradiction shows that P.n � p.. PROOF.
A(})
If ,., n is a unit mass at X n ' then { ,., n } is tight if and only if { X n } is bounded. The theorem above and its corollary reduce in this case to standard facts about the real line; see Example 25.4 and A10: tightness of sequences of probability measures is analogous to boundedness of sequences of real numbers. Let P.n be the normal distribution with mean m n and variance on2• If m n and on2 are bounded, then the second moment of P.n is bounded, and it follows by Markov' s inequality (21 .12) that { P.n } is tight. The conclusion of Theorem 25.10 can also be checked directly: If { n k (J) } is chosen so that lim1.m n lc ( J ) = m and lim 1.on2lc ( J ) = o 2, then P.n A t J ) � p., where p. is normal with mean m and variance o2 (a unit mass at m if o2 = 0). If m n > b, then P.n(b, oo) > ! ; if m n < a, then p. , ( - oo , a] > t · Hence { p. n } cannot be tight if m n is unbounded. If m n is bounded, say by K, then P n ( - oo , a ] > v ( - oo, ( a - K )on- 1 ], where v is the standard normal distri bution. If on is unbounded, then v ( - oo, (a - K )on- 1 ] � i along some subsequence, and { p. n } cannot be tight. Thus a sequence of normal distribu • tions is tight if and only if the means and variances are bounded. Example 25. 10.
Integration to the Limit 1beorem 25. 1 1.
If Xn � X, then E [ I XI ] < lim infn E [ 1 Xn 1 1 -
Apply Skorohod ' s Theorem 25.6 to the distributions of X, and X: There exist on a common probability space random variables Y,, and Y such that Y = lim n Yn with probability 1 , Yn has the distribution of Xn , and Y has the distribution of X. By Fatou ' s lemma, E[ I Yl ] < lim inf, E[ I Y,, l ]. Since 1 XI and 1 Yl have the same distribution, they have the same expected value (see (21 .6)), and similarly for 1 xn 1 and I Yn 1 • PROOF.
The random variables Xn are said to be uniformly integrable if (25 . 1 5 )
lim sup a , .....
oo
£
[ I xn 1 � a J
I X,. I dP
=
0;
348
CONVERGENCE OF DISTRIBUTIONS
see (16.21). This implies (see (16.22)) that sup E ( I Xn l ) < 00 .
( 25 .16 )
n
If Xn � X and the Xn are uniformly integrable, then X is
Theorem 25. 12.
integrable and ( 25 .17)
Construct random variables Yn and Y as in the preceding proof. Since Yn � Y with probability 1 and the Yn are uniformly integrable in the sense of (16.21 ), E [ Xn ] = E [ Yn ] � E [ Y ] = E [ X] by Theorem 16.13. • PROOF.
If supn E [ I xn 1 1 + ( ] < 00 for some positive integrable because
£'
then the xn are uniformly
(25 .18 )
Since Xn :::o X implies that x; � xr by Theorem 25.7, there is the follow ing consequence of the theorem.
Let r be a positive integer. If Xn :::o X and supn E [ I Xn l r + f ] < > 0, then E [ I X I r] < 00 and E [ x;] � E[ xr].
Corollary.
here
w
£
oo ,
The Xn are also uniformly integrable if there is an integrable random variable Z such that P[ I Xn l > t ] s P[ I Z I > t ] for t > 0, because then ( 2 1 . 1 0 ) gives
1
[ I X,.I � a)
I X,. I dP
<
1
[ IZI > a ]
I Z I dP.
From this the dominated convergence theorem follows again. PROBLEMS 25.1.
(a) Show by example that distribution functions having densities can
converge weakly even if the densities do not converge. Hint: Consider /,. ( x ) == 1 + cos 2wnx on [0, 1]. (b) Let /,. be 211 times the indicator of the set of x in the unit interval for which d ,. + (x) == · · · == d2 n ( x) == 0, where d" (x) is the k th dyadic digi t. 1
SECTION
25.
349
WEAK CONVERGENCE
Show that In ( x) --+ 0 except on a set of Lebesgue measure 0; on this exceptional set, redefine In ( x) == 0 for all n , so that /,. ( x) --+ 0 everywhere. Show that the distributions -corresponding to these densities converge weakly to Lebesgue measure confined to the unit interval. (c) Show that distributions with densities can converge weakly to a limit that has no density (even to a unit mass). (d) Show that discrete distributions can converge weakly to a distribution that has a density. (e) Construct an example, like that of Example 25.3, in which p.,. ( A ) --+ p.( A ) fails but in which all the measures come from continuous densities on 25.2. 25.3.
[0, 1 ]. 14.12 f Give a simple proof of the Glivenko-Cantelli theorem (Theorem 20.6) under the extra hypothesis that F is continuous. Initial digits. (a) Show that the first significant digit of a positive number x is d (in the scale of 10) if and only if {log1 0x } lies between log1 0 d and log1 0( d + 1), d == 1, . . . , 9, where the brackets denote fractional part. (b) For positive numbers x 1 , x2 , , let Nn ( d ) be the number among the first n that have initial digit d. Show that •
•
•
d == 1 , . . . , 9,
( 25 .19)
if the sequence log1 0xn , n == 1, 2, . .n. , is uniformly distributed modulo 1. This is true, for example, of xn == f if log1 08 is irrational. (c) Let Dn be the first significant digit of a positive random variable Xn . Show that
( 25 .20) lim P [ Dn == d ] == log 1 0( d + 1 ) - log1 0 d , n
25.4.
d == 1 , . . . , 9,
if {log1 0 Xn } � U, where U is uniformly distributed over the unit interval. Show that for each probability measure p. on the line there exist probability measures #Ln with finite support such that #Ln � p.. Show further that P.n{ x } can be taken rational and that each point in the support can be taken rational. Thus there exists a countable set of probability measures such that every p, is the weak limit of some sequence from the set. The space of distribution functions is thus separable in the Levy metric (see Problem
14.9).
25.5. 25.6.
25.7. 25.8.
Show that (25.10) implies that P([ X < X ]�[ xn < X ]) --+ 0 if P[ X == X ] == 0. For arbitrary random variables Xn there exist positive constants an such that an xn � 0. Generalize Example 25.8 by showing for three-dimensional random vectors ( A n , Bn , Xn ) and constants a and b, a > 0, that, if An � a, Bn � b, and Xn � X, then AnXn + Bn � aX + b . Suppose that Xn � X and that hn and h are Borel functions. Let E be the set of x for1 which hnxn --+ hx fails for some sequence xn --+ x. Suppose that E e 9P and P[ X e E] == 0. Show that hn Xn � hX.
JSO
CONVERGENCE OF DISTRIBUTIONS
Suppose that the distributions of random variables X,, and X have densities /,. and f. Show that if fn( x) --+ /( x ) for x outside a set of Lebesgue measure 0, then Xn � X. 25. 10. f Suppose that Xn assumes as values y, + k B,. , k == 0, + 1, . . . , where with n in Bn > 0. Suppose that Bn --+ 0 and that, if kn is an integer varying 1 such a way that Yn + knBn --+ x, then P[ Xn == Yn + knBn 1 Bn- --+ /( x), where I is the density of a random variable X. Show that xn = X. 25. 11. f Let Sn have the binomial distribution with parameters n and p. Assume as known that 25.9.
( 25 .21)
25.12. 25. 13.
25. 14.
P [ S,. = k n ] ( np ( l - p ))
1
--+ ..ff; e - ' 2 /2
if ( kn - np)( np(1 - p)) - 1 12 1--+ x. Deduce the DeMoivre-Laplace the orem: ( Sn - np)( np( 1 - p)) - 12 = N, where N has the standard normal distribution. This is a special case of the central limit theorem; see Section 27. Prove weak convergence in Example 25.3 by using Theorem 25.8 and the theory of the Riemann integral. (a) Show that probability measures satisfy #Ln = p. if P. n ( a , b] --+ p.( a , b) whenever I' { a } == I' { b } == 0. (b) Show that, if I fdp.n --+ I fdp. for all continuous f with bounded support, then ,.,. n = ,.,. . f Let I' be Lebesgue measure confined to the unit interval; let l'n COrrespond tO a mass Of Xn . Xn 1 at SOme point in ( Xn . Xn . ] where 0 == Xn o < xn 1 < · · · < xnn == 1. Show by considering the distribution functions that p.,. = p. if max ; < n( x, ; - xn . ; - 1 ) --+ 0. Deduce that a bounded Borel function continuous almost everywhere on the unit interval is Riemann integrable. See Problem 17.6. 2.15 5.16 f A function f of positive integers has distribution function F if F is the weak limit of the distribution function Pn [ m : /( m ) < x] of I under the measure having probability 1/n at each of 1, . . . , n (see (2.20)). In this case D [ m : /( m ) < x] == F(x) (see (2.21)) for continuity points x of F. Show that q>(m)/m (see (2.23)) has a distribution: (a) Show by the mapping theorem that it suffices to prove that f( m) .. log(q>( m )/m) == EPBP ( m)log(1 - 1/p) has a distribution. (b) Let fu ( m) == EP "B1 ( m )log(1 - 1/p), and show by (5.41) that fu has distribution function �\ X ) == P{E u Xp log(1 - 1/p) < x], where the XP are independent random variables (one for each prime p) such that P[ XP .. 1] == 1/p and P( XP == 0] == 1 - 1/p. (c) Show that EP XP log(1 - 1/p) converges with probability 1. Hint: Use Theorem 22.6. (d) Show that limu _ oo supn En l l/ - fu l 1 == 0 (see (5.42) for the notation). (e) Conclude by Markov's inequality and Theorem 25.5 that f has the distribution of the sum in (c). i
-
.
I_ 1,
I_
•.
25.15.
1 /2
s
s
I ,
26. CHARACTERISTIC FUNCTIONS 351 25.1 6. 5.16 f Suppose that /( n ) --+ x. Show that Pn [ m : 1/( m ) - x l > £] --+ 0. From Theorem 25.12 conclude that En [/] --+ x (see (5.42); use the fact that f is bounded). Deduce the consistency of Cesaro summation [A30]. 25 .1 7. For A E 911 and T > 0, put "T( A ) == A ([ - T, T] n A )/2T, where " is Lebesgue measure. The relative measure of A is SECTION
p( A)
( 25 .22)
==
lim AT ( A ) ,
r-. oo
provided that this limit exists. This is a continuous analogue of density (see (2.21)) for sets of integers. A Borel function f has a distribution under "T; if this converges weakly to F, then
p ( x : /( X )
( 25 .23 )
<
u]
==
F( u )
for continuity points u of F, and F is called the distribution function of f. Show that all periodic functions have distributions. 25.18. Suppose that supn / fdp.n < oo for a nonnegative f such that f(x) --+ oo as x --+ + oo . Show that { P.n } is tight. 25. 19. f Let f( x) == l x l cr , a > 0 ; this is an f of the kind in Problem 25.18. Construct a tight sequence for which f lx l aP.n ( dx) --+ oo ; construct one for which f l x l aP.n( dx) == 00 . 25.20. 23.4 t
Show that the random variables A, and L, in Problems 23.3 and 23.4 converge in distribution. Show that the moments converge. 25.21. In the applications of Theorem 9.2, only a weaker result is actually needed: For each K there exists a positive a == a( K) such that if E[ X] == 0, E[ X2 ] == 1, and E [ X4 ] < K, then P[ X > 0] > a. Prove this by using tightness and the corollary to Theorem 25.12. 25.22. Find uniformly integrable random variables Xn for which there is no integrable Z satisfying P[ I Xn l > t] � P[ I ZI > t] for t > 0. SECI'ION 26. CHARACI'ERISTIC FUNCI'IONS Definition
The characteristic function of a probability measure for real t by
oo cp ( t ) J eitxll ( dx ) - oo =
p.
on the line is defined
352
CONVERGENCE OF DISTRIBUTIONS
see the end of Section 16 for integrals of complex-valued functions.t A random variable X with distribution p. has characteristic function
cp( t )
=
E[e
;'x ]
=
;ul-' ( dx ) . oo e J - oo
The characteristic function is thus defined as the moment generating func tion but with the real argument s replaced by it; it has the advantage that it always exists because e itx is bounded. The characteristic function in non probabilistic contexts is called the Fourier transform. The characteristic function has three fundamental properties to be estab lished here: (i) If p. 1 and p. 2 have respective characteristic functions cp 1 (t) and cp2 ( t ), then p.1 * p. 2 has characteristic function cp 1 ( t) cp2 ( t ). Although con volution is essential to the study of sums of independent random variables, it is a complicated operation, and it is often simpler to study the products of the corresponding characteristic functions. (ii) The characteristic function uniquely determines the distribution. This shows that in studying the products in (i), no information is lost. (iii) From the pointwise convergence of characteristic functions follows the weak convergence of the corresponding distributions. This makes it possible, for example, to investigate the asymptotic distributions of sums of independent random variables by means of their characteristic functions. Moments and Derivatives
It is convenient first to study the relation between a characteristic function and the moments of the distribution it comes from. Of course, cp(O) = 1, and by (16.25), l cp(t) l < 1 for all t. By Theorem 16.8(i), cp(t) is continuous in t. In fact, l cp(t + h) - cp(t) l < f l e i h x 1 1 p. ( dx ), and so it follows by the bounded convergence theorem that cp(t) is
uniformly continuous.
Integration by parts shows that
(26 .1)
1 (X X
0
S
n
) eu ds ·
=
n+1 n +1 's 1 I + ( X - S ) e ds, 1 1 n+ n+ 0 •
X
X
·
and it follows by induction that
n ( · k ·n + 1 n ) 1 ( x - s ) e ;s ds e ix (26 .2) + L '\ k. n. for n � 0. Replace n by n - 1 in (26.1), solve for the integral on the right, =
t From
k-o
1
1
x
o
complex variable theory only De Moivre's formula and the simplest properties of the
exponential function are needed here.
SECTION
26.
353
CHARACTERISTIC FUNCTIONS
and substitute this for the integral in (26.2); this gives (26 . 3)
e
ix
=
�
( ix ) k +
i..Jo k ' k·
(n
;n
_
1x( x - ) n -1 ( e is - 1) ds . 1) ' s
o
·
Estimating the integrals in (26.2) and (26.3) (consider separately the cases x � 0 and x < 0) now leads to
e ix -
(26 .4)
n ( ix ) k
E
k-o
k '.
<
.
nun
{ I In+1 ! , 21 In } X
( n + 1)
X
n '.
I
for n > 0. The first term on the right gives a sharp estimate for x I small, the second a sharp estimate for l x l large. For n = 0, 1 , 2, the inequality specializes to
l e ix - 1 1 <
min { l x l , 2} ;
{�
min x 2 , 2 1 xI
};
If X has a moment of order n , it follows that k 1 + n � . k [ ( ] E X < E nun (26.5) , i..Jo k'. ' 1) n + . k-
21
I eix - ( 1 + ix ) I �
For any
(26 . 6 )
) t cp t
(it )
[ { (I tXI
tXI n } ]
n '.
.
satisfying
cp( t) must therefore have the expansion (26. 7) compare (21.22). If
then (see (16.26)) (26.7) must hold. Thus generating function over the whole line.
(26.7) holds if
X has a moment
354
CONVERGENCE OF DISTRIBUTIONS
Since E[el t XI] < oo if X has the standard normal distri bution, by (26.7) and (21.7) its characteristic function is Example 26. 1.
(26 .8) cp ( t ) =
00 it ) 2 k 1 · 3 · · · (2 k - 1) = E ( E (2k ) 9. 00
k-0
This and (21 .25) formally coincide if s =
( -)
2 k 1 t 1 = e - 1 2 12 . k. 2 k -0
it.
•
If the power series expansion (26.7) holds, the moments of off from it:
X can be read
(26.9) This is the analogue of (21 .23). It holds, however, under the weakest possible assumption, namely that E[ 1 X k 1 ] < oo . Indeed,
l
hX i ( t + e h t ( )