Basic Probability Theory

-.,.p BASIC PROBABILITY THEORY Robert B. Ash Department of Mathematics University of Illinois DOVER PUBLICATIONS, IN...

Author: Robert B. Ash

405 downloads 2453 Views 8MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

-.,.p

BASIC PROBABILITY THEORY

Robert B. Ash Department of Mathematics University of Illinois

DOVER PUBLICATIONS, INC. Mineola, New York

Copyright

Copy right © 1970, 2008 by Robert B. Ash

All rights reserved.

Bibliographical Note This Dover edition, first published in 2008, is an unabridged republication of

the work originally published in 1970 by John Wiley & Sons, Inc., New York. Readers of this book who would like to receive the solutions to the exercises not

published in the book 1nay request the1n fro1n the publisher at the following e-1nail address: [email protected]

Library of Congress Cataloging-in-Publication Data Ash, Robert B.

Basic propability theory I Robert B. Ash. - Dover ed. p. c1n.

Includes index. Originally published: New York : Wiley, [1970] ISBN-13: 978-0-486-46628-6 ISBN-10: 0-486-46628-0 1. Probabilities. I. Title. QA273.A77 2008 519.2-dc22 200800473

Manufactured in the United States of A1nerica Dover Publications, Inc., 31 East 2nd Street, Mineola, N.Y 11501

Preface

This book has been written for a first course in probability and was developed from lectures given at the University of Illinois during the last five years. Most of the students have been juniors, seniors, and beginning graduates, from the fields of mathematics, engineering and physics. The only formal prerequisite is calculus, but an additional degree of n1athematical maturity may be helpful. In talking about nondiscrete probability spaces, it is difficult to avoid measure-theoretic concepts. However, to develop extensive formal machinery from measure theory before going into probability (as is done in most graduate programs in mathematics) would be inappropriate for the particular audience to whom the book is addressed. Thus I have tried to suggest, when possible, the underlying measure-theoretic ideas, while emphasizing the probabilistic way of thinking, which is likely to be quite novel to anyone studying this subject for the first time. The major field of application considered in the book is statistics (Chapter 8). In addition, some of the problems suggest connections with the physical sciences. Chapters 1 to 5, and Chapter 8 will serve as the basis for a one semester or a two-quarter course covering both probability and statistics. If probability alone is to be considered, Chapter 8 may be replaced by Chapter 6 and Chapter 7, as time permits. An asterisk before a section or a problem indicates material that I have normally omitted (without loss of continuity) , either because it involves subject matter that many of the students have not been exposed to (for example, complex variables) or because it represents too concentrated a dosage of abstraction. A word to the instructor about notation. In the most popular terminology , P{X < x} is written for the probability that the random variable X assumes a value less than or equal to the number x. I tried this once in my class, and I found that as the semester progressed, the capital X tended to become smaller in the students ' written work, and the small x larger. The following semester, I switched to the letter R for random variable, and this notation is used throughout the book. v

vi

PREFACE ..

Fairly detailed solutions to some of the problems (and numerical answers to others) are given at the end of the book. I hope that the book will provide an introduction to more advanced courses in probability and real analysis and that it makes the abstract ideas to be encountered later more meaningful. I also hope that nonmathematics majors who come in contact with probability theory in their own areas find the book useful. A brief list of references, suitable for future study, is given at the end of the book. I am grateful to the many students and colleagues who have influenced my own understanding of probability theory and thus contributed to this book. I also thank Mrs. Dee Keel for her superb typing, and the staff of Wiley for its continuing interest and assistance. Urbana, Illinois, 1969

Robert B. Ash

Co nte nts

1. B A SIC C O N C E P T S

1.1 1 .2 1 .3 1 .4 1 .5 1 .6 1 .7 1 .8 2.

Introduction Algebra of Events (Boolean Algebra) Probability Combinatorial Problems Independence Conditional Probability Some Fallacies in Combinatorial Problems Appendix: Stirling's Formula

1 3 10 15 25 33 39 43

R AN D O M V A RI A B L E S

Introduction Definition of a Random Variable Classification of Random Variables Functions of a Random Variable Properties of Distribution Functions Joint Density Functions Relationship Between Joint and Individual Densities ; Independence of Random Variables 2.8 Functions of More Than One Random Variable 2.9 Some Discrete Examples

2.1 2.2 2.3 2.4 2.5 2.6 2. 7

46 48 51 58 66 70 76 85 95 vii

CONTENTS

·viii 3.

EXP ECTATION

3.1 3.2 3.3 3.4 3.5 3.6 3.7

Introduction Terminology and Examples Properties of Expectation Correlation The Method of Indicators Some Properties of the Normal Distribution Chebyshev' s Inequality at?- d the Weak Law of Large Numbers

100 107 1 14 1 19 122 124 126

4. CO NDITI O NAL P R O BABILITY AND EXP ECTATI O N

4. 1 4.2 4.3 4.4 4.5 S.

Introduction Examples Conditional Density Functions Conditional Expectation Appendix : The General Concept of Conditional Expectation

130 1 33 1 35 140 1 52

C HAR ACT E R I S TIC F UNCT I O N S

5.1 5.2 5.3 5.4

Introduction Examples Properties of Characteristic Functions The Central Limit Theorem

1 54 1 58 1 66 169

6. INFINI TE S E Q UE N C E S O F RANDO M VARIAB L E S

6.1 Introduction 6.2 The Gambler ' s Ruin Problem 6.3 Combinatorial Approach to the Random Walk; the Reflection Principle 6.4 Generating Functions 6.5 The Poisson Random Process 6.6 The Strong Law of Large Numbers

1 78 1 82 1 86 191 196 203

CONTENTS 7.

MARKOV CHAINS

7.1 7.2 7. 3 7.4 7.5 8.

ix

IntrQduction Stopping Times and the Strong Markov Property Classification of States Limiting Probabilities Stationary and Steady-State Distributions

21 1 21 7 220 230 236

INTRODUCTION TO STATISTICS

8.1 8.2 8. 3 8.4 8.5 8.6 8.7

Statistical Decisions Hypothesis Testing Estimation Sufficient Statistics Unbiased Estimates Based on a Complete Sufficient Statistic Sampling from a Normal Population The Multidimensional Gaussian Distribution

241 243 258 264 268 274 279

Tables

286 289 290 333

I

A

Brief Bibliography

Solutions to Problems Index

BASIC PROBABILITY THEORY

Basic Co n ce p ts

1. 1 I N T R O D U C T I O N

The origin of probability theory lies in physical observations associated with games of chance. It was found that if an "unbiased" coin is tossed independ ently n times , where n is very large, the relative frequency of heads , that is, the ratio of the number of heads to the total number of tosses , is very likely to be very close to 1/2. Similarly, if a card is drawn from-a perfectly shuffled deck and then is replaced, the deck is reshuffled, and the process is repeated over and over again , there is (in some sense) convergence of the relative frequency of spades to 1/4. In the card experiment there are 52 possible outcomes when a single card is drawn. There is no reason to favor one outcome over another (the principle of "insufficient reason" or of "least astonishment") , and so the early workers in probability took as the probability of obtaining a spade the number of favorable outcomes divided by the total number of outcomes, that is , 1 3/52 or 1/4. This so-called "classical definition" of probability, (the probability of an event is the number of outcomes favorable to the event, divided by the total number of outcomes, where all outcomes are equally likely) is first of all restrictive (it considers only experiments with a finite number of outcomes) and , more seriously, circular (no matter how you look at it, "equally likely"

2

BASIC CONCEPTS

essentially means "equally probable," and thus we are using the concept of probability to define probability itself). Thus we cannot use this idea as the basis of a mathematical theory of probability; however, the early proba bilists were not prevented from deriving many valid and useful results. Similarly, an attempt at a frequency definition of probability will cause trouble. If Sn is the number of occurrences of an event in n independent performances of an experiment, we expect physically that the relative fre quency Snfn should coverge to a limit ; however, we cannot assert that the limit exists in a mathematical sense. In the case of the tossing of an unbiased coin , we expect that Snfn � 1/2, but a conceivable outcome of the process is that the coin will keep coming up heads forever. In other words it is possible that Snfn � 1 , or that Snfn � any number between 0 and 1 , or that Snfn has no limit at all. In this chapter we introduce the concepts that are to be used in the con struction of a mathematical theory of probability. The first ingredient we need is a set Q, called the sample space, representing the collection of possible outcomes of a random experiment. For example, if a coin is tossed once we may take Q = {H, T}, where H corresponds to a head and T to a tail. If the coin is tossed twice, this is a different experiment and we need a different Q, say {HH, HT, TH, TT}; in this case one performance of the experiment corresponds to two tosses of the coin. If a single die is tossed, we may take Q to consist of six points, say Q = {1, 2, . . . , 6}. However, another possible sample space consists of two points, corresponding to the outcomes "N is even" and "N is odd," where N is the result of the toss. Thus different sample spaces can be associated with the same experiment. The nature of the particular problem under considera tion will dictate which sample space is to be used. If we are interested, for example, in whether or not N > 3 in a given performance of the experiment, the second sample space, corresponding to "N even" and "N odd," will not be useful to us. In general, the only physical requirement on Q is that a given performance of the experiment musl produce a result corresponding to exactly one of the p oints ofO. We have as yet no mathematical requirements on 0; it is simply a set of points. Next we come to the notion of event. An "event" associated with a random experiment corresponds to a question about the experiment that has a yes or no answer, and this in turn is associated with a subset of the sample space. For example , if a coin is tossed twice and Q = {HH, HT, TH, TT}, "the number of heads is < 1 " will be a condition that either occurs or does not occur in a given performance of the experiment. That is , after the experiment is performed , the question "Is the number of heads < 1 ?" can be answered yes or no. The subset of Q corresponding to a "yes" answer is A = {HT, TH, TT}; that is , if the outcome of the experiment is HT, TH, or TT, the answer

1.2

B

=

ALGEBRA OF EVENTS (BOOLEAN ALGEBRA)

{first toss = second toss} A=

3

{number of heads <1}

FIGURE 1 . 1 . 1 Coin-Tossing Experiment.

to the question "Is the number of heads < 1 ?" will be "yes," and if the out come is HH, the answer will be "no." Si�ilarly, the subset of Q associated with the "event" that the result of the first toss is the same as the result of the second toss is B = {HH, TT}. Thus an event is defined as a subset of the sample space, that is, a collection of points of the sample space. (We shall qualify this in the next section.) Events will be denoted by capital letters at the beginning of the English alphabet, such as A , B, C, and so on. An event may be characterized by listing all of its points, or equivalently by describing the conditions under which the event will occur. For example, in the coin-tossing experiment just considered, we write A = {the number of heads is less than or equal to 1 } This expression is to be read as "A is the set consisting of those outcomes which satisfy the condition that the number of heads is less than or equal to 1 ," or, more simply, "A is the event that the number of heads is less than or equal to 1." The event A consists of the points HT, TH, and TT ; therefore we write A = {HT, TH, TT} , which is to be read "A is the event consisting of the points HT, TH, and TT." As another example, if B is the event that the result of the first toss is the same as the result of the second toss , we may describe B by writing B = {first toss = second toss} or, equivalently, B = {HH, TT} (see Figure 1. 1. 1)� Each point belonging to an event A is said to be favorable to A. The event A will occur in a given performance of the experiment if and only if the outcome of the experiment corresponds to one of the points of A . The entire sample space Q is said to be the sure (or certain) event ; if must occur on any given performance of the experiment. On the other hand , the event consist ing of none of the points of the sample space, that is , the empty set 0, is called the impossible event ; it can never occur in a given performance of the experiment. 1. 2

A L G E B R A O F E V E N T S (B O O L E A N A L G E B R A)

Before talking about the assignment of probabilities to events, we introduce some operations by which new events are formed from old ones. These

4

BASIC CONCEPTS

operations correspond to the construction of compound sentences by use of the connectives "or," "and," and "not." Let A and B be events in the same sample space. Define the union of A and B (denoted by Au B) as the set consisting of those points belonging to either A or B or both. (Unless other wise specified, the word "or" will have, for us, the inclusive connotation. In other words, the statement "p or q" will always mean "p or q or both.") Define the intersection of A and B, written A () B, as the set of points that belong to both A and B. Define the complement of A , written Ac , as the set of points which do not belong to A. �

1.

Consider the expe�iment involving the toss of a single die, with N = the result ; take a sample space with six points corresponding to N = 1 , 2, 3 , 4, 5 , 6. For convenience, label the points of the sample space by the integers 1 through 6. Example

A

B

A

AUB

B

AnB

FIGURE 1 .2. 1 Venn Diagrams.

Let A

=

Au B A () B Ac

=

=

Be

=

Then

{N is even}

and

B

=

{N > 3}

{N is even or N > 3} = {2, 3, 4, 5 , 6} {N is even and N > 3} = {4, 6} {N is not even} = {1 , 3 , 5} {N is not > 3} = {N < 3} = { 1 , 2} -..

=

Schematic representations (called Venn diagrams) of unions, intersections, and complements are shown in Figure 1 .2. 1 . Define the union of n events A 1 , A 2 , , A n (notation : A 1U UA n , or U; 1 A i) as the set consisting of those points which belong to at least one of the events A 1 , A 2 , , A n . Similarly define the union of an infinite se quence of events A 1 , A 2 , as the set of points belonging to at least one of the events A 1 , A 2 , (notation : A 1u A 2u , or U t) 1 A i). Define the intersection of n events A 1 , , A n as the set of points belonging to all of the events A 1 , n A n , or n� 1 Ai). , A n (notation : A 1 n A 2 n Similarly define the intersection of an infinite sequence of events as the set of •

•

•

•

•

•

•

•

·

•

•

·

•

•

•

•

•

•

•

·

·

•

·

·

·

·

·

1.2

5

ALGEBRA OF EVENTS (BOOLEAN ALGEBRA)

points belonging to all the events in the sequence (notation : A1 n A2 n , or n� 1 Ai). In the above example , with A = {N is even} = {2, 4, 6}, B = {N > 3} = {3 , 4, 5, 6}, C = {N = 1 or N = 5} = { 1 , 5}, we have ·

·

·

Au Bu C = Q, A nB nC= 0 Au Beu C = {2, 4, 6} u {1 , 2 }u {1 , 5} = {1 , 2, 4, 5, 6} (AU C) n [(A n B)c] = { 1 , 2, 4, 5, 6 } n {4, 6 }c = {1 , 2, 5}

Two events in a sample space are said to be mutually exclusive or disjoint if A and B have no points in common , that is , if it is impossible that both A and B occur during the same performance of the experiment. In symbols, A and B are mutually exclusive if A n B = 0. In general the events A1, A2, , An are said to be mutually exclusive if no two of the events have a point in common ; that is , no more than one of the events can occur during •

•

•

FIGURE 1 .2.2 A

(")

(B u C)

=

(A

(")

B) u (A (l C).

the same performance of the experiment. Symbolically, this condition may be written for i ¥- j are said to be mutually exclusive if Similarly, infinitely many events A1, A2, Ai n A; = 0 for i ¥- j. In some ways the algebra of events is similar to the algebra of real numbers , with union corresponding to addition and intersection to multiplication. For example, the commutative and associative properties hold. •

•

•

Au B

=

Bu A,

Au (Bu C)

=

(Au B)u C

A nB

=

B n A,

A n (B n C)

=

(A n B) n C

(1.2.1)

'

Furthermore, we can prove that for events A, B, and C in the same sample space we have A n (Bu C) = (A n B) u (A n C) (1.2.2) There are several ways to establish this ; for example, we may verify that the sets of both the left and right sides of the equality above are represented by the area in the Venn diagram of Figure 1 .2.2.

6

BASIC CONCEPTS

Another approach is to use the definitions of union and intersection to show that the sets in question have precisely the same members ; that is, we show that any point which belongs to the set on the left necessarily belongs to the set on the right, and conversely. To do this , we proceed as follows. x

E A n (Bu C)=> x E A =>X E A

X E Bu c (x E B or x E C)

and and

( The symbol=> means "implies , '' and -¢=>means "implies and is implied by.") CASE 1.

x

E B. Then

x

E A and

x

E B, so

x

E A n B, so

x

E (A n B)u

2.

x

E C. Then

x

E A and

x

E C, so

x

E A n C, so

x

E (A n B)u

(A n C).

CAsE

(A n C).

Thus x E A n (B u C) => x E (A n B)u (A n C); that is , A n (Bu C) c ( A n B)u (A n C). (The symbol c is read "is a subset of"; we say that A 1 c A 2 provided that x E A 1 => x E A 2 ; see Figure 1 .2.3. Notice that, according to this definition , a set A is a subset of itself: A c A.) Conversely: Let

x

E (A n B)u (A n C). Then x E A n B or

x

E A n C.

E A n B. Then x E B, so x E Bu C, so x E A n (Bu C).

CAsE

1.

CAsE

2. x E A n C. Then x E C, so x E Bu C, so x E A n (Bu C).

x

c

Thus (A n B)u (A n C)

A n (Bu C) ; hence

A n (Bu C) = (A n B)u (A n C)

As another example we show that (A 1 u A 2u

·

·

·

u A ) c - A 1c n A 2c n n

FIGURE 1 .2.3 Al

·

c

·

·

A 2.

nA c n

(1 .2.3)

1.2

ALGEBRA OF EVENTS (BOOLEAN ALGEBRA)

7

The steps are as follows. X

E (AlU···UAn)c-¢:> X ¢ A1U···UAn

<=>it is not the case that x belongs to at least one of the Ai <=>x E none of the Ai <=>X E A/

for all i

<=> x E A1cn···nAn c

An identical argument shows that (1.2.4) and similarly (1.2.5) Also (1 .2.6) The identities (1 .2.3)-(1 .2.6) are called the DeMorgan laws. In many ways the algebra of events differs from the algebra of real numbers, as some of the identities below indicate. AUA = A

Au Ac = Q

AnA = A

AnAc = 0

AnO = A

AU0 = A

AUO = Q

Arl0 = 0

(1 .2. 7)

Another method of verifying relations among events involves algebraic manipulation, using the identities already derived. Four examples are given below ; in working out the identities, it may be helpful to write Au B as A + B and AnB as AB. 1 . Au (AnB) = A

(1 .2.8)

PROOF.

A + AB = AQ + AB = A(Q + B) = AO. = A

2. (Au B)n (Au C) = Au (Bn C)

(1.2.9)

8

BASIC CONCEPTS

PROOF.

(A + B) (A + C) = = = = =

(A + B)A + (A + B)C AA + AB + A C + BC A (Q + B + C) + BC AO. + BC A + BC

(note AB = BA)

(1 .2. 1 0) PROOF.

4. (A nBe)u (A nB)u (AcnB) = A u B

(1.2.11)

PROOF.

ABC + AB + ACB = ABC + AB + AB + ACB

[see (1.2 .7)]

= A (Bc + B) + (A + Ac)B = AQ + QB =A+B

(see Figure 1 .2.4). As another example, let Q be the set of nonnegative real numbers. Let n = 1, 2, ...

(This will be anoth�r common way of describing an event. It is to be read: "A n is the set consisting of those points x in Q such that 0 < x < 1 - 1/n." If there is no confusion about what space Q we are considering, we shall simply write A n = {x: 0 < x < 1 - 1/n}.) Then 00

U A n = [0, 1) = {X: 0

n=l

<

X < 1}

FIGURE 1 .2.4 Venn Diagram Illustrating (A n Bc) u (A n B) u (Ac n B) A u B. =

1.2

ALGEBRA OF E VEN TS (BOOLEAN ALGEBRA)

9

As ari illustration of the DeMorgan laws,

( � A )"= = oo) = {x: x > ) ( n A nc = oo) - -, oo = (Notice that x > ,1 - 1/n for all n = 1 , 2, <=> x > 1 .) Also ( � A )"= {0}" = (0, ) {x: x > 0} ( ) A c= = (0, ) 1

..

00

[0, 1)" 00

n

n=l

n=l

1

1}

[ 1,

1

[1,

n

. ..

1

00

U

n=l

oo

..

n

00

U

n=l

1

1

-, oo n

=

oo

PROBLEMS 1. An experiment

involves choosing an integer N between 0 and 9 (the sample space consists of the integers from 0 to 9, inclusive). Let A {N < 5}, B {3 < N ::;: 7} , C {N is even and N > 0}. List the points that belong to the following events. =

=

=

(A u B) rl cc, (A rlB) rl [(A u C)c] A rlB rl C, A u (B rl cc), 2. Let A, B, and C be arbitrary events in the same sample space. Let D1 be the event that at least two of the events A, B, C occur; that is, D1 is the set of points common to at least two of the sets A, B, C. Let D2 {exactly two of the events A, B, C occur} D3 {at least one of the events A, B, C occur} {exactly one of the events A, B, C occur} D4 D5 {not more than two of the events A, B, C occur} Each of the events D1 through D5 can be expressed in terms of A , B, and C by using unions, intersections, and complements. For example, D3 A u B u C. Find suitable expressions for D1, D2, D4, and D5• 3. A public opinion poll (circa 1 850) consisted of the following three questions: (a) Are you a registered Whig? (b) Do you approve of President Fillmore's performance in office? (c) Do you favor the Electoral College system? A group of 1 000 people is pol led. Assume that the answer to each question must be either "yes" or "no. " It is found that: 550 people answer "yes" to the third question and 450 answer "no." 325 people answer "yes" exactly twice; that is, their responses contain two "yeses" and one ''no. " =

=

= =

=

I0

4.

5.

BASIC CONCEPTS

100 people answer "yes" to all three questions. 125 registered Whigs approve of Fillmore's performance. How many of those who favor the Electoral College system do not approve of Fillmore's performance, and in addition are not registered Whigs ? HINT: Draw a Venn diagram. If A and B are events in a sample space, define A -B as the set of points which belong to A but not to B; that is, A -B = A rl B e. Establish the following. (a) A rl (B - C) = (A rl B) -(A rl C) (b) A - (B u C) = (A -B)- C Is is true that (A -B) u C = (A u C) B? Let Q be the reals. Establish the fo �lowing.

-

. b) = lJ (a, b - �] - lJ [a+�, b ) n=1 n=1 [a, b] = nn=1 [a, b+�) - nn=1 (a-�' b]

(a,

8.

9.

n

n

n

A and B are disjoint events, are Ae and Be disjoint? Are A rl C and B rl C disjoint? What about A C and B C? If An An-1 A1 , show that nr 1 Ai = An , Uf 1 Ai = A1. Suppose that A1 , A2, is a sequence of subsets of n, and we know that for each nf 1 Ai is not empty. Is it true that nf 1 Ai is not empty ? (A related questi on about real numbers: if, for each we have :Li 1 ai < b, is it true that :Lt) 1 ai < b?) If A, B 1 , B2, are arbitrary events, show that A
6. If 7.

n

u

c

c

. . .

u

c

•

n,

•

•

n,

•

•

•

n

n

This is the distributive law with infinitely many factors. 1.3

P R O B A BILIT Y

We now consider the assignment of probabilities to events. A technical complication arises here. It may not always be possible to regard all subsets of Q as events. We may discard or fail to measure some of the information in the outcome corresponding to the point wE 0., so that for a given subset A of 0., it may not be possible to give a yes or no answer to the question "Is wE A ?" For example, if the experiment involves tossing a coin five times , we may record the results of only the first three tosses , so that A = {at least four heads} will not be "measurable"; that is , membership of wE A cannot be determined from the given information about w. In a given problem there will be a particular class of subsets of Q called the

1.3

PROBABILITY

II

"class of events." For reasons of mathematical consistency, we require that the event class :F form a sigma field, which is a collection of subsets of n satisfying the following three requirements. O E !F

(1 .3. 1 )

implies

( .3.2) 1

n=l

That is , :F is closed under finite or countable union. A E :F implies Ac E :F

(1 .3.3)

That is , :F is closed under complementation. E :F by (1 .3.3); hence Notice that if__ A1 , A 2 , E :F , then A 1c, A 2c, U: 1 Anc E :F by 1 .3. 2) . By the DeM organ laws, n: 1 An = CU: 1 A nc)c; hence, by (1 .3.3) , n: 1 A n E :F. Thus :F is closed under finite or countable intersection. Also, by (1 .3. 1) and (1 .3.3) , the empty set 0 belongs to :F. Thus, for example, if the question "Did A n occur ?" has a definite answer for n = 1 , 2, . . . , so do the questions "Did at least one of the A n occur?" and "Did all the A n occur ?" Note also that if we apply the algebraic operations of Section 1 .2 to sets in :F , the new sets we obtain still belong to :F. In many cases we shall be able to take :F = the collection of all subsets of 0., so that every subset of Q is an event. Problems in which :F cannot be chosen in this way generally arise in uncountably infinite sample spaces ; for example, Q = the reals. We shall return to this subject in Chapter 2. We are now ready to talk about the assignment of probabilities to events. If A E /#, the probability P(A) should somehow reflect the long-run relative frequency of A in a large number of independent repetitions of the experi ment. Thus P(A) should be a number between 0 and 1 , and P(O.) should be 1 . Now if A and B are disjoint events , the number of occurrences of A u B in n performances of the experiment is obtained by adding the number of occurrences of A to the number of occurrences of B. Thus we should have

(

•

•

•

•

(

P(A u B) = P(A) + P B)

and, similarly,

P(AlU

• ·

·

n UA n) = I P (Ai) i= l

•

•

if A and B are disjoint

A

if 1

,

•

•

•

,

A n are disjoint

For mathematical convenience we require that

P( p1 An) = �1P(A n)

when we have a countably infinite family of disjoint events A 1 , A 2 ,

•

•

•

•

12

BASIC CONCEPTS

The assumption of countable rather than simply finite additivity has not been convincingly justified physically or philosophically ; however, it leads to a much richer mathematical theory. A function that assigns a number P(A) to each set A in the sigma field :F is called a probability measure on :F , provided that the following conditions are satisfied. P(A) > 0 for every A E :F (1 .3.4) P(Q) = 1 (1 .3. 5) are disjoint sets in /F, then If A1, A2, •

•

•

P(A1u A2u

·

·

·

)

=

P(A1) + P(A2) +

·

·

·

(1 .3.6) We may now give the underlying mathematical framework for probability theory. ·

A probability space is a triple (0., ff, P), where 0. is a set, :F a sigma field of subsets of n, and P a probability measure on :F.

DEFINITION.

We shall not, at this point, embark on a general study of probability measures. However, we shall establish four facts from the definition. (All sets in the arguments to follow are assumed to belong to :F.) 1 . P(0)

=

0

(1 .3. 7)

Au 0 = A; hence P(Au 0) = P(A) . But A and 0 are disjoint and so P(Au 0) = P(A) + P(0). Thus P(A) =P(A) + P(0); conse quently P(0) = 0. PROOF.

2. P(Au B)

=

P(A) + P(B) -P(AnB)

(1 .3.8)

A = (AnB)u (An Be), and these sets are disjoint (see Figure 1 .2.4). Thus P(A) = P(AnB) + P(AnBe). Similarly P(B) = P(AnB) + P(AenB). Thus P(A) + P(B) - P(A n B) = P(A n B) + P(AnBe) + P(Ae n B) = P(Au B). Intuitively, if we add the outcomes in A to those in B, we have counted those in A nB twice ; subtracting the outcomes in A n B yields the outcomes in A u B. PROOF.

3. If B

c

A, then P(B) < P(A); in fact, P(A - B)

=

P(A) - P(B)

(1 .3.9)

where A -B is the set of points that belong to A but not to B. P(B) + P(A -B), since B c A (see Figure 1 .3. 1), and the result follows because P(A -B) > 0. Intuitively, if the occurrence of B PROOF. P(A)

=

1 .3 PROBABILITY

13

A-B

FIGURE 1 .3 . 1

0

always implies the occurrence of A , A mus t occur at least as often as B in any sequence of performances of the experiment. (1 .3. 10) That is, the probability that at least one of a finite or countably infinite collection of events will occur is less than or equal to the sum of the prob abilities ; note that, for the case of two events, this follows from P(Au B) = P(A) + P(B) - P(A nB) < P(A) + P(B). We make use of the fact that any union may be written as a disjoint union, as follows. PROOF.

A1u A2u · · · = A1u (A1enA2) u (A1e nA2enA3)u · · · u (A1enA2en · · · nA �_1nAn) u · · ·

( 1. 3. 1 1)

To see this, observe that if x belongs to the set on the right then x E A1en · · · nA�_1nA n for some n ; hence x E A n . Thus x belongs to the set on the left. Conversely, if x belongs to the set on the left, then x E An for some n. Let n0 be the smallest such n. Then x E A1en · · · nA�0_1nA n 0 , and so x belongs to the set on the right. Thus 00

00

n=l

n=l

P(Alu A2u . . · ) = I P(Ale n ...nA�-1 nAn) < I P(An)

using (1 .3.9) ; notice that

The basic difficulty with the classical and frequency definitions of probability is that their approach is to try somehow to prove mathematically that, for example, the probability of picking a heart from a perfectly shuflled deck is 1/4, or that the probability of an unbiased coin coming up heads is 1/2. This cannot be done. All we can say is that if a card is picked at random and then replaced, and the

REMARKS.

14

BASIC CONCEPTS

process is repeated over and over again , the result that the ratio of hearts to total number of drawings will be close to 1/4 is in accord with our intuition and our physical experience. For this reason we should assign a probability 1/4 to the event of obtaining a heart, and similarly we should assign a probability 1/52 to each possible outcome of the experiment. The only reason for doing this is that the con sequences agree with our experience. If you decide that some mysterious factor caused the ace of spades to be more likely than any other card, you could incorporate this factor by assigning a higher probability to the ace of spades. The mathematical development of the theory would not be affected ; however, .the conclusions you might draw from this assumption would be at variance with experimental results. One can never really use mathematics to prove a specific physical fact. For example, we cannot prove mathematically that there is a physical quantity called "force. " What we can do is postulate a mathematical entity called "force" that satisfies a certain differential equation. We can build up a collection of mathematical results that, when interpreted properly, provide a reasonable description of certain physical phenomena (reasonable until another mathematical theory is constructed that provides a better description). Similarly, in probability theory we are faced with situations in which our intuition or some physical experiments we have carried out suggest certain results. Intuition and experience lead us to an assignment of probabilities to events. As far as the mathematics is concerned, any assignment of probabilities will do , subject to the rules of mathen1atical con sistency. However, our hope is to develop mathematical results that, when interpreted and related to physical experience, will help to make precise such notions as "the ratio of the number of heads to the total number of observations in a very large number of independent tosses of an unbiased coin is very likely to be very close to 1/2. " We emphasize that the insights gained by the early workers in prob ability are not to be discarded, but instead cast in a more precise form.

PRO B L E MS

1. 2.

Write down some examples of sigma fields other than the collection of all subsets of a given set n. Give an example to show that P(A - B) need not equal P(A) - P(B) if B is not a subset of A.

1 .4

COMBINATORIAL PROBLEMS

IS

1 . 4 C O M BI N A T O RI A L P R O B L E M S "

We consider a class of problems in which the assignment of probabilities can be made in a natural way. Let Q be a finite or countably infinite set, and let :F consist of all subsets of

n.

For each point wi E Q, i = 1 , 2, . . . , assign a nonnegative number pi , with "!iPi = 1 . If A is any subset of 0., let P(A) = "!wiEA Pi· Then it may be verified that P is a probability measure ; P {wi} = pi, and the probability of any event A is found by adding the probabilities of the points of A. An (Q , :F , P) of this type is called a discrete probability space. Let E1

=

Hence

1.

Throw a (biased) coin twice (see Figure 1 .4. 1). {at least one head}. Then E1 = A1 u A 2u A3

.. Example

Let E 2

=

=

P(A1) + P(A 2) + P(A3) = .36 + .24 + .24 = .84 {tail on first toss} ; then E2 = A3u A 4 P(E1)

=

{HH}, A 3 = {TH },

A1

Assign P(A1) P(A2) P(A4)

=

=

=

.36 P(A3)

=

.24

.16

FIGURE 1.4.1 Coin-Tossing Problem.

16

BASIC CONCEPTS

and In the special case when n = {w1, ..., wn} and Pi = 1/n , i = 1 , 2, . . . , n, we have number of points of A favorable outcomes P(A) = total number of points in n total outcomes corresponding to the classical definition of probability. Thus, in this case, finding P(A) · simply involves counting the number of outcomes favorable to A. When n is large, counting by hand may not be feasible ; combinatorial analysis is simply a method of counting that can often be used to avoid writing down the entire list of favorable outcomes. There is only one basic idea in combinatorial analysis, and that is the following. Suppose that a symbol is selected from the set {a1, ..., an}; if ai is chosen, a symbol is selected from the set {bil, ..., him}.Each pair of selections (ai, bii) is assumed to determine a "result" f(i, j). If all results are distinct, the number of possible results is nm, since there is a one-to-one correspondence between results and pairs of integers (i, j) , i = 1 , . . . , n, j = 1 , . . . , m. If, after the symbol bii is chosen, a symbol is selected from the set {cii1, ci12, , ciiv}, and each triple (ai, bii, ciik) determines a distinct result f(i, j, k) , the number of possible results is nmp. Analogous statements may be made for any finite sequence of selections. Certain standard selections occur frequently, and it is convenient to classify them. Let a1, ..., an be distinct symbols. •

•

•

Ordered samples of size

r,

with replacement

The number of ordered sequences (ai1, , ai), where the aik belong to {a1, ..., an}, is n x n x · · · X n ( r times) , or (1 .4. 1 ) •

•

•

(The term "with replacement" refers to the fact that if the symbol aik is selected at step k it may be selected again at any future time.) For example, the number of possible outcomes if three dice are thrown is 6 X 6 X 6 = 216. Ordered Samples of Size

r,

without Replacement

The number of ordered sequences (ai1, , ai), where the aik belong to {a1, ..., an}, but repetition is not allowed (i.e. , no ai can appear more than •

•

•

1 .4

once in the sequence) , is

n(n- 1) · · · (n - r + 1)

COMBINATORIAL PROBLEMS

n'· , (n- r) !

=

=

r

17

1, 2, . . . , n (1. 4 .2)

(The first symbol may be chosen in n ways , and the second in n - 1 ways, since the first symbol may not be used again, and so on.) The above number is sometimes called the number of permutations of r objects out of n, written (n) r ·

For example, the number of 3-digit numbers that can be formed from 1 , 2, , 9, if no digit can be repeated, is 9(8) (7) = 504. .

.

.·

Unordered Samples of Size r, without Replacement

The number of unordered sets {a i1 , , air} , where the aik' k = 1 , . . . , r, are distinct elements of {a1 , . . . , an } (i.e. , the number of ways of selecting r distinct objects out of n) , if order does not count, is •

•

•

G) - r ! (:� r) !

(1 .4.3)

To see this, consider the following process. (a) Select r distinct objects out of n without regard to order ; this can be done in c:) ways, where (�) is to be determined. (b) For each set selected in (a) , say {ai1 , , a ir }, select an ordering of , air· This can be done in (r) r = r ! ways (see Figure 1 .4.2 for n = 3, ai 1 , •

•

r

=

•

•

2). •

•

The result of performing (a) and (b) is a permutation of r objects out of n; hence

or

( n) r ! r

(�)

=

(n)r

=

n'·

(n - r) !

n r ! (n r) ! '

�

r

=

1, 2,

. .

.

,n

We define (�) to be n ! /0 ! n ! = 1 , to make the formula for C:) valid for r = 0, 1 , . . . , n. Notice that ( �) = ( nk). c:) is sometimes called the number of combinations ofr objects out ofn. 11

12

(a) (b) 12

/ ""'

FIGURE

13

/ ""'

21 13

23

/ ""'

31 23

1.4.2 Determination of (;)

.

32

18

BASIC CONCEPTS

Unordered Samples

of

Size r, with Replacement

We wish to find the number of unordered sets {ai1 , . . . , a,;}, where the "r aik belong to {a1, . . . , an } and repetition is allowed. As an example, le t n = 3 and r = 3. Let the symbols be 1 , 2, and 3. List all arrangements in a column so that a precedes b if and only if a, read as an ordinary 3-digit number, is
I* I I *** I

r1 =

1,

r2 =

(sample FIGURE 1 .4.3

0,

r3 =

1*1*1* *1 3

)

= a1a3a3a3

r

1=

1,

r2 =

(sannple

=

1,

r3 =

2 a1a2a3a3)

Counting Unordered Samples with Replacement.

1 .4

1 . 4. 3 for n

=

COMBINATORIAL PROBLEMS

1.9

=

4 (the thicker bars at the sides are fixed). Each arrange ment corresponds to a solution of r1 + + rn = r. The number of arrangements is the number of ways of selecting r positions out of n + r - 1 for the stars to occur (or n - 1 positions for the bars ) ; that is, ( n+;-1 ) . For n = 3, r = 4, there are 1 5 solutions. 3, r

·

·

rl r2 ra

Sample

0 4 4 0 0 0

a aa aa aa a a2a2a2a2 a la l a l a l a2a3a3a3 a2a2aaaa a2a2a2aa a taaaaa a alalaaa a alalalaa a l a 2 a 2a 2 alala2a2 a lalala2 alala2aa a ta 2a2a a ala2aaa a

0 0 4 0 0 0

1

3

2 2 3

I

1 0 3

2 0 2 3 0 1 1

2 3

3

2 1

2 1 1 1 .,._

·

2

1

0

0 0 1 1

2

Find the probability of obtaining four of a kind in an ordinary five-card poker hand. There are (5 2 ) distinct poker hands (without regard to order) , and so we 5 may take Q to have (552) points. To obtain the number of hands in which there are four of a kind : (a) Choose the face value to appear four times (1 3 choices : A, K , Q, . . . ' 2) (b) Choose the fifth card (48 ways). Thus p = (1 3)( 48)/{5 2). Figure 1 .4.4 indicates the selection process. Example 2.

5

NoTE.

.,._

The problem may also be done using ordered samples. The number of ordered poker hands is (52) (51)(50)(49){48) = (52)5 (the drawing is without replacement). The number of ordered poker hands having four of a kind is (1 3)(48) 5! , so that p = (1 3)(48){5 !)/(52)5 = (1 3) (48)/(5 2 ) as before. Here we may take the space Q' to have (52)5 5 points ; each point of Q corresponds to 5! points of Q'. ...._

Example 3.

Three balls are dropped into three boxes. Find the probability that exactly one box will be empty.

20

(a)

BASIC CONCEPTS

Ace

Qu en

2

(b) King of Spades FIGURE 1 .4.4 ·Counting Process for Selecting Four of a Kind. There is a one-to-one correspondence between paths in the diagram and favorable outcomes.

In problems of dropping r balls into n boxes, we may regard the boxes as (distinct) symbols a1 , . . . , an; each toss of a ball corresponds to the selec tion of a box. Thus the sequence a 2a1a1a3 corresponds to the first ball into box 2, the second and third into box 1 , and the fourth into box 3. In general, an arrangement of r balls in n boxes corresponds to a sample of size r from the symbols a1 , . . . , an . If we require that the sampling be with replacement, this means that a given box can contain any number of balls . Sampling without replacement means that a given box cannot contain more than one ball. If we consider ordered samples we are saying that the balls are distinguishable. For example, a3a7 (ball 1 into box 3 , ball 2 into box 7) is different from a7a3 (ball 1 into box 7 , ball 2 into box 3) ; in other words, we may regard the balls as being numbered 1 , 2, . . . , r. Unordered sampling corresponds to indistinguishable balls. If there is no restriction on the number of balls in a given box, the total number of arrangements, taking into account the order in which the balls are tossed (i.e., regarding the balls as distinct) , is the number of ordered samples of size r (from {a1 , . . . , an}) with replacement, or nr. If the boxes are energy levels in physics and the balls are particles, the Maxwell-Boltzmann assump tion is that all nr arrangements are equally likely. If there can be at most one ball in a given box, the number of (ordered) arrangements is (n) r · If the order in which the balls are tossed is neglected, we are simply choosing r boxes out of n to be occupied ;. the Fermi-Dirac assumption takes the C;) possible selections of boxes (or energy levels) as equally likely.

1 .4

COMBINATORIAL PROBLEMS

21

We might also mention the Bose-Einstein_ assumption; here a box may contain an unlimited number of balls, but the balls are indistinguishable; that is , the order in which the balls are tossed is neglected, so that, for ex ample , a2a1a1a3 is identified with a1a3a1a2• Thus the number of arrangements counted in this scheme is the number of unordered samples of size r with replacement, or ( n+�-1) . The Bose-Einstein assumption takes all these arrangements as equally likely. To return to the original problem, we have boxes a1, a2, and a3 and se quences of length 3 (three balls are tossed) . We take all 33 = 27 ordered samples with replacement as equally likely. (We shall see that this model ordered sampling with replacement-corresponds to the tossing of the balls independently; this idea will be developed in the next section.) Now P{exactly 1 box empty} = P{box 1 empty, boxes 2 and 3 occupied} + P{box 2 empty , boxes 1 and 3 occupied} + P{box 3 empty, boxes 1 and 2 occupied} Furthermore P{box 1 empty , boxes 2 and 3 occupied} = P{a1 does not occur in the sequence ai1ai2aia' but a2 and

a3

both occur}

If a1 does not occur, either a2 or a3 must occur twice, and the other symbol once. We may choose the symbol that is to occur twice in two ways; the symbol that occurs once is then determined. If, say, a3 occurs twice and a2 once , the position of a2 may be any of three possibilities; the position of the two a3 s is then determined. Thus the probability that box 1 will be empty and boxes 2 and 3 occupied is 2(3)/27 = 6/27 (in fact the six favorable out comes are a2a2a3, a2a3a2, a3a2a2, a3a3a2, a3a2a3, and a2a3a3) . Thus the probability that exactly one box will be empty is , by symmetry, 3(6)/27 = 2/3 . .... '

.,._ Example 4.

In a 13-card bridge hand the probability that the hand will contain the A K Q J 10 of spades is (�7)/(�i)- (The A K Q J 10 of spades must be chosen, and afterward eight cards must be selected out of 4 7 that remain after the five top spades have been removed.) Now let us find the probability of obtaining the A K Q J 10 of at least one suit. Thus , if As is the event that the A K Q J 10 of spades is obtained, and similarly for AH, An, and A 0 (hearts , diamonds , and clubs) , we are looking for P(As

U

AH

U

An

U

A0)

The sets are not disjoint, so that we cannot simply add probabilities. It is possible to obtain, for example, the A K Q J 10 of both spades and hearts in

22 a

BASIC CONCEPTS

1 3-card hand , and this probability is easy to compute:

What we need here is a way of expressing P(A s u A H u An u A0) in terms of the individual terms P(A 8 ) etc. , and the intersections P(A 8 n AH) etc. We know that

P(A

u

B)

=

P(A) + P(B)- P(A n B)

u

C))

If we have three events , then

P(A

u

B

u

C)

=

P(A

=

P(A) + P(B

=

P(A) + P(B) + P(C)- P(A n B)- P(A n C)

u

(B

u

=

P(A) + P(B

C)- P((A n B)

u

C)- P(A n (B

u

u

C))

(An C))

- P(B n C) + P(A n B n C) The general p attern is now clear and may be verified by induction.

P(Al

u ... u

An)

=

I P(Ai) -I P(Ai n Aj)

i

i< j

P(A n Ai n Ak)- · i
+ I

· ·

+

(-l ) n-1P(A1 n

· · ·n

A n) (1.4.5)

In the present problem the intersections taken three or four at a time are empty ; hence, by symmetry ,

P(As

u

AH

u

An

u

Ac)

=

-

4P(As) - 6 P{As n AH) 7 -6 2 4

(� ) (: ) G�)

It is illuminating to consider an incorrect approach to this problem. Suppose that we first pick a suit (four choices) ; we then select the A K Q J 10 of that suit. The remaining eight cards can be anything (if they include the A K Q J 10 of another suit, the condition that at least one A K Q J 10 of the same suit be obtained will still be satisfied). Thus we have (487) choices , so that the desired probability is 4(481)/(�i). The above procedure illustrates multiple counting, the nemesis of the com binatorial analyst (see Figure 1 . 4. 5) .

1 .4

(1)

Diamonds I

I

I

I I

COMBINATORIAL PROBLEMS

I

h)

I

23

Clubs

'r J

J I

J J I

J

K Q J 10 of clubs, say 2 of diamonds, 2 of spades, 2 of clubs

A

FIGURE 1.4.5

__

A K Q J 10 of spades, 2 of diamonds, spades, clubs

Multiple Counting.

In writing p = 4(487)/ (�i) we are saying that there is a one-to-one corre spondence between p aths in Figure 1 .4. 5 and favorable outcomes. But this is not the c(lse, since the two paths indicated in the diagram both lead to the same result, namely, A K Q J 10 of spades, A K Q J 10 of clubs, 2 of dia monds, 2 of spades, 2 of clubs. In fact there are 6(432) such duplications. For we can pick the two suits in (�) = 6 ways ; then, after taking A K Q J 10 of each suit, we select the remaining three cards in (432) ways. If we subtract the number of duplications, 6{432) , from the original count, 4(487) , we obtain the correct result. To rephrase : the counting process we have proposed counts the number of paths in the above diagram, that is , the number of choices at Step 1 times the number of choices at Step 2. However, the paths do not in general lead to distinct "results," namely, distinct bridge hands. -4111 P ROB LE M S

1. 2.

If a 3-digit number (000 to 999) is chosen at random , find the probability that exactly 1 digit will be > 5. Find the probability that a five-card poker hand will be : (a) A straight (five cards in sequence regardless of suit ; ace may be high but not low).

24

BASIC CONCEPTS

(b) Three of a kind (three cards of the same face value x, plus two cards with face values y and z, with x, y, z distinct). (c) Two pairs (two cards of face value x , two of face value y, and one of face value z , with x , y, z distinct). 3. An urn contains 3 red, 8 yellow, and 13 green balls ; another urn contains 5 red, 7 yellow, and 6 green balls. One ball is selected from each urn. Find the probability that both balls will be of the same color. 4. An experiment consists of drawing 10 cards from an ordinary 52-card pack. (a) If the drawing is done with replacement, find the probability that no two cards will have the same face value. (b) If the drawing is done without replacement, find the probability that at least 9 cards will be of the same suit. S. An urn contains 10 balls numbered from 1 to 10. Five balls are drawn without replacement. Find the probability that the second largest of the five numbers drawn will be 8. 6. m men and w women seat themselves at random in m + w seats arranged in a row. Find the probability that all the women will be adjacent. 7. If a box contains 75 good light bulbs and 25 defective bulbs and 15 bulbs are removed, find the probability that at least one will be defective. 8. Eight cards are drawn without replacement from an ordinary deck. Find the probability of obtaining exactly three aces or exactly three kings (or both). 9. (The game of rencontre). An urn contains n tickets numbered 1 , 2, . . . , n . The tickets are shuffled thoroughly and then drawn one by one without re placement. If the ticket numbered r appears in the rth drawing, this is denoted as a match (French : rencontre). Show that the probability of at least one match IS

10.

11.

1 1 1 -- +- 2! 3 !

· · ·

( - 1)n-1 --+ 1 - e-1 + n!

as n --+

oo

A "language" consists of three "words," W1 = a , W2 = ba, W3 = bb. Let N(k) be the number of "sentences" using exactly k letters (e.g. , N( 1 ) = 1 (i.e. , a), N(2) = 3 (aa, ba, bb), N(3) = 5 (aaa, aha, abb, baa, bba) ; no space is allowed between words). (a) Show that N(k) = N(k - 1 ) + 2N(k - 2), k = 2, 3, . . . (define N(O) = 1). (b) Show that the general solution to the second-order homogeneous linear difference equation (a) [with N(O) and N( 1 ) specified], is N(k) = A2k + B( - 1)k, where A and B are determined by N(O) and N(l). Evaluate A and B in the present case. (The birthday problem) Assume that a person's birthday is equally likely to fall on any of the 365 days in a year (neglect leap years). If r people are selected, find the probability that all r birthdays will be different. Equivalently, if r balls are dropped into 3 65 boxes, we are looking for the probability that no box will contain more than one ball. It turns out that the probability is less than 1/2

1 .5 IND£PENDE.NCE

12.

for r > 23 , so that in a class of 23 or more students the odds are that two or more people will have the same birthday. Fourteen balls are dropped into six boxes. Find the number of arrangements (ordered samples of size 14 with replacement, from six symbols) whose occupancy numbers coincide with 4, 4, 2, 2, 2, 0 in some order (i.e. , boxes i1 and i2 contain four balls, boxes i3 , i4 , and i5 two balls, and box is no balls, for some i1 , , is) . (a) Let n be a set with n elements. Show that there are 2n subsets of n. For example, if n = {1 , 2, 3}, the subsets are 0, {1}, {2}, {3} , {1 , 2} , {1 , 3}, {2, 3}, and {1 , 2, 3} = n ; 23 = 8 altogether. (b) How many ways are there of selecting ordered pairs (A, B) of subsets of n such that A c B ? For example, A = {1}, B = {1 , 3} gives such a pair, but A = {1 , 2} , B = {1 , 3} does not. , A n}, where Let Q be a finite set. A partition of Q is an (unordered) set {A 1 , the Ai are nonempty subsets whose union is n. For example, if n = {1 , 2, 3}, there are five partitions. A 1 = {1 , 2, 3} A 1 = {1 , 2}, A 2 = {3} A 1 = {1 , 3} , A 2 = {2} A1 = {1}, A 2 = {2 , 3} A 1 = {1}, A 2 = {2}, A3 = {3} Let g(n) be the number of partitions of a set with n elements. (a) Show that g(n) = I�-A ( n ; l)g(k) [define g(O) = 1 ]. n 1 (b) Show that g(n) = e- I� 0 k /k ! HINT : Show that the series satisfies the difference equation of part (a). •

13.

14.

1.5

25

•

•

•

•

•

IN D E P E ND E N C E

Consider the following experiment. A person is selected at random and his height is recorded. After this the last digit of the license number of the next car to pass is noted. If A is the event that the height is over 6 feet, and B is the event that the digit is > 7 , then, intuitively, A and B are "independent' ' in the sense that knowledge about the occurrence or nonoccurrence of one of the events should not influence the odds about the other. For example, say that P(A) = .2, P(B) = .3. In a long sequence of trials we would expect the following situation. (Roughly) 20 % of the time A occurs ; of those cases in which A occurs : 30 % B occurs 70 % B does not occur

80 % of the time A does not occur ; of these cases : 30 % B occurs 70 % B does not occur

26

BASIC CONCEPTS

Thus , if B is independent of A , it appears that P(AnB) should be .2(.3) = .06 = P(A)P(B) , and P(AcnB) should be .8(.3) = .24 = P(Ac)P(B). Conversely, if P(A nB) = P(A)P(B) = .06 and P(AcnB) = P(Ac)P(B) = .24, then, if A occurs roughly 20 % of the time and we look at only the cases in which A occurs , B must occur in roughly 30 % of these cases in order to have A nB occur 6 % of the time. Similarly, if we look at the cases in which A does not occur (80 %) , then, since we are assuming that AcnB occurs 24 % of the time, we must have B occurring in 30 % of these cases. Thus the odds about B are not changed by specifying the occurrence or non occurrence of A. It appears that we should say tha:t event B is independent of A iff P(A nB) = P(A)P(B) and P(Acn B) = P(Ac)P(B). However, the second condition is already implied by the first. If P(A n B) = P(A)P(B) , P(AcnB) = P(B - A) = P(B - (A nB)) = P(B) - P(A nB) since A nB is a subset of B ; hence P(A cnB) = P(B) - P(A)P(B) = (1

-

P(A))P(B)

=

P(Ac)P(B)

Thus B is independent of A ; that is , knowledge of A does not influence the odds about B, iff P(A)P(B) = P(A nB) . But this condition is perfectly symmetrical, in other words, B is independent of A iff A is independent of B. Thus we are led to the following definition. DEFINITION.

Two events A and B are independent iff P(A nB) = P(A)P(B) .

If we have three events A , B, C that are (intuitively) independent, knowl edge of the occurrence or nonoccurrence of AnB, for example, should not change the odds about C ; this leads as above to the requirement that P(A nBn C) = P(A nB)P(C). But if A , B, and C are to be independent, we must expect that A and B are independent (as well as A and C, and B. and C) , so we should have all of the following conditions satisfied. P(A nB) = P(A)P(B),

P(A n C) = P(A)P(C),

P(Bn C) = P(B)P( C) and P(A nBn C) = P(A)P(B)P( C) We are led to the following definition. Let Ai, i E I, where I is an arbitrary index set, possibly in finite , be an arbitrary collection of events [a fixed probability space (Q, :F , P) is of course assumed].

DEFINITION.

1 .5

INDEPENDENCE

27

The Ai are said to be independent iff for each finite set of distinct indices i1 , , ik E I we have •

•

•

REMARKS

1 . If the Ai, i E I, are independent, it follows that

for all (distinct) i1, , ik , where each Bi r may be either Ai r or A i/ · To put it simply , if the Ai are independent and we replace any event by its complement, we still have independence [see Problem 1 ; actually we have already done most of the work by showing that P(A n B) = P(A)P(B) implies P(Ac n B) = P(Ac)P(B)] . 2. The condition P(A1 n · · · n An) = P(A1) P(An) does not imply the analogous condition for any smaller family of events. For example, it is possible to have P(A n B n C) = P(A)P(B)P(C) , but P(A n B) =t= P(A)P(B) , P(A n C) =t= P(A)P(C) , P(B n C) =t= P(B)P(C). In particular, A , B, and C are not independent. Conversely it is possible to have , for example, P(A n B) P(A)P(B) , P(A n C) = P(A)P(C) , P(B n C) = P(B)P(C) , but P(A n B n C) =t= P(A)P(B)P( C). Thus A and B are independent, as are A and C, and also B and C, but A , B, and C are not indepen dent . •

•

•

•

•

•

Let two dice be tossed, and take Q = all ordered pairs (i, j) , i, j = 1 , 2, . . . , 6 , with each point assigned probability 1 /3 6. Let A = {first die = 1 , 2, or 3 }

.,._ Example 1.

B = {first die = 3 , 4, or 5 } C = {the sum of the two faces is 9} (Thus A n B = { (3 , 1) , ( 3 , 2) , ( 3 , 3) , (3 , 4) , (3 , 5) , (3 , 6) } , A n C = { ( 3 , 6)} , B n c = { ( 3 , 6 ) , (4, 5) , (5 , 4)} , A n B n c = { (3 , 6)}.) Then P(A n B) = � ¥= P(A)P(B) = � ( � ) = l

But

P(A n C) =

1 3 6

¥=

P(A)P(C )

P(B n C) =

1 1 2

¥=

P(B)P(C) = � ( � ) =

P(A n B n C) =

1 3 6

=

� (3\) =

1 1 8

1 1 8

= 'P(A)P(B)P(C)

28

BASIC CONCEPTS

Now in the same probability space let

A = {first die = 1 , 2 , or 3}

B = {second die = 4, 5, or 6} C = {the sum of the two faces is 7} (Thus A

n

C = {(1 , 6) , (2 , 5) , (3 , 4)} = A P(A P(A P(B

But

P(A

n

n

n

B

n

B) =

l = P(A)P(B) = � ( � )

n

C) = C)

112

n

B

n

= P(A ) P(C) = � ( � ) = P(B) P(C) = � ( � )

112

_

C) =

C, etc.) Then

� P(A) P(B)P(C) =

112

214 ...-

We illustrate the idea of independence by considering some problems related to the classical coin-tossing experiment. A sequence of n Bernoulli trials is a sequence of n independent observations , each of which may result in exactly one of two possible situations , called "success" or "failure." At each observation the probability of success is p , and the probability of failure is q = 1 p. -

SPECIAL CASES

(a) Toss a coin independently n times , with success = heads , failure = tails. (b) Examine components produced on an assembly line ; success = acceptable, failure = defective. (c) Transmit binary digits through a communication channel ; success = digit recieved correctly, failure = digit received in correctly. We take Q = all 2 n ordered sequences of length n , with components 0 (failure) and 1 (success) . To assign probabilities in accordance with the physical description given above , we reason as follows. Consider the sample point w = 1 1 0 (k 1 's followed by n k 10 O's). Let A i = {success on trial i} = the set of all sequences with a 1 in the ith coordinate. Because of the independence of the trials we must assign ·

·

·

·

·

-

·

P{w } = P(A1 n A2 n n An ) n Ak n A�+ 1 n P(Anc) = pkq n-k P(Ak)P(A� 1) = P(A1)P(A2) Similarly , any point with k 1' s and n k 0' s is assigned probability pkqn-k . The number of such points is the number of ways of selecting k distinct ·

·

·

·

·

·

·

•

-

•

•

·

·

c

1 .5

INDEPENDENCE

29

positions for the 1 ' s to occur (or selecting n - k distinct positions for the O ' s) ; that is, (k). The sum of the probabilities assigned to all the points is

by the binomial theorem. Thus we have a legitimate assignment. Further more, the probability of obtaining exactly k successes is

(1 . 5 . 1) p(k) , k

=

0 , 1 , . . . , n, is called the binomial probability function .

Six balls are tossed independently into three boxes A, B, C. For each ball the probability of going into a specific box is 1/3. Find the probability that box A will contain (a) exactly four balls , (b) at least two balls, (c) at least five balls. Here we have six Bernoulli trials, with success corresponding to a ball in box A , failure to a ball in box B or C. Thus n = 6 , p = 1/3 , q = 2/3 , and so the required probabilities are ..._

Example 2.

(�) (1) 4 ( � ) 2 1 - p (O) - p (1) = 1 - ( � ) 6 - (�) (1) ( � ) 5 p ( 5) + p (6) = (�) (1) 5 ( � ) + (1) 6 ... p (4)

(a) (b) (c)

=

We now consider generalized Bernoullli trials. Here we have a sequence of independent trials, and on each trial the result is exactly one of the k possibilities b1, , bk . On a given trial let bi occur with probability pi, i = 1 , 2, . . . , k (pi > o , L� 1 P i = 1). We take Q = all kn ordered sequences of length n with components , bk ; for example, if w = (b1b3b 2b 2 ) then b1 occurs on trial 1 , b1 , b3 on trial 2 , b 2 on trials 3 and 4, and so on. As in the previous situation, assign to the point •

•

•

•

•

•

•

w

=

•

·

( b l b l . . . b l b 2 . . . b 2 . . . bk . . . bk) <

n1

)<

n2

)

<

nk

)

pk nk . This is the probability assigned to any the probability p1 n1p 2 n2 sequence having n i occurrences of b i , i = 1 , 2, . . . , k. To find the number of such sequences, first select n1 positions out of n for the b1 ' s to occur, then n 2 positions out of the remaining n - n1 for the b 2 ' s, n3 out of n - n1 - n2 for the b3 ' s , and so on. Thus the number of sequences having exactly n1 •

•

•

30

BASIC CONCEPTS

occurrences of b1, . . . , nk occurrences of bk is

n! n 1 '• n 2 ·' · · · n k ' •

The total probability assigned to all points is

, n! k nonneg k integers, with � n i= n n1 ,

•

•

•

(1.5.2)

i=l

To see this, notice that (p1 + · · · + pk) n = (p1 + · · · + pk) (p1 + · · + pk) · · · (p1 + · · · + pk) , n times. A typical term in the expansion is p1n 1 pk n k ; the number of times this term appears is ·

•

•

•

·

n! n 1 '• n 2 ·' · · · n k •' since we may count the appearances by selecting p1 from n1 of the n factors , selecting p 2 from n2 of the remaining n - n1 factors, and so forth. Thus we have a legitimate assignment of probabilities. The probability that b1 will occur n1 times , b 2 will occur n2 times , . . . , and bk will occur nk times is

P(n 1 ' . . . ' n k ) -

_

n ·' p1 n1 n 1 ' . . . n k .'

•

•

•

p

k

nk

(1.5.3)

•

p(n1, . . . , nk) , n1, . . . , nk = nonnegative integers whose sum is n, is called the mu ltinomial probability function. Note that when k = 2, generalized Bernoulli trials reduce to ordinary Bernoulli trials [let b1 = "success ," b 2 = "failure" , p1 = p,p 2 = q = 1 - p, n1 = k, n 2 = n - k; then (n !/n1 ! n 2 !)p1 n1p 2 n2 = (lc)pkqn-k = probability of k successes in n trials] . .,._

Example 3.

Throw four unbiased dice independently. Find the proba bility of exactly two 1 's and one 2. Let

n1 = 2 b1 = " 1 occurs" (o n a given trial) P1 - 61 b 2 = "2 occurs" n2 = 1 , P 2 - 61 n3 = 1 b 3 = "3 , 4, 5 , or 6 occurs" - 3 Pa 2 The probability is (4 !/2 ! 1 ! 1 !) (1/6) 2 (1/6) 1 (2/3)1 = 1 /27 . ...-

1 .5 .._

INDEPENDENCE

4.

31

If 10 balls are tossed independently into five boxes, with a given ball equally likely to fall into each box, find the probability that all boxes will have the same number of balls. Let bi = "ball goes into box i", P i = 1 /5 , n i = 2, i = 1 , 2, 3 , 4, 5 , n = 10. The probability = (10 !/25)(1/5) lo . ...Example

If r balls are tossed independently into n boxes, we have seen that the event {ball 1 into box i1, ball 2 into box i2 , , ball r into box ir} must have probability Pi 1Pi 2 P i r ' where Pi is the probability that a specific ball will fall into box i. In particular, if all Pi = 1 /n (as is assumed in Examples 2 and 4 above) , the probability of the event is 1 /nr. In other words, all ordered samples of size r (out of n symbols), with replacement, have the same probability. This justifies the assertion we made in Example 3 of Section 1 .4. We emphasize that the independence of the tosses is an assumption, not a theorem. For example, if two balls are tossed into two boxes and a given box can contain at most one ball, then the events A1 = {ball 1 goes into box 1 } and A 2 = {ball 2 goes into box 1 } are not independ ent, since P(A1nA 2) = 0, P(A1) = P(A 2) = 1/2 .

REMARK.

•

•

.._

•

•

•

•

5.

An urn contains equal numbers of black, white, red, and green balls. Four balls are drawn independently, with replacement. Find the probability p(k) that exactly k colors will appear in the sample, k = Example

1 , 2 , 3 , 4.

This is a multinomial problem with n = 4 and b1 = B = black, b 2 = W = white, b3 = R = red, b4 = G = green. k = 4 : The probability that all four colors will appear is given by the multinomial formula with all n i = 1 ; that is,

()

4! 14- p (4) - 6/64 1! 1! 1! 1! 4

k = 3 : The probability of obtaining two black, one white, and one red ball is given by the multinomial formula with n1 = 2, n2 = n3 = 1 , n4 = 0 ; that is,

()

4! 14 3 64 2! 1 ! 1 ! 0! �

To find the total probability of obtaining exactly three colors, multiply by the number of ways of selecting three colors out of four [(�) = 4] and the number of ways of selecting one of three colors to be repeated (3). Thus p(3)

=

36/64.

BASIC CONCEPTS

32

k=

2:

The probability of obtaining two black and two white balls is

4 4! (1) - 3 2! 2! 0! 0! , 4 128

Thus the probability of obtaining two balls of one color and two of another is times the number of ways of selecting two colors out of = or Similarly, the probability of obtaining three of one color and one of another is

� 4[( ) 6],

3/128 9/64.

4 4 ! ( 1) (4)(3) - 12 64 3 ! 1 ! 0 ! 0! 4 Notice that the extra factor is (4) (3) = 12, not (�) = 6, since three blacks

and one white constitute a different selection from three whites and one black. Thus = + =

p(2) 9/64 12/64 21/64. k= The probability that all balls will be of the same color is 4 .! ( ) 4! (1) p - 4 ! 0 ! 0 ! 0 ! 4 (4) - _!_64 REMARK. The sample space of this problem consists of all ordered samples of size 4, with replacement, from the symbols B, W, G, with all 1:

R,

samples assigned the same probability. The reader should resist the temptation to assign equal probability to all unordered samples of size with replacement. This would imply , for example, that { WWWW} and { WWWB, WWB W, WB WW, B WWW} = {three whites and one black} have the same probability, and this is inconsistent with the assumption of independence. ...._

4,

PRO B L E M S

rl Bi ) = P(Bi ) 1. Show that the events A i , i E /, are independent iff P(Bi rl P(Bi ) for all (distinct) i1 , , ik , where each Bi r may be either Air or A,;" rc. 1

k

2.

.

•

· ·

·

k

1

·

· ·

•

Let p(k), k = 0, 1 , . . . , n, be the binomial probability function. (a) If (n + 1)p is not an integer, show that p(k) is strictly increasing up to k = [(n + 1)p] = the largest integer � (n + 1)p, and attains a maximum at [(n + l)p ]. p(k) is strictly decreasing for all larger values of k. (b) If (n + 1)p is an integer, show that p(k) is strictly increasing up to k = (n + 1)p - 1 and has a double maximum at k = (n + l)p - 1 and k = (n + 1)p ; p(k) is strictly decreasing for larger values of k.

1 .6

CONDITIONAL PROBABILITY

33

A single card is drawn from an ordinary deck. Give examples of events A and B associated with this experiment that are (a) Mutually exclusive (disjoint) but not independent (b) Independent but not mutually exclusive (c) Independent and mutually exclusive (d) Neither independent nor mutually exclusive 4. Of the 100 people in a certain village, 50 always tell the truth, 30 always lie, and 20 always refuse to answer. A sample of size 30 is taken with replacement. (a) Find the probability that the sample will contain 10 people of each category. (b) Find the probability that there will be exactly 12 liars. S. Six unbiased dice are tossed independently. Find the probability that the number of 1 's minus the number of 2's will be 3. 6. How many terms are there in the multinomial expansion ( 1 .5.2) ? , tk of color Ck . 7. An urn contains t 1 balls of color C1 , t2 of color C2 , (a) If n balls are drawn without replacement, show that the probability of , nk of color Ck is obtaining exactly n1 of color C1 , n2 of color C2 ,

3.

•

•

•

8.

•

•

•

where t = t 1 + t2 + · · · + tk is the total number of balls in the urn and (�!) is defined to be 0 if ni > ti. (Notice the pattern : t 1 + t2 + · · · + tk = t, n1 + n2 + · · · + nk = n.) The above expression, regarded as a function of n1 , . . . , nk , is called the hypergeometric probability function. (b) What is the probability of the event of part (a) if the balls are drawn inde pendently, with replacement ? (a) If an event A is independent of itself, that is, if A and A are independent, show that P(A) = 0 or 1 . (b) If P(A) = 0 or 1 , show that A and B are independent for any event B, in particular, that A and A are independent.

1.6

CO NDITIONAL PROBABILITY

If A and B are independent events, the occurrence or nonoccurrence of A does not influence the odds concerning the occurrence of B. If A (\nd B are not independent, it would be desirable to have some way of measuring exactly how much the occurrence of one of the events changes the odds about the other. In a long sequence of independent repetitions of the experiment, P(A) measures the fraction of the trials on which A occurs. If we look only at the trials on which A occurs (say there are nA of these) and record those trials

34

BASIC CONCEPTS

on which B occurs also (there are nAB of these, where nAB is the number of trials on which both A and B occur) , the ratio nABfnA is a measure of P(B I A) , the "conditional probability of B given A ," that is , the fraction of the time that B occurs , looking only at trials producing an occurrence of A. Comparing P(B I A) with P(B) will indicate the difference between the odds about B when A is known to have occurred, and the odds about B before any information about A is revealed. The above discussion suggests that we define the conditional probability of B given A as P(AnB) P(B I A) = (1.6.1) P(A) This makes sense if P(A) > 0 . .,._

1.

Throw two unbiased dice independently. Let A = {sum of the faces = 8}, B = {faces are equal}. Then Example

P{4 - 4} 1/36 1 P(AnB) = P(A) P{4 - 4 , 5 3, 3 - 5, 6 - 2, 2 - 6} 5/36 5 (see Figure 1 .6. 1). There is a point here that may be puzzling. In counting the outcomes favorable to A , we note that there are two ways of making an 8 using a 5 and a 3 , but only one way using a 4 and a 4. The probability space consists of all 36 ordered pairs (i, j) , i, j = 1 , 2, 3 , 4, 5 , 6 , each assigned probability1/36. The ordered pair (4, 4) is the same as the ordered pair (4, 4) (this is rather difficult to dispute) , while ( 5 , 3) is different from (3 , 5). Alternatively, think of the first die to be thrown as red and the second as green. A 5 on the red die and a 3 on the green is a different outcome from a 3 on the red and a 5 on the green. However , using 4 ' s we can make an 8 in only one way , a 4 on the red followed by a 4 on the green. ...._

P(B I A) =

_

_

-

The extension of the definition of conditional probability to events with probability zero will be considered in great detail later on. For now, we are

26

1.6.1

Example on Conditional Probability. Numbers Indicate Favorable Outcomes.

FIGURE

1 .6

CONDITIONAL PROBABILITY

35

content to note some consequences of the above definition [whenever an expression such as P(B I A) is written, it is assumed that P (A) > 0]. If A and B are independent, then P(A () B) = P (A)P(B), so that P(B I A) = P(B) and P(A I B) ( = P(A () B)/P (B)) =

Hence Similarly

P (A () B

P (A () B () C) = P (A () B)P ( C I A () B)

P (A () B () C) = P (A)P(B I A)P ( c I A () B) () c

(1.6.2)

() D) = P (A)P (B I A)P(C I A () B)P(D I A () B () C) (1 .6.3)

and so on . Three cards are drawn without replacement from an ordinary deck. Find the probability of not obtaining a heart. Let A i = {card i is not a heart}. Then we are looking for ..- Example 2.

P(A 1nA 2nA 3) = P (A 1)P (A 2 1 AJP (A 3 I A 1nA 2) = � � � � � � [For example, to find P (A 2 1 AJ , we restrict ourselves to the outcomes favor able to A 1 • If the first card is not a heart, 51 cards remain in the deck, in cluding 1 3 hearts, so that the probability of not getting a heart on the second trial is 38/51.] Notice that the above probability can be written (3i)/(532) , which could have been derived by direct combinatorial reasoning. Furthermore, if the cards were drawn independently , with replacement, the probability would be quite different, (3/4) 3 = 27 / 64 . ...We now prove one of the most useful theorems of the subject.

B1 , B 2 ,

be a finite or countably infinite family of mutually exclusive and exhaustive events (i.e. , the Bi are disjoint and their union is Q). If A is any event, then Theorem of Total Probability. Let

•

P(A) = I P(A () Bi )

i

•

•

(1.6.4)

Thus P (A) is computed by finding a list of mutually exclusive, exhaustive ways in which A can happen, an d then adding the individual probabilities. Also P (A) = I P(Bi) P(A I Bi) (1.6.5) i

36

BASIC CONCEPTS

where the sum is taken over those i for which P(Bi) > 0. Thus P(A) is a weighted average of the conditional probabilities P(A I Bi). PROOF.

P(A)

=

P(A n Q)

=

I P(A n Bi) i

=

( (Y Bi ))

P A n =

=

P

I P(Bi)P(A I Bi) i

( Y (A n Bi))

Notice that under the above assll:mptions we have

P(Bk)P(A I Bk) ( 1 .6.6) = = I P(Bi)P(A I Bi) i This formula is sometimes referred to as Bayes' theorem ; P(Bk I A) is some ' times called an a posteriori probability. The reason for this terminology may be seen in the example below . P(Bk I A)

P(A n Bk) P(A)

..- Example 3.

Two coins are available, one unbiased and the other two headed. Choose a coin at random and toss it once ; assume that the unbiased coin is chosen with probability 3 /4. Given that the result is heads, find the probability that the two-headed coin was chosen. The "tree diagram" shown in Figure 1 .6.2 represents the experiment. We may take Q to consist of the four possible paths through the tree , with each path assigned a probability equal to the product of the probabilities assigned to each branch. Notice that we are given the probabilities of the H

unbiased T H

two-headed T FIGUR E 1 . 6.2

Tree Diagram.

1 .6

CONDITIONAL PROBABILITY

37

events B1 = {unbiased coin chosen} and B2 = {two-headed coin chosen}, as well as the conditional probabilities P (A I Bi) where A = {coin comes up heads}. This is sufficient to determine the probabilities of all events. Now we can compute P(B2 1 A) using Bayes ' theorem ; this is facilitated if, instead of trying to identify the individual terms in (1.6.6) , we simply look at the tree and write ,

P(B2 I A)

= -

P(B2nA) P(A) P{ two-headed coin chosen and coin comes up heads } P{ coin comes up heads} (1/4)(1) = � ... (3/4)(1/2) � (1/4)(1) 5

There are many situations in which an experiment consists of a sequence of steps , and the conqitional probabilities of events happening at step n � 1 , given outcomes at step n , are specified. In such cases a description by means of a tree diagram may be very convenient (see Problems) . .,._

Example 4.

A loaded die is tossed once ; if N is the result of the toss, then P {N = i} = pi , i = 1 , 2, 3 , 4, 5 , 6. If N = i, an unbiased coin is tossed independently i times. F ind the conditional probability that N will be odd, given that at least one head is obtained (see Figure 1 . 6.3). at least one head

FIGURE 1 .6.3

38

BASIC CONCEPTS

Let A = {at least one head obtained}, B P (AnB)/P(A) . Now

P(A () B)

=

.! P{ N

i=l,3,5

=

i

=

{N odd}. Then P(B I A)

=

and at least one head obtained}

- l p + 1.p + 3 1 P s a -2 1 32 s \

since when an unbiased coin is tossed independently i times , the probability of at least one head is 1 - ( l/2) i. Similarly,

P(A)

=

=

Thus

6

_! P{N

i=l 6

_! Pi( l

i=l

=

i

-

2-i)

and at least one head obtained}

PRO B LE M S

l.

2. 3.

4.

5.

In 10 Bernoulli trials find the conditional probability that all successes will occur consecutively (i.e. , no two successes will be separated by one or more failures), given that the number of successes is between four and six. If X is the number of successes in n Bernoulli trials, find the probability that X > 3 given that X > 1 . An unbiased die is tossed once. If the face is odd, an unbiased coin is tossed repeatedly; if the face is even, a biased coin with probability of heads p :#- 1/2 is tossed repeatedly. (Successive tosses of the coin are independent in each case.) If the first n throws result in heads, what is the probability that the unbiased coin is being used ? A positive integer I is selected, with P{I = n} = (1/2)n , n = 1 , 2, . . . . If I takes the value n , a coin with probability of heads e-n is tossed once. Find the probability that the resulting toss will be a head. A bridge player and his partner are known to have six spades between them. Find the probability that the spades will be split (a) 3-3 (b) 4-2 or 2-4 (c) 5-1 or 1-5 (d) 6-0 or 0-6.

1.7

39

SOME FALLACIES IN COMBINATORIAL PROBLEMS

6. An

7.

urn contains 30 white and 1 5 black balls. If 10 balls are drawn with (respec tively without) replacement, find the probability that the first two balls will be white , given that the sample contains exactly six white balls. Let C1 be an unbiased coin, and C2 a biased coin with probability of heads 3/4. At time t 0, C1 is tossed. If the result is heads, then C1 is tossed at time t 1 ; if the result is tails, C2 is tossed at t 1 . The process is repeated at t 2, 3 , . . . . In general, if heads appears at t n, then C1 is tossed at t n + 1 ; if tails appears at t n , then c2 is tossed at t n + 1. Find Yn the probability that the toss at t n will be a head (set up a difference equation). In the switching network of Figure P. 1 .6_. 8, the switches operate independently. =

=

=

=

=

=

=

=

8.

=

=

In D

c FIGURE P . 1 .6.8

Each switch closes with probability p, and remains open with probability 1 p. (a) Find the probability that a signal at the input will be received at the output. (b) Find the conditional probability that switch E is open given that a signal is received. In a certain village 20 % of the population has disease D. A test is administered which has the property that if a person has D , the test will be positive 90 % of the time , and if he does not have D , the test will still be positive 30 % of the time. All those whose test is positive are given a drug which invariably cures the disease , but produces a characteristic rash 25 % of the time. Given that a person picked at random has the rash , what is the probability that he actually had D to begin with ? -

9.

1.7

S O M E F A L L A C I E S I N C O M BI N A T O R I A L P R O B L E M S

In this section we illustrate some common traps occurring in combinatorial problems. In the first three examples there will be a multiple count . ..- Example 1.

Three cards are selected from an ordinary deck, without replacement. Find the probability of not obtaining a heart.

40

BASIC CONCEPTS

The total number of selections is (532). To find the number of favorable outcomes, notice that the first card cannot be a heart; thus we have 39 choices at step 1. Having removed one card, there are 38 nonhearts left at step 2 (and then 37 at step 3). The desired probability is c39) (3 8) (3 7) 1(532). FALLACY. In computing the number of favorable outcomes, a particular selection might be: 9 of diamonds, 8 of clubs, 7 of diamonds. Another selection is: 8 of clubs, 9 of diamonds, 7 of diamonds. In fact the 3! = 6 possible orderings of these three cards are counted separately in the numera tor (but not in the denominator). rhus the proposed answer is too9 high by a factor of 3!; the actual probability is (39)(38)(37)/3! (532) = (� )/(5i) (see example 2, Section 1 .6) . ...._ PROPOSED SoLUTION.

.,._

Example 2.

Find the probability that a five-card poker hand will result in three of a kind (three cards of the same face value x, plus two cards of face values and with x , and distinct). ( PROPOSED SoLUTION. Pick the face value to appear three times 13 possibilities). Pick three suits out of four for the "three of a kind" ((:) choices). Now one face value is excluded, so that 48 cards are left in the deck. Pick one of them as the fourth card; the fifth card can be chosen in 44 ways, since the fourth card excludes another face value. Thus the desired prob ability is (13)(:)(48)(44)/(552). FALLACY. Say the first three cards are aces. The fourth and fifth cards might be the jack of clubs and the 6 of diamonds, or equally well the 6 of diamonds and the jack of clubs. These possibilities are counted separately in the numerator but not in the denominator, so that the proposed answer is too high by a factor of 2. The actual probability is 13(:)(48)(44)/2(552) = 13(:)(\2)16/(552) [see Problem 2, Section 1.4; the factor (\2)1 6 corresponds to the selection of two distinct face values out of the remaining 12, then one card from each of these face values]. REMARK. A more complicated approach to this problem is as follows. Pick the face value x to appear three times, then pick three suits out of four, as before. Forty-nine cards remain in the deck,9 and the total number of ways of selecting two remaining cards is (� ). However, if the two face values are the same, we obtain a full house; there are 12(�) selections in which this happens (select one face value out of 12, then two suits out of four). Also, if one of the two cards has face value we obtain four of a kind; since there is only one remaining card with face value x and 48 cards remain after this one is chosen, there y

x,

z,

y,

z

1.7

SOME FALLACIES IN COMBINATORIAL PROBLEMS

41

are . 48 possibilities. Thus the probability of obtaining three of a kind

IS

(This agrees with the previous answer.) ......- Example Ten cards are drawn without replacement from an ordinary deck. Find the probability that at least nine will be of the same suit. PROPOSED SoLUTION. Pick the suit in any one of four ways, then choose nine of 1 3 face values. Forty-three cards now remain in the deck, so that the desired probability is 4(193)43/(��). FALLACY. Consider two possible selections. 1 . Spades are chosen, then face values A K Q J 10 9 8 7 6 . The last card is the 5 of spades. 2. Spades are chosen, then face values A K Q J 10 9 8 7 5. The last card is the 6 of spades (see Figure 1 .7. 1). Both selections yield the same 10 cards, but are counted separately in the computation. To find the number of duplications, notice that we can select 10 cards out of 1 3 to be involved in the t duplication; each choice of one card (ou of 1 0) for the last card yields a distinct path in Figure 1 . 7 . 1 . Of the 10 possible paths corresponding to a given selection of cards, nine are redundant. Thus the actual probability is 4 [C:) 43 - G�) 9J 3.

G�)

6 of spades

5 of spades

FIGURE

1 .7.1

Multiple Count.

42

BASIC CONCEPTS

Now

� ) c:)39 G so that the probability is [ c:)39 G�) J (i�) =

4

+

+

4

,

_....

as obtained in a straightforward manner in Problem Section 1 .4. An urn contains 10 balls bb . . . , b1 0. Five balls are drawn ..- Example without replacement. Find the probability that b8 and b9 will be included in the sample. PROPOSED SoLUTION. We are drawing half the balls, so that the proba bility that a particular ball will be included is 1/2. Thus the probability of including both b8 and b9 is (1/2)(1 /2) = 1/4. FALLACY. Let A = {b 8 is included}, B = {b 9 is included}. The 1difficulty is simply that A and B are not independent. For P(A () B) = (�)/( �) = 2/9 (after b8 and b9 are chosen, three balls are to be selected from the remaining eight). Also P(A) = P(B) = (:)/(�) = 1/2, so that P(AnB) =/= P(A)P(B) . ..- Example Two cards are drawn independently, with replacement, from an ordinary deck; at each selection all 52 cards are equally likely. Find the probability that the king of spades and the king of hearts will be chosen (in some order). PROPOSED SoLUTION. The2 number of unordered samples of size 2 out of 1 52, with replacement, is (5 +22- ) = (5;) [see (1 .4.4)]. The kings of spades and hearts constitute one such sample, so that the desired probability is 1/ (5:). FALLACY. It is not legitimate to assign equal probability to all unordered samples with replacement. If we do this we are saying, for example, that the outcomes "ace of spades, ace of spades" and "king of spades, king of hearts" have the same probability. However, this cannot be the case if independent 4.

_....

5.

1 .8 APPENDIX: STIRLING'S FORMULA

43

sampling is assumed. For the probability that the ace of spades is chosen twice is ( 1 /52)2 , while the probability that the spade and heart kings will be chosen (in some order) is P{first card is the king of spades, second card is the king of hearts} + P{first2 card is the king of hearts, second card is the king of spades} = 2 (1/52) , which is the desired probability. The main point is that we must use ordered samples with replacement in order to capture the idea of independence. _....

1 .8

A P P E N D I X : S TI R LI N G ' S F O R M U L A

An estimate of n! that is of importance both in numerical calculations and theoretical analysis is Stirling'sformula in the sense that Define (2n) ! ! (read 2n semifactorial) as 2n(2n - 2)(2n - 4) · · · 6(4)(2) , and (2n + 1) ! ! as (2n + 1) (2n - 1) · · · (5) (3) (1). We first show that '1T (2n - 1) ! ! (2n - 2) ! ! (2n) ! ! (a) < -----'-__;, -- < __� (2n + 1) ! ! 2 (2n) ! ! (2n - 1) ! ! Let Ik = J�12 (cos x)k dx, k 20, 1 , 2, . k. .1. Then 10 = 7T/2,2 11 = 1 . Integrating by parts, we obtain Ik = J�' (cos x) - d(sin x) = J�' (k - 1)(cos x)k-2 sin2 x dx. Since sin2 x = 1 - cos2 x, we have Ik = (k - 1)7TIk_2 - (k - 1)Ik or Ik = [(k - 1 )fk]lk_2• By iteration, we obtain l2n =k ( /2) [ (2n - 1) ! !/ (2n) ! ! ] and 12n+1 = [ (2n) ! ! / (2n + 1 ) ! ! ] . Since (cos x) decreases with k, so does lk, and hence 12n+1 < 12n < 12n_1 , and (a) is proved. (b) Let Qn = (2:)/22n . Then lim QnJn'1T = 1 n � oo To prove this, write PROOF.

=

-

(2n) ! (2n - 1) ! ! (2 n) ! ! ((2n)(2n - 2 ) · · · (4) (2) )2

44

BASIC CONCEPTS

Thus, by (a),

'1T 2) ! ! (2n) ! ! (2n - _;__ < Qn < (2n - 1) ! ! (2n + 1) ! ! 2

--..._ .;.. _

Multiply this inequality by to obtain

(2n - 1) ! ! (2n - 2) ! !

_

-

...;._ ...._

(2n - 1) ! ! (2n) ! ! = Qn(2n) (2n) ! ! (2n - 2) ! !

2n < n'1TQ n2 < 1 2 n + 1n'1TQ n2 � 1 , ---

If we let n � we obtain proving (b). (c) Proof of Stirling's formula. Let en = n !fnne-n�. We must show that en � 1 as n � Consider (n + 1) !/n ! = n + 1 . We have oo ,

oo .

n n en+ 1 (n + 1) + le-< + t >.J2'1T(n + 1)

(n + 1) ! n!

Thus

(

)

(

)

n - < n+ l / 2 ) .J -;, en+ l n .! = (e) 1 + = (n + 1)(e) 312 en n + 1 (n + 1) n n (1 + 1 /n) n+l/ 2 > e n. e n+ 1/e n < 1 en � c.

for sufficiently large (take logarithms and expand Now for large enough Since every mono in a power series); hence tone bounded sequence converges, a limit We must show e = 1 . By (b), 2n ) n ( .J n '1T 2-2 = 1 lim n � oo n But

(2nn ) .Jn TT T2n

(2n) ! & c 2n(2nfer.J2TT(2n) & c 2 n n = 2 n = n 2 2 2 en 2 n! n! 2 ( en(nfe) .J2'1Tn ) en 2 � e 2 , e2n fe n 2 � 1 . e2n � e cfc 2 = 1 , e = 1. =

and However, and consequently so that The theorem is proved. REMARK. The last step requires that e be >O. To see this, write Therefore

1.8

APPENDIX: STIRLING'S FORMULA

45

where c0 is defined as 1. To show that en � a nonzero limit, it suffices to show that the limit of In cn+l is finite, and for this it is sufficient to show that In In (cn+1/cn) converges to a finite limit. Now n + 1 In c;: = In [e( l + !f l/2)] = (n + D in ( 1 + !) =

1

1 -

_

(n +

�) ( n1 2n1 2 + On(n3 )) _

_

where O (n) is bounded by a constant independent of n. This is the order of 1/n2 ; hence In In (cn+1/cn) converges, and the result follows.

Ra n d o m Va riab l es

2 . 1 I N TR O D U C T I O N

In Chapter 1 we mentioned that there are situations in which not all subsets of the sample space Q can belong to the event class �, and that difficulties of this type generally arise when Q is uncountable. Such spaces may arise physically as approximations to discrete spaces with a very large number of points. For example, if a person is picked at random in the United States and his age recorded, a complete description of this experiment would involve a probability space with approximately 200 million points (if the data are recorded accurately enough, no two people have the same age). A more convenient way to describe the experiment is to group the data, for example, into 10-year intervals. We may define a function q(x), x = 5 , 15, 25, . . . , so that q(x) is the number of people, say in millions, between x - 5 and x + 5 years (see Figure 2.1.1). For example, if q(15) = 40, there are 40 million people between the ages of 10 and 20 or, on the average, 4 million per year over that 10-year span. Now if we want the probability that a person picked at random will be between 14 and 16, we can get a reasonable figure by taking the average number of people per year [4 = q(15)/10)] and multiplying by the number of years (2) to obtain (roughly) 8 million people, then dividing by the total population to obtain a probability of 8/200 = .04. 46

2. 1

q (x) 40

? .....

30

v

� �v.

��

30\

� �I

20 10 0

v�

30/ �

0

� ��

47

35 million people between 20 and 30, etc.

I�

40

INTROD UCTION

"

20-.....

18

��

10'

10 14 16 20

30

40

FIGURE 2. 1 . 1

50

60

70

�

80

90

....

X

Age Statistics.

If we connect the values ofJq(x) by a smooth curve, essentially what we are doing is evaluating (1/200) �: [q(x)/10] dx to find the probability that a person picked at random will be between 14 and 16 years old. In general, we estimate the number of people between ages a and b by J� [q(x)/10] dx so that q(x)/10 is the age density, that is, the number of people per unit age. We estimate the probability of obtaining an age between a and b by f� [q(x)/2000] dx; thus q(x)/2000 is the probability density, or probability per unit age. Thus we are led to the idea of assigning probabilities by means of an integral. We are taking Q as (a subset of) the reals, and assigning P(B) = JBf(x) dx, where f is a real-valued function defined on the reals. There are several immediate questions, namely, what Jsigma field we are using, what functions fare allowed, what we mean by Bf(x) dx, and how we know that the resulting P is a probability. For the moment suppose that we restrict ourselves to continuous or J piecewise continuous f Then we can certainly talk about Bf(x) dx, at least when B is an interval, and the integral is in the Riemann sense. Thus the appropriate sigma field should contain the intervals, and hence must be at least as big as the smallest sigma field f!4 containing the intervals ( f!4 exists; it can be described as the intersection of all sigma fields containing the intervals). 1The sigma field f!4 = f!4(E1) is called the class of Borel sets of the reals £ • Intuitively we may think of f!4 being generated by .starting with the intervals and repeatedly forming new sets by taking countable unions (and countable intersections) and1 complements in all possible ways (it turns out that there are subsets of £ that are not Borel sets) . Thus our problem will be to construct probability measures on the class of Borel sets of £1. 1The reason for considering only the Borel sets rather than S all subsets of £ is this. Suppose that we require that P(B) = BJ (x) dx .:7

48

RANDOM VARIABLES

for all intervals1 B, where f is a particular nonnegative continuous function defined on E , and J oo oo1 f(x) dx 1 . There is no probability measure on the class of all subsets of E satisfying this requirement, but there is such a meas ure on the Borel sets. Before elaborating on these ideas, it is convenient to introduce the concept of a random variable; we do this in the next section. =

2.2

D E FI N I T I O N O F A R A N D O M V A R IA B LE

Intuitively, a random variable is a- quantity that is measured in connection with a random experiment. If Q is a sample space, and the outcome of the experiment is a measuring process is carried out to obtain a number R( ). Thus a random variable is a real-valued function on a sample space. (The formal definition, which is postponed until later in the section, is somewhat more restrictive. ) ..- Example Throw a coin 10 times, and let R be the number of heads.10 We take Q = all sequences of length 10 with components H and T; 2 points altogether. A typical sample point is HHTHTTHHTH. For this point R( ) = 6. Another random variable, R1 , is the number of times a head is followed immediately by a tail. For the point above, R1 ( ) = 3 . ...._ ..- Example 2. Pick a person at random from a certain population and measure2 his height and weight. We may take the sample space to be the plane E , that is, the set of all pairs (x, y) of real numbers, with the first coordinate x representing the height and the second coordinate y the weight (we can take care of the requirement that height and weight be nonnegative by assigning probability 0 to the complement of the first quadrant). Let R1 be the height of the person selected, and let R 2 be the weight. Then R1(x, y) = x, R 2 (x, y) = y . As another example, let R3 be twice the height plus the cube root of the weight; that is, R3 = 2R1 + flR2• Then R3(x, y) = 2R1(x, y) + flR2 (x, y) = 2x + fly . ...._ ..- Example Throw two dice. We may take the sample space to be the set of all pairs of integers (x, y), x, y = 1 , 2, . . . , 6 (36 points in all) . Let R1 = the result of the first toss. Then R1(x, y) = x. Let R2 = the sum of the two faces. Then R2 (x, y) = x + y. Let R3 = 1 if at least one face is an even number; R3 0 otherwise. Then R3(6, 5) = 1 ; R3(3 , 6) = 1 ; R ( 1 , 3) = 0, and so on. ...._ w,

w

1.

w =

w

w

w

3.

3

=

..- Example 4.

2.2

DEFINITION OF A RANDOM VARIABLE

49

Imagine that we can observe the times at which electrons are emitted from the cathode of a vacuum tube, starting at time t = 0. As a sample space, we may take all infinite sequences of positive real numbers, with the components representing the emission times. Assume that the emission process never stops. Typical sample points might be w1 = (.2, 1 . 5 , 6.3 , . . . ) , w 2 = (.01 , .5, .9, 1 .7, . . . ). If R1. is the number of electrons emitted before t = 1 , then R1(w1) = 1 , R1(w 2) = 3. If R 2 is the time at which the first electron is emitted, then R 2 (w1) = .2, R 2 (w 2) = .01 . ...._

If we are interested in a random variable R defined on a given sample space , we generally want to know the probability of events involving R. Physical measurements of a quantity R generally lead to statements of the form a < R < b, and it is natural to ask for the probability that R will lie between a and b in a given performance of the experiment. Thus we are looking for P{w : a < R(w) < b } (or, equally well, P{w : a < R(w) < b }, and so on). For example, if a coin is tossed independently n times, with probability p of coming up heads on a given toss , and if R is the number of heads, we have seen in Chapter 1 that P{w : a < R (w) < b} = NoTATION.

_ p)n-k k ( ±=a (n)p l k

k

{w : a < R(w) < b } will often be abbreviated to {a < R < b}.

As another example, if two unbiased dice are tossed independently, and R 2 is the sum of the faces (Example 3 above) , then P{R 2 = 6 } = P{(5, 1), ( 1 ' 5) ' (4 ' 2) ' (2' 4 ) ' (3 ' 3)} = 5/3 6. In general an "event involving R" corresponds to a statement that the value of R lies in a set B ; that is, the event is of the form { w : R( w) E B}. Intuitively, if P{w : R(w) E I} is known for all intervals I, then P{w : R(w) E B} is deter mined for any "well-behaved" set B, the reason being that any such set can be built up from intervals. For example, P{O < R < 2 or R > 3} ( = P{R E [0, 2) u (3 , oo)}) = P{O < R < 2} + P{R > 3}. Thus it appears that in order to describe the nature of R completely, it is sufficient to know P{R E I} for each interval I. We consider in more detail the problem of characterizing a random variable in the next section ; in the remainder of this section we give the formal definition of a random variable. * For the concept of random variable to fit in with our established model for a probability space, the sets {a < R < b } must be events ; that is, they must belong to the sigma field /F. Thus a first restriction on R is that for all real a, b, the sets {w : a < R(w) < b } are in /F. Thus we can talk intelligently about the event that R lies between a and b. A question now comes up : Suppose that the sets {a < R < b } are in :F

50

RANDOM VARIABLES

for all a, b. Can we talk about the event that R belongs to a set B of reals, for B more general than a closed interval ? For example, let B = [a, b) be an interval closed on the left, open on the right. Then a < R (w) < b iff a < R(w) < b - .! n

'Thus

{w : a < R(w) < b } =

for at least one n = 1, 2, . . .

U { w : a < R(w) < b - .!)

n =l

n

and this set is a countable union of sets in !F , hence belongs to !F . In a similar fashion we can handle all types of intervals. Thus {w : R(w) E B} E !F for all intervals B. In fact {w : R(w) E B} belongs to :F for all Borel sets B. The sequence of steps by which this is proved is outlined in Problem 1 . We are now ready for the formal definition. A random variable on the probability space (Q, !F , P) is a real valued function R defined on Q, such that for every Borel subset B of the reals, { w : R( w) E B} belongs to !F .

DEFINITION.

Notice that the probability P is not in:volved in the definition at all ; if R is a random variable on (Q, !F , P) and the probability measure is changed, R is still a random variable. Notice also that, by the above discussion, to check whether a given function R is a random variable it is sufficient to know that { w : a < R( w) < b} E !F for all real a, b. In fact (Problem 2) it is suffi cient that {w : R(w) < b} E !F for all real b (or, equally well, {w : R(w) < b} E !F for all real b ; or {w : R(w) > a} E !F for all real a ; or {w : R(w) > a} E :F for all real a ; the argument is essentially the same in all cases) . Notice that if :F consists of all subsets of Q, {w : R(w) E B} automatically belongs to !F , so that in this case any real-valued function on the sample space is a random variable. Examples 1 and 3 fall into this category. Now let us consider Example 2. We take Q = the plane £ 2 , !F = the class of Borel subsets of £ 2 , that is, the smallest sigma field containing all rec tangles (we shall use "rectangle" in a very broad sense, allowing open, closed, or semiclosed rectangles, as well as infinite rectangular strips). To check that R1 is a random variable, we have {(x, y) : a < R1(x, y) < b} = {(x, y) : a < x < b} which is a rectangular strip and hence a set in /F. Similarly, R2 is a random variable. For R3, see Problem 3. Example 2 generalizes as follows. Take Q = En = all n-tuples of real

2.3

CLASSIFICATION OF RANDOM VARIABLES

51

numbers, :F the smallest sigma field containing the n-dimensional "inter vals." [If a = (a1, , bn), the interval (a, b) is defined as , an) , b (b1, {x E En : ai < xi < bi, i = 1 , . . . , n} ; closed and semiclosed intervals are defined similarly.] The coordinate functions, given by R1 (x1 , , x n) = , Rn(x1, , xn) = xn, are random variables. x1 , R 2 (x1, • • • , xn) = x2 , Example 4 involves some serious complications, since the sample points are infinite sequences of real numbers. We postpone the discussion of situations of this type until much later (Chapter 6). •

•

=

.

•

.

•

.

•

•

•

•

•

.

•

•

P RO B LE M S

* 1 . Let R be a real-valued function on a sample space n , and let CC be the collection of all subsets B of E1 such that { w : R( w) E B} E �(a) Show that CC is a sigma field. (b) If all intervals belong to CC , that is, if {ro : R ( w) E B} E � when B is an interval, show that all Borel sets belong to CC. Conclude that R is a random variable. *2. Let R be a real-valued function on a sample space n , and assume { w : R ( w) < b} E � for all real b. Show that R is a random variable. *3. In Example 2 , show that R3 is a random variable. Do this by showing that if R1 and R2 are random variables, so is R1 + R2 ; if R is a random variable, so is aR for any real a ; if R is a random variable, so is {I R. 2.3

C L A S S I F I C A T I O N OF R A N D O M VARI A B L E S

If R is a random variable on the probability space (0, /F, P) , we are gener ally interested in calculating probabilities of events involving R, that is, P{w : R(w) E B} for various (Borel) sets B. The way in which these proba bilities are calculated will depend on the particular nature of R ; in this section we examine some standard classes of random variables. The random variable R is said to be discrete iff the set of possible value s are the v·alues of of R is finite or countably infinite. In this case, if x1 , x 2 , R that belong to B, then •

•

•

P{R E B } = P{R = x1 or R = x2 or · · · } = P{R = x1} + P{R = x2 } + · · · = I PR(x)

where PR(x) , x real, is the probability function of R, defined by PR(x) = P{R = x}. Thus the probability of an event involving R is found by summing

52

RANDOM VARIABLES

the probability function over the set of points favorable to the event. In particular, the probability function determines the probability of all events involving R . .,._ Example 1.

Let R be the number of heads in two independent tosses of a coin, with the probability of heads being .6 on a given toss. Take Q = {HH, HT, TH, TT} with probabilities .36, .24 , .24 , . 1 6 assigned to the four points of Q ; take ff = all subsets. Then R has three possible values , namely , 0, 1 , and 2, and P{R = 0} = . 1 6, P{R = 1 } = . 48, P{R = 2} = .36, by inspection or by using the binomial formula P{R

=

k}

=

G)Ji<(l

-

p) n-k .....

Another way of characterizing R is by means of the distribution function , defined by x real (see Figure 2. 3 . 1 for a sketch of FR and p R in Example 1). Observe that, for example, P{R < 1 } = PR(O) + PR(1) = . 64, but if 0 < x < 1 , we have P{R < x} = PR(O) = . 1 6. Thus FR has a discontinuity at x = 1 , of magnitude .48 = PR(1). In general, if R is discrete , and P{R = xn} = Pn, n = 1 , 2, . . . , where the Pn are > 0 and "''n Pn = 1 , then FR has a jump of magnitude Pn at x = xn ; FR is constant between jumps. In the discrete case, if we are given the probability function, we can con struct the distribution function, and, conversely , given FR, we can construct 1

& (x) .48

�

.16 0

1

2

X

pR(x) .48 .36

.16 0

FIGURE 2.3. 1

1

2

Distribution and Probability Functions of a Discrete Random Variable.

2.3

PR ·

CLASSIFICATION OF RANDOM VARIABLES

53

Knowledge of either function is sufficient to determine the probability of all events involving R. We now consider the case introduced in Section 2. 1 , where probabilities are assigned by means of an integral. Let f be a nonnegative Riemann integrablet function defined on E1 with S 00oo f(x) dx = 1 . Take n = E1, :F = Borel sets. We would like to write, for each B E /F,

J

P(B) = Bf (x) d x but this makes sense only if B is an interval. However, the following result is applicable. ·

Theorem 1. Let f be a nonnegative real-valued function on E l, with

j 00 f(x ) dx = 1 . There is a unique probability measure P defined on the Borel 00 subsets of E1 , such that P(B) = JBf(x ) dx for all intervals B = (a, b] .

The theorem belongs to the domain of measure and integration theory, and will not be proved here. The theorem allows us to talk about the integral of f over an arbitrary Borel set B. We simply define JBf(x) dx as P(B), where P is the probability measure given by the theorem. The uniqueness part of the theorem may �pen be phrased as follows. If Q is a probability measure on the Borel subsets of E1 and Q(B) = SB f(x) dx for all intervals B = (a, b] , then Q(B) = JBf(x) dx for all Borel sets B. If R is defined on Q by R( w) = w (so that the outcome of the experiment is identified with the value of R) , then

P{ w : R( w) E B} = P(B) =

J;(x) dx

In particular, the distribution function of R is given by FR ( x)

= P{ w : R( w) < x} = P(- oo , x] =

so that FR is represented as an integral.

fjct) dt

The random variable R is said to be absolutely continuous iff there is a nonnegative function / = fR defined on E1 such that

DEFINITION.

for al l real x

(2.3.1)

fR is called the density function of R. We shall see in Section 2.5 that FR (x) must approach 1 as x � oo ; hence J00oo fR (x) dx = 1 .

t "Integrable" will from now on mean "Riemann integrable."

54

RANDOM VARIABLES

fR,(x) I\

1 b-a

;:: X

b

a

1

b

a

FIGURE 2. 3 . 2 Variable .

Distri bution and Density Functions of a Uniformly Distributed Random

..- Example 2.

A number R is chosen at random between a and b ; R is assumed to be uniformly distributed; that is , the probability that R will fall into an interval of length c depends only on c , not on the position of the interval within [a, b ]. , 1 We take Q = E , ff = Borel sets, R(w) = w, f(x) = fR (x) = 1/(b a) , a < x b or x < a. Define P(B) = JBf(x) dx . In particu lar, if B is a subinterval of [a, b] , then P(B) = (length of B)/(b - a). The density and distribution function of R are shown in Figure 2. 3.2 . ...._ -

NoTE .

The values of FR are probabilities, but the values of fR are not ; probabilities are found by integrating fR· FR(x)

=

P {R < x }

=

fjR

(t) dt

If R is absolutely continuous, then

P { a < R < b}

=

J:

!R(x) dx,

a
For {R < b} is the disjoint union of the events {R < a} and {a < R < b} ; hence P{R < b} = P{R < a} + P{a < R < b}. It follows

2.3

CLASSIFICATION OF RANDOM VARIABLES

55

that (2.3. 2)

P{a < R < b} = FR(b) - FR(a)

fjR(x) dx - fjR(x) dx J:!R(x) dx Thus, if Q(B) P{R E B}, we have Q(B) JBfR (x) dx when B is an interval (a, b] . By Theorem 1 , Q(B) JBfR (x) dx for all Borel =

=

=

=

=

sets B. Therefore, if R is absolutely continuous,

for all Borel sets B The basic point is that the density functionfR determines the probability of all events involving R. If R is absolutely continuous, then

f!R(x) dx 0 The event {R c} is in general not impossible ; for example, if R is uniformly distributed between a and b , each event {R x}, a < x < b , is possible ; that is, the set {w : R(w) x} is not empty. But the event {R c } has P{R

= c} =

P{c < R < c}

=

=

=

=

=

=

probability 0. This does not contradict the axioms of probability. The definition of a probability measure requires that if the event A is impossible (i.e. , A = 0) then P(A) = 0 ; the converse need not be true. Intuitively, if R is uniformly distributed between a and b , it should be expected that all events {R = a < < b , will have the same probability. Any probability other than 0 will lead to a contradiction, since there are infinitely many points between a and b. As a consequence of the fact that P{R = = 0 in the absolutely con tinuous case , we have

x}, x

x

x}

P{a < R < b} = P{a < R < b} = P{a < R < b}

= P{a < R < b} =

f!R(x) dx

(2.3.3)

= FR( b) - FR(a)

Notice also that although in the discrete case the probability function of R determines the probability of all events involving R, in the absolutely con tinuous case it gives no information at all, since P ( ) = P{R = = 0 for

Rx

x}

56

RANDOM VARIABLES

all x. However, the distribution function Qf R is still adequate, since FR determinesfR · IffR is continuous , it may be obtained from FR by differentia tion ; that is, d fR( t) d t = fR(x) dx (the fundamental theorem of calculus). The general proof that FR determines fR is measure-theoretic, and we shall not pursue it here. If fR is continuous, we have just seen that FR is differentiable, and its derivative is fR · In general, if R is absolutely continuous, FR(x) will be a continuous function of x, but again we shall not pursue this. We shall show in Section 2. 5 that the distribution function of an arbitrary random variable must be nondecreasing [a < b implies FR(a) < FR(b)], must approach as x � oo , and must approach 0 as x � - oo .

�l

- oo

.,._

3.

1

Let R be time of emission of the first electron from the cathode of a vacuum tube. Under certain physical assumptions, it turns out that R has the following density function : Example

I

R(x = ite-;.�,

f

(see Figure 2.3.3).

)

=0

x>O

(it constant)

x
so

FR(x) = x x

and if

� (x) +

------�----� x

FIGURE 2.3.3

Exponential Density and Distribution Functions.

0,

<

0,

> 0,

FR(x) = f,/R(t)dt i�fR(t)dt = i�A.e-Atdt e-l� = 1 -

2.3

CLASSIFICATION OF RANDOM VARIABLES-

57

(R - l) (R - 2)

FIGURE 2. 3 .4

Calculation of Probabilities.

We calculate some probabilities of events involving R :

P{ l

<

R

<

2}

=

fA.e-'-"' dx e-'- - e-2;.

P{(R - 1)(R - 2)

=

>

0}

=

=

P{R

<

1

P{R

<

1}

FR(2)

=

FR(1)

R

>

2}

P{R

>

2}

or +

-

J:A.e-'-"' dx f' A.e-'-"' dx 1 - e-,\ e-2 ,\ +

=

+

=

2.3.4) .

(see Figure

REMARK.

...._

You will often see the statement "Let R be an absolutely con

tinuous random variable with density function /," with no reference made to the underlying probability space. However , we have seen that we can always supply an appropriate space , as follows. Take n

=

w, w

E1 , ff

=

Borel sets , P(B)

=

SB f(x) dx for all B E :F.

If R(w)

=

E n, then R is absolutely continuous and has density f

In a sense , it does not make any difference how we arrive at Q and P ; we may equally well use a different n and P and a different R, as long as R is absolutely continuous with den sity f No matter what construction we use , we get the same essential result , namely ,

P{R

E

B}

=

fBJ(x) dx

Thus questions about probabilities of events involving R are answered completely by knowledge of the density f

58

RANDOM VARIABLES

P RO B L E MS

1.

2.

An absolutely continuous random variable R has a density function j(x) = (1 j2)e-l xl. (a) Sketch the distribution function of R. (b) Find the probability of each of the following events.

(5) {R3 - R2 - R - 2 < 0} (1) {IRI < 2} (2) {IRI < 2 or R > 0} (6) {esin rrR > 1} (3) {IRI < 2 and R < - 1} (7) {R is irrational} ( = {co : R(co) is an irrational number}) (4) {IRI + IR - 31 < 3 } Consider a sequence of five Bernoulli trials. Let R be the number of times that a head is followed immediately by a tail. For example, if co = HHTHT then R( co) = 2, since a head is followed directly by a tail at trials 2 and 3 , and also at trials 4 and 5. Find the probability function of R.

2.4

F U N C T I O N S OF A R A N D O M V A R I A B L E

A general problem that arises in many branches of science is the following. Given a system of so me sort, to which an input is applied ; knowledge of some of the characteristics of the system, together with knowledge of the input, will allow some estimate of the behavior at the output. We formulate a special case of this problem. Given a random variable R1 on a probability space, and a real-valued function g on the reals, we define a random variable R 2 by R 2 = g(R1) ; that is, R 2 ( w) = g(R1 (w) ) , w E Q. R 1 plays the role of the input, and g the role of the system ; the output R 2 is a random variable defined on the same space as R 1 . Given the function g and the distribution or density function of the random variable R1 , the problem is to find the distri bution or density function of R 2 • NoTE.

.,._

If R1 is a random variable and we set R 2 = g(R1), the question arises as to whether R 2 is in fact a. random variable. The answer is yes if g is continuous or piecewise continuous ; we shall consider this pro b lem in greater detail in Section 2. 7 . 1.

Let R1 be absolutely continuous, with the density h given in Figure 2.4. 1 . Let R 2 = R1 2 ; that is , R 2 (w) = R1 2 (w), w E Q. Find the distri bution or density function of R 2 • We shall indicate two approaches to the problem. Example

2.4

-1

FUNCTIONS OF A RANDOM VARIABLE

59

0 FIGURE 2.4. 1

FIGURE 2.4.2

In this method the distribution func tion F2 of R 2 is found directly, by expressing the event {R 2 < y} in terms of the random variable R1 • First, since R2 > 0, we have F2(y) = P{R 2 < y } = 0 for y < 0. If y > 0, then R 2 < y i ff - ..Jy < R1 < JY (see Figure 2.4.2). Thus, if y > 0, DISTRIBUTION FUNCTION METHOD .

l

P{R 2 < y} = P{ - ..j y < R1 < ..j y} = v' ;_ f1(x) d x - 'Y In particular, if 0 < y < 1 , then v' v' 0 J dx + rv';;�e-11) dx = � ..J y + H l - e_v' ;) fl( x) dx = F2( Y ) = -V v -V y (see Figure 2.4.3). If y > 1 , -1 ; F2( Y ) = v' J1(x) dx = _0 dx + 1 � dx + v';�e-x dx - 'Y v' v' = � + �(1 - e_v' ;) (see Figure 2.4.4). A sketch of F2 is given in Figure 2 .4. 5.

l�

l

l

l

Jo

fo ·

11

FIGURE 2.4.3

[

�

0

60

RANDOM VARIABLES

- -..JY

-1

1

0

FIGURE 2.4.4

We would like to conclude , by � nspection of the distribution function F2 , that the random variable R 2 is absolutely continuous. We should be able to find the density /2 of R 2 by differentiating F2 • y
-

1

1_

4-y Y 1

�

4 y

(1 +

e

� 1 _-y y e ,

-v 11) ,

O< y< 1

y> 1

(see Figure 2.4.6). It may be verified that F2 (y) is gtven by j11 00 /2 (t) dt, so that/2 is in fact the density of R 2 • Thus, in this case , if we differentiate F2 and then integrate the derivative, we get back to F2 • It is reasonable to expect that a random variable R, whose distribution function is continuous everywhere and defined by an explicit formula or collection of formulas , will be absolutely continuous. The following result will cover almost all situations encountered in p ractice.

----�----�--� y 0 1 FIGURE 2.4.5

2.4

FUNCTIONS OF A RANDOM VARIABLE

61

FIGURE 2.4.6

Let R be a random variable with distribution function F. Suppose that (1) F is continuous for all x {2) F is differentiable everywhere except possibly at a finite

number of points ( 3 ) The derivativef(x) = F' (x) is continuous except possibly at a finite number of points

(2.4 . 1)

Then R is an absolutely continuous random variable with density function [ (The proof involves the application of the fundamental theorem of calculus ; see Problem 6.) NoTE.

Density functions need not be continuous or bounded. Also, in this case there is an ambiguity in the values of /2 (y) at y = 0 and y = 1 , since F2 is not differentiable at these points. However, any values may be assumed , since changing a function at a single point, or a finite or countably infinite number of points , or in fact on a set of total length (Lebesgue measure) zero, does not change the integral.

In this approach we develop an explicit formula for the density of R 2 in terms of that of R 1 • We first give an informal description. The probability that R 2 will lie in the small interval [y , y + dy] is DENSITY FUNCTION METHOD .

ly+dyflt) dt 'Y

which is roughly J;(y) dy if /2 is well-behaved near y . But if we set h1 (y) = -Jy , h 2 (y) = - �y , y > 0, then (see Figure 2.4.7)

P{y < R 2 s y + dy} = P{h1 (Y) < R 1 < h1 ( Y + dy)} + P{h 2 (y + dy) < R 1 < h 2 (y)}

62

RANDOM VARIABLES

y + dy y

FIGURE 2.4. 7

Hence

Let dy � 0 to obtain

f2(y) In this case

f2(y)

=

ft(JY)

=

f1(h1(y))h�(y) + f1(h2(y))( -h�(y)) = f1(h1(y)) l h�(y) l + f1(h2(y)) l h�(Y) l

:y .jy

+ ft( -.jy)

:y - .jy 2� [ft(.jy) =

+ ft( - .jy)]

Now (see Figure 2 .4. 1 ) , if O < y < 1 ,

1.f2( Y) - 2 1 ["2"1 e-v; + 21 ] .jy _

If y > 1 ,

.f J 2 ( Y)

=

1

2-y1-Y

=

[21 e- -v; + 0]

as before. Similar reasoning shows that in general

1

4.jy =

(

-VY) 1 + e

1 _ e- -v; 1

4-y y

where h1(y) , . . . , hn (Y) are the values of R1 corresponding to R 2 = y. Here is the formal statement. Suppose that the domain of g can be written as the union of intervals /1 /2 , , ln . Assume that over the interval 11, g is strictly increasing or strictly decreasing and is differentiable (except •

'

•

•

2.4

FUNCTIONS OF A RANDOM VARIABLE

63

possibly at the end points) , with h3 = inverse of g over 13• Let F1 satisfy the three conditions (2.4. 1). Then n

f2(y) = � ft ( h i(Y)) l h ; (y) l j=l

where

ft ( h j(y)) l h ; (y) l is interpreted as 0 if y ¢ the domain of h1• For the proof, see Problem 7. If we have R1 = h(R2) , where h is a one-to-one differentiable function, and R 2 = g(R1) , where g is the inverse of h , then

REMARK.

J g ' ( x) 1

h ' (y) = Thus we may write

f2(y) =

ij=l I g�t((Xh);(Y)) I

where R 2 = g(R1). In the present example we have g ( x) = x 2, h 1 (y) = ..Jy ,

.,._

Example 2.

x=h

X= h ; ( y )

Let R1 be uniformly distributed between 0 and 27T ; that is, 1

f1 ( x) = - , 27T = 0 Let R 2 = sin R1 (see Figure 2.4.8).

0 < x < 27T elsewhere

21r o

1r �

----� 1r - sin-1

�

y

--------� R1

o

FIGURE 2.4.8

?r

+ sin - 1 y

\ - sin -1 y t

64

RANDOM VARIABLES

If 0 < y < 1 ,

DISTRIBUTION FUNCTION METHOD .

1 1 . -1 y 1 where the branch of the arc sin function is chosen so that -'11'/2 s sin- y s j2 . If - 1 :5: y < 0,( 1 F2 Y) = P{ - sin-11 y :5: R1 < 2'11' + sin- y} 1 1 . Sill y + = 2 as above, and ( ( 1 1 ] see Figure 2.4.9) . = - + - Sill 2 '1T

'1T

'1T

'1T

'

'1T

'1T

...-

------�--��--�--- Y

-1

0

FIGURE 2.4.9

1

2.4

FUNCTIONS OF A RANDOM VARIABLE

65

. PRO B L E M S 1.

Let R1 be absolutely continuous with density x > o·

ft (x) = e-� ,

ft (x) = 0,

'

X <

0

Define R2

= Rl 1 =R t -

2.

3.

= 2R1 - R12

for R1 > 2

= 2R1

for R1 < 2

=5

for R1 > 2

Find and sketch the distribution function of R2 ; is R2 absolutely continuous ? (a) Let R1 have distribution function F1 (x)

Define R2

6.

for R1 < 2

Find the density of R2 • Let R1 be as in Problem 3 , and define R2

S.

if R1 > 1

Show that R2 is absolutely continuous and find its density. An absolutely continuous random variable R1 is uniformly distributed between - 1 and + 1 . Find and sketch either the density or the distribution function of the random variable R2, where R2 = e-R�. Let R1 have density f1(x) = 1/x2 , x > 1 ; f1(x) = 0, x < 1 . Define R2

4.

if R1 < 1

= 1 - e-x ,

x >O

= 0'

x
= 1 - e-R�,

R1

>0

= 0'

R1

<0

Show that R2 is uniformly distributed between 0 and 1 . (b) In general, if a random variable R1 has a continuous distribution function g (x) = F1 (x) and we define a random variable R2 by R2 = g(R1) , show that R2 is uniformly distributed between 0 and 1 . If R is a random variable with distribution function F, where F is continuous everywhere and has a continuous derivative f at all but a finite number of points, show that R is absolutely continuous with density f

66 7.

RANDOM VARIABLES

Establish the validity of the formula

n

j2 (y) = � ft(h3 (y)) lh/ (Y) I i=l

8.

under the conditions given in the text. Let R1 be chosen at random between 0 and 1 , with density[1 [so that J � /1 (y) dy = 1 ]. Let R2 be the second digit in the decimal expansion of R1 • (To avoid ambiguity , write, for example, .3 as .3000 , not .2999 .) (a) Show that R2 = k iff i + 10-1k < 10R1 < i + 10-1 (k + 1) for some i = 0, 1 , . . . , 9. Hence ·

·

·

·

·

·

k = 0, 1 ' . . . ' 9

(b) If R is uniformly distributed between 0 and 1 , and R1 = viR, find the probability function of R2 = the second digit in the decimal expansion of Rt A projectile is fired with initial velocity v0 at an angle () uniformly distributed between 0 and 1r/2 (see Figure P.2.4. 9). If R is the distance from the launch site ·

9.

+-----

R --------�

FIGURE P.2.4.9

to the point at which the projectile returns to earth, find the density of R (consider only the effect of gravity).

2.5

P R OPE RTIE S OF DIS TRIBUTION FUN C T I O N S

We shall establish some general properties of the distribution function of an arbitrary random variable. We need two facts about probability measures.

Let (0, :F, P) be a probability space . (a) If A1 , A2, • • • is an expanding sequence of sets in :F, that is, A n for all n, and A = U := t A n , then P(A) = lim n -+ oo P(A n). Theorem 1.

c

A n+l

2.5

FIGURE 2.5. 1

Expanding Sequence.

(b) If A1, A 2 , is a contracting sequence of sets in #' , that is, An+1 for all n , and A = n : 1 A n , then P(A ) = limn � oo P (An) · •

•

67

PROPERTIES OF DISTRIBUTION FUNCTIONS

•

c:

An

PROOF.

(a) We can write A

=

A1

u

(A 2 - A1 )

u

(A 3 - A.2)

u

···

u

(A n - An-1) · · ·

(see Figure 2.5. 1 ; note this is the expansion (1 . 3 . 1 1) in the special case of an expanding sequence). Since this is a disjoint union, P (A)

=

P (A 1 )

=

P (A 1)

=

+ P(A2 - A 1 ) + P(A3 - A2) + · · · + P(A2) - P(A ) + P(A3) - P(A2) + 1

· · ·

since A n

c:

A n+ 1

lim P (A n) n � oo

(b) If A = n : 1 An, then, by the DeMorgan laws , Ac = U ; 1 Anc· Now An+ 1 c: An ; hence A nc c: A �+ 1 . Thus the sets Anc form an expanding sequence , so , by (a) , P (A c) � P (A c ) ; that is; 1 - P (A n) � 1 - P (A ) . The result follows. n

Let F be the distribution function of an arbitrary random variable R. Then Theorem 2. 1.

F(x) is nondecreasing; that is, a 0.

1

1 , 2 , . . . be a sequence of rea! numbers increasing to + oo . Let A n = {R < xn } . Then the An form an expanding sequence. (Since xn < Xn+l ' R < Xn implies R < Xn+1 · ) Now u : 1 An = n, since, given any point =

68

RANDOM VARIABLES

F(x)

y-

F(x

---+----�--� X

FIGURE 2.5.2

Right Continuity of Distribution Functions.

w E n, R(w ) is a real number ; hence, for sufficiently large n, R (w) < xn, so that w E An. Thus P(A n) � P(Q) = 1 , that is, limn -4 oo F(xn) = 1 . lim F(x) = 0 -4 X - 00 Let xn, n = 1 , 2, . . . be a sequence of real numbers decreasing to oo . Let A n = {R < xn } . Then the A n form a contracting sequence. {Since xn+ 1 < x n , R < Xn+1 implies R < xn . ) Now n : 1 A n = 0' since if w is any point of 0, R ( w ) cannot always be < xn because xn � - oo . Thus P(A n ) � P (0) = 0 ; that is , F(xn ) � 0. 3.

-

4. F is continuous from the right; that is, lim� -4 xo+ F(x) = F(x0) Hence F assumes the upper value at any discontinuity ; see Figure 2. 5.2. Let xn approach x0 from above ; that is, let xn , n = 1 , 2, . . . be a (strictly) decreasing sequence whose limit is x0• As before, let A n = {R < xn } . The A n form a contracting sequence whose limit (intersection) is A = {R < x0}. In order to show that n � 1 A n = {R < Xo }, we reason as follows. If R(w ) < xn for all n, then, since x n � x0, R (w) < x0• Conversely, if R ( w ) < x0, then, since x0 < xn for all n, R(w ) < X71 for all n. Thus P(A n ) � P(A) ; that is , F(xn) � F(x0). 5. lim_ F(x) = P{R < x0} [We write F(x0) for limx -4 xo- F(x).] Let xn , n = 1 , 2, . . . be a (strictly) increasing sequence whose limit is x0• Again let A n = {R < xn }. The An form an expanding sequence whose union is {R < Xo}. To show u � 1 A n = {R < Xo} , we reason as follows. If w E some A n , then R(w) < xn , so that R(w ) < x0• Conversely, if R (w) < x0, then, since xn � Xo, eventually R(w) < xn , so that w E u � 1 A n . Thus P(A n) � P{R < x0}, and the result follows. 6. P{R = x0} = F(x0) - F(x0)

2.5

PROPERTIES OF DISTRIBUTION FUNCTIONS

69

Thus F is continuous at x0 iff P{R = x0} = 0, and if F is discontinuous at x0 , the magnitude of the jump is the probability that R = x0 • For P{R < x0} = P{R < x0} + P{R = x0} , so that F(x0) = F(x0) + P{R = x0} The random variable R is said to be continuous iff its distribution function FR(x) is a continuous function ofx for all x. In any reasonable case a continuous random variable will have a density-that is, it will be absolutely continuous-but it is possible to establish the existence of random variables that are continuous but not absolutely continuous.

REMARK.

7. Let F be a function from the reals to the reals, satisfying properties 1 , 2, 3 , and 4 above. Then F is the distribution function of' some random vari

able.

This is a somewhat vague statement. Let us try to clarify it, even though we omit the proof. What we are doing essentially is making the statement "Let R be a random variable with distribution function F." It is up to us to supply the underlying probability space. As we have done before , we take Q = E1, /iF = Borel sets, R(w) = w. Now if F is to be the distribution function of R, we must have, for a < b, P(a, b] = P{a < R < b} = F(b) - F(a)

by (2.3.2)

It turns out that if F satisfies conditions 1-4, there is a unique probability measure P defined on the Borel subsets of E1 such that P(a, b] = F(b) - F(a) for all real a, b, a < b ; thus the probabilities of all events involving R are determined by F. If we let a ---+ - oo , we obtain P( - oo , b] = F(b), that is, P{R < b} = F(b) , so that in fact F is the distribution function of R. In the special case in which F(x) = J x oo f(t) dt, where /is a nonnegative integrable function and J 0000 f(x) dx = 1 , P(a, b] = F(b) - F(a) = J�f(x) dx. This is exactly the situation we considered in Theorem 1 of Section 2.3 .

PRO B L E M S

1.

Let R be a random variable with the distribution function shown in Figure P.2.5. 1 ; notice that R is neither discrete nor continuous. Find the probability

70

RANDOM VARIABLES

0

1

2

3

4

FIGURE P.2.5. 1

of the following events.

2.

(a) {R = 2} (b) {R < 2} (c) {R = 2 or .5 < R < 1 .5} (d) {R = 2 or .5 ::;: R < 3} Let R be an arbitrary random variable with distribution function F. We have seen that P{a < R < b} = F(b) - F(a), a < b. Show that P{a < R < b} = F(b) - F(a-) P{a < R < b} = F(b-) - F(a-) P{a < R < b} = F(b-) - F(a)

(Of course these are all equal if F is continuous at a and b.)

2.6

JOINT DENSITY FUN CTIONS

We are going to investigate situations in which we deal simultaneously with several random variables defined on the same sample space. As an intro ductory example, suppose that a person is selected at random from a certain population, and his age and weight recorded. We may take as the sample space the set of all pairs (x, y) of real numbers , that is, the Euclidean plane E 2 , where we interpret x as the age and y as the weight. Let R1 be the age of the person selected, and R 2 the weight ; that is, R1 (x, y) = x , R2 (x, y) = y. We wish to assign probabilities to events that involve R1 and R2 simul taneously. A cross-section of the available data might appear as shown in Figure 2.6. 1 . Thus there are 4 million people whose age is between 20 and 25 and (simultaneously) whose weight is between 1 50 and 1 60 pounds, and so on. Now suppose that we wish to estimate the number of people between 22 and 2 3 years, and 1 54 and 1 56 pounds. There are 4 million pe o ple spread over 5 years and 10 pounds , or 4/ 50 million per year-pound. We are interested in

2.6

71

JOINT DENSITY FUNCTIONS

weight 1 60 weight 1 56

1 70 2

3

4

5

-

1 54

1 60

1 50

1 50 20

25

age

30

20

22 23

FIGURE 2.6.2

FIGURE 2.6. 1 Age-Weight Data (Number of People Is in Millions).

25

age

Estimation of Probabilities.

a range of 1 year and 2 pounds , and so our estimate is 4/50 x 1 x 2 = 8 /50 million (see Figure 2.6.2) . If the total population is 200 million , then

P{22 <

R1 < 23 , 1 54 < R2 < 1 56}

should be approximately

8 / 50 200

=

{22 < R1 < 23, 1 54 < 1 54 < R2 < 1 56}.

NoTATION.

.0008

R2 < 1 56} means {22 < R1 < 23 and

What we are doing is multiplying an age-weight density 4/50 by an area 1 x 2 to estimate the number of pe o ple or, equally well, a probability density 4/ [50(200)] by an area (1 x 2) to estimate the probability. Thus it appears that we should assign probabilities by means of an integral over an area. Let us try to construct an appropriate probability space. We take Q = E 2 , :F = the Borel subsets of E2 • Suppose we have a nonnegative real-valued function / on E 2 , with

L: f_J

c x, y) dx dy

=

1

Theorem 1 of Section 2.3 holds just · as well in the two-dimensional case ; there is a unique probability measure P on :F such that P(B) = JJB f(x) dx for all rectangles B. If we define R1 (x, y) = x , R2 (x, y ) = y , then

P{(R1 , R2) E B }

=

P(B)

=

fBf

! ( x, y ) dx dy

72

RANDOM VARIABLES

For example,

P{a < R1 < b, c

<

R2 < d }

=

f aid

[ (x, y) dx dy

The joint distribution function of two arbitrary random variables R 2 is defined by < x, < y} y) =

F12 (x ,

In the present case we have

F1 2 (x, y)

R1 R2

R1 and

R2

P{R1

= i� -oo f _j(u, v) du dv

In general, if and are arbitrary r andom variables defined on a given is said to be absolutely continuous iff probability space, the pair there is a nonnegative function f = defined on E2 such that

(R1, R2) ft2 F1 2 (x, Y) = 1�00 f00 !12( u , v) du dv for all real x, y (2.6. 1) ft2 is called the density of (R1, R2) or the joint density of R1 and R2 • Just as in the one-dimensional case, if (R1, R 2 ) is absolutely continuous, it follows that P{(Rt. R2) E B } = j1 2(x , Y) dx dy ff B

for all two-dimensional Borel sets B (see Problem 1). Again, as in the one dimensional case, if f is a nonnegative function on with

l:l!

E2

(x, y) dx dy =

1

R1, R22

(R1, R2) R1(x,

is absolutely we can always find random variables such that continuous with density f We take Q = E , ff = Borel sets, y) = x, R 2 (x, y) = y, P(B) = JJBf(x, y) dx dy. Even if we use a completely different construction , we get the same result, namely ,

P{(Rt. R 2) E B }

= fBff(x, y) dx dy

We have a similar situation in n dimensions. If the n random variables R , , Rn are all defined on the same probability space, the joint distribution function of , Rn is defined by

1 R2 ,

•

•

•

R1, R2 ,

F12 . . . n Cxl ,

•

•

•

=

P{Rl < xl , Rn s xn} The random vector or n-tuple (R1, , R ) is said to be absolutely con tinuous iff there is a nonnegative function ft2 defined on En , called the •

•

.

' Xn )

.

•

.

n

•

•

• • •

n

•

'

2.6

JOINT DENSITY FUNCTIONS

73

(R1, . . . Rn) or the joint density of R1, . . . , Rn , such that x � F12 . . .nCxl , · · · , xn) = i 1 - l f12. . . n(uto · , un) dut . . . dun (2.6.2) for all real x1 , . . . , xn . Notice thatft 2 . .. n can be recovered from F1 2 .. . n by differentiation : anp12. . .n(.x.l ,. . . . ' xn) - J 12. . . n(xb . . . ' xn) oxl oxn at least at points where ft 2 n is continuous. If (R1, , Rn) is absolutely continuous , then P{ (Rt , . . . , Rn) E B} = I · · Jf12. .. n( , xn) dxt · · · dxn I

1,

density of

•

•

- 00

·

- 00

·

.f

• •

.

•

•

•

B

:11_ ,

•

•

•

for all n-dimensi onal Borel sets B. Iff is a nonnegative function on

En such that 1: · . . ifcxl, . . . ' xn) dxl . . . dxn = 1 we can always find random variables R1, . . . , Rn such that (R1, . . . , Rn) is absolutely continuous with density f We take Q = E n , :F = Borel sets, and define R1(x1, . . . , xn) x1, . . . , Rn (x1, . . . , xn) = xn . If B is any Borel subset of E n , we assign P(B) = r · Jf(x1, . . . , xn) dx1 · · dxn Then (R1, , Rn) is absolutely continuous with density f =

•

B

.

.,... Example

.

•

1.

Let

.h 2 (x , y) = 1

=0

if O < x < 1 elsewhere

and

(This is the uniform density on the unit square.) We may as well take Q = :F = Borel sets, = =

R1(x, y) x , R2 (x, y) y, P(B) = ff!12(x, y) dx dy Let us calculate the probability that 1 /2 < R1 + R 2 < 3/2. Now {� < R1 + R2 < �} { (x , y) : � < R1(x, y) + R2(x, y) < �} = { (x , y) : � < x + y < �} B

=

E2,

74

RANDOM VARIABLES y

y

2

FIGURE 2.6.3

Calculation of P{! <

FIGURE 2.6.4

R1 + R2 < !}.

Calculation of

R1 R2

P{

>

>

2}.

Thus (see Figure 2. 6.3)

P{ � < R1 + R2 < � } = ff j12(x, y) dx d y l/2 < x+Y< 3/2 = Jf 1 dx dy = shaded area shaded area

If

- 1 - 2(18 ) - .a4 < 3/4 and 0 < we want the probability that 1/2 <

R1

-

R2

<

1 /2, we obtain

1 /2 4 / 3 P{ � < R1 < i, 0 < R2 < �} = 1X=l /2 1Y=O1 dx dy = � (i) = � Example 2. Let ft2 (x , y) = e-<X+Y> , x, y > 0 elsewhere =0 Let us calculate the probability that R1 > R 2 > 2. We have (see Figure 2.6.4) P{R1 > R2 > 2} = If j12(x, y) dx d y _...

.,._

To summarize:

P{(R1 , R2) E B} = ff fdx, y) d x d y ( x , Y )EB

The probability of any event is found by integrating the density function

2.6

JOINT DENSITY FUNCTIONS

75

over the set defined by the event. This is perhaps about as close as one can come

to a one-sentence summary of the role of density functions in probability theory.

PRO BLE MS

1.

Let F12 be the joint distribution function of R1 and R2 , where (R1 , R2) is absolutely continuous with density [12 • Show that

bl b2 P{a1 R1 b1 , a2 R2 b2} f12 (x , y) dx dy f Ja1 a2 The uniqueness part of Theorem 2.3. 1 (generalized to two dimensions) shows that P{(R1 , R2) E B} Iff1 2 (x , y) dx dy <

<

�

<

=

=

B

for all two-dimensional Borel sets B. HINT : If F is the joint distribution function of the random variables R1 and R2 , show that

2.

3.

P{a1 < R1 < b1 , a2 < R2 � b2} = F(b1 , b2) - F(a1 , b2) - F(b1 , a2) + F(a1, a2) If F is the joint distribution function of the random variables R1 , R2 , and R3, express P{a1 < R1 < b1, a2 < R2 � b2 , a3 < R3 ::;: b3} in terms of F. Can you see a general pattern that will extend this result to n dimensions ? If for x + y > 0 F(x, y) = 1 for x + y < 0 =0 (see Figure P.2.6.3.), show that F cannot possibly be the joint distribution y

FIGURE P.2.6.3

76

4.

RANDOM VARIABLES

function of a pair of random variables (see Problem 1 . ) Let R1 and R2 have the following joint density : x

/12 ( , y) = l

if - 1 � x elsewhere

�

1

and

-1

�

y�1

=0 (This corresponds to R1 and R2 being chosen independently , each uniformly distributed between - 1 and + 1 ; we elaborate on this in the next section.) Find the probability of each of the following events. I

(a) {R1 + R2 � �} (b) {R1 - R2 � �} (c) {R1R2 � l }

{�: (e) { �

(d)

�

�} �

�}

(f) { I R1 l + I R2I < 1 } (g) { I R2 I < eRt } 2.7

R E L A TI O N SHIP B E T W E E N J O I N T A N D I N D IVIDUAL D E N SITIE S ; I N D E P E ND E N C E O F RAND O M VARIA B L E S

If

R1 and R 2 are two random variables defined on the same probability space , we wish to investigate the relation between the characterization of the random variables individually and their characterization simultaneously. We shall consider two problems. 1. If (R1 , R2) is absolutely continuous , are R1 and R 2 absolutely continuous , and, if so , how can the individual densities of R1 and R 2 be found in terms of the joint densities ? 2. Given R1 , R2 (individually) absolutely continuous , is (R1 , R2) absolutely continuous , and, if so , can the joint density be derived from the individual density ? Problem 1

To go from simultaneous information to i ndividual information is essentially a matter of adding across a row or column. For example, suppose that a group of 14 people has the age-weight distribution shown i n Figure 2.7. 1 . The number of people between 20 and 25 years is found by adding the numbers in the first column ; thus 4 + 2 = 6. Let us develop this idea a bit further. If R1 and R 2 are discrete , the joint probability function of R1 and R 2 [or t he probability function of the pair

2.7

RELATIONSHIP BE.TWE.E.N JOINT AND INDIVIDUAL DE.NSITIE.S

77

weight 1 70 2

3

4

5

1 60 age

1 50 20 FIGURE 2. 7. 1

25

30

Calculation of Individual Probabilities from Joint Probabilities.

(R1 , R2)] is defined by P12 (x, y) = P{R1 = x, R2 = y } If the possible values of R 2 are y1 , y2 , , then •

•

(2.7. 1)

x, y real

•

{R1 = x} = {R1 = x, R 2 = y1 }u {R1 = x, R 2 = y2 }u since the events {R 2 = Yn } , n = 1 , 2, . . . are mutually exclusive and exhaus tive. Thus the probability function of R1 is given by ·

Similarly,

·

·

P12 (x, Y) P1 (x) = P{Rl = x} = I y

(2 . 7 . 2)

P2( y) = P{R2 = Y } = I P12 (x, Y)

(2 . 7.3)

X

There are analogous formulas in higher dimensions, for example, P12 (x, Y) = I P12a (x, y , z) z

x.z

w here p123(x, y , z) = P{R1 = x, R 2 = y , R3 = z}. Now let us return to the absolutely continuous case. If (R 1 , R2) is ab solutely continuous with joint density ft 2 , we shall show that R1 is absolutely continuous (and so is R 2) and find /1 and /2 in terms offt 2 . For any x0 we have, intuitively , (2.7.4) But

(see Figure 2. 7 .2). Ifft 2 is well-behaved, this is approximately d x0

i!12(x0, y) dy

(2 . 7.5)

78

RANDOM VARIABLES

FIGURE 2. 7.2

Calculation of Individual Densities from Joint Densities.

From (2.7.4) and (2.7. 5) (replacing x0 by x) we have

fi(x)

=

1: f12(x, y) dy

To verify this formally , we work with the distribution function of R1 •

P{ R1 < x0} = P{R1 < x0 , - oo < R 2 < oo } x = 2 (x, y) d y dx l j _ oo Thus F1 is represented as an integral, and so R1 is absolutely continuous with F1(x0)

=

1�

]

[ fo

density

Similarly,

ft (x)

=

f2 ( y)

=

l!12(x, y) dy l!12(x, y) dx

(2.7.6) (2.7.7)

In exactly the same way we may establish similar formulas in higher dimen sions ; for example,

l!123(x, y, z) dz l:l!123(x, y, dx dz

i12(x, y) f2 ( y)

=

=

)

z

(2.7.8) (2.7.9)

The process of obtaining the individual densities from the joint density is sometimes called the calculation of marginal densities , because of the simi larity to the process of adding across a row or column.

2.7

RELATIONSHIP BETWEEN JOINT AND INDIVIDUAL DENSITIES y

1

0

F IGURE 2.7.3

.,._ Example 1 .

Let

(see Figure 2.7. 3).

/t 2(x, y) = 8xy, 0
If O < x < l ,

s:fl2(x, y) dy

=0

if x < 0 or x > 1

f fl y) = i!12( x, y) d x

fke) = 'sxy d y = 4x3

=0

If O < y < l ,

j2( y) =

(Figure 2.7. 4a)

if y < 0 or y > 1

f8xy dx = 4y( 1 - y2)

Sketches of/1 and /2 are given in Figure 2.7. 5 .

(Figure 2. 7 .4b) ...._ y

y

y

0

1

X

0

(a)

(b) FIGURE 2.7.4

79

80

RANDOM VARIABLES

4y (l - y 2)

FIGURE 2.7.5

Problem 2

The second problem posed at the beginning of this section has a negative answer ; that is, if R1 and R 2 are each absolutely continuous then (R1 , R 2) is not necessarily absolutely continuous. Furthermore , even if (R1, R 2) is absolutely continuous , ft(x) and /2 (y) do not determine ft 2 (x, y). We give examples later in the section. However, there is an affirmative answer when the random variables are independent. We have considered the notion of independence of events , and this can be used to define independence of random variables. Intuitively , the random variables R1, , R n are independent if knowledge about some of the Ri does not change the odds about the other R/s. In other words , if Ai is an event involving Ri alone , that is, if A i = { R i E Bi}, then the events A1, , A n should be independent. Formally, we define independence as follows. •

•

•

•

•

•

Let Rb . . . ' Rn be random variables on en, F, P). Rl, . . . ' R, are said to be independent iff for all Borel subsets B1, , Bn of E1 we have

DEFINITION .

•

REMARK.

For

If R1,

.

•

•

, R n are independent, so are R1,

•

•

•

•

•

, Rlc for

k < n.

P{R l E Bl , . . . ' R k E Bk } = P{R l E Bl , . . . ' R k E Bk , - oo < R�c+ t < oo , . . . , - oo < R n < oo } = P{R l E Bl} . . . P{Rk E Bk} since P{ - oo < R i < oo } = 1 . If (R i , i E the index set I) , is an arbitrary family of random variables on the space (0 , :F, P), the Ri are said to be independent iff for each finite set of distinct indices i1, . . . , ik E I, Ri 1 , , Rirc are independent. •

•

•

2.7

RELATIONSHIP BE TWEEN JOINT AND INDIVIDUAL DENSITIES

81

We may now give the solution to Problem 2 under the hypothesis of independence. Let Rb R2 , , R n be independent random variables on a_ given probability space. If each R i is absolutely continuous with density h, then (R1, R 2 , , xn , , Rn) is absolutely continuous; also, for all x1, Theorem 1.

•

•

•

•

•

•

•

•

•

h 2 ·'· · n (x1, X2 , ··· , xn) = ft(x1)/2 (x 2) ···fn (xn) Thus in this sense the joint density is the product of the individual densities. PROOF. The joint distribution function of R1, . . . , R n is given by F12 . .. n(x1 , . . . , xn) = P{ R1 < x1, . . . , R n < xn } < X1 } < Xn } = P{R1 P{R n •

=

•

f'� f�'ft(ut) · · ·

by independence

•

· · · fn(u n)

du 1

· · ·

du n

It follows fr � m the definition of absolute continuity [see (2.6.2)] that (R1, . . . , Rn) is absolutely continuous and that the joint density isft 2 ... n (x1, . . . , x n) = /1 (X1 ) ···fn (X n) · Note that we have the following intuitive interpretation (when n = 2). From the independence of R1 and R 2 we obtain P{x < R1 < x + dx, y < R 2 < y + dy } = P{x < R 1 < x + dx}P{y < R 2 < y + dy} If there is a joint density, we have (roughly) ft 2 (x, y) dx dy = ft (x) dx f2 (y) dy, so that.h 2 (x, y) = ft(x)f2 (y) . As a consequence of this result, the statement "Let R1 , . . . , Rn be inde pendent random variables, with R i having density h,' ' is unambiguous in the sense that it completely determi-n es all probabilities of events involving the random vector (R1, . . . , R n) ; if B is an n-dimensional Borel set,

P{(R t. . . . , R n) E B} =

r Jft(x1) ·

B

· · · fn( xn)

d x1 ·

· ·

d xn

We now show that Problem 2 has a negative answer when the hypothesis of independence is dropped. We have seen that if (R1, . . . , R n) is absolutely continuous then each R i is absolutely continuous, but the converse is false

82

RANDOM VARIABLES

in general if the Ri are not independent ; that is, each of the random vari , Rn can have a density without there being a density for the ables R 1 , �-tuple (R1 , . . . , Rn) . .

•

•

Let R 1 be an absolutely continuous random variable with density f, and take R2 = R 1 ; that is, R2(w) = R 1 (w) , w E 0. Then R2 is absolutely continuous, but (R1 , R 2) is not. For suppose that (R1 , R 2) has a density g. Necessarily (R1 , R 2) E L, where L is the line y = x, but

..- Example 2.

P{(R� > R 2) E L} =

Ifg(x, y) dx dy

.

L

Since L has area 0, the integral on the right is 0. But the probability on the left is 1 , a contradiction. _.... We can also give an example to show that if R1 and R 2 are each absolutely continuous (but not necessarily independent) , then even if (R1 , R2) is ab solutely continuous, the joint density is not determined by the individual densities . ...,. Example

3.

Let

-1 < X < 1

f12( x , Y) = 1 (1 + xy), =

Since

0

elsewhere

J-1x dx J-1y dy J!12(x, y) dy 1

f1( x) =

-1 < y < 1

=

1

= �.

=

f2 ( Y) = � , =0

0

=

0, -1 < x < 1

elsewhere

-1 < y < 1

elsewhere

But if

-1 < X < 1 -1 < y < 1

0 we get the same individual densities. =

_....

elsewhere

2.7

RELATIONSHIP BETWEEN JOINT AND INDIVIDUAL DENSITIES

83

FIGURE 2.7.6

Now intuitively, if R1 and R 2 are independent, then, say, eR1 and sin R 2 should be independent, since information about eR1 should not change the odds concerning R 2 and hence should no t affect sin R 2 either. We shall prove a theorem of this type, but first we need some additional terminology. If g is a function that maps points in the set D into points in the set E,t and T c E, we define the preimage of T under g as g-1 (T) . {x E D : g (x) E T}

For example, let D = {x 1 , x 2 , x3, x4} , E = {a , b , c} , g(x1) = g(x2) = g (x3) = a, g(x ) = c (see Figure 2.7.6 ) . We then have 4 g-1 {a} g 1{a , b} g-1 {a, c} g-1 {b} -

= {x 1 , = {x 1 ,

x 2 , x a} x 2 , x3} = {x1 , x 2 , x3, x4}

= 0

Note that, by definition of preimage, x E g-1 (T) iff g(x) E T. Now let R 1 , • • • , Rn be random variables on a given probability space, and let g1 , • • • , gn be functions of one variable, that is, functions from the reals to the reals. Let R� = g1 (R1) , • . . , R� = gn (Rn) ; that is, R; (w ) = gi (Ri (w) ) , w E n. We assume that the R; are also random variables ; this will be the case if the gi are continuous or piecewise continuous. Specifically, we have the following result, which we shall use without proof. If g is a real-valued function defined on the reals, and g is piecewise con tinuous, then for each Borel set B c E1 , g-1 (B) is also a Borel subset of E 1 . (A function with this property is said to be Borel measurable.) Now we show that if gi is piecewise continuous or, more generally, Borel measurable, R; is a random variable. Let B; be a Borel subset of E1 • Then R�-1 (B�) = { w : R�(w) E B�} = { w : gi(Rlw) ) E B�} 1 = { w : Ri(w) E gi (B�)} E !F

since g£ 1 (B� ) is a Borel set.

t A common notation for such a function is g : and belongs to E for each x in D .

D

� E. It means simply that g(x) is defined

84

RANDOM VARIABLES

Similarly , if g is a continuous real-valued function defined on En , then , for 1 1 each Borel set B c £ , g- (B) is a Borel subset of En . It follows that if R1, . . . , Rn are random variables , so is g(R1, . . . , Rn ) .

If R1, . . . , Rn are independent, then R�, . . . , R� are also

Theorem 2.

independent. (For short, "functions of independent random variables are

independent.") PROOF.

If B� , . . . , B � are Borel subsets of E1, then

P { R� E B� , . . . , R � E B �}

= = = =

P{ g1(�1) E B� . . . , gn(R n) E B �} P{R1 E g!1(B�), . . . , R n E gn 1(B�)} n

II P{Ri E g£1(B �)} by independence of the Ri i=1 n

II P{gi(Ri) E B � } i =1

=

n

II P{R� E B� } i=1

PRO B L E M S 1.

Let (R1 , R2) have the following density function. f12(x, y) = 4xy = 6x2 =0

if O < x < 1 , 0 < y < 1 , x > y if O < x < 1 , 0 < y < 1 , x < y elsewhere

(a) Find the individual density functions /1 and f2. (b) If A = {R1 < � } , B = { R2 < � } , find P(A u B). 2. If (R1 , R2) is absolutely continuous with /12(x, y) = 2e-<x+y) ' =0 3.

4.

elsewhere

find /1 (x) and f2(y) . Let (R1, R2) be uniformly distributed over the parallelogram with vertices ( - 1 , 0), (1 , 0) , (2 , 1 ) , and (0, 1 ) . (a) Find and sketch the density functions of R1 and R2 • (b) A new random variable R3 is defined by R3 = R1 + R2• Show that R3 is absolutely continuous, and find and sketch its density. If R1, R2 , , Rn are independent, show that the joint distribution function is the product of the individual distribution functions ; that is, •

•

•

F1 2. . . n(x1, x2, . . . , xn) = F1 (x1 )F2 (x2)

· · ·

Fn (xn)

for all real x1, . . . , xn

2.8

FUNCTIONS OF MORE THAN ONE RANDOM VARIABLE

85

[Conversely, it can be shown that if F12 ... n(x1, , xn) = F1 (x1) • Fn (xn) for , xn , then R1, . . . , Rn are independent.] all real x1, S. Show that a random variable R is independent of itself-in other words, R and R are independent-if and only if R is degenerate, that is, essentially constant (P{R c} = 1 for some c) . 6. Under what conditions will R and sin R be independent ? (Use Problem 5 and the result that functions of independent random variables are independent.) 7. If (R1, , Rn) is absolutely continuous andf1 2 ... n (x1, , xn) = /1(x1) · · · fn (xn) for all x1, . . . , xn , show that R1, . . . , Rn are independent. 8. Let (R1, R 2) be absolutely continuous with density f1 2 (x, y) = (x + y)/8, 0 < x < 2, 0 < y < 2 ; f1 2(x, y) = 0 elsewhere. (a) Find the probability that R12 + R2 < 1 . (b) Find the conditional probability that exactly one of the random variables R1, R2 is < 1 , given that at least one of the random variables is < 1 . (c) Determine whether or not R1 and R2 are independent. •

.

.

•

•

•

•

•

=

I

•

2.8

•

•

•

•

•

F U N C T I O N S OF M O R E T H A N ONE RANDOM VARIABLE

We are now equipped to consider a wide variety of problems of the following sort. If R 1 , , Rn are random variables with a given joint density, and we define R = g (R 1 , , Rn) , we ask for the distribution or density function of R. We shall use a distribution function approach to these problems ; that is, we shall find the distribution function of R directly. There is also a density function method, but it is u s u ally not as convenient ; the density function approa<:h is outlined in Problem 1 2. The distribution function method can be described as follows. •

•

•

•

FR ( z )

•

•

= P{R < z} = P{ g(R1 , =

.,._

r.J

g ( xl

1.

•

.

.

.

•

xn>

•

•

•

, R n) < z}

!1 2

· . ·

( xl, . . . ' xn) dxl . . . dxn

n

Let R1 and R 2 be uniformly distributed between 0 and 1, and independent. (a) Let R3 = R 1 + R 2 Then, since /12 (x , y) = /1 (x )j2 (y) by independence, Example

•

F3(z) = P{R 1 + R 2 < z } =

If 0 < z < 1 (see Figure 2.8. 1a), F3(z)

=

ff

ff

f1( x)f2( Y) dx d y

z2 1 dx d y = shaded area = -

shaded area

2

86

RANDOM VARIABLES y 2-z

(a)

(b)

fg (z)

1 --��--�----�� z

0

FIGURE 2.8.1

(c) fa(z).

If 1 <

z

1

(a) Calculation of

F3(z) ,

2

0 <

(c)

z<

1. (b) Calculation of

F3(z) ,

1

2.

< 2 (see Figure 2.8. 1b), F3(z) = shaded area =

Thus /3 (z) = z , 0 < z < 1 ; f3 (z) = 2 (see Figure 2.8. 1 c).

-

1

- z,

z

(2

)2

�----'--

2

1 < z < 2 ; f3 (z) = 0 elsewhere

(b) Let R3 = R 1R 2 • (Notice that 0 < R3 < 1 .) If 0 < z < 1 (see Figure 2.8.2),

=

f3 (z) = =

shaded area = -

0

In z

z

z

+

0< < elsewhere

J.l z

1

-

z

X

dx =

z - z

In z

2.8

FUNCTIONS OF MORE THAN ONE RANDOM VARIABLE

87

y

0

1

0

1

z

(b)

(a) FIGURE 2.8.2

(c) Let R3

=

max (R 1 , R 2). If 0 < z < 1 (see Figure 2.8.3) ,

P{R3 < z}

[Alternatively, F3(z)

=

=

=

shaded area

P{R1 < z}P{R 2 < z}

=

z 2 by independence.]

/3(z)

=

2z,

0
=

z2

P{R1 < z, R 2 < z}

�

Before the next example, we introduce the Gaussian or normal density function. f( x)

=

�b

2/2 2 < ) x e- --a b ,

x real (b > 0, a any real number) ( 2 .8. )

1

This is the familiar bell-shaped curve centered at a (Figure 2 .8.4) ; the smaller the value of b , the higher the peak and the more f is concentrated close to x = a. To check that this is a legitimate density, we must show that the area .).'

2

1 1------, z

0

1

z

0

(a)

1

(b) FIGURE 2.8.3

88

RANDOM VARIABLES f(x )

a

FIGURE 2.8.4

under f is 1. Let

Normal Density.

oo e-x2 dx

J -00 Then 21 = J-ooooe-"'2 dx Jr-oooo e-Y2 dy = 1-00aa .(00-oo e-<x +Y2 l dx dy = (in polar coordinates) I=

2

so that

ro21T d() rooo re-r2 dr = J J

7T

oo e-"'2 dx = ,J TT . . ) (2 2 8 J -oo Thus b 2 2 2 x --a x / a l 2 < e y ( with dy e dx = =1 = y _ J b ) � ,J ,J b b s: s: ..- Example Let R 1, R 2 , and R3 be independent, each normally distributed (i.e. , having the normal density) , with a = 0 , b = 1. Let R4 = (R12 + R 22 + R32)112; take the positive square root so fhat R4 > 0. (For example, if R1, R 2 , and R3 are the velocity components of a particle, then R4 is the speed.) Find the distribution function of R4 • F4(w) = P{R4 < w} = P{R 12 + R 22 + R3 2 < w2 } 2+Y2+z2 l /2dx dy dz / -3 -< 2 TT "' e ) (2 2x +YI2ff +z2 w2 We switch to spherical coordinates : x = sin cp cos fJ y = sin cp sin fJ z = cos cf> _

7T

2.

<

r

r

r

2.8

FUNCTIONS OF MORE THAN ONE RANDOM VARIABLE

89

(cp is the "cone angle," and fJ the "polar coordinate angle.") Then 21T r1T r w Flw) = d() Jo de/> Jo (2?T)-3 12 e-1" 1 2r 2 sin cp d r w 2 (2?T)-3/ (2?T)(2) r2 e-h2 d r =

l 0

2

i

Thus R4 has a density given by f4( w) =

2 2e w 1- w '\/ 27T

2

12

w
= 0,

w>0 ...._

.,._ Example 3.

There are certain situations in which it is possible to avoid all integration in an n-dimensional problem. Suppose that R1, , R n are independent and Fi is the distribution function of Ri, i = 1 , 2 , . . . , n. Let Tk be the kth smallest of the Ri. [For example, if n = 4 and R1(w) = 3 , R 2 (w) = 1 . 5, R3(w ) = - 1 0, R4(w) = 7, then .1

T1(w) = min R lw) = R3(w) = i

-

•

•

10,

T4( w ) = max Ri w) = R4(w) = 7 ]

i

(Ties may be broken, for example, by favoring the random variable with the smaller subscript.) We wish to find the distribution function of Tk. When k = 1 or k = n, the calculation is brief. n = II P {Ri < x}

by independence

i= l

Thus n

F Tn (x) = II Fi(x)

Thus

i= l

P { T1 < x} = 1

-

=1

-

P {min (R 1 , . . . , R n) > x} n P {R1 > x, . . . , R n > x} = 1 - II P {Ri > x} P { T1 > x} = 1

-

i= l

F p 1(x) =

n

1

-

II ( 1

i= l

-

Fi(x))

90

RANDOM VARIABLES

REMARK.

We may also calculate Fp1 (x) as follows.

P{T1 < x} = P{at least one Ri is < x} where A i = {Ri < x} = P(A 1UA 2U · · · UA n) = P(A 1) + P(A 1c nA 2) + · · · by (1.3. 1 1) But P(A 1 cn · · · nAf_1nA i) = P{R1 > x, . . . , Ri-l > x, Ri < x} = (1 - F1 (x)) · · · (1 - Fi_1 ( x)) Fi ( x) Thus F p1 ( x) = F1 (x) + (1 - F1 (x))F 2( x) + (1 - Fl(x))(1 - F2(x))Fa (x) + · · · + (1 - F1( x)) · · · (1 - F n-1( x)) F n( x) Hence 1 - F p 1 = (1 - F1)[1 - F2 - (1 - F2) Fa - · · · - (1 - F2) · · · (1 - F n-l) Fnl = (1 - F1)(1 - F2)[1 - Fa - (1 - Fa ) F4 - · · - (1 - Fa ) · · · (1 - Fn-l) Fn ] n = II (1 - Fi) i =l as above. We now make the simplifying assumption that the Ri are absolutely con tinuous (as well as independent) , each with the same density f [Note that ·

P{Ri = R ;} = Hence

IIf(xi)f(x;) dxi dx; = 0

( if i � j)

P{Ri = R1 for at l east one i � j} < I P{Ri = R 1 } = 0 i -::f::. j

Thus ties occur with probability zero and can be ignored.] We shall show that the Tk are absolutely continuous, and find the density explicitly. We do this intuitively first. We have ·

P{x < Tk < x + dx} = P{x < Tk < x + dx, Tk = R1} + P{x < Tk < x + dx, Tk = Rn} + P{x < Tk < x + dx , Tk = R 2} + ·

·

·

by the theorem of total probability. (The events {Tk = Ri } , i = 1, . . . , n, are mutually exclusive and exhaustive.) Thus P{x < Tk < x + dx} = nP{x < Tk < x + dx, Tk = R 1} = nP{Tk = R1, x < R1 < x + dx}

by symmetry

2.8

91

FUNCTIONS OF MORE THAN ONE RANDOM VARIABLE

Now for R 1 to be the kth smallest and fall between x and x + dx, exactly k - 1 of the random variables R2, . . . , R n must be < Rr, and the remaining n - k must be > R 1 [and R 1 must lie in (x, x + dx)]. Since there are (�-i) ways of selecting k - 1 distinct objects out of n - 1 , we have n - 1 P{x < Tk < x + dx } = n P{x < R 1 < x + dx, k- 1

(

)

R 2 < R1 , . . . , Rk < R1 , Rk+1 > R1 , . . . , R n > R1 } But if R1 falls in (x, x + dx) , Ri < R 1 is essentially the same thing as Ri < x, so that n - 1 1 k P{x < Tk < x + dx} = n J (x) dx(P{ Ri < x}) k- (P{ R i > x}) nk- 1 n - 1 1 = n J(x)(F(x))k- (1 - F(x)) n-k dx k- 1

( (

Since P{x < Tk < x to exist) , we have .

+

) )

dx} = iJc(x) dx, wherefk is the density of Tk (assumed

(

)

n - 1 k 1 J (x)(F(x)) - (1 - F(x) ) n-k k- 1 [When k = n we get nf(x) (F(x)) n-1 = (dfdx)F(x) n , and when k 1 we get nf(x) (1 - F(x)) n-1 = (dfdx) (1 - (1 - F(x)) n) , in agreement with the previous results if all Ri have distribution function F and the density f can be obtained by differentiating F. ] To obtain the result formally , we reason as follows. fk (x) = n

=

n

P {Tk < x} = L P{ Tk < x, Tk = Ri} = nP{ Tk < x, Tk = R 1} i= 1 = nP{ R 1 < x, exactly k - 1 of the variables R 2 , . . . , R n are < R1 , and the remaining n - k variables are > R 1} n - 1 P{ R1 < x, R 2 < R t> . . . , Rk < Rt> Rk+ 1 > R1 , = n k by symmetry R n > R1 }

( _ l)

•

•

•

,

RANDOM VARIABLES

92

The integrand is the density of Tk , in agreement with the intuitive approach. T1, . . . , Tn are called the order statistics of R1, . . . , R,. REMARK.

All events

{ Ri1 < x , R,;" 2 < Ri 1 ,. . .. . , R,;"k < R,; , R ,;"k+ 1 > R,;"1 , . . . , R.",.n > R .:"1 } "1

-

have the same probability, namely ,

This justifies the appeal to symmetry in the above argument . ...._

P RO B L E M S

1.

2.

Let R1 and R2 be independent and uniformly distributed between 0 and 1 . Find and sketch the distribution or density function of the random variable

Ra = R2/R12· If R1 and R2 are independent random variables, each with the density function

f(x) = e-x, x > O ; f(x) = 0, x < 0, find and sketch the distribution or density function of the random variable R3, where (a) R3 = R1 + R2 (b) R3 = R2/R1 3. Let R1 and R2 be independent, absolutely continuous random variables, each normally distributed with parameters a = 0 and b 1 ; that is, =

J1(x) = [2 (x)

=

1

v2 7T

e-x2 / 2

Find and sketch the density or distribution function of the random variable

R3

4.

R2/R1• Let R1 and R2 be independent, absolutely continuous random variables, =

each uniformly distributed between 0 and 1 . Find and sketch the distribution or density function of the random variable R3, where

Ra

=

max (R1, R2) min (R1, R2)

The example in which R3 = max (R1, R2) + min (R1, R2) may occur to the reader. However, this yields nothing new, since

REMARK.

2.8

FUNCTIONS OF MORE THAN ONE RANDOM VARIABLE

93

max (Rr, RJ + min (Rr , RJ = R1 + R 2 (the sum of two numbers is the larger plus the small e r). 2 2 5. A point-size worm is inside an apple in the form of the sphere x + y 2 + z = 4a 2 • (Its position is uniformly distributed.) If the apple is eaten down to a core determined by the intersection of the sphere and the cylinder x2 + y2 = a2 , find the probability that the worm will be eaten. 3 6. A point (R1, R 2 , RJ is uniformly distributed over the region in E described by x2 + y 2 < 4, 0 � z < 3x. Find the probability that R3 < 2R1 • 7. Solve Problem 6 under the assumption that (R1, R2 , R3) has density f(x , y, z) = kz2 over the given region and f (x, y, z) = 0 outside the region. 8. Let T1, . . . , Tn be the order statistics of R1, . . . , Rn , where R1, . . . , Rn are independent, each with density f Show that the joint density of T1, • • • , Tn is given by x1 < x2 < · · · < Xn g (x1, • • • , Xn) = n ! f(x1) • • • f(xn), elsewhere =0 HINT : Find P{T1 O = 0, X <0 Find the probability that R1 > 2R2 > 3R3. 10. A man and a woman agree to meet at a certain place some time between 1 1 and 12 o'clock. They agree that the one arriving first will wait z hours, 0 < z < 1 , for the other to arrive. Assuming that the arrival times are independent and uniformly distributed, find the probability that they will meet. 11. If n points R1, • • • , R n are picked independently and with uniform density on a straight line of length L, find the probability that no two points will be less than distance d apart ; that is, find P{min IRi - Ril > d} i =l-j HINT : First find P{mini =l- i !Ri - Ri l > d, R1 < R2 < · · · < Rn} ; show that the region of integration defined by this event is Xn_1 + d < Xn < L Xn- 2 + d < Xn-1 < L - d Xn_3 + d < Xn-2 < L - 2d x1 + d < x2 < L (n - 2)d 0 < x1 < L - (n - 1)d -

94 12.

RANDOM VARIABLES

(The density function method for functions of more than one random variable.) Let . . . , xn). be absolutely continuous with density Define random variables , Wn by Wi n , i 1 , 2, . . . , n ; thus . . . , Rn). Assume that g is one-to-one, con , Wn) tinuously differentiable with a nonzero Jacobian Ju (hence g has a continuously differentiable inverse h) . Show that ( , Wn) is absolutely continuous with density fi n ( (y l , llh (Y) I , Y

(R1, . . . , Rn) (W1, • • •

W1, • • • g(Rr,

=

=

f12... n (x1, gi(R1, • • • , R ) =

W1, • • •

2 ·· y) /12 ···n (h (y)) /12·· · n (h(y)) =

·

· · · , Yn)

=

l lu (x) lx=h(y)

n and [The result is the same if g is defined only on some open subset D of E . . . , Rn) E D} 1 .] Let and be independent random variables, each normally distributed with a 0 and the same b. Define random variables and by cos (taking > O) sin Show that and are independent, and find their density functions. Let and be independent, absolutely continuous, positive random variables and let R3 Show that the density function of R3 is given by

P{(R1, 13. R1

=

=

R2

R1 R0 00 R2 R0 00

R0

=

14.

R1

R0 R2 =

00 R1R2•

fa (z)

R0

00

=

=

L"' ��(:) f2(w) dw,

z>0

z<0 Note: This problem may be done by the distribution function method or by applying Problem 12 as follows. R3 =

0,

R1R2 R4 R2 =

=

Use the results of Problem 1 2 to obtain f34(z , w) and from this find[a(z). 15. Because of inefficiency of production, the resistances and in Figure P.2.8. 15 may be regarded as independent random variables, each uniformly

R1

1

R

FIGURE P.2.8. 1 5

R2

2.9

SOME DISCRETE EXAMPLES

95

distributed between 0 and 1 ohm. ' F ind the probability that the total resistance R of the network is < � ohm. 16. A chamber consists of the inside of the cylinder x2 + y2 1 . A particle at the origin is given initial velocity components vx R1 and v'Y R2 , where R 1 and R2 are independent random variables, each with normal density f (x) (21T)-1'2 e-x2 /2. (There is no motion in the z-direction, and no force acting on the particle after the initial "push" at time t 0.) If T is the time at which the particle strikes the wall of the chamber, find the distribution and density functions of T.

=

=

2.9

==

=

S O ME D I S C RE T E E X A MP L E S

In this section we examine some typical problems involving one or more discrete random variables. We first introduce the Poisson distribution, which may be regarded as an approximation to the binomial when the number n of trials is large and the probability p of success on a given trial is small. Let R n be the number of successes in n Bernoulli trials, with probability Pn of success on a given trial. We have seen (Section 1 .5) that R n has the binomial distribution ;t that is, the probability function of Rn is

= C ) Pnk(1 - Pn)n-k, = 0, 1, . . . , n We now let n � P n � 0 in such a way that np n � A = constant. We shall show that = 0, 1, . . . k

P R ( k) ,.

oo ,

k

To see this, write

n = Pnk( 1 P n) -k 1)/n) n:nr k (1 - 1/n)(1 - 2 /n� ; . . (1 ( k ( n n) 1 = Now (1 - np n n) k � 1 and (1 - np n n) n � e_;. (Problem 1), and the result follows. We call = 0, 1, 2, . . . (2 . 9 . 1) ) PRn( k

n(n - 1) · · · (n - k + 1) k!

f -

_

- (k -

p

-

f

k

t When a probabilist says he knows the distribution of a random variable

R, he generally

means that he has some way of calculating P{R E B} for all Borel sets B. For example, he might know the distribution function of R, or the probability function if R is discrete, or the density function if R is absolutely continuous. Thus to say that R has the normal distribution means that R has a density given by the formula (2.8.1).

96.

RANDOM VARIABLES

the Poisson probability function ; a random variable which has this prob ability function is said to have the Poisson distribution. (To check that it is a legitimate probability function :

We shall show that if R 1 and R2 are independent, each having the Poisson distribution, then R 1 + R 2 also has the Poisson distribution. We first need a characterization of independence in the discrete case. '

Let Rr, . . . , R n be discrete random variables on a given probability space, with probability functions p 1 , . . . , P n · Let p 1 2 . . be the joint probability function of R1 , . . . , Rn , defined by Theorem 1.

.

n

(2.9.2)

Then R 1 , . . . , Rn are independent if and only if P1 2

·

·

·

n (xl ,

· · ·

'

x n ) = P l (x l )

· · · P n ( xn )

PROOF. If R 1 , . . . , R n are independent, then P 1 2 . . . n (x l ,

· · ·

, xn ) = P{R 1 = xl , · · · , Rn = xn} = P{R 1 = x1} · · · P {R n = xn } = P 1 (x 1 ) · · · Pn (xn )

Conversely, if p 1 2 . . n (x1 , . . . , xn ) = p1 (x1) dimensional Borel sets B1 , . . . , Bn ,

· · · Pn Cxn ) ,

.

P{R1 E Br, . . . , R n E Bn}

=

P{R 1 ,L X1EB1 , . . . , XnEBn

.L X1EB1 •

.

.

.

•

by independence

Pt( xl)

XnE B n

= x1 , · · ·

then for all one. . , R n = xn} .

Pn( xn)

Hence R 1 , . . . , R n are independent. If R 1 and R 2 are not independent, the joint probability function of R 1 and R 2 is not determined by the individual probability functions. For example, if P{R 1 = 1 , R 2 = 1 } = P{R 1 = 2 , R2 = 2} = a ,

REMARKS.

2.9

SOME DISCRETE EXAMPLES

97

P{R1 = 1 , R 2 = 2} = P{R1 = 2, R 2 = 1 } = � . - a, 0 < a < · � , then P{R1 = 1 } = P{R1 = 2} = � and P{R2 = 1 } � P{R 2 = 2} = � - Thus we have uncountably many joint probability functions giving rise to the same individual probability functions. If we wish to define discrete random variables R1, . . . , R n having a specified joint probability function p1 2 . . . n ' 1there is no difficulty in constructing an appropriate probability space. Ta"ke Q = En , :F = all subsets of Q (since the random variables are discrete , there is no need to restrict to Borel sets) , P(B) = ! ( x � J . .. , x n )E B P 12 n (xl , · · · ' xn) , B E /F. Now let R 1 and R 2 be independent, wi�h Ri having the Poisson distribution with parameter iti, i = 1 , 2. By Theorem 1 , the joint probability function of R 1 and R 2 is ·

·

·

We find the probability function of R1 + R2 • P{R1 + R 2 = m } = ! P1 2( j, k) i+k=m

But (it1 + it 2)m = ! � 0 (7) it{itr- 3 by the binomial theorem, so that m = 0,

1, . . .

Thus R1 + R 2 has the Poisson distribution with parameter it 1 + A 2• By induction, it follows that the sum of n independent random variables R1, . . . , R n , where Ri is Poisson with parameter iti, has the Poisson distri bution with parameter A1 + · · · + An . The use of the Poisson distribution as an approximation to the binomial is illustrated in the problems . .,._

1.

Six unbiased dice are tossed independently. Let R1 be the number of ones , R 2 the number of twos ; R1 and R 2 have the binomial distri bution with n = 6, p = 1/6 ; that is, Example

Pt( k) = P

( )k ( ) ( )6-k 5 1 6 , 2 ( k) = k

6

6

k = 0,

1 , 2, 3 , 4, 5 , 6

98

RANDOM VARIABLES

Let us find the joint probability function p 1 2 (j, k) of R 1 and R 2 • This is a multi nomial problem. b 1 = " 1 occurs" on a given toss n l = jP 1 - 61 b 2 = ''2 occurs '' n = 6 n2 = k P 2 - 61 b3 = 3 4, 5, or 6 occurs" n3 = 6 - j - k - a Pa 2 Thus "

,

() ()

l i+k 2 6-j-k 6 ! ' = j ! k ! ( 6 - j - k) ! 6 3 j, k = 0 , 1 , 2, 3, 4, 5 , 6 ; j + k < 6 Thus the multinomial formula appears as the joint probability function of a number of random variables, each of which is individually binomial. Now let us find the conditional probability function of R 1 given R 2 ; that is, Pt( i I k) = P {Rl = j I R2 = k} P1 2(j, k) [ 6 !/j ! k ! ( 6 - j - k) ! ] ( 1 / 6) 3+k(4/6)6-j-k [ 6 !/k ! ( 6 - k) ! ]( 1 / 6)k( 5 / 6 )6-k P2(k) 1 j � 6 -k- j (6 - k) ! 46-j-k 5 -� 6 k = = ' k 6 5 5 5 i ! c 6 - i - k) ! 5 i Intuitively, given R 2 = k, there are 6 - k remaining tosses. The possible outcomes are 1, 3 , 4, 5 , or 6 (2 is not permitted ) , all equally likely. Thus, given R 2 = k, R1 should be binomial with n = 6 - k, p = 1 / 5. This is verified by the formal calculation above. p12( J. ' k)

( ) (

_

)( )( )

Since the discrete random variables R1 and R 2 are independent iff P1 2(j, k) = p 1 (j) p 2 (k) for all j, k, it follows that independence is equivalent to p 1 ( j I k) = p 1 (j) for all j, k [such that p 2 (k) > 0 ] . In the present case p 1 (j I k) is the binomial probability function with n = 6 - k, p = 1 / 5 , and p 1 (j) is the binomial probability function with n = 6, p = 1 / 6 . Thus R 1 and R 2 are not independent. This is clear intuitively ; for example , if we know that R 1 = 6, the odds about R 2 are certainly affected ; in fact, R 2 must be 0 . _....

REMARK.

P RO B L E M S

1.

(a) If 1x1 < 1/2, show that

ln (1 + x) = x + Ox2 where 1 01 < 1 , 0 depending on x.

2.9

(b) Show that if xn

�

A,

then

( 1 - �r

�

SOME DISCRETE EXAMPLES

99

e-A

2. If R has the binomial distribution with n large and p small, the Poisson approxi mation with A np may be used (a rule of thumb that has been given is that the approximation will be good to several decimal places if n > 100 and p ::;: .01). Feller (An Introduction to Probability Theory and Its Applications, vol. 1 , John Wiley and Sons, 1950) gives several examples of such random variables : (i) The number of color-blind people in a large group (or the number of people possessing some other rare characteristic). (ii) The number of misprints on a page. (iii) The number of radioactive particles (or particles with some other dis tinguishing characteristic) passing through a counting device in a giv�n time interval. (iv) The number of flying bomb hits on a particular area of London during World War II (n is the number of bombs in a given period of time, p the probability that a single bomb will hit the area). (v) The number of raisins in a cookie. [Here the assumptions are not entirely clear. Perhaps what is envisioned is that the dough is bombarded by a raisin gun at some stage in the cookie-making process. It would seem that this is simply a peaceful version of example (iv).] In the following exercises, use the Poisson approximation to calculate the probabilities. (a) If p .001 , how large must n be if P{R > 1} > .99 ? (b) If np 2, find P{R > �}. 3. The joint probability function of two discrete random variables R1 and R2 is as follows : =

'

'

=

=

p12 (1 , 1) .4 p12 (1 , 2) .3 p12 (2, 1) .2 p1 2 (2, 2) .1 elsewhere P12 (j , k) 0 (a) Determine whether or not R1 and R 2 are ,independent. (b) Find the probability that R1R2 < 2. Let R1 and R2 be independent ; assume that R1 has the binomial distribution with parameters n and p , and R2 has the binomial distribution with parameters m and p . Find P{R1 j I R1 + R 2 k}, and interpret the result intuitively. [Note : one approach involves establishing the formula ( ntm) L� 0 ( �) (k�i ).] =

=

= =

=

4.

=

=

=

Ex p ectat i o n

3.1

INT R O DUCTION

We begin here the study of the long-run convergence properties of situations involving a very large number of independent repetitions of a random experi ment. As an introductory example, suppose that we observe the length of a telephone call made from a specific phone booth at a given time of the day, say, the first call after 12 o'clock noon . Suppose that we repeat the experiment independently n times , where n is very large, and record the cost of each call (which is determined by its length). If we take the arithmetic average of the costs , that is , add the total cost of all n calls and then divide by n , we expect physically that the arithmetic average will converge in some sens� to a number that we should interpret· as the long-run average cost of a call. We shall try first to pin down the notion of average more precisely. Assume that the cost R 2 of a call in terms of its length R1 is as follows. If 0 < R1 < 3 (minutes) If 3 < R1 < 6 If 6 < R1 < 9

R 2 = 10 (cents) R2 = 20 R2 = 30

(Assume for simplicity that the telephone is automatically disconnected after 9 minutes.) 1 00

3. 1

INTRODUCTION

101

Thus R2 takes on three possible values , 10, 20, and 30 ; say P{R2 = 1 0 } = .6, P{R 2 = 20} = .. 25, P{R 2 = 30} = . 1 5. If we observe N calls , where N is very large, then, roughly, {R2 = 10} will occur .6N times ; the total cost of calls of this type is 10( .. 6N) = 6N. {R 2 = 20} will occur approximately .25N times, giving rise to a total cost of 20( . 25N) = 5N. {R2 = 30} will occur approximately . 1 5N times, producing a total cost of 30 (. 1 5N) = 4.5N. The total cost of all calls is 6N + 5N + 4. 5N = 15. 5N, or 1 5.5 cents per call on the average. Observe how we have computed the average. 10( .6 N)

+ 20(.25N) + 30(. 15N) N

= .10 ( .6) + 20(_ 25) + 30(. l5) = 2,y yP{ R 2 = y }

Thus we are taking a weighted average of the possible values of R 2 , where the weights are the probabilities of R2 assuming those values. This suggests the following definition. Let R be a simple random variable , that is , a discrete random variable taking on only finitely many possible values. Define the expectation [also called the expected value, average value, mean value, or mean ] of R as

E(R) = 2, xP{ R = x]

(3. 1 . 1)

X

Since R is simple, this is a finite sum and there are no convergence problems. In particular, if R is identically constant, say R = c, then E(R) = cP{R = c} = c. For short, E (c) = c (3 . 1 . 2 )

Note that if R takes the values x1, . . . , xn, each with probability 1 /n , then E(R) = (x1 + · · · + xn) /n, as we would expect intuitively. In this case each xi is given the same weight, namely, 1 /n. We now have the problem of extending the definition to more general random variables. If R is an arbitrary discrete random variable, the natural choice for E(R) is again Lx xP{R = x}, provided that the sum makes sense. (Theorem 1 will make this precise.) Similarly , let R1 be discrete and R 2 = g (R1). Since R 2 is also discrete, we have E(R 2) = I,y yP{R 2 = y } . However, if x1 , x 2 , are the values of R1 , then with probability PR 1 (xi) we have R1 = xi , hence R2 = g(xi). Thus if our definition of expectation is sound , we should have the following alter nate expression for E(R2 ) : •

•

•

E(R 2) = E [g (Rt ) ] = 2, g (xi)PR1 (xi) i

(3. 1 . 3)

EXPECTATION

1 02

x '· + dx J.·

X · '

FIGURE 3 . 1 . 1

again a weighted average of possible values of R 2 , but expressed in terms of the probability function of R1 . If R1 is absolutely conti nuous, this approach breaks down completely, since P{R1 = x} = 0 for all x. However, we may get some idea as to how to compute E(R 2) = E[g(R1)] when R 1 is absolutely continuous , by making a discrete approximation. If we split the real line into intervals (xi, xi + dxi] , then, roughly, the probability that xi < R 1 < xi + dxi is /R 1 (xi) dxi (see Figure 3. 1 . 1) . If R1 falls into this interval, g(R1 ) is approximately g(xi) , at least if g is continuous. Thus an approximation to E(R 2) should be

!i g( xi)fR( xi) dxi which suggests that if a general definition of expectation is formulated properly, and R1 is absolutely continuous, (3. 1 .4)

In the telephone call example above, if R1 has density /1, we obtain

where

R 2 = g(R1 ) = 10

= 20

Thus

f

= 30

f

if O < R1 < 3 if 3 < R1 < 6 if 6 < R1 < 9

f

E(RJ = 1 0 !1(x) dx + 20 !1 (x) dx + 30 �t(x) dx =

10(.6) + 20(.25) + 30(. 1 5)

as before

If we have an n-dimensional situation, for instance R0 = g(R1,

•

•

•

, Rn),

3. 1

INTRODUCTION

the preceding formulas generalize in a natural way. If R1, crete, Xl ' .

If (R1,

•

E[g(R l ,

•

•

•

, R n) is absolutely continuous with density ft2

·

·

·

R n) ] =

•

•

•

n'

, R n are dis

' Xn

•

'

•

I 03

J:· - J: g( ·

l1_,

•

•

•

'

xn) f1 2· . · n( l1_ ,

•

•

•

•

•

, x n) d x l

·

· · d xn (3. 1 .6)

We shall outline very briefly a general definition of expectation that in cludes all the previous special cases. If R is a simple random variable on (Q, :F, P) , we define

E(R) = ! xP{ R = x } just as above. Now let R be a nonnegative random variable. We approximate R by simple random variables as follows. Define k- 1 n k = 1, 2, . . . ' n 2 R n( w ) = n 2 and let if R (w) > n (see Figure 3. 1 . 2 for an illustration with n = 2) . For any fixed w , eventually R (w ) < n , so that 0 < R(w ) - Rn (w) < 2-n. Thus R n (w) � R(w). In fact R n (w) < Rn+1(w ) for all n, w. For example, if 3/4 < R(w) < 7 / 8 , then R 2 (w ) = R3(w) = 3/4 ; if 7 / 8 < R(w ) < 1 , then R 2 (w) = 3/4 , R3(w ) = 7 / 8 . In general, if R(w ) lies in the lower half of the interval [ (k - 1) /2 n , k/2n), then R n (w) = R11+ 1( w) ; if R(w ) lies in the upper h alf, R n (w) < R n+1(w). Thus we have constructed a sequence of nonnegative simple functions R n converging monotonically up to R. We have already defined E(Rn) , and since R n < Rn+ l we have E(Rn) < E(R n+1) . We define

E(R) = lim E (R n) n � oo

(this may be + oo)

It is possible to show that if {R�} is any other sequence of nonnegative simple functions converging monotonically up to R, n � oo

and thus E(R) is well defined.

n � oo

1 04

EXPECTATION R(w) 2

7/4 6/4 � 1\ 5/4 � 1

3/4 2/4 1/4

_/

/

/

7

/

-

�"--"""/

0

w

R2 w) ( 2

7/ 4 6/4 5/4 1

3/4 2/4 1/ 4 0

FIGURE 3 . 1 . 2 Variables.

w

Approximation of a Nonnegative Random Variable by Simple Random

Finally, if R is an arbitrary random variable, let R+ = max (R, 0) , R- = max ( - R, 0) ; that is, R+ (w) =

R(w)

R- (w) = - R (w)

if R(w) > 0 ;

if R (w) < 0 ;

R+ (w) = 0

R-(w) = 0

if R(w) < 0 if R(w) > 0

and R- are called the positive and negative parts of R (see Figure 3. 1 . 3). It follows that R = R+ - R- (and I R I = R+ + R-) , and we define E(R) = E(R.+) - E(R-) if this is not of the form + oo - oo ; if it is, we say that the expectation does not exist. Note that E(R ) is finite if and only if E(R+) an� E(R-) are both finite. Since it can be shown that E( I R I ) = E (R+) + E(R-) , it follows that (3. 1 .7) E(R) is finite if and only if E( I RI ) is fi nite R+

The expectation ofa nonnegative randomvariablealways exists ; it may be + oo . The following results may be proved. Let R1 , R2 , , R n be random variables on (Q, :F , P) , and let R0 = g(R1 , . . . , Rn) , where g is a function from E n to E 1 . Assume that g has the property that g-1 (B) is a Borel subset of E n whenever B is a Borel subset of E1 . Then, as we indicated in Section 7 , R0 is a random variable. •

•

•

2.

3. 1

I OS

INTRODUCTION

R(w)

Positive and Negative Parts of a Random Variable.

FIGURE 3 . 1 .3

Theorem 1.

if g(x1 ,

If Rb . . . , R n are discrete, then X l • · · · • Xn

. . . , xn) > 0 for all x1 , . . . , x n , or if the series on the right is ab

solutely convergent. Theorem

then

E[g(R l , if g(x1 ,

·

·

·

.

•

.

2.

If (R b . . . , Rn) is absolutely continuous with density /1 2 . . . n '

, R n) J

=

1 : - 1 : g ( Xl , ·

·

, x n) > 0 for all x1 ,

•

•

·

•

absolutely convergent .

·

·

, xn) f1 2. . n( �. ·

· ·

·

, xn) d xl ,

·

·

·

, d xn

, x n , or if the integral on the right is'

We shall look at examples that are neither discrete nor absolutely con tinuous in Chapter 4. Notice that it is quite possible for the expectation to exist and be infinite, or not to exist at all. For example, let X> 1; (see Figure 3. 1 .4a). Then E (R) =

x<1

oo xfR(x) dx l oo x 1 dx 1 X 1-oo =

2

=

oo

1 06

EXPECTATION

0

------�--�----� x

1

-1

1

(b)

(a) FIGURE 3 . 1 .4

(a)

E(R)

=

oo . (b)

E(R) Does Not Exist.

As another example, let fR(x) = 1/2x2 , l x l > 1 ; fR(x) 3. 1 .4b). Then (see Figure 3. 1. 5)

=

0, l x l < 1 (Figure

=

�

100

1 dx X

-

=

oo

Thus E(R) does not exist. Finally it can be shown that if two random variables R1 and R 2 are "essentially" equal, that is , if P{w : R1(w) � R2 (w)} = 0, then E(R1) = E(R 2) if the expectations exist. REMARK.

Theorem 1 fails if the series on the right is conditionally but not absolutely convergent. For example, let P{R1 = n} = (1 /2)n, n = 1 , 2, . . . , and R 2 = g(R1) , where g(n) = ( - 1 )n+I2nfn. If R1(w) = n , n odd , then R 2 ( w) = 2 nfn ; hence Rt(w) = g(n ) = 2n/n, R;: (w) = 0. If R1(w) = n, n even, then R 2 ( w) = - 2 nfn ; hence Rt ( w) = 0, R; (w) = - g(n) = 2 nfn . Therefore, by the nonnegative case of Theorem 1 , E (Rt) x+

= X, =

0,

X X

=

I g(n) P{R1

n odd

=

n}

=

1

+

�

+

�

+···

=

oo

x - = O, x > O

:=:0

<0

:::; - x, x � O

--------�--� x

FIGURE 3 . 1 .5

and

E (R;:)

=

n

I - g(n) P{R1 ev en

=

1 07

TERMINOLOG Y AND EXAMPLES

3.2

n}

� + l+ � +

=

· · · = oo

Hence E(R2) does not exist, although 00

I g(n)P{R 1

n=l

=

n}

=

1

� + �

-

-

l+

· · ·

is conditionally convergent. From an intuitive standpoint, the expectation should not change if the series is rearranged ; a conditionally but not absolutely convergent series will not have this property.

3.2

T E R MI N O L O GY AND E XAMP L E S

If R is a random variable on a given probability space, the kth moment of R (k > 0, not necessarily an integer) is defined by if the expectation exists Thus if R is discrete =

1:

xkfR ( x) dx

if R is abso lutely continuous

cx1 is simply E(R) , the expectation of R, often written as m and called the mean of R . If R has density fR, m may be regarded as the abscissa of the

centroid of the region in the plane between the x-axis and the graph of fR (see Figure 3.2. 1). To see this, notice that the total moment of the region

-m

--�----�-

x

�----� x

x + dx

FIGURE 3 .2. 1 Geometric Interpretation of E(R). The "Strip" Between Contributes (x - m)fR(x) dx to the Moment about x = m.

x

and

x

+ dx

1 08

EXPECTATION

about the line x = m is

J:(x - m)fR(x) dx

= m - m =

0

The expectation of R is a measure of central tendency in the sense that the arithmetic average of n independent observations of R converges (in a sense yet to be made precise) to E(R). The kth central moment of R (k > 0) is defined by

f3k = E[(R - m)k] =

if m is finite and the expectation exists

! (x - m) kPR(x)

if R is discrete if R is absolutely continuous

Notice that /31 = E(R - m) = m - m = 0 . {3 2 = E[ (R - m) 2 ] is called the variance of R, written a2 , a2 (R), or Var R. a (the positive square root of {3 2 ) is called the standard deviation of R. Note that if R has finite mean, then, since (R - m) 2 > 0, Var R always exists ; it may be infinite. If R has density fR , the variance of R may be regarded as the moment of inertia of the region in the plane between the x-axis and the graph of fR , about the axis x = m. The variance may be interpreted as a measure of dispersion. A large vari ance corresponds to a high probability that R will fall far from its mean, while a small variance indicates that R is likely to be close to its mean (see Figure 3.2.2). We shall make a quantititative statement to this effect (Chebyshev ' s inequality) in Section 3.7. �

Example

1.

Consider the normal density function fR(x) = ...;-

2 2 1 e- ( x-a ) /2 b ' 271' b

b > 0, a real

Since fR is symmetrical about x = a, the centroid of the area under fR has abscissa a, so that E(R) = a. We compute the variance of R.

3.2

TERMINOLOGY AND EXAMPLES

1 09

FIGURE 3 .2.2

Now, by (2.8.2), Integrate by parts to obtain ,J 71' = -

It follows that

2 e y Y

Joooo - Jr_oo - 2y2 e-Y dy oo 2

Hence a2 = b2 • Thus we may write

fR(x)

2 2 1 2 / ( x e m) u = ,J2'TT' a -

(3.2. 1 )

In this case the mean and variance determine the density completely. If R has the normal density with mean m and variance a2 , we sometimes write "R is normal (m, a2)" for short . ...Before looking at the next example, it will be convenient to introduce the gammafunction, defined by

r(r)

=flO xr-le-"' dx.

r>O

(3.2.2)

Integrating by parts, we have

Thus

r oo x r e- x d x = r(r + 1 ) = Jo r r r(r + 1 )

=

rr(r)

(3.2.3 )

I I0

EXPECTATION

Since r(l) =

we have

f" e-IIJ dx

r(2) = 1 rc 1 ) = 1 ,

r(3) = 2r(2) = 2 · 1 ,

and

r (n + J) = n ! ,

= 1

r (4) = 3r(3) = 3 · 2 . 1 = 3 !

n = 0, 1 , . . .

(3.2.4)

We also need r (I /2) . Let x = y2 to obtain r(�) =

l oo - e-'Y 2y dy 1 00 e-'Y dy 1

0

2

Y

= 2

2

0

=

By (2.8.2) , we have

oo f-oo e-'Y dy 2

(3.2.5)

Example 2. Let R1 be absolutely continuous with density h_(x) = e-x , x > O ; f1(x) = 0 , x < 0. Let R 2 = R12 • We may compute E(R 2) in two ways. .,._

1 . E (R 2) = E (R 1 ) =

2

f-oooo x"i1(x) dx Jo( oo x2e-11J dx =

= r( 3) = 2

by (3.2.4)

2. We may find the density of R 2 by the technique of Section 2.4 (see Figure 3 .2.3) . We have

d - e-V"Y Y >o f2 (y) = J� (JY) .J y = dy 2.JY = 0, y<0 •

�------�--�

FIGURE 3 .2.3

-vY

R

l

Computation of Density of R 2 •

3.2

TERMINOLOGY AND EXAMPLES

Ill

Then

as before. Notice that both methods must give the same answer by Theorem 2 of Section 3. 1 . For R 2 (w) = (R1(w)) 2 ; applying the theorem with g(R1) = R1 2 , we obtain

2 E (R 1 ) =

s: x2f1(x) dx

Applying the theorem with g(R 2 ) = R 2 , we have

E ( R 2) =

s: Yf2(y) dy

Generally the first method is easier, since the computation of the density of R 2 is avoided . ...._ � Example 3.

Let R1 and R 2 be independent, each with density f(x) = e-x, x > O ; f(x) = 0, x < 0. Let R3 = max (R1, R 2 ). We compute E(R3).

E(R3) = E[ g(R 1, R 2) ] = =

f" f"

s:s: g(x, y)f12(x, y) dx dy

max (x, y)e- "'e-u dx dy

Now max (x, y) = x if x > y ; max (x, y) = y if x < y ( see Figure 3.2.4). Thus

E(R3) =

JJxe-"'e-u dx dy JJye-"'e-u dx dy xe-"' e-u dy dx fl() ye-u lY e-"' dx dy 0 0 flO0 i"' 0 +

B

A

=

+

y

on A, max (x, y) on B, max (x, y)

=

=

x y FIGURE 3 .2.4

1 12

EXPECTATION

The two integrals are equal, since one may be obtained from the other by interchanging x and y . Thus

The moments and central moments of a random variable R, especially the mean and variance, give some information about the behavior of R. In many situations it may be difficult to compute the distribution function of R explicitly, but the calculation of some of the moments may be easier. We shall examine some problems of this type in Section 3.5. Another parameter that gives some information about a random variable R is the median of R, defined when FR is continuous as a number ft (not necessarily unique) such that FR(ft) = 1 /2 (see Figure 3.2.5a and b) . tn general the median of a random variable R is a number ft such that

FR(ft) FR(ft-)

P{R < ft} > � = P{R < p,} < �

=

(see Figure 3.2. 5c) . Loosely speaking, ft is the halfway point of the distribution function of R.

(a)

J1.

a

b

(b)

J1.

(c)

FIGURE 3 .2.5 (a) Jl is the Uniq ue Median. (b) Any Number Between a and b is a Median. (c) fl is the Unique Median.

3.2

TERM INOLOG Y AND EXAMPLES

1 13

PRO B L E M S 1.

Let R be normally distributed with mean 0 and variance E(Rn) = 0, n odd = (n

-

1 )( - 3) · · · (5)(3)(1), n

2. Let R1 have the exponential density .f1 (x)

1. n

Show that even

x > 0 ; f1 (x) = 0, x < 0. Let R1 < 1 , R2 = 0; if 1 < R1 < 2,

= e-x,

R2 = g (R1) be the largest integer < R1 (if 0 < R2 = 1 , and so on). (a) Find. E(R2) by computing f 0000 g (x)/1(x) dx . (b) Find E(R2) by evaluating the probability function of R2 and then computing

"

I'Y YPR2 (y).

3.

4.

5.

Let R1 and R2 be independent random variables, each with the exponential density f(x) = e-x, x > 0 ; f(x) = 0, x < 0. Find the expectation of (a) R1R2 (b) R1 - R2 (c) IR 1 - R2 l Let R1 and R2 be independent, each uniformly distributed between - 1 and + 1 . Find E[max (Rb R2)]. Suppose that the density function for the length R of a telephone call is x >0

f(x) = xe-x , =

The cost of a call is

C(R) = 2, = 2

X <

0, 0

<

R

<

+ 6(R - 3),

0

3 R >

3

Find the average cost of a call. 6.

7.

Two machines are put into service at t = 0, processing the same data. Let Ri (i = 1 , 2) be the time (in hours) at which machine i breaks down. Assume that R1 and R2 are independent random variables, each having the exponential density function [(x) = .Ae-itx , x > 0 ; f(x) = 0, x < 0. Suppose that we start counting down time if and only if both machines are out of service. No repairs are allowed during the working day (which is T hours long), but any machine that has failed during the day is assumed to be completely repaired by the time the next day begins. For example, if T = 8 and the machines fail at t = 2 and t = 6 , the down time is 2 hours. (a) Find the probability that at least one machine will fail during a working day. (b) Find the average down time per day. (Leave the answer in the form of an integral.) Show that if R has the binomial distribution with parameters n and p, that is, R is the number of successes in n Bernoulli trials with probability of success p on

1 14

EXPECTATION

a given trial, then E(R) = np , as one should expect intuitively. HINT : in E(R) = I� 0 k(�)p k(l - p)n- k, factor out np and use the binomial theorem. REMARK. In Section 3.5 we shall calculate the mean and variance of R in an indirect but much more efficient way. ,8.

If R has the Poisson distribution with parameter

E [R(R - l)(R - 2) · · · (R Conclude that E(R) 3.3

=

Var R

.A,

- r

= .A.

show that

+ 1)]

=

_Ar

PROPERTIES OF EXPECTATION

In this section we list several basic properties of the expectation of a random variable. A precise justification of these properties would require a detailed analysis of the general definition of E (R) that we gave in Section 3. 1 ; what we actually did there was to outline the construction of the abstract Lebesgue integral. Instead we shall give plausibility arguments or proofs in special cases. 1 . Let R1, . . . , Rn be random variables on a given probability space. Then

Recall that E( R) can be ± oo, or not exist at all. The complete statement of property 1 is : If E (Ri) exists for all i = 1 , 2, . . . , n , and + oo and - oo do not both appear in the sum E(R1) + + E (R n ) ( + oo alone or - oo alone is allowed) , then E(R1 + + Rn) exists and equals E(R1) + + E(Rn) .

CAUTION.

·

·

·

·

= = = =

·

·

·

For example, suppose that (R1, R2) has density /1 2 , and R ' R" = h(R1, R2 ) . Then

E(R ' + R ")

·

·

E [ g(R 1 , R 2 ) + h (R 1, R 2 ) ]

=

g (R 1 , .R 2) ,

J:J: [g(x, y) + h(x, y) ]f12(x, y) dx dy J: J: g(x, y)f12(x, y) dx dy + J:J:h(x, y)j12(x, y) dx dy E(R ') + E(R")

1 15

3.3 PROPERTIES OF EXPECTATION 2.

If R is , a random variable whose expectation exists, and a is any real number, then E(aR) exists and E (aR )

=

aE(R) t

For example, if R1 has density /1 and R 2 E(RJ

=

=

J:axf1(x) dx

aR1 , then

=

a E (R J

Basically, properties 1 and 2 say that the expectation is linear. 3. If R1 < R 2 , then E (R1) < E(R2) , assttming that both expectations exist. For example, if R has density f, and R 1 = g (R) , R 2 = h(R) , and g < h, we have

E(R1)

=

I: g(x)f(x) dx < I: h(x)f(x) dx

=

E(R 2 )

4. If R > 0 and E(R) = 0, then R is essentially 0 ; that is, P{R = 0} = 1 . This we can actually prove, from the previous properties. Define Rn = 0 if 0 < R < 1 /n ; Rn = 1 /n if R > 1 /n. Then 0 < Rn < R, so that, by property 3 , E(Rn ) = 0. But Rn has only two possible values, 0 and 1 /n, and so

Thus for all n But Hence

P{R

=

0}

=

1

Notice that if R is discrete, the argument is much faster : if I x>o xp R (x) = 0, then xpR (x ) = 0 for all x > 0 ; hence PR (x) = 0 for x > 0, and therefore PR (O) = 1 . CoROLLARY. PROOF.

If m

If Var R =

=

0,

then R is essentially constant.

E(R), then E[(R - m) 2]

=

0, hence P{R

=

m}

=

1.

t Since E(R) is allowed to be infinite , expressions of the form 0 · oo will occur. The most convenient way to handle this is simply to define 0 · oo = 0 ; no inconsistency will result.

1 16

EXPECTATION

5. Let

R1 , . . . , Rn be independent random variables. (a) If all the Ri are nonnegative , then R rJ = E(R1)E(R 2) E(R1R 2 E(Rn) (b) If E(Ri) is finite for all i (whether or not the Ri > 0) , then E(R 1R 2 Rn ) is finite and '

·

·

·

·

·

·

·

·

·

\

We can prove this when all the Ri are discrete , if we accept certain facts about infinite series. For Xl ,

Xl '

.

.

•

, Xn

• . . '

Xn

Under hypothesis (a) we may restrict the x/s to be > 0. Under hypothesis (b) the above series is absolutely convergent. Since a nonnegative or ab solutely convergent series can be summed in any order , we have

If (R1 , . . . , Rn) is absolutely continuo us, the argument is similar, with sums replaced by integrals.

E(R1R 2 •

• •

s:· · -s:x1 · · · xnf12 . . . n{x1, . . . , xn) dx1 · · · dxn = J:· · · s: xl · · x11 U x1) · · · fnC xn) d x1 · · · d xn = s :xJl(x1) d xl · · · s: x nfn(xn) dxn

R n) =

·

= E(R1) · · · E(R n)

6. Let R be a random variable with finite mean 1n and variance

infinite). If a and b are real numbers , then Var (aR PROOF.

we have

+

a2 (possibly

b) = a 2 a2

Since E(aR + b) = am + b by properties 1 and 2 [and (3 . 1 .2)], Var (aR + b ) = E((aR + b - (am + b)) 2 ]

= E[a2 (R - m) 2] = a2E[(R - m) 2]

by property 2

3.3 PROPERTIES OF EXPECTATION

1 17

R1, . . . , Rn be independent random variables, each with finite mean. Then Var (R1 + + Var Rn + Rn) = Var R1 + PROOF. Let mi = E(R i) . Then 2 2 Var (R I + + Rn) = E[ Ci1 Ri - ii mi) ] = E [ ( l-1 (Ri - mi)) ] i 7. Let

· · ·

· · ·

· · ·

If this is expanded, the "cross terms" are 0, since, if i � j,

E [(Ri - mi)(Ri - mi) ] = E(RiRi - miRi - miRi + mimi) = E(Ri)E(Ri ) - miE(Ri) - miE(R i) + mimi by properties 5, 1 , and 2 since E(Ri) mi , E(Ri ) = mi = 0 Thus n n + Rn) = 2:,1 E(Ri - mi) 2 = 2:,1 Var Ri Var (R 1 + i= i= CoROLLARY. If Rr , . . . , R n are independent, each with finite mean, and a1, . . . , an, b are real numbers , then Var (a1R1 + + an2 Var Rn + anRn + b) = a12 Var R1 + PROOF. This follows from properties 6 and 7. (Notice that a1Rr, . . . , anR n =

· · ·

· · ·

· · ·

are still independent ; see Problem 1.)

{31, . . . , f3n ( > 2) can be obtained from the moments cx1, . . . , cxn, provided that cx1, . . . , cxn_1 are finite and cxn exists. To see this, expand (R - m ) n by the binomial theorem. 8. The central moments

n

Thus

E[(R - m)n] = k=Oi ( n ) (-m)n-kctk Notice that since cx1, , cxn_1 are finite, no terms of the form + f3 n

= .

k

.

•

oo

oo

can appear in the summation, and thus we may take the expectation term by term, by property 1 . This result is applied most often when n = If has finite mean hence always exists since > 0] , then + =

2. R E(R2) [ R2 (R - m)2 R2 - 2mR m2; Var R = E(R2) - 2mE(R) + m 2

EXPECTATION

1 18

That is,

a2 = E (R2)

-

[E (R)] 2

(3 .3. 1 )

which is the "mean of the square" minus the "square of the mean." 9. If E(R7�) is finite and 0 < j < k, then E (R3 ) is also finite. PROOF

I R(w) l 3 1 if I R(w) l < 1

I R(w) l k

for all w

and the result follows. Notice that the expectation of a random variable is finite if and only if the expectation of its absolute value is finite ; see (3. 1 .7). Thus in property 8, if cxn-l is finite, automatically cx1, , cxn_ 2 are finite as well. •

•

•

Properties and 7 fail without the hypothesis of independence. For example, let R1 = R 2 = R, where R has finite mean. Then E (R1R 2 ) � E(R1)E(R 2 ) since E(R2 ) - [E(R)] 2 = Var R, which is > 0 unless R is essentially constant, by the corollary to property 4. Also, Var (R1 + R 2 ) = Var (2R) = 4 Var R, which is not the same as Var R1 + Var R 2 = 2 Var R unless R is essentially constant .

5

REMARK.

PRO B LE M S

1. 2. 3.

4. 5.

, If R1, , Rn are independent random variables, show that a1R1 + b1, an Rn + bn are independent for all possible choices of the constants ai and h i . If R is normally distributed with mean m and variance a2 , evaluate the central moments of R (see Problem 1 , Section 3.2). Let () be uniformly distributed between 0 and 27T. Define R1 = cos 0, R2 = sin 0. Show that E(R1R2) = E(R1)E(R2), and also Var (R1 + R2) = Var R1 + Var R2, but R1 and R2 are not independent. Thus, in properties 5 and 7, the converse assertion is false. If E(R) exists, show that IE(R)I < E(!RI). Let R be a random variable with finite mean. Indicate how and under what conditions the moments of R can be obtained from the central moments. In •

.

.

•

•

•

3.4

particular show that E(R2) < oo if and only if Var R < an is finite if and only if Pn is finite. 3.4

CORRELATION oo .

1 19

More generally,

C O RRELA TION

If R1 and R 2 are random variables on a given probability space, we may define joint moments associated with R 1 and R 2

j, k > 0 and joint central moments

{33 k = E[(R1 - m1 ) 3 (R 2 - m2) k ],

m1 = E (R1), m 2 = E(R 2)

We shall study {311 = E[ (R1 - m1) (R 2 - m2) ] = E(R1R 2) - E (R1)E(R 2) , which is called the covariance of R1 and R 2 , written Cov {R1, R 2 ). In this section we assume that E(R1) and E(R 2 ) are finite, and E(R1R 2) exists ; then the covariance of R1 and R 2 is well defined. Theorem 1.

conversely.

If R1 and R 2 are independent, then Cov (R1, R 2) = 0, but not

PROOF. By property 5 of Section 3. 3, independence of R1 and R 2 implies that E(R1R 2 ) = E(R1)E(R 2) ; hence Cov (R1 , R2 ) = 0. An example in which Cov (R1 , R 2 ) = 0 but R1 and R 2 are not independent is given in Problem 3 of Section 3.3. We shall try to find out what the knowledge of the covariance of R1 and R 2 tells us about the random variables themselves. We first establish a very useful inequality. 2

Assume that E(R12) and E(R 22 ) are finite (R1 and R 2 then automatically have finite mean, by property 9 of Section 3.3, and finite variance, by property 8) . Then E(R1R2 ) is finite, and IE(R1R 2) 1 2 <( E(R12)E(R 22) Theorem

(Schwarz Inequality).

PROOF. If R1 is essentially 0, the inequality is immediate, so assume not essentially 0 ; then E(R12 ) > 0. For any real number x let

R1

Since h(x) is the expectation of a nonnegative random variable, it must be > O for all x. The quadratic equation h(x) = 0 has either no real roots or, at

120

EXPECTATION h(x)

/

FIGURE 3 .4. 1

/

/

/ //Not possible /

Proof of the Schwarz Inequality.

worst, one real repeated root ( see Figure 3.4. 1 ). Thus the discriminant must be
Now assume that E(R1 2 ) and E(R 22) are finite and, in addition, that the variances a1 2 and a 22 of R 1 and R 2 are > 0. Define the correlation coefficient of R1 and R 2 as

By Theorem 1 , if R1 and R 2 are independent, they are uncorrelated ; that is, p(R1, R 2 ) = 0, but not conversely. Apply the Schwarz inequality to R1 - E(R1) and R 2 - E(R2). IE[(R1 - ER1) (R 2 - ER 2 ) ]1 2 < E[(R1 - ER1) 2 ]E[ (R 2 - ER 2) 2 ]

PROOF.

We shall show that p is a measure of linear dependence between R1 and R 2 [more precisely , between R1 - E(R1) and R 2 - E(R2) ], in the following sense. Let us try to estimate R 2 - ER2 by a linear combination c(R1 - ER1 ) + d, that is , find the c and d that minimize E{ [ (R2 - ER 2 ) - (c(R1 - ER1) + d) ] 2} = a 22 - 2c Cov (R1, R 2) + c2 a12 + d2 = a22 - 2c pa1a 2 + c2 a12 + d2

3.4

CORRELATION

121

Clearly we can do no better than to take d = 0. Now the minimum of Ax2 + 2Bx + D occurs for x = -B/A ; hence a12c2 - 2pa1a2c + a22 is minimized when Thus the mtntmum expectatiofi is a22 - 2p 2 a22 + p2 a22 = a22 (1 - p2). For a given a22 , the closer I P I is to 1 , the better R2 is approximated (i n the mean square sense) by a linear combination aR1 + b . In particular, if I PI = 1 , then so that

with probability 1 . Thus , if I PI = 1 , then R1 - E(R1) and R2 - E(R2) are linearly de pendent. (The random variables R1, , Rn are said to be linearly dependent i ff there are real numbers a1, . . . , an, n ot all 0 , such that P{a1R1 + · · · + anRn = 0} . = 1 . ) Conversely, if R1 - E(R1) and R2 - E(R2) are linearly dependent, that is , if a(R1 - ER1) + b(R2 - ER2) = 0 with probability 1 for some constants a and b, not both 0 , then I PI = 1 (Problem 1 ) . •

PRO B LE M S

1.

2.

3. 4.

•

•

If R1 - E(R1) and R2 - E(R2) are linearly dependent, show that I p(R1, R2)1 = 1 . If aR1 + b R2 = c for some constants a, b, c, where a and b are not both 0, show that R1 - E(R1) and R2 - E(R2) are linearly dependent. Thus I p(R1, R2)1 = 1 if and only if there is a line L in the plane such that (R1 ( w) , R2( w )) lies on L for "almost" all w, that is, for all w outside a set of probability 0. Show that equality occurs in the Schwarz inequality, IE(R1R2) 1 2 = E(R12)E(R22), if and only if R1 and R2 are linearly dependent. Prove the following results. (a) Schwarz inequality for sums: For any real numbers a1 , . . . , an , b1 , . . . , bn ,
1 22 5.

EXPECTATION

Show that if R1, all i, then

.

•

•

,

Var (R1 +

3.5

Rn are arbitrary random variables with E(Rl) finite for · · ·

+ Rn)

=

n

L Var Ri i =l

n

+ 2 L Cov (Ri, R3)

i ,j=l i <j

T HE M E T H O D O F I N D I C A T O R S

In this section we introduce a technique that in certain cases allows the expectation of a random variable to be computed quickly, without any knowledge of the distribution function. This is especially useful in situations when the distribution function is difficult to calculate. The indicator of an event A is a random variable IA defined as follows. if w E A if w tt A Thus IA = 1 if A occurs and 0 if A does not occur . (Sometimes IA is called the "characteristic function" of A , but we do not use this terminology since we reserve the term "characteristic function" for something quite different.) The expectation of IA is given by

E(IA) = OP{IA = 0} + 1P{IA = 1 } = P{IA = 1 }

=-=

P (A)

The "method of indicators" simply involves expressing, if possible, a given random variable R as a sum of indicators , say, R = IA 1 + · · · + IAn . Then n

n

E(R) = L E( IA) = L P(A j )

j =l j=l Hopefully, it will be easier to compute the P (A 3 ) than to evaluate E(R) directly . 1.

Let R be the number of successes in n Bernoulli trials, with probability of success p on a given trial ; then R has the binomial distribution with parameters n and p ; that is, ..- Example

k = 0, 1, . . . n .

We have found by a direct evaluation that E (R) = np (Problem 7 , Section 3.2) , but the method of indicators does the job more smoothly. Let Ai be the event that there is a success on trial i, i = 1 , 2, . . . , n. Then R = IA 1 + · · · + IAn (note that IA i may be regarded as the number of successes

3.5

on trial i) . Thus

n

THE METHOD OF INDICATORS

123

n

E(R) = L E(IA i) = L P(A i) = n p i=l i=l Now, since A1, , A n are independent, the indicators IA1 , . . . , IA" are independent (Problem 1) , and so there is a bonus , namely (by property 7 , Section 3.3) , n Var R = L Var IA i i=l But IA . 2 = IA . ; hence •

�

•

�

Therefore Thus .,._

•

Example 2.

Var !Ai = E(IAi2 ) - [E(IA)] 2 = p - p 2 = p(1 - p)

Var R = np (1 - p)

by (3.3. 1) ...-

A single unbiased die is tossed independently n times. Let

R1 be the number of 1 ' s obtained , and R 2 the number of 2 ' s. Find E(R1 R2) . If A i is the event that the ith toss results in a 1 , and Bi the event that the ith toss results in a 2, then

Hence

R 2 = 1B 1 + · · · + IB n n

E(R 1 R 2 ) = 2 E(IA i lB;) i , j=l Now if i � j, !Ai and IB; are independent (see Problem 1) ; hence If i = j, Ai and Bi are disjoint, since the ith toss cannot simultaneously result in a 1 and a 2. Thus IA .IB . = IA . -B . = 0 (see Problem 2) . Thus �

�

� I

I

t

n(n - 1 ) E(R 1R 2) = 36 since there are n(n - 1 ) ordered pairs (i,j) of integers E{1 , 2 , . . . , n} such that i � j. Note that the IAiiB; ' i, j = 1 , . . . , n , are not independent [for instance, if IA 1 (w )IB 2 ( w ) = 1 , then IA 2 (w )IB 3 (w ) must be 0] , so that we cannot compute the variance of R1R 2 in the same way as in Example 1 . ...-

124

EXPECTATION

PRO B LE M S

1. 2.

, An are independent, show that the indicators /A1, If the events A 1 , , IAn are independent random variables, and conversely. Establish the following properties of indicators : (a) /0 = 1 , /0 = 0 ( b) IA nB = IAIB , IAuB = lA + IB - IAnB if the Ai are disjoint (c) I = !� 1 /A i U �= � 1A . (d) If A 1 , A 2 , • • • is an expanding sequence of events (A n An-t-1 for all n) and is a contracting sequence (An+1 c A n for U: 1 A n = A , or if A 1 , A 2 , all n) and n� 1 An = A , th�n [An -+- lA ; that is, lim n-+ oo [A n (w) = IA (w) for all w. In Example 2, find the joint probability function of R1 and R2 • Notice how unwieldy is the direct expression for E(R1 R2) . .

.

•

•

•

•

�

c

.

3.

•

.

n

E(R1R2) = ! jkP{R1 = j, R2 = k} j,k=O

--

a sequence of n Bernoulli trials, let R0 be the number of times a success is followed immediately by a failure. For example, if n = 7 and w = (SSFFSFS), then R0( w) = 2, as indicated. Find E(R0) . Find Var R0 in Problem 4. 100 balls are tossed independently and at random into 50 boxes. Let R be the number of empty boxes. Find E(R).

4. In

5. 6.

3.6

S O ME PROPERTIES OF THE NO RMAL DIS TRIBUTION

Let R1 be normally distributed with mean m and variance a2 •

=

If R 2 aR1 + b , a � 0 , we shall show that R 2 is also normally distributed [necessarily E(R 2) am + b , Var R 2 a2 a2 by properties 1 , 2, and 6 of Section 3.3]. We may use the technique of Section 2.4 to find the density of R 2• R 2 y .corresponds to R1 h( y) ( y b)fa. Thus

=

=

f2 (y)

=

= -

= ft(h(y)) l h' (y) l = I � I J1 ( y : b ) (y - (am + b)) 2] [ _1 = J27T I a I a exp - 2a 2a2

=

3.6

SOME PROPERTIES OF THE NORMAL DISTRIBUTION

1 25

so that R 2 has the normal density with mean am + b and variance a2 a2 • We may use this result in the calculation of probabilities of events involving a normally distributed random variable. If R has the normal density with E(R) = m , Var R = a2 , then

One must resort to tables to evaluate this. The point we wish to bring out is that, regardless of m and a2 , only one table is needed, namely, that of the normal distribution function when m = 0 , a2 = 1 ; that is ,

For if R is normally distributed with E(R) = (R - m f a is normally distributed with E R *

)

( )

m, =

Var R = a2 , then R * = 0 and Var R * = 1 . Thus

A brief table of values of F* is given at the end of the book. If a random variable has a density function f that is symmetrical about 0 [i.e. , an even function : f(-x) = f(x)] , then the distribution function has the property that F( - x = 1 - F(x . For (see Figure 3.6. 1)

REMARK.

)

F(- x

=

P{R < - x }

=

P{R > x }

=

=

) ) 00 i �f(t) dt L j(t) d t

1 - F(x)

f(t)

-x

FIGURE 3.6.1

x

Symmetrical Density.

=

126

EXPECTATION

In particular, the distribution function F* has this property, and thus once the values of F*(x) for positive x are known , the values of F*(x) for negative x are determined.

PRO B LE M S

1. 2.

Let R be normally distributed with m 1 , a2 9. (a) Find P{ - .5 < R < 4} (b) If P{R > c} .9, find c. If R is normally distributed and k is a positive real number, show that P{IR - ml > ka} does not depend on m or a ; thus one can speak unambiguously of the "probability that a normally distributed random variable lies at least k standard deviations from its mean.' ' Show that when k 1 .96, the probability is .05. =

=

=

=

3 .7

C H E B Y S H E V ' S I N EQ U A L I T Y A N D T H E W E A K L A W OF LARGE NUMBERS

In this section we are going to prove a result that corresponds to the physical statement that the arithmetic average of a very large number of independent observations of a random variable R is very likely to be very close to E(R) . We first establish a quantitative result about the variance as a measure of dispersion. Theorem 1.

(a) Let R be a nonnegative random variable, and b a positive real number. Then

PROOF.

We first consider the absolutely continuous case. We have E(R)

=

1:xfR(x) dx L" xfR(x) dx =

since R > 0 , so that fR(x) 0 , x < 0 . Now if we drop the integral from 0 to b, we get something smaller. =

E(R) >

f" xfR(x) dx

Since

x > b,

3.7

LXlxfR(x) dx > LXl bfR(x) dx

CHEBYSHEV'S INEQUALITY

=

b P{R >

127

b}

This is the desired result. The general proof is based on the same idea. Let Ab = {R > b} ; then R > RIA b · For if w ¢= Ab , this says simply that R(w) > 0 ; if w E Ab, it says that R(w) > R(w). Thus E(R) > E(RIAb ). But RIA b > b lA b since w E Ab implies that R(w) > b. Thus '

E(R) > E(RIAb ) > E(b /Ab ) = bE(IAb ) Consequently P (A b ) < E(R) /b , as desired:

_:_

bP (A b)

(b) Let R be an arbitrary random variable, c any real number, and 8 and nz positive real numbers. Then

P{ IR -

e l > e}

PROOF.

P { I R - c l > 8}

=

P {I R -

<

E[ IR -= e l m ] 8

e l m > em } < E[IR : e l m ]

by (a)

8

(c) If R has finite mean m and finite variance a2 > 0, and k is a positive real number, then

P{ IR -

ml > ka} <

PROOF. This follows from (b) with c

=

m, 8

1 2 k =

ka, m

=

2.

All three parts of Theorem 1 go under the name of Chebyshev ' s inequality. Part (c) says that the probability that a random variable will fall k or more standard deviations from its mean is < 1/k2 • Notice that nothing at all is said about the distribution function of R ; Chebyshev ' s inequality is therefore quite a general statement. When applied to a particular case , however, it may be quite weak. For example, let R be normally distributed with mean m and variance a2 • Then (Problem 2, Section 3.6 ) P{ I R - m l > 1 . 9 6a} = . 05 . In this case Chebyshev' s inequality says only that

P {. IR -

ml >- 1 .96a} -<

1 2 ( 1 .96 )

=

. 26

which is a much weaker statement. The strength of Chebyshev' s inequality lies in its universality. We are now ready for the main result.

128

EXPECTATION

Theorem 2. ( J:Veak Law of Large Numbers). For each n = 1 , 2, . . . , suppose that R1, R2, , Rn are independent random variables on a given •

•

•

probability space, each having finite mean and variance. Assume that the variances are uniformly bounded; that is, assume that there is some finite positive number M such that al < M for all i. Let Sn = I� 1 Ri. Then , for any 8 > 0 , S n - E( S n) as n � oo > 0 n Before proving the theorem , we consider two cases of interest.

p{

e}

SPECIAL CASES 1 . Suppose that E(Ri) =

---+

for all i, and Var Ri = a2 for all i. Then n E( S n) = I E(Ri) = nm i m

=l

n n n Therefore, for any arbitrary 8 > 0, there is for large n a high probability that the arithmetic average and the expectation m will differ by < 8. , R n are independent This case covers the situation when R1, R2, observations of a given random variable R. All this means is that R1, . . . , Rn are independent, and the Ri all have the same distribution function, namely , FR · In particular, E(Ri) = E(R) , so that for large n there is a high probability that (R1 + + Rn)fn and E(R) will differ by < 8. 2. Consider a sequence of Bernoulli trials, and let Ri be the number of successes on trial i; that is, Ri = IA ., where A i = {success on trial i}. Then (R1 + + Rn)/n is the relative frequency of successes in n trials. Now E(Ri) = P(Ai) = p, the probability of success on a given trial, so that for large n there is a high probability that the relative frequency will differ from p by < 8. •

·

·

•

•

·

�

·

·

·

2. By the second form of Chebyshev' s inequality [part (b) of Theorem 1 ] , with R = (Sn E(Sn))/n , c = 0, and m = 2 , we have PROOF OF THEOREM

-·

But since R1 , . . . , R n are independent, n by p roperty 7 of Section 3.3 V ar S n = I V ar Ri i=l

< nM

by hy pothesis

3.7

CHEBYSHEV'S INEQUALITY

1 29

Thus

If a coin with probability p of heads is tossed indefinitely, the successive tosses being independen�, we expect that as a practical matter the relative frequency of heads will converge, in the ordinary sense of convergence of a sequence of real numbers, to p. This is a somewhat stronger statement than the weak law of large numbers, which says that for large n the relative frequency of heads in n trials is very likely to be very close to p. The first statement, when properly formulated, becomes the strong law of large numbers, which we shall examine in detail later.

REMARK.

PRO B L E MS

1. 2.

3.

Let R have the exponential density f(x) = e-x , x > O ; f(x) = 0, x < 0 . Evaluate P{ IR - m l > ka} and compare with the Chebyshev bound. Suppose that we have a sequence of random variables Rn such that P{Rn = en} = 1 /n , P{Rn = 0} = 1 - 1/n, n = 1 , 2, . . . . (a) State and prove a theorem that expresses the fact that for large n, Rn is very likely to be 0. (b) Show that E(Rn k) -+- oo as n -+- oo for any k > 0. Suppose that Rn is the amount you win on trial n in a game of chance. Assume that the R i are independent random variables, each with finite mean m and finite variance a2 • Make the realistic assumption that m < 0. Show that P{ (R1 + · · · + Rn) /n < m/2} 1 as n -+- oo . What is the moral of this result ? --+

Co nd it i o n al Pro bab i l ity and Ex pectat io n

4. 1

INTRODUCTION

We have thus far defined the conditional probability P(B I A) only when P(A) > 0. However, there are many situations when it is natural to talk about a conditional probability given an event of probability 0. For example , suppose that a real number R is selected at r�ndom , with density f If R takes the valu e x, a coin with probability of heads g(x) is tossed (0 < g(x) < 1). It is natural to assert that the conditional probability of obtaining a head , given R = x, is g(x). But since R is absolutely continuous, the event {R = x} has probability 0, and thus conditional probabilities given R = x are not as yet defined. If we ignore this problem for the moment, we can find the over-all prob ability of obtaining a head by the following intuitive argument. The prob ability that R will fall into the interval (x, x + dx ] is roughly f(x) dx ; given that R falls into this interval, the probability of a head is roughly g(x). Thus we should expect, from the theorem of total probability , that the probability of a head will be �x g(x)f(x) dx, which approximates J oo g(x)f(x) dx. Thus the probability in question is a weighted average of conditional probabilities, the weights being assigned in accordance with the density f 00

1 30

4. 1

INTRODUCTION

131

y

y t

B

C

*

X

(a) FIGURE 4. 1 . 1

Let us examine what is happening here. We have two random variables R 1 and R 2 [R1 = R, R 2 = (say) the number of heads obtained]. We are specifying the density of R1, and for each x and each Borel set B we are specifying a quantity P (B) that is to be interpreted intuitively as the con ditional probability that R 2 E B given that R1 = x. (We shall often write P{R 2 E B I R1 = x} for Px (B).) We would like to conclude that the probabilities of all events involving R1 and R 2 are now determined. Suppose that e is a two dimensional Borel set. What is a reasonable figure for P{(R1, R 2 ) E e} ? Intuitively, the probability that R1 falls into (x, x + dx] is ft(x) dx. Given that this happens , that is, (roughly) given R1 = x, the only way (R1, R 2) can lie in e is if R 2 belongs to the "section" e = {y : (x, y) E e} {see Figure 4. 1 . 1a). This happens with probability P ( C ) . Thus we expect that the total probability that (R1 , R 2 ) will belong to e is x

x

x

x

1:P.,(C.,)f1(x) dx

In particular, if e = A x B = {(x, y) : x E A , y E B} (see Figure 4. 1 . 1 b), Thus

ex = 0

if X ¢= A ;

ex

=B

if X E A

L

P{(R J , R 2 ) E C } = P{R 1 E A, R 2 E B } = P.,(B) f1(x) dx The above reasoning may be formalized as follows. Let Q = E2 , :F = Borel subsets, R1(x, y) = x, R 2 (x, y) = y . Letf1 be a density function on E1 , that is, a nonnegative function such that J00 ooh (x) dx = 1 . Suppose that for each real x we are given a probability measure Px on the Borel subsets of E1 • Assume also that Px (B) is a piecewise continuous function of x for each fixed B. Then it turns out that there is a unique probability measure P on :F such

1 32

CONDITIONAL PROBABILITY AND EXPECTATION

that for all Borel subsets A , B of E1 (4. 1 . 1)

Thus the requirement (4. 1 . 1) , which may be regarded as a continuous version of the tl;leorem of total probability, determines P uniquely. In fact, if C E /F, P( C) is given explicitly by P(C) =

s:P.,(C.,)f1(x) dx

(4. 1 .2)

Notice that if R1(x, y ) = x, R 2 (x, y) = y , then P(A

X

and

B) = P{R1 E A , R 2 E B}

Furthermore , the distribution function of R1 is given by F1(x0) = P{R 1 < x0} = P{R1 E A, R 2 E B} =

where A = ( - oo, x0], B = ( - oo, oo) "' 1(x) d x .,(B)fl (x) d x =

Lp

Thus ft is in fact the density of R1. Notice also that where

A

fj

P{R 2 E B} = P{R1 E A , R 2 E B} = ( - oo ,

oo ) ; hence P{R 2 E B} =

s:P.,(B)f1(x) dx

(4. 1 . 3)

To summarize: If we start with a density for R1 and a set of probabilities Px (B) that we interpret as P{R 2 E B I R1 = x} , the probabilities of events of the form { (R1 , R 2) E C} are determined in a natural way, if you believe that there should be a continuous version of the theorem of total probability ; P{(R1, R2 ) E C } is given explicitly by (4. 1 .2) , which reduces to (4. 1 . 1) in the special case when C = A x B . We have not yet answered the question of how to define P{R 2 E B I R1 = x} for arbitrarily specified random variables R1 and R 2 ; we attack this problem later in the chapter. Instead we have approached the problem in a somewhat oblique way. However, there are many situations in which one specifies the density of R1, and then the conditional probability of events i nvolving R2 , given R1 = x. We now know how to formulate such problems precisely. Consider again the problem at the beginning of the section. If R1 has density f, and a coin with probability of heads g(x) is tossed whenever R1 = x (and

4.2

EXAMPLES

1 33

a head corresponds to R2 = 1 , a tail to- R 2 = 0) , then the probability of obtaining a head is

i:P{R2 = 1 j R1 = x}flx) dx = s: g(x)f1(x) dx

P{R2 = 1 } =

by (4. 1 .3)

in agreement with the previous intuitive argument. 4.2

E XAMP LES

We apply the general results of this section to some- typical special cases . .,._

Example

1.

A point is chosen with uniform density between 0 and 1 . If the number R1 selected is x, then a coin with probability x of heads is tossed independently n times. If R 2 is the resulting number of heads, find p 2 (k) = P {R2 = k} , k = 0 , 1 , . . . , n. Here we have f1(x) = 1 , 0 < x < 1 ; ft(x) = 0 elsewhere. Also P.,{ k } = P{R 2 = By (4. 1 .3) ,

k.j R1 = x} = G) x"'(1 - x)n-k

k f G) x"'(1 - x)n-k dx

P{R 2 = } =

This is an instance of the beta function, defined by

f3(r, s) = fxr--1(1 - x)s-l dx,

r, s > 0

It can be shown that the beta function can be expressed in terms of the gamma function [see (3.2.2)] by (see Problem 1). Thus

f3(r, s) = rr(r)r(s) ( r + s)

P2(k) = G) f3(k + 1, n - k + 1) ( n r(k + 1)r(n - k + 1) = ) k r(n + 2) n k! n - k) ! = 1 , k = 0, 1, . . . , n = ( ) ( k (n + 1) ! n + 1

(4. 2. 1 )

_....

1 34

CONDITIONAL PROBABILITY AND EXPECTATION

.,._

Example 2. A nonnegative number R 1 is chosen with the density ft (x) = xe-x , x > O ;_h(x) = 0 , x < -o. If R1 = x, a number R2 is chosen with uniform density between 0 and x. Find P{R1 + R 2 < 2}. Now we must have 0 < R2 < R1 ; hence , if 0 < R1 < 1 , then necessarily R 1 + R2 < 2. If 1 < R 1 < 2, then R 1 + R2 < 2 provided that R2 < 2 - R1• If R1 > 2, then R1 + R 2 cannot be < 2. By (4. 1 . 2),

P{R1

+

R 2 < 2}

fXl xe-"'P {R1 R2 < 2 j R1 = x} dx = f xe- "'( 1 ) dx fx e-"'P {li 2 < 2 - x I R1 = x}dx ioo xe- "'(0) dx =

+

+

+

Given R1 = x, R 2 is uniformly distributed between 0 and x ; thus

2-x 1 < x<2 , P{R 2 < 2 - x I R1 = x} = X , (see Figure 4 . 2. 1). Therefore x dx 1 2 e 1 P{R1 + R 2 < 2 } = xe- "' d x + xe-"' = .,._

3.

f C� )

f

-

+

e- 2

�

Let R 1 be a discrete random variable, taking on the values If R1 = xi , a random variable x1, x2 , with probabilities p(x1) , p (x 2 ) , R 2 is observed, where R 2 has density h · What is P{(R1 , R 2 ) E C} ? This is not quite the situation we considered in Section 4. 1 , since R 1 is discrete. However, the theorem of total probability should still be in force. R 1 takes the value xi with probability p(xi) ; given that R1 = xi , the prob ability that R 2 E B is Pxi (B) = fB h(Y) dy. Thus we should have Example •

•

•

•

•

•

•

r

P{R l E A, R 2 E B} = I p(xi) ft(y) d y .J11 y

FIGURE 4.2. 1

XiE�

Conditional Probability Calculation.

(4.2.2)

4.3

and, more generally,

P{ (R 1 , R 2) E C } =

CONDITIONAL DENSITY FUNCTIONS

� p(xi) Ia,�(y) dy

1 35

(4.2.3)

In fact, if we take 0 = E2 , :F = Borel sets , R1(x, y) = x, R2 (x, y) y, it turns out that there is a unique probability measure on :F satisfying (4.2.2) 1 for all Borel subsets A , B of E ; P is given explicitly by ( 4.2.3) . ...._ =

P RO B L E M S 1. 2. 3.

Derive formula (4.2. 1). HINT : in r(r) J� tr-1e-t dt, let t x2 • Then write r(r)r(s) as a double integral and switch to polar coordinates. In Example 2, what are the sets C and Cx in (4. 1 .2) ? What is P ( Cx) ? In Example 3 , suppose that R1 takes on positive integer values 1 , 2, . . . with n , R2 is selected according 1). If R1 probabilities p1, p2, (pi > 0, �f 1 Pi to the density [n(x) ne-nx , x > 0 ; [n(x) 0, x < 0. Find the probability that 4 < R1 + R2 < 6. In Example 3 we specified Pxi (B) to be interpreted intuitively as the probability that R2E B, given that R1 xi . This, plus the specification ofp (xi) , i 1 , 2, . . . , determines the probability measure P. Use (4.2.2) to show that if p (xi) > 0 then P{R2 E B I R1 xi } Pxi (B) , thus justifying the intuition. In order words, the conditional probability as computed from the probability measure P coincides with the original specification. A number R1 is chosen with density f1(x) 1 /x2 , x > 1 ; f1 (x) 0, x < 1 . If R1 x, let R2 be uniformly distributed between 0 and x . Find the distribution and density functions of R2• =

x

•

•

=

=

•

=

=

4.

=

=

=

5.

=

=

=

=

=

4.3

C O NDITIONAL DENSITY FUNCTIONS

We have seen that specification of the distribution or density function of a random variable R1, together with Px (B ) (for all real x and Borel subsets B of E1) , interpreted intuitively as the conditional probability that R2 E B, given R1 x, determines the probability of all events of the form {(R1, R2) E C}. However, this has not resolved the difficulty of defining conditional probabilities given events of probability 0. If we are given random variables R1 and R2 with a particular joint distribution function, we can ask whether it is possible to define in a meaningful way the conditional probability P{R2 E B I R1 x } , even though the event {R1 x } may have probability 0 for some , in fact perhaps for all, x. We now consider this question in the case in which R1 and R2 have a joint density f =

=

=

1 36

CONDITIONAL PROBABILITY AND EXPECTATION

reasonable approach to the conditional probability P{R2 E B I R1 = x0} is to look at P{R2 E B I x0 - h < R1 < x0 + h} and let h � 0. Now A

h o + x P{ x0 - h < R 1 < x0 + h, R2 E B} = J. l f (x, y) dy dx xo-h B

which for small h should look like 2h SBf(x0, y) dy. But P{x0 - h < R1 < x0 + h } looks like 2h ft(x0) for small h, where f1(x) = Joo f(x, y) dy is the density of R1. Thus, as h � 0, it appears that under appropriate conditions P{R2 E B I x - h < R1 < x + h } should approach JB [f(x, y)fh(x)] dy, so that we find conditional probabilities involving R2, given R1 = x, by inte gratingf(x, y)fft(x) with respect to y. We are led to define the conditional density of R2 given R1 = x (or, for short, the conditional density of R2 given R1) as 00

(x, y) f = h(y / x) (4 . 3 . 1 ) ft(x) Since s ec oo f (x, y) dy = ft(x) (see Section 2.7) , , we have s oo h(y I x) dy = 1 , so that h(y I x) , regarded as a function of y, is a legitimate density. Notice that the conditional density is defined only when ft(x) > 0. How ever, we may essentially ignore those (x, y) at which the conditional density is not defined. For let S = '{(x, y) : ft(x) = 0}. We can show that P{(R1, R2) E S} = 0. 00

P{(R1, R 2) E S } =

Iff(x, y) dx dy = L,h<x>�o} f:J(x, y) dy dx s

=

rJ{ :fl ( )=O} ftCx) dx = 0 x x

We define the conditional probability that R2 belongs to the Borel set B, given that R1 = x, as P.,(B)

=

L

P{ R2 E B / R 1 = x} = h( y / x) dy

(4.3.2)

We can ask whether this is a sensible definition of conditional probability. We have set up our own ground rules to answer this question : "sensible'' means that the theorem of total probability holds. Let us check that in fact (4. 1 . 1) [and hence (4. 1 .2)] holds. We have

f. f (x, y) d x dy JxEA yEB = r ft(x) [ l h(y I x) dy] d x = l Px (B)fl( x) d x A J � yEB

P{ R 1 E A, R2 E B} = f which is (4. 1 . 1).

x

4.3

CONDITIONAL DENSITY FUNCTIONS

1 37

y)

We have seen that if (R1, R 2 ) has density f(x, and R1 has density f1(x) we have a conditional density I x) = f(x, y) /ft (x) for R2, given R1 = x. Let us reverse this process. Suppose that we observe a random variable R 1 with density ft(x) ; if R1 = x, we observe a random variable R 2 with density I x) . If we accept the continuous version of the theorem of total prob ability, we may calculate the joint distribution function of R1 and R 2 using (4. 1 . 1 ) .

h(y

h (y

f':P{R2 < Yo I R1 = x} fl(x) dx = f:[f: h(y I x) dy] !l(x) dx f: f':fi(x)h(y I x) dy dx

F(xo , Yo ) = P{R1 < Xo , R 2 < Yo } =

=

y)

Thus (R1, R 2 ) has a density given by f(x, = ft(x)h(y I x) , in agreement with (4.3. 1) . To summarize: We may look at the formula f(x, = ft(x)h(y I x) in two ways. we have a natural notion of conditional 1 . If (R1, R 2) has density f(x, probability.

y),

y)

2. If R1 has density ft(x) , and whenever R1 = x we select R 2 with density

h(y I x) , then in the natural formulation of this problem (R1 , R2) has density f(x, y) = ft(x) h( y I x).

In both cases "natural" indicates that (4. 1 . 1) , the continuous version of the theorem of total probability, is required to hold. We may extend these results to higher dimensions. For example, if (R1, R 2 , R3 , R4) has density f(x1, x 2 , x3 , x4) , we define (say) the conditional density of (R3 , R4) given (R1, R 2) , as ( xb X2 , Xa, x4) f a, h( X x4 I xb x2) = f1 2 (xb X2 ) where

f1 2 (x1, x2) =

1: i:f(xb X2 , Xa, xJ dxa dx4

The conditional probability that (R3 , R4) belongs to the two-dimensional Borel set B, given that R1 = x1, R 2 = x 2 , is defined by

P x l[c2(B) = P {(R 3 , R4) E B I R1 = xb R 2 = x2} =

ff B

h( x3 , x4 I xb x2) d x3 dx4

1 38

CONDITIONAL PROBABILITY AND EXPECTATION

The appropriate version of the theorem of total probability is

P {(Rt. R 2) E A , (R 3 , R4) E B} =

JJP.,,.,.(B)f12(xt. x2) dx1 dx2 A

If (R1, R 2 ) has density ft 2 (x1, x 2) , and having observed R1 = x1, R 2 = x 2 , we select (R3 , R4) \Vith density h(x3 , x4 1 x1, x 2), then (R1, R 2 , R3 , R4) must have density f(x1, x 2 , x3 , x4) = ft 2 (x1, x 2)h(x3 , x4 1 x1, x 2). Let us do some examples . 1.

We arrive at a bus stop at time t = 0. Two buses A and B are in operation. The arrival time R1 of bus A is uniformly distributed be tween 0 and tA minutes , and the arrival time R 2 of bus B is uniformly distrib uted between 0 and tB minutes, with tA < tB . The arrival times are independent. Find the probability that bus A will arrive first. We are looking for the probability that R1 < R 2• Since R1 and R 2 are independent (and have a joint density) , the conditional density of R 2 given R1 is f (x, y) = ( y) = f2 _!_ ' tB ft(x) ..- Example

If bus A arrives at x, 0 < x < tA, it will be first provided that bus B arrives between x and tB . This happens with probability (tB - x)ftB . Thus

By (4. 1 .2),

=

L

0

tA (

)

X 1 t 1 - - - d x = 1 - _4.. tB tA

2 tB

[Formally, taking the sample space as £2 , we have C = {R1 < R 2} = {(x, y) : x < y}, C x = {y : x < y}, Px ( Cx) = P{R1 < R 2 I R1 = x} = 1 xftB , 0 < X < tA.] Alternatively, we may simply use the joint density :

P{R 1 < R2} =

as before .

.,...

JJ f(x, y) dx dy

x
2 tB

4. 3 CONDITIONAL DENSITY FUNCTIONS

1 39

y

0

tA

FIGURE 4.3 . 1 �

Bus Problem.

2.

Let R 0 be a nonnegative random variable with density fo(it) = e-l , it > 0. If R0 = it, we take n independent observations R1, R2 , , Rn , each Ri having the exponential density /;.(y) = ite-;.11 , y � 0 ( = 0 for y < 0) . Find the conditional density of R0 given (R1, R2, , Rn ) · Here we have specifiedfo(it), the density of R0, and the conditional density of (R1, R 2 , , Rn) given R0, namely, Example •

•

•

•

•

•

•

•

•

by the independence assumption The joint density of R0, R1 ,

.

.

.

, Rn is therefore

J(it, xl , . . . ' xn) = fo (A)h(xb . . . ' xn I it) The joint density of R1 , . , R n is given by .

g( xt , · . . , x n)

=

=

Ane-l ( l+x>

.

oo oo J-oo f(A, xt, . . . , x,.) dA Jfo ;.ne--<
n' y d A( 1 + x)) J = y o ( 1 + x) '* 1 ( 1 + �t+l Thus the conditional density of R0 given (R1, , Rn ) is (A, xl , . . . ' x,.) _!_ ). "e-M l+re) 1 = h (A I xl , . . . ' x ,.) = f ( + x) "+l, n! g( x1, . . . , xn) A, xb . . . ' Xn· > 0, X = xl + . . . + Xn .... =

(with y

=

.

PRO BLE M S

1.

.

•

Let (R1 , R2) have density f (x, y) = e-11 , 0 < x < y , f(x, y) = 0 elsewhere. Find the conditional density of R 2 given R1, and P{R2 < y I R1 = x}, the conditiona l distribution function of R2 given R1 = x.

1 40 2.

3.

CONDITIONAL PROBABILITY AND EXPECTATION

Let (R1, R2) have densityf(x, y) = k lxl , - 1 < x < 1 , - 1 < y < x ; f(x, y) = 0 elsewhere. Find k ; also find the individual densities of R1 and R2 , the conditional density of R2 given R1, and the conditional density of R1 given R2 • (a) If (R1, R2) is uniformly distributed over the set C = {(x, y) : x2 + y2 < 1}, show that, given R1 = x, R2 is uniformly distributed between - (1 - x2)1 12 and + (1 - x2)1 12 • (b) Let (R1, R2) be uniformly distributed over the arbitrary two-dimensional Borel set C [i.e. , P(B) = (area of B rl C)/area of C ( = area Bjarea C if B c C)]. Show that given R1 = x, R2 is uniformly distributed on Cx {y : (x, y) E C}. In other words, h(y I x) is constant for y E ex , and 0 for y ¢= ex . In Problem 1 , let R3 = R2 - R1• Find the conditional density of R3 given R1 = x. Also find P{1 < R3 < 2 1 R1 = x}. Suppose that (R1, R2) has density f and R3 = g(R1, R2). You are asked to compute the conditional distribution function of R3, given R1 = x ; that is, P{R3 < z I R1 = x}. How would you go about it ? =

4. 5.

4.4

C O N D I T I O N AL E X P E C T A T I O N

In the preceding sections we considered situations in which two successive observations are made, the second observation depending on the result of the first. The essential ingredient in such problems is the quantity Px(B) , defined for real x and Borel sets B, to be interpreted as the conditional prob ability that the second observation will fall into B, given that the first observa tion takes the value x : for short, P{R 2 E B I R1 = x }. In particular, we may define the conditional distribution function of R 2 given R1 = x, as F2 (y I x) = P{R 2 < y I R1 = x }. If R1 and R 2 have a joint density, this can- be computed from the conditional density of R 2 given R1 : F2 (y0 I x) = j110 h (y I x) dy . In any case, for each real x we have a probability measure Px defined on the Borel subsets of £1 • Now if R1 = x and we observe R 2 , there should be an average value associated with R 2 , that is, a conditional expectation of R 2 given that R1 = x. How should this be computed ? Let us try to set up an appropriate model. We are observing a single random variable R 2 , so let Q = £1 , :F = Borel sets , R 2 (y ) = y . We are not concerned with the prob ability that R 2 E B, but instead with the probability that R 2 E B, given that R1 = x. In other words, the appropriate probability measure is P The expectation of R 2 , computed with respect to Px ' is called the conditional expectation of R 2 given that R1 = x (or, for short, the conditional expectation of R 2 given R1) , written E(R 2 1 R1 = x). Note that if g is a (piecewise continuous) function from £1 to £1 , then g(R 2 ) is also a random variable (see Section 2.7) , so that we may also talk about 00

�·

'

CONDITIONAL EXPECTATION

141

the conditional expectation of g(R 2 ) given R1 = x, written E[g(R 2 ) I R1 = x]. In particular, if there is a conditional density of R 2 given R1 = x, then, by Theorem 2 of Section 3. 1 ,

E [g(R 2) I R1

=

x]

=

J: g(y)h(y I x) dy

(4.4. 1)

if g > 0 or if the integral is absolutely convergent. There is an immediate extension to n dimensions. For example, if there is a conditional density of (R4 , R5) given (R1 , R 2 , R3) , then E[g(R4,

R5) I R1

=

x1 , R 2 = x2, R 3

=

x 3]

00 J 00 J: g(x , x5)h(x4, x5 l x1 , x2 , x3) dx4 dx5 Note also that conditional probability can be obtained from conditional 0 if expectation. If in (4.4. 1 ) we take g(y) IB(Y) 1 if y E B, and y ¢= B, then E [ g(R2 ) I Rl x ] E [IB (R 2 ) I R l x ] s:IB(y)h(y I x) dy Lh(y I x) dy P{R2 E B I R1 x} =

4

=

=

=

=

=

=

=

=

=

=

We have seen previously that P{R 2 E B} = E[I{ R2 EB}]. We now have a similar result under the condition that R 1 = x . [Notice that I (R2 ) = I{ R 2 EB } ; for In(R 2 (w)) = 1 iff R 2 (w) E B, that is, iff I{ R 2 EB} (w) = 1 . ] Let us consider again the examples of Section 4.2 .

B

.,._

1.

R1 is uniformly distributed between 0 and 1 ; if R1 = x, R 2 is the number of heads in n tosses of a coin with probability x of heads. Given that R 1 = x, R 2 has a binomial distribution with parameters n and x : P{R2 = k I R 1 = x} = ( �)x k ( 1 - x) n-k . It follows that E(R 2 j R 1 = x) is the average number of successes in n Bernoulli trials, with probability x of success on a particular trial, namely, nx . _.... Example

2.

R1 has density f1(x) = xe-x , x > 0, f1(x) = 0 , x < 0. The conditional density of R 2 given R1 = x is uniform over [0 , x ] . It follows that, for x > 0 , x 1 R I x) = = �x E( 2 j R1 = x) = 0 X Similarly, x 1 - 1 _.... R = E [ 1 R1 = x ] = I x) = ... Example

e

2

5oo yh(y dy l y - dy -oo 00 eYh(y dy l eY- dy ex 5- oo 0 X X

1 42

CONDITIONAL PROBABILITY AND EXPECTATION

R 1 is discrete, with p(x J = P{R1 = xi }, i = R1 = xi , R 2 has density h ; that is , .,._

Example 3.

.

P {R 2 E B I R1 = xi } = Thus

E[ g(R 2 ) I R1 = xi ] =

1 , 2,

. . . . Given

L .f;(y) dy

1: g(y).f;(y) dy

Now let us consider a slightly different case . .,._

4.

Let R1 and R 2 be discrete random variables. If R1 = x, then R 2 will take the value y with probability X ' y) Pt 2C x x = = P ( = = I { ) I } y R1 R py 2 Example

;;;;�___,;. ;....;;;;._ ..

Pt (x)

where

p1 2 (x , y) = P{R1 = x, R 2 = y}, p1 (x) = P{R1 = x} p(y I x) , which is defined provided that p1 (x) > 0 , will be called the con ditional probability function of R 2 given R 1 = x (or the conditional prob ability function of R 2 given R1 , for short) . We may find the probability that R 2 E B given R 1 = x by summing the conditional probability function. P{R 1 - x' R 2 E B} PaJB) = P{R 2 E B I R1 = x} = P{R1 = x } = I P(Y I x)

=

I p12(x, y)

YEB _

_ __

p1 (x)

yEB

Thus, given that R1 = x, the probabilities of events involving R 2 are found from the probability function p(y I x) , y real. Therefore the conditional expectation of g(R 2 ) given R1 = x is In particular,

E[ g(R 2 ) I R1 = x ] = I g(y)p( y I x) y

E(R 2 1 R1 = x) = I yp( y I x) Y

(4.4.2)

...-

There is a feature common to all these examples. In each case the ex pectation of R 2 (or of a function of R 2 ) can be expressed as a weighted average of conditional expectations. Let us look at Example 4 first. With probability p1 (x) , R 1 takes the value x ; if R 1 = x, the average value of R 2 is E(R 2 1 R1 = x). By analogy with the theorem of total probability, it is reasonable to ex pect that

E(R2) = I P1(�)E(R2 I R1 X

=

x)

CONDITIONAL EXPECTATION To

1 43

justify this, write

E(R 2)

=

I YP 2 ( Y)

=

I y P{R 2

=

I yP{R1

=

x}P{R 2

y

X,Y

!I

=

=

=

y}

I y I P{Rl y

=

Y I R1

X

=

x}

=

x, R 2

=

y}

by (2.7.2)

I p1(x)[ I yp( y I x) ] X

Y

This iS' the desired result. In Example 1 the probability that R1 will lie in an interval about x is f1 (x) dx = dx ; given that R 1 = x, the average value of R 2 is E(R2 1 R 1 = x) = nx . We expect that

E(R 2 )

=

J:!1(x)E(R� I R1

=

x) dx

To verify this , notice that we calculated in Section 4.2 that

P{R 2 = k} Thus

E(R 2 ) But

=

f k=O

kP{R 2

=

k}

=

=

1 n + 1

,

k

1 (1 + 2 + n + 1

[oo f1(x)E(R2 1 R1 -oo

•

=

-

·

=

·

x) dx =

0, 1 , . . . , n + n)

·

i

l

0

=

nx d x

1 (n + 1)n - n 2 2 n + 1 = -

n

2

In Example 2, the joint density of R 1 and R 2 is

f (x, y)

=

f1(x)h( y I x)

Now

E(R 2 )

=

=

xe

X

x

=

e-x ,

X > 0, 0 < y < X

J: J: yf(x, y) dx dy

[Notice that we need not compute J;(y ) explicitly; instead we simply regard

R 2 as a function of R1 and R 2 ; that is , we set g (R1 , R 2) E[ g(R1 , R 2 ) ] Thus But

=

=

R 2 and compute

J: 1: g(x, y)f(x, y) dx dy]

1 44

CONDITIONAL PROBABILITY AND EXPECTATION

In Example 3 we have [see (4.'2 .2)]

P{R 2 E B } = so that R 2 has density

f p(xi) L_h(y) dy L [f p(xt)h(Y)J dy =

f2 (y) Thus

E(R 2 ) and consequently

=

=

Ii p(xi)fly)

(4 . 4 . 3 )

J: Yf2(y) dy t p(xt) J:y_h(y) dy =

·

E(R 2 )

=

Ii p(xi)E(R 2 1 R1

=

xi)

E(R 2 )

=

Ii p(xi)E(R 2 1 R1

=

xi)

as expected. Results of the form or

(4.4.4) (4.4.5)

are called versions of the theorem of total expectation. In the situations we are considering, conditional expectations are derived ultimately from a given set of probabilities P (B) = P{R 2 E B I R 1 = x} . In such cases it turns out that if E (R 2 ) exists , (4.4.4) will hold if R1 is discrete , and (4.4. 5) wiU hold if R 1 is absolutely continuous. Notice that E(R 2 1 R1 = x) wil � in general depend on x and hence may be written as g (x) ; J oo g (x)ft (x) dx in (4.4. 5) [or I g (x)p (x) in (4.4.4)] is then the expectation of g (R1) . Thus (4.4.4) and (4.4. 5) may be rephrased as follows. x

00

x

The expectation of the conditional expectation of R 2 given R1 is the (over-all) expectation of R 2 •

5.

Let R be a random v ariable with the distribution function shown in Figure 4.4. 1 . Find E(R3).

..,.. Example

F(x) I�

1

3

4 1

4

-1

0

I

1

I

2

I

3

FIGURE 4.4.1

-...

X

4.4

CONDITIONAL EXPECTATION

1 45

If R were discrete we would compute E(Ra) = I xapR(x) X

and if R were absolutely continuous we would compute (

In this case , however, R falls into neither category. We are going to show how to use the theorem of total expectation to compute E(R3) . We have P{R = - 1 } = 1/4, P{R = 2 } = 3/4 - 1 /4 = 1 /2, P{R = x} = 0 for other values of x. Let F1 be a step function that is 0 for x < - 1 and has a jump of 1 /4 at x = - 1 and a jump of 1 /2 at x = 2. Subtract F1 from F to obtain a continuous function F2 that can be represented as an integral of a nonnegative function /2 • F1 is called the "discrete part" of F, and F2 the "absolutely continuous part" (see Figure 4.4.2). F1 and F2 are monotone , right-continuous functions, and they approach zero as x : - oo . However , they approach limits that are less than 1 as x � oo , so that they cannot be regarded as distribution functions of random variables. However, (4/3)F1 and 4F2 are legitimate distribution functions. We shall show that E(R3) =

� x3pR(x) + J: x3f2(x) dx

Consider the following random experiment. With probability 3/4 ( = F1( oo ) = Ix PR(x) , where PR(x) = P{R = x}) , pick a number in accordance F1 (x)

F2 (x)

I

3

4

T

1

1

4

I

2

-1

{. ) 2 (x

1

2

_

dF2(x)

3

X

dx

2

1

FIGURE 4.4.2

X

PR(x) = P{R = x}

4

-1

,....

4

0

2

X

2

3

X

Discrete and Absolutely Continuous Parts of a Distribution Function.

1 46

CONDITIONAL PROBABILITY AND EXPECTATION

N = -1

3

4

'N = 2

'------N uniformly distributed between

FIGURE 4.4.3

2 and 3

Tree Diagram for Example 5 .

with (4/3)F1 ; that is, pick - 1 with probability 1/3 and 2 with probability 2/3. With probability 1/4 [ = F2( oo )] , pick a number in accordance with F2 , that is, one uniformly distributed between 2 and 3 (see Figure 4.4.3). If N is the resulting number, then, by the theorem of total probability, P{N < x } = P(A)P{N < x I A} + P(B)P{N < x I B} where A and B correspond to the two possible results at the first stage of the experiment. Thus

Therefore FN is the original distribution function F. Since N and R have the same distribution function, we expect that E(N3) = E(R3). Now we may compute E(N3) by the theorem of total expectation.

E(N3) = P(A )E(N 3 j A ) + P(B)E(N3 1 B) = 1[( - 1 ) 3 � + 2 3 n + l

fx3 ax = 145- + i � = 1N'-

Notice that this may be expressed as

( - 1 ) 3 l + 23 � + that is,

f x3 l dx

4.4 COND'/.TIONAL EXPECTATION

1 47

More generally, the expectation of a function of R· may be computed by

E [g(R) ] =

� g(x)pR(x) + J:g(x)f2(x) dx

(4.4.6) t

if g > 0 or if both the series and the integral are absolutely convergent. .,._

...._

6.

Let R be a random variable on a given probability space, and A an event with P (A) > 0. Formulate the proper definition of the con ditional expectation of R, given that A has occurred. This actually is not a new concept. If IA is the indicator of A , we are look ing for the expectation of R, given that [A_ = 1 . Let the experiment be per formed independently n times , n very large, and let R i be the value of R obtained on trial i, i = 1 , 2, . . , n . Renumber the trials so that A occurs on the first k trials , and Ac on the last n - k [k will be approximately nP(A ) ] . The average value of R, considering only those trials on which A occurs, is Example

.

R1 +

·

(

)

n 1 n + Rk = - I R 3I 3 n i=l k k ·

·

--

where 13 = 1 if A occurs on trial j; 13 = 0 if A does not occur on trial j. In other words, 13 is simply the jth observation of IA . It appears that 1 /n I3 n 1 R313 approximates the expectation of RIA ; since k/n approximates P(A) , we are led to define the conditional expectation of R given A as

E(RIA) (4.4.7) if P(A) > 0 P(A ) Let us check that (4.4. 7) agrees with previous results when R is discrete. By (4.4.2), E(R I A ) =

E(R J IA = 1) = I yP{R = y I IA = 1 } = I yP{R = y J IA = 1 } y Y*O But if y � 0, P {R = y, IA = 1 } P{RIA y } = P{R = y I IA = 1 } = P(A ) P{IA = 1 } Thus =

t The reader may recognize this as the Riemann-Stieltjes integral J 0000 g(x) dF(x). Alterna tively, if one differentiates F formally to obtain f = .12 plus "impulses" or "delta functions" at - 1 and 2 of strength 1 /4 and 1 /2, respectively, and then evaluates J00ro g(x)f(x) dx, (4.4.6) is obtained.

1 48

CONDITIONAL PROBABILITY AND EXPECTATION

Let us look at another special case. For any random variable R and event A with P (A ) > 0, we may define the conditional distribution function of R given A in a natural way, namely, P(A fl {R S x}) < x I A} = FR(x I A) = P {R (4.4.8) P(A)

Now assume that R has density fand A is of the form {R E B} for some Borel set B. Then P(A

n

{R <

Thus (4.4.8) becomes

Xo

}) = P { R E B, R <

FR(x I A) = 0

J

xo

-

Xo

}

=

JXEB f(x) dx x < xo

f (x) IB(x) d x P(A)

oo

In other words , there is a conditional density of R given A , namely, f(x) f (x) IB(x) = fR(x I A) = P(A) P(A )

if X E B if X tt B

= 0

(4.4.9)

We may then compute the conditional expectation of R given A . E(R J A) =

f

""

-

oo

xfR( x I A) dx =

-

E(RIB(R)) P(A)

But

IB(R ) = /{REB} = I.A

Thus

E(R I A) =

f

"" xl ( x) B oo

P(A)

by Theorem

2

f ( x) dx

of Section 3. 1

by the discussion preceding Example 1

E(RIA) P(A)

in agreement with (4.4. 7) . REMARK.

(4.4.8) and (4.4.9) extend to n dimensions. The conditional

distribution function of (R1 , , Rn) given A is F1 2 ... n (x1, . . . , If (R1 , . . . , R n ) has X n I A) = P{R l < xl , . . . , R n < X n I A}. density fand A = {(R1 , Rn ) E B} , there is a conditional density of •

.

.

•

,

.

.

4.4

CONDITIONAL EXPECTATION

The argument is essentially the same as above.

1 49

�

P RO B L E M S 1.

Let (R1, R2) have density [(x, y) Sxy, 0 < y < x < 1 ; f (x, y) 0 elsewhere. (a) Find the conditional expectation of R2 given R1 x, and the conditional expectation of R1 given R2 y. (b) Find the conditional expectation of R24 given R1 x. (c) Find the conditional expectation of R2 given A {R1 < 1/2} . =

=

=

=

=

=

2.

In Example 2 of Section 4.3 , find the conditional expectation of Ro-n , given Rl xl ' . . . ' Rn Xn· =

3.

=

Let (R1, R2) be uniformly distributed over the parallelogram with vertices (0, 0), (2, 0), (3 , 1), (1 , 1). Find E(R2 1 R1 x). =

4.

If a single die is tossed independently n times, find the average number of 2's, given that the number of 1 's is k.

5.

Let R1 and R2 be independent random variables, each uniformly distributed between 0 and 2. (a) Find the conditional probability that R1 > 1 , given that R1 + R2 < 3. (b) Find the conditional expectation of R1, given that R1 + R2 < 3.

6.

be mutually exclusive , exhaustive events, with P(B,) > 0 , Let B1 , B2, n 1 , 2, . . . , and let R be a random variable. Establish the following version of the theorem of total expectation : •

•

•

=

E(R)

[if E(R) exists]. 7.

=

00

! P(Bn)E(R I Bn)

n=l

Of the 100 people in a certain village, 50 always tell the truth, 30 always lie, and 20 always refuse to answer. A single unbiased die is tossed. If the result is 1 , 2 , 3 , or 4, a sample of size 30 is taken with replacement. If the result is 5 or 6 , a sample of size 30 is taken without rep lacement. A random variable R is defined as follows : R 1 if the resulting sample contains 10 people of each category. R 2 if the sample is taken with replacement and contains 12 liars. R 3 otherwise. Find E(R) . =

=

=

ISO

8.

'·

10. 11.

12.

CONDITIONAL PROBABILITY AND E.XPECTA TION

Let R1 and R2 be independent random variables, each uniformly distributed between 0 and 1 . Define if R1 2 + R22 <1 (a) Find F3(z) and compute E(R3) from this. (b) Compute E(R3) from f00 f 00 g (x , y)f1 2 (x , y) dx dy . 00 00 (c) Compute E(R3 I R12 + R22 < 1) and E(R3 I R12 + R22 > 1 ) ; then find E(R3) by using the theorem of total expectation. The density for the time T required for the failure of a light bulb is f (x) = .Ae-lx , x > 0. Find the conditional density function of T - t0 , given that T > t 0 , and interpret the result in-tuitively. Let R1 and R2 be independent random variables, each uniformly distributed between 0 and 1 . Find the conditional expectation of (R1 + R2)2 given R1 - R2 . Let R1 and R2 be independent random variables, each with density f(x) = (1/2)e-x , x > O;f(x) = 1 /2, - 1 < x < O; f(x) = 0, x < - 1 . Let R3 = R12 + R22 . Find E(R3 I R1 = x). Let R1 be a discrete random variable; if R1 = x , let R2 have a conditional density h(y I x). Define the conditional probability that R1 = x given that R2 = y as (cf. Bayes' Theorem). (a) Interpret this definition intuitively by considering P{R1 = x I y < R2 < y + dy}. (b) Show that the definition is natural in the sense that the appropriate version of the theorem of total probability is satisfied :

L t2(y)P{Rl E

P{R l E A , R2 E B} = where

A

I R2 = y} dy

P{Rl E A I R 2 = y} = I P{Rl = X I R2 = y} xeA

J2(y) = I P{R1 = x}h (y I x) X

13.

[see (4.4.3) ] . If R1 is absolutely continuous and R2 discrete, and p (y I x) = P{R2 = y I R1 = x} is specified, show that there is a conditional density of R1 given R2 , namely,

where

f (x)p (y I x) h (x I y) = t P2 (y) p2 (y) = P{R2

=

y} =

1: /1(x)p (Y I x) dx

4.4 14.

15.

CONDITIONAL EXPECTATION

lSI

Let R be uniformly distributed between 0 and 1 . If R A, a coin with probability of heads A is tossed independently n times. If R1 , . . . , Rn are the results of the tosses (R i = 1 for a head, Ri = 0 for a tail), find the conditional density of R given (R1 , . . . , Rn), and the conditional expectation of R given (R1 , . . . , Rn) . (Hypothesis testing) Consider the following experiment. Throw a coin with probability p of heads. If the coin comes up heads, observe a random variable R with densityf0 (x) ; if the coin comes up tails, let R have density/1 (x). Suppose that we are not told the result of the coin toss, but only the value of R, and we have to guess whether or not the coin came up heads. We do this by means of a decision scheme, which is simply a Borel set S of real numbers with the interpretation that if R = x and x E S, we decide for tails, that is, [1 , and if x ¢= S we decide for heads, that is, f0• (a) Find the over-all probability of error in terms of p, f0, [1 , and S. [There are two types of errors : if the actual density is f0 and we decide for [1 (type 1 error), and if the actual density is /1 and we decide for f0 (type 2 error).] (b) For a given p , f0 , [1 , find the S that makes the over-all probability of error a minimum. Apply the results to the case in whichfi is the normal density with mean mi and variance a2 , i = 0, 1 . =

REMARK. A physical model for part (b) is the following. The input R to a radar receiver is of the form () + N, where () (the signal) and N (the noise) are independent random variables, with P{ O = mo} = p , P{ O = m1 } = 1 p , and N normally distributed with mean 0 and variance a2 • If () = mi (i = 0 corresponds to a head in the above discussion, and i = 1 to a tail), then R is normal with mean mi and variance a2 ; thus fi is the conditional density of R given () = mi. We are trying to determine the actual value of the signal with as low a probability of error as possible. -

16.

17.

18.

Let R be the number of successes in n Bernoulli trials, with probability p of success on a given trial. Find the conditional expectation of R, given that R > 2. Let R1 be uniformly distributed between 0 and 10, and define R 2 by R 2 = R 12 if 0 < R1 < 6 if 6 < R1 < 10 =3 Find the conditional expectation of R2 given that 2 < R2 < 4. Consider the following two-stage random experiment. (i) A circle of radius R and center at (0, 0) is selected, where R has density fR (z) = e -z , z > O ; fR (z) = O, z < 0. (ii) A point (R1 , R2) is chosen, where (R1 , R2) is uniformly distributed inside the circle selected in step (i). (a) If D = (R12 + R22)1 1 2 is the distance of the resulting point from the origin, find E(D) . (b) Find the conditional density of R given R1 = x, R2 = y. (Leave the answer in the form of an integral.)

CONDITIONAL PROBABILITY AND EXPECTATION

1 52 19.

20.

(An estimation problem) The input R to a radar receiver is of the form () + N, where () (the signal) and N (the noise) are independent random variables with finite mean and variance. The value of R is observed, and then an estimate of 0 is made, say, 0* = d(R), where d is a function from the reals to the reals. We wish to choose the estimate so that E[(O* - 0)2 ] is as small as possible. (a) Show that d(x) is the conditional expectation E(O I R = x). (Assume that R is either absolutely continuous or discrete.) (b) Let () = ± 1 with equal probability, and let N be uniformly distributed between -2 and + 2. Find d(x) and the minimum value of E[(O* - 0)2 ] . A number () is chosen at random with density f8(x) = e-x, x > 0 ; f8(x) = 0, x < 0. If () takes the value A, a random variable R is observed, where R has the Poisson distribution with parameter A. For example, R might be the number of radioactive particles (or particles with some other distinguishing character istic) passing through a counting device in a given time interval , where the average number of such particles is selected randomly. The value of R is observed and an estimate of () is made , say 0* = d(R). The argument of Problem 1 9, which applies in any situation when one makes an estimate 0* = d(R) of a parameter (), and when the distribution function of R depends on 0, shows that the estimate that minimizes E[(O* - 0)2 ] is d(x) = E(O I R = x). Find d(x) in this case.

REMARK. _

4.5

Problems 1 5 , 1 9, and 20 il lustrate some techniques of statistics. This ' subject will be taken up systematically in Chapter 8.

APPENDIX : THE GENERAL CON CEPT O F C O N D ITIO NAL EXPE C TATION

By shifting our viewpoint slightly, we may regard a conditional expectation as a random variable defined on the given probability space. For example, suppose that E(R 2 1 R1 = x) = x2 • We may then say that, having observed R1, the average value of R 2 is R1 2 • We adopt the notation E(R 2 1 Ri) = R12 • In general, if E(R 2 1 R1 = x) = g(x) , we define E(R 2 1 R1) = g{R1). Then E(R 2 1 R1 ) is a function defined on Q ; its value at the point w is g(R1 (w) ) . Let us see what happens to the theorem of total expectation in this notation. If, for example,

E(R 2) = J!1(x)E(R 2 1 R 1 = x) dx = J!1(x) g(x) dx then E(R 2) = E[g(R1)] ; in other words,

E(R2) = E[E(R 2 I R1) ]

(4. 5 . 1)

The expectation of the conditional expectation of R 2 given R1 is the ex pectation of R 2 •

4.5

THE GENERAL CONCEPT OF CONDITIONAL EXPECTATION

1 53

Let us develop this a bit further. Let A be a Borel subset of E 1. Then , assuming that (4.5. 1) holds for the random variable R 2l{R1 e A } ' we have E (R 2l{ R1eA } )

=

E [E (R2l{R1eA} I R1) ]

But having observed R1 , R 2l{ n1 e A } will be R 2 if R 1 E A , and 0 otherwise ; thus we expect i ntuitively that E (R 2l{ R1eA } I R1)

=

l{ R1eA }E(R 2 1 R 1)

It appears reasonable to expect, then, that E(R 2l{R1eA})

=

E[I{ R1eA} E(R 2 1 R1) ]

for all Borel subsets A of £1

(4.5.2)

It turns out that if R1 is an arbitrary random variable and R 2 a random variable whose expectation exists, there is a random variable R , of the form g(R1) for some Borel measurable function g, such that E (R 2l{R1eA})

=

E[I{ R1eA }R ]

for all Borel subsets A of E 1

We set R = E (R 2 1 R1) . Furthermore, R is essentially unique : if R ' = g' (R1) for some Borel measurable function g' , and R' also satisfies (4. 5.2), then R = R' except perhaps on a set of probability 0. In the cases considered in this chapter, the conditional expectations all satisfy (4.5.2) (which is just a restatement of the theorem of total expecta tion), and thus the examples of the chapter are consistent with the general notion of conditional expectation.

C h a racte ri st i c F u n ct i o n s

5.1

IN T R O D U C T I O N

In Chapter 2 we examined the problem of finding probabilities of the form P{ (R1 , . . . , R n) E B}, where R1, . . . , Rn were random variables on a given probability space. If (R1, . . . , Rn) has density f, then

P{(R1, . . . , R n) E B }

=

r · Jf(x1, . . . , xn) dx1 • • • dxn B

In general, the evaluation of integrals of this type is quite difficult, if it is possible at all. In this chapter we describe an approach to a particular class of problems, those involving sums of independent random variables, which avoids integration in n dimensions. The approach is similar in spirit to the application of Fourier or Laplace transforms to a differential equation. Let R be a random variable on a given probability space. We introduce the characteristic function of R, defined by u real (5.1.1) Here we meet complex-valued random variables for the first time. A complex-valued random variable on (0, :F, P) is a function T from 0 to the complex numbers C, such that the real part T1 and the imaginary part T2 of T are (real-valued) random variables. Thus T(w) = T1 (w) + iT2 (ro) 1 54

,

5. 1

INTRODUCTION

ISS

We define the expectation of T as the complex number E(T) = E(T1) + iE(T2) ; E(T) is defined only if E(T1) and E(T2 ) are both finite. In the present case we have MR(u) = E (cos uR) - iE (sin uR) ; since the cosine and the sine are < 1 in absolute value, all expectations are finite. Thus MR is a function from the reals to the complex numbers. If R has density fR we obtain ()) E n.

(5.1 .2) which is the Fourier transform of fR · It will be convenient in many computations to use a Laplace rather than a Fourier transform. The generalized characteristic function of R is defined by s complext (5. 1.3) NR(s) is defined only for those s such that E(e-sR) is finite. If s is imaginary, that is, if s = iu, u real, then NR (s) = MR(u) , so that NR(s) is defined at least for s on the imaginary axis. There will be situations in which NR(s) is not defined for any s off the imaginary axis, and other situations in which NR(s) is defined for all s. If R has density fR, we obtain

NR(s)

=

J:e-"'"fR(x) dx

(5.1 .4)

This is the (two-sided) Laplace transform of fR · The basic fact about characteristic functions is the following. Theorem 1 . Let R1, . . . , Rn be independent random variables on a given probability space, and let R0 = R1 + + Rn. If NRi (s) is finite for all i = 1 , 2, . . . , n, then NR 0 (s) is .finite, and ·

·

·

NRn (s) NR 0 (s) = NR1 (s) NR2 (s) In particular, if we set s = iu, we obtain MR 0 (u) = MR (u) MR 2 (u) MRn (u) 1 Thus the characteristic function of a sum of independent random variables is the product of the characteristic functions. ·

·

·

t In doing most

·

·

·

of the examples in this chapter, the student will not come to grief if he regards s as a real variable and replaces statements such as "a < Re s < b" by "a < s < b." Also, a comment about notation. We have taken E(e-iuR) as the definition of the charac... teristic function rather than the more usual E(eiuR) in order to preserve a notational symmetry between Fourier and Laplace transforms (J e-8Xf(x) dx, not J e8Xf(x) dx, is the standard notation for Laplace transform). Since u ranges over all real numbers, this change is of no essential significance.

1 56

CHARACTERISTIC FUNCTIONS

PROOF. E(e-sRo)

=

E(e-s < Rl+· · · + Rn> )

=

E

by independence.

[ii ] = IT k=l

e-•Rk

k=l

E(e-•Rk)

We have glossed over one point in this argument. If we take n = 2 for simplicity, we have complex-valued random variables V = V1 + iV2 and 2 w = wl + i W2 (V = e-sR1 , w = e-sR ) , where, by Theorem 2 of Section 2.7 , V1 and Wk are independent (j, k = 1 , 2) , and all expectations are finite. We must show that E( VW) = E( V)E( W) , which we have proved only in the case when V and W are real-valued and independent. However, there is no difficulty. E( VW) = E[V1 W1 - V2 W2 + i(V1 W2 + V2 W1)] = E(V1)E( W1) - E(V2)E( W2) + i(E( V1)E( W2) + E( V2)E( W1)) = [E( V1) + iE( W1)] [E( V2) + iE( W2) ] = E( V) E( W) The proof for arbitrary n is , m ore cumbersome, but the idea is exactly the same. Thus we may find the characteristic function of a sum of independent random variables without any n-dimensional integration. However, this technique will not be of value unless it is possible to recover the distribution function from the characteristic function. In fact we have the following result, which we shall not prove. ·

Theorem 2 (Correspondence Theorem). If MR1 (u)

FR1 (x) = FR2 (x)

=

MR2(u) for all u , then

for all x

For computational purposes we need some facts about the Laplace trans form. Letfbe a piecewise continuous function from E 1 to E1 (not necessarily a density) and L1 its Laplace transform : L l s) = Laplace Transform Properties

fj

( x) e-•"' d x

1 . If there are real numbers K1 and K2 and nonnegative real numbers A 1 and A 2 such that lf(x) l < A1e1( 1x for x > 0, and lf(x) l < A 2eK 2x for x < 0, then L1(s) is finite for K1 < Re s < K2 • This follows, since

and

L'"

i f (x) e-•"' i d x

<

f� A1

e< K

,-a ) re

dx

5. 1

INTRODUCTION

1 57

where a = Re s. The integrals are finite if K1 < a < K2• Thus the class of functions whose Laplace transform can be taken is quite large. 2. If g(x) = f(x - a) and L1 is finite at s, then Lg is also finite at s and Lg (s) = e-as L1(s). This follows, since 00 f (x - a) e- •"' d x = e-a• j (x - a) e-s < x-a > d(x - a)

J:

J

3. If h(x) = f( - x) and L1 is finite at s, then Lh is finite at -s and Lh(-s) = L1 (s) [or Lh(s) = L1 (- s) if L1 is finite at -s]. To verify this, write 00 00 h(x) e""' d x = (with y = · - x) j ( y) e-•Y dy 00 4. If g(x) = e-a Xj(x) and L1 is finite at s, then Lg is finite at s - a and Lg (s - a) = L1 (s) [or Lg (s) = L1(s + a) if L1 is finite at s + a ]. For 00 x e-a e- < s-a> "'j (x) d x = e-""'f ( x) d x 00 We now construct a very brief table of Laplace transforms for use in the examples. In Table 5. 1 . 1 , u(x) is the unit step function, defined by u(x) = 1 ,

f

J

J:

Table

5.1.1

Laplace Transforms

n

a

=

>

Region of Convergence

L1(s)

[(x) u(x) e-axu(x) xne-axu(x) ' xa.e-axu(x) ,

f

0, -

1, . . . 1

1/s 1/(s + a) n!f(s + a)n+1 r(a + 1)/(s + a)a.+ 1

Re s Re s Re s Re s

>0 > -a > -a > -a

x > 0 ; u(x) = 0, x < 0. If we verify the last entry in the table the others will follow. Now ae-y y x x dy xae-a e-s d x = [with y = (s + a) x ] 1 + a o s � ( + a) o r(oc + 1 ) t = 1 + (s + a) a

oo i

(oo

Strictly speaking, these manipulations are only valid for s real and > -a. However, one can show that under the hypothesis of property 1 L1 is analytic for K1 < Re s < K2 . In the present case K1 = - a and K2 can be taken arbitrarily large, so that L1 is analytic for Re s > - a . Now L1(s) = r(a + 1)/(s + a)a+1 for s real and > -a, and therefore, by the identity theorem for analytic functions, the formula holds for all s with Re s > -a. This technique, which allows one to treat certain complex integrals as if the integrands were real-valued, will be used several times without further comment.

t

1 58

CHARACTERISTIC . FUNCTIONS

u(x) and - u( x) have the same Laplace transform 1/s, but the regions of convergence are disjoint :

REMARK.

-

100

dx

=

- u( - x) e-sx d x

=

- oo

and

100

- oo

u ( x) e

-n

1000 0 1

-sx e

dx

-s x -e

- oo

=

dx

1 S

-

Re s > 0

,

=

1

-

S

Re s < 0

,

This indicates that any statement about Laplace transforms should be accompanied by some information about the region of convergence. We need the following result in doing examples ; the proof is measure theoretic and will be omitted. 5. Let R be an absolutely continuous random variable. If h is a nonnega tive (piecewise continuous) function and Lh(s) is finite and coincides with the generalized characteristic function NR (s) for all s on the line Re s = a , then h is the density of R.

5.2

E X A MPLE S

We are going to examine some typical problems involving sums of independ ent random variables. We shall use the result, to be justified in Example 6, that if R1 , R 2 , , Rn are independent , each absolutely continuous, then R1 + + Rn is also absolutely continuous. In all examples Ni(s) will denote the generalized characteristic function of the random variable Ri . •

·

·

•

•

·

Let R1 and R 2 be independent random variables, with R1 uniformly. distributed between - 1 and + 1 , and R2 having the exponential density e-vu(y). Find the density of R0 = R1 + R 2 We have

..- Example 1.

•

all s Re s > - 1

Thus, by Theorem 1 of Section 5. 1 , N0(s)

=

N1 (s)N2 (s)

=

2s (s

1

+ 1)

(e8 - e-8)

5.2

i [e-(x- 1)

_

e

EXAMPLES

1 59

- (x + l)J

----.---�----�--� X

-1

1

0

FIGURE 5.2. 1

at least for Re s > - 1 . To find a function with this Laplace transform, we use partial fraction expansion of the rational function part of N0(s) : 2 s(s

1

1-

1)

=

1

-

2s

-

2(s

1

1)

�

Now from Table 5. 1 . 1 , u(x) l1as transform 1 /s ( Re s > 0) and e- xu(x) has transform 1/(s � 1) (Re s > - 1) . Thus (1/2) (1 - e-x) u(x) has transform 1 f2s(s � 1) (Re s > 0) . By property 2 of Laplace transforms (Section 5. 1) , (1 /2) (1 - e- < x-t1> )u(x � 1) has transform e8f2s(s � 1) and (1 /2)(1 - e- < x-l > )u(x - 1) has transform e-sj2s(s � 1) (Re s > 0). Thus a function h whose transform is N0(s) for Re s > 0 is h(x) = � (1 - e- < x+1 > ) u (x

�

1) - � (1 - e- < x-1> ) u (x - 1)

By property 5 of Laplace transforms, h is the density of R0 ; for a sketch, see Figure 5.2. 1 . ...._ Let R0 = R1 � R 2 � R3, where Rb R 2, and R3 are independ ent with densities /1(x) = /2 (x) = exu(-x), f3(x) = e- < x-1 > u (x - 1) . Find the density of R0 • We have

�

Example 2.

1

N1(s) = N2(s) = {o e "'e-•"' d x = and Thus

f1

J- oo

oo e- ( x-1 ) e-sx d ) x N3 ( s N0(s) = N1(s)N2(s)N3(s) =

-

1 -

e-s

s

e-s

�

(s - 1 ) 2(s

�

1

S

Re s < 1 Re s > - 1

, 1)

,

,

- 1 < Re s < 1

160

CHARACTERISTIC FUNCTIONS

We expand the rational function in partial fractions.

A + B + C + (s - 1 ) 2 s - 1 s 1 The coefficients may be found as follows. A = [(s - 1 ) 2G(s) ]s =1 = � = -l B = (( s - 1) 2G(s)) =1 ds s + C = [(s 1)G(s)] s = -1 = l + From Table 5.1 . 1 , the transform of xe- xu(x) is 1/(s 1) 2 , Re s > - 1 . By Laplace transform property 3, the transform of -xexu(-x) is 1/(1 - s) 2 , + Re s < 1 . The transform of e-xu(x) is 1/(s 1), Re s > - 1 , so that, again by property 3 , the transform of exu(-x) is 1/(1 - s) , Re s < 1 . Thus the transform of + + - !xexu(-x) l exu(-x) l e-xu(x) . 1

[!!_

J

IS

1 /4 1/2 ( s - 1) 2 s - 1 By property 2, the transform of G(s) =

.

_

+

1 /4 s

+

1

,

h(x) = [l - � (x - 1)]ex-1u( - (x - 1))

IS

- 1 < Re s < 1 +

1 le- < x- > u(x - 1)

- 1 < Re s < 1

By property 5, h is the density of R0 (see Figure 5.2.2) .

...._

h(x)

0

1

FIGURE 5.2.2

+ h(x) = (l � (1 - x))ex-1 , = } e- < x-1 > ' x>1

x< 1

.,._

Example

3.

5.2

EXAMPLES

161

Let R have the Cauchy density; that is,

fR(x) = -rr( l +1 x2) ' The characteristic function of R is

MR(u) = J:e -iuxfR(x) dx

R(

[In this case NR (s) is finite only for s on the imaginary axis.] M u) turns out to be e l l . This may be verified by complex variable methods (see Problem 9) , but instead we give a rough· sketch of another attack. If the characteristic function of a random variable R is integrable, that is,

-u

R oo 1 x ei u du (u) M fR(x) = R f 27T oo

it turns out that R has a density and in fact f is given by the inverse Fourier transform. In the present case

0 00 f- e -l u l du = f eu du + r 00 e-u du = 2 <

oo

-oo

Jo

(5.2.1) 00

and thus the density corresponding to e- l u l is

-27T1 I-00oo e-l u l e�ux du = -27T1 f-ooo eu (l+�x> du + -27T1 i000 e-u ( l-z.x ) du J_[ 1 + 1 = 1 2 = 27T 1 + ix 1 ixJ 7T(1 + x ) .

.

-

Thus the Cauchy density in fact corresponds to the characteristic function e- l u l . This argument has a serious gap. We started with the assumption that e- l u l was the characteristic function of some random variable, and deduced We must from this that the random variable must have density 1/Tr(l + establish that e-l u l is in fact a characteristic function (see Problem 8) . Now let R0 = R1 + + Rn , where the Ri are independent, each with the Cauchy density. Let us find the density of R0 • We have

x2).

·

·

·

162

CHARACTERISTIC FUNCTIONS

If instead we consider R0fn, we obtain

=

MRo / n (u )

E [e-i uRo f n ]

= Mo

( :) =

e- l u l

Thus R0/n has the Cauchy density. Now if R 2 = nR1 ·, then /2 (y) (1/n)ft(yfn) (see Section 2.4) , and so the density of R0 is n 1 n7r(1 + y 2fn 2) Tr(Y 2 + n 2)

fo(Y) =

REMARKS. 1 . The arithmetic average R0fn of a sequence of independent Cauchy distributed random variables has the same density as each of the components. There is no convergence of the arithmetic average to a constant, as we might expect physically. The trouble is that E(R) does not exist. 2.

If R has the Cauchy density and R1 = c1R, R 2 = c 2R, c1, c 2 constant and > 0, then I

M1(u) = E( e-iu R1) = E(e-i u c 1R) = MR (c1u) = e-c1 l u l and similarly M2 (u ) = e-c2 l u l Thus, if R0 = R1 + R 2 = (c1 + c 2) R, Mo (u) = e- < c t+C 2 ) I u l which happens to be M1(u)M2 (u). This shows that if the char acteristic function of the sum of two random variables is the prod uct of the characteristic functions, the random variables need not be independent. 3. If R has the Cauchy density and R1 = OR, fJ > 0, then by the calculation performed before Remark 1, R1 has density /1(y) = 0/Tr (y2 + 0 2) and (as in Remark 2) characteristic function M1(u) = e-8l u l : A random variabl� with this density is said to be of the Cauchy type with parameter fJ or to have the Cauchy density with parameter fJ. The formula for M1(u) shows immediately that if R1 , . . . , Rn are independent and Ri is of the Cauchy type with parameter fJ i , i = 1 , . . . , n, then R1 + · · · + Rn is of the Cauchy type with parameter f) l + . . . + f) n · ..... ..- Example 4.

If R1 , R 2 , , Rn are independent and normally distributed, then R0 = R1 + · · · + Rn is also normally distributed. •

•

•

5.2

EXAMPLES

1 63

We first show that if R is normally distributed with mean m and variance a2 , then (all s) (5.2.2) Now

NR (

[00 s) = e-sx R .. - oo

f (x) dx =

l

oo

- oo

1 e-sx e-(x-m ) f 2a dx .J 2Tr a 2

2

Let y = (x - m)/.J 2 a and cop1plete the square to obtain

by (2.8.2)

No(s) = Nt(s)N2 (s) . . . Nn(s)

=

e-s < ml+...+ mn > e s < al +···+an > 12 2

2

2

But this is the characteristic function of a normally distributed random + mn, a0 2 = variable, and the result follows. Note that m0 = m1 + a12 + · + an 2 , as we should expect from the results of Section 3.3. _.... ·

·

·

·

·

.,._ Example 5.

Let R have the Poisson distribution. PR(k)

=

e-;.Ak k!

'

k

=

0, 1, . . .

We first show that the generalized characteristic function of R is (all s)

We have

(5 .2.3)

as asserted. We now show that if R1 , . . . , Rn are independent random variables, each with the Poisson distribution, then R0 = R1 + · · · + Rn also has the poisson distribution. If Ri has the Poisson distribution with parameter Ai, then

No(s)

=

N1(s) N2 (s) · · Nn(s) = exp [(A1 + · · + An) (e-s - 1 ) ] ·

·

This is the characteristic function of a Poisson random variable, and the result follows. Note that if R has the Poisson distribution with parameter

1 64

CHARACTERISTIC FUNCTIONS

A, then E(R) = Var R = A (see Problem 8, Section 3.2). Thus the result that the parameter of R0 is A1 + · · · + An is consistent with the fact that E(R0) = E(R1) + · · · + E(Rn) and Var R0 = Var R1 + · · · + Var Rn . ...._

.,._

Example 6.

In certain situations (especially when the Laplace transforms cannot be expressed in closed form) it may be convenient to use a convolu tion procedure rather than the transform technique to find the density of a sum of independent random variables. The method is based on the following result.

Let R1 and R 2 be independent random variables, having densities /1 and /2 , respectively. Let R0 = R1 + R 2• Then R0 has a density given by Convolution Theorem.

fo (z) =

i:Mz - x)f1(x) dx = i:.II(z - y)f2(y) dy

(5.2.4)

(Intuitively, the probability that R1 lies in (x, x + dx] is f1(x) dx ; given that R1 = x, the probability that R0 lies in (z, z + dz] is the probability that R 2 lies in (z - x, z - x + dz] , namely , /2 (z - x) dz. Integrating with respect to x, we obtain the result that the probability that R0 lies in (z, z + dz] is dz

i!2(z - x)f1(x) dx

Since the probability is f0(z) dz, (5.2.4) follows.) PROOF.

To prove the convolution theorem, observe that F0(z) = P{R1 + R 2 < z} =

=

II fi(x)f2(y) dx dy

J: [ f00" f2(y) dyJ !1(x) dx

Let y = u - x to obtain

J: [fJ2(u - x) du] f1(x) dx = foo [ J:fi(x)f2(u - x) dx] du

This proves the first relation of (5.2.4) ; the other follows by a symmetrical argument. We consider a numerical example. Let .f1(x) = 1/x2 , x > 1 ; .f1(x) = 0 , x < 1 . Let f2 (y) = 1 , 0 < y < 1 ; /2 (y) = 0 elsewhere. If z < 1 , /0(z) = 0 ; if 1 < z < 2, oo 1 fo(z) = j1(x)f2 (z - x) d x = -2 d x = 1 - 1 1X Z - oo

J

iz

-

5.2

-----L--'---'-+- X z

_._____ __.. �X

-1

_

ft(x)

ft (x)

1

1

{0 (z)

_1_

.__ .! z

---'----'---�

1

0

FIGURE 5.2. 3

If z

2

z

Application of the Convolution Theorem.

> 2, fo (z) =

(see Figure 5.2.3) .

1 65

z- 1 z

z- 1

0

EXAMPLES

l

z

1

1 2 dx =

z- 1 X

Z -

- 1

-

1

Z

The successive application of the convolution theorem shows that if R 1 , . . . , Rn are independent, each absolutely continuous , then R 1 + · · · + Rn is absolutely continuous. _....

REMARK.

P RO B L E M S

Let R1 , R2 , and R3 be independent random variables, each uniformly distrib uted between - 1 and + 1 . Find and sketch the density function of the random variable Ro = R1 + R2 + R3. 2. Two independent random variables R1 and R2 each have the density function [ (x) = 1/3 , - 1 < x < 0 ; [ (x) 2/3 , 0 < x < 1 ; f(x) = 0 elsewhere. Find and sketch the density function of R1 + R2• 2 2 3. Let R = R1 + , Rn are independent , and each Ri · + Rn , where R1 , is normal with mean 0 and variance 1 . Show that the density of R is

1.

=

·

·

•

•

•

1 n f (x) = nl2r(n/2) x< /2 )-le-x/ 2 ' 2

(R is said to have the "chi-square" distribution with n "degrees of freedom.")

1 66

CHARACTERISTIC FUNCTIONS

4. A random variable R is said tq have the "gamma distribution" if its density is, for some a , fJ > 0, [(x)

=

xa.-Ie- a:l P

r( a){Ja. , X > 0 ; [(x)

=

0,

X

< 0

Show that if R1 and R2 are independent random variables, each having the gamma distribution with the same {J, then R1 + R2 also has the gamma distribution. 5. If R1, . . . , Rn are independent nonnegative random variables, each with density .Ae-lxu(x), find the density of R0 = R1 + · · · + Rn. 6. Let () be uniformly distributed between - 'T/'/2 and 'T/'/2. Show that tan () has the _ Cauchy density. 7. Let R have density f(x) = 1 - lxl , !xl < 1 ; f(x) = 0 , lxl > 1 . Show that MR (u) = 2(1 - cos u)fu2 • *8. (a) Suppose that f is the density of a random variable and the associated characteristic function M is real-valued, nonnegative, and integrable. Show that kf(u), oo < u < oo , is the characteristic function of a random variable with density kM(x)/271', where k is chosen so that kf(O) = 1 , that is, -:-

J:

(kM(x)/2,.] dx

=

1

(b) Use part (a) to show that the following are characteristic functions of random variables : (i) e- l ul , (ii) M(u) = 1 - l ui , l ui < 1 ; M(u) = 0 , lui > 1 . *9. Use the calculus of residues to evaluate the characteristic function of the Cauchy density. 10. Calculate the characteristic function of the normal (0, 1) random variable as follows. Differentiate e x2/2 (cos ux) .y- dx M(u) =

foo

- 00

-

271'

under the integral sign ; then integrate by parts to obtain M' (u) = - uM(u). Solve the resulting differential equation to obtain M(u) = e-u 2 12• From this, find the characteristic function of a random variable that is normal with mean 2 m and variance a •

5.3

PROPERTIES OF CHARA CTERISTIC FUNCTIONS

Let R be a random variable with characteristic function M and generalized characteristic function N. We shall establish several properties of M and N.

5.3 PROPERTIES OF CHARACTERISTIC FUNCTIONS

M(O) I M(u)l

1. = N(O) = 1 . =N This follows, since < 1 for all 2. If R has a density f, we have

M(O) u.

1 67

(O) = E(e ) . 0

I M(u) l = J:e-iure f(x) dx < J:l e-iu'i(x)l dx = Jj(x) dx = 1 The general case can be handled by replacing f(x) dx by dF(x), where

F

is the distribution function of R. This involves Riemann-Stieltjes integration, which we shall not enter into here. 3. If R has a density f, and f is even., that is , = for all then is real-valued for all For

M(u)

f( - x) f(x)

u.

x,

M(u) = Jj(x) cos ux dx - Jj(x) sin ux dx Since f(x) is an even function of x and sin ux is an odd function of x, f (x) sin ux is odd ; hence the second integral is 0. It turns out that the assertion that M(u) is real for all u is equivalent to the statement that R has a symmetric distribution, that is, P{R E B} = P{R E - B} for every Borel set B. (-B = { - x x E B}.) 4. If R is a discrete random variable taking on only integer values , then M(u + 27T) = M(u) for all u. To see this , write M(u) = E(e-iuR) = n=! ooPne-iun (5.3. 1) i

:

00

where Pn = P{R = n}. Since e- i u n = e- i < u+ 2 1T> n , the result fol lo ws . Note that the Pn are the coefficients of the Fourier series of on the interval [0, 27T] . If we multiply (5.3. 1) by and integrate, we obtain

eiuk Pk = _!_27T l M(u) eiuk du 2 7I

0

M

(5. 3.2)

We come now to the important moment-generating property. Suppose that N(s) can be expanded in a power series about s = 0.

N(s) = k!=Oaksk 00

where the series converges in some neighborhood of the origin. This is just the Taylor expansion of N; hence the coefficients must be given by

k d ak = l.k ! dsN(s)k J S=O

1 68

CHARACTERISTIC FUNCTIONS

=

But if R has density f and we can di�erentiate N(s) J�oo e-sx f(x) dx under the integral sign, we obtain N'(s) J oooo - xe-sxf(x) dx ; if we can differentiate k times, we find that

=

Thus

(5.3.3) and hence

The precise statement is as follows. 5. If NR (s) is analytic at s 0 (i.e. , expandable in a power series in a neighborhood of s 0) , then all moments of R are finite, and

=

=

(5.3.4) within the radius of convergence of the series. In particular, (5.3.3) holds for all k. We shall not give a proof of (5.3.4). The above remarks make it at least plausible ; further evidence is presented by the following argument. If R has density f, then N(s)

= J:e-""'f (x) dx oo ( S 2X 2 S 3X3 = J-oo 1 - sx + -2 ! - --3 ! +

· · ·

(- l ) ks kxk + + k!

· · ·

)

f(x) d x

If we are allowed to integrate term by term, we obtain (5.3.4). Let us verify (5.3.4) for a numerical example. Let f(x) e-xu(x) , so that N(s) 1 / (s + 1), Re s > 1 We have a power series expansion for N(s) about s 0.

=

=

1 1+s

-

=

.

= 1 - s + s 2 - s3 +

· · ·

-

+ ( 1) ks k +

· ·

·

ls i < 1

Equation 5.3.4 indicates that we should have ( - 1)kE(Rk)fk ! E(Rk) k ! To check this, notice that

=

= (- 1) k , or

5.4

THE CENTRAL LIMIT THEOREM

1 69

Let R be a discrete random variable taking on only nonnegative integer values. In the generalized characteristic function

REMARK.

N (s)

= _2 pke-sk, 00

Pk = P{R = k}

k=O

make the substitution z = e-s . We obtain A(z)

=

N(s) ]z = e-s

=

E(zR)

= _2 p�k 00

k=O

A is called the generating function of R ; it is finite at least for lzl < 1 ,

since .2� o P k = 1 .

We consider generating functions in detail in connection with the random walk problem in Chapter 6.

PRO B L E M S

1.

Could [2/(s + 1)] - (1/s)(1 - e-8) (Re s > 0) be the generalized characteristic function of an (absolutely continuous) random variable ? Explain. 2. If the density of a random variable R is zero for x ¢= the finite interval [a , b ] , show that NR (s) is finite for all s. 3. We have stated that if MR (u) is integrable, R has a density [see (5.2. 1)]. Is the converse true ? 4. Let R have a lattice distribution ; that is, R is discrete and takes on the values a + nd, where a and d are fixed real numbers and n ranges over the integers. What can be said about the characteristic function of R ? 5. If R has the Poisson distribution with parameter .A, calculate the mean and variance of R by differentiating NR (s) . 5 . 4 THE CENTRAL LIMIT THEOREM

The weak law of large numbers states that if, for each n, R1, R 2 , , Rn are independent random variables with finite expectations and uniformly bounded variances , then , for every s > 0, •

as n �

oo

•

•

1 70

CHARACTERISTIC FUNCTIONS

In particular, if the R i are independent observations of a random variable R (with finite mean m and finite variance a2) , then � oo

as n

The central limit theorem gives further information ; it says roughly that for large n , the sum R1 + · · · + Rn of n independent random variables is approximately normally distributed, under wide conditions on the individual

Ri .

To make the idea of "approximately normal" more precise, we need the notion of convergence in distribution. Let R1, R2, be random variables with distribution functions F1 , F2, , and let R be a random variable with distribution function F. We say that the sequence R1 , R 2, converges in •

•

•

•

•

•

•

•

•

distribution to R (notation : R n d > R) iff Fn (x) � F(x) at all points x at which F is continuous. T o se e the reason for the restriction to continuity points of F, consider the following example .

..,.. Example 1 .

Let Rn be uniformly distributed between 0 and 1 /n (see Figure 5.4. 1) . Intuitively, as n � oo , Rn approximates more and more closely a random variable R that is identically 0. But Fn(x) � F(x) when x � 0 , but not at x = 0, since Fn(O) = 0 for all n, and F(O) = 1 . Since x = 0 is not a continuity point of F, we have Rn d > R . _.... Fn(x)

F(x) 1

1 0 ��

0

----�----� x

0

1

n

f,(x) n

1

::: X

n

FIGUR.E 5.4. 1

Convergence in Distribution.

5.4

THE CENTRAL LIMIT THEOREM

171

REMARK.

The type of convergence involved in the weak law of large numbers is called convergence in probability. The sequence R1 , R2 , is said to converge in probability to R (notation : Rn P > R) iff for every s > 0, P{IRn - Rl > s} � 0 as n � oo . Intuitively , for large n , R n is very likely to be very close to R. Thus the weak law of large numbers states that ( 1 /n ) '!,f 1 (Ri - E(Ri)) P ) 0 ; in the case in which E(Ri) = m for all i, we have •

•

•

n - !, Ri 1

n i= l

P>

m

The relation between convergence in probability and convergence in distribution is outlined in Problem 1 . The basic result about convergence in distribution is the following. Theorem 1. The sequence R1, R 2 , converges only if Mn(u) � M(u) for all u, where Mn is the Rn, and M is the characteristic function of R. •

•

.

in distribution to R if and characteristic function of

The proof is measure-theoretic , and will be omitted. Thus, in order to show that a sequence converges in distribution to a normal random variable, it suffices to show that the corresponding sequence of characteristic functions converges to a normal characteristic function. This is the technique that will be used to prove the main theorem, which we now state.

For each n, let R1, R 2 , , Rn be independent random variables on a given probability space. Assume that the Ri all have the same density function f (and characteristic function M) with finite mean m andfinite variance a2 > 0 , and finite third moment as well. Let Theorem 2. (Central Limit Theorem).

n. !, R3 - nm Tn = .Jfi a

•

•

•

i=I

(= [Sn - E(Sn)]/a(Sn) , where Sn = R1 + · · · + Rn and a(Sn) is the standard deviation of Sn) so that Tn has mean 0 and variance 1 . Then T1, converge in distribution to a random variable that is normal with mean T2 , •

0

•

•

and variance

1.

*Before giving the proof, we need some preliminaries.

172

CHARACTERISTIC FUNCTIONS

Let f be a complex-valued function on E1 with n continuous derivatives on the interval V = (-b, b). Then, on V, n k k 1 -1 ) ( 1 O )u (1 - t) n-1 < n > ( J n f(u) = k=OL k ! + u o (n - 1) ! f (tu) dt Thus, ifiJ < n > l < M o n V, wh ere 1 0 1 < M (0 depends on u) Theorem

3.

PROOF.

Using integration by parts, we obtain

n-1 f ( n-1)(t) u + ruf ( n-1 ) (t) (u - t)n-2 dt t) (u o Jo (n - 1) ! [ (n - 1) ! (n - 2) ! J n n n ' 0 1) 1 2 ( u ) ( ) u f (u t n 1) t = + J ( (t) d (n - 1) ! Jo (n - 2) ! n n n n u 0 -1) 0 -2) 2 ( 1 ( ) ) ( ( u J J = (n - 1) ! (n - 2) ! rJouf ( n-2) (t) (u(n-- t)3)n-! 3 dt n k k 1 ) ( O u ) ( u J = - L + J J '(t) dt by iteration k=1 k ! n n k 1 1 ) ( O u t) uk ) ( ( u J n f(u) = kL=O k ! + Jo J < > (t) (n - 1) ! dt

n 1 ) (u t n c j ) f (t) dt =

Jo

_

-

+

0

Thus

The change of variables t = ut ' in the above integral yields the desired ex pression for f(u). Now if (1 I = Jo 1) ! then t (1 ( =M <M Jo 1) ! !

JIJ

Let () = In !/ l ui

nu ( 1 - t)n-1 j< nl(tu) dt (n n n 1 t) lul n J u J (n - dt n

n ; then 1 0 1 < M and the result follows.

5.4

1 73

THE CENTRAL LIM IT THEOREM

Theorem 4.

101 < 1

where

y

is an arbitrary real n u"!ber, 0, 01 depending on

PROOF.

This is immediate from Theorem 3.

(b) If z is a complex number and lzl <

y.

1/2, then In ( 1 +

where 1 0 1 < 1 , f) depending on z . (Take In 1 = logarithm . )

0

z

) =

z

+ 0 lzl 2 , to determine the branch of the

PROOF.

z2

za

z4

I n (1 + z) = z - - + - - - + · · 4 3 2

·

Now 00

1 - � + iz - lz 2 + · · · I < � + !(t) + lC�) 2 + · · < I C�) k = 1 k= l · ·

By Theorem 1, we must show that MTn(u) e u 2 / 2, the characteristic function of a normal (0, 1) random variable. Now we may assume without loss of generality that m = 0. For if we have proved the theorem under this assumption, then write ---+

PROOF OF THEOREM 2.

n

I CR j Tn = i= l .Jri (J

-

m)

The random variables R3 - m have mean 0 and variance a2 , and the result will follow. The characteristic function of I�-1 R3 is (M(u)) n ; hence the characteristic function of Tn is

1 74

CHARACTERISTIC FUNCTIONS

Now

But

J:

Thus

xf( x) dx =

m

= 0,

[M ( u )]� ( ,J n a

where

n In

(

c

1

depends on

2 u 2n

+

u.

=

1

_

2 u 2n

+

) 32

__£__

n

n1

Take logarithms to obtain, by Theorem 4b,

2 u ( _ ) __£___ n 2 31 n 2n =

Thus

+

_£_

n 31 2

+ ()

2 u 2n

+

) 2 n 31

_£_

2

2 u ---+ 2 as n ---+

oo

which is the desired result. * Convergence in distribution of Tn = [Sn - E(Sn)]fa(Sn), sn = '!,j 1 R; , a(Sn) = standard deviation of Sn , can be established under conditions much more general than those given in Theorem 2 . For example , the finiteness of the third moment is not necessary in Theorem 2 ; neither is the assumption that the Ri have a density. We give two other sufficient conditions for normal convergence. 1. The Ri are uniformly bounded ; that is , there is a constant k such that I Ril < k for all i, and also Var (Sn) ---+ oo . The requirement that Var Sn ---+ oo is necessary, for otherwise we could take R1 to have an arbitrary distribution function, and R n = 0 for n > 2 . ·Then Sn = R 1 ,

REMARK.

Thus the functions cannot approach

FTn

are the same for all n and hence in general

5.4

THE CENTRAL LIMIT THEOREM

1 75

which is the normal distribution function with mean 0 and variance 1 . 2. (The Liapounov condition) n I E I � - ERk l 2+t5

Tr=l

l a(Sn) l 2+t5

---+

0

for some

(J

>0

For each n, let R1, R2 , , R n be independent random variables, where all Ri have the same distribution function, with finite mean m, finite variance a2 , and also finite third moment. It can be shown that there is a positive number k such that I FTn (x) - F * (x) l < kfvn for all and all n. It follows that, for large n, sn is approximately normal with mean nm and variance na2 , in the sense that for all x,

REMARK.

•

X

Fsn( x) -

For F sn ( x ) =

J

x

•

1

__

- oo

P { Sn

•

..j2 Trn

a

< x} =

e- < t- n m> 2 J 2 na2

dt < k ..J n

p{ Sn..J-n anm < x ..J-n nma )

and this differs from F* ((x - nm)/Vn a) by
(X F*

_

..Jn

nm a

) =J

(x- n m ) (V� a

- oo

(

= set t = y

1 e-t2 1 2 dt ..J27T _

-

nm

.J n a

) ..j2Tr1 n l a

x e-(y- n m ) 2j na2 2

- oo

dy

and the result follows. In particular , let R be the number of successes in n Bernoulli + Rn , where the trials ; then (Example 1 , Section 3.5) R = R1 + Ri are independent, and P{Ri = 1 } = p , P{Ri = 0} = 1 - p. T�us, for large n , R is approximately normal with mean np and variance np(l - p) , in the sense described above. ·

·

·

P RO B L E MS

1.

Show that Rn P > R implies Rn d ) R, as follows. Let Fn be the distribution function of Rn , and F the distribution function of R. (a) If > 0, show that €

and

P{Rn �

x} � P{I Rn - Rl

>

E} + P{R � X + E}

176

CHARACTERISTIC FUNCTIONS

Conclude that F(x

)

€

-

-

P{IRn - Rl

>

€

} � Fn (x) � P{ IRn - Rl

>

} + F(x + )

€

€

(b) If Rn P > R and F is continuous at x, show that Fn(x) � F(x). 2. Give an example of a sequence R1 , R 2 , that converges in distribution to a random variable R, but does not converge in probability to R. 3. I f Rn converges in distribution to a constant c, that is, limn-+ oo Fn(x) 1 for x > c, and 0 for x < c, show that Rn P > c. be random variables such that P{Rn = en} 1/n, P{Rn 0} 4. Let R1, R2 , 1 - 1/n. Show that Rn p > 0, buJ._E(Rnk) � oo as n � oo , for any fixed k > 0 . S. Two candidates, A and B, are running for President and it is desired to predict the outcome of the election. Assume that n people are selected independently and at random and asked their preference. Suppose that the probability of selecting a voter who favors A in any particular observation is p . (p is fixed but unknown.) Let Qn be the relative frequency of "A" voters in the sample ; that is, •

•

•

=

=

•

•

=

•

=

=

number of "A" voters in sample Qn = size of s�mple

-----

(a) We wish to choose n large enough so that P{I Qn - p i � .00 1 } > .99 for all possible values of p. In other words, we wish to predict A's percentage of the vote to within . 1 %, with 99 % confidence. Estimate the minimum value of n. (b) Estimate the minimum value of n if we wish to predict A's percentage to within 1 %, with 95 % confidence. (Use the central limit theorem.) 6. (a) Show that the normal density function (with mean 0, variance 1) satisfies the inequality oo 1 1 e-t--"/ 2 e-x2f2 ' x > O dt � V21T V2 1T X

i

-

__

-

HINT : show that

by differentiating both sides. (b) Show that

in the sense that the ratio of the two sides approaches 1 as x � oo . 7. Consider a container holding n = 106 molecules. In the steady state it is reason able that there be roughly as many molecules on the left side as on the right. Assume that the molecules are dropped independently and at random into the

5.4

1 77

container and that each molecule may fall with equal probability on the left or right side. If R is the number of molecules on the right side of the container, we may invoke the central limit theorem to justify the physical assumption that for the purpose of calculating P{a < R < b} we may regard R as normally distributed with mean np n/2 and variance np (l - p) n/4. Use Problem 6 to bound P{IR n/21 > .005n}, the probability of a fluctuation about the mean of more than ± .5 % of the total number of molecules. Let R be the number of successes in 10,000 Bernoulli trials, with probability of success .8 on a given trial. Use the central limit theorem to estimate P{1940 < R < 8080}. =

=

-

8.

THE CENTRAL LIMIT THEOREM

I n fi n i te Seq u e n ces of Ran d om Va r i a b l es

6.1 INTRODUCTION

We have not yet encountered any situation in which it is necessary to con sider an infinite collection of random variables , all defined on the same prob ability space. In the central limit theorem, for example, the basic underlying hypothesis is "For each n , let R1, . . . , R n be independent random variables." As n changes , the underlying probability space may change, but this is of no consequence, since a convergence in distribution statement is a statement about convergence of a sequence of real-valued functions on E1 . If R1 , . . . , Rn are independent, with distribution functions F1 , . . . , Fn , and Tn = (Sn - E(Sn))fa(Sn ) , Sn = R1 + · · · + Rn, the distribution function of Tn is completely determined by the Fi, and the validity of a statement about convergence in distribution of Tn is also determined by the Fi, regardless of the construction of the underlying space. However, there are occasions when it is necessary to consider an infinite number of random variables defined on the same probability space. For example, consider the following random experiment. We start at the origin on the real line, and flip a coin independently over and over again. If the result of the first toss is heads, we take one step to the right (i.e. , from x = 0 1 78

6. 1

1 79

INTRODUCTION

to x = 1), and if the result is tails, we move one step to the left (to x = - 1). We continue the process ; if we are at x = k after n trials , then at trial n + 1 we move to x = k + 1 if the (n + 1)th toss results in heads, or to x = k - 1 if it results in tails. We ask, for example, for the probability of eventually returning to the origin. Now the position after n steps is the sum 1 + + of n independ ent random variables, where = 1} = p = probability of heads, = 0 for some n > 0}. = - 1 } = 1 - p. We are looking for We must decide what probability space we are considering. If we are interested only in the first n trials, there is no problem. We simply have a sequence of n Bernoulli trials , and we have considered the assignment of probabilities in detail. However, the difficulty is that the event = 0 for some n > 0} involves infinitely many trials. We must take Q = E00 = all infinite sequences (x1, x 2 , ) of real numbers. (In this case we may restrict the xi to be ± 1 , but it is convenient to allow arbitrary xi so that the discussion will apply to the general problem of assigning probabilities to events in volving infinitely many random variables.) We have the problem of specifying the sigma field :F and the probability measure The physical description of the problem has determined all E B} that is, we know , probabilities involving finitely many for each positive integer n and n-dimensional Borel set B. What we would like to conclude is that a reasonable specification of probabilities involving finitely many determines the probability of events involving all the For example, consider {all = 1}. This event may be expressed as

Sn

P{Rk

R P{Rk P{Sn

·

·

Rn

·

{Sn

•

•

•

P.

Ri;

P{(R1 ,

•

•

•

Rn)

Ri.

Ri

Ri n { R1 1, . . . , R n 1 } n=l The sets {R 1 1 , . . . , R n 1} form a contracting sequence ; hence P{ all Ri 1 } lim P{ R1 = 1, . . . , Rn 1 } = lim pn 0 if p < 1 oo oo n -+ n -+ As another example, tRn = 1 for infinitely many n} {for every n, there exists k > n such that Rk 1 } n u { Rk = 1 } lk n= =n Thus P{ R n = 1 for infinitely many n } !��p[ l.J { Rk 1}] kn = lim lim P [ 0n {Rk = 1}] n -+ oo m-+ oo k= 00

=

=

=

=

=

=

=

=

=

=

00

00

=

=

=

1 80

INFINITE SEQUENCES OF RANDOM VARIABLES

Thus again the probability is determined once we know the probabilities of all events involving finitely many Ri . We now sketch the general situation. Let Q = E00 • A set of the form where c E , is called a cylinder with E a measurable cylinder if is a Borel subset of qase Suppose that for each n we specify a probability measure Pn on the Borel subsets of En ; is to be interpreted as P{(R1, where E ... = Suppose , for example, that we have specified P5• Then k < 5 , is determined. In particular, in the discrete case we have

Bn n

{(x1 , x2 , . . . ) : (x1 , . . . , xn) Bn}, Bn Bn , Pn (Bn) Ri (xl , x 2 , ) xi . I

( x l , X2 x3 )E B3 - oo < x4 < oo , - oo < xs < oo .

En . . . . , Rn) Bn}, Pk ,

P{R1 xb R 2 x2 , R3 x3, R4 x4, R5 x5} =

=

=

=

=

.

and in the absolutely continuous case we have

P{(Rb R2 , Ra) E Ba} r . J f(xl , x2 , Xa, x4, ) dxl dx2 dx3 dx4 dx5 Xs

=

( xl , X 2 , X3 ) EB3 -oo < x4 < oo . - oo < xs < oo

In general, once P is given, k < n, is determined. But we have specified Pk, k < n, at the beginning ; if our assignment of probabilities is to make sense, the original Pk must agree with that derived from P n > k. If, for all n = 1 , 2, . . . and all k < n, the probability measure P origi nally specified agrees with that derived from P n > k, we say that the prob ability measures are consistent. Under the consistency hypothesis , the Kolmogorov extension theorem states that there is a unique probability measure P on ff = the smallest sigma field of subsets of Q containing the measurable cylinders , such that

Pk ,

n

n'

k

n'

Bn) 1 , 2, . . . and all Borel subsets Bn of E n . P (the measurable cylinder with base

=

Bn)

Pn (

for all n = In other words , a consistent specification of finite dimensional probabilities determines the probabilities of events involving all the Ri . We now consider the case in which the Ri are discrete. Here we determine by prescribing the joint probability probabilities involving function = P{R1 = , , Rn = We may then derive th� joint probability function of R1, . . . , R

(R1 , . . . , Rn) Pt 2 · · n Cxt , xn) ·

·

P{R1 xb . . . , Rk xk} =

=

xt ,

·

=

I

XJc+ l •

•

• • ,

·

·

·

xn}

k: P{R1 xb . . . , R n xn} =

Xu

=

(6. 1 . 1)

6. 1

INTRODUCTION

181

If this coincides with the given p1 2 ... k (for all n and all k < n) we say that the system of joint probability functions is consistent. If we sum (6. 1 . 1) over (x1, . . . , xk) E Bk, we find that consistency of the joint probability functions is equivalent to consistency of the associated probability measures P · Thus in the discrete case the essential point is the consistency of the joint probability functions. In particular , suppose that we require that for each n, R1, . . . , R n be independent , with Ri having a specified probability function Pi · Then ( 6. 1 . 1) becomes �

n

P {R1 = Xb

·

·

·

,

Rk = xk} =

L

Pt ( xl) · · · P n( x n) = Pt ( xl) · · · Pk( xk)

and thus the joint probability functions are consistent. The point we are making here is that there is a unique probability measure on :#' such that the random variables R1 , R 2 , are independent, each with a specified prob ability function. In other words , the statement "Let R1 , R 2 , be independ ent random variables , where Ri is discrete and has probability function pi," is unambiguous. In the absolutely continuous case , probabilities involving R1 , . . . , Rn are determined by the joint density function /1 2 n· The joint density of R1 , . . . , Rk is then given by •

•

•

•

•

•

. . •

g( xt , . . . , xk) =

J:· · · f!l2 ·· · n(x1,

•

•

•

, xn) d xlc+ 1

•

•

•

d xn

(6 . 1 .2)

If this coincides with the given /1 2 ... k (n , k = 1 , 2 , . . . , k < n) , we say that the system of joint densities is consistent. By integrating ( 6. 1 . 2) over Borel k sets Bk c E , we find that consistency of joint density functions is equivalent to consistency of the associated probability measures P · In particular , if we require that for each n, R1 , . . . , R n be independent, with Ri having a specified density function h, then (6. 1 . 2) becomes n

g( xl , . . . ' xk) =

fXlcx: . . J!t(xt) . . . fn(xn) dxlc+l . . . dxn

= ft C xl) · · · fk( xk) Therefore the joint density functions are consistent, and the statement "Let R1 , R 2 , be independent random variables , where Ri is absolutely continuous with density function h," is unambiguous. •

•

•

PROBLEMS 1.

Pn,

By working directly with the probability measures give an argument shorter than the one above to show that the statement "Let R1 , R2 , be independent random variables, where Ri is absolutely continuous with density fi," is unambiguous. •

•

•

1 82 2.

INFINITE SEQUENCES OF RANDOM VARIABLES

If R1 , R2 , are independent, with P{Ri at the beginning of the section, find P{Rn P{limn--. 00 R1� 1}. (Assume 0 < p < 1 .) •

•

•

= =

P{Ri - 1} 1 p , as 1 for infinitely many n} ; also, find

1}

=

p,

=

=

-

=

6.2

T HE GAMB LER ' S RUIN - P R OBLEM

Suppose that a gambler starts with a capital of x dollars and plays a sequence of games against an opponent with b x dollars. At each trial he wins a dollar with probability p , and loses a dollar with probability q = 1 - p. (The trials are assumed independent, with 0 < p < 1 , 0 < x < b. ) The process continues until the gambler ' s capital reaches 0 (ruin) or b (victory) . We wish to find h(x), the probability of eventual ruin when the initial capital . -

IS

X.

Let A = {eventual ruin} , B1 = {win on trial 1}, B2 = {lose on trial 1}. By the theorem of total probability,

P(A) = P(Bl )P(A I Bl ) + P(B2 )P(A I B2) We are given that P(B1 ) p, P(B2 ) = q ; P(A) is the unknown probability h(x ) . Now if the gambler wins at the first trial , his capital is then x + 1 ; thus P(A I B1 ) is the probability of eventual ruin , starting at x + 1 , that is , h(x + 1). Similarly , P(A I B2 ) h(x - 1). Thus =

=

h(x) = ph(x + 1) + qh(x - 1 ) , x = 1 , 2, . . . , b

-

1

(6.2. 1)

[The intuition behind the argument leading to ( 6.2. 1) is compelling ; how ever, a formal proof involves concepts not treated in this book, and will be omitted.] We have not yet found h(x) , but we know that it satisfies (6.2. 1 ) , a linear homogeneous difference equation with constant coefficients. The boundary conditions are h(O) = 1 , h(b) = 0. To see this, note that if x = 1 , then with probability p the gambler wins on trial 1 ; his probability of eventual ruin is then h(2). With probability q he loses on trial 1 , and then he is already ruined. In other words , if ( 6.2. 1 ) is to be satisfied at x = 1 , we must have h(O) = 1 . Similarly, examination of (6.2. 1 ) at x = b 1 shows that h(b) = 0. The difference equation may be put into the standard form -

p h(x

+ 2)

-

h (x + 1) + qh(x) = 0 , X = 0, 1 , . . , b - 2, h(O) .

=

It is solved in the same way as the analogous differential equation p

d 2y

dy

- - + qy = O 2 dx dx

-

1 , h(b) = 0

6.2

THE GAMBLER'S RUIN PROBLEM

1 83

We assume an exponential solution ; for convenience, we take h(x) = it� ( = ex In ;.) . Then pit �2 - .;tx+l + qit X = .;tx (pit2 - A + q) = 0. Since .;tx is never 0, the only allowable it ' s are the roots of the characteristic equation pit2 - it + q = 0, namely, 1 it = - (1 ± J 1 - 4p q ) 2p

Now (p - q) 2 = p2 - 2pq + q2 = p 2 + 2pq + q2 - 4pq = (p + q) 2 - 4pq = 1 - 4pq (6.2.2) Hence 1 it = (1 ± I P - q l ) 2p

The two roots are

+ it l = 1

p 2p

CASE 1 .

+ it 2 = 1 q -

q = 1'

2p

p � q . Then A1 and it2 are distinct ; hence

h(x) = AA1"' + CA 2"' = A + C h(O) = A + C h (b) = A + c Solving, we obtain Therefore

1

(:r = 0

A = - (q fp) b 1 - ( q fp)b

c=

= q p

(�J

1

1 - (q fp) b

p) b ( ( f f q q p)'" = h(x) 1 - ( q fp) b

(6 .2.3)

= q = 1 /2. Then it1 = it 2 = it = 1 , a repeated root. In such a case (just as in the analogous differential equation) we may construct two linearly independent solutions by taking it x and xit x ; that is ,

CASE 2.

p

=

p

Thus

h(x) = Ait x + Cxitx = A + Cx h(O) = A = 1 so c = - -1 h( b) = A + Cb = 0 b h( x) = 1

-

x

--

=b-x b b

.:.._

(6.2.4)

1 84

INFINITE SEQUENCES OF RANDOM VARIABLES h(x)

FI GURE 6.2. 1

Probability of Eventual Ruin.

so that the probability of eventual ruin is the ratio of the adversary ' s capital to the total capital. A sketch of h(x) in the various cases is shown in Figure 6.2. 1 . Similarly, let g (x) be the probability of eventual victory, starting with a capital of x dollars. We cannot conclude immediately that g(x) = 1 - h(x), since there is the possibility that the game will never end ; that is, the gambler ' s fortune might oscillate forever within the limits x = 1 and x = b - 1 . However, we can show that this event has probability 0, as follows. By the same reasoning as that leading to (6.2. 1) , we obtain

g(x) = pg(x + 1 ) + qg(x - 1) ( 6.2. 5 ) The boundary conditions are now g(O) = 0, g(b) = 1 . But we may verify that g (x) = 1 - h (x) satisfies (6.2. 5) with the given boundary conditions ; since the solution is unique (see Problem 1 ) , we must have g (x) = 1 - h(x) ; that is, the game ends with probability 1 . We should mention, at least in passing, the probability space we are work ing in. We take Q = E00 , :F = the smallest sigma field containing the measur able cylinde rs, Ri(x1, x 2 , ) = xi, i = 1 , 2, . . . , P the probability measure determined by the requirement that R1, R 2 , be independent, with P{R i = 1 } = p , P{Ri = - 1 } = q. Thus Ri is the gambler ' s net gain on trial i, and x + If 1 Ri is his capital after n trials. We are looking for h (x) = P{for some n, x + If 1 R i = 0, 0 < x + I� 1 Ri < b, k = 1 , 2, . . . , n - 1 }. A sequence of random variables of the form x + !Y 1 Ri, n = 1 , 2, . . . , where the R i are independent and have the same distribution function (or, are more generally, R0 + If 1 Ri, n � 1 , 2, . . . , where R0, R1, R 2 , independent and R1, R 2 , have the same distribution function) , is called a random walk, a simp le random walk if Ri ( i > 1 ) takes on only the values ± 1 . The present case may be regarded as a simple random walk with absorbing •

•

•

•

•

•

•

•

•

•

•

•

6.2

THE GAMBLE.R•s RUIN PROBLEM

1 85

barriers at 0 and b, since when the gambler' s fortune reaches either of these figures, the game ends, and we may as well regard his capital as forever frozen. We wish to investigate the effect of removing one or both of the barriers. Let hb (x ) be the probability of eventual ruin starting from x, when the total capital is b. It is reasonable to expect that limb__.oo hb (x) should be the prob ability of eventual ruin when the gambler has the misfortune of playing against an adversary with infinite capital. Let us verify this. Oonsider the simple random walk with only the barrier at x = 0 present ; that is, the adversary has infinite capital. If the gambler starts at x > 0, his probability h*(x) of eventual ruin is =

P(A)

P{for some positive integer b, 0 is reached before b}

Let Ab = {0 is reached before b}. The sets Ab, b ing sequence whose union is A ; hence P(A)

But

=

=

1 , 2, . . . form an expand

lim P(A b)

b -+

00

Consequently h *( x)

=

lim hb(x)

=

1

=

(�)'"

if q > p if q < p, by (6 .2.3) and (6 . 2 . 4) (x

=

1, 2, . . . ) (6.2.6)

Thus , in fact, limb__. oo hb(x) is the probability h* (x) of eventual ruin when the adversary has infinite capital ; 1 - h * (x) is the probability that the origin will never be reached , that is, that the game will never end. If q < p, then h * (x) < 1 , and so there is a positive probability that the game will go on forever. Finally, consider a simple random walk starting at 0, with no barriers. Let r be the probability of eventually returning to 0. Now if R1 = 1 (a win on trial 1), there will be a return to 0 with probability h * (l ) . If R1 = - 1 (a loss on trial 1), the probability of eventually reaching 0 is found by evaluat ing h * (1 ) with q and p interchanged , that is, 1 for q � p, and p/q ifp < q. Thus, if q < p, , If p < q ,

r

= p

r

=

(�) + q(l)

p(l) + q

( :)

=

2q

=

2p

1 86

INFINITE SEQUENCES OF RANDOM VARIABLES

One expression covers both of these cases , namely, r

= 1 = 1 < 1

-

lp

ql ifp = q = � ifp � q -

(6.2. 7)

PRO B L E M S

1. Show that the difference equation arising from the gambler's ruin problem has a

unique solution subject to given boundary conditions at x = 0 and x = b. 2. In the gambler's ruin problem, let D(x) be the average duration of the game when the initial capital is x. Show that D(x) = p (1 + D(x + 1)) + q (1 + D(x - 1)) , x = 1 , 2 , . . . , b - 1 [the boundary conditions are D(O) = D(b) = 0] . 3. Show that the solution to the difference equation of Problem 2 is X

(b/(q - p )) (1 - (q/p) X) if p "#: q 1 - (q/p)b q -p = x(b - x) ifp = q . 1/2 [D(x) can be shown to be finite, so that the usual method of solution applies ; see Problem 4, Section 7.4.] REMARK. If Db(x) is the average duration of the game when the total capital is b, then lim b � oo Db(x) ( = oo if p > q , = xfq - p if p < q) can be interpreted as the average length of time required to reach 0 when the adversary has infinite capital. 4. In a simple random walk starting at 0 (with no barriers), show that the average length of time required to return to the origin is infinite. (Corollary : A couple decides to have children until the number of boys equals the number of girls. The average number of children is infinite.) 5. Consider the simple random walk starting at 0. If b > 0, find the probability that x b w ill eventually be reached. D(x) =

-

=

6 . 3 C O M B I N A T O RI A L AP P R O A C H T O T H E R A N D O M WAL K ; T H E REFLE C T I O N PRIN CIPLE

In this section we obtain, by combinatorial methods , some explicit results connected with the simple random walk. We assume that the walk starts at 0, with no barriers ; thus the position at time n is Sn = 2: 1 Rk, where are independent random variables with P{Rk = 1 } = p, R1, R2 , . •

•

6.3 COMBINATORIAL APPROACH TO THE RANDOM

WALK

1 87

P{Rk = - 1 } = q = 1 - p. We may regard the Rk as the results of an infinite sequence of Bernoulli trials ; we call an occurrence of Rk = 1 a "success," and that of R k = - 1 a "failure." ( Suppose that among the first n trials there are exactly a successes and b failures (a + b = n) ; say a > b. We ask for the (conditional) probability that the process will always be positive at times 1 , 2, . . . , n, that is, P{Sl > 0, s2 > 0, . . . ' sn > 0 I sn = a - b }

(6.3 . 1)

(Notice that the only way that Sn can equal a - b is for there to be a suc cesses and b failures in the first n trials ; for if x is the number of successes and y the number of failures, then x + y = n = a + b, x - y = a - b ; hence x = a, y = b. Thus {Sn = a - b} = { a successes, b failures in the first n trials}.) , Now (6.3. 1) may be written as

P{ S1 > 0, . . . , Sn > 0, Sn = a - b } (6.3.2) P{ Sn = a - b} A favorable outcome in the numerator corresponds to a path from (0, 0) to (n, a - b) that always lies above the axis,t and a favorable outcome in the denominator to an arbitrary path from (0, 0) to (n, a - b) (see Figure 6.3.1). Thus (6.3.2) becomes paqb[the number of paths from (0, 0) to ( n, a - b) that are always above 0] paqb[the total number of paths from (0, 0) to ( n, a - b)] A path from (0, 0) to (n, a - b) is determined by selecting a positions out of n for the successes to occur ; the total number of paths is (�) = (a!b). To count the number of paths lying above 0, we reason as follows (see Figure 6.3.2). Let A and B be points above the axis. Given any path from A to B that touches or crosses the axis, reflect the segment between A and the first zero point T, as shown. We get a path from A' to B, where A' is the reflection of A. Conversely, given any path from A' to B, the path must reach the axis at some point T. Reflecting the segment from A' to T, we obtain a path from A to B that touches or crosses the axis. The correspondence thus established is one-to-one ; hence

the number ofpaths from A to B that touch or cross the axis = the total number ofpaths from A' to

B

t Terminology : For the purpose of determining whether or not a_ path lies above the axis (or touches it, crosses it, etc.), the end points are not included in the path.

1 88

INFINITE SEQUENCES OF RANDOM VARIABLES

FIGURE 6.3. 1

A Path in the Random Walk . n = 6

a = 4,

b =2

P{R1 = R2 = 1 , R3 = - 1 , R4 = R5 = 1, R6 = -1} = p4q2 ; this is one contribution to

P{S1 > o, . . . , S6 > o, S6 = 2 } This is called the reflection principle. Now the number of paths from (0, 0) to (n, a - b) lying entirely above the axis = the number from (1 , 1) to (n, a - b) that neither touch nor cross the axis (since R1 must be + 1 in this case) = the total number from (1 , 1) to (n, a - b) - the number from (1 , 1) to (n, a - b) that either touch or cross the axis = the total number from (1 , 1) to (n, a - b) - the total number from (1 , - 1) to (n, a - b) (by the reflection principle) [Notice that in a path from (1 , 1) to (n, a - b) there are x successes and y failures, where x + y = n - 1 = a + b - 1 , x - y = a - b 1 , so -

B

FIGURE 6.3 .2

Reflection Principle.

6.3 COMBINATORIAL APPROACH TO THE RANDOM

WALK

1 89

x = a - 1 , y = b . Similarly, a path from {1 , - 1) to (n, a - b) must have a successes and b - 1 failures.]

(

n! a b ( n - 1) ! ( = n - 1) ! = (a - 1) ! b ! a ! (b - 1) ! a ! b ! n n

Thus

if a + b

==

)

n, the number of paths from (0, 0) to (n, a - b) a b lying entirely above th e axis =

( : ) ( :)

Therefore

P{ St > O, . . . , Sn > O I Sn = a - b} = a - b = a - b a+b n --

(6.3.3) {6.3.4)

This problem is equivalent to the ballot p roblem: In an election with two candidates, candidate 1 receives a votes and candidate 2 receives b votes, with a > b, a + b = n. The ballots are shuflled and counted one by one. The probability that candidate 1 will lead throughout the balloting is (a - b)f(a + b). [Each possible sequence of ballots corresponds to a path from {0, 0) to (n, a - b) ; a sequence in which candidate 1 is always ahead corresponds to a path from (0, 0) to (n, a - b) that is always above the axis.]

REMARK.

We now compute h1

= P{the first return to 0 occurs at time j} Since hi must be 0 for j odd , we may as well set j = 2n. Now h 2n = P {S1 � 0, . . . , S2n_1 � 0, S2n = 0} and thus h 2n is the number of paths from (0, 0) to (2n, 0) lying above the axis, times 2 (to take into account paths lying below the axis) , times p nqn , the probability of each path (see Figure 6.3.3). The number of paths from {0, 0) to (2n, 0) lying above the axis is the num ber from {0, 0) to (2n - 1 , 1) lying above the axis, which, by ( 6.3.3) , is {2 :-1) (a - b)f(2n - 1) (where a + b = 2n - 1 , a - b = 1 , hence a = n,

FIGURE 6.3.3

A First Return to 0 at Time 2n.

1 90

INFINITE SEQUENCES OF RANDOM VARIABLES (y + 2k,y)

FIGURE 6.3.4

b=

n

Thus

Comp utation of First Passage Times.

- 1), that is, · (2n - 2) ! 1 2n - 1 ! 2n - 2 = n 2n - 1 n ! (n - 1) ! n n - 1

(

)

h

(

2 2n - 2 = 2n n n - 1

)

(

(p q)

n

) (6. 3.5)

We now compute probabilities of first passage times, that is, P{the first passage through y > 0 takes place at time r}. The only possible values of r are of the form y + 2k, k = 0, 1 , . . . ; hence we are looking for h

��2k = P{the first passage through y > 0 occurs at time y + 2k}

To do the computation in an effortless manner, see Figure 6.3.4. If we look at the path of Figure 6.3.4 backward , it always lies below y and travels a vertical distance y in time y + 2k . Thus the number of favorable paths is the number of paths from tO, 0) to ( y + 2k , y) that lie above the axis ; that is, by (6.3.3) , y + 2k a - b where a + b y + 2k, a - b = y a y + 2k Thus a = y + k, b = k. Consequently y y + 2k pY+kq k h (y) = 2 (6.3.6) Y+ k k y + 2k

(

)

=

(

)

PRO B L E M S

The first five problems refer to the simple random walk starting at 0, with no barriers. 1. Show that P{Sl > 0, s2 > 0, . . . ' S2n-1 > 0, S2n = 0} = U2n/(n + 1), where u2n = P{S2n = 0} is the probability of n successes (and n failures) in 2n Bernoulli trials, that is, (2,:)(pq)n.

6.4

191

GE.NE.RATING FUNCTIONS

Let p = q 1/2. (a) Show that h2n u2n-2/2n, where u2n ( 2:)(1/2)2n . (b) Show that u2n/u2n-2 1 - 1/2n ; hence h2n u2n-2 - u2n · 3. If p q 1/2, show that "#: 0, . . . , S2n "#: 0}, the probability of no return to the origin in the first 2n steps, is u2n (2:)2-2n . Show al so that "#: 0, . . . , =

2.

=

=

=

=

2n

S

1

-

P{S1

=

=

"#:

0}

h 2n + U2n · 1/2, show that 1/2, show that

P{S1

=

=

> 0, . . , 2 > 0} If p q (2nn)2-2n . q S. If p the average length of time required to return to the origin is infinite, by using Stirling's formula to find the asymptotic expression oo . for h2n , and then showing that �� nh2n 6. Two players each toss an unbiased coin independently n times. Show that the probability that each player will have the same number of heads after n tosses is (1/2)2n ��= o (�)2 . 7. By looking at Problem 6 in a slightly different way, show that � �= o (J:) 2 ( 2nn). 8. A spider and a fly are situated at the corners of an n by n grid, as shown in Figure P.6.3.8. The spider walks only north or east, the fly only south or west ; 4.

=

=

=

=

P{S1

.

1

S

n

=

=

=

D ��--��--•

Fly

iL 1

2

Spider

Spider

FIGURE

P.6 . 3 . 8

they take their steps simultaneously, to an adjacent vertex of the grid. (a) Show that if they meet, the point of contact must be on the diagonal D. (b) Show that if the successive steps are independent, and equally likely to go in each of the two possible directions, the probability that they will meet is

( 2nn) (1/2)2n .

6.4

GE NERA T I N G FUN C T I O N S

Let {an, n > 0} be a bounded sequence of real numbers. The generating function of the sequence is defined by A(z)

=

00

n, z � a n =

n O

z complex

1 92

INFINITE SEQUENCES OF RANDOM VARIABLES

The series converges at least for lzl < 1 . If R is a discrete random variable taking on only nonnegative integer values, and P{R a , n 0, 1, . . . , then A(z) is called the generating function of R. Note that A(z) E(zR), the characteristic function of R with z replacing ! � 0 z P{R e-iu . We have seen that the characteristic function of a sum of independent random variables is the product of the characteristic functions. An analogous result holds for generating functions.

n

= n} = n =

= n} =

=

} and {bn} be bounded sequences of real numbers. Let n {en} be the convolution of {an} and {bn} , defined by en = ki=Oakb n-k ( = 3"i,=0b an-i) Then C(z) = :l: 0 cnzn convergent at least for lzl < 1 , and C(z) = A(z)B(z) PROOF. Suppose first that a n = P{R1 = n}, b n = P{R 2 = n } , where R1 and R 2 are independent nonnegative integer-valued random variables. Then en = P{R1 + R 2 = n} , since {R1 + R2 = n} is the disjoint union of the events {R1 = k, R 2 = n - k}, k = 0, 1 , . . . , n. Thus C(z) = E(zR1+R2 ) = E(zR1zR2 ) = E(zR1)E(zR2 ) = A(z)B(z) Theorem 1. Let {a

_

;

is

In general,

We have seen that under appropriate conditions the moments of a random variable can be obtained from its characteristic function. Similar results hold for generating functio_ns. Let A(z) be the generating function of the random variable R ; restrict z to be real and between 0 and 1 . We show that

= A'(l) A'( l ) = lim A'(z) E(R)

where

(6.4. 1)

If E(R) is finite, then the variance of R is given by Var R

= A" (1) + A ' (1) - [A' (1)]2

To establish (6 . 4 . 1) and (6.4.2) , notice that 00

A(z) = ! anzn , n=O

a n = P{R

= n}

(6.4.2)

6.4

Thus

GENERATING FUNCTIONS

1 93

z n=1na nzn-1 A '( 1 ) = I n a n = E(R) n=1 00

A'( ) = I

z

Let � 1 to obtain proving (6.4. 1). Similarly, 00

z n=1

A"( ) = I n ( n Therefore

00

1)a nzn-2,

2

1

so A"( ) = E (R ) - E(R)

Var R = E(R2) - [E(R)] 2 = A-" (1) + A' (1) - [A' ( 1)] 2 which is ( 6.4. 2). Now consider the simple random walk starting at 0, with no barriers. Let = P { = 0} , = the probability that the first return to 0 will occur at time n = P {S � 0, = 0}. Let � 0,

un

Sn

hn

. . . , Sn_1 Sn U(z) = n=OI unzn, H(z) = nI=Oh nzn (For the remainder of this section, z is restricted to real values.) 1

00

00

If we are at the origin at time n , the first return to 0 must occur at some time k, k = 1 , 2, . . . , n. If the first return to 0 occurs at time k, we must be at the origin after n - k additional steps. Since the events {first return to 0 at time k}, k = 1 , 2, . . . , n , are disjoint, we have

n un = k=I1 hkun-k' = 1, 2, . . . Let us write this as n U n = kI=Ohkun-k' n = 1, 2, . . . This will be valid provided that we define h0 = 0. Now u0 = 1 , since the walk starts at the origin, but h0u0 = 0. Thus we may write n (6.4.3) Vn = kI=Ohku n-k' n = 0, 1, . . . where V n = un, > 1 ; v 0 = 0 = u0 - 1 . Since {v n } is the convolution of {hn } and {un } , Theorem 1 yields V(z) = H(z)U(z) But V(z) = nI=OVnZn = n=I1U nZn = U(z) - 1 n

n

00

00

1 94

INFINITE SEQUENCES OF RANDOM VARIABLES

Thus

z)( 1 - H(z)) = 1 We may use (6. 4 . 4) to find the h n explicitly. For U(

(6.4.4)

This can be put into a closed form, as follows. We claim that

(6.4.5) where, for any real number ex( ex

ex ,

(�)

denotes

- 1) ·

· · ( ex

To see this, write

( - 1/2) n

=

- n + 1)

n!

( - 1/2)( - 3/2) · · · [-(2n - 1)/2] , n! ·

·

· · ·

n! 1 3 5 (2n - 1) ( - 1) n n! n ! 2n 1 3 5 (2n - 1) (2/2)(4/2)( 6/2) = n =

·

-

·

· · ·

n .f (2n) ! ( - 1) n n! n!

4n

n f. 2

· · ·

(2n/2)

(- 1)

n

(6.4.5). Thus U(z) = },0 ( -! '2) ( -4pqz2)n = (1 - 4pqz2)-112 by the binomial theorem. By (6. 4 . 4) we have (6.4.6) H(z) = 1 - U(z)1 = 1 - (1 - 4pqz2)112 This may be expanded by the binomial theorem to obtain the h n (see Problem 1) ; of course the results will agree with (6.3.5), obtained by the combinatorial approach of the preceding section. Notice that we must have the positive square root in ( 6.4.6) , since H(O) = h0 = 0. Some useful information may be gathered without expanding H(z). Observe that H(1) = !� 0 hn is the probability of eventual return to 0. proving

6.4 GENERATING FUNCTIONS

By (6. 4 .6 ) ,

1 95

H(1) = 1 (1 - 4pq)1 1 2 = 1 - lp - q l by (6.2.2) -

This agrees with the result ( 6.2. 7) obtained previously. Now assume p = q = 1/2, so that there is a return to 0 with probability 1. We show that the average length of time required to return to the origin is infinite. For if T is the time of first return to 0, then 00

00

E (T) =I1 n P{T = n } = I1 nh n = H' ( 1 ) n= n= as in (6. 4 . 1) . By (6.4.6) , H' (z) = - d ( 1 - z2)1 1 2 = z( 1 - z 2)-11 2 � oo as z � 1 dz Thus E(T) = oo , as asserted (see Problem 5, Section 6.3 , for another ap proach). -

PRO B LE M S

Expand (6.4.6) by the binomial theorem to obtain the hn . 2. Solve the difference equation an+ 1 - 3an 4 by taking the generating function of both sides to obtain 4z a0 A(z) (1 - z)( 1 - 3z) + 1 - 3z Expand in partial fractions and use a geometric series expansion to find an . 3. Let A(z) be the generating function of the sequence {a } ; assume that I � 0 l an - an- 1 1 < oo . Show that if lim n an exists, the limit is lim (1 - z)A(z) z--+1 4. If R is a random variable with generating function A(z), find the generating function of R + k and kR, where k is a nonnegative integer. If F(n) P{R < n}, find the generating function of {F(n)}. 5. Let R1 , R2, be independent random variables, with P{Ri 1} p, P{Ri 0} q 1 - p , i 1 , 2, . . . . Thus we have an infinite sequence of Bernoulli trials ; Ri 1 corresponds to a success on trial i, and Ri 0 is a failure. (Assume 0 < p < 1 .) Let R be the number of trials required to obtain the first success. (a) Show that P{R k} qk-Ip, k 1 , 2, . . . . (b) Use generalized characteristic functions to show that E(R) 1/p, Var R (1 - p)/p2 ; check the result by calculating the generating function of R and using (6.4. 1) and (6.4.2). R is said to have the geometric distribution.

1.

=

=

n

-+

'

oo

=

•

=

=

•

=

•

=

=

=

=

=

=

=

=

=

=

196

INFINITE SEQUENCES OF RANDOM VARIABLES

as in Problem 5, let , With the the rth success (r = 1 , 2, . . . ) . (a) Show that = = <:

Ri

Nr be the number of trials required to obtain P{Nr k} f)prqk-r, k = r, r + 1 , . . . = (k rr)p r (- q)k-r , k = r, r + 1 , . . . ( - r - j + 1) /j ! , j = 1 , 2, . . . , where (-�) is defined as ( - r) ( - r - 1) (--;-) = 1 . (b) Let T1 = the number of trials required to obtain the first success , T2 = the number of trials following the first success up to and including the second success, . . . , Tr = the number of trials following the (r - 1)st success up to and including the rth success (thus Nr = T1 + + Tr). Show that the are independent, each with _the geometric distribution. Ti (c) Show that E(Nr) = r/p , Var Nr = r(1 -p)/p 2 • Find the characteristic function and the generating function of Nr. Nr is said to have the distribution. With the as in Problem 5 , let R be the length of the run (of either successes or failures) started by the first trial. Find P{R = k}, k = 1 , 2, . . . , and E(R). 8. In Problem 6, find the joint probability functions of N1 and N2 ; also find (in a relatively effortless manner) the correlation coefficient between N1 and N2 • 6.

3

·

·

·

·

·

·

negative

binomial

7.

6.5

Ri

THE POISSON RANDOM PROCESS

We now consider a mathematical model that fits a wide variety of physical phenomena. Let , by a sequence of independent random variables , where each is absolutely continuous with density f(x) = ite-J.x , x > 0 ; f(x) = 0, x < 0 (it is a fixed positive constant) . Let An = + + n = 1 , 2, . . . . We may think of An as the arrival time of the nth customer at a serving counter, so that Tn is the waiting time between the arrival of the (n - 1)st customer and the arrival of the nth customer . . :Equally well, An may be regarded as the time at which the nth call is · made at a telephone exchange, the time at which the nth component fails on an assembly line, or the time at which the nth electron arrives at the anode of a vacuum tube. If t > 0, let Rt be the number of customers that have arrived up to and including time t; that is, Rt = n if An < t < An+ l (n = 0, 1 , . . . ; define A0 = 0) . A sketch of (Rt, t > 0) is given in Figure 6.5. 1 . Thus we have a family of random variables Rt, t > 0 (not just a sequence, but instead a random variable defined for each nonnegative real number) . A family of random variables Rt, where t ranges over an arbitrary set I, is called a random process or stochastic process. Note that if I is the set of positive integers, the random process becomes a sequence of random variables ; if I is a finite set, we obtain a finite collection of random variables ; and if I consists of only one element, we obtain a single random variable. Thus the concept of a random process includes all situations studied previously.

Ti

T1 T2 , • • •

T1

·

·

·

Tn,

6.5

•

4 •

3

2

•

1

0

THE POISSON RANDOM PROCESS

1 97

Tn = An - A n - 1

•

Al

0

A2

FIGURE 6.5. 1

Aa

A4

;:.. t

Poisson Process.

If the outcome of the experiment is w, we may regard (Rt (w), t E I) as a real-valued. function defined on I. In Figure 6.5. 1 what is actually sketched is Rt ( w ) versus t, t E I = the nonnegative reals, for a particular w . Thus, roughly speaking, we have a "random function," that is, a function that depends on the outcome of a random experiment . The particular process introduced above is called the Poisson process since, for each t > 0, Rt has the Poisson distribution with parameter At. Let us verify this. If k is a nonnegative integer, P{R t < k} = P{at most k customers have arrived by time t} = P{(k + 1)st customer arrives after time t} = P{T1 + · · · + Tk+l > t} But A k+l = T1 + · · · + Tk+l is the sum of k + 1 independent random variables, each with generalized characteristic function f: Ae-l ze-s � dx = A/(s + A) , Re s > - A ; hence A k+ l has the generalized characteristic function [A/(s + A) ]k+1 , Re s > - A. Thus (see Table 5. 1 . 1) the density of A k+ l is (6.5.1) f (x) = _!_ Ak+lxke-lxu (x) Ak+ l

k!

where u is the unit step function. Thus

P{R t < k}

=

P{T1 +

· · ·

+ Tk+l > t }

1 98

INFINITE SEQUENCES OF RANDOM VARIABLES

Integrating by parts successively, we obtain

Hence Rt has the Poisson distribution with parameter At. Now the mean of a Poisson random variable is its parameter, so that E(Rt) = A t . Thus A may be interpreted as the average number of customers arriving per second. We should expect that A-1 is the average number of seconds per customer, that is , the average waiting time between customers. This may be verified by computing E(Ti).

E( Ji)

=

rJ ooo Axe-'-"' dx = 1A

We now establish an important feature of the Poisson process. Intuitively, if we arrive at the serving counter at time t and the last customer to arrive came at time t h , the distribution function of the length of time we must wait for the arrival of the next customer does not depend on h , and in fact coincides with the distribution function of T1 . Thus we are essentially starting from scratch at time t ; the process does not remember that h seconds have elapsed between the arrival of the last customer and the present time. If Wt is the waiting time from t to the arrival of the next customer, we wish to show that P{ Wt < z} = P{ T1 < z } , z > 0. We have -

P{ Wt < z} = P{for some n = 1 , 2, . . . , the nth customer arrives in

( t, t + z] and the ( n + 1 )st customer arrives after time t + z} (6.5.2)

(see Figure 6. 5.2). To justify (6.5.2) , notice that if t < A n < t + z < A n+ 1 for some n, then Wt < z. Conversely, if Wt < z, then some customer arrives in (t, t + z] and hence there will be a last customer to arrive in the interval. (If not, then ! : 1 Tn < oo ; but this event has probability 0 ; see Problem 1 .) Now P{t < A n < t + Z < A n+1 } = P{ t < A n < t + z, A n + Tn+1 > t + z} . t

An •

t+z

FIGURE 6.5.2

An+ l •

6.5

Since A n ( = T1 +

· · ·

P{t < A � -< t +

z

THE POISSON RANDOM PROCESS

1 99

Tn) and Tn+l are independent, we obtain, by (6.5. 1), < An+l } JJ (n - 1) ! .;tnxn-le-lx_Ae-ly dx dy tx<+xy t+ z +

l

=

er, (6.5.2) yields P{ � < } e-}. (t+z> [el - 1 - (elt - 1) ] 1 - e- ,\ z

Since ! � r n/n ! =

0

z

Wt

=

=

has the same distribution function as T1. Thus Alternatively, we may write

P{Wt < } P [ nU0 {A n < t < An+l < t + }J For if A n < t < A n+ l < t + for some then Wt < Conversely, if Wt < then some customer arrives in (t, t + ] and there must be a first customer to arrive in this interval , say customer n + 1 . Thus A n < t < A n+l < t + for some > 0. An argument very similar to the above shows that (At)n e - l t e - ;. t+ z > ( - < ) P{A n -< t < A n+l -< t + } and therefore P{Wt < } 1 - e - lz as before. In this approach, we do not have the p roblem of showing that P {}: Tn < oo} 0. 1 z

=

z

z

z,

z

n,

z.

z ,

n

z

z

=

n .'

=

=

To justify completely the statement that the process starts from scratch at time t, we may show that if V1, V2 , are the successive waiting times ' then V1 , V2 , are independent, and Vi and starting at t (so V1 = Ti have the same distribution function for all i. To see this , observe that •

.

•

Wt), P { V1 < x1 , . . . , Vk < xk} = P [Po { n < t < A n+l < t + X1 , Tn+ 2 < x2 , •

A

•

•

•

•

•

,

Tn+k < xk} J

200

INFINITE SEQUENCES OF RANDOM VARIABLES

For it is clear that the set on the right is a subset of the set on the left. Con versely ' if vl < x l , . . . ' vk < x k , then a customer arrives in (t, t + xl], and hence there is a first customer in this interval, say cus tomer n + 1 . Then A n < t < A n+ t < t + x1, and also Vi = Tn+i ' i = 2, . . . , k, as desired. Therefore

P{ Vt < xl , . . . ' vk < xk }

=

=

=

Fix j and let xi --+ Consequently

oo ,

L�t{An <

t

< A n+l < t + xl }

k X II P {Tn+ i < xi } i=2 k .P{ � < x1} II P{� < xi } i= 2 k II P{ T.· < x.} i= l t

-

J

t

i � j, to conclude that P{ V1 < x1 }

=

P{T; < x1} .

k P{ JI� < x l , . . . ' vk < xk } II P{ � < xi } i=l and the result follows. In particular, the number of customers arriving in the interval (t, t + r] has the Poisson distribution with parameter .Ar . The "memoryless" feature of the Poisson process is connected with a basic property of the exponential density, as the following intuitive argument shows. Suppose that we start counting at time t, and the last customer to arrive , say customer n 1 , came at time t h. Then (see Figure 6. 5.3) the probability that Wt < z is the probability that Tn � z + h, given that we have already waited h seconds ; that is , given that Tn > h. Thus =

-

P{ � < z}

=

-

-

Z

P{ Tn < + h I Tn > h} P{h < Tn � z + h} P{ Tn > h} +h

l {" z

-

=

A A. e- "' d x

h

A A. e- "' dx

e-;.h

e-;.( z+h) e- ).h

_

=

1

-

e- ;. z

=

< z} P{ Tn -

so that Wt and Tn have the same distribution function (for any n).

6.5 �---�--

A n- 1 = t-h

Tn

THE POISSON RANDOM PROCESS

z+h

20 1

>

---�

An

t

t+ z

FIGURE 6.5.3

The key property of the exponential density used in this argument is that is,

P{Tn > z + h I Tn > h} = P{Tn > z} or (since {Tn > h, Tn > z + h} = {Tn > z + h})

P{Tn > z + h} = P{Tn > z}P{Tn > h} In fact, a positive random variable T having the property that

P{T > z + h} = P{T > z}P{T > h}

for all z, h > 0

must have exponential density. For if G(z) = P{T > z} , then

G(z + h) = G(z) G(h)

Therefore

�

G(z + h - G(z)

= G(z)

( G(h� - 1 ) = G(z) ( G(h) � G(O) )

Let h � 0 to conclude that

G' (z) = G' (O)G(z) This is a differential equation of the form

dy + A = 0 y dx

whose solution is y =

l ce- x ;

( A = - G'(O))

that is,

G(z) = ce- l z But G(O) =

c

= 1 ; hence the distribution function of T is F p (z)

= 1 - G(z) = 1 - e- l z ,

z>0

since G(O) = 1

202

INFINITE SEQUENCES OF RANDOM VARIABLES

and thus the density of T is as desired. (The above argument assumes that T has a continuous density, but actually the result is true without this requirement.) The memoryless feature of the Poisson process may be used to show that the process satisfies the Markov p roperty : , an+ l a�re nonnegative integers If 0 < t1 < t 2 < · · · < tn+l and a1, with al < . . . < an+ l ' then P{Rtn 1 = an+l I R t = al , . . . ' R tn = an } = P{R tn+ = an+ l I Rtn = an } •

+

1

•

•

1

Thus the behavior of the process at the "future" time tn+l ' given the be havior at "past" times t1, . . : , t n-l and the "present" time tn , depends only on the present state , or the number of customers at time t n . For example , P R t4

{

=

=

1 5 j Rt 1 = 0, Rt2 = 3, Rt3 = 8 } = P {Rt4 = 1 5 j Rt3 = 8 } the probability that exactly 7 customers will arrive between t3 and t4 -,1.(t4- t 3 ) [A ( t4 - t3 ) ] 7 e 7! =

This result is reasonable in view of the memoryless feature , but a formal proof becomes quite cumbersome and will be omitted. We consider the Markov property in detail in the next chapter. We close this section by describing a physical approach to the Poisson process. Suppose that we divide the interval (i, t + r] into subintervals of length �t, and assume that the subintervals are so small that the probability of the arrival of more than one customer in a given subinterval I is negligible. If the average number of customers arriving per second is A , and R1 is the number of customers arriving in the subinterval I, then E(R1) = A �t. But E(R1) = (O)P{R1 = 0} + (1)P{R1 = 1 } = P{R1 = 1 } , so that with probability A �t a customer will arrive in /, and with probability 1 - A �t no customer will arrive. We assume that if I and .T are disjoint subintervals , R1 and RJ are inde pendent. Then we have a sequence of n = r/�t Bernoulli trials, with prob ability p = A �t of success on a given trial, and with �t very small. Thus we expect that the number N(t, t + r] of customers arriving in (t, t + r] should have the Poisson distribution with parameter Ar. Furthermore, if Wt is the waiting time from t to the arrival of the next customer, then P{ Wt > x} = P{N(t, t + x] = 0} = e- l x , so that Wt has exponential density.

6.6

THE. STRONG LAW OF LARGE NUMBERS

203

PROBLEMS t. 2.

Show that �f�::

1 Tn

< oo}

=

o.

Show that the probability that an even number of customers will arrive in the interval ( t , t + ] is (1/2)(1 + e-2 lr) and the probability that an odd number of customers will arrive in this interval is (1/2)(1 - e-2 lr) . T

3. (The random telegraph signal) Let T1 , T2 , be independent, each with density A.e-lxu(x). Define a random process Rt, t > 0, as follows. •

R0

=

+ 1 or

R0, 0

T1

=

=

R0, T1 +

Rt

=

( - 1)nR0, T1 +

-

Ro,

•

- 1 with equal probability (assume R0 independent of the Ti)

Rt Rt Rt

=

•

::;: t

T1

<

T1

T2 < t <

·

·

·

+ T2

T1

+ T2 +

+

Tn

� t

T3

<

T1

+

·

·

·

+

Tn+1

(see Figure P.6.5.3).

+1

� Tl _,

r- r3 1 ...

-1 �

t

.,.__ _ _ _ _

FIGURE P.6. 5.3

(a) Find the joint probability function of Rt and Rt+ r ( t, > 0). (b) Find the covariance .function of the process, defined by K(t, +) = Cov R t+r) , t, > 0. T

T

*6 . 6

(Rt,

T H E S T R O N G LAW O F L A R G E N UMBERS

In this section we show that the arithmetic average (R1 + + Rn)fn of a sequence of independent observations of a random variable R converges ·

·

·

204

INFINITE SEQUENCES OF RANDOM VARIABLES

with probability 1 to E(R) (assumed finite). In other words, we have .I R1(w) + · · · + R n(w) = 1m E (R ) n � oo n for all w, except possibly on a set of probability 0. We shall see that this is a stronger convergence statement than the weak law of large numbers. Let (0 , �' P) be a probability space, fixed throughout the discussion. If A1, A 2 , is a sequence of events , we define the upp er limit or limit sup erior of the sequence as •

•

•

00

00

lim sup A n = n U A k n n=l k=n and the lower limit or limit inferior of the sequence as 00

(6.6. 1)

00

lim inf A n = U n A k ( 6 . 6 .2) n n=l k= n Theorem 1. lim sup n A n = {w : w E A n for infinitely many n}, lim infn A n = {w : w E A n eventually, i.e. , for all but finitely many n}.

Proof. By (6.6. 1 ) , w E lim sup A n iff, for every n, w E U;' n A k ; that is, for every n there is a k > n such that w E Ak , or w E A n for infinitely many n. By (6.6.2) , w E lim infn A n iff, for some n, w E n� n A k ; that is, for some n, w E A k for all k > n , or w E An eventually. Let R, Rh R 2 , be random variables on (Q, �' P). Denote by {R n � R} the set {w : lim n � oo Rn (w) = R(w)}. Then {Rn � R} = n :=1 lim infn Anm ' where A nm = {w : IRn (w) - R(w) l < 1/m}. Theorem

2.

•

•

•

Proof Rn (w) � R(w) iff for every m = 1 , 2, . . . , IRn(w) - R(w) l < 1 /m eventually, that is (Theorem 1) , for every m = 1 , 2, . . . w E lim infn A · nm

We say that the sequence R1, R 2 , converges almost surely to R (nota a.s. tion : Rn ) R) iff P{R n � R} = 1 . The terminology "almost everywhere" is also used. •

3.

a.s.

•

•

R n ) R iff f()r every one k > n} � 0 as n � oo . Theorem

8

>

0,

P{IRk - Rl > 8 for at least

Proof By Theorem 2, P{R n � R} < P{lim infn A n m} for every m ; a.s. hence Rn ) R implies that lim infn A n m has probability 1 for every m. Conversely , if P(lim infn A n m) = 1 for all m, then, by Theorem 2, {R n � R}

6.6

205

THE STRONG LAW OF LARGE NUMBE.RS

is a countable intersection of sets with probability 1 and hence has probability 1 (see Problem 1). Now in (6.6.2) , n � n A k , n = 1 , 2, . . . is an expanding sequence ; hence P(lim infn A n) = lim n � oo P Cn: n Ak ). Thus Rn

as . .

>

for all m = 1 , 2, . . .

iff P(lim n inf A n m) = 1

R

iff

(

)
lim P fl A k m k= n n � oo

{

iff P I R�o - R l .

for all m = 1 , 2,

= 1

for all k

>n

}

. . .

---* 1 as n ---* oo

for all m = 1 , 2,

for at least one k

>n

}

---* 0 as n

for all iff P{ I Rk - R l

> 8}

for at least one k

(see Problem 2) . CoROLLARY.

Rn

as . .

>

R implies Rn

P>

m

= 1,

. . .

---*

oo

2, . . .

> n} � 0 as n � for all 8 > 0

oo

R.

Rn P > R iff for every 8 > 0, P{ IRn - R l > 8 } � 0 as n � oo (see Section 5.4). Now { I R k - R l > 8 for at least one k > n} ::::> {IR n - Rl > 8 } , so that P{I R k - Rl > 8 for at least one k > n} > P{I R n - Rl > 8}. The result now follows from Theorem 3. PROOF.

For an example in which Rn

P>

R but Rn

Theorem 4 (Borel-Cantelli Lemma).

j

a s.

>

R, see Problem 3.

If Ab A 2, are events in a given probability space, and :L: 1 P(A n ) < oo , then P ( lim supn A n) = 0 ; that is, •

•

•

the probability that A n occ urs for infinitely many n is 0. PROOF.

By (6.6. 1),

(

) (

P lim sup A n 

Theorem 5.

Rn

s

a. .

) R.

2� 1 P{ IR n - Rl > 8} <

0,

oo, then

By Theorem 4, lim supn { IR n - R l > 8} has probability 0 for every 8 > 0 ; that is , the probability that I R n - R l > 8 for infinitely many n is 0. Take complements to show that the probability that I R n - Rl > 8 for only finitely many n is 1 ; that is , with probability 1 , I R n - Rl < 8 eventually. Since 8 is arbitrary , we may set 8 = 1/m, m = 1 , 2, . . . . Thus P ( lim infn A n m) = 1 for m = 1 , 2, . . . , and the result follows by Theorem 2. PROOF.

If 2� 1 E[ I R n - Rl k] < oo for some k >

Theorem 6.

0,

then R n

a .s.)

R.

P{ IR n - R l > 8} < E[IR n - R l k]f8 k by Chebyshev's inequality, and the result follows by Theorem 5. PROOF.

Theorem 7 (Strong Law of Large Numbers).

Let R1, R 2 , be independent random variables on a given probability space . Assume uniformly bo unded fo urth central moments ; that is, for some positive real number M, •

•

•

E[IRi - E(Ri) l 4 ] < M =

for all i . Let Sn

R1 +

· · ·

+ R n . Then s a. .

S n - E(S n)

n In partic ular, �f E (Ri)

=

m for all i, then E ( Sn ) a.s.

Sn n

Since Sn - E(Sn)

PROOF.

R� = R . - E(R ) 'l

'l

'l

.

) O

=

=

n m ; hence

) m

2� 1 R�, where

E(R�) = 0 ,

'

we may assume without loss of generality that all E(Ri) are 0 . We show that 2 � 1 E[(Sn /n) 4 ] < oo , and the result will follow by Theorem 6. Now S n4

=

( R1 + . . . + R n) 4 n 4' 2 2 n 4 .' R i Rk ' ' 4 ! R 3- Rk R z R 1n R 3- 2R k R l + = '- R 3 4 + ' + £.., � � 1 2' 2' 3- .k= 3!;f:. k 2 ' 1 ' 1 ' 3=1 j
""'

.

•

•

•

•

•

6. 6

But

1?(�;2�k� z)

THE STRONG LAW OF LARGE NUMBERS

==

�(�i2) �(�k) �(� z)

==

0

207

by independence

Similarly, Thus E( S n4)

==

n

n

i=l

j , k=l j
'I E(R /) + I 6E(R /)E(R k2 )

By the Schwarz inequality, �(�i 2)

==

�(Ri2 . 1 )

Hence E(S n4)

<

nM

+6

n(n

<

[�(�j4) £(1 2) ] 1 /2

; l) M

=

<

Ml / 2

( 3 n 2 - 2 n)M

<

3 n 2M

Con seq uently

The theorem is proved. REMARKS. If all � i have the same distribution function, it turns out that the hypothesis on the fourth central moments can be replaced by the assumption that the � (�i) are finite [of course , in this case �(�i) is the same for all i] . In general, the hypothesis on the fourth central moments can be replaced by the assumption that for some M and � > 0, � [ l �i - E(Ri) l 1+15] < M for all i. Now consider an infinite sequence of Bernoulli trials, with �i is a success on trial i , �i 0 if there is a failure on trial i. Then Rl + . . . + Rn Sn

==

1

if there

==

-

n

==

------

n

is the relative frequency of successes in n trials and �(�i) is the probability p of success on a given trial. The strong law of large numbers says that if we regard the observation of all the �i as one performance of the experiment, the relative frequency of successes will almost certainly converge to p. The weak law of large numbers says only that if we consider a sufficiently large but fixed n (essentially regarding observation of R 1 , , � n as one per formance of the experiment) , the probability that the relative frequency will be within a specified distance e- of p is > 1 � ' where � > 0 is preassigned. The requirements on n will depend on e- and � ' as well as p. Recall that, by .

-

•

.

208

INFINITE SEQUENCES OF RANDOM VARIABLES

Chebyshev ' s inequality,

p{

Sn - E(Sn) > n

8} - E[(Sn - E2 S <

e

..

) 2/ n 2 ]

=

Var s n 2e 2

n

..

=

Var R i � i 1

n 2e 2

- p(1 - p)

ne 2

Thus

p { snn -

 1 - p(1 p) > 1 ne 2

-

(J

for large enough

n

P ROBLEMS

Show that a countable intersection of sets with probability 1 still has probability 1 . Does this hold for an uncountable intersection ? e for at least one n � 0 as n � oo for all e of the form 2. If P{IR k - Rl 1 , 2 , . . . , show that this holds for all e 0. 1/m, m 3. Let R0 be uniformly distributed on the interval (0, 1 ]. Define the following sequence of random variables.

1.

>

=

k> }

Rl R21

= g21 (R0) =

R22

=

Ral Ra2 R33

= gl (Ro)

= = =

=

if � if 0 if l if �

1 1 1 1

1

1 < < < <

>

if 0 < Ro ::;: f ; R21 0 otherwise Ro < 1 , 0 otherwise R0 ::;: l , 0 otherwise R0 ::;: � , 0 otherwise R0 ::;: 1 ,

=

0 otherwise

In general, let Rnm

=

1

if

m - 1 n

< Ro <

m n

,

n =

1,

2, . . . , m

=

1,

2, . . .

, n,

0 otherwise (see Figure P.6.6.3 for n = 4). The fact that we are using two subscripts is unimportant. We may arrange the Rnm as a single sequence. Show that the sequence converges in probability to almost surely to 0. In fact, P{Rn m � 0} 0. =

0,

but does not converge

6.6 I

THE STRONG LAW OF LARGE. NUMBERS

1

209

'

1 1

I

1 0

1

! 4

1 4

... ....

FIGURE P.6.6.3

4. Let A 1 , A 2 ,

(a) (b)

(c)

(d)

be an arbitrary sequence of events in a given probability space . Show that lim infn An c lim supn An · If the An form an expanding sequence whose union is A , or a contracting sequence whose intersection is A, show that lim infn A11 = lim supn An = A . In general, if lim supn An = lim infn An = A , we say that A is the limit of the sequence {A n} (notation : A = li111n An) . Give an example of a sequence that is not eventually expanding or contracting (i.e. , that does not become an expanding or contracting sequence if an appropriate finite number of terms is omitted), but that has a limit. If A = limn An, show that P(A) = lim n -4 oo P(An). HINT : n � n A k expands to lim inf An , U � n A k contracts to lim sup An, and n:: n Ak c • .

•

•

A n c U :: n A k·

S.

(e) Show that (lim infn An) c = lim supn A �, and (lim supn An) c = lim infr. An c. Find lim supn An and lim infn An if Q = E1 and

[ 1 - �J [ - 1 , �J

A11 = 0, = n

if n is even if n is odd

= E2 and take An = the interior of the circle with center at (( - 1 )nfn 0) and radius 1 . Find lim supn An and lim infn An . 7 . Let x1 , x2 , be a sequence of real numbers, and let A n = ( - oo , xn). What is the connection between lim supn xn an d lim supn An (similarly for lim inf) ?

6. Let

•

8.

9.

•'·

,

•

(Second Borel-Cantelli lemma) If A1, A 2 , . are independent events and !: 1 P(An) = oo , show that P(lim supn An) = 1 . HINT : Show that P(lim supn A n) = limn-- oo limm-4 oo P( Ur n Ak) , and consider 1 - x. be a sequence of independent random variables, all defined on Let R1 , R2 , •

•

•

•

•

21 0

INFINITE SEQUENCES OF RANDOM VARIABLES a .s.

10.

.

the same probability space. Let c be any real number. Show that Rn ) c tf and only if for every e > 0, !: 1 P{IRn - cl > e} < oo . Let R1 , R2 , be a sequence of independent random variables, and let c be a.s. any real number. Show that either Rn ) c or Rn "diverges almost surely" from c ; that is, P{x : limn-- oo Rn(x) c} 0. 1\hus, for example, it is impossible that P{Rn -+- c} 1/3. Let R1 , R2 , be independent random variables, with Rn 1 with probability Pn ' Rn 0 with probability 1 - Pn · (a) What conditions on the Pn are equivalent to the statement that Rn P ) 0 ? •

•

•

=

=

=

11.

•

•

•

=

=

a.s.

12.

13.

(b) What conditions on the Pn are equivalent to the statement that Rn ) 0 ? Let R1 , R2 , be independent random variables, with E(RJ 0, Var Rn 2 an ::;: Mfn, where M is some fixed positive constant. Show that (R1 + · · · + a.s. Rn)/n > 0. Give an example of a particular sequence of random variables R1 , R2 , and a random variable R such that 0 < P{Rn -+- R} < 1. •

•

•

=

=

•

•

•

M ar kov C h ai n s

7.1

INTRODUCTI ON ·

Suppose that a machine with two components is inspected every hour. A given component operating at t = n hours has probability p of failing before the next inspection at t = n + 1 ; a component that is not operating at t = n has probability r of being repaired at t = n + 1 , regardless of how long it has been out of service. (Each repair crew works a 1-hour day and refuses to inform the next crew of any insights it may have gained.) The components are assumed to fail and be repaired independently of each other. The situation may be summarized as follows. If R n is the number of components in operation at t = n , then P{Rn+ l = 0 I R n = 0 } = (1 - r) 2 P{R n+l P{Rn+ l

=

P{Rn+ l P{R n+ l P{R n+ l

=

P{R n+ l P{Rn+l P{R n+ l

=

= =

1 j Rn 2 j Rn

= =

0 I Rn = 1} 1 l R n = 1} 2 1 Rn = 1}

= 0

=

=

=

=

0} 0}

( Rn 1 j Rn 2 1 Rn

=

= =

2r(1 - r) r2

p (1 - r) = pr + (1 - p )(1 = (1 - p)r 2} = p 2 2} = 2p (1 - p ) 2} = (1 - p ) 2 =

- r

)

(7. 1 . 1) 21 1

212

MARKO V CHAINS

2r(l - r) pr + (1 - p)(l - r)

FIGURE 7. 1 . 1

A Markov Chain.

For example , if component A is operating and component B is out of service at t = n , then in order to have one component in service at t = n + 1 , either A fails and B is repaired between t = n and t = n + 1 , or A does not fail and B is not repaired between t = n and t = n + 1 . In order to have two components in service at t = n + 1 , A must not fail and B must be repaired. The other entires of (7. 1 . 1) are derived similarly. Thus , at time t = n , there are three possible states , 0, 1 , and 2 ; to be in state i means that i components are in operation ; that is , Rn = i. There are various transition probabilities P i i indicating the probability of moving to state j at t = n + 1 , given that we are in state i at t = n ; thus ( see Figure 7 . 1 . 1 ) . The P ii may be arranged in the form of a matrix :

0 0 (1 - r )2 IT

=

[piJ]

=

1 p (1 - r) pr 2

p2

2

1

2r(l +

r

)

(1 - p) ( 1

2p (1 - p )

- r

)

r2 (1 - p)r (1

- p)2

Notice that IT is stochastic ; that is , the elements are nonnegative and the row sums 2 1 P ii are 1 for all i. If Rn is the state of the process at time n , then , according to the way the problem is stated , if we know that R0 = i0 , R1 = i1 , . . . , Rn = in = (say) 2, we are in state 2 at t = n. Regardless of how we got there , once we know that Rn = 2, the probability that Rn+l = j is (�)(1 - p ) 1p2-3 , j = 0, 1 , 2. In

7. 1

INTROD UCTION

213

other words, P {R n+l = in+ l l Ro = io , . . . , R n = in} = P{R n+ l = in+l l R n = in} (7. 1 .2) This is the Markov prop erty. What we intend to show is that, given a description of a process in terms of states and transition probabilities (formally, given a stochastic matrix) , we can construct in a natural way an infinite sequence of random variables satisfying (7. 1 .2). Assume that we are given a stochastic matrix TI = [p i1 ] , where i and j range over the state space S, which we take to be a finite or infinite subset of the integers. Let p i , i E S, be a set of nonnegative numbers adding to 1 (the initial distribution). We specify the joint probability function of R0, R1 , . . . , R n as follows. ' R n = in} P {Ro = io ' R1 = i1 ' = P i 0p ,;"0"1P i 1i · · · p ,;"n-1 in , n = 1 , 2, . . . , P{R0 = i0 } = p ,;"0 (7 . 1 .3) If we sum the right side of (7 . 1 .3) over ik+ l ' . . . , in , we obtain P i oPi o i1 P ik_1 ik ' since II is stochastic. But this coincides with the original specification of P{R0 = i0 , R1 = i1, . . . , Rk = ik}. Thus the joint probability functions (7. 1 .3) are consistent, and therefore , by the discussion in Section 6. 1 , we can construct a sequence of random variables R0, R1, . . . such that for each n the joint probability function of , Rn ) is given by (7 . 1 .3). (R0, Let us verify that the Markov property (7 . 1 .2) is in fact satisfied. We have P{ Rn+l = i n+ l I R o = io , . . . ' R n = i n } ·

·

·

,;

2

•

•

•

•

•

P{ R o = io , ' Rn+l = i n+l } p { R o = io , . . . ' Rn = i n } (assuming the denominator is not zero) by (7. 1 .3) ·

But

•

io

• .

.

.

•

·

·

in-1 io.

·

! P{ R0 = i0, . , in-1 .

io . . . . . in-1 io

•

. . .

•

in-1

establishing the Markov property .

•

•

•

, R n = i n}

214

MARKOV CHAINS

The sequence {Rn} is called the Markov chain corresponding to the matrix TI and the initial distribution {p i } . II is called the transition matrix, and the Pii the transition probabilities, of the chain. The basic construction given here can be carried out if we have "nonstationary transition probabilities," that is , if, instead of the "stationary transition probabilities" pii, we have, for each n = 0, 1 , . . . , a stochastic matrix [nPiil · nPi3 is interpreted as the probability of moving to state j at time n + 1 when the state at time n is i. We define P{Ro = io , · · · , R n = in} = Pi oPi i 1 1Pi 1i 2 • • • n-1Pi n-1 i n · This yields a Markov process , � sequence of random variables satisfying the Markov property, with P{R n = in f R0 = i0, • • • , Rn-1 = in-1 } =

REMARK.

o

0

We begin the analysis of Markov chains by calculating the probability function of Rn in terms of the initial probabilities p i and the matrix II. Let p � n > = P{R n = j}, n = 0, 1 , . . . , j E S (thus p �o > = p1). If we are in state i at time n - 1 , we move to state j at time n with probability P{R n j I R n-1 = i} = p i1 ; thus , by the theorem of total probability,

n 1 n p j > = I P{R n- 1 = i}P{R n = j I R n- 1 = i } = I Pi - )Pi i (7. 1 .4) i i If v < n > = (p1 n > , i E S) is the "state distribution" or "state probability vector" at time n, then (7. 1 .4) may be written in matrix form as v < n > = v < n-l ) II Iterating this, we obtain (7. 1 .5) But suppose that we specify R0 = i; that is, p} 0 > = 1 , p �o > = 0, i � j. Then v < o > has a 1 in the ith coordinate and O ' s elsewhere, so that by (7. 1 .5) v < n > is simply row i of II n . Thus the element p}; > in row i and column j of II n is the probability that Rn = j when the initial state is i. In other words , (7. 1 .6) (A slight formal quibble lies behind the phrase "in other words" ; see Problem 1 .) Because of (7. 1 .6), II n is called the n- step transition matrix : it follows immediately that rrn is stochastic. We shall be interested in the behavior of rr n for large n. As an example, suppose that

II =

[! !J

7. I

We compute Il 2 =

[ : : ]' 16

Il 4 =

16

[

�g: 153 256

�g� 103 256

INTRODUCTION

2 15

]

Thus p�i > t'.J p �i > and p��> t'.J p ��> , so that the probability of being in state j at time t = 4 is almost independent of the initial state. It appears as if, for large n, a "steady state" condition will be approached ; the probability of being in a particular state j at t = n will be almost independent of the initial state at t = 0. Mathematically, we express this condition by saying that . (n) (7.1 .7) 1liD P·i i = v j ' z , J E S oo n� where v3 does not depend on i. In Sections 7.4 and 7.5 we investigate the conditions under which (7. 1 .7) holds ; it is not true for an arbitrary Markov chain. Note that (7. 1 .7) is equivalent to the statement that rr n � a matrix with identical rows , the rows being (v3 , j E S) . •

.,._

•

1.

Consider the simple random walk with no barriers. Then S = the integers and Pi . i+ 1 = p, P i . i-1 = q = 1 - p , i = 0, ± 1 , ± 2, . . . . If there is an absorbing varrier at 0 (gambler ' s ruin problem when the adversary has infinite capital) , S = the nonnegative integers and P i . i+1 = p, Pi.i-1 = q, i = 1 , 2, . . . , p00 = 1 (hence p03 = 0, j ¥:- 0). If there are absorbing barriers at 0 and b , then S = {0, 1 , . . . , b } , b - 1 , Poo = Pbb = 1 . _.... Pi . i+1 = p , p i , i-1 = q, i = 1 , 2, Example

·

�

·

·

,

Consider an infinite sequence of Bernoulli trials. Let state 1 (at t = n) correspond to successes (S) at t = n - 1 and at t = n ; state 2 to success at t = n - 1 and failure (F) at t = n ; state 3 to failure at t = n - 1 and success at t = n ; state 4 to failures at t = n - 1 and at t = n (see Figure 7 . 1 .2). We observe that I1 2 has identical rows, the rows being Example 2.

FIGURE 7 1 2 .

.

216

MARKOV CHAINS

Hence rr n has identical rows for n > 2 (see Problem 2) , so that n n approaches a limit. ..... (p2 , pq , qp , q2) . 3.

(A queueing example) Assume that customers are to be served at discrete times t = 0, 1 , . . . , and at most one customer can be served at a given time. Say there are Rn customers before the completion of service at time n, and , in the interval [n, n + 1), Nn new customers arrive, where P{Nn = k} = pk , k = 0� 1 , . . . . The number of customers before completion of service at time n + 1 is ..- Example

Rn+

That is,

Rn+

l = =

l =

(Rn - 1)+ + Nn

Rn - 1 + Nn if Rn > 1 Nn if Rn 0 =

If the number of customers at time n + 1 is > M, a new serving counter automatically opens and immediately serves all customers who are waiting and also those who arrive in the interval [n + 1 , n + 2) ; thus Rn+ 2 = 0. The queueing process may be represented as a Markov chain with S = the nonnegative integers and transition matrix

0 1

2 3

IT =

0

2

1

3

Po Pt P 2 Po Pt P 2 0 Po Pt P2 0 0 Po Pt P 2

M-1 0 0 0 M 1 0 M+1 1 0

0 0

· · ·

Po Pt P 2

·

.

.

PRO B LE M S

{Rn} with transition matrix IT = [pi1 ] and initial distribution {pi} ; assume p, > 0. Let { Tn} be a Markov chain with the same transition matrix, and initial distribution {qi}, where q, = 1 , q1 = 0, j '#: Show that P{Rn j I R0 = r} P{Tn = j} = p�j> ; this justifies (7.1 .6).

1. Consider a Markov chain =

=

r.

7.2

2.

STOPPING TIMES AND THE STRONG MARKOV PROPERTY

2 17

If II is a transition matrix of a Markov chain, and II k has identical rows, show that rrn has identical rows for all n > k. Similarly, if II k has a column all of whose elements are > � > 0, show that rrn has this property for all n > k. 3. Let {Rn} be a Markov chain with state space S, and let g be a function from S to S. If g is one-to-one, {g(Rn)} is also a Markov chain (this simply amounts to relabeling the states). Give an example to show that if g is not one-to-one, {g(Rn)} need not have the Markov property. 4. If {Rn} is a Markov chain, show that P{Rn+l = il , · Rn+ k = ik I Rn = i } = Pii 1 Pi1 i 2 · · · Pik-1 ik · ·

7.2

'

ST O P P I N G T I M E S A ND T H E S T R O N G MARK O V PROPERTY

Let {R n} be a Markov chain , and let T be the first time at which a particular state i is reached ; set T = oo if i is never reached. For example, if i = 3 and R0 (w) = 4, R1 (w) = 2 , R 2 (w) = 2, R3 (w) = 5 , R4 (w) = 3 , R 5 (w) = 1 , R 6 (w) = 3 , . . . , then T(w) = 4 . For our present purposes, the key feature of T is that if we examine R0 , R1 , . . . , 'Rk , we can come to a definite decision as to whether or not T = k. Formally, for each k = 0 , 1 , 2 , . . . , I{ T= k } can be expressed as gk (R0 , R1 , . . . , Rk) , where gk is a func tion from Sk+l to {0, 1 } . A random variable T, whose possible values are the nonnegative integers together with oo , that satisfies this condition for each k = 0, 1 , . . . is said to be a stopp ing time for the chain { Rn} . Now let T be the first time at which the state i is reached , as above. If we look at the sequence {Rn } after we arrive at i, in other words, the sequence Rp , R T+ l ' . . . , it is reasonable to expect that we have a Markov chain with the same transition probabilities as the original chain. After all , if T = k, we are looking at the sequence Rk , Rk+1 , However, since T is a random variable rather than a constant, there is something to be proved. We first introduce a new concept. If T is a stopping time, an event A is said to be prior to T iff, whenever T = k, we can tell by examination of R0 , , Rk whether or not A has occurred. Formally , for each k = 0, 1 , . . . , IA n {T= k } can be expressed as k+ hk (R0 , R1 , . . . , Rk) , where hk is a functio n from S l to {0, 1 } •

•

•

•

•

.,.._

1.

•

•

If T is a stopping time for the Markov chain {Rn }, define the random variable Rp as follows . If T(w) = k, take Rp (w) = Rk (w) , k = 0, I , . . . . If T(w) = oo , take Rp (w ) = c , where c is an arbitrary element not be longing to the state space S. If we like we can replace S by S u { c} and define Pee = 1 ' Pci = 0, j E S. Example

218

MARKOV CHAINS

If A = {RT-i = i}, i E S , j = 0, 1 , . . . , then A is prior to T. (Set R T-i = R0 if j > T.) To see this, note that {RT-i = i } n {T = k} = {R k-i = i} n {T = k} ; examination of R0, , R k determines the value of R k_1, and also deter mines whether or not T = k. _.... •

�

•

•

•

If T is a stopping time for the Markov chain {R } , then {T = r} is prior to T for all r = 0, 1 , . . . . For Example 2.

n

{T = r} n {T = k} = 0 if r � k if r = k = {T = k} In eith er case ping time. _....

/ {T= r } n { T= k }

is a function of R0,

•

•

•

, Rk, since T is a stop

Theorem 1. Let T be a stopping time for the Markov chain {R } . If A is n

p rior to T, then

P(A n {R T =

i , R T+ l

= i1 , . . . , RT+k = ik}) = P (A n {RT = i })pii1Pi li 2 . . . Pik-lik (i, it , .

. . , ik

E S)

PROOF. The probability of the set on the left is 00

L P(A n { T

n =O

= n,

Rn

=

i, R n+ l

=

i l , . . . ' R n+ k

00

= I P(A n { T = n, R n = n= O X

=

=

ik } )

i} )

= ik I A n {T = n, R n = i } ) (Actually we sum only over those n for which P (A n {T = n, R n = i }) > 0.) P(R n + l

i l , . . . ' R n+ k

Now

P (Rn+l = il , . . . ' Rn+k = ik I A n {T = n, Rn = i }) = P{Rn+l = i1, . . . , Rn+k = ik I Rn = i } since IA n {T= n } is a function of R0, R1 , , Rn (see Problem 1) .

•

•

(Problem 4, Section 7. 1). Thus the summation becomes 00

L P(A n { T = n, R n =

n= O

and the result follows.

i} )Pii 1Pili2 . . . Pik-l ik

7.2

Theorem 2 (Strong Markov Property).

(a) P{RT+1 = i1 , , RT+k = ik I R p P{Rp = i } > 0 ( i , i1 , . . . ' ik E S) (b) If A is prior to T, then ·

=

i1 ,

•

•

•

·

=

·

, RT+k

=

ik I A

=

P{RT+1

n

=

{Rp =

=

i}

n

Pii1Pi 1i 2

•

•

Pik lik

•

if

-

i })

i1, . . . ' RT+k

if P(A

tim e for the

Let T be a stopp ing

Markov cha in {Rn} . Then

P (RT+1

219

STOPPING TIMES AND THE STRONG MARKO V PROPERTY

=

{R p

=

ik I Rp

=

i }) > 0 ( i , i1 ,

i} •

•

•

, ik E S)

(a) follows from Theorem 1 by taking A = n. (b) follows upon dividing the equality of Theorem 1 by P(A n {R p = i }) and using (a) . Thus the sequence R p , R p+1 , . . has essentially the same properties as the original sequence R0 , R1 , PROOF.

.

•

•

•

•

The strong Markov property reduces to the ordinary Markov property ( 7. 1 .2) if we set k = 1 , T = n , and A = {R0 = j0 , , R n_ 1 = jn_1} . For T is a stopping time since

REMARK.

•

=

l{ T=k}

gk (Ro ,

·

·

·

' Rk)

=

0

if k � n

=

1

if k

and A is prior to T since l.A n { T=k }

=

h k ( Ro ,

·

·

=

•

•

n

' Rk)

·

=

=

1

if k

=

0

otherwise

n and R0

=

j0 ,

•

•

•

=

, Rn-1

jn 1 -

P RO B L E M S 1.

Let {Rn} be a Markov chain. If D is an event whose occurrence or nonoccurrence is determined by examination of R0 , , , Rn , that is, ID is a function of R0 1 Rn , or, equivalently, D is of the form { (R0 , , Rn) E B } for some B sn+ , show that •

•

,

•

•

P(Rn+1

=

i1

,

•

•

•

, Rn+ k

=

ik I D

rl

{Rn

=

•

•

•

•

i}) rl

=

{Rn

If {Rn} is a Markov chain, show that the "reversed sequence" Rn-2 , also has the Markov property. •

•

c

•

if P(D 2.

•

·

·

·

i})

>

0

Rn , Rn-_1 ,

220

7.3

MARKOV CHAINS

CLAS SIFI CATI ON O F STATES

In

this section we examine various modes of behavior of Markov chains. A key to the analysis is the following result. We consider a fixed Markov chain {Rn } throughout.

Let ftF' > be the probability that the first return to i will occur at time n, when the initial state is i, that is, Theorem 1 (First Entrance Theorem).

n = 1 , 2, . . . f1F > = P{R n = i, Rk � i for 1 < k < n - 1 I R0 = i} , If i � j, let f�j> be the probability that the first visit to state j will occur at time n,

when the initial state is i ; that is,

J:r; > = P{Rn = j , Rk � j for 1 < k < Then n P�� > = £.., "" �� > p<.�-k ) k=l �3

�3

33

n

- 1 I R 0 = i} , n

'

=

n

= :-1 , 2, . . .

1 , 2, . . .

Intuitively, if we are to be in state j after n steps, we must reach j for the first time at step k , 1 < k < n. After this happens , we are. in state j and must be in state j again after the n - k remaining steps. For a formal proof, we use the strong Markov property. Assume that the initial state is i, and let T be the time of the first visit to j (T = min {k > 1 : Rk = j} if Rk = j for some k = 1 , 2 , . . . ; T = oo if Rk � j for all k = 1 , 2 , . . . ) . Then PROOF.

But

n P{R n = j } = I P{Rn = j , T = k } k= l n = I P{ T = k , RT+n-k = j } k=l

P {T = k , R T+n-k = j} = P{T = k} P{R T+n-k = j I T = k} and since {T = k} = {T = k , R p = j} ,

P{RT+n- k = j I T = k } = P{RT+n- k = j I Rp = j , T = k } = P{RT+n- k = j I Rp = j }

by Theorem 2b of Secti o n 7.2

Since P{T = Now let

= p ��-k >

by Theorem 2a of Section 7.2

k} = f!�> , the result follows. 00

n

fii = nI h� ) =l

fii is the probability of eventual return to state i when the initial state is i.

7.3

Theorem

2.

r times is (fi i) r .

CLASSIFICATION OF STATES

22 1

If the initial state is i, the p robability of returning to i at least

The result is immediate if r = 1 . If it holds when r = m - 1, let T be the time of first return to i. Then, starting from i, P{return to i at least m times} = I� 1 P { T = k, at least m - 1 returns after T}. But PROOF.

P{T = k, at least m - 1 returns after T} = P{T = k} P{Rp+ 1 , Rp+2, returns to i at least m - 1 times I T = k} = P{T = k} returns to i at least m - 1 times I Rp = i, T = k} X P{RT+ 1 , Rp+2, •

•

•

•

•

•

By the strong Markov property this may be written as

\

P{T = k}P{RT+1 , Rp+2, returns to i at least m - 1 times I Rp = i } = P{T = k}P{R1 , R2, returns to i at least m - 1 times I R0 = i } = �� > (fii) m-1 by the induction hypothesis •

•

•

•

•

•

Thus the probability of returning to i at least m times is 00

1 I1 fi�k ) (Jii) m-1 = fii( Jii)mk= which is the desired result. Let the initial state be i. If fii = 1 , the probability of returning to i infinitely often is 1 . Iffii < 1 , the probability of returning to i infinitely often is 0. CoROLLARY.

The events {return to i at least r times}, r = 1 , 2, . . . form a contracting sequence whose intersection is {return to i infinitely often}. Thus the probability of returning to i infinitely often is lim r --. (fi i) r , and the result follows. PROOF.

00

If fii = 1 , we say that the state i is recurrent or persistent ; hi = 1 , we say that i is transient.

DEFINITION.

if

It is useful to have a criterion for recurrence in terms of the probabilities P1t>, since these numbers are often easier to handle than the J:r>. Theorem

3.

PROOF.

By the first entrance theorem,

The state i is recurrent iff I � 1 p� � > = oo. n

"" l'.< ?c>p� � -k) Pu�� > = £.., Ju u k=1

222

MARKOV CHAINS

so that

co

"" Pii( n ) - Jiink =O - .f

Thus Hence, if

00

>< n!-P�7 =l

oo

then Iii < 1 , so that i is transient. Now

N "" �� > ,k pu n=l

Thus

=

N n N N N N k k k k � 1' ) ) "" ) "" l'. "" . . < "" "" l'. � �?'l' "" �� < > < ,k J u > ,k pu. �r ) ,k J u ,k pu ,k ,k J u pu n=l k=l k=l r=O k=l n=k =

N n "" ( ) ,k Pii N n =l � 1 .f "" l'.<.k > > "" 1'.<� > > as N � oo if ! P:; > - N Jtt - ,k J u ,k J u n =l k=l k=l ' r) ( k.t Pii r=O Therefore ! : 1 p�t > oo implies that Iii 1 , so that i is recurrent. .

.

=

00

00

=

oo

=

=

We denote by Iii the probability of ever visiting j at some future time, starting from i ; that is ,

Theorem 4.

Ifj is a transient state and i an arbitrary state, then 00

! p� � ) < 00 n =l

hence PROOF.

as n � oo By the first entrance theorem,

=

00

! p �j ) fi in=O

7.3 'CLASSIFICATION OF STATES

But hi' being a probability, is < 1 , and result follows.

I� 0 p�j'

223

< oo by The orem 3 ; the

If j is a transient state and the initial state of the chain is i, then, by Theorem 4 above and the Borel-Cantelli lemma (Theorem 4 , Section 6.6) , with probability 1 the state j will be visited only finitely many times. Alternatively , we may use the argument of Theorem 2 (witb initial state i and T = the time of first visit to j) to show that the probability that j will be visited at least m times is h3(h3)m-l . Now hi < 1 since j is transient, and thus , if we let m ---+ oo , we find (as in the corollary� to Theorem 2)- that the probability that j will be visited infinitely often is 0 . In fact this result holds for an arbitrary initial distribution. For

REMARK.

P{Rn = j for infinitely many n} = Ii P{R0 = i}P{Rn = j for infinitely many n I R0 = i} = 0 It follows that if B is a finite set of transient states, then the prob ability of remaining in B forever is 0. For if R n E B for all n, then , since B is finite , we have , for some j E B, R n = j for infinitely many n ; thus

P {R n E B for all n} < Ij E B P{R n = j for infinitely many n} = 0 One of the our main problems will be to classify the states of a given chain as to recurrence or nonrecurrence. The first step is to introduce an equivalence relation on the state space and show that within each equivalence class all states are of the same type. If i and j are distinct states, we say that i leads to j ifffij > 0 ; that is , it is possible to reachj, starting from i. Equivalently, i leads to j iff p�j> > 0 for so1ne n > 1 . By convention, i leads to itself. We say that i and j comm unicate iff i leads t o j and j leads to i.

DEFINITION.

We define an equ ivalence relation on the state space S by taking i equivalent to j iff i and j communicate. (It is not difficult to verify that we have a legitimate equivalence relation. ) The next theorem shows that rec urrence or nonrecurrence is a class property : that is , if one state in a given equivalence class is recurrent, all states are recurrent. Theorem 5.

If i is rec urrent and i leads to j, then j is rec urrent. Further = 1 . In fact, ifh� is the probability that j will be visited in

more, hi = hi finitely often when the initial state is i, then f/j = fJi =

1.

224

MARKOV CHAINS

PROOF.

Start in state i, and let T be the time of the first visit to j. Then

00

1 = I P{ T = k} + P{ T = oo} k= l 00

= I P{T = k= l

k,

infinitely many visits to i after T } + P{ T = oo}

since i is recurrent

00

= I P{T = k}P{R T+l ' R T+ 2 , k= l

•

•

•

visits i infinitely often I T =

00

k,

= I �� >P { R b R 2 , visits i infinitely often I R 0 = j } + 1 - hi k= l Thus 1 = }.f.� 3.�� . + 1 - }1'�. 3. ' or Si nce fii > 0 by hyp othesis , f;i = 1 . Now if p� j > > 0, p �� > > 0 , then P <.�+r+ s > > p ��>p �?1>p �� ) •

•

R p = j} + 1 - h, i

•

:H

33

-

3Z

ZZ

Z3

since one way of going from .i to j in n + r + s steps is to go from j to i in s steps , from i to i in n steps , and finally from i to j in r steps. It follows from Theorem 3 that I : 1 p� j > = oo ; hence j is recurrent. Finally, we have j recurrent and fi i > 0 . By the above argument, with i and j interchanged, h'i = 1 . Since fii > f;3 , fj i > f;i ' it follows that hi = fH = 1 and the theorem is proved. .

Theorem 6.

If a finite chain (i.e. , S a finite set) , it is not possible for all

states to be transient. In particular, if every state in a finite chain can be reached from every other state, so that there is on!;' one equivalence class (namely S) , then all states are recurrent. PROOF. If S = {1, 2, . . . , r}, then I�=1 p13n > = 1 for all n. Let n � oo. B y Theorem 4 and the fact that the limit of a finite sum is the sum of the limits , we have 0 = Ii= 1 limn� oo p1p> = lim n � oo I i= 1 p1 ;- > = 1 , a contradiction. In the case of a finite chain , it is easy to decide whether or not a given class is recurrent ; we shall see how to do this in a moment.

A nonempty subset C of the state space S is said to be closed iff it is not possible to leave C ; that is , I j E O P ii = 1 for all i E C. Notice that if C is closed , then the submatrix [p ii] , i, j, E C, is sto· ; h ence so I· S [pH( n )] , z· , J E C. chast1c

DEFINITION.

•

7.3

Theorem

7.

CLASSIFICATION OF STATES

225

C is closed ifffor all i E C (i leads to j implies j E C).

Let C be closed. If i E C and i leads to j, then p!;> > 0 for some n. If j ¢ C, then !keo P!r:> < 1 , a contradiction. Conversely, if the condition is satisfied and C is not closed, then !JeOP ii < 1 for some i E C ; hence Pu > 0 for some i E C, j ¢ C, a contradiction. PROOF.

Theorem 8.

( a) Let C be a recurrent class. Then C is closed. (b) If C is any equivalence class, no proper subset of C is closed. (c) In a finite chain, every closed equivalence class C is recurrent. Thus, in a finite chain, the recurrent classes are simply those classes that are closed. PROOF.

(a) Let C be a recurrent class. If C is not closed, then by Theorem 7 we have some i E C leading to a j ¢ C. But by Theorem 5 , i and j communicate, and so i is equivalent to j. This contradicts i E C, j ¢ C. (b) Let D be a (nonempty) proper subset of the arbitrary equivalence class C. Pick i E D and j E C ,j ¢= D. Then i leads to j, since both states belong to the same equivalence class. Thus D cannot be closed. (c) Consider C itself as a chain ; this is possible since !Jeo pi3 = 1 , i E C. (We are simply restricting the original transition matrix to C.) By Theorem 6 and the fact that recurrence is a class property, C is recurrent . .,._

1.

Consider the chain of Figure 7.3. 1 . (An arrow from i to j indicates that Pii > 0. ) There are three equivalence classes, C1 = {1 , 2} , C2 = {3 , 4, 5} , and C3 = {6}. By Theorem 8 , C2 is recurrent and C1 and C3 are transient. There is no foolproof method for classifying the states of an infinite chain, but in some cases an analysis can be done quickly. Co nsider the chain of Example 3 , Section 7. 1 , and assume that all p k > 0. Then every state is reachable from every other state, so that the entire state space forms a Example

FIGURE 7.3 . 1

A Finite Markov Chain.

226

MARKOV CHAINS

single equivalence class. We claim that the class is recurrent. For assume the contrary ; then all states are transient. Let 0 be the initial state ; by the remark after Theorem 4, the set B = {0, 1 , . . . , M - 1} will be visited only finitely many times ; that is, eventually Rn > M (with probability 1). But by definition of the transition matrix , if Rn > M then Rn+l = 0, a contradiction. _.... We now describe another basic class property, that of periodicity. If p:F> > 0 for some n > 1 , that is, if starting from i it is possible to return to i, we define the period of i (notation : di) as the greatest common divisor of the set of positive integers n such that p:in > > 0. Equivalently, the period of i is the greatest common divisor of the set of positive integers n such that ..��n > > 0 (see Problem 1c). 9.

If the distinct staies i and j are in the same equivalence class, they have the same period. Theorem

PROOF. Since i and j communicate, each has a period. If P! � > > 0, p��> > 0, then p � j+r+s> > p��> p�r> P!i> (see the argument of Theorem 5). Set n = 0 to obtain p��+ s> > 0, so that r + s is a multiple of d3 • Thus if n is not a multiple of d3 (so neither is n + r + s) we have p� j+r+ s> = 0 ; hence P1 t> = 0. But this says that ifP! F'> > 0 then n is a multiple of d3 ; hence d3 < di. By a symmetrical argument, di < d3 • The transitions from state to state within a closed equivalence class C of period d > 1 , although random, have a certain cyclic pattern , which we now describe. > 0. Then Let i, ;· E C ; if p��> > 0, let t be such that p�t.> > 0 and p��> 3 't 't:J 't:J P��+t> > - p't��>3 p�3 t't.> > 0 ; hence d divides r + t. Similarly, d divides s + t, and so d divides s - r. Thus , if r = ad + b , a and b integers , 0 < b < d - 1 , then s = cd + b for some integer c. Consequently, if i leads to j in n steps , then n is of the form ed + b , that is , n = b mod d where the integer b depends on the states i and j but is independent of n. Now fix i E C and define 't't

=

{ j E C : p�; > > 0 implies n { j E C : p�; > > 0 implies n

=

=

{j E C : P! �) > 0 implies n

=

C0

=

cl

cd-1

=

0 mod d } 1 mod d }

d - 1 mod d }

7.3

FIGURE 7.3.2

Then Theorem 10. If k E mod d ; i.e. , Cd = C0,

A

CLASSIFICATION OF STATES

227

Periodic Chain.

d- 1 c = U C3 i=O

ct and P ki > 0, then j E ct+ 1 (with indices reduced Cd+1 = C1 , etc. ) . Thus, starting from i, the chain moves from C0 to C1 to . . . to Cd_ 1 back to C0, and so on. The C3 are called the cyclically moving subclasses of C . PROOF. Choose an n such that p�;:> > 0. Then n is of the form ad + t. Now p�;-+ 1 > > p�;:> pk3 > 0 ; hence i leads to j, and therefore j E C, since C is closed. But n = t mod d; hence n + 1 = t + 1 mod d, so that j E C t +1 .

Consider the chain of Figure 7. 3 .2. Since every state leads to every other state , the entire state space forms a closed equivalence clas� C (necessarily recurrent by Theorem 6). We now describe an effective procedure that can be used to find the period of any finite closed equivalence class. Start with any state , say 1 , and let C0 be the subclass containing 1 . Then all states reachable in one step from 1 belong to c1 ; in this case 3 E c1. All states reachable in one step from 3 belong to C2 ; in this case 5 , 6 E C2 • Continue in this fashion to obtain the following table , constructed according to the rule that all states reachable in one step from at least one state in ci belong to ci+1 · ., Example 2.

c6 1' 2 5, 6 2 Stop the construction when a class Ck is reached that contains a state belong ing to some C3 j < k. Here we have 2 E C3n C6 ; hence C3 = C6 • Also, ,

228

MARKOV CHAINS

1 E C0 n C6 ; hence C0 = C3 = C6• Repeat the process with C0 = the class containing 1 and 2 ; all states reachable in one step from either 1 or 2 belong to C1 . We obtain

c2

1' 2

5, 6,

7

Ca 1'2

We find that C0 = C3 (which we already knew). Since C0 u C1 u C2 is the entire equivalence class C, we are finished. We conclude that the period is 3 , and that C0 = { 1 , 2} , C1 = { 3 , 4 } , C2 = { 5 , 6 , 7 }. If C has only a finite number of states , the above process must terminate in a finite number of steps. We have the following schematic representation of the powers of the transition matrix. _

Co cl Co cl c2 Co 0 0 0 Co 0 0 0 II 2 = cl II = C1 0 0 0 c2 0 c2 0 0 Co cl c2 0 0 Co II 3 = cl 0 (x stands for positive element) 0 c2 0 0 Notice that II4 has the same form as II but is not the same numerically ; 6 similarly, II5 has the same form as II 2 , II has the same form as II 3 , and X

X

X

X

X

X

X

X

X

so on .

_....

., Example 3.

Consider the simple random walk. (a) If there are no barriers, the entire state space forms a closed equivalence class with period 2. We have seen that f00 = 1 lp ql [see (6 .2. 7)] ; by symmetry ,hi = /00 for all i. Thus if p = q the class is recurrent, and if p � q the class is transient. (b) If there is an absorbing barrier at 0, then there are two classes , C = {0} and D = {1 , 2 , . . . }. C is clearly recurrent, and since D is not closed , it is transient by Theorem 8. C has period 1 , and D has period 2. (c) If there are absorbing barriers at 0 and b, then there are three equiva lence classes , C = {0 }, D = {1 , 2, . . . , b - 1 }, E = {b}. C and E have period 1 and are recurrent ; D has period 2 and is transient. _.... -

-

We have seen that if B is a finite set of transient states, the probability of remaining forever in B is 0. This is not true for an

REMARK.

7.3

CLASSIFICATION OF STATES

229

infinite set of transient states. For example, in the simple random walk with an absorbing barrier at 0, if the initial state is x > 1 and p > q, there is a probability 1 (qjpyr > 0 of remaining forever in the transient class D = {1 , 2 , . . . } [see (6.2.6)]. -

A state (or class) is said to be aperiodic iff its period d i s 1 , periodic iff d > 1 .

TERMINOLOGY.

PRO B L E M S

(a) Let A be a (possibly infinite) set of positive integers with greatest common divisor d. Show that there is a finite subset of A with greatest common divisor d. (b) If A is a nonempty set of positive integers with greatest common divisor d, and A is closed under addition, show that all sufficiently large multiples of d belong to A. (c) If di is the period of the state i , show that di is the greatest common divisor of {n > 1 : fi\n > > 0}. 2. A state i is said to be essential iff its equivalence class is closed. Show that i is essential iff, whenever i leads to j, it follows that j leads to i. 3. Prove directly (without using Theorem 5) that an equivalence class that is not closed must be transient. 4. Classify the states of the following Markov chains. [In (a) and (b) assume 0 < p < 1 .] (a) Simple random walk with reflecting barrier at 0 (S = { 1 , 2 , . . . }, p11 = q , Pi , i+1 = p for all i , Pi,i-1 = q , i = 2, 3 , . . .) (b) Simple random walk with reflecting barriers at 0 and I + 1 (S = {1 , 2 , . . . , / }, P11 = q , P zz = p , Pi , i+1 = p , i = 1 ' 2 , . . . ' I 1 ' Pi , i- 1 = q , i = 2 , 3 , 1.

(c)

. . . ' I)

-

.2 . 8 II =

(d)

0 0

0 .1 .9 0 0 .2 . 8 .7 .3 0 0 0

1 1 0 2 0 1 0 1 1 2

ll =

3

2 3

2

2

230 S.

6.

MARKOV CHAINS

Let R0, R 1 , R2 , be independent random variables, all having the same distribution function, with values in the countable set S; assume P{Ri = j} > 0 for all j E S. (a) Show that {Rn } may be regarded as a Markov chain ; what is II ? (b) Classify the states of the chain. Let i be a state of a Markov chain, and let •

•

•

H(z) [take fJ2> H(z)U(z). 7.4

=

=

00

1 !Jr >zn '

n=O

U(z)

=

00

I P�r>zn ' n=O

0]. Use the first entrance theorem to show that U(z)

-

1=

LIMITING PRO BABI LITIES

In this section we investigate the limiting behavior of the n-step transition probability p�j > . The basic result is the following.

be a sequence of nonnegative numbers with I � 1 /n = 1 , such that the greatest common divisor of {j: j; > 0} is 1. Set U0 = 1 , u n = I;: 1 fku n-k' n = 1 , 2 , . . . D efine !" = I � 1 nfn · Then u n � 1 /!" as n � oo. Theorem 1. Let f1,f2 ,

•

•

•

.

We shall apply the theorem to a Markov chain with fn = !;� n > , i a given recurrent state with period 1 ; then u n = Pit > by the first entrance theorem. Also , fl = fli = I � 1 nJ;� n > , so that if T is the time required to return to i when the initial state is i, then !"i = E(T). If i is an arbitrary recurrent state of a Markov chain , !"i is called the mean recurrence time of i. Theorem 1 states that p� t > � 1 /!" i ; thus , starting in i, there is a limiting prob ability for state i, namely, the reciprocal of the mean recurrence time. Intuitively, if !"i = (say) 4 , then for large n we should be in state i roughly one quarter of the time, and it is reasonable to expect that P� t > � 1 / 4. PROOF. We first list three results from real analysis that will be neede.d . All numbers aki ' ci , ai , bi are assumed real. 1 . Fatou' s Lemma: If l aki l < c:i , k, j = 1 , 2 , . . . , and Ii ci < oo , then and

lim sup 1 ak i 2i li m i nf a k i k-+ oo k -+ oo i

7.4

LIMITING PROBABILITIES

23 1

The "lim inf" statement holds without the hypothesis that

! akil < c i , I c i < j

if all a ki > 0

oo,

2. D ominated Convergence Theorem : If lakil < ci , k, j = 1 , 2, . . . , Ii c; < oo , and lim k�oo aki = ai , j = 1 , 2, . . . , then

(

)

lim I aki = I lim aki = I ai k�oo j j k�oo j (The dominated convergence theorem follows from Fatou ' s lemma . Alter natively , a fairly short direct proof may be given.) 3. lim infk� oo (ak + bk) < lim infk� oo ak ·+ lim supk� oo bk . (This follows quickly from the definitions of lim inf and lim sup.) We now prove The orem 1 . First notice that 0 1

(7.4. 1) r0un , and thus we may rearrange terms

Since r0 = If 1 ft = 1 , we have u n = in (7. 4. 1) to obtain roUn + r1Un-1 + . . . + rnU o = roUn-1 + . . . + r n-1Uo , This indicates that I;: 0 rku n-k is independent of n ; hence

n>1

n n = 0, 1 , . . . (7.4.2) ku n- k = ro U o = 1, 2: r k=O [An alternative proof that I;: 0 rkun-k = 1 : construct a Markov chain with f! t > = .fn , P! in > = un (see Problem 1). Then n n � rku n-k = 2: u k r n-k k=O k=O n = 2: P�� > P{ T > n - k I R0 = i } k=O (where T is the time required to return to i when the initial state is i) n = 2: P { R k = i , R k+ 1 � i , . . . , R n � i \ R o = i } k=O = P{ R k = i for some k = 0, 1 , . . . , n I R0 = i } =

1]

232

MARKOV CHAINS

Now let b

=

lim sup n u n . Pick a subsequence {u nk} converging to b. Then

b

= =

lim U nk = lim inf u nk k k lim inf h. u n k-i + I. Ji u nk- i k i=l

[

]

i =l:. i

[We use here the fact that lim jnfk (ak + bk) < lim infk ak + lim supk bk. Furthermore, if we take u n = 0 for n < 0, then

Notice that since I};·U nk-i l < /;· and � � 1 /t < oo , Fatou ' s lemma applies.] Therefore b < fi lim inf u nk-i + (1 ft)b k or fi limk inf u nk-i > fib -

Thus h > 0 implies un k-i � b as k � oo. It follows that unk-i � b for sufficiently large i. For if.!'i > 0, we apply the above argument to the sequence unk- i ' k = 1 , 2 , . . . , to show that .h· > 0 implies un k - i - i -+ b. Thus if t = �;" 1 arir , where the ar are positive integers and_h r > 0 , then u nk-t � b. The set S of all such t ' s is closed under addition and has greatest common divisor 1 , since S is generated by the positive integers i for which _h > 0. Thus (Problem 1 b, Section 7.3) S contains all sufficiently large positive integers. Say unk- i � b for i > I. By (7.4.2) , 00

� r i u nk-1- i

i=O

=

1

(with u n

=

0 for

n

< 0)

(7.4. 3)

If �3� 0 r1 < oo , the dominated convergence theorem shows that we may let k ---+ oo and take limits term by term in (7. 4 .3) to obtain b � � 0 r1 = 1 . If � � 0 r1 = oo , Fatou ' s lemma gives 1 > b � ;' 0 ri ; hence b = 0. In either case, then ,

7.4

But

LIMITING PROBABILITIES

233

ro = h + /2 + fa + · · · r1 = /2 + fa + · · · r2 = fa + · · · •

Hence

00

00

'I rn = 'I nfn = fl n =l . n=O u n = 1 /fl · By an entirely

Consequently b = lim sup n lim infn u n = 1 /p , and the result follows.

symmetric argument,

We now apply Theorem 1 to gain complete information about the limiting behavior of the n-step transition probability Pii>. A recurrent state j is said to be positive iff its mean recurrence time fli is < oo , null iff fli = oo. Theorem

2.

(a) If the state j is transient then I � 1 pi;> < oo for all i, hence

P'l� 3� > -+ 0

as n -+ oo

PROOF. This is Theorem 4 of Section 7.3. (b) If j is recurrent and aperiodic, and i belongs to the same equivalence class as j, then Pi;> -+ 1 /p1• Furthermore, fli is finite iff fli is finite. If i belongs to a different class, then Pi;> -+ fulfli · PROOF.

00

k} ' f..< � >p 3� 3�P � � > = £.., 'l3

k=l

'l3

by the first entrance theorem [take p�j> = 0, r < 0]. By the dominated con vergence theorem , we may take limits term by term as n -+ oo ; since p�j-k > -+ 1 /p1 by Theorem 1 , we have p� ; > -+ fi�k) _!_ = fi j

( i: ) k=l

fli

fli

-

If i and j belong to the same recurrent class , fii = 1 . r · Now assume that jvu 3. is finite. If p'l�3� > , p��> 3 'l > 0, then p � �+ + > > p��> 'l3 p 3<. 3�) p��> , this is bounded away from 0 for large n , since p�j> -+ 1 fp1 > 0. But if fli = oo, then Pit > -+ 0 as n -+ oo , a contradiction. This proves (b). 'l'l

s

J'l

234

MARKOV CHAINS

(c) Let j be recurrent with period d > 1 . Let i be in the same class as j,

with i E the cyclically moving subclass Cr , j E Cr+a · Then pJ jd + a > � d/p,1 • Also , p,1 is fin ite iff ft i is finite, so that the property of being recurrent positive (or recurrent null) is a class property.

PROOF. First assume a = 0. Then j is recurrent and aperiodic relative to the chain with transition matrix IT d. (If A has greatest common divisor d, then the greatest common divisor of {xfd: x E A} is 1 .) By (b) ,

Now, having established the result for write

n++ Pi( i d r l )

-

_

a

= r, assume a = r + 1 and

d d + n � d r ( > --,.- £., pik £., pikp k i k k lt i lt i �

_____,_

_

_

_

as asserted. The argument that p,1 is finite iff /-t i is finite is the same as in (b) , with nd replacing n. (d) Ifj is recurrent with period d > 1 , and i is an arbitrary state, then a = 0, 1 ,

.

.

.,d- 1

The expression in brackets is the probability of reaching j from i in a number of steps that is = a mod d. Thus, ifj is recurrent null, pJt > 0 as n oo for all i. �

�

PROOF.

nd+ a +a ) = � ��d �� <.�d+a k ) a = 0, 1 , . . . , d - 1 P'l 3 £., � 'l 3 > p 3 3 - ' 7r=l Since j has period d, pj; d + a - k > = 0 unless k - a is of the form rd (necessarlly r < n) ; hence n + + � ��d a P� 3 > = £., Jl'z.(3!d a >p 3<.<3;t-r ) d ) r=O Let n oo and use (c) to finish the proof. �

(e) A finite chain has no recurrent null states.

7.4

LIMITING PROBABILITIES

235

PROOF. Let C be a finite recurrent null class, say, C = { 1 , 2, . . . , r}. Then r

> = 1, f 2 P� 1 =

1

Let n -+ oo ; by (d) we obtain 0

=

iEC

1 , a contradiction.

P RO B L E M S

1. With [n and un as in Theorem 1 , show how to construct a Markov chain with a state i such that fi�n > = [n and p �r> = un for all n.

(The renewal theorem) Let T1 , T2 , be independent random variables, all with the same distribution function, taking values on the positive integers. (Think of the Ti as waiting times for customers to arrive, or as lifetimes of a succession of products such as light bulbs. If T1 + + Tn = x, bulb n has burned out at time x, and the light must be renewed by placing bulb n + 1 in position.) Assume that the greatest common divisor of {x : P{ Tk = x} > 0} is d and let G(n) = 2� 1 P{T1 + + Tk = n} , n = 1 , 2 , . . . If p = E(Ti), show that limn __. G (nd) = d/ p ; interpret the result intuitively. oo , 3. Show that in any Markov chain, (1/n) 2 r: 1 p � � > approaches a limit as n namely, fi i/ P:r (Define p1 = oo if j is transient.) HINT :

2.

•

•

•

· ·

· · ·

·

.

00

-+-

d = period of j 4. Let Vi i be the number of visits to the state j, starting at i. (If i = j, t = 0 counts

as a visit.) (a) Show that E( Vi i) = � 'oon= O p Z� J� > . Thus i is recurrent iff E( Vii) = transient, E( Vi i) < oo for all i. (b) Let C be a transient class, Nu = E( Vu), i, j E C. Show that

N'l· J· = � 'l-J. + £.., ' P'l·kNk3· kEG

( � ii = =

1,

(},

oo ,

and if ]. is

i =j i

#: j)

In matrix form, N = I + QN, Q = II restricted to C. (c) Show that (I - Q)N = N(I - Q) = I so that N = (I - Q)-1 in the case of a finite chain (the inverse of an infinite matrix need not !Je unique). REMARK.

(a) implies that in the gambler' s ruin problem with finite capital, the average duration of the game is finite. For if the initial capital is i and D is the duration of the game, then D = 2�-I vii ' so that E(D) < 00 .

236 7.5

MARKOV CHAINS

S "f A T I O NA R Y AND S T E A D Y - S T A T E DIS T RI B U TI O N S

A stationary distribution for a numbers vi, i E S, such that vi I

Markov chain with state space > 0, Ii ES v i = 1 , and

2 ViPi i iES

= Vi,

S

is a set

of

jES

Thus , if V = (vi, i E S) , then VII = V. By induction, VII n = VII (II n-1) = VII n-1 = · · · = VII = V, so that VII n = V for all n = 0, 1 , . . . . Therefore , if the initial state distribution is V, the state distribution at all future times is still V. Furthermore , since (Problem 4 , Section 7. 1 ) the sequence {R n } is stationary ; that is, the joint probability function of , Rn+k does not depend on n. R n , R n+ 1 , Stationary distributions are closely related to limiting probabilities. The main result is the following. •

•

.

Consider a Markov chain with transition matrix [pul· Assume

Theorem 1.

I I. m

n -+ oo

n

( 'l J

) P · · = q 3.

for all states i, j (where qi does not depend on i). Then (a) Ij ES qi < 1 and Ii ES qip ii = qi , j E S. (b) Either all qi = 0, or else Ij ES qi = 1 . (c) If all qi = 0, there is no stationary distribution. If IjES q1 = 1 , then {qi } is the unique stationary distribution.

PROO F .

I q i = I li m P1;> i i n

< lim

n

inf i P1;> i

by Fatou ' s lemma ; hence Now � £,.,

i

q iPi i

= ·

� £,., ( I"t m

i

n

Pki Pi i < I"I � (

n) )

_

n+ 1 ) f" ( n) " � Pki Pii = 1 Im f £,., = qi In P1c( i n i n ·

In

·

7.5 STATIONAR Y AND STEADY-STATE DISTRIBUTIONS

which is a contradiction. This proves (a). Now if Q = (qi, i E S), then by ( a) , QII QIT n = Q, that is, 'L,i qipip > = q1 • Thus

=

237

Q ; hence, by induction,

by the dominated convergence theorem. Hence qi = C 2i qi)qi, proving (b). Finally , if {vi} is a stationary distribution , then Li vip�;> = vi. Let n -+ oo to obtain 'L,i vtq1 = v1, so that qi = vi. Consequently, if a stationary distri bution exists, it is unique and coincides with {q1}. Therefore no stationary distribution can exist if all qi = 0 ; if Li qi = 1 , then, by (a) , {qi} is station ary and the result is established. The numbers vi, i E S, are said to form a steady- state distribution iff lim n __. oo Pit > = vi for all i, j E S , and 'L,j ES vi = 1 . Thus we require that limiting prob abilities exist (independent of the initial state) and form a probability distribution. In the case of a finite chain , a set of limiting probabilities that are inde pendent of the initial state must form a steady-state distribution , that is , the case in which all q1 = 0 cannot occur in Theorem 1 . For Lj ES pi;> = 1 for all i E S; let n -+ oo to obtain, since S is finite, 'L,1 Es q 1 = 1 . If the chain is infinite, this result is no longer valid. For example , if all states are transient, then pi;> -+ 0 for all i, j. If {qi} is a steady-state distribution, {qi} is the unique stationary distribu tion, by Theorem 1 . However, a chain can have a unique stationary distribu tion without having a steady-state distribution , in fact without having limiting probabilities. We give examples later in the section. We shall establish conditions under which a steady-state distribution exists after we discuss the existence and uniqueness of stationary distributions. Let N be the number of positive recurrent classes. CASE 1 . N = 0. Then all states are transient or recurrent null. Hence p�; > -+ 0 for all i, j by Theorem 2 of Section 7.4, so that, by Theorem 1 of this section , there is no stationary distribution. CASE 2. N = 1 . Let C be the unique positive recurrent class. If C is aperiodic, then , by Theorem 2 of Section 7.4 , pip> -+ 1 /p1 , i, j E C. Ifj ¢ C, then j is transient or recurrent null, so that pi,n> -+ 0 for all i. By Theorem 1 ,

238

MARKOV CHAINS

if we assign vi = 1 /pi , j E C, vi = O, j ¢= C, then {v3} is the unique stationary distribution , and P1r > -+ vi for all i, j. Now assume C periodic , with period d > 1 . Let D be a cyclically moving subclass of C. The states of D are recurrent and aperiodic relative to the transition matrix lid. By Theorem 2 of Section 7.4, P1rd > -+ dffli, i, j E D ; hence {dffli, j E D} is the unique stationary distribution for D relative to lid (in particular, !j E D 1 /fli = 1 /d). It follows that vi = 1 /pi , j E C, vi = 0, j tt C, gives the unique station ary distribution for the original chain (see Problem 1). CASE 3.

N > 2. There is a uniqu� stationary distribution for each positive

recurrent class , hence uncountably many stationary distributions for the original chain. For if V11I = Vr, V2 1I = V2 , then , if a1 , a2 > 0, a1 + a2 = 1 , we have (at vt + a2 V2) II = at vt + a2 V2 In summary , there is a unique stationary distribution if and only if there is exactly one positive recurrent class.

Finally , we have the basic theorem concerning steady-state distributions. Theorem

2.

(a) If there is a steady-state distribution, there is exactly one positive recurrent class C, and this class is aperiodic; also, fii = 1 for all j E C and all i E S. (b) Conversely, if there is exactly one positive recurrent class C, which is aperiodic, and, in addition, fii = 1 for all j E C and all i E S, then a steady-sta te distribution exists.

PROOF. (a) Let {vi} be a steady-state distribution. By Theorem 1 , {vi} is the unique stationary distribution ; hence there must be exactly one positive recurrent class C. Suppose that C has period d > 1 , and let i E a cyclically moving subclass C0 ,j E C1 . Then p �;d + t > -+ dffli by Theorem 2 of Section 7.4, and pJ;d > = 0 for all n. Since dffli > 0, pJr > has no limit as n -+ oo , contradicting the hypothesis. If j E C and i E S, then by Theorem 2(b) of Section 7 .4, P1r > -+ h:)fli, hence vi = hi/fli· Since vi does not depend on i, we have hi = hi = 1 . (b) By Theorem 2(b) of Section 7.4, �� > -+ fi j for all i, if j E C P'l J fl j for all i if j tt C -+ 0

7.5

STATIONARY AND STEAD Y-STATE DISTRIBUTIONS

239

since in this case j is transient or recurrent null. Therefore , if fii = 1 for all i E S and j E C, the limit vi is independent of i . Since C is positive , vi > 0 for j E C ; hence , by Theorem 1 , �i vi = 1 and the result follows. [Note that if a steady state distribution exists , there are no recurrent null classes (or closed transient classes) . For if D is such a class and i E D , then since D is closed , hi = 0 for all j E C, a contradiction. Thus in Theorem 2, the statement "there is exactly one positive recurrent class , which is aperiodic" may be replaced by "there is exactly one recurrent class , which is positive and aperiodic" . ] CoROLLARY. Consider a finite chain. (a) A steady- state distribution exists iff there is exactly one closed equiva lence class C, and C is aperiodic. (b) There is a unique stationary distribution iff there is exactly one closed equivalence class. PROOF. The result follows from Theorem 2, with the aid of Theorem 8c of Section 7. 3 , Theorem 2e of Section 7.4 , and the fact that if B is a finite set of transient states , the probability of remaining forever in B is 0 (see the remark after Theorem 4 of Section 7.3). [It is not difficult to verify that a finite chain has at least one closed equivalence class. Thus a finite chain always has at least one stationary distribution. ] REMARK. Consider a finite chain with exactly one closed equivalence class, which is periodic. Then , by the above corollary , there is a unique stationary distribution but no steady-state distribution, in fact no limiting probabilities (see the argument of Theorem 2a). For example , consider the chain with transition matrix

The unique stationary distribution is rr rr

n n

[� �} = [� �} =

(1 /2, 1 /2) , but n

even

n

odd

and therefore rr n does not approach a limit. Usually the easiest way to find a steady-state distribution

{vi}, if it exists ,

240

MARKO V CHAINS

is to use the fact th�t a steady-state distribution must be the unique stationary

· distribution. Thu� we solve the equations

L ViPii = v jEs iE S under the conditions that all vi > 0 and LiES vi = j'

1.

P RO B L E M S

Show that if there is a single positive recurrent class C, then {1/ p,1, j E C}, with probability 0 assigned to states outside C, gives the unique stationary distribution for the chain. HINT : p�:;d> = Zk EC p �r;d-1 >pki' i E C. Use Fatou's lemma to show that 1/ lti > Lk E0(1/ flk)P ki· Then use the fact that Lj EC 1/ p,1 = 1 . 2. (a) If, for some N, IIN has a column bounded away from 0, that is, if for some j0 and some � > 0 we have p�fo> > � > 0 for all i, show that there is exactly one recurrent class (namely, the class ofj0) ; this class is positive and aperiodic. (b) In· the case of a finite chain, show that a steady-state distribution exists iff II1V has a positive column for some N. 3. Classify the states of the following Markov chains. Discuss the limiting behavior of the transition probabilities and the existence of steady-state and stationary distributions. 1 . Simple random walk with no barriers. 2. Simple random walk with absorbing barrier at 0. 3 . Simple random walk with absorbing barriers at 0 and b. 4 . Simple random walk with reflecting barrier at 0. 5. Simple random walk with reflecting barriers at 0 and I + 1 . 6. The chain of Example 2 , Section 7. 1 . 7. The chain of Problem 4c, Section 7 .3. 8. The chain of Problem 4d, Section 7 .3. 9. A sequence of independent random variables ( Problem 5, Section 7 .3). 1 0. The chain with transition matrix 1.

1

2

3 4 5 6 7 1 -o 0 1 0 0 0 o2 0 0 0 1 0 0 0 0 3 0 0 0 0 4 3 4

II =

1

4 0 0 0 0 0 0

1

5 0 1 0 0 0 0 0 6 7

0 1 0 0 0 0 0 1

� ...

_

1

Q ...

0 0 0 0 0

I nt rod uct i o n to Stat i st i cs

8. 1

S T A T I S T I C A L D E CI S I O N S

Suppose that the number of telephone calls made per day at a given exchange is known to have a Poisson distribution with parameter fJ , but fJ itself is unknown. In order to obtain some information about fJ , we observe the number of calls over a certain period of time , and then try to come to a decision about 0. The nature of the decision will depend on the type of in formation desired. For example , it may be that extra equipment will be needed if fJ > 00, but not if fJ < 00• In this case we make one of two possible decisions : we decide either that fJ > 00 or that fJ < 00• Alternatively , we may want to estimate the actual value of fJ in order to know how much equipment to install. In this case the decision results in a number B, which we hope is as close to fJ as possible. In general , an incorrect decision will result in· a loss , which may be measurable in precise terms , as in the case of the cost of un necessary equipment, but which also may have intangible components. For example , it may be difficult to assign a numerical value to losses due to customer complaints , unfavorable publicity , or government investigations. Decision problems such as the one just discussed may be formulated mathematically by means of a statistical decision model. The ingredients of the model are as follows. 1 . N, the set of states o.,f nature. 24 1

242

INTRODUCTION TO STATISTICS

2. A random variable (or random vector) R, the observable, whose distribution function F8 depends on the particular fJ E N. We may imagine that "nature" chooses the parameter fJ E N (without revealing the result to us) ; we then observe the value of a random variable R with distribution function F8• In the above example , N is the set of positive real numbers , and F8 is the distribution function of a Poisson random variable with parameter fJ. 3. A , the set of possible actions. In the above example , since we are trying to determine the value of fJ , A = N = (0, oo) . 4. A loss function (or cost function) L(fJ , a), fJ E N , a E A ; L(fJ , a) repre sents our loss when the true state of nature is fJ and we take action a. The process by which we arrive at a decision may be described by means of a decision function, defined as follows. Let E be the range of the observable R (e.g. , E1 if R is a random variable, E n if R is an n-dimensional random vector). A nonrandomized decision function is a function cp from E to A. Thus, if R takes the value x, we take action cp(x). qJ is to be chosen so as to minimize the loss , in some sense. Nonrandomized decision functions are not adequate to describe all aspects of the decision-making process. For example , under certain conditions we may flip a coin or use some other chance device to determine the appropriate action. (If you are a statistician employed by a company , it is best to do this out of sight of the customer.) The general concept of a decision function is that of a mapping assigning to each x E E a probability measure Px on an appropriate sigma field of subsets of A. Thus Px (B) is the probability of taking an action in the set B when R = x is observed. A nonrandomized decision function may be regarded as a decision function with each Px concentrated on a single point ; that is, for each x we have Px{a} = 1 for some a (= cp(x)) in A . We shall concentrate on the two most important special cases of the statistical decision problem, hypothesis testing and estimation. A typical physical situation in which decisions of this type occur is the problem of signal detection. The input to a radar receiver at a particular instant of time may be regarded as a random variable R with density f8 , where fJ is related to the signal strength. In the simplest model, R = fJ + R' , where R' (the noise) is a random variable with a specified density, and (} is a fixed but unknown constant determined by the strength of the signal. We may be interested in distinguishing between two conditions : the absence of a target ((} = 00) versus its presence (0 = 0 1) ; this is an example of a hypothesis testing problem . Alternatively , we may know that a signal is present and wish to estimate its strength. Thus, after observing R, we record a number that we hope is close to the true value of (} ; this is an example of a problem in estimation.

8.2

HYPO THESIS TESTING

243

As another example , suppose that fJ is the (unknown) percentage of defective components produced on an assembly line. We inspect n components (i.e. , we observe R1 , • • • , R n , where Ri = 1 if component i is defective, Ri = 0 if component i is accceptable) and then try to say something about fJ. We may be trying to distinguish between the two conditions fJ < 00 and fJ > 00 (hypothesis testing) , or we may be trying to come as close as possible to the true value of () (estimation). In working the specific examples in the chapter, the table of common density and probability functions and their properties given at the end of the book may be helpful. 8.2

H YP O T H E S I S T E S T I N G

Consider again the statistical decision model of the preceding section. Sup pose that H0 and H1 are disjoint nonempty subsets of N whose union is N, and our objective is to determine whether the true state of nature fJ belongs to H0 or to H1 • (In the example on the telephone exchange , H0 might corre spond to () < 00, and H1 to () > 00.) Thus our ultimate decision must be either " () E H0" o r " () E H1 ,'' so that the action space A contains only two points , labeled 0 and 1 for convenience. The above decision problem is called a hypothesis-testing problem ; H0 is called the null hypothesis, and H1 the alternative. H0 is said to be simple iff it contains only one element ; otherwise H0 is said to be composite, and simi larly for H1 • To take action 1 is to reject the null hypothesis H0 ; to take action 0 is to accept H0• We first consider t� e case of simple hypothesis versus simple alternative. Here H0 and H1 each contain one element, say 00 and 0 1 • For the sake of definiteness , we assume that under H0 , R is absolutely continuous with density f0, and under H1 , R is absolutely continuous with density Jr. (The results of this section will also apply to the discrete case upon replacing integrals by sums.) Thus the problem essentially comes down to deciding, after observing R , whether R has density /0 or Jr. A decision function may be specified by giving a (Borel measurable) func tion cp from E t o [0, 1 ] , with cp(x) interpreted as the ptobability of rejecting H0 when x is observed. Thus, if cp(x) = 1 , we reject H0 ; if cp(x) = 0, we accept H0 ; and if cp(x) = a, 0 < a < 1 , we toss a coin with probability a of heads : if the coin comes up heads , we reject H0 ; if tails , we accept H0• The set {x : cp(x) = 1 } is called the rejection region or the critical region ; the function cp is called a test. The decision we arrive at may be in error in two possible ways. A type 1 error occurs if we reject H0 when it is in fact true, and a type 2 error occurs if H0 is accepted when it is false, that is, when H1 is

244

INTRODUCTION TO STATISTICS

true. Now if H0 is true and we observe R = x, an error will be n1ade if H0 is rejected ; this happens with pro bability cp(x). Thus the pro bability of a type 1 error is rx =

J:

cp(x) f0(x) dx

(8 . 2 . 1)

Similarly , the probability of a type 2 error i s f3

=

J:

( 1 - cp ( x)) f1(x) dx

(8.2 .2)

Note that ex is the expectation of ·cp(R) under H0, sometimes written E0 cp ; similarly , {3 = 1 - E0 cp. It would be desirable to choose cp so that both ex and {3 wi�l be small, but, as we shall see , a decrease in one of the two error probabilities usually results in an increase in the other. For example , if we ignore the observed data and always accept H0 , then ex = 0 but {3 = 1. There is no unique an swer to the question of what is a good test ; we shall consider s·everal possibilities First, suppo se that there is a nonnegative cost ci associated with a type i error, i = 1 , 2. (For simplicity , assume that the cost of a correct decision is 0.) Suppose also that we know the probability p that the null hypothesis will be true. (p is called the a priori probability of H0• In many situations it will be difficult to estimate ; for example , in a radar reception problem , H0 might correspond to no signal being present.) Let cp be a test with error probabilities ex ( cp) and {3( cp) The over-all average cost associated with cp is 0

( 8 .2. 3 )

B( cp) is called the Bayes risk associated with cp ; a test that minimizes B( cp) is called a Bayes test corresponding to the given p , c1 , c 2 , f0 , and f1 •

The Bayes solution can be computed in a straightforward way. from (8.2. 1-8.2.3) , B ( cp)

have ,

( 1 - p) c 2( 1 - cp(x)) f1(x) ] d x

=

J: [pc 1 cp ( x) f0( x)

=

J: cp ( x) [pctf0( x) - ( 1 - p)c 2 .{1( x)] dx

+

We

+

( 1 - p) c 2 ( 8 . 2 . 4)

Now if we wish to minimize J s cp(x)g(x) dx and g(x) < 0 on S, we can do no better than to take cp(x) = 1 for all x in S ; if g (x) > 0 on S, we should take cp(x) = 0 for all x in S ; if g(x) = 0 on S, cp(x) may be chosen a rbitrarily. In this case g (x) = pc1f0(x) - ( 1 - p)c2[t (x) , and the Bayes solution may

8.2

HYPO THESIS TES TING

therefore be given as follows.

245

Let L(x) = f1 (x ) lf0(x) .

If L(x) > pc1/ ( 1 - p)c2 , take q;(x) = 1 ; that is , reject H0•

If L(x) < pc1/ ( 1 - p)c2 , take q;(x) = 0 ; that is , accept H0• If L(x) = pc1/( 1 - p)c 2 , take q;(x) = anything.

L is called the likelihood ratio , and a test q; such that for some constant it , 0 < A < oo , q;(x) = 1 when L(x) > it and q;(x) = 0 when L(x) < it , is

called a likelihood ratio test, abbreviated LRT. To avoid ambiguity , if .fr(x) > 0 and /Q (x) = 0, we take L(x) = oo. The set on which f1(x) = f0(x) = 0 may be ignored , since it will have probability 0 under both H0 and H1. Also , if we observe an x for which f1 (x) > 0 and f0(x) = 0, it must be associated with H1 , so that we should take q;(x) = 1 . It will be convenient to build this requirement into the definition of a likelihood ratio test : if L(x) = oo };Ve assume that q;(x) = 1 . In fact, likelihood ratio tests are completely adequate to describe the problem of testing a simple hypothesis versus a simple alternative. This assertion will be justified by the sequence of theorems to follow. From now on , the notation P0(B) will indicate the probability that the value of R will belong to the set B when the true state of nature is (} . Theorem 1. For any a , 0 < probability of type 1 error is a .

a

<

1,

there is a likelihood ratio test whose

If a = 0, the test given by q;(x) = 1 if L(x) = oo ; q;(x) = 0 if L(x) < oo , is the desired LRT , so assume a > 0. Now G (y) = P0 {x : L(x) < y} , - oo < y < oo, is a distribution function [of the random variable L(R) ; notice that L(R) > 0, and L(R) cannot be infinite under H0 ] . Thus either we can find it , 0 < it < oo , such that G( it) = 1 - a , or else G jumps through 1 - a ; that is, for some it we have G(it-) < 1 - a < G(it) (see Figure 8.2 . 1 ) . Define PROOF.

0

q;(x) = 1

= 0

= a

if L(x) > it

if L(x) < it

if L(x) = it

where a = [ G(/l.) - ( 1 - a)] / [ G(it) - G(it-)] if G(it) > G (it-) , a = an arbitrary number in [0 , 1 ] if G(it) = G(it-). Then the probability of a type 1 error Is .

P0 {x : L(x) > 0

as desired.

it } + aP0 {x : L(x) 0

= it } = 1 - G(it)

+ a [G (it)

- G(it-) ] = a

246

INTRODUCTION TO STATISTICS G(y)

1 - a

FIGURE 8.2.1

A test is said to be at level cx0 jf its probability ex of type 1 error is < cx0• oc itself is· called the size of the test, and 1 - {3, the probability of rejecting the null hypothesis when it is false, is called the power of the test. The following result, known as the Neyman-Pearson lemma, is the funda mental theorem of hypothesis testing. 2.

Let cp;. be ex LRT with parameter it and error probabilities ex;. and {3;. . Let cp be an arbitrary test 1-v ith error probabilities ex and {3 ; if ex < oc;. then {3 > {3;.. In other words, the LRT has maximum polver among all tests at level ex;.. Theorem

We give two proofs Consider the Bayes problem with costs c1 = c2 = 1 , and set it = pc1 /(1 - p)c2 = p/(1 - p). Assuming first that it < oo, we have p = it/(1 + it). Thus cp;. is the Bayes solution when the a priori probability is p = it/(1 + it). If {3 < {3;., we compute the Bayes risk [see (8.2.3)] for p = it/(1 + it) , using the test cp. FIRST PROOF.

B(cp) = pet + (1 - p)f3

But ex < ex;. by hypothesis, while {3 < {3;. and p < 1 by assumption. Thus B( cp) < B( cp;.) , contradicting the fact that cp;. is the Bayes solution. It remains to consider the case it = oo. Then we must have cp;.(x) = 1 if L(x) = oo, cp;.(x) = 0 if L(x) < oo. Then ex;. = 0, since L(R) is never infinite under H0 ; consequently ex = 0, so that, by ( 8.2. 1 ) , cp(x)f0(x) = 0 [strictly speaking, cp( x)f0(x) = 0 except possibly on a set of Lebesgue measure 0]. By (8.2.2) , f3 =

rJ {x:L(x)

<

oo}

( 1 - cp(x))fl(x) dx +

rJ{x:L(x)=oo} ( 1 - cp(x))fl(x) dx

8.2

HYPOTHESIS TESTING

247

If L(x) < oo , then f0(x) > 0 ; hence cp(x) = 0. Thus , in order to minimize {3, we must take cp(x) = 1 when L(x) = oo . But this says that {J > {J;., com pleting the proof. First assume A < oo. We claim that [cp;.(x) - cp(x)] x [{1(x) - Af0(x)] > 0 for all x. For if.ft (x) > Aj0(x), then cp;.(x) = 1 > cp(x), and if _h (x) < Aj0(x), then cp;.(x) = 0 < cp(x). Thus SECOND PROOF.

J: [«p.<(x) - cp(x)] [f1(x) - Aj0(x)] > 0

By (8.2.1) and (8.2.2) ,

1 - {J;. - ( 1 - {J) - Act;. + Act > 0 or The case A

=

oo

is handled just as in the first proof.

If we wish to construct a test that is best at level ex in the sense of maximum power, we find , by Theorem 1 , a LRT of size ex. By Theorem 2 , the test has maximum power among all tests at level ex. We shall illustrate the procedure with examples and problems later in the section. Finally, we show that no matter what criterion the statistician adopts in defining a good test, he can restrict himself to the class of likelihood ratio tests. A test cp with error probabilities ex and {J is said to be inadmissible iff there is a test cp' with error probabilities ex ' and {J' , with ex ' < ex , {J' < {J, and either ' ex < ex or {J' < {J. (In this case we say that cp' is better than cp.) Of course, cp is admissible iff it is not inadmissible. Theorem 3.

Every LRT is admissible.

Let C{J;. be a LRT with parameter A and error probabilities ex;. and {3;., and cp an arbitrary test with error probabilities ex and {J. We have seen that if ex < ex;. , then {J > {J;.. But the Neyman-Pearso n lemma is symmetric in H0 an d H1• In other words , if we relabel H1 as the null hypothesis and H0 as the alternative, Theorem 2 states that if {J < {J ;_ , then ex > ex;. ; the result follows . PROOF.

Thus no test can be better than a LRT. In fact, if cp is any test, then there is a LRT cp;. that is as good as cp ; that is � ex;. < ex and {J;. < {J. F or by Theorem 1 there is a LRT cp;. with ex;. = ex , and by Theorem 2 {J;. < {J. This argument establishes the following result, essentially a converse to Theorem 3.

248

INTRODUCTION TO STATISTICS

If cp is an admissible test, there is a LRT with exactly the same error probabilities. Theorem 4.

PROOF. As above, we find a LRT cp;. with ex;. = admissible , we must have fJ;. = {J.

LJ..

•

�

.

and fJ ; < fJ ; since cp is

1.

Suppose that under H0 , R is uniformly distributed between 0 and 1 , and under H1 , R has density 3x2 , 0 < x < 1 . For short we write Example

O<x< 1 H0 : f0(x) = 1 , 0<x< 1 H1 : ft (x) 3 x 2 , We are going to find the risk set S, that is, the set of points ( ex ( cp) , fJ( cp)) where cp ranges over l all possible tests. [The individual points (ex( cp) , fJ( cp)) ' are called risk points.] We are also going to find the set SA of admissible risk points, that is, the set of risk points corresponding to admissible tests. By Theorems 3 and 4 , SA is the set of risk points corresponding to LRTs. F irst we notice two general properties of S. 1 . S is convex ; that is, if Q1 and Q 2 belong to S, so do all points on the line segment joining Q1 to Q 2• In other words, (1 - a)Q1 + aQ 2 E S for all a E [0 , 1 ]. For if Q1 = (ex(cp1) , fJ(cp1)) , Q 2 = ( a ( cp2) , {J( cp2)) and 0 < a < 1 , let cp = (1 - a)cp 1 + a cp 2 Then cp is a test, and by (8.2. 1) and (8.2.2), a(cp) = (1 - a) ex ( cp1) + aex ( cp2) , {J( cp) = (1 - a){3 ( + a{J( cp2 ). If Q = (ex( cp) , {3( cp)) , then Q E S, since cp is a test, and Q = (1 - a)Q1 + aQ2• 2. S is symmetric about ( 1/2, 1/2) ; that is, if l s i , l c5 1 < 1 /2 and (1/2 - s, 1 /2 c5 ) E S, then ( 1 /2 + s, 1/2 + �) E S. Equivalently, (ex, {3) E S implies ( 1 - ex, 1 - {J) E S. For if (ex(cp), fJ(cp)) E S, let cp' = 1 cp ; then cp' is a test and ex(cp') = 1 - ex(cp) , fJ(cp') = 1 - fJ(cp). To return to the present example, we have L(x) = 3x2 , 0 < x < 1 . Thus the error probabilities for a LRT with parameter A < 3 are 2 rx. = P 0 { x : L x) > = 1( A } = P0 . x : x > 0 _

•

cp1)

-

-

{ (1)1 12} Gf {J = P01 { x : L(x) < A } = Po , { x : x < Gf 2 } r(,\/ 3 )1/2 2 ( A)3 /2 = = (1 - a) 3 3x d x = Jo

3

(If A > 3 , then ex = 0, fJ = 1 .) Thus SA = { (ex, (1 - ex) 3) : 0 < ex < 1 }. Since no test can be better than a LRT, SA is the lower boundary of the

8.2

{j =

HYPOTHESIS TESTING

· 249

1 - a3

1

FIGURE 8.2.2

set S ; hence, by symmetry, {(1 - ex , 1 - (1 - ex) 3), 0 < ex < 1 } = { ( ex , 1 - cx.3) : 0 < ex. < 1 } is the upper boundary of S. Thus S must be {( ex , p) : 0 < ex < 1 , (1 - ex)3 < p < 1 - ex3 } (see Figure 8.2.2). Various tests may now be computed without difficulty. We give some typical illustrations. (a) Find a most powerful test at level . 1 5 . Set ex. = . 1 5 = 1 - (A/3) 11 2• Since L(x) > A iff x > (A/3)11 2 , the test is given by cp(x) = 1

if X > .85

=0

if X < .85 if X = .85

= anything

We have p = ( 1 - ex) 3 = (.85 ) 3 = .6 1 4. (b) Find a Bayes test corresponding to c1 = 3/2, c2 = 3 , is a LRT with A = pc1/(1 - p) c2 = 3/2 ; that is , cp( x) = 1

=

0

( A)1 / 2 ..j2 if X > = =. 3

2

p

= 3/4. This

0 7 7

..j2 if x < 2

= a nything

..j2 if x = 2

Thus ex = 1 - (A/3)11 2 = .293 , p = (1 - oc) 3 , and the Bayes risk may be computed using (8.2.3).

250

I NTRODUCT/ON TO STATISTICS

a 9

a + :;; 13 = c 3

-�----� a 0 ��--1 a = 1 - -Y2{!

FIGURE 8.2.3

Geometric Interpretation of Bayes Solution.

The Bayes solution may be interpreted geometrically as follows. We are trying to find a test that minimizes the Bayes risk pc1 oc + (1 - p)c2{3 = (9/8)oc + (3/4){3. If we vary c until the line (9/8)oc + (3/4){3 = c intersects S.A. , we find the desired test (see Figure 8 . 2.3). Notice also that to find the Bayes solution we may differentiate (9/8)a + (3/4)( 1 - a) 3 and set the result equal to zero to oLtain a = 1 - .J2/2 , as before. (c) Find a minimax test, that is, a test that minimizes max ( a , {3) . It is immediate from the definition of admissibility that an admissible test with constant risk (i.e. , a = {3) is minimax. Thus we set a = f3 = ( 1 - a)3 , which yields a = .3 1 8 (approximately) . Therefore (A/3) 112 = 1 - a = .682 , and so we reject H0 if x > .682 and accept H0 if x < . 682. -4111 � Example 2.

Let R be a discrete random variable taking on only the values 0 , 1 , 2 , 3. Let the probability function of R under Hi be pi, i = 0 , 1 , where the P i are as follows. X

Po (x) Pt(x)

0

1

2

3

.1

.2

.3

.4

.2

.1

.4

.3

The appropriate likelihood ratio here is L(x) = p1(x) fp0(x). Arrangin g the values of L (x) in increasing order, we have the following table. X

L(x)

1

3

2

2

4

4 3

1.

3

0

2

8.2

HYPOTHESIS TESTING

We may therefore describe the LRT with parameter A as follows. ex Acceptance Region LRT Rejection Region 1 O
25 1

{3 0 .1 .4 .8 1

Now assume A = 3/4. Then we reject H0 if x = 0 or 2 , accept H0 if x = 1 , and if x = 3 we randomize, that _ is, reject H0 with probability a , 0 < a < 1 . Thus ex = p0(0) + p0(2) + ap0(3 ) = .4 + .4a {3 = Pt (1) + (1 - a)p1(3 ) = . 1 + .3(1 - a) As a ranges over [0, 1 ] , (ex , {3) traces out the line segment joining ( .4, .4) to (.8, . 1). In a similar fashion we calculate the error probabilities for A = 1 /2, 4/3 , and 2. The admissible risk points are shown in Figure 8.2.4. We compute several tests. (a) Find a most powerful test at level .25. Since . 1 < .25 < .4, we have A = 4/3. Thus we reject H0 if x = 0, accept H0 if x = 1 or 3 , and reject H0 with probability a if x = 2 , where . 1 (1 - a) + .4a = .25 , so that a = 1 /2. Notice that {3 = .8 (1 - a) + .4a = .6. (b) Find a Bayes test with c1 = c2 = 1 , p = . 6. We have A = pc1/(1 - p)c2 = 3 /2. Thus we reject H0 if x = 0 and accept H0 otherwise. The error prob abilities are ex = . 1 , {3 = .8 , and the Bayes risk is pc1 ex + (1 - p)c 2{3 = .38. (c) Find a minimax test. The only admissible test with ex = {3 has ex = {3 = .4, so that 3/4 < A < 4/3. We reject H0 when x = 0 or 2 and accept H0 if x = 1 or 3 . ...._

FIGURE 8.2.4

(1, 0)

1 X= 2

Admissible Risk Points When

R is

Discrete.

252

INTRODUCTION TO STATISTICS 3.

Let R be normally distributed with mean fJ and variance <12 , where <12 is known. We wish to test the null hypothesis that fJ = 00 against the alternative that fJ = 0 1 , and the test is to be based on n independent observations R1, , R n of R. (Assume fJ0 < 0 1.) The appropriate likelihood ratio is ., Example

•

•

•

[f - fJ 1) 2j2a2] (x k 1 1( X1 ' · · · ' X n) = L(x1 , . . . , xn) = . . . ' x n) (2 7Ta2 - n/ 2 exp [ - i (x - O 2 2a2] ) o) / k l k (2 7Ta2)- n f 2 exp

f

k=

-=---------=:.

-----

fo( xt,

The condition L (x) > it is equivalent to In L(x) > In it ; that is ,

n L 2( 0 1 - fJ0)xk + n ( fJ02 - 0 12) > 2a 2 ln it k= 1 This is of the form !;: 1 xk > c . Thus a LRT must be of the form n if L xk > c k= 1 n =0 if L xk < c k= 1

( 8 . 2 .5)

= anything Now R1 + · · · + R n is normal with mean nO and variance na2 , so that the error probabilities are ex

{

i

= P 80 ( xb . . . , x n) : xk > c k= 1

}

= P 80 {R1 + · · · + Rn > c} · · · + Rn - nfJ0 c - n0 0 = pOo R 1 + > ..J n a ..J n a

{

= 1

p

- F* ( c Jn naOo) _

{

= Pe, ( xt ,

. . . ' xn ) :

where F * is the normal (0, 1) distribution function

�lxk < c}

P81 { R1 + · · · + Rn < c} c n t = F* =

( fo : )

}

8.2

HYPOTHESIS TESTING

253

1

1

0

FIGURE 8.2.5

Admissible Risk Points

When R is Normal .

Thus we have parametric equations for a. and fJ with c as parameter, - oo < c < oo . The admissible risk points are sketched in Figure 8.2. 5. Suppose that we want a LRT of size a. . If Na is the number such that 1 - F*(Na) = a. , then (c - n00)/ .Jn a = Na, so that c = n60 + .Jn aNa . We now apply the results to a problem in testing a simple hypothesis versus a composite alternative. Again let R be normal (fJ , a2) , and take H0 : fJ = 00, H1 : fJ > 00 • If we choose any particular 01 > 00 and test fJ = 00 against fJ = 01, the test described above is most powerful at level a.. However, the test is com pletely specified by c, and c does not depend on 01 • Thus, for any 01 > 00, the test has the highest power of any test at level a. of f) = 00 versus f) = 01• Such a test is called a uniformly most powerful (UMP) level a. test of f) = 00 versus fJ > 00 • We expect intuitively that the larger the separation between 00 and 01, the better the performance of the test in distinguishing between the two possibilities. This may be verified by considering the power function Q, defined by Q (fJ)

=

Eocp

=

the probability of rejecting H0 when the true state of nature is f)

Thus Q(O) increases with fJ. Now if H0 : f) = 00, H1 : f)

=

01 , where 01 < 00, the same technique as

254

INTRODUCTION TO STATISTICS

above shows that a size ex LRT is of the form cp (

xl , .

. .

'

x n) = 1 =

0

n

if 2 xlc = c

= anything

( c;nn:0) {3 = 1 - F * ( c - n fJ ) �n �

where

�

k= l

= F*

l

_

c = nfJ0 + ,Jn aN1 a Again , the test is UMP at level ex for fJ = 00 versus fJ < 00, with power _

function

(

)

c - nO Q ' (fJ) = F * -Jn_ a which increases as fJ decreases (see Figure 8.2.6) . The above discussion suggests that there can be no UMP level ex test of fJ = 00 versus fJ � 00 • For any such test cp must have power function Q(fJ) for fJ > 00, and Q' (fJ) for fJ < 80 . But the power function of cp is given by

Eo (/) =

1: · · · 1: q;(x1, • • • ,

xn) f8( xl > . . .

, xn) d x1 • • • dxn

where .fo is the joint density of n independent normal random variables with mean fJ and variance a2 • It can be shown that this is differentiable for all () (the derivative can be taken under the integral sign). But a function that is Q(fJ) for f) > 00 and Q' (fJ) for fJ < 00 cannot be differentiable at 00•

Q(8)

FIGURE 8 .2.6

Power Functions .

8.2

HYPOTHESIS TESTING

255

In fact, the test cp with power function Q(O) is UMP at level ex for the composite hypothesis H0 : 0 < 00 versus the composite alternative H1 : 0 > 00• Let us explain what this means. cp is said to be at level ex for H0 versus H1 iff E8cp < ex for all 0 < 00 ; cp is UMP at level ex if for any test cp ' at level ex for H0 versus H1 we have E8cp ' < E8cp for all 0 > 00• In the present case E8cp = Q(O) < ex for 0 < 00 by monotonicity of Q(O) , and E8cp ' < E8cp for 0 > 00, since cp is UMP at level ex for 0 = 00 versus 0 > () 0 · The underlying reason for the existence of uniformly most powerful tests is the following. If 0 < 0' , the likelihood_ ratio Lf8,(x)Lf8(x) can be expressed as a nondecreasing function of t(x) [where, in this case, t(x) = x1 + · · · + xn ; see (8.2.5)]. Whenever this happens , the family of densities f8 is said to have the monotone likelihood ratio (MLR) property. Suppose that the f8 have the MLR property. Consider the following test of 0 = 00 versus 0 = 0 1 , 0 1 > 00• cp(x) = 1 =0 =a

if t (x ) > c if t(x) < c if t(x) = c

where P8 0 {x : t(x) > c} + aP80 {x : t(x) = c } = ex (n otice that c does not depend on 0 1). Let A be the value of the likelihood ratio when t(x) ·= c ; then L (x) > A implies t (x) > c ; hence cp(x) = 1 . Also L(x) < A implies t (x) < c , so that cp(x) = 0. Thus cp is a LRT and hence is most powerful at level ex. We may make the following observations. 1 . cp is UMP at level ex for 0 = 00 versus 0 > 00• This is immediate from the Neyman-Pearson lemma and the fact that c does not depend on the particular 0 > 0 0• 2. If 0 1 < 0 2 , cp is the most powerful test at level ex1 = E81 cp for 0 = 0 1 versus 0 = 0 2 • Since cp is a LRT , the Neyrnan-Pearson lemma yields this result immedi ately. 3. If 0 1 < 0 2 , then E81 cp < E8 2 cp ; that is, cp has a monotone non decreasing power function. It follows , as in the earlier discussion, that cp is UMP at level ex for 0 < 00 versus 0 > 00• By property 2 , cp is most powerful at level ex1 = E8 1 cp for 0 = 0 1 versus ' ' () = 0 2 • But the test cp (x) = ex1 is also at level ex1 ; hence E82cp < E8 2 cp, that is , ex1 = E81 cp < E 92 cp. Since the Neyman-Pearson lemma is symmetric in H0 and Hb if 0 1 < 02, then for all tests cp ' with {3( cp ' ) < {3( cp), we have E81 cp < E81 cp ' .

REMARK .

256

INTRODUCTION TO STATISTICS

We might say that q; is uniformly least powerful for fJ < 00 among all tests whose type 2 error is <{J whenever fJ > 00• �

P RO B L E M S

1.

2.

Let H0 :fo(x) = e-x , x > 0 ; H1 :f1(x) = 2e-2x , x > 0. (a) Find the risk set and the admissible risk points. (b) Find a most powerful test at level .05. (c) Find a minimax test. Show that the following families have the MLR property, and thus UMP tests may be constructed as in the discussion of Example 3. (a) p 0 = the joint probability function of n independent random variables, ea ch Poisson with parameter (). (b) p 0 = the joint probability function of n independent random variables Ri , where Ri is Bernoulli with parameter 0 ; that is, P{R i = 1} 0, P{Ri = + Rn has the binomial 0} = 1 0, 0 < () < 1 ; notice that R1 + distribution with parameters n and 0. (c) Suppose that of N objects, () are defective. If n objects are drawn without replacement, the probability that exactly x defective objects will be found in the sample is =

-

Po

·

(O )(lV- 0 )

(x) = x n- x ' (�n)

X

= 0' 1 ' . .

.

'

·

·

() ( () = 0 ' 1 ' . . . ' N)

This is the hypergeometric probability function ; see Problem 7 , Section 1 .5. (d) f0 = the joint density of n independent normally distributed random variables with mean 0 and variance () > 0. 3. It is desired to test the null hypothesis that a die is unbiased versus the alter native that the die is loaded, with faces 1 and 2 having probability 1 /4 and faces 3 , 4, 5 , and 6 having probability 1 /8. (a) Sketch the set of admissible risk points. (b) Find a most powerful test at level . 1 . (c) Find a Bayes solution if the cost of a type 1 error is c1 , the cost of a type 2 error is 2c1 , and the null hypothesis has probability 3/4. 4. It is desired to test the null hypothesis that R is normal with mean 00 and known variance a2 versus the alternative that R is normal with mean 0 1 = 00 + /a and variance a2 , on the basis of n independent observations of R. Find the minimum value of n such that < .05 and f3 < . 03 . 5. Consider the problem of testing the null hypothesis that R is normal (0, 00) versus the alternative that R is normal (0, 01) 01 > 00 (notice that in this case a UMP test of () < 00 versus () > 00 exists ; see Problem 2d) . Describe a most rY..

,

8.2

HYPOTHESIS TESTING

257

a

6.

powerful test at level and indicate how to find the minimum number of independent observations of R necessary to reduce the probability of a type 2 error below a given figure. Let R , . . . , Rn be independent random variables, each uniformly distributed between 0 and (), () > 0. Show that the following test is UMP at level oc for () = versus () '#:

1

H0 :

H1 :

00

= 7. 8.

9.

*11.

12.

xi < 00a11n

or if max l
if max l
0

xi > 00

Find the sketch the power function of the test. Consider the test of Problem 6 with () = () = 2. Find the risk set and the set of admissible risk points. Let be independent, each Bernoulli with parameter R , and 0 < () < Find the UMP test of size = of () < versus () > and find the power function of the test. Show that every admissible test is a Bayes test for some choice of costs c1 and c2 and a priori probability p. Conversely , show that every Bayes test with > 0, 0 0, Bayes test with c > 0, > 0. If is most powerful at level and {J( > 0, show that is actually of size Give a counterexample to the assertion if {J( = 0. Let be a most powerful test at level Show that for some constant A we have if > A; < A, except possibly for in a set of = = 0 if Lebesgue measure 0. there is a A class C of tests is said to be essentially complete iff for any test test Show that the following classes are is as good as E C such that essentially complete. (a) The likelihood ratio tests. (b) The admissible tests. (c) The Bayes tests (i.e. , considering all possible and p). such that the statements is as good as Give an example of tests and and are both false. is as good as Let be independent random variables , each with density and let () = () = (a) If is a test based on n observations that minimizes the sum of the error > if probabilities, show that = ITf = = 0 if < Thus

H0;

R3

R1,1. 2 c2

1

10.

00•

q;

1

a .1

c2

0, 1/4 ,

1/4

1

q;)

a0

a0• q; q;(x) 1 x

q;(x)

q;2

q;2

1; H1:

x

q;)

a.

q;

x q;1

q;1•

c1 , c 2 ,

13.

14.

q;1 q;2 "q;1 q;2" "q;2 q;1" R2 , • • • R1, h8, H0 : 00 , H1 : 01• Pn (x) 1 gn (x) h81(xi)/h80(xi)] 1 , 1 [ n P Pn(x) gn(x) 1. <Xn + Pn Po.{x :gn (x) > 1} + Pol {x : gn�x) > 1 } (b) Let t (xi ) [h81(xi)/h80(xi)]112• Show that n n t(R 1) E t(R ) E 8 P80{x:gn(x) > 1} < IT ] i 0 [ 80 i=1 =

=

=

258

INTRODUCTION TO STATISTICS

(c) Show that < 1 ; hence an 00 and interchanged shows that observations are taken , both error small.

E80t(R1) fJ1

8.3

0 as n oo . A similar argument with f1n 0 as n oo , so that if enough

---+

---+

---+

---+

probabilities can be made arbitrarily

E S T I MA T I O N

C o nsider the statistical decision model of Section 8. 1 . Suppose that y i s a real-valued function on the set N of states of nature , and we wish to estimate y(O) . If we observe R = x we must produce a number 1p(x) that we hope will be close to y ( O ). Thus the action space A is the set of reals and a decision function may be specified by giving a (Borel measurable) function 1p fr om the such a 1p is called an estimate, and the above decision range of R to problem is called a problem of point estimation of a real parameter. Although the estimate 1p appears intrinsically nonrandomized , it is possible to introduce randomization without an essential change in the model. If R1 is the observable, we let R 2 be a random variable independent of R1 and 0 , with an arbitrary distribution function F. Formally , assume 8 {R1 E = E E where R E E is determined by the distribution function F and is unaffected by 0 . If R1 = x and R 2 = y, Wf; estimate y ( O) by a number 1p(x , y) . Thus we introduce randomization by enlarging the observable. There is no unique way of specifying a good estimate ; we shall discu ss several classes of estimates that have desirable properties. We first consider maximum likelihood estimates. Let be the density (or probability) function corresponding to the state o f nature 0 , and assume for simplicity that y ( O) = 0 . If R = x , the maximum likelihood estimate of 0 is given by y(x) = B = the value of 0 that maximizes f (x). Thus (at least in the discrete case) the estimate is the state of nature that makes the particular observation most likely. In many cases the max'imum likelihood estimate is easily computable .

£1 ,

£1 ;

2 B2} P8{R1 B1}P{R2 B2},

P

P{R2 B2}

B1,

f8

8

..,.. Example 1.

Let R have the binomial distribution with parameters n and 0 , 0 < 0 < 1 , so that p 8 (x) = C :) Ox (1 - O ) n , x = 0, 1 , . . . , n. To find B we may set x

-

to obtain

x

-

f)

-

n-x 1 - f)

=

0

or 0

=

x

n Notice that R may be regarded as a sum of independent random variables

8.3 ESTIMATION

R1 ,

259

, Rn, where Ri is 1 with probability fJ and 0 with probability 1 - 0. In terms of the Ri we have B(R) = (R1 + · · · + Rn) fn , which converges in probability to E(Ri) = fJ by the weak law of large numbers. Convergence in probability of the maximum likelihood estimate to the true parameter can be established under rather general conditions. _.... .,._

•

•

•

Example 2.

Let Rb . . . , R n be independent, normally distributed random variables with mean fl and variance a2• Find the maximum likelihood esti mate of fJ = (p , a2). (Here f) is a point in £2 rather than a real number, but the maximum likelihood estimate is defined as before.) We have so that

n

ln f8(x) = - - I n 27T - n I n a - -- L (xi - p) 2 1

n

2a 2 i= 1

2

Thus

n

n a 1 L (xi - p) = 2 C x - p) ln f8(x) = 2 a i= 1 a op

where

n 1 x = - iL xi n =1

and

(

a n 1� 1 � - ln f8( x) = - -n + £. ( xi - p) 2 = - - a 2 + . - £. ( xi - fl ) 2 i aa

a

a3 i= 1

a3

n =1

)

Setting the partial derivatives equal to zero , we obtain where

0

= ( x, s 2)

-)? s 2 = -1 � £- ( xi - x .... n i =1 (A standard calculus argument shows that this is actually a maximum.) In terms of the Ri , we have

where R is the sample mean (R1 + + Rn)fn and V2 is the sample variance ( 1/n) LF 1 (R i - R)2 • If the problem is changed so that fJ = fl (i.e. , a2 is known) , we obtain tJ = R as above. However, if fJ = a2, then we find (} = ( 1 /n) LF 1 (xi - p)2, since the equation a lnfo(x)fop = 0 is no longer present. ..... ·

·

·

260

INTRODUCTION TO STATISTICS

We now discuss Bayes estimates. For the sake of definiteness we consider the absolutely continuous case. Assume N = £ 1 , and letf8 be the density of R when the state of nature is fJ. Assume that there is an a priori density g for fJ ; that is , the probability that the state of nature will lie in the set B is given by JB g(fJ) dfJ. Finally, assume that we are given a (nonnegative) loss function L(y(fJ) , a) , fJ E N, a E A ; L(y(fJ) , a) is the cost when our estimate of y(fJ) turns out to be a. If 1p is an estimate, the over-all average cost associated with 1p is B(1p)

=

1:1:g(())j8(x)L(y(()), 1p(x)) d() dx

B( 1p) is called the Bayes risk of 1p, and an estimate that minimizes B( 1p) is called a Bayes estimate. If we write B( 1p)

=

1: [ i: g(())j8(x)L(y(()), 1p(x)) d()J dx

( 8.3. 1 )

it follows that in order to minimize B(1p) it is sufficient to minimize the ex pression in brackets for each x. Often this is computationally feasible. In particular , let L(y(fJ) , a) = (y(fJ) - a) 2 • Thus we are trying to minimize

This is of the form A1p2(x) - 2B1p(x) + 1p(x) = BfA ; that is, 1jJ( X)

=

C, which is a mtntmum when

J:g(())j8(x)y(()) 1: g(()) j8(x) d()

d()

( 8 . 3. 2)

------

But the 'conditional density of fJ given R = x is g(O)f8 (x)/J 0000 g(O)f8 (x) dfJ , so that 1p(x) is simply the conditional expectation of y(fJ) given R = x. To summarize : To find a Bayes estimate with quadratic loss function , set 1p(x) = the conditional expectation of the parameter to be estimated , given that the observable takes the value x . ...,. Example 3.

Let R have the binomial distribution with parameters n and (} , 0 < fJ < 1 , and let y(fJ) = 0. Take g as the beta density with parameters r and s ; that is ,

g( () )

=

or- 1( 1

_ fJ)S-1

{J(r, s )

'

0 < fJ < 1 , r,

s

>0

26 1

8.3 ESTIMATION

where fJ(r, s) is the beta function (see Section 2 of Chapter 4). Fi rst we find a Bayes estimate of fJ with quadratic loss function. The discussion leading to (8.3.2) applies , with f8(x) replaced by p 8 (x) = ( �)fJx ( t - fJ) n-x , x = 0, 1 , . . . , n. Thus

f (:) f ( :)

er-1+>* 1(1 - fJ)s-l+n- x d fJ

1jJ(X)

=

--------

er-l+ x( l - fJ)s-l+n-x d fJ

fJ( r + x + 1 , n - x + s) fJ(r + x, n - x + s)

r ( r + X + l)r(n - X + s) r(r + S + n) r ( r + x) r ( n - X + s) r ( r + S + n + 1) -

+x r + s + n r

Now, for a given fJ , the average loss plp ( fJ) , using 1p , may be computed as follows. p lp(O)

=

Eo

[(

)]

r

+R -02 r + s + n 1

--

(r

= V

Since E8 [(R - nfJ) 2 ] p'I' (O)

=

(r

(r

+ s + n )2

1

ar8 R

+ s + n) 1

=

[ n0(1 2

+ s + n )2

l(( r

E8 [( R - nfJ + n fJ ( l - fJ)

- fJ)

r - rfJ - sfJ)2 ]

and E0R

=

nfJ, we have

+ (r - rfJ - sfJ) 2 ]

+ s) 2 - n) fJ 2 + ( n - 2r( r + s))fJ + r2 ]

pip is called the risk function of 1p ; notice that

(8 .3.3) It is possible to choose r and s so that pip will be constant for all fJ . Fo r this to happen , n = (r + s) 2 = 2r(r + s)

262

INTRODUCTION TO STATISTICS

which is satisfied if r

=

s = ..Jn/2. We then have

x + Jn/2

1/2 ..Jn -x - = - + 1jJ(X) = n + ..Jn Jn n 1 + ..Jn 1+ Pip (fJ) =

n /4

1 = B( 1fJ) 2 4( 1 + ..j n)

(n + ..j n) 2 Thus in this case 1p is a Bayes estimate with constant risk ; we claim that 1p must be minimax, that is , 1p minimizes max8 p'P(fJ) . For if 1p ' had a maximum risk smaller than that of 1p , ( 8 . 3 . 3 ) shows that B(1p') < B(1p) , contradicting the fact that 1p is Bayes. Notice that if B(x) = xfn is the maximum likelihood estimate, then 1p (x) = an B (x) + bn, where an -+ 1 , b n -+ 0 as n -+ oo . _.... We have not yet discussed randomized estimates ; in fact, in a wide variety of situations, including the case of quadratic loss functions, randomization can be ignored. In order to justify this, we first consider a basic theorem concerning convex functions. A functionffrom the reals to the reals is said to be convex ifff [(1 - a) x + ay] < (1 - a)f(x) + af(y) for all real x, y and all a E [0, 1 ]. A sufficient condition for f to be convex is that it have a nonnegative second derivative ("concave upward" is the phrase used in calculus books) . The geometric interpretation is that flies on or above any o f its tangents. Theorem 1 (Jensen 's Inequality).

If R is a random variable, f is a convex function, and E(R) is .finite, then E[f (R)] > f [E(R)]. (For example, E[R2 n ] > [E(R) ] 2n , n = 1 , 2 , . . . . ) PROOF .

Consid�r a tangent to f at the point E( R) (see Figure 8. 3 . 1) ; let the equation of the tangent be y = ax + b. Since fis convex , f(x) > ax + b fo r all x ; hencef(R) > aR + b. Thus E[f(R)] > aE(R) + b = f(E(R)) . f(x) ax + b

-------4--�--� x

E(R)

FIGURE 8. 3 . 1

Proof of Jensen's Inequality.

8.3 ESTIMATION

263

We may now prove the theorem that allows us to ignore randomized estimates . Theorem

2

(Rao-Blackwell). Let R1 be an observable, and let R 2 be

independent of R1 and fJ, as indicated in the discussion of randomized estimates at the beginning of this section. Let 1p = 1p(x, y) be any estimate of y(fJ) based on observation of R1 and R2• Assume that the loss function L(y(fJ) , a) is a convex function of a for each fJ ( this includes the case of quadratic loss). D efine 1p * (x) = E8 [1p( R1, R 2) I R1 = x] = E[1p(x, R2) ] (E81p (R1, R2) is assumed finite.) Let p'P be the risk function of 1p, defined by p'P(fJ) = E8 [L(y ( O) , 1p (R1, R2)) ] = the average loss, using 1p, when the state of nature is fJ. Similarly, let ptp* (fJ) = E8 [L (y ( fJ) , 1p *( R1))]. Then p'P*(fJ) < p'P(fJ) for all fJ ; hence the nonrandomized estimate 1p* is at least as good as the randomized estimate "P · PROOF.

L(y (fJ) , E8 [1p( R1, R2) I R 1 = x]) < E8 [L( y ( O) , 1p( R1, R2)) I R 1 = x] by the argument of Jensen ' s inequality applied to conditional expectations. Therefore L(y (fJ) , 1p * (R1)) < E8 [L(y (fJ) , 1p(R 1, R2)) I R1] Take expectations on both sides to obtain p'P*(fJ) < E8 [L( y (fJ), 1p(R1 , R2)) ]

as desired.

=

p'P ( fJ)

P RO B L E M S

1.

Let R1 , . . . , Rn be independent random variables, all having the same density h8 ; thus f8 (x1 , • • • , xn) = llf 1 h8 (xi) In each case find the maximum likelihood estimate of 0. (a) h8 (x) = OxB-1 , 0 < x < 1, 0 > 0 .

(b) h8 (x)

= 0 e-xlo ,

(c) h8 (x)

=0 ,

1

1

x > 0, 0 > 0

0 < X < 0, 0

>

0

264 2.

INTRODUCTION TO STATISTICS

Let R have the Cauchy density with parameter 0 ; that is ,

/8 (x) =

0 1r

(x2

+

0 > 0

0 2) '

Find the maximum likelihood estimate of 0. 3. Let R have the negative binomial distri bution ; that is , (see Problem 6, Section 6.4),

p8 (x) = P{R = x} =

4.

S. 6.

7.

x = r, r + 1 , .

(;_}) Or(1 - O) x-r ,

. . ,0 < 0 < 1

Find the maximum likelihood estimate of 0. Find the risk function in Example 3 , using the maximum likelihood estimate 0 = xfn. In Example 3, find the Bayes estimate if 0 is uniformly distributed between 0 and 1 . In Example 3 , change the loss function to L(O, a) = (0 - a)2/0(1 - 0) , and let 0 I?e uniformly distributed between 0 and 1 . Find the Bayes estimate and show that it has constant risk and is therefore minimax. Let R have the Poisson distribution with parameter 0 > 0. Find the Bayes estimate 'lJl of 0 with quadratic loss function if the a priori density is g(O) = e-8 • Compute the risk function and the Bayes risk using 'IJl , and compare with the results using the maximum likelihood estim�te.

8.4

S UFFI C I E N T S T A T I S T I C S

In many situations the statistician is concerned with reduction of data. For example, if a sequence of observations results in numbers x1 , . . . , xn , it is easier to store the single number x1 + + xn than to record the entire set of observations. Under certain conditions no essential information is lost in reducing the data ; let us illustrate this by an example. Let R1 , . . . , R n be independent, Bernoulli random variables with param eter 0 ; that is, P{Ri = 1 } = 0 , P{Ri = 0 } = 1 - 0 , 0 < f) < 1 . Let T = t(R1, , R n) = R1 + + R n , which has the binomial distribution with parameters n and 0 . We claim that P8{R1 = x1 , . . . , R n = xn I T = y} actually does not depend on 0 . We compute, for xi = 0 o r 1 , i = 1 , . . . , n, ·

•

•

P8 { R 1 - X1,

·

•

•

•

·

This is 0 unless y

·

·

. . . ' R n = Xn , P8 { R = , R - xn I T - Y } - ____;;,_ l xb P8 { T = y } n

•

·

_

1+

·

·

·

x

x

=

=

G)

y}

= __,;;..

___________

+ n , in which case we obtain 1 P8 { Rl xl , . . . ' R n = n } - f)Y( l - f) ) n- y y} P8 { T OY( l - f) ) n- y = x

T

8.4

265

SUFFICIENT STATISTICS

The significance of this result is that for the purpose of making a �tatistical decision based on observation of R 1 , . . . , Rn , we may ignore the individual + Rn. To justify this , consider R i and base the decision entirely on R 1 + two statisticians , A and B. Statistician A observes R1 , , Rn and then makes his decision. Statistician B, on the other hand, is only given T = R1 + + Rn. He then constructs random variables R�, . . . , R� as follows. If T = y, let R�, . . . , R � be chosen according to the conditional probability function of R1 , . . . , R n given T = y. Explicitly , ·

·

·

.

.

•

·

P{ R � =

X1 ,

•

•

•

, R � = X n I T = y} = .

·

·

( ;) l

where xi = 0 or 1 , i = 1 , . . . , n , x1 + + xn = y. B then follows A's decision procedure, using R�, . . . , R�. Note that since the conditional prob ability function of R1 , . . . , R n given T = y does not depend on the un known parameter fJ, B's procedur� is sensible. Now if x1 + + xn = y, ·

·

·

·

=

( :)

euo - et-y

x1 , . . . , R�

( :)

= P0{T = y}P0{R � =

·

·

=

xn

I T = y}

= fJ Y( l fJ) n-y = p0 { R 1 = xl , . . . ' R n = X n } _

Thus (R�, . . . , R�) has exactly the same probability function as (R1 , , Rn) , so that the procedures of A and B are equivalent. In other words , anything A can do , B can do at least as well, even though B starts with less informa tion. We now give the formal definitions. F o r simplicity, we restrict ourselves to the discrete case. However, the definition of sufficiency in the absolutely continuous ca se is the same, with pr o bability functions replaced by densities. Also , the basic factorization theorem , to be proved below, holds in the absolutely continuous case (admittedly with a more difficult proof). Let R be a discrete random variable (or random vector) whose probability function under the state of nature fJ is p0• Let T be a statistic for R, that is , a function of R that is also a random variable. T is said to be sufficient for R (or for the family p 0 , fJ E N) iff the conditional probability function of R given T does not depend on fJ . The definition is often unwieldy, and the following criterion for sufficiency is useful. .

.

•

266

INTRODUCTION TO STATISTICS

Let T = t(R) be a statistic for R. T is sufficient for R if and only if the probability function p 0 can be factored in the form p 0 (x) = g(fJ , t(x))h(x). Theorem 1 (Factorization Theorem).

PROOF.

Assume a factorization of this form. Then

o { R = x, T = y } P y} I x = = = T { po R P0{T = y} This is 0 unless t(x) = y, in which case we obtain P0{ R = x } g( fJ , t(x)) h (x) L g(fJ, t(z)) h (z) P0{ T = y} ·

{z : t ( z ) =y}

g( fJ, y) h( ) L g( fJ, y)h(z) X

{z : t ( z )=y}

h (x) ' z h ( ) L

which is free of 0

{z : t ( z )=·y }

Conversely, if T is sufficient, then

p 0 (x) = P0{R = x} = P0 {R = x, T = t(x)} = P0 {T = t(x)}P0 {R = x I T = t(x)} by definition of sufficiency = g(fJ , t(x))h(x) � Example 1.

Let Rb . . . , Rn be independent, each Bernoulli with param + R n i s sufficient for (R1 , . . . , Rn) . eter fJ. Show that R 1 + We have done this in the introductory discussion , using the definition of sufficiency. Let us check the result using the factorization theorem. If X = X1 + + Xn ' X · = 0 ' 1 ' and t(x) = X1 + + X n ' then ·

·

·

·

·

t

·

·

·

·

n Po (x1 , . . . , xn) = f) t ( x> ( 1 - fJ) - t < x>

which is of the form specified in the factorization theorem [with h(x) = 1 ] . � Example 2.

...._

Let R1 , . . . , Rn be independent, each Poisson with param eter fJ . Again R1 + + R n is sufficient for (R 1 , . . . , Rn) (Notice that + Rn is Poisson with parameter nO. ) R1 + For ·

·

·

·

·

·

·

n

= IT P0{ Ri = xi } i 1 =

e-

n{} f)Xl +

X

1f ·

• • •

• · •

+xn

Xn

•f

8.4 SUFFICIENT STATISTICS

The factorization theorem applies , with g(fJ, t (x) ) 1 / x1 ! n ! , t (X) = x1 + · · · + n . ...._ �

...X

Example

3.

X

=

267

n e- o()t<:x:>, h (x)

=

Let R1, . . . , R n be independent, each normally distributed and variance a2 • Find a sufficient statistic for (R1, , R n)

with mean fl . assumtng (a) fl and a2 both unknown ; that is , fJ = (p, a2). (b) a2 known ; that is, fJ = fl · (c) fl known ; that is, fJ = a2 • [Of course (R1 , . . . , R n) is always sufficient for itself, but we hope to reduce the data a bi t more.] We compute •

f8 ( x)

Let

=

n (2 7Ta2)- 1 2 e xp

-x = 1 .2n xi , i 1 -

Since xi -

x

=

xi

- fl

s2

(x; - p) 2] i [ -A 2a i= 1

=

•

•

(8.4. 1 )

1 � (xi - x-) 2 i�1

-

n= n= - (x - p) , we have

s2

1 � ( xi

= - �

n i= 1

Thus

2 - fl )

- ( x-

2 - fl)

(8 .4.2) By (8.4.2) , if fl and a2 are unknown, then [take h(x) = 1 ] (R , V2) is sufficient, where R is the sample mean ( 1 /n) _2 ?1 1 Ri and V2 is the sample variance 2 2 ( 1 /n) .2t 1 (Ri R) 2 • If a2 is known, then the term (2?Ta2)- n f2e-n s 1 2a can be taken as h(x) in the factorization theorem ; hence R is sufficient. If !-' i s known , then , by (8.4. 1 ) , .2r 1 ( Ri - p) 2 is sufficient. ...._

-

t

-

PRO B LE M S

1.

2.

Let R1, . . . , Rn be independent, each uniformly distributed on the interval (01, 02]. Find a sufficient statistic for (R1, • • • , Rn), assuming (a) 01, 0 2 both unknown (b) 01 known (c) 0 2 known Repeat Problem 1 if each Ri has the gamma density with parameters 0 1 and 02 , that is,

j((x)

==

x

8I-1e-xlo 2

I, ( 01)02 8 1 '

268 3.

INTRODUCTION TO STATISTICS

Repeat Problem 1 if each Ri has the beta density with parameters fJ 1 and 02 , that is,

4. Let R 1 and R 2 be independent, with R1 normal (fJ, a2), R2 normal (fJ, T2), where S.

a2 and T2 are known. Show that R1/ a2 + R2/ T2 is sufficient for (R1 , R2). An exponential family of densities is a family of the form f6 (x) = a ( O)b (x) exp

[ i�

ci(O) ti(x)

J,

x real, fJ E N

(a) Verify that the following density (or probability) functions can be put into the above form. x = 0, 1 , . . . , n, 0 < fJ < 1 (i) Binomial (n, fJ) : p8 (x) = (�)fJx(1 - fJ) n -x , e-o ox X = 0, 1, . . . , fJ > 0 (ii) Poisson (fJ) : p8 (x) = , , X. x0 1

-le-xlo

2

(iv) Gamma (fJl , fJ2) : fe (x) = r( fJl)fJ 28 1 x)B 2-1 xO 1 ( 1 , (v) Beta (0 1 , fJ2) : f0(x) = (fJ , 0 ) {J 1 2 I-

_

'

X

>

0, (} = (fJ l , 0 2), f)b f) 2 > 0

0

<

x

1 , fJ = (01 , 02), f)h f) 2 > 0 x = r, r + 1 , . . . , (vi) Negative binomial (r, fJ) : p8 (x) = <;-t)fJr(1 - fJ):�r-r , 0 < fJ < 1 , r a known positive integer (b) If R1 , . . . , Rn are independent, each Ri having the density fo of part (a), find a sufficient statistic for (R1 , . . . , Rn) . 6. Let T be sufficient for the family of densities [0, fJ E N. Consider the problem of testing the null hypothesis that fJ E H0 versus the alternative that fJ E H1 . Show that all possible risk points can be obtained from tests based on T [i.e. , cp(x) expressible as a function of t(x)]. 8.5

<

UNBIASED E S TIMATE S BASED ON A C O MPLE TE S UF F I C I E N T S T A T I S T I C

In this section we require our estimates 1p of y(O) to be unbiased; that is, E81p(R) = y(O) for all 0 E N. Our objective is to show that in a wide class of situations it is possible to construct unbiased estimates 1p that have uniformly minimum risk ; that is, if 1p' is any unbiased estimate of y(O), then ptp(O) < pVJ,(O) for all 0 . We need a technical definition first. If T is a statistic for R,

8.5

UNBIASED ESTIMATES BASED ON COMPLETE SUFFICIENT STATIS TIC

269

T is said to be complete iff there are no unbiased estimates of 0 based on T, that is , iff whenever E8g(T) = 0 for all fJ E N, we have P8{g ( T) = 0} = 1 for all fJ E N.

Let T t (R) be a complete sufficient statistic for R , and let 1p be an unbiased estimate of y(fJ) based on T [i.e. , 1p(x) can be expressed as a function of t(x) ] . Assume that the loss function L(y (� ) , a) is convex in a for each fixed fJ. Then 1p has uniformly minimum risk among all unbiased estimates of y(fJ). Theorem 1.

==

Let 1p ' be any unbiased estimate of y(fJ), and define 1p "(x) = E[1p' (R) I T = t(x)]. (Since T is sufficient, 1p" does not depend on fJ and hence is a legitimate estimate.) 1p" is an unbiased estimate of y(fJ) based on T, and so is 1p , and therefore E8 [1p"(R) - 1p(R) ] = 0 for all fJ. But 1p"(R) and 1p(R) can be expressed as functions of T; hence, by completeness, PROOF.

P8 {1p"(R)

=

1p(R)}

=

1

for all fJ

It follows that p1p"(fJ) = p1p(fJ) for all 0. But the proof of the Rao-Blackwell theorem, with R1 replaced by T, x by t(x), 1p(R1 , R2) by 1p' (R) , and 1p * (R1) by 1p" (R) , shows that p'P,(fJ) < p'P,(fJ) for all fJ , as desired. 2 2 I f L(y (fJ) , a) = (y(fJ) - a) , then p 11' (0) = E8 [(y(fJ ) - 1p(R)) ] = V ar8 1p(R) . Thus 1p has the smallest variance of all unbiased estimates of y(fJ), regardless of the state of nature. In this case 1p i� said to be a uniformly minimum vari ance unbiased estimate (UMVUE) . ..- Example 1.

Let Rb . . . , R n be independent, each Bernoulli with param eter fJ , 0 < fJ < 1 . By Example 1 , Section 8.4, T = R1 + + Rn is sufficient for (R1 , . . . , R n) ; let us show that it is complete. Now T is binomial with param eters n and fJ ; hence ·

=

) [ig(k) ] n) ( ( f) =O k k k

f)

1

(1 - fJ)

n

·

·

if f) < 1 '

If E8g(T) = 0 for all fJ E [0, 1 ] , then !;: 0 g(k)(;:)zk = 0 for all z E [0, oo) ; hence g(k) = 0 for k = 0, 1 , . . . , n. We now look for unbiased estimates of y(fJ) based on T. If 1p(x1 , , xn) , xn)), t(x1 , g ( t(x1 , , xn) x1 + + xn , is such an estimate , the above argument shows that E81p(R1, , Rn) = E8g(T) is a polynomial in •

•

•

•

•

•

•

=

·

•

•

•

·

·

•

.

=

INTRODUCTION TO S TATISTICS

270

0

of degree T(T - 1) · · · (T - r + 1), n < r > = (n - 1) · · · (n - r + 1), then =

[E nr<>J i=O nk(k(n - 1)1) (k(n - rr ++ 1)1) k ! (nn-! k)! (1k(l O)n-k k erki=r (k -(nr)-! (nr)-! k) ! ek-r(l - O)n-k or(O + 1 - O)n r =

-

· · ·

_

-

· · ·

=

=

Thus !,;: 0 ak [ T < k > fn < k > ] is a UMVUE of y(O) sample mean Tfn is a UMVUE of 0 . ...._

=

-

"'2;: 0 ak0k ; in particular the

Let Rb . . . , R n be independent, each Poisson with parameter 0. By Example 2, Section 8.4 , T = R1 + · · · + Rn is sufficient for (R1, , Rn) ; T is also complete. For T is Poisson with parameter nO ; hence

..- Example 2.

•

(

•

•

If E8g T) = 0 for all 0 > 0, then

f [ g(k) nk] (jk

k=O

k!

=

0

for all 0 > 0

Since this is a power series in 0, we must have g = 0. If 1p (x1 , , x n) = g(t(x1 , , xn)) is an unbiased estimate of y(O), then •

•

.

•

•

•

Thus y(O) must be expressible as a power series in 0. If y (O) then

n

k i where ck = "'2 . a k- i i=O l ! -

But

=

"'2: 0 ak0 k,

8.5

UNBIASED ESTIMATES BASED ON COMPLETE SUFFICIENT STATISTIC

27 1

hence

=

We conclude that

T -· ! T ' aT-� nT-t i=O i !

For example, if y(O) T '· --

1

r) ! n r

(T -

-

=

=

k'

k ! -· ak-� n k-z i=O i ! 00

y (O)

is a UMVUE of

or,

r< r > nr

-

r

(

=

=

=

! ako k

k=O

1 , 2, . . . , ·the UMVUE is T

n

-

=

the sample mean when r

=

1

)

[In this particular case the above computation could have been avoided, since we know that E8 (T) (nO)" (Problem 8, Section 3.2). Since r fnr is an unbiased estimate of or based on T, it is a UMVUE.] As another example, a UMVUE of 1 /(1 0) !: 0 0 k , 0 < 0 < 1 , is =

f T ! nT-t 1

.

i=O

i!

-

=

PRO B L E M S

1. Find a UMVUE of e-8 in Example 2 . 2. Let R1 , . . . , Rn be independent, each uniformly distributed between 0 and

() > 0. By Problem 1 , Section 8.4 , T max Ri is sufficient for (R1, . . . , Rn). (a) Show that T is com plete. (b) Find a UMVUE of y(O), assuming that y extends to a function with a continuous derivative on [0, oo ), and on y(O) 0 as () 0. [In part (a), use without proof the fact that if sg h(y) dy 0 for all () > 0, then h(y) 0 except on a set of Lebesgue measure 0. Notice that if it is known that h is continuous, then h 0 by the fundamental theorem of calculus.] , Rn be independent, each normal with mean () and known variance 3. Let R1 , a2 . (a) Show that the sample mean R is a UMVUE of 0 . (b) Show that (R)2 - (a2fn) is a UMVUE of 02 • [Use without proof the fact that ·if J oo h(y)e8Y dy 0 for all () > 0, then h(y) 0 except on a set of Lebesgue measure 0.] =

---+

---+

=

=

=

.

.

•

00

=

=

272 4.

S.

=

INTRODUCTION TO STATISTICS

=

=

Let R1 , . . . , Rn be independent, with P{Ri k} = 1/N, k = 1 , . . . , N; take 0 N, N 1 , 2 , . . . . (a) Show that max1 � i �n Ri is a complete sufficient statistic. (b) Find a UMVUE of y(N). Let R have the negative binomial distribution :

k = r, r + 1 , . . . , 0

1

< p :::;;:

Take 0 = 1 - p, 0 :::;;: 0 < 1 . Show that y(O) has a UMVUE if and only if it is expressible as a power series in 0 ; find the form of the UMVUE.

6. Let

P{R = k} =

7.

-

Ok

e-8

1

e-8 k !

k = 1 ' 2, . . . ' 0

'

>

0

(This is the conditional probability function of a Poisson random variable R', given that R' > 1 .) R is clearly sufficient for itself, and is complete by an argu ment similar to that of Example 2. (a) Find a UMVUE of e-8• (b) Show that (assuming quadratic loss function) the estimate 'lJl found in part (a) is inadmissible ; that is, there is another estimate '1Jl 1 such that P �p '(O) < p 'P (O) for all 0, and P �p '(O) < p 'P (O) for some 0. This shows that unbiased estimates, while often easy to find, are not necessarily desirable. The following is another method for obtaining a UMVUE. Let R1, . . . , Rn be independent, each Bernoulli with parameter 0, 0 < 0 < 1 , as in Example 1 . If j = 1 , . . . , n, then

E

[ IT �J = P{R1 = = Rj = 1 } = ...

t= 1

Oi

Thus R1R2 R3 is an unbiased estimate of Oi. But then 'IJl (k) = E[R1 R3 1 If 1 Ri = k} is an unbiased estimate of Oi based on the complete sufficient statistic 2f 1 Ri , so that 'lJl is a UMVUE. Compute 'lJl directly and show that the result agrees with Example 1 . 8. Let R1 , . . . , Rn be independent, each Poisson with parameter 0 > 0 . Show, using the analysis in Problem 7, that •

9.

10.

•

·

•

(

n

)

E R1R2 I 2 Ri =k = t= . 1

k (k

-2 1) '

n

•

•

k = 0, 1 ' . . .

Let R1 , . . . , Rn be independent, each uniformly distributed between 0 and 0 ; if T = max Ri, then [ (n + 1 )/n ] T is a UMVUE of 0 (see Problem 2). Compare the risk function E8[((1 + 1/n) T 0)2] using [ (n + 1 )/n] T with the risk function E8[( (2/n) 2f 1 Ri - 0)2 ] using the unbiased estimate (2/n) 2� 1 Ri . Let R1 , . . . , Rn be independent, each Bernoulli with parameter 0 E [0, 1 ] . Show that (assuming quadratic loss function) there is no best estimate of 0 based on R1 , . . . , Rn ; that is, there is no estimate 'lJl such that p 'P (O) :::;;: Ptp ' (O) for all e and all estimates '1Jl 1 of 0. •

-

8.5

11.

12.

UNBIASED ESTIMATES BASED ON COMPLETE SUFFICIENT STATISTIC

273

If E0'1Jl1 (R) = E0'1Jl 2 (R) = y(O) and '1Jl1, '1Jl 2 both minimize p 'P (O) = E0 [('IJl (R) y( 0))2], 0 fixed, show that P0{'lflt (R) = '1Jl2 (R) } = 1 . Consequently, if '1Jl1 and "P 2 are UMVUEs of y(O), then, for each 0, '1Jl1 (R) = '1Jl 2 (R) with probability 1 . Let [0, 0 E N = an open interval of reals, be a family of densities. Assume that of0 (x)fo0 exists and is continuous everywhere, and that J oo 00 f0 (x) dx can be differentiated under the integral sign with respect to 0. (a) If R has density [0 when the state of nature is 0, show that

E0 [

° 0 0 In f0(R)

J

=

0

(b) If E0'1Jl (R) = y(O) and J 0000 '1Jl (x)f0(x) dx can be differentiated under the integral sign with respect to 0, show that

� y(O)

=

{ :o Inf0(R)J

E 1p (R)

(c) Under the assumptions of part (b), show that

= f8 (x1,

if the denominator is >0. In particular, if f8 (x) IIi 1 h 8 (xi), then

E0

o

= =

n Var - In h 8 (R ·) 9

ao a

•

•

,

xn)

=

'l

where R = (R1 , , Rn). The above result is called the Cramer-Rao inequality (an analogous theorem may be proved with densities replaced by probability functions). If 'lJl is an estimate that satisfies the Cramer-Rao lower bound with equality for all 0, then ·lJJ is a UMVUE of y(O). This idea may be used to give an alternativ� proof that the sample mean is a UMVUE of the true mean in the Bernoulli, Poisson, and normal cases (see Examples 1 and 2 and Problem 3 of this section). If R1 , , Rn are independent, each with mean f..l and variance a2 , and V2 is the sample variance, show that V2 is a biased estimate of a2 ; specifically, •

13.

(:o lnf (R)J

° Var11 0 In fo (R) 0

•

•

•

•

•

•

E( V2)

=

(n - 1) 2 a n

274 8.6

INTRODUCTION TO STATISTICS S A MP L I N G F R O M A N O R M A L P O P U L A T I O N I

If R1 , . . . , Rn are independent, each normally distributed with mean fl and variance a2 , we have seen that (R, P) , where R is the sample mean + R n) and V2 = (1 /n) L� 1 (Ri - R) 2 is the sample vari (1 /n) (R1 + ance , is a sufficient statistic for {R1 , , R n). R and V2 have some special properties that are often useful. First, R is a sum of the independent normal random variables Ri/n , each of which has mean pfn and variance a2/n 2 ; hence R is normal with mean fl and variance 0'2/n. We now prove that R and V2 are independent. ·

·

·

•

Theorem 1.

If R1 ,

•

•

•

•

•

, R n are independent, each normal (p, a2), the

associated sample mean and variance are independent random variables. *PROOF. Define random variables Wl , . . . ' wn by

where the c ii are chosen so as to make the transformation orthogonal . [This may be accomplished by extending the vector (1 /)n, . . . , 1 /.Jn) to an orthonormal basis for En . ] The Jacobian J of the transformation is the determinant of the orthogonal matrix A = [cii ] (with cH = 1 f)n , j = 1 , . . . , n) , namely, ± 1 . Thus (see Problem 12, Section 2.8) the density of ( W1 , . . . , Wn) is given by _ f(xl , xn) ! * ( Yt ,

·

·

·

, Y n) -

·

·

IJI

·

,

n = (2 1T a2 )- f 2 exp

where

[- 2a_12 i=Ii ( xi - p)2] Yt

= A-t Yn

8.6 SAMPLING FROM A NORMAL POPULATION

f *( y1 ,

• • •

, Yn ) = (2 7Ta2)-n1 2 exp

=

1 _

;)2 17" a

ex p [ -

[ - � (i 2a

z=l

n

�

]

n C Y 1 - .Jn p) 2 Ti 1 i = 2 .J27T a 2a 2

-2= R)

� (Ri i =l

n

�

_

n

- �

2

= L Ri2 i=l

=

• • • �

� R i - 2 R � Ri + i=l i=l

n

i

i =l

)J 2) ex p ( - --

2 2 Yi - 2p.Jn Y1 + np

It follows that W1 , • • • , Wn are independent, with W2 , (0 , a2) and W1 normal (�n p , a2) . But

n V2 =

275

-

Yi 2 a2

Wn each normal

n(R-) 2

n ( R) 2

l¥; 2 -

n

( i � )2 i=t .J n

= L � 2 - w1 2 i=l n = L �2 i= 2

Since .Jn R = W1 , it follows that R and V2 are independent, completing the proof. The above argument also gives us the distribution of the sample variance. For

( )

n V2 n � 2 = .L a2 z= 2 a

-

where the Wifa are independent, each normal {0, 1). Thus n V2ja 2 has the chi-square distribution with n 1 degrees offreedom ; that is, the density of n V2/a2 is 1 2 < n-l >J 2r ((n - 1) /2)

x< n- 3 ) f 2 e-xf 2'

x>O

(see Problem 3 , Section 5.2). Now since R is normal (p , a2J n) , .Jn (R - p)/a is normal {0 , 1) ; hence

{

P -b

< .J n (R

� #) < b } = F*(b) - F*(- b) =

2F* (b ) -

1

276

INTRODUCTION TO STATISTICS

where F* is the normal {0, 1) distribution function. If b is chosen so that 2F* (b) - 1 = 1 - oc , that is, F* (b) = 1 - oc/2 (b = Nrz12 in the terminology of Example 3 , Section 8.2), or

P - Na. /2

{ {

P R-

Thus, with probability 1

<

-

Jn

aNrz/ 2 < JYt - oc ,

[R

( R - p)

a

u

,.,.,

<

-

R

<

+

} aNr:.! 2} Jn Na./ 2

=

1

=

1 -

-

oc

oc

the true mean p lies in the random interval

aNrz/ 2 Jn I is called a confidence interval for p with confidence coefficient 1 The interval I is computable from the given observations of R1 , , Rn , provided that a2 is known. If a2 is unknown , it is natural to replace the true variance a2 by the sample variance V2 • However, we then must know some thing about the random variable (R - p)/ V. In order to provide the neces sary information , we do the following computation. Let R1 be normal (0, 1), and let R2 have the chi square distribution with m degrees of freedom ; assume that R1 and R2 are independent. We compute the density of Jm R1/JR 2 , as follows. Let I

=

-

'

oc . •

so that

R1

_ -

w1J� Jm

R 2 = w2 Thus we have a transformation of the form

with inverse given by

•

•

8.6

SAMPLING FROM A NORMAL POPULATION

277

By Problem 12, Section 2.8 , the density of ( W1, W2) is given by Y2 > 0

fi2 CY1 , Y2) = ft2 (h (y1 , Y2)) 1Jh(y1, Y2) l ,

where

Jh( y1, Y2) =

OX OX -1 -1 oy1 oy2

J y2 Jm

Y1 2Jm y2

OX -2 OX -2 oy1 oy2

0

1

Thus

=

J�

!'* (

m / 2- 1e-Y2/2 1 ,) � e-Y1 2Y 2 / 2 m 12 y1 ' y 2 = y 2 mf � r (m /2) 2 J 2 7T

Therefore the density of W1 is fwJYl ) =

But

Hence fwl ( Yl ) =

l'Xl

m

ftlY1, Y2) d y2

+

1) /2 ) J m TT I'( m /2) (1 r ((m

J

Y2

+

1 2 m ) < m+ l l/2 Y1 /

A random variable with this density is said to have the t distribution with m degrees of freedom . An application of Stirling's formula shows that the t density approaches the normal (0 , 1) density as m -+ oo . Now we know that n (R - p) fa is normal (0, 1) , and n V2ja2 has the chi-square distribution with n - 1 degrees of freedom . Thus

.Jn .Jn .Jn - 1

.J

( R - p)fa

V fa

- n _

�

_

1

( R - p)

V

-

_

R-

[(lfn(n - }:, l))

l

fl

J

(Ri - R) 2

has the t distribution with n - 1 degrees of freedom. If tp , m is such that J� , m h m(t ) dt = {3, where hm is the t density with degrees of freedom , then p

Thus

is

a

{

- trx/2, n-1 <

[

R

_

1 v

n

p) < trx/2,n-1 = 1 - 1 (R -

.Jn -

Vtrx/ 2 , n-1 1'

v

R

+

.Jn -

}

Vtcz/2,n-l 1

]

confidence interval for p with confidence coefficient 1 -

Cl

rJ.. .

/2

m

278

IN TRODUCTION TO STATISTICS I

PROBLEMS 1.

Let be independent, chi-square random variables with and degrees of freedom, respectively. Show that has density

R1

R2

m and

n

(R1/m)/(R /n) 2 2)-1 2 ( / / m m x (m/n) (x) {3(m/2, n/2) (1 + mxfn)<m+n)/2 ' x > 0 fmn /n) is said to have the F distribution with m and n degrees offreedom, (R1/m)f(R 2 abbreviated F(m , n). Calculate the mean and variance _of the chi-square, t, and F distributions. (a) If has the t distribution with n degrees of freedom, show that has the F(1 , n) distribution. If R has the F(m, n) distribution, show that 1/R has the F(n, m) distribution. (c) If R1 is chi-square (m) and R2 is chi-square (n), show that R1 + R 2 is chi square (m + n). Discuss the problem of obtaining confidence intervals for the variance a2 of a =

2. 3.

T2

T

(b)

4.

normally distributed random variable, assuming that (a) The mean p is known (b) p is unknown be inde (A two-sample problem) Let pendent, with the normal p , and the normal (p1 , p , p 2 and unknown). Thus we are taking independent samples from two different normal populations. Show that if and i = 1 , 2, are the sample mean and variance of the two samples, and

R11 ,aR2 12 , • • • , R1n1 , R21, R22 , • • • , R2n2 a2 R2 i R1 i ( 2 a2) ( 1 ) Ri Vi2 , 1 2 / n n + 2 2 1 [ l k (nl vl2 + n2 v2 ) /2 nln2(nl + n2 - 2)J then [R1 - R2 - kta. 1 2 , n 1+n 2_ 2 , R1 - R2 + kta. 1 2 , n1+n 2_ 2 ] is a confidence interval for p 1 - p 2 with confidence coefficient In Problem 5, assume that the samples have different variances a12 and a22 • Discuss the problem of obtaining confidence intervals for the ratio a12/a22 • 7. (a) Suppose that C(R) is a confidence set for y(O) with confidence coefficient >1 that is, for all E N P9{y(O) E C(R)} > 1 Consider the hypothesis-testing problem H0 : y (O) k H1 : y (O) k and the following test. q;k(x) 1 if k tj: C(x) if k C(x) 0 [Thus C(x) is the accep tance region of q;k.] Show that q;k is a test at level

S.

=

1 - (X.

6.

- (X ;

-

(X

0

=

"#:

=

=

E

(X .

8.7

THE MULTIDIMENSIONAL GAUSSIAN DISTRIBUTION

279

(b) Suppose that for all k in the range of y there is a nonrandomized test q;k (i.e. , q;k (x) = O or 1 for all x] at level (X for H0 : y(fJ) = k versus H1 : y(fJ) '#: k. Let C(x) be the set {k : q;k (x) = 0}. Show that C(R) is a confidence set for y(fJ) with confidence coefficient > 1 - (X. This result allows the confidence interval examples in this section to be translated into the language of hypothesis testing.

*8 . 7

T H E MULTI D I M E N S I O NAL G A U S S I A N D I S T R I BU T I O N

I f R�, . . . , R � are independent, normally 4istributed random variables and we , Rn by Ri = � j 1 aiiR ; + bi , i = 1 , . . . , n, define random variables R1 , the R i have a distribution of considerable importance in many aspects of probability and statistics. In this section we examine the properties of this distribution and make an application to the problem of prediction. Let R = (R 1 , . . . , Rn) be a random vector. The characteristic function of R (or the joint characteristic function of R 1 , . . . , Rn) is defined by •

M(u 1 ,

•

•

, un)

•

=

=

•

•

E [ i( u 1R 1 +

+ u nR n) ] ,

i: 1: exp ( i}: u x ) ·

·

·

·

·

·

1

k

k

u1,

d F ( x1 ,

•

•

•

.

•

.

,

,

u n real

xn)

where F is the distribution function of R. It will be convenient to use a vector n matrix notation. If u = (ur, . . . , un) E E , u will denote the column vector with components u1, . . . , un. Similarly we write x for col (x1, . . . , xn) and A superscript t will indicate the transpose of a R for col (R1 , , R n) matrix Just as in one dimension , it can be shown that the characteri�tic function determines the distribution function uniquely. .

.

.

·

The random vector R = (R 1 , . . . , Rn) is said to be Gaussian (or R1 , . . . , Rn are said to be jointly Gaussian) iff the characteristic function of R is

DEFINITION .

( 8.7. 1)

where b1 , . . . , bn are arbitrary real numbers and K is an arbitrary real symmetric nonnegative definite n by n matrix. (Nonnegative definite means that ��.s=l arKrsas is real and >O for all real numbers ar, . . . , a n. )

280

IN TRODUCTION TO STATISTICS

We must show that there is a random vector with this characteristic function. We shall do this in the proof of the next theorem. Theorem 1. Let R be a random n-vector. R is Gaussian iff R can be n expressed as WR' + b , where b = (b 1 , . , b n) E E , W is an n by n matrix, •

•

and R�, . . . , R� are independent normal random variables with 0 mean. The matrix K of (8. 7. 1) is given by WD Wt , where D = diag (it1 , , An) is a diagonal matrix with entries A.i = Var R;, j = 1 , . . . , n . (To avoid having to treat the case it3 = 0 separately, we agree that normal with expecta tion m and variance 0 will mean degenerate at m.) Furthermore, the matrix W can be taken as orthogonal. •

PROOF . If R

=

WR'

+

E[exp (iutR)]

E [ex p (ivtR'))

= =

Set v

=

Wtu

E

•

then

b,

But

•

=

exp [iutb] E[exp (iut WR')]

[ il exp ( ivkR�)J

fr E [e xp (ivk R�)] = ex p [

k=l

-

!

2

iAkv,.2]

k=l

to obtain E[exp (iut R)]

=

exp [iutb - � u tKu]

where K = WD W t . K is clearly symmetric, and is also nonnegative definite, since u tKu = v t Dv = L�- 1 'Akvk > 0, where v = Wtu. Thus R is Gaussian. (Notice also that if K is symmetric and nonnegative definite, there is an orthogonal matrix W such that W tKW = D , where D is the diagonal matrix of eigenvalues of K. Thus K = WD Wt, so that it is always possible to con struct a Gaussian random vector corresponding to a prescribed K and b.) Conversely, let R have characteristic function exp [iutb - (1/2) (u tKu)] , where K is symmetric and nonnegative definite . Let W be an orthogonal matrix such that WtKW = D = diag (it1 , . . . , An), where the iti are the eigenvalues of K. Let R' = Wt (R - b) . Then

2

where

v

=

Wu

-8 .7

THE M ULTJDIMENSIONAL GAUSSIAN DISTRIBUTION

28 1

It follows that R�, . . . , R� are independent, with R; normal (0, .A1) . Since W is orthogonal , Wt = W-1 ; hence R = WR' + b . The matrix K has probabilistic significance, as follows. In Theorem 1 we have E(R) = h, that is, E(R3) = b;, j = 1 , . . . , n, and K is the covariance matrix of the Ri, that is, Krs = Cov (Rr, R8) , r, s = 1 , . . . , n. Theorem

2.

PROOF. Since the R; have finite second moments, so do the R1 • E(R) = b follows immediately by linearity of the expectation. Now the covariance matrix of the Ri is where E(A) for a matrix A means the matrix [E(Ars)] . Thus the covariance matrix is E[ WR' ( WR')t] = WE(R'R't) Wt = WD Wt = K since D is the covariance matrix of the R ; . The representation of Theorem 1 yields many useful properties of Gaussian vectors. Let R be Gaussian with representation R = WR' + h , W orthogonal, as in Theorem 1 . 1 . If K is nonsingular, then the random variables Rj = Ri - bi are linearly independent; that is, if ! i= 1 aiR;� = 0 with probability 1 , then all a; = 0. In this case R has a density given by Theorem

3.

f(x)

=

(2TT) -

n

f 2 (det K)-1 12 exp

[ - !(x - b)tK-1 (x - b) ]

2. If K is singular, the R: are linearly dependent. If, say, {Ri , . . . , R:} is a maxima/ linearly independent subset of { Ri , . . . , R�}, then ( R1 , , Rr) •

•

.

has a density of the above form , with K replaced by Kr = the first r rows and columns of K. R; +I , . . . , R� can be expressed (with probability 1 ) as linear . . ttons comb tna 0if R* 1 ' . . . ' R*r •

PROOF. 1 . If K is nonsingular, all A.i are f'( y)

=

=

>

0 ; hence R' has density

(2 7T)- nf 2(A.1 . . . An)-1 1 2 exp

[ - 12 ki=1 YkA. ]

1 ( 2 7T)- n f 2( det K )- / 2 exp [ - !Yt v - l y]

2

k

282

INTRODUCTION TO STATISTICS

The Jacobian of the transformation has density

x

= Wy + b is det W = ± 1 ; hence R

f(x) = f' ( Wt(x - b)) 12 = (27T)- f 2 (d et K) - 1 exp [ - � (x - b)t WD-1 Wt(x - b) ] n

Since K = WD Wt, we have K-1 = WD-1 Wt , which yields the desired ex pression for the density . Now if L i 1 aiRj = 0 with probability 1 , 0=

E[ ±

a1 R f 1 ,

] � =

=

arE(Rr*R.*) a 8 r. 1 n

L arKrs as r,s=1

Since K is nonsingular, it is positive rather than merely nonnegative definite, and thus all ar = 0. 2. If K is singular, then L�, s = 1 a K C!s will be 0 for some a1, . . . , a n , not all 0. (This follows since utKu = L�-1 Akvk2 , where v = Wtu ; if K is singular, then some Ai is 0.) But by the analysis of case 1 , E[I L i 1 a1Rjl 2 ] = 0 ; hence L7 1 aiRj = 0 with probability 1 , proving linear dependence . The re maining statements of 2 follow from 1 . I I

r

rs

REMARK . The result that K is singular iff the R1 are linearly dependent is true for arbitrary random variables with finite second moments, as the above argument shows. �

Example

1.

Let (R1, R 2) be Gaussian. Then K=

where a12 = Var R1 , a 22 = Var R 2 , a1 2 = Cov (R1, R 2). Also , det K = a12a 22 ( 1 - p1 22) , where p1 2 is the correlation coefficient between R1 and Rz. Thus K is singular iff l p1 2 l = 1 . In the nonsingular case we have

where a

=

8.7

exp

283

E(R2) . The characteristic function of (R1 , R2) is [i(au1 + bu2) ] exp [ - � (
E(R1) , b

=

M(u1 , u2)

THE MULTIDIMENSIONAL GAUSSIAN DISTRIBUTION

=

Notice that if n = the ordinary Gaussian distribution. For in this case we have M(u)

where m

=

E(R).

=

2 2/ e�u me- u a 2 , f ( x) •

=

�2 7T

e- x- m > 1 2 a (

1

_

a

z

2

...._

If R1 is a Gaussian n-vectqr and Rs by n matrix, then R2 is a Gaussian m-vector. Theorem 4.

=

AR1, where A is an

m

PROOF. Let R1 = WR' + b as in Theorem 1 . Then R2 = A WR' + Ab , and hence R 2 is Gaussian by Theorem 1 . CoROLLARY. (a) If R1 , . . . , R n are jointly Gaussian , so are R1 , . . . , R m , m < n. (b) If R1 , . . . , R n are jointly Gaussian, then a1R1 + · · · + a nR n is a Gaussian random variable . PROOF. For (a) take A = [I 0], where For (b) take A = [a1a2 a n]. •

•

I

is an m by m identity matrix.

•

Thus we see that i f R1 , . . . , R n are jointly Gaussian, then the Ri are (individually) Gaussian. The converse is not true, however. It is possible to find Gaussian random variables R1 , R2 such that (R1 , R2) is not Gaussian, and in addition R1 + R2 is not Gaussian. For example, let R1 be normal (0, 1) and define R2 as follows. Let R3 be independent of R1 , with P{R3 = 0} = P{R3 = 1 } = 1 /2 . If R3 = 0, let R2 = R1 ; if R3 = 1 , let R2 = - R1. Then P{R2 < y} = (1 /2)P{R1 < y} + ( 1 /2) P{ - R1 < y} = P{R1 < y} , so that R2 is normal (0, 1). But if R3 = 0, then R1 + R 2 = 2R1 , and if R3 = 1 , then R1 + R2 = 0 . Therefore P{R1 ·+ R2 = 0} = 1 /2 ; hence R1 + R 2 is not Gaussian. By corollary (b) to Theorem 4, (R1 , R2) is not Gaussian. Notice that if R1 , . . . , R n are independent and each R i is Gaussian, then the R i are jointly Gaussian (with K = the diagonal matrix of variances of the Ri).

if Kii

=

5.

If Rb . . . , Rn are jointly Gaussian and uncorrelated, that is, 0 for i � j, they are independent .

Theorem

284

INTRODUCTION TO STATISTICS

PROOF. Let ai 2 = Var R i. We may assume all ai 2 > 0 ; if ai 2 = 0, then R1 is constant with probability 1 and may be deleted. Now K = diag (a12 , • • • , a n2) ; hence K-1 = diag ( 1 / a1 2 , • • . , l fan2) , and so, by Theorem 3 , R1, . . . , R n have a joint density given by ! ( X1 ,

•••

'

Xn)

-

_

[

2 1 bi (xi ) 1 � 2 / . . . )- n ( (]1 (Jn) exp - - i£,., 2 =1 a1 2

(2 '7T

]

We now consider the following. predictiOJ;J problem. Let R1 , . . . , R n+ 1 be jointly Gaussian. We observe R1 = x1, . . . , R n = x n and then try to predict the value of R n+1• If the predicted value is 1p(x1 , . . . , xn) and the actual value is xn+ 1 , we assume a quadratic loss (xn+ 1 - 1p (x1 , . . . , xn)) 2 • In other words , we are trying to minimize the mean square difference between the true value and the predicted value of R n+ 1• This is simply a problem of Bayes estimation with quadratic loss function, as considered in Section 8.3 ; in this case R n+ 1 plays the role of the state of nature and (R1 , . . . , R n) the observable. It follows that the best estimate is We now show that in the jointly Gaussian case 1Jl is a linear function of x1, . . . , xn . Thus the optimum predictor assumes a particularly simple form . Say {R1 , . . . , Rr} is a maximal linearly independent subset of {R1, . . . , R n} · If R1 , . . , Rn R n+ 1 are linearly dependent, there is nothing to prove ; if R 1, . . . , Rr , R n+ 1 are linearly independent, we may replace R1 , . . . , R n by R 1 , . . . , Rr in the problem . Thus we may as well assume R1 , . . . , R n+ 1 linearly independent. Then (R1 , . . . , Rn+1) has a density, and the con ditional density of R n+ 1 given R 1 = x1 , . . . , R n = xn is .

where K is the covariance matri x of R1 ,

.

•

•

, R n+1 and Q

=

[qrs l

=

K-1 .

8.7

THE MULTIDIMENSIONAL GAUSSIAN DISTRIBUTION

285

Thus

where

n q n+t .rxr D = D ( xb . . . , x n) = ! r=l Therefore the conditional density can be expressed as

� exp [��] exp [ -c ( xn+l + �JJ

Thus , given R1 = x1 , . . . , R n = x n , R n+l is normal with mean variance 1 /2C = 1 /q n+ t n+t · Hence '

-

D/2C and

Tab l es

Common Density Functions and Their Properties

Uniform on [a, b]

Parameters

Density

Type

1 b -a'

a, b real, a p

Normal

Gamma Beta Exponential ( = gamma with a = 1 , fJ = 1 /.A)

r( a){Ja '

real,

r, s

.Ae-;.x ,

.A > O

x>O

x >O

>0

>0

n = 1 , 2, . . .

t

1 r[ (n + 1 )/2 ] 1 2 2 n 'Tf r (n/2) (1 + X /n) < n+ ) 1

n = 1 , 2, . . .

F

x (m/2) -1 (m/n)m/ 2 x>O {J(m/2 , n/2) (1 + (mfn)x) <m+n) /2 '

nz ,

Cauchy 286

v

8

b

a, fJ > 0

x>O

xr-1 ( 1 - x) s-1 {J(r, s)

Chi-square ( = gamma with a = n/2 , fJ = 2)

a

<

n = 1 , 2, . . .

8>0

Com�on Density Functions (continued)

Type

Mean

a +b 2

Normal

p

Variance (b - a)2 12 0"2

Gamma

rx{J

rxf12

Uniform on [a, b]

r

Beta

-

n

F

Cauchy

"

A Re s > - A s+A' (2s + 1)-n / 2 , Re s > - 1 /2

2n

not exist if n = 1 n if n > 2 ; n-2 oo if n = 1 or 2 Does not exist

)

(

1

0 if n > 1 ; does

t

a

-).2

;.

Chi-square

1 e-s - e-sb ' all s b -a s all s e-sf.les2a2 /2 , 1 /P Re s > - 1 / {1 , {1 s + 1I

rs (r + s)2 (r +. s + 1 )

r +s 1

Exponential

Generalized Characteristic Function (If Easily Computable)

n if n > 2 ; n -2 oo if n = 2 2n2 (m + n - 2) if m(n - 2)2 (n - 4) n > 4 ; oo if n = 3 or 4 Does not exist

e-Blul ' s = iu ' u real

Common Probability Functions and Their Properties

Type Discrete uniform Bernoulli Binomial

-N1 '

Parameters

Probability p (k)

k = 1 , 2, . . . , N

p (1) p p (O) = q (�)p kqn- k ,

N = 1 , 2, . . .

=

.

O

k = 0, 1 , . . , n O O Poisson e-;. Akjk ! , k = 0, 1 , . . . Geometric 0

Basic probability theory with applications

Basic Probability Theory with Applications

Solutions to Basic Probability Theory

A Basic Course in Probability Theory (Universitext)

Basic Probability Theory for Biomedical Engineers

Basic principles and applications of probability theory

Basic Principles and Applications of Probability Theory

Basic Principles and Applications of Probability Theory

A basic course in probability theory

A Basic Course in Probability Theory (Universitext)

Basic Probability Theory (Dover Books on Mathematics)

A basic course in probability theory

Basic Principles and Applications of Probability Theory

A Basic Course in Probability Theory (Universitext)

A basic course in probability theory

A basic course in probability theory

Probability Theory

Probability Theory

Probability theory

Probability Theory

Probability theory

Probability Theory

Measure Theory and Probability Theory

Measure Theory and Probability Theory

Measure Theory and Probability Theory

Measure Theory and Probability Theory

Measure Theory and Probability Theory

Measure Theory and Probability Theory

Measure Theory and Probability Theory

Basic Probability Theory

Basic Probability Theory

Basic probability theory

Basic probability theory with applications

Basic Probability Theory with Applications

Solutions to Basic Probability Theory

A Basic Course in Probability Theory (Universitext)

Basic Probability Theory for Biomedical Engineers

Basic principles and applications of probability theory

Basic Principles and Applications of Probability Theory

Basic Principles and Applications of Probability Theory

A basic course in probability theory

A Basic Course in Probability Theory (Universitext)

Basic Probability Theory (Dover Books on Mathematics)

A basic course in probability theory

Basic Principles and Applications of Probability Theory

A Basic Course in Probability Theory (Universitext)

A basic course in probability theory

A basic course in probability theory

Probability Theory

Probability Theory

Probability theory

Probability Theory

Probability theory

Probability Theory

Measure Theory and Probability Theory

Measure Theory and Probability Theory

Measure Theory and Probability Theory

Measure Theory and Probability Theory

Measure Theory and Probability Theory

Measure Theory and Probability Theory

Measure Theory and Probability Theory

Recommend Documents