Probability and Random Processes for Electrical Engineers John A. Gubner University of Wisconsin{Madison
Discrete Rand...
617 downloads
1027 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Probability and Random Processes for Electrical Engineers John A. Gubner University of Wisconsin{Madison
Discrete Random Variables Bernoulli(p)
}(X = 1) = p; }(X = 0) = 1 , p. E[X] = p; var(X) = p(1 , p); GX (z) = (1 , p) + pz.
binomial(n;p)
}(X = k) = n pk (1 , p)n,k ; k = 0; : : :; n. k E[X] = np; var(X) = np(1 , p); GX (z) = [(1 , p) + pz]n.
geometric0(p) }(X = k) = (1 , p)pk ; k = 0; 1; 2; : :::
p ; var(X) = p ; G (z) = 1 , p . X 1,p (1 , p)2 1 , pz geometric1(p) }(X = k) = (1 , p)pk,1; k = 1; 2; 3; : : :: 1 ; var(X) = p ; G (z) = (1 , p)z . E[X] = X 1,p (1 , p)2 1 , pz negative binomial or Pascal(m;p) }(X = k) = k , 1 (1 , p)m pk,m; k = m; m + 1; : : :: m,1 m ; var(X) = mp ; G (z) = (1 , p)z m . E[X] = X 1,p (1 , p)2 1 , pz Note that Pascal(1; p) is the same as geometric1(p). E[X] =
Poisson()
k , }(X = k) = e ; k = 0; 1; : : :: k! E[X] = ; var(X) = ; GX (z) = e(z,1) .
Fourier Transforms Fourier Transform H(f) =
Inversion Formula
h(t) =
h(t) I[,T;T ] (t) 2W sin(2Wt) 2Wt
Z1 ,1
Z1 ,1
h(t)e,j 2ft dt
H(f)ej 2ft df H(f) Tf) 2T sin(2 2 Tf
I[,W;W ] (f) Tf) 2 (1 , jtj=T )I[,T;T ](t) T sin( Tf sin(Wt) 2 W Wt (1 , jf j=W)I[,W;W ] (f) 1 e,t u(t) + j2f 2 e,jtj 2 + (2f)2 e,2jf j 2 + t2 p e,(t=)2 =2 2 e,2 (2f )2 =2
Preface Intended Audience
This book contains enough material to serve as a text for a two-course sequence in probability and random processes for electrical engineers. It is also useful as a reference by practicing engineers. For students with no background in probability and random processes, a rst course can be oered either at the undergraduate level to talented juniors and seniors, or at the graduate level. The prerequisite is the usual undergraduate electrical engineering course on signals and systems, e.g., Haykin and Van Veen [18] or Oppenheim and Willsky [30] (see the Bibliography at the end of the book). A second course can be oered at the graduate level. The additional prerequisite is some familiarity with linear algebra; e.g., matrix-vector multiplication, determinants, and matrix inverses. Because of the special attention paid to complex-valued Gaussian random vectors and related random variables, the text will be of particular interest to students in wireless communications. Additionally, the last chapter, which focuses on self-similar processes, long-range dependence, and aggregation, will be very useful for students in communication networks who are interested modeling Internet trac.
Material for a First Course
In a rst course, Chapters 1{5 would make up the core of any oering. These chapters cover the basics of probability and discrete and continuous random variables. Following Chapter 5, additional topics such as wide-sense stationary processes (Sections 6.1{6.5), the Poisson process (Section 8.1), discrete-time Markov chains (Section 9.1), and con dence intervals (Sections 12.1{12.4) can also be included. These topics can be covered independently of each other, in any order, except that Problem 15 in Chapter 12 refers to the Poisson process.
Material for a Second Course
In a second course, Chapters 7{8 and 10{11 would make up the core, with additional material from Chapters 6, 9, 12, and 13 depending on student preparation and course objectives. For review purposes, it may be helpful at the beginning of the course to assign the more advanced problems from Chapters 1{5 that are marked with a ? .
Features
Those parts of the book mentioned above as being suitable for a rst course are written at a level appropriate for undergraduates. More advanced problems and sections in these parts of the book are indicated by a ? . Those parts of the book not indicated as suitable for a rst course are written at a level suitable iii
iv
Preface
for graduate students. Throughout the text, there are numerical superscripts that refer to notes at the end of each chapter. These notes are usually rather technical and address subtleties of the theory. The last section of each chapter is devoted to problems. The problems are separated by section so that all problems relating to a particular section are clearly indicated. This enables the student to refer to the appropriate part of the text for background relating to particular problems, and it enables the instructor to make up assignments more quickly. Tables of discrete random variables and of Fourier transform pairs are found inside the front cover. A table of continuous random variables is found inside the back cover. When cdfs or other functions are encountered that do not have a closed form, Matlab commands are given for computing them. For a list, see \Matlab" in the index. The index was compiled as the book was being written. Hence, there are many pointers to speci c information. For example, see \noncentral chi-squared random variable." With the importance of wireless communications, it is vital for students to be comfortable with complex random variables, especially complex Gaussian random variables. Hence, Section 7.5 is devoted to this topic, with special attention to the circularly symmetric complex Gaussian. Also important in this regard are the central and noncentral chi-squared random variables and their square roots, the Rayleigh and Rice random variables. These random variables, as well as related ones such as the beta, F, Nakagami, and student's t, appear in numerous problems in Chapters 3{5 and 7. Chapter 10 gives an extensive treatment of convergence in mean of order p. Special attention is given to mean-square convergence and the Hilbert space of square-integrable random variables. This allows us to prove the projection theorem, which is important for establishing the existence of certain estimators, and conditional expectation in particular. The Hilbert space setting allows us to de ne the Wiener integral, which is used in Chapter 13 on advanced topics to construct fractional Brownian motion. We give an extensive treatment of con dence intervals in Chapter 12, not only for normal data, but also for more arbitrary data via the central limit theorem. The latter is important for any situation in which the data is not normal; e.g., sampling for quality control, election polls, etc. Much of this material can be covered in a rst course. However, the last subsection on derivations is more advanced, and uses results from Chapter 7 on Gaussian random vectors. With the increasing use of self-similar and long-range dependent processes for modeling Internet trac, we provide in Chapter 13 an introduction to these developments. In particular, fractional Brownian motion is a standard example of a continuous-time self-similar process with stationary increments. Fractional autoregressive integrated moving average (FARIMA) processes are developed as examples of important discrete-time processes in this context. July 19, 2002
Table of Contents 1 Introduction to Probability
1.1 Review of Set Notation : : : : : : : : : : : : : : 1.2 Probability Models : : : : : : : : : : : : : : : : : 1.3 Axioms and Properties of Probability : : : : : : Consequences of the Axioms : : : : : : : : : : 1.4 Independence : : : : : : : : : : : : : : : : : : : : Independence for More Than Two Events : : : 1.5 Conditional Probability : : : : : : : : : : : : : : The Law of Total Probability and Bayes' Rule 1.6 Notes : : : : : : : : : : : : : : : : : : : : : : : : 1.7 Problems : : : : : : : : : : : : : : : : : : : : : :
2 Discrete Random Variables
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
2.1 Probabilities Involving Random Variables : : : : : : : : : : : : : Discrete Random Variables : : : : : : : : : : : : : : : : : : : : Integer-Valued Random Variables : : : : : : : : : : : : : : : : Pairs of Random Variables : : : : : : : : : : : : : : : : : : : : Multiple Independent Random Variables : : : : : : : : : : : : Probability Mass Functions : : : : : : : : : : : : : : : : : : : : 2.2 Expectation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Expectation of Functions of Random Variables, or the Law of the Unconscious Statistician (LOTUS) : : : : : : : : : : : ? Derivation of LOTUS : : : : : : : : : : : : : : : : : : : : : : Linearity of Expectation : : : : : : : : : : : : : : : : : : : : : Moments : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Probability Generating Functions : : : : : : : : : : : : : : : : Expectations of Products of Functions of Independent Random Variables : : : : : : : : : : : : : : : : : : : : : : : : Binomial Random Variables and Combinations : : : : : : : : : Poisson Approximation of Binomial Probabilities : : : : : : : 2.3 The Weak Law of Large Numbers : : : : : : : : : : : : : : : : : : Uncorrelated Random Variables : : : : : : : : : : : : : : : : : Markov's Inequality : : : : : : : : : : : : : : : : : : : : : : : : Chebyshev's Inequality : : : : : : : : : : : : : : : : : : : : : : Conditions for the Weak Law : : : : : : : : : : : : : : : : : : 2.4 Conditional Probability : : : : : : : : : : : : : : : : : : : : : : : The Law of Total Probability : : : : : : : : : : : : : : : : : : The Substitution Law : : : : : : : : : : : : : : : : : : : : : : : 2.5 Conditional Expectation : : : : : : : : : : : : : : : : : : : : : : : Substitution Law for Conditional Expectation : : : : : : : : : Law of Total Probability for Expectation : : : : : : : : : : : : v
1
2 5 12 12 16 16 19 19 22 24
33
33 36 36 38 39 42 44 47 47 48 49 50 52 54 57 57 58 60 60 61 62 63 66 69 70 70
vi
Table of Contents
2.6 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 72 2.7 Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 74
3 Continuous Random Variables
85
3.1 De nition and Notation : : : : : : : : : : : : : : : : : : : : : : : 85 The Paradox of Continuous Random Variables : : : : : : : : : 90 3.2 Expectation of a Single Random Variable : : : : : : : : : : : : : 90 Moment Generating Functions : : : : : : : : : : : : : : : : : : 94 Characteristic Functions : : : : : : : : : : : : : : : : : : : : : 96 3.3 Expectation of Multiple Random Variables : : : : : : : : : : : : 98 Linearity of Expectation : : : : : : : : : : : : : : : : : : : : : 99 Expectations of Products of Functions of Independent Random Variables : : : : : : : : : : : : : : : : : : : : : : : : 99 3.4 ? Probability Bounds : : : : : : : : : : : : : : : : : : : : : : : : : 100 3.5 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 103 3.6 Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 104
4 Analyzing Systems with Random Inputs
4.1 Continuous Random Variables : : : : : : : : : : ? The Normal CDF and the Error Function : 4.2 Reliability : : : : : : : : : : : : : : : : : : : : : 4.3 Cdfs for Discrete Random Variables : : : : : : 4.4 Mixed Random Variables : : : : : : : : : : : : 4.5 Functions of Random Variables and Their Cdfs 4.6 Properties of Cdfs : : : : : : : : : : : : : : : : 4.7 The Central Limit Theorem : : : : : : : : : : : Derivation of the Central Limit Theorem : : 4.8 Problems : : : : : : : : : : : : : : : : : : : : :
5 Multiple Random Variables
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
117
: 118 : 123 : 123 : 126 : 127 : 130 : 134 : 137 : 139 : 142
155
5.1 Joint and Marginal Probabilities : : : : : : : : : : : : : : : : : : 155 Product Sets and Marginal Probabilities : : : : : : : : : : : : 155 Joint and Marginal Cumulative Distributions : : : : : : : : : : 157 5.2 Jointly Continuous Random Variables : : : : : : : : : : : : : : : 158 Marginal Densities : : : : : : : : : : : : : : : : : : : : : : : : 159 Specifying Joint Densities : : : : : : : : : : : : : : : : : : : : 162 Independence : : : : : : : : : : : : : : : : : : : : : : : : : : : 163 Expectation : : : : : : : : : : : : : : : : : : : : : : : : : : : : 163 ? Continuous Random Variables That Are not Jointly Continuous : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 164 5.3 Conditional Probability and Expectation : : : : : : : : : : : : : : 164 5.4 The Bivariate Normal : : : : : : : : : : : : : : : : : : : : : : : : 168 5.5 ? Multivariate Random Variables : : : : : : : : : : : : : : : : : : 172 The Law of Total Probability : : : : : : : : : : : : : : : : : : 175 5.6 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 177 July 19, 2002
Table of Contents
vii
5.7 Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 178
6 Introduction to Random Processes
185
7 Random Vectors
223
6.1 Mean, Correlation, and Covariance : : : : : : : : : : : : : : : : : 188 6.2 Wide-Sense Stationary Processes : : : : : : : : : : : : : : : : : : 189 Strict-Sense Stationarity : : : : : : : : : : : : : : : : : : : : : 189 Wide-Sense Stationarity : : : : : : : : : : : : : : : : : : : : : 191 Properties of Correlation Functions and Power Spectral Densities : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 193 6.3 WSS Processes through Linear Time-Invariant Systems : : : : : 195 6.4 The Matched Filter : : : : : : : : : : : : : : : : : : : : : : : : : : 199 6.5 The Wiener Filter : : : : : : : : : : : : : : : : : : : : : : : : : : 201 ? Causal Wiener Filters : : : : : : : : : : : : : : : : : : : : : : 203 ? 6.6 Expected Time-Average Power and the Wiener{ Khinchin Theorem : : : : : : : : : : : : : : : : : : : : : : : : : : 206 Mean-Square Law of Large Numbers for WSS Processes : : : 208 6.7 ? Power Spectral Densities for non-WSS Processes : : : : : : : : : 210 Derivation of (6.22) : : : : : : : : : : : : : : : : : : : : : : : : 211 6.8 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 212 6.9 Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 214
7.1 Mean Vector, Covariance Matrix, and Characteristic Function : : 223 7.2 The Multivariate Gaussian : : : : : : : : : : : : : : : : : : : : : : 226 The Characteristic Function of a Gaussian Random Vector : : 227 For Gaussian Random Vectors, Uncorrelated Implies Independent : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 228 The Density Function of a Gaussian Random Vector : : : : : 229 7.3 Estimation of Random Vectors : : : : : : : : : : : : : : : : : : : 230 Linear Minimum Mean Squared Error Estimation : : : : : : : 230 Minimum Mean Squared Error Estimation : : : : : : : : : : : 233 7.4 Transformations of Random Vectors : : : : : : : : : : : : : : : : 234 7.5 Complex Random Variables and Vectors : : : : : : : : : : : : : : 236 Complex Gaussian Random Vectors : : : : : : : : : : : : : : : 238 7.6 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 239 7.7 Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 240
8 Advanced Concepts in Random Processes
8.1 The Poisson Process : : : : : : : : : : : : : : : : : : : : : : : : ? Derivation of the Poisson Probabilities : : : : : : : : : : : : Marked Poisson Processes : : : : : : : : : : : : : : : : : : : Shot Noise : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8.2 Renewal Processes : : : : : : : : : : : : : : : : : : : : : : : : : 8.3 The Wiener Process : : : : : : : : : : : : : : : : : : : : : : : : Integrated White-Noise Interpretation of the Wiener Process July 19, 2002
249
: 249 : 253 : 255 : 256 : 256 : 257 : 258
viii
Table of Contents
The Problem with White Noise : : : : : : : : : : : : The Wiener Integral : : : : : : : : : : : : : : : : : : : Random Walk Approximation of the Wiener Process 8.4 Speci cation of Random Processes : : : : : : : : : : : : Finitely Many Random Variables : : : : : : : : : : : In nite Sequences (Discrete Time) : : : : : : : : : : : Continuous-Time Random Processes : : : : : : : : : 8.5 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8.6 Problems : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: 260 : 260 : 261 : 263 : 263 : 266 : 269 : 270 : 270
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: 279 : 281 : 282 : 284 : 287 : 288 : 289 : 291 : 294 : 294
Convergence in Mean of Order p : : : : : : : : : : : : : Normed Vector Spaces of Random Variables : : : : : : : The Wiener Integral (Again) : : : : : : : : : : : : : : : Projections, Orthonality Principle, Projection Theorem Conditional Expectation : : : : : : : : : : : : : : : : : : Notation : : : : : : : : : : : : : : : : : : : : : : : : : 10.6 The Spectral Representation : : : : : : : : : : : : : : : : 10.7 Notes : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10.8 Problems : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: 299 : 303 : 307 : 308 : 311 : 313 : 313 : 316 : 316
9 Introduction to Markov Chains
9.1 Discrete-Time Markov Chains : : : : : : : : : : : : : : State Space and Transition Probabilities : : : : : : Examples : : : : : : : : : : : : : : : : : : : : : : : : Stationary Distributions : : : : : : : : : : : : : : : Derivation of the Chapman{Kolmogorov Equation : Stationarity of the n-step Transition Probabilities : 9.2 Continuous-Time Markov Chains : : : : : : : : : : : : Kolmogorov's Dierential Equations : : : : : : : : : Stationary Distributions : : : : : : : : : : : : : : : 9.3 Problems : : : : : : : : : : : : : : : : : : : : : : : : :
10 Mean Convergence and Applications 10.1 10.2 10.3 10.4 10.5
11 Other Modes of Convergence
11.1 Convergence in Probability : : : : : : : : 11.2 Convergence in Distribution : : : : : : : : 11.3 Almost Sure Convergence : : : : : : : : : The Skorohod Representation Theorem 11.4 Notes : : : : : : : : : : : : : : : : : : : : 11.5 Problems : : : : : : : : : : : : : : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
12.1 12.2 12.3 12.4
299
323
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: 324 : 325 : 330 : 335 : 337 : 337
The Sample Mean : : : : : : : : : : : : : : : : : : : Con dence Intervals When the Variance Is Known : The Sample Variance : : : : : : : : : : : : : : : : : : Con dence Intervals When the Variance Is Unknown
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: 345 : 347 : 349 : 350
12 Parameter Estimation and Con dence Intervals
July 19, 2002
: : : : : :
279
345
Table of Contents
ix
Applications : : : : : : : : : : : : : : : : : : Sampling with and without Replacement : : 12.5 Con dence Intervals for Normal Data : : : : : Estimating the Mean : : : : : : : : : : : : : Limiting t Distribution : : : : : : : : : : : : Estimating the Variance | Known Mean : : Estimating the Variance | Unknown Mean ? Derivations : : : : : : : : : : : : : : : : : : 12.6 Notes : : : : : : : : : : : : : : : : : : : : : : : 12.7 Problems : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: : : : : : : : : :
: 351 : 352 : 353 : 353 : 355 : 355 : 357 : 358 : 360 : 362
13 Advanced Topics
365
Bibliography Index
393 397
13.1 Self Similarity in Continuous Time : : : : : : : : : : : : : : : : : 365 Implications of Self Similarity : : : : : : : : : : : : : : : : : : 366 Stationary Increments : : : : : : : : : : : : : : : : : : : : : : : 367 Fractional Brownian Motion : : : : : : : : : : : : : : : : : : : 368 13.2 Self Similarity in Discrete Time : : : : : : : : : : : : : : : : : : : 369 Convergence Rates for the Mean-Square Law of Large Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 370 Aggregation : : : : : : : : : : : : : : : : : : : : : : : : : : : : 371 The Power Spectral Density : : : : : : : : : : : : : : : : : : : 373 Notation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 374 13.3 Asymptotic Second-Order Self Similarity : : : : : : : : : : : : : : 375 13.4 Long-Range Dependence : : : : : : : : : : : : : : : : : : : : : : : 380 13.5 ARMA Processes : : : : : : : : : : : : : : : : : : : : : : : : : : : 383 13.6 ARIMA Processes : : : : : : : : : : : : : : : : : : : : : : : : : : 384 13.7 Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 387
July 19, 2002
x
Table of Contents
July 19, 2002
CHAPTER 1
Introduction to Probability If we toss a fair coin many times, then we expect that the fraction of heads should be close to 1=2, and we say that 1=2 is the \probability of heads." In fact, we might try to de ne the probability of heads to be the limiting value of the fraction of heads as the total number of tosses tends to in nity. The diculty with this approach is that there is no simple way to guarantee that the desired limit exists. An alternative approach was developed by A. N. Kolmogorov in 1933. Kolmogorov's idea was to start with an axiomatic de nition of probability and then deduce results logically as consequences of the axioms. One of the successes of Kolmogorov's approach is that under suitable assumptions, it can be proved that the fraction of heads in a sequence of tosses of a fair coin converges to 1=2 as the number of tosses increases. You will meet a simple version of this result in the weak law of large numbers in Chapter 2. Generally speaking, a law of large numbers is a theorem that gives conditions under which the numerical average of a large number of measurements converges in some sense to a probability or other parameter speci ed by the underlying mathematical model. Other laws of large numbers are discussed in Chapters 10 and 11. Another celebrated limit result of probability theory is the central limit theorem, which you will meet in Chapter 4. The central limit theorem says that when you add up a large number of random perturbations, the overall eect has a Gaussian or normal distribution. For example, the central limit theorem explains why thermal noise in ampli ers has a Gaussian distribution. The central limit theorem also accounts for the fact that the speed of a particle in an ideal gas has the Maxwell distribution. In addition to the more famous results mentioned above, in this book you will learn the \tools of the trade" of probability and random processes. These include probability mass functions and densities, expectation, transform methods, random processes, ltering of processes by linear time-invariant systems, and more. Since probability theory relies heavily on the use of set notation and set theory, a brief review of these topics is given in Section 1.1. In Section 1.2, we consider a number of simple physical experiments, and we construct mathematical probability models for them. These models are used to solve several sample problems. Motivated by our speci c probability models, in Section 1.3, we introduce the general axioms of probability and several of their consequences. The concepts of statistical independence and conditional probability are introduced in Sections 1.4 and 1.5, respectively. Section 1.6 contains the notes that are referenced in the text by numerical superscripts. These notes are usually rather technical and can be skipped by the beginning student. However, the 1
2
Chap. 1 Introduction to Probability
notes provide a more in-depth discussion of certain topics that may be of interest to more advanced readers. The chapter concludes with problems for the reader to solve. Problems and sections marked by a ? are intended for more advanced readers.
1.1. Review of Set Notation
Let be a set of points. If ! is a point in , we write ! 2 . Let A and B be two collections of points in . If every point in A also belongs to B, we say that A is a subset of B, and we denote this by writing A B. If A B and B A, then we write A = B; i.e., two sets are equal if they contain exactly the same points. Set relationships can be represented graphically in Venn diagrams. In these pictures, the whole space is represented by a rectangular region, and subsets of are represented by disks or oval-shaped regions. For example, in Figure 1.1(a), the disk A is completely contained in the oval-shaped region B, thus depicting the relation A B. If A , and ! 2 does not belong to A, we write ! 2= A. The set of all such ! is called the complement of A in ; i.e., Ac := f! 2 : ! 2= Ag: This is illustrated in Figure 1.1(b), in which the shaded region is the complement of the disk A. The empty set or null set of is denoted by 6 ; it contains no points of
. Note that for any A , 6 A. Also, c = 6 .
A
B
A Ac (b)
(a)
Figure 1.1. (a) Venn diagram of A B . (b) The complement of the disk A, denoted by Ac , is the shaded part of the diagram.
The union of two subsets A and B is A [ B := f! 2 : ! 2 A or ! 2 B g: Here \or" is inclusive; i.e., if ! 2 A [ B, we permit ! to belong to either A or B or both. This is illustrated in Figure 1.2(a), in which the shaded region is the union of the disk A and the oval-shaped region B. The intersection of two subsets A and B is A \ B := f! 2 : ! 2 A and ! 2 B g; July 18, 2002
1.1 Review of Set Notation
3
hence, ! 2 A \ B if and only if ! belongs to both A and B. This is illustrated in Figure 1.2(b), in which the shaded area is the intersection of the disk A and the oval-shaped region B. The reader should also note the following special case. If A B (recall Figure 1.1(a)), then A \ B = A. In particular, we always have A \ = A.
A
B
A
B
(b)
(a)
Figure 1.2. (a) The shaded region is A [ B . (b) The shaded region is A \ B .
The set dierence operation is de ned by B n A := B \ Ac ; i.e., B n A is the set of ! 2 B that do not belong to A. In Figure 1.3(a), B n A is the shaded part of the oval-shaped region B. Two subsets A and B are disjoint or mutually exclusive if A \ B = 6 ; i.e., there is no point in that belongs to both A and B. This condition is depicted in Figure 1.3(b). A B
A
B
(b)
(a)
Figure 1.3. (a) The shaded region is B n A. (b) Venn diagram of disjoint sets A and B .
Using the preceding de nitions, it is easy to see that the following properties hold for subsets A; B, and C of . The commutative laws are A [ B = B [ A and A \ B = B \ A: The associative laws are A \ (B \ C) = (A \ B) \ C and A [ (B [ C) = (A [ B) [ C: July 18, 2002
4
Chap. 1 Introduction to Probability
The distributive laws are A \ (B [ C) = (A \ B) [ (A \ C) and A [ (B \ C) = (A [ B) \ (A [ C): DeMorgan's laws are (A \ B)c = Ac [ B c and (A [ B)c = Ac \ B c : We next consider in nite collections of subsets of . Suppose An , n = 1; 2; : : :: Then 1 [
n=1
An := f! 2 : ! 2 An for some n 1g:
In other words, ! 2 1 n=1 An if and only if for at least one integer n 1, ! 2 An . This de nition admits the possibility that ! 2 An for more than one value of n. Next, we de ne 1 \
n=1
S
An := f! 2 : ! 2 An for all n 1g:
T In other words, ! 2 1 n=1 An if and only if ! 2 An for every positive integer n. Example 1.1. Let denote the real numbers, = (,1; 1). Then the following in nite intersections and unions can be simpli ed. Consider the intersection 1 \ (,1; 1=n) = f! : ! < 1=n; for all n 1g:
n=1
Now, if ! < 1=n for all n 1, then ! cannot be positive; i.e., we must have ! 0. Conversely, if ! 0, then for all n 1, ! 0 < 1=n. It follows that 1 \
n=1
(,1; 1=n) = (,1; 0]:
Consider the in nite union, 1 [
(,1; ,1=n] = f! : ! ,1=n; for some n 1g:
n=1
Now, if ! ,1=n for some n 1, then we must have ! < 0. Conversely, if ! < 0, then for large enough n, ! ,1=n. Thus, 1 [
n=1
(,1; ,1=n] = (,1; 0): July 18, 2002
1.2 Probability Models
5
In a similar way, one can show that 1 \
n=1
as well as
1 [
[0; 1=n) = f0g; 1 \
(,1; n] = (,1; 1) and
n=1
(,1; ,n] = 6 :
n=1
The following generalized distributive laws also hold, B\ and
B[
[ 1
n=1 \ 1
n=1
An =
An =
1 [
(B \ An );
n=1
1 \
(B [ An ):
n=1
We also have the generalized DeMorgan's laws, \ 1
n=1
and
[ 1
n=1
An An
c
c
= =
1 [
n=1
1 \ n=1
Acn ; Acn :
Finally, we will need the following de nition. We say that subsets An ; n = 1; 2; : : :; are pairwise disjoint if An \ Am = 6 for all n 6= m.
1.2. Probability Models
Consider the experiment of tossing a fair die and measuring, i.e., noting, the face turned up. Our intuition tells us that the \probability" of the ith face turning up is 1=6, and that the \probability" of a face with an even number of dots turning up is 1=2. Here is a mathematical model for this experiment and measurement. Let
be any set containing six points. We call the sample space. Each point in corresponds to, or models, a possible outcome of the experiment. For simplicity, let
:= f1; 2; 3; 4;5; 6g: Now put Fi := fig; i = 1; 2; 3; 4;5; 6; July 18, 2002
6
Chap. 1 Introduction to Probability
and
E := f2; 4; 6g: We call the sets Fi and E events. The event Fi corresponds to, or models, the die's turning up showing the ith face. Similarly, the event E models the die's showing a face with an even number of dots. Next, for every subset A of , we denote the number of points in A by jAj. We call jAj the cardinality of A. We de ne the probability of any event A by }(A) := jAj=j j: It then follows that }(Fi) = 1=6 and }(E) = 3=6 = 1=2, which agrees with our intuition. We now make four observations about our model: (i) }( 6 ) = j6 j=j j = 0=j j = 0. (ii) }(A) 0 for every event A. (iii) If A and B are mutually exclusive events, i.e., A \ B = 6 , then }(A [ } } B) = (A) + (B); for example, F3 \ E = 6 , and it is easy to check that }(F3 [ E) = }(f2; 3; 4; 6g) = }(F3) + }(E). (iv) When the die is tossed, something happens; this is modeled mathematically by the easily veri ed fact that }( ) = 1. As we shall see, these four properties hold for all the models discussed in this section. We next modify our model to accommodate an unfair die as follows. Observe that for a fair die, X 1 X }(A) = jAj = = j j !2A j j !2A p(!); where p(!) := 1=j j. For an unfair die, we rede ne } by taking }(A) := X p(!); !2A
where now p(!) is not constant, but is chosen to re ect the likelihood of occurrence of the various faces. This new de nition of } still satis es (i) and (iii); however, to guarantee that (ii) and (iv) still hold, we mustPrequire that p be nonnegative and sum to one, or, in symbols, p(!) 0 and !2 p(!) = 1. Example 1.2. Consider a die for which face i is twice as likely as face i , 1. Find the probability of the die's showing a face with an even number of dots. Solution. To solve the problem, we use the above modi ed probability model withy p(!) = 2!,1=63. We then need to realize that we are being asked to compute }(E), where E = f2; 4; 6g. Hence, }(E) = [21 + 23 + 25 ]=63 = 42=63 = 2=3: If A = 6 , the summation is always taken to be zero. y The derivation of this formula appears in Problem 8.
July 18, 2002
1.2 Probability Models
7
This problem is typical of the kinds of \word problems" to which probability theory is applied to analyze well-de ned physical experiments. The application of probability theory requires the modeler to take the following steps: 1. Select a suitable sample space . 2. De ne }(A) for all events A. 3. Translate the given \word problem" into a problem requiring the calculation of }(E) for some speci c event E. The following example gives a family of constructions that can be used to model experiments having a nite number of possible outcomes. Example 1.3. Let M be a positive integer, and put :=PfM1; 2; : : :; M g. Next, let p(1); : : :; p(M) be nonnegative real numbers such that !=1 p(!) = 1. For any subset A , put }(A) := X p(!): ! 2A
In particular, to model equally likely outcomes, or equivalently, outcomes that occur \at random," we take p(!) = 1=M. In this case, }(A) reduces to jAj=j j.
Example 1.4. A single card is drawn at random from a well-shued deck of playing cards. Find the probability of drawing an ace. Also nd the probability of drawing a face card. Solution. The rst step in the solution is to specify the sample space
and the probability }. Since there are 52 possible outcomes, we take := f1; : : :; 52g. Each integer corresponds to one of the cards in the deck. To specify }, we must de ne }(E) for all events E . Since all cards are equally likely to be drawn, we put }(E) := jE j=j j. To nd the desired probabilities, let 1; 2; 3; 4 correspond to the four aces, and let 41; : : :; 52 correspond to the 12 face cards. We identify the drawing of an ace with the event A := f1; 2; 3; 4g, and we identify the drawing of a face card with the event F := f41; : : :; 52g. It then follows that }(A) = jAj=52 = 4=52 = 1=13 and }(F ) = jF j=52 = 12=52 = 3=13. While the sample spaces in Example 1.3 can model any experiment with a nite number of outcomes, it is often convenient to use alternative sample spaces. Example 1.5. Suppose that we have two well-shued decks of cards, and we draw one card at random from each deck. What is the probability of drawing the ace of spades followed by the jack of hearts? What is the probability of drawing an ace and a jack (in either order)? July 18, 2002
8
Chap. 1 Introduction to Probability
Solution. The rst step in the solution is to specify the sample space
and the probability }. Since there are 52 possibilities for each draw, there are 522 = 2,704 possible outcomes when drawing two cards. Let D := f1; : : :; 52g, and put
:= f(i; j) : i; j 2 Dg: Then j j = jDj2 = 522 = 2,704 as required. Since all pairs are equally likely, we put }(E) := jE j=j j for arbitrary events E . As in the preceding example, we denote the aces by 1; 2; 3; 4. We let 1 denote the ace of spades. We also denote the jacks by 41; 42; 43; 44, and the jack of hearts by 42. The drawing of the ace of spades followed by the jack of hearts is identi ed with the event A := f(1; 42)g; and so }(A) = 1=2,704 0:000370. The drawing of an ace and a jack is identi ed with B := Baj [ Bja , where
Baj := (i; j) : i 2 f1; 2; 3; 4g and j 2 f41; 42; 43;44g
corresponds to the drawing of an ace followed by a jack, and
Bja := (i; j) : i 2 f41; 42; 43; 44g and j 2 f1; 2; 3; 4g
corresponds to the drawing of a jack followed by an ace. Since Baj and Bja are disjoint, }(B) = }(Baj )+}(Bja ) = (jBajj + jBja j)=j j. Since jBajj = jBja j = 16, }(B) = 2 16=2,704 = 2=169 0:0118.
Example 1.6. Two cards are drawn at random from a single well-shued deck of playing cards. What is the probability of drawing the ace of spades followed by the jack of hearts? What is the probability of drawing an ace and a jack (in either order)? Solution. The rst step in the solution is to specify the sample space and the probability }. There are 52 possibilities for the rst draw and 51 possibilities for the second. Hence, the sample space should contain 52 51 = 2,652 elements. Using the notation of the preceding example, we take
:= f(i; j) : i; j 2 D with i 6= j g; Note that j j = 522 , 52 = 2,652 as required. Again, all such pairs are equally likely, and so we take }(E) := jE j=j j for arbitrary events E . The events A and B are de ned as before, and the calculation is the same except that j j = 2,652 instead of 2,704. Hence, }(A) = 1=2,652 0:000377, and }(B) = 2 16=2,652 = 8=663 0:012. July 18, 2002
1.2 Probability Models
9
Example 1.7 (The Birthday Problem). In a group of n people, what is the probability that two or more people have the same birthday? Solution. The rst step in the solution is to specify the sample space
and the probability }. Let D := f1; : : :; 365g denote the days of the year, and let
:= f(d1 ; : : :; dn) : di 2 Dg denote the set of all possible sequences of n birthdays. Then j j = jDjn . Since all sequences are equally likely, we take }(E) := jE j=j j for arbitrary events E . Let Q denote the set of sequences (d1 ; : : :; dn) that have at least one pair of repeated entries. For example, if n = 9, one of the sequences in Q would be (364; 17; 201;17;51; 171;51; 33;51): Notice that 17 appears twice and 51 appears 3 times. The set Q is very complicated. On the other hand, consider Qc , which is the set of sequences (d1; : : :; dn) that have no repeated entries. A typical element of Qc can be constructed as follows. There are jDj possible choices for d1. For d2 there are jDj , 1 possible choices because repetitions are not allowed for elements of Qc . For d3 there are jDj , 2 possible choices. Continuing in this way, we see that jQc j = jDj (jDj , 1) (jDj , [n , 1]) = (jDjjD,j!n)! : It follows that }(Qc ) = jQcj=j j, and }(Q) = 1 , }(Qc ). A plot of }(Q) as a function of n is shown in Figure 1.4. As the dashed line indicates, for n 23, the probability of two more more people having the same birthday is greater than 1=2. Example 1.8. A certain memory location in an old personal computer is faulty and returns 8-bit bytes at random. What is the probability that a returned byte has seven 0s and one 1? Six 0s and two 1s? Solution. Let := f(b1; : : :; b8) : bi = 0 or 1g. Since all bytes are equally likely, put }(E) := jE j=j j for arbitrary E . Let
A1 := (b1 ; : : :; b8) : and
A2 := (b1; : : :; b8) :
8 P
i=1 8 P
bi = 1
bi = 2 : i=1 Then }(A1 ) = jA1 j=j j = 8=256 = 1=32, and }(A2 ) = jA2j=j j = 28=256 = 7=64. July 18, 2002
10
Chap. 1 Introduction to Probability 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
5
10
15
20
25
30
35
40
45
50
55
Figure 1.4. A plot of }(Q) as a function of n. For n 23, the probability of two or more people having the same birthday is greater than 1=2.
In some experiments, the number of possible outcomes is countably in nite. For example, consider the tossing of a coin until the rst heads appears. Here is a model for such situations. Let denote the set of all positive integers, := f1; 2; : : :g. For ! 2 , let p(!) be nonnegative, and suppose that P1 !=1 p(!) = 1. For any subset A , put }(A) := X p(!): !2A
This construction can be used to model the coin tossing experiment by identifying ! = i with the outcome that the rst heads appears on the ith toss. If the probability of tails on a single toss is (0 < 1), it can be shown that we should take p(!) = !,1(1 , ) (cf. Example 2.8. To nd the probability that the rst head occurs before the fourth toss, we compute }(A), where A = f1; 2; 3g. Then }(A) = p(1) + p(2) + p(3) = (1 + + 2)(1 , ): If = 1=2, }(A) = (1 + 1=2 + 1=4)=2 = 7=8. For some experiments, the number of possible outcomes is more than countably in nite. Examples include the lifetime of a lightbulb or a transistor, a noise voltage in a radio receiver, and the arrival time of a city bus. In these cases, } is usually de ned as an integral,
}(A) :=
Z
A
f(!) d!; A ;
July 18, 2002
1.2 Probability Models
11 R
for some nonnegative function f. Note that f must also satisfy f(!) d! = 1. Example 1.9. Consider the following model for the lifetime of a lightbulb. For the sample space we take the nonnegative half line, := [0; 1), and we put Z }(A) := f(!) d!; A
where, for example, f(!) := e,! . Then the probability that the lightbulb's
lifetime is between 5 and 7 time units is
}([5; 7]) =
Z 7
5
e,! d! = e,5 , e,7 :
Example 1.10. A certain bus is scheduled to pick up riders at 9:15. However, it is known that the bus arrives randomly in the 20-minute interval between 9:05 and 9:25, and departs immediately after boarding waiting passengers. Find the probability that the bus arrives at or after its scheduled pick-up time. Solution. Let := [5; 25], and put }(A) :=
Z
A
f(!) d!:
Now, the term \randomly" in the problem statement is usually taken to mean that f(!) constant. In order that }( ) = 1, we must choose the constant to be 1=length( ) = 1=20. We represent the bus arriving at or after 9:15 with the event L := [15; 25]. Then
}(L) =
Z
1 d! = 20 [15;25]
Z 25
15
1 25 , 15 1 20 d! = 20 = 2 :
Example 1.11. A dart is thrown at random toward a circular dartboard of radius 10 cm. Assume the thrower never misses the board. Find the probability that the dart lands within 2 cm of the center. Solution. Let := f(x; y) : x2 + y2 100g, and for any A , put }(A) := area(A) = area(A) : area( ) 100 We then identify the event A := f(x; y) : x2 + y2 4g with the dart's landing within 2 cm of the center. Hence, }(A) = 4 = 0:04: 100 July 18, 2002
12
Chap. 1 Introduction to Probability
1.3. Axioms and Properties of Probability
The probability models of the preceding section suggest the following axiomatic de nition of probability. Given a nonempty set , called the sample space, and a function } de ned on the subsets of , we say } is a probability measure if the following four axioms are satis ed:1 (i) The empty set 6 is called the impossible event. The probability of the impossible event is zero; i.e., }( 6 ) = 0. (ii) Probabilities are nonnegative; i.e., for any event A, }(A) 0. (iii) If A1; A2; : : : are events that are mutually exclusive or pairwise disjoint, i.e., An \ Am = 6 for n 6= m, thenz
}
[ 1
n=1
An =
1 X n=1
}(An ):
This property is summarized by saying that the probability of the union of disjoint events is the sum of the probabilities of the individual events, or more brie y, \the probabilities of disjoint events add." (iv) The entire sample space is called the sure event or the certain event. Its probability is always one; i.e., }( ) = 1. We now give an interpretation of how and } model randomness. We view the sample space as being the set of all possible \states of nature." First, Mother Nature chooses a state !0 2 . We do not know which state has been chosen. We then conduct an experiment, and based on some physical measurement, we are able to determine that !0 2 A for some event A . In some cases, A = f!0g, that is, our measurement reveals exactly which state !0 was chosen by Mother Nature. (This is the case for the events Fi de ned at the beginning of Section 1.2). In other cases, the set A contains !0 as well as other points of the sample space. (This is the case for the event E de ned at the beginning of Section 1.2). In either case, we do not know before making the measurement what measurement value we will get, and so we do not know what event A Mother Nature's !0 will belong to. Hence, in many applications, e.g., gambling, weather prediction, computer message trac, etc., it is useful to compute }(A) for various events to determine which ones are most probable. In many situations, an event A under consideration depends on some system parameter, say , that we can select. For example, suppose A occurs if we correctly receive a transmitted radio message; here could represent a voltage threshold. Hence, we can choose to maximize }(A ).
Consequences of the Axioms
Axioms (i){(iv) that de ne a probability measure have several important implications as discussed below. z See the paragraph Finite Disjoint Unions below and Problem 9 for further discussion regarding this axiom.
July 18, 2002
1.3 Axioms and Properties of Probability
13
Finite Disjoint Unions. Let N be a positive integer. By taking An = 6 for n > N in axiom (iii), we obtain the special case (still for pairwise disjoint events) N N } [ An = X }(An): n=1
n=1
Remark. It is not possible to go backwards and use this special case to derive axiom (iii). Example 1.12. If A is an event consistingPof a nite number of sample points, say A = f!1 ; : : :; !N g, then2 }(A) = Nn=1 }(f!n g). Similarly, if A consists of a countably many P sample points, say A = f!1; !2; : : :g, then directly from axiom (iii), }(A) = 1 n=1 }(f!ng). Probability of a Complement. Given an event A, we can always write = A [ Ac , which is a nite disjoint union. Hence, }( ) = }(A) + }(Ac ). Since }( ) = 1, we nd that }(Ac ) = 1 , }(A): Monotonicity. If A and B are events, then A B implies }(A) }(B):
To see this, rst note that A B implies B = A [ (B \ Ac ): This relation is depicted in Figure 1.5, in which the disk A is a subset of the
A
B
Figure 1.5. In this diagram, the disk A is a subset of the oval-shaped region B ; the shaded region is B \ Ac , and B = A [ (B \ Ac ).
oval-shaped region B; the shaded region is B \ Ac . The gure shows that B is the disjoint union of the disk A together with the shaded region B \ Ac . Since B = A [ (B \ Ac ) is a disjoint union, and since probabilities are nonnegative, }(B) = }(A) + }(B \ Ac ) }(A): July 18, 2002
14
Chap. 1 Introduction to Probability
Note that the special case B = results in }(A) 1 for every event A. In other words, probabilities are always less than or equal to one. Inclusion-Exclusion. Given any two events A and B, we always have }(A [ B) = }(A) + }(B) , }(A \ B): (1.1) To derive (1.1), rst note that (see Figure 1.6)
A
B
B
A
(a)
(b)
Figure 1.6. (a) Decomposition A = (A \ B c ) [ (A \ B ). (b) Decomposition B = (A \ B ) [
(Ac \ B ).
A = (A \ B c ) [ (A \ B) and
B = (A \ B) [ (Ac \ B):
Hence,
A [ B = (A \ B c ) [ (A \ B) [ (A \ B) [ (Ac \ B) : The two copies of A \ B can be reduced to one using the identity F [ F = F for any set F . Thus, A [ B = (A \ B c ) [ (A \ B) [ (Ac \ B): A Venn diagram depicting this last decomposition is shown in Figure 1.7. Taking probabilities of the preceding equations, which involve disjoint unions, we
A
B
Figure 1.7. Decomposition A [ B = (A \ B c) [ (A \ B ) [ (Ac \ B ). July 18, 2002
1.3 Axioms and Properties of Probability
15
nd that
}(A) = }(A \ B c ) + }(A \ B); }(B) = }(A \ B) + }(Ac \ B); }(A [ B) = }(A \ B c ) + }(A \ B) + }(Ac \ B): Using the rst two equations, solve for }(A \ B c ) and }(Ac \ B), respectively, and then substitute into the rst and third terms on the right-hand side of the last equation. This results in }(A [ B) = }(A) , }(A \ B) + }(A \ B) + }(B) , }(A \ B) = }(A) + }(B) , }(A \ B): Limit Properties. Using axioms (i){(iv), the following formulas can be derived (see Problems 10{12). For any sequence of events An , [ 1
\ 1
} and
}
[ N
\ N
} An = Nlim A ; !1 n=1 n n=1 n=1
An
} = Nlim !1
n=1
An :
(1.2) (1.3)
In particular, notice that if An An+1 for all n, then the nite union in (1.2) reduces to AN . Thus, (1.2) becomes [ 1
}
}(AN ); if An An+1 : An = Nlim !1 n=1
(1.4)
Similarly, if An+1 An for all n, then the nite intersection in (1.3) reduces to AN . Thus, (1.3) becomes
}
1 \
}(AN ); if An+1 An : An = Nlim !1 n=1
(1.5)
Formulas (1.1) and (1.2) together imply that for any sequence of events An , [ 1
}
n=1
An
1 X n=1
}(An):
This formula is known as the union bound in engineering and as countable subadditivity in mathematics. It is derived in Problems 13 and 14 at the end of the chapter.
July 18, 2002
16
Chap. 1 Introduction to Probability
1.4. Independence
Consider two tosses of an unfair coin whose \probability" of heads is 1=4. Assuming that the rst toss has no in uence on the second, our intuition tells us that the \probability" of heads on both tosses is (1=4)(1=4) = 1=16. This motivates the following de nition. Two events A and B are said to be statistically independent, or just independent, if }(A \ B) = }(A) }(B): (1.6) The notation A ?? B is sometimes used to mean A and B are independent. We now make some important observations about independence. First, it is a simple exercise to show that if A and B are independent events, then so are A and B c , Ac and B, and Ac and B c . For example, using the identity A = (A \ B) [ (A \ B c ); we have }(A) = }(A \ B) + }(A \ B c ) = }(A) }(B) + }(A \ B c ); and so }(A \ B c ) = }(A) , }(A) }(B) = }(A)[1 , }(B)] = }(A) }(B c ): By interchanging the roles of A and Ac and/or B and B c , it follows that if any one of the four pairs is independent, then so are the other three. Now suppose that A and B are any two events. If }(B) = 0, then we claim that A and B are independent. To see this, rst note that the righthand side of (1.6) is zero; as for the left-hand side, since A \ B B, we have 0 }(A \ B) }(B) = 0, i.e., the left-hand side is also zero. Hence, (1.6) always holds if }(B) = 0. We now show that if }(B) = 1, then A and B are independent. This follows from the two preceding paragraphs. If }(B) = 1, then }(B c ) = 0, and we see that A and B c are independent; but then A and B must also be independent.
Independence for More Than Two Events
Suppose that for j = 1; 2; : : :; Aj is an event. When we say that the Aj are independent, we certainly want that for any i 6= j, }(Ai \ Aj ) = }(Ai ) }(Aj ): And for any distinct i; j; k, we want }(Ai \ Aj \ Ak ) = }(Ai ) }(Aj ) }(Ak ): July 18, 2002
1.4 Independence
17
We want analogous equations to hold for any four events, ve events, and so on. In general, we want that for every nite subset J containing two or more positive integers, } \ Aj = Y }(Aj ): j 2J
j 2J
In other words, we want the probability of every intersection involving nitely many of the Aj to be equal to the product of the probabilities of the individual events. If the above equation holds for all nite subsets of two or more positive integers, then we say that the Aj are mutually independent, or just independent. If the above equation holds for all subsets J containing exactly two positive integers but not necessarily for all nite subsets of 3 or more positive integers, we say that the Aj are pairwise independent. Example 1.13. Given three events, say A, B, and C, they are mutually independent if and only if the following equations all hold, }(A \ B \ C) = }(A) }(B) }(C) }(A \ B) = }(A) }(B) }(A \ C) = }(A) }(C) }(B \ C) = }(B) }(C): It is possible to construct events A, B, and C such that the last three equations hold (pairwise independence), but the rst one does not.3 It is also possible for the rst equation to hold while the last three fail.4
Example 1.14. A coin is tossed three times and the number of heads is noted. Find the probability that the number of heads is two, assuming the tosses are mutually independent and that on each toss the probability of heads is for some xed 0 1. Solution. To solve the problem, let Hi denote the event that the ith toss is heads (so }(Hi ) = ), and let S2 denote the event that the number of heads in three tosses is 2. Then S2 = (H1 \ H2 \ H3c) [ (H1 \ H2c \ H3) [ (H1c \ H2 \ H3): This is a disjoint union, and so }(S2 ) is equal to }(H1 \ H2 \ H3c ) + }(H1 \ H2c \ H3) + }(H1c \ H2 \ H3): (1.7) Next, since H1, H2, and H3 are mutually independent, so are (H1 \ H2) and H3. Hence, (H1 \ H2 ) and H3c are also independent. Thus, }(H1 \ H2 \ H3c ) = }(H1 \ H2 ) }(H3c ) = }(H1) }(H2 ) }(H3c ) = 2 (1 , ): July 18, 2002
18
Chap. 1 Introduction to Probability
Treating the last two terms in (1.7) similarly, we have }(S2 ) = 32(1 , ). If the coin is fair, i.e., = 1=2, then }(S2 ) = 3=8. In working the preceding example, we did not explicitly specify the sample space or the probability measure }. This is common practice. However, the interested reader can nd one possible choice for and } in the Notes.5 Example 1.15. If A1; A2; : : : are mutually independent, show that
}
Solution. Write }
\ 1
n=1
An
1 \
n=1
1 Y
An =
n=1
\ N
}(An):
} = Nlim !1 n=1 An ; by (1.3); N Y }(An ); by independence; = Nlim !1 =
n=1
1 Y
n=1
}(An);
where the last step is just the de nition of the in nite product.
Example 1.16. Consider an in nite sequence of independent coin tosses. Assume that the probability of heads is 0 < p < 1. What is the probability of seeing all heads? What is the probability of ever seeing heads? Solution. We use the result of the preceding example as follows. Let
be a sample space equipped with a probability measure } and events An ; n = 1; 2; : : :; with }(An ) = p, where the An are mutually independent.6 The event An corresponds to, or models, the outcome that the nth toss results in a heads. T The outcome of seeing all heads corresponds to the event 1 n=1 An , and its probability is
}
1 \
N Y
}(An ) = lim pN = 0: An = Nlim !1 N !1 n=1 n=1 S The outcome of ever seeing heads corresponds to the event A := T1 n=1 Anc . Since }(A) = 1 ,}(Ac ), it suces to compute the probability of Ac = 1 n=1 An. Arguing exactly as above, we have \ 1
}
n=1
Acn
= Nlim !1
Thus, }(A) = 1 , 0 = 1.
N Y
}(Acn) = lim (1 , p)N = 0: N !1 n=1
July 18, 2002
1.5 Conditional Probability
19
1.5. Conditional Probability
Conditional probability gives us a mathematically precise way of handling questions of the form, \Given that an event B has occurred, what is the conditional probability that A has also occurred?" If }(B) > 0, we put \ B) }(AjB) := }(A (1.8) }(B) : We call }(AjB) the conditional probability of A given B. In order to justify using the word \probability" in reference to }(jB), note that it is an easy exercise to show that if we put Q(A) = }(AjB) for xed B with }(B) > 0, then Q satis es axioms (i){(iv) in Section 1.3 that de ne a probability measure. Note, however, that although Q is a probability measure, it has the special property that if A B c , then Q(A) = 0. In other words, if B implies A has not occurred, and if B occurs, then }(AjB) = 0. The de nition of conditional probability satis es the following intuitive property, \If A and B are independent, then }(AjB) does not depend on B." In fact, if A and B are independent, \ B) }(A) }(B) }(AjB) = }(A }(B) = }(B) = }(A): Observe that }(AjB) as de ned in (1.8) is the unique solution of }(A \ B) = }(AjB)}(B) (1.9) when }(B) > 0. If }(B) = 0, then no matter what real number we assign to }(AjB), both sides of (1.9) are zero; clearly the right-hand side is zero, and, for the left-hand side, note that since A\B B, we have 0 }(A\B) }(B) = 0. Hence, in probability theory, when }(B) = 0, the value of }(AjB) is permitted to be arbitrary, and it is understood that (1.9) always holds. The Law of Total Probability and Bayes' Rule From the identity A = (A \ B) [ (A \ B c ); it follows that }(A) = }(A \ B) + }(A \ B c ) = }(AjB)}(B) + }(AjB c )}(B c ): (1.10) This formula is the simplest version of the law of total probability. In many applications, the quantities on the right of (1.10) are known, and it is required to nd }(B jA). This can be accomplished by writing \ A) }(A \ B) }(AjB)}(B) : }(B jA) := }(B }(A) = }(A) = }(A) July 18, 2002
20
Chap. 1 Introduction to Probability
Substituting (1.10) into the denominator yields }(AjB)}(B) }(B jA) = } } (AjB) (B) + }(AjB c )}(B c ) : This formula is the simplest version of Bayes' rule. Example 1.17. Polychlorinated biphenyls (PCBs) are toxic chemicals. Given that you are exposed to PCBs, suppose that your conditional probability of developing cancer is 1/3. Given that you are not exposed to PCBs, suppose that your conditional probability of developing cancer is 1/4. Suppose that the probability of being exposed to PCBs is 3/4. Find the probability that you were exposed to PCBs given that you do not develop cancer. Solution. To solve this problem, we use the notation E = fexposed to PCBsg and C = fdevelop cancer g: With this notation, it is easy to interpret the problem as telling us that }(C jE) = 1=3; }(C jE c) = 1=4; and }(E) = 3=4; (1.11) and asking us to nd }(E jC c ). Before solving the problem, note that the above data implies three additional equations as follows. First, recall that }(E c ) = 1 , }(E). Similarly, since conditional probability is a probability as a function of its rst argument, we can write }(C c jE) = 1 , }(C jE) and }(C c jE c) = 1 , }(C jE c). Hence, }(C c jE) = 2=3; }(C cjE c ) = 3=4; and }(E c ) = 1=4: (1.12) To nd the desired conditional probability, we write \ C c) }(E jC c ) = }(E }(C c) }(C c jE)}(E) = }(C c ) = (2=3)(3=4) }(C c ) = }1=2 (C c ) : To nd the denominator, we use the law of total probability to write }(C c) = }(C c jE)}(E) + }(C c jE c)}(E c ) = (2=3)(3=4) + (3=4)(1=4) = 11=16: Hence, }(E jC c ) = 1=2 = 8 : 11=16 11 July 18, 2002
1.5 Conditional Probability
21
In working the preceding example, we did not explicitly specify the sample space or the probability measure }. As mentioned earlier, this is common practice. However, one possible choice for and } is given in the Notes.7 We now generalize the law of P total probability. Let Bn be a sequence of pairwise disjoint events such that n }(Bn ) = 1. Then for any event A,
}(A) =
X
n
}(AjBn )}(Bn ):
S
To derive this result, put B := n Bn , and observe that }(B) = X }(Bn ) = 1: n
It follows that }(B c ) = 0. Next, for any event A, A \ B c B c , and so 0 }(A \ B c ) }(B c ) = 0: Hence, }(A \ B c ) = 0. Writing A = (A \ B c ) [ (A \ B); it follows that
}(A) = }(A \ B c ) + }(A \ B) = }(A \ B) [ = } A \ Bn [
= } =
X
n
n
n
[A \ Bn ]
}(A \ Bn ):
(1.13)
To compute the posterior probabilities }(Bk jA), write }(Bk jA) = }(A} \ Bk ) = }(Aj}Bk )}(Bk ) : (A) (A) Applying the law of total probability to }(A) in the denominator yields the general form of Bayes' rule, }(Bk jA) = X}(AjBk )}(Bk ) : }(AjBn )}(Bn ) n
July 18, 2002
22
Chap. 1 Introduction to Probability
1.6. Notes
Notes x1.3: Axioms and Properties of Probability Note 1. When the sample space is nite or countably in nite, }(A) is usually de ned for all subsets of by taking X }(A) := p(!) !2A
P
for some nonnegative function p that sums to one; i.e., p(!) 0 and !2 p(!) = 1. (It is easy to check that if } is de ned in this way, then it satis es the axioms of a probability measure.) However, for larger sample spaces, such as when is an interval of the real line, e.g., Example 1.10, it is not possible to de ne }(A) for all subsets and still have } satisfy all four axioms. (A proof of this fact can be found in advanced texts, e.g., [4, p. 45].) The way around this diculty is to de ne }(A) only for some subsets of , but not all subsets of . It is indeed fortunate that this can be done in such a way that }(A) is de ned for all subsets of interest that occur in practice. A set A for which }(A) is de ned is called an event, and the collection of all events is denoted by A. The triple ( ; A; }) is called a probability space. Given that }(A) is de ned only for A 2 A, in order for the probability axioms to make sense, A must have certain properties. First, axiom (i) requires that 6 2 A. Second, axiom (iv) requires that 2 A. Third, axiom (iii)S1requires that if A1 ; A2; : : : are mutually exclusive events, then their union, n=1 An , is also an event. Additionally, we need inTSection 1.4 that if A1 ; A2; : : : are arbitrary events, then so is their intersection, 1 n=1 An. We show below that these four requirements are satis ed if we assume only that A has the following three properties, (i) The empty set 6 belongs to A, i.e., 6 2 A. (ii) If A 2 A, then so does its complement, Ac , i.e., A 2 SA implies Ac 2 A. (iii) If A1 ; A2; : : : belong to A, then so does their union, 1 n=1 An . A collection of subsets with these properties is called a - eld or a -algebra. Of the four requirements above, the rst and third obviously hold for any - eld A. The second requirement holds because (i) and (ii) imply that 6 c = must be in A. The fourth requirement is a consequence of (iii) and DeMorgan's law. Note 2. In light of the preceding note, we see that to guarantee that }(f!ng) is de ned in Example 1.12, it is necessary to assume that the singleton sets f!ng are events, i.e., f!ng 2 A.
Notes x1.4: Independence Note 3. Here is an example of three events that are pairwise independent, but not mutually independent. Let
:= f1; 2; 3; 4; 5;6;7g; July 18, 2002
1.6 Notes
23
and put }(f!g) := 1=8 for ! 6= 7, and }(f7g) := 1=4. Take A := f1; 2; 7g, B := f3; 4; 7g, and C := f5; 6; 7g. Then }(A) = }(B) = }(C) = 1=2. and }(A \ B) = }(A \ C) = }(B \ C) = }(f7g) = 1=4. Hence, A and B, A and C, and B and C are pairwise independent. However, since }(A \ B \ C) = }(f7g) = 1=4, and since }(A) }(B) }(C) = 1=8, A, B, and C are not mutually independent. Note 4. Here is an example of three events for which }(A \ B \ C) = }(A) }(B) }(C) but no pair is independent. Let := f1; 2; 3; 4g. Put }(f1g) = }(f2g) = }(f3g) = p and }(f4g) = q, where 3p + q = 1 and 0 p; q 1. Put A := f1; 4g, B := f2; 4g, and C := f3; 4g. Then the intersection of any pair is f4g, as is the intersection of all three sets. Also, }(f4g) = q. Since }(A) = }(B) = }(C) = p+q, we require (p+q)3 = q and (p+q)2 6= q. Solving 3p + q = 1 and (p + q)3 = q for q reduces to solving 8q3 + 12q2 , 21q + 1 = 0. Now, q = 1 is obviously a root, but this results in p = 0, which implies mutual independence. However, since q = 1 is a root, it is easy to verify that 8q3 + 12q2 , 21q + 1 = (q , 1)(8q2 + 20q , 1): , p 3 =4. It then follows By the quadratic formula, the desired root is q = , 5+3 p p , , that p = 3 , 3 =4 and that p + q = ,1 + 3 =2. Now just observe that (p + q)2 6= q. Note 5. Here is a choice for and } for Example 1.14. Let
:= f(i; j; k) : i; j; k = 0 or 1g; with 1 corresponding to heads and 0 to tails. Now put H1 := f(i; j; k) : i = 1g; H2 := f(i; j; k) : j = 1g; H3 := f(i; j; k) : k = 1g; and observe that H1 = f (1; 0; 0) ; (1; 0; 1) ; (1; 1; 0) ; (1; 1; 1) g; H2 = f (0; 1; 0) ; (0; 1; 1) ; (1; 1; 0) ; (1; 1; 1) g; H3 = f (0; 0; 1) ; (0; 1; 1) ; (1; 0; 1) ; (1; 1; 1) g: Next, let }(f(i; j; k)g) := i+j +k (1 , )3,(i+j +k). Since H3c = f (0; 0; 0) ; (1; 0; 0) ; (0; 1; 0) ; (1; 1; 0) g; H1 \ H2 \ H3c = f(1; 1; 0)g. Similarly, H1 \ H2c \ H3 = f(1; 0; 1)g, and H1c \ H2 \ H3 = f(0; 1; 1)g. Hence, S2 = f (1; 1; 0) ; (1; 0; 1) ; (0; 1; 1) g = f(1; 1; 0)g [ f(1; 0; 1)g [ f(0; 1; 1)g; and thus, }(S2 ) = 32 (1 , ). July 18, 2002
24
Chap. 1 Introduction to Probability
Note 6. To show the existence of a sample space and probability measure with such independent events is beyond the scope of this book. Such constructions can be found in more advanced texts such as [4, Section 36]. Notes x1.5: Conditional Probability Note 7. Here is a choice for and } for Example 1.17. Let
:= f(e; c) : e; c = 0 or 1g;
where e = 1 corresponds to exposure to PCBs, and c = 1 corresponds to developing cancer. We then take E := f(e; c) : e = 1g = f (1; 0) ; (1; 1) g; and C := f(e; c) : c = 1g = f (0; 1) ; (1; 1) g: It follows that E c = f (0; 1) ; (0; 0) g and C c = f (1; 0) ; (0; 0) g: Hence, E \ C = f(1; 1)g, E \ C c = f(1; 0)g, E c \ C = f(0; 1)g, and E c \ C c = f(0; 0)g. In order to specify a suitable probability measure on , we work backwards. First, if a measure } on exists such that (1.11) and (1.12) hold, then }(f(1; 1)g) = }(E \ C) = }(C jE)}(E) = 1=4; }(f(0; 1)g) = }(E c \ C) = }(C jE c )}(E c ) = 1=16; }(f(1; 0)g) = }(E \ C c ) = }(C c jE)}(E) = 1=2; }(f(0; 0)g) = }(E c \ C c ) = }(C c jE c)}(E c ) = 3=16: This suggests that we de ne } by }(A) := X p(!); !2A
where p(!) = p(e; c) is given by p(1; 1) := 1=4, p(0; 1) := 1=16, p(1; 0) := 1=2, and p(0; 0) := 3=16. Starting from this de nition of }, it is not hard to check that (1.11) and (1.12) hold.
1.7. Problems
Problems x1.1: Review of Set Notation 1. For real numbers ,1 < a < b < 1, we use the following notation. (a; b] := fx : a < x bg (a; b) := fx : a < x < bg [a; b) := fx : a x < bg [a; b] := fx : a x bg: July 18, 2002
1.7 Problems
25
We also use (,1; b] (,1; b) (a; 1) [a; 1)
fx : x bg fx : x < bg fx : x > ag fx : x ag: For example, with this notation, (0; 1]c = (,1; 0] [ (1; 1) and (0; 2] [ := := := :=
[1; 3) = (0; 3). Now analyze (a) [2; 3]c, (b) (1; 3) [ (2; 4), (c) (1; 3) \ [2; 4), (d) (3; 6] n (5; 7). 2. Sketch the following subsets of the x-y plane. (a) Bz := f(x; y) : x + y z g for z = 0; ,1; +1. (b) Cz := f(x; y) : x > 0; y > 0; and xy z g for z = 1. (c) Hz := f(x; y) : x z g for z = 3. (d) Jz := f(x; y) : y z g for z = 3. (e) Hz \ Jz for z = 3. (f) Hz [ Jz for z = 3. (g) Mz := f(x; y) : max(x; y) z g for z = 3, where max(x; y) is the larger of x and y. For example, max(7; 9) = 9. Of course, max(9; 7) = 9 too. (h) Nz := f(x; y) : min(x; y) z g for z = 3, where min(x; y) is the smaller of x and y. For example, min(7; 9) = 7 = min(9; 7). (i) M2 \ N3 . (j) M4 \ N3 . 3. Let denote the set of real numbers, = (,1; 1). (a) Use the distributive law to simplify
[1; 4] \ [0; 2] [ [3; 5] :
c
(b) Use DeMorgan's law to simplify [0; 1] [ [2; 3] . (c) Simplify
1 \
(,1=n; 1=n).
n=1
July 18, 2002
26
Chap. 1 Introduction to Probability
(d) Simplify (e) Simplify (f) Simplify
1 \
[0; 3 + 1=(2n)).
n=1
1 [
[5; 7 , (3n),1].
n=1
1 [
[0; n].
n=1
Problems x1.2: Probability Models 4. A letter of the alphabet (a{z) is generated at random. Specify a sample
space and a probability measure }. Compute the probability that a vowel (a, e, i, o, u) is generated. 5. A collection of plastic letters, a{z, is mixed in a jar. Two letters drawn at random, one after the other. What is the probability of drawing a vowel (a, e, i, o, u) and a consonant in either order? Two vowels in any order? Specify your sample space and probability }. 6. A new baby wakes up exactly once every night. The time at which the baby wakes up occurs at random between 9 pm and 7 am. If the parents go to sleep at 11 pm, what is the probability that the parents are not awakened by the baby before they would normally get up at 7 am? Specify your sample space and probability }. 7. For any real or complex number z 6= 1 and any positive integer N, derive the geometric series formula NX ,1
N z k = 11,,zz ; z 6= 1: k=0
Hint: Let SN := 1 + z + + z N ,1 , and show that SN , zSN = 1 , z N .
Then solve for SN . Remark. If jzj < 1, jzjN ! 0 as N ! 1. Hence, 1 X
z k = 1 ,1 z ; for jz j < 1: k=0 8. Let
:= f1; : : :; 6g. If p(!) = 2 p(! , 1) for ! = 2; : : :; 6, and if P6 !,1 !=1 p(!) = 1, show that p(!) = 2 =63. Hint: Use Problem 7.
July 18, 2002
1.7 Problems
27
Problems x1.3: Axioms and Properties of Probability
9. Suppose that instead of axiom (iii) of Section 1.3, we assume only that
for any two disjoint events A and B, }(A [ B) = }(A) + }(B). Use this assumption and inductionx on N to show that for any nite sequence of pairwise disjoint events A1 ; : : :; AN ,
}
N [
n=1
N X
An =
n=1
}(An ):
Using this result for nite N, it is not possible to derive axiom (iii), which is the assumption needed to derive the limit results of Section 1.3. ? 10. The purpose of this problem is to show that any countable union can be written as a union of pairwise disjoint sets. Given any sequence of sets Fn, de ne a new sequence by A1 := F1, and An := Fn \ Fnc,1 \ \ F1c; n 2: Note that the An are pairwise disjoint. For nite N 1, show that N [
n=1
Also show that ? 11.
1 [ n=1
Fn = Fn =
N [
n=1
1 [ n=1
An : An :
Use the preceding problem to show that for any sequence of events Fn , [ 1
[ N
1 \
N \
}
} Fn = Nlim F : !1 n=1 n n=1 ? 12. Use the preceding problem to show that for any sequence of events Gn,
} G : Gn = Nlim !1 n=1 n n=1 13. The Finite Union Bound. Show that for any nite sequence of events F1; : : :; FN , N N } [ Fn X }(Fn ): }
n=1
n=1
Hint: Use the inclusion-exclusion formula (1.1) and induction on N. See
the last footnote for information on induction.
x In this case, using induction on N means that you rst verify the desired result for N = 2. Second, you assume the result is true for some arbitrary N 2 and then prove the desired result is true for N + 1.
July 18, 2002
28
Chap. 1 Introduction to Probability
? 14. The In nite Union Bound. Show that for any in nite Fn, 1 1 } [ Fn X }(Fn ): n=1 n=1
sequence of events
Hint: Combine Problems 11 and 13. ? 15. Borel{Cantelli Lemma. Show that if Bn is a sequence of events for which 1 X
n=1
then
}
}(Bn ) < 1;
1 [ 1 \ n=1 k=n
(1.14)
Bk = 0:
T S1 Hint: Let G := 1 n=1 Gn, where Gn := k=n Bk . Now use Problem 12,
the union bound of the preceding problem, and the fact that (1.14) implies lim N !1
Problems x1.4: Independence
1 X n=N
}(Bn ) = 0:
16. A certain binary communication system has a bit-error rate of 0:1; i.e.,
in transmitting a single bit, the probability of receiving the bit in error is 0:1. To transmit messages, a three-bit repetition code is used. In other words, to send the message 1, 111 is transmitted, and to send the message 0, 000 is transmitted. At the receiver, if two or more 1s are received, the decoder decides that message 1 was sent; otherwise, i.e., if two or more zeros are received, it decides that message 0 was sent. Assuming bit errors occur independently, nd the probability that the decoder puts out the wrong message. Answer: 0:028. 17. You and your neighbor attempt to use your cordless phones at the same time. Your phones independently select one of ten channels at random to connect to the base unit. What is the probability that both phones pick the same channel? 18. A new car is equipped with dual airbags. Suppose that they fail independently with probability p. What is the probability that at least one airbag functions properly? 19. A discrete-time FIR lter is to be found satisfying certain constraints, such as energy, phase, sign changes of the coecients, etc. FIR lters can be thought of as vectors in some nite-dimensional space. The energy constraint implies that all suitable vectors lie in some hypercube; i.e., a July 18, 2002
1.7 Problems
29
square in IR2 , a cube in IR3, etc. It is easy to generate random vectors uniformly in a hypercube. This suggests the following Monte{Carlo procedure for nding a lter that satis es all the desired properties. Suppose we generate vectors independently in the hypercube until we nd one that has all the desired properties. What is the probability of ever nding such a lter? 20. A dart is thrown at random toward a circular dartboard of radius 10 cm. Assume the thrower never misses the board. Let An denote the event that the dart lands within 2 cm of the center on the nth throw. Suppose that the An are mutually independent and that }(An ) = p for some 0 < p < 1. Find the probability that the dart never lands within 2 cm of the center. 21. Each time you play the lottery, your probability of winning is p. You play the lottery n times, and plays are independent. How large should n be to make the probability of winning at least once more than 1=2? Answer: For p = 1=106, n 693147. 22. Consider the sample space = [0; 1] equipped with the probability measure Z }(A) := 1 d!; A : A
For A = [0; 1=2], B = [0; 1=4] [ [1=2; 3=4], and C = [0; 1=8] [ [1=4; 3=8] [ [1=2; 5=8] [ [3=4;7=8], determine whether or not A, B, and C are mutually independent.
Problems x1.5: Conditional Probability 23. The university buys workstations from two dierent suppliers, Moon Mi-
crosystems (MM) and Hyped{Technology (HT). On delivery, 10% of MM's workstations are defective, while 20% of HT's workstations are defective. The university buys 140 MM workstations and 60 HT workstations. (a) What is the probability that a workstation is from MM? From HT? (b) What is the probability of a defective workstation? Answer: 0:13. (c) Given that a workstation is defective, what is the probability that it came from Moon Microsystems? Answer: 7=13. 24. The probability that a cell in a wireless system is overloaded is 1=3. Given that it is overloaded, the probability of a blocked call is 0:3. Given that it is not overloaded, the probability of a blocked call is 0:1. Find the conditional probability that the system is overloaded given that your call is blocked. Answer: 0:6. 25. A certain binary communication system operates as follows. Given that a 0 is transmitted, the conditional probability that a 1 is received is ". Given that a 1 is transmitted, the conditional probability that a 0 is July 18, 2002
30
Chap. 1 Introduction to Probability
received is . Assume that the probability of transmitting a 0 is the same as the probability of transmitting a 1. Given that a 1 is received, nd the conditional probability that a 1 was transmitted. Hint: Use the notation Ti := fi is transmittedg; i = 0; 1; and
Rj := fj is receivedg; j = 0; 1:
26. Professor Random has taught probability for many years. She has found
that 80% of students who do the homework pass the exam, while 10% of students who don't do the homework pass the exam. If 60% of the students do the homework, what percent of students pass the exam? Of students who pass the exam, what percent did the homework? Answer: 12=13.
27. The Bingy 007 jet aircraft's autopilot has conditional probability 1=3 of failure given that it employs a faulty Hexium 4 microprocessor chip. The
autopilot has conditional probability 1=10 of failure given that it employs nonfaulty chip. According to the chip manufacturer, unre l, the probability of a customer's receiving a faulty Hexium 4 chip is 1=4. Given that an autopilot failure has occurred, nd the conditional probability that a faulty chip was used. Use the following notation: AF = fautopilot failsg CF = fchip is faultyg: Answer: 10=19.
? 28.
You have ve computer chips, two of which are known to be defective. (a) You test one of the chips; what is the probability that it is defective? (b) Your friend tests two chips at random and reports that one is defective and one is not. Given this information, you test one of the three remaining chips at random; what is the conditional probability that the chip you test is defective? (c) Consider the following modi cation of the preceding scenario. Your friend takes away two chips at random for testing; before your friend tells you the results, you test one of the three remaining chips at random; given this (lack of) information, what is the conditional probability that the chip you test is defective? Since you have not yet learned the results of your friend's tests, intuition suggests that your conditional probability should be the same as your answer to part (a). Is your intuition correct? July 18, 2002
1.7 Problems
31
29. Given events A, B, and C, show that
}(A \ B \ C) = }(AjB \ C) }(B jC) }(C): Also show that }(A \ C jB) = }(AjB) }(C jB) if and only if }(AjB \ C) = }(AjB): In this case, A and C are conditionally independent given B.
July 18, 2002
32
Chap. 1 Introduction to Probability
July 18, 2002
CHAPTER 2
Discrete Random Variables In most scienti c and technological applications, measurements and observations are expressed as numerical quantities, e.g., thermal noise voltages, number of web site hits, round-trip times for Internet packets, engine temperatures, wind speeds, loads on an electric power grid, etc. Traditionally, numerical measurements or observations that have uncertain variability each time they are repeated are called random variables. However, in order to exploit the axioms and properties of probability that we studied in Chapter 1, we need to de ne random variables in terms of an underlying sample space . Fortunately, once some basic operations on random variables are derived, we can think of random variables in the traditional manner. The purpose of this chapter is to show how to use random variables to model events and to express probabilities and averages. Section 2.1 introduces the notions of random variable and of independence of random variables. Several examples illustrate how to express events and probabilities using multiple random variables. For discrete random variables, some of the more common probability mass functions are introduced as they arise naturally in the examples. (A summary of the more common ones can be found on the inside of the front cover.) Expectation for discrete random variables is de ned in Section 2.2, and then moments and probability generating functions are introduced. Probability generating functions are an important tool for computing both probabilities and expectations. In particular, the binomial random variable arises in a natural way using generating functions, avoiding the usual combinatorial development. The weak law of large numbers is derived in Section 2.3 using Chebyshev's inequality. Conditional probability and conditional expectation are developed in Sections 2.4 and 2.5, respectively. Several examples illustrate how probabilities and expectations of interest can be easily computed using the law of total probability.
2.1. Probabilities Involving Random Variables
A real-valued function X(!) de ned for points ! in a sample space is called a random variable.
Example 2.1. Let us construct a model for counting the number of heads in a sequence of three coin tosses. For the underlying sample space, we take
:= fTTT, TTH, THT, HTT, THH, HTH, HHT, HHHg; which contains the eight possible sequences of tosses. However, since we are only interested in the number of heads in each sequence, we de ne the random 33
34
variable (function) X by
Chap. 2 Discrete Random Variables 8 > > <
0; X(!) := > 1; 2; > : 3; This is illustrated in Figure 2.1.
! = TTT; ! 2 fTTH, THT, HTTg; ! 2 fTHH, HTH, HHTg; ! = HHH:
IR HHH HHT HTH THH TTT TTH THT HTT
3 2 1 0
Figure 2.1. Illustration of a random variable X that counts the number of heads in a sequence of three coin tosses.
With the setup of the previous example, let us assume for speci city that the sequences are equally likely. Now let us nd the probability that the number of heads X is less than 2. In other words, we want to nd }(X < 2). But what does this mean? Let us agree that }(X < 2) is shorthand for }(f! 2 : X(!) < 2g): Then the rst step is to identify the event f! 2 : X(!) < 2g. In Figure 2.1, the only lines pointing to numbers less than 2 are the lines pointing to 0 and 1. Tracing these lines backwards from IR into , we see that f! 2 : X(!) < 2g = fTTT, TTH, THT, HTTg: Since the sequences are equally likely, }(fTTT, TTH, THT, HTTg) = jfTTT, TTH, THT, HTTgj j j 1 4 = 8 = 2: The shorthand introduced above is standard in probability theory. More generally, if B IR, we use the shorthand fX 2 B g := f! 2 : X(!) 2 B g July 18, 2002
2.1 Probabilities Involving Random Variables
35
and1
}(X 2 B) := }(fX 2 B g) = }(f! 2 : X(!) 2 B g): If B is an interval such as B = (a; b], fX 2 (a; b]g := fa < X bg := f! 2 : a < X(!) bg and }(a < X b) = }(f! 2 : a < X(!) bg): Analogous notation applies to intervals such as [a; b], [a; b), (a; b), (,1; b), (,1; b], (a; 1), and [a; 1). Example 2.2. Show that }(a < X b) = }(X b) , }(X a):
Solution. It is convenient to rst rewrite the desired equation as }(X b) = }(X a) + }(a < X b):
Now observe that f! 2 : X(!) bg = f! 2 : X(!) ag [ f! 2 : a < X(!) bg: Since we cannot have an ! with X(!) a and X(!) > a at the same time, the events in the union are disjoint. Taking probabilities of both sides yields the desired result.
If B is a singleton set, say B = fx0g, we write fX = x0g instead of X 2 fx0g . Example 2.3. Show that }(X = 0 or X = 1) = }(X = 0) + }(X = 1):
Solution. First note that the word \or" means \union." Hence, we are trying to nd the probability of fX = 0g[fX = 1g. If we expand our shorthand, this union becomes f! 2 : X(!) = 0g [ f! 2 : X(!) = 1g: Since we cannot have an ! with X(!) = 0 and X(!) = 1 at the same time, these events are disjoint. Hence, their probabilities add, and we obtain }(fX = 0g [ fX = 1g) = }(X = 0) + }(X = 1): July 18, 2002
36
Chap. 2 Discrete Random Variables
The following notation is also useful in computing probabilities. For any subset B, we de ne the indicator function of B by IB (x) :=
1; x 2 B; 0; x 2= B:
Readers familiar with the unit-step function, u(x) :=
1; x 0; 0; x < 0;
will note that u(x) = I[0;1) (x). However, the indicator notation is often more compact. For example, if a < b, it is easier to write I[a;b) (x) than u(x , a) , u(x , b). How would you write I(a;b] (x) in terms of the unit step?
Discrete Random Variables We say X is a discrete random variable if there exist distinct real numbers xi such that
X
For discrete random variables,
i
}(X = xi ) = 1:
}(X 2 B) =
X
i
IB (xi )}(X = xi):
To derive this equation, we apply the law of total probability as given in (1.13) with A = fX 2 B g and Bi = fX = xig. Since the xi are distinct, the Bi are disjoint, and (1.13) says that
}(X 2 B) =
X
i
}(fX 2 B g \ fX = xi g):
Now observe that
fX 2 B g \ fX = xi g = Hence,
fX = xi g; xi 2 B;
6 ; xi 2= B:
}(fX 2 B g \ fX = xi g) = IB (xi )}(X = xi ):
Integer-Valued Random Variables An integer-valued random variable is a discrete random variable whose
distinct values are xi = i. For integer-valued random variables,
}(X 2 B) =
X
i
IB (i)}(X = i):
July 18, 2002
2.1 Probabilities Involving Random Variables
37
Here are some simple probability calculations involving integer-valued random variables.
}(X 7) =
X
I(,1;7] (i)}(X = i) =
i
Similarly,
}(X 7) =
X
}(X > 7) =
X
i
However,
i
I[7;1) (i)}(X = i) = I(7;1)(i)}(X = i) =
which is equal to }(X 8). Similarly
}(X < 7) =
X
i
I(,1;7) (i)}(X = i) =
7 X
i=,1
1 X i=7
1 X i=8
}(X = i):
}(X = i): }(X = i);
6 X
i=,1
}(X = i);
which is equal to }(X 6). When an experiment results in a nite number of \equally likely" or \totally random" outcomes, we model it with a uniform random variable. We say that X is uniformly distributed on 1; : : :; n if }(X = k) = 1 ; k = 1; : : :; n: n For example, to model the toss of a fair die we would use }(X = k) = 1=6 for k = 1; : : :; 6. To model the selection of one card for a well-shued deck of playing cards we would use }(X = k) = 1=52 for k = 1; : : :; 52. More generally, we can let k vary over any subset of n integers. A common alternative to 1; : : :; n is 0; : : :; n , 1. For k not in the range of experimental outcomes, we put }(X = k) = 0. Example 2.4. Ten neighbors each have a cordless phone. The number of people using their cordless phones at the same time is totally random. Find the probability that more than half of the phones are in use at the same time. Solution. We model the number of phones in use at the same time as a uniformly distributed random variable X taking values 0; : : :; 10. Zero is included because we allow for the possibility that no phones are in use. We must compute 10 10 }(X > 5) = X }(X = i) = X 1 = 5 : 11 i=6 i=6 11 If the preceding example had asked for the probability that at least half the phones are in use, then the answer would have been }(X 5) = 6=11. July 18, 2002
38
Chap. 2 Discrete Random Variables
Pairs of Random Variables
If X and Y are random variables, we use the shorthand fX 2 B; Y 2 C g := f! 2 : X(!) 2 B and Y (!) 2 C g; which is equal to f! 2 : X(!) 2 B g \ f! 2 : Y (!) 2 C g: Putting all of our shorthand together, we can write fX 2 B; Y 2 C g = fX 2 B g \ fY 2 C g: We also have }(X 2 B; Y 2 C) := }(fX 2 B; Y 2 C g) = }(fX 2 B g \ fY 2 C g): In particular, if the events fX 2 B g and fY 2 C g are independent for all sets B and C, we say that X and Y are independent random variables. In light of this de nition and the above shorthand, we see that X and Y are independent random variables if and only if }(X 2 B; Y 2 C) = }(X 2 B) }(Y 2 C) (2.1) for all sets2 B and C. Example 2.5. On a certain aircraft, the main control circuit on an autopilot fails with probability p. A redundant backup circuit fails independently with probability q. The airplane can y if at least one of the circuits is functioning. Find the probability that the airplane cannot y. Solution. We introduce two random variables, X and Y . We set X = 1 if the main circuit fails, and X = 0 otherwise. We set Y = 1 if the backup circuit fails, and Y = 0 otherwise. Then }(X = 1) = p and }(Y = 1) = q. We assume X and Y are independent random variables. Then the event that the airplane cannot y is modeled by fX = 1g \ fY = 1g: Using the independence of X and Y , }(X = 1; Y = 1) = }(X = 1)}(Y = 1) = pq. The random variables X and Y of the preceding example are said to be
Bernoulli. To indicate the relevant parameters, we write X Bernoulli(p) and Y Bernoulli(q). Bernoulli random variables are good for modeling the result of an experiment having two possible outcomes (numerically represented by 0 and 1), e.g., a coin toss, testing whether a certain block on a computer disk is bad, whether a new radar system detects a stealth aircraft, whether a certain internet packet is dropped due to congestion at a router, etc. July 18, 2002
2.1 Probabilities Involving Random Variables
39
Multiple Independent Random Variables
We say that X1 ; X2; : : : are independent random variables, if for every nite subset J containing two or more positive integers,
\
}
j 2J
fXj 2 Bj g =
Y
j 2J
}(Xj 2 Bj );
for all sets Bj . If for every B, }(Xj 2 B) does not depend on j for all j, then we say the Xj are identically distributed. If the Xj are both independent and identically distributed, we say they are i.i.d. Example 2.6. Ten neighbors each have a cordless phone. The number of people using their cordless phones at the same time is totally random. For one week ( ve days), each morning at 9:00 am we count the number of people using their cordless phone. Assume that the numbers of phones in use on dierent days are independent. Find the probability that on each day fewer than three phones are in use at 9:00 am. Also nd the probability that on at least one day, more than two phones are in use at 9:00 am. Solution. For i = 1; : : :; 5, let Xi denote the number of phones in use at 9:00 am on the ith day. We assume that the Xi are independent, uniformly distributed random variables taking values 0; : : :; 10. For the rst part of the problem, we must compute 5 \
} Using independence, \ 5
}
fXi < 3g :
i=1
fXi < 3g =
i=1
5 Y
i=1
}(Xi < 3):
Now observe that since the Xi are nonnegative, integer-valued random variables, }(Xi < 3) = }(Xi 2) = }(Xi = 0 or Xi = 1 or Xi = 2) = }(Xi = 0) + }(Xi = 1) + }(Xi = 2) = 3=11: Hence, 5 } \ fXi < 3g = (3=11)5 0:00151: i=1
For the second part of the problem, we must compute [ 5
}
i=1
fXi > 2g
= 1,}
\ 5
July 18, 2002
i=1
fXi 2g
40
Chap. 2 Discrete Random Variables
= 1,
5 Y
i=1
}(Xi 2)
= 1 , (3=11)5 0:99849: Calculations similar to those in the preceding example can be used to nd probabilities involving the maximum or minimum of several independent random variables. Example 2.7. Let X1; : : :; Xn be independent random variables. Evaluate }(max(X1 ; : : :; Xn) z) and }(min(X1 ; : : :; Xn) z):
Solution. Observe that max(X1; : : :; Xn) z if and only if all of the Xk are less than or equal to z; i.e., fmax(X1 ; : : :; Xn) z g =
n \
It then follows that
n \
}(max(X1 ; : : :; Xn ) z) = }
fXk z g
k=1
n Y
=
fXk z g:
k=1
k=1
}(Xk z);
where the second equation follows by independence. For the min problem, observe that min(X1 ; : : :; Xn ) z if and only if at least one of the Xi is less than or equal to z; i.e., n [
fmin(X1 ; : : :; Xn) z g =
fXk z g:
k=1
Hence,
}(min(X1 ; : : :; Xn) z) = }
n [ k=1
= 1,} = 1, July 18, 2002
fXk z g
n Y k=1
n \
fXk > z g
k=1
}(Xk > z):
2.1 Probabilities Involving Random Variables
41
Example 2.8. A drug company has developed a new vaccine, and would like to analyze how eective it is. Suppose the vaccine is tested in several volunteers until one is found in whom the vaccine fails. Let T = k if the rst time the vaccine fails is on the kth volunteer. Find }(T = k). Solution. For i = 1; 2; : : :; let Xi = 1 if the vaccine works in the ith volunteer, and let Xi = 0 if it fails in the ith volunteer. Then the rst failure occurs on the kth volunteer if and only if the vaccine works for the rst k , 1 volunteers and then fails for the kth volunteer. In terms of events, fT = kg = fX1 = 1g \ \ fXk,1 = 1g \ fXk = 0g: Assuming the Xi are independent, and taking probabilities of both sides yields }(T = k) = }(fX1 = 1g \ \ fXk,1 = 1g \ fXk = 0g) = }(X1 = 1) }(Xk,1 = 1) }(Xk = 0): If we further assume that the Xi are identically distributed with p := }(Xi = 1) for all i, then }(T = k) = pk,1(1 , p): The preceding random variable T is an example of a geometric random variable. Since we consider two types of geometric random variables, both depending on 0 p < 1, it is convenient to write X geometric1 (p) if }(X = k) = (1 , p)pk,1; k = 1; 2; : : :; and X geometric0 (p) if }(X = k) = (1 , p)pk ; k = 0; 1; : : :: As the above exmaple shows, the geometric1(p) random variable represents the rst time something happens. We will see in Chapter 9 that the geometric0(p) random variable represents the steady-state number of customers in a queue with an in nite buer. Note that by the geometric series formula (Problem 7 in Chapter 1), for any real or complex number z with jz j < 1, 1 X
z k = 1 ,1 z : k=0 Taking z = p shows that the probabilities sum to one in both cases. Note that if we put q = 1 , p, then 0 < q 1, and we can write }(X = k) = q(1 , q)k,1 in the geometric1(p) case and }(X = k) = q(1 , q)k in the geometric0(p) case. July 18, 2002
42
Chap. 2 Discrete Random Variables
Example 2.9. If X geometric0(p), evaluate }(X > 2). Solution. We could write }(X > 2) =
1 X
k=3
}(X = k):
However, since }(X = k) = 0 for negative k, a nite series is obtained by writing }(X > 2) = 1 , }(X 2) 2 X = 1 , }(X = k) k=0
= 1 , (1 , p)[1 + p + p2]:
Probability Mass Functions The probability mass function (pmf) of a discrete random variable X
taking distinct values xi is de ned by pX (xi ) := }(X = xi ): With this notation }(X 2 B) = X IB (xi )}(X = xi ) = X IB (xi )pX (xi ): i
i
Example 2.10. If X geometric0(p), write down its pmf. Solution. Write
pX (k) := }(X = k) = (1 , p)pk ; k = 0; 1; 2; : ::: A graph of pX (k) is shown in Figure 2.2. The Poisson random variable is used to model many dierent physical phenomena ranging from the photoelectric eect and radioactive decay to computer message trac arriving at a queue for transmission. A random variable X is said to have a Poisson probability mass function with parameter > 0, denoted by X Poisson(), if k , pX (k) = k!e ; k = 0; 1; 2; : ::: A graph of pX (k) is shown in Figure 2.3. To see that these probabilities sum to one, recall that for any real or complex number z, the power series for ez is 1 zk X ez = : k=0 k! July 18, 2002
2.1 Probabilities Involving Random Variables
43
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5 p
X
6
7
8
9
k
( k )
Figure 2.2. The geometric0(p) pmf pX (k) = (1 , p)pk with p = 0:7.
The joint probability mass function of X and Y is de ned by pXY (xi; yj ) := }(X = xi ; Y = yj ) = }(fX = xi g \ fY = yj g): (2.2) Two applications of the law of total probability as in (1.13) can be used to show that3 XX }(X 2 B; Y 2 C) = IB (xi )IC (yj )pXY (xi ; yj ): (2.3) i
j
If B = fxk g and C = IR, the left-hand side of (2.3) is }(X = xk ; Y 2 IR) = }(fX = xk g \ ) = }(X = xk ): P The right-hand side of (2.3) reduces to j pXY (xk ; yj ). Hence, pX (xk ) =
X
j
pXY (xk ; yj ):
Thus, the pmf of X can be recovered from the joint pmf of X and Y by summing over all values of Y . Similarly, pY (y` ) =
X
pXY (xi ; y` ): i Because we have de ned random variables to be real-valued functions of !, it is easy to see, after expanding our shorthand notation, that fY 2 IRg := f! 2 : Y (!) 2 IRg = : July 18, 2002
44
Chap. 2 Discrete Random Variables 0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
2
4
6
8 p
X
10
12
14
k
( k )
Figure 2.3. The Poisson() pmf pX (k) = k e, =k! with = 5.
In this context, we call pX and pY marginal probability mass functions. From (2.2) it is clear that if X and Y are independent, then pXY (xi; yj ) = pX (xi )pY (yj ). The converse is also true, since if pXY (xi ; yj ) = pX (xi )pY (yj ), then the double sum in (2.3) factors, and we have
}(X 2 B; Y 2 C) =
X
i
IB (xi)pX (xi )
X
j
IC (yj )pY (yj )
= }(X 2 B) }(Y 2 C): Hence, for a pair of discrete random variables X and Y , they are independent if and only if their joint pmf factors into the product of their marginal pmfs.
2.2. Expectation
The notion of expectation is motivated by the conventional de nition of numerical average. Recall that the numerical average of n numbers, x1; : : :; xn, is n 1X n xi : i=1
We use the average to summarize or characterize the entire collection of numbers x1; : : :; xn with a single \typical" value. If X is a discrete random variable taking n distinct, equally likely, real values, x1; : : :; xn, then we de ne its expectation by n X E[X] := 1 xi : n i=1 July 18, 2002
2.2 Expectation
45
Since }(X = xi) = 1=n for each i = 1; : : :; n, we can also write E[X] =
n X i=1
xi }(X = xi):
Observe that this last formula makes sense even if the values xi do not occur with the same probability. We therefore de ne the expectation, mean, or average of a discrete random variable X by X E[X] := xi }(X = xi); i
or, using the pmf notation pX (xi ) = }(X = xi ), X E[X] = xi pX (xi ): i
In complicated problems, it is sometimes easier to compute means than probabilities. Hence, the single number E[X] often serves as a conveniently computable way to summarize or partially characterize an entire collection of pairs f(xi; pX (xi ))g. Example 2.11. Find the mean of a Bernoulli(p) random variable X. Solution. Since X takes only the values x0 = 0 and x1 = 1, we can write E[X] =
1 X
i=0
i }(X = i) = 0 (1 , p) + 1 p = p:
Note that while X takes only the values 0 and 1, its \typical" value p is never seen (unless p = 0 or p = 1).
Example 2.12 (Every Probability is an Expectation). There is a close connection between expectation and probability. If X is any random variable and B is any set, then IB (X) is a discrete random variable taking the values zero and one. Thus, IB (X) is a Bernoulli random variable, and , E[IB (X)] = } IB (X) = 1 = }(X 2 B): Example 2.13. When light of intensity is incident on a photodetector, the number of photoelectrons generated is Poisson with parameter . Find the mean number of photoelectrons generated. Solution. Let X denote the number of photoelectrons generated. We need to calculate E[X]. Since a Poisson random variable takes only nonnegative integer values with positive probability, E[X] =
1 X
n=0
n }(X = n):
July 18, 2002
46
Chap. 2 Discrete Random Variables
Since the term with n = 0 is zero, it can be dropped. Hence, E[X] =
1 X n=1
1 X
n , n n!e
n e, n=1 (n , 1)! 1 n,1 X = e, (n, 1)! : n=1
=
Now change the index of summation from n to k = n , 1. This results in E[X] = e,
1 X
k = e, e = : k=0 k!
Example 2.14 (In nite Expectation). Here is random variable for which
E[X] = 1. Suppose that }(X = k) = C ,1=k2, k = 1; 2; : : :; wherey
C := Then E[X] =
1 X k=1
k }(X = k) =
1 X
1: 2 k k=1 1 X k=1
1
,1 X k Ck2 = C ,1 k1 = 1 k=1
as shown in Problem 34. Some care is necessary when computing expectations of signed random variables that take more than nitely many values. It is the convention in probability theory that E[X] should be evaluated as E[X] =
X
i:xi 0
xi }(X = xi) +
X
i:xi <0
xi }(X = xi );
assuming that at least one of these sums is nite. If the rst sum is +1 and the second one is ,1, then no value is assigned to E[X], and we say that E[X] is unde ned. y Note that C is nite by Problem 34. This is important since if C = 1, then C ,1 = 0 and the probabilities would sum to zero instead of one.
July 18, 2002
2.2 Expectation
47
Example 2.15 (Unde ned Expectation). With C as in the previous example, suppose that for k = 1; 2; : : :; }(X = k) = }(X = ,k) = 21 C ,1=k2. Then E[X] =
1 X
k=1
k }(X = k) +
,1 X
k=,1
k }(X = k)
1 1 1 X ,1 1 1 X = 2C + k=1 k 2C k=,1 k 1 + ,1 = 2C 2C = unde ned:
Expectation of Functions of Random Variables, or the Law of the Unconscious Statistician (LOTUS)
Given a random variable X, we will often have occasion to de ne a new random variable by Z := g(X), where g(x) is a real-valued function of the real variable x. More precisely, recall that a random variable X is actually a function taking points of the sample space, ! 2 , into real numbers X(!). Hence, the notation Z = g(X) is actually shorthand for Z(!) := g(X(!)). We now show that if X has pmf pX , then we can compute E[Z] = E[g(X)] without actually nding the pmf of Z. The precise formula is X E[g(X)] = g(xi ) pX (xi): (2.4) i
Equation (2.4) is sometimes called the law of the unconscious statistician (LOTUS). As a simple example of its use, we can write, for any constant a, X X E[aX] = axi pX (xi ) = a xi pX (xi ) = aE[X]: i
i
In other words, constant factors can be pulled out of the expectation. ? Derivation
of LOTUS
To derive (2.4), we proceed as follows. Let X take distinct values xi. Then Z takes values g(xi ). However, the values g(xi ) may not be distinct. For example, if g(x) = x2 , and X takes the four distinct values 1 and 2, then Z takes only the two distinct values 1 and 4. In any case, let zk denote the distinct values of Z, and put Bk := fx : g(x) = zk g: Then Z = zk if and only if g(X) = zk , and this happens if and only if X 2 Bk . Thus, the pmf of Z is pZ (zk ) := }(Z = zk ) = }(X 2 Bk ); July 18, 2002
48
Chap. 2 Discrete Random Variables
where
X
}(X 2 Bk ) =
We now compute E[Z] =
= =
X
k
X
k
i
IBk (xi) pX (xi ):
zk pZ (zk ) zk
X
X X
i
k
i
IBk (xi) pX (xi )
zk IBk (xi ) pX (xi ):
From the de nition of the Bk in terms of the distinct zk , the Bk are disjoint. Hence, for xed i in the outer sum, in the inner sum, only one of the Bk can contain xi . In other words, all terms but one in the inner sum are zero. Furthermore, for that one value of k with IBk (xi ) = 1, we have xi 2 Bk , and zk = g(xi ); i.e., for this value of k, zk IBk (xi) = g(xi )IBk (xi ) = g(xi ): Hence, the inner sum reduces to g(xi ), and X E[Z] = E[g(X)] = g(xi ) pX (xi): i
Linearity of Expectation
The derivation of the law of the unconscious statistician can be generalized to show that if g(x; y) is a real-valued function of two variables x and y, then XX E[g(X; Y )] = g(xi ; yj ) pXY (xi ; yj ): i
j
In particular, taking g(x; y) = x + y, it is a simple exercise to show that E[X + Y ] = E[X] + E[Y ]. Using this fact, we can write E[aX + bY ] = E[aX] + E[bY ] = aE[X] + bE[Y ]: In other words, expectation is linear. Example 2.16. A binary communication link has bit-error probability p. What is the expected number of bit errors in a transmission of n bits? Solution. For i = 1; : : :; n, let Xi = 1 if the ith bit is received incorrectly, and let Xi = 0 otherwise. Then Xi Bernoulli(p), and Y := X1 + + Xn is the total number of errors in the transmission of n bits. We know from Example 2.11 that E[Xi] = p. Hence, X n
E[Y ] = E
i=1
Xi =
n X i=1
E[Xi ] =
July 18, 2002
n X i=1
p = np:
2.2 Expectation
49
Moments
The nth moment, n 1, of a real-valued random variable X is de ned to be E[X n]. In particular, the rst moment of X is its mean E[X]. Letting m = E[X], we de ne the variance of X by var(X) := E[(X , m)2 ] = E[X 2 , 2mX + m2 ] = E[X 2] , 2mE[X] + m2 ; by linearity; = E[X 2] , m2 = E[X 2] , (E[X])2 : The variance is a measure of the spread of a random variable about its mean value. The larger the variance, the more likely we are to see values of X that are far from the mean m. The standard deviation of X is de ned to be the positive square root of the variance. The nth central moment of X is E[(X , m)n ]. Hence, the second central moment is the variance. Example 2.17. Find the second moment and the variance of X if X Bernoulli(p). Solution. Since X takes only the values 0 and 1, it has the unusual property that X 2 = X. Hence, E[X 2] = E[X] = p. It now follows that var(X) = E[X 2] , (E[X])2 = p , p2 = p(1 , p):
Example 2.18. Find the second moment and the variance of X if X
Poisson().
Solution. Observe that E[X(X , 1)] + E[X] = E[X 2]. Since we know that
E[X] = from Example 2.13, it suces to compute E[X(X , 1)] =
=
1 X
n=0
1 X
n , n(n , 1) n!e
n e, n=2 (n , 2)!
= 2e,
1 X
n,2 : n=2 (n , 2)!
Making the change of summation k = n , 2, we have 1 k X E[X(X , 1)] = 2 e, k=0 k! = 2 : July 18, 2002
50
Chap. 2 Discrete Random Variables
It follows that E[X 2 ] = 2 + , and var(X) = E[X 2 ] , (E[X])2 = (2 + ) , 2 = : Thus, the Poisson() random variable is unusual in that its mean and variance are the same.
Probability Generating Functions
Let X be a discrete random variable taking only nonnegative integer values. The probability generating function (pgf) of X is4 GX (z) := E[z X ] = Since for jz j 1,
1 X n z
n=0
}(X = n) =
1 X
n=0
1 X
jz n}(X = n)j
n=0
1 X
jz jn}(X = n)
n=0
1 X
n=0
z n}(X = n):
}(X = n) = 1;
GX has a power series expansion with radius of convergence at least one. The reason that GX is called the probability generating function is that the probabilities }(X = k) can be obtained from GX as follows. Writing GX (z) = }(X = 0) + z }(X = 1) + z 2 }(X = 2) + ; we immediately see that GX (0) = }(X = 0). We also see that G0X (z) = }(X = 1) + 2z }(X = 2) + 3z 2 }(X = 3) + : Hence, G0X (0) = }(X = 1). Continuing in this way shows that G(Xk)(z)jz=0 = k! }(X = k); or equivalently,
G(Xk)(z)jz=0 = }(X = k): (2.5) k! The probability generating function can also be used to nd moments. Since G0X (z) =
1 X
n=1
nz n,1}(X = n);
July 18, 2002
2.2 Expectation
51
setting z = 1 yields G0X (z)jz=1 = Similarly, since
G00X (z) =
setting z = 1 yields G00X (z)jz=1 = In general, since
1 X
n=2
1 X n=1
1 X n=2
n}(X = n) = E[X]:
n(n , 1)z n,2}(X = n);
n(n , 1)}(X = n) = E[X(X , 1)] = E[X 2] , E[X]:
G(Xk)(z) =
1 X n=k
n(n , 1) (n , [k , 1])z n,k}(X = n);
setting z = 1 yields G(Xk)(z)jz=1 = E X(X , 1)(X , 2) (X , [k , 1]) : The right-hand side is called the kth factorial moment of X. Example 2.19. Find the probability generating function of X Poisson(), and use the result to nd E[X] and var(X). Solution. Write GX (z) = E[z X ] 1 X = z n}(X = n) n=0
=
1 X
n=0
n , z n n!e
1 (z)n X , = e n=0 , e ez
n!
= = exp[(z , 1)]: Now observe that G0X (z) = exp[(z , 1)], and so E[X] = G0X (1) = . Since G00X (z) = exp[(z , 1)]2, E[X 2 ] , E[X] = 2 , and so E[X 2] = 2 + . For the variance, write var(X) = E[X 2] , (E[X])2 = (2 + ) , 2 = : July 18, 2002
52
Chap. 2 Discrete Random Variables
Expectations of Products of Functions of Independent Random Variables If X and Y are independent, then
E[h(X) k(Y )] = E[h(X)] E[k(Y )]
for all functions h(x) and k(y). In other words, if X and Y are independent, then for any functions h(x) and k(y), the expectation of the product h(X)k(Y ) is equal to the product of the individual expectations. To derive this result, we use the fact that pXY (xi ; yj ) := }(X = xi; Y = yj ) = }(X = xi) }(Y = yj ); by independence; = pX (xi ) pY (yj ): Now write E[h(X) k(Y )] =
= =
XX
i
j
i
j
XX X
i
h(xi ) k(yj ) pXY (xi ; yj ) h(xi ) k(yj ) pX (xi ) pY (yj )
h(xi ) pX (xi)
X
= E[h(X)] E[k(Y )]:
j
k(yj ) pY (yj )
Example 2.20. A certain communication network consists of n links. Suppose that each link goes down with probability p independently of the other links. For k = 0; : : :; n, nd the probability that exactly k out of the n links are down. Solution. Let Xi = 1 if the ith link is down and Xi = 0 otherwise. Then the Xi are independent Bernoulli(p). Put Y := X1 + + Xn . Then Y counts the number of links that are down. We rst nd the probability generating function of Y . Write GY (z) = = = =
E[z Y ] E[z X1 ++Xn ] E[z X1 z Xn ] E[z X1 ] E[z Xn ]; by independence:
Now, the probability generating function of a Bernoulli(p) random variable is easily seen to be E[z Xi ] = z 0 (1 , p) + z 1 p = (1 , p) + pz: July 18, 2002
2.2 Expectation
53
Thus,
GY (z) = [(1 , p) + pz]n: To apply (2.5), we need the derivatives of GY (z). The rst derivative is G0Y (z) = n[(1 , p) + pz]n,1p; and in general, the kth derivative is G(Yk)(z) = n(n , 1) (n , [k , 1]) [(1 , p) + pz]n,k pk : It now follows that (k) }(Y = k) = GY (0) k! n(n , 1) (n , [k , 1]) (1 , p)n,k pk = k! n! k = k!(n , k)! p (1 , p)n,k : Since the formula for GY (z) is a polynomial of degree n, G(Yk)(z) = 0 for all k > n. Thus, }(Y = k) = 0 for k > n. The preceding random variable Y is an example of a binomial(n;p) random variable. (An alternative derivation using techniques from Section 2.4 is given in the Notes.5 ) The binomial random variable counts how many times an event has occurred. Its probability mass function is usually written using the notation pY (k) = nk pk (1 , p)n,k ; k = 0; : : :; n; , where the symbol nk is read \n choose k," and is de ned by n := n! k!(n , k)! : k , In Matlab, nk = nchoosek(n; k). For any discrete random variable Y taking only nonnegative integer values, we always have GY (1) =
X 1
k=0
z k pY (k)
z=1
=
1 X
k=0
pY (k) = 1:
For the binomial random variable Y this says n n X pk (1 , p)n,k = 1; k k=0 July 18, 2002
54
Chap. 2 Discrete Random Variables
when p is any number satisfying 0 p 1. More generally, suppose a and b are any nonnegative numbers with a + b > 0. Put p = a=(a + b). Then 0 p 1, and 1 , p = b=(a + b). It follows that n X
k=0
n k
a k b n,k = 1: a+b a+b
Multiplying both sides by (a + b)k (a + b)n,k = (a + b)n yields the binomial theorem, n n X k n,k = (a + b)n: k ab k=0
Remark. The above derivation is for nonnegative a and b. The binomial theorem actually holds for complex a and b. This is easy to derive using induction on n along with the easily veri ed identity n + n = n + 1: k,1 k k , On account of the binomial theorem, the quantity nk is sometimes called the binomial coecient. We also point out that the binomial coecients can be read o from the nth row of Pascal's triangle in Figure 2.4. Noting that the top row is row 0, it is immediate, for example, that (a + b)5 = a5 + 5a4 b + 10a3b2 + 10a2b3 + 5ab4 + b5 : To generate the triangle, observe that except for the entries that are ones, each entry is equal to the sum of the two numbers above it to the left and right.
1
1
1 5
1 4
1 3 10
1 2 6 .. .
1 3 10
1 4
1 5
1
1
Figure 2.4. Pascal's triangle.
Binomial Random Variables and Combinations
We showed in Example 2.20 that if X1 ; : : :; Xn are i.i.d. Bernoulli(p), then Y := X1 + + Xn is binomial(n; p). We can use this result to show that the July 18, 2002
2.2 Expectation
55 ,
binomial coecient nk is equal to the number of n-bit words with exactly k , ones and n , k zeros. On the one hand, }(Y = k) = nk pk (1 , p)n,k . On the other hand, observe that Y = k if and only if X1 = i1 ; : : :; Xn = in for some n-tuple (i1 ; : : :; in ) of ones and zeros with exactly k ones and n , k zeros. Denoting this set of n-tuples by Bn;k , we can write
}(Y = k) = } = =
[
(i1 ;:::;in )2Bn;k X
(i1 ;:::;in )2Bn;k X
fX1 = i1 ; : : :; Xn = in g
}(X1 = i1 ; : : :; Xn = in ) n Y
(i1 ;:::;in )2Bn;k =1
}(X = i ):
Now, each factor is p if i = 1 and 1 , p if i = 0. Since each n-tuple belongs to Bn;k , we must have k factors being p and n , k factors being 1 , p. Thus,
}(Y = k) =
X
(i1 ;:::;in )2Bn;k
pk (1 , p)n,k :
Since all terms in the above sum are the same, }(Y = k) = jBn;k j pk (1 , p)n,k : ,
It follows that jBn;k j = nk . Consider n objects O1; : : :; On. How many ways can we select exactly k of the n objects if the order of selection is not important? Each such selection is O1
O2
/
/
On − 1
On
Figure 2.5. There are n objects arranged over trap doors. If k doors are opened (and n , k shut), this device allows the corresponding combination of k of the n objects to fall into the bowl below. July 18, 2002
56
Chap. 2 Discrete Random Variables
called a combination. One way to answer this question is to think of arranging the n objects over trap doors as shown in Figure 2.5. If k doors are opened (and n , k shut), this device allows the corresponding combination of k of the n objects to fall into the bowl below. The number of ways to select the open and shut doors is exactly the number,of n-bit words with k ones and n , k zeros. As argued above, this number is nk .
Example 2.21. A 12-person jury is to be selected from a group of 20 potential jurors. How many dierent juries are possible? Solution. There are 20 = 20! = 125970 12! 8! 12 dierent juries.
Example 2.22. A 12-person jury is to be selected from a group of 20 potential jurors of which 11 are men and 9 are women. How many 12-person juries are there with 5 men and 7 women? Solution. There are ,115 ways to choose the 5 men, and there are ,97 ways to choose the 7 women. Hence, there are 11 9 = 11! 9! = 16632 5 7 5! 6! 7! 2! possible juries with 5 men and 7 women. Example 2.23. An urn contains 11 green balls and 9 red balls. If 12 balls are chosen at random, what is the probability of choosing exactly 5 green balls and 7 red balls? Solution. Since there are, ,2012 ways to choose any 12 balls, we envision a sample space with j j = 20 12 . We are interested in a subset of , call it Bg;r (5; 7), corresponding to those selections of 12 balls such that 5 are green and 7 are red. Thus, jBg;r (5; 7)j = 115 97 :
Since all selections of 12 balls are equally likely, the probability we seek is 11 9 jBg;r (5; 7)j = 5 7 = 16632 0:132: 20 j j 125970 12 July 18, 2002
2.3 The Weak Law of Large Numbers
57
Poisson Approximation of Binomial Probabilities
If we let := np, then the probability generating function of a binomial(n; p) random variable can be written as [(1 , p) + pz]n = [1 + p(z , 1)]n n = 1 + (z n, 1) :
From calculus, recall the formula x n = ex : lim 1 + n!1 n So, for large n, n 1 + (z n, 1) exp[(z , 1)]; which is the probability generating function of a Poisson() random variable (Example 2.19). In making this approximation, n should be large compared to (z , 1). Since := np, as n becomes large, so does (z , 1). To keep the size of small enough to be useful, we should keep p small. Under this assumption, the binomial(n;p) probability generating function is close to the Poisson(np) probability generating function. This suggests the approximationz n pk (1 , p)n,k (np)k e,np ; n large, p small: k! k
2.3. The Weak Law of Large Numbers
Let X1 ; X2; : : : be a sequence of random variables with a common mean
E[Xi ] = m for all i. In practice, since we do not know m, we use the numerical
average, or sample mean,
n X Mn := n1 Xi ; i=1 in place of the true, but unknown value, m. Can this procedure of using Mn as an estimate of m be justi ed in some sense? Example 2.24. You are given a coin which may or may not be fair, and you want to determine the probability of heads, p. If you toss the coin n times and use the fraction of times that heads appears as an estimate of p, how does this t into the above framework? Solution. Let Xi = 1 if the ith toss results in heads, and let Xi = 0 otherwise. Then }(Xi = 1) = p and m := E[Xi] = p as well. Note that X1 + + Xn is the number of heads, and Mn is the fraction of heads. Are we justi ed in using Mn as an estimate of p? z This approximation is justi ed rigorously in Problems 15 and 16(a) in Chapter 11. It is also derived directly without probability generating functions in Problem 17 in Chapter 11.
July 18, 2002
58
Chap. 2 Discrete Random Variables
One way to answer these questions is with a weak law of large numbers (WLLN). A weak law of large numbers gives conditions under which lim }(jMn , mj ") = 0 n!1 for every " > 0. This is a complicated formula. However, it can be interpreted as follows. Suppose that based on physical considerations, m is between 30 and 70. Let us agree that if Mn is within " = 1=2 of m, we are \close enough" to the unknown value m. For example, if Mn = 45:7, and if we know that Mn is within 1=2 of m, then m is between 45:2 and 46:2. Knowing this would be an improvement over the starting point 30 m 70. So, if jMn , mj < ", we are \close enough," while if jMn , mj " we are not \close enough." A weak law says that by making n large (averaging lots of measurements), the probability of not being close enough can be made as small as we like; equivalently, the probability of being close enough can be made as close to one as we like. For example, if }(jMn , mj ") 0:1, then }(jMn , mj < ") = 1 , }(jMn , mj ") 0:9; and we would be 90% sure that Mn is \close enough" to the true, but unknown, value of m. Before giving conditions under which the weak law holds, we need to introduce the notion of uncorrelated random variables. Also, in order to prove the weak law, we need to rst derive Markov's inequality and Chebyshev's inequality.
Uncorrelated Random Variables
Before giving conditions under which the weak law holds, we need to introduce the notion of uncorrelated random variables. A pair of random variables, say X and Y , is said to be uncorrelated if E[XY ] = E[X] E[Y ]: If X and Y are independent, then they are uncorrelated. Intuitively, the property of being uncorrelated is weaker than the property of independence. Recall that for independent random variables, E[h(X) k(Y )] = E[h(X)] E[k(Y )] for all functions h(x) and k(y), while for uncorrelated random variables, we only require that that this hold for h(x) = x and k(y) = y. For an example of uncorrelated random variables that are not independent, see Problem 41. Example 2.25. If mX := E[X] and mY := E[Y ], show that X and Y are uncorrelated if and only if E[(X , mX )(Y , mY )] = 0: July 18, 2002
2.3 The Weak Law of Large Numbers
59
Solution. The result is obvious if we expand E[(X , mX )(Y , mY )] = E[XY , mX Y , XmY + mX mY ] = E[XY ] , mX E[Y ] , E[X]mY + mX mY = E[XY ] , mX mY , mX mY + mX mY = E[XY ] , mX mY :
Example 2.26. Let X1; X2; : : : be a sequence of uncorrelated random variables; more precisely, for i 6= j, Xi and Xj are uncorrelated. Show that var
X n
i=1
n X
Xi =
i=1
var(Xi ):
In other words, for uncorrelated random variables, the variance of the sum is the sum of the variances. Solution. Let mi := E[Xi] and mj := E[Xj ]. Then uncorrelated means that E[(Xi , mi )(Xj , mj )] = 0 for all i 6= j: Put n X X := Xi : Then and
i=1
E[X] = E
n X i=1 n X
X , E[X] =
Now write
i=1
Xi = Xi ,
n X
n X i=1
var(X) = E[(X , E[X])2]
= E =
n X
E[Xi ] =
i=1
mi =
(Xi , mi )
i=1 n n X X i=1 j =1
n X i=1
mi ;
n X i=1
(Xi , mi ):
n X j =1
(Xj , mj )
E[(Xi , mi )(Xj , mj )] :
For xed i, consider the sum over j. There is only one term in this sum for which j 6= i, and that is the term j = i. All the other terms are zero because July 18, 2002
60
Chap. 2 Discrete Random Variables
Xi and Xj are uncorrelated. Hence, var(X) =
= =
n X
i=1
n X i=1
n X i=1
E[(Xi , mi )(Xi , mi )]
E[(Xi , mi )2] var(Xi ):
Markov's Inequality
If X is a nonnegative random variable, we show that for any a > 0, }(X a) E[X] : a This relation is known as Markov's inequality. Since we always have }(X a) 1, Markov's inequality is useful only when the right-hand side is less than one. Consider the random variable aI[a;1) (X). This random variable takes the value zero if X < a, and it takes the value a if X a. Hence, E[aI[a;1) (X)] = 0 }(X < a) + a }(X a) = a }(X a): Now observe that
aI[a;1) (X) X: To see this, note that the left-hand side is either zero or a. If it is zero, there is no problem because X is a nonnegative random variable. If the left-hand side is a, then we must have I[a;1) (X) = 1; but this means that X a. Next take expectations of both sides to obtain a }(X a) E[X]: Dividing both sides by a yields Markov's inequality.
Chebyshev's Inequality
For any random variable Y and any a > 0, we show that 2 }(jY j a) E[Y2 ] : a This result, known as Chebyshev's inequality, is an easy consequence of Markov's inequality. As in the case of Markov's inequality, it is useful only when the right-hand side is less than one. July 18, 2002
2.3 The Weak Law of Large Numbers
61
To prove Chebyshev's inequality, just note that
fjY j ag = fjY j2 a2g: Since the above two events are equal, they have the same probability. Hence, 2 }(jY j a) = }(jY j2 a2 ) E[jY2j ] ; a where the last step follows by Markov's inequality. The following special cases of Chebyshev's inequality are sometimes of interest. If m := E[X] is nite, then taking Y = jX , mj yields }(jX , mj a) var(X) a2 : If 2 := var(X) is also nite, taking a = k yields }(jX , mj k) 12 : k
Conditions for the Weak Law
We now give sucient conditions for a version of the weak law of large numbers (WLLN). Suppose that the Xi all have the same mean m and the same variance 2 . Assume also that the Xi are uncorrelated random variables. Then for every " > 0, lim }(jMn , mj ") = 0: n!1
This is an immediate consequence of the following two facts. First, by Chebyshev's inequality, n) }(jMn , mj ") var(M "2 : Second, n X var(Mn ) = var 1 Xi
n n=1
n 2 X 2 2 var(Xi ) = n2 = : = n1 n n i=1
(2.6)
Thus,
2 }(jMn , mj ") 2 ; (2.7) n" which goes to zero as n ! 1. Note that the bound 2 =n"2 can be used to select a suitable value of n.
Example 2.27. Given " and 2, determine how large n should be so that
the probability that Mn is within " of m is at least 0:9. July 18, 2002
62
Chap. 2 Discrete Random Variables
Solution. We want to have }(jMn , mj < ") 0:9: Rewrite this as or
1 , }(jMn , mj ") 0:9;
}(jMn , mj ") 0:1: By (2.7), it suces to take 2 0:1; n"2 or n 102="2 .
2.4. Conditional Probability
For conditional probabilities involving random variables, we use the notation }(X 2 B jY 2 C) := }(fX 2 B gjfY 2 C g) } = (fX }2(fBYg 2\ fCYg)2 C g) } = (X}2(YB;2YC)2 C) : For discrete random variables, we de ne the conditional probability
mass functions,
pX jY (xijyj ) := }(X = xijY = yj ) } = (X}=(Yxi=; Yy =) yj ) j = pXYp (x(yi; )yj ) ; Y j
and pY jX (yj jxi) := }(Y = yj jX = xi) } = (X}=(Xxi=; Yx =) yj ) i p (x ; y ) = XYp (xi ) j : X i
We call pX jY the conditional probability mass function of X given Y . Similarly, pY jX is called the conditional pmf of Y given X. July 18, 2002
2.4 Conditional Probability
63
Conditional pmfs are important because we can use them to compute conditional probabilities just as we use marginal pmfs to compute ordinary probabilities. For example, }(Y 2 C jX = xk ) = X IC (yj )pY jX (yj jxk): j
This formula is derived by taking B = fxk g in (2.3), and then dividing the result by }(X = xk ) = pX (xk ).
Example 2.28. To transmit message i using an optical communication system, light of intensity i is directed at a photodetector. When light of intensity i strikes the photodetector, the number of photoelectrons generated is a Poisson(i ) random variable. Find the conditional probability that the number of photoelectrons observed at the photodetector is less than 2 given that message i was sent. Solution. Let X denote the message to be sent, and let Y denote the number of photoelectrons generated by the photodetector. The problem statement is telling us that n , }(Y = njX = i) = i e i ; n = 0; 1; 2; : : :: n! The conditional probability to be calculated is }(Y < 2jX = i) = }(Y = 0 or Y = 1jX = i) = }(Y = 0jX = i) + }(Y = 1jX = i) = e,i + i e,i : Example 2.29. For the random variables X and Y used in the solution of the previous example, write down their joint pmf if X geometric0 (p). Solution. The joint pmf is n e,i ; pXY (i; n) = pX (i) pY jX (nji) = (1 , p)pi i n! for i; n 0, and pXY (i; n) = 0 otherwise. The Law of Total Probability Let A be any event, and let X be any discrete random variable taking
distinct values xi. Then the events
Bi := fX = xig = f! 2 : X(!) = xi g: July 18, 2002
64
Chap. 2 Discrete Random Variables
are pairwise disjoint, and i }(Bi ) = probability as in (1.13) yields P
}(A) =
X
i
}(A \ Bi ) =
i }(X
P
X
i
= xi ) = 1. The law of total
}(AjX = xi )}(X = xi ):
(2.8)
Hence, we call (2.8) the law of total probability as well. Now suppose that Y is an arbitrary random variable. Taking A = fY 2 C g, where C IR, yields }(Y 2 C) = X }(Y 2 C jX = xi )}(X = xi): i
If Y is a discrete random variable taking distinct values yj , then setting C =
fyj g yields
}(Y = yj ) =
X
=
X
i
i
}(Y = yj jX = xi)}(X = xi) pY jX (yj jxi)pX (xi ):
Example 2.30. Radioactive samples give o alpha-particles at a rate based on the size of the sample. For a sample of size k, suppose that the number of particles observed is a Poisson random variable Y with parameter k. If the sample size is a geometric1 (p) random variable X, nd }(Y = 0) and }(X = 1jY = 0). Solution. The rst step is to realize that the problem statement is telling us that as a function of n, }(Y = njX = k) is the Poisson pmf with parameter k. In other words, n ,k }(Y = njX = k) = k e ; n = 0; 1; : : :: n! In particular, note that }(Y = 0jX = k) = e,k . Now use the law of total probability to write
}(Y = 0) = =
1 X k=1
1 X
k=1
}(Y = 0jX = k) }(X = k) e,k (1 , p)pk,1
1 X = 1 ,p p (p=e)k k=1 , p: 1 , p = p 1 ,p=ep=e = 1e , p July 18, 2002
2.4 Conditional Probability
Next,
65
}(X = 1jY = 0) = }(X}= 1; Y = 0) (Y = 0) }(Y = 0jX = 1)}(X = 1) = }(Y = 0) p = e,1 (1 , p) 1e , ,p = 1 , p=e:
Example 2.31. A certain electric eye employs a photodetector whose ef ciency occasionally drops in half. When operating properly, the detector outputs photoelectrons according to a Poisson() pmf. When the detector malfunctions, it outputs photoelectrons according to a Poisson(=2) pmf. Let p < 1 denote the probability that the detector is operating properly. Find the pmf of the observed number of photoelectrons. Also nd the conditional probability that the circuit is malfunctioning given that n output photoelectrons are observed. Solution. Let Y denote the detector output, and let X = 1 indicate that the detector is operating properly. Let X = 0 indicate that it is malfunctioning. Then the problem statement is telling us that }(X = 1) = p and n ,=2 n , }(Y = njX = 0) = (=2) e : }(Y = njX = 1) = e and n! n! Now, using the law of total probability, }(Y = n) = }(Y = njX = 1)}(X = 1) + }(Y = njX = 0)}(X = 0) n ,=2 n , = n!e p + (=2)n!e (1 , p): This is the pmf of the observed number of photoelectrons. The above formulas can be used to nd }(X = 0jY = n). Write }(X = 0jY = n) = }(X}= 0; Y = n) (Y = n) }(Y = njX = 0)}(X = 0) = }(Y = n) (=2)n e,=2 (1 , p) = n , n! n ,=2 e p + (=2) e (1 , p) n! n! 1 ; = n ,=2 2 e p +1 (1 , p) July 18, 2002
66
Chap. 2 Discrete Random Variables
which is clearly a number between zero and one as a probability should be. Notice that as we observe a greater output Y = n, the conditional probability that the detector is malfunctioning decreases.
The Substitution Law
It is often the case that Z is a function of X and some other discrete random variable Y , say Z = X + Y , and we are interested in }(Z = z). In this case, the law of total probability becomes }(Z = z) = X }(Z = z jX = xi )}(X = xi): =
i
X
i
}(X + Y = z jX = xi)}(X = xi ):
We now claim that }(X + Y = z jX = xi) = }(xi + Y = z jX = xi ): This property is known as the substitution law of conditional probability. To derive it, we need the observation fX + Y = z g \ fX = xi g = fxi + Y = z g \ fX = xig: From this we see that }(X + Y = z jX = xi ) = }(fX +}Y = z g \ fX = xi g) (fX = xi g) }(fxi + Y = z g \ fX = xi g) = }(fX = xi g) } = (xi + Y = z jX = xi ): Writing }(xi + Y = z jX = xi) = }(Y = z , xijX = xi ); we can make further simpli cations if X and Y are independent. In this case, }(Y = z , xi jX = xi ) = }(Y =}z , xi; X = xi) (X = xi) }(Y = z , xi)}(X = xi) = }(X = xi ) } = (Y = z , xi ): Thus, when X and Y are independent, we can write }(Y = z , xijX = xi) = }(Y = z , xi); and we say that we \drop the conditioning." July 18, 2002
2.4 Conditional Probability
67
Example 2.32 (Signal in Additive Noise). A random, integer-valued signal X is transmitted over a channel subject to independent, additive, integervalued noise Y . The received signal is Z = X +Y . To estimate X based on the received value Z, the system designer wants to use the conditional pmf pX jZ . Our problem is to nd this conditional pmf. Solution. Let X and Y be independent, discrete, integer-valued random variables with pmfs pX and pY , respectively. Put Z := X + Y . We begin by writing out the formula for the desired pmf pX jZ (ijj) = }(X = ijZ = j) } = i; Z = j) = (X}(Z = j) }(Z = j jX = i)}(X = i) = }(Z = j) } jX = i)pX (i) : (2.9) = (Z =}j(Z = j) To continue the analysis, we use the substitution law followed by independence to write }(Z = j jX = i) = }(X + Y = j jX = i) = }(i + Y = j jX = i) = }(Y = j , ijX = i) = }(Y = j , i) = pY (j , i): (2.10) This result can also be combined with the law of total probability to compute the denominator in (2.9). Just write X X pZ (j) = }(Z = j jX = i)}(X = i) = pY (j , i)pX (i): (2.11) i
i
In other words, if X and Y are independent, discrete, integer-valued random variables, the pmf of Z = X + Y is the discrete convolution of pX and pY . It now follows that pX jZ (ijj) = XpY (j , i)pX (i) ; pY (j , k)pX (k) k
where in the denominator we have changed the dummy index of summation to k to avoid confusion with the i in the numerator. The Poisson() random variable is a good model for the number of photoelectrons generated in a photodetector when the incident light intensity is . July 18, 2002
68
Chap. 2 Discrete Random Variables
Now suppose that an additional light source of intensity is also directed at the photodetector. Then we expect that the number of photoelectrons generated should be related to the total light intensity +. The next example illustrates the corresponding probabilistic model.
Example 2.33. If X and Y are independent Poisson random variables with respective parameters and , use the results of the preceding example to show that Z := X + Y is Poisson( + ). Also show that as a function of i, pX jZ (ijj) is a binomial(j;=( + )) pmf. Solution. To nd pZ (j), we apply (2.11) as follows. Since pX (i) = 0 for i < 0 and since pY (j , i) = 0 for j < i, (2.11) becomes pZ (j) =
j i , X e
j ,i ,
e i! (j , i)!
i=0
j ,(+) X e j! i j ,i = j! i=0 i!(j , i)!
j ,(+) X j i j ,i e = j! i=0 i ,(+) = e j! ( + )j ; j = 0; 1; : : :; where the last step follows by the binomial theorem. Our second task is to compute } } pX jZ (ijj) = (Z = j}jX(Z==i)j) (X = i) } = (Z = jpjX(j)= i)pX (i) :
Z
Since we have already found pZ (j), all we need is }(Z = j jX = i), which, using (2.10), is simply pY (j , i). Thus, j ,i , i , pX jZ (ijj) = (j ,ei)! ei! i j ,i = ( + )j ji
=
j i
e,(+) ( + )j j!
i j , i ; ( + ) ( + )
for i = 0; : : :; j. July 18, 2002
2.5 Conditional Expectation
69
2.5. Conditional Expectation
Just as we developed expectation for discrete random variables in Section 2.2, including the law of the unconscious statistician, we can develop conditional expectation in the same way. This leads to the formula X E[g(Y )jX = xi] = g(yj ) pY jX (yj jxi): (2.12) j
Example 2.34. The random number Y of alpha-particles given o by a radioactive sample is Poisson(k) given that the sample size X = k. Find E[Y jX = k]. Solution. We must compute X E[Y jX = k] = n }(Y = njX = k); n
where (cf. Example 2.30) Hence,
n ,k }(Y = njX = k) = k e ; n = 0; 1; : : :: n!
E[Y jX = k] =
1 X n=0
n ,k n k n!e :
Now observe that the right-hand side is exactly ordinary expectation of a Poisson random variable with parameter k (cf. the calculation in Example 2.13). Therefore, E[Y jX = k] = k.
Example 2.35. Suppose that given Y = n, X binomial(n; p). Find
E[X jY = n].
Solution. We must compute E[X jY = n] =
where Hence,
n X k=0
k}(X = kjY = n);
}(X = kjY = n) = n pk (1 , p)n,k : k
n X
k nk pk (1 , p)n,k : k=0 Now observe that the right-hand side is exactly the ordinary expectation of a binomial(n;p) random variable. It is shown in Problem 20 that the mean of such a random variable is np. Therefore, E[X jY = n] = np. E[X jY = n] =
July 18, 2002
70
Chap. 2 Discrete Random Variables
Substitution Law for Conditional Expectation
For functions of two variables, we have the following conditional law of the unconscious statistician, XX E[g(X; Y )jX = xi ] = g(xk ; yj ) pXY jX (xk ; yj jxi): However,
k
j
pXY jX (xk ; yj jxi) = }(X = xk ; Y = yj jX = xi ) } ; Y = yj ; X = xi ) : = (X = x}k(X =x) i
Now, when k 6= i, the intersection fX = xk g \ fY = yj g \ fX = xig is empty, and has zero probability. Hence, the numerator above is zero for k 6= i. When k = i, the above intersections reduce to fX = xig \ fY = yj g, and so pXY jX (xk ; yj jxi) = pY jX (yj jxi); for k = i: It now follows that X E[g(X; Y )jX = xi ] = g(xi ; yj ) pY jX (yj jxi) j
= E[g(xi; Y )jX = xi]: (2.13) which is the substitution law for conditional expectation. Note that if g in (2.13) is a function of Y only, then (2.13) reduces to (2.12). Also, if g is of product form, say g(x; y) = h(x)k(y), then E[h(X)k(Y )jX = xi ] = h(xi )E[k(Y )jX = xi ]:
Law of Total Probability for Expectation
In Section 2.4 we discussed the law of total probability, which shows how to compute probabilities in terms of conditional probabilities. We now derive the analogous formula for expectation. Write X
i
E[g(X; Y )jX = xi ] pX (xi) =
=
XX
i
j
XX
i
j
g(xi ; yj ) pXY (xi ; yj )
= E[g(X; Y )]: In particular, if g is a function of Y only, then X E[g(Y )jX = xi] pX (xi ): E[g(Y )] = i
July 18, 2002
g(xi ; yj ) pY jX (yj jxi) pX (xi )
2.5 Conditional Expectation
71
Example 2.36. Light of intensity is directed at a photomultiplier that generates X Poisson() primaries. The photomultiplier then generates Y , secondaries, where given X = n, Y is conditionally geometric1 (n+2),1 . Find the expected number of secondaries and the correlation between the primaries and the secondaries. Solution. The law of total probability for expectations says that E[Y ] =
1 X
n=0
E[Y jX = n] pX (n);
where the range of summation follows because X is Poisson(). The next step is to compute the conditional expectation. The conditional pmf of Y is geometric1(p), where p = (n + 2),1 , and the mean of such a pmf is, by Problem 19, 1=(1 , p). Hence, E[Y ] =
1 X n=0
1 + n +1 1 pX (n) = E 1 + X 1+ 1 :
An easy calculation (Problem 14) shows that for X Poisson(), 1 , E X + 1 = [1 , e ]=; and so E[Y ] = 1 + [1 , e, ]=. The correlation between X and Y is E[XY ] =
=
1 X
n=0
1 X
n=0
E[XY jX = n] pX (n)
n E[Y jX = n] pX (n)
1 X
1 = n 1 + n + 1 pX (n) n=0 = E X 1 + X 1+ 1 :
Now observe that
X 1 + X 1+ 1 It follows that
= X + 1 , X 1+ 1 :
E[XY ] = + 1 , [1 , e, ]=: July 18, 2002
72
Chap. 2 Discrete Random Variables
2.6. Notes
Notes x2.1: Probabilities Involving Random Variables Note 1. According to Note 1 in Chapter 1, }(A) is only de ned for certain subsets A 2 A. Hence, in order that the probability }(f! 2 : X(!) 2 B g) be de ned, it is necessary that f! 2 : X(!) 2 B g 2 A: (2.14) To guarantee that this is true, it is convenient to consider }(X 2 B) only for sets B in some - eld B of subsets of IR. The technical de nition of a random variable is then as follows. A function X from into IR is a random variable if and only if (2.14) holds for every B 2 B. Usually B is usually taken to be the Borel - eld; i.e., B is the smallest - eld containing all the open subsets of IR. If B 2 B, then B is called a Borel set. It can be shown [4, pp. 182{183] that a real-valued function X satis es (2.14) for all Borel sets B if and only if f! 2 : X(!) xg 2 A; for all x 2 IR:
Note 2. In light of Note 1 above, we do not require that (2.1) hold for all sets B and C, but only for all Borel sets B and C. Note 3. We show that }(X 2 B; Y 2 C) = X X IB (xi )IC (yj )pXY (xi ; yj ): i
j
P Consider the disjoint events fY = yj g. Since j }(Y = yj ) = 1, we can use the law of total probability as in (1.13) with A = fX = xi ; Y 2 C g to write }(X = xi; Y 2 C) = X }(X = xi ; Y 2 C; Y = yj ):
j
Now observe that
fY 2 C g \ fY = yj g = Hence, }(X = xi ; Y 2 C) =
X
j
fY = yj g; yj 2 C;
6 ; yj 2= C:
IC (yj )}(X = xi ; Y = yj ) =
X
j
IC (yj )pXY (xi ; yj ):
The next step is to use (1.13) again, but this time with the disjoint events fX = xig and A = fX 2 B; Y 2 C g. Then, }(X 2 B; Y 2 C) = X }(X 2 B; Y 2 C; X = xi): i
July 18, 2002
2.6 Notes
73
Now observe that
fX 2 B g \ fX = xi g =
fX = xi g; xi 2 B;
6 ; xi 2= B:
Hence,
}(X 2 B; Y 2 C) =
X
=
X
i i
IB (xi )}(X = xi; Y 2 C) IB (xi )
X
j
IC (yj )pXY (xi ; yj ):
Notes x2.2: Expectation Note 4. When z is complex, E[z X ] := E[Re(z X )] + j E[Im(z X )]:
By writing
z n = rnejn = rn [cos(n) + j sin(n)]; it is easy to check that E[z X ] =
1 X n=0
z n}(X = n):
Notes x2.4: Conditional Probability Note 5. Here is an alternative derivation of the fact that the sum of independent Bernoulli random variables is a binomial random variable. Let X1 ; X2; : : : be independent Bernoulli(p) random variables. Put Yn :=
n X i=1
Xi :
We need to show that Yn binomial(n; p). The case n = 1 is trivial. Suppose the result is true for some n 1. We show that it must be true for n + 1. Use the law of total probability to write
}(Yn+1 = k) =
n X i=0
}(Yn+1 = kjYn = i)}(Yn = i):
(2.15)
To compute the conditional probability, we rst observe that Yn+1 = Yn +Xn+1 . Also, since the Xi are independent, and since Yn depends only on X1 ; : : :; Xn, July 18, 2002
74
Chap. 2 Discrete Random Variables
we see that Yn and Xn+1 are independent. Keeping this in mind, we apply the substitution law and write }(Yn+1 = kjYn = i) = }(Yn + Xn+1 = kjYn = i) = }(i + Xn+1 = kjYn = i) = }(Xn+1 = k , ijYn = i) = }(Xn+1 = k , i): Since Xn+1 takes only the values zero and one, this last probability is zero unless i = k or i = k , 1. Returning to (2.15), we can writex
}(Yn+1 = k) =
k X i=k,1
}(Xn+1 = k , i)}(Yn = i):
Assuming that Yn binomial(n; p), this becomes }(Yn+1 = k) = p n pk,1(1 , p)n,(k,1) + (1 , p) n pk (1 , p)n,k : k,1 k Using the easily veri ed identity, n + n = n+1 ; k,1 k k we see that Yn+1 binomial(n + 1; p).
2.7. Problems
Problems x2.1: Probabilities Involving Random Variables 1. Let Y be an integer-valued random variable. Show that
}(Y = n) = }(Y > n , 1) , }(Y > n): 2. Let X Poisson(). Evaluate }(X > 1); your answer should be in
terms of . Then compute the numerical value of }(X > 1) when = 1. Answer: 0:264. 3. A certain photo-sensor fails to activate if it receives fewer than four photons in a certain time interval. If the number of photons is modeled by a Poisson(2) random variable X, nd the probability that the sensor activates. Answer: 0:143. 4. Let X geometric1(p). x When k = 0 or k = n + 1, this sum actually has only one term, since }(Yn = ,1) =
}(Yn = n + 1) = 0.
July 18, 2002
2.7 Problems
75
(a) Show that }(X > n) = pn . (b) Compute }(fX > n+kgjfX > ng). Hint: If A B, then A \ B = A. Remark. Your answer to (b) should not depend on n. For this reason, the geometric random variable is said to have the memoryless property. For example, let X model the number of the toss on which the rst heads occurs in a sequence of coin tosses. Then given a heads has not occurred up to and including time n, the conditional probability that a heads does not occur in the next k tosses does not depend on n. In other words, given that no heads occurs on tosses 1; : : :; n has no eect on the conditional probability of heads occurring in the future. Future tosses do not remember the past. ? 5. From your solution of Problem 4(b), you can see that if X geometric (p), 1 then }(fX > n + kgjfX > ng) = }(X > k). Now prove the converse; i.e., show that if Y is a positive integer-valued random variable such that }(fY > n + kgjfY > ng) = }(Y > k), then Y geometric1(p), where p = }(Y > 1). Hint: First show that }(Y > n) = }(Y > 1)n; then apply Problem 1. 6. At the Chicago IRS oce, there are m independent auditors. The kth auditor processes Xk tax returns per day, where Xk is Poisson distributed with parameter > 0. The oce's performance is unsatisfactory if any auditor processes fewer than 2 tax returns per day. Find the probability that the oce performance is unsatisfactory. 7. An astronomer has recently discovered n similar galaxies. For i = 1; : : :; n, let Xi denote the number of black holes in the ith galaxy, and assume the Xi are independent Poisson() random variables. (a) Find the probability that at least one of the galaxies contains two or more black holes. (b) Find the probability that all n galaxies have at least one black hole. (c) Find the probability that all n galaxies have exactly one black hole. Your answers should be in terms of n and . 8. There are 29 stocks on the Get Rich Quick Stock Exchange. The price of each stock (in whole dollars) is geometric0(p) (same p for all stocks). Prices of dierent stocks are independent. If p = 0:7, nd the probability that at least one stock costs more than 10 dollars. Answer: 0:44. 9. Let X1 ; : : :; Xn be independent, geometric1 (p) random variables. Evaluate }(min(X1 ; : : :; Xn ) > `) and }(max(X1 ; : : :; Xn ) `). 10. A class consists of 15 students. Each student has probability p = 0:1 of getting an \A" in the course. Find the probability that exactly one July 18, 2002
76
Chap. 2 Discrete Random Variables
student receives an \A." Assume the students' grades are independent. Answer: 0:343. 11. In a certain lottery game, the player chooses three digits. The player wins if at least two out of three digits match the random drawing for that day in both position and value. Find the probability that the player wins. Assume that the digits of the random drawing are independent and equally likely. Answer: 0:028. 12. Blocks on a computer disk are good with probability p and faulty with probability 1 , p. Blocks are good or bad independently of each other. Let Y denote the location (starting from 1) of the rst bad block. Find the pmf of Y . 13. Let X and Y be jointly discrete, integer-valued random variables with joint pmf 8 3j ,1e,3 ; i = 1; j 0; > > > > j! < pXY (i; j) = > 4 6j ,1e,6 ; i = 2; j 0; > > j! > : 0; otherwise: Find the marginal pmfs pX (i) and pY (j), and determine whether or not X and Y are independent.
Problems x2.2: Expectation 14. If X is Poisson(), compute E[1=(X + 1)]. 15. A random variable X has mean 2 and variance 7. Find E[X 2 ]. 16. Let X be a random variable with mean m and variance 2 . Find the
constant c that best approximates the random variable X in the sense that c minimizes the mean squared error E[(X , c)2 ]. 17. Find var(X) if X has probability generating function GX (z) = 61 + 16 z + 23 z 2 :
18. Find var(X) if X has probability generating function
GX (z) =
2 + z 5 : 3
19. Evaluate GX (z) for the cases X geometric0 (p) and X geometric1(p).
Use your results to nd the mean and variance of X in each case. 20. The probability generating function of Y binomial(n; p) was found in Example 2.20. Use it to nd the mean and variance of Y . July 18, 2002
2.7 Problems
77
21. Compute E[(X + Y )3 ] if X and Y are independent Bernoulli(p) random
variables. 22. Let X1 ; : : :; Xn be independent Poisson() random variables. Find the probability generating function of Y := X1 + + Xn . 23. For i = 1; : : :; n, let Xi Poisson(i ). Put Y := 24. 25.
26. 27.
28. 29. ? 30.
n X i=1
Xi :
Find }(Y = 2) if the Xi are independent. A certain digital communication link has bit-error probability p. In a transmission of n bits, nd the probability that k bits are received incorrectly, assuming bit errors occur independently. A new school has M classrooms. For i = 1; : : :; M, let ni denote the number of seats in the ith classroom. Suppose that the number of students in the ith classroom is binomial(ni; p) and independent. Let Y denote the total number of students in the school. Find }(Y = k). Let X1 ; : : :; Xn be i.i.d. with }(Xi = 1) = 1 , p and }(Xi = 2) = p. If Y := X1 + + Xn , nd }(Y = k) for all k. Ten-bit codewords are transmitted over a noisy channel. Bits are ipped independently with probability p. If no more than two bits of a codeword are ipped, the codeword can be correctly decoded. Find the probability that a codeword cannot be correctly decoded. From a well-shued deck of 52 playing cards you are dealt 14 cards. What is the probability that 2 cards are spades, 3 are hearts, 4 are diamonds, and 5 are clubs? Answer: 0.0116. From a well-shued deck of 52 playing cards you are dealt 5 cards. What is the probability that all 5 cards are of the same suit? Answer: 0.00198. A generalization of the binomial coecient is the multinomial coecient. Let m1; : : :; mr be nonnegative integers that sum to n. Then n := n! : m1 ! mr ! m1 ; : : :; mr If words of length n are composed of r-ary symbols, then the multinomial coecient is the number of words with exactly m1 symbols of type 1, m2 symbols of type 2, : : :, and mr symbols of type r. The multinomial theorem says that X J kJ aj j =1
July 18, 2002
78
Chap. 2 Discrete Random Variables
is equal to kJ X kJ ,1 =0
k2 X
k1=0
kJ kJ ,kJ ,1 : k1 k2 ,k1 k1; k2 , k1; : : :; kJ , kJ ,1 a1 a2 aJ
Derive the multinomial theorem by induction on J.
Remark. The multinomial theorem can also be expressed in the form X J
j =1
aj
n
n = am1 1 am2 2 amJ J : m ; m 1 2 ; : : :; mJ m1 ++mJ =n X
31. Make a table comparing both sides of the Poisson approximation of bino-
mial probabilities, n pk (1 , p)n,k (np)k e,np ; n large, p small; k! k for k = 0; 1; 2; 3; 4;5 if n = 100 and p = 1=100. Hint: If Matlab is available, the binomial probability can be written nchoosek(n; k) p^k (1 , p)^(n , k) and the Poisson probability can be written (n p)^k exp(,n p)=factorial(k):
32. Betting on Fair Games. Let X Bernoulli(p) random variable. For exam-
ple, we could let X = 1 model the result of a coin toss being heads. Or we could let X = 1 model your winning the lottery. In general, a bettor wagers a stake of s dollars that X = 1 with a bookmaker who agrees to pay d dollars to the bettor if X = 1 occurs; if X = 0, the stake s is kept by the bookmaker. Thus, the net income of the bettor is Y := dX , s(1 , X); since if X = 1, the bettor receives Y = d dollars, and if X = 0, the bettor receives Y = ,s dollars; i.e., loses s dollars. Of course the net income to the bookmaker is ,Y . If the wager is fair to both the bettor and the bookmaker, then we should have E[Y ] = 0. In other words, on average, the net income to either party is zero. Show that a fair wager requires that d=s = (1 , p)=p. 33. Odds. Let X Bernoulli(p) random variable. We say that the (fair) odds against X = 1 are n2 to n1 (written n2 : n1) if n2 and n1 are positive integers satisfying n2 =n1 = (1 , p)=p. Typically, n2 and n1 are chosen to July 18, 2002
2.7 Problems
79
have no common factors. Conversely, we say that the odds for X = 1 are n1 to n2 if n1 =n2 = p=(1 , p). Consider a state lottery game in which players wager one dollar that they can correctly guess a randomly selected three-digit number in the range 000{999. The state oers a payo of $500 for a correct guess. (a) What is the probability of correctly guessing the number? (b) What are the (fair) odds against guessing correctly? (c) The odds against actually oered by the state are determined by the ratio of the payo divided by the stake, in this case, 500 : 1. Is the game fair to the bettor? If not, what should the payo be to make it fair? (See the preceding problem for the notion of \fair.") ? 34. These results are needed for Examples 2.14 and 2.15. Show that 1 1 X = 1 k=1 k and that
1 X
1 < 2: 2 k=1 k (The exact value of the second sum is 2 =6 [38, p. 198].) Hints: For the rst formula, use the inequality Z k+1 1 dt Z k+1 1 dt = 1 : t k k k k For the second formula, use the inequality Z k+1 1 dt Z k+1 1 dt = 1 : 2 2 t (k + 1) (k + 1)2 k k
35. Let X be a discrete random variable taking nitely many distinct values
x1; : : :; xn. Let pi := }(X = xi) be the corresponding probability mass function. Consider the function g(x) := , log }(X = x): Observe that g(xi ) = , log pi . The entropy of X is de ned by H(X) := E[g(X)] =
n X i=1
n X } g(xi ) (X = xi) = pi log p1 : i=1
i
If all outcomes are equally likely, i.e., pi = 1=n, nd H(X). If X is a constant random variable, i.e., pj = 1 for some j and pi = 0 for i 6= j, nd H(X). July 18, 2002
80
Chap. 2 Discrete Random Variables
? 36. Jensen's inequality.
Recall that a real-valued function g de ned on an interval I is convex if for all x; y 2 I and all 0 1, ,
g x + (1 , )y g(x) + (1 , )g(y): Let g be a convex function, and let X be a discrete random variable taking nitely many values, say n values, all in I. Derive Jensen's inequality, E[g(X)] g(E[X]): Hint: Use induction on n. ? 37.
Derive Lyapunov's inequality, E[jZ j]1= E[jZ j ]1= ; 1 < < 1: Hint: Apply Jensen's inequality to the convex function g(x) = x = and
the random variable X = jZ j. ? 38. A discrete random variable is said to be nonnegative, denoted by X 0, if }(X 0) = 1; i.e., if X
i
I[0;1) (xi )}(X = xi ) = 1:
(a) Show that for a nonnegative random variable, if xk < 0 for some k, then }(X = xk ) = 0. (b) Show that for a nonnegative random variable, E[X] 0. (c) If X and Y are discrete random variables, we write X Y if X , Y 0. Show that if X Y , then E[X] E[Y ]; i.e., expectation is monotone.
Problems x2.3: The Weak Law of Large Numbers 39. Show that E[Mn] = m. Also show that for any constant c, var(cX) = c2 var(X). 40. Let X1 ; X2 ; : : :; Xn be i.i.d. geometric1(p) random variables, and put Y := X1 + + Xn . Find E[Y ], var(Y ), and E[Y 2]. Also nd the moment
generating function of Y . Remark. We say that Y is a negative binomial or Pascal random variable with parameters n and p. 41. Show by counterexample that being uncorrelated does not imply independence. Hint: Let }(X = 1) = }(X = 2) = 1=4, and put Y := jX j. Show that E[XY ] = E[X]E[Y ], but }(X = 1; Y = 1) 6= }(X = 1) }(Y = 1). July 18, 2002
2.7 Problems
81
42. Let X geometric0(1=4). Compute both sides of Markov's inequality,
}(X 2) E[X] : 2 43. Let X geometric0 (1=4). Compute both sides of Chebyshev's inequality, 2 }(X 2) E[X ] : 4 44. Student heights range from 120 cm to 220 cm. To estimate the average height, determine how many students' heights should be measured to make the sample mean within 0:25 cm of the true mean height with probability at least 0:9. Assume measurements are uncorrelated and have variance 2 = 1. What if you only want to be within 1 cm of the true mean height with probability at least 0:9? 45. Show that for an arbitrary sequence of random variables, it is not always true that for every " > 0, lim }(jMn , mj ") = 0: n!1
Hint: Let Z be a nonconstant random variable and put Xi := Z for
i = 1; 2; : : :: To be speci c, try Z Bernoulli(1=2) and " = 1=4. 46. Let X and Y be two random variables with means mX and mY and variances X2 and Y2 . The correlation coecient of X and Y is , mY )] : := E[(X , mX )(Y X Y Note that = 0 if and only if X and Y are uncorrelated. Suppose X cannot be observed, but we are able to measure Y . We wish to estimate X by using the quantity aY , where a is a suitable constant. Assuming mX = mY = 0, nd the constant a that minimizes the mean squared error E[(X , aY )2]. Your answer should depend on X , Y and .
Problems x2.4: Conditional Probability 47. Let X and Y be integer-valued random variables. Suppose that condi-
tioned on X = i, Y binomial(n; pi), where 0 < pi < 1. Evaluate }(Y < 2jX = i). 48. Let X and Y be integer-valued random variables. Suppose that conditioned on Y = j, X Poisson(j ). Evaluate }(X > 2jY = j). 49. Let X and Y be independent random variables. Show that pX jY (xi jyj ) = pX (xi ) and pY jX (yj jxi) = pY (yj ). July 18, 2002
82
Chap. 2 Discrete Random Variables
50. Let X and Y be independent with X geometric0 (p) and Y geometric0 (q). 51.
52.
53.
54. 55. 56. 57.
58.
Put T := X , Y , and nd }(T = n) for all n. When a binary optical communication system transmits a 1, the receiver output is a Poisson() random variable. When a 2 is transmitted, the receiver output is a Poisson() random variable. Given that the receiver output is equal to 2, nd the conditional probability that a 1 was sent. Assume messages are equally likely. In a binary communication system, when a 0 is sent, the receiver outputs a random variable Y that is geometric0 (p). When a 1 is sent, the receiver output Y geometric0(q), where q 6= p. Given that the receiver output Y = k, nd the conditional probability that the message sent was a 1. Assume messages are equally likely. Apple crates are supposed to contain only red apples, but occasionally a few green apples are found. Assume that the number of red apples and the number of green apples are independent Poisson random variables with parameters and , respectively. Given that a crate contains a total of k apples, nd the conditional probability that none of the apples is green. Let X Poisson(), and suppose that given X = n, Y Bernoulli(1=(n+ 1)). Find }(X = njY = 1). Let X Poisson(), and suppose that given X = n, Y binomial(n; p). Find }(X = njY = k) for n k. Let X and Y be independent binomial(n;p) random variables. Find the conditional probability that X > k given that max(X; Y ) > k if n = 100, p = 0:01, and k = 1. Answer: 0.284. Let X geometric0(p) and Y geometric0 (q), and assume X and Y are independent. (a) Find }(XY = 4). (b) Put Z := X+Y and nd pZ (j) for all j using the discrete convolution formula (2.11). Treat the cases p 6= q and p = q separately. Let X and Y be independent random variables, each taking the values 0; 1; 2; 3 with equal probability. Put Z := X + Y and sketch pZ (j) for all j. Hint: Use the discrete convolution formula (2.11), and use graphical convolution techniques.
Problems x2.5: Conditional Expectation
59. Let X and Y be as in Problem 13. Compute E[Y jX = i], E[Y ], and E[X jY = j]. 60. Let X and Y be as in Example 2.30. Find E[Y ], E[XY ], E[Y 2 ], and var(Y ). July 18, 2002
2.7 Problems
83
61. Let X and Y be as in Example 2.31. Find E[Y jX = 1], E[Y jX = 0], E[Y ], E[Y 2 ], and var(Y ). , 62. Let X Bernoulli(2=3), and suppose that given X = i, Y Poisson 3(i+ 1) . Find E[(X + 1)Y 2 ]. 63. Let X Poisson(), and suppose that given X = n, Y Bernoulli(1=(n+ 1)). Find E[XY ]. 64. Let X geometric1 (p), and suppose that given X = n, Y Pascal(n; q). Find E[XY ]. 65. Let X and Y be integer-valued random variables. Suppose that given
Y = k, X is conditionally Poisson with parameter k. If Y has mean m and variance r, nd E[X 2 ]. 66. Let X and Y be independent random variables, with X binomial(n; p), and let Y binomial(m; p). Put V := X + Y . Find the pmf of V . Find }(V = 10jX = 4) (assume n 4 and m 6). 67. Let X and Y be as in Problem 13. Find the probability generating function of Y , and then nd }(Y = 2). 68. Let X and Y be as in Example 2.30. Find GY (z).
July 18, 2002
84
Chap. 2 Discrete Random Variables
July 18, 2002
CHAPTER 3
Continuous Random Variables In Chapter 2, the only speci c random variables we considered were discrete such as the Bernoulli, binomial, Poisson, and geometric. In this chapter we consider a class of random variables that are allowed to take on a continuum of values. These random variables are called continuous random variables and are introduced in Section 3.1. Their expectation is de ned in Section 3.2 and used to develop the concepts of moment generating function (Laplace transform) and characteristic function (Fourier transform). In Section 3.3 expectation of multiple random variables is considered. Applications of characteristic functions to sums of independent random variables are illustrated. In Section 3.4 Markov's inequality, Chebyshev's inequality, and the Cherno bound illustrate simple techniques for bounding probabilities in terms of expectations.
3.1. De nition and Notation
Recall that a real-valued function X(!) de ned on a sample space is called a random variable. We say that X is a continuous random variable if }(X 2 B) has the form
}(X 2 B) =
Z
B
f(t) dt :=
1
Z
,1
IB (t)f(t) dt
(3.1)
} for some R 1 integrable function f. Since (X 2 IR) = 1, f must integrate to one; } i.e., ,1 f(t) dt = 1. Further, since (X 2 B) 0 for all B, it can be shown that f must be nonnegative.1 A nonnegative function that integrates to one is called a probability density function (pdf). Here are some examples of continuous random variables. A summary of the more common ones can be found on the inside of the back cover. The simplest continuous random variable is the uniform. It is used to model experiments in which the outcome is constrained to lie in a known interval, say [a; b], and all outcomes are equally likely. We used this model in the bus Example 1.10 in Chapter 1. We write f uniform[a; b] if a < b and 8 1 ; a x b; < f(x) = : b , a 0; otherwise: To verify that f integrates to one, rst note that since f(x) = 0 for x < a and x > b, we can write Z b Z 1 f(x) dx: f(x) dx = ,1
a
85
86
Chap. 3 Continuous Random Variables
Next, for a x b, f(x) = 1=(b , a), and so Z
b
Z
b
1 dx = 1: a a b,a This calculation illustrates a very important technique that is often incorrectly carried out by beginning students: First modify the limits of integration, then substitute the appropriate formula for f(x). For example, it is quite common to see the incorrect calculation, Z 1 Z 1 1 dx = 1: f(x) dx = ,1 ,1 b , a Example 3.1. If X is a continuous random variable with density f uniform[,1; 3], nd }(X 0), }(X 1), and }(X > 1jX > 0). Solution. To begin, write f(x) dx =
}(X 0) = }(X 2 (,1; 0]) =
Z 0
,1
f(x) dx:
Since f(x) = 0 for x < ,1, we can modify the limits of integration and obtain
}(X 0) =
Z 0
,1
f(x) dx =
Z 0
1 dx = 1=4: ,1 4
R1
Similarly, }(X 1) = ,1 1=4 dx = 1=2. To calculate }(X > 1jX > 0) = }(fX >} 1g \ fX > 0g) ; (X > 0) observe that the denominator is simply }(X > 0) = 1 , }(X 0) = 1 , 1=4 = 3=4. As for the numerator, }(fX > 1g \ fX > 0g) = }(X > 1) = 1 , }(X 1) = 1=2: Thus, }(X > 1jX > 0) = (1=2)=(3=4) = 2=3. Another simple continuous random variable is the exponential with parameter > 0. We write f exp() if ( e,x ; x 0; f(x) = 0; x < 0: It is easy to check that f integrates to one. The exponential random variable is often used to model lifetimes, such as how long a lightbulb burns or how long it takes for a computer to transmit a message from one node to another. July 18, 2002
3.1 De nition and Notation
87
The exponential random variable also arises as a function of other random variables. For example, in Problem 35 you will show that if U uniform(0; 1), then X = ln(1=U) is exp(1). We also point out that if U and V arepindependent Gaussian random variables, then U 2 + V 2 is exponential2 and U 2 + V 2 is Rayleigh (de ned in Problem 30). Related to the exponential is the Laplace, sometimes called the doublesided exponential. For > 0, we write f Laplace() if f(x) = 2 e,jxj: You will show in Problem 45 that the dierence of two independent exponential random variables is a Laplace random variable. Example 3.2. If X is a continuous random variable with density f Laplace(), nd }(,3 X ,2 or 0 X 3):
Solution. The desired probability can be written as
}(f,3 X ,2g [ f0 X 3g): Since these are disjoint events, the probability of the union is the sum of the individual probabilities. We therefore need to compute }(,3 X ,2) =
,2
Z
,3
e,jxj dx 2
=
which is equal to (e,2 , e,3 )=2, and
}(0 X 3) =
Z 3
0
e,jxj dx 2
= 2
Z
2
Z 3
0
,2
,3
ex dx;
e,x dx;
which is equal to (1 , e,3 )=2. The desired probability is then 1 , 2e,3 + e,2 : 2 The Cauchy random variable with parameter > 0 is also easy to work with. We write f Cauchy() if f(x) = 2= + x2 : Since (1=)(d=dx) tan,1 (x=) = f(x), and since tan,1 (1) = =2, it is easy to check that f integrates to one. The Cauchy random variable arises as the quotient of independent Gaussian random variables (Problem 18 in Chapter 5). Gaussian random variables are de ned later in this section.
July 18, 2002
88
Chap. 3 Continuous Random Variables
Example 3.3. In the -lottery you choose a number with 1 10. Then a random variable X is chosen according to the Cauchy density with parameter . If jX j 1, then you win the lottery. Which value of should you choose to maximize your probability of winning? Solution. Your probability of winning is }(jX j 1) = }(X 1 or X ,1) =
1
Z
1
f(x) dx +
Z 1
,1
f(x) dx;
where f(x) = (=)=(2 + x2) is the Cauchy density. Since the Cauchy density is an even function, Z 1 = dx: }(jX j 1) = 2 1 2 + x2 Now make the change of variable y = x=, dy = dx=, to get Z 1 1= dy: }(jX j 1) = 2 1 1= + y2 Since the integrand is nonnegative, the integral is maximized by minimizing 1= or by maximizing . Hence, choosing = 10 maximizes your probability of winning. The most important density is the Gaussian or normal. As a consequence of the central limit theorem, whose discussion is taken up in Chapter 4, Gaussian random variables are good models for sums of many independent random variables. Hence, Gaussian random variables often arise as noise models in communication and control systems. For 2 > 0, we write f N(m; 2 ) if
2 1 1 x , m f(x) = p exp , 2 ; 2 where is the positive square root of 2 . If m = 0 and 2 = 1, we say that f is a standard normal density. A graph of the standard normal density is shown in Figure 3.1. To verify that an arbitrary normal density integrates to one, we proceed as follows. (For an alternative derivation, see Problem 14.) First, making the change of variable t = (x , m)= shows that Z 1 Z 1 f(x) dx = p1 e,t2 =2 dt: 2 ,1 ,1 So, without loss of generality, we may assume f is a standard density p R 1 ,normal with m = 0 and = 1. We then need to show that I := ,1 e x2 =2 dx = 2.
July 18, 2002
3.1 De nition and Notation
89
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −4
−3
−2
−1
0
1
2
3
4
Figure 3.1. Standard normal density.
The trick is to show instead that I 2 = 2. First write I2 =
Z
1
,1
2
e,x =2 dx
Z
1
,1
2 e,y =2 dy :
Now write the product of integrals as the iterated integral I2 =
Z
1
Z
1
e,(x2 +y2 )=2 dx dy:
,1 ,1
Next, we interpret this as a double integral over the whole plane and change to polar coordinates. This yields I2
=
1
Z 2 Z
0
0
e,r2 =2 r dr d
2 =2 1 , r ,e d
Z 2
= 0 0 = 2: Example 3.4. If f denotes the standard normal density, show that Z 0
,1
f(x) dx =
1
Z
0
f(x) dx = 1=2:
Solution. Since f(x) = e,x2=2=p2 is an even function of x, the two R1
integrals are equal. Since the sum of the two integrals is ,1 f(x) dx = 1, each individual integral must be 1=2. July 18, 2002
90
Chap. 3 Continuous Random Variables
The Paradox of Continuous Random Variables Let X uniform(0; 1), and x any point x0 2 (0; 1). What is }(X = x0)? Consider the interval [x0 , 1=n; x0 + 1=n]. Write }(X 2 [x0 , 1=n; x0 + 1=n]) =
Z x0 +1=n
x0 ,1=n
1 dx = 2=n ! 0
as n ! 1. Since the interval [x0 , 1=n; x0 + 1=n] converges to the singleton set fx0g, we concludey that }(X = x0) = 0 for every x0 2 (0; 1). We are thus confronted with the fact that continuous random variables take no xed value with positive probability! The way to understand this apparent paradox is to realize that continuous random variables are an idealized model of what we normally think of as continuous-valued measurements. For example, a voltmeter only shows a certain number of digits after the decimal point, say 5:127 volts because physical devices have limited precision. Hence, the measurement X = 5:127 should be understood as saying that 5:1265 X < 5:1275; since all numbers in this range round to 5:127. Now there is no paradox because }(5:1265 X < 5:1275) has positive probability. You may still ask, \Why not just use a discrete random variable taking the distinct values k=1000, where k is any integer?" After all, this would model the voltmeter in question. The answer is that if you get a better voltmeter, you need to rede ne the random variable, while with the idealized, continuousrandom-variable model, even if the voltmeter changes, the random variable does not. Remark. If B is any set with nitely many points, or even countably many points, then }(X 2 B) = 0 when X is a continuous random variable. To see this, suppose B = fx1; x2; : : :g where the xi are distinct real numbers. Then
}(X 2 B) = }
1 [
fxi g =
1 X
i=1
i=1
}(X = xi ) = 0;
since, as argued above, each term is zero.
3.2. Expectation of a Single Random Variable
Let X be a continuous random variable with density f, and let g be a real-valued function taking nitely many distinct values yj 2 IR. We claim that E[g(X)] =
Z
1
g(x)f(x) dx:
(3.2)
,1 y This argumentis made rigorouslyfor arbitrarycontinuousrandom variablesin Section 4.6. July 18, 2002
3.2 Expectation of a Single Random Variable
91
First note that the assumptions on g imply Y := g(X) is a discrete random variable whose expectation is de ned as X E[Y ] = yj }(Y = yj ): j
Let Bj := fx : g(x) = yj g. Observe that X 2 Bj if and only if g(X) = yj , or equivalently, if and only if Y = yj . Hence, X E[Y ] = yj }(X 2 Bj ): j
Since X is a continuous random variable,
}(X 2 Bj ) =
Z
1
,1
IBj (x)f(x) dx:
Substituting this into the preceding equation, and writing g(X) instead of Y , we have E[g(X)] =
X
j
yj
Z
1
,1
IBj (x)f(x) dx =
1 X
Z
,1 j
yj IBj (x) f(x) dx:
It suces to show that the sum in brackets is exactly g(x). Fix any x 2 IR. Then g(x) = yk for some k. In other words, x 2 Bk . Now the Bj are disjoint because the yj are distinct. Hence, for x 2 Bk , the sum in brackets reduces to yk , which is exactly g(x). We would like to apply (3.2) for more arbitrary functions g. However, if g(X) is not a discrete random variable, its expectation has not yet been de ned! This raises the question of how to de ne the expectation of an arbitrary random variable. The approach is to approximate X by a sequence of discrete random variables (for which expectation was de ned in Chapter 2) and then de ne E[X] to be the limit of the expectations of the approximations. To be more precise about this, consider the sequence of functions qn sketched in Figure 3.2. Since each qn takes only nitely many distinct values, qn(X) is a discrete random variable for which E[qn(X)] is de ned. Since qn(X) ! X, we then de ne 3 E[X] := limn!1 E[qn(X)]. Now suppose X is a continuous random variable. Since qn(X) is a discrete random variable, (3.2) applies, and we can write Z
1
E[X] := nlim !1 E[qn(X)] = nlim !1 ,1 qn(x)f(x) dx:
Now bring the limit inside the integral,4 and then use the fact that qn(x) ! x. This yields E[X] =
Z
1
,1
lim q (x)f(x) dx = n!1 n July 18, 2002
1
Z
,1
xf(x) dx:
92
Chap. 3 Continuous Random Variables qn ( x ) n
x n
1 ___ 2n
Figure 3.2. Finite-step quantizer qn (x) for approximating arbitrary random variables by n discrete random variables. The number of steps is n2 . In the gure, n = 2 and so there are 8 steps.
The same technique can be used to show that (3.2) holds even if g takes more than nitely many values. Write E[g(X)] := nlim !1 E[qn(g(X))] Z
1
= nlim !1 ,1 qn(g(x))f(x) dx Z 1 = lim q (g(x))f(x) dx n!1 n =
Z
,1 1 ,1
g(x)f(x) dx:
Example 3.5. If X uniform[a; b] random variable, nd E[X] and E[cos(X)]. Solution. To nd E[X], write E[X] =
which simpli es to
Z
1
,1
xf(x) dx =
Z
2 b x b ,1 a dx = 2(bx, a) ; a a
b
b2 , a2 = (b + a)(b , a) = a + b ; 2(b , a) 2(b , a) 2 which is simply the numerical average of a and b. To nd E[cos(X)], write Z b Z 1 cos(x) b ,1 a dx: cos(x)f(x) dx = E[cos(X)] = a ,1 July 18, 2002
3.2 Expectation of a Single Random Variable
93
Since the antiderivative of cos is sin, we have E[cos(X)] = sin(b) , sin(a) : b,a In particular, if b = a + 2k for some positive integer k, then E[cos(X)] = 0.
Example 3.6. Let X be a continuous random variable with standard Gaussian density f N(0; 1). Compute E[X n] for all n 1. Solution. Write Z 1 E[X n ] = xnf(x) dx; p exp(,x2 =2)= 2.
,1
where f(x) = Since f is an even function of x, the above integrand is odd for n odd. Hence, all the odd moments are zero. For n 2, apply integration by parts to Z 1 E[X n ] = p1 xn,1 xe,x2 =2 dx 2 ,1 to obtain E[X n] = (n , 1)E[X n,2], where E[X 0 ] := E[1] = 1. When n = 2 this yields E[X 2 ] = 1, and when n = 4 this yields E[X 4 ] = 3. The general result for even n is E[X n] = 1 3 (n , 3)(n , 1):
Example 3.7. Determine E[X] if X has a Cauchy density with parameter
= 1.
Solution. This is a trick question. Recall that as noted following Example 2.14, for signed discrete random variables, E[X] =
X
i:xi 0
xi }(X = xi) +
X
i:xi <0
xi }(X = xi );
if at least one of the sums is nite. The analogous formula for continuous random variables is E[X] =
1
Z
0
xf(x) dx +
Z 0
,1
xf(x) dx;
assuming at least one of the integrals is nite. Otherwise we say that E[X] is unde ned. Write E[X] =
Z
1
,1
xf(x) dx July 18, 2002
94
Chap. 3 Continuous Random Variables
=
Z
0
1
xf(x) dx +
Z 0
,1
xf(x) dx
1 ln(1 + x2) 1 + 1 ln(1 + x2 ) 0 = 2 2 0 ,1 = (1 , 0) + (0 , 1) = 1,1 = unde ned:
Moment Generating Functions
The probability generating function was de ned in Chapter 2 for discrete random variables taking nonnegative integer values. For more general random variables, we introduce the moment generating function (mgf) MX (s) := E[esX ]: This generalizes the concept of probability generating function because if X is discrete taking only nonnegative integer values, then MX (s) = E[esX ] = E[(es )X ] = GX (es ): To see why MX is called the moment generating function, write d E[esX ] MX0 (s) = ds d sX = E ds e = E[XesX ]:
Taking s = 0, we have
MX0 (s)js=0 = E[X]: By dierentiating k times and then setting s = 0 yields MX(k)(s)js=0 = E[X k ]:
(3.3)
This derivation makes sense only if MX (s) is nite in a neighborhood of s = 0 [4, p. 278]. In the preceding equations, it is convenient to allow s to be complex. This means that we need to de ne the expectation of complex-valued functions of x. If g(x) = u(x) + jv(x), where u and v are real-valued functions of x, then E[g(X)] := E[u(X)] + j E[v(X)]: July 18, 2002
3.2 Expectation of a Single Random Variable
95
If X has density f, then E[g(X)] := E[u(X)] + j E[v(X)]
= = =
1
Z
,1 Z 1 ,1 Z 1 ,1
u(x)f(x) dx + j
Z
1 ,1
v(x)f(x) dx
[u(x) + jv(x)]f(x) dx g(x)f(x) dx:
It now follows that if X is a continuous random variable with density f, then MX (s) =
E[esX ]
1
Z
=
which is just the Laplace transform of f.
,1
esx f(x) dx;
Example 3.8. If X is an exponential random variable with parameter > 0, then for Re s < , MX (s) =
Z
1
0Z
=
0
esx e,x dx
1
ex(s,) dx
: = , s If MX (s) is nite for all real s in a neighborhood of the origin, say for
,r < s < r for some 0 < r 1, then X has nite moments P1 of all orders, and
the following calculation using the power series e = complex s with jsj < r [4, p. 278]: E[esX ]
n=0
n =n!
is valid for
X 1
n (sX) = E n=0 n! 1 sn X = E[X n]; jsj < r: n! n=0
(3.4)
Example 3.9. For the exponential random variable of the previous example, we can obtain the power series as follows. Recalling the geometric series formula (Problem 7 in Chapter 1), write, this time for jsj < , 1 1 = X n = ,s 1 , s= n=0(s=) :
July 18, 2002
96
Chap. 3 Continuous Random Variables
Comparing this expression with (3.4) and equating the coecients of the powers of s, we see by inspection that E[X n ] = n!=n. In particular, we have E[X] = 1= and E[X 2 ] = 2=2 . Since var(X) = E[X 2],(E[X])2, it follows that var(X) = 1=2.
Example 3.10. We nd the moment generating function of X N(0; 1).
To begin, write
Z 1 2 esx e,x =2 dx: MX (s) = p1 2 ,1 We rst show that MX (s) is nite for all real s. Combine the exponents in the above integral and complete the square. This yields Z 1 ,(x,s)2=2 e p MX (s) = es2 =2 dx: 2 ,1 The above integrand is a normal density with mean s and unit variance. Since densities integrate to one, MX (s) = es2 =2 for all real s. Note that since the mean of a real-valued random variable must be real, our argument required that s be real.5 We next show that MX (s) = es2 =2 holds for all complex s. Since the moment generating function is nite for all real s, we can use the power series expansion (3.4) to nd MX . The moments of X were determined in Example 3.6. Recalling that the odd moments are zero,
1 X
s2m E[X 2m ] m=0 (2m)! 1 s2m X 1 3 (2m , 3)(2m , 1) = m=0 (2m)! 1 X s2m : = m=0 2 4 2m
MX (s) =
Combining this result with the power series es2 =2 =
1 X
1 (s2 =2)m = X s2m m=0 m! m=0 2 4 2m
shows that MX (s) = es2 =2 for all complex s.
Characteristic Functions
In Example 3.8, the moment generating function was guaranteed nite only for Re s < . It is possible to have random variables for which MX (s) is de ned July 18, 2002
3.2 Expectation of a Single Random Variable
97
only for Re s = 0; i.e., MX (s) is only de ned for imaginary s. For example, if X is a Cauchy random variable, then it is easy to see that MX (s) = 1 for all real s 6= 0. In order to develop transform methods that always work for any random variable X, we introduce the the characteristic function of X, de ned by 'X () := E[ejX ]: (3.5) Note that 'X () = MX (j). Also, since jejX j = 1, j'X ()j E[jejX j] = 1. Hence, the characteristic function always exists and is bounded in magnitude by one. If X is a continuous random variable with density f, then 'X () =
Z
1
,1
ejxf(x) dx;
which is just the Fourier transform of f. Using the Fourier inversion formula, 1 Z 1 e,jx' () d: f(x) = 2 X ,1
Example 3.11. If X is an N(0; 1) random variable, then by Example 3.10, MX (s) = es2 =2 . Thus, 'X () = MX (j) = e(j )2 =2 = e, 2 =2. In terms of Fourier transforms, Z 1 p1 ejx e,x2 =2 dx = e, 2 =2 : 2 ,1 In signal processing terms, the Fourier transform of a Gaussian pulse is a Gaussian pulse. Example 3.12. Let X be a gamma random variable with parameter p > 0 (de ned in Problem 11). It is shown in Problem 39 that MX (s) = 1=(1 , s)p for complex s with jsj < 1. It follows that 'X () = 1=(1 , j)p for j j < 1. It can be shown6 that 'X () = 1=(1 , j)p for all . Example 3.13. As noted above, the characteristic function of an N(0; 1) random variable is e, 2 =2. What is the characteristic function of an N(m; 2 ) random variable? Solution. Let f0 denote the N(0; 1) density. If X N(m; 2), then fX (x) = f0 ((x , m)=)=. Now write 'X () = EZ [ejX ] 1 = ejxfX (x) dx =
Z
,1 1 ,1
ejx
1 f x , m dx: 0
July 18, 2002
98
Chap. 3 Continuous Random Variables
Now apply the change of variable y = (x , m)=, dy = dx= and obtain 'X () =
Z
1
ej (y+m) f0 (y) dy
,1 Z
1 = ejm ej ()y f0 (y) dy
= =
,1
ejm e,()2 =2 ejm,2 2=2 :
If X is a discrete integer-valued random variable, then 'X () = E[ejX ] =
X
n
ejn}(X = n)
is a 2-periodic Fourier series. Given 'X , the coecients can be recovered by the formula for Fourier series coecients, Z ,jn }(X = n) = 1 2 , e 'X () d: When the moment generating function is not nite in a neighborhood of the origin, the moments of X cannot be obtained from (3.3). However, the moments can sometimes be obtained from the characteristic function. For example, if we dierentiate (3.5) with respect to , we obtain d E[ejX ] = E[jXejX ]: '0X () = d Taking = 0 yields '0X (0) = j E[X]. The general result is '(Xk)()j =0 = j k E[X k ]; assuming E[jX jk] < 1: The assumption that E[jX jk ] < 1 justi es dierentiating inside the expectation [4, pp. 344{345].
3.3. Expectation of Multiple Random Variables
In Chapter 2 we showed that for discrete random variables, expectation is linear and monotone. We also showed that the expectation of a product of independent discrete random variables is the product of the individual expectations. We now derive these properties for arbitrary random variables. Recall that for an arbitrary random variable X, E[X] := limn!1 E[qn(X)], where qn(x) is sketched in Figure 3.2, qn(x) ! x, and for each n, qn(X) is a discrete random variable taking nitely many values. July 18, 2002
3.3 Expectation of Multiple Random Variables
99
Linearity of Expectation
To establish linearity, write aE[X] + bE[Y ] := = = =
a nlim !1 E[qn(X)] + b nlim !1 E[qn(Y )] lim E[aqn(X) + bqn(Y )] n!1 E[nlim !1 aqn(X) + bqn(Y )] E[aX + bY ]:
From our new de nition of expectation, it is clear that if X 0, then so is
E[X]. Combining this with linearity shows that monotonicity holds for general random variables; i.e., X Y implies E[X] E[Y ].
Expectations of Products of Functions of Independent Random Variables Suppose X and Y are independent random variables. For any functions h(x) and k(y) write E[h(X)]E[k(Y )] := nlim !1 E[qn(h(X))] nlim !1 E[qn(k(Y ))] = nlim !1 E[qn(h(X))]E[qn (k(Y ))] = nlim !1 E[qn(h(X))qn (k(Y ))] = E[nlim !1 qn(h(X))qn (k(Y ))] = E[h(X)k(Y )]:
Example 3.14. Let Z := X +Y , where X and Y are independent random variables. Show that the characteristic function of Z is the product of the characteristic functions of X and Y . Solution. The characteristic function of Z is 'Z () := E[ejZ ] = E[ej (X +Y ) ] = E[ejX ejY ]: Now use independence to write 'Z () = E[ejX ]E[ejY ] = 'X ()'Y (): In the preceding example, suppose that X and Y are continuous random variables with densities fX and fY . Then we can inverse Fourier transform the last equation and show that fZ is the convolution of fX and fY . In symbols, fZ (z) =
Z
1
,1
fX (z , ) fY () d:
July 18, 2002
100
Chap. 3 Continuous Random Variables
Example 3.15. In the preceding example, suppose that X and Y are Cauchy with parameters and , respectively. Find the density of Z. Solution. The characteristic of functions of X and Y are, by Problem 44, 'X () = e,j j and 'Y () = e,j j . Hence, 'Z () = 'X ()'Y () = e,j je,j j = e,(+)j j ; which is the characteristic function of a Cauchy random variable with parameter + . In terms of convolution, = = d: ( + )= = Z 1 2 2 2 2 2 ( + ) + z ,1 + (z , ) + 2
3.4. ?Probability Bounds
In many applications, it is dicult to compute the probability of an event exactly. However, bounds on the probability can often be obtained in terms of various expectations. For example, the Markov and Chebyshev inequalities were derived in Chapter 2. Below we use Markov's inequality to derive a much stronger result known as the Cherno bound.z
Example 3.16 (Using Markov's Inequality). Let X be a Poisson random variable with parameter = 1=2. Use Markov's inequality to bound }(X > 2). Compare your bound with the exact result. Solution. First note that since X takes only integer values, }(X > 2) = }(X 3). Hence, by Markov's inequality and the fact that E[X] = = 1=2 from Example 2.13, }(X 3) E[X] = 1=2 = 0:167: 3 3 The exact answer can be obtained by noting that }(X 3) = 1 , }(X < 3) = 1 , }(X = 0) , }(X = 1) , }(X = 2). For a Poisson() random variable with = 1=2, }(X 3) = 0:0144. So Markov's inequality gives quite a loose bound. Example 3.17 (Using Chebyshev's Inequality). Let X be a Poisson random variable with parameter = 1=2. Use Chebyshev's inequality to bound }(X > 2). Compare your bound with the result of using Markov's inequality in Example 3.16. z This bound, often attributed to Cherno (1952) [7], was used earlier by Cramer (1938) [11].
July 18, 2002
3.4 ? Probability Bounds
101
Solution. Since X is nonnegative, we don't have to worry about the absolute value signs. Using Chebyshev's inequality and the fact that E[X 2] = 2 + = 0:75 from Example 2.18, 2 }(X 3) E[X2 ] = 3=4 0:0833: 3 9 From Example 3.16, the exact probability is 0:0144 and the Markov bound is 0:167.
We now derive the Cherno bound. Let X be a random variable whose moment generating function MX (s) is nite for s 0. For s > 0,
fX ag = fsX sag = fesX esa g: Using this identity along with Markov's inequality, we have sX }(X a) = }(esX esa ) E[esa ] = e,sa MX (s): e Now observe that the inequality holds for all s > 0, and the left-hand side does not depend on s. Hence, we can minimize the right-hand side to get as tight a bound as possible. The Cherno bound is given by }(X a) inf e,sa MX (s); s>0
where the in mum is over all s > 0 for which MX (s) is nite.
Example 3.18. Let X be a Poisson random variable with parameter = 1=2. Bound }(X > 2) using the Cherno bound. Compare your result with the exact probability and with the bound obtained via Chebyshev's inequality in Example 3.17 and with the bound obtained via Markov's inequality in Example 3.16. Solution. First recall that MX (s) = GX (es ), where GX (z) = exp[(z , 1)] was derived in Example 2.19. Hence, e,sa MX (s) = e,sa exp[(es , 1)] = exp[(es , 1) , as]: The desired Cherno bound when a = 3 is }(X 3) inf exp[(es , 1) , 3s]: s>0
To evaluate the in mum, we must minimize the exponential. Since exp is an increasing function, it suces to minimize its argument. Taking the derivative of the argument and setting it equal to zero requires us to solve es , 3 = 0. July 18, 2002
102
Chap. 3 Continuous Random Variables
The solution is s = ln(3=). Substituting this value of s into exp[(es , 1) , 3s] and simplifying the exponent yields }(X 3) exp[3 , , 3 ln(3=)]: Since = 1=2,
}(X 3) exp[2:5 , 3 ln6] = 0:0564: Recall that from Example 3.16, the exact probability is 0:0144 and Markov's inequality yielded the bound 0:167. From Example 3.17, Chebyshev's inequality yielded the bound 0:0833.
Example 3.19. Let X be a continuous random variable having exponential density with parameter = 1. Compute }(X 7) and the corresponding Markov, Chebyshev, and Cherno bounds. Solution. The exact probability is }(X 7) = R71 e,x dx = e,7 = 0:00091. For the Markov and Chebyshev inequalities, recall that from Example 3.9, E[X] = 1= and E[X 2 ] = 2=2. For the Cherno bound, we need MX (s) = =( , s) for s < , which was derived in Example 3.8. Armed with these formulas, we nd that Markov's inequality yields }(X 7) E[X]=7 = 1=7 = 0:143 and Chebyshev's inequality yields }(X 7) E[X 2]=72 = 2=49 = 0:041. For the Cherno bound, write }(X 7) inf e,7s=(1 , s); s
where the minimization is over 0 < s < 1. The derivative of e,7s=(1 , s) with respect to s is e,7s (7s , 6) : (1 , s)2 Setting this equal to zero requires that s = 6=7. Hence, the Cherno bound is ,7s ,6 }(X 7) e (1 , s) s=6=7 = 7e = 0:017: For suciently large a, the Cherno bound on }(X a) is always smaller than the bound obtained by Chebyshev's inequality, and this is smaller than the one obtained by Markov's inequality. However, for small a, this may not be the case. See Problem 56 for an example.
July 18, 2002
3.5 Notes
103
3.5. Notes
Notes x3.1: De nition and Notation Note 1. Strictly speaking, it can only be shown that f in (3.1) is nonnegative almost everywhere; that is Z
ft2IR:f (t)<0g
1 dt = 0:
For example, f could be negative at a nite or countably in nite set of points.
Note 2. If U and V are N(0; 1), then by Problem 41, U 2 and V 2 are each
chi-squared with one degree of freedom (de ned in Problem 12). By the Remark following Problem 46(c), U 2 +V 2 is chi-squared with 2 degrees of freedom, which is the same as exp(1=2).
Notes x3.2: Expectation of a Single Random Variable Note 3. Since qn in Figure 3.2 is de ned only for x 0, the de nition of
expectation in the text applies only arbitrary nonnegative random variables. However, for signed random variables, X = jX j 2+ X , jX j 2, X ; and we de ne j X j , X j X j + X ,E 2 ; E[X] := E 2
assuming the dierence is not of the form 1 , 1. Otherwise, we say E[X] is unde ned. We also point out that for x 0, qn(x) ! x. To see this, x any x 0, and let n > x. Then x will lie under one of the steps in Figure 3.2. If x lies under the kth step, then k,1 x < k ; 2n 2n
For x in this range, the value of qn(x) is (k ,1)=2n. Hence, 0 x,qn (x) < 1=2n. Another important fact to note is that for each x 0, qn(x) qn+1(x). Hence, qn(X) qn+1 (X), and so E[qn(X)] E[qn+1(X)] as well. In other words, the sequence of real numbers E[qn(X)] is nondecreasing. This implies that limn!1 E[qn(X)] exists either as a nite real number or the extended real number 1 [38, p. 55].
Note 4. In light of the preceding note, we are using Lebesgue's monotone convergence theorem [4, p. 208], which applies to nonnegative functions. July 18, 2002
104
Chap. 3 Continuous Random Variables 1 e,(x,s)2 =2 dx as a conNote 5. If s were complex, we could interpret R,1
tour integral in the complex plane. By appealing to the Cauchy{Goursat Theorem [9, pp.p115{121], one could then show that this integral is equal to R 1 ,t2 =2 e dt = 2. Alternatively, one can use a permanence of form ar,1 gument [9, pp. 286{287]. In this approach, one shows that MX (s) is analytic in some region, in this case the whole complex plane. One then obtains a formula for MX (s) on a contour in this region, in this case, the contour is the entire real axis. The permanence of form theorem then states that the formula is valid in the entire region. Note 6. The permanence of form argument mentioned above in Note 5 is the easiest approach. Since MX (s) is analytic for complex s with Re s < 1, and since MX (s) = 1=(1 , s)p for real s with s < 1, MX (s) = 1=(1 , s)p holds for all complex s with Re s < 1. In particular, the formula holds for s = j, since in this case, Re s = 0 < 1.
3.6. Problems
Problems x3.1: De nition and Notation 1. A certain burglar alarm goes o if its input voltage exceeds ve volts at
three consecutive sampling times. If the voltage samples are independent and uniformly distributed on [0; 7], nd the probability that the alarm sounds. 2. Let X have density (
2=x3; x > 1; 0; otherwise; The median of X is the number t satisfying }(X > t) = 1=2. Find the median of X. 3. Let X have an exp() density. (a) Show that }(X > t) = e,t . (b) Compute }(X > t + tjX > t). Hint: If A B, then A \ B = A. Remark. Observe that X has a memoryless property similar to that of the geometric1(p) random variable. See the remark following Problem 4 in Chapter 2. 4. A company produces independent voltage regulators whose outputs are exp() random variables. In a batch of 10 voltage regulators, nd the probability that exactly 3 of them produce outputs greater than v volts. 5. Let X1 ; : : :; Xn be i.i.d. exp() random variables. (a) Find the probability that min(X1 ; : : :; Xn) > 2. f(x) =
July 18, 2002
3.6 Problems
105
(b) Find the probability that max(X1 ; : : :; Xn) > 2. Hint: Example 2.7 may be helpful. 6. A certain computer is equipped with a hard drive whose lifetime (mea-
sured in months) is X exp(). The lifetime of the monitor (also measured in months) is Y exp(). Assume the lifetimes are independent. (a) Find the probability that the monitor fails during the rst 2 months. (b) Find the probability that both the hard drive and the monitor fail during the rst year. (c) Find the probability that either the hard drive or the monitor fails during the rst year.
7. A random variable X has the Weibull density with parameters p > 0
and > 0, denotedp by X Weibull(p; ), if its density is given by f(x) := pxp,1 e,x for x > 0, and f(x) := 0 for x 0. (a) Show that this density integrates to one. (b) If X Weibull(p; ), evaluate }(X > t) for t > 0. (c) Let X1 ; : : :; Xn be i.i.d. Weibull(p; ) random variables. Find the probability that none of them exceeds 3. Find the probability that at least one of them exceeds 3.
Remark. The Weibull density arises in the study of reliability in Chapter 4. Note that Weibull(1; ) is the same as exp(). ? 8.
p
The standard normal density f N(0; 1) is given by f(x) := e,x2 =2= 2. The following steps provide a mathematical proof that the normal density is indeed \bell-shaped" as shown in Figure 3.1. (a) Use the derivative of f to show that f is decreasing for x > 0 and increasing for x < 0. (It then follows that f has a global maximum at x = 0.) (b) Show that f is concave for jxj < 1 and convex for jxj > 1. Hint: Show that the second derivative of f is negative for jxj < 1 and positive for jxj > 1. P n z z. Hence, e+x2 =2 x2=2. (c) Since ez = 1 z, e n=0 z =n!, for positive Use this fact to show that e,x2 =2 ! 0 as jxj ! 1.
? 9.
Use the results of the preceding problem to obtain the corresponding three results2 forp the general normal density f N(m; 2 ). Hint: Let '(t) := e,t =2 = 2 denote the N(0; 1) density, and observe that f(x) = '((x , m)=)=. July 18, 2002
106
Chap. 3 Continuous Random Variables
As in the preceding problem, let f N(m; 2 ). Keeping in mind that f(x) depends on > 0, show that lim!1 f(x) = 0. Show also that for x 6= m, lim!0 f(x) = 0, whereas for x = m, lim!0 f(x) = 1. With m = 0, sketch f(x) for = 0:5; 1; and 2. 11. The gamma density with parameter p > 0 is given by 8 p,1 ,x e ; x > 0; < x gp (x) := : ,(p) 0; x 0;
? 10.
where ,(p) is the gamma function, ,(p) :=
1
Z
0
xp,1 e,x dx; p > 0:
In other words, the gamma function is de ned exactly so that the gamma density integrates to one. Note that the gamma density is a generalization of the exponential since g1 is the exp(1) density. Remark. In Matlab, ,(p) = gamma(p). (a) Use integration by parts to show that
,(p) = (p , 1) ,(p , 1); p > 1: Since ,(1) can be directly evaluated and is equal to one, it follows that ,(n) = (n , 1)!; n = 1; 2; : : :: Thus , is sometimes called the factorial function. (b) Show that ,(1=2) = p aspfollows. In the de ning integral, use the change of variable y = 2x. Write the result in terms of the standard normal density, which integrates to one, in order to obtain the answer by inspection. (c) Show that p 2n + 1 , 2 = (2n , 1)2n 5 3 1 ; n 1:
Remark. The integral de nition of ,(p) makes sense only for p > 0. However, the recursion ,(p) = (p , 1),(p , 1) suggests a simple way to de ne , for negative, noninteger arguments. For 0 < " < 1, the righthand side of ,(") = (" , 1),(" , 1) is unde ned. However, we rearrange this equation and make the de nition, ,(" , 1) := , 1,(") , ": July 18, 2002
3.6 Problems
107
Similarly writing ,(" , 1) = (" , 2),(" , 2), and so on, leads to n ,(" , n) = (n , ")(, 1)(2 ,(") , ")(1 , ") : 12. Important generalizations of the gamma density gp of the preceding prob-
lem arise if we include a scale parameter. For > 0, put p,1 ,x gp;(x) := gp (x) = (x),(p)e ; x > 0: We write X gamma(p; ) if X has density gp; , which is called the gamma density with parameters p and . (a) Let f be any probability density. For > 0, show that f (x) := f(x) is also a probability density. (b) For p = m a positive integer, gm; is called the Erlang density with parameters m and . We write X Erlang(m; ) if X has density m,1 e,x gm; (x) = (x) (m , 1)! ; x > 0: Remark. As you will see in Problem 46(c), the sum of m i.i.d. exp() random variables is Erlang(m; ). (c) If X Erlang(m; ), show that
}(X > t) =
mX ,1 (t)k
e,t : k! k=0 In other words, if Y Poisson(t), then }(X > t) = }(Y < m). (d) For p = k=2 and = 1=2, gk=2;1=2 is called the chi-squared density with k degrees of freedom. It is not required that k be an integer. Of course, the chi-squared density with an even number of degrees of freedom, say k = 2m, is the same as the Erlang(m; 1=2) density. Using Problem 11(b), it is also clear that for k = 1, e,x=2 ; x > 0: g1=2;1=2(x) = p 2x For an odd number of degrees of freedom, say k = 2m + 1, where m 1, show that xm,1=2 e,x=2 p g 2m2+1 ; 21 (x) = (2m , 1) 5 3 1 2 July 18, 2002
108
Chap. 3 Continuous Random Variables
for x > 0. Hint: Use Problem 11(c). Remark. As you will see in Problem 41, the chi-squared random variable arises as the square of a normal random variable. In communication systems employing noncoherent receivers, the incoming signal is squared before further processing. Since the thermal noise in these receivers is Gaussian, chi-squared random variables naturally apppear. ? 13.
The beta density with parameters r > 0 and s > 0 is de ned by br;s (x) :=
8 < :
,(r + s) xr,1(1 , x)s,1; 0 < x < 1; ,(r) ,(s) 0; otherwise;
where , is the gamma function de ned in Problem 11. We note that if X gamma(p; ) and Y gamma(q; ) are independent random variables, then X=(X + Y ) has the beta density with parameters p and q (Problem 23 in Chapter 5). (a) Find simpli ed formulas and sketch the beta density for the following sets of parameter values: (a) r = 1, s = 1. (b) r = 2, s = 2. (c) r = 1=2, s = 1. (b) Use the followingapproach to show that the density integrates to one. Write ,(r) ,(s) as a product of integrals using dierent variables of integration, say x and y, respectively. Rewrite this as an iterated integral, with the inner variable of integration being x. In the inner integral, make the change of variable u = x=(x+y). Change the order of integration, and then make the change of variable v = y=(1 , u) in the inner integral. Recognize the resulting inner integral, which does not depend on u, as ,(r + s). Thus, ,(r) ,(s) = ,(r + s)
Z 1
0
ur,1 (1 , u)s,1 du;
(3.6)
showing that indeed, the density integrates to one.
Remark. The above integral, which is a function of r and s, is usually called the beta function, and is denoted by B(r; s). Thus, ,(s) B(r; s) = ,(r) ,(r + s) ; and
r,1 (1 , x)s,1 br;s (x) = x B(r; s) ; 0 < x < 1:
July 18, 2002
(3.7)
3.6 Problems ? 14.
109
Use equation (3.6) in the preceding problem to show that ,(1=2) = p. Hint: Make the change of variable u = sin2 . Then take r = s = 1=2.
Remark. In Problem 11, you used thep fact that the normal density
integrates to one to show that ,(1=2) = . Since your derivation there is reversible, it follows p that the normal density integrates to one if and only if ,(1=2) = . In this problem, you usedpthe fact that the beta density integrates to one to show that ,(1=2) = . Thus, you have an alternative derivation of the fact that the normal density integrates to one. ? 15. Show that p , n + 1 Z =2 2 sinn d = n + 2 : 0 2, 2 Hint: Use equation (3.6) in Problem 13 with r = (n + 1)=2 and s = 1=2, and make the substitution u = sin2 . ? 16. The beta function B(r; s) is de ned as the integral in (3.6) in Problem 13. Show that Z 1 B(r; s) = (1 , e, )r,1 e,s d: ? 17.
0
The student's t density with degrees of freedom is given by
,( +1)=2
1 + x2 f (x) := pB( 1 ; ) ; ,1 < x < 1; 2 2 where B is the beta function. Show that f integrates to one. Hint: The change of variable e = 1 + x2= may be useful. Also, the result of the preceding problem may be useful. Remarks. (i) Note that f1 Cauchy(1). (ii) It is shown in Problem 24 in Chapter 5 that if X and Y are independentpwith X N(0; 1) and Y chi-squared with k degrees of freedom, then X= Y=k has the student's t density with k degrees of freedom, a result of crucial importance in the study of con dence intervals in Chapter 12. ? 18. Show that the student's t density f (x) de ned in Problem 17 converges to the standard normal density as ! 1. Hints: Recall that x = ex : lim 1 + !1 Combine (3.7) in Problem 13 with Problem 11 for the cases = 2n and = 2n + 1 separately. Then appeal to Wallis's formula [3, p. 206], = 2 2 4 4 6 6 : 2 133557 July 18, 2002
110 ? 19.
Chap. 3 Continuous Random Variables
For p and q positive, let B(p; q) denote the beta function de ned by the integral in (3.6) in Problem 13. Show that p,1 fZ (z) := B(p;1 q) (1 +z z)p+q ; z > 0;
is a valid density (i.e., integrates to one) on (0; 1). Hint: Make the change of variable t = 1=(1 + z).
Problems x3.2: Expectation of a Single Random Variable 20. Let X have density f(x) = 2=x3 for x 1 and f(x) = 0 otherwise. Compute E[X].
21. Let X have the exp() density, fX (x) = e,x , for x 0. Use integration by parts to show that E[X] = 1=. 22. High-Mileage Cars has just begun producing its new Lambda Series, which
averages miles per gallon. Al's Excellent Autos has a limited supply of n cars on its lot. Actual mileage of the ith car is given by an exponential random variable Xi with E[Xi] = . Assume actual mileages of dierent cars are independent. Find the probability that at least one car on Al's lot gets less than =2 miles per gallon.
23. The dierential entropy of a continuous random variable X with den-
sity f is
h(X) := E[, logf(X)] =
1
Z
,1
1 dx: f(x) log f(x)
If X uniform[0; 2], nd h(X). Repeat for X uniform[0; 21 ] and for X N(m; 2 ). 24. Let X be a continuous random variable with density f, and suppose that E[X] = 0. If Z is another random variable with density fZ (z) := f(z , m), nd E[Z]. ? 25.
For the random variable X in Problem 20, nd E[X 2 ].
? 26.
Let X have the student's t density with degrees of freedom, as de ned in Problem 17. Show that E[jX jk] is nite if and only if k < .
27. Let X have moment generating function MX (s) = e2 s2 =2. Use formula (3.3) to nd E[X 4 ]. 28. Let Z N(0; 1), and put Y = Z + n for some constant n. Show that E[Y 4 ] = n4 + 6n2 + 3. July 18, 2002
3.6 Problems
111
29. Let X gamma(p; 1) as in Problem 12. Show that
+ p) E[X n ] = ,(n ,(p) = p(p + 1)(p + 2) (p + [n , 1]):
30. Let X have the standard Rayleigh density, f(x) := xe,x2 =2 for x 0
and f(x) := 0 for x < 0. p (a) Show that E[X] = =2. (b) For n 2, show that E[X n] = 2n=2,(1 + n=2). (c) An Internet router has n input links. The ows in the links are independent standard Rayleigh random variables. The router's buer memory over ows if more than two links have ows greater than . Find the probability of buer memory over ow. 31. Let X Weibull(p; ) as in Problem 7. Show that E[X n ] = ,(1 + n=p)=n=p. 32. A certain nonlinear circuit has random input X exp(1), and output Y = X 1=4 . Find the second moment of the output. ? 33. Let X have the student's t density with degrees of freedom, as de ned in Problem 17. For n a positive integer less than =2, show that , ,2n , , , 2n2+1 2 n n , 2 , : E[X ] = , 12 , 2 ? 34.
35. 36. ? 37.
38.
Recall that the moment generating function of an N(0; 1) random variable es2 =2 . Use this fact to nd the moment generating function of an N(m; 2 ) random variable. If X uniform(0; 1), show that Y = ln(1=X) exp(1) by nding its moment generating function for s < 1. Find a closed-form expression for MX (s) if X Laplace(). Use your result to nd var(X). Let X be the random variable of Problem 20. For what real values of s is MX (s) nite? Hint: It is not necessary to evaluate MX (s) to answer the question. Let Mp (s) denote the moment generating function of the gamma density gp de ned in Problem 11. Show that Mp (s) = 1 ,1 s Mp,1 (s); p > 1: Remark. Since g1(x) is the exp(1) density, and M1(s) = 1=(1 , s) by direct calculation, it now follows that the moment generating function of an Erlang(m; 1) random variable is 1=(1 , s)m . July 18, 2002
112 ? 39.
Chap. 3 Continuous Random Variables
To do this problem, use the methods of Example 3.10. Let X have the gamma density gp given in Problem 11. (a) For real s < 1, show that MX (s) = 1=(1 , s)p . (b) The moments of X are given in Problem 29. Hence, from (3.4), we have for complex s, MX (s) =
1 X
sn ,(n + p) ; jsj < 1: ,(p) n=0 n!
For complex s with jsj < 1, derive the Taylor series for 1=(1 , s)p and show that it is equal to the above series. Thus, MX (s) = 1=(1 , s)p for all complex s with jsj < 1. 40. As shown in the preceding problem, the basic gamma density with parameter p, gp (x), has moment generating function 1=(1 , s)p . The more general gamma density de ned by gp; (x) := gp (x) is given in Problem 12. (a) Find the moment generating function and then the characteristic function of gp;(x). (b) Use the answer to (a) to nd the moment generating function and the characteristic function of the Erlang density with parameters m and , gm; (x). (c) Use the answer to (a) to nd the moment generating function and the characteristic function of the chi-squared density with k degrees of freedom, gk=2;1=2(x). 41. Let X N(0; 1), and put Y = X 2 . For real values of s < 1=2, show that MY (s) =
1
1=2
1 , 2s : By Problem 40(c), it follows that Y is chi-squared with one degree of freedom. 42. Let X N(m; 1), and put Y = X 2 . For real values of s < 1=2, show that sm2 =(1,2s) e MY (s) = p : 1 , 2s
Remark. For m 6= 0, Y is said to be noncentral chi-squared with 2
one degree of freedom and noncentrality parameter m . For m = 0, this reduces to the result of the previous problem. 43. Let X have characteristic function 'X (). If Y := aX + b for constants a and b, express the characteristic function of Y in terms of a; b, and 'X . July 18, 2002
3.6 Problems
113
44. Apply the Fourier inversion formula to 'X () = e,j j to verify that this
is the characteristic function of a Cauchy() random variable.
Problems x3.3: Expectation of Multiple Random Variables 45. Let X and Y be independent random variables with moment generat-
ing functions MX (s) and MY (s). If Z := X , Y , show that MZ (s) = MX (s)MY (,s). Show that if both X and Y are exp(), then Z Laplace(). 46. Let X1 ; : : :; Xn be independent, and put Yn := X1 + + Xn . (a) If Xi N(mi ; i2), show that Yn N(m; 2 ), and identify m and 2 . Hint: Example 3.13 may be helpful. (b) If Xi Cauchy(i ), show that Yn Cauchy(), and identify . Hint: Problem 44 may be helpful. (c) If Xi is a gamma random variable with parameters pi and (same for all i), show that Yn is gamma with parameters p and , and identify p. Hint: The results of Problem 40 may be helpful. Remark. Note the following special cases of this result. If all the pi = 1, then the Xi are exponential with parameter , and Yn is Erlang with parameters n and . If p = 1=2 and = 1=2, then Xi is chi-squared with one degree of freedom, and Yn is chi-squared with n degrees of freedom. 47. Let X1 ; : : :; Xr be i.i.d. gamma random variables with parameters p and . Let Y = X1 + + Xr . Find E[Y n ]. 48. Packet transmission times on a certain network link are i.i.d. with an
exponential density of parameter . Suppose n packets are transmitted. Find the density of the time to transmit n packets. 49. The random number generator on a computer produces i.i.d. uniform(0; 1) random variables X1 ; : : :; Xn . Find the probability density of Y = ln
Y n
1 : X i=1 i
50. Let X1 ; : : :; Xn be i.i.d. Cauchy(). Find the density of Y := 1 X1 + +
n Xn , where the i are given positive constants.
51. Three independent pressure sensors produce output voltages U, V , and
W, each exp() random variables. The three voltages are summed and fed into an alarm that sounds if the sum is greater than x volts. Find the probability that the alarm sounds. July 18, 2002
114
Chap. 3 Continuous Random Variables
52. A certain electric power substation has n power lines. The line loads are
independent Cauchy() random variables. The substation automatically shuts down if the total load is greater than `. Find the probability of automatic shutdown. 53. The new outpost on Mars extracts water from the surrounding soil. There are 13 extractors. Each extractor produces water with a random eciency that is uniformly distributed on [0; 1]. The outpost operates normally if fewer than 3 extractors produce water with eciency less than 0:25. If the eciencies are independent, nd the probability that the outpost operates normally. ? 54. In this problem we generalize the noncentral chi-squared density of Problem 42. To distinguish these new densities from the original chisquared densities de ned in Problem 12, we refer to the original ones as central chi-squared densities. The noncentral chi-squared density with k degrees of freedom and noncentrality parameter 2 is de ned byx 1 (2 =2)ne,2 =2 X c2n+k (x); x > 0; ck;2 (x) := n! n=0 where c2n+k denotes the central chi-squared density with 2n + k degrees of freedom. R (a) Show that 01 ck;2 (x) dx = 1. (b) If X is a noncentral chi-squared random variable with k degrees of freedom and noncentrality parameter 2 , show that X has moment generating function 2=(1 , 2s)] Mk;2 (s) = exp[s (1 , 2s)k=2 : Hint: Problem 40 may be helpful. Remark. When k = 1, this agrees with Problem 42. (c) Let X1 ; : : :; Xn be independent random variables with Xi cki ;2i . Show that Y := X1 + + Xn has the ck;2 density, and identitify k and 2. Remark. By part (b), if each ki = 1, we could assume that each Xi is the square of an N(i ; 1) random variable. (d) Show that (x+2 )=2 epx + e,px e,p = c1;2 (x): 2 2x (Note that if = 0, the left-hand side reduces to the central chisquared density one degree of freedom.) Hint: Use the power P with n=n! for the two exponentials involving px. series e = 1 n=0 x A closed-form expression is derived in Problem 17 of Chapter 4.
July 18, 2002
3.6 Problems
115
Problems x3.4: Probability Bounds
55. Let X be the random variable of Problem 20. For a 1, compare }(X
a) and the bound obtained via Markov's inequality. ? 56. Let X be an exponential random variable with parameter = 1. Use Markov's inequality, Chebyshev's inequality, and the Cherno bound to obtain bounds on }(X a) as a function of a . (a) For what values of a does Markov's inequality give the best bound? (b) For what values of a does Chebyshev's inequality give the best bound? (c) For what values of a does the Cherno bound work best?
July 18, 2002
116
Chap. 3 Continuous Random Variables
July 18, 2002
CHAPTER 4
Analyzing Systems with Random Inputs In this chapter we consider systems with random inputs, and we try to compute probabilities involving the system outputs. The key tool we use to solve these problems is the cumulative distribution function (cdf). If X is a real-valued random variable, its cdf is de ned by FX (x) := }(X x): Cumulative distribution functions of continuous random variables are introduced in Section 4.1. The emphasis is on the fact that if Y = g(X), where X is a continuous random variable and g is a fairly simple function, then the density of Y can be found by rst determining the cdf of Y via simple algebraic manipulations and then dierentiating. The material in Section 4.2 is a brief diversion into reliability theory. The topic is placed here because it makes simple use of the cdf. With the exception of the formula Z 1 }(T > t) dt E[T ] = 0 for nonnegative random variables, which is derived at the beginning of Section 4.2, the remaining material on reliability is not used in the rest of the book. In trying to compute probabilities involving Y = g(X), the method of Section 4.1 only works for very simple functions g. To handle more complicated functions, we need more background on cdfs. Section 4.3 provides a brief introduction to cdfs of discrete random variables. This section is not of great interest in itself, but serves as preparation for Section 4.4 on mixed random variables. Mixed random variables frequently appear in the form Y = g(X) when X is continuous, but g has \ at spots." More precisely, there is an interval where g(x) is constant. For example, most ampli ers have a linear region, say ,v x v, where g(x) = x. But if x > v, g(x) = v, and if x < ,v, g(x) = ,v. In this case, if a continuous random variable is applied to such a device, the output will be a mixed random variable, which can be thought of has a random variable whose \generalized density" contains Dirac impulses. The problem of nding the cdf and generalized density of Y = g(X) is studied in Section 4.5. At this point, having seen several generalized densities and their corresponding cdfs, Section 4.6 summarizes and derives the general properties that characterize arbitrary cdfs. Section 4.7 introduces the central limit theorem. Although we have seen many examples for which we can explicitly write down probabilities involving a
The material in Sections 4.3{4.5 is not used in the rest of the book, though it does provide examples of cdfs with jump discontinuities.
117
118
Chap. 4 Analyzing Systems with Random Inputs
sum of i.i.d. random variables, in general, the problem is very hard. The central limit theorem provides an approximation for computing probabilities involving the sum of i.i.d. random variables.
4.1. Continuous Random Variables
If X is a continuous random variable with density f, then FX (x) = }(X x) =
Z
x
,1
f(t) dt:
Example 4.1. Find the cdf of a Cauchy random variable X with parame-
ter = 1.
Solution. Write
Z
x
1= dt ,1 1 + t2 x = 1 tan,1(t) ,1 1 , 1 = tan (x) , ,2 = 1 tan,1(x) + 21 :
FX (x) =
A graph of FX is shown in Figure 4.1. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −15
−10
−5
0
5
10
15
Figure 4.1. Cumulative distribution function of a Cauchy random variable.
July 18, 2002
4.1 Continuous Random Variables
119
Example 4.2. Find the cdf of a uniform[a; b] random variable X. we see that FX (x) = R x Solution. Since f(t) = 0 for t < a and t >R b, 1
,1 f(t) dt is equal to 0 for x < a, and is equal to ,1 f(t) dt = 1 for x > b. For a x b, we have Z
x
Z
x
1 dt = x , a : b , a b,a ,1 a Hence, for a x b, FX (x) is an aney function of x. A graph of FX when X uniform[0; 1] is shown if Figure 4.2. FX (x) =
f(t) dt =
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Figure 4.2. Cumulative distribution function of a uniform random variable.
We now consider the cdf of a Gaussian random variable. If X N(m; 2), then 2 Z x p 1 exp , 12 t , m dt: FX (x) = ,1 2 Unfortunately, there is no closed-form expression for this integral. However, it can be computed numerically, and there are many subroutines available for doing it. For example, in Matlab, the above integral can be computed with normcdf(x,m,sigma). If you do not have access to such a routine, you may have software available for computing some related functions that can be used to calculate the normal cdf. To see how, we need the following analysis. Making y A function is ane if it is equal to a linear function plus a constant.
July 18, 2002
120
Chap. 4 Analyzing Systems with Random Inputs
the change of variable = (t , m)= yields Z (x,m)= 2 e, =2 d: FX (x) = p1 2 ,1 In other words, if we put Z y 2 (y) := p1 e, =2 d; 2 ,1 then x , m FX (x) = :
(4.1)
Note that is the cdf of a standard normal random variable (a graph is shown in Figure 4.3). Thus, the cdf of an N(m; 2) random variable can always be expressed in terms of the cdf of a standard N(0; 1) random variable. If your computer has a routine that can compute , then you can compute an arbitrary normal cdf. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −3
−2
−1
0
1
2
3
Figure 4.3. Cumulative distribution function of a standard normal random variable.
For continuous random variables, the density can be recovered from the cdf by dierentiation. Since FX (x) = dierentiation yields
Z
x
,1
f(t) dt;
F 0(x) = f(x): July 18, 2002
4.1 Continuous Random Variables
121
The observation that the density of a continuous random variable can be recovered from its cdf is of tremendous importance, as the following examples illustrate. Example 4.3. Consider an electrical circuit whose random input voltage X is rst ampli ed by a gain > 0 and then added to a constant oset voltage . Then the output voltage is Y = X + . If the input voltage is a continuous random variable X, nd the density of the output voltage Y . Solution. Although the question asks for the density of Y , it is more advantageous to nd the cdf rst and then dierentiate to obtain the density. Write FY (y) = }(Y y) = }(X + y) = }(X (y , )=); since > 0; = FX ((y , )=): If X has density fX , then d F y , = 1F0 y , = 1f y , : fY (y) = dy X X X
Example 4.4. Amplitude modulation in certain communication systems can be accomplished using various nonlinear devices such as a semiconductor diode. Suppose we model the nonlinear device by the function Y = X 2 . If the input X is a continuous random variable, nd the density of the output Y = X 2. Solution. Although the question asks for the density of Y , it is more advantageous to nd the cdf rst and then dierentiate to obtain the density. To begin, note that since Y = X 2 is nonnegative, for y < 0, FY (y) = }(Y y) = 0. For nonnegative y, write FY (y) = }(Y y) = }(X 2 y) = }(,py X py ) Z
= The density is
py
,py
fX (t) dt:
p
d Z y f (t) dt fY (y) = dy X ,py = 2p1 y fX (py ) + fX (,py ) ; y > 0: July 18, 2002
122
Chap. 4 Analyzing Systems with Random Inputs
Since }(Y y) = 0 for y < 0, fY (y) = 0 for y < 0. When the diode input voltage X of the preceding example is N(0; 1), it turns out that Y is chi-squared with one degree of freedom (Problem 7). If X is N(m; 1) with m 6= 0, then Y is noncentral chi-squared with one degree of freedom (Problem 8). These results are frequently used in the analysis of digital communication systems. The two preceding examples illustrate the problem of nding the density of Y = g(X) when X is a continuous random variable. The next example illustrates the problem of nding the density of Z = g(X; Y ) when X is discrete and Y is continuous.
Example 4.5 (Signal in Additive Noise). Let X and Y be independent random variables, with X being discrete with pmf pX and Y being continuous with density fY . Put Z := X + Y and nd the density of Z. Solution. Although the question asks for the density of Z, it is more advantageous to nd the cdf rst and then dierentiate to obtain the density. This time we use the law of total probability, substitution, and independence. Write FZ (z) = }(Z z) X }(Z z jX = xi)}(X = xi) = =
i
X
i
=
X
=
X
= =
i i
X
i
X
i
}(X + Y z jX = xi )}(X = xi ) }(xi + Y z jX = xi )}(X = xi) }(Y z , xi jX = xi )}(X = xi) }(Y z , xi )}(X = xi) FY (z , xi) pX (xi ):
Dierentiating this expression yields fZ (z) =
X
i
fY (z , xi ) pX (xi ):
We should also note that FZ jX (z jxi) := }(Z z jX = xi ) is called the
conditional cdf of Z given X.
July 18, 2002
4.2 Reliability ? The
123
Normal CDF and the Error Function
If your computer does not have a routine that computes the normal cdf, it may have a routine to compute a related function called the error function, which is de ned by Z z 2 2 erf(z) := p e, d: 0
Note that erf(z) is negative for z < 0. In fact, erf is an odd function. To express in terms of erf, write Z 0 Z y 2 =2 2 =2 1 , , (y) = p e d + e d 2 ,1Z 0 y 2 = 21 + p1 e, =2 d; by Example 3.4; 2 0 p Z y= 2 = 12 + 12 p2 e,2 d; 0 p where the last step follows by making the change of variable = = 2. We now see that , p (y) = 12 1 + erf y= 2 : From the derivation, it is easy to see that this holds for all y, both positive and negative. The Matlab command for erf(z) is erf(z). In many applications, instead of (y), we need Q(y) := 1 , (y): Using the complementary error function, Z 1 erfc(z) := 1 , erf(z) = p2 e,2 d; z it is easy to see that , p Q(y) = 12 erfc y= 2 : The Matlab command for erfc(z) is erfc(z).
4.2. Reliability
Let T be the lifetime of a device or system. The reliability function of the device or system is de ned by R(t) := }(T > t) = 1 , FT (t): The reliability at time t is the probability that the lifetime is greater than t. The mean time to failure (MTTF) is simply the expected lifetime, E[T]. Since lifetimes are nonnegative random variables, we claim that E[T ] =
Z
0
1
}(T > t) dt:
July 18, 2002
(4.2)
124
Chap. 4 Analyzing Systems with Random Inputs
It then follows that
E[T ] =
1
Z
0
R(t) dt;
i.e., the MTTF is the integral of the reliability. To derive (4.2), write 1
Z
0
}(T > t) dt =
1
Z
E[I(t;1)(T)] dt
0 Z
= E
1
0
Z
= E
1
0
I(t;1) (T ) dt
I(,1;T ) (t) dt :
To evaluate the integral, observe that since T is nonnegative, the intersection of [0; 1) and (,1; T) is [0; T). Hence, 1
Z
0
Z
}(T > t) dt = E
T
0
dt = E[T ]:
The failure rate of a device or system with lifetime T is }(T t + tjT > t) : r(t) := lim t#0 t This can be rewritten as }(T t + tjT > t) r(t)t: In other words, given that the device or system has operated for more than t units of time, the conditional probability of failure before time t + t is approximately r(t)t. Intuitively, the form of a failure rate function should be as shown in Figure 4.4. For small values of t, r(t) is relatively large when r(t )
t
Figure 4.4. Typical form of a failure rate function.
pre-existing defects are likely to appear. Then for intermediate values of t, r(t) is at indicating a constant failure rate. For large t, as the device gets older, r(t) increases indicating that failure is more likely. July 18, 2002
4.2 Reliability
125
To say more about the failure rate, write }(T t + tjT > t) = }(fT t}+ tg \ fT > tg) (T > t) }(t < T t + t) = }(T > t) , FT (t) : = FT (t + t) R(t)
Since FT (t) = 1 , R(t), we can rewrite this as }(T t + tjT > t) = , R(t + t) , R(t) : R(t) Dividing both sides by t and letting t # 0 yields the dierential equation 0 (t) : r(t) = , RR(t) Now suppose T is a continuous random variable with density fT . Then Z 1 } R(t) = (T > t) = fT () d; t
and We can now write
R0(t) = ,fT (t): 0(t) r(t) = , RR(t) =
fT (t) : fT () d
1
Z
t
In this case, the failure rate r(t) is completely determined by the density fT (t). The converse is also true; i.e., given the failure rate r(t), we can recover the density fT (t). To see this, rewrite the above dierential equation as 0 (t) = dtd ln R(t): ,r(t) = RR(t) Integrating the left and right-hand formulas from zero to t yields
, Then
Z
0
t
r() d = ln R(t) , ln R(0):
Rt R(t) = R(t); e, 0 r( ) d = R(0) where we have used the fact that for a nonnegative, continuous random variable, R(0) = }(T > 0) = }(T 0) = 1. Since R0(t) = ,fT (t), Rt
fT (t) = r(t)e, 0 r( ) d : For example, if the failure rate is constant, r(t) = , then fT (t) = e,t, and we see that T has an exponential density with parameter . July 18, 2002
126
Chap. 4 Analyzing Systems with Random Inputs
4.3. Cdfs for Discrete Random Variables
In this section, we show that for discrete random variables, the pmf can be recovered from the cdf. If X is a discrete random variable, then FX (x) = }(X x) =
X
i
I(,1;x] (xi ) pX (xi ) =
X
i:xi x
pX (xi ):
Example 4.6. Let X be a discrete random variable with pX (0) = 0:1, pX (1) = 0:4, pX (2:5) = 0:2, and pX (3) = 0:3. Find the cdf FX (x) for all x. Solution. Put x0 = 0, x1 = 1, x2 = 2:5, and x3 = 3. Let us rst evaluate FX (0:5). Observe that FX (0:5) =
X
i:xi 0:5
pX (xi) = pX (0);
since only x0 = 0 is less than or equal to 0:5. Similarly, for any 0 x < 1, FX (x) = pX (0) = 0:1. Next observe that FX (2:1) =
X
i:xi 2:1
pX (xi ) = pX (0) + pX (1);
since only x0 = 0 and x1 = 1 are less than or equal to 2.1. Similarly, for any 1 x < 2:5, FX (x) = 0:1 + 0:4 = 0:5. The same kind of argument shows that for 2:5 x < 3, FX (x) = 0:1+0:4+ 0:2 = 0:7. Since all the xi are less than or equal to 3, for x 3, FX (x) = 0:1 + 0:4 + 0:2 + 0:3 = 1:0. Finally, since none of the xi is less than zero, for x < 0, FX (x) = 0. The graph of FX is given in Figure 4.5. We now show that the arguments used in the preceding example hold for the cdf of any discrete random variable X taking distinct values xi . Fix two adjacent points, say xj ,1 < xj as shown in Figure 4.6. For xj ,1 x < xj , observe that FX (x) =
X
i:xi x
pX (xi ) =
X
i:xi xj,1
pX (xi) = FX (xj ,1):
In other words, the cdf is constant on the interval [xj ,1; xj ), with the constant equal to the value of the cdf at the left-hand endpoint, FX (xj ,1). Next, since FX (xj ) =
X
i:xi xj
pX (xi) =
X
i:xi xj,1
July 18, 2002
pX (xi) + pX (xj );
4.4 Mixed Random Variables
127
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Figure 4.5. Cumulative distribution function of a discrete random variable. ... p
p
...
pj − 1
... x 1
x2
...
x j− 1
1
2
pj . . . x
xj . . .
Figure 4.6. Analysis for the cdf of a discrete random variable.
we have or
FX (xj ) = FX (xj ,1) + pX (xj );
FX (xj ) , FX (xj ,1) = pX (xj ): As noted above, for xj ,1 x < xj , FX (x) = FX (xj ,1). Hence, FX (xj ,) := limx"xj FX (x) = FX (xj ,1). It then follows that FX (xj ) , FX (xj ,) = FX (xj ) , FX (xj ,1) = pX (xj ): In other words, the size of the jump in the cdf at xj is pX (xj ), and the cdf is constant between jumps.
4.4. Mixed Random Variables
We say that X is a mixed random variable if }(X 2 B) has the form
}(X 2 B) =
Z
B
~ dt + X IB (xi)~pi f(t)
July 18, 2002
i
(4.3)
128
Chap. 4 Analyzing Systems with Random Inputs
for some nonnegativeR function f,~ some nonnegative sequence p~i , and distinct P 1 ~ points xi such that ,1 f(t) dt + i p~i = 1. Note that the mixed random variables include the continuous and discrete real-valued random variables as special cases.
Example 4.7. Let X be a mixed random variable with
Z }(X 2 B) := 1 e,jtj dt + 1 IB (0) + 1 IB (7): (4.4) 4 B 3 6 Compute }(X < 0), }(X 0), }(X > 2), and }(X 2). Solution. We need to compute }(X 2 B) for B = (,1; 0), (,1; 0], (2; 1), and [2; 1), respectively. Since the set B = (,1; 0) does not contain 0 or 7, the two indicator functions in the de nition of }(X 2 B) are zero in this case. Hence, Z 0 Z 0 1 1 ,j t j t }(X < 0) = 1 4 ,1 e dt = 4 ,1 e dt = 4 : Next, the set B = (,1; 0] contains 0 but not 7. Thus, Z 0 1 1 1 7 ,jtj }(X 0) = 1 4 ,1 e dt + 3 = 4 + 3 = 12 : For the last two probabilities, we need to consider (2; 1) and [2; 1). Both intervals contain 7 but not 0. Hence, the probabilities are the same, and both equal 1 Z 1 e,t dt + 1 = e,2 + 1 : 4 6 4 6
2
Example 4.8. With X as in the last example, compute }(X = 0) and }(X = 7). Also, use your results to verify that }(X = 0)+ }(X < 0) = }(X 0). Solution. To compute }(X = 0) = }(X 2 f0g), take B = f0g in (4.4). Then Z 1 1 ,jtj }(X = 0) = 1 4 f0g e dt + 3 If0g(0) + 6 If0g (7): The integral of an ordinary function over a set consisting of a single point is zero, and since 7 2= f0g, the last term is zero also. Hence, }(X = 0) = 1=3. Similarly, }(X = 7) = 1=6. From the previous example, }(X < 0) = 1=4. Hence, }(X = 0) + }(X < 0) = 1=3 + 1=4 = 7=12, which is indeed the value of }(X 0) computed earlier. July 18, 2002
4.4 Mixed Random Variables
129
In order to perform calculations with mixed random variables more easily, it is sometimes convenient to generalize the notion of density to permit impulses. Recall that the unit impulse or Dirac delta function, denoted by , is de ned by the two properties Putting
1
Z
(t) = 0 for t 6= 0 and,
,1
(t) dt = 1:
~ + X p~i (t , xi); f(t) := f(t)
it is easy to verify that Z f(t) dt = }(X 2 B) and
i
B
Z
1 ,1
(4.5)
g(t)f(t) dt = E[g(X)]:
~ =0 We call f in (4.5) a generalized probability density function. If f(t) for all t, we say that f is purely impulsive, and if p~i = 0 for all i, we say f is nonimpulsive. Example 4.9. The preceding examples can be worked using delta functions. In these examples, the generalized density is f(t) = 14 e,jtj + 13 (t) + 16 (t , 7); and Z }(X 2 B) = f(t) dt: B Since the impulses are located at 0 and at 7, they will contribute to the integral if and only if 0 and/or 7 lie in the set B. For example, in computing
}(0 < X 7) =
Z 7
0+
f(t) dt;
the impulse at the origin makes no contribution, but the impulse at 7 does. Thus, Z 7 1 e,jtj + 1 (t) + 1 (t , 7)dt }(0 < X 7) = 3 6 0+ 4 Z 7 = 41 e,t dt + 16 0 1 , e,7 + 1 = 5 , e,7 : = 4 6 12 4 Similarly, in computing }(X = 0) = }(X 2 f0g), only the impulse at zero makes a contribution. Thus, Z Z 1 (t) dt = 1 : }(X = 0) = f(t) dt = 3 f0g f0g 3 July 18, 2002
130
Chap. 4 Analyzing Systems with Random Inputs
~ We now show that if X is a mixed random variable as in (4.3), then f(x) and p~i can be recovered from the cdf of X. From (4.3), Z x ~ dt + X p~i ; FX (x) = f(t) ,1
and so
i:xi x
~ FX0 (x) = f(x); x 6= xi;
while
FX (xi) , FX (xi ,) = p~i : ~ = (0:7=)=(1+t2), Figure 4.7 shows the cdf of a mixed random variable with f(t) x1 = 0; x2 = 3, p~1 = 0:21, and p~2 = 0:09. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −15
−10
−5
0
5
10
15
Figure 4.7. Cumulative distribution function of a mixed random variable.
4.5. Functions of Random Variables and Their Cdfs
Most modern systems today are composed of many subsystems in which the output of one system serves as the input to another. When the input to a system or a subsystem is random, so is the output. To evaluate system performance, it is necessary to take into account this randomness. The rst step in this process is to nd the cdf of the system output if we know the pmf or density of the random input. In many cases, the output will be a mixed random variable with a generalized impulsive density. We consider systems modeled by real-valued functions g(x). The random system input is a random variable X, and the system output is the random variable Y = g(X). To nd FY (y), observe that FY (y) := }(Y y) = }(g(X) y) = }(X 2 By ); July 18, 2002
4.5 Functions of Random Variables and Their Cdfs
131
where
By := fx 2 IR : g(x) yg: If X has density fX , then Z
FY (y) = }(X 2 By ) =
By
fX (x) dx:
The diculty is to identify the set By . However, if we rst sketch the function g(x), the problem becomes manageable. Example 4.10. Suppose g is given by 8 1; ,2 x < ,1; > > > > < x2; ,1 x < 0; g(x) := > x; 0 x < 2; > 2; 2 x < 3; > > : 0; otherwise: Find the inverse image B = fx 2 IR : g(x) yg for y 2, 2 > y 1, 1 > y 0, and y < 0. Solution. To begin, we sketch g in Figure 4.8. Fix any y 2. Since g is upper bounded by 2, g(x) 2 y for all x 2 IR. Thus, for y 2, B = IR. 2
1.5
1
0.5
0
−2
−1
0
1
2
3
Figure 4.8. The function g from Example 4.10.
Next, x any y with 2 > y 1. On the graph of g, draw a horizontal line at level y. In Figure 4.9, the line is drawn at level y = 1:5. Next, at the intersection of the horizontal line and the curve g, drop a vertical line to the x-axis. In the gure, this vertical line hits the x-axis at the point marked . July 18, 2002
132
Chap. 4 Analyzing Systems with Random Inputs 2
1.5
1
0.5
0
−2
−1
0
1
2
3
Figure 4.9. Determining B for 1 y < 2.
Clearly, for all x to the left of this point, and for all x 3, g(x) y. To nd the x-coordinate of , we solve g(x) = y for x. For the y-value in question, the formula for g(x) is g(x) = x. Hence, the x-coordinate of is simply y. Thus, g(x) y , x y or x 3, and so, B = (,1; y] [ [3; 1); 1 y < 2: Now x any y with 1 > y 0, and draw a horizontal line at level y as before. In Figure 4.10 we have used y = 1=2. This time the horizontal line intersects the curve g in two places, and there are two points marked on the x-axis. Call the x-coordinate of the left one x1 and that of the right one x2. We must solve g(x1 ) = y, where x1 is negative and g(x1 ) = x21. We must alsopsolve g(x2 ) = y, where g(x2 ) = x2. We conclude that g(x) y , x < ,2 or , y x y or x 3. Thus, B = (,1; ,2) [ [,py; y] [ [3; 1); 0 y < 1:
Note that when y = 0, the interval [0; 0] is just the singleton set f0g. Finally, since g(x) 0 for all x 2 IR, for y < 0, B = 6 .
Example 4.11. Let g be as in Example 4.10, and suppose that X uniform[,4; 4]. Find and sketch the cumulative distribution of Y = g(X). Also nd and sketch the density. Solution. Proceeding as in the two examples above, write FY (y) = }(Y y) = }(X 2 B) for the appropriate inverse image B = fx 2 IR : g(x) yg. July 18, 2002
4.5 Functions of Random Variables and Their Cdfs
133
2
1.5
1
0.5
0
−2
−1
0
1
2
3
Figure 4.10. Determining B for 0 y < 1.
From Example 4.10,
8 > > <
6 ;
py; y] [ [3; 1); 0y < y0;< 1; ( ,1 ; , 2) [ [ , B = > (,1; y] [ [3; 1); 1 y < 2; > : IR; y 2: For y < 0, FY (y) = 0. For 0 y < 1, FY (y) = }(X < ,2) + }(,py X y) + }(X 3) y , (,py ) 4 , 3 = ,2 ,8(,4) + + 8 8 p = y + 8y + 3 : For 1 y < 2, FY (y) = }(X y) + }(X 3) = y , 8(,4) + 4 ,8 3 = y +8 5 : For y > 2, FY (y) = }(X 2 IR) = 1. Putting all this together, 8 0; p y < 0; > > < (y + y + 3)=8; 0 y < 1; FY (y) = > (y + 5)=8; 1 y < 2; > : 1; y 2: July 18, 2002
134
Chap. 4 Analyzing Systems with Random Inputs 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.5
1
1.5
2
Figure 4.11. Cumulative distribution function FY (y) from Example 4.11.
In sketching FY (y), we note from the formula that it is 0 for y < 0 and 1 for y 2. Also from the formula, note that there is a jump discontinuity of 3=8 at y = 0 and a jump of 1=8 at y = 1 and at y = 2. See Figure 4.11. From the observations used in graphing FY , we can easily obtain the generalized density, hY (y) = 38 (y) + 81 (y , 1) + 81 (y , 2) + f~Y (y); where 8 p < [1 + 1=(2 y )]=8; 0 < y < 1; 1 < y < 2; f~Y (y) = : 1=8; 0; otherwise; is obtained by dierentiating FY (y) at non-jump points y. A sketch of hY is shown in Figure 4.12.
4.6. Properties of Cdfs
Given an arbitrary real-valued random variable X, its cumulative distribution function is de ned by F(x) := }(X x); ,1 < x < 1: We show below that F satis es the following eight properties: (i) 0 F (x) 1. (ii) For a < b, }(a < X b) = F (b) , F(a). July 18, 2002
4.6 Properties of Cdfs
135
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.5
1
1.5
2
Figure 4.12. Density hY (y) from Example 4.11.
F is nondecreasing, i.e., a b implies F (a) F(b). lim F(x) = 1. x"1 lim F (x) = 0. x#,1 F (x0+) := xlim F(x) = }(X x0) = F (x0). In other words, F is #x0 right-continuous. (vii) F (x0,) := xlim F (x) = }(X < x0 ). "x0 (viii) }(X = x0) = F(x0) , F(x0,). We also point out that }(X > x0 ) = 1 , }(X x0 ) = 1 , F(x0); (iii) (iv) (v) (vi)
and
}(X x0) = 1 , }(X < x0 ) = 1 , F (x0,): If F(x) is continuous at x = x0, i.e., F(x0,) = F(x0), then this last equation becomes }(X x0) = 1 , F (x0): Another consequence of the continuity of F (x) at x = x0 is that }(X = x0) = 0. Note that if a random variable has a nonimpulsive density, then its cumulative distribution is continuous everywhere. We now derive the eight properties of cumulative distribution functions. (i) The properties of } imply that F(x) = }(X x) satis es 0 F(x) 1. July 18, 2002
136
Chap. 4 Analyzing Systems with Random Inputs
(ii) First consider the disjoint union (,1; b] = (,1; a] [ (a; b]. It then follows that fX bg = fX ag [ fa < X bg is a disjoint union of events in . Now write F(b) = }(X b) = }(fX ag [ fa < X bg) = }(X a) + }(a < X b) = F(a) + }(a < X b): Now subtract F(a) from both sides. (iii) This follows from (ii) since }(a < X b) 0. (iv) We prove the simpler result limN !1 F(N) = 1. Starting with IR = (,1; 1) = we can write
1 [
(,1; n];
n=1
1 = }(X 2 IR) [ 1 } = fX ng n=1
}(X N); by limit property (1.4); = Nlim !1 = Nlim !1 F(N): (v) We prove the simpler result, limN !1 F(,N) = 0. Starting with
6 =
we can write
1 \
(,1; ,n];
n=1
0 = }(X 2 6 ) 1 \ } = fX ,ng n=1
} = Nlim !1 (X ,N); by limit property (1.5); = Nlim F(,N): !1 (vi) We prove the simpler result, }(X x0) = Nlim F(x0 + N1 ). Starting !1 with 1 \ (,1; x0] = (,1; x0 + n1 ]; n=1
July 18, 2002
4.7 The Central Limit Theorem
we can write
137 \ 1
}(X x0 ) = }
1g
fX x0 + n
n=1
}(X x0 + N1 ); by (1.5); = Nlim !1 F (x0 + N1 ): = Nlim !1 (vii) We prove the simpler result, }(X < x0) = Nlim F(x0 , N1 ). Starting !1 with 1 [ (,1; x0) = (,1; x0 , n1 ]; we can write
n=1
}(X < x0 ) = }
1 [
1g
fX x0 , n
n=1
}(X x0 , N1 ); by (1.4); = Nlim !1 = Nlim F (x0 , N1 ): !1 (viii) First consider the disjoint union (,1; x0] = (,1; x0) [fx0 g. It then follows that fX x0 g = fX < x0 g [ fX = x0 g is a disjoint union of events in . Using Property (vii), it follows that F(x0) = F(x0,) + }(X = x0):
4.7. The Central Limit Theorem
Let X1 ; X2; : : : be i.i.d. with common mean m and common variance 2. There are many cases for which we know the probability mass function or density of n X Xi : i=1
For example, if the Xi are Bernoulli, binomial, Poisson, gamma, or Gaussian, we know the cdf of the sum (see Example 2.20 or Problem 22 in Chapter 2 and Problem 46 in Chapter 3). Note that the exponential and chi-squared are special cases of the gamma (see Problem 12 in Chapter 3). In general, however, nding the cdf of a sum of i.i.d. random variables is not possible. Fortunately, we have the following approximation. Central Limit Theorem (CLT). Let X1 ; X2; : : : be independent, identically distributed random variables with nite mean m and nite variance 2 . If n , m and M := 1 X Yn := M=n p n n n i=1 Xi ; July 18, 2002
138
Chap. 4 Analyzing Systems with Random Inputs
then
lim F (y) n!1 Yn
= (y): To get some idea of how large n should be, would like to compare FYn (y) and (y) in cases where FYn is known. To do this, we need the following result. Example 4.12. Show that if Gn is the cdf of Pni=1 Xi, then p (4.6) FYn (y) = Gn(y n + nm):
Solution. Write
,m y FYn (y) = } M=n p n p = }(Mn , m y= n ) p = }(Mn y= n + m) = sP
X n
i=1 p
p Xi y n + nm
= Gn(y n + nm): When the Xi are exp(1), Gn is the Erlang(30; 1) cdf given in Problem 12(c) in Chapter 3. In Figure 4.13 we plot FY30 (y) (dashed line) and the N(0; 1) cdf (y) (solid line). A typical calculation using the central limit theorem is as follows. To approximate n } X Xi > t ; i=1
write
}
n X i=1
Xi > t
X n t 1 } = n i=1 Xi > n = }(Mn > t=n) = }(M n , m > t=n , m) M , m > t=n p, m = } =n p n = n t=n , m = } Yn > =pn t=n , m = 1 , FYn =pn t=n , m 1 , =pn :
July 18, 2002
4.7 The Central Limit Theorem
139
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −3
−2
−1
0
1
2
3
Figure 4.13. Illustration of the central limit theorem when the Xi are exponential with
parameter 1. The dashed line is FY30 , and the solid line is the standard normal cumulative distribution, .
Example 4.13. A certain digital communication link has bit-error probability p. Use the central limit theorem to approximate the probability that in transmitting a word of n bits, more than k bits are received incorrectly. Solution. Let Xi = 1 if bit i is received in error, and Xi = 0 otherwise. We assume the Xi arePindependent Bernoulli(p) random variables. The number of errors in n bits is ni=1 Xi . We must compute }
X n
i=1
Xi > k :
Using the approximation above in which t = k, m = p, and 2 = p(1 , p), we have n } X Xi > k 1 , p k=n , p : p(1 , p)=n i=1
Derivation of the Central Limit Theorem
It is instructive to consider rst the following special case, which illustrates the key steps of the p general derivation. Suppose that the Xi are i.i.d. Laplace with parameter = 2. Then m = 0, 2 = 1, and
pn X n n X p M , m n Xi = p1n Xi : Yn = =pn = nMn = n i=1 i=1 July 18, 2002
140
Chap. 4 Analyzing Systems with Random Inputs
The characteristic function of Yn is n
'Yn () =
E[ejYn ]
p E ej (= n )Xi
Of course, variable Xi , Thus, and
j (=pn ) P Xi
= Ee
=
i=1
n Y p E ej (= n )Xi :
i=1
,p p = 'Xi = n , where, for the Laplace 2 random ,
'Xi () = 2 +2 2 = 1 + 1 2=2 : p E[ej (= n )Xi ] = 'Xi p
1 ; = n 1 + 2n=2
'Yn () =
1
1 + 2n=2
n
=
1
1 + 2n=2
n :
We now use the fact that for any number , n 1 + n ! e : It follows that 'Yn () = 1 2 =2 n ! e12 =2 = e, 2 =2 ; 1+ n which is the characteristic function of an N(0; 1) random variable. We now turn to the derivation in the general case. Write ,m Yn = M=n p n pn 1 X n = n Xi , m i=1 pn 1 X n = n (Xi , m) i=1 n X 1 X , m i p = n : i=1 Now let Zi := (Xi , m)=. Then n X Yn = p1n Zi ; i=1
July 18, 2002
4.7 The Central Limit Theorem
141
where the Zi are zero mean and unit variance. Since the Xi are i.i.d., so are the Zi . Let 'Z () := E[ejZi ] denote their common characteristic function. We can write the characteristic function of Yn as 'Yn () := E[ejYn ] n X = E exp j pn Zi i=1 n Y = E exp j pn Zi i=1 n Y = E exp j p Zi n i=1 n Y = 'Z pn i=1 n = 'Z pn : Now recall that for any complex , e = 1 + + 12 2 + R(): Thus, p 'Z pn = E[ej (= n )Zi ] 2 1 = E 1 + j pn Zi + 2 j pn Zi + R j pn Zi : Since Zi is zero mean and unit variance, 2 'Z pn = 1 , 21 n + E R j pn Zi : It can be shown that the last term on the right is asymptotically negligible [4, pp. 357{358], and so 2 'Z pn 1 , n=2 :
We now have
n 2 n 'Yn () = 'Z pn 1 , n=2 ! e, 2 =2 ; which is the N(0; 1) characteristic function. Since the characteristic function of Yn converges to the N(0; 1) characteristic function, it follows that FYn (y) ! (y) [4, p. 349, Theorem 26.3].
July 18, 2002
142
Chap. 4 Analyzing Systems with Random Inputs
Example 4.14. In the derivation of the central limit theorem, we found that 'Yn () ! e, 2 =2 as n ! 1. We can use this result to give a simple derivation of Stirling's formula, p n! 2 nn+1=2e,n : Solution. Since n! = n(n , 1)!, it suces to show that p
(n , 1)! 2 nn,1=2e,n : To thisPend, let X1 ; : : :; Xn be i.i.d. exp(1). Then by Problem 46(c) in Chapter 3, ni=1 Xi has the n-Erlang density, n,1 ,x gn(x) = x(n ,e1)! ; x 0: Also, the mean and variance of Xi are both one. Dierentiating (4.6) (with m = = 1) yields p p fYn (y) = gn(y n + n) n: Note that fYn (0) = nn,1=2e,n =(n , 1)!. On the other hand, by inverting the characteristic function of Yn , we can also write Z 1 1 fYn (0) = 2 'Yn () d ,1 Z 1 e, 2 =2 d; by the CLT; 21
p ,1
= 1= 2: p It follows that (n , 1)! 2 nn,1=2e,n .
Remark. A more precise version of Stirling's formula is [14, pp. 50{53] p
p
2 nn+1=2e,n+1=(12n+1) < n! < 2 nn+1=2e,n+1=(12n):
4.8. Problems
Problems x4.1: Continuous Random Variables
1. Find the cumulative distribution function FX (x) of an exponential ran-
dom variable X with parameter . Sketch the graph of FX when = 1. 2. The Rayleigh density with parameter is de ned by 8 < x e,(x=)2 =2 ; x 0; f(x) := : 2 0; x < 0: Find the cumulative distribution function. July 18, 2002
4.8 Problems ? 3.
4. 5. 6. 7.
143
The Maxwell density with parameter is de ned by 8 r > 2 <
x2 e,(x=)2 =2; x 0; f(x) := > 3 : 0; x < 0: Show that the cdf F (x) can be expressed in terms of the standard normal cdf (x) de ned in (4.1). If Z has density fZ (z) and Y = eZ , nd fY (y). If X uniform(0; 1), nd the density of Y = ln(1=X). Let X Weibull(p; ). Find the density of Y = X p . The input to a squaring circuit is a Gaussian random variable X with mean zero and variance one. Show that the output Y = X 2 has the chi-squared density with one degree of freedom, e,y=2 ; y > 0: fY (y) = p 2y
8. If the input to the squaring circuit of Problem 7 includes a xed bias, say
9.
10. 11.
12.
m, then the output is given by Y = (X + m)2 , where again X N(0; 1). Show that Y has the noncentral chi-squared density with one degree of freedom and noncentrality parameter m2 , ,(y+m2 )=2 mpy ,mpy fY (y) = e p2y e +2 e ; y > 0: Note that if m = 0, we recover the result of Problem 7. Let X1 ; : : :; Xn be independent with common cumulative distribution function F (x). Let us de ne Xmax := max(X1 ; : : :; Xn ) and Xmin := min(X1 ; : : :; Xn). Express the cumulative distributions of Xmax and Xmin in terms of F(x). Hint: Example 2.7 may be helpful. If X and Y are independent exp() random variables, nd E[max(X; Y )]. Digital Communication System. The received voltage in a digital communication system is Z = X + Y , where X Bernoulli(p) is a random message, and Y N(0; 1) is a Gaussian noise voltage. Assuming X and Y are independent, nd the conditional cdf FZ jX (z ji) for i = 0; 1, the cdf FZ (z), and the density fZ (z). Fading Channel. Let X and Y be as in the preceding problem, but now suppose Z = X=A + Y , where A, X, and Y are independent, and A takes the values 1 and 2 with equal probability. Find the conditional cdf FZ jA;X (z ja; i) for a = 1; 2 and i = 0; 1. July 18, 2002
144
Chap. 4 Analyzing Systems with Random Inputs
13. Generalized Rayleigh Densities. Let Yn be chi-squared with pn degrees of
freedom as de ned in Problem 12 of Chapter 3. Put Zn := Yn. (a) Express the cdf of Zn in terms of the cdf of Yn . (b) Find the density of Z1 . (c) Show that Z2 has a Rayleigh density, as de ned in Problem 2, with = 1. (d) Show that Z3 has a Maxwell density, as de ned in Problem 3, with = 1. (e) Show that Z2m has a Nakagami-m density 8 2 z 2m,1 e,(z=)2 =2 ; z 0; < m f(z) := : 2 ,(m) 2m 0; z < 0; with = 1. Remark. For the general chi-squared random variable Yn, it is not necessary that n be an integer. However, if n is a positive integer, and if X1 ; : : :; Xn are i.i.d. N(0; 1), then the Xi2 are chi-squared with one degree of freedom by Problem 7, and Yn := X12 + + Xn2 is chi-squared with n degrees of freedom by Problem 46(c) in Chapter 3. Hence, the above densities usually arise from taking the square root of the sum of squares of standard normal random variables. For example, (X1 ; X2 ) can be regarded as a random point in the plane whose horizontal and vertical coordinates are p independent N(0; 1). The distance of this point from the origin is X12 + X22 = Z2 , which is a Rayleigh random variable. As another example, consider an ideal gas. The velocity of a given particle is obtained by adding up the results of many collisions with other particles. By the central limit theorem (Section 4.7), each component of the given particle's velocity vector,psay (X1 ; X2; X3 ) should be i.i.d. N(0; 1). The speed of the particle is X12 + X22 + X32 = Z3 , which has the Maxwell density. When the Nakagami-m density is used as a model for fading in wireless communication channels, m is often not an integer.
14. Generalized Gamma Densities.
(a) Let X gamma(p; 1), and put Y := X 1=q . Show that pq,1 ,yq fY (y) = qy ,(p)e ; y > 0:
(b) If in part (a) we replace p with p=q, we nd that p,1 e,yq qy fY (y) = ,(p=q) ; y > 0:
July 18, 2002
4.8 Problems
145
If we introduce a scale parameter > 0, we have the generalized gamma density [46]. More precisely, we say that Y g-gamma(p; ; q) if Y has density p,1 e,(y)q q(y) ; y > 0: fY (y) = ,(p=q) Clearly, g-gamma(p; ;1) = gamma(p; ), which includes the Erlang and the chi-squared as special cases. Show that (i) g-gamma(p;1=pp; p) = Weibull(p; ). (ii) g-gamma(2; 1=( 2); 2) is the Rayleigh density de ned in Problem 2. p (iii) g-gamma(3; 1=( 2); 2) is the Maxwell density de ned in Problem 3. (c) If Y g-gamma(p; ; q), show that , , (n + p)=q) n E[Y ] = ,(p=q)n ; and conclude that 1 sn ,,(n + p)=q) X ,(p=q)n : MY (s) = n=0 n! ? 15. In the analysis of communicationsystems, one is often interested in }(X > x) = 1 , FX (x) for some voltage threshold x. We call FXc (x) := 1 , FX (x) the complementary cumulative distribution function (ccdf) of X. Of particular interest is the ccdf of the standard normal, which is often denoted by Z 1 e,t2 =2 dt: Q(x) := 1 , (x) = p1 2 x Using the hints below, show that for x > 0, ,x2 =2 1 1 e,px2 =2 : ep , < Q(x) < 2 x x3 x 2 Hints: To derive the upper bound, apply integration by parts to Z 1 1 te,t2 =2 dt; x t and then drop the new integral term (which is positive), Z 1 1 e,t2 =2 dt: x t2 If you do not drop the above term, you can derive the lower bound by applying integration by parts one more time (after dividing and multiplying by t again) and then dropping the nal integral. July 18, 2002
146 ? 16.
Chap. 4 Analyzing Systems with Random Inputs
Let Ck (x) denote the chi-squared cdf with k degrees of freedom. Show that the noncentral chi-squared cdf with k degrees of freedom and noncentrality parameter 2 is given by (recall Problem 54 in Chapter 3) Ck;2 (x) =
1 X
(2 =2)ne,2 =2 C (x): 2n+k n! n=0
Remark. Note that in Matlab, Ck(x) = chi2cdf(x; k), and Ck;2 (x) = (; ;
).
ncx2cdf x k lambda^2
? 17. Generalized Rice or Noncentral Rayleigh Densities.
Let Yn be noncentral chi-squared with n degrees of freedom and noncentrality parameter m2 as de ned in Problem 54 in Chapter 3. (In general, n need not be an integer, but if it is, and if X1 ; : : :; Xn are independent normal random variables with Xi N(mi ; 1), then by Problem 8, Xi2 is noncentral chisquared with one degree of freedom and noncentrality parameter m2i , and by Problem 54 in Chapter 3, X12 + +Xn2 is noncentral chi-squared with n degrees of freedom and noncentrality parameter m2 = m21 + + m2n .) p (a) Show that Zn := Yn has the generalized Rice density, n=2 fZn (z) = mzn=2,1 e,(m2 +z2 )=2In=2,1(mz); z > 0; where I is the modi ed Bessel function of the rst kind, order ,
I (x) :=
1 X
(x=2)2`+ : `=0 `! ,(` + + 1)
Remark. In Matlab, I (x) = besseli(nu; x).
(b) Show that Z2 has the original Rice density,
fZ2 (z) = ze,(m2 +z2 )=2I0 (mz); z > 0: (c) Show that
py n=2,1 fYn (y) = 21 m e,(m2 +y)=2 In=2,1(mpy ); y > 0; giving a closed-form expression for the noncentral chi-squared density. Recall that you already have a closed-form expression for the moment generating function and characteristic function of a noncentral chi-squared random variable (see Problem 54(b) in Chapter 3). Remark. Note that in Matlab, the cdf of Yn is given by FYn (y) = ncx2cdf(y; n; m^2).
July 18, 2002
4.8 Problems
147
(d) Denote the complementary cumulative distribution of Zn by FZc n (z) := }(Zn > z) =
1
Z
z
Show that
fZn (t) dt:
n=2,1 e,(m2 +z2 )=2 In=2,1(mz) + FZc n,2 (z): FZc n (z) = mz Hint: Use integration by parts; you will need the easily-veri ed fact that d x I (x) = x I (x): ,1 dx (e) The complementary cdf of Z2 , FZc 2 (z), is known as the Marcum Q function,
Q(m; z) :=
Z
z
1
te,(m2 +t2 )=2 I0 (mt) dt:
Show that if n 4 is an even integer, then 2
2
FZc n (z) = Q(m; z) + e,(m +z )=2
n=X 2,1 k=1
z k I (mz): m k
e (f) Show that Q(m; z) = Q(m; z), where e Q(m; z)
1 2 +z2 )=2 X , ( m := e (m=z)k Ik (mz): k=0
e z) = e,z2 =2 . It then Hint: [19, p. 450] Show that Q(0; z) = Q(0;
suces to prove that @ @ e @m Q(m; z) = @m Q(m; z): To this end, use the derivative formula in the hint of part (d) to show that @ Q(m; ,(m2 +z2 )=2 I1(mz): e z) = ze @m Now take the same partial derivative of Q(m; z) as de ned in part (e), and then use integration by parts on the term involving I1 .
? 18. Properties of Modi ed Bessel Functions.
(a) Use the power-series de nition of I (x) in the preceding problem to show that I0 (x) = 12 [I ,1(x) + I +1 (x)] July 18, 2002
148
Chap. 4 Analyzing Systems with Random Inputs
and that
I ,1(x) , I +1 (x) = 2(=x) I (x): Note that the second identity implies the recursion, I +1 (x) = I ,1(x) , 2(=x) I (x): Hence, once I0 (x) and I1(x) are known, In (x) can be computed for n = 2; 3; : : :: (b) Parts (b) and (c) of this problem are devoted to showing that for integers n 0, 1 Z ex cos cos(n) d: In(x) = 2 , To this end, denote the above integral by I~n (x). Use integration by parts and a trigonometric identity to show that x [I~ (x) , I~ (x)]: I~n(x) = 2n n,1 n+1 Hence, in Part (c) it will be enough to show that I~0 (x) = I0 (x) and I~1 (x) = I1 (x). (c) From the integral de nition of I~n(x), it is clear that I~00 (x) = I~1 (x). It is also easy to see from the power series for I0 (x) that I00 (x) = I1(x). Hence, it is enough to show that I~0 (x) = I0(x). Since the integrand de ning I~0 (x) is even, Z I~0 (x) = 2 ex cos d: 0 Show that Z =2 I~0 (x) = 1 e,x sin t dt: e
,=2
P1
Then use the power series = k=0 k =k! in the above integrand and integrate term by term. Then use the results of Problems 15 and 11 of Chapter 3 to show that I~0 (x) = I0 (x). (d) Use the integral formula for In (x) to show that In0 (x) = 12 [In,1(x) + In+1 (x)]:
Problems x4.2: Reliability 19. The lifetime T of a Model n Internet router has an Erlang(n; 1) density,
fT (t) = tn,1e,t =(n , 1)!. (a) What is the router's mean time to failure? July 18, 2002
4.8 Problems
149
(b) Show that the reliability of the router after t time units of operation is nX ,1 tk ,t R(t) = k! e : k=0
(c) Find the failure rate (known as the Erlang failure rate). 20. A certain device has the Weibull failure rate r(t) = ptp,1 ; t > 0: (a) Sketch the failure rate for = 1 and the cases p = 1=2, p = 1, p = 3=2, p = 2, and p = 3. (b) Find the reliability R(t). (c) Find the mean time to failure. (d) Find the density fT (t). 21. A certain device has the Pareto failure rate ( p=t; t t0 ; r(t) = 0; t < t0 : (a) Find the reliability R(t) for t 0. (b) Sketch R(t) if t0 = 1 and p = 2. (c) Find the mean time to failure if p > 1. (d) Find the Pareto density fT (t). 22. A certain device has failure rate r(t) = t2 , 2t + 2 for t 0. (a) Sketch r(t) for t 0. (b) Find the corresponding density fT (t) in closed form (no integrals). 23. Suppose that the lifetime T of a device is uniformly distributed on the interval [1; 2]. (a) Find and sketch the reliability R(t) for t 0. (b) Find the failure rate r(t) for 0 t < 2. (c) Find the mean time to failure. 24. Consider a system composed of two devices with respective lifetimes T1 and T2 . Let T denote the lifetime of the composite system. Suppose that the system operates properly if and only if both devices are functioning. In other words, T > t if and only if T1 > t and T2 > t. Express the reliability of the overall system R(t) in terms of R1(t) and R2(t), where R1(t) and R2(t) are the reliabilities of the individual devices. Assume T1 and T2 are independent. July 18, 2002
150
Chap. 4 Analyzing Systems with Random Inputs
25. Consider a system composed of two devices with respective lifetimes T1
and T2 . Let T denote the lifetime of the composite system. Suppose that the system operates properly if and only if at least one of the devices is functioning. In other words, T > t if and only if T1 > t or T2 > t. Express the reliability of the overall system R(t) in terms of R1(t) and R2(t), where R1(t) and R2 (t) are the reliabilities of the individual devices. Assume T1 and T2 are independent.
26. Let Y be a nonnegative random variable. Show that E[Y n ]
=
1
Z
0
nyn,1}(Y > y) dy:
Hint: Put T = Y n in (4.2).
Problems x4.3: Discrete Random Variables 27. Let X binomial(n; p) with n = 4 and p = 3=4. Sketch the graph of the cumulative distribution function of X, FX (x).
Problems x4.4: Mixed Random Variables 28. A random variable X has generalized density
f(t) = 13 e,t u(t) + 21 (t) + 16 (t , 1); where u is the unit step function de ned in Section 2.1, and is the Dirac delta function de ned in Section 4.4. (a) Sketch f(t). (b) Compute }(X = 0) and }(X = 1). (c) Compute }(0 < X < 1) and }(X > 1). (d) Use your above results to compute }(0 X 1) and }(X 1). (e) Compute E[X]. 29. If X has generalized density f(t) = 12 [(t) + I(0;1](t)], evaluate E[X] and }(X = 0jX 1=2). 30. Let X have cdf 8 0; x < 0; > > < 2 x x < 1=2; FX (x) = > x;; 01=2 x < 1; > : 1; x 1: Show that E[X] = 7=12. July 18, 2002
4.8 Problems
151
31. Sketch the graph of the cumulative distribution function of the random
variable in Example 4.9. Also sketch corresponding impulsive density function. ? 32. A certain computer monitor contains a loose connection. The connection is loose with probability 1=2. When the connection is loose, the monitor displays a blank screen (brightness=0). When the connection is not loose, the brightness is uniformly distributed on (0; 1]. Let X denote the observed brightness. Find formulas and plot the cdf and generalized density of X.
Problems x4.5: Functions of Random Variables and Their Cdfs 33. Let uniform[,; ], and put X := cos and Y := sin . (a) Show that FX (x) = 1 , 1 cos,1 x for ,1 x 1. (b) Show that FY (y) = 21 + 1 sin,1 y for ,1 y 1. (c) Use the identity sin(=2 , ) = cos to show that FX (x) = 21 + 1 ,1 sin x. Thus, X and Y have the same cdf and are called arcsine random variables. The corresponding density is fX (x) = p (1=)= 1 , x2 for ,1 < x < 1. 34. Find the cdf and density of Y = X(X + 2) if X is uniformly distributed
on [,3; 1]. 35. Let g be as in Example 4.10 and 4.11. Find the cdf and density of Y = g(X) if (a) X uniform[,1; 1]; (b) X uniform[,1; 2]; (c) X uniform[,2; 3]; (d) X exp(). 36. Let 8 0; jxj < 1; < g(x) := : jxj , 1; 1 jxj 2; 1; jxj > 2: Find the cdf and density of Y = g(X) if (a) X uniform[,1; 1]; (b) X uniform[,2; 2]; (c) X uniform[,3; 3]; (d) X Laplace(). July 18, 2002
152
Chap. 4 Analyzing Systems with Random Inputs
37. Let
8 > > <
g(x) := > > :
38.
39.
40.
41.
? 42.
,x , 2; x < ,1; ,x2; ,1 x < 0; x3; 0 x < 1; 1; x 1:
Find the cdf and density of Y = g(X) if (a) X uniform[,3; 2]; (b) X uniform[,3; 1]; (c) X uniform[,1; 1]. Consider the function g given by 8 < x2 , 1; x < 0; g(x) = : x , 1; 0 x < 2; 1; x 2: If X is uniform[,3;3], nd the cdf and density of Y = g(X). Let X be a uniformly distributed random variable on the interval [,3; 1]. Let Y = g(X), where 8 0; x < ,2; > > < x + 2 x < ,1; g(x) = > x2;2; , , 1 x < 0; > : p x; x 0: Find the cdf and density of Y . Let X uniform[,6; 0], and suppose that Y = g(X), where 8 jxj , 1; 1 x < 2; > < p g(x) = > 1 , jxj , 2; jxj 2; : 0; otherwise: Find the cdf and density of Y . Let X uniform[,2; 1], and suppose that Y = g(X), where 8 x + 2; ,2 x < ,1; > > > < 2 g(x) = > 1 2x + x2 ; ,1 x < 0; > > : 0; otherwise: Find the cdf and density of Y . For x 0, let g(x) denote the fractional part of x. For example, g(5:649) = 0:649, and g(0:123) = 0:123. Find the cdf and density of Y = g(X) if July 18, 2002
4.8 Problems
153
(a) X exp(1); (b) X uniform[0; 1); (c) X uniform[v; v + 1), where v = m + for some integer m 0 and some 0 < < 1.
Problems x4.6: Properties of Cdfs ? 43.
From your solution of Problem 3(b) in Chapter 3, you can see that if X exp(), then }(X > t + tjX > t) = }(X > t). Now prove the converse; i.e., show that if Y is a nonnegative random variable such that }(Y > t + tjY > t) = }(Y > t), then Y exp(), where = , ln[1 , FY (1)], assuming that }(Y > t) > 0 for all t 0. Hints: Put h(t) := ln }(Y > t), which is a right-continuous function of t (Why?). Show that h(t + t) = h(t) + h(t) for all t; t 0.
Problems x4.7: The Central Limit Theorem 44. Packet transmission times on a certain Internet link are i.i.d. with mean
m and variance 2 . Suppose n packets are transmitted. Then the total expected transmission time for n packets is nm. Use the central limit theorem to approximate the probability that the total transmission time for the n packets exceeds twice the expected transmission time. 45. To combat noise in a digital communication channel with bit-error probability p, the use of an error-correcting code is proposed. Suppose that the code allows correct decoding of a received binary codeword if the fraction of bits in error is less than or equal to t. Use the central limit theorem to approximate the probability that a received word cannot be reliably decoded. 46. Let Xi = 1 with equal probability. Then the Xi are zero mean and have unit variance. Put n X X pni : Yn = i=1
Derive the central limit theorem for this case; i.e., show that 'Yn () ! 2 =2 , e . Hint: Use the Taylor series approximation cos() 1 , 2 =2.
July 18, 2002
154
Chap. 4 Analyzing Systems with Random Inputs
July 18, 2002
CHAPTER 5
Multiple Random Variables The main focus of this chapter is the study of multiple continuous random variables that are not independent. In particular, conditional probability and conditional expectation along with corresponding laws of total probability and substitution are studied. These tools are used to compute probabilities involving the output of systems with multiple random inputs. In Section 5.1 we introduce the concept of joint cumulative distribution function for a pair of random variables. Marginal cdfs are also de ned. In Section 5.2 we introduce pairs of jointly continuous random variables, joint densities, and marginal densities. Conditional densities, independence and expectation are also discussed in the context of jointly continuous random variables. In Section 5.3 conditional probability and expectation are de ned for jointly continuous random variables. In Section 5.4 the bivariate normal density is introduced, and some of its properties are illustrated via examples. In Section 5.5 most of the bivariate concepts are extended to the case of n random variables.
5.1. Joint and Marginal Probabilities
Suppose we have a pair of random variables, say X and Y , for which we are able to compute }((X; Y ) 2 A) for arbitrary1 sets A IR2 . For example, suppose X and Y are the horizontal and vertical coordinates of a dart striking a target. We might be interested in the probability that the dart lands within two units of the center. Then we would take A to be the disk of radius two centered at the origin, i.e., A = f(x; y) : x2 + y2 4g. Now, even though we can write down a formula for }((X; Y ) 2 A), we may be interested in nding potentially simpler expressions for things like }(X 2 B; Y 2 C), }(X 2 B), and }(Y 2 C), etc. To this end, the notion of a product set is helpful.
Product Sets and Marginal Probabilities The Cartesian product of two univariate sets B and C is de ned by B C := f(x; y) : x 2 B and y 2 C g: In other words,
(x; y) 2 B C , x 2 B and y 2 C: For example, if B = [1; 3] and C = [0:5; 3:5], then B C is the rectangle [1; 3] [0:5; 3:5] = f(x; y) : 1 x 3 and 0:5 y 3:5g; which is illustrated in Figure 5.1(a). In general, if B and C are intervals, then B C is a rectangle or square. If one of the sets is an interval and the other 155
156
Chap. 5 Multiple Random Variables 4 3 2 1 0
4 3 2 1 0
0 1 2 3 4 (a)
0 1 2 3 4 (b) ,
Figure 5.1. The Cartesian products (a) [1; 3] [0:5; 3:5] and (b) [1; 2] [ [3; 4] [1; 4].
is a singleton, then the product set degenerates to a line segment in the plane. A more complicated example is shown in Figure 5.1(b), which illustrates the , product [1; 2] [ [3; 4] [1; 4]. Figure 5.1(b) also illustrates the general result that distributes over [; i.e., (B1 [ B2 ) C = (B1 C) [ (B2 C). Using the notion of product set,
fX 2 B; Y 2 C g = f! 2 : X(!) 2 B and Y (!) 2 C g = f! 2 : (X(!); Y (!)) 2 B C g; for which we use the shorthand
f(X; Y ) 2 B C g: We can therefore write }(X 2 B; Y 2 C) = }((X; Y ) 2 B C): The preceding expression allows us to obtain the marginal probability }(X 2 B) as follows. First, for any event F, we have F , and therefore, F = F \ . Second, Y is assumed to be a real-valued random variable, i.e., Y (!) 2 IR for all !. Thus, fY 2 IRg = . Now write }(X 2 B) = }(fX 2 B g \ ) = }(fX 2 B g \ fY 2 IRg) = }(X 2 B; Y 2 IR) = }((X; Y ) 2 B IR): Similarly,
}(Y 2 C) = }((X; Y ) 2 IR C): If X and Y are independent random variables, then }((X; Y ) 2 B C) = }(X 2 B; Y 2 C) = }(X 2 B) }(Y 2 C): July 18, 2002
5.1 Joint and Marginal Probabilities
157
Joint and Marginal Cumulative Distributions The joint cumulative distribution function of X and Y is de ned by FXY (x; y) := }(X x; Y y) = }((X; Y ) 2 (,1; x] (,1; y]): In Chapter 4, we saw that }(X 2 B) could be computed if we knew the cumulative distribution function FX (x) for all x. Similarly, it turns out that we can compute }((X; Y ) 2 A) if we know FXY (x; y) for all x; y. For example,
it is shown in Problems 1 and 2 that }((X; Y ) 2 (a; b] (c; d]) = }(a < X b; c < Y d) is given by
FXY (b; d) , FXY (a; d) , FXY (b; c) + FXY (a; c): In other words, the probability that (X; Y ) lies in a rectangle can be computed in terms of the cdf values at the corners. If X and Y are independent random variables, we have the simpli cations FXY (x; y) = }(X x; Y y) = }(X x) }(Y y) = FX (x)FY (y); and
}(a < X b; c < Y d) = }(a < X b) }(c < Y d); which is equal to [FX (b) , FX (a)][FY (d) , FY (c)]. We return now to the general case (X and Y not necessarily independent). It is possible to obtain the marginal cumulative distributions FX and FY directly from FXY by setting the unwanted variable to 1. More precisely, it can be shown that2 FX (x) = ylim (5.1) !1 FXY (x; y) =: FXY (x; 1); and
FY (y) = xlim !1 FXY (x; y) =: FXY (1; y): Example 5.1. If FXY (x; y) =
8 < :
y + e,x(y+1) , e,x ; x; y > 0; y+1 0; otherwise;
nd both of the marginal cumulative distribution functions, FX (x) and FY (y). July 18, 2002
158
Chap. 5 Multiple Random Variables
Solution. For x; y > 0,
FXY (x; y) = y +y 1 + y +1 1 e,x(y+1) , e,x : Hence, for x > 0, lim F (x; y) y!1 XY
= 1 + 0 0 , e,x = 1 , e,x :
For x 0, FXY (x; y) = 0 for all y. So, for x 0, ylim !1 FXY (x; y) = 0. The complete formula for the marginal cdf of X is ,x 0; FX (x) = 1 ,0;e ; xx > (5.2) 0: Next, for y > 0, lim F (x; y) = y +y 1 + y +1 1 0 , 0 = y +y 1 : x!1 XY We then see that the marginal cdf of Y is 0; FY (y) = y=(y0;+ 1); yy > (5.3) 0: Note that since FXY (x; y) 6= FX (x) FY (y) for all x; y, X and Y are not independent,
Example 5.2. Express }(X x or Y y) using cdfs. Solution. The desired probability is that of the union, }(A [ B), where
A := fX xg and B := fY yg. Applying the inclusion-exclusion formula (1.1) yields }(A [ B) = }(A) + }(B) , }(A \ B) = }(X x) + }(Y y) , }(X x; Y y) = FX (x) + FY (y) , FXY (x; y):
5.2. Jointly Continuous Random Variables
In analogy with the univariate case, we say that two random variables X and Y are jointly continuous with joint density fXY (x; y) if
}((X; Y ) 2 A) =
ZZ
A
fXY (x; y) dx dy
July 18, 2002
5.2 Jointly Continuous Random Variables
= = =
1
Z
1
Z
159
IA (x; y)fXY (x; y) dx dy
,1 ,1 1 Z1
Z
,1 Z,1 1 1
Z
,1
,1
IA (x; y)fXY (x; y) dy dx
IA (x; y)fXY (x; y) dx dy
for some nonnegative function fXY that integrates to one; i.e., Z
1Z 1
,1 ,1
fXY (x; y) dx dy = 1:
Example 5.3. Show that
2 2 fXY (x; y) = 21 e,(2x ,2xy+y )=2 is a valid joint probability density. Solution. Since fXY (x; y) is nonnegative, all we have to do is show that it integrates to one. By completing the square in the exponent, we obtain
,(y,x)2 =2 e,x2 =2 fXY (x; y) = e p p : 2 2 This factorization allows us to write the double integral Z
1Z 1
fXY (x; y) dx dy =
,1 ,1
as the iterated integral Z
1 Z 1 e,(2x2 ,2xy+y2 )=2 dx dy 2 ,1 ,1
Z
1 e,x2 =2 Z 1 e,(y,x)2 =2 p p dy dx: 2 ,1 2 ,1
The inner integral, as a function of y, is a normal density with mean x and variance one. Hence, the inner integral is one. But this leaves only the outer integral, whose integrand is an N(0; 1) density, which also integrates to one.
Marginal Densities
If A is a product set, say A = B C, then IA (x; y) = IB (x) IC (y), and so }(X 2 B; Y 2 C) = }((X; Y ) 2 B C) Z Z = fXY (x; y) dy dx (5.4) =
B C Z Z C
B
July 18, 2002
fXY (x; y) dx dy:
160
Chap. 5 Multiple Random Variables
At this point we would like to substitute B = (,1; x] and C = (,1; y] in order to obtain expressions for FXY (x; y). However, the preceding integrals already use x and y for the variables of integration. To avoid confusion, we must rst replace the variables of integration. We change x to t and y to . We then nd that Z x Z y FXY (x; y) = fXY (t; ) d dt; ,1
or, equivalently, FXY (x; y) =
Z
,1
Z
y
,1
x
,1
fXY (t; ) dt d:
It then follows that @ 2 F (x; y) = f (x; y) and @ 2 F (x; y) = f (x; y): XY XY @y@x XY @x@y XY Example 5.4. Let 8 < y + e,x(y+1) ,x FXY (x; y) = : y + 1 , e ; x; y > 0; 0; otherwise; as in Example 5.1. Find the joint density fXY . Solution. For x; y > 0, @ F (x; y) = e,x , e,x(y+1) ; @x XY and @ 2 F (x; y) = xe,x(y+1) : @y@x XY Thus,
fXY (x; y) =
xe,x(y+1) ; x; y > 0; 0; otherwise:
We now show that if X and Y are jointly continuous, then X and Y are individually continuous with marginal densities obtained as follows. Taking C = IR in (5.4), we obtain
}(X 2 B) = }((X; Y ) 2 B IR) =
Z Z
which implies the marginal density of X is fX (x) =
Z
1
,1
B
1
,1
fXY (x; y) dy:
July 18, 2002
fXY (x; y) dy dx; (5.5)
5.2 Jointly Continuous Random Variables
Similarly,
}(Y 2 C) = }((X; Y ) 2 IR C) = and
fY (y) =
Z
1 ,1
161 1
Z Z
,1
C
fXY (x; y) dx dy;
fXY (x; y) dx:
Thus, to obtain the marginal densities, integrate out the unwanted variable. Example 5.5. Using the joint density fXY obtained in Example 5.4, nd the marginal densities fX and fY by integrating out the unneeded variable. To check your answer, also compute the marginal densities by dierentiating the marginal cdfs obtained in Example 5.1. Solution. We rst compute fX (x). To begin, observe that for x 0, fXY (x; y) = 0. Hence, for x 0, the integral in (5.5) is zero. Now suppose x > 0. Since fXY (x; y) = 0 whenever y 0, the lower limit of integration in (5.5) can be changed to zero. For x > 0, it remains to compute Z
0
1
fXY (x; y) dy = xe,x = e,x :
Hence,
1
Z
0
e,xy dy
,x > 0; fX (x) = e 0; ; xx 0; and we see that X is exponentially distributed with parameter = 1. Note that the same answer can be obtained by dierentiating the formula for FX (x) in (5.2). We now turn to the calculation of fY (y). Arguing as above, we have fY (y) = R 0 for y 0, and fY (y) = 01 fXY (x; y) dx for y > 0. Write this integral as Z 1 Z 1 1 x (y + 1)e,(y+1)x dx: (5.6) fXY (x; y) dx = y + 1 0 0 If we put = y + 1, then the integral on the right has the form 1
Z
0
x e,x dx;
which is the mean of an exponential random variable with parameter . This integral is equal to 1= = 1=(y + 1), and so the right-hand side of (5.6) is equal to 1=(y + 1)2. We conclude that 2 > 0; fY (y) = 1=(y 0;+ 1) ; yy 0: Note that the same answer can be obtained by dierentiating the formula for FY (y) in (5.3). July 18, 2002
162
Chap. 5 Multiple Random Variables
Specifying Joint Densities
Suppose X and Y are jointly continuous with joint density fXY . De ne (x; y) : (5.7) fY jX (yjx) := fXY fX (x) For reasons that will become clear later, we call fY jX the conditional density of Y given X. For the moment, simply note that as a function of y, fY jX (yjx) is nonnegative and integrates to one, since Z
1
fXY (x; y) dy fY jX (yjx) dy = ,1 f (x) = ffX (x) = 1: X X (x) ,1 Now rewrite (5.7) as fXY (x; y) = fY jX (yjx) fX (x): This suggests an easy way to specify a joint density. First, pick any onedimensional density, say fX (x). Then fX (x) is nonnegative and integrates to one. Next, for each x, let fY jX (yjx) be any density in the variable y. For dierent values of x, fY jX (yjx) could be very dierent densities as a function of y; the only important thing is that for each x, fY jX (yjx) as a function of y should be nonnegative and integrate to one with respect to y. Now if we de ne fXY (x; y) := fY jX (yjx) fX (x), we will have a nonnegative function of (x; y) whose double integral is one. In practice, and in the problems at the end of the chapter, this is how we usually specify joint densities. Example 5.6. Let X N(0; 1), and suppose that the conditional density of Y given X = x is N(x; 1). Find the joint density fXY (x; y). Solution. First note that ,(y,x)2 =2 ,x2 =2 fX (x) = ep and fY jX (yjx) = e p : 2 2 Then write ,(y,x)2 =2 e,x2 =2 fXY (x; y) = fY jX (yjx) fX (x) = e p p2 ; 2 which simpli es to 2 2 fXY (x; y) = exp[,(2x ,22xy + y )=2] : Z
1
By interchanging the roles of X and Y , we have (x; y) fX jY (xjy) := fXY fY (y) ; and we can de ne fXY (x; y) = fX jY (xjy) fY (y) if we specify a density fY (y) and a function fX jY (xjy) that is a density in x for each xed y. July 18, 2002
5.2 Jointly Continuous Random Variables
163
Independence
We now consider the joint density of jointly continuous independent random variables. As noted in Section 5.1, if X and Y are independent, then FXY (x; y) = FX (x) FY (y) for all x and y. If X and Y are also jointly continuous, then by taking second-order mixed partial derivatives, we nd @ 2 F (x) F (y) = f (x) f (y): Y X Y @y@x X In other words, if X and Y are jointly continuous and independent, then the joint density is the product of the marginal densities. Using (5.4), it is easy to see that the converse is also true. If fXY (x; y) = fX (x) fY (y), (5.4) implies Z Z
}(X 2 B; Y 2 C) =
B
Z Z
=
B
Z
=
C C
fXY (x; y) dy dx
fX (x)
BZ
fX (x) fY (y) dy dx Z
fY (y) dy dx
C
= fX (x) dx }(Y 2 C) B = }(X 2 B) }(Y 2 C):
Expectation
If X and Y are jointly continuous with joint density fXY , then the methods of Section 3.2 can easily be used to show that Z
E[g(X; Y )] =
1
Z
1
,1 ,1
g(x; y)fXY (x; y) dx dy:
For arbitrary random variables X and Y , their bivariate characteristic function is de ned by 'XY (1; 2) := E[ej (1X +2 Y ) ]: If X and Y have joint density fXY , then 'XY (1 ; 2) =
1
Z
Z
1
,1 ,1
fXY (x; y)ej (1 x+2 y) dx dy;
which is simply the bivariate Fourier transform of fXY . By the inversion formula, Z 1Z 1 1 fXY (x; y) = (2)2 'XY (1; 2)e,j (1 x+2 y) d1 d2: ,1 ,1 Now suppose that X and Y are independent. Then 'XY (1; 2) = E[ej (1X +2 Y ) ] = E[ej1X ] E[ej2Y ] = 'X (1) 'Y (2): July 18, 2002
164
Chap. 5 Multiple Random Variables
In other words, if X and Y are independent, then their joint characteristic function factors. The converse is also true; i.e., if the joint characteristic function factors, then X and Y are independent. The general proof is complicated, but if X and Y are jointly continuous, it is easy using the Fourier inversion formula. ? Continuous
Random Variables That Are not Jointly Continuous Let uniform[,; ], and put X := cos and Y := sin. As shown in Problem 33 in Chapter p 4, X and Y are both arcsine random variables, each having density (1=)= 1 , x2 for ,1 < x < 1.
Next, since X 2 +Y 2 = 1, the pair (X; Y ) takes values only on the unit circle C := f(x; y) : x2 + y2 = 1g: , Thus, } (X; Y ) 2 C = 1. On the other hand, if X and Y have a joint density fXY , then ZZ },(X; Y ) 2 C = fXY (x; y) dx dy = 0 C
because a double integral over a set of zero area must be zero. So, if X and Y had a joint density, this would imply that 1 = 0. Since this is not true, there can be no joint density.
5.3. Conditional Probability and Expectation
If X is a continuous random variable, then its cdf FX (x) := }(X x) = ,1 fX (t) dt is a continuous function of x. It follows from the properties of cdfs in Section 4.6 that }(X = x) = 0 for all x. Hence, we cannot de ne }(Y 2 C jX = x) by }(X = x; Y 2 C)=}(X = x) since this requires division by zero. Similar problems arise with conditional expectation. How should we de ne conditional probability and expectation in this case? Recall from Section 2.5 that if X and Y are both discrete random variables, then the following law of total probability holds, Rx
X
E[g(X; Y )] =
E[g(X; Y )jX = xi] pX (xi ):
i
If X and Y are jointly continuous, however we de ne E[g(X; Y )jX = x], we certainly want the analogous formula E[g(X; Y )] =
Z
1
E[g(X; Y )jX = x]fX (x) dx
,1
to hold. To discover how E[g(X; Y )jX = x] should be de ned, write E[g(X; Y )] =
=
1Z 1
Z
g(x; y)fXY (x; y) dx dy
,1 ,1 1 Z1
Z
,1
,1
g(x; y)fXY (x; y) dy dx
July 18, 2002
(5.8)
5.3 Conditional Probability and Expectation
= =
Z
1Z
1
,1 Z,1 1 1
Z
,1
,1
It follows that if
(x; y) g(x; y) fXY f (x) dy fX (x) dx X
g(x; y)fY jX (yjx) dy fX (x) dx:
Z
E[g(X; Y )jX = x] :=
165
1
,1
g(x; y) fY jX (yjx) dy;
(5.9)
then (5.8) holds. Now observe that if g in (5.9) is a function of y alone, then E[g(Y )jX = x] =
1
Z
,1
g(y) fY jX (yjx) dy:
(5.10)
Returning now to the case g(x; y), we see that (5.10) implies that for any x~, E[g(~x; Y )jX = x] =
Z
1 ,1
g(~x; y) fY jX (yjx) dy:
Taking x~ = x, we obtain E[g(x; Y )jX = x] =
Z
1 ,1
g(x; y) fY jX (yjx) dy:
Comparing this with (5.9) shows that E[g(X; Y )jX = x] = E[g(x; Y )jX = x]:
In other words, the substitution law holds. Another important point to note is that if X and Y are independent, then fX (x) fY (y) (x; y) fY jX (yjx) = fXY fX (x) = fX (x) = fY (y): In this case, (5.10) becomes E[g(Y )jX = x] = E[g(Y )]. In other words, we can \drop the conditioning."
Example 5.7. Let X exp(1), and suppose that given X = x, Y is conditionally normal with fY jX (jx) N(0; x2). Evaluate E[Y 2 ] and E[Y 2 X 3 ]. Solution. We use the law of total probability for expectation. We begin
with
E[Y 2 ]
=
Z
1
,1
E[Y 2 jX = x]fX (x) dx:
July 18, 2002
166
Chap. 5 Multiple Random Variables
p
Since fY jX (yjx) = e,(y=x)2 =2=( 2 x), we see that Z
,(y=x)2 =2
1
y2 e p dy 2 x ,1 is the second moment of a zero-mean Gaussian density with variance x2. Hence, E[Y 2jX = x] = x2 , and E[Y 2jX = x] =
E[Y 2] =
1
Z
,1
x2fX (x) dx = E[X 2 ]:
Since X exp(1), E[X 2 ] = 2 by Example 3.9. To compute E[Y 2 X 3 ], we proceed similarly. Write E[Y 2X 3 ] =
= = =
1
Z
,1 Z 1 ,1 Z 1 ,1 Z 1 ,1
E[Y 2 X 3 jX = x]fX (x) dx E[Y 2 x3jX = x]fX (x) dx
x3E[Y 2 jX = x]fX (x) dx x3x2fX (x) dx
= E[X 5 ] = 5!; by Example 3.9: For conditional probability, if we put
}(Y 2 C jX = x) :=
Z
C
fY jX (yjx) dy;
then it is easy to show that the law of total probability
}(Y 2 C) =
1
Z
,1
}(Y 2 C jX = x) fX (x) dx
holds. It can also be shown3 that the substitution law holds for conditional probabilities involving X and Y conditioned on X. Also, if X and Y are independent, then }(Y 2 C jX = x) = }(Y 2 C); i.e., we can \drop the conditioning." Example 5.8 (Signal in Additive Noise). A random, continuous-valued signal X is transmitted over a channel subject to additive, continuous-valued noise Y . The received signal is Z = X + Y . Find the cdf and density of Z if X and Y are jointly continuous random variables with joint density fXY . July 18, 2002
5.3 Conditional Probability and Expectation
167
Solution. Since we are not assuming that X and Y are independent, the characteristic-function method of Example 3.14 does not work here. Instead, we use the laws of total probability and substitution. Write FZ (z) = }Z (Z z) 1 }(Z z jY = y) fY (y) dy = = = = =
Z
,1 1
,1 Z 1 ,1 Z 1 ,1 Z 1 ,1
}(X + Y z jY = y) fY (y) dy }(X + y z jY = y) fY (y) dy }(X z , yjY = y) fY (y) dy FX jY (z , yjy) fY (y) dy:
By dierentiating with respect to z, fZ (z) =
1
Z
,1
fX jY (z , yjy) fY (y) dy:
In particular, note that if X and Y are independent, we can drop the conditioning and recover the convolution result stated following Example 3.14, fZ (z) =
1
Z
,1
fX (z , y) fY (y) dy:
Example 5.9 (Signal in Multiplicative Noise). A random, continuous-valued signal X is transmitted over a channel subject to multiplicative, continuousvalued noise Y . The received signal is Z = XY . Find the cdf and density of Z if X and Y are jointly continuous random variables with joint density fXY . Solution. We proceed as in the previous example. Write FZ (z) = }Z (Z z) 1 }(Z z jY = y) fY (y) dy = = =
Z
,1 1
,1 Z 1 ,1
}(XY z jY = y) fY (y) dy }(Xy z jY = y) fY (y) dy:
At this point we have a problem when we attempt to divide through by y. If y is negative, we have to reverse the inequality sign. Otherwise, we do not have July 18, 2002
168
Chap. 5 Multiple Random Variables
to reverse the inequality. The solution to this diculty is to break up the range of integration. Write FZ (z) =
Z 0
}(Xy z jY = y) fY (y) dy
,1 Z
+
1
0
}(Xy z jY = y) fY (y) dy:
Now we can divide by y separately in each integral. Thus, FZ (z) =
Z 0
,1 Z
+ =
}(X z=yjY = y) fY (y) dy 1
0
Z 0
,1 Z
}(X z=yjY = y) fY (y) dy
, 1 , FX jY zy y fY (y) dy
1
, FX jY yz y fY (y) dy: 0 Dierentiating with respect to z yields
+
1 , , fX jY zy y y1 fY (y) dy + fX jY yz y 1y fY (y) dy: ,1 0 Now observe that in the rst integral, the range of integration implies that y is always negative. For such y, ,y = jyj. In the second integral, y is always positive, and so y = jyj. Thus,
fZ (z) = ,
fZ (z) = =
Z 0
Z
Z 0
, fX jY yz y j1yj fY (y) dy + ,1
1
Z
,1
, fX jY z y 1
y
Z
0
1
, fX jY zy y jy1j fY (y) dy
jyj fY (y) dy:
5.4. The Bivariate Normal
The bivariate Gaussian or bivariate normal density is a generalization of the univariate N(m; 2) density. p Recall that the standard N(0; 1) density is given by '(x) := exp(,x2 =2)= 2. The general N(m; 2) density can be written in terms of ' as 2 1 1 x , m 1 x , m p exp , 2 = ' : 2 In order to de ne the general bivariate Gaussian density, it is convenient to de ne a standard bivariate density rst. So, for jj < 1, put exp 2(1,,12 ) [u2 , 2uv + v2 ] p ' (u; v) := : (5.11) 2 1 , 2 July 18, 2002
5.4 The Bivariate Normal
169
0.15
0.1
0.05
3
0 2
−3
1
−2 0
−1 0
−1
1
−2
2 3
−3
v−axis
u−axis
Figure 5.2. The Gaussian surface ' (u; v) of (5.11) with = 0.
For xed , this function of the two variables u and v de nes a surface. The surface corresponding to = 0 is shown in Figure 5.2. From the gure and from the formula (5.11), we see that '0 is circularly symmetric; i.e., for2 all (u; v) on a circle of radius r, in other words, for u2 +v2 = r2, '0 (u; v) = e,r =2 =2 does not depend on the particular values of u and v, but only on the radius of the circle on which they lie. We also point out that for = 0, the formula (5.11) factors into the product of two univariate N(0; 1) densities, i.e., '0 (u; v) = '(u) '(v). For 6= 0, ' does not factor. In other words, if U and V have joint density ' , then U and V are independent if and only if = 0. A plot of ' for = ,0:85 is shown in Figure 5.3. It turns out that now ' is constant on ellipses instead of circles. The axes of the ellipses are not parallel to the coordinate axes. Notice how the density is concentrated along the line v = ,u. As ! ,1, this concentration becomes more extreme. As ! +1, the density concentrates around the line v = u. We now show that the density ' integrates to one. To do this, rst observe that for all jj < 1, u2 , 2uv + v2 = u2 (1 , 2 ) + (v , u)2 : It follows that ,u2 =2 exp 2(1,,12) [v , u]2 e ' (u; v) = p p p 2 2 2 1 , = '(u) p 1 2 ' pv , u2 : (5.12) 1, 1, Observe that the right-hand factor as a function of v has the form of a univariate July 18, 2002
170
Chap. 5 Multiple Random Variables
0.3 0.25 0.2 0.15 0.1 3 2
0.05 1 0
0 −1
−3
−2
−1
0
−2 1
2
3
−3
v−axis
u−axis
Figure 5.3. The Gaussian surface ' (u; v) of (5.11) with = ,0:85. 2 normal densityR 1with R 1mean u and variance 1 , . With ' factored as in (5.12), we can write ,1 ,1 ' (u; v) du dv as the iterated integral Z
1
Z
1
p '(u) ' pv , u2 dv du: 2 1, ,1 ,1 1 ,
1
As noted above, the inner integrand, as a function of v, is simply an N(u; 1,2) density, and R 1 therefore integrates to one. Hence, the above iterated integral becomes ,1 '(u) du = 1. We can now easily de ne the general bivariate Gaussian density with parameters mX , mY , X2 , Y2 , and by
fXY (x; y) := 1 ' x , mX ; y , mY : X Y X Y More explicitly, this density is exp
y,mY 2 x,mX y,mY ,1 x,mX 2 2(1,2 ) [( X ) , 2( X )( Y ) + ( Y ) ] p
2X Y 1 , 2
:
(5.13)
It can be shown that the marginals are fX N(mX ; X2 ) and fY N(mY ; Y2 ) (see Problems 25 and 28). The parameter is called the correlation coecient. From (5.13), we observe that X and Y are independent if and only if = 0. A plot of fXY with mX = mY = 0, X = 1:5, Y = 0:6, and = 0 is shown in Figure 5.4. Notice how the density is concentrated around the line July 18, 2002
5.4 The Bivariate Normal
171
0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 3
0 2
−3
1
−2 0
−1 0
−1
1
−2
2 3
−3
y−axis
x−axis
Figure 5.4. The bivariate normal density fXY (x; y) of (5.13) with mX = mY = 0, X = 1:5, Y = 0:6, and = 0.
y = 0. Also, fXY is constant on ellipses of the form x 2 + y 2 = r2: X Y R1 R1 To show that ,1 ,1 fXY (x; y) dx dy = 1 as well, use formula (5.12) for ' and proceed as above, integrating with respect to y rst and then x. For the inner integral, make the change of variable v = (y , mY )=Y , and in the remaining outer integral make the change of variable u = (x , mX )=X . Example 5.10. Let random variables U and V have the standard bivariate normal density ' in (5.11). Show that E[UV ] = . Solution. Using the factored form of ' in (5.12), write E[UV ] =
Z
1
Z
1
uv ' (u; v) du dv Z 1 v v , u p = u '(u) ' p dv du: 1 , 2 ,1 ,1 1 , 2 The quantity in brackets has the form E[V^ ], where V^ is a univariate normal random variable with mean u and variance 1 , 2 . Thus, ,1 ,1 Z 1
E[UV ] =
Z
1
u '(u)[u] du
,1 Z 1
= u2 '(u) du ,1 = ; July 18, 2002
172
Chap. 5 Multiple Random Variables
since ' is the N(0; 1) density.
Example 5.11. Let U and V have the standard bivariate normal density fUV (u; v) = ' (u; v) given in (5.11). Find the conditional densities fV jU and fU jV . Solution. It is shown in Problem 25 that fU and fV are both N(0; 1). Hence, ' (u; v) (u; v) fV jU (vju) = fUV f (u) = '(u) ; U
where ' is the N(0; 1) density. If we now substitute the factored form of ' (u; v) given in (5.12), we obtain 1 v , u fV jU (vju) = p ' p ; 1 , 2 1 , 2 i.e., fV jU ( ju) N(u; 1 , 2 ). To compute fU jV we need the following alternative factorization of ' , u , v 1 ' (u; v) = p ' p '(v): 1 , 2 1 , 2 It then follows that 1 u , v fU jV (ujv) = p ' p ; 1 , 2 1 , 2 i.e., fU jV ( jv) N(v; 1 , 2 ).
Example 5.12. If U and V have standard joint normal density '(u; v), nd E[V jU = u]. Solution. Recall that from Example 5.11, fV jU (ju) N(u; 1 , 2). Hence, Z 1 v fV jU (vju) dv = u: E[V jU = u] = ,1
5.5. ?Multivariate Random Variables
Let X := [X1; : : :; Xn ]0 (the prime denotes transpose) be a length-n column vector of random variables. We call X a random vector. We say that the variables X1 ; : : :; Xn are jointly continuous with joint density fX = fX1 Xn if for A IRn,
}(X 2 A) =
Z
A
fX (x) dx =
Z
July 18, 2002
IRn
IA (x) fX (x) dx;
5.5 ? Multivariate Random Variables
173
which is shorthand for Z
1
1
Z
,1
,1
IA (x1; : : :; xn) fX1 Xn (x1; : : :; xn) dx1 dxn:
Usually, we need to compute probabilities of the form
}(X1 2 B1 ; : : :; Xn 2 Bn ) := }
n \
fXk 2 Bk g ;
k=1
where B1 ; : : :; Bn are one-dimensional sets. The key to this calculation is to observe that n \
n
o
fXk 2 Bk g = [X1 ; : : :; Xn ]0 2 B1 Bn :
k=1
(Recall that a vector [x1; : : :; xn]0 lies in the Cartesian product set B1 Bn if and only if xi 2 Bi for i = 1; : : :; n.) It now follows that
}(X1 2 B1 ; : : :; Xn 2 Bn ) =
Z
B1
Z
Bn
fX1 Xn (x1 ; : : :; xn) dx1 dxn:
For example, to compute the joint cumulative distribution function, take Bk = (,1; xk ] to get FX (x) := FX1 Xn (x1; : : :; xn) := }Z (X1 Zx1; : : :; Xn xn) x1 xn = fX1 Xn (t1 ; : : :; tn) dt1 dtn; ,1
,1
where we have changed the dummy variables of integration to t to avoid confusion with the upper limits of integration. Note also that @ n FX @xn @x1 x1 ;:::;xn = fX1 Xn (x1; : : :; xn): We can use these equations to characterize the joint cdf and density of independent random variables. First, suppose X1 ; : : :; Xn are independent. Then FX (x) = }(X1 x1; : : :; Xn xn) = }(X1 x1) }(Xn xn): By taking partial derivatives, we learn that fX1 Xn (x1 ; : : :; xn) = fX1 (x1) fXn (xn ): Thus, if the Xi are independent, then the joint density factors. On the other hand, if the joint density factors, then }(X1 2 B1 ; : : :; Xn 2 Bn ) has the form Z
B1
Z
Bn
fX1 (x1 ) fXn (xn ) dx1 dxn; July 18, 2002
174
Chap. 5 Multiple Random Variables
which factors into the product Z
B1
fX1 (x1 ) dx1
Z
Bn
fXn (xn) dxn :
Thus, if the joint density factors, }(X1 2 B1 ; : : :; Xn 2 Bn ) = }(X1 2 B1 ) }(Xn 2 Bn ); and we see that the Xi are independent. Sometimes we need to compute probabilities involving only some of the Xi . In this case, we can obtain the joint density of the ones we need by integrating out the ones we do not need. For example, fXi (xi ) = and
Z
fX (x1; : : :; xn) dx1 dxi,1dxi+1 dxn;
IRn,1
fX1 X2 (x1 ; x2) =
Z
1
Z
,1
1
fX1 Xn (x1 ; : : :; xn) dx3 dxn:
,1
We can use these results to compute conditional densities such as fX3 Xn jX1 X2 (x3 ; : : :; xnjx1; x2) = fX1fXn (x(x1; :; :x:;)xn) : X1 X2 1 2 Example 5.13. Let 3z 2 e,zy exph, 1 x,y 2 i; fXY Z (x; y; z) = p 2 z 7 2 for y 0 and 1 z 2, and fXY Z (x; y; z) = 0 otherwise. Find fY Z (y; z) and fX jY Z (xjy; z). Then nd fZ (z), fY jZ (yjz), and fXY jZ (x; yjz). Solution. Observe that the joint density can be written as h
2i
exp , 12 x,z y p fXY Z (x; y; z) = ze,zy 37 z 2 : 2 z The rst factor as a function of x is an N(y; z 2 ) density. Hence, fY Z (y; z) = and
Z
1 ,1
fXY Z (x; y; z) dx = ze,zy 37 z 2 ; h
exp , 21 x,z y f (x; y; z) XY Z p fX jY Z (xjy; z) = f (y; z) = 2 z YZ July 18, 2002
2 i
:
5.5 ? Multivariate Random Variables
175
Thus, fX jY Z (jy; z) N(y; z 2 ). Next, in the above formula for fY Z (y; z), observe that ze,zy as a function of y is an exponential density with parameter z. Thus, Z 1 fZ (z) = fY Z (y; z) dy = 37 z 2 ; 1 z 2: 0
It follows that fY jZ (yjz) = fY Z (y; z)=fZ (z) = ze,zy ; i.e., fY jZ (jz) exp(z). Finally, h
1 x,y y; z) = exp ,p2 z fXY jZ (x; yjz) = fXYfZ (x; 2 z Z (z)
2i
ze,zy :
The preceding example shows that fXY Z (x; y; z) = fX jY Z (xjy; z) fY jZ (yjz) fZ (z): More generally, fX1 Xn (x1; : : :; xn) = fX1 (x1 ) fX2 jX1 (x2jx1) fX3 jX1 X2 (x3jx1; x2) fXn jX1 Xn,1 (xnjx1; : : :; xn,1): Contemplating this equation for a moment reveals that fX1 Xn (x1; : : :; xn) = fX1 Xn,1 (x1; : : :; xn,1) fXn jX1 Xn,1 (xn jx1; : : :; xn,1); which is an important recursion that is discussed later. If g(x) = g(x1 ; : : :; xn) is a real-valued function of the vector x = [x1; : : :; xn]0, then we can compute E[g(X)] =
Z
IRn
g(x) fX (x) dx;
which is shorthand for Z
1
Z
,1
1
,1
g(x1 ; : : :; xn) fX1 Xn (x1; : : :; xn) dx1 dxn:
The Law of Total Probability
We also have the law of total probability and the substitution law for multiple random variables. Let U = [U1 ; : : :; Um ]0 and V = [V1; : : :; Vn]0 be any two vectors of random variables. The law of total probability tells us that E[g(U; V )] =
1
Z |
Z
1
,1 {z ,1} m times
E[g(U; V )jU = u] fU (u) du;
July 18, 2002
176
Chap. 5 Multiple Random Variables
where, by the substitution law, E[g(U; V )jU = u] = E[g(u; V )jU = u], and E[g(u; V )jU = u] =
and
Z |
1
Z
1
,1 {z ,1} n times
g(u; v) fV jU (vju) dv;
fV jU (vju) = fUV (uf1 ; :(u: :;;u: m: :;; vu1; :): :; vn) : U 1
m
When g has a product form, say g(u; v) = h(u)k(v), it is easy to see that E[h(u)k(V )jU = u] = h(u) E[k(V )jU = u]: Example 5.14. Let X, Y , and Z be as in Example 5.13. Find E[X] and E[XZ]. Solution. Rather than use the marginal density of X to compute E[X], we use the law of total probability. Write E[X] =
1
Z
1
Z
E[X jY = y; Z = z] fY Z (y; z) dy dz:
,1 ,1 Next, from Example 5.13, fX jY Z (jy; z)
z] = y. Thus,
E[X] =
= =
Z
1 Z 1
Z
,1 Z,1 1 1
Z
,1 1 ,1
,1
N(y; z 2 ), and so E[X jY = y; Z =
y fY Z (y; z) dy dz
y fY jZ (yjz) dy fZ (z) dz
E[Y jZ = z] fZ (z) dz:
From Example 5.13, fY jZ (jz) exp(z) and fZ (z) = 3z 2=7; hence, E[Y jZ = z] = 1=z, and we have E[X] =
Now, to nd E[XZ], write E[XZ] =
1
Z
Z
1
,1 ,1
Z 2
1
3 z dz 7
=
9 14 :
E[Xz jY = y; Z = z] fY Z (y; z) dy dz:
We then note that E[Xz jY = y; Z = z] = E[X jY = y; Z = z] z = yz. Thus, E[XZ] =
1
Z
Z
1
,1 ,1
yz fY Z (y; z) dy dz = E[Y Z]:
In Problem 34 the reader is asked to show that E[Y Z] = 1. Thus, E[XZ] = 1 as well. July 18, 2002
5.6 Notes
177
Example 5.15. Let N be a positive, integer-valued random variable, and let X1 ; X2 ; : : : be i.i.d. Further assume that N is independent of this sequence. Consider the random sum, N X
Xi :
i=1
Note that the number of terms in the sum is a random variable. Find the mean value of the random sum. Solution. Use the law of total probability to write X N
E
i=1
Xi =
1 X n X n=1
E
i=1
Xi N = n }(N = n):
By independence of N and the Xi sequence,
E
n X i=1
n X
Xi N = n = E
i=1
Xi =
n X i=1
E[Xi]:
Since the Xi are i.i.d., they all have the same mean. In particular, for all i, E[Xi ] = E[X1 ]. Thus, X n
E
i=1
Xi N
= n = n E[X1]:
Now we can write
E
N X i=1
Xi
=
1 X n=1
n E[X1] }(N = n)
= E[N] E[X1]:
5.6. Notes
Notes x5.1: Joint and Marginal Probabilities Note 1. Comments analogous to Note 1 in Chapter 2 apply here. Specifically, the set A must be restricted to a suitable - eld B of subsets of IR2. Typically, B is taken to be the collection of Borel sets of IR2 ; i.e., B is the 2
smallest - eld containing all the open sets of IR . Note 2. We now derive the limit formula for FX (x) in (5.1); the formula for FY (y) can be derived similarly. To begin, write FX (x) := }(X x) = }((X; Y ) 2 (,1; x] IR): July 18, 2002
178
Chap. 5 Multiple Random Variables
S Next, observe that IR = 1 n=1 (,1; n], and write
(,1; x] IR = (,1; x] =
1 [ n=1
1 [
n=1
(,1; n]
(,1; x] (,1; n]:
Since the union is increasing, we can use the limit property (1.4) to show that
1 [
FX (x) = } (X; Y ) 2 (,1; x] (,1; n] n=1 }((X; Y ) 2 (,1; x] (,1; N]) = Nlim !1 = Nlim F (x; N): !1 XY
Notes x5.3: Conditional Probability and Expectation Note 3. To show that the law of substitution holds for conditional probability, write
}(g(X; Y ) 2 C) = E[IC (g(X; Y ))] =
Z
1 ,1
E[IC (g(X; Y ))jX = x]fX (x) dx
and reduce the problem to one involving conditional expectation, for which the law of substitution has already been established.
5.7. Problems
Problems x5.1: Joint and Marginal Distributions 1. For a < b and c < d, sketch the following sets.
(a) R := (a; b] (c; d]. (b) A := (,1; a] (,1; d]. (c) B := (,1; b] (,1; c]. (d) C := (a; b] (,1; c]. (e) D := (,1; a] (c; d]. (f) A \ B. 2. Show that }(a < X b; c < Y d) is given by FXY (b; d) , FXY (a; d) , FXY (b; c) + FXY (a; c): Hint: Using the notation of the preceding problem, observe that (,1; b] (,1; d] = R [ (A [ B); and solve for }((X; Y ) 2 R). July 18, 2002
5.7 Problems
179
3. The joint density in Example 5.4 was obtained by dierentiating FXY (x; y)
rst with respect to x and then with respect to y. In this problem, nd the joint density by dierentiating rst with respect to y and then with respect to x. 4. Find the marginals FX (x) and FY (y) if FXY (x; y) =
8 > > > > < > > > > :
,y
,xy
x , 1 , e ,y e ; 1 x 2; y > 0; ,y ,2y 1 , e ,y e ; x > 2; y > 0; 0; otherwise:
5. Find the marginals FX (x) and FY (y) if
FXY (x; y) =
8 > > > < > > > :
2 (1 , e,2y ); 7 (7 , 2e,2y , 5e,3y )
7 0;
2 x < 3; y 0; ; x 3; y 0; otherwise:
Problems x5.2: Jointly Continuous Random Variables 6. Find the marginal density fX (x) if 2 fXY (x; y) = exp[,jy ,pxj , x =2] : 2 2
7. Find the marginal density fY (y) if ,(x,y)2 =2 fXY (x; y) = 4e 5 p ; y 1: y 2
8. Let X and Y have joint density fXY (x; y). Find the marginal cdf and
density of max(X; Y ) and of min(X; Y ). How do your results simplify if X and Y are independent? What if you further assume that the densities of X and Y are the same? 9. Let X and Y be independent gamma random variables with positive parameters p and q, respectively. Find the density of Z := X + Y . Then compute }(Z > 1) if p = q = 1=2. 10. Find the density of Z := X +Y , where X and Y are independent Cauchy random variables with parameters and , respectively. Then compute }(Z 1) if = = 1=2. July 18, 2002
180 ? 11.
Chap. 5 Multiple Random Variables
If X N(0; 1), then the complementary cumulative distribution function (ccdf) of X is Z 1 ,x2 =2 ep dx: Q(x0) := }(X > x0) = 2 x0 (a) Show that Z =2 x20 d; x 0: Q(x0 ) = 1 exp 2 , 0 cos2 0 Hint: For any random variables X and Y , we can always write }(X > x0 ) = }(X > x0 ; Y 2 IR) = }((X; Y ) 2 D); where D is the half plane D := f(x; y) : x > x0g. Now specialize to the case where X and Y are independent and both N(0; 1). Then the probability on the right is a double integral that can be evaluated using polar coordinates.
Remark. The procedure outlined in the hint is a generalization of
that used in Section 3.1 to show that the standard normal density integrates to one. To see this, note that if x0 = ,1, then D = IR2. (b) Use the result of (a) to derive Craig's formula [10, p. 572, eq. (9)], Z =2 2 , x 1 0 dt; x 0: exp Q(x0 ) = 0 2 sin2 t 0 Remark. Simon [41] has derived a similar result for the Marcum Q function (de ned in Problem 17 in Chapter 4) and its higher-order generalizations. See also [43, pp. 1865{1867].
Problems x5.3: Conditional Probability and Expectation
12. Let fXY (x; y) be as derived in Example 5.4, and note that fX (x) and
fY (y) were found in Example 5.5. Find fY jX (yjx) and fX jY (xjy) for x; y > 0. 13. Let fXY (x; y) be as derived in Example 5.4, and note that fX (x) and fY (y) were found in Example 5.5. Compute E[Y jX = x] for x > 0 and E[X jY = y] for y > 0. 14. Let X and Y be jointly continuous. Show that if Z }(Y 2 C jX = x) := fY jX (yjx) dy; then
C
}(Y 2 C) =
Z
1 ,1
}(Y 2 C jX = x)fX (x) dx:
July 18, 2002
5.7 Problems
181
15. Find }(X Y ) if X and Y are independent with X exp() and 16. 17.
18.
19. 20. 21.
22. ? 23.
Y exp(). Let X and Y be independent random variables with Y being exponential with parameter 1 and X being uniform on [1; 2]. Find }(Y= ln(1 + X 2 ) > 1). Let X and Y be jointly continuous random variables with joint density fXY . Find fZ (z) if (a) Z = eX Y . (b) Z = jX + Y j. Let X and Y be independent continuous random variables with respective densities fX and fY . Put Z = Y=X. (a) Find the density of Z. Hint: Review Example 5.9. (b) If X and Y are both N(0; 2 ), show that Z has a Cauchy(1) density that does not depend on 2 . (c) If X and Y are both Laplace(), nd a closed-form expression for fZ (z) that does not depend on . (d) Find a closed-form expression for the density of Z if Y is uniform on [,1; 1] and X N(0; 1). (e) If X and Y are both Rayleigh random variables with parameter , nd a closed-form expression for the density of Z. Your answer should not depend on . Let X and Y be independent with densities fX (x) and fY (y). If X is a positive random variable, and if Z = Y= ln(X), nd the density of Z. Let Y exp(), and suppose that given Y = y, X gamma(p; y). Assuming r > n, evaluate E[X nY r ]. Use the law of total probability to to solve the following problems. (a) Evaluate E[cos(X + Y )] if given X = x, Y is conditionally uniform on [x , ; x + ]. (b) Evaluate }(Y > y) if X uniform[1; 2], and given X = x, Y is exponential with parameter x. (c) Evaluate E[XeY ] if X uniform[3; 7], and given X = x, Y N(0; x2). (d) Let X uniform[1; 2], and suppose that given X = x, Y N(0; 1=x). Evaluate E[cos(XY )]. Find E[X n Y m ] if Y exp( ), and given Y = y, X Rayleigh(y). Let X gamma(p; ) and Y gamma(q; ) be independent. July 18, 2002
182
Chap. 5 Multiple Random Variables
(a) If Z := X=Y , show that the density of Z is p,1 fZ (z) = B(p;1 q) (1 +z z)p+q ; z > 0: Observe that fZ (z) depends on p and q, but not on . It was shown in Problem 19 in Chapter 3 that fZ (z) integrates to one. Hint: You will need the fact that B(p; q) = ,(p),(q)=,(p+q), which was shown in Problem 13 in Chapter 3. (b) Show that V := X=(X + Y ) has a beta density with parameters p and q. Hint: Observe that V = Z=(1 + Z), where Z = X=Y as above. Remark. If W := (X=p)=(Y=q), then fW (w) = (p=q) fZ (w(p=q)). If further p = k1=2 and q = k2=2, then W is said to be an F random variable with k1 and k2 degrees of freedom. If further = 1=2, then the X and Y are chi-squared with k1 and k2 degrees of freedom, respectively. ? 24. Let X and Y be independent with X N(0; 1) and Y being chi-squared p with k degrees of freedom. Show that the density of Z := X= Y=k has a student's t density with k degrees of freedom. Hint: For this problem, it may be helpful to review the results of Problems 11{13 and 17 in Chapter 3.
Problems x5.4: The Bivariate Normal 25. Let U and V have the joint Gaussian density in (5.11). Show that for 26. 27. 28. ? 29.
all with ,1 < < 1, U and V both have standard univariate N(0; 1) marginal densities that do not involve . Let X and Y be jointly Gaussian with density fXY (x; y) given by (5.13). Find fX (x), fY (y), fX jY (xjy), and fY jX (yjx). Let X and Y be jointly Gaussian with density fXY (x; y) given by (5.13). Find E[Y jX = x] and E[X jY = y]. If X and Y are jointly normal with parameters, mX , mY , X2 , Y2 , and . Compute E[X], E[X 2], and E[XY ]. Let ' be the standard bivariate normal density de ned in (5.11). Put fUV (u; v) := 12 ['1 (u; v) + '2 (u; v)]; where ,1 < 1 6= 2 < 1. (a) Show that the marginals fU and fV are both N(0; 1). (You may use the results of Problem 25.) July 18, 2002
5.7 Problems
183
(b) Show that := E[UV ] = (1 + 2 )=2. (You may use the result of Example 5.10.) (c) Show that U and V cannot be jointly normal. Hints: (i) To obtain a contradiction, suppose that fUV is a jointly normal density with parameters given by parts (a) and (b). (ii) Consider fUV (u; u). (iii) Use the following fact: If 1 ; : : :; n are distinct real numbers, and if n X k=1
k e k t = 0; for all t 0;
then 1 = = n = 0. ? 30. Let U and V be jointly normal with joint density ' (u; v) de ned in (5.11). Put Q (u0; v0) := }(U > u0 ; V > v0 ): Show that for u0; v0 0, Z
Q (u0 ; v0) =
=2,tan,1 (u0=v0 )
0
+ where
Z tan,1 (u0 =v0 )
0
h (u20 ; ) d
h (v02 ; ) d;
p
2 1 , , z(1 , sin 2) : h (z; ) := 2(1 , sin 2) exp 2(1 , 2 ) sin2 This formula for Q (u0; v0) is Simon's bivariate generalization [43, pp. 1864{ 1865] of Craig's univariate formula given in Problem 11. Hint: Write }(U > u0; V > v0) as a double integral and convert to polar coordinates. It may be helpful to review your solution of Problem 11 rst. ? 31. Use Simon's formula (Problem 30) to show that Z =4 2 1 , x 0 dt: Q(x0 = exp 2 sin2 t 0 In other words, to compute Q(x0)2 , we integrate Craig's integrand (Problem 11) only half as far [42, p. 210]!
)2
Problems x5.5: ? Multivariate Random Variables ? 32.
If
2 fXY Z (x; y; z) = 2 exp[,jx ,5ypj , (y , z) =2] ; z 1; z 2 and fXY Z (x; y; z) = 0 otherwise, nd fY Z (y; z), fX jY Z (xjy; z), fZ (z), and fY jZ (yjz).
July 18, 2002
184 ? 33.
? 34. ? 35. ? 36. ? 37.
Chap. 5 Multiple Random Variables
Let
,(x,y)2 =2e,(y,z)2 =2e,z2 =2 : fXY Z (x; y; z) = e (2)3=2
Find fXY (x; y). Then nd the means and variances of X and Y . Also nd the correlation, E[XY ]. Let X, Y , and Z be as in Example 5.13. Evaluate E[XY ] and E[Y Z]. Let X, Y , and Z be as in Problem 32. Evaluate E[XY Z]. Let X, Y , and Z be jointly continuous. Assume that X uniform[1; 2]; that given X = x, Y exp(1=x); and that given X = x and Y = y, Z is N(x; 1). Find E[XY Z]. Let N denote the number of primaries in a photomultiplier, and let Xi be the number of secondaries due to the ith primary. Then the total number of secondaries is N X Y = Xi : i=1
Express the characteristic function of Y in terms of the probability generating function of N, GN (z), and the characteristic function of the Xi , assuming that the Xi are i.i.d. with common characteristic function 'X (). Assume that N is independent of the Xi sequence. Find the density of Y if N geometric1 (p) and Xi exp().
July 18, 2002
CHAPTER 6
Introduction to Random Processes A random process or stochastic process is any family of random variables. For example, consider the experiment consisting of three coin tosses. A suitable sample space would be
:= fTTT, TTH, THT, HTT, THH, HTH, HHT, HHHg: On this sample space, we can de ne several random variables. For n = 1; 2; 3; put if the nth component of ! is T; Xn (!) := 0; 1; if the nth component of ! is H: Then, for example, X1 (THH) = 0, X2 (THH) = 1, and X3 (THH) = 1. For xed !, we can plot the sequence X1 (!), X2 (!), X3 (!), called the sample path corresponding to !. This is done for ! = THH and for ! = HTH in Figure 6.1. These two graphs illustrate the important fact that every time the sample point ! changes, the sample path also changes. X n ( THH )
X n ( HTH )
2
2
1
1
n
n −1
1
2
3
−1
1
2
3
−2
−2
Figure 6.1. Sample paths Xn (!) for ! = THH at the left and for ! = HTH at the right.
In general, a random process Xn may be de ned over any range of integers n, which we usually think of as discrete time. In this case, we can usually plot only a portion of a sample path as in Figure 6.2. Notice that there is no requirement that Xn be integer valued. For example, in Figure 6.2, X0 (!) = 0:5 and X1 (!) = 2:5. It is also useful to allow continuous-time random processes Xt where t can be any real number, not just an integer. Thus, for each t, Xt (!) is a random variable, or function of !. However, as in the discrete-time case, we can x ! and allow the time parameter t to vary. In this case, for xed !, if we plot the sample path Xt (!) as a function of t, we get a curve instead of a sequence as shown in Figure 6.3. 185
186
Chap. 6 Introduction to Random Processes
Xn( ) 3 2 1 −6 −5 −4 −3 −2 −1 −1
1
2
4
3
n 5
6
−2 −3
Figure 6.2. A portion of a sample path Xn (!) of a discrete-time random process. Because there are more dots than in the previous gure, the dashed line has been added to more clearly show the ordering of the dots.
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 6.3. Sample path Xt (!) of a continuous-time random process with continuous but very wiggly sample paths.
July 18, 2002
187
5
4
3
2
1
0
−1
−2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 6.4. Sample path Xt (!) of a continuous-time random process with very wiggly sample paths and jump discontinuities.
5
Nt
4 3 2 1
t T1
T2
T3 T 4
T5
Figure 6.5. Sample path Xt (!) of a continuous-time random process with sample paths that are piecewise constant with jumps at random times.
July 18, 2002
188
Chap. 6 Introduction to Random Processes
Figure 6.3 shows a very wiggly but continuous waveform. However, wiggly waveforms with jump discontinuities are also possible as shown in Figure 6.4. Figures 6.3 and 6.4 are what we usually think of when we think of a random process as a model for noise waveforms. However, random waveforms do not have to be wiggly. They can be very regular as in Figure 6.5. The main point of the foregoing pictures and discussion is to emphasize that there are two ways to think about a random process. The rst way is to think of Xn or Xt as a family of random variables, which by de nition, is just a family of functions of !. The second way is to think of Xn or Xt as a random waveform in discrete or continuous time, respectively. Section 6.1 introduces the mean function, the correlation function, and the covariance function of a random process. Section 6.2 introduces the concept of a stationary process and of a wide-sense stationary process. The correlation function and the power spectral density of wide-sense stationary processes are introduced and their properties are derived. Section 6.3 is the heart of the chapter, and covers the analysis of wide-sense stationary processes through linear, time-invariant systems. These results are then applied in Sections 6.4 and 6.5 to derive the matched lter and the Wiener lter. Section 6.6 contains a discussion of the expected time-average power in a wide-sense stationary random process. The Wiener{Khinchin Theorem is derived and used to provide an alternative expression for the power spectral density. As an easy corollary of our derivation of the Wiener{Khinchin Theorem, we obtain the mean-square ergodic theorem for wide-sense stationary processes. Section 6.7 brie y extends the notion of power spectral density to processes that are not wide-sense stationary.
6.1. Mean, Correlation, and Covariance
If Xt is a random process, its mean function is mX (t) := E[Xt ]: Its correlation function is RX (t1 ; t2) := E[Xt1 Xt2 ]: Example 6.1. In modeling a communication system, the carrier signal at the receiver is modeled by Xt = cos(2ft + ), where uniform[,; ]. The reason for the random phase is that the receiver does not know the exact time when the transmitter was turned on. Find the mean function and the correlation function of Xt . Solution. For the mean, write E[Xt] = E[cos(2ft + )] Z 1 = cos(2ft + ) f () d ,1 Z d : cos(2ft + ) 2 = , July 18, 2002
6.2 Wide-Sense Stationary Processes
189
Be careful to observe that this last integral is with respect to , not t. Hence, this integral evaluates to zero. For the correlation, write RX (t1 ; t2) = E[X t Xt ] 1 2 = E cos(2ft1 + ) cos(2ft2 + ) = 12 E cos(2f[t1 + t2 ] + 2) + cos(2f[t1 , t2 ]) :
The rst cosine has expected value zero just as the mean did. The second cosine is nonrandom, and equal to its expected value. Thus, RX (t1 ; t2) = cos(2f[t1 , t2 ])=2. Correlation functions have special properties. First, RX (t1 ; t2) = E[Xt1 Xt2 ] = E[Xt2 Xt1 ] = RX (t2; t1): In other words, the correlation function is a symmetric function of t1 and t2. Next, observe that RX (t; t) = E[Xt2 ] 0, and for any t1 and t2 , RX (t1 ; t2)
q
E[Xt21 ] E[Xt22 ]: (6.1) This is an easy consequence of the Cauchy{Schwarz inequality (see Prob-
lem 1), which says that for any random variables U and V , E[UV ]2 E[U 2] E[V 2]: Taking U = Xt1 and V = Xt2 yields the desired result. The covariance function is , , CX (t1 ; t2) := E Xt1 , E[Xt1 ] Xt2 , E[Xt2 ] ; An easy calculation shows that CX (t1 ; t2) = RX (t1 ; t2) , mX (t1 )mX (t2): Note that the covariance function is also symmetric; i.e., CX (t1 ; t2) = CX (t2 ; t1). Since we usually assume that our processes are zero mean; i.e., mX (t) 0, we focus on the correlation function and its properties.
6.2. Wide-Sense Stationary Processes
Strict-Sense Stationarity A random process is strictly stationary if for any nite collection of times
t1 ; : : :; tn, all joint probabilities involving Xt1 ; : : :; Xtn are the same as those involving Xt1 +t ; : : :; Xtn +t for any positive or negative time shift t. For discrete-time processes, this is equivalent to requiring that joint probabilities involving X1 ; : : :; Xn are the same as those involving X1+m ; : : :; Xn+m for any integer m. July 18, 2002
190
Chap. 6 Introduction to Random Processes
Example 6.2. Consider a discrete-time process of integer-valued random variables Xn such that }(X1 = i1; : : :; Xn = in ) = q(i1 ) r(i2 ji1 ) r(i3 ji2) r(in jin,1); where q(i) is any pmf, and for each i, r(j ji) is a pmf in the variable j; i.e., r is any conditional pmf. Show that if q has the property X
k
q(k)r(j jk) = q(j);
(6.2)
then Xn is strictly stationary for positive time shifts m. Solution. Observe that }(X1 = j1; : : :; Xm = jm ; X1+m = i1 ; : : :; Xn+m = in ) is equal to q(j1 )r(j2 jj1)r(j3 jj2) r(i1 jjm )r(i2 ji1) r(injin,1): Summing both expressions over j1 and using (6.2) shows that }(X2 = j2; : : :; Xm = jm ; X1+m = i1 ; : : :; Xn+m = in ) is equal to
q(j2)r(j3 jj2 ) r(i1 jjm )r(i2 ji1 ) r(in jin,1): Continuing in this way, summing over j2 ; : : :; jm , shows that }(X1+m = i1 ; : : :; Xn+m = in ) = q(i1 )r(i2 ji1) r(in jin,1); which was the de nition of }(X1 = i1 ; : : :; Xn = in ).
Strict stationarity is a very strong property with many implications. Even taking n = 1 in the de nition tells us that for any t1 and t1 + t, Xt1 and Xt1 +t have the same pmf or density. It then follows that for any function g(x), E[g(Xt1 )] = E[g(Xt1 +t )]. Taking t = ,t1 shows that E[g(Xt1 )] = E[g(X0 )], which does not depend on t1. Stationarity also implies that for any function g(x1 ; x2), we have E[g(Xt1 ; Xt2 )] = E[g(Xt1 +t ; Xt2 +t )]
for every time shift t. Since t is arbitrary, let t = ,t2 . Then E[g(Xt1 ; Xt2 )] = E[g(Xt1 ,t2 ; X0)]:
In Chapter 9, we will see that Xn is a time-homogeneous Markov chain. The condition on q is that it be the equilibrium distribution of the chain.
July 18, 2002
6.2 Wide-Sense Stationary Processes
191
It follows that E[g(Xt1 ; Xt2 )] depends on t1 and t2 only through the time difference t1 , t2 . Asking that all the joint probabilities involving any Xt1 ; : : :; Xtn be the same as those involving Xt1 +t ; : : :; Xtn +t for every time shift t is a very strong requirement. In practice, e.g., analyzing receiver noise in a communication system, it is often enough to require only that E[Xt] not depend on t and that the correlation RX (t1 ; t2) = E[Xt1 Xt2 ] depend on t1 and t2 only through the time dierence, t1 , t2. This is a much weaker requirement than stationarity for several reasons. First, we are only concerned with one or two variables at a time rather than arbitrary nite collections. Second, we are not concerned with probabilities, only expectations. Third, we are only concerned with E[Xt] and E[Xt1 Xt2 ] rather than E[g(Xt)] and E[g(Xt1 ; Xt2 )] for arbitrary functions g.
Wide-Sense Stationarity
We say that a process is wide-sense stationary (WSS) if the following two properties both hold: (i) The mean function mX (t) = E[Xt] does not depend on t. (ii) The correlation function RX (t1 ; t2) = E[Xt1 Xt2 ] depends on t1 and t2 only through the time dierence t1 , t2 . The process of Example 6.1 was WSS since we showed that E[Xt ] = 0 and RX (t1 ; t2) = cos(2f[t1 , t2])=2. Once we know a process is WSS, we write RX (t1 , t2 ) instead of RX (t1; t2). We can also write RX () = E[Xt+ Xt ]; for any choice of t: In particular, note that RX (0) = E[Xt2 ] 0; for all t: At time t, the instantaneous power in a WSS process is Xt2 .y The expected instantaneous power is then PX := E[Xt2 ] = RX (0); which does not depend on t, since the process is WSS. It is now convenient to introduce the Fourier transform of RX (), SX (f) :=
1
Z
,1
RX ()e,j 2f d:
We call SX (f) the power spectral density of the process. Observe that by the Fourier inversion formula, RX () =
Z
1
SX (f)ej 2f df:
,1 y If Xt is the voltage across a one-ohm resistor or the current passing through a one-ohm resistor, then Xt2 is the instantaneous power dissipated. July 18, 2002
192
Chap. 6 Introduction to Random Processes
In particular, taking = 0, shows that Z
1
,1
SX (f) df = RX (0) = PX :
To explain the terminology, \power spectral density," recall that probability densities are nonnegative functions that are integrated to obtain probabilities. Similarly, power spectral densities are nonnegative functionsz that are integrated to obtain powers. The adjective \spectral" refers to the fact that SX (f) is a function of frequency. Example 6.3. Let Xt be a WSS random process with power spectral density SX (f) = e,f 2 =2 . Find the power in the process. Solution. The power is PX = = = =
1
Z
,1 Z 1 ,1
p
2
p
SX (f) df e,f 2 =2 df Z
1 e,f 2 =2
,1
p
2
df
2:
Example 6.4. Let Xt be a WSS process with correlation function RX () = 5 sin()=. Find the power in the process. Solution. The power is sin() PX = RX (0) = lim !0 RX () = lim !0 5 = 5: Example 6.5. Let Xt be a WSS random process with ideal lowpass power spectral density SX (f) = I[,W;W ] (f) shown in Figure 6.6. Find its correlation function RX (). Solution. We must nd the inverse Fourier transform of SX (f). Write RX () = =
Z
1
,1
Z
W
SX (f)ej 2f df ej 2f df
,W z The nonnegativity of power spectral densities is derived in Example 6.9. July 18, 2002
6.2 Wide-Sense Stationary Processes
193 SX ( f ) 1
f −W
W
0
Figure 6.6. Power spectral density of bandlimited noise in Example 6.5.
j 2f f =W = ej2 f =,W j 2 W , e,j 2W = e j2 j 2 W , e,j 2W = 2W e 2j(2W) = 2W sin(2W) 2W :
If the bandwidth W of SX (f) in the preceding example is allowed to go to in nity, then all frequencies will be present in equal amounts. A WSS process whose power spectral density is constant for all f is call white noise. The value of the constant is usually denoted by N0 =2. Since the inverse transform of a constant function is a delta function, the correlation function of white noise with SX (f) = N0 =2 is RX () = N20 (). Remark. Just as the delta function is not an ordinary function, white noise is not an ordinary random process. For example, since (0) is not de ned, and since E[Xt2 ] = RX (0) = N20 (0), we cannot speak of the second moment of Xt when Xt is white noise. On the other hand, since SX (f) = N0 =2 for white noise, and since Z 1 N0 df = 1; 2 ,1
we often say that white noise has in nite average power.
Properties of Correlation Functions and Power Spectral Densities
The correlation function RX () is completely characterized by the following three properties.1 (i) RX (,) = RX (); i.e., RX is an even function of . (ii) jRX ()j RX (0). In other words, the maximum value of jRX ()j occurs at = 0. In particular, taking = 0 rearms that RX (0) 0. (iii) The power spectral density, SX (f), is a real, even, nonnegative function of f. July 18, 2002
194
Chap. 6 Introduction to Random Processes
Property (i) is a restatement of the fact any correlation function, as a function of two variables, is symmetric; for a WSS process this means that RX (t1 , t2) = RX (t2 , t1 ). Taking t1 = and t2 = 0 yields the result. Property (ii) is a restatement of (6.1) for a WSS process. However, we can also derive it directly by again using the Cauchy{Schwarz inequality. Write RX () = E[Xt+ Xt ] q E[Xt2+ ] E[Xt2] p = RX (0) RX (0) = RX (0): Property (iii): To see that SX (f) is real and even, write SX (f) = =
Z
1
,1 Z 1 ,1
RX ()e,j 2f dt RX () cos(2f) d , j
Z
1 ,1
RX () sin(2f) d:
Since correlation functions are real and even, and since the sine is odd, the second integrand is odd, and therefore integrates to zero. Hence, we can always write Z 1 SX (f) = RX () cos(2f) d: ,1
Thus, SX (f) is real. Furthermore, since the cosine is even, SX (f) is an even function of f as well. The fact that SX (f) is nonnegative is derived later in Example 6.9. Remark. Property (iii) actually implies Properties (i) and (ii). (Problem 9.) Hence, a real-valued function of is a correlation function if and only if its Fourier transform is real, even, and nonnegative. However, if one is trying to show that a function of is not a correlation function, it is easier to check if either (i) or (ii) fails. Of course, if (i) and (ii) both hold, it is then necessary to go ahead and nd the power spectral density and see if it is nonnegative. Example 6.6. Determine whether or not R() := e,j j is a valid correlation function. Solution. Property (i) requires that R() be even. However, R() is not even (in fact, it is odd). Hence, it cannot be a correlation function.
Example 6.7. Determine whether or not R() := 1=(1 + 2) is a valid
correlation function. Solution. We must check the three properties that characterize correlation functions. It is obvious that R is even and that R() R(0) = 1. The Fourier July 18, 2002
6.3 WSS Processes through Linear Time-Invariant Systems
195
transform of R() is S(f) = exp(,2jf j), as can be veri ed by applying the inversion formula to S(f).x Since S(f) is nonnegative, we see that R() satis es all three properties, and is therefore a valid correlation function.
Example 6.8 (Power in a Frequency Band). Let Xt have power spectral density SX (f). Find the power in the frequency band W1 f W2. Solution. Since power spectral densities are even, we include the corresponding negative frequencies too. Thus, we compute ,W1
Z
,W2
SX (f) df +
Z
W2
W1
SX (f) df = 2
Z
W2
W1
SX (f) df:
6.3. WSS Processes through Linear Time-Invariant Systems
In this section, we consider passing a WSS random process through a linear time-invariant (LTI) system. Our goal is to nd the correlation and power
spectral density of the output. Before doing so, it is convenient to introduce the following concepts. Let Xt and Yt be random processes. Their cross-correlation function is
RXY (t1 ; t2) := E[Xt1 Yt2 ]: To distinguish between the terms cross-correlation function and correlation function, the latter is sometimes referred to as the auto-correlation function. The cross-covariance function is CXY (t1 ; t2) := E[fXt1 ,mX (t1 )gfYt2 ,mY (t2 )g] = RXY (t1 ; t2),mX (t1) mY (t2): A pair of processes Xt and Yt is called jointly wide-sense stationary (J-WSS) if all three of the following properties hold: (i) Xt is WSS. (ii) Yt is WSS. (iii) the cross-correlation RXY (t1 ; t2) = E[Xt1 Yt2 ] depends on t1 and t2 only through the dierence t1 , t2 . If this is the case, we write RXY (t1 , t2 ) instead of RXY (t1; t2). Remark. In checking to see if a pair of processes is J-WSS, it is usually easier to check property (iii) before (ii). The most common situation in which we have a pair of J-WSS processes occurs when Xt is the input to an LTI system, and Yt is the output. Recall x There is a short table of transform pairs in the Problems section and inside the front cover.
July 18, 2002
196
Chap. 6 Introduction to Random Processes
that an LTI system is expressed by the convolution 1
Z
Yt =
h(t , )X d;
,1
where h is the impulse response of the system. An equivalent formula is obtained by making the change of variable = t , , d = ,d. This allows us to write Yt =
1
Z
,1
h()Xt, d;
which we use in most of the calculations below. We now show that if the input Xt is WSS, then the output Yt is WSS and Xt and Yt are J-WSS. To begin, we show that E[Yt] does not depend on t. To do this, write 1
Z
E[Yt] = E
,1
Z
h()Xt, d =
1
,1
E[h()Xt, ] d:
To justify bringing the expectation inside the integral, write the integral as a Riemann sum and use the linearity of expectation. More explicitly, E
Z
1
,1
h()Xt, d
X
E =
X
=
X
i
i Z
i
h(i )Xt,i i
E[h(i )Xt,i i ] E[h(i )Xt,i ] i
1
,1
E[h()Xt, ] d:
In computing the expectation inside the integral, observe that Xt, is random, but h() is not. Hence, h() is a constant and can be pulled out of the expectation; i.e., E[h()Xt, ] = h()E[Xt, ]. Next, since Xt is WSS, E[Xt, ] does not depend on t , , and is therefore equal to some constant, say m. Hence, E[Yt] =
Z
1
,1
h() m d = m
Z
1
,1
h() d;
which does not depend on t. Logically, the next step would be to show that E[Yt1 Yt2 ] depends on t1 and t2 only through the dierence t1 , t2 . However, it is more ecient if we rst obtain a formula for the cross-correlation, E[Xt1 Yt2 ], and show that it depends only on t1 , t2 . Write
E[Xt1 Yt2 ] = E Xt1
1
Z
,1
h()Xt2 , d
July 18, 2002
6.3 WSS Processes through Linear Time-Invariant Systems
= = =
Z
1
Z
,1 1
Z
,1 1 ,1
197
h()E[Xt1 Xt2 , ] d h()RX (t1 , [t2 , ]) d h()RX ([t1 , t2 ] + ) d;
which depends only on t1 , t2 as required. We can now write 1
Z
RXY () =
,1
h()RX ( + ) d:
(6.3)
If we make the change of variable = ,, d = ,d, then Z
RXY () =
1
,1
h(, )RX ( , ) d ;
(6.4)
and we see that RXY is the convolution of h(, ) with RX ( ). This suggests that we de ne the cross power spectral density by 1
Z
SXY (f) :=
,1
RXY ()e,j 2f d:
Since we are considering real-valued random variables here, we must assume h() is real valued. Letting H(f) denote the Fourier transform of h(), it follows that the Fourier transform of h(,) is H(f) , where the superscript denotes the complex conjugate. Thus, the Fourier transform of (6.4) is SXY (f) = H(f) SX (f):
(6.5)
We now calculate the auto-correlation of Yt and show that it depends only on t1 , t2 . Write Z
E[Yt1 Yt2 ] = E
= = = Thus,
RY () =
Z
1
Z
,1 1
Z
,1 1 ,1
Z
1
,1
h()Xt1 , d Yt2
h()E[Xt1 , Yt2 ] d h()RXY ([t1 , ] , t2 ) d h()RXY ([t1 , t2 ] , ) d:
1 ,1
h()RXY ( , ) d:
July 18, 2002
198
Chap. 6 Introduction to Random Processes SX ( f )
f
− W2 − W1
0
W1
W2
H( f ) 1
f
− W2 − W1
0
W1
W2
Figure 6.7. Setup of Example 6.9 to show that a power spectral density must be nonnegative.
In other words, RY is the convolution of h and RXY . Taking Fourier transforms, we obtain SY (f) = H(f)SXY (f) = H(f)H(f) SX (f), and
SY (f) = H(f) 2SX (f):
(6.6)
In other words, the power spectral density of the output of an LTI system is the product of the magnitude squared of the transfer function and the power spectral density of the input.
Example 6.9. We can use the above formula to show that power spectral densities are nonnegative. Suppose Xt is a WSS process with SX (f) < 0 on some interval [W1; W2 ], where 0 < W1 W2 . Since SX is even, it is also negative on [,W2 ; ,W1] as shown in Figure 6.7. Let H(f) := I[,W2 ;,W1 ] (f) + I[W1 ;W2 ] (f) as shown in Figure 6.7. Note that since H(f) takes only the values zero and one, H(f) 2 = H(f). Let h(t) denote the inverse Fourier Rtransform 1 h(t , of H. Since H is real and even, h is also real and even. Then Yt := ,1 )X d is a real WSS random process, and
SY (f) = H(f) 2 SX (f) = H(f)SX (f) 0; 1 S (f) df < 0. It then follows with SY (f) < 0 for W1 < jf j < W2 . Hence, ,1 Y that Z 1 0 E[Yt2 ] = RY (0) = SY (f) df < 0; R
which is a contradiction. Thus, SX (f) 0.
,1
Example 6.10. A certain communication receiver employs a bandpass lter to reduce white noise generated in the ampli er. Suppose that the white July 18, 2002
6.4 The Matched Filter
199
noise Xt has power spectral density N0 =2 and that the lter transfer function H(f) is given in Figure 6.7. Find the expected output power from the lter. Solution. The expected output power is obtained by integrating the power spectral density of the lter output. Denoting the lter output by Yt , PY = H(f) 2 SX (f)
1
Z
,1
SY (f) df =
Z
1
,1
H(f) 2SX (f) df:
Since = N0 =2 for W1 jf j W2 , and is zero otherwise, PY = 2(N0 =2)(W2 , W1 ) = N0 (W2 , W1 ). In other words, the expected output power is N0 times the bandwidth of the lter.
6.4. The Matched Filter
Consider an air-trac control system which sends out a known radar pulse. If there are no objects in range of the radar, the system returns only a noise waveform Xt , which we model as a zero-mean, WSS random process with power spectral density SX (f). If there is an object in range, the system returns the re ected radar pulse, say v(t), plus the noise Xt . We wish to design a system that decides whether received waveform is noise only, Xt , or signal plus noise, v(t) + Xt . As an aid to achieving this goal, we propose to take the received waveform and pass it through an LTI system with impulse response h(t). If the received signal is in fact v(t) + Xt , then the output of the linear system is Z
where
1
,1
h(t , )[v() + X ] d = vo (t) + Yt; vo (t) :=
is the output signal, and Yt :=
1
Z
,1 1
Z
,1
h(t , )v() d
h(t , )X d
is the output noise process. We will now try to nd the impulse response h that maximizes the output signal-to-noise ratio (SNR), (t0 )2 ; SNR := vEo[Y 2 t0 ] where vo (t0 )2 is the instantaneous output signal power at time t0 , and E[Yt20 ] is the expected instantaneous output noise power at time t0 . Note that since E[Yt20 ] = RY (0) = PY , we can also write 2 SNR = voP(t0) : Y July 18, 2002
200
Chap. 6 Introduction to Random Processes
Our approach will be to obtain an upper bound on the numerator of the form vo (t0 )2 PY B, where B does not depend on the impulse response h. It will then follow that 2 SNR = voP(t0 ) PYP B = B: Y Y We then show how to choose the impulse response so that in fact vo (t0 )2 = PY B. For this choice of impulse response, we then have SNR = B, the maximum possible value. We begin by analyzing the denominator in the SNR. Observe that PY =
1
Z
,1
SY (f) df =
To analyze the numerator, write vo (t0 ) =
Z
1
,1
Vo (f)ej 2ft0 df
1
Z
,1
=
Z
jH(f)j2 SX (f) df: 1
,1
H(f)V (f)ej 2ft0 df;
where Vo (f) is the Fourier transform of vo (t), and V (f) is the Fourier transform of v(t). Next, write Z 1 p (f)ej 2ft0 df vo (t0 ) = H(f) SX (f) V p SX (f) ,1 Z 1 e,j 2ft0 p p = H(f) SX (f) V (f) df; SX (f) ,1 where the asterisk denotes complex conjugation. Applying the Cauchy{Schwarz inequality (see Problem 1 and the Remark following it), we obtain the upper bound, Z 1 Z 1 jV (f)j2 df : jvo (t)j2 jH(f)j2 SX (f) df S (f) |
Thus,
,1
{z
= PY
} |
X ,1 {z
=: B
}
2 SNR = jvoP(t0 )j PYP B = B: Y Y p Now, the Cauchy{Schwarz inequality holds with equality if and only if H(f) SX (f) p is a multiple of V (f) e,j 2ft0 = SX (f). Thus, the upper bound on the SNR will be achieved if we take H(f) to solve e,j 2ft0 p p H(f) SX (f) = V (f) SX (f) for any complex multiplier ; i.e., we should take ,j 2ft0 H(f) = V (f)S e(f) : X
July 18, 2002
6.5 The Wiener Filter
201 v (t )
t −T
0
T
h(t ) = v ( t0 − t )
t −T
0
t0
T
Figure 6.8. Known signal v(t) and corresponding matched lter impulse response h(t) in the case of white noise.
Thus, the optimal lter is \matched" to the known signal. Consider the special case in which Xt is white noise with power spectral density SX (f) = N0 =2. Taking = N0 =2 as well, we have H(f) = V (f) e,j 2ft0 , which inverse transforms to h(t) = v(t0 , t), assuming v(t) is real. Thus, the matched lter has an impulse response which is a time-reversed and translated copy of the known signal v(t). An example of v(t) and the corresponding h(t) is shown in Figure 6.8. As the gure illustrates, if v(t) is a nite-duration, causal waveform, as any radar \pulse" would be, then the sampling time t0 can always be chosen so that h(t) corresponds to a causal system.
6.5. The Wiener Filter
In the preceding section, the available data was of the form v(t)+Xt , where v(t) was a known, nonrandom signal, and Xt was a zero-mean, WSS noise process. In this section, we suppose that Vt is an unknown random process that we would like to estimate based on observing a related random process Ut . For example, we might have Ut = Vt + Xt , where Xt is a noise process. However, to begin, we only assume that Ut and Vt are zero-mean, J-WSS with known power spectral densities and known cross power spectral density. We restrict attention to linear estimators of the form Vbt =
1
Z
,1
h(t , )U d =
Z
1
,1
h()Ut, d:
(6.7)
Note that to estimate Vt at a single time t, we use the entire observed waveform U for ,1 < < 1. Our goal is to nd an impulse response h that minimizes the mean squared error, E[jVt , Vbt j2]. In other words, we are looking for an impulse response h such that if Vbt is given by (6.7), and if ~h is any other July 18, 2002
202
Chap. 6 Introduction to Random Processes
impulse response, and we put Z 1 Vet = h~ (t , )U d = ,1
then
Z
1 ,1
~h()Ut, d;
(6.8)
E[jVt , Vbt j2] E[jVt , Vetj2 ]:
To nd the optimal lter h, we apply the orthogonality principle, which says that if Z 1 ~h()Ut, d = 0 E (Vt , Vbt) (6.9) ,1 for every lter h~ , then h is the optimal lter. Before proceeding any further, we need the following observation. Suppose (6.9) holds for every choice of h~ . Then in particular, it holds if we replace h~ by h , ~h. Making this substitution in (6.9) yields
E (Vt , Vbt )
1
Z
,1
[h() , h~ ()]Ut, d = 0:
Since the integral in this expression is simply Vbt , Vet, we have that E[(Vt , Vbt )(Vbt , Vet )] = 0: (6.10) To establish the orthogonality principle, assume (6.9) holds for every choice of h~ . Then (6.10) holds as well. Now write E[jVt , Vetj2 ] = E[j(Vt , Vbt ) + (Vbt , Vet )j2 ] = E[jVt , Vbt j2 + 2(Vt , Vbt )(Vbt , Vet ) + jVbt , Vetj2 ] = E[jVt , Vbt j2] + 2E[(Vt , Vbt )(Vbt , Vet)] + E[jVt , Vet j2] = E[jVt , Vbt j2] + E[jVt , Vetj2 ] E[jVt , Vbt j2]; and thus, h is the lter that minimizes the mean squared error. The next task is to characterize the lter h such that (6.9) holds for every choice of ~h. Recalling (6.9), we have
0 = E (Vt , Z
= E = = =
Z
1
1
,1
,1 Z 1 ,1 Z 1 ,1
Vbt )
1
Z
,1
~h()Ut, d
~h()(Vt , Vbt)Ut, d
E[~h()(Vt , Vbt )Ut, ] d
h~ ()E[(Vt , Vbt )Ut, ] d h~ ()[RV U () , RVbU ()] d: July 18, 2002
6.5 The Wiener Filter
203
Since h~ is arbitrary, take ~h() = RV U () , RVbU () to get 1
Z
,1
RV U ()
, RVbU () 2 d = 0:
(6.11)
Thus, (6.9) holds for all ~h if and only if RV U = RVbU . The next task is to analyze RVbU . Recall that Vbt in (6.7) is the response of an LTI system to input Ut . Applying (6.3) with X replaced by U and Y replaced by Vb , we have, also using the fact that RU is even, Z
1
Z
1
RVbU () = RU Vb (,) = h()RU ( , ) d: ,1 Taking Fourier transforms of yields
RV U () = RVbU () = h()RU ( , ) d ,1
(6.12)
SV U (f) = H(f)SU (f):
Thus,
(f) H(f) = SSV U(f) U is the optimal lter. This choice of H(f) is called the Wiener lter. ? Causal
Wiener Filters
Typically, the Wiener lter as found above is not causal; i.e., we do not have h(t) = 0 for t < 0. To nd such an h, we need to reformulate the problem by replacing (6.7) with Vbt =
Z
t
,1
h(t , )U d =
1
Z
0
and replacing (6.8) with Vet =
Z
t
,1
h~ (t , )U d =
1
Z
0
h()Ut, d; h~ ()Ut, d:
Everything proceeds as before from (6.9) through (6.11) except that lower limits of integration are changed from ,1 to 0. Thus, instead of concluding RV U () = RVbU () for all , we only have RV U () = RVbU () for 0. Instead of (6.12), we have Z 1 RV U () = h()RU ( , ) d; 0: (6.13) 0
This is known as the Wiener{Hopf equation. Because the equation only holds for 0, we run into a problem if we try to take Fourier transforms. To July 18, 2002
204
Chap. 6 Introduction to Random Processes H( f ) Wt Ut
H0 ( f )
K( f )
^
Vt
Figure 6.9. Decomposition of the causal Wiener lter using the whitening lter K (f ).
compute SV U (f), we need to integrate RV U ()e,j 2f from = ,1 to = 1. But we can only use the Wiener{Hopf equation for 0. In general, the Wiener{Hopf equation is very dicult to solve. However, if U is white noise, say RU () = (), then (6.13) reduces to RV U () = h(); 0: Since h is causal, h() = 0 for < 0. The preceding observation suggests the construction of H(f) using a whitening lter as shown in Figure 6.9. If Ut is not white noise, suppose we can nd a causal lter K(f) such that when Ut is passed through this system, the output is white noise Wt , by which we mean SW (f) = 1. Letting k denote the impulse response corresponding to K, we can write Wt mathematically as Wt =
Z
0
1
k()Ut, d:
(6.14)
Then
1 = SW (f) = jK(f)j2 SU (f): (6.15) Consider the problem of causally estimating Vt based on Wt instead of Ut . The solution is again given by the Wiener{Hopf equation, RV W () =
1
Z
0
h0 ()RW ( , ) d; 0:
Since K was chosen so that SW (f) = 1, RW () = (). Therefore, the Wiener{ Hopf equation tells us that h0() = RV W () for 0. Using (6.14), it is easy to see that Z 1 RV W () = k()RV U ( + ) d; (6.16) and then{
0
SV W (f) = K(f) SV U (f): We now summarize the procedure.
(6.17)
{ If k() is complexvalued, so is Wt in (6.14). In this case, as in Problem 26, it is understood that RV W ( ) = E[Vt+ Wt ].
July 18, 2002
6.5 The Wiener Filter
205
1. According to (6.15), we must rst write SU (f) in the form 1 1 ; SU (f) = K(f) K(f) where K(f) is a causal lter (this is known as spectral factorization).k 2. The optimum lter is H(f) = H0(f)K(f), where H0(f) =
Z
0
1
RV W ()e,j 2f d;
and RV W () is given by (6.16) or by the inverse transform of (6.17). Example 6.11. Let Ut = Vt + Xt, where Vt and Xt are zero-mean, WSS processes with E[Vt X ] = 0 for all t and . Assume that the signal Vt has power spectral density SV (f) = 2=[2 + (2f)2 ] and that the noise Xt is white with power spectral density SX (f) = 1. Find the causal Wiener lter. Solution. From your solution of Problem 32, SU (f) = SV (f) + SX (f). Thus, A2 + (2f)2 ; SU (f) = 2 +2 + 1 = (2f)2 2 + (2f)2
where A2 := 2 + 2. This factors into j2f A , j2f : SU (f) = A ++ j2f , j2f Then K(f) = A ++ j2f j2f is the required causal (by Problem 35) whitening lter. Next, from your solution of Problem 32, SV U (f) = SV (f). So, by (6.17), 2 SV W (f) = A ,, j2f j2f 2 + (2f)2 2 = (A , j2f)( + j2f) B = A , j2f + +Bj2f ; where B := 2=( + A). It follows that RV W () = BeA u(,) + Be, u(); k If SU (f ) satis es the Paley{Wiener condition, Z
1 j ln S (f )j U 1 + f 2 df < 1; ,1 then SU (f ) can always be factored in this way.
July 18, 2002
206
Chap. 6 Introduction to Random Processes
where u is the unit-step function. Since h0() = RV W () for 0, h0() = Be, u() and H0(f) = B=( + j2f). Next, B ; = H(f) = H0 (f)K(f) = +Bj2f A ++ j2f j2f A + j2f and h() = Be,A u().
6.6. ?Expected Time-Average Power and the Wiener{ Khinchin Theorem
In this section we develop the notion of the expected time-average power in a WSS process. We then derive the Wiener{Khinchin Theorem, which implies that the expected time-average power is the same as the expected instantaneous power. As an easy corollary of the derivation of the Wiener{Khinchin Theorem, we derive the mean-square ergodic theorem for WSS processes. This result shows that E[Xt] can often be computed by averaging a single sample path over time. Recall that the average power in a deterministic waveform x(t) is given by Z T 1 x(t)2 dt: lim T !1 2T ,T The analogous formula for a random process is 1 Z T X 2 dt: lim T !1 2T ,T t The problem with the above quantity is that it is a random variable because the integrand, Xt2 , is random. We would like to characterize the process by a nonrandom quantity. This suggests that we de ne the expected time-average power in a random process by 1 Z T X 2 dt : PeX := E Tlim !1 2T ,T t We now carry out some analysis of PeX . To begin, we want to apply Parseval's equation to the integral. To do this, we use the following notation. Let ( Xt ; jtj T; T Xt := 0; jtj > T; R R 1 T 2 X dt. Now, the Fourier transform of X T is so that ,TT Xt2 dt = ,1 t t
Xe T f
:=
Z
1
,1
XtT e,j 2ft dt
=
July 18, 2002
Z
T
,T
Xt e,j 2ft dt;
(6.18)
6.6 ? Expected Time-Average Power and the Wiener{
and by Parseval's equation, Z
1
,1
X T 2 dt
=
t
Z
1 ,1
Khinchin Theorem207
X e T 2 df:
f
Returning now to PeX , we have
1 = E Tlim !1 2T 1 = E Tlim !1 2T 1 = E Tlim !1 2T
PeX
=
Z
X 2 dt
,T 1
Z
t
X T 2 dt
t
,1 1
Z
,1
X e T 2 df
f
E XefT 2 df: lim T !1 2T
1
Z
T
,1
The Wiener{Khinchin Theorem says that the above integrand is exactly the power spectral density SX (f). From our earlier work, we know that PX := R1 E[Xt2 ] = RX (0) = ,1 SX (f) df. Thus, PeX = PX ; i.e., the expected timeaverage power is equal to the expected instantaneous power at any point in time. Note also that the above integrand is obviously nonnegative, and so if we can show that it is equal to SX (f), then this will provide another proof that SX (f) must be nonnegative. To derive the Wiener{Khinchin Theorem, we begin with the numerator, , , E XefT 2 = E XefT XefT ;
where the asterisk denotes complex conjugation. To evaluate the right-hand side, use (6.18) to obtain Z
E
T
,T
Xt e,j 2ft dt
Z
T
,T
X e,j 2f d
:
We can now write
Z
T
Z
T
Z
T
Z
T
E XefT 2 = E[XtX ]e,j 2f (t,) dt d ,T ,T
= = =
,T ,T
Z
T
Z
T Z 1
,T ,T 1
Z
,1
RX (t , )e,j 2f (t,) dt d
SX ()
,1
Z
(6.19)
SX ()ej 2 (t,) d e,j 2f (t,) dt d T
,T
ej 2(f , )
July 18, 2002
Z
T
,T
ej 2t( ,f ) dt
d d:
208
Chap. 6 Introduction to Random Processes
Notice that the inner two integrals decouple so that E Xe T 2
f
= =
1
Z
,1 Z 1
SX ()
Z
T
,T
ej 2(f , ) d ,
Z
2T(f , ) SX () 2T sin2T(f , ) ,1
T
,T
2
ej 2t( ,f ) dt
d
d:
We can then write , Z 1 E XefT 2 sin 2T(f , ) 2 d: = S () 2T (6.20) X 2T 2T(f , ) ,1 This is a convolution integral. Furthermore, the quantity multiplying SX () converges to the delta function (f , ) as T ! 1.2 Thus,
E XefT 2
Z
1
SX () (f , ) d = SX (f); 2T ,1 which is exactly the Wiener{Khinchin Theorem. Remark. The preceding derivation shows that SX (f) is equal to the limit of (6.19) divided by 2T. Thus, lim T !1
=
1 Z T Z T R (t , )e,j 2f (t,) dt d: S(f) = Tlim !1 2T ,T ,T X As noted in Problem 37, the properties of the correlation function directly imply that this double integral is nonnegative. This is the direct way to prove that power spectral densities are nonnegative.
Mean-Square Law of Large Numbers for WSS Processes
In the process of deriving the weak law of large numbers in Chapter 2, we showed that for an uncorrelated sequence Xn with common mean m = E[Xn] and common variance 2 = var(Xn ), the sample mean (or time average) n X 1 Mn := n Xi i=1
converges to m in the sense that E[jMn , mj2] = var(Mn ) ! 0 as n ! 1 by (2.6). In this case, we say that Mn converges in mean square to m, and we call this a mean-square law of large numbers. We now show that for a WSS process Yt with mean m = E[Yt], the sample mean (or time average) Z T 1 Yt dt ! m MT := 2T ,T
July 18, 2002
6.6 ? Expected Time-Average Power and the Wiener{
Khinchin Theorem209
in the sense that E[jMT , mj2 ] ! 0 as T ! 1 if and only if 1 Z 2T C ()1 , j j d = 0; lim T !1 2T ,2T Y 2T where CY is the covariance (not correlation) function of Yt. We can view this result as a mean-square law of large numbers for WSS processes. Laws of large numbers for sequences or processes that are not uncorrelated are often called ergodic theorems. Hence, the above result is also called the mean-square ergodic theorem for WSS processes. The point in all theorems of this type is that the expectation E[Yt] can be computed by averaging a single sample path over time. We now derive this result. To begin, put Xt := Yt , m so that Xt is zero mean and has correlation function RX () = CY (). Also, Z T XefT f =0 1 Xt dt = 2T ; MT , m = 2T ,T where XefT was de ned in (6.18). Using (6.20) with f = 0 we can write E[jMT
, mj2 ]
=
E Xe0T 2
4T 2 E Xe0T 2 1 = 2T 2T 2 Z 1 1 sin(2T) = 2T SX () 2T 2T d: ,1
(6.21)
Applying Parseval's equation yields 1 Z 2T R ()1 , j j d E[jMT , mj2] = 2T X 2T ,2T Z 2T j j d: = 2T1 CY () 1 , 2T ,2T To give a sucient condition for when this goes to zero, recall that by the argument following (6.20), as T ! 1 the integral in (6.21) is approximately SX (0) if SX (f) is continuous at f = 0.3 Thus, if SX (f) is continuous at f = 0, E[jMT , mj2 ] SX (0) ! 0 2T as T ! 1. Remark. If RX () = CY () is absolutely integrable, then SX (f) is uniformly continuous. To see this write
jSX (f) , SX (f0 )j =
Z
1
,1
RX ()e,j 2f d , July 18, 2002
1
Z
,1
RX ()e,j 2f0 d
210
Chap. 6 Introduction to Random Processes
= =
Z
1
,1 Z 1 ,1 Z 1
,1 Now observe that je,j 2(f ,f0 )
jRX ()j je,j 2f , e,j 2f0 j d jRX ()j je,j 2f0 [e,j 2(f ,f0 ) , 1]j d jRX ()j je,j 2(f ,f0) , 1j d:
, 1j ! 0 as f ! f0. Since RX is absolutely integrable, Lebesgue's dominated convergence theorem [4, p. 209] implies that the integral goes to zero as well. 6.7. ?Power Spectral Densities for non-WSS Processes If Xt is not WSS, then the instantaneous power E[Xt2 ] will depend on t. In this case, it is more appropriate to look at the expected time-average power PeX de ned in the previous section. At the end of this section, we show that where
lim T !1
E XefT 2
2T
=
Z
1 ,1
RX ()e,j 2f d;
(6.22)
1 Z T R ( + ; ) d: RX () := Tlim !1 2T ,T X Hence, for a non-WSS process, we de ne its power spectral density to be the Fourier transform of RX (), SX (f) :=
Z
1
,1
RX ()e,j 2f d:
An important application the foregoing is to cyclostationary processes. A process Yt is (wide-sense) cyclostationary if its mean function is periodic in t, and if its correlation function has the property that for xed , RX ( + ; ) is periodic in . For a cyclostationary process with period T0 , it is not hard to show that Z T0 =2 RX () = T1 RX ( + ; ) d: (6.23) 0 ,T0 =2
Example 6.12. Let Xt be WSS, and put Yt := Xt cos(2f0t). Show that Yt is cyclostationary and that SY (f) = 14 [SX (f , f0 ) + SX (f + f0 )]: Solution. The mean of Yt is E[Yt] = E[Xt cos(2f0 t)] = E[Xt] cos(2f0 t): Note that if Xt is WSS, then RX ( ) = RX ( ).
July 18, 2002
6.7 ? Power Spectral Densities for non-WSS Processes
211
Because Xt is WSS, E[Xt] does not depend on t, and it is then clear that E[Yt] has period 1=f0 . Next consider RY (t + ; ) = E[Yt+ Y ] = E[Xt+ cos(2f0 ft + g)X cos(2f0 )] = RX (t) cos(2f0 ft + g) cos(2f0 ); which is periodic in with period 1=f0 . To compute SY (f), rst use a trigonometric identity to write RY (t + ; ) = RX2(t) [cos(2f0 t) + cos(2f0 ft + 2g)]: Applying (6.23) to RY with T0 = 1=f0 yields RY (t) = RX2(t) cos(2f0 t): Taking Fourier transforms yields the claimed formula for SY (f).
Derivation of (6.22)
We begin as in the derivation of the Wiener{Khinchin Theorem, except that instead of (6.19) we have
Z
T
Z
T
Z
T
Z
1
E XefT 2 = RX (t; )e,j 2f (t,) dt d ,T ,T
=
,T ,1
I[,T;T ] (t)RX (t; )e,j 2f (t,) dt d:
Now make the change of variable = t , in the inner integral. This results in Z
T
Z
1
E XefT 2 = I[,T;T ] ( + )RX ( + ; )e,j 2f d d: ,T ,1
Change the order of integration to get E Xe T 2
f
=
Z
1
,1
Z
e,j 2f
T
,T
I[,T;T ] ( + )RX ( + ; ) d d:
To simplify the inner integral, observe that I[,T;T ] ( + ) = I[,T ,;T , ] (). Now T , is to the left of ,T if 2T < , and ,T , is to the right of T if ,2T > . Thus, Z 1 E XefT 2 e,j 2f gT () d; = 2T ,1 July 18, 2002
212
Chap. 6 Introduction to Random Processes
where gT () :=
Z T , 8 > 1 > RX ( + ; ) d; > 2T > > ,T > < Z T 1 > RX ( + ; ) d; > 2T > > , T , > > :
0 2T;
,2T < 0;
j j > 2T: If T much greater than j j, then T , T and ,T , ,T in the above 0;
limits of integration. Hence, if RX is a reasonably-behaved correlation function, Z T 1 R ( + ; ) d = RX (); lim g () = Tlim T !1 T !1 2T ,T X and we nd that
E XefT 2 lim = T !1
2T
1
Z
,1
e,j 2f Tlim g () d = !1 T
Z
1 ,1
e,j 2f RX () d:
Remark. If Xt is actually WSS, then RX ( + ; ) = RX (), and j j I gT () = RX () 1 , 2T [,2T;2T ](): In this case, for each xed , gT () ! RX (). We thus have an alternative derivation of the Wiener{Khinchin Theorem.
6.8. Notes
Notes x6.2: Wide-Sense Stationary Processes Note 1. We mean that if RX is the correlation function of a WSS process,
then RX it must satisfy Properties (i){(iii). Furthermore, it can be proved [50, pp. 94{96] that if R is a real-valued function satisfying Properties (i){(iii), then there is a WSS process having R as its correlation function.
Notes x6.6: ?Expected Time-Average Power and the Wiener{Khinchin Theorem Note 2. To give a rigorous derivation of the fact that , Z 1 sin 2T (f , ) 2 d = SX (f); lim S () 2T 2T (f , ) T !1 ,1 X it is convenient to assume SX (f) is continuous at f. Letting
2 T (f) := 2T sin(2Tf) ; 2Tf
July 18, 2002
6.8 Notes
213
we must show that
Z
1 ,1
SX () T (f , ) d ,
SX (f)
! 0:
To proceed, we need the following properties of T . First, 1
Z
T (f) df = 1:
,1
This can be seen by using the Fourier transform table to evaluate the inverse transform of T (f) at t = 0. Second, for xed f > 0, as T ! 1, Z
ff :jf j>f g
T (f) df ! 0:
This can be seen by using the fact that T (f) is even and writing Z 1 Z 1 1 df = 1 ; 2T T (f) df (2T)2 2 2 f f 2T f f which goes to zero as T ! 1. Third, for jf j f > 0, 1 : jT (f)j 2T(f) 2 Now, using the rst property of T , write SX (f) = SX (f) Then SX (f) ,
1
Z
,1
Z
1
,1
T (f , ) d = Z
SX () T (f , ) d =
1 ,1
Z
1
,1
SX (f) T (f , ) d:
[SX (f) , SX ()] T (f , ) d:
For the next step, let " > 0 be given, and use the continuity of SX at f to get the existence of a f > 0 such that for jf , j < f, jSX (f) , SX ()j < ". Now break up the range of integration into such that jf , j < f and such that jf , j f. For the rst range, we need the calculation Z
f +f
f ,f
[SX (f) , SX ()] T (f , ) d
Z f +f
" "
jSX (f) , SX ()j T (f , ) d
f ,f Z f +f f ,f
Z
1
,1
T (f , ) d
T (f , ) d = ":
July 18, 2002
214
Chap. 6 Introduction to Random Processes
For the second range of integration, consider the integral Z
1
f +f
) d
[SX (f) , SX ()] T (f ,
1
Z
f +f
Z
1
f +f
jSX (f) , SX ()j T (f , ) d (jSX (f)j + jSX ()j) T (f , ) d
= jSX (f)j + Observe that
Z
1 f +f
f +f
1
Z
1
Z
f +f
T (f , ) d
jSX ()j T (f , ) d:
T (f , ) d =
,f
Z
,1
T () d;
which goes to zero by the second property of T . Using the third property, we have 1
Z
f +f
Z
jSX ()j T (f , ) d =
,f
,1
jSX (f , )j T () d
1 2T (f) 2
Z
,f
,1
jSX (f , )j d;
which also goes to zero as T ! 1.
Note 3. In applying the derivation in Note 2 to the special case f = 0 in (6.21) we do not need SX (f) to be continuous for all f, we only need continuity at f = 0.
6.9. Problems Recall that if
H(f) =
1
Z
,1
h(t)e,j 2ft dt;
then the inverse Fourier transform is h(t) =
Z
1
,1
H(f)ej 2ft df:
With these formulas in mind, the following table of Fourier transform pairs may be useful. July 18, 2002
6.9 Problems
215
H(f) Tf) 2T sin(2 2 Tf
h(t) I[,T;T ] (t) 2W sin(2Wt) 2Wt
I[,W;W ] (f)
Tf) 2 (1 , jtj=T )I[,T;T ](t) T sin( Tf 2 W sin(Wt) (1 , jf j=W)I[,W;W ] (f) Wt 1 e,t u(t) + j2f 2 e,jtj 2 + (2f)2 e,2jf j 2 + t2 p e,(t=)2 =2 2 e,2 (2f )2 =2
Problems x6.1: Mean, Correlation, and Covariance 1. The Cauchy{Schwarz inequality says that for any random variables U and
V,
E[UV ]2 E[U 2] E[V 2 ]:
(6.24) Derive this formula as follows. For any constant , we can always write 0 E[jU , V j2 ] = E[U 2] , 2E[UV ] + 2 E[V 2]:
(6.25)
Now set = E[UV ]=E[V 2 ] and rearrange the inequality. Note that since equality holds in (6.25) if and only if U = V , equality holds in (6.24) if and only if U is a multiple of V .
Remark. The same technique can be used to show that for complexvalued waveforms, Z
1 ,1
2
g() h() d
Z
1 ,1
jg()j2 d
Z
1
,1
jh()j2 d;
where the asterisk denotes complex conjugation, and for any complex number z, jz j2 = z z . 2. Show that RX (t1 ; t2) = E[Xt1 Xt2 ] is a positive semide nite function in the sense that for any real or complex constants c1 ; : : :; cn and any July 18, 2002
216
Chap. 6 Introduction to Random Processes
times t1; : : :; tn, Hint: Observe that
n X n X i=1 k=1
ci RX (ti ; tk)ck 0:
2 n X E ci Xti
i=1
0:
3. Let Xt for t > 0 be a random process with zero mean and correlation
function RX (t1 ; t2) = min(t1 ; t2). If Xt is Gaussian for each t, write down the density of Xt .
Problems x6.2: Wide-Sense Stationary Processes 4. Find the correlation function corresponding to each of the following power 2 5. 6. 7. 8. ? 9.
spectral densities. (a) (f). (b) (f , f0 )+(f +f0 ). (c) e,f =2 . (d) e,jf j . Let Xt be a WSS random process with power spectral density SX (f) = I[,W;W ] (f). Find E[Xt2 ]. Explain why each of the following frequency functions cannot be a power spectral density. (a) e,f u(f), where u is the unit step function. (b) e,f 2 cos(f). (c) (1 , f 2 )=(1 + f 4 ). (d) 1=(1 + jf 2 ). For each of the following functions, determine whether or not it is a valid correlation function. (a) sin(). (b) cos(). (c) e, 2 =2 . (d) e,j j . (e) 2 e,j j . (f) I[,T;T ] (). Let R0() be a correlation function, and put R() := R0() cos(2f0 ) for some f0 > 0. Determine whether or not R() is a valid correlation function. Let S(f) be a real-valued, even, nonnegative function, and put R() :=
Z
1
,1
S(f)ej 2f d:
Show that R satis es properties (i) and (ii) that characterize a correlation function. ? 10. Let R0 () be a real-valued, even function, but not necessarily a correlation function. Let R() denote the convolution of R0 with itself, i.e., R() :=
1
Z
,1
R0() R0 ( , ) d:
(a) Show that R() is a valid correlation function. Hint: You will need the remark made in Problem 1. July 18, 2002
6.9 Problems
217
(b) Now suppose that R0() = I[,T;T ] (). In this case, what is R(), and what is its Fourier transform? 11. A discrete-time random process is WSS if E[Xn] does not depend on n and if the correlation E[Xn+k Xk ] does not depend on k. In this case we write RX (n) = E[Xn+k Xk ]. For discrete-time WSS processes, the power spectral density is de ned by SX (f) :=
1 X n=,1
RX (n)e,j 2fn ;
which is a periodic function of f with period one. By the formula for Fourier series coecients RX (n) =
Z 1=2
,1=2
SX (f)ej 2fn df:
(a) Show that RX is an even function of n. (b) Show that SX is a real and even function of f.
Problems x6.3: WSS Processes through LTI Systems 12. White noise with power spectral density SX (f) = N0 =2 is applied to a lowpass lter with transfer function H(f) =
1 , f 2 ; jf j 1; 0; jf j > 1:
Find the output power of the lter. 13. White noise with power spectral density N0 =2 is applied to a lowpass lter with transfer function H(f) = sin(f)=(f). Find the output noise power from the lter. 14. A WSS input signal Xt with correlation function RX () = e, 2 =2 2is passed through an LTI system with transfer function H(f) = e,(2f ) =2 . Denote the system output by Yt . Find (a) the cross power spectral density, SXY (f); (b) the cross-correlation, RXY (); (c) E[Xt1 Yt2 ]; (d) the output power spectral density, SY (f); (e) the output auto-correlation, RY (); (f) the output power PY . 15. White noise with power spectral density SX (f) = N0 =2 is applied to a 1 e,t=(RC ) u(t). Find lowpass RC lter with impulse response h(t) = RC (a) the cross power spectral density, SXY (f); (b) the cross-correlation, RXY (); (c) E[Xt1 Yt2 ]; (d) the output power spectral density, SY (f); (e) the output auto-correlation, RY (); (f) the output power PY . July 18, 2002
218
Chap. 6 Introduction to Random Processes
16. White noise with power spectral density N0 =2 is passed through a linear, 17. 18. 19. 20.
time-invariant system with impulse response h(t) = 1=(1 + t2 ). If Yt denotes the lter output, nd E[Yt+1=2Yt ]. A WSS process Xt with correlation function RX () = 1=(1+ 2) is passed through an LTI system with impulse response h(t) = 3 sin(t)=(t). Let Yt denote the system output. Find the output power PY . White noise with power spectral density SX (f) = N0 =2 is passed though a lter with impulse response h(t) = I[,T=2;T=2](t). Find and sketch the correlation function of the lter output. R Let Xt be a WSS random process, and put Yt := tt,3 X d. Determine whether or not Yt is WSS. Consider the system Yt = e,t
21.
22. 23.
24.
Z
t
,1
e X d:
Assume that Xt is zero mean white noise with power spectral density SX (f) = N0 =2. Show that Xt and Yt are J-WSS, and nd RXY (), SXY (f), SY (f), and RY (). A zero-mean, WSS process Xt with correlation function (1 ,j j)I[,1;1]() is to be processed by a lter with transfer function H(f) designed so that the system output Yt has correlation function RY () = sin() : Find a formula for, and sketch the required lter H(f). R1 Let XtRbe a WSS random process. Put Yt := ,1 h(t , )X d, and 1 g(t , )X d. Determine whether or not Y and Z are JZt := ,1 t t WSS. Let Xt be a zero-mean WSS random process with power spectral density SX (f) = 2=[1 + (2f)2 ]. Put Yt := Xt , Xt,1 . (a) Show that Xt and Yt are J-WSS. (b) Find the power spectral density SY (f). (c) Find the power in Yt . Let fXt g be a zero-mean wide-sense stationary random process with power spectral density SX (f). Consider the process Yt := with hn real valued.
1 X
n=,1
hn Xt,n;
July 18, 2002
6.9 Problems
219
(a) Show that fXtg and fYtg are jointly wide-sense stationary. (b) Show that SY (f) has the form SY (f) = P(f)SX (f) where P is a real-valued, nonnegative, periodic function of f with period 1. Give a formula for P(f). 25. System Identi cation. When white noise fWtg with power spectral density SW (f) = 3 is applied to a certain2 linear time-invariant system, the output has power spectral density e,f . Now let fXt g be a zero-mean, widesense stationary random process with power spectral density SX (f) = ef 2 I[,1;1] (f). If fYt g is the response of the system to fXt g, nd RY () for all . ? 26. Extension to Complex Random Processes. If Xt is a complex-valued random process, then its auto-correlation function is de ned by RX (t1; t2 ) := E[Xt1 Xt2 ]. Similarly, if Yt is another complex-valued random process, their cross-correlation is de ned by RXY (t1; t2) := E[Xt1 Yt2 ]. The concepts of WSS, J-WSS, the power spectral density, and the cross power spectral density are de ned as in the realR 1 case. Now suppose that Xt is a complex WSS process and that Yt = ,1 h(t , )X d, where the impulse response h is now possibly complex valued. (a) Show that RX (,) = RX () . (b) Show that SX (f) must be real valued. Hint: What does part (a) say about the real and imaginary parts of RX ()? (c) Show that E[Xt1 Yt2 ] =
Z
1
h(, ) RX ([t1 , t2] , ) d :
,1
(d) Even though the above result is a little dierent from (6.4), show that (6.5) and (6.6) still hold for complex random processes. 27. Let Xn be a discrete-time WSS process as de ned in Problem 11. Put Yn =
1 X
k=,1
hk Xn,k :
(a) Suggest a de nition of jointly WSS discrete-time processes and show that Xn and Yn are J-WSS. (b) Suggest a de nition for the cross power spectral density of two discretetime J-WSS processes and show that (6.5) holds if H(f) :=
1 X
n=,1
Also show that (6.6) holds. July 18, 2002
hne,j 2fn :
220
Chap. 6 Introduction to Random Processes
(c) Write out the discrete-time analog of Example 6.9. What condition do you need to impose on W2 ?
Problems x6.4: The Matched Filter
28. Determine the matched lter if the known radar pulse is v(t) = sin(t)I[0;] (t),
and Xt is white noise with power spectral density SX (f) = N0 =2. For what values of t0 is the optimal system causal? p 29. Determine the matched lter if v(t) = e,(t= 2)2 =2 , and SX (f) = e,(2f )2 =2 . 30. Derive the matched lter for a discrete-time received signal v(n) + Xn . Hint: Problems 11 and 27 may be helpful.
Problems x6.5: The Wiener Filter
31. Suppose Vt and Xt are J-WSS. Let Ut := Vt + Xt . Show that Ut and Vt 32. 33.
34.
35. 36.
are J-WSS. Suppose Ut = Vt + Xt , where Vt and Xt are each zero mean and WSS. Also assume that E[VtX ] = 0 for all t and . Express the Wiener lter H(f) in terms of SV (f) and SX (f). Using the setup of the previous problem, nd the Wiener lter H(f) and the corresponding impulse response h(t) if SV (f) = 2=[2 + (2f)2 ] and SX (f) = 1. Remark. You may want to compare your answer with the causal Wiener lter found in Example 6.11. Using the setup of Problem 32, suppose that the signal has correlation 2 sin and that the noise has power spectral density function RV () = SX (f) = 1 , I[,1;1] (f). Find the Wiener lter H(f) and the corresponding impulse response h(t). Find the impulse response of the whitening lter K(f) of Example 6.11. Is it causal? Derive the Wiener lter for discrete-time J-WSS signals Un and Vn with zero means. Hints: (i) First derive the analogous orthogonality principle. (ii) Problems 11 and 27 may be helpful.
Problems x6.6: ? Expected Time-Average Power and the Wiener{Khinchin Theorem 37. Recall that by Problem 2, correlation functions are positive semide nite.
Use this fact to prove that the double integral in (6.19) is nonnegative, assuming that RX is continuous. Hint: Since RX is continuous, the double integral in (6.19) is a limit of Riemann sums of the form XX RX (ti , tk )e,j 2f (ti ,tk ) titk : i
k
July 18, 2002
6.9 Problems
221
38. Let Yt be Ra WSS process. In each of the cases below, determine whether or not 21T ,TT Yt dt ! E[Yt] in mean square.
(a) The covariance CY () = e,j j . (b) The covariance CY () = sin()=(). 39. Let Yt = cos(2t + ), where uniform[,; ]. As in Example 6.1, E[Yt] = 0. Determine whether or not Z T 1 lim Y dt ! 0: T !1 2T ,T t
40. Let Xt be a zero-mean, WSS process. For xed , you might expect
1 Z T X X dt 2T ,T t+ t to converge in mean square to E[Xt+ Xt ] = RX (). Give conditions on the process Xt under which this will be true. Hint: De ne Yt := Xt+ Xt . Remark. When = 0 this says that 21T R,TT Xt2 dt converges in mean square to RX (0) = PX = PeX . 41. Let Xt be a zero-mean, WSS process. For a xed set B IR, you might expectyy 1 Z T I (X ) dt B t 2T ,T
to converge in mean square to E[IB (Xt )] = }(Xt 2 B). Give conditions on the process Xt under which this will be true. Hint: De ne Yt := IB (Xt ).
yy This is the fraction of time during [,T; T ] that Xt 2 B . For example, we might have B = [vmin; vmax] being the acceptable operating range of the voltage of some device. Then we would be interested in the fraction of time during [,T; T ] that the device is operating normally.
July 18, 2002
222
Chap. 6 Introduction to Random Processes
July 18, 2002
CHAPTER 7
Random Vectors This chapter covers concepts that require some familiarity with linear algebra, e.g., matrix-vector multiplication, determinants, and matrix inverses. Section 7.1 de nes the mean vector, covariance matrix, and characteristic function of random vectors. Section 7.2 introduces the multivariate Gaussian random vector and illustrates several of its properties. Section 7.3 discusses both linear and nonlinear minimum mean squared error estimation of random variables and random vectors. Estimators are obtained using appropriate orthogonality principles. Section 7.4 covers the Jacobian formula for nding the density of Y = G(X) when the density of X is given. Section 7.5 introduces complexvalued random variables and random vectors. Emphasis is on circularly symmetric Gaussian random vectors due to their importance in the design of digital communication systems.
7.1. Mean Vector, Covariance Matrix, and Characteristic Function
If X = [X1; : : :; Xn ]0 is a random vector, we put mi := E[Xi], and we de ne Rij to be the covariance between Xi and Xj , i.e., Rij := cov(Xi ; Xj ) := E[(Xi , mi )(Xj , mj )]:
Note that Rii = cov(Xi ; Xi) = var(Xi ). We call the vector m := [m1 ; : : :; mn ]0 the mean vector of X, and we call the n n matrix R := [Rij ] the covariance matrix of X. Note that since Rij = Rji, the matrix R is symmetric. In other words, R0 = R. Also note that for i 6= j, Rij = 0 if and only if Xi and Xj are uncorrelated. Thus, R is a diagonal matrix if and only if Xi and Xj are uncorrelated for all i 6= j. If we de ne the expectation of a random vector X = [X1; : : :; Xn ]0 to be the vector of expectations, i.e., 2
E[X] :=
6 4
E[X1]
3
.. 75 ; . E[Xn]
then E[X] = m. We de ne the expectation of a matrix of random variables to be the matrix of expectations. This leads to the following characterization of the covariance matrix R. To begin, observe that if we multiply an n 1 matrix (a column vector) by a 1 n matrix (a row vector), we get an n n matrix. Hence, 223
224
Chap. 7 Random Vectors
(X , m)(X , m)0 is equal to 2 3 (X1 , m1 )(X1 , m1 ) (X1 , m1 )(Xn , mn ) 6 7 .. .. 4 5: . . (Xn , mn )(X1 , m1 ) (Xn , mn )(Xn , mn ) The expectation of the ijth entry is cov(Xi ; Xj ) = Rij . Thus, R = E[(X , m)(X , m)0 ] =: cov(X): Note the distinction between the covariance of a pair of random variables, which is a scalar, and the covariance of a column vector, which is a matrix. Example 7.1. Write out the covariance matrix of the three-dimensional random vector U := [X; Y; Z]0 if U has zero mean. Solution. The covariance matrix of U is 2 3 E[X 2] E[XY ] E[XZ] cov(U) = E[UU 0] = 4 E[Y X] E[Y 2] E[Y Z] 5 : E[ZX] E[ZY ] E[Z 2 ]
Example 7.2. Let X = [X1;0 : : :; Xn]0 be a random vector with covariance
matrix R, and let c = [c1 ; : : :; cn] be a given constant vector. De ne the scalar Z := c0(X , m) = Show that
n X i=1
ci (Xi , mi );
var(Z) = c0Rc: Solution. First note that since E[X] = m, Z has zero mean. Hence, var(Z) = E[Z 2]. Write E[Z 2]
X n
X n
= E
ci (Xi , mi )
=
ci cj E[(Xi , mi )(Xj , mj )]
= =
i=1 n X n X i=1 j =1 n X n X
j =1
cj (Xj , mj )
ci cj Rij i=1 j =1 n X n X ci Rij cj : i=1 j =1
The inner sum is the ith component of the column vector Rc. Hence, E[Z 2] = c0 (Rc) = c0 Rc. July 18, 2002
7.1 Mean Vector, Covariance Matrix, and Characteristic Function
225
The preceding example shows that a covariance matrix must satisfy c0Rc = E[Z 2] 0 for all vectors c = [c1; : : :; cn]0. A symmetric matrix R with the property c0 Rc 0 for all vectors c is said to be positive semide nite. If c0Rc > 0 for all nonzero c, then R is called positive de nite. Recall that the norm of a column vector x is de ned by kxk := (x0x)1=2. It is shown in Problem 5 that E[kX , E[X]k2] = tr(R) =
n X i=1
var(Xi );
where the trace of an n n matrix R is de ned by tr(R) :=
n X i=1
Rii:
The joint characteristic function of X = [X1 ; : : :; Xn ]0 is de ned by 0
'X () := E[ej X ] = E[ej (1X1 ++n Xn ) ]; where = [1; : : :; n]0. Note that when X has a joint density, 'X () = E[ej 0X ] is just the ndimensional Fourier transform, 'X () =
Z
j 0 x fX (x) dx; e n
IR
(7.1)
and the joint density can be recovered using the multivariate inverse Fourier transform: 1 Z e,j 0x ' () d: fX (x) = (2) X n IRn Whether X has a joint density or not, the joint characteristic function can be used to obtain its various moments. Example 7.3. The components of the mean vector and covariance matrix can be obtained from the characteristic function as follows. Write @ j 0X j 0X @k E[e ] = E[e jXk ]; and @ 2 E[ej 0X ] = E[ej 0X (jX )(jX )]: ` k @ @ Then
` k
@ E[ej 0X ] = j E[X ]; k @k =0 July 18, 2002
226
Chap. 7 Random Vectors
and
@ 2 E[ej 0X ] = ,E[X X ]: ` k @` @k =0 Higher-order moments can be obtained in a similar fashion. If the components of X = [X1 ; : : :; Xn ]0 are independent, then 0
'X () = E[ej X ] = E[ej (1X1 ++n Xn ) ] n Y
= E = =
k=1
n Y
k=1 n Y k=1
ejk Xk
E[ejkXk ]
'Xk (k ):
In other words, the joint characteristic function is the product of the marginal characteristic functions.
7.2. The Multivariate Gaussian
A random vector X = [X1 ; : : :; Xn]0 is said to be Gaussian or normal if every linear combination of the components of X, e.g., n X i=1
ci Xi ;
(7.2)
is a scalar Gaussian random variable.
Example 7.4. If X is a Gaussian random vector, then the numerical average of its components, n 1X n Xi ; i=1
is a scalar Gaussian random variable.
An easy consequence of our de nition of Gaussian random vector is that any subvector is also Gaussian. To see this, suppose X = [X1; : : :; Xn]0 is a Gaussian random vector. Then every linear combination of the components of the subvector [X1; X3 ; X5]0 is of the form (7.2) if we take ci = 0 for i not equal to 1; 3; 5. July 18, 2002
7.2 The Multivariate Gaussian
227
Sometimes it is more convenient to express linear combinations as the product of a row vector times the column vector X. For example, if we put c = [c1; : : :; cn]0, then n X ci Xi = c0 X: i=1
Now suppose that Y = AX for some pn matrix A. Letting c = [c1; : : :; cp]0, every linear combination of the p components of Y has the form p X i=1
ci Yi = c0Y = c0(AX) = (A0 c)0 X;
which is a linear combination of the components of X, and therefore normal. We can even add a constant vector. If Y = AX + b, where A is again p n, and b is p 1, then c0 Y = c0 (AX + b) = (A0 c)0X + c0b: Adding the constant c0b to the normal random variable (A0 c)0 X results in another normal random variable (with a dierent mean). In summary, if X is a Gaussian random vector, then so is AX + b for any p n matrix A and any p-vector b.
The Characteristic Function of a Gaussian Random Vector
We now nd the joint characteristic function of the Gaussian random vector X, 'X () = E[ej 0X ]. Since we are assuming that X is a normal vector, the quantity Y := 0X is a scalar Gaussian random variable. Hence,
'X () = E[ej 0X ] = E[ejY ] = E[ejY ] =1 = 'Y () =1 :
In other words, all we have to do is nd the characteristic function of Y and evaluate it at = 1. Since Y is normal, we know its characteristic function is (recall Example 3.13) 'Y () := E[ejY ] = ej,2 2 =2; where := E[Y ], and 2 := var(Y ). Assuming X has mean vector m, = E[ 0X] = 0m. Suppose X has covariance matrix R. Next, write var(Y ) = E[(Y , )2]. Since Y , = 0(X , m); Example 7.2 tells us that var(Y ) = 0R. It now follows that 'X () = ej 0m, 0 R=2: If X is a Gaussian random vector with mean vector m and covariance matrix R, we write X N(m; R). July 18, 2002
228
Chap. 7 Random Vectors
For Gaussian Random Vectors, Uncorrelated Implies Independent
If the components of a random vector are uncorrelated, then the covariance matrix is diagonal. In general, this is not enough to prove that the components of the random vector are independent. However, if X is a Gaussian random vector, then the components are independent. To see this, suppose that X is Gaussian with uncorrelated components. Then R is diagonal, say 2
R =
6 6 4
12
0
07 3
...
n2
7; 5
where i2 = Rii = var(Xi ). The diagonal form of R implies that 0R = and so
n X i=1
i2 i2;
0 0 'X () = ej m, R=2 =
In other words, 'X () =
n Y i=1
n Y i=1
2 2
ejimi ,i i =2 :
'Xi (i );
where 'Xi (i) is the characteristic function of the N(mi ; i2) density. Multivariate inverse Fourier transformation then yields fX (x) =
n Y i=1
fXi (xi );
where fXi N(mi ; i2 ). This establishes the independence of the Xi .
Example 7.5. Let X1; : : :; Xn be i.i.d. N(m; 2 ) random variables. Let 1 Pn
X := n i=1 Xi denote the average of the Xi . Furthermore, for j = 1; : : :; n, put Yj := Xj , X. Show that X and Y := [Y1 ; : : :; Yn ]0 are jointly normal and independent. Solution. Let X := [X1; : : :; Xn]0, and put a := [ n1 n1 ]. Then X = aX. Next, observe that 2 3 2 3 2 3 Y1 X1 X 6 . 7 6 . 7 6 . 7 4 .. 5 = 4 .. 5 , 4 .. 5 : Yn Xn X July 18, 2002
7.2 The Multivariate Gaussian
229
Let M denote the n n matrix with each row equal to a; i.e., Mij = 1=n for all i; j. Then Y = X , MX = (I , M)X, and Y is a jointly normal random vector. Next consider the vector a = Z := X I , M X: Y Since Z is a linear transformation of the Gaussian random vector X, Z is also a Gaussian random vector. Furthermore, its covariance matrix has the blockdiagonal form (see Problem 11)
var( X )
0
0 E[Y Y 0 ] :
This implies, by Problem 12, that X and Y are independent.
The Density Function of a Gaussian Random Vector
We now derive the general multivariate Gaussian density function under the assumption that R is positive de nite. Using the multivariate Fourier inversion formula, 1 Z e,jx0 ' () d fX (x) = (2) X n IRn Z 1 ,jx0 ej 0m, 0 R=2 d = (2) n IRn e Z 1 = (2)n n e,j (x,m)0 e, 0 R=2 d: IR Now make the multivariate change of variable = Rp1=2, d = det R1=2 d. The existence of R1=2 and the fact that det(R1=2) = det R > 0 are shown in the Notes.1 We have 1 Z e,j (x,m)0 R,1=2 e, 0 =2 d fX (x) = (2) n IRn det R1=2 Z 1 ,j fR,1=2 (x,m)g0 e, 0 =2 p d : = (2) n IRn e det R Put t = R,1=2(x , m). Then 1 Z e,jt0 e, 0 =2 p d fX (x) = (2) n IRn det R Z n 1 Y 1 2 =2 1 , , jt i i i di : e = p e det R i=1 2 ,1 July 18, 2002
230
Chap. 7 Random Vectors
Observe that e,i2 =2 is the characteristic function of a scalar N(0; 1) random variable. Hence, n 1 Y p e,t2i =2 fX (x) = p 1 det R i=1 2 n 1p 1X 2 = exp , 2 i=1 ti (2)n=2 det R 1p = exp[,t0t=2] n= 2 (2) det R exp[, 21 (x , m)p0 R,1(x , m)] : = (2)n=2 det R With the norm notation, t0t = ktk2 = kR,1=2(x , m)k2 , and so ,1=2 , m)k2 =2] p fX (x) = exp[,kR n=2(x (2) det R as well.
(7.3)
7.3. Estimation of Random Vectors
Consider a pair of random vectors X and Y , where X is not observed, but Y is observed. We wish to estimate X based on our knowledge of Y . By an estimator of X based on Y , we mean a function g(y) such that X^ := g(Y ) is our estimate or \guess" of the value of X. What is the best function g to use? What do we mean by best? In this section, we de ne g to be best if it minimizes the mean squared error (MSE) E[kX , g(Y )k2 ] for all functions g in some class of functions. The optimal function g is called the minimum mean squared error (MMSE) estimator. In the next subsection, we restrict attention to the class of linear estimators (actually ane). Later we allow g to be arbitrary.
Linear Minimum Mean Squared Error Estimation
We now restrict attention to estimators g of the form g(y) = Ay + b, where A is a matrix and b is a column vector. Such a function of y is said to be ane. If b = 0, then g is linear. It is common to say g is linear even if b 6= 0 since this only is a slight abuse of terminology, and the meaning is understood. We shall follow this convention. To nd the best linear estimator is simply to nd the matrix A and the column vector b that minimize the MSE, which for linear estimators has the form
E X , (AY + b) 2 : Letting mX = E[X] and mY = E[Y ], the MSE is equal to
2
:
E (X , mX ) , A(Y , mY ) + mX , AmY , b July 18, 2002
7.3 Estimation of Random Vectors
231
p l
^p
Figure 7.1. Geometric interpretation of the orthogonality principle.
Since the left-hand quantity in braces is zero mean, and since the right-hand quantity in braces is a constant (nonrandom), the MSE simpli es to
E (X , mX ) , A(Y , mY ) 2 + kmX , AmY , bk2:
No matter what matrix A is used, the optimal choice of b is b = mX , AmY ; and the estimate is g(Y ) = AY + b = A(Y , mY ) + mX . The estimate is truly linear in Y if and only if AmY = mX . We now turn to the problem of minimizing
E (X , mX ) , A(Y , mY ) 2 :
The matrix A is optimal if and only if for all matrices B,
E (X , mX ) , A(Y , mY ) 2 E (X , mX ) , B(Y , mY ) 2 :
(7.4) The following condition is equivalent, and easier to use. This equivalence is known as the orthogonality principle. It says that (7.4) holds for all B if and only if E B(Y , mY ) 0 (X , mX ) , A(Y , mY ) = 0; for all B: (7.5) Below we prove that (7.5) implies (7.4). The converse is also true, but we shall not use it in this book. We rst explain the terminology and show geometrically why it is true. Recall that given a straight line l and a point p not on the line, the shortest path between p and the line is obtained by dropping a perpendicular segment from the point to the line as shown in Figure 7.1. The point p^ on the line where the segment touches is closer to p than any other point on the line. Notice also that the vertical segment, p , p^, is orthogonal to the line l. In our situation, the role of p is played by the random variable X , mX , the role of p^ is played by the random variable A(Y , mY ), and the role of the line is played by the set of all random variables of the form B(Y , mY ) as B runs over all matrices. Since the inner product between two random vectors U and V can be de ned as E[V 0U], (7.5) says that (X , mX ) , A(Y , mY ) July 18, 2002
232
Chap. 7 Random Vectors
is orthogonal to all B(Y , mY ). To use (7.5), rst note that since it is a scalar equation, the left-hand side is equal to its trace. Bringing the trace inside the expectation and using the fact that tr( ) = tr( ), we see that the left-hand side of (7.5) is equal to , E tr (X , mX ) , A(Y , mY ) (Y , mY )0 B 0 : Taking the trace back out of the expectation shows that (7.5) is equivalent to tr([RXY , ARY ]B 0 ) = 0; for all B; (7.6) where RY = cov(Y ) is the covariance matrix of Y , and RXY = E[(X , mX )(Y , mY )0 ] =: cov(X; Y ): (This is our third usage of cov. We have already seen the covariance of a pair of random variables and of a random vector. This is the third type of covariance, that of a pair of random vectors. When X and Y are both of dimension one, this reduces to the rst usage of cov.) By Problem 5, it follows that (7.6) holds if and only if A solves the equation ARY = RXY : If RY is invertible, the unique solution of this equation is A = RXY R,Y 1 : In this case, the complete estimate of X is RXY R,Y 1 (Y , mY ) + mX : Even if RY is not invertible, ARY = RXY always has a solution, as shown in Problem 16. Example 7.6 (Signal in Additive Noise). Let X denote a random signal of zero mean and known covariance matrix RX . Suppose that in order to estimate X, all we have available is the noisy measurement Y = X + W; where W is a noise vector with zero mean and known covariance matrix RW . Further assume that the covariance between the signal and noise, RXW , is zero. Find the linear MMSE estimate of X based on Y . Solution. Since E[Y ] = E[X + W ] = 0, mY = mX = 0. Next, RXY = E[(X , mX )(Y , mY )0] = E[X(X + W )0] = RX : July 18, 2002
7.3 Estimation of Random Vectors
Similarly,
It follows that
233
RY = E[(Y , mY )(Y , mY )0 ] = E[(X + W)(X + W )0 ] = RX + RW : X^ = RX (RX + RW ),1Y:
We now show that (7.5) implies (7.4). To simplify the notation, we assume zero means. E[kX , BY k2] = E[k(X , AY ) + (AY , BY )k2] = E[k(X , AY ) + (A , B)Y )k2] = E[kX , AY k2] + E[k(A , B)Y k2 ]; where the cross terms 2E[f(A , B)Y g0(X , AY )] vanish by (7.5). If we drop the right-hand term in the above display, we obtain E[kX , BY k2] E[kX , AY k2 ]:
Minimum Mean Squared Error Estimation
We begin with an observed vector Y and a scalar X to be estimated. (The generalization to vector-valued X is left to the problems.) We seek a real-valued function g(y) with the property that E[jX , g(Y )j2] E[jX , h(Y )j2]; for all h: (7.7) Here we are not requiring g and/or h to be linear. We again have an orthogonality principle (see Problem 17) that says that the above condition is equivalent to E[fX , g(Y )gh(Y )] = 0; for all h: (7.8) We claim that g(y) = E[X jY = y] is the optimal function g if X and Y are jointly continuous. In this case, we can use the law of total probability to write E[fX , g(Y )gh(Y )] =
Z
E fX , g(Y )gh(Y ) Y = y fY (y) dy
Applying the substitution law to the conditional expectation yields E fX , g(Y )gh(Y ) Y = y = E fX , g(y)gh(y) Y = y = E[X jY = y]h(y) , g(y)h(y) = g(y)h(y) , g(y)h(y) = 0:
In order that the expectations in (7.7) be nite, we assume E[X 2 ] < 1, and we restrict g and h to be such that E[g(Y )2 ] < 1 and E[h(Y )2 ] < 1.
July 18, 2002
234
Chap. 7 Random Vectors
It is shown in Problem 18 that the solution g of (7.8) is unique. We can use this fact to nd conditional expectations if X and Y are jointly normal. For simplicity, assume zero means. We claim that g(y) = RXY R,Y 1 y solves (7.8). We rst observe that
X , RXY R,Y 1Y Y
=
I ,RXY R,Y 1 0 I
X Y
is a linear transformation of [X; Y 0]0 and so the left-hand side is a Gaussian random vector whose top and bottom entries are easily seen to be uncorrelated: E[(X , RXY R,Y 1Y )Y 0] = RXY , RXY R,Y 1 RY
= 0:
Being jointly Gaussian and uncorrelated, they are independent. Hence, for any function h(y), E[(X , RXY R,Y 1Y )h(Y )] = E[X , RXY R,Y 1 Y ] E[h(Y )] = 0 E[h(Y )]:
We have now learned two things about jointly normal random variables. First, we have learned how to nd conditional expectations. Second, in looking for the best estimator function g, and not requiring g to be linear, we found that the best g is linear if X and Y are jointly normal.
7.4. Transformations of Random Vectors
If G(x) is a vector-valued function of x 2 IRn , and X is an IRn -valued random vector, we can de ne a new random vector by Y = G(X). If X has joint density fX , and G is a suitable invertible mapping, then we can nd a relatively explicit formula for the joint density of Y . Suppose that the entries of the vector equation y = G(x) are given by 2 6 4
y1 .. . yn
3 7 5
2
=
6 4
3
g1(x1 ; : : :; xn) 7 .. 5: . gn(x1 ; : : :; xn)
If G is invertible, we can apply G,1 to both sides of y = G(x) to obtain G,1(y) = x. Using the notation H(y) := G,1(y), we can write the entries of the vector equation x = H(y) as 2 6 4
x1 .. . xn
3 7 5
2
=
6 4
3
h1(y1 ; : : :; yn ) 7 .. 5: . hn(y1 ; : : :; yn )
July 18, 2002
7.4 Transformations of Random Vectors
235
Assuming that H is continuous and has continuous partial derivatives, let 2 6 6 6 6 6 6 6 4
H 0(y) :=
@h1 @y1
@hi @y1
@hn @y1
.. . .. .
@h1 @yn
3
.. .
7 7 7 @hi 7 : @yn 7 .. 77 . 5
@hn @yn
To compute }(Y 2 C) = }(G(X) 2 C), it is convenient to put B = fx : G(x) 2 C g so that }(Y 2 C) = }(G(X) 2 C) = }Z (X 2 B) = IB (x) fX (x) dx: n IR
Now apply the multivariate change of variable x = H(y). Keeping in mind that dx = j det H 0(y) j dy,
}(Y 2 C) =
Z
IRn
IB (H(y)) fX (H(y)) j det H 0 (y) j dy:
Observe that IB (H(y)) = 1 if and only if H(y) 2 B, which happens if and only if G(H(y)) 2 C. However, since H = G,1, G(H(y)) = y, and we see that IB (H(y)) = IC (y). Thus,
}(Y 2 C) =
Z
C
fX (H(y)) j det H 0 (y) j dy:
It follows that Y has density fY (y) = fX (H(y)) j det H 0(y) j: Since det H 0 (y) is called the Jacobian of H, the preceding equations are sometimes called Jacobian formulas. Example 7.7. pLet X and Y be independent univariate N(0; 1) random variables. Let R = X 2 + Y 2 and = tan,1(Y=X). Find the joint density of R and . Solution. The transformation G is given by p r = x2 + y2 ; = tan,1 (y=x): The rst thing we must do is nd the inverse transformation H. To begin, observe that x tan = y. Also, r2 = x2 + y2 . Write x2 tan2 = y2 = r2 , x2: July 18, 2002
236
Chap. 7 Random Vectors
Then x2 (1 + tan2 ) = r2 . Since 1 + tan2 = sec2 , x2 = r2 cos2 : It then follows that y2 = r2 , x2 = r2 (1 , cos2 ) = r2 sin2 . Hence, the inverse transformation H is given by x = r cos ; y = r sin : The matrix H 0 (r; ) is given by H 0 (r; ) =
"
@x @r @y @r
@x @ @y @
#
=
cos ,r sin ; sin r cos
and det H 0 (r; ) = r cos2 + r sin2 = r. Then
fR; (r; ) = fXY (x; y) x=r cos j det H 0 (r; ) j y=r sin
= fXY (r cos ; r sin ) r: Now,2 since X and Y are independent N(0; 1), fXY (x; y) = fX (x) fY (y) = e,(x +y2 )=2=(2), and 1 ; r 0; , < : fR; (r; ) = re,r2 =2 2 Thus, R and are independent, with R having a Rayleigh density and having a uniform(,; ] density.
7.5. Complex Random Variables and Vectors
A complex random variable is a pair of real random variables, say X and Y , written in the form Z = X + jY , where j denotes the square root of ,1. The advantage of the complex notation is that it becomes easy to write down certain functions of (X; Y ). For example, it is easier to talk about Z 2 = (X + jY )(X + jY ) = (X 2 , Y 2) + j(2XY ) than the vector-valued mapping 2,Y2 X g(X; Y ) = : 2XY Recall that the absolute value of a complex number z = x + jy is
jz j :=
p
x2 + y2 :
July 18, 2002
7.5 Complex Random Variables and Vectors
237
The complex conjugate of z is z := x , jy; and so
zz = (x + jy)(x , jy) = x2 + y2 = jz j2: The expected value of Z is simply E[Z] := E[X] + j E[Y ]:
The covariance of Z is
cov(Z) := E[(Z , E[Z])(Z , E[Z]) ] = E Z , E[Z] 2 :
Note that cov(Z) = var(X) + var(Y ), while E[(Z , E[Z])2] = [var(X) , var(Y )] + j[2 cov(X; Y )];
which is zero if and only if X and Y are uncorrelated and have the same variance. If X and Y are jointly continuous real random variables, then we say that Z = X + jY is a continuous complex random variable with density fZ (z) = fZ (x + jy) := fXY (x; y): Sometimes the formula for fXY (x; y) is more easily expressed in terms of the complex variable z. For example, if X and Y are independent N(0; 1=2), then ,y 2 ,jzj2 , x2 = e fXY (x; y) = p e p p e p 2 1=2 2 1=2
Note that E[Z] = 0 and cov(Z) = 1. Also, the density is circularly symmetric since jz j2 = x2 + y2 depends only on the distance from the origin of the point (x; y) 2 IR2. A complex random vector of dimension n, say Z = [Z1 ; : : :; Zn]0; is a vector whose ith component is a complex random variable Zi = Xi + jYi , where Xi and Yj are real random variables. If we put X := [X1 ; : : :; Xn]0 and Y := [Y1; : : :; Yn]0; then Z = X + jY , and the mean vector of Z is E[Z] = E[X] + j E[Y ]. The covariance matrix of Z is C := E[(Z , E[Z])(Z , E[Z])H ]; July 18, 2002
238
Chap. 7 Random Vectors
where the superscript H denotes the complex conjugate transpose. In other words, the i k entry of C is Cik = E[(Zi , E[Zi])(Zk , E[Zk ])] =: cov(Zi ; Zk ): It is also easy to show that C = (RX + RY ) + j(RY X , RXY ): (7.9) For joint distribution purposes, we identify the n-dimensional complex vector Z with the 2n-dimensional real random vector [X1 ; : : :; Xn; Y1; : : :; Yn]0: (7.10) If this 2n-dimensional real random vector has a joint density fXY , then we write fZ (z) := fXY (x1; : : :; xn; y1; : : :; yn): Sometimes the formula for the right-hand side can be written simply in terms of the complex vector z.
Complex Gaussian Random Vectors
An n-dimensional complex random vector Z = X+jY is said to be Gaussian if the 2n-dimensional real random vector 0in (7.10) is jointly Gaussian; i.e., its characteristic function 'XY (; ) = E[ej ( X +0 Y ) ] has the form 0 R R X XY 0 0 1 0 exp j( mX + mY ) , 2 RY X RY : (7.11) Now observe that 0 RX RXY 0 (7.12) RY X RY is equal to 0RX + 0 RY + 20 RY X : On the other hand, if we put w := + j, and use (7.9), then (see Problem 26) wH Cw = 0(RX + RY ) + 0 (RX + RY ) + 20 (RY X , RXY ): Clearly, if RX = RY and RXY = ,RY X ; (7.13) then (7.12) is equal to wH Cw=2. Conversely, if (7.12) is equal to wH Cw=2 for all w = +j, then (7.13) holds (Problem 33). We say that a complex Gaussian random vector Z = X + jY is circularly symmetric if (7.13) holds. If Z is circularly symmetric and zero mean, then its characteristic function is 0 0 H E[ej ( X + Y ) ] = e,w Cw=4 ; w = + j:
July 18, 2002
7.6 Notes
239
The density corresponding to (7.11) is (assuming zero means)
fXY (x; y) =
RX RXY ,1 x RY X RY y p n (2) det K
exp , 12 x0 y0
where
;
(7.14)
K := RRX RRXY : YX Y It is shown in Problem 34 that under the assumption of circular symmetry (7.13), ,z0 C ,1 z e (7.15) fXY (x; y) = n det C ; z = x + jy; and that C is invertible if and only if K is invertible.
7.6. Notes
Notes x7.2: The Multivariate Gaussian Density Note 1. Recall that an n n matrix R is symmetric if it is equal to its transpose; i.e., R = R0. It is positive de nite if c0 Rc > 0 for all c = 6 0. We show
that the determinant of a positive-de nite matrix is positive. A trivial modi cation of the derivation shows that the determinant of a positive semide nite matrix is nonnegative. At the end of the note, we also de ne the square root of a positive-de nite matrix. We start with the well-known fact that a symmetric matrix can be diagonalized [22]; i.e., there is an n n matrix P such that P 0P = PP 0 = I and such that P 0RP is a diagonal matrix, say 2
P 0RP = =
6 6 4
1
0
07 3
...
n
7: 5
Next, from P 0RP = , we can easily obtain R = P P 0. Since the determinant of a product of matrices is the product of their determinants, det R = det P det det P 0. Since the determinants are numbers, they can be multiplied in any order. Thus, det R = = = = =
det det P 0 det P det det(P 0P) det det I det 1 n :
July 18, 2002
240
Chap. 7 Random Vectors
Rewrite P 0RP = as RP = P . Then it is easy to see that the columns of P are eigenvectors of R; i.e., if P has columns p1 ; : : :; pn, then Rpi = i pi. Next, since P 0P = I, each pi satis es p0i pi = 1. Since R is positive de nite, 0 < p0i Rpi = p0i (i pi ) = i p0i pi = i : Thus, each eigenvalue i > 0, and it follows that det R = 1 n > 0. Because positive-de nite matrices are diagonalizable with positive eigenvalues, it is easy to de ne their square root by p p R := P P 0; where 2 p 3 0 1 p 7 6 ... 7: := 64 5 p 0 n p p p p Thus, p det R = 1 n = det R. Furthermore, from p pthe de nition of R, it is clear that it is positive de nite and satis es R R = R. We also point out that since R = P P 0, Rp,1 = P p,1P 0, where ,1 is diagonal withpdiagonalpentries 1=i ; hence,p R,1 = ( R ),1 . Finally, note that p RR,1 R = (P P 0)(P ,1 P 0)(P P 0) = I.
7.7. Problems
Problems x7.1: Mean Vector, Covariance Matrix, and Characteristic Function 1. The input U to a certain ampli er is N(0; 1), and the output is X =
ZU + Y , where the ampli er's random gain Z has density fZ (z) = 37 z 2 ; 1 z 2; and given Z = z, the ampli er's random bias Y is conditionally exponential with parameter z. Assuming that the input U is independent of the ampli er parameters Z and Y , nd the mean vector and the covariance matrix of [X; Y; Z]0. 2. Find the mean vector and covariance matrix of [X; Y; Z]0 if 2 fXY Z (x; y; z) = 2 exp[,jx ,5ypj , (y , z) =2] ; z 1; z 2 and fXY Z (x; y; z) = 0 otherwise. 3. Let X, Y , and Z be jointly continuous. Assume that X uniform[1; 2]; that given X = x, Y exp(1=x); and that given X = x and Y = y, Z is N(x; 1). Find the mean vector and covariance matrix of [X; Y; Z]0. July 18, 2002
7.7 Problems
241
4. Find the mean vector and covariance matrix of [X; Y; Z]0 if ,(x,y)2 =2e,(y,z)2 =2e,z2 =2 : fXY Z (x; y; z) = e (2)3=2
Also nd the joint characteristic function of [X; Y; Z]0. 5. Traces and Norms.
(a) If M is a random n n matrix, show that tr(E[M]) = E[tr(M)]. (b) Let A and B be matrices of dimensions m n and n m, respectively. Derive the formula tr(AB) = tr(BA). Hint: Recall that if C = AB, then n X Cij := Aik Bkj : k=1
(c) Show that if X is an n-dimensional random vector with covariance matrix R, then E[kX , E[X]k2] = tr(R) =
n X i=1
var(Xi ):
(d) For m n matrices U and V , show that tr(UV 0) =
m X n X i=1 k=1
Uik Vik :
Show that if U is xed and tr(UV 0 ) = 0 for all matrices V , then U = 0. Remark. What is going on here is that tr(UV 0) de nes an inner product on the space of m n matrices.
Problems x7.2: The Multivariate Gaussian 6. If X is a zero-mean, multivariate Gaussian random variable, show that E[( 0XX 0 )k ] = (2k , 1)(2k , 3) 5 3 1 ( 0E[XX 0 ])k: Hint: Example 3.6. 7. Wick's Theorem. Let X N(0; R) be n-dimensional. Let (i1 ; : : :; i2k ) be
a vector of indices chosen from f1; : : :; ng. Repetitions are allowed; e.g., (1; 3; 3; 4). Derive Wick's Theorem, E[Xi1 Xi2k ] =
X
j1 ;:::;j2k
July 18, 2002
Rj1j2 Rj2k,1 j2k ;
242
Chap. 7 Random Vectors
where the sum is over all j1; : : :; j2k that are permutations of i1; : : :; i2k and such that the product Rj1j2 Rj2k,1 j2k is distinct. Hint: The idea is to view both sides of the equation derived in the previous problem as a multivariate polynomial in the n variables 1; : : :; n. After collecting all terms on each side that involve i1 i2k , the corresponding coecients must be equal. In the expression X n
E[( 0X)2k ] = E
=
n X j1 =1
j1 =1
j1 Xj1
n X
j2k =1
X n
j2k =1
j2k Xj2k
j1 j2k E[Xj1 Xj2k ];
we are only interested in those terms for which j1 ; : : :; j2k is a permutation of i1 ; : : :; i2k. There are (2k)! such terms, each equal to i1 i2k E[Xi1 Xi2k ]: Similarly, from ( 0R)k =
X n X n
i=1 j =1
iRij j
k
we are only interested in terms of the form j1 j2 j2k,1 j2k Rj1 j2 Rj2k,1 j2k ; where j1 ; : : :; j2k is a permutation of i1 ; : : :; i2k . Now many of these permutations involve the same value of the product Rj1j2 Rj2k,1 j2k . First, because R is symmetric, each factor Rij also occurs as Rji. This happens in 2k dierent ways. Second, the order in which the Rij are multiplied together occurs in k! dierent ways. 8. Let X be a multivariate normal random vector with covariance matrix R. Use Wick's theorem of the previous problem to evaluate E[X1X2 X3 X4 ], E[X1X32 X4 ], and E[X12 X22 ]. 9. Evaluate (7.3) if m = 0 and
R =
12 1 2 : 12 22
Show that your result has the same form as the bivariate normal density in (5.13). 10. Let X = [X1; : : :; Xn]0 N(m; R), and suppose that Y = AX + b, where
A is a p n matrix, and b 2 IRp . Find the mean and variance of Y . July 18, 2002
7.7 Problems
243
P 11. Let X1 ; : : :; Xn be i.i.d. N(m; 2 ) random variables, and let X := n1 ni=1 Xi
denote the average of the Xi . For j = 1; : : :; n, put Yj := Xj , X. Show that E[Yj ] = 0 and that E[ XYj ] = 0 for j = 1; : : :; n. 12. Let X = [X1 ; : : :; Xn ]0 N(m; R), and suppose R is block diagonal, say R =
S 0 ; 0 T
where S and T are square submatrices with S being s s and T being t t with s + t = n. Put U := [X1 ; : : :; Xs]0 and V := [Xs+1 ; : : :; Xn]0. Show that U and V are independent. Hint: Show that 'X () = 'U (1 ; : : :; s) 'V (s+1 ; : : :; n); where 'U is an s-variate normal characteristic function, and 'V is a tvariate normal characteristic function. Use the notation := [1; : : :; s]0 and := [s+1; : : :; n]0. 13. The digital signal processing chip in a wireless communication receiver generates the n-dimensionalGaussian vector X with mean zero and positivede nite covariance matrix R. It then computes the vector Y = R,1=2X. (Since R,1=2 is invertible, there is no loss of information in applying P such a transformation.) Finally, the decision statistic V = kY k2 := nk=1 Yk2 is computed. (a) Find the multivariate density of Y . (b) Find the density of Yk2 for k = 1; : : :; n. (c) Find the density of V .
Problems x7.3: Estimation of Random Vectors 14. Let X denote a random signal of known mean mX and known covariance
matrix RX . Suppose that in order to estimate X, all we have available is the noisy measurement Y = GX + W; where G is a known gain matrix, and W is a noise vector with zero mean and known covariance matrix RW . Further assume that the covariance between the signal and noise, RXW , is zero. Find the linear MMSE estimate of X based on Y . 15. Let X and Y be random vectors with known means and covariance matrices. Do not assume zero means. Find the best purely linear estimate of X based on Y ; i.e., nd the matrix A that minimizes E[kX , AY k2]. Similarly, nd the best constant estimate of X; i.e., nd the vector b that minimizes E[kX , bk2]. July 18, 2002
244
Chap. 7 Random Vectors
16. In this problem, you will show that ARY = RXY has a solution even if
RY is singular. Recall that since RY is symmetric, it can be diagonalized [22]; i.e., there is an n n matrix P such that P 0P = PP 0 = I and such that P 0RY P = is a diagonal matrix, say = diag(1 ; : : :; n). De ne a new random variable Y~ := P 0Y . Consider the problem of nding the best linear estimate of X based on Y~ . This leads to nding a matrix A~ that solves ~ Y~ = RX Y~ : AR (7.16) ~ 0 solves ARY = RXY . (a) Show that if A~ solves (7.16), then A := AP (b) Show that (7.16) has a solution. Hint: Use the fact that if var(Y~k ) = 0, then cov(Xi ; Y~k ) = 0 as well. (This fact is an easy consequence of the Cauchy{Schwarz inequality, which is derived in Problem 1 of Chapter 6.) 17. Show that (7.8) implies (7.7). 18. Show that if g1 (y) and g2 (y) both solve (7.8), then g1 = g2 in the sense that E[jg2(Y ) , g1 (Y )j] = 0: Hint: Observe that
jg2(y) , g1(y)j = [g2(y) , g1(y)]IB+ (y) , [g2(y) , g1 (y)]IB, (y); where B+ := fy : g2 (y) > g1 (y)g and B, := fy : g2(y) < g1 (y)g: Now write down the four versions of (7.8) obtained with g = g2 or g = g1 and h(y) = IB+ (y) or h(y) = IB, (y). 19. Write down the analogs of (7.7) and (7.8) when X is vector valued. Find E[X jY = y] if [X 0; Y 0 ]0 is a Gaussian random vector such that X has mean mX and covariance RX , Y has mean mY and covariance RY , and RXY = cov(X; Y ). 20. Let X and Y be jointly normal random vectors as in the previous problem, and let the matrix A solve ARY = RXY . Show that given Y = y, X is conditionally N(mX + A(y , mY ); RX , ARY X ). Hints: First note that (X , mX ) , A(Y , mY ) and Y are uncorrelated and therefore independent by Problem 12. Next, observe that E[ej 0X jY = y] is equal to 0 0 E ej [(X ,mX ),A(Y ,mY )] ej [mX +A(Y ,mY )] Y = y :
July 18, 2002
7.7 Problems
245
Problems x7.4: Transformations of Random Vectors 21. Let X and Y have joint density fXY (x; y). Let U := X + Y and V :=
X , Y . Find fUV (u; v). 22. p Let X and Y be independent uniform(0; 1] random variables. Let U := ,2 ln X cos(2Y ), and V := p,2 ln Y sin(2Y ). Show that U and V are independent N(0; 1) random variables. 23. Let X and Y have joint density fXY (x; y). Let U := X + Y and V := X=(X+Y ). Find fUV (u; v). Apply your result to the case where X and Y are independent gamma random variables X gp; and Y gq; . What type of density is fU ? What type of density is fV ? Hint: Results from Problem 21(d) in Chapter 5 may be helpful.
Problems x7.5: Complex Random Variables and Vectors
24. Show that for a complex random variable Z = X+jY , cov(Z) = var(X)+ var(Y ). 25. Consider the complex random vector Z = X +jY with covariance matrix
C. (a) Show that C = (RX + RY ) + j(RY X , RXY ). (b) If the circular symmetry conditions RX = RY and RXY = ,RY X hold, show that the diagonal elements of RXY are zero; i.e., the components Xi and Yi are uncorrelated. (c) If the circular symmetry conditions hold, and if C is a real matrix, show that X and Y are uncorrelated. 26. Let Z be a complex random vector with covariance matrix C = R + jQ for real matrices R and Q. (a) Show that R = R0 and that Q0 = ,Q. (b) If Q0 = ,Q, show that 0Q = 0. (c) If w = + j, show that wH Cw = 0R + 0 R + 20 Q: 27. Let Z = X + jY be a complex random vector, and let A = + j be a complex matrix. Show that the transformation Z 7! AZ is equivalent to X 7! , X : Y Y Hence, multiplying an n-dimensional complex random vector by an n n complex matrix is a linear transformation of the 2n-dimensional vector [X 0 ; Y 0 ]0. Now show that such a transformation preserves circular symmetry; i.e., if Z is circularly symmetric, then so is AZ. July 18, 2002
246
Chap. 7 Random Vectors
28. Consider the complex random vector partitioned as
Z X + jY W = U + jV ; where X, Y , U, and V are appropriately sized real random vectors. Since every complex random vector is identi ed with a real random vector of f := [U 0; V 0]0. twice the length, it is convenient to put Ze := [X 0 ; Y 0 ]0 and W 0 Since the real and imaginary parts of are R := [X ; U 0]0 and I := [Y 0; V 0 ]0, we put 2 3 X 6 7 e := RI = 64 UY 75 : V Assume that is Gaussian and circularly symmetric. (a) Show that CZW = 0 if and only if RZeWe = 0. (b) Show that the complex matrix A = + j solves ACW = CZW if and only if Ae := , =
e solves AR e = RZ eW e. W (c) If A solves ACW = CZW , show that given W = w, Z is conditionally Gaussian and circularly symmetric N(mZ + A(w , mW ); CZ , ACWZ ). Hint: Problem 20. 29. Let Z = X + jY have density fZ (z) = e,jzj2 = as discussed in the text. (a) Find cov(Z). (b) Show that 2jZ j2 has a chi-squared density with 2 degrees of freedom. 30. Let X N(mr ; 1) and Y N(mi ; 1) be independent, and de ne the complex random variable Z := X + jY . Use the result of Problem 17 in Chapter 4 to show that jZ j has the Rice density. 31. The base station of a wireless communication system generates an ndimensional, complex, circularly symmetric, Gaussian random vector Z with mean zero and covariance matrix C. Let W = C ,1=2Z. (a) Find the density of W. (b) Let Wk = Uk +jVk . Find the joint density of the pair of real random variables (Uk ; Vk ). (c) If n n X X kW k2 := jWk j2 = Uk2 + Vk2 ;
k=1
July 18, 2002
k=1
7.7 Problems
247
show that 2kW k2 has a chi-squared density with 2n degrees of freedom. Remark. (i) The chi-squared density with 2n degrees of freedom is the same as the n-Erlang density, whose cdf has a closed-form expression givenpin Problem 12(c) in Chapter 3. (ii) By Problem 13 in Chapter 4, 2 kW k has a Nakagami-n density with parameter = 1. 32. Let M be a real symmetric matrix such that u0 Mu = 0 for all real vectors u. (a) Show that v0 Mu = 0 for all real vectors u and v. Hint: Consider the quantity (u + v)0 M(u + v). (b) Show that M = 0. Hint: Note that M = 0 if and only if Mu = 0 for all u, and Mu = 0 if and only if kMuk = 0. 33. Show that if (7.12) is equal to wH Cw for all w = +j, then (7.13) holds. Hint: Use the result of the preceding problem. 34. Assume that circular symmetry (7.13) holds. In this problem you will show that (7.14) reduces to (7.15). (a) Show that det K = (det C)2 =22n. Hint: 2RX ,2RY X det(2K) = det 2R 2RX YX X + j2RY X ,2RY X = det 2R 2RY X , j2RX 2RX 2RY X = det ,CjC ,2R X YX = det C0 ,2R = (det C)2: CH Remark. Thus, K is invertible if and only if C is invertible. (b) Matrix Inverse Formula. For any matrices A, B, C, and D, let V = A + BCD. If A and C are invertible, show that V ,1 = A,1 , A,1B(C ,1 + DA,1 B),1 DA,1 by verifying that the formula for V ,1 satis es V V ,1 = I. (c) Show that ,1RY X ,1 ,1 R , 1 K = ,,1R R,1 X ,1 ; YX X where := RX +RY X R,X1RY X , by verifying that KK ,1 = I. Hint: Note that ,1 satis es ,1 = R,X1 , R,X1 RY X ,1 RY X R,X1 : July 18, 2002
248
Chap. 7 Random Vectors
(d) Show that C ,1 = (,1,jR,X1 RY X ,1)=2 by verifying that CC ,1 = I. (e) Show that (7.14) reduces to (7.15). Hint: Before using the formula in part (d), rst use the fact that (Im C ,1)0 = , ImC ,1; this fact can be seen using the equation for ,1 given in part (c).
July 18, 2002
CHAPTER 8
Advanced Concepts in Random Processes The two most important continuous-time random processes are the Poisson process and the Wiener process, which are introduced in Sections 8.1 and 8.3, respectively. The construction of arbitrary random processes in discrete and continuous time using Kolmogorov's Theorem is discussed in Section 8.4. In addition to the Poisson process, marked Poisson processes and shot noise are introduced in Section 8.1. The extension of the Poisson process to renewal processes is presented brie y in Section 8.2. In Section 8.3, the Wiener process is de ned and then interpreted as integrated white noise. The Wiener integral is introduced. The approximation of the Wiener process via a random walk is also outlined. For random walks without nite second moments, it is shown by a simulation example that the limiting process is no longer a Wiener process.
8.1. The Poisson Process
A counting process fNt ; t 0g is a random process that counts how many times something happens from time zero up to and including time t. A sample path of such a process is shown in Figure 8.1. Such processes always have a 5
Nt
4 3 2 1
t T1
T2
T3 T 4
T5
Figure 8.1. Sample path Nt of a counting process.
staircase form with jumps of height one. The randomness is in the times Ti at which whatever we are counting happens. Note that counting processes are right continuous. Here are some examples of things that we might count: Nt = the number of radioactive particles emitted from a sample of radioactive material up to and including time t. 249
250
Chap. 8 Advanced Concepts in Random Processes
Nt = the number of photoelectrons emitted from a photodetector up to
and including time t. Nt = the number of hits of a web site up to and including time t. Nt = the number of customers passing through a checkout line at a grocery store up to and including time t. Nt = the number of vehicles passing through a toll booth on a highway up to and including time t. Suppose that 0 t1 < t2 < 1 are given times, and we want to know how many things have happened between t1 and t2. Now Nt2 is the number of occurrences up to and including time t2. If we subtract Nt1 , the number of occurrences up to and including time t1 , then the dierence Nt2 , Nt1 is simply the number of occurrences that happen after t1 up to and including t2. We call dierences of the form Nt2 , Nt1 increments of the process. A counting process fNt ; t 0g is called a Poisson process if the following three conditions hold: N0 0; i.e., N0 is a constant random variable whose value is always zero. For any 0 s < t < 1, the increment Nt , Ns is a Poisson random variable with parameter (t , s). The constant is called the rate or the intensity of the process. If the time intervals (t1 ; t2]; (t2; t3]; : : :; (tn; tn+1] are disjoint, then the increments Nt2 , Nt1 ; Nt3 , Nt2 ; : : :; Ntn+1 , Ntn are independent; i.e., the process has independent increments. In other words, the numbers of occurrences in disjoint time intervals are independent. Example 8.1. Photoelectrons are emitted from a photodetector at a rate of per minute. Find the probability that during each of 2 consecutive minutes, more than 5 photoelectrons are emitted. Solution. Let Ni denote the number of photoelectrons emitted from time zero up through the ith minute. The probability that during the rst minute and during the second minute more than 5 photoelectrons are emitted is }(fN1 , N0 6g \ fN2 , N1 6g): By the independent increments property, this is equal to }(N1 , N0 6) }(N2 , N1 6): July 18, 2002
8.1 The Poisson Process
251
Each of these factors is equal to 1,
5 X
k e, ; k=0 k!
where we have used the fact that the length of the time increments is one. Hence,
5 k , 2 }(fN1 , N0 6g \ fN2 , N1 6g) = 1 , X e : k!
k=0
We now compute the correlation and covariance functions of a Poisson process. Since N0 0, Nt = Nt , N0 is a Poisson random variable with parameter (t , 0) = t. Hence, E[Nt] = t and var(Nt ) = t: This further implies that E[Nt2] = t + (t)2 . For 0 s < t, we can compute the correlation E[NtNs ] = E[(Nt , Ns )Ns ] + E[Ns2] = E[(Nt , Ns )(Ns , N0 )] + (s)2 + s: Since (0; s] and (s; t] are disjoint, the above increments are independent, and so E[(Nt , Ns )(Ns , N0 )] = E[Nt , Ns ] E[Ns , N0 ] = (t , s) s: It follows that E[NtNs ] = (t)(s) + s: We can also compute the covariance, cov(Nt ; Ns) = E[(Nt , t)(Ns , s)] = E[NtNs ] , (t)(s) = s: More generally, given any two times t1 and t2 , cov(Nt1 ; Nt2 ) = min(t1 ; t2): So far, we have focused on the number of occurrences between two xed times. Now we focus on the jump times, which are de ned by (see Figure 8.1) Tn := minft > 0 : Nt ng: In other words, Tn is the time of the nth jump in Figure 8.1. In particular, if Tn > t, then the nth jump happens after time t; hence, at time t we must July 18, 2002
252
Chap. 8 Advanced Concepts in Random Processes
have Nt < n. Conversely, if at time t, Nt < n, then the nth occurrence has not happened yet; it must happen after time t, i.e., Tn > t. We can now write nX ,1 (t)k
}(Tn > t) = }(Nt < n) =
,t k! e :
k=0
From Problem 12 in Chapter 3, we see that Tn has an Erlang density with parameters n and . In particular, T1 has an exponential density with parameter . Depending on the context, the jump times are sometimes called arrival times or occurrence times. In the previous paragraph, we de ned the occurrence times in terms of counting process fNt ; t 0g. Observe that we can express Nt in terms of the occurrence times since 1 X Nt = I(0;t](Tk ): k=1
We now de ne the interarrival times, X1 = T1 ; Xn = Tn , Tn,1 ; n = 2; 3; : : :: The occurrence times can be recovered from the interarrival times by writing Tn = X1 + + Xn : We noted above that Tn is Erlang with parameters n and . Recalling Problem 46 in Chapter 3, which shows that a sum of i.i.d. exp() random variables is Erlang with parameters n and , we wonder if the Xi are i.i.d. exponential with parameter . This is indeed the case, as shown in [4, p. 301]. Example 8.2. Micrometeors strike the space shuttle according to a Poisson process. The expected time between strikes is 30 minutes. Find the probability that during at least one hour out of 5 consecutive hours, 3 or more micrometeors strike the shuttle. Solution. The problem statement is telling us that the expected interarrival time is 30 minutes. Since the interarrival times are exp() random variables, their mean is 1=. Thus, 1= = 30 minutes, or 0:5 hours, and so = 2 strikes per hour. The number of strikes during the ith hour is Ni , Ni,1. The probability that during at least one hour out of 5 consecutive hours, 3 or more micrometeors strike the shuttle is 5 [
}
fNi , Ni,1 3g
i=1
5 \
= 1,} = 1,
5 Y
i=1
July 18, 2002
fNi , Ni,1 < 3g
i=1
}(Ni , Ni,1 2);
8.1 The Poisson Process
253
where the last step follows by the independent increments property of the Poisson process. Since Ni , Ni,1 Poisson([i , (i , 1)]), or simply Poisson(), }(Ni , Ni,1 2) = e, (1 + + 2 =2) = 5e,2 ; we have
? Derivation
}
5 [
fNi , Ni,1 3g = 1 , (5e,2 )5 0:86:
i=1
of the Poisson Probabilities
In the de nition of the Poisson process, we required that Nt , Ns be a Poisson random variable with parameter (t , s). In particular, this implies that }(Nt+t , Nt = 1) te,t ! ; = t t as t ! 0. The Poisson assumption also implies that 1 , }(Nt+t , Nt = 0) = }(Nt+t , Nt 1) t t 1 X 1 (t)k = t k!
k=1
1 X
k ,1 ! ; = 1 + (t) k! k=2
as t ! 0. As we now show, the converse is also true as long as we continue to assume that N0 0 and that the process has independent increments. So, instead of assuming that Nt , Ns is a Poisson random variable with parameter (t , s), we assume that During a suciently short time interval, t > 0, }(Nt+t , Nt = 1) t. By this we mean that }(Nt+t , Nt = 1) lim = : t#0 t This property can be interpreted as saying that the probability of having exactly one occurrence during a short time interval of length t is approximately t. For suciently small t > 0, }(Nt+t , Nt = 0) 1 , t. More precisely, 1 , }(Nt+t , Nt = 0) = : lim t#0 t July 18, 2002
254
Chap. 8 Advanced Concepts in Random Processes
By combining this property with the preceding one, we see that during a short time interval of length t, we have either exactly one occurrence or no occurrences. In other words, during a short time interval, at most one occurrence is observed. For n = 0; 1; : : :; let pn(t) := }(Nt , Ns = n); t s: (8.1) Note that p0 (s) = }(Ns , Ns = 0) = }(0 = 0) = }( ) = 1. Now, pn(t + t) = }(Nt+t , Ns = n) ! = } = =
n X k=0 n X k=0
n [
k=0
fNt , Ns = n , kg \ fNt+t , Nt = kg
}(Nt+t , Nt = k; Nt , Ns = n , k) }(Nt+t , Nt = k)pn,k (t);
using independent increments and (8.1). Break the preceding sum into three terms as follows. pn (t + t) = }(Nt+t , Nt = 0)pn(t) + }(Nt+t , Nt = 1)pn,1(t) n X + }(Nt+t , Nt = k)pn,k (t): k=2
This enables us to write pn(t + t) , pn (t) = ,[1 , }(Nt+t , Nt = 0)]pn(t) + }(Nt+t , Nt = 1)pn,1(t) n X + }(Nt+t , Nt = k)pn,k (t): k=2
(8.2)
For n = 0, only the rst term on the right in (8.2) is present, and we can write p0 (t + t) , p0 (t) = ,[1 , }(Nt+t , Nt = 0)]p0(t): (8.3) It then follows that , p0 (t) = ,p (t): lim p0 (t + t) 0 t#0 t In other words, we are left with the rst-order dierential equation, p00(t) = ,p0 (t); p0 (s) = 1; July 18, 2002
8.1 The Poisson Process
255
whose solution is simply
p0(t) = e,(t,s) ; t s: To handle the case n 2, note that since n X
k=2
}(Nt+t , Nt = k)pn,k (t)
n X k=2
}(Nt+t , Nt = k)
= }(Nt+t , Nt 2) = 1 , [ }(Nt+t , Nt = 0) + }(Nt+t , Nt = 1) ]; it follows that
k=2 }(Nt+t , Nt = k)pn,k(t) = , = 0: lim t#0 t Pn
Returning to (8.2), we see that for n = 1 and for n 2, , pn (t) = ,p (t) + p (t): lim pn (t + t) n n,1 t#0 t This results in the dierential-dierence equation, p0n (t) = ,pn (t) + pn,1 (t); p0(t) = e,(t,s): It is easily veri ed that for n = 1; 2; : : :; n ,(t,s) ; pn(t) = [(t , s)]n! e which are the claimed Poisson probabilities, solve (8.4).
(8.4)
Marked Poisson Processes
It is frequently the case that in counting arrivals, each arrival is associated with a mark. For example, suppose packets arrive at a router according to a Poisson process of rate , and that the size of the ith packet is Bi bytes. The size Bi is the mark. Thus, the ith packet, whose size is Bi , arrives at time Ti , where Ti is the ith occurrence time of the Poisson process. The total number of bytes processed up to time t is Mt :=
Nt X i=1
Bi :
We usually assume that the mark sequence is independent of the Poisson process. In this case, the mean of Mt can be computed as in Example 5.15. The characteristic function of Mt can be computed as in Problem 37 in Chapter 5. July 18, 2002
256
Chap. 8 Advanced Concepts in Random Processes
Shot Noise
Light striking a photodetector generates photoelectrons according to a Poisson process. The rate of the process is proportional to the intensity of the light and the eciency of the detector. The detector output is then passed through an ampli er of impulse response h(t). We model the input to the ampli er as a train of impulses X Xt := (t , Ti ); i
where the Ti are the occurrence times of the Poisson process. The ampli er output is Yt = = =
Z
1
h(t , )X d
,1 1Z1 X
i=1 ,1
1 X i=1
h(t , )( , Ti ) d
h(t , Ti ):
For any realizable system, h(t) is a causal function; i.e., h(t) = 0 for t < 0. Then Nt X X Yt = h(t , Ti ) = h(t , Ti ): (8.5) i:Ti t
i=1
A process of the form of Yt is called a shot-noise process or a ltered Poisson process. Note that if the impulse response h(t) is continuous, then so is Yt . If h(t) contains jump discontinuities, then so does Yt . If h(t) is nonnegative, then so is Yt.
8.2. Renewal Processes
Recall that a Poisson process of rate can be constructed by writing Nt :=
where the arrival times
1 X
k=1
I[0;t] (Tk );
Tk := X1 + + Xk ; and the Xk are i.i.d. exp() interarrival times. If we drop the requirement that the interarrival times be exponential and let them have arbitrary density f, then Nt is called a renewal process. If we let F denote the cdf corresponding to the interarrival density f, it is easy to see that the mean of the process is E[Nt] =
1 X
k=1
Fk (t);
July 18, 2002
(8.6)
8.3 The Wiener Process
257
where Fk is the cdf of Tk . The corresponding density, denoted by fk , is the k-fold convolution of f with itself. Hence, in general this formula is dicult to work with. However, there is another way to characterize E[Nt ]. In the problems you are asked to derive renewal equation E[Nt] = F (t) +
Z
t
0
E[Nt,x]f(x) dx:
The mean function m(t) := E[Nt] of a renewal process is called the renewal function. Note that m(0) = E[N0] = 0, and that the renewal equation can be written in terms of the renewal function as m(t) = F (t) +
Z
8.3. The Wiener Process
0
t
m(t , x)f(x) dx:
The Wiener process or Brownian motion is a random process that models integrated white noise. Such a model is needed because, as indicated below, white noise itself does not exist as an ordinary random process! In fact, the careful reader will note that in Chapter 6, we never worked directly with white noise, but rather with the output of an LTI system whose input was white noise. The Wiener process provides a mathematically well-de ned object that can be used in modeling the output of an LTI system in response to (approximately) white noise. We say that fWt; t 0g is a Wiener process if the following four conditions hold: W0 0; i.e., W0 is a constant random variable whose value is always zero. For any 0 s t < 1, the increment Wt , Ws is a Gaussian random variable with zero mean and variance 2 (t , s). If the time intervals (t1 ; t2]; (t2; t3]; : : :; (tn; tn+1] are disjoint, then the increments Wt2 , Wt1 ; Wt3 , Wt2 ; : : :; Wtn+1 , Wtn are independent; i.e., the process has independent increments. For each sample point ! 2 , Wt (!) as a function of t is continuous. More brie y, we just say that Wt has continuous sample paths. Remarks. (i) If the parameter 2 = 1, then the process is called a standard Wiener process. A sample path of a standard Wiener process is shown in Figure 8.2. July 18, 2002
258
Chap. 8 Advanced Concepts in Random Processes 0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 8.2. Sample path Wt of a standard Wiener process.
(ii) Since the rst and third properties of the Wiener process are the same as those of the Poisson process, it is easy to show that cov(Wt1 ; Wt2 ) = 2 min(t1 ; t2):
(iii) The fourth condition, that as a function of t, Wt should be continuous, is always assumed in practice, and can always be arranged by construction [4, Section 37]. Actually, the precise statement of the fourth property is }(f! 2 : Wt(!) is a continuous function of tg) = 1; i.e., the realizations of Wt are continuous with probability one. (iv) As indicated in Figure 8.2, the Wiener process is very \wiggly." In fact, it wiggles so much that it is nowhere dierentiable with probability one [4, p. 505, Theorem 37.3].
Integrated White-Noise Interpretation of the Wiener Process
We now justify the interpretation of the Wiener process as integrated white noise. Let Xt be a zero-mean wide-sense stationary white noise process with correlation function RX () = 2() and power spectral density SX (f) = 2. For t 0, put Z t Wt = X d: 0
Then Wt is clearly zero mean. For 0 s < t < 1, consider the increment Wt , Ws =
Z
t 0
X d ,
Z
0
s
X d =
July 18, 2002
Z
t s
X d:
8.3 The Wiener Process
259
This increment also has zero mean. To compute its variance, write var(Wt , Ws ) = E[(Wt , Ws )2 ] Z
= E
t
X d
s
Z t Z
=
s
Z t Z
=
Z
=
s s
t
t s
Z
t
s
X d
E[X X ] d d
t
2 (
s
, ) d d
2 d
= 2 (t , s): (8.7) A random process Xt is said to be Gaussian (cf. Section 7.2), if for every function c(), Z 1 c()X d ,1 is a Gaussian random variable. For example, if our white noise Xt is Gaussian, and if we take c() = I(0;t] (), then 1
Z
,1
c()X d =
1
Z
,1
I(0;t]()X d =
is a Gaussian random variable. Similarly, Wt , Ws =
Z
t
s
1
Z
X d =
,1
Z
t
0
X d = Wt
I(s;t] ()X d
is a Gaussian random variable with zero mean and variance 2 (t , s). More generally, for time intervals (t1 ; t2]; (t2; t3]; : : :; (tn; tn+1 ]; consider the random vector [Wt2 , Wt1 ; Wt3 , Wt2 ; : : :; Wtn+1 , Wtn ]0: Recall from Section 7.2 that this vector is normal, if every linear combination of its components is a Gaussian random variable. To see that this is indeed the case, write n X i=1
ci(Wti+1 , Wti ) = = =
n X
i=1 n X i=1
Z
ci ci
Z ti+1
ti
Z
1
,1
n 1X
,1 i=1
July 18, 2002
X d
I(ti ;ti+1 ] ()X d
ciI(ti ;ti+1 ] () X d:
260
Chap. 8 Advanced Concepts in Random Processes
This is a Gaussian random variable if we assume Xt is a Gaussian random process. Finally, if the time intervals are disjoint, we claim that the increments are uncorrelated (and therefore independent under the Gaussian assumption). Without loss of generality, suppose 0 s < t < < 1. Write Z
E[(Wt , Ws )(W , W )] = E
t s
X d
Z Z
=
t s
Z
X d
2 ( , ) d
d :
In the inside integral, we never have = because s < t < . Hence, the inner integral evaluates to zero, and the increments Wt , Ws and W , W are uncorrelated as claimed.
The Problem with White Noise
If white noise really existed and Wt =
Z
t 0
X d;
then the derivative of Wt would be W_ t = Xt . Now consider the following calculations. First, write d E[W 2] = E d W 2 dt t dt t = E[2W tW_ t] Wt+t , Wt = E 2Wt lim t#0 t E [W (W t t+t , Wt )] : = 2 lim t#0 t It is a simple calculation using the independent increments property to show that E[Wt(Wt+t , Wt )] = 0. Hence d 2 dt E[Wt ] = 0: On the other hand, from (8.7), E[Wt2] = 2 t, and so d d 2 2 2 dt E[Wt ] = dt t = > 0:
The Wiener Integral
The Wiener processR is a well-de ned mathematical object. We argued above that Wt behaves like 0t X d, where Xt is white Gaussian noise. If such noise July 18, 2002
8.3 The Wiener Process
261
is applied to an LTI system starting at time zero, and if the system has impulse response h, then the output is 1
Z
h(t , )X d:
0
We now suppress t and write g() instead of h(t , ). Then we need a wellde ned mathematical object to play the role of 1
Z
0
g()X d:
The required object is the Wiener integral, 1
Z
0
g() dW ;
which is de ned as follows. For piecewise constant functions g of the form g() =
n X i=1
giI(ti ;ti+1 ] ();
where 0 t1 < t2 < < tn+1 < 1, we de ne Z
0
1
g() dW :=
n X i=1
gi(Wti+1 , Wti ):
Note that the right-hand side is a weighted sum of independent, zero-mean, Gaussian random variables. The sum is therefore Gaussian with zero mean and variance n X i=1
gi2 var(Wti+1 , Wti ) =
n X i=1
gi2 2 (ti+1 , ti) = 2
1
Z
0
g()2 d:
Because of the zero mean, the variance and second moment are the same. Hence, we also have 2 Z 1 Z 1 2 g()2 d: (8.8) g() dW = E 0
0
For functions g that are not piecewise constant, but do satisfy 01 g()2 d < 1, the Wiener integral can be de ned by a limiting process, which is discussed in more detail in Chapter 10. Basic properties of the Wiener integral are explored in the problems.
Random Walk Approximation of the Wiener Process
R
We now show how to approximate the Wiener process with a piecewiseconstant, continuous-time process that is obtained from a discrete-time random walk. July 18, 2002
262
Chap. 8 Advanced Concepts in Random Processes 6
4
2
0
−2
−4
−6
0
10
20
30
40
50
60
70
80
Figure 8.3. Sample path Sk ; k = 0;: : :; 74.
Let X1 ; X2; : : : be i.i.d. 1-valued random variables with }(Xi = 1) = 1=2. Then each Xi has zero mean and variance one. Let Sn :=
n X i=1
Xi :
Then Sn has zero mean and variance n. The process fSn ; n 0g is called a symmetric random walk. A sample path of Sn is shown in Figure 8.3. We next consider the scaled random walk Sn =pn. Note that Sn =pn has zero mean and variance one. By the pcentral limit theorem, which is discussed in detail in Chapter 4, the cdf of Sn = n converges to the standard normal cdf. To approximate the continuous-time Wiener process, we use the piecewiseconstant, continuous-time process Wt(n) := p1n Sbntc ; where b c denotes the greatest integer that is less than or equal to . For example, if n = 100 and t = 3:1476, then 1 1 1 W3(100) :1476 = p100 Sb1003:1476c = 10 Sb314:76c = 10 S314: Figure 8.4 shows a sample path of Wt(75) plotted as a function of t. As the continuous variable t ranges over [0; 1], the values of b75tc range over the It is understood that S0 0.
July 18, 2002
8.4 Speci cation of Random Processes
263
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 8.4. Sample path Wt(75) .
integers 0; 1; : : :; 75. Thus, the constant levels seen in Figure 8.4 are p1 Sb75tc = p1 Sk ; k = 0; : : :; 75: 75 75
p
In other words, the levels in Figure 8.4 are 1= 75 times those in Figure 8.3. Figures 8.5 and 8.6 show sample paths of Wt(150) and Wt(10;000), respectively. As n increases, the sample paths look more and more like those of the Wiener process shown in Figure 8.2. Since the central limit theorem applies to any i.i.d. sequence with nite variance, the preceding convergence to the Wiener process holds if we replace the 1-valued Xi by any i.i.d. sequence with nite variance.y However, if the Xi only have nite mean but in nite variance, other limit processes can be obtained. For example, suppose the Xi are i.i.d. having a student's t density with = 3=2 degrees of freedom. Then the Xi have zero mean and in nite variance. As can be seen in Figure 8.7, the limiting process has jumps, which is inconsistent with the Wiener process, which has continuous sample paths.
8.4. Speci cation of Random Processes Finitely Many Random Variables
In this text we have often seen statements of the form, \Let X, Y , and Z be random variables with }((X; Y; Z) 2 B) = (B)," where B IR3, and (B) y If the mean of the Xi is m and the variance is 2 , then we must replace Sn by (Sn ,nm)=.
July 18, 2002
264
Chap. 8 Advanced Concepts in Random Processes
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.8
0.9
1
Figure 8.5. Sample path Wt(150) .
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Figure 8.6. Sample path Wt(10;000) .
July 18, 2002
8.4 Speci cation of Random Processes
265
5
4
3
2
1
0
−1
−2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 8.7. Sample path Sbntc =n2=3 for n = 10; 000 when the Xi have a student's t density with 3=2 degrees of freedom.
is given by some formula. For example, if X, Y , and Z are discrete, we would have XXX (B) = IB (xi; yj ; zk )pi;j;k; (8.9) i
j
k
where the xi , yj , and zk are the values taken by the random variables, and the pi;j;k are nonnegative numbers that sum to one. If X, Y , and Z are jointly continuous, we would have Z
(B) =
1
Z
1
Z
1
,1 ,1 ,1
IB (x; y; z)f(x; y; z) dx dy dz;
(8.10)
where f is nonnegative and integrates to one. In fact, if X is discrete and Y and Z are jointly continuous, we would have XZ
(B) =
Z
1
,1 ,1
i
where f is nonnegative and XZ
i
1
1
Z
1
,1 ,1
IB (xi ; y; z)f(xi ; y; z) dy dz;
(8.11)
f(xi ; y; z) dy dz = 1:
The big question is, given a formula for computing (B), how do we know that a sample space , a probability measure }, and functions X(!), Y (!), and Z(!) exist such that we indeed have },(X; Y; Z) 2 B = (B); B IR3 : July 18, 2002
266
Chap. 8 Advanced Concepts in Random Processes
As we show in the next paragraph, the answer turns out to be rather simple. If is de ned by expressions such as (8.9){(8.11), it can be shown that is a probability measurez on IR3; the case of (8.9) is easy; the other two require some background in measure theory, e.g., [4]. More generally, if we are given any probability measure on IR3 , we take = IR3 and put }(A) := (A) for A = IR3 . For ! = (!1 ; !2; !3), we de ne X(!) := !1 ; Y (!) := !2 ; and Z(!) := !3: It then follows that for B IR3, f! 2 : (X(!); Y (!); Z(!)) 2 B g reduces to f! 2 : (!1 ; !2; !3) 2 B g = B: } Hence, (f(X; Y; Z) 2 B g) = }(B) = (B). For xed n 1, the foregoing ideas generalize in the obvious way to show the existence of a sample space , probability measure }, and random variables X1 ; : : :; Xn with },(X1 ; X2 ; : : :; Xn ) 2 B = (B); B IRn; where is any given probability measure de ned on IRn.
In nite Sequences (Discrete Time)
Consider an in nite sequence of random variables such as X1 ; X2; : : :: While (X1 ; : : :; Xn) takes values in IRn, the in nite sequence (X1 ; X2 ; : : :) takes values in IR1 . If such an in nite sequence of random variables exists on some sample space equipped with some probability measure }, then1 },(X1 ; X2; : : :) 2 B ; B IR1 ; is a probability measure on IR1 . We denote this probability measure by (B). Similarly, } induces on IRn the measure , n(Bn ) = } (X1 ; : : :; Xn) 2 Bn ; Bn IRn ; Of course, } induces on IRn+1 the measure , n+1 (Bn+1 ) = } (X1 ; : : :; Xn; Xn+1 ) 2 Bn+1 ; Bn+1 IRn+1 : Now, if we take Bn+1 = Bn IR for any Bn IRn , then , n+1 (Bn IR) = } (X1 ; : : :; Xn; Xn+1 ) 2 Bn IR , = } (X1 ; : : :; Xn) 2 Bn ; Xn+1 2 IR , = } f(X1 ; : : :; Xn) 2 Bn g \ fXn+1 2 IRg , = } f(X1 ; : : :; Xn) 2 Bn g \
, = } (X1 ; : : :; Xn) 2 Bn = n(Bn ): z See Section 1.3 to review the de nition of probability measure.
July 18, 2002
8.4 Speci cation of Random Processes
267
Thus, we have the consistency condition n+1 (Bn IR) = n(Bn ); Bn IRn; n = 1; 2; : : :: Next, observe that since (X1 ; : : :; Xn) 2 Bn if and only if (X1 ; X2 ; : : :; Xn ; Xn+1; : : :) 2 Bn IR ; it follows that n(Bn ) is equal to },(X1 ; X2; : : :; Xn; Xn+1 ; : : :) 2 Bn IR ; ,
(8.12)
which is simply Bn IR . Thus, (Bn IR ) = n(Bn ); Bn 2 IRn ; n = 1; 2; : : :: (8.13) The big question here is, if we are given a sequence of probability measures n on IRn for n = 1; 2; : : :; does there exist a probability measure on IR1 such that (8.13) holds? The answer to this question was given by Kolmogorov, and is known as Kolmogorov's Consistency Theorem or as Kolmogorov's Extension Theorem. It says that if the1consistency condition (8.12) holds,x then a probability measure exists on IR such that (8.13) holds [8, p. 188]. We now specialize the foregoing discussion to the case of integer-valued random variables X1 ; X2 ; : : :: For each n = 1; 2; : : :; let pn (i1 ; : : :; in ) denote a proposed joint probability mass function of X1 ; : : :; Xn. In other words, we want a random process for which
},(X1 ; : : :; Xn) 2 Bn =
1 X
i1 =,1
1 X
in =,1
IBn (i1 ; : : :; in )pn(i1 ; : : :; in):
More precisely, with n (Bn ) given by the above right-hand side, does there exist a measure on IR1 such that (8.13) holds? By Kolmogorov's theorem, we just need to show that (8.12) holds. We now show that (8.12) is equivalent to 1 X
j =,1
pn+1 (i1 ; : : :; in; j) = pn (i1 ; : : :; in ):
x Knowing the measure n , we can always write the corresponding cdf as ,
(8.14)
Fn (x1 ; :: : ;xn ) = n (,1; x1 ] (,1; xn ] : Conversely, if we know the Fn , there is a unique measure n on IRn such that the above formula holds [4, Section 12]. Hence, the consistency condition has the equivalent formulation in terms of cdfs [8, p. 189], lim F (x ;: : :; xn ; xn+1 ) = Fn (x1 ; : :: ; xn ): x !1 n+1 1 n+1
July 18, 2002
268
Chap. 8 Advanced Concepts in Random Processes
The left-hand side of (8.12) takes the form 1 X i1 =,1
1 X
1 X
IBn IR (i1 ; : : :; in; j)pn+1 (i1 ; : : :; in ; j):
in =,1 j =,1
(8.15)
Observe that IBn IR = IBn IIR = IBn . Hence, the above sum becomes 1 X
i1 =,1
1 X
1 X
in =,1 j =,1
IBn (i1 ; : : :; in)pn+1 (i1 ; : : :; in; j);
which, using (8.14), simpli es to 1 X i1 =,1
1 X in =,1
IBn (i1 ; : : :; in)pn(i1 ; : : :; in );
(8.16)
which is our de nition of n(Bn ). Conversely, if in (8.12), or equivalently in (8.15) and (8.16), we take Bn to be the singleton set Bn = f(j1; : : :; jn)g; then we obtain (8.14). The next question is how to construct a sequence of probability mass functions satisfying (8.14). Observe that (8.14) can be rewritten as 1 X
pn+1 (i1 ; : : :; in; j) = 1: j =,1 pn (i1 ; : : :; in) In other words, if pn (i1 ; : : :; in ) is a valid joint pmf, and if we de ne pn+1 (i1 ; : : :; in ; j) := pn+1j1;:::;n(j ji1 ; : : :; in) pn(i1 ; : : :; in); where pn+1jn;:::;1(j ji1 ; : : :; in ) is a valid pmf in the variable j (i.e., is nonnegative and the sum over j is one), then (8.14) will automatically hold!
Example 8.3. Let q(i) be any pmf. Take p1(i) := q(i), and take pn+1j1;:::;n(jji1; : : :; in) := q(j). Then, for example, p2(i; j) = p2j1(j ji) p1 (i) = q(j) q(i); and
p3 (i; j; k) = p3j1;2(kji; j) p2(i; j) = q(i) q(j) q(k): More generally, pn(i1 ; : : :; in ) = q(i1 ) q(in ): Thus, the Xn are i.i.d. with common pmf q. July 18, 2002
8.4 Speci cation of Random Processes
269
Example 8.4. Again let q(i) be any pmf. Suppose that for each i, r(jji) is a pmf in the variable j; i.e., r is any conditional pmf. Put p1 (i) := q(i), and put{ pn+1j1;:::;n (j ji1 ; : : :; in ) := r(j jin): Then p2(i; j) = p2j1(j ji) p1 (i) = r(j ji) q(i); and p3(i; j; k) = p3j1;2(kji; j) p1;2(i; j) = q(i) r(j ji) r(kjj): More generally, pn (i1 ; : : :; in ) = q(i1 ) r(i2 ji1 ) r(i3 ji2) r(in jin,1):
Continuous-Time Random Processes
The consistency condition for a continuous-time random process is a little more complicated. The reason is that in discrete time, between any two consecutive integers, there are no other integers, while in continuous time, for any t1 < t2 , there are in nitely many times between t1 and t2 . Now suppose that for any t1 < < tn+1, we are given a probability measure t1 ;:::;tn+1 on IRn+1. Fix any Bn IRn . For k = 1; : : :; n+1, we de ne Bn;k IRn+1 by Bn;k := f(x1; : : :; xn+1) : (x1 ; : : :; xk,1; xk+1; : : :; xn+1) 2 Bn and xk 2 IRg: Note the special casesk Bn;1 = IR Bn and Bn;n+1 = Bn IR: The continuous-time consistency condition is that [40, p. 244] for k = 1; : : :; n+ 1, t1 ;:::;tn+1 (Bn;k ) = t1 ;:::;tk,1 ;tk+1 ;:::;tn+1 (Bn ): (8.17) If this condition holds, then there is a sample space , a probability measure }, and random variables Xt such that },(Xt1 ; : : :; Xtn ) 2 Bn = t1 ;:::;tn (Bn ); Bn IRn; for any n 1 and any times t1 < < tn. { In Chapter 9, we will see that Xn is a Markov chain with stationary transition probabilities. k In the previous subsection, we only needed the case k = n + 1. If we had wanted to allow two-sided discrete-time processes Xn for n any positive or negative integer, then both k = 1 and k = n + 1 would have been needed (Problem 28).
July 18, 2002
270
Chap. 8 Advanced Concepts in Random Processes
8.5. Notes
Notes x8.4: Speci cation of Random Processes Note 1. Comments analogous to Note 1 in Chapter 5 apply here. Specifically, the set B must be restricted to a suitable - eld B1 of subsets of IR1 . Typically, B1 is taken to be the smallest - eld containing all sets of the form f! = (!1; !2; : : :) 2 IR1 : (!1 ; : : :; !n) 2 Bn g; where Bn is a Borel subset of IRn , and n ranges over the positive integers [4, p. 485].
8.6. Problems
Problems x8.1: The Poisson Process 1. Hits to a certain web site occur according to a Poisson process of rate
2.
3.
4.
5. 6.
= 3 per minute. What is the probability that there are no hits in a 10minute period? Give a formula and then evaluate it to obtain a numerical answer. Cell-phone calls processed by a certain wireless base station arrive according to a Poisson process of rate = 12 per minute. What is the probability that more than 3 calls arrive in a 20-second interval? Give a formula and then evaluate it to obtain a numerical answer. Let Nt be a Poisson process with rate = 2, and consider a xed observation interval (0; 5]. (a) What is the probability that N5 = 10? (b) What is the probability that Ni , Ni,1 = 2 for all i = 1; : : :; 5? A sports clothing store sells football jerseys with a certain very popular number on them according to a Poisson process of rate 3 crates per day. Find the probability that on 5 days in a row, the store sells at least 3 crates each day. A sporting goods store sells a certain shing rod according to a Poisson process of rate 2 per day. Find the probability that on at least one day during the week, the store sells at least 3 rods. (Note: week= ve days.) Let Nt be a Poisson process with rate , and let t > 0. (a) Show that Nt+t , Nt and Nt are independent. (b) Show that }(Nt+t = k + `jNt = k) = }(Nt+t , Nt = `). (c) Evaluate }(Nt = kjNt+t = k + `). (d) Show that as a function of k = 0; : : :; n, }(Nt = kjNt+t = n) has the binomial(n; p) probability mass function and identify p. July 18, 2002
8.6 Problems
271
7. Customers arrive at a store according to a Poisson process of rate . What
is the expected time until the nth customer arrives? What is the expected time between customers? 8. During the winter, snowstorms occur according to a Poisson process of intensity = 2 per week. (a) What is the average time between snowstorms? (b) What is the probability that no storms occur during a given two-week period? (c) If winter lasts 12 weeks, what is the expected number of snowstorms? (d) Find the probability that during at least one of the 12 weeks of winter, there are at least 5 snowstorms. 9. Space shuttles are launched according to a Poisson process. The average time between launches is two months. (a) Find the probability that there are no launches during a four-month period. (b) Find the probability that during at least one month out of four consecutive months, there are at least two launches. 10. Diners arrive at popular restaurant according to a Poisson process Nt of rate . A confused maitre d' seats the ith diner with probability p, and turns the diner away with probability 1 , p. Let Yi = 1 if the ith diner is seated, and Yi = 0 otherwise. The number diners seated up to time t is Mt :=
Nt X i=1
Yi :
Show that Mt is a Poisson random variable and nd its parameter. Assume the Yi are independent of each other and of the Poisson process. Remark. Mt is an example of a thinned Poisson process. 11. Lightning strikes occur according to a Poisson process of rate per minute. The energy of the ith strike is Vi . Assume the energy is independent of the occurrence times. What is the expected energy of a storm that lasts for t minutes? What is the average time between lightning strikes? ? 12. Find the mean and characteristic function of the shot-noise random variable Yt in equation (8.5).
Problems x8.2: Renewal Processes 13. In the case of a Poisson process, show that the right-hand side of (8.6) reduces to t. Hint: Formula (4.2) in Section 4.2 may be helpful. July 18, 2002
272
Chap. 8 Advanced Concepts in Random Processes
14. Derive the renewal equation E[Nt ] = F (t) +
Z
t 0
E[Nt,x]f(x) dx
as follows. (a) Show that E[NtjX1 = x] = 0 for x > t. (b) Show that E[NtjX1 = x] = 1 + E[Nt,x] for x t. (c) Use parts (a) and (b) and the laws of total probability and substitution to derive the renewal equation. 15. Solve the renewal equation for the renewal function m(t) := E[Nt ] if the interarrival density is f exp(). Hint: Dierentiate the renewal equation and show that m0 (t) = . It then follows that m(t) = t, which is what we expect since f exp() implies Nt is a Poisson process of rate .
July 18, 2002
8.6 Problems
273
Problems x8.3: The Wiener Process 16. For 0 s < t < 1, use the de nition of the Wiener process to show that E[WtWs ] = 2 s.
17. Let the random vector X = [Wt1 ; : : :; Wtn ]0, 0 < t1 < < tn < 1,
consist of samples of a Wiener process. Find the covariance matrix of X. 18. For piecewise constant g and h, show that 1
Z
0
g() dW +
1
Z
0
1
Z
h() dW =
g() + h() dW :
0
Hint: The problem is easy if g and h are constant over the same intervals. 19. Use (8.8) to derive the formula 1
Z
E
0
g() dW
1
Z
h() dW
0
Hint: Consider the expectation E
1
Z
g() dW ,
0
1
Z
0
=
2
Z
h() dW
0
1
g()h() d:
2
;
which can be evaluated in two dierent ways. The rst way is to expand the square and take expectations term by term, applying (8.8) where possible. The second way is to observe that since 1
Z
0
g() dW ,
1
Z
0
h() dW =
Z
0
1
[g() , h()] dW ;
the above second moment can be computed directly using (8.8). 20. Let Z t Yt = g() dW ; t 0: 0
(a) Use (8.8) to show that
E[Y 2 ] t
=
Hint: Observe that Z
0
t
g() dW =
2
Z
0
1
Z
t
0
jg()j2 d:
g()I(0;t] () dW :
(b) Show that Yt has correlation function RY (t1; t2 ) =
2
Z min(t1 ;t2)
0
July 18, 2002
jg()j2 d; t1 ; t2 0:
274
Chap. 8 Advanced Concepts in Random Processes
21. Consider the process
Yt = e,t V +
Z
0
t
e,(t, ) dW ; t 0;
where Wt is a Wiener process independent of V , and V has zero mean and variance q2 . Show that Yt has correlation function 2 + 2 e,jt1,t2 j : RY (t1 ; t2) = e,(t1 +t2 ) q2 , 2 2 Remark. If V is normal, then the process Yt is Gaussian and is known as an Ornstein{Uhlenbeck process. 22. Let Wt be a Wiener process, and put ,t Yt := ep We2t : 2 Show that 2 RY (t1; t2 ) = 2 e,jt1,t2 j : In light of the remark above, this is another way to de ne an Ornstein{ Uhlenbeck process. 23. So far we have de ned the Wiener process Wt only for t 0. When de ning Wt for all t, we continue to assume that W0 0; that for s < t, the increment Wt , Ws is a Gaussian random variable with zero mean and variance 2 (t , s); that Wt has independent increments; and that Wt has continuous sample paths. The only dierence is that s or both s and t can be negative, and that increments can be located anywhere in time, not just over intervals of positive time. In the following take 2 = 1. (a) For t > 0, show that E[Wt2] = t. (b) For s < 0, show that E[Ws2] = ,s. (c) Show that jtj + jsj , jt , sj : E[WtWs ] = 2
Problems x8.4: Speci cation of Random Processes 24. Suppose X and Y are random variables with
},(X; Y ) 2 A =
XZ
i
1
,1
IA (xi ; y)fXY (xi ; y) dy;
where the xi are distinct real numbers, and fXY is a nonnegative function satisfying XZ 1 fXY (xi; y) dy = 1: i
,1
July 18, 2002
8.6 Problems
275
(a) Show that
(b) Show that for C IR,
}(Y 2 C) =
1
Z
}(X = xk ) =
fXY (xk ; y) dy:
,1
Z X
C
i
fXY (xi; y) dy:
In other words, Y has marginal density fY (y) =
X
i
fXY (xi ; y):
(c) Show that Z
fXY (xi; y) dy: C pX (xi )
}(Y 2 C jX = xi) = In other words,
(xi; y) : fY jX (yjxi ) = fXY pX (xi ) (d) For B IR, show that if we de ne
}(X 2 B jY = y) := where then
X
i
IB (xi )pX jY (xi jy);
pX jY (xijy) := fXYf (x(y)i ; y) ; Y Z
1
,1
}(X 2 B jY = y)fY (y) dy = }(X 2 B):
In other words, we have the law of total probability. 25. Let F be the standard normal cdf. Then F is a one-to-one mapping from (,1; 1) onto (0; 1). Therefore, F has an inverse, F ,1: (0; 1) ! (,1; 1). If U uniform(0; 1), show that X := F ,1(U) has F for its cdf. 26. Consider the cdf 8 0; x < 0; > > > > < x2; 0 x < 1=2; F(x) := > 1=4; 1=2 x < 1; > x=2; 1 x < 2; > > : 1; x 2: July 18, 2002
276
Chap. 8 Advanced Concepts in Random Processes
(a) Sketch F(x). (b) For 0 < u < 1, sketch G(u) := inf fx 2 IR : F(x) ug: Hint: First identify the set Bu := fx 2 IR : F(x) ug: Then nd its in mum. 27. As illustrated in the previous problem, an arbitrary cdf F is usually not invertible, either because the equation F (x) = u has more than one solution, e.g., F (x) = 1=4, or because it has no solution, e.g., F(x) = 3=8. However, for any cdf F, we can always introduce the function G(u) := inf fx 2 IR : F (x) ug; 0 < u < 1; which, you will now show, can play the role of F ,1 in Problem 25. (a) Show that if 0 < u < 1 and x 2 IR, then G(u) x if and only if u F (x). (b) Let U uniform(0; 1), and put X := G(U). Show that X has cdf F. 28. In the text we considered discrete-time processes Xn for n = 1; 2; : : :: The consistency condition (8.12) arose from the requirement that },(X1 ; : : :; Xn; Xn+1 ) 2 B IR = },(X1 ; : : :; Xn) 2 B ; where B IRn . For processes Xn with n = 0; 1; 2; : : :; we require not only },(Xm ; : : :; Xn; Xn+1 ) 2 B IR = },(Xm ; : : :; Xn ) 2 B ; but also },(Xm,1 ; Xm ; : : :; Xn) 2 IR B = },(Xm ; : : :; Xn ) 2 B ; where now B IRn,m+1 . Let m;n(B) be a proposed formula for the above right-hand side. Then the two consistency conditions are m;n+1 (B IR) = m;n (B) and m,1;n (IR B) = m;n (B): For integer-valued random processes, show that these are equivalent to 1 X
and
j =,1
1 X j =,1
pm;n+1 (im ; : : :; in; j) = pm;n (im ; : : :; in) pm,1;n (j; im ; : : :; in) = pm;n(im ; : : :; in );
where pm;n is the proposed joint probability mass function of Xm ; : : :; Xn. July 18, 2002
8.6 Problems
277
29. Let q be anyPpmf, and let r(j ji) be any conditional pmf. In addition,
assume that k q(k)r(j jk) = q(j). Put pm;n (im ; : : :; in ) := q(im ) r(im+1 jim ) r(im+2 jim+1 ) r(in jin,1):
Show that both consistency conditions for pmfs in the preceding problem are satis ed.
Remark. This process is strictly stationary as de ned in Section 6.2
since the upon writing out the formula for pm+k;n+k (im ; : : :; in), we see that it does not depend on k. 30. Let n be a probability measure on IRn , and suppose that it is given in terms of a joint density fn , i.e., n (Bn ) =
1
Z
,1
1
Z
,1
IBn (x1; : : :; xn)fn (x1; : : :; xn) dxn dx1:
Show that the consistency condition (8.12) holds if and only if Z
1
,1
fn+1 (x1; : : :; xn; xn+1) dxn+1 = fn(x1 ; : : :; xn):
31. Generalize the preceding problem for the continuous-time consistency con-
dition (8.17). 32. Let Wt be a Wiener process. Let ft1 ;:::;tn denote the joint density of Wt1 ; : : :; Wtn . Find ft1 ;:::;tn and show that it satis es the density version of (8.17) that you derived in the preceding problem. Hint: The joint density should be Gaussian.
July 18, 2002
278
Chap. 8 Advanced Concepts in Random Processes
July 18, 2002
CHAPTER 9
Introduction to Markov Chains A Markov chain is a random process with the property that given the values of the process from time zero up through the current time, the conditional probability of the value of the process at any future time depends only on its value at the current time. This is equivalent to saying that the future and the past are conditionally independent given the present (cf. Problem 29 in Chapter 1). Markov chains often have intuitively pleasing interpretations. Some examples discussed in this chapter are random walks (without barriers and with barriers, which may be re ecting, absorbing, or neither), queuing systems (with nite or in nite buers), birth{death processes (with or without spontaneous generation), life (with states being \healthy," \sick," and \death"), and the gambler's ruin problem. Discrete-time Markov chains are introduced in Section 9.1 via the random walk and its variations. The emphasis is on the Chapman{Kolmogorov equation and the nding of stationary distributions. Continuous-time Markov chains are introduced in Section 9.2 via the Poisson process. The emphasis is on Kolmogorov's forward and backward dierential equations, and their use to nd stationary distributions.
9.1. Discrete-Time Markov Chains
A sequence of discrete random variables, X0 ; X1 ; : : : is called a Markov
chain if for n 1,
}(Xn+1 = xn+1 jXn = xn; Xn,1 = xn,1; : : :; X0 = x0)
is equal to
}(Xn+1 = xn+1jXn = xn ): In other words, given the sequence of values x0; : : :; xn, the conditional probability of what Xn+1 will be depends only on the value of xn . Example 9.1 (Random Walk). Let X0 be an integer-valued random variable that is independent of the i.i.d. sequence Z1 ; Z2; : : :; where }(Zn = 1) = a, }(Zn = ,1) = b, and }(Zn = 0) = 1 , (a + b). Show that if Xn := Xn,1 + Zn ; n = 1; 2; : : :; then Xn is a Markov chain. Solution. It helps to write out X1 = X0 + Z1 279
280
Chap. 9 Introduction to Markov Chains
X2 = X1 + Z2 = X0 + Z1 + Z2 .. . Xn = Xn,1 + Zn = X0 + Z1 + + Zn : The point here is that (X0 ; : : :; Xn ) is a function of (X0 ; Z1; : : :; Zn ), and hence, Zn+1 and (X0 ; : : :; Xn ) are independent. Now observe that }(Xn+1 = xn+1 jXn = xn; : : :; X0 = x0) is equal to }(Xn + Zn+1 = xn+1 jXn = xn; : : :; X0 = x0): Using the substitution law, this becomes }(Zn+1 = xn+1 , xn jXn = xn ; : : :; X0 = x0 ): On account of the independence of Zn+1 and (X0 ; : : :; Xn), the above conditional probability is equal to }(Zn+1 = xn+1 , xn): Since this depends on xn but not on xn,1; : : :; x0, it follows by the next example that }(Zn+1 = xn+1 , xn ) = }(Xn+1 = xn+1jXn = xn). The Markov chain of the preceding example is called a random walk on the integers. The random walk is said to be symmetric if a = b = 1=2. A realization of a symmetric random walk is shown in Figure 9.1. Notice that each point diers from the preceding one by 1. To restrict the random walk to the nonnegative integers, we can take Xn = max(0; Xn,1 + Zn ) (Problem 1). Example 9.2. Show that if }(Xn+1 = xn+1 jXn = xn; : : :; X0 = x0) is a function of xn but not xn,1; : : :; x0, then the above conditional probability is equal to }(Xn+1 = xn+1jXn = xn ):
Solution. We use the identity }(A \ B) = }(AjB) }(B), where A = fXn+1 = xn+1g and B = fXn = xn; : : :; X0 = x0g. Hence, if }(Xn+1 = xn+1jXn = xn; : : :; X0 = x0) = h(xn); then the joint probability mass function }(Xn+1 = xn+1; Xn = xn; Xn,1 = xn,1; : : :; X0 = x0) July 18, 2002
9.1 Discrete-Time Markov Chains
281
6
4
2
0
−2
−4
−6
0
10
20
30
40
50
60
70
80
Figure 9.1. Realization of a symmetric random walk Xn .
is equal to
h(xn) }(Xn = xn ; Xn,1 = xn,1; : : :; X0 = x0 ): Summing both of these formulas over all values of xn,1; : : :; x0 shows that }(Xn+1 = xn+1 ; Xn = xn) = h(xn) }(Xn = xn): Hence, h(xn) = }(Xn+1 = xn+1 jXn = xn ) as claimed.
State Space and Transition Probabilities
The set of possible values that the random variables Xn can take is called the state space of the chain. In the rest of this chapter, we take the state space to be the set of integers or some speci ed subset of the integers. The conditional probabilities }(Xn+1 = j jXn = i) are called transition probabilities. In this chapter, we assume that the transition probabilities do not depend on time n. Such a Markov chain is said to have stationary transition probabilities or to be time homogeneous. For a time-homogeneous Markov chain, we use the notation pij := }(Xn+1 = j jXn = i) for the transition probabilities. The pij are also called the one-step transition probabilities because they are the probabilities of going from state i to state j in one time step. One of the most common ways to specify the transition probabilities is with a state transition diagram as in Figure 9.2. This July 18, 2002
282
Chap. 9 Introduction to Markov Chains a 1− a
0
1− b
1 b
Figure 9.2. A state transition diagram. The diagram says that the state space is the nite set f0; 1g, and that p01 = a, p10 = b, p00 = 1 , a, and p11 = 1 , b.
particular diagram says that the state space is the nite set f0; 1g, and that p01 = a, p10 = b, p00 = 1 , a, and p11 = 1 , b. Note that the sum of all the probabilities leaving a state must be one. This is because for each state i, X X }(Xn+1 = j jXn = i) = 1: pij = j
j
The transition probabilities pij can be arranged in a matrix P, called the transition matrix, whose i j entry is pij . For the chain in Figure 9.2, P = 1 ,b a 1 ,a b : The top row of P contains the probabilities p0j , which is obtained by noting the probabilities written next to all the arrows leaving state 0. Similarly, the probabilities written next to all the arrows leaving state 1 are found in the bottom row of P .
Examples
The general random walk on the integers has the state transition diagram shown in Figure 9.3. Note that the Markov chain constructed in Example 9.1 is 1− ( a i + b i ) a i −1
ai
...
...
i bi
b i +1
Figure 9.3. State transition diagram for a random walk on the integers.
a special case in which ai = a and bi = b for all i. The state transition diagram is telling us that 8 bi ; j = i , 1; > > < 1 , (a + b ); i i j = i; (9.1) pij = > a j = i + 1; i; > : 0; otherwise: July 18, 2002
9.1 Discrete-Time Markov Chains
283 1− ( a i + b i )
1− ( a 1 + b 1 ) a0 1− a 0
a i −1
a1
0
...
1 b1
ai
b2
...
i bi
b i +1
Figure 9.4. State transition diagram for a random walk with barrier at the origin (also called a birth{death process).
Hence, the transition matrix P is in nite, tridiagonal, and its ith row is 0 bi 1 , (ai + bi) ai 0 : Frequently, it is convenient to introduce a barrier at zero, leading to the state transition diagram in Figure 9.4. In this case, we speak of the random walk with barrier. If a0 = 1, the barrier is said to be re ecting. If a0 = 0, the barrier is said to be absorbing. Once a chain hits an absorbing state, the chain stays in that state from that time onward. If ai = a and bi = b for all i, the Markov chain is viewed as a model for a queue with an in nite buer. A random walk with barrier at the origin is also known as a birth{death process. With this terminology, the state i is taken to be a population, say of bacteria. In this case, if a0 > 0, there is spontaneous generation. If bi = 0 for all i, we have a pure birth process. For i 1, the formula for pij is given by (9.1) above, while for i = 0, 8 < 1 , a0; j = 0; p0j = : a0; j = 1; (9.2) 0; otherwise: The transition matrix P is the tridiagonal, semi-in nite matrix 3 2 1 , a0 a0 0 0 0 6 b1 1 , (a1 + b1) a1 0 0 77 6 6 0 b 1 , (a + b ) a 0 77 : 2 2 2 2 P = 66 7 0 0 b 1 , (a + b ) a 3 3 3 3 5 4 .. .. .. . . . . . . Sometimes it is useful to consider a random walk with barriers at the origin and at N, as shown in Figure 9.5. The formula for pij is given by (9.1) above for 1 i N , 1, by (9.2) above for i = 0, and, for i = N, by 8 j = N , 1; < bN ; pNj = : 1 , bN ; j = N; (9.3) 0; otherwise: This chain is viewed as a model for a queue with a nite buer, especially if ai = a and bi = b for all i. When a0 = 0 and bN = 0, the barriers at 0 and N July 18, 2002
284
Chap. 9 Introduction to Markov Chains 1− ( a N − 1 + b N − 1 )
1− ( a 1 + b 1 ) a0 1− a 0
0
...
1 b1
aN − 1
aN − 2
a1
N −1 bN − 1
b2
N
1− b N
bN
Figure 9.5. State transition diagram for a queue with a nite buer.
are absorbing, and the chain is a model for the gambler's ruin problem. In this problem, a gambler starts at time zero with 1 i N , 1 dollars, and plays until he either runs out of money, that is, absorption into state zero, or he acquires a fortune of N dollars and stops playing (absorption into state N). If N = 2 and b2 = 0, the chain can be interpreted as the story of life if we view state i = 0 as being the \healthy" state, i = 1 as being the \sick" state, and i = 2 as being the \death" state. In this model, if you are healthy (in state 0), you remain healthy with probability 1 , a0 and become sick (move to state 1) with probability a0 . If you are sick (in state 1), you become healthy (move to state 0) with probability b1, remain sick (stay in state 1) with probability 1 , (a1 + b1), or die (move to state 2) with probability b1. Since state 2 is absorbing (b2 = 0), once you enter this state, you never leave.
Stationary Distributions The n-step transition probabilities are de ned by p(ijn) := }(Xn = j jX0 = i):
This is the probability of going from state i (at time zero) to state j in n steps. For a chain with stationary transition probabilities, it will be shown later that the n-step transition probabilities are also stationary; i.e., for m = 1; 2; : : :; }(Xn+m = j jXm = i) = }(Xn = j jX0 = i) = p(ijn) : We also note that p(0) ij = }(X0 = j jX0 = i) = ij ; where ij denotes the Kronecker delta, which is one if i = j and is zero otherwise. A Markov chain with stationary transition probabilities also satis es the Chapman{Kolmogorov equation, X (n) (m) p(ijn+m) = pik pkj : (9.4) k
This equation, which is derived later, says that to go from state i to state j in n + m steps, rst you go to some intermediate state k in n steps, and then you July 18, 2002
9.1 Discrete-Time Markov Chains
285
go from state k to state j in m steps. Taking m = 1, we have the special case p(ijn+1) = If n = 1 as well,
p(2) ij =
X (n)
pik pkj :
k X
k
pik pkj :
This equation says that the matrix with entries p(2) ij is equal to the product ( n ) PP. More generally, the matrix with entries pij , called the n-step transition matrix , is equal to P n. The Chapman{Kolmogorov equation thus says that P n+m = P nP m : For many Markov chains, it can be shown [17, Section 6.4] that lim }(Xn = j jX0 = i) n!1 exists and does not depend on i. In other words, if the chain runs for a long time, it reaches a \steady state" in which the probability of being in state j does not depend on the initial state of the chain. If the above limit exists and does not depend on i, we put } j := nlim !1 (Xn = j jX0 = i):
We call fj g the stationary distribution or the equilibrium distribution of the chain. Note that if we sum both sides over j, we get X X j = lim }(Xn = j jX0 = i) j
j n!1
= nlim !1
X
j
}(Xn = j jX0 = i)
= nlim !1 1 = 1:
(9.5)
To nd the j , we can use the Chapman{Kolmogorov equation. If p(ijn) ! j , then taking limits in X (n) p(ijn+1) = pik pkj shows that
k
j =
X
k
k pkj :
If we think of as a row vector with entries j , and if we think of P as the matrix with entries pij , then the above equation says that = P: July 18, 2002
286
Chap. 9 Introduction to Markov Chains 1− ( a+b ) a
a 1− a
1− ( a+b )
0
a
a ...
1 b
b
...
i b
b
Figure 9.6. State transition diagram for a queuing system with an in nite buer.
On account of (9.5), cannot be the zero vector. Hence, is a left eigenvector of P with eigenvalue 1. To say this another way, I , P is singular; i.e., there are many solutions of (I , P) = 0. The solution we want must satisfy not only = P, but also the normalization condition, X
j
j = 1;
derived in (9.5).
Example 9.3. The state transition diagram for a queuing system with an in nite buer is shown in Figure 9.6. Find the stationary distribution of the chain. Solution. We begin by writing out j =
X
k
k pkj
(9.6)
for j = 0; 1; 2; : ::: For each j, the coecients pkj are obtained by inspection of the state transition diagram. For j = 0, we must consider 0 =
X
k
k pk0:
We need the values of pk0. From the diagram, the only way to get to state 0 is from state 0 itself (with probability p00 = 1 , a) or from state 1 (with probability p10 = b). The other pk0 = 0. Hence, 0 = 0 (1 , a) + 1 b: We can rearrange this to get 1 = ab 0: Now put j = 1 in (9.6). The state transition diagram tells us that the only way to enter state 1 is from states 0, 1, and 2, with probabilities a, 1 , (a + b), and b, respectively. Hence, 1 = 0 a + 1 [1 , (a + b)] + 2 b: July 18, 2002
9.1 Discrete-Time Markov Chains
287
Substituting 1 = (a=b)0 yields 2 = (a=b)20 . In general, if we substitute j = (a=b)j 0 and j ,1 = (a=b)j ,10 into j = j ,1 a + j [1 , (a + b)] + j +1 b; then we obtain j +1 = (a=b)j +1 0. We conclude that j j = ab 0 ; j = 0; 1; 2; : : :: To solve for 0 , we use the fact that 1 X j =0
or
j = 1;
1 a j X
b = 1: The geometric series formula shows that 0 = 1 , a=b; and j j = ab (1 , a=b): In other words, the stationary distribution is a geometric0(a=b) probability mass function. 0
j =0
Derivation of the Chapman{Kolmogorov Equation We begin by showing that p(ijn+1) =
X (n)
l
pil plj :
To derive this, we need the following law of total conditional probability. We claim that }(X = xjZ = z) = X }(X = xjY = y; Z = z) }(Y = yjZ = z): y
The right-hand side of this equation is simply X }(X = x; Y = y; Z = z) }(Y = y; Z = z) }(Y = y; Z = z) }(Z = z) : y Canceling common factors and then summing over y yields }(X = x; Z = z) } = (X = xjZ = z): }(Z = z) July 18, 2002
288
Chap. 9 Introduction to Markov Chains
We can now write p(ijn+1) = }(Xn+1 = j jX0 = i) X X = }(Xn+1 = j jXn = ln ; : : :; X1 = l1 ; X0 = i) = = =
ln
l1
X
X
ln
X
ln
l1
}(Xn = ln ; : : :; X1 = l1 jX0 = i) }(Xn+1 = j jXn = ln ) }(Xn = ln ; : : :; X1 = l1 jX0 = i)
}(Xn+1 = j jXn = ln ) }(Xn = ln jX0 = i)
X (n)
l
pil plj :
This establishes that for m = 1, p(ijn+m) =
X (n) (m)
k
pik pkj :
We now assume it is true for m and show it is true for m + 1. Write p(ijn+[m+1]) = p([ijn+m]+1) X (n+m) = pil plj l
= = =
XX (n) (m)
pik pkl plj k X (n) X (m) pik pkl plj k l X (n) (m+1) pik pkj : k l
Stationarity of the n-step Transition Probabilities
We would now like to use the Chapman{Kolmogorov equation to show that the n-step transition probabilities are stationary; i.e., }(Xn+m = j jXm = i) = }(Xn = j jX0 = i) = p(ijn) : To do this, we need the fact that for any 0 < n, }(Xn+1 = j jXn = i; X = l) = }(Xn+1 = j jXn = i): (9.7) To establish this fact, we use the law of total conditional probability to write the left-hand side as X X }(Xn+1 = j jXn = i; Xn,1 = ln,1; : : :; X = l; : : :; X0 = l0 ) ln,1
l0
}(Xn,1 = ln,1 ; : : :; X +1 = l +1 ; X ,1 = l ,1; : : :; X0 = l0 jXn = i; X = l); July 18, 2002
9.2 Continuous-Time Markov Chains
289
where the sum is (n , 1)-fold, the sum over l being omitted. By the Markov property, this simpli es to X
ln,1
X
l0
}(Xn+1 = j jXn = i)
}(Xn,1 = ln,1 ; : : :; X +1 = l +1 ; X ,1 = l ,1; : : :; X0 = l0 jXn = i; X = l); which further simpli es to }(Xn+1 = j jXn = i).
We can now show that the n-step transition probabilities are stationary. For a time-homogeneous Markov chain, the case n = 1 holds by de nition. Assume it holds for m, and use the law of total conditional probability to write }(Xn+m+1 = j jXm = i) = X }(Xn+m+1 = j jXn+m = k; Xm = i) k
= = =
X
k
}(Xn+m = kjXm = i) }(Xn+m+1 = j jXn+m = k)
}(Xn+m = kjXm = i)
X (n)
pik pkj k p(ijn+1) ;
by the Chapman{Kolmogorov equation.
9.2. Continuous-Time Markov Chains
A family of integer-valued random variables, fXt ; t 0g, is called a Markov chain if for all n 1, and for all 0 s0 < < sn,1 < s < t, }(Xt = j jXs = i; Xsn,1 = in,1 ; : : :; Xs0 = i0 ) = }(Xt = j jXs = i):
In other words, given the sequence of values i0 ; : : :; in,1; i, the conditional probability of what Xt will be depends only on the condition Xs = i. The quantity }(Xt = j jXs = i) is called the transition probability.
Example 9.4. Show that the Poisson process of rate is a Markov chain. Solution. To begin, observe that }(Nt = j jNs = i; Nsn,1 = in,1; : : :; Ns0 = i0 );
is equal to
}(Nt , i = j , ijNs = i; Nsn,1 = in,1 ; : : :; Ns0 = i0 ): July 18, 2002
290
Chap. 9 Introduction to Markov Chains
By the substitution law, this is equal to }(Nt , Ns = j , ijNs = i; Nsn,1 = in,1; : : :; Ns0 = i0 ): Since
(Ns ; Nsn,1 ; : : :; Ns0 )
is a function of
(9.8) (9.9)
(Ns , Nsn,1 ; : : :; Ns1 , Ns0 ; Ns0 , N0 ); and since this is independent of Nt , Ns by the independent increments property of the Poisson process, it follows that (9.9) and Nt , Ns are also independent. Thus, (9.8) is equal to }(Nt , Ns = j , i), which depends on i but not on in,1; : : :; i0 . It then follows that }(Nt = j jNs = i; Nsn,1 = in,1 ; : : :; Ns0 = i0 ) = }(Nt = j jNs = i); and we see that the Poisson process is a Markov chain. As shown in the above example, j ,i ,(t,s) }(Nt = j jNs = i) = }(Nt , Ns = j , i) = [(t , s)] e (j , i)! depends on t and s only through t , s. In general, if a Markov chain has the property that the transition probability }(Xt = j jXs = i) depends on t and s only through t , s, we say that the chain is time-homogeneous or that it has stationary transition probabilities. In this case, if we put pij (t) := }(Xt = j jX0 = i);
then }(Xt = j jXs = i) = pij (t , s). Note that pij (0) = ij , the Kronecker delta. In the remainder of the chapter, we assume that Xt is a time-homogeneous Markov chain with transition probability function pij (t). For such a chain, we can derive the continuous-time Chapman{Kolmogorov equation, pij (t + s) =
X
k
pik (t)pkj (s):
To derive this, we rst use the law of total conditional probability to write pij (t + s) = }(Xt+s = j jX0 = i) X }(Xt+s = j jXt = k; X0 = i)}(Xt = kjX0 = i): = k
July 18, 2002
9.2 Continuous-Time Markov Chains
291
Now use the Markov property and time homogeneity to obtain X }(Xt+s = j jXt = k)}(Xt = kjX0 = i) pij (t + s) = =
k
X
k
pkj (s)pik (t):
The reader may wonder why the derivation of the continuous-time Chapman{ Kolmogorov equation is so much simpler than the derivation of the discrete-time version. The reason is that in discrete time, the Markov property and time homogeneity are de ned in a one-step manner. Hence, induction arguments are rst needed to derive the discrete-time analogs of the continuous-time de nitions !
Kolmogorov's Dierential Equations
In the remainder of the chapter, we assume that for small t > 0, pij (t) gij t; i 6= j; and pii (t) 1 + giit: These approximations tell us the conditional probability of being in state j at time t in the very near future given that we are in state i at time zero. These assumptions are more precisely written as pii(t) , 1 = g : (9.10) lim pij (t) = gij and lim ii t#0 t#0 t t Note that gij 0, while gii 0. Using the Chapman{Kolmogorov equation, write pij (t + t) = =
X
pik (t)pkj (t) k X pij (t)pjj (t) + pik (t)pkj (t): k6=j
Now subtract pij (t) from both sides to get pij (t + t) , pij (t) = pij (t)[pjj (t) , 1] +
X
k6=j
pik (t)pkj (t):
(9.11)
Dividing by t and applying the limit assumptions (9.10), we obtain p0ij (t) = pij (t)gjj +
X
k6=j
pik (t)gkj :
This is Kolmogorov's forward dierential equation, which can be written more compactly as X p0ij (t) = pik (t)gkj : (9.12) k
July 18, 2002
292
Chap. 9 Introduction to Markov Chains
To derive the backward equation, observe that since pij (t+t) = pij (t+t), we can write X pij (t + t) = pik (t)pkj (t) =
k X pii (t)pij (t) + pik (t)pkj (t): k6=i
Now subtract pij (t) from both sides to get pij (t + t) , pij (t) = [pii(t) , 1]pij (t) +
X
k6=i
pik (t)pkj (t):
Dividing by t and applying the limit assumptions (9.10), we obtain p0ij (t) = giipij (t) +
X
k6=i
gik pkj (t):
This is Kolmogorov's backward dierential equation, which can be written more compactly as X p0ij (t) = gik pkj (t): (9.13) k
The derivations of both the forward and backward dierential equations require taking a limit in t inside the sum over k. For example, in deriving the backward equation, we tacitly assumed that X X pik (t) pik (t) p (t): p (t) = lim (9.14) lim kj t # 0 t#0 k6=i t t kj k6=i If the state space of the chain is nite, the above sum is nite and there is no problem. Otherwise, additional technical assumptions are required to justify this step. We now show that a sucient assumption for deriving the backward equation is that X gij = ,gii < 1: j 6=i
A chain that satis es this condition is said to be conservative. For any nite N, observe that X pik (t) X pik (t) p (t) pkj (t): kj k6=i t jkjN t k6=i
Since the right-hand side is a nite sum, X pik (t) X lim p (t) gik pkj (t): kj t#0 k6=i t jkjN k6=i
July 18, 2002
9.2 Continuous-Time Markov Chains
293
Letting N ! 1 shows that X pik (t)
X
pkj (t) gik pkj (t): (9.15) k6=i k6=i t To get an upper bound on the limit, take N i and write X pik (t) X pik (t) X pik (t) p (t) = p (t) + pkj (t): kj kj k6=i t jkjN t jkj>N t lim t#0
k6=i
Since pkj (t) 1, X pik (t) X pik (t) X pik (t) p (t) p (t) + kj kj k6=i t jkjN t jkj>N t k6=i
pik (t) p (t) + 1 1 , X p (t) kj ik t jkjN t jkjN k6=i X pik (t) 1 , pii(t) , X pik (t) : = p (t) + kj t jkjN t jkjN t =
X
k6=i
k6=i
Since these sums are nite, X (t) p (t) X g p (t) , g , X g : lim pikt kj ik kj ii ik t#0 k6=i jkjN jkjN k6=i
k6=i
Letting N ! 1 shows that X pik (t) X X lim p (t) g p (t) , g , gik : kj ik kj ii t#0 k6=i t k6=i k6=i If the chain is conservative, this simpli es to X X pik (t) p (t) gik pkj (t): lim kj t#0 k6=i t k6=i Combining this with (9.15) yields (9.14), thus justifying the backward equation. Readers familiar with linear system theory may nd it insightful to write the forward and backward equations in matrix form. Let P(t) denote the matrix whose i j entry is pij (t), and let G denote the matrix whose i j entry is gij (G is called the generator matrix, or rate matrix). Then the forward equation (9.12) becomes P 0(t) = P (t)G; July 18, 2002
294
Chap. 9 Introduction to Markov Chains
and the backward equation (9.13) becomes P 0(t) = GP(t); The initial condition in both cases is P (0) = I. Under suitable assumptions, the solution of both equations is given by the matrix exponential, 1 (Gt)n X Gt : P (t) = e := n=0 n! When the state space is nite, G is a nite-dimensional matrix, and the theory is straightforward. Otherwise, more careful analysis is required.
Stationary Distributions
Let us suppose that pij (t) ! j as t ! 1, independently of the initial state i. Then for large t, pij (t) is approximately constant, and we therefore hope that its derivative is converging to zero. Assuming this to be true, the limiting form of the forward equation (9.12) is X 0 = k gkj : k
P
Combining this with the normalization condition k k = 1 allows us to solve for k much as in the discrete case.
9.3. Problems
Problems x9.1: Introduction to Discrete-Time Markov Chains
1. Let X0 ; Z1; Z2 ; : : : be a sequence of independent discrete random variables.
Put
Xn = g(Xn,1 ; Zn ); n = 1; 2; : : :: Show that Xn is a Markov chain. For example, if Xn = max(0; Xn,1 + Zn ), where X0 and the Zn are as in Example 9.1, then Xn is a random walk restricted to the nonnegative integers. 2. Let Xn be a time-homogeneous Markov chain with transition probabilities pij . Put i := }(X0 = i). Express }(X0 = i; X1 = j; X2 = k; X3 = l) in terms of i and entries from the transition probability matrix. 3. Find the stationary distribution of the Markov chain in Figure 9.2. 4. Draw the state transition diagram and nd the stationary distribution of the Markov chain whose transition matrix is 3 2 1=2 1=2 0 P = 4 1=4 0 3=4 5 : 1=2 1=2 0 Answer: 0 = 5=12, 1 = 1=3, 2 = 1=4. July 18, 2002
9.3 Problems
295
5. Draw the state transition diagram and nd the stationary distribution of
the Markov chain whose transition matrix is 2 3 0 1=2 1=2 P = 4 1=4 3=4 0 5 : 1=4 3=4 0 Answer: 0 = 1=5, 1 = 7=10, 2 = 1=10. 6. Draw the state transition diagram and nd the stationary distribution of the Markov chain whose transition matrix is 2 3 1=2 1=2 0 0 6 0 1=10 0 77 : P = 64 9=10 0 1=10 0 9=10 5 0 0 1=2 1=2 Answer: 0 = 9=28, 1 = 5=28, 2 = 5=28, 3 = 9=28. 7. Find the stationary distribution of the queuing system with nite buer of size N, whose state transition diagram is shown in Figure 9.7. 1− ( a + b ) a 1− a
1− ( a + b )
0
a
a
a
N −1
...
1 b
b
b
N
1− b
b
Figure 9.7. State transition diagram for a queue with a nite buer. 8. Show that if i := }(X0 = i), then
}(Xn = j) =
X
i
i p(ijn):
If fj g is the stationary distribution, and if i = i , show that }(Xn = j) = j .
Problems x9.2: Continuous-Time Markov Chains 9. The general continuous-time random walk is de ned by
gij =
8 > > <
i ; j = i , 1; ,(i + i ); j = i;
i ; j = i + 1; 0; otherwise: Write out the forward and backward equations. Is the chain conservative? > > :
July 18, 2002
296
Chap. 9 Introduction to Markov Chains
10. The continuous-time queue with in nite buer can be obtained by mod-
ifying the general random walk in the preceding problem to include a barrier at the origin. Put 8 < ,0 ; j = 0; g0j = : 0; j = 1; 0; otherwise: Find the stationary distribution assuming 1 X 0 j =1
11. 12.
13. 14.
j ,1 < 1: 1 j
If i = and i = for all i, simplify the above condition to one involving only the relative values of and . Modify Problem 10 to include a barrier at some nite N. Find the stationary distribution. For the chain in Problem 10, let j = j + and j = j; where , , and are positive. Put mi (t) := E[XtjX0 = i]. Derive a dierential equation for mi (t) and solve it. Treat the cases = and 6= separately. Hint: Use the forward equation (9.12). For the chain in Problem 10, let i = 0 and i = . Write down and solve the forward equation (9.12). Hint: Equation (8.4). Let T denote the rst time a chain leaves state i, T := minft 0 : Xt 6= ig: Show that given X0 = i, T is conditionally exp(,gii ). In other words, the time the chain spends in state i, known as the sojourn time, has an exponential density with parameter ,gii . Hints: By Problem 43 in Chapter 4, it suces to prove that }(T > t + tjT > t; X0 = i) = }(T > tjX0 = i): To derive this equation, use the fact that if Xt is right-continuous, T > t if and only if Xs = i for 0 s t: Use the Markov property in the form }(Xs = i; t s t + tjXs = i; 0 s t) = }(Xs = i; t s t + tjXt = i); July 18, 2002
9.3 Problems
297
and use time homogeneity in the form }(Xs = i; t s t + tjXt = i) = }(Xs = i; 0 s tjX0 = i): To identify the parameter of the exponential density, you may use the approximation }(Xs = i; 0 s tjX0 = i) pii (t). 15. The notion a Markov chain can be generalized to include random variables that are not necessarily discrete. We say that Xt is a continuous-time Markov process if }(Xt 2 B jXs = x; Xsn,1 = xn,1; : : :; Xs0 = x0) = }(Xt 2 B jXs = x): Such a process is time homogeneous if }(Xt 2 B jXs = x) depends on t and s only through t,s. Show that the Wiener process is a Markov process that is time homogeneous. Hint: It is enough to look at conditional cdfs; i.e., show that }(Xt yjXs = x; Xsn,1 = xn,1; : : :; Xs0 = x0) = }(Xt yjXs = x):
16. Let Xt be a time-homogeneous Markov process as de ned in the previous
problem. Put
Pt(x; B) := }(Xt 2 B jX0 = x); and assume that there is a corresponding conditional density, denoted by ft (x; y), such that Z Pt (x; B) = ft (x; y) dy: B
Derive the Chapman{Kolmogorov equation for conditional densities, ft+s (x; y) = Hint: It suces to show that
Pt+s (x; B) =
Z
1
,1 Z
1 ,1
fs (x; z)ft(z; y) dz: fs (x; z)Pt(z; B) dz:
To derive this, you may assume that a law of total conditional probability holds for random variables with appropriate conditional densities.
July 18, 2002
298
Chap. 9 Introduction to Markov Chains
July 18, 2002
CHAPTER 10
Mean Convergence and Applications As mentioned at the beginning of Chapter 1, limit theorems have been the foundation of the success of Kolmogorov's axiomatic theory of probability. In this chapter and the next, we focus on dierent notions of convergence and their implications. Section 10.1 introduces the notion of convergence in mean of order p. Section 10.2 introduces the normed Lp spaces. Norms provide a compact notation for establishing results about convergence in mean of order p. We also point out that the Lp spaces are complete. Completeness is used to show that convolution sums like 1 X hk Xn,k k=0
are well de ned. This is an important result because sums like this represent the response of a causal, linear, time-invariant system to a random input Xk . Section 10.3 uses completeness to develop the Wiener integral. Section 10.4 introduces the notion of projections. The L2 setting allows us to introduce a general orthogonality principle that uni es results from earlier chapters on the Wiener lter, linear estimators of random vectors, and minimum mean squared error estimation. The completeness of L2 is also used to prove the projection theorem. In Section 10.5, the projection theorem is used to to establish the existence conditional expectation for random variables that may not be discrete or jointly continuous. In Section 10.6, completeness is used to establish the spectral representation of wide-sense stationary random sequences.
10.1. Convergence in Mean of Order p
We say that Xn converges in mean of order p to X if lim E[jXn , X jp ] = 0; n!1
where 1 p < 1. Mostly we focus on the cases p = 1 and p = 2. The case p = 1 is called convergence in mean or mean convergence. The case p = 2 is called mean square convergence or quadratic mean convergence. Example 10.1. Let Xn N(0; 1=n2). Show that pnXn converges in mean square to zero. Solution. Write p E[j nXn j2] = nE[Xn2 ] = n n12 = n1 ! 0: 299
300
Chap. 10 Mean Convergence and Applications
In the following example, Xn converges in mean square to zero, but not in mean of order 4. Example 10.2. Let Xn have density fn (x) = gn (x)(1 , 1=n3) + hn (x)=n3; where gn N(0; 1=n2) and hn N(n; 1). Show that Xn converges to zero in mean square, but not in mean of order 4. Solution. For convergence in mean square, write E[jXnj2] = n12 (1 , 1=n3) + (1 + n2 )=n3 ! 0: However, using Problem 28 in Chapter 3, we have 3 E[jXnj4] = n4 (1 , 1=n3) + (n4 + 6n2 + 3)=n3 ! 1: The preceding example raises the question of whether Xn might converge in mean of order 4 to something other than zero. However, by Problem 8 at the end of the chapter, if Xn converged in mean of order 4 to some X, then it would also converge in mean square to X. Hence, the only possible limit for Xn in mean of order 4 is zero, and as we saw, Xn does not converge in mean of order 4 to zero. Example 10.3. Let X1; X2; : : : be uncorrelated random variables with common mean m and common variance 2 . Show that the sample mean n X Mn := n1 Xi i=1 converges in mean square to m. We call this the mean-square law of large numbers for uncorrelated random variables. Solution. Since n X Mn , m = n1 (Xi , m); i=1
we can write E[jMn
, mj2 ]
n n X X = n12 E (Xi , m) (Xj , m) i=1 j =1 n X n X 1 = n2 E[(Xi , m)(Xj , m)]: i=1 j =1
July 18, 2002
(10.1)
10.1 Convergence in Mean of Order p
301
Since Xi and Xj are uncorrelated, the preceding expectations are zero when i 6= j. Hence, n 2 X 2 ; 1 2 = E[jMn , mj ] = n2 E[(Xi , m)2 ] = n 2 n n i=1 which goes to zero as n ! 1.
Example 10.4. Here is a generalization of the preceding example. Let X1 ; X2; : : : be wide-sense stationary; i.e., the Xi have common mean m = E[Xi ], and the covariance E[(Xi , m)(Xj , m)] depends only on the dierence i , j. Put R(i) := E[(Xj +i , m)(Xj , m)]: Show that n X Mn := n1 Xi i=1
converges in mean square to m if and only if
,1 1 nX lim R(k) = 0: n!1 n
(10.2)
k=0
We call this result the mean-square law of large numbers for wide-sense stationary sequences or the mean-square ergodic theorem for wide-sense stationary sequences Note that a sucient condition for (10.2) to hold is that limk!1 R(k) = 0 (Problem 3). Solution. We show that (10.2) implies Mn converges in mean square to m. The converse is left to the reader in Problem 4. From (10.1), we see that n2 E[jMn , mj2 ] = =
n X n X i=1 j =1
X
i=j
R(i , j)
R(0) + 2
= nR(0) + 2 = nR(0) + 2
X
R(i , j)
j
i=2 k=1
R(i , j) R(k):
On account of (10.2), given " > 0, there is an N such that for all i N, i
i,1 1 X R(k) , 1 k=1 < ":
July 18, 2002
302
Chap. 10 Mean Convergence and Applications
For n N, the above double sum can be written as
i,1 X 1 R(k) = R(k) + (i , 1) i , 1 R(k) : i=2 k=1 i=2 k=1 k=1 i=N The magnitude of the right-most double sum is upper bounded by
n X i,1 X
NX ,1 X i,1
n X
i,1 n X X 1 , 1) i , 1 R(k) < " (i , 1) i=N i=N k=1 n X < " (i , 1)
n X (i
i=1
= "n(n2, 1) 2 < "n2 :
It now follows that limn!1 E[jMn , mj2 ] can be no larger than ". Since " is arbitrary, the limit must be zero.
Example 10.5. Let Wt be a Wiener process with E[Wt2] = 2t. Show that Wt =t converges in mean square to zero as t ! 1. Solution. Write 2 2 2 E Wt = 2t = ! 0: t t t Example 10.6. Let X be a nonnegative random variable with nite mean;
i.e., E[X] < 1. Put
(
X; X n; n; X > n: The idea here is that Xn is a bounded random variable that can be used to approximate X. Show that Xn converges in mean to X. Solution. Since X Xn, E[jXn , X j] = E[X , Xn]. Since X , Xn is nonnegative, we can write Xn := min(X; n) =
E[X , Xn ] =
1
Z
0
}(X , Xn > t) dt;
where we have appealed to (4.2) in Section 4.2. Next, for t 0, a little thought shows that fX , Xn > tg = fX > t + ng: July 18, 2002
10.2 Normed Vector Spaces of Random Variables
Hence, E[X , Xn ] =
1
Z
0
}(X > t + n) dt =
303 Z
1 n
}(X > ) d;
which goes to zero as n ! 1 on account of the fact that
1 > E[X] =
1
Z
0
}(X > ) d:
10.2. Normed Vector Spaces of Random Variables
We denote by Lp the set of all random variables X with the property that E[jX jp] < 1. We claim that Lp is a vector space. To prove this, we need to show that if E[jX jp] < 1 and E[jY jp] < 1, then E[jaX + bY jp ] < 1 for all scalars a and b. To begin, recall that the triangle inequality applied to numbers x and y says that jx + yj jxj + jyj: If jyj jxj, then jx + yj 2jxj; and so jx + yjp 2p jxjp: A looser bound that has the advantage of being symmetric is jx + yjp 2p (jxjp + jyjp ): It is easy to see that this bound also holds if jyj > jxj. We can now write E[jaX + bY jp ] E[2p(jaX jp + jbY jp)] = 2p (jajpE[jX jp] + jbjpE[jY jp ]): Hence, if E[jX jp] and E[jY jp ] are both nite, then so is E[jaX + bY jp ]. For X 2 Lp , we put kX kp := E[jX jp]1=p: We claim that k kp is a norm on Lp , by which we mean the following three properties hold: (i) kX kp 0, and kX kp = 0 if and only if X is the zero random variable. (ii) For scalars a, kaX kp = jaj kX kp. (iii) For X; Y 2 Lp , kX + Y kp kX kp + kY kp : As in the numerical case, this is also known as the triangle inequality. July 18, 2002
304
Chap. 10 Mean Convergence and Applications
The rst two properties are obvious, while the third one is known as Minkowski's inequality, which is derived in Problem 9. Observe now that Xn converges in mean of order p to X if and only if lim kXn , X kp = 0:
n!1
Hence, the three norm properties above can be used to derive results about convergence in mean of order p, as will be seen in the problems. Recall that a sequence of real numbers xn is Cauchy if for every " > 0, for all suciently large n and m, jxn , xm j < ". A basic fact that can be proved about the set of real numbers is that it is complete; i.e., given any Cauchy sequence of real numbers xn, there is a real number x such that xn converges to x [38, p. 53, Theorem 3.11]. Similarly, a sequence of random variables Xn 2 Lp is said to be Cauchy if for every " > 0, for all suciently large n and m,
kXn , Xm kp < ": It can be shown that the Lp spaces are complete; i.e., if Xn is a Cauchy sequence of Lp random variables, then there exists an Lp random variable X such that Xn converges in mean of order p to X. This is known as the Riesz{Fischer Theorem [37, p. 244]. A normed vector space that is complete is called a Banach space. Of special interest is the case p = 2 because the norm kk2 can be expressed in terms of the inner product
hX; Y i := E[XY ]; X; Y 2 L2: It is easily seen that
hX; X i1=2 = kX k2 :
Because the norm k k2 can be obtained using the inner product, L2 is called an inner-product space. Since the Lp spaces are complete, L2 in particular is a complete inner-product space. A complete inner-product space is called a Hilbert space. The space L2 has several important properties. First, for xed Y , it is easy to see that hX; Y i is linear in X. Second, the simple relationship between the norm and the inner product imply the parallelogram law (Problem 25),
kX + Y k22 + kX , Y k22 = 2(kX k22 + kY k22 ): Third, there is the Cauchy{Schwarz inequality (Problem 1 in Chapter 6), hX; Y i kX k2 kY k2: For complex-valued random variables (de ned in Section 7.5), we put hX;Y i := E[XY ].
July 18, 2002
10.2 Normed Vector Spaces of Random Variables
Example 10.7. Show that
1 X k=1
305
hk X k
is well de ned as an element of L2 assuming that 1 X
k=1
jhk j < 1 and E[jXkj2 ] B; for all k;
where B is a nite constant. Solution. Consider the partial sums, Yn :=
n X
k=1
hk Xk :
Each Yn is an element of L2 , which is complete. If we can show that Yn is a Cauchy sequence, then therePwill exist a Y 2 L2 with kYn , Y k2 ! 0. Thus, 1 the in nite-sum expression k=1 hk Xk is understood to be shorthand for \the P mean-square limit of nk=1 hk Xk as n ! 1." Next, for n > m, write Yn , Ym = Then
n X
k=m+1
hk Xk :
kYn , Ym k22 = hYn , Ym ; Yn , Ym i =
X n
n X
hk Xk ; hl Xl k=m+1 l=m+1 n n X X jhk j jhl j hXk ; Xl i k=m+1 l=m+1 n n X X jhk j jhl j kXk k2 kXl k2; k=m+1 l=m+1
p
by the Cauchy{Schwarz inequality. Next, since kXk k2 = E[jXk j2]1=2 B,
kYn , Ym k22 B P1
Since k=1 jhkj < 1 implies1 n X
k=m+1
n X
k=m+1
2
jhk j :
jhk j ! 0 as n and m ! 1 with n > m;
it follows that Yn is Cauchy. July 18, 2002
306
Chap. 10 Mean Convergence and Applications
Under the conditions of the previous example, since kYn , Y k2 ! 0, we can write, for any Z 2 L2 ,
hYn ; Z i , hY; Z i = hYn , Y; Z i kYn , Y k2 kZ k2 ! 0:
In other words,
lim hY ; Z i n!1 n
= hY; Z i:
Taking Z = 1 and writing the inner products as expectations yields lim E[Yn] = E[Y ];
n!1
or, using the de nitions of Yn and Y , lim E n!1
X n
k=1
X 1
hk Xk = E
k=1
hk Xk :
Since the sum on the left is nite, we can bring the expectation inside and write lim n!1 In other words,
n X k=1
1 X
X 1
hk E[Xk ] = E
X 1
hk E[Xk ] = E
k=1
k=1
k=1
hk Xk :
hk Xk :
The foregoing example and discussion have the following application. Consider a discrete-time, causal, stable, linear, time-invariant system with impulse response hk . Now suppose that the random sequence Xk is applied to the input of this system. If E[jXk j2] is bounded as in the example, theny the output of the system at time n is 1 X
k=0
hk Xn,k ;
which is a well-de ned element of L2 . Furthermore, X 1
E
k=0
hk Xn,k =
1 X k=0
hk E[Xn,k]:
y The assumption in the example, P jhk j < 1, is equivalent to the assumption that the k system is stable.
July 18, 2002
10.3 The Wiener Integral (Again)
307
10.3. The Wiener Integral (Again)
In Section 8.3, we de ned the Wiener integral Z
0
1
g() dW :=
X
i
gi(Wti+1 , Wti );
for piecewise constant g, say g() = gi for ti < t ti+1 for a nite number of intervals, and g() = 0 otherwise. In this case, since the integral is the sum of scaled, independent, zero mean, Gaussian increments of variance 2 (ti+1 , ti), Z
E
0
1
g() dW
2
= 2
X
i
gi2 (ti+1 , ti ) =
1
Z
0
g()2 d:
We now de ne the Wiener integral for arbitrary g satisfying 1
Z
g()2 d < 1:
0
(10.3)
To do this, we use the fact [13, p. 86, Prop. 3.4.2] that for g satisfying (10.3), there always exists a sequence of piecewise constant functions gn converging to g in the mean-square sense lim n!1
1
Z
0
jgn() , g()j2 d = 0:
(10.4)
The set Rof1g satisfying (10.3) is an inner product space with inner product h g; hi = 0 g()h() d and corresponding norm kgk = h g; gi1=2 . Thus, (10.4) says that kgn ,gk ! 0. In particular, this implies gn is Cauchy; i.e., kgn ,gm k ! 0 as n; m ! 1 (cf. Problem 11). Consider the random variables Yn :=
Z
0
1
gn () dW :
Since each gn is piecewise constant, Yn is well de ned and is Gaussian with zero mean and variance Z 1 gn()2 d: 0
Now observe that kYn , Ym k22 = E[jYn , Ym j2]
2 Z 1 Z 1 g () dW g () dW n m 0 0 2 Z 1 E gn () gm () dW 0 Z 1
,
= E = =
,
0
jgn() , gm ()j2 d; July 18, 2002
308
Chap. 10 Mean Convergence and Applications
since gn , gm is piecewise constant. Thus, kYn , Ym k22 = kgn , gm k2: Since gn is Cauchy, we see that Yn is too. Since L2 is complete, there exists a random variable Y 2 L2 with kYn , Y k2 ! 0. We denote this random variable by Z 1 g() dW ; 0 and call it the Wiener integral of g.
10.4. Projections, Orthonality Principle, Projection Theorem
Let C be a subset of Lp . Given X 2 Lp , suppose we want to approximate X by some Xb 2 C. We call Xb a projection of X onto C if Xb 2 C and if kX , Xb kp kX , Y kp ; for all Y 2 C: Note that if X 2 C, the we can take Xb = X. Example 10.8. Let C be the unit ball, C := fY 2 Lp : kY kp 1g: For X 2= C, i.e., kX kp > 1, show that Xb = kXXk : p Solution. First note that the proposed formula for Xb satis es kXb kp = 1 so that Xb 2 C as required. Now observe that
X b
kX , X kp = X , kX k
p p = 1 , kX1k kX kp p = kX kp , 1: Next, for any Y 2 C, kX , Y kp kX kp , kY kp ; by Problem 15; = kX kp , kY kp kX kp , 1 = kX , Xb kp: b Thus, no Y 2 C is closer to X than X. July 18, 2002
10.4 Projections, Orthonality Principle, Projection Theorem
309
Much more can be said about projections when p = 2 and when the set we are projecting onto is a subspace rather than an arbitrary subset. We now present two fundamental results about projections onto subspaces of L2 . The rst result is the orthogonality principle. Let M be a subspace of L2 . If X 2 L2 , then Xb 2 M satis es
kX , Xb k2 kX , Y k2; for all Y 2 M;
(10.5)
if and only if
b Y i = 0; for all Y 2 M: hX , X; (10.6) Furthermore, if such an Xb 2 M exists, it is unique. Observe that there is no claim that an Xb 2 M exists that satis es either (10.5) or (10.6). In practice, we try to nd an Xb 2 M satisfying (10.6), since such an Xb then automatically
satis es (10.5). This was the approach used to derive the Wiener lter in Section 6.5, where we implicitly took (for xed t)
M = Vbt : Vbt is given by (6.7) and E[Vbt2 ] < 1 : In Section 7.3, when we discussed linear estimation of random vectors, we implicitly took M = fAY + b : A is a matrix and b is a column vector g: When we discussed minimummean squared error estimation, also in Section 7.3, we implicitly took
M = g(Y ) : g is any function such that E[g(Y )2 ] < 1 : Thus, several estimation problems discussed in earlier chapters are seen to be special cases of nding the projection onto a suitable subspace of L2 . Each of these special cases had its version of the orthogonality principle, and so it should be no trouble for the reader to show that (10.6) implies (10.5). The converse is also true, as we now show. Suppose (10.5) holds, but for some Y 2 M, b Y i = c 6= 0: hX , X; Because we can divide this equation by kY k2, there is no loss of generality in assuming kY k2 = 1. Now, since M is a subspace containing both Xb and Y ,
Xb + cY also belongs to M. We show that this new vector is strictly closer to b contradicting (10.5). Write X than X, b , cY k2 kX , (Xb + cY )k22 = k(X , X) 2 = kX , Xb k22 , jcj2 , jcj2 + jcj2 = kX , Xb k22 , jcj2 > kX , Xb k22:
July 18, 2002
310
Chap. 10 Mean Convergence and Applications
The second fundamental result to be presented is the Projection Theorem. Recall that the orthogonality principle does not guarantee the existence of an Xb 2 M satisfying (10.5). If we are not smart enough to solve (10.6), what can we do? This is where the projection theorem comes in. To state and prove this result, we need the concept of a closed set. We say that M is closed if whenever Xn 2 M and kXn , X k2 ! 0 for some X 2 L2 , the limit X must
actually be in M. In other words, a set is closed if it contains all the limits of all converging sequences from the set. Example 10.9. Show that the set of Wiener integrals M :=
1
Z
0
is closed.
g() dW :
1
Z
0
g()2 d
<1
Solution. A sequence Xn from M has the form Xn =
1
Z
0
gn () dW
for square-integrable g. Suppose Xn converges in mean square to some X. We must show that there exists a square-integrable function g for which X =
1
Z
0
g() dW :
Since Xn converges, it is Cauchy. Now observe that kXn , Xm k22 = E[jXn , Xm j2]
2 Z 1 Z 1 E gn() dW gm () dW 0 0 2 Z 1 E gn () gm () dW 0 Z 1
,
=
,
=
jgn() , gm ()j2 d = kgn , gm k2:
=
0
Thus, gn is Cauchy. Since the set of square-integrable time functions is complete (the Riesz{Fischer Theorem again [37, p. 244]), there is a square-integrable g with kgn , gk ! 0. For this g, write
Xn
,
1
Z
0
2
g() dW
2
2 Z 1 Z 1 g () dW g() dW n 0 0 2 Z 1 E gn() g() dW 0 Z 1
,
= E =
,
= jgn() , g()j2 d 0 = kgn , gk2 ! 0: July 18, 2002
10.5 Conditional Expectation
311
R Since mean-square limits are unique (Problem 14), X = 01 g() dW .
Projection Theorem. If M is a closed subspace of L2 , and X 2 L2, then there
exists a unique Xb 2 M such that (10.5) holds. To prove this result, rst put h := inf Y 2M kX , Y k2 . From the de nition of the in mum, there is a sequence Yn 2 M with kX , Ynk2 ! h. We will show that Yn is a Cauchy sequence. Since L2 is a Hilbert space, Yn converges b must be in M. to some limit in L2 . Since M is closed, the limit, say X, To show Yn is Cauchy, we proceed as follows. By the parallelogram law, 2(kX , Yn k22 + kX , Ym k22) = k2X , (Yn + Ym )k22 + kYm , Yn k22
2
= 4
X , Yn +2 Ym
+ kYm , Yn k22: 2 Note that the vector (Yn + Ym )=2 2 M since M is a subspace. Hence, 2(kX , Yn k22 + kX , Ym k22) 4h2 + kYm , Yn k22: Since kX , Yn k2 ! h, given " > 0, there exists an N such that for all n N, kX , Yn k2 < h + ". Hence, for m; n N,
kYm , Ynk22 < 2((h + ")2 + (h + ")2 ) , 4h2 = 4"(2h + ");
and we see that Yn is Cauchy. Since L2 is a Hilbert space, and since M is closed, Yn ! Xb for some Xb 2 M. We now have to show that kX , Xb k2 kX , Y k2 for all Y 2 M. Write
kX , Xb k2 = kX , Yn + Yn , Xb k2 kX , Yn k2 + kYn , Xb k2: Since kX , Ynk2 ! h and since kYn , Xb k2 ! 0, kX , Xb k2 h: Since h kX , Y k2 for all Y 2 M, kX , Xb k2 kX , Y k2 for all Y 2 M. The uniqueness of Xb is left to Problem 22.
10.5. Conditional Expectation
In earlier chapters, we de ned conditional expectation separately for discrete and jointly continuous random variables. We are now in a position to introduce a uni ed de nition. The new de nition is closely related to the orthogonality principle and the projection theorem. It also reduces to the earlier de nitions in the discrete and jointly continuous cases. July 18, 2002
312
Chap. 10 Mean Convergence and Applications
We say that gb(Y ) is the conditional expectation of X given Y if E[Xg(Y )] = E[bg(Y )g(Y )]; for all bounded functions g: (10.7) When X and Y are discrete or jointly continuous it is easy to check that bg(y) = E[X jY = y] solves this equation.z However, the importance of this de nition is that we can prove the existence and uniqueness of such a function bg even if X and Y are not discrete or jointly continuous, as long as X 2 L1 . Recall that uniqueness was proved in Problem 18 in Chapter 7. We rst consider the case X 2 L2 . Put M := fg(Y ) : E[g(Y )2 ] < 1g: (10.8) It is a consequence of the Riesz{Fischer Theorem [37, p. 244] that M is closed. By the projection theorem combined with the orthogonality principle, there exists a bg(Y ) 2 M such that hX , gb(Y ); g(Y )i = 0; for all g(Y ) 2 M: Since the above inner product is de ned as an expectation, it is equivalent to E[Xg(Y )] = E[bg(Y )g(Y )]; for all g(Y ) 2 M: Since boundedness of g implies g(Y ) 2 M, (10.7) holds. When X 2 L2 , we have shown that bg(Y ) is the projection of X onto M. For X 2 L1 , we proceed as follows. First consider the case of nonnegative X with E[X] < 1. We can approximate X by the bounded function Xn of Example 10.6. Being bounded, Xn 2 L2 , and the corresponding gbn (Y ) exists and satis es E[Xng(Y )] = E[bg(Y )g(Y )]; for all g(Y ) 2 M: (10.9) Since Xn Xn+1 , bgn(Y ) bgn+1 (Y ) by Problem 29. Hence, bg(Y ) := limn!1 bgn(Y ) exists. To verify that gb(Y ) satis es (10.7), write2 E[Xg(Y )] = E nlim X g(Y ) n !1 = nlim !1 E[Xng(Y )] = nlim E[bgn(Y )g(Y )]; by (10:9); !1 = E nlim g b (Y )g(Y ) n !1 = E[bg(Y )g(Y )]: For signed X with E[jX j] < 1, consider the nonnegative random variables X; X 0; < 0; + , X := 0; X < 0; and X := ,0;X; X X 0: z See, for example, the derivation following (7.8) in Section 7.3.
July 18, 2002
10.6 The Spectral Representation
313
Since X + + X , = jX j, it is clear that X + and X , are L1 random variables. Since they are nonnegative, their conditional expectations exist. Denote them by bg+ (Y ) and gb, (Y ). Since X + , X , = X, it is easy to verify that bg(Y ) := g+ (Y ) , bg, (Y ) satis es (10.7). b
Notation
We have shown that to every X 2 L1 , there corresponds a unique function g(y) such that (10.7) holds. The standard notation for this function of y is b E[X jY = y], which, as noted above, is given by the usual formulas when X and Y are discrete or jointly continuous. It is conventional in probability theory to write E[X jY ] instead of bg(Y ). We emphasize that E[X jY = y] is a deterministic function of y, while E[X jY ] is a function of Y and is therefore a random variable. We also point out that with the conventional notation, (10.7) becomes E[Xg(Y )] = E E[X jY ]g(Y ) ; for all bounded functions g: (10.10)
10.6. The Spectral Representation
Let Xn be a discrete-time, zero-mean, wide-sense stationary process with correlation function R(n) := E[Xn+m Xm ] and corresponding power spectral density S(f) = so thatx
1 X
(10.11)
S(f)ej 2fn df:
(10.12)
n=,1 Z 1=2
R(n) =
R(n)ej 2fn
,1=2
Consider the space of complex-valued frequency functions,
S := G :
Z 1=2
equipped with the inner product
,1=2
jG(f)j2S(f) df < 1
Z 1=2
G(f)H(f) S(f) df ,1=2 x For some correlation functions, the sum in (10.11) may not converge. However, by Herglotz's Theorem, we can always replace (10.12) by
hG; H i :=
Z 1=2
ej2fn dS0 (f ); ,1=2 where S0 (f ) is the spectral (cumulative) distribution function, and the rest of the section goes through with the necessary changes. Note that when the power spectral density exists, S (f ) = S00 (f ). R(n) =
July 18, 2002
314
Chap. 10 Mean Convergence and Applications
and corresponding norm kGk := hG; Gi1=2. Then S is a Hilbert space. Furthermore, every G 2 S can be approximated by a trigonometric polynomial of the form N X G0(f) = gn ej 2fn : n=,N
The approximation is in norm; i.e., given any " > 0, there is a trigonometric polynomial G0 such that kG , G0 k < " [5, p. 139]. To each frequency function G 2 S , we now associate an L2 random variable as follows. For trigonometric polynomials like G0 , put T(G0) :=
N X
n=,N
gn Xn :
Note that T is well de ned (Problem 35). A critical property of these trigonometric polynomials is that (Problem 36) E[T(G0)T(H0 ) ] =
Z 1=2
,1=2
G0(f)H0 (f) S(f) df = hG0; H0i;
where H0 is de ned similarly to G0 . In particular, T is norm preserving on the trigonometric polynomials; i.e., kT (G0)k22 = kG0k2 : To de ne T for arbitrary G 2 S , let Gn be a sequence of trigonometric polynomials converging to G in norm; i.e., kGn , Gk ! 0. Then Gn is Cauchy (cf. Problem 11). Furthermore, the linearity of and norm-preservation properties of T on the trigonometric polynomials tells us that kGn , Gm k = kT (Gn , Gm )k = kT (Gn) , T (Gm )k2 ; and we see that T(Gn) is Cauchy in L2 . Since L2 is complete, there is a limit random variable, denoted by T(G) and such that kT (Gn) , T (G)k2 ! 0. Note that T(G) is well de ned, norm preserving, linear, and continuous on S (Problem 37). There is another way approximate elements of S . Every G 2 S can also be approximated by a piecewise constant function of the following form. For ,1=2 f0 < < fn 1=2, let G0 (f) =
n X i=1
giI(fi,1 ;fi ] (f):
Given any " > 0, there is a piecewise constant function G0 such that kG , G0k < ". This is exactly what we did for the Wiener process, but with a dierent norm. For piecewise constant G0, T(G0 ) =
n X i=1
giT (I(fi,1 ;fi ] ):
July 18, 2002
10.6 The Spectral Representation
Since
315
I(fi,1 ;fi ] = I[,1=2;fi ] , I[,1=2;fi,1 ] ;
if we put
Zf := T(I[,1=2;f ] ); ,1=2 f 1=2; then Zf is well de ned by Problem 38, and n X
gi (Zfi , Zfi,1 ): i=1 The family of random variables fZf ; ,1=2 f 1=2g has many similarities to T(G0) =
the Wiener process (see Problem 39). We write Z 1=2
,1=2
G0 (f) dZf :=
n X i=1
gi (Zfi , Zfi,1 ):
For arbitrary G 2 S , we approximate G with a sequence of piecewise constant functions. Then Gn will be Cauchy. Since
Z 1=2
,1=2
Gn(f) dZf ,
Z 1=2
,1=2
Gm (f) dZf
2
= kT(Gn) , T(Gm )k2 = kGn , Gm k;
there is a limit random variable in L2 , denoted by Z 1=2
and
,1=2
Z 1=2
,1=2
Gn (f) dZf ,
G(f) dZf ; Z 1=2
,1=2
G(f) dZf
! 0:
On the other hand, since Gn is piecewise constant, Z 1=2
,1=2
2
Gn (f) dZf = T (Gn);
and since kGn , Gk ! 0, and since T is continuous, kT (Gn) , T(G)k2 ! 0. Thus, Z 1=2 T (G) = G(f) dZf : In particular, taking G(f) = ej 2fn Z 1=2
,1=2
,1=2
shows that
ej 2fn dZf = T(G) = Xn ;
where the last step follows from the original de nition of T. The formula Xn =
Z 1=2
,1=2
ej 2fn dZf
is called the spectral representation of Xn . July 18, 2002
316
Chap. 10 Mean Convergence and Applications
10.7. Notes
Notes x10.1: Convergence in Mean of Order p Note 1. The assumption 1 X
k=1
jhk j < 1
should be understood more precisely as saying that the partial sum An :=
n X k=1
jhkj
is bounded above by some nite constant. Since An is also nondecreasing, the monotonic sequence property of real numbers [38, p. 55, Theorem 3.14] says that An converges to a real number, say A. Now, by the same argument used to solve Problem 11, since An converges, it is Cauchy. Hence, given any " > 0, for large enough n and m, jAn , Am j < ". If n > m, we have An , Am =
n X k=m+1
jhk j < ":
Notes x10.5: Conditional Expectation Note 2. The interchange of limit and expectation is justi ed by Lebesgue's monotone convergence theorem [4, p. 208].
10.8. Problems
Problems x10.1: Convergence in Mean of Order p 1. Let U uniform[0; 1], and put p (U); n = 1; 2; : : :: X := nI n
[0;1=n]
For what values of p 1 does Xn converge in mean of order p to zero? 2. Let Nt be a Poisson process of rate . Show that Nt =t converges in mean square to as t ! 1. 3. Show that limk!1 R(k) = 0 implies ,1 1 nX R(k) = 0; lim n!1 n k=0
and thus Mn converges in mean square to m by Example 10.4. July 18, 2002
10.8 Problems
317
4. Show that if Mn converges in mean square m, then ,1 1 nX R(k) = 0; lim n!1 n k=0
where R(k) is de ned in Example 10.4. Hint: Observe that nX ,1 k=0
R(k) = E (X1 , m)
n X k=1
(Xk , m)
= E[(X1 , m) n(Mn , m)]:
5. Let Z be a nonnegative random variable with E[Z] < 1. Given any " > 0,
show that there is a > 0 such that for any event A, }(A) < implies E[ZIA ] < ":
Hint: Recalling Example 10.6, put Zn := min(Z; n) and write E[ZIA ] = E[(Z , Zn )IA ] + E[ZnIA ]: 6. Derive Holder's inequality, E[jXY j] E[jX jp]1=pE[jY jq ]1=q ;
if 1 < p; q < 1 with 1p + q1 = 1. Hint: Let and denote the factors on the right-hand side; i.e., := E[jX jp ]1=p and := E[jY jq ]1=q . Then it suces to show that E[jXY j] 1: To this end, observe that by the convexity of the exponential function, for any real numbers u and v, we can always write h
i
exp p1 u + 1q v 1p eu + q1 ev : Now take u = ln(jX j=)p and v = ln(jY j= )q . 7. Derive Lyapunov's inequality, E[jZ j]1= E[jZ j ]1= ; 1 < < 1: Hint: Apply Holder's inequality to X = jZ j and Y = 1 with p = =. 8. Show that if Xn converges in mean of order > 1 to X, then Xn converges
in mean of order to X for all 1 < . July 18, 2002
318
Chap. 10 Mean Convergence and Applications
9. Derive Minkowski's inequality, E[jX + Y jp]1=p E[jX jp]1=p + E[jY jp ]1=p;
where 1 p < 1. Hint: Observe that E[jX + Y jp ] = E[jX + Y j jX + Y jp,1 ] E[jX j jX + Y jp,1] + E[jY j jX + Y jp,1];
and apply Holder's inequality.
Problems x10.2: Normed Vector Spaces of Random Variables 10. Show that if Xn converges in mean of order p to X, and if Yn converges 11. 12.
13. 14. 15.
in mean of order p to Y , then Xn + Yn converges in mean of order p to X +Y. Let Xn converge in mean of order p to X 2 Lp . Show that Xn is Cauchy. Let Xn be a Cauchy sequence in Lp . Show that Xn is bounded in the sense that there is a nite constant K such that kXn kp K for all n. Remark. Since by the previous problem, a convergent sequence is Cauchy, it follows that a convergent sequence is bounded. Let Xn converge in mean of order p to X 2 Lp , and let Yn converge in mean of order p to Y 2 Lp . Show that Xn Yn converges in mean of order p to XY . Show that limits in mean of order p are unique; i.e., if Xn converges in mean of order p to both X and Y , show that E[jX , Y jp ] = 0. Show that
kX kp , kY kp kX , Y kp kX kp + kY kp :
16. If Xn converges in mean of order p to X, show that
lim E[jXnjp ] = E[jX jp]:
n!1
17. Show that
1 X k=1
hk Xk
is well de ned as an element of L1 assuming that 1 X k=1
jhk j < 1 and E[jXk j] B; for all k; July 18, 2002
10.8 Problems
319
where B is a nite constant. Remark. By the Cauchy{Schwarz inequality, a sucient condition for E[jXkj] B is that E[Xk2 ] B 2 .
Problems x10.3: The Wiener Integral (Again) 18. Recall that the Wiener integral of g was de ned to be the mean-square
limit of
Yn :=
1
Z
0
gn() dW ;
where each gn is piecewise constant, and kgn , gk ! 0. Show that 1
Z
E
and
1
Z
E
0
0
g() dW = 0;
g() dW
2
=
2
1
Z
0
g()2 d:
Hint: Let Y denote the mean-square limit of Yn. Show that E[Yn] ! E[Y ] and that E[Yn2 ] ! E[Y 2 ]. 19. Use the following approach to show that the limit de nition of the Wiener
integral is unique. Let Yn and Y be as in the text. If egn is another sequence of piecewise constant functions with kegn , gk2 ! 0, put Yen
:=
1
Z
0
gn() dW : e
By the argument that Yn has a limit Y , Yen has a limit, say Ye . Show that kY , Ye k2 = 0. 20. Show that the Wiener integral is linear even for functions that are not
piecewise constant.
Problems x10.4: Projections, Orthogonality Principle, Projection Theorem 21. Let C = fY 2 Lp : kY kp rg. For kX kp > r, nd a projection of X onto C.
22. Show that if Xb 2 M satis es (10.6), it is unique. 23. Let N M be subspaces of L2 . Assume that the projection of X 2 L2
onto M exists and denote it by XbM . Similarly, assume that the projection of X onto N exists and denote it by XbN . Show that XbN is the projection of XbM onto N. July 18, 2002
320
Chap. 10 Mean Convergence and Applications
24. The preceding problem shows that when N M are subspaces, the pro-
jection of X onto N can be computed in two stages: First project X onto M and then project the result onto N. Show that this is not true in general if N and M are not both subspaces. Hint: Draw a disk of radius one. Then draw a straight line through the origin. Identify M with the disk and identify N with the line segment obtained by intersecting the line with the disk. If the point X is outside the disk and not on the line, then projecting rst onto the ball and then onto the segment does not give the projection.
Remark. Interestingly, projecting rst onto the line (not the line seg-
ment) and then onto the disk does give the correct answer. 25. Derive the parallelogram law, kX + Y k22 + kX , Y k22 = 2(kX k22 + kY k22 ):
P1 26. Show that the result of Example 10.7 holds k=1 jhk j < P1if the assumption 1 is replaced by the two assumptions k=1 jhkj2 < 1 and E[Xk Xl ] = 0
for k 6= l. 27. If kXn , X k2 ! 0 and kYn , Y k2 ! 0, show that lim hX ; Y i = hX; Y i: n!1 n n
28. If Y has density fY , show that M in (10.8) is closed. Hints: (i) Observe
that
E[g(Y )2 ]
=
Z
1
g(y)2 fY (y) dy:
,1
(ii) Use the fact that the set of functions
G := g :
Z
1
,1
g(y)2 fY (y) dy < 1
is complete if G is equipped with the norm
kgk2Y :=
Z
1
,1
g(y)2 fY (y) dy:
(iii) Follow the method of Example 10.9.
Problems x10.5: Conditional Expectation 29. Fix X 2 L1 , and suppose X 0. Show that E[X jY ] 0 in the sense that }(E[X jY ] < 0) = 0. Hints: To begin, note that
fE[X jY ] < 0g =
1 [
fE[X jY ] < ,1=ng:
n=1
July 18, 2002
10.8 Problems
321
By limit property (1.4), }(E[X jY ] < 0) = lim }(E[X jY ] < ,1=n): n!1 To obtain a proof by contradiction, suppose that }(E[X jY ] < 0) > 0. Then for suciently large n, }(E[X jY ] < ,1=n) > 0 too. Take g(Y ) = IB (E[X jY ]) where B = (,1; ,1=n) and consider the de ning relationship E[Xg(Y )] = E E[X jY ]g(Y ) : 30. For X 2 L1 , let X + and X , be as in the text, and denote their corresponding conditional expectations by E[X + jY ] and E[X , jY ], respectively.
Show that
,
E[Xg(Y )] = E E[X + jY ] , E[X , jY ] g(Y ) :
31. For X 2 L1 , derive the smoothing property of conditional expectation,
E[X jY1] = E E[X jY2; Y1] Y1 :
Remark. If X 2 L2, this is an instance of Problem 23.
32. If X 2 L1 and h is a bounded function, show that
E[h(Y )X jY ] = h(Y )E[X jY ]: 33. If X 2 L2 and h(Y ) 2 L2 , use the orthogonality principle to show that E[h(Y )X jY ] = h(Y )E[X jY ]: 34. If X 2 L1 and if h(Y )X 2 L1 , show that E[h(Y )X jY ] = h(Y )E[X jY ]: Hint: Use a limit argument as in the text to approximate h(Y ) by a
bounded function of Y . Then apply either of the preceding two problems.
Problems x10.6: The Spectral Representation 35. Show that the mapping T de ned by
G0(f) =
N X n=,N
gn ej 2fn 7!
N X n=,N
gnXn
is well de ned; i.e., if G0(f) has another representation, say G0(f) =
N X n=,N
July 18, 2002
gen ej 2fn ;
322
Chap. 10 Mean Convergence and Applications
show that
N X n=,N
gnXn e
=
N X n=,N
gn Xn :
P Hint: With dn := gn , egn, it suces to show that if Nn=,N dn ej 2fn = 0, PN
then Y := n=,N dn Xn = 0, and for this it is enough to show that E[jY j2] = 0. Formula (10.12) may be helpful. 36. Show that if H0(f) is de ned similarly to G0(f) in the preceding problem, then Z 1=2 E[T(G0)T (H0) ] = G0(f)H0 (f) S(f) df: ,1=2
37. Show that if Gn and Ge n are trigonometric polynomials with kGn , Gk ! 0
and kGe n , Gk ! 0, then
e lim T(Gn) = nlim !1 T (Gn): Also show that T is norm preserving, linear, and continuous on S . Hint: Put T (G) := limn!1 T(Gn) and Y := limn!1 T(Ge n). Show that kT(G), Y k2 = 0. Use the fact that T is norm preserving on trigonometric polynomials. 38. Show that Zf := T(I[,1=2;f ] ) is well de ned. Hint: It suces to show that I[,1=2;f ] 2 S . 39. Since Z 1=2 G(f) dZf = T(G);
n!1
,1=2
use results about T (G) to prove the following. (a) Show that Z 1=2 E G(f) dZf = 0: ,1=2
(b) Show that Z 1=2
E
,1=2
G(f) dZf
Z 1=2
,1=2
H() dZ
=
Z 1=2
,1=2
G(f)H(f) S(f) df:
(c) Show that Zf has orthogonal (uncorrelated) increments.
July 18, 2002
CHAPTER 11
Other Modes of Convergence This chapter introduces the three types of convergence,
convergence in probability, convergence in distribution, almost sure convergence, which are related to each other (and to convergence in mean of order p) as shown in Figure 11.1. mean of order p in probability
in distribution
almost sure
Figure 11.1. Implications of various types of convergence.
Section 11.1 introduces the notion of convergence in probability. Convergence in probability will be important in Chapter 12 on parameter estimation and con dence intervals, where it justi es various statistical procedures that are used to estimate unknown parameters. Section 11.2 introduces the notion of convergence in distribution. Convergence in distribution is often used to approximate probabilities that are hard to calculate exactly. Suppose that Xn is a random variable whose cumulative distribution function FXn (x) is very hard to compute. But suppose that for large n, FXn (x) FX (x), where FX (x) is a cdf that is very easy to compute. When FXn (x) ! FX (x), we say that Xn converges in distribution to X. In this case, we can approximate FXn (x) by FX (x) if n is large enough. When the central limit theorem applies, FX is the normal cdf with mean zero and variance one. Section 11.3 introduces the notion of almost-sure convergence. This kind of convergence is more technically demanding to analyze, but it allows us to discuss important results such as the strong law of large numbers, for which we give the usual fourth-moment proof. Almost-sure convergence also allows us the derive the Skorohod representation, which is a powerful tool for studying convergence in distribution. 323
324
Chap. 11 Other Modes of Convergence
11.1. Convergence in Probability
We say that Xn converges in probability to X if lim }(jXn , X j ") = 0 for all " > 0: n!1
The rst thing to note about convergence in probability is that it is implied by convergence in mean of order p. Using Markov's inequality, we have p }(jXn , X j ") = }(jXn , X jp "p ) E[jXn ,p X j ] : " Hence, if Xn converges in mean of order p to X, then Xn converges in probability to X as well.
Example 11.1 (Weak Law of Large Numbers). Since the sample mean Mn de ned in Example 10.3 converges in mean square to m, it follows that Mn converges in probability to m. This is known as the weak law of large numbers. The second thing to note about convergence in probability is that it is possible to have Xn converging in probability to X, while Xn does not converge to X in mean of order p for any p 1. For example, let U uniform[0; 1], and consider the sequence Xn := nI[0;1=n](U); n = 1; 2; : : :: Observe that for " > 0,
fjXnj "g =
(
fU 1=ng; n ";
6 ; n < ":
It follows that for all n, }(jXn j ") }(U 1=n) = 1=n ! 0: Thus, Xn converges in probability to zero. However, E[jXnjp] = np}(U 1=n) = np,1 ; which does not go to zero as n ! 1. The third thing to note about convergence in probability is that if g(x; y) is continuous, and if Xn converges in probability to X, and Yn converges in probability to Y , then g(Xn ; Yn) converges in probability to g(X; Y ). This result is derived in Problem 6. In particular, note that since the functions g(x; y) = x + y and g(x; y) = xy are continuous, Xn + Yn and Xn Yn converge in probability to X + Y and XY , respectively, whenever Xn and Yn converge in probability to X and Y , respectively. July 18, 2002
11.2 Convergence in Distribution
325
Example 11.2. Since Problem 6 is somewhat involved, it is helpful to rst see the derivation for the special case in which Xn and Yn both converge in probability to constant random variables, say u and v, respectively. Solution. Let " > 0 be given. We show that }(jg(Xn ; Yn) , g(u; v)j ") ! 0: By the continuity of g at the point (u; v), there is a > 0 such that whenever (u0 ; v0) is within of (u; v); i.e., whenever p
ju0 , uj2 + jv0 , vj2 < 2;
then
jg(u0; v0 ) , g(u; v)j < ":
Since
p
we see that
ju0 , uj2 + jv0 , vj2 ju0 , uj + jv0 , vj;
jXn , uj < and jYn , vj < ) jg(Xn ; Yn) , g(u; v)j < ": Conversely,
jg(Xn ; Yn) , g(u; v)j " ) jXn , uj or jYn , vj : It follows that }(jg(Xn ; Yn ) , g(u; v)j ") }(jXn , uj ) + }(jYn , vj ): These last two terms go to zero by hypothesis.
Example 11.3. Let Xn converge in probability to x, and let cn be a converging sequence of real numbers with limit c. Show that cn Xn converges in probability to cx. Solution. De ne the sequence of constant random variables Yn := cn. It is a simple exercise to show that Yn converges in probability to c. Thus, Xn Yn = cn Xn converges in probability to cx.
11.2. Convergence in Distribution
We say that Xn converges in distribution to X, or converges weakly to X if lim F (x) = FX (x); for all x 2 C(FX ); n!1 Xn July 18, 2002
326
Chap. 11 Other Modes of Convergence
where C(FX ) is the set of points x at which FX is continuous. If FX has a jump at a point x0, then we do not say anything about the existence of lim F (x ): n!1 Xn 0 The rst thing to note about convergence in distribution is that it is implied by convergence in probability. To derive this result, x any x at which FX is continuous. For any " > 0, write FXn (x) = }(Xn x; X x + ") + }(Xn x; X > x + ") FX (x + ") + }(Xn , X < ,") FX (x + ") + }(jXn , X j "): It follows that lim F (x) FX (x + "): n!1 Xn Similarly, FX (x , ") = }(X x , "; Xn x) + }(X x , "; Xn > x) FXn (x) + }(X , Xn < ,") FXn (x) + }(jXn , X j "); and we obtain
FX (x , ") lim FXn (x): n!1
Since the liminf is always less than or equal to the limsup, we have FX (x , ") lim FXn (x) nlim !1 FXn (x) FX (x + "): n!1
Since FX is continuous at x, letting " go to zero shows that the liminf and the limsup are equal to each other and to FX (x). Hence limn!1 FXn (x) exists and equals FX (x). The need to restrict attention to continuity points can also be seen in the following example. Example 11.4. Let Xn exp(n); i.e., FXn (x) =
1 , e,nx ; x 0; 0; x < 0:
For x > 0, FXn (x) ! 1 as n ! 1. However, since FXn (0) = 0, we see that FXn (x) ! I(0;1) (x), which, being left continuous at zero, is not the cumulative distribution function of any random variable. Fortunately, the constant random variable X 0 has I[0;1) (x) for its cdf. Since the only point of discontinuity is zero, FXn (x) does indeed converge to FX (x) at all continuity points of FX . July 18, 2002
11.2 Convergence in Distribution
327
The second thing to note about convergence in distribution is that if Xn converges in distribution to X, and if X is a constant random variable, say X c, then Xn converges in probability to c. To derive this result, rst observe that
fjXn , cj < "g = f," < Xn , c < "g = fc , " < Xn g \ fXn < c + "g: It is then easy to see that }(jXn , cj ") }(Xn c , ") + }(Xn c + ") FXn (c , ") + }(Xn > c + "=2) = FXn (c , ") + 1 , FXn (c + "=2): Since X c, FX (x) = I[c;1) (x). Therefore, FXn (c , ") ! 0 and FXn (c+"=2) ! 1. The third thing to note about convergence in distribution is that it is equivalent to the condition lim E[g(Xn )] = E[g(X)] for every bounded continuous function g: (11.1)
n!1
A proof that convergence in distribution implies (11.1) can be found in [17, p. 316]. A proof that (11.1) implies convergence in distribution is sketched in Problem 19.
Example 11.5. Show that if Xn converges in distribution to X, then the characteristic function of Xn converges to the characteristic function of X. Solution. Fix any , and take g(x) = ejx. Then 'Xn () = E[ejXn ] = E[g(Xn )] ! E[g(X)] = E[ejX ] = 'X ():
Remark. It is also true that if 'Xn () converges to 'X () for all , then Xn converges in distribution to X [4, p. 349, Theorem 26.3]. Example2 11.6. Let Xn N(mn; n2 ), where mn ! m and n2 ! 2. If
X N(m; ), show that Xn converges in distribution to X. Solution. It suces to show that the characteristic function of Xn converges to the characteristic function of X. Write 2 2 2 2 'Xn () = ejmn ,n =2 ! ejm , =2 = 'X ():
July 18, 2002
328
Chap. 11 Other Modes of Convergence
A very important instance of convergence in distribution is the central limit theorem. This result says that if X1 ; X2; : : : are independent, identically distributed random variables with nite mean m and nite variance 2 , and if n X , m; 1 Mn := n Xi and Yn := M=n p n i=1 then
Z y 2 1 lim F (y) = (y) := p e,t =2 dt: n!1 Yn 2 ,1 In other words, for all y, FYn (y) converges to the standard normal cdf (y) (which is continuous at all points y). To better see the dierence between the weak law of large numbers and the central limit theorem, we specialize to the case m = 0 and 2 = 1. In this case, n X Yn = p1n Xi : i=1
In other words, the sample mean Mn divides the sum X1 + +Xn by n while Yn divides the sum only by pn. The dierence is that Mn converges in probability (and in distribution) to the constant zero, while Yn converges in distribution to an N(0; 1) random variable. Notice also that the weak law requires only uncorrelated random variables, while the central limit theorem requires i.i.d. random variables. The central limit theorem was derived in Section 4.7, where examples and problems can also be found. The central limit theorem is also used to determine con dence intervals in Chapter 12. Example 11.7. Let Nt be a Poisson process of rate . Show that Nn , n Yn := p =n converges in distribution to an N(0; 1) random variable. Solution. Since N0 0, n Nn = 1 X n n k=1(Nk , Nk,1 ): By the independent increments property of the Poisson process, the terms of the sum are i.i.d. Poisson() random variables, with mean and variance . Hence, Yn has the structure to apply the central limit theorem, and so FYn (y) ! (y) for all y.
The next example is a version of Slutsky's Theorem. We need it in our analysis of con dence intervals in Chapter 12. July 18, 2002
11.2 Convergence in Distribution
329
Example 11.8. Let Yn be a sequence of random variables with corresponding cdfs Fn. Suppose that Fn converges to a continuous cdf F . Suppose also that Un converges in probability to 1. Show that } nlim !1 (Yn yUn ) = F (y):
Solution. The result is very intuitive. For large n, Un 1 and Fn(y) F(y) suggest that }(Yn yUn ) }(Yn y) = Fn(y) F (y): The precise details are more involved. Fix any y > 0 and 0 < < 1. Then }(Yn yUn ) is equal to }(Yn yUn ; jUn , 1j < ) + }(Yn yUn ; jUn , 1j ): The second term is upper bounded by }(jUn , 1j ), which goes to zero. Rewrite the rst term as }(Yn yUn ; 1 , < Un < 1 + ); which is equivalent to }(Yn yUn ; y(1 , ) < yUn < y(1 + )): (11.2) Now this is upper bounded by }(Yn y(1 + )) = Fn(y(1 + )): Thus, lim }(Yn yUn ) F(y(1 + )): n!1 Next, (11.2) is lower bounded by }(Y y(1 , ); jUn , 1j < ); which is equal to }(Yn y(1 , )) , }(Yn y(1 , ); jUn , 1j ): Now the second term satis es }(Yn y(1 , ); jUn , 1j ) }(jUn , 1j ) ! 0: In light of these observations, lim }(Yn yUn ) F(y(1 , )): n!1 Since was arbitrary, and since F is continuous, lim }(Yn yUn ) = F (y): n!1
The case y < 0 is similar.
July 18, 2002
330
Chap. 11 Other Modes of Convergence
11.3. Almost Sure Convergence
Let Xn be any sequence of random variables, and let X be any other random variable. Put G := ! 2 : nlim !1 Xn (!) = X(!) : In other words, G is the set of sample points ! 2 for which the sequence of real numbers Xn (!) converges to the real number X(!). We think of G as the set of \good" !'s for which Xn (!) ! X(!). Similarly, we think of the complement of G, Gc , as the set of \bad" !'s for which Xn (!) 6! X(!). We say that Xn converges almost surely to X if1 },f! 2 : lim Xn (!) 6= X(!)g = 0: (11.3) n!1
In other words, Xn converges almost surely to X if the \bad" set Gc has probability zero. We write Xn ! X a.s. to indicate that Xn converges almost surely to X. If it should happen that the bad set Gc = 6 , then Xn (!) ! X(!) for every ! 2 . This is called sure convergence, and is a special case of almost sure convergence. Because almost sure convergence is so closely linked to the convergence of ordinary sequences of real numbers, many results are easy to derive.
Example 11.9. Show that if Xn ! X a.s. and Yn ! Y a.s., then Xn + Yn ! X + Y a.s. Solution. Let GX := fXn ! X g and GY := fYn ! Y g. In other words, GX and GY are the \good" sets for the sequences Xn and Yn respectively. Now consider any ! 2 GX \ GY . For such !, the sequence of real numbers Xn (!) converges to the real number X(!), and the sequence of real numbers Yn (!) converges to the real number Y (!). Hence, from convergence theory for sequences of real numbers, Xn (!) + Yn(!) ! X(!) + Y (!):
(11.4)
At this point, we have shown that GX \ GY G;
(11.5)
where G denotes the set of all ! for which (11.4) holds. To prove that Xn +Yn ! X+Y a.s., we must show that }(Gc) = 0. On account of (11.5), Gc GcX [ GcY . Hence, }(Gc ) }(GcX ) + }(GcY ); and the two terms on the right are zero because Xn and Yn both converge almost surely. July 18, 2002
11.3 Almost Sure Convergence
331
In order to discuss almost sure convergence in more detail, it is helpful to characterize when (11.3) holds. Recall that a sequence of real numbers xn converges to the real number x if given any " > 0, there is a positive integer N such that for all n N, jxn , xj < ". Hence, 1 \ 1 \ [
fXn ! X g =
">0 N =1 n=N
Equivalently,
1 [ 1 [ \
fXn 6! X g =
">0 N =1 n=N
It is convenient to put
fjXn , X j < "g: fjXn , X j "g:
An (") := fjXn , X j "g; BN (") := and
A(") :=
Then
1 \ N =1
[
[
0 = }(fXn 6! X g) = }
n=N
An (");
BN ("):
}(fXn 6! X g) = }
If Xn ! X a.s., then
1 [
">0
">0
(11.6)
A(") :
, A(") } A("0 )
, for any choice of "0 > 0. Conversely, suppose } A(") = 0 for every positive ". We claim that Xn ! X a.s. To see this, observe that in the earlier characterization of the convergence of a sequence of real numbers, we could have restricted attention to values of " of the form " = 1=k for positive integers k. In other words, a sequence of real numbers xn converges to a real number x if and only if for every positive integer k, there is a positive integer N such that for all n N, jxn , xj < 1=k. Hence,
}(fXn 6! X g) = } ,
[ 1
k=1
A(1=k)
1 , X k=1
} A(1=k):
From this we see that if } A(") = 0 for all " > 0, then }(fXn 6! X g) = 0. To say more about almost sure convergence, we need to examine (11.6) more closely. Observe that BN (") =
1 [
n=N
An (")
1 [
n=N +1
July 18, 2002
An(") = BN +1 ("):
332
Chap. 11 Other Modes of Convergence
By limit property (1.5),
},A(") = }
\ 1
,
} BN (") : BN (") = Nlim !1 N =1 The next two examples use this equation to derive important results about almost sure convergence. Example 11.10. Show that if Xn ! X a.s., then Xn converges in probability to X. Solution., Recall that, by de nition, Xn converges in probability to X if and only if } AN (") ! 0 for every " > 0. If Xn ! X a.s., then for every " > 0, , },BN ("): 0 = } A(") = Nlim !1 Next, since 1 },BN (") = } [ An (") },AN ("); , it follows that } AN (") ! 0 too.
n=N
Example 11.11. Show that if
1 , X
n=1
} An (") < 1;
holds for all " > 0, then Xn ! X a.s. Solution. For any " > 0, },A(") = lim },BN (") N !1
(11.7)
[ 1
n=N
Nlim !1
},An(")
} = Nlim !1
1 X
n=N
An (")
= 0; on account of (11.7): What we have done here is derive a particular instance of the Borel{Cantelli Lemma (cf. Problem 15 in Chapter 1).
Example 11.12. Let X1; X2; : : : be i.i.d. zero-mean random variables with nite fourth moment. Show that n X Mn := n1 Xi ! 0 a.s. i=1 July 18, 2002
11.3 Almost Sure Convergence
333
Solution. We already know from Example 10.3 that Mn converges in mean square to zero, and hence, Mn converges to zero in probability and in distribution as well. Unfortunately, as shown in the next example, convergence in mean does not imply almost sure convergence. However, by the previous example, the almost sure convergence of Mn to zero will be established if we can show that for every " > 0, 1 X n=1
}(jMnj ") < 1:
(11.8)
By Markov's inequality, 4 }(jMn j ") = }(jMnj4 "4 ) E[jM4nj ] : " By Problem 34, there are nite, nonnegative constants (depending on E[Xi4]) and (depending on E[Xi2]) such that E[jMnj4] 3 + 2 : n n Hence, (11.8) holds by Problem 33.
The preceding example is an instance of the strong law of large numbers. The derivation in the example is quite simple because of the assumption of nite fourth moments (which implies niteness of the third, second, and rst moments by Lyapunov's inequality). A derivation assuming only nite second moments can be found in [17, pp. 326{327], and assuming only nite rst moments in [17, pp. 329{331]. Strong Law of Large Numbers (SLLN). Let X1 ; X2 ; : : : be independent, identically distributed random variables with nite mean m. Then n X Mn := n1 Xi ! m a.s. i=1 Since almost sure convergence implies convergence in probability, the following form of the weak law of large numbers holds. Weak Law of Large Numbers (WLLN). Let X1 ; X2; : : : be independent, identically distributed random variables with nite mean m. Then n X Mn := n1 Xi i=1
converges in probability to m. July 18, 2002
334
Chap. 11 Other Modes of Convergence
The weak law of large numbers in Example 11.1, which relied on the meansquare version in Example 10.3, required nite second moments and uncorrelated random variables. The above form does not require nite second moments, but does require independent random variables. Example 11.13. Let Wt be a Wiener process with E[Wt2] = 2t. Use the strong law to show that Wn=n converges almost surely to zero. Solution. Since W0 0, we can write n Wn = 1 X n n k=1(Wk , Wk,1): By the independent increments property of the Wiener process, the terms of the sum are i.i.d. N(0; 2) random variables. By the strong law, this sum converges almost surely to zero.
Example 11.14. We construct a sequence of random variables that converges in mean to zero, but does not converge almost surely to zero. Fix any positive integer n. Then n can be uniquely represented as n = 2m + k, where m and k are integers satisfying m 0 and 0 k 2m , 1. De ne (x): gn(x) = g2m +k (x) = I km ; k+1 m 2
2
For example, taking m = 2 and k = 0; 1; 2; 3; which corresponds to n = 4; 5; 6; 7, we nd g4(x) = I0; 1 (x); g5(x) = I 1 ; 1 (x); 4 4 2 and g6(x) = I 1 ; 3 (x); g7(x) = I 3 ; 1(x): 2 4 4 For xed m, as k goes from 0 to 2m , 1, g2m +k is a sequence of pulses moving from left to right. This is repeated for m + 1 with twice as many pulses that are half as wide. The two key ideas are that the pulses get narrower and that for any xed x 2 [0; 1), gn(x) = 1 for in nitely many n. Now let U uniform[0; 1). Then E[gn(U)] = } 2km X < k2+m 1 = 21m : Since m ! 1 as n ! 1, we see that gn(U) converges in mean to zero. It then follows that gn(U) converges in probability to zero. Since almost sure convergence also implies convergence in probability, the only possible almost sure limit is zero. However, we now show that gn(U) does not converge almost surely to zero. Fix any x 2 [0; 1). Then for each m = 0; 1; 2; : : :; k x < k+1 2m 2m Limits in probability are unique by Problem 4.
July 18, 2002
11.3 Almost Sure Convergence
335
for some k satisfying 0 k 2m , 1. For these values of m and k, g2m +k (x) = 1. In other words, there are in nitely many values of n = 2m + k for which gn(x) = 1. Hence, for 0 x < 1, gn (x) does not converge to zero. Therefore,
fU 2 [0; 1)g fgn (U) 6! 0g; and it follows that }(fgn (U) 6! 0g) }(fU 2 [0; 1)g) = 1:
The Skorohod Representation Theorem The Skorohod Representation Theorem says that if Xn converges in distribution to X, then we can construct random variables Yn and Y with FYn = FXn , FY = FX , and such that Yn converges almost surely to Y . This can often simplify proofs concerning convergence in distribution.
Example 11.15. Let Xn converge in distribution to X, and let c(x) be a continuous function. Show that c(Xn ) converges in distribution to c(X). Solution. Let Yn and Y be as given by the Skorohod representation theorem. Since Yn converges almost surely to Y , the set G := f! 2 : Yn(!) ! Y (!)g has the property that }(Gc ) = 0. Fix any ! 2 G. Then Yn(!) ! Y (!). Since c is continuous, , , c Yn (!) ! c Y (!) : Thus, c(Yn) converges almost surely to c(Y ). Now recall that almost-sure convergence implies convergence in probability, which implies convergence in distribution. Hence, c(Yn ) converges in distribution to c(Y ). To conclude, observe that since Yn and Xn have the same cumulative distribution function, so do c(Yn ) and c(Xn ). Similarly, c(Y ) and c(X) have the same cumulative distribution function. Thus, c(Xn ) converges in distribution to c(X). We now derive the Skorohod representation theorem. For 0 < u < 1, let Gn(u) := inf fx 2 IR : FXn (x) ug; and
G(u) := inf fx 2 IR : FX (x) ug: By Problem 27(a) in Chapter 8, G(u) x , u FX (x); July 18, 2002
336
Chap. 11 Other Modes of Convergence
or, equivalently,
G(u) > x , u > FX (x); and similarly for Gn and FXn . Now let U uniform(0; 1). By Problem 27(b) in Chapter 8, Yn := Gn(U) and Y := G(U) satisfy FYn = FXn and FY = FX , respectively. From the de nition of G, it is easy to see that G is nondecreasing. Hence, its set of discontinuities, call it D, is at most countable (Problem 39), and so }(U 2 D) = 0 by Problem 40. We show below that for u 2= D, Gn(u) ! G(u). It then follows that ,
,
Yn(!) := Gn U(!) ! G U(!) =: Y (!); except for ! 2 f! : U(!) 2 Dg, which has probability zero. Fix any u 2= D, and let " > 0 be given. Then between G(u) , " and G(u) we can select a point x, G(u) , " < x < G(u); that is a continuity point of FX . Since x < G(u), FX (x) < u: Since x is a continuity point of FX , FXn (x) must be close to FX (x) for large n. Thus, for large n, FXn (x) < u. But this implies Gn(u) > x. Thus, G(u) , " < x < Gn (u); and it follows that
G(u) lim Gn(u): n!1
To obtain the reverse inequality involving the lim, x any u0 with u < u0 < 1. Fix any " > 0, and select another continuity point of FX , again called x, such that G(u0 ) < x < G(u0) + ": Then G(u0 ) x, and so u0 FX (x). But then u < FX (x). Since FXn (x) ! FX (x), for large n, u < FXn (x), which implies Gn(u) x. It then follows that lim G (u) n!1 n
G(u0):
Since u is a continuity point of G, we can let u0 ! u to get lim G (u) n!1 n
G(u):
It now follows that Gn(u) ! G(u) as claimed. July 18, 2002
11.4 Notes
337
11.4. Notes
Notes x11.3: Almost Sure Convergence Note 1. In order that (11.3) be well de ned, it is necessary that the set f! 2 : nlim 6 X(!)g !1 Xn (!) =
be an event in the technical sense of Note 1 in Chapter 1. This is assured by the assumption that each Xn is a random variable (the term \random variable" is used in the technical sense of Note 1 in Chapter 2). The fact that this assumption is sucient is demonstrated in more advanced texts, e.g., [4, pp. 183{184].
11.5. Problems
Problems x11.1: Convergence in Probability 1. Let cn be a converging sequence of real numbers with limit c. De ne the
constant random variables Yn cn and Y c. Show that Yn converges in probability to Y . 2. Let U uniform[0; 1], and put
p
Xn := nI[0;1=n](U); n = 1; 2; : : :: Does Xn converge in probability to zero? 3. Let V be any continuous random variable with an even density, and let cn be any positive sequence with cn ! 1. Show that Xn := V=cn converges in probability to zero. 4. Show that limits in probability are unique; i.e., show that if Xn converges in probability to X, and Xn converges in probability to Y , then }(X 6= Y ) = 0. Hint: Write
fX 6= Y g =
1 [
fjX , Y j 1=kg;
k=1
and use limit property (1.4). 5. Suppose you have shown that given any " > 0, for suciently large n, }(jXn , X j ") < ": Show that lim }(jXn , X j ") = 0 for every " > 0:
n!1
July 18, 2002
338
Chap. 11 Other Modes of Convergence
6. Let g(x; y) be continuous, and suppose that Xn converges in probability
to X, and that Yn converges in probability to Y . In this problem you will show that g(Xn ; Yn) converges in probability to g(X; Y ). (a) Fix any " > 0. Show that for suciently large and , }(jX j > ) < "=4 and }(jY j > ) < "=4:
(b) Once and have been xed, we can use the fact that g(x; y) is uniformly continuous on the rectangle jxj 2 and jyj 2 . In other words, there is a > 0 such that for all (x0; y0 ) and (x; y) in the rectangle and satisfying jx0 , xj and jy0 , yj ; we have jg(x0; y0 ) , g(x; y)j < ": There is no loss of generality if we assume that and . Show that if the four conditions jXn , X j < ; jYn , Y j < ; jX j ; and jY j hold, then jg(Xn ; Yn) , g(X; Y )j < ": (c) Show that if n is large enough that }(jXn , X j ) < "=4 and }(jYn , Y j ) < "=4; then }(jg(Xn ; Yn) , g(X; Y )j ") < ": 7. Let X1 ; X2 ; : : : be i.i.d. with common nite mean m and common nite variance 2 . Also assume that E[Xi4 ] < 1. Put n n X X Mn := n1 Xi and Vn := n1 Xi2 : i=1 i=1 (a) Explain (brie y) why Vn converges in probability to 2 + m2 . (b) Explain (brie y) why
S2
n
n
:= n , 1
n 1X 2 2 n i=1 Xi , Mn
converges in probability to 2. 8. Let Xn converge in probability to X. Assume that there is a nonnegative random variable Y with E[Y ] < 1 and such that jXnj Y for all n. July 18, 2002
11.5 Problems
339
(a) Show that }(jX j Y + 1) = 0. (In other words, jX j < Y + 1 with probability one, from which it follows that E[jX j] E[Y ] + 1 < 1.) (b) Show that Xn converges in mean to X. Hints: Write E[jXn , X j] = E[jXn , X jIAn ] + E[jXn , X jIAcn ];
where An := fjXn , X j "g. Then use Problem 5 in Chapter 10 with Z = Y + jX j. 9. (a) Let g be a bounded, nonnegative function satisfying limx!0 g(x) = 0. Show that limn!1 E[g(Xn )] = 0 if Xn converges in probability to zero. (b) Show that j X j n = 0 lim E n!1 1 + jX j n
if and only if Xn converges in probability to zero.
Problems x11.2: Convergence in Distribution 10. Let cn be a converging sequence of real numbers with limit c. De ne the
constant random variables Yn cn and Y c. Show that Yn converges in distribution to Y . 11. Let X be a random variable, and let cn be a positive sequence converging to limit c. Show that cn X converges in distribution to cX. Consider separately the cases c = 0 and 0 < c < 1. 12. For t 0, let Xt Yt Zt be three continuous-time random processes such that lim F (y) = tlim t!1 Xt !1 FZt (y) = F(y)
for some continuous cdf F. Show that FYt (y) ! F(y) for all y. R1 13. Show that the Wiener integral Y := 0 g() dW is Gaussian with zero R1 2 mean and variance 0 g() d. Hints: The desired integral Y is the mean-square limit of the sequence Yn de ned in Problem 18 in Chapter 10. Use Example 11.5. R 14. Let g(t; ) be such that for each t, 01 g(t; )2 d < 1. De ne the process Xt =
Z
0
1
g(t; ) dW :
Use the result of the preceding problem to show that for any 0 t1 < < tn < 1, the random vector of samples X := [Xt1 ; : : :; Xtn ]0 is Gaussian. Hint: Read the rst paragraph of Section 7.2. July 18, 2002
340
Chap. 11 Other Modes of Convergence
15. If the moment generating functions MXn (s) converge to the moment gen-
erating function MX (s), show that Xn converges in distribution to X. Also show that for nonnegative, integer-valued random variables, if the probability generating functions GXn (z) converge to the probability generating function GX (z), then Xn converges in distribution to X. Hint: The Remark following Example 11.5 may be useful. 16. Let Xn and X be integer-valued random variables with probability mass functions pn(k) := }(Xn = k) and p(k) := }(X = k), respectively. (a) If Xn converges in distribution to X, show that for each k, pn (k) ! p(k). (b) If Xn and X are nonnegative, and if for each k 0, pn(k) ! p(k), show that Xn converges in distribution to X. 17. Let pn be a sequence of numbers lying between 0 and 1 and such that npn ! > 0 as n ! 1. Let Xn binomial(n; pn), and let X Poisson(). Show that Xn converges in distribution to X. Hints: By the previous problem, it suces to prove that the probability mass functions converge. Stirling's formula, p n! 2 nn+1=2e,n ; by which we mean = 1; lim p n! n!1 2 nn+1=2e,n and the formula qn n = e,q ; if q ! q; lim 1 , n n!1 n
may be helpful. 18. Let Xn binomial(n; pn) and X Poisson(), where pn and are as in the previous problem. Show that the probability generating function GXn (z) converges to GX (z). 19. Suppose that Xn and X are such that for every bounded continuous function g(x), lim E[g(Xn )] = E[g(X)]: n!1 Show that Xn converges in distribution to X as follows: (a) For a < b, sketch the three functions I(,1;a] (t), I(,1;b] (t), and 8 1; t < a; > > < b , t ; a t b; ga;b(t) := > > : b,a 0; t > b: July 18, 2002
11.5 Problems
341
(b) Your sketch in part (a) shows that I(,1;a] (t) ga;b (t) I(,1;b] (t): Use these inequalities to show that for any random variable Y , FY (a) E[ga;b(Y )] FY (b): (c) For x > 0, use part (b) with a = x and b = x + x to show that lim F (x) FX (x + x): n!1 Xn
(d) For x > 0, use part (b) with a = x , x and b = x to show that FX (x , x) lim FXn (x): n!1
(e) If x is a continuity point of FX , show that lim F (x) = FX (x): n!1 Xn 20. Show that Xn converges in distribution to zero if and only if
j X j n lim E = 0: n!1 1 + jXnj
21. For t 0, let Zt be a continuous-time random process. Suppose that as
t ! 1, FZt (z) converges to a continuous cdf F(z). Let u(t) be a positive function of t such that u(t) ! 1 as t ! 1. Show that , lim } Zt z u(t) = F (z): t!1
Hint: Modify the derivation in Example 11.8.
22. Let Zt be as in the preceding problem. Show that if c(t) ! c > 0, then ,
lim } c(t) Zt z = F (c=z): t!1
23. Let Zt be as in Problem 21. Let s(t) ! 0 as t ! 1. Show that if
Xt = Zt + s(t), then FXt (x) ! F (x). 24. Let Nt be a Poisson process of rate . Show that Nt , t Yt := p =t converges in distribution to an N(0; 1) random variable. Hint: By Example 11.7, Yn converges in distribution to an N(0; 1) random variable. Next, since Nt is a nondecreasing function of t, observe that Nbtc Nt Ndte ; July 18, 2002
342
Chap. 11 Other Modes of Convergence
where btc denotes the greatest integer less than or equal to t, and dte denotes the smallest integer greater than or equal to t. Hint: The preceding two problems and Problem 12 may be useful.
Problems x11.3: Almost Sure Convergence 25. Let Xn ! X a.s. and let Yn ! Y a.s. If g(x; y) is a continuous function, show that g(Xn ; Yn) ! g(X; Y ) a.s. 26. Let Xn ! X a.s., and suppose that X = Y a.s. Show that Xn ! Y a.s. (The statement X = Y a.s. means }(X = 6 Y ) = 0.) 27. Show that almost sure limits are unique; i.e., if Xn ! X a.s. and Xn ! Y a.s., then X = Y a.s. (The statement X = Y a.s. means }(X = 6 Y)= 28. 29. 30. 31.
0.) Suppose Xn ! X a.s. and Yn ! Y a.s. Show that if Xn Yn a.s. for all n, then X Y a.s. (The statement Xn Yn a.s. means }(Xn > Yn ) = 0.) If Xn converges almost surely and in mean, show that the two limits are equal almost surely. Hint: Problem 4 may be helpful. In Problem 11, suppose that the limit is c = 1. Specify the limit random variable cX(!) as a function of !. Note that cX(!) is a discrete random variable. What is its probability mass function? Let S be a nonnegative random variable with E[S] < 1. Show that S < 1 a.s. Hints: It is enough to show that }(S = 1) = 0. Observe that 1 \ fS = 1g = fS > ng: n=1
Now appeal to the limit property (1.5) and use Markov's inequality. 32. Under the assumptions of Problem 17 in Chapter 10, show that 1 X
k=1
hk Xk
is well de ned as an almost-sure limit. Hints: It is enough to prove that S :=
1 X
k=1
jhk Xk j < 1 a.s.
Hence, the result of the preceding problem can be applied if it can be shown that E[S] < 1. To this end, put Sn :=
n X
k=1
jhk j jXk j:
July 18, 2002
11.5 Problems
343
By Problem 17 in Chapter 10, Sn converges in mean to S 2 L1 . Use the nonnegativity of Sn and S along with Problem 16 in Chapter 10 to show that n X E[S] = nlim !1 jhkjE[jXk j] < 1: 33. For p > 1, show that
P1
n=1
k=1 1=np < 1.
34. Let X1 ; X2; : : : be i.i.d. with := E[Xi4], 2 := E[Xi2], and E[Xi ] = 0.
Show that
n X Mn := n1 Xi i=1 4 4 satis es E[Mn] = [n + 3n(n , 1) ]=n4. 35. In Problem 7, explain why the assumption E[Xi4 ] < 1 can be omitted. 36. Let Nt be a Poisson process of rate . Show that Nt =t converges almost surely to . Hint: First show that Nn =n converges almost surely to . Second, since Nt is a nondecreasing function of t, observe that
Nbtc Nt Ndte ; where btc denotes the greatest integer less than or equal to t, and dte denotes the smallest integer greater than or equal to t. 37. Let Nt be a renewal process as de ned in Section 8.2. Let X1 ; X2; : : : denote the i.i.d. interarrival times. Assume that the interarrival times have nite, positive mean . (a) Show that for any " > 0, for all suciently large n, n X k=1
Xk < n( + ") a.s.
(b) Show that Nt ! 1 as t ! 1 a.s.; i.e., show that for any M, Nt M for all suciently large t. (c) Show that n=Tn ! 1= a.s. Here Tn := X1 + + Xn is the nth occurrence time. (d) Show that Nt=t ! 1= a.s. Hints: On account of (c), if we put Yn := n=Tn, then YNt ! 1= a.s. since Nt ! 1. Also note that TNt t < TNt +1 : 38. Give an example of a sequence of random variables that converges almost
surely to zero but not in mean.
July 18, 2002
344
Chap. 11 Other Modes of Convergence
39. Let G be a nondecreasing function de ned on the closed interval [a; b].
Let D" denote the set of discontinuities of size greater than " on [a; b], D" := fu 2 [a; b] : G(u+) , G(u,) > "g;
with the understanding that G(b+) means G(b) and G(a,) means G(a). Show that if there are n points in D" , then n < [G(b) , G(a)]=" + 2:
Remark.SThe set of all discontinuities of G on [a; b], denoted by D[a; b], 1 is simply k=1 D1=k . Since this is a countable union of nite sets, D[a; b] is at most countably in nite. If G is de ned on the open interval (0; 1), we can write 1 [ D(0; 1) = D[1=n; 1 , 1=n]: n=3
Since this is a countable union of countably in nite sets, D(0; 1) is also countably in nite [37, p. 21, Proposition 7]. 40. Let D be a countably in nite subset of (0; 1). Let U uniform(0; 1). Show that }(U 2 D) = 0. Hint: Since D is countably in nite, we can enumerate its elements as a sequence un. Fix any " > 0 and put Kn := (un , "=2n; un + "=2n). Observe that D Now bound
} U2
1 [ n=1
1 [ n=1
July 18, 2002
Kn :
Kn :
CHAPTER 12
Parameter Estimation and Con dence Intervals Consider observed measurements X1 ; X2 ; : : :; each having unknown mean m. Then natural thing to do is to use the sample mean n X 1 Mn := n Xi i=1
as an estimate of m. However, since Mn is a random variable and m is a constant, we do not expect to have Mn always equal to m (or even ever equal if the Xi are continuous random variables). In this chapter we investigate how close the sample mean Mn is to the true mean m (also known as the ensemble mean or the population mean). Section 12.1 introduces the notions of sample mean, unbiased estimator, and consistent estimator. It also recalls the central limit theorem approximation from Section 4.7. Section 12.2 introduces the notion of a con dence interval for the mean when the variance is known. In Section 12.3, the sample variance is introduced, and some of its properties presented. In Section 12.4, the sample mean and sample variance are combined to obtain con dence intervals for the mean when both the mean and the variance are unknown. Section 12.5 considers the special case of normal data. While the preceding sections assume that the number of samples is large, in the normal case, this assumption is not necessary. To obtain con dence intervals for the mean when the variance is unknown, we introduce the student's t density. By introducing the chi-squared density, we obtain con dence intervals for the variance when the mean is known and when the mean is unknown. Section 12.5 concludes with a subsection containing derivations of the more complicated results concerning the density of the sample variance and of the T random variable. These derivations can be skipped by the beginning student.
12.1. The Sample Mean
Although we do not expect to have Mn = m, we can say that
n n n X X 1 1X E[Mn] = E X = E[Xi] = 1 m = m: i n i=1 n i=1 n i=1 In other words, the expected value of the estimator is equal to the quantity we are trying to estimate. An estimator with this property is said to be unbiased. Next, if the Xi have common nite variance 2 , as we assume from now on, and if they are uncorrelated, then for any error bound > 0, Chebyshev's
345
346
Chap. 12 Parameter Estimation and Con dence Intervals
inequality tells us that mj2 ] = var(Mn ) = 2 : }(jMn , mj > ) E[jMn , (12.1) 2 2 n 2 This says that the probability that the estimate Mn diers from the true mean m by more than is upper bounded by 2 =(n 2). For large enough n, this bound is very small, and so the probability of being o by more than is negligible. In particular, (12.1) implies that nlim !1
}(jMn , mj > ) = 0; for every > 0:
The terminology for describing the above expression is to say that Mn converges in probability to m. An estimator that converges in probability to the parameter to be estimated is said to be consistent. Thus, the sample mean is a consistent estimator of the population mean. While it is reassuring to know that the sample mean is a consistent estimator of the ensemble mean, this is an asymptotic result. In practice, we work with a nite value of n, and so it is unfortunate that, as the next example shows, the bound in (12.1) is not very good.
Example 12.1. Let X1; X2; : : : be i.i.d. N(m; 2). Then Mn N(m; 2=n)
and so
,m Yn := M=n p n
is N(0; 1). Thus,
}(jMn , mj > ) = }(jYnj > pn) = 2[1 , (pn )];
where we have used the fact that Yn has an even density. For n = 3, we nd }(jM3 , mj > ) = 0:0833. For = in (12.1), we have the bound }(jMn , mj > ) 1 : n To make this bound less than or equal to 0:0833 would require n 12. Thus, (12.1) would lead us to use a lot more data than is really necessary. Since the bound in (12.1) is not very good, we would like to know the distribution of Yn itself in the general case, not just the Gaussian case as in the example. Fortunately, the central limit theorem (Section 4.7) tells us that if the Xi are i.i.d. with common mean m and common variance 2 , then for large n, FYn can be approximated by the normal cdf . July 18, 2002
12.2 Con dence Intervals When the Variance Is Known
347
12.2. Con dence Intervals When the Variance Is Known as
The notion of a con dence interval is obtained by rearranging } jMn , mj py n
} , py Mn , m py ; n n or } py m , Mn , py ; n n and then } Mn + py m Mn , py : n n The above quantity is the probability that the random interval h i Mn , pyn ; Mn + pyn contains the true mean m. This random interval is called a con dence interval. If y is chosen so that i h } m 2 Mn , py ; Mn + py = 1 , (12.2) n n for some 0 < < 1, then the con dence interval is said to be a 100(1 , )% con dence interval. The shorthand for the above expression is m = Mn pyn with 100(1 , )% probability: In practice, we are given a con dence level 1 , , and we would like to choose y so that (12.2) holds. Now, (12.2) is equivalent to 1 , = } jMn , mj pyn , mj y = } jM=n p n } = (jYnj y) = FYn (y) , FYn (,y) (y) , (,y); where the last step follows by the central limit theorem (Section 4.7) if the Xi are i.i.d. with common mean m and common variance 2. Since the N(0; 1) density is even, the approximation becomes 2(y) , 1 1 , :
July 18, 2002
348
Chap. 12 Parameter Estimation and Con dence Intervals
1, 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
y=2 1.645 1.695 1.751 1.812 1.881 1.960 2.054 2.170 2.326 2.576
Table 12.1. Con dence levels 1 , and corresponding y=2 such that 2(y=2 ) , 1 = 1 , .
Ignoring the approximation, we solve (y) = 1 , =2 for y. We denote the solution of this equation by y=2 . It can be found from tables, e.g., Table 12.1, or numerically by nding the unique root of the equation (y) + =2 , 1 = 0, or in Matlab by y=2 = norminv(1 , alpha=2). The width of a con dence interval is y 2 p=n 2 :
p
Notice that as n increases, the width gets smaller as 1= n. If we increase the requested con dence level 1 , , we see from Table 12.1 that y=2 gets larger. Hence, by using a higher con dence level, the con dence interval gets larger. Example 12.2. Let X1; X2; : : : be i.i.d. random variables with variance 2 = 2. If M100 = 7:129, nd the 93% and 97% con dence intervals for the population mean. Solution. In Table 12.1 we scan the 1 , column p until we nd 0:93. The corresponding value of y=2 is 1:812. Since = 2, we obtain the 93% con dence interval p i p h 2 1:812 1:812 ; 7:129 + p 2 = [6:873; 7:385]: 7:129 , p 100 100 p p Since 1:812 2= 100 = 0:256, we can also express this con dence interval as m = 7:129 0:256 with 93% probability:
Since can be related to the error function erf (see Section 4.1), y=2 can also be found p using the inverse of the error function. Hence, y=2 = 2 erf ,1 (1 , ). The Matlab , 1 command for erf (z) is erfinv(z).
July 18, 2002
12.3 The Sample Variance
349
For the 97% con dence interval, we use y=2 = 2:170 and obtain or
p
p
i p 2 ; 7:129 + 2:170 p 2 = [6:822; 7:436]; 7:129 , 2:170 100 100
h
m = 7:129 0:307 with 97% probability:
12.3. The Sample Variance
If the population variance 2 is unknown, then the con dence interval i h Mn , pyn ; Mn + pyn cannot be used. Instead, we replace by the sample standard deviation Sn , where n X Sn2 := n ,1 1 (Xi , Mn )2 i=1
is the sample variance. This leads to the new con dence interval h i Mn , Spnny ; Mn + Spnny : Before we can do any analysis of this, we need some properties of Sn . To begin, we need the formula, Sn2 = n ,1 1
n X i=1
Xi2 , nMn2 :
It is derived as follows. Write n X Sn2 := n ,1 1 (Xi , Mn )2 i=1 n X 1 = n , 1 (Xi2 , 2Xi Mn + Mn2) i=1 X X n n = n ,1 1 Xi2 , 2 Xi Mn + nMn2 i=1 i=1 n X 1 2 2 = n,1 Xi , 2(n Mn )Mn + nMn i=1 X n 1 2 2 Xi , nMn : = n,1 i=1 July 18, 2002
(12.3)
350
Chap. 12 Parameter Estimation and Con dence Intervals
Using (12.3), it is shown in the problems that Sn2 is an unbiased estimator of Furthermore, if the Xi are i.i.d. with nite fourth moment, n then it is shown in the problems that 2 ; i.e., E[S 2 ] = 2 .
n 1X 2 n i=1 Xi
is a consistent estimator of the second moment E[Xi2 ] = 2 +m2 ; i.e., the above formula converges in probability to 2 +m2 . Since Mn converges in probability to m, it follows that S2
n
n
= n,1
n 1X 2 2 n i=1 Xi , Mn
converges in probability1 to (2 + m2 ) , m2 = 2 . In other words, Sn2 is a consistent estimator of 2. Furthermore, it now follows that Sn is a consistent estimator of ; i.e., Sn converges in probability to .
12.4. Con dence Intervals When the Variance Is Unknown
As noted in the previous section, if the population variance 2 is unknown, then the con dence interval i h Mn , pyn ; Mn + pyn cannot be used. Instead, we replace by the sample standard deviation Sn . This leads to the new con dence interval h i Mn , Spnny ; Mn + Spnny : Given a con dence level 1 , , we would like to choose y so that h i } m 2 Mn , Spn y ; Mn + Spn y = 1 , : (12.4) n n To this end, rewrite the left-hand side as } jMn , mj Spn y ; n or } Mn p, m Sn y : = n
p Again using Yn := (Mn , m)=(= n ), we have } jYnj Sn y : July 18, 2002
12.4 Con dence Intervals When the Variance Is Unknown
As n ! 1, Sn is converging in probability to . So,2 for large n, } jYnj Sn y }(jYnj y) = FYn (y) , FYn (,y): 2(y) , 1;
351
(12.5)
where the last step follows by the central limit theorem approximation. Thus, if we want to solve (12.4), we can solve the approximate equation 2(y) , 1 = 1 , instead, exactly as in Section 12.2. In other words, to nd con dence intervals, we still get y=2 from Table 12.1, but we use Sn in place of . Remark. In this section, we are using not only the central limit theorem approximation, for which it is suggested n should be at least 30, but we are also using the approximation (12.5). For the methods of this section to apply, n should be at least 100. This choice is motivated by considerations in the next section.
Example 12.3. Let X1; X2; : : : be i.i.d. Bernoulli(p) random variables. Find the 95% con dence interval for p if M100 = 0:28 and S100 = 0:451. Solution. Observe that since m := E[Xip] = p, we can use Mn to estimate p. From Table 12.1, y=2 = 1:960, S100 y=2 = 100 = 0:088, and p = 0:28 0:088 with 95% probability: The actual con dence interval is [0:192; 0:368].
Applications
Estimating the number of defective products in a lot. Consider a production run of N cellular phones, of which, say d are defective. The only way to determine d exactly and for certain is to test every phone. This is not practical if N is very large. So we consider the following procedure to estimate the fraction of defectives, p := d=N, based on testing only n phones, where n is large, but smaller than N. for i = 1 to n Select a phone at random from the lot of N phones; if the ith phone selected is defective let Xi = 1;
else let Xi = 0; end if Return the phone to the lot; end for
July 18, 2002
352
Chap. 12 Parameter Estimation and Con dence Intervals
Because phones are returned to the lot (sampling with replacement), it is possible to test the same phone more than once. However, because the phones are always chosen from the same set of N phones, the Xi are i.i.d. with }(Xi = 1) = d=N = p. Hence, the central limit theorem applies, and we can use the method of Example 12.3 to estimate p and d = Np. For example, if N = 1; 000 and we use the numbers from Example 12.3, we would estimate that d = 280 88 with 95% probability: In other words, we are 95% sure that the number of defectives is between 192 and 368 for this particular lot of 1; 000 phones. If the phones were not returned to the lot after testing (sampling without replacement), the Xi would not be i.i.d. as required by the central limit theorem. However, in sampling with replacement when n is much smaller than N, the chances of testing the same phone twice are negligible. Hence, we can actually sample without replacement and proceed as above. Predicting the Outcome of an Election. In order to predict the outcome of a presidential election, 4; 000 registered voters are surveyed at random. 2; 104 (more than half) say they will vote for candidate A, and the rest say they will vote for candidate B. To predict the outcome of the election, let p be the fraction of votes actually received by candidate A out of the total number of voters N (millions). Our poll samples n = 4; 000, and M4000 = 2; 104=4; 000 = 0:526. Suppose that p S4000 = 0:499. For a 95% con dence interval for p, y=2 = 1:960, S4000 y=2= 4000 = 0:015, and p = 0:526 0:015 with 95% probability: Rounding o, we would predict that candidate A will receive 53% of the vote, with a margin of error of 2%. Thus, we are 95% sure that candidate A will win the election.
Sampling with and without Replacement
Consider sampling n items from a batch of N items, d of which are defective. If we sample with replacement, then the theory above worked out rather simply. We also argued brie y that if n is much smaller than N, then sampling without replacement would give essentially the same results. We now make this statement more precise. To begin, recall that the central limit theorem says that Pfor large n, FYn (y) (y), where Yn = (Mn , m)=(=pn), and Mn = (1=n) ni=1 Xi . If we sample with replacement and set Xi = 1 if the ith item is defective, then the Xi are i.i.d. Bernoulli(p) with p = d=N. When X1 ; X2; : : : are i.i.d. Bernoulli(p), we P know that ni=1 Xi is binomial(n;p) (e.g., by Example 2.20). Putting this all together, we obtain the DeMoivre{Laplace Theorem, which p says that if V binomial(n; p) and n is large, then the cdf of (V=n , p)= p(1 , p)=n is approximately standard normal. July 18, 2002
12.5 Con dence Intervals for Normal Data
353
Now suppose we sample n items without replacement. Let U denote the number of defectives out of the n samples. It is shown in the Notes3 that U has the hypergeometric(N; d;n) pmf, d N ,d }(U = k) = k n, k ; k = 0; : : :; n: N n } As we show below, if n is much smaller than d, N , d, and p N, then (U = k) }(V = k). It then p follows that the cdf of (U=n , p)= p(1 , p)=n is close to the cdf of (V=n , p)= p(1 , p)=n, which is close to the standard normal cdf if n is large. (Thus, to make it all work we need n large, but still much smaller than d, N , d, and N.) To show that }(U = k) }(V = k), write out }(U = k) as d! (N , d)! n!(N , n)! : k!(d , k)! (n , k)![(N , d) , (n , k)]! N! ,n We can easily identify the factor k . Next, since 0 k n d, d! k (d , k)! = d(d , 1) (d , k + 1) d : Similarly, since 0 k n (N , d), (N , d)! n,k [(N , d) , (n , k)]! = (N , d) [(N , d) , (n , k) + 1] (N , d) : Finally, since n N, 1 1: (N , n)! = N! N(N , 1) (N , n + 1) Nn Writing p = d=N, we have }(U = k) n pk (1 , p)n,k : k
12.5. Con dence Intervals for Normal Data
In this section we assume that X1 ; X2 ; : : : are i.i.d. N(m; 2 ).
Estimating the Mean
Recall from Example 12.1 that Yn N(0; 1). Hence, the analysis in Sections 12.1 and 12.2 shows that h i } m 2 Mn , py ; Mn + py = }(jYn j y) n n = 2(y) , 1: July 18, 2002
354
Chap. 12 Parameter Estimation and Con dence Intervals
The point is that for normal data there is no central limit theorem approximation. Hence, we can determine con dence intervals as in Section 12.2 even if n < 30. Example 12.4. Let X1; X2; : : : be i.i.d. N(m; 2). If M10 = 5:287, nd the 90% con dence interval for m. Solution. From Table 12.1 for 1 , = 0:90, y=2 is 1:645. Since = p2, we obtain the 90% con dence interval p i p h p 2 ; 5:287 + 1:645 p 2 = [4:551; 6:023]: 5:287 , 1:645 10 10 p p Since 1:645 2= 10 = 0:736, we can also express this con dence interval as m = 5:287 0:736 with 90% probability: Unfortunately, the results of Section 12.4 still involve approximation in (12.5) even if the Xi are normal. However, we now show how to compute the left-hand side of (12.4) exactly when the Xi are normal. The left-hand side of (12.4) is } jMn , mj Spn y = } Mn ,pm y : n Sn = n It is convenient to rewrite the above quotient as T := MS n=,pnm : n
It is shown later that if X1 ; X2; : : : are i.i.d. N(m; 2), then T has a student's t density with = n , 1 degrees of freedom (de ned in Problem 17 in Chapter 3). To compute 100(1 , )% con dence intervals, we must solve }(jT j y) = 1 , : Since the density fT is even, this is equivalent to 2FT (y) , 1 = 1 , ; or FT (y) = 1 , =2. This can be solved using tables, e.g., Table 12.2, or numerically by nding the unique root of the equation FT (y) + =2 , 1 = 0, or in Matlab by y=2 = tinv(1 , alpha=2; n , 1). Example 12.5. Let X1; X2; : : : be i.i.d. N(m; 2) random variables, and suppose M10 = 5:287. Further suppose that S10 = 1:564. Find the 90% con dence interval for m. Solution. In Tablep 12.2 with n = 10, we see that for 1 , = 0:90, y=2 is 1:833. Since S10 y=2 = 10 = 0:907, m = 5:287 0:907 with 90% probability: The corresponding con dence interval is [4:380; 6:194]. July 18, 2002
12.5 Con dence Intervals for Normal Data
355
1 , y=2(n = 100) 1 , y=2(n = 10) 0.90 1.833 0.90 1.660 0.91 1.899 0.91 1.712 0.92 1.973 0.92 1.769 0.93 2.055 0.93 1.832 0.94 2.150 0.94 1.903 0.95 2.262 0.95 1.984 0.96 2.398 0.96 2.081 0.97 2.574 0.97 2.202 0.98 2.821 0.98 2.365 0.99 3.250 0.99 2.626 Table 12.2. Con dence levels 1 , and corresponding y=2 such that }(jT j y=2 ) = 1 , .
The left-hand table is for n = 10 observations with T having n , 1 = 9 degrees of freedom, and the right-hand table is for n = 100 observations with T having n , 1 = 99 degrees of freedom.
Limiting t Distribution
If we compare the n = 100 table in Table 12.2 with Table 12.1, we see they are almost the same. This is a consequence of the fact that as n increases, the t cdf converges to the standard normal cdf. We can see this by writing }(T t) = } Mn ,pm t Sn = n S M , m n n } p = = n t = } Yn Sn t }(Yn t);
since2 Sn converges in probability to . Finally, since the Xi are independent and normal, FYn (t) = (t). We also recall from Problem 18 in Chapter 3 that the t density converges to the standard normal density.
Estimating the Variance | Known Mean
Suppose that X1 ; X2 ; : : : are i.i.d. N(m; 2) with m known but 2 unknown. We use n X Vn2 := n1 (Xi , m)2 i=1
as our estimator of the variance 2. It is shown in the problems that Vn2 is a consistent estimator of 2 . July 18, 2002
356
Chap. 12 Parameter Estimation and Con dence Intervals
For determining con dence intervals, it is easier to work with n X ,m 2 n V2 = X i : 2 n i=1 Since (Xi , m)= is N(0; 1), its square is chi-squared with one degree of freedom (Problem 41 in Chapter 3 or Problem 7 in Chapter 4). It then follows that n2 Vn2 is chi-squared with n degrees of freedom (see Problem 46(c) and its Remark in Chapter 3). Choose 0 < ` < u, and consider the equation } ` n2 Vn2 u = 1 , : We can rewrite this as 2 2 } nVn 2 nVn = 1 , : ` u This suggests the con dence interval
h nV 2
2i
n ; nVn
u ` : Then the probability that 2 lies in this interval is F (u) , F (`) = 1 , ; where F is the chi-squared cdf with n degrees of freedom. We usually choose ` and u to solve F (`) = =2 and F (u) = 1 , =2: These equations can be solved using tables, e.g., Table 12.3, or numerically by root nding, or in Matlab with the commands ` = chi2inv(alpha=2; n) and u = chi2inv(1 , alpha=2; n).
Example2 12.6. Let X1; X2; : : : be i.i.d. N(5; 2) random 2variables. Sup-
pose that V100 = 1:645. Find the 90% con dence interval for . Solution. From Table 12.3 we see that for 1 , = 0:90, ` = 77:929 and u = 124:342. The 90% con dence interval is h 100(1:645) 100(1:645) i 124:342 ; 77:929 = [1:323; 2:111]:
July 18, 2002
12.5 Con dence Intervals for Normal Data
1, 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
` 77.929 77.326 76.671 75.949 75.142 74.222 73.142 71.818 70.065 67.328
357
u 124.342 125.170 126.079 127.092 128.237 129.561 131.142 133.120 135.807 140.169
Table 12.3. Con dence levels 1 , and corresponding values of ` and u such that }(` n V 2 u) = 1 , and such that }( n V 2 `) = }( n V 2 u) = =2 for n = 100 2 n 2 n 2 n observations.
Estimating the Variance | Unknown Mean
Let X1 ; X2; : : : be i.i.d. N(m; 2 ), where both the mean and the variance are unknown, but we are interested only in estimating the variance. Since we do not know m, we cannot use the estimator Vn2 above. Instead we use Sn2 . However, for determining con dence intervals, it is easier to work with n,21 Sn2 . As argued below, n,21 Sn2 is a chi-squared random variable with n , 1 degrees of freedom. Choose 0 < ` < u, and consider the equation } ` n ,2 1 Sn2 u = 1 , : We can rewrite this as 2 2 } (n , 1)Sn 2 (n , 1)Sn = 1 , : ` u This suggests the con dence interval h (n , 1)S 2 (n , 1)S 2 i n n : u ; ` Then the probability that 2 lies in this interval is F(u) , F (`) = 1 , ; where now F is the chi-squared cdf with n , 1 degrees of freedom. We usually choose ` and u to solve F (`) = =2 and F (u) = 1 , =2: These equations can be solved using tables, e.g., Table 12.4, or numerically by root nding, or in Matlab with the commands ` = chi2inv(alpha=2; n , 1) and u = chi2inv(1 , alpha=2; n , 1). July 18, 2002
358
Chap. 12 Parameter Estimation and Con dence Intervals
1, 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
` 77.046 76.447 75.795 75.077 74.275 73.361 72.288 70.972 69.230 66.510
u 123.225 124.049 124.955 125.963 127.103 128.422 129.996 131.966 134.642 138.987
Table 12.4. Con dence levels 1 , and corresponding values of ` and u such that }(` n,1 S 2 u) = 1 , and such that }( n,1 S 2 `) = }( n,1 S 2 u) = =2 for n = 100 2 n 2 n 2 n observations (n , 1 = 99 degrees of freedom).
Example 12.7. Let X1; X2; : : : be i.i.d. N(m;22) random variables. If
2 = 1:608, nd S100
the 90% con dence interval for . Solution. From Table 12.4 we see that for 1 , = 0:90, ` = 77:046 and u = 123:225. The 90% con dence interval is h 99(1:608) 99(1:608) i 123:225 ; 77:046 = [1:292; 2:067]: ? Derivations
The remainder of this section is devoted to deriving the distributions of Sn2 and p (M n , m)=(= n) q T := MS n=,pm = n,21 S 2 =(n , 1) n n n under the assumption that the Xi are i.i.d. N(m; 2). The analysis is rather complicated, and may be omitted by the beginning student. p We begin with the numerator in T . Recall that (Mn , m)=(= n ) is simply Yn as de ned in Section 12.1. It was shown in Example 12.1 that Yn N(0; 1). For the denominator, we show that the density of n,21 Sn2 is chi-squared with n , 1 degrees of freedom. We begin by recalling the derivation of (12.3). If we replace the rst line of the derivation with n X Sn2 = n ,1 1 ([Xi , m] , [Mn , m])2 ; i=1 then we end up with X n 1 2 2 2 [Xi , m] , n[Mn , m] : Sn = n , 1 i=1 July 18, 2002
12.5 Con dence Intervals for Normal Data
359
Using the notation Zi := (Xi , m)=, we have n 2 n , 1 S2 = X 2 , n Mn , m ; Z 2 n i=1 i
or
n n , 1 S 2 + Mn p , m 2 = X Zi2 : 2 n = n i=1 As we argue below, P the two terms on the left are independent. It then follows that the density of ni=1 Zi2 is equal to the convolution of the densities of the other two terms. To nd the density of n,21 Sn2 , we use moment generating functions. Now, the second term on the left is the square of an N(0; 1) random variable, and by Problem 7 in Chapter 4 has a chi-squared density with one degree of freedom. Its moment generating function is 1=(1 , 2s)1=2 (see Problem 40(c) in Chapter 3). For the same reason, each Zi2 has aPchi-squared density with one degree of freedom. Since the Zi are independent, ni=1 Zi2 has a chi-squared density with n degrees of freedom, and its moment generating function is 1=(1 , 2s)n=2 (see Problem 46(c) in Chapter 3). It now follows that the moment generating function of n,21 Sn2 is the quotient
1=(1 , 2s)n=2 = 1 1=(1 , 2s)1=2 (1 , 2s)(n,1)=2 ; which is the moment generating function of a chi-squared density with n , 1 degrees of freedom. It remains to show that Sn2 and Mn are independent. Observe that Sn2 is a function of the vector W := [(X1 , Mn); : : :; (Xn , Mn )]0: In fact, Sn2 = W 0W=(n , 1). By Example 7.5, the vector W and the sample mean are independent. It then follows that any function of W and any function of Mn are independent. We can now nd the density of p n , m)=(= n) : q T = (M n,21 S 2 =(n , 1) n If the Xi are i.i.d. N(m; 2 ), then the numerator and the denominator are independent; the numerator is N(0; 1), and the denominator is chi-squared with n , 1 degrees of freedom divided by n , 1. By Problem 24 in Chapter 5, T has a student's t density with = n , 1 degrees of freedom.
July 18, 2002
360
Chap. 12 Parameter Estimation and Con dence Intervals
12.6. Notes
Notes x12.3: The Sample Variance Note 1. We are appealing to the fact that if g(u; v) is a continuous function
of two variables, and if Un converges in probability to u, and if Vn converges in probability to v, then g(Un ; Vn) converges in probability to g(u; v). This result is proved in Example 11.2.
Notes x12.4: Con dence Intervals When the Variance Is Unknown Note 2. We are appealing to the fact that if the cdf of Yn, say Fn, converges
to a continuous cdf F, and if Un converges in probability to 1, then }(Yn yUn ) ! F(y):
This result, which is proved in Example 11.8, is a version of Slutsky's Theorem. Note 3. The hypergeometric random variable arises in the following situation. We have a collection of N items, d of which are defective. Rather than test all N items, we select at random a small number of items, say n < N. Let Yn denote the number of defectives out the n items tested. We show that d N ,d }(Yn = k) = k n, k ; k = 0; : : :; n: N n We denote this by Yn hypergeometric(N; d; n). Remark. In the typical case, d n and N , d n; however,,dif these conditions do not hold in the above formula, it is understood that k = 0 if , d < k n, and Nn,,kd = 0 if n , k > N , d, i.e., if 0 k < n , (N , d). For i = 1; : : :; n, draw at random an item from the collection and test it. If the ith item is defective, let Xi = 1, and put Xi = 0 otherwise. In either case, do not put the tested item back into the collection (sampling without replacement). Then the total number of defectives among the rst n items tested is n X Yn := Xi : i=1
We show that Yn hypergeometric(N; d; n). Consider the case n = 1. Then Y1 = X1 , and the chance of drawing a defective item at random is simply the ratio of the number of defectives to the total number of items in the collection; i.e., }(Y1 = 1) = }(X1 = 1) = d=N. July 18, 2002
12.6 Notes
361
Now in general, suppose the result is true for some n 1. We show it is true for n + 1. Use the law of total probability to write
}(Yn+1 = k) =
n X i=0
}(Yn+1 = kjYn = i)}(Yn = i):
(12.6)
Since Yn+1 = Yn + Xn+1 , we can use the substitution law to write }(Yn+1 = kjYn = i) = }(Yn + Xn+1 = kjYn = i) = }(i + Xn+1 = kjYn = i) = }(Xn+1 = k , ijYn = i): Since Xn+1 takes only the values zero and one, this last expression is zero unless i = k or i = k , 1. Returning to (12.6), we can write
}(Yn+1 = k) =
k X
i=k,1
}(Xn+1 = k , ijYn = i)}(Yn = i):
(12.7)
When i = k , 1, the above conditional probability is }(Xn+1 = 1jYn = k , 1) = d , (k , 1) ; N ,n since given Yn = k , 1, there are N , n items left in the collection, and of those, the number of defectives remaining is d , (k , 1). When i = k, the needed conditional probability is }(Xn+1 = 0jYn = k) = (N , d) , (n , k) ; N ,n since given Yn = k, there are N , n items left in the collection, and of those, the number of non defectives remaining is (N , d) , (n , k). If we now assume that Yn hypergeometric(N; d; n), we can expand (12.7) to get d N ,d }(Yn+1 = k) = d , (k , 1) k , 1 n , (k , 1) N N ,n n d N ,d , (n , k) k n, k : + (N , d) N ,n N n It is a simple calculation to see that the rst term on the right is equal to d N ,d 1 , n +k 1 k [nN+ 1], k ; n+1 July 18, 2002
362
Chap. 12 Parameter Estimation and Con dence Intervals
and the second term is equal to
d k k n+1
N ,d [n + 1], k : N n+1
Thus, Yn+1 hypergeometric(N; d; n + 1).
12.7. Problems
Problems x12.1: The Sample Mean 1. Show that the random variable Yn de ned in Example 12.1 is N(0; 1). 2. Show that when the Xi are uncorrelated, Yn has zero mean and unit
variance.
Problems x12.2: Con dence Intervals When the Variance Is Known 3. If 2 = 4 and n = 100, how wide is the 99% con dence interval? How
large would n have to be to have a 99% con dence interval of width less than or equal to 1=4? 4. Let W1 ; W2; : : : be i.i.d. with zero mean and variance 4. Let Xi = m + Wi , where m is an unknown constant. If M100 = 14:846, nd the 95% con dence interval. 5. Let Xi = m + Wi , where m is an unknown constant, and the Wi are i.i.d. Cauchy with parameter 1. Find > 0 such that the probability is 2=3 that the con dence interval [Mn , ; Mn + ] contains m; i.e., nd > 0 such that }(jMn , mj ) = 2=3: Hints: Since E[Wi2] = 1, the central limit theorem does not apply. However, you can solve for exactly if you can nd the cdf of Mn , m. The cdf of Wi is F(w) = 1 tan,1(w) + 1=2, and the characteristic function of Wi is E[ejWi ] = e,j j .
Problems x12.3: The Sample Variance 6. If X1 ; X2 ; : : : are uncorrelated, use the formula
S2
n
1
= n,1
n X i=1
X2 i
, nM 2
n
to show that Sn2 is an unbiased estimator of 2; i.e., show that E[Sn2 ] = 2. July 18, 2002
12.7 Problems
363
7. Let X1 ; X2 ; : : : be i.i.d. with nite fourth moment. Let m2 := E[Xi2 ] be
the second moment. Show that
n 1X 2 n i=1 Xi
is a consistent estimator of m2 . Hint: Let X~ i := Xi2 .
Problems x12.4: Con dence Intervals When the Variance is Unknown 8. Let X1 ; X2; : : : be i.i.d. random variables with unknown, nite mean m 9. 10.
11.
12. 13.
14.
and variance 2 . If M100 = 10:083 and S100 = 0:568, nd the 95% con dence interval for the population mean. Suppose that 100 engineering freshmen are selected at random and X1 ; : : :; X100 are their times (in years) to graduation. If M100 = 4:422 and S100 = 0:957, nd the 93% con dence interval for their expected time to graduate. From a batch of N = 10; 000 computers, n = 100 are sampled, and 10 are found defective. Estimate the number of defective computers in the total batch of 10; 000, and give the margin of error for 90% probability if S100 = 0:302. You conduct a presidential preference poll by surveying 3; 000 voters. You nd that 1; 559 (more than half) say they plan to vote for candidate A, and the others say they plan to vote for candidate B. If S3000 = 0:500, are you 90% sure that candidate A will win the election? Are you 99% sure? From a batch of 100; 000 airbags, 500 are sampled, and 48 are found defective. Estimate the number of defective airbags in the total batch of 100; 000, and give the margin of error for 94% probability if S100 = 0:295. A new vaccine has just been developed at your company. You need to be 97% sure that side eects do not occur more than 10% of the time. (a) In order to estimate the probability p of side eects, the vaccine is tested on 100 volunteers. Side eects are experienced by 6 of the volunteers. Using the value S100 = 0:239, nd the 97% con dence interval for p if S100 = 0:239. Are you 97% sure that p 0:1? (b) Another study is performed, this time with 1000 volunteers. Side eects occur in 71 volunteers. Find the 97% con dence interval for the probability p of side eects if S1000 = 0:257. Are you 97% sure that p 0:1? Packet transmission times on a certain Internet link are independent and identically distributed. Assume that the times have an exponential density with mean . July 18, 2002
364
Chap. 12 Parameter Estimation and Con dence Intervals
(a) Find the probability that in transmitting n packets, at least one of them takes more than t seconds to transmit. (b) Let T denote the total time to transmit n packets. Find a closedform expression for the density of T . (c) Your answers to parts (a) and (b) depend on , which in practice is unknown and must be estimated. To estimate the expected transmission time, n = 100 packets are sent, and the transmission times T1 ; : : :; Tn recorded. It is found that the sample mean M100 = 1:994, and sample standard deviation S100 = 1:798, where n n X X Mn := n1 Ti and Sn2 := n ,1 1 (Ti , Mn )2 : i=1
i=1
Find the 95% con dence interval for the expected transmission time. 15. Let Nt be a Poisson process with unknown intensity . For i = 1; : : :; 100, put Xi = (Ni , Ni,1). Then the Xi are i.i.d. Poisson(), and E[Xi] = . If M100 = 5:170, and S100 = 2:132, nd the 95% con dence interval for .
Problems x12.5: Con dence Intervals for Normal Data 16. Let W1; W2 ; : : : be i.i.d. N(0; 2) with 2 unknown. Let Xi = m + Wi , 17. 18.
19.
20.
where m is an unknown constant. Suppose M10 = 14:832 and S10 = 1:904. Find the 95% con dence interval for m. Let X1 ; X2; : : : be i.i.d. N(0; 2) with 2 unknown. Find the 95% con 2 = 4:413. dence interval for 2 if V100 Let Wt be a zero-mean, wide-sense stationary white noise process with power spectral density SW (f) = N0 =2. Suppose that Wt is applied to the ideal lowpass lter of bandwidth B = 1 MHz and power gain 120 dB; i.e., H(f) = GI[,B;B] (f), where G = 106. Denote the lter output by Yt , and for i = 1; : : :; 100, put Xi := Yit , where t = (2B),1 . Show that the Xi are zero mean, uncorrelated, with variance 2 = G2B N0 . Assuming 2 = 4:659 mW, nd the 95% con dence the Xi are normal, and that V100 interval for N0 ; your answer should have units of Watts/Hz. Let X1 ; X2; : : : be i.i.d. with known, nite mean m, unknown, nite variance 2 , and nite, fourth central moment 4 = E[(Xi , m)4 ]. Show that Vn2 is a consistent estimator of 2 . Hint: Apply Problem 7 with Xi replaced by Xi , m. Let W1; W2 ; : : : be i.i.d. N(0; 2) with 2 unknown. Let Xi = m + Wi , where m is an unknown constant. Find the 95% con dence interval for 2 = 4:736. 2 if S100 July 18, 2002
CHAPTER 13
Advanced Topics Self similarity has been noted in the trac of local-area networks [26], widearea networks [32], and in World Wide Web trac [12]. The purpose of this chapter is to introduce the notion of self similarity and related concepts so that students can be conversant with the kinds of stochastic processes being used to model network trac. Section 13.1 introduces the Hurst parameter and the notion of distributional self similarity for continuous-time processes. The concept of stationary increments is also presented. As an example of such processes, fractional Brownian motion is developed using the Wiener integral. In Section 13.2, we show that if one samples the increments of a continuous-time self-similar process with stationary increments, then the samples have a covariance function with a very speci c formula. It is shown that this formula is equivalent to specifying the variance of the sample mean for all values of n. Also, the power spectral density is found up to a multiplicative constant. Section 13.3 introduces the concept of asymptotic second-order self similarity and shows that it is equivalent to specifying the limiting form of the variance of the sample mean. The main result here is a sucient condition on the power spectral density that guarantees asymptotic second-order self similarity. Section 13.4 de nes long-range dependence. It is shown that every long-range-dependent process is asymptotically second-order self similar. Section 13.5 introduces ARMA processes, and Section 13.6 extends this to fractional ARIMA processes. Fractional ARIMA process provide a large class of models that are asymptotically second-order self similar.
13.1. Self Similarity in Continuous Time
Loosely speaking, a continuous-time random process Wt is said to be self similar with Hurst parameter H > 0 if the process Wt \looks like" the
process H Wt. If > 1, then time is speeded up for Wt compared to the original process Wt . If < 1, then time is slowed down. The factor H in H Wt either increases or decreases the magnitude (but not the time scale) compared with the original process. Thus, for a self-similar process, when time is speeded up, the apparent eect is the same as changing the magnitude of the original process, rather than its time scale. The precise de nition of \looks like" will be in the sense of nite-dimensional distributions. That is, for Wt to be self similar, we require that for every > 0, for every nite collection of times t1; : : :; tn, all joint probabilities involving Wt1 ; : : :; Wtn are the same as those involving H Wt1 ; : : :; H Wtn . The best example of a self-similar process is the Wiener process. This is easy to verify by comparing the joint characteristic functions of Wt1 ; : : :; Wtn and 365
366
Chap. 13 Advanced Topics
H Wt1 ; : : :; H Wtn for the correct value of H.
Implications of Self Similarity
Let us focus rst on a single time point t. For a self-similar process, we must have }(Wt x) = }(H Wt x): Taking t = 1, results in }(W x) = }(H W1 x): Since > 0 is a dummy variable, we can call it t instead. Thus, }(Wt x) = }(tH W1 x); t > 0: Now rewrite this as }(Wt x) = }(W1 t,H x); t > 0; or, in terms of cumulative distribution functions, FWt (x) = FW1 (t,H x); t > 0: It can now be shown (Problem 1) that Wt converges in distribution to the zero random variable as t ! 0. Similarly, as t ! 1, Wt converges in distribution to a discrete random variable taking the values 0 and 1. We next look at expectations of self-similar processes. We can write E[Wt] = E[H Wt ] = H E[Wt]: Setting t = 1 and replacing by t results in E[Wt] = tH E[W1]; t > 0: (13.1) Hence, for a self-similar process, its mean function has the form of a constant times tH for t > 0. As another example, consider E[Wt2 ] = E (H Wt)2 = 2H E[Wt2]: (13.2) Arguing as above, we nd that E[Wt2] = t2H E[W12]; t > 0: (13.3) We can also take t = 0 in (13.2) to get E[W02] = 2H E[W02]: Since the left-hand side does not depend on , E[W02] = 0, which implies W0 = 0 a.s. Hence, (13.1) and (13.3) both continue to hold even when t = 0. Using the formula ab = eb ln a, 0H = eH ln 0 = e,1 , since H > 0. Thus, 0H = 0.
July 18, 2002
13.1 Self Similarity in Continuous Time
367
Example 13.1. Assuming that the Wiener process is self similar, show that the Hurst parameter must be H = 1=2. Solution. Recall that for the Wiener process, we have E[Wt2] = 2t. Thus, (13.2) implies that 2 (t) = 2H 2t: Hence, H = 1=2. Stationary Increments
A process Wt is said to have stationary increments if for every increment > 0, the increment process Zt := Wt , Wt, is a stationary process in t. If Wt is self similar with Hurst parameter H, and has stationary increments, we say that Wt is H -sssi. If Zt is H-sssi, then E[Zt] cannot depend on t; but by (13.1), E[Zt] = E[Wt , Wt, ] = tH , (t , )H E[W1]: If H 6= 1, then we must have E[W1] = 0, which by (13.1), implies E[Wt] = 0. As we will see later, the case H = 1 is not of interest if Wt has nite second moments, and so we will always take E[Wt] = 0. If Zt is H-sssi, then the stationarity of the increments and (13.3) imply E[Zt2 ] = E[Z2] = E[(W , W0 )2 ] = E[W2] = 2H E[W12]: Similarly, for t > s, E[(Wt , Ws )2 ] = E[(Wt,s , W0 )2 ] = E[Wt2,s] = (t , s)2H E[W12]: For t < s, E[(Wt , Ws )2 ] = E[(Ws , Wt )2 ] = (s , t)2H E[W12]: Thus, for arbitrary t and s, E[(Wt , Ws )2 ] = jt , sj2H E[W12]: Note in particular that E[Wt2] = jtj2H E[W12]: Now, we also have E[(Wt , Ws )2 ] = E[Wt2] , 2E[WtWs ] + E[Ws2]; and it follows that 2 E[Wt Ws ] = E[W1 ] [jtj2H , jt , sj2H + jsj2H ]: (13.4) 2 July 18, 2002
368
Chap. 13 Advanced Topics
Fractional Brownian Motion
Let Wt denote the standard Wiener process on ,1 < t < 1 as de ned in Problem 23 in Chapter 8. The standard fractional Brownian motion is the process BH (t) de ned by the Wiener integral Z
BH (t) :=
1
,1
gH;t () dW ;
where gH;t is de ned below. Then E[BH (t)] = 0, and E[BH
(t)2 ]
=
Z
1
,1
gH;t ()2 d:
To evaluate this expression as well as the correlation E[BH (t)BH (s)], we must now de ne gH;t (). To this end, let H ,1=2 > 0; qH () := 0; ; 0; and put gH;t() := C1 qH (t , ) , qH (,) ; H where Z 1 1: CH2 := (1 + )H ,1=2 , H ,1=2 2 d + 2H 0
First note that since gH;0 () = 0, BH (0) = 0. Next, BH (t) , BH (s) =
Z
1
[gH;t() , gH;s ()] dW
,1 Z 1 = CH,1 [qH (t , ) , qH (s , )] dW ; ,1
and so E[jBH (t) , BH
(s)j2 ]
= CH,2
Z
1
,1
jqH (t , ) , qH (s , )j2 d:
If we now assume s < t, then this integral is equal to the sum of Z
and
s
,1
[(t , )H ,1=2 , (s , )H ,1=2 ]2 d
Z t,s 2H (t , )2H ,1 d = 2H ,1 d = (t ,2Hs) : s 0 To evaluate the integral from ,1 to s, let = s , to get Z
t
1
Z
0
[(t , s + )H ,1=2 , H ,1=2 ]2 d; July 18, 2002
13.2 Self Similarity in Discrete Time
which is equal to (t , s)2H ,1
369
1 ,
Z
0
, 1 + =(t , s) H ,1=2 , =(t , s) H ,1=2 2 d:
Making the change of variable = =(t , s) yields (t , s)2H It is now clear that
1
Z
0
(1 + )H ,1=2 , H ,1=2 2 d:
E[jBH (t) , BH (s)j2 ] = (t , s)2H ; t > s:
Since interchanging the positions of t and s on the left-hand side has no eect, we can write for arbitrary t and s, E[jBH (t) , BH (s)j2 ] = jt , sj2H : Taking s = 0 yields E[BH (t)2 ] = jtj2H : Furthermore, expanding E[jBH (t) , BH (s)j2], we nd that jt , sj2H = jtj2H , 2E[BH (t)BH (s)] + jsj2H ; or, 2H 2H 2H (13.5) E[BH (t)BH (s)] = jtj , jt , sj + jsj : 2 Observe that BH (t) is a Gaussian process in the sense that if we select any sampling times, t1 < < tn , then the random vector [BH (t1 ); : : :; BH (tn )]0 is Gaussian; this is a consequence of the fact that BH (t) is de ned as a Wiener integral (Problem 14 in Chapter 11). Furthermore, the covariance matrix of the random vector is completely determined by (13.5). On the other hand, by (13.4), we see that any H-sssi process has the same covariance function (up to a scale factor). If that H-sssi process is Gaussian, then as far as the joint probabilities involving any nite number of sampling times, we may as well assume that the H-sssi process is fractional Brownian motion. In this sense, there is only one H-sssi process with nite second moments that is Gaussian : fractional Brownian motion.
13.2. Self Similarity in Discrete Time
Let Wt be an H-sssi process. By choosing an appropriate time scale for Wt, we can focus on the unit increment = 1. Furthermore, the advent of digital signal processing suggests that we sample the increment process. This leads us to consider the discrete-time increment process Xn := Wn , Wn,1: July 18, 2002
370
Chap. 13 Advanced Topics
Since Wt is assumed to have zero mean, the covariance of Xn is easily found using (13.4). For n > m, 2 1 ] [(n , m + 1)2H , 2(n , m)2H + (n , m , 1)2H ]: E[Xn Xm ] = E[W 2 Since this depends only on the time dierence, the covariance function of Xn is 2 (13.6) C(n) = 2 [jn + 1j2H , 2jnj2H + jn , 1j2H ]; where 2 := E[W12]. The foregoing analysis assumed that Xn was obtained by sampling the increments of an H-sssi process. More generally, a discrete-time, wide-sense stationary (WSS) process is said to be second-order self similar if its covariance function has the form in (13.6). In this context it is not assumed that Xn is obtained from an underlying continuous-time process or that Xn has zero mean. A second-order self-similar process that is Gaussian is called fractional Gaussian noise, since one way of obtaining it is by sampling the increments of fractional Brownian motion.
Convergence Rates for the Mean-Square Law of Large Numbers
Suppose that Xn is a discrete-time, WSS process with mean := E[Xn]. It is shown later in Section 13.4 that if Xn is second-order self similar; i.e., if (13.6) holds, then C(n) ! 0. On account of Example 10.4, the sample mean n 1X n i=1 Xi converges in mean square to . But how fast does the sample mean converge? We show that (13.6) holds if and only if 2 n 2 1 X 2 n2H ,2 = n2H ,1 : E X , = (13.7) i n i=1 n In other words, Xn is second-order self similar if and only if (13.7) holds. To put (13.7) into perspective, rst consider the case H = 1=2. Then (13.6) reduces to zero for n 6= 0. In other words, the Xn are uncorrelated. Also, when H = 1=2, the factor n2H ,1 in (13.7) is not present. Thus, (13.7) reduces to the mean-square law of large numbers for uncorrelated random variables derived in Example 10.3. If 2H , 1 < 0, or equivalently H < 1=2, then the convergence is faster than in the uncorrelated case. If 1=2 < H < 1, then the convergence is slower than in the uncorrelated case. This has important consequences for determining con dence intervals, as shown in Problem 6. To show the equivalence of (13.6) and (13.7), the rst step is to de ne Yn :=
n X i=1
(Xi , );
July 18, 2002
13.2 Self Similarity in Discrete Time
371
and observe that (13.7) is equivalent to E[Yn2] = 2 n2H . The second step is to express E[Yn2 ] in terms of C(n). Write E[Y 2] n
X n
X n
i=1 n X n X
k=1
= E
(Xi , )
=
E[(Xi , )(Xk , )]
=
i=1 k=1 n X n X i=1 k=1
(Xk , )
C(i , k):
(13.8)
The above sum amounts to summing all the entries of the n n matrix with i k entry C(i , k). This matrix is symmetric and is constant along each diagonal. Thus, nX ,1 2 E[Yn ] = nC(0) + 2 C()(n , ): (13.9) =1 2 Now that we have a formula for E[Y ] in terms of C(n), we follow Likhanov n
[28, p. 195] and write
E[Yn2+1] , E[Yn2] = C(0) + 2
and then
,
,
n X =1
C();
E[Yn2+1] , E[Yn2 ] , E[Yn2] , E[Yn2,1] = 2C(n): Applying the formula E[Yn2 ] = 2 n2H shows that for n 1,
(13.10)
2 C(n) = 2 [(n + 1)2H , 2n2H + (n , 1)2H ]: Finally, it is a simple exercise (Problem 7) using induction on n to show that (13.6) implies E[Yn2 ] = 2 n2H for n 1.
Aggregation
Consider the partitioning of the sequence Xn into blocks of size m: X : :; Xm} X : : :; X2m} X(n,1)m+1 ; : : :; Xnm : 1 ; : {z m+1 ; {z | | 1st block
2nd block
|
P
{z
nth block
}
The average of the rst block is m1 mk=1 Xk . The average of the second block P 2 m is m1 k=m+1 Xk . The average of the nth block is nm X Xn(m) := m1 Xk : k=(n,1)m+1
July 18, 2002
(13.11)
372
Chap. 13 Advanced Topics
The superscript (m) indicates the block size, which is the number of terms used to compute the average. The subscript n indicates the block number. We call fXn(m) g1n=,1 the aggregated process. We now show that if Xn is secondorder self similar, then the covariance function of Xn(m) , denoted by C (m) (n), satis es C (m) (n) = m2H ,2 C(n): (13.12) In other words, if the original sequence is replaced by the sequence of averages of blocks of size m, then the new sequence has a covariance function that is the same as the original one except that the magnitude is scaled by m2H ,2 . The derivation of (13.12) is very similar to the derivation of (13.7). Put Xe(m) :=
m X
(Xk , ):
(13.13)
k=( ,1)m+1
Since Xk is WSS, so is Xe(m) . Let its covariance function be denoted by Ce (m) (). Next de ne n X Yen := Xe(m) : =1
Just as in (13.10), ,
,
2Ce (m) (n) = E[Yen2+1] , E[Yen2] , E[Yen2 ] , E[Yen2,1] : Now observe that Yen = = =
n X
=1 n X
Xe(m) m X
(Xk , )
=1 k=( ,1)m+1 nm X (X , ) =1 Ynm ;
= where this Y is the same as the one de ned in the preceding subsection. Hence, , 2 ] , ,E[Y 2 ] , E[Y 2 2Ce (m) (n) = E[Y(2n+1)m] , E[Ynm (13.14) nm (n,1)m] : Now we use the fact that since Xn is second-order self similar, E[Yn2 ] = 2 n2H . Thus, 2 , , Ce (m) (n) = 2 (n + 1)m 2H , 2(nm)2H + (n , 1)m 2H : Since C (m) (n) = Ce (m) (n)=m2 , (13.12) follows. July 18, 2002
13.2 Self Similarity in Discrete Time
373
The Power Spectral Density
We show that the power spectral density of a second-order self-similar process is proportional toy 1 X 1 2 (13.15) sin (f) j i + f j2H +1 : i=,1 The proof rests on the fact (derived below) that for any wide-sense stationary process, ,1 2 C (m) (n) = Z 1 ej 2fn mX S([f + k]=m) sin(f) df m2H ,2 m2H +1 sin([f + k]=m) 0 k=0 for m = 1; 2; : : :: Now we also have C(n) =
Z 1=2
,1=2
S(f)ej 2fn df
=
Z 1
0
S(f)ej 2fn df:
Hence, if S(f) satis es mX ,1 S([f + k]=m) sin(f) 2 = S(f); m = 1; 2; : : :; (13.16) m2H +1 sin([f + k]=m) k=0 then (13.12) holds. In particular it holds for n = 0. Thus, E[Ym2 ] = C (m) (0) = C(0) = 2 ; m = 1; 2; : : :: m2H m2H ,2 2 As noted earlier, E[Yn ] = 2 n2H is just (13.7), which is equivalent to (13.6). Now observe that if S(f) is proportional to (13.15), then (13.16) holds. The integral formula for C (m) (n)=m2H ,2 is derived following Sinai [44, p. 66]. Write 1 C (m) (n) = (m) ( m) m2H ,2 m2H ,2 E[(Xn+1 , )(X1 , )] (nX +1)m X m E[(Xi , )(Xk , )] = m12H i=nm+1 k=1 (nX +1)m X m = m12H C(i , k) i=nm+1 k=1 (nX +1)m X m Z 1=2 1 = m2H S(f)ej 2f (i,k) df , 1 = 2 i=nm+1 k=1
Z 1=2 (nX +1)m X m 1 = m2H S(f) ej 2f (i,k) df: ,1=2 i=nm+1 k=1
y The constant of proportionality can be found using results from Section 13.3; see Problem 15.
July 18, 2002
374
Chap. 13 Advanced Topics
Now write (nX +1)m X m
i=nm+1 k=1
ej 2f (i,k) =
m X m X =1 k=1
= ej 2nm =
ej 2f (nm+ ,k) m X m X =1 k=1
ej 2f ( ,k)
m 2 X ej 2nm e,j 2fk :
k=1
Using the nite geometric series, m X k=1
,j 2fm e,j 2fk = e,j 2f 11,,ee,j 2f = e,jf (m+1) sin(mf) sin(f) :
Thus,
C (m) (n) = 1 Z 1=2 S(f)ej 2fnm sin(mf) 2df: m2H ,2 m2H ,1=2 sin(f) Since the integrand has period one, we can shift the range of integration to [0; 1] and then make the change-of-variable = mf. Thus, C (m) (n) = 1 Z 1 S(f)ej 2fnm sin(mf) 2 df m2H ,2 m2H 0 sin(f) Z m sin() 2d S(=m)ej 2n sin(=m) = m21H +1 0 mX ,1 Z k+1 1 sin() 2d: = m2H +1 S(=m)ej 2n sin(=m) k=0 k
Now make the change-of-variable f = , k to get
2 ,1 Z 1 C (m) (n) = 1 mX sin([f + k]) j 2 [ f + k ] n m2H ,2 m2H +1 k=0 0 S([f + k]=m)e sin([f + k]=m) df 2 mX ,1 Z 1 sin(f) = m21H +1 S([f + k]=m)ej 2fn sin([f + k]=m) df k=0 0 m,1 Z 1 X S([f + k]=m) sin(f) 2 df: = ej 2fn m2H +1 sin([f + k]=m) 0 k=0
Notation
We have been using the term correlation function to refer to the quantity E[Xn Xm ]. This is the usual practice in engineering. However, engineers July 18, 2002
13.3 Asymptotic Second-Order Self Similarity
375
studying network trac follow the practice of statisticians and use the term correlation function to refer to cov(Xn ; Xm ) : p var(Xn ) var(Xm ) In other words, in networking, the term correlation function refers to our covariance function C(n) divided by C(0). We use the notation (n) := C(n) C(0) : Now assume that Xn is second-order self similar. We have by (13.6) that C(0) = 2 , and so (n) = 12 [jn + 1j2H , 2jnj2H + jn , 1j2H ]: Let (m) denote the correlation function of Xn(m) . Then (13.12) tells us that (m) m2H ,2 C(n) = (n): = (m) (n) := CC (m)(n) m2H ,2 C(0) (0)
13.3. Asymptotic Second-Order Self Similarity
We showed in the previous section that second-order self similarity(eq. (13.6)) is equivalent to (13.7), which speci es 2 X 1 n (13.17) E n Xi , i=1
exactly for all n. While this is a very nice result, it applies only when the covariance function has exactly the form in (13.6). However, if we only need to know the behavior of (13.17) for large n, say 2 n 1 X E n Xi , i=1 2 lim = 1 (13.18) n!1 n2H ,2 2 , then we can allow more freedom in the behavior for some nite, positive 1 of the covariance function. The key to obtaining such a result is suggested by (13.12), which says that for a second-order self similar process, C (m) (n) = 2 [jn + 1j2H , 2jnj2H + jn , 1j2H ]: m2H ,2 2 You are asked to show in Problems 9 and 10 that (13.18) holds if and only if 2 C (m) (n) = 1 2H 2H 2H (13.19) lim 2 H , 2 m!1 m 2 [jn + 1j , 2jnj + jn , 1j ]: July 18, 2002
376
Chap. 13 Advanced Topics
A wide-sense stationary process that satis es (13.19) is said to be asymptotically second-order self similar. In the literature, (13.19) is usually written in terms of the correlation function (m) (n); see Problem 11. If a wide-sense stationary process has a covariance function C(n), how can we check if (13.18) or (13.19) holds? Below we answer this question in the frequency domain with a sucient condition on the power spectral density. In Section 13.4, we answer this question in the time domain with a sucient condition on C(n) known as long-range dependence. Let us look into the frequency domain. Suppose that C(n) has a power spectral density S(f) so thatz C(n) =
Z 1=2
,1=2
S(f)ej 2fn df:
The following result is proved later in this section. Theorem. If lim S(f) = s f !0 jf j,1
(13.20)
for some nite, positive s, and if for every 0 < < 1=2, S(f) is bounded on [; 1=2], then the process is asymptotically second-order self similar with H = 1 , =2 and 2 = s 4 cos(=2),() : (13.21) 1 (2) (1 , )(2 , )
Notice that 0 < < 1 implies H = 1 , =2 2 (1=2; 1). Below we give a speci c power spectral density that satis es the above conditions. Example 13.2 (Hosking [23, Theorem 1(c)]). Fix 0 < d < 1=2, and letx S(f) = j1 , e,j 2f j,2d : Since jf ,jf 1 , e,j 2f = 2je,jf e ,2je ; we can write
S(f) = [4 sin2 (f)],d :
z The power spectral density of a discrete-time process is periodic with period 1, and is real, even, and nonnegative. It is integrable since Z 1=2
,1=2
S (f ) df = C (0) < 1:
x As shown in Section 13.6, a process with this power spectral density is an ARIMA(0; d; 0) process. The covariance function that corresponds to S (f ) is derived in Problem 12.
July 18, 2002
13.3 Asymptotic Second-Order Self Similarity
Since if we put = 1 , 2d, then
377
[4 sin2 (f)],d = 1; lim f !0 [4(f)2 ],d S(f) = (2),2d : lim f !0 jf j,1
Notice that to keep 0 < < 1, we needed 0 < d < 1=2. The power spectral density S(f) in the above example factors into S(f) = [1 , e,j 2f ],d [1 , ej 2f ],d : More generally, let S(f) be any power spectral density satisfying (13.20), boundedness away from the origin, and having a factorization of the form S(f) = G(f)G(f) . Then pass any wide-sense stationary, uncorrelated sequence though the discrete-time lter G(f). By the discrete-time analog of (6.6), the output power spectral density is proportional to S(f), and therefore asymptotically second-order self similar. Example 13.3 (Hosking [23, Theorem 1(a)]). Find the impulse response of the lter G(f) = [1 , e,j 2f ],d . Solution. Observe that G(f) can be obtained by evaluating the z transform (1 , z ,1 ),d on the unit circle, z = ej 2f . Hence, the desired impulse response can be found by inspection of the series for (1 , z ,1 ),d . To this end, it is easy to show that the Taylor series for (1 + z)d is{ (1 + z)d = 1 + Hence,
1 X
d(d , 1) (d , [n , 1]) z n: n! n=1
1 X
(,d)(,d , 1) (,d , [n , 1]) (,z ,1 )n n! n=1 1 X = 1 + d(d + 1) n!(d + [n , 1]) z ,n: n=1
(1 , z ,1 ),d = 1 +
Note that the impulse response is causal. { Notice that if d 0 is an integer, the product
d(d , 1) (d , [n , 1]) contains zero as a factor for n d + 1; in this case, the sum contains only d + 1 terms and converges for all complex z. In fact, the formula reduces to the binomial theorem. July 18, 2002
378
Chap. 13 Advanced Topics
Once we have a process whose power spectral density satis es (13.20) and boundedness away from the origin, itP remains so after further ltering by stable linear time-invariant systems. For if n jhnj < 1, then H(f) =
1 X n=,1
hn e,j 2fn
is an absolutely convergent series and therefore continuous. If S(f) satis es (13.20), then jH(f)j2 S(f) = jH(0)j2s: lim f !0 jf j,1
A wide class of stable lters is provided by autoregressive moving average (ARMA) systems discussed in Section 13.5. Proof of Theorem. We now establish that (13.20) and boundedness of S(f) away from the origin imply asymptotic second-order self similarity. Since (13.19) and (13.18) are equivalent, it suces to establish (13.18). As noted in 2 . From (13.8), the Hint in Problem 10, (13.18) is equivalent to E[Yn2]=n2H ! 1 E[Y 2 ] n
= = =
n X n X
C(i , k)
i=1 k=1 n X n Z 1=2 X
i=1 k=1 ,1=2
Z 1=2
,1=2
S(f)ej 2f (i,k) df
n X
S(f)
k=1
2
e,j 2fk df:
Using the nite geometric series, n X k=1
,j 2fn e,j 2fk = e,j 2f 11,,ee,j 2f = e,jf (n+1) sin(nf) sin(f) :
Thus, 2 sin(nf) S(f) sin(f) df n = ,1=2 2 2 Z 1=2 sin(nf) f 2 = n S(f) nf sin(f) df: ,1=2
E[Y 2 ]
Z 1=2
We now show that 4 cos(=2),() E[Yn2] lim n!1 n2, = s (2) (1 , )(2 , ) : July 18, 2002
13.3 Asymptotic Second-Order Self Similarity
The rst step is to put Kn() := and show that
n
Z 1=2
sin(nf) 1 , j f j nf ,1=2 1
379 2
E[Yn2 ] , s K () ! 0: n 2,
f 2 df; sin(f)
(13.22) n The second step, which is left to Problem 13, is to show that Kn () ! K(), where Z 1 1 sin 2d: K() := , ,1 jj1, The third step, which is left to Problem 14, is to show that 4 cos(=2),() : K() = (2) (1 , )(2 , ) To begin, observe that E[Yn2] , s K () n n2, is equal to 2 2 Z 1=2 s sin(nf) f n S(f) , jf j1, (13.23) nf sin(f) df: ,1=2 Let " > 0 be given, and let 0 < < 1=2 be so small that S(f) ,1 , s < "; jf j or s < " : S(f) , jf j1, jf j1, In analyzing (13.23), it is rst convenient to focus on the range of integration [0; ]. Then the absolute value of 2 2 Z s sin(nf) f n S(f) , jf j1, nf sin(f) df 0 is upper bounded by 2 Z 1 sin(nf) 2 df; "n 2 nf 0 f 1, where we have used the fact that for jf j 1=2, 2: 1 sin(f) f July 18, 2002
380
Chap. 13 Advanced Topics
Now make the change-of-variable = nf in the above integral to get the expression 2 , Z n 1 sin 2 "n 2 n 1, d; 0 or 2 Z n 2 2 K() 1 sin , " 2 d " 1, 2 2 : 0
We return to (13.23) and focus now on the range of integration [; 1=2]. The absolute value of 2 Z 1=2 f 2df n S(f) , jf js1, sin(nf) nf sin(f) is upper bounded by n where
2
Z 1=2 sin(nf) 2
2 B
B := sup
nf
jf j2[;1=2]
S(f)
df;
(13.24)
, jf js1, :
In (13.24) make the change-of-variable = nf to get Z 1=2 sin(nf) 2
nf
df =
n=2 sin 2 d
Z Z
n
n
1 sin 2 d
0
n :
Thus, (13.24) is upper bounded by n,1
Z
1 sin 2
4 B 0 d: Since this integral is nite, and since < 1, this bound goes to zero as n ! 1. Thus, (13.22) is established.
13.4. Long-Range Dependence
Loosely speaking, a wide-sense stationary process is said to be long-range dependent (LRD) if its covariance function C(n) decays slowly as n ! 1. The precise de nition of slow decay is the requirement that for some 0 < < 1, (13.25) lim C(n) = c; n!1 n, for some nite, positive constant c. In other words, for large n, C(n) looks like c=n. July 18, 2002
13.4 Long-Range Dependence
381
In this section, we prove two important results. The rst result is that a second-order self-similar process is long-range dependent. The second result is that long-range dependence implies asymptotic second-order self similarity. To prove that second-order self similarity implies long-range dependence, we proceed as follows. Write (13.6) for n 1 as 2 2 C(n) = 2 n2H [(1 + 1=n)2H , 2 + (1 , 1=n)2H ] = 2 n2H q(1=n); where q(t) := (1 + t)2H , 2 + (1 , t)2H : For n large, 1=n is small. This suggests that we examine the Taylor expansion of q(t) for t near zero. Since q(0) = q0 (0) = 0, we expand to second order to get 00 q(t) q 2(0) t2 = 2H(2H , 1)t2 : So, for large n, 2 C(n) = 2 n2H q(1=n) 2H(2H , 1)n2H ,2: (13.26) It appears that = 2 , 2H and c = 2 H(2H , 1). Note that 0 < < 1 corresponds to 1=2 < H < 1. Also, H > 1=2 corresponds to 2 H(2H , 1) > 0. To prove that these values of and c work, write C(n) = 2 lim q(1=n) = 2 lim q(t) ; lim n!1 n2H ,2 2 n!1 n,2 2 t#0 t2 and apply l'H^opital's rule twice to obtain (13.27) lim C(n) = 2 H(2H , 1): n!1 n2H ,2 This formula implies the following two facts. First, if H > 1, then C(n) ! 1 as n ! 1. This contradicts the fact that covariance functions are bounded (recall that jC(n)j C(0) by the Cauchy{Schwarz inequality; cf. Section 6.2). Thus, a second-order self-similar process cannot have H > 1. Second, if H = 1, then C(n) ! 2 . In other words, the covariance does not decay to zero as n increases. Since this situation does not arise in applications, we do not consider the case H = 1. We now show that long-range dependence (13.25) impliesk asymptotic second2 = 2c=[(1 , )(2 , )]. From order self similarity with H = 1 , =2 and 1 k Actually, the weaker condition,
lim `(Cn)(nn,) = c; where ` is a slowly varying function, is enough to imply asymptotic second-order self similarity [48]. The derivation we present results from taking the proof in [48, Appendix A] and setting `(n) 1 so that no theory of slowly varying functions is required. n!1
July 18, 2002
382
Chap. 13 Advanced Topics
(13.9),
nX ,1
C()
nX ,1
C()
E[Yn2 ] = C(0) + 2 =1 =1 2, 1, 1, , 2 2, :
n n n We claim that if (13.25) holds, then nX ,1 =1
n
C()
= 1 ,c ;
lim n1,
n!1
and
nX ,1 nlim !1
=1
C()
n2,
(13.28)
= 2 ,c :
(13.29)
Since n1, ! 1, it follows that 2c E[Yn2 ] = 2c , 2c = lim 2 , n!1 n 1, 2, (1 , )(2 , ) : Since n1, ! 1, to prove (13.28), it is enough to show that for some k, nX ,1
C()
lim n1, = 1 ,c : Fix any 0 < " < c. By (13.25), there is a k such that for all k, C() , , c < ": Then nX ,1 nX ,1 nX ,1 (c , ") , C() (c + ") ,: n!1
=k
=k
Hence, we only need to prove that
=k
nX ,1
=k
,
1 : =k lim = 1 , n!1 n 1, This is done in Problem 17 by exploiting the inequality nX ,1 =k
Note that
Z
( + 1),
Z
k
n
t, dt
n
nX ,1 =k
,:
1, 1, ! 1 as n ! 1: t, dt = n 1 ,, k k A similar approach is used in Problem 18 to derive (13.29).
In :=
July 18, 2002
(13.30)
13.5 ARMA Processes
383
13.5. ARMA Processes
We say that Xn is an autoregressive moving average (ARMA) process if Xn satis es the equation Xn + a1 Xn,1 + + ap Xn,p = Zn + b1 Zn,1 + + bq Zn,q ; (13.31) where Zn is an uncorrelated sequence of zero-mean random variables with common variance 2 = E[Zn]. In this case, we say that Xn is ARMA(p; q). If a1 = = ap = 0, then Xn = Zn + b1Zn,1 + + bq Zn,q ; and we say that Xn is a moving average process, denoted by MA(q). If instead b1 = = bq = 0, then Xn = ,(a1 Xn,1 + + ap Xn,p ); and we say that Xn is an autoregressive process, denoted by AR(p). To gain some insight into (13.31), rewrite it using convolution sums as 1 X
where and
k=,1
ak Xn,k =
1 X
k=,1
bk Zn,k ;
(13.32)
a0 := 1; ak := 0; k < 0 and k > p;
b0 := 1; bk := 0; k < 0 and k > q: Taking z transforms of (13.32) yields A(z)X(z) = B(z)Z(z); or X(z) = B(z) A(z) Z(z); where A(z) := 1 + a1 z ,1 + + ap z ,p and B(z) := 1 + b1 z ,1 + + bq z ,q : This suggests that if hn has z transform H(z) := B(z)=A(z), and if X X Xn := hk Zn,k = hn,k Zk ; (13.33) k
k
then (13.32) holds. This is indeed the case, as can be seen by writing X X X aiXn,i = ai hn,i,kZk i
i
= =
k
XX
k
X
k
i
bn,k Zk ;
July 18, 2002
aihn,k,i Zk
384
Chap. 13 Advanced Topics
since A(z)H(z) = B(z). The \catch" in the preceding argument is to make sure that the in nite sums inP (13.33) are well de ned. If hn is causal (hn = 0 for n < 0), and if hn is stable ( n jhnj < 1), then (13.33) holds in L2 , L1 , and almost surely (recall Example 10.7 and Problems 17 and 32 in Chapter 11). Hence, it remains to prove the key result of this section, that if A(z) has all roots strictly inside the unit circle, then hn is causal and stable. To begin the proof, observe that since A(z) has all its roots inside the unit circle, the polynomial (z) := A(1=z) has all its roots strictly outside the unit circle. Hence, for small enough > 0, 1=(z) has the power series expansion 1 1 = X n (z) n=0 nz ; jz j < 1 + ; for unique coecients n. In particular, this series converges for z = 1 + =2. Since the terms of a convergent series go to zero, we must have n(1+=2)n ! 0. Since a convergent sequence is bounded, there is some nite M for which jn(1 + =2)nj M, orP jnj M(1 + =2),n, which is summable by the geometric series. Thus, n jnj < 1. Now write
B(z) 1 H(z) = B(z) A(z) = (1=z) = B(z) (1=z) ; or H(z) = B(z)
1 X n=0
nz ,n =
where hn is given by the convolution hn =
1 X k=,1
1 X n=,1
hn z ,n ;
k bn,k:
Since n and bn are causal, so is their convolution hn . Furthermore, for n 0, hn =
n X k=max(0;n,q)
k bn,k :
(13.34)
P
In Problem 19, you are asked to show that n jhnj < 1. In Problem 20, you are asked to show that (13.33) is the unique solution of (13.32).
13.6. ARIMA Processes
Before de ning ARIMA processes, we introduce the dierencing lter, whose z transform is 1 , z ,1. If the input to this lter is Xn , then the output is Xn , Xn,1 . July 18, 2002
13.6 ARIMA Processes
385
A process Xn is said to be an autoregressive integrated moving average (ARIMA) process if instead of A(z)X(z) = B(z)Z(z), we have A(z)(1 , z ,1 )d X(z) = B(z)Z(z); (13.35) where A(z) and B(z) are de ned as in the previous section. In this case, we e say that Xn is an ARIMA(p; d; q) process. If we let A(z) = A(z)(1 , z ,1 )d , it would seem that ARIMA(p; d; q) is just a fancy name for ARMA(p + d; q). While this is true when d is a nonnegative integer, there are two problems. First, recall that the results of the previous section assume A(z) has all roots e strictly inside the unit circle, while A(z) has a root at z = 1 repeated d times. The second problem is that we will be focusing on fractional values of d, in e which case A(1=z) is no longer a polynomial, but an in nite power series in z. Let us rewrite (13.35) as X(z) = (1 , z ,1 ),d B(z) A(z) Z(z) = H(z)Gd (z)Z(z); where H(z) := B(z)=A(z) as in the previous section, and Gd (z) := (1 , z ,1 ),d : From the calculations following Example 13.2, Gd (z) = where g0 = 1, and for n 1, The plan then is to set
n=0
gnz ,n ;
gn = d(d + 1) n!(d + [n , 1]) : Yn :=
and thenyy
1 X
Xn :=
1 X k=0
1 X k=0
gk Zn,k hk Yn,k :
Note that the power spectral density of Y iszz SY (f) = jGd(ej 2f )j22 = j1 , e,j 2f j,2d2 = [4 sin2 (f)],d 2 ; Since 1 , z,1 is a dierencing lter, (1 , z,1 ),1 is a summing or integrating lter. For noninteger values of d, (1 , z,1 ),d is called a fractional integrating lter. The corresponding process is sometimes called a fractional ARIMA process (FARIMA). yy Recall that hn is given by (13.34). zz We are appealing to the discrete-time version of (6.6). July 18, 2002
386
Chap. 13 Advanced Topics
using the result of Example 13.2. If p = q = 0, then A(z) = B(z) = H(z) = 1, Xn = Yn , and we see that the process of Example 13.2 is ARIMA(0; d; 0). Now, the problem with the above plan is that we have to make sure that Yn is well de ned. To analyze the situation, we need to know how fast the gn decay. To this end, observe that ,(d + n) = (d + [n , 1]),(d + [n , 1]) .. . = (d + [n , 1]) (d + 1),(d + 1): Hence,
+ n) : gn = ,(dd+ ,(d 1),(n + 1) Now apply Stirling's formula, ,(x)
p
2xx,1=2e,x ;
to the gamma functions that involve n. This yields de1,d 1 + d , 1 n+1=2(n + d)d,1 : gn ,(d + 1) n+1 Since 1 n+1=2 = 1 + d , 1 n+1 1 + d , 1 ,1=2 ! ed,1 ; 1 + nd , +1 n+1 n+1 and since (n + d)d,1 nd,1 , we see that gn ,(dd+ 1) nd,1; as in Hosking [23, Theorem 1(a)]. For 0 < d < 1=2, ,1 < d , 1 < ,1=2, and we see that the gn are not absolutely summable. However, since ,2 < 2d , 2 < ,1, they are square summable. Hence, Yn is well de ned as a limit in mean square by Problem 26 in Chapter 11. The sum de ning Xn is well de ned in L2 , L1, and almost surely by Example 10.7 and Problems 17 and 32 in Chapter 11. Since Xn is the result of ltering the long-range dependent process Yn with the stable impulse response hn, Xn is still long-range dependent as pointed out in Section 13.4.
We derived Stirling's formula for ,(n) = (n , 1)! in Example 4.14. A proof for noninteger x can be found in [6, pp. 300{301].
July 18, 2002
13.7 Problems
387
13.7. Problems
Problems x13.1: Self Similarity in Continuous Time 1. Show that for a self-similar process, Wt converges in distribution to the
zero random variable as t ! 0. Next, identify limt!1 tH W1 (!) as a function of !, and nd the probability mass function of the limit in terms of FW1 (x). 2. Use joint characteristic functions to show that the Wiener process is self similar with Hurst parameter H = 1=2. 3. Use joint characteristic functions to show that the Wiener process has stationary increments. 4. Show that for H = 1=2, BH (t) , BH (s) =
Z
s
t
dW = Wt , Ws :
Taking t > s = 0 shows that BH (t) = Wt , while taking s < t = 0 shows that BH (s) = Ws . Thus, B1=2 (t) = Wt for all t. 5. Show that for 0 < H < 1, Z
0
1
(1 + )H ,1=2 , H ,1=2 2 d < 1:
Problems x13.2: Self Similarity in Discrete Time 6. Let Xn be a second-order self-similar process with mean = E[Xn], variance 2 = E[(Xn , )2 ], and Hurst parameter H. Then the sample mean n X Mn := n1 Xi i=1 has expectation and, by (13.7), variance 2 =n2,2H . If Xn is a Gaussian sequence, Mn , =n1,H N(0; 1);
and so given a con dence level 1 , , we can choose y (e.g., by Table 12.1) such that } Mn 1,,H y = 1 , : =n For 1=2 < H < 1, show that the width of the corresponding con dence interval is wider by a factor of nH ,1=2 than the con dence interval obtained if the Xn had been independent as in Section 12.2. July 18, 2002
388
Chap. 13 Advanced Topics
7. Use (13.10) and induction on n to show that (13.6) implies E[Yn2 ] = 2 n2H . 8. Suppose that Xk is wide-sense stationary.
(a) Show that the process Xe(m) de ned in (13.13) is also wide-sense stationary. (b) If Xk is second-order self similar, prove (13.12) for the case n = 0.
Problems x13.3: Asymptotic Second-Order Self Similarity 9. Show that asymptotic second-order self similarity (13.19) implies (13.18). Hint: Observe that C (n)(0) = E[(X1(n) , )2 ]. 10. Show that (13.18) implies asymptotic second-order self similarity (13.19). Hint: Use (13.14), and note that (13.18) is equivalent to E[Yn2 ]=n2H !
2. 1 11. Show that a process is asymptotically second-order self similar; i.e., (13.19) holds, if and only if the conditions 1 2H 2H 2H (m) mlim !1 (n) = 2 [jn + 1j , 2jnj + jn , 1j ]; and C (m) (0) = 2 lim 1 m!1 m2H ,2 both hold. 12. Show that the covariance function corresponding to the power spectral density of Example 13.2 is n ,(1 , 2d) C(n) = ,(n +(,11), d),(1 , d , n) : This result is due to Hosking [23, Theorem 1(d)]. Hints: First show that Z C(n) = 1 [4 sin2(=2)],d cos(n) d: 0 Second, use the change-of-variable = 2 , to show that 1 Z 2 [4 sin2 (=2)],d cos(n) d = C(n): Third, use the formula [16, p. 372] Z + 1)21,p : sinp,1 (t) cos(at) dt = cos(a=2),(p 0 p, p + a2 + 1 , p , a2 + 1
July 18, 2002
13.7 Problems
389
13. Prove that Kn () ! K() as follows. Put
n
Jn () :=
Z 1=2
sin(nf) 2 df: ,1=2 jf j1, nf
First show that
1
1
1
f 2 sin(f)
1
2 sin d: Jn () ! ,1 jj1, Then show that Kn () , Jn() ! 0; take care to write this dierence as an integral, and to break up the range of integration into [0; ] and [; 1=2], where 0 < < 1=2 is chosen small enough that when jf j < ,
,
Z
,
< ":
14. Show that for 0 < < 1, 1
Z
0
1, ,3 sin2 d = 2 (1 cos(=2),() , )(2 , ) :
Remark. The formula actually holds for complex with 0 < Re < 2 [16, p. 447]. Hints: (i) Fix 0 < " < r < 1, and apply integration by parts to Z
r
"
,3 sin2 d
with u = sin2 and dv = ,3 d. (ii) Apply integration by parts to the integral Z
t,2 sin t dt
with u = sin t and dv = t,2 dt. (iii) Use the fact that for 0 < < 1,y Z
r
t,1e,jt dt " "!0 y For s > 0, a change of variable shows that rlim !1
= e,j=2,():
Z
r lim t,1 e,st dt = ,(s) : r!1 "!0 " As in Notes 5 and 6 in Chapter 3, a permanence of form argument allows us to set s = j = ej=2 .
July 18, 2002
390
Chap. 13 Advanced Topics
15. Let S(f) be given by (13.15). For 1=2 < H < 1, put = 2 , 2H. (a) Evaluate the limit in (13.20). Hint: You may use the fact that 1 X
1 2H +1 i=1 ji + f j converges uniformly for jf j 1=2 and is therefore a continuous and bounded function. (b) Evaluate Z 1=2 S(f) df: ,1=2
Hint: The above integral is equal to C(0) = 2 . Since (13.15) corresponds to a second-order self-similar process, not just an asymptoti2 . Now apply (13.21). cally second-order self-similar process, 2 = 1
Problems x13.4: Long-Range Dependence
16. Show directly that if a wide-sense stationary sequence has the covari-
ance function C(n) given in Problem 12, then the process is long-range dependent; i.e., (13.25) holds with appropriate values of and c [23, Theorem 1(d)]. Hints: Use the Remark following Problem 11 in Chapter 3, Stirling's formula, p ,(x) 2xx,1=2e,x ; and the formula (1 + d=n)n ! ed . 17. For 0 < < 1, show that nX ,1
,
lim =nk1, = 1 ,1 : Hints: Rewrite (13.30) in the form Bn + n, , k, In Bn : Then , , 1 BI n 1 + k I, n : n!1
n
n
Show that In =n1, ! 1=(1 , ), and note that this implies In =n, ! 1. 18. For 0 < < 1, show that nX ,1
nlim !1
=k
1,
n2,
July 18, 2002
= 2 ,1 :
13.7 Problems
391
Problems x13.5: ARMA Processes P 19. Use the bound jnj M(1 + =2),n to show that n jhnj < 1, where hn =
n X
k=max(0;n,q)
k bn,k :
20. Assume (13.32) holds and that A(z) has all roots strictly inside the unit circle. Show that (13.33) must hold. Hint: Compute the convolution X
n
m,n Yn
rst for Yn replaced by the left-hand side of (13.32) and again for Yn replaced by the right-hand side of (13.32). P P1 21. Let 1 k=0 jhk j < 1. Show that if Xn is WSS, then Yn = k=0 hk Xn,k and Xn are J-WSS.
July 18, 2002
392
Chap. 13 Advanced Topics
July 18, 2002
Bibliography [1] M. Abramowitz and I. A. Stegun, eds. Handbook of Mathematical Functions, with Formulas, Graphs, and Mathematical Tables. New York: Dover, 1970. [2] J. Beran, Statistics for Long-Memory Processes. New York: Chapman & Hall, 1994. [3] P. Billingsley, Probability and Measure. New York: Wiley, 1979. [4] P. Billingsley, Probability and Measure, 3rd ed. New York: Wiley, 1995. [5] P. J. Brockwell and R. A. Davis, Time Series: Theory and Methods. New York: SpringerVerlag, 1987. [6] R. C. Buck, Advanced Calculus, 3rd ed. New York: McGraw-Hill, 1978. [7] H. Cherno, \A measure of asymptotic eciency for tests of a hypothesis based on the sum of observations," Ann. Math. Statist., vol. 23, pp. 493{507, 1952. [8] Y. S. Chow and H. Teicher, Probability Theory: Independence, Interchangeability, Martingales, 2nd ed. New York: Springer, 1988. [9] R. V. Churchill, J. W. Brown, and R. F. Verhey, Complex Variables and Applications, 3rd ed. New York: McGraw-Hill, 1976. [10] J. W. Craig, \A new, simple and exact result for calculating the probability of error for two-dimensional signal constellations," in Proc. IEEE Milit. Commun. Conf. MILCOM '91, McLean, VA, Oct. 1991, pp. 571{575. [11] H. Cramer, \Sur un nouveaux theoreme-limite de la theorie des probabilites," Actualites Scienti ques et Industrielles 736, pp. 5{23. Colloque consacre a la theorie des probabilites, vol. 3, Hermann, Paris, Oct. 1937. [12] M. Crovella and A. Bestavros, \Self-similarity in World Wide Web trac: Evidence and possible causes," Perf. Eval. Rev., vol. 24, pp. 160{169, 1996. [13] M. H. A. Davis, Linear Estimation and Stochastic Control. London: Chapman and Hall, 1977. [14] W. Feller, An Introduction to Probability Theory and Its Applications, Vol. 1, 3rd ed. New York: Wiley, 1968. [15] L. Gonick and W. Smith, The Cartoon Guide to Statistics. New York: HarperPerennial, 1993. [16] I. S. Gradshteyn and I. M. Ryzhik, Table of Integrals, Series, and Products. Orlando, FL: Academic Press, 1980. [17] G. R. Grimmettand D. R. Stirzaker, Probability and Random Processes, 3rd ed. Oxford: Oxford University Press, 2001. [18] S. Haykin and B. Van Veen, Signals and Systems. New York: Wiley, 1999. [19] C. W. Helstrom, Statistical Theory of Signal Detection, 2nd ed. Oxford: Pergamon, 1968. [20] C. W. Helstrom, Probability and Stochastic Processes for Engineers, 2nd ed. New York: Macmillan, 1991. [21] P. G. Hoel, S. C. Port, and C. J. Stone, Introduction to Probability Theory. Boston: Houghton Miin, 1971. [22] K. Homan and R. Kunze, Linear Algebra, 2nd ed. Englewood Clis, NJ: Prentice-Hall, 1971.
393
394
Bibliography
[23] J. R. M. Hosking, \Fractional dierencing," Biometrika, vol. 68, no. 1, pp. 165{176, 1981. [24] S. Karlin and H. M. Taylor, A First Course in Stochastic Processes, 2nd ed. New York: Academic Press, 1975. [25] S. Karlin and H. M. Taylor, A Second Course in Stochastic Processes. New York: Academic Press, 1981. [26] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, \On the self-similar nature of Ethernet trac (extended version)," IEEE/ACM Trans. Networking, vol. 2, no. 1, pp. 1{15, Feb. 1994. [27] A. Leon-Garcia, Probability and Random Processes for Electrical Engineering, 2nd ed. Reading, MA: Addison-Wesley, 1994. [28] N. Likhanov, \Bounds on the buer occupancy probability with self-similar trac," pp. 193{213, in Self-Similar Network Trac and Performance Evaluation, K. Park and W. Willinger, Eds. New York: Wiley, 2000. [29] A. O'Hagen, Probability: Methods and Measurement. London: Chapman and Hall, 1988. [30] A. V. Oppenheim and A. S. Willsky, with S. H. Nawab, Signals & Systems, 2nd ed. Upper Saddle River, NJ: Prentice Hall, 1997. [31] A. Papoulis, Probability, Random Variables, and Stochastic Processes, 2nd ed. New York: McGraw-Hill, 1984. [32] V. Paxson and S. Floyd, \Wide-area trac: The failure of Poisson modeling," IEEE/ACM Trans. Networking, vol. 3, pp. 226-244, 1995. [33] B. Picinbono, Random Signals and Systems. Englewood Clis, NJ: Prentice Hall, 1993. [34] S. Ross, A First Course in Probability, 3rd ed. New York: Macmillan, 1988. [35] R. I. Rothenberg, Probability and Statistics. San Diego: Harcourt Brace Jovanovich, 1991. [36] G. Roussas, A First Course in Mathematical Statistics. Reading, MA: Addison-Wesley, 1973. [37] H. L. Royden, Real Analysis, 2nd ed. New York: MacMillan, 1968. [38] W. Rudin, Principles of Mathematical Analysis, 3rd ed. New York: McGraw-Hill, 1976. [39] G. Samorodnitskyand M. S. Taqqu, Stable non-GaussianRandom Processes: Stochastic Models with In nite Variance. New York: Chapman & Hall, 1994. [40] A. N. Shiryayev, Probability. New York: Springer, 1984. [41] M. K. Simon, \A new twist on the Marcum Q function and its application," IEEE Commun. Lett., vol. 2, no. 2, pp. 39{41, Feb. 1998. [42] M. K. Simon and D. Divsalar, \Some new twists to problems involving the Gaussian probability integral," IEEE Trans. Commun., vol. 46, no. 2, pp. 200{210, Feb. 1998. [43] M. K. Simon and M.-S. Alouini, \A uni ed approach to the performance analysis of digital communication over generalized fading channels," Proc. IEEE, vol. 86, no. 9, pp. 1860{1877, Sep. 1998. [44] Ya. G. Sinai, \Self-similar probability distributions," Theory of Probability and its Applications, vol. XXI, no. 1, pp. 64{80, 1976. [45] J. L. Snell, Introduction to Probability. New York: Random House, 1988. [46] E. W. Stacy, \A generalization of the gamma distribution," Ann. Math. Stat., vol. 33, no. 3, pp. 1187{1192, Sept. 1962. [47] H. Stark and J. W. Woods, Probability and Random Processes with Applications to July 18, 2002
Bibliography
395
Signal Processing, 3rd ed. Upper Saddle River, NJ: Prentice Hall, 2002. [48] B. Tsybakov and N. D. Georganas, \Self-similar processes in communication networks," IEEE Trans. Inform. Theory, vol. 44, no. 5, pp. 1713{1725, Sept. 1998. [49] Y. Viniotis, Probability and Random Processes for Electrical Engineers. Boston: WCB/McGraw-Hill, 1998. [50] E. Wong and B. Hajek, Stochastic Processes in Engineering Systems. New York: Springer, 1985. [51] R. D. Yates and D. J. Goodman, Probability and Stochastic Processes: A Friendly Introduction for Electrical and Computer Engineers. New York: Wiley, 1999. [52] R. E. Ziemer, Elements of Engineering Probability and Statistics. Upper Saddle River, NJ: Prentice Hall, 1997.
July 18, 2002
396
July 19, 2002
Index
A
absorbing state, 283 ane estimator, 230 ane function, 119, 230 aggregated process, 372 almost sure convergence, 330 arcsine random variable, 151, 164 ARIMA, 376 fractional, 385 ARMA process, 383 arrival times, 252 associative laws, 3 asymptotic second-order self similarity, 376, 388 auto-correlation function, 195 autoregressive integrated moving average see ARIMA, 385 autoregressive moving average process see ARMA, 383 autoregressive process | see AR process, 383
C
cardinality of a set, 6 Cartesian product, 155 Cauchy random variable, 87 cdf, 118 characteristic function, 113 nonexistence of mean, 93 special case of student's t, 109 Cauchy sequence of Lp random variables, 304 of real numbers, 304 Cauchy{Schwarz inequality, 189, 215, 304 causal Wiener lter, 203 cdf | cumulative distribution function, 117 central chi-squared random variable, 114 central limit theorem, 1, 117, 137, 262, 328 compared with weak law of large numbers, 328 central moment, 49 certain event, 12 change of variable (multivariate), 229, 235 Chapman{Kolmogorov equation B continuous time, 290 discrete time, 284 Banach space, 304 discrete-time derivation, 287 Bayes' rule, 20, 21 for Markov processes, 297 Bernoulli random variable, 38 characteristic function mean, 45 bivariate, 163 probability generating function, 52 multivariate (joint), 225 second moment and variance, 49 univariate, 97 Bessel function, 146 Chebyshev's inequality, 60, 100, 115 properties, 147 used to derive the weak law, 61 beta function, 108, 109, 110 Cherno bound, 100, 101, 115 beta random variable, 108 chi-squared random variable, 107 relation to chi-squared random variable, 182 as squared zero-mean Gaussian, 112, 122, 143 betting on fair games, 78 characteristic function, 112 binomial | approximation by Poisson, 57, 340 moment generating function, 112 binomial coecient, 54 related to generalized gamma, 145 binomial random variable, 53 relation to F random variable, 182 mean, variance, and pgf, 76 relation to beta random variable, 182 binomial theorem, 54, 377 see also noncentral chi-squared, 112 birth process | see Markov chain, 283 square root of = Rayleigh, 144 birth{death process | see Markov chain, 283 circularly symmetric complex Gaussian, 238 birthday problem, 9 closed set, 310 bivariate characteristic function, 163 CLT | central limit theorem, 137 bivariate Gaussian random variables, 168 combination, 56 Borel{Cantelli lemma, 28, 332 commutative laws, 3 Borel set, 72 complement of a set, 2 Borel sets of IR2 , 177 complementary cdf, 145 Borel - eld, 72 complementary error function, 123 Brownian motion fractional | see fractionalBrownian motion, 368 completeness of the Lp spaces, 304 ordinary | see Wiener process, 257 397
398
Index
of the real numbers, 304 complex conjugate, 237 complex Gaussian random vector, 238 complex random variable, 236 complex random vector, 237 conditional cdf, 122 conditional density, 162 conditional expectation abstract de nition, 312 for discrete random variables, 69 for jointly continuous random variables, 164 conditional independence, 31, 279 conditional probability, 19 conditional probability mass functions, 62 con dence interval, 347 con dence level, 347 conservative Markov chain, 292 consistency condition continuous-time processes, 269 discrete-time processes, 267 consistent estimator, 346 continuous random variable, 85 arcsine, 151, 164 beta, 108 Cauchy, 87 chi-squared, 107 Erlang, 107 exponential, 86 F , 182 gamma, 106 Gaussian = normal, 88 generalized gamma, 145 Laplace, 87 Maxwell, 143 multivariate Gaussian, 226 Nakagami, 144 noncentral chi-squared, 114 noncentral Rayleigh, 146 Pareto, 149 Rayleigh, 111 Rice, 146 student's t, 109 uniform, 85 Weibull, 105 continuous sample paths, 257 convergence almost sure (a.s.), 330 in distribution, 325 in mean of order p, 299 in mean square, 299 in probability, 324, 346 in quadratic mean, 299 sure, 330 weak, 325 convex function, 80 convolution, 67 convolution of densities, 99
convolution and LTI systems, 196 of probability mass functions, 67 correlation coecient, 81, 170 correlation function, 188, 195 engineering de nition, 374 statistics/networking de nition, 375 countable subadditivity, 15 counting process, 249 covariance, 223 distinction between scalar and matrix, 224 function, 189 matrix, 223 Craig's formula, 180 cross power spectral density, 197 cross-correlation function, 195 cross-covariance function, 195 cumulative distribution function (cdf), 117 continuous random variable, 118 discrete random variable, 126 joint, 157 properties, 134 cyclostationary process, 210
D
delta function, 193 Dirac, 129 Kronecker, 284 DeMoivre{Laplace theorem, 352 DeMorgan's laws, 4 generalized, 5 dierence of sets, 3 dierencing lter, 384 dierential entropy, 110 Dirac delta function, 129 discrete random variable, 36 Bernoulli, 38 binomial, 53 geometric, 41 hypergeometric, 353 negative binomial = Pascal, 80 Poisson, 42 uniform, 37 disjoint sets, 3 distributive laws, 4 generalized, 5 double-sided exponential = Laplace, 87
E
eigenvalue, 286 eigenvector, 286 empty set, 2 ensemble mean, 345 entropy, 79 dierential, 110 equilibrium distribution, 285
July 19, 2002
Index
399
ergodic theorems, 209 Erlang random variable, 107 as sum of i.i.d. exponentials, 113 cumulative distribution function, 107 moment generating function, 111, 112 related to generalized gamma, 145 error function, 123 complementary, 123 estimate of the value of a random variable, 230 estimator of a random variable, 230 event, 6, 22, 337 expectation linearity for arbitrary random variables, 99 linearity for discrete random variables, 48 monotonicity for arbitrary random variables, 99 monotonicity for discrete random variables, 80 of a discrete random variable, 44 of an arbitrary random variable, 91 when it is unde ned, 46, 93 exponential random variable, 86 double sided = Laplace, 87 memoryless property, 104 moment generating function, 95 moments, 95
F
random variable, 182 factorial function, 106 factorial moment, 51 failure rate, 124 constant, 125 Erlang, 149 Pareto, 149 Weibull, 149 FARIMA | fractional ARIMA, 385 ltered Poisson process, 256 Fourier series as characteristic function, 98 as power spectral density, 217 Fourier transform as bivariate characteristic function, 163 as multivariate characteristic function, 225 as univariate characteristic function, 97 inversion formula, 191, 214 multivariate inversion formula, 229 of correlation function, 191 table, 214 fractional ARIMA process, 385 fractional Brownian motion, 368 fractional Gaussian noise, 370 fractional integrating lter, 385 F
G
gambler's ruin, 284 gamma function, 106 gamma random variable, 106
characteristic function, 97, 112 generalized, 145 moment generating function, 112 moments, 111 with scale parameter, 107 Gaussian pulse, 97 Gaussian random process, 259 fractional, 369 Gaussian random variable, 88 ccdf approximation, 145 cdf, 119 cdf related to error function, 123 characteristic function, 97 complex, 238 complex circularly symmetric, 238 Craig's formula for ccdf, 180 moment generating function, 96 moments, 93 Gaussian random vector, 226 characteristic function, 227 complex circularly symmetric, 238 joint density, 229 multivariate moments, Wick's theorem, 241 generalized gamma random variable, 145 generator matrix, 293 geometric random variable, 41 mean, variance, and pgf, 76 memoryless property, 75 geometric series, 26
H
H -sssi, 367
Herglotz's theorem, 313 Hilbert space, 304 Holder's inequality, 317 Hurst parameter, 365 hypergeometric random variable, 353 derivation, 360
I
ideal gas, 1, 144 identically distributed random variables, 39 impossible event, 12 impulse function, 129 inclusion-exclusion formula, 14, 158 increment process, 367 increments of a random process, 250 independent events more than two events, 16 pairwise, 17, 22 two events, 16 independent identically distributed (i.i.d.), 39 independent increments, 250 independent random variables, 38 example where uncorrelated does not imply independence, 80
July 19, 2002
400
Index
jointly continuous, 163 more than two, 39 indicator function, 36 inner product, 241, 304 inner-product space, 304 integrating lter, 385 intensity of a Poisson process, 250 interarrival times, 252 intersection of sets, 2 inverse Fourier transform, 214
limit properties of }, 15 linear estimators, 230, 309 linear time-invariant system, 195 long-range dependence, 380 LOTUS | law of the unconscious statistician, 47 LRD | long-range dependence, 380 LTI | linear time-invariant (system), 195 Lyapunov's inequality, 333 derived from Holder's inequality, 317 derived from Jensen's inequality, 80
J
M
J-WSS | jointly wide-sense stationary, 195 Jacobian, 235 Jacobian formulas, 235 Jensen's inequality, 80 joint characteristic function, 225 joint cumulative distribution function, 157 joint density, 158, 172 joint probability mass function, 43 jointly continuous random variables bivariate, 158 multivariate, 172 jointly wide-sense stationary, 195 jump times of a Poisson process, 251
MA process, 383 Marcum Q function, 147, 180 marginal cumulative distributions, 157 marginal density, 160 marginal probability, 156 marginal probability mass functions, 44 Markov chain absorbing barrier, 283 birth{death process (discrete time), 283 conservative, 292 construction, 269 continuous time, 289 discrete time, 279 equilibrium distribution, 285 gambler's ruin, 284 K generator matrix, 293 Kolmogorov's backward equation, 292 Kolmogorov, 1 Kolmogorov's forward equation, 291 Kolmogorov's backward equation, 292 model for queue with nite buer, 283 Kolmogorov's consistency theorem, 267 model for queue with in nite buer, 283, 296 Kolmogorov's extension theorem, 267 n-step transition matrix, 285 Kolmogorov's forward equation, 291 n-step transition probabilities, 284 Kronecker delta, 284 pure birth process (discrete time), 283 random walk (construction), 279 L random walk (continuous-time), 295 random walk (de nition), 282 Laplace random variable, 87 rate matrix, 293 variance and moment generating function, 111 re ecting barrier, 283 Laplace transform, 95 sojourn time, 296 law of large numbers mean square, for 2nd-orderself-similarsequences, state space, 281 state transition diagram, 281 370 stationary distribution, 285 mean square, uncorrelated, 300 mean square, wide-sense stationary sequences, symmetric random walk (construction), 280 time homogeneous (continuous time), 290 301 time-homogeneous (discrete time), 281 strong, 333 transition probabilities (continuous time), 289 weak, for independent random variables, 333 weak, for uncorrelated random variables, 61, 324 transition probabilities (discrete time), 281 transition probability matrix, 282 law of the unconscious statistician, 47 Markov process, 297 law of total conditional probability, 287, 297 Markov's inequality, 60, 100, 115 law of total probability, 19, 21, 164, 175, 181 Matlab commands discrete conditioned on continuous, 275 besseli, 146 for continuous random variables, 166 chi2cdf, 146 for discrete random variables, 64 chi2inv, 356 for expectation (discrete random variables), 70 July 19, 2002
Index
401
erf, 123 erfc, 123 er nv, 348 factorial, 78 gamma, 106 nchoosek, 53, 78 ncx2cdf, 146 normcdf, 119 norminv, 348 tinv, 354 matrix exponential, 294 matrix inverse formula, 247 Maxwell random variable as square root of chi-squared, 144 cdf, 143 related to generalized gamma, 145 speed of particle in ideal gas, 144 mean, 45 mean function, 188 mean square convergence, 299 mean squared error, 76, 81, 201, 230 mean time to failure, 123 mean vector, 223 mean-square ergodic theorem for WSS processes, 209 for WSS sequences, 301 mean-square law of large numbers for uncorrelated random variables, 300 for wide-sense stationary sequences, 301 for WSS processes, 209 median, 104 memoryless property exponential random variable, 104 geometric random variable, 75 mgf | moment generating function, 94 minimum mean squared error, 230, 309 Minkowski's inequality, 304, 318 mixed random variable, 127 MMSE | minimum mean squared error, 230 modi ed Bessel function of the rst kind, 146 properties, 147 moment, 49 moment generating function, 101 moment generating function (mgf), 94 moment central, 49 factorial, 51 monotonic sequence property, 316 monotonicity of E, 80, 99 of }, 13 Mother Nature, 12 moving average process | see MA process, 383 MSE | mean squared error, 230 MTTF | mean time to failure, 123 multinomial coecient, 77 multinomial theorem, 77
mutually exclusive sets, 3 mutually independent events, 17
N
Nakagami random variable, 144, 247 as square root of chi-squared, 144 negative binomial random variable, 80 noncentral chi-squared random variable as squared non-zero-mean Gaussian, 112, 122, 143 cdf (series form), 146 density (closed form using Bessel function), 146 density (series form), 114 moment generating function, 112, 114 noncentrality parameter, 112 square root of = Rice, 146 noncentral Rayleigh random variable, 146 square of = noncentral chi-squared, 146 noncentrality parameter, 112, 114 norm preserving, 314 norm Lp , 303 matrix, 241 vector, 225 normal random variable | see Gaussian, 88 null set, 2
O
occurrence times, 252 odds, 78 Ornstein{Uhlenbeck process, 274 orthogonal increments, 322 orthogonality principle general statement, 309 in relation to conditional expectation, 233 in the derivation of linear estimators, 231 in the derivation of the Wiener lter, 202
P
pairwise disjoint sets, 5 pairwise independent events, 17 Paley{Wiener condition, 205 paradox of continuous random variables, 90 parallelogram law, 304, 320 Pareto failure rate, 149 Pareto random variable, 149 Parseval's equation, 206 Pascal random variable = negative binomial, 80 Pascal's triangle, 54 pdf | probability density function, 85 permanence of form argument, 104, 389 pgf | probability generating function, 50 pmf | probability mass function, 42 Poisson approximation of binomial, 57, 340 Poisson process, 250
July 19, 2002
402
arrival times, 252 as a Markov chain, 289 ltered, 256 independent increments, 250 intensity, 250 interarrival times, 252 marked, 255 occurrence times, 252 rate, 250 shot noise, 256 thinned, 271 Poisson random variable, 42 mean, 45 mean, variance, and pgf, 51 second moment and variance, 49 population mean, 345 positive de nite, 225 positive semide nite, 225 function, 215 posterior probabilities, 21 power spectral density, 191 for discrete-time processes, 217 nonnegativity, 198, 207 power dissipated in a resistor, 191 expected instantaneous, 191 expected time-average, 206 probability density function (pdf), 85 generalized, 129 nonimpulsive, 129 purely impulsive, 129 probability generating function (pgf), 50 probability mass function (pmf), 42 probability measure, 12, 266 probability space, 22 probability written as an expectation, 45 projection, 308 onto the unit ball, 308 theorem, 310, 311
Q
Q function
Gaussian, 145 Marcum, 147 quadratic mean convergence, 299 queue | see Markov chain, 283
R
random process, 185 random sum, 177 random variable complex-valued, 236 continuous, 85 de nition, 33 discrete, 36
Index
integer-valued, 36 precise de nition, 72 traditional interpretation, 33 random variables identically distributed, 39 independent, 38, 39 uncorrelated, 58 random vector, 172 random walk approximation of the Wiener process, 262 construction, 279 de nition, 282 symmetric, 280 with barrier at the origin, 283 rate matrix, 293 rate of a Poisson process, 250 Rayleigh random variable as square root of chi-squared, 144 cdf, 142 distance from origin, 87, 144 generalized, 144 moments, 111 related to generalized gamma, 145 square of = chi-squared, 144 re ecting state, 283 reliability function, 123 renewal equation, 257 derivation, 272 renewal function, 257 renewal process, 256, 343 resistor, 191 Rice random variable, 146, 246 square of = noncentral chi-squared, 146 Riemann sum, 196, 220 Riesz{Fischer theorem, 304, 310, 312
S
sample mean, 57, 345 sample path, 185 sample space, 5, 12 sample standard deviation, 349 sample variance, 349 as unbiased estimator, 350 sampling with replacement, 352 sampling without replacement, 352, 360 scale parameter, 107, 145 second-order self similarity, 370 self similarity, 365 set dierence, 3 shot noise, 256 -algebra, 22 - eld, 22, 72, 177, 270 signal-to-noise ratio, 199 Simon's formula, 183 Skorohod representation theorem, 335 derivation, 335 July 19, 2002
Index
403
SLLN | strong law of large numbers, 333 slowly varying function, 381 Slutsky's theorem, 328, 360 smoothing property, 321 SNR | signal-to-noise ratio, 199 sojourn time, 296 spectral distribution, 313 spectral factorization, 205 spectral representation, 315 spontaneous generation, 283 square root of a nonnegative de nite matrix, 240 standard deviation, 49 standard normal density, 88 state space of a Markov chain, 281 state transition diagram | see Markov chain, 281 stationary distribution, 285 stationary increments, 367 stationary random process, 189 Markov chain example, 277 statistical independence, 16 Stirling's formula, 142, 340, 386, 390 derivation, 142 stochastic process, 185 strictly stationary process, 189 Markov chain example, 277 strong law of large numbers, 333 student's t, 109, 182 cdf converges to normal cdf, 355 density converges to normal density, 109 generalization of Cauchy, 109 moments, 110, 111 subset, 2 substitution law, 66, 70, 165, 175 sure event, 12 symmetric function, 189 symmetric matrix, 223, 239
T
| see student's t, 109 thermal noise and the central limit theorem, 1 thinned Poisson process, 271 time-homogeneity | see Markov chain, 281 trace of a matrix, 225, 241 transition matrix | see Markov chain, 282 transition probability | see Markov chain, 281 triangle inequality for Lp random variables, 303 for numbers, 303
t
uniform random variable (discrete), 37 union bound, 15 derivation, 27, 28 union of sets, 2 unit impulse, 129 unit-step function, 36, 206
V
variance, 49 Venn diagrams, 2
W
Wallis's formula, 109 weak law of large numbers, 1, 58, 61, 208, 324, 333 compared with the central limit theorem, 328 Weibull failure rate, 149 Weibull random variable, 105, 143 moments, 111 related to generalized gamma, 145 white noise, 193 in nite power, 193 whitening lter, 204 Wick's theorem, 241 wide-sense stationarity continuous time, 191 discrete time, 217 Wiener lter, 203, 309 causal, 203 Wiener integral, 261, 307 normality, 339 Wiener process, 257 approximation using random walk, 261 as a Markov process, 297 de ned for negative and positive time, 274 independent increments, 257 normality, 277 relation to Ornstein{Uhlenbeck process, 274 self similarity, 365, 387 standard, 257 stationarity of its increments, 387 Wiener{Hopf equation, 203 Wiener{Khinchin theorem, 207 alternative derivation, 212 WLLN | weak law of large numbers, 58, 61 WSS | wide-sense stationary, 191
Z z
transform, 383
U
unbiased estimator, 345 uncorrelated random variables, 58 example that are not independent, 80 uniform random variable (continuous), 85 cdf, 119 July 19, 2002
Continuous Random Variables uniform[a;b]
fX (x) = b ,1 a and FX (x) = xb ,, aa ; a x b. a + b ; var(X) = (b , a)2 ; M (s) = esb , esa . E[X] = X 2 12 s(b , a) exponential, exp() fX (x) = e,x and FX (x) = 1 , e,x ; x 0. E[X] = 1=; var(X) = 1=2; E[X n] = n!=n. MX (s) = =( , s); Re s < .
Laplace()
fX (x) = 2 e,jxj. E[X] = 0; var(X) = 2=2: MX (s) = 2 =(2 , s2 ); , < Re s < .
Gaussian or normal, N (m; 2) fX (x) = p 1 exp , 12 x , m . 2 2
E[X] = m; var(X) = 2 ; MX (s) = esm+s2 2 =2 .
FX (x) = normcdf(x; m; sigma).
E[(X , m)2n ] = 1 3 (2n , 3)(2n , 1)2n,
gamma(p; )
p,1 ,x fX (x) = (x),(p)e ; x > 0,
R where ,(p) := 01 xp,1 e,x dx; p > 0.
Recall that ,(p) = (p , 1) ,(p , 1); p > 1. FX (x) = gamcdf(x; p; 1=lambda). p ,(n + p) n E[X ] = n ,(p) . MX (s) = , s ; Re s < . Note that gamma(1;) is the same as exp().
Continuous Random Variables (continued) Erlang(m;) := gamma(m; ); m = integer Since ,(m) = (m , 1)!
mX ,1 (x)k m,1 e,x fX (x) = (x) and F (x) = 1 , e,x ; x > 0. X (m , 1)! k! k=0 Note that Erlang(1; ) is the same as exp().
chi-squared(k) := gamma(k=2; 1=2) If k is an even integer, then chi-squared(k) is the same as Erlang(k=2; 1=2). Since ,(1=2) = p, e,x=2 ; x > 0. for k = 1, fX (x) = p 2x , 2m+1 (2m,1)531 p , Since , 2 = 2m xm,1=2e,x=2 p ; x > 0. for k = 2m + 1, fX (x) = (2m , 1) 5 3 1 2 FX (x) = chi2cdf(x; k). Note that chi-squared(2) is the same as exp(1=2).
Rayleigh()
2 2 fX (x) = x2 e,(x=) =2 and FX (x) = 1 , e,(x=) =2 ; x 0. p E[X] = =2; E[X 2 ] = 22; var(X) = 2 (2 , =2): E[X n ] = 2n=2n ,(1 + n=2).
Weibull(p;)
fX (x) = pxp,1 e,xp and FX (x) = 1 , e,xp ; x > 0. ,(1 + n=p) . E[X n ] = n=p , p Note that Weibull(2; ) is the same as Rayleigh 1= 2 and that Weibull(1; ) is the same as exp().
Cauchy()
1 tan,1 x + 1 . fX (x) = 2= ; F (x) = X 2 +x 2 2 E[X] = unde ned; E[X ] = 1; 'X () = e,j j. Odd moments are not de ned; even moments are in nite. Since the rst moment is not de ned, central moments, including the variance, are not de ned.