Discrete Stochastic Processes and Optimal Filtering
Discrete Stochastic Processes and Optimal Filtering
Jean-Claude B...
43 downloads
1081 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Discrete Stochastic Processes and Optimal Filtering
Discrete Stochastic Processes and Optimal Filtering
Jean-Claude Bertein Roger Ceschi
First published in France in 2005 by Hermes Science/Lavoisier entitled “Processus stochastiques discrets et filtrages optimaux” First published in Great Britain and the United States in 2007 by ISTE Ltd Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address: ISTE Ltd 6 Fitzroy Square London W1T 5DX UK
ISTE USA 4308 Patrice Road Newport Beach, CA 92663 USA
www.iste.co.uk © ISTE Ltd, 2007 © LAVOISIER, 2005 The rights of Jean-Claude Bertein and Roger Ceschi to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. Library of Congress Cataloging-in-Publication Data Bertein, Jean-Claude. [Processus stochastiques discrets et filtrages optimaux. English] Discrete stochastic processes and optimal filtering/Jean-Claude Bertein, Roger Ceschi. p. cm. Includes index. "First published in France in 2005 by Hermes Science/Lavoisier entitled "Processus stochastiques discrets et filtrages optimaux"." ISBN 978-1-905209-74-3 1. Signal processing--Mathematics. 2. Digital filters (Mathematics) 3. Stochastic processes. I. Ceschi, Roger. II. Title. TK5102.9.B465 2007 621.382'2--dc22 2007009433 British Library Cataloguing-in-Publication Data A CIP record for this book is available from the British Library ISBN 13: 978-1-905209-74-3 Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire.
To our families We wish to thank Mme Florence François for having typed the manuscript, and M. Stephen Hazlewood who assured the translation of the book
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii
Chapter 1. Random Vectors. . . . . . . . . . . . . . . . . . . . . . . 1.1. Definitions and general properties . . . . . . . . . . . . . . . 1.2. Spaces L1(dP) and L2(dP) . . . . . . . . . . . . . . . . . . . . 1.2.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2. Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3. Mathematical expectation and applications . . . . . . . . . . 1.3.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2. Characteristic functions of a random vector . . . . . . . 1.4. Second order random variables and vectors . . . . . . . . . . 1.5. Linear independence of vectors of L2(dP) . . . . . . . . . . . 1.6. Conditional expectation (concerning random vectors with density function) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7. Exercises for Chapter 1 . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
Chapter 2. Gaussian Vectors . . . . . . . . . . . . . . . . . . 2.1. Some reminders regarding random Gaussian vectors 2.2. Definition and characterization of Gaussian vectors . 2.3. Results relative to independence . . . . . . . . . . . . 2.4. Affine transformation of a Gaussian vector . . . . . . 2.5. The existence of Gaussian vectors . . . . . . . . . . . 2.6. Exercises for Chapter 2 . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . . . . .
1 1 20 20 22 23 23 34 39 47
. . . . . . . . . . . . . .
51 57
. . . . . . .
63 63 66 68 72 74 85
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . .
viii
Discrete Stochastic Processes and Optimal Filtering
Chapter 3. Introduction to Discrete Time Processes. . . . . . . . 3.1. Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. WSS processes and spectral measure. . . . . . . . . . . . . . 3.2.1. Spectral density . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Spectral representation of a WSS process . . . . . . . . . . . 3.3.1. Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2.1. Process with orthogonal increments and associated measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2.2. Wiener stochastic integral . . . . . . . . . . . . . . . . 3.3.2.3. Spectral representation. . . . . . . . . . . . . . . . . . 3.4. Introduction to digital filtering . . . . . . . . . . . . . . . . . 3.5. Important example: autoregressive process . . . . . . . . . . 3.6. Exercises for Chapter 3 . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
93 93 105 106 110 110 111
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
111 113 114 115 128 134
Chapter 4. Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . 4.1. Position of the problem . . . . . . . . . . . . . . . . . . . . . 4.2. Linear estimation . . . . . . . . . . . . . . . . . . . . . . . . 4.3. Best estimate – conditional expectation . . . . . . . . . . . 4.4. Example: prediction of an autoregressive process AR (1) 4.5. Multivariate processes . . . . . . . . . . . . . . . . . . . . . 4.6. Exercises for Chapter 4 . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
141 141 144 156 165 166 175
Chapter 5. The Wiener Filter . . . . . . . . . . . . . 5.1. Introduction. . . . . . . . . . . . . . . . . . . . 5.1.1. Problem position . . . . . . . . . . . . . . 5.2. Resolution and calculation of the FIR filter . 5.3. Evaluation of the least error . . . . . . . . . . 5.4. Resolution and calculation of the IIR filter . 5.5. Evaluation of least mean square error . . . . 5.6. Exercises for Chapter 5 . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
181 181 182 183 185 186 190 191
Chapter 6. Adaptive Filtering: Algorithm of the Gradient and the LMS. 6.1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2. Position of problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3. Data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4. Minimization of the cost function. . . . . . . . . . . . . . . . . . . . . . 6.4.1. Calculation of the cost function . . . . . . . . . . . . . . . . . . . . 6.5. Gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
197 197 199 202 204 208 211
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
Table of Contents
6.6. Geometric interpretation . . . . . . . . . . . . . . . . 6.7. Stability and convergence . . . . . . . . . . . . . . . 6.8. Estimation of gradient and LMS algorithm . . . . . 6.8.1. Convergence of the algorithm of the LMS . . . 6.9. Example of the application of the LMS algorithm . 6.10. Exercises for Chapter 6 . . . . . . . . . . . . . . . .
ix
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
214 218 222 225 225 234
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
237 237 241 241 244 245 245 246 248 248 250 258 260 262
Table of Symbols and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . .
281
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
283
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
285
Chapter 7. The Kalman Filter . . . . . . . . . . . . . . . 7.1. Position of problem . . . . . . . . . . . . . . . . . . 7.2. Approach to estimation . . . . . . . . . . . . . . . . 7.2.1. Scalar case . . . . . . . . . . . . . . . . . . . . . 7.2.2. Multivariate case . . . . . . . . . . . . . . . . . 7.3. Kalman filtering . . . . . . . . . . . . . . . . . . . . 7.3.1. State equation . . . . . . . . . . . . . . . . . . . 7.3.2. Observation equation. . . . . . . . . . . . . . . 7.3.3. Innovation process . . . . . . . . . . . . . . . . 7.3.4. Covariance matrix of the innovation process. 7.3.5. Estimation . . . . . . . . . . . . . . . . . . . . . 7.3.6. Riccati’s equation. . . . . . . . . . . . . . . . . 7.3.7. Algorithm and summary . . . . . . . . . . . . . 7.4. Exercises for Chapter 7 . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
Preface
Discrete optimal filtering applied to stationary and non-stationary signals allows us to process in the most efficient manner possible, according to chosen criteria, all of the problems that we might meet in situations of extraction of noisy signals. This constitutes the necessary stage in the most diverse domains: the calculation of the orbits or guidance of aircraft in the aerospace or aeronautic domain, the calculation of filters in the telecommunications domain, or in the domain of command systems, or again in that of the processing of seismic signal – the list is non-exhaustive. Furthermore, the study and the results obtained from discrete signals lend themselves easily to the calculator. In their book, the authors have taken pains to stress educational aspects, preferring this to displays of erudition; all of the preliminary mathematics and probability theories necessary for a sound understanding of optimal filtering have been treated in the most rigorous fashion. It should not be necessary to have to turn to other works to acquire a sound knowledge of the subjects studied. Thanks to this work, the reader will be able not only to understand discrete optimal filtering but also will be able easily to go deeper into the different aspects of this wide field of study.
Introduction
The object of this book is to present the bases of discrete optimal filtering in a progressive and rigorous manner. The optimal character can be understood in the sense that we always choose that criterion at the minimum of the norm − L2 of error. Chapter 1 tackles random vectors, their principal definitions and properties. Chapter 2 covers the subject of Gaussian vectors. Given the practical importance of this notion, the definitions and results are accompanied by numerous commentaries and explanatory diagrams. Chapter 3 is by its very nature more “physics” heavy than the preceding ones and can be considered as an introduction to digital filtering. Results that will be essential for what follows will be given. Chapter 4 provides the pre-requisites essential for the construction of optimal filters. The results obtained on projections in Hilbert spaces constitute the cornerstone of future demonstrations. Chapter 5 covers the Wiener filter, an electronic device that is well adapted to processing stationary signals of second order. Practical calculations of such filters, as an answer to finite or infinite pulses, will be developed. Adaptive filtering, which is the subject of Chapter 6, can be considered as a relatively direct application of the determinist or stochastic gradient method. At the end of the process of adaptation or convergence, the Wiener filter is again encountered.
xiv
Discrete Stochastic Processes and Optimal Filtering
The book is completed with a study of Kalman filtering which allows stationary or non-stationary signal processing; from this point of view we can say that it generalizes Wiener’s optimal filter. Each chapter is accentuated by a series of exercises with answers, and resolved examples are also supplied using Matlab software which is well adapted to signal processing problems.
Chapter 1
Random Vectors
1.1. Definitions and general properties If we remember that
n
{
= x = ( x1 ,..., xn )
of real n -tuples can be fitted to two laws:
1 = (1, 0,..., 0 ) ,...,
n
= ( 0,..., 0,1) and x ∈
denoted:
⎛ x1 ⎞ ⎜ ⎟ x = ⎜ ⎟ (or xT = ( x1 ,..., xn ) ). ⎜x ⎟ ⎝ 2⎠
}
; j = 1 to n , the set
x, y → x + y and n
making it a vector space of dimension n . The basis implicitly considered on
xj ∈
n
n
×
n
( λ ,x ) → λ x ×
n
n
will be the canonical base n
expressed in this base will be
2
Discrete Stochastic Processes and Optimal Filtering
Definition of a real random vector Beginning with a basic definition, without concerning ourselves at the moment
⎛ X1 ⎞ ⎜ ⎟ with its rigor: we can say simply that a real vector X = ⎜ ⎟ linked to a physical ⎜X ⎟ ⎝ n⎠ or biological phenomenon is random if the value taken by this vector is unknown and the phenomenon is not completed. For typographical reasons, the vector will instead be written X
T
or even X = ( X 1 ,..., X n ) when there is no risk of confusion. In other words, given a random vector X and Β ⊂
assertion (also called the event) ( X ∈ Β ) is true or false:
n
= ( X 1 ,..., X n )
we do not know if the
n
Β .X
However, we do usually know the “chance” that X ∈ Β ; this is denoted
Ρ ( X ∈ B ) and is called the probability of the event ( X ∈ Β ) .
After completion of the phenomenon, the result (also called the realization) will be denoted
⎛ x1 ⎞ ⎜ ⎟ x = ⎜ ⎟ or xT = ( x1 ,..., xn ) or even x = ( x1 ,..., xn ) ⎜x ⎟ ⎝ 2⎠ when there is no risk of confusion.
Random Vectors
3
An exact definition of a real random vector of dimension n will now be given. We take as given that: – Ω = basic space. This is the set of all possible results (or tests) random phenomenon. –
ω
linked to a
a = σ -algebra (of events) on Ω , recalling the axioms: 1) Ω ∈ a c 2) if Α ∈ a then the complementary A ∈ a
( Α j , j ∈ J ) is a countable family of events then
3) if
∪ A j is an event,
j∈J
i.e. ∪ A j ∈ a j∈J
n
– –
= space of observables
B(
n
)=
n
Borel algebra on n
which contains all the open sets of
; this is the smallest
σ
n
-algebra on
.
DEFINITION.– X is said to be a real random vector of dimension n defined on
( Ω, a )
if
∀Β ∈ B (
X is a measurable mapping n
)
Χ −1 ( Β ) ∈ a.
( Ω, a ) → (
n
,B (
n
)) ,
i.e.
When n = 1 we talk about a random variable (r.v.). In the following the event Χ
−1
(Β)
is also denoted as
and even more simply as ( X ∈ B ) .
{ω
X (ω ) ∈ B
}
PROPOSITION.– In order for X to be a real random vector of dimension n (i.e. a measurable mapping
( Ω, a ) → (
each component Χ j
n
,B (
n
) ) , it is necessary and it suffices that
j = 1 at n is a real r.v. (i.e. is a measurable mapping
( Ω, a ) → ( R,B ( R ) ) ).
4
Discrete Stochastic Processes and Optimal Filtering
ABRIDGED DEMONSTRATION.– It suffices to consider:
Χ −1 ( Β1 × ... × Βn ) where Β1 ,..., Β n ∈ B ( R )
as
we
show
B (R) ⊗
B(
that
n
) = B (R) ⊗
⊗B ( R )
where
⊗ B ( R ) denotes the σ -algebra generated by the measurable
blocks Β1 × ... × Β n .
( B1 ×
Now X
× Bn ) = X 1−1 ( B1 ) ∩
and only if each term concerns
∩ X n−1 ( Bn ) , which concerns
a , that is to say if each
a
if
X j is a real r.v.
DEFINITION.– X = X 1 + iX 2 is said to be a complex random variable defined on
( Ω, a )
say
the
if the real and imaginary parts X 1 and X 2 are real variables, that is to random
( Ω, a ) → (
,B(
variables
)) .
X 1 and X 2
are
measurable
mappings
EXAMPLE.– The complex r.v. can be associated with a real random vector
X = ( X1 ,..., X n ) and a real n -tuple, u = ( u1 ,..., un ) ∈ i∑u j X j j
n
.
= cos∑ u j X j + i sin ∑ u j X j j
j
The study of this random variable will be taken up again when we define the characteristic functions . Law Law Ρ X of the random vector X .
Random Vectors
First of all we assume that the a mapping
P:
σ
-algebra
a → [0,1] verifying:
a
is provided with a measure
5
P , i.e.
1) P ( Ω ) = 1 2) For every family
( Aj , j ∈ J ) of countable pairwise disjoint events:
⎛ ⎞ P ⎜ ∪ Aj ⎟ = ∑ P Aj ⎝ j∈J ⎠ j∈J
( )
DEFINITION.– We call the law of random vector X, the “image measure
P through the mapping of X”, i.e. the definite measure on B ( following way by: ∀Β ∈ B
( n)
(
PX ( Β ) = ∫ dPX ( x1 ,..., xn ) = P X −1 ( B ) Β
n
PX of
) defined in the
)
↑
Definition
(
= P ω
)
X (ω ) ∈ Β = P ( X ∈ Β )
Terms 1 and 2 on the one hand and terms 3, 4 and 5 on the other are different notations of the same mathematical notion.
n
X
X
−1
B ∈B (
(B ) ∈ a
Ω Figure 1.1. Measurable mapping
X
n
)
6
Discrete Stochastic Processes and Optimal Filtering
It is important to observe that as the measure calculable for all Β ∈ B The space law is denoted:
n
(
( n ) because X
,B (
n
a,
PX ( B ) is
is measurable.
provided with the Borel algebra n
P is given along
) , PX ) .
B(
n
) and then with the PX
NOTE.– As far as the basic and the exact definitions are concerned, the basic definition of random vectors is obviously a lot simpler and more intuitive and can happily be used in basic applications of probability calculations. On the other hand in more theoretical or sophisticated studies and notably in those calling into play several random vectors, X , Y , Z ,... considering the latter as definite mappings on the same space ( Ω, a ) ,
( i.e. X,Y,Z, ... : ( Ω, a ) → (
n
,B (
n
))) ,
will often prove to be useful even indispensable.
X (ω ) Y (ω )
ω Ω
n
Z (ω )
Figure 1.2. Family of measurable mappings
In effect, the expressions and calculations calling into play several (or the entirety) of these vectors can be written without ambiguity using the space
( Ω, a,P ) . Precisely, the events linked to X , Y , Z ,… are among elements a (and the probabilities of these events are measured by P ).
A of
Random Vectors
7
Let us give two examples: 1) if there are 2 random vectors X , Y : ( Ω, a, P ) →
(
n
,B
( )) n
and
( ) , the event ( X ∈ B ) ∩ (Y ∈ B′) (for example) can be
given B and B′ ∈ B
n
translated by X −1 ( B ) ∩ Y −1 ( B ′ ) ∈ a ;
2) there are 3 r.v. X , Y , Z : ( Ω, a, P ) →
(
,B (
) ) and given a ∈
* +.
Let us try to express the event ( Z ≥ a − X − Y ) . Let us state U = ( X , Y , Z ) and B =
where
B Borel set of
3
{( x, y, z ) ∈
3
, represents the half-space bounded by the plane ( Π ) not
containing the origin 0 and based on the triangle A B C.
C (a)
B (a) A(a)
0 Figure 1.3. Example of Borel set of
U is ( Ω, a ) →
(
3
}
x+y+z ≥ a .
,B
( ) ) measurable and: 3
U ( z ≥ a − x − y ) = (U ∈ B ) = U −1 ( B ) ∈ a .
3
8
Discrete Stochastic Processes and Optimal Filtering
NOTE ON SPACE ( Ω, a, P ) .– We said that if we took as given Ω and then on
Ω and then P on
a
and so on, we would consider the vectors
a
X , Y , Z ,... as
measurable mappings:
( Ω, a, P ) → (
n
,B
( )) n
This way of introducing the different concepts is the easiest to understand, but it rarely corresponds to real probability problems. In general
( Ω, a, P )
is not specified or is even given before “ X , Y , Z ,...
measurable mappings”. On the contrary, given the random physical or biological sizes X , Y , Z ,... of
n
, it is on departing from the latter that
X , Y , Z ,... definite measurable mappings on introduced.
( Ω, a, P )
( Ω, a, P )
( Ω, a, P )
and
are simultaneously
is an artificial space intended to serve as a link between
X , Y , Z ,... What has just been set out may seem exceedingly abstract but fortunately the general random vectors as they have just been defined are rarely used in practice. In any case, and as far as we are concerned, we will only have to manipulate in what follows the far more specific and concrete notion of a “random vector with a density function”. DEFINITION.– We say that the law PX of the random vector X has a density if there is a mapping f X :
(
n
,B
( )) → ( n
,B (
measurable, called the density of PX such that ∀B ∈ B
))
which is positive and
( n ).
P ( X ∈ B ) = PX ( B ) = ∫ dPX ( x1 ,..., xn ) = ∫ f X ( x1 ,..., xn ) dx1 ,..., dxn B
B
Random Vectors
9
VOCABULARY.– Sometimes we write
dPX ( x1 ,..., xn ) = f X ( x1 ,..., xn ) dx1 ,..., dxn and we say also that the measure PX admits the density f X with respect to the Lebesgue measure on density f X . NOTE.–
n
. We also say that the random vector
(
f X ( x1 ,...xn ) dx1 ,...dxn = P X ∈
∫B
n
X admits the
) =1.
For example, let the random vector be X = ( X 1 , X 2 , X 3 ) of density
f X ( x1 , x2 , x3 ) = K x3 1∆ ( x1 , x2 , x3 ) where ∆ is the half sphere defined by x12 + x22 + x32 ≤ R with x3 ≥ 0 . We easily obtain via a passage through spherical coordinates:
1 = ∫ Kx3 dx1 dx2 dx3 = K ∆
π R4 4
where K =
4 π R4
Marginals
⎛ X1 ⎞ ⎜ ⎟ Let the random vector be X = ⎜ ⎟ which has the law ⎜X ⎟ ⎝ 2⎠ probability f X . DEFINITION.– The r.v. marginal of
PX
and density of
X j , which is the j th component of X , is called the j th
X and the law PX j of X j is called the law of the j th marginal.
If we know
PX , we know how to find the PX j laws.
10
Discrete Stochastic Processes and Optimal Filtering
In effect ∀B ∈ B (
(
)
)
P X j ∈ B = P ⎡⎣( X 1 ∈
∫
(
) ∩ ... ∩ ( X j ∈ B ) ∩ ... ∩ ( X n ∈ )⎤⎦
)
f X x1 ,..., x j ,..., xn dx1 ...dx j ...dxn
×...× B ×...×
using the Fubini theorem:
= ∫ dx j ∫ B
n−1
(
f X x1 ,..., x j ,..., xn
)
dx1...dxn except
The equality applying for all
( )
fX j xj = ∫
n−1
(
dx j
B , we obtain:
f X x1 ,..., x j ,..., xn
)
dx1...dxn except dx j
NOTE.– Reciprocally: except in the case of independent components the knowledge of PX ⇒ / to that of PX . j
EXAMPLE.– Let us consider: 1) A Gaussian pair Z
f Z ( x, y ) =
T
= ( X , Y ) of density of probability
⎛ x2 + y 2 ⎞ 1 exp ⎜ − ⎟. 2π 2 ⎠ ⎝
Random Vectors
11
We obtain the densities of the marginals:
fX ( x) = ∫
fY ( y ) = ∫
+∞ −∞
+∞ −∞
f z ( x, y ) dy =
⎛ x2 ⎞ 1 exp ⎜ − ⎟ and 2π ⎝ 2⎠
f z ( x, y ) dx =
⎛ y2 ⎞ 1 exp ⎜ − ⎟ . 2π ⎝ 2 ⎠
A second random non-Gaussian pair W
T
= (U , V ) whose density of
probability fW is defined by:
fW ( u , v ) = 2 f Z ( u , v ) if uv ≥ 0
fW ( u, v ) = 0 if uv < 0
Let us calculate the marginals
fU ( u ) = ∫
+∞ −∞
fW ( u, v ) dv = ∫ =∫
+∞ −∞ +∞
−∞
2 f Z ( u, v ) dv
if
u≤0
2 f Z ( u , v ) dv
if
u>0
From which we easily come to fU ( u ) =
In addition we obtain fV ( v ) =
⎛ u2 ⎞ 1 exp ⎜ − ⎟ 2π ⎝ 2 ⎠
⎛ v2 ⎞ 1 exp ⎜ − ⎟ 2π ⎝ 2⎠
CONCLUSION.– We can clearly see from this example that the marginal densities (identical in 1 and 2) do not determine the densities of the vectors (different in 1 and 2).
12
Discrete Stochastic Processes and Optimal Filtering
Probability distribution function
DEFINITION.– We call the mapping:
FX : ( x1 ,..., xn ) → FX ( x1 ,..., xn )
[ 0,1]
n
the distribution function of a random vector X
T
= ( X1 ,..., X n ) .
This is defined by:
FX ( x1 ,..., xn ) = P ( ( X1 ≤ x1 ) ) ∩ ... ∩ ( X n ≤ xn ) and in integral form, since X is a probability density vector:
FX ( x1 ,..., xn ) = ∫
x1 −∞
xn
∫ −∞ f X ( u1,.., un ) du1.. dun .
Some general properties: – ∀j = 1 at n the mapping x j → FX ( x1 ,..., xn ) is non-decreasing; – FX ( x1 ,..., xn ) → 1 when all the variables x j → ∞ ; – FX ( x1 ,..., xn ) → 0 if one at least of the variables x j → −∞ ; – If ( x1 ,..., xn ) → f X ( x1 ,..., xn ) is continuous, then
∂ n FX = fX . ∂ xn ...∂ x1
EXERCISE.– Determine the probability distribution of the pair
( X ,Y )
of density
f ( x, y ) = K xy on the rectangle ∆ = [1,3] × [ 2, 4] and state precisely the value of
K.
Random Vectors
13
Independence
DEFINITION.– We say that a family of r.v., X 1 ,..., X n , is an independent family if ∀ J ⊂ {1, 2,..., n} and for all the family of B j ∈ B (
):
⎛ ⎞ P ⎜ ∩ X j ∈ Bj ⎟ = ∏ P X j ∈ Bj . ⎝ j∈J ⎠ j∈J
(
∈B (
As
)
(
)
) , it is easy to verify, by making certain Borel sets equal to
,
that the definition of independence is equivalent to the following:
∀B j ∈ B (
)
⎛ n ⎞ n : P ⎜ ∩ X j ∈ B j ⎟ = ∏ P X j ∈ Bj ⎝ j =1 ⎠ j =1
(
)
(
)
again equivalent to:
∀B j ∈ B (
)
n
(
P ( X ∈ B1 × ... × Bn ) = ∏ P X j ∈ Bj j =1
)
i.e. by introducing the laws of probabilities:
∀B j ∈ B (
NOTE.–
B
This
( ) =B ( n
probabilities PX j
n
) : PX ( B1 × ... × Bn ) = ∏ PX j =1
law
of
) ⊗ ... ⊗ B ( ) ) is (defined on B ( ) ).
j
( Bj )
probability
PX
(defined
on
the tensor product of the laws of
Symbolically we write this as PX = PX ⊗ ... ⊗ PX n . 1
14
Discrete Stochastic Processes and Optimal Filtering
NOTE.– Let X 1 ,..., X n be a family of r.v. If this family is independent, the r.v. are independent pairwise, but the converse is false. PROPOSITION.– Let X = ( X 1 ,..., X n ) be a real random vector admitting the density of probability f X and the components X 1 , ..., X n admitting the densities
f X ,..., f X n . 1
In order for the family of components to be an independent family, it is necessary that and it suffices that:
n
f X ( x1 ,..., xn ) = ∏ f X j ( x j ) j =1
DEMONSTRATION.– In the simplified case where f X is continuous: – If ( X 1 ,..., X n ) is an independent family:
n ⎛ n ⎞ n FX ( x1 ,..., xn ) = P ⎜ ∩ X j ≤ x j ⎟ = ∏ P X j ≤ x j = ∏ FX j x j ⎝ j =1 ⎠ j =1 j =1
(
)
(
)
( )
by deriving the two extreme members:
f X ( x1 ,..., xn ) =
n ∂F n ∂ n FX ( x1 ,..., xn ) X j (xj ) =∏ =∏ f X j ( x j ) ; ∂xn ...∂x1 ∂x j j =1 j =1
Random Vectors
– reciprocally if f X ( x1 ,..., xn ) = i.e. B j ∈ B (
) , for
15
n
∏ fX ( xj ) : j
j =1
j = 1 at n :
N ⎛ ⎞ ⎛ n ⎞ P ⎜ ∩ X j ∈ Bj ⎟ = P ⎜ X ∈∏ Bj ⎟ = ∫ n f ( x ,..., xn ) dx1... dxn ∏ Bj X 1 ⎝ j =1 ⎠ J =1 ⎝ ⎠ j=1
(
=∫
)
n
∏ Bj ∏ j =1 n
fX
j
n
( )
x j dx j = ∏ ∫ j =1
j =1
Bj
NOTE.– The equality f X ( x1 ,..., xn ) =
fX
j
n
( )
(
x j dx j = ∏ P X j ∈ B j j =1
)
n
∏ fX j ( xj )
is the definition of the
j −1
function of n variables and f X is the tensor product of the functions of a variable
f X . Symbolically we write f X = f X ⊗ ... ⊗ f X n (not to be confused with the 1
j
ordinary
f = f1 f 2 i i f n
product
defined
by
f ( x ) = ( f1 ( x ) f 2 ( x )i i f n ( x ) ) .
EXAMPLE.– Let the random pair X = ( X1 , X 2 ) of density: ⎛ x 2 + x22 ⎞ 1 exp ⎜ − 1 ⎟. ⎜ 2π 2 ⎟⎠ ⎝
As
⎛ x 2 + x22 ⎞ 1 exp ⎜ − 1 ⎟= ⎜ 2π 2 ⎟⎠ ⎝
⎛ x 2 ⎞ 1 ⎛ x22 ⎞ exp ⎜ − ⎟ ⎜− ⎟ ⎜ 2 ⎟ 2π ⎜ 2 ⎟ 2π ⎝ ⎠ ⎝ ⎠ 1
⎛ x2 ⎞ ⎛ x2 ⎞ 1 exp ⎜ − 1 ⎟ and exp ⎜ − 2 ⎟ are the densities of X1 and of X 2 , ⎜ 2 ⎟ ⎜ 2 ⎟ 2π 2π ⎝ ⎠ ⎝ ⎠ these two components X1 and X 2 are independent.
and as
1
16
Discrete Stochastic Processes and Optimal Filtering
DEFINITION.– Two random vectors:
(
X = ( X1 ,..., X n ) and Y = Y1 ,..., Yp
)
are said to be independent if:
∀B ∈ B
( ) and B ' ∈B ( ) n
p
P ( ( X ∈ B ) ∩ (Y ∈ B ') ) = P ( X ∈ B ) P (Y ∈ B ') The sum of independent random variables
NOTE.– We are frequently led to calculate the probability P in order for a function of n r.v. given as X 1 ,..., X n to verify a certain inequality. Let us denote this probability
as
P (inequality).
Let
us
assume
that
the
random
vector
X = ( X 1 ,..., X n ) possesses a density of probability f X ( x1 ,..., xn ) . The
method of obtaining P (inequality) consists of determining B ∈ B verifies ( X 1 ,..., X n ) ∈ B .
( n ) which
∫ B f X ( x1,..., xn ) dx1... dxn
We thus obtain: P (inequality) = EXAMPLES.– 1) P ( X 1 + X 2 ≤ z ) = P where B =
{( x, y ) ∈
2
( ( X1, X 2 ) ∈ B ) = ∫ B f X ( x1, x2 ) dx1 dx2
}
x+ y ≤ z
y z 0
z
x
Random Vectors
17
P ( X1 + X 2 ≤ a − X 3 ) = P ( ( X1 , X 2 , X 3 ) ∈ B ) = ∫ f X ( x1 , x2 , x3 ) dx1 dx2 dx3 B
z C (a)
0
y
B (a)
A (a)
x
1 space containing the origin 0 and limited by the plane placed on the 2 triangle A B C from equation x + y + z = a
B is the
P ( Max
( X1 + X 2 ) ≤ z ) = P ( ( X1 , X 2 ) ∈ B ) = ∫ f X ( x1 , x2 ) dx1 dx2 B where B is the non-shaded portion below.
y z 0
z
x
Starting with example 1) we will show the following.
18
Discrete Stochastic Processes and Optimal Filtering
PROPOSITION.– Let X and Y be two real independent r.v. of probability densities, respectively f X and fY . The r.v.
Z = X + Y admits a probability density f Z defined as:
f Z ( z ) = ( f X ∗ fY )( z ) = ∫
+∞ −∞
f X ( x ) fY ( z − x ) dx
DEMONSTRATION. – Let us start from the probability distribution of Z .
FZ ( z ) = P ( Z ≤ z ) = P ( X + Y ≤ z ) = P ( ( X , Y ) ∈ B ) (where B is defined in example 1) above)
= ∫ f ( x, y ) dx dy = (independence) B
∫B
f X ( x ) fY ( y ) dx dy
y
x+ y = z
z z−x 0
=∫
+∞ −∞
In stating
=∫
+∞ −∞
f X ( x ) dx ∫
z− x −∞
x
z
x
fY ( y ) dy .
y =u−x:
f X ( x ) dx ∫
z −∞
fY ( u − x ) du = ∫
z −∞
du ∫
+∞ −∞
f X ( x ) fY ( u − x ) dx .
Random Vectors
The mapping u →
+∞
∫ −∞
19
f X ( x ) fY ( u − x ) dx being continuous, of which
FZ ( z ) is a primitive from and:
FZ′ ( z ) = f Z ( z ) = ∫
+∞ −∞
f X ( x ) fY ( z − x ) dx .
NOTE.– If (for example) the support of f X and fY is
+
, i.e. if
f X ( x ) = f X ( x )1 [0,∞[ ( x ) and fY ( y ) = fY ( y ) 1 [0,∞[( y ) we easily arrive at:
z
f Z ( z ) = ∫ f X ( x ) fY ( z − x ) dx 0
EXAMPLE.– X and Y are two exponential r.v. of parameter independent. Let us take as given For
z≤0
For
z≥0
λ
which are
Z = X +Y :
fZ ( z ) = 0 .
fZ ( z ) = ∫
+∞
−∞
and f Z ( z ) = λ ze 2
z −λ z − x f X ( x ) fY ( z − x ) dx = ∫ λ e− λ x λ e ( ) dx = λ 2 ze− λ z
−λ z
0
1[0,∞[ ( z ) .
20
Discrete Stochastic Processes and Optimal Filtering
1.2. Spaces
L1 ( dP ) and L2 ( dP )
1.2.1. Definitions
The family of r.v. X :
ω
→
X (ω )
( Ω, a,P ) ( ,B ( ) ) , denoted
forms a vector space on
Two vector subspaces of will be defined.
ε
ε.
play a particularly important role and these are what
The definitions would be in effect the final element in the construction of the Lebesgue integral of measurable mappings, but this construction will not be given here and we will be able to progress without it. DEFINITION.– We say that two random variables X and X ′ defined on ( Ω, a )
are almost surely equal and we write X = X ′ a.s. if X = X ' except eventually on
an event N of zero probability (that is to say N ∈ a and P ( N ) = 0 ). We note: – X = {class (of equivalences) of r.v.
X ′ almost definitely equal to X };
– O = {class (of equivalences) of r.v. almost definitely equal to O }. We can now give: – the definition of L ( dP ) as a vector space of first order random variables; and 1
2
– the definition of L
{ L ( dP ) = {
( dP ) as a vector space of second order random variables:
L1 ( dP ) = r. v. X 2
r.v.
X
} X (ω ) dP (ω ) < ∞ }
∫ Ω X (ω ) ∫Ω
2
dP (ω ) < ∞
Random Vectors
21
where, in these expressions, the r.v. are clearly defined except for at a zero probability event, or otherwise: the r.v. X are any representatives of the X classes, because, by construction the integrals of the r.v. are not modified if we modify the latter on zero probability events. Note on inequality
∫ Ω X (ω )
dP (ω ) < ∞
Introducing the two positive random variables:
X + = Sup ( X , 0 ) and X − = Sup ( − X 1 0 ) we can write X = X
+
− X − and X = X + + X − .
Let X ∈ L ( dP ) ; we thus have: 1
∫ Ω X (ω ) dP (ω ) < ∞ ⇔ ∫ Ω X (ω ) dP (ω ) < ∞ and − ∫ Ω X (ω ) dP (ω ) < ∞. +
So, if X ∈ L ( dP ) , the integral 1
+ − ∫ Ω X (ω ) dP (ω ) = ∫ Ω X (ω ) dP − ∫ Ω X (ω ) dP (ω )
is defined without ambiguity. 2
NOTE.– L
( dP ) ⊂ L1 ( dP ) 2
In effect, given X ∈ L
(∫
Ω
( dP ) , following Schwarz’s inequality:
X (ω ) dP (ω )
) ≤∫ 2
Ω
X 2 (ω ) dP ∫ dP (ω ) < ∞ Ω
1
22
Discrete Stochastic Processes and Optimal Filtering
⎛ 1 ⎛ x − m ⎞2 ⎞ 1 exp ⎜ − ⎜ ⎟ ). ⎜ 2 ⎝ σ ⎟⎠ ⎟ 2πσ ⎝ ⎠
EXAMPLE.– Let X be a Gaussian r.v. (density This belongs to L ( dP ) and to L 1
2
Let Y be a Cauchy r.v. (density
( dP ) .
(
1
π 1 + x2
)
).
This does not belong to L ( dP ) and thus does not belong to L 1
2
( dP ) either.
1.2.2. Properties
– L ( dP ) is a Banach space; we will not use this property for what follows; 1
2
– L
( dP )
is a Hilbert space. We will give here the properties without any
demonstration. 2
* We can equip L
( dP ) with the scalar product defined by:
∀X , Y ∈ L2 ( dP ) < X,Y > = ∫ X (ω ) Y (ω ) dP (ω ) Ω
This expression is well defined because following Schwarz’s inequality:
∫Ω
2
X (ω ) Y (ω ) dP (ω ) ≤ ∫ X 2 (ω ) dP (ω ) ∫ Y 2 (ω ) dP (ω ) < ∞ Ω
Ω
and the axioms of the scalar product are immediately verifiable.
Random Vectors 2
* L
23
( dP ) is a vector space normed by: X = < X, X > =
2 ∫ Ω X (ω ) dP (ω )
It is easy to verify that:
∀X , Y ∈ L2 ( dP )
X +Y ≤ X + Y
∀X ∈ L2 ( dP ) and ∀λ ∈
λX = λ
X
As far as the second axiom is concerned: – if X = 0 ⇒ X – if
X =
(∫
Ω
=0;
)
X 2 (ω ) dP (ω ) = 0 ⇒ X = 0 a.s.
L2 ( dP ) is a complete space for the norm 2
sequence X n converges to X ∈ L
.
( or
)
X =0 .
defined above. (Every Cauchy
( dP ) .)
1.3. Mathematical expectation and applications 1.3.1. Definitions
We are studying a general random vector (not necessarily with a density function):
(
X = X1,..., X n
)
:
( Ω, a , P ) → (
n
,B
( )) . n
24
Discrete Stochastic Processes and Optimal Filtering
Furthermore, we give ourselves a measurable mapping:
Ψ:
(
n
,B
( n )) → (
))
,B (
Ψ X (also denoted Ψ ( X ) or Ψ ( X 1 ,..., X n ) ) is a measurable mapping (thus
an r.v.) defined on ( Ω, a ) .
X
(
( Ω, a, P )
n
,B
n
X
Ψ
Ψ X
(
( ), P )
,B (
))
DEFINITION.– Under the hypothesis Ψ X ∈ L ( dP ) , we call mathematical 1
expectation of the random value Ψ X the expression Ε ( Ψ X ) defined as:
Ε(Ψ X ) = ∫
Ω
(Ψ
or, to remind ourselves that
X )(ω ) dP (ω )
X is a vector:
Ε ( Ψ ( X 1 ,..., X 2 ) ) = ∫ Ψ ( X 1 (ω ) ,..., X n (ω ) ) dP (ω ) Ω
NOTE.– This definition of the mathematical expectation of Ψ X is well adapted to general problems or to those of a more theoretical orientation; in particular, it is 2
by using the latter that we construct L
( dP ) the Hilbert space of the second order
r.v. In practice, however, it is the PX law (similar to the measure P by the mapping
X ) and not P that we do not know. We thus want to use the law PX to
Random Vectors
25
express Ε ( Ψ X ) , and it is said that the calculation of Ε ( Ψ X ) from the space ( Ω, a,P ) to the space
(
n
,B
( ), P ) . n
X
In order to simplify the writing in the theorem that follows (and as will often
occur in the remainder of this work) ( X 1 ,..., X n ) , ( x1 ,..., xn ) and dx1...dxn will often be denoted as
X , x and dx respectively.
Transfer theorem
Let us assume Ψ X ∈ L ( dP ) ; we thus have: 1
Ε(Ψ X ) = ∫
Ω
(Ψ
X )(ω ) dP (ω ) = ∫
n
Ψ ( x ) dPX ( x )
In particular, if PX admits a density f X :
E (Ψ X ) = ∫
n
Ψ ( x ) f X ( x ) dx and Ε X = ∫ x f X ( x ) dx .
Ψ ∈ L1 ( dPX ) DEMONSTRATION.– – The equality of 2) is true if Ψ = 1B with B ∈ B
Ε ( Ψ X ) = Ε (1B X ) = PX ( B ) =∫
n 1B
( x ) dPX ( x ) = ∫
n
– The equality is still true if m
j =1
because
Ψ ( x ) dPX ( x ).
Ψ is a simple measurable mapping, that is to say if
Ψ = ∑ λ j 1B or B j ∈ B j
( n)
( ) and are pairwise disjoint. n
26
Discrete Stochastic Processes and Optimal Filtering
We have in effect:
(
m
)
m
( )
Ε ( Ψ X ) = ∑ λ j Ε 1B j X = ∑ λ j PX B j j =1
m
= ∑λj ∫
n 1B
j =1
=∫
n
( x ) dPX ( x ) = ∫ j
j =1
⎛ m ⎞ λ j 1B ( x ) ⎟ dPX ( x ) n ⎜∑ ⎜ j =1 ⎟ j ⎝ ⎠
Ψ ( x ) dPX ( x )
If we now assume that Ψ is a positive measurable mapping, we know that it is the limit of an increasing sequence of positive simple measurable mappings Ψ P .
⎛
We thus have ⎜
∫ Ω ( Ψ p X ) (ω ) = ∫
⎜ with Ψp ⎝
n
Ψ p ( x ) dPX ( x )
Ψ
Ψ p X is also a positive increasing sequence which converges to Ψ X and by taking the limits of the two members when p ↑ ∞ , we obtain, according to the monotone convergence theorem:
∫Ω (Ψ
X )(ω ) dP (ω ) = ∫
n
Ψ ( x ) dPX ( x ) .
If Ψ is a measurable mapping of any sort we still use the decomposition
Ψ = Ψ + − Ψ − and
Ψ = Ψ+ + Ψ− . +
Furthermore, it is clear that ( Ψ X ) = Ψ
+
−
X and ( Ψ X ) = Ψ − X .
It emerges that: +
−
(
) (
Ε Ψ X = Ε (Ψ X ) + Ε (Ψ X ) = Ε Ψ+ X + Ε Ψ− X
)
Random Vectors
27
i.e. according to what we have already seen:
=∫
n
Ψ + ( x ) dPX ( x ) + ∫
n
Ψ − ( x ) dPX ( x ) = ∫
n
Ψ ( x ) dPX ( x )
As Ψ X ∈ L ( dP ) , we can deduce from this that Ψ ∈ L ( dPX 1
1
(reciprocally if Ψ ∈ L ( dPX 1
In particular Ε ( Ψ X )
) then Ψ
+
)
X ∈ L1 ( dP ) ).
and Ε ( Ψ X ) are finite, and −
(
) (
Ε (Ψ X ) = Ε Ψ+ X − Ε Ψ− X =∫
n
Ψ + ( x ) dPX ( x ) − ∫
=∫
n
Ψ ( x ) dPX ( x )
n
)
Ψ − ( x ) dPX ( x )
NOTE.– (which is an extension of the preceding note) In certain works the notion of “a random vector as a measurable mapping” is not developed, as it is judged as being too abstract. In this case the integral
∫
nΨ
( x ) dPX ( x ) = ∫
n
Ψ ( x ) f X ( x ) dx
PX admits the density f X ) is given as a definition of Ε ( Ψ X ) . EXAMPLES.– 1) Let the “random Gaussian vector” be X
f X ( x1 , x2 ) =
where
ρ ∈ ]−1,1[
1 2π 1 − ρ 2
T
= ( X1 , X 2 ) of density:
⎛ 1 1 ⎞ exp ⎜ − x12 − 2 ρ x1 x2 + x22 ⎟ 2 ⎝ 2 1-ρ ⎠
(
)
and let the mapping Ψ be ( x1 , x2 ) → x1 x2
3
(if
28
Discrete Stochastic Processes and Optimal Filtering
The condition:
∫
2
x1 x23
⎛ 1 exp ⎜ − 2 ⎜ 2 1− ρ 2 2π 1 − ρ ⎝ 1
(
)
(
⎞ x12 − 2 ρ x1 x2 + x22 ⎟ dx1 dx2 < ∞ ⎟ ⎠
)
is easily verifiable and:
EX1 X 23 = ∫
x x3 2 1 2
⎛ 1 exp ⎜ − ⎜ 2 2 1− ρ2 2n 1 − ρ ⎝ 1
(
)
(
⎞ x12 − 2 ρ x1 x2 + x22 ⎟ dx1dx2 ⎟ ⎠
)
2) Given a random Cauchy variable of density
1
π∫
x
fX ( x) =
1
1 π 1 + x2
1 dx = +∞ thus X ∉ L1 ( dP ) and EX is not defined. 1 + x2
Let us consider next the transformation Ψ which consists of “rectifying and clipping” the r.v. X .
Ψ
K
−K
0
K
x
Figure 1.4. Rectifying and clipping operation
Random Vectors
Ψ ( x ) dPX ( x ) =
∫
1
K
K
−K
∞
29
K
∫ − K x 1 + x 2 dx + ∫ −∞ 1 + x 2 dx + ∫ K 1 + x2 dx
⎛π ⎞ = ln 1 + K 2 + 2 K ⎜ − K ⎟ < ∞. ⎝2 ⎠
(
)
Thus, Ψ X ∈ L ( dP ) and: 1
Ε(Ψ X ) = ∫
+∞ −∞
⎛π ⎞ Ψ ( x ) dPX ( x ) = ln 1 + K 2 + 2 K ⎜ − K ⎟ . ⎝2 ⎠
DEFINITION.– Given np r.v. X jk
(
( j = 1 at
)
p, k = 1 at n ) ∈ L1 ( dP ) , we
⎛ X 11 … X 1n ⎞ ⎜ ⎟ define the mean of the matrix ⎡⎣ X jk ⎤⎦ = ⎜ ⎟ by: ⎜ X p1 X pn ⎟⎠ ⎝ ⎛ ΕX 11 … ΕX 1n ⎞ ⎜ ⎟ Ε ⎡⎣ X jk ⎤⎦ = ⎜ ⎟. ⎜ ΕX p1 ΕX pn ⎟⎠ ⎝ In particular, given a random vector:
⎛ X1 ⎞ ⎜ ⎟ X = ⎜ ⎟ or X T = ( X 1 ,..., X n ) verifying X j ∈ L1 ( dP ) ∀j = 1 at n , ⎜X ⎟ ⎝ n⎠
(
)
30
Discrete Stochastic Processes and Optimal Filtering
⎛ EX 1 ⎞ ⎜ ⎟ ⎡ T⎤ We state Ε [ X ] = ⎜ ⎟ or Ε ⎣ X ⎦ = ( EX1 ,..., ΕX n ) . ⎜ EX ⎟ ⎝ n⎠
(
)
Mathematical expectation of a complex r.v.
DEFINITION.– Given a complex r.v. X = X 1 +i X 2 , we say that:
X ∈ L1 ( dP ) if X 1 and X 2 ∈ L1 ( dP ) . If X ∈ L ( dP ) we define its mathematical expectation as: 1
Ε ( X ) = ΕX 1 + i Ε X 2 . Transformation of random vectors
We are studying a real random vector X = ( X 1 ,..., X n ) with a probability
density of f X ( x )1D ( x ) = f X ( x1 ,..., xn ) 1D ( x1 ,..., xn ) where D is an open set of
n
.
Furthermore, we give ourselves the mapping:
α : x = ( x1 ,..., xn ) → y = α ( x ) = (α1 ( x1 ,..., xn ) ,...,α n ( x1 ,..., xn ) ) ∆
D We assume that that
α
α
1
is a C -diffeomorphism of D on an open ∆ of
is bijective and that
α
and
β =α
−1
1
are of class C .
n
, i.e.
Random Vectors
X
α
31
Y =α (X )
∆
D Figure 1.5. Transformation of a random vector
The random vector Y = (Y1 ,..., Yn ) =
X
by a
C1 -diffeomorphism
(α1 ( X1,..., X n ) ,...,α n ( X1,..., X n ) )
takes its values on ∆ and we wish to determine fY ( y )1∆ ( y ) , its probability density. PROPOSITION.–
fY ( y )1∆ ( y ) = f X ( β ( y ) ) Det J β ( y ) 1∆ ( y ) DEMONSTRATION.– Given:
Ψ ∈ L1 ( dy )
Ε ( Ψ (Y ) ) = ∫
n
Ψ ( y ) fY ( y )1∆ ( y ) dy .
Furthermore:
Ε ( Ψ (Y ) ) = ΕΨ (α ( X ) ) = ∫
n
Ψ (α ( x ) ) f X ( x )1D ( x ) dx .
By applying the change of variables theorem in multiple integrals and by denoting the Jacobian matrix of the mapping
=∫
n
β
as J β ( y ) , we arrive at:
Ψ ( y ) f X ( β ( y ) ) Dét J β ( y ) 1∆ ( y ) dy .
32
Discrete Stochastic Processes and Optimal Filtering
Finally, the equality:
∫ n Ψ ( y ) fY ( y )1∆ ( y ) dy = ∫ n Ψ ( y ) f X ( β ( y ) ) Dét J β ( y ) 1∆ ( y ) dy has validity for all Ψ ∈ L ( dy ) ; we deduce from it, using Haar’s lemma, the 1
formula we are looking for:
fY ( y )1∆ ( y ) = f X ( β ( y ) ) Dét J β ( y ) 1∆ ( y ) IN PARTICULAR.– If X is an r.v. and the mapping:
α : x → y = α ( x) D⊂
Α⊂
the equality of the proposition becomes:
fY ( y )1∆ ( y ) = f X ( β ( y ) ) β ′ ( y ) 1∆ ( y ) EXAMPLE.– Let the random ordered pair be Z = ( X , Y ) of probability density:
f Z ( x, y ) =
1 x y
1 2 2 D
( x, y )
where
D = ]1, ∞[ × ]1, ∞[ ⊂
2
Random Vectors 1
Furthermore, we allow the C -diffeomorphism
33
α:
α
β
D 1
∆ 1
0
x
1
0
u
1
defined by:
⎛ ⎜ ⎜ ⎜ ⎜ ⎜⎜ ⎝
α : ( x, y ) → ( u = α1 ( x, y ) = xy , v = α 2 ( x, y ) = x y ) ∈D
∈∆
(
β : ( u, v ) → x = β1 ( u, v ) = uv , y = β 2 ( u, v ) = u v ∈∆
)
∈D
⎛ v u 1⎜ J β ( u, v ) = ⎜ 2⎜ 1 ⎜ uv ⎝
⎞ v ⎟ 1 ⎟ and Det J β ( u , v ) = . u⎟ 2 v − 3 ⎟ v 2⎠ u
(
The vector W = U = X Y , V = X
Y
) thus admits the probability density:
fW ( u , v )1∆ ( u , v ) = f Z ( β1 ( u , v ) , β 2 ( u , v ) ) Det J β ( u, v ) 1∆ ( u, v ) =
(
1 uv
)
1
2
( uv )
2
1 1∆ ( u, v ) = 12 1∆ ( u, v ) 2v 2u v
34
Discrete Stochastic Processes and Optimal Filtering
NOTE.–
Reciprocally
W = (U , V )
vector
of
probability
density
fW ( u , v ) 1∆ ( u , v ) and whose components are dependent is transformed by β
into vector Z = ( X , Y ) of probability density f Z ( x, y ) 1D ( x, y ) and whose components are independent.
1.3.2. Characteristic functions of a random vector
DEFINITION.– We call the characteristic function of a random vector:
X T = ( X1 ... X n ) the mapping ϕ X : ( u1 ,..., u2 ) → ϕ X ( u1 ,..., u2 ) defined by: n
⎛ ⎜ ⎝
n
⎞ ⎟ ⎠
ϕ X ( u1 ,..., un ) = Ε exp ⎜ i ∑ u j X j ⎟ =∫
j =1
⎛ n ⎞ exp ⎜⎜ i ∑ u j x j ⎟⎟ f X ( x1 ,...xn ) dx1... dxn n ⎝ j =1 ⎠
(The definition of ΕΨ ( X 1 ,..., X n ) is written with:
⎛ n ⎞ Ψ ( X 1 ,..., X n ) = exp ⎜ i ∑ u j X j ⎟ ⎜ j =1 ⎟ ⎝ ⎠ and the integration theorem is applied with respect to the image measure.)
ϕX
is thus the Fourier transform of
ϕX = F ( fX )
.
fX
which can be denoted
Random Vectors
35
(In analysis, it is preferable to write:
F ( f X ) ( u1 ,..., un ) = ∫
n ⎛ ⎞ exp − i u x f u ,..., un ) dx1... dxn . ) ⎜ j j⎟ n ⎜ ∑ ⎟ X( 1 ⎝ j =1 ⎠
Some general properties of the Fourier transform: –
ϕ X ( u1 ,...u2 ) ≤ ∫
n
f X ( x1 ,..., xn ) dx1... dxn = ϕ X ( 0,..., 0 ) = 1 ;
– the mapping ( u1 ,..., u2 ) → ϕ X ( u1 ,..., u2 ) is continuous; n
– the mapping F : f X → ϕ X is injective. Very simple example
[
]n
The random vector X takes its values from within the hypercube ∆ = −1,1 and it admits a probability density:
f X ( x1 ,..., xn ) =
1 1 ∆( x1 ,..., xn ) 2n
(note that components X j are independent).
1 exp i ( u1 x1 + ... + un xn ) dx1...dxn 2n ∫ ∆ n sin u 1 n +1 j = n ∏ ∫ exp iu j x j dx j = ∏ uj 2 j =1 −1 j =1
ϕ ( u1 ,..., un ) =
(
)
where, in this last expression and thanks to the extension by continuity, we replace:
sin u1 sin u2 by 1 if u1 = 0 , by 1 if u2 = 0 ,... u1 u2
36
Discrete Stochastic Processes and Optimal Filtering
Fourier transform inversion
F
fX
ϕX
F −1 As shall be seen later in the work, there are excellent reasons (simplified calculations) for studying certain questions using characteristic functions rather than probability densities, but we often need to revert back to densities. The problem which arises is that of the invertibility of the Fourier transform F , which is studied in specialized courses. It will be enough here to remember one condition. PROPOSITION.– If (i.e.
∫
n
ϕ X ( u1 ,..., un ) du1...dun < ∞
ϕ X ∈ L1 ( du1...dun ) ), f X ( x1 ,..., xn ) =
1
( 2π )n
then F
∫
−1
exists and:
⎛ n ⎞ exp − i u x ϕ ⎜ j j⎟ n ⎜ ∑ ⎟ X = 1 j ⎝ ⎠
( u1 ,..., un ) du1...dun .
In addition, the mapping ( x1 ,..., xn ) → f X ( x1 ,..., xn ) is continuous. EXAMPLE.–
Given
a
Gaussian
r.v.
(
)
X ∼ Ν m, σ 2 ,
i.e.
that
⎛ 1 ⎛ x − m ⎞2 ⎞ 1 exp ⎜ − ⎜ ⎟ and assuming that σ ≠ 0 we obtain ⎜ 2 ⎝ σ ⎟⎠ ⎟ 2πσ ⎝ ⎠ 2 2 ⎛ uσ ⎞ ϕ X ( u ) = exp ⎜ ium − ⎟. 2 ⎝ ⎠ fX ( x) =
It is clear that ϕ X ∈ L1 ( du ) and
fX ( x) =
1 2π
+∞
∫ −∞ exp ( −iux ) ϕ X ( u ) du .
Random Vectors
37
Properties and mappings of characteristic functions 1) Independence
PROPOSITION.– In order for the components X j of the random vector
X T = ( X 1 ,..., X n ) to be independent, it is necessary and sufficient that: n
ϕ X ( u1 ,..., un ) = ∏ ϕ X ( u j ) . j
j =1
DEMONSTRATION.– Necessary condition:
⎛ ⎜ ⎝
⎞ ⎟ ⎠
n
ϕ X ( u1 ,..., un ) = ∫ exp ⎜ i ∑ u j x j ⎟ f X ( x1 ,..., xn ) dx1...dxn n
j =1
Thanks to the independence:
=∫
n ⎛ n ⎞ n exp i u x f x dx ... dx = ⎜ ⎟ ( ) ∏ϕ X j (u j ) . 1 j j ⎟∏ X j n n ⎜ ∑ j j =1 ⎝ j =1 ⎠ j =1
Sufficient condition: we start from the hypothesis:
∫ =∫
⎛
i n exp ⎜ ⎜
n
⎞
∑ u j x j ⎟⎟ f x ( x1,..., xn ) dx1... dxn
⎝ j =1 ⎠ ⎛ n ⎞ n exp i u x x j dx1... dxn ⎜ ⎟ j j ⎟∏ f X n ⎜ ∑ j = 1 j = 1 j ⎝ ⎠
( )
38
Discrete Stochastic Processes and Optimal Filtering
from which we deduce: f X ( x1 ,..., xn ) =
n
∏ f X j ( x j ) , i.e. the independence, j =1
since the Fourier transform f X ⎯⎯ → ϕ X is injective. NOTE.– We must not confuse this result with that which concerns the sum of independent r.v. and which is stated in the following manner. n
If X 1 ,..., X n are independent r.v., then
ϕ∑ X ( u ) = ∏ ϕ X j
j
j =1
j
(u ) .
If there are for example n independent random variables:
(
)
(
X 1 ∼ Ν m1 , σ 2 ,..., X n ∼ Ν mn , σ 2 and n real constants
)
λ1 ,..., λn , the note above enables us to determine the law of
n
∑λj X j .
the random value
j =1
λj X j
In effect the r.v.
ϕ∑ j
λ X
=e
and thus
j
j
are independent and:
n
n
j =1
j =1
( )
n
( u ) = ∏ ϕλ j X j ( u ) = ∏ ϕ X j λ j u = ∏ e
1 iu ∑ λ j m j − u 2 ∑ λ 2j σ 2j 2 j j
n
⎛
j =1
⎝
⎞
∑ λ j X j ∼ Ν ⎜⎜ ∑ λ j m j , ∑ λ 2j σ 2j ⎟⎟ . j
j
⎠
j =1
1 iuλ j m j − u 2 λ 2j σ 2j 2
Random Vectors
39
2) Calculation of the moment functions of the components X j (up to the 2nd order, for example)
Let us assume
ϕX ∈C2
( ). n
In applying Lebesgue’s theorem (whose hypotheses are immediately verifiable) once we obtain:
∀K = 1 to n
∂ϕ X ( 0,..., 0 ) ∂uK
⎛ ⎞ ⎛ ⎞ = ⎜ ∫ n ixK exp ⎜ i ∑ u j x j ⎟ f X ( x1 ,..., xn ) dx1...dxn ⎟ ⎜ j ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠( u1 = 0,...,un = 0 ) = i∫
n
xK f X ( x1 ,..., xn ) dx1...dxn = i Ε X K
i.e. Ε X K = −i
∂ϕ X ( 0,..., 0 ) . ∂u K
By applying this theorem a second time, we have:
∀k
and
∈ (1,2,..., n )
EX K X =
∂ 2ϕ X ( 0,..., 0 ) ∂u ∂uK
1.4. Second order random variables and vectors
Let us begin by recalling the definitions and usual properties relative to 2nd order random variables. DEFINITIONS.–
Given
X ∈ L2 ( dP )
of
probability
density
E X 2 and E X have a value. We call variance of X the expression: Var X = Ε X 2 − ( Ε X ) = E ( X − Ε X ) 2
2
fX ,
40
Discrete Stochastic Processes and Optimal Filtering
We call standard deviation of
X the expression σ ( X ) = Var X . 2
Now let two r.v. be X and Y ∈ L
( dP ) . By using the scalar product <, > on
L2 ( dP ) defined in 1.2 we have: ΕXY = < X , Y > = ∫ X (ω ) Y (ω ) dP (ω ) Ω
and, if the vector Z = ( X , Y ) admits the density f Ζ , then:
EXY = ∫
2
xy f Z ( x, y ) dx dy .
We have already established, by applying Schwarz’s inequality, that ΕXY actually has a value. 2
DEFINITION.– Given that two r.v. are X , Y ∈ L of
( dP ) , we call the covariance
X and Y : The expression Cov ( X , Y ) = ΕXY − ΕX ΕY . Some observations or easily verifiable properties:
Cov ( X , X ) = V ar X Cov ( X , Y ) = Cov (Y , X ) – if
λ
is a real constant
Var ( λ X ) = λ 2 Var X ;
– if X and Y are two independent r.v., then Cov ( X , Y ) = 0 but the reciprocal is not true;
Random Vectors
41
– if X 1 ,..., X n are pairwise independent r.v.
Var ( X1 + ... + X n ) = Var X1 + ... + Var X n Correlation coefficients
The
Var X j (always positive) and the Cov ( X j , X K ) (positive or negative)
can take extremely high algebraic values. Sometimes it is preferable to use the (normalized) “correlation coefficients”:
ρ ( j, k ) =
Cov ( X j , X K ) Var X j
Var X K
whose properties are as follows:
ρ ( j , k ) ∈ [ −1,1] In effect, let us assume (solely to simplify its expression) that X j and X K are centered and let us study the 2nd degree trinomial in
λ.
Τ ( λ ) = Ε ( λ X j − X K ) = λ 2ΕX 2j − 2λΕ ( X j X K ) + Ε X K2 ≥ 0 2
Τ ( λ ) ≥ 0 ∀λ ∈
(
∆ = E X jXK is
negative
or
ρ ( j , k ) ∈ [ −1,1] ).
)
2
zero,
if and only if the discriminant
− Ε X 2j Ε X K2 i.e.
Cov ( X j , X K )
This is also Schwarz’s inequality.
2
≤ Var X j Var X K
(i.e.
42
Discrete Stochastic Processes and Optimal Filtering
Furthermore, we can make clear that
ρ ( j , k ) = ±1
if and only if ∃ λ 0 ∈
such that X K = λ 0 X j p.s. In effect by replacing X K with definition of
λ0 X j
in the
ρ ( j , k ) , we obtain ρ ( j , K ) = ±1 .
Reciprocally, if
ρ ( j , K ) = 1 (for example), that is to say if:
∆ = 0 , ∃ λ0 ∈
such that X K = λ 0 X j a.s.
If X j and X K are not centered, we replace in what has gone before X j by
X j − Ε X j and X K by X K − Ε X K ). 2)
If
(
Xj
and
)
Xk
are
independent,
Ε X j Xk = Ε X j Ε Xk
so
Cov X j , X k = 0 , ρ ( j , k ) = 0 . However, the reciprocity is in general false, as is proven in the following example. Let Θ be a uniform random variable on
f Θ (θ ) =
[0 , 2 π [
that is to say
1 1 (θ ) . 2π [ 0 , 2 π [
In addition let two r.v. be X j = sin Θ and X K = c os Θ . We can easily verify that Ε X j
(
Cov X j , X k
)
and
ρ ( j , k ) are
X j and X k are dependent.
, Ε X k , Ε X j X k are zero; thus 2
2
zero. However, X j + X k = 1 and the r.v.
Random Vectors
43
Second order random vectors
DEFINITION.– We say that a random vector X 2
if X j ∈ L
( dP )
T
= ( X1 ,..., X n ) is second order
∀ j = 1 at n .
DEFINITION.– Given a second order random vector X
T
= ( X1 ,..., X n ) , we call
the covariance matrix of this vector the symmetric matrix:
… Cov ( X 1 , X n ) ⎞ ⎛ Var X 1 ⎜ ⎟ ΓX = ⎜ ⎟ ⎜ Cov ( X , X ) ⎟ X Var 1 n n ⎝ ⎠ If we return to the definition of the expectation value of a matrix of r.v., we see T that we can express it as Γ X = Ε ⎡( X − Ε X )( X − Ε X ) ⎤ .
⎣
⎦
We also can observe that Γ X −ΕX = Γ X . NOTE.– Second order complex random variables and vectors: we say that a complex random variable X = X 1 + i X 2 is second order if X 1 and
X 2 ∈ L2 ( dP ) . The covariance of two centered second order random variables, X = X 1 + i X 2 and Y = Y1 + iY2 has a natural definition:
Cov ( X , Y ) = EXY = E ( X1 + iX 2 )(Y1 − iY2 )
= E ( X 1Y1 + X 2Y2 ) + iE ( X 2Y1 − X 1Y2 )
and the decorrelation condition is thus:
E ( X 1Y1 + X 2Y2 ) = E ( X 2Y1 − X 1Y2 ) = 0 .
44
Discrete Stochastic Processes and Optimal Filtering
We say that a complex random vector X order if
T
(
)
= X 1 ,..., X j ,..., X n is second
j ∈ (1,..., n ) X j = X1 j + iX 2 j is a second order complex random
variable for the entirety. The covariance matrix of a second order complex centered random vector is defined by:
⎛ E X 2 … EX X ⎞ 1 1 n⎟ ⎜ ΓX = ⎜ ⎟ ⎜ ⎟ 2 ⎜ EX X ⎟ E X n ⎠ ⎝ n 1 If we are not intimidated by its dense expression, we can express these definitions for non-centered complex random variables and vectors without any difficulty. Let us return to real random vectors. T DEFINITION.– We call the symmetric matrix Ε ⎡ X X ⎤ the second order matrix
⎣
moment. If
⎦
X is centered Γ X = ⎡⎣ X X ⎤⎦ . T
Affine transformation of a second order vector
Let us denote the space of the matrices at p rows and at n columns as M ( p, n ) .
PROPOSITION.– Let X
T
= ( X1 ,..., X n ) be a random vector of expectation
value vector m = ( m1 ,..., mn ) and of covariance matrix Γ X . T
Furthermore
(
)
let
BT = b1 ,..., bp .
a
matrix
be
A ∈ M ( p, n )
and
a
certain
vector
Random Vectors
45
The random vector Y = A X + B possesses Αm + B as a mean value vector Τ
and Γ y = ΑΓ X Α as a covariance matrix. DEMONSTRATION.–
Ε [Y ] = Ε [ ΑX + B ] = Ε [ ΑX ] + Β = Αm + Β . In addition for example: Τ Ε ⎡( ΑX ) ⎤ = Ε ⎣⎡ X Τ ΑΤ ⎦⎤ = mΤ ΑΤ ⎣ ⎦
Τ ΓY = Γ ΑX +Β = Γ ΑX = Ε ⎡⎢ Α ( X − m ) ( Α ( X − m ) ) ⎤⎥ = ⎣ ⎦ Τ Τ Ε ⎡ Α ( X − m )( X − m ) ΑΤ ⎤ = Α Ε ⎡( X − m )( X − m ) ⎤ ΑΤ = ΑΓ X Α Τ ⎣ ⎦ ⎣ ⎦
for what follows, we will also need the easy result that follows. PROPOSITION.– Let X
T
= ( X 1 ,..., X n ) be a second order random vector, of
covariance matrix Γ Χ . Thus: ∀ Λ = ( λ1 ,..., λn ) ∈ T
n
⎛ n ⎞ Λ Τ Γ X Λ = var ⎜ ∑ λ j X j ⎟ ⎜ j =1 ⎟ ⎝ ⎠
DEMONSTRATION.–
(
)
(
Λ ΤΓ X Λ = ∑ Cov X j , X K λ j λK = ∑ Ε X j − ΕX j j,K
j,K
) ( X K − Ε X K ) λ j λK 2
2 ⎛ ⎛ ⎞⎞ ⎛ ⎞ ⎛ ⎞ = Ε ⎜ ∑ λ j X j − ΕX j ⎟ = Ε ⎜ ∑ λ j X j − Ε ⎜ ∑ λ j X j ⎟ ⎟ = Var ⎜ ∑ λ j X j ⎟ ⎜ j ⎟⎟ ⎜ j ⎟ ⎜ j ⎝ K ⎠ ⎝ ⎠⎠ ⎝ ⎠ ⎝
(
)
46
Discrete Stochastic Processes and Optimal Filtering
CONSEQUENCE.– ∀Λ ∈
n
Τ
we still have Λ Γ Χ Λ ≥ 0 .
Let us recall in this context the following algebraic definitions: – if Λ Γ X Λ > 0 ∀Λ = ( λ1 ,..., λn ) ≠ ( 0,..., 0 ) , we say that Γ X is positive T
definite; – if ∃Λ = ( λ1 ,..., λn ) ≠ ( 0,..., 0 ) such that Λ Γ X Λ = 0 , we say that Λ X is Τ
positive semi-definite. NOTE.– In this work the notion of vector appears in two different contexts and in order to avoid confusion, let us return for a moment to some vocabulary definitions. n
1) We call random vector of
(or random vector with values in
n
), every
⎛ X1 ⎞ ⎜ ⎟ n-tuple of random variables X = ⎜ ⎟ ⎜X ⎟ ⎝ n⎠ (or X
T
= ( X 1 ,..., X n ) or even X = ( X 1 ,..., X n ) ).
X is a vector in this sense that for each ω ∈ Ω , we obtain an n-tuple X (ω ) = ( X 1 (ω ) ,..., X n (ω ) ) which belongs to the vector space n . 2) Every random vector of
n
. X = ( X 1 ,..., X n ) of which all the components
X j belong to L2 ( dP ) we call a second order random vector.
In this context, the components X j themselves are vectors since they belong to the vector space L ( dP ) . 2
Thus, in what follows, when we speak of linear independence or of scalar product or of orthogonality, it is necessary to point out clearly to which vector space,
n
or L ( dP ) , we are referring. 2
Random Vectors 2
1.5. Linear independence of vectors of L
( dP ) 2
DEFINITION.– We say that n vectors X 1 ,..., X n of L
λ1 X 1 + ... + λn X n = 0
independent if
2
zero vector of L
a.s.
( dP )
( dP ) ). 2
λ1 ,..., λn are not all zero and ∃ an event A λ1 X 1 (ω ) + ... + λn X n (ω ) = 0 ∀ω ∈ A .
dependent if ∃
In particular: X 1 ,..., X n will be linearly dependent if ∃ zero such that
are linearly
⇒ λ1 = ... = λn = 0 (here 0 is the
DEFINITION.– We say that the n vectors X 1 ,..., X 2 of L such that
λ1 X 1 + ... + λn X n = 0
( dP )
are linearly
of positive probability
λ1 ,..., λn
are not all
a.s.
Examples: given the three measurable mappings:
X1, X 2 , X 3 :
([0, 2] ,B [0, 2] , dω ) → (
,B (
))
defined by:
X 1 (ω ) = ω
X 2 (ω ) = 2ω X 3 (ω ) = 3ω
47
⎫ ⎪ ⎬ on [ 0,1[ and ⎪ ⎭
X 1 (ω ) = e −(ω −1)
⎫ ⎪⎪ X 2 (ω ) = 2 ⎬ on [1, 2[ ⎪ X 3 (ω ) = −2ω + 5⎭⎪
48
Discrete Stochastic Processes and Optimal Filtering
X1 ; X2 ; X3
3
2
1
0
1
ω
2
Figure 1.6. Three random variables
The three mappings are evidently measurable and belong to L ( dω ) , so there 2
are 3 vectors of L ( dω ) . 2
There 3 vectors are linearly dependent on measurement
A = [ 0,1[ of probability
1 : 2
−5 X 1 ( ω ) + 1 X 2 ( ω ) + 1 X 3 ( ω ) = 0
∀ω ∈ A
Covariance matrix and linear independence
Let Γ X be the covariance matrix of X = ( X 1 ,..., X n ) a second order vector.
Random Vectors
49
1) If Γ X is defined as positive: X 1 = X 1 − ΕX 1 ,..., X n = X n − ΕX n are thus *
*
linearly independent vectors of L ( dP ) . 2
In effect:
⎛ ⎛ ⎞ ⎛ ⎞⎞ ΛT Γ X Λ = Var ⎜ ∑ λ j X j ⎟ = Ε ⎜ ∑ λ j X j − Ε ⎜ ∑ λ j X j ⎟ ⎟ ⎜ j ⎟ ⎝ j ⎠ ⎝ j ⎠⎠ ⎝
2
2
⎛ ⎞ = Ε ⎜ ∑ λ j ( X j − ΕX j ) ⎟ = 0 ⎝ j ⎠ That is to say:
∑λ ( X j
j
j
− ΕX j ) = 0
a.s.
This implies, since Γ X is defined positive, that
λ1 =
= λn = 0
We can also say that X 1 ,..., X n generates a hyperplane of L ( dP ) of *
*
2
(
*
*
)
dimension n that we can represent as H X 1 ,..., X n . In particular, if the r.v. X 1 ,..., X n are pairwise uncorrelated (thus a fortiori if they are stochastically independent), we have:
ΛT Γ X Λ = ∑ Var X j .λ j2 = 0 ⇒ λ1 =
= λn = 0
j
thus, in this case, Γ X is defined positive and X 1 ,..., X n are still linearly *
independent.
*
50
Discrete Stochastic Processes and Optimal Filtering
NOTE.– If Ε X X , the matrix of the second order moment function is defined as T
positive definite, then X 1 ,..., X n are linearly independent vectors of L ( dP ) . 2
2) If now Γ X is semi-defined positive:
X 1* = X 1 − ΕX 1 ,..., X n∗ = X n − ΕX n are thus linearly dependent vectors of L ( dP ) . 2
In effect:
∃Λ = ( λ1 ,..., λn ) ≠ ( 0,..., 0 )
(
)
⎛
such that: Λ Γ X Λ = Var ⎜ T
⎝
∑λ j
j
⎞ Xj⎟=0 ⎠
That is to say:
∃Λ = ( λ1 ,..., λn ) ≠ ( 0,..., 0 ) such that
∑λ ( X j
j
j
− ΕX j ) = 0 a.s.
⎛ X1 ⎞ ⎜ ⎟ 3 Example: we consider X = X 2 to a second order random vector of , ⎜ ⎟ ⎜X ⎟ ⎝ 3⎠ ⎛ 3⎞ ⎛ 4 2 0⎞ ⎜ ⎟ ⎜ ⎟ admitting m = −1 for the mean value vector and Γ X = 2 1 0 for the ⎜ ⎟ ⎜ ⎟ ⎜ 2⎟ ⎜ 0 0 3⎟ ⎝ ⎠ ⎝ ⎠ covariance matrix. We state that Γ X is semi-defined positive. In taking for example
Random Vectors
ΛT = (1, − 2, 0 )
we
( X1 − 2 X 2 + 0 X 3 ) = 0
verify
(Λ Γ Λ) = 0 T
that
and X 1 − 2 X 2 = 0 *
a.s.
X
*
Thus
51
Var
a.s.
X ∗ (ω )
L2 ( dP )
3
0
X∗
x2 ∆
x1
(
H X 1∗ X 2∗ X 3∗
When
(
ω
describes Ω ,
X ∗ (ω ) = X ∗ (ω ) , X ∗ (ω ) , X ∗ (ω ) 1
2
3
random vector of
3
)
X 1∗ , X 2∗ , X 3∗ vectors of L2 ( dP )
)
T
(
∗
∗
∗
generate H X 1 X 2 X 3 2
subspace of L
of the 2nd order
describes the vertical plane ( Π ) passing
)
( dP ) of
dimension 2
through the straight line ( ∆ ) of equation
x1 = 2 x2 Figure 1.7. Vector
X ∗ (ω )
X∗
and vector
1.6. Conditional expectation (concerning random vectors with density function)
Given that assume
that
X is a real r.v. and Y = (Y1 ,..., Yn ) is a real random vector, we X
and
Y
are
independent
and
that
Z = ( X , Y1 ,..., Yn ) admits a probability density f Z ( x, y1 ,..., yn ) . In this section, we will use as required the notations
Y , ( y1 ,..., yn ) or y . Let us recall to begin with fY ( y ) =
∫
f Z ( x, y ) dx .
the
vector
(Y1 ,..., Yn )
or
52
Discrete Stochastic Processes and Optimal Filtering
Conditional probability
We want, for all B ∈ B (
)
and all
( y1 ,..., yn ) ∈
n
, to define and
calculate the probability that X ∈ B knowing that Y1 = y1 ,..., Yn = yn . We denote this quantity P
(
)
( ( X ∈ B ) (Y1 = y1 ) ∩ .. ∩ (Yn = yn ) )
or more
simply P X ∈ B y1 ,..., yn . Take note that we cannot, as in the case of discrete variables, write:
(
)
P ( X ∈ B ) (Y1 = y1 ) ∩ .. ∩ (Yn = yn ) =
(
P ( X ∈ B ) (Y1 = y1 ) ∩ .. ∩ (Yn = yn ) P ( (Y1 = y1 ) ∩ .. ∩ (Yn = yn ) )
The quotient here is indeterminate and equals
)
0 . 0
For j = 1 at n , let us note I j = ⎡⎣ y j , y j + h ⎡⎣ We write:
(
P ( X ∈ B y1 ,..., yn ) = lim P ( X ∈ B ) (Y1 ∈ I1 ) ∩ .. ∩ (Yn ∈ I n ) h →0
= lim
h→0
P ( ( X ∈ B ) ∩ (Y1 ∈ I1 ) ∩ .. ∩ (Yn ∈ I n ) ) P ( (Y1 ∈ I1 ) ∩ .. ∩ (Yn ∈ I n ) )
∫ B dx ∫ I ×...×I f Z ( x, u1,..., un ) du1...dun h→0 ∫ I ×...×I f y ( u1,..., un ) du1...dun ∫ B f Z ( x, y ) dx = f Z ( x, y ) dx = ∫ B fY ( y ) fY ( y ) = lim
n
1
1
n
)
Random Vectors
53
It is thus natural to say that the conditional density of the random vector X
( y1 ,..., yn ) is the function:
knowing
x → f ( x y) =
f Z ( x, y ) if fY ( y ) ≠ 0 fY ( y )
We can disregard the set of
y for which fY ( y ) = 0 for its measure (in
n
)
is zero. Let us state that Α =
{( x, y ) fY ( y ) = 0} ; we observe:
P ( ( X , Y ) ∈ Α ) = ∫ f Z ( x, y ) dx dy = ∫ Α
=∫
{y f
Y
( y )=0}
{y f
Y
( y )=0}
du ∫ f ( x, u ) dx
fY ( u ) du = 0 , so fY ( y ) is not zero almost everywhere.
Finally, we have obtained a family (indicated by y verifying fY ( y ) > 0 ) of
(
probability densities f x y
(∫
)
)
f ( x y ) dx = 1 .
Conditional expectation
Let the random vector always be Z = ( X , Y1 ,..., Yn ) of density f Z ( x, y ) and
f ( x y ) always be the probability density of X , knowing y1 ,..., yn . DEFINITION.– Given a measurable mapping Ψ : under
the
(
hypothesis
)
∫
(
,B (
Ψ ( x ) f ( x y ) dx < ∞
) ) → ( ,B ( ) ) , (that
is
to
say
Ψ ∈ L1 f ( x y ) dx we call the conditional expectation of Ψ ( X ) knowing
54
Discrete Stochastic Processes and Optimal Filtering
( y1 ,..., yn ) , the expectation of
Ψ ( X ) calculated with the conditional density
f ( x y ) = f ( x y1 ,..., yn ) and we write:
Ε ( Ψ ( X ) y1 ,..., yn ) = ∫ Ψ ( x ) f ( x y ) dx
Ε ( Ψ ( X ) y1 ,..., yn ) is a certain value, depending on
( y1 ,..., yn ) ,
and we
denote this gˆ ( y1 ,..., yn ) (this notation will be of use in Chapter 4). DEFINITION.– We call the conditional expectation of Ψ ( X ) with respect to
Y = (Y1 ,..., Yn ) the r.v. gˆ (Y1 ,..., Yn ) = Ε ( Ψ ( X ) Y1 ,..., Yn ) (also denoted
Ε ( Ψ ( X ) Y ) ) which takes the value gˆ ( y1 ,..., yn ) = Ε ( Ψ ( X ) y1 ,..., yn )
when (Y1 ,..., Yn ) takes the value ( y1 ,..., yn ) .
NOTE.– As we do not distinguish between two equal r.v. a.s., we will still call the condition expectation of
Ψ ( X ) with respect to Y1 ,..., Yn of all r.v.
gˆ ′ (Y1 ,..., Yn ) such that gˆ ′ (Y1 ,..., Yn ) = gˆ (Y1 ,..., Yn ) almost surely.
That is to say gˆ ′ (Y1 ,..., Yn ) = gˆ (Y1 ,..., Yn ) except possibly on Α such that
P ( Α ) = ∫ fY ( y ) dy = 0 . Α
PROPOSITION.– If Ψ ( X ) ∈ L ( dP ) (i.e. 1
gˆ (Y ) = Ε ( Ψ ( X ) Y ) ∈ L1 ( dP ) (i.e.
∫
n
∫
Ψ ( x ) f X ( x ) dx < ∞ ) then
gˆ ( y ) fY ( y ) dy < ∞ ).
Random Vectors
55
DEMONSTRATION.–
∫ =∫
n
n
gˆ ( y ) f ( y ) dy = ∫
n
Ε ( Ψ ( X ) y ) fY ( y ) dy
fY ( y ) dy ∫ Ψ ( x ) f ( x y ) dx
Using Fubini’s theorem:
∫ =∫
n+1
Ψ ( x ) fY ( y ) f ( x y ) dx dy = ∫
Ψ ( x ) dx ∫
n
n+1
Ψ ( x ) f Z ( x, y ) dx dy
f Z ( x, y ) dy = ∫ Ψ ( x ) f X ( x ) dx < ∞
Principal properties of conditional expectation
The hypotheses of integrability having been verified:
( (
1) Ε Ε Ψ ( X ) Y
)) = Ε ( Ψ ( X )) ;
(
)
(
)
2) If X and Y are independent Ε Ψ ( X ) Y = Ε Ψ ( X ) ;
(
)
3) Ε Ψ ( X ) X = Ψ ( X ) ; 4) Successive conditional expectations
(
)
Ε E ( Ψ ( X ) Y1 ,..., Yn , Yn +1 ) Y1 ,..., Yn = Ε ( Ψ ( X ) Y1 ,..., Yn ) ; 5) Linearity
Ε ( λ1Ψ1 ( X ) + λ2 Ψ 2 ( X ) Y ) = λ1Ε ( Ψ1 ( X ) Y ) + λ2Ε ( Ψ 2 ( X ) Y ) . The demonstrations which in general are easy may be found in the exercises.
56
Discrete Stochastic Processes and Optimal Filtering
Let us note in particular that as far as the first property is concerned, it is sufficient to re-write the demonstration of the last proposition after stripping it of absolute values. The chapter on quadratic means estimation will make the notion of conditional expectation more concrete. Example: let Z = ( X , Y ) be a random couple of probability density
f Z ( x, y ) = 6 xy ( 2 − x − y )1∆ ( x, y ) where ∆ is the square [ 0,1] × [ 0,1] .
(
)
Let us calculate E X Y . We have successively: 1
i.e.
1
( y ) = ∫0 f ( x, y ) dx = ∫0 6 xy ( 2 − x − y ) dx with f ( y ) = ( 4 y − 3 y 2 )1[0,1] ( y )
– f
(
)
– f x y =
(
y ∈ [ 0,1]
f ( x, y ) 6 x ( 2 − x − y ) 1[0,1] ( x ) with y ∈ [ 0,1] = f ( y) 4 − 3y
) ∫0 xf ( x y ) dx ⋅1[0,1] ( y ) = 2 54−−43yy 1[0,1] ( y ) . ( ) 1
– E X y =
Thus:
E(X Y) =
5 − 4Y 1 (Y ) . 2 ( 4 − 3Y ) [0,1]
We also have:
(
)
E ( X ) = E E ( X Y ) = ∫ E ( X y ) f ( y ) dy 1
0
5− 4y 7 . =∫ 4 y − 3 y 2 dy = 0 2(4 − 3y) 12 1
(
)
Random Vectors
57
1.7. Exercises for Chapter 1 Exercise 1.1.
Let
X be an r.v. of distribution function ⎛ 0 if ⎜ 1 if F ( x) = ⎜ ⎜2 ⎜ 1 if ⎝
x<0
0≤ x≤2 x>2
Calculate the probabilities:
(
) (
) (
P X 2 ≤ X ; P X ≤ 2X 2 ; P X + X 2 ≤ 3
4
).
Exercise 1.2.
Given
the
f Z ( x, y ) = K
random
vector
Z = ( X ,Y )
1 1∆ ( x, y ) where K yx 4
of
probability
density
is a real constant and where
⎧ 1⎫ ∆ = ⎨( x, y ) ∈ 2 x, y > 0 ; y ≤ x ; y > ⎬ , determine the constant K and the x⎭ ⎩ densities f X and fY of the r.v. X and Y . Exercise 1.3.
Let X and Y be two independent random variables of uniform density on the
[ ]
interval 0,1 : 1) Determine the probability density f Z of the r.v. Z = X + Y ; 2) Determine the probability density fU of the r.v. U = X Y .
58
Discrete Stochastic Processes and Optimal Filtering
Exercise 1.4.
Let X and Y be two independent r.v. of uniform density on the interval
[0,1] .
Determine the probability density fU of the r.v. U = X Y . Solution 1.4.
y
xy = 1
1
xy < u
A
B
0
u
x
1
U takes its values in [ 0,1] Let FU be the distribution function of
U:
– if
u ≤ 0 FU ( u ) = 0 ; if u ≥ 1 FU ( u ) = 1 ;
– if
u ∈ ]0,1[ : FU ( u ) = P (U ≤ u ) = P ( X Y ≤ u ) = P ( ( X , Y ) ∈ Bu ) ;
where Bu = A ∪ B is the cross-hatched area of the figure. Thus FU ( u ) =
∫B
u
f( X ,Y ) ( x, y ) dx dy = ∫
Bu
f X ( x ) fY ( y ) dx dy
Random Vectors
1
u
u
0
= ∫ dx dy + ∫ dx ∫ A
x
dy = u + u ∫
1 dx
= u (1 − n u )
x
u
59
.
⎛ 0 if x ∈ ]-∞,0] ∪ [1, ∞[ ⎜− nu x ∈ ]0,1[ ⎝
Finally fU ( u ) = FU′ ( u ) = ⎜
Exercise 1.5.
Under consideration are three r.v.
X , Y , Z which are independent and of the
same law N ( 0,1) , that is to say admitting the same density
1 2π
⎛ x2 ⎞ ⎜− ⎟. ⎝ 2⎠
Determine the probability density fU of the real random variable (r.r.v.)
U = (X 2 +Y2 + Z2) 2. 1
Solution 1.5.
Let FU be the probability distribution of
U
– if
⎛ u ≤ 0 FU ( u ) = P ⎜ X 2 + Y 2 + Z 2 ⎝
– if
u > 0 FU ( u ) = P ( ( X + Y + Z ) ∈ Su ) ;
(
where Su is the sphere of
3
centered on
)
1
2
⎞ ≤ u⎟ = 0; ⎠
( 0, 0, 0 ) and of radius u
= ∫ f( X ,Y ,Z ) ( x, y, z ) dx dy dz Su =
( 2π )
∫S exp ⎜⎝ − 2 ( x 2 ⎛ 1
1 3
u
2
)
⎞ + y 2 + z 2 ⎟ dx dy dz ⎠
60
Discrete Stochastic Processes and Optimal Filtering
and by employing a passage from spherical coordinates:
= =
1
( 2π )
eπ
3
∫0 2
1
( 2π )
3
2
2
dθ
π
∫0
dϕ
⎛ 1
u
∫ 0 exp ⎜⎝ − 2 r
2
⎞ 2 ⎟ r sin ϕ dr ⎠
u ⎛ 1 ⎞ 2π ⋅ 2 ∫ r 2 exp ⎜ − r 2 ⎟ dr 0 ⎝ 2 ⎠
⎛ 1 2⎞ r ⎟ is continuous: ⎝ 2 ⎠
and as r → r exp ⎜ −
⎛ 0 if u <0 fU ( u ) = ⎜⎜ 2 2 ⎛ 1 ⎞ u exp ⎜ − u 2 ⎟ if u ≥ 0 ⎜ FU′ ( u ) = 2π ⎝ 2 ⎠ ⎝ Exercise 1.6.
1a) Verify that ∀ a > 0
fa ( x ) =
1 a is a probability density 2 Π a + x2
(called Cauchy’s density). 1b)
Verify
that
ϕ X ( u ) = exp ( −a u )
the
corresponding
1c) Given a family of independent r.v. density of the r.v.
Yn =
characteristic
function
is
X 1 ,..., X n of density f a , find the
X 1 + ... + X n . n
What do we notice? 2) By considering Cauchy’s random variables, verify that we can have the equality
ϕ X +Y ( u ) = ϕ X ( u ) ϕ Y ( u )
with
X and Y dependent.
Random Vectors
61
Exercise 1.7.
Show that
⎛1 2 3⎞ ⎜ ⎟ M = ⎜ 2 1 2 ⎟ is not a covariance matrix. ⎜3 2 1⎟ ⎝ ⎠
⎛ 1 0,5 0 ⎞ ⎜ ⎟ Show that M = 0,5 1 0⎟ ⎜ ⎜ 0 0 1 ⎟⎠ ⎝
is a covariance matrix.
Verify that from this example the property of “not being correlated with” for a family or r.v. is not transitive. Exercise 1.8.
Show
that
ΕX = ( 7, 0,1) T
the
random
vector
X T = ( X1 , X 2 , X 3 )
of
expectation
⎛ 10 −1 4 ⎞ ⎜ ⎟ and of covariance matrix Γ X = −1 1 −1 belongs ⎜ ⎟ ⎜ 4 −1 2 ⎟ ⎝ ⎠
almost surely (a.s.) to a plane of
3
.
Exercise 1.9.
We are considering the random vector U = ( X , Y , Z ) of probability density
fU ( x, y, z ) = K x y z ( 3 − x − y − z )1∆ ( x, y, z ) where ∆ is the cube
[0,1] × [0,1] × [0,1] .
1) Calculate the constant
K.
2) Calculate the conditional probability
⎛ 1 3⎞ ⎡1 1⎤ P⎜ X ∈⎢ , ⎥ Y = ,Z = ⎟ . 2 4⎠ ⎣4 2⎦ ⎝
3) Determine the conditional expectation
(
)
Ε X 2 Y,Z .
Chapter 2
Gaussian Vectors
2.1. Some reminders regarding random Gaussian vectors DEFINITION.– We say that a real r.v. is Gaussian, of expectation m and of variance
σ2
if its law of probability PX :
⎛ ( x − m )2 ⎞ 1 ⎟ if σ 2 ≠ 0 (using exp ⎜ − – admits the density f X ( x ) = 2 ⎜ ⎟ σ 2 2π σ ⎝ ⎠ a double integral calculation, for example, we can verify that ∫ f X ( x ) dx = 1 ); – is the Dirac measure
(
2πσ
)
δm
if σ
2
= 0.
δm
−1
fX
x
x m
m
Figure 2.1. Gaussian density and Dirac measure
64
Discrete Stochastic Processes and Optimal Filtering
If
σ 2 ≠ 0 , we say that X
is a non-degenerate Gaussian r.v.
2 If σ = 0 , we say that X is a degenerate Gaussian r.v.; X is in this case a “certain r.v.” taking the value m with the probability 1.
EX = m, Var X = σ 2 . This can be verified easily by using the probability distribution function. As we have already observed, in order to specify that an r.v. X is Gaussian of
(
)
m expectation and of σ 2 variance, we will write X ∼ N m, σ 2 . Characteristic function of Let
us
begin
X 0 ∼ N ( 0,1) :
firstly
(
ϕ X (u ) = E e 0
X ∼ N ( m, σ 2 )
iuX 0
by
)
determining
1 = 2π
∫
iux
e e
the
− x2
2 dx
characteristic
function
of
.
We can easily see that the theorem of derivation under the sum sign can be applied:
ϕ ′X 0 ( u ) =
i 2π
∫
iux
e xe
− x2
2 dx
.
Following this by integration by parts:
=
i 2π
⎡⎛ iux − x2 ⎞+∞ ⎤ − x2 +∞ iux 2 2 dx = − uϕ e e iue e − + ⎢⎜ ⎥ X0 (u ) . ⎟ ∫ −∞ ⎠−∞ ⎢⎣⎝ ⎥⎦
Gaussian Vectors
The resolution of the differential equation condition that
ϕ X ( 0) = 1
(
)
ϕ X (u ) =
By changing the variable y = case, we obtain If
σ2 =0
ϕ X (u ) =
ϕ X (u ) =
x−m
σ
1 ium − u 2σ 2 e 2
δm )
1 ium − u 2σ 2 e 2
ium
⎛ ⎝
−u
2
0
∫
(
.
1 ⎛ x −m ⎞ +∞ iux − 2 ⎜ σ ⎟ ⎠ e e ⎝ −∞
2
dx .
which brings us back to the preceding
ϕ X (u )
(Fourier transform in the sense
so well that in all cases
(σ
2
≠ or = 0 )
)
1
2
(σ ) 2
1 2
1
2
with the
2
∼ N m, σ 2 , we can write: 1
( 2π )
ϕ X (u ) = e
.
NOTE.– Given the r.v. X
fX ( x) =
= e
1 2πσ
0
.
that is to say if PX = δ m ,
of the distribution of
0
leads us to the solution
0
For X ∼ N m, σ 2
ϕ ′X ( u ) = − uϕ X ( u )
65
( )
⎛ 1 exp ⎜ − ( x − m ) σ 2 ⎝ 2
−1
( x − m ) ⎞⎟ ⎠
⎞ ⎠
ϕ X ( u ) = exp ⎜ ium − u σ 2u ⎟ These are the expressions that we will find again for Gaussian vectors.
66
Discrete Stochastic Processes and Optimal Filtering
2.2. Definition and characterization of Gaussian vectors T
= ( X 1 ,..., X n ) is Gaussian
∑aj X j
is Gaussian (we can in this
DEFINITION.– We say that a real random vector X if ∀ ( a0 , a1 ,..., an ) ∈
n +1
the r.v. a0 +
n
j =1
definition assume that a0 = 0 and this will be sufficient in general). A random vector X
T
= ( X 1 ,..., X n ) is thus not Gaussian if we can find an
n -tuple ( a1 ,..., an ) ≠ ( 0,..., 0 ) such that the r.v.
n
∑aj X j
is not Gaussian and
j =1
n
for this it suffices to find an n -tuple such that
∑ajX j
is not an r.v. of density.
j =1
EXAMPLE.– We allow ourselves an r.v. X ∼ N ( 0,1) and a discrete r.v.
ε,
independent of X and such that:
P ( ε = 1) =
1 1 and P ( ε = −1) = . 2 2
We state that Y = ε X . By using what has already been discussed, we will show through an exercise that although Y is an r.v. N ( 0,1) , the vector ( X , Y ) is not a Gaussian vector. PROPOSITION.– In order for a random vector X
T
= ( X 1 ,..., X n ) of expectation
mT = ( m1 ,..., mn ) and of covariance matrix Γ X to be Gaussian, it is necessary and sufficient that its characteristic function (c.f.)
⎛ ⎜ ⎝
m
1 2
⎞ ⎟ ⎠
ϕ X ( u1 ,..., un ) = exp ⎜ i ∑ u j m j − uT Γ X u ⎟ j =1
ϕX
be defined by:
( where u
T
= ( u1 ,..., un )
)
Gaussian Vectors
67
DEMONSTRATION.–
⎛ ⎜ ⎝
⎞ ⎟ ⎠
n
⎛ ⎜ ⎝
⎞ ⎟ ⎠
n
ϕ X ( u 1,..., u n ) = E exp ⎜ i ∑ u j X j ⎟ = E exp ⎜ i.1.∑ u j X j ⎟ j =1
j =1
n
= characteristic function of the r.v.
∑u j X j
in the value 1.
j =1
That is to say:
ϕ
n
∑u j X j
(1)
j =1
and
ϕ
⎛ ⎛ n ⎞ 1 2 ⎛ n 1 exp .1. 1 Var = ⎜ − i E u X n () ⎜⎜ ∑ j j ⎟⎟ ⎜⎜ ∑ u ⎜ 2 ∑u j X j ⎝ j =1 ⎠ ⎝ j =1 ⎝ j =1
j
⎞⎞ X j ⎟⎟ ⎟⎟ ⎠⎠
n
if and only if the r.v.
∑u j X j
is Gaussian.
j =1
⎛ n ⎞ u j X j ⎟ = u T Γ X u , we arrive indeed at: ∑ ⎜ j =1 ⎟ ⎝ ⎠
Finally, since Var ⎜
⎛ ⎜ ⎝
n
1 2
⎞ ⎟ ⎠
ϕ X ( u 1,..., u n ) = exp ⎜ i ∑ u j m j − u T Γ X u ⎟ . j =1
NOTATION.– We can see that the characteristic function of a Gaussian vector X is entirely determined when we know its expectation vector m and its covariance
matrix Γ X . If X is such a vector, we will write X ∼ N n ( m, Γ X ) .
PARTICULAR CASE.– m = 0 and Γ X = I n (unit matrix), X ∼ N n ( 0, I n ) is called a standard Gaussian vector.
68
Discrete Stochastic Processes and Optimal Filtering
2.3. Results relative to independence PROPOSITION.– T
= ( X 1 ,..., X n ) is Gaussian, all its components X j are
2) if the components
X j of a random vector X are Gaussian and independent,
1) if the vector X thus Gaussian r.v.;
the vector
X is thus also Gaussian.
DEMONSTRATION.– 1) We write
X j = 0 + ... + 0 + X j + 0... + 0 . n
2)
ϕ X ( u 1,..., u n ) = ∏ ϕ X ( u j
j =1
j
)
n 1 ⎛ ⎞ = ∏ exp ⎜ iu j m j − u 2jσ 2j ⎟ 2 ⎝ ⎠ j =1
⎛ n ⎞ 1 u jmj − u T ΓX u ⎟ ∑ ⎜ j =1 ⎟ 2 ⎝ ⎠ 0 ⎞ ⎟ ⎟. σ n2 ⎟⎠
that we can still express exp ⎜ i
⎛ σ 12 ⎜ with Γ X = ⎜ ⎜ 0 ⎝
NOTE.– As we will see later “the components
X j are Gaussian and independent”
is not a necessary condition for the random vector
(
)
X T = X1 ,..., X j ,..., X n to
be Gaussian. PROPOSITION.– If
(
X T = X 1 ,..., X j ,..., X n
)
is a Gaussian vector of
covariance Γ X , we have the equivalence: Γ X diagonal ⇔ the r.v. independent.
X j are
Gaussian Vectors
69
DEMONSTRATION.–
⎛ σ 12 ⎜ ΓX = ⎜ ⎜ 0 ⎝
0 ⎞ n ⎟ ⇔ ϕ u ,..., u = ( ) ∏ϕ X j u ⎟ X n 1 j −1 σ n2 ⎟⎠
( j)
This is a necessary and sufficient condition of independence of the r.v. X j . Let us sum up these two simple results schematically:
(
X T = X 1 ,..., X j ,..., X n
)
The components
X j are
Gaussian r.v.
is a Gaussian vector If (sufficient condition) the r.v. X j are
Even if
Γ X is
independent
diagonal
(The r.v. X j are independent ⇔ Γ X
(The r.v. X j are independent or X is Gaussian)
is
diagonal)
NOTE.– A Gaussian vector
(
X T = X1 ,..., X j ,..., X n
order. In effect each component
)
is evidently of the 2nd
X j is thus Gaussian and belongs to L2 ( dP )
2 ⎛ ⎞ −( x − m ) 1 2 ⎜ ⎟ 2 e 2σ dx < ∞ ⎟ . ⎜∫ x 2 2πσ ⎜ ⎟ ⎝ ⎠
We can generalize the last proposition and replace the Gaussian r.v. by Gaussian vectors.
70
Discrete Stochastic Processes and Optimal Filtering
Let us consider for example three random vectors:
(
X T = X ,..., X 1
) ; Y = (Y ,..., Y ) ; Z = ( X ,..., X , Y ,..., Y ) T
n
T
1
ΓX ⎛ ⎜ and state Γ Z = ⎜ ⎜ Cov(Y , X ) ⎝
1
p
n
1
p
Cov( X , Y ) ⎞ ⎟ ⎟ ⎟ ΓY ⎠
where Cov ( X , Y ) is the matrix of the coefficients Cov
Cov ( X , Y ) = ( Cov ( X , Y ) ) .
( X j ,Y )
and where
T
PROPOSITION.– If
(
Z T = X 1 ,..., X n , Y1 ,..., Yp
)
is a Gaussian vector of
covariance matrix Γ Z , we have the equivalence:
Cov ( X , Y ) = zero matrix ⇔ X and Y are two independent Gaussian vectors. DEMONSTRATION.–
⎛ΓX ⎜ ΓZ = ⎜ ⎜ 0 ⎝
0 ⎞ ⎟ ⎟⇔ ΓY ⎟⎠
ϕ Z ( u 1 ,..., u n, u n+1,..., u n+ p )
(
⎛ n+ p ⎛ΓX 1 T⎜ ⎜ = exp ⎜ i ∑ u j m j − u ⎜ 2 ⎜ ⎜ j =1 ⎝ 0 ⎝
)
0 ⎞ ⎞ ⎟ ⎟ ⎟u⎟ ΓY ⎟⎠ ⎟⎠
= ϕ X ( u 1,..., u n ) ϕY u n +1,..., u n + p , which is a necessary and sufficient condition for the independence of vector X and Y .
Gaussian Vectors
NOTE.– Given Z
T
(
71
)
= X T , Y T , U T ,... where X , Y ,U ,... are r.v. or random
vectors: – that Z is a Gaussian vector is a stronger hypothesis than – Gaussian
X and Gaussian Y and Gaussian U , etc.;
X and Gaussian Y and Gaussian U , etc. and their covariances (or T T T T matrix covariances) are zero ⇒ that Z = X , Y , U ,... is a Gaussian – Gaussian
(
)
vector. EXAMPLE.– Given that
X , Y , Z three r.v. ∼ N ( 0,1) , find the law of the vector
W T = (U , V ) or U = X + Y + Z and V = λ X − Y with λ ∈
( X ,Y , Z ) a, b ∈ aU + bV = ( a + λ b ) X + ( a − λ b ) Y + aZ W T = (U , V ) is a Gaussian vector.
the
independence,
the
vector
To determine this entirely we must know m = EW
W ∼ N 2 ( m, ΓW ) .
is
: because of
Gaussian
is a Gaussian r.v. Thus
and ΓW and we will have
It follows on easily:
EW T = ( EU , EV ) = ( 0, 0 )
⎛
and ΓW = ⎜
Var U
⎝ Cov (V ,U )
and
Cov (U ,V ) ⎞ ⎛ 3 λ −1 ⎞ ⎟=⎜ Var V ⎠ ⎝ λ − 1 λ 2 + 1⎟⎠
In effect:
Var U = EU 2 = E ( X + Y + Z ) = EX 2 + EY 2 + EZ 2 = 3 2
Var V = EV 2 = E ( λ X − Y ) = λ 2 EX 2 + EY 2 = λ 2 + 1 2
Cov (U , V ) = E ( X + Y + Z )( λ X − Y ) = λ EX 2 − EY 2 = λ − 1
72
Discrete Stochastic Processes and Optimal Filtering
Particular case:
λ = 1 ⇔ ΓW
diagonal ⇔ U and V are independent.
2.4. Affine transformation of a Gaussian vector We can generalize to vectors the following result on Gaussian r.v.:
(
If Y ∼ N m, σ
2
)
then ∀a, b ∈
(
)
aY + b ∼ N am + b, a 2σ 2 .
(
By modifying a little the annotation, with N am + b, a
σ2
2
)
becoming
N ( am + b, a VarYa ) , we can imagine already how this result is going to extend to Gaussian vectors. PROPOSITION.– Given a Gaussian vector Y ∼ N n ( m, ΓY ) , A a matrix
belonging to M ( p, n ) and a certain vector B ∈
(
T
vector ∼ N p Am + B, AΓY A
p
, then
).
AY + B is a Gaussian
DEMONSTRATION.–
⎛ a11 ⎜ ⎜ AY + B = ⎜ a 1 ⎜ ⎜ ⎜ a p1 ⎝
ai
⎛ a1n ⎞ ⎛ Y1 ⎞ ⎛ b1 ⎞ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ n a n ⎟ ⎜ Yi ⎟ + ⎜ b ⎟ = ⎜ ∑ a iYi + b ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ i =1 ⎟⎜ ⎟ ⎜ ⎟ ⎜ a pn ⎟⎠ ⎜⎝ Yn ⎟⎠ ⎜⎝ b p ⎟⎠ ⎜⎜ ⎝
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎟ ⎠
– this is indeed a Gaussian vector (of dimension p ) because every linear combination of its components is an affine combination of the r.v. Y1 ,..., Yi ,..., Yn and by hypothesis Y
T
= (Y1 ,..., Yn ) is a Gaussian vector;
Gaussian Vectors
73
– furthermore, we have seen that if Y is a 2nd order vector:
E ( AY + B ) = AEY + B = Am + B and Γ AY + B = AΓY AT . EXAMPLE.– Given emerges
( n + 1)
independent r.v. Y j ∼ N
( µ ,σ ) 2
j = 0 at n , it
Y T = (Y0 , Y1 ,..., Yn ) ∼ N n +1 ( m, ΓY ) with mT = ( µ ,..., µ ) and
⎛σ 2 ⎜ ΓY = ⎜ ⎜ 0 ⎝
0 ⎞ ⎟ ⎟. σ 2 ⎟⎠
Furthermore, given new r.v. X defined by:
X1 = Y0 + Y1 ,..., X n = Yn −1 + Yn ,
the vector X
T
= ( X 1 ,..., X n )
⎛ X 1 ⎞ ⎛ 110...0 ⎞ ⎛ Y0 ⎞ ⎜ ⎟ ⎜ ⎟⎜ ⎟ is Gaussian for ⎜ ⎟ = ⎜ 0110..0 ⎟ ⎜ ⎟ more ⎜ X ⎟ ⎜ 0...011 ⎟ ⎜ Y ⎟ ⎠⎝ n ⎠ ⎝ n⎠ ⎝
(
T
precisely following the preceding proposition, X ∼ N n Am, AΓY A NOTE.– If in this example we assume
µ =0
and
).
σ 2 = 1 , we are certain that the
vector X is Gaussian even though its components X j are not independent. In effect, we have for example:
Cov ( X1 , X 2 ) ≠ 0 because
EX 1 X 2 = E (Y0 + Y1 ) (Y1 + Y2 ) = EY12 = 1 and EX 1EX 2 = E (Y0 + Y1 ) E (Y1 + Y2 ) = 0 .
74
Discrete Stochastic Processes and Optimal Filtering
2.5. The existence of Gaussian vectors NOTATION.– u = ( u 1,..., u T
n
) , xT = ( x1 ,..., xn )
and m = ( m1 ,..., mn ) . T
We are interested here in the existence of Gaussian vectors, that is to say the existence of laws of probability on n having Fourier transforms of the form:
⎛ ⎞ 1 exp ⎜ i ∑ u j m j − u T Γu ⎟ ⎜ j ⎟ 2 ⎝ ⎠ PROPOSITION.–
Given
a
mT = ( m1 ,..., mm )
vector
and
a
matrix
Γ ∈ M ( n, n ) , which is symmetric and semi-defined positive, there is a unique probability PX on
∫
n
, of the Fourier transform:
⎛ n ⎞ ⎛ n 1 T ⎞ exp ,..., exp i u x dP x x = ⎜⎜ ∑ j j ⎟⎟ X ( 1 ⎜⎜ i ∑ u j m j − u Γu ⎟⎟ n) n 2 ⎝ j =1 ⎠ ⎝ j =1 ⎠
In addition: 1) if Γ is invertible, PX admits on
f X ( x1 ,..., xn ) =
n
the density:
1 n
( 2π ) 2 ( Det Γ )
1
2
T ⎛ 1 ⎞ exp ⎜ − ( x − m ) Γ −1 ( x − m ) ⎟ ; ⎝ 2 ⎠
2) if Γ is non-invertible (of rank r < n ) the r.v. X 1 − m1 ,..., X n − mn are linearly dependent. We can still say that hyperplane ( Π ) of
n
ω → X (ω ) − m
a.s. takes its values on a
or that the probability PX loads a hyperplane ( Π ) does
not admit a density function on
n
.
Gaussian Vectors
75
DEMONSTRATION.– 1) Let us begin by recalling a result from linear algebra:
Γ being symmetric, we can find an orthonormal basis of n formed from eigenvectors of Γ ; let us call (V1 , ..., Vn ) this basis. By denoting the eigenvalues of Γ as λ j , we thus have ΓV j = λ jV j where the λ j are solutions of the equation Det ( Γ − λ I ) = 0 . Some consequences
⎛λ ⎜ 1 Let us first note Λ = ⎜ ⎜⎜ ⎝ 0 (where the V j are column vectors). – ΓV j = λ jV j
(
orthogonal VV
T
⎞ 0 ⎟ ⎟ and V = V1 , ⎟ λ n ⎟⎠
(
– The
λj
λj
)
j = 1 at n equates to ΓV = V Λ and, matrix V being
)
= V T V = I , Γ = V ΛV T .
Let us demonstrate that if in addition Γ is invertible, the and thus the
, V j , Vn
λj
are ≠ 0 and ≥ 0 ,
are > 0.
are ≠ 0 . In effect, Γ being invertible, n
0 ≠ Det Γ = Det Λ = ∏ λ j j =1
The
λj
are ≥ 0 : let us consider in effect the quadratic form u → u
( ≥ 0 since Γ is semi-defined positive).
T
Γu
76
Discrete Stochastic Processes and Optimal Filtering
In the basis (V1...Vn ) , u is written ( u 1,..., u
n
)
with u j = < V j , u > and the
⎛u1⎞ ⎜ ⎟ 2 quadratic form is written u → ( u 1,..., u n ) Λ ⎜ ⎟ = ∑ λ j u j ≥ 0 from which ⎜u ⎟ j ⎝ n⎠ we get the predicted result. Let us now demonstrate the proposition. 2) Let us now look at the general case, that is to say, in which Γ is not necessarily invertible (once again that the eigenvalues λ j are ≥ 0 ).
(
)
Let us consider n independent r.v. Y j ∼ N 0, λ j . We know that the vector Y
T
= (Y1 ,..., Yn ) is Gaussian as well as the vector
X = VY + m (proposition from the preceding section); more precisely
(
)
X ∼ N m , Γ = V ΛV T . The existence of vectors of Gaussian expectation and of a given covariance matrix is thus clearly proven. Furthermore, we have seen that if X is N n ( m, Γ ) , its characteristic function
⎛ ⎜ ⎝
(Fourier transformation of its law) is exp ⎜ i
1
⎞
∑ u j m j − 2 uT Γu ⎟⎟ j
⎠
We thus in fact have:
∫
⎛ 1 T ⎞ = − exp ,..., exp i u x dP x x i u m u Γu ⎟ . ( ) ⎜ ∑ ∑ 1 j j X n j j n ⎜ j ⎟ 2 ⎝ ⎠
(
)
Uniqueness of the law: this ensues from the injectivity of the Fourier transformation.
Gaussian Vectors
77
3) Let us be clear to terminate the role played by the invertibility of Γ . a) If Γ is invertible all the eigenvalues
Y T = (Y1...Yn ) admits the density:
λ j ( = VarY j )
are > 0 and the vector
⎛ y 2j ⎞ 1 exp ⎜ − fY ( y1 ,..., yn ) = ∏ ⎟ ⎜ 2λ j ⎟ j =1 2πλ j ⎝ ⎠ 1 ⎛ 1 ⎞ exp ⎜ − yT Λ −1 y ⎟ = 1 ⎝ 2 ⎠ ⎞ 2 n ⎛ n 2 ( 2π ) ⎜⎜ ∏ λ j ⎟⎟ ⎝ j =1 ⎠ n
As far as the vector X = VY + m is concerned: the affine transformation
y → x = Vy + m is invertible and has y = V −1 ( x − m ) as the inverse and has Det V = ±1 ( V orthogonal) as the Jacobian. n
Furthermore
∏ λ j = Det Λ = Det Γ . j =1
By applying the theorem on the transformation of a random vector by a
C1 -diffeomorphism, we obtain the density probability of vector X:
(
)
f X ( x1 ,..., xn ) = f X ( x ) = fY V −1 ( x − m ) = ↑
↑
notation
1 n
( 2π ) 2 ( DetΓ )
1
2
theorem
↑ we clarify
( )
T ⎛ 1 exp ⎜ − ( x − m ) V T ⎝ 2
−1
⎞ Λ −1V −1 ( x − m ) ⎟ ⎠
78
Discrete Stochastic Processes and Optimal Filtering T
As Γ = V ΛV :
f X ( x1 ,..., xn ) =
1 n
( 2π ) 2 ( DetΓ )
1
2
T ⎛ 1 ⎞ exp ⎜ − ( x − m ) Γ −1 ( x − m ) ⎟ ⎝ 2 ⎠
b) If rank Γ = r < n , let us rank the eigenvalues of Γ in decreasing order: and λr +1 = 0,..., λn = 0
λ1 ≥ λ2 ≥ ...λr > 0
Yr +1 = 0 a.s .,..., Yn = 0 a.s. and, almost surely, X = VY + m takes its values
in ( Π ) the hyperplane of affine mapping
n
the image of
y → Vy + m .
NOTE.– Given a random vector
ε = { y = ( y1 ,..., yr , 0,..., 0 )} by the
X T = ( X 1 ,..., X n ) ∼ N n ( m, Γ X ) and
supposing that we have to calculate an expression of the form:
EΨ ( X ) = ∫
∫
n
n
Ψ ( x ) f X ( x ) dx =
Ψ ( x1 ,..., xn ) f X ( x1 ,..., xn ) dx1...dxn
In general, the density f X , and in what follows the proposed calculation, are rendered complex by the dependence of the r.v. X 1 ,..., X n . Let
λ1 ,..., λn
diagonalizes Γ X .
be the eigenvalues of Γ X and V the orthogonal matrix which
Gaussian Vectors
We have X = VY + m with Y
(
∼ N 0, λ j
)
T
79
= (Y1 ,..., Yn ) , the Y j being independent and
and the proposed calculation can be carried out under the simpler
form: − yj ⎛ n 1 2λ Ψ (Vy + m ) ⎜ ∏ e j n ⎜ ⎜ j =1 2πλ j ⎝
2
E Ψ ( X ) = E Ψ (VY + m ) = ∫
⎞ ⎟ dy ...dy n ⎟ 1 ⎟ ⎠
EXAMPLE.– 1) The expression of a normal case: Let the Gaussian vector X
⎛1
where Γ X = ⎜
⎝ρ
T
= ( X1 , X 2 ) ∼ N 2 ( 0, Γ X )
ρ⎞
⎟ with ρ ∈ ]−1,1[ . 1⎠
Γ X is invertible and f X ( x1 , x2 ) =
⎛ 1 1 ⎞ exp ⎜ − x12 − 2 ρ x1 x2 + x22 ⎟ . 2 ⎝ 2 1− ρ ⎠ 2π 1 − ρ 2 1
(
)
80
Discrete Stochastic Processes and Optimal Filtering
1
fx
2π 1 − ρ 2
ε 0
x1
x2
The intersections of the graph of ellipses
ε
2 from equation x1
fX
with the horizontal plane are the
− 2 ρ x1 x2 + x22 = C
(constants)
Figure 2.2. Example of the density of a Gaussian vector
2) We give ourselves the Gaussian vector X
T
= ( X 1 , X 2 , X 3 ) with:
⎛ 3 0 q⎞ ⎜ ⎟ m = (1, 0, −2 ) and Γ = ⎜ 0 1 0 ⎟ . ⎜q 0 1⎟ ⎝ ⎠ T
Because of Schwarz’s inequality must suppose q ≤
( Cov ( X1, X 2 ) )
2
≤ Var X 1 Var X 2 we
3.
We wish to study the density f X ( x1 , x2 , x3 ) of vector X .
Gaussian Vectors
81
Eigenvalues of Γ :
Det ( Γ − λΙ ) =
3−λ
0
q
(
)
1− λ 0 = (1 − λ ) λ 2 − 4λ + 3 − q 2 . 0 1− λ
0 q
From which we obtain the eigenvalues ranked in decreasing order:
λ1 = 2 + 1 + q 2 a) if q < density in
3
3 then λ1 > λ2 > λ3 , Γ is invertible and X has a probability
given by:
f X ( x1 , x2 , x3 ) =
b) q =
, λ2 = 1 , λ3 = 2 − 1 + q 2
1 3
( 2π ) 2 ( λ1λ2λ3 )
1
2
T ⎛ 1 ⎞ exp ⎜ − ( x − m ) Γ −1 ( x − m ) ⎟ ; ⎝ 2 ⎠
3 thus λ1 = 4 ; λ2 = 1; λ3 = 0 and Γ is non-invertible of rank
2. Let us find the orthogonal matrix V ΓV j = λ j V j . For
λ1 = 4 ; λ2 = 1; λ3 = 0
which diagonalizes Γ by writing
we obtain respectively the eigenvectors
⎛ 3 ⎞ ⎛− 1 ⎞ ⎛ 0⎞ ⎜ 2⎟ ⎜ 2⎟ ⎜ ⎟ ⎜ ⎟ ⎜ V = 0 ⎟ V = 1 V1 = 0 ⎜ ⎟ , 2 ⎜ ⎟, 3 ⎜ ⎟ ⎜ 0⎟ ⎜⎜ 1 ⎟⎟ ⎜⎜ 3 ⎟⎟ ⎝ ⎠ ⎝ 2⎠ ⎝ 2 ⎠
82
Discrete Stochastic Processes and Optimal Filtering
and the orthogonal matrix
(VV
V = (V1 V2 V3 )
T
)
= V TV = Ι .
Given the independent r.v. Y1 ∼ N ( 0, 4 ) and Y2 ∼ N ( 0,1) and given the r.v.
Y3 = 0 a.s., we have: ⎛ 3 ⎛ X1 ⎞ ⎜ 2 ⎜ ⎟ X = ⎜ X2 ⎟ = ⎜ 0 ⎜ ⎜X ⎟ ⎜ ⎝ 3⎠ ⎜ 1 ⎝ 2
0 1 0
− 1 ⎟⎞ ⎛ Y ⎞ ⎛ 1 ⎞ 2 1 ⎜ ⎟ ⎜ ⎟ 0 ⎟ ⎜ Y2 ⎟ + ⎜ 0 ⎟ ⎟ ⎜ ⎟ ⎜ ⎟ 3 ⎟⎟ ⎝ 0 ⎠ ⎝ −2 ⎠ 2⎠
⎛ X 1∗ ⎞ ⎜ ∗⎟ ∗ or, by calling X = ⎜ X 2 ⎟ the vector X after centering, ⎜⎜ ∗ ⎟⎟ ⎝ X3 ⎠ ⎛ X1∗ ⎞ ⎜⎛ 3 ⎜ ∗⎟ ⎜ 2 0 ⎜ X2 ⎟ = ⎜⎜ ∗ ⎟⎟ ⎜⎜ ⎝ X 3 ⎠ ⎜⎝ 1 2
0 1 0
− 1 ⎟⎞ ⎛ Y ⎞ X1∗ = 3 2Y1 2 1 ⎜ ⎟ 0 ⎟ ⎜ Y2 ⎟ i.e. X 2∗ = Y2 ⎟ ⎜ ⎟ 3 ⎟⎟ ⎝ 0 ⎠ X 3∗ = 1 Y1 2 2⎠
⎛ X 1∗ ⎞ ⎜ ⎟ ∗ ∗ We can further deduce that X = ⎜ X 2 ⎟ . ⎜⎜ ∗⎟ ⎟ 3 X 1 ⎝ ⎠
Gaussian Vectors
83
x3 1
U
0
x2
3
x1 Figure 2.3. The plane
( Π ) is the support of probability PX
Thus, the vector X ∗ describes almost surely the plane ( Π ) containing the axis
0x2 and the vector U T =
(
)
3,0,1 . The plane ( Π ) is the support of the
probability PX .
Probability and conditional expectation Let us develop a simple case as an example. Let
the
Gaussian
( Cov ( X , Y ) ) ρ=
Z T = ( X , Y ) ∼ N 2 ( 0, Γ Z ) .
In
stating
2
and Var X = σ 1 , Var Y = σ 2 the density Z is written: 2
VarX VarY
f Z ( x, y ) =
vector
1 2πσ 1σ 2
2
⎛ ⎛ x2 1 xy y 2 ⎞ ⎞⎟ exp ⎜ − 2 − ρ + ⎜ ⎟ . ⎜ 2 1 − ρ 2 ⎝ σ 12 σ 1σ 2 σ 22 ⎠ ⎟ 1− ρ 2 ⎝ ⎠
(
)
84
Discrete Stochastic Processes and Optimal Filtering
Conditional density of X knowing Y = y
f ( x y) = 1 =
=
2πσ 1σ 2
f Z ( x, y ) = fY ( y )
f Z ( x, y ) dx
∫
⎡ ⎛ x2 1 xy y 2 ⎞ ⎤⎥ ⎢ − 2ρ + exp − ⎜ ⎟ σ 1σ 2 σ 22 ⎠ ⎥ ⎢ 2 1 − ρ 2 ⎝ σ 12 1− ρ 2 ⎣ ⎦ 2 ⎡ 1 y ⎤ 1 exp ⎢ − 2⎥ 2π σ 2 ⎣ 2 σ2 ⎦
(
1
(
f Z ( x, y )
σ 1 2π 1 − ρ
2
)
)
⎡ ⎛ σ1 1 ρ exp ⎢ − 2 − x ⎜ σ2 ⎢ 2σ 1 1 − ρ 2 ⎝ ⎣
(
)
⎞ y⎟ ⎠
2⎤
⎥ ⎥ ⎦
X being a real variable and y a fixed numeric value, we can recognize a Gaussian density. More precisely: the conditional law of X , knowing Y = y , is ⎛ σ ⎞ N ⎜ ρ 1 y , σ 12 1 − ρ 2 ⎟ . ⎝ σ2 ⎠
(
We see in particular that
)
E ( X y) = ρ
σ1 y σ2
and that
In Chapter 4, we will see more generally that if
( X , Y1 ,..., Yn ) n
vector,
E(X Y) = ρ
E ( X Y1 ,..., Yn ) is written in the form of λ0 + ∑ λ jY j . j =1
σ1 Y. σ2
is a Gaussian
Gaussian Vectors
85
2.6. Exercises for Chapter 2 Exercise 2.1.
D of center 0 and of radius R which is used for archery. The couple Z = ( X , Y ) represents the coordinates of the point of We are looking at a circular target
impact of the arrow on the target support; we assume that the r.v. independent and following the same law
(
N 0.4 R
2
).
X and Y are
1) What is the probability that the arrow reach the target? 2) How many times must one fire the arrow In order for, with a probability
≥ 0.9 , the target is reached at least once (we give n10 ≠ 2.305 ).
Let us assume that we fire 100 times at the target, calculate the probability that the target to reached at least 20 times. Hint: use the central limit theorem. Solution 2.1.
Z = ( X ,Y )
X and Y being independent, the density of probability of ⎛ x2 + y 2 ⎞ 1 is f Z ( x, y ) = f X ( x ) fY ( y ) = and exp ⎜− 2 ⎟ R 8π R 2 8 ⎝ ⎠
P (Z ∈ D) =
1 8π R 2
1) The r.v.s
⎛ x2 + y 2 ⎞ exp ∫D ⎜⎝ − 8R 2 ⎟⎠ dx dy using a change from Cartesian
to polar coordinates: R −e 1 ⎞ 2π ⎛ = ⎜− dθ ∫ e 2 ⎟ ∫0 0 ⎝ 8π R ⎠
2
8 R 2 ede
=
−1 1 1 R2 −u 2 ⋅ 2π ⋅ ∫ e 8 R du = 1 − e 8 2 2 0 8π R
2) At each shot k , we associate a Bernoulli r.v. U k ∼ b ( p ) defined by
⎛ U k = 1 if the arrow reaches the target (probability p ) ⎜ ⎝ U k = 0 if the arrow does not reach the target (probability 1- p )
86
Discrete Stochastic Processes and Optimal Filtering
In n shots, the number of impacts is given by the r.v.
U = U1 + ... + U n ∼ B ( n, p )
P (U ≥ 1) = 1 − P (U = 0 ) = 1 − Cnk p k (1 − p ) = 1 − (1 − p ) We
are
⇔ (1 − p ) n ≥ 19
n
n−k
( where k = 0 )
n
n which verifies 1 − (1 − p ) ≥ 0,9 n10 n10 n10 2,3 # ≤ 0,1 ⇔ n ≥ − =− =− i.e. −1 1 n (1 − p ) n (1 − p ) ne 8 8 thus
looking
n
for
3) By using the previous notations, we are looking to calculate P (U ≥ 20 ) with
U = U1 + P (U1 +
with
µ = 1− e
+ U100 , which is to say: ⎛U + + U100 ≥ 20 ) = P ⎜ 1 ⎝ −1
8
# 0,1175
and
+ U100 − 100µ 100σ
≥
−1 −1 ⎞ ⎛ σ = ⎜ ⎜⎛ 1 − e 8 ⎟⎞ e 8 ⎟ ⎠ ⎝⎝ ⎠
1
20 − 100µ ⎞ ⎟ 100σ ⎠ 2
# 0,32
8, 25 ⎞ ⎛ P⎜S ≥ = P ( S ≥ 2,58 ) = 1 − F0 ( 2,58 ) 3, 2 ⎟⎠ ⎝ where S is an r.v. N ( 0,1) and
F0 distribution function of the r.v. N ( 0,1) .
Finally P (U ≥ 20 ) = 1 − 0,9951# 0, 005 .
i.e.
Gaussian Vectors
87
Exercise 2.2.
n independent r.v. of law N ( 0,1) and given
X 1 ,… , X n
Given
a 1 ,… , a n ; b 1,… , b n
2n real constants:
1) Show that the r.v. Y =
n
n
j =1
j =1
∑ a j X j and Z = ∑ b j are independent if and
n
only if
∑ a jb j = 0 . j =1
2) Deduce from this that if the r.v.
X=
X 1 ,..., X n are n independent r.v. of law N ( 0,1) ,
1 n ∑ X j and YK = X K − X (where n j =1
K∈
{1, 2,..., n} )
are
independent. For K
≠
YK and Y are they independent r.v.?
Solution 2.2. 1) U = (Y , Z ) is evidently a Gaussian vector. ( ∀λ and
µ∈ ,
In order for
Y and Z to be independent it is thus necessary and sufficient that:
the r.v. λY + µ Z is evidently a Gaussian r.v.).
0 = Cov (Y , Z ) = EYZ = ∑ a j b j EY j Z j = ∑ a j b j j
2) In order to simplify the expression, let us make K
j
= 1 an example:
1 1 ⎛ 1⎞ X n ; Y1 = ⎜1 − ⎟ X 1 − X 2 − n n ⎝ n⎠ n 1⎛ 1⎞ 1 and ∑ a j b j = ⎜ 1 − ⎟ − ( n − 1) = 0 n⎝ n⎠ n j =1 X=
1 X1 + n
+
−
1 Xn n
88
Discrete Stochastic Processes and Optimal Filtering
– To simplify let us make K
= 1 and
=2
1 1 ⎛ 1⎞ Y1 = ⎜1 − ⎟ X 1 − X 2 − − X n ; n n ⎝ n⎠ 1 1 ⎛ 1⎞ Y2 = − X 1 + ⎜1 − ⎟ X 2 − − X n n n ⎝ n⎠ n
and
⎛
1⎞1
1
∑ a j b j = −2 ⎜⎝1 − n ⎟⎠ n − ( n − 2 ) n < 0 , thus Y1 and Y2 are dependent. j =1
Exercise 2.3.
X ∼ N ( 0,1) and a discrete r.v. ε such that 1 1 P ( ε = −1) = and P = ( ε = +1) = . 2 2 We give a real r.v.
We suppose
X and ε independent. We state Y = ε X :
– by using distributions functions, verify that Y ∼ N ( 0,1) ; – show that Cov ( X , Y ) = 0 ; – is the vector U = ( X , Y ) gaussian? Solution 2.3. 1)
(
FY ( y ) = P (Y ≤ y ) = P ( ε X ≤ y ) = P ( ε X ≤ y ) ∩ ( ( ε = 1) ∪ ( ε = −1) )
=P
( ( (ε X ≤ y ) ∩ (ε = 1) ) ∪ ( (ε X ≤ y ) ∩ (ε = −1) ) )
)
Gaussian Vectors
89
Because of the incompatibility of the two events linked by the union
= P ( ( ε X ≤ y ) ∩ ( ε = 1) ) + P ( ( ε X ≤ y ) ∩ ( ε = −1) ) = P ( ( X ≤ y ) ∩ ( ε = 1) ) + P ( ( − X ≤ y ) ∩ ( ε = −1) ) Because of the independence of
X and ε ,
P ( X ≤ y ) P ( ε = 1) + P ( − X ≤ y ) P ( ε = −1) =
1 ( P ( X ≤ y ) + P ( − X ≤ y )) 2
Finally, thanks to the parity of the density of the law N ( 0,1) ,
= P ( X ≤ y ) = FX ( y ) ; 2) Cov ( X , Y ) = EXY − EXEY = Eε X − EX Eε X = Eε EX 2
0
2
= 0;
0
3) X + Y = X + ε X = X (1 + ε ) ;
(
)
Thus P ( X + Y = 0 ) = P X (1 + ε ) = P (1 + ε = 0 ) =
1 2
λ X + µY (with λ = µ = 1 ) because the law admits no density ( PX +Y ({0} ) = 1 ). 2 We can deduce that the r.v.
Thus the vector U = ( X , Y ) is not Gaussian.
is not Gaussian,
90
Discrete Stochastic Processes and Optimal Filtering
Exercise 2.4. Given a real r.v. X ∼ N ( 0,1) and given a real a > 0 :
⎧⎪ X if X < a is also a real ⎪⎩− X if X ≥ a
1) Show that the real r.v. Y defined by Y = ⎨ r.v. ∼ N ( 0,1)
(Hint: show the equality of the distribution functions FY = FX .)
4 2) Verify that Cov ( X , Y ) = 1 − 2π
∞
∫a
2
x e
− x2
2 dx
Solution 2.4. 1) FY ( y ) = P ( Y ≤ y ) = P
( (Y ≤ y ) ∩ ( X
Distributivity and then incompatibility
( P ( (Y ≤ y )
)
(
< a) ∪ ( X ≥ a)
)
⇒
)
P (Y ≤ y ) ∩ ( X < a ) + P (Y ≤ y ) ∩ ( X ≥ a ) =
)
((
)
X < a P ( X < a) + P Y ≤ y X ≥ a P ( X ≥ a)
P ( X ≤ y ) P ( X < a ) + P (( − X ≤ y )) P ( X ≥ a ) P( X ≤ y )
because
1 − x2 2 e = f X ( x) is even 2π
(
)
= P ( X ≤ y ) P ( X < a ) + P ( X ≥ a ) = P ( X ≤ y ) = FX ( y ) ;
)
Gaussian Vectors
91
2) EX = 0 and EY = 0, thus:
Cov ( X , Y ) = EXY = ∫ =∫ −∫
∞ −∞
−a −∞
a −a
x 2 f X ( x ) dx − ∫
x 2 f X ( x ) dx − ∫
−a −∞
−a −∞
∞
x 2 f X ( x ) dx − ∫ x 2 f X ( x ) dx a
∞
x 2 f X ( x ) dx − ∫ x 2 f X ( x ) dx a
∞
x 2 f X ( x ) dx − ∫ x 2 f X ( x ) dx a
The 1st term equals EX 2 = VarX = 1 . The sum of the 4 following terms, because of the parity of the integrated function, equals
∞
−4∫ x 2 f X ( x ) dx from which we obtain the result. a
Exercise 2.5.
⎛X⎞ ⎛ 0⎞ Z = ⎜ ⎟ be a Gaussian vector of expectation vector m = ⎜ ⎟ and of ⎝Y ⎠ ⎝1 ⎠ ⎛ 1 1 ⎞ 2 ⎟ which is to say covariance matrix Γ Z = ⎜ Z ∼ N 2 ( m, Γ Z ) . ⎜1 1 ⎟ ⎝ 2 ⎠ Let
1) Give the law of the random variable
X − 2Y .
2) Under what conditions on the constants a and b is the random variable aX + bY independent of X − 2Y and of variance 1? Solution 2.5. 1) X ∼ N ( 0,1) and Y ∼ N (1,1) ; as
X and Y are also independent thus
X − 2Y is a Gaussian r.v.; precisely X − 2Y ∼ N ( −2,5 ) . ⎛ X − 2Y ⎞ ⎟ is a Gaussian vector (write the definition) X − 2Y and ⎝ aX + bY ⎠
2) As ⎜
aX + bY
are
independent
⇔
Cov ( X − 2Y , aX + bY ) = 0
now
92
Discrete Stochastic Processes and Optimal Filtering
Cov ( X − 2Y , aX + bY ) = aVarX − b Cov ( X , Y )
2 −2a Cov ( X , Y ) − 2bVarY = a − b − a = 0 i.e. b = 0 3 As 1 = Var ( a X
+ bY ) = Var aX = a 2 Var X
: a = ±1 .
Exercise 2.6.
X and Y and we assume that X admits a density probability f X ( x ) and that Y ∼ N ( 0,1) . We are looking at two independent r.v.
Determine the r.v.
(
)
E e XY X .
Solution 2.6.
(
)
E e XY x = E xY = ∫ e xy 1 x2 2 = e ∫ e 2π 1 So y → e 2π finally obtain
(
−( y − x ) 2
−( y − x ) 2
)
1 −y 2 e dy 2π 2
2
dy
2
is a density of probability (v.a. ∼ N ( x,1) ), and we
E e XY X = e
X2
2
.
Chapter 3
Introduction to Discrete Time Processes
3.1. Definition A discrete time process is a family of r.v.
{
XT = Xt j t j ∈T ⊂
}
where T called the time base is a countable set of instants. X t is the r.v. of the j
family considered at the instant t j . Ordinarily, the t j are uniformly spread and distant from a unit of time and in the sequence T will be equal to
,
or
∗
and the processes will be still denoted
X T or, if we wish to be precise, X , X or X
∗
.
In order to be able to study correctly some sets of r.v. X j of X T and not only the r.v. X j individually, it is in our interests to consider the latter as being definite mappings on the same set and this leads us to an exact definition.
94
Discrete Stochastic Processes and Optimal Filtering
DEFINITION.– Any X T family with measurable mappings is called a real discrete time stochastic process:
Xj: ω
⎯⎯ →
(
( Ω,a )
X j (ω ) ,B (
with j ∈ T ⊂
))
We also say that the process is defined on the fundamental space ( Ω, a ) . In general a process X T is associated with a real phenomenon, that is to say that the X j represent (random) physical, biological, etc. values. For example the intensity of electromagnetic noise coming from a certain star. For a given
ω,
that is to say after the phenomenon has been performed, we
obtain the values x j = X j (ω ) .
{
DEFINITION.– xT = x j j ∈ T
}
is called the realization or trajectory of the
process X T .
X −1
X0
X1
X2
Xj
xj
x1 x2
x−1 -1
x0
0
1
2
Figure 3.1. A trajectory
j
t
Introduction to Discrete Time Processes
95
Laws We defined the laws PX of the real random vectors X Chapter 1. These laws are measures defined on n
Borel algebra of The finite sets
B
= ( X 1 ,..., X n ) in
T
( ) =B (
) ⊗ ... ⊗ B ( )
n
.
( X i ,..., X j ) of r.v. of X T are random vectors and, as we will
be employing nothing but sets such as these in the following chapters, the considerations of Chapter 1 will be sufficient for the studies that we envisage. T
and in certain problems we cannot avoid the following However, X T ∈ additional sophistication: 1) construction of a
σ
-algebra
2) construction of laws on
B
B
( ) = ⊗ B ( ) on T
j
j∈T
T
;
( ) (Kolmogorov’s theorem). T
Stationarity DEFINITION.– We say that a process
∀i, j , p ∈
the random vectors
same law, i.e. ∀Bi ,..., B j ∈ B (
((
)
(
{
XT = X j j ∈
}
is stationary if
( X i ,..., X j ) and ( X i+ p ,..., X j + p ) have the ) (in the drawing the Borelians are intervals):
P X i + p ∈ Bi ∩ ... ∩ X j + p ∈ B j
) ) = P ( ( X i ∈ Bi ) ∩ ... ∩ ( X j ∈ B j ) )
96
Discrete Stochastic Processes and Optimal Filtering
i +1
i
i+ p
j
i +1+ p
j+ p
t
Wide sense stationarity DEFINITION.– We say that a process
X T is centered if EX j = 0
DEFINITION.– We say that a process
X T is of the second order if:
X j ∈ L2 ( dP )
∀j ∈ T .
∀j ∈ T .
Let us remember that if
X j ∈ L2 ∀j ∈ T then X j ∈ L1 and ∀i, j ∈ T
EX i X j < ∞ . Thus, the following definition is meaningful. DEFINITION.– Given
X a real 2nd order process, we call the covariance function
of this process, the mapping
(
Γ : i, j ⎯⎯ → Γ ( i, j ) = Cov X i , X j
)
x We call the autocorrelation function of this process the mapping:
R : i, j ⎯⎯ → R ( i, j ) = E X i X j x
Introduction to Discrete Time Processes
97
These two mappings obviously coincide if X ] is centered. We can recognize here notions introduced in the context of random vectors, but here as the indices ...i,... j ,... represent instants, we can expect in general that when the deviations
i − j increase, the values Γ ( i, j ) and R ( i, j ) decrease. DEFINITION.– We say that the process X ] is wide sense stationary (WSS) if: – it is of the 2nd order;
→ m ( j ) = EX is constant; – the mapping j ⎯⎯ ]
\ Γ ( i + p, j + p ) = Γ ( i, j )
– ∀ i, j , p ∈ ]
In this case Γ ( i, j ) is instead written C ( j − i ) . Relationship linking the two types of stationarity A stationary process is not necessarily of the 2nd order as we see with the process X ] for example in which we choose for X j r.v. independent of Cauchy’s law:
fX j ( x) =
(
a
π a +x 2
2
)
and a > 0 and
EX j and EX 2j are not defined.
A “stationary process which is also of the 2nd order” (or a process of the 2nd order which is also stationary) must not be confused with a WSS process. It is clear that if a process of the 2nd order is stationary, it is thus WSS. In effect:
EX j + p = ∫ xdPX j+ p ( x ) = ∫ xdPX j ( x ) = EX j \
\
98
Discrete Stochastic Processes and Optimal Filtering
and:
Γ ( i + p, j + p ) = ∫ =∫
xy dPX i+ p , X j+ p ( x, y ) − EX i + p EX j + p
2
2
xy dPX i , X j ( x, y ) − EX i EX j = Γ ( i, j )
The inverse implication “wide sense stationary (WSS) ⇒ stationarity” is false in general. However, it is true in the case of Gaussian processes. Ergodicity Let X
be a WSS process.
DEFINITION.– We say that the expectation of X
EX 0 = lim
N ↑∞
N
1 2N + 1
∑
j =− N
X j (ω ) a.s. (almost surely)
We say that the autocorrelation function X
∀n ∈
is ergodic if:
K ( j, j + n ) = EX j X j +n = lim
N ↑∞
is ergodic if:
1 2N + 1
N
∑
j =− N
X j (ω ) X j +n (ω ) a.s.
That is to say, except possibly for ω ∈ N set of zero probability or even with the exception of trajectories whose apparition probability is zero, we have for any trajectory x :
EX 0 = lim
N ↑∞
+N
1 2N + 1
∑
j =− N
x j (ergodicity of 1st order)
= EX j X j + n = lim N ↑∞
1 2N + 1
+N
∑
j =− N
x j x j + n (ergodicity of 2nd order)
Introduction to Discrete Time Processes
With the condition that the process X
99
is ergodic, we can then replace a
mathematical expectation by a mean in time. This is a sufficient condition of ergodicity of 1st order. PROPOSITION.– Strong law of large numbers: If the X j ( j ∈
)
form a sequence of independent r.v. and which are of the
same law and if E X 0 < ∞ then EX 0 = lim
N ↑∞
+N
1
∑
2 N + 1 j =− N
X j (ω ) a.s.
NOTE.– Let us suppose that the r.v. X j are independent Cauchy r.v. of probability density
1
a π a + x2 2
( a > 0) .
By using the characteristic functions technique, we can verify that the r.v.
YN =
1
+N
∑
2 N + 1 j =− N
X j has the same law as X 0 ; thus YN can not converge a.s. to
the constant EX 0 , but E X 0 = +∞ .
X
EXAMPLE.– We are looking at the process
which consists of r.v.
X j = A cos ( λ j + Θ ) where A is a real constant and where Θ is an r.v. of uniform probability density f Θ (θ ) =
1 1 (θ ) . Let us verify that X 2π [0,2π [
is a
WSS process.
EX j = ∫
2π 0
Acos ( λ j + θ ) f Θ (θ ) dθ =
Γ ( i, j ) = K ( i, j ) = EX i X j = ∫
A2 2π
2π
∫0
2π 0
A 2π
2π
∫0
cos ( λ j + θ ) dθ = 0
Acos ( λ j + θ ) Acos ( λ j+θ ) f Θ (θ ) dθ
cos ( λ i + θ ) cos ( λ j + θ ) dθ =
A2 cos ( λ ( j − i ) ) 2
100
Discrete Stochastic Processes and Optimal Filtering
and X
is in fact WSS.
Keeping with this example, we are going to verify the ergodicity expectation. Ergodicity of expectation
lim N
+N
1
∑
Acos ( λ j + θ ) (with θ fixed ∈ [ 0, 2π [ )
2 N + 1 j =− N
= lim N
1
N
∑
2 N + 1 j =− N
Acosλ j = lim N
2A ⎛ N 1⎞ ⎜⎜ ∑ cosλ j − ⎟⎟ 2 N + 1 ⎝ j =0 2⎠
iλ N +1 N 2A ⎛ 1⎞ 2 A ⎛ 1- e ( ) 1 ⎞ iλ j = lim − ⎟ ⎜ Re ⎜ Re ∑ e − ⎟⎟ = lim N 2N + 1 ⎜ 2 ⎠ N 2 N + 1 ⎝⎜ 2 ⎠⎟ 1 − eiλ ⎝ j =0
If λ ≠ 2kπ , the parenthesis is bounded and the limit is zero and equal to EX 0 . Therefore, the expectation is ergodic. Ergodicity of the autocorrelation function
lim N
(with
∑
2 N + 1 j =− N
θ
= lim N
= lim N
+N
1
Acos ( λ j + θ ) Acos ( λ ( j + n ) + θ )
[
[
fixed ∈ 0, 2π )
A2
+N
∑
2 N + 1 j =− N
cosλ j cosλ ( j + n )
1 A2 + N ∑ ( cosλ ( 2j+n ) + cosλ n ) 2 2 N + 1 j =− N
+N ⎛ 1 A2 ⎛ ⎞ ⎞ A2 = lim ⎜ Re ⎜ eiλ n ∑ eiλ 2 j ⎟ ⎟ + cosλ n ⎜ ⎟⎟ 2 N ⎜ 2 2N + 1 j =− N ⎝ ⎠⎠ ⎝
The
limit
is
still
zero
autocorrelation function is ergodic.
and
A2 cosλ n = K ( j , j + n ) . Thus, the 2
Introduction to Discrete Time Processes
101
Two important processes in signal processing Markov process DEFINITION.– We say that X – ∀B ∈ B (
is a discrete Markov process if:
);
– ∀t1 ,..., t j +1 ∈
with t1 < t2 < ... < t j < t j +1 ;
– ∀x1 ,..., x j +1 ∈
.
(
) (
)
P X t j+1 ∈ B X t j = x j ,..., X t1 = x1 = P X t j+1 ∈ B X t j = x j , an
Thus
equality that more briefly can be written:
(
) (
)
P X t j+1 ∈ B x j ,..., x1 = P X t j+1 ∈ B x j . We can say that if t j represents the present instant, for the study of X towards the future (instants > t j ), the information
(
{( X
tj
) (
= x j ,..., X t 1 = x1
)
brings nothing more than the information X t j = x j .
B
xt1 xt
t1
j −1
t j −1
tj xt
t j +1 j
t
)}
102
Discrete Stochastic Processes and Optimal Filtering
Markov processes are often associated with phenomena beginning at instant 0 for example and we thus choose the probability law Π 0 of the r.v. X 0 .
(
The conditional probabilities P X t ∈ B x j j +1
) are called transition probabilities.
= j.
In what follows, we suppose t j
DEFINITION.– We say that the transition probability is stationary if
(
)
(
)
P X j +1 ∈ B x j is independent of j = P ( X 1 ∈ B x0 ) .
Here is an example of a Markov process that in practice is often met.
X
is defined by the r.v.
(
)
X 0 and the relation of recurrence
X j +1 = f X j , N j where the N j are independent r.v. and independent of the r.v.
X 0 and where f is a mapping:
×
Thus, let us show that ∀B ∈ B (
):
2
→
Borel function.
( ) ( ) ⇔ P ( f ( X , N ) ∈ B x , x ,..., x ) = P ( f ( X , N ) ∈ B x ) ⇔ P ( f ( x , N ) ∈ B x , x ,..., x ) = P ( f ( x , N ) ∈ B x ) P X j +1 ∈ B x j , x j −1 ,..., x0 = P X j +1 ∈ B x j j
j
j
j
j
j
j −1
j −1
0
0
This equality will be verified if the r.v.
( X j −1 = x j −1 ) ∩ ... ∩ ( X 0 = x0 ) .
j
j
j
j
j
j
N j is independent of
Introduction to Discrete Time Processes
103
Now the relation of recurrence leads us to expressions of the form:
X 1 = f ( X 0 , N 0 ) , X 2 = f ( X 1 , N1 ) = f ( f ( X 0 , N 0 ) , N1 )
(
= f 2 ( X 0 , N 0 , N1 ) ,..., X j = f j X 0 , N1 ,..., N j −1 which proves that of
)
N j , being independent of X 0 , N1 ,..., N j −1 , is also independent
X 0 , X 1 ,..., X j −1 (and even of X j ).
Gaussian process
the random vector
(
X S = X i ,..., X j
remember is denoted X S ∼
(
is Gaussian if ∀ S = ( i,..., j ) ∈
X
DEFINITION.– We say that a process
)
,
is a Gaussian vector that as we will
)
N n mS , Γ X s .
We see in particular that as soon as we know that a process law is entirely determined by its expectation function
X is Gaussian, its
j → m ( j ) and its
covariance function i, j → Γ ( i, j ) . A process such as this is denoted
X ∼ N ( m ( j ) , Γ ( i, j ) ) .
A Gaussian process is obviously of the 2nd order: furthermore if it is a WSS process it is thus stationary and to realize this it is sufficient to write the probability:
(
)
f X S xi ,..., x j =
of whatever vector
1
( 2π )
j −i +1 2
( Det Γ ) XS
1 2
T ⎛ 1 ⎞ exp ⎜ − ( x − mS ) Γ −S1 ( x − mS ) ⎟ ⎝ 2 ⎠
X S extracted from the process.
104
Discrete Stochastic Processes and Optimal Filtering
Linear space associated with a process
X
Given
a WSS process, we note
combinations of the r.v. of
That is to say:
H
X
HX
the family of finite linear
X .
⎧⎪ = ⎨ ∑ λ j X j S finite ⊂ ⎪⎩ j∈S
⎫⎪ ⎬ ⎪⎭
DEFINITION.– We call linear space associated with the process
H
X
2
augmented by the limits in L of the elements of H
denoted
H
X
X
X
the family
. The linear space is
.
NOTES.– 1) H
X
⊂H
X
⊂ L2 ( dP ) and H
2) Let us suppose that X
X
2
is a closed vector space of L
( dP ) .
is a stationary Gaussian process. All the linear 2
combinations of the r.v. X j of X
are Gaussian and the limits in L are equally
(
Gaussian. In effect, we easily verify that if the set of r.v. X n ∼ N mn , σ n 2
converge in L towards an r.v. X of expectation m and of variance
σ n2
then converge towards m and
σ
(
and X ∼ N m, σ
2
σ 2 , mn
) respectively.
2
)
and
Delay operator Process X
being given, we are examining operator T
defined by:
T n : ∑ λ j X j → ∑ λ j X ( j − n ) ( S finished ⊂ j∈S
H
j∈S
X
H
X
)
n
( n ∈ ) on H ∗
X
Introduction to Discrete Time Processes
DEFINITION.– T
n
105
is called operator delay of order n .
Properties of operator delay: –
T n is linear of H ∗
– ∀ n and m ∈ –
X
in H
X
;
T n T m = T n+m ;
T n conserves the scalar product of L2 , that is to say ∀ I and J finite ⊂
:
⎛ ⎞ ⎛ ⎞ < T n ⎜ ∑ λi X i ⎟ , T n ⎜ ∑ µ j X j ⎟ > = < ∑ λi X i , ∑ µ j X j > . ⎜ j∈J ⎟ i∈I j∈J ⎝ i∈I ⎠ ⎝ ⎠ EXTENSION.– T Let
Z ∈H
X
n
extends to all
and Z p ∈ H
X
H
X
in the following way:
be a sequence of r.v. which converge towards Z
2
in L ; Z P is in particular a Cauchy sequence of
( )
Tn Zp
is also a Cauchy sequence of
converges in
H
X
H
X
P
∀Z ∈ H
X
towards Z . It is natural to state T
n
As a consequence,
X
n
and by isometry T ,
H
which, since
. It is simple to verify that lim T
particular series Z p which converges towards
H n
X
is complete,
( Z p ) is independent of the
Z.
and the series Z p ∈ H
X
which converges
T n (Z p ) . ( Z ) = lim P
DEFINITION.– We can also say that
H
X
is the space generated by the X
process. 3.2. WSS processes and spectral measure
In this section it will be interesting to note the influence on the spectral density of the temporal spacing between the r.v. For this reason we are now about to
106
Discrete Stochastic Processes and Optimal Filtering
{
consider momentarily a WSS process X θ = X jθ j ∈ and where jθ has the significance of duration.
} where θ
is a constant
3.2.1. Spectral density
DEFINITION.– We say that the process X θ possesses a spectral density if its
( ( j − i )θ ) = EX iθ X jθ − EX iθ EX jθ can be written 1 C ( nθ ) = ∫ 12θ exp ( 2iπ ( inθ ) u ) S XX ( u ) du and S XX ( u ) is −
covariance C ( nθ ) = C
in the form:
2θ
then called the spectral density of the process X θ . PROPOSITION.– +∞
Under the hypothesis
∑ C ( nθ ) < ∞ :
n =−∞
1) the process X θ admits a spectral density S XX ; 2) S XX is continuous, periodic of
C
− nθ − 2θ − θ
1
θ
period, real and even.
Var X jθ
0 θ
2θ
nθ
S XX
u
t −1
2θ
0
1
2θ
Figure 3.2. Covariance function and spectral density of a process
Introduction to Discrete Time Processes
107
NOTE.– The covariance function C is not defined (and in particular does not equal zero) outside the values nθ . DEMONSTRATION.– Taking into account the hypotheses, the series: +∞
∑ C ( pθ ) exp ( −2iπ ( pθ ) u )
p =−∞
converges uniformly on
1
θ
and defines a continuous function S ( u ) and
-periodic. Furthermore:
∫ =∫
+∞ 2θ C −1 2θ p =−∞ 1
1
2θ −1 2θ
∑ ( pθ ) exp ( −2iπ ( pθ ) u ) exp ( 2iπ ( nθ ) u ) du
S ( u ) exp ( 2iπ ( nθ ) u ) du 2
The uniform convergence and the orthogonality in L
( − 1 2θ , 1 2θ ) of the
complex exponentials enables us to conclude that:
C ( nθ ) = ∫
1
2θ −1 2θ
exp ( 2iπ ( nθ ) u ) S ( u ) du and that S XX ( u ) = S ( u ) .
To finish, C ( nθ ) is a covariance function, thus:
C ( − nθ ) = C ( nθ )
108
Discrete Stochastic Processes and Optimal Filtering
and we can deduce from this that S XX ( u ) =
+∞
∑
p =−∞
real and even (we also have S XX ( u ) = C ( 0 ) + 2 EXAMPLE.– The covariance C ( nθ ) = σ e
C ( pθ ) exp ( −2iπ ( pθ ) u ) is ∞
∑ C ( pθ ) cos2π ( pθ ) u ). p =1
2 − λ nθ
( λ > 0)
of a process X θ in
fact verifies the condition of the proposition and X θ admits the spectral density.
S XX ( u ) = σ 2
+∞
∑ e−λ nθ −2iπ ( nθ )u
n =−∞
∞ ⎛ ⎞ − λ nθ − 2iπ ( nθ )u − λ nθ + 2iπ ( nθ )u =σ 2 ⎜∑e + ∑e − 1⎟ n =0 ⎝ n =0 ⎠ 1 1 ⎛ ⎞ =σ2⎜ + − 1⎟ − λθ − 2iπθ u − λθ + 2iπθ u 1− e ⎝ 1− e ⎠ ∞
=σ2
1 − e−2λθ 1 + e −2λθ − 2e − λθ cos2πθ u
White noise
DEFINITION.– We say that a centered WSS process X θ is a white noise if its covariance
function
⎛ C ( 0 ) = EX 2jθ = σ 2 ⎜ ⎜ C ( nθ ) = 0 if n ≠ 0 ⎝
C ( nθ ) = C ( ( j − i )θ ) = EX iθ X jθ
verifies
∀j ∈
The function C in fact verifies the condition of the preceding proposition and
S XX ( u ) =
+∞
∑
n =−∞
C ( nθ ) exp ( −2iπ ( nθ ) u ) = C ( 0 ) = σ 2 .
Introduction to Discrete Time Processes
109
S XX C
σ
σ2
2
t
u
0
0
Figure 3.3. Covariance function and spectral density of a white noise
We often meet “Gaussian white noises”: these are Gaussian processes which are also white noises; the families of r.v. extracted from such processes are independent
(
and ∼ N 0, σ
2
).
More generally we have the following result which we will use as the demonstration. Herglotz theorem
In order for a mapping
nθ → C ( nθ ) to be the covariance function of a
WSS process, it is necessary and sufficient that a positive measurement on
⎛⎡ 1 1 ⎤⎞ , ⎥ ⎟ , which is called the spectral measure, such that: ⎝ ⎣ 2θ 2θ ⎦ ⎠
B ⎜ ⎢-
C ( nθ ) = ∫
1
2θ −1 2θ
exp ( 2iπ ( nθ ) u ) d µ X ( u ) . ∞
In this statement we no longer assume that
∑ C ( nθ ) < ∞ .
n =−∞
µX
exists
110
Discrete Stochastic Processes and Optimal Filtering +∞
∑ C ( nθ ) < ∞ ,
If
we
again
find
the
starting
statement
with:
n =−∞
d µ X ( u ) = S XX ( u ) du (a statement that we can complete by saying that the
spectral density S XX ( u ) is positive).
3.3. Spectral representation of a WSS process
In this section we explain the steps enabling us to arrive at the spectral representation of a process. In order not to obscure these steps, the demonstrations of the results which are quite long without being difficult are not given.
3.3.1. Problem
The object of spectral representation is: 1) To study the integrals (called Wiener integrals) of the
∫S ϕ ( u ) dZu
type
obtained as limits, in a meaning to clarify the expressions with the form:
∑ ϕ ( u j ) ( Zu j
j
− Zu j−1
) , ϕ is a mapping with complex values (and
where S is a restricted interval of
{
other conditions), Z S = Z u u ∈ S
}
is a 2nd order process with orthogonal
increments (abbreviated as p.o.i.) whose definition will be given in what follows. 2) The construction of the Wiener integral being carried out, to show that reciprocally, if we allow ourselves a WSS process X θ , we can find a p.o.i.
{
}
Z S = ZU u ∈ S = ⎡ − 1 , 1 ⎤ such that ∀j ∈ ⎣ 2θ 2θ ⎦ a Wiener integral X jθ =
∫S e
2iπ ( jθ )u
dZu .
X jθ may be written as
Introduction to Discrete Time Processes
NOTE.–
∫ S ϕ ( u ) dZu
and
∫S e
2iπ ( jθ )u
111
dZu will not be ordinary Stieljes
integrals (and it is this which motivates a particular study). In effect:
⎛ ⎞ ⎜ ⎟ ⎜ σ = ,.., u j −1 , u j , u J +1 subdivision of S ⎟ ⎜ ⎟ let us state ⎜ σ = sup u j − u j −1 module of the subdivision σ ⎟ j ⎜ ⎟ ⎜ ⎟ ⎜ Iσ = ∑ ϕ u j Z u j − Z u j−1 ⎟ u j ∈σ ⎝ ⎠
{
}
( )(
)
∀σ , the expression Iσ is in fact defined, it is a 2nd order r.v. with complex values. However, the process Z S not being a priori of bounded variation, the ordinary limit
lim Iσ , i.e. the limit with a given trajectory u → Zu (ω ) , does not exist and
σ →0
∫ S ϕ ( u ) dZu The r.v.
cannot be an ordinary Stieljes integral.
∫ S ϕ ( u ) dZu
will be by definition the limit in
limit exists for the family Iσ when
L2 precisely if this
σ → 0 , i.e.: 2
lim E Iσ − ∫ ϕ ( u ) dZu = 0 .
σ →0
S
This is still sometimes written:
L _ ( Iσ ) . ∫ S ϕ ( u ) dZu = σlim →0 2
3.3.2. Results
3.3.2.1. Process with orthogonal increments and associated measurements
S designates here a bounded interval of
.
112
Discrete Stochastic Processes and Optimal Filtering
DEFINITION.– We call a random process of continuous parameters with base S , all the family of r.v. Z u , the parameter u describing S .
{
}
This process will be denoted as Z S = Z u u ∈ S . Furthermore, we can say that such a process is: – centered if EZ u = 0
∀u ∈ S ; 2
2
– of the 2nd order if EZ u < ∞ (i.e. Z u ∈ L – continuous in
( dP ) ) ∆u ∈ S ;
L2 : if E ( Zu + ∆u − Zu ) → 0 2
when ∆u → 0 ∀u and u + ∆u ∈ S (we also speak about right continuity when
∆u > 0 or of left continuity when ∆u < 0 in L2 ). In what follows Z S will be centered, of the 2nd order and continuous in
L2 .
Z S has orthogonal increments ∀u1 , u2 , u3 , u4 ∈ S with u1 < u2 ≤ u3 < u4
DEFINITION.– We say that the process ( ZS
is
a
p.o.i.)
if
(
< Z u4 − Z u3 , Z u2 − Zu1 > L2 ( dP ) = E Zu4 − Zu3
)(Z
u2
)
− Zu1 = 0 .
We say that Z S is a process with orthogonal and stationary increments ( Z S is a p.o.s.i.) if Z S is a p.o.i. and if in addition ∀u1 , u2 , u3 , u4 with u4 − u3 = u2 − u1
(
we have E Z u − Z u 4
3
)
2
(
= E Z u2 − Zu1
)
2
.
PROPOSITION.– To all p.o.i. Z S which are right continuous in
L2 , we can
associate: –
a
function
F which does not decrease on
F ( u ′ ) − F ( u ) = E ( Zu′ − Zu ) if u < u ′ ; 2
S
such
that:
Introduction to Discrete Time Processes
– a measurement thus
µ
on
B (S )
113
which is such that ∀ u , u ′ ∈ S with u < u ′ ,
( ).
µ ( ]u, u′]) = F ( u′ ) − F u −
3.3.2.2. Wiener stochastic integral Let Z S still be a p.o.i. right continuity and PROPOSITION.– Given
ϕ ∈ L2 ( µ )
⎞ Zu j − Z u j−1 ⎟ exists. This is by definition ⎟ ⎠ ϕ ( u ) dZ u ;
( )(
Wiener’s stochastic integral 2) Given
ϕ
and ψ ∈ L
2
∫S
the associated measurement.
with complex values:
⎛ lim L2 _ ⎜ ∑ ϕ u j σ →0 ⎜ u ∈σ ⎝ j
1) The
µ
)
( µ ) with complex values, we have the property:
E ∫ ϕ ( u ) dZu ∫ ψ ( u ) dZu = ∫ ψ ( u )ψ ( u ) d µ ( u ) , S
S
in particular E
S
∫ S ϕ ( u ) dZu
2
2
= ∫ ϕ (u ) d µ (u ) . S
Idea of the demonstration
Let us postulate that
ε = vector space in step functions with complex values.
We begin by proving the proposition for functions
ϕ ( u ) = ∑ a j 1⎤u j
⎦
⎤
j −1 ,u j ⎦
ϕ ,ψ ,... ∈ ε
( u ) and : ∫ S ϕ ( u ) dZu = ∑ ϕ ( u j ) ( Zu j
j
(if
ϕ ∈ε
)
− Zu j−1 ).
We next establish the result in the general case by using the fact that
ε ( ⊂ L2 ( µ ) ) is dense in ϕn ∈ ε such that:
L2 ( µ ) , i.e. ∀ϕ ∈ L2 ( µ ) we can find a sequence
114
Discrete Stochastic Processes and Optimal Filtering 2
ϕ − ϕn L ( µ ) = ∫ ϕ ( u ) − ϕn ( u ) d µ ( u ) → 0 S 2
2
when n → ∞ .
3.3.2.3. Spectral representation We start with X θ , a WSS process. Following Herglotz’s theorem, we know that its covariance function
nθ → C ( nθ ) is written C ( nθ ) = ∫
(⎣
spectral measure on B ⎡ −1
2θ
,1
1
2θ 2iπ ( nθ )u e d µX −1 20
(u )
where
µX
is the
)
⎤ . 2θ ⎦
PROPOSITION.– If X θ is a centered WSS process of covariance function
nθ → C ( nθ ) and of spectral measure µ X , there exists a unique p.o.i.
{
}
Z S = Zu u ∈ S = ⎡ −1 , 1 ⎤ such that: ⎣ 2θ 2θ ⎦ ∀j ∈
X jθ = ∫ e
2iπ ( jθ )u
S
dZ u .
Moreover, the measurement associated with Z S is the spectral measure The expression of the X jθ representation of the process.
as Wiener integrals is called the spectral
2iπ ( j + n )θ ) u 2iπ jθ u EX jθ X ( j + n )θ = E ∫ e ( ) dZu ∫ e ( dZu
NOTE.–
S
S
applying the stated property of 2) of the preceding proposition.
=∫ e S
−2iπ ( nθ )u
µX .
dZ u = C ( −nθ ) = C ( nθ ) .
and
by
Introduction to Discrete Time Processes
115
3.4. Introduction to digital filtering
We suppose again that θ = 1 . Given
{
a
h = hj ∈
WSS
j∈
process
X
and
a
sequence
of
real
} , we are interested in the operation which at X
numbers makes a
new process Y correspond, defined by:
∀K ∈
( h 0T
0
YK =
⎛
+∞
⎞
+∞
∑ h j X K − j = ⎜⎜ ∑ h jT j ⎟⎟ X K ⎝ j =−∞
j =−∞
⎠
is also denoted as h1 where 1 is the identical mapping of +∞
In what follows we will still assume that
∑
j =−∞
L2 in L2 ).
h j < ∞ ; this condition is
1
and is called (for reasons which will be explained later) generally denoted h ∈ the condition of stability. DEFINITION.– We say that the process Y process X
by the filter H ( T ) =
is the transform (or filtration) of the
+∞
∑ h jT j and we write Y
j =−∞
= H (T ) X .
NOTES.– 1) Filter H (T ) is entirely determined by the sequence of coefficients
{
h = hj ∈
j∈
} and according to the case in hand, we will speak of filter
H (T ) or of filter h or again of filter (..., h− m ,..., h−1 , h0 ,..., hn ,...) . 2) The expression “ ∀K ∈
YK =
convolution product (noted ∗ ) of X
+∞
∑ h j X K − j ” is the definition of the
j =−∞
by h which is also written as:
116
Discrete Stochastic Processes and Optimal Filtering
YK = ( h ∗ X
Y = h ∗ X or again ∀K ∈ 3) Given that X
is a WSS process and
is clear that the r.v. YK =
+∞
∑ hj X K − j
H
X
∈H
)K
is the associated linear space, it X
and that process Y
is also
j =−∞
WSS. Causal filter
Physically, for whatever
K
is given, YK can only depend on the previous r.v.
X K − j in the widest sense of YK , i.e. that j ∈
. A filter H (T ) which realizes
this condition is called causal or feasible. Amongst these causal filters, we can further distinguish two major classes: 1) Filters that are of finite impulse response (FIR) such that: N
YK = ∑ h j X K − j
∀K ∈
j =0
the schematic representation of which follows.
XK
T h0
T
T
h1
h2
hN
Σ
Σ
Σ
Figure 3.4. Schema of a FIR filter
YK
Introduction to Discrete Time Processes
117
2) Filters that are of infinite impulse response (IIR) such that ∞
YK = ∑ h j X K − j
∀K ∈
j =0
NOTES.– 1) Let us explain about the role played by the operator T : at any particular instant K , it replaces X K with X K −1 ; we can also say that T blocks the r.v.
X K −1 for a unit of time and restores it at instant
K
2) Let H (T ) be an IIR filter. At the instant
K
.
∞
YK = ∑ h j X K − j = h0 X K + ... + hK X 0 + hK +1 X −1 + ... j =0
For a process X , thus beginning at the instant 0 , we will have:
∀K ∈
K
YK = ∑ h j X K − j j =0
Example of filtering of a Gaussian process
Let us consider the Gaussian process X
∼ N ( m ( j ) , Γ ( i, j ) ) and the FIR
filter H (T ) defined by h = (...0,..., 0, h 0,..., hN , 0,...) . We immediately verify
that the process Y = H ( T ) X
is Gaussian. Let us consider for example the
filtering specified by the following schema:
118
Discrete Stochastic Processes and Optimal Filtering
(
X ∼ N 0, e − j −i
)
T -1
2
YK
Σ
K
YK = ∑ h j X K − j = − X K + 2 X K −1
∀K ∈
j =0
Y is a Gaussian process. Let us determine its parameters: mY ( i ) = EY j = 0
(
(
ΓY ( i, j ) = E Yi Y j = E ( − X i + 2 X i −1 ) − X j + 2 X j −1
)) =
E X i X j − 2 E X i −1 X j − 2 E X i X j −1 + 4 E X i −1 X j −1 = 5e
− j −i
− 2e
− j −i +1
− 2e
− j −i −1
Inverse filter of a causal filter
DEFINITION.– We say that a causal filter H (T ) is invertible if there is a filter
(
denoted H (T ) process X
)
−1
and called the inverse filter of H (T ) such that for any WSS
we have:
(
X = H (T ) ( H (T ) ) X −1
) = ( H (T ) )
−1
( H (T ) X ) ( ∗)
Introduction to Discrete Time Processes
If such a filter exists, the equality Y = H ( T ) X
119
is equivalent to the equality
X = ( H (T ) ) Y . −1
Furthermore,
{
h′ = h′j ∈
( H (T ) )
j∈
}
−1
is
and
defined
by
we
have
(
)
a
sequence
the
of
coefficients
convolution
product
X = h′ ∗ Y .
∀K ∈
In order to find the inverse filter H (T )
{
of coefficients h′ = h′j ∈
(∗) is equivalent to: ∀K ∈
j∈
−1
, i.e. in order to find the sequence
} , we write that the sequence of equalities
⎞ ⎛ +∞ ⎞ ⎛ +∞ ⎞ ⎛ ⎛ +∞ ⎞ ⎞ ⎛ ⎛ +∞ ⎞ X K = ⎜ ∑ h jT j ⎟ ⎜ ⎜ ∑ h′j T j ⎟ X K ⎟ = ⎜ ∑ h′j T j ⎟ ⎜ ⎜ ∑ h j T j ⎟ X K ⎟ ⎜ j =−∞ ⎟ ⎜ ⎜ j =−∞ ⎟ ⎟ ⎜ ⎜ j =−∞ ⎟ ⎟ ⎜ j =−∞ ⎟ ⎝ ⎠⎝⎝ ⎠ ⎠⎝⎝ ⎠ ⎠ ⎝ ⎠ or even to:
⎛ +∞ ⎞ j ⎜⎜ ∑ h jT ⎟⎟ ⎝ j =−∞ ⎠
⎛ +∞ ⎞ ⎛ +∞ ⎞ j j ′ h T ⎜⎜ ∑ j ⎟⎟ = ⎜⎜ ∑ h′j T ⎟⎟ ⎝ j =−∞ ⎠ ⎝ j =−∞ ⎠
⎛ +∞ ⎞ j ⎜⎜ ∑ h j T ⎟⎟ = 1 ⎝ j =−∞ ⎠
EXAMPLE.– We are examining the causal filter H ( T ) = 1 − hT 1) If h < 1
∞
H (T ) admits the inverse filter ( H (T ) ) = ∑ h j T j . −1
j =0
120
Discrete Stochastic Processes and Optimal Filtering
For that we must verify that being given X K r.v. at instant
K
of a WSS process
X , we have: ⎛⎛
⎞
⎞
∞
(1 − hT ) ⎜⎜ ⎜⎜ ∑ h j T j ⎟⎟ X K ⎟⎟ = X K ⎝ ⎝ j =0
⎠
2
(equality in L )
⎠ ⎛ N ⎞ ⇔ lim (1 − hT ) ⎜ ∑ h j T j ⎟ X K = X K ⎜ j =0 ⎟ N ⎝ ⎠
(
)
⇔ 1 − h N +1 T N +1 X K − X K = h
N +1
X K −( N +1) → 0 when N ↑ ∞
which is verified if h < 1 since X K − ( N +1) =
(
We should also note that H (T )
)
−1
E X 02 .
is causal.
⎛ ⎝
2) If h > 1 let us write (1 − hT ) = − hT ⎜ 1 −
(1 − hT )
−1
⎛ 1 ⎞ = ⎜1 − T −1 ⎟ ⎝ h ⎠
As the operators commute and
−1
(1 − hT ) = −
T −1 h
∞
∑
j =0
−1
1 −1 ⎞ T ⎟ thus: h ⎠
⎛ 1 −1 ⎞ ⎜− T ⎟. ⎝ h ⎠
1 < 1, q ∞ 1 −j T ( ) = − T ∑ h j +1 . hj j =0 − j +1
Introduction to Discrete Time Processes
121
However, this inverse has no physical reality and it is not causal (the “lead − ( j +1) operators” T are not causal). 3) If h = 1
(1 − T ) and (1 + T ) are not invertible.
Transfer function of a digital filter
DEFINITION.– We call the transfer function of a digital filter
H (T ) =
+∞
∑
j =−∞
h j T j the function H ( z ) =
+∞
∑ hj z− j
z∈
.
j =−∞
We recognize the definition given in the analysis of a Laurent sequence if we
1 . As a consequence of this permutation the transfer z −1 functions (sums of the series) will sometimes be written by using the variable z . We also say that H ( z ) is the z transform of the digital sequence permute z and z
−1
=
h = (... h− m ,..., h 0,..., hn ,...) .
Let us be more precise about the domain of definition of H ( z ) ; it is the domain of the convergence K of Laurent sequence. We already know that K is an annulus of center 0 and thus has the form
{
}
K = z 0≤r < z < R
Moreover, any circle of a complex plane of centre and radius
C ( 0, ρ ) .
ρ
is denoted by
122
Discrete Stochastic Processes and Optimal Filtering
K contains C ( 0,1) because due to that fact that we know the hypothesis of +∞
the stability of the filter,
∑
j =−∞
+∞
∑ hj z− j
hj < ∞ ,
converges absolutely any
j =−∞
∀ z ∈ C ( 0,1) . C ( 0, R ) C ( 0, r )
0
1
Figure 3.5. Convergence domain of transfer function
The singularities
σj
of H ( z ) verify
σj ≤r
H ( z ) of any digital filter
or
σj ≥R
and there will be
at least one singularity of H ( z ) on C ( 0, r ) and another on C ( 0, R ) (if not,
K , the holomorphic domain of H ( z ) , could be enlarged). If the filter is now causal and: – if it is an IIR filter then H ( z ) =
{
K = z 0≤r < z
}
( R = +∞ ) ;
∞
∑ h j z − j , so H ( z ) j =0
is holomorphic in
Introduction to Discrete Time Processes
– if it is an FIR filter then H ( z ) =
{
K = z 0< z
} (pointed plane at 0).
123
N
∑ h j z − j , so H ( z )
is holomorphic in
j =0
We observe above all that the singularities
σj
of a transfer function of a stable,
causal filter all have a module strictly less than 1.
C ( 0, r ) 0
∗0
1
Figure 3.6. Convergence domain of and convergence domain of
1
H ( z ) of an IIR causal filter
H ( z ) of an FIR causal filter +∞
NOTE.– In the case of a Laurent sequence
∑ hj z− j
(i.e., in the case of a digital
j =−∞
filter h = {... h− m ,..., h 0,..., hn ,...} ), its domain of convergence K and thus its sum H ( z ) is determined in a unique manner, that is to say that the couple
( H ( z ) , K ) is associated with the filter.
Reciprocally, if, given H ( z ) , we wish to obtain the filter h , it is necessary to
begin by specifying the domain in which we wish to expand H ( z ) , because for
124
Discrete Stochastic Processes and Optimal Filtering
different K domains, we obtain different Laurent expansions having H ( z ) as the sum.
(
This can be summed up by the double implication H ( z ) , K Inversion of the
)
h.
z transform
(
)
Given the couple H ( z ) , K , we wish to find filter h .
H being holomorphic in K , we can apply Laurent’s formula: ∀j ∈
hj =
1 2iπ
∫Γ
H ( z) +
z − j +1
dz
where (homotopy argument) Γ is any contour of K and encircling 0 . The integral can be calculated by the residual method or even, since we have a choice of contour
Γ , by choosing Γ = C ( 0,1) and by parameterizing and calculating the integral ∀j ∈
hj =
1 2iπ
iθ ijθ ∫Γ H ( e ) e dθ . +
In order to determine h j , we can also expand the function H ( z ) in Laurent sequences by making use of the usual known expansions. SUMMARY EXAMPLE.– Let the stable causal filter H (T ) = 1 − hT with
h < 1 , of transfer function H ( z ) = 1 − h z −1 defined on
− {0} . We have
seen that it is invertible and that its inverse, equally causal and stable, is ∞
R (T ) = ∑ h j T j . j =0
Introduction to Discrete Time Processes
125
The transfer function of the inverse filter is thus: ∞
R ( z ) = ∑ h j z− j = j =0
(note also that R ( z ) =
1
1 defined on z z > h 1 − hz −1
H (z)
{
}
).
h 0
0
Figure 3.7. Definition domain of
{
1
H ( z ) and definition domain of R ( z )
}
1 on z z > h , let us find (as an exercise) the 1 − hz −1 −j Laurent expansion of R ( z ) , i.e. the h j coefficients of z . Having R ( z ) =
1 1 R ( z )z j −1dz = + ∫ Γ 2iπ 2iπ where Γ is a contour belonging to z z > h . Using the Laurent formulae h j =
{
}
∫Γ
+
zj −dz z−h
126
Discrete Stochastic Processes and Optimal Filtering
By applying the residual theorem,
⎞ 1 ⎛ zj zj if j ≥ 0 h j = 2iπ . in h ⎟ = lim ( z − h ) = hj ⎜ residual of 2iπ ⎝ z-h z−h ⎠ z →h if j < 0 :
h j = 2iπ .
1 ⎡⎢⎛
⎞ ⎤ ⎡⎛ ⎞⎤ 1 1 ⎜ Residual of in 0 ⎟ ⎥ + ⎢⎜ Residual of in h ⎟ ⎥ = 0 . ⎟ ⎥ ⎢⎜ ⎟⎥ 2iπ ⎢⎣⎜⎝ z j ( z −h ) z j ( z −h ) ⎠ ⎦ ⎣⎝ ⎠⎦ −1
1
hj
PROPOSITION.– It is given that X
is a WSS process and
linear space; we are still considering the filter
H ( z) =
+∞
∑
j =−∞
h j z − j with
+∞
∑
hj
H
X
is the associated
H (T ) of transfer function
hj < ∞ .
j =−∞
So: 1)
∀K ∈
⎛ +∞ ⎞ j ⎜⎜ ∑ q jT ⎟⎟ X K = ⎝ j =−∞ ⎠
That is to say that the r.v. YK =
H
X
+∞
∑
j =−∞
+∞
∑ q j X K − j converges in H X .
j =−∞
h j X K − j of the filtered process remain in
; we say that the filter is stable.
2) The filtered process Y is WSS. 3) The spectral densities of X
SYY ( u ) = H ( −2iπ u )
2
and of Y are linked by the relationship:
S XX ( u )
Introduction to Discrete Time Processes
127
DEMONSTRATION.– 1) We have to show that ∀K ∈ such that the sequence N →
, there exists an r.v.
YK ∈ H
X
⊂ L2 ( dP )
N
∑ hj X K − j
converges for the norm of
H
X
and
−N
when N ↑ ∞ towards YK . As
X
H
is a Banach space, it is sufficient to verify the
normal convergence, namely: +∞
∑
+∞
∑
(
h j E X K2 − j
)
1
2
<∞
which is true as a result of the stability hypothesis
∑
j =−∞
hj X K − j =
J =−∞
+∞
j =−∞
h j < ∞ and of the wide
sense stationarity: E X ( K − j ) = σ + m . 2
2
2
(
2) We must verify that E YK is independent of K and that Cov Yi , Y j the form 3)
CY ( j − i ) , which is immediate.
(
)
(
CY ( j − i ) = Cov Yi , Y j = ∑ h h ′ Cov X j − , X i −
the definition of
, ′
S XX ( u )
CY ( j − i ) = ∑ h h ′ ∫ , '
1
2 −1 2
(
exp 2iπ ( ( j −
′
)
) has
and, by using
) − ( i − ') ) u ) S XX ( u ) du
128
Discrete Stochastic Processes and Optimal Filtering
It is easy to verify that we can invert symbols
CY ( j − i ) = ∫ =∫ =∫
1
2 −1 2 1
2 −1 2 1
2 −1 2
∑
and
∫
in such a way that:
⎛ ⎞ exp ( 2iπ ( j − i ) u ) ⎜ ∑ hA hA ' exp 2iπ ( A '− A ) ⎟ S XX ( u ) du ⎜ ⎟ ⎝ A ,A ' ⎠ 2
exp ( 2iπ ( j − i ) u ) ∑ hA exp ( 2iπ Au ) S XX ( u ) du A
exp ( 2iπ ( j − i ) u ) H ( −2iπ u ) S XX ( u ) du 2
and in going back to the definition of
SYY ( u ) , we in fact have
SYY ( u ) = H ( −2iπ u ) S XX ( u ) . 2
3.5. Important example: autoregressive process DEFINITION.– We call the autoregressive process of degree d ∈ ` centered X ] process which verifies ∀K ∈ ] :
∗
any WSS
d
X K = ∑ h j X K − j + BK where B] is a white noise of power EBK2 = σ 2 . j =1
The family of autoregressive processes of degree Thus ∀ K , X K is obtained from the
d is denoted by AR ( d ) .
d previous values X K −d ,..., X K −1
(modulo r.v. BK ), which can be carried out using the following schema:
Introduction to Discrete Time Processes
hd
hd −1
129
h1
BK
Σ
XK
Figure 3.8. Autoregressive filter
The equality of the definition can be written:
H (T ) X = B where we have
d
stated that
H ( T ) = 1 − ∑ h jT j . j =1
This means that we can obtain X
by the filtering of B
through the filter
H (T ) whose schema has already been given above (modulo the direction of the arrows). PROPOSITION.– 1) Every process
X
( AR ( d ) )
generated by the noise B
H (T ) possesses the spectral density S XX ( u ) =
and by the filter
σ2
H ( exp ( −2iπ u ) )
2
(where
the polynomial H has no root having a module 1). 2) Reciprocally: every WSS process which is centered and possesses a spectral density of the preceding form is autoregressive of a degree equal to the degree of H.
130
Discrete Stochastic Processes and Optimal Filtering
DEMONSTRATION.– 1) The proposition on the filtering and relation
B = H (T ) X
with
S B ( u ) = σ 2 leads to the first result announced. Furthermore, let us suppose that H possesses the root z0 module 1 and let us state that
z = exp ( −2i π u ) .
= exp ( −2i π u0 ) of
Using Taylor’s development in the proximity of z0 , we should obtain:
H ( z ) = H ′ ( z0 )( z − z0 ) + ... or even H ( exp ( −2i π u ) ) =
× ( u − u0 ) + ...
constant
and
the
mapping
σ2
u → S XX ( u ) =
H ( exp ( −2i π u ) )
2
could not be integrable in the proximity of u0 ... as a spectral density must be. 2)
If
S XX ( u ) =
process
X
admits
σ2 H ( exp ( −2i π u ) )
spectral density
2
a
spectral
, the process
density
H (T ) X
X K = h X K −1 + BK i.e.
(1− hT ) X K = BK
(E)
the
form
admits the constant
σ 2 and, as it is centered, it is a white noise B .
PARTICULAR CASE.– Autoregressive process of degree 1
of
Introduction to Discrete Time Processes
131
We notice to begin with that: 1) X
is a Markov process
∀B ∈ B (
):
P ( X K ∈ B X K −1 = α , X K −2 = β ,...) = P ( hα1 + BK ∈ B X K −2 = β ,...)
and as BK is independent of X K − 2 , X K −1 ,...
= P ( h α1 + BK ∈ B )
= P ( h X K −1 + BK ∈ B X K −1 = α ) = P ( X K ∈ B X K −1 = α ) 2) If B is a Gaussian white noise, X
is itself Gaussian.
Expression of X , solution of ( E ) : 1) We are looking for X
the WSS process solution of ( E ) :
– if h = 1 , there is no WSS process X
which will satisfy ( E ) .
In effect, let us suppose for example that h = 1 and reiterate n times the relation of recurrence, we then obtain:
X K − X K − n −1 = BK + Bk −1 + ... + BK − n and E ( X K − X K − n −1 ) = E ( BK + BK −1 + ... + BK − n ) = ( n + 1) σ . 2
2
However, if the process were WSS, we would also have
E ( X K − X K − n−1 ) = E X K2 + E X K2 − n −1 − 2 E X K X K − n −1 ≤ 4σ 2 2
We see then that X
cannot be WSS.
2
∀n ∈
132
Discrete Stochastic Processes and Optimal Filtering
Let us now suppose that h ≠ 1 ; we would like, if operator, to obtain X K =
(1 − hT )
⎛ ⎝
BK ;
1 (1 − hT ) = −hT ⎛⎜1 − h T −1 ⎞⎟ , as
– if h > 1 . By writing see that we can expand ⎜1 −
−1
(1 − hT ) , is an invertible
⎝
⎠
1 < 1 , we h
1 −1 ⎞ −1 T ⎟ (thus we can also expand (1 − hT ) ) in h ⎠
−1
series of powers of T (lead operator) but the filter we obtain being non-causal we must reject the solution X obtained; – if h < 1 , i.e. if the root of the polynomial H ( z ) = 1 − hz less than 1, we know that the operator
has a module
is invertible and that
∞
(1 − hT ) = ∑ h j T j −1
(1 − hT )
−1
(causal filter).
j =0
∞
X K = (1 − hT ) BK = ∑ h j BK − j −1
is then the unique solution of:
j =0
(1− hT ) X K = BK
In this form the wide sense stationarity X
is evident. In effect the B j being
central and orthogonal, ∞
(
Var X K = ∑ E h BK − j j =0
Moreover for n ∈
j
)
2
=
σ2 1 − h2
cov ( X i , X i + n ) =
∞ ⎛ ∞ E X i X i + n = E ⎜ ∑ h j Bi − j ∑ h Bi + n− ⎜ j =0 =0 ⎝
n ∞ ⎞ 2 2 h j j +n ⎟⎟ = σ ∑ h h = σ 1− h j =0 ⎠
Introduction to Discrete Time Processes
133
n
h C ( n ) = Cov ( X i , X i + n ) = σ . 1− h 2
Finally ∀n ∈
C ( n) σ2
−n
Figure 3.9. Graph of
= =
0
1
n
C ( n ) , covariance function of a process AR(1) ( h ∈ ] 0,1 [ )
S XX ( u ) of X :
– spectral density
S XX ( u ) =
−1
1 − h2
+∞
σ2
+∞
∑ C ( n ) exp ( −2iπ n u ) = 1 − h2 ∑ h n
n =−∞
n =−∞
σ2 ⎡
exp ( −2iπ n u )
⎤ 1 1 + − 1⎥ ⎢ 1 − h ⎣1 − h exp ( −2iπ u ) 1 − h exp ( 2iπ u ) ⎦ 2
σ2 1 − 2h cos 2 π u + h 2
134
Discrete Stochastic Processes and Optimal Filtering
2) General solution of ( E ) : This is the sum of the solution found of the equation with as second member
X − h X K −1 = BK i.e.
∞
∑ h j BK − j
and from the general solution of the
j =0
equation without the second member X K − hX K −1 = 0 i.e. Α h
K
where Α is
any r.v. The general solution X K =
∞
∑ h j BK − j + Α h K is no longer WSS, except if j =0
Α=0. 3.6. Exercises for Chapter 3 Exercise 3.1.
Study the stationarity of the Gaussian process
( K ) = m ( K ) is constant.
X ∼ N ( m ( K ) , min ( j , K ) ) where E X Exercise 3.2.
We are considering the real sequence
hn = 2n if n < 0 and hn =
hn
defined by:
1 if n ≥ 0 . 4n +∞
1) Determine the convergence domain of the Laurent series
∑ hn z n .
n =−∞
Introduction to Discrete Time Processes
{
135
} is a digital filter, determine its transfer function H ( z )
2) If h = hn n ∈
by clarifying its definition domain. Solution 3.2. +∞
1)
∑ hn z
n
−1
=
n =∞
∑ (2z)
n
n =−∞
n
n
∞ ∞ ⎛z⎞ ⎛ 1 ⎞ ⎛z⎞ + ∑ ⎜ ⎟ = ∑ ⎜ ⎟ +∑ ⎜ ⎟ n =0 ⎝ 4 ⎠ n =1 ⎝ 2 z ⎠ n=0 ⎝ 4 ⎠
The series converges if
∞
z ≥
1 2
and if
n
z < 4 , thus in the annulus
⎧ 1 ⎫ K = ⎨z < z < 4⎬ . ⎩ 2 ⎭ 2) H ( z ) =
+∞
∑ hn z
−n
n =−∞
The series converges if
{
}
∞
n
∞ ⎛z⎞ ⎛ 1 ⎞ = ∑⎜ ⎟ + ∑⎜ ⎟ n =1 ⎝ 2 ⎠ n =0 ⎝ 4 z ⎠
n
z > 2 and if z < 1/ 4 , thus in the annulus
K′ = z 1 < z < 2 4 In
K′ : H ( z) =
1 1 7z −1+ = . −1 z 2 − z )( 4 z − 1) ( 1− 1 4 z − ( ) 2
Exercise 3.3.
Develop
H ( z) =
16 − 6 z −1 in series (of Laurent) of powers z in the ( 2 − z )( 4 − z )
three following domains:
{
– z
{
z < 2}
}
– z 2< z <4
136
Discrete Stochastic Processes and Optimal Filtering
{
z > 4}
– z
H ( z ) representing each time a transfer function, clarify in the three cases if the corresponding filter is stable and if it is causal. Solution 3.3.
H ( z) =
– If z < 2
2 4 1 1 + = + z 2 − z 4 − z 1− 1− z 2 4 ∞ 0 1 ⎞ ⎛ 1 H ( z ) = ∑ ⎜ n + n ⎟ z n = ∑ 2n + 4n z − n 4 ⎠ n =0 ⎝ 2 n =−∞
(
∞
⎛ 1
)
1 ⎞
∑ ⎜⎝ 2n + 4n ⎟⎠ < ∞ but it is not causal since the series
The filter is stable for
n =0
contains positive powers of z ; –
2< z <4
If
∞
n
∞
n
we
write
H (z) =
−2 z 1− 2
(
z
)
+
1 1− z
4
∞
−2 z + ∑ n = ∑ 4n z − n + ∑ −2n z n the filter is neither stable nor n n =1 z n=0 4 n =−∞ n =1
=∑
0
causal; – ∞
If
(
z >4
)
we
write
H (z) =
= ∑ − 2n + 4n z − n the filter is unstable and causal. n =1
−2 z 1− 2
(
+
z
−4 z 1− 4
) (
z
)
Introduction to Discrete Time Processes
137
Exercise 3.4.
We are examining a Gaussian white noise B independent Gaussian r.v.; EBK = 0 and
α
two real numbers
(let us remember that BK are
Var BK = σ 2 ). Moreover, we allow
β which are different and which verify α < 1 and
and
β < 1. 1)
Construct
a
stationary
X K = α X K −1 + BK − β BK −1
S XX ( u ) .
K
centered
∈
process
X
such
and determine its spectral density
2) Let us denote the linear space generated by the r.v. X n , n ≤ 0 as Let us denote the linear space generated by the r.v. Bn , n ≤ 0 as Verify that
H
X
that
H
X
.
H B.
=H B .
3) We note that YK =
∞
∑ β n X K −n
K
∈ .
n=0
Express YK according to the white noise and deduce from it the best linear approximation of YK expressed with the help of the X n , n ≤ 0. 4) Show that the r.v. YK are Gaussian and centered, and calculate their covariances. Solution 3.4.
The equality defining X K allows us to write and operator
(1 − α T ) is invertible as
X K = (1 − α T )
−1
(1 − β T ) BK
(1 − α T ) X K = (1 − β T ) BK
α <1 ⎛ ∞ ⎞ = ⎜ ∑ α nT n ⎟ (1 − β T ) BK ⎝ n =0 ⎠
138
Discrete Stochastic Processes and Optimal Filtering
Thus, X K = BK + Furthermore,
∞
∑α n−1 (α − β ) BK −n
and X
is in fact stationary.
n =1
X
process
is
generated
from
B
by
the
filter
(1 − αT ) (1 − β T ) of transfer function 11 −+ αβ zz . −1
2
Thus, according to the theorem on filtering:
2) According to 1) Reciprocally,
X K ∈ H B , thus H
∀K
starting
S XX
from
1 − β e2iπ u u = σ2. ( ) 2iπ u 1+ αe
X
BK = (1 − β T )
−1
⊆ H B.
(1 − αT ) X K
calculations similar to those previously performed, we obtain
3) YK
Thus
H B ⊆H
X
using
.
∞ ⎛ ∞ ⎞ −1 = ∑ β n X K −n = ⎜ ∑ β nT n ⎟ X K = (1 − β T ) X K n =0 ⎝ n =0 ⎠
YK = (1 − β T )
−1
permutated, YK = (1 − α T ) Since
and
H
X
−1 (1 − αT ) (1 − β T ) BK , and as the operators can be
−1
∞
BK = ∑ α n BK − n n =0
= H B , the best linear approximation of YK is:
⎛ ∞ ⎞ ∞ projH B YK = projH B ⎜ ∑ α n BK −n ⎟ = ∑ α n+ K B− n ⎝ n =0 ⎠ n =0 ∞
∞
n =0
n =0
α K ∑ α n B− n = α k Y0 = α K ∑ β n X − n
Introduction to Discrete Time Processes
4) Since YK =
139
∞
∑ α n BK −n , the YK
are centered Gaussian r.v.
n =0
Moreover:
(
=α
K− j
∞
∞
∞
) ∑ m =0 n =0
Cov Y j , YK = ∞
∑ α 2mσ 2
∑ α m+n E ( BK −n B j −m ) = ∑ α 2m+ K − j EB 2j −m =
m =0
m =0
K− j
α σ2 2 1−α
Exercise 3.5. ∞
Let X
be a process verifying
∑ bn X K −n = BK ( bn ∈ )
where B
is a
n =0
∞
white noise of power
σ 2 . In addition we state b ( z ) = ∑ bn z − n . n =0
j
1) Show that if
EX j BK =
{
1 2iπ
}
∫C
+
z K − j −1 dz (the integral of the b( z)
complex variable z where C = z z = 1 ). 2) Verify that if
∀j < K
{
}
b ( z ) does not possess a root in the disk z z < 1 then
X j ⊥ BK
( EX j BK = 0 ) .
Solution 3.5.
1) EX j BK =
S XX ( u ) of X :
∞
∑ bn EX j X K −n
n =0
and by definition of the spectral density
140
Discrete Stochastic Processes and Optimal Filtering
(
)
EX j X K −n = cov X j , X K − n = ∫
Moreover, since
−1/ 2
exp ( 2iπ ( j − K + n ) u ) S XX ( u ) du
⎛ ∞ n⎞ ⎜ ∑ bnT ⎟ X K = BK , X ⎝ n =0 ⎠
is obtained by filtering B
σ 2 ), by the transfer function filter
spectral density
filtering
1/ 2
S X (u ) =
(of
1 and by the theorem on b( z)
σ2 b ( exp ( −2iπ u ) )
2
from where
EX j BK = σ 2 ∫
1/ 2
−1/ 2
σ 2∫
1/ 2
−1/ 2
n =0
exp ( 2iπ ( j − K ) u )
=σ 2∫
1/ 2
−1/ 2
In stating
b ( exp ( +2iπ u ) )
b ( exp ( −2iπ u ) )
exp ( 2iπ ( j − K ) u ) b ( exp ( −2iπ u ) )
2
1 b ( exp ( −2iπ u ) )
2
du
du
du
z = exp ( −2iπ u ) , dz = −2iπ u du and finally:
σ2 EX j BK = 2iπ 2) If
∞
exp ( 2iπ ( j − K ) u )∑ bn exp ( 2iπ nu )
b( z)
∫C
+
z K − j −1 dz b( z)
does not possess a root in
{z
integrated is holomorphic inside the open disk theorem EX j BK = 0 .
}
z < 1 , the function to be
D ( 0,1) and using Cauchy’s
Chapter 4
Estimation
4.1. Position of the problem We are examining two discrete time processes:
X
∗
(
)
= X 1 ,..., X j ,... and Y
∗
(
)
= Y1 ,..., Y j ,... :
– of the 2nd order; – not necessarily wide sense stationary (WSS) (thus they do not necessarily have a spectral density).
X
∗
is called the state process and is the process (physical for example) that we
are seeking to estimate, but it is not accessible directly.
Y
∗
is called the observation process, which is the process we observe (we
observe a trajectory
y
corresponding trajectory
∗
x
(
= y1 ,..., y j ,... ∗
(
)
which allows us to estimate the
)
= x1 ,..., x j ,... ).
142
Discrete Stochastic Processes and Optimal Filtering
A traditional example is the following:
X Y where U
∗
(
= X 1 ,..., X j ,...
∗
=X
∗
∗
+U
∗
(
)
= X 1 + U1 ,..., X j + U j ,...
)
is also a random process.
We thus say that the state of the process is perturbed by a parasite noise U (perturbation due to its measurement, transmission, etc.).
∗
In what follows, the hypothesis and data will be admitted:
X j and Y j ∈ L2 ( dP ) ;
∗
– ∀j ∈ – ∀i, j ∈
∗
×
∗
(
, we know EX j , cov X i , Y j
) , cov (Yi , Y j ) .
PROBLEM.– Having observed (or registered) a trajectory y instant K − 1 , we want, for a given instant
∗
of Y
∗
up to the
p , to determine the value “ xˆ p which
best approaches x p (unknown)”.
×
× • 0
×
• 1
2
•
K
−1
Figure 4.1. Three trajectories
y
∗
(
)
= y1 ,..., y j ,... xˆ
∗
(
)
= xˆ1 ,..., xˆ j ,... and x
which is unknown
∗
(
= x1 ,..., x j ,...
)
Estimation
143
If: – p<
K
− 1 we speak of smoothing;
– p=
K
− 1 we speak of filtering;
–
p > K − 1 we speak of prediction.
NOTE 1.– In the case of prediction, it is possible that we need only consider the process Y ∗ as predicting y p for p > K − 1 is already a problem. NOTE 2.– Concerning the expression “ xˆ p which best approaches x p ”. We will see that the hypothesis (knowledge of variances and covariances) allows us to determine Xˆ p , the 2nd order r.v. which best approaches by quadradic means r.v.
(
X p , i.e. r.v. Xˆ P which is such that E X p − Xˆ p
)
2
(
= Min2 E X p − Z Z ∈L
)
2
,
which is a result from the means of the r.v. and not from the realizations. However, even if it were only because of Bienaymé-Tchebychev inequality:
(
)
P X p − Xˆ p ≥ C ≤
(
E X p − Xˆ p C2
)
2
=A
we see that we obtain a result based on the numerical realizations since this inequality signifies exactly that at instant p , the unknown value x p will belong to the known interval ⎤⎦ xˆ p − C , xˆ p + C ⎡⎣ with a probability higher than 1 − A . This chapter is an introduction to Kalman filtering for which we will have to consider the best estimation of the r.v. X K (and also possibly of r.v. YK ) having observed Y1 ,..., YK −1 and thus assuming that
p=K.
SUMMARY.– Being given the observation process Y instant K − 1 , any estimation
∗
, considered up to the
Z of X K will have the form Z = g (Y1 ,..., Yk −1 )
144
Discrete Stochastic Processes and Optimal Filtering
where g :
K −1
→
is a Borel mapping. The problem that we will ask ourselves
in the following sections is: – how to find the best estimation in terms of quadratic means Xˆ K i.e. makes r.v. Xˆ K
K −1
which makes the mapping
K −1
Z → E( XK − Z )
K −1
= gˆ (Y1 ,..., YK −1 ) ).
2
L2 ( dP )
(
minimal (i.e. to find function gˆ which renders X K − g (Y1 ,..., YK −1 ) We have Xˆ K
of X K ,
)
2
minimal.
4.2. Linear estimation The fundamental space that we define below has already been introduced in Chapter 3 but in a different context. DEFINITION.– Space up to instant K − 1 is called linear space of observation and the vector space of the linear combinations of r.v. 1, Y1 ,..., YK −1 is denoted (or
H (1, Y1 ,..., YK −1 ) ), i.e.: ⎪⎧
K −1
⎪⎫
⎪⎩
j =1
⎪⎭
H KY−1
H KY−1 = ⎨λ01 + ∑ λ jY j λ 0 ,..., λK −1 ∈ ⎬ .
( dP ) , H KY-1 is a vector subspace (closed, as the 2 number of r.v. is finite) of L ( dP ) . Since r.v. 1, Y1 ,..., YK −1 ∈ L
2
We can also say that
H KY-1 is a Hilbert subspace of L2 ( dP ) .
Estimation
145
We are focusing here on the problem stated in the preceding section but with a simplified hypothesis: g is linear, which means that the envisaged estimators Z of
X K are of the form: K −1
Z = g (Y1 ,..., YK −1 ) = λ0 + ∑ λ jY j and thus belong to H KY−1 . j =1
The problem presents itself as: find the r.v., denoted Xˆ K K −1 , which renders minimum mapping:
Z → E( XK − Z )
2
H KY−1 (i.e., find λˆ0 , λˆ1 ,..., λˆK −1 which render minimum: 2
K −1 ⎛ ⎛ ⎞⎞ λ0 , λ1 ,..., λK −1 → E ⎜⎜ X K − ⎜ λ0 + ∑ λ jY j ⎟ ⎟⎟ ). J =1 ⎝ ⎠⎠ ⎝
We will have Xˆ K K −1 = λˆ0 +
K −1
∑ λˆ jY j . j =0
DEFINITION.– K −1 ⎛ ⎛ ⎞⎞ C ( λ0 , λ1 ,..., λK ) = E ⎜ X K − ⎜ λ0 + ∑ λ jY j ⎟ ⎟ ⎜ ⎟⎟ ⎜ j =1 ⎝ ⎠⎠ ⎝
2
is
called
the
function”. The solution is given by the following result, relative to the Hilbert spaces.
“cost
146
Discrete Stochastic Processes and Optimal Filtering
THEOREM.– – There exists unique formula Xˆ K K −1 = λˆ0 + mapping
K −1
∑ λˆ jY j
which renders the
j =1
Z → E ( X K − Z ) minimal. 2
H KY−1 – Xˆ K K −1 is the orthogonal projection of X K on
H KY−1 (which is also denoted
projH Y X K ). That is to say X K − Xˆ K K −1 ⊥ H KY−1 . K −1
X K − Xˆ K K −1
XK Xˆ K K −1
H KY−1
Figure 4.2. Orthogonal projection of vector
XK
on
Z
H KY-1
This theorem being admitted, we finish off the problem by calculating
λˆ 0, λˆ 1,..., λˆ K −1 . PROPOSITION.– – Let us represent the covariance matrix of vector Y = (Y1 ,..., YK ) as ΓY . K −1
1) The coefficients λˆ 0, λˆ 1,..., λˆ K −1 of Xˆ K K −1 = λˆ 0 + ∑ λˆ jY j verify: j =1
Estimation
147
⎛ λˆ 1 ⎞ ⎛ Cov ( X K , Y1 ) ⎞ K −1 ⎜ ⎟ ⎜ ⎟ ˆ = EX − λˆ EY λ and ΓY ⎜ = ⎟ ⎜ ∑ j j 0 K ⎟ j =1 ⎜ˆ ⎟ ⎜ Cov ( X , Y ) ⎟ K K −1 ⎠ ⎝ λ K −1 ⎠ ⎝
⎛ λˆ 1 ⎞ ⎛ Cov ( X K , Y1 ) ⎞ ⎜ ⎟ ⎟ −1 ⎜ and if ΓY is invertible ⎜ ⎟ = ΓY ⎜ ⎟; ⎜ˆ ⎟ ⎜ Cov ( X , Y ) ⎟ K K −1 ⎠ ⎝ ⎝ λ K −1 ⎠ 2) X K
K −1
= X K − Xˆ K K −1 is a centered r.v. which represents the estimation
error. We have:
Var X K
K −1
(
) (
= Var X K − Xˆ K K −1 = E X K − Xˆ K K −1
(
= Var X K − ∑ λˆi λˆ j cov Yi , Y j i, j
)
and if ΓY is invertible
(
)⎦
T
(
)⎦
−1 = Var X K − ⎡Cov X K , Y j ⎤ ΓY ⎡Cov X K , Y j ⎤
⎣
⎣
DEMONSTRATION.– Y 1) X K − Xˆ K K −1 ⊥ H K −1 ⇔ X K − Xˆ K K −1 ⊥ 1, Y1 ,..., YK −1
– X K − Xˆ K K −1 ⊥ 1 ⇔ K −1 ⎛ ⎛ ⎞⎞ E X K − Xˆ K K −1 1 = E ⎜ X K − ⎜ λˆ 0 + ∑ λˆ jY j ⎟ ⎟ = 0 ⎜ ⎟⎟ ⎜ j =1 ⎝ ⎠⎠ ⎝
(
)
)
2
148
Discrete Stochastic Processes and Optimal Filtering
i.e. EX K = λˆ 0 +
∑ λˆ j EY j ;
(1)
j
– X K − Xˆ K K −1 ⊥ Yi ⇔
⎛ ⎛ ⎞⎞ E X K − Xˆ K K −1 Yi = E ⎜ X K − ⎜ λˆ 0 + ∑ λˆ jY j ⎟ ⎟ Yi = 0 . ⎜ ⎟⎟ ⎜ j ⎝ ⎠⎠ ⎝
(
i.e.
)
EX K Yi = λˆ 0 EYi + ∑ λˆ j EY jYi
(2)
j
We take
λˆ 0 = EX K − ∑ λˆ j EY j
from (1) and carry it to (2).
j
It becomes:
⎛ ⎞ EX K Yi = ⎜ EX K − ∑ λˆ j EY j ⎟ EYi + ∑ λˆ j EY jYi ⎜ ⎟ j j ⎝ ⎠ = EX EY − λˆ EY Y − EY EY K
∑ j(
i
j
j i
j
i
)
That is to say:
∀i = 1 at
K
−
1
∑ λˆ j Cov (Y j , Yi ) = Cov ( X K , Yi ) j
or, in the form of a matrix
⎛ λˆ 1 ⎞ ⎛ Cov ( X K , Y1 ) ⎞ ⎜ ⎟ ⎜ ⎟ ΓY ⎜ ⎟=⎜ ⎟. ⎜ˆ ⎟ ⎜ Cov ( X , Y ) ⎟ K K −1 ⎠ ⎝ λ K −1 ⎠ ⎝
Estimation
– If
149
ΓY is non-invertible: ΓY non-invertible ⇔ ΓY is semi-defined r.v. Y1 − EY1 ,..., YK −1 − EYK −1 are linearly dependent in
- Let us recall the equivalences: positive
⇔
L2 ⇔ dim H KY−1 < K − 1 ; Under this hypothesis, there exists an infinity of K-tuples
( λˆ ,..., λˆ ) (and 1
K −1
thus also an infinity of λˆ 0 ) which verify the last matrix equality but all the expressions
λˆ 0 + ∑ λˆ jY j
are equal to the same r.v.
j
Xˆ K K −1 according to the
uniqueness of the orthogonal projection on a Hilbert subspace. – If
ΓY is invertible:
- R.v.
Y − EY ,..., Y 1
K −1
1
coefficients λˆ 0, λˆ 1,..., λˆ
K −1
− EY
K −1
are linearly independent in
L2 , the
are unique and we obtain
⎛ λˆ 1 ⎞ ⎛ Cov ( X K , Y1 ) ⎞ K −1 ⎜ ⎟ ⎟ −1 ⎜ ˆ ˆ ⎜ ⎟ = ΓY ⎜ ⎟ and λ 0 = EX K − ∑ λ j EY j j =1 ⎜ˆ ⎟ ⎜ Cov ( X , Y ) ⎟ K K −1 ⎠ ⎝ ⎝ λ K −1 ⎠ 2) X K − Xˆ K K −1 is centered (obvious).
(
)
X K = X K − Xˆ K K −1 + Xˆ K K −1 and as X K − Xˆ K K −1 ⊥ X K according to Pythagoras’ theorem.
(
E X K − Xˆ K K −1
)
2
=
EX K2
− EXˆ
2 K K −1
=
EX K2
⎛ ⎞ − E ⎜ λˆ 0 + ∑ λˆ jY j ⎟ ⎜ ⎟ j ⎝ ⎠
2
150
Discrete Stochastic Processes and Optimal Filtering
and since
λˆ 0 = EX K − ∑ λˆ j EY j , j
⎛ ⎞ E X K − Xˆ K K −1 = − E ⎜ EX K − ∑ λˆ j (Y j − EY j ) ⎟ j ⎝ ⎠ 2 2 = EX K − E ( EX K ) − 2 EX K ∑ λˆ j ( Y j − EY j )
(
)
2
2
EX K2
j
+ ∑ λˆi λˆ j (Yi − EYi ) (Y j − EY j ) . i, j
From which
(
E X K − Xˆ K K −1
i.e. under the form of a matrix
)
2
(
)
= Var X K − ∑ λˆ i λˆ j Cov Yi , Y j . i, j
(
Var X K − λˆ 1,..., λˆ K −1
)
⎛ λˆ1 ⎞ ⎜ ⎟ ΓY ⎜ ⎟. ⎜ˆ ⎟ ⎜ λK −1 ⎟ ⎝ ⎠
⎛ λˆ 1 ⎞ ⎛ Cov ( X K , Y1 ) ⎞ ⎜ ⎟ ⎟ −1 ⎜ In addition if ΓY is invertible since ⎜ ⎟ = ΓY ⎜ ⎟, ⎜ˆ ⎟ ⎜ Cov ( X , Y ) ⎟ K K −1 ⎠ ⎝ ⎝ λ K −1 ⎠ it becomes:
(
E X K − Xˆ K K −1
)
2
⎛ Cov ( X K , Y1 ) ⎞ ⎜ ⎟ = Var X K − ( Cov ( X K , Y1 ) , ... , Cov ( X K , YK −1 ) ) ΓY−1 ⎜ ⎟ ⎜ Cov ( X , Y ) ⎟ K K −1 ⎠ ⎝
Estimation
NOTE.– If Cov ( X K , Y1 ) = 0,..., Cov ( X K , YK −1 ) = 0 , r.v. information in order to estimate r.v.
151
Y j brings no further
X K −1 in quadratic mean.
Furthermore, by going back to the preceding formula:
⎛ λˆ 1 ⎞ ⎛ 0⎞ ⎜ ⎟ −1 ⎜ ⎟ ⎜ ⎟ = ΓY ⎜ ⎟ ⎜ 0⎟ ⎜ˆ ⎟ ⎝ ⎠ ⎝ λ K −1 ⎠
and
Xˆ K K −1 = λˆ 0 = EX K .
We rediscover the known result: being given an r.v. minimizes Z → E ( X − Z ) is 2
X ∈ L2 , the r.v. which
Xˆ = EX .
L2
DEFINITION.– The hyperplane of the regression plane of
K
of equation x = λˆ 0 +
K −1
∑ λˆ j y j
is called
j =1
X in Y1 ,..., YK −1 .
Practically: 1) The statistical hypotheses on the processes X calculate the numerical values plane x = λˆ 0 +
K −1
∑ λˆ j y j j =1
λˆ 0 , λˆ 1,..., λˆ K −1
( y j and x covering
and Y
∗
have enabled us to
and thus to obtain the regression ).
xK taken by X K ; we gather the observations and we thus deduce the sought estimation xˆ K K −1 (this time they are
2) We want to know the value
y1 ,..., yk −1
∗
determined values).
152
Discrete Stochastic Processes and Optimal Filtering
3) We are assured that the true value
xK taken by r.v. X K is in the interval
⎤ xˆ − C , xˆ K K −1+ C ⎡ with a probability greater than: ⎦ K K −1 ⎣
1−
(
E X K − X K K −1
)
2
C2
a value that we can calculate using the formula from the preceding proposition.
X 2 from the sole r.v. of = λˆ + λˆ Y which minimizes
PARTICULAR CASE.– We are going to estimate observation
Y1 , i.e. we are going to find Xˆ 2 1
0
1 1
E ( X 2 − ( λ 0 + λ 1Y1 ) ) . According to the proposition: 2
−1
λˆ1 = (VarY1 ) Cov ( X 2 , Y1 ) Thus
−1 and λˆ0 = EX 2 − (VarY1 ) Cov ( X 2 , Y1 ) EY1
Cov ( X 2 , Y1 ) Xˆ 2 1 = EX 2 + (Y1 − EY1 ) . VarY1 1.
We trace the regression line
3. We choose xˆ2 1 to approximate (lineary and in m.s.) the true but unknown value x2.
xˆ2 1 2. We measure the value realization of the r.v.
Y1
λˆ 0 0
y1
Figure 4.3. Regression line
y
y1
Estimation
153
Value of the error estimate variance:
(
EX 22 1 = E X 2 − Xˆ 2 1
)
2
−1
= VarX 2 − Cov ( X 2 , Y1 )(VarY ) Cov ( X 2 , Y1 ) ⎛ ( Cov ( X , Y ) )2 ⎞ 2 1 ⎟. = VarX 2 ⎜ 1 − ⎜ VarX 2 VarY1 ⎟ ⎝ ⎠
NOTE.– It may be interesting to note the parallel which exists between the problem of the best estimation in the quadratic mean of X K and that of the best
L2 of a function h by a trigonometric polynomial. We can state B ([ 0,T ]) = Borel algebra of the interval [ 0, T ] and give a table of the
approximation in
correspondences.
H ∈ L2 ([ 0, T ] , B ([ 0, T ]) , dt )
H Ky−1 ⊂ L2 ( Ω, a, P )
h
XK
X K − Xˆ K K −1
h − hˆ
Xˆ K K −1
hˆ
ˆ
{
L2 ( dP ) = v.a. X
}
EX 2 < ∞
∀X , Y ∈ L2 ( dP ) < X , Y > = EXY
Scalar product:
= ∫ X (ω ) Y (ω ) dP (ω ) Ω
For j
= 1 to
K −1
Y j ∈ L2 ( dP )
L2 ( dt ) =
{
f Borel function
T
∫0
2
f ( t ) dt < ∞
f , g ∈ L2 ( dt )
Scalar product:
T
< f , g >= ∫ f ( t ) g ( t ) dt 0
For j = − K to K
}
154
Discrete Stochastic Processes and Optimal Filtering
e j (t ) = Linear space:
H KY−1 = H (1, Y1 ,..., YK −1 )
(
exp 2iπ jt T
Find
2
H ( e− K ,..., e0 ,..., eK ) Problem:
X K ∈ L ( dP ) λˆ 0, λˆ 1,..., λˆ K −1 thus find
Xˆ K K −1
) ∈ L ( dt )
Linear space:
Problem:
2
Being given r.v.
T
h ∈ L2 ( dt ) thus find hˆ which
Being given the function Find
λˆ − K ,..., λˆ K
minimizes
which minimizes
k −1 ⎛ ⎛ ⎞⎞ E ⎜ X K − ⎜ λ0 + ∑ λ jY j ⎟ ⎟ ⎜ ⎟⎟ ⎜ j =1 ⎝ ⎠⎠ ⎝
2
2
K
T
λ je j (t ) ∫0 h ( t ) − j∑ =− K
dt
In the problem of the best approximation of a function by a trigonometric polynomial, coefficients orthonormal basis of
λˆ j =
1 T
T
∫0
λˆ j
have a very simple expression because e j form an
H ( e− K ,..., eK )
and we have:
h ( t ) e j ( t ) dt
and C j =
λˆ j T
Fourier coefficients.
Variant of the preceding proposition
We are considering the linear space of observation and we are thus seeking r.v. Xˆ K K −1 =
Z
H KY−1
→ E( XK − Z )
⎧⎪ K −1
⎫⎪
⎩⎪ j =1
⎭⎪
H KY−1 = ⎨ ∑ λ jY j λ j ∈ ⎬
K −1
∑ λˆ jY j j =1
which minimizes the mapping
Estimation
(
155
)⎦
Let us state M Y = ⎡ E YiY j ⎤ matrix of the 2nd order moments of the random
⎣
vector Y1 ,..., YK −1 . We have the following proposition. PROPOSITION.–
⎛ λˆ1 ⎞ ⎛ EX K Y1 ⎞ ⎜ ⎟ ⎜ ⎟ 1) The λˆ j verify M Y ⎜ ⎟=⎜ ⎟ and if M Y is invertible: ⎜ λ ⎟ ⎜ EX Y ⎟ K K −1 ⎠ ⎝ K −1 ⎠ ⎝ ⎛ λˆ1 ⎞ ⎛ EX K Y1 ⎞ ⎜ ⎟ ⎟ −1 ⎜ ⎜ ⎟ = MY ⎜ ⎟. ⎜ EX Y ⎟ ⎜λ ⎟ K K −1 ⎠ ⎝ ⎝ K −1 ⎠ 2)
(
E X K − X K K −1 =
EX K2
−(
)
2
= EX K2 − ∑ λˆi λˆ j EYiY j and if M Y is invertible i, j
⎛ EX K Y1 ⎞ ⎟ ⎟. ⎜ EX Y ⎟ K K −1 ⎠ ⎝
⎜ EX K Y1 ,..., EX K YK −1 M Y−1 ⎜
)
From now on, and in all that follows in this work, the linear space of observation at the instant
K −1
will be
⎧⎪ K −1
⎫⎪
⎪⎩ j =1
⎭⎪
H KY−1 = ⎨ ∑ λ1Y j λ j ∈ ⎬ .
INNOVATION.– Let a discrete process be (YK ) K ∈
∗
which (as will be the case in
Kalman filtering) can be the observation process of another process
( X K ) K∈
∗
and we can state that YˆK K −1 = projH Y YK ; YˆK K −1 is thus the best linear estimate K −1
and best quadratic mean of r.v. YK . DEFINITION.– R.v. I K = YK − YˆK K −1 is called the innovation at instant K ( ≥ 2 ) .
156
Discrete Stochastic Processes and Optimal Filtering
The family of r.v. { I 2 ,..., I K ,...} is called the innovation process.
4.3. Best estimate – conditional expectation We are seeking to improve the result by considering as estimation of X K not K −1
∑ λ jY j
only the linear functions
of r.v. Y1 ,..., YK −1 but the general functions
j =1
g (Y1 ,..., YK −1 ) .
{
H K′Y−1 = g (Y1 ,..., YK −1 ) g :
PROPOSITION.– The family of r.v. Borel functions; g (Y1 ,..., YK −1 ) ∈ L
2
K −1
→
} is a closed vector subspace of L . 2
DEMONSTRATION.– 2
Let us note again L
( dP ) = {r.v. Z 2
with a scalar product: ∀Z1 , Z 2 ∈ L Furthermore,
f
Y
(y
1
, ...,
y
K −1
}
EZ 2 < ∞ = Hilbert space equipped
( dP ) < Z1 , Z 2 > L ( dP ) = EZ1Z 2 . 2
)
designating the density of the vector
Y = (Y1 ,..., YK −1 ) , in order to simplify its expression let us state: d µ = fY ( y1 ,..., yK −1 ) dy1...dyK −1 and let us introduce the new Hilbert space:
{
L2 ( d µ ) = g :
∫
K −1
K −1
→
Borel functions
g 2 ( y1 ,..., yK −1 ) d µ < ∞} .
Estimation 2
This is equipped with the scalar product: ∀g1 , g 2 ∈ L
< g1 , g 2 > L2 ( d µ ) = ∫
K −1
157
(d µ )
g1 ( y1 ,..., yK −1 ) g 2 ( y1 ,..., yK −1 ) d µ .
Thus finally the linear mapping:
→ g (Y ) = g (Y1 ,..., YK −1 )
Ψ:g
L2 ( d µ )
L2 ( dP )
We notice that ψ conserves the scalar product (and the norm):
< g1 (Y ) g 2 (Y ) > L2 ( dP ) = Eg1 (Y ) g 2 (Y ) = ∫
K −1
g1 ( y ) g 2 ( y ) dy
=< g1 , g 2 > L2 ( d µ )
From hypothesis 2
subspace of L Let
( dP ) :
H K′Y−1 ⊂ L2 ( dP ) ,
Z1 and Z 2 ∈ H K′Y−1
let us verify that H K′ −1 is a vector
and two constants
Y
λ1
and
λ2∈
.
g1 ∈ L2 ( d µ ) is such that Z1 = g1 (Y ) and g 2 ∈ L2 ( d µ ) is such that
Z 2 = g2 ( µ ) . Thus
λ 1Z1 + λ 2 Z 2 = λ 1Ψ ( g1 ) + λ 2 Ψg 2 = Ψ ( λ 1 g1 + λ 2 Z 2 )
λ 1 g1 + λ 2 g 2 ∈ L2 ( d µ ) , H K′Y−1 is in fact a vector subspace of L2 ( dP ) .
and as
158
Discrete Stochastic Processes and Optimal Filtering
Let us show next that
H K′Y−1 is closed in L2 ( dP ) .
( )
= g p (Y ) = Ψ g p
Given Z p
2
towards Z ∈ L
( dP ) .
a sequence of
H K′Y−1
which converges
Let us verify that Z ∈ H K′ −1 : Y
g p (Y ) is a Cauchy sequence H K′Y−1 and because of the isometry, g p (Y ) is a Cauchy sequence of
g ∈ L2 ( d µ ) , i.e.:
gp − g
L (d µ ) 2
=∫
L2 ( d µ ) and which thus converges towards a function
( g p ( y ) − g ( y ) ) d µ = E ( g p (Y ) − g (Y ) ) 2
K −1
2
→ 0.
p ↑∞
So the limit of g p (Y ) is unique, g (Y ) = Z that is to say that Z ∈ H K′ −1 Y
and that
H K′Y−1
Finally
is closed.
H K′Y−1 is a Hilbert subspace of L2 ( dP ) .
Let us return to our problem, i.e. estimating r.v. X K The best estimator Xˆ ′
K K −1
= gˆ (Y1 ,..., YK −1 ) ∈ H K′Y−1 of X K , that is to say
the estimator which minimizes
E ( X K − g (Y1 ,..., YK −1 ) )
2
is (always in
accordance with the theorem already cited about Hilbert spaces) the orthogonal projection of X K on
H K′Y−1 , i.e.: Xˆ ′
K K −1
= gˆ (Y1 ,..., YK −1 ) = projH ′Y X K . K −1
Estimation
XK
X K − Xˆ ′
K K −1
Xˆ ′
K K −1
H K′Y−1
= gˆ (Y1 ,..., YK −1 )
Figure 4.4. Orthogonal projection of the vector
XK
(
on
H K′Y-1
)
⎞ ⎟ ⎠
⎛ ˆ ⎜ E X K − X K′ K −1 ⎝
H K′ Y−1 H KY−1 L2 ( dP )
2
1
2
Xˆ K′ K −1
XK
Xˆ K K −1
(
⎛ ˆ ⎜ E X K − X K K −1 ⎝
Figure 4.5. Best linear estimation and best estimation
)
2
⎞ ⎟ ⎠
1
2
159
160
Discrete Stochastic Processes and Optimal Filtering 2
In Figure 4.5, the r.v. (vector of L ) are represented by dots and the norms of estimation errors are represented by segments. It is clear that we have the inclusions 2
being given X K ∈ L
( dP ) − H K′Y−1
H KY−1 ⊂ H K′Y−1 ⊂ L2 ( dP )
, Xˆ ′
K K −1
thus a priori
will be a better approximation of
X K than Xˆ K K −1 , which we can visualize in Figure 4.5. Finally, to resolve the problem posed entirely, we are looking to calculate
Xˆ K′ K −1 . PROPOSITION.– Xˆ K′ K −1 = gˆ (Y1 ,..., YK −1 ) = projH ′Y X K is the conditional
(
K −1
)
expectation E X K Y1 ,..., YK −1 . DEMONSTRATION.– 1) Let us verify to begin with that the r.v.
g (Y1 ,..., YK −1 ) = E ( X Y1 ,..., YK −1 ) ∈ L2 ( dP )
(
and yet g ( y1 ,..., yK −1 )
) = ( g ( y )) 2
and by the Schwarz inequality:
≤ ∫ x 2 f ( x y ) dx ∫ 12 f ( x y ) dx =1
2
=
(∫
xi1 f ( x y ) dx
)
2
Estimation
161
recalling
that
thus:
Eg (Y1 ,..., YK −1 ) = ∫ 2
≤∫ By
stating
here
K −1
K −1
again
g 2 ( y1 ,..., yk −1 ) fY ( y ) dy
fY ( y ) dy ∫ x 2 f ( x y ) dx. U = ( X , Y1 ,..., YK −1 )
and
fU ( x, y ) = fY ( y ) f ( x y ) we have from Fubini’s theorem: E ( g (Y1 ,..., YK −1 ) ) ≤ ∫ x 2 dx ∫ 2
K −1
fU ( x, y ) dy = EX 2 < ∞ fX ( x)
We thus have g (Y1 ,..., YK −1 ) ∈ L
2
of
( dP )
and also, being given the definition
H K′Y−1 , g (Y1 ,..., YK −1 ) ∈ H K′Y−1 .
(
2) In order to show that g (Y1 ,..., YK −1 ) = E X K Y1 ,..., YK −1
)
is the
orthogonal projection Xˆ K′ K −1 = gˆ (Y1 ,..., YK −1 ) = projH ′Y X K , it suffices, as K −1
this projection is unique, to verify the orthogonality
X K − E ( X K Y1 ,..., YK −1 ) ⊥ H K′Y−1 i.e.:
∀ g (Y1 ,..., YK −1 ) ∈ H K′Y−1
X K − E ( X K Y1 ,..., YK −1 ) ⊥ g (Y1 ,..., YK −1 )
(
)
⇔ EX K g (Y1 ,..., YK −1 ) = E E ( X K Y1 ,..., YK −1 ) g (Y1 ,..., YK −1 ) . Now, the first member EX K g (Y1 ,..., YK −1 ) =
=∫
K
xg ( y ) f ( x y ) fY ( y ) dx dy
∫
K
xg ( y ) f Z ( x, y ) dx dy
162
Discrete Stochastic Processes and Optimal Filtering
and by applying Fubini’s theorem:
∫ ( ∫ xf ( x y ) dx ) g ( y ) fY ( y ) dy which is equal to the 2 member E ( E ( X K Y1 ,..., YK −1 ) g (Y1 ,..., YK −1 ) ) and the proposition is demonstrated. =
nd
K −1
Practically, the random vector U = ( X K , Y1 ,..., YK −1 ) being associated with a physical, biological, etc., phenomenon, the realization of this phenomenon gives us K − 1 numerical values y1 ,..., y K −1 and the final responses to the problem will be the numerical values: – xˆ K K −1 =
K −1
∑ λˆ j y j
in the case of the linear estimate;
j =1
(
)
– xˆ ′K K −1 = E X K y1 ,..., y K −1 in the case of the general estimate. We show now that in the Gaussian case Xˆ K K −1 and Xˆ K′ K −1 coincide. The following proposition will demonstrate this more precisely. PROPOSITION.– If the vector U = ( X K , Y1 ,..., YK −1 ) is Gaussian, we have the equality between r.v. K −1 ⎛ ⎞ Xˆ K′ K −1 = Xˆ K K −1 + E ⎜ X K − ∑ λˆ jY j ⎟ . ⎜ ⎟ j =1 ⎝ ⎠
DEMONSTRATION.–
⎛
K −1
⎞
⎝
j =1
⎠
( X K , Y1 ,..., YK −1 ) Gaussian vector ⇒ ⎜⎜ X K − ∑ λˆ jY j , Y1 , ..., YK −1 ⎟⎟ is equally Gaussian.
Estimation
Let us state V = X K −
163
K −1
∑ λˆ jY j . j =1
V is orthogonal at H KY−1 , thus EVY j = 0 ∀
j =1
at
K −1
and the two
vectors V and (Y1 ,..., YK −1 ) are uncorrelated.
(V , Y1 ,..., YK −1 ) is Gaussian and V (Y1 ,..., YK −1 ) are uncorrelated, then V and (Y1 ,..., YK −1 ) are independent. We know that if the vector
FINALLY.–
⎛ K −1 ⎞ E ( X K Y1 ,..., YK −1 ) = E ⎜ ∑ λˆ jY j + V Y1 ,..., YK −1 ⎟ ⎜ j =1 ⎟ ⎝ ⎠ K −1
= ∑ λˆ jY j + E (V Y1 ,..., YK −1 ) j =1
As V and Y1 ,..., YK −1 are independent: K −1
E ( X K Y1 ,..., YK −1 ) = ∑ λˆ jY j + EV . j =1
EXAMPLE.– Let U = ( X K , YK −1 ) = ( X , Y ) be a Gaussian couple of density
fU ( x, y ) =
(
)
⎛ 2 ⎞ exp ⎜ − x 2 − xy + y 2 ⎟ . π 3 ⎝ 3 ⎠ 1
We wish to determine
E(X Y).
and
164
Discrete Stochastic Processes and Optimal Filtering
The marginal law of
Y admits the density:
(
)
⎛ 2 ⎞ exp ⎜ − x 2 − xy + y 2 ⎟ dx π 3 ⎝ 3 ⎠ 2 ⎛ 2⎛ ⎛ y2 ⎞ 1 y⎞ ⎞ exp ⎜ − ⎟ exp ⎜ − ⎜ x − ⎟ ⎟ dx = ⎜ 3⎝ 2 ⎠ ⎟⎠ π 3∫ ⎝ 2 ⎠ ⎝ ⎛ y2 ⎞ 1 1 ⎛ 2 ⎞ exp ⎜ − ⎟ exp ⎜ − u 2 ⎟ du = ∫ ⎜ 2 ⎟ 3π 2π ⎝ 3 ⎠ ⎝ ⎠ 2
fY ( y ) = ∫
1
⎛ y2 ⎞ 1 exp ⎜ − ⎟ = 2π ⎝ 2 ⎠
⎛ y2 ⎞ f Z ( x, y ) 1 ⎛ 2 ⎞ exp ⎜ − x 2 − xy + y 2 ⎟ 2π exp ⎜ ⎟ = fY ( y ) π 3 ⎝ 3 ⎠ ⎝ 2 ⎠ 2 ⎛ 2⎛ 2 y⎞ ⎞ exp ⎜ − ⎜ x − ⎟ ⎟ = ⎜ 3⎝ 3π 2 ⎠ ⎟⎠ ⎝ ⎛ 2⎞ 1 1 y ⎜ ⎟. exp − x− = 2 ⎟ 3 ⎜ 3 2 i 2π i 4 ⎝ ⎠ 4
(
f ( x y) =
(
Thus, knowing Y = y , X
E ( X y) = y
2
(
)
⎛ ⎝
)
follows a law N
and E X Y = Y
(Here EV = E ⎜ X −
)
2
( y 2 , 34)
; that is to say:
1 (linear function of Y ; λˆ = ).
1 ⎞ Y ⎟ = 0 for X and Y are centered.) 2 ⎠
2
Estimation
165
defined
by
4.4. Example: prediction of an autoregressive process AR (1) Let
∀K ∈
us
consider
XK =
the
WSS
process
X
∞
∑ q j BK − j and solution of the equation j =∞
X K = qX K −1 + BK
with q which is real such that q < 1 and where BZ is a white noise of power
EBK2 = σ 2 . In the preceding chapter we calculated its covariance function and obtained:
EX i X i + n
n
q =σ . 1 − q2 2
Having observed r.v. X 1 ,..., X K −1 , we are seeking the best linear estimate and in the quadratic mean Xˆ K +
ˆ K −1 of X K + , X K +
K −1
ˆ ˆ K −1 = ∑ λ jY j and λ j verify: j =1
⎛ EX 1 X 1 … EX 1 X K −1 ⎞ ⎛ λˆ1 ⎞ ⎛ EX K + X 1 ⎞ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟=⎜ ⎟ ⎜ ⎟⎜ ⎜ EX X ⎟ ⎜ ˆ ⎟ ⎜ EX X ⎟ EX X K −1 1 K −1 K −1 ⎠ ⎝ λK −1 ⎠ ⎝ K+ K −1 ⎠ ⎝ i.e.
⎛1 q ⎜ ⎜q 1 ⎜ ⎜ ⎜ q K −2 ⎝
q K −2 ⎞ ⎛ λˆ1 ⎞ ⎛ q K + −1 ⎞ ⎟⎜ ⎟ ⎜ K + −2 ⎟ q K −3 ⎟ ⎜ ⎟ ⎟ ⎜q ⎟⎜ ⎟ ⎟=⎜ ⎟⎜ ⎟ ⎟ ⎜ +1 ⎟ ⎜ ⎟ ⎟ ⎜ ˆ 1 ⎠ ⎝ λK −1 ⎠ ⎝ q ⎠
166
Discrete Stochastic Processes and Optimal Filtering
) = ( 0,..., 0, q ) and this solution is unique as the determinant of the matrix is equal to (1 − q ) ≠ 0. We have the solution
( λˆ ,..., λˆ
ˆ
+1
K − 2 , λK −1
1
2 K −2
Thus Xˆ K +
= λˆK −1 X K −1 = q
K −1
+1
X K −1 .
We see that the prediction of r.v. X k + only uses the last r.v. observed i.e. here this is X K −1 . The variance of error estimate equals:
(
E X K + − Xˆ K + 2 EX K2 + + q (
+1)
K −1
)
2
(
= E XK+ − q
EX K2 −1 − 2q
+1
+1
X K −1
EX K + X K −1 =
)
2
=
σ2 1 − q2
(1 − q ( ) ). 2 +1
4.5. Multivariate processes In certain practical problems, we may have to consider the state process X ∗ and the observation process Y
∀j ∈
∗
where ∀ j and
∗
which are such that:
⎛ X 1j ⎞ ⎛ Y j1 ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ X j = ⎜ X j ⎟ and Y j = ⎜ Y j ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ X nj ⎟ ⎜ Y jm ⎟ ⎝ ⎠ ⎝ ⎠
X j and Y j ∈ L2 .
Estimation
167
We thus say that: – X j and Y j are multivectors (vectors because the X j and the Y j belong to 2
the vector space L ; multi because X j and Y j are sets of several vectors); – n is the order of the multivector X j and m is the order of the multivectors
Yj ;
( )
( )
2 n
2 m
– Xj∈ L – X
∗
and Y j = L
and Y
;
are multivariate processes, until this point the processes
∗
) being called scalar.
considered (with value in
Operations on the multivectors: – we can add two multivectors of the some order, and if
( )
2 n
then X + X ′ ∈ L
( )
X and X ′ ∈ L2
;
( )
2 n
– we can multiply a multivector by a real constant, and if X ∈ L
λ∈
then
n
( )
λ X ∈ L2
n
and
;
– scalar product of two multivectors not necessarily of the same order: i.e.
( )
X ∈ L2
n
( )
2 m
and Y ∈ L
.
We state < X , Y >= EXY ∈ M ( n, m ) where M ( n, m ) is the space of the T
matrix of n rows and m columns. The matrix of M ( n, m ) which is identically zero is denoted by Onm . DEFINITION.– We say that the multivectors < X , Y >= Onm and we write X ⊥ Y .
X and Y are orthogonal of
168
Discrete Stochastic Processes and Optimal Filtering
NOTE.– If X and Y are orthogonal, Y and X are also. We state X
X
2
2
=< X , X >= EXX T .
being a definite positive matrix, we know that a symmetric definite
positive matrix is denoted by X such that
X
2
=
Nevertheless, in what follows we will only use ⋅
X 2
X .
.
( )
2 m
NOTE.– The set of multivectors of the same order ( L
for example) could be
equipped with a vector space structure. On this space the symbol
⋅
previously
defined would be a norm. Here, we are considering the set of multivectors of order n or m . This set is not a vector space and thus cannot be equipped with a norm. Thus, for us, in what
X
follows
2
2
will not have any significance ( norm of X ) . For the same
reason, it is not through misuse of language that we will speak of scalar product < X , Y > .
Linear observation space Thus,
∀j ∈
∗
let
the
multivariate
( )
X j ∈ L2
verifying ∀j ∈
∗
H KY−1
n
state
process
be
X
∗
verifying
and the multivariate observation process be Y
( )
Y j ∈ L2
m
∗
.
By generalization of the definition given in section 4.2, we note:
⎧⎪
K −1
⎩⎪
j =1
H KY−1 = H (Y1 ,..., YK −1 ) = ⎨ that
H KY−1
∑ Λ jY j
⎫⎪ Λ j ∈ M ( n, m ) ⎬ and we say again ⎭⎪
is the linear space of observation until the instant K
−1.
Estimation
NOTE.– The elements of
H KY−1
169
must be multivectors of order n , for it is from
among them that we will choose the best estimate of X K , multivector of order n .
H KY−1 is thus adapted to X K . NOTATIONS.–
H KY−1 : this is the set denoted H KY−,⊥1 of the multivectors V Y ,⊥ Y verifying V ∈ H K −1 if and only if V is orthogonal to H K −1 . – Orthogonal of
⎛ 0⎞ ⎜ ⎟ – 0H = ⎜ ⎟ ⎜ 0⎟ ⎝ ⎠
⎫ ⎪ Y ⎬ n zero, empty multivectors of H K −1 . ⎪ ⎭
Problem of best estimate Generalizing the problem developed in section 4.2 in the case of multivariate
⎛ X 1K ⎞ ⎜ ⎟ processes, we are seeking to approximate X K = ⎜ ⎟ by the elements ⎜Xn ⎟ ⎝ K⎠ ⎛ Z1 ⎞ ⎜ ⎟ Z = ⎜ ⎟ of H KY−1 , the distance between X K and Z being: ⎜Zn ⎟ ⎝ ⎠
tr X K − Z
2
where tr X K − Z
2
K −1
= trE ( X K − Z )( X K − Z ) = ∑ E
(
signifies “trace of the matrix X K − Z
2
T
j =1
X Kj
−Z
j
)
2
”.
The following result generalizes the theorem of projection on Hilbert subspaces and brings with it the solution.
170
Discrete Stochastic Processes and Optimal Filtering
THEOREM.– – Xˆ K K −1 =
K −1
∑ Λˆ jY j
is unique and belongs to
j =1
mapping Z → tr X K − Z
2
H KY−1
which minimizes the
,
H KY−1 –
Xˆ K K −1 is the orthogonal projection of
XK
on
H KY−1
i.e.
X K − Xˆ K K −1 ⊥ H KY−1 , which is to say again: < X K − Xˆ K K −1 , Y j >= Onm
∀j = 1 at
K −1.
We can provide an image of this theorem using the following schema in which all the vectors which appear are, in fact, multivectors of order n .
XK X K − Xˆ K K −1
Z
H KY−1
Xˆ K K −1
Figure 4.6. Orthogonal projection of multivector
XK
on
H KY-1
NOTATION.– In what follows all the orthogonal projections (exclusively on
H KY−1 ) will be denoted indifferently: Xˆ K K −1 or projH Y X K K −1
; YˆK K −1 or projH Y YK etc. K −1
Estimation
171
From this theorem we deduce the following properties:
( )
2 n
P1) Given X K and X K ′ ∈ L
(
then X + X ′
)
K K −1
= Xˆ K K −1 + Xˆ K′ K −1 .
In effect:
∀j = 1 to K − 1 < X K − Xˆ K K −1 , Y j >= Onm and < X K′ − Xˆ K′ K −1 , Y j >= Onm Thus:
(
)
∀j = 1 to K − 1 < X K − X K′ − Xˆ K K −1 + Xˆ K′ K −1 , Y j >= Onm In addition since the orthogonal projection of X K + X K′ is unique, we in fact have:
( X + X ′) P2)
K K −1
Given
= Xˆ K K −1 + Xˆ K′ K −1
( )
X K ∈ L2
n
and
matrix
H ∈ M ( m, n ) ;
then
( HX ) K K −1 = HXˆ K K −1 . It is enough to verify that HX K − HXˆ K projection (here on the space
H KY−1 )
K −1
⊥ H KY−1 since the orthogonal
is unique.
Now by hypothesis
< X K − Xˆ K
(
)
Y jT ⎞⎟ = Onm . , Y j >= E ⎛⎜ X K − Xˆ K K −1 K −1 ⎝ ⎠
172
Discrete Stochastic Processes and Optimal Filtering
Thus also
(
)
(
)
⎛ ⎞ ⎛ ⎞ H ⎜ E ⎛⎜ X K − Xˆ K Y jT ⎞⎟ ⎟ = E ⎜ H ⎛⎜ X K − Xˆ K Y jT ⎞⎟ ⎟ = Onm K −1 K −1 ⎠⎠ ⎠⎠ ⎝ ⎝ ⎝ ⎝ And, by associativity of the matrix product
(
⎛ E ⎜ ⎛⎜ H X K − Xˆ K K −1 ⎝⎝
) ⎞⎟⎠ Y
and we have indeed HX K − HXˆ K
T j
⎞ T ˆ ⎟ =< HX K − HX K K −1 , Y j >= Onm ⎠
K −1
⊥ H KY−1 .
These properties are going to be used in what follows. Innovation process I
∗
With Kalman filtering in mind, we are supposing here that X
∗
and Y
∗
are
the two multivariate processes stated earlier and linked by equations of state and equations of observation:
⎛ X K +1 = A ( K ) X K + C ( K ) N K ⎜⎜ ⎝ YK = H ( K ) X K + G ( K ) WK where
A ( K ) ∈ M ( n, n ) ;
G ( K ) ∈ M ( m, p ) and where N
C ( K ) ∈ M ( n, ∗
and W
∗
);
H ( K ) ∈ M ( m, n ) ;
are noises (multivariate processes)
satisfying a certain number of hypotheses but of which the only one which is necessarily here is:
∀j = 1 to K − 1 < WK , Y j >= EWK YjT = O pm
Estimation
173
1) If n = m :
YK and YˆK K −1 are two multivectors of the same order m . The difference
YK − YˆK K −1 thus has a sense and in accordance with the definition given in section 4.2, we define the innovation at the instant K ≥ 2 by I K = YK − YˆK K −1 Let us now express I K in the form which will be useful to us in the future. By the second equation of state:
I K = YK − projH Y
K −1
( H ( K ) X K + G ( K )WK ) .
By using property P1 first and then P2
I K = YK − H ( K ) Xˆ K K −1 − ( G ( K ) WK ) K K −1 (if p ≠ m ( and from n ) ,
( G ( K )W )K K −1
is not equal to G ( K ) Wˆ K K −1 and
moreover this last matrix product has no meaning).
(
To finish let us verify that G ( K ) WK
)K K −1 = OH .
By definition of the orthogonal projection:
< G ( K ) WK
( G ( K )WK )K K −1 , Y j >
= 0mm
∀ j = 1 to K − 1
< G ( K ) WK , Y j >= G ( K ) < WK , Y j > = 0mm
∀ j = 1 to K − 1
−
By hypothesis on the noise W
∗
:
174
Discrete Stochastic Processes and Optimal Filtering
We can deduce from this:
( G ( K )W )K K −1 , Y j Y ,⊥ G ( K ) WK ∈ H K −1
= 0mm and
∀ j = 1 to K − 1 ,
which
is
to
say:
( G ( K )WK )K K −1 = 0H .
Finally I K = YK − YˆK K −1 = YK − H ( K ) Xˆ K K −1 . 2) If n ≠ m :
YK and YˆK K −1 are multivectors of different orders and YK − YˆK K −1 has no
meaning and we directly define I K = YK − H ( K ) Xˆ K K −1 . Finally and in all cases ( n equal to or different from m ):
K ≥ 2 ; the multivector of order m , defined by I K = YK − H ( K ) Xˆ K K −1 .
DEFINITION.– We name innovation at instant
(
I K ∈ H KY,-1⊥
)
NOTE.– We must not confuse innovation with the following. DEFINITIONS.– We call the prediction error of state at instant K the multivector of order n defined by X K
K −1
= X K − Xˆ K
K −1
.
We call the error of filtering at instant K a multivector of order n defined by
XK
K
= X K − Xˆ K
K
Property of innovation 1) I K ⊥ Y j 2) I K ′ ⊥ I K
∀j = 1 at K − 1 ; ∀K and
K′ ≥
2 with
K
≠ K′ .
DEMONSTRATION.– 1) I K = YK − H ( K ) Xˆ K K −1 = H ( K ) X K + G ( K ) WK − H ( K ) Xˆ K K −1
Estimation
175
thus:
(
)
< I K , Y j > = < H ( K ) X K − Xˆ K K −1 + G ( K ) WK , Y j > by using the associativity of the matrix product. Since:
(
)
< H ( K ) X K − Xˆ K K −1 , Y j > = H ( K ) < X K − Xˆ K K −1 , Y j >= 0mm and since:
< G ( K )WK , Y j > = G ( K ) < WK , Y j >= Omm We have in fact < I K , Y j > = 0 and I K ⊥ Y j . 2) Without losing the idea of generality let us suppose for example K ′ > K :
< I K ′ , I K > = < I K ′ , YK − H ( K ) Xˆ K K −1 > and
this
scalar
product
(
equals
Omm
as
I K ′ ∈ H KY′−,⊥1
)
and
YK − H ( K ) Xˆ K K −1 ∈ HKY YK ∈ HKY and H ( K ) Xˆ K K −1 ∈ HKY−1 . 4.6. Exercises for Chapter 4 Exercise 4.1. Given a family of second order r.v. X , Y1 ,..., YK ,... we wish to estimate X starting from the Y j and we state: Xˆ K = E ( X Y1 ,..., YK ) .
176
Discrete Stochastic Processes and Optimal Filtering
Verify that E ( Xˆ K +1 Y1 ,..., YK ) = Xˆ K . (We say that the process
Xˆ
∗
is a martingale with respect to the sequence of
YK .) Exercise 4.2.
{
} be a sequence of independent r.v., of the second order, of law
Let U j j ∈
N (0, σ 2 ) and let θ be a real constant.
{
We define a new sequence X j j ∈
1) Show that
∀k ∈
∗
, the vector X
∗
K
}
⎛ X1=U1
by ⎜
. ⎜ X j =θU j−1+U j if j ≥ 2 ⎝
= ( X1 ,..., X K ) is Gaussian.
2) Specify the mean, the matrix of covariance and the probability density of this vector. 3) Determine the best prediction in quadratic mean of X k + P at instant K = 2 ,
(
)
i.e. calculate E X 2 + P X 1 , X 2 . Solution 4.2.
0⎞ ⎛1 0 ⎜ θ 1 0 0 ⎟⎟ ⎜ belonging to M ( K , K ) . 1) Let us consider matrix A= ⎜ ⎟ ⎟ ⎜⎜ 0 0 θ 1⎟⎠ ⎝ By stating U
K
= (U1 ,...U K ) , we can write X K = AU K . The vector U K
being Gaussian (Gaussian and independent components), the same can be said for the vector X 2)
K
EX K = EAU K = AEU K = 0
Estimation
( )
Γ X = A σ 2 I AT = σ 2 AAT
177
( I = matrix identity ) .
Furthermore:
(
)
Det Γ X K = Det
We obtain
(σ
2
)
AAT = σ 2 n and Γ X K is invertible.
f X K ( x1 ,..., xK ) =
3) The vector
1
( 2π )
( X 1 , X 2 , X 2+ P )
n/2
⎛ 1 ⎞ exp ⎜ − xT Γ −X1K x ⎟ . ⎝ 2 ⎠ σ n
is Gaussian; thus the best prediction of
is the best linear prediction, which is to say:
Xˆ 2+ P = E ( X 2+ P X 1 , X 2 ) = projH X 2+P where
H
Thus
now
is the linear space generated by r.v. X1 and X 2 .
⎛ λˆ ⎞ ⎛ C ov ( X 2+ P , X1 ) ⎞ Xˆ 2+ P = λˆ, X1 + λˆ2 X 2 with ⎜ 1 ⎟ = Γ −X12 ⎜ ; ⎜ C ov ( X , X ) ⎟⎟ ⎜ λˆ ⎟ 2 + P 2 ⎝ ⎠ ⎝ 2⎠
(
)
(
)
C ov X j , X K = EX j X K = θ if K − j = 1 C ov X j , X K = EX j X K = 0 if K − j > 1 ⎛ C ov ( X 2 P +1 , X 1 ) ⎞ ⎛ 0 ⎞ ⎟⎟ = ⎜ ⎟ and Xˆ 2+ P = 0 ⎜ C ov ( X 0 X , ) 2 P+2 2 ⎠ ⎝ ⎠ ⎝
thus if p > 1 ⎜
If p = 1
⎛ λˆ1 ⎞ 1 ⎛1 + θ 2 ⎜ ⎟= 2⎜ ⎜ˆ ⎟ ⎝ λ2 ⎠ σ ⎝ −θ
−θ ⎞ ⎛ 0 ⎞ ⎟ ⎜ ⎟ and 1 ⎠ ⎝θ ⎠
Xˆ 2+ P
178
Discrete Stochastic Processes and Optimal Filtering
θ2 θ Xˆ 2+ p = − 2 Xˆ 1 + 2 Xˆ 2 . σ
σ
Exercise 4.3.
⎛ X K +1 = A ( K ) X K + C ( K ) N K ⎜Y ⎝ K = H ( K ) X K + G ( K ) WK
We are considering the state system ⎜
A ( K ) ∈ M ( n, n ) ;
where
C ( K ) = M ( n,
);
(1) ( 2)
H ( K ) = M ( m, n ) ;
G ( K ) = M ( m, p ) and where X 0 , N K ,WK ( for K ≥ 0 ) are multivectors of the second
order
such
∀j ≤ K WK is
that
orthogonal
at
X 0 , N 0 ,..., N j −1 , W0 ,..., W j −1 . Show that
(
)
< H ( j ) X j − Xˆ j j −1 ,WK >= 0mp .
∀j ≤ K
Solution 4.3.
(
)
< H ( j ) X j − Xˆ j j −1 , WK > = j −1 ⎛ ˆ ( H ( i ) X + G ( i ) W ) ⎞⎟ , W > < H ( j ) ⎜ A ( j − 1) X j −1 + C ( j − 1) Ν j −1 − ∑ Λ i i i K i =1 ⎝ ⎠
(where
ˆ are the optimal matrices of M ( n, m ) ). Λ i
Taking into account the hypotheses of orthogonality of the subject, this scalar
⎛
product can be reduced to < H ( j ) ⎜ A ( j − 1) X j −1 −
⎝
j −1
⎞
i −1
⎠
∑ Λˆ i H ( i ) X i ⎟ ,WK > .
Furthermore by reiterating the relation recurrences expresses
itself
according
X 0 , N 0 , N1 ,..., Ni −1 .
to
X i −1 ,
Ν i−1 ,
(1) ,
we see that X i
X i −2 , Ni − 2 , Ni −1...
and
Estimation
179
H ( j ) A ( j − 1) X j −1 and H ( j ) Λˆ i H ( i ) X i are multivectors of order m of which each of the m “components” only consists of the r.v. orthogonal to each of the p “components” of WK , multivector of order p . Finally, we have in Thus,
(
fact < H ( j ) X j − Xˆ j
j −1
) ,W
K
> = 0mp .
Chapter 5
The Wiener Filter
5.1. Introduction Wiener filtering is a method of estimating a signal perturbed by an added noise. The response of this filter to the noisy signal, correlated with the signal to 2
estimate, is optimal in the sense of the minimum in L . The filter must be practically realizable and stable if possible, as a consequence its impulse response must be causal and the poles inside of the circle unit. Wiener filtering is often used because of its simplicity, but despite this, the signals to be analyzed must be WSS processes. Examples of applications, word processing, petrol exploration, surge movement, etc.
182
Discrete Stochastic Processes and Optimal Filtering
5.1.1. Problem position
XK
h?
Σ
ZK
YK
noise : WK
h
Figure 5.1. Representation for the transmission. is the impulse response of the filter that we are going to look for
In Figure 5.1, X K , WK and YK represent the 3 entry processes, h being the impulse response of the filter, Z K being the output of the filter which will give
Xˆ K which is an estimate at instant K of X K when the filter will be optimal. All the signals are necessarily WSS processes. We will call:
(
Y = YK YK −1 "Y j "YK − N +1
)
T
the representative vector of the process of length N at the input of the realization filter:
(
y = yK yK −1 " y j " yK − N +1
(
– h = h 0 h 1" hN −1
)
T
)
T
,
the vector representing the coefficients of the impulse
response that we could identify with the vector
λ
of Chapter 4;
– X K the sample to be estimated at instant K ;
The Wiener Filter
183
– Xˆ K the estimated sample of X K at instant K ; – Z K the exit of the filter at this instant
= hT Y .
The criterion used is the traditional least mean square criterion. The filter is optimal when:
(
2 Min E ( X K − Z K ) = E X K − Xˆ K
)
2
The problem consists of obtaining the vector h which minimizes this error.
5.2. Resolution and calculation of the FIR filter The error is written:
ε K = X K − hT Y with
h ∈ \N
and
( )
Y ∈ L2
N
.
We have a function C : cost to be minimized which is a mapping:
(
)
h 0 , h 1," , hN −1 → C h 0 , h 1," hN −1 = E (ε K2 ) \N
→
\
The vector hˆ = hoptimal is such that ∇ h C = 0
(
given
C = E X K − hT Y
then
∇ hC = −2 E (ε K Y )
)
2
(scalar) (vector Nx1).
NOTE.– This is the theorem of projection on Hilbert spaces. Obviously this is the principle of orthogonality again.
184
Discrete Stochastic Processes and Optimal Filtering
This least mean square error will be minimal when:
E (ε K Y ) = 0 i.e. when h = hˆ
By using expression
εK :
⎛ ⎞ E ⎜ X K − hˆT Y ⎟ Y = 0 ; ⎜ ⎟ ⎝ ⎠
(
all the components of the vector are empty (or E X K
(
Let E ( X K Y ) = E Y Y
T
−
)
Xˆ K Y = 0 ).
) hˆ .
We will call: – The cross-correlation vector r :
(
r = E X K (YK YK −1 "YK − N +1 )
N ×1
–
T
)
R the matrix of autocorrelation of observable data: ⎛ YK ⎞ ⎜ ⎟ Y R = E ⎜ K −1 ⎟ (YK YK −1 "YK − N +1 ) = E Y Y T ⎜ # ⎟ N ×N ⎜⎜ ⎟⎟ ⎝ YK − N +1 ⎠
(
and
)
r = R hˆ Wiener-Hopf’s equation in matrix form.
NOTE.– By taking the line
(
j ∈ [ 0,
)
N −1] , we obtain: N −1
rXY ( j ) = E X K YK − j = ∑ hˆi RYY ( j − i ) i =0
∀j ∈ [ 0, N −1]
The Wiener Filter
185
Wiener-Hopf equation If the matrix R is non-singular, we draw from this:
hˆ = R −1 r . 5.3. Evaluation of the least error According to the projection theorem:
(
E XK
−
)
(
Xˆ K Y = 0 and E X K
−
)
Xˆ K Xˆ K = 0 .
Thus, the least error takes the form:
C min = Min E
(ε ) = E ( X Xˆ ) = E(X Xˆ ) X = E(X Xˆ ) . 2 K
K −
K
K −
K
K
2 K −
2 K
2
However, Xˆ K = hˆ Y T
Thus
C min = Min
E (ε K ) 2 = R XX
( 0 ) − hˆT
r.
Knowing the matrix of autocorrelation R of the data at entry of the filter and the cross-correlation vector r , we can deduce from this the optimal filter of impulse
response hˆ and the lowest least mean square error for a given order N of the filter.
APPLICATION EXAMPLE.– Give the coefficients of the Wiener filter as N = 2 if the autocorrelation function of the signal to be estimated is written
RXX
(K ) = a K
;
0〈 a〈1 and that of the noise: RWW ( K ) = δ ( K
=0
) white noise.
186
Discrete Stochastic Processes and Optimal Filtering
The signal to be estimated does not correlate to the noise ( X ⊥ W ) .
⎛2 a⎞ ⎛1 ⎞ ⎟ ; r =⎜ ⎟ ⎝ a 2⎠ ⎝a⎠
Let R = ⎜
because RYY = RXX + RWW We deduce from this:
⎛ 2 − a2 hˆ = ⎜ 2 ⎝ 4−a
T
a ⎞ ⎟ 4 − a2 ⎠
and Min E
(ε ) = 1 − 4 −2a 2 K
2
Let us return to our calculation of the FIR filter. The filter that we have just obtained is of the form:
(
hˆ = hˆ 0 hˆ 1 " hˆ N −1
)
T
of finite length N . Its transfer function is written: N −1
H ( z ) = ∑ hˆ i z −i i =0
with an output-input relationship of the form Xˆ ( z ) = H ( z ) Y ( z ) . Let us enlarge this class of FIR type filters and give a method for obtaining IIR type filters. 5.4. Resolution and calculation of the IIR filter In order to do this we are going to carry out a pre-whitening of the observation signal.
The Wiener Filter
187
Firstly let us recall a definition: we say that rational function Α( z ) represents a minimum phase system if
Α( z ) and
1
Α( z )
are analytical in the set
{ z | z ≥ 1} , i.e. if the zeros and poles of Α( z ) are within the unit disk. Furthermore the minimum phase system and its inverse are stable. Paley-Wiener theorem Given a function SYY ( z ) verifying when z = e iω
– SYY (e ) = 2π
–
∫
∞
∑ sne−inω
iω
:
real function and ≥ 0 ;
−∞
ln SYY (eiω ) dω < ∞ .
0
Therefore, there is a causal sequence an of the transform in z , Α( z ) which verifies:
( )
SYY ( z ) = σ ε2 A ( z ) A z −1 .
σ ε2
represents the variance of a white noise and Α( z ) represents a minimum
phase system. In addition, the factorization of SYY ( z ) is unique.
Α( z ) being a minimum phase system, 1
Α( z )
is causal and analytical in
{ z | z ≥ 1} . Since the an coefficients of filter A ( z ) are real:
(
)
SYY (eiω ) = σ ε2 Α(eiω ) Α e−iω = σ ε2 Α(eiω ) Α(eiω ) = σ ε2 Α(eiω )
2
188
Discrete Stochastic Processes and Optimal Filtering
that is to say:
The filter
σ ε2 =
1 Α(eiω )
2
SYY (eiω ) .
1 thus whitens the process YK , Α( z )
K ∈Z .
Schematically:
ε
A(z)
YK
Y
1
Spectral density σ 2ε
Spectral density
SYY NOTE.– A ( z )
2
εK
A(z)
( )
= A ( z ) . A z −1 if the coefficients of A ( z ) are real.
At present, having pre-whitened the entry, the problem returns to a filter
B( z )
in the following manner.
Y
1/ A ( z )
Y
ε
H ( z)
B( z)
Z
Z
Thus B ( z ) = A ( z ) . H ( z )
A ( z ) , being known by SYY ( z ) and H ( z ) having to be optimal, thus B ( z ) must also be optimal.
The Wiener Filter
189
Let us apply Wiener-Hopf’s equation to filter B ( z ) .
r X ε ( j ) = ∑ bˆi R
εε ( j − i ) .
i
Let rX ε Thus
( j ) = bˆ j σ ε2
.
r ( j) bˆ j = X ε 2 . σε
And B ( z ) =
Thus B ( z ) =
∞
∑ bˆ j z − j j =0
1
σε
2
∞
∑
for B ( z ) causal.
rX ε ( j ) z − j .
j =0
The sum represents the Z transform of the cross-correlation vector rX ε the indices
Thus:
j ≥ 0 that we will write ⎡⎣ S X ε ( z ) ⎤⎦ + . B(z) =
1 2
σε
⎡⎣ S X ε ( z ) ⎤⎦ +
We must now establish a relationship between S X ε ( z ) and S XY ( z ) . In effect we can write:
RXY ( K ) = E ( X n + K E (Yn ) ) ∞ ⎛ ⎞ = E ⎜ X n + K ∑ ai ε n −i ⎟ i =0 ⎝ ⎠ ∞
RXY ( K ) = ∑ ai RX ε i =0
( K + i)
( j ) for
190
Discrete Stochastic Processes and Optimal Filtering
which can also be written: −∞
RXY ( K ) = ∑ a−i RX ε
( K − i)
0
= a− k ∗ RX ε
(K ) .
By taking the Z transform of the 2 members:
( )
S XY ( z ) = A z −1 S X ε ( z ) it emerges:
⎡ S ( z)⎤ 1 ⎢ XY ⎥ H (Z ) = 2 σ ε A ( z ) ⎢ A z −1 ⎥ ⎣ ⎦+
( )
5.5. Evaluation of least mean square error This least mean square error is written:
C min = E (ε K X K ) = Rε X
when h = hˆ
( 0)
which can also be written:
C min = E ( X K − Xˆ K ) X K
i.e. C min = RXX
( 0 ) − hˆT
where = RXX
⎛
⎞
⎝
⎠
( 0 ) − E ⎜⎜ hˆT YX K ⎟⎟
r which we have already seen with the FIR filter.
The Wiener Filter
191
However, this time, the number of elements in the sum is infinite: ∞
C min = RXX ( 0 ) − ∑ hˆi RXY ( i ) i =0
or: ∞
C min = RXX ( 0 ) − ∑ hˆi RYX ( −i ) . i =0
By bringing out a convolution:
C min = RXX ( 0 ) − hˆ j ∗ RYX ( j )
j =0 .
This expression can also be written, using the Z transform.
C min =
1 j 2π
∫C (0,1) ( S XX ( Z ) − H ( Z ) SYX ( Z ) )
−1
dZ
5.6. Exercises for Chapter 5 Exercise 5.1. Our task is to estimate a signal X K , whose autocorrelation function is:
1 1 RXX ( K ) = δ ( K =0) + ⎡⎣δ ( K =−1) + δ ( K =1) ⎤⎦ 2 4 The measures of response h .
yK = xK + nK of the process YK are filtered by a Wiener filter
192
Discrete Stochastic Processes and Optimal Filtering
N K is orthogonal to the signal X K and:
The noise
1 Rnn ( K ) = δ ( K =0) . 2 1) Give the response of the 2nd order Wiener filter (FIR). 2) Give the least mean square error obtained. Solution 5.1. 1) hˆ = R r =(7 /15 −1
2 /15)T ;
2 T 2) C min = σ X − r hˆ = 7 / 30
with
σ X2 = RXX (0) = 1/ 2 .
Exercise 5.2. We propose calculating a 2nd order FIR filter.
YK the input to the filter has the form YK = X K + WK where X K is the signal emitted and WK is a white noise orthogonal at X K (the processes are all wide sense stationary (WSS)). Knowing the statistical autocorrelation:
RXX ( K ) = a
K
and R WW ( K ) = N δ ( K =0)
and knowing:
hˆ = R -1 r
hˆ : h
are optimal.
The Wiener Filter
193
With:
⎛ YK ⎞ ⎜ ⎟ Y R = E ⎜ K −1 ⎟ (YK YK −1 "YK − N +1 ) = E Y Y T ⎜ # ⎟ N ×N ⎜⎜ ⎟⎟ ⎝ YK − N +1 ⎠
(
(
r = E X K (YK YK −1 "YK − N +1 )
N ×1
T
)
)
1) Give the 2 components of the vector hˆ representing the impulse response. 2) Give the least mean square error. 3) Give the shape of this error for N = 1 and 0 < a < 1 . 4) We now want to calculate an IIR type optimal filter. By considering the same data as already given, give the transfer function of the filter. 5) Give the impulse response. 6) Give the least mean square error.
(
) (
1 (1 + N − a 2 2 2 (1 + N ) − a
aN )T
NOTE.– We can state: b + b
−1
=
Solution 5.2. 1)
hˆ =
2) C min = 1 −
)
1 −1 a − a + a −1 + a . N
1 + N − a2 + a2 N (1 + N ) 2 − a 2
194
Discrete Stochastic Processes and Optimal Filtering
3) See Figure 5.2
Figure 5.2. Path of the error function or cost according to parameter a
4) H ( z ) =
1 − a2 Na A 2 with and σ ε = A = 2 −1 1 − ab b σ ε 1 − bz 1
(1 − a ) b 2
n
5) hn≥0 = cb with c =
6) C min = 1 −
Na (1 − ab )
c 1 − ab
Exercise 5.3. [SHA 88] Let
{ X K | K = 1 to N} be a set of N
random variables such that
and cov( X i X j ) = σ x ∀ i, j emitted by a source. 2
Ε( X K ) = 0
The Wiener Filter
195
At the reception, we obtain the digital sequence y K = xK + wK which is a result of the process YK = X K + WK where WK is a centered white noise of variance
σ w2 .
1) Give the Wiener filter depending on N and
γ
γ =σx
2
by stating
σ w2
as
the relationship between signal and noise. 2) Give the least mean square error in function of
σ x2 , N
and γ .
NOTE.– We can use Wiener-Hopf’s equation Solution 5.3. 1) hˆ j =
γ 1 + Nγ
2) C min =
σ x2
1 + Nγ
Exercise 5.4. Same
exercise
as
in
section
5.2.
R WW ( K ) = δ ( K =0) : 1) Give the 3 components of the vector hˆ . 2) Give the least mean square. Solution 5.4. 1) hˆ = (0.4778
0.1028 0.0212)T
2) C min = 0.4778
where
RXX ( K ) = 0.4
K
and
Chapter 6
Adaptive Filtering: Algorithm of the Gradient and the LMS
6.1. Introduction By adaptive processing, we have in mind a particular, yet very broad class of optimization algorithms which are activated in real time in distance information transmission systems. The properties of adaptive algorithms are such that, on the one hand, they allow the optimization of a system and its adaptation to its environment without outside intervention, and, on the other hand, this optimization is also assumed in the presence of environmental fluctuation over time. It is also to be noted that the success of adaptive techniques is such that we no longer meet them only in telecommunications but also in such diverse domains as submarine detection, perimetric detection, shape recognition, aerial networks, seismology, bio-medical instrumentation, speech and image processing, identification of control system, etc. Amongst the applications cited above, different configurations arise.
198
Discrete Stochastic Processes and Optimal Filtering
XK
DK
+
T
F. Adap.
YK
-
Σ
ZK
εK
Figure 6.1. Prediction
XK
DK
Syst.? +
F. Adap.
YK
-
Σ
ZK
εK
Figure 6.2. Identification
XK
DK
T +
Syst.
+
Σ
+ +
YK
F. Adap.
Noises Figure 6.3. Deconvolution
-
ZK
Σ
εK
Adaptive Filtering: Algorithm of the Gradient and the LMS
X K + NK
199
DK +
N K′
F. Adap.
YK
-
ZK
Σ
εK
Figure 6.4. Cancellation
In the course of these few pages we will explain the principle of adaptive filtering and establish the first mathematical results. We will limit ourselves to begin with to WSS processes and to the algorithms called, of the deterministic gradient and the LMS algorithm. We will also give a few examples concerning non-recursive linear adaptive filtering. Later, we will broaden this concept to non-stationary signals in presenting Kalman filtering in the following chapter.
6.2. Position of problem [WID 85] Starting from observations (or measures) taken at instant K (that we will note yK : results) of a process X K issued from a captor or from an unknown system, we want to perform: – a prediction on the signal; or – an identification of an unknown system; or – a deconvolution (or inverse filtering); or – a cancellation of echoes.
200
Discrete Stochastic Processes and Optimal Filtering
To achieve this, we will carry out an optimization, in the sense of the least mean squares, by minimizing the error obtained in the different cases. EXAMPLE.– Given the following predictor:
XK
+
T
F. Adap.
YK
-
ZK
Σ
εK
Figure 6.5. Predictor
The 3 graphs below represent: 1. input X K observed by xK : signal to be predicted; 2. output of filter Z K observed by z K ; 3. residual error
εK
given by
It is clearly apparent that which the filter converges.
εK
εK . tends towards 0 after a certain time, at the end of
Adaptive Filtering: Algorithm of the Gradient and the LMS
201
Input xk
1 0 -1 -2
0
50
100
150
200 250 300 Output of filter zk
350
400
450
500
0
50
100
150
200 250 300 epsilonk error
350
400
450
500
0
50
100
150
200
350
400
450
500
1 0 -1 -2 1 0.5 0 -0.5
250
300
Figure 6.6. Graphs of input, output and error. These graphs are the result of continuous time processes
202
Discrete Stochastic Processes and Optimal Filtering
6.3. Data representation The general shape of an adaptive filter might be the following. Coefficients
λK0
⎫ ⎪ ⎪ ⎪ YK1 ⎬ ⎪ ⎪ YKm−1 ⎪⎭ 0
YK Input signal
λK1
Σ
λKm−1
Output signal
Figure 6.7. Theoretical schema with multiple inputs
The input signal can be simultaneously the result of sensors (case of adaptive antennae, for example) or as well they can represent the different samples, taken at different instants of a single signal. We will take as notation: – multiple input: Y – single input: Y
K
K
(
= YK0 YK1 ... YKm−1
)
T
;
= (YK YK −1 ... YK − m+1 ) . T
In the case of a single input that we will consider next, we will have the following configuration.
Adaptive Filtering: Algorithm of the Gradient and the LMS
YK
T
XK
T
YK −1
T
203
YK −m+1
λK1 λK0
λK1
λKm −1
Σ
Σ
ZK
DK Σ
-
εK Figure 6.8. Schema of predictor principle
X K , YK , Z K , DK and
εK
representing the signal to be predicted, the filter
input, the filter output, the desired output and the signal error respectively. Let us write the output Z K : m −1
Z K = ∑ λKi YK −i i =0
By calling
λK
written in the form
the weight vector or coefficient vector at instant “K”, also
(
λK = λK0 λK1 ... λKm−1
notation:
Z K = Y K T λK = λKT Y K .
)
T
, we can use a single vectorial
204
Discrete Stochastic Processes and Optimal Filtering
Our system not being perfect we obtain an error written as:
ε K = DK − Z K where DK represents the desired output (or X K here), that is to say, the random variable that we are looking to estimate. The criterion that we have chosen to exploit is that of the least squares: it consists of choosing the best vector λ , which will minimize the least mean square error E
(ε ) 2 K
, or the function cost C
( λK ) . When
the vector λ is fixed, the quadratic error of this cost function does not depend on “K” because of the stationarity of the signals. Thus we call the vector λ which will
minimize this cost function vector λˆ .
6.4. Minimization of the cost function If our system (filter) is linear and non-recursive, we will have a quadratic cost function and this can be represented by an elliptical paraboloid (dim 2) (or a hyperparaboloid if the dimension is superior). We will call the graphs or same level cost surfaces, i.e. the graphs or surfaces defined vector λˆ :
Sg
=
{λ = ( λ , λ ,…, λ ) ∈ 0
m −1
1
m
}
| C ( λ ) = g , g real constant
Let us give an example, the equation of the isocosts in the case of a second order filter:
S g = {λ =( λ 0 ,λ1 )∈
(
(
2
| C ( λ )= E ( DK − Z K )
= E DK − λ 0 X K −1 + λ1 X K − 2
) ) = g} 2
2
Adaptive Filtering: Algorithm of the Gradient and the LMS
205
using the stationarity of X K , we obtain after development the equation of the isocosts S g
2
0 2
E ( X K )( λ ) −2 E ( D
K
+
2
1 2
E ( X K )( λ )
X K −1 ) λ
0
−
+
0 1
2 E ( X K X K −1 ) λ λ
2 E ( DK X K − 2 ) λ
1
+
2
E(D ) = g K
NOTE.– By identification, we easily find the coefficients of the ellipse function of the traditional form:
a(λ 0 ) 2 + b(λ1 ) 2 + cλ 0λ1 + d λ 0 + eλ 1 + f = 0 NOTE.– Still because of the stationarity of the expression of C
(λ )
X K , we see that the coefficients arising in
are independent of “K” and this finding is valid for a filter of
any order. This signifies that
C (λ )
depends only on the value
λ
and not of the
instant “K” where we are considering this value. Otherwise said, being given two instants
K ≠ j if λ K
= λj
then
C ( λK ) = C ( λ j ) ,
and the latter can be
summarized by saying that the cost function itself does not depend on time. Let us illustrate such a cost function. Paraboloid with projections
200
Cost
150 100 50 0 10 5
10 5
0 lambda1
0
-5
-5 -10 -10
lambda0
Figure 6.9. Representation of the cost function ([MOK 00] for the line graph)
206
Discrete Stochastic Processes and Optimal Filtering
Let us go back to the cost function at the instant K, where
C ( λK ) = E
(ε ) = E {( D
The latter, for any
2 K
− ZK )
K
2
} = E {( D
K
εK
is not stationary.
− λKT Y K
)} 2
λ and independently of time can still be written
{(
C ( λ ) = E DK − λ T Y K
)} 2
The minimum of this function is arrived at when: ⎛
{(
⎞
∇ λC ( λ ) = grad C ( λ ) = E ⎜⎜ ∂C ( λ0 ) ,..., ∂C (mλ−1) ⎟⎟ = E DK − λ T Y K ∇ λC ( λ ) = −2 E
(ε
⎝ ∂λ
K
KY
)=0
T
∂λ
⎠
= zero vector in
m
when λ
)( −2Y )} K
= λˆ
Thus
∇ λC ( λ ) = −2 E for
(ε
KY
K
) = −2E {( D
K
− λˆT Y K
) (Y )} K
λ = λoptimal = λˆ In what follows, we will denote as
(
λˆ = λˆ 0 λˆ1 ... λˆ m−1
)
optimal coefficients, that is to say the coefficients that render void,
∇ λC ( λ ) = grad C ( λ ) and which thus minimize C
(λ ) .
T
the family of
Adaptive Filtering: Algorithm of the Gradient and the LMS
207
We find again the traditional result: the error is orthogonal to observations (principle of orthogonality or theorem of projection):
(
Let us state R = E Y
(
)
R = E Y K Y KT
(
and p = E DK Y
K
)
K
εK ⊥ Y K .
)
Y KT the autocorrelation matrix of the input signal.
⎧ YK2 ⎪ ⎪ Y Y = E ⎨ K −1 K ⎪ ⎪ ⎩YK −m+1 YK
YK YK −1 … YK2−1 YK − m+1 YK −1
YK YK −m +1 ⎫ ⎪ YK −1 YK − m+1 ⎪ ⎬ ⎪ 2 YK − m+1 ⎭⎪
the cross-correlation column vector between the desired
response and the input signal.
(
)
p = E DK Y K = E ( DK YK DK YK −1 ... DK YK −m+1 )
T
Thus, the gradient of the cost function becomes:
(
) (
)
E DK Y K − E Y K Y KT λ i.e.
=0
p − Rλˆ = 0
NOTE.– This is also Wiener-Hopf’s equation. The vector which satisfies this equation is the optimal vector:
λˆ = R −1 p if
R is invertible. This value of λ does not depend on the value of K.
208
Discrete Stochastic Processes and Optimal Filtering
6.4.1. Calculation of the cost function
( )
(
)
(
)
C ( λ ) = E DK2 + λ T E Y K Y KT λ − 2 E DK Y KT λ thus C
( λ ) = E ( DK2 ) + λ T R λ − 2 pT λ
For λˆ the optimal value of
λ
the minimum cost value is written:
()
( )
C min = C λˆ = E DK2 − pT λˆ NOTE.– It is interesting to notice that the error and the input signal correlated when λ = λˆ . In effect:
Y are not
ε K = DK − λ T Y K By multiplying the two members by expectation, we obtain:
E
(ε
KY
K
) = p − E (Y
For the optimal value of
λ
K
Y and by taking the mathematical
)
Y KT λ = p − Rλ
we have:
E
(ε
KY
K
)=0
Example of calculation of filter
The following system is an adaptive filter capable of identifying a phase shifter system.
ϕ
is a deterministic angle.
Adaptive Filtering: Algorithm of the Gradient and the LMS
209
⎛ 2π K ⎞ DK = 2sin ⎜ + ∅ −ϕ ⎟ ⎝ N ⎠
XK Syst. ⎛ 2π K ⎞ + ∅⎟ X K = YK = sin ⎜ ⎝ N ⎠
T
λ0
λ1
+
Σ
Σ ZK
εK
Figure 6.10. Schema of principle of an adaptive filter identifying a diphaser system
[
If ∅ is equally spread on 0, 2π
] we showed in Chapter 3 that YK
is wide
sense stationary (WSS). Let us calculate the elements of matrix R.
⎡ ⎛ 2π n ⎤ ⎞ ⎛ 2π E (Yn Yn − K ) = E ⎢sin ⎜ + ∅ ⎟ sin ⎜ ( n − K ) + ∅ ⎞⎟ ⎥ ⎠ ⎝ N ⎠⎦ ⎣ ⎝ N 2π K = 0.5 cos K ∈ [ 0,1] N ⎡ ⎤ ⎛ 2π n ⎞ ⎛ 2π E ( Dn Yn − K ) = E ⎢ 2sin ⎜ − ϕ + ∅ ⎟ sin ⎜ ( n − K ) + ∅ ⎞⎟ ⎥ ⎝ N ⎠ ⎝ N ⎠⎦ ⎣ ⎛ 2π K ⎞ = cos ⎜ −ϕ ⎟ ⎝ N ⎠
210
Discrete Stochastic Processes and Optimal Filtering
The autocorrelation matrix of the input data R and the cross-correlation vector
p are written:
⎛ YK2 R = E⎜ ⎜Y Y ⎝ K −1 K p = E ( DK YK
⎛ 0.5 YK YK −1 ⎞ ⎜ ⎟=⎜ YK2−1 ⎟⎠ ⎜ 0.5cos 2π ⎜ N ⎝ DK YK −1 )
T
2π ⎞ N ⎟ ⎟ ⎟ 0.5 ⎟ ⎠
0.5cos
T
⎛ 2π ⎞⎞ −ϕ ⎟⎟ cos ⎜ ⎝ N ⎠⎠
⎛ = ⎜ cos ϕ ⎝
The cost is written:
(
)
C ( λ ) = 0.5 (λ 0 )2 + (λ1 ) 2 + λ 0 λ1 cos
2π ⎛ 2π ⎞ − 2λ 0 cos ϕ − 2λ1 cos ⎜ −ϕ ⎟ + 2 N ⎝ N ⎠
Thus, we obtain:
λˆ = R −1 p 2 2π sin N C λˆ = E
λˆ =
()
⎛ ⎛ 2π ⎞ ⎜ sin ⎜ N − ϕ ⎟ ⎠ ⎝ ⎝
T
⎞ sin ϕ ⎟ ⎠
( D ) − p λˆ 2 K
T
and here the calculation gives us: C
(λˆ ) = 0
In this section, we have given the method for obtaining λˆ and
C min . As we
can see, this method does not even assume the existence of a physical filter but it requires: – the knowledge of the constituents of
p and R ;
– carrying out some calculations, notably the inverse of the matrix.
Adaptive Filtering: Algorithm of the Gradient and the LMS
211
In the following sections, we will be seeking to free ourselves of these requirements. 6.5. Gradient algorithm
We have seen previously that the vector minimizes the cost C ( λ ) is written:
λ
optimal, which is to say the one that
λˆ = R −1 p Now, to resolve this equation, we have to inverse the autocorrelation matrix. That can involve major calculations if this matrix R is not a Toeplitz matrix. It is a Toeplitz matrix if R( i , j ) = c(i −i ) with c representing the autocorrelation of the process. Let us examine the evolution of the cost Let at
λ
λK
C (λ )
previously traced.
be the vector coefficients (or weight) at instant K . If we wish to arrive
optimal, we must make
λK
evolve at each interaction by taking into account
its relative position between the instant K and K +1 . For a given cost
(
λ j = λ 0j λ1j ... λ mj −1
C ( λ j ) , the gradient of C ( λ j )
)
T
is normal at C
with regards to the vector
(λ j ) .
In order for the algorithm to converge, it must very obviously do so for: K>
( )
j ; C ( λK )
In addition as we have already written, the minimum will be attained when
∇ λC ( λ ) = 0
212
Discrete Stochastic Processes and Optimal Filtering
From here we get the idea of writing that the larger the gradient, the more distant we will be from the minimum and that it suffices to modify the vector of the coefficients in a recursive manner in the following fashion:
λK +1 = λK + µ ( −∇ λ C ( λK ) ) K
(equality in
\m )
and that we can call: algorithm of the deterministic gradient at instant K
∇ λK C ( λK ) = ∇ λC ( λ ) K = −2 E with Y
K
(
= YK YK −1 ... YK −m+1
)
T
(ε
K
YK
)
notation of the process that we saw at the
beginning of Chapter 4 and this last expression of notation ∇ λ C K
( λK )
is equal to:
= −2 ( p − R λK ) with
µ : parameter which acts on the stability and rapidity of convergence to λˆ .
Theoretical justification
If the mapping
λ = (λK0 λK1
λKm−1 ) → C ( λK )
1
m
is of class C ( \ ) we
have the equality:
C ( λK +1 ) − C ( λK ) = 〈∇ λ C ( λK ) , λK +1 − λK 〉 + o ( λK +1 − λK K
where
〈, 〉 and
Thus, if
designate the scalar product and the norm in
\ m respectively.
λK +1 is close enough to λK , we have the approximation:
C ( λK +1 ) − C ( λK ) 〈 ∇ λ C ( λK ) , λK +1 − λK 〉 K
)
Adaptive Filtering: Algorithm of the Gradient and the LMS
C ( λK +1 ) - C ( λK ) λK +1 − λK are colinear.
From which we deduce in particular that the variation
C ( λK )
is maximal if the vector ∇ λ C K
( λK )
C ( λK )
In order to obtain the minimum of ourselves then in this situation and
and
213
of
as quickly as possible we place
∀K we write:
λK +1 − λK = µ ( −∇ λ C ( λK ) ) K
i.e.
λK +1 = λK + µ ( −∇ λ C ( λK ) ) K
Furthermore, by using the expression:
λK +1 = λK + 2 µ E (ε K Y K ) we can write: n −1
∀n ≥ 1 λK + n = λK + 2µ ∑ E (ε K + jY K + j ) j =0
However, the multivariate process of order m not ergodic and we cannot write:
λ K + n = λK + 2 µ n E
(ε
K
YK
ε K + jY K + j is not WSS, thus it is
)
Moreover, the expression:
λK +1 = λK + 2 µ E (ε K Y K ) is unexploitable on a practical plane. By using the gradient method, we have succeeded in avoiding the inversion of the R matrix but we have assumed that the numerical values of the constants (correlation) composing the elements of R and p which determine the quadratic
214
Discrete Stochastic Processes and Optimal Filtering
form
C (λ )
are known. In general, these numerical values are unknown; so, we
are going to attempt to estimate them which is the reason for the following section. 6.6. Geometric interpretation
Let us give another expression to the cost function at instant K. We
have
()
( )
C ( λK ) = E DK2 + λKT R λK − 2 pT λK
found:
( )
C λˆ = E DK2 − pT λˆ
and p = Rλˆ and Wiener solution of ∇ λC
with
(λ ) = 0 .
The cost can be put in the form:
() = C ( λˆ ) + (λˆ − λ = C ( λˆ ) + (λˆ − λ = C ( λˆ ) + (λˆ − λ = C ( λˆ ) + (λˆ − λ
C ( λK ) = C λˆ + λˆT p + λKT RλK − 2λKT p
or
T K)
p + λKT RλK − λKT p
T K)
p + λKT R(λK − λˆ )
T K)
Rλˆ + (λK − λˆ )T RλK
T K)
R(λˆ − λK )
() (
C ( λK ) = C λˆ + λK − λˆ Let us state
α K = λK − λˆ
(
C λˆ + α K and easily: ∇αC
) R (λ T
K
− λˆ
)
(the origin of the axes is at present λˆ ); it becomes:
) = C (λˆ ) + α
T K
R αK
(λˆ + α )K = 2 R α
we are considering the gradient.
K
: the factor K representing the instant where
Adaptive Filtering: Algorithm of the Gradient and the LMS
215
Let us simplify the preceding expressions to find simple geometric interpretations by changing the base. Matrix R being symmetric, we say that it is diagonalizable by an orthogonal matrix Q , that is to say:
Γ = Q −1RQ
T
with Q = Q
−1
⎛γ 0 ⎜ and Γ = ⎜ ⎜0 ⎝
Let us bring R = Q Γ Q
(
)
−1
0 ⎞ ⎟ i ⎟ where γ are the eigenvalues of R . γ m−1 ⎟⎠ into the last cost expression:
()
C λˆ + α K = C λˆ + α KT Q Γ Q −1 α K and by noting u K = Q
(
)
−1
αK
()
()
m −1
C λˆ + QuK = C λˆ + uTK Γ uK = C λˆ + ∑ γ i (uKi ) 2 and ∇uC
(λˆ + Qu )K = 2 Γ u K
i
K
(
= 2 γ 0 u K0
i =0
γ 1 u1K
γ m −1 uKm−1
)
T
where u K is the i component of u at instant K . th
This expression is interesting as when only one of the components of
(
∇u C λˆ + Qu
(
C λˆ + QuK
)
K
is non-zero, the vector thus formed, always normal at
) , will carry the gradient vector. So this vector will form one of the
principle axes of the ellipses (or hyperellipses). As a consequence, the vectors u K represent the principal axes of the hyperellipses.
216
Discrete Stochastic Processes and Optimal Filtering
These principle axes equally represent the eigenvector of R . In effect, when we reduce a quadratic form, which we do diagonalizing, we establish the principle axes of the hyperellipses by calculating the eigenvectors to the matrix R when the cost expression C is in the form:
Cte + α KT R α K NOTE 1.– When m = 2 or 3 the orthogonal matrix Q is associated with a rotation 3
2
in R or R accompanied with the base of the eigenvectors of R NOTE 2.– ∇uC
(λˆ + Qu )K = Q
−1
(
∇αC λˆ + α
)K
Let us illustrate this representation with an example.
⎛3 1⎞ ⎟; ⎝ 1 3⎠
p = (5 7)
T
Let R = ⎜
and
( )
E DK2 = 10 .
Thus we obtain:
⎛ 2 0⎞ T ˆ ˆ Γ=⎜ ⎟ ; λ = (1 2 ) and C λ = 1 0 4 ⎝ ⎠
()
The eigenvectors to R allow us to construct a unitary matrix Q .
Let
and
Q=
1 ⎛ 1 1⎞ ⎜ ⎟ 2 ⎝ −1 1 ⎠
(
)
()
C λˆ + α K = C λˆ + α KT R α K .
Q always has the same shape and always takes the same values if we choose the vector unit as base vector. This holds to the very special shape of R (Toeplitz). 0 1 0 1 0 1 and u , u later. See the line graph in the guides λ , λ , α , α
NOTE.–
(
)(
)
(
)
Adaptive Filtering: Algorithm of the Gradient and the LMS Cost 3 2.8 2.6 2.4
lambda1
2.2 2 1.8 1.6 1.4 1.2 1
0
0.2
0.4
0.6
0.8
1 1.2 lambda0
1.4
1.6
1.8
2
Figure 6.11. Line graph of the cost function and of the different axes ([BLA 06] for the line graph of the ellipse)
Geometric interpretation
5 4.5 4 3.5
lambda1
3 2.5 2 1.5 1 0.5 0
0
0.5
1
1.5
2
2.5 3 lambda0
3.5
4
4.5
Figure 6.12. Line graph of “important reference points”
5
217
218
Discrete Stochastic Processes and Optimal Filtering
With u K = Q
−1
αK
⎧ 0 ⎪⎪u = i.e. ⎨ ⎪u 1 = ⎪⎩
(
)
(
)
1 α 0 −α1 2 1 α 0 + α1 2
6.7. Stability and convergence
Let us now study the stability and the convergence of the algorithm of the deterministic gradient. By taking the recursive expressions of the coefficient vector and by translation:
α K = λK − λˆ The following expressions
λK +1 = λK + µ ( −∇ λ C ( λK ) ) K
λˆ = R p ∇ λ C ( λK ) = −2 ( p − RλK ) −1
K
enable us to write: By writing
α K +1 = ( I d − 2µ R ) α K Id : identity matrix.
R in the form
R = Q Γ Q −1 and by premultiplying
α K +1
by Q
−1
, we obtain:
Q −1α K +1 = uK +1 = ( I d − 2 µ Γ ) u K
Adaptive Filtering: Algorithm of the Gradient and the LMS
219
Thus:
u K = ( I d − 2 µ Γ ) u0 K
(
or: u K +1 = 1 − 2 µ γ i
and:
i
)u
i K
(
∀i u Ki = I d − 2 µ γ i
)
K
uOi
Thus, the algorithm is stable and convergent if
(
)
Thus, if and only if
∀i
lim 1 − 2 µ γ i
K →∞
Thus, if and only if:
K
=0 i 1 − 2 µγ ∈ ]−1,1[
1 0〈 µ 〈 i
γ
In addition, we finally must verify
µ 0<µ <
1
γ max
We thus obtain:
lim λK = λˆ
K →∞
The illustration which follows gives us an idea of the evolution of the cost and of the convergence of λK .
220
Discrete Stochastic Processes and Optimal Filtering
Cost
3.5
Values of C-Cmin with principal eigenvector axes
3
lambda1
2.5
2
1.5
1
0.5 -0.5
0
0.5
1
1.5 lambda0
2
2.5
3
3.5
Figure 6.13. Line graph of several cost functions and the principal axes “u”
The same calculation example as before but with a noise input. This is a question of constructing a phase shifter with noise canceller.
∅ is uniformly spread on known phase difference.
XK
⎛ 2π K = sin ⎜ ⎝ N
⎞ + ∅⎟ ⎠
[0, 2π ]
and
bK
Σ
ϕ,
which is definite, illustrates a
⎛ 2π DK = 2sin ⎜ ⎝ N
YK
K
⎞ −ϕ + ∅ ⎟ ⎠
T
λ1
λ0 +
Σ
+
ZK
εK Σ
Figure 6.14. Schema of the principal of the phase shifter (see Figure 6.10) with noise input
Adaptive Filtering: Algorithm of the Gradient and the LMS
221
bK being a noise centered and independent from the input:
(
)
E bK −i bK − j = σ 2 δ i , j ⎡⎛ ⎛ 2π E (YK YK −n ) = E ⎢⎜ sin ⎜ ⎣⎝ ⎝ N = 0.5cos
K
⎞ ⎛ ⎛ 2π ⎞ + ∅ ⎟ + bK ⎟ ⎜ sin ⎜ ⎠ ⎠⎝ ⎝ N
⎞⎤
( K − n ) + ∅ ⎞⎟ + bK −n ⎟ ⎥ ⎠
2π K + σ 2δ 0,n N
⎡ ⎛ 2π K ⎞⎤ ⎞ ⎞ ⎛ ⎛ 2π ( K − n ) E ( DK YK − n ) = E ⎢sin ⎜ − ϕ + ∅ ⎟ ⎜ sin ⎜ + ∅ ⎟ + bK − n ⎟ ⎥ ⎟ N ⎠ ⎜⎝ ⎝ ⎠ ⎠ ⎦⎥ ⎣⎢ ⎝ N ⎛ 2π n ⎞ = cos ⎜ −ϕ ⎟ ⎝ N ⎠ Autocorrelation matrix of data YK :
2π ⎞ N ⎟ ⎟ 2 ⎟ 0.5 + σ ⎟ ⎠
⎛ 2 ⎜ 0.5 + σ R=⎜ ⎜ 0.5cos 2π ⎜ N ⎝ p = E ( DK YK
0.5cos
DK YK −1 )
T
⎛ = ⎜ cos ϕ ⎝
Thus, we obtain:
λˆ = R −1 p ⎛ ⎛ ⎛ 4π ⎞⎞ ⎞ 2 −ϕ ⎟⎟ ⎟ ⎜ 2 1 + 2σ cos ϕ − ⎜ cos ϕ + cos ⎜ 1 ⎝ N ⎠⎠ ⎟ ⎝ λˆ = ⎜ ⎜ ∆ 2π ⎛ 2π ⎞⎟ cos ϕ + 2 1 + 2σ 2 cos ⎜ − ϕ ⎟ ⎟⎟ ⎜⎜ −2 cos N ⎝ N ⎠⎠ ⎝
(
)
(
)
T
⎛ 2π ⎞⎞ cos ⎜ −ϕ ⎟⎟ ⎝ N ⎠⎠
⎠⎦
222
Discrete Stochastic Processes and Optimal Filtering
with:
(
∆ = 1 + 2σ 2
)
2
− cos 2
2π N
and:
(1 + 2σ )(1 + 4σ ) − 2σ 2
( ) = C min =
C λˆ
2
2
⎛ ⎛ 4π ⎞⎞ 2 ⎜ 2 cos ϕ + cos ⎜ N − 2ϕ ⎟ ⎟ − 1 ⎝ ⎠⎠ ⎝ ∆
with:
(
) (
)
C ( λ ) = 2 + 1 + 2σ 2 0,5 (λ 0)2 + (λ 1)2 + λ 0λ 1cos
2π − 2λ 0 cos ϕ N
⎛ 2π ⎞ − 2λ 1cos ⎜ −ϕ ⎟ ⎝ N ⎠
(
or
)
()
)
()
C λˆ + α K = C λˆ + α KT R α K
and
(
C λˆ + QuK = C λˆ + uTK Γ u K . See the line graph in reference points
(λ
0
)(
, λ1 , α 0 ,α 1
)
and
(u
0
, u1
)
above. 6.8. Estimation of gradient and LMS algorithm
We can consider the estimate p and R of p and R in the calculation of the gradient. We have changed to the notation R and
p and not Rˆ and pˆ as the 2
criterion is no longer the traditional criterion “ min L ” but an approximation of this latter.
Adaptive Filtering: Algorithm of the Gradient and the LMS
223
We had:
∇ λK C ( λK ) = −2 ( p − RλK ) Thus, we are going to consider its estimate:
(
∇ λK C ( λK ) = −2 p − RλK
)
The estimated values will be the observed data. Let:
p = yK dK
and
R = y K y KT
thus
∇ λK C ( λK ) = -2 ε K y K
and
λK +1 = λK + 2µε K y K .
This recursive expression on
λK
returns to suppress the calculation of the
expectation, in effect
λK +1 = λK + 2µ E
(ε
K
YK
)
becomes:
λK +1 = λK + 2µ ε K y K called the LMS algorithm or stochastic gradient (which is a class of filters which includes the LMS). Now, it happens that the successive iterations of this recursive algorithm themselves achieve the mathematical expectation included in this formula by statistical averaging [MAC 81].
224
Discrete Stochastic Processes and Optimal Filtering
To be put into operation, this algorithm needs knowledge of the couple DK and
Z K at each incremental step. We now have knowledge of this at the instants thanks to the filtering
λK
as
Z K = λKT Y K and z K = λK y K by considering the data And we know, obviously, the reference DK . We can write for n ∈
∗
λ K + n = λK + ( 2 µ n )
with y if
µ
K+ j
1 n
n −1
∑ yK+ j ε K+ j j =0
(
= yK + j yK −1+ j ... yK − m+1+ j
)
T
is constant at each step of the iteration.
( ε ) is ergodic and µ constant, the expression
If Y
K
K
λ K + n = λK + ( 2 µ n )
is such that lim
K →∞
λK
1 n
n −1
∑ yK+ j ε K+ j j =0
does not exist.
K
Adaptive Filtering: Algorithm of the Gradient and the LMS
Let us suppose that
(ε
K
225
)
Y K is ergodic but that µ varies with the instant K,
thus:
λK + n
1 = λK + ( 2 µ n ) n
n −1
∑ yK+ j ε K+ j j =0
becomes
λ K + n = λK + ( 2 µ n n )
As
1 n
n −1
∑ yK+ j εK+ j j =0
1 n −1 K + j y εK+ j → E Y K ∑ n j =0
In order that
( ε ) = cte
λn
→ boundary,
K
µn
must decrease faster than
α / n (α =
constant). We rediscover thus a relation very close to that obtained in section 6.5.
λ K + n = λK + 2 µ n n E
(ε
K
YK
)
6.8.1. Convergence of the algorithm of the LMS
The study of the convergence of this algorithm is a lot more delicate than that of the deterministic gradient. The reader is invited to refer to the bibliography and [MAC 95] in particular for more information.
6.9. Example of the application of the LMS algorithm
Let us recall the modelization of an AR process.
226
Discrete Stochastic Processes and Optimal Filtering
BK
XK
Σ -
T a1
Σ
T a2
Σ
M
Thus BK
= ∑ an X K − n . n =0
By multiplying the two members by X K −l and by taking the expectations, it becomes:
⎛ E ⎜ X K− ⎝ If
>0
M
⎞
n =0
⎠
∑ an X K −n ⎟ = E ( X K −
then X K −
BK ) .
⊥ BK
As B K is a white noise and unique BK is dependent on X K . Thus, by stating:
(
)
E X j X m = rj − m
Adaptive Filtering: Algorithm of the Gradient and the LMS M
∑ an rn−
=0
227
for l > 0
n =0
M ⎛ a r = E X B = E B − ( ) ⎜ K ∑ an X K − n ∑ nn K K n =0 n =1 ⎝ M
and
By noting a0
⎞ 2 ⎟ BK = σ B . ⎠
= 1 and by using the matrix expression, this becomes:
r1 ⎛ r0 ⎜ r0 ⎜ r1 ⎜ ⎜⎜ ⎝ rM rM −1
rM ⎞ ⎛1 ⎟⎜ rM −1 ⎟ ⎜ a1 ⎟⎜ ⎟⎜ r0 ⎟⎠ ⎜⎝ aM
⎞ ⎛ σ B2 ⎞ ⎟ ⎜ ⎟ ⎟ = ⎜0 ⎟ ⎟ ⎜ ⎟ ⎟⎟ ⎜⎜ ⎟⎟ ⎠ ⎝0 ⎠
← ⎫ ⎪ ⎬ ⎪ ⎭
=0 ∈ [1, M ]
AR process of order 1, let the following AR process be X K = − a X K −1 + BK where BK is a centered white noise of variance σ B2 . For an
The problem consists of estimating the constant a using an adaptive filter.
BK
+ -
X
Σ
K
T a
BK = X K + a X K −1 Knowing BK and X K −1 , the problem consists of estimating X K (or a ).
228
Discrete Stochastic Processes and Optimal Filtering
The preceding results allow us to write: 2 ⎪⎧r0 + a1 r1 = σ B ⎨ ⎪⎩r1 + a1 r0 = 0
from where: a1
=a=−
(
)
r1 2 2 2 and σ B = σ X 1 − a . r0
Let us estimate this value of parameter “ a ” with the help of a predictor and by using an LMS algorithm. X
K
DK = X K
λ
−
T YK
ε K = DK − Z K where
ε K = DK − λ X K −1 and YK = X K −1
with
ε K ⊥ ZK
(
i.e. E X K
principle of orthogonality
)
− λˆ X K −1 X K −1 = 0
or r1 = λˆr0 from where λˆ
=
r1 = −a . r0
Σ
ZK and
DK = X K
εK
Adaptive Filtering: Algorithm of the Gradient and the LMS
229
By using the Wiener optimal solution directly with R = r0 and p = r1 we obtain R λˆ Let λˆ
=
= p. r1 r0
()
( )
C λˆ = E DK2 − pT λˆ which gives us:
()
C λˆ = σ X2 (1−a2 ) . This minimum cost is also equal to
σ B2 .
Below is an example processed with Matlab. For an AR process of order 2, we have:
ε K = DK − λ 0 X K −1 − λ1 X K −2
(
and E X K
)
− λˆ 0 X K −1 − λˆ1 X K − 2 ( X K −1
X K − 2 )T = (0 0)T .
2
rr −rr r r −r 0 1 Thus: λˆ = 1 02 12 2 and λˆ = 2 20 12 r0 − r1
r0 − r1
or by using the Wiener solution:
⎛r R=⎜ 0 ⎝ r1
r1 ⎞ ⎟ r0 ⎠
and
p = ( r1 r2 )
T
with R λˆ
See the following example using Matlab software.
= p.
230
Discrete Stochastic Processes and Optimal Filtering
SUMMARY.– We have shown that the algorithm of the gradient, through its recursivity, resolves the Wiener-Hopf expression by calculating the mean. However, it needs twice the amount of calculations as a transverse filter as we have to calculate, on the one hand:
ε K = d K − λKT y K
with its “m” multiplications and “m” additions,
and on the other hand:
λK +1 = λK + 2µε K y K
with its “m + 1” multiplications and “m” additions.
The complexity is thus of 2m. We have also shown that the algorithm of the stochastic gradient is the simplest of all those which optimize the same criteria of the least squares. In contrast, it will converge more slowly than the algorithm of the so-called least exact squares. Examples processed using Matlab software Example of adaptive filtering ( AR of order 1) The objective consists of estimating the coefficient of a predictor of order 1 by using the LMS algorithm of an adaptive filter. The process is constructed by a model AR of the 1st order with a white noise which is centered, Gaussian and has a 2
variance ( sigmav ) . The problem returns to that of finding the best coefficient which gives us the sample to be predicted.
Adaptive Filtering: Algorithm of the Gradient and the LMS
% Predictor of order 1 clear all; close all; N=500; t=0:N; a=-rand(1);%value to be estimated sigmav=0.1;%standard deviator of noise r0=(sigmav)^2/(1-a^2);%E[u(k)^2] r1=-a*r0;%represents P wopt=r1/r0;%optimal Wiener solution Jmin=r0-r1*wopt; mu=0.1;%convergence parameter w(1)=0; u(1)=0; vk=sigmav*randn(size(t)); for k=1:length(t)-1; u(k+1)=-a*u(k)+vk(k+1); e(k+1)=u(k+1)-w(k)*u(k); w(k+1)=w(k)+2*mu*u(k)*e(k+1); E(k+1)=e(k+1)^2;%instantaneous square error J(k+1)=Jmin+(w(k)-wopt)’*r0*(w(k)-wopt); end %line graph subplot(3,1,1) plot(t,w,’k’,t,wopt,’k’,t,a,’k’);grid on title(‘estimation of lambda, lambda opt. and “a”’) subplot(3,1,2) plot(t,E,’k’,t,J,’k’,t,Jmin,’k’);grid on axis([0 N 0 max(E) ]) title(‘inst.err.,cost and min cost’) subplot(3,1,3) plot(w,E,’k’,w,J,’k’);grid on axis([0 1.2*wopt 0 max(J)]) title(‘inst.err.and cost acc. to lambda’)
231
232
Discrete Stochastic Processes and Optimal Filtering
Figure 6.15. Line graph of important data of
AR
process of order 1
Another example (AR of order2)
The objective consists of estimating the coefficient of a predictor of order 2 by using the algorithm of the stochastic gradient of an adaptive filter. The process is constructed by a model AR of 2nd order with a white noise, which is centered, 2
Gaussian and has a variance ( sigmav ) . The problem returns to that of finding the best coefficients which give us the sample to be predicted.
Adaptive Filtering: Algorithm of the Gradient and the LMS
Predictor of order 2 clear all; close all; N=1000; t=0:N; a1=-0.75;%value to be estimated a2=0.9;%idem sigmav=0.2;%standard deviation of noise r0=((1+a2)*((sigmav)^2))/(1+a2-a1^2+a2*(a1^2)-a2^2-a2^3);%E[u(k)^2] r1=(-a1*r0)/(1+a2);%represents P2 r2=(r0*(a1^2-a2^2-a2))/(1+a2);%represents P1 w1opt=(r0*r1-r1*r2)/(r0^2-r1^2); w2opt=(r0*r2-r1^2)/(r0^2-r1^2); wopt=[w1opt w2opt]’;%optimal Wiener solution p=[r1 r2]’; Jmin=r0-p’*wopt; R=[r0 r1;r1 r0]; mu=0.2;%convergence parameter w1(1)=0;w2(1)=0;w1(2)=0; w2(2)=0; u(1)=0;u(2)=0; vk=sigmav*randn(size(t)); for k=2:length(t)-1; u(k+1)=-a1*u(k)-a2*u(k-1)+vk(k+1); e(k+1)=u(k+1)-w1(k)*u(k)-w2(k)*u(k-1); w1(k+1)=w1(k)+2*mu*u(k)*e(k+1); w2(k+1)=w2(k)+2*mu*u(k-1)*e(k+1); w(:,k)=[w1(k) w2(k)]’; J(k+1)=Jmin+(w(:,k)-wopt)’*R*(w(:,k)-wopt); end %line graph w(:,N) delta=a1^2-4*a2; z1=(-a1+(delta^.5))/2; z2=(-a1-(delta^.5))/2; subplot(2,2,1) plot(t,w1,’k’,t,w1opt,’b’,t,a1,’r’);grid on title(‘est. lambda0, lambda0.opt. and “a0”’) subplot(2,2,2) plot(t,w2,’k’,t,w2opt,’b’,t,a2,’r’);grid on title(‘est.lambda1, lambda1.opt and “a1”’)
233
234
Discrete Stochastic Processes and Optimal Filtering
subplot(2,2,3) plot(t,J,’-’,t,Jmin,’r’);grid on axis([0 N 0 max(J)]) title(‘Cost and min Cost’) subplot(2,2,4) plot (w1,J,’b’,w2,J,’r’);grid on title(‘evolution of coefficients acc. to Cost’) est.la mbda 0, la mbda 0.opt and " a0"
1.5 1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
0
500
1, 000
Cost a nd min Cost
-1.5
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
500
est.la mbda 1, la mbda 1.opt and " a1"
1
1, 000
0
0
Exercise 6.1. [WID 85]
An adaptive filter is characterized by:
⎛2 1⎞ ⎟ correlation matrix of data ⎝1 2⎠
– R=⎜
– p = ( 7 8 ) intercorrelation vector T
1, 000
evolution of coefficients acc. to Cost
-2
Figure 6.16. Line graph of important data of
6.10. Exercises for Chapter 6
500
-1
AR
0
1
process of order 2
2
Adaptive Filtering: Algorithm of the Gradient and the LMS
( )
and E DK = 42 2
235
D K being the desired output
1) Give the cost expression C . 2) Calculate the optimal vector λˆ . 3) Give the expression of minimum cost C
(λˆ ) .
4) Calculate the values proper to R . 5) Determine the proper vectors in such a way that matrix Q of the proper vectors be “normalized” (that is to say QQ
T
= I ), these vectors representing the
principal axes of the family of ellipses. 6) Give the limits of
µ convergence parameter used in the algorithm of
stochastic gradient. Solution 6.1.
1) C = 2λ1 + 2λ2 + 2λ1λ2 − 14λ1 − 16λ2 + 42 . 2
2) λˆ = ( 2 3) C
2
3) . T
(λˆ ) = 4 .
γ1 = 1
γ2 = 3.
5) u1 = 1
2 (1 − 1)
4)
6) 0<µ <1 3 .
T
u2 = 1
2 (1 1) . T
Chapter 7
The Kalman Filter
7.1. Position of problem The aim of the filtering that we are going to study consists of “best estimating”, in the sense of the classic criteria of least mean squares, a discrete process X K governed by an equation of the form:
X K +1 = A( K ) X K + C ( K ) N K (state equation) This process (physical, biological, etc.) called the state process is what interests the user. It represents for example the position, speed and acceleration of a moving object. This process is inaccessible directly and it is studied by means of a process YK governed by an equation of the form:
YK = H ( K ) X K + G ( K ) WK (observation equation) YK is called the observation process.
238
Discrete Stochastic Processes and Optimal Filtering
N K and WK are the system noise and the measurement noise respectively and will be explained in more detail in what follows. The Kalman filter, with its creation, brought into widespread use the optimal filter for non-stationary systems. It is also recursive: the predicted Xˆ K +1|K is obtained starting from the filtration at the preceding instant Xˆ K |K and the filtration Xˆ K +1|K +1 from its predicted
Xˆ K +1|K and from the measurement of the process YK +1 at the instant that we are making our estimation. Moreover, if the observable system is known and linear, the objective consists of, starting from measurements of the system, determining the best possible estimate in the sense of the criteria specified above. If the observable system is known but non-linear an approximate solution can be given by effecting a linearization of the equations of state and of the observations around the last estimated value. If the system is not perfectly known and linear the problem is more complicated because we must make appear and estimate in the state vector the inherent components of this system. This case will not be studied in this chapter. In the same fashion, we will not tackle the case where noises are colored or that in which there is a correlation between system noise and measurement noise. The reader can find additional information in the bibliography [GIM 82] and [RAD 84]. Preliminaries in the scalar case We have demonstrated that the best estimate of a process, starting with an
g , which is to say Xˆ = gˆ (Y1 ,..., YK ) represented by the orthogonal projection of X on a Hilbert space that we have defined is the conditional expectation of magnitude X , knowing the set of random observation variables Y1...YK , i.e.: observation function
Xˆ = gˆ (Y1 , ..., YK ) = Pr oj
H KY
X = Ε ( X Y1 ,..., YK )
The Kalman Filter
However, if vector estimate Xˆ of
( X , Y1 ,..., YK )
239
is Gaussian, then we have seen that the
X is a refined function of the vectors Y j .
Xˆ = λˆ 0 +
K
∑
j =1
λˆ j Y j
In order to approach Kalman filtering in a simple way, we are going to begin by grappling with the problem of linear estimation in the scalar case applied to the linear prediction. The shape of the recursive estimation obtained will allow a better grasp on the multivariate case. Let us consider a set of random variables
Y1 , Y2 ,..., Y j ..., YK −1
Y j : variable observed at instant j with Y0 = 0 by convention. Let us recall that we denote by
H KY-1
the real vector space generated by these
random variables, i.e.:
H KY-1 =
{
K −1
}
∑ λ jYj λ j ∈ \ .
j =1
Example of linear estimation [HAY 91] The best linear estimation in the sense of the least mean square error of a random variable YK , starting from observations making up following linear predictor:
H KY-1 ,
can be done by the
240
Discrete Stochastic Processes and Optimal Filtering
YK −1 T
YK −2
YK −( K −1)
λ2
λK −1
T
YK
λ1
+ +
Σ
+
Σ
+
YˆK |K −1
Figure 7.1. Schema of the principle of the linear estimator
The predictor error is then written:
I K = YK − YˆK |K −1 (that we could compare with
εK
in the adaptive filter)
for a predictor filter of the order K −1 and is easily constructed by the above arrangement. The output of the filter can be interpreted as: the best estimate at instant K , knowing the data of process Y1 ,..., YK −1 . Thus, we can interpret yˆ K |K −1 , result of YˆK |K −1 as the output of a predictor of order K − 1 whose input would be made up of observations y1 , y2 ,..., yK −1 : measurements of Y j . The principle of orthogonality shows us that this “error” I K is orthogonal to
H KY-1
and may be interpreted as information brought about by YK , from which
comes the name of “innovation” error. Thus, we will name this prediction error: the innovation.
The Kalman Filter
241
7.2. Approach to estimation 7.2.1. Scalar case It is clear that we can give an estimate of the magnitude of a process based on past observation of this process. In the expression of the innovation:
I K = YK −
K −1
∑ λˆi YK −i i =1
YK represents the magnitude to be estimated (see predictor) and
K −1
∑ λˆi YK −i i =1
represents the estimation.
= Pr oj
Y YK H K-1
= YˆK |K −1 and
I K = YK − YˆK |K −1 In the same way, if we call:
Xˆ K K = Pr oj
H KY
XK
the estimate of a process at instant K , starting from measurement process Y1 ,..., YK ,... , we can write: K
Xˆ K K = ∑ b j Y j estimate of X K . j =1
y1 ,..., yK , ... of
242
Discrete Stochastic Processes and Optimal Filtering
Let us write the innovation at instants 1, 2,…, K : K −1
I K = YK − ∑ λiK −1 YK −i i =1
with
λiK −1 : coefficients of the predictor of order I1 = Y1
with
K −1
Yˆ1/ 0 = 0
I 2 = Y2 − λ11Y1 I 3 = Y3 − λ12 Y2 − λ22 Y1 I K = YK − λ1K −1YK −1 − ... − λKK−−11Y1 This expression can be put in the form: I = M Y with M , invertible triangular matrix because det M = 1 . Thus Y = M
−1
I.
As a consequence, each vector I can be written according to the vectors
(
Y = (Y1 ,..., YK ) and inversely, H KY = H KI T
−1 Thus Xˆ K K = b′.Y = b′M I
or b ' = (b1′ ,..., bK′ )
T
vector of dimension K
I = ( I1 ,..., I K )T innovation vector.
)
The Kalman Filter
It is clear that the equality Xˆ K K = b′M
−1
I can also be put in the form:
K
Xˆ K K = ∑ d j I j j =1
Let us now show that: d j =
( ) Ε(I j I j )
Ε XK I j
j ∈ [1, K ]
Demonstration We know that: X K − Xˆ K |K ∈ H K
Y,⊥
We have:
X K − Xˆ K |K ⊥ Y j
Yˆj| j −1 ∈ H jY−1 ⊂ H KY
∀j ∈ [1, K ] it also comes to:
,
X K − Xˆ K |K ⊥ Yˆj| j −1
Thus X K − Xˆ K |K ⊥ Y j − Yˆj| j −1 = I j
(
)
(
∀j ∈ [1, K ] .
)
That is to say: E X K I j = E Xˆ K | K I j .
(
)
(
and since I i ⊥ I j if
i≠ j
K
) ∑d E (I I )
From which finally: E X K I j = E Xˆ K |K I j =
it becomes: d j =
i
i =1
i j
( ). E (I jI j )
E XKI j
K
Let us exploit the expression of the filtering: Xˆ K K = ∑ d j I j j =1
243
244
Discrete Stochastic Processes and Optimal Filtering
and Xˆ K K =
K −1
∑ d j I j + dK IK . j =1
From our first results, the sum of the K −1 terms also represents an estimation and:
Xˆ K K = Xˆ K −1 K −1 + d K I K which shows that the estimate, at instant K is written according to the estimate at instant K −1 and of a corrective term depending on instant K . This recursive estimation procedure is the foundation of Kalman filtering. 7.2.2. Multivariate case We are going at present to consider some vector magnitudes seen in Chapter 4, which is to say:
( )
– XK
: multivector of order n ∈ L2
– YK
: multivector of order m ∈ L2
– IK
: multivector of order m ∈ L2
( )
m
( )
m
The relationship between Y j and I j :
I K = YK − H ( K ) Xˆ K K −1 or I K
K −1
ˆ Y = YK − H ( K ) ∑ Λ j j j =1
n
The Kalman Filter
245
Reciprocally: By writing YK according to I K , it becomes with
Xˆ 1|0 = 0
Y1 = I1 ˆ I Y2 = I 2 + H ( 2 ) Λ 1 1 ˆ I + H ( 3) Λ ˆ I + H ( 3) Λ ˆ H ( 2) Λ ˆ I Y3 = I 3 + H ( 3) Λ 1 1 2 2 2 1 1
Thus YK is expressed according to I K
, I K −1 ,..., I1
7.3. Kalman filtering Vector or multivariate approach given: –
X K : state multivector ( n ×1)
–
xK : state vector of results
– YK –
: multivector of observations ( m × 1)
yK : vector of observations and results
7.3.1. State equation
X K +1 = A ( K ) X K + C ( K ) N K with and
A ( K ) = state matrix (n × n) , deterministic matrix NK
= process noise vector
(l × 1)
that we will choose centered, white and of correlation matrix (covariance matrix in the general case).
246
Discrete Stochastic Processes and Optimal Filtering
(
)
E N K N Tj = δ K , j QK
:(
× ) : correlation matrix of the process noise
vector N K
C ( K ) : (n × ) : deterministic matrix 7.3.2. Observation equation
YK = H ( K ) X K + G ( K ) WK with
H ( K ) : matrix of measurements or of observations ( m × n ) , deterministic matrix;
WK : measurement noise vector of observations vector ( p × 1) that we choose, like N K , centered, white and of correlation matrix(covariance matrix in the general case);
(
)
E WK W jT = δ K , j RK
( p × p)
correlation matrix of the measurement
noise vector WK
G ( K ) : (m × p) : deterministic matrix The noises N K and WK are independent, and, as they are centered:
(
)
E N K W jT = 0
∀K and j
We will suppose, in what follows, that WK ⊥ X 0 .
The Kalman Filter
247
By iteration of the state equation, we can write: K −1
X K = Φ ( K ,0 ) X 0 + ∑ Φ ( K ,i +1) Ni with Φ ( K , j ) : transition matrix. i =1
We obtain from this transition equation, by multiplying the 2 members by W j
X K ⊥ Wj
K,
j>0
By using the equation of observations:
and
Y j ⊥ WK
0 ≤ j ≤ K −1
Yj ⊥ NK
0≤ j≤K
The problem of the estimation can now be expressed simply in the following way. Knowing that A( K ) is the state matrix of the system, H ( K ) is the measurement matrix and the results yi of Yi
i ∈ [1, K ] ; obtain the estimations x j
of X j : – if 1< j K we say that the estimation is a prediction. NOTE.– The matrices C ( K ) and G ( K ) do not play an essential role in the measurement where the powers of noise appear in the elements of the matrices QK and RK respectively. However, the reader will be able to find analogies with notations used in “Processus stochastiques et filtrage de Kalman” by the same authors which examine the continuous case.
248
Discrete Stochastic Processes and Optimal Filtering
7.3.3. Innovation process
The innovation process has already been defined as:
I K = YK − H ( K ) Pr oj
H KY −1
and:
⎪⎧
X K = YK − H ( K ) Xˆ K |K −1
: ( m×1)
⎪⎫
K −1
∑ Λ jY j Λj matrix n × m ⎬⎪ ⎪ j =0
H KY-1 = ⎨ ⎩
⎭
By this choice of Λ j , the space multivectors X j and Pr oj
Y HK −1
XK
=
H KY−1
is adapted to the order of the state
Xˆ K |K −1 has the same order as X K .
Thus I K represents the influx of information between the instants K − 1 and K Reminder of properties established earlier:
I K ⊥ Y j ⎫⎪ ⎬ for j ∈ [1, K -1] I K ⊥ I j ⎪⎭ We will go back to the innovation to give the importance of its physical sense. 7.3.4. Covariance matrix of the innovation process
Between two measurements, the dynamic of the system leads to an evolution of the state quantities. So the prediction of the state vector at instant K , knowing the measurements
(Y1...YK −1 )
which is to say Xˆ K |K −1 is written according to the
filtering at instant K − 1 .
Xˆ K |K −1 = E ( X K | Y1 ,… , YK −1 ) = Pr oj
HY
= Pr oj
HY
K −1
K −1
XK
( A( K − 1) X K −1 + C ( K − 1) N K −1 | Y1 ,… , YK −1 )
= A( K − 1) Xˆ K −1|K −1 + 0
The Kalman Filter
Xˆ
= A ( K −1) Xˆ
K K −1
249
K −1 K −1
Only the information deriving from a new measurement at instant K will enable us to reduce the estimation error at this same instant. Thus H ( K ) representing in a certain fashion, the measurement operator, where at the very least its effect:
YK − H ( K ) Xˆ
K K −1
will represent the influx of information between two instants of observation. It is for this reason that this information is called the innovation. We observe, furthermore, that I K and YK have the same order. By exploiting the observation equation we can deduce:
⎛ ⎞ I K = H ( K ) ⎜ X K − Xˆ + G ( K ) WK K K −1 ⎟ ⎝ ⎠ and
IK = H (K ) X
K K −1
+ G ( K ) WK
where X K |K −1 = X K − Xˆ K |K −1 is called the prediction error. The covariance innovation matrix is expressed finally as:
Cov I K = E
(
I K I KT
)
that is to say Cov I K = H ( K ) PK K −1 H where P
K K −1
T
⎛ ⎞⎛ ⎞ = E ⎜ H (K ) X + G ( K ) WK ⎟ ⎜ H ( K ) X + G ( K ) WK ⎟ K K −1 K K −1 ⎝ ⎠⎝ ⎠ T
( K ) + G ( K ) RK GT ( K )
⎛ ⎞ XT = Ε⎜ X ⎟ is called the covariance matrix of the ⎝ K K −1 K K −1 ⎠
prediction error.
250
Discrete Stochastic Processes and Optimal Filtering
A recurrence formula on the matrices P
K K −1
will be developed in Appendix A.
7.3.5. Estimation
In the scalar case, we have established a relationship between the estimate of a magnitude X K and the innovations I K . We can, quite obviously extend this approach to the case of multivariate processes, that is to say we can write:
Xˆ
K
= ∑ d j (i ) I j
iK
j =1
where d j ( i ) is a matrix ( n x m ) . Let us determine the matrices d j ( i ) :
(
T
since E X i|K I j
(
) = E (( X T
we have: E X i I j
) = E ( Xˆ
furthermore, we have E
(
Then, since I j ⊥ I p
(
i
)
) )
− Xˆ i|K I Tj = 0 ∀j ∈ [1, K ] T i| K I j
X i I Tj
)
) and knowing the form of Xˆ
⎛ K ⎞ = E ⎜ ∑ d p (i ) I p I Tj ⎟ . ⎜ p =1 ⎟ ⎝ ⎠
∀j ≠ p
(
and
)
j , p ∈ [1, K ]
E X i I Tj = d j ( i ) E I j I Tj = d j ( i ) CovI j .
(
Finally: d j ( i ) = E X i I j
T
) ( CovI ) j
−1
.
i| K
The Kalman Filter
251
We thus obtain: K
(
) ( Cov I )
(
) ( Cov I )
Xˆ i K = ∑ Ε X i I Tj j =1
K −1
= ∑ Ε X i I Tj j =1
(
+ Ε X i I KT
−1
j
Ij
−1
j
) ( Cov I
K
Ij
)−1 I K
We are now going to give the Kalman equations. Let us apply the preceding equality to the filtering Xˆ K +1 K +1 ; we obtain: K +1
Xˆ K +1 K +1 = ∑ Ε X K +1 I Tj
(
) ( Cov I )
K
(
) ( Cov I )
−1
+ Ε X K +1 I KT +1 ( Cov I K +1 )
−1
j =1
= ∑ Ε X K +1 I Tj j =1
(
j
)
The state equation reminds us that:
X K +1 = Α ( K ) X K + C ( K ) N K and we know that N K
⊥ Ij .
Thus:
(
)
(
−1
j
)
Ε X K +1 I Tj = Α ( K ) Ε X K I Tj .
Ij Ij I K +1.
252
Discrete Stochastic Processes and Optimal Filtering
The estimate of X K +1 knowing the measurement at this instant K+1 is thus expressed: K
(
Xˆ K +1 K +1 = Α ( K ) ∑ Ε X K I Tj
(
j =1
) ( Cov I ) j
)
−1
Ij
+ Ε X K +1 I KT +1 ( Cov I K +1 ) I K +1. −1
The term under the sigma sign (sum) can be written Xˆ K K . Let us exploit this expression:
I K +1 = H ( K +1) X K +1 K + G ( K +1) WK +1 . This gives us:
(
)
−1 Xˆ K +1 K +1 = Α ( K ) Xˆ K K + Ε X K +1 I KT +1 ( Cov I K +1 ) I K +1
which is also written:
(
⎛ Xˆ K +1 K +1 = Α ( K ) Xˆ K K + Ε ⎜ X K +1 H ( K +1) X K +1 K + G ( K +1) WK +1 ⎝
)
T
⎞ ⎟ ⎠
−1
. ( Cov I K +1 ) I K +1 In addition we have shown that the best estimation at a given instant, knowing the past measurements, that we write as Xˆ K +1 K , is equal to the projection of
X K +1 on H KY , i.e.:
The Kalman Filter
Xˆ K +1 K = ProjH Y X K +1 = Pr oj
HY
K
Xˆ K +1 K = Pr oj
HY
and as:
Yj
⊥
K
K
253
( Α (K ) X K + C (K ) NK )
( Α (K ) X K + C (K ) NK ) ∀ j ∈[1, K ]
NK
it becomes Xˆ K +1 K = Α ( K ) Xˆ K K ; Α ( K ) squared . We can consider this equation as that which describes the dynamic of the system independently of the measurements, and as one of the equations of the Kalman filter.
⊥ Wj
In addition, X K
K, j >
0 : it becomes for the filtering:
(
)
−1 Xˆ K +1 K +1 = Xˆ K +1 K + Ε X K +1 X KT +1 K H T ( K + 1) ( Cov I K +1 ) I K +1 .
As:
Xˆ K +1 K
⊥
X K +1 K
then:
( (
)
Xˆ K +1 K +1 = Xˆ K +1 K + E X K +1 − Xˆ K +1 K X KT +1 K H T ( K +1) −1
. ( Cov I K +1 ) I K +1 thus: −1 Xˆ K +1 K +1 = Xˆ K +1 K + PK +1 K H T ( K +1) ( Cov I K +1 ) I K +1
)
254
Discrete Stochastic Processes and Optimal Filtering
DEFINITION.– We call the Kalman gain the function K defined (here at instant K+1) by:
K ( K +1) = PK +1 K H T ( K +1) ( Cov I K +1 )
−1
with:
Cov I K +1 = H ( K + 1) PK +1 K H T ( K + 1) + G ( K +1) RK +1 GT ( K +1) From which by putting
K ( K + 1) back into the expression we obtain:
(
K ( K +1) = PK +1 K H T ( K +1) H ( K +1) PK +1 K H (TK +1) + G ( K +1) RK +1GT ( K +1)
)
−1
We notice that this calculation does not require direct knowledge of the measurement of YK . This expression of the gain intervenes, quite obviously, in the algorithm of the Kalman filter and we can write;
(
Xˆ K +1 K +1 = Xˆ K +1 K + K ( K +1) YK +1 − H ( K +1) Xˆ K +1 K
)
This expression of the best filtering represents another equation of Kalman filter. We observe that the “effect” of the gain is essential. In effect, if the measurement is very noisy, which signifies that the elements of matrix RK are large, then the gain will be relatively weak, and the impact of this measurement will be minimized for the calculation of the filtering. On the other hand, if the measurement is not very noisy, we will have the inverse effect; the gain will be large and its effect on the filtering will be appreciable.
The Kalman Filter
255
We are now going “to estimate” this filtering by calculating the error that we commit, that is to say, by calculating the covariance matrix of the filtering error. Let us recall that Xˆ K +1 K +1 is the best of the filtrations, in the sense that it minimizes the mapping:
Z
→ tr X K +1 − Z
Y ∈ H K+ 1
2
T = tr E ⎡( X K +1 − Z )( X K +1 − Z ) ⎤ ⎣ ⎦
∈\ 2
The minimum is thus: tr X K +1 − Xˆ K +1 K +1
(
= tr E X K +1 K +1 X TK +1 K +1
(
T
NOTATION.– In what follows, matrix E X K +1 K +1 X K +1 K +1
P
)
)
is denoted
and is called the covariance matrix of the filtering error.
K +1 K +1
We now give a simple relationship linking the matrices
P
K +1 K +1
and P
K +1 K
.
We observe that by using the filtration equation first and the state equation next:
X K +1|K +1 = X K +1 − Xˆ K +1 K +1
(
= X K +1 − Xˆ K +1 K − K ( K +1) YK +1 − H ( K +1) Xˆ K +1 K = X K +1 − Xˆ K +1 K − K ( K +1)
(H (
K +1) X K +1 + G ( K +1) WK +1 − H ( K +1) Xˆ K +1 K
)
)
= ( I d − K ( K +1) H ( K +1) ) X K +1|K − K ( K +1) G ( K +1) WK +1
256
Discrete Stochastic Processes and Optimal Filtering
where I d is the identity matrix. By bringing this expression of X K +1|K +1 in P
K +1 K +1
and by using the fact
that: X K +1| K ⊥ WK +1 we have:
P
K +1 K +1
= ( I d − K ( K +1) H ( K +1) ) P
K +1 K
( I d − K ( K +1) H ( K +1) )T +
K ( K +1) G ( K +1) R ( K +1) GT ( K +1) K T ( K +1) an expression which, since:
Cov I K +1 = G ( K +1) RK +1 GT ( K +1) + H ( K + 1) PK +1 K H T ( K + 1) can be written:
(
PK +1 K +1 = K ( K +1) − PK +1 K H T ( K +1) ( CovI K +1 )
(
( CovI K +1 ) ( K ( K + 1) − PK +1 K
−1
)
H (TK +1) ( CovI K +1 )
)
−1 T
)
+ I d − PK +1 K H T ( K +1) ( CovI K +1 ) H ( K +1) PK +1 K . −1
However, we have seen that: K ( K +1) = PK +1 K H ( K +1) ( Cov I K +1 ) T
−1
So the first term of the second member of the expression is zero and our sought relationship is finally:
(
)
PK +1 K +1 = I d − K ( K +1) H ( K +1) PK +1 K This “updating” of the covariance matrix by iteration is another equation of the Kalman filter.
The Kalman Filter
257
Another approach to calculate this minimum [RAD 84]. We notice that the penultimate expression of PK +1|K +1 can be put in the form:
(
PK +1 K +1 = K ( K +1) − PK +1 K H T ( K +1) J −1 ( K +1)
(
)
J ( K +1) K ( K + 1) − PK +1 K H T ( K + 1) J (−K1 +1)
(
)
)
T
+ I d − PK +1 K H T ( K +1) J −1 ( K +1) H ( K +1) PK +1 K with:
J ( K +1) = H ( K +1) PK +1 K H T ( K +1) + G ( K +1) RK +1 GT ( K +1) = Cov I K +1 Only the 1st term of PK +1 K +1 depends on
K ( K +1) and is of the form
M J M T symmetric with J . So this shape is a positive or zero trace and:
(
)
PK +1 K +1 = M J M T + I d − PK +1 K H T ( K +1) J −1 ( K +1) H ( K +1) PK +1 K . The minimum of the trace will thus be reached when
M is zero, thus:
K ( K +1) = PK +1 K H T ( K +1) J −1 ( K +1) where:
(
K ( K +1) = PK +1 K H T ( K +1) H ( K +1) PK +1 K H T ( K + 1) + G ( K +1) RK +1GT ( K +1) a result which we have already obtained! In these conditions when:
(
)
PK +1 K +1 = I d − K ( K +1) H ( K +1) PK +1 K
)
−1
258
Discrete Stochastic Processes and Optimal Filtering
we obtain the minimum of the tr PK +1 K +1 . It is important to note that K , the Kalman gain and PK K the covariance matrix of the estimation error are independent of the magnitudes YK . We can also write the best “prediction”, i.e. Xˆ K +1 K according to the preceding prediction: Thus:
(
Xˆ K +1 K = Α ( K ) Xˆ K K −1 + Α ( K ) K ( K ) YK − H ( K ) Xˆ K K −1
)
As for the “best” filtering, the best prediction is written according to the preceding predicted estimate weighted with the gain and the innovation brought along by the measurement YK . This Kalman equation is used not in filtering but in prediction. We must now establish a relationship on the evolution of the covariance matrix of the estimation errors. 7.3.6. Riccati’s equation
Let us write an evolution relationship between the covariance matrix of the filtering error and the covariance matrix of the prediction error:
(
PK K −1 = Ε X K K −1 X KT K −1
)
or by incrementation:
with:
(
PK +1 K = Ε X K +1 K X KT +1 K X K +1 K = X K +1 − Xˆ K +1 K
).
The Kalman Filter
259
Furthermore we know that:
Xˆ K +1 K = Α ( K ) Xˆ K K −1 + A ( K ) K ( K ) I K giving the prediction at instant K +1 and X K +1 = Α ( K ) X K + C ( K ) N K just as: I K = YK − H ( K ) Xˆ K K −1 . The combination of these expressions gives us:
(
)
(
)
X K +1 K = Α ( K ) X K − Xˆ K K −1 − Α ( K ) K ( K ) YK − H ( K ) Xˆ K K −1 + C ( K ) N K but YK = H ( K ) X K + G ( K ) WK thus:
(
)
(
)
X K +1 K = Α ( K ) X K − Xˆ K K −1 − Α ( K ) K ( K ) H ( K ) X K − Xˆ K K −1 −
Α ( K ) K ( K ) G ( K ) WK + C ( K ) N K
X K +1 K = ( Α ( K ) − Α ( K ) K ( K ) H ( K ) ) X K K −1 − Α ( K ) K ( K ) G ( K ) WK + C ( K ) N K We can now write PK +1 K by observing that:
X K K −1 ⊥ N K and
X K K −1 ⊥ WK
.
NOTE.– Please note that X K +1/ K is not orthogonal to WK .
260
Discrete Stochastic Processes and Optimal Filtering
Thus:
PK +1 K = ( Α ( K ) − Α ( K ) K ( K ) H ( K ) ) PK K −1 ( Α ( K ) − Α ( K ) K ( K ) H ( K ) )
T
+ C ( K ) QK C T ( K ) + Α ( K ) K ( K ) G ( K ) RK GT ( K ) K T ( K ) ΑT ( K ) This expression of the covariance matrix of the prediction error can be put in the form:
PK +1 K = Α ( K ) PK K ΑT ( K ) + C ( K ) QK C T ( K ) This equality independent of YK is called Riccati’s equation with PK K = ( I d − K ( K ) H ( K ) ) PK K −1 which represents the covariance matrix of the filtering error, which is equally independent of YK . See Appendix A for details of the calculation. 7.3.7. Algorithm and summary
The algorithm presents itself in the following form, with the initial conditions:
P0 and Xˆ 0|0 given as well as the matrices: Α ( K ) , QK , H ( K ) , RK , C ( K ) and G ( K ) . 1) Calculation phase independent of YK . Effectively, starting from the initial conditions, we perceive that the recursivity which acts on the gain K ( K + 1) and on the covariance matrix of the prediction and filtering errors PK +1 K and PK +1 K +1 do not require knowledge of the observation process. Thus, the calculation of these matrices can be done without knowledge of measurements. As for these measurements, they come into play for the innovation calculation and that of the filtering or of prediction.
The Kalman Filter
261
PK +1 K = Α( K ) PK K ΑT ( K ) + C ( K ) QK CT ( K )
(
K ( K+1) = PK +1 K HT ( K+1) H ( K+1) PK +1 K HT ( K + 1) + G ( K+1) RK +1 GT ( K+1) PK +1 K +1 = ( Id − K ( K+1) H ( K+1) ) PK +1 K
)
−1
Xˆ K +1 K = Α( K ) Xˆ K K T
(
T
or K ( K + 1) = PK +1 K +1 H ( K + 1) G ( K +1) RK +1G ( K +1)
)
−1
T
if G ( K +1) RK +1G ( K +1) is invertible. 1) Calculation phase taking into account results y K of process YK .
I K +1 = YK +1 − H ( K + 1) Xˆ K +1 K Xˆ K +1 K +1 = Xˆ K +1 K + K ( K + 1) I K +1 It is using a new measurement that the calculated innovation will allow us, weighted by the gain at the same instant, to know the best filtering.
YK +1
+
Σ
K ( K + 1)
Σ
Σ
-
Xˆ K +1 K +1
+
T H ( K + 1)
Xˆ K +1 K
Α(K )
Figure 7.2. Schema of the principle of the Kalman filter
Xˆ K K
262
Discrete Stochastic Processes and Optimal Filtering
Important additional information may be obtained in [HAY 91]. NOTE.– If we had conceived a Kalman predictor we would have obtained the expression of the prediction seen at the end of section 7.3.5.
(
Xˆ K +1 K = Α ( K ) Xˆ K K −1 + Α ( K ) K ( K ) YK − H ( K ) Xˆ K K −1
)
IK NOTE.– When the state and measurement equations are no longer linear, a similar solution exists and can be found in other works. The filter then takes the name of the extended Kalman filter. 7.4. Exercises for Chapter 7 Exercise 7.1.
Given the state equation
X K +1 = A X K + N K
where the state matrix A is the “identity” matrix of dimension 2 and
N K the
system noise whose covariance matrix is written Q = σ I d ( I d : identity matrix). 2
The system is observed by the scalar equation:
YK = X 1K + X K2 + WK where X 1K and X K2 are the components of the vector
X K and where WK is the measurement noise of the variance R = σ 12 . P0|0 = Id
and
ˆ = 0 are the initial conditions. X 0|0
1) Give the expression of the Kalman gain K (1) at instant “1” according to
σ 2 and σ 12 . 2) Give the estimate of Xˆ 1|1 of X 1 at instant “1” according to the first measurement of
K (1) and the first measurement Y1 .
The Kalman Filter
263
Solution 7.1.
1) K (1) =
1+ σ 2 ⎛1⎞ ⎜ ⎟ 2 + 2σ 2 + σ 12 ⎝ 1 ⎠
2) Xˆ 1|1 = K (1)Y1 Exercise 7.2.
We are considering the movement of a particle.
x1 ( t ) represents the position of the particle and x2 ( t ) its speed. t
x1 ( t ) = ∫ x2 (τ ) dτ + x1 ( 0 ) 0
By deriving this expression and by noting
x2 (t ) =
dx1 ( t ) = dt
approximately = x1 ( K +1) − x1 ( K ) .
We assume that the speed can be represented by:
X K2 = X K2 −1 + N K −1 where N K is a Gaussian stationary noise which is centered and of variance 1. The position is measured by y K , result of the process YK . This measurement adds a Gaussian stationary noise, which is centered and of variance 1.
Y ( K ) = H ( K ) X ( K ) + WK We assume that RK measurement equal to 1.
covariance matrix (of dimension 1) has a noise
264
Discrete Stochastic Processes and Optimal Filtering
1) Give the matrices A, Q (covariance matrix of the system noise) and H . 2) In taking as initial conditions Xˆ 0 = Xˆ 0|0 = 0
P0|0 = I d identity matrix,
give xˆ1|1 the 1st estimation of the state vector. Solution 7.2.
⎛ 1 1⎞ ⎛0 0⎞ ⎟; Q = ⎜ ⎟ ; H = (1 0 ) ⎝ 0 1⎠ ⎝0 1⎠
1) A = ⎜
⎛ 2 3⎞ ⎟ y1 ⎝1 3⎠
2) xˆ1|1 = ⎜
Exercise 7.3. [RAD 84]
We want to estimate two target positions using one measurement. These 1
2
positions X K and X K form the state vector:
(
X K = X 1K
X K2
)
T
The process noise is zero. The measurement of process Y is affected by noise by W of mean value zero and of variance R carried to the sum of the position:
YK = X 1K + X K2 + WK In order to simplify the calculation, we will place ourselves in the case of an immobile target:
X K +1 = X K = X The initial conditions are:
(
)
– P0|0 = C ov X , X = Id identity matrix
The Kalman Filter
265
– R = 0.1 – y = 2.9 (measurement) and Xˆ 0|0 = ( 0
0)
T
1) Give the state matrix A , and observation matrix H . 2) Give the Kalman gain
K.
3) Give the covariance matrix of the estimation error. 2
4) Give the estimation in the sense of the minimum in L of the state vector XK . 5) If x = xK = (1
2 ) , give the estimation error x = xK |K = xK − xˆ K |K . T
1
6) Compare the estimation errors of the variances of X K
and
X K2 and
conclude. Solution 7.3.
H = (1 1)
1) A = I d
2) K = (1 2,1 1 2,1)
T
⎛ 1,1 2,1
−1,1
⎝
1,1
3) P1|1 = ⎜ −1,1 ⎜
2,1
4) xˆ1|1 = ( 2,9 2,1
(
1
5) xK = xK
xK2
1
2,1 ⎞
2,1
⎟⎟ ⎠
2,9 2,1)
T
)
T
= ( −0,38 − 0, 62 )
T
2
6) var X K = var X K = 0,52 Exercise 7.4.
Given the state equation of dimension “1” (the state process is a scalar process):
X K +1 = X K .
266
Discrete Stochastic Processes and Optimal Filtering
The state is observed by 2 measurements: Y1 W1 YK = ⎛⎜ YK2 ⎞⎟ affected by noise with WK = ⎛⎜ WK2 ⎞⎟ ⎝ K⎠ ⎝ K⎠
The noise measurement is characterized by its covariance matrix:
σ2 RK = ⎛⎜ O1 σO2 ⎞⎟ . 2 ⎠ ⎝ The initial conditions are:
P0|0 = 1 (covariance of the estimation error at instant “0”)
ˆ = 0 (estimate of and X 0|0
X at instant “0”)
Let us state D = σ 1 + σ 2 + σ 1 σ 2 . 2
2
2
2
1) Give the expression of K(1) Kalman gain at instant “1” according to σ 1 , σ 2 and D 2) Give the estimate Xˆ 1|1 of X 1 at instant “1” according to the measurements
Y11 , Y12 and σ 1,σ 2 and D
σ 12 σ 22 σ 12 +σ 22 instant “1” according to σ . 3) By stating
σ2 =
give P1|1 the covariance of the filtration error at
Solution 7.4.
⎛σ 1) K (1) = ⎜ 1
2
⎝ D
2) Xˆ 1|1 =
(σ Y
2 1 2 1
σ 22 ⎞ ⎟ D ⎠
)
+ σ 12Y12 / D
The Kalman Filter
3) P1|1 =
267
σ2 1+ σ 2
Exercise 7.5.
The fixed distance of an object is evaluated by 2 radar measurements of different qualities. The 1st measurement gives the result:
y1 = r + n1 , measurement of the process Y = X + N1 where we know that the noise N1 is such that: E ( N1 ) = 0 and var ( N1 ) = σ 12 = 10-2 . The 2nd measurement gives: y 2 = r + n2
measurement of the process
Y = X + N2 . E ( N 2 ) = 0 and var ( N 2 ) = w (scalar) The noises N1 and
N 2 are independent.
1) Give the estimate rˆ1 of r that we obtain from the 1st measurement. 2) Refine this estimate by using the 2nd measurement. We will name as rˆ 2 this
new estimate that we will express according to w .
3) Draw the graph rˆ2 ( w) and justify how it appears. Solution 7.5.
1) rˆ1 = xˆ1|1 = y1 2) rˆ2 = xˆ2|2 = y1 +
σ 12
σ 12
+w
( y2 − y1 ) =
100 wy1 + y2 100 w + 1
268
Discrete Stochastic Processes and Optimal Filtering
3) See Figure 7.3.
Figure 7.3. Line graph of the evolution of the estimate according to the power of the noise w , parametered according to the magnitude of the measurements
Appendix A Resolution of Riccati’s equation
Let us show that: PK +1 K = A ( K ) PK K A ( K ) + C ( K ) QK C ( K ) T
T
Let us take again the developed expression of the covariance matrix of the prediction error of section 7.3.6
PK +1 K = Α ( K ) ( I d − K ( K ) H ( K ) ) PK K −1 ( Α ( K ) − Α ( K ) K ( K ) H ( K ) )
T
+ C ( K ) QK C(TK ) + Α ( K ) K ( K ) G ( K ) RK G T ( K ) K T ( K ) ΑT ( K )
The Kalman Filter
269
with:
K ( K ) = PK K −1 H T ( K ) ( Cov I K )
−1
and:
Cov I K = H ( K ) PK K −1 H (TK ) + G ( K ) RK G T ( K ) . By replacing
K ( K ) and Cov I K , by their expressions, in the recursive writing
of, PK +1 K , we are going to be able to simplify the expression of the covariance matrix of the prediction error. To lighten the expression, we are going to eliminate the index K when there is no ambiguity by noting P1 = PK +1 K , P0 = PK K −1 and I = I K
(
)
P1 = A I d − KH P0 ( Α − ΑKH ) + C Q C T + Α K G R G T K T ΑT T
K = P0 H T ( Cov I )
−1
Cov I = H P0 H T + G R GT Thus:
G R G T = Cov I − H P0 H T K G R G T K T = P0 H T ( Cov I )
(
−1
( Cov I − H P
0
H T ) ( Cov I )
= P0 H T − P0 H T ( Cov I ) H P0 H T −1
KGRGT K T = P0 H T ( cov I )
−1T
−1T
) ( Cov I )
H P0T
−1T
H P0T
HP0T − P0 H T ( cov I ) HP0 H T ( cov I )
HP0T
−1
−1T
P1 = AP0 AT − AKHP0 AT − AP0 H T K T AT + AKHP0 H T K T AT + CQC T + (+ P0 H T ( cov I )
−1T
HP0T − P0 H T ( cov I ) HP0 H T ( cov I ) −1
−1T
HP0T ) AT
270
Discrete Stochastic Processes and Optimal Filtering
i.e. in replacing K with its expression. −1
P1 = AP0 AΤ − A P0 H T ( Cov I ) HP0 AT − AP0 H T ( Cov I )
−1T
HP0T AT
K
+ AP0 H
Τ
(
( Cov I )
−1
+ A P0 H Τ ( Cov I )
HP0 H T ( Cov I )
−1T
−1T
HP0T AT + CQC T −1
HP0T − P0 H T ( Cov I ) HP0 H T ( Cov I )
−1T
)
HP0T AT .
The 3rd and 6th term cancel each other out and the 4th and 7th term also cancel each other out which leaves: P1 = AP0 A − AKHP0 A + CQC T
(
)
or: P1 = A ⎡ I d − KH P0 ⎤ A + CQC ⎣ ⎦ T
T
T
T
PK +1 K = A ( K ) ( I d − K ( K ) H ( K ) ) PK K −1 ) AT ( K ) + C ( K ) QK C T ( K ) PK K Thus:
PK +1 K = A ( K ) PK K AT ( K ) + C ( K ) QK C T ( K ) = covariance matrix of prediction error with:
PK K = ( I d − K ( K ) H ( K ) ) PK K −1 = covariance matrix of filtering error. This result will be demonstrated in Appendix B. NOTE.– As mentioned in section 7.3.7 knowing the initial conditions and the Kalman gain, the updating of the covariance matrices made in an iterative manner.
PK |K −1
and
PK |K
can be
The Kalman Filter
271
Appendix B
We are going to arrive at this result starting from the definition P
KK
and by
using the expression of the function K already obtained. NOTE.– Different from the calculation developed in section 7.3.6 we will not show obtained is minimal. that trP KK
Another way of showing the following result:
(
)
PK K = Ε X K K X TK K = PK K −1 − K ( K ) H ( K ) PK K −1
(
)
= Id − K ( K ) H ( K ) P
K K −1
Demonstration
Starting from the definition of the covariance matrix of the filtering error, i.e.:
PK |K
=
(
E X K |K X TK |K
)
It becomes with X K | K = X K − Xˆ K |K and Xˆ K K = Xˆ K K −1 + K ( K ) I K So X K K = X K − Xˆ K K −1 − K ( K ) I K
X K K −1 Let us now use these results to calculate PK |K :
(
) (
)
PK K = PK K −1 − K ( K ) Ε I K X KT K −1 − Ε X K K −1 I KT K (TK ) + K ( K ) Ε ( I K I KT ) K T ( K )
272
Discrete Stochastic Processes and Optimal Filtering
We observe that:
(
) (
)
but I j ⊥ I K and I j ⊥ YK
j ∈ [1, K − 1]
Ε X K K −1 I KT = Ε X K − Xˆ K K −1 I KT
thus Xˆ K K −1 ⊥ I K Given:
(
) (
) (
Ε X K K −1 I KT = Ε X K I KT = E A−1 ( K ) ( X K +1 − C ( K ) N K ) I KT
(
(
)
Thus: Ε X K I K = Ε A T
−1
( K ) X K +1 I KT
)
)
For Ε ( N K ) = 0 However, we have seen elsewhere that:
(
Ε ( X K +1 I KT ) = E ( A ( K ) X K + C ( K ) N K ) H ( K ) X K |K −1 + G ( K )WK =
as:
N K ⊥ WK and Furthermore: For Xˆ K |K −1 ⊥
(
)
)
T
E A ( K ) X K X TK |K −1 H T ( K )
N K ⊥ X K |K −1 = X K − Xˆ K |K −1
(
)
(
)
E X K X TK |K −1 = E Xˆ K |K −1 + X K |K −1 X TK |K −1 = PK |K +1 X K |K −1
The Kalman Filter
273
Thus it becomes:
(
)
Ε X K K −1 I KT = PK K −1H T ( K ) thus:
PK K = PK K-1 − K ( K ) H ( K ) PKT K −1 − PK K −1H T ( K ) K T ( K ) + K ( K ) ( Cov I K ) K T ( K ) with K ( K ) = PK K −1 H ( K ) ( Cov I K ) T
−1
after simplification and in noting that
PK K = PKT K symmetric or hermitic matrix if the elements are complex: PK K = PK K −1 − K ( K ) H ( K ) PK K −1 or:
PK K = [ I d − K ( K ) H ( K ) ] PK K −1 QED Examples treated using Matlab software First example of Kalman filtering
The objective is to estimate an unknown constant drowned in noise. This constant is measured using a noise sensor. The noise is centered, Gaussian and of variance equal to 1. The initial conditions are equal to 0 for the estimate and equal to 1 for the variance of the estimation error.
274
Discrete Stochastic Processes and Optimal Filtering
clear t=0:500; R0=1; constant=rand(1); n1=randn(size(t)); y=constant+n1; subplot(2,2,1) %plot(t,y(1,:)); plot(t,y,’k’);% in B&W grid title(‘sensor’) xlabel(‘time’) axis([0 500 -max(y(1,:)) max(y(1,:))]) R=R0*std(n1)^2;% variance of noise measurement P(1)=1;%initial conditions on variance of error estimation x(1)=0; for i=2:length(t) K=P(i-1)*inv(P(i-1)+R); x(i)=x(i-1)+K*(y(:,i)-x(i-1)); P(i)=P(i-1)-K*P(i-1); end err=constant-x; subplot(2,2,2) plot(t,err,’k’); grid title(‘error’); xlabel(‘time’) axis([0 500 -max(err) max(err)]) subplot(2,2,3) plot(t,x,’k’,t,constant,’k’);% in W&B title(‘x estimated’) xlabel(‘time’) axis([0 500 0 max(x)])
The Kalman Filter
275
grid subplot(2,2,4) plot(t,P,’k’);% in W&B grid,axis([0 100 0 max(P)]) title(‘variance error estimation’) xlabel(‘time’)
Figure 7.3. Line graph of measurement, error, best filtration and variance of error
Second example of Kalman filtering The objective of this example is to extract a dampened sine curve of the noise. The state vector is a two component column vector: X1=10*exp(-a*t)*cos(w*t) X2=10*exp(-a*t)*sin(w*t) The system noise is centered, Gaussian and of variance var(u1) and var(u2).
276
Discrete Stochastic Processes and Optimal Filtering
The noise of the measurements is centered, Gaussian and of variance var(v1) and var(v2). Initial conditions: the components of the state vector are zero at origin and the covariance of the estimation error is initialized at 10* identity matrix. Note: the proposed program is not the shortest and most rapid in the sense of CPU time; it is detailed to allow a better understanding. clear %simulation a=0.05; w=1/2*pi; Te=0.005; Tf=30; Ak=exp(-a*Te)*[cos(w*Te) -sin(w*Te);sin(w*Te) cos(w*Te)];%state matrix Hk=eye(2);% observations matrix t=0:Te:Tf; %X1 X1=10*exp(-a*t).*cos(w*t); %X2 X2=10*exp(-a*t).*sin(w*t); Xk=[X1;X2];% state vector % measurements noise sigmav1=100; sigmav2=10; v1=sigmav1*randn(size(t)); v2=sigmav2*randn(size(t)); Vk=[v1;v2]; Yk=Hk*Xk+Vk;% measurements vector % covariance matrix of measurements noise. Rk=[var(v1) 0;0 var(v2)];% covariance matrix of noise %initialization
The Kalman Filter
sigmau1=0.1;% noise process sigmau2=0.1;%idem u1=sigmau1*randn(size(t)); u2=sigmau2*randn(size(t)); %Uk=[sigmau1*randn(size(X1));sigmau2*randn(size(X2))]; Uk=[u1;u2]; Xk=Xk+Uk; sigq=.01; Q=sigq*[var(u1) 0;0 var(u2)]; sigp=10; P=sigp*eye(2);%covariance matrix of estimation error P(0,0) % line graph subplot(2,3,1) %plot(t,X1,t,X2); plot(t,X1,’k’,t,X2,’k’)% in W&B axis([0 Tf -max(abs(Xk(1,:))) max(abs(Xk(1,:)))]) title(‘state vect. x1&x2’) subplot(2,3,2) %plot(t,Vk(1,:),t,Vk(2,:),‘r’) plot(t,Vk(1,:),t,Vk(2,:));% in W & B axis([0 Tf -max(abs(Vk(1,:))) max(abs(Vk(1,:)))]) title(‘meas. noise w1&w2’) subplot(2,3,3) %plot(t,Yk(1,:),t,Yk(2,:),‘r’); plot(t,Yk(1,:),t,Yk(2,:));% in W&B axis([0 Tf -max(abs(Yk(1,:))) max(abs(Yk(1,:)))]) title(‘observ. proc. y1&y2’) Xf=[0;0]; %%estimation and prediction by Kalman
277
278
Discrete Stochastic Processes and Optimal Filtering
for k=1:length(t); %%prediction Xp=Ak*Xf; % Xp=Xest(k+1,k) and Xf=Xest(k,k) Pp=Ak*P*Ak’+Q; % Pp=P(k+1,k) and P=P(k) Gk=Pp*Hk’*inv(Hk*Pp*Hk’+Rk); % Gk=Gk(k+1) Ik=Yk(:,k)-Hk*Xp;% Ik=I(k+1)=innovation % best filtration Xf=Xp+Gk*Ik; % Xf=Xest(k+1,k+1) P=(eye(2)-Gk*Hk)*Pp;% P=P(k+1) X(:,k)=Xf; P1(:,k)=P(:,1);%1st column of P P2(:,k)=P(:,2);%2nd column of P end err1=X1-X(1,:); err2=X2-X(2,:); %% line graph subplot(2,3,4) %plot(t,X(1,:),t,X(2,:),‘r’) plot(t,X(1,:),‘k’,t,X(2,:),‘k’)% in W&B axis([0*Tf Tf -max(abs(X(1,:))) max(abs(X(1,:)))]) title(‘filtered x1&x2’) subplot(2,3,5) %plot(t,err1,t,err2) plot(t,err1,‘k’,t,err2,‘k’)% in W&B axis([0 Tf -max(abs(err1)) max(abs(err1))]) title(‘errors’)
The Kalman Filter
subplot(2,3,6) %plot(t,P1(1,:),‘r’,t,P2(2,:),‘b’,t,P1(2,:),‘g’,t,P2(1,:),‘y’) plot(t,P1(1,:),‘k’,t,P2(2,:),‘k’,t,P1(2,:),t,P2(1,:),‘b’) axis([ 0 Tf/10 0 max(P1(1,:))]) title(‘covar. matrix filter. error.’)% p11, p22, p21 and p12
Figure 7.4. Line graphs of noiseless signals, noise measurements, filtration, errors and variances
279
Table of Symbols and Notations
N, R, C
Numerical sets
L2
Space of summable square function
a.s.
Almost surely
E
Mathematical expectation
r.v.
Random variable
r.r.v.
Real random variable
a. s. X n ⎯⎯→ X
Convergence a.s. of sequence X n to X 2
⋅, ⋅ L2 ( )
Scalar product in L
⋅
Norm
L2 (
)
L2
Var
Variance
Cov
Covariance
⋅∧⋅
Min ( ⋅ , ⋅)
X ∼ N (m, σ 2 )
Normal law of means m and of variance
AT
Transposed matrix
HKY
Hilbert space generated by YN , scalar or multivariate processes
Pr ojHY K
σ2
A
Projection on Hilbert space generated by Y( t ≤ K )
282
Discrete stochastic Processes and Optimal Filtering
XT
Stochastic process defined on T (time describes T )
p.o.i.
Process with orthogonal increments
p.o.s.i.
Process with orthogonal and stationary increments
Xˆ K |K −1
Prediction at instant K knowing the measurements of the process YK of instants 1 to K −1
X K |K −1
Prediction error
Xˆ K |K
Filtering at instant K knowing its measurements of instants 1 to K
X K |K
Filtering error
∇λ C
Gradient of function C ( λ )
{X P }
The set of element
1D
Indicative function of a set
X which verify the property P D
Bibliography
[BER 98] BERTEIN J.C and CESCHI R., Processus stochastiques et filtrage de Kalman, Hermes, 1998. [BLA 06] BLANCHET G. and GARBIT M., Digital Signal and Image Processing using MATLAB, ISTE, 2006. [CHU 87] CHUI C.K. and CHEN G., Kalman Filtering, Springer-Verlag, 1987. [GIM 82] GIMONET B., LABARRERE M. and KRIEF J.P., Le filtrage et ses applications, Cépadues editions, 1982. [HAY 91] HAYKIN S., Adaptive Filter Theory, Prentice Hall, 1991. [MAC 81] MACCHI O., “Le filtrage adaptatif en telecommunications”, Annales des Télécommunications, 36, no. 11-12, 1981. [MAC 95] MACCHI O., Adaptive Processing: The LMS Approach with Applications in Transmissions, John Wiley, New York, 1995. [MET 72] METIVIER M., Notions fondamentales de la théorie des probabilités, Dunod, 1972. [MOK 00] MOKHTARI M., Matlab et Simulink pour étudiants et ingénieurs, Springer, 2000. [RAD 84] RADIX J.C., Filtrages et lissages statistiques optimaux linéaires, Cépadues editions, 1984. [SHA 88] SHANMUGAN K.S. and BREIPOHL A.M., Random signal, John Wiley & Sons, 1988. [THE 92] THERRIEN C.W., Discrete random signals and statistical signal processing, Prentice Hall, 1992. [WID 85] WIDROW B. and STEARNS S.D., Adaptive Signal processing, Prentice Hall, 1985.
Index
A, B adaptive filtering 197 algebra 3 analytical 187 autocorrelation function 96 autoregressive process 128 Bienaymé-Tchebychev inequality 143 Borel algebra 3
C cancellation 199 Cauchy sequence 158 characteristic functions 4 coefficients 182 colinear 213 convergence 218 convergent 219 correlation coefficients 41 cost function 204 covariance 40 covariance function 107 covariance matrix 258 covariance matrix of the innovation process 248
covariance matrix of the prediction error 249 cross-correlation 184
D deconvolution 199 degenerate Gausian 64 deterministic gradient 225 deterministic matrix 245 diffeomorphism 31 diphaser 209
E eigenvalues 75 215 eigenvectors 75 ergodicity 98 ergodicity of expectation 100 ergodicity of the autocorrelation function 100 expectation 67
F, G filtering 143, 247 Fubini’s theorem 162 Gaussain vectors 13
286
Discrete Stochastic Processes and Optimal Filtering
Gradient algorithm 211
P
H, I
Paley-Wiener 187 prediction 143 199 247 258 prediction error 249 predictor 200 pre-whitening 186 principle axes 215 probability distribution function 12 process noise vector 245 projection 183
Hilbert spaces 145 Hilbert subspace 144 identification 199 IIR filter 186 impulse response 182 independence 13, 246 innovation 240 innovation process 172, 248 causal 187 orthogonal 192
K, L Kalman gain 254 least mean square 184 linear observation space 168 linear space 104 LMS algorithm 222 lowest least mean square error 185
M, N
Q, R quadratic form 216 random variables 194 random vector 1 3 random vector with a density function 8 regression plane 151 Riccati’s equation 258, 260
S
marginals 9 Markov process 101 matrix of measurements 246 measure 5 measurement noise 238 measurement noise vector 246 minimum phase 187 multivariate 245 250 multivariate processes 166 multivector 245
Schwarz inequality 160 Second order stationarity 96 second order stationary processes 199 singular 185 smoothing 143, 247 spectral density 106 stability 218 stable 219 state matrix 245 stationary processes 181 stochastic process 94 system noise 238
O
T
observations 245 orthogonal matrix 216 orthogonal projection 238
Toeplitz 211 216 trace 257 trajectory 94 transfer function 121
Index
transition equation 247 transition matrix 247
U-Z unitary matrix Q 216 variance 39 white noise 109 185 Wiener filter 181
287