Discrete Stochastic Processes and Optimal Filtering

Discrete Stochastic Processes and Optimal Filtering Discrete Stochastic Processes and Optimal Filtering Jean-Claude B...

Author: Jean-Claude Bertein | Roger Ceschi

43 downloads 1081 Views 2MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Discrete Stochastic Processes and Optimal Filtering

Discrete Stochastic Processes and Optimal Filtering

Jean-Claude Bertein Roger Ceschi

First published in France in 2005 by Hermes Science/Lavoisier entitled “Processus stochastiques discrets et filtrages optimaux” First published in Great Britain and the United States in 2007 by ISTE Ltd Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address: ISTE Ltd 6 Fitzroy Square London W1T 5DX UK

ISTE USA 4308 Patrice Road Newport Beach, CA 92663 USA

www.iste.co.uk © ISTE Ltd, 2007 © LAVOISIER, 2005 The rights of Jean-Claude Bertein and Roger Ceschi to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. Library of Congress Cataloging-in-Publication Data Bertein, Jean-Claude. [Processus stochastiques discrets et filtrages optimaux. English] Discrete stochastic processes and optimal filtering/Jean-Claude Bertein, Roger Ceschi. p. cm. Includes index. "First published in France in 2005 by Hermes Science/Lavoisier entitled "Processus stochastiques discrets et filtrages optimaux"." ISBN 978-1-905209-74-3 1. Signal processing--Mathematics. 2. Digital filters (Mathematics) 3. Stochastic processes. I. Ceschi, Roger. II. Title. TK5102.9.B465 2007 621.382'2--dc22 2007009433 British Library Cataloguing-in-Publication Data A CIP record for this book is available from the British Library ISBN 13: 978-1-905209-74-3 Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire.

To our families We wish to thank Mme Florence François for having typed the manuscript, and M. Stephen Hazlewood who assured the translation of the book

Table of Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiii

Chapter 1. Random Vectors. . . . . . . . . . . . . . . . . . . . . . . 1.1. Definitions and general properties . . . . . . . . . . . . . . . 1.2. Spaces L1(dP) and L2(dP) . . . . . . . . . . . . . . . . . . . . 1.2.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2. Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3. Mathematical expectation and applications . . . . . . . . . . 1.3.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2. Characteristic functions of a random vector . . . . . . . 1.4. Second order random variables and vectors . . . . . . . . . . 1.5. Linear independence of vectors of L2(dP) . . . . . . . . . . . 1.6. Conditional expectation (concerning random vectors with density function) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7. Exercises for Chapter 1 . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

Chapter 2. Gaussian Vectors . . . . . . . . . . . . . . . . . . 2.1. Some reminders regarding random Gaussian vectors 2.2. Definition and characterization of Gaussian vectors . 2.3. Results relative to independence . . . . . . . . . . . . 2.4. Affine transformation of a Gaussian vector . . . . . . 2.5. The existence of Gaussian vectors . . . . . . . . . . . 2.6. Exercises for Chapter 2 . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . . . . .

1 1 20 20 22 23 23 34 39 47

. . . . . . . . . . . . . .

51 57

. . . . . . .

63 63 66 68 72 74 85

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . .

viii

Discrete Stochastic Processes and Optimal Filtering

Chapter 3. Introduction to Discrete Time Processes. . . . . . . . 3.1. Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. WSS processes and spectral measure. . . . . . . . . . . . . . 3.2.1. Spectral density . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Spectral representation of a WSS process . . . . . . . . . . . 3.3.1. Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2.1. Process with orthogonal increments and associated measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2.2. Wiener stochastic integral . . . . . . . . . . . . . . . . 3.3.2.3. Spectral representation. . . . . . . . . . . . . . . . . . 3.4. Introduction to digital filtering . . . . . . . . . . . . . . . . . 3.5. Important example: autoregressive process . . . . . . . . . . 3.6. Exercises for Chapter 3 . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

93 93 105 106 110 110 111

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

111 113 114 115 128 134

Chapter 4. Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . 4.1. Position of the problem . . . . . . . . . . . . . . . . . . . . . 4.2. Linear estimation . . . . . . . . . . . . . . . . . . . . . . . . 4.3. Best estimate – conditional expectation . . . . . . . . . . . 4.4. Example: prediction of an autoregressive process AR (1) 4.5. Multivariate processes . . . . . . . . . . . . . . . . . . . . . 4.6. Exercises for Chapter 4 . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

141 141 144 156 165 166 175

Chapter 5. The Wiener Filter . . . . . . . . . . . . . 5.1. Introduction. . . . . . . . . . . . . . . . . . . . 5.1.1. Problem position . . . . . . . . . . . . . . 5.2. Resolution and calculation of the FIR filter . 5.3. Evaluation of the least error . . . . . . . . . . 5.4. Resolution and calculation of the IIR filter . 5.5. Evaluation of least mean square error . . . . 5.6. Exercises for Chapter 5 . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

181 181 182 183 185 186 190 191

Chapter 6. Adaptive Filtering: Algorithm of the Gradient and the LMS. 6.1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2. Position of problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3. Data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4. Minimization of the cost function. . . . . . . . . . . . . . . . . . . . . . 6.4.1. Calculation of the cost function . . . . . . . . . . . . . . . . . . . . 6.5. Gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

197 197 199 202 204 208 211

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Table of Contents

6.6. Geometric interpretation . . . . . . . . . . . . . . . . 6.7. Stability and convergence . . . . . . . . . . . . . . . 6.8. Estimation of gradient and LMS algorithm . . . . . 6.8.1. Convergence of the algorithm of the LMS . . . 6.9. Example of the application of the LMS algorithm . 6.10. Exercises for Chapter 6 . . . . . . . . . . . . . . . .

ix

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

214 218 222 225 225 234

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

237 237 241 241 244 245 245 246 248 248 250 258 260 262

Table of Symbols and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . .

281

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

283

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

285

Chapter 7. The Kalman Filter . . . . . . . . . . . . . . . 7.1. Position of problem . . . . . . . . . . . . . . . . . . 7.2. Approach to estimation . . . . . . . . . . . . . . . . 7.2.1. Scalar case . . . . . . . . . . . . . . . . . . . . . 7.2.2. Multivariate case . . . . . . . . . . . . . . . . . 7.3. Kalman filtering . . . . . . . . . . . . . . . . . . . . 7.3.1. State equation . . . . . . . . . . . . . . . . . . . 7.3.2. Observation equation. . . . . . . . . . . . . . . 7.3.3. Innovation process . . . . . . . . . . . . . . . . 7.3.4. Covariance matrix of the innovation process. 7.3.5. Estimation . . . . . . . . . . . . . . . . . . . . . 7.3.6. Riccati’s equation. . . . . . . . . . . . . . . . . 7.3.7. Algorithm and summary . . . . . . . . . . . . . 7.4. Exercises for Chapter 7 . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

Preface

Discrete optimal filtering applied to stationary and non-stationary signals allows us to process in the most efficient manner possible, according to chosen criteria, all of the problems that we might meet in situations of extraction of noisy signals. This constitutes the necessary stage in the most diverse domains: the calculation of the orbits or guidance of aircraft in the aerospace or aeronautic domain, the calculation of filters in the telecommunications domain, or in the domain of command systems, or again in that of the processing of seismic signal – the list is non-exhaustive. Furthermore, the study and the results obtained from discrete signals lend themselves easily to the calculator. In their book, the authors have taken pains to stress educational aspects, preferring this to displays of erudition; all of the preliminary mathematics and probability theories necessary for a sound understanding of optimal filtering have been treated in the most rigorous fashion. It should not be necessary to have to turn to other works to acquire a sound knowledge of the subjects studied. Thanks to this work, the reader will be able not only to understand discrete optimal filtering but also will be able easily to go deeper into the different aspects of this wide field of study.

Introduction

The object of this book is to present the bases of discrete optimal filtering in a progressive and rigorous manner. The optimal character can be understood in the sense that we always choose that criterion at the minimum of the norm − L2 of error. Chapter 1 tackles random vectors, their principal definitions and properties. Chapter 2 covers the subject of Gaussian vectors. Given the practical importance of this notion, the definitions and results are accompanied by numerous commentaries and explanatory diagrams. Chapter 3 is by its very nature more “physics” heavy than the preceding ones and can be considered as an introduction to digital filtering. Results that will be essential for what follows will be given. Chapter 4 provides the pre-requisites essential for the construction of optimal filters. The results obtained on projections in Hilbert spaces constitute the cornerstone of future demonstrations. Chapter 5 covers the Wiener filter, an electronic device that is well adapted to processing stationary signals of second order. Practical calculations of such filters, as an answer to finite or infinite pulses, will be developed. Adaptive filtering, which is the subject of Chapter 6, can be considered as a relatively direct application of the determinist or stochastic gradient method. At the end of the process of adaptation or convergence, the Wiener filter is again encountered.

xiv

Discrete Stochastic Processes and Optimal Filtering

The book is completed with a study of Kalman filtering which allows stationary or non-stationary signal processing; from this point of view we can say that it generalizes Wiener’s optimal filter. Each chapter is accentuated by a series of exercises with answers, and resolved examples are also supplied using Matlab software which is well adapted to signal processing problems.

Chapter 1

Random Vectors

1.1. Definitions and general properties If we remember that

n

{

= x = ( x1 ,..., xn )

of real n -tuples can be fitted to two laws:

1 = (1, 0,..., 0 ) ,...,

n

= ( 0,..., 0,1) and x ∈

denoted:

⎛ x1 ⎞ ⎜ ⎟ x = ⎜ ⎟ (or xT = ( x1 ,..., xn ) ). ⎜x ⎟ ⎝ 2⎠

}

; j = 1 to n , the set

x, y → x + y and n

making it a vector space of dimension n . The basis implicitly considered on

xj ∈

n

n

×

n

( λ ,x ) → λ x ×

n

n

will be the canonical base n

expressed in this base will be

2

Discrete Stochastic Processes and Optimal Filtering

Definition of a real random vector Beginning with a basic definition, without concerning ourselves at the moment

⎛ X1 ⎞ ⎜ ⎟ with its rigor: we can say simply that a real vector X = ⎜ ⎟ linked to a physical ⎜X ⎟ ⎝ n⎠ or biological phenomenon is random if the value taken by this vector is unknown and the phenomenon is not completed. For typographical reasons, the vector will instead be written X

T

or even X = ( X 1 ,..., X n ) when there is no risk of confusion. In other words, given a random vector X and Β ⊂

assertion (also called the event) ( X ∈ Β ) is true or false:

n

= ( X 1 ,..., X n )

we do not know if the

n

Β .X

However, we do usually know the “chance” that X ∈ Β ; this is denoted

Ρ ( X ∈ B ) and is called the probability of the event ( X ∈ Β ) .

After completion of the phenomenon, the result (also called the realization) will be denoted

⎛ x1 ⎞ ⎜ ⎟ x = ⎜ ⎟ or xT = ( x1 ,..., xn ) or even x = ( x1 ,..., xn ) ⎜x ⎟ ⎝ 2⎠ when there is no risk of confusion.

Random Vectors

3

An exact definition of a real random vector of dimension n will now be given. We take as given that: – Ω = basic space. This is the set of all possible results (or tests) random phenomenon. –

ω

linked to a

a = σ -algebra (of events) on Ω , recalling the axioms: 1) Ω ∈ a c 2) if Α ∈ a then the complementary A ∈ a

( Α j , j ∈ J ) is a countable family of events then

3) if

∪ A j is an event,

j∈J

i.e. ∪ A j ∈ a j∈J

n

– –

= space of observables

B(

n

)=

n

Borel algebra on n

which contains all the open sets of

; this is the smallest

σ

n

-algebra on

.

DEFINITION.– X is said to be a real random vector of dimension n defined on

( Ω, a )

if

∀Β ∈ B (

X is a measurable mapping n

)

Χ −1 ( Β ) ∈ a.

( Ω, a ) → (

n

,B (

n

)) ,

i.e.

When n = 1 we talk about a random variable (r.v.). In the following the event Χ

−1

(Β)

is also denoted as

and even more simply as ( X ∈ B ) .

{ω

X (ω ) ∈ B

}

PROPOSITION.– In order for X to be a real random vector of dimension n (i.e. a measurable mapping

( Ω, a ) → (

each component Χ j

n

,B (

n

) ) , it is necessary and it suffices that

j = 1 at n is a real r.v. (i.e. is a measurable mapping

( Ω, a ) → ( R,B ( R ) ) ).

4

Discrete Stochastic Processes and Optimal Filtering

ABRIDGED DEMONSTRATION.– It suffices to consider:

Χ −1 ( Β1 × ... × Βn ) where Β1 ,..., Β n ∈ B ( R )

as

we

show

B (R) ⊗

B(

that

n

) = B (R) ⊗

⊗B ( R )

where

⊗ B ( R ) denotes the σ -algebra generated by the measurable

blocks Β1 × ... × Β n .

( B1 ×

Now X

× Bn ) = X 1−1 ( B1 ) ∩

and only if each term concerns

∩ X n−1 ( Bn ) , which concerns

a , that is to say if each

a

if

X j is a real r.v.

DEFINITION.– X = X 1 + iX 2 is said to be a complex random variable defined on

( Ω, a )

say

the

if the real and imaginary parts X 1 and X 2 are real variables, that is to random

( Ω, a ) → (

,B(

variables

)) .

X 1 and X 2

are

measurable

mappings

EXAMPLE.– The complex r.v. can be associated with a real random vector

X = ( X1 ,..., X n ) and a real n -tuple, u = ( u1 ,..., un ) ∈ i∑u j X j j

n

.

= cos∑ u j X j + i sin ∑ u j X j j

j

The study of this random variable will be taken up again when we define the characteristic functions . Law Law Ρ X of the random vector X .

Random Vectors

First of all we assume that the a mapping

P:

σ

-algebra

a → [0,1] verifying:

a

is provided with a measure

5

P , i.e.

1) P ( Ω ) = 1 2) For every family

( Aj , j ∈ J ) of countable pairwise disjoint events:

⎛ ⎞ P ⎜ ∪ Aj ⎟ = ∑ P Aj ⎝ j∈J ⎠ j∈J

( )

DEFINITION.– We call the law of random vector X, the “image measure

P through the mapping of X”, i.e. the definite measure on B ( following way by: ∀Β ∈ B

( n)

(

PX ( Β ) = ∫ dPX ( x1 ,..., xn ) = P X −1 ( B ) Β

n

PX of

) defined in the

)

↑

Definition

(

= P ω

)

X (ω ) ∈ Β = P ( X ∈ Β )

Terms 1 and 2 on the one hand and terms 3, 4 and 5 on the other are different notations of the same mathematical notion.

n

X

X

−1

B ∈B (

(B ) ∈ a

Ω Figure 1.1. Measurable mapping

X

n

)

6

Discrete Stochastic Processes and Optimal Filtering

It is important to observe that as the measure calculable for all Β ∈ B The space law is denoted:

n

(

( n ) because X

,B (

n

a,

PX ( B ) is

is measurable.

provided with the Borel algebra n

P is given along

) , PX ) .

B(

n

) and then with the PX

NOTE.– As far as the basic and the exact definitions are concerned, the basic definition of random vectors is obviously a lot simpler and more intuitive and can happily be used in basic applications of probability calculations. On the other hand in more theoretical or sophisticated studies and notably in those calling into play several random vectors, X , Y , Z ,... considering the latter as definite mappings on the same space ( Ω, a ) ,

( i.e. X,Y,Z, ... : ( Ω, a ) → (

n

,B (

n

))) ,

will often prove to be useful even indispensable.

X (ω ) Y (ω )

ω Ω

n

Z (ω )

Figure 1.2. Family of measurable mappings

In effect, the expressions and calculations calling into play several (or the entirety) of these vectors can be written without ambiguity using the space

( Ω, a,P ) . Precisely, the events linked to X , Y , Z ,… are among elements a (and the probabilities of these events are measured by P ).

A of

Random Vectors

7

Let us give two examples: 1) if there are 2 random vectors X , Y : ( Ω, a, P ) →

(

n

,B

( )) n

and

( ) , the event ( X ∈ B ) ∩ (Y ∈ B′) (for example) can be

given B and B′ ∈ B

n

translated by X −1 ( B ) ∩ Y −1 ( B ′ ) ∈ a ;

2) there are 3 r.v. X , Y , Z : ( Ω, a, P ) →

(

,B (

) ) and given a ∈

* +.

Let us try to express the event ( Z ≥ a − X − Y ) . Let us state U = ( X , Y , Z ) and B =

where

B Borel set of

3

{( x, y, z ) ∈

3

, represents the half-space bounded by the plane ( Π ) not

containing the origin 0 and based on the triangle A B C.

C (a)

B (a) A(a)

0 Figure 1.3. Example of Borel set of

U is ( Ω, a ) →

(

3

}

x+y+z ≥ a .

,B

( ) ) measurable and: 3

U ( z ≥ a − x − y ) = (U ∈ B ) = U −1 ( B ) ∈ a .

3

8

Discrete Stochastic Processes and Optimal Filtering

NOTE ON SPACE ( Ω, a, P ) .– We said that if we took as given Ω and then on

Ω and then P on

a

and so on, we would consider the vectors

a

X , Y , Z ,... as

measurable mappings:

( Ω, a, P ) → (

n

,B

( )) n

This way of introducing the different concepts is the easiest to understand, but it rarely corresponds to real probability problems. In general

( Ω, a, P )

is not specified or is even given before “ X , Y , Z ,...

measurable mappings”. On the contrary, given the random physical or biological sizes X , Y , Z ,... of

n

, it is on departing from the latter that

X , Y , Z ,... definite measurable mappings on introduced.

( Ω, a, P )

( Ω, a, P )

( Ω, a, P )

and

are simultaneously

is an artificial space intended to serve as a link between

X , Y , Z ,... What has just been set out may seem exceedingly abstract but fortunately the general random vectors as they have just been defined are rarely used in practice. In any case, and as far as we are concerned, we will only have to manipulate in what follows the far more specific and concrete notion of a “random vector with a density function”. DEFINITION.– We say that the law PX of the random vector X has a density if there is a mapping f X :

(

n

,B

( )) → ( n

,B (

measurable, called the density of PX such that ∀B ∈ B

))

which is positive and

( n ).

P ( X ∈ B ) = PX ( B ) = ∫ dPX ( x1 ,..., xn ) = ∫ f X ( x1 ,..., xn ) dx1 ,..., dxn B

B

Random Vectors

9

VOCABULARY.– Sometimes we write

dPX ( x1 ,..., xn ) = f X ( x1 ,..., xn ) dx1 ,..., dxn and we say also that the measure PX admits the density f X with respect to the Lebesgue measure on density f X . NOTE.–

n

. We also say that the random vector

(

f X ( x1 ,...xn ) dx1 ,...dxn = P X ∈

∫B

n

X admits the

) =1.

For example, let the random vector be X = ( X 1 , X 2 , X 3 ) of density

f X ( x1 , x2 , x3 ) = K x3 1∆ ( x1 , x2 , x3 ) where ∆ is the half sphere defined by x12 + x22 + x32 ≤ R with x3 ≥ 0 . We easily obtain via a passage through spherical coordinates:

1 = ∫ Kx3 dx1 dx2 dx3 = K ∆

π R4 4

where K =

4 π R4

Marginals

⎛ X1 ⎞ ⎜ ⎟ Let the random vector be X = ⎜ ⎟ which has the law ⎜X ⎟ ⎝ 2⎠ probability f X . DEFINITION.– The r.v. marginal of

PX

and density of

X j , which is the j th component of X , is called the j th

X and the law PX j of X j is called the law of the j th marginal.

If we know

PX , we know how to find the PX j laws.

10

Discrete Stochastic Processes and Optimal Filtering

In effect ∀B ∈ B (

(

)

)

P X j ∈ B = P ⎡⎣( X 1 ∈

∫

(

) ∩ ... ∩ ( X j ∈ B ) ∩ ... ∩ ( X n ∈ )⎤⎦

)

f X x1 ,..., x j ,..., xn dx1 ...dx j ...dxn

×...× B ×...×

using the Fubini theorem:

= ∫ dx j ∫ B

n−1

(

f X x1 ,..., x j ,..., xn

)

dx1...dxn except

The equality applying for all

( )

fX j xj = ∫

n−1

(

dx j

B , we obtain:

f X x1 ,..., x j ,..., xn

)

dx1...dxn except dx j

NOTE.– Reciprocally: except in the case of independent components the knowledge of PX ⇒ / to that of PX . j

EXAMPLE.– Let us consider: 1) A Gaussian pair Z

f Z ( x, y ) =

T

= ( X , Y ) of density of probability

⎛ x2 + y 2 ⎞ 1 exp ⎜ − ⎟. 2π 2 ⎠ ⎝

Random Vectors

11

We obtain the densities of the marginals:

fX ( x) = ∫

fY ( y ) = ∫

+∞ −∞

+∞ −∞

f z ( x, y ) dy =

⎛ x2 ⎞ 1 exp ⎜ − ⎟ and 2π ⎝ 2⎠

f z ( x, y ) dx =

⎛ y2 ⎞ 1 exp ⎜ − ⎟ . 2π ⎝ 2 ⎠

A second random non-Gaussian pair W

T

= (U , V ) whose density of

probability fW is defined by:

fW ( u , v ) = 2 f Z ( u , v ) if uv ≥ 0

fW ( u, v ) = 0 if uv < 0

Let us calculate the marginals

fU ( u ) = ∫

+∞ −∞

fW ( u, v ) dv = ∫ =∫

+∞ −∞ +∞

−∞

2 f Z ( u, v ) dv

if

u≤0

2 f Z ( u , v ) dv

if

u>0

From which we easily come to fU ( u ) =

In addition we obtain fV ( v ) =

⎛ u2 ⎞ 1 exp ⎜ − ⎟ 2π ⎝ 2 ⎠

⎛ v2 ⎞ 1 exp ⎜ − ⎟ 2π ⎝ 2⎠

CONCLUSION.– We can clearly see from this example that the marginal densities (identical in 1 and 2) do not determine the densities of the vectors (different in 1 and 2).

12

Discrete Stochastic Processes and Optimal Filtering

Probability distribution function

DEFINITION.– We call the mapping:

FX : ( x1 ,..., xn ) → FX ( x1 ,..., xn )

[ 0,1]

n

the distribution function of a random vector X

T

= ( X1 ,..., X n ) .

This is defined by:

FX ( x1 ,..., xn ) = P ( ( X1 ≤ x1 ) ) ∩ ... ∩ ( X n ≤ xn ) and in integral form, since X is a probability density vector:

FX ( x1 ,..., xn ) = ∫

x1 −∞

xn

∫ −∞ f X ( u1,.., un ) du1.. dun .

Some general properties: – ∀j = 1 at n the mapping x j → FX ( x1 ,..., xn ) is non-decreasing; – FX ( x1 ,..., xn ) → 1 when all the variables x j → ∞ ; – FX ( x1 ,..., xn ) → 0 if one at least of the variables x j → −∞ ; – If ( x1 ,..., xn ) → f X ( x1 ,..., xn ) is continuous, then

∂ n FX = fX . ∂ xn ...∂ x1

EXERCISE.– Determine the probability distribution of the pair

( X ,Y )

of density

f ( x, y ) = K xy on the rectangle ∆ = [1,3] × [ 2, 4] and state precisely the value of

K.

Random Vectors

13

Independence

DEFINITION.– We say that a family of r.v., X 1 ,..., X n , is an independent family if ∀ J ⊂ {1, 2,..., n} and for all the family of B j ∈ B (

):

⎛ ⎞ P ⎜ ∩ X j ∈ Bj ⎟ = ∏ P X j ∈ Bj . ⎝ j∈J ⎠ j∈J

(

∈B (

As

)

(

)

) , it is easy to verify, by making certain Borel sets equal to

,

that the definition of independence is equivalent to the following:

∀B j ∈ B (

)

⎛ n ⎞ n : P ⎜ ∩ X j ∈ B j ⎟ = ∏ P X j ∈ Bj ⎝ j =1 ⎠ j =1

(

)

(

)

again equivalent to:

∀B j ∈ B (

)

n

(

P ( X ∈ B1 × ... × Bn ) = ∏ P X j ∈ Bj j =1

)

i.e. by introducing the laws of probabilities:

∀B j ∈ B (

NOTE.–

B

This

( ) =B ( n

probabilities PX j

n

) : PX ( B1 × ... × Bn ) = ∏ PX j =1

law

of

) ⊗ ... ⊗ B ( ) ) is (defined on B ( ) ).

j

( Bj )

probability

PX

(defined

on

the tensor product of the laws of

Symbolically we write this as PX = PX ⊗ ... ⊗ PX n . 1

14

Discrete Stochastic Processes and Optimal Filtering

NOTE.– Let X 1 ,..., X n be a family of r.v. If this family is independent, the r.v. are independent pairwise, but the converse is false. PROPOSITION.– Let X = ( X 1 ,..., X n ) be a real random vector admitting the density of probability f X and the components X 1 , ..., X n admitting the densities

f X ,..., f X n . 1

In order for the family of components to be an independent family, it is necessary that and it suffices that:

n

f X ( x1 ,..., xn ) = ∏ f X j ( x j ) j =1

DEMONSTRATION.– In the simplified case where f X is continuous: – If ( X 1 ,..., X n ) is an independent family:

n ⎛ n ⎞ n FX ( x1 ,..., xn ) = P ⎜ ∩ X j ≤ x j ⎟ = ∏ P X j ≤ x j = ∏ FX j x j ⎝ j =1 ⎠ j =1 j =1

(

)

(

)

( )

by deriving the two extreme members:

f X ( x1 ,..., xn ) =

n ∂F n ∂ n FX ( x1 ,..., xn ) X j (xj ) =∏ =∏ f X j ( x j ) ; ∂xn ...∂x1 ∂x j j =1 j =1

Random Vectors

– reciprocally if f X ( x1 ,..., xn ) = i.e. B j ∈ B (

) , for

15

n

∏ fX ( xj ) : j

j =1

j = 1 at n :

N ⎛ ⎞ ⎛ n ⎞ P ⎜ ∩ X j ∈ Bj ⎟ = P ⎜ X ∈∏ Bj ⎟ = ∫ n f ( x ,..., xn ) dx1... dxn ∏ Bj X 1 ⎝ j =1 ⎠ J =1 ⎝ ⎠ j=1

(

=∫

)

n

∏ Bj ∏ j =1 n

fX

j

n

( )

x j dx j = ∏ ∫ j =1

j =1

Bj

NOTE.– The equality f X ( x1 ,..., xn ) =

fX

j

n

( )

(

x j dx j = ∏ P X j ∈ B j j =1

)

n

∏ fX j ( xj )

is the definition of the

j −1

function of n variables and f X is the tensor product of the functions of a variable

f X . Symbolically we write f X = f X ⊗ ... ⊗ f X n (not to be confused with the 1

j

ordinary

f = f1 f 2 i i f n

product

defined

by

f ( x ) = ( f1 ( x ) f 2 ( x )i i f n ( x ) ) .

EXAMPLE.– Let the random pair X = ( X1 , X 2 ) of density: ⎛ x 2 + x22 ⎞ 1 exp ⎜ − 1 ⎟. ⎜ 2π 2 ⎟⎠ ⎝

As

⎛ x 2 + x22 ⎞ 1 exp ⎜ − 1 ⎟= ⎜ 2π 2 ⎟⎠ ⎝

⎛ x 2 ⎞ 1 ⎛ x22 ⎞ exp ⎜ − ⎟ ⎜− ⎟ ⎜ 2 ⎟ 2π ⎜ 2 ⎟ 2π ⎝ ⎠ ⎝ ⎠ 1

⎛ x2 ⎞ ⎛ x2 ⎞ 1 exp ⎜ − 1 ⎟ and exp ⎜ − 2 ⎟ are the densities of X1 and of X 2 , ⎜ 2 ⎟ ⎜ 2 ⎟ 2π 2π ⎝ ⎠ ⎝ ⎠ these two components X1 and X 2 are independent.

and as

1

16

Discrete Stochastic Processes and Optimal Filtering

DEFINITION.– Two random vectors:

(

X = ( X1 ,..., X n ) and Y = Y1 ,..., Yp

)

are said to be independent if:

∀B ∈ B

( ) and B ' ∈B ( ) n

p

P ( ( X ∈ B ) ∩ (Y ∈ B ') ) = P ( X ∈ B ) P (Y ∈ B ') The sum of independent random variables

NOTE.– We are frequently led to calculate the probability P in order for a function of n r.v. given as X 1 ,..., X n to verify a certain inequality. Let us denote this probability

as

P (inequality).

Let

us

assume

that

the

random

vector

X = ( X 1 ,..., X n ) possesses a density of probability f X ( x1 ,..., xn ) . The

method of obtaining P (inequality) consists of determining B ∈ B verifies ( X 1 ,..., X n ) ∈ B .

( n ) which

∫ B f X ( x1,..., xn ) dx1... dxn

We thus obtain: P (inequality) = EXAMPLES.– 1) P ( X 1 + X 2 ≤ z ) = P where B =

{( x, y ) ∈

2

( ( X1, X 2 ) ∈ B ) = ∫ B f X ( x1, x2 ) dx1 dx2

}

x+ y ≤ z

y z 0

z

x

Random Vectors

17

P ( X1 + X 2 ≤ a − X 3 ) = P ( ( X1 , X 2 , X 3 ) ∈ B ) = ∫ f X ( x1 , x2 , x3 ) dx1 dx2 dx3 B

z C (a)

0

y

B (a)

A (a)

x

1 space containing the origin 0 and limited by the plane placed on the 2 triangle A B C from equation x + y + z = a

B is the

P ( Max

( X1 + X 2 ) ≤ z ) = P ( ( X1 , X 2 ) ∈ B ) = ∫ f X ( x1 , x2 ) dx1 dx2 B where B is the non-shaded portion below.

y z 0

z

x

Starting with example 1) we will show the following.

18

Discrete Stochastic Processes and Optimal Filtering

PROPOSITION.– Let X and Y be two real independent r.v. of probability densities, respectively f X and fY . The r.v.

Z = X + Y admits a probability density f Z defined as:

f Z ( z ) = ( f X ∗ fY )( z ) = ∫

+∞ −∞

f X ( x ) fY ( z − x ) dx

DEMONSTRATION. – Let us start from the probability distribution of Z .

FZ ( z ) = P ( Z ≤ z ) = P ( X + Y ≤ z ) = P ( ( X , Y ) ∈ B ) (where B is defined in example 1) above)

= ∫ f ( x, y ) dx dy = (independence) B

∫B

f X ( x ) fY ( y ) dx dy

y

x+ y = z

z z−x 0

=∫

+∞ −∞

In stating

=∫

+∞ −∞

f X ( x ) dx ∫

z− x −∞

x

z

x

fY ( y ) dy .

y =u−x:

f X ( x ) dx ∫

z −∞

fY ( u − x ) du = ∫

z −∞

du ∫

+∞ −∞

f X ( x ) fY ( u − x ) dx .

Random Vectors

The mapping u →

+∞

∫ −∞

19

f X ( x ) fY ( u − x ) dx being continuous, of which

FZ ( z ) is a primitive from and:

FZ′ ( z ) = f Z ( z ) = ∫

+∞ −∞

f X ( x ) fY ( z − x ) dx .

NOTE.– If (for example) the support of f X and fY is

+

, i.e. if

f X ( x ) = f X ( x )1 [0,∞[ ( x ) and fY ( y ) = fY ( y ) 1 [0,∞[( y ) we easily arrive at:

z

f Z ( z ) = ∫ f X ( x ) fY ( z − x ) dx 0

EXAMPLE.– X and Y are two exponential r.v. of parameter independent. Let us take as given For

z≤0

For

z≥0

λ

which are

Z = X +Y :

fZ ( z ) = 0 .

fZ ( z ) = ∫

+∞

−∞

and f Z ( z ) = λ ze 2

z −λ z − x f X ( x ) fY ( z − x ) dx = ∫ λ e− λ x λ e ( ) dx = λ 2 ze− λ z

−λ z

0

1[0,∞[ ( z ) .

20

Discrete Stochastic Processes and Optimal Filtering

1.2. Spaces

L1 ( dP ) and L2 ( dP )

1.2.1. Definitions

The family of r.v. X :

ω

→

X (ω )

( Ω, a,P ) ( ,B ( ) ) , denoted

forms a vector space on

Two vector subspaces of will be defined.

ε

ε.

play a particularly important role and these are what

The definitions would be in effect the final element in the construction of the Lebesgue integral of measurable mappings, but this construction will not be given here and we will be able to progress without it. DEFINITION.– We say that two random variables X and X ′ defined on ( Ω, a )

are almost surely equal and we write X = X ′ a.s. if X = X ' except eventually on

an event N of zero probability (that is to say N ∈ a and P ( N ) = 0 ). We note: – X = {class (of equivalences) of r.v.

X ′ almost definitely equal to X };

– O = {class (of equivalences) of r.v. almost definitely equal to O }. We can now give: – the definition of L ( dP ) as a vector space of first order random variables; and 1

2

– the definition of L

{ L ( dP ) = {

( dP ) as a vector space of second order random variables:

L1 ( dP ) = r. v. X 2

r.v.

X

} X (ω ) dP (ω ) < ∞ }

∫ Ω X (ω ) ∫Ω

2

dP (ω ) < ∞

Random Vectors

21

where, in these expressions, the r.v. are clearly defined except for at a zero probability event, or otherwise: the r.v. X are any representatives of the X classes, because, by construction the integrals of the r.v. are not modified if we modify the latter on zero probability events. Note on inequality

∫ Ω X (ω )

dP (ω ) < ∞

Introducing the two positive random variables:

X + = Sup ( X , 0 ) and X − = Sup ( − X 1 0 ) we can write X = X

+

− X − and X = X + + X − .

Let X ∈ L ( dP ) ; we thus have: 1

∫ Ω X (ω ) dP (ω ) < ∞ ⇔ ∫ Ω X (ω ) dP (ω ) < ∞ and − ∫ Ω X (ω ) dP (ω ) < ∞. +

So, if X ∈ L ( dP ) , the integral 1

+ − ∫ Ω X (ω ) dP (ω ) = ∫ Ω X (ω ) dP − ∫ Ω X (ω ) dP (ω )

is defined without ambiguity. 2

NOTE.– L

( dP ) ⊂ L1 ( dP ) 2

In effect, given X ∈ L

(∫

Ω

( dP ) , following Schwarz’s inequality:

X (ω ) dP (ω )

) ≤∫ 2

Ω

X 2 (ω ) dP ∫ dP (ω ) < ∞ Ω

1

22

Discrete Stochastic Processes and Optimal Filtering

⎛ 1 ⎛ x − m ⎞2 ⎞ 1 exp ⎜ − ⎜ ⎟ ). ⎜ 2 ⎝ σ ⎟⎠ ⎟ 2πσ ⎝ ⎠

EXAMPLE.– Let X be a Gaussian r.v. (density This belongs to L ( dP ) and to L 1

2

Let Y be a Cauchy r.v. (density

( dP ) .

(

1

π 1 + x2

)

).

This does not belong to L ( dP ) and thus does not belong to L 1

2

( dP ) either.

1.2.2. Properties

– L ( dP ) is a Banach space; we will not use this property for what follows; 1

2

– L

( dP )

is a Hilbert space. We will give here the properties without any

demonstration. 2

* We can equip L

( dP ) with the scalar product defined by:

∀X , Y ∈ L2 ( dP ) < X,Y > = ∫ X (ω ) Y (ω ) dP (ω ) Ω

This expression is well defined because following Schwarz’s inequality:

∫Ω

2

X (ω ) Y (ω ) dP (ω ) ≤ ∫ X 2 (ω ) dP (ω ) ∫ Y 2 (ω ) dP (ω ) < ∞ Ω

Ω

and the axioms of the scalar product are immediately verifiable.

Random Vectors 2

* L

23

( dP ) is a vector space normed by: X = < X, X > =

2 ∫ Ω X (ω ) dP (ω )

It is easy to verify that:

∀X , Y ∈ L2 ( dP )

X +Y ≤ X + Y

∀X ∈ L2 ( dP ) and ∀λ ∈

λX = λ

X

As far as the second axiom is concerned: – if X = 0 ⇒ X – if

X =

(∫

Ω

=0;

)

X 2 (ω ) dP (ω ) = 0 ⇒ X = 0 a.s.

L2 ( dP ) is a complete space for the norm 2

sequence X n converges to X ∈ L

.

( or

)

X =0 .

defined above. (Every Cauchy

( dP ) .)

1.3. Mathematical expectation and applications 1.3.1. Definitions

We are studying a general random vector (not necessarily with a density function):

(

X = X1,..., X n

)

:

( Ω, a , P ) → (

n

,B

( )) . n

24

Discrete Stochastic Processes and Optimal Filtering

Furthermore, we give ourselves a measurable mapping:

Ψ:

(

n

,B

( n )) → (

))

,B (

Ψ X (also denoted Ψ ( X ) or Ψ ( X 1 ,..., X n ) ) is a measurable mapping (thus

an r.v.) defined on ( Ω, a ) .

X

(

( Ω, a, P )

n

,B

n

X

Ψ

Ψ X

(

( ), P )

,B (

))

DEFINITION.– Under the hypothesis Ψ X ∈ L ( dP ) , we call mathematical 1

expectation of the random value Ψ X the expression Ε ( Ψ X ) defined as:

Ε(Ψ X ) = ∫

Ω

(Ψ

or, to remind ourselves that

X )(ω ) dP (ω )

X is a vector:

Ε ( Ψ ( X 1 ,..., X 2 ) ) = ∫ Ψ ( X 1 (ω ) ,..., X n (ω ) ) dP (ω ) Ω

NOTE.– This definition of the mathematical expectation of Ψ X is well adapted to general problems or to those of a more theoretical orientation; in particular, it is 2

by using the latter that we construct L

( dP ) the Hilbert space of the second order

r.v. In practice, however, it is the PX law (similar to the measure P by the mapping

X ) and not P that we do not know. We thus want to use the law PX to

Random Vectors

25

express Ε ( Ψ X ) , and it is said that the calculation of Ε ( Ψ X ) from the space ( Ω, a,P ) to the space

(

n

,B

( ), P ) . n

X

In order to simplify the writing in the theorem that follows (and as will often

occur in the remainder of this work) ( X 1 ,..., X n ) , ( x1 ,..., xn ) and dx1...dxn will often be denoted as

X , x and dx respectively.

Transfer theorem

Let us assume Ψ X ∈ L ( dP ) ; we thus have: 1

Ε(Ψ X ) = ∫

Ω

(Ψ

X )(ω ) dP (ω ) = ∫

n

Ψ ( x ) dPX ( x )

In particular, if PX admits a density f X :

E (Ψ X ) = ∫

n

Ψ ( x ) f X ( x ) dx and Ε X = ∫ x f X ( x ) dx .

Ψ ∈ L1 ( dPX ) DEMONSTRATION.– – The equality of 2) is true if Ψ = 1B with B ∈ B

Ε ( Ψ X ) = Ε (1B X ) = PX ( B ) =∫

n 1B

( x ) dPX ( x ) = ∫

n

– The equality is still true if m

j =1

because

Ψ ( x ) dPX ( x ).

Ψ is a simple measurable mapping, that is to say if

Ψ = ∑ λ j 1B or B j ∈ B j

( n)

( ) and are pairwise disjoint. n

26

Discrete Stochastic Processes and Optimal Filtering

We have in effect:

(

m

)

m

( )

Ε ( Ψ X ) = ∑ λ j Ε 1B j X = ∑ λ j PX B j j =1

m

= ∑λj ∫

n 1B

j =1

=∫

n

( x ) dPX ( x ) = ∫ j

j =1

⎛ m ⎞ λ j 1B ( x ) ⎟ dPX ( x ) n ⎜∑ ⎜ j =1 ⎟ j ⎝ ⎠

Ψ ( x ) dPX ( x )

If we now assume that Ψ is a positive measurable mapping, we know that it is the limit of an increasing sequence of positive simple measurable mappings Ψ P .

⎛

We thus have ⎜

∫ Ω ( Ψ p X ) (ω ) = ∫

⎜ with Ψp ⎝

n

Ψ p ( x ) dPX ( x )

Ψ

Ψ p X is also a positive increasing sequence which converges to Ψ X and by taking the limits of the two members when p ↑ ∞ , we obtain, according to the monotone convergence theorem:

∫Ω (Ψ

X )(ω ) dP (ω ) = ∫

n

Ψ ( x ) dPX ( x ) .

If Ψ is a measurable mapping of any sort we still use the decomposition

Ψ = Ψ + − Ψ − and

Ψ = Ψ+ + Ψ− . +

Furthermore, it is clear that ( Ψ X ) = Ψ

+

−

X and ( Ψ X ) = Ψ − X .

It emerges that: +

−

(

) (

Ε Ψ X = Ε (Ψ X ) + Ε (Ψ X ) = Ε Ψ+ X + Ε Ψ− X

)

Random Vectors

27

i.e. according to what we have already seen:

=∫

n

Ψ + ( x ) dPX ( x ) + ∫

n

Ψ − ( x ) dPX ( x ) = ∫

n

Ψ ( x ) dPX ( x )

As Ψ X ∈ L ( dP ) , we can deduce from this that Ψ ∈ L ( dPX 1

1

(reciprocally if Ψ ∈ L ( dPX 1

In particular Ε ( Ψ X )

) then Ψ

+

)

X ∈ L1 ( dP ) ).

and Ε ( Ψ X ) are finite, and −

(

) (

Ε (Ψ X ) = Ε Ψ+ X − Ε Ψ− X =∫

n

Ψ + ( x ) dPX ( x ) − ∫

=∫

n

Ψ ( x ) dPX ( x )

n

)

Ψ − ( x ) dPX ( x )

NOTE.– (which is an extension of the preceding note) In certain works the notion of “a random vector as a measurable mapping” is not developed, as it is judged as being too abstract. In this case the integral

∫

nΨ

( x ) dPX ( x ) = ∫

n

Ψ ( x ) f X ( x ) dx

PX admits the density f X ) is given as a definition of Ε ( Ψ X ) . EXAMPLES.– 1) Let the “random Gaussian vector” be X

f X ( x1 , x2 ) =

where

ρ ∈ ]−1,1[

1 2π 1 − ρ 2

T

= ( X1 , X 2 ) of density:

⎛ 1 1 ⎞ exp ⎜ − x12 − 2 ρ x1 x2 + x22 ⎟ 2 ⎝ 2 1-ρ ⎠

(

)

and let the mapping Ψ be ( x1 , x2 ) → x1 x2

3

(if

28

Discrete Stochastic Processes and Optimal Filtering

The condition:

∫

2

x1 x23

⎛ 1 exp ⎜ − 2 ⎜ 2 1− ρ 2 2π 1 − ρ ⎝ 1

(

)

(

⎞ x12 − 2 ρ x1 x2 + x22 ⎟ dx1 dx2 < ∞ ⎟ ⎠

)

is easily verifiable and:

EX1 X 23 = ∫

x x3 2 1 2

⎛ 1 exp ⎜ − ⎜ 2 2 1− ρ2 2n 1 − ρ ⎝ 1

(

)

(

⎞ x12 − 2 ρ x1 x2 + x22 ⎟ dx1dx2 ⎟ ⎠

)

2) Given a random Cauchy variable of density

1

π∫

x

fX ( x) =

1

1 π 1 + x2

1 dx = +∞ thus X ∉ L1 ( dP ) and EX is not defined. 1 + x2

Let us consider next the transformation Ψ which consists of “rectifying and clipping” the r.v. X .

Ψ

K

−K

0

K

x

Figure 1.4. Rectifying and clipping operation

Random Vectors

Ψ ( x ) dPX ( x ) =

∫

1

K

K

−K

∞

29

K

∫ − K x 1 + x 2 dx + ∫ −∞ 1 + x 2 dx + ∫ K 1 + x2 dx

⎛π ⎞ = ln 1 + K 2 + 2 K ⎜ − K ⎟ < ∞. ⎝2 ⎠

(

)

Thus, Ψ X ∈ L ( dP ) and: 1

Ε(Ψ X ) = ∫

+∞ −∞

⎛π ⎞ Ψ ( x ) dPX ( x ) = ln 1 + K 2 + 2 K ⎜ − K ⎟ . ⎝2 ⎠

DEFINITION.– Given np r.v. X jk

(

( j = 1 at

)

p, k = 1 at n ) ∈ L1 ( dP ) , we

⎛ X 11 … X 1n ⎞ ⎜ ⎟ define the mean of the matrix ⎡⎣ X jk ⎤⎦ = ⎜ ⎟ by: ⎜ X p1 X pn ⎟⎠ ⎝ ⎛ ΕX 11 … ΕX 1n ⎞ ⎜ ⎟ Ε ⎡⎣ X jk ⎤⎦ = ⎜ ⎟. ⎜ ΕX p1 ΕX pn ⎟⎠ ⎝ In particular, given a random vector:

⎛ X1 ⎞ ⎜ ⎟ X = ⎜ ⎟ or X T = ( X 1 ,..., X n ) verifying X j ∈ L1 ( dP ) ∀j = 1 at n , ⎜X ⎟ ⎝ n⎠

(

)

30

Discrete Stochastic Processes and Optimal Filtering

⎛ EX 1 ⎞ ⎜ ⎟ ⎡ T⎤ We state Ε [ X ] = ⎜ ⎟ or Ε ⎣ X ⎦ = ( EX1 ,..., ΕX n ) . ⎜ EX ⎟ ⎝ n⎠

(

)

Mathematical expectation of a complex r.v.

DEFINITION.– Given a complex r.v. X = X 1 +i X 2 , we say that:

X ∈ L1 ( dP ) if X 1 and X 2 ∈ L1 ( dP ) . If X ∈ L ( dP ) we define its mathematical expectation as: 1

Ε ( X ) = ΕX 1 + i Ε X 2 . Transformation of random vectors

We are studying a real random vector X = ( X 1 ,..., X n ) with a probability

density of f X ( x )1D ( x ) = f X ( x1 ,..., xn ) 1D ( x1 ,..., xn ) where D is an open set of

n

.

Furthermore, we give ourselves the mapping:

α : x = ( x1 ,..., xn ) → y = α ( x ) = (α1 ( x1 ,..., xn ) ,...,α n ( x1 ,..., xn ) ) ∆

D We assume that that

α

α

1

is a C -diffeomorphism of D on an open ∆ of

is bijective and that

α

and

β =α

−1

1

are of class C .

n

, i.e.

Random Vectors

X

α

31

Y =α (X )

∆

D Figure 1.5. Transformation of a random vector

The random vector Y = (Y1 ,..., Yn ) =

X

by a

C1 -diffeomorphism

(α1 ( X1,..., X n ) ,...,α n ( X1,..., X n ) )

takes its values on ∆ and we wish to determine fY ( y )1∆ ( y ) , its probability density. PROPOSITION.–

fY ( y )1∆ ( y ) = f X ( β ( y ) ) Det J β ( y ) 1∆ ( y ) DEMONSTRATION.– Given:

Ψ ∈ L1 ( dy )

Ε ( Ψ (Y ) ) = ∫

n

Ψ ( y ) fY ( y )1∆ ( y ) dy .

Furthermore:

Ε ( Ψ (Y ) ) = ΕΨ (α ( X ) ) = ∫

n

Ψ (α ( x ) ) f X ( x )1D ( x ) dx .

By applying the change of variables theorem in multiple integrals and by denoting the Jacobian matrix of the mapping

=∫

n

β

as J β ( y ) , we arrive at:

Ψ ( y ) f X ( β ( y ) ) Dét J β ( y ) 1∆ ( y ) dy .

32

Discrete Stochastic Processes and Optimal Filtering

Finally, the equality:

∫ n Ψ ( y ) fY ( y )1∆ ( y ) dy = ∫ n Ψ ( y ) f X ( β ( y ) ) Dét J β ( y ) 1∆ ( y ) dy has validity for all Ψ ∈ L ( dy ) ; we deduce from it, using Haar’s lemma, the 1

formula we are looking for:

fY ( y )1∆ ( y ) = f X ( β ( y ) ) Dét J β ( y ) 1∆ ( y ) IN PARTICULAR.– If X is an r.v. and the mapping:

α : x → y = α ( x) D⊂

Α⊂

the equality of the proposition becomes:

fY ( y )1∆ ( y ) = f X ( β ( y ) ) β ′ ( y ) 1∆ ( y ) EXAMPLE.– Let the random ordered pair be Z = ( X , Y ) of probability density:

f Z ( x, y ) =

1 x y

1 2 2 D

( x, y )

where

D = ]1, ∞[ × ]1, ∞[ ⊂

2

Random Vectors 1

Furthermore, we allow the C -diffeomorphism

33

α:

α

β

D 1

∆ 1

0

x

1

0

u

1

defined by:

⎛ ⎜ ⎜ ⎜ ⎜ ⎜⎜ ⎝

α : ( x, y ) → ( u = α1 ( x, y ) = xy , v = α 2 ( x, y ) = x y ) ∈D

∈∆

(

β : ( u, v ) → x = β1 ( u, v ) = uv , y = β 2 ( u, v ) = u v ∈∆

)

∈D

⎛ v u 1⎜ J β ( u, v ) = ⎜ 2⎜ 1 ⎜ uv ⎝

⎞ v ⎟ 1 ⎟ and Det J β ( u , v ) = . u⎟ 2 v − 3 ⎟ v 2⎠ u

(

The vector W = U = X Y , V = X

Y

) thus admits the probability density:

fW ( u , v )1∆ ( u , v ) = f Z ( β1 ( u , v ) , β 2 ( u , v ) ) Det J β ( u, v ) 1∆ ( u, v ) =

(

1 uv

)

1

2

( uv )

2

1 1∆ ( u, v ) = 12 1∆ ( u, v ) 2v 2u v

34

Discrete Stochastic Processes and Optimal Filtering

NOTE.–

Reciprocally

W = (U , V )

vector

of

probability

density

fW ( u , v ) 1∆ ( u , v ) and whose components are dependent is transformed by β

into vector Z = ( X , Y ) of probability density f Z ( x, y ) 1D ( x, y ) and whose components are independent.

1.3.2. Characteristic functions of a random vector

DEFINITION.– We call the characteristic function of a random vector:

X T = ( X1 ... X n ) the mapping ϕ X : ( u1 ,..., u2 ) → ϕ X ( u1 ,..., u2 ) defined by: n

⎛ ⎜ ⎝

n

⎞ ⎟ ⎠

ϕ X ( u1 ,..., un ) = Ε exp ⎜ i ∑ u j X j ⎟ =∫

j =1

⎛ n ⎞ exp ⎜⎜ i ∑ u j x j ⎟⎟ f X ( x1 ,...xn ) dx1... dxn n ⎝ j =1 ⎠

(The definition of ΕΨ ( X 1 ,..., X n ) is written with:

⎛ n ⎞ Ψ ( X 1 ,..., X n ) = exp ⎜ i ∑ u j X j ⎟ ⎜ j =1 ⎟ ⎝ ⎠ and the integration theorem is applied with respect to the image measure.)

ϕX

is thus the Fourier transform of

ϕX = F ( fX )

.

fX

which can be denoted

Random Vectors

35

(In analysis, it is preferable to write:

F ( f X ) ( u1 ,..., un ) = ∫

n ⎛ ⎞ exp − i u x f u ,..., un ) dx1... dxn . ) ⎜ j j⎟ n ⎜ ∑ ⎟ X( 1 ⎝ j =1 ⎠

Some general properties of the Fourier transform: –

ϕ X ( u1 ,...u2 ) ≤ ∫

n

f X ( x1 ,..., xn ) dx1... dxn = ϕ X ( 0,..., 0 ) = 1 ;

– the mapping ( u1 ,..., u2 ) → ϕ X ( u1 ,..., u2 ) is continuous; n

– the mapping F : f X → ϕ X is injective. Very simple example

[

]n

The random vector X takes its values from within the hypercube ∆ = −1,1 and it admits a probability density:

f X ( x1 ,..., xn ) =

1 1 ∆( x1 ,..., xn ) 2n

(note that components X j are independent).

1 exp i ( u1 x1 + ... + un xn ) dx1...dxn 2n ∫ ∆ n sin u 1 n +1 j = n ∏ ∫ exp iu j x j dx j = ∏ uj 2 j =1 −1 j =1

ϕ ( u1 ,..., un ) =

(

)

where, in this last expression and thanks to the extension by continuity, we replace:

sin u1 sin u2 by 1 if u1 = 0 , by 1 if u2 = 0 ,... u1 u2

36

Discrete Stochastic Processes and Optimal Filtering

Fourier transform inversion

F

fX

ϕX

F −1 As shall be seen later in the work, there are excellent reasons (simplified calculations) for studying certain questions using characteristic functions rather than probability densities, but we often need to revert back to densities. The problem which arises is that of the invertibility of the Fourier transform F , which is studied in specialized courses. It will be enough here to remember one condition. PROPOSITION.– If (i.e.

∫

n

ϕ X ( u1 ,..., un ) du1...dun < ∞

ϕ X ∈ L1 ( du1...dun ) ), f X ( x1 ,..., xn ) =

1

( 2π )n

then F

∫

−1

exists and:

⎛ n ⎞ exp − i u x ϕ ⎜ j j⎟ n ⎜ ∑ ⎟ X = 1 j ⎝ ⎠

( u1 ,..., un ) du1...dun .

In addition, the mapping ( x1 ,..., xn ) → f X ( x1 ,..., xn ) is continuous. EXAMPLE.–

Given

a

Gaussian

r.v.

(

)

X ∼ Ν m, σ 2 ,

i.e.

that

⎛ 1 ⎛ x − m ⎞2 ⎞ 1 exp ⎜ − ⎜ ⎟ and assuming that σ ≠ 0 we obtain ⎜ 2 ⎝ σ ⎟⎠ ⎟ 2πσ ⎝ ⎠ 2 2 ⎛ uσ ⎞ ϕ X ( u ) = exp ⎜ ium − ⎟. 2 ⎝ ⎠ fX ( x) =

It is clear that ϕ X ∈ L1 ( du ) and

fX ( x) =

1 2π

+∞

∫ −∞ exp ( −iux ) ϕ X ( u ) du .

Random Vectors

37

Properties and mappings of characteristic functions 1) Independence

PROPOSITION.– In order for the components X j of the random vector

X T = ( X 1 ,..., X n ) to be independent, it is necessary and sufficient that: n

ϕ X ( u1 ,..., un ) = ∏ ϕ X ( u j ) . j

j =1

DEMONSTRATION.– Necessary condition:

⎛ ⎜ ⎝

⎞ ⎟ ⎠

n

ϕ X ( u1 ,..., un ) = ∫ exp ⎜ i ∑ u j x j ⎟ f X ( x1 ,..., xn ) dx1...dxn n

j =1

Thanks to the independence:

=∫

n ⎛ n ⎞ n exp i u x f x dx ... dx = ⎜ ⎟ ( ) ∏ϕ X j (u j ) . 1 j j ⎟∏ X j n n ⎜ ∑ j j =1 ⎝ j =1 ⎠ j =1

Sufficient condition: we start from the hypothesis:

∫ =∫

⎛

i n exp ⎜ ⎜

n

⎞

∑ u j x j ⎟⎟ f x ( x1,..., xn ) dx1... dxn

⎝ j =1 ⎠ ⎛ n ⎞ n exp i u x x j dx1... dxn ⎜ ⎟ j j ⎟∏ f X n ⎜ ∑ j = 1 j = 1 j ⎝ ⎠

( )

38

Discrete Stochastic Processes and Optimal Filtering

from which we deduce: f X ( x1 ,..., xn ) =

n

∏ f X j ( x j ) , i.e. the independence, j =1

since the Fourier transform f X ⎯⎯ → ϕ X is injective. NOTE.– We must not confuse this result with that which concerns the sum of independent r.v. and which is stated in the following manner. n

If X 1 ,..., X n are independent r.v., then

ϕ∑ X ( u ) = ∏ ϕ X j

j

j =1

j

(u ) .

If there are for example n independent random variables:

(

)

(

X 1 ∼ Ν m1 , σ 2 ,..., X n ∼ Ν mn , σ 2 and n real constants

)

λ1 ,..., λn , the note above enables us to determine the law of

n

∑λj X j .

the random value

j =1

λj X j

In effect the r.v.

ϕ∑ j

λ X

=e

and thus

j

j

are independent and:

n

n

j =1

j =1

( )

n

( u ) = ∏ ϕλ j X j ( u ) = ∏ ϕ X j λ j u = ∏ e

1 iu ∑ λ j m j − u 2 ∑ λ 2j σ 2j 2 j j

n

⎛

j =1

⎝

⎞

∑ λ j X j ∼ Ν ⎜⎜ ∑ λ j m j , ∑ λ 2j σ 2j ⎟⎟ . j

j

⎠

j =1

1 iuλ j m j − u 2 λ 2j σ 2j 2

Random Vectors

39

2) Calculation of the moment functions of the components X j (up to the 2nd order, for example)

Let us assume

ϕX ∈C2

( ). n

In applying Lebesgue’s theorem (whose hypotheses are immediately verifiable) once we obtain:

∀K = 1 to n

∂ϕ X ( 0,..., 0 ) ∂uK

⎛ ⎞ ⎛ ⎞ = ⎜ ∫ n ixK exp ⎜ i ∑ u j x j ⎟ f X ( x1 ,..., xn ) dx1...dxn ⎟ ⎜ j ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠( u1 = 0,...,un = 0 ) = i∫

n

xK f X ( x1 ,..., xn ) dx1...dxn = i Ε X K

i.e. Ε X K = −i

∂ϕ X ( 0,..., 0 ) . ∂u K

By applying this theorem a second time, we have:

∀k

and

∈ (1,2,..., n )

EX K X =

∂ 2ϕ X ( 0,..., 0 ) ∂u ∂uK

1.4. Second order random variables and vectors

Let us begin by recalling the definitions and usual properties relative to 2nd order random variables. DEFINITIONS.–

Given

X ∈ L2 ( dP )

of

probability

density

E X 2 and E X have a value. We call variance of X the expression: Var X = Ε X 2 − ( Ε X ) = E ( X − Ε X ) 2

2

fX ,

40

Discrete Stochastic Processes and Optimal Filtering

We call standard deviation of

X the expression σ ( X ) = Var X . 2

Now let two r.v. be X and Y ∈ L

( dP ) . By using the scalar product <, > on

L2 ( dP ) defined in 1.2 we have: ΕXY = < X , Y > = ∫ X (ω ) Y (ω ) dP (ω ) Ω

and, if the vector Z = ( X , Y ) admits the density f Ζ , then:

EXY = ∫

2

xy f Z ( x, y ) dx dy .

We have already established, by applying Schwarz’s inequality, that ΕXY actually has a value. 2

DEFINITION.– Given that two r.v. are X , Y ∈ L of

( dP ) , we call the covariance

X and Y : The expression Cov ( X , Y ) = ΕXY − ΕX ΕY . Some observations or easily verifiable properties:

Cov ( X , X ) = V ar X Cov ( X , Y ) = Cov (Y , X ) – if

λ

is a real constant

Var ( λ X ) = λ 2 Var X ;

– if X and Y are two independent r.v., then Cov ( X , Y ) = 0 but the reciprocal is not true;

Random Vectors

41

– if X 1 ,..., X n are pairwise independent r.v.

Var ( X1 + ... + X n ) = Var X1 + ... + Var X n Correlation coefficients

The

Var X j (always positive) and the Cov ( X j , X K ) (positive or negative)

can take extremely high algebraic values. Sometimes it is preferable to use the (normalized) “correlation coefficients”:

ρ ( j, k ) =

Cov ( X j , X K ) Var X j

Var X K

whose properties are as follows:

ρ ( j , k ) ∈ [ −1,1] In effect, let us assume (solely to simplify its expression) that X j and X K are centered and let us study the 2nd degree trinomial in

λ.

Τ ( λ ) = Ε ( λ X j − X K ) = λ 2ΕX 2j − 2λΕ ( X j X K ) + Ε X K2 ≥ 0 2

Τ ( λ ) ≥ 0 ∀λ ∈

(

∆ = E X jXK is

negative

or

ρ ( j , k ) ∈ [ −1,1] ).

)

2

zero,

if and only if the discriminant

− Ε X 2j Ε X K2 i.e.

Cov ( X j , X K )

This is also Schwarz’s inequality.

2

≤ Var X j Var X K

(i.e.

42

Discrete Stochastic Processes and Optimal Filtering

Furthermore, we can make clear that

ρ ( j , k ) = ±1

if and only if ∃ λ 0 ∈

such that X K = λ 0 X j p.s. In effect by replacing X K with definition of

λ0 X j

in the

ρ ( j , k ) , we obtain ρ ( j , K ) = ±1 .

Reciprocally, if

ρ ( j , K ) = 1 (for example), that is to say if:

∆ = 0 , ∃ λ0 ∈

such that X K = λ 0 X j a.s.

If X j and X K are not centered, we replace in what has gone before X j by

X j − Ε X j and X K by X K − Ε X K ). 2)

If

(

Xj

and

)

Xk

are

independent,

Ε X j Xk = Ε X j Ε Xk

so

Cov X j , X k = 0 , ρ ( j , k ) = 0 . However, the reciprocity is in general false, as is proven in the following example. Let Θ be a uniform random variable on

f Θ (θ ) =

[0 , 2 π [

that is to say

1 1 (θ ) . 2π [ 0 , 2 π [

In addition let two r.v. be X j = sin Θ and X K = c os Θ . We can easily verify that Ε X j

(

Cov X j , X k

)

and

ρ ( j , k ) are

X j and X k are dependent.

, Ε X k , Ε X j X k are zero; thus 2

2

zero. However, X j + X k = 1 and the r.v.

Random Vectors

43

Second order random vectors

DEFINITION.– We say that a random vector X 2

if X j ∈ L

( dP )

T

= ( X1 ,..., X n ) is second order

∀ j = 1 at n .

DEFINITION.– Given a second order random vector X

T

= ( X1 ,..., X n ) , we call

the covariance matrix of this vector the symmetric matrix:

… Cov ( X 1 , X n ) ⎞ ⎛ Var X 1 ⎜ ⎟ ΓX = ⎜ ⎟ ⎜ Cov ( X , X ) ⎟ X Var 1 n n ⎝ ⎠ If we return to the definition of the expectation value of a matrix of r.v., we see T that we can express it as Γ X = Ε ⎡( X − Ε X )( X − Ε X ) ⎤ .

⎣

⎦

We also can observe that Γ X −ΕX = Γ X . NOTE.– Second order complex random variables and vectors: we say that a complex random variable X = X 1 + i X 2 is second order if X 1 and

X 2 ∈ L2 ( dP ) . The covariance of two centered second order random variables, X = X 1 + i X 2 and Y = Y1 + iY2 has a natural definition:

Cov ( X , Y ) = EXY = E ( X1 + iX 2 )(Y1 − iY2 )

= E ( X 1Y1 + X 2Y2 ) + iE ( X 2Y1 − X 1Y2 )

and the decorrelation condition is thus:

E ( X 1Y1 + X 2Y2 ) = E ( X 2Y1 − X 1Y2 ) = 0 .

44

Discrete Stochastic Processes and Optimal Filtering

We say that a complex random vector X order if

T

(

)

= X 1 ,..., X j ,..., X n is second

j ∈ (1,..., n ) X j = X1 j + iX 2 j is a second order complex random

variable for the entirety. The covariance matrix of a second order complex centered random vector is defined by:

⎛ E X 2 … EX X ⎞ 1 1 n⎟ ⎜ ΓX = ⎜ ⎟ ⎜ ⎟ 2 ⎜ EX X ⎟ E X n ⎠ ⎝ n 1 If we are not intimidated by its dense expression, we can express these definitions for non-centered complex random variables and vectors without any difficulty. Let us return to real random vectors. T DEFINITION.– We call the symmetric matrix Ε ⎡ X X ⎤ the second order matrix

⎣

moment. If

⎦

X is centered Γ X = ⎡⎣ X X ⎤⎦ . T

Affine transformation of a second order vector

Let us denote the space of the matrices at p rows and at n columns as M ( p, n ) .

PROPOSITION.– Let X

T

= ( X1 ,..., X n ) be a random vector of expectation

value vector m = ( m1 ,..., mn ) and of covariance matrix Γ X . T

Furthermore

(

)

let

BT = b1 ,..., bp .

a

matrix

be

A ∈ M ( p, n )

and

a

certain

vector

Random Vectors

45

The random vector Y = A X + B possesses Αm + B as a mean value vector Τ

and Γ y = ΑΓ X Α as a covariance matrix. DEMONSTRATION.–

Ε [Y ] = Ε [ ΑX + B ] = Ε [ ΑX ] + Β = Αm + Β . In addition for example: Τ Ε ⎡( ΑX ) ⎤ = Ε ⎣⎡ X Τ ΑΤ ⎦⎤ = mΤ ΑΤ ⎣ ⎦

Τ ΓY = Γ ΑX +Β = Γ ΑX = Ε ⎡⎢ Α ( X − m ) ( Α ( X − m ) ) ⎤⎥ = ⎣ ⎦ Τ Τ Ε ⎡ Α ( X − m )( X − m ) ΑΤ ⎤ = Α Ε ⎡( X − m )( X − m ) ⎤ ΑΤ = ΑΓ X Α Τ ⎣ ⎦ ⎣ ⎦

for what follows, we will also need the easy result that follows. PROPOSITION.– Let X

T

= ( X 1 ,..., X n ) be a second order random vector, of

covariance matrix Γ Χ . Thus: ∀ Λ = ( λ1 ,..., λn ) ∈ T

n

⎛ n ⎞ Λ Τ Γ X Λ = var ⎜ ∑ λ j X j ⎟ ⎜ j =1 ⎟ ⎝ ⎠

DEMONSTRATION.–

(

)

(

Λ ΤΓ X Λ = ∑ Cov X j , X K λ j λK = ∑ Ε X j − ΕX j j,K

j,K

) ( X K − Ε X K ) λ j λK 2

2 ⎛ ⎛ ⎞⎞ ⎛ ⎞ ⎛ ⎞ = Ε ⎜ ∑ λ j X j − ΕX j ⎟ = Ε ⎜ ∑ λ j X j − Ε ⎜ ∑ λ j X j ⎟ ⎟ = Var ⎜ ∑ λ j X j ⎟ ⎜ j ⎟⎟ ⎜ j ⎟ ⎜ j ⎝ K ⎠ ⎝ ⎠⎠ ⎝ ⎠ ⎝

(

)

46

Discrete Stochastic Processes and Optimal Filtering

CONSEQUENCE.– ∀Λ ∈

n

Τ

we still have Λ Γ Χ Λ ≥ 0 .

Let us recall in this context the following algebraic definitions: – if Λ Γ X Λ > 0 ∀Λ = ( λ1 ,..., λn ) ≠ ( 0,..., 0 ) , we say that Γ X is positive T

definite; – if ∃Λ = ( λ1 ,..., λn ) ≠ ( 0,..., 0 ) such that Λ Γ X Λ = 0 , we say that Λ X is Τ

positive semi-definite. NOTE.– In this work the notion of vector appears in two different contexts and in order to avoid confusion, let us return for a moment to some vocabulary definitions. n

1) We call random vector of

(or random vector with values in

n

), every

⎛ X1 ⎞ ⎜ ⎟ n-tuple of random variables X = ⎜ ⎟ ⎜X ⎟ ⎝ n⎠ (or X

T

= ( X 1 ,..., X n ) or even X = ( X 1 ,..., X n ) ).

X is a vector in this sense that for each ω ∈ Ω , we obtain an n-tuple X (ω ) = ( X 1 (ω ) ,..., X n (ω ) ) which belongs to the vector space n . 2) Every random vector of

n

. X = ( X 1 ,..., X n ) of which all the components

X j belong to L2 ( dP ) we call a second order random vector.

In this context, the components X j themselves are vectors since they belong to the vector space L ( dP ) . 2

Thus, in what follows, when we speak of linear independence or of scalar product or of orthogonality, it is necessary to point out clearly to which vector space,

n

or L ( dP ) , we are referring. 2

Random Vectors 2

1.5. Linear independence of vectors of L

( dP ) 2

DEFINITION.– We say that n vectors X 1 ,..., X n of L

λ1 X 1 + ... + λn X n = 0

independent if

2

zero vector of L

a.s.

( dP )

( dP ) ). 2

λ1 ,..., λn are not all zero and ∃ an event A λ1 X 1 (ω ) + ... + λn X n (ω ) = 0 ∀ω ∈ A .

dependent if ∃

In particular: X 1 ,..., X n will be linearly dependent if ∃ zero such that

are linearly

⇒ λ1 = ... = λn = 0 (here 0 is the

DEFINITION.– We say that the n vectors X 1 ,..., X 2 of L such that

λ1 X 1 + ... + λn X n = 0

( dP )

are linearly

of positive probability

λ1 ,..., λn

are not all

a.s.

Examples: given the three measurable mappings:

X1, X 2 , X 3 :

([0, 2] ,B [0, 2] , dω ) → (

,B (

))

defined by:

X 1 (ω ) = ω

X 2 (ω ) = 2ω X 3 (ω ) = 3ω

47

⎫ ⎪ ⎬ on [ 0,1[ and ⎪ ⎭

X 1 (ω ) = e −(ω −1)

⎫ ⎪⎪ X 2 (ω ) = 2 ⎬ on [1, 2[ ⎪ X 3 (ω ) = −2ω + 5⎭⎪

48

Discrete Stochastic Processes and Optimal Filtering

X1 ; X2 ; X3

3

2

1

0

1

ω

2

Figure 1.6. Three random variables

The three mappings are evidently measurable and belong to L ( dω ) , so there 2

are 3 vectors of L ( dω ) . 2

There 3 vectors are linearly dependent on measurement

A = [ 0,1[ of probability

1 : 2

−5 X 1 ( ω ) + 1 X 2 ( ω ) + 1 X 3 ( ω ) = 0

∀ω ∈ A

Covariance matrix and linear independence

Let Γ X be the covariance matrix of X = ( X 1 ,..., X n ) a second order vector.

Random Vectors

49

1) If Γ X is defined as positive: X 1 = X 1 − ΕX 1 ,..., X n = X n − ΕX n are thus *

*

linearly independent vectors of L ( dP ) . 2

In effect:

⎛ ⎛ ⎞ ⎛ ⎞⎞ ΛT Γ X Λ = Var ⎜ ∑ λ j X j ⎟ = Ε ⎜ ∑ λ j X j − Ε ⎜ ∑ λ j X j ⎟ ⎟ ⎜ j ⎟ ⎝ j ⎠ ⎝ j ⎠⎠ ⎝

2

2

⎛ ⎞ = Ε ⎜ ∑ λ j ( X j − ΕX j ) ⎟ = 0 ⎝ j ⎠ That is to say:

∑λ ( X j

j

j

− ΕX j ) = 0

a.s.

This implies, since Γ X is defined positive, that

λ1 =

= λn = 0

We can also say that X 1 ,..., X n generates a hyperplane of L ( dP ) of *

*

2

(

*

*

)

dimension n that we can represent as H X 1 ,..., X n . In particular, if the r.v. X 1 ,..., X n are pairwise uncorrelated (thus a fortiori if they are stochastically independent), we have:

ΛT Γ X Λ = ∑ Var X j .λ j2 = 0 ⇒ λ1 =

= λn = 0

j

thus, in this case, Γ X is defined positive and X 1 ,..., X n are still linearly *

independent.

*

50

Discrete Stochastic Processes and Optimal Filtering

NOTE.– If Ε X X , the matrix of the second order moment function is defined as T

positive definite, then X 1 ,..., X n are linearly independent vectors of L ( dP ) . 2

2) If now Γ X is semi-defined positive:

X 1* = X 1 − ΕX 1 ,..., X n∗ = X n − ΕX n are thus linearly dependent vectors of L ( dP ) . 2

In effect:

∃Λ = ( λ1 ,..., λn ) ≠ ( 0,..., 0 )

(

)

⎛

such that: Λ Γ X Λ = Var ⎜ T

⎝

∑λ j

j

⎞ Xj⎟=0 ⎠

That is to say:

∃Λ = ( λ1 ,..., λn ) ≠ ( 0,..., 0 ) such that

∑λ ( X j

j

j

− ΕX j ) = 0 a.s.

⎛ X1 ⎞ ⎜ ⎟ 3 Example: we consider X = X 2 to a second order random vector of , ⎜ ⎟ ⎜X ⎟ ⎝ 3⎠ ⎛ 3⎞ ⎛ 4 2 0⎞ ⎜ ⎟ ⎜ ⎟ admitting m = −1 for the mean value vector and Γ X = 2 1 0 for the ⎜ ⎟ ⎜ ⎟ ⎜ 2⎟ ⎜ 0 0 3⎟ ⎝ ⎠ ⎝ ⎠ covariance matrix. We state that Γ X is semi-defined positive. In taking for example

Random Vectors

ΛT = (1, − 2, 0 )

we

( X1 − 2 X 2 + 0 X 3 ) = 0

verify

(Λ Γ Λ) = 0 T

that

and X 1 − 2 X 2 = 0 *

a.s.

X

*

Thus

51

Var

a.s.

X ∗ (ω )

L2 ( dP )

3

0

X∗

x2 ∆

x1

(

H X 1∗ X 2∗ X 3∗

When

(

ω

describes Ω ,

X ∗ (ω ) = X ∗ (ω ) , X ∗ (ω ) , X ∗ (ω ) 1

2

3

random vector of

3

)

X 1∗ , X 2∗ , X 3∗ vectors of L2 ( dP )

)

T

(

∗

∗

∗

generate H X 1 X 2 X 3 2

subspace of L

of the 2nd order

describes the vertical plane ( Π ) passing

)

( dP ) of

dimension 2

through the straight line ( ∆ ) of equation

x1 = 2 x2 Figure 1.7. Vector

X ∗ (ω )

X∗

and vector

1.6. Conditional expectation (concerning random vectors with density function)

Given that assume

that

X is a real r.v. and Y = (Y1 ,..., Yn ) is a real random vector, we X

and

Y

are

independent

and

that

Z = ( X , Y1 ,..., Yn ) admits a probability density f Z ( x, y1 ,..., yn ) . In this section, we will use as required the notations

Y , ( y1 ,..., yn ) or y . Let us recall to begin with fY ( y ) =

∫

f Z ( x, y ) dx .

the

vector

(Y1 ,..., Yn )

or

52

Discrete Stochastic Processes and Optimal Filtering

Conditional probability

We want, for all B ∈ B (

)

and all

( y1 ,..., yn ) ∈

n

, to define and

calculate the probability that X ∈ B knowing that Y1 = y1 ,..., Yn = yn . We denote this quantity P

(

)

( ( X ∈ B ) (Y1 = y1 ) ∩ .. ∩ (Yn = yn ) )

or more

simply P X ∈ B y1 ,..., yn . Take note that we cannot, as in the case of discrete variables, write:

(

)

P ( X ∈ B ) (Y1 = y1 ) ∩ .. ∩ (Yn = yn ) =

(

P ( X ∈ B ) (Y1 = y1 ) ∩ .. ∩ (Yn = yn ) P ( (Y1 = y1 ) ∩ .. ∩ (Yn = yn ) )

The quotient here is indeterminate and equals

)

0 . 0

For j = 1 at n , let us note I j = ⎡⎣ y j , y j + h ⎡⎣ We write:

(

P ( X ∈ B y1 ,..., yn ) = lim P ( X ∈ B ) (Y1 ∈ I1 ) ∩ .. ∩ (Yn ∈ I n ) h →0

= lim

h→0

P ( ( X ∈ B ) ∩ (Y1 ∈ I1 ) ∩ .. ∩ (Yn ∈ I n ) ) P ( (Y1 ∈ I1 ) ∩ .. ∩ (Yn ∈ I n ) )

∫ B dx ∫ I ×...×I f Z ( x, u1,..., un ) du1...dun h→0 ∫ I ×...×I f y ( u1,..., un ) du1...dun ∫ B f Z ( x, y ) dx = f Z ( x, y ) dx = ∫ B fY ( y ) fY ( y ) = lim

n

1

1

n

)

Random Vectors

53

It is thus natural to say that the conditional density of the random vector X

( y1 ,..., yn ) is the function:

knowing

x → f ( x y) =

f Z ( x, y ) if fY ( y ) ≠ 0 fY ( y )

We can disregard the set of

y for which fY ( y ) = 0 for its measure (in

n

)

is zero. Let us state that Α =

{( x, y ) fY ( y ) = 0} ; we observe:

P ( ( X , Y ) ∈ Α ) = ∫ f Z ( x, y ) dx dy = ∫ Α

=∫

{y f

Y

( y )=0}

{y f

Y

( y )=0}

du ∫ f ( x, u ) dx

fY ( u ) du = 0 , so fY ( y ) is not zero almost everywhere.

Finally, we have obtained a family (indicated by y verifying fY ( y ) > 0 ) of

(

probability densities f x y

(∫

)

)

f ( x y ) dx = 1 .

Conditional expectation

Let the random vector always be Z = ( X , Y1 ,..., Yn ) of density f Z ( x, y ) and

f ( x y ) always be the probability density of X , knowing y1 ,..., yn . DEFINITION.– Given a measurable mapping Ψ : under

the

(

hypothesis

)

∫

(

,B (

Ψ ( x ) f ( x y ) dx < ∞

) ) → ( ,B ( ) ) , (that

is

to

say

Ψ ∈ L1 f ( x y ) dx we call the conditional expectation of Ψ ( X ) knowing

54

Discrete Stochastic Processes and Optimal Filtering

( y1 ,..., yn ) , the expectation of

Ψ ( X ) calculated with the conditional density

f ( x y ) = f ( x y1 ,..., yn ) and we write:

Ε ( Ψ ( X ) y1 ,..., yn ) = ∫ Ψ ( x ) f ( x y ) dx

Ε ( Ψ ( X ) y1 ,..., yn ) is a certain value, depending on

( y1 ,..., yn ) ,

and we

denote this gˆ ( y1 ,..., yn ) (this notation will be of use in Chapter 4). DEFINITION.– We call the conditional expectation of Ψ ( X ) with respect to

Y = (Y1 ,..., Yn ) the r.v. gˆ (Y1 ,..., Yn ) = Ε ( Ψ ( X ) Y1 ,..., Yn ) (also denoted

Ε ( Ψ ( X ) Y ) ) which takes the value gˆ ( y1 ,..., yn ) = Ε ( Ψ ( X ) y1 ,..., yn )

when (Y1 ,..., Yn ) takes the value ( y1 ,..., yn ) .

NOTE.– As we do not distinguish between two equal r.v. a.s., we will still call the condition expectation of

Ψ ( X ) with respect to Y1 ,..., Yn of all r.v.

gˆ ′ (Y1 ,..., Yn ) such that gˆ ′ (Y1 ,..., Yn ) = gˆ (Y1 ,..., Yn ) almost surely.

That is to say gˆ ′ (Y1 ,..., Yn ) = gˆ (Y1 ,..., Yn ) except possibly on Α such that

P ( Α ) = ∫ fY ( y ) dy = 0 . Α

PROPOSITION.– If Ψ ( X ) ∈ L ( dP ) (i.e. 1

gˆ (Y ) = Ε ( Ψ ( X ) Y ) ∈ L1 ( dP ) (i.e.

∫

n

∫

Ψ ( x ) f X ( x ) dx < ∞ ) then

gˆ ( y ) fY ( y ) dy < ∞ ).

Random Vectors

55

DEMONSTRATION.–

∫ =∫

n

n

gˆ ( y ) f ( y ) dy = ∫

n

Ε ( Ψ ( X ) y ) fY ( y ) dy

fY ( y ) dy ∫ Ψ ( x ) f ( x y ) dx

Using Fubini’s theorem:

∫ =∫

n+1

Ψ ( x ) fY ( y ) f ( x y ) dx dy = ∫

Ψ ( x ) dx ∫

n

n+1

Ψ ( x ) f Z ( x, y ) dx dy

f Z ( x, y ) dy = ∫ Ψ ( x ) f X ( x ) dx < ∞

Principal properties of conditional expectation

The hypotheses of integrability having been verified:

( (

1) Ε Ε Ψ ( X ) Y

)) = Ε ( Ψ ( X )) ;

(

)

(

)

2) If X and Y are independent Ε Ψ ( X ) Y = Ε Ψ ( X ) ;

(

)

3) Ε Ψ ( X ) X = Ψ ( X ) ; 4) Successive conditional expectations

(

)

Ε E ( Ψ ( X ) Y1 ,..., Yn , Yn +1 ) Y1 ,..., Yn = Ε ( Ψ ( X ) Y1 ,..., Yn ) ; 5) Linearity

Ε ( λ1Ψ1 ( X ) + λ2 Ψ 2 ( X ) Y ) = λ1Ε ( Ψ1 ( X ) Y ) + λ2Ε ( Ψ 2 ( X ) Y ) . The demonstrations which in general are easy may be found in the exercises.

56

Discrete Stochastic Processes and Optimal Filtering

Let us note in particular that as far as the first property is concerned, it is sufficient to re-write the demonstration of the last proposition after stripping it of absolute values. The chapter on quadratic means estimation will make the notion of conditional expectation more concrete. Example: let Z = ( X , Y ) be a random couple of probability density

f Z ( x, y ) = 6 xy ( 2 − x − y )1∆ ( x, y ) where ∆ is the square [ 0,1] × [ 0,1] .

(

)

Let us calculate E X Y . We have successively: 1

i.e.

1

( y ) = ∫0 f ( x, y ) dx = ∫0 6 xy ( 2 − x − y ) dx with f ( y ) = ( 4 y − 3 y 2 )1[0,1] ( y )

– f

(

)

– f x y =

(

y ∈ [ 0,1]

f ( x, y ) 6 x ( 2 − x − y ) 1[0,1] ( x ) with y ∈ [ 0,1] = f ( y) 4 − 3y

) ∫0 xf ( x y ) dx ⋅1[0,1] ( y ) = 2 54−−43yy 1[0,1] ( y ) . ( ) 1

– E X y =

Thus:

E(X Y) =

5 − 4Y 1 (Y ) . 2 ( 4 − 3Y ) [0,1]

We also have:

(

)

E ( X ) = E E ( X Y ) = ∫ E ( X y ) f ( y ) dy 1

0

5− 4y 7 . =∫ 4 y − 3 y 2 dy = 0 2(4 − 3y) 12 1

(

)

Random Vectors

57

1.7. Exercises for Chapter 1 Exercise 1.1.

Let

X be an r.v. of distribution function ⎛ 0 if ⎜ 1 if F ( x) = ⎜ ⎜2 ⎜ 1 if ⎝

x<0

0≤ x≤2 x>2

Calculate the probabilities:

(

) (

) (

P X 2 ≤ X ; P X ≤ 2X 2 ; P X + X 2 ≤ 3

4

).

Exercise 1.2.

Given

the

f Z ( x, y ) = K

random

vector

Z = ( X ,Y )

1 1∆ ( x, y ) where K yx 4

of

probability

density

is a real constant and where

⎧ 1⎫ ∆ = ⎨( x, y ) ∈ 2 x, y > 0 ; y ≤ x ; y > ⎬ , determine the constant K and the x⎭ ⎩ densities f X and fY of the r.v. X and Y . Exercise 1.3.

Let X and Y be two independent random variables of uniform density on the

[ ]

interval 0,1 : 1) Determine the probability density f Z of the r.v. Z = X + Y ; 2) Determine the probability density fU of the r.v. U = X Y .

58

Discrete Stochastic Processes and Optimal Filtering

Exercise 1.4.

Let X and Y be two independent r.v. of uniform density on the interval

[0,1] .

Determine the probability density fU of the r.v. U = X Y . Solution 1.4.

y

xy = 1

1

xy < u

A

B

0

u

x

1

U takes its values in [ 0,1] Let FU be the distribution function of

U:

– if

u ≤ 0 FU ( u ) = 0 ; if u ≥ 1 FU ( u ) = 1 ;

– if

u ∈ ]0,1[ : FU ( u ) = P (U ≤ u ) = P ( X Y ≤ u ) = P ( ( X , Y ) ∈ Bu ) ;

where Bu = A ∪ B is the cross-hatched area of the figure. Thus FU ( u ) =

∫B

u

f( X ,Y ) ( x, y ) dx dy = ∫

Bu

f X ( x ) fY ( y ) dx dy

Random Vectors

1

u

u

0

= ∫ dx dy + ∫ dx ∫ A

x

dy = u + u ∫

1 dx

= u (1 − n u )

x

u

59

.

⎛ 0 if x ∈ ]-∞,0] ∪ [1, ∞[ ⎜− nu x ∈ ]0,1[ ⎝

Finally fU ( u ) = FU′ ( u ) = ⎜

Exercise 1.5.

Under consideration are three r.v.

X , Y , Z which are independent and of the

same law N ( 0,1) , that is to say admitting the same density

1 2π

⎛ x2 ⎞ ⎜− ⎟. ⎝ 2⎠

Determine the probability density fU of the real random variable (r.r.v.)

U = (X 2 +Y2 + Z2) 2. 1

Solution 1.5.

Let FU be the probability distribution of

U

– if

⎛ u ≤ 0 FU ( u ) = P ⎜ X 2 + Y 2 + Z 2 ⎝

– if

u > 0 FU ( u ) = P ( ( X + Y + Z ) ∈ Su ) ;

(

where Su is the sphere of

3

centered on

)

1

2

⎞ ≤ u⎟ = 0; ⎠

( 0, 0, 0 ) and of radius u

= ∫ f( X ,Y ,Z ) ( x, y, z ) dx dy dz Su =

( 2π )

∫S exp ⎜⎝ − 2 ( x 2 ⎛ 1

1 3

u

2

)

⎞ + y 2 + z 2 ⎟ dx dy dz ⎠

60

Discrete Stochastic Processes and Optimal Filtering

and by employing a passage from spherical coordinates:

= =

1

( 2π )

eπ

3

∫0 2

1

( 2π )

3

2

2

dθ

π

∫0

dϕ

⎛ 1

u

∫ 0 exp ⎜⎝ − 2 r

2

⎞ 2 ⎟ r sin ϕ dr ⎠

u ⎛ 1 ⎞ 2π ⋅ 2 ∫ r 2 exp ⎜ − r 2 ⎟ dr 0 ⎝ 2 ⎠

⎛ 1 2⎞ r ⎟ is continuous: ⎝ 2 ⎠

and as r → r exp ⎜ −

⎛ 0 if u <0 fU ( u ) = ⎜⎜ 2 2 ⎛ 1 ⎞ u exp ⎜ − u 2 ⎟ if u ≥ 0 ⎜ FU′ ( u ) = 2π ⎝ 2 ⎠ ⎝ Exercise 1.6.

1a) Verify that ∀ a > 0

fa ( x ) =

1 a is a probability density 2 Π a + x2

(called Cauchy’s density). 1b)

Verify

that

ϕ X ( u ) = exp ( −a u )

the

corresponding

1c) Given a family of independent r.v. density of the r.v.

Yn =

characteristic

function

is

X 1 ,..., X n of density f a , find the

X 1 + ... + X n . n

What do we notice? 2) By considering Cauchy’s random variables, verify that we can have the equality

ϕ X +Y ( u ) = ϕ X ( u ) ϕ Y ( u )

with

X and Y dependent.

Random Vectors

61

Exercise 1.7.

Show that

⎛1 2 3⎞ ⎜ ⎟ M = ⎜ 2 1 2 ⎟ is not a covariance matrix. ⎜3 2 1⎟ ⎝ ⎠

⎛ 1 0,5 0 ⎞ ⎜ ⎟ Show that M = 0,5 1 0⎟ ⎜ ⎜ 0 0 1 ⎟⎠ ⎝

is a covariance matrix.

Verify that from this example the property of “not being correlated with” for a family or r.v. is not transitive. Exercise 1.8.

Show

that

ΕX = ( 7, 0,1) T

the

random

vector

X T = ( X1 , X 2 , X 3 )

of

expectation

⎛ 10 −1 4 ⎞ ⎜ ⎟ and of covariance matrix Γ X = −1 1 −1 belongs ⎜ ⎟ ⎜ 4 −1 2 ⎟ ⎝ ⎠

almost surely (a.s.) to a plane of

3

.

Exercise 1.9.

We are considering the random vector U = ( X , Y , Z ) of probability density

fU ( x, y, z ) = K x y z ( 3 − x − y − z )1∆ ( x, y, z ) where ∆ is the cube

[0,1] × [0,1] × [0,1] .

1) Calculate the constant

K.

2) Calculate the conditional probability

⎛ 1 3⎞ ⎡1 1⎤ P⎜ X ∈⎢ , ⎥ Y = ,Z = ⎟ . 2 4⎠ ⎣4 2⎦ ⎝

3) Determine the conditional expectation

(

)

Ε X 2 Y,Z .

Chapter 2

Gaussian Vectors

2.1. Some reminders regarding random Gaussian vectors DEFINITION.– We say that a real r.v. is Gaussian, of expectation m and of variance

σ2

if its law of probability PX :

⎛ ( x − m )2 ⎞ 1 ⎟ if σ 2 ≠ 0 (using exp ⎜ − – admits the density f X ( x ) = 2 ⎜ ⎟ σ 2 2π σ ⎝ ⎠ a double integral calculation, for example, we can verify that ∫ f X ( x ) dx = 1 ); – is the Dirac measure

(

2πσ

)

δm

if σ

2

= 0.

δm

−1

fX

x

x m

m

Figure 2.1. Gaussian density and Dirac measure

64

Discrete Stochastic Processes and Optimal Filtering

If

σ 2 ≠ 0 , we say that X

is a non-degenerate Gaussian r.v.

2 If σ = 0 , we say that X is a degenerate Gaussian r.v.; X is in this case a “certain r.v.” taking the value m with the probability 1.

EX = m, Var X = σ 2 . This can be verified easily by using the probability distribution function. As we have already observed, in order to specify that an r.v. X is Gaussian of

(

)

m expectation and of σ 2 variance, we will write X ∼ N m, σ 2 . Characteristic function of Let

us

begin

X 0 ∼ N ( 0,1) :

firstly

(

ϕ X (u ) = E e 0

X ∼ N ( m, σ 2 )

iuX 0

by

)

determining

1 = 2π

∫

iux

e e

the

− x2

2 dx

characteristic

function

of

.

We can easily see that the theorem of derivation under the sum sign can be applied:

ϕ ′X 0 ( u ) =

i 2π

∫

iux

e xe

− x2

2 dx

.

Following this by integration by parts:

=

i 2π

⎡⎛ iux − x2 ⎞+∞ ⎤ − x2 +∞ iux 2 2 dx = − uϕ e e iue e − + ⎢⎜ ⎥ X0 (u ) . ⎟ ∫ −∞ ⎠−∞ ⎢⎣⎝ ⎥⎦

Gaussian Vectors

The resolution of the differential equation condition that

ϕ X ( 0) = 1

(

)

ϕ X (u ) =

By changing the variable y = case, we obtain If

σ2 =0

ϕ X (u ) =

ϕ X (u ) =

x−m

σ

1 ium − u 2σ 2 e 2

δm )

1 ium − u 2σ 2 e 2

ium

⎛ ⎝

−u

2

0

∫

(

.

1 ⎛ x −m ⎞ +∞ iux − 2 ⎜ σ ⎟ ⎠ e e ⎝ −∞

2

dx .

which brings us back to the preceding

ϕ X (u )

(Fourier transform in the sense

so well that in all cases

(σ

2

≠ or = 0 )

)

1

2

(σ ) 2

1 2

1

2

with the

2

∼ N m, σ 2 , we can write: 1

( 2π )

ϕ X (u ) = e

.

NOTE.– Given the r.v. X

fX ( x) =

= e

1 2πσ

0

.

that is to say if PX = δ m ,

of the distribution of

0

leads us to the solution

0

For X ∼ N m, σ 2

ϕ ′X ( u ) = − uϕ X ( u )

65

( )

⎛ 1 exp ⎜ − ( x − m ) σ 2 ⎝ 2

−1

( x − m ) ⎞⎟ ⎠

⎞ ⎠

ϕ X ( u ) = exp ⎜ ium − u σ 2u ⎟ These are the expressions that we will find again for Gaussian vectors.

66

Discrete Stochastic Processes and Optimal Filtering

2.2. Definition and characterization of Gaussian vectors T

= ( X 1 ,..., X n ) is Gaussian

∑aj X j

is Gaussian (we can in this

DEFINITION.– We say that a real random vector X if ∀ ( a0 , a1 ,..., an ) ∈

n +1

the r.v. a0 +

n

j =1

definition assume that a0 = 0 and this will be sufficient in general). A random vector X

T

= ( X 1 ,..., X n ) is thus not Gaussian if we can find an

n -tuple ( a1 ,..., an ) ≠ ( 0,..., 0 ) such that the r.v.

n

∑aj X j

is not Gaussian and

j =1

n

for this it suffices to find an n -tuple such that

∑ajX j

is not an r.v. of density.

j =1

EXAMPLE.– We allow ourselves an r.v. X ∼ N ( 0,1) and a discrete r.v.

ε,

independent of X and such that:

P ( ε = 1) =

1 1 and P ( ε = −1) = . 2 2

We state that Y = ε X . By using what has already been discussed, we will show through an exercise that although Y is an r.v. N ( 0,1) , the vector ( X , Y ) is not a Gaussian vector. PROPOSITION.– In order for a random vector X

T

= ( X 1 ,..., X n ) of expectation

mT = ( m1 ,..., mn ) and of covariance matrix Γ X to be Gaussian, it is necessary and sufficient that its characteristic function (c.f.)

⎛ ⎜ ⎝

m

1 2

⎞ ⎟ ⎠

ϕ X ( u1 ,..., un ) = exp ⎜ i ∑ u j m j − uT Γ X u ⎟ j =1

ϕX

be defined by:

( where u

T

= ( u1 ,..., un )

)

Gaussian Vectors

67

DEMONSTRATION.–

⎛ ⎜ ⎝

⎞ ⎟ ⎠

n

⎛ ⎜ ⎝

⎞ ⎟ ⎠

n

ϕ X ( u 1,..., u n ) = E exp ⎜ i ∑ u j X j ⎟ = E exp ⎜ i.1.∑ u j X j ⎟ j =1

j =1

n

= characteristic function of the r.v.

∑u j X j

in the value 1.

j =1

That is to say:

ϕ

n

∑u j X j

(1)

j =1

and

ϕ

⎛ ⎛ n ⎞ 1 2 ⎛ n 1 exp .1. 1 Var = ⎜ − i E u X n () ⎜⎜ ∑ j j ⎟⎟ ⎜⎜ ∑ u ⎜ 2 ∑u j X j ⎝ j =1 ⎠ ⎝ j =1 ⎝ j =1

j

⎞⎞ X j ⎟⎟ ⎟⎟ ⎠⎠

n

if and only if the r.v.

∑u j X j

is Gaussian.

j =1

⎛ n ⎞ u j X j ⎟ = u T Γ X u , we arrive indeed at: ∑ ⎜ j =1 ⎟ ⎝ ⎠

Finally, since Var ⎜

⎛ ⎜ ⎝

n

1 2

⎞ ⎟ ⎠

ϕ X ( u 1,..., u n ) = exp ⎜ i ∑ u j m j − u T Γ X u ⎟ . j =1

NOTATION.– We can see that the characteristic function of a Gaussian vector X is entirely determined when we know its expectation vector m and its covariance

matrix Γ X . If X is such a vector, we will write X ∼ N n ( m, Γ X ) .

PARTICULAR CASE.– m = 0 and Γ X = I n (unit matrix), X ∼ N n ( 0, I n ) is called a standard Gaussian vector.

68

Discrete Stochastic Processes and Optimal Filtering

2.3. Results relative to independence PROPOSITION.– T

= ( X 1 ,..., X n ) is Gaussian, all its components X j are

2) if the components

X j of a random vector X are Gaussian and independent,

1) if the vector X thus Gaussian r.v.;

the vector

X is thus also Gaussian.

DEMONSTRATION.– 1) We write

X j = 0 + ... + 0 + X j + 0... + 0 . n

2)

ϕ X ( u 1,..., u n ) = ∏ ϕ X ( u j

j =1

j

)

n 1 ⎛ ⎞ = ∏ exp ⎜ iu j m j − u 2jσ 2j ⎟ 2 ⎝ ⎠ j =1

⎛ n ⎞ 1 u jmj − u T ΓX u ⎟ ∑ ⎜ j =1 ⎟ 2 ⎝ ⎠ 0 ⎞ ⎟ ⎟. σ n2 ⎟⎠

that we can still express exp ⎜ i

⎛ σ 12 ⎜ with Γ X = ⎜ ⎜ 0 ⎝

NOTE.– As we will see later “the components

X j are Gaussian and independent”

is not a necessary condition for the random vector

(

)

X T = X1 ,..., X j ,..., X n to

be Gaussian. PROPOSITION.– If

(

X T = X 1 ,..., X j ,..., X n

)

is a Gaussian vector of

covariance Γ X , we have the equivalence: Γ X diagonal ⇔ the r.v. independent.

X j are

Gaussian Vectors

69

DEMONSTRATION.–

⎛ σ 12 ⎜ ΓX = ⎜ ⎜ 0 ⎝

0 ⎞ n ⎟ ⇔ ϕ u ,..., u = ( ) ∏ϕ X j u ⎟ X n 1 j −1 σ n2 ⎟⎠

( j)

This is a necessary and sufficient condition of independence of the r.v. X j . Let us sum up these two simple results schematically:

(

X T = X 1 ,..., X j ,..., X n

)

The components

X j are

Gaussian r.v.

is a Gaussian vector If (sufficient condition) the r.v. X j are

Even if

Γ X is

independent

diagonal

(The r.v. X j are independent ⇔ Γ X

(The r.v. X j are independent or X is Gaussian)

is

diagonal)

NOTE.– A Gaussian vector

(

X T = X1 ,..., X j ,..., X n

order. In effect each component

)

is evidently of the 2nd

X j is thus Gaussian and belongs to L2 ( dP )

2 ⎛ ⎞ −( x − m ) 1 2 ⎜ ⎟ 2 e 2σ dx < ∞ ⎟ . ⎜∫ x 2 2πσ ⎜ ⎟ ⎝ ⎠

We can generalize the last proposition and replace the Gaussian r.v. by Gaussian vectors.

70

Discrete Stochastic Processes and Optimal Filtering

Let us consider for example three random vectors:

(

X T = X ,..., X 1

) ; Y = (Y ,..., Y ) ; Z = ( X ,..., X , Y ,..., Y ) T

n

T

1

ΓX ⎛ ⎜ and state Γ Z = ⎜ ⎜ Cov(Y , X ) ⎝

1

p

n

1

p

Cov( X , Y ) ⎞ ⎟ ⎟ ⎟ ΓY ⎠

where Cov ( X , Y ) is the matrix of the coefficients Cov

Cov ( X , Y ) = ( Cov ( X , Y ) ) .

( X j ,Y )

and where

T

PROPOSITION.– If

(

Z T = X 1 ,..., X n , Y1 ,..., Yp

)

is a Gaussian vector of

covariance matrix Γ Z , we have the equivalence:

Cov ( X , Y ) = zero matrix ⇔ X and Y are two independent Gaussian vectors. DEMONSTRATION.–

⎛ΓX ⎜ ΓZ = ⎜ ⎜ 0 ⎝

0 ⎞ ⎟ ⎟⇔ ΓY ⎟⎠

ϕ Z ( u 1 ,..., u n, u n+1,..., u n+ p )

(

⎛ n+ p ⎛ΓX 1 T⎜ ⎜ = exp ⎜ i ∑ u j m j − u ⎜ 2 ⎜ ⎜ j =1 ⎝ 0 ⎝

)

0 ⎞ ⎞ ⎟ ⎟ ⎟u⎟ ΓY ⎟⎠ ⎟⎠

= ϕ X ( u 1,..., u n ) ϕY u n +1,..., u n + p , which is a necessary and sufficient condition for the independence of vector X and Y .

Gaussian Vectors

NOTE.– Given Z

T

(

71

)

= X T , Y T , U T ,... where X , Y ,U ,... are r.v. or random

vectors: – that Z is a Gaussian vector is a stronger hypothesis than – Gaussian

X and Gaussian Y and Gaussian U , etc.;

X and Gaussian Y and Gaussian U , etc. and their covariances (or T T T T matrix covariances) are zero ⇒ that Z = X , Y , U ,... is a Gaussian – Gaussian

(

)

vector. EXAMPLE.– Given that

X , Y , Z three r.v. ∼ N ( 0,1) , find the law of the vector

W T = (U , V ) or U = X + Y + Z and V = λ X − Y with λ ∈

( X ,Y , Z ) a, b ∈ aU + bV = ( a + λ b ) X + ( a − λ b ) Y + aZ W T = (U , V ) is a Gaussian vector.

the

independence,

the

vector

To determine this entirely we must know m = EW

W ∼ N 2 ( m, ΓW ) .

is

: because of

Gaussian

is a Gaussian r.v. Thus

and ΓW and we will have

It follows on easily:

EW T = ( EU , EV ) = ( 0, 0 )

⎛

and ΓW = ⎜

Var U

⎝ Cov (V ,U )

and

Cov (U ,V ) ⎞ ⎛ 3 λ −1 ⎞ ⎟=⎜ Var V ⎠ ⎝ λ − 1 λ 2 + 1⎟⎠

In effect:

Var U = EU 2 = E ( X + Y + Z ) = EX 2 + EY 2 + EZ 2 = 3 2

Var V = EV 2 = E ( λ X − Y ) = λ 2 EX 2 + EY 2 = λ 2 + 1 2

Cov (U , V ) = E ( X + Y + Z )( λ X − Y ) = λ EX 2 − EY 2 = λ − 1

72

Discrete Stochastic Processes and Optimal Filtering

Particular case:

λ = 1 ⇔ ΓW

diagonal ⇔ U and V are independent.

2.4. Affine transformation of a Gaussian vector We can generalize to vectors the following result on Gaussian r.v.:

(

If Y ∼ N m, σ

2

)

then ∀a, b ∈

(

)

aY + b ∼ N am + b, a 2σ 2 .

(

By modifying a little the annotation, with N am + b, a

σ2

2

)

becoming

N ( am + b, a VarYa ) , we can imagine already how this result is going to extend to Gaussian vectors. PROPOSITION.– Given a Gaussian vector Y ∼ N n ( m, ΓY ) , A a matrix

belonging to M ( p, n ) and a certain vector B ∈

(

T

vector ∼ N p Am + B, AΓY A

p

, then

).

AY + B is a Gaussian

DEMONSTRATION.–

⎛ a11 ⎜ ⎜ AY + B = ⎜ a 1 ⎜ ⎜ ⎜ a p1 ⎝

ai

⎛ a1n ⎞ ⎛ Y1 ⎞ ⎛ b1 ⎞ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ n a n ⎟ ⎜ Yi ⎟ + ⎜ b ⎟ = ⎜ ∑ a iYi + b ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ i =1 ⎟⎜ ⎟ ⎜ ⎟ ⎜ a pn ⎟⎠ ⎜⎝ Yn ⎟⎠ ⎜⎝ b p ⎟⎠ ⎜⎜ ⎝

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎟ ⎠

– this is indeed a Gaussian vector (of dimension p ) because every linear combination of its components is an affine combination of the r.v. Y1 ,..., Yi ,..., Yn and by hypothesis Y

T

= (Y1 ,..., Yn ) is a Gaussian vector;

Gaussian Vectors

73

– furthermore, we have seen that if Y is a 2nd order vector:

E ( AY + B ) = AEY + B = Am + B and Γ AY + B = AΓY AT . EXAMPLE.– Given emerges

( n + 1)

independent r.v. Y j ∼ N

( µ ,σ ) 2

j = 0 at n , it

Y T = (Y0 , Y1 ,..., Yn ) ∼ N n +1 ( m, ΓY ) with mT = ( µ ,..., µ ) and

⎛σ 2 ⎜ ΓY = ⎜ ⎜ 0 ⎝

0 ⎞ ⎟ ⎟. σ 2 ⎟⎠

Furthermore, given new r.v. X defined by:

X1 = Y0 + Y1 ,..., X n = Yn −1 + Yn ,

the vector X

T

= ( X 1 ,..., X n )

⎛ X 1 ⎞ ⎛ 110...0 ⎞ ⎛ Y0 ⎞ ⎜ ⎟ ⎜ ⎟⎜ ⎟ is Gaussian for ⎜ ⎟ = ⎜ 0110..0 ⎟ ⎜ ⎟ more ⎜ X ⎟ ⎜ 0...011 ⎟ ⎜ Y ⎟ ⎠⎝ n ⎠ ⎝ n⎠ ⎝

(

T

precisely following the preceding proposition, X ∼ N n Am, AΓY A NOTE.– If in this example we assume

µ =0

and

).

σ 2 = 1 , we are certain that the

vector X is Gaussian even though its components X j are not independent. In effect, we have for example:

Cov ( X1 , X 2 ) ≠ 0 because

EX 1 X 2 = E (Y0 + Y1 ) (Y1 + Y2 ) = EY12 = 1 and EX 1EX 2 = E (Y0 + Y1 ) E (Y1 + Y2 ) = 0 .

74

Discrete Stochastic Processes and Optimal Filtering

2.5. The existence of Gaussian vectors NOTATION.– u = ( u 1,..., u T

n

) , xT = ( x1 ,..., xn )

and m = ( m1 ,..., mn ) . T

We are interested here in the existence of Gaussian vectors, that is to say the existence of laws of probability on n having Fourier transforms of the form:

⎛ ⎞ 1 exp ⎜ i ∑ u j m j − u T Γu ⎟ ⎜ j ⎟ 2 ⎝ ⎠ PROPOSITION.–

Given

a

mT = ( m1 ,..., mm )

vector

and

a

matrix

Γ ∈ M ( n, n ) , which is symmetric and semi-defined positive, there is a unique probability PX on

∫

n

, of the Fourier transform:

⎛ n ⎞ ⎛ n 1 T ⎞ exp ,..., exp i u x dP x x = ⎜⎜ ∑ j j ⎟⎟ X ( 1 ⎜⎜ i ∑ u j m j − u Γu ⎟⎟ n) n 2 ⎝ j =1 ⎠ ⎝ j =1 ⎠

In addition: 1) if Γ is invertible, PX admits on

f X ( x1 ,..., xn ) =

n

the density:

1 n

( 2π ) 2 ( Det Γ )

1

2

T ⎛ 1 ⎞ exp ⎜ − ( x − m ) Γ −1 ( x − m ) ⎟ ; ⎝ 2 ⎠

2) if Γ is non-invertible (of rank r < n ) the r.v. X 1 − m1 ,..., X n − mn are linearly dependent. We can still say that hyperplane ( Π ) of

n

ω → X (ω ) − m

a.s. takes its values on a

or that the probability PX loads a hyperplane ( Π ) does

not admit a density function on

n

.

Gaussian Vectors

75

DEMONSTRATION.– 1) Let us begin by recalling a result from linear algebra:

Γ being symmetric, we can find an orthonormal basis of n formed from eigenvectors of Γ ; let us call (V1 , ..., Vn ) this basis. By denoting the eigenvalues of Γ as λ j , we thus have ΓV j = λ jV j where the λ j are solutions of the equation Det ( Γ − λ I ) = 0 . Some consequences

⎛λ ⎜ 1 Let us first note Λ = ⎜ ⎜⎜ ⎝ 0 (where the V j are column vectors). – ΓV j = λ jV j

(

orthogonal VV

T

⎞ 0 ⎟ ⎟ and V = V1 , ⎟ λ n ⎟⎠

(

– The

λj

λj

)

j = 1 at n equates to ΓV = V Λ and, matrix V being

)

= V T V = I , Γ = V ΛV T .

Let us demonstrate that if in addition Γ is invertible, the and thus the

, V j , Vn

λj

are ≠ 0 and ≥ 0 ,

are > 0.

are ≠ 0 . In effect, Γ being invertible, n

0 ≠ Det Γ = Det Λ = ∏ λ j j =1

The

λj

are ≥ 0 : let us consider in effect the quadratic form u → u

( ≥ 0 since Γ is semi-defined positive).

T

Γu

76

Discrete Stochastic Processes and Optimal Filtering

In the basis (V1...Vn ) , u is written ( u 1,..., u

n

)

with u j = < V j , u > and the

⎛u1⎞ ⎜ ⎟ 2 quadratic form is written u → ( u 1,..., u n ) Λ ⎜ ⎟ = ∑ λ j u j ≥ 0 from which ⎜u ⎟ j ⎝ n⎠ we get the predicted result. Let us now demonstrate the proposition. 2) Let us now look at the general case, that is to say, in which Γ is not necessarily invertible (once again that the eigenvalues λ j are ≥ 0 ).

(

)

Let us consider n independent r.v. Y j ∼ N 0, λ j . We know that the vector Y

T

= (Y1 ,..., Yn ) is Gaussian as well as the vector

X = VY + m (proposition from the preceding section); more precisely

(

)

X ∼ N m , Γ = V ΛV T . The existence of vectors of Gaussian expectation and of a given covariance matrix is thus clearly proven. Furthermore, we have seen that if X is N n ( m, Γ ) , its characteristic function

⎛ ⎜ ⎝

(Fourier transformation of its law) is exp ⎜ i

1

⎞

∑ u j m j − 2 uT Γu ⎟⎟ j

⎠

We thus in fact have:

∫

⎛ 1 T ⎞ = − exp ,..., exp i u x dP x x i u m u Γu ⎟ . ( ) ⎜ ∑ ∑ 1 j j X n j j n ⎜ j ⎟ 2 ⎝ ⎠

(

)

Uniqueness of the law: this ensues from the injectivity of the Fourier transformation.

Gaussian Vectors

77

3) Let us be clear to terminate the role played by the invertibility of Γ . a) If Γ is invertible all the eigenvalues

Y T = (Y1...Yn ) admits the density:

λ j ( = VarY j )

are > 0 and the vector

⎛ y 2j ⎞ 1 exp ⎜ − fY ( y1 ,..., yn ) = ∏ ⎟ ⎜ 2λ j ⎟ j =1 2πλ j ⎝ ⎠ 1 ⎛ 1 ⎞ exp ⎜ − yT Λ −1 y ⎟ = 1 ⎝ 2 ⎠ ⎞ 2 n ⎛ n 2 ( 2π ) ⎜⎜ ∏ λ j ⎟⎟ ⎝ j =1 ⎠ n

As far as the vector X = VY + m is concerned: the affine transformation

y → x = Vy + m is invertible and has y = V −1 ( x − m ) as the inverse and has Det V = ±1 ( V orthogonal) as the Jacobian. n

Furthermore

∏ λ j = Det Λ = Det Γ . j =1

By applying the theorem on the transformation of a random vector by a

C1 -diffeomorphism, we obtain the density probability of vector X:

(

)

f X ( x1 ,..., xn ) = f X ( x ) = fY V −1 ( x − m ) = ↑

↑

notation

1 n

( 2π ) 2 ( DetΓ )

1

2

theorem

↑ we clarify

( )

T ⎛ 1 exp ⎜ − ( x − m ) V T ⎝ 2

−1

⎞ Λ −1V −1 ( x − m ) ⎟ ⎠

78

Discrete Stochastic Processes and Optimal Filtering T

As Γ = V ΛV :

f X ( x1 ,..., xn ) =

1 n

( 2π ) 2 ( DetΓ )

1

2

T ⎛ 1 ⎞ exp ⎜ − ( x − m ) Γ −1 ( x − m ) ⎟ ⎝ 2 ⎠

b) If rank Γ = r < n , let us rank the eigenvalues of Γ in decreasing order: and λr +1 = 0,..., λn = 0

λ1 ≥ λ2 ≥ ...λr > 0

Yr +1 = 0 a.s .,..., Yn = 0 a.s. and, almost surely, X = VY + m takes its values

in ( Π ) the hyperplane of affine mapping

n

the image of

y → Vy + m .

NOTE.– Given a random vector

ε = { y = ( y1 ,..., yr , 0,..., 0 )} by the

X T = ( X 1 ,..., X n ) ∼ N n ( m, Γ X ) and

supposing that we have to calculate an expression of the form:

EΨ ( X ) = ∫

∫

n

n

Ψ ( x ) f X ( x ) dx =

Ψ ( x1 ,..., xn ) f X ( x1 ,..., xn ) dx1...dxn

In general, the density f X , and in what follows the proposed calculation, are rendered complex by the dependence of the r.v. X 1 ,..., X n . Let

λ1 ,..., λn

diagonalizes Γ X .

be the eigenvalues of Γ X and V the orthogonal matrix which

Gaussian Vectors

We have X = VY + m with Y

(

∼ N 0, λ j

)

T

79

= (Y1 ,..., Yn ) , the Y j being independent and

and the proposed calculation can be carried out under the simpler

form: − yj ⎛ n 1 2λ Ψ (Vy + m ) ⎜ ∏ e j n ⎜ ⎜ j =1 2πλ j ⎝

2

E Ψ ( X ) = E Ψ (VY + m ) = ∫

⎞ ⎟ dy ...dy n ⎟ 1 ⎟ ⎠

EXAMPLE.– 1) The expression of a normal case: Let the Gaussian vector X

⎛1

where Γ X = ⎜

⎝ρ

T

= ( X1 , X 2 ) ∼ N 2 ( 0, Γ X )

ρ⎞

⎟ with ρ ∈ ]−1,1[ . 1⎠

Γ X is invertible and f X ( x1 , x2 ) =

⎛ 1 1 ⎞ exp ⎜ − x12 − 2 ρ x1 x2 + x22 ⎟ . 2 ⎝ 2 1− ρ ⎠ 2π 1 − ρ 2 1

(

)

80

Discrete Stochastic Processes and Optimal Filtering

1

fx

2π 1 − ρ 2

ε 0

x1

x2

The intersections of the graph of ellipses

ε

2 from equation x1

fX

with the horizontal plane are the

− 2 ρ x1 x2 + x22 = C

(constants)

Figure 2.2. Example of the density of a Gaussian vector

2) We give ourselves the Gaussian vector X

T

= ( X 1 , X 2 , X 3 ) with:

⎛ 3 0 q⎞ ⎜ ⎟ m = (1, 0, −2 ) and Γ = ⎜ 0 1 0 ⎟ . ⎜q 0 1⎟ ⎝ ⎠ T

Because of Schwarz’s inequality must suppose q ≤

( Cov ( X1, X 2 ) )

2

≤ Var X 1 Var X 2 we

3.

We wish to study the density f X ( x1 , x2 , x3 ) of vector X .

Gaussian Vectors

81

Eigenvalues of Γ :

Det ( Γ − λΙ ) =

3−λ

0

q

(

)

1− λ 0 = (1 − λ ) λ 2 − 4λ + 3 − q 2 . 0 1− λ

0 q

From which we obtain the eigenvalues ranked in decreasing order:

λ1 = 2 + 1 + q 2 a) if q < density in

3

3 then λ1 > λ2 > λ3 , Γ is invertible and X has a probability

given by:

f X ( x1 , x2 , x3 ) =

b) q =

, λ2 = 1 , λ3 = 2 − 1 + q 2

1 3

( 2π ) 2 ( λ1λ2λ3 )

1

2

T ⎛ 1 ⎞ exp ⎜ − ( x − m ) Γ −1 ( x − m ) ⎟ ; ⎝ 2 ⎠

3 thus λ1 = 4 ; λ2 = 1; λ3 = 0 and Γ is non-invertible of rank

2. Let us find the orthogonal matrix V ΓV j = λ j V j . For

λ1 = 4 ; λ2 = 1; λ3 = 0

which diagonalizes Γ by writing

we obtain respectively the eigenvectors

⎛ 3 ⎞ ⎛− 1 ⎞ ⎛ 0⎞ ⎜ 2⎟ ⎜ 2⎟ ⎜ ⎟ ⎜ ⎟ ⎜ V = 0 ⎟ V = 1 V1 = 0 ⎜ ⎟ , 2 ⎜ ⎟, 3 ⎜ ⎟ ⎜ 0⎟ ⎜⎜ 1 ⎟⎟ ⎜⎜ 3 ⎟⎟ ⎝ ⎠ ⎝ 2⎠ ⎝ 2 ⎠

82

Discrete Stochastic Processes and Optimal Filtering

and the orthogonal matrix

(VV

V = (V1 V2 V3 )

T

)

= V TV = Ι .

Given the independent r.v. Y1 ∼ N ( 0, 4 ) and Y2 ∼ N ( 0,1) and given the r.v.

Y3 = 0 a.s., we have: ⎛ 3 ⎛ X1 ⎞ ⎜ 2 ⎜ ⎟ X = ⎜ X2 ⎟ = ⎜ 0 ⎜ ⎜X ⎟ ⎜ ⎝ 3⎠ ⎜ 1 ⎝ 2

0 1 0

− 1 ⎟⎞ ⎛ Y ⎞ ⎛ 1 ⎞ 2 1 ⎜ ⎟ ⎜ ⎟ 0 ⎟ ⎜ Y2 ⎟ + ⎜ 0 ⎟ ⎟ ⎜ ⎟ ⎜ ⎟ 3 ⎟⎟ ⎝ 0 ⎠ ⎝ −2 ⎠ 2⎠

⎛ X 1∗ ⎞ ⎜ ∗⎟ ∗ or, by calling X = ⎜ X 2 ⎟ the vector X after centering, ⎜⎜ ∗ ⎟⎟ ⎝ X3 ⎠ ⎛ X1∗ ⎞ ⎜⎛ 3 ⎜ ∗⎟ ⎜ 2 0 ⎜ X2 ⎟ = ⎜⎜ ∗ ⎟⎟ ⎜⎜ ⎝ X 3 ⎠ ⎜⎝ 1 2

0 1 0

− 1 ⎟⎞ ⎛ Y ⎞ X1∗ = 3 2Y1 2 1 ⎜ ⎟ 0 ⎟ ⎜ Y2 ⎟ i.e. X 2∗ = Y2 ⎟ ⎜ ⎟ 3 ⎟⎟ ⎝ 0 ⎠ X 3∗ = 1 Y1 2 2⎠

⎛ X 1∗ ⎞ ⎜ ⎟ ∗ ∗ We can further deduce that X = ⎜ X 2 ⎟ . ⎜⎜ ∗⎟ ⎟ 3 X 1 ⎝ ⎠

Gaussian Vectors

83

x3 1

U

0

x2

3

x1 Figure 2.3. The plane

( Π ) is the support of probability PX

Thus, the vector X ∗ describes almost surely the plane ( Π ) containing the axis

0x2 and the vector U T =

(

)

3,0,1 . The plane ( Π ) is the support of the

probability PX .

Probability and conditional expectation Let us develop a simple case as an example. Let

the

Gaussian

( Cov ( X , Y ) ) ρ=

Z T = ( X , Y ) ∼ N 2 ( 0, Γ Z ) .

In

stating

2

and Var X = σ 1 , Var Y = σ 2 the density Z is written: 2

VarX VarY

f Z ( x, y ) =

vector

1 2πσ 1σ 2

2

⎛ ⎛ x2 1 xy y 2 ⎞ ⎞⎟ exp ⎜ − 2 − ρ + ⎜ ⎟ . ⎜ 2 1 − ρ 2 ⎝ σ 12 σ 1σ 2 σ 22 ⎠ ⎟ 1− ρ 2 ⎝ ⎠

(

)

84

Discrete Stochastic Processes and Optimal Filtering

Conditional density of X knowing Y = y

f ( x y) = 1 =

=

2πσ 1σ 2

f Z ( x, y ) = fY ( y )

f Z ( x, y ) dx

∫

⎡ ⎛ x2 1 xy y 2 ⎞ ⎤⎥ ⎢ − 2ρ + exp − ⎜ ⎟ σ 1σ 2 σ 22 ⎠ ⎥ ⎢ 2 1 − ρ 2 ⎝ σ 12 1− ρ 2 ⎣ ⎦ 2 ⎡ 1 y ⎤ 1 exp ⎢ − 2⎥ 2π σ 2 ⎣ 2 σ2 ⎦

(

1

(

f Z ( x, y )

σ 1 2π 1 − ρ

2

)

)

⎡ ⎛ σ1 1 ρ exp ⎢ − 2 − x ⎜ σ2 ⎢ 2σ 1 1 − ρ 2 ⎝ ⎣

(

)

⎞ y⎟ ⎠

2⎤

⎥ ⎥ ⎦

X being a real variable and y a fixed numeric value, we can recognize a Gaussian density. More precisely: the conditional law of X , knowing Y = y , is ⎛ σ ⎞ N ⎜ ρ 1 y , σ 12 1 − ρ 2 ⎟ . ⎝ σ2 ⎠

(

We see in particular that

)

E ( X y) = ρ

σ1 y σ2

and that

In Chapter 4, we will see more generally that if

( X , Y1 ,..., Yn ) n

vector,

E(X Y) = ρ

E ( X Y1 ,..., Yn ) is written in the form of λ0 + ∑ λ jY j . j =1

σ1 Y. σ2

is a Gaussian

Gaussian Vectors

85

2.6. Exercises for Chapter 2 Exercise 2.1.

D of center 0 and of radius R which is used for archery. The couple Z = ( X , Y ) represents the coordinates of the point of We are looking at a circular target

impact of the arrow on the target support; we assume that the r.v. independent and following the same law

(

N 0.4 R

2

).

X and Y are

1) What is the probability that the arrow reach the target? 2) How many times must one fire the arrow In order for, with a probability

≥ 0.9 , the target is reached at least once (we give n10 ≠ 2.305 ).

Let us assume that we fire 100 times at the target, calculate the probability that the target to reached at least 20 times. Hint: use the central limit theorem. Solution 2.1.

Z = ( X ,Y )

X and Y being independent, the density of probability of ⎛ x2 + y 2 ⎞ 1 is f Z ( x, y ) = f X ( x ) fY ( y ) = and exp ⎜− 2 ⎟ R 8π R 2 8 ⎝ ⎠

P (Z ∈ D) =

1 8π R 2

1) The r.v.s

⎛ x2 + y 2 ⎞ exp ∫D ⎜⎝ − 8R 2 ⎟⎠ dx dy using a change from Cartesian

to polar coordinates: R −e 1 ⎞ 2π ⎛ = ⎜− dθ ∫ e 2 ⎟ ∫0 0 ⎝ 8π R ⎠

2

8 R 2 ede

=

−1 1 1 R2 −u 2 ⋅ 2π ⋅ ∫ e 8 R du = 1 − e 8 2 2 0 8π R

2) At each shot k , we associate a Bernoulli r.v. U k ∼ b ( p ) defined by

⎛ U k = 1 if the arrow reaches the target (probability p ) ⎜ ⎝ U k = 0 if the arrow does not reach the target (probability 1- p )

86

Discrete Stochastic Processes and Optimal Filtering

In n shots, the number of impacts is given by the r.v.

U = U1 + ... + U n ∼ B ( n, p )

P (U ≥ 1) = 1 − P (U = 0 ) = 1 − Cnk p k (1 − p ) = 1 − (1 − p ) We

are

⇔ (1 − p ) n ≥ 19

n

n−k

( where k = 0 )

n

n which verifies 1 − (1 − p ) ≥ 0,9 n10 n10 n10 2,3 # ≤ 0,1 ⇔ n ≥ − =− =− i.e. −1 1 n (1 − p ) n (1 − p ) ne 8 8 thus

looking

n

for

3) By using the previous notations, we are looking to calculate P (U ≥ 20 ) with

U = U1 + P (U1 +

with

µ = 1− e

+ U100 , which is to say: ⎛U + + U100 ≥ 20 ) = P ⎜ 1 ⎝ −1

8

# 0,1175

and

+ U100 − 100µ 100σ

≥

−1 −1 ⎞ ⎛ σ = ⎜ ⎜⎛ 1 − e 8 ⎟⎞ e 8 ⎟ ⎠ ⎝⎝ ⎠

1

20 − 100µ ⎞ ⎟ 100σ ⎠ 2

# 0,32

8, 25 ⎞ ⎛ P⎜S ≥ = P ( S ≥ 2,58 ) = 1 − F0 ( 2,58 ) 3, 2 ⎟⎠ ⎝ where S is an r.v. N ( 0,1) and

F0 distribution function of the r.v. N ( 0,1) .

Finally P (U ≥ 20 ) = 1 − 0,9951# 0, 005 .

i.e.

Gaussian Vectors

87

Exercise 2.2.

n independent r.v. of law N ( 0,1) and given

X 1 ,… , X n

Given

a 1 ,… , a n ; b 1,… , b n

2n real constants:

1) Show that the r.v. Y =

n

n

j =1

j =1

∑ a j X j and Z = ∑ b j are independent if and

n

only if

∑ a jb j = 0 . j =1

2) Deduce from this that if the r.v.

X=

X 1 ,..., X n are n independent r.v. of law N ( 0,1) ,

1 n ∑ X j and YK = X K − X (where n j =1

K∈

{1, 2,..., n} )

are

independent. For K

≠

YK and Y are they independent r.v.?

Solution 2.2. 1) U = (Y , Z ) is evidently a Gaussian vector. ( ∀λ and

µ∈ ,

In order for

Y and Z to be independent it is thus necessary and sufficient that:

the r.v. λY + µ Z is evidently a Gaussian r.v.).

0 = Cov (Y , Z ) = EYZ = ∑ a j b j EY j Z j = ∑ a j b j j

2) In order to simplify the expression, let us make K

j

= 1 an example:

1 1 ⎛ 1⎞ X n ; Y1 = ⎜1 − ⎟ X 1 − X 2 − n n ⎝ n⎠ n 1⎛ 1⎞ 1 and ∑ a j b j = ⎜ 1 − ⎟ − ( n − 1) = 0 n⎝ n⎠ n j =1 X=

1 X1 + n

+

−

1 Xn n

88

Discrete Stochastic Processes and Optimal Filtering

– To simplify let us make K

= 1 and

=2

1 1 ⎛ 1⎞ Y1 = ⎜1 − ⎟ X 1 − X 2 − − X n ; n n ⎝ n⎠ 1 1 ⎛ 1⎞ Y2 = − X 1 + ⎜1 − ⎟ X 2 − − X n n n ⎝ n⎠ n

and

⎛

1⎞1

1

∑ a j b j = −2 ⎜⎝1 − n ⎟⎠ n − ( n − 2 ) n < 0 , thus Y1 and Y2 are dependent. j =1

Exercise 2.3.

X ∼ N ( 0,1) and a discrete r.v. ε such that 1 1 P ( ε = −1) = and P = ( ε = +1) = . 2 2 We give a real r.v.

We suppose

X and ε independent. We state Y = ε X :

– by using distributions functions, verify that Y ∼ N ( 0,1) ; – show that Cov ( X , Y ) = 0 ; – is the vector U = ( X , Y ) gaussian? Solution 2.3. 1)

(

FY ( y ) = P (Y ≤ y ) = P ( ε X ≤ y ) = P ( ε X ≤ y ) ∩ ( ( ε = 1) ∪ ( ε = −1) )

=P

( ( (ε X ≤ y ) ∩ (ε = 1) ) ∪ ( (ε X ≤ y ) ∩ (ε = −1) ) )

)

Gaussian Vectors

89

Because of the incompatibility of the two events linked by the union

= P ( ( ε X ≤ y ) ∩ ( ε = 1) ) + P ( ( ε X ≤ y ) ∩ ( ε = −1) ) = P ( ( X ≤ y ) ∩ ( ε = 1) ) + P ( ( − X ≤ y ) ∩ ( ε = −1) ) Because of the independence of

X and ε ,

P ( X ≤ y ) P ( ε = 1) + P ( − X ≤ y ) P ( ε = −1) =

1 ( P ( X ≤ y ) + P ( − X ≤ y )) 2

Finally, thanks to the parity of the density of the law N ( 0,1) ,

= P ( X ≤ y ) = FX ( y ) ; 2) Cov ( X , Y ) = EXY − EXEY = Eε X − EX Eε X = Eε EX 2

0

2

= 0;

0

3) X + Y = X + ε X = X (1 + ε ) ;

(

)

Thus P ( X + Y = 0 ) = P X (1 + ε ) = P (1 + ε = 0 ) =

1 2

λ X + µY (with λ = µ = 1 ) because the law admits no density ( PX +Y ({0} ) = 1 ). 2 We can deduce that the r.v.

Thus the vector U = ( X , Y ) is not Gaussian.

is not Gaussian,

90

Discrete Stochastic Processes and Optimal Filtering

Exercise 2.4. Given a real r.v. X ∼ N ( 0,1) and given a real a > 0 :

⎧⎪ X if X < a is also a real ⎪⎩− X if X ≥ a

1) Show that the real r.v. Y defined by Y = ⎨ r.v. ∼ N ( 0,1)

(Hint: show the equality of the distribution functions FY = FX .)

4 2) Verify that Cov ( X , Y ) = 1 − 2π

∞

∫a

2

x e

− x2

2 dx

Solution 2.4. 1) FY ( y ) = P ( Y ≤ y ) = P

( (Y ≤ y ) ∩ ( X

Distributivity and then incompatibility

( P ( (Y ≤ y )

)

(

< a) ∪ ( X ≥ a)

)

⇒

)

P (Y ≤ y ) ∩ ( X < a ) + P (Y ≤ y ) ∩ ( X ≥ a ) =

)

((

)

X < a P ( X < a) + P Y ≤ y X ≥ a P ( X ≥ a)

P ( X ≤ y ) P ( X < a ) + P (( − X ≤ y )) P ( X ≥ a ) P( X ≤ y )

because

1 − x2 2 e = f X ( x) is even 2π

(

)

= P ( X ≤ y ) P ( X < a ) + P ( X ≥ a ) = P ( X ≤ y ) = FX ( y ) ;

)

Gaussian Vectors

91

2) EX = 0 and EY = 0, thus:

Cov ( X , Y ) = EXY = ∫ =∫ −∫

∞ −∞

−a −∞

a −a

x 2 f X ( x ) dx − ∫

x 2 f X ( x ) dx − ∫

−a −∞

−a −∞

∞

x 2 f X ( x ) dx − ∫ x 2 f X ( x ) dx a

∞

x 2 f X ( x ) dx − ∫ x 2 f X ( x ) dx a

∞

x 2 f X ( x ) dx − ∫ x 2 f X ( x ) dx a

The 1st term equals EX 2 = VarX = 1 . The sum of the 4 following terms, because of the parity of the integrated function, equals

∞

−4∫ x 2 f X ( x ) dx from which we obtain the result. a

Exercise 2.5.

⎛X⎞ ⎛ 0⎞ Z = ⎜ ⎟ be a Gaussian vector of expectation vector m = ⎜ ⎟ and of ⎝Y ⎠ ⎝1 ⎠ ⎛ 1 1 ⎞ 2 ⎟ which is to say covariance matrix Γ Z = ⎜ Z ∼ N 2 ( m, Γ Z ) . ⎜1 1 ⎟ ⎝ 2 ⎠ Let

1) Give the law of the random variable

X − 2Y .

2) Under what conditions on the constants a and b is the random variable aX + bY independent of X − 2Y and of variance 1? Solution 2.5. 1) X ∼ N ( 0,1) and Y ∼ N (1,1) ; as

X and Y are also independent thus

X − 2Y is a Gaussian r.v.; precisely X − 2Y ∼ N ( −2,5 ) . ⎛ X − 2Y ⎞ ⎟ is a Gaussian vector (write the definition) X − 2Y and ⎝ aX + bY ⎠

2) As ⎜

aX + bY

are

independent

⇔

Cov ( X − 2Y , aX + bY ) = 0

now

92

Discrete Stochastic Processes and Optimal Filtering

Cov ( X − 2Y , aX + bY ) = aVarX − b Cov ( X , Y )

2 −2a Cov ( X , Y ) − 2bVarY = a − b − a = 0 i.e. b = 0 3 As 1 = Var ( a X

+ bY ) = Var aX = a 2 Var X

: a = ±1 .

Exercise 2.6.

X and Y and we assume that X admits a density probability f X ( x ) and that Y ∼ N ( 0,1) . We are looking at two independent r.v.

Determine the r.v.

(

)

E e XY X .

Solution 2.6.

(

)

E e XY x = E xY = ∫ e xy 1 x2 2 = e ∫ e 2π 1 So y → e 2π finally obtain

(

−( y − x ) 2

−( y − x ) 2

)

1 −y 2 e dy 2π 2

2

dy

2

is a density of probability (v.a. ∼ N ( x,1) ), and we

E e XY X = e

X2

2

.

Chapter 3

Introduction to Discrete Time Processes

3.1. Definition A discrete time process is a family of r.v.

{

XT = Xt j t j ∈T ⊂

}

where T called the time base is a countable set of instants. X t is the r.v. of the j

family considered at the instant t j . Ordinarily, the t j are uniformly spread and distant from a unit of time and in the sequence T will be equal to

,

or

∗

and the processes will be still denoted

X T or, if we wish to be precise, X , X or X

∗

.

In order to be able to study correctly some sets of r.v. X j of X T and not only the r.v. X j individually, it is in our interests to consider the latter as being definite mappings on the same set and this leads us to an exact definition.

94

Discrete Stochastic Processes and Optimal Filtering

DEFINITION.– Any X T family with measurable mappings is called a real discrete time stochastic process:

Xj: ω

⎯⎯ →

(

( Ω,a )

X j (ω ) ,B (

with j ∈ T ⊂

))

We also say that the process is defined on the fundamental space ( Ω, a ) . In general a process X T is associated with a real phenomenon, that is to say that the X j represent (random) physical, biological, etc. values. For example the intensity of electromagnetic noise coming from a certain star. For a given

ω,

that is to say after the phenomenon has been performed, we

obtain the values x j = X j (ω ) .

{

DEFINITION.– xT = x j j ∈ T

}

is called the realization or trajectory of the

process X T .

X −1

X0

X1

X2

Xj

xj

x1 x2

x−1 -1

x0

0

1

2

Figure 3.1. A trajectory

j

t

Introduction to Discrete Time Processes

95

Laws We defined the laws PX of the real random vectors X Chapter 1. These laws are measures defined on n

Borel algebra of The finite sets

B

= ( X 1 ,..., X n ) in

T

( ) =B (

) ⊗ ... ⊗ B ( )

n

.

( X i ,..., X j ) of r.v. of X T are random vectors and, as we will

be employing nothing but sets such as these in the following chapters, the considerations of Chapter 1 will be sufficient for the studies that we envisage. T

and in certain problems we cannot avoid the following However, X T ∈ additional sophistication: 1) construction of a

σ

-algebra

2) construction of laws on

B

B

( ) = ⊗ B ( ) on T

j

j∈T

T

;

( ) (Kolmogorov’s theorem). T

Stationarity DEFINITION.– We say that a process

∀i, j , p ∈

the random vectors

same law, i.e. ∀Bi ,..., B j ∈ B (

((

)

(

{

XT = X j j ∈

}

is stationary if

( X i ,..., X j ) and ( X i+ p ,..., X j + p ) have the ) (in the drawing the Borelians are intervals):

P X i + p ∈ Bi ∩ ... ∩ X j + p ∈ B j

) ) = P ( ( X i ∈ Bi ) ∩ ... ∩ ( X j ∈ B j ) )

96

Discrete Stochastic Processes and Optimal Filtering

i +1

i

i+ p

j

i +1+ p

j+ p

t

Wide sense stationarity DEFINITION.– We say that a process

X T is centered if EX j = 0

DEFINITION.– We say that a process

X T is of the second order if:

X j ∈ L2 ( dP )

∀j ∈ T .

∀j ∈ T .

Let us remember that if

X j ∈ L2 ∀j ∈ T then X j ∈ L1 and ∀i, j ∈ T

EX i X j < ∞ . Thus, the following definition is meaningful. DEFINITION.– Given

X a real 2nd order process, we call the covariance function

of this process, the mapping

(

Γ : i, j ⎯⎯ → Γ ( i, j ) = Cov X i , X j

)

x We call the autocorrelation function of this process the mapping:

R : i, j ⎯⎯ → R ( i, j ) = E X i X j x

Introduction to Discrete Time Processes

97

These two mappings obviously coincide if X ] is centered. We can recognize here notions introduced in the context of random vectors, but here as the indices ...i,... j ,... represent instants, we can expect in general that when the deviations

i − j increase, the values Γ ( i, j ) and R ( i, j ) decrease. DEFINITION.– We say that the process X ] is wide sense stationary (WSS) if: – it is of the 2nd order;

→ m ( j ) = EX is constant; – the mapping j ⎯⎯ ]

\ Γ ( i + p, j + p ) = Γ ( i, j )

– ∀ i, j , p ∈ ]

In this case Γ ( i, j ) is instead written C ( j − i ) . Relationship linking the two types of stationarity A stationary process is not necessarily of the 2nd order as we see with the process X ] for example in which we choose for X j r.v. independent of Cauchy’s law:

fX j ( x) =

(

a

π a +x 2

2

)

and a > 0 and

EX j and EX 2j are not defined.

A “stationary process which is also of the 2nd order” (or a process of the 2nd order which is also stationary) must not be confused with a WSS process. It is clear that if a process of the 2nd order is stationary, it is thus WSS. In effect:

EX j + p = ∫ xdPX j+ p ( x ) = ∫ xdPX j ( x ) = EX j \

\

98

Discrete Stochastic Processes and Optimal Filtering

and:

Γ ( i + p, j + p ) = ∫ =∫

xy dPX i+ p , X j+ p ( x, y ) − EX i + p EX j + p

2

2

xy dPX i , X j ( x, y ) − EX i EX j = Γ ( i, j )

The inverse implication “wide sense stationary (WSS) ⇒ stationarity” is false in general. However, it is true in the case of Gaussian processes. Ergodicity Let X

be a WSS process.

DEFINITION.– We say that the expectation of X

EX 0 = lim

N ↑∞

N

1 2N + 1

∑

j =− N

X j (ω ) a.s. (almost surely)

We say that the autocorrelation function X

∀n ∈

is ergodic if:

K ( j, j + n ) = EX j X j +n = lim

N ↑∞

is ergodic if:

1 2N + 1

N

∑

j =− N

X j (ω ) X j +n (ω ) a.s.

That is to say, except possibly for ω ∈ N set of zero probability or even with the exception of trajectories whose apparition probability is zero, we have for any trajectory x :

EX 0 = lim

N ↑∞

+N

1 2N + 1

∑

j =− N

x j (ergodicity of 1st order)

= EX j X j + n = lim N ↑∞

1 2N + 1

+N

∑

j =− N

x j x j + n (ergodicity of 2nd order)

Introduction to Discrete Time Processes

With the condition that the process X

99

is ergodic, we can then replace a

mathematical expectation by a mean in time. This is a sufficient condition of ergodicity of 1st order. PROPOSITION.– Strong law of large numbers: If the X j ( j ∈

)

form a sequence of independent r.v. and which are of the

same law and if E X 0 < ∞ then EX 0 = lim

N ↑∞

+N

1

∑

2 N + 1 j =− N

X j (ω ) a.s.

NOTE.– Let us suppose that the r.v. X j are independent Cauchy r.v. of probability density

1

a π a + x2 2

( a > 0) .

By using the characteristic functions technique, we can verify that the r.v.

YN =

1

+N

∑

2 N + 1 j =− N

X j has the same law as X 0 ; thus YN can not converge a.s. to

the constant EX 0 , but E X 0 = +∞ .

X

EXAMPLE.– We are looking at the process

which consists of r.v.

X j = A cos ( λ j + Θ ) where A is a real constant and where Θ is an r.v. of uniform probability density f Θ (θ ) =

1 1 (θ ) . Let us verify that X 2π [0,2π [

is a

WSS process.

EX j = ∫

2π 0

Acos ( λ j + θ ) f Θ (θ ) dθ =

Γ ( i, j ) = K ( i, j ) = EX i X j = ∫

A2 2π

2π

∫0

2π 0

A 2π

2π

∫0

cos ( λ j + θ ) dθ = 0

Acos ( λ j + θ ) Acos ( λ j+θ ) f Θ (θ ) dθ

cos ( λ i + θ ) cos ( λ j + θ ) dθ =

A2 cos ( λ ( j − i ) ) 2

100

Discrete Stochastic Processes and Optimal Filtering

and X

is in fact WSS.

Keeping with this example, we are going to verify the ergodicity expectation. Ergodicity of expectation

lim N

+N

1

∑

Acos ( λ j + θ ) (with θ fixed ∈ [ 0, 2π [ )

2 N + 1 j =− N

= lim N

1

N

∑

2 N + 1 j =− N

Acosλ j = lim N

2A ⎛ N 1⎞ ⎜⎜ ∑ cosλ j − ⎟⎟ 2 N + 1 ⎝ j =0 2⎠

iλ N +1 N 2A ⎛ 1⎞ 2 A ⎛ 1- e ( ) 1 ⎞ iλ j = lim − ⎟ ⎜ Re ⎜ Re ∑ e − ⎟⎟ = lim N 2N + 1 ⎜ 2 ⎠ N 2 N + 1 ⎝⎜ 2 ⎠⎟ 1 − eiλ ⎝ j =0

If λ ≠ 2kπ , the parenthesis is bounded and the limit is zero and equal to EX 0 . Therefore, the expectation is ergodic. Ergodicity of the autocorrelation function

lim N

(with

∑

2 N + 1 j =− N

θ

= lim N

= lim N

+N

1

Acos ( λ j + θ ) Acos ( λ ( j + n ) + θ )

[

[

fixed ∈ 0, 2π )

A2

+N

∑

2 N + 1 j =− N

cosλ j cosλ ( j + n )

1 A2 + N ∑ ( cosλ ( 2j+n ) + cosλ n ) 2 2 N + 1 j =− N

+N ⎛ 1 A2 ⎛ ⎞ ⎞ A2 = lim ⎜ Re ⎜ eiλ n ∑ eiλ 2 j ⎟ ⎟ + cosλ n ⎜ ⎟⎟ 2 N ⎜ 2 2N + 1 j =− N ⎝ ⎠⎠ ⎝

The

limit

is

still

zero

autocorrelation function is ergodic.

and

A2 cosλ n = K ( j , j + n ) . Thus, the 2

Introduction to Discrete Time Processes

101

Two important processes in signal processing Markov process DEFINITION.– We say that X – ∀B ∈ B (

is a discrete Markov process if:

);

– ∀t1 ,..., t j +1 ∈

with t1 < t2 < ... < t j < t j +1 ;

– ∀x1 ,..., x j +1 ∈

.

(

) (

)

P X t j+1 ∈ B X t j = x j ,..., X t1 = x1 = P X t j+1 ∈ B X t j = x j , an

Thus

equality that more briefly can be written:

(

) (

)

P X t j+1 ∈ B x j ,..., x1 = P X t j+1 ∈ B x j . We can say that if t j represents the present instant, for the study of X towards the future (instants > t j ), the information

(

{( X

tj

) (

= x j ,..., X t 1 = x1

)

brings nothing more than the information X t j = x j .

B

xt1 xt

t1

j −1

t j −1

tj xt

t j +1 j

t

)}

102

Discrete Stochastic Processes and Optimal Filtering

Markov processes are often associated with phenomena beginning at instant 0 for example and we thus choose the probability law Π 0 of the r.v. X 0 .

(

The conditional probabilities P X t ∈ B x j j +1

) are called transition probabilities.

= j.

In what follows, we suppose t j

DEFINITION.– We say that the transition probability is stationary if

(

)

(

)

P X j +1 ∈ B x j is independent of j = P ( X 1 ∈ B x0 ) .

Here is an example of a Markov process that in practice is often met.

X

is defined by the r.v.

(

)

X 0 and the relation of recurrence

X j +1 = f X j , N j where the N j are independent r.v. and independent of the r.v.

X 0 and where f is a mapping:

×

Thus, let us show that ∀B ∈ B (

):

2

→

Borel function.

( ) ( ) ⇔ P ( f ( X , N ) ∈ B x , x ,..., x ) = P ( f ( X , N ) ∈ B x ) ⇔ P ( f ( x , N ) ∈ B x , x ,..., x ) = P ( f ( x , N ) ∈ B x ) P X j +1 ∈ B x j , x j −1 ,..., x0 = P X j +1 ∈ B x j j

j

j

j

j

j

j −1

j −1

0

0

This equality will be verified if the r.v.

( X j −1 = x j −1 ) ∩ ... ∩ ( X 0 = x0 ) .

j

j

j

j

j

j

N j is independent of

Introduction to Discrete Time Processes

103

Now the relation of recurrence leads us to expressions of the form:

X 1 = f ( X 0 , N 0 ) , X 2 = f ( X 1 , N1 ) = f ( f ( X 0 , N 0 ) , N1 )

(

= f 2 ( X 0 , N 0 , N1 ) ,..., X j = f j X 0 , N1 ,..., N j −1 which proves that of

)

N j , being independent of X 0 , N1 ,..., N j −1 , is also independent

X 0 , X 1 ,..., X j −1 (and even of X j ).

Gaussian process

the random vector

(

X S = X i ,..., X j

remember is denoted X S ∼

(

is Gaussian if ∀ S = ( i,..., j ) ∈

X

DEFINITION.– We say that a process

)

,

is a Gaussian vector that as we will

)

N n mS , Γ X s .

We see in particular that as soon as we know that a process law is entirely determined by its expectation function

X is Gaussian, its

j → m ( j ) and its

covariance function i, j → Γ ( i, j ) . A process such as this is denoted

X ∼ N ( m ( j ) , Γ ( i, j ) ) .

A Gaussian process is obviously of the 2nd order: furthermore if it is a WSS process it is thus stationary and to realize this it is sufficient to write the probability:

(

)

f X S xi ,..., x j =

of whatever vector

1

( 2π )

j −i +1 2

( Det Γ ) XS

1 2

T ⎛ 1 ⎞ exp ⎜ − ( x − mS ) Γ −S1 ( x − mS ) ⎟ ⎝ 2 ⎠

X S extracted from the process.

104

Discrete Stochastic Processes and Optimal Filtering

Linear space associated with a process

X

Given

a WSS process, we note

combinations of the r.v. of

That is to say:

H

X

HX

the family of finite linear

X .

⎧⎪ = ⎨ ∑ λ j X j S finite ⊂ ⎪⎩ j∈S

⎫⎪ ⎬ ⎪⎭

DEFINITION.– We call linear space associated with the process

H

X

2

augmented by the limits in L of the elements of H

denoted

H

X

X

X

the family

. The linear space is

.

NOTES.– 1) H

X

⊂H

X

⊂ L2 ( dP ) and H

2) Let us suppose that X

X

2

is a closed vector space of L

( dP ) .

is a stationary Gaussian process. All the linear 2

combinations of the r.v. X j of X

are Gaussian and the limits in L are equally

(

Gaussian. In effect, we easily verify that if the set of r.v. X n ∼ N mn , σ n 2

converge in L towards an r.v. X of expectation m and of variance

σ n2

then converge towards m and

σ

(

and X ∼ N m, σ

2

σ 2 , mn

) respectively.

2

)

and

Delay operator Process X

being given, we are examining operator T

defined by:

T n : ∑ λ j X j → ∑ λ j X ( j − n ) ( S finished ⊂ j∈S

H

j∈S

X

H

X

)

n

( n ∈ ) on H ∗

X

Introduction to Discrete Time Processes

DEFINITION.– T

n

105

is called operator delay of order n .

Properties of operator delay: –

T n is linear of H ∗

– ∀ n and m ∈ –

X

in H

X

;

T n T m = T n+m ;

T n conserves the scalar product of L2 , that is to say ∀ I and J finite ⊂

:

⎛ ⎞ ⎛ ⎞ < T n ⎜ ∑ λi X i ⎟ , T n ⎜ ∑ µ j X j ⎟ > = < ∑ λi X i , ∑ µ j X j > . ⎜ j∈J ⎟ i∈I j∈J ⎝ i∈I ⎠ ⎝ ⎠ EXTENSION.– T Let

Z ∈H

X

n

extends to all

and Z p ∈ H

X

H

X

in the following way:

be a sequence of r.v. which converge towards Z

2

in L ; Z P is in particular a Cauchy sequence of

( )

Tn Zp

is also a Cauchy sequence of

converges in

H

X

H

X

P

∀Z ∈ H

X

towards Z . It is natural to state T

n

As a consequence,

X

n

and by isometry T ,

H

which, since

. It is simple to verify that lim T

particular series Z p which converges towards

H n

X

is complete,

( Z p ) is independent of the

Z.

and the series Z p ∈ H

X

which converges

T n (Z p ) . ( Z ) = lim P

DEFINITION.– We can also say that

H

X

is the space generated by the X

process. 3.2. WSS processes and spectral measure

In this section it will be interesting to note the influence on the spectral density of the temporal spacing between the r.v. For this reason we are now about to

106

Discrete Stochastic Processes and Optimal Filtering

{

consider momentarily a WSS process X θ = X jθ j ∈ and where jθ has the significance of duration.

} where θ

is a constant

3.2.1. Spectral density

DEFINITION.– We say that the process X θ possesses a spectral density if its

( ( j − i )θ ) = EX iθ X jθ − EX iθ EX jθ can be written 1 C ( nθ ) = ∫ 12θ exp ( 2iπ ( inθ ) u ) S XX ( u ) du and S XX ( u ) is −

covariance C ( nθ ) = C

in the form:

2θ

then called the spectral density of the process X θ . PROPOSITION.– +∞

Under the hypothesis

∑ C ( nθ ) < ∞ :

n =−∞

1) the process X θ admits a spectral density S XX ; 2) S XX is continuous, periodic of

C

− nθ − 2θ − θ

1

θ

period, real and even.

Var X jθ

0 θ

2θ

nθ

S XX

u

t −1

2θ

0

1

2θ

Figure 3.2. Covariance function and spectral density of a process

Introduction to Discrete Time Processes

107

NOTE.– The covariance function C is not defined (and in particular does not equal zero) outside the values nθ . DEMONSTRATION.– Taking into account the hypotheses, the series: +∞

∑ C ( pθ ) exp ( −2iπ ( pθ ) u )

p =−∞

converges uniformly on

1

θ

and defines a continuous function S ( u ) and

-periodic. Furthermore:

∫ =∫

+∞ 2θ C −1 2θ p =−∞ 1

1

2θ −1 2θ

∑ ( pθ ) exp ( −2iπ ( pθ ) u ) exp ( 2iπ ( nθ ) u ) du

S ( u ) exp ( 2iπ ( nθ ) u ) du 2

The uniform convergence and the orthogonality in L

( − 1 2θ , 1 2θ ) of the

complex exponentials enables us to conclude that:

C ( nθ ) = ∫

1

2θ −1 2θ

exp ( 2iπ ( nθ ) u ) S ( u ) du and that S XX ( u ) = S ( u ) .

To finish, C ( nθ ) is a covariance function, thus:

C ( − nθ ) = C ( nθ )

108

Discrete Stochastic Processes and Optimal Filtering

and we can deduce from this that S XX ( u ) =

+∞

∑

p =−∞

real and even (we also have S XX ( u ) = C ( 0 ) + 2 EXAMPLE.– The covariance C ( nθ ) = σ e

C ( pθ ) exp ( −2iπ ( pθ ) u ) is ∞

∑ C ( pθ ) cos2π ( pθ ) u ). p =1

2 − λ nθ

( λ > 0)

of a process X θ in

fact verifies the condition of the proposition and X θ admits the spectral density.

S XX ( u ) = σ 2

+∞

∑ e−λ nθ −2iπ ( nθ )u

n =−∞

∞ ⎛ ⎞ − λ nθ − 2iπ ( nθ )u − λ nθ + 2iπ ( nθ )u =σ 2 ⎜∑e + ∑e − 1⎟ n =0 ⎝ n =0 ⎠ 1 1 ⎛ ⎞ =σ2⎜ + − 1⎟ − λθ − 2iπθ u − λθ + 2iπθ u 1− e ⎝ 1− e ⎠ ∞

=σ2

1 − e−2λθ 1 + e −2λθ − 2e − λθ cos2πθ u

White noise

DEFINITION.– We say that a centered WSS process X θ is a white noise if its covariance

function

⎛ C ( 0 ) = EX 2jθ = σ 2 ⎜ ⎜ C ( nθ ) = 0 if n ≠ 0 ⎝

C ( nθ ) = C ( ( j − i )θ ) = EX iθ X jθ

verifies

∀j ∈

The function C in fact verifies the condition of the preceding proposition and

S XX ( u ) =

+∞

∑

n =−∞

C ( nθ ) exp ( −2iπ ( nθ ) u ) = C ( 0 ) = σ 2 .

Introduction to Discrete Time Processes

109

S XX C

σ

σ2

2

t

u

0

0

Figure 3.3. Covariance function and spectral density of a white noise

We often meet “Gaussian white noises”: these are Gaussian processes which are also white noises; the families of r.v. extracted from such processes are independent

(

and ∼ N 0, σ

2

).

More generally we have the following result which we will use as the demonstration. Herglotz theorem

In order for a mapping

nθ → C ( nθ ) to be the covariance function of a

WSS process, it is necessary and sufficient that a positive measurement on

⎛⎡ 1 1 ⎤⎞ , ⎥ ⎟ , which is called the spectral measure, such that: ⎝ ⎣ 2θ 2θ ⎦ ⎠

B ⎜ ⎢-

C ( nθ ) = ∫

1

2θ −1 2θ

exp ( 2iπ ( nθ ) u ) d µ X ( u ) . ∞

In this statement we no longer assume that

∑ C ( nθ ) < ∞ .

n =−∞

µX

exists

110

Discrete Stochastic Processes and Optimal Filtering +∞

∑ C ( nθ ) < ∞ ,

If

we

again

find

the

starting

statement

with:

n =−∞

d µ X ( u ) = S XX ( u ) du (a statement that we can complete by saying that the

spectral density S XX ( u ) is positive).

3.3. Spectral representation of a WSS process

In this section we explain the steps enabling us to arrive at the spectral representation of a process. In order not to obscure these steps, the demonstrations of the results which are quite long without being difficult are not given.

3.3.1. Problem

The object of spectral representation is: 1) To study the integrals (called Wiener integrals) of the

∫S ϕ ( u ) dZu

type

obtained as limits, in a meaning to clarify the expressions with the form:

∑ ϕ ( u j ) ( Zu j

j

− Zu j−1

) , ϕ is a mapping with complex values (and

where S is a restricted interval of

{

other conditions), Z S = Z u u ∈ S

}

is a 2nd order process with orthogonal

increments (abbreviated as p.o.i.) whose definition will be given in what follows. 2) The construction of the Wiener integral being carried out, to show that reciprocally, if we allow ourselves a WSS process X θ , we can find a p.o.i.

{

}

Z S = ZU u ∈ S = ⎡ − 1 , 1 ⎤ such that ∀j ∈ ⎣ 2θ 2θ ⎦ a Wiener integral X jθ =

∫S e

2iπ ( jθ )u

dZu .

X jθ may be written as

Introduction to Discrete Time Processes

NOTE.–

∫ S ϕ ( u ) dZu

and

∫S e

2iπ ( jθ )u

111

dZu will not be ordinary Stieljes

integrals (and it is this which motivates a particular study). In effect:

⎛ ⎞ ⎜ ⎟ ⎜ σ = ,.., u j −1 , u j , u J +1 subdivision of S ⎟ ⎜ ⎟ let us state ⎜ σ = sup u j − u j −1 module of the subdivision σ ⎟ j ⎜ ⎟ ⎜ ⎟ ⎜ Iσ = ∑ ϕ u j Z u j − Z u j−1 ⎟ u j ∈σ ⎝ ⎠

{

}

( )(

)

∀σ , the expression Iσ is in fact defined, it is a 2nd order r.v. with complex values. However, the process Z S not being a priori of bounded variation, the ordinary limit

lim Iσ , i.e. the limit with a given trajectory u → Zu (ω ) , does not exist and

σ →0

∫ S ϕ ( u ) dZu The r.v.

cannot be an ordinary Stieljes integral.

∫ S ϕ ( u ) dZu

will be by definition the limit in

limit exists for the family Iσ when

L2 precisely if this

σ → 0 , i.e.: 2

lim E Iσ − ∫ ϕ ( u ) dZu = 0 .

σ →0

S

This is still sometimes written:

L _ ( Iσ ) . ∫ S ϕ ( u ) dZu = σlim →0 2

3.3.2. Results

3.3.2.1. Process with orthogonal increments and associated measurements

S designates here a bounded interval of

.

112

Discrete Stochastic Processes and Optimal Filtering

DEFINITION.– We call a random process of continuous parameters with base S , all the family of r.v. Z u , the parameter u describing S .

{

}

This process will be denoted as Z S = Z u u ∈ S . Furthermore, we can say that such a process is: – centered if EZ u = 0

∀u ∈ S ; 2

2

– of the 2nd order if EZ u < ∞ (i.e. Z u ∈ L – continuous in

( dP ) ) ∆u ∈ S ;

L2 : if E ( Zu + ∆u − Zu ) → 0 2

when ∆u → 0 ∀u and u + ∆u ∈ S (we also speak about right continuity when

∆u > 0 or of left continuity when ∆u < 0 in L2 ). In what follows Z S will be centered, of the 2nd order and continuous in

L2 .

Z S has orthogonal increments ∀u1 , u2 , u3 , u4 ∈ S with u1 < u2 ≤ u3 < u4

DEFINITION.– We say that the process ( ZS

is

a

p.o.i.)

if

(

< Z u4 − Z u3 , Z u2 − Zu1 > L2 ( dP ) = E Zu4 − Zu3

)(Z

u2

)

− Zu1 = 0 .

We say that Z S is a process with orthogonal and stationary increments ( Z S is a p.o.s.i.) if Z S is a p.o.i. and if in addition ∀u1 , u2 , u3 , u4 with u4 − u3 = u2 − u1

(

we have E Z u − Z u 4

3

)

2

(

= E Z u2 − Zu1

)

2

.

PROPOSITION.– To all p.o.i. Z S which are right continuous in

L2 , we can

associate: –

a

function

F which does not decrease on

F ( u ′ ) − F ( u ) = E ( Zu′ − Zu ) if u < u ′ ; 2

S

such

that:

Introduction to Discrete Time Processes

– a measurement thus

µ

on

B (S )

113

which is such that ∀ u , u ′ ∈ S with u < u ′ ,

( ).

µ ( ]u, u′]) = F ( u′ ) − F u −

3.3.2.2. Wiener stochastic integral Let Z S still be a p.o.i. right continuity and PROPOSITION.– Given

ϕ ∈ L2 ( µ )

⎞ Zu j − Z u j−1 ⎟ exists. This is by definition ⎟ ⎠ ϕ ( u ) dZ u ;

( )(

Wiener’s stochastic integral 2) Given

ϕ

and ψ ∈ L

2

∫S

the associated measurement.

with complex values:

⎛ lim L2 _ ⎜ ∑ ϕ u j σ →0 ⎜ u ∈σ ⎝ j

1) The

µ

)

( µ ) with complex values, we have the property:

E ∫ ϕ ( u ) dZu ∫ ψ ( u ) dZu = ∫ ψ ( u )ψ ( u ) d µ ( u ) , S

S

in particular E

S

∫ S ϕ ( u ) dZu

2

2

= ∫ ϕ (u ) d µ (u ) . S

Idea of the demonstration

Let us postulate that

ε = vector space in step functions with complex values.

We begin by proving the proposition for functions

ϕ ( u ) = ∑ a j 1⎤u j

⎦

⎤

j −1 ,u j ⎦

ϕ ,ψ ,... ∈ ε

( u ) and : ∫ S ϕ ( u ) dZu = ∑ ϕ ( u j ) ( Zu j

j

(if

ϕ ∈ε

)

− Zu j−1 ).

We next establish the result in the general case by using the fact that

ε ( ⊂ L2 ( µ ) ) is dense in ϕn ∈ ε such that:

L2 ( µ ) , i.e. ∀ϕ ∈ L2 ( µ ) we can find a sequence

114

Discrete Stochastic Processes and Optimal Filtering 2

ϕ − ϕn L ( µ ) = ∫ ϕ ( u ) − ϕn ( u ) d µ ( u ) → 0 S 2

2

when n → ∞ .

3.3.2.3. Spectral representation We start with X θ , a WSS process. Following Herglotz’s theorem, we know that its covariance function

nθ → C ( nθ ) is written C ( nθ ) = ∫

(⎣

spectral measure on B ⎡ −1

2θ

,1

1

2θ 2iπ ( nθ )u e d µX −1 20

(u )

where

µX

is the

)

⎤ . 2θ ⎦

PROPOSITION.– If X θ is a centered WSS process of covariance function

nθ → C ( nθ ) and of spectral measure µ X , there exists a unique p.o.i.

{

}

Z S = Zu u ∈ S = ⎡ −1 , 1 ⎤ such that: ⎣ 2θ 2θ ⎦ ∀j ∈

X jθ = ∫ e

2iπ ( jθ )u

S

dZ u .

Moreover, the measurement associated with Z S is the spectral measure The expression of the X jθ representation of the process.

as Wiener integrals is called the spectral

2iπ ( j + n )θ ) u 2iπ jθ u EX jθ X ( j + n )θ = E ∫ e ( ) dZu ∫ e ( dZu

NOTE.–

S

S

applying the stated property of 2) of the preceding proposition.

=∫ e S

−2iπ ( nθ )u

µX .

dZ u = C ( −nθ ) = C ( nθ ) .

and

by

Introduction to Discrete Time Processes

115

3.4. Introduction to digital filtering

We suppose again that θ = 1 . Given

{

a

h = hj ∈

WSS

j∈

process

X

and

a

sequence

of

real

} , we are interested in the operation which at X

numbers makes a

new process Y correspond, defined by:

∀K ∈

( h 0T

0

YK =

⎛

+∞

⎞

+∞

∑ h j X K − j = ⎜⎜ ∑ h jT j ⎟⎟ X K ⎝ j =−∞

j =−∞

⎠

is also denoted as h1 where 1 is the identical mapping of +∞

In what follows we will still assume that

∑

j =−∞

L2 in L2 ).

h j < ∞ ; this condition is

1

and is called (for reasons which will be explained later) generally denoted h ∈ the condition of stability. DEFINITION.– We say that the process Y process X

by the filter H ( T ) =

is the transform (or filtration) of the

+∞

∑ h jT j and we write Y

j =−∞

= H (T ) X .

NOTES.– 1) Filter H (T ) is entirely determined by the sequence of coefficients

{

h = hj ∈

j∈

} and according to the case in hand, we will speak of filter

H (T ) or of filter h or again of filter (..., h− m ,..., h−1 , h0 ,..., hn ,...) . 2) The expression “ ∀K ∈

YK =

convolution product (noted ∗ ) of X

+∞

∑ h j X K − j ” is the definition of the

j =−∞

by h which is also written as:

116

Discrete Stochastic Processes and Optimal Filtering

YK = ( h ∗ X

Y = h ∗ X or again ∀K ∈ 3) Given that X

is a WSS process and

is clear that the r.v. YK =

+∞

∑ hj X K − j

H

X

∈H

)K

is the associated linear space, it X

and that process Y

is also

j =−∞

WSS. Causal filter

Physically, for whatever

K

is given, YK can only depend on the previous r.v.

X K − j in the widest sense of YK , i.e. that j ∈

. A filter H (T ) which realizes

this condition is called causal or feasible. Amongst these causal filters, we can further distinguish two major classes: 1) Filters that are of finite impulse response (FIR) such that: N

YK = ∑ h j X K − j

∀K ∈

j =0

the schematic representation of which follows.

XK

T h0

T

T

h1

h2

hN

Σ

Σ

Σ

Figure 3.4. Schema of a FIR filter

YK

Introduction to Discrete Time Processes

117

2) Filters that are of infinite impulse response (IIR) such that ∞

YK = ∑ h j X K − j

∀K ∈

j =0

NOTES.– 1) Let us explain about the role played by the operator T : at any particular instant K , it replaces X K with X K −1 ; we can also say that T blocks the r.v.

X K −1 for a unit of time and restores it at instant

K

2) Let H (T ) be an IIR filter. At the instant

K

.

∞

YK = ∑ h j X K − j = h0 X K + ... + hK X 0 + hK +1 X −1 + ... j =0

For a process X , thus beginning at the instant 0 , we will have:

∀K ∈

K

YK = ∑ h j X K − j j =0

Example of filtering of a Gaussian process

Let us consider the Gaussian process X

∼ N ( m ( j ) , Γ ( i, j ) ) and the FIR

filter H (T ) defined by h = (...0,..., 0, h 0,..., hN , 0,...) . We immediately verify

that the process Y = H ( T ) X

is Gaussian. Let us consider for example the

filtering specified by the following schema:

118

Discrete Stochastic Processes and Optimal Filtering

(

X ∼ N 0, e − j −i

)

T -1

2

YK

Σ

K

YK = ∑ h j X K − j = − X K + 2 X K −1

∀K ∈

j =0

Y is a Gaussian process. Let us determine its parameters: mY ( i ) = EY j = 0

(

(

ΓY ( i, j ) = E Yi Y j = E ( − X i + 2 X i −1 ) − X j + 2 X j −1

)) =

E X i X j − 2 E X i −1 X j − 2 E X i X j −1 + 4 E X i −1 X j −1 = 5e

− j −i

− 2e

− j −i +1

− 2e

− j −i −1

Inverse filter of a causal filter

DEFINITION.– We say that a causal filter H (T ) is invertible if there is a filter

(

denoted H (T ) process X

)

−1

and called the inverse filter of H (T ) such that for any WSS

we have:

(

X = H (T ) ( H (T ) ) X −1

) = ( H (T ) )

−1

( H (T ) X ) ( ∗)

Introduction to Discrete Time Processes

If such a filter exists, the equality Y = H ( T ) X

119

is equivalent to the equality

X = ( H (T ) ) Y . −1

Furthermore,

{

h′ = h′j ∈

( H (T ) )

j∈

}

−1

is

and

defined

by

we

have

(

)

a

sequence

the

of

coefficients

convolution

product

X = h′ ∗ Y .

∀K ∈

In order to find the inverse filter H (T )

{

of coefficients h′ = h′j ∈

(∗) is equivalent to: ∀K ∈

j∈

−1

, i.e. in order to find the sequence

} , we write that the sequence of equalities

⎞ ⎛ +∞ ⎞ ⎛ +∞ ⎞ ⎛ ⎛ +∞ ⎞ ⎞ ⎛ ⎛ +∞ ⎞ X K = ⎜ ∑ h jT j ⎟ ⎜ ⎜ ∑ h′j T j ⎟ X K ⎟ = ⎜ ∑ h′j T j ⎟ ⎜ ⎜ ∑ h j T j ⎟ X K ⎟ ⎜ j =−∞ ⎟ ⎜ ⎜ j =−∞ ⎟ ⎟ ⎜ ⎜ j =−∞ ⎟ ⎟ ⎜ j =−∞ ⎟ ⎝ ⎠⎝⎝ ⎠ ⎠⎝⎝ ⎠ ⎠ ⎝ ⎠ or even to:

⎛ +∞ ⎞ j ⎜⎜ ∑ h jT ⎟⎟ ⎝ j =−∞ ⎠

⎛ +∞ ⎞ ⎛ +∞ ⎞ j j ′ h T ⎜⎜ ∑ j ⎟⎟ = ⎜⎜ ∑ h′j T ⎟⎟ ⎝ j =−∞ ⎠ ⎝ j =−∞ ⎠

⎛ +∞ ⎞ j ⎜⎜ ∑ h j T ⎟⎟ = 1 ⎝ j =−∞ ⎠

EXAMPLE.– We are examining the causal filter H ( T ) = 1 − hT 1) If h < 1

∞

H (T ) admits the inverse filter ( H (T ) ) = ∑ h j T j . −1

j =0

120

Discrete Stochastic Processes and Optimal Filtering

For that we must verify that being given X K r.v. at instant

K

of a WSS process

X , we have: ⎛⎛

⎞

⎞

∞

(1 − hT ) ⎜⎜ ⎜⎜ ∑ h j T j ⎟⎟ X K ⎟⎟ = X K ⎝ ⎝ j =0

⎠

2

(equality in L )

⎠ ⎛ N ⎞ ⇔ lim (1 − hT ) ⎜ ∑ h j T j ⎟ X K = X K ⎜ j =0 ⎟ N ⎝ ⎠

(

)

⇔ 1 − h N +1 T N +1 X K − X K = h

N +1

X K −( N +1) → 0 when N ↑ ∞

which is verified if h < 1 since X K − ( N +1) =

(

We should also note that H (T )

)

−1

E X 02 .

is causal.

⎛ ⎝

2) If h > 1 let us write (1 − hT ) = − hT ⎜ 1 −

(1 − hT )

−1

⎛ 1 ⎞ = ⎜1 − T −1 ⎟ ⎝ h ⎠

As the operators commute and

−1

(1 − hT ) = −

T −1 h

∞

∑

j =0

−1

1 −1 ⎞ T ⎟ thus: h ⎠

⎛ 1 −1 ⎞ ⎜− T ⎟. ⎝ h ⎠

1 < 1, q ∞ 1 −j T ( ) = − T ∑ h j +1 . hj j =0 − j +1

Introduction to Discrete Time Processes

121

However, this inverse has no physical reality and it is not causal (the “lead − ( j +1) operators” T are not causal). 3) If h = 1

(1 − T ) and (1 + T ) are not invertible.

Transfer function of a digital filter

DEFINITION.– We call the transfer function of a digital filter

H (T ) =

+∞

∑

j =−∞

h j T j the function H ( z ) =

+∞

∑ hj z− j

z∈

.

j =−∞

We recognize the definition given in the analysis of a Laurent sequence if we

1 . As a consequence of this permutation the transfer z −1 functions (sums of the series) will sometimes be written by using the variable z . We also say that H ( z ) is the z transform of the digital sequence permute z and z

−1

=

h = (... h− m ,..., h 0,..., hn ,...) .

Let us be more precise about the domain of definition of H ( z ) ; it is the domain of the convergence K of Laurent sequence. We already know that K is an annulus of center 0 and thus has the form

{

}

K = z 0≤r < z < R

Moreover, any circle of a complex plane of centre and radius

C ( 0, ρ ) .

ρ

is denoted by

122

Discrete Stochastic Processes and Optimal Filtering

K contains C ( 0,1) because due to that fact that we know the hypothesis of +∞

the stability of the filter,

∑

j =−∞

+∞

∑ hj z− j

hj < ∞ ,

converges absolutely any

j =−∞

∀ z ∈ C ( 0,1) . C ( 0, R ) C ( 0, r )

0

1

Figure 3.5. Convergence domain of transfer function

The singularities

σj

of H ( z ) verify

σj ≤r

H ( z ) of any digital filter

or

σj ≥R

and there will be

at least one singularity of H ( z ) on C ( 0, r ) and another on C ( 0, R ) (if not,

K , the holomorphic domain of H ( z ) , could be enlarged). If the filter is now causal and: – if it is an IIR filter then H ( z ) =

{

K = z 0≤r < z

}

( R = +∞ ) ;

∞

∑ h j z − j , so H ( z ) j =0

is holomorphic in

Introduction to Discrete Time Processes

– if it is an FIR filter then H ( z ) =

{

K = z 0< z

} (pointed plane at 0).

123

N

∑ h j z − j , so H ( z )

is holomorphic in

j =0

We observe above all that the singularities

σj

of a transfer function of a stable,

causal filter all have a module strictly less than 1.

C ( 0, r ) 0

∗0

1

Figure 3.6. Convergence domain of and convergence domain of

1

H ( z ) of an IIR causal filter

H ( z ) of an FIR causal filter +∞

NOTE.– In the case of a Laurent sequence

∑ hj z− j

(i.e., in the case of a digital

j =−∞

filter h = {... h− m ,..., h 0,..., hn ,...} ), its domain of convergence K and thus its sum H ( z ) is determined in a unique manner, that is to say that the couple

( H ( z ) , K ) is associated with the filter.

Reciprocally, if, given H ( z ) , we wish to obtain the filter h , it is necessary to

begin by specifying the domain in which we wish to expand H ( z ) , because for

124

Discrete Stochastic Processes and Optimal Filtering

different K domains, we obtain different Laurent expansions having H ( z ) as the sum.

(

This can be summed up by the double implication H ( z ) , K Inversion of the

)

h.

z transform

(

)

Given the couple H ( z ) , K , we wish to find filter h .

H being holomorphic in K , we can apply Laurent’s formula: ∀j ∈

hj =

1 2iπ

∫Γ

H ( z) +

z − j +1

dz

where (homotopy argument) Γ is any contour of K and encircling 0 . The integral can be calculated by the residual method or even, since we have a choice of contour

Γ , by choosing Γ = C ( 0,1) and by parameterizing and calculating the integral ∀j ∈

hj =

1 2iπ

iθ ijθ ∫Γ H ( e ) e dθ . +

In order to determine h j , we can also expand the function H ( z ) in Laurent sequences by making use of the usual known expansions. SUMMARY EXAMPLE.– Let the stable causal filter H (T ) = 1 − hT with

h < 1 , of transfer function H ( z ) = 1 − h z −1 defined on

− {0} . We have

seen that it is invertible and that its inverse, equally causal and stable, is ∞

R (T ) = ∑ h j T j . j =0

Introduction to Discrete Time Processes

125

The transfer function of the inverse filter is thus: ∞

R ( z ) = ∑ h j z− j = j =0

(note also that R ( z ) =

1

1 defined on z z > h 1 − hz −1

H (z)

{

}

).

h 0

0

Figure 3.7. Definition domain of

{

1

H ( z ) and definition domain of R ( z )

}

1 on z z > h , let us find (as an exercise) the 1 − hz −1 −j Laurent expansion of R ( z ) , i.e. the h j coefficients of z . Having R ( z ) =

1 1 R ( z )z j −1dz = + ∫ Γ 2iπ 2iπ where Γ is a contour belonging to z z > h . Using the Laurent formulae h j =

{

}

∫Γ

+

zj −dz z−h

126

Discrete Stochastic Processes and Optimal Filtering

By applying the residual theorem,

⎞ 1 ⎛ zj zj if j ≥ 0 h j = 2iπ . in h ⎟ = lim ( z − h ) = hj ⎜ residual of 2iπ ⎝ z-h z−h ⎠ z →h if j < 0 :

h j = 2iπ .

1 ⎡⎢⎛

⎞ ⎤ ⎡⎛ ⎞⎤ 1 1 ⎜ Residual of in 0 ⎟ ⎥ + ⎢⎜ Residual of in h ⎟ ⎥ = 0 . ⎟ ⎥ ⎢⎜ ⎟⎥ 2iπ ⎢⎣⎜⎝ z j ( z −h ) z j ( z −h ) ⎠ ⎦ ⎣⎝ ⎠⎦ −1

1

hj

PROPOSITION.– It is given that X

is a WSS process and

linear space; we are still considering the filter

H ( z) =

+∞

∑

j =−∞

h j z − j with

+∞

∑

hj

H

X

is the associated

H (T ) of transfer function

hj < ∞ .

j =−∞

So: 1)

∀K ∈

⎛ +∞ ⎞ j ⎜⎜ ∑ q jT ⎟⎟ X K = ⎝ j =−∞ ⎠

That is to say that the r.v. YK =

H

X

+∞

∑

j =−∞

+∞

∑ q j X K − j converges in H X .

j =−∞

h j X K − j of the filtered process remain in

; we say that the filter is stable.

2) The filtered process Y is WSS. 3) The spectral densities of X

SYY ( u ) = H ( −2iπ u )

2

and of Y are linked by the relationship:

S XX ( u )

Introduction to Discrete Time Processes

127

DEMONSTRATION.– 1) We have to show that ∀K ∈ such that the sequence N →

, there exists an r.v.

YK ∈ H

X

⊂ L2 ( dP )

N

∑ hj X K − j

converges for the norm of

H

X

and

−N

when N ↑ ∞ towards YK . As

X

H

is a Banach space, it is sufficient to verify the

normal convergence, namely: +∞

∑

+∞

∑

(

h j E X K2 − j

)

1

2

<∞

which is true as a result of the stability hypothesis

∑

j =−∞

hj X K − j =

J =−∞

+∞

j =−∞

h j < ∞ and of the wide

sense stationarity: E X ( K − j ) = σ + m . 2

2

2

(

2) We must verify that E YK is independent of K and that Cov Yi , Y j the form 3)

CY ( j − i ) , which is immediate.

(

)

(

CY ( j − i ) = Cov Yi , Y j = ∑ h h ′ Cov X j − , X i −

the definition of

, ′

S XX ( u )

CY ( j − i ) = ∑ h h ′ ∫ , '

1

2 −1 2

(

exp 2iπ ( ( j −

′

)

) has

and, by using

) − ( i − ') ) u ) S XX ( u ) du

128

Discrete Stochastic Processes and Optimal Filtering

It is easy to verify that we can invert symbols

CY ( j − i ) = ∫ =∫ =∫

1

2 −1 2 1

2 −1 2 1

2 −1 2

∑

and

∫

in such a way that:

⎛ ⎞ exp ( 2iπ ( j − i ) u ) ⎜ ∑ hA hA ' exp 2iπ ( A '− A ) ⎟ S XX ( u ) du ⎜ ⎟ ⎝ A ,A ' ⎠ 2

exp ( 2iπ ( j − i ) u ) ∑ hA exp ( 2iπ Au ) S XX ( u ) du A

exp ( 2iπ ( j − i ) u ) H ( −2iπ u ) S XX ( u ) du 2

and in going back to the definition of

SYY ( u ) , we in fact have

SYY ( u ) = H ( −2iπ u ) S XX ( u ) . 2

3.5. Important example: autoregressive process DEFINITION.– We call the autoregressive process of degree d ∈ ` centered X ] process which verifies ∀K ∈ ] :

∗

any WSS

d

X K = ∑ h j X K − j + BK where B] is a white noise of power EBK2 = σ 2 . j =1

The family of autoregressive processes of degree Thus ∀ K , X K is obtained from the

d is denoted by AR ( d ) .

d previous values X K −d ,..., X K −1

(modulo r.v. BK ), which can be carried out using the following schema:

Introduction to Discrete Time Processes

hd

hd −1

129

h1

BK

Σ

XK

Figure 3.8. Autoregressive filter

The equality of the definition can be written:

H (T ) X = B where we have

d

stated that

H ( T ) = 1 − ∑ h jT j . j =1

This means that we can obtain X

by the filtering of B

through the filter

H (T ) whose schema has already been given above (modulo the direction of the arrows). PROPOSITION.– 1) Every process

X

( AR ( d ) )

generated by the noise B

H (T ) possesses the spectral density S XX ( u ) =

and by the filter

σ2

H ( exp ( −2iπ u ) )

2

(where

the polynomial H has no root having a module 1). 2) Reciprocally: every WSS process which is centered and possesses a spectral density of the preceding form is autoregressive of a degree equal to the degree of H.

130

Discrete Stochastic Processes and Optimal Filtering

DEMONSTRATION.– 1) The proposition on the filtering and relation

B = H (T ) X

with

S B ( u ) = σ 2 leads to the first result announced. Furthermore, let us suppose that H possesses the root z0 module 1 and let us state that

z = exp ( −2i π u ) .

= exp ( −2i π u0 ) of

Using Taylor’s development in the proximity of z0 , we should obtain:

H ( z ) = H ′ ( z0 )( z − z0 ) + ... or even H ( exp ( −2i π u ) ) =

× ( u − u0 ) + ...

constant

and

the

mapping

σ2

u → S XX ( u ) =

H ( exp ( −2i π u ) )

2

could not be integrable in the proximity of u0 ... as a spectral density must be. 2)

If

S XX ( u ) =

process

X

admits

σ2 H ( exp ( −2i π u ) )

spectral density

2

a

spectral

, the process

density

H (T ) X

X K = h X K −1 + BK i.e.

(1− hT ) X K = BK

(E)

the

form

admits the constant

σ 2 and, as it is centered, it is a white noise B .

PARTICULAR CASE.– Autoregressive process of degree 1

of

Introduction to Discrete Time Processes

131

We notice to begin with that: 1) X

is a Markov process

∀B ∈ B (

):

P ( X K ∈ B X K −1 = α , X K −2 = β ,...) = P ( hα1 + BK ∈ B X K −2 = β ,...)

and as BK is independent of X K − 2 , X K −1 ,...

= P ( h α1 + BK ∈ B )

= P ( h X K −1 + BK ∈ B X K −1 = α ) = P ( X K ∈ B X K −1 = α ) 2) If B is a Gaussian white noise, X

is itself Gaussian.

Expression of X , solution of ( E ) : 1) We are looking for X

the WSS process solution of ( E ) :

– if h = 1 , there is no WSS process X

which will satisfy ( E ) .

In effect, let us suppose for example that h = 1 and reiterate n times the relation of recurrence, we then obtain:

X K − X K − n −1 = BK + Bk −1 + ... + BK − n and E ( X K − X K − n −1 ) = E ( BK + BK −1 + ... + BK − n ) = ( n + 1) σ . 2

2

However, if the process were WSS, we would also have

E ( X K − X K − n−1 ) = E X K2 + E X K2 − n −1 − 2 E X K X K − n −1 ≤ 4σ 2 2

We see then that X

cannot be WSS.

2

∀n ∈

132

Discrete Stochastic Processes and Optimal Filtering

Let us now suppose that h ≠ 1 ; we would like, if operator, to obtain X K =

(1 − hT )

⎛ ⎝

BK ;

1 (1 − hT ) = −hT ⎛⎜1 − h T −1 ⎞⎟ , as

– if h > 1 . By writing see that we can expand ⎜1 −

−1

(1 − hT ) , is an invertible

⎝

⎠

1 < 1 , we h

1 −1 ⎞ −1 T ⎟ (thus we can also expand (1 − hT ) ) in h ⎠

−1

series of powers of T (lead operator) but the filter we obtain being non-causal we must reject the solution X obtained; – if h < 1 , i.e. if the root of the polynomial H ( z ) = 1 − hz less than 1, we know that the operator

has a module

is invertible and that

∞

(1 − hT ) = ∑ h j T j −1

(1 − hT )

−1

(causal filter).

j =0

∞

X K = (1 − hT ) BK = ∑ h j BK − j −1

is then the unique solution of:

j =0

(1− hT ) X K = BK

In this form the wide sense stationarity X

is evident. In effect the B j being

central and orthogonal, ∞

(

Var X K = ∑ E h BK − j j =0

Moreover for n ∈

j

)

2

=

σ2 1 − h2

cov ( X i , X i + n ) =

∞ ⎛ ∞ E X i X i + n = E ⎜ ∑ h j Bi − j ∑ h Bi + n− ⎜ j =0 =0 ⎝

n ∞ ⎞ 2 2 h j j +n ⎟⎟ = σ ∑ h h = σ 1− h j =0 ⎠

Introduction to Discrete Time Processes

133

n

h C ( n ) = Cov ( X i , X i + n ) = σ . 1− h 2

Finally ∀n ∈

C ( n) σ2

−n

Figure 3.9. Graph of

= =

0

1

n

C ( n ) , covariance function of a process AR(1) ( h ∈ ] 0,1 [ )

S XX ( u ) of X :

– spectral density

S XX ( u ) =

−1

1 − h2

+∞

σ2

+∞

∑ C ( n ) exp ( −2iπ n u ) = 1 − h2 ∑ h n

n =−∞

n =−∞

σ2 ⎡

exp ( −2iπ n u )

⎤ 1 1 + − 1⎥ ⎢ 1 − h ⎣1 − h exp ( −2iπ u ) 1 − h exp ( 2iπ u ) ⎦ 2

σ2 1 − 2h cos 2 π u + h 2

134

Discrete Stochastic Processes and Optimal Filtering

2) General solution of ( E ) : This is the sum of the solution found of the equation with as second member

X − h X K −1 = BK i.e.

∞

∑ h j BK − j

and from the general solution of the

j =0

equation without the second member X K − hX K −1 = 0 i.e. Α h

K

where Α is

any r.v. The general solution X K =

∞

∑ h j BK − j + Α h K is no longer WSS, except if j =0

Α=0. 3.6. Exercises for Chapter 3 Exercise 3.1.

Study the stationarity of the Gaussian process

( K ) = m ( K ) is constant.

X ∼ N ( m ( K ) , min ( j , K ) ) where E X Exercise 3.2.

We are considering the real sequence

hn = 2n if n < 0 and hn =

hn

defined by:

1 if n ≥ 0 . 4n +∞

1) Determine the convergence domain of the Laurent series

∑ hn z n .

n =−∞

Introduction to Discrete Time Processes

{

135

} is a digital filter, determine its transfer function H ( z )

2) If h = hn n ∈

by clarifying its definition domain. Solution 3.2. +∞

1)

∑ hn z

n

−1

=

n =∞

∑ (2z)

n

n =−∞

n

n

∞ ∞ ⎛z⎞ ⎛ 1 ⎞ ⎛z⎞ + ∑ ⎜ ⎟ = ∑ ⎜ ⎟ +∑ ⎜ ⎟ n =0 ⎝ 4 ⎠ n =1 ⎝ 2 z ⎠ n=0 ⎝ 4 ⎠

The series converges if

∞

z ≥

1 2

and if

n

z < 4 , thus in the annulus

⎧ 1 ⎫ K = ⎨z < z < 4⎬ . ⎩ 2 ⎭ 2) H ( z ) =

+∞

∑ hn z

−n

n =−∞

The series converges if

{

}

∞

n

∞ ⎛z⎞ ⎛ 1 ⎞ = ∑⎜ ⎟ + ∑⎜ ⎟ n =1 ⎝ 2 ⎠ n =0 ⎝ 4 z ⎠

n

z > 2 and if z < 1/ 4 , thus in the annulus

K′ = z 1 < z < 2 4 In

K′ : H ( z) =

1 1 7z −1+ = . −1 z 2 − z )( 4 z − 1) ( 1− 1 4 z − ( ) 2

Exercise 3.3.

Develop

H ( z) =

16 − 6 z −1 in series (of Laurent) of powers z in the ( 2 − z )( 4 − z )

three following domains:

{

– z

{

z < 2}

}

– z 2< z <4

136

Discrete Stochastic Processes and Optimal Filtering

{

z > 4}

– z

H ( z ) representing each time a transfer function, clarify in the three cases if the corresponding filter is stable and if it is causal. Solution 3.3.

H ( z) =

– If z < 2

2 4 1 1 + = + z 2 − z 4 − z 1− 1− z 2 4 ∞ 0 1 ⎞ ⎛ 1 H ( z ) = ∑ ⎜ n + n ⎟ z n = ∑ 2n + 4n z − n 4 ⎠ n =0 ⎝ 2 n =−∞

(

∞

⎛ 1

)

1 ⎞

∑ ⎜⎝ 2n + 4n ⎟⎠ < ∞ but it is not causal since the series

The filter is stable for

n =0

contains positive powers of z ; –

2< z <4

If

∞

n

∞

n

we

write

H (z) =

−2 z 1− 2

(

z

)

+

1 1− z

4

∞

−2 z + ∑ n = ∑ 4n z − n + ∑ −2n z n the filter is neither stable nor n n =1 z n=0 4 n =−∞ n =1

=∑

0

causal; – ∞

If

(

z >4

)

we

write

H (z) =

= ∑ − 2n + 4n z − n the filter is unstable and causal. n =1

−2 z 1− 2

(

+

z

−4 z 1− 4

) (

z

)

Introduction to Discrete Time Processes

137

Exercise 3.4.

We are examining a Gaussian white noise B independent Gaussian r.v.; EBK = 0 and

α

two real numbers

(let us remember that BK are

Var BK = σ 2 ). Moreover, we allow

β which are different and which verify α < 1 and

and

β < 1. 1)

Construct

a

stationary

X K = α X K −1 + BK − β BK −1

S XX ( u ) .

K

centered

∈

process

X

such

and determine its spectral density

2) Let us denote the linear space generated by the r.v. X n , n ≤ 0 as Let us denote the linear space generated by the r.v. Bn , n ≤ 0 as Verify that

H

X

that

H

X

.

H B.

=H B .

3) We note that YK =

∞

∑ β n X K −n

K

∈ .

n=0

Express YK according to the white noise and deduce from it the best linear approximation of YK expressed with the help of the X n , n ≤ 0. 4) Show that the r.v. YK are Gaussian and centered, and calculate their covariances. Solution 3.4.

The equality defining X K allows us to write and operator

(1 − α T ) is invertible as

X K = (1 − α T )

−1

(1 − β T ) BK

(1 − α T ) X K = (1 − β T ) BK

α <1 ⎛ ∞ ⎞ = ⎜ ∑ α nT n ⎟ (1 − β T ) BK ⎝ n =0 ⎠

138

Discrete Stochastic Processes and Optimal Filtering

Thus, X K = BK + Furthermore,

∞

∑α n−1 (α − β ) BK −n

and X

is in fact stationary.

n =1

X

process

is

generated

from

B

by

the

filter

(1 − αT ) (1 − β T ) of transfer function 11 −+ αβ zz . −1

2

Thus, according to the theorem on filtering:

2) According to 1) Reciprocally,

X K ∈ H B , thus H

∀K

starting

S XX

from

1 − β e2iπ u u = σ2. ( ) 2iπ u 1+ αe

X

BK = (1 − β T )

−1

⊆ H B.

(1 − αT ) X K

calculations similar to those previously performed, we obtain

3) YK

Thus

H B ⊆H

X

using

.

∞ ⎛ ∞ ⎞ −1 = ∑ β n X K −n = ⎜ ∑ β nT n ⎟ X K = (1 − β T ) X K n =0 ⎝ n =0 ⎠

YK = (1 − β T )

−1

permutated, YK = (1 − α T ) Since

and

H

X

−1 (1 − αT ) (1 − β T ) BK , and as the operators can be

−1

∞

BK = ∑ α n BK − n n =0

= H B , the best linear approximation of YK is:

⎛ ∞ ⎞ ∞ projH B YK = projH B ⎜ ∑ α n BK −n ⎟ = ∑ α n+ K B− n ⎝ n =0 ⎠ n =0 ∞

∞

n =0

n =0

α K ∑ α n B− n = α k Y0 = α K ∑ β n X − n

Introduction to Discrete Time Processes

4) Since YK =

139

∞

∑ α n BK −n , the YK

are centered Gaussian r.v.

n =0

Moreover:

(

=α

K− j

∞

∞

∞

) ∑ m =0 n =0

Cov Y j , YK = ∞

∑ α 2mσ 2

∑ α m+n E ( BK −n B j −m ) = ∑ α 2m+ K − j EB 2j −m =

m =0

m =0

K− j

α σ2 2 1−α

Exercise 3.5. ∞

Let X

be a process verifying

∑ bn X K −n = BK ( bn ∈ )

where B

is a

n =0

∞

white noise of power

σ 2 . In addition we state b ( z ) = ∑ bn z − n . n =0

j
1) Show that if

EX j BK =

{

1 2iπ

}

∫C

+

z K − j −1 dz (the integral of the b( z)

complex variable z where C = z z = 1 ). 2) Verify that if

∀j < K

{

}

b ( z ) does not possess a root in the disk z z < 1 then

X j ⊥ BK

( EX j BK = 0 ) .

Solution 3.5.

1) EX j BK =

S XX ( u ) of X :

∞

∑ bn EX j X K −n

n =0

and by definition of the spectral density

140

Discrete Stochastic Processes and Optimal Filtering

(

)

EX j X K −n = cov X j , X K − n = ∫

Moreover, since

−1/ 2

exp ( 2iπ ( j − K + n ) u ) S XX ( u ) du

⎛ ∞ n⎞ ⎜ ∑ bnT ⎟ X K = BK , X ⎝ n =0 ⎠

is obtained by filtering B

σ 2 ), by the transfer function filter

spectral density

filtering

1/ 2

S X (u ) =

(of

1 and by the theorem on b( z)

σ2 b ( exp ( −2iπ u ) )

2

from where

EX j BK = σ 2 ∫

1/ 2

−1/ 2

σ 2∫

1/ 2

−1/ 2

n =0

exp ( 2iπ ( j − K ) u )

=σ 2∫

1/ 2

−1/ 2

In stating

b ( exp ( +2iπ u ) )

b ( exp ( −2iπ u ) )

exp ( 2iπ ( j − K ) u ) b ( exp ( −2iπ u ) )

2

1 b ( exp ( −2iπ u ) )

2

du

du

du

z = exp ( −2iπ u ) , dz = −2iπ u du and finally:

σ2 EX j BK = 2iπ 2) If

∞

exp ( 2iπ ( j − K ) u )∑ bn exp ( 2iπ nu )

b( z)

∫C

+

z K − j −1 dz b( z)

does not possess a root in

{z

integrated is holomorphic inside the open disk theorem EX j BK = 0 .

}

z < 1 , the function to be

D ( 0,1) and using Cauchy’s

Chapter 4

Estimation

4.1. Position of the problem We are examining two discrete time processes:

X

∗

(

)

= X 1 ,..., X j ,... and Y

∗

(

)

= Y1 ,..., Y j ,... :

– of the 2nd order; – not necessarily wide sense stationary (WSS) (thus they do not necessarily have a spectral density).

X

∗

is called the state process and is the process (physical for example) that we

are seeking to estimate, but it is not accessible directly.

Y

∗

is called the observation process, which is the process we observe (we

observe a trajectory

y

corresponding trajectory

∗

x

(

= y1 ,..., y j ,... ∗

(

)

which allows us to estimate the

)

= x1 ,..., x j ,... ).

142

Discrete Stochastic Processes and Optimal Filtering

A traditional example is the following:

X Y where U

∗

(

= X 1 ,..., X j ,...

∗

=X

∗

∗

+U

∗

(

)

= X 1 + U1 ,..., X j + U j ,...

)

is also a random process.

We thus say that the state of the process is perturbed by a parasite noise U (perturbation due to its measurement, transmission, etc.).

∗

In what follows, the hypothesis and data will be admitted:

X j and Y j ∈ L2 ( dP ) ;

∗

– ∀j ∈ – ∀i, j ∈

∗

×

∗

(

, we know EX j , cov X i , Y j

) , cov (Yi , Y j ) .

PROBLEM.– Having observed (or registered) a trajectory y instant K − 1 , we want, for a given instant

∗

of Y

∗

up to the

p , to determine the value “ xˆ p which

best approaches x p (unknown)”.

×

× • 0

×

• 1

2

•

K

−1

Figure 4.1. Three trajectories

y

∗

(

)

= y1 ,..., y j ,... xˆ

∗

(

)

= xˆ1 ,..., xˆ j ,... and x

which is unknown

∗

(

= x1 ,..., x j ,...

)

Estimation

143

If: – p<

K

− 1 we speak of smoothing;

– p=

K

− 1 we speak of filtering;

–

p > K − 1 we speak of prediction.

NOTE 1.– In the case of prediction, it is possible that we need only consider the process Y ∗ as predicting y p for p > K − 1 is already a problem. NOTE 2.– Concerning the expression “ xˆ p which best approaches x p ”. We will see that the hypothesis (knowledge of variances and covariances) allows us to determine Xˆ p , the 2nd order r.v. which best approaches by quadradic means r.v.

(

X p , i.e. r.v. Xˆ P which is such that E X p − Xˆ p

)

2

(

= Min2 E X p − Z Z ∈L

)

2

,

which is a result from the means of the r.v. and not from the realizations. However, even if it were only because of Bienaymé-Tchebychev inequality:

(

)

P X p − Xˆ p ≥ C ≤

(

E X p − Xˆ p C2

)

2

=A

we see that we obtain a result based on the numerical realizations since this inequality signifies exactly that at instant p , the unknown value x p will belong to the known interval ⎤⎦ xˆ p − C , xˆ p + C ⎡⎣ with a probability higher than 1 − A . This chapter is an introduction to Kalman filtering for which we will have to consider the best estimation of the r.v. X K (and also possibly of r.v. YK ) having observed Y1 ,..., YK −1 and thus assuming that

p=K.

SUMMARY.– Being given the observation process Y instant K − 1 , any estimation

∗

, considered up to the

Z of X K will have the form Z = g (Y1 ,..., Yk −1 )

144

Discrete Stochastic Processes and Optimal Filtering

where g :

K −1

→

is a Borel mapping. The problem that we will ask ourselves

in the following sections is: – how to find the best estimation in terms of quadratic means Xˆ K i.e. makes r.v. Xˆ K

K −1

which makes the mapping

K −1

Z → E( XK − Z )

K −1

= gˆ (Y1 ,..., YK −1 ) ).

2

L2 ( dP )

(

minimal (i.e. to find function gˆ which renders X K − g (Y1 ,..., YK −1 ) We have Xˆ K

of X K ,

)

2

minimal.

4.2. Linear estimation The fundamental space that we define below has already been introduced in Chapter 3 but in a different context. DEFINITION.– Space up to instant K − 1 is called linear space of observation and the vector space of the linear combinations of r.v. 1, Y1 ,..., YK −1 is denoted (or

H (1, Y1 ,..., YK −1 ) ), i.e.: ⎪⎧

K −1

⎪⎫

⎪⎩

j =1

⎪⎭

H KY−1

H KY−1 = ⎨λ01 + ∑ λ jY j λ 0 ,..., λK −1 ∈ ⎬ .

( dP ) , H KY-1 is a vector subspace (closed, as the 2 number of r.v. is finite) of L ( dP ) . Since r.v. 1, Y1 ,..., YK −1 ∈ L

2

We can also say that

H KY-1 is a Hilbert subspace of L2 ( dP ) .

Estimation

145

We are focusing here on the problem stated in the preceding section but with a simplified hypothesis: g is linear, which means that the envisaged estimators Z of

X K are of the form: K −1

Z = g (Y1 ,..., YK −1 ) = λ0 + ∑ λ jY j and thus belong to H KY−1 . j =1

The problem presents itself as: find the r.v., denoted Xˆ K K −1 , which renders minimum mapping:

Z → E( XK − Z )

2

H KY−1 (i.e., find λˆ0 , λˆ1 ,..., λˆK −1 which render minimum: 2

K −1 ⎛ ⎛ ⎞⎞ λ0 , λ1 ,..., λK −1 → E ⎜⎜ X K − ⎜ λ0 + ∑ λ jY j ⎟ ⎟⎟ ). J =1 ⎝ ⎠⎠ ⎝

We will have Xˆ K K −1 = λˆ0 +

K −1

∑ λˆ jY j . j =0

DEFINITION.– K −1 ⎛ ⎛ ⎞⎞ C ( λ0 , λ1 ,..., λK ) = E ⎜ X K − ⎜ λ0 + ∑ λ jY j ⎟ ⎟ ⎜ ⎟⎟ ⎜ j =1 ⎝ ⎠⎠ ⎝

2

is

called

the

function”. The solution is given by the following result, relative to the Hilbert spaces.

“cost

146

Discrete Stochastic Processes and Optimal Filtering

THEOREM.– – There exists unique formula Xˆ K K −1 = λˆ0 + mapping

K −1

∑ λˆ jY j

which renders the

j =1

Z → E ( X K − Z ) minimal. 2

H KY−1 – Xˆ K K −1 is the orthogonal projection of X K on

H KY−1 (which is also denoted

projH Y X K ). That is to say X K − Xˆ K K −1 ⊥ H KY−1 . K −1

X K − Xˆ K K −1

XK Xˆ K K −1

H KY−1

Figure 4.2. Orthogonal projection of vector

XK

on

Z

H KY-1

This theorem being admitted, we finish off the problem by calculating

λˆ 0, λˆ 1,..., λˆ K −1 . PROPOSITION.– – Let us represent the covariance matrix of vector Y = (Y1 ,..., YK ) as ΓY . K −1

1) The coefficients λˆ 0, λˆ 1,..., λˆ K −1 of Xˆ K K −1 = λˆ 0 + ∑ λˆ jY j verify: j =1

Estimation

147

⎛ λˆ 1 ⎞ ⎛ Cov ( X K , Y1 ) ⎞ K −1 ⎜ ⎟ ⎜ ⎟ ˆ = EX − λˆ EY λ and ΓY ⎜ = ⎟ ⎜ ∑ j j 0 K ⎟ j =1 ⎜ˆ ⎟ ⎜ Cov ( X , Y ) ⎟ K K −1 ⎠ ⎝ λ K −1 ⎠ ⎝

⎛ λˆ 1 ⎞ ⎛ Cov ( X K , Y1 ) ⎞ ⎜ ⎟ ⎟ −1 ⎜ and if ΓY is invertible ⎜ ⎟ = ΓY ⎜ ⎟; ⎜ˆ ⎟ ⎜ Cov ( X , Y ) ⎟ K K −1 ⎠ ⎝ ⎝ λ K −1 ⎠ 2) X K

K −1

= X K − Xˆ K K −1 is a centered r.v. which represents the estimation

error. We have:

Var X K

K −1

(

) (

= Var X K − Xˆ K K −1 = E X K − Xˆ K K −1

(

= Var X K − ∑ λˆi λˆ j cov Yi , Y j i, j

)

and if ΓY is invertible

(

)⎦

T

(

)⎦

−1 = Var X K − ⎡Cov X K , Y j ⎤ ΓY ⎡Cov X K , Y j ⎤

⎣

⎣

DEMONSTRATION.– Y 1) X K − Xˆ K K −1 ⊥ H K −1 ⇔ X K − Xˆ K K −1 ⊥ 1, Y1 ,..., YK −1

– X K − Xˆ K K −1 ⊥ 1 ⇔ K −1 ⎛ ⎛ ⎞⎞ E X K − Xˆ K K −1 1 = E ⎜ X K − ⎜ λˆ 0 + ∑ λˆ jY j ⎟ ⎟ = 0 ⎜ ⎟⎟ ⎜ j =1 ⎝ ⎠⎠ ⎝

(

)

)

2

148

Discrete Stochastic Processes and Optimal Filtering

i.e. EX K = λˆ 0 +

∑ λˆ j EY j ;

(1)

j

– X K − Xˆ K K −1 ⊥ Yi ⇔

⎛ ⎛ ⎞⎞ E X K − Xˆ K K −1 Yi = E ⎜ X K − ⎜ λˆ 0 + ∑ λˆ jY j ⎟ ⎟ Yi = 0 . ⎜ ⎟⎟ ⎜ j ⎝ ⎠⎠ ⎝

(

i.e.

)

EX K Yi = λˆ 0 EYi + ∑ λˆ j EY jYi

(2)

j

We take

λˆ 0 = EX K − ∑ λˆ j EY j

from (1) and carry it to (2).

j

It becomes:

⎛ ⎞ EX K Yi = ⎜ EX K − ∑ λˆ j EY j ⎟ EYi + ∑ λˆ j EY jYi ⎜ ⎟ j j ⎝ ⎠ = EX EY − λˆ EY Y − EY EY K

∑ j(

i

j

j i

j

i

)

That is to say:

∀i = 1 at

K

−

1

∑ λˆ j Cov (Y j , Yi ) = Cov ( X K , Yi ) j

or, in the form of a matrix

⎛ λˆ 1 ⎞ ⎛ Cov ( X K , Y1 ) ⎞ ⎜ ⎟ ⎜ ⎟ ΓY ⎜ ⎟=⎜ ⎟. ⎜ˆ ⎟ ⎜ Cov ( X , Y ) ⎟ K K −1 ⎠ ⎝ λ K −1 ⎠ ⎝

Estimation

– If

149

ΓY is non-invertible: ΓY non-invertible ⇔ ΓY is semi-defined r.v. Y1 − EY1 ,..., YK −1 − EYK −1 are linearly dependent in

- Let us recall the equivalences: positive

⇔

L2 ⇔ dim H KY−1 < K − 1 ; Under this hypothesis, there exists an infinity of K-tuples

( λˆ ,..., λˆ ) (and 1

K −1

thus also an infinity of λˆ 0 ) which verify the last matrix equality but all the expressions

λˆ 0 + ∑ λˆ jY j

are equal to the same r.v.

j

Xˆ K K −1 according to the

uniqueness of the orthogonal projection on a Hilbert subspace. – If

ΓY is invertible:

- R.v.

Y − EY ,..., Y 1

K −1

1

coefficients λˆ 0, λˆ 1,..., λˆ

K −1

− EY

K −1

are linearly independent in

L2 , the

are unique and we obtain

⎛ λˆ 1 ⎞ ⎛ Cov ( X K , Y1 ) ⎞ K −1 ⎜ ⎟ ⎟ −1 ⎜ ˆ ˆ ⎜ ⎟ = ΓY ⎜ ⎟ and λ 0 = EX K − ∑ λ j EY j j =1 ⎜ˆ ⎟ ⎜ Cov ( X , Y ) ⎟ K K −1 ⎠ ⎝ ⎝ λ K −1 ⎠ 2) X K − Xˆ K K −1 is centered (obvious).

(

)

X K = X K − Xˆ K K −1 + Xˆ K K −1 and as X K − Xˆ K K −1 ⊥ X K according to Pythagoras’ theorem.

(

E X K − Xˆ K K −1

)

2

=

EX K2

− EXˆ

2 K K −1

=

EX K2

⎛ ⎞ − E ⎜ λˆ 0 + ∑ λˆ jY j ⎟ ⎜ ⎟ j ⎝ ⎠

2

150

Discrete Stochastic Processes and Optimal Filtering

and since

λˆ 0 = EX K − ∑ λˆ j EY j , j

⎛ ⎞ E X K − Xˆ K K −1 = − E ⎜ EX K − ∑ λˆ j (Y j − EY j ) ⎟ j ⎝ ⎠ 2 2 = EX K − E ( EX K ) − 2 EX K ∑ λˆ j ( Y j − EY j )

(

)

2

2

EX K2

j

+ ∑ λˆi λˆ j (Yi − EYi ) (Y j − EY j ) . i, j

From which

(

E X K − Xˆ K K −1

i.e. under the form of a matrix

)

2

(

)

= Var X K − ∑ λˆ i λˆ j Cov Yi , Y j . i, j

(

Var X K − λˆ 1,..., λˆ K −1

)

⎛ λˆ1 ⎞ ⎜ ⎟ ΓY ⎜ ⎟. ⎜ˆ ⎟ ⎜ λK −1 ⎟ ⎝ ⎠

⎛ λˆ 1 ⎞ ⎛ Cov ( X K , Y1 ) ⎞ ⎜ ⎟ ⎟ −1 ⎜ In addition if ΓY is invertible since ⎜ ⎟ = ΓY ⎜ ⎟, ⎜ˆ ⎟ ⎜ Cov ( X , Y ) ⎟ K K −1 ⎠ ⎝ ⎝ λ K −1 ⎠ it becomes:

(

E X K − Xˆ K K −1

)

2

⎛ Cov ( X K , Y1 ) ⎞ ⎜ ⎟ = Var X K − ( Cov ( X K , Y1 ) , ... , Cov ( X K , YK −1 ) ) ΓY−1 ⎜ ⎟ ⎜ Cov ( X , Y ) ⎟ K K −1 ⎠ ⎝

Estimation

NOTE.– If Cov ( X K , Y1 ) = 0,..., Cov ( X K , YK −1 ) = 0 , r.v. information in order to estimate r.v.

151

Y j brings no further

X K −1 in quadratic mean.

Furthermore, by going back to the preceding formula:

⎛ λˆ 1 ⎞ ⎛ 0⎞ ⎜ ⎟ −1 ⎜ ⎟ ⎜ ⎟ = ΓY ⎜ ⎟ ⎜ 0⎟ ⎜ˆ ⎟ ⎝ ⎠ ⎝ λ K −1 ⎠

and

Xˆ K K −1 = λˆ 0 = EX K .

We rediscover the known result: being given an r.v. minimizes Z → E ( X − Z ) is 2

X ∈ L2 , the r.v. which

Xˆ = EX .

L2

DEFINITION.– The hyperplane of the regression plane of

K

of equation x = λˆ 0 +

K −1

∑ λˆ j y j

is called

j =1

X in Y1 ,..., YK −1 .

Practically: 1) The statistical hypotheses on the processes X calculate the numerical values plane x = λˆ 0 +

K −1

∑ λˆ j y j j =1

λˆ 0 , λˆ 1,..., λˆ K −1

( y j and x covering

and Y

∗

have enabled us to

and thus to obtain the regression ).

xK taken by X K ; we gather the observations and we thus deduce the sought estimation xˆ K K −1 (this time they are

2) We want to know the value

y1 ,..., yk −1

∗

determined values).

152

Discrete Stochastic Processes and Optimal Filtering

3) We are assured that the true value

xK taken by r.v. X K is in the interval

⎤ xˆ − C , xˆ K K −1+ C ⎡ with a probability greater than: ⎦ K K −1 ⎣

1−

(

E X K − X K K −1

)

2

C2

a value that we can calculate using the formula from the preceding proposition.

X 2 from the sole r.v. of = λˆ + λˆ Y which minimizes

PARTICULAR CASE.– We are going to estimate observation

Y1 , i.e. we are going to find Xˆ 2 1

0

1 1

E ( X 2 − ( λ 0 + λ 1Y1 ) ) . According to the proposition: 2

−1

λˆ1 = (VarY1 ) Cov ( X 2 , Y1 ) Thus

−1 and λˆ0 = EX 2 − (VarY1 ) Cov ( X 2 , Y1 ) EY1

Cov ( X 2 , Y1 ) Xˆ 2 1 = EX 2 + (Y1 − EY1 ) . VarY1 1.

We trace the regression line

3. We choose xˆ2 1 to approximate (lineary and in m.s.) the true but unknown value x2.

xˆ2 1 2. We measure the value realization of the r.v.

Y1

λˆ 0 0

y1

Figure 4.3. Regression line

y

y1

Estimation

153

Value of the error estimate variance:

(

EX 22 1 = E X 2 − Xˆ 2 1

)

2

−1

= VarX 2 − Cov ( X 2 , Y1 )(VarY ) Cov ( X 2 , Y1 ) ⎛ ( Cov ( X , Y ) )2 ⎞ 2 1 ⎟. = VarX 2 ⎜ 1 − ⎜ VarX 2 VarY1 ⎟ ⎝ ⎠

NOTE.– It may be interesting to note the parallel which exists between the problem of the best estimation in the quadratic mean of X K and that of the best

L2 of a function h by a trigonometric polynomial. We can state B ([ 0,T ]) = Borel algebra of the interval [ 0, T ] and give a table of the

approximation in

correspondences.

H ∈ L2 ([ 0, T ] , B ([ 0, T ]) , dt )

H Ky−1 ⊂ L2 ( Ω, a, P )

h

XK

X K − Xˆ K K −1

h − hˆ

Xˆ K K −1

hˆ

ˆ

{

L2 ( dP ) = v.a. X

}

EX 2 < ∞

∀X , Y ∈ L2 ( dP ) < X , Y > = EXY

Scalar product:

= ∫ X (ω ) Y (ω ) dP (ω ) Ω

For j

= 1 to

K −1

Y j ∈ L2 ( dP )

L2 ( dt ) =

{

f Borel function

T

∫0

2

f ( t ) dt < ∞

f , g ∈ L2 ( dt )

Scalar product:

T

< f , g >= ∫ f ( t ) g ( t ) dt 0

For j = − K to K

}

154

Discrete Stochastic Processes and Optimal Filtering

e j (t ) = Linear space:

H KY−1 = H (1, Y1 ,..., YK −1 )

(

exp 2iπ jt T

Find

2

H ( e− K ,..., e0 ,..., eK ) Problem:

X K ∈ L ( dP ) λˆ 0, λˆ 1,..., λˆ K −1 thus find

Xˆ K K −1

) ∈ L ( dt )

Linear space:

Problem:

2

Being given r.v.

T

h ∈ L2 ( dt ) thus find hˆ which

Being given the function Find

λˆ − K ,..., λˆ K

minimizes

which minimizes

k −1 ⎛ ⎛ ⎞⎞ E ⎜ X K − ⎜ λ0 + ∑ λ jY j ⎟ ⎟ ⎜ ⎟⎟ ⎜ j =1 ⎝ ⎠⎠ ⎝

2

2

K

T

λ je j (t ) ∫0 h ( t ) − j∑ =− K

dt

In the problem of the best approximation of a function by a trigonometric polynomial, coefficients orthonormal basis of

λˆ j =

1 T

T

∫0

λˆ j

have a very simple expression because e j form an

H ( e− K ,..., eK )

and we have:

h ( t ) e j ( t ) dt

and C j =

λˆ j T

Fourier coefficients.

Variant of the preceding proposition

We are considering the linear space of observation and we are thus seeking r.v. Xˆ K K −1 =

Z

H KY−1

→ E( XK − Z )

⎧⎪ K −1

⎫⎪

⎩⎪ j =1

⎭⎪

H KY−1 = ⎨ ∑ λ jY j λ j ∈ ⎬

K −1

∑ λˆ jY j j =1

which minimizes the mapping

Estimation

(

155

)⎦

Let us state M Y = ⎡ E YiY j ⎤ matrix of the 2nd order moments of the random

⎣

vector Y1 ,..., YK −1 . We have the following proposition. PROPOSITION.–

⎛ λˆ1 ⎞ ⎛ EX K Y1 ⎞ ⎜ ⎟ ⎜ ⎟ 1) The λˆ j verify M Y ⎜ ⎟=⎜ ⎟ and if M Y is invertible: ⎜ λ ⎟ ⎜ EX Y ⎟ K K −1 ⎠ ⎝ K −1 ⎠ ⎝ ⎛ λˆ1 ⎞ ⎛ EX K Y1 ⎞ ⎜ ⎟ ⎟ −1 ⎜ ⎜ ⎟ = MY ⎜ ⎟. ⎜ EX Y ⎟ ⎜λ ⎟ K K −1 ⎠ ⎝ ⎝ K −1 ⎠ 2)

(

E X K − X K K −1 =

EX K2

−(

)

2

= EX K2 − ∑ λˆi λˆ j EYiY j and if M Y is invertible i, j

⎛ EX K Y1 ⎞ ⎟ ⎟. ⎜ EX Y ⎟ K K −1 ⎠ ⎝

⎜ EX K Y1 ,..., EX K YK −1 M Y−1 ⎜

)

From now on, and in all that follows in this work, the linear space of observation at the instant

K −1

will be

⎧⎪ K −1

⎫⎪

⎪⎩ j =1

⎭⎪

H KY−1 = ⎨ ∑ λ1Y j λ j ∈ ⎬ .

INNOVATION.– Let a discrete process be (YK ) K ∈

∗

which (as will be the case in

Kalman filtering) can be the observation process of another process

( X K ) K∈

∗

and we can state that YˆK K −1 = projH Y YK ; YˆK K −1 is thus the best linear estimate K −1

and best quadratic mean of r.v. YK . DEFINITION.– R.v. I K = YK − YˆK K −1 is called the innovation at instant K ( ≥ 2 ) .

156

Discrete Stochastic Processes and Optimal Filtering

The family of r.v. { I 2 ,..., I K ,...} is called the innovation process.

4.3. Best estimate – conditional expectation We are seeking to improve the result by considering as estimation of X K not K −1

∑ λ jY j

only the linear functions

of r.v. Y1 ,..., YK −1 but the general functions

j =1

g (Y1 ,..., YK −1 ) .

{

H K′Y−1 = g (Y1 ,..., YK −1 ) g :

PROPOSITION.– The family of r.v. Borel functions; g (Y1 ,..., YK −1 ) ∈ L

2

K −1

→

} is a closed vector subspace of L . 2

DEMONSTRATION.– 2

Let us note again L

( dP ) = {r.v. Z 2

with a scalar product: ∀Z1 , Z 2 ∈ L Furthermore,

f

Y

(y

1

, ...,

y

K −1

}

EZ 2 < ∞ = Hilbert space equipped

( dP ) < Z1 , Z 2 > L ( dP ) = EZ1Z 2 . 2

)

designating the density of the vector

Y = (Y1 ,..., YK −1 ) , in order to simplify its expression let us state: d µ = fY ( y1 ,..., yK −1 ) dy1...dyK −1 and let us introduce the new Hilbert space:

{

L2 ( d µ ) = g :

∫

K −1

K −1

→

Borel functions

g 2 ( y1 ,..., yK −1 ) d µ < ∞} .

Estimation 2

This is equipped with the scalar product: ∀g1 , g 2 ∈ L

< g1 , g 2 > L2 ( d µ ) = ∫

K −1

157

(d µ )

g1 ( y1 ,..., yK −1 ) g 2 ( y1 ,..., yK −1 ) d µ .

Thus finally the linear mapping:

→ g (Y ) = g (Y1 ,..., YK −1 )

Ψ:g

L2 ( d µ )

L2 ( dP )

We notice that ψ conserves the scalar product (and the norm):

< g1 (Y ) g 2 (Y ) > L2 ( dP ) = Eg1 (Y ) g 2 (Y ) = ∫

K −1

g1 ( y ) g 2 ( y ) dy

=< g1 , g 2 > L2 ( d µ )

From hypothesis 2

subspace of L Let

( dP ) :

H K′Y−1 ⊂ L2 ( dP ) ,

Z1 and Z 2 ∈ H K′Y−1

let us verify that H K′ −1 is a vector

and two constants

Y

λ1

and

λ2∈

.

g1 ∈ L2 ( d µ ) is such that Z1 = g1 (Y ) and g 2 ∈ L2 ( d µ ) is such that

Z 2 = g2 ( µ ) . Thus

λ 1Z1 + λ 2 Z 2 = λ 1Ψ ( g1 ) + λ 2 Ψg 2 = Ψ ( λ 1 g1 + λ 2 Z 2 )

λ 1 g1 + λ 2 g 2 ∈ L2 ( d µ ) , H K′Y−1 is in fact a vector subspace of L2 ( dP ) .

and as

158

Discrete Stochastic Processes and Optimal Filtering

Let us show next that

H K′Y−1 is closed in L2 ( dP ) .

( )

= g p (Y ) = Ψ g p

Given Z p

2

towards Z ∈ L

( dP ) .

a sequence of

H K′Y−1

which converges

Let us verify that Z ∈ H K′ −1 : Y

g p (Y ) is a Cauchy sequence H K′Y−1 and because of the isometry, g p (Y ) is a Cauchy sequence of

g ∈ L2 ( d µ ) , i.e.:

gp − g

L (d µ ) 2

=∫

L2 ( d µ ) and which thus converges towards a function

( g p ( y ) − g ( y ) ) d µ = E ( g p (Y ) − g (Y ) ) 2

K −1

2

→ 0.

p ↑∞

So the limit of g p (Y ) is unique, g (Y ) = Z that is to say that Z ∈ H K′ −1 Y

and that

H K′Y−1

Finally

is closed.

H K′Y−1 is a Hilbert subspace of L2 ( dP ) .

Let us return to our problem, i.e. estimating r.v. X K The best estimator Xˆ ′

K K −1

= gˆ (Y1 ,..., YK −1 ) ∈ H K′Y−1 of X K , that is to say

the estimator which minimizes

E ( X K − g (Y1 ,..., YK −1 ) )

2

is (always in

accordance with the theorem already cited about Hilbert spaces) the orthogonal projection of X K on

H K′Y−1 , i.e.: Xˆ ′

K K −1

= gˆ (Y1 ,..., YK −1 ) = projH ′Y X K . K −1

Estimation

XK

X K − Xˆ ′

K K −1

Xˆ ′

K K −1

H K′Y−1

= gˆ (Y1 ,..., YK −1 )

Figure 4.4. Orthogonal projection of the vector

XK

(

on

H K′Y-1

)

⎞ ⎟ ⎠

⎛ ˆ ⎜ E X K − X K′ K −1 ⎝

H K′ Y−1 H KY−1 L2 ( dP )

2

1

2

Xˆ K′ K −1

XK

Xˆ K K −1

(

⎛ ˆ ⎜ E X K − X K K −1 ⎝

Figure 4.5. Best linear estimation and best estimation

)

2

⎞ ⎟ ⎠

1

2

159

160

Discrete Stochastic Processes and Optimal Filtering 2

In Figure 4.5, the r.v. (vector of L ) are represented by dots and the norms of estimation errors are represented by segments. It is clear that we have the inclusions 2

being given X K ∈ L

( dP ) − H K′Y−1

H KY−1 ⊂ H K′Y−1 ⊂ L2 ( dP )

, Xˆ ′

K K −1

thus a priori

will be a better approximation of

X K than Xˆ K K −1 , which we can visualize in Figure 4.5. Finally, to resolve the problem posed entirely, we are looking to calculate

Xˆ K′ K −1 . PROPOSITION.– Xˆ K′ K −1 = gˆ (Y1 ,..., YK −1 ) = projH ′Y X K is the conditional

(

K −1

)

expectation E X K Y1 ,..., YK −1 . DEMONSTRATION.– 1) Let us verify to begin with that the r.v.

g (Y1 ,..., YK −1 ) = E ( X Y1 ,..., YK −1 ) ∈ L2 ( dP )

(

and yet g ( y1 ,..., yK −1 )

) = ( g ( y )) 2

and by the Schwarz inequality:

≤ ∫ x 2 f ( x y ) dx ∫ 12 f ( x y ) dx =1

2

=

(∫

xi1 f ( x y ) dx

)

2

Estimation

161

recalling

that

thus:

Eg (Y1 ,..., YK −1 ) = ∫ 2

≤∫ By

stating

here

K −1

K −1

again

g 2 ( y1 ,..., yk −1 ) fY ( y ) dy

fY ( y ) dy ∫ x 2 f ( x y ) dx. U = ( X , Y1 ,..., YK −1 )

and

fU ( x, y ) = fY ( y ) f ( x y ) we have from Fubini’s theorem: E ( g (Y1 ,..., YK −1 ) ) ≤ ∫ x 2 dx ∫ 2

K −1

fU ( x, y ) dy = EX 2 < ∞ fX ( x)

We thus have g (Y1 ,..., YK −1 ) ∈ L

2

of

( dP )

and also, being given the definition

H K′Y−1 , g (Y1 ,..., YK −1 ) ∈ H K′Y−1 .

(

2) In order to show that g (Y1 ,..., YK −1 ) = E X K Y1 ,..., YK −1

)

is the

orthogonal projection Xˆ K′ K −1 = gˆ (Y1 ,..., YK −1 ) = projH ′Y X K , it suffices, as K −1

this projection is unique, to verify the orthogonality

X K − E ( X K Y1 ,..., YK −1 ) ⊥ H K′Y−1 i.e.:

∀ g (Y1 ,..., YK −1 ) ∈ H K′Y−1

X K − E ( X K Y1 ,..., YK −1 ) ⊥ g (Y1 ,..., YK −1 )

(

)

⇔ EX K g (Y1 ,..., YK −1 ) = E E ( X K Y1 ,..., YK −1 ) g (Y1 ,..., YK −1 ) . Now, the first member EX K g (Y1 ,..., YK −1 ) =

=∫

K

xg ( y ) f ( x y ) fY ( y ) dx dy

∫

K

xg ( y ) f Z ( x, y ) dx dy

162

Discrete Stochastic Processes and Optimal Filtering

and by applying Fubini’s theorem:

∫ ( ∫ xf ( x y ) dx ) g ( y ) fY ( y ) dy which is equal to the 2 member E ( E ( X K Y1 ,..., YK −1 ) g (Y1 ,..., YK −1 ) ) and the proposition is demonstrated. =

nd

K −1

Practically, the random vector U = ( X K , Y1 ,..., YK −1 ) being associated with a physical, biological, etc., phenomenon, the realization of this phenomenon gives us K − 1 numerical values y1 ,..., y K −1 and the final responses to the problem will be the numerical values: – xˆ K K −1 =

K −1

∑ λˆ j y j

in the case of the linear estimate;

j =1

(

)

– xˆ ′K K −1 = E X K y1 ,..., y K −1 in the case of the general estimate. We show now that in the Gaussian case Xˆ K K −1 and Xˆ K′ K −1 coincide. The following proposition will demonstrate this more precisely. PROPOSITION.– If the vector U = ( X K , Y1 ,..., YK −1 ) is Gaussian, we have the equality between r.v. K −1 ⎛ ⎞ Xˆ K′ K −1 = Xˆ K K −1 + E ⎜ X K − ∑ λˆ jY j ⎟ . ⎜ ⎟ j =1 ⎝ ⎠

DEMONSTRATION.–

⎛

K −1

⎞

⎝

j =1

⎠

( X K , Y1 ,..., YK −1 ) Gaussian vector ⇒ ⎜⎜ X K − ∑ λˆ jY j , Y1 , ..., YK −1 ⎟⎟ is equally Gaussian.

Estimation

Let us state V = X K −

163

K −1

∑ λˆ jY j . j =1

V is orthogonal at H KY−1 , thus EVY j = 0 ∀

j =1

at

K −1

and the two

vectors V and (Y1 ,..., YK −1 ) are uncorrelated.

(V , Y1 ,..., YK −1 ) is Gaussian and V (Y1 ,..., YK −1 ) are uncorrelated, then V and (Y1 ,..., YK −1 ) are independent. We know that if the vector

FINALLY.–

⎛ K −1 ⎞ E ( X K Y1 ,..., YK −1 ) = E ⎜ ∑ λˆ jY j + V Y1 ,..., YK −1 ⎟ ⎜ j =1 ⎟ ⎝ ⎠ K −1

= ∑ λˆ jY j + E (V Y1 ,..., YK −1 ) j =1

As V and Y1 ,..., YK −1 are independent: K −1

E ( X K Y1 ,..., YK −1 ) = ∑ λˆ jY j + EV . j =1

EXAMPLE.– Let U = ( X K , YK −1 ) = ( X , Y ) be a Gaussian couple of density

fU ( x, y ) =

(

)

⎛ 2 ⎞ exp ⎜ − x 2 − xy + y 2 ⎟ . π 3 ⎝ 3 ⎠ 1

We wish to determine

E(X Y).

and

164

Discrete Stochastic Processes and Optimal Filtering

The marginal law of

Y admits the density:

(

)

⎛ 2 ⎞ exp ⎜ − x 2 − xy + y 2 ⎟ dx π 3 ⎝ 3 ⎠ 2 ⎛ 2⎛ ⎛ y2 ⎞ 1 y⎞ ⎞ exp ⎜ − ⎟ exp ⎜ − ⎜ x − ⎟ ⎟ dx = ⎜ 3⎝ 2 ⎠ ⎟⎠ π 3∫ ⎝ 2 ⎠ ⎝ ⎛ y2 ⎞ 1 1 ⎛ 2 ⎞ exp ⎜ − ⎟ exp ⎜ − u 2 ⎟ du = ∫ ⎜ 2 ⎟ 3π 2π ⎝ 3 ⎠ ⎝ ⎠ 2

fY ( y ) = ∫

1

⎛ y2 ⎞ 1 exp ⎜ − ⎟ = 2π ⎝ 2 ⎠

⎛ y2 ⎞ f Z ( x, y ) 1 ⎛ 2 ⎞ exp ⎜ − x 2 − xy + y 2 ⎟ 2π exp ⎜ ⎟ = fY ( y ) π 3 ⎝ 3 ⎠ ⎝ 2 ⎠ 2 ⎛ 2⎛ 2 y⎞ ⎞ exp ⎜ − ⎜ x − ⎟ ⎟ = ⎜ 3⎝ 3π 2 ⎠ ⎟⎠ ⎝ ⎛ 2⎞ 1 1 y ⎜ ⎟. exp − x− = 2 ⎟ 3 ⎜ 3 2 i 2π i 4 ⎝ ⎠ 4

(

f ( x y) =

(

Thus, knowing Y = y , X

E ( X y) = y

2

(

)

⎛ ⎝

)

follows a law N

and E X Y = Y

(Here EV = E ⎜ X −

)

2

( y 2 , 34)

; that is to say:

1 (linear function of Y ; λˆ = ).

1 ⎞ Y ⎟ = 0 for X and Y are centered.) 2 ⎠

2

Estimation

165

defined

by

4.4. Example: prediction of an autoregressive process AR (1) Let

∀K ∈

us

consider

XK =

the

WSS

process

X

∞

∑ q j BK − j and solution of the equation j =∞

X K = qX K −1 + BK

with q which is real such that q < 1 and where BZ is a white noise of power

EBK2 = σ 2 . In the preceding chapter we calculated its covariance function and obtained:

EX i X i + n

n

q =σ . 1 − q2 2

Having observed r.v. X 1 ,..., X K −1 , we are seeking the best linear estimate and in the quadratic mean Xˆ K +

ˆ K −1 of X K + , X K +

K −1

ˆ ˆ K −1 = ∑ λ jY j and λ j verify: j =1

⎛ EX 1 X 1 … EX 1 X K −1 ⎞ ⎛ λˆ1 ⎞ ⎛ EX K + X 1 ⎞ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟=⎜ ⎟ ⎜ ⎟⎜ ⎜ EX X ⎟ ⎜ ˆ ⎟ ⎜ EX X ⎟ EX X K −1 1 K −1 K −1 ⎠ ⎝ λK −1 ⎠ ⎝ K+ K −1 ⎠ ⎝ i.e.

⎛1 q ⎜ ⎜q 1 ⎜ ⎜ ⎜ q K −2 ⎝

q K −2 ⎞ ⎛ λˆ1 ⎞ ⎛ q K + −1 ⎞ ⎟⎜ ⎟ ⎜ K + −2 ⎟ q K −3 ⎟ ⎜ ⎟ ⎟ ⎜q ⎟⎜ ⎟ ⎟=⎜ ⎟⎜ ⎟ ⎟ ⎜ +1 ⎟ ⎜ ⎟ ⎟ ⎜ ˆ 1 ⎠ ⎝ λK −1 ⎠ ⎝ q ⎠

166

Discrete Stochastic Processes and Optimal Filtering

) = ( 0,..., 0, q ) and this solution is unique as the determinant of the matrix is equal to (1 − q ) ≠ 0. We have the solution

( λˆ ,..., λˆ

ˆ

+1

K − 2 , λK −1

1

2 K −2

Thus Xˆ K +

= λˆK −1 X K −1 = q

K −1

+1

X K −1 .

We see that the prediction of r.v. X k + only uses the last r.v. observed i.e. here this is X K −1 . The variance of error estimate equals:

(

E X K + − Xˆ K + 2 EX K2 + + q (

+1)

K −1

)

2

(

= E XK+ − q

EX K2 −1 − 2q

+1

+1

X K −1

EX K + X K −1 =

)

2

=

σ2 1 − q2

(1 − q ( ) ). 2 +1

4.5. Multivariate processes In certain practical problems, we may have to consider the state process X ∗ and the observation process Y

∀j ∈

∗

where ∀ j and

∗

which are such that:

⎛ X 1j ⎞ ⎛ Y j1 ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ X j = ⎜ X j ⎟ and Y j = ⎜ Y j ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ X nj ⎟ ⎜ Y jm ⎟ ⎝ ⎠ ⎝ ⎠

X j and Y j ∈ L2 .

Estimation

167

We thus say that: – X j and Y j are multivectors (vectors because the X j and the Y j belong to 2

the vector space L ; multi because X j and Y j are sets of several vectors); – n is the order of the multivector X j and m is the order of the multivectors

Yj ;

( )

( )

2 n

2 m

– Xj∈ L – X

∗

and Y j = L

and Y

;

are multivariate processes, until this point the processes

∗

) being called scalar.

considered (with value in

Operations on the multivectors: – we can add two multivectors of the some order, and if

( )

2 n

then X + X ′ ∈ L

( )

X and X ′ ∈ L2

;

( )

2 n

– we can multiply a multivector by a real constant, and if X ∈ L

λ∈

then

n

( )

λ X ∈ L2

n

and

;

– scalar product of two multivectors not necessarily of the same order: i.e.

( )

X ∈ L2

n

( )

2 m

and Y ∈ L

.

We state < X , Y >= EXY ∈ M ( n, m ) where M ( n, m ) is the space of the T

matrix of n rows and m columns. The matrix of M ( n, m ) which is identically zero is denoted by Onm . DEFINITION.– We say that the multivectors < X , Y >= Onm and we write X ⊥ Y .

X and Y are orthogonal of

168

Discrete Stochastic Processes and Optimal Filtering

NOTE.– If X and Y are orthogonal, Y and X are also. We state X

X

2

2

=< X , X >= EXX T .

being a definite positive matrix, we know that a symmetric definite

positive matrix is denoted by X such that

X

2

=

Nevertheless, in what follows we will only use ⋅

X 2

X .

.

( )

2 m

NOTE.– The set of multivectors of the same order ( L

for example) could be

equipped with a vector space structure. On this space the symbol

⋅

previously

defined would be a norm. Here, we are considering the set of multivectors of order n or m . This set is not a vector space and thus cannot be equipped with a norm. Thus, for us, in what

X

follows

2

2

will not have any significance ( norm of X ) . For the same

reason, it is not through misuse of language that we will speak of scalar product < X , Y > .

Linear observation space Thus,

∀j ∈

∗

let

the

multivariate

( )

X j ∈ L2

verifying ∀j ∈

∗

H KY−1

n

state

process

be

X

∗

verifying

and the multivariate observation process be Y

( )

Y j ∈ L2

m

∗

.

By generalization of the definition given in section 4.2, we note:

⎧⎪

K −1

⎩⎪

j =1

H KY−1 = H (Y1 ,..., YK −1 ) = ⎨ that

H KY−1

∑ Λ jY j

⎫⎪ Λ j ∈ M ( n, m ) ⎬ and we say again ⎭⎪

is the linear space of observation until the instant K

−1.

Estimation

NOTE.– The elements of

H KY−1

169

must be multivectors of order n , for it is from

among them that we will choose the best estimate of X K , multivector of order n .

H KY−1 is thus adapted to X K . NOTATIONS.–

H KY−1 : this is the set denoted H KY−,⊥1 of the multivectors V Y ,⊥ Y verifying V ∈ H K −1 if and only if V is orthogonal to H K −1 . – Orthogonal of

⎛ 0⎞ ⎜ ⎟ – 0H = ⎜ ⎟ ⎜ 0⎟ ⎝ ⎠

⎫ ⎪ Y ⎬ n zero, empty multivectors of H K −1 . ⎪ ⎭

Problem of best estimate Generalizing the problem developed in section 4.2 in the case of multivariate

⎛ X 1K ⎞ ⎜ ⎟ processes, we are seeking to approximate X K = ⎜ ⎟ by the elements ⎜Xn ⎟ ⎝ K⎠ ⎛ Z1 ⎞ ⎜ ⎟ Z = ⎜ ⎟ of H KY−1 , the distance between X K and Z being: ⎜Zn ⎟ ⎝ ⎠

tr X K − Z

2

where tr X K − Z

2

K −1

= trE ( X K − Z )( X K − Z ) = ∑ E

(

signifies “trace of the matrix X K − Z

2

T

j =1

X Kj

−Z

j

)

2

”.

The following result generalizes the theorem of projection on Hilbert subspaces and brings with it the solution.

170

Discrete Stochastic Processes and Optimal Filtering

THEOREM.– – Xˆ K K −1 =

K −1

∑ Λˆ jY j

is unique and belongs to

j =1

mapping Z → tr X K − Z

2

H KY−1

which minimizes the

,

H KY−1 –

Xˆ K K −1 is the orthogonal projection of

XK

on

H KY−1

i.e.

X K − Xˆ K K −1 ⊥ H KY−1 , which is to say again: < X K − Xˆ K K −1 , Y j >= Onm

∀j = 1 at

K −1.

We can provide an image of this theorem using the following schema in which all the vectors which appear are, in fact, multivectors of order n .

XK X K − Xˆ K K −1

Z

H KY−1

Xˆ K K −1

Figure 4.6. Orthogonal projection of multivector

XK

on

H KY-1

NOTATION.– In what follows all the orthogonal projections (exclusively on

H KY−1 ) will be denoted indifferently: Xˆ K K −1 or projH Y X K K −1

; YˆK K −1 or projH Y YK etc. K −1

Estimation

171

From this theorem we deduce the following properties:

( )

2 n

P1) Given X K and X K ′ ∈ L

(

then X + X ′

)

K K −1

= Xˆ K K −1 + Xˆ K′ K −1 .

In effect:

∀j = 1 to K − 1 < X K − Xˆ K K −1 , Y j >= Onm and < X K′ − Xˆ K′ K −1 , Y j >= Onm Thus:

(

)

∀j = 1 to K − 1 < X K − X K′ − Xˆ K K −1 + Xˆ K′ K −1 , Y j >= Onm In addition since the orthogonal projection of X K + X K′ is unique, we in fact have:

( X + X ′) P2)

K K −1

Given

= Xˆ K K −1 + Xˆ K′ K −1

( )

X K ∈ L2

n

and

matrix

H ∈ M ( m, n ) ;

then

( HX ) K K −1 = HXˆ K K −1 . It is enough to verify that HX K − HXˆ K projection (here on the space

H KY−1 )

K −1

⊥ H KY−1 since the orthogonal

is unique.

Now by hypothesis

< X K − Xˆ K

(

)

Y jT ⎞⎟ = Onm . , Y j >= E ⎛⎜ X K − Xˆ K K −1 K −1 ⎝ ⎠

172

Discrete Stochastic Processes and Optimal Filtering

Thus also

(

)

(

)

⎛ ⎞ ⎛ ⎞ H ⎜ E ⎛⎜ X K − Xˆ K Y jT ⎞⎟ ⎟ = E ⎜ H ⎛⎜ X K − Xˆ K Y jT ⎞⎟ ⎟ = Onm K −1 K −1 ⎠⎠ ⎠⎠ ⎝ ⎝ ⎝ ⎝ And, by associativity of the matrix product

(

⎛ E ⎜ ⎛⎜ H X K − Xˆ K K −1 ⎝⎝

) ⎞⎟⎠ Y

and we have indeed HX K − HXˆ K

T j

⎞ T ˆ ⎟ =< HX K − HX K K −1 , Y j >= Onm ⎠

K −1

⊥ H KY−1 .

These properties are going to be used in what follows. Innovation process I

∗

With Kalman filtering in mind, we are supposing here that X

∗

and Y

∗

are

the two multivariate processes stated earlier and linked by equations of state and equations of observation:

⎛ X K +1 = A ( K ) X K + C ( K ) N K ⎜⎜ ⎝ YK = H ( K ) X K + G ( K ) WK where

A ( K ) ∈ M ( n, n ) ;

G ( K ) ∈ M ( m, p ) and where N

C ( K ) ∈ M ( n, ∗

and W

∗

);

H ( K ) ∈ M ( m, n ) ;

are noises (multivariate processes)

satisfying a certain number of hypotheses but of which the only one which is necessarily here is:

∀j = 1 to K − 1 < WK , Y j >= EWK YjT = O pm

Estimation

173

1) If n = m :

YK and YˆK K −1 are two multivectors of the same order m . The difference

YK − YˆK K −1 thus has a sense and in accordance with the definition given in section 4.2, we define the innovation at the instant K ≥ 2 by I K = YK − YˆK K −1 Let us now express I K in the form which will be useful to us in the future. By the second equation of state:

I K = YK − projH Y

K −1

( H ( K ) X K + G ( K )WK ) .

By using property P1 first and then P2

I K = YK − H ( K ) Xˆ K K −1 − ( G ( K ) WK ) K K −1 (if p ≠ m ( and from n ) ,

( G ( K )W )K K −1

is not equal to G ( K ) Wˆ K K −1 and

moreover this last matrix product has no meaning).

(

To finish let us verify that G ( K ) WK

)K K −1 = OH .

By definition of the orthogonal projection:

< G ( K ) WK

( G ( K )WK )K K −1 , Y j >

= 0mm

∀ j = 1 to K − 1

< G ( K ) WK , Y j >= G ( K ) < WK , Y j > = 0mm

∀ j = 1 to K − 1

−

By hypothesis on the noise W

∗

:

174

Discrete Stochastic Processes and Optimal Filtering

We can deduce from this:

( G ( K )W )K K −1 , Y j Y ,⊥ G ( K ) WK ∈ H K −1

= 0mm and

∀ j = 1 to K − 1 ,

which

is

to

say:

( G ( K )WK )K K −1 = 0H .

Finally I K = YK − YˆK K −1 = YK − H ( K ) Xˆ K K −1 . 2) If n ≠ m :

YK and YˆK K −1 are multivectors of different orders and YK − YˆK K −1 has no

meaning and we directly define I K = YK − H ( K ) Xˆ K K −1 . Finally and in all cases ( n equal to or different from m ):

K ≥ 2 ; the multivector of order m , defined by I K = YK − H ( K ) Xˆ K K −1 .

DEFINITION.– We name innovation at instant

(

I K ∈ H KY,-1⊥

)

NOTE.– We must not confuse innovation with the following. DEFINITIONS.– We call the prediction error of state at instant K the multivector of order n defined by X K

K −1

= X K − Xˆ K

K −1

.

We call the error of filtering at instant K a multivector of order n defined by

XK

K

= X K − Xˆ K

K

Property of innovation 1) I K ⊥ Y j 2) I K ′ ⊥ I K

∀j = 1 at K − 1 ; ∀K and

K′ ≥

2 with

K

≠ K′ .

DEMONSTRATION.– 1) I K = YK − H ( K ) Xˆ K K −1 = H ( K ) X K + G ( K ) WK − H ( K ) Xˆ K K −1

Estimation

175

thus:

(

)

< I K , Y j > = < H ( K ) X K − Xˆ K K −1 + G ( K ) WK , Y j > by using the associativity of the matrix product. Since:

(

)

< H ( K ) X K − Xˆ K K −1 , Y j > = H ( K ) < X K − Xˆ K K −1 , Y j >= 0mm and since:

< G ( K )WK , Y j > = G ( K ) < WK , Y j >= Omm We have in fact < I K , Y j > = 0 and I K ⊥ Y j . 2) Without losing the idea of generality let us suppose for example K ′ > K :

< I K ′ , I K > = < I K ′ , YK − H ( K ) Xˆ K K −1 > and

this

scalar

product

(

equals

Omm

as

I K ′ ∈ H KY′−,⊥1

)

and

YK − H ( K ) Xˆ K K −1 ∈ HKY YK ∈ HKY and H ( K ) Xˆ K K −1 ∈ HKY−1 . 4.6. Exercises for Chapter 4 Exercise 4.1. Given a family of second order r.v. X , Y1 ,..., YK ,... we wish to estimate X starting from the Y j and we state: Xˆ K = E ( X Y1 ,..., YK ) .

176

Discrete Stochastic Processes and Optimal Filtering

Verify that E ( Xˆ K +1 Y1 ,..., YK ) = Xˆ K . (We say that the process

Xˆ

∗

is a martingale with respect to the sequence of

YK .) Exercise 4.2.

{

} be a sequence of independent r.v., of the second order, of law

Let U j j ∈

N (0, σ 2 ) and let θ be a real constant.

{

We define a new sequence X j j ∈

1) Show that

∀k ∈

∗

, the vector X

∗

K

}

⎛ X1=U1

by ⎜

. ⎜ X j =θU j−1+U j if j ≥ 2 ⎝

= ( X1 ,..., X K ) is Gaussian.

2) Specify the mean, the matrix of covariance and the probability density of this vector. 3) Determine the best prediction in quadratic mean of X k + P at instant K = 2 ,

(

)

i.e. calculate E X 2 + P X 1 , X 2 . Solution 4.2.

0⎞ ⎛1 0 ⎜ θ 1 0 0 ⎟⎟ ⎜ belonging to M ( K , K ) . 1) Let us consider matrix A= ⎜ ⎟ ⎟ ⎜⎜ 0 0 θ 1⎟⎠ ⎝ By stating U

K

= (U1 ,...U K ) , we can write X K = AU K . The vector U K

being Gaussian (Gaussian and independent components), the same can be said for the vector X 2)

K

EX K = EAU K = AEU K = 0

Estimation

( )

Γ X = A σ 2 I AT = σ 2 AAT

177

( I = matrix identity ) .

Furthermore:

(

)

Det Γ X K = Det

We obtain

(σ

2

)

AAT = σ 2 n and Γ X K is invertible.

f X K ( x1 ,..., xK ) =

3) The vector

1

( 2π )

( X 1 , X 2 , X 2+ P )

n/2

⎛ 1 ⎞ exp ⎜ − xT Γ −X1K x ⎟ . ⎝ 2 ⎠ σ n

is Gaussian; thus the best prediction of

is the best linear prediction, which is to say:

Xˆ 2+ P = E ( X 2+ P X 1 , X 2 ) = projH X 2+P where

H

Thus

now

is the linear space generated by r.v. X1 and X 2 .

⎛ λˆ ⎞ ⎛ C ov ( X 2+ P , X1 ) ⎞ Xˆ 2+ P = λˆ, X1 + λˆ2 X 2 with ⎜ 1 ⎟ = Γ −X12 ⎜ ; ⎜ C ov ( X , X ) ⎟⎟ ⎜ λˆ ⎟ 2 + P 2 ⎝ ⎠ ⎝ 2⎠

(

)

(

)

C ov X j , X K = EX j X K = θ if K − j = 1 C ov X j , X K = EX j X K = 0 if K − j > 1 ⎛ C ov ( X 2 P +1 , X 1 ) ⎞ ⎛ 0 ⎞ ⎟⎟ = ⎜ ⎟ and Xˆ 2+ P = 0 ⎜ C ov ( X 0 X , ) 2 P+2 2 ⎠ ⎝ ⎠ ⎝

thus if p > 1 ⎜

If p = 1

⎛ λˆ1 ⎞ 1 ⎛1 + θ 2 ⎜ ⎟= 2⎜ ⎜ˆ ⎟ ⎝ λ2 ⎠ σ ⎝ −θ

−θ ⎞ ⎛ 0 ⎞ ⎟ ⎜ ⎟ and 1 ⎠ ⎝θ ⎠

Xˆ 2+ P

178

Discrete Stochastic Processes and Optimal Filtering

θ2 θ Xˆ 2+ p = − 2 Xˆ 1 + 2 Xˆ 2 . σ

σ

Exercise 4.3.

⎛ X K +1 = A ( K ) X K + C ( K ) N K ⎜Y ⎝ K = H ( K ) X K + G ( K ) WK

We are considering the state system ⎜

A ( K ) ∈ M ( n, n ) ;

where

C ( K ) = M ( n,

);

(1) ( 2)

H ( K ) = M ( m, n ) ;

G ( K ) = M ( m, p ) and where X 0 , N K ,WK ( for K ≥ 0 ) are multivectors of the second

order

such

∀j ≤ K WK is

that

orthogonal

at

X 0 , N 0 ,..., N j −1 , W0 ,..., W j −1 . Show that

(

)

< H ( j ) X j − Xˆ j j −1 ,WK >= 0mp .

∀j ≤ K

Solution 4.3.

(

)

< H ( j ) X j − Xˆ j j −1 , WK > = j −1 ⎛ ˆ ( H ( i ) X + G ( i ) W ) ⎞⎟ , W > < H ( j ) ⎜ A ( j − 1) X j −1 + C ( j − 1) Ν j −1 − ∑ Λ i i i K i =1 ⎝ ⎠

(where

ˆ are the optimal matrices of M ( n, m ) ). Λ i

Taking into account the hypotheses of orthogonality of the subject, this scalar

⎛

product can be reduced to < H ( j ) ⎜ A ( j − 1) X j −1 −

⎝

j −1

⎞

i −1

⎠

∑ Λˆ i H ( i ) X i ⎟ ,WK > .

Furthermore by reiterating the relation recurrences expresses

itself

according

X 0 , N 0 , N1 ,..., Ni −1 .

to

X i −1 ,

Ν i−1 ,

(1) ,

we see that X i

X i −2 , Ni − 2 , Ni −1...

and

Estimation

179

H ( j ) A ( j − 1) X j −1 and H ( j ) Λˆ i H ( i ) X i are multivectors of order m of which each of the m “components” only consists of the r.v. orthogonal to each of the p “components” of WK , multivector of order p . Finally, we have in Thus,

(

fact < H ( j ) X j − Xˆ j

j −1

) ,W

K

> = 0mp .

Chapter 5

The Wiener Filter

5.1. Introduction Wiener filtering is a method of estimating a signal perturbed by an added noise. The response of this filter to the noisy signal, correlated with the signal to 2

estimate, is optimal in the sense of the minimum in L . The filter must be practically realizable and stable if possible, as a consequence its impulse response must be causal and the poles inside of the circle unit. Wiener filtering is often used because of its simplicity, but despite this, the signals to be analyzed must be WSS processes. Examples of applications, word processing, petrol exploration, surge movement, etc.

182

Discrete Stochastic Processes and Optimal Filtering

5.1.1. Problem position

XK

h?

Σ

ZK

YK

noise : WK

h

Figure 5.1. Representation for the transmission. is the impulse response of the filter that we are going to look for

In Figure 5.1, X K , WK and YK represent the 3 entry processes, h being the impulse response of the filter, Z K being the output of the filter which will give

Xˆ K which is an estimate at instant K of X K when the filter will be optimal. All the signals are necessarily WSS processes. We will call:

(

Y = YK YK −1 "Y j "YK − N +1

)

T

the representative vector of the process of length N at the input of the realization filter:

(

y = yK yK −1 " y j " yK − N +1

(

– h = h 0 h 1" hN −1

)

T

)

T

,

the vector representing the coefficients of the impulse

response that we could identify with the vector

λ

of Chapter 4;

– X K the sample to be estimated at instant K ;

The Wiener Filter

183

– Xˆ K the estimated sample of X K at instant K ; – Z K the exit of the filter at this instant

= hT Y .

The criterion used is the traditional least mean square criterion. The filter is optimal when:

(

2 Min E ( X K − Z K ) = E X K − Xˆ K

)

2

The problem consists of obtaining the vector h which minimizes this error.

5.2. Resolution and calculation of the FIR filter The error is written:

ε K = X K − hT Y with

h ∈ \N

and

( )

Y ∈ L2

N

.

We have a function C : cost to be minimized which is a mapping:

(

)

h 0 , h 1," , hN −1 → C h 0 , h 1," hN −1 = E (ε K2 ) \N

→

\

The vector hˆ = hoptimal is such that ∇ h C = 0

(

given

C = E X K − hT Y

then

∇ hC = −2 E (ε K Y )

)

2

(scalar) (vector Nx1).

NOTE.– This is the theorem of projection on Hilbert spaces. Obviously this is the principle of orthogonality again.

184

Discrete Stochastic Processes and Optimal Filtering

This least mean square error will be minimal when:

E (ε K Y ) = 0 i.e. when h = hˆ

By using expression

εK :

⎛ ⎞ E ⎜ X K − hˆT Y ⎟ Y = 0 ; ⎜ ⎟ ⎝ ⎠

(

all the components of the vector are empty (or E X K

(

Let E ( X K Y ) = E Y Y

T

−

)

Xˆ K Y = 0 ).

) hˆ .

We will call: – The cross-correlation vector r :

(

r = E X K (YK YK −1 "YK − N +1 )

N ×1

–

T

)

R the matrix of autocorrelation of observable data: ⎛ YK ⎞ ⎜ ⎟ Y R = E ⎜ K −1 ⎟ (YK YK −1 "YK − N +1 ) = E Y Y T ⎜ # ⎟ N ×N ⎜⎜ ⎟⎟ ⎝ YK − N +1 ⎠

(

and

)

r = R hˆ Wiener-Hopf’s equation in matrix form.

NOTE.– By taking the line

(

j ∈ [ 0,

)

N −1] , we obtain: N −1

rXY ( j ) = E X K YK − j = ∑ hˆi RYY ( j − i ) i =0

∀j ∈ [ 0, N −1]

The Wiener Filter

185

Wiener-Hopf equation If the matrix R is non-singular, we draw from this:

hˆ = R −1 r . 5.3. Evaluation of the least error According to the projection theorem:

(

E XK

−

)

(

Xˆ K Y = 0 and E X K

−

)

Xˆ K Xˆ K = 0 .

Thus, the least error takes the form:

C min = Min E

(ε ) = E ( X Xˆ ) = E(X Xˆ ) X = E(X Xˆ ) . 2 K

K −

K

K −

K

K

2 K −

2 K

2

However, Xˆ K = hˆ Y T

Thus

C min = Min

E (ε K ) 2 = R XX

( 0 ) − hˆT

r.

Knowing the matrix of autocorrelation R of the data at entry of the filter and the cross-correlation vector r , we can deduce from this the optimal filter of impulse

response hˆ and the lowest least mean square error for a given order N of the filter.

APPLICATION EXAMPLE.– Give the coefficients of the Wiener filter as N = 2 if the autocorrelation function of the signal to be estimated is written

RXX

(K ) = a K

;

0〈 a〈1 and that of the noise: RWW ( K ) = δ ( K

=0

) white noise.

186

Discrete Stochastic Processes and Optimal Filtering

The signal to be estimated does not correlate to the noise ( X ⊥ W ) .

⎛2 a⎞ ⎛1 ⎞ ⎟ ; r =⎜ ⎟ ⎝ a 2⎠ ⎝a⎠

Let R = ⎜

because RYY = RXX + RWW We deduce from this:

⎛ 2 − a2 hˆ = ⎜ 2 ⎝ 4−a

T

a ⎞ ⎟ 4 − a2 ⎠

and Min E

(ε ) = 1 − 4 −2a 2 K

2

Let us return to our calculation of the FIR filter. The filter that we have just obtained is of the form:

(

hˆ = hˆ 0 hˆ 1 " hˆ N −1

)

T

of finite length N . Its transfer function is written: N −1

H ( z ) = ∑ hˆ i z −i i =0

with an output-input relationship of the form Xˆ ( z ) = H ( z ) Y ( z ) . Let us enlarge this class of FIR type filters and give a method for obtaining IIR type filters. 5.4. Resolution and calculation of the IIR filter In order to do this we are going to carry out a pre-whitening of the observation signal.

The Wiener Filter

187

Firstly let us recall a definition: we say that rational function Α( z ) represents a minimum phase system if

Α( z ) and

1

Α( z )

are analytical in the set

{ z | z ≥ 1} , i.e. if the zeros and poles of Α( z ) are within the unit disk. Furthermore the minimum phase system and its inverse are stable. Paley-Wiener theorem Given a function SYY ( z ) verifying when z = e iω

– SYY (e ) = 2π

–

∫

∞

∑ sne−inω

iω

:

real function and ≥ 0 ;

−∞

ln SYY (eiω ) dω < ∞ .

0

Therefore, there is a causal sequence an of the transform in z , Α( z ) which verifies:

( )

SYY ( z ) = σ ε2 A ( z ) A z −1 .

σ ε2

represents the variance of a white noise and Α( z ) represents a minimum

phase system. In addition, the factorization of SYY ( z ) is unique.

Α( z ) being a minimum phase system, 1

Α( z )

is causal and analytical in

{ z | z ≥ 1} . Since the an coefficients of filter A ( z ) are real:

(

)

SYY (eiω ) = σ ε2 Α(eiω ) Α e−iω = σ ε2 Α(eiω ) Α(eiω ) = σ ε2 Α(eiω )

2

188

Discrete Stochastic Processes and Optimal Filtering

that is to say:

The filter

σ ε2 =

1 Α(eiω )

2

SYY (eiω ) .

1 thus whitens the process YK , Α( z )

K ∈Z .

Schematically:

ε

A(z)

YK

Y

1

Spectral density σ 2ε

Spectral density

SYY NOTE.– A ( z )

2

εK

A(z)

( )

= A ( z ) . A z −1 if the coefficients of A ( z ) are real.

At present, having pre-whitened the entry, the problem returns to a filter

B( z )

in the following manner.

Y

1/ A ( z )

Y

ε

H ( z)

B( z)

Z

Z

Thus B ( z ) = A ( z ) . H ( z )

A ( z ) , being known by SYY ( z ) and H ( z ) having to be optimal, thus B ( z ) must also be optimal.

The Wiener Filter

189

Let us apply Wiener-Hopf’s equation to filter B ( z ) .

r X ε ( j ) = ∑ bˆi R

εε ( j − i ) .

i

Let rX ε Thus

( j ) = bˆ j σ ε2

.

r ( j) bˆ j = X ε 2 . σε

And B ( z ) =

Thus B ( z ) =

∞

∑ bˆ j z − j j =0

1

σε

2

∞

∑

for B ( z ) causal.

rX ε ( j ) z − j .

j =0

The sum represents the Z transform of the cross-correlation vector rX ε the indices

Thus:

j ≥ 0 that we will write ⎡⎣ S X ε ( z ) ⎤⎦ + . B(z) =

1 2

σε

⎡⎣ S X ε ( z ) ⎤⎦ +

We must now establish a relationship between S X ε ( z ) and S XY ( z ) . In effect we can write:

RXY ( K ) = E ( X n + K E (Yn ) ) ∞ ⎛ ⎞ = E ⎜ X n + K ∑ ai ε n −i ⎟ i =0 ⎝ ⎠ ∞

RXY ( K ) = ∑ ai RX ε i =0

( K + i)

( j ) for

190

Discrete Stochastic Processes and Optimal Filtering

which can also be written: −∞

RXY ( K ) = ∑ a−i RX ε

( K − i)

0

= a− k ∗ RX ε

(K ) .

By taking the Z transform of the 2 members:

( )

S XY ( z ) = A z −1 S X ε ( z ) it emerges:

⎡ S ( z)⎤ 1 ⎢ XY ⎥ H (Z ) = 2 σ ε A ( z ) ⎢ A z −1 ⎥ ⎣ ⎦+

( )

5.5. Evaluation of least mean square error This least mean square error is written:

C min = E (ε K X K ) = Rε X

when h = hˆ

( 0)

which can also be written:

C min = E ( X K − Xˆ K ) X K

i.e. C min = RXX

( 0 ) − hˆT

where = RXX

⎛

⎞

⎝

⎠

( 0 ) − E ⎜⎜ hˆT YX K ⎟⎟

r which we have already seen with the FIR filter.

The Wiener Filter

191

However, this time, the number of elements in the sum is infinite: ∞

C min = RXX ( 0 ) − ∑ hˆi RXY ( i ) i =0

or: ∞

C min = RXX ( 0 ) − ∑ hˆi RYX ( −i ) . i =0

By bringing out a convolution:

C min = RXX ( 0 ) − hˆ j ∗ RYX ( j )

j =0 .

This expression can also be written, using the Z transform.

C min =

1 j 2π

∫C (0,1) ( S XX ( Z ) − H ( Z ) SYX ( Z ) )

−1

dZ

5.6. Exercises for Chapter 5 Exercise 5.1. Our task is to estimate a signal X K , whose autocorrelation function is:

1 1 RXX ( K ) = δ ( K =0) + ⎡⎣δ ( K =−1) + δ ( K =1) ⎤⎦ 2 4 The measures of response h .

yK = xK + nK of the process YK are filtered by a Wiener filter

192

Discrete Stochastic Processes and Optimal Filtering

N K is orthogonal to the signal X K and:

The noise

1 Rnn ( K ) = δ ( K =0) . 2 1) Give the response of the 2nd order Wiener filter (FIR). 2) Give the least mean square error obtained. Solution 5.1. 1) hˆ = R r =(7 /15 −1

2 /15)T ;

2 T 2) C min = σ X − r hˆ = 7 / 30

with

σ X2 = RXX (0) = 1/ 2 .

Exercise 5.2. We propose calculating a 2nd order FIR filter.

YK the input to the filter has the form YK = X K + WK where X K is the signal emitted and WK is a white noise orthogonal at X K (the processes are all wide sense stationary (WSS)). Knowing the statistical autocorrelation:

RXX ( K ) = a

K

and R WW ( K ) = N δ ( K =0)

and knowing:

hˆ = R -1 r

hˆ : h

are optimal.

The Wiener Filter

193

With:

⎛ YK ⎞ ⎜ ⎟ Y R = E ⎜ K −1 ⎟ (YK YK −1 "YK − N +1 ) = E Y Y T ⎜ # ⎟ N ×N ⎜⎜ ⎟⎟ ⎝ YK − N +1 ⎠

(

(

r = E X K (YK YK −1 "YK − N +1 )

N ×1

T

)

)

1) Give the 2 components of the vector hˆ representing the impulse response. 2) Give the least mean square error. 3) Give the shape of this error for N = 1 and 0 < a < 1 . 4) We now want to calculate an IIR type optimal filter. By considering the same data as already given, give the transfer function of the filter. 5) Give the impulse response. 6) Give the least mean square error.

(

) (

1 (1 + N − a 2 2 2 (1 + N ) − a

aN )T

NOTE.– We can state: b + b

−1

=

Solution 5.2. 1)

hˆ =

2) C min = 1 −

)

1 −1 a − a + a −1 + a . N

1 + N − a2 + a2 N (1 + N ) 2 − a 2

194

Discrete Stochastic Processes and Optimal Filtering

3) See Figure 5.2

Figure 5.2. Path of the error function or cost according to parameter a

4) H ( z ) =

1 − a2 Na A 2 with and σ ε = A = 2 −1 1 − ab b σ ε 1 − bz 1

(1 − a ) b 2

n

5) hn≥0 = cb with c =

6) C min = 1 −

Na (1 − ab )

c 1 − ab

Exercise 5.3. [SHA 88] Let

{ X K | K = 1 to N} be a set of N

random variables such that

and cov( X i X j ) = σ x ∀ i, j emitted by a source. 2

Ε( X K ) = 0

The Wiener Filter

195

At the reception, we obtain the digital sequence y K = xK + wK which is a result of the process YK = X K + WK where WK is a centered white noise of variance

σ w2 .

1) Give the Wiener filter depending on N and

γ

γ =σx

2

by stating

σ w2

as

the relationship between signal and noise. 2) Give the least mean square error in function of

σ x2 , N

and γ .

NOTE.– We can use Wiener-Hopf’s equation Solution 5.3. 1) hˆ j =

γ 1 + Nγ

2) C min =

σ x2

1 + Nγ

Exercise 5.4. Same

exercise

as

in

section

5.2.

R WW ( K ) = δ ( K =0) : 1) Give the 3 components of the vector hˆ . 2) Give the least mean square. Solution 5.4. 1) hˆ = (0.4778

0.1028 0.0212)T

2) C min = 0.4778

where

RXX ( K ) = 0.4

K

and

Chapter 6

Adaptive Filtering: Algorithm of the Gradient and the LMS

6.1. Introduction By adaptive processing, we have in mind a particular, yet very broad class of optimization algorithms which are activated in real time in distance information transmission systems. The properties of adaptive algorithms are such that, on the one hand, they allow the optimization of a system and its adaptation to its environment without outside intervention, and, on the other hand, this optimization is also assumed in the presence of environmental fluctuation over time. It is also to be noted that the success of adaptive techniques is such that we no longer meet them only in telecommunications but also in such diverse domains as submarine detection, perimetric detection, shape recognition, aerial networks, seismology, bio-medical instrumentation, speech and image processing, identification of control system, etc. Amongst the applications cited above, different configurations arise.

198

Discrete Stochastic Processes and Optimal Filtering

XK

DK

+

T

F. Adap.

YK

-

Σ

ZK

εK

Figure 6.1. Prediction

XK

DK

Syst.? +

F. Adap.

YK

-

Σ

ZK

εK

Figure 6.2. Identification

XK

DK

T +

Syst.

+

Σ

+ +

YK

F. Adap.

Noises Figure 6.3. Deconvolution

-

ZK

Σ

εK

Adaptive Filtering: Algorithm of the Gradient and the LMS

X K + NK

199

DK +

N K′

F. Adap.

YK

-

ZK

Σ

εK

Figure 6.4. Cancellation

In the course of these few pages we will explain the principle of adaptive filtering and establish the first mathematical results. We will limit ourselves to begin with to WSS processes and to the algorithms called, of the deterministic gradient and the LMS algorithm. We will also give a few examples concerning non-recursive linear adaptive filtering. Later, we will broaden this concept to non-stationary signals in presenting Kalman filtering in the following chapter.

6.2. Position of problem [WID 85] Starting from observations (or measures) taken at instant K (that we will note yK : results) of a process X K issued from a captor or from an unknown system, we want to perform: – a prediction on the signal; or – an identification of an unknown system; or – a deconvolution (or inverse filtering); or – a cancellation of echoes.

200

Discrete Stochastic Processes and Optimal Filtering

To achieve this, we will carry out an optimization, in the sense of the least mean squares, by minimizing the error obtained in the different cases. EXAMPLE.– Given the following predictor:

XK

+

T

F. Adap.

YK

-

ZK

Σ

εK

Figure 6.5. Predictor

The 3 graphs below represent: 1. input X K observed by xK : signal to be predicted; 2. output of filter Z K observed by z K ; 3. residual error

εK

given by

It is clearly apparent that which the filter converges.

εK

εK . tends towards 0 after a certain time, at the end of

Adaptive Filtering: Algorithm of the Gradient and the LMS

201

Input xk

1 0 -1 -2

0

50

100

150

200 250 300 Output of filter zk

350

400

450

500

0

50

100

150

200 250 300 epsilonk error

350

400

450

500

0

50

100

150

200

350

400

450

500

1 0 -1 -2 1 0.5 0 -0.5

250

300

Figure 6.6. Graphs of input, output and error. These graphs are the result of continuous time processes

202

Discrete Stochastic Processes and Optimal Filtering

6.3. Data representation The general shape of an adaptive filter might be the following. Coefficients

λK0

⎫ ⎪ ⎪ ⎪ YK1 ⎬ ⎪ ⎪ YKm−1 ⎪⎭ 0

YK Input signal

λK1

Σ

λKm−1

Output signal

Figure 6.7. Theoretical schema with multiple inputs

The input signal can be simultaneously the result of sensors (case of adaptive antennae, for example) or as well they can represent the different samples, taken at different instants of a single signal. We will take as notation: – multiple input: Y – single input: Y

K

K

(

= YK0 YK1 ... YKm−1

)

T

;

= (YK YK −1 ... YK − m+1 ) . T

In the case of a single input that we will consider next, we will have the following configuration.

Adaptive Filtering: Algorithm of the Gradient and the LMS

YK

T

XK

T

YK −1

T

203

YK −m+1

λK1 λK0

λK1

λKm −1

Σ

Σ

ZK

DK Σ

-

εK Figure 6.8. Schema of predictor principle

X K , YK , Z K , DK and

εK

representing the signal to be predicted, the filter

input, the filter output, the desired output and the signal error respectively. Let us write the output Z K : m −1

Z K = ∑ λKi YK −i i =0

By calling

λK

written in the form

the weight vector or coefficient vector at instant “K”, also

(

λK = λK0 λK1 ... λKm−1

notation:

Z K = Y K T λK = λKT Y K .

)

T

, we can use a single vectorial

204

Discrete Stochastic Processes and Optimal Filtering

Our system not being perfect we obtain an error written as:

ε K = DK − Z K where DK represents the desired output (or X K here), that is to say, the random variable that we are looking to estimate. The criterion that we have chosen to exploit is that of the least squares: it consists of choosing the best vector λ , which will minimize the least mean square error E

(ε ) 2 K

, or the function cost C

( λK ) . When

the vector λ is fixed, the quadratic error of this cost function does not depend on “K” because of the stationarity of the signals. Thus we call the vector λ which will

minimize this cost function vector λˆ .

6.4. Minimization of the cost function If our system (filter) is linear and non-recursive, we will have a quadratic cost function and this can be represented by an elliptical paraboloid (dim 2) (or a hyperparaboloid if the dimension is superior). We will call the graphs or same level cost surfaces, i.e. the graphs or surfaces defined vector λˆ :

Sg

=

{λ = ( λ , λ ,…, λ ) ∈ 0

m −1

1

m

}

| C ( λ ) = g , g real constant

Let us give an example, the equation of the isocosts in the case of a second order filter:

S g = {λ =( λ 0 ,λ1 )∈

(

(

2

| C ( λ )= E ( DK − Z K )

= E DK − λ 0 X K −1 + λ1 X K − 2

) ) = g} 2

2

Adaptive Filtering: Algorithm of the Gradient and the LMS

205

using the stationarity of X K , we obtain after development the equation of the isocosts S g

2

0 2

E ( X K )( λ ) −2 E ( D

K

+

2

1 2

E ( X K )( λ )

X K −1 ) λ

0

−

+

0 1

2 E ( X K X K −1 ) λ λ

2 E ( DK X K − 2 ) λ

1

+

2

E(D ) = g K

NOTE.– By identification, we easily find the coefficients of the ellipse function of the traditional form:

a(λ 0 ) 2 + b(λ1 ) 2 + cλ 0λ1 + d λ 0 + eλ 1 + f = 0 NOTE.– Still because of the stationarity of the expression of C

(λ )

X K , we see that the coefficients arising in

are independent of “K” and this finding is valid for a filter of

any order. This signifies that

C (λ )

depends only on the value

λ

and not of the

instant “K” where we are considering this value. Otherwise said, being given two instants

K ≠ j if λ K

= λj

then

C ( λK ) = C ( λ j ) ,

and the latter can be

summarized by saying that the cost function itself does not depend on time. Let us illustrate such a cost function. Paraboloid with projections

200

Cost

150 100 50 0 10 5

10 5

0 lambda1

0

-5

-5 -10 -10

lambda0

Figure 6.9. Representation of the cost function ([MOK 00] for the line graph)

206

Discrete Stochastic Processes and Optimal Filtering

Let us go back to the cost function at the instant K, where

C ( λK ) = E

(ε ) = E {( D

The latter, for any

2 K

− ZK )

K

2

} = E {( D

K

εK

is not stationary.

− λKT Y K

)} 2

λ and independently of time can still be written

{(

C ( λ ) = E DK − λ T Y K

)} 2

The minimum of this function is arrived at when: ⎛

{(

⎞

∇ λC ( λ ) = grad C ( λ ) = E ⎜⎜ ∂C ( λ0 ) ,..., ∂C (mλ−1) ⎟⎟ = E DK − λ T Y K ∇ λC ( λ ) = −2 E

(ε

⎝ ∂λ

K

KY

)=0

T

∂λ

⎠

= zero vector in

m

when λ

)( −2Y )} K

= λˆ

Thus

∇ λC ( λ ) = −2 E for

(ε

KY

K

) = −2E {( D

K

− λˆT Y K

) (Y )} K

λ = λoptimal = λˆ In what follows, we will denote as

(

λˆ = λˆ 0 λˆ1 ... λˆ m−1

)

optimal coefficients, that is to say the coefficients that render void,

∇ λC ( λ ) = grad C ( λ ) and which thus minimize C

(λ ) .

T

the family of

Adaptive Filtering: Algorithm of the Gradient and the LMS

207

We find again the traditional result: the error is orthogonal to observations (principle of orthogonality or theorem of projection):

(

Let us state R = E Y

(

)

R = E Y K Y KT

(

and p = E DK Y

K

)

K

εK ⊥ Y K .

)

Y KT the autocorrelation matrix of the input signal.

⎧ YK2 ⎪ ⎪ Y Y = E ⎨ K −1 K ⎪ ⎪ ⎩YK −m+1 YK

YK YK −1 … YK2−1 YK − m+1 YK −1

YK YK −m +1 ⎫ ⎪ YK −1 YK − m+1 ⎪ ⎬ ⎪ 2 YK − m+1 ⎭⎪

the cross-correlation column vector between the desired

response and the input signal.

(

)

p = E DK Y K = E ( DK YK DK YK −1 ... DK YK −m+1 )

T

Thus, the gradient of the cost function becomes:

(

) (

)

E DK Y K − E Y K Y KT λ i.e.

=0

p − Rλˆ = 0

NOTE.– This is also Wiener-Hopf’s equation. The vector which satisfies this equation is the optimal vector:

λˆ = R −1 p if

R is invertible. This value of λ does not depend on the value of K.

208

Discrete Stochastic Processes and Optimal Filtering

6.4.1. Calculation of the cost function

( )

(

)

(

)

C ( λ ) = E DK2 + λ T E Y K Y KT λ − 2 E DK Y KT λ thus C

( λ ) = E ( DK2 ) + λ T R λ − 2 pT λ

For λˆ the optimal value of

λ

the minimum cost value is written:

()

( )

C min = C λˆ = E DK2 − pT λˆ NOTE.– It is interesting to notice that the error and the input signal correlated when λ = λˆ . In effect:

Y are not

ε K = DK − λ T Y K By multiplying the two members by expectation, we obtain:

E

(ε

KY

K

) = p − E (Y

For the optimal value of

λ

K

Y and by taking the mathematical

)

Y KT λ = p − Rλ

we have:

E

(ε

KY

K

)=0

Example of calculation of filter

The following system is an adaptive filter capable of identifying a phase shifter system.

ϕ

is a deterministic angle.

Adaptive Filtering: Algorithm of the Gradient and the LMS

209

⎛ 2π K ⎞ DK = 2sin ⎜ + ∅ −ϕ ⎟ ⎝ N ⎠

XK Syst. ⎛ 2π K ⎞ + ∅⎟ X K = YK = sin ⎜ ⎝ N ⎠

T

λ0

λ1

+

Σ

Σ ZK

εK

Figure 6.10. Schema of principle of an adaptive filter identifying a diphaser system

[

If ∅ is equally spread on 0, 2π

] we showed in Chapter 3 that YK

is wide

sense stationary (WSS). Let us calculate the elements of matrix R.

⎡ ⎛ 2π n ⎤ ⎞ ⎛ 2π E (Yn Yn − K ) = E ⎢sin ⎜ + ∅ ⎟ sin ⎜ ( n − K ) + ∅ ⎞⎟ ⎥ ⎠ ⎝ N ⎠⎦ ⎣ ⎝ N 2π K = 0.5 cos K ∈ [ 0,1] N ⎡ ⎤ ⎛ 2π n ⎞ ⎛ 2π E ( Dn Yn − K ) = E ⎢ 2sin ⎜ − ϕ + ∅ ⎟ sin ⎜ ( n − K ) + ∅ ⎞⎟ ⎥ ⎝ N ⎠ ⎝ N ⎠⎦ ⎣ ⎛ 2π K ⎞ = cos ⎜ −ϕ ⎟ ⎝ N ⎠

210

Discrete Stochastic Processes and Optimal Filtering

The autocorrelation matrix of the input data R and the cross-correlation vector

p are written:

⎛ YK2 R = E⎜ ⎜Y Y ⎝ K −1 K p = E ( DK YK

⎛ 0.5 YK YK −1 ⎞ ⎜ ⎟=⎜ YK2−1 ⎟⎠ ⎜ 0.5cos 2π ⎜ N ⎝ DK YK −1 )

T

2π ⎞ N ⎟ ⎟ ⎟ 0.5 ⎟ ⎠

0.5cos

T

⎛ 2π ⎞⎞ −ϕ ⎟⎟ cos ⎜ ⎝ N ⎠⎠

⎛ = ⎜ cos ϕ ⎝

The cost is written:

(

)

C ( λ ) = 0.5 (λ 0 )2 + (λ1 ) 2 + λ 0 λ1 cos

2π ⎛ 2π ⎞ − 2λ 0 cos ϕ − 2λ1 cos ⎜ −ϕ ⎟ + 2 N ⎝ N ⎠

Thus, we obtain:

λˆ = R −1 p 2 2π sin N C λˆ = E

λˆ =

()

⎛ ⎛ 2π ⎞ ⎜ sin ⎜ N − ϕ ⎟ ⎠ ⎝ ⎝

T

⎞ sin ϕ ⎟ ⎠

( D ) − p λˆ 2 K

T

and here the calculation gives us: C

(λˆ ) = 0

In this section, we have given the method for obtaining λˆ and

C min . As we

can see, this method does not even assume the existence of a physical filter but it requires: – the knowledge of the constituents of

p and R ;

– carrying out some calculations, notably the inverse of the matrix.

Adaptive Filtering: Algorithm of the Gradient and the LMS

211

In the following sections, we will be seeking to free ourselves of these requirements. 6.5. Gradient algorithm

We have seen previously that the vector minimizes the cost C ( λ ) is written:

λ

optimal, which is to say the one that

λˆ = R −1 p Now, to resolve this equation, we have to inverse the autocorrelation matrix. That can involve major calculations if this matrix R is not a Toeplitz matrix. It is a Toeplitz matrix if R( i , j ) = c(i −i ) with c representing the autocorrelation of the process. Let us examine the evolution of the cost Let at

λ

λK

C (λ )

previously traced.

be the vector coefficients (or weight) at instant K . If we wish to arrive

optimal, we must make

λK

evolve at each interaction by taking into account

its relative position between the instant K and K +1 . For a given cost

(

λ j = λ 0j λ1j ... λ mj −1

C ( λ j ) , the gradient of C ( λ j )

)

T

is normal at C

with regards to the vector

(λ j ) .

In order for the algorithm to converge, it must very obviously do so for: K>

( )

j ; C ( λK )
In addition as we have already written, the minimum will be attained when

∇ λC ( λ ) = 0

212

Discrete Stochastic Processes and Optimal Filtering

From here we get the idea of writing that the larger the gradient, the more distant we will be from the minimum and that it suffices to modify the vector of the coefficients in a recursive manner in the following fashion:

λK +1 = λK + µ ( −∇ λ C ( λK ) ) K

(equality in

\m )

and that we can call: algorithm of the deterministic gradient at instant K

∇ λK C ( λK ) = ∇ λC ( λ ) K = −2 E with Y

K

(

= YK YK −1 ... YK −m+1

)

T

(ε

K

YK

)

notation of the process that we saw at the

beginning of Chapter 4 and this last expression of notation ∇ λ C K

( λK )

is equal to:

= −2 ( p − R λK ) with

µ : parameter which acts on the stability and rapidity of convergence to λˆ .

Theoretical justification

If the mapping

λ = (λK0 λK1

λKm−1 ) → C ( λK )

1

m

is of class C ( \ ) we

have the equality:

C ( λK +1 ) − C ( λK ) = 〈∇ λ C ( λK ) , λK +1 − λK 〉 + o ( λK +1 − λK K

where

〈, 〉 and

Thus, if

designate the scalar product and the norm in

\ m respectively.

λK +1 is close enough to λK , we have the approximation:

C ( λK +1 ) − C ( λK ) 〈 ∇ λ C ( λK ) , λK +1 − λK 〉 K

)

Adaptive Filtering: Algorithm of the Gradient and the LMS

C ( λK +1 ) - C ( λK ) λK +1 − λK are colinear.

From which we deduce in particular that the variation

C ( λK )

is maximal if the vector ∇ λ C K

( λK )

C ( λK )

In order to obtain the minimum of ourselves then in this situation and

and

213

of

as quickly as possible we place

∀K we write:

λK +1 − λK = µ ( −∇ λ C ( λK ) ) K

i.e.

λK +1 = λK + µ ( −∇ λ C ( λK ) ) K

Furthermore, by using the expression:

λK +1 = λK + 2 µ E (ε K Y K ) we can write: n −1

∀n ≥ 1 λK + n = λK + 2µ ∑ E (ε K + jY K + j ) j =0

However, the multivariate process of order m not ergodic and we cannot write:

λ K + n = λK + 2 µ n E

(ε

K

YK

ε K + jY K + j is not WSS, thus it is

)

Moreover, the expression:

λK +1 = λK + 2 µ E (ε K Y K ) is unexploitable on a practical plane. By using the gradient method, we have succeeded in avoiding the inversion of the R matrix but we have assumed that the numerical values of the constants (correlation) composing the elements of R and p which determine the quadratic

214

Discrete Stochastic Processes and Optimal Filtering

form

C (λ )

are known. In general, these numerical values are unknown; so, we

are going to attempt to estimate them which is the reason for the following section. 6.6. Geometric interpretation

Let us give another expression to the cost function at instant K. We

have

()

( )

C ( λK ) = E DK2 + λKT R λK − 2 pT λK

found:

( )

C λˆ = E DK2 − pT λˆ

and p = Rλˆ and Wiener solution of ∇ λC

with

(λ ) = 0 .

The cost can be put in the form:

() = C ( λˆ ) + (λˆ − λ = C ( λˆ ) + (λˆ − λ = C ( λˆ ) + (λˆ − λ = C ( λˆ ) + (λˆ − λ

C ( λK ) = C λˆ + λˆT p + λKT RλK − 2λKT p

or

T K)

p + λKT RλK − λKT p

T K)

p + λKT R(λK − λˆ )

T K)

Rλˆ + (λK − λˆ )T RλK

T K)

R(λˆ − λK )

() (

C ( λK ) = C λˆ + λK − λˆ Let us state

α K = λK − λˆ

(

C λˆ + α K and easily: ∇αC

) R (λ T

K

− λˆ

)

(the origin of the axes is at present λˆ ); it becomes:

) = C (λˆ ) + α

T K

R αK

(λˆ + α )K = 2 R α

we are considering the gradient.

K

: the factor K representing the instant where

Adaptive Filtering: Algorithm of the Gradient and the LMS

215

Let us simplify the preceding expressions to find simple geometric interpretations by changing the base. Matrix R being symmetric, we say that it is diagonalizable by an orthogonal matrix Q , that is to say:

Γ = Q −1RQ

T

with Q = Q

−1

⎛γ 0 ⎜ and Γ = ⎜ ⎜0 ⎝

Let us bring R = Q Γ Q

(

)

−1

0 ⎞ ⎟ i ⎟ where γ are the eigenvalues of R . γ m−1 ⎟⎠ into the last cost expression:

()

C λˆ + α K = C λˆ + α KT Q Γ Q −1 α K and by noting u K = Q

(

)

−1

αK

()

()

m −1

C λˆ + QuK = C λˆ + uTK Γ uK = C λˆ + ∑ γ i (uKi ) 2 and ∇uC

(λˆ + Qu )K = 2 Γ u K

i

K

(

= 2 γ 0 u K0

i =0

γ 1 u1K

γ m −1 uKm−1

)

T

where u K is the i component of u at instant K . th

This expression is interesting as when only one of the components of

(

∇u C λˆ + Qu

(

C λˆ + QuK

)

K

is non-zero, the vector thus formed, always normal at

) , will carry the gradient vector. So this vector will form one of the

principle axes of the ellipses (or hyperellipses). As a consequence, the vectors u K represent the principal axes of the hyperellipses.

216

Discrete Stochastic Processes and Optimal Filtering

These principle axes equally represent the eigenvector of R . In effect, when we reduce a quadratic form, which we do diagonalizing, we establish the principle axes of the hyperellipses by calculating the eigenvectors to the matrix R when the cost expression C is in the form:

Cte + α KT R α K NOTE 1.– When m = 2 or 3 the orthogonal matrix Q is associated with a rotation 3

2

in R or R accompanied with the base of the eigenvectors of R NOTE 2.– ∇uC

(λˆ + Qu )K = Q

−1

(

∇αC λˆ + α

)K

Let us illustrate this representation with an example.

⎛3 1⎞ ⎟; ⎝ 1 3⎠

p = (5 7)

T

Let R = ⎜

and

( )

E DK2 = 10 .

Thus we obtain:

⎛ 2 0⎞ T ˆ ˆ Γ=⎜ ⎟ ; λ = (1 2 ) and C λ = 1 0 4 ⎝ ⎠

()

The eigenvectors to R allow us to construct a unitary matrix Q .

Let

and

Q=

1 ⎛ 1 1⎞ ⎜ ⎟ 2 ⎝ −1 1 ⎠

(

)

()

C λˆ + α K = C λˆ + α KT R α K .

Q always has the same shape and always takes the same values if we choose the vector unit as base vector. This holds to the very special shape of R (Toeplitz). 0 1 0 1 0 1 and u , u later. See the line graph in the guides λ , λ , α , α

NOTE.–

(

)(

)

(

)

Adaptive Filtering: Algorithm of the Gradient and the LMS Cost 3 2.8 2.6 2.4

lambda1

2.2 2 1.8 1.6 1.4 1.2 1

0

0.2

0.4

0.6

0.8

1 1.2 lambda0

1.4

1.6

1.8

2

Figure 6.11. Line graph of the cost function and of the different axes ([BLA 06] for the line graph of the ellipse)

Geometric interpretation

5 4.5 4 3.5

lambda1

3 2.5 2 1.5 1 0.5 0

0

0.5

1

1.5

2

2.5 3 lambda0

3.5

4

4.5

Figure 6.12. Line graph of “important reference points”

5

217

218

Discrete Stochastic Processes and Optimal Filtering

With u K = Q

−1

αK

⎧ 0 ⎪⎪u = i.e. ⎨ ⎪u 1 = ⎪⎩

(

)

(

)

1 α 0 −α1 2 1 α 0 + α1 2

6.7. Stability and convergence

Let us now study the stability and the convergence of the algorithm of the deterministic gradient. By taking the recursive expressions of the coefficient vector and by translation:

α K = λK − λˆ The following expressions

λK +1 = λK + µ ( −∇ λ C ( λK ) ) K

λˆ = R p ∇ λ C ( λK ) = −2 ( p − RλK ) −1

K

enable us to write: By writing

α K +1 = ( I d − 2µ R ) α K Id : identity matrix.

R in the form

R = Q Γ Q −1 and by premultiplying

α K +1

by Q

−1

, we obtain:

Q −1α K +1 = uK +1 = ( I d − 2 µ Γ ) u K

Adaptive Filtering: Algorithm of the Gradient and the LMS

219

Thus:

u K = ( I d − 2 µ Γ ) u0 K

(

or: u K +1 = 1 − 2 µ γ i

and:

i

)u

i K

(

∀i u Ki = I d − 2 µ γ i

)

K

uOi

Thus, the algorithm is stable and convergent if

(

)

Thus, if and only if

∀i

lim 1 − 2 µ γ i

K →∞

Thus, if and only if:

K

=0 i 1 − 2 µγ ∈ ]−1,1[

1 0〈 µ 〈 i

γ

In addition, we finally must verify

µ 0<µ <

1

γ max

We thus obtain:

lim λK = λˆ

K →∞

The illustration which follows gives us an idea of the evolution of the cost and of the convergence of λK .

220

Discrete Stochastic Processes and Optimal Filtering

Cost

3.5

Values of C-Cmin with principal eigenvector axes

3

lambda1

2.5

2

1.5

1

0.5 -0.5

0

0.5

1

1.5 lambda0

2

2.5

3

3.5

Figure 6.13. Line graph of several cost functions and the principal axes “u”

The same calculation example as before but with a noise input. This is a question of constructing a phase shifter with noise canceller.

∅ is uniformly spread on known phase difference.

XK

⎛ 2π K = sin ⎜ ⎝ N

⎞ + ∅⎟ ⎠

[0, 2π ]

and

bK

Σ

ϕ,

which is definite, illustrates a

⎛ 2π DK = 2sin ⎜ ⎝ N

YK

K

⎞ −ϕ + ∅ ⎟ ⎠

T

λ1

λ0 +

Σ

+

ZK

εK Σ

Figure 6.14. Schema of the principal of the phase shifter (see Figure 6.10) with noise input

Adaptive Filtering: Algorithm of the Gradient and the LMS

221

bK being a noise centered and independent from the input:

(

)

E bK −i bK − j = σ 2 δ i , j ⎡⎛ ⎛ 2π E (YK YK −n ) = E ⎢⎜ sin ⎜ ⎣⎝ ⎝ N = 0.5cos

K

⎞ ⎛ ⎛ 2π ⎞ + ∅ ⎟ + bK ⎟ ⎜ sin ⎜ ⎠ ⎠⎝ ⎝ N

⎞⎤

( K − n ) + ∅ ⎞⎟ + bK −n ⎟ ⎥ ⎠

2π K + σ 2δ 0,n N

⎡ ⎛ 2π K ⎞⎤ ⎞ ⎞ ⎛ ⎛ 2π ( K − n ) E ( DK YK − n ) = E ⎢sin ⎜ − ϕ + ∅ ⎟ ⎜ sin ⎜ + ∅ ⎟ + bK − n ⎟ ⎥ ⎟ N ⎠ ⎜⎝ ⎝ ⎠ ⎠ ⎦⎥ ⎣⎢ ⎝ N ⎛ 2π n ⎞ = cos ⎜ −ϕ ⎟ ⎝ N ⎠ Autocorrelation matrix of data YK :

2π ⎞ N ⎟ ⎟ 2 ⎟ 0.5 + σ ⎟ ⎠

⎛ 2 ⎜ 0.5 + σ R=⎜ ⎜ 0.5cos 2π ⎜ N ⎝ p = E ( DK YK

0.5cos

DK YK −1 )

T

⎛ = ⎜ cos ϕ ⎝

Thus, we obtain:

λˆ = R −1 p ⎛ ⎛ ⎛ 4π ⎞⎞ ⎞ 2 −ϕ ⎟⎟ ⎟ ⎜ 2 1 + 2σ cos ϕ − ⎜ cos ϕ + cos ⎜ 1 ⎝ N ⎠⎠ ⎟ ⎝ λˆ = ⎜ ⎜ ∆ 2π ⎛ 2π ⎞⎟ cos ϕ + 2 1 + 2σ 2 cos ⎜ − ϕ ⎟ ⎟⎟ ⎜⎜ −2 cos N ⎝ N ⎠⎠ ⎝

(

)

(

)

T

⎛ 2π ⎞⎞ cos ⎜ −ϕ ⎟⎟ ⎝ N ⎠⎠

⎠⎦

222

Discrete Stochastic Processes and Optimal Filtering

with:

(

∆ = 1 + 2σ 2

)

2

− cos 2

2π N

and:

(1 + 2σ )(1 + 4σ ) − 2σ 2

( ) = C min =

C λˆ

2

2

⎛ ⎛ 4π ⎞⎞ 2 ⎜ 2 cos ϕ + cos ⎜ N − 2ϕ ⎟ ⎟ − 1 ⎝ ⎠⎠ ⎝ ∆

with:

(

) (

)

C ( λ ) = 2 + 1 + 2σ 2 0,5 (λ 0)2 + (λ 1)2 + λ 0λ 1cos

2π − 2λ 0 cos ϕ N

⎛ 2π ⎞ − 2λ 1cos ⎜ −ϕ ⎟ ⎝ N ⎠

(

or

)

()

)

()

C λˆ + α K = C λˆ + α KT R α K

and

(

C λˆ + QuK = C λˆ + uTK Γ u K . See the line graph in reference points

(λ

0

)(

, λ1 , α 0 ,α 1

)

and

(u

0

, u1

)

above. 6.8. Estimation of gradient and LMS algorithm

We can consider the estimate p and R of p and R in the calculation of the gradient. We have changed to the notation R and

p and not Rˆ and pˆ as the 2

criterion is no longer the traditional criterion “ min L ” but an approximation of this latter.

Adaptive Filtering: Algorithm of the Gradient and the LMS

223

We had:

∇ λK C ( λK ) = −2 ( p − RλK ) Thus, we are going to consider its estimate:

(

∇ λK C ( λK ) = −2 p − RλK

)

The estimated values will be the observed data. Let:

p = yK dK

and

R = y K y KT

thus

∇ λK C ( λK ) = -2 ε K y K

and

λK +1 = λK + 2µε K y K .

This recursive expression on

λK

returns to suppress the calculation of the

expectation, in effect

λK +1 = λK + 2µ E

(ε

K

YK

)

becomes:

λK +1 = λK + 2µ ε K y K called the LMS algorithm or stochastic gradient (which is a class of filters which includes the LMS). Now, it happens that the successive iterations of this recursive algorithm themselves achieve the mathematical expectation included in this formula by statistical averaging [MAC 81].

224

Discrete Stochastic Processes and Optimal Filtering

To be put into operation, this algorithm needs knowledge of the couple DK and

Z K at each incremental step. We now have knowledge of this at the instants thanks to the filtering

λK

as

Z K = λKT Y K and z K = λK y K by considering the data And we know, obviously, the reference DK . We can write for n ∈

∗

λ K + n = λK + ( 2 µ n )

with y if

µ

K+ j

1 n

n −1

∑ yK+ j ε K+ j j =0

(

= yK + j yK −1+ j ... yK − m+1+ j

)

T

is constant at each step of the iteration.

( ε ) is ergodic and µ constant, the expression

If Y

K

K

λ K + n = λK + ( 2 µ n )

is such that lim

K →∞

λK

1 n

n −1

∑ yK+ j ε K+ j j =0

does not exist.

K

Adaptive Filtering: Algorithm of the Gradient and the LMS

Let us suppose that

(ε

K

225

)

Y K is ergodic but that µ varies with the instant K,

thus:

λK + n

1 = λK + ( 2 µ n ) n

n −1

∑ yK+ j ε K+ j j =0

becomes

λ K + n = λK + ( 2 µ n n )

As

1 n

n −1

∑ yK+ j εK+ j j =0

1 n −1 K + j y εK+ j → E Y K ∑ n j =0

In order that

( ε ) = cte

λn

→ boundary,

K

µn

must decrease faster than

α / n (α =

constant). We rediscover thus a relation very close to that obtained in section 6.5.

λ K + n = λK + 2 µ n n E

(ε

K

YK

)

6.8.1. Convergence of the algorithm of the LMS

The study of the convergence of this algorithm is a lot more delicate than that of the deterministic gradient. The reader is invited to refer to the bibliography and [MAC 95] in particular for more information.

6.9. Example of the application of the LMS algorithm

Let us recall the modelization of an AR process.

226

Discrete Stochastic Processes and Optimal Filtering

BK

XK

Σ -

T a1

Σ

T a2

Σ

M

Thus BK

= ∑ an X K − n . n =0

By multiplying the two members by X K −l and by taking the expectations, it becomes:

⎛ E ⎜ X K− ⎝ If

>0

M

⎞

n =0

⎠

∑ an X K −n ⎟ = E ( X K −

then X K −

BK ) .

⊥ BK

As B K is a white noise and unique BK is dependent on X K . Thus, by stating:

(

)

E X j X m = rj − m

Adaptive Filtering: Algorithm of the Gradient and the LMS M

∑ an rn−

=0

227

for l > 0

n =0

M ⎛ a r = E X B = E B − ( ) ⎜ K ∑ an X K − n ∑ nn K K n =0 n =1 ⎝ M

and

By noting a0

⎞ 2 ⎟ BK = σ B . ⎠

= 1 and by using the matrix expression, this becomes:

r1 ⎛ r0 ⎜ r0 ⎜ r1 ⎜ ⎜⎜ ⎝ rM rM −1

rM ⎞ ⎛1 ⎟⎜ rM −1 ⎟ ⎜ a1 ⎟⎜ ⎟⎜ r0 ⎟⎠ ⎜⎝ aM

⎞ ⎛ σ B2 ⎞ ⎟ ⎜ ⎟ ⎟ = ⎜0 ⎟ ⎟ ⎜ ⎟ ⎟⎟ ⎜⎜ ⎟⎟ ⎠ ⎝0 ⎠

← ⎫ ⎪ ⎬ ⎪ ⎭

=0 ∈ [1, M ]

AR process of order 1, let the following AR process be X K = − a X K −1 + BK where BK is a centered white noise of variance σ B2 . For an

The problem consists of estimating the constant a using an adaptive filter.

BK

+ -

X

Σ

K

T a

BK = X K + a X K −1 Knowing BK and X K −1 , the problem consists of estimating X K (or a ).

228

Discrete Stochastic Processes and Optimal Filtering

The preceding results allow us to write: 2 ⎪⎧r0 + a1 r1 = σ B ⎨ ⎪⎩r1 + a1 r0 = 0

from where: a1

=a=−

(

)

r1 2 2 2 and σ B = σ X 1 − a . r0

Let us estimate this value of parameter “ a ” with the help of a predictor and by using an LMS algorithm. X

K

DK = X K

λ

−

T YK

ε K = DK − Z K where

ε K = DK − λ X K −1 and YK = X K −1

with

ε K ⊥ ZK

(

i.e. E X K

principle of orthogonality

)

− λˆ X K −1 X K −1 = 0

or r1 = λˆr0 from where λˆ

=

r1 = −a . r0

Σ

ZK and

DK = X K

εK

Adaptive Filtering: Algorithm of the Gradient and the LMS

229

By using the Wiener optimal solution directly with R = r0 and p = r1 we obtain R λˆ Let λˆ

=

= p. r1 r0

()

( )

C λˆ = E DK2 − pT λˆ which gives us:

()

C λˆ = σ X2 (1−a2 ) . This minimum cost is also equal to

σ B2 .

Below is an example processed with Matlab. For an AR process of order 2, we have:

ε K = DK − λ 0 X K −1 − λ1 X K −2

(

and E X K

)

− λˆ 0 X K −1 − λˆ1 X K − 2 ( X K −1

X K − 2 )T = (0 0)T .

2

rr −rr r r −r 0 1 Thus: λˆ = 1 02 12 2 and λˆ = 2 20 12 r0 − r1

r0 − r1

or by using the Wiener solution:

⎛r R=⎜ 0 ⎝ r1

r1 ⎞ ⎟ r0 ⎠

and

p = ( r1 r2 )

T

with R λˆ

See the following example using Matlab software.

= p.

230

Discrete Stochastic Processes and Optimal Filtering

SUMMARY.– We have shown that the algorithm of the gradient, through its recursivity, resolves the Wiener-Hopf expression by calculating the mean. However, it needs twice the amount of calculations as a transverse filter as we have to calculate, on the one hand:

ε K = d K − λKT y K

with its “m” multiplications and “m” additions,

and on the other hand:

λK +1 = λK + 2µε K y K

with its “m + 1” multiplications and “m” additions.

The complexity is thus of 2m. We have also shown that the algorithm of the stochastic gradient is the simplest of all those which optimize the same criteria of the least squares. In contrast, it will converge more slowly than the algorithm of the so-called least exact squares. Examples processed using Matlab software Example of adaptive filtering ( AR of order 1) The objective consists of estimating the coefficient of a predictor of order 1 by using the LMS algorithm of an adaptive filter. The process is constructed by a model AR of the 1st order with a white noise which is centered, Gaussian and has a 2

variance ( sigmav ) . The problem returns to that of finding the best coefficient which gives us the sample to be predicted.

Adaptive Filtering: Algorithm of the Gradient and the LMS

% Predictor of order 1 clear all; close all; N=500; t=0:N; a=-rand(1);%value to be estimated sigmav=0.1;%standard deviator of noise r0=(sigmav)^2/(1-a^2);%E[u(k)^2] r1=-a*r0;%represents P wopt=r1/r0;%optimal Wiener solution Jmin=r0-r1*wopt; mu=0.1;%convergence parameter w(1)=0; u(1)=0; vk=sigmav*randn(size(t)); for k=1:length(t)-1; u(k+1)=-a*u(k)+vk(k+1); e(k+1)=u(k+1)-w(k)*u(k); w(k+1)=w(k)+2*mu*u(k)*e(k+1); E(k+1)=e(k+1)^2;%instantaneous square error J(k+1)=Jmin+(w(k)-wopt)’*r0*(w(k)-wopt); end %line graph subplot(3,1,1) plot(t,w,’k’,t,wopt,’k’,t,a,’k’);grid on title(‘estimation of lambda, lambda opt. and “a”’) subplot(3,1,2) plot(t,E,’k’,t,J,’k’,t,Jmin,’k’);grid on axis([0 N 0 max(E) ]) title(‘inst.err.,cost and min cost’) subplot(3,1,3) plot(w,E,’k’,w,J,’k’);grid on axis([0 1.2*wopt 0 max(J)]) title(‘inst.err.and cost acc. to lambda’)

231

232

Discrete Stochastic Processes and Optimal Filtering

Figure 6.15. Line graph of important data of

AR

process of order 1

Another example (AR of order2)

The objective consists of estimating the coefficient of a predictor of order 2 by using the algorithm of the stochastic gradient of an adaptive filter. The process is constructed by a model AR of 2nd order with a white noise, which is centered, 2

Gaussian and has a variance ( sigmav ) . The problem returns to that of finding the best coefficients which give us the sample to be predicted.

Adaptive Filtering: Algorithm of the Gradient and the LMS

Predictor of order 2 clear all; close all; N=1000; t=0:N; a1=-0.75;%value to be estimated a2=0.9;%idem sigmav=0.2;%standard deviation of noise r0=((1+a2)*((sigmav)^2))/(1+a2-a1^2+a2*(a1^2)-a2^2-a2^3);%E[u(k)^2] r1=(-a1*r0)/(1+a2);%represents P2 r2=(r0*(a1^2-a2^2-a2))/(1+a2);%represents P1 w1opt=(r0*r1-r1*r2)/(r0^2-r1^2); w2opt=(r0*r2-r1^2)/(r0^2-r1^2); wopt=[w1opt w2opt]’;%optimal Wiener solution p=[r1 r2]’; Jmin=r0-p’*wopt; R=[r0 r1;r1 r0]; mu=0.2;%convergence parameter w1(1)=0;w2(1)=0;w1(2)=0; w2(2)=0; u(1)=0;u(2)=0; vk=sigmav*randn(size(t)); for k=2:length(t)-1; u(k+1)=-a1*u(k)-a2*u(k-1)+vk(k+1); e(k+1)=u(k+1)-w1(k)*u(k)-w2(k)*u(k-1); w1(k+1)=w1(k)+2*mu*u(k)*e(k+1); w2(k+1)=w2(k)+2*mu*u(k-1)*e(k+1); w(:,k)=[w1(k) w2(k)]’; J(k+1)=Jmin+(w(:,k)-wopt)’*R*(w(:,k)-wopt); end %line graph w(:,N) delta=a1^2-4*a2; z1=(-a1+(delta^.5))/2; z2=(-a1-(delta^.5))/2; subplot(2,2,1) plot(t,w1,’k’,t,w1opt,’b’,t,a1,’r’);grid on title(‘est. lambda0, lambda0.opt. and “a0”’) subplot(2,2,2) plot(t,w2,’k’,t,w2opt,’b’,t,a2,’r’);grid on title(‘est.lambda1, lambda1.opt and “a1”’)

233

234

Discrete Stochastic Processes and Optimal Filtering

subplot(2,2,3) plot(t,J,’-’,t,Jmin,’r’);grid on axis([0 N 0 max(J)]) title(‘Cost and min Cost’) subplot(2,2,4) plot (w1,J,’b’,w2,J,’r’);grid on title(‘evolution of coefficients acc. to Cost’) est.la mbda 0, la mbda 0.opt and " a0"

1.5 1

0.5

0.5

0

0

-0.5

-0.5

-1

-1

0

500

1, 000

Cost a nd min Cost

-1.5

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05

0

0

500

est.la mbda 1, la mbda 1.opt and " a1"

1

1, 000

0

0

Exercise 6.1. [WID 85]

An adaptive filter is characterized by:

⎛2 1⎞ ⎟ correlation matrix of data ⎝1 2⎠

– R=⎜

– p = ( 7 8 ) intercorrelation vector T

1, 000

evolution of coefficients acc. to Cost

-2

Figure 6.16. Line graph of important data of

6.10. Exercises for Chapter 6

500

-1

AR

0

1

process of order 2

2

Adaptive Filtering: Algorithm of the Gradient and the LMS

( )

and E DK = 42 2

235

D K being the desired output

1) Give the cost expression C . 2) Calculate the optimal vector λˆ . 3) Give the expression of minimum cost C

(λˆ ) .

4) Calculate the values proper to R . 5) Determine the proper vectors in such a way that matrix Q of the proper vectors be “normalized” (that is to say QQ

T

= I ), these vectors representing the

principal axes of the family of ellipses. 6) Give the limits of

µ convergence parameter used in the algorithm of

stochastic gradient. Solution 6.1.

1) C = 2λ1 + 2λ2 + 2λ1λ2 − 14λ1 − 16λ2 + 42 . 2

2) λˆ = ( 2 3) C

2

3) . T

(λˆ ) = 4 .

γ1 = 1

γ2 = 3.

5) u1 = 1

2 (1 − 1)

4)

6) 0<µ <1 3 .

T

u2 = 1

2 (1 1) . T

Chapter 7

The Kalman Filter

7.1. Position of problem The aim of the filtering that we are going to study consists of “best estimating”, in the sense of the classic criteria of least mean squares, a discrete process X K governed by an equation of the form:

X K +1 = A( K ) X K + C ( K ) N K (state equation) This process (physical, biological, etc.) called the state process is what interests the user. It represents for example the position, speed and acceleration of a moving object. This process is inaccessible directly and it is studied by means of a process YK governed by an equation of the form:

YK = H ( K ) X K + G ( K ) WK (observation equation) YK is called the observation process.

238

Discrete Stochastic Processes and Optimal Filtering

N K and WK are the system noise and the measurement noise respectively and will be explained in more detail in what follows. The Kalman filter, with its creation, brought into widespread use the optimal filter for non-stationary systems. It is also recursive: the predicted Xˆ K +1|K is obtained starting from the filtration at the preceding instant Xˆ K |K and the filtration Xˆ K +1|K +1 from its predicted

Xˆ K +1|K and from the measurement of the process YK +1 at the instant that we are making our estimation. Moreover, if the observable system is known and linear, the objective consists of, starting from measurements of the system, determining the best possible estimate in the sense of the criteria specified above. If the observable system is known but non-linear an approximate solution can be given by effecting a linearization of the equations of state and of the observations around the last estimated value. If the system is not perfectly known and linear the problem is more complicated because we must make appear and estimate in the state vector the inherent components of this system. This case will not be studied in this chapter. In the same fashion, we will not tackle the case where noises are colored or that in which there is a correlation between system noise and measurement noise. The reader can find additional information in the bibliography [GIM 82] and [RAD 84]. Preliminaries in the scalar case We have demonstrated that the best estimate of a process, starting with an

g , which is to say Xˆ = gˆ (Y1 ,..., YK ) represented by the orthogonal projection of X on a Hilbert space that we have defined is the conditional expectation of magnitude X , knowing the set of random observation variables Y1...YK , i.e.: observation function

Xˆ = gˆ (Y1 , ..., YK ) = Pr oj

H KY

X = Ε ( X Y1 ,..., YK )

The Kalman Filter

However, if vector estimate Xˆ of

( X , Y1 ,..., YK )

239

is Gaussian, then we have seen that the

X is a refined function of the vectors Y j .

Xˆ = λˆ 0 +

K

∑

j =1

λˆ j Y j

In order to approach Kalman filtering in a simple way, we are going to begin by grappling with the problem of linear estimation in the scalar case applied to the linear prediction. The shape of the recursive estimation obtained will allow a better grasp on the multivariate case. Let us consider a set of random variables

Y1 , Y2 ,..., Y j ..., YK −1

Y j : variable observed at instant j with Y0 = 0 by convention. Let us recall that we denote by

H KY-1

the real vector space generated by these

random variables, i.e.:

H KY-1 =

{

K −1

}

∑ λ jYj λ j ∈ \ .

j =1

Example of linear estimation [HAY 91] The best linear estimation in the sense of the least mean square error of a random variable YK , starting from observations making up following linear predictor:

H KY-1 ,

can be done by the

240

Discrete Stochastic Processes and Optimal Filtering

YK −1 T

YK −2

YK −( K −1)

λ2

λK −1

T

YK

λ1

+ +

Σ

+

Σ

+

YˆK |K −1

Figure 7.1. Schema of the principle of the linear estimator

The predictor error is then written:

I K = YK − YˆK |K −1 (that we could compare with

εK

in the adaptive filter)

for a predictor filter of the order K −1 and is easily constructed by the above arrangement. The output of the filter can be interpreted as: the best estimate at instant K , knowing the data of process Y1 ,..., YK −1 . Thus, we can interpret yˆ K |K −1 , result of YˆK |K −1 as the output of a predictor of order K − 1 whose input would be made up of observations y1 , y2 ,..., yK −1 : measurements of Y j . The principle of orthogonality shows us that this “error” I K is orthogonal to

H KY-1

and may be interpreted as information brought about by YK , from which

comes the name of “innovation” error. Thus, we will name this prediction error: the innovation.

The Kalman Filter

241

7.2. Approach to estimation 7.2.1. Scalar case It is clear that we can give an estimate of the magnitude of a process based on past observation of this process. In the expression of the innovation:

I K = YK −

K −1

∑ λˆi YK −i i =1

YK represents the magnitude to be estimated (see predictor) and

K −1

∑ λˆi YK −i i =1

represents the estimation.

= Pr oj

Y YK H K-1

= YˆK |K −1 and

I K = YK − YˆK |K −1 In the same way, if we call:

Xˆ K K = Pr oj

H KY

XK

the estimate of a process at instant K , starting from measurement process Y1 ,..., YK ,... , we can write: K

Xˆ K K = ∑ b j Y j estimate of X K . j =1

y1 ,..., yK , ... of

242

Discrete Stochastic Processes and Optimal Filtering

Let us write the innovation at instants 1, 2,…, K : K −1

I K = YK − ∑ λiK −1 YK −i i =1

with

λiK −1 : coefficients of the predictor of order I1 = Y1

with

K −1

Yˆ1/ 0 = 0

I 2 = Y2 − λ11Y1 I 3 = Y3 − λ12 Y2 − λ22 Y1 I K = YK − λ1K −1YK −1 − ... − λKK−−11Y1 This expression can be put in the form: I = M Y with M , invertible triangular matrix because det M = 1 . Thus Y = M

−1

I.

As a consequence, each vector I can be written according to the vectors

(

Y = (Y1 ,..., YK ) and inversely, H KY = H KI T

−1 Thus Xˆ K K = b′.Y = b′M I

or b ' = (b1′ ,..., bK′ )

T

vector of dimension K

I = ( I1 ,..., I K )T innovation vector.

)

The Kalman Filter

It is clear that the equality Xˆ K K = b′M

−1

I can also be put in the form:

K

Xˆ K K = ∑ d j I j j =1

Let us now show that: d j =

( ) Ε(I j I j )

Ε XK I j

j ∈ [1, K ]

Demonstration We know that: X K − Xˆ K |K ∈ H K

Y,⊥

We have:

X K − Xˆ K |K ⊥ Y j

Yˆj| j −1 ∈ H jY−1 ⊂ H KY

∀j ∈ [1, K ] it also comes to:

,

X K − Xˆ K |K ⊥ Yˆj| j −1

Thus X K − Xˆ K |K ⊥ Y j − Yˆj| j −1 = I j

(

)

(

∀j ∈ [1, K ] .

)

That is to say: E X K I j = E Xˆ K | K I j .

(

)

(

and since I i ⊥ I j if

i≠ j

K

) ∑d E (I I )

From which finally: E X K I j = E Xˆ K |K I j =

it becomes: d j =

i

i =1

i j

( ). E (I jI j )

E XKI j

K

Let us exploit the expression of the filtering: Xˆ K K = ∑ d j I j j =1

243

244

Discrete Stochastic Processes and Optimal Filtering

and Xˆ K K =

K −1

∑ d j I j + dK IK . j =1

From our first results, the sum of the K −1 terms also represents an estimation and:

Xˆ K K = Xˆ K −1 K −1 + d K I K which shows that the estimate, at instant K is written according to the estimate at instant K −1 and of a corrective term depending on instant K . This recursive estimation procedure is the foundation of Kalman filtering. 7.2.2. Multivariate case We are going at present to consider some vector magnitudes seen in Chapter 4, which is to say:

( )

– XK

: multivector of order n ∈ L2

– YK

: multivector of order m ∈ L2

– IK

: multivector of order m ∈ L2

( )

m

( )

m

The relationship between Y j and I j :

I K = YK − H ( K ) Xˆ K K −1 or I K

K −1

ˆ Y = YK − H ( K ) ∑ Λ j j j =1

n

The Kalman Filter

245

Reciprocally: By writing YK according to I K , it becomes with

Xˆ 1|0 = 0

Y1 = I1 ˆ I Y2 = I 2 + H ( 2 ) Λ 1 1 ˆ I + H ( 3) Λ ˆ I + H ( 3) Λ ˆ H ( 2) Λ ˆ I Y3 = I 3 + H ( 3) Λ 1 1 2 2 2 1 1

Thus YK is expressed according to I K

, I K −1 ,..., I1

7.3. Kalman filtering Vector or multivariate approach given: –

X K : state multivector ( n ×1)

–

xK : state vector of results

– YK –

: multivector of observations ( m × 1)

yK : vector of observations and results

7.3.1. State equation

X K +1 = A ( K ) X K + C ( K ) N K with and

A ( K ) = state matrix (n × n) , deterministic matrix NK

= process noise vector

(l × 1)

that we will choose centered, white and of correlation matrix (covariance matrix in the general case).

246

Discrete Stochastic Processes and Optimal Filtering

(

)

E N K N Tj = δ K , j QK

:(

× ) : correlation matrix of the process noise

vector N K

C ( K ) : (n × ) : deterministic matrix 7.3.2. Observation equation

YK = H ( K ) X K + G ( K ) WK with

H ( K ) : matrix of measurements or of observations ( m × n ) , deterministic matrix;

WK : measurement noise vector of observations vector ( p × 1) that we choose, like N K , centered, white and of correlation matrix(covariance matrix in the general case);

(

)

E WK W jT = δ K , j RK

( p × p)

correlation matrix of the measurement

noise vector WK

G ( K ) : (m × p) : deterministic matrix The noises N K and WK are independent, and, as they are centered:

(

)

E N K W jT = 0

∀K and j

We will suppose, in what follows, that WK ⊥ X 0 .

The Kalman Filter

247

By iteration of the state equation, we can write: K −1

X K = Φ ( K ,0 ) X 0 + ∑ Φ ( K ,i +1) Ni with Φ ( K , j ) : transition matrix. i =1

We obtain from this transition equation, by multiplying the 2 members by W j

X K ⊥ Wj

K,

j>0

By using the equation of observations:

and

Y j ⊥ WK

0 ≤ j ≤ K −1

Yj ⊥ NK

0≤ j≤K

The problem of the estimation can now be expressed simply in the following way. Knowing that A( K ) is the state matrix of the system, H ( K ) is the measurement matrix and the results yi of Yi

i ∈ [1, K ] ; obtain the estimations x j

of X j : – if 1< j K we say that the estimation is a prediction. NOTE.– The matrices C ( K ) and G ( K ) do not play an essential role in the measurement where the powers of noise appear in the elements of the matrices QK and RK respectively. However, the reader will be able to find analogies with notations used in “Processus stochastiques et filtrage de Kalman” by the same authors which examine the continuous case.

248

Discrete Stochastic Processes and Optimal Filtering

7.3.3. Innovation process

The innovation process has already been defined as:

I K = YK − H ( K ) Pr oj

H KY −1

and:

⎪⎧

X K = YK − H ( K ) Xˆ K |K −1

: ( m×1)

⎪⎫

K −1

∑ Λ jY j Λj matrix n × m ⎬⎪ ⎪ j =0

H KY-1 = ⎨ ⎩

⎭

By this choice of Λ j , the space multivectors X j and Pr oj

Y HK −1

XK

=

H KY−1

is adapted to the order of the state

Xˆ K |K −1 has the same order as X K .

Thus I K represents the influx of information between the instants K − 1 and K Reminder of properties established earlier:

I K ⊥ Y j ⎫⎪ ⎬ for j ∈ [1, K -1] I K ⊥ I j ⎪⎭ We will go back to the innovation to give the importance of its physical sense. 7.3.4. Covariance matrix of the innovation process

Between two measurements, the dynamic of the system leads to an evolution of the state quantities. So the prediction of the state vector at instant K , knowing the measurements

(Y1...YK −1 )

which is to say Xˆ K |K −1 is written according to the

filtering at instant K − 1 .

Xˆ K |K −1 = E ( X K | Y1 ,… , YK −1 ) = Pr oj

HY

= Pr oj

HY

K −1

K −1

XK

( A( K − 1) X K −1 + C ( K − 1) N K −1 | Y1 ,… , YK −1 )

= A( K − 1) Xˆ K −1|K −1 + 0

The Kalman Filter

Xˆ

= A ( K −1) Xˆ

K K −1

249

K −1 K −1

Only the information deriving from a new measurement at instant K will enable us to reduce the estimation error at this same instant. Thus H ( K ) representing in a certain fashion, the measurement operator, where at the very least its effect:

YK − H ( K ) Xˆ

K K −1

will represent the influx of information between two instants of observation. It is for this reason that this information is called the innovation. We observe, furthermore, that I K and YK have the same order. By exploiting the observation equation we can deduce:

⎛ ⎞ I K = H ( K ) ⎜ X K − Xˆ + G ( K ) WK K K −1 ⎟ ⎝ ⎠ and

IK = H (K ) X

K K −1

+ G ( K ) WK

where X K |K −1 = X K − Xˆ K |K −1 is called the prediction error. The covariance innovation matrix is expressed finally as:

Cov I K = E

(

I K I KT

)

that is to say Cov I K = H ( K ) PK K −1 H where P

K K −1

T

⎛ ⎞⎛ ⎞ = E ⎜ H (K ) X + G ( K ) WK ⎟ ⎜ H ( K ) X + G ( K ) WK ⎟ K K −1 K K −1 ⎝ ⎠⎝ ⎠ T

( K ) + G ( K ) RK GT ( K )

⎛ ⎞ XT = Ε⎜ X ⎟ is called the covariance matrix of the ⎝ K K −1 K K −1 ⎠

prediction error.

250

Discrete Stochastic Processes and Optimal Filtering

A recurrence formula on the matrices P

K K −1

will be developed in Appendix A.

7.3.5. Estimation

In the scalar case, we have established a relationship between the estimate of a magnitude X K and the innovations I K . We can, quite obviously extend this approach to the case of multivariate processes, that is to say we can write:

Xˆ

K

= ∑ d j (i ) I j

iK

j =1

where d j ( i ) is a matrix ( n x m ) . Let us determine the matrices d j ( i ) :

(

T

since E X i|K I j

(

) = E (( X T

we have: E X i I j

) = E ( Xˆ

furthermore, we have E

(

Then, since I j ⊥ I p

(

i

)

) )

− Xˆ i|K I Tj = 0 ∀j ∈ [1, K ] T i| K I j

X i I Tj

)

) and knowing the form of Xˆ

⎛ K ⎞ = E ⎜ ∑ d p (i ) I p I Tj ⎟ . ⎜ p =1 ⎟ ⎝ ⎠

∀j ≠ p

(

and

)

j , p ∈ [1, K ]

E X i I Tj = d j ( i ) E I j I Tj = d j ( i ) CovI j .

(

Finally: d j ( i ) = E X i I j

T

) ( CovI ) j

−1

.

i| K

The Kalman Filter

251

We thus obtain: K

(

) ( Cov I )

(

) ( Cov I )

Xˆ i K = ∑ Ε X i I Tj j =1

K −1

= ∑ Ε X i I Tj j =1

(

+ Ε X i I KT

−1

j

Ij

−1

j

) ( Cov I

K

Ij

)−1 I K

We are now going to give the Kalman equations. Let us apply the preceding equality to the filtering Xˆ K +1 K +1 ; we obtain: K +1

Xˆ K +1 K +1 = ∑ Ε X K +1 I Tj

(

) ( Cov I )

K

(

) ( Cov I )

−1

+ Ε X K +1 I KT +1 ( Cov I K +1 )

−1

j =1

= ∑ Ε X K +1 I Tj j =1

(

j

)

The state equation reminds us that:

X K +1 = Α ( K ) X K + C ( K ) N K and we know that N K

⊥ Ij .

Thus:

(

)

(

−1

j

)

Ε X K +1 I Tj = Α ( K ) Ε X K I Tj .

Ij Ij I K +1.

252

Discrete Stochastic Processes and Optimal Filtering

The estimate of X K +1 knowing the measurement at this instant K+1 is thus expressed: K

(

Xˆ K +1 K +1 = Α ( K ) ∑ Ε X K I Tj

(

j =1

) ( Cov I ) j

)

−1

Ij

+ Ε X K +1 I KT +1 ( Cov I K +1 ) I K +1. −1

The term under the sigma sign (sum) can be written Xˆ K K . Let us exploit this expression:

I K +1 = H ( K +1) X K +1 K + G ( K +1) WK +1 . This gives us:

(

)

−1 Xˆ K +1 K +1 = Α ( K ) Xˆ K K + Ε X K +1 I KT +1 ( Cov I K +1 ) I K +1

which is also written:

(

⎛ Xˆ K +1 K +1 = Α ( K ) Xˆ K K + Ε ⎜ X K +1 H ( K +1) X K +1 K + G ( K +1) WK +1 ⎝

)

T

⎞ ⎟ ⎠

−1

. ( Cov I K +1 ) I K +1 In addition we have shown that the best estimation at a given instant, knowing the past measurements, that we write as Xˆ K +1 K , is equal to the projection of

X K +1 on H KY , i.e.:

The Kalman Filter

Xˆ K +1 K = ProjH Y X K +1 = Pr oj

HY

K

Xˆ K +1 K = Pr oj

HY

and as:

Yj

⊥

K

K

253

( Α (K ) X K + C (K ) NK )

( Α (K ) X K + C (K ) NK ) ∀ j ∈[1, K ]

NK

it becomes Xˆ K +1 K = Α ( K ) Xˆ K K ; Α ( K ) squared . We can consider this equation as that which describes the dynamic of the system independently of the measurements, and as one of the equations of the Kalman filter.

⊥ Wj

In addition, X K

K, j >

0 : it becomes for the filtering:

(

)

−1 Xˆ K +1 K +1 = Xˆ K +1 K + Ε X K +1 X KT +1 K H T ( K + 1) ( Cov I K +1 ) I K +1 .

As:

Xˆ K +1 K

⊥

X K +1 K

then:

( (

)

Xˆ K +1 K +1 = Xˆ K +1 K + E X K +1 − Xˆ K +1 K X KT +1 K H T ( K +1) −1

. ( Cov I K +1 ) I K +1 thus: −1 Xˆ K +1 K +1 = Xˆ K +1 K + PK +1 K H T ( K +1) ( Cov I K +1 ) I K +1

)

254

Discrete Stochastic Processes and Optimal Filtering

DEFINITION.– We call the Kalman gain the function K defined (here at instant K+1) by:

K ( K +1) = PK +1 K H T ( K +1) ( Cov I K +1 )

−1

with:

Cov I K +1 = H ( K + 1) PK +1 K H T ( K + 1) + G ( K +1) RK +1 GT ( K +1) From which by putting

K ( K + 1) back into the expression we obtain:

(

K ( K +1) = PK +1 K H T ( K +1) H ( K +1) PK +1 K H (TK +1) + G ( K +1) RK +1GT ( K +1)

)

−1

We notice that this calculation does not require direct knowledge of the measurement of YK . This expression of the gain intervenes, quite obviously, in the algorithm of the Kalman filter and we can write;

(

Xˆ K +1 K +1 = Xˆ K +1 K + K ( K +1) YK +1 − H ( K +1) Xˆ K +1 K

)

This expression of the best filtering represents another equation of Kalman filter. We observe that the “effect” of the gain is essential. In effect, if the measurement is very noisy, which signifies that the elements of matrix RK are large, then the gain will be relatively weak, and the impact of this measurement will be minimized for the calculation of the filtering. On the other hand, if the measurement is not very noisy, we will have the inverse effect; the gain will be large and its effect on the filtering will be appreciable.

The Kalman Filter

255

We are now going “to estimate” this filtering by calculating the error that we commit, that is to say, by calculating the covariance matrix of the filtering error. Let us recall that Xˆ K +1 K +1 is the best of the filtrations, in the sense that it minimizes the mapping:

Z

→ tr X K +1 − Z

Y ∈ H K+ 1

2

T = tr E ⎡( X K +1 − Z )( X K +1 − Z ) ⎤ ⎣ ⎦

∈\ 2

The minimum is thus: tr X K +1 − Xˆ K +1 K +1

(

= tr E X K +1 K +1 X TK +1 K +1

(

T

NOTATION.– In what follows, matrix E X K +1 K +1 X K +1 K +1

P

)

)

is denoted

and is called the covariance matrix of the filtering error.

K +1 K +1

We now give a simple relationship linking the matrices

P

K +1 K +1

and P

K +1 K

.

We observe that by using the filtration equation first and the state equation next:

X K +1|K +1 = X K +1 − Xˆ K +1 K +1

(

= X K +1 − Xˆ K +1 K − K ( K +1) YK +1 − H ( K +1) Xˆ K +1 K = X K +1 − Xˆ K +1 K − K ( K +1)

(H (

K +1) X K +1 + G ( K +1) WK +1 − H ( K +1) Xˆ K +1 K

)

)

= ( I d − K ( K +1) H ( K +1) ) X K +1|K − K ( K +1) G ( K +1) WK +1

256

Discrete Stochastic Processes and Optimal Filtering

where I d is the identity matrix. By bringing this expression of X K +1|K +1 in P

K +1 K +1

and by using the fact

that: X K +1| K ⊥ WK +1 we have:

P

K +1 K +1

= ( I d − K ( K +1) H ( K +1) ) P

K +1 K

( I d − K ( K +1) H ( K +1) )T +

K ( K +1) G ( K +1) R ( K +1) GT ( K +1) K T ( K +1) an expression which, since:

Cov I K +1 = G ( K +1) RK +1 GT ( K +1) + H ( K + 1) PK +1 K H T ( K + 1) can be written:

(

PK +1 K +1 = K ( K +1) − PK +1 K H T ( K +1) ( CovI K +1 )

(

( CovI K +1 ) ( K ( K + 1) − PK +1 K

−1

)

H (TK +1) ( CovI K +1 )

)

−1 T

)

+ I d − PK +1 K H T ( K +1) ( CovI K +1 ) H ( K +1) PK +1 K . −1

However, we have seen that: K ( K +1) = PK +1 K H ( K +1) ( Cov I K +1 ) T

−1

So the first term of the second member of the expression is zero and our sought relationship is finally:

(

)

PK +1 K +1 = I d − K ( K +1) H ( K +1) PK +1 K This “updating” of the covariance matrix by iteration is another equation of the Kalman filter.

The Kalman Filter

257

Another approach to calculate this minimum [RAD 84]. We notice that the penultimate expression of PK +1|K +1 can be put in the form:

(

PK +1 K +1 = K ( K +1) − PK +1 K H T ( K +1) J −1 ( K +1)

(

)

J ( K +1) K ( K + 1) − PK +1 K H T ( K + 1) J (−K1 +1)

(

)

)

T

+ I d − PK +1 K H T ( K +1) J −1 ( K +1) H ( K +1) PK +1 K with:

J ( K +1) = H ( K +1) PK +1 K H T ( K +1) + G ( K +1) RK +1 GT ( K +1) = Cov I K +1 Only the 1st term of PK +1 K +1 depends on

K ( K +1) and is of the form

M J M T symmetric with J . So this shape is a positive or zero trace and:

(

)

PK +1 K +1 = M J M T + I d − PK +1 K H T ( K +1) J −1 ( K +1) H ( K +1) PK +1 K . The minimum of the trace will thus be reached when

M is zero, thus:

K ( K +1) = PK +1 K H T ( K +1) J −1 ( K +1) where:

(

K ( K +1) = PK +1 K H T ( K +1) H ( K +1) PK +1 K H T ( K + 1) + G ( K +1) RK +1GT ( K +1) a result which we have already obtained! In these conditions when:

(

)

PK +1 K +1 = I d − K ( K +1) H ( K +1) PK +1 K

)

−1

258

Discrete Stochastic Processes and Optimal Filtering

we obtain the minimum of the tr PK +1 K +1 . It is important to note that K , the Kalman gain and PK K the covariance matrix of the estimation error are independent of the magnitudes YK . We can also write the best “prediction”, i.e. Xˆ K +1 K according to the preceding prediction: Thus:

(

Xˆ K +1 K = Α ( K ) Xˆ K K −1 + Α ( K ) K ( K ) YK − H ( K ) Xˆ K K −1

)

As for the “best” filtering, the best prediction is written according to the preceding predicted estimate weighted with the gain and the innovation brought along by the measurement YK . This Kalman equation is used not in filtering but in prediction. We must now establish a relationship on the evolution of the covariance matrix of the estimation errors. 7.3.6. Riccati’s equation

Let us write an evolution relationship between the covariance matrix of the filtering error and the covariance matrix of the prediction error:

(

PK K −1 = Ε X K K −1 X KT K −1

)

or by incrementation:

with:

(

PK +1 K = Ε X K +1 K X KT +1 K X K +1 K = X K +1 − Xˆ K +1 K

).

The Kalman Filter

259

Furthermore we know that:

Xˆ K +1 K = Α ( K ) Xˆ K K −1 + A ( K ) K ( K ) I K giving the prediction at instant K +1 and X K +1 = Α ( K ) X K + C ( K ) N K just as: I K = YK − H ( K ) Xˆ K K −1 . The combination of these expressions gives us:

(

)

(

)

X K +1 K = Α ( K ) X K − Xˆ K K −1 − Α ( K ) K ( K ) YK − H ( K ) Xˆ K K −1 + C ( K ) N K but YK = H ( K ) X K + G ( K ) WK thus:

(

)

(

)

X K +1 K = Α ( K ) X K − Xˆ K K −1 − Α ( K ) K ( K ) H ( K ) X K − Xˆ K K −1 −

Α ( K ) K ( K ) G ( K ) WK + C ( K ) N K

X K +1 K = ( Α ( K ) − Α ( K ) K ( K ) H ( K ) ) X K K −1 − Α ( K ) K ( K ) G ( K ) WK + C ( K ) N K We can now write PK +1 K by observing that:

X K K −1 ⊥ N K and

X K K −1 ⊥ WK

.

NOTE.– Please note that X K +1/ K is not orthogonal to WK .

260

Discrete Stochastic Processes and Optimal Filtering

Thus:

PK +1 K = ( Α ( K ) − Α ( K ) K ( K ) H ( K ) ) PK K −1 ( Α ( K ) − Α ( K ) K ( K ) H ( K ) )

T

+ C ( K ) QK C T ( K ) + Α ( K ) K ( K ) G ( K ) RK GT ( K ) K T ( K ) ΑT ( K ) This expression of the covariance matrix of the prediction error can be put in the form:

PK +1 K = Α ( K ) PK K ΑT ( K ) + C ( K ) QK C T ( K ) This equality independent of YK is called Riccati’s equation with PK K = ( I d − K ( K ) H ( K ) ) PK K −1 which represents the covariance matrix of the filtering error, which is equally independent of YK . See Appendix A for details of the calculation. 7.3.7. Algorithm and summary

The algorithm presents itself in the following form, with the initial conditions:

P0 and Xˆ 0|0 given as well as the matrices: Α ( K ) , QK , H ( K ) , RK , C ( K ) and G ( K ) . 1) Calculation phase independent of YK . Effectively, starting from the initial conditions, we perceive that the recursivity which acts on the gain K ( K + 1) and on the covariance matrix of the prediction and filtering errors PK +1 K and PK +1 K +1 do not require knowledge of the observation process. Thus, the calculation of these matrices can be done without knowledge of measurements. As for these measurements, they come into play for the innovation calculation and that of the filtering or of prediction.

The Kalman Filter

261

PK +1 K = Α( K ) PK K ΑT ( K ) + C ( K ) QK CT ( K )

(

K ( K+1) = PK +1 K HT ( K+1) H ( K+1) PK +1 K HT ( K + 1) + G ( K+1) RK +1 GT ( K+1) PK +1 K +1 = ( Id − K ( K+1) H ( K+1) ) PK +1 K

)

−1

Xˆ K +1 K = Α( K ) Xˆ K K T

(

T

or K ( K + 1) = PK +1 K +1 H ( K + 1) G ( K +1) RK +1G ( K +1)

)

−1

T

if G ( K +1) RK +1G ( K +1) is invertible. 1) Calculation phase taking into account results y K of process YK .

I K +1 = YK +1 − H ( K + 1) Xˆ K +1 K Xˆ K +1 K +1 = Xˆ K +1 K + K ( K + 1) I K +1 It is using a new measurement that the calculated innovation will allow us, weighted by the gain at the same instant, to know the best filtering.

YK +1

+

Σ

K ( K + 1)

Σ

Σ

-

Xˆ K +1 K +1

+

T H ( K + 1)

Xˆ K +1 K

Α(K )

Figure 7.2. Schema of the principle of the Kalman filter

Xˆ K K

262

Discrete Stochastic Processes and Optimal Filtering

Important additional information may be obtained in [HAY 91]. NOTE.– If we had conceived a Kalman predictor we would have obtained the expression of the prediction seen at the end of section 7.3.5.

(

Xˆ K +1 K = Α ( K ) Xˆ K K −1 + Α ( K ) K ( K ) YK − H ( K ) Xˆ K K −1

)

IK NOTE.– When the state and measurement equations are no longer linear, a similar solution exists and can be found in other works. The filter then takes the name of the extended Kalman filter. 7.4. Exercises for Chapter 7 Exercise 7.1.

Given the state equation

X K +1 = A X K + N K

where the state matrix A is the “identity” matrix of dimension 2 and

N K the

system noise whose covariance matrix is written Q = σ I d ( I d : identity matrix). 2

The system is observed by the scalar equation:

YK = X 1K + X K2 + WK where X 1K and X K2 are the components of the vector

X K and where WK is the measurement noise of the variance R = σ 12 . P0|0 = Id

and

ˆ = 0 are the initial conditions. X 0|0

1) Give the expression of the Kalman gain K (1) at instant “1” according to

σ 2 and σ 12 . 2) Give the estimate of Xˆ 1|1 of X 1 at instant “1” according to the first measurement of

K (1) and the first measurement Y1 .

The Kalman Filter

263

Solution 7.1.

1) K (1) =

1+ σ 2 ⎛1⎞ ⎜ ⎟ 2 + 2σ 2 + σ 12 ⎝ 1 ⎠

2) Xˆ 1|1 = K (1)Y1 Exercise 7.2.

We are considering the movement of a particle.

x1 ( t ) represents the position of the particle and x2 ( t ) its speed. t

x1 ( t ) = ∫ x2 (τ ) dτ + x1 ( 0 ) 0

By deriving this expression and by noting

x2 (t ) =

dx1 ( t ) = dt

approximately = x1 ( K +1) − x1 ( K ) .

We assume that the speed can be represented by:

X K2 = X K2 −1 + N K −1 where N K is a Gaussian stationary noise which is centered and of variance 1. The position is measured by y K , result of the process YK . This measurement adds a Gaussian stationary noise, which is centered and of variance 1.

Y ( K ) = H ( K ) X ( K ) + WK We assume that RK measurement equal to 1.

covariance matrix (of dimension 1) has a noise

264

Discrete Stochastic Processes and Optimal Filtering

1) Give the matrices A, Q (covariance matrix of the system noise) and H . 2) In taking as initial conditions Xˆ 0 = Xˆ 0|0 = 0

P0|0 = I d identity matrix,

give xˆ1|1 the 1st estimation of the state vector. Solution 7.2.

⎛ 1 1⎞ ⎛0 0⎞ ⎟; Q = ⎜ ⎟ ; H = (1 0 ) ⎝ 0 1⎠ ⎝0 1⎠

1) A = ⎜

⎛ 2 3⎞ ⎟ y1 ⎝1 3⎠

2) xˆ1|1 = ⎜

Exercise 7.3. [RAD 84]

We want to estimate two target positions using one measurement. These 1

2

positions X K and X K form the state vector:

(

X K = X 1K

X K2

)

T

The process noise is zero. The measurement of process Y is affected by noise by W of mean value zero and of variance R carried to the sum of the position:

YK = X 1K + X K2 + WK In order to simplify the calculation, we will place ourselves in the case of an immobile target:

X K +1 = X K = X The initial conditions are:

(

)

– P0|0 = C ov X , X = Id identity matrix

The Kalman Filter

265

– R = 0.1 – y = 2.9 (measurement) and Xˆ 0|0 = ( 0

0)

T

1) Give the state matrix A , and observation matrix H . 2) Give the Kalman gain

K.

3) Give the covariance matrix of the estimation error. 2

4) Give the estimation in the sense of the minimum in L of the state vector XK . 5) If x = xK = (1

2 ) , give the estimation error x = xK |K = xK − xˆ K |K . T

1

6) Compare the estimation errors of the variances of X K

and

X K2 and

conclude. Solution 7.3.

H = (1 1)

1) A = I d

2) K = (1 2,1 1 2,1)

T

⎛ 1,1 2,1

−1,1

⎝

1,1

3) P1|1 = ⎜ −1,1 ⎜

2,1

4) xˆ1|1 = ( 2,9 2,1

(

1

5) xK = xK

xK2

1

2,1 ⎞

2,1

⎟⎟ ⎠

2,9 2,1)

T

)

T

= ( −0,38 − 0, 62 )

T

2

6) var X K = var X K = 0,52 Exercise 7.4.

Given the state equation of dimension “1” (the state process is a scalar process):

X K +1 = X K .

266

Discrete Stochastic Processes and Optimal Filtering

The state is observed by 2 measurements: Y1 W1 YK = ⎛⎜ YK2 ⎞⎟ affected by noise with WK = ⎛⎜ WK2 ⎞⎟ ⎝ K⎠ ⎝ K⎠

The noise measurement is characterized by its covariance matrix:

σ2 RK = ⎛⎜ O1 σO2 ⎞⎟ . 2 ⎠ ⎝ The initial conditions are:

P0|0 = 1 (covariance of the estimation error at instant “0”)

ˆ = 0 (estimate of and X 0|0

X at instant “0”)

Let us state D = σ 1 + σ 2 + σ 1 σ 2 . 2

2

2

2

1) Give the expression of K(1) Kalman gain at instant “1” according to σ 1 , σ 2 and D 2) Give the estimate Xˆ 1|1 of X 1 at instant “1” according to the measurements

Y11 , Y12 and σ 1,σ 2 and D

σ 12 σ 22 σ 12 +σ 22 instant “1” according to σ . 3) By stating

σ2 =

give P1|1 the covariance of the filtration error at

Solution 7.4.

⎛σ 1) K (1) = ⎜ 1

2

⎝ D

2) Xˆ 1|1 =

(σ Y

2 1 2 1

σ 22 ⎞ ⎟ D ⎠

)

+ σ 12Y12 / D

The Kalman Filter

3) P1|1 =

267

σ2 1+ σ 2

Exercise 7.5.

The fixed distance of an object is evaluated by 2 radar measurements of different qualities. The 1st measurement gives the result:

y1 = r + n1 , measurement of the process Y = X + N1 where we know that the noise N1 is such that: E ( N1 ) = 0 and var ( N1 ) = σ 12 = 10-2 . The 2nd measurement gives: y 2 = r + n2

measurement of the process

Y = X + N2 . E ( N 2 ) = 0 and var ( N 2 ) = w (scalar) The noises N1 and

N 2 are independent.

1) Give the estimate rˆ1 of r that we obtain from the 1st measurement. 2) Refine this estimate by using the 2nd measurement. We will name as rˆ 2 this

new estimate that we will express according to w .

3) Draw the graph rˆ2 ( w) and justify how it appears. Solution 7.5.

1) rˆ1 = xˆ1|1 = y1 2) rˆ2 = xˆ2|2 = y1 +

σ 12

σ 12

+w

( y2 − y1 ) =

100 wy1 + y2 100 w + 1

268

Discrete Stochastic Processes and Optimal Filtering

3) See Figure 7.3.

Figure 7.3. Line graph of the evolution of the estimate according to the power of the noise w , parametered according to the magnitude of the measurements

Appendix A Resolution of Riccati’s equation

Let us show that: PK +1 K = A ( K ) PK K A ( K ) + C ( K ) QK C ( K ) T

T

Let us take again the developed expression of the covariance matrix of the prediction error of section 7.3.6

PK +1 K = Α ( K ) ( I d − K ( K ) H ( K ) ) PK K −1 ( Α ( K ) − Α ( K ) K ( K ) H ( K ) )

T

+ C ( K ) QK C(TK ) + Α ( K ) K ( K ) G ( K ) RK G T ( K ) K T ( K ) ΑT ( K )

The Kalman Filter

269

with:

K ( K ) = PK K −1 H T ( K ) ( Cov I K )

−1

and:

Cov I K = H ( K ) PK K −1 H (TK ) + G ( K ) RK G T ( K ) . By replacing

K ( K ) and Cov I K , by their expressions, in the recursive writing

of, PK +1 K , we are going to be able to simplify the expression of the covariance matrix of the prediction error. To lighten the expression, we are going to eliminate the index K when there is no ambiguity by noting P1 = PK +1 K , P0 = PK K −1 and I = I K

(

)

P1 = A I d − KH P0 ( Α − ΑKH ) + C Q C T + Α K G R G T K T ΑT T

K = P0 H T ( Cov I )

−1

Cov I = H P0 H T + G R GT Thus:

G R G T = Cov I − H P0 H T K G R G T K T = P0 H T ( Cov I )

(

−1

( Cov I − H P

0

H T ) ( Cov I )

= P0 H T − P0 H T ( Cov I ) H P0 H T −1

KGRGT K T = P0 H T ( cov I )

−1T

−1T

) ( Cov I )

H P0T

−1T

H P0T

HP0T − P0 H T ( cov I ) HP0 H T ( cov I )

HP0T

−1

−1T

P1 = AP0 AT − AKHP0 AT − AP0 H T K T AT + AKHP0 H T K T AT + CQC T + (+ P0 H T ( cov I )

−1T

HP0T − P0 H T ( cov I ) HP0 H T ( cov I ) −1

−1T

HP0T ) AT

270

Discrete Stochastic Processes and Optimal Filtering

i.e. in replacing K with its expression. −1

P1 = AP0 AΤ − A P0 H T ( Cov I ) HP0 AT − AP0 H T ( Cov I )

−1T

HP0T AT

K

+ AP0 H

Τ

(

( Cov I )

−1

+ A P0 H Τ ( Cov I )

HP0 H T ( Cov I )

−1T

−1T

HP0T AT + CQC T −1

HP0T − P0 H T ( Cov I ) HP0 H T ( Cov I )

−1T

)

HP0T AT .

The 3rd and 6th term cancel each other out and the 4th and 7th term also cancel each other out which leaves: P1 = AP0 A − AKHP0 A + CQC T

(

)

or: P1 = A ⎡ I d − KH P0 ⎤ A + CQC ⎣ ⎦ T

T

T

T

PK +1 K = A ( K ) ( I d − K ( K ) H ( K ) ) PK K −1 ) AT ( K ) + C ( K ) QK C T ( K ) PK K Thus:

PK +1 K = A ( K ) PK K AT ( K ) + C ( K ) QK C T ( K ) = covariance matrix of prediction error with:

PK K = ( I d − K ( K ) H ( K ) ) PK K −1 = covariance matrix of filtering error. This result will be demonstrated in Appendix B. NOTE.– As mentioned in section 7.3.7 knowing the initial conditions and the Kalman gain, the updating of the covariance matrices made in an iterative manner.

PK |K −1

and

PK |K

can be

The Kalman Filter

271

Appendix B

We are going to arrive at this result starting from the definition P

KK

and by

using the expression of the function K already obtained. NOTE.– Different from the calculation developed in section 7.3.6 we will not show obtained is minimal. that trP KK

Another way of showing the following result:

(

)

PK K = Ε X K K X TK K = PK K −1 − K ( K ) H ( K ) PK K −1

(

)

= Id − K ( K ) H ( K ) P

K K −1

Demonstration

Starting from the definition of the covariance matrix of the filtering error, i.e.:

PK |K

=

(

E X K |K X TK |K

)

It becomes with X K | K = X K − Xˆ K |K and Xˆ K K = Xˆ K K −1 + K ( K ) I K So X K K = X K − Xˆ K K −1 − K ( K ) I K

X K K −1 Let us now use these results to calculate PK |K :

(

) (

)

PK K = PK K −1 − K ( K ) Ε I K X KT K −1 − Ε X K K −1 I KT K (TK ) + K ( K ) Ε ( I K I KT ) K T ( K )

272

Discrete Stochastic Processes and Optimal Filtering

We observe that:

(

) (

)

but I j ⊥ I K and I j ⊥ YK

j ∈ [1, K − 1]

Ε X K K −1 I KT = Ε X K − Xˆ K K −1 I KT

thus Xˆ K K −1 ⊥ I K Given:

(

) (

) (

Ε X K K −1 I KT = Ε X K I KT = E A−1 ( K ) ( X K +1 − C ( K ) N K ) I KT

(

(

)

Thus: Ε X K I K = Ε A T

−1

( K ) X K +1 I KT

)

)

For Ε ( N K ) = 0 However, we have seen elsewhere that:

(

Ε ( X K +1 I KT ) = E ( A ( K ) X K + C ( K ) N K ) H ( K ) X K |K −1 + G ( K )WK =

as:

N K ⊥ WK and Furthermore: For Xˆ K |K −1 ⊥

(

)

)

T

E A ( K ) X K X TK |K −1 H T ( K )

N K ⊥ X K |K −1 = X K − Xˆ K |K −1

(

)

(

)

E X K X TK |K −1 = E Xˆ K |K −1 + X K |K −1 X TK |K −1 = PK |K +1 X K |K −1

The Kalman Filter

273

Thus it becomes:

(

)

Ε X K K −1 I KT = PK K −1H T ( K ) thus:

PK K = PK K-1 − K ( K ) H ( K ) PKT K −1 − PK K −1H T ( K ) K T ( K ) + K ( K ) ( Cov I K ) K T ( K ) with K ( K ) = PK K −1 H ( K ) ( Cov I K ) T

−1

after simplification and in noting that

PK K = PKT K symmetric or hermitic matrix if the elements are complex: PK K = PK K −1 − K ( K ) H ( K ) PK K −1 or:

PK K = [ I d − K ( K ) H ( K ) ] PK K −1 QED Examples treated using Matlab software First example of Kalman filtering

The objective is to estimate an unknown constant drowned in noise. This constant is measured using a noise sensor. The noise is centered, Gaussian and of variance equal to 1. The initial conditions are equal to 0 for the estimate and equal to 1 for the variance of the estimation error.

274

Discrete Stochastic Processes and Optimal Filtering

clear t=0:500; R0=1; constant=rand(1); n1=randn(size(t)); y=constant+n1; subplot(2,2,1) %plot(t,y(1,:)); plot(t,y,’k’);% in B&W grid title(‘sensor’) xlabel(‘time’) axis([0 500 -max(y(1,:)) max(y(1,:))]) R=R0*std(n1)^2;% variance of noise measurement P(1)=1;%initial conditions on variance of error estimation x(1)=0; for i=2:length(t) K=P(i-1)*inv(P(i-1)+R); x(i)=x(i-1)+K*(y(:,i)-x(i-1)); P(i)=P(i-1)-K*P(i-1); end err=constant-x; subplot(2,2,2) plot(t,err,’k’); grid title(‘error’); xlabel(‘time’) axis([0 500 -max(err) max(err)]) subplot(2,2,3) plot(t,x,’k’,t,constant,’k’);% in W&B title(‘x estimated’) xlabel(‘time’) axis([0 500 0 max(x)])

The Kalman Filter

275

grid subplot(2,2,4) plot(t,P,’k’);% in W&B grid,axis([0 100 0 max(P)]) title(‘variance error estimation’) xlabel(‘time’)

Figure 7.3. Line graph of measurement, error, best filtration and variance of error

Second example of Kalman filtering The objective of this example is to extract a dampened sine curve of the noise. The state vector is a two component column vector: X1=10*exp(-a*t)*cos(w*t) X2=10*exp(-a*t)*sin(w*t) The system noise is centered, Gaussian and of variance var(u1) and var(u2).

276

Discrete Stochastic Processes and Optimal Filtering

The noise of the measurements is centered, Gaussian and of variance var(v1) and var(v2). Initial conditions: the components of the state vector are zero at origin and the covariance of the estimation error is initialized at 10* identity matrix. Note: the proposed program is not the shortest and most rapid in the sense of CPU time; it is detailed to allow a better understanding. clear %simulation a=0.05; w=1/2*pi; Te=0.005; Tf=30; Ak=exp(-a*Te)*[cos(w*Te) -sin(w*Te);sin(w*Te) cos(w*Te)];%state matrix Hk=eye(2);% observations matrix t=0:Te:Tf; %X1 X1=10*exp(-a*t).*cos(w*t); %X2 X2=10*exp(-a*t).*sin(w*t); Xk=[X1;X2];% state vector % measurements noise sigmav1=100; sigmav2=10; v1=sigmav1*randn(size(t)); v2=sigmav2*randn(size(t)); Vk=[v1;v2]; Yk=Hk*Xk+Vk;% measurements vector % covariance matrix of measurements noise. Rk=[var(v1) 0;0 var(v2)];% covariance matrix of noise %initialization

The Kalman Filter

sigmau1=0.1;% noise process sigmau2=0.1;%idem u1=sigmau1*randn(size(t)); u2=sigmau2*randn(size(t)); %Uk=[sigmau1*randn(size(X1));sigmau2*randn(size(X2))]; Uk=[u1;u2]; Xk=Xk+Uk; sigq=.01; Q=sigq*[var(u1) 0;0 var(u2)]; sigp=10; P=sigp*eye(2);%covariance matrix of estimation error P(0,0) % line graph subplot(2,3,1) %plot(t,X1,t,X2); plot(t,X1,’k’,t,X2,’k’)% in W&B axis([0 Tf -max(abs(Xk(1,:))) max(abs(Xk(1,:)))]) title(‘state vect. x1&x2’) subplot(2,3,2) %plot(t,Vk(1,:),t,Vk(2,:),‘r’) plot(t,Vk(1,:),t,Vk(2,:));% in W & B axis([0 Tf -max(abs(Vk(1,:))) max(abs(Vk(1,:)))]) title(‘meas. noise w1&w2’) subplot(2,3,3) %plot(t,Yk(1,:),t,Yk(2,:),‘r’); plot(t,Yk(1,:),t,Yk(2,:));% in W&B axis([0 Tf -max(abs(Yk(1,:))) max(abs(Yk(1,:)))]) title(‘observ. proc. y1&y2’) Xf=[0;0]; %%estimation and prediction by Kalman

277

278

Discrete Stochastic Processes and Optimal Filtering

for k=1:length(t); %%prediction Xp=Ak*Xf; % Xp=Xest(k+1,k) and Xf=Xest(k,k) Pp=Ak*P*Ak’+Q; % Pp=P(k+1,k) and P=P(k) Gk=Pp*Hk’*inv(Hk*Pp*Hk’+Rk); % Gk=Gk(k+1) Ik=Yk(:,k)-Hk*Xp;% Ik=I(k+1)=innovation % best filtration Xf=Xp+Gk*Ik; % Xf=Xest(k+1,k+1) P=(eye(2)-Gk*Hk)*Pp;% P=P(k+1) X(:,k)=Xf; P1(:,k)=P(:,1);%1st column of P P2(:,k)=P(:,2);%2nd column of P end err1=X1-X(1,:); err2=X2-X(2,:); %% line graph subplot(2,3,4) %plot(t,X(1,:),t,X(2,:),‘r’) plot(t,X(1,:),‘k’,t,X(2,:),‘k’)% in W&B axis([0*Tf Tf -max(abs(X(1,:))) max(abs(X(1,:)))]) title(‘filtered x1&x2’) subplot(2,3,5) %plot(t,err1,t,err2) plot(t,err1,‘k’,t,err2,‘k’)% in W&B axis([0 Tf -max(abs(err1)) max(abs(err1))]) title(‘errors’)

The Kalman Filter

subplot(2,3,6) %plot(t,P1(1,:),‘r’,t,P2(2,:),‘b’,t,P1(2,:),‘g’,t,P2(1,:),‘y’) plot(t,P1(1,:),‘k’,t,P2(2,:),‘k’,t,P1(2,:),t,P2(1,:),‘b’) axis([ 0 Tf/10 0 max(P1(1,:))]) title(‘covar. matrix filter. error.’)% p11, p22, p21 and p12

Figure 7.4. Line graphs of noiseless signals, noise measurements, filtration, errors and variances

279

Table of Symbols and Notations

N, R, C

Numerical sets

L2

Space of summable square function

a.s.

Almost surely

E

Mathematical expectation

r.v.

Random variable

r.r.v.

Real random variable

a. s. X n ⎯⎯→ X

Convergence a.s. of sequence X n to X 2

⋅, ⋅ L2 ( )

Scalar product in L

⋅

Norm

L2 (

)

L2

Var

Variance

Cov

Covariance

⋅∧⋅

Min ( ⋅ , ⋅)

X ∼ N (m, σ 2 )

Normal law of means m and of variance

AT

Transposed matrix

HKY

Hilbert space generated by YN , scalar or multivariate processes

Pr ojHY K

σ2

A

Projection on Hilbert space generated by Y( t ≤ K )

282

Discrete stochastic Processes and Optimal Filtering

XT

Stochastic process defined on T (time describes T )

p.o.i.

Process with orthogonal increments

p.o.s.i.

Process with orthogonal and stationary increments

Xˆ K |K −1

Prediction at instant K knowing the measurements of the process YK of instants 1 to K −1

X K |K −1

Prediction error

Xˆ K |K

Filtering at instant K knowing its measurements of instants 1 to K

X K |K

Filtering error

∇λ C

Gradient of function C ( λ )

{X P }

The set of element

1D

Indicative function of a set

X which verify the property P D

Bibliography

[BER 98] BERTEIN J.C and CESCHI R., Processus stochastiques et filtrage de Kalman, Hermes, 1998. [BLA 06] BLANCHET G. and GARBIT M., Digital Signal and Image Processing using MATLAB, ISTE, 2006. [CHU 87] CHUI C.K. and CHEN G., Kalman Filtering, Springer-Verlag, 1987. [GIM 82] GIMONET B., LABARRERE M. and KRIEF J.P., Le filtrage et ses applications, Cépadues editions, 1982. [HAY 91] HAYKIN S., Adaptive Filter Theory, Prentice Hall, 1991. [MAC 81] MACCHI O., “Le filtrage adaptatif en telecommunications”, Annales des Télécommunications, 36, no. 11-12, 1981. [MAC 95] MACCHI O., Adaptive Processing: The LMS Approach with Applications in Transmissions, John Wiley, New York, 1995. [MET 72] METIVIER M., Notions fondamentales de la théorie des probabilités, Dunod, 1972. [MOK 00] MOKHTARI M., Matlab et Simulink pour étudiants et ingénieurs, Springer, 2000. [RAD 84] RADIX J.C., Filtrages et lissages statistiques optimaux linéaires, Cépadues editions, 1984. [SHA 88] SHANMUGAN K.S. and BREIPOHL A.M., Random signal, John Wiley & Sons, 1988. [THE 92] THERRIEN C.W., Discrete random signals and statistical signal processing, Prentice Hall, 1992. [WID 85] WIDROW B. and STEARNS S.D., Adaptive Signal processing, Prentice Hall, 1985.

Index

A, B adaptive filtering 197 algebra 3 analytical 187 autocorrelation function 96 autoregressive process 128 Bienaymé-Tchebychev inequality 143 Borel algebra 3

C cancellation 199 Cauchy sequence 158 characteristic functions 4 coefficients 182 colinear 213 convergence 218 convergent 219 correlation coefficients 41 cost function 204 covariance 40 covariance function 107 covariance matrix 258 covariance matrix of the innovation process 248

covariance matrix of the prediction error 249 cross-correlation 184

D deconvolution 199 degenerate Gausian 64 deterministic gradient 225 deterministic matrix 245 diffeomorphism 31 diphaser 209

E eigenvalues 75 215 eigenvectors 75 ergodicity 98 ergodicity of expectation 100 ergodicity of the autocorrelation function 100 expectation 67

F, G filtering 143, 247 Fubini’s theorem 162 Gaussain vectors 13

286

Discrete Stochastic Processes and Optimal Filtering

Gradient algorithm 211

P

H, I

Paley-Wiener 187 prediction 143 199 247 258 prediction error 249 predictor 200 pre-whitening 186 principle axes 215 probability distribution function 12 process noise vector 245 projection 183

Hilbert spaces 145 Hilbert subspace 144 identification 199 IIR filter 186 impulse response 182 independence 13, 246 innovation 240 innovation process 172, 248 causal 187 orthogonal 192

K, L Kalman gain 254 least mean square 184 linear observation space 168 linear space 104 LMS algorithm 222 lowest least mean square error 185

M, N

Q, R quadratic form 216 random variables 194 random vector 1 3 random vector with a density function 8 regression plane 151 Riccati’s equation 258, 260

S

marginals 9 Markov process 101 matrix of measurements 246 measure 5 measurement noise 238 measurement noise vector 246 minimum phase 187 multivariate 245 250 multivariate processes 166 multivector 245

Schwarz inequality 160 Second order stationarity 96 second order stationary processes 199 singular 185 smoothing 143, 247 spectral density 106 stability 218 stable 219 state matrix 245 stationary processes 181 stochastic process 94 system noise 238

O

T

observations 245 orthogonal matrix 216 orthogonal projection 238

Toeplitz 211 216 trace 257 trajectory 94 transfer function 121

Index

transition equation 247 transition matrix 247

U-Z unitary matrix Q 216 variance 39 white noise 109 185 Wiener filter 181

287