Springer Monographs in Mathematics
Malempati M. Rao
Stochastic Processes – Inference Theory Second Edition
Springer Monographs in Mathematics
More information about this series at http://www.springer.com/series/3733
Malempati M. Rao
Stochastic Processes – Inference Theory Second Edition
Malempati M. Rao Department of Mathematics University of California Riverside, CA, USA
ISSN 1439-7382 ISSN 2196-9922 (electronic) ISBN 978-3-319-12171-0 ISBN 978-3-319-12172-7 (eBook) DOI 10.1007/978-3-319-12172-7 Springer Cham Heidelberg New York Dordrecht London Library of Congress Control Number: 2014953897 Mathematics Subject Classification (2010): 60Gxx, 60H05, 60H30, 60J25, 62J02, 62MXX 1st Edition: © Kluwer Academic Publishers 2000 2nd Edition: © Springer International Publishing Switzerland 2014 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publishers’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the process respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
To Professors
Ulf Grenander and Tom S. Pitcher whose fundamental and deep contributions shaped stochastic inference
Preface to the Second Edition
This new edition of ‘Stochastic Inference’ preserves the treatment of the orignal format as detailed in preface of this book, which is appended, but contains the following new material. After making the few typographical corrections, noted earlier, the main new addition, given in reasonable detail, is the topic of ‘Regression analysis for processes’. Scanning the literature of this very popular and important subject which was given a formal definition, perhaps for the first time, abstracting from several examples, by H. Cram´er in his well-known and original volume Mathematical Methods of Statistics (1946). It states that the regression of a random variable (or vector) Y , on another such variable or vector X, when their joint probability distribution is known and Y has one moment, as the conditional expectation E(Y |X) = gY (X). It is somewhat surprizing that the foundations of the subject for random processes (or random measures) have not been treated (as far as I could find in extensive search) in the literature, although ‘Regression Analysies’ for numerous problems were treated, mostly for Gaussian processes. However the problem is deeper and is considered in this edition. To begin with the subject, one has to consider the difficulty (and even nonavailability) of a standard procedure for obtaining the gY (X) obtained above. After noting the nontriviality of the computing problem, detailed in my earlier book, “Conditional Measures and Applications” (second edition, 2006), it is perhaps appropriate that I should discuss the foundations of this subject. Consequently Chapter VI of this volume is expanded to include this topic in some reasonable detail. The problem of linearity of Regression is treated in this generality. The problem of linearity of gY (·), first discussed by M. Kanter in this generality, identifying it as an aspect of symmetric stable class (a key part of infinitely divisible family), which thus includes the Gaussian case which is so important for many applications. An aspect of the problem was considered by E. Lukacs and R. G. Laha (1964) to characterizing some processes such as Gamma, Poisson and the like, devoting a modest chapter in their book. Most of the books on the subject assume that gY (X) is linear and obtain computational procedures in related evaluations of constants in the model. In this new edition, I have expanded Chapter VI, with three additional sections on Regression and some new applications. These are devoted to characterizing the linearity of regression function gY (X), as vii
viii
Preface to the Second Edition
a part of the (symmetric) stable processes (with extended treatment by C. D. Hacken) as well as my extensions for random measures and integrals. Also specializations and applications of linear regression with (linear) constraints and related problems are discussed. The work as presented, along with included Problems Section, shows their use in some key econometric applications – well-treated by J. S. Chipman (2011) leading to further analysis to ’ridge regression’ in the subject. After this Chapters VII-IX retain the treatment as before, centering on filtering, prediction and nonparametric estimation - all in the context of processes. These include harmonizable classes. In all the cases, the subject is illustrated by examples from different processes. As noted before, I still have some computer problems, and I hope that these are not distracting the reader. As noted in the earlier Preface, I had also been in correspondence with Tom Pitcher since his Pacific Journal paper (1965) which was extended considerably by Robert Rosenberg (1970) in his CarnegieMellon thesis. The correspondence continued even when I moved to UCR, by then he moved to U. of Hawaii, and we never met although our correspondence continued. I hoped to present a copy of this book with a surprize dedication along with Grenander, but I learned he has passed away by then. His continued encouraging correspondence and deep mathematical contributions are signified in my inclusive dedication of this volume. The new edition of this work is again done at UCR with departmental help, especially the techincal computer assistance by James Marberry. I hope that the new edition with added material will further stimulate workers in the subject and help expand the area of Stochastic Inference. Riverside, CA August, 2014
M. M. Rao
Preface to the First Edition
The material accumulated and presented in this volume can be explained easily. At the start of my graduate studies in the early 1950s, I came across Grenander’s (1950) thesis, and was much attracted to the entire subject considered there. I then began preparing for the necessary mathematics to appreciate and possibly make some contributions to the area. Thus after a decade of learning and some publications on the way, I wanted to write a modest monograph complementing Grenander’s fundamental memoir. So I took a sabbatical leave from my teaching position at the Carnegie-Mellon University, encouraged by an Air Force Grant for the purpose, and followed by a couple of years more learning opportunity at the Institute for Advanced Study to complete the project. As I progressed, the plan grew larger needing a substantial background material which was made into an independent initial volume in (1979). In its preface I said: “My intention was to present the following material as the first part of a book treating the Inference Theory of stochastic processes, but the latter account has now receded to a distant future,” namely for two more decades! Meanwhile, a much enlarged second edition of that early work has appeared (1995), and now I am able to present the main part of the original plan. In fact, while this effort took on the form of a life’s project, and developing all the necessary backup material during the long gestation period, I have written some seven books and directed several theses on related topics that helped me appreciate the main subject much better. It is now termed ‘stochastic inference’ as an abbreviation as well as a homage to Grenander’s “Stochastic processes and statistical inference”. Let me explain the method adapted in preparing this work. At the outset, it became clear that there can be no compromise with the mathematics of inference theory. One observes that, broadly speaking, inference has theoretical, practical, philosophical, and interpretative aspects. But these components are also present in other scientific studies. However, for inference theory all these parts are founded on sound mathematical principles, a violation of which leads to unintended controversies. Thus the primary concern here is on mathematical ramifications of the subject, and the work is illustrated with a number of important examples, many of independent interest. It is noted that, as a basis of the classical statistical inference, two original sources are visible. The crucial idea on hypothesis testing is founded in the formulation of the Neyman-Pearson lemma which itself has a firm backing of the calculus of variations. All later developments ix
x
Preface to the First Edition
of the subject are extensions of this result. Similarly, the basic idea of estimating a parameter of a distribution is founded on Fisher’s for formulation of the maximum likelihood (ML). The classical inference theory (for finite samples) grew out of these two fundamental principles. But the subject of stochastic processes deals with infinite collections of random variables, and there is a real barrier here to cross in order to apply the classical ideas. The necessary new accomplishment is the formulation by Grenander who showed that the (abstract) RadonNikod´ ym derivative must replace the likelihood function of the finite sample case, leading the way to stochastic inference. Now the determination of the general likelihood ratio (or the RN-derivative) involves an intricate analysis for which one has to employ several different technical tools. This was successfully done by Grenander himself and it was advanced further by Pitcher. Thus for a proper understanding of the subject, a greater preparation and a concerted effort are needed, and to aid this, the previous concepts are restated at various places. I shall now indicate, with some outlines of the material [more detailed summaries are at the beginning of chapters], how my original plan is executed. The first three chapters contain the basic inference theory, as described above, from the point of view of adapting it to stochastic processes. Thus the work in Chapter I, begins with the question of distinguishability (or ‘identifiability’) of a pair of probability measures or distributions, before a hypothesis testing question can be raised. It is better to know this fact in the theory, since, as shown by example, there exist unequal distributions which cannot be distinguished. So conditions are presented for distinguishability. Then the inference problem and its decision theoretic setup as a unifying formulation of both testing and estimation are discussed. Chapter II is devoted to detailed analysis, applications, and extensions of the Neyman-Pearson theory, and its many connections with other parts of analysis such as control theory and vector integral (differential) calculus, as well as extensions to composite hypotheses leading to new questions. The classical H¨ older, Jensen, Liapounov inequalities are shown to be consequences of this extension of the NP-lemma. The Bayesian idea to reduce the composite hypotheses to a simple hypothesis and a simple alternative setup, as well as the hierarchical (or multistage) prior methodology are shown to lead to the hypothesis testing problems on stochastic processes. The difficulties encountered in the composite case in the original testing theory are explained vividly by a complete solution (due to Linnik) of the Behrens-Fisher problem, and this is included to show how new
Preface to the First Edition
xi
methods are required to answer difficult questions of Inference. The detailed mathematical ideas explained in this chapter give an appreciation of inference problems leading directly to stochastic processes. Chapter III concentrates on estimation of parameters. Here optimal estimation (and prediction) relative to quadratic and more generally convex (or only increasing) loss functions is explored. Bayes estimation and (non linear) prediction are shown to be closely related. A detailed analysis including lower bounds for the corresponding risk functions is given. Further existence and asymptotic properties of ML estimators of parameters of discrete indexed processes as well as sequential sampling, introducing stopping time concepts, are studied. Considerable amount of work here appears for the first time in a book. From this point on, the rest of the chapters deal with stochastic processes. Thus the material in Chapters I-III is of general interest, and can be read even by those with less mathematical preparation, by skipping the proofs, since the general content will be appreciated by anyone interested in Inference. It can also be used for a semester of graduate course. Chapters IV and V concern with major problems, of stochastic inference for classes of processes, which are based on likelihood ratios and the fundamental Neyman-Pearson-Grenander theorem. In Chapter IV, both the continuous and discrete parameter processes are treated. New techniques are needed in the continuous index case. They include Karhunen-Lo`eve expansions, Hellinger distances, and martingale convergence theory. These have been detailed and the work continued on second order processes with several illustrative examples. Sequential testing is considered, and the optimal character of the sequential probability ratio test for these processes is given. Here stochastic integrals and Brownian motion play key roles and they are detailed. Then weighted unbiased linear least squares prediction for stochastic flows driven by such classes as harmonizable or orthogonal increment processes are studied. In the discrete index case, sharper results are obtained for processes satisfying nth order difference equations, and properties of (least squares) estimators such as consistency and limit distributions. Chapter V deals with Gaussian and other special classes of processes. Here new tools with reproducing kernel Hilbert spaces, so perfected by Parzen, are used and Pitcher’s work, on admissible means and their structure, is included. Also Gaussian dichotomy, likelihood ratios for processes of independent increments, of diffusion types, of jump Markov, and of infinitely divisible classes are studied. A number of new results, properties, and methods are presented, resulting in the longest chapter of the book.
xii
Preface to the First Edition
Next Chapter VI is concerned with the important problem of sampling continuous parameter second order processes. The Kotel’nikovShannon theory and some generalizations, all for non stationary processes such as harmonizable and Cram´er types, are treated. Some of the results are also extended to (isotropic) harmonizable random fields indexed by Rn , n > 1, with indications to LCA groups. Several examples are included. Chapter VII takes up extensions of some of the problems of Chapter V, for simple vs composite hypotheses and then for composite vs composite hypotheses, finding likelihood ratios in both cases. Here one has to appeal to tools such as semi-group and evolutionary operations, and the methods are somewhat abstract but quite important. These are needed for Pitcher’s and his associates’ works, in obtaining detailed results for such classes as seen here. An illustration to statistical communication theory is given. Another proof of Gaussian dichotomy, an extension of Girsanov’s theorem, and multiple Wiener chaos are included. The latter are of interest in the recent applications of stochastic analysis to financial mathematics. Chapter VIII takes up the general concept of linear and non linear prediction, already introduced in Chapter III. In the non linear study, the existence, and strong as well as point-wise convergence of best predictors on subsets, are established when the loss function is convex employing some elementary aspects of Orlicz and K¨othe space analysis. Then for the linear least squares prediction, the Cram´er-Hida approach via multiplicity theory is detailed with illustrations. Turning to linear filtering problems using Bochner’s formulation, one has to obtain the unknown input observing the output when the filter is an integro-difference-differential operator and the output is, for instance, harmonizable. Characterizations are given for achieving this. Specializing the filter to signal plus noise models, sharper conditions and results can be obtained with the Kalman filtering. The main result of this theory first for discrete and then mostly for continuous parameter problems leading to stochastic differential equations (SDEs) are presented. Then the non linear theory is treated in detail. These involve just the first order SDEs, but the analysis uses classical PDEs and their stochastic counterparts, bringing this into the frontiers of research in stochastic analysis. Here, the basic (bi)spectral (or covariance) functions of the processes are needed but are often unknown to the experimenter, and they have to be estimated. So this is considered in the final Chapter IX. While the estimation of spectra are relatively well-developed for stationary processes, for the non stationary case, the main thrust here, much more work is needed. As an important illustration, solutions for
Preface to the First Edition
xiii
a class of harmonizable processes are considered, using a resampling method, necessarily involving more observations. However, conditions for asymptotic unbiasedness, consistency, and limit distributions are obtained for bispectral density estimators. Also problems on estimation and structural analysis of processes depending on certain summability methods, isolated by Kamp´e de F´eriet and Frenkiel, Parzen, and Rozanov are considered. Many potentially solvable problems are indicated for future research here and through out the book. Each chapter has exercises, with copious hints, complementing its work. Parts of Chapters IV–IX can be studied in graduate seminars. The numbering system is standard and is the same as in the companion volume (1995). Thus all items are serially noted, starting afresh in each section. For instance, IV.3.2 is the second item in Section 3 of Chapter IV, and in a chapter its number is dropped. In a section the chapter and section numbers are also omitted, giving only the last one. A prerequisite for studying this work is an acquaintance with real analysis, and some exposure to basic probability such as the first parts of Chapters 2, 4, and 5 of the author’s text (1984). A prior knowledge of statistics is not essential but beneficial. As already stated, this work is largely influenced by Grenander’s thesis. The only published books on the topics considered here are his monograph (1981) and the two volume work by Liptser and Shriyayev (1977). However, the overlap between these and what follows is small. All sources are discussed in the Bibliographical notes. Although the presentation is reworked from publications, I have tried to credit the original authors fully, and hopefully I have been successful in this effort. In the initial stages of this project, I received helpful comments and encouragement from Tom Pitcher. A previous draft of this book, is read by Ulf Grenander and his comments, questions, suggestions, and encouragement have been invaluable. For this I express my deep gratitude to both of them. Stochastic inference uses many aspects of analysis, but also opens up many new areas. Recalling a view expressed by Hadamard in discussing Poincar´e’s work in 1921: “the center of modern mathematics is in the theory of partial differential equations”, it may be said equally that: “the center of modern probability is in the theory of stochastic inference”. I hope that this point is reflected, to some extent, in the following pages. The final preparation took over two years of intense work. I do not know typing, and it is a struggle for me to compose all of this material using AMS-TeX; but somehow I did it by myself, possibly with
xiv
Preface to the First Edition
many imperfections. Initial chapter-wise ‘TeXing’ was shown me by my colleague Yˆ uichirˆ o Kakihara. Numerous difficulties with the computer were softened by Jan Carter, and final formating and pagination were also assisted by Lambert Timmermans. My daughters Leela and Uma have given me their spare time with typing the References as well as some text. For all this help, I am very grateful. Part of the preparation of the manuscript was done with a UCR sabbatical leave last year. Finally, I hope that this book plays a role in consolidating the past and progressing into the future of inference theory. Riverside, CA January, 2000
M.M. Rao
CONTENTS
Chapter I: Introduction and Preliminaries 1.1 1.2 1.3 1.4 1.5 1.6
The problem of inference . . . Testing a hypothesis . . . . . Distinguishability of hypotheses Estimation of parameters . . . Inference as a decision problem Complements and exercises . .
Bibliographical notes
. . . . . .
. . . . . .
. . . . . .
. . . . . .
1 . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. 1 . 4 . 7 . 10 . 12 . 15
. . . . . . . . . . . . . . . . . . 17
Chapter II: Principles of Hypothesis Testing 2.1 Testing simple hypotheses . . . . . . . . 2.2 Reduction of composite hypotheses . . . . 2.3 Composite hypotheses with iterated weights 2.4 Bayesian methodology for applications . . . 2.5 Further results on composite hypotheses . . 2.6 Complements and exercises . . . . . . . . Bibliographical notes
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
19 19 38 42 48 55 67
. . . . . . . . . . . . . . . . . . 70
Chapter III: Parameter Estimation and Asymptotics 73 3.1 Loss functions of different types . . . . . . . . . . . . 73 3.2 Existence and other properties of estimators . . . . . . 75 3.3 Some principles of estimation . . . . . . . . . . . . . 88 3.4 Asymptotics in estimation methodology . . . . . . . . 107 3.5 Sequential estimation . . . . . . . . . . . . . . . . 117 3.6 Complements and exercises . . . . . . . . . . . . . 125 Bibliographical notes
. . . . . . . . . . . . . . . . . . 130
Chapter IV: Inference for Classes of Processes 133 4.1 Testing methods for second order processes . . . . . . 133 4.2 Sequential testing of processes . . . . . . . . . . . . 155
xv
Contents
xvi
4.3 4.4 4.5 4.6
Weighted unbiased linear least squares prediction . . . . Estimation in discrete parameter models . . . . . . . Asymptotic properties of estimators . . . . . . . . . Complements and exercises . . . . . . . . . . . . .
Bibliographical notes
. . . . . . . . . . . . . . . . . . 219
Chapter V: Likelihood Ratios for Processes 5.1 Sets of admissible signals or translates . . . . . 5.2 General Gaussian processes . . . . . . . . . . 5.3 Independent increment and jump Markov processes 5.4 Infinitely divisible processes . . . . . . . . . . 5.5 Diffusion type processes . . . . . . . . . . . . 5.6 Complements and exercises . . . . . . . . . . Bibliographical notes
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . . .
. . . . . . . . . .
339 339 348 353 358 374 375 386 391 395 398
. . . . . . . . . . . . . . . . . . 404
Chapter VII: More on Stochastic Inference 7.1 Absolute continuity of families of probability measures . 7.2 Likelihood ratios for families of non Gaussian measures . 7.3 Extension to two parameter families of measures . . . . 7.4 Likelihood ratios in statistical communication theory . . 7.5 The general Gaussian dichotomy and Girsanov’s theorem 7.6 Complements and exercises . . . . . . . . . . . . . Bibliographical notes
223 223 247 277 298 314 327
. . . . . . . . . . . . . . . . . . 335
Chapter VI: Sampling and Regression for Processes 6.1 Kotel’nikov-Shannon methodology . . . . . . . . 6.2 Band limited sampling . . . . . . . . . . . . . 6.3 Analyticity of second order processes . . . . . . . 6.4 Periodic sampling of processes and fields . . . . . 6.5 Remarks on optional sampling . . . . . . . . . . 6.6 Regression for random processes and its basis . . . 6.7 Regression for random measures and integrals . . . 6.8 Processes admiting linear regression . . . . . . . 6.9 Linear regression with constraints . . . . . . . . . 6.10 Complements and exercises . . . . . . . . . . . Bibliographical notes
179 196 200 213
407 407 429 437 453 459 476
. . . . . . . . . . . . . . . . . . 486
Contents
xvii
Chapter VIII: Prediction and Filtering of Processes 8.1 Predictors and projections . . . . . . . . . . . . 8.2 Least squares prediction: the Cram´er-Hida approach 8.3 Linear filtering: Bochner’s formulation . . . . . . 8.4 Kalman-Bucy filters: the linear case . . . . . . . 8.5 Kalman-Bucy filters: the nonlinear case . . . . . . 8.6 Complements and exercises . . . . . . . . . . . Bibliographical notes
. . . . . .
489 489 504 512 533 559 573
. . . . . . . . . . . . . . . . . . 577
Chapter IX: Nonparametric Estimation for Processes 9.1 Spectra for classes of second order processes . . . . . 9.2 Asymptotically unbiased estimation of bispectra . . . 9.3 Resampling procedure and consistent estimation . . . 9.4 Associated spectral estimation for a class of processes . 9.5 Limit distributions of (bi)spectral function estimators . 9.6 Complements and exercises . . . . . . . . . . . . Bibliographical notes
. . . . . .
581 . 581 . 584 . 587 . 596 . 607 . 617
. . . . . . . . . . . . . . . . . . 621
Bibliography
625
Notation index
651
Author index
657
Subject index
663
Chapter I Introduction and Preliminaries
An outline of the stochastic inference problem, in general terms, is presented in this chapter. This includes the notions of distinctness of hypotheses to be tested as well as the associated parameter estimation from observations. Then, how both these questions can be unified into a broad framework of a decision theory is discussed. These ideas will be elaborated later on and then their application to various classes of stochastic processes will take the center stage.
1.1 The problem of inference If a dynamical system, or an experiment, is observed for a duration of “time” T = [a, b] symbolized by the data {Xt , t ∈ T }, one of the key problems here is to predict a future value Xt0 (t0 > b). Since the future cannot be precisely foreseen, an element of uncertainty exists and demands a definition in a suitable mathematical language. This implies that one should (a) find a model to describe the experiment, and (b) verify or test its validity for the process under observation. The first point leads to the formulation of a probability structure on which the system is based, and thereafter the suitability of the model needs to be tested as desired in point (b). These general statements are made mathematically meaningful as follows. Consider all possible outcomes of an experiment and represent them as points of a space Ω. Let a collection of various combinations of outcomes of conceivable interest to the experiment be denoted by A which for a mathematical convenience be enlarged to a σ-algebra Σ (of subsets of Ω). The members of Σ are termed events. The uncertainty associated with each event is measured by a numerical function P (·), such that 0 ≤ P (A) ≤ 1 for each A ∈ Σ. This P should be determined
© Springer International Publishing Switzerland 2014 M.M. Rao, Stochastic Processes – Inference Theory, Springer Monographs in Mathematics, DOI 10.1007/978-3-319-12172-7_1
1
2
I. Introduction and Preliminaries
on the basis of the experiment itself. Hence it is a part of the data of the system. The physical (or natural) reasons induce an additivity property of P (·) on mutually exclusive events with P (Ω) = 1 for the sure event Ω. A continuity condition, namely the σ-additivity is usually imposed on P , as a mathematical convenience, and this will be in force in all the following work. The resulting triple (Ω, Σ, P ) is called a probability space and is hence a mathematical model on which all the ensuing analysis is essentially based. Thus the observed process Xt will be regarded as mappings Xt : Ω → Rn (n ≥ 1) which associate (a vector of) numerical values for each outcome. This allows us to translate properties of the abstract model into concrete objects on Rn whose fine structure is utilized to answer most questions about the system (or experiment) being analyzed. A mathematical condition imposed on each Xt , which is automatic for all natural experiments, is that the inverse image of each interval (or box) is in Σ, and then the Xt is termed a random variable, and {Xt , t ∈ T } a stochastic process (or simply a process). To discuss the problem in familiar terms, one associates with P , and the process {Xt , t ∈ T }, a family of finite dimensional (nice) functions on Rn as follows. Let a ≤ t1 < t2 < . . . < tn ≤ b and define, for each vector (x1 , x2 , . . . , xn ) ∈ Rn , Ft1 ,... ,tn (x1 , . . . , xn ) = P [ω : Xt1 (ω) < x1 , . . . , Xtn (ω) < xn ].
(1)
The family of functions {Ft1 ,... ,tn , ti ∈ T }, called finite dimensional distributions, inherits many important properties of P through the Xt s and in fact satisfies the compatibility or consistency conditions: lim Ft1 ,... ,tn (x1 , . . . , xn ) = Ftt ,... ,tn−1 (x1 , . . . , xn−1 ),
xn →∞
(2)
and for any permutation (i1 , . . . , in ) of (1, . . . , n) of the suffixes in (1), one has Fti1 ,... ,tin (xi1 , . . . , xin ) = Ft1 ,... ,tn (x1 , . . . , xn ).
(3)
Now condition (3), although mysterious at first sight, simply indicates the fact that the events (= subsets) in the domain of P in (1) have the same (set) intersection even after an arbitrary permutation. Moreover 0 ≤ Ft1 ,... ,tn (x1 , . . . , xn ) ≤ 1, with the additional properties that (i) limxi →−∞ Ft1 ,... ,tn (x1 , . . . , xn ) = 0, for each i = 1, . . . , n, (ii) limxi →+∞,1≤i≤n Ft1 ,... ,tn (x1 , . . . , xn ) = 1, and (iii) the F s are nondecreasing in the sense that the n-dimensional increment ΔF ≥ 0. Actually it is a (nontrivial) fact that the formulation of the model (Ω, Σ, P ) and a process {Xt , t ∈ T } on it are equivalent to having a (consistent)
1.1 The problem of inference
3
family of distribution functions on Rn , n ≥ 1. We now give a precise form of this result, due to Kolmogorov [1], for a clear understanding of the second point leading to the inference problem: 1. Theorem. Let {Ft1 ,... ,tn , ti ∈ T ⊂ R} be a compatible family of distribution functions on Rn , n ≥ 1. Let Ω = RT , the space of all real valued functions on T , and ΣT be the smallest σ-algebra containing all sets of the form {ω ∈ Ω : −∞ < ω(t) < a, t ∈ T, a ∈ R}. Then there exists a unique probability function P on ΣT and if Xt : Ω → R is the (coordinate) mapping defined as Xt (ω) = ω(t), the family {Xt , t ∈ T } is a stochastic process on (Ω, Σ, P ), such that for each n ≥ 1 and t 1 < . . . < tn , ti ∈ T , and xi ∈ R, i = 1, . . . , n, P [ω : Xt1 (ω) < x1 , . . . , Xtn (ω) < xn ] = Ft1 ,... ,tn (x1 , . . . , xn ).
(4)
This result was originally proved by Kolmogorov [1], and most books have it discussed in some form. A detailed version, exactly as above, is established in the author’s monograph (cf. Rao [21], p.15) and will not be repeated here. The point of this assertion is the essential equivalence of the model (Ω, Σ, P ) with an observable process {Xt , t ∈ T } on it together with the (compatible) family of distributions {Ft1 ,... ,tn , ti ∈ T, n ≥ 1}. It is thus the latter family that is central to modeling and the abstract version can be replaced by, what is termed, a function space version which gives rise to the same finite dimensional distribution family. In what follows either form will be used according to convenience. [However, the reader should note that it is not always advisable (or even possible) to replace a given probability space by a function space representation, as seen, for instance, by considering various functionals of the Xt -process.] Suppose now that the system’s uncertainty is measured by different sets of people (or instruments), labeled by I, where the same basic couple (Ω, Σ) can be retained. Thus each one will be able to produce a (possibly) different probability function Pθ , θ ∈ I. Then equations (1) or (4) determine a family {Ftθ1 ,... ,tn , tn ∈ T, n ≥ 1}, depending on θ ∈ I, for the same observation process Xt . It is hence necessary to find conditions in order that the observed process determines the true Pθ0 that governs the experiment. Several criteria can be (and have been) proposed to find the “true” or most “likely” function Pθ0 . This set of methods leading to an “optimal or desirable” solution constitutes the inference problem. There are several suggested “principles” for the “best” inference. Since it is clear that no single method can work for all situations, and since different people advocate different procedures as “best”, they sometimes lead to controversies. Here we present methods
4
I. Introduction and Preliminaries
that are applicable to most of these groups, and concentrate mainly on the mathematical analysis that applies to large sets of problems. These include hypothesis testing, estimation, prediction, filtering and sampling the observations as well as regression, among others. 1.2 Testing a hypothesis Suppose that a dynamical system or a (natural) phenomenon is modeled by probability triples {(Ω, Σ, Pθ ), θ ∈ I}, devised by different experimenters, denoted (or indexed) by I. If now the system is observed, designated by a process {Xt , t ∈ T } on Ω, it is desired to verify the statement that Pθ0 for some θ = θ0 in I, is the true probability function governing the system. In view of the discussion of the last part of the preceding section, the problem may also be considered for convenience with the distribution functions {Ftθ1 ,... ,tn , ti ∈ T, n ≥ 1, θ ∈ I}. Then in simple terms it is given as follows. First suppose that θ = θ implies F θ = F θ (or equivalently Pθ = Pθ ) so that the latter functions differ on some (measurable) set. This is no restriction on the problem since one removes only a redundancy. Then the hypothesis that θ = θ0 is rejected if there exists a set A ∈ Σ (often called a critical region) such that Pθ0 (A) ≤ α0 , a prescribed number 0 < α0 < 1, and among all sets satisfying this constraint one chooses an A for which Pθ (A) is a maximum when θ = θ0 . The rationale here is that one should reject θ = θ0 only if there is a strong reason for accepting the alternative hypothesis (taking only a small chance or probability, α0 , of rejecting the “true” hypothesis) and that is reflected in the maximization of the probability of the second hypothesis (or equivalently minimizing Pθ (Ac ) of accepting a wrong alternative; here and elsewhere Ac denotes the complement of the set A). These assertions are often given in terms of the distributions F θ , using relations (1) or (4) of the preceding section by supposing that A can be taken as a cylinder set with base in Rn , i.e., if A = B × RI−In where B ⊂ RIn = Rn , a Borel set, so that dFtθ1 ,... ,tn , (1) Pθ (A) = B
in this new notation. It is clear that there are additional assumptions to be made here in order that such critical regions exist. In the next chapter various subsidiary conditions will be formulated under which the existence of such A is assured. The number α0 is called the size of the critical region and β = supθ=θ0 Pθ (A) is termed the power of the test in the traditional (statistical) language. In formulating the above procedure, the underlying feeling is that an observed process comes from that part where the governing probability
5
1.2 Testing a hypothesis
function assigns the highest value and thus the critical region should distinguish the hypothesis and its alternative. The distinguishability property is not as simple as it appears (see the example below), and it will be discussed further in the next section. The problem of accepting the hypothesis θ = θ0 or its alternative θ = θ0 can be considered more generally. Namely, divide the set I of θ’s into two mutually exclusive classes H0 and H1 , carefully, so that θ0 ∈ H0 and θ1 ∈ H1 imply Pθ0 = Pθ1 for an element of H0 and one of H1 . If either of the sets H0 , H1 is not a singleton, the corresponding hypothesis is called composite and the singleton case is termed a simple hypothesis. For composite hypotheses it will be necessary to consider further restrictions usually motivated by the variational calculus which is suitable to the situation at hand. These problems are treated at some length in the following chapter. To understand a possible difficulty, we present an example of indistinguishability. 1. Example. Let F be a distribution on R with jumps at x1 , x2 , x3 where x1 < x2 < x3 of sizes p1 > 0, p2 > 0, p3 ≥ 0 and p1 + p2 + p3 = q ≤ 1. If B ⊂ R is a Borel set and B0 = {x1 , x2 , x3 }, let B1 = B0 ∩ B and B2 = B − B1 so that B = B1 ∪ B2 , a disjoint union. Define three distributions G1 , G2 , G3 by letting Gj (xi ) − Gj (xi −) = qji ≥ 0, qj1 + qj2 +qj3 = q, j = 1, 2, 3 with qii = pi but qij = pi for i = j, and for other values set Gj (x) = F (x) for −∞ < x < x1 ; =F (x) + Gj (x1 ) − F (x1 ) for x1 ≤ x < x2 ; =F (x) + Gj (x2 ) − F (x2 ) for x2 ≤ x < x3 ; and =F (x) + Gj (x3 ) − F (x3 ) for x3 ≤ x < ∞. Then F = Gj , j = 1, 2, 3 and on B2 one has dF (x) = dGj (x), j = 1, 2, 3. B2
B2
If B1 = {xi }, i = 1, 2, 3, then for some j = 1, 2, 3 one has dF (x) = pj = dGj (x) B1
B1
and if in the above equation B1 is replaced by B0 − {xi } it again holds if pj is replaced by q − pj . Since by construction the integrals agree on B0 for all j, it follows that dF (x) = dGj (x), B
B
for some j, for any Borel set B. If now H0 is the simple hypothesis that F is true, and H1 is the alternative that {G1 , G2 , G3 } is correct, then F (x) = Gj (x) for all j and all x ∈ R. Since there is no nonempty Borel
6
I. Introduction and Preliminaries
set on which the distributions of H0 and H1 give distinct probabilities, the hypotheses cannot be distinguished! Here H0 is simple and H1 is composite; but a modification of the construction shows that the difficulty persists when both hypotheses are composite. This is done as follows. Choose the numbers pi , qji above to satisfy p2 < q32 < 2p2 so that p1 + q2i − q3i ≥ 0 for all i. Consider H0 = {F1∗ , F2∗ }, H1 = {G∗1 , G∗2 } where F1∗ = F1 , F2∗ = F + G2 − G3 , ; G∗1 = G1 , G∗2 = G2 . Then one verifies that all the distributions are unequal, and H0 and H1 are disjoint, but for any Borel subset B of R as above one has: ∗ dFi (x) = dG∗j (x), B
B
for some i, j. Thus the hypotheses again cannot be distinguished. These computations imply the following result due to A. Berger [1]: 2. Proposition. Let H0 , H1 be distinct sets of inequivalent distributions on R, having at least two discontinuities. If (card(H0 )) · (card(H1 )) ≥ 3, then H0 and H1 cannot always be distinguished by a (nonempty) Borel set. If the observational process X = (X1 , . . . , Xn ) is a finite vector, θ of X for each then one can consider its n-dimensional distribution FX θ ∈ I, and additional conditions need be postulated in dealing with the testing problem. However, if the process consists of an infinite number (or a continuum) of observable variables, then there may not θ . This is the case even if X = (X1 , X2 , . . . , ), all the be a nontrivial FX Xi being independent with a common distribution Gθ . Indeed there θ satisfying, for x = (x1 , x2 , . . . ), need not exist a nontrivial FX θ (x) FX
=
∞
Gθ (xi ),
(2)
i=1
since the infinite product, if it converges, may take only 0 or 1 values (cf., also Exercise 6.3 below). This possibility forces one to consider all possible (compatible) finite dimensional distributions which can determine a (so-called projective limit) probability measure for each θ ∈ I, and whose finite dimensional projections coincide with the given distributions. These questions will be discussed in detail, starting with Chapter IV. In view of the preceding two statements, it is desirable (at least mathematically) to analyze and obtain (possibly optimal) conditions for
1.3 Distinguishability of hypotheses
7
the distinguishability problem which is evidently a basic requirement for the testing part of inference theory. Very little attention seems to have been paid in the literature even though it is not entirely simple. We therefore take it up in the next section. 1.3 Distinguishability of hypotheses The discussion of the preceding paragraph indicates that for a pair of hypotheses represented by the sets H0 , H1 , a mere disjointness does not guarantee the distinguishability of the distributions which is thus somewhat subtle. To restate the concept precisely, the set of probability measures {Pθ , θ ∈ I = H0 ∪ H1 , H0 ∩ H1 = ∅} on the basic measurable space (Ω, Σ) is termed distinguishable (or identifiable) if (i) Pθ = Pθ for all distinct θ, θ ∈ I and (ii) there is at least one A ∈ Σ such that Pθ (A) = Pθ (A) for θ ∈ H0 and θ ∈ H1 . Under further conditions, satisfied in many applications, this difficulty is eliminated and a positive solution is obtained. The following result, the first part of which is due to A. Berger [1] and the second part to A. Berger and Wald [1], exemplifies this point. It is stated for X taking values in R, but it clearly holds if X is a (finite) vector with values in some Rn . Its purpose is to expose the reader to unexpected hardships lurking beneath the surface of this simple looking problem. 1. Theorem. Let {Pθ , θ ∈ H0 ∪ H1 = I} be the underlying probability family of the observable random variable X. Suppose one of the following conditions holds: (i) Each of the distinct hypotheses H0 and H1 is a countable set and that the distributions Fθ (x) = Pθ [X < x] are distinct and continuous on R, but otherwise I = H0 ∪ H1 is unrestricted; (ii) I ⊂ R, and each of the distributions is absolutely continuous d with density pθ (x)[= dx (Pθ [X < x])] such that θ → pθ (x) is continuous for each x and there is a locally integrable (i.e., on bounded intervals) g : R → R+ dominating pθ (i.e., pθ (x) ≤ g(x), x ∈ R). Then there is a Borel set B which distinguishes Fθ for the disjoint θ-sets H0 and H1 , denoting the hypothesis and its alternative. It may be noted that there is no topology on I in the first part but I is only countable, whereas in the second part it inherits the topology from R but I can be uncountable. In either case X is a finite vector. Moreover, the distinguishing set may be chosen to have small (Lebesgue) measure as the following demonstration shows. This property is also found useful for later work. Proof of (i) Let H0 : {Fi , i ≥ 1}, H1 : {Gj , j ≥ 1} be countable sets and let Dij = Fi − Gj . Renumber this countable set as K1 , K2 , . . . . By
8
I. Introduction and Preliminaries
hypothesis Ki (xi ) = 0 for some xi ∈ R and Fi , Gj being distribution functions (d.f.s) Ki (x) → 0 as x → ±∞, so that there is also a yi ∈ R, xi < yi such that 0 = K(xi ) = K(yi ) = 0. Let zi = sup{x ∈ [xi , yi ], Ki (x) = K(yj )}. Then for each δ > 0 there is a point x ∈ R satisfying |zi − x| < δ, and Ki (x) = Ki (zi ). This zi is a “change point” of Ki which has a “δ-neighborhood around zi ”. We now construct a Borel set W satisfying |Ji (W )| > 0 for all i and of arbitrarily small Lebesgue measure, where Ji (W ) = W dKi . Thus let ε > 0 and we find a sequence of Borel sets {Wn , n ≥ 1} such that each Wi is a finite disjoint union of intervals and |Wi | < 2εi where |Wi | = Leb.M eas(Wi ). Start with W1 such that Ji (W1 ) = 0, for 1 ≤ i ≤ i1 but Ji+1 (W1 ) = 0 for i > i1 . Then choose W2 such that Ji (W2 ) = 0 for i ≤ i2 and Ji (Wr ) = 0, for i ≤ ir , but Jr+1 (Wr ) = 0, and Ji (Wr+1 ) = 0 for i ≤ ir+1 . The construction is done inductively. If Ji (Wr ) = 0 for all i, then Wr satisfies the requirements. In the general case we proceed as follows. Let x1 be a point of change of K1 and let a δ(= ε/2)-neighborhood of it be as obtained at the beginning. Call it W1 . Suppose W1 , . . . , Wr have been chosen to satisfy the desired conditions, for the induction procedure. Then Ji (Ws ) = 0, i ≤ is , s = 1, . . . , r and Js+1 (Ws ) = 0. Now set t = is + 1, and let xt be a change point of Kt . Since Wr is a finite disjoint union of intervals, with 0 < δr < ε/2r , choose a δr neighborhood of xt so that these intervals do not contain any boundary points of Wr (other than xt ). The construction ensures that there is an ηs > 0 such that |Ji (Ws )| > ηs , i ≤ is , s ≤ r.
(1)
∞ If x1 , . . . , xt are the change points of K1 , . . . , Kt , let i=1 αi = α < 1, 0 < αi+1 < αi , and xt ∈ (at , bt ) be arbitrary. Then by the continuity of K1 , . . . , Kir in [at , bt ], we may choose δr > 0 small enough so that each subinterval I(δr ) ⊂ [at , bt ] satisfies |Ji [I(δr )]| < ηs αr , s ≤ r, i ≤ ir .
(2)
Let now Δr be an interval of change of Kt in the δr -neighborhood of xt , so that its length verifies |Δr | < δr < ε/2r , and |Jt (Δr )| > 0. With this we can define Wr+1 (to complete the induction) as: Wr+1 = Wr ∪ Δr , if Δr ∩ interior(Wr ) = ∅; = Wr − Δr , otherwise.
(3)
9
1.3 Distinguishability of hypotheses
Then we get from the conditions of construction |Ji (Wr+1 )| ≥ |Ji (Wr )| − |Ji (Δr )| > ηr − ηr αr > 0, i ≤ ir , and since Jt (Ws ) = 0, using (3), we have |Jt (Wr+1 )| + |Ji (Δr )| > 0. ∞ From this sequence, define W = W1 ∪ ∪∞ r=1 Δr − ∪r=1 Δr , where Δr has no common interior points with Wr and Δr has such points (if any) with Wr (and =∅ otherwise). Then 0 < |W | < ε and by choice of ηr s and αr s, one has
|Ji (Δr )| < |Ji (Ws )|αr , i ≤ is , s = 1, . . . , r; r = 1, 2, . . .
(4)
For any fixed i, find indices ik−1 , ik such that ik−1 < i ≤ ik . Hence by (4) |Ji (W )| ≥ |Ji (Wk )| − > |Ji (Wk )| −
∞ j=0 ∞
|Ji (Δk+j )| |Ji (Wk )|αk+j .
(5)
j=0
Moreover
∞
|Ji (Δk+j )| < |Ji (Wk )|
j=0
so that |Ji (W )| > |Ji (Wk )|(1 −
∞
αk+j
j=0 ∞
αk+j ) ≥ 0.
(6)
j=0
This W is a Borel set of the desired kind. We omit the proof of Part (ii) which is also long. The result serves as motivation for us. The problem of distinguishability is thus a nontrivial one especially when H0 or H1 is composite. Further, in studies on stochastic processes both discrete and continuous as well as the mixed cases appear frequently. Poisson and Gaussian processes are examples of the first two and classes of infinitely divisible processes (e.g., those with independent increments) under all the Pθ measures contain the mixed cases. However in many of the problems studied below, the distinguishability property is seen to be automatically included when the critical regions are constructed. Also the result extends to n-dimensions and then to infinite (or function space) case
10
I. Introduction and Preliminaries
since by Theorem 1.1 one only has to consider all its finite dimensional distributions. However, the details will not be included here. Having noted the importance and nontriviality of distinguishing composite hypotheses, we now proceed to another related question of inference, namely looking at the indexes of {Pθ , θ ∈ I} from a different perspective in the following section. It has to do with estimating the unknown parameter θ from the observable random variable X, and the criteria for “optimal” methods for the purpose. Many procedures of importance for our work on processes will be undertaken in Chapter III. Some desirable properties of testing hypotheses will be discussed in considerable detail in Chapter II and applied to several problems and applications later on in the book. 1.4 Estimation of parameters If X is a random variable on (Ω, Σ, Pθ , θ ∈ I) then its expectation E(X) = Ω XdPθ , will be a function of θ. More generally, using the Fundamental Theorem of Probability (c.f., e.g., Rao [15], p.19) one gets E(f (X)) = Ω f (X)dPθ = g(θ) so that g : I → R, a function depending on f for which the integral is finite, will enable us to get new (and additional) “information” about the parameter θ of Pθ . In particular, if one can find a suitable function f (X) that “locates” θ, then our model will be completely known. However, f (X) is a random variable and as such it can only “approximate” θ with a certain probability, but not necessarily with probability one. This method thus opens up new possibilities for exploration. The procedure is known as estimation of the parameter θ of I, and it may be described as follows. Let X : Ω → Rn be a random variable (or vector) governed by the model {(Ω, Σ, Pθ ), θ ∈ I}. If f : Rn → I is a measurable function relative to the Borel σ-algebra B n of Rn and a fixed σ-algebra I of I, then the function Y = f (X) is called an estimator of θ (with values in I) and it is desired that the probability distribution Qθ of Y derived from Pθ , places “maximum” probability about θ ∈ I. Thus for each A∈I (1) Qθ (A) = Pθ (X ∈ f −1 (A)), depends on θ, and f is to be chosen in such a way that if θ = θ0 is the true value, one would like to have Qθ0 (A) ≥ Qθ (A) for all θ ∈ I and A ∈ I. A related idea when I ⊂ R, is to find a pair of functions f1 , f2 : Rn → I such that the random interval [f1 (X), f2 (X)] ⊂ I contains θ and has the property that its Lebesgue length is shortest and Pθ -probability is a maximum; and this method is termed an interval estimation. It has generalizations if I ⊂ Rn and in other spaces, but for now we discuss the problem of a single f – the “point” estimation. The “best” choice
1.4 Estimation of parameters
11
of f [or a pair (f1 , f2 )] with such a property is the subject of parameter estimation. In particular it is desirable to have Qθ concentrate (or degenerate) at θ if the number of components of X is (or tends to) infinity. This is an asymptotic result. However, these properties should be stated more precisely for a rigorous mathematical analysis. The “closeness” of f (X) and θ in I will be discussed concretely. Suppose I ⊂ Rm and W : Rm × Rm → R+ is a measurable function (with W (0, 0) = 0) relative to the Borel σ-algebras of the three spaces involved. Then W (f (X), θ) stands for the error committed when X is observed and f (X) is the estimator of θ. Since this is again a random variable, one considers its average value, i.e., if we set θˆ = f (X), as: ˆ W (f (X), θ)dPθ R(θ, θ) = E(W (f (X), θ)) = Ω = W (f (x), θ)dF θ (x),
(2)
Rn
is the new value. Here W (·, ·) is called a loss function to gauge the error between θˆ and θ, and its average loss, R(·, ·), is termed the risk function. The observed value of f (X), denoted f (x), is the estimate of θ for the case under consideration. The estimator θˆ of θ is said to be the closest or(“best”) relative to the given loss function W if the risk ˆ θ) is a minimum among all such estimators. From the mathematR(θ, ical tractability point of view as well as other practical reasons, one usually takes W (x, y) = L(x − y) where L(·) is a monotone (or a conˆ θ) = 0 if θˆ takes the vex) function vanishing at the origin so that L(θ, value θ, and is nonnegative otherwise. For instance, x → L(x) = |x|2 , the quadratic loss function, is a popular one for many applications. Different methods (or types) of estimation and the desirable properties of estimators will be discussed in some detail in Chapter III. Indeed even as n → ∞, θˆn may converge (in some sense) only to a random variable with θ as a key parameter. In such cases the distribution of the limiting variable, and the asymptotic analysis in general will also be important in the study and it will be considered. Specific applications of these ideas for processes will occupy much of the work in this book. The preceding discussion (and especially (2)), leads to another related problem, namely the prediction or extrapolation, and filtering of a process. This can be stated in general terms as follows. Suppose that {Xt , t ∈ [a, b]} is observed and Xt0 , t0 > b, is to be predicted relative to a loss function L(·). In other words, we have to find a function Y0 of {Xt , t ∈ [a, b]}, called a functional, such that E(L(Y0 − Xt0 )) ≤ E(L(Y − Xt0 )),
(3)
12
I. Introduction and Preliminaries
among all such functionals Y for which the right side of (3) is finite. If Y is a linear function of the Xt (e.g., it can be an integral such as b b Y = a Xt f (t)dμ(t) or = a f (t)dXt, or the corresponding sums if t varies in a countable index set), then it is termed a linear prediction problem, and nonlinear otherwise. Here one has to prove the existence, t ). It is uniqueness, and a possible method of computation of Y (= X 0 of interest to note that now we estimate a random variable X t0 by a random function Y , which contrasts the previous discussion of estimating a constant, namely θ. This aspect of the subject will be considered under various conditions in later chapters. Moreover, methods and solutions depend on the type of loss functions used as well as the particular processes considered. Related problems of filtering and signal plus noise models will be prominent in these studies, and will be made precise later. The former comment is also applicable for both linear and nonlinear prediction problems on stochastic processes, and they have an interesting analogy with a type of “Bayes estimation”. 1.5 Inference as a decision problem The basic ideas of hypothesis testing and estimation can be abstracted and unified in a general frame work. This was done by Wald (cf., his book [4]) who has then presented an overview of the generalization and its consequences in his address to the International Congress of Mathematicians in that year. Here we outline the subject from a point of view that applies to stochastic processes. Consider a random variable (or a vector) X with the probability family {Pθ , θ ∈ I} where I = H0 ∪ H1 , is a disjoint union and that {Pθ , θ ∈ H0 } and {Pθ , θ ∈ H1 } are distinguishable in the sense of Section 3 above, so that there exists at least one set A ∈ Σ which separates H0 and H1 . Suppose there is a Borel set B ⊂ Rn such that A = X −1 (B). Now the test procedure is that one rejects H0 when ω ∈ A, and hence accepts H1 . If d0 and d1 denote these two decisions, let δ : Ω → {d0 , d1 } be a mapping defined as δ(A) = d0 and δ(Ac ) = d1 . The mapping δ describing these actions is termed a decision function, and αδ (θ) = Pθ (A) = Pθ ◦ δ −1 ({d0 }) is the probability of rejecting H0 when Pθ is the underlying measure, i.e., θ ∈ H0 . One would like to have it as small as possible. But this implies accepting H1 so that βδ (θ ) = Pθ (A) = 1 − Pθ (Ac ) = 1 − Pθ ◦ δ −1 ({d1 }) where Pθ (Ac ) is the probability of rejecting H1 when the underlying true measure is Pθ , i.e., θ ∈ H1 . Since the probability of an incorrect action for a given δ is αδ (θ) + (1 − βδ (θ)), this cannot always be made small, satisfying the constraint that Pθ (A) + Pθ (Ac ) = 1 for all θ ∈ I. Thus one takes supθ∈H0 αδ (θ) ≤ α0 , and then chooses A so that the power βδ (θ) is as
13
1.5 Inference as a decision problem
large as possible. If the intuitively desirable property that α0 ≤ βδ (θ) is also possible, then the test is termed unbiased. The above discussion may be stated alternatively as follows on assuming that the incorrect decision incurs some losses to the experimenter, say b1 (θ), b2 (θ) > 0: ⎧ 0, ⎪ ⎪ ⎪ ⎨ b (θ) > 0, 1 L(δ(ω), θ) = ⎪ b2 (θ) > 0, ⎪ ⎪ ⎩ 0,
for θ ∈ H0 , ω ∈ A, for θ ∈ H1 , ω ∈ A, for θ ∈ H0 , ω ∈ Ac ,
(1)
for θ ∈ H1 , ω ∈ Ac .
If b1 = b2 = 1, then it is said to be a simple loss function. Since L(·, ·) defines a random variable, one considers again only its average value, termed the risk, for the ensuing analysis. Thus R(δ, θ) = Eθ (L(δ, θ)) = b1 (θ)αδ (θ) + b2 (θ)(1 − βδ (θ)),
(2)
where αδ (θ) = Pθ ◦ δ −1 ({d0 }) is the size and βδ (θ) = Pθ ◦ δ −1 ({d1 }) is the power of the test. This formulation motivates a generalization of the test procedure. Comparing it with the work of the preceding section, it is immediately seen that both problems can be given a unified formulation as well as an extension. Thus if I has more than two points so that H0 or H1 is composite, then one may reduce the problem to “simple” hypotheses by introducing some weight functions to average or combine the θ’s. This idea will be discussed in some detail in the next chapter. However, the general problem considered above is to minimize the probabilities of the two types of errors appearing in (2), which is equivalent to minimizing the risk function (δ, θ) → R(δ, θ) by finding a suitable decision function (if it exists) as θ varies in I. Thus if D is the set of all decision functions δ, then R : D × I → R+ is to be minimized, and this being a function of two variables the resulting variational problem needs a deeper scrutiny. In the last section it was suggested that by “controlling” the type I error, αδ (θ), one should minimize the type II error, (1 − βδ (θ)), or equivalently maximize the power βδ (θ). On surveying some general methods of (abstract) analysis, one finds that the problem is related to a zero sum two-person game, already well-developed by von Neumann. This fact was noticed by Wald who then went on to apply it, with suitable modifications and extensions, to statistical inference problems. We now indicate this shift and its consequences for the analysis under consideration. Two players I and II with respective plans, called strategies, AI and AII , and a payoff function k : AI × AII → R, play a game where I chooses a point a ∈ AI and II chooses a point b ∈ AII without
14
I. Introduction and Preliminaries
any knowledge of the choice of I. Then I gets the amount k(a, b) (say, dollars) and II gets −k(a, b) so that the sum is zero. Thus I tries to maximize the (winning fortune) k(a, b) by a suitable choice of a ∈ AI and II at the same time tries to minimize the (losses) −k(a, b) by a suitable choice of b ∈ AII . If there exists an element (or strategy) a0 ∈ AI for I and an element b0 ∈ AII for II such that min max k(a, b) = max min k(a, b) = k(a0 , b0 ), b
a
a
b
(3)
then the game is said to have a value (=v, say) which will thus be fair to both players. If AI and AII are finite sets (of strategies), then von Neumann already proved that such a game always has a value. But if these are infinite sets, additional conditions are needed for the existence of a value (where in (3) ‘max’ and ‘min’ are replaced by ‘sup’ and ‘inf’). Consequently these sets are often taken to be topological and then the finiteness can be replaced by compactness of the spaces of strategies. Further extensions suggest themselves. Indeed, since the strategies a, b can be regarded as points at which a pair of probability measures concentrated, one can embed the problem by considering the classes M1 (AI ) and N1 (AII ) as the sets of all probability measures on certain (given) σ-algebras AI and AII of the spaces having all the one-point sets. Then the payoff function will be replaced by the corresponding average relative to a ξ ∈ M1 (AI ) and an η ∈ N1 (AII ):
k(a, b)dξ(a)dη(b),
K(ξ, η) = AII
(4)
AI
and consider the analog of (3) as: V =
inf
sup
η∈N1 (AII ) ξ∈M1 (AI )
K(ξ, η) =
sup
inf
ξ∈M1 (AI ) η∈N1 (AII )
K(ξ, η),
(5)
provided the equality in (5) holds so that the common value V may again be called the value of the (generalized) game. The elements of AI , AII are termed pure strategies and in the generalized case, since probability measures are involved, they are called randomized strategies. Now this brings in a natural topological structure with it. In fact if B(AI ) is the space of real bounded AI -measurable functions which is a Banach space under the uniform norm, then its adjoint space M (AI ) is the space of additive functions on AI with finite variation (as norm). But M1 (AI ) is a subset of its unit ball which is a “weak-star” compact subspace. Thus M1 (AI ) inherits the induced topology, and a similar statement holds for N1 (AII )(⊂ N (AII )). Other usable conditions on sequential convergence of measures in M1 (AI ) and N1 (AII ) suggest
1.6 Complements and exercises
15
themselves and they allow the generalized two person zero sum game to admit a value, by extending von Neumann’s theorem. These considerations are employed in the inference theory by means of the following identification. One of the players, say I, is the experimenter and the second one, II, is the ignorance (or lack of knowledge) of the parameter θ personified as a malevolent opponent, also termed enigmatically the ‘nature’. Thus in the inference problem the experimenter is to play the game with such an adversary and takes a conservative view that one should minimize the maximum (average) risk in making the decision by using (2), or (4) in the randomized case. A rule that minimizes the maximum average risk is also called a minimax decision, and the theory can proceed by seeking conditions for the existence and uniqueness of sets of such rules. Some of these ideas will be briefly discussed again in the next chapter. It may be noted that this identification of the inference problem with the game theory has helped enlarge the scope of statistical decisions, although the solutions now are only based on averages and not pointwise. Even so, actual applications to specific problems of interest in the subject have lagged behind. In one of his last contributions, Wald [3] who was the originator and prime contributer of this extended technology concluded: “While the general decision theory has been developed to a considerable extent and many results of great generality are available, ...., the mathematical difficulties in obtaining explicit solutions are still great, but it is hoped that [in future research such] solutions will be worked out in a great variety of problems,” (p. 242). This difficulty is especially acute in the case of stochastic processes which usually deal with infinitely many random variables. Consequently most of what follows is devoted to solving such problems, and we concentrate on hypothesis testing, estimation, prediction, filtering, and related questions of greater specificity than abstract discussions of the general (decision) theory, although this will be kept in the background as it gives an overview of the subject. 1.6 Complements and exercises 1(a) If a hypothesis H0 and its alternative H1 are both simple so that each has a single distinct distribution, show that they are always distinguishable. [Indeed, there exists an interval (−∞, x0 ) on which they are different.] (b) Suppose H0 is simple and H1 has two elements, all being different. Show again that there is at least one interval (a, b) that distinguishes H0 and H1 . [Hint: If H0 = {F } and H1 = {G1 , G2 } and F (x0 ) = G1 (x0 ) and F (x1 ) = G2 (x1 ), then a = min(x0 , x1 ), b = max(x0 , x1 ) will be a solution. This problem complements the asser-
16
I. Introduction and Preliminaries
tions of Theorem 3.1.] 2. We can strengthen 1(a) if the distributions are restricted there. Thus let H0 = {F }, H1 = {G} be simple, and suppose that F and G are absolutely continuous distributions on the line with densities f, g (relative to the Leb. measure). If the set A = {x : f (x) = g(x)} has positive Lebesgue measure, then show that there is a (Borel) set A0 that distinguishes H0 and H1 for which moreover both type I and 1 type II errors are strictly less than 2 each. [Hints: Consider the cases (i) [f ≤g] g(x)dx > 12 , and (ii) [f >g] f (x)dx > 12 . In case (i) choose (Borel) sets A1 ⊂ [f ≤ g] satisfying A1 g(x)dx = 21 > A1 f (x)dx, with A2 ⊂ [f ≤ g] − A1 , such that 0 < A2 g(x)dx < 12 − A1 f (x)dx; and verify that A0 = A1 ∪ A2 satisfies the requirements. (ii) is similar.] 3. Let X = (X1 , X2 , . . . ) be an infinite vector (or sequence) of independent random variables on some (Ω, Σ, P ). Let x = (x1 , x2 , . . . ), xi ∈ R, be any given real vector. Consider F (x) = P [Xi < xi , i = 1, 2, . . . ] = distribution of Xi . Verify that limn→∞ Πni=1 FXi (xi ) where FXi (·) is the ∞ (1 − FXi (xi )) < ∞ or = ∞, F (x) = 0 or > 0 accordingly as i=1 and that there exist distributions FXi each with support R, but that {x : F (x) = 0} contains a cylinder set with base in finite dimensions of finite positive volume. [This problem strengthens the assertion of equation (2) of Section 2.] 4. The following elementary limit relations will be used in the ensuing work as needed, and the reader should verify them for gaining facility. If Xn , Yn , n ≥ 1 are sequences of random variables on P D some (Ω, Σ, P ), then we write Xn →X (and Yn →Y ) if for each ε > 0, limn→∞ P [|Xn − X| > ε] = 0 (and FYn (x) = P [Yn < x] → FY (x) = P [Y < x] for each x that is a continuity point of FY ) so that Xn → X in probability (and Yn → Y in distribution). A sequence {Xn , n ≥ 1} is bounded in probability if for each ε > 0 there is a constant M ε > 0 P such that lim supn→∞ P [|Xn | ≥ Mε ] ≤ ε. If Xn − Yn →0, then one also P writes Xn = Yn . The following statements hold for the Xn and Yn sequences. P D (a) Xn →X =⇒ Xn → X and the converse implication holds iff X is a constant random variable. D (b) Xn →X =⇒ {Xn , n ≥ 1} is bounded in probability. D (c) (Slutsky) Xn → 0 and {Yn , n ≥ 1} is bounded in probability =⇒ P D D Xn Yn → 0 and if Yn → Y then (Xn + Yn )→ Y . D D D (d) (Slutsky) Xn → X, Yn → a (a ∈ R) =⇒ Xn Yn → aX and if a = P P P D P n n →X ; also Xn → X, Yn → Y =⇒ Xn Yn → XY , and X →X 0 then X Yn a Yn Y whenever P [Y = 0] = 0. D D n → X, Ynβn−b → Y , (e) Let αn 0, βn 0 and 0 = b ∈ R. If now X αn
17
Bibliographical notes Xn D X → b . [Hint. Apply (d).] α n Yn D P D (f) Xn → X, Xn = Yn =⇒ Yn → X. D (g) If (Xn , Yn )→ (X, Y ) so that FXn ,Yn (x, y) 2
show that
→ FX,Y (x, y) at all D
X n continuity points of F, (x, y) ∈ R and if P [Y = 0] = 0, then X Yn → Y , and generalize this as: if (x, y) → h(x, y) is any real Borel function in R2 such that its discontinuity points have the F -measure zero, then D h(Xn , Yn )→ h(X, Y ). [Thus h( xy ) = xy includes the preceding case, and also contains (b)–(d).] 5. Let F be the set of all distributions on R. Define the L´evy-metric d on F × F as:
d(F, G) = inf{ε > 0 : F (x − ε) − ε ≤ G(x) ≤ F (x + ε) + ε, x ∈ R}. Then verify ( or accept the fact) that {F , d} is a complete metric space, and show that the set of discrete distributions in F is everywhere dense D in this metric. [Hint: Observe that Xn → X ⇔ d(FXn , FX ) → 0 as n → ∞, and that for any random variable X there exists a sequence Xn of simple random variables such that Xn (ω) → X(ω), ∀ω ∈ Ω.]
Bibliographical notes Setting up a stochastic model to describe an experiment, and analyzing its structural properties belong to the theory of Probability; and the consequences of the model belong to Inference theory. However new principles are needed here to draw conclusions, and they are often motivated by various types of experiments that are under study. As such these principles and their importance in practice may be disagreed. This is part of the formulation of inference and must be kept in view for most of the work that follows. Here we shall concentrate on the mathematical analysis of the methodology without subscribing to particular viewpoints and/or the resulting claims. Such disagreements appear in most applications, and inference practice is no exception. The hypothesis testing and estimation problems have been discussed in various forms by R. A. Fisher [1], J. Neyman and E. S. Pearson [1-2], and a generalization in a decision theoretic frame work by A. Wald [4] who then identified it with J. von Neumann’s game theory solution of 1929. The latter was first applied to economic behavior and detailed in the book by von Neumann and Morgenstern [1]. Thereafter these matters have been presented from different vantage points in the books by Lehmann [1], Wald [4] and especially (unhurriedly) in Blackwell and Girshick [1]. The latter account is also given with examples and
18
I. Introduction and Preliminaries
helpful comments, and the following statement on page 121, at the beginning of serious applications, is of interest. “It was pointed out that no single principle thus far advanced seems to be compelling to insure a universal agreement on a rule .... However, while disagreement might exist on what to do in a given situation, it might be possible to get full agreement on what not to do.” Thus it is a conservative view to approach the subject with this methodology. The distinguishability of hypotheses is basic for inference, and it is not sufficiently emphasized in the literature, although an interesting problem was solved by A. Berger [1], and a related case by A. Berger and Wald [1]. We have included this work and emphasized its relevance in Sections 2 and 3. This property is always assumed in all inference theory, whether it is stated explicitly or not in the following analysis. Using a slightly different definition, Hoeffding and Wolfowitz [1] have considered a similar problem for sequential decisions, based on independent and identically distributed observations and obtained sharp conditions in that case for distinguishability. The game theory work of von Neumann has attracted attention of several mathematicians and mathematical economists in the 1950s. The results for zero-sum two-person games apparently did not produce the hoped for breakthroughs in economics, although they have clarified the structure of the underlying problems. [More complicated models need be (and are being) considered.] Its induction into statistics has created similar response. This is especially highlighted by the fact that the two persons (experimenter and ‘nature’) in statistical games both could not justifiably optimize their “payoffs”. As a result there are more principles formulated, and many of these have not yet progressed to the stochastic process level (as the latter involves genuinely infinite collections of random variables). This is reflected even in Wald’s statement quoted at the end of Section 5. We thus focus more on essential parts of inference theory outlined in this chapter that allows relatively deep mathematical analysis in the way Linnik [1] considers a classical problem with statistical as well as analytic techniques, to be examined in the next chapter. The complements section here and in the ensuing work supplements the textual discussion and presents facts that are of some general interest as well as for use in certain applications in later studies. In particular, Exercise 6.4 is such a result, and it is based on material from Mann and Wald [2] and Chernoff [1].
Chapter II Principles of Hypothesis Testing
This chapter is devoted to some serious aspects of the hypothesis testing problems, including both the simple and composite cases. These consist of the fundamental lemma of Neyman-Pearson, in its abstract version due to Grenander, and a few of its applications as well as a technique in reducing composite hypotheses by means of weights. The latter contains a detailed Bayes methodology with iterated priors and some uniformity conditions that admit extensions to stochastic processes. Some of these considerations are classical, but they are seen to allow sharper analysis, in contrast with a use of the general (decision) theory, and these are examined carefully in the first four sections which also contain vector analysis approaches. It may be noted that this work demands an employment of deeper mathematical tools in solving some fundamental questions such as the Behrens-Fisher problem, and this is detailed in the fifth section. The last one is devoted to complementing the preceding results, as exercises often with hints.
2.1 Testing simple hypotheses Let us start with distinguishable simple hypotheses H0 and H1 . As indicated in Exercise 1.6.1, if the underlying probability measures are distinct (the only nontrivial case), then they are always distinguishable. Consequently, we can present conditions that minimize the probabilities of errors of types I and II. More precisely, we establish the following quite general and fundamental result based on a key special case due to Neyman and Pearson [2]. The abstract version is established by Grenander [1]. This result in various forms will be considered and its long shadow over different parts of mathematics be discussed to focus a deep and enduring significance.
© Springer International Publishing Switzerland 2014 M.M. Rao, Stochastic Processes – Inference Theory, Springer Monographs in Mathematics, DOI 10.1007/978-3-319-12172-7_2
19
20
II. Principles of Hypothesis Testing
1. Theorem. If P0 and P1 are distinguishable probability measures on a measurable (also sometimes called conveniently but inappropriately “a sample”) space (Ω, Σ) representing the “simple” hypothesis H0 : {P0 } and the (simple) alternative H1 : {P1 }, then a “best critical region” A0 ∈ Σ of a given “size”, at most α (0 < α < 1) exists and has maximal “power”. Mathematically, this means P0 (A0 ) ≤ α and for all A ∈ Σ satisfying P0 (A) ≤ P0 (A0 ) one has P1 (A0 ) ≥ P1 (A). Proof. The argument actually gives a construction of the desired region A0 . Since no restrictions are placed on probability measures P0 and P1 (except the exclusion of the trivial case that they are identical), the tool at our disposal is the Lebesgue-Radon-Nikod´ ym (or LRN-) theorem. The set A0 is obtained in two different ways using this theorem. Both are given below as they are useful in future extensions. By the LRN- theorem noted above, the probability measure P1 can be uniquely decomposed as P1 = P1c +P1s relative to P0 where P1c is P0 continuous and P1s is P0 -singular so that there is a set B0 ∈ Σ on which ym P1s lives while P1c lives on B0c (= Ω−B0 ). Moreover the Radon-Nikod´ dP1c (or RN-) derivative f = dP0 exists P0 -uniquely and has support in B0c . Thus consider (1) Ak0 = {ω : f (ω) ≥ k} ∪ B0 (∈ Σ), where k is chosen such that P0 (Ak0 ) ≤ α which holds since P0 (Ak0 ) → 0 as k → ∞ by the finiteness of P0 . With such a k ≥ 0, Ak0 = A0 is asserted to be the desired region. Before verifying this, we present a slightly different construction of Ak0 . Let μ = P0 +P1 which is a finite measure on Σ, and which dominates both Pi , i = 0, 1. Then μ = P0 + P1c + P1s and all the three measures on the right are absolutely continuous relative to μ. This is the only property of μ that is used and any σ-finite measure dominating both dP1c 0 P0 , P1 will suffice for the following argument. Let f0 = dP , g = 1 dμ dμ dP s
1 and g2 = dμ1 be the RN-derivatives so that g = dP dμ = g1 + g2 , a.e. holds. By the Lebesgue decomposition property, the class of μ-null sets is contained in the class of P0 -null sets, so that the support of g1 is a.e. a subset of the support of f0 and that the supports of g2 and f0 have an intersection which has μ (hence P0 )-measure zero. This implies
B0 = {ω :
g2 g2 (ω) > 0} = {ω : (ω) = ∞}. f0 f0
Moreover, by the chain rule for the RN-derivatives, one has f=
dP1c g1 = , a.e.[μ], dP0 f0
(2)
21
2.1 Testing simple hypotheses
Thus for any k ≥ 0, g1 + g2 )(ω) ≥ k} f0 g1 g2 = {ω : ( )(ω) ≥ k} ∪{ω : ( )(ω) > 0} f0 f0 g1 = {ω : ( )(ω) ≥ k} ∪ B0 = Ak0 . f0
{ω : g(ω) ≥ kf (ω)} = {ω : (
(3)
Then (1) and (3) imply that A0 (= Ak0 ) is uniquely defined by the ratio g f0 and it does not depend on the auxiliary (dominating) measure μ. It is to be shown that A0 is the best critical region. Let A ∈ Σ be any other event such that P0 (A) ≤ P0 (A0 ) ≤ α. Then using the form (3) with μ dominating both P0 and P1 for convenience, consider Ak0 = (Ak0 − A ∩ Ak0 ) ∪(A ∩ Ak0 ), a disjoint union, P1 (Ak0 ) = P1 (Ak0 − A ∩ Ak0 ) + P1 (A ∩ Ak0 ) g dμ + P1 (A ∩ Ak0 ) = k Ak 0 −A ∩ A0
≥k
k Ak 0 −A ∩ A0
f0 dμ + P1 (A ∩ Ak0 ), using (3) for Ak0 ,
= kP0 [Ak0 − A ∩ Ak0 ] + P1 (A ∩ Ak0 ) = k[P0 (Ak0 ) − P0 (A ∩ Ak0 )] + P1 (A ∩ Ak0 ) ≥ k[P0 (A) − P0 (A ∩ Ak0 )] + P1 (A ∩ Ak0 ), since P0 (A) ≤ P0 (Ak0 ), = kP0 (A − A ∩ Ak0 ) + P1 (A ∩ Ak0 ) g dμ + P1 (A ∩ Ak0 ), by (3) on (Ak0 )c , ≥ A−A ∩ Ak 0
= P1 (A − A ∩ Ak0 ) + P1 (A ∩ Ak0 ) = P1 (A).
(4)
Thus Ak0 is the best (= most powerful) critical region. Note that if P0 ⊥ P1 then in the above P0 (A0 ) = P0 (B0 ) = 0 and P1 (B0 ) = 1. [Theorem 1 holds for any finite measures, as the proof shows.] If (Ω, Σ) is specialized to (Rn , B) and P0 , P1 are the distribution functions on Rn , both absolutely continuous relative to the Lebesgue measure with densities f0 and g, (i.e., μ=Leb. meas.) then we can use (3) to denote the critical region for these densities. Note that in this case one can find a k ≥ 0 such that P0 (Ak0 ) = α for each given 0 < α < 1. The same form of the critical region holds if P0 , P1
22
II. Principles of Hypothesis Testing
are discrete distributions with densities f0 , g relative to the counting measure μ, although the last equality for a given α need not hold for any k ≥ 0. Then the above result reduces to the the following form which was originally discovered,through the Calculus of Variations methods. 2. Corollary.(Neyman and Pearson) If F, G are a pair of distributions on Rn with respective (Lebesgue) densities f, g, then for each 0 < α < 1 the set Ak0 = {x ∈ Rn : g(x) ≥ kf (x)} is the best critical region for the distinguishable H0 : {F }, and H1 : {G} of size α having maximum power. It must be noted that all the subsequent generalizations to composite hypotheses and other (vector) modifications are essentially motivated by (and based on) this result and its method of proof. To emphasize the significance, we now present some alternative forms, simple extensions and interesting applications. Indeed the comparison of measures P0 and P1 , via the existence of a most powerful region, has several nontrivial interesting consequences in mathematics, and some of them will be given below for a proper appreciation. Moreover, this important extension, due to Grenander [1], of the Neyman-Pearson lemma, leads to certain essential parts of abstract as well as concrete results in mathematics including control theory. Some of these results will be discussed. Consider the indicator function ϕk = χA0 of (3), where A0 = Ak0 . Then 0 ≤ ϕk ≤ 1, and can be considered as a critical (or “test”) function. Let C be the class of all measurable functions ψ : Ω → [0, 1], termed the critical or test class. Then ϕk ∈ C, and the above theorem can be stated in the following alternative form. The members ψ of C which can take all values of [0, 1] are termed randomized test functions and those which take the two extreme values, as ϕk here, are called non randomized test functions. 3. Theorem. Let H0 = {P0 } and H1 = {P1 } be distinguishable simple hypotheses and C be the class of all critical functions. Then for any 0 < α < 1 and k ≥ 0 chosen as β = E0 (ϕk ) ≤ α, one has for all ϕ ∈ C with E0 (ϕ) ≤ β, the relation E1 (ϕ) ≤ E1 (ϕk ); and moreover, ϕk is essentially unique in the sense that any other “most powerful” 1 ϕ0 ∈ C satisfies {ω : ϕ0 = ϕk (ω)} ⊂ {ω : dP dP0 (ω) = k}. Here again Ei (f ) = Ω f dPi , i = 0, 1, the expected value of f . Proof. The argument is just a translation of the above result. It will be sketched for completeness. Thus let ϕ ∈ C be any test function, such that E(ϕ) ≤ β = E0 (ϕk ) ≤ α. Rewriting the preceding proof, with k k 1 f0 = dP dP0 , and noting that on A0 , f0 ≥ k, and ϕk − ϕ ≥ 0, but on A0 it
23
2.1 Testing simple hypotheses
is = −ϕ < 0, one has (ϕk − ϕ)(f0 − k) dP0 = Ω
Ak 0
(1 − ϕ)(f0 − k) dP0
+ c (Ak 0)
Hence
(−ϕ)(f0 − k) dP0 ≥ 0.
E1 (ϕk − ϕ) =
Ω
(5)
(ϕk − ϕ) dP1 ≥ k
Ω
(ϕk − ϕ) dP0 ≥ 0,
as asserted. The last statement is obtained from this easily. +
For an application of the results we recall that a measure μ : Σ → R is said to have the Darboux property if for any A ∈ Σ, and any β > 0 such that μ(A) ≥ β there exists a B ∈ Σ(A) = {A ∩ C : C ∈ Σ} + satisfying μ(B) = β. If A ⊂ Σ is an algebra and μi : A → R , i = 0, 1, are a pair of measures, we say that for each α ∈ R+ a set A ∈ C is of size α if μ0 (A) = α and has power μ1 (A). As an abstraction of Corollary 2, one says that A ∈ A is most (least) powerful of size α if μ0 (A) = α > 0, and for any B ∈ A of size α it is true that μ1 (B) ≤ (≥)μ1 (A). By induction this property n extends to any finite collection i.e., if A ∈ A, μ0 (A) = α > 0 and i=1 αi = α, αi > 0, then there exist Ai ∈ A, A = ∪ni=1 Ai , such that μ0 (Ai ) = αi . Also if A is most powerful, A1 ∈ A(A), μ0 (A1 ) = α1 ≤ α, and for B ∈ A(Ac ), μ0 (B) = α1 , then we have: μ1 (B) ≤ μ1 (A1 ). (6) This holds since (A − A1 ) ∪ B has size α and A being most powerful we must have that μ1 ((A − A1 ) ∪ B) ≤ μ1 (A) and hence μ1 (A) − μ1 (A1 ) + μ1 (B) ≤ μ1 (A), as asserted. The Darboux property has other interesting implications in this context and we discuss some of them now. [Basic structure of measures with this property can be found, for instance, in Dinculeanu [1], pp. 25-31 where it is shown that this is more general than nonatomicity. Evidently the Lebesgue measure in Rn has the Darboux property.] Since the least powerful case is obtained by reversing the inequalities from the most powerful procedure, only the latter will often be demonstrated in what follows. Let us start with a combinatorial assertion on partitioning of elements, from algebra A. 4. Lemma. Let μ0 , μ1 be a pair of (not necessarily finite) measures on an algebra A ⊂ Σ of which μ0 has the Darboux property on A; and suppose A ∈ A is a most powerful region of size α > 0. If A1 ∈ A(A) is of size β ≤ α, B ∈ A(Ac ) of size kβ, k ≥ 1, then μ1 (B) ≤ kμ1 (A1 ).
24
II. Principles of Hypothesis Testing
Proof. In case μ1 (B) = ∞, then by the Darboux property of μ0 , we can clearly find an element B1 ∈ A(B), of size μ0 (B1 ) ≤ β such that μ1 (B1 ) = ∞, and since A1 ⊂ A, most powerful region by hypothesis, we get by (6) that μ1 (B1 ) ≤ μ1 (A1 ) ≤ μ1 (A) so that μ1 (A) = ∞, and the result holds. So let μ1 (B) < ∞, and ε > 0 be given. Using the Darboux property of μ0 , we can partition B as B = ∪ni=1 Bi where n > [ 1ε ], the latter being the integral part, and such that μ0 (Bi ) = kβ n . Then for at least one Bi , say B1 by relabeling if need be, we have μ 1 (B1 ) ≤ μ1 (B). For, n otherwise μ1 (B) = i=1 μ1 (Bi ) > nε ≥ μ1 (B), by the choice of n, a contradiction. If now 0 ≤ δ ≤ kβ n , then one can find a D ∈ A(B) (D is such a B1 above) satisfying μ0 (D) = δ and μ1 (D) ≤ εμ1 (B). Express kβ = m1 δ + δ1 , (0 < δ1 < δ), where m1 is an integer. If Ai ∈ A, ∗ 1 is chosen to satisfy μ0 (Ai ) = δ, consider A∗ = ∪m i=1 Ai , and B = m1 ∗ ∗ ∪i=1 Bi ∪ D, where μ0 (Bi ) = δ, μ0 (D) = δ1 , and μ1 (D) ≤ εμ1 (B ∗ ). All this is possible by the above reduction. If p = min1≤i≤m1 μ1 (Ai ), then p ≥ μ1 (Bi∗ ) for all i by (6). Consequently μ1 (B ∗ ) ≤ m1 p + εμ1 (D) ≤ kμ1 (A∗ ) + εμ1 (D).
(7)
Since ε > 0 is arbitrary, (7) implies the desired inequality. We can now present an extension of (6) as follows. 5. Theorem. Let A ∈ A be a most powerful size α region 0 < α < ∞ for the (possibly σ-finite) measures μ0 , μ1 on A, of which μ0 has the Darboux property. If A1 , . . . , An ∈ A, μ0 (Ai ) = αi and n (p1 , . . . , pn ) is a discrete probability distribution, and i=1 pi αi = α, n then i=1 pi μ1 (Ai ) ≤ μ(A1 ). Proof. The idea of proof is to reduce it to the case where all Ai ⊂ A or all Ai ⊃ A so that in the resulting form all the Ai will have the same size as A, and the assertion will then follow easily. We present the computations in two steps, observing that some pi can be zero. 1. Since no particular inclusions are given between the sets, consider the pair A − A1 , A1 − A and their measures β1 = μ0 (A − A1 ), β2 = μ0 (A1 − A). If β1 ≥ β2 , then for any A∗ ∈ A(A − A1 ) ⊂ A(Ac1 ) of size β2 , we have μ1 (A1 ) = μ1 (A ∩ A1 ) + μ1 (A1 − A) ≤ μ1 (A ∩ A1 ) + μ1 (A∗ ), by (6) since A∗ ⊂ A − A1 ∩ A. Thus replacing A1 by A∗ ∪(A ∩ A1 ) if necessary we may assume that A1 ⊂ A in this case. Similarly if β1 < β2 , let A∗ ∈ A(A1 − A) ⊂ A(Ac ) be of size β1 . Then we may replace A1 − A by A∗ and can assume that A1 ⊂ A, or Ai ⊃ A for each i. Likewise we continue for all the Ai .
25
2.1 Testing simple hypotheses
2. The next reduction is to show that the above inclusions may be taken to hold for all i. Indeed, consider the sets A1 ⊃ A, A2 ⊂ A. Let β1 = μ0 (A1 − A), β2 = μ0 (A − A2 ). If p1 β1 ≥ p2 β2 , then let A1 = A1 − A∗ where A∗ ∈ A(A1 − A) of size p2pβ1 2 , and let A2 = A. Then we have p1 μ0 (A1 ) + p2 μ0 (A2 ) = p1 [μ0 (A1 ) − μ0 (A∗ )] + p2 [μ0 (A − A2 ) + μ0 (A2 )] = p1 μ0 (A1 ) − p2 β2 + p2 μ0 (A2 ) + p2 β2 = p1 μ0 (A1 ) + p2 μ0 (A2 ).
(8)
Hence we get for the μ1 -calculation p1 μ1 (A1 ) + p2 μ1 (A2 ) = p1 μ1 (A1 − A∗ ) + p2 μ1 ((A − A2 ) ∪ A2 ) = p1 μ1 (A1 ) + p2 μ1 (A2 ) − p1 μ1 (A∗ ) + p2 μ1 (A − A2 ) ≥ p1 μ1 (A1 ) + p2 μ1 (A2 ), since p1 μ1 (A∗ ) ≤ p2 μ1 (A − A2 ) by Lemma 4. (9) In case p1 β1 < p2 β2 , we proceed similarly by replacing A1 by A and with A∗ ∈ A(A − A2 ) of size p1pβ2 1 ; let A2 be changed to A2 ∪ A∗ to get the same inequality. Now one can iterate the procedure until either n all Ai ⊃ A, or all Ai ⊂ A, so that by (8) we have the equation i=1 pi μ0 (Ai ) = α and since the same inclusion holds for all Ai it follows that μ0 (Ai ) = α, i ≥ 1. Hence using the hypothesis nthat A is the most powerful region, one gets from (9) the inequality i=1 pi μ1 (Ai ) ≤ μ1 (A) as asserted. This result may be immediately extended for a countable collection of regions. 6. Corollary. Let A, Ai ∈ A, i ≥ 1, and μ0 , μ1 be as in the theorem. If {pi , i ≥ 1} is a discrete probability distribution such that ∞ i=1 pi μ0 (Ai ) = μ0 (A), where A is a most powerful region, then again one has: ∞ pi μ1 (Ai ) ≤ μ1 (A). (10) i=1
Proof. The nontrivial case is when μ1 (A) < ∞. But then for any finite integer N , since A is most powerful, the above theorem implies N i=1 pi μ1 (Ai ) ≤ μ1 (A), [by dropping some terms on the left there] so
26
II. Principles of Hypothesis Testing
that the series in (10) is convergent and is bounded by μ1 (A) which does not depend on N . Hence the result follows on letting N → ∞. As observed before, it follows that, by reversing the inequalities in the preceding argument, the corresponding statements are valid for the least powerful regions. The above corollary raises the question as to whether the result holds for an arbitrary collection of critical regions. The following assertion provides an answer which will be utilized for some interesting applications. 7. Theorem. Let C ⊂ Σ be an algebra and S be a set indexing all the distinct elements of C. Suppose that S is a σ-algebra of S containing ¯ + , i = 0, 1, be a pair of (distinct) measures all finite sets and μi : C → R such that μ0 has the Darboux property. If the mappings s → μi (As ), s ∈ S, i = 0, 1, are S-measurable for all As ∈ C, and P : S → R+ is a probability such that for a given 0 ≤ α < ∞ one has μ0 (As ) dP (s) = α, (11) S
then for any most(least) powerful region A ∈ C of size α (i.e., μ0 (A) = α) it follows that μ1 (As ) dP (s) ≤ (≥)μ1 (A). (12) S
Proof. We consider the most powerful A (its existence follows from Theorem 1), the least powerful case being similar is left to the reader. For nontriviality, it will be assumed that μ1 (A) < ∞ in (12). Now α = 0 implies μ0 (As ) = 0 for a.a.(s)[P ], and A also has size zero. Further A being most powerful μ1 (As ) ≤ μ1 (A), for a.a.(s). Integrating this inequality relative to P on S one gets (12) in this case. So let α > 0. Since the integrals in (11) and (12) are finite, they can be approximated by elementary functions (cf., e.g., Rao [17], p.140). Thus given 0 < ε < 1 and ε1 > 0 there exists a finite or countable measurable partition {Si , i ≥ 1} of S such that for some 0 < η1 < ε1 and si ∈ Si one has for (12) μ1 (As ) dP (s) = μ1 (Asi )P (Si ) − η1 , (13) S
i≥1
and for (11) μ0 (As ) dP (s) = μ0 (Asi )P (Si ) + η, |η| ≤ εα. α= S
i≥1
(14)
27
2.1 Testing simple hypotheses
This can be done by refining the two sequences obtained for the two integrals and adjusting the error terms η, η1 . Now if η > 0, we can use (10) for regions of size (α − η) with the most powerful region A being of size (α − η); and if η < 0, then apply the same result for the most powerful region A∗ ∈ C, of size (α + η), A∗ ⊃ A. Thus in either case one gets for (13): μ1 (As ) dP (s) ≤ S
μ1 (Asi )P (Si ) ≤ μ1 (A∗ )
i
= μ1 (A) + μ1 (A∗ − A) ≤ μ1 (A) + εμ1 (A), since μ1 (A∗ ) − μ1 (A) < α + η − α ≤ εα ≤ εμ1 (A), by Lemma 4.
(15)
Now ε > 0 being arbitrary, (12) follows. Remark. This theorem as well as the preceding one used the Darboux property crucially in their proofs. Actually, the results do not hold without some such hypothesis as simple counterexamples show. An application of (12) gives several known inequalities. The key problem here is to find a most (least) powerful region, and Corollary 2 will often be useful in this task. First we derive an interesting consequence for the purpose. 8. Proposition. Let F : [a, ∞) → R+ be an absolutely continuous function with derivative F ≥ 0, nonincreasing (nondecreasing). If X : Ω → [a, ∞) is a random variable with probability distribution G, then one has ∞ ∞ E(F (X)) = F (x) dG(x) ≤ (≥)F ( x dG(x)) = F (E(X)). (16) a
a
Proof. We again prove the result for one case and omit the parenthetical part for the reader. [In fact these are the Jensen inequalities for concave and convex functions respectively and the proof may be compared with the classical arguments, usually found in the literature, cf., e.g., Rao [15], p. 15.] To deduce (16) from (11) and (12) choose μ0 , μ1 as: μ0 ((a, x)) = x − a; μ1 ((a, x)) = F (x) − F (a), x ≥ a, both of which may be extended to be the Lebesgue-Stieltjes measures on [a, ∞), (cf., e.g., Rao [17], p.95). In order to use (12), it is necessary to find a most powerful region of a given size α = (x − a) (say). Since
28
II. Principles of Hypothesis Testing
F ≥ 0 the graph of F does not “hold water” (i.e., it increases slowly). Hence the most powerful size α region is the interval [a, x) itself, since the increments of F on other sets of size α are no larger than this one. [However, one can also find the ‘best’ region using Corollary 2 with g(x) = F (x) and f (x) = 1 there, but it is simpler to exploit the geometric property noted above.] Thus for (12) consider Aω = (a, X(ω)) and S = [a, ∞). Then μ0 (Aω ) dP (ω) = (X(ω) − a) dP (ω) Ω Ω∞ s dG(s) − a = E(X) − a = α = (x − a), = a
∞ so that (X taking values in [a, ∞)) E(X) = a s dG(s) = x. Hence ∞ (F (s) − F (a)) dG(s) E(F (X) − F (a)) = a∞ μ1 ([a, s]) dG(s) = a
≤ μ1 ([a, x]) = F (x) − F (a).
(17)
Since now G(∞) − G(a) = 1, (17) gives (16) after cancelling the finite number F (a) on both sides. The second part is similar. An intriguing consequence of the idea of a most powerful test is included in Theorem 7, and the inequalities in (16) actually characterize (see below) the concavity and convexity of the function F . Let us first show how the Liapounov and H¨ older inequalities are implied by (16), and then present the stated characterization. Applications. 1. In the above proposition, let a = 0, F (x) = xk , 0 ≤ k < 1 so that F satisfies its hypothesis. Let X be a positive random variable. Then (16) becomes E(X k ) ≤ (E(X))k .
(18)
Letting k = vp , 0 < p < v and taking X = Z v where Z is another positive random variable, (18) gives the Liapounov inequality (cf., e.g., Cram´er [1], p.255): 1 1 (19) [E(Z p )] p ≤ [E(Z v )] v . 2. The preceding inequality implies the H¨ older inequality for all positive exponents. In fact let Z be a discrete random variable with P [Z = zi ] = pi so that (19) becomes with p = 1, 1 p i zi ≤ ( pi ziv ) v . i
i
29
2.1 Testing simple hypotheses
v v Setting v = v−1 , pi = tvi / j tvj and zi = yi /tiv one gets after an easy simplification 1 1 ti y i ≤ ( yiv ) v ( tvi ) v . (20)
i
i
i
From this by a standard approximation of the Lebesgue integral the general case for functions obtains. If instead v = 1, so that (0 < p < 1) in (19), and using a similar simplification, one gets the corresponding (lower) H¨ older result by reversing the inequality in (20). As noted already, the inequality (16) is obtainable for concave (convex) functions. But the characterization of the latter functions generally depends on the deep Lebesgue-Vitali differentiation theorem (cf., e.g., Rao [17], p.242) Here we sketch a proof based on Theorem 7, and the present point of view. 9. Theorem. Let F : R+ → R be a continuous increasing function with F (0) = 0. Then F is concave iff it is subadditive, i.e., F (x + y) ≤ F (x) + F (y), ∀x, y ∈ R+ ,
(21)
and F is convex iff it is superadditive, i.e., F (x + y) ≥ F (x) + F (y), ∀x, y ∈ R+ .
(22)
Proof. Suppose that (21) holds; so 0 ≤ (F (x + y) − F (x))/y ≤ F (y)/y, y > 0. This implies that the difference quotients (hence the right and left derivatives) at every x of F are bounded by that at x = 0. Moreover, if μ1 (x, y) = F (y)−F (x) ≥ 0 then μ1 (x, x+y) ≤ μ1 (0, y) for all x, by the preceding statement. As in the last proposition, it then follows that [0, y] is the most powerful critical region of a given size y for the problem since F (0) = 0, where one takes μ0 as the Lebesgue measure with μ1 defined in terms of F as just indicated. Then for any random variable X : Ω → R+ , the inequality (16) obtains. In particular, if X takes just two values x1 , x2 such that P [X = xi ] = pi , i = 1, 2, p1 + p2 = 1, then (16) gives F (p1 x1 + p2 x2 ) ≥ p1 F (x1 ) + p2 F (x2 ), so that F (·) is concave. Conversely, let F (·) be concave increasing, and X be a two valued random variable as above with x1 = 0, x2 = x > 0. Then 0 ≤ F and (16) applies so that F (p2 x) ≥ p2 F (x), 0 < p2 ≤ 1. Dividing by p2 and writing q = p12 ≥ 1, and replacing x by y, this gives qF (y) ≥ F (qy). Thus for any x > 0 and z = p2 x, 0 ≤ p2 < 1, one gets F (x + z) = F ((1 + p2 )x) ≤ (1 + p2 )F (x)
30
II. Principles of Hypothesis Testing
= F (x) + p2 F (x) ≤ F (x) + F (p2 x) = F (x) + F (z). This implies (21) since x and 0 ≤ p2 < 1 are arbitrary. The second assertion is similarly established with (16). Remark. In all the work on obtaining the most powerful critical regions of size α, the Darboux property of μ0 is employed. Moreover, the proof of inequality (12) uses an adjustment of approximation based on this property. However, if μ0 does not have this property, then easy counter examples can be constructed to show that (12) need not hold anymore. Note also that there are integral representations of convex (concave) functions which use the Lebesgue differentiation theorem and are nontrivial (cf., e.g., Rao [17], Thm. 5.2.10). The characterizations given by (21) and (22) are the most useful in applications. In the preceding work, not only the hypotheses H0 , H1 are simple, but they are described by scalar measures. One can ask whether these results hold if they consist of signed or vector valued set functions. Several complications arise, and only certain weaker conclusions can be given. We now discuss briefly some of these results. Recall that if μ, ν : Σ → R are σ-additive, then ν is μ-continuous if their variations |ν|, |μ| have that property (cf., e.g., Rao [17], p.227); or equivalently, |μ|(A) = 0 ⇒ |ν|(A) = 0, A ∈ Σ. Thus ν μ means g dν dν dν ν |μ|. Then by the RN-theorem dμ = d|μ| · d|μ| dμ = f where ( d|μ| = dμ = f ) |f | = 1, a.e.[μ]. Also a function h is μ-integrable iff it is g, d|μ| |μ|-integrable. Moreover the mapping T : h → Ω h dμ = Ω hf d|μ| is well-defined, linear, and h1 = Ω |h| d|μ| (by definition). [However T need not be order preserving.] The statement of Theorem 1 now takes the following form.
10. Proposition. Let μ0 , μ1 : Σ → R be a pair of signed (=real) measures such that |μ0 | and |μ1 | are distinguishable and μ1 μ0 . Then for each 0 ≤ α ≤ |μ0 |(Ω), the set Ak0,α defined by Ak0,α = {ω : f1 (ω) ≥ kf0 (ω)} (∈ Σ), |μ0 |(Ak0,α ) ≤ α,
(23)
dμi where fi = d|μ , i = 0, 1, has the property that for all A ∈ Σ, |μ0 |(A) = 0| k |μ0 |(A0,α ), one has μ1 (Ak0,α ) ≥ μ1 (A), i.e., Ak0,α is the best critical region for the pair μi , i = 0, 1.
The proof is identical with that of Theorem 1, where the fact that the integrals relative to |μ0 | are order preserving is used for deriving the last inequality for sets with μ1 which is just a real measure. The details are left to the reader. This form of the result is useful since it
31
2.1 Testing simple hypotheses
admits an extension if μ0 itself is a vector measure of n-components. The problem will be restated for that purpose. This point of view is also employed for testing certain composite hypotheses to be discussed, emphasizing the fundamental nature of the Neyman-Pearson theory. Let μ0 = (μ01 , · · · , μ0n ) : Σ → Rn be a vector of signed measures and μ1 : Σ → R be another signed measure. If one is given an initial constraint as μ0i (A) ≤ αi , i = 1, . . . , n, then it is desired to find a critical region A0 ∈ Σ, if it exists, such that for all A ∈ Σ, satisfying μ0i (A) = μ0i (A0 ), we have μ1 (A0 ) ≥ μ1 (A), i.e., A0 is the region on which μ1 is maximized subject to the above constraint. n This may be reformulated as the previous problem. Indeed let μ = i=1 |μ0i | so that μ0i μ. Suppose that μ1 μ0i , i = 1, . . . , n, and n consider for any vector a = (a1 , . . . , an ) the scalar measure νa = i=1 ai μ0i : Σ → R. n dνa 0i Note that νa μ and if f0i = dμ i=1 ai f0i dμ and fa = dμ then fa = 1 and g = dμ dμ exist. Moreover μ1 νa μ holds. The problem can be solved simply by means of Proposition 10. A reformulated statement follows.
11. Proposition. Let μ0 : Σ → Rn , μ1 : Σ → R be nonatomic and distinguishable measures such that μ1 μ0 , meaning μ1 μi , i = n 1, . . . , n. If μ = i=1 |μ0i | : Σ → R+ is the corresponding variation measure induced by the components of the vector measure μ0 , then the region A0,a defined by A0,a = {ω : g(ω) ≥ fa (ω)} = {ω : g(ω) ≥
n
ai f0i (ω)},
(24)
i=1
where the vector a = (a1 , . . . , an ) is chosen such that μ0i (A0,a ) = αi , i = 1, . . . , n, necessarily satisfies μ1 (A0,a ) ≥ μ1 (A), for nall A ∈ Σ n so that the constraint becomes νa (A) = i=1 ai μ0i (A) = i=1 ai αi . As an application, the following result essentially due to Neyman and Pearson [1,2] is obtained. 12. Proposition. Let Pθ0 , Pθ : Σ → R+ be two distinguishable probability measures for H0 : {Pθ0 } and a composite alternative H1 : {Pθ , θ ∈ [a, b] − θ0 ⊂ R}, absolutely continuous relative to a σ-finite nonatomic measure μ with densities fθ . Suppose that θ → fθ (ω) is smooth in that it is n-times differentiable for each ω ∈ Ω and is dominated by a μk fθ (ω) integrable function h, i.e., | ∂ ∂θ | ≤ h(ω), k = 1, . . . , n − 1. Then the k best critical region A0,a is given as A0,a
n−1 ∂ i fθ ∂ n fθ = { n |θ=θ0 ≥ ai i |θ=θ0 } ∂θ ∂θ i=0
(25)
32
II. Principles of Hypothesis Testing
where a = (a0 , . . . , an−1 ) is a vector of constants chosen to satisfy the i constraints A0,a fθ0 dμ = c and A0,a ∂∂θfiθ |θ=θ0 d μ = 0, i = 1, . . . , n − 1. Proof. This follows from Proposition 11, by taking g = ∂if ∂θ i |θ=θ0 ,
∂nf ∂θ n |θ=θ0 , f0i
=
α1 = c, and αi = 0, i = 2, . . . , n. Then A0,a of (25) becomes (24), as desired. Remarks. 1. It should be observed that the smoothness information is built into the best critical region, n = 1 having been included in dPθ and Theorem 1. Also note that if Pθ Pθ0 for all θ then fθ = dP θ0 one has fθ0 = 1 with μ = Pθ0 in this computation. The region given in Proposition 11 is motivated by this result to solve a problem with a composite alternative. The general case that both H0 and H1 are composite will be considered in the following sections. n 2. In general the equation νa (A) = αa = i=1 ai αi , used in the above proposition, may not be possible if a μ0i is an arbitrary measure. The equation holds if all μ0i , (or μ), have no atoms. In this case μ0 : Σ → Rn has a range that can be precisely characterized. This was done by Liapounov [1] who showed that for any nonatomic vector measure, the range in Rn is a compact convex set. The result is not true if μ has atoms, although its range is always a bounded set in any dimension. When applicable, it is quite powerful, and we have occasion (especially see Section 5 below) to invoke this for some important conclusions. Further analysis of the problem, when both μ0 and μ1 are nonatomic, has been continued by several authors who looked at the situation as one of optimization since μ1 (A)(= A g dμ) (with μ =Leb. meas.) is maximized subject to the constraint μ0i (A) = αi . Now Proposition 11 gives only a sufficient condition and one may ask whether it is also necessary; and then look for a generalization of the problem if moreover μ1 (A) is a vector in Rm for A ∈ Σ, the optimization being about ϕ ◦ μ1 where ϕ : Rm → R is a continuous function. These two question have been treated by Dantzig and Wald [1], and Chernoff and Scheff´e [1] respectively. Here we briefly indicate their works since there is an immediate generalization of the latter to optimal control of systems of measures in arbitrary number of dimensions, leading to interesting mathematical studies. But the discussion will be limited here to relevant aspects of inference theory. Assume that both μ0 and μ1 are nonatomic so that the vector (μ1 , μ0 ) has range identifiable with a set in Rn+1 . By the cited Liapounov theorem the range is a compact convex set N , and the same is true of the range M of μ0 in Rn . The constraint c = μ0 (A) fixes a point in M . Identifying it as a vector (0, c) = (0, c1 , . . . , cn ) ∈ N ⊂ Rn+1 , and noting that a line parallel to the (n + 1)th -axis through this point meets M in a closed (bounded) interval, the maximum of μ1 (A) occurs
33
2.1 Testing simple hypotheses
at an end point of this set, and (by convexity) it can be identified with the inequality pointwise. Thus Dantzig and Wald [1] use the (finite dimensional) geometric properties of the range space to prove the existence of the optimal set A0,a . The necessity of the condition (i.e., the inequalities) in this case is to be obtained pointwise, and a measure theoretical proof is not available for the purpose. So another geometrical argument, considerably involved, is presented by these authors to conclude that, for the asserted maximum, the inequalities must hold outside a set of (the (n + 1)-dimensional Leb.) measure zero. Next under the same hypothesis of nonatomicity, but when both μ0 , μ1 are n- and m-vector measures with values in Rn and Rm , was considered by Chernoff and Scheff´e [1]. The maximization of ϕ ◦ μ1 (A) is treated for ϕ : Rm → R, a continuously differentiable (concave) function, and A ∈ A = {A ∈ Σ : μ0 (A) = c}, with μ1 (A) as a point in a given closed subset Z of Rm , extending the methodology of Dantzig and Wald. However, some essentially new problems arise, leading to a separate necessary and then a sufficient condition for its solution. The optimal (or best) region for this generalized problem is given by: A0,a,b = {ω :
m i=1
dμ
ai gi (ω) ≥
n
bj fj (ω)},
(26)
i=1
0j 1i where gi = dμ dμ , fj = dμ , the μ being the dominating (here the standard Leb.) measure; the ai are the directional derivatives of ϕ at an interior point x0 in the range of μ1 , and bj are constants to be determined from the initial conditions. Note that if m = 1, ϕ(x) = x ∈ R, and the set Z = Rm itself, then (26) reduces to (24). When m > 1, the new problems and their resolution are detailed along with an interesting application in Chernoff and Scheff´e [1]. Indeed, the significance of this paper is not fully appreciated in the literature, especially since it raises and answers a question of comparing two vector measures, for the first time. The above extended discussion and Proposition 12 prompt a study of the general problem when Pθ (A) is not necessarily differentiable in θ ∈ H1 , but a composite alternative (along with the above H0 ) should be considered. In this case one may view the mapping θ → Pθ (A), A ∈ Σ as a function on I = H0 ∪ H1 → B(I), the space of bounded measurable functions on I relative to a given σ-algebra containing all finite sets which becomes a Banach space under the uniform norm. Now the space B(I) is not finite dimensional and the hypothesis testing problem becomes P : (θ, A) → Pθ (A) for θ ∈ H0 and for θ ∈ H1 , A ∈ Σ. Stated differently, this is just a mapping of P0 : Σ → B(H0 ), and P1 : Σ → B(H1 ) so that P0 , P1 are vector-valued measures. These
34
II. Principles of Hypothesis Testing
will be denoted by μ, ν to avoid a confusion with probability measures. The testing problem then relates to a hypothesis (vector) measure μ and an alternative (vector measure) ν. This is exactly in accord with the Chernoff-Scheff´e finite vector generalization of the Neyman-Pearson lemma, to infinite dimensions. Their restriction to finite vectors comes from the fact that it is one of the first such extensions motivated by a natural application at hand. The general composite hypotheses case needs the full (infinite dimensional) version since B(I) is finite dimensional only if I is finite. We indicate a limitation of the previous techniques in this context. Observe that both the Dantzig-Wald and Chernoff-Scheff´e works depend crucially on Liapounov’s theorem on the range of vector measures as well as the (Euclidean) geometry available in those spaces. The boundedness of the range of a vector (just as a signed) measure is valid (as an application of the classical uniform boundedness principle) in any Banach space. But even if the vector measures μ, ν have finite nonatomic variations (which need not be finite) their ranges in general are neither convex nor compact. Now Uhl [1] has shown that for the range spaces of the vector measures μ, ν, which are either reflexive or of separable adjoints (more generally spaces having the so-called RNproperty), with variation measures |μ|, |ν| finite and nonatomic, the norm closures of μ(Σ), ν(Σ) are convex and compact. He also gave counter examples showing that μ(Σ) or ν(Σ) in L1 ([0, 1], Leb.) is neither convex nor compact. [A detailed discussion of the range problem of vector measures can be found in the last chapter of the book by Diestel and Uhl [1], and also an extensive treatment in Kluv´ anek and Knowles [1].] In our case the range space B(I) does not have the RN-property, and thus additional difficulties arise. One of the essential consequences of Liapounov’s theorem is that the convexity and compactness properties assure the existence of all the extreme points of the range set in itself. That property (along with the geometry) translates into the construction of optimal regions given by the sets in (24)–(26). In the infinite dimensional case the following analog, due to Tweddle [1] is available. 13. Proposition. Let μ : Σ → X , an arbitrary Banach space, be a σ-additive function (i.e., a vector measure). Then an extreme point of the (norm) closed convex hull of the range μ(Σ) is in μ(Σ) itself. Moreover, if Ω f dμ, 0 ≤ f ≤ 1 (the integral here is defined in the sense of Dunford and Schwartz [1], IV.10), is an extreme point of the closed convex hull of μ(Σ) then f = χA for some A ∈ Σ, outside of a μ-null set B, meaning sup{μ(F ) : F ∈ Σ(B)} = 0. A proof of this result is omitted here since it will not be employed
2.1 Testing simple hypotheses
35
later, and details may be found in the Uhl-Diestel book referred to above. In the particular cases considered previously, the set A has been characterized when X = Rn , and μ is nonatomic. The corresponding results in the infinite dimensional cases do not hold, but their characterizations will be of interest in the present study. (See also Moedomo and Uhl [1] on some cases.) But these are not available. However an extension of Theorem 1 for certain range spaces X , Y of μ, ν can and will be discussed here to illuminate the significance of the problem. Note that X = B(H0 ), Y = B(H1 ) in our applications which need not be the same Banach space, although this is not a real hurdle as they may be considered as subspaces of their tensor product, for example. Since the optimal critical region in Theorem 1 above was obtained by means of a Radon-Nikod´ ym theorem, first one may find an extension of the latter result to the case of vector measures. The point here is to identify the problem as that of a simple (vector) hypothesis versus a simple (vector) alternative. The difficulty is thereby shifted to finding the corresponding RN- densities. However most of the literature on the subject concerns ν as a vector measure, but μ a scalar. In finite dimensions the problem may be reduced to the scalar case by working with each component, as was done implicitly by the previous studies, and such a procedure will be inadequate for the general study. The infinite dimensional case involves a procedure of the following type. If μ, ν : Σ → X are vector measures, then by a classical result due to Bartle, Dunford, and Schwartz (cf., Dunford-Schwartz [1], IV.10.5) there exists a finite positive measure λ on Σ which dominates both μ, ν in the sense that the class of λ-null sets is contained in each of the classes of μ- and ν-null sets. [The null set of a vector measure is defined at the end of Prop. 13 above. That both μ, ν take values in the same space is not a real restriction, as already noted.] However, an additional assumption is needed to get μ, ν as indefinite integrals of certain X -valued functions relative to λ. The best possible condition for this is as follows. Consider the averaged range: AE (μ) = {
μ(B) : 0 < λ(B), B ∈ Σ(E)}, E ∈ Σ. λ(B)
(27)
Then AE (μ) ⊂ X should be relatively compact, i.e., the closure of AE (μ) be (norm) compact. If X is finite dimensional, then this is just boundedness of the set. The result is proved in this form by Maynard ([1], Theorem 8). Using this type of condition (hence μ(A) = A f dλ holds for a measurable function f : Ω → X , with a separable range, called strongly measurable), one goes on for a Radon-Nikod´ ym theorem of ν relative to a second vector measure μ to obtain an analog of Theorem 1. Thus, if μ, ν satisfy the condition that for each E ∈ Σ,
36
II. Principles of Hypothesis Testing
AE (μ), AE (ν) are relatively compact and ν μ in the sense that the class of ν-null sets contains the corresponding μ-null class, then there exists a (not necessarily unique) function T : ω → T (ω) ∈ B(X ), the space of all bounded linear operators on X , and a strongly measurable function T (·)x : Ω → X , x ∈ X , such that T (ω) dμ(ω) (28) ν(A) = A
holds. Here the integral is what is called the “Bartle bilinear integral”. It was established (slightly) more generally by the author in different ways in (Rao [7] and [9]). Each of the methods involves a detailed technical computation. This result is employed for a certain subclass of Banach spaces, including the B(I) recalled after Proposition 13. At this point, one considers the long awaited abstraction of the Chernoff-Scheff´e optimal region (26) for which we need a usable partial ordering in the space X admitted here. Thus let X be a Banach space whose adjoint space X ∗ has a positive cone P in it (so X = B(I) is included). Then one says that a pair of elements x1 , x2 ∈ X satisfies an order x1 ≺ x2 if for each x∗ ∈ P one has x∗ (x1 ) ≤ x∗ (x2 ). This is a partial order. Let K = {K ∈ B(X ) : K ∗ (P ) ⊂ P }. This set is not empty. With this (somewhat necessarily long) preamble, the following is a form of the infinite dimensional version of the Chernoff-Scheff´e result in Grenander’s form of the extended Neyman-Pearson fundamental lemma. 14. Proposition. Let μ, ν : Σ → X be σ-additive, distinguishable, and λ be a positive finite dominating measure for both where X is a Banach space having a positive cone in its adjoint X ∗ relative to which a partial order is definable in X as above. Suppose that AE (μ), AE (ν) are relatively compact subsets of X for each E ∈ Σ where these “average range classes” over E are defined by (27), so that there exist strongly measurable f, g : Ω → X such that μ(A) = f dλ; ν(A) = g dλ, A ∈ Σ, (29) A
A
hold. If ν μ and T (·) is an RN-derivative of ν relative to μ given by (28), then for each K ∈ K and e ∈ X the set AK defined by AK = {ω : T (ω)f (ω) ≺ g(ω), T (ω) = K},
(30)
is an optimal region whenever (it is nonempty and) the constraint μ(AK ) = e holds; thus for any B ∈ Σ, ν(B) = e one has ν(B) ≺ ν(AK ) where the partial ordering is defined with respect to the positive cone in the space X ∗ .
37
2.1 Testing simple hypotheses
We have presented the ingredients in terms of compactness conditions and inequalities in terms of a partial order, so that the proof of Theorem 1 goes through with only a few simple modifications to be used for the vector integration in lieu of the Lebesgue integral there. For completeness, a brief account will be included to illustrate the distinction in the case of vector integrals. Proof. Note that the strong measurability of f, g implies AK ∈ Σ. If B and AK are as given, and x∗ ∈ P , then it should be shown that x∗ ◦ ν(AK ) ≥ x∗ ◦ ν(B). For this, as in the proof of Theorem 1, it suffices to verify the inequality on AK − AK ∩ B and B − AK ∩ B. So consider x∗ ◦ ν(AK − AK ∩ B) = x∗ ( T (ω)f (ω) dλ) AK −AK ∩ B x∗ (Kf (ω)) dλ(ω), by definition = AK −AK ∩ B
of AK in (29); (x∗ and the integral commute), ∗ ∗ f (ω) dλ(ω)), =K x ( AK −AK ∩ B
since (x∗ K)(x) = (K ∗ x∗ )(x) = y ∗ (x), y ∗ ∈ P , y ∗ (f (ω)) dλ(ω), = B−AK ∩ B
since y ∗ is continuous and μ(B) = μ(AK ), y ∗ (g(ω)) dλ(ω), ≥ B−AK ∩ B ∗
since y (P ) ⊂ P and (30), = y ∗ ◦ ν(B − AK ∩ B), which implies ν(B) ≺ ν(AK ), as desired. This was originally indicated in (Rao [4I ], Thm. 1) and with more detail in (Rao [7], Thm. 5.1) after establishing (28). If the class of Banach spaces is restricted then the measures can be more general; but if the spaces are more general then the measures have to be restricted as in this argument. Regarding the former point, if X is taken as C(S), the space of continuous real functions on a compact Stone space S, then there is a corresponding general RN-theorem due to Wright [1] and since C(S) is a Banach lattice the same type of argument works. An analogous result can be established, where in exchange of the compactness condition on the sets in (29), one assumes that μ, ν are “modular” (i.e., do not “twist” vectors) measures and the integral is defined with
38
II. Principles of Hypothesis Testing
an order continuity topology. A form suitable in this context is detailed, for instance, in (Rao [18], Prop. 6.5.3). These general results specialized to inference theory still need further investigation and so can not be pursued here. But the corresponding study has a potential for use in inference and only the future work should bear this out. We thus terminate the discussion at this point since a real unsolved problem is the construction of RN-derivatives to employ them in most of the applications of these existence results. However, special techniques will be considered in Chapter IV and thereafter for classes of processes, and their RN (or likelihood ratios) are obtained. The following account is devoted to particular types of problems and methods that reduce the composite case to the simple ones, discussed above, using various weighted averaging and related techniques, of interest in mathematical analysis. 2.2 Reduction of composite hypotheses To motivate a different approach than the one discussed above, consider H0 and H1 having more than one element and set I = H0 ∪ H1 as before. If {Pθ , θ ∈ H0 } and {Pθ , θ ∈ H1 } are the corresponding distinguishable hypotheses let π0 and π1 be weight functions on H0 and H1 , so that πi ≥ 0 are measures on Hi , i = 1, 2, (considered as sets with σ-algebras Hi containing single point sets). Then the new measures on (Ω, Σ) defined (θ → Pθ (A) is assumed Hi -measurable) by Pθ (A) dπ0 (θ); Pπ1 (A) = Pθ (A) dπ1 (θ ), A ∈ Σ, Pπ0 (A) = H0
H1
(1) are σ-additive and Pπi (Ω) = 1 iff πi (Hi ) = 1, i = 0, 1. But the latter restriction is unnecessary. If Pθ Pθ for all θ ∈ H0 , θ ∈ H1 , then it θ is clear that Pπ1 Pπ0 and so let fθ ,θ = dP dPθ be the RN-density. By the Neyman-Pearson-Grenander theorem, then the critical region to distinguish the weighted measures (or hypotheses) Pπi (or Hi ) i = 1, 2, is given by Wk = {ω : fθ ,θ (ω) ≥ k}, (2) where k is chosen to satisfy a constraint Pπ0 (Wk ) ≤ α. If π0 and π1 concentrate on θ0 and θ1 with weights α0 and β0 , then fθ ,θ = αβ00 f¯, dP where f¯ = π1 , and (2) becomes dPπ0
W k = {ω : f¯(ω) ≥ k
α0 }, β0
(3)
which is of the same form as in the simple hypotheses case. Since Wk depends on π0 , π1 and the weights are at our disposal, it is natural to
39
2.2 Reduction of composite hypotheses
introduce additional conditions to maximize the differences between H0 and H1 by choosing different classes of measures (or weights) π0 , π1 . The procedure is referred to as the “Bayes method”, because it is an extension of a very simple but interesting rewriting of a conditional probability identity, observed by a Rev. T. Bayes from a philosophical point of view, which was published posthumously in 1764-’65, far earlier than when the conditional measures were rigorously defined and studied. The nomenclature used is, in the current notation, that if A, B are events of positive probability, then the conditional probabilities of A given B, and B given A, are defined as: P (A|B) =
P (A ∩ B) P (A ∩ B) ; P (B|A) = . P (B) P (A)
(4)
Eliminating P (A ∩ B) gives what is called the “Bayes formula”: P (A|B) =
P (B|A)P (A) . P (B)
(5)
Here P (A) is termed the “prior” probability, PP(B|A) (B) the likelihood, and P (A|B) the “posterior” probability, the names having certain philosophical origins. These do not make sense if either P (A) = 0 or P (B) = 0, and the formula (5) runs into trouble. In case the probability distributions are determined by a pair of random variables X, Y into R and are absolutely continuous relative to the Lebesgue measure with strictly positive densities fX , fY , then Kolmogorov ([1], Sec. 5.3) who introduced the general conditioning concept rigorously, has given the corresponding formula in this special case as: fX|Y (x|y) = =
fY |X (y|x)fX (x) f (y|x)fX (x) dx R Y |X
fY |X (y|x)fX (x) , fY (y)
(6)
and termed it Bayes’ theorem for (absolutely) continuous distributions. This is a one-dimensional formula and is obtained from (5) with an application of the classical Lebesgue-Vitali differentiation theorem (cf., e.g., Rao [17], Thm. 5.2.6) and it is more involved in higher dimensions (e.g., when X, Y are vectors); even for the absolutely continuous case the differentiation in general is a difficult subject which is needed to continue with the above formula, (c.f., Hayes and Pauc [1] in this connection and we discuss the matter further in Section 2.3 below). These two special cases of conditioning form the basis of all the current “Bayesian” methodology. In the general case, calculation of
40
II. Principles of Hypothesis Testing
conditional probabilities is a difficult question for which as yet no satisfactory algorithm is available. [See, e.g., Rao [18], Chapter 3, where many of the problems and the present state of affairs are discussed in considerable detail.] Ad hoc computations result in multiple answers for the same problem, and in fact a“Borel paradox” was already illustrated by Kolmogorov in Sec. 5.2 of his book [1], and more serious pathologies are given in the work just cited above. The existence of such conditional measures is assured by the RN-theorem, but the calculation is not generally a simple matter and not much effort is spent in the existing literature on this question. The problem is treated in a few cases, as in the author’s paper [19]; and some of which is detailed in Sections 5.6, 5.7, and 7.5 of the author’s book [18] on conditioning and its applications. We now show how these conditional densities (in two special cases) may be exploited in reducing the composite case to the modified simple hypotheses problem. It may be noted that Wald [1] has already considered the reduction of composite hypotheses to simple ones using different weights, but allowing H0 and H1 to (possibly) have common points. One of his formulations will be illustrated after the following: 1. Definition. Let H0 ⊂ I be a set of parameters where I is the index (or parameter) space for {Pθ , θ ∈ I} on (Ω, Σ) and H1 : Kc ⊂ I be a set defined on I by a function ϕ → ϕ(θ) at some constant level c, i.e., Kc = {θ ∈ I : ϕ(θ) = c}. Then a critical region A0 ∈ Σ of size α, in the sense that supθ∈H0 Pθ (A0 ) = α, is said to have a uniformly best average power relative to a weight function μ(·) on (a σ-algebra K of) Kc , if for any B ∈ Σ, of size α as above, and if θ → Pθ (A) is μ-measurable for each A ∈ Σ, then
Pθ (B) dμ(θ) ≤
Kc
Pθ (A0 ) dμ(θ).
(7)
Kc
The existence of such sets A0 may be shown for a multivariate normal family. In general such A0 need not exist since even the distinguishability is omitted in exchange for the assumption of solution (7). Some variations of this concept and their extension with applications to asymptotically multivariate normal classes have been considered in several cases. However, it is possible to present a unified version using the decision theoretic frame work outlined in Section I.5, which can be restated as follows. If X is a random variable (or vector), then an experimenter may consider a function δ(X) to decide on θ ∈ I, the parameter set. Thus δ(X) ∈ {d0 , d1 }, the decision space. Denoting the loss function by L(·, ·), one wants to have L(θ, δ(X)) = 0 when θ ∈ H0 and δ(X) = d0 ;
41
2.2 Reduction of composite hypotheses
or when θ ∈ H1 and δ(X) = d1 , but is > 0 otherwise, since an error is clearly committed now. Then the average loss, or risk, is given by R(θ, δ) = Eθ (L(θ, δ(X))), θ ∈ I where Eθ is the expectation computed relative to Pθ . Let μi (·) be the weight function on Hi , i = 0, 1, as above. Then one may reduce the result to the simple hypotheses case by letting R(θ, δ) dμi (θ), i = 0, 1. (8) r(μi , δ) = Hi
Since H0 and H1 are identified as disjoint sets (by distinguishability), one defines μ on H0 ∪ H1 = I by taking it to be μi on Hi , i = 0, 1. Thus μ is a uniquely defined measure on I (i.e., μi (Hi ∩ ·) = μi (·), i = 0, 1), and (8) is expressed as: r(μ, δ) = R(θ, δ) dμ(θ). (9) I
Suppose μ(I) = 1, a probability measure. If δ(·) can be chosen to make r(μ, δ) as small as possible so that there is a function δ = δ0 for which one has r(μ, δ0 ) ≤ inf{r(μ, δ) : δ ∈ D}, (10) D being the set of decisions formed for X, then it is called a “Bayes solution” of the problem relative to the weight function μ. This is analogous to the case described by Definition 1. On the other hand, when no μ is available and if there is a decision function δ1 that minimizes the maximum risk, i.e., sup R(θ, δ1 ) ≤ inf sup R(θ, δ), θ
δ
(11)
θ
then δ1 is called a minimax solution of the decision problem. On the other hand, under some conditions, one can find a sequence of weights (also termed a priori probabilities) μn such that the corresponding Bayes solutions converge to the minimax solution. The interest in these classes of solutions comes from the so-called completeness property, given in the following statement. 2. Theorem. Let L : I × D → R+ be a loss function which is bounded and lower semicontinuous in the product topology of I × D where I is given a metric topology defined by, θ − θ = var(Pθ − Pθ ), and D ⊂ [0, 1][0,1] the topology of pointwise convergence. If I is separable and closed in this description, then the class of Bayes solutions relative to any a priori probability μ on I is complete in the sense that for any decision function δ which is not a Bayes solution there is one δ ∗ ∈ D which is a Bayes solution relative to a μ0 such that r(μ0 , δ ∗ ) ≤ r(μ, δ) holds.
42
II. Principles of Hypothesis Testing
This is a particular version of a general theorem due to Wald ([4], Chapter 3, Sec. 3.6). Since the result is not utilized in the ensuing work, its proof will not be included here. A useful consequence of this statement is that if a best test is known to exist from other methods, one can then find a prior probability μ relative to which a Bayes solution is obtained to coincide with the existing result. An obvious weakness of this assertion is that there is no recipe to obtain such a test, in contrast to the Neyman-Pearson method. A reinterpretation of (9) is useful as well as significant for our study later, especially for the filtering and prediction problems treated in Chapter VIII. In more detail (9) may be expressed as: r(μ, δ) = Eθ (L(θ, δ(X))) dμ(θ) I = L(θ, δ(X(ω)) Pθ (ω) μ(θ) I Ω L(θ, δ(X(ω))) μ ˜(dω, dθ) = I×Ω ˜ δ(X)) ˜ d˜ = L(θ, μ, (12) ˜ Ω
˜ = I × Ω, θ˜ : (θ, ω) → θχΩ , X ˜ : (θ, ω) → χI X(ω), Σ ˜ = I ⊗ Σ, where Ω and μ ˜ : A × B → A Pθ (B) dμ(θ), A ∈ I, B ∈ Σ, is a probability measure. Here I is a σ-algebra of I on which μ(·) is defined. Now ˜ can be considered as random variables on the same measure θ˜ and X ˜ ˜ μ space (Ω, Σ, ˜). Since Pθ (·) is a probability measure on Σ for each θ ∈ I, and θ → Pθ (A) is by assumption a measurable function on (I, I) with μ as a probability on I, it is easily verified that μ ˜ is again a probability ˜ Σ). ˜ on (Ω, In this way, θ˜ is seen as a random variable and it is “approximated” ˜ in the sense of making r(μ, δ) as small as possible. This shows by δ(X) that putting a (normalized) weight function μ on the parameter space is essentially equivalent to treating the parameter θ as a random variable (i.e., Θ(·) = id.), and such an interpretation is often used by the Bayesian methodology. In this view, Pθ (·) = P (·|θ) is taken as a regular ˜ Σ, ˜ μ conditional probability function on (Ω, ˜). 2.3 Composite hypotheses with iterated weights As noted already, the use of weights in a composite hypothesis testing problem is primarily to reduce it to the case of simple hypotheses. When the weights μ(·) are finite and normalized to be probabilities, which typically are not known, they can again be considered as coming
2.3 Composite hypotheses with iterated weights
43
from another family {μτ , τ ∈ J}. Iterating this method (with weights) gives the following form of the ‘posterior’ probability function for families of priors on priors. Moreover, some motivational applications of statistical practice are indicated by J. Berger ([1], pp. 180-195), under the name “hierarchical or multistage priors”. Recall that hypothesis testing is a two decision problem and so the decision function δ(X) ∈ {d1 , d2 }, i.e., it is representable as the function d1 χA + d2 χAc for some A ∈ Σ. Taking the loss function L(·, ·) to be simple (it takes only 0, 1 values), and d1 = 1, d2 = 0 for simplicity, we get R(θ, δ) = Eθ (χA ) = Pθ (A) where Pθ (·) will be written as P (·|θ) hereafter, and it can be considered as a regular conditional probability for each θ ∈ I, i.e., P (·|θ) is a probability and P (A|·) is I-measurable for each A ∈ Σ, and θ ∈ I so that θ is regarded as a value of a random variable Θ on the measurable space (I, I) into itself. Then relation (12) of the preceding section becomes Pμτ (A × B) = P (A|θ) dμ(θ|τ ), B ∈ I, (1) B
where μ(·|τ ) is a “prior” probability on I, and is a member of a family {μτ = μ(·|τ ), τ ∈ J}. It is easy to verify that Pμτ (·) is a σ-additive measure, or probability, on the product σ-algebra Σ⊗I (cf., e.g., Neveu [1], Prop. III.2.1, or Rao [17], Thm. 8.1.1). If there is a (new) weight ν on J one can reduce the composite family of the μτ -measures by using this function ν, and thereby obtain another probability P˜ . Thus these iterated weights give a sequence of measures, each of which reflects the experimenter’s thinking as time progresses, and thus changing the underlying model (or the probability space) by augmentation. As a motivation, suppose that {P (·|θ), θ ∈ I} is dominated by a fixed σ-finite measure λ on Σ. Then (1) becomes (after using Tonelli’s theorem) Pμτ (A × B) = f (x|θ) dμτ (θ) dλ(x) A ∈ Σ, B ∈ I. (2) A
Hence
dPμτ (·×B) dλ
B
= gB (·|μτ ) exists and is given (with g = gI ) by g(x) = f (x|θ) dμτ (θ), τ ∈ J,
(3)
I
for almost all x, with μτ as a second stage (or the second subjective) weight function. Here g(x) is the same (marginal) density of X but the joint density of X, Θ and others will be different. Now consider a sequence of weights from time 1 to time T , namely {μ1 (dθ1 |θ0 ), . . . , μT (dθT |θ0 , . . . , θT −1 )} where each depends (possibly)
44
II. Principles of Hypothesis Testing
on all the previous parameters. Then the combined prior can be given precisely under the following mathematical setup: μi : Ii × I1 × · · · × Ii−1 → R+ where the parameter sets are the measurable spaces (Ii , Ii ), i = 0, 1, . . . , T , μi (·|θ0 , . . . , θi−1 ) is a probability on Ii for each θi , i = 0, . . . , i − 1, and μi (B|θ0 , . . . , θi−1 ) is measurable relative to I0 ⊗ · · · ⊗ Ii−1 , for each B ∈ IT . This makes the μi regular conditional probabilities, and the combined weight (or prior) on ⊗Ti=1 Ii × I0 → R+ is obtained as an extension of (1) for θ0 ∈ A and Bi ∈ Ii as: μ(B1 × · · · , ×BT |θ0 ) = χA (θ0 )× μ1 (dθ1 |θ0 ) · · · B1
BT
μT (dθT |θ0 , θ1 , . . . , θT −1 ). (4)
Using an extension of the argument of (1), it can be verified that (4) is well-defined, and μ(·|θ0 ) is a (regular conditional) probability on ⊗Ti=1 Ii for each θ0 ∈ A ∈ I0 . [Indeed, this has a unique extension onto the “cylinder σ-algebra”, denoted ⊗∞ i=1 Ii , for infinite products by a classical theorem due to C. Ionescu Tulcea [1], (cf., e.g., Neveu [1], p.162, for a proof) and there are suitable extensions of the latter result if there are uncountably many such μi measures. See e.g., Rao and Sazonov [1], and related references therein.] If each μi (·|θ0 , . . . , θi−1 ) depends only on the immediate predecessor parameter, μi (·|θi−1 ) so that the {θi , i ≥ 0} forms a Markov process, the formula simplifies slightly. In any case with (4) it is relatively easy to obtain analogs of (1) and (3) if there are iterated weights. It is given here leaving the easy proof to the reader where we employ a convenient (but not a correct) notation. [Strictly speaking, the measures on product spaces should be expressed by a different symbol as in (9)–(11) below.] Pθ0 (×Ti=1 Bi × A) = P (A|θ0 , . . . , θT ) μ(dθ0 , . . . , dθT |θ0 ), (5) ×T i=1 Bi
and when P (·|θ0 , . . . , θT ) is dominated by a fixed σ-finite measure λ on Σ, one has the corresponding density g(·|θ0 ) (with Bi = Ii ) as: g(x|θ0 ) = f (x|θ0 , . . . , θt )μ(dθ1 , . . . , θT |θ0 ). (6) I1 ×···×IT
If θ0 is a known constant, then g(·) becomes the marginal density of (X, Θ1 , . . . , ΘT ), and one gets the “posterior density”, π(·|x, θ0 ), relative to a “multistage prior” μ from the equation g(x|θ0 )π(θ1 , . . . , θT |x, θ0 ) = f (x, θ1 , . . . , θT |θ0 ) = f1 (θ1 |x, θ0 ) · · · fT (θT |x, θ0 , . . . , θT −1 ). (7)
45
2.3 Composite hypotheses with iterated weights
That (7) is not a definition of f but is deduced from (4) (established from the above noted key theorem of Ionescu Tulcea’s) should be emphasized since conditional probability is defined once and for all by (1) and the rest must be proved using it, and no new “definitions” for densities could be employed without also showing consistency of the concepts. This is elaborated further leading to Proposition 1 below. The preceding discussion immediately raises a question, about the ‘posterior density’ π(θ|x) defined by (7), that it is a true conditional density of some measure P˜ on a suitable product σ-algebra, in the framework of the Kolmogorov model to make valid probability statements using it. This is non-trivial since the respective measures have to be constructed. So the argument is detailed here to avoid ambiguities, and it will be used later to illuminate a subtle difficulty appearing in the converse direction. If P is a probability on a measurable space (Ω, Σ) and B ⊂ Σ is a σ-algebra (generated by a random variable Z : Ω → R, for instance, so that B = Z −1 (R) where R is the Borel σ-algebra of R), then the conditional probability given B, written P B (·)(ω) (also written in the special case as P (·|Z(ω) = x)) is an essentially unique σ-additive measurable (for B) function satisfying the identity (PB = P |B): P B (A)(ω) dPB (ω) = P (A ∩ B), A ∈ Σ, B ∈ B. (8) B
The existence and essential uniqueness of P B (·) is a consequence of the RN-theorem. Here P B (·)(·) = P (·|·) : Σ × Ω → R+ is called a regular conditional probability when P B (·)(ω) is a probability measure for each ω ∈ Ω, and P B (A)(·) is B-measurable for each A ∈ Σ. In the case of P (·|·) defined by (7), i.e., π(·|·), it is to be shown that there is a suitable ˜ Σ, ˜ P˜ ) such that (8) holds on it with π(·|·) as its probability space (Ω, ˜ = Ω × I, Σ ˜ = Σ ⊗ I, and P˜ be defined by integrand. For this let Ω P˜ (A × B) = f (x, θ) dλ(x) dμ(θ), (9) A×B
where f is the joint density of X and Θ relative to the given (dom˜ where inating) product measure λ ⊗ μ. Taking B = Θ−1 (I) ⊂ Σ n ˜ → R and Θ : Ω ˜ → I are the (coordinate) random variables X : Ω X(ω, θ) = x, Θ(ω, θ) = θ ∈ I, we have to show that π(A|θ) = P B (A)(θ) satisfies (8). Let A = (−∞, x) = ×ni=1 (−∞, xi ) and B = I. Then P˜ (˜ ω : X(˜ ω ) ∈ A, Θ(˜ ω ) ∈ I) =
f (t, θ) dλ(t) dμ(θ) A
I
46
II. Principles of Hypothesis Testing
=
g(x) dλ(x), A
and similarly taking A = Rn and B ∈ I, one gets h(θ) dμ(θ), P˜ (X ∈ Rn , Θ ∈ B) = B
where h(·)[g(·)] is the marginal density of Θ(X) relative to μ[λ]. Thus the vector (X, Θ) has P˜ as the governing probability with g and h as ˜ → Ω, p2 : Ω ˜ → I be the coordinate marginal densities. Now let p1 : Ω ˜ the section is A(p2 (˜ ω )). Define projections, and for any A ∈ Σ ˜ π(θ|x) dμ(θ), ω ˜ = (ω, θ) ∈ Ω. (10) P˜ (A|˜ ω) = A(p1 (˜ ω ))
It is claimed that π(·|x), the posterior probability, determines the reg˜ For this one needs ular conditional measure P˜ (·|˜ ω ) relative to B ⊂ Σ. ˜ and (ii) P˜ (A|·) is to show (i) P˜ (·|˜ ω ) is σ-additive for each ω ˜ ∈ Ω, B-measurable, and satisfies (8). To verify (i), it is sufficient to consider a disjoint sequence of measurable rectangles Cn ∈ Σ × I whose union C is also in Σ × I, since the latter is a semi-algebra and a σ-additive real function, it has a unique σ-additive extension to Σ ⊗ I by a standard result in measure theory (cf., e.g., Rao [17], Thm. 6.1.3). Thus ∞
∞
n=1
n=1
C = A × B = ∪ (An × Bn ) = ∪ Cn , and the disjointness of the rectangles on the right side implies (cf., again Rao [17], Lemma 6.1.1) either, for each pair, the An ’s are equal and the Bn ’s are disjoint or vice-versa. Considering the whole sequence, since their union must again be a rectangle, this implies that either all An ’s are equal to A and the Bn ’s are disjoint, or all Bn ’s are equal to B and the An ’s are disjoint. Consequently A × B = A × ∪∞ i−1 Bn , or = ∪∞ A × B, and in either case the p -projection application gives: i 1 i=1 ∞
p1 (C) = A = ∪ An . i=1
The second case being trivial, one has for the sections: ∞
C(p1 (˜ ω )) = ∪ Ci (p1 (˜ ω )), i=1
ω )) = ∅ when ω ˜∈ / C. Hence which is a disjoint union. Note that C(p1 (˜ ∞ ω) = π(θ|x) dμ(θ) P˜ ( ∪ Ci |˜ i=1
∪∞ ω )) i=1 Ci (p2 (˜
47
2.3 Composite hypotheses with iterated weights
= =
∞ i=1 ∞
π(θ|x) dμ(θ)
Ci (p2 (˜ ω ))
ω ), P˜ (Ci |˜
i=1
as required for (i). Next for (ii), by (7) and (10) the B-measurability follows from the Fubini-Tonelli theorem, so that only (8) should be verified. Thus let ˜ (B being a cylinder σ-algebra in Σ, ˜ B = A1 × I B ∈ B = Θ−1 (I) ⊂ Σ, ˜ for some A1 ∈ Σ). Then for any A ∈ Σ, consider
[ π(θ|x) dμ(θ)] dP˜ (˜ ω) B A(p2 (˜ ω )) = π(θ|x) dμ(θ)( f (x, θ) dμ(θ)) dλ(x) A A(p (˜ ω )) I 1 2 π(θ|x)g(x) dμ(θ) dλ(x) = A1 A(p2 (˜ ω )) = f (x, θ) dμ(θ) dλ(x), by (7), A1 A(p2 (˜ ω )) dP˜ (˜ ω ) = P˜ (A ∩ B). (11) =
P˜ (A|˜ ω ) dP˜ (˜ ω) = B
B∩A
Thus π(·|x) is a version of P˜ (·|˜ ω ), as desired. [Regarding this construction of P˜ see also Dubins and Freedman [1], p.551, as well as Chow and Teicher [1], p.211, Example 2. Note that there are no separability ˜ or P˜ in this derivation.] restrictions on B, Σ, Summarizing this discussion, the result may be stated, for reference, as follows. 1. Proposition. Let {(Ω, Σ, Pθ ), θ ∈ I} be a family of probability spaces with Pθ absolutely continuous relative to a fixed σ-finite measure θ λ, uniformly in θ, having a density f (ω|θ) = dP dλ (ω). If μ is some a priori probability (or a weight) on (I, I), let {π(θ|ω), θ ∈ I, ω ∈ Ω} be a posterior density relative to λ ⊗ μ on (Ω × I, Σ ⊗ I), given by (7). Then there exists a probability measure P˜ on this product measurable space and a regular conditional probability function relative to a σ-algebra B ⊂ Σ ⊗ I such that π(·|·) is a version of the latter. In fact, P˜ is defined by (9), and is absolutely continuous relative to λ ⊗ μ. Having constructed the probability P˜ governing the posterior function π(·|·) which depends on the auxiliary weight function μ, it will be of interest to consider some applications of this methodology for
48
II. Principles of Hypothesis Testing
some practical problems. However, there are some real computational barriers to overcome in order to implement the procedure in a mathematically satisfactory (i.e., correct) way. They have been “justified” sometimes by a plausible reasoning. Therefore, it is necessary to understand the underlying theoretical basis rather clearly. These and related questions arising from this procedure will now be discussed in the following section, with a nontrivial (and natural) example which clarifies the contribution of the ideas to inference theory. 2.4 Bayesian methodology for applications It was already seen in Section 2 above that the Bayes critical region for a simple H0 versus a simple alternative H1 is of the same form as that given by the classical (i.e., Neyman-Pearson-Grenander) region, and both coincide when the (priors or) weights are equal on H0 and H1 (cf. (3) of Sec. 2). Thus even here the regions are different in the two procedures if the priors differ. When H0 or H1 is composite the differences are quite clearly pronounced. Thus let H0 and H1 be composite hypotheses that also stand for disjoint subsets of the parameter space I of the probability family {Pθ , θ ∈ I} on a fixed measurable space (Ω, Σ). Suppose there is a ¯ + that dominates the family so that fixed σ-finite measure λ : Σ → R dPθ f (·|θ) = dλ (·) denotes the corresponding density. The testing problem in the decision theoretic frame work, relative to a loss function (θ, δ) → W (θ, δ(x)), is then given as: r(θ, δ) = W (θ, δ(X(ω))f (ω|θ) dλ(ω) μ(θ), (1) Ω
I
where μ(·) is the prior probability on I that gives positive mass to both H0 and H1 , so that the composite hypotheses are reduced to the simple ones, as discussed before (cf., e.g., Sec. 2). Here r(θ, δ) is the average loss measuring the error, with W (θ, δ) = 0 for θ ∈ Hi , i = 0, 1, and > 0 otherwise. Taking W to be simple, i.e., it takes only the two values {0, 1}, one has f (ω|θ)δ(X(ω)) dλ(ω), θ ∈ H0 , (2) r(θ, δ) = Ω f (ω|θ)(1 − δ(X(ω))) dλ(ω), θ ∈ H1 . Ω In the previous terminology r(θ, δ) is the size of the test (0 ≤ δ ≤ 1), and 1 − r(θ, δ) is the power. Note that since 0 < μ(Hi ) < 1, it is not a probability and each should perhaps be termed a weight on Hi , i = 0, 1. Let us restate the resulting critical regions, given in (2) and (3) of Section 2 above, putting them in context with the “Bayesian”
49
2.4 Bayesian methodology for applications
terminology. Now the point of the Bayesian solution is to choose δ that minimizes the risk function in (2), i.e., that minimizes both the (Type I and Type II) error losses. The key difference in the points of view here should be noted. In the Bayesian procedure one minimizes a certain convex combination of the probabilities of both types of errors, whereas in the Neyman-Pearson setup the type I error probability is controlled and the type II error probability is minimized or equivalently the power is maximized. These procedures rarely coincide even when the structures of the critical regions are similar as seen in (2) and (3) of Section 2. They are simply two different types of solutions to be chosen according to the situation at hand. Let us elaborate this further. To appreciate this clearly, let us derive (3) of Section 2 with the following alternative argument. Suppose W0 is the critical region of the Bayesian procedure. Then the risk incurred in choosing H0 when it is false with a prior probability π0 , and H1 when it is false with a prior probability π1 is to be minimized. So the decision function δ or the critical region W0 must minimize the probability pW0 given, when Pi λ, i = 0, 1, by: pW0 = π0 P0 (W0 ) + π1 P1 (W0c ) f0 (ω) dλ(ω) + π1 = π0
f1 (ω) dλ(ω)
W0c
W0
(π0 f0 − π1 f1 )(ω) dλ(ω).
= π1 +
(3)
W0
Clearly pW0 is smallest if W0 = {ω : π0 f0 (ω ≤ π1 f1 (ω)} = {ω : ff01 (ω) ≥ π0 π1 } and δ = χW0 is the Bayes solution of the problem. In the NeymanPearson approach, on the other hand, one gets the most powerful region, using the same weights, as W k = {ω :
π0 f1 f1 (ω) ≥ k } = { ≥ k }, f0 π1 f0
(4)
as seen in Section 2. Here k(= k(α, π0 , π1 )) ≥ 0 is chosen such that P0 (W k ) ≤ α, the size, and the two will coincide only if k = 1. In other words, the control of the type I error probability and the consequent maximization of power are replaced by a minimization of the “generalized” risk function as in (1) above. Thus the variational problems of the earlier approach (with connections to the general control theory as noted in Section 1) are replaced by a different procedure for the “combined risk”. Both methods therefore solve different types of inference problems. This distinction, often overlooked in comparative discussions, is important and should be remembered as they represent conceptually two different mathematical models.
50
II. Principles of Hypothesis Testing
It will be of interest to note that the critical region given by (4) may be restated simply in terms of posterior probabilities to be taken as a basis of such Bayes procedures for the composite cases. In fact, if H0 , H1 are composite and μ(·) is a prior on I such that 0 < μ(Hi ) < 1, i = 0, 1, then the critical region obtained from (3) can be expressed as: W = {ω : f (ω|θ) dμ(θ) ≤ f (ω|θ) dμ(θ)}, (5) H0
H1
where fi (·) of (3) are replaced by Hi f (·|θ) dμ(θ), and (4) reduces to the earlier case if H0 , H1 are singletons with 0 < πi = μ(Hi ) < 1, i = 0, 1. Consequently the Bayes solution is simply the decision function δ = χW of (4). But dividing through by the marginal g(ω) = I f (ω|θ) dμ(θ), assumed a.e. [λ]-positive for nontriviality, the corresponding inequality is in terms of the prior probabilities of Hi , i = 0, 1, “after the experiment” (or conditional on ω) for the prior weight μ on I. Writing Pμ (·|˜ ω) for the posterior measure relative to μ⊗λ, ω ˜ = (ω, θ), (5) is expressible as: ˜ = {˜ ω ) ≤ Pμ (H1 |˜ ω )}, ω ˜ = (ω, θ). (6) W ω : Pμ (H0 |˜ Thus the decision function solution, in the Bayes terminology, is again ˜ ω ) has to be identified δ = χW ˜ , the set W being given by (6). But Pμ (·|˜ ˜ ˜ ˜ ˜ ˜ = σ⊗I with a conditional measure on (Ω, Σ, P ) where Ω = Ω × I, Σ ˜ and P is defined for each A ∈ I and B ∈ Σ, as P˜μ (A × B) =
ω ) dP˜B (˜ ω ), P˜μ (A × Ω|˜
Pμ A|x) dλ(x) = B×I
(7)
B×I
with B = σ(B × I : B ∈ Σ), a cylinder σ-algebra as in (10) of Section 2.3, and Proposition 3.1 there. Then Pμ (A|x) may be “identified” ˜ ω ), as a conditional probability. The symbol Pμ (Hi |ω) is with P˜μ (A|˜ not meaningful on (Ω, Σ, P ). This conceptual distinction is important in making probability statements about the “posterior” measures on probability spaces, at least from a correct mathematical point of view. We state this, in summary, for reference as follows. 1. Proposition. Let {(Ω, Σ, Pθ ); θ ∈ H0 ∪ H1 = I} be a family of probability measures where H0 and H1 denote a composite hypothesis and a composite alternative, which are supposed distinguishable. If μ(·) is a prior probability for both H0 and H1 , reducing them to simple cases, then either H0 or H1 is decided, based on a sample point ω ∈ W or ω ∈ W c respectively, where the critical region W is given by (5) above. This is the Bayes solution of the hypothesis testing (or decision) problem. It reduces to the Neyman-Pearson-Grenander critical region when both H0 and H1 are simple and when the prior weights 0 < πi =
2.4 Bayesian methodology for applications
51
μ(Hi ) < 1, i = 0, 1 are chosen to satisfy the constraint for (4) or when π0 = π1 . [By the complete class theorem (cf., Thm. 2.2), there exists a μ0 that makes this identification possible.] Remark. If M+ 1 (I) is the set of all prior probabilities on I, then the integrals based on this class serve a purpose quite analogous to the L. Schwartz’s theory of distributions in containing (weak) solutions of families of PDEs. It will be interesting to push this analogy much further, borrowing the relevant results from abstract analysis, to investigate the resulting Bayesian inferences aside from the so-called complete class statements for sub families which are useful but there is no constructive method of finding them in any problem. This aspect has not been thoroughly investigated in the literature. The preceding discussion implies that, for a Bayesian analysis (inference), it is necessary in many cases to evaluate the posterior probabilities of various events on appropriate spaces. But these are conditional probabilities, and as such one encounters serious evaluational (or computational) difficulties which we now indicate. Since in the Bayesian setup the parameter θ becomes a value of a random variable Θ, the pair (x, θ) becomes the value taken by the stochastic vector (or ˜ Σ, ˜ P˜μ ), and the conditional measure process) (X, Θ) on the space (Ω, ω ) must also satisfy the description of Proposition 3.1, especially P˜μ (·|˜ equation (11) there. Consequently it is the RN-density of P˜μ relative ˜ in the notation introduced above. to its restriction to B = Θ−1 (I) ⊂ Σ, To appreciate the implications of this identification and to clarify the matters, we add an example. In fact even in a standard treatment of the level crossing probabilities treated by Cram´er and Leadbetter [1], p.220, on evaluations of level crossing conditional probabilities given a value of a previous state of the process, while recognizing the technical problem involved, have adopted a simpler method, advancing a heuristic reasoning without giving a mathematically proper formulation. They have used a “horizontal window” method from among several discussed by Kac and Slepian [1], each giving a quite different solution. It was also favored by the latter authors, saying that it is related to certain concepts in statistical physics. To illuminate the subtlety of the problem we include a brief account of one of their examples on an ergodic stationary Gaussian process. Also in Grenander ([3], p.373), a “diagonal method” is presented, with complete rigor, for applications in Metric Pattern Theory where conditioning is desired to produce a particular value of such expectation and not some version of an equivalence class. Its use in the present analysis is discussed further in bibliographical notes at the end of the chapter. We devote some space to discussing the problem here since it is important and in many places one of the computational methods (many going with the horizontal window pro-
52
II. Principles of Hypothesis Testing
cedures) is found as the “exact” conditional probability in the original Kolmogorov setup. 2. Example. Let {Xt , t ∈ R} be a Gaussian process of the following description. For any finite set t1 < t2 < · · · < tn of points from R (Xt1 , . . . , Xtn ) is an n-dimensional Gaussian random vector with means 2 zero and the covariance matrix (r(ti , tj ), 1 ≤ i, j ≤ n) = (e−(ti −tj ) , 1 ≤ i, j ≤ n). This is equivalent to saying that for any linear combination n i=1 ci Xti = Zc (say), Zc is normally distributed with mean zero and P [Zc < z] =
1 (2πσc2 )
z
−
e
v2 2 2σc
dv,
(8)
−∞
n where σc2 = i,j=1 r(ti , tj )ci cj in the above. This implies that the Xt process is “stationary and ergodic” and the covariance r(s, t) = r(s − t) is continuously differentiable. We use just this property of ergodicity of the stationary Gaussian process and not the specific covariance function r given above as illustration. Moreover one has Xt+hh−Xt → Yt in mean as h → 0, for each t ∈ R for this type of a process. Then the {Yt , t ∈ R}, called the derived process, is also Gaussian. It may be verified immediately that E(Xt Yt ) = 0, t ∈ R, but E(Xs Yt ) = 0 for s = t so that Xt and Yt are independent but not Xs and Yt for s = t. The pair (Xs , Yt ) is a planar Gaussian random variable, and in the Bayesian setup Xt is an observable and Yt is a ‘parameter’ with a normal prior distribution with mean zero and variance 0 < α2 < ∞ (say). That such a process exists on some probability space follows from the Kolmogorov existence Theorem I.1.1 and since r(s) → 0 as s → 0 it also is seen that Xt → X0 and Yt → Y0 in mean as t → 0. The problem now is to calculate the “posterior” distribution of Yt given an observed value of Xt . Specifically, find the posterior probability P [Y0 < y|X0 = a] = FY0 (y|a). We now show that, using the previous information on the Yt -process, a direct calculation of FY0 (y|a) yields distinctly different values related to the method used (see e.g., the Bayes formula of (6) in Section 2.2 above). In other words, the answer depends on the type of computation employed! Let Ay = [Y0 < y] and Ba = [X0 = a]. These are measurable sets. Since P (Ba ) = 0 and P (Ay |Ba ) is to be calculated, we approximate the desired probability by a sequence P (Ay |Bn ) where P (Bn ) > 0 and Ba = ∩n Bn . Now consider P (Ay |Ba ) = lim P (Ay |Bn ) n→∞
= lim n
P A ∩ Bn ) , P (Bn )
53
2.4 Bayesian methodology for applications
whenever this limit exists. The problem here is that, in the last line, the limits of both the numerator and the denominator are zero and their ratio is to be found whenever it is possible. We now construct a number of such sequences for which the desired limit in fact exists (and turns out to be different in each case). Thus consider m ∈ R, δ > 0 and let 1
Bδm = {ω ∈ Ω : Xt (ω) = a + mt for some 0 ≤ t ≤ δ(1 + m2 )− 2 }. This means Bδm is the set of points ω so that Xt (ω) passes through the line y = a + mt, of slope m and length δ, at time t. Now for fixed m, letting δ 0 through a sequence, one has Ba = ∩δ>0 Bδm . Then one finds (omitting some tedious calculations): lim P (Ay |Bδm )
δ0
P [Ay ∩ Bδm ∩ Acm ] + P [Ay ∩ Bδm ∩ Am ] δ0 P [Acm ∩ Bδm ] + P [Am ∩ Bδm ] y y −fX0 (a) −∞ (v − m)fY0 (v) dv + fX0 (a) −∞ (m − v)fY0 (v) dv ∞ m = , −fX0 (a)[2 m vfY0 (v) dv + m −m fY0 (v) dv]
= lim
using the independence of X0 and Y0 ,
and the fX0 , fY0 being normal densities, fτ0 ∼ N (0, α2 ), y
= −∞
v2
|v − m|e− 2α2 dv , m m2 v2 2α2 e− 2α2 + m −m e− 2α2 dv
= P (Ay |Ba ).
(9)
The details of the omitted computations for (9) are originally indicated in Kac and Slepian [1] and were detailed in the author’s book (Rao [18], pp.69-72). If one uses another type of approximation sequence, Bδ defined by a circle centered at (a, o) and of radius δ, i.e., Bδ = {ω : (Xt (ω) − a)2 + t2 < δ 2 }, then again Ba = ∩δ>0 Bδ as δ 0 through a sequence. One gets the “circular window” (or c.w.) approximation as: P (Ay |Ba )|c.w. = lim P (Ay |Bδ ) δ0
y −∞
=
v2
(1 + v 2 ) 2 e− 2α2 dv 1
v2
(1 + v 2 ) 2 e− aα2 dv R 1
.
(10)
54
II. Principles of Hypothesis Testing
Evidently one can use other approximations, and as m ∈ R varies (9) and then (10) yield uncountably many distinct values for the same problem of calculating the posterior probability. If in (9) one lets m → ∞ then the result is a “vertical window” (or v.w.) approximation and it becomes y v2 1 P (Ay |Ba )|v.w. = √ e− 2α2 dv, (11) 2πα2 −∞ a result implied directly by the mere use of the independence of X0 and Y0 , disregarding the other information on Yt . Letting m → 0 one gets a “horizontal window” (or h.w.) approximation from (9) as y v2 1 − 2α 2 P (Ay |Ba )|h.w = |v|e dv. (12) 2α2 −∞ Note that because of the “stationarity” of the Xt -process and its eventual independence with Y0 as t → 0, the final result does not involve ‘a . How can one select the correct solution (if any) from all these values for P (Ay |Ba )? The answer (from a purely mathematical point of view) is that the problem is ill-posed, i.e., P (Ay |Ba ) is not well-defined to use the “Bayes type formula” (6) of Section 2.2. To incorporate all the “prior information” one must consider P [Ay | limt0 Xt = a], and to employ Kolmogorov’s theory, one has to use the conditioning σ-algebra B = σ(Bδm ⊂ R, δ > 0, m ∈ R) and this must be clearly specified. Then P B (Ay ) satisfies the basic equation P B (Ay )(ω) dP (ω) = P (Ay ∩ B), B ∈ B. B
The desired value is then P B (Ay )(X0 (ω) = a). Thus the problem (of paradoxical solutions) is deeper, and an extended discussion is included in (Rao [18], Chapter 3) where also some exact methods of evaluation of these probabilities for a class of processes is found. In general there is no algorithm available for a correct evaluation of conditional probabilities and expectations. Ad hoc methods used in such studies, based on selected (subjective) procedures, lead to necessarily different answers for the same question. This should be kept in mind in using the “posterior probabilities” in practical problems. Ryll-Nardzewski [1] has also noted an unease about evaluation of conditional probabilities with plausible arguments, and gave a rigorous approach for an interesting class of problems, via the RN-theorem, i.e., the Kolmogorov approach, which however does not include those of the above Example. In the next section we consider certain restrictions under which composite hypotheses may be tested with standard methods that exclude such multiple solutions.
2.5 Further results on composite hypotheses
55
2.5 Further results on composite hypotheses In order to continue with an analysis of the composite hypotheses we need to analyze classes of tests classified as similar, unbiased, uniformly most powerful, and others. These are also needed for a discussion of the Behrens-Fisher problem, usually thought of as a question that goes to the core of the subject, and illuminate its various parts. It was noted in Proposition 1.14, and in the discussion following its proof, that the composite hypothesis testing problem is identifiable with measures taking values in (usually infinite dimensional) vector spaces, and a critical region to distinguish them can be given. The results in this general setup are not in a form that may be applied to specific cases since construction of critical regions such as those given by (30) of Section 1 above depend explicitly on finding the (as yet unavailable) RN-derivatives of general vector measures. We therefore resort to some methods applicable to several concrete situations. This is one of the reasons for introducing special classifications and studying particular families of distributions which nevertheless illustrate many fundamental aspects of inference theory. 1. Definition. Let {(Ω, Σ, Pθ ), θ ∈ H0 ∪ H1 = I} be the basic model of the problem where Hi , i = 0, 1, are distinguishable (composite) hypotheses. If X : Ω → Rn is a random vector, 0 ≤ δ(X) ≤ 1 is the associated decision (or test) function [it is randomized unless δ takes only the two values {0, 1}], then δ(·) is termed unbiased of size 0 ≤ α ≤ 1 if (i) supθ∈H0 Eθ (δ(X)) ≤ α and (ii) inf θ∈H1 Eθ (δ(X)) ≥ α, where Eθ is, as usual, the expectation operator for the measure Pθ . The test is termed similar of size α if for some subset H0 of H0 , called a ‘sub hypothesis’, one has Eθ (δ(X)) = α, for all θ ∈ H0 , but not depending on θ ∈ H0 − H0 , termed ‘nuisance parameters’. [See for an elaboration of this concept immediately below.] In case H0 is a singleton {θ0 }(i.e., simple) but H1 is composite and there is a test function δ such that Eθ0 (δ(X)) = α and Eθ (δ(X)) ≥ α for all θ ∈ H1 , then such a δ(·) is termed uniformly most powerful (UMP). Of these three concepts, the one on ‘similarity’ perhaps appears unmotivated, and the following explanation may help in understanding it. Suppose the parameter space I is topological, H0 is a closed subset, and the power function θ → βδ (θ) = Eθ (δ(X)) is continuous. If the test is unbiased then βδ (θ) ≤ α on H0 ; and > α on H1 . Then by continuity, βδ (θ) = α on the boundary ∂H0 ⊂ H0 , implying that the test is similar on ∂H0 . [There exist trivial unbiased tests of a given size α ≤ 1, namely δ(X) = α, without regard to any optimum properties.] If H0 is simple and Pθ0 is nonatomic, then a UMP test (if it exists) is both unbiased and similar. Note also that, for a similar test,
56
II. Principles of Hypothesis Testing
Eθ (δ(X)) is independent of θ ∈ H0 . [Likewise if (S, S) is a measurable space and T : Ω → S is a measurable mapping (T −1 (S) ⊂ Σ), termed a statistic when T = ϕ(X) for some Borel function ϕ : S → Rn and an abstract random variable X : Ω → S, then T is similar for H0 ⊂ H0 provided Pθ ◦ T −1 is independent of θ ∈ H0 .] In general, however, a nontrivial similar or unbiased test may not exist. (See, e.g., Exercise 6.2(a) below.) But when they exist the constructions of similar regions with such properties need some deeper mathematical tools for these problems as will become clear from the following account. The method of construction of nontrivial similar regions to be given is based on the one proposed by Neyman for families {(Ω, Σ, Pθ ), θ ∈ H0 ∪ H1 = I} that admit a sufficient statistic. This condition potentially simplifies the problem. Let us recall the latter concept. A σ-algebra B ⊂ Σ is said to be sufficient for {Pθ , θ ∈ I} if for each bounded measurable function f : Ω → R, there is a B-measurable function f˜ (not depending on θ) such that B Eθ (f ) dPθ = f dPθ = (1) f˜ dPθ , A ∈ B. A
A
A
This means EθB (f ) = f˜ is independent of θ, which is thus a very special but important property of B, originally introduced by R. A. Fisher. [Clearly B = Σ is always a trivial sufficient σ-algebra.] A statistic ˜ where (Ω, ˜ Σ) ˜ is a measurable space, is sufficient if T −1 (Σ) ˜ = T : Ω → Ω, B(= σ(T )) ⊂ Σ is a sufficient σ-algebra in the above sense. Thus the sufficiency concept refers fundamentally to the existence of certain special σ-subalgebras of the model family {(Ω, Σ, Pθ , θ ∈ I}. In the particular case that the family {Pθ , θ ∈ I} is dominated by a fixed σθ finite measure λ so that fθ = dP dλ , then an operational form of (1) can be given by the following factorization criterion from P.R.Halmos and L.J.Savage (and in the present form it is due to R.R.Bahadur) when ˜ is sufficient for {Pθ , θ ∈ I} dominated B = σ(T ). A statistic T : Ω → Ω by a fixed σ-finite measure λ iff their densities fθ admit a factorization: fθ (ω) =
dPθ (ω) = (gθ ◦ T )(ω)h(ω), a.e.[λ], dλ
(2)
˜ → R+ is measurable (Σ) ˜ and h : Ω → R+ is measurable where gθ : Ω (Σ) but independent of θ. [Note that the range space of the statistic T and the parameter space I have little (if any) relation in the final result (2).] An immediate consequence of (2) is that if for a θ0 ∈ I, fθ0 (ω) > 0 for a.a. (ω) in the support of the family (in particular in Ω itself), and ˜ then T , as an identity mapping on Ω, is a sufficient statistic Ω = Ω
2.5 Further results on composite hypotheses
57
since fθ (ω) = gθ ◦ T (ω)h(ω), where gθ = ffθθ and h = fθ0 in (2). 0 This corresponds to the obvious fact that the largest σ-algebra Σ of the underlying model is always sufficient for the family. However, one usually excludes this trivial case in discussing sufficiency. Also just as functions of random variables are used, considerable interest will be attached for functions of sufficient statistics in both testing and estimation problems as a means of reducing the “sample data” to its essential needs without loosing any of its distributional properties. In this context one may want to consider a minimal sufficient statistic of the family if it exists. This is defined as follows: If Ti , i = 1, 2 are a pair of statistics, then T1 is said to be dominated by T2 , written T1 ≺ T2 , if T1 is constant on the set where T2 is constant; and if T is sufficient for the family {fθ , θ ∈ I}, it is minimal sufficient if it is dominated by every sufficient statistic of the family. [In the literature it is also sometimes termed a necessary statistic]. In general, however, a minimal sufficient statistic need not exist. But it does exist if the family of measures is on Σ is absolutely continuous relative to a σ-finite measure. A proof and related detailed treatment (with references) of sufficiency can be found in (Rao [18], Chapter 6; and the above result (2) is Corollary 6.3.5 there). Now taking B = σ(T ) where T is a sufficient statistic, we have the following important relation as an immediate consequence of the definition in (1): EθBT (δ(X)) = Eθ (δ(X)|T ) = ϕ(T ), θ ∈ I,
(3)
˜ → R, where δ(X) : Ω → R is a test statistic. for a measurable ϕ : Ω Instead suppose only that T is sufficient for the sub family {Pθ , θ ∈ H0 }. Then (3) implies that ϕ(T ) is independent of θ just in H0 and Neyman’s proposal is to consider tests of size α as those for which ϕ(t) = α for a.a.(t). This indeed gives a similar test as in Definition 1, since α = Eθ (ϕ(T )) = Eθ (E BT (δ(X))) = Eθ (δ(X)), θ ∈ H0 ,
(4)
where we use (1) with A = Ω, f = δ(X) and f˜ = ϕ(T ). A test δ(·) for which Eθ (δ(X)|T ) = α, a.e.[Pθ ], θ ∈ H0 , is said to follow Neyman’s rule (or to have a“Neyman structure”– an uncouth word in English – also used in the literature). For a class of probability measures, including those admitting a sufficient statistic, we present a construction of similar tests, thereby obtaining a generalization of Neyman’s rule, following Linnik [1]. Consider probability densities fθ , including those given by (2), for a family {Pθ , θ ∈ I}. Suppose there is a sufficient statistic T for the sub family obtained by replacing I by its subset H0 , so that (4) holds. It should be observed that sufficient statistics exist for several classes
58
II. Principles of Hypothesis Testing
of probability densities such as the exponential, although not for all families (e.g., the Cauchy family fθ (x) = π1 (1 + (x − θ)2 )−1 , −∞ < θ < ∞, x ∈ R does not admit such statistics). Here an ‘exponential family’ stands for the class fθ given by the densities relative to a fixed σ-finite measure λ on Rn : k ai (θ)Ti (x))h(x), θ ∈ Rm , x ∈ Rn , fθ (x) = c(θ) exp(
(5)
i=1
where c(θ) > 0, h(x) ≥ 0 and ai (θ), Ti (x) ∈ R are measurable relative to the respective σ-algebras in I and Rn . But Linnik’s more inclusive family is given by the following strongly integrable densities fθ relative to a λ as above (cf., Linnik [1], p.74): fθ (x) =
q
gi (T (x), θ)hi (x),
(6)
i=1
where T (·) : Rn → Rk defines a statistic, gi and hi are real measurable functions and gj hj integrable for λ, for each θ ∈ H0 , and that fθ ≥ 0 is a probability density relative to the same σ-finite measure λ. If q = 1 this reduces to (2), and to (5) if moreover g(T (x), θ) is a suitable exponential function. We can obtain a similar region for this larger family, assuming that the dominating measure λ(·) is nonatomic. An example of (6) is given by: fθ (x) = exp{
n i=1
ϕi (x)ψi (θ)}
m
gj (x)hj (θ).
j=1
Now we present the following comprehensive theorem, essentially due to Linnik, to indicate an important aspect of the problem: 2. Theorem. Let {(Ω, Σ, Pθ ), θ ∈ H0 ∪ H1 = I} be a distinguishable family of composite hypotheses H0 , H1 where Pθ is absolutely continuous relative to a nonatomic σ-finite measure λ with a (strongly integrable) density fθ , (for θ ∈ H0 ) given by (6), T (X) being a statistic. Then for any 0 < α < 1, there exits a (nontrivial) non randomized similar test of size α for the family. Outline of Proof. The idea of proof is to express fθ as a finite linear (not convex) combination of probability densities each of which admits T (·) as a sufficient statistic. Then Neyman’s rule is applied to each of them to obtain a similar test, using the procedures of Proposition 1.12 and its generalizations discussed in the last part of Sec. 1. Next these tests (or regions) are “glued” together for the final desired solution
59
2.5 Further results on composite hypotheses
of the statement. For the proof, we can and do replace (Ω, Σ) by the range of X, namely (Rm , Rm ) where Rm is the Borel σ-algebra of Rm . This is always possible by Theorem I.1.1 (the “sample space representation”). Actually X can take values in a complete separable metric space, also termed a Polish space, for the arguments below. Thus (Ω, Σ) is (Rm , Rm ) or can be a Polish measurable space. Here are the nontrivial mathematical details. 1. Let gj± be the positive and negative parts of gj and similarly h± j , + − + − ˜ j for these so that gj = gj + gj and hj = hj + hj . Writing g˜j and h four parts (some may be zero), (6) can be expressed as: fθ (x) =
n
˜ j (x) = g˜i (T (x), θ)h
i=1
n
εi kiθ (x), (say)
(7)
i=1
where q ≤ n ≤ 4q, εi = ±1, and kiθ(x) ≥ 0. Moreover, since fθ is strongly integrable in the sense that Ω |gj hj |(x) dλ(x) < ∞ for all j which implies in turn that ci (θ) = Ω kiθ (x) dλ(x) < ∞, we have fθ (x) =
n
εi ci (θ)kiθ (x).
(8)
i=1
n Integrating (8) relative to λ gives i=1 εi c2i (θ) = 1, for all θ ∈ H0 . Here and in the following one can take, for convenience, that ci (θ) > 0 for all θ ∈ H0 , i ≥ 1. Then k˜iθ (x) =
1 ˜ i (x)|, |˜ gi (T (x), θ)h ci (θ)
(9)
is a probability density relative to λ for each i = 1, . . . , n, and the important factorization criterion (2) implies that the statistic T (X) is sufficient for the densities (9), θ ∈ H0 , i = 1, . . . n. Consequently the conditional distribution k˜iθ (·|T = t) of the system given T is independent of θ ∈ H0 , and denoted k˜i (·|t). Now the sufficient statistic T may be assumed to have its range B as a Borel set of Rk (k < m), and hence the conditional distribution has a version that is regular (and even “proper”, cf., e.g., Rao [18], Sec. 5.4, especially, Corollary 5.4.7). This important property will be used in the following argument. First note that k˜i (·|t) is given by the formula ˜ ˜ = T −1 (B), (10) (t), B kiθ (x) dλ(x) = Kiθ (A) = k˜i (A|t) dKiθ A
˜ B
˜ with T : Rm → Rk and (·) is the restriction of Kiθ (·) to B, where Kiθ A ∈ Σ. Since λ is nonatomic, it follows that Kiθ has the same property
60
II. Principles of Hypothesis Testing
and by (10) so is k˜i (·|t), for each t in the range of T . Fix t and consider the vector measure (k˜1 (·|t), · · · , k˜n (·|t)) on Σ into the positive orthant of Rn . This is a nonatomic (finite dimensional) vector measure and moreover each of its components is a probability. By the Liapounov [1] theorem (discussed after Prop. 1.13) its range is a compact convex set and hence, given a vector (α, · · · , α), 0 < α < 1 in Rn , there exists a set At ∈ Rk such that k˜i (At |t) = α, 1 ≤ i ≤ n. Consequently, (10) implies with A = At , and B = Ω there, that Kiθ (At ) = α, 1 ≤ i ≤ n for any θ ∈ H0 . 2. It is next necessary to get a measurable set A determined by the family {At , t ∈ Rang(T ) = RT } by a “gluing process”, such that Kiθ (A) = α. In fact, let A be the disjoint sum of At ’s, i.e., if A˜t = t×At , then A = ∪t∈RT A˜t . Note that if there is a Borel set B ⊂ Ω × RT , then each section Bt of B will be measurable for Σ and there is a (unique) non-cartesian product measure β on the product measurable space of the above (product) space, such that ˜ t , t) dk¯iθ (t). k(B β(B) = RT
(See, e.g., Rao [17], p.335.) But we need the converse procedure here! ˜ t) and k (·), and a measurable family {At , t ∈ RT } find a Given k(·, iθ measurable subset A whose sections satisfy the above equation. This is the gluing process noted above. If the At are “smooth” manifolds, then such a set is obtainable using a classical “gluing theorem” from the manifold theory, (cf., e.g., Abraham, Marsden, and Ratiu [1], p.501). But this is not assumed here. However, the result appears extendable for Borel sets and replace “smooth” by “Borel”; but this is not readily available in the literature, although Linnik [1] implies it. We proceed with an alternative procedure by considering the completed σ-algebras Σ and that of the RT for the probability measure Kiθ and the set A obtained above. Since every σ-finite measure is localizable, by a property of such measures (cf., Zaanen [1], p.262, Lemma α), A is equivalent to the supremum of the above class, denoted A∗ . It is then possible to obtain (through a lifting map ρ) that ρ(A∗ ) = A and ρ(A∗ )(t) = At for all t. (See Tulcea and Tulcea [1], on localizability and the existence of a lifting using which one can drop the separability conditions.) This machinery seems necessary if the range of T is uncountable which is the most important case for applications. Thus with such a measurable A whose sections are equivalent to At , we find that Kiθ (A) = α, i = 1, . . . , n. Hence substituting this in (8) and integrating over A, one has: n fθ (x) dλ(x) = Kiθ (A)εi c2i (θ) = α, θ ∈ H0 . (11) A
i=1
2.5 Further results on composite hypotheses
61
This A is a similar region of size α, giving a non randomized test. Remarks. 1. In general a similar test on H0 may not be unbiased in the sense of Definition 1 above. However, if a similar test is also UMP then it is automatically unbiased. Thus in seeking tests with the latter property, similarity plays a useful role, and the preceding theorem is of assistance in this task. The sufficiency concept, first introduced by R. A. Fisher in the early 1920s, has thus a crucial role to play in the (extended) Neyman rule. 2. As the proof shows, the nonatomicity of the measures is an important step enabling an employment of the Liapounov theorem. This as well as the “gluing” process are useful for the problems of applications, and a constructive procedure for A is desirable. It is not yet available. If q = 1, then n = 1 in the above proof (g ≥ 0, h ≥ 0) and similar regions may be constructed using the procedures of the Neyman-Pearson lemma or its extensions discussed in Section 1. 3. The nonatomicity hypothesis restricts applications of the theorem to discrete distributions or of mixed types. This condition can be dropped if the class of densities admitted is restricted. For instance, if the class is exponential (relative to some σ-finite measure) and the parameter range is wide enough to include an open rectangle, then the existence of similar tests, which necessarily obeys Neyman’s rule, can be proved with the assistance of certain results from (multiple) Laplace transforms. This possibility will be indicated in the complements section together with some related examples. Several other aspects of similar tests using deep mathematical tools can be found in Linnik’s monograph [1]. 4. Since the uniqueness of similar tests (when they exist) is not implied by the above theorem, one can consider a weight function μ(·) on H1 and reduce it to a simple hypothesis. Then it is possible to choose an optimum similar test δ(·) from among those that maximize E (δ(X)) dμ(θ). The existence of such an optimum δ is usually very θ H0 difficult to obtain constructively, and there is again no uniqueness. We now discuss another question on composite hypotheses for which similar tests will play a central role. The case in point is the famous Behrens-Fisher problem. For this the following terminology is used. 3. Definition. Let {Pθ , θ ∈ I ⊂ Rk } be a family of probability measures on a fixed measurable space, where θ = (θ1 , . . . , θk ). Suppose that γi : I → R are linearly independent functions, i = 1, . . . , k, and that the hypothesis H0 is given for only a subset (γ1 , . . . , γq ) ∈ Jq ⊂ Rq (q < k), leaving (γq+1 . . . , γk ) free. The latter functions are often termed (after H. Hotelling) nuisance parameters of the problem. If δ(X) is a test function based on X for the hypothesis H0 (with alternative H1 = I − H0 ),
62
II. Principles of Hypothesis Testing
and if the power function θ → Eθ (δ(X)) = ϕ(θ1 , . . . , θk ) depends only through (but not completely independent of) γ1 , . . . , γq , then it is said to have an invariant power function. [The trivial (invariant) test that δ(X) = α, 0 < α < 1, is excluded from all further considerations here and below as it is of no use.] Note that γi (θ) = θi , i = 1, . . . , k is possible in the above and then H0 denotes (γ1 , . . . , γq ) ∈ Jq and (γq+1 , . . . , γk ) will be nuisance parameters. A simple but important example of this situation is provided by the following famous Behrens-Fisher problem: Let X = (X1 , . . . , Xn1 ) be independent identically distributed (i.i.d.) random variables, or a random sample, where each Xi is normally distributed with mean μ1 ∈ R and variance σ12 > 0, denoted N (μ1 , σ12 ). These two parameters uniquely determine a normal (Gaussian) distribution. Similarly, let Y = (Y1 , . . . , Yn2 ) be another random sample from N (μ2 , σ22 ) independent of the first one. Here θ = (μ1 , μ2 , σ12 , σ22 ) ∈ I ⊂ R4 , and γ1 (θ) = μ1 − μ2 , γ2 (θ) = μ1 , γ3 (θ) = σ12 , γ4 (θ) = σ22 , and H0 : γ1 (θ) = 0 with H1 : γ1 (θ) = 0. The parameters γ2 , γ3 , γ4 or equivalently μ1 , σ12 > 0, σ22 > 0 are the nuisance parameters. It is desired to find a test 0 ≤ δ(X, Y ) ≤ 1 for H0 (i.e., equality of means). Does there exist a similar test, an unbiased test, or an invariant one? This simple sounding problem presents an entirely new phenomenon, and the existing inference theory, detailed so far, does not provide a satisfactory solution. Generalizations of the problem suggest themselves with “linear hypotheses and regression analysis” among others, but a solution of the particular case introduced here is essential to advance the theory. A penetrating analysis of this subject has been conducted by Linnik. We indicate highlights of his work to understand its depth. Let X, Y , Sx2 , Sy2 be the sample means and variances of the independent random samples X, Y noted above. Then it is well-known (and also easily obtained from definition) that these four random variables form a (vector) sufficient statistic for (μ1 , μ2 , σ12 , σ22 ) of the family. It is desired to test the hypothesis H0 : μ1 − μ2 = 0 and the alternative H1 = H0 , and we seek a test that is (naturally) invariant under translations or scale changes, i.e., x → αx + β should yield the same test. This implies that the critical region should be determined by a function y sx G of the form G(| x¯s−¯ |, sy ) ≥ 0 where G is a real measurable function x in the sample space. This may be seen as follows. Since the critical region C is determined by (¯ x, y¯, s2x , s2y ) and, by translation invariance, C would be defined by (¯ x − y¯, 0, s2x , s2y ). But by symmetry this is further restricted to (|¯ x − y¯|, 0, s2x , s2y ). Now it is to be invariant under change of scale so that replacing X, Y by SXx , SYy , it is y| , 0, 1, ssxy ). In other words, seen that the region is determined by ( |¯xs−¯ x
63
2.5 Further results on composite hypotheses
for each realization (or random sample values), the region is determined y| by the positive quadrant 0 ≤ ξ = |¯xs−¯ , η = ssxy , which is symmetric x about the (y- or) η-axis. We finally impose a natural and convenient restriction that for any two realizations with |¯ x − y¯| < |¯ x − y¯ |, sx = sy both points belong to C. This makes that every line parallel to the η-axis shall meet the boundary ∂C of C at only one point. Indeed, if (ξ1 , η0 ), (ξ2 , η0 ) are two points of the line y = η0 , meeting ∂C, let 0 ≤ ξ1 < ξ2 . If there is a ξ3 such that ξ1 < ξ3 < ξ2 and (ξ3 , η0 ) ∈ / C, then the corresponding sample values of the sufficient statistics, say y | y| > |¯xs−¯ . But by the |¯ x − y¯ | > |¯ x − y¯| and ssx = ssxy , must satisfy |¯xs−¯ x y x invariance of scale values of the sufficient statistics, replacing X, Y by kX, kY so that sx is replaced by ksx etc., we get on taking k = ssx that |¯ x −¯ y | sx
>
|k¯ x−ky¯| ksx
x
which must be in C, and hence ξ3 ∈ C, impossible.
y| So C must be of the form |¯xs−¯ ≥ ϕ( ssxy ) for some measurable real x function ϕ, assumed to have finite values for nontriviality. In other y | sx words, this may be expressed as G( |¯xs−¯ , sy ) ≥ 0 for some measurable x real G, namely G(a, b) = |a| − ϕ(b) ≥ 0 here. The functional form of G or ϕ can be quite complicated except that they are Borel functions. The existence of a desired test which is also similar for H0 is based on the important work due to Linnik [1] and is as follows.
4. Theorem. If X, Y are independent random samples of sizes n1 , n2 , of different parity, from N (μ1 , σ12 ), N (μ2 , σ22 ) respectively, then there exists a (non randomized) similar test of size α, 0 < α < 1 for the Behrens-Fisher problem, whose critical region is determined by the vecy | sx tor ( |¯xs−¯ , sy ) in the positive quadrant of R2 , so that the region is dex termined by a measurable function G of the form described above. Proof. We present the argument in steps to illuminate the method of attack and for illustrating the finer points of analysis. 1. Since X = (X1 , . . . , Xn1 ) and Y = (Y1 , . . . , Yn2 ) are two independent random samples from the two normal populations, we may reduce the problem by considering the sufficient statistics, and as al¯ Y¯ Sx ready noted, the test depends on the vector ( X− Sx , Sy ). The well-known properties of normal distributions (from elementary statistical analysis) ¯ − Y¯ , Sx2 , Sy2 are mutually independent r.v.s so that we can imply that X ¯ Y¯ Sx derive the joint distributions of X− Sx and Sy from the densities of the above r.v.s which are respectively normal, chi-squared, and chi-squared. An elementary but tedious integration with change of variables (which we leave it as an instructive exercise) gives the joint density of these r.v.s denoted ξ, η, which depends only on the nuisance parameters, as: fξ,η (u, v|θ) = c(n1 , n2 )θ
n2 2
(1 + θ)−
n1 +n2 2
×
64
II. Principles of Hypothesis Testing
v n1 −2 [θ2 + θ(1 + u2 + v 2 ) + v 2 ] where c(n1 , n2 ) =
n1 +n2 −1 2
, (u, v) ∈ R, (12)
2Γ( n1 +n2 2 −1 ) n2 σ12 , , θ = n1 σ22 Γ( n12−1 )Γ( n22−1 )
and R = {(u, v) : u ∈ R, v ∈ R+ }. Now a similar (non randomized) test δ(ξ, η) of size 0 < α < 1, should satisfy the integral equation: δ(u, v)fξ,η (u, v) du dv = α, R
which in detail becomes v n1 −2 δ(u, v) n1 +n2 −1 du dv R [θ2 + θ(1 + u2 + v 2 ) + v 2 ] 2 n2 n1 +n2 −2 α θ− 2 (1 + θ)− 2 . = c(n1 , n2 )
(13)
Thus a test function 0 ≤ δ(·, ·) ≤ 1 satisfying this equation for all θ > 0 should be found where δ should also satisfy the symmetry and invariance conditions prescribed earlier as well as to be an indicator function of a set in R. These competing conditions make the existence and determination of such a δ an intricate task. We now sketch the format of the solution. 2. To solve for δ, it will be useful to factorize the denominator of the expression in (13). Thus if A = A(u, v), B = B(u, v) are the roots of the equation θ2 + θ(1 + u2 + v 2 ) + v 2 = (θ + A)(θ + B) = 0, so that
θ(1 + θ) + θu2 + (1 + θ)v 2 = 0,
gives
v2 u2 + , 1 + θ2 θ and letting θ = −D, D ≥ 0, one has −1 =
u2 v2 − = 1. D 1−D A series of loci of curves for D > 0, as conformal hyperbolas, and D = 0 implying v = 0, D = 1 giving u = 0; and 0 < D < 1 give ellipses.
65
2.5 Further results on composite hypotheses
Thus A(u, v) = D gives a family of hyperbolas, and B(u, v) = D of ellipses if D ≥ 1, etc. The regions covered by these curves A, B are important for computations. Then (13) becomes:
δ(u, v)v n1 −2
R
[(θ + A)(θ + B)]
n1 +n2 −1 2
du dv =
n2 n1 +n2 −2 α , θ− 2 (1 + θ)− 2 c(n1 , n2 ) (14)
for θ > 0. If δ(u, v) − α = ϕ(u, v), then we seek a solution of (14) expressed as ϕ(u, v)v n1 −2 du dv = 0, (15) N N R (θ + A) (θ + B) where N = n1 +n2 2 −1 . If ni ≥ 2, i = 1, 2 and n1 , n2 are of different parity, then one should find a function ϕ that takes only the two values −α, 1−α and satisfying (15). Changing the variables (u, v) → (A, B) in ∂(u,v) B−A (15) one finds the Jacobian as ∂(A,B) = − 14 √ . Thus one AB(B−1)(1−A)
seeks the region corresponding to a nonsingular solution. Consequently ϕ(u, v) → ϕ1 (A, B) gives Π
n1 −3
ϕ (A, B)(AB) 2 (B − A) dAdB 1 = 0, (1 − A)(B − 1)(θ + A)N (θ + B)N
(16)
where Π is the strip {(A, B) : 0 ≤ A ≤ 1, B ≥ 1} onto which R is mapped. 3. To solve this for ϕ1 one expresses Π = ∪∞ k=0 πk as disjoint half strips: πk = {(A, B) : 1 − 2−k ≤ A < 1 − 2−k−1 , 1 ≤ B < ∞}, and consider ϕ1 of (16) on each of these πk , k = 0, 1, 2, . . . . Using the partial fraction technique, one gets N -integrals (since there are N such functions) of the form
b
b
pmki (A, B) dA, a.a.B ∈ [c, d],
I(A, B)pmki (A, B) dA = α a
a
and (a.a.= almost all)
d
c
d
pmki (A, B) dB, a.a.A ∈ [a, b].
I(A, B)pmki (A, B) dB = α c
66
II. Principles of Hypothesis Testing
Here I(A, B) is the indicator function of the partition of πk into a countable number of rectangles by splitting each πk horizontally, and pnki will be a positive function satisfying the integral in (16) when restricted to each of these strips. Thus in each rectangle there are N sets of integrals of the above type and they satisfy the hypothesis of Liapounov’s theorem (somewhat modified to the situation at hand), and this implies that a solution exists in each such rectangular region. Since there are only a countable number of such disjoint rectangles, we can uniquely define a measurable function ϕ1 , to satisfy (16) and hence there is a ϕ satisfying (15). It depends only on u, v so that ϕ(ξ, η) defines a similar test for all θ > 0, i.e., all σi2 > 0, i = 1, 2. Remarks. 1. The details of the omitted computations can be found in Linnik [1], Chapter X). But the structure of the proof is exactly as given here. A direct extension to a finite number of samples is possible. 2. If n1 = n2 = n, an integer, a simpler solution was given by Bartlett. It was generalized by Scheff´e, by Welch, and by Wald (cf.,also Exercise 5 below). There are several other randomized similar tests satisfying Neyman’s rule. There facts were also found in Linnik’s monograph. 3. One may ask whether the ϕ or G of the theorem can be chosen not merely measurable as given there, but actually continuous or to have some other regularity properties. The problem is quite sensitive, and negative solutions to the above questions emerge from Linnik’s researches.We state the precise result below without its (not surprisingly) complicated proof. Note also that the exponential nature of the distribution is fully utilized, indicating that for such a family some of the finer analytical work can be extended employing the theory of (multiple) Laplace transforms and of (several) complex variables in a nontrivial and essential way, as seen from the work of Linnik and his associates. Typically the boundary of the region determined by ϕ above plays a key role. Let us call the following regularity property of ϕ for the null hypothesis simply “null-regularity”: Suppose that there is a circle around every point on the segment of the v-axis 0 ≤ v ≤ v0 (v0 ≥ 1) such that the conditional probability of rejecting the null hypothesis H0 given that the sample point (ξ = u, η = v) falls in the circle is zero. Then the test ϕ is termed null-regular. We then have: 5. Theorem. For the Behrens-Fisher problem, (i) there are no nontrivial non randomized critical regions determined by a continuous G of Theorem 4, (ii) there are no randomized homogeneous (like G) similar regions
2.6 Complements and exercises
67
having the null-regularity property. As already noted, the proofs of each of these assertions is quite involved and is established by supposing the existence and deriving a contradiction, thus using an indirect argument. It may be observed that the existence of a critical region can be constructed under suitable conditions by identifying the composite hypotheses as vector measures and invoking Proposition 1.14. But it cannot evidently provide a detailed information which incorporates the particular structure of the problem as discussed above. However, this work could motivate a further analysis of the vector measure point of view in this study. 6. A Comment. Before ending this analysis, one should perhaps note that R. A. Fisher seems to have recognized the difficulties involved with problems of nuisance parameters in general, and proposed a new theory called “fiducial probability”. Its details have not been fully worked out for others to continue satisfactorily. As is the case with most new proposals, people found counter examples from the contemporary point of view, and much misunderstanding resulted. Only Segal [1] and Tukey [1] seem to have made a brief attempt to understand the subject mathematically, but the work has not been pursued further by these or other mathematicians. In some respects, this is similar to the Feynman integral in the early 1950’s which did not satisfy the mathematical rigor, but unlike Fisher’s case, fortunately a great deal of mathematical research has gone in for a new (rigorous) interpretation (with the active advice and collaboration of Feynman himself) by probabilists and mathematical physicists. It is to be hoped that a similar serious effort between the concerned researchers here will materialize in future developments with a useful contribution to the study of composite hypotheses. Such a result will advance inference theory whose numerous difficult problems have (except for Linnik’s school) not been answered to the satisfaction of most practitioners of the subject. 2.6 Complements and exercises 1(a) Let F, G : R+ → R+ be differentiable functions such that G + F (x) ≥ 0 is nonincreasing. Show that for any r.v. X : Ω → R one −1 −1 has G (E(G(X))) ≤ F (E(F (X))), finite or not. [Hint: Observe that the function G ◦ F −1 satisfies the hypothesis of Prop. 1.8.] (b) Deduce Liapounov’s inequality from (a), by considering F (x) = 1 1 xk , G(x) = xv , 0 < v < k, i.e., [E(|X|v )] v ≤ [E(|X|k )] k for any r.v X. (c) Let F : R+ → R+ be nondecreasing, F(0)=0, and the left derivative F be nonincreasing (nondecreasing). If X : Ω → R+
68
II. Principles of Hypothesis Testing
is a random variable, show that for any σ-algebra B ⊂ Σ one has E(F (X)) ≤ (≥)E(F (E B (X))). [Hint: Use (a) replacing E by the conditional expectation E B and then note that E(E B (Y )) = E(Y ) for any r.v. Y ≥ 0. If F (x) = x2 and B = σ(Y ) for any r.v. Y , the resulting inequality is useful in the theory of estimation as was noted independently by D. Blackwell and C. R. Rao in the middle 1940s, for that purpose and is discussed in the next chapter in detail.] 2(a) (Similar regions need not exist.) Let pθ (x|θ) = θxθ−1 , 0 ≤ x ≤ 1, θ ≥ 1 be a family of probability densities, θ ∈ I = {θ : 1 ≤ θ < ∞}. If H0 = {θ = n : n = 1, 2, . . . } ⊂ I and H1 = I − H0 , then show that there is no nontrivial similar test of size 0 < α < 1 for H0 . [Hint: Use the Weierstrass approximation theorem suitably.] (b) Let X : Ω → R be a random variable on (Ω, Σ, P ) with a probability density fX (x|θ), θ ∈ H0 ∪ H1 = I ⊂ Rk , relative to a σ-finite measure λ on (R, B), of the form: fX (x|θ) = exp{
N
ui (x)ψi (θ)}
i=1
M
hi (x)gi (θ),
i=1
and let X = (X1 , . . . , Xn ), n > N, be a random sample whose distribution then is fX (x1 , . . . , xn |θ) = Πni=1 fXi (xi |θ) = exp{
N
Ti (x1 , . . . , xn )ψi (θ)}×
i=1 M
hi1 (x1 , . . . , xn ) · · · hin (x1 , . . . , xn )×
i1 ,... ,in
gi1 (θ) · · · gin (θ), n with Ti (x1 , . . . , xn ) = j=1 ui (xj ). Verify that this family satisfies the hypothesis of Theorem 5.2 so that it admits a nontrivial similar region of size 0 < α < 1 for θ ∈ H0 . If the second factor in fX (·|θ) reduces to unity, then one has the exponential family, and T gives a sufficient statistic for it. 3. In view of 2(a), one may consider approximate similar regions as follows. If X is a random variable whose distribution Pθ , θ ∈ I admits a statistic T (X), having an absolutely continuous density with a bounded derivative, then an approximate similar test of size 0 < α < 1 exists in the sense that for any 0 < ε < 1 there exists a measurable set or region Aε in the range space of X satisfying |Pθ (Aε ) − α| < ε, for all θ ∈ I. [This result is based on a simplified version of a result due to
2.6 Complements and exercises
69
Besicovitch, and can be established by approximating the integral of the density of T (X) uniformly by a sequence of polygonal figures in R2 and a set Bε is constructed as a union of a finite but large number of intervals which then gives the desired Aε as T −1 (Bε ). The details are also included in Linnik’s book [1]. However the thus constructed region is generally very complicated and hence has a limited utility. This can be seen even for the example of Problem 2(a) above.] 4. A family {Pθ , θ ∈ I} of probability measures is said to be (boundedly) complete if for each (bounded) measurable f : Ω → R, Ω f dPθ = 0, for all θ ∈ I implies f = 0, a.e.[Pθ ]. Invoking the classical theory of Laplace-Stieltjes transforms one can show that if dPθ = pθ dλ relative + to a σ-finite measure on (Ω, Σ) = (Rn , B) with λ : B → R where pθ is an exponential function, then one has the completeness property. The family of densities in Exercise 2(a) above has the bounded completeness property. These general classes are useful for constructing similar tests complementing the earlier work as we now illustrate. Thus let pθ be an exponential family of a random variable X having T (X) as a (vector) sufficient statistic for θ (cf., eq.(5) of Section 5) so that k ai (θ)Ti (x))h(x), θ ∈ I ⊂ Rk , x ∈ Rn . pθ (x) = c(θ) exp( i=1
If I is a nonempty open set, then the family of densities pTθ of T is complete in the sense that for any pTθ -integrable f : I → R, Eθ (f (T (X))) = 0 for all θ ∈ I implies f = 0, a.e.(pTθ ). More generally, if {Pθ , θ ∈ I} is a probability family governing X for which a sufficient statistic T exists, then any (randomized) test δ of X for θ ∈ I is similar of size α, 0 < α < 1, whenever the family is boundedly complete for the open parameter set ∅ = I ⊂ Rk . [Hints: It suffices to consider the case that I is an open box and θ → (a1 , . . . , ak )(θ) is one-to-one, and f ≥ 0. k The hypothesis implies that Rk exp( i=1 ai (θ)ti ) dPθT (t) = 0, and use a suitable (multidimensional) version of the uniqueness theorem for the LS-transforms (cf. Widder [1], p.336). For the last part note that (δ(X) − α) is a bounded Borel function.] 5. [The Behrens-Fisher problem for equal sample sizes.] Let Xi ∼ N (μ1 , σ12 ) and Yi ∼ N (μ2 , σ22 ), i = 1, . . . , n are independent samples, then the following critical region (obtained by Bartlett and Scheff´e in the early 1940’s) is a solution of the Behrens-Fisher problem. ¯ − Y¯ | n(n − 1)|X > k} Ak = { n 2 ¯ ¯ (X − Y − X − Y ) i i i=1 where, as usual, k > 0 is chosen to satisfy Pθ (Ak ) = α when θ ∈ H0 = {μ1 − μ2 = 0, σi2 > 0, i = 1, 2} and H1 = {μ1 = μ2 , σi2 >
70
II. Principles of Hypothesis Testing
0, i = 1, 2} and in general θ = (μi , σi2 , i = 1, 2) is the parameter vector. [Hints: Verify that the random variable defining Ak has a ‘Student’s t’distribution with n−1 degrees of freedom. This type of regions has been characterized by Linnik whose work shows the profound differences that occur when the sample sizes are equal and when they are different for this problem.]
Bibliographical notes The fundamental notion of a most (or least) powerful region of a given size for comparing two measures is due to Neyman and Pearson [1,2] with its abstract version to Grenander [1]. Although an extension of this result for a scalar versus a finite vector (dominating) measure already appears in the early paper of Neyman and Pearson, a comparison of two (finite) vector valued nonatomic measures is a next significant extension and it is due to Chernoff and Scheff´e [1]. This result has wide applications to control theory and elsewhere, but this fact was not immediately noticed. These results and the myriad difficulties for an infinite dimensional extension are still not yet fully resolved. A simpler account and several open problems were noted by the author (cf. Rao [7], and [9]). An interesting and significant application for a pair of scalar measures where the dominating one has the Darboux property, is found in Mann [1], which shows how the H¨ older and Liapounov inequalities as well as characterizations of convexity and concavity (also Jensen’s inequality) follow from the Neyman-Pearson fundamental idea. We have presented all these facts in an extended form in a somewhat long first section to show the fecundity of the basic idea and the mathematical consequences that have not been adequately appreciated in the literature. All of the preceding work is for the simple hypotheses. The composite case demands several new subtleties and difficulties. Some proposals to reduce the problem to the simple case have been advanced in practice. One of them is to put some weights in terms of a measure on the parameter space and bring it to the simple case by averaging. But these weights are intended only to facilitate the computations and are arbitrary otherwise. To make a somewhat structured pattern out of this, a Bayesian procedure, extending an ancient philosophical argument due to a Rev. T. Bayes, is advanced. The (normalized) weights are then called a priori probabilities, and some arguments have been introduced to amplify these ideas. Observe that the weights interpreted as prior probabilities necessarily depend on some parameters which cannot be known, as in the
Bibliographical notes
71
previous case. Following the same procedure, one can then place “second and higher stage” or “hierarchical ” priors. A natural extension of this idea is an iteration of these numerical weights so that each posterior probability is again a regular conditional measure and the “multi stage” posterior can be obtained on a product measurable space in accordance with a classical theorem due to Ionescu Tulcea [1]. This point has been presented in some detail in Sections 2 and 3. The difficulties inherent in the prior probability application is made transparent by such an extension. If each prior is determined by just one extra step, then the resulting model will become a simple Markov process, and the direction of the process for inference is illuminated. Even if a single stage (or ordinary) prior is used, its selection is essentially unreliable and arbitrary, as the following incident that occurred in 1960 indicates. It was between the well-known research thinker of subjective probability who is also a Bayesian architect, namely Prof. L. J. Savage, and the author. This was the first time I met him (at the San Francisco airport on our way to the IMS meetings at Stanford), and I as a young researcher wanted to make an acquaintance. The following is the essential conversation: R: Prof. Savage, my name is Rao, and I would like to introduce myself. S: I met you before, as I recall. R: I don’t think so, in fact, I am ... S: No, let me tell you. You did your graduate work at Chapel Hill, North Carolina, and you studied Statistics. R: No, I have in fact never crossed the Mason-Dixon line, and I never had an opportunity to enroll in a Statistics Department in the United States. S: (Laughs) I assume that usually a Rao studies Statistics and an Indian student goes to Chapel Hill for such a study. That is based on my subjective probability. But you proved to be a counter example to both assumptions (and with a smile he later extended me a ride to the Stanford meetings, through his colleague Lester Dubins). This episode indicates that even an outstanding subjectivist thinker, L. J. Savage, could not necessarily formulate a sustainable prior probability. The present day discussions are based on essentially personal probabilities, but this is still a difficult problem that stays with the subject. Additionally, calculation of these conditional measures is technically a difficult matter as seen in Section 4. It is thus evident that the contribution of Bayesian ideas to simplify or solve the problems of composite hypotheses is not yet a complete answer, although it helps to bring in many new ideas and results of decision theory. Even the latter did not yet produce definitive solutions of outstanding problems (such
72
II. Principles of Hypothesis Testing
as those of Section 5) of the subject. However, from a purely mathematical point of view, assuming the availability of ‘reasonable’ priors, the resulting analysis can resemble some work close to Markov type processes as seen in Section 3. More interestingly however, the introduction of the Bayesian idea (with hierarchial priors) shows that even the classical (finite dimensional) problems of inference lead to (non independent) stochastic processes and inferences on them. We therefore may consider its employment in this spirit. It should also be noted that our discussion is based on the classical Kolmogorov model. Using other approaches, as advocated by, e.g., R´enyi [1], one can obtain similar (not the same) solutions. The diagonal method employed in Grenander’s General Pattern Theory [3], Section 7.3 in particular, (see also Hwang [1]), important in that situation, is different from Kolmogorov’s conditioning model to which our treatment is attached. We chose the latter because the material discussed here and through out this volume is devoted to the basic mathematical analysis of stochastic inference, employing the results presented in the companion volume (Rao [21]) as well as other works from the same point of view (e.g., Doob [2], Grenander [1,2], Shiryayev [1]). As a next step in the extension of the Neyman-Pearson formulation, we study similar regions and sufficient statistics. These results depend on a deeper mathematical analysis. The classical Behrens-Fisher problem and the general work with nuisance parameters show that the Bayesian or other simplifying ideas do not yet seem to have been sufficient. The most serious study on these questions was conducted chiefly by Linnik and his students in the 1960’s and his monograph [1] is so far the unique volume to present a thorough analysis of the available work on these difficult matters. We have included two key aspects of the results from this admirable account in Section 5, to show how essentially new procedures are needed in the composite hypothesis case. A few additions are also included in the complements and exercises section. It is hoped that our treatment of the principles of hypothesis testing shows that the seemingly “controversial” aspects of inference theory are about the formulations of the practical problems that also exist in other applications of mathematics and that there is nothing special about these questions in statistical inference. We now consider in the same manner a mathematical analysis of other aspects of inference, namely the theory of estimation leading to prediction and filtering among others.
Chapter III Parameter Estimation and Asymptotics
In stochastic modeling, probability distributions often involve unknown parameters, and this chapter deals with some principles of estimation of such parameters. That involves various classes of loss functions, and the study concentrates on certain desirable properties of estimators which are (known) functions of random variables. These include a detailed mathematical analysis of Bayes and maximum likelihood estimation as well as (nonlinear) prediction. Also treated are the asymptotics of the methodology to be explored for particular types of processes later on, and a relatively short account of sequential estimation together with some important complements.
3.1 Loss functions of different types Let {(Ω, Σ, Pθ ), θ ∈ I} be the basic probability model for an experiment, and X : Ω → Rk be a (vector) random variable governed by Pθ , i.e., the distribution of X is given by FX (x|θ) = Pθ [X < x], θ ∈ I, x ∈ Rk and the inequality is taken component-wise for vectors. It was indicated in Section I.3 that, since typically θ is unknown, one needs to find a function g : Rk → I of X, whose distribution places maximum probability at or around θ in I, or alternatively that the “error” g(X) from θ is a minimum relative to some criterion of measuring the absolute error (or the “loss” incurred due to that error). Such a measurable function g(X) involving no unknown θ is termed an estimator of θ,and is thus a random variable. [If X(ω) = x is the observed value of X, then g(x) = g(X)(ω) is usually called an estimate (of θ) which is thus not a random variable.] Both these types of loss functions can be given a unified mathematical formulation as follows.
© Springer International Publishing Switzerland 2014 M.M. Rao, Stochastic Processes – Inference Theory, Springer Monographs in Mathematics, DOI 10.1007/978-3-319-12172-7_3
73
74
III. Parameter Estimation and Asymptotics
A mapping L : I × I → R+ is called a loss function if L(x, θ) = 0 for x = θ, and is jointly measurable relative to a σ-algebra I of I, given in advance (usually with the basic model for which points are measurable). Some of the loss functions that are often considered in the literature are: If ϕ : R → R+ is a monotone nondecreasing function such that ϕ(x) = ϕ(−x) and ϕ(0) = 0, then L(x, θ) = a(θ)ϕ(x − θ), a : I → R+ , will be a candidate. The ϕ-functions also include the following: (i) ϕ(·) is convex, (ii) ϕ(x) = |x|p , 0 < p < ∞, (iii) for a, b ∈ R−{0}, set ⎧ 0, for |x| < |a| ⎪ ⎨ |x|−|a| ϕa,b (x) = |b|−|a| , for |a| ≤ |x| < |b| ⎪ ⎩ 1, for |x| ≥ |b|, and (iv) for α ∈ R+ , ϕ(x) =
0, for |x| < α 1, for |x| ≥ α.
Most commonly one considers ϕ(·) to be a convex function which is continuous and increasing on R+ and a(θ) = 1. The structure of continuous convex functions implies (cf. Theorem II.1.9) ϕ(x + y) ≥ ϕ(x) + ϕ(y), ∀x, y ∈ R+ .
(1)
It may be shown that such a ϕ is representable (since ϕ(0) = 0) as:
x
h(t) dt,
ϕ(x) =
(2)
0
where h : R → R+ is a nondecreasing function which can be taken as the right (or left) derivative of ϕ that exists at each point (cf.,e.g. Rao [17], p.242). On the other hand for a discontinuous convex function, h(t) = +∞, t ≥ x0 for some x0 . In particular, if ϕ(x) = |x|2 one has a quadratic loss function. Because of the mathematical simplicity, this is very popular among practioners and was recommended initially by Gauss precisely for this (and he even said that for no other) reason. Although more general loss functions can be devised, their usefulness will be limited if suitable mathematical techniques are not available. However, one finds that the most general loss functions for which good analytical tools are available to be the convex classes and accordingly they will receive the major attention in what follows. Since L(g(X), θ) : Ω → R+ is a random variable, it will be different for different observed values X(ω) of X, rendering it of limited
3.2 Existence and other properties of estimators
75
use. Hence one considers usually an averaged value for comparison of different estimators g(X) of θ. Such a function is defined by L(g(X), θ) dPθ RL (g, θ) = Ω
= Eθ (L(g(X), θ) L(g(x), θ) dFX (x|θ), θ ∈ I. =
(3)
Rk
Here RL (·, ·) is called the risk function associated with the loss L(·, ·). A function g0 that makes RL (g, θ) a minimum at θ = θ0 is termed a locally best estimator of θ relative to RL or the loss function L, at θ0 , and is the best if it is so uniformly in θ ∈ I. The existence (and uniqueness) property can only be established under additional conditions and they are considered below. 3.2 Existence and other properties of estimators We now treat the existence problem for the (locally best) estimators of θ ∈ I ⊂ R, when L(g(x), θ) = W (g(x) − θ) where W (·) : R → R+ is a convex (hence absolutely continuous) function satisfying W (0) = 0 and (for nontriviality) not a constant. Then W (·) has a right and left derivative at each point, both agreeing at all but a countable set of points on R. The optimality criteria for estimators used in large parts of inference studies can be stated simply as: (i) g0 is locally best if RL (g0 , θ) = inf{RL (g, θ) : g ∈ G},
(1)
where G consists of estimators g for which RL (g, θ) < ∞, θ ∈ I; (ii) g0 is minimax if (2) sup RL (g0 , θ) = inf sup RL (g, θ); θ∈I
g∈G θ∈I
and (iii) g0 is “good” if for each c > 0, Pθ [|g0 (X) − θ| ≤ c] = sup Pθ [g(X) − θ| ≤ c].
(3)
g∈G
This last one is actually included in (i) if we take Lc (g, θ) = χ[|g−θ|>c] . It is thus evident that the optimality criteria depend the methodology of variational calculus. We consider here the convex (and later the monotone) case in detail to illuminate the mathematical structure of these problems in finding best estimators. It will appear from the following work that the more general is the loss function, the less general
76
III. Parameter Estimation and Asymptotics
is the class of distributions admitted for detailed analytical solutions, so that the “sizes” of classes of loss functions and those of distribution functions seem to vary in opposite directions. Thus let w+ and w− be the right and left derivatives of the convex loss function W discussed at the beginning of this section so that x x W (x) = w+ (t) dt = w− (t) dt, x ≥ 0. (4) 0
0
For a Borel measure μ on R, if R W (x − y) dμ(x) < ∞, y ∈ R, then for all a > 0, by (4), one has w± (x − y) dμ(x) ≤ [W (x − y + a) − W (x − y)] dμ(x) < ∞, a x>y
x>y
(5)
and that U (y) =
R
W (x − y) dμ(x), y ∈ R,
(6)
is a continuous (strictly) convex function if W (·) is. Further from the nontriviality of W we deduce that U (y) < ∞ for y ∈ R and lim|y|→∞ U (y) = ∞. Hence the set of values on which U (·) takes a minimum (it has no maximum) is a nonempty compact interval J ⊂ R. At each point t ∈ J, one also has for the right (left) derivative u+ (u− ) of U, u+ (y) ≥ 0(u− (y) ≤ 0). These properties are used in establishing the ensuing technical characterization of J, to be studied later. 1. Proposition The compact nonempty set J of U defined above is such that y ∈ J iff the following pair of integral inequalities hold: w+ (y − x) dμ(x) ≥ w+ (x − y) dμ(x), (7a) x≤y x>y w− (y − x) dμ(x) ≤ w− (x − y) dμ(x). (7b) x
x≥y
Further J degenerates to a single point if W (·) is strictly convex. In any case if μ is nonatomic or if w+ (t) = w− (t) = w(t) (say) and w(0) = 0, then (7a) and (7b) reduce to the single equation: w(y − x) dμ(x) = w(x − y) dμ(x). (8) x
x>y
Proof. To establish (7a) and (7b), we only need to calculate the right and left derivatives of U (·) and use the fact that on J they are respectively positive and negative (hence = 0 if the derivative of W exists).
3.2 Existence and other properties of estimators
77
Thus for any sequence an 0 one has, on considering the difference quotients, Vn (x − y) = [W (x − y − an ) − W (x − y)]/an −w− (x − y),
(9)
as n ∞. Hence by the monotone convergence theorem, one has u− (y) = lim [U (y − an ) − U (y)]/an n→∞ = lim Vn (y − x) dμ( x) R n→∞ = − w+ (y − x) dμ(x) R w+ (y − x) dμ(x) − = x≤y
w+ (x − y) dμ(x) ≥ 0, y ∈ J,
x>y
since w+ (−y) = −w+ (y). This gives (7a), and (7b) is similar. If W (·) is strictly convex, then so is U (·) and hence J consists of a single point. Finally if μ(·) is nonatomic, then every countable set is of μ-measure zero, and so w+ (t) = w− (t) for a.a.(t) and equality holds in (7a) and (7b). The same clearly holds if W (t) = w(t) exists at each point and W (0) = 0. Thus (8) holds in either of these cases. Note that μ need not be finite here. Remark. The above proposition holds without change if the y-values form an open non-degenerate interval I ⊂ R, instead of the whole line. Then limy→∂I U (y) exists and ≥ U (y), ∀y ∈ I with the former value being finite or not, and J ⊂ I. Here ∂I denotes the end points (or the boundary) of I, and U (y) < ∞, ∀y ∈ I. This result can be used to establish the existence of Bayes estimators [and in another sense a solution of an optimal (nonlinear) prediction problem to be derived later]. First we recall a relevant formulation of Bayes estimators. Let X : Ω → Rk and Θ : I → I be measurable mappings (or r.v.’s) relative to the Borel σ-algebras of Rk and I(⊂ R), where {(Ω, Σ, Pθ ), θ ∈ I} is the basic probability space. Let D be the collection of measurable functions δ : Rk → I. If μ is a (prior) probability on (I, I) governing Θ, and FX (·|θ) is the distribution of X with θ as its parameter, which is a value of Θ, then Ω = consider Ω×I, Σ = Σ×I and P˜ : A×B → B P (A|θ) dμ(θ) = B A dPθ dμ(θ). This gives the joint measure space (Ω , Σ , P˜ ). Thus Pθ (·) = P (·|θ) is a regular conditional probability of X given Θ, and X, Θ can be regarded as random variables on Ω (i.e., X(ω ) = X(ω) × θ0 and Θ(ω ) = Θ(i) × a0 ). Here θ0 , a0 are arbitrarily fixed elements of I, Rk in the ranges of X and Θ. Under this identification, which was also noted in Section II.3, the following concept can be introduced.
78
III. Parameter Estimation and Asymptotics
2. Definition. Let X and Θ be as above. Then an element δ0 (·) ∈ D is called a Bayes estimator of the random parameter Θ relative to a convex loss function W (·) if (writing E(·) for expectation relative to P˜ and D introduced above): E(W (Θ − δ0 (X))) = inf{E(W (Θ − δ(X))) : δ ∈ D} < ∞.
(10)
We now establish the existence and determination of Bayes estimators relative to convex loss, as a consequence of Proposition 1. Indeed one has a characterization of these estimators as follows. 3. Theorem. Let X, Θ be random and parametric variables as above, where I = R for simplicity, and W (·) be a symmetric convex function on R with W (0) = 0. Assume that X(Ω) is a Borel subset of Rk . Then there exists a Bayes estimator δ(X) of Θ. In fact for each x = X(ω), let F (·|x) be a regular conditional distribution of Θ [or a posterior distribution of Θ] given X, (which exists when X(Ω) is a Borel set), and let J(x) be the corresponding nonempty compact minimizing interval for the risk function U (·|x); [one can define δ(x), for instance, as the left end point of J(x) for each x, the observed value of the random variable X]. Such an estimator δ(X) satisfies, for each sample value, the following pair of integral inequalities: w+ (θ − δ(x)) F (dθ|x) ≥ w+ (δ(x) − θ) F (dθ|x) θ≥δ(x) θ<δ(x) (11) w− (θ − δ(x)) F (dθ|x) ≤ w− (δ(x) − θ) F (dθ|x). θ>δ(x)
θ≤δ(x)
Further (11) has a unique solution if W (·) is strictly convex. Finally, (11) reduces to a single equation if W exists everywhere and W (0) = 0, so that δ(x) is a solution of the integral equation:
W (θ − δ(x)) F (dθ|x) = θ>δ(x)
W (δ(x) − θ) F (dθ|x).
(12)
θ<δ(x)
Proof. Everything except the measurability is a direct consequence of Proposition 1, since the fact that X(Ω) is a Borel set implies the regularity of the conditional measure F (·|x) for each x (cf. e.g., Rao [19], Chapter 5). Also the condition on X(Ω) is not a restriction for our problem since the range space Rk × R is a complete metric (or a “Polish”) space and even in more general cases X may be taken as a coordinate function via Kolmogorov’s theorem (cf. Theorem I.1.1). We now prove the desired measurability of δ(·), taking it as in the statement
3.2 Existence and other properties of estimators
79
for definiteness. Observe that U (·|x) of (6), with the measure F (·|x) in place of μ(·) there, is convex. Hence it follows from the definition of u+ (·|x) that δ(x) ≤ a iff u+ (a|x) ≥ 0 for x ∈ I = R. But by definition for any an 0, one has u+ (a|x) = lim [U (a + an |x) − U (a|x)]/an , x ∈ R. n→∞
(13)
Now by regularity of F (·|x), the function x → U (a|x) is equivalent to a Borel function, for all a and hence the right side limit of (13), and so the left side, is a measurable function of x. It follows that the set {x : δ(x) ≤ a} = {x : u+ (a|x) ≥ 0} is measurable so that δ(X) is a Bayes estimator by the previous definition and the proposition. From (1) and (11) or (12), it is clear that the risk function for the Bayes estimator δ0 is given in the case that L(·, ·) = W (·), by ˜ W (Θ, δ0 (X)) dP (ω ), RW (δ0 , θ) = Ω W (θ, δ0 (x)) F (dx|θ) μ(dθ). I R
(14)
where δ0 (x) is a solution of (11) or (12), and μ is the prior probability on (I, I) governing Θ. In most problems, the actual evaluation of RW (δ, Θ) depends on the ability in solving the respective integral inequalities, or the equation, and then using it in (14). Since this is a nontrivial task in many cases, one seeks the best lower bounds to risk functions which could be used for inference. We shall present a solution of such bounds for an appreciation of the problem. Let us denote by M k (P˜ ) for the following set, with k ≥ 1 (E(·|·) being the conditional expectation on Ω ): M k (P˜ ) = {m ∈ Lk (Ω , Σ , P˜ ) : E(m(Θ, X)|X = x) = 0, a.a.(x)} (15) Note that for k1 ≤ k2 one has M k1 (P˜ ) ⊃ M k2 (P˜ ) since P˜ is a finite measure. The class of functions in M k (P˜ ) can be constructed in many ways, and one of these will be indicated later. 4. Theorem. Let X, Θ be as above and suppose the symmetric (nontrivial) convex loss function W vanishing at the origin is given. Let A0 1 be the set of k ≥ 1 such that W k is convex. Then for any estimator δ(X) ∈ D, the best lower bound for the Bayes risk RW (δ, Θ) is given by(E(·) is expectation relative to P˜ ): RW (δ, θ) = E(W (Θ − δ(X)))
E(Θm(Θ, X)) E(|m(Θ, X)|) k sup [ ] , ≥W E(|m(Θ, X)|) k∈A0 m(Θ, X)k
(16)
80
III. Parameter Estimation and Asymptotics −1
where k ≥ 1 is the conjugate exponent satisfying k −1 + k = 1, m(Θ, X)kk = E(|m(Θ, X)|k ) and for k = ∞ the last quantity in (16) is interpreted as the essential supremum of m(Θ, X) ∈ M k , and where 0 0 is taken as 0. Proof. One may assume that m ∈ M k for which E(|m(Θ, X)|) > 0. 1 Let W = W k , k ∈ A0 , and consider k > 1. Then W has the same properties as W , so that by Jensen’s inequality one gets E(|m(Θ, X)|W (Θ − δ(X)) = E
|m(Θ, X)| W (Θ − δ(X)) × E(|m(Θ, X)|)
E(|m(Θ, X)|) |E(m(Θ, X)(Θ − δ(X)))| ≥ E[|m(Θ, X)|]W E(|m(Θ, X)|) E(m(Θ, X)Θ) = E[|m(Θ, X)|]W E(|m(Θ, X)|) since m(Θ, X) ∈ M k so that E(m(Θ, X)δ(X)) = E(δ(X)E(m(Θ, X)|X)) = 0. (17) Now applying H¨ older’s inequality to the left side quantity of (17) and k using the fact that W = W one has (17) as: m(Θ, X)kk E(W (Θ−δ(X)))
≥W
E(Θm(Θ, X)) E(|m(Θ, X)|)
[E(|m(Θ, X)|)]k .
(18) This is (16) for k > 1 and A0 is a singleton. It also holds if k = 1 so k = ∞ when the corresponding form of H¨ older’s inequality is used. Finally, if A0 is not a singleton, then the statement again follows if we show that the last quantity in (16) is monotone increasing in k on A0 . But this is a consequence of the Liapounov inequality (cf. Sec. II.1, application 1 following Prop. 8 there). Indeed according to that inequality, m(Θ, X)v is an increasing function of v, and hence the k = value decreases as v 1. But k > 1 and as k ∞, k = k−1 1 1 − k−1 1. Hence m(θ, X)k is decreasing. It follows from (17), (18) and the fact that the last factor in (16) is increasing one can even replace ‘ sup by ‘ lim there, establishing (16). As an illustration, consider the case that W (t) = |t|k , k ≥ 1. Then (16) reduces to E(|Θ − δ(X)|k ) ≥ [E(Θm(Θ, X))|/m(Θ, X)k ]k ,
(19)
81
3.2 Existence and other properties of estimators
with equality iff |θ − δ(x)|k = c|m(θ, x)|k for some c > 0 and θ − δ(x)m(θ, x) is of constant sign for a.a.(θ, x). The last statement is a standard result for the equality condition in H¨ older’s inequality (and since W (t) = |t|, Jensen’s inequality plays no part here). In this case we can assert that there is an element m(Θ, X)∗ ∈ M k (P˜ ), k ≥ 1, (uniquely if k > 1) such that
|E(Θm(Θ, X)∗ )| inf{E(|Θ − δ(X)| ) : δ ∈ D} = m(Θ, X)∗ k k
k .
(20)
An interesting point of these bounds is that they do not involve the actual estimator δ(X) ∈ D and are valid for all (including the best) functions of the latter. We now illustrate one method of construction of elements of M k above. Let the posterior distribution F (·|x) have a density f (·|x) relaf (θ|x) for each tive to the Lebesgue measure. Consider m(θ, x) = ∂ log∂x i θ ∈ I, xi ∈ R, when f (θ|·) is so differentiable. If the density f (·|x) is moreover dominated by a μ-integrable function g(·) independent of x, then it is seen that E(m(Θ, X)|X = x) = 0 for a.a.(x) and if g k is integrable, then m(Θ, X) ∈ M k (P˜ ). Even the differentiability and domination conditions here can be relaxed by using difference operations, as indicated in Theorem 5 below (cf., e.g., Rao [1], p.394 for more details). The same mathematical analysis employed in Theorem 4 can also be used to obtain the corresponding results in the non-Bayesian problems. We now indicate how this is done. Let X be a random (vector) variable on {(Ω, Σ, Pθ ), θ ∈ I} as before and δ be an element of D. It is a (locally) best estimator of θ relative to a convex loss function W (·) if (1) holds in the form (Eθ is expectation relative to Pθ ): RW (δ0 , θ) = inf{Eθ (W (δ − θ)) : δ ∈ D}.
(1’)
Then the following result, in which we take M k (P ) = {m(X, θ) ∈ Lk (P ) : Eθ (m(X, θ)) = 0} and θˆ = δ0 (X), is the analog of Theorem 4. For convenience and variety let us define m(X, θ) in a different manner which has much flexibility. This method is due to Kiefer [1]. Thus let I ⊂ R be the parameter set and Iθ = {h : θ + h ∈ I} for an arbitrarily fixed θ ∈ I. Let ξ1 , ξ2 be a pair of probability measures on the Borel σ-algebra of Iθ , and suppose that the distribution function Fθ of the random vector X in Rk has a density f (·; θ) relative to a fixed σ-finite measure μ on (Rk , B). If Sθ = {x : f (x; θ) > 0} is the carrier of f (·|θ) for each θ ∈ I, where the density f (·; ·) is jointly measurable, define on
82
III. Parameter Estimation and Asymptotics
Sθ a (necessarily) measurable function Dξ1 ,ξ2 as: −1 Dξ1 ,ξ2 (x; θ) = f (x; θ) f (x; θ + h) d(ξ1 − ξ2 )(h),
x ∈ Sθ , θ ∈ I.
Iθ
(21) It is evident that Eθ (|Dξ1 ,ξ2 |) < ∞ for all θ ∈ Iθ . By the Fubini theorem one finds that Eθ (Dξ1 ,ξ2 ) = 0. Let M k (P ) be the collection of ˆ all such Dξ1 ,ξ2 that are in Lk (P ). If δ(X)(= θ(X)) is any estimator of ˆ ˆ θ for which Eθ (W (θ)) < ∞, then letting Eθ (θ) = α(θ), (α(θ) − θ = b(θ) is called the bias of the estimator) one has, on using Fubini’s theorem again, ˆ ξ ,ξ ) Eθ [(θˆ − θ)Dξ1 ,ξ2 ] = Eθ (θD 1 2 ˆ f (x; θ + h) d(ξ1 − ξ2 )(h) dμ(x) θ(x) = Sθ Iθ = α(θ + h) d(ξ1 − ξ2 )(h). (22) Iθ
The desired result can now be given as follows. 5. Theorem. If W (·) is a symmetric convex function vanishing at 1 the origin ( ≡ 0) for which W k (·) has the same properties, and X is a random vector in Rk whose distributions {Fθ (·), θ ∈ I} have densities {f (·; θ), θ ∈ I} relative to a σ-finite measure on (Rk , B), and the elements of M k (P ) are defined by (21), then for any estimator θˆ of θ ˆ < ∞, one has the best lower bounds at θ0 ∈ I as satisfying Eθ (W (θ)) α(θ + h) d(ξ1 − ξ2 )(h) Iθ ˆ × RW (θ, θ0 ) ≥ sup W Eθ (|Dξ1 ,ξ2 |) ξ1 =ξ2 k Eθ (|Dξ1 ,ξ2 |) , (23) 1 Eθ (|Dξ1 ,ξ2 |k ) k k where k ≥ 1; k = k−1 and the case k = 1 is taken as the essential supremum as in the last result.
The result when W (t) = t2 and α(θ) = θ was obtained by Kiefer [1]. The proof being nearly identical with that of Theorem 4 will be left to the reader. In the case that W (t) = t2 and ξ1 puts all its mass at h = 0 and ξ2 all its mass at 0, with α(θ) = θ then (23) becomes ˆ θ) = Eθ (θˆ − θ)2 R2 (θ, ≥
inf h
h2
1 (x;θ) 2 ( f (x;θ+h)−f ) f (x; θ) dμ(x) f (x;θ) Rk
3.2 Existence and other properties of estimators
= =
83
1 (x;θ) 2 ( ∂f∂θ ) f (x; θ) dμ(x) Rk
1 f (X;θ) 2 Eθ ( ∂ log ∂θ )
.
(24)
This is the classical Cram´er-Rao lower bound. As Kiefer remarks, the flexibility of using different measures ξi in (21) can improve the lower bound in (23). For example, if f (x; θ) = exp(x−θ), for x ≥ θ, and = 0, otherwise, and θˆ = min(X1 , · · · , Xn )− n1 where X1 , . . . , Xn is a random sample from a distribution with the above density on the line with μ as Lebesgue measure, then a simple ˆ = θ and (24) is a strict inequality. Howcomputation shows that Eθ (θ) ever if dξ1 (x) = n exp −mh for 0 < h < ∞ and = 0, elsewhere and ξ2 is as before, then equality holds in (23). Thus the various choices of the “extraneous” measures ξi allow us to improve the lower bound. In contrast to the Bayesian case, the construction of (locally) best estimators for convex loss is more involved, and will be discussed in the next section where other methods and principles of obtaining estimators are also considered. Before going to that aspect, we show how the Bayes estimation problem can be viewed from a different perspective, namely its relation to nonlinear prediction. This will be of importance for stochastic inference proper. The problem of Bayes estimation relative to convex (or any other) loss can be recast as that of estimating a random variable Θ, given the (observable) random vector X = (Xt1 , . . . , Xtn ) all regarded as defined on the enlarged triple (Ω , Σ , P˜ ) which is obtained using the conditional distribution (or measure) of X given Θ whose prior (or marginal) measure is also known. It is then a problem of predicting (or estimating) Xt0 [or Θ] given Xt1 , . . . , Xtn , (t0 = ti ) relative to the convex loss function W , minimizing the expected loss (=risk), by a best nonlinear predictor δ(Xt1 , · · · , Xtn ). In the new terminology, we can state it as: Given a stochastic sequence (or process) Xti , i = 1, . . . , n, on a probability space (Ω, Σ, P ), predict the future value (or extrapolate) Xt0 , based on the present Xtn and the past Xti , i = 1, . . . , n − 1. Thus find the best δ(Xt1 , · · · , Xtn ) = δn (X), (say) such that E(W (Xt0 − δn (X))) is a minimum computed over all δn (X) for which E(W (δn (X))) < ∞. Then Theorem 3 gives the complete solution to the problem. [Here the prior and posterior measures are used to obtain the underlying probability space of the model (Ω, Σ, P ) and a further mention of these terms or concepts is neither needed nor relevant.] Indeed the solution is unique in the case that W (·) is strictly convex. Then the mapping Qn : Xt0 → δn (X) is well-defined and that Q2n (Xt0 ) = Qn (δn (X)) = δn (X), since the best predictor of δn (X) based on the same ‘observa-
84
III. Parameter Estimation and Asymptotics
tion’ δn (X) is itself. Thus Qn is an idempotent (but generally nonlinear) operator on the space of “W (·)-integrable” functions on (Ω, Σ, P ). This fact leads to an interesting insight on the prediction problem, and its characterization will be obtained later in Chapter VIII. We now present the solution in the new terminology for a convenient reference. The condition that all the estimators or predictors δn (X) be W (·)-integrable (i.e., E(W (δn (X))) < ∞) leads to a more inclusive class of function spaces [if W (t) = |t|p , these are the familiar Lebesgue spaces], called the Orlicz spaces. For simplicity, we first consider the case that W (t) = |t|p , p > 1. 6. Theorem. Let X = (X1 , . . . , Xn ) be a random vector (to be observed) on (Ω, Σ, P ) and Xn+1 be a random variable (to be predicted), where E(W (Xi )) < ∞, i = 1, . . . , n + 1, for the loss function t → W (t) = |t|p , 1 < p < ∞. If Fn is the σ-algebra generated by X1 , . . . , Xn , then the unique (nonlinear) best predictor δn (X) = δ(X1 , · · · , Xn ) is an element of Lp (Ω, Fn , P ) that satisfies the integral equation, relative to the conditional probability measure P Fn (as in (11) and (12)): (Xn+1 − δn (X))p−1 dP Fn [Xn+1 >δn (X)] = (δn (X) − Xn+1 )p−1 dP Fn . [Xn+1 <δn (X)] (24) More generally, if W (·) is a symmetric strictly convex differentiable function with derivative w vanishing only at the origin, then the best predictor δn (X) is a unique solution of: w(Xn+1 − δn (X)) dP Fn [Xn+1 >δn (X)] = w(δn (X) − Xn+1 ) dP Fn . [Xn+1 <δn (X)] (25) In particular, if W (t) = t2 , then the unique solution is given by δn (X) = E Fn (Xn+1 ), the conditional mean of Xn+1 given X1 , . . . , Xn (or, what is the same, given Fn ). Here equation (24) is clearly a special case of (25) and the latter is simply a reformulation of (12). Regarding the last statement, note that (24) with p = 2 becomes: Fn (Xn+1 − δn (X)) dP = (δn (X) − Xn+1 ) dP Fn , (26) A
Ac
3.2 Existence and other properties of estimators
85
where A = [Xn+1 ≥ δn (X)], Ac = Ω − A, and the integral in (24) is unchanged with this A in lieu of [Xn+1 > δn (X)]. Now δn (X) is Fn measurable, and hence is a constant for the integral relative to P Fn . Thus (26) reduces to Xn+1 dP Fn = E Fn (Xn+1 ), a.e.[P ], (27) δn (X) = Ω
as desired. The integrals relative to P Fn are in the sense of Lebesgue when one can assume that it has “regular versions”, [this is the case if one considers the Xn as coordinate functions] but they are valid as stated when interpreted as vector integrals in the sense of Dunford and Schwartz ([1], IV.10). These points have been discussed in detail elsewhere. [Cf. e.g., Rao [12], Sec. II.3 and more recently further in Rao [18], Sec. 7.3.] Mathematically there is no escape from using this (vector) integration in our study unless further and severe restrictions are imposed. In section 1, general loss functions which need not be convex were considered and so it is natural to discuss the corresponding estimation or prediction problem. We already remarked (see the discussion after (3) of this section) that the more general is the loss function, the less general is the class of distributions [or the underlying models] that can be admitted. This will now be illustrated for monotone loss functions, but for random sequences [or processes] that are somewhat close to the Gaussian (i.e., those processes whose finite dimensional distributions have certain “regression characteristics” similar to those of the Gaussian family). For such a class, an explicit solution of the best predictor problem can be presented. The possibility of such a result was first observed by Sherman [1], and a simpler argument for a slightly more general case will be included here, following the author’s recent note (Rao [22]). The class of random sequences (or processes) considered are those which have symmetric stable distributions, possessing unique modes. Recall that a distribution F is unimodal if there is a real number a such that for x < a, F (x) is convex and for x > a, it is concave. A classical theorem of Khintchine’s (see, e.g., Gnedenko and Kolmogorov [1], p.157) asserts that a distribution function F on R is unimodal iff the function V : x → V (x) = F (x) − xF (x) is a distribution, where F is a right (or a left) derivative which always exists. Since a convex (concave) function on an open interval is absolutely continuous, it is seen that a unimodal distribution F is absolutely continuous at all points except possibly at the mode. We then have the following result on the best (nonlinear) predictor with respect to a fairly general loss function if the admitted observing process of observations is restricted, but can be general if the loss function is correspondingly restricted:
86
III. Parameter Estimation and Asymptotics
7. Theorem. Let X = (X1 , . . . , Xn ) be an integrable random (observable) vector on (Ω, Σ, P ) and Xn+1 be an integrable random variable to be predicted relative to an increasing continuous loss function ϕ vanishing at the origin. Suppose that the conditional distribution of Xn+1 given the past and present, Xi , i ≤ n, has a regular version which is unimodal and symmetric about its conditional mean. Then the best predictor of Xn+1 is just the conditional mean, i.e., δn (X) = E Fn (Xn+1 ) and that the lower bound for the expected mean value of the error of prediction relative to the general loss function ϕ is given by: Rϕ (δn (X), Xn+1 ) = E(ϕ(δn (X) − Xn+1 )) ≤ E(ϕ(δ(X) − Xn+1 )),
(28)
for all estimators δ(X) of Xn+1 , where Fn is the σ-algebra generated by the vector of observations X. In particular, the statement holds if (X1 , . . . , Xn+1 ) is a Gaussian vector. This result is an immediate consequence of the following variational calculus argument, analogous to that of Proposition 1 applied to the convex case. In fact one replaces the probability measure by a (regular) conditional probability distribution and then takes the expectation, using the identity E(ψ(X)) = E(E(ψ(X))|Y ) for any random variable Y and any integrable or only positive random variable ψ(X), so that the inequality (28) is satisfied for all other estimators δ(X). There is no measurability problem here, and in fact the solution is explicitly obtainable. 8. Proposition. Let ϕ be the loss function as in the above theorem, and F be a distribution function which is symmetric and unimodal about the origin, i.e., (i) F (x) = 1 − F (−x), and (ii) F has the unique mode at the origin. Then we have the key integral inequality for ϕ as: ϕ(x − 0) dF (x) ≤ ϕ(x − a) dF (x), (29) R
R
for all a ∈ R. Proof. It suffices to prove the result for absolutely continuous F , since the general case can be obtained by a standard approximation argument. In fact, by Exercise I.6.5, a distribution can be approximated in the L´evy metric by a discrete distribution with a finite number of jumps. But the latter can again be approximated by an absolutely continuous distribution outside of a set of an arbitrarily small (Lebesgue) measure. By another standard device ϕ can also be allowed to have jump discontinuities. Under these convenient reductions, it suffices to
3.2 Existence and other properties of estimators
87
consider the problem for F satisfying dF (x) = f (x) dx. The hypothesis on F then implies that the density f satisfies f (x) = f (−x) and f (x) ≤ f (0), x ∈ R, since the unique mode is at the origin. Also let ϕa (x) = ϕ(x−a) for an arbitrarily fixed a ∈ R. With these preliminary simplifications we now establish (29) for the absolutely continuous F and a continuous ϕ. First let a ≥ 0, and (the reader may draw a picture to follow the computations with the integrals quickly and easily) consider: a2 [ϕa − ϕ](x)f (x) dx = [ϕ(x − a) − ϕ(x)]f (x) dx R −∞ ∞ [ϕ(a − x) − ϕ(x)]f (x) dx + a
2 ∞
a a (ϕa − ϕ)( − t)f ( − t) dt 2 2 0∞ a a + (ϕa − ϕ)( + u)f ( + u) du, 2 2 0 a where we let x = − t in the first 2 a and x = + u in the second integrand, ∞ 2 a a = [ϕa − ϕ]( − t)f ( − t) dt 2 2 0∞ a a − [ϕa − ϕ]( − u)f ( + u) du, 2 2 0 since by the symmetry of ϕ, we have a a (ϕa − ϕ)( − u) = −(ϕa − ϕ)( + u), 2 ∞ 2 a a a = (ϕa − ϕ)( − t)[f ( − t) − f ( + t)] dt 2 2 2 0 ≥ 0, since f has unique mode at 0, =
and ϕ is increasing.
(30a)
Next, let a < 0 and use a similar split of the integral at a2 and then with a change of variables one gets: ∞ a a [ϕa − ϕ](x)f (x) dx = [ϕa − ϕ]( − u)f ( − u) du 2 2 R 0∞ a a + [ϕa − ϕ](u + )f (u + ) du 2 2 0 ∞ a a a = [ϕ − ϕa ]( − u)[f ( + u) − f ( − u)] du 2 2 2 0
88
III. Parameter Estimation and Asymptotics
≥ 0, as before.
(30b)
Thus (30a) and (30b) imply (29) because of the earlier reduction, and hence, with the remarks before the proposition, establishes (28) also, as desired. Remark. The discrete ϕ can be approximated pointwise by continuous ϕn as follows. For instance, if ϕ(t) = 0 for 0 ≤ t < t0 ; = t, t > 0 for t ≥ t0 , then let ϕn (t) = t, for t ≥ t0 ; = ϕ(t0 ) − ( tt0 )n , 0 ≤ t < t0 . Thus ϕn (t) → ϕ(t). The argument can evidently be extended. Since in the general theory constructions of ‘best’ estimators is a nontrivial problem, it is useful to use various additional constraints, and develop suitable methods for that purpose. So we take up this problem in the next section. 3.3 Some principles of estimation Since the evaluation of risk for a given loss function L(·, ·), of an optimal (or best) estimator δ(X) of a parameter θ involves a knowledge of its distribution, which is not simple in most cases (cf. Theorem 2.6), one may consider finding “best” lower bounds for the risk in selecting a good estimator (cf. Theorem 2.5). The class D of all estimators of θ is typically too large and one may therefore look for a subclass D0 that satisfies an additional constraint, namely those estimators δ(X) for which Eθ (δ(X)) = θ, ∀θ ∈ I of the model {(Ω, Σ, Pθ ), θ ∈ I}. It is possible that D0 is empty, but otherwise it is a useful collection called the unbiased class of estimators of θ. Note that δ(X) must have a finite expectation for this definition to be meaningful. At least for large samples X = {Xn , n ≥ 1} this class will be of interest (hence should be nonempty) in inference theory. In that case, if δn (X) is an estimator for an observable X = (X1 , . . . , Xn ) of n random variables, then δn is called asymptotically unbiased if lim Eθ (δn (X)) = θ, θ ∈ I. n
(1)
This is a very weak requirement and one should have a more useful (and somewhat stronger) condition called consistency of the estimator sequence, whenever δn (X) → θ in probability, i.e., for any ε > 0, lim P [|δn (X) − θ| ≥ ε] = 0, θ ∈ I, n
(2)
which is a requirement of the weak law of large numbers, when δn (X) is a sample average. We consider the asymptotical properties of estimators in the next section, but study the case of finite expectation here in some detail.
89
3.3 Some principles of estimation
If δ(X) is an unbiased estimator of θ, and B ⊂ Σ is any σ-algebra, then from the functional equation Eθ (EθB (δ(X))) = Eθ (δ(X)) = θ, we note that every such function EθB (δ(X)) satisfies the unbiasedness constraint. However, this may not be an estimator, since it can depend on the unknown parameter θ. But we have seen in Section II.5 that when the family {Pθ , θ ∈ I} admits a sufficient statistic T , then (cf. (3) of Sec. II.5) one has, for any estimator δ(X) of θ with finite expectation Eθ (δ(X)|T ) = EθBT (δ(X)) = ϕ(T ), θ ∈ I.
(3)
Thus ϕ(T ), which does not contain the unknown θ, qualifies to be an estimator and moreover Eθ (ϕ(T )) = Eθ [EθBT (δ(X))] = Eθ (δ(X)),
(4)
so that ϕ(T ) is unbiased if δ(X) is. Here BT is the σ-algebra generated by T and BT ⊂ Σ. Consequently for any symmetric (non-constant) convex loss function W and an unbiased estimator δ(X) = θˆ of θ of the model {Pθ , θ ∈ I} admitting a sufficient statistic T , one has (cf. Exercise II.6.1(c)): ˆ θ) = Eθ (W (θˆ − θ)) R(θ, ≥ Eθ (W (E BT (θˆ − θ)) θ
= Eθ (W (ϕ(T ) − θ)) = R(ϕ(T ), θ).
(5)
Thus the risk is reduced by using ϕ(T ) of (3) instead of δ(X). This inequality for W (t) = t2 is the classical Rao-Blackwell theorem noted earlier. Following the proof of Theorem 2.4 (or 2.5) one notes that the lower bound does not depend on δ(X) = θˆ and hence serves the same purpose actually for R(ϕ(T ), θ). It follows that ϕ(T ) is a better ˆ in terms of having a smaller risk, although may still estimator than θ, not attain the lower bound and hence not necessarily optimal. This situation will be analyzed further, along with constructions of such estimators in some important cases. The optimality of unbiased estimators based on sufficient statistics has an interesting characterization. The following result, when the risk function is the variance, was discussed by Fraser [1]. It has a complete generalization for convex loss functions, but the variance case serves as a motivation for this study. If δi (X), i = 1, 2, are two unbiased estimators of θ, then U (X) = δ1 (X) − δ2 (X) is an unbiased estimator of zero, i.e., Eθ (U (X)) = 0, ∀θ ∈ I. Let U be the class of all such estimators. Then one has:
90
III. Parameter Estimation and Asymptotics
1. Proposition. Let X be a random vector on {(Ω, Σ, Pθ ), θ ∈ I} admitting a sufficient statistic T (= T (X)). Then a minimum variance unbiased estimator δ0 (= δ0 (T ) necessarily) of θ exists iff Eθ (δ0 U ) = 0, θ ∈ I, for all U ∈ U ∩ L2 (Pθ ). Proof. Noting that W (t) = t2 here, it follows from (5) that one can restrict the statement to the class of unbiased estimators which are functions of the sufficient statistic T , since otherwise EθBT (δ(X)) improves δ(X). Now let δ0 (T ) be an unbiased minimum variance estimator of θ, and U ∈ U ∩ L2 (Pθ ). Then δ(X) = δ0 (T ) + cU is also an unbiased estimator of θ for all c ∈ R, and its variance is perhaps larger than that of δ0 (T ). Then one has Eθ (δ0 (T )2 ) = V arθ (δ0 (T )) + θ2 ≤ V arθ (δ(X)) + θ2 = Eθ (δ(X)2 ) = Eθ [δ02 (T ) + c2 U 2 + 2cU δ0 (T )].
(6)
This implies for all 0 = c ∈ R, c2 Eθ (U 2 ) + 2cEθ (U δ0 (T )) ≥ 0. Hence for c = 0, dividing through by c2 and transposing −
2 sgn(c) Eθ (U δ0 (T )) ≤ Eθ (U 2 ) < ∞. |c|
(7)
[As usual, the signum function is given by x → sgn(x) = 1, x > 0; = 0, x = 0; and = −1, x < 0.] Letting c → 0, it is seen that (7) necessarily implies that Eθ (U δ0 (T )) = 0, θ ∈ I, as desired. Conversely, if the condition Eθ (U δ0 (T )) = 0, ∀U ∈ U ∩ L2 (Pθ ), holds where δ0 (T ) is an unbiased estimator with finite variance, then for any other unbiased estimator δ(X), of θ, with finite variance we have U = δ0 (T ) − δ(X) ∈ U and hence using Eθ (U δ0 (T )) = 0, Eθ (δ 2 (X)) = Eθ ((δ(X) − δ0 (T ) + δ0 (T ))2 ) = Eθ (U 2 ) + Eθ (δ02 (T )) > Eθ (δ02 (T )), unless U = 0, a.e.[Pθ ]. This shows that such a δ0 (T ) is also unique when it exists. In the non quadratic case of W , one cannot expect a simple computation as above, but still abstracting the basic ideas an analogous result can be established; and it demands a slightly more advanced argument. Such a result was given by Linnik and Rukhin [1]. It uses some facts of the theory of Orlicz spaces from Krasnoselskii and Rutickii [1]. A slightly improved version of it, using an updating of this work (cf. Rao
91
3.3 Some principles of estimation
and Ren [1], p.278) will now be given. It is only by considering the general (here the convex) case one can really appreciate the underlying variational calculus basis of this work. We recall that an Orlicz space LW (μ)(=LW (Ω, Σ, μ)) is a set of all measurable functions f : Ω → R such that Ω W (af ) dμ < ∞ for some a > 0, where W is a symmetric nonnegative convex function satisfying W (0) = 0, such as our loss function. It is easily seen that this is a (real) vector space and becomes the familiar Lebesgue space Lp (μ) if W (t) = |t|p . There are two (equivalent) functionals in this space, called (i) a gauge norm NW (·) and (ii) an Orlicz norm · W , defined respectively as follows. For an f ∈ LW (μ) one has: (i) NW (f ) = inf{k > 0 : and
f W ( ) dμ ≤ 1}; k Ω
(ii) f W = sup{|
f g dμ| :
Ω
Ω
V (f ) dμ ≤ 1},
where V is a “complementary function” to W defined as V (y) = sup{x|y| − W (x) : x ≥ 0}. They satisfy NW (f ) ≤ f W ≤ 2NW (f ), and LW (μ) is a complete normed vector space under either norm when equivalent functions are, as usual, identified. For convenience of our applications we consider the gauge norm, and write LW (μ) for (LW (μ), NW (·)). Also, Eθ (W (f )) < ∞ does not imply Eθ (W (2f )) < ∞ when W (·) grows too fast (e.g., |t| W W W (t) = e − 1), and so we work inWthe subset M = {f ∈ L (Pθ ) : W (αf ) < ∞, ∀α > 0}. Then M is a closed vector subspace and Ω bounded functions are norm dense in it. If W (2t) ≤ kW (t), t ≥ t0 ≥ 0, then W is said to be of class Δ2 , and in this case it is seen that M W = LW (Pθ ) itself. The desired extension of the above proposition can be given as: 2. Theorem. Let X be a random (vector) variable on {(Ω, Σ, Pθ ), θ ∈ I} admitting a sufficient statistic T (= T (X)) ( and I ⊂ R, open, for simplicity). Then an unbiased estimator δ0 (= δ0 (T )) of θ exists and is in M W relative to a convex loss function W having a continuous derivative, W , with minimum risk R(δ0 , θ) iff Eθ (gW (δ0 )) = 0, θ ∈ I, g ∈ U ∩ M W , where U is the set of unbiased estimators of zero.
(8)
92
III. Parameter Estimation and Asymptotics
Proof. Since g, δ0 ∈ M W , it follows that for each α ∈ R, δ + αg ∈ M W , and hence the functional Iθ (·) is given by: W (δ0 − θ + αg) dPθ R(δ0 + αg, θ) = Iθ (δ0 + αg) = Ω
is well-defined. Then U : α → Iθ (δ0 + αg) is a nonnegative convex function, and, as the proof of Prop. 2.1 shows, the set of minimum values is a nonempty convex compact set J ⊂ I. Moreover U has right and left derivatives, and since W exists it is seen that the derivative U also exists at all points and vanishes on J. If δ0 is the minimum risk unbiased estimator (and δ + αg is also an unbiased estimator for all α ∈ R), then R(δ0 , θ) ≤ R(δ0 + αg, θ) for any α with a minimum at α = 0. So 0 ∈ J, and hence U (0) = 0 for each such g ∈ U having minimum risk for δ0 . To find U (0), observe that U (α) − U (0) W (δ0 + αg − θ) − W (δ0 − θ) = dPθ . (9) α α Ω Here the integrand is dominated by W (|δ0 | + |g|)(|δ0 | + |g|) which is integrable, since (|δ| + |g|) ∈ M W and, by the Orlicz space results, W (|δ0 | + |g|) ∈ LV (Pθ ) so that the H¨ older inequality of these spaces applies (V being the complementary function to W , already noted). Hence expanding it (by the mean-value theorem), integrating, and using the dominated convergence theorem one gets the directional derivative as 0 = Iθ (δ0 ) = U (0) = gW (δ0 ) dPθ , g ∈ U ∩ M W , θ ∈ I. Ω
Thus (8) is necessarily true if δ0 is a minimum risk unbiased estimator in M W . Conversely, if W exists and is continuous, and if δ0 is any other unbiased estimator of θ in M W , then g = δ0 − δ0 ∈ U ∩ M W . Thus gW (δ0 ) dPθ , U (0) = Ω
and by hypothesis this is zero for all g ∈ U ∩ M W . This implies U (0) = θ (δ0 ) limα→0 Iθ (δ0 +αg)−I , and since U has no maximum on the open set α I, but has a minimum we see that I(δ0 ) = R(δ0 , θ) is the minimum risk for the estimator δ0 in M W . As a direct consequence of this result one has the following statement. But an interesting alternative argument due to Linnik and Rukhin [1] will also be included for variety.
93
3.3 Some principles of estimation
3. Corollary. Let W be a symmetric convex loss function of class Δ2 (so M W = LW (Pθ )) and be twice differentiable. Let X and T be as in the above theorem, and δ0 (= δ0 (T )) be an unbiased estimator of θ with R(δ0 , θ) < ∞. Then it is a minimum risk unbiased estimator iff Eθ (W (δ0 − θ)U ) = 0 for all U ∈ U ∩ LW (Pθ ). Proof. The direct part is the same as in the above proof using just one derivative. Here one notes that since W (2t) ≤ KW (t), and hence by convexity 2x 2x KW (x) ≥ W (2x) = W (t) dt ≥ W (t) dt ≥ xW (x), x ≥ 0, 0
x
since W (·) is increasing. Thus U W (U ), δ0 W (δ0 ) ∈ L1 (Pθ ), and hence (|δ0 | + |U |)W (|δ0 | + |U |) ∈ L1 (Pθ ) which serves as a dominating function. For the converse, let the second derivative of W also exist, and consider δ to be another unbiased estimator in LW so that δ − δ0 ∈ U ∩ LW (Pθ ) (because of Δ2 ), and one has by the mean-value theorem R(δ , θ) = Eθ (W (δ − θ)) = Eθ [W (δ0 ) − θ) + (δ − δ0 )W (δ0 − θ) 1 + W (δ0 − θ + ξ(δ − δ0 ))], 0 ≤ ξ ≤ 1 2 ≥ Eθ (W (δ0 − θ)) + 0 = R(δ0 , θ), since for any twice differentiable convex W, W ≥ 0 and δ − δ0 ∈ U ∩ LW (Pθ ); and by hypothesis the middle term vanishes. Thus δ0 has the minimum risk, that cannot be inferred from (5), among all unbiased estimators δ for which RW (δ , θ) < ∞. In all the preceding results Eθ (δ) = θ is used for simplicity but Eθ (δ) = g(θ) is possible where g : I → R, and then one uses δ − g(θ) in place of δ −θ and the work extends. This is especially relevant if I is an abstract set, as seen below. Another derivation of this assertion through the methods of mathematical programming is already contained in the interesting work of Isii [1]. We also add an important comment on the above two statements and a related one. 4. Discussion. It is interesting to note that, in the theory of Orlicz spaces, when the symmetric convex function W , vanishing at 0, has a continuous derivative W which satisfies W (x) > 0 for x > 0, then the norm NW (·) is (weakly or Gˆ ateaux) differentiable at all points of M W − {0} and the derivative at f0 ∈ M W − {0} is given by (cf., e.g., Rao and Ren [1], p.280): G(f0 ; f ) =
θ (f0 + tf ) dNW |t=0 dt
94
III. Parameter Estimation and Asymptotics
=
Ω
f W ( NWf0(f0 ) ) dPθ
f0 W ( NWf0(f0 ) ) dPθ Ω NW (f0 )
.
Since the denominator is always positive, we get, when f0 is an unbiased estimator of θ and f an unbiased estimator of zero, that f0 θ also minimizes the norm functional NW (·) as well as the risk R(·, θ). θ Superficially, NW (·) does not have a “statistical interpretation” as a minimum risk functional, but the mathematical minimization problem is the same in both cases in M W . In the case that W (t) = |t|p ; the risk Rp (δ, θ) = Eθ (|δ −θ|p ) and the norm Npθ (δ −θ) = δ −θp are related as Rp (δ, θ) = δ−θpp . In the convex case RW ((·), θ) = Eθ (W (·−θ)) which θ is a modular function, and the norm is a Minkowski functional NW (δ − δ−θ θ) = inf{k > 0 : RW ( k , θ) ≤ 1}. This is only a technical formulation but they serve the same purpose. Some statistical practitioners may prefer the modular form. However, if W ∈ Δ2 then both the norm and the modular functional define the same topology in M W = LW (Pθ ) in θ the sense that if f0 , fn ∈ M W , then NW (fn − f0 ) → 0 as n → ∞ iff RW (fn − f0 ) → 0. This is an ancient result (cf., e.g., Rao and Ren [1], Theorem 12, p.83). It is also of interest to observe that if there is a uniformly minimum variance unbiased estimator δ for {Pθ , θ ∈ I}, then Bahadur [1] has shown that the system admits a sufficient statistic T and hence δ is a function of T . [This is a sort of converse to the Rao-Blackwell theorem. It should also hold for the convex case of the preceding two results, but we shall not consider the problem here. However, the analysis of efficient estimation (under the standard regularity conditions) given in De Groot and Rao [2] shows that the probability distributions from {Pθ , θ ∈ I} belong to the exponential family when quadratic loss is employed and equality holds in the lower bounds. The family then actually admits a sufficient statistic, required in the converse.] Again as in the case of Bayes estimation, the construction problem for minimum risk unbiased estimators demands a careful study inviting more refined techniques than the previous ones. We present some results here. For a motivation, consider W (t) = |t|p , 1 < p < ∞. It will be seen that the form of the best estimator is similar in both this special and a more general convex cases, although the latter uses some additional mathematical tools. The problem treated here will be for a dominated family. Thus it is supposed that there is a fixed (σ-finite) measure λ such that dPθ = p(·|θ) dλ, θ ∈ I, an abstract set, and it is desired to find an unbiased estimator δ(X) : Ω → R of a real function h(θ) of locally minimum risk, RW (δ(X), θ) = Eθ (W (δ(X) − h(θ))), at θ = θ0 . Thus Eθ (δ(X)) = h(θ), ∀θ ∈ I, and RW (δ, θ) has a local minimum at θ0 . Let MW be the
95
3.3 Some principles of estimation
class of all such estimators. We first present a characterization of this set if W (t) = |t|p , denoted Mp (similar to M p in Section 2 above): Mp = {f ∈ Lp (Pθ ) : Eθ (f ) = h(θ), θ ∈ I}, p(·|θ) and a set of random variables {D(·, θ) = p(·|θ : |D(ω|θ)|q dPθ0 (ω) < Ω 0) ∞}. Consequently h(θ) = Eθ (f ) = Ω f D(·, θ)p(·|θ0 ) dλ. Let Pθ0 = ν for convenience. [If I ⊂ R then we would simply take h as the identity function, to compare it with the previous work.] Hereafter we build our estimators in terms of the random functions {D(·, θ), θ ∈ I} and the fixed measure ν together with the unbiasedness condition relative to h(θ) in the above. The desired result is obtained by Barankin [1] as follows: 5. Theorem. The set Mp , 1 < p < ∞, is nonempty iff there is a constant K ≥ 0, such that for each θi ∈ I and ai ∈ R, i = 1, . . . , n, n ≥ p 1, one has with q = p−1 |
n i=1
ai h(θi )| ≤ K
n
ai D(·, θi )q .
(10)
i=1
When this condition holds, Mp has only one element f0 such that f0 p = K0 = inf{K : (10) is true}, and so f0 is the unique unbiased (locally) minimum risk estimator of h(θ) at θ0 . Moreover, in this case, it is possible to find a sequence fn ∈ Lp (ν) which satisfies fn − f0 p → 0, where the fn are certain finite linear combinations from the process {D(·, θ), θ ∈ I}. The direct part is simple. Indeed, when there is an unbiased estimator f of h(θ), then from h(θ) = Eθ (f ) = Eθ0 (f D(·, θ)), (10) is implied by the H¨ older inequality. The converse is deep, and here results from Functional Analysis are needed, and they assert the existence of a continuous linear functional on the linear span of {D(·, θ), θ ∈ I} ⊂ Lq (ν), (with the Hahn-Banach theorem) and then by using a (Riesz) representation theorem, (·) is given for 1 < q < ∞, as h(θ) = (D(·, θ)) = Ω f0 D(·, θ) dν, for a unique f0 ∈ Lp (ν). The rest is a consequence of these assertions. There is also an easy analog for the case p = ∞ (so q = 1), but not for p = 1 so q = ∞), since Riesz theorem fails in L∞ (ν) as it does not give a point function to qualify for an estimator. We now give an extended result for applications and reference. Instead of presenting the actual details of the above case, we can at once consider a wider class of convex functions containing the preceding results with almost no extra cost. It will show how some related
96
III. Parameter Estimation and Asymptotics
ideas come into play while the basic structure is illuminated. Here the Lebesgue spaces are replaced by Orlicz spaces, introduced for Theorem 2 above. Recall that a symmetric convex loss function W has a complementary function V of the same type, and W ∈ Δ2 means that it does not “grow too fast”, i.e., W (2t) ≤ KW (t), t ≥ t0 > 0. This implies (and is actually equivalent to) that V does not “grow too slowly”, i.e., V (2x) ≥ 1 V (x), x ≥ x0 > 0. Here K > 1, > 1 are some constants. This property is opposite to the Δ2 -condition and is called the 2 condition and so V ∈ 2 . It is an interesting fact that the Orlicz space LW (Pθ ) is reflexive iff W ∈ Δ2 ∩ 2 , (cf., Rao and Ren [1], p.113). This corresponds to the Lp (P θ ), 1 < p < ∞, and it will be needed in our full extension of Barankin’s result to convex functions. Also one should note that the adjoint space (LW (Pθ ), NW (·))∗ = (LV (Pθ ), · V ) so the gauge and Orlicz norms alternate in the given space and its dual. [The same (gauge) norm can be used in both spaces after a normalization of W and V , but we shall not go into this finer point here. In the Lebesgue case this is automatic and one has (Lp (ν))∗ = Lq (ν), 1 < p < ∞ and this fact is decisive in Barankin’s work.] Moreover the general result also shows how the abstract analysis is essential in the existence and determination of estimators to employ loss functions other than the quadratic one. [E.g., W (t) = |t|p1 + |t|p2 , p1 , p2 ≥ 1.] This will enforce Gauss’s remark on using the latter with elementary analysis in these problems. For convenience, we present the existence and construction of best estimators separately. We start with existence. 6. Theorem. Let W be a convex loss function such that its derivative W exists, is continuous, W (x) > 0 for x > 0, and W ∈ Δ2 ∩ ∇2 . Let MW be the class of all locally minimum risk (at θ = θ0 ) unbiased estimators of h(θ) ∈ R, based on a random vector of observations X of the model {(Ω, Σ, Pθ ), θ ∈ I} where dPθ = p(·|θ) dλ for a fixed σ-finite dominating measure λ on Σ. Then MW is non empty iff there is a constant C > 0 such that for each θi ∈ I, ai ∈ R, i = 1, . . . , n, n ≥ 1, |
n
n ai h(θi )| ≤ C( di D(·, θi ))V ,
i=1
i=1
(11)
where (·)V is the Orlicz norm of LV (Pθ ), V being the complementary function of W . When MW is nonempty, it has only δ0 un oneδ0element −h(θ0 ) δ0 −h(θ0 ) der the given conditions and RW ( C0 , θ0 ) = Ω W ( C0 ) dν = 1 where C0 = inf C satisfying (11). Proof. The argument for this and the previous case run on similar lines.
97
3.3 Some principles of estimation
Thus if there is a δ ∈ MW , then one has |
n i=1
ai h(θi )| = |
Ω
n δ( ai D(X, θi )) dν| i=1
n ≤ NW (δ)( ai D(X, θi ))V , i=1
by the H¨ older inequality for Orlicz spaces (cf., e.g., Rao and Ren [1], p.62). Thus taking C = NW (δ) here one obtains (11). Only the converse is nontrivial. Suppose now (11) holds. Then by a classical result due to Hahn (cf., Hille and Phillips [1], p.31, Thm. 2.7.7 with a short proof), there exists a continuous linear functional (·) on (LV (Pθ0 ), · V ) such that (D(X, θi )) = h(θi ), i = 1, . . . , n whose norm is bounded by C. Since V ∈ Δ2 also, such an admits a representation (cf., Rao and Ren, loc. cit., Corollary 4.1.9) with a unique δ0 ∈ LV (Pθ0 ) satisfying h(θ) = (D(X, θ)) = δ0 D(X, θ) dPθ0 , (12) Ω
for all D(X, θ) ∈ LV (Pθ0 ). It follows from (12) that Eθ (δ0 ) = h(θ), and that NW (δ0 ) = = C0 = inf{C : (11)holds}. Hence δ0 ∈ MW . But by the Orlicz space theory, since W is continuous and W (x) > 0 for x > 0, one concludes that LW (Pθ0 ) is strictly convex (cf. Rao and Ren [1], p.268, Thm. 7.1.3), and hence the convex set MW can have only a single estimator of norm C0 . Now the minimal element, for the norm functional NW and the modular RW (δ0 , θ) = Ω W (δ0 − h(θ0 )) dPθ0 on MW are related (see Discussion 4 above) since W ∈ Δ2 , and we get δ0 − h(θ0 ) δ0 − h(θ0 ) , θ0 ) = W( ) dPθ0 = 1 RW ( C0 C0 Ω by definition of the gauge norm. This establishes all the assertions. Remark. If W (t) = |t|p and p = 2, then Barankin [1] has shown that (10) (hence (11)) includes all the known lower bounds, by choosing the θi suitably to satisfy the necessary regularity conditions on the {D(X, θ), θ ∈ I}-set, and they cover the Cram´er-Rao as well as the Bhattacharyya (series) bounds. Since the above existence uses the Hahn (and Banach) theorem, the construction of optimal estimators (i.e., those that attain the lower bound) becomes a separate issue. This needs additional analysis and the δ0 of the theorem can be approximated in mean by known simple
98
III. Parameter Estimation and Asymptotics
functions. The idea of such a construction comes from the equality condition in the H¨ older inequality which is the basic ingredient of (11) (and (10)). This means one should choose the sets {am i , i = 1, . . . , n} m and {θi , i = 1, . . . , n} there to find a sequence of estimators δn that approximate δ0 in the desired form. The precise statement can be enunciated and proved as follows. 7. Theorem. Let the loss function W be as in Theorem 6, and V be its complementary function so that there is a unique unbiased estimator δ0 of h(θ), (h : I → R), with minimum risk C0 at θ. Suppose also that m V has the same properties as W . If {am i , θi , i = 1, . . . , n} are chosen to satisfy n | am h(θim )| = C0 , (13) lim n i=1m i m m→∞ i=1 ai D(X, θi )V then the estimators δm defined by nm m a h(θim ) δm = nmi=1m i × i=1 ai D(X, θim )V n nm m m m a D(X, θ ) i i m ni=1 V sgn am i D(X, θi ) m m i=1 am i D(X, θi )V i=1 = αm · V (βm ) · γm (say),
(14)
satisfies δm → δ0 in mean, i.e., as m → ∞ RW (δm − δ0 ) → 0 [and RW (δm , θ0 ) → RW (δ0 , θ0 )],
(15)
where δ0 is the (desired) unbiased estimator of h(θ) of minimum risk at θ0 . Proof. The argument here uses slightly more advanced tools than for the case W (t) = |t|p , which is also nontrivial. With the hypothesis (cf., (13)) that |αn | → C0 , it will be first shown that there are continuous linear functionals n on LW (ν) which converge to ∈ (LW (ν))∗ such that n (δ0 ) → (δ0 ) = NW (δ0 ); and then we verify that n (δn ) → (δ0 ), from which the desired conclusion will be deduced. Both these ˘ assertions are based on an interesting theorem due to V. Smulian for general smooth Banach spaces. This is as follows. The norm of a Banach space X is differentiable at a point x ∈ X iff every sequence of continuous linear functionals x∗ on X of unit norm with the property that x∗n (x) → x implies that x∗n − x∗m → 0. [A discussion of this result with relevant references is in Dunford and Schwartz [1], p.472.] Hereafter we exclude the true and trivial case that C0 = 0. The details of proof of our result are given in two steps.
99
3.3 Some principles of estimation
1. Let Yn ∈ LW (ν) of unit norm, and let δn = αn Yn . Since βn V = 1, in (14), the H¨ older inequality in Orlicz spaces implies that Ω
|Yn βn | dν ≤ 1,
(16)
with equality iff Yn βn has constant sign and |Yn | = V (|βn |) or |βn | = W (|Yn |) a.e. The last result follows from the fact that when W ∈ Δ2 ∩ 2 (so LW (ν) is reflexive), equality holds in the (Young) inequality between W and V iff (x, y) is a point on the curve y = V (x) [or x = W (y)], (cf., Rao and Ren [1]), p. 80). It therefore follows that |Yn | = V (|βn |) = V (βn )γn , gives equality in (16), (V ≥ 0 always) and hence δn = αn Yn = αn V (βn )γn . Thus δn defined by (14) is in LW (ν) and in fact, NW (δn ) = |αn |, as a consequence of the equality in (16) again. Hence by(13) NW (δn ) → C0 . To show that δn → δ0 ˘ in mean (or equivalently in norm) we need to use Smulian’s theorem recalled above. Consider the linear functional n represented by βn rn , with rn = sgn(αn ), (cf. (14)) and n (δ0 ) = =
Ω
δ0 βn rn dν rn a i=1 i D(X, θi )V
n
= rn
n
ani h(θin )
i=1
· V
Ω
δ0
n
ani D(X, θin ) dν
i=1
= |αn | → C0 .
Since βn rn V = 1, it follows that n ≤ 1, and in fact equality holds because n (Yn ) = 1, and Yn W = 1 by the preceding paragraph. Now, n (δ0 ) → C0 = NW (δ0 ) = (δ0 ) for a functional . By the Hahn-Banach theorem, we get = 1. However, the conditions on W (and W ) imply that the norm NW (·) is differentiable at each point of LW (ν)−{0}, (cf., Rao and Ren [1], p. 281, Corollary 7.2.4). Hence by the above noted ˘ result of Smulian’s, n → in norm. So by the representation theorem W ∗ for ∈ (L (ν)) , there is a unique Y0 ∈ LV (ν), = Y0 V such that (δ0 ) =
Ω
δ0 Y0 dν.
(17)
This means there is equality in H¨ older’s inequality and hence δ0 Y0 must have constant sign and Y0 = V (δ0 ) a.e. Also n ↔ Yn rn correspond to each other uniquely by (17) and hence Yn rn → Y0 in norm. With these
100
III. Parameter Estimation and Asymptotics
facts we deduce in the next step the crucial conclusion that δn → δ0 in the mean. 2. Again consider the linear functionals n and the random variables βn . We have, using the notations introduced in (14), n (δn ) =
δn βn rn dν βn V (|βn |)γn dν = |αn | Ω
Ω
= |αn | → C0 ,
(18)
as n → ∞, by the equality in H¨ older’s inequality, and since the integral is unity. Hence we get |(δn ) − C0 | ≤ |(δn ) − n (δn )| + |n (δn ) − C0 | ≤ n − NW (δn ) + |n (δn ) − C0 | → 0,
(19)
by (18) and the fact that NW (δn ) ≤ 1, plus n − → 0 established in the preceding step. Now (19) may be written (on using βn rn → β0 (say)in norm) as: 1 = lim ( n→∞
δn 1 ) = lim δn βn rn dν n→∞ αn Ω C0 δn = lim ( )βn rn dν. n→∞ Ω αn
(20)
But LW (ν) is reflexive with its adjoint space (LV (ν), · V ) which has exactly similar properties (V ∈ Δ2 ∩ 2 , V continuous etc.), and so δn W C0 ∈ L (ν) may be regarded as defining a continuous linear func˘ tional on LV (ν) with LW (ν) as its adjoint space. Hence Smulian’s δn theorem applies to this situation again and yields the result that C → 0 δ0 , βn rn → β0 in norm and (δ0 ) = 1 = Ω β0 δ0 dν. The equality in H¨ older’s inequality obtains once again so that δ0 = W (|β0 |) sgn(β0 ), or β0 = V (δ0 ). But we also have δn → C0 δ0 , in mean. Hence C0 δ0 = δ0 . We have thus shown that δn → δ0 in mean in LW (ν). Note that here we are using the fact that when W ∈ Δ2 , norm and mean convergences are equivalent, which was invoked several times before. This simply means RW (δn − δ0 , θ0 ) → 0, as desired. ˘ Remark. As the proof indicates, Smulian’s theorem plays a key role and W V the norm differentiability of both L (L )-spaces was crucial. One may attempt to extend the proof of Barankin’s, with a clever manipulation of the conjugate exponents p and q as in his work, but this will be
3.3 Some principles of estimation
101
difficult in the present context, since the inverse functions W and V cannot be factored as in the Lebesgue case. In any event, it should be noted that the construction problem (even when the risk is variance) is interesting and nontrivial. Analogs of these results in the context of Bayes estimation depend on constructing a family of auxiliary random functions Di (X, Θ). Unlike the preceding case, the set of functions should be chosen with a view to have the corresponding desirable properties for which there are many possibilities. One case was discussed in Rao [5], but it will be omitted here. However, the corresponding statement for (nonlinear) prediction given in Theorem 2.6, is obtained with a constructive solution by means of an integral equation. We outline an extension to multidimensions. Readers interested in asymptotic properties can go now to Section 3.4. The following result is a form of Theorem 2.3 in multidimensions, i.e., the parameter vector Θ takes values in Rk , k > 1, and the observation vector X in Rn . We note that the posterior distribution of Θ, given X = x is a regular conditional measure (as discussed before) and · is the Euclidean norm of Rk . Recall that the Bayes estimator δ ∗ of Θ on the product space (Ω , Σ, P˜ ), as noted in Section 2.2, is given by: E(W (Θ − δ ∗ (X))) = inf{E(W (Θ − δ(X))) : δ ∈ D},
(21)
where D is the set of all measurable functions δ(X) for which the right side of (21) is finite. 8. Theorem. Let W ∈ Δ2 ∩ 2 be a convex loss function satisfying the same conditions as in Theorems 6 and 7, and X be an observable random vector, valued in Rn , with the basic model {(Ω, Σ, Pθ ), θ ∈ Rk } where the index parameter is the value of a random variable Θ ∈ Rk . Then a Bayes estimator δ ∗ (X) : Ω → Rk of Θ, is determined as a unique solution of the integral equation W (θ − δ ∗ (x))(θ − δ ∗ (x))−1 (θ, y)P (dθ|x) k R = W (θ − δ ∗ (x))(θ − δ ∗ (x))−1 (δ ∗ (x), y)P (dθ|x), Rk (22) for almost all x (relative to the marginal distribution of X) and 0 = y ∈ Rk is an arbitrarily fixed vector, (·, ·) being the scalar product in Rk . More generally, the result holds if the Euclidean norm of Rk is replaced by any smooth norm, i.e., one that is differentiable both in Rk and its adjoint space, then (22) takes the form: W (θ − δ ∗ (x))(y(x))P (dθ|x) = 0, a.a.(x), (23) Rk
102
III. Parameter Estimation and Asymptotics
where (y(x)) =
d du (θ
− δ ∗ (x) + uy(x))|u=0 .
Proof. It will be first shown that (22) is the same as (23) if the smooth norm of the latter is derived from a scalar product, and then establish (23). One should also note that, in (22), if k = 1, then (x, y) is just the product xy, and cancelling y from both sides, it is seen that (22) becomes the integral equation (12) of Sec. 2. We thus establish (23). [The argument to be given is such that it is valid for spaces of any dimension, and so we write X and Y for Rn and Rk , a pair of reflexive spaces with differentiable norms.] d ¯ ¯ Consider (y(x)) = du [δ(x) + uy(x)]|u=0 , where δ(x) = θ − δ ∗ (x). The (directional) derivative can be evaluated as: d ¯ d ¯ [δ + uy2 ] = [(δ + uy, δ¯ + uy)] du du ¯ y). = 2u(y, y) + (δ, d Since the left side is 2δ¯ + uy du [δ¯ + uy], we get on equating both sides d ¯ ¯ −1 (δ, ¯ y), [δ + uy]|u=0 = δ (24) (y) = du since δ¯ = θ − δ ∗ = 0, (the nontrivial case) (24) is well-defined, and substituting it in (23) one gets (22). Thus we need to establish the general case of (23). This is very similar to the result of Theorem 2.3, and we include the necessary details. Fix x ∈ Rn = X arbitrarily, and let μ(·) = P (·|x), a regular probability measure on (X , B) and ν = Pθ0 . Consider a minimization of
U (δ) = X
W (θ − δ) dμ,
(25)
as δ(X) varies over D, the convex set of estimators in LW X (ν). Now the set D0 ⊂ D over which U (δ) is a minimum, is convex and bounded (i.e., is contained in a ball) in this reflexive space. To see that it is also closed let δn ∈ D and δn → δ0 in mean. Then we claim that U (δn ) → U (δ0 ) and δ0 ∈ D0 . Indeed, if fn = θ − δn so that fn → f0 = θ − δ0 and for any 0 < α < 1, fn =
α α 2 (2f0 ) + (1 − α)f0 + [ (fn − f0 )], 2 2 α
then by the convexity of W one gets α W (fn ) dν ≤ W (2f0 ) dν + (1 − α) W (f0 ) dν 2 Ω Ω Ω
103
3.3 Some principles of estimation
α + 2
2 W ( (fn − f0 )) dν. α Ω
(26)
All the terms on the right are finite since W ∈ Δ2 , and since |fn − f0 | ≤ δn − δ0 → 0. Here one uses the fact that the mean and norm convergences are equivalent in LW (ν). Thus the last term in (26) tends to zero. Therefore on letting n → ∞, and then α → 0, one has W (fn ) dν ≤ W (f0 ) dν ≤ lim W (fn ) dν, (27) lim n→∞
Ω
Ω
n→∞
Ω
(using Fatou’s lemma for the last inequality since fn → f0 in measure also). Hence limn→∞ U (δn ) = U (δ0 ). To see that D0 is nonempty, let α0 = inf{U (δ) : δ ∈ D} < ∞, and so there exists a sequence δn , n ≥ 1 such that U (δn ) = αn α0 . Now {δn , n ≥ 1} is a bounded set of LW X (ν) which (by reflexivity) is weakly sequentially compact. If δ0 is its weak limit, then by the weak(=strong here) completeness of D, δ0 ∈ D, and hence α0 ≤ U (δ0 ) ≤ limn U (δn ) = limn αn = α0 so that δ0 is a minimal element and then δ0 ∈ D0 , as desired. Finally by V hypothesis on W and · , we conclude that LW X (ν) (as well as LX ∗ (ν)) is a strictly convex space so that D0 can have exactly one element, and U (δ0 ) = W (Θ − δ(X)) dν = α0 . (28) Ω
Using the differentiability hypothesis more effectively, (23) will now be established. Note that the norm of LW X (ν) is differentiable (since (ν) has a differentiable norm, the space LW · is smooth and LW X (ν) R has the same property from classical results) and the space is reflexive. Thus if δ¯ = θ − δ(x), and since u (y) =
d d W [δ¯ + uy] = W (δ¯ + uy) (δ¯ + uy), du du
¯ + y)(δ ¯ + y) one can differentiate (along y ∈ Y), since W (δ serves as a dominating integrable function (cf., Theorem 2), to get (23) from (28). Also it is known from classical theory that u (·) is a linear functional. It remains to show that δ ∗ of (23) (or (22)) with νx (·) can be taken to be a measurable function on (X , B). Instead of considering it directly, as in Theorem 2.3, we use the latter result. For this let y ∗ ∈ Y ∗ be an arbitrary element, so that Θ − δ(X)Y = sup |y ∗ (Θ − δ(X))|
y ∗ ≤1
= sup |y ∗ (Θ) − y ∗ (δ(X))|.
y ∗ ≤1
(29)
104
III. Parameter Estimation and Asymptotics
Now y ∗ (Θ) is real, and may be considered as a one-dimensional parameter. Then by the proof of Theorem 2.3, there is a measurable function x → δy∗ (x), that minimizes the corresponding functional, and this is unique. Also the same argument there shows that (y1∗ + y2∗ )(Θ) is minimized by δ(y1∗ +y2∗ ) (x) and using the uniqueness we see that y ∗ → δy∗ (x) is a linear function for each x and also seen to be continuous (by the argument of (25)-(27)). Hence it defines an element of Y ∗∗ = Y, and δy∗ (x) can be expressed as y ∗ (δ ∗ (x)) for some δ ∗ (x) ∈ Y, which is measurable in x. This means y ∗ (δ ∗ ) is measurable so that δ ∗ is weakly measurable, and since Y is separable, this is equivalent to saying that δ ∗ is a measurable (also termed “strongly measurable”) function. Hence the minimization problem has a solution y ∗ (δ ∗ (X)). By the uniqueness of the solution to (23), this means that δ ∗ (X) coincides with δ0 (X) there. Thus it is a multidimensional Bayes solution. Remark. The above result holds if Rk and Rn are replaced by Y and X , reflexive (separable) Banach spaces with differentiable norms both in the spaces and their adjoints. In particular if X , Y are uniformly convex with smooth norms, the above conditions are satisfied. This is the result considered in De Groot and Rao ([2], Sec. 4). There the spaces are treated for the measures dP˜ (θ, x) = dP (θ|x)π(x) where π(·) is the marginal distribution of X, which then implies (23), without a detailed explanation. A different type of extension of the multi parameter (even infinite dimensional) case was considered by Kozek [1], using FenchelOrlicz spaces and its results as the tools. Except for the one dimensional parameter these results are not the same. The preceding theorem has again an immediate analog of Theorem 2.6, to multidimensional nonlinear prediction problems with X = Y. We state it for completeness as follows: 9. Theorem. Let W, X be as above and X = (X1 , . . . , Xn ) be given (multidimensional) vectors Xi ∈ LW X (P ) on (Ω, Σ, P ). If Xn+1 ∈ W LX (P ) is to be predicted with minimal average risk defined by W , and the conditional probability distribution P B(X) (·) is regular, then the best predictor δ(X) of Xn+1 is the unique solution of the following integral equation:
d (Xn+1 −δ(X)+uZ)|u=0 (ω) dP B(X) (ω) = 0, du Ω (30) for all B(X)-measurable Z in LW (P ) where B(X) is the Borel σ-algebra X generated by X. W (Xn+1 −δ(X)
We next outline a pair of direct methods of estimation of the parameters using other principles, popular in applications.
3.3 Some principles of estimation
105
Instead of estimating a parameter value θ, one can consider finding the smallest random set A(X) that covers θ, with probability greater than a prescribed level, i.e., given ε > 0 we have Pθ [Aε (X) θ] ≥ 1 − ε, θ ∈ I.
(31)
Such an Aε (X) is called a confidence region, and it is a confidence interval if θ ∈ R and A(X) is of the form (δ1 (X), δ2 (X)). It is also termed an interval estimator; the earlier one being often called a point estimator since one only estimates a point θ in the parameter space. Thus by definition, for each observed X = x, A(x) θ iff x ∈ Bθ ⊂ I or Aε (x) = {θ ∈ I : Bθ x}, and (31) is equivalent to: Pθ [Aε (X) θ] = Pθ [X ∈ Bθ ] ≥ 1 − ε, θ ∈ I.
(32)
Typically Aε (x) is found as the region given by the Neyman-Pearson lemma. Indeed, if for H0 : θ = θ0 , Bθ0 is the complement of the critical (or the acceptance) region, then Aε (x) = {θ ∈ I : x ∈ Bθ0 }. Thus there is this connection between the problems of hypothesis testing and confidence regions. Therefore, one can introduce concepts of “unbiasedness”, those with nuisance parameters, and the like. However, not much serious work is available with these extensions, since confidence sets are, by nature, less tightly defined compared with the pointwise concepts for hypothesis testing. Here we shall merely present the notions for completeness and omit further discussion. In the Bayesian context this is expressed as: if P (·|x) is the posterior probability distribution of Θ after observing X = x, then the corresponding confidence statement is given by P (Θ ∈ A(x)|X = x) ≥ 1 − ε. If ϑ is a nuisance parameter, let A(θ0 ) be an acceptance region based on the observation X = x in that B(x) = {θ : x ∈ A(θ)} iff A(θ) = {x : θ ∈ B(x)}, and Pθ,ϑ [B(X) θ0 ] ≥ 1 − α, ∀θ, ϑ ∈ I, and the family B(X) is unbiased at level 1 − α if Pθ,ϑ (B(X) θ ) ≤ 1 − α, ∀θ ∈ A(θ ), and ∀ϑ. For a discussion and the rationale of these concepts, the reader may be referred to Lehmann [1]. A more operational and potentially useful concept in many situations is the principle (and the consequent methods) of maximum likelihood. This is based on the natural principle that a random observation comes from the event of maximum probability in the underlying model. Thus
106
III. Parameter Estimation and Asymptotics
if X is a random variable (or vector) governed by (a family of) probability distributions FX (·, θ) depending on a parameter θ ∈ I, then one selects that distribution (or θ) for which F (x, θ) − F (x−, θ) is a maximum. In particular, if dF (x, θ) = f (x, θ) dμ(x) (for a dominated family), then the observation must be governed by the density which ˆ where f (x, θ) ˆ ≥ f (x, θ), θ ∈ I. The methods to be used is f (x, θ), naturally come from variational calculus to find such a maximum if it ˆ ˆ maximum for exists. Thus a Borel function x → θ(x) making f (x, θ) each observed value of X, is called a maximum likelihood estimator (or MLE) of θ. Then the existence, uniqueness, and various properties of ˆ the estimator θ(X) are the relevant subjects of study. If such maxima exist, and θˆ is unique, then it is the MLE of θ for which ˆ ˆ g(θ) = E(θ(X)) = θ(X) dPθ , θ ∈ I, Ω
ˆ when the expectation exists, may not be θ so that θ(X) need not be an unbiased estimator. But a useful property in its favor is a relation to sufficiency, i.e., if there is a sufficient statistic T (X) of the family {f (·, θ), θ ∈ I} so that (see (2) of Sec. II.5) f (x, θ) = g(T (x), θ)h(x), θ ∈ I, where h(·) is defined as a suitable function of x alone, one finds that a maximum likelihood estimator, if unique, is a function of the sufficient statistic, and since the maximization of the left side relative to θ for fixed x is the same as maximizing the function g(T (x), θ) relative to θ it follows that an MLE in the presence of sufficiency can be chosen to be a function of the sufficient statistic even if it is not unique (no recipe to find such a desirable one routinely is available, however). Moreover it also has an “invariance” property. Namely, if θˆ is an MLE of θ ∈ I and h : I → I is a mapping with range I (i.e., I = h(I)) then one can consider g(x, h(θ)) as a newly parameterized density by h(θ) ∈ I. The ˆ ˆ In fact, for each obserMLE h(θ) of h(θ) is usually “defined” as h(θ). ˆ = sup vation x, one has f (x, θ) θ∈I f (x, θ) = supθ∈∪i∈I h−1 ({i}) f (x, θ). −1 But the h ({i}) are disjoint for distinct ‘i , since h is a function, and then the estimator θˆ ∈ I = ∪i h−1 ({i}) belongs to exactly one ˆ is this value, we of the sets on the right side. Thus if i0 = h(θ) ˆ = g(x, i0 ) ≤ sup g(x, i) = g(x, ˆi). On the other hand get f (x, θ) i∈I ˆ for all i ∈ I, so that there is equality. Consequently g(x, i) ≤ f (x, θ) ˆ ˆ and the invariance property is obtained. These g(x, h(θ)) = g(x, h(θ)), and their relation to the well-known variational methods makes this procedure quite attractive in applications, and moreover it is found to
3.4 Asymptotics in estimation methodology
107
possess a number of other desirable properties. Because of this circumstance, R. A. Fisher who introduced these matters in early 1920s has emphasized their importance in statistical methodology. It must be noted, however, that later analyses showed the existence of some pathological aspects related to this method (especially for small samples) if it is used uncritically. However, it does have many more useful consequences than its competitors if the sample size is large and the observations are independent (or the dependence is “moderate”). These questions will be treated in detail later on. The asymptotic (=large sample) analysis for estimation and prediction will be introduced in the next section and the ramifications will appear throughout the rest of this work. Many of the other methods, known as least squares, minimum chi-squared, information theoretic and the like, tend to have asymptotically analogous properties. But to implement the other methods, some at least, do not have the well-defined operational machinery as the MLE has, and this works usually in favor of the latter. 3.4 Asymptotics in estimation methodology The large sample properties start with consistency, as a basic requirement for the estimators. The more desirable but also more difficult ones are about the limiting distributions and the rates of convergence of these estimators. We start with the first of these properties. Recall that a sequence of estimators θˆn (X), n ≥ 1, of θ, based on a sample of size n, i.e., X = (X1 , . . . , Xn ), is said to be (strongly) conP sistent if θn →θ (θˆn → θ, a.e.) as n → ∞. Since θˆn (X) is a random variable for each n, it can rarely equal θ; and a desirable property is that, in the limit, it should tend to θ in the same sense. Thus the asymptotic theory becomes a key part of estimation study. Here we ˆ first present conditions in order that an MLE θ(X) of θ exist and have this property. Typically the restrictions are put on the density functions f (·, θ), since f (X, θ) and its integrals relative to the dominating measure μ directly link the work with certain functions of θ. Further one usually starts with a random sample X, where the components Xi are independent and have a common distribution (or law), to invoke quickly the (strong or weak) law of large numbers in the computations. Here we present a general result without assuming independence, and illustrate its utility with an important application. Let X : Ω → Rn be a random vector (or a sample of size n) whose distributions are given as {F (·, θ), θ ∈ I ⊂ R} where I is an open interval, and suppose that dFn (x, θ) = fn (x, θ) dμn (x) where μn = ⊗ni=1 μi , is a (Cartesian) product measure of a σ-finite real μi = μ on R, which dominates the Fn (·, θ) for all θ ∈ I. In applications μ is
108
III. Parameter Estimation and Asymptotics
usually taken to be either the Lebesgue measure (continuous case) or a counting measure (discrete distributions). Also fn (·, θ) is called a likelihood function of X. It is convenient to introduce some concepts. 1. Definition. A sequence of estimators θˆn (= θˆn (X)), n ≥ 1, of θ +
is termed weakly asymptotically efficient if ϕn = ∂ log∂θ fn exists [here and below, log+ a = log a if a > 0, and = 0 if a ≤ 1, as usual], with fn (X, θ) = fn (X1 , · · · , Xn ; θ), and if cn (θ) = Eθ (ϕ2n ), then P Wn cn (θ)(θˆn − θ)= , Vn
(1)
where Wn , Vn , n ≥ 1, are sequences of random variables (on the same probability space as X) with the properties, Eθ (Wn ) → 0, Eθ (Wn2 ) → 1 and Eθ (Vn ) → 1, P [Vn = 0] → 0 as n → ∞. This concept is termed a“wide sense” one by Wald [2], in case Vn = 1 a.e., n ≥ 1, and strict sense one if moreover Wn has a limit normal distribution. The following theorem is an extension of his result for non independent sequences with weaker hypotheses so as to use also for certain non stationary cases, illustrated below, for which the previous results are not applicable. Even so, only sufficient conditions are available for all such general assertions on this topic. 2. Theorem. Let X = {Xn , n ≥ 1} be a sequence of random variables (or a process) with an underlying space {(Ω, Σ, Pθ ), θ ∈ I ⊂ R}, where I is an open interval. Suppose that the finite dimensional distributions have densities relative to a (product) dominating σ-finite measure μn , so that for each n, {fn (x, θ), n ≥ 1} is an n-dimensional density, and suppose that the following conditions hold wherein the preceding notation is used: i Condition (a). ∂∂θfin , i = 1, 2, exist for all θ ∈ I, and are domi
inated by integrable functions Gi (X), i = 1, 2, so that Eθ (| ∂∂θfin |) ≤ Eθ (Gin (X)) < ∞, i = 1, 2, θ ∈ I. Condition (b). cn (θ) = Eθ (ϕ2n (X, θ)) < ∞ and cn (θ) → ∞, as + n → ∞, for all θ ∈ I, where ϕn (X, θ) = ( ∂ log∂θ fn (X, θ)). Condition (c). ϕn (·) (which exists by (a)) satisfies a uniform Lipschitz type condition of order α in that for each β > 0, there is an 0 < α ≤ 1 such that |ϕn (θ) − ϕn (θ0 )| ≤ |θ − θ0 |α Mn (θ, θ0 ), a.e., and sup Eθ (sup n
θ
Mn (θ, θ0 ) ) < ∞, for all |θ − θ0 | < β, (θ, θ0 ∈ I). cn (θ)
3.4 Asymptotics in estimation methodology
109
Condition (d). Given 0 < δ < 1, there is an ε(= εδ > 0), and an n0 such that ε (θ) ≥ ε] > 1 − δ, ∀θ ∈ I, n ≥ n0 . Pθ [| n cn (θ) Condition (e). θ → Pθ is 1-1 and continuous, i.e., the variation of (Pθ1 − Pθ2 ) → 0 as θ1 − θ2 → 0 and Pθ1 = Pθ2 iff θ1 = θ2 . Then the maximum likelihood equation ϕn (x, θ) = 0 has a root which P is a consistent estimator of θ, i.e., θˆn →θ, and moreover it is asymptotically efficient in the weak sense. Proof. The method of proof in all such results is to express the function ϕn in a short Taylor expansion in θ for fixed x, and show using the assumed conditions in the hypothesis that various terms behave in the desired manner. Additional restrictions are needed for uniqueness of the MLE, and for the strict sense (or a.e.) convergence of the θˆn to θ (cf., e.g., the last part of Theorem 5 below). Condition (e) is analogous to the distinguishability of hypotheses discussed in Section I.3. Thus consider for a θ0 ∈ I, the expansion (omitting x from display): ϕn (θ) = ϕn (θ0 ) + (θ − θ0 )ϕn (θ0 ) + (θ − θ0 )Un (θ),
(2)
where Un (θ) = ϕ(θ0 + δn (θ − θ0 )) − ϕn (θ0 ), the remainder term with 0 < δn < 1. It follows from Condition (a) again that ∂ Eθ (ϕn (θ)) = fn dx = 0, Eθ (ϕ2n (θ)) = −Eθ (ϕn (θ)) = cn (θ). ∂θ Rn (3) Now (2) may be rewritten as ϕn (θ) ˜n (θ − θ0 ), = Bn + (θ − θ0 ) + B ϕn (θ0 )
(4)
where we set Bn =
cn (θ)ϕn (θ0 ) ˜ cn (θ0 )Un (θ) , Bn = , ϕn (θ0 )cn (θ0 ) ϕn (θ0 )cn (θ0 )
and these are simplified as follows. (θ0 ) (θ0 ) 2 ) = 0, and Eθ0 (( ϕcnn(θ ) ) = (cn (θ0 ))−1 → 0 as n → Since Eθ0 ( ϕcnn(θ 0) 0) ˘ ∞ by Condition (b), it follows by Ceby˘ sev’s inequality that ϕ (θ )
ϕn (θ0 ) P cn (θ0 ) →0.
Also by Condition (d), cnn(θ00) is bounded away from zero in probability. Hence given ε > 0, there is a Kε > 0 such that Pθ0 [|
cn (θ0 ) | ≥ Kε ] < ε, ϕn (θ0 )
(5)
110
III. Parameter Estimation and Asymptotics
P ˜n , so that by Exercise I.6.4 [(a) and (c)], that Bn →0. Regarding B consider its second factor. For 0 < α ≤ 1 and β > 0 as in Condition (c) we have, for
Un (θ) α Mn (θ, θ0 ) cn (θ0 ) ≤ |θ − θ0 | cn (θ0 ) , n and for |θ − θ0 | < β, [ M cn being bounded in probability] a simplification of (4) as: ϕn (θ) ¯n , = Bn + (θ − θ0 )1+α B (6) ϕn (θ0 )
¯n = cn (θ0 ) Mn (θ,θ0 ) δ˜n , 0 < δ˜n < 1, which is bounded in probwhere B ϕn cn (θ0 ) ability for all |θ − θ0 | < β. Consequently, given 0 < εi < β, i = 1, 2, ˜n | ≥ Kε ], then if we let p1 = Pθ0 [|Bn | > ε1+α ] and p2 = Pθ0 [|B 2 P from the fact that Bn →0, one can choose n0 (= n0 (ε1 )) such that ¯n above, n ≥ n0 ⇒ p1 ≤ ε22 . Also from the computations for the B ε2 we can find an n1 (= n1 (ε2 )) so that n ≥ n1 ⇒ p2 ≤ 2 . Hence for n2 ≥ max(n0 , n1 ) consider the event for n ≥ n2 1+α)
Sn = {X = (X1 , . . . , Xn ) : |Bn | < ε1
¯n | < Kε }. , |B 2
Then Pθ0 (S c ) ≤ p1 + p2 ≤ ε2 , and so Pθ0 (Sn ) ≥ 1 − ε2 . Consequently, with for θ = θ0 ± ε1 , the first and last terms of (6) are < (1 + Kε2 )ε1+α 1 probability > 1 − ε2 on Sn . Now choose ε1 such that (1 + Kε2 )εα 1 < 1. Then for n ≥ n2 the sign of the expression on the right side of (6) for θ = θ0 ± ε1 is determined by the middle term so that ϕn (θ) ϕn (θ0 )
> 0,
if θ = θ0 + ε1
< 0,
if θ = θ0 − ε1 .
(7)
Since ϕn (θ), being differentiable in θ, is continuous in θ and hence the (θ) left side of (6) must have a zero, (i.e., ϕϕnn(θ has at least one root θˆn ) 0) in the interval (θ0 − ε1 , θ0 + ε1 ) if n ≥ n2 with probability > 1 − ε2 . But by Condition (e), θ → Pθ is 1-1 and continuous, and εi > 0, i = 1, 2 are arbitrary. Hence the (log)likelihood equation ϕn (θ) = 0 has a root θˆn P in (θ0 − ε1 , θ0 + ε1 ) and that θˆn →0 as n → ∞. It remains to establish the efficiency assertion. Thus let θˆn be a root of ϕn (θ) = 0. Substituting this in (4) and rearranging the right side ϕ (θ ) with Vn = − cnn (θ)0 , Zn = cn (θ)(θˆn − θ0 ), and Wn = √ϕ(θ0 ) , we get cn (θ0 )
Wn = Vn Zn + ρn Zn ,
(8)
111
3.4 Asymptotics in estimation methodology ˆ
( θn ) where ρn = − Ucnn(θ . By hypothesis one has 0)
Mn (θˆn , θ0 ) . |ρn | ≤ |θˆn − θ0 |α cn (θ0 ) Since
(9)
Mn cn
is bounded in probability, it follows from (9), on noting that P P 0 < α ≤ 1, that ρn →0 as θˆn → θ0 in probability. Hence Vn =(Vn + ρn ), and Eθ0 (Wn ) = 0, Eθ0 (Wn2 ) = 1. But then Wn is bounded in probability [since its means and variances are bounded], and Eθ0 (Vn ) = 1, and by Condition (d), Vn is bounded away from zero in probability. Thus (8) implies that Zn is also bounded in the same way and one concludes at once that P Wn , cn (θ0 )(θˆn − θo ) = Zn = Vn and hence θˆn is asymptotically efficient in the weak sense by Definition 1, as desired. Remark. The above proof actually shows that under the conditions of the theorem, every root of equation ϕn (θ) = 0 is consistent and asymptotically efficient in the weak sense for θ in the open interval I. It is of interest to present an application of this result that could not be obtained from the previously known theorems of Wald [2], Cram´er [1], Kulldorff [1] and others. 3. Example. Consider the process {Xn , n ≥ 0} on {(Ω, Σ, Pα ), α ∈ I ⊂ R} given by: Xn = αXn−1 + un ,
α ∈ I,
(10)
where the un are independent standard Gaussian random variables (the “disturbances” or “noise”) N (0, 1) for all n, un = 0, n ≤ 0, with I as a bounded open interval containing (−2, 2). This model is a favorite one among stock market analysts, where Xn is the value of a stock at time n, and α is an unknown parameter with possible values |α| ≤ 1, or |α| > 1 (but bounded). If |α| < 1 it is said to be a “stable” process, |α| = 1 an “unstable” and |α| > 1 an “explosive” process of considerable interest. [The terms in quotes can (but will not) be defined at this point. But it should be noted that “stable” does not imply stationarity.] Of course, the model may be used for other (general) random walk situations. We verify here that the hypothesis of Theorem 2 is satisfied so that the MLE α ˆ n of α exists, is consistent, and asymptotically efficient in the weak sense.
112
III. Parameter Estimation and Asymptotics
Let μ be the Lebesgue measure, and fn is then given by: 1 (xi − xi−1 )2 ]. 2 i=1 n
fn (x1 , · · · , xn ; α) = (2π)− 2 exp[− n
(11)
Since I is an open bounded interval, it is clear that Condition (a) holds for the fn of (11). Next the (log)likelihood function is given as: ∂ log fn (X) = ui Xi−1 . ∂α i=1 n
ϕn (α) =
(12)
Using the linear (difference) relation (10) between Xn s we get by iteration n Xn = αi−1 un−i+1 , (13) i=1
and that Xn and un+1 are independently distributed. Hence (12) and (13) imply cn (α) =
Eα (ϕ2n (α))
=
i−1 n
α2(j−1) ≥ n − 1.
(14)
i=2 j=1
So cn (α) → ∞ and Condition (b) holds while (e) is clearly true. The remaining two conditions are verified as follows. n 2 From (12) one finds that ϕn (α) = − i=1 Xi−1 , for all α, so that ϕn (α) − ϕn (α ) = 0 and Condition (c) is also trivial since one can take Mn = 1 for instance. Finally for the key Condition (d), note ϕ (α) that cnn(α) has mean = −1. To show it is bounded away from zero in probability, consider the cases |α| < 1, |α| = 1 and |α| > 1 separately where cn (α) is given by (14). Now one needs to find the limit behavior P n
X2
i−1 of i=1 in each of the cases, and this is somewhat involved. We cn (α) present the result without always including the intermediate elementary (often long) computations. If |α| < 1, then from (14) it is seen after an algebraic simplification that limn→∞ cnn(α) = (1 − α2 )−1 , and that Var(ϕn (α)) = O(n). Then
ϕn (α) P cn (α) →
ϕ (α)
− 1 since Var( cnn(α) ) → 0 as n → ∞. Consequently this function is bounded away from zero in probability, and Condition (d) holds. ϕ (α) D If |α| = 1, then cnn(α) → 12 , and nn2 →Z a random variable bounded 2 away from zero in probability. This is a nontrivial consequence of a limit theorem of Erd¨ os and Kac [1], so that Condition (d) is again satisfied. If |α| > 1, the preceding result does not apply, but by a detailed
3.4 Asymptotics in estimation methodology
113
n 2 analysis of cn1(α) i=1 Xi−1 , it may be shown (cf., Rao [2], Lemma 15, and also Section IV.1 below) that this random quantity converges in probability to a random variable, a.e., and is bounded away from zero in probability. Thus Condition (d) is satisfied in all cases. Hence by Theorem 2, the likelihood function ϕn (α) = 0 has a root α ˆ n which is a consistent and asymptotically efficient (in the weak sense) estimator. Since (12) implies that the ML equation has only one root, it is unique in this problem. One may note that the verification of conditions [especially Conditions (d) and (e)] can present nontrivial individual studies demanding a detailed analysis in applications. This example raises at once two questions. First, is there a multi parameter analog of Theorem 2? Second, the special form of the model equation (10) indicates that it is a linear stochastic difference equation, and the simplifications above used only the fact that the “nose” process {un , n ≥ 1} consists of independent random variables with a common distribution having two finite moments, but not necessarily normal. Can one use other methods (e.g., least-squares) in estimation theory, bypassing the ML estimation, and prove similar results for more general processes? There is a positive answer to each of these questions. We consider the first one here, and postpone the second one to the next chapter wherein other specific structures are studied. It is now necessary to present an analog of Definition 1 for the kparameter version. This and the subsequent result will involve considerations of ranks of certain covariance matrices. Thus we start with the following: 4. Definition. A sequence of k-vector estimators {θˆn , n ≥ 1} of θ = (θ1 , . . . , θn ) of {fn (·, θ), θ ∈ I k ⊂ Rk } is asymptotically efficient in the wide sense if there exist random vectors {Wn , n ≥ 1} such that limn Eθ (Wn ) = 0, limn Eθ (Wn Wn ) = Ik , the identity matrix of order P k (Wn is the transpose of Wn ), and that (θˆn − θ)Bn (θ)=Wn where Bn2 (θ) = Γn (θ) = (cij , 1 ≤ i, j ≤ k), cij (θ, n) = Eθ (ϕi (θ, n)ϕj (θ, n)), ∂ log+ fn (x, θ) = ∂i∂θi ϕ(θ, n), 1 ≤ i ≤ k, defining and ϕi (θ, n)(·) = ∂θ i the (log)likelihood function ϕ(θ, n)(·). [For simplicity ϕ(θ, n) is also called a likelihood function and Bn is the positive square root of Γn .] The following is a multi parameter analog of the preceding theorem which we state without proof, using the above notations: 5. Theorem. Let {fn (·, θ), θ ∈ I k ⊂ Rk } be the finite dimensional densities relative to μn of a sequence of random variables {Xn , n ≥ 1}, as in Theorem 2 on {(Ω, Σ, Pθ ), θ ∈ I k } where I k is a non degenerate open interval (or box) of Rk . Suppose the fn satisfy the following conditions for n ≥ 1: 2 fn (x,θ) ∂ Condition 1. ∂θ fn (x, θ), ∂ ∂θ , i, j = 1, . . . , k exist for a.a. (x) i i ∂θj
114
III. Parameter Estimation and Asymptotics
and θ ∈ I k and that their absolute values are dominated by Pθ -integrable functions Gn and Hn of x for all i, j. Condition 2. c(θ, n) = maxi cii (θ, n) → ∞ as n → ∞, ∀θ ∈ I k . Condition 3. For θ ∈ I k , limn→∞ c(θ, n)−1 Γn (θ) exists as a nonsingular matrix. ϕij (θ,n) ] = 0, 1 ≤ i, j ≤ Condition 4. For each θ ∈ I k , limn→∞ V arθ [ c(θ,n) n. Condition 5. For θ, θ ∈ I k , and a given β > 0, there is an 0 < α ≤ 1 such that k α |ϕij (θ, n) − ϕij (θ , n)| ≤ [ (θi − θi )2 ] 2 Mij (θ, θ , n), a.a.(x) i=1
k M (θ,θ ,n) and Eθ [supθ,θ ijc(θ,n) ] < ∞, whenever i=1 (θi − θi )2 < β. Condition 6. The mapping θ → Pθ is 1-1 and continuous when the total variation measure set-wise limit is used on the right, and the Euclidean metric on I k . Then the ML vector equation ϕn (θ) = (ϕi (θ, n), i = 1, . . . , k) = 0 has a root θˆn which is a consistent estimator of the vector θ ∈ I k , and which is asymptotically efficient in the wide sense. Moreover, if the matrix Ψn (θ) = [ϕij (θ, n), 1 ≤ i, j ≤ k] has the property that c(θ0 , n)−1 Ψn (θ) is negative definite for all θ ∈ I k , a non degenerate open convex set as above, and for all large n with probability one, then the consistent estimator θˆn of θ is also unique.
The proof of this result is very similar to that of Theorem 2. The demand of non singularity for the limit matrices makes it more restricted than Theorem 2, and for instance it can be used only in the “stable” analog of Example 4 above. However, this still extends Wald’s [2] theorem to multi dimensions. Details of proof will be left as an instructive exercise. It is possible to weaken the conditions of the above theorem to apply for some “unstable” processes, but a satisfactory generalization of Theorem 2 that applies to all unstable as well as explosive processes is still not available. This point will be illustrated for a certain class of processes in the next chapter. We now present an approximation of the solutions of (nonlinear) prediction problems with convex loss based on a finite number of observations, using the conclusions of Theorem 3.8, as a final item of this section. It will show how asymptotic analysis comes into play in different forms in stochastic inference problems. Using the previous notation, we have the following result, stated for the real valued case, but it holds if the range of the random variables is multidimensional as in Theorem 3.9. [Extensions are treated in Chapter VIII later.]
3.4 Asymptotics in estimation methodology
115
6. Theorem. Let W be a convex loss function such that (i) W ∈ Δ2 , (ii) W is continuous, and (iii) for each ε > 0 there is a kε > 1, xε ≥ 0 such that W ((1 + ε)x) ≥ kε W (x), x ≥ xε . Let {Xn , n ≥ 1, Y } be a set of random variables on (Ω, Σ, P ) whose joint distributions are thus known. If Bn = σ(X1 , . . . , Xn ) and B0 = σ(Xn , n ≥ 1) are the σ-algebras generated by the variables shown, then Yn and Y0 , the best predictors of Y based on (X1 , . . . , Xn ) and the whole set {Xn , n ≥ 1}, exist, i.e., E(W (Y − Yn )) = inf{E(W (Y − Z)) : Z ∈ LW (Bn )}, and similarly Y0 if B0 replaces Bn . Moreover, Yn → Y0 in mean or, equivalently, that E(W (Yn − Y0 )) → 0 as n → ∞. Proof. We first observe that W satisfying conditions (i)-(iii), restricts both its exponential growth as well as a slow linear increase. More precisely, these imply that W ∈ Δ2 ∩ 2 (cf., Rao and Ren [1], p.284), and the corresponding Orlicz space LW (Ω, Σ, P ) = LW (Σ) is reflexive and strictly convex. More is true, namely, it is also uniformly convex in both of its norms (cf., again Rao and Ren, loc.cit., p.295). With this preliminary information, consider the spaces LW (Bn ) ⊂ LW (Bn+1 ) ⊂ LW (B0 ) ⊂ LW (Σ) which are closed linear manifolds. Since Y ∈ LW (Σ), by Theorem 3.8, there is a unique Yn ∈ LW (Bn ) which is closest to Y, n = 0, 1, . . . ,. Note that B0 = σ(∪n Bn ) = σ(Xn , n ≥ 1). Thus it only remains to show that Yn → Y0 in mean. Let us consider the convex functional U defined as: W (|Y − Z|) dP, Z ∈ LW (B). (15) U (Z) = Ω
Then U (Y0 ) and U (Yn ) are the minimum values in LW (Bi ), i = 0, n, as Z varies in these subspaces, and 0 ≤ U (Y0 ) ≤ U (Yn ) ≤ U (Y1 ). Thus {Yn , n ≥ 1} is contained in a ball if LW (Σ). We observe, as already noted before, that the norm and modular convergences in a reflexive LW (Σ)-space are equivalent. Hence the above statement is the same as saying that 0 ≤ NW (Y − Y0 ) ≤ NW (Y − Yn ) ≤ NW (Y − Y1 ). Then if αn = NW (Y − Yn ), we claim that αn α0 = NW (Y − Y0 ). Indeed, let Zn = Y − Yn so that NW (Zn ) = αn ≥ α0 = NW (Z0 ). Now ∪n LW (Bn ) ⊂ LW (B0 ) and the former set is dense in the latter. Hence for any ε > 0 there is a Yε ∈ ∪n LW (Bn ) such that NW (Y0 −Yε ) < ε. Consequently, since Yε ∈ LW (Bn ) (for some n = n0 , say), one has: α0 ≤ αn ≤ αn0 = NW (Zn0 ) ≤ NW (Y − Yε ) ≤ NW (Y − Y0 ) + NW (Y0 − Yε ) ≤ α0 + ε, n ≥ n0 . It follows that αn → α0 , and {Zn = Y − Yn ∈ LW (Σ), n ≥ 1} is contained in a ball.
116
III. Parameter Estimation and Asymptotics
Since LW (Σ) is reflexive, the set {Zn , n ≥ 1} is relatively weakly compact (by classical results) so that it has a weakly convergent subsequence Zni → Z0 ∈ LW (Σ). This implies, from Zni = Y − Yni , that Yni → Y0 weakly. Since Yni ∈ LW (Bni ) ⊂ LW (B0 ) which is weakly complete, Y0 ∈ LW (B0 ). However we have Z0 = Y − Y0 so that α0 ≤ NW (Z0 ) ≤ lim inf NW (Zni ) = lim αni . i
i
Thus NW (Y − Y0 ) = α0 , and so Y0 is also a minimal element. By the strict convexity of LW (B0 ), the minimal element is unique, whence Y0 = Y0 , a.e. But the argument can be repeated for each infinite subsequence of {Zn , n ≥ 1}, and the above argument shows that they all have the same limit Z0 , so that the whole sequence converges weakly to it and moreover α0 = NW (Z0 ) = limn NW (Zn ). Now in a uniformly convex Banach space weak sequential convergence plus convergence of their norms implies the strong convergence of the sequence, i.e., NW (Zn − Z0 ) → 0, (cf., e.g., Dunford and Schwartz [1], II.4.28). This is equivalent to saying that NW (Yn − Y0 ) → 0 as n → ∞. Hence if εn = NW (Yn − Y0 ), then for large n, 0 < εn < 1 and we have εn (Yn − Y0 ) E(W (Yn − Y0 )) = ) dP ≤ εn · 1, (16) W( εn Ω by definition of norm NW (·), together with the convexity of W . Since εn → 0 as n → ∞, this gives the desired conclusion. Remarks. 1. If the Xn take values in X , a uniformly convex Banach space, then LW X (Σ) is also uniformly convex. Replacing W (Y − Yn ) by W (Y −Yn ) where · is the norm of X , all the abstract results used in the above proof are valid. So the theorem also holds in multidimensions, exactly as Theorems 3.8 and 3.9. This result is essentially detailed in (Rao [3], Thm. 5). 2. The conditions on W in Theorem 3.8 are that (i) W ∈ Δ2 ∩ 2 and (ii) W exists and is continuous. The present hypothesis is somewhat stronger. However, as is well-known (Rao and Ren [1], Thm. 7.3.2 ¯ on p. 297), the convex loss function W ∈ Δ2 ∩ 2 is equivalent to a W that satisfies the hypothesis of the above theorem. Here equivalence means that there exist 0 < a ≤ b < ∞ and t0 ≥ 0, verifying ¯ (t) ≤ W (bt), ∀t ≥ t0 . W (at) ≤ W
(17)
This implies that the norms are equivalent, and with the new (equiva¯ the obtained results can be interpreted in terms lent) loss function W of those of W in “equivalence classes”. We avoided these further discussions, using a strengthened hypothesis of the theorem for brevity.
117
3.5 Sequential estimation
A natural question now is to ask whether Yn → Y0 pointwise a.e., and not merely in the mean. The answer is in the affirmative, but the proof uses the preceding result together with a considerable amount of additional argument. This will be discussed in Section VII.1, but for now we turn briefly to sequential analysis. 3.5 Sequential estimation An account of sequential estimation methodology will be indicated here for a later reference. Suppose {Xn , n ≥ 1} are independent random variables with a common distribution, i.e., observations independently taken on a fixed X. In sequential experimentation, instead of taking a fixed set of observations (decided before), one observes a random variable Xn at stage n and a (sequential) plan specifies whether to terminate or take the next observation Xn+1 , n ≥ 1, of the experiment. Thus the number N of observations depends on the previous ones, X1 , . . . , XN −1 and hence is a random variable itself. In fact, N is measurable relative to {σ(X1 , . . . , Xn ), n ≥ 1} and we want the plan so that P [N < ∞] = 1. More explicitly, the event [N = n] ∈ σ(X1 , . . . , Xn ), n ≥ 1, and the sampling should terminate after a finite number of Xi s with probability one. A basic identity, due to Wald, in this theory states that if Xn , n ≥ 1 are independent with a common distribution having one moment and SN = X1 + · · · + XN then E(SN ) = E(X1 )E(N ). A slightly more general version of it is given by Wolfowitz [1] and we present it with a simpler proof using an observation due to Neveu [1]: 1. Proposition. Let {Xn , n ≥ 1} be a sequence of independent random variables with a common mean, and supn E(|Xn |) < ∞. If N is the number of observationstaken according to a sequential sampling plan, n E(N ) < ∞, and Sn = i=1 Xi , then E(SN ) = E(N )E(X1 ).
(1)
If further the Xn also have a common second moment, and E(N 2 ) < ∞, then 2 E(SN ) = E(N )VarX1 + [E(X1 )]2 E(N 2 ). (2)
Proof. Here we establish (1) using an identity, which is useful to (2) as well as for other cases to be verified by the reader. The auxiliary result is the following. Suppose Y1 , Y2 , . . . is a sequence of integrable random variables on (Ω, Σ, P ) and N is an integer valued bounded random variable such that
118
III. Parameter Estimation and Asymptotics
[N = k] ∈ σ(Y1 , . . . , Yk ) = Gk , i.e., the event [N = k] is determined by the present and past only, as in the proposition. Then we have, on setting Y0 = 0, the equation: E(YN ) = E( E Gk (Yk+1 − Yk )). (3) 0≤k
In establishing (3), we use repeatedly a basic identity of conditional expectations, namely, E(X) = E(E B (X)) for any integrable (or nonnegative) random variable X and any σ-algebra B ⊂ Σ, which is a consequence of the definition: X dP = E B (X) dPB , A ∈ B. A
A
Take A = Ω, and set P |B = PB . Also if Nn = min(N, n), then for each k, [Nn = k] = [N = k], on [N ≤ k], and k ≤ n, = ∅; if N > n and k = n; = Ω if N > n and n = k. Thus [Nn = k] ∈ Fk for all k. If Yn = YNn (and note that N ≤ n0 < ∞ for some integer n0 since N is bounded) then for 0 ≤ n ≤ n0 , we have: Yn+1
= = =
Hence
n
Yk χ[N =k] + Yn+1 χ[N >n]
k=1 Yn + Yn +
Yn χ[N =n] + Yn+1 χ[N >n] − Yn χ[N >n−1] Yn+1 χ[N >n] − Yn χ[N >n] .
Yn+1 − Yn = (Yn+1 − Yn )χ[N >n] .
(4)
Taking expectations, and using the above basic identity of conditioning with B = Fk and noting the fact that [N > k] = [N ≤ k]c ∈ Fk , one gets from (4), after a telescopic cancellation, E(Yn0 ) =
n−1
E[χ[N >k] E Fk (Yk+1 − Yk )]
k=0
=E
E Fk (Yk+1 − Yk ) .
(5)
0≤k
Since min(N, n0 ) = N and Yn 0 = YN , (5) is the same as (3). We now use (3) to prove (1) (as nwell as some other results) where Fk = σ(X1 , · · · , Xn ), Yn = Sn = i=1 Xi . Then Sn is Fn -measurable, and using the independence of Xn and Fn−1 so that E Fn (Xn+1 ) = E(Xn+1 ), we get E Fn (Sn+1 ) = E Fn (Sn ) + E Fn (Xn+1 ) = Sn + E(Xn+1 ).
(6)
119
3.5 Sequential estimation
Taking Nn = min(N, n), (3) gives
[E Fk (Sk+1 ) − Sk ] E(SNn ) = E 0≤k
=E
E(Xk+1 ) , by(6),
0≤k
= μE(Nn ), since E(Xk ) = μ, ∀k, by hypothesis.
(7)
This establishes (1) for bounded random sizes, and to get the general result we only need to let n → ∞ so that the right side tends to μE(N ) by the monotone convergence ∞theorem. For the left side SNn → SN , a.e., and |SNn | → |SN | ≤ k=1 |Xk |χ[N =k] . But the last dominating function is integrable because ∞
E(|Xk |χ[N =k] ) ≤
k=1
=
∞ k=1 ∞
E(|Xk |χ[N ≥k] ) E(|Xk |)E(χ[N ≥k] ),
k=1
since [N ≥ k] = [N < k]c ∈ Fk−1 and Xk is independent of Fk , ≤ sup E(|Xk |) k
∞
P ([N ≥ k])
k=1
= sup E(|Xk |)E(N ) < ∞,
(8)
k
by hypothesis. So letting n → ∞ in (7), we can interchange the limit and expectation on the left side too, by the dominated convergence, whence (1) holds as stated. n For (2) set Xn = Xn − μ, Yn = ( i=1 Xi )2 , and Nn = min(N, n) in (3). Here again the interchange of limit and integrand as n → ∞ needs care. The details are left as an exercise. An alternative proof of (1). In the original (Wald) case that the Xi are independent integrable random variables with a common distribution. This proof, due to Blackwell [1], is instructive and will be given here. It is a beautiful application of the classical nKolmogorov strong law of large numbers (SLLN). Thus let Sn = i=1 Xi , and N be a positive integer valued random variable, both defined (for convenience) on the same basic probability space as before. Let N1 , . . . , Nk be independent observations on N . Similarly consider independent variables SN1 , . . . , SNk which are identically distributed as SN . Then SN1 + · · · + SNk N1 + · · · + Nk SN1 + · · · + SNk = · k N1 + · · · + Nk k
120
III. Parameter Estimation and Asymptotics
N1 +···+Nk
Xi N1 + · · · + Nk . (9) · N1 + · · · + Nk k By the SLLN, the last term in (9) tends to E(N ) with probability one, n 1 as k → ∞. On the other hand, n i=1 Xi → E(X1 ), as n → ∞, with probability one by the same SLLN. Thus there is a set A ∈ Σ, P (A) =
n 0, and for ω ∈ Ac , the numerical sequences n1 i=1 Xi (ω) → E(X1 ), as n → ∞. Consequently every infinite subsequence of this big sequence of numbers also converges to the same limit. Hence, in particular, (N +···+Nk )(ω) the subsequences Pk 1N (ω) i=11 Xi (ω) → E(X1 ) for each i i=1 c ω ∈ A , as k → ∞. This implies that the first factor on the right of (9) also tends to the number E(X1 ) with probability one. Hence the left side of (9) must converge with probability one to the number E(X1 )E(N ), as k → ∞. But the SNi are independent with the same distribution as SN . So it follows by the converse part of the SLLN that the SNi have finite mean which is E(SN ) and which must equal the constant on the right side (cf., e.g., Rao [15], Thm. 2.3.7 on p. 62). This means E(SN ) = E(X1 )E(N ) which is (1). =
i=1
Several other demonstrations of (1) are also available in the literature. In fact, one can consider the problem as one of “mixtures” of probability distributions, and other moments of SN can be calculated. (See, for instance, Rao [18], p. 49, where N was taken as a “Poisson type” variable so that it has all moments.) We shall present an extension of this reasoning for certain uncorrelated random [not independent] sequences in an exercise below in the next section. The stopping variable N with events [N = k] depending only on Xi , i = 1, . . . , k, is called a stopping time of the (sequential) process, and plays an important role in the theory of several types of stochastic processes. The complications arising from this method compared to the fixed sampling plans are rewarded by the fact that for inference there are substantial (monetary) savings in taking observations (indeed if there is cost involved in sampling the process). This is because in the sequential case at each stage one either accepts or rejects a hypothesis, or takes an additional observation only if no decision is reached at that stage. Thus the relation between testing and sampling or estimation is close. It is therefore useful to illustrate the method by a simple example. We do this in the following with a Bernoulli random variable, slightly elaborating on one considered by Wald [5]. 2. Example. Let X1 , X2 , . . . be i.i.d. observations on a Bernoulli random variable X, P (X = 1) = p = 1 − P (X = 0), 0 < p < 1. If f (xi |p) is the probability of Xi = xi , (xi = 0, 1), then for each m the probability of observing x1 , . . . , xm is given by: f (x1 , · · · , xm |p) = p
Pm
i=1
xi
Pm
(1 − p)m−
i=1
xi
.
121
3.5 Sequential estimation
m If dm = i=1 xi , the number of 1 s in the m observations, and H0 : p = p0 vs H1 : p = p1 then the Neyman-Pearson theory (cf. Chapter II) implies that the likelihood ratio to be considered is given by L(x) = L(x1 , · · · , xm ) = =
f (x1 , · · · , xm |p1 ) f (x1 , · · · , xm |p0 ) pd1m (1 − p1 )m−dm . pd0m (1 − p0 )m−dm
(10)
and one accepts H1 if L(x) > A and accepts H0 if L(x) < B for some 0 < B < A, and makes no decision but takes an additional observation if B ≤ L(x) ≤ A. [Thus we have a three-decision problem here in contrast to the classical fixed sample two decision case.] Usually one chooses B < 1 and A > 1 so that B < A is obtained. From the earlier work one notes that if L(x) ≥ A, then H1 is accepted and L(x) ≤ B leads to H0 ’s acceptance. If the sizes of the critical region Sm for H0 c is ≤ α and that for H1 , namely Sm , is β or 1 − β the power of the test, then we have Sm = {x : L(x) ≥ A}, and hence: c ) ≥ 1 − β ≥ Ep0 (χSm
L(x) dPp0 ≥ A
Sm
dPp1 ≥ Aα. Sm
β Hence A ≤ 1−β α . Similarly we find B ≥ 1−α (with usually equalities in the case that the distributions are of continuous type). [Since α, β are type I and type II error probabilities, they are taken small and α+β < 1 so that A > B assumed above follows and is thus a natural condition.] Using these relations we now find the limits for the Bernoulli case at hand. β Thus with these bounds for A, B, we accept H0 if L(x) ≤ 1−α ≤ B, 1−β β accept H1 if L(x) ≥ α ≥ A, and continue sampling if 1−α < L(x) < 1−β α . Using (10) for L(x) and taking logs on both sides one finds, on rearranging, the continuation region as: β 1−α p1 (1−p0 ) p0 (1−p1 )
log log
+m
1−p0 1−p1 p1 (1−p0 ) p0 (1−p1 )
log log
1−β α < dm < 0) log pp10 (1−p (1−p1 ) 0 log 1−p 1−p1 +m . 0) log pp10 (1−p (1−p1 )
log
(11)
Letting the left side member of (11) am and the right side rm , one continues sampling if [am ] < dm < (rm ),
122
III. Parameter Estimation and Asymptotics
where [am ] is the integral part of am and (rm ) is the smallest integer ≥ rm . Note that these am and rm are simply given by: am = h1 + ms,
rm = h2 + ms,
(12)
where h1 and h2 are the corresponding quantities displayed by (11), and 0 log 1−p 1−p1 , (13) s= 0) log pp10 (1−p (1−p1 ) is the slope of the two parallel lines (or boundaries) of the acceptance and rejection regions. After obtaining (12), Wald ([5], p. 100) remarks in a footnote: “it can be shown that (the slope) s lies between p0 and p1 ”, i.e., 0 < p0 < s < p1 < 1. Since this is not entirely obvious, we include here a simple proof of it. Indeed, p0 < p1 implies p0 (1 − p1 ) < p1 (1 − p0 ), or since p1 − p0 > 0 one has: p −p p1 −p0 1 − p1 1 0 p1 < , (14) 1 − p0 p0 which may be written as:
1 − p1 1 − p0
1−p0
×
1 − p0 1 − p1
1−p1
<
p1 p0
p1
p0 p1
p0 .
(14’)
We observe that for 0 < p0 < p1 < 1 one always has: 1−p0 p0 1−p1 p1 1 − p1 p0 1 − p0 p1 (a) < ; (b) < , 1 − p0 p1 1 − p1 p0
(15)
and then multiplying these two inequalities gives (14’). To see the truth of (15), consider (a). Let y(p1 ) and y˜(p1 ) be the left and right sides and replace p1 by p. Then 0 < p0 < p < 1, and both y(·), y˜(·) are differentiable functions on the open interval and are continuous on the closed interval [p0 , 1]. Also if z(p) = y˜(p) − y(p), then the derivative z(p) exists and is found at once to be: p 1+p0 p0 1 − p0 0 − < 0. z (p) = − p 1−p
Consequently z(p) is strictly decreasing as p goes from p0 to 1. Since z(p0 ) = pp00 > 0 and z(1) = 0, we must have z(p) > 0 on the open interval which implies (a). An identical argument for (b) replacing p0 by p one finds that on the closed interval [0, p1 ] the corresponding function has the same properties and the desired inequality follows.
123
3.5 Sequential estimation
Next consider (15)(a) and write it as:
1 − p1 1 − p0
p0 (1 − p1 ) < p1 (1 − p0 )
p0 .
Taking logs and multiplying by negative one, we get 1 − p0 p1 1 − p1 log > p0 log − log , 1 − p1 p0 1 − p0 giving p0 < s of (13). Similarly taking logs in (15)(b) and rewriting shows s < p1 , establishing Wald’s remark. It should be observed that (14) and (14’) are key motivations for (15) and the latter directly gives the desired inequalities. Analogous results for testing the variance of a normal distribution to have a given value when the mean is known and other problems are detailed in Wald’s book [5]. It should be noted that the sequential procedures are best utilized when observations are independently taken. The general case demands assumptions on conditional densities (as we show in an exercise), and then the elegance of the theory is somewhat diminished. We now present an analog of Theorem 2.5 for the lower bound with convex loss for the risk function of an estimator of a parametric function θ → g(θ) of θ appearing in the density f (·|θ) of the random variable under observation. As in the classical case, let hN (X) = h(X1 , · · · , XN ) be an estimator of a real function g of θ, the parameter appearing in the model {(Ω, Σ, Pθ ), θ ∈ I ⊂ R}, based on the observations (X1 , . . . , XN ) just before termination of the experiment (and hence before reaching a decision of acceptance or rejection of the hypothesis). It is taken that Eθ (hN (X)) = g(θ). If g(θ) = θ, one has the unbiasedness property. Here we state the result without the regularity conditions, by using Kiefer’s method (adapted to the present context), as given in Section 2. Thus consider the density function fn (x|θ) = fn (x1 , · · · , xn |θ), for the value of N = n, and an auxiliary class of functions on Sθ,n × Iθ where Sθ,n is the carrier of fn (x|θ) and Iθ = {t : θ + t ∈ I}, defined by: 1 Dξ1 ,ξ2 (x, n|θ) = fn (x|θ)
fn (x|θ + t) d(ξ1 − ξ2 )(t),
(17)
Iθ
for x ∈ Sθ,n , θ ∈ I, and a pair of distinct probability measures ξi , i = 1, 2 on the Borel σ-algebra of Iθ . In case each fn (·|θ) is differentiable in fn (x|θ) θ, one takes D(x, n|θ) = ∂ log ∂θ , and this is seen to be included in (17) if ξi are chosen suitably. Let M k (P˜ ) be the collection of all such
124
III. Parameter Estimation and Asymptotics
functions Dξ1 ,ξ2 which are in Lk (P˜θ ) for all θ, where P˜θ is determined by Pθ (·|N = n) and P (N = n), as a mixture, considered before (cf., e.g., Rao [18], p. 47). It is again true that Eθ [(h(X) − g(θ))Dξ1 ,ξ2 ] = Eθ (h(X)Dξ1 ,ξ2 ) g(θ + t) d(ξ1 − ξ2 )(t). = Iθ
˜ = Ω × I and P˜θ on the product σ-algebra Here Eθ (·) is calculated on Ω of the former. In case the variables X1 , X2 , . . . are independent and identically distributed so that fn (x|θ) = Πni=1 f (xn |θ), and f (·|θ) is difn f (xi |θ) ferentiable for θ, then D(x, n|θ) = i=1 ∂ log ∂θ . Thus D(X, n|θ) is a sum of n independent identically distributed random variables where
2 the fn (·|θ) is always taken to be a Borel function. If Eθ ∂ log f∂θ(X1 |θ) is positive and finite on (Ω, Σ, Pθ ), which is therefore the same for all i, and the further conditions that differentiation under expectation is permissible as well as Eθ (N ) < ∞, we get by Proposition 1, the following: ∂ log f (X1 |θ) = Eθ (N ) · 0 = 0, (18) Eθ (D(X, N |θ) = Eθ (N )Eθ ∂θ and (cf., (2)) ∂ log f (X1 |θ) 2 ) . Eθ (D(X, N |θ)2 ) = Eθ (N ) · Eθ ( ∂θ
(19)
Then the same reasoning of Theorem 2.5 carries over verbatim, and establishes the following result (without assuming independence of the Xi ): 3. Theorem. Let W (·) be a symmetric convex loss function ( ≡ 0) 1 such that W k has the same properties for k ≥ 1. Let (X1 , . . . , XN ) be random variables observed under a sequential sampling plan so that N is a random variable. Let fn (x1 , · · · , xn |θ) be the density relative to a σ-finite measure, of (X1 , . . . , Xn ) at N = n, depending on θ ∈ I of {(Ω, Σ, Pθ ), θ ∈ I ⊂ R}, and suppose that hN (X) = h(X1 , · · · , XN ) is an estimator of its expected value, g(θ). Then the best lower bound for the corresponding risk function RW of the loss W is given by: g(θ + t) d(ξ1 − ξ2 )(t) Iθ × RW (hN (X), g(θ)) ≥ sup W Eθ (|Dξ1 ,ξ2 (X, N |θ)|) ξ1 =ξ2 k Eθ (|Dξ1 ,ξ2 (X, N |θ)|) , (20) 1 Eθ (|Dξ1 ,ξ2 (X, N |θ)|k ) k
125
3.6 Complements and exercises
k where k ≥ 1, k = k−1 , and for k = 1 one takes the essential supremum on the last factor. In particular, if W (t) = t2 and the sampling plan consists of independent observations, if the standard regularity conditions hold on f ,
N (X|θ) 2 and 0 < Eθ [ ∂ log f∂θ ] < ∞, Eθ (N ) < ∞, then (20) becomes:
V ar(h(X))θ ≥
Eθ (N )Eθ
g (θ)2
∂ log fN (X1 |θ) 2 .
(21)
∂θ
The last simplification of (20) to (21) is the same as that given, following Theorem 2.5, after using (19) in the denominator. This was originally obtained by Wolfowitz [1]. It should be noted that even if W (t) = |t|p , p > 2, a similar analysis is possible only after Proposition 1(b) is extended to other moments. In any case, (20) is the most inclusive result even without the regularity conditions, and is essentially the same in all procedures. If P [N = n0 ] = 1, then Eθ (N ) = n0 and (21) becomes the classical Cram´er-Rao bound. If W (t) = t2 and f (·|θ) is taken as an exponential distribution, then one can discuss efficient estimators hN (X) (of its expectation), in the sense that there is equality in (21). (The regularity conditions are almost automatic in this case.) Some results on efficient estimation will be included in the complements section below. Instead of considering a single parameter and/or first order derivatives of fn (·|θ), one can study multiple parameter and higher order derivatives also. Such studies have been available in the literature, both for the fixed and sequential sampling procedures, and a brief indication will appear in exercises. Born out of important practical applications to terminate sampling when a decision can be reached, Sequential Analysis has highlighted the random stopping of a process. This in turn has greatly assisted (if not originated) the optional stopping theory of stochastic processes, especially martingales which is now an important chapter in the subject. [For a view of the work, from Doob’s optional stopping theory of martingales on, one may consult Doob [2],Chapter VII, or the companion volume Rao [21], Chapter IV, among others.] We shall omit further discussion since their use in general stochastic processes so far has been limited, and the preceding analysis seems to apply for a large collection of problems, as seen from the work in the ensuing chapters. 3.6 Complements and exercises 1. Consider the loss function W given by W (t) = (|t| + t2 )2 . Let (X1 , . . . , Xn ) be a random sample, (i.e., the Xi are independent with a common distribution) from X whose density is normal, N (θ, 1), so
126
III. Parameter Estimation and Asymptotics 1
that f (x|θ) = (2π)− 2 exp[− 12 (x − θ)2 ]. Verify that the MLE of θ n ˆ θ) = is θˆ = n1 i=1 Xi , and that the corresponding risk is RW (θ, 1 1 3 16 2 n + n2 + 2( 2πn2 ) . Find the lower bound given by Theorem 2.5 using ∂ log f (x1 ,··· ,xn |θ) D(x, n|θ) = , and conclude that there is strict inequal∂θ ity for all n ≥ 1, although if W (t) = t2 the Cram´er-Rao lower bound is attained. (Thus the efficiency concept very much depends on the type of loss function used in finding the risk of an estimator.) 2. In contrast to the preceding, there is more flexibility in attaining the bounds for Bayes risk based on convex loss functions, as illustrated here. Indeed let X = (X1 , . . . , Xn ) be a vector whose distribution is given by dF (x1 , · · · , xn |θ) = f (x1 , · · · , xn |θ) dμ(x1 , · · · , xn ), θ ∈ R. Suppose it admits a real valued sufficient statistic T with density p(·|θ). If ν is the prior density of Θ, and p˜(·|t) is the posterior density of Θ given T = t, suppose ∂∂tp˜ exists. Verify that if ∂p(t|θ) ν(θ) dθ ∂ log p(t|θ) − R ∂t , m(θ|t) = ∂t p(t|θ)ν(θ) dθ R then E(m(Θ, T )|T ) = 0, a.e. In particular if f (·|θ) is an exponential density so that p(t|θ) = c(θ) exp[θt]h(t), then m(Θ|T ) = Θ − E(Θ|T ) gives the best bound and δ(X) = E(Θ|T (X)), is a Bayes estimator, for the quadratic loss function. 3. If f (x|θ) dμ(x) = dF (x|θ) is a distribution of a random (vector) variable X for θ ∈ I, a nonempty open convex set in Rk , then the usual standard regularity conditions are: (i) the carrier Sθ of f (x|θ) (x|θ) is independent of θ so Sθ = S, (ii) ∂f∂θ exists for x ∈ S, θ ∈ I and i (x|θ) i = 1, . . . , k, (iii) ∂f∂θ is dominated by a μ-integrable function, and i
(iv) Di (x|θ) = ∂ log∂θf (x|θ) , i = 1, . . . , k, x ∈ S, are linearly independent i for all θ ∈ I. Establish the following statements for a vector estimator δ(X) = (δ1 (X), . . . , δ (X)) of its expectation, if D = (D1 , . . . , Dk ), Cov(X, Y ) is the variance-covariance matrix of X, Y Vθ = Cov(δ(X), δ(X)); Λθ = Cov(D, D); Uθ = Cov(δ, D),
then (i) Vθ ≥ Uθ Λ−1 θ Uθ , where between symmetric matrices A, B, A ≥ B ⇒ A − B is a positive (semi-) definite matrix; (ii) there is equality in (i) (at θ = θ0 ) for θ ∈ I so that δ(X) is an (locally) efficient estimator of θ with quadratic variation as the risk function, (i.e., R2 (δ, θ) = Cov(δ, δ)), and at most k + 1 of δ1 , . . . , δ are linearly independent (with probability one) estimators of their expectations ai (θ), i = 1, . . . , , and if (δ1 , . . . , δk ) is an efficient (non constant)
127
3.6 Complements and exercises
unbiased estimator of a1 (θ), . . . , ak (θ) then the latter are linearly independent over θ ∈ I with the mapping θ → (a1 (θ), . . . , ak (θ)) being one-to-one; (iii) the estimators (δ1 , . . . , δk ), noted in (ii), form a jointly minimal sufficient set for θ, as well as MLE of (a1 , . . . , ak )(θ), and moreover f (·|θ) belongs to the generalized exponential family: k ψj (a)δj (x) − h(a) + g(x)] f (x|θ(a)) = exp[ i=1
where θ is regarded as a function of a due to the last statement of (ii). [Hint: This result needs some careful analysis, cf., De Groot and Rao [2] for details.] (iv) Suppose in (ii) = 1 and k > 1 with Di (x|θ) = ∂ log∂θfi(x|θ) where θ is now one-dimensional, but the likelihood function is k-times differentiable in θ. Then the bound of (i) in this case becomes i
Vθ ≥
k
ij λij (θ)ui (θ)uj (θ), Λ−1 θ = (λ (θ)), Uθ = (ui (θ)).
i,j=1
These are called Bhattacharyya bounds of k th order. Diagonalizing Λθ , it can be seen that the right side series is a sum of squares and thus increases with k, the first term being the Cram´er-Rao bound. [Regarding this part, for a slight alternative version, as well as the multi parameter case, see Rao [1] for a detailed description. Also the work of Fend [1] indicates that if δ(X) is efficient in that a bound for k > 1 is attained, then δ(X) need not be an MLE, in contrast to the case k = 1.] 4.(a) In the notation of 3(ii) above, δ(X) is an efficient estimator of a(θ) for all θ ∈ I ⊂ R (assuming the regularity conditions and kderivatives) iff there exist constants (b1 (θ), . . . , bk (θ)), bk = 0, such that (for the k th Bhattacharyya bound) δ(X) − a(θ) =
k
bi (θ)Di (X|θ), θ ∈ I.
i=1
When this holds the density f (·|θ) satisfies the partial differential equation (with μ = Leb.meas.) bk (θ)
∂kf ∂f + (a(θ) − δ(x)) = 0 + · · · + b1 (θ) ∂θk ∂θ
128
III. Parameter Estimation and Asymptotics i
subject to the boundary conditions Eθ (δ(X)) = a(θ), Eθ ( f1 ∂∂θfi ) = 0, and S f (x|θ) dx = 1. In case the density f is of a generalized exponential family (as in the preceding Exercise 3(iii)) substituting it in the above PDE, it can be expressed as f (x|θ) = exp[u(x)g(θ) + v(x)] and δ(x) is a polynomial of degree at most k. [Hint: By the regularity conditions Λθ is nonsingular and a(θ), bi (θ) are all continuous in θ. See also Fend [1] for the last comment.] (b) We present a first application of this to sequential sampling plan from an exponential distribution motivated by the above problem. Supk pose therefore that f (x|θ) = c(θ) exp[ i=1 θi hi (x)], θ = (θ1 , . . . , θk ) ∈ I ⊂ Rk , I an open convex set and c(θ) > 0. Let D1 , . . . , Dk be defined as in 3(i) and that we have a sequential sampling plan with X1 , X2 , . . . , as independent with the above common exponential distribution. Let N ∂ log f (X,|θ) = δiN (X) = i=1 hi (Xj ), i = 1, . . . , k, and Di (X|θ) = ∂θi i (θ) δiN (X) + N ξi , ξi = cc(θ) where ci (θ) = ∂c(θ) ∂θi and Eθ (N ) < ∞. Call the sampling plan linear, if for some constants αi , β, γ, i = 1, . . . , k, k not all zero, such that P [γ + βN + i=1 δi (X) = 0] = 1. With this setup and (a) establish that a parametric function g(ξ) has an unbiased efficient estimator at ξ 0 iff there exist constants a1 , . . . , ak such that k g(ξ) − g(ξ 0 ) = Eξ (N ) i=1 ai (ξi − ξi0 ). [This problem uses (a) above and the result of 3. also. For details see De Groot and Rao [2].] 5.(a) Complete the proof of Theorem 4.5. (b) Consider the following application of (a). Let {Xn , n ≥ 1} be a stochastic sequence on (Ω, Σ, Pθ ), θ = (α1 , . . . , αk ) ∈ I ⊂ Rk , where I is a nonempty bounded open convex set and
Xn =
k
αi Xn−i + un ,
(*)
i=1
such that the k-roots of the characteristic equation tk − α1 tk−1 − · · · − αk = 0,
(+)
are simple and the un are independent N (0, 1), for n ≥ 0 and un = 0 a.e. for n < 0. Show that the MLE of αi , i = 1, . . . , k exist, unique, and are consistent (as well as efficient in the wide sense) by (a) if all the characteristic roots of (+) are either strictly inside the unit circle or strictly outside the unit circle. However Condition 3 of (a) fails if some root is on the unit circle or some inside and some outside the unit circle. [The details involve several computations in verifying the hypothesis of Theorem 4.5, and they may be found in Rao [4(b)]. The case of all roots inside the unit circle was treated by Mann and Wald [1] and the case that all the roots are outside the circle was treated by
129
3.6 Complements and exercises
Anderson [2]. Using the least squares method in lieu of ML method, these authors only needed a moment condition but not the normality of un s. We shall again return to this problem later.] 6.(a) Complete the proof of Proposition 5.1(b). (b) Using an extension of Blackwell’s method, Wald’s equation of sequential analysis can be generalized for some uncorrelated variables. Let X1 , X2 , . . . , be a sequence of uncorrelated random variables with a common mean μ and uniformly bounded variances. Under a sequential sampling plan if SN = X1 + · · · + XN is observed where E(N 2 ) < ∞, show that Wald’s equation E(SN ) = μE(N ) holds for this [not necessarily independent] sequence. (This is an analog of Proposition 5.1 in which independence is deleted but the hypothesis on moments is strengthened slightly. However in the classical case the existence of one moment of X1 already implies that E(N k ) < ∞ for all k by a theorem of Stein [1].) [Hints: To use Blackwell’s method, let SNi , i = 1, . . . be independent and identically distributed as SN where Ni are also independent copies of N . Consider again, k 1 1 SNi = k k i=1 i=1 Ni
N1 +···+N k i=1
k 1 Xi · Ni . k i=1
k Now k1 i=1 Ni → E(N ), a.e. and in mean by the SLLN as before. Since E(Xi2 ) ≤ M < ∞, we can apply Rajchman’s SLLN to conn clude that n1 i=1 Xi → E(X1 ) = μ a.e. and in L2 (P ), (cf., Rao [15], Thm. 2.3.4 on page 59). Hence as in the original proof, every infinite subsequence of this sequence, and then the first term, on the right, tends to μE(N ) as k → ∞, a.e. and in L2 (P ). Since also k 1 2 2 i=1 Ni → a = E(N ) < ∞, a.e. and in mean, by the Rajchk k man SLLN one concludes that k1 i=1 Ni → E(N ) in L2 (P ). Thus the right side tends to μE(N ) a.e. and in L2 (P ). This implies that the left side tends to μE(N ) a.e. and in L2 (P ) and therefore it is uniformly integrable as well. Then integrating term by term (and interchanging the limit and expectation) is valid by Vitali’s theorem, so the left side becomes E(SN ) identically, whence the desired equation E(SN ) = E(X1 )E(N ) holds.] 7. A more general (but somewhat difficult for verifying the conditions in applications) form of Wald’s equation is due to Wolfowitz [1]. Let X1 , X2 , . . . , be a sequence of integrable random variables observed under a sequential sampling plan such that the conditional means νi = E(Xi |N ≥ i) satisfy P (N i) > 0 where N is the sample ≥ ∞ size. If νi = E(|Xi − νi ||N ≥ i), and i=1 (ν1 + · · · + νi )P (N = i) < ∞, N then one can conclude that E(SN − i=1 νi ) = 0. [Hint: A straight forward calculation is used with the last condition for the justification
130
III. Parameter Estimation and Asymptotics
of a rearrangement of the double series occurring in the simplification, and no other restriction on the dependence of the Xi is required.]
Bibliographical notes The classical inference theory formulates problems mainly in terms of hypothesis testing and (parameter) estimation. The analysis proceeds in depth and detail on the subject at hand, although an overview as well as a unification can be provided with the decision theoretic setup. Since the preceding chapter is devoted to the testing aspect, the present one is concerned mainly with estimation problems, mostly concentrating on convex loss functions. Section 1 contains the basics of these loss functions that are employed hereafter. Unbiasedness is considered as a mathematical constraint used for a simplification and study of various properties of estimators. Then Section 2 is devoted to a detailed analysis centering around the existence, unicity and calculation of Bayes type estimators and their lower bounds for the related risk functions all of which are based on convex (and more general) loss functions, for single parameter cases. Most of the work here is taken from De Groot and Rao [1] with some improvements. The close relationship between Bayes type estimation and nonlinear prediction is noted and an important aspect is given, bringing in also some work of Sherman’s [1] in this context for processes that have features analogous to a Gaussian class. Then the unbiasedness constraint is discussed in more detail in Section 3 where the point estimation problems including the maximum likelihood (principles and) methods primarily originating from Fisher [1] are treated for the single as well as some multi parameter cases. The work here also exemplifies as to how the methods of non linear analysis have to be employed in estimation theory. A construction of best estimators for convex loss functions demands a greater use of abstract analysis, especially some ideas and results of Orlicz spaces. These are given here perhaps for the first time in a book on inference theory. The work is influenced by the papers of Barankin [1], Linnik and Rukhin [1] as well as a general treatment by the author (Rao [1] and [5]). The latter uses Orlicz space analysis (with “gauge norms”) as Barankin’s is for the Lp (P )-spaces. A slightly different approach with Orlicz norm and Young’s (in-)equality results leading to the Fenchel-Orlicz type analysis was used by Kozek [1]. We follow the earlier direct approach as it is somewhat simpler. It should be noted that the classical Cram´er-Rao and Rao-Blackwell theorems are an important motivation for much of the study with convex loss functions, although the methods used have to be considerably different
Bibliographical notes
131
as one finds it easier to employ some (relatively simple) results from Orlicz spaces. These spaces are particularly useful and interesting in estimation as well as in other parts of stochastic analysis, and we tried to familiarize it to the reader. Asymptotic properties of the MLE are very important for stochastic processes since the latter involve, as a rule, infinitely many random variables. An important property is the consistency of estimator sequences, first rigorously discussed in Cram´er’s classic [1] for independent random variables, and extended by Wald [2] for a (weakly) dependent class. They have been generalized for a more inclusive class of (even “explosive”) processes by the author (Rao [4(b)]), and Theorem 4.2 as well as the following work is taken from the latter. Some results on prediction sequences, especially Theorem 4.6, is a slightly more detailed version of the work appearing in De Groot and Rao [2]. A brief account of sequential estimation following Wald [4] and lower bounds containing Wolfowitz’s [1] work for convex loss functions is also included here. The complements section has some additional results of interest to the main theme. For instance, an analysis of the regularity conditions in the Cram´er-Rao inequality for multi parameters was adapted from De Groot and Rao [2], and the single parameter case with slight weakening of the conditions were later discussed in Wijsman [1] and Joshi [1] showing that under somewhat weaker conditions the classical lower bound may be reached by certain non exponential families. Under such conditions, some results including their sequential extensions, given as Exercises 6.3 and 6.4 are found in De Groot and Rao [2], and the multi parameter MLE problem, outlined as Exercise 6.5, is in Rao [4(b)]. The interesting method of Wald’s equation by Blackwell [1], admits an extension to uncorrelated variables. This is detailed in Exercise 6.6(b) because of its other possibilities. It does not seem to have appeared in print before. Extending Wolfowitz’s [1] work, Seth [1] has considered sequential estimation for Bhattacharyya’s bounds and a readable account can be found there. Also series type lower bounds for other loss functions and of k-parameters are detailed in the author’s paper (Rao [1]). Starting with the next chapter applications and possible extensions of this and the preceding chapters will be made to broad classes of stochastic processes.
Chapter IV Inference for Classes of Processes
This chapter is devoted to specific problems of inference for both the continuous and discrete indexed processes. The hypothesis testing, estimation and certain (unbiased) weighted prediction problems together with some calculations of likelihood ratios for processes are detailed. In the discrete indexed cases, an analysis of the asymptotic properties of estimators for some classes is also given. Principles outlined in the preceding chapters on classical (finite sample) cases are utilized and improved upon for the types of processes considered. The sequential testing aspect is included with an extended treatment as it motivates solving several new and important questions using stopping times, both in probability and inference theories. Processes defined by difference equations and estimation problems for their parameters are also treated. These indicate the depth and a feeling for the general theory. This chapter contains an essential and important part of analysis studied in the present work. Various aspects of this study will be analyzed in greater detail for many specialized problems in the ensuing chapters, as they play key roles.
4.1 Testing methods for second order processes A large class of processes for which an experimenter can assume the existence of mean and covariance functions are second order processes. However, except for Gaussian processes, one cannot generally determine a class from its mean and covariance functions alone. Consequently we broadly interpret the desired property, and postulate conditions on the first two moments of the statistics (i.e., certain functions of the observations) of interest in inference theory. Moreover, for most of the following work, only simple hypotheses (and some immediate ex-
© Springer International Publishing Switzerland 2014 M.M. Rao, Stochastic Processes – Inference Theory, Springer Monographs in Mathematics, DOI 10.1007/978-3-319-12172-7_4
133
134
IV Inference for Classes of Processes
tensions) are considered, since the likelihood function, necessary for the Neyman-Pearson-Grenander (NPG)-theorem, has to be defined relative to an infinite collection of random variables which is in sharp contrast to the classical statistical inference problems where test functions are based on finite sets of observations. In the case of processes, the (overall) likelihood function should be approximated (in a precise sense) by finite dimensional ones, and this is a nontrivial problem. It was originally solved by Grenander [1], and that fundamental result will be established in different forms. Then some applications and related results will be presented. P Thus if (Ω, Σ, Q ) is the basic probability model for the experiment and P, Q are the hypothesis and the alternative measures on Σ, let {X1 , X2 , . . . } be the observed sequence, i.e., Xn : Ω → R, and the process (Xn being a measurable function on Σ) is governed by P under H0 and by Q under H1 . Let Fn = σ(X1 , · · · , Xn ), be the σ-algebra determined by the first n observables, so that Fn ⊂ Fn+1 ⊂ Σ. The measures P and Q are assumed distinguishable in the sense of Section I.3. [This will be automatic from the NPG critical region except perhaps in the trivial case that P = Q.] Consider the measurable set given by Theorem II.1.1: Let Q = Qc + Qs be the Lebesgue decomposition of Q relative to P so that Qc P and Qs ⊥ P in that there is a B0 ∈ Σ, P (B0 ) = 0 and Qc is supported in B0c (= Ω − B0 ) while Qs by c B0 , the P -singular set. If f (ω) = dQ dP (ω), the critical region is then found to be: Ak0 = {ω : f (ω) ≥ k} ∪ B0 , ∈ Σ, (1) where k > 0 is chosen to satisfy P (Ak0 ) ≤ α, the size of the region. The basic problem now is to approximate f or Ak0 from the observed segment (X1 , · · · , Xn ) for n-large enough. We now present a solution of approximating f , thereby giving a method of calculating the RadonNikod´ym derivative of Q relative to P in some important cases. This will be illustrated later. P 1. Theorem. Let (Ω, Σ, Q ) be the probability model for the hypothesis and alternative. If {X1 , X2 , . . . , } is a sequence of observables, dQc and Fn = σ(X1 , · · · , Xn ), let Pn = P |Fn , Qn = Q|Fn , fn = dPnn , the Radon-Nikod´ym derivative of the absolutely continuous part of Qn reldQc∞ where P∞ = P |F∞ and Q∞ = Q|F∞ with ative to Pn , and f∞ = dP∞ c F∞ = σ(∪∞ F ). Then f n (ω) → f∞ (ω) for a.a. ω ∈ B0 (relative to n=1 n both P and Qc measures) and fn (ω) → ∞ for a.a. ω ∈ B0 (relative to both P and Qs measures). c c In particular, if F∞ = Σ, then f (ω) = dQ dP (ω) for a.a. ω ∈ B0 (and = ∞ for a.a. ω ∈ B0 ). Hence if Akn is the set of (1) with fn , then Akn → Ak0 , as n → ∞ in the sense that χAkn → χAk0 pointwise a.e.
4.1 Testing methods for second order processes
135
Proof. It should be noted at the outset that fn is Fn -measurable; hence fn (ω) = fn (X1 , · · · , Xn )(ω), i.e., a function of X1 , . . . , Xn only. Similarly f∞ is a function of the sequence {Xn , n ≥ 1}, since it is F∞ adapted (or measurable). The proof is essentially the same as that of a theorem of Andersen and Jessen [1], available in many books including the author’s (cf., Rao [21], p.142), but the self-contained details will be presented here with slight modifications to appreciate the type of reasoning . [An alternative argument using martingale theory is possible, as indicated later, and it is Grenander’s original proof.] Let f∗ = lim inf n fn , f ∗ = lim supn fn . Then f∗ ≤ f ∗ a.e. [P∞ ] and both limits are F∞ -measurable. The result that fn → f∞ = f∗ = f ∗ a.e. will follow if the F∞ -set {ω : f∗ (ω) < f ∗ (ω)} is shown to be P∞ -null. Here are the details. {ω : f∗ (ω) ≤ r1 < r2 ≤ f ∗ (ω)} = Since {ω : f∗ < f ∗ (ω)} = r1 ,r2 rationals
∪ Nr1 ,r2 (say), is a countable union, it suffices to show that each of the sets Nr1 ,r2 (∈ F∞ ) is P -null. Let Kr1 = {ω : f∗ (ω) ≤ r1 }, Lr2 = {ω : f ∗ (ω) ≥ r2 }. The following inequality is basic for establishing the desired assertion: (*) For all A ∈ F∞ , and any real numbers r < s, we have Q(Kr ∩ A) ≤ rP (Kr ∩ A); Q(Ls ∩ A) ≥ sP (Ls ∩ A). Indeed, let Kn = {ω : inf m≥1 fn+m < rn }, rn r. Then Kn ∈ F∞ . It can be expressed as a disjoint union: Kn = ∪m≥1 Kmn where Kmn = {ω : fn+m (ω) falls below rn for the first time} = {ω : fn+j (ω) ≥ rn , 1 ≤ j ≤ m − 1, fn+m (ω) < rn } ∈ Fn+m . (2) Since K1 ⊃ K2 ⊃ · · · and rn r, we have Kr = ∩n≥1 Kn , as well as Knm ⊂ {ω : fn+m (ω) < rn }. Let A ∈ ∪n≥1 Fn , whence for some n0 , A ∈ Fn0 and for all m ≥ 1, n ≥ n0 , with (2), one has Knm ∩ A ∈ Fn+m . Consequently Q(Kn ∩ A) = Q[ ∪ Knm ∩ A] m≥1
= = =
∞ m=1 ∞ m=1 ∞ m=1
Q(Knm ∩ A), since Knm are disjoint, Qn+m (Knm ∩ A), since Q|Fn+m = Qn+m , Qcn+m (Knm ∩ A), since Qn = Qcn on Kn ,
136
IV Inference for Classes of Processes ∞
=
m=1 ∞
≤
Knm ∩ A
fn+m dPn+m , since Pn+m = Pm |Fn+m ,
rn P (Knm ∩ A) = rn P (Kn ∩ A).
(3)
m=1
Letting n → ∞, and noting that Kn ↓ Kr we get Q(Kr ∩ A) = lim Q(Kn ∩ A) n
≤ lim rn P (Kn ∩ A), by (3), n
= rP (Kr ∩ A).
(4)
The first inequality of (*) is thus true for all A ∈ ∪n≥1 Fn . If now ν(·) is defined on this union, an algebra, as: ν(A) = rP (Kr ∩ A) − Q(Kr ∩ A),
(5)
which by (4) is a nonnegative σ-additive function, it has a unique σ-additive extension (by the classical Hahn extension theorem) onto σ(∪n≥1 Fn ) = F∞ . So (5) holds on F∞ . Consequently the first inequality of (*) is proved. The second one is established by a similar argument. The desired pointwise limit is obtained from (*) as follows. Since Nr1 ,r2 = Kr1 ∩ Lr2 , take A = Nr1 ,r2 itself in (5) with r = r1 < r2 = s. Then we get from (*) r1 P (Nr1 ,r2 ) ≥ Q(Nr1 ,r2 ) ≥ r2 P (Nr1 ,r2 ).
(6)
This holds iff P (Nr1 ,r2 ) = 0, and hence f∗ = f ∗ a.e. [P∞ ], as desired. dQc∞ dQc , a.e.[P ]. By definition, fn = dPnn , It remains to show that f∞ = dP∞ and it is shown above that fn → f∞ a.e. So by Fatou’s lemma Ω
f∞ dP ≤ lim inf n
Ω
fn dP = lim inf Qcn (Ω) ≤ 1. n
Hence f∞ is finite a.e. Let N1 = {ω : fn (ω) f∞ (ω)} and N2 = {ω : f∞ (ω) = +∞}. Then P (Ni ) = 0, i = 1, 2, and set Ω0 = Ω−(N1 ∪ N2 ) ∈ F∞ , so that for all ω ∈ Ω0 , f∞ (ω) < ∞. Let ε > 0 be given, and consider the elementary function fε =
∞ n=−∞
nεχDnε ,
Dnε = {ω : nε ≤ f∞ (ω) < (n + 1)ε} ∈ Σ.
137
4.1 Testing methods for second order processes
Then 0 ≤ (f∞ − fε )(ω) → 0 as ε 0 (uniformly in ω ∈ Ω0 ). Thus fε (ω) ≤ f∞ (ω) ≤ fε (ω) + ε, ω ∈ Ω0 .
(7)
Take r = (n + 1)ε, s = nε, Kn = {ω : f∞ (ω) ≤ (n + 1)ε}, Ln = {ω : f∞ (ω) ≥ nε} in (*), so that Dnε ⊂ Kn ∩ Ln ∩ Ω0 , and nεP [Dnε ∩ A ∩ Ω0 ] ≤ Q[Dnε ∩ A ∩ Ω0 ] ≤ (n + 1)εP [Dnε ∩ A ∩ Ω0 ]. (8) Summing over n and noting that the Dnε are disjoint, we get
Ω0 ∩ A
fε dP∞ ≤ Q(Ω0 ∩ A) ≤ εP (Ω0 ) +
Ω0 ∩ A
fε dP∞ , A ∈ F∞ . (9)
Hence (7) and (9) yield Q(Ω0 ∩ A) − ε ≤
Ω0 ∩ A
f∞ dP∞ ≤ Q(Ω0 ∩ A) + ε.
(10)
Letting ε 0, and observing that Qc∞ (·) = Q(Ω0 ∩ ·), (10) reduces to
f∞ dP∞ =
A
Ω0 ∩ A
f∞ dP∞ = Qc∞ (A), A ∈ F∞ .
(11)
It follows that N = N1 ∪ N2 is the singular set of Q∞ relative to P∞ dQc∞ and then f∞ = dP∞ , a.e.[P ]. In case Σ = F∞ we get P∞ = P, Q∞ = Q and if also Q P then Q(N ) = 0 as well. In the general case that Q is not necessarily P -continuous, one can describe the behavior of f∞ on the singular set N as follows. This satisfies P (N ) = 0 but Qs (N ) > 0 where Q = Qc + Qs , Qs ⊥ P (by the Lebesgue decomposition relative to P ) and N qualifies as B0 of the statement of the theorem. But then B0 = ∪ Nr1 ,r2 and on B0 one has 0 ≤ f∗ ≤ r (rationals r); similarly f ∗ ≥ s, all rationals s. So f∗ = 0 and f ∗ = +∞ on B0 . Thus the limit f∞ can be defined as: lim fn (ω) = n
f∞ (ω), for ω ∈ B0c +∞,
for ω ∈ B0 .
(12)
Alternatively this may be stated as (since Qs (B0c ) = 0 and P (B0c ) > 0, plus f∞ (ω) = 0 for all ω ∈ C ⊂ B0c , Qc (C) = 0) fn (ω) → f∞ (ω), ω ∈ B0c relative to both measures P and Qc while fn (ω) → +∞ for a.a. ω ∈ B0 relative to the Qs -measure. From this it follows that χAkn → χAk0 pointwise a.e.[P ]. This is just the last assertion.
138
IV Inference for Classes of Processes
There is another form of the above result that may be better suited for applications. For this, let μ = P + Q which is the smallest dominating measure on Σ for both P and Q. Let Fn = σ(X1 , · · · , Xn ) be as before and Pn , Qn and μn = μ|Fn be the restrictions, so that dPn n gn = dQ dμn and hn = dμn exist since Pn μn and Qn μn . Then by the preceding theorem, gn → g∞ and hn → h∞ a.e.[μ∞ ] and are meaa.e.[μ∞ ] on the set A = [h∞ > 0] surable for F∞ ; whence hgnn → hg∞ ∞ with μ(A) > 0, and is a.e. finite on Ω since all μ-null sets are also P - and Q-null. However, using the calculus of Radon-Nikod´ ym derivadQc dQs a.e.[μ] tives, one can verify that hgnn = dPnn + dPnn = fn + 0 → f∞ = hg∞ ∞ and also relative to the measures P, Qc . The same result holds if μ is replaced by any other σ-finite measure ν such that P, Q ν [or even a sequence νn dominating Pn , Qn and satisfying νn = νn+1 |Fn ] where if ν : ∪n Fn → R+ then νn = ν|Fn and is extendible to be a measure (not merely an additive function, see the technical remark below). It turns out that in all cases the {fn , Fn , n ≥ 1} sequence is a “positive super martingale” and so it converges a.e., by the theorem. Thus one has: 2. Corollary. Under the hypothesis of the theorem, let gn , hn be the densities of Qn , Pn respectively relative to a dominating σ-finite dQc∞ measure μn (= μ|Fn ). Then the likelihood ratios hgnn → f∞ = dP∞ a.e. where f∞ has the properties described in the theorem. This is essentially the form proven in Grenander ([1], [2]). In some cases one may take μn to be the Lebesgue measure, so that the gn , hn are the usual densities. However an additional argument is needed since the Lebesgue measure on Rn does not necessarily extend to a measure on the infinite dimensional space (Ω, Σ) in its canonical representation which is being used. The associated technical problem will now be discussed to clarify the situation. 3. A technical remark. In many applications, it is useful to consider the function space representation of the model (Ω, Σ, P ) for an observable process {Xn , n ≥ 1} (as in Theorem I.1.1). This means + Ω = RZ , Σ = σ-algebra generated by sets of the form {ω ∈ Ω : (ω(n1 ), · · · , ω(nk )) ∈ B ⊂ Rk } where B is a Borel set. These are cylinder sets with finite dimensional bases B. Then the distribution functions of the random variables Xn (ω) = ω(n) which are now the coordinate functions, determine the probability measure P and satisfy: Fn1 ,··· ,nk (x1 , · · · , xk ) = P [ω : Xn1 (ω) < x1 , · · · , Xnk (ω) < xk ]. These distributions {Fn1 ,··· ,nk , ni ∈ Z+ } form a consistent family, and, by Theorem I.1.1 of Kolmogorov’s, is equivalent to having such a prob-
4.1 Testing methods for second order processes
139
ability model. In fact, for k ≥ 1, P˜k (A) = P ◦ (Xn1 , · · · , Xnk )−1 (A) = · · · dFn1 ,··· ,nk (x1 , · · · , xk ),
(13)
A
define the image measures on the range spaces (Rk here). If Bn denotes + the Borel σ-algebra of Rn , and Fn = πn−1 (Bn ) where πn : RZ → Rn is the coordinate projection, then Fn = σ(X1 , · · · , Xn ) in the notation of the preceding proof. Suppose now an experimenter is working with the above basic model (Ω, Σ, P ) as H0 , and has an alternative hypothesis H1 which states that Gn : Rn → R is an n-dimensional distribution such that the family is consistent (i.e., the (n − 1)th marginal of Gn is Gn−1 , etc.). Next consider P˜n and Gn for each n, obtaining the likelidGc hood function fn = dP˜n , and defining the critical region Akn with this n fn . Can the hypothesis testing be done for these two families, assuming distinguishability? A positive answer is provided by the following (nontrivial) technical reduction to the preceding case. + Since πn : RZ → Rn is onto, π −1 is a one-to-one set mapping so that Qn on Fn= πn−1 (Bn ) can be uniquely defined by the equation Qn (πn−1 (B)) = B dGn , B ∈ Bn , and the Qn are probability measures satisfying Qn = Qn+1 |Fn , n ≥ 1. But by definition, Fn ⊂ Fn+1 so that on the algebra F0 = ∪n≥1 Fn , one can uniquely define an additive function Q such that Q|Fn = Qn . It will be σ-additive in most applications but not always. [A good sufficient condition for σ-additivity of Q is that each Qn can be approximated from below by a “compact class” in Fn . Details of this condition and several related results can be found in the book (Rao [12], Chapter III).] However in our case we can directly solve the convergence problem as follows. The finitely additive Q can be uniquely decomposed into Q = Qa +Qp , by the Yosida-Hewitt theorem, where Qa is σ-additive and Qp is purely finitely additive on F0 (cf., e.g., Rao [17], p. 182). Now apply Theorem 1 to P∞ and Qa∞ dQc dQc a.e. A proof of this on F0 . Then one obtains fn = dPnn → f∞ = dPa∞ ∞ statement, even for more general measures, can be found in the just cited book, Theorem 21, p.305, and will be omitted. One may observe that fn → f∞ a.e. and if also {fn , n ≥ 1} is uniformly integrable (for this Ω fnr dP ≤ K < ∞, r > 1, n ≥ 1, is sufficient), then the last convergence is in L1 -mean as well, and Q is consequently σ-additive. It will be better appreciated when this problem is approached from the point of view of martingale convergence theory. [The relevant convergence theorems will be recalled later.] This conclusion is of interest here since it strengthens the preceding theorem.
140
IV Inference for Classes of Processes
For a continuously indexed process the above theorem is not immediately applicable since the index is not a countable (not even) linearly ordered set. However for many second order processes it can be reduced to the preceding case as shown below. The idea here is to replace the whole process by (linear) combinations of a fixed suitably chosen countable set of random variables for which Theorem 1 is applicable. [The general case will be treated after this one.] It is based on the Karhunen-Lo`eve representation which we now describe. Thus let {X(t), t ∈ I ⊂ R} be a second order (scalar) process with mean function t → m(t) = E(X(t)) and a continuous covariance function (s, t) → r(s, t) = E[(X(s) − m(s))(X(t) − m(t))]. Then r(s, t) is positive definite, and if I is a compact interval, assumed hereafter, by the classical Mercer theorem (cf., e.g., Riesz and Sz.-Nagy [1], p.245) since I I |r(s, t)|2 ds dt < ∞, it can be represented by a uniformly convergent series: (‘bar’ for complex conjugate) ∞ ψi (s)ψ i (t) ; λi > 0, r(s, t) = λi i=1
(14)
where the ψi (·) are continuous functions satisfying the integral equation ψ(t) = λ
r(s, t)ψ(s) ds,
(15)
I
∞ 1 and i=1 λi < ∞. Here the λi are the eigenvalues (counted according to their multiplicity) and ψi are the corresponding eigenfunctions of the “kernel” r, and {ψn , n ≥ 1} forms a complete orthonormal set in the Lebesgue space L2 (I), with Lebesgue measure, satisfying (15). This classical result and its relation to the Hilbert-Schmidt theory of symmetric kernels is nicely treated in the above reference, and their properties are needed here. First we consider the case that E(X(t)) = m(t) = 0 so that the X(t) are centered. Now define the random variables ξn = λn X(t)ψ¯n (t) dt, n ≥ 1, (16) I
where the integral is obtained using Fubini’s theorem, since X(t, ω) is jointly measurable (r(·, ·) being jointly continuous) in (t, ω). [Alternatively it may be regarded as a Bochner integral.] In any case, we get ¯ ¯ E(ξn ξ m ) = λm λn E(X(s)X(t))ψ n (s)ψm (t) ds dt I
I
141
4.1 Testing methods for second order processes
= λn λm [ r(s, t)ψn (s) ds]ψ¯m (t) dt I I λm = ψn (t)ψ¯m (t) dt, by (15), λn I λm = δnm . λn n It follows (on expanding inner products) that Xn (t) = k=1 ξk ψ√kλ(t) → k ∞ X(t) in L2 (P ), by (14), where X(t) = n=1 ξn ψ√nλ(t) , and conversely if n
X(t) is given by this series, converging in mean, then E(X(s)X(t)) = ∞ n (t) limn E(Xn (s)X n (t)) = n=1 ψn (s)ψ holds. If E(X(t)) = m(t) = 0, λn then the above argument applied to Y (t) = X(t) − m(t) establishes the following classical Karhunen-Lo`eve representation: 4. Proposition. If {X(t), t ∈ I} is a second order process with E(X(t)) = m(t), and a continuous covariance function r(·, ·) on a compact interval I, then X(t) = m(t) +
∞
ψ (t) ξn √n , t ∈ I, λn n=1
(17)
holds uniformly in t, and the convergence is in L2 (P ) where the λn > 0 and ψn are the eigenvalues and the corresponding (complete orthonormal in L2 (I)) eigenfunctions of the kernel r, satisfying (15), and hence the ξn are orthonormal in L20 (P ), given by (16). Let F = σ(X(t) ∈ I), F∞ = σ(ξn , n ≥ 1) be the σ-algebras generated by the random variables shown. Since each X(t) is a linear combination of the ξn , by (17), it follows that the X(t) are F∞ -measurable for each t ∈ I so that F ⊂ F∞ . On the other hand each ξn is F-measurable, by (16), for n ≥ 1, so that F∞ ⊂ F and hence F = F∞ ⊂ Σ. If P˜ = P |F, it is then determined by the X(t) as well as the ξn . Thus using (17), we can transfer the testing problem for measures P and ˜ on F = F∞ ) to the sequence {ξn , n ≥ 1}, to find the Q, (or P˜ , Q ˜c likelihood function f∞ = ddQP˜ by the approximation procedure of The˜c dQ
orem 1 with Fn = σ(ξ1 , · · · , ξn ). Consequently, if fn = dP˜n where n ˜ n = Q|F ˜ n , then fn → f∞ a.e. as in Theorem 1, and f∞ P˜n = P˜ |Fn , Q is the desired likelihood function. This method will now be illustrated to gain an insight into the type of calculations needed for some test problems. 5. Example. Let {X(t), t ∈ [0, 1]} be a Gaussian process with mean
142
IV Inference for Classes of Processes
function 0 and covariance function rb (b = 0), given by cosh bs cosh b(1−t) , for s ≤ t b sinh b rb (s, t) = cosh bt cosh b(1−s) . for t ≤ s. b sinh b
(18)
That this defines a covariance function follows from the well-known fact that any function of the form (s, t) → r(s, t) = u(min(s, t))v(max(s, t)) is a covariance function on T × T if u, v ≥ 0 and uv is strictly increasing on T ⊂ R. (Cf., e.g., Rao [25], p.340; and one can also verify this by computing the matrix Rn = (r(si , sj ), 1 ≤ i, j ≤ n), n ≥ 1, and showing its determinant det Rn > 0 which in this case has a simple pattern.) Now the problem is to test the hypotheses H0 : b = b0 > 0, vs H1 : b ∈ [b0 + ε, B], where ε > 0 is given, making the hypotheses distinguishable. (The same procedure applies if b0 < 0, and then H1 : b ∈ [B1 , b0 − ε] but b0 = 0 is excluded.) To employ Proposition 4, it is necessary to find the eigenvalues λin and the corresponding eigenfunctions ψni relative to the hypotheses Hi , i = 0, 1. This may be done as follows. Consider the integral equation with rb as its symmetric kernel:
1
ψ(t) = λ 0
rb (s, t)ψ(s) ds.
(19)
Substituting (18) here and differentiating, it is seen that (19) is equivalent to an ordinary second order linear differential equation with suitable boundary conditions at 0 and 1: ψ (t) = (b2 − λ)ψ(t); ψ (0) = ψ (1) = 0.
(19’)
Solving this equation, it is immediately found that λn = n2 π 2 + b2 , n = 0, 1, 2, . . . , √ ψ0 (t) = 1, ψn (t) = 2 cos nπt, n = 1, 2, . . . ,
(20)
Define the coordinate (or observable) random variables Zn =
0
1
X(t)ψn (t) dt.
Then the Zn are orthogonal (hence independent here) Gaussian random variables with E(Zn ) = 0 and E(Zn2 ) = λ1n . Note that in this particular case the eigenfunctions do not depend on b, and only the λn do. Hence writing λin for λn of (20) under the hypotheses Hi , i = 0, 1, we can
143
4.1 Testing methods for second order processes
calculate the likelihood functions fn on Fn = σ(Z1 , · · · , Zn ) (by setting Zn (ω) = zn ) as: fn (ω) =
λ1 1 [Πni=1 i0 ] 2 λi
1 2 1 z (λ − λ0i )} exp{− 2 i=1 i i n
Then by the preceding work fn → f∞ a.e.[P˜ ]. Since the series ∞
Ei (Zn2 )(λ1n − λ0n ) = b2i
n=1
∞ 1 < ∞, λi n=1 n
and (Ei = expectation, V ari = Variance under Hi ) ∞
V ari Zn2 (λ1n − λ0n )2 = 2bi
n=1
∞
1 < ∞, (λin )2 n=1
∞ 2 1 0 the series n=1 Zn (λn − λn ) converges with probability one (by a ˜ [they are standard result in probability theory), under both P˜ and Q equivalent measures], and similarly a(b1 ) a(b0 )
1
λn Π∞ n=1 λ0n
b2
= limn→∞
1 Πn i=1 (1+ (nπ)2 ) b2
0 Πn i=1 (1+ (nπ)2 )
=
> 0, (say) exists. Thus
f∞ = [
∞ a(b1 ) 1 1 2 1 ] 2 exp{− Z (λ − λ0n )} a(b0 ) 2 n=1 n n
exists a.e. Then the critical region Ak0 of (1) is given (on taking logs) by: ∞ k , (21) Zn2 (ω) ≤ Ak0 = ω : (b1 − b0 ) n=1 for a suitable k0 > 0, since λ1n −λ0n = b1 −b0 > 0. The same result holds for all b1 > b0 > 0 so that the set Ak0 is a (one-sided) uniformly most powerful critical region (and the inequality in (21) should be reversed if b1 < b0 for a similar conclusion). A related problem, (due to Grenander [1]) which is generalized in the next chapter, will be discussed here since it serves as a motivation for that work. With the notation of the above example, let {X(t), t ∈ [0, 1]} be a Gaussian process with mean zero and a continuous covariance function r(·, ·). Suppose that H0 : r(s, t) = r0 (s, t) vs H1 : r(s, t) = σ 2 r0 (s, t), σ = 1. If λin and ψni are as in (19) for r then it is clear that 1 λ2n = σ 2 λ1n and if Zni = 0 X(t)ψni dt are the observable coordinates of
144
IV Inference for Classes of Processes
the process, then Zni ∼ N (0, λin ) are independent and the corresponding likelihood ratio becomes: n 1 n 1 1 2 1 λk zk ( 2 − 1)}. fn (ω) = ( 2 ) 2 exp{− σ 2 σ k=1 ∞ By the preceding work fn → f∞ a.e. [P˜ ], and since k=1 λ1k Zk2 converges with probability one, f∞ = 0(∞) according as σ 2 > 1(< 1) and in either case the corresponding probabilities are mutually singular so that H0 and H1 can be distinguished with probability one. If a transformation Tσ : X → σX is considered instead, then the same conclusions obtain since r(s, t) = σ 2 r0 (s, t). In case the process is BM so that r(s, t) = min(s, t) the corresponding result was established by Cameron and Martin [2] from a different point of view. On the other hand Example 5 above shows that nontrivial likelihood functions can be obtained for the equivalence of the measures at least when covariances are triangular, but not scalar multiples of each other. An interesting generalization of this result for distinct triangular covariances will be considered, especially affine linear transformations of the BM, in the next chapter. Several other examples, each demanding a special nontrivial treatment, have been detailed in Grenander [1],[2], which will greatly assist the reader’s appreciation of the subject. The preceding elegant argument can be utilized, via Proposition 4, only when the eigenvalues and eigenfunctions λn , ψn of the covariance kernel can be explicitly calculated. That is relatively easy when the problem is converted into a differential equation with suitable (two point) boundary conditions, such as those given in (19’). This equivalence is a classical result in the Hilbert-Schmidt theory of linear homogeneous integral equations. In general, however, one has to obtain these λn , ψn by other means and it is not easy. [See, e.g., Riesz and Sz.-Nagy [1], Sections 95 and 96.] We encounter these situations even in relatively simple cases as seen in the important Ornstein-Uhlenbeck (or O.U.) process. We now consider this process because of its many applications. It will be used in other illustrations as well, again following Grenander [1,2]. 6. An O.U. Process example. The O.U. process {X(t), t ∈ [a, b]} is real Gaussian with mean m(t) and covariance r(s, t) = σ 2 exp[−β|s − t|], β > 0, σ > 0. Since r(·, ·) is a continuous positive definite symmetric kernel, (this follows from the fact that t → e−β|t| is the characteristic function of a Cauchy distribution with parameter β > 0), we can consider as before (a = 0, b = 1, σ = 1, taken for simplicity) the integral equation: 1 e−β|s−t| ψ(s) ds ψ(t) = λ 0
145
4.1 Testing methods for second order processes
= λ[e−βt
t
1
eβs ψ(s) ds + eβt 0
e−βs ψ(s) ds],
t
so that differentiating relative to t one gets (“ ” for ψ (t) = λβ[−e−βt
t
1
eβs ψ(s) ds + eβt 0
d dt ):
e−βs ψ(s) ds].
t
The boundary conditions are somewhat unfamiliar, and one sees that, from the above two expressions for ψ, ψ , the following as the two point boundary conditions: ψ(0) −
1 1 ψ (0) = 0 = ψ(1) + ψ (1). β β
(22)
Then the previous procedure converts the integral equation into the differential equation: ψ (t) + 2β(λ −
β )ψ(t) = 0. 2 2
(23)
2
+β Let α2 = β(2λ − β) > 0 so that λ = α 2β > 0. (The case 2λ ≤ β leads to λ ≤ 0 which is inadmissible for √ the positive definite kernel r.) Then the solution of (23) becomes (i = −1)
ψ(t) = c1 eiαt + c2 e−iαt ,
(24)
and substitution of this in (22) gives the pair of equations: iα iα ) + c2 (1 + ) = 0 β β iα iα c1 eiα (1 + ) + c2 e−iα (1 − ) = 0. β β c1 (1 −
(25)
For a nontrivial solution of this equation in c1 , c2 the determinant of the coefficients must vanish which implies that eiαt =
(β − iα)4 (β − iα)2 = . (β + iα)2 (β 2 + α2 )2
(26)
Consider the real part of this equation: cos 2α =
α4 + β 4 − 6α2 β 2 . (α2 + β 2 )2
(27)
146
IV Inference for Classes of Processes
(The imaginary part gives the sine function which is obtainable from this equation immediately, and need not be discussed.) Let αn be the zeros of this (transcendental) equation which, written as an infinite power series, shows that there are infinitely many roots of which only the positive ones are of interest here. Then the eigenvalues of the real α2 +β 2 r(·, ·) are λn = n2β > 0, and the corresponding eigenfunctions ψn are real and orthogonal, given from (24)–(27)as: ψn (t) = (c1 + c2 ) cos αn t = c cos αn t (say). (Because of the previous observation, an explicit form of these functions cannot be written.) Then c is chosen to normalize ψn , and suppose 1 this is done. As before one defines Zn = 0 X(t)ψn (t) dt. Then the Zn are independent Gaussian random variables. To test the hypothesis H0 : m(t) = 0 vs H1 : m(t) = 0, the observable (or coordinate) Zn 1 have means zero under H0 , and an = 0 m(t)ψn (t) dt under H1 with the same variance under both hypotheses, var Zn = λ1n . The finite dimensional likelihood ratio is given by n n 1 2 λi ai + λi ai Zi (ω)]. fn (ω) = exp[− 2 i=1 i=1
∞ If i=1 λi a2i < ∞, then by a classical Kolmogorov theorem the series ∞ and f = limn→∞ fn exists i=1 λi ai Zi converges with probability one, n in the same sense. Now if gn (t) = − i=1 λi ai ψi (t), then gn → g in L2 ([0, 1]), and one gets in a similar manner fn = exp[−
0
1 → exp[− 2
1
gn (t) 0
X(t) − m(t) dt] 2
1
g(t)(X(t) − m(t)) dt] = f (=
dQ ). dP
(28)
∞ 2 Thus if i=1 λi ai < ∞, then Q ∼ P and if this condition fails, ∞ i=1 λi ai Zn diverges with probability one (by the same Kolmogorov theorem) ∞so 1that Q ⊥ P . It may be noted that since by Mercer’s theorem i=1 λi < ∞, the case that m(t) = a = 0, a constant leads to the singular case that Q ⊥ P , and the hypotheses can be distinguished with probability one based on a single realization. (This fact is a special case of Theorem V.1.1 to be established later, cf., also Section VII.2.) The elegant method, illustrated in both problems above, can be employed in practical cases only if λn , ψn are explicitly calculated. But in the case of the O.U. process it is seen that our equation (27) cannot
4.1 Testing methods for second order processes
147
be solved for its roots easily, and hence (28) is not effectively used. However, it is possible to find (a slightly weaker) alternative procedure that still gives a reasonably satisfactory solution. An alternative procedure. Consider the O.U. process {X(t), t ∈ [0, 1]} as before with mean and covariance functions under H0 : m(t) = 0, and under H1 : m(t) = 0 but satisfies a uniform Lipschitz condition of order one and with the same covariance r(s, t) = exp[−β|s − t|]. Recall that a function g satisfies a uniform Lipschitz condition of order α > 0 if there is an absolute constant C > 0 such that |g(s) − g(t)| ≤ C|s − t|α (n) (n) and C = 0 if g = a, constant. Let t1 < . . . < tn be a partition πn of [0, 1] at the nth stage, the πn being ordered by refinement, so that as |πn | → 0 the partition points form a dense set of the unit interval. For instance the binary subdivision is adequate. Since the covariance function and the means under both hypotheses are continuous and the process is Gaussian, using the form of r, one can verify that the process has continuous sample paths with probability one. Indeed this was proved by Doob ([1], pages 304-5) who observed that the O.U. process X = {X(t), t ∈ R} has the property that if √ 1 Y (t) = tX( log t), t > 0, (29) 2β then (after a computation which is left to the reader as an exercise) E(Y (s + t) − Y (s)) = 0; E(|Y (s + t) − Y (s)|2 ) = σ02 t,
(30)
and for s1 < s2 ≤ t1 < t2 , the increments Y (s2 ) − Y (s1 ) and Y (t2 ) − Y (t1 ) are uncorrelated (hence independent) since Y = {Y (t), t ∈ R} is Gaussian. Thus the Y -process is a Brownian motion, and the latter is well-known to have a.a. continuous sample paths, already proved by Wiener in the 1920s. (See also the sketch after Theorem 2.6 below.) (n) (n) For a partition πn , as above, let Xi = X(ti ), ρi = exp[−β(ti+1 − (n)
(n)
ti ), and mi = m(ti ), for fixed n. Then {X1 , . . . , Xn } are jointly normal random variables of the O.U. process, and the n-dimensional density fπmn with mean m, can be written as: 1 1 2 −2 (2π(1 − ρ ) exp{− (x1 − m1 )2 − fπmn (x1 , · · · , xn ) = Πn−1 i i=1 2 n−1 1 [xi+1 − mi − ρi (xi − mi )]2 }. (31) 2 i=1 1 − ρ2i Hence the “log likelihood function” of the hypotheses H0 : m = 0, vs H1 : m = 0 with Xi (ω) = xi is given for each such partition by: m n−1 fπn m21 xi+1 − ρi xi mi+1 − ρi mi + log (ω) =m1 x1 − · fπ0n 2 1 + ρi 1 − ρi i=1
148
IV Inference for Classes of Processes
2 n−1 1 1 − ρi mi+1 − ρi mi − . 2 i=1 1 + ρi 1 − ρi
(32)
The right side terms are Riemann sums which converge in L1 ([0, 1])mean (hence in probability) as |πn | → 0, to the function f (·) given by: X(1) m2 (1) X(0) (m(0) − c) + (m(1) + c) − 2 2 2 1 1 β 1 X(t)(βm(t) − c + cβ) dt − m2 (t) dt + 2 0 4 0 c2 β cβ 1 , m(t) dt − − 2 0 4
log f =
(33)
where c ≥ 0 is the (smallest) constant in the Lipschitz condition on m. Consequently, from (33) the critical region for the hypotheses is given by Ak0 = {ω : X(0,ω)[m(0) − c] + X(1, ω)(m(1) + c)+ 1 X(t, ω)(βm(t) − c + cβ) dt > k},
(34)
0
where k is chosen to satisfy the size of the region, P (Ak0 ) = α. When m = a, a constant, then c = 0 in the above computation. If H1 : m = a > 0, then (34) becomes 1 X(t, ω) dt ≥ k1 }, (34’) Ak01 = {ω : X(0, ω) + X(1, ω) + β 0
for a suitable k1 , which is a (one sided) uniformly most powerful critical region (the inequality being reversed if a < 0). The above procedure shows that the {πn , n ≥ 1} are only partially dQ ordered and the fπn = dPππn converge in the mean but not pointwise. n As counter examples show one has only such a weaker statement when a complete ordering is not available. To include these problems one seeks just convergence in probability for these likelihood ratios. This is covered, even with mean convergence, by employing a general technique based on Hellinger integrals which we now describe. If P and Q are a pair of probability measures on a measurable space (Ω, Σ), and μ is a dominating (σ-finite) measure for both (specifically take μ = P + Q), then the Hellinger “distance” between P and Q is defined by: dP dQ H(P, Q) = · dμ dP dQ = dμ dμ Ω Ω
4.1 Testing methods for second order processes
149
f g dμ,
= (f, g) =
(35)
Ω
where f 2 , g 2 are the Radon-Nikod´ ym derivatives of P and Q respectively relative to μ. If μ ˜ is another such dominating measure then μμ ˜ so that one has dμ d˜ μ f g dμ = fg H(P, Q) = d˜ μ Ω Ω dP dQ = dP dQ, d˜ μ= d˜ μ d˜ μ Ω Ω and hence H(P, Q) does not depend on the particular dominating μ or μ ˜. The last expression above is the Hellinger integral. Note that H(P, Q) is not a true distance since it does not satisfy the triangle inequality. However, 0 ≤ H(P, Q) = H(Q, P ) ≤ 1, the last being a consequence of the CBS- inequality. Moreover, H(P, Q) = 0 iff Q ⊥ P and H(P, Q) = 1 iff P = Q (by the equality conditions in the CBSinequality and since min(f, g) = 0 if P ⊥ Q). But a true distance between P, Q, say ρ(P, Q), is obtained from (35) by considering the L2 (μ)-metric: ρ(P, Q) = f − g2,μ = (f, f ) + (g, g) − 2(f, g) = 2(1 − H(P, Q)),
(36)
and this can be used to translate the Hellinger distance to the L2 (μ)metric. The expression H(P, Q) will be useful in deciding the singularity or equivalence (or non-singularity) of P, Q in many problems, with (35) and (36), complementing the result of Theorem 1 in obtaining the c likelihood ratio dQ dP , strengthening the alternative method employed in Example 6 above. P 7. Theorem. Let {(Ω, Σα , Σ, Q ), α ∈ I} be a probability model for testing a hypothesis and its alternative, where I is a directed index set with a partial ordering denoted by ‘ ≤ and Σα ⊂ Σβ ⊂ Σ for α ≤ β in I, the Σα being σ-algebras. If Pα = P |Σα , Qα = Q|Σα and (for simplicity) Σ = σ(∪α Σα ), then the corresponding Hellinger distances satisfy the limit relation H(P, Q) = limα H(Pα , Qα ), so that P ⊥ Q iff H(P, Q) = 0 which trivially holds when H(Pα , Qα ) = 0 for some α ∈ I.
Proof. We first establish the result, based on an elementary argument due to Brody [1], as stated, and then show how it can be reformulated when Pα , Qα are image measures of P, Q on finite dimensional spaces such as Rα ∼ = Rn . This will enable a direct application of the theorem.
150
IV Inference for Classes of Processes
The argument is facilitated by the following auxiliary relation: (*) For any probability measures P1 , P2 on (Ω, Σ) one has P1 (Ak )P2 (Ak ) : Ak ∈ Σ, disjoint, ∪ Ak = Ω}.
H(P1 , P2 ) = inf{
k
k
In fact, let μ = P1 + P2 and fi = integers m, n, consider the sets
dPi aμ .
Then for any real t > 1 and
Am,n = ω : t2(m−1) ≤ f1 (ω) < t2m , t2(n−1) ≤ f2 (ω) < t2n , so that Am,n ∈ Σ, disjoint and if B = ∪m,n Am,n , B ∈ Σ. Then one has f1 f2 dμ ≥ tm+n−2 μ(Am,n ), (37) Am,n
and P1 (Am,n ) = Am,n f1 dμ ≤ t2m μ(Am,n ). Similarly P2 (Am,n ) ≤ t2n μ(Am,n ). On the other hand since μ(B c ) = 0 and the Am,n are disjoint, it follows that H(P1 , P2 ) =
f1 f2 dμ ≥
Am,n
m,n
1 m+n t μ(Am,n ) t2 m,n
1 P1 (Am,n )P2 (Am,n ), by (37) and (35), t2 m,n 1 ≥ 2 inf{ P1 (Ak )P2 (Ak ) : Ak ∈ Σ, disjoint}. t m,n
≥
(38)
Letting t ↓ 1, this gives the lower inequality for (*). The opposite inequality is simple. Indeed, for any partition {Ak }k ⊂ Σ of Ω, one notes that H(P1 , P2 ) =
k
f1 f2 dμ
Ak
≤ ( k
=
Ak
f1 dμ)(
f2 dμ), by the CBS-inequality,
Ak
P1 (Ak )P2 (Ak ).
k
Taking the infimum over all such partitions of Ω, yields the desired inequality, which with (38) establishes (*).
4.1 Testing methods for second order processes
151
Since α ≤ β ⇒ Σα ⊂ Σβ and Σα has fewer partitions of Ω than Σβ , it follows that H(Pα , Qα ) ≥ H(Pβ , Qβ ) ≥ H(P, Q) so that the positive monotone decreasing net has a limit and hence 0 ≤ H(P, Q) ≤ lim H(Pα , Qα ) ≤ 1. α
(39)
It is to be shown that there is equality in the middle to complete the proof. Now by (*), given an ε > 0, one can find a partition {Ak }k ⊂ Σ such that ε (40) P (Ak )Q(Ak ) ≤ H(P, Q) + , 4 k
and an n0 (= nε ) such that
P (Ak ) <
|k|≥n0
ε ε Q(Ak ) < . , 4 4
(41)
|k|≥n0
Since the square root function is continuous on R+ , for each integer k, we can find an 0 < ηk < 3ε 2−(|k|+2) such that ε (P (Ak ) + ηk )(Q(Ak ) + ηk ) ≤ P (Ak )Q(Ak ) + 2−(|k|+2) . 3
(42)
But Σ = σ(∪α Σα ). So for the finite measure μ = P + Q, and each Ak ∈ Σ, one can find a Bk ∈ ∪α Σα (hence Bk ∈ Σαk for some αk ∈ I) satisfying μ(Ak ΔBk ) < ηk . Thus P (Ak ΔBk ) < ηk , Q(Ak ΔBk ) < ηk .
(43)
By directedness of I, each finite set of elements in I has an upper bound in I and so there is a β ≥ αk , |k| < n0 and all Bk ∈ Σβ , |k| < n0 . Since Pβ = P |Σβ , we have: Pβ ( ∪
|k|
Bk ) = P ( ∪ Bk ) ≥ P ( ∪ Bk ∩ Ak ) |k|
=
[P (Ak ) − P (Ak − Bk )]
|k|
≥
[P (Ak ) − P (Ak ΔBk )]
|k|
>
|k|
(P (Ak ) − ηk ), by (43),
152
IV Inference for Classes of Processes
ε ε 2−(|k|+2) , by (41), ≥ (1 − ) − 4 3 k ε =1− . 2
(44)
Replacing P by Q in the above procedure, one gets similarly ε Bk ) > 1 − . 2 |k|
Qβ ( ∪
(45)
Hence if Bn0 = Ω − ∪|k|
ε ε , Qβ (Bn0 ) < . 2 2
(46)
Since Ω = ∪k Bk , one can estimate H(Pβ , Qβ ) as: H(Pβ , Qβ ) ≤
Pβ (Bk )Qβ (Bk )
k
ε ≤ Pβ (Bk )Q(Bk ) + , by (46), 2 |k|
since Ak ⊂ Bk ∪(Ak ΔBk ) and then use (43), ε ε ≤ [ P (Ak )Q(Ak ) + 2−(|k|+2) ] + , by (42), 3 2 |k|
0 is arbitrary, one has the opposite inequality (and hence there is equality) for the limit relation in (39). The last statement of the theorem is obvious. 8. Remarks. 1. Let {Xt , t ∈ I} be a real stochastic process on a probability space (Ω, Σ, P ), and F the collection of all finite subsets of I directed by inclusion. Thus for α, β ∈ F let α ≤ β iff α ⊂ β so that for any finite collection α1 , . . . , αk ∈ F, β = ∪kj=1 αj ∈ F and αi ⊂ β, i = 1, . . . , k. Now consider for α = (t1 , . . . , tk ), the vector Xα = (Xt1 , . . . , Xtk ) whose distribution (=image measure) is
153
4.1 Testing methods for second order processes
P˜α = P ◦ Xα−1 , i.e., P˜α (A) = P (Xα ∈ A) for any Borel set A ⊂ Rα . If Q is another such (alternative hypothesis) measure on Σ, then we take Σα = σ(Xti , ti ∈ α) ⊂ Σ, and Σα ⊂ Σβ whenever α ≤ β. If Pα = P |Σα and Qα = Q|Σα we have, with the notation of the preceding proof, H(Pα , Qα ) = fα (ω)gα (ω) dμα (μα = Pα + Qα = μ|Σα ) Ω = fα (x)gα (x) dμα ◦ Xα−1 = fα (x)gα (x) d˜ μ(x), Rα
Rα
by the fundamental law of probability (cf., e.g., Rao [15], p.19), ˜ α ). = H(P˜α , Q
(47)
¯ = limα H(Pα , Qα ) = limα H(P˜α , Q ˜ α ) with P¯ = P |Σ ˜0 Hence H(P¯ , Q) ¯ ˜ ˜ and Q = Q|Σ0 where Σ0 = σ(Xt , t ∈ I). Thus the above theorem ˜ α although they can be applied to the distribution functions P˜α , Q α are defined on (R , Bα ) which are not subsets of (Ω, Σ). However, they may be identified using a canonical representation (cf., e.g., Rao [12], pp. 208-9) which is explained using the martingale language. For applications, the above identification in (47) is sufficient. If only ˜ α ) are given satisfying a compatibility condition, then one needs (P˜α , Q P ) to get σ-additive (P, Q) on to use Theorem I.1.1 and find (Ω, Σ, Q I Ω = R and its cylinder σ-algebra Σ so that Xα : Ω → Rα and ˜ α = Q ◦ X −1 as indicated in the technical Remark 3. P˜α = P ◦ Xα−1 , Q α 2. It is worthy of note that the above theorem is valid with essentially the same proof if H(P, Q) is replaced by dP γ dQ 1−γ ) ( ) Hγ (P, Q) = ( dμ dμ Ω dμ f γ g 1−γ dμ ( = Hγ (Q, P )), 0 < γ < 1, = Ω
instead of γ = 12 (see e.g., Rao [12], pp. 205-209). Using H¨older’s (in place of the CBS) inequality there, one finds 0 ≤ Hγ (P, Q) ≤ 1 with Hγ (P, Q) = 0 iff P ⊥ Q. Moreover,
f ( )γ g dμ g Ω hγ g dμ, (say), = Ω γ log h = e dQ +
Hγ (P, Q) =
A
Ac
eγ log h dQ,
154
IV Inference for Classes of Processes
where A = {ω : log h(ω) ≤ 0}, and hence that, as γ 0 through a sequence, limγ0 Hγ (P, Q) = H0 (P, Q) exists. Further using Fatou’s lemma, one concludes that limγ→0 Hγ (P, Q) = 1 iff P Q. But by the above theorem (and also see (47)) Hγ (P, Q) = inf α∈F Hγ (Pα , Qα )[= limα Hγ (Pα , Qα )]. Consequently, P Q iff limγ→0 Hγ (Pα , Qα ) = 1 uniformly in α ∈ F. This fact will be of use in the next chapter. We next illustrate an application of the above with the following: 9. Example. Let {X(t), t ∈ [0, 1]} be a Gaussian process under both hypotheses H0 and H1 , which has mean and covariance functions given as H0 : m(t) = 0 vs H1 : m(t) = 0, both having the same covariance (n) (n) function r(·, ·). To apply the above procedure, let πn : 0 = ti < t1 < (n) · · · < tn = 1 be a partition which under refinement ordering becomes (n) (n) dense in [0, 1]. Specifically, let ti = 2in , i = 0, 1, . . . , 2n . Set mi = (n) (n) (n) (n) m(ti ), σij = r(ti , tj ) for n ≥ 1. It is supposed that the matrix (n)
(n)
(σij ) is nonsingular for each n, and let λi
be the (necessarily positive)
(n) (σij ).
eigenvalues of Then with the classical integral evaluations of the multivariate normal distributions, one gets for the Hellinger distances of Pπn , Qπn the following: 1 (n) H(Pπn , Qπn ) = · · · exp{− (xn − αn )(σij )−1 (xn − αn ) }× 2 R2n (n)
exp[− 12 xn (σij )−1 xn ] dxn . (n) (2π)n det(σij ) (n)
(48)
(n)
Here αn = (m1 , · · · , m2n ) and xn are the mean and suitable integration vectors (prime denoting transpose). This may be evaluated with a change of variables, and one gets the result as (cf., e.g., Cram´er [1], p. 118): 1 (n) H(Pπn , Qπn ) = exp{− (αn (σij )−1 αn } 8 2n 1 1 (n) (m )2 }. = exp{− 8 i=1 λ(n) i i This converges to a finite positive limit iff limn→∞
2n
(49)
(n) 2 1 i=1 λ(n) (mi ) i
<
∞ as |πn | → 0. One can give conditions on the mean vector m in order that this series converges, getting P ⊥ Q in the opposite case. A similar computation can also be applied if r is replaced by different covariances r1 , r2 in the above hypotheses, and conditions can
155
4.2 Sequential testing of processes
be obtained for (non-)singularity of P, Q. We state the result leaving (n) (n) (n) (n) the verification to the reader. Thus let r1 = (r1 (ti , tj ), r2 = (n)
(n)
(n)
(n)
(r2 (ti , tj ) and similarly let m1 , m2 be the corresponding mean vectors. Then the probability densities satisfy dPπn = f1,··· ,2n (x) dx (n) where f is a multivariate normal density with vector mean m1 and (n) covariance matrix r1 . Similarly let Qπn = g1,··· ,2n (x) dx correspond(n) (n) ing to m1 , r2 . The Hellinger integral is then given by H(Pπn , Qπn ) = {|Mn−1 ||Mn − Nn Mn−1 Nn |}× 1 (n) (n) (n) (n) exp{− (m1 − m2 )(Mn − Nn Mn−1 Nn )(m1 − m2 ) , 8 (n)
(n)
(n)
(50)
(n)
where Mn = 12 (r1 + r2 ); Nn = 12 (r1 − r2 ), and | · | denotes the determinant. Thus, as |πn | → 0, H(Pπn , Qπn ) → H(P, Q) which defines the nonsingular case if this limit is nonzero, and the singular case otherwise. Taking m1 = 0, m2 = 0, r1 = r2 here it reduces to (49). The simplification (50) is given in Kraft [1], [there are some obvious misprints which are corrected for (50)). Similar computations will be used in the proof of Theorem 5.2.1 below. Actually, as seen later, for Gaussian processes the case of non singularity becomes equivalence of measures and thus the above result can be refined. Also there is another method based on Aronszajn’s [1] theory of reproducing kernels which was made into a powerful tool in inference theory by Parzen [1], and this will be considered later in Chapter V. We now discuss certain other results in the next section to explain the richness of problems arising in this study. 4.2 Sequential testing of processes Here we indicate how some of the above ideas may be modified and considered for sequential testing. The problem can be formulated as follows. Suppose a process {X(t), t ∈ I ⊂ R} is being observed and it is P governed by the model (Ω, Σ, Q ) as before, but now instead of taking a complete realization, one chooses a suitable real function f (X(t), t) and stops observing the process when f exceeds one of two given bounds A > B. Namely, if f (X(t), t) ≥ A accept H0 and if f (X(t), t) ≤ B accept H1 , and continue observation if B < f (X(t), t) < A, which is thus a three decision problem. Then the epoch T at which the experiment is terminated depends on the realization until that point, so that the event [T ≤ t] is determined by {X(s), s ≤ t}, t ∈ I, and hence T is a real random variable, called a stopping time. Thus one has to analyze the new “stopped” process Y (t) = f (X(t), t)χ[t
156
IV Inference for Classes of Processes
f (X(T ), T )χ[t≥T ] = f (X(T ∧ t), T ∧ t), t ∈ I, and this new process may change the structure of the original one. If I is a set of positive integers, and Xn is a partial sum of a sequence of independent (or orthogonal) random variables, we already discussed such a problem in Section III.5 and calculated the first two moments of Y (T ) there. The corresponding analysis for continuous time hypothesis testing case will now be considered. The discrete indexed partial sum process can be identified with a random walk, and its continuous version may be seen as a martingale. We have hinted at the appearance of martingales, at several places in the preceding sections, and will now present some necessary results after outlining the sequential testing problem which serves as a motivation for the type of (martingale) results that are appealed here. Following the previous work of non sequential testing, one considers c a function of the likelihood ratio f˜(X(t)) = dQ dP |Ft , Ft = σ(X(s), s ≤ t),[usually f (X(t), t) = log f˜], given two numbers A > B, the process is observed until either f (X(t), t) ≥ A or ≤ B, and H0 or H1 is decided, so that the critical region will be [f (X(t), t) ≥ A] or [f (X(t), t) ≤ B] with probabilities α0 and α1 (α0 + α1 < 1) and 1 − α0 − α1 as the probability of the third (or continuation) region [B < f (X(t), t) < A]. If T denotes the stopping time for the decisions of H0 or H1 , then the desired optimality criterion is that if another procedure is used with respective probabilities α0∗ and α1∗ such that αi∗ ≤ αi , i = 0, 1, then Ei∗ (T ) ≥ Ei (T ), i = 0, 1 must hold, where Ei (Ei∗ ) is the expectation under the original (alternative) procedure for the hypotheses Hi , i = 0, 1, and Ei (T )[Ei∗ (T )] is the expected sample size. In the classical discrete time case, the sequential probability ratio test (SPRT), analogous to the Neyman-Pearson lemma, establishes the optimality of the above indicated sequential procedure for the SN -process. It is thus clear that Ei (T ) and even the distribution of T itself will play a crucial role in this analysis. The most important case, from the point of real applications, is that T is finite but unbounded and the subject is developed with this property in view. Since Y (t) = f (X(T ∧ t)), t ∈ I, the stopped process, is a key object of study, it will be necessary to consider the analog of the Sn (namely the X(t))-process and the stopped versions SN or Y (t). We thus treat (a large set of) the cases when the Y -process is a martingale, or one closely related to it, with the classical random walk as a motivating example in both studies, extending the partial sum sequence of independent random variables. Then we consider the problem for stopped processes suggested by (and for) the sequential analysis. Consequently we begin with the latter concept and present some properties to use for the SPRT optimality proof (namely Theorem 8 below as the final item of this section).
157
4.2 Sequential testing of processes
Thus if {X(t), t ∈ I} is a stochastic process on (Ω, Σ, P ), of integrable random variables, with I as a partially ordered index, denoted by ‘ ≤ , then it is called a (sub)martingale if for any s, t ∈ I, s < t, one has E(X(t)|X(r), r ≤ s) = (≥)X(s),
a.e.
(1)
It is a super martingale if the left side is ≤ X(s). Intuitively, if X(t) is regarded as a gambler’s fortune at ‘time’ t, then a martingale condition says that the expected fortune at a future time t , having known the past through the time t, is the same as at time t, and hence is also termed a fair game. The sub martingale in this interpretation becomes a favorable game (to the gambler), and the super martingale an unfavorable game to the gambler (or a favorable one to the house). Mathematically, the condition that X(r), r ≤ s, is known means simply the σ-algebra Fs = σ(X(r), r ≤ s) is given. Then (1) is equivalent to stating: X(t) dP = (≥) X(s) dP, A ∈ Fs , (2) A
or alternatively
A
E Fs (X(t)) = (≥)X(s),
a.e.
(3)
If I is the set of integers, then the process is a (sub)martingale sequence and if I ⊂ R, it is a continuous parameter (sub)martingale. Taking A = Ω in (2), it follows that t → E(X(t)) is a constant for martingales and a nondecreasing (nonincreasing) function for the sub(super )martingales. A simple example of a martingale is the sequence of partial sums of a sequence of independent random variables with means zero. More precisely, one has: 1 1. Lemma. Let {X nn, n ≥ 1} ⊂ L (P ) be Fak process. Then it is a martingale iff Xn = k=1 Yk , n ≥ 1, where E (Yk+1 ) = 0, k ≥ 1 [and hence if Yk s are independent E Fk (Yk+1 ) = E(Yk+1 ) = 0 for all k ≥ 1] where Fk = σ(X1 , · · · , Xk ) = σ(Y1 , · · · , Yk ).
Proof. Indeed if the Xn -sequence is a martingale, then letting Yn = n Xn+1 − Xn one has Xn = k=1 Yk , and E Fn (Yn+1 ) = E Fn (Xn+1 ) − Xn = 0,
a.e.
Conversely, if {Yn , n ≥ 1} has the above property, then it is immediate n that {Xn = k=1 Yk , Fn , n ≥ 1}, is a martingale. The parenthetical statement is clear since Yn+1 is independent of Fn so that E Fn (Yn+1 ) = E(Yn+1 ) = 0, n ≥ 1. Remark. If Yn , n ≥ 1, are independent identically distributed random n variables with mean a(= E(Yn )) and Yn = Yn − a, then Xn = k=1 Yk
158
IV Inference for Classes of Processes
n satisfies the above hypothesis and hence {Xn = k=1 Yk − na, n ≥ 1} is a martingale. Evidently, Lemma 1 and the preceding comment should be interpreted differently for continuous parameter processes, since sums must be replaced by suitable stochastic integrals which will be defined later in this section and used freely thereafter in the book. Abstracting (1), it is said that {X(t), Ft , t ∈ I} is an adapted (real) process if X(t) is Ft -measurable for each t, and it is a (sub)martingale if each X(t) ∈ L1 (Ω, Ft , P ), and satisfies (2) or (3) for Ft ⊂ Ft ⊂ Σ, t < t . The nondecreasing family or net {Ft , t ∈ I} is called a (standard) filtration if I ⊂ R (and Ft = ∩t >t∈I Ft = Ft+0 ) for each t ∈ I and (for convenience) each Ft is completed for P . It is evident that the measurability problem for continuous parameter processes becomes nontrivial, and one has to pay special attention to it in this regard. First we establish an analog of Theorem 1.1, a basic convergence result for (sub)martingales as it also plays an important role in our applications. It is due to Doob [2]. (In fact, the likelihood ratio sequence forms a sub martingale, shown after this proof.) But this cannot be deduced directly from Theorem 1.1, since, as we shall see, one of the measures P, Q on Σ there, need not be σ-additive, and it is necessary to find other useful substitutes. For simplicity we treat the discrete parameter case. The continuous parameter version can be obtained from this work. 2. Theorem. Let {Xn , Fn , n ≥ 1} be a sub martingale on (Ω, Σ, P ) such that supn E(|Xn |) < ∞. Then Xn → X∞ a.e. and E(|X∞ )| ≤ lim inf n E(|Xn |). Proof. There are several proofs of this assertion since Doob’s original argument which is based on an ingenious combinatorial “up crossings inequality”. Here we give a different proof, using a functional analytic method that is perhaps conceptually simpler than the original, because it actually reduces this result to that of Theorem 1.1. [The original proof is also given below.] The argument is presented in two steps, first treating the martingale case (the main part) and then extending it to sub martingales via a “Doob decomposition”. Here are the details. I. Let {Xn , Fn , n ≥ 1} be a martingale, supn E(|Xn |) < ∞, and consider νn : A → A Xn dPn , A ∈ Fn , Pn = P |Fn . Then the martingale property implies νn = νn+1 |Fn and each νn is a signed measure. On the algebra F0 = ∪n Fn , define ν as ν(A) = νn (A), since A ∈ F0 belongs to Fn for some n, which is clearly possible. Then ν is, additive, bounded and ν|Fn = νn , n ≥ 1. However ν need not be σ-additive. [It is σ-additive iff the Xn -sequence is also uniformly integrable.] So we use a Stone representation from abstract analysis (cf., e.g., Dunford and Schwartz [1], p. 312), in a slightly refined form as follows. Accord-
159
4.2 Sequential testing of processes
ing to this result, if ba(Ω, Σ) is the space of bounded additive real set functions on (Ω, Σ), which is a Banach space under the variation norm, then there exists a totally disconnected compact Hausdorff space (also called a Stone space) with B denoting its Baire σ-algebra generated by the algebra of the clopen (= closed-open) sets such that ba(Ω, Σ) can be isometrically and isomorphically mapped onto ca(S, B), the (Banach) space of σ-additive real bounded regular set functions on (S, B) with the variation norm. The isomorphism is induced by a measure preserving set mapping τ : Σ → B, such that τ (A) = A˜ ∈ B, for A ∈ Σ uniquely. A refinement of this result, needed here, is to specify S more particularly as a set whose points are 0 − 1 valued additive functions (or measures) on Σ. The necessary detail is given in the companion volume (Rao [1], pp. 26-27), according to which the set mapping τ is induced by a point mapping u : Ω → S so that u−1 (τ (A)) = A, ∀A ∈ Σ. With this, let Pˆ = P ◦ u−1 , νˆn = νn ◦ u−1 and Bn = τ (Fn ). Then one has Pˆn = Pˆ |Bn and νˆn Pˆn , n ≥ 1. If gn = ddˆPνˆn is the RN-derivative, n then νˆn (B) = νn (u−1 (B)) Xn dPn = gn dPˆn , = u−1 (B) B −1 gn dPn ◦ u = gn ◦ u dPn , B ∈ Bn . = B
(4)
u−1 (B)
Since u−1 (Bn ) = τ −1 (Bn ) = Fn (equality because the Fn are complete), we get the key relation Xn = gn ◦ u a.e. Also νˆ = ν ◦ u−1 is automatically σ-additive on B0 = ∪n Bn (since S is a Stone space) which has a unique extension to B = σ(B0 ) by a classical Hahn theorem, and νˆ|Bn = νˆn , (but νˆ need not be Pˆ -continuous). Thus by Theorem νc ˆ be the exceptional set 1.1, gn → g∞ , a.e.[Pˆ ] and g∞ = dˆ . Let N dPˆ ˆ ), Nn = {ω : gn ◦ u(ω) = Xn (ω)}, N0 = N ∪ ∪∞ Nn ; and N = u−1 (N n=1 whence P (N0 ) = 0. Define X∞ = (g∞ ◦ u)χΩ−N0 . Clearly X∞ is F∞ measurable, and |Xn − X∞ |(ω) = |gn − g∞ |(u(ω)) = |gn − g∞ |(s) → 0 ˆ . Hence Xn → X∞ a.e., and (by Fatou’s lemma) for s = u(ω) ∈ S − N E(|X∞ |) ≤ lim inf n E(|Xn |). II. Next let {Xn , n ≥ 1} be a sub martingale as in the statement. Then it can be expressed as: Xn = Xn +
n j=2
[E Fj−1 (Xj ) − Xj−1 ] = Xn + An (say), n ≥ 2,
(5)
160
IV Inference for Classes of Processes
and let A1 = 0. Now {Xn , Fn , n ≥ 1} is a martingale, An ≥ 0, and Fn−1 -adapted. Indeed, since the Xn -sequence is a sub martingale, An ≥ 0 a.e., and is increasing since each of its summands is nonnegative. Further E Fn (Xn ) = E Fn (Xn+1 − An+1 ) = E Fn (Xn+1 ) −
n+1
E Fj−1 (Xj − Xj−1 )
j=2
= Xn −
n
[E Fj−1 (Xj ) − Xj−1 ] = Xn .
j=2
so that {Xn , Fn , n ≥ 1} is a martingale. [This is the Doob decomposition of a sub martingale.] The converse is immediate, since (5) implies E
Fn
(Xn+1 ) =
Xn
+
n
[E Fj−1 (Xj ) − Xj−1 ] + (An+1 − An )
j=2
≥ Xn +
n
[
] = Xn , a.e.
j=2
Hence E(An ) = E(Xn ) − E(Xn ) ≤ sup E(|Xn |) − E(X1 ) = K < ∞, n
for all n ≥ 1. But An ≤ An+1 −→ A a.e., and by the monotone convergence theorem E(A) ≤ K < ∞ so that A is integrable. Also Xn = Xn − An ⇒ supn E(|Xn |) ≤ supn E(|Xn |) + E(A) < ∞. Hence by Step I (for martingales), Xn −→ X∞ a.e. Consequently Xn = Xn + An −→ X∞ + A = X∞ (say) exists a.e., and Fatou’s lemma gives E(|X∞ |) ≤ lim inf n E(|Xn |) < ∞. This is the final assertion of the theorem. Alternative proof. (Doob’s original method.) As noted earlier, this depends on the following key combinatorial inequality: Up crossings lemma. For the sub martingale {Xk , Fk , 1 ≤ k ≤ n} and (n) −∞ < a < b < ∞, let Ua,b be the number of up crossings of the interval [a, b] by the given sequence, i.e., for the numerical sequence {Xk (ω), 1 ≤ k ≤ n}, if k1 (ω) = min{i : Xi (ω) ≤ a}, k2 (ω) = min{i > k1 (ω) : Xi (ω) ≥ b} and by induction, k2j+1 (ω) = min{i > k2j (ω) : Xi (ω) ≤ a}, k2j+2 (ω) = min{i > k2j+1 (ω) : Xi (ω) ≥ b}, then let βn (ω) = max{j : k2j (ω) ≤ n}, where min{∅} = n and max{∅} = 0.
161
4.2 Sequential testing of processes (n)
The integer valued random variable βn = Ua,b above then satisfies the up crossings inequality: E(βn ) ≤
1 b−a
[Xn ≥a]
(Xn − a) dP ≤
E|Xn |) − |a| . b−a
(5’)
If (5’) is granted, then the conclusion is obtained immediately as fol(n) (n+1) lows. Now Ua,b ≤ Ua,b −→ Ua,b (say) as n → ∞, for any −∞ < a < b < ∞. If the Xn -sequence did not tend to a limit, then for some pair of numbers a < b, Ua,b = ∞. But by hypothesis and (5’) (n)
E(Ua,b ) ≤
supn E(|Xn |) + |a| <∞ b−a
so that Ua,b = ∞ can be true only on a set of P -measure zero. Now if Na,b = {ω : X∗ (ω) = lim inf Xn (ω) < a < b < lim sup Xn (ω) = X ∗ (ω)} n
n
one has P (Na,b ) = 0. Hence N = {ω : X∗ (ω) < X ∗ (ω)} =
Na,b
a
has P -measure zero so that X∗ = X ∗ = X∞ a.e., and a.e. By Fatou’s lemma the last inequality obtains. So it remains to establish the inequality (5’). Define counting variables u1 , . . . , un as ui (ω) = 1 iff the sequence (X1 (ω), . . . , Xi−1 (ω)) completes an up crossing of the interval [a, b] so that ui is σ(X1 , · · · , Xi−1 )-measurable, i ≥ 2. Thus ui = χ∪i−1 Ai with j j=1 n = u (X − X Aij = {ω : k2j (ω) < i ≤ k2j+1 (ω)}. Let X i i−1 ). i=3 i By the sub martingale inequality one has [ui =1] (Xi − Xi−1 ) dP ≥ 0 (n)
whence E(X) ≥ 0. Writing βn for Ua,b , we may assume βn (ω) > 0 for some ω ∈ Ω, for nontriviality, so that k2βn (ω) ≤ n, k2βn +1 (ω) ≤ n, and k2βn +2 (ω) > n. Then X(ω) = (Xk3 − Xk2 )(ω) + · · · + (Xk2βn −1 − Xk1βn −2 )(ω) + (Xn − Xk2βn )(ω) ≤ (a − b)(βn (ω) − 1) + (Xn − X2βn )(ω) ≤ (a − b)βn (ω) + (Xn (ω) − a) since Xk2βn ≥ b > a, Xn > a, and k2βn +2 > n, a.e. Hence 0 ≤ E(X) ≤ (a − b)E(βn ) + E(Xn − a)+
162
IV Inference for Classes of Processes
≤ (a − b)E(βn ) + sup E(|Xn |) + |a|. n
This implies (5) as asserted. Remark. To identify the limit element as the RN-derivative of the P continuous part of ν, additional work as outlined in Technical Remark 1.3 above is needed. This is in contrast to the fact that Theorem 1.1 directly describes this limit where ν is moreover assumed to be σadditive. These two major methods have this crucial distinction and in our applications both versions are used. We omit further discussion of this point here. But see Johansen and Karush [1] where an extension of the Andersen-Jessen method is given to identify X∞ with a certain RN-derivative, in the general additive case also. We shall sketch in dν c Exercise 6.4 below that X∞ = dP∞ follows from this work and the ∞ following proposition. Let us show that a sequence of likelihood ratios of a process {Xn , n ≥ P ) can be formulated as (super) martingales for which the 1} on (Ω, Σ, Q preceding convergence theorem applies. Thus for each n, and a Borel set A ⊂ Rn , consider the image measures: ˜ n (A) = Q ◦ (X1 , · · · , Xn )−1 (A)(= Q[(X1 , · · · , Xn ) ∈ A]) Q P˜n (A) = P ◦ (X1 , · · · , Xn )−1 (A).
(6)
Equivalently, if Fn = σ(X1 , · · · , Xn ) = (X1 , · · · , Xn )−1 (Bn ) ⊂ Σ, let ˜ n are the image measures Pn = P |Fn and Qn = Q|Fn . Then P˜n and Q of Pn and Qn on Bn , the Borel σ-algebra of Rn . Suppose that Qn Pn ˜ n P˜n ), and let fn , gn be the corresponding RN-derivatives, (so Q which are measurable for Fn , and Bn respectively. The classical real analysis (or measure theory) implies: ˜n dQn dQ (ω) = fn (ω) = gn (X1 , · · · , Xn )(ω) = (x), dPn dP˜n
(7)
where x = (x1 , . . . , xn ) = (X1 , . . . , Xn )(ω). We assert that the process {fn , Fn , n ≥ 1} is a nonnegative martingale. Indeed if A ∈ Bn , then g −1 (A) ∈ Fn ⊂ Fn+1 , and we have: E Fn (fn+1 ) dPn = fn+1 dPn+1 −1 gn (A) g −1 (A) = gn+1 (x, xn+1 ) dP˜n+1 (x, xn+1 ) A×R ˜ n+1 (x, xn+1 ) = dQ A×R
163
4.2 Sequential testing of processes
˜ n (x) dQ
=
A
gn (x) dP˜n
=
A
=
−1 gn (A)
fn (ω) dPn .
(8)
Since the extreme integrands are Fn -measurable and gn−1 (Bn ) = Fn , this proves the martingale property of the fn -sequence and the non negativity is evident. In case the finite dimensional distributions (or the image measures) ˜ n on Bn have densities pn , qn relative to product σ-finite measures ˜ Pn , Q μn = ⊗ni=1 λi where each λi is a Borel measure on (R, B), then the likelihood ratios defined as fn = pqnn ≥ 0 will only form a super martingale in general. This is seen as follows. (In applications one often has λi to be either Lebesgue or counting measures on R.) We start with the integral representation of the conditional expectation, assuming that the image measures are regular so that the following integrals are in Lebesgue’s sense (otherwise one needs to use the Dunford-Schwartz integrals and we will have trouble in some of the manipulations below for a rigorous justification) where we let x = (y1 , . . . , yn+1 ) = (x , yn+1 )Rn × R. If ω ∈ Ω is arbitrarily fixed, denote (X1 , . . . , Xn )(ω ) = x here. Then Fn E (fn+1 )(ω ) = ( fn+1 dP Fn )(ω ) Ω qn+1 = (x) dPn+1 (yn+1 |x ), using the image law, R pn+1 qn+1 (x)pn+1 (yn+1 |x ) dλn+1 (yn+1 ) = p n+1 R qn+1 pn+1 (x) dλn+1 (yn+1 ), (x) ≤ p pn (x ) R n+1 by the regularity of conditioning and Tulcea’s theorem as in Sec. II.3, qn (x ) = fn (x ) = pn = fn (X1 , · · · , Xn )(ω ).
(9)
Inequality appears here, since on a set of positive μn -measure pn (but not qn ) may vanish, and on such a set the ratio can be arbitrarily large. This shows that the fn -sequence is a nonnegative super martingale or that {−fn , Fn , n ≥ 1} is a negative sub martingale, and in either case E(fn ) ≤ 1. We summarize the above in the following:
164
IV Inference for Classes of Processes
3. Proposition. The likelihood ratio sequences, given by (7) (or (9)), form positive (super) martingales and hence converge a.e. The convergence result in this form was originally established by Grenander [1], who moreover described the limit element explicitly with a special argument. We next present stopped analogs of martingales and show how the earlier results of Section III.5 may be obtained from this work, while moving on for new applications of continuous parameter processes. At the beginning of this section, we have defined a stopped process {Y (t), t ≥ 0} from a given one {X(t), Ft , t ≥ 0}, taking I = R+ for simplicity, by a stopping time T where Y (t) = X(T ∧ t), t ≥ 0. The first question is to find the σ-algebra for which each Y (t) is adapted. Since X(t) is Ft -adapted, [T ≤ t] ∈ Ft , and that T is Σ-measurable, the composition Y (t) = X ◦ (T ∧ t) is only measurable for Σ. But is it adapted to a smaller filtration? A precise answer can be given, using some additional terminology, as follows. By definition a stopping time T is defined relative to the given family {Ft , t ≥ 0}, or shortly, of the filtration. For simplicity, let this be a standard filtration. Now one can define a class of events “prior to T ” as: F(T ) = {A ∈ Σ : A ∩[T ≤ t] ∈ Ft , t ≥ 0}, a collection that is known up to the (random) time T . If T = t, a constant, then clearly F(T ) = Ft . It may be verified that F(T ) is a σ-algebra relative to which T is measurable and F(T ∧ t) ⊂ F(T ∧ t ) for t < t and in fact X(T ) is F(T )-measurable so that X(T ∧ t) is F(T ∧ t)-adapted whenever X(t) is right continuous (defined below). We now present the following result on stopped (sub)martingales that will be of particular interest in the projected treatment of sequential analysis. It is due to Doob [2]. 4. Theorem. Let {X(t), Ft , t ≥ 0} be a (sub)martingale and T be a stopping time of the standard filtration {Ft , t ≥ 0}. Then the stopped process {Y (t) = X(T ∧ t), F(T ∧ t), t ≥ 0} is again a (sub)martingale whenever, X(t) = X(t + 0) = limn→∞ X(t + n1 ), a.e. for all t, i.e., X(t) is right continuous. More generally, if T1 , T2 are stopping times of the filtration, T1 ≤ T2 , then the new process {X(Ti ), F(Ti ), i = 1, 2}, defined as X(Ti )(ω) = X(Ti (ω), ω), ω ∈ Ω, is a (sub)martingale whenever (i) {X(t), Ft , t ≥ 0} is a right continuous process, and (ii) (either T2 ≤ t0 or) the given process is uniformly integrable. Proof. The first part is a special case of the second, by taking T1 = T ∧t and T2 = T ∧ t so that we only need to verify the latter. By the right continuity of X(t) and of the filtration (the ‘standard’ part condition) it suffices to establish the result when T1 and T2 take finitely many values (or simple times). For if this is granted, then one can find such
165
4.2 Sequential testing of processes
times Tin Ti and that {X ◦ T, T ∈ T } is a uniformly integrable set when the given process is uniformly integrable where T is the set of all simple stopping times of the filtration, to be shown in the last paragraph. Let t1 < t2 < · · · , < tn be the distinct values taken on by T1 and T2 . Since by definition X(Ti ) is equal to Xtj , 1 ≤ j ≤ n, in pieces (i = 1, 2), and the latter are in L1 (P ), it is clear that X(Ti ) ∈ L1 (P ). Also, if A ∈ Ftk , we have
Xtk dP = A ∩[T1 ≥tk ] ≤
Xtk dP + A ∩[T1 =tk ] X(T1 ) dP +
A ∩[T1 =tk ]
≤
A ∩[T1 >tk ]
since [T1 > tk ] ∈ Ftk ,
+ A ∩[T1 >tk+2 ] n
≤ ··· ≤ =
A ∩[T1 ≤tk ]
Xtk+1 dP,
A ∩[T1 =tk+1 ]
X(T1 ) dP
Xtk+2 dP
A ∩[T1 =tj ]
j=k
Xtk dP
X(T1 ) dP +
A ∩[T1 =tk ]
A ∩[T1 >tk ]
X(T1 ) dP
X(T1 ) dP.
(10)
On the other hand T1 ≤ T2 implies [T1 = tk ] ⊂ [T2 ≥ tk ], and hence: A ∩[T1 =tk ]
X(T1 ) dP =
A ∩[T1 =tk ] ∩[T2 ≥tk ]
Xtk dP
≤
A ∩[T1 =tk ] ∩[T2 ≥tk ]
X(T2 ) dP, by (10),
=
A ∩[T1 =tk ]
X(T2 ) dP.
Summing over k = 1, . . . , n, this becomes: X(T1 ) dP ≤ X(T2 ) dP, A
A ∈ F(T1 ),
(11)
A
which is the sub martingale property of {X(Ti ), F(Ti ), i = 1, 2}. (n) Finally, since one can find simple stopping times Ti ↓ Ti , as n → ∞ 2n −1 j+1 (n) (n) j j+1 [e.g. Ti = j=0 2n χ[ n ≤Ti < n ] will do] and F(Ti ) ⊃ F(Ti ) 2
2
166
IV Inference for Classes of Processes
(11) is extended to the Ti . Indeed, that {Xt , t ≥ 0} is a uniformly integrable sub martingale implies easily Xt → X∞ a.e. and in L1 (P ). (n) Consequently {X(Ti ), n ≥ 0} is also uniformly integrable. Then (n) X(Ti ) → X(Ti ) a.e. so that we can interchange the limits and integrals in (11). This implies all the statements. Remarks. 1. Many properties of stopped martingales are available in books on the general theory of the subject. [Cf., e.g., Doob [2], and for an account that we use here, one may consult the companion volume, Rao [21], Chapters IV and V.] 2. The uniform integrability hypothesis is satisfied if the process is in some ball of Lp (P ), 1 < p < ∞. For our work we needed to establish integrability of X(T ) for any stopping time. In the discrete parameter case one can give simple sufficient conditions for a martingale, motivated by Lemma n 1 above. In fact if {Xn , Fn , n ≥ 1} is a martingale, then Xn = k=1 Yk , E Fk (Yk+1 ) = 0 and let E(T ) < ∞. Hence E(|X ◦ T |) =
∞ [T =k]
k=1
≤ =
=
=
=
=
|Xk | dP
∞ k
k=1 j=1 ∞
[T =k]
∞ j=1 ∪k=j [T =k] ∞
j=1 ∞ j=1 ∞ j=1 ∞
[T ≥j]
|Yj | dP |Yj | dP
|Yj | dP, by Tonelli’s theorem,
E(χ[T ≥j] |Yj |) E(E Fj−1 (χ[T ≥j] Tj )) E(χ[T ≥j] E Fj−1 (|Yj |)), since [T ≥ j] ∈ Fj−1 .
j=1
Now suppose that E Fj−1 (|Yj |) ≤ K0 χ[T ≥j] a.e. for a constant K0 > 0. Then we get E(|X(T )|) ≤ K0
∞ j=1
E(χ[T ≥j] ) = K0
∞ j=1
P [T ≥ j] = K0 E(T ).
(12)
167
4.2 Sequential testing of processes
Thus if E(T ) < ∞ and E Fj (|Yj+1 |) ≤ K0 χ[T ≥n] a.e., then X(T ) is integrable. Taking T1 = 1 and T2 = T in the above we can state the following consequence of the theorem for applications. 5. Corollary. If {Xn , Fn , n ≥ 1} is a martingale and if for each n ≥ 1, E Fn (|Xn+1 − Xn |) ≤ Kχ[T ≥n] a.e., and E(T ) < ∞, then for any such stopping time T of Fn we have E(X(T )) ≥ E(X1 ). We use this to obtain the Wald identities of sequential analysis. 6. Theorem. (a) Let {Xn , n ≥ 1} be independent random variables with means μ, and supn E(|Xn |) < ∞. Then for any stopping time T of σ(X1 , · · · , Xn ), n ≥ 1, with E(T ) < ∞ we have E(ST ) = μE(T ) n where Sn = i=1 Xi . (b) If ϕn (z) = E(ezXn ) exists and = 0 for some complex z such that |Re(z)Xn |χ[T ≥n] ≤ K1 a.e. and |ϕn (z)| ≥ 1, then for any stopping time T as in (a) with E(T ) < ∞, we have the identity: E(ezST [ΠTi=1 ϕi (z)]−1 ) = 1. In addition if all the Xn have a common distribution this becomes E(ezST ϕ(z)−T ) = 1. Proof. (a) We have already proved this part in Proposition n III.5.1, and here we deduce it from Corollary 5. Indeed, let Sn = k=1 (Xk − μ), a partial sum of independent integrable random variables with means n zero, and if Sn = k=1 Xk , then {Sn , Fn , n ≥ 1} is a martingale where Fn = σ(S1 , · · · , Sn ) = σ(X1 , · · · , Xn ). Also E Fn (|Sn+1 − Sn |) = E(|Xn+1 − μ|) ≤ K + |μ| < ∞, since Fn and Xn+1 are independent. It follows from Corollary 5 that 0 = E(S1 ) = E(ST ) = E(ST − T μ) = E(ST ) − μE(T ). (b) By hypothesis ϕn (z) exists and = 0 for some z so that if Vnz = P z
n
X
i=1 i e Πn i=1 ϕi (z)
, then {Vn , Fn , n ≥ 1} is a martingale since [putting Vn = Vnz ] E Fn (Vn+1 ) = Vn E(
ezXn+1 ) = Vn , a.e., ϕn+1 (z)
for n ≥ 1 and E(Vn ) = 1. Thus the result will follow from the Corollary if we can show that E Fn (|Vn+1 − Vn |) ≤ Kχ[T ≥n] since E(T ) < ∞ by hypothesis. But the condition on Re(zXn ) and the independence of Xn s imply E Fn (|Vn+1 − Vn |) = |Vn |E(|
ezXn+1 − 1|) ϕn+1 (z)
≤ |Vn |(1 + E(eRe(zXn+1 )))
168
IV Inference for Classes of Processes
≤ |Vn |C1 ≤ C1
e|Re(z)Xn+1 | Πni=1 |ϕi (z)|
≤ C1 eK1 χ[T ≥n] . ∞ Hence we get E(|V (T )|) ≤ C2 n=1 P [T ≥ n] = C2 E(T ) < ∞ as in zST (12). Consequently 1 = E(V1 ) = E(V (T )) = E( ΠTe ϕi (z) ). The last i=1 comment is immediate. Note that if P [T ≤ t0 ] = 1 then the results of both parts follow at once. But if we consider Tk = T ∧ k and apply the special case just obtained and let k → ∞, then we may hope to get the general assertion. However, one has to invoke the dominated convergence theorem for this limit to be correctly established, and essentially all the above analysis will be required to find such a dominating function. It is also clear that the condition on Re(zXn ) employed above is somewhat artificial and so for applications this has to be replaced by a (perhaps) stronger condition that can easily be verified. We now turn to an application of Theorem 4 for continuous parameter processes, especially the Brownian Motion (or BM) to indicate how much additional analysis is needed. For instance an analog of (12) demands that the sums be replaced by suitable “stochastic integrals”. The following discussion is based on the work of Dvoretzky, Kiefer and Wolfowitz [1], elaborated by Shiryayev [1]. For a convenient reference, we recall briefly the Brownian Motion and a few of its immediate properties to use here and later on. Thus a process {X(t), t ∈ I ⊂ R} is termed a Brownian Motion (BM) if it is a Gaussian process with characteristics: (i) for each t1 < · · · < tn , ti ∈ I, X(ti+1 ) − X(ti ), i = 1, . . . , n are independent normally distributed random variables with means zero and variances σ 2 (ti+1 − ti ), and (ii) the sample paths t → X(t, ω) are continuous for almost all ω. The existence result of such a process, although can be obtained from Theorem I.1.1 with some additional (non trivial) work, we outline it with a direct construction, because of the important continuity condition (ii), used in applications. Thus let ξ0 , ξ1 , . . . be a sequence of independent N (0, 1) random variables on a probability triple (Ω, Σ, P ) which can be taken, e.g., as the one given by the classical Fubini-Jessen theorem. Then define on [0, 1] × Ω, a function by the series: X(t)(ω) =
∞ i=0
ξn (ω)
0
t
Hn (u) du =
∞
ξn (ω)ψn (t), (say)
(13)
n=0
complete set of Haar functions in L2 ([0, 1]). where {Hn , n ≥ 0} is the √ k Here h0 ≡ 1, and hn,k = 2n (χAn,i − χAn,i+1 ), An,i = [ k−1 2n , 2n ), k =
169
4.2 Sequential testing of processes
1, . . . , 2n ; and Hn is relabeled from hn,k such that H0 = h0 , H1 = h1,1 , H2 = h2,1 , H3 = h2,2 , etc. It can then be verified that {Hn , n ≥ 0} is a complete orthonormal sequence in L2 ([0, 1]) with Lebesgue measure, and the ψn are linearly independent continuous (but not orthogonal) functions, (called Schauder functions). Using the fact that u2
2 − 2 P [|ξn | > u] < /u for a normal density, and Bn = [|ξn | > πe √ ∞ 3 log n] satisfies n=1 P (Bn ) < ∞ so that by the Borel-Cantelli lemma, P [lim supn Bn ] = 0, and the series (13) is uniformly convergent in t for a.a. (ω). Since ψn is a bounded continuous function, this yields the result that t → X(t) is continuous with probability one. Then using Parseval’s identity with {Hn , n ≥ 0} it is seen, after a straightforward but tedious computation, that for 0 = t0 < t1 < · · · < tn < 1 we have:
ϕn (u1 , · · · , un ) = E(eiu1 X(t1 )+iu2 (X(t2 )−X(t1 ))+···+iun (X(tn )−X(tn−1 )) ) 1
2
= Πnj=1 e− 2 uj (tj −tj−1 ) = Πnj=1 E(eiuj (X(tj )−X(tj−1 )) ). This shows the existence of a BM on [0, 1]. It is extended to all of R by considering a countable collection of independent copies X (n) (t) of X(t), by defining X(t) = X (1) (t), 0 ≤ t ≤ 1, and inductively, if X(t), 0 ≤ t ≤ n is defined, let X(t) = X (n+1) (t − n) + X(n), n ≤ t < n + 1. This gives X(t), t ∈ R+ , (X(0) = 0) and on R− let X(−t) be an independent copy of the just defined process on R+ . The underlying probability space is the Fubini-Jessen product space again. It is clear that the details are nontrivial, but we included the necessary sketch which is independent of Theorem I.1.1. The same arguments show that if {X(t), t ∈ R+ } is a BM, then so are {X(t + α) − X(α), t ≥ 0} for any α ≥ 0, or {σ −1 X(σ 2 t), t ≥ 0, σ 2 > 0}. In particular {X(σ 2 t), t ≥ 0, σ 2 > 0} is a BM with E(X(σ 2 t)2 ) = σ 2 t. Numerous properties of this process are studied in the literature. For instance, another important fact, to be used, is the existence of its quadratic variation, namely if X(t), t ∈ R+ is a BM with a scale parameter σ 2 > 0, then 2 n
lim
n→∞
(X(
k=1
k k−1 ) − X( n ))2 = σ 2 , a.e. 2n 2
(14)
This fact implies that the BM cannot have finite variation for a.a.(ω) on any nondegenerate interval. This is because (14) and the (uniform) continuity of the sample paths t → X(t), imply 2 n
2
0 < σ = lim
n→∞
k=1
(X(
k k−1 ) − X( n ))2 n 2 2
170
IV Inference for Classes of Processes
≤ lim
max |X(
n→∞ 1≤k≤2n
2 n
|X(
k=1
k k−1 ) − X( n )|× n 2 2
k k−1 ) − X( n )|. n 2 2
(15)
The first factor on the right tends to zero by the uniform continuity so that the second factor must tend to infinity a.e. Because of this b fact a f (t) dX(t) cannot be defined in the classical Lebesgue-Stieltjes sense. Therefore we introduce a new concept as well as an appropriate integral to obtain an analog of (12), establish Wald’s identities in this (continuous parameter) context for BM, and then prove the optimal character of the sequential probability ratio test for this process. This is the goal and we proceed as follows. The desired stochastic integral is defined, using an extended boundedness principle originally due to Bochner [1]: If X = {X(t), Ft , t ∈ I ⊂ R} is a process, and f : I × Ω → R is any simple function, n−1 f (t, ω) = i=0 ai (ω)χ[ti ,ti+1 ) (t), t0 < t1 < · · · < tn , ti ∈ I where ai is Fti -adapted, then X is said to be Lp,q -bounded relative to a σ-finite measure μ : BI ⊗ Σ → R+ , whenever there is an absolute constant C(= Cp,q,μ > 0) such that n−1
E(|τ f |q ) = E(|
ai (X(ti+1 ) − X(ti ))|q )
i=0
≤C
|f (t, ω)|p dμ(t, ω),
(16)
I×Ω
holds for 0 < p, q < ∞. (X then qualifies to be a stochastic integrator as seen below.) If p = q = 2, this is L2,2 -boundedness which we use often, but the general case is also needed if X is a stable process without 2,2 2 moments. We now show that a BM process X is L -bounded, and define a stochastic integral I f dX uniquely with μ = λ ⊗ P, λ = o integral. If f is non stochastic, Leb. meas., ∀f ∈ L2 (μ), called the Itˆ then we have the original L2,2 -boundedness and the resulting integral obtained from (16) is the classical Wiener integral. A general analysis of this concept and its key role in stochastic integration is detailed in the companion volume (Rao [21], Sec. VI.2). Let us verify that the BM process {X(t), Ft , t ≥ 0} on (Ω, Σ, P ) constructed above is L2,2 -bounded relative n−1to μ = λ ⊗ P where λ is the Lebesgue measure. In fact, let f = i=0 ai χ[ti ,ti+1 ) , 0 ≤ t0 < t1 < · · · < tn < ∞, ai ∈ B(Ω, Fti ), the last space being the set of bounded real Fti -measurable functions on Ω. Note that Bt = σ(X(s), s ≤ t) ⊂ Ft , where {Ft , t ≥ 0} is a standard filtration. Now (16) becomes on
171
4.2 Sequential testing of processes
using the identities E(h) = E(E G (h)) for any σ-algebra G ⊂ Σ, h ∈ L1 (P ), and E G (hg) = gE G (h) for any G-measurable bounded g: n−1
E(|τ f |2 ) = E[|
ai (X(ti+1 ) − X(ti ))|2 ]
i=0
=
n−1
E[a2i (X(ti+1 ) − X(ti ))2 ]+
i=0
E[ai aj (X(ti+1 ) − X(ti ))(X(tj+1 ) − X(tj ))]
i =j
=
n−1
E[a2i E Fti (X(ti+1 ) − X(ti ))2 ]+
i=0
2
E[ai aj (X(ti+1 ) − X(ti ))E Ftj (X(tj+1 ) − X(tj )),
0≤i<j≤n−1
since Ftj ⊂ Ftj ⊂ Σ, =
n−1
E(a2i E(X(ti+1 ) − X(ti ))2 ) + 0,
i=0
since X(ti+1 ) − X(ti ) is independent of Fti , with mean 0, =
n−1
σ 2 (ti+1 − ti )E(a2i )
i=0 n−1
2
= σ E( = σ2
a2i (ti+1 − ti ))
i=0
|f (t, ω)|2 dμ(t, ω),
I×Ω
where μ = λ ⊗ P . Thus (16) holds with equality and C = σ 2 . This means the mapping τ : S ⊂ L2 (I × Ω, BI ⊗ Σ, μ) → L2 (P ) is bounded and linear. Further it is a linear isometry (this fact will be used below), where S is the set of simple functions, subject to measurability for the filtration. Consequently it has a unique distance (here ¯ a subspace of L2 (μ) with the propnorm)-preserving extension to S, erty that for each t ∈ I, f (t, ·) is Ft -measurable. It is denoted τ f = f (t, ·) dX(t). The σ-subalgebra P generated by the set of all bounded I functions with this property (relative to a filtration) is called predictable, and S¯ = L2 (Ω , O, μ) where Ω = I × Ω, O = BI ⊗ P. If the f do not depend on ω, so that P = {∅, Ω}, then (16) becomes 2 2 |f (t)|2 dt. (17) τ f = f (t) dX(t), E(|τ f | ) = σ I
I
172
IV Inference for Classes of Processes
This particular integral is the classical Wiener integral and the general case of “predictable integrands” (t, ω) → f (t, ω) is the Itˆ o integral. It can be verified that S¯ contains all the left continuous f ∈ L2 (μ) for a.a. (ω). Other properties will be included as the need arises. If t I = [a, t) we set Y (t) = Y (a)+ a f (s) dX(s), and it is also symbolically expressed as the differential equation dY (t) = f (t) dX(t);
Y (t) = Y (a), (as boundary condition).
(18)
This preparation allows us to establish Wald’s identities for the standard BM (i.e., σ 2 = 1) process, a continuous extension of Theorem 6. 7. Theorem. Let {X(t), Ft , t ≥ 0} be a standard BM process on (Ω, Σ, P ) with the standard filtration {Ft = ∩s>t Fs , t ≥ 0}, and X(t) is Ft -adapted. Then for any stopping time T of the filtration such that E(T ) < ∞, one has the following identities: E[X(T )2 ] = E(T ).
(i)E(X(T )) = 0;
(19)
(ii) If T is a bounded stopping time of the filtration, or more √ generally a first exit time in the sense that T = inf{t ≥ 0 : |X(t)| = A t + B}, where 0 ≤ A < 1 and 0 < B < ∞, then one has: E(exp[λX(T ) −
λ2 T ] = 1, 2
λ ∈ R.
(20)
Proof. (i) Since [T ≥ t] = [T < t]c ∈ Ft , the function χ[T ≥t] is predictable, and using dμ = dt ⊗ dP as the dominating measure of the boundedness principle, one has E[( R+
2
χ[T ≥t] dX(t)) ] ≤ C
R+
=C R+
Thus the integral Hence
R+
χ[T ≥t] dX(t) =
E(X(T )) = E(E = E(
Ft
R+
( R+
Ω
χ2[T ≥t] dt dP
P [T ≥ t] dt = CE(T ) < ∞. T 0
dX(t) = X(T ) is meaningful.
χ[T ≥t] dX(t)))
χ[T ≥t] E Ft (dX(t))) = 0,
173
4.2 Sequential testing of processes
and similarly, since the quadratic variation of the standard BM on [0, t] is t, one gets: χ2[T ≥t] E Ft (dX(t))2 ) E[X(T )2 ] = E( + R = χ2[T ≥t] dt = E(T ). R+
Here in the last two expressions we used the isometry of the Itˆo-integral noted earlier for the BM process. [In detail we can approximate the predictable function χ[T ≥t] by simple functions and use the isometry.] This establishes (19). (ii) The bounded case of T is easy. In fact let N < ∞ be a number such that P [T ≤ N ] = 1. If ϕt = exp[λX(t)− 12 λ2 t] then {ϕt , Ft , t ≥ 0} is a positive martingale, because ϕt is certainty Ft -adapted and for s < t we have: E Fs (ϕt ) = e− = e−
λ2 t 2 +λX(s) λ2 2
= ϕs ,
t+λX(s)
E Fs (eλ(X(t)−X(s) ) ·e
λ2 (t−s) 2
a.e.
(21)
since X(t) − X(s) is independent of Fs and X(t) − X(s) ∼ N (0, (t − s)) so that its well-known moment generating function is employed. Thus the ϕt -process is a positive martingale. Now T is a bounded stopping time and t → ϕt is continuous with probability one. So Theorem 4 applies with T1 = 1 ≤ T2 = T , giving (write ϕt = ϕ(t)) E(ϕ(T )) = E(ϕ(T1 )) = E(ϕ(1)) = 1, as asserted. Suppose now that T is the first exit time as in the statement. Then considering a dense denumerable set (e.g. rationals) it is seen that T is a stopping time of the filtration. However it is not bounded and we first need to show (the non obvious fact) that E(T ) < ∞. For this consider T1 = inf{t ≥ 0 : |X(t)| = A < ∞}, and let T1n = min(T1 , n). Then T1n is bounded so that by (19), since |X(T1n )|2 ≤ A2 , we get A2 ≥ E(X(T1n )2 ) = E(T1n ). Letting n → ∞, it results that E(T1 ) ≤ A2 < ∞, so that T1 < ∞ a.e., and then by (i) since now X(T1 ) = A, one has E(X(T1 )2 ) = E(T1 ) = A2 . This result will be used for T of the theorem. Thus letting T2n = min(T, n), one has: E(T2n ) = E(X(T2n )2 ) ≤ E(A2 (T2n + B)), by definition of T , = A2 (E(T2n ) + B) = A2 E(T2n ) + A2 B.
174
IV Inference for Classes of Processes
Rearranging and noting that A2 < 1, it follows that E(T2n ) ≤
A2 B < ∞. 1 − A2
Now let n → ∞ to get, by the monotone convergence, E(T ) ≤ that P [T < ∞] = 1 and then by (i): 2 X(T )2 dP E(T ) = E(X(T ) ) = [T <∞] = A2 (T + B) dP
A2 B 1−A2
so
[T <∞] 2
= A (E(T ) + B). It finally follows that E(T ) =
A2 B < ∞. 1 − A2
(22)
Then for any N < ∞ we have E(eλX(T )−
λ2 T 2
) = E(( · )(χ[T ≤N ] + χ[T >N ] )) λ2 N eλX(N )− 2 dP, =1+ [T >N ]
by definition of T . But one also has [T >N ]
eλX(N )−
λ2 N 2
√
dP ≤ eλA
(23)
2
N +B− λ 2N
P [T > N ].
(24)
Letting N → ∞ and using (24) in (23), one obtains (20) as desired. Remark. A comparison of Theorems 6 and 7 shows that in the discrete case the partial sum process of independent sequences can have a general (common) distribution while the continuous case is restricted to the BM process. Yet the level of sophistication of the argument needed in the latter is higher. This distinction is again seen vividly in the next result on the optimality of the Sequential Probability Ratio Test (SPRT). [More involved analysis with (sub)martingale theory can then be considered for other types of processes. Below BM is standard.] The problem considered for the SPRT is the analog of testing the equality of means, and in the continuous case, it is a question of testing whether there is “drift” for a BM process. More precisely, we consider
175
4.2 Sequential testing of processes
a BM process {X(t), Ft , t ≥ 0}, X(0) = 0 and let Y (t) = θt + X(t) or equivalently dY (t) = θdt + dX(t) as a symbolic differential equation. Then E(Y (t)) = θt and Cov(Y (r), Y (s)) = Cov(X(r), X(s)) = min(r, s). Under H0 : θ = θ0 = 0, against H1 : θ = θ1 = 0, is to be tested. Now if P, Q are the respective probabilities of these hypotheses, then the likelihood ratio of the Y (t)-process on [0, t] should be calculated for any t > 0. We can obtain the likelihood ratios of all finite dimensional sets employing the alternative procedure given in Example 1.6. Thus let πn : 0 = s0 < s1 < · · · < sn = t be a partition of [0, t]. Then the n-dimensional likelihood ratio is found simply and explicitly (after setting Yk = Y (sk )) as: 1 [(yk − θ1 sk ) − (yk−1 − θ1 sk−1 )]2 log θ0 − =− 2 sk − sk−1 fπn (y1 , · · · , yn ) k=1 n
fπθn1 (y1 , · · · , yn )
(yk − yk−1 )2 ! , sk − sk−1 since θ0 = 0. The actual multivariate normal density in this case simplifies as given because the covariance matrix (Cov(Yi , Yj )) has a simple pattern. Thus writing Lnt for the left side and simplifying after noting that yk = Y (sk )(ω), one gets: 1 Lnt (ω) = − [θ2 (sk − sk−1 ) − 2 (Y (sk ) − Y (sk−1 ))](ω) 2 n
n
k=1
k=1
1 = θY (t)(ω) − θ2 t, 2 and hence as n → ∞ or (|πn | → 0) the likelihood ratio tends to ft under both P, Q in probability. Thus Lt = log
1 1 dQ |Ft = θY (t) − θ2 t = θX(t) + θ2 t. dP 2 2
(25)
Here θ = θ0 = 0; or = θ1 = 0 under the hypothesis and its alternative. [In Chapter V we present explicit constructions of likelihood ratios for many types of processes that are more general than the BM class considered above.] In the sequential testing procedure, given two numbers a < b, one observes the process Y (t), t ≥ 0, as long as a < Lt < b and terminates the experiment, accepting H0 when Lt ≤ a, or H1 if Lt ≥ b. Thus if x Lxt = x + Lt , let Ta,b be the first exit time of the Lt -process, which means: x Ta,b = inf{t ≥ 0 : Lxt ∈ / (a, b)}, a ≤ x ≤ b.
176
IV Inference for Classes of Processes
x x Let α(x) = Q[Ta,b = a] and β(x) = P [Ta,b = b]. Now by (25), Lxt 1 satisfies the symbolic equation dLxt = 2 θ2 dt + θ dX(t), and so it is a diffusion process. But from the general theory of the latter (cf. e.g., Dynkin [1], Vol. 2, Example on p. 44, and the subsequent analysis there), one finds that α(·), β(·) satisfy the differential equations:
(i) α + α = 0;
and (ii) β − β = 0,
(26)
with the boundary conditions α(a) = 1 = β(b), α((b) = 0 = β(a). The solutions are immediately seen to be: α(x) =
ea+b−x − ea ; eb − ea
β(x) =
ex − ea . eb − ea
(27)
x x Further the expected (or average) exit times of Ta,b , namely Ei (Ta,b )= mi (x), i = 0, 1 for H0 and H1 are finite and satisfy the corresponding differential equations with boundary conditions as:
m0 − m0 = −2,
m0 (a) = m0 (b) = 0
m1 + m1 = −2,
m1 (a) = m1 (b) = 0, a ≤ x ≤ b.
(28)
The solutions again are immediately obtained as:
(eb − ex )(b − a) −b+x m0 (x) = 2 eb − ea b (e − ea+b−x )(b − a) +a−x . m1 (x) = 2 eb − ea
(28’)
The equations (26) and (28) are from the general diffusion theory (Dynkin [1], and in a different form in Mandl [1], Ch. V), and the technical details will not be reproduced here. Thus prepared we can present the optimality result of the SPRT for the BM process. For convenience let T denote the set of all stopping times T of the filtration such that Ei (T ) < ∞, i = 0, 1, and let D be the set of decision functions δ(·) = i to denote that Hi is accepted using a stopping time T, i = 0, 1 so that one has: α(δ) = Q(ω : δ(ω) = 1) ≤ α; β(δ) = P (ω : δ(ω) = 0) ≤ β.
(29)
We can now present the final result. 8. Theorem. Let α, β be the prescribed error probabilities of the hypotheses H0 , H1 such that α + β < 1. If T ∈ T , and a ¯ < ¯b are the boundaries defined only by α, β (with explicit expressions given below),
177
4.2 Sequential testing of processes
and δ as the corresponding decision function to terminate the sequential procedure of the standard BM, then there is an optimal stopping time T ∗ ∈ T for testing H0 : θ0 = 0 vs. H1 : θ1 = 0 in the sense that Ei (T ∗ ) ≤ Ei (T ),
i = 0, 1, ∀T ∈ T .
(30)
In fact T ∗ is the first exit time of the log likelihood process {Lt , Ft , t ≥ 0}, L0 = 0, given by (25) which is defined as: T ∗ = inf{t ≥ 0 : Lt ∈ / (¯ a, ¯b)}, α where a ¯ = log 1−β , ¯b = log 1−α β . Moreover, the optimal expected value of T ∗ is explicitly obtained as:
E0 (T ∗ ) = 2w(α, β);
E1 (T ∗ ) = 2w(β, α)
(31)
x where w(x, y) = (1 − x) log 1−x y + x log 1−y , 0 < x, y < 1. Thus H0 is accepted when Lt ≤ a ¯ and H1 is accepted when Lt ≥ ¯b.
Proof. For any T ∈ T , we have with the log likelihood process Lt = 1 2 2 θ t + θX(t): 1 2 θ E1 (T ) + θ1 E1 (X(T )) 2 1 1 = θ12 E1 (T ), by (19). 2
E1 (LT ) =
(32)
Next adapting the procedure of Theorem II.1.1 to the present situation with Lt = log ft and setting δ = 0 for H0 and δ = 1 for H1 , we have the following simplification: dP E1 (LT ) = −E1 (log |F (T ) ), so that [δ = 1] = [δ = 0]c ∈ F(T ), dQ dP dP dQ − dQ, log log =− dQ dQ [δ=1] [δ=0] since at T the observation is terminated and H1 or H0 is decided, dP Q(dω|δ = 1)Q[δ = 1]− =− log dQ Ω dP Q(dω|δ = 0)Q[δ = 0], log dQ Ω since Q[δ = i] > 0, i = 0, 1, dP Q(dω|δ = 1))− ≥ −Q[δ = 1] log( Ω dQ
178
IV Inference for Classes of Processes
dP Q[δ = 0] log( Q(dω|δ = 0)), by the Ω dQ (conditional) Jensen inequality applied to the convex function − log x, dQ dP ]− = −Q[δ = 1] log[ dQ Q[δ = 1] [δ=1] dQ dP Q[δ = 0] log[ ], [δ=0] dQ Q[δ = 0] using the elementary definition of conditional expectation, (cf., e.g., Rao [18], p.5), P [δ = 0] P [δ = 1] − Q[δ = 0] log , Q[δ = 1] Q[δ = 0] 1−β β − α log , by (28), = −(1 − α) log 1−α α = w(α, β), = −Q[δ = 1] log
(34)
where w(·, ·) is the above expression; same as in the theorem. Similarly by interchanging P and Q in the above computation one obtains E0 (LT ) = E0 (log fT ) ≥ w(β, α).
(35)
However by (27) we have: ¯
ea¯+b − ea , e¯b − ea¯ ¯ 1 − eb , P [δ = 1] = β(0) = ¯b e − ea¯
Q[δ = 0] = α(0) =
so that on substitution and simplification one gets
and
E1 (T ∗ ) = 2w(α, β),
(36)
E0 (T ∗ ) = 2w(β, α).
(37)
Thus E1 (LT ∗ ) =
1 2 θ E1 (T ∗ ) = θ12 w(α, β); E0 (LT ∗ ) = θ02 w(β, α). 2 1
(38)
Substituting (34)-(37) in (32) we obtain (30), as well as (31). Remark. It can be shown that the expected (or average) sample sizes Ei (T ∗ ), i = 0, 1, for given 0 < α, β, α+β < 1, are much smaller than the
4.3 Weighted unbiased linear least squares prediction
179
fixed sample size experiments to achieve the same error probabilities α, β (using Theorem II.1.1). In fact for the BM noise, if 0 < α, β < 0.03, and t(α, β) denotes the corresponding fixed sample size, then it is reported in Shiryayev [1] that Ei (T ∗ ) ≤ 17 30 t(α, β), i = 0, 1. On the other hand, in the discrete parameter case, the corresponding first exit time T ∗ will be optimal iff the error probabilities for the decision rule are exactly equal to α and β. If there is inequality, then simple examples show that there are optimal times better than the first exit time. Further discussion on these matters may be found, e.g., in Shiryayev [1], and it will not be included here. One of the important observations to be made on sequential inference is that it uses as well as motivates deeper results from advanced topics in probability theory (e.g., optional stopping and sampling theorems of martingales are guided by and perhaps originate from these results). Several other techniques extending the above considerations lead to new areas and directions (e.g., refined stochastic calculus meaning stochastic differential equations and control theory, or stochastic calculus of variations), making up separate branches of analysis. We thus leave the present topic at this point and turn to some problems on stochastic inference related to unbiased prediction, motivated by the work on estimation theory presented in the preceding chapter. 4.3 Weighted unbiased linear least squares prediction We consider a fairly general (linear) and important least squares weighted unbiased prediction problem of the following type. Let the process {X(t), t ∈ R} represent an additive model expressed as: X(t) = Y (t) + Z(t),
(1)
where Y (t) is a “signal” and Z(t) is a “noise” (possibly complex valued) processes. Suppose that both Y (t) and Z(t) are of second order and uncorrelated (i.e., Y (t), Z(t) ∈ L2 (P )), and let E(Y (t)) = M (t) =
m
αi gi (t),
(2)
i=1
where the αi (∈ C) are unknown constants but gi are smooth given ¯ functions, E(Z(t)) = 0, r(s, t) = E(Z(s)Z(t)), assumed known, and 2,2 the noise is either L -bounded or differentiable in mean. Suppose that the X(t)-process is observed continuously on an interval [a, b] and the problem is to obtain the best linear unbiased weighted estimator ˆ 0 ) of Y (t0 ) for t0 > b by finding a suitable (complex) weight function X(t
180
IV Inference for Classes of Processes
p(·, t0 ) of bounded variation defining a Bochner integral b ˆ 0 )) = M (t0 ), ˆ 0) = X(t) dp(t, t0 ), E(X(t X(t
(3)
a
ˆ 0 )|2 ) is a minimum. This problem was briefly and that E(|Y (t0 ) − X(t discussed by Grenander [1] for stationary processes Y (t), Z(t), and was treated in a more general form when X(t) is given as a solution of a linear stochastic differential equation by Dolph and Woodbury [1], (cf. also Grenander’s monograph [2], pp. 202-208). A simpler extension of some of their interesting results will now be presented to focus on its potential use for second order processes that may be non stationary, largely complementing the study of the preceding two sections. Also this should be contrasted with a special nonlinear prediction problem solved in Theorem III.2.7 for a discrete set of observations. We introduce a new technique useful in the subject. Let us first consider the case that the process is weakly differentiable. Later on we show how the results can be modified in the case of orthogonal increment processes, or the BM itself. Let us recall the mean differential concept (briefly mentioned prior to Theorem 2.7). A process {X(t), t ∈ R} is said to have a mean-derivative (in L2 (P )) at t, denoted X (t) or dX dt (t), if X(t + h) − X(t) 2 − X (t)| → 0, as h → 0, t ∈ R. (4) E | h This is equivalent to saying that the covariance function (s, t) → r(s, t) of the second order process X(t) under consideration, is differentiable at (t, t) on the diagonal (as can be verified without difficulty). ∂2r It may then be noted that Cov(X (s), X (t)) = ∂s∂t (s, t). Higher order (n) derivatives, denoted X (t) are defined by induction, for n > 1. We then consider the following differential equation, in lieu of (1): (Lt X)(t) =
n j=0
αj (t)
di X (t) = Z (t), dti
(α0 = 0)
(5)
where the αi are (n − 1) times continuously differentiable real functions defining the linear differential operator Lt , the‘noise’ process Z(·) has also a mean-derivative, and Z is itself representable as a simple vectorvalued integral: eitλ dξ(λ). (6) Z(t) = R
L20 (P )(= 2,2
Here ξ : B → equivalently is L
2
{f ∈ L (P ) : E(f ) = 0}) is σ- additive (or ¯ -bounded) with the property that E(ξ(A)ξ(B)) =
181
4.3 Weighted unbiased linear least squares prediction
μ(A, B) defines a spectral bimeasure, i.e., μ(·, B), μ(A, ·) are σ-additive on the Borel σ-algebra B of R into C. Such a process Z(t) is called (weakly) harmonizable. If the integrator function ξ has orthogonal values in L20 (P ) so that μ(A, B) = μ ˜(A ∩ B), then the Z(t) becomes a (weakly or Khintchine) stationary process with μ ˜ as its spectral measure which is actually a Borel measure on B → R+ . Second order processes are subject to a considerable amount of the familiar Hilbert space geometry and analysis which we exploit in the ensuing presentation. It is verified without much difficulty that Z of (6) has the (mean-) derivative iff its bimeasure satisfies the moment condition: ∗ | λλ μ(dλ, dλ )| < ∞, (7) R
R
where this double integral relative to a bimeasure is a strict MorseTransue integral. The latter has a slightly weaker definition than the usual Lebesgue-Stieltjes integral, and was developed fully by Morse and Transue [1]. Briefly this can be recalled as follows. If f, g : R → C are Borel functions and μ : B × B → C is a bimeasure, then (f, g) is MT-integrable relative to μ provided: (i) f is μ(·, B)-integrable and g is μ(A, ·)-integrable in the Lebesgue sense for each pair A, B ∈ B, (ii) for the (complex) measures ν1g : A → g(y)μ(A, dy), ν2f : B → R f (x)μ(dx, B), f is ν2g and g is ν1f inteR grable, and g f f (x)ν1 (dx) = g(y)ν2 (dy) [= (f (x), g(y))μ(dx, dy)], (8) R
R
R
R
where the double integral in brackets is the common value, defined to be the MT-integral. It is the strict MT-integral if moreover (8) holds for the restriction μAB = μ|(B(A) × B(B)) for each pair A, B ∈ B, and is denoted with a ‘star’: ∗ B (f, g)μ(dx, dy) = f (x)ν1 (dx) = g(y)ν2A (dy), (9) A
B
A
ν2f |B(A),
ν1g |B(B).
B
where ν2A = ν1B = Without this additional restriction the dominated convergence theorem can fail for μ. The necessary detailed account with (counter) examples is given in (Chang and Rao [1]). It will be utilized here. The reason for this detour is that μ need not have finite Vitali (i.e., the usual) variation, but it always has (the weaker) finite Fr´echet variation. Recall that the Vitali (Fr´echet) variation of μ, denoted |μ|, (μ), is given (the first and last lines being the definitions) by: μ(R, R) = sup{|
n i,j=1
ai¯bj μ(Ai , Bj )| : Ai , Bj ,
182
IV Inference for Classes of Processes
are disjoint Borel sets, and |ai | ≤ 1, |bj | ≤ 1, complex}, ≤ sup{
n
|μ(Ai , Bj )| : Ai , Bj as above}
i,j=1
= |μ|(R, R).
(10)
It should be remarked that if |μ|(R, R) < ∞, then the integrals (8) and (9) coincide with the Lebesgue-Stieltjes integrals (for signed measures) and in this case the Z(t)-process given by (6) is termed strongly harmonizable which is equivalent to the concept introduced by Lo´eve [1], and the weak harmonizability is the same as Bochner’s V-boundedness [2]. [One should also observe that in (8) and (9) when the integrals exist they are automatically equal and the latter is not an additional assumption, but for simplicity we use the concept as stated above.] These notions and their distinctions were first discussed in (Rao [13]). Both the harmonizabilities include the weak stationarity. They will be used here because most of the earlier work with the stationarity hypothesis extends to the harmonizable cases, using suitably modified arguments. We also indicate later, some changes if the Z-process is of orthogonal or independent increments for which neither the meanderivative nor the representation (6) holds. [However the Z(t) we use will always satisfy an L2,2 -boundedness condition, and hence no serious problems arise due to these changes, shown below.] When X satisfies (5), it is called a stochastic flow of order n driven by the (harmonizable) noise process Z. On the other hand X given by (1) is a (more general) signal plus noise model, and is used for linear unbiased prediction as defined there, among others. Both these models are analyzed in what follows, since, although they are distinct ˜ = X − Y of (1) in appearance, they are also related. This is because X will satisfy (5) under appropriate conditions. For later use when Z is not necessarily differentiable, we express (5) as: ϕ(t)(Lt X)(t) dt = ϕ(t)Z (t) dt = ϕ(t) dZ(t), (11) R+
R+
R+
for all continuous ϕ with compact supports. Then the left side integral is defined as either of the right side ones in (11), where the middle integral is (by Fubini’s theorem) well-defined, or may also be understood as a Bochner integral, and the extreme right side one is a simple stochastic integral since the Z-process in all cases considered above is L2,2 -bounded as is easily seen. Thus (5) can and will be interpreted as (11). The following properties of the operator Lt from the theory of differential equations will be employed here. Thus if u and v are two
4.3 Weighted unbiased linear least squares prediction
183
deterministic functions on R, such that (Lt u)(t) = v(t),
u(k) (t0 ) = 0, 0 ≤ k ≤ n − 1, t0 ∈ R+ ,
then there is a unique solution given by t n Wk (ϕ1 , · · · , ϕn )(s) v(s) ds u(t) = ϕk (t) W (ϕ1 , · · · , ϕn )(s) t0 k=1 t = R(s, t)v(s) ds (say).
(12)
(13)
t0
Here (ϕ1 , · · · , ϕn ) are the linearly independent solutions of the homogeneous equation Lt u = 0, W (ϕ1 , · · · , ϕn ) is the Wronskian determinant of the system and Wk is the same as W with its k th column replaced by (0, . . . , 0, 1). [W never vanishes since the ϕk s are linearly independent. Cf., e.g., Coddington and Levinson [1], pp. 87-88.] Also by the classical theory, the kernel R(·, ·) is an extended Green function (it reduces to the familiar Green function if all the αi are constants, and the general form is also termed a Riemann function). The classical theory further implies (cf. e.g., Ince [1], Ch. XI, particularly p.256, or Hurewicz [1], p.54, Theorem 13) that the kernel R satisfies the following conditions of interest for our computations below (in fact, R can be considered alternatively as a unique solution of the ensuing relation (14)): (i) R(s, ·) is (n − 1) times differentiable and satisfies the formal adjoint equation in Mt , of Lt , namely (Mt Rs )(t) =
n
(−1)k (αk (t)Rs (t))(n−1) = 0,
(14)
k=0
with the boundary conditions at t0 (cf. (12)) (D2k R)(t, t) = 0, (n−1)
(D2
0 ≤ k ≤ n − 2;
R)(t0 , t0 ) = (−1)(n−1) (α0 (t0 ))−1 ,
∂ ∂ where D1 = ∂s and inductively D1k = D1 (D1k−1 ); similarly D2 = ∂t k and D2 are defined relative to the first and second variables of the kernel R, (ii) (Lt R(·, t0 ))(t) = 0, (iii) (D1k R)(t0 , t0 ) = 0, 0 ≤ k ≤ n − 2, (iv) (D1n−1 R)(t0 , t0 ) = (α0 (t0 ))−1 . Thus the function R∗ defined by R∗ (s, t) = R(t, s) is a solution of (Lt Rt∗0 )(t) = 0 which is established using the so-called Lagrange identity (cf., e.g., Coddington and Levinson [1], p.86): t [f (s)Ls (g)(s) − g(s)Ms (f )(s)] ds t0
184
IV Inference for Classes of Processes n−1
=
⎡
n−1
g (j) (s) ⎣
k=0
(−1)(j−k) [αn−j−1 (s)f (s)]j−k
j=k
⎫ ⎬ ⎭
s=t |s=t . 0
Using these properties, we can represent the solution of (5) (the stochastic flow) as (the integral which is the same as (13) with u = X and v = Z there):
t
X(t) =
Rt (s) dZ(s)
t
(=
t0
R(t, s)Z (s) ds).
(15)
t0
A description of the 2nd order characteristics of the flow when the αi are C n−1 -smooth is given by the following result which shows that the covariance function to be far smoother than what one might expect: 1. Theorem. Let {X(t), t ∈ R} be a process (or flow) governed by (5) or (11). Then its covariance function r (exists and) is of triangular type, in the sense that it is representable as: ∗ r(s, t) = g(s, λ)¯ g (t, λ ) dG(λ, λ ), (16) R
R
where G is a (necessarily positive definite) function of finite Fr´echet variation, g is strictly MT-integrable for G and, r is 2n − 2 times continuously differentiable and whose (2n − 1)st derivative may have a jump-discontinuity at s = t given by lim[(D12n−1 r)(t, s + ε) − (D12n−1 r)(t, s − ε)] = (−1)n (α0 (t))−2 . (17) ε↓0
¯ Proof. That the covariance function r : (s, t) → E(X(s)X(t)) [using E(Z(t)) = 0 = E(Z (t)) so that by (15) E(X(t)) = 0] is of triangular type follows from the computation: r(s, t) = [
s
t
Rs (u) du
t0
¯ t (v)E(Z (u)Z¯ (v)) dv], R
t0
the interchange of the integrals and expectation is valid by an appropriate use of Fubini’s theorem, s t ∗ ¯ = Rs (u)Rt (v)[ eiuλ−ivλ λλ dF (λ.λ )] du dv, t0
R
t0
R
where F (·, ·) is the bimeasure of Z, ∗ iuλ ˜ ˜ t (v) dv] dG(λ, λ ), [ e Rs (u) du][ eivλ R = R
R
R
R
4.3 Weighted unbiased linear least squares prediction
185
˜ s (u) = (χ[t ,s) Rs )(u), dG(λ, λ ) = λλ dF (λ, λ ), where R 0 ∗ = (18) kˆs (λ)kˆt (λ ) dG(λ, λ ), (say). R
R
Here kˆ = [ ] is a bounded continuous function which is G-integrable in the strict MT-sense, and hence r has the stated representation. Since by definition of r(s, t)(= r(t, s)), and also R, both hermitian, it is seen that conditions (i)-(iv), preceding the theorem, hold for r. But the operators Ls , Lt and the integrals for r commute so that property (ii) for Rs implies that (Ls r)(s, t0 ) = 0 = (Lt r)(s, t0 ). Hence by (17) (or equivalently (18)) D1k r exists for 0 ≤ k ≤ n − 1, and then since n (D1n R)(s, t) = − k=1 αi (t)(D1n−k R)(s, t) so that D1n r exists. By the hermitian property of r, D2k r, 0 ≤ k ≤ n exists. Also by hypothesis the αi are C n−1 -smooth. So using induction one can continue differentiation for n < k ≤ 2n − 1. Then for the last one, lim ε↓0
(D1k r)(s
s
s
− ε, s) =
(D1k R)(s, u)R(s, v) du dv t0
t0
+ (−1)2n−1
δk2n−1 . α02 (s)
(Cf., Ince [1], p.255.) Here δk is the Kronecker delta. This implies the last statement on taking k = 2n − 1. Interpreting Z (t) dt as dZ(t) in the above integrals the preceding statements hold for orthogonal increment processes with easy modifications, as we now show. Thus suppose that the Z(t)-process above is not stationary or harmonizable but has orthogonal increments (or a BM like process), so that for s < t, Z(s) ⊥ (Z(t) − Z(s). For simplicity, ¯ again let E(Z(t)) = 0. Then r(s, t) = E(Z(s)Z(t)) satisfies: r(s, t) = r(s ∧ t)
(= r(min(s, t))),
(19)
and if H(s) = r(s, s), then for s ≤ t, H(s) ≤ H(t). It may be called the variance function. Thus if X(t) is a solution of (5) or (11) with Lt , the differential operator, as given before, then its integral representation (15) becomes: t X(t) = R(t, s) dZ(s), (20) t0
which is a simple stochastic integral since it is immediately seen that Z is L2,2 -bounded relative to the (σ-finite) dominating measure μ ⊗ P where μ([ti+1 , ti )) = E(|Z(ti+1 ) − Z(ti )|2 ). If the Z(t)-process is
186
IV Inference for Classes of Processes
the BM then μ is the Lebesgue measure, and one gets the covariance representation in lieu of (16) or (18), as follows:
s∧t
R(s, u)R(t, u) dμ(u),
r(s, t) =
(21)
t0
where R is the generalized Green or Riemann function. With these modifications the preceding argument of the theorem yields the following result. 2. Proposition. Let {X(t), t ≥ t0 } be a second order process satisfying (11) where the noise is one of orthogonal increments (or a BM process). Then its covariance function r is representable as (21). Moreover, r is (2n − 2) times continuously differentiable and its (2n − 1)st derivative also exists but may have a jump discontinuity on the diagonal s = t, given by (17), and further (Mt Lt r)(t, s) = 0. The last statement may be directly verified using properties (i)(iv) for the kernel R. Complete details of this evaluation are given in Grenander [2], pp.203-205, and will not be reproduced here. Note that if we define h(t, v) = (χ[t0 ,t) R)(t, v), then the covariance (21) can be expressed as: ¯ v) dμ(v). h(s, v)h(t,
r(s, t) =
(22)
R
If h is replaced by a jointly measurable function that is square integrable relative to μ, then the covariance (and also the process) is said to be of Karhunen class. The more general r (and its process)given by (16) is termed of Cram´er class. Thus it is an interesting characteristic of second order (nth -degree) stochastic flows driven by harmonizable noises that they belong to a Cram´er class, and those driven by orthogonal noises belong to a Karhunen class. It may also be noted that if μ of (22) concentrates at a point v0 with mass p, then (22) implies that √ r(s, t) = f (s)f¯(t) where f (s) = ph(s, v0 ), and this is the covariance of the process X(t) = f (t)ξ where E(ξ) = 0, V arξ = 1 and the latter is (strongly) harmonizable iff f is the Fourier transform of a function on R. This suggests applications of the above results for ‘factorizable’ covariances. The idea can be pursued further in the context of harmonizable noises but we will omit its discussion here. We now turn to the prediction problem outlined in (1)-(3). Let us restate it in light of the preceding work on flows. Thus to incorporate the ideas of (1) and (5), allowing the signal and noise processes to be correlated as well, we assume the stochastic signal to satisfy E(Y (t)) = m M (t) = k=1 ak gk (t), ai ∈ C, the gk being smooth. It is desired
187
4.3 Weighted unbiased linear least squares prediction
to predict X(t0 ), t0 > b from the observation X(t), a ≤ t ≤ b, by a function: n b ˆ X (i) (t) dpi (t, t0 ), (23) X(t0 ) = i=0
a
where pi (·, t0 ), 0 ≤ i ≤ n are (complex) functions of bounded variation and X (i) is the ith -order mean-derivative of X as before, and the unbiˆ 0 )) = M (t0 ) is maintained. We want to find asedness constraint E(X(t ˆ ˆ X(t0 ) such that E(|X(t0 ) − Y (t0 )|2 ) is a minimum. The solution of a very general form of this problem is as follows: 3. Theorem. Let the signal-plus-noise model be given as X(t) = m Y (t) + Z(t), −∞ < a ≤ t ≤ b < ∞, where E(Y (t)) = i=1 ai gi (t), ai ∈ ¯ rzy (s, t) = E(Z(s)Y¯ (t)), and C, E(Z(t)) = 0, ryz (s, t) = E(Y (s)Z(t)), ¯ in which the conry (s, t) = Cov(Y (s), Y (t)), rz (s, t) = E(Z(s)Z(t)) stants ai are unknown but the parameters gi , ry , rz , ryz , rzy are known, and n-times continuously differentiable so that X (k) exists for all 0 ≤ k ≤ n in the mean. Then a best linear weighted unbiased least squares estimator, of the form (23), exists iff the pk minimize the functional: 0 ≤ J0 (p) =
n
b
b
(D1k D2 K)(s, t) dpk (s, t0 ) d¯ p (t, t0 )
a
k,=0
a
+ ry (t0 , t0 ) + 2 2
m
λj (t0 )gj (t0 )−
j=1
n
b
[Re(D1k K1 )(t, t0 ) +
a
k=0
m
(k)
λj (t0 )gj (t)] dpk (t, t0 ), j=1 (24)
where K = ry + rz + ryz + rzy , and K1 = ry + ryz ; and where λj , j = 1, . . . , m are Lagrange multipliers. The pk giving the minimum in (24) necessarily satisfy the (Fredholm-type) linear integral equation system: Re(D1k K1 )(t, t0 )
+
m
(k) λj (t0 )gj (t)
j=1
=
n
b
(D1k D2 )(s, t) d¯ pk (s, t0 ),
a
=0
(25) for k = 0, . . . , n. The corresponding minimum mean-square error σ 2 = ˆ 0 ) − Y (t0 )|2 ) is given by σ 2 = J0 (p) where E(|X(t J0 (p) = ry (t0 , t0 )+
m j=1
λj (t0 )gj (t0 )−
n k=0
b
Re(D1k K1 )(t, t0 ) d¯ pk (t, t0 ).
a
(26)
188
IV Inference for Classes of Processes
∂ ∂ [As before, D1 = ∂s , D2 = ∂t and D1k , D2k are higher order differential operators, and Re=real part.]
Remark. After presenting the proof, we shall specialize this result if Z is an orthogonal increment (or BM) process and find that with n = 0, 1 and ryz = 0 = rzy it coincides with the main theorem due to Dolph and Woodbury [1]. Also note that if X − Y satisfies (5) then by Theorem 1 (or Proposition 2) the smoothness hypothesis assumed here is automatically satisfied. Later an interesting application will be discussed to explain the potential of this theorem. Proof. The unbiasedness constraint applied to (23) yields m
ˆ 0 )) aj gj (t0 ) = M (t0 ) = E(X(t
j=1
=
n
b
E(X (k) (t)) dpk (t, t0 ), since under the
a
k=0
present hypothesis the integral and expectation commute, b n m (k) ai gi (t) dpk (t, t0 ), since E(X(t)) = M (t), = a
k=0 j=1
since the derivative and expectation commute.
(27)
Now (27) is an identity in the ai so that the coefficients of ai can be identified. This yields the set of equations n k=0
b
(k)
gj (t) dpk (t, t0 ) = gj (t0 ),
j = 1, . . . , m.
(28)
a
But (27) can also be written as: M (t0 ) =
n k=0
b
M (k) (t) dpk (t, t0 ),
(29)
a
k
2 ˆ where M (k) (t) = ddtM k (t). Let J = E(|X(t0 ) − Y (t0 )| ). To minimize ˆ we may express it, with Y˜ (k) (t) = Y (k) (t) − this J relative to X, (k) M (t), k = 0, . . . , n, as:
ˆ 0 ) − M (t0 ) − Y˜ (t0 )|2 ) J = E(|X(t n b (Y˜ (k) (t) + Z (k) (t)) dpk (t, t0 )] − Y˜ (t0 )|2 ), = E([| k=0
a
189
4.3 Weighted unbiased linear least squares prediction
since X = Y + Z and use (23) plus (29), n b (Y˜ (k) (t) + Z (k) (t)) dpk (t, t0 )|2 ) + E(|Y˜ (t0 )|2 ) = E(| a
k=0
− 2ReE(
n b
k=0
(Y˜ (k) (t) + Z (k) (t) dpk (t, t0 ) · Y˜ (t0 )).
(30)
a
The three terms on the right of (30) may be simplified separately using the partial differential operators D1 , D2 applied to the covariances of Y, Z-processes. Since E(Y˜ (k) (s)Y¯ () (t)) = (D1k D2 ry )(s, t); and also E(Y˜ (k) (s)Z¯ () (t) = (D1k D2 ryz )(s, t), and similarly for rzy and rz , the first term on the right of (30) becomes: E(|
n
n
)|2 ) =
(
k=0
b
b
(D1k D2 K)(s, t) dpk (s, t0 )d¯ p (t, t0 ), (31)
a
k,=0
a
where K = ry + rz + ryz + rzy . The second term is simply E(|Y˜ (t0 )|2 ) = E(|Y (t0 ) − M (t0 )|2 ) = ry (t0 , t0 ),
(32)
and the last term becomes on setting K1 = ry + ryz : E(
n
) · Y˜ (t0 )) =
(
k=0
n k=0
b
(D1k K1 )(t, t0 ) dpk (t, t0 ).
(33)
a
Substituting (31)-(33) in (30), and writing J for J(p), p = (p0 , · · · , pn ), with the constraint (28), and Lagrange multipliers λ1 (t0 ), . . . , λm (t0 ), one gets the minimization functional as: 0 ≤ J0 (p) = J + 2
m
λj (t0 )(gj (t0 ) −
j=1
=
n k,=0
b
n k=0
(k)
gj (t) dpk (t, t0 ))
a
b
(D1k D2 K)(s, t) dpk (s, t0 ) d¯ p (t, t0 )+ a
ry (t0 , t0 ) + 2
a m
λj (t0 )gj (t0 ) − 2
j=1 m
b
(k)
λj (t0 )gj (t)] dpk (t, t0 ).
n k=0
b
[Re(D1k K1 )(t, t0 )+
a
(34)
j=1
We now need to minimize J0 (p) as p varies over a bounded (convex) set of functions of bounded variation on [a, b]. The methods of
190
IV Inference for Classes of Processes
variational calculus are not easy to apply here. However, the problem is made considerably simpler by using the following special but important device which unfortunately has no motivation, but which is suggested by some (deterministic) control problems. Namely, suppose the (related) linear system of (n + 1) integral equations has a solution (˜ pk , k = 0, . . . , n): m n b (k) k λj (t0 )gj (t) = (D1k D2 K)(s, t) d¯ pk (s, t0 ), Re(D1 K1 )(t, t0 ) + j=1
=0
a
(35) for each t ∈ [a, b]. Then we claim that p˜k minimizes (34). To prove the claim, suppose the pk , 0 ≤ k ≤ n, are any set of functions of bounded variation, obeying (28), that may be used in (34). Then, since the p˜k satisfy (35), we can substitute the right side of (35) for the left side term in the square brackets of the last of (34). Using the fact that D1k D2 = D2 D1k because of the continuity of these derivatives, one gets the following: n b b J0 (p) = (D1k D2 K)(s, t)d(pk − p˜k )(s, t0 ) d(pk − p˜k )(t, t0 ) k,=0
a
a
+ ry (t0 , t0 ) + 2 −
n k,=0
a
b
m
λj (t0 )gj (t0 )
j=1 b
pk (t, t0 ). (D1k D2 K)(s, t) d˜ pk (s, t0 ) d˜
(36)
a
The pk , subject to (28), are arbitrary functions of bounded variation, and the first and last terms of (36) are nonnegative by (31), the middle term being independent of pk . Thus J0 (p) ≥ 0 will be a minimum iff pk = p˜k , 0 ≤ k ≤ n, establishing the claim. So (24) and (25) follow. Finally the mean-square error is given as σ 2 = J0 (p) when pk = p˜k , 0 ≤ k ≤ n is used in (36), which is precisely (26). Analogous to Proposition 2, we can immediately present the corresponding assertion of the preceding result when the Z(t)-process is of orthogonal increments or BM, with variance function H(·). This will be recorded for ready reference, where as before, K1 = ry + ryz , K = ry + H + 2ryz and all processes are taken real for simplicity. The following is then the desired result. 4. Proposition. Let X(t) = Y (t)+Z(t), a ≤ t ≤ b, be the real observed signal plus noise of orthogonal increments model where E(Y (t)) = m j=1 aj gj (t), aj ∈ R, and let H(·) be the variance function of the Zprocess. Then the best linear unbiased weighted least squares predictor
191
4.3 Weighted unbiased linear least squares prediction
ˆ 0 ) of Y (t0 ) of the form (or estimator) X(t ˆ 0) = X(t
b
X(t) dp(t, t0 ),
t0 > b,
(37)
a
relative to a real weight function p of bounded variation on [a, b], exists whenever p(·, t0 ) is a solution of the linear integral equation: K1 (t, t0 ) +
m
b
λj (t0 )gj (t) =
K(s, t) dp(t, t0 ),
(38)
a
j=1
where the λj (t0 ) are the Lagrange multipliers and
b
gj (t) dp(t, t0 ) = gj (t0 ),
j = 1, . . . , m.
(39)
a
When p is a solution of (38), then the minimum mean-square error of ˆ of (37) is given by σ 2 where X 2
σ = ry (t0 , t0 ) +
m
λj (t0 )gj (t0 ) −
b
K(t, t0 ) dp(t, t0 ).
(40)
a
j=1
Remark. In case the signal and noise are uncorrelated so that ryz = 0, then K = ry + H. If Y is a deterministic signal then ry = 0 also and then (38) reduces to m j=1
b
H(s ∧ t) dp(t, t0 ).
λj (t0 )gj (t) =
(41)
a
In particular, if m = 1, Y = α (a constant) so that g1 = 1, then b b λ1 (t0 ) = a H(s ∧ t) dp(s, t0 ), a dp(s, t0 ) = 1. These two equations can be used to solve for the two unknowns λ1 and p. In general, (38) and (39) have (m + 1) linearly independent equations and there are m λj (unknown) Lagrange multipliers and 1 (unknown) p, which in principle can always be determined. It should be emphasized that the key mathematical problem for a solution of the prediction question is an evaluation of the weight functions pk and the Lagrange multipliers λj from the system of integral equations (25) or even (38). This is a nontrivial step and is significant in itself for the variational calculations. To explain this point clearly, we include a simple illustration.
192
IV Inference for Classes of Processes
5. Example. Consider the process {X(t), t ≥ t0 } given by the first order Langevin type equation: dX + α(t)X(t) = Z (t), lim X(t) = X(a) (in mean). t→a dt
(42)
Then the solution of this equation is given on [a, t] by: −A(t)
t
X(t) − X(a) = e
A(u)
e
t
Z (u) du; A(t) =
a
α(u) du,
(43)
a
where α(·) is a nonstochastic continuous function. Since E(Z(t)) = 0 = E(Z (t)) = 0 we have E(X(t) − X(a)) = 0, and the covariance r(s, t) of X(t) − X(a) is given by
s
∗t
r(s, t) = a
eA(s)+A(t)−A(u)−A(v) dG(u, v).
a
Suppose now Z has orthogonal increments, so that s∧t eA(s)+A(t)−2A(u) dμ(u), r(s, t) =
(44)
a
where μ(B) = E(( B dZ(u))2 ). For a further simplified illustration, let Z be BM so that dμ(u) = du and (44) becomes
s∧t
r(s, t) = eA(s)+A(t)
e−2A(u) du,
(45)
a d d + α(t), Mt = − dt + and then (Ms Ls r)(s, t) = 0. Here n = 1, Lt = dt α(t) in the notations used for Theorem 1. The Riemann (or generalized Green) function R(s, t) for this problem is a solution of Ms R = 0 subject to the boundary conditions that for t > s, limt→∞ R(s, t) = 0 = limt→a R(s, t). Then one finds after a calculation that
R(s, t) = e−[A(s)+A(t)] ·
k(t) , (t > s) m0
k(s) , (t < s), m0 ∞ ∞ where k(s) = s e2A(u) du and m0 = a e2A(u) du > 0. With these notations and letting ϕ1 (s) = e−A(s) , ϕ2 (s) = ϕ1 (s)k(s), one notes that, on excluding the uninteresting case that k = constant, the ϕi , i = 1, 2 are linearly independent and differentiable with = r(s, t)
R(s, t) = ϕ1 (max(s, t))ϕ2 (min(s, t)).
4.3 Weighted unbiased linear least squares prediction
193
ˆ 0 ), the best predictor, one has to obtain p(·, t0 ) which is To obtain X(t a solution of the integral equation (38). Now let us express (42) in the form X = Y + Z. Thus integrating one gets: X(t) = X(a) −
t
α(u)X(u) du + Z(t) = Y (t) + Z(t) (say).
(46)
a
Then in the notation of (38), K1 = ry , K = ry + H, and since E(X(t) − X(a)) = 0 (and Z is uncorrelated with the Y -process) M (t) = ag(t), and because M (t) is a constant we can take g(t) = 1 so that (39) gives p(b, t0 ) = p(a, t0 ) + 1. Thus (38) gives
b
[ry (s, t) + H(s ∧ t)] dp(s, t0 )
ry (t, t0 ) + λ1 = a
to be solved for p, λ1 . Since the ry and H are differentiable in s we get the above as: b ∂r ∂K (t, t0 ) = (s, t) dp(s, t0 ), ∂t a ∂t (an explicit differentiation is clearly possible, but we write it formally to note the real problem here). This equation is of the form, given a function f and a kernel K(= ry + H), consider the integral equation
b
K(s, t) dp(s, t0 ),
f (t) =
(47)
a
subject to a boundary condition. When once p is obtained the desired solution is given by (37). A related problem was answered by Dolph and Woodbury [1] and we state the result for comparison. 6. Proposition. Let r be a symmetric kernel of the form r(s, t) = ϕ1 (s ∨ t)ϕ2 (s ∧ t), a ≤ s, t ≤ b where ϕ1 (b) = 0 = ϕ2 (a), and ϕ1 , ϕ2 are linearly independent satisfying a second order linear homogeneous differential equation Ls r = 0 for s = t, in that Lt h = (p(t)h ) +q(t)h = 0. Suppose the differential equation possesses a Green function over an interval containing [a, b], and moreover lim [
ε0
∂r 1 ∂r (s + ε, s) − (s − ε, s)] = − . ∂s ∂s p(s)
Now consider the integral equation f (t) =
b
r(s, t) dw(s), a
(48)
194
IV Inference for Classes of Processes
where f is twice continuously differentiable. Then w exists and a solution of (48) is given by w (t) = −(Lt f )(t), with the boundary conditions p(b) [ϕ (b)f (b) − ϕ1 (b)f (b)] ϕ1 (b) 1 p(a) [ϕ (a)f (a) − ϕ2 (a)f (a)], w(a) ¯ = ϕ2 (a) 2 w(b) ¯ =−
where w(t) ¯ = w(t + 0) − w(t − 0). The proof is based on the Lagrange identity noted before Theorem 1. This result has an extension to the equation of the form
b
[r1 + r2 ](s, t) dw(s),
f (t) =
(49)
a
where f and r1 satisfy the hypotheses of the preceding proposition and moreover if r3 = Lt r2 then the resolvent kernel r˜3 of r3 is differentiable in the first variable. Then again the weight function is differentiable, and a solution of (49) can be obtained so that equations of the form (38) with m = 2 may be solved when these additional conditions are satisfied. We shall not present the details which are somewhat involved and have little probabilistic insight, although the subject is motivated by the latter. The situation is clearly more complicated for the higher order equations which nevertheless are of interest in this study. The mathematical problem may be restated as: given a smooth function f and a (smooth) kernel K(·, ·) when is f representable as an integral of a signed measure (in the higher order case a vector measure) with K as its kernel? This is similar to (but different from) the Riesz representation theory. Such questions typically arise from studies in probability theory and lead to interesting analyses. For a related problem and its extensions, the reader can refer to Rosenberg [1], and Masani and Rosenberg [1]. It should also be observed that the weights used in the prediction problem here are analogous to, but distinct from, the Bayes priors. For one thing, the weights need only be of bounded variation and not necessarily positive so that they cannot be normalized to qualify for (prior) probabilities. More importantly, unlike the priors, these weights are not prescribed in advance but have to be obtained as solutions of certain (Fredholm) integral equations to give optimal least squares predictors, and this is a distinctly different (and generally difficult) problem in itself. 7. An Illustration. Consider a particular case of (42) with a nonstochastic signal Y and the noise Z as the O.U. process, studied in Example 1.6 earlier. This is a Gaussian process with mean zero and
195
4.3 Weighted unbiased linear least squares prediction
˜ = X − Y , and α = 0 covariance rz (s, t) = e−β|s−t| , β > 0. Thus let X ˜ X ˜ ˜ in (42) so that ddt = Z , or X(t) = X(a) + Z(t). Taking a = 0, b > ˜ = 0 so that X(0) = a1 , a.e., we get (42) as: 0, Y (a) = a1 + a2 t, X(0) X(t) = Y (t) + Z(t) = a1 + a2 t + Z(t). Hence g1 (t) = 1, g2 (t) = t, ry = 0, ryz = rzy = 0 and the integral equation of the problem becomes: f (t, t0 ) = λ1 (t0 ) + λ2 (t0 )t =
0
b
rz (s, t) dw(s, t0 ).
(50)
For this equation Proposition 6 applies, and using the notation after (48), we get: f (t, t0 ) =
0
b
e−β|s−t|
∂w (s, t0 ) ds+ ∂s
¯ t0 )e−βt , w(b, ¯ t0 )e−β(b−t) + w(0, and (39) becomes
b
0 b
s 0
∂w (s, t0 ) ds + w(b, ¯ t0 ) + w(0, ¯ t0 ) = 1 ∂s
(51)
∂w ¯ t0 ) = t0 . (s, t0 ) ds + bw(b, ∂s
(52)
Since Lt h = h − β 2 h = 0, with limt→∞ h(t) = 0 = limt→∞ h(t), one finds, after using Proposition 6, that β ∂w (s, t0 ) = [λ1 (t0 ) + λ2 (t0 )]. ∂s 2
(53)
Substituting the value of w ¯ and (53) in (51)-(52), and simplifying one gets (after a straightforward but tedious algebra): 8βb2 + 24βb + 24 − 12βt0 (bβ + 2) , β 3 b2 + 8b2 β 2 + 24bβ + 24 12 −bβ + 2t0 β . λ2 (t0 ) = b b2 β 2 + 6bβ + 12
λ1 (t0 ) =
Using these values in (53) together with w(b, ¯ t0 ), w(0, ¯ t0 ) one finds that w is given by: (w =)
2 2 ∂w β 8b β + 24bβ + 24 + 12t0 β(bβ + 2)(1 + (s, t0 ) = ∂s 2 b3 β 3 + 8b2 β 2 + 24bβ + 24
2s b
−
t t0 )
.
196
IV Inference for Classes of Processes
ˆ 0 ) and the mean-square Using this in (37) and (40), one obtains X(t error. The above procedure can also be used with an extended version of Proposition 6, if the signal is stochastic but uncorrelated with the (noise) O.U. process Z. Suppose, for instance, the signal process Y is such that E(Y (t)) = a1 +a2 t and ry (s, t) = (|s|∧|t|). Then the integral equation to be solved becomes f (t, t0 ) = λ1 (t0 ) + λ2 (t0 )t + ry (t, t0 ) − w(b, ¯ t0 )t b ∂w [ry + rz ](s, t) ¯ t0 )e−β(b−t) + w(0, ¯ t0 )e−βt . = (s, t0 ) + w(b, ∂s 0 This can be solved with the same procedure as before, but the computations are even more involved. The details will not be included here. They may be found in the paper by Dolph and Woodbury [1]. Since we have presented the general theory of linear prediction and its significant applicational potential in this section, let us proceed to other aspects (and models) of stochastic inference. In the next two sections discrete parameter processes will be studied to introduce another class of problems of interest and intensity in the subject. 4.4 Estimation in discrete parameter models The processes to be considered here are observed at equally spaced (taken to be the unit) times so that such a process may also be regarded as a time series. This restriction enables us to borrow several (specialized) classical results to study the limiting behavior of various sequences of statistics. This is because in the continuous parameter problems, the process is observed at a dense set of points of an interval and the desired approximation is made by suitable continuity properties. On the other hand, in the discrete time case, the set of observed values cannot become dense in any (infinite) compact time set. Hence the possible observational points remain countably infinite, and the structural parameters of the model have to be estimated based on a finite sample. Thus, when the number of observations increases, the asymptotic behavior of the estimators and their limit distributions for testing problems become the central objects of study. We make this point explicit by considering stochastic models defined by general (linear) difference equations, and proceed to the asymptotic analysis which sharpens the work of Section III.4. Consider a process {Xn , n ∈ Z} defined by a difference equation: Xn =
k i=1
ai Xn−i + un ,
(1)
197
4.4 Estimation in discrete parameter models
where the structural parameters (a1 , . . . , an ) ∈ Rk , are unknown and the un denote the ‘noise’ sequence. Thus the current value Xn (to be observed) is linearly dependent on the k preceding (known) ones together with an unobservable disturbance un whose probabilistic structure is assumed (partly) known. The problem here is to estimate the ai with a large set of observations Xn , n = 1, 2, . . . , (i.e., as n → ∞) and make inferences on the estimators a ˆi (n). Now (1) may be identified with a signal-plus-noise model
Xn = Yn + un ,
(Yn =
k
ai Xn−i )
(2)
i=1
where the ‘signal’ Yn contains the (unknown) parameters ai , analogous to the problem of the preceding section. However the two lead to different types of analyses, even with the least squares method which we shall use here. In fact, in the former case, an estimation of the parameters ai was not needed, but the weights (and Lagrange multipliers) had to be calculated, whereas the weights here are taken as given (set equal to unity) and the problem shifts to finding a ˆi (n) and their asymptotic properties. These are two aspects of the same problem. We use the explicit representation (1). Suppose that un ∼ N (0, 1) and that u1 , . . . , un are independent so that the log likelihood function is given by: log fn = Ln = Const. −
n k 1 [uj − ( ai Xj−i )]2 , 2 j=1 i=1
where for simplicity we assume that X−j = 0, j = 1, . . . , k (or given constants). Now to maximize Ln for the maximum likelihood estimators (MLE) a ˆi (n) of ai one may differentiate it relative to ai and set the resulting expressions to zero for the extrema, obtaining the usual “normal equations” as: k i=1
a ˆi
n m=[i,j]+1
Xm−i Xm−j =
n
Xm Xm−j ,
j = 1, . . . , k,
(3)
m=j+1
where [i, j] = max(i, j). [If X−j = αj ( = 0), j = 1, . . . , k, then the above system of equations starts from m = 1, and this makes little difference in the asymptotic analysis.] Note that the MLE a ˆi (n) of Ln are the same as those that minimize the quadratic form in Ln , and thus are the same as the least squares estimators for (1). Now using
198
IV Inference for Classes of Processes
n n n n the notation Cij = m=[i,j]+1 Xm−i Xm−j , Ai = m=i+1 um Xm−i , (3) (with (1)) may be written compactly as: a(n) − a) = (An ) , C n (ˆ
(4)
n , 1 ≤ i, j ≤ k), An = (An1 , . . . , Ank ) , and the estimator where C n = (Cij vector a ˆ(n) = (ˆ a1 (n), . . . , a ˆk (n)) with prime denoting transposition of a matrix or a vector. From this one obtains (uniquely) that
ˆ(n) = a + An (C n )−1 , (ˆ a(n) − a) = (C n )−1 (An ) , or a
(5)
where (C n )−1 is the (generalized or Moore-Penrose) inverse of C n . Actually it will be seen below that for large enough n, C n is nonsingular with probability one, and the problem is to investigate the properties of estimators such as unbiasedness, consistency, and their limit distributions. Note that the expressions for a ˆ(n) − a are not linear functions of the observations, unlike the work in the preceding section. If C n is ˆ(n) are unbinonstochastic, then (5) implies (since E(Ani ) = 0) that a ased (E(ˆ a(n) = a), but in the present (stochastic) case this is not true, and one has to study their asymptotic properties for large n. Also for the following work, since one can use the least squares method, it is not necessary that the un are N (0, 1). Thus hereafter it will only be assumed that the un are i.i.d. with means zero, unit variances, and their common distribution is positive and continuous at 0. The linear model (1), as described, implies that if Fn = σ(uk , k ≤ n), then Fn ⊂ Fn+1 and Xn is Fn -adapted for all n. Now to continue with the analysis, one has to consider the characteristic equation of the difference equation, as in the classical nonstochastic case, and keep track with the location of the roots of its characteristic equation relative to the unit circle of the complex plane. Thus let ϕ(z) = 0 be this equation where ϕ(z) = z k − a1 z k−1 − · · · − ak = Πki=1 (z − zi ) 1 = Πhi=1 (z − zi ) · Πkj=h1 +1 (z − zj ),
(6)
the roots zi may be repeated. We suppose that |zi | ≤ 1, i = 1, . . . , h1 and |zj | > 1, j = h1 + 1, . . . , k. If all |zi | > 1, then the solution process Xn of (1) is called explosive and if |zi | ≤ 1 it is nonexplosive. In the latter case, if |zi | < 1 for all i, the process is termed stable, and if |zi | = 1 for all i it is unstable. The behavior of the solution process is different in all three cases, and the analysis has to be carried out separately by decomposing the process reflecting these divisions. To explain the structure of the problem, in its simplest form, suppose that the roots of (6) are distinct and Xn = 0 for n ≤ 0. Then an explicit
199
4.4 Estimation in discrete parameter models
solution of (1) is given as (cf., Jordan [1], p.564, or Mann and Wald [1], p.177): n k Xn = λj zjn−r ur , n = 0, 1, 2, . . . (7) r=1 j=1
where λj are constants satisfying δ1n =
k
λj zjn−1 ,
n = 1, 0, −1, . . . , −(k − 2),
j=1
k δij being the Kronecker delta. (Thus j=1 λj = 1.) For simplicity, let there exist a ρ = z1 , the (unique) maximal root where |ρ| > 1. Now (7) can be expressed as: Xn =
k
λj Xj,n ,
Xj,n =
n
zjn−r ur , 1 ≤ j ≤ k.
(8)
r=1
j=1
As noted above, the location of ρ plays a key role in the whole analysis, and will therefore be of prime interest to estimate, since it is an unknown parameter (a function of (a1 , . . . , ak ) itself. Using (8) in (1), one finds that Xn − ρXn−1 =
k
λj (Xj,n − ρXj,n−1 )
j=1
= un +
k
λj (zj − ρ)Xj,n−1 = vn (say).
(9)
j=2
Now (9) is a first order difference equation in Xn ’s and ρ is a parameter to be estimated. The ‘new noise’ process {vn , n = 0, 1, 2, . . . } is a function of un ’s and Fn = σ(Xj , j ≤ n) = σ(vj , j ≤ n) = σ(uj , j ≤ n), but although E(vn2 ) < ∞, ∀n, the vn -sequence is neither an indepenk dent nor a martingale difference sequence (E Fn−1 (vn ) = j=2 λj (zj − ρ)Xj,n−1 = 0). Still one may apply the least squares method and study the resulting estimator ρˆn given by: n n vr Xr−1 Rn r=1 Xr Xr−1 (say). (10) = ρ + r=1 =ρ+ ρˆn = n n 2 2 Q n r=1 Xn−1 r=1 Xr−1 An immediate question is to determine whether the ρˆn is (strongly) consistent, and then find its limit distribution. We present a complete (positive) solution of both these problems in the next section. A related
200
IV Inference for Classes of Processes
problem is to answer the same question for the vector a ˆ(n) given by (5). This will be treated there as well, and the work shows the intricacies attending such a specific class of processes requiring sharper answers. Some follow-up problems are also noted for a future analysis. In the same way the covariance function of the noise or spectral functions of the processes, used essentially for the solutions of the prediction problems in the preceding section, are generally unknown; and have to be estimated (at least) consistently, and in fact their limit distributions should be investigated. These are not simple and will be discussed further in (the last) Chapter IX. Here we shall confine to the processes generated by a stochastic difference equation. 4.5 Asymptotic properties of estimators Under the same conditions as in the preceding section, we first establish the following comprehensive result on an estimator of the maximal root that is greater than unity. Theorem 1. Let {Xn , n ≥ 0} be a process defined by (1) of the preceding section, with a maximal root ρ, and suppose that ρˆn is defined by (10) there. If the hypotheses there are satisfied (i.e., the noise process un consists of i.i.d. random variables with E(un ) = 0, Eu2n ) = 1 and |ρ| > 1 the roots being simple), then (a) limn→∞ ρˆn = ρ with probability one (i.e., ρˆn is a strongly consistent estimator), and (b) supposing also that |ρ| > 1 > |zi |, i = 2, . . . , k, we have
λ1 |ρ|n (ˆ ρn − ρ) < x = F (x), lim P 2 n→∞ ρ −1
(1)
exists at all continuity points of F (and at all x ∈ R if the distribution of un is continuous). The limit distribution F depends on that of the noise un . In particular, if the common distribution of the u’s is N (0, 1), then F (·) is a Cauchy distribution. Moreover one also has: ' lim P
n→∞
n
( 2 Xi−1 (ˆ ρn
− ρ) < x = F˜ (x),
(2)
i=1
for all x ∈ R where F˜ is normal with mean zero and a finite positive variance, depending only on the roots zi . Remark. We shall sketch the essential details of proof to illustrate the basic method that generalizes. Although these involve some messy computations, and hence are long, the basic probability results needed
201
4.5 Asymptotic properties of estimators
are elementary in that only the (first) Borel-Cantelli lemma, and the convergence of series are used. Also the limit distribution depends on that of the disturbances, in contrast to the classical central limit theory when there is no root outside the unit circle. The limit distribution is Gaussian when the un ’s are N (0, σ 2 ) and the normalizing factor is n also a random function and not ρ|ρ| 2 −1 , which is somewhat nonclassical. Some readers may find it more convenient to glance at the outline, as a similar method will be sketched in Section IX.5, and return to study these arguments later. Proof. (a) For convenience and clarity we present the argument in a series of steps. To simplify, let s(n) = |λ1 ||ρ|n (ρ2 − 1)−1 and 1
Vn = (ρ2 − 1) 2
n
ρ−i ui .
i=1
Then by the classical Kolmogorov two series theorem Vn → V a.e., as n → ∞, and P [V = 0] = 0 since P [un = 0] = 0 and the distribution Rn of Vn is a convolution. We wish to show that Q → 0 with probability n one, using the notation in (10) of the last section. 1. limn→∞ s21(n) Qn = V 2 , a.e., where V is a random variable defined just above as a limit of the Vn . Indeed, using the representation (8) of Xn of the last section, we have n n n λ21 2 1 1 Qn = 2 X + [ λq λq Xq,m−1 Xq ,m−1 ]. s2 (n) s (n) j=1 1,j−1 s2 (n) m=1 q,q =2
(3) The first term on the right side of (3) can be written as a function of the un ’s, after an elementary but somewhat tedious algebra, as: ⎛ ⎞2 n n−1 λ21 2 ρ2 − 1 ⎝ −j+1 ⎠ ˜n, X ρ uj −R = s(n)2 j=1 1,j−1 ρ2 j=1
(4)
where 2
⎡
n−1
⎛
n−2
⎞⎤
˜n = ρ − 1 ⎣ R u2j + 2 ⎝ρ uj uj+1 + · · · + ρn−2 u1 un−1 ⎠⎦ . ρ2n j=1 j=1 The first term here has expected value = O( ρn2n ), and is nonnegative which, by Markov’s inequality and the (first) Borel-Cantelli lemma, ˜ n is a partial tends to zero with probability one. The second term of R
202
IV Inference for Classes of Processes
sum of independent random variables with means zero and variances of ˘ sev’s inequality and the Borelthe order O( n(n−1) ρ2n ), and hence by Ceby˘ Cantelli again, tends to zero with probability one. The second term of (3) is dominated (use the CBS-inequality) by the product: '
n 1 2 2 λ X s2 (n) m=1 q q,m−1
n 1 2 2 λ X s2 (n) m=1 q q ,m−1
( 12 , q, q ≥ 2. (5)
Each of these factors inside the square root in (5) tends to zero with probability one, since each is positive with mean given by (zq being a characteristic root dominated by ρ): n m−1 λ2q (ρ2 − 1)2 zq zq2(m−1−j) = O(n2 ( )2n ), 2 2n λ1 ρ ρ m=2 j=1
(6)
z
and since | ρq | < 1 this converges to zero exponentially fast. So by the Borel-Cantelli this (and similarly the second) factor converges to zero with probability one. These three statements imply the assertion of this step. 2. limn→∞ s21(n) Rn = 0 with probability one. For, by substituting for vn in terms of un from (9) of the last section, one has: n n 1 1 v X = um Xm−1 + m m−1 s2 (n) m=1 s2 (n) m=1 k j=2
(zj − ρ)λj
n
Xj,m−1 Xm−1 .
(7)
m=1
The first term on the right side of (7) has mean zero, and variance found to be n m−1 ρ2 − 1)4 n2 n2 1 m−1−j (z s ) ≤ = O( ). q q s4 (n) m=2 j=1 λ41 ρ2n ρ2n
(8)
˘ Hence by Ceby˘ sev’s inequality and the Borel-Cantelli lemma, the term tends to zero with probability one. The last term on the right is dominated (using the CBS-inequality) by the square-root of the product: n n 1 2 1 2 X )( X ), ( 2 s (n) m=1 j,m−1 s2 (n) m=1 m−1
4.5 Asymptotic properties of estimators
203
for j = 2, 3, . . . , k. Now the first term of the product tends to zero a.e., by the argument of (6) above, and the second factor tends to V 2 a.e. so that the product, and hence the right side of (7), tends to zero a.e. which is the assertion of the step. 3. Since by (10) of the last section we have ρˆn − ρ =
Qn −1 Rn Rn )[ 2 ] → 0, a.e., =( 2 Qn s (n) s (n)
(9)
due to the above two steps, part (a) of the theorem is proved. (b) Let us turn to the limit distribution of the estimator ρˆn with the additional restriction on the roots, namely, |ρ| > 1 > |zj |, j = 2, . . . , k. We again continue the essential details in steps. The normalizing factor √ here is s(n) (it would have been n if |ρ| < 1) and it is to be shown that D s(n)(ˆ ρn −ρ)→ξ whose distribution F (·) is the function to be obtained as in the statement. Thus (9) above should be improved to conclude that D s(n)(ˆ ρn −ρ) = (Rn /s(n))[Qn /s(n)2 ]−1 →ξ, and that ξ is not a constant, Rn so that s(n) 0 in probability. The stochastic order relations recorded in Exercise I.6.4 will be employed freely from now on without further reference. Now the result of Step 1 above can be written in the current notation as: Qn = Vn2 + op (1). 2 s (n) To simplify
Rn s(n) ,
note that (substituting for vn and recalling ρ = z1 )
k n k n un Xn−1 Xi,m−1 Xj,m−1 Rn = + λi λj (zj − ρ) s(n) m=1 s(n) s(n) m=1 j=2 i=1
= An + Bn ,
(say).
(10)
4. An = an Un Vn + op (1), for large n, where Vn is defined at the 1 n beginning of the proof, Un = (ρ2 − 1) 2 m=2 ρ−(n−m+1) um , and a = sgn (ρ). In fact, using the representation (8) of the preceding section, one has An =
n n k 1 λ1 um X1,m−1 + λj um Xj,m−1 . s(n) m=2 s(n) m=2 j=2
(11)
For n each j =2 2, . . . ,2k, the last term of (11) has mean zero, and variance m=2 E(Xj,m−1 /s (n)) → 0 as n → ∞, as seen in Step 1 (cf., (6)), so
204
IV Inference for Classes of Processes
that it is of op (1). Regarding the first term of (11), one can simplify it directly to get: an Un Vn −
n n ρ2 − 1 −(n−m) −i+1 ρ u ρ ui . n ρ2 m=2 i=m
Here the first term is the desired one, and the second can be expressed as: ' n ( n−1 1 1 ρ2 − 1 u2m + n+1 um+1 um + · · · + n+1 2(n−1) un u2 . |ρ| |ρ|n |ρ| a |ρ| m=2 m=2 n Now |ρ|nn n1 m=2 u2m → 0 a.e. (by the strong law of large numbers) since |ρ| > 1, and the remaining expression also tends to 0, since it has ˘ mean zero, and variance of order O(nρ−2n ), so by Ceby˘ sev’s inequality one infers that it is op (1). This implies the assertion of the step. 5. Next consider 1
Bn = (ρ2 − 1) 2 an V n
k
λj (zj − ρ)
n
ρ−(n−m+1) Xj,m−1 + op (1).
m=2
j=2
Let the second sum on the right side be denoted as Ajn . Now we need to separate the terms with ρ in A∗jn , and show that the rest tends to zero in probability. This computation is elementary but depends on an involved algebraic manipulation than in earlier cases. Thus we may express Bn as: Bn =
+
k n λ1 (zj − ρ)λj X1,m−1 Xj,m−1 s(n) j=2 m=1 k
λj λq (zj − ρ)
q,j=2
n 1 Xq,m−1 Xj,m−1 . s(n) m=1
(12)
The second term of (12) is op (1) after using the CBS-inequality and the fact that |zj | < 1 < |ρ|, exactly as in (5). The first term is written as: n m−1 ρ2 − 1 ∗ X ρm−1−r ur = an (A∗jn Vn − Bjn ), j,m−1 |ρ|n m=2 r=1
where A∗jn
2
= (ρ − 1)
n m=2
ρ
−(n−m+1)
Xj,m−1
n r=1
ρ−r ur ,
(13)
205
4.5 Asymptotic properties of estimators
and ∗ Bjn
2
= (ρ − 1)
n
ρ
−(n−m+1)
Xj,m−1
ρ−r ur
r=m
m=2 n−1
n−2 ρ − 1 −n ρ Xj,r ur+1 + ρ−n−1 Xj,r ur+2 ρ r=1 r=1 + · · · + ρ−(2n−2) Xj,r ur . 2
=
n
Using the simple inequality that σ 2 (Y + Z) ≤ [σ(Y ) + σ(Z)]2 where σ(·) is the standard deviation operator, one finds ρ2 − 1 2 ρ−2n n(n − 1) 2 ) ( ) )( ρ 1 − z02 2 n4 = O( 2n ), ρ
∗ V arBj,n ≤(
(14)
∗ ∗ with z0 = maxj≥2 |zj | < 1. Since E(Bjn ) = 0, this implies that Bjn = op (1), j = 2, . . . , k. From (12)-(14) we get the assertion of the step. 6. With (10)-(14) we have thus simplified An and Bn to get k Rn = an [Un Vn + λj (zj − ρ)A∗jn Vn ] + op (1), s(n) j=2
(15)
where A∗jn
=
ρ2 − 1 ρ2
+ u2 zjn−3
21
n
n n−2 u 1 zj (ρzj )−(n−r) r=2
(ρzj )−(n−2) + · · · + un−1 .
(16)
r=3
We now combine all the above results to establish a key fact in the following step. 7. The random vector (Vn , Un , A∗2n , · · · , A∗kn ) converges in distribution to a nondegenerate vector denoted by (V, U, W2 , · · · , Wk ) where Vn is the same as in Step 1, and A∗jn is given by (16). In fact, it was already noted in Step 1 that Vn → V a.e., and hence in distribution. Thus it is enough to show that the vector Yn = (Un , X2,n , . . . , Xk,n ) converges to (U, W2 , . . . , Wk ) = Y (say), in the desired sense, and that the latter is independent of V . Now D D Yn →Y iff the random variable c · Yn →c · Y , for all vectors c ∈ Rk . For
206
IV Inference for Classes of Processes
this we show equivalently that the joint characteristic function (ch.f.) ϕn of c · Yn given by ϕn (t) = E(exp[it(c1 Un + c2 A∗2n + · · · + ck A∗kn )])
(17)
converges for all t ∈ R and all c ∈ Rk . In the definition of A∗jn we need to distinguish two cases: ρzj = 1 and ρzj = 1. Thus if ρzj = 1, then (16) can be expressed as: 1
A∗jn = (1 − ρzj )−1 [Un − zj (ρ2 − 1) 2
n
zjn−i ui ].
(18)
i=1
Hence using the value of Un from Step 4, one gets: 1
c1 Un + c2 A∗2n + · · · + ck A∗kn = c1 (ρ−1 Un−1 + ρ−1 (ρ2 − 1) 2 un ) +
k j=2
cj
k 1 (zj − ρ)λj Un−1 − (ρ2 − 1) 2 Xj,n 1 − ρzj j=2
= c1 (αUn−1 + βun ) +
k
cj γj Xj,n−1 ,
(19)
j=2
where α, β, γ2 , . . . , γk are constants that depend only on the roots , zk and (c1 , . . . , ck ) ∈ Rk is an arbitrarily fixed vector. Since ρ, z2 , . . . n n−m Xj,n = um , and |zj | < 1, j = 2, . . . , k, it is seen that m=1 zj D ˜ ˜ because Xj,n →Xj (say). Indeed the ch.f. of Xj,n tends to that of X n−m
E(eitXj,n ) = E(Πnm=1 eitzj
n−m
= Πnm=1 E(eitzj =
um ) um ), by independence of um s,
itzjm um ), Πn−1 m=0 E(e Pn−1 m it m=0 zj um
since um are i.i.d., ˜
) → E(eitX ),
= E(e
(20)
n−1 ˜ a.e. (since |zj | < 1) and hence in distribuwhere m=0 zjm um → X, tion, establishing the assertion. We already know that Vn → V a.e., Un → U a.e., and the distribution of um is the same for all m. It 1 D ˜ j ]. follows from (19) and (20) that A∗jn →(1 − ρzj )−1 [U − zj (ρ2 − 1) 2 X Substituting this in (17) implies the statement of the step in this case. If, however, ρzj = 1, then the corresponding form of (16) becomes: A∗jn
=
ρ2 − 1 ρ2
12 n−1 r=1
˜ j,n , (say). (n − r)zjn−r−1 ur = α1 X
207
4.5 Asymptotic properties of estimators
Then by the same argument as in (20) one has for j = 2, . . . , k: ˜
E(eitXj,n ) = E(eit
Pn−1 r=1
(n−r)zjn−r−1 ur
n−1
) → E(eitXj ),
(21)
where r=1 zjn−r−1 ur → Xj a.e. (again by the two series theorem) since |zj | < 1, and hence it converges in distribution. Thus in both D
cases the random sequence (Vn , Un , A∗2n , · · · , A∗kn )→(V, U, W2 , · · · , Wk ) ˜ j or X as the cases occur. This establishes the assertion where Wj = X j completely since un has a nondegenerate distribution so that the last vector has also. 8. Vn is independent of U and Wj , j = 2, . . . , k. [ n2 ] −r n Indeed, let Vn∗ = r=1 ρ ur , Vn = r=[ n ]+1 ρ−r ur where [ n2 ] is 2 the integer part of n2 . Then by our earlier analysis Vn → 0 a.e., and so Vn = Vn∗ + op (1). On the other hand A˜jn =
n
(n−r)
u r zj
, j = 1, . . . , k, (z1 =
r=[ n 2 ]+1
1 ), ρ
D ˜ are independent of Vn∗ for each n and A˜jn →X j , j = 1, . . . , k, by the ˜ j must also be independent preceding step. Hence the limits V and X (in both cases). 2 k Rn s(n) D 9. s(n) · Qn →(αU + βu + j=2 γj Wj )/V as n → ∞, where α, β, γj are functions of the roots above. n Rn D →(αU + βu + j=2 γj Wj )V and For, by the previous analysis s(n) s(n)2 Qn
→ V12 (actually a.e.). Hence by the result of I.6.4(g), the ratio converges in distribution to the right side quantity given above. Moreover, the numerator and the denominator variables are mutually independent. Since |a| = 1, the Vn , A˜jn and Un have zero means and the same variances after multiplying by an , it has no effect on the limit distributions. This establishes the main statements of (b). If now the un are N (0, 1), then the above analysis shows that Un and Xj,n are linear functions of independent un s so that they are jointly normally distributed, and the same is true of Vn , and both have means zero and finite positive variances which are functions of the roots zj . It follows that the ratio has a Cauchy distribution, as asserted. Finally if the n 2 normalizing factor is a random variable S(n) = i=1 Xi−1 instead of some (exponentially increasing) constant function of ρ, then from the work of Step 1, we get S(n) s(n) → V a.e., so that S(n)(ˆ ρn − ρ) =
S(n) (s(n)(ˆ ρn − ρ)) s(n)
208
IV Inference for Classes of Processes
D
→V [αU + βu +
k
γj Wj ]/V
i=2
= [αU + βu +
k
γj Wj ],
j=2
and the last has a normal distribution with mean zero and a finite positive variance. 2. Remarks. 1. The detailed discussion, albeit with somewhat compressed computations, is included to show how the limiting analysis is carried out and the decomposition of the solution process into components depending on the location of the roots relative to the unit circle. The distinctness of the roots and that there is one outside and the rest inside the circle are used to simplify the computations. It is also imporRn n , s2Q(n) have a joint limit distribution tant to note by Steps 6-9 that s(n) in which the components are not independent, but that their ratio has a limit distribution which is the ratio of a pair of independent random variables. For k = 1 with normal noise assumption, both Anderson [1] and White [1] have computed the joint limit ch.f of this vector and it will not factor, as expected. A separate argument is needed to find the distribution of the ratio from the joint ch.f and this leads to an interesting analysis, to be discussed later (after Theorem 3 below). Examining this work, and that of Anderson’s [1] as well as White’s [1], Stigum [1] has made the important observation that the representation (8) of the preceding section implies that the analysis can be carried out piecewise for the series depending on the location of the roots. Thus the solution process should be divided into explosive, unstable, and stable parts, and the previous work can be generalized. In the paper noted above, he gave a compressed account of these ideas. This was extended and elaborated by Lai and Wei [1], on the consistency part, which will be described below. The case of the roots on the unit circle presents additional problems, to be discussed later. 2. Part (b) of the theorem on the limit distribution of the estimators is (not surprisingly) more involved. The first order case, k = 1, has been considered originally by White [1] who showed, under normal disturbances assumption, that the estimator has a limit Cauchy distribution. The corresponding result when all the roots are outside the unit circle or all are inside of it has been investigated, immediately thereafter, by Anderson [1] under essentially the same conditions as here. He observed that the case where some of the roots lie inside and some outside would be “much more involved”, and an aspect of the latter is the above theorem given by the author (cf., Rao [2]). An important point of these results, first noted by Anderson
4.5 Asymptotic properties of estimators
209
[1], is that when there is an explosive component in the solution, the limit distribution of the estimators depends on the distribution of the errors un and the invariance principles of Probability Theory are not applicable. The consistency result, corresponding to (5) of the preceding section, when k = 2 and one root is outside and one inside the unit circle was also treated by the author in the above reference. The general case will now be discussed. However, the joint limit distributions of the estimators are still not completely settled in the subject. The (strong) consistency of the estimators defined by (5) of the preceding section, in the general case that places no restrictions on the roots, can be considered by extending the method of proof of Theorem 1, using the decompositions of the solution process. The computations in the above work (particularly in part (a)) did not use nthe full force of the i.i.d. assumption of un s. The expressions such as i=1 ui Xi−1 form (not independent but) a “martingale transform”, and with enough moments all the work can be carried out when the un ’s are such that this transform condition holds. Since now the un do not have the same distribution, one has to assume more than two moments (as suggested by the classical Liapounov central limit theorem for non i.i.d variables). In the earlier cases, although in the literature the conclusions were stated as (weak) consistency, the computations there show that (via BorelCantelli lemmas) they are really strong consistency results. [In the case k = 1, the use of martingale argument was illustrated in the text book (Rao [15], Sec. 6.1).] An extension of the (strong) consistency result when the un are martingale differences and the roots are inside the unit circle has been obtained by Anderson and Taylor [1] for a generalization. The strong consistency part of this “blue print”, suggested by Stigum [1] has been successfully carried out by Lai and Wei [1] obtaining the following result in which the un are martingale differences with uniformly bounded 2 + δ(δ > 0) moments, (corresponding to the Liapounov condition noted above). 3. Theorem. Suppose that the Xn -process is defined by a k th order stochastic difference scheme of the preceding section: Xn =
k
ai Xn−i + un ,
(22)
i=1
where the un form a martingale difference sequence, in the sense that E Fn−1 (un ) = 0, n ≥ 1, Fn = σ(u1 , · · · , un ) and also satisfying the bound supn E Fn−1 (|un |2+δ ) < ∞ a.e. for some δ > 0. Then the least squares estimators a ˆn = (ˆ a1 (n), · · · , a ˆk (n)) of the structural parameter vector a = (a1 , · · · , ak ) given by (5) of the last section namely (ˆ an −
210
IV Inference for Classes of Processes
a) = (C n )−1 (An ) , satisfies: ˆn = a, lim a
n→∞
a.e.
(23)
The idea of proof is to first establish the consistency result when all the roots are on or inside and then when all are outside the unit circle, using the corresponding estimates from the (by now well-understood) martingale theory. Then decomposing the solution sequence Xn of (22) into the parts explosive, and nonexplosive cases corresponding the location of the roots, and applying the previous results. The calculations are (still) quite involved, and we refer the reader for complete details to the paper by Lai and Wei [1] referred to above. An essentially complete solution of the (strong) consistency problem given by Theorem 3 may be considered adequate for many applications, by replacing “martingale difference” with “independence”, and one can ask for the corresponding result of part (b) of Theorem 1. The situation here is even more complicated. We briefly present the status of the problem and a quick description of available results. As noted in Remark 2 above, if some roots of the characteristic polynomial of a stochastic difference equation are outside the unit circle, then the limit distributions of the estimators depend on those of the noise. This is exemplified in Theorem 1(b) above, as well as by the earlier works of Anderson’s [2] and of White’s [1] so that the limit laws are not generally functionals of Brownian processes. However, as shown in White’s work, the situation is more interesting in the case that none of the roots is outside the unit circle, since then the invariance principle applies. In case k = 1, and a = 1 (the unit root case) White has shown that the limit distribution of the estimator a ˆn is the ratio of a pair of (nonlinear) Brownian functionals, and obtained the ch.f. of the limit variable. [If the roots are inside the unit circle, and the noise variables un are i.i.d., then the limit distribution is found to be (jointly) normal, even in the k th order case already by Mann and Wald [1] under high moment assumptions, and it was improved by Anderson [2] with the second moment condition as in Theorem 1.] The work implies that the invariance principle holds when there is no root outside the unit circle. Consequently, by choosing a convenient distribution for the un (usually N (0, 1)) one can derive the limit ch.f. of the estimators. By inverting the latter one can, in principle, find that distribution. Unfortunately, this (Fourier) inversion is often quite difficult. For instance, in the case k = 1, the limit ch.f., obtained by White [1], could be considered even for a more general case that |a| = 1. Here an alternative method, not inversion, found by Cram´er ([1], p. 317) to obtain the distributions of the ratios directly (under some analyticity conditions on the ch.f.)
211
4.5 Asymptotic properties of estimators
is applicable. Although a multidimensional extension is available (cf., Rao [14], pp. 623-631 where a bivariate version is presented which may be generalized to the k-variate case), the end product is still complicated. Using White’s calculation of the ch.f., an explicit form of the limit distribution of α ˆ n when |α| = 1 (not just α = +1,or−1 which are also obtainable by the same method) has been found (cf. Rao [26]) and it can be simplified to get the other cases by setting α = 1 and α = −1 in the former result. [However each of the cases α = 1, α = −1 and |α| = 1 is different and gives a distinct limit distribution as one should expect.] The k-dimensional case with this method has not yet been solved. Extending the preceding results, Chan and Wei [1] have considered the k th -order case when the roots satisfy |zi | ≤ 1. We shall briefly describe their result which exemplifies the above noted problems vividly. The method is again to simplify the “ratios of matrices” in (5) of the last section, and find a joint limit distribution for the vector consisting of the numerator and denominator variables which will be functions of Brownian motion, resulting from the invariance principle (or also termed the functional central limit theorem) even when multiple roots are allowed. Here is an outline of the problem. Consider again the model Xn =
k
ai Xn−i + un ,
n = 1, 2, . . . ,
(24)
i=1
where {un , n ≥ 1} is the noise, un being independent of Xm , m ≤ n−1. Suppose that the distinct roots of the characteristic equation ϕ(z) = z k − a1 z k−1 − · · · − ak = 0,
(25)
on the unit circle are z1 , · · · , z with multiplicities d1 , d2 , c2 , · · · , c . Among these, let the unit roots (say z1 = +1) be d1 in number, the negative unit roots (say z2 = −1) be d2 in number, and the pairs of complex (and conjugate) roots (say zj−2 = e±iθj ) be cj−2 , j = 1, . . . , − 2. If r = k − (d1 + d2 + 2(c1 + · · · + c−2 )) then the remaining r are inside the unit circle (so |zj | < 1 for the other roots). The component corresponding to the last r-roots denotes the stable process for which the Mann-Wald theory, with refinements by Anderson [2], will give a complete limiting distribution, namely the multivariate normal. The normalizing factors are different for each of these groups. To incorporate the information on the roots, it is first convenient to decompose the solution process Xn into parts corresponding to the location of the roots, analogous to (7) and (8) of Section 4 above. Now the characteristic equation (25) can be expressed as ϕ(z) = (z1 − 1)d1 (z + 1)d2 Πj=1 [(z − eiθj )(z − e−iθj )cj ]ϕ1 (z),
(26)
212
IV Inference for Classes of Processes
where ϕ1 (z) is a polynomial with the remaining r roots in the interior of the circle. Using vector notation, let Xn = (Xn , · · · , Xn−k+1 ) (column vector), (24) can be written as Xn = AXn−1 + un ,
(27)
where un = (un , 0, · · · , 0) and the k × k coefficient matrix A is: a1 , . . . ak−1 , ak A= Ik , 0 ak = 0, Ik being the k × k identity matrix, so that A is nonsingular. To obtain the key decomposition with (26), we use a classical result (known as the Sylvester determinant, or its equivalent, the resultant of a pair of polynomial equations, cf., Conkwright [1]), according to which pi (z) = z ri + ai1 z ri −1 + · · · + airi ,
i = 1, . . . , k0 ,
have no common root iff the coefficient matrices Mi (ri × k), given by ⎛1 ⎜0 Mi = ⎜ ⎝ .. . 0
⎞
ai1 1 .. .
ai2 ai1 .. .
... ... .. .
airi airi .. .
0 0 .. .
... ... .. .
0 0 .. .
0
0
...
1
ai1
...
airi
⎟ ⎟ ⎠
have the property that M = (M1 , · · · , Mk0 )
(28)
is nonsingular. If we take p1 (z) = (z − 1)d1 , p2 (z) = (z + 1)d2 , p3 (z) = (z − eiθ1 )c1 , · · · , p2−1 (z) = ϕ(z), (k0 = 2 − 1), and the Mi as the corresponding coefficient matrices, then the k × k matrix M given by (28) for these polynomials is nonsingular. If we set (from (28)), M Xn = (M1 Xn , · · · , Mk0 Xn ) = (Y1n , · · · , Yk0 n ) ,
(say)
(29)
then the desired decomposition is obtained, since for instance, relative to the polynomial pj (z), one verifies directly that Lrj pj (L−1 )Yjn = un ,
j = 1, . . . , k0 ,
(30)
where LXn = Xn−1 is the unit delay operator. This implies, if the multiplicities are one each, that (30) will be of a first order difference equation. Moreover one can show (with considerable computation) that Mi XN 2 = Op (n2di ) a.e. the di being the multiplicity, whereas for the
213
4.6 Complements and exercises
interior roots this is of order Op (n) a.e. Thus the normalizing factors will be different for these components, and the combined one is chosen as a block diagonal matrix each of its diagonal elements being a function √ of n, tending to infinity at the rates shown, namely nd1 , nd2 , . . . , n. We shall denote it by Nn =diag(Jn1 , Jn2 , · · · , Mn ) where Jn1 is d1 ×d1 , Jn2 is d2 × d2 ,..., Mn is r × r. The normalizing k × k matrix then will be Nn M . With this outline one has the following assertion due to Chan and Wei [1]: Theorem 4. If the Xn -process is a solution of the difference equation (22) with all the roots of the associated characteristic equation (25) are on or interior to the unit circle, then for the least squares (vector) estimator a ˆ(n) of a, one has: (Nn M )−1 (ˆ a(n) − a) →(F1 (B1 ), · · · , Fk−1 (Bk−1 ), G(Bk )) D
where the vector on the right side consists of Brownian functionals, the last one being independent of the rest, and the normalizing matrix factor Nn is determined by the roots of (25) and the diagonal elements of Nn tend to infinity at the indicated rates. As is clear from the statement, the needed computations, included in the Chan-Wei paper, are long and involved. It is clear, however, that the results are not yet in final form for applications. Several special cases are discussed in the literature. (For an extended account of such applications, the reader may consult Fuller [1] and his recent updated book [2] where some approximations to certain limit distributions are also included.) In particular, the exact “closed form”of the limit distribution of the estimator is not obtained in the general case. Even for the first order case with |α| = 1, available in closed form (cf. Rao [26]), but is complicated. The corresponding exact limit distribution for the case k > 1 should ideally involve just the conditions that specify the roots which are on the unit circle and those inside of it. If the normalizing matrix Nn is random, as in the last part of Theorem 1, can the work be simplified? To answer these problems, considerably more research than so far completed is needed for a comprehensive picture. We therefore have to conclude this aspect of the foregoing account at present. 4.6 Complements and exercises 1.Let {X(t), t > 0} be an Ornstein-Uhlenbeck (or O.U.) process, so that it is Gaussian with mean 0 and covariance r(s, t) = αe−β|s−t| for α > 0 and β > 0. [It is related to the Brownian motion {B(t), t > 0} by the simple linear differential equation dX(t) = βX(t)dt + √ stochastic 1 dB(t).] If Y (t) = tX( 2β log t), t > 0, then verify that the increments
214
IV Inference for Classes of Processes
Y (t)−Y (s), 0 < s < t of the Y -process are independent with N (0, σ 2 |s− t|). Deduce that the Y -process is Gaussian and hence it and the X(t)process have continuous sample paths. [Hint: Show that for 0 < s1 < s2 ≤ t1 < t2 the moment generating function is given by 1 1 E(eα1 (Y (s2 )−Y (s1 ))+α2 (Y (t2 )−Y (t1 )) ) = exp{ (s2 −s1 )α12 + (t2 −t1 )α22 }.] 2 2 2. The O.U. process plays an important role in applications, and so the following characterization, due to Doob [1], is of interest. Thus let {X(t), t ∈ R} be a Markov process which is stationary with zero mean and a finite variance (mean function is a constant due to stationarity). Then it is an O.U. process iff for each distinct pair of time points s, t ∈ R, X(s) and X(t) have a nonsingular bivariate Gaussian distribution with positive covariance, i.e., Cov(X(s), X(t)) > 0. [Hints: For the converse, by the Markov property (and the classical Tulcea theorem, cf., Neveu [1], p.162, or Rao [21], p.276), and the fact that any two X(t)’s have a bivariate normal density imply here that all finite dimensional distributions are jointly normal, and hence the process is stationary and Gaussian. Moreover, for t1 < t2 < t3 the covariance function ρ satisfies the (Cauchy) functional equation ρ(t3 − t1 ) = ρ(t2 − t1 )ρ(t3 − t2 ) and hence ρ(t) = αe−β|t| for some α > 0 and β > 0 as a solution of this equation.] 3. Let {B(t), 0 ≤ t ≤ 1} be a Brownian motion. Verify that t B(t) − tB(1) and (1 − t)B( 1−t ), 0 ≤ t ≤ 1, have the same distribution. [The former is called the pinned BM or a Brownian bridge. Hint: Calculate the ch.f.’s.] The existence of BM was obtained earlier using Haar functions and a series representation in Section 2 (following Theorem 6). Verify that the same argument can be used if one uses an arbitrary complete orthonormal (con) sequence {ϕn , n ≥ 1} ⊂ L2 (0, 1), where with i.i.d. random variables ξn ∼ N (0, 1), we get X(t) =
∞ n=1
ξn
0
t
ϕn (u) du,
0 ≤ t ≤ 1,
and show that this series converges absolutely with probability one, defining a Gaussian process with covariance r(s, t) = min(s, t). [The uniform convergence of this series is established using the result that the partial sums here form a martingale and then invoking the martingale maximal inequalities, valid for any con system. See Shepp [1], Sec. 9, in this connection.] 4. Let {Xn , Fn , n ≥ 1} be a sub martingale with supn E(|Xn |) < ∞ on (Ω, Σ, Fn , P, n ≥ 1) as in Theorem 2.2. Define νn : A → X dP, A ∈ Fn , and let ν be defined on F0 = ∪n Fn as ν(A) = A n
4.6 Complements and exercises
215
νn (A), if A ∈ F0 , (see the first proof of that theorem). Then ν is well-defined and is additive on the algebra F0 . It may be uniquely extended to be an additive function on F∞ = σ(F0 ), denoted by the same symbol. If P∞ = P |F∞ , let ν = ν1 + ν2 be the Yosida-Hewitt decomposition, where ν1 is σ-additive and ν2 is purely finitely additive, both on F∞ . If ν1c is the P∞ -continuous part of ν1 (the classical dν1c , then show that Xn → X∞ Lebesgue decomposition) and X∞ = dP∞ a.e. [Hints: It suffices to consider the case that all the Xn ≥ 0 (by a generalized Jordan decomposition) so that νi ≥ 0. Also νi |Fn = νin 2n are σ-additive. Then verify that the sequence {Zn = dν dPn , Fn , n ≥ 1} is a positive so that Zn → Z∞ a.e., by Theorem 2.2. super martingale But 0 ≤ A Zn dP ≤ A Zn−1 dP ≤ ν2(n−1) = ν2 (A). This implies the dν1c result that X∞ = dP∞ a.e. For more details, see the companion volume (Rao [21], p.147).] 5. The limit distributions of estimators of parameters of stochastic equations of Sections 4 and 5 are usually more difficult than one may expect, since they involve ratios of sequences and different methods should be of interest. As discussed after Theorem 5.3, a rarely used technique with ch.f.’s indicated by Cram´er ([1], p.317), in the onedimensional case, can be stated for the bivariate problem as follows. Let (Xi , Yi ), be a pair of random variables with P [Yi > 0] = 1, i = 1, 2 and 1 all having absolutely continuous distributions. If H(x1 , x2 ) = P [ X Y1 < 2 x1 , X Y2 < x2 ] is the joint distribution so that it has a density, and if the joint ch.f. ϕ of Xi , Yi , i = 1, 2, namely: ϕ(t1 , t2 , t3 , t4 ) = E[exp(it1 X1 + it2 Y1 + it3 X2 + t4 Y2 )] is available, then show that ∂2H (x1 , x2 ) ∂x1 ∂x2 1 2 ∂ 2 ϕ =( ) t= τ1 ,t2 =−x1 τ2 dτ1 dτ2 , 2πi R2 ∂t2 ∂t4 t3 =τ2 ,t4 =−x2 τ2
h(x1 , x2 ) =
whenever the integral exists and uniformly convergent with respect to (x1 , x2 ) ∈ R2 . If X1 , X2 , Y1 , Y2 are also mutually independent, then the result reduces to Cram´er’s ([4], Theorem 16). The work needs many details, and a more general form is given in Rao [14]. Taking t3 = 0, t4 = 0, one gets the one-dimensional version of Cram´er’s, noted earlier as: X1 ∂ϕ d 1 P[ (t, u)|u=−tx dt, g(x) = < x] = dx Y1 2πi R ∂u
216
IV Inference for Classes of Processes
where the integral is uniformly convergent in x. 6. This problem illustrates the utility of the preceding formula in obtaining an explicit limit distribution of the estimator in the first order case: Xn = αXn−1 + un where {un , n ≥ 1} is an i.i.d. sequence with mean zero and variance one. The least squares estimator α ˆ n is given by (cf., eq. (11) in Sec. 4 with k = 1 so that ρ = α there): n r=1 Xr Xr−1 α ˆn = n 2 r=1 Xr−1 n Qn r=1 ur Xr−1 =α+ =α+ (say). n 2 Rn r=1 Xr−1 Then the classical probability theory implies that the invariance principle works (meaning un can be any i.i.d. with two moments and the limit distribution is the same) for the model if |α| ≤ 1, and it fails if |α| > 1 (and the limit depends on the distribution of the un ’s). Thus to find the limit in the case |α| ≤ 1, one can choose a computationally convenient distribution for the un with two moments, and we therefore take un ∼ N (0, 1). The other result for |α| > 1 depends on the distribution of the un and so we may as well consider N (0, 1) in all cases. Assuming that um = 0 for m ≤ 0, so Xm = 0 also, verify that 1
ϕn (u, v) = E(exp[iuQn +ivRn ]) = E(exp[iX (Au+Bv)X]) = D(n)− 2 , and where X = (X1 , . . . , Xn ) is the (column) vector of observations n A, B are the matrices of the quadratic forms of Qn = r=1 (Xr − n 2 αXr−1 )Xr−1 and Rn = X . Here D(n) is the determinant r=1 r−1 obtained in evaluating the normal integral above and is given by D(n) =
1 − rn n 1 − sn n r + s rn − sn n sn − rn n
where rn , sn are defined by 1 [1 + α2 + 2iαu − 2iv ± {(1 − α2 )2 − 4iα(1 − α2 )u 2 1 + 4(1 − α2 )u2 − 4i(1 + α2 )v + 8αuv − 4v 2 } 2 ].
rn , sn =
From this verify, on letting u ˜ = 2 −1
n2 2
u ˜ β(n) , v
=
v β(n)2
in the above with
if |α| = 1; and = |α|n (α2 − 1)−1 if β(n) = n(1 − α ) if |α| < 1; = |α| > 1, that ⎧ 2 (iv− u2 ) ⎪ e , if |α| < 1 ⎪ ⎪ ⎨ 1 1 2 − √ √ −iuα lim ϕ(˜ u, v˜) = e √2 cos 2 iv − √iuα sin 2 iv) 2 , if |α| = 1 n→∞ ⎪ 2iv ⎪ ⎪ ⎩ 1 (1 − 2iv + u2 )− 2 , if |α| > 1.
217
4.6 Complements and exercises
Note that the limit which is a ch.f. in all cases depends explicitly on α when |α| = 1 and gives different functions for α = +1, α = −1, and still another ch.f. for the combined case |α| = 1. [There are some tedious computations involved here and for more details the reader may consult White [1].] 7. Use the last formula (Cram´er’s result) and obtain the density g of the limit distribution of α ˆ n , of Exercise 6, as: ⎧ x2 ⎪ √1 e− 2 , if |α| < 1 ⎪ ⎨ 2π ∂ϕ 1 1 (t, −tx) dt = g(x) = if |α| > 1 π(1+x2 ) , ⎪ 2πi R ∂u ⎪ ⎩ f (x, α), if |α| = 1 where f (x, α) has a complicated form as follows: f (x, α) = √
1 8π 2
ρ(x, t)
3 cos(δ(x, t) − θ(x, t))× 2 R [r(x, t)] dt (χR2+ (x, t) + χR2+ (−x, t)) √ , tx 3 2
the functions r, θ, δ contain the parameter α and are expressed in terms of certain trigonometric and hyperbolic functions of t, x. These are given in Rao [26] and some results related to Exercise 5 above are in Rao [14]. [The expressions will not be reproduced here, and the reader is referred to the cited papers.] 8. The limit distributions in the k th order schemes (k ≥ 2) indicated in Theorem 5.4 involve iterated Brownian motions and their functionals. These may be approximated by certain ‘Monte-Carlo’ methods. For this it will be necessary to have some simplifications of the associated iterated Brownian (or Wiener) integrals. Here the new element is to find relations between multiple (and iterated for these) integrals which we now briefly discuss. Consider {X(t), t ∈ [0, T ]}, a BM, a partition πn : 0 = t0 < t1 < · · · < tn = T , and an simple function H : [0, T ] × n [0, T ] → R, defined as Hn (s, t) = j=1 k=1 ajk χ[tj−1 ,tj )×[tk−1 ,tk ) . Let τ (X) = =
[0,T ]2 n n
Hn (s, t) dX(s) dX(t) ajk (X(tj ) − X(tj−1 ))(X(tk ) − X(tk−1 )).
j=1 k=1
It may be verified that τ (X) is uniquely defined (does not depend on the partition) and there is dμ(s, t), a dominating measure for the Bochner boundedness principle, which may be taken as μ(B) = λ1 (π(B ∩ Δ)) +
218
IV Inference for Classes of Processes
λ2 (B ∩ Δc ) for all Borel sets B ⊂ [0, T ]2 where λ1 is the Lebesgue measure, π is the coordinate projection of the rectangle [0, T ]2 onto [0, T ], Δ the diagonal and λ2 is the planar Lebesgue measure on [0, T ]2 , so that one has for some C > 0 2 E(|τ (X)| ) ≤ C H(s, t)2 dμ(s, t), (+) [0,T ]2
whence τ extends uniquely to be an integral on L2 ([0, T ]2 , dμ), since such simple functions are dense in the latter. [In fact a direct procedure using the above expression for τ (X) one shows immediately that T tT E(τH (X)) = 0 H(s, s) ds and Var (τH (X)) = 2 0 0 H 2 (s, t) ds dt, being the expectation of a chi-square random variable. Hence one can also define the (double) integral using Cauchy sequences, Hn → H in L2 ([0, T ]2 , ds dt) for an H in this latter space.] Verify that for any H ∈ L2 ([0, T ]2 , ds dt) (using the tensor product notation to avoid confusion with a repeated integration) one has τH (X) =
[0,T ]2
H(s, t) dX(s) ⊗ dX(t)
is linear and although not Gaussian, it defines a double Wiener integral. The function H is square integrable, also termed a kernel, and if H(s, t) = f (s)g(t), a degenerate symmetric one, verify that H is dX(s) ⊗ dX(t)-integrable and one has 0,t]2
f (s)g(t) dX(s) ⊗ dX(t) = (
T
f (s) dX(s))2 .
0
If H is of finite Vitali variation on the rectangle, i.e., V (H) = sup πn
= sup πn
|ΔH(ti , tj )|,
i,j n
πn is a partition of [0, T ],
|H(ti , tj ) − H(ti−1 , tj ) − H(ti , tj−1 ) + H(ti , tj )|,
i,j=1
is finite, one can use integration by parts formally, to get 2
τH (X) = H(T, T )X (T ) − X(T ) X(T )
T
X(s) H(ds, T )− 0
T
T
X(t) H(T, dt) + 0
T
X(s)X(t) H(ds, dt). 0
0
Bibliographical notes
219
But this has to be shown rigorously first for simple functions H, and then (with L2,2 -boundedness or otherwise) for the general case. [It should be noted that the formal procedure of the Lebesgue theory is not valid here since X does not have (even locally) bounded variation and so an alternative argument, as above, is needed to justify the result.] 9. If {X(t), t ∈ [0, T ]} is a BM, verify rigorously, as in Exercise 8, the truth of the following formula for the iterated Brownian process T integrals. Let X 0 (t) = X(t) and for n ≥ 1, X (n) (T ) = 0 X (n−1) (t) dt. Then a formal use of double and repeated integrals and induction on n applied to the BM gives (cf., Exercise VII.6.5 for a further discussion on such multiple integrals): T (T − s)n (n) dX(t). X (T ) = n! 0 10. It is of interest to observe here that a BM process X(t) defines a σ-additive L2 (P )-valued (or vector) measure with orthogonal (even independent) values. Hence dX(s)×dX(t) is a bimeasure, meaning that it is separately σ-additive in each component, but in this case the bimeasure has (locally) bounded Vitali variation in the L2 (P )-norm, and hence defines a σ-additive (vector) measure on the Borel σ-algebra of [0, T ]2 into L2 (P ). Consequently one can employ the Dunford-Schwartz integration. If X(t) is not a BM but defines a vector measure into an L2 (P ), for instance it has orthogonal values, then dX(s)×dX(t) is only a bimeasure having a (locally) finite Fr´echet variation, and then one has to employ the Morse-Transue integration for vector integrals. But one has to add some restrictions for general kernels H in order that the dominated convergence theorem be true. (See Chang and Rao [1], Sec. 3, in the scalar case and some extensions to the vector m case in Dobrakov [1] and before in Chech. Math. J.) If H(s, t) = i=1 fi (s)gi (t), simple kernel, then the theory of the above references extends. The reader is asked to extend these ideas to multiple stochastic MT-type integrals. [On double product stochastic integral, see Shepp [1], Sec.9, and Rosi´ nski and Szulga [1] where conditions for the σ-additive bimeasures dX(s) × dX(t) are discussed. Another possibility is to consider processes satisfying the L2,2 -bounded condition in the plane, and obtain the product integrals with it. These ideas will be useful in studying the corresponding problems of inference on random fields, in the direction given in Yadrenko [1], Chapter IV.]
Bibliographical notes As noted in the text, the basic result on approximating the RadonNikod´ ym derivative (or likelihood ratio) of a process, using finite col-
220
IV Inference for Classes of Processes
lections of random variables given in Theorem 1.1, is due to Grenander [1], and we have given a streamlined and slightly sharper form using an approach from Andersen and Jessen [1]. Several examples included also are motivated by (and follow from) the treatment of Grenander’s monograph [2]. Theorem 1.6 is an extension of Kakutani’s [1] important theorem which will again be shown to play a fundamental role in the Gaussian dichotomy in Chapter VII later. Our proof here is based on Brody’s [1] somewhat simpler argument. The treatment in the text of sequential testing for processes, with martingale methods, shows how it is an essential new step from (and an extension of) the classical Neyman-Pearson theory. It also shows a motivation for stopped martingale convergence extensions. Consequently we presented a few results from the latter for an appreciation of the direction of developments of the subject. We included a detailed analysis of the existence of Brownian motion as well as its (or of Wiener’s) integration via Bochner’s L2,2 -boundedness principle for an economical and a general presentation, as discussed in the companion volume (cf., Rao [21], especially Chapter VI). The results of Theorems 3.7-3.8 are essentially due to Dvoretzky, Kiefer and Wolfowitz [1], but our treatment is influenced by (and based on) Shiryayev’s [1] monograph. The next item in our presentation is the unbiased weighted linear prediction based on the least squares principle, for a general (stochastic) signal plus noise model. We allow the noise to be (weakly) stationary or even harmonizable. For the “white” noise case, such an account is given by Dolph and Woodbury [1] with a detailed analysis linking it with the classical ordinary differential equations. Here we included some extensions of it, following a recent treatment by the author (Rao [20]). There are a number of open problems for solutions in the analysis of systems of (linear) integral equations. Here we restricted the work strictly for stochastic analysis, but more can be done using the theory of Ordinary Differential Equations. Indeed there is much interest in this (some what neglected) approach, as noted in Grenander’s monograph [2], and the present account should be of particular interest to readers who may want to pursue the subject further along these lines. This corresponds to a study of differential equations in (an infinite dimensional) Hilbert space. Turning to specific parameter models to show how really different and difficult problems confront the analysis, the k th -order stochastic difference equations is considered as the main topic. The first rigorous treatment for stable process solution of these equations is due to Mann and Wald [1], with higher moment restrictions than necessary, but is clearly in the tradition of a pioneering study. Extensions to non stable process solutions present many difficulties. The first complete treat-
Bibliographical notes
221
ment of the first order case is due to White [1] which was refined and extended by Anderson [2] whose interest in a rigorous mathematical treatment is evidenced in his book on Time Series Analysis [3]. Some aspects of higher order schemes have also been studied in the former if all the roots of the associated characteristic equation are either outside, or all are inside the unit circle of the complex plane. The author [2] has treated the case that one root is outside and the rest inside the circle, with a detailed analysis for the case k = 2. The maximal root here plays a significant role, and the process has to be decomposed relative to its location. Thus an estimation of that root and its limit distribution have been included there. To show the method in the general case, a detailed asymptotic analysis of the estimator of that root is given in the text as Theorem 5.1 from the above work. The result was analyzed by Stigum [1] who presented a blue print for the subject in the general case. That was completed rigorously in further generality of the distributions of the errors by Lai and Wei [1], since in all the previous studies, the errors were assumed to be i.i.d. with zero means and finite variances. They assumed that the error process is of martingale differences with uniformly bounded 2 + δ > 2 moments, and the (strong) consistency of the estimators of the structural parameters was then established, with no more restrictions. We included it as Theorem 5.3 in the text for the i.i.d. case for simplicity. Then the joint limit distribution of these estimators in the stable and unstable (i.e., the roots are inside or on the unit circle) case were obtained by Chan and Wei [1] which is included in the text as Theorem 5.4. Both these results basically decompose the solution process into parts corresponding to roots outside the unit circle, on the circle, and those inside. The consistency was obtained in this way, and then only the stable-unstable (i.e., excluding the explosive case) solutions were treated in the work by Chan and Wei. Here the multiplicity of roots presents an additional problem. Explicit distributions have not been obtained, but the limit joint distribution is shown to be of a k-dimensional vector whose components are Brownian functionals (since the invariance principle of Probability Theory applies also for the unstable case as previously noted by White for k = 1). The present author has obtained an explicit distribution of the estimator for the boundary case with k = 1, (cf., Rao [26]), using an interesting method due to Cram´er [1]. Here the parameter merely satisfies |α| = 1. The special cases α = 1, −1 can be obtained from it, but the general form is still complicated. As noted in the text, this has been discussed in different forms by Fuller [1] and in more detail in his recent book [2]. It is clear from this discussion that, as Wald [3] observed in his address to the International Congress of Mathematicians, the special problems present many hurdles and they have to be
222
IV Inference for Classes of Processes
resolved with specifically devised methods. Our presentation of Section 5 is intended to be an illustration of this phenomenon which shows how special problems of importance in applications can present a need to find interesting (nontrivial) techniques. The complements section has some additional results, elaborating the text. The representation in Exercise 3 is due to Shepp [1] and that of Exercise 5 is taken from the author’s [14] extension of Cram´er’s theorem ([1], p.317). The last two problems illustrate the way that Brownian functionals appear in limit distributions of estimators of the structural parameters which also lead to double (and multiple) Wiener integrals. Here tensor products of the Brownian motion and more generally of (local semi-)martingales have to be studied. Although an L2,2 -bounded process leads to an associated vector measure, their tensor products present new problems. They are not simple extensions of the onedimensional case, as shown by Rosi´ nski and Szulga [1], although for the BM one can use various special techniques (e.g., series expansions), in the general case that X is L2,2 -bounded presents difficulties for a study of the tensor product dX(s)⊗dX(t). The Morse-Transue integral appears well suited for such an extension. If the latter is successfully carried out, then the corresponding tensor products of multiple indexed processes (or random fields), with the work of Cairoli and Walsh [1] for planar BM and of Green [1] for planar quasimartingale integrals, will be the next step which appears quite feasible. These are some of the problems raised by the work of this chapter. Employing the ideas and methods of Section 5, we shall present in the last chapter (Chapter IX below), some results on consistent (nonparametric) estimators and their limit distributions under suitable conditions for bispectral densities of certain harmonizable processes which contain the stationary classes. Since in all the preceding analysis likelihood ratios of processes are fundamental, we now turn to calculations of these functions for several important classes encountered in applications, in the following chapter and later which complement the above study.
Chapter V Likelihood Ratios for Processes
As seen in the preceding work on inference theory of processes, likelihood ratios play a prominent role, particularly for the testing problems. Consequently the major part of this chapter will be devoted to finding densities for probability measures induced by broad classes of stochastic processes. These include processes with independent increments, jump Markov, and those that are infinitely divisible. Also considered for this work are diffusion types of processes and some applications. All these start with (and suggested by) Gaussian processes and so we establish results including dichotomy theorems as well as likelihood ratios for them under several different sets of conditions, together with a few stationary cases. As a motivation for (and also an interest in) the subject we start with a treatment of the important problem of (admissible) means of processes within their function space representations (and these means can be regarded as deterministic signals). The analysis presented in this chapter involves some interesting mathematical ideas, and the reader should persevere with patience, since a rich collection of problems, applications, and new directions are suggested in this work. We include many illustrations in the text, and some further results in the final complements section, often with hints.
5.1 Sets of admissible signals or translates Let {Xt , t ∈ I} be a real process on (Ω, Σ, P ), with mean zero and covariance r(s, t) = E(Xs Xt ), where I is an index set. If f : I → R is a mapping, consider a new process Yt = f (t) + Xt , t ∈ I. Then E(Yt ) = f (t) and the Yt -process has the same covariance r(s, t). If (Tf X)t = Yt , then the probability measure governing the Y -process, say Pf , is given by Pf = P ◦Tf−1 , and a standard problem here is to test
© Springer International Publishing Switzerland 2014 M.M. Rao, Stochastic Processes – Inference Theory, Springer Monographs in Mathematics, DOI 10.1007/978-3-319-12172-7_5
223
224
V. Likelihood Ratios for Processes
the hypothesis that the (generally non constant) signal f is present in the ‘output’ Yt , for the nontrivial case that Pf and P are not mutually singular, and then to analyze the structure of the set of signals f . Such a function f is called an admissible mean, or translate (signal) of the process X. Let MP be the set of all admissible means. One should describe its geometry and/or ‘size’ properties. The problem is made more concrete by replacing (Ω, Σ, P ) with its canonical representation as follows. Let Ω = RI , Σ = σ-algebra generated by the cylinder sets of Ω, Xt (ω) = ω(t), ω ∈ Ω so that Xt is the coordinate function and Σ = σ(Xt , t ∈ I), the smallest σ-algebra relative to which each Xt is (measurable or) a random variable. Then (Ω, Σ, P ) is said to be canonically represented, and for each Borel set A ⊂ Rn , P [ω : (Xt1 , · · · , Xtn ) ∈ A] =
···
dFt1 ,··· ,tn (x1 , · · · , xn ),
(1)
A
where Ft1 ,··· ,tn is the finite dimensional distribution of the given process. [This is the basic Kolmogorov representation, cf., Theorem I.1.1., and it is convenient.] In this form the admissible mean f ∈ RI = Ω is in MP and MP ⊂ Ω. Thus f ∈ MP iff Pf P so that the derivative dPf dP exists. It is seen that if fi ∈ MP , i = 1, 2, then f1 + f2 ∈ MP (by the chain rule). In fact, for any bounded measurable h : R → R we have Ω
h(ω) dPf1 +f2 =
Ω
= Ω
h(ω + f1 ) dPf2 h(ω + f1 ) h(ω)
= Ω
dPf2 dP dP
dPf2 dPf1 (ω + f1 ) dP, dP dP
so that f1 + f2 ∈ MP . However, neither MP ∈ Σ nor the linearity of MP need be true. Observe that for f, g ∈ MP , Pg Pf iff Pg−f P . Indeed, just as above, with the canonical representation that is adapted, for any bounded measurable h : R → R and ω, f ∈ Ω, h(ω + f ) is defined. Since Pf = P ◦ Tf−1 and Pf P then
h(ω + f ) dP =
Ω
Ω =
h(ω) dPf h(ω)(
Ω
dPf ) dP. dP
(2)
5.1 Sets of admissible signals or translates
225
Thus one finds f ∈ MP by comparing both the integrals on the right. dP On the other hand, if Pg Pf and ρ(ω) = dPfg which exists a.e. [Pf ], one has h(ω) dPg−f (ω) = h(ω + g − f ) dP (ω) Ω Ω = h(ω − f ) dPg (ω) Ω h(ω − f )ρ(ω) dPf (ω) = Ω = h(ω)ρ(ω + f ) dP (ω), by (2), (3) Ω
where the change of variable technique is employed repeatedly. Since dPg−f = ρ(ω + f ) so that g − f ∈ MP . Thus h is arbitrary, this implies dP the problem of testing the equivalence of the admissible means f, g is the same as testing the hypotheses H0 : g − f = 0 vs H1 : g − f = 0. It is therefore useful to consider the structure of MP , since for instance in a communication network, for the noise process X, the set MP denotes the collection of possible signals that may be transmitted and for which a nontrivial test theory can be developed. Thus if f, g ∈ MP it is desirable to know whether f +g, αf, αf +βg, α+β = 1, α ≥ 0 are again admissible signals so that MP is either a positive cone, or a convex (or even a linear) subset of Ω, which is important for the design engineer in planning the communication system. This can also be viewed as a key problem in (nontrivial) testing theory for the hypothesis H0 : P vs. the composite alternative H1 : Pf , f ∈ MP . The study of composite hypotheses in stochastic inference is both highly important and involved. Therefore we analyze the special case in some detail, essentially following the basic researches of Pitcher ([3]-[6]). We first introduce a technique of considerable power and elegance, recognized and effectively used in this context by Parzen ([1] and [5]), namely the reproducing kernel Hilbert spaces (RKHS) of Aronszajn [1]. It was employed as an effective tool later by Neveu [2] who also called it Aronszajn space, and we use both the names. This concept is based on a symmetric (or hermitian) positive definite function K : T ×T → F, (F real or complex scalars), called a kernel where T is a set. Thus for each n n ≥ 1, ai ∈ F, we have i,j=1 K(ti , tj )ai a ¯j ≥ 0, and consider the set n H1 = {f : f = i=1 ci K(si , ·), ci ∈ F, n ≥ 1} ⊂ FT . Then H1 is a linear space over F, and one can introduce the (semi-)inner product in it: !f, g" =
m n i=1 j=1
K(si , tj )ci d¯j
226
V. Likelihood Ratios for Processes
=
n i=1
ci g¯(ti ) =
m
f (tj )d¯j , f, g ∈ H1 , (say).
(4)
j=1
It is immediately seen that !·, ·" is well defined (i.e., does not depend on the representation of f or g), is sesqilinear, and has the following important properties: [dropping “semi” hereafter, using equivalence classes] (a)f ∈ H1 =⇒ f (t) = !f, K(t, ·)", (5) (b)!K(s, ·), K(t, ·)" = K(s, t), s, t ∈ T. Moreover, the linear span sp{K(s, ·), s ∈ T } = H1 . Let HK be the completion of H1 under the norm · derived from the inner product (4). Then using (a) of (5) one has |f (t)| ≤ f K(t, ·), t ∈ T,
(6)
so that f = 0 ⇒ f (t) = 0, ∀t ∈ T . The set (HK , · ) is the RKHS or the Aronszajn space, and is uniquely determined by the kernel K on T and the property (a) of (5). This technique will initially be illustrated on Gaussian processes, and then the general case be considered. However, even here we need to use a relatively deep result, namely, the following due independently to Feldman [1] and H´ ajek [1]. Its ramifications and a generalization will be detailed in Chapter VII. P 1. Theorem Let (Ω, Σ, Q ) be a model where P and Q are Gaussian probability measures on Σ. Then either P ⊥ Q or P ≡ Q.
Proof. Among the many possible, the following short proof, due to K¨ uhn and Liese [1], is adapted. It is based on the Hellinger integral (cf., the discussion for Theorem IV.1.7 and the detailed Remark IV.1.8). The result of that remark will be recalled here for a convenient reference. Let T be a set, Γ the collection of all of its finite subsets directed by inclusion, and BT be the cylinder σ-algebra of RT as in Kolmogorov’s Theorem I.1.1. If P, Q are Gaussian measures on BT let Pα , Qα be their marginals (or projections) onto Rα , Bα where Bα is the Borel σ-algebra of Rα , for each α ∈ Γ. Setting μ = P + Q (or any σ-finite μ dominating 1 2β 1 dQ 21−β dμ be the Hellinger both (P, Q)) let Hβ (P, Q) = Ω dP dμ dμ integral of the measures P, Q and 0 < β < 1. Then it was seen that H0 (P, Q) = limβ→0 Hβ (P, Q) exists and has the properties: (1) 0 ≤ Hβ (P, Q) ≤ 1, 0 < β < 1, and P ⊥ Q iff Hβ (P, Q) = 0 for some (and then for all) 0 < β < 1, (2) Q P iff H0 (P, Q) = 1, (recall H0 (P, Q) = lim Hβ (P, Q)) β→0
227
5.1 Sets of admissible signals or translates
(3) Hβ (P, Q) = inf α∈Γ Hβ (Pα , Qα ), and, as a consequence of (2) and (3), one has (4) Q P iff limβ→0 Hβ (Pα , Qα ) = 1 uniformly in α ∈ Γ. We use (3)-(4) for Pα , Qα which are finite dimensional Gaussian distributions on (Rα , Bα ). Depending on the degeneracies, it is clear that in any finite dimensional space Rα two Gaussian distributions Pα , Qα are either singular or are mutually absolutely continuous. Hence if Pα ⊥ Qα for some α ∈ Γ then P ⊥ Q as seen in the proof of Theorem IV.1.7, and it is just property (1) above. So let Pα Qα Pα (equivalence) for all α ∈ Γ. Remark IV.1.8 shows that to calculate Hβ (Pα , Qβ ) one can use a convenient coordinate system in Rα . Now since Pα and Qα should be non degenerate, it may be assumed that the covariance matrices of these distributions (in the chosen coordinate system) can be simultaneously diagonalyzed using elementary linear algebra. In fact we can (and do) assume, in such a (coordinate) system, that Qα has mean zero, identity covariance matrix, and Pα has a diagonal covariance matrix of positive entries, with a (not necessarily zero) mean vector. From this reduction the likelihood ratio is given by dPα (x) = dQα
n(α)
Πi=1
exp{− 12 biα (xi − aiα )2 } , n(α) 1 1 2 Πi=1 exp{− x } i 2π 2 biα 2π
where aiα ∈ R, biα > 0 and x = (x1 , . . . , xn(α) ). A short calculation shows that Hβ (Pα , Qα ) = exp(−Sα (β)/2) where
n(α)
Sα (β) =
[log(1 − β + βbiα ) − β log biα +
i=1
β(1 − β)a2iα biα ]. 1 − β + βbiα
In particular n(α) 1 1 1 + biα a2iα biα − log biα + [log Sα ( ) = ] ) 2 2 2 2(1 + b iα i=1
If H 12 (P, Q) = 0, then P ⊥ Q by property (1) above. Suppose that H 12 (P, Q) > 0. But 0 < inf α∈Γ H 12 (Pα , Qα ≤ 1 implies supα∈Γ Sα ( 12 ) = S < ∞, so that c1 = inf biα > 0, i,α∈Γ
c2 = sup biα < ∞. i,α∈Γ
Also one has the following important numerical inequalities: for all x x ∈ [c1 , c2 ], β(1−β)x 1−β+βx ≤ βK1 2(1+x) and log(1 − β + βx) − β log x ≤
228
V. Likelihood Ratios for Processes
1 βK2 (log 1+x 2 − 2 log x) for some constants K1 , K2 > 0. Let K = max(K1 , K2 ) so that we obtain
1 Sα (β) ≤ βKSα ( ) ≤ βKS, 0 < β < 1, 2 and all α ∈ Γ. Consequently limβ→0 Hβ (Pα , Qα ) = 1 uniformly in α ∈ Γ, since the right side is e0 . Hence property (4) implies P Q and, by interchanging P and Q, also Q P so that the measures P, Q are equivalent. After this dichotomy, it is necessary to calculate the likelihood ratios, and that is a more involved problem. We consider this question in various forms in the rest of this section and later on for other (non Gaussian) processes. This result will be utilized in the following analysis on Gaussian families. If {Xt , t ∈ T } ⊂ L20 (P ) is a process on (Ω, Σ, P ) with E(Xt ) = ¯ t ), let L = sp{X ¯ 0, r(s, t) = E(Xs X t , t ∈ T } be the closed linear span 2 ¯ t ). Then of the process in L0 (P ). Consider f : t → f (t) = E(Y X f (t) = 0, t ∈ T iff Y = 0 a.e., since the Xt are the generators of L. Hence u : f → Y is one-to-one and onto from Hr to L where Hr is the Aronszajn space determined by r. Moreover, ¯ s ) = f (s), s ∈ T, !f, r(s, ·)" = E(u(f )u(r(s, ·))) = E(Y X
(7)
so that r(s, ·) ↔ Xs is in one-to-one correspondence. Clearly !·, ·" is an inner product effecting the above bijective (linear) correspondence. We can now establish the following intrinsic characterization of the space of admissible signals if the noise {Xt , t ∈ T } is a Gaussian process with mean zero and covariance r. 2. Proposition. If the noise X is Gaussian Xt ∈ L20 (P ), t ∈ T ⊂ R ¯ t ), then MP = Hr so that MP is a Hilbert space. and r(s, t) = E(Xs X Proof. The argument proceeds by showing that MP −→ L −→ Hr and then using the bijection u : Hr → L, we obtain L → MP so that MP ⊂ Hr ⊂ MP will be shown, and the result is nontrivial. Thus in the forward direction, let f ∈ MP be any element and Pf (= P ◦ Tf−1 ) be the corresponding probability measure on Σ. Since Pf P , let dP Y = dPf and B0 = σ(Xt , t ∈ T ) with B denoting its P -completion. Then Y is B-measurable, and one has ¯ t ). ¯ ¯ t Y dP = E(Y X f (t) = (8) Xt dPf = X Ω
Ω
√
Note that Y > 0 a.e., and Y 2 = P (Ω) = 1. Now it is claimed that Y ∈ L(⊂ L2 (P )) itself. For, consider the functional f : L → C defined
229
5.1 Sets of admissible signals or translates
by f (Xt ) = Ω Xt dPf . It is clear that f is linear, and we show that it is continuous, so that by the classical Riesz representation and (8) the desired conclusion obtains. Since L is not a lattice, we need to use a special argument, borrowed from Neveu [2]. Let {Fj , j ∈ I} be an increasing family of σ-algebras such that σ(∪j Fj ) = B, and let Pj , Pf j be restrictions of P, Pf to Fj . If Hj (P, Pf ) denotes the Hellinger distance on Fj , given by Hj (P, Pf ) =
Ω
(dPj dPf j )
1 2
= Ω
dPf j dPj dPj
,
(9)
then by Theorem IV.1.7, Hj (P, Pf ) is monotone, limj Hj (P, Pf ) = c exists and c = 0 iff Pf ⊥ P . Since Pf P here, we must have 0 < c ≤ 1 by Theorem 1. Let i0 ∈ I be chosen as i0 = {t} so that Fi0 = Ft = σ(Xt ). But we can evaluate the Hj functional, since P, Pf are Gaussian measures determining the distributions of Xt and f (t) + Xt , as (put a = r(t, t) for convenience):
1
0 < c ≤ Hi0 (P, Pf ) = (dPt dPf t ) 2 Ω (x−f (t))2 x2 dx = [e− 2a e− 2a ] √ , by the fundamental law 2πa R of Probability (cf.,e.g., Rao [15], p.19), 1 = exp(− f 2 (t)), (cf., eq. (50) of Sec. IV.1). (10) 8a If c20 = −8 log c > 0, then (10) implies f (t)2 ≤ c20 a so that, t ∈ R being arbitrary, |f (t)| = |f (Xt )| ≤ c0 Xt 2 . Hence f is continuous and by the previously recalled Riesz theorem f ↔ Y uniquely (f ∈ (L)∗ = L) and Y ∈ L. Now let g = u−1 (Y ) so that g ∈ Hr by (7). Also r(t, ·) = u−1 (Xt ). Thus ¯ t ) = f (t), t ∈ T. g(t) = !g, r(t, ·)" = E(Y X Hence f = g ∈ Hr and MP ⊂ Hr . For the opposite inclusion, let Z ∈ L and f˜ = u−1 (Z)(∈ Hr ). Since L and Hr are in bijective correspondence, it suffices to show that f˜ ∈ MP . Therefore define a measure PZ on Σ by the equation dPZ = KeZ dP where K −1 = Ω eZ dP = E(eZ ) and since eZ > 0, PZ ≡ P . Now Z is normal with mean zero, and variance E(Z 2 ), so 1
2
1 = PZ (Ω) = KE(eZ ) = Ke 2 E(Z ) ,
230
V. Likelihood Ratios for Processes 1
2
so that K = e− 2 E(Z ) , by the Gaussian moment generating function of Z. But for any t ∈ R and V ∈ L we have 2 1 tV e dPZ = e(tV +Z)− 2 E(Z ) dP Ω
Ω − 12 E(Z 2 )
=e
1 2
1
2
e 2 E(tV +Z) , as in (10),
2
= e 2 t E(V )+tE(V Z) = etV dPh , where h = E(V Z).
(11)
Ω
Taking V = Xt in (11), we conclude from the uniqueness theorem for bilateral Laplace transforms that PZ = Ph , the Gaussian measure with ¯ t Z) and variance E(|Xt |2 ) = r(t, t) so that h ∈ MP . mean h(t) = E(X However, for t ∈ T one has h(t) = !u−1 (Z), u−1 (Xt )", by (7), = !f˜, r(t, ·)", by definition of f˜, = f˜(t), by the reproducing property. Hence f˜ = h ∈ MP and Hr ⊂ MP . Thus MP = Hr so that the set of admissible means is not merely a vector space, but has even an inner product relative to which it is complete. The following is a useful byproduct of the above computation which is recorded for a convenient reference. 3. Corollary. For a Gaussian noise process {Xt , t ∈ T ⊂ R} on (Ω, Σ, P ) and f ∈ MP , an admissible mean, there exists a unique random variable Z ∈ L = sp{X ¯ t , t ∈ T } such that f (t) = E(ZXt ) and the likelihood ratio of Pf relative to P is given by 1 dPf = exp[Z − E(|Z|2 )], a.e. [P ]. dP 2
(12)
Remark. We note that the assumption of Gaussian noise process played a key role in using Theorem 1 for (10) and also for the uniqueness assertion in (11). Since the Aronszajn space technique depends mainly on a positive definite kernel, one might hope that the result holds for second order processes that need not be Gaussian. However, in the non-Gaussian case the set MP need not even be linear, as shown by an example below. For the general problem other types of conditions are needed and we study them in some detail for a better understanding
5.1 Sets of admissible signals or translates
231
of the subject. An alternative proof of the above proposition and its corollary may be found in Pitcher [1] along with many useful results. We consider certain other aspects of MP revealing several structural properties of Gaussian processes. 4. Theorem. Let {Xt , t ∈ T ⊂ R} be a Gaussian process on (Ω, Σ, P ) in canonical form, with MP as the space of its admissible mean values (a Hilbert space by Proposition 2). For each f ∈ MP , there is a linear functional (= f ) : Ω = RT → C which is measurable (i.e., (a1 ω1 + a2 ω2 ) = a1 (ω1 ) + a2 (ω2 ), ωi ∈ Ω, ai ∈ C and f (·) is a random variable) such that
1 dPf = exp[`f (. ) − Cf ], a.e., Cf = E(`2f ). dP 2
(12’)
¯ t ) is continuous on T × T , then the If the covariance r(s, t) = E(Xs X following additional statements hold: (a) Each f ∈ MP is continuous and, if r(·, ·) is also bounded and nondegenerate, then MP ⊂ Cb (T ) the set of scalar bounded continuous functions on T , but P ∗ (MP ) = 0 where P ∗ is the outer measure generated by P . [However, if T is locally compact then MP is an Fσδ -set and hence it belongs to the Borel completion of the cylinder σ-algebra Σ, hence P -measurable.] In case, r(·, ·) is degenerate, one has P (MP ) = 1. (b) If r(·, ·) vanishes at infinity on the locally compact space T × T , then MP ⊂ C0 (T ), a space of continuous functions vanishing at infinity with the uniform norm,and the embedding j of MP (with its inner product topology) into C0 (T ) is continuous. Moreover, if j ∗ : ˜ P = i∗ (C0 (T )∗ ) ⊂ C0 (T )∗ → MP∗ ∼ = MP is the adjoint mapping and M ˜ P , there is a unique regular signed MP∗ ∼ = MP , then for each f ∈ M (hence bounded) Borel measure F (= Ff ), generated by a function of bounded variation (denoted by the same symbol for simplicity), on the (Borel) σ-algebra of T such that r(s, t) F (ds),
f (t) =
(13)
T
˜ P ). In case of (13), the functional (and holding only for elements of M f of (12’) can be represented as (a Lebesgue integral): f (ω) =
ω(t) Ff (dt), for a.a. (ω),
(14)
T
˜ P is a (non closed) dense subspace of MP in its (Hilbertian) and M metric topology.
232
V. Likelihood Ratios for Processes
Proof. The formulas (12) and (12’) are the same. In fact, since in (12) n n Z ∈ L = sp{X ¯ , t ∈ T }, there is a sequence Z = t n i=1 ci Xti which 2 converges in L (P )-norm to Z. Also by the canonical representation of the process on Ω = RT ⇒ Xt (ω) = ω(t), it is clear that Z is linear on the vector space Ω since Zn has that property for each n. Thus if : ω → Z(ω) and C = E(Z 2 ) in (12), it becomes (12’). Moreover, if (Z , C ) is another such pair satisfying (12’), then Z − C = Z − C a.e., and since E(Z) = E(Z ) = 0 we deduce that C = C and then Z = Z a.e. So by Proposition 2, the representation (12’) holds as asserted. (a) Now suppose that r(·, ·) is also continuous and bounded. In fact, for f ∈ MP one has |f (t) − f (t0 )| ≤ f r(t, ·) − r(t0 , ·) → 0, as t → t0 ,
(15)
for each t0 ∈ T ( · being the norm of Hr ), and when r is bounded so is f since |f (t)| = |(f, r(t, ·))| ≤ f r(t, t), and vanishes at infinity if r does. Regarding the “size” of MP (= Hr ) consider with the identification map u of Hr and L(⊂ L20 (P )), for a m ˜ = u(f˜) ∈ L) resulting in function f˜ = j=1 aj r(tj , ·) ∈ Hr (X u(f˜) =
m
aj u(¯ r(tj , ·)) =
j=1
m
˜ f˜) (say). ¯ t = (X, aj X j
(16)
j=1
˜ f˜) = u(f˜) is a random variable for each Here the symbolic pairing (X, n ˜ f ∈ Hr . But for ω = i=1 di r(ti , ·) ∈ Hr , (16) becomes ˜ f˜)(ω) = (X,
m
aj ω ¯ (tj ) =
j=1
n
¯ d¯i f˜(ti ),
(17)
i=1
˜ f˜)(ω) defines by definition of Hr , and for each fixed ω ∈ Hr (= MP ), (X, an inner product (although not for all ω). In any case the latter is a random variable. Also since r is continuous and T separable, the RKHS Hr is separable. Let {ψn , n ≥ 1} be a complete orthonormal set in Hr . Then for each ω ∈ Hr one has the Fourier expansion: ω=
∞
(ω, ψn )ψn ,
n=1
∞
|(ω, ψn )|2 < ∞.
(18)
n=1
But the functions ω → (ω, ψn )ψn (t) define a collection of orthogonal (hence independent) Gaussian random variables. Indeed, one has: E[(·, ψi )ψi (t)] = ψi (t) (ω, ψi ) dP Ω
233
5.1 Sets of admissible signals or translates
= ψi (t) ( ω(s)ψi (s) ds) dP (ω) Ω T = ψi (t) ( Xs (ω) dP (ω))ψi (s) ds, T
Ω
since Xs (ω) = ω(s) = 0,
since the P -measure is centered.
Similarly, (the interchange of integrals here and below being legitimate) ¯ t (ω)ψi (s)ψ¯j (t) ds dt] dP E[(·, ψi )ψi (s)(·, ψj )ψj (t)] = [ Xs (ω)X Ω T T ¯ ¯ t dP ]ds dt = ψi (s)ψj (t)[ Xs X Ω T T ψi (s)ψ¯j (t)r(s, t) ds dt = T T = ψ¯j (t)(r(s, ·), ψi (·)) dt T = ψ¯j (t)ψi (t) dt, by the reproducing T
property of Hr , = δij ,
since the ψj are orthonormal.
(The δij is Kronecker’s symbol.) Hence by the classical Kolmogorov two series theorem one has P ∗ (A) = 0, where A = {ω :
∞
|(ω, ψi )|2 < ∞} ⊃ Hr
i=1
because the (ω → ψi ) define independent N (0, 1) random variables. Thus the set MP = Hr of admissible means is “thin”, meaning it has measure zero. If r(·, ·) is degenerate and T = [a, b] is a finite interval, then MP is finite dimensional, and evidently P (A) = 1 as well as A = Hr (= MP ). (b) If r is continuous and vanishes at infinity, then by (15) MP ⊂ C0 (T ) and the embedding is continuous since j(f ) = sup |f (t)| ≤ f sup t
t
r(t, t) ≤ K0 f < ∞.
˜ be the subspace of MP as in the statement, where Hr is identified Let M with its (Hilbert) adjoint. Thus if v ∗ ∈ C0 (T )∗ ⊂ Hr∗ ∼ = Hr , it has a continuous extension to Hr , and so v ∗ (g) = (f, g) for a unique f ∈
234
V. Likelihood Ratios for Processes
˜ P ) by the Riesz theorem. On the other hand, by Hr (v ∗ ↔ f ∈ M another classical Riesz representation for C0 (T )∗ , there is a unique Baire function Ff (of bounded variation) such that (f, g) = v ∗ (g) =
g(s) Ff (ds), g ∈ C0 (T ),
(19)
T
with Ff having the properties given in the theorem. Since r(t, ·) ∈ C0 (T ) and {r(t, ·), t ∈ T } generates Hr , it follows upon taking g = r(t, ·) that v ∗ (g) = f (t) due to the reproducing property of Hr . The continuity of r further implies that the process X = {Xt , t ∈ T } is (jointly) measurable on T × Ω for B(T ) ⊗ Σ (B(T ) is the Borel σalgebra of T ) so that T Xt Ff (dt) is a well-defined random variable (by Fubini’s theorem) and one has
2
Xt F (dt)| ) =
E(|
r(s, t) F¯f (dt)
Ff (ds)
T
T
T
f (s) Ff (ds) < ∞, by (19).
=
(20)
T
Hence for each t ∈ T , it follows that Ω
¯ ¯ t dP. ( Xs Ff (ds))Xt dP = f (t) = YX
(21)
Ω
T
By denseness of {Xt , t ∈ T } in L, we conclude that Y = Xs Ff (ds), T ¯ ¯ and that E(Y Xt ) = f (t) = Ω Xt dPf . Consequently, by Corollary 3 above, one has Y (ω) = f (ω) =
ω(s) Ff (ds), a.a. (ω).
(22)
T
˜ is a non This is (13). Since C0 (T )∗ = Hr , it can be concluded that M closed norm dense subspace of MP . Easy counter examples can also be constructed (see below) to this effect. Remarks. 1. If in the above analysis, T = [a, b], a compact interval, r is a continuous covariance with {λn , n ≥ 1} and {ψn , n ≥ 1} as its eigenvalues and eigenfunctions (these exist as discussed in Sec. IV.1), then one has (X, ψn )(ω) = T ω(t)ψn (t) dt, obtainable from (17). However, if r is merely bounded (and continuous) but T is not compact, then in the representation (19), there is a (unique) set function Ff (·) on B(T ) which is only finitely additive so that the further analysis in (20)-(22) need not hold.
235
5.1 Sets of admissible signals or translates
2. It should also be noted that, if f (·) is represented by (14) with an Ff of bounded variation, in (12’) one obtains (as observed by Pitcher [1] already): 2 C = E(f ) = E[| Xt dFf (t)|2 ] T = E[ Xs Xt dFf (s) dF¯f (t)] T T dFf (s) r(s, t) dF¯f (t) = T T f (s) dFf (s), by (13). = T
Even under the best of circumstances, if MP is infinite dimensional, ˜ M will not be closed. The following example, also due to Pitcher, illustrates this point vividly. 5. Example. Let {Xt , t ∈ R} be a stationary process with its covariance function, r, twice continuously differentiable. Define a sequence Fn of ‘rectangle’ functions given by: ⎧ if t ≤ − n1 , ⎪ ⎨ 0, Fn (t) = −2n, if |t| < n1 , ⎪ ⎩ 0, if t ≥ n1 . Then fn (s) = T r(s − t) dFn (t) = 2n[r(t + n1 ) − r(s − n1 )]. It is seen ˜ is a Cauchy sequence with limit h(s) = dr (s). If that {fn , n ≥ 1} ⊂ M ds h ∈ Hr , then one gets (by the Helly-Bray theorem) h(s) = r(s − t) dF (t). (23) T 2
However, if r(s) = e−s , then (23) becomes 2 −s2 −(s−t)2 −s2 −2se = e dF (t) = e e2st−t dF (t). T
(24)
T 2
Thus cancelling the factor e−s and differentiating the resulting equation (24) twice relative to s, one gets 2 0= t2 e2st−t dF (t), T
which implies that F must be a function with a jump at the origin of size 2 2 C and constant elsewhere. But then (24) becomes −2se−s = e−s C
236
V. Likelihood Ratios for Processes
˜ = MP , and a priori not for any s ∈ R, which is impossible. Thus M closed. ˜ P , is not easily The representing Ff of f of (13) for the elements of M obtainable. However, an important class of covariances of triangular type, considered in Example IV.1.5 including the important Brownian motion and the O.U. process, can be characterized relatively easily. We present this to emphasize the nontrivial nature of the problems of inference for stochastic processes. The result is essentially due to Varberg [2]. 6. Lemma. Let r(s, t) = u(s ∧ t)v(s ∨ t), −∞ < a ≤ s, t ≤ b < ∞, where u, v ≥ 0, uv is strictly increasing, and u, v are differentiable with derivatives of bounded variation on T = [a, b]. Then a continuous f : T → R admits the representation (13) for this kernel r with a (unique) function Ff of bounded variation iff f is differentiable with its derivative f of bounded variation and f (a) = 0 in case u(a) = 0. When these conditions hold, the representing measure Ff is given by Ff (t) = − a
t
1 dλf (s), v(s)
t ∈ T,
where for a < t < b, ( fv ) v(t)f (t) − f (t)v (t) λf (t) = = u (t), v(t)u (t) − u(t)v (t) (v) and λf (a)u(a) = f (a), λf (b) = 0. Proof. As noted in Example IV.1.5, the r given in the statement is a covariance (termed of triangular type). For the BM with T = [0, 1] it is obtained with u(t) = t, v(t) = 1 and for the O.U. process with u(t) = eβt , v(t) = e−βt . To proceed with the proof, (13) implies, on integrating by parts and using the fact that the derivatives u , v exist by hypothesis,
t
b
u(s) dF (s) + u(t) v(s) dF (s) a t t = −v(t)u(a)F (a) − v(t) F (s)u (s) ds+
f (t) = v(t)
u(t)F (b)v(b) − u(t)
a b
F (s)v (s) ds.
t
Since u , v and F are of bounded variation, these integrals exist and by the (Lebesgue) fundamental theorem of calculus f (·) exists and is also of bounded variation. Clearly f (a) = 0 if u(a) = 0.
5.1 Sets of admissible signals or translates
237
It is the converse that is of interest since it gives the desired explicit form for F . By the uniqueness of the representation, it suffices to verify that the Ff of the statement satisfies (13). In fact, substituting it in the integral of (13) one obtains, on using integration by parts again and a simplification, b t b u(s) dλf (s) − u(t) r(s, t) dFf (s) = −v(t) dλf (s) a a v(s) t t u(a) f u u + v(t) ( ) (s)[( ) ]−1 (s)( ) (s) ds = v(t)λf (a) v(a) v v a v t f f (a) + v(t) d( )(s) = v(t) v(a) v a = f (t), since ( fv ) is of bounded variation. Thus Ff satisfies (13). For the triangular covariances considered here we can restate Theorem 4 in a simplified form and to use later on for further analysis, especially in the next section (and for an extension, see Proposition VII.1.4). It is convenient to first give a continuity property of the sample paths of the Gaussian processes characterized by continuous triangular covariances. 7. Lemma. Let {Xt , t ∈ [a, b]} be a Gaussian process with continuous mean and triangular covariance functions m and r = uv where u v is strictly increasing (and no differentiability hypotheses, but u, v ≥ Yt −m(α(t)) 0). If α(t) = u(t) v(t) , Yt = Xα(t) , and Zt = v(α(t)) , then {Zt , t ∈ −1 −1 + [α (a), α (b)]} ⊂ R is a Brownian Motion process obtained from the X-process by a (strict) time change (namely α(·)). Moreover, the X-process has independent increments and thus is Markovian. Proof. By hypothesis, α(·) is one-to-one, order preserving and continuous on [α−1 (a), α−1 (b)] onto [a, b]. Let α−1 (a) < t1 < · · · < tm < α−1 (b) and consider si = α(ti ), so a < si < si+1 < b. It is clear that the Yt , Zt -processes are Gaussian, E(Zt ) = 0 and for t1 , t2 ∈ [α−1 (a), α−1 (b)] one has E(Zt1 Zt2 ) =
r(s1 , s2 ) = min(t1 , t2 ). v(s1 )v(s2 )
Consequently the Z-process is a BM. Letting ui = u(si ), vi = v(si ) and u0 = 0 = vn+1 , un+1 = 1 = v0 , si < si+1 , we get with Rn = (r(si , sj ), 1 ≤ i, j ≤ n) that |Rn | =
n 3 i=1
(ui vi−1 − ui−1 vi ) > 0,
238
V. Likelihood Ratios for Processes
where |Rn | is the determinant of the matrix Rn . This again shows that the triangular function defined in Lemma 6 is positive definite and hence is a covariance. The finite dimensional density of Y˜t = Xt − m(t) is given by: 1
fs1 ,... ,sn (x1 , · · · , xn ) = (2π)− 2 |Rn |− 2 × n (vk−1 xk − vk xk−1 )2 1 . exp − 2 vk−1 vk (uk vk−1 − vk uk−1 ) n
k=1
This shows that the density of Xs2 − Xs1 and Xs3 − Xs2 , s1 < s2 < s3 factors so that the differences are independent. It follows that the Xprocess has independent increments, and hence is Markovian. Now {Y˜t , σ(Ys , s ≤ t), t ∈ [a, b]} is easily verified to be a martingale. 8. Important remark. Now a BM process has a.a. continuous sample paths. This is not a simple property, and it needs a separate proof. We indicated it after the proof of Theorem IV.2.6 along with a construction of BM. (For another proof, see the companion volume, Rao [21], p.201.) This implies from the relation Xα(t) = m(α(t))+v(α(t))Zt that the Xt -process also has a.a. continuous paths. Moreover, since a BM has almost no paths of bounded variation, the same is true of the Xt -process and hence the integral relative to X cannot be defined in Lebesgue’s sense. However, using a formal integration by parts techb b nique, one can define a h(t) dX(t) = X(t)h(t)|ba − a X(t)h (t) dt for all differentiable h vanishing at the boundary points and extending it for all C([a, b]) using a classical argument due to Paley-Wiener-Zygmund [1]. But the preceding relation for simple functions shows that X is L2,2 -bounded, and hence the integral can be defined directly as seen in Sec. IV.2 (just preceding Theorem IV.2.7). Consequently the Ff (·) of Lemma 6, and the stochastic integral recalled, we can reformulate Theorem 4 when r is as in Lemma 6, covering the BM, Brownian bridge, as well as the O.U. processes among others. 9. Theorem. Let {Xt , t ∈ T = [a, b]} be a Gaussian process on a canonically represented space (Ω, Σ, P ), with mean zero and a continuous triangular covariance r(s, t) = u(s ∧ t)v(s ∨ t), s, t ∈ T , where u, v ≥ 0, uv strictly increasing, the derivatives u , v exist and of bounded variation (in particular if u, v are twice continuously differentiable). Then for each f ∈ MP such that f (a) = 0 when u(a) = 0, f exists and is of bounded variation (so (13) holds for the f ), one has (12’) as: 1 dPf (x) = exp{ dP 2
T
( fv ) 2xt − f (t) ] − C1 + C2 x(a)}, (t) d[ ( uv ) v(t)
(12”)
239
5.1 Sets of admissible signals or translates 2
f (a) f (a) where C1 = 2u(a)v(a) , and C2 = u(a) with 00 taken as 0, and the integral in (12”) is the stochastic integral for the L2,2 -bounded process X.
Proof. Substituting the expression for F (t) from Lemma 6 in (12’), and evaluating the resulting integral (use by parts), one obtains (12”). This form of the general result will be of interest in the next section for concrete illustrations. It also serves as a motivation for the ensuing general treatment of the subject. The fact that P and Q(= Pf ) are Gaussian measures on (Ω, Σ) was used in an essential manner in the above Theorems 4 and 9. If even one of the measures is not Gaussian then, as seen from the next example (noted by Skorokhod [3] in another context), the set MP of translates need not be convex, much less a vector space. We again take (Ω, Σ) to be a canonically represented couple in the illustration with Ω = RT and Σ as the cylinder σ-algebra of Ω. 10. Example. Let P : Σ → R+ be a Gaussian probability and MP be the set of all admissible translates of P . Consider f0 ∈ Ω − MP , so that f0 = 0. Then Pf ⊥ P by definition (of f0 ), so that αf0 ∈ / MP for any α = 0, by Theorem 4. If a = b, then Paf0 ⊥ Pbf0 , and define a mixture Q on Σ as: Q=
∞
αk Pkf0 , αk > 0,
k=−∞
∞
αk = 1.
(25)
k=−∞
Then Q is a probability measure on Σ which is not Gaussian, and f0 ∈ MQ . By construction, tf0 ∈ MQ iff t is an integer. Thus if 0 < t < 1, then tf0 ∈ / MQ so that MQ is not convex. However, it is a semi-group under addition. But Proposition 2 fails for Q. In the general case, the analysis of the set MP of admissible means has to employ other techniques. We analyze this point with the following result, due to Pitcher [3], to illustrate the problems involved. 11. Proposition. Let {Xt , t ∈ T = [a, b] ⊂ R} be a centered second order process with a continuous covariance function r. Then the set of 1 admissible translates MP satisfies the inclusion MP ⊂ R 2 (L2 (T, dt)) in the sense that for each f∈ MP there is an h ∈ L2 (T, dt) such that 1 f = R 2 h where (Rg)(s) = T r(s, t) g(t) dt, g ∈ L2 (T, dt); so R is the integral operator determined by the kernel r. Proof. First note that Rg is well defined for each g ∈ L2 (T, dt) and Rg ∈ L2 (T, dt). In fact, since r(·, ·) ∈ L2 (T × T, ds dt) one has
2
|Rg| (t) dt = T
|
T
T
r(s, t)g(s) ds|2 dt
240
V. Likelihood Ratios for Processes
2 ≤ [ |r(s, t)| ds |g(s)|2 ds] dt, T
T
T
by the CBS inequality, |r(s, t)|2 ds dtg22 < ∞. = T
Then the equation
F (ω, h) =
(26)
T
ω(t)h(t) dt, ω ∈ Ω = RT , h ∈ L2 (T, dt),
T
is well-defined, and moreover for any h, g ∈ L2 (T, dt) one has (with Fubini’s theorem): F (ω, h)F (ω, g) dP = [ ω(t)¯ ω (t) dP ]h(s)¯ g (t) ds dt Ω T T Ω = r(s, t)h(s)¯ g (t) ds dt T T = (Rh)(t)¯ g (t) dt T 1
1
= (Rh, g) = (R 2 h, R 2 g),
(27)
1
where (f, g) is the inner product of L2 (T, dt) and R 2 is the unique selfadjoint positive square root of the positive definite R (i.e., (Rh, h) > 0, h = 0). Thus F (·, h) ∈ L2 (P ) for each h ∈ L2 (T, dt). Define a new 1 (semi-)inner product (·, ·)R by the equation (h, g)R = (R 2 h, g) so that 1 1 1 h22,R = (h, R 2 h) gives a semi-norm for L2 (T, dt). If R 2 hn → R 2 h in L2 (T, dt) or equivalently hn → h in · 2,R , then F (·, hn ) → F (·, h) in L2 (P ), so that there is a subsequence F (·, hni ) → F (·, h) a.e.[P ] and in L2 (P ). Suppose now f ∈ MP , so that Pf P and F (·, hni ) → F (·, h), a.e. [P ] also. If an = T f (t)hn (t) dP , then F (ω, hn ) dPf = F (ω + f, hn ) dP Ω Ω (ω + f )(t)hn (t) dt dP (ω) = Ω T = [ ω(t) dP ]hn (t) dt + [ f (t)hn (t) dt]dP T
Ω
Ω
T
= 0 + an . Similarly (as in (27)) F (ω, hn )F (ω, hn ) dP = (Rhn , hn ). Ω
(28)
241
5.1 Sets of admissible signals or translates
If Yn = F (·, hn ) − an , then 2 |Yn − Ym | dP = |F (ω, hn ) − F (ω, hm )|2 dP → 0, Ω
Ω
as n, m → ∞, i.e., as hn → h in · 2,R . Since F (·, hni ) → F (·, h), a.e., and in L2 (P ), we have ani → a. But ani = (f, hni ) and so (f, ·) is (sequentially and then generally) continuous on L2 (T, dt) with metric · 2,R . So a = a(h) must be of the form (by Riesz representation) 1 (h, g)R for a unique g = gf ∈ L2 (T, dt). This implies that f = R 2 g 1 and hence MP ⊂ R 2 (L2 (T, dt)) as desired. 1
Remark. One can show in the Gaussian case that MP = R 2 (L2 (T, dt)) in the above proposition under the same conditions while there is proper inclusion in general. As Example 10 implies, even for the family MP of admissible translates of P , one is not always certain of establishing a nontrivial theory of hypothesis testing for {Pf , f ∈ MP }, especially in the composite case. In other words, can Pf and Ptf , t ∈ R, have such a test theory for H0 : Pf vs Ht : Ptf , t = 1 so that Ptf Pf ( P )? For an inclusion of these problems in our study, it is necessary to find conditions on P in order that MP is a (convex) cone, or a linear set, complementing the assertions of Theorem 4 for non Gaussian families. It is desirable to understand this situation well before studying the general case of {Pα , α ∈ I}, a family of measures on (Ω, Σ), which can only be considered later. This circle of ideas was investigated, in detail, primarily by T. S. Pitcher, and the ensuing analysis on these problems is based mostly on his work. As in Proposition 7 above, we start with a second order real process with a canonical representation, for convenience. Thus let X = {Xt , t ∈ T = [a, b] ⊂ R} be a real process on (Ω, Σ, P ), such that E(Xt ) = 0 and E(Xs Xt ) = r(s, t), a continuous covariance, so that X ∈ L2 (dt dP ). Then, as in the preceding case, the kernel r(·, ·) ∈ L2 (ds dt), and hence has eigenvalues {λn , n ≥ 1, λn > 0} and the (corresponding) eigenfunctions {ψn , n ≥ 1} which form a complete orthonormal set in L2 (T, dt). These are used to replace the process {Xt , t ∈ T } by a countable set of random variables which determine the process, as detailed in the first section of Chapter IV. Thus one has ψn and λn satisfying (T = [a, b]): b ψn (t) = λn r(s, t)ψn (s) ds, (29) a
so that if 1 Xn = √ λn
b
Xt ψn (t) dt, a
(30)
242
V. Likelihood Ratios for Processes
then the sequence {Xn , n ≥ 1} forms the (observable) coordinate family whose finite dimensional distributions determine (and are determined by) the given process. It is immediate that E(Xn ) = 0, E(Xm Xn ) = δmn and as noted before Xt =
∞
ψn (t) Xn √ , t ∈ T, λn n=1
(31)
the series converging in L2 (P )-mean for each t. Also note that for each f ∈ MP = M (X) if fn = √1λ T f (t)ψn (t) dt, then (MP ⊂ n
1
R 2 L2 (T, dt)) ∞
|fn |2 =
n=1
T
T
T
T
∞ ψn (s)ψn (t) f (t)f (s) λn n=1
ds dt
r(s, t)f (s)f (t) ds dt < ∞,
=
(32)
(by Mercer’s theorem). Thus {fn , n ≥ 1} ∈ 2 , the Lebesgue sequence (Hilbert) space. Without assuming that the X-process is Gaussian, we can present a general result, essentially due to Pitcher [3], on the linearity of M (X) as: 12. Theorem. Let X = {Xt , t ∈ T = [a, b]} be a centered continuous in L2 (P ) process with a (non-degenerate) covariance r. Consider {Xn , fn , n ≥ 1} defined above. Let Pn be the distribution (on Rn ) of {Xi , 1 ≤ i ≤ n}, n ≥ 1, which is absolutely continuous relative to the Lebesgue measure with density pn . Suppose that these densities satisfy, for each n ≥ 1 the conditions: (i) pn (x) ≥ 0, n ≥ 1, a.e., (ii) lim|xj |→∞ pn (x1 , · · · , xj , · · · , xn ) = 0, for all xi , (i = j), 1 ≤ i ≤ n, n ≥ 1, n (iii) ∂p ∂xj , 1 ≤ j ≤ n, exists and satisfies the bound n j=1
Rn
∂ log pn ∂xj
2 (x) dPn (x) ≤ K0 < ∞,
where K0 > 0 is independent of n. Then the set M (X) is a positive cone (or positively linear). If, in addition, each of the densities pn is symmetric about the origin then M (X) is linear. Remark. As the proof below shows, one needs to have essentially new ideas (involving semi-group techniques) to conclude that f ∈ M (X) ⇒
5.1 Sets of admissible signals or translates
243
tf ∈ M (X) for all t > 0. They were found and employed with considerable effectiveness by Pitcher in several publications. Some of them will be employed in our analysis as the need arises. It may be observed that the conditions of the theorem are similar in many respects to (but sharper than) those used in the familiar asymptotic studies of maximum likelihood estimators (cf., e.g., Sections III.4 and IV.5). Proof.The idea here is to associate a semi-group of contractive linear operators on L2 (P ) for each f ∈ M (X) and extend it to tf for t ≥ 0. Thus let f = {fn , n ≥ 1} ∈ 2 as in (32) and t > 0, so that the likelihood ratio 2 t Yn (ω)
=
pn (X1 (ω) − tf1 , · · · , Xn (ω) − tfn ) , ω ∈ Ω, pn (X1 (ω), · · · , Xn (ω))
(33)
has the property that {t Yn2 , Fn , n ≥ 1} is a positive martingale on (Ω, Σ, P ), where Fn = σ(X1 , · · · , Xn ). [This was shown after Theorem IV.2.2 in detail.] Hence by the conditional Jensen inequality for concave functions, one deduces that {t Yn , Fn , n ≥ 1} is a positive super martingale with the property that E(t Yn2 ) = 1 for all n, so that it is uniformly integrable and t Yn → t Y∞ a.e. [P ], and in L1 (P ) as well, by the familiar martingale convergence theorem. [The pointwise convergence of a positive super martingale holds without further hypothesis, but the L2 (P ) boundedness just gives uniform integrability and hence only the L1 (P )-mean convergence (and not L2 (P ) convergence). The L2 (P ) convergence would be true, however, if the sequence were a sub martingale instead, c.f, e.g., Rao [25], p.120, which ours is not.] We now associate the desired semi-group of operators on L2 (P ), and show that it is strongly continuous to achieve our goal as the key part of the proof. We do this in three steps for convenience. 1. Let Vn (t) : L2 (P ) → L2 (P ) be defined by Vn (t)g(X1 , · · · , Xn ) = t Yn · g(X1 − tf1 , · · · , Xn − tfn ),
(34)
where g is a Borel function on Rn , n ≥ 1. Then g(X1 − tf1 , · · · , Xn − tfn ) is Fn -adapted for t ≥ 0 and n since the fn are non stochastic. We observe that t → Vn (t), n ≥ 1 forms a strongly continuous semigroup of linear isometries on L2 (Ω, Fn , P ) for each n, and moreover that Vn (t)g(X) → V (t)g(X), a.e. and, because of the uniform integrability, 2 also in L1 (P ), as n → ∞, for all bounded g(X) ∈ ∪∞ n=1 L (Ω, Fn , P ) where X = {Xn , n ≥ 1}. Now 2 2 2 2 Vn (t)g (X1 , · · · , Xn ) dP = t Yn g (X1 − tf1 , · · · , Xn − tfn ) dPn Ω Ω = g 2 (X1 − tf1 , · · · , Xn − tfn )× Ω
244
V. Likelihood Ratios for Processes
dPn (X − tf ), by (33) and (34), g 2 (X1 , · · · , Xn ) dPn , =
(35)
Ω
by a change of variables. This shows that Vn (t) is an isometry on the dense subset ∪n≥1 L2 (FN , P ); it is taken (for convenience) that Σ = σ(Xt , t ∈ T ) = σ(∪n≥1 Fn ). The semi-group property is likewise verified as follows. Let t1 , t2 ≥ 0 and consider for g, h as above: 2 g(X)(Vn (t1 )Vn (t2 )h(X)) dPn = g(X + t1 f )[Vn (t2 )h(X)]2 dPn Ω Ω = g(X + t1 f + t2 f )h2 (X) dPn Ω = g(X)[Vn (t1 + t2 )h(X)]2 dPn . Ω (36) This implies that Vn (t1 )Vn (t2 ) = Vn (t1 +t2 ) on a dense subset of L2 (P ) and hence the operators (isometries by (35)) have unique extensions (using the same symbols) to all of L2 (P ) with the same properties. We conclude that {Vn (t), t ≥ 0} is a semi-group of isometries on L2 (P ) for each n. Taking t1 = t2 = 1 here shows immediately that with f, g ∈ M (X) also f + g ∈ M (X) so that the set M (X) is a semigroup under addition. Since the process X has a continuous covariance function, it follows that X is a “separable” and a continuous in mean random function. But then σ(∪n Fn ) = Σ is countably generated so that L2 (P ) can be taken to be a separable space for this proof. With this observation, it follows that t → Vn (t)g(X) is (weakly and hence) strongly measurable. But the semi-group property implies that t → Vn (t)g(X) is continuous by a classical theorem of Sz.-Nagy (see Hille-Phillips [1], p.588 for a simple proof, or Dunford-Schwartz [1], p.616); whence {Vn (t), t ≥ 0} is a strongly continuous semi-group of isometries. This establishes the desired property of the Vn (t). It then follows for each g(X), in the dense subset noted above, that Vn (t)g(X) → V (t)g(X), a.e. and uniformly, so the convergence is also in L1 (P )-mean defining the family {V (t), t ≥ 0} on a dense set, and from the point-wise convergence (by Fatou’s lemma) V (t)g(X)2 ≤ lim inf n→∞ Vn (t)g(X)2 ≤ g(X)2 , or V (t) ≤ 1. So each of these operators has a unique bound-preserving extension to L2 (P ). Then to complete the proof, it should be shown that {V (t), t ≥ 0} is also a semi-group of isometries on L2 (P ), the strong continuity being a consequence of the uniform (strong) convergence. The semi-group property of V (t) is the difficult part and it should be obtained from the corresponding fact of the Vn (t). We now establish this key result using the conditions (i)-(iii) of the hypothesis at this point.
5.1 Sets of admissible signals or translates
245
2. The behavior of the limit set of the sequence {Vn (t), n ≥ 1} of semi-groups, namely {V (t), t ≥ 0}, cannot be deduced directly. It depends on an important, and by now classical, theorem due to Trotter (for a detailed proof, see Kato [1], p.503, or Goldstein [3], p.46). The ∞ setup for this is as follows. Let Rn = 0 e−λt Vn (t) dt (as a strong integral), and let R(λ) be similarly defined with V (t) in place of Vn (t). Both these operators are contractive and strongly continuous. Moreover, the classical Hille-Yosida theory implies that Rn (λ) is a resolvent for each n and that the strong convergence of Vn (t) → V (t) implies that Rn (λ)h(X) → R(λ)h(X), λ > 0, for each bounded Borel h depending on a finite number of coordinates, i.e., h is m-dimensional, for some m ≥ 1. Such functions form a dense set in L2 (P ), as noted in Step 1 above. Then the desired result follows if we show that the limit R(λ) is a resolvent in L2 (F∞ , P ) = L2 (σ(∪n Fn ), P ). By the above noted theorem of Trotter’s, this holds if R(λ) satisfies: (a) R(λ) − R(λ ) = (λ − λ )R(λ)R(λ ), (b) λn R(λ)n ≤ K0 < ∞, n ≥ 1, (c)
lim λR(λ) = I, (strongly in L2 ),
λ→∞
for all λ, λ > 0. These conditions will now be verified in our case. Now Rn (λ), being the resolvent of a strongly continuous semi-group of operators, already satisfies (a) on L2 (P ) by the classical theory for each n. Hence letting n → ∞, the identity remains valid for the limit elements. Regarding (b) for each bounded Borel h, h(X) ∈ L2 (P ) and since R(λ)n h(X) = R(λ)(Rn−1 (λ)h(X)), and a simple estimate on the n-fold integral shows that R(λ)n h(X)2 ≤ λ1n h(X)2 , so that the corresponding inequality holds with K0 = 1 for h(X) in a dense subset of L2 (P ). Consequently (b) is also valid. We now verify (c) for g on Rn , n ≥ 1, with bounded first partial derivatives. These g(X) again constitute a dense set in L2 (P ), and it suffices to verify (c) for such g. Thus by definition: ∞ d λRn (λ) = (−e−λt ) dt t Yn g(X1 − tf1 , · · · , Xn − tfn ) dt 0 ∞ n ∂g −λt e t Yn ( fi (X1 − tf1 , · · · , Xn − tfn ) dt = g(X) − ∂xi 0 i=1 √ ∞ ∂ pn −λt n − e {[(σi=1 fi )g(X1 − tf1 , · · · , Xn − tfn )]× ∂xi 0 1
pn (x)− 2 } dt,
(37)
with integration by parts, which is valid for the strong (or Bochner) ∞ ∂g integrals. But t Yn 2 = 1, ∂x is bounded, and i=1 fi2 < ∞. Hence i
246
V. Likelihood Ratios for Processes
∞ the first integral of (37) is bounded by C1 0 e−λt dt = Cλ1 for some C1 > 0 independent of n. The second integral can be bounded, with the CBS-inequality for the terms in brackets as: √ ∂ pn { } dPn = { fi (x1 − tf1 , · · · , xn − tfn )}2 dx1 · · · dxn ∂xi Ω Rn i=1 √ n n ∂ pn 2 2 ≤ fi ( ) dx1 · · · dxn ∂xi n i=1 i=1 R
2
≤ K0
n
∞
fi2 < ∞, by (iii) of hypothesis.
i=1
Hence the norm of this term is similarly bounded by 0. Then
C2 λ
for some C2 >
λR(λ)g(X) − g(X)2 = lim (λRn (λ) − I)g(X)2 n
C1 + C2 → 0, ≤ λ as λ → ∞. Since such g(X)’s form a norm dense set in L2 (P ), condition (c) of Trotter’s approximation theorem is verified. So we can now conlude that {V (t), t ≥ 0} is a strongly continuous contractive semigroup in L2 (P ). It remains to establish that V (t) is an isometry. 3. Since for each t, t Yn 2 = 1, and we have seen that t Yn → Qt a.e., (and in L1 (P )-norm) it follows that t Yn2 → Q2t a.e. But when dP dP t = 1, Yn2 = dPf |Fn → dPf |F∞ , a.e., by the Andersen-Jessen theorem dP (cf., Theorem IV.1.1) for any f ∈ M (X). Hence Q21 = dPf |F∞ , a.e. Now if g is any (real) bounded Borel function, in Rn , then (35) is valid if Vn (t)2 there is replaced by Q21 = V (1)2 and thus for a bounded Borel g (depending on a finite number of coordinates) and 0 < t < 1: g(X)22 = V (1)g(X)22 = V (1 − t)V (t)g(X)22 , (semi-group property of V (t), ≤ V (1 − t)2 V (t)g(X)22 ≤ V (t)g(X)22 ≤ g(X)22 ,
(38)
since V (t) is a contraction. Thus there is equality. Since g(X) is in a dense subset of L2 (P ), the unique extensions of V (t) on to L2 (P ) (again denoted V (t)) are also isometries. Hence one can conclude that {V (t), t ≥ 0} is a strongly continuous semi-group of isometries on L2 (P ). Next taking g = 1 in the above computation, and taking
247
5.2 General Gaussian processes
another such Borel h on Rn so that h(X1 , · · · , Xn ) is Fn -adapted, one gets: √ (V (t)1)2 h(X1 , · · · , Xn ) dP = V (t) h(X1 + tf1 , · · · , Xn + tfn )22 , Ω
as in (35), h(x1 + tf1 , · · · , xn + tfn ) dPn (x), = Rn
by the isometry (38), h(x1 , · · · , xn ) dPn (x + tf ) = n R = h(X1 , · · · , Xn ) dPtf (X).
(39)
Ω
But such h(X)’s form a dense set, and so we can conclude that Ptf P dP as well as dPtf = (V (t)1)2 a.e., and tf ∈ M (X). Since M (X) was already noted to be a semi-group, it follows on replacing f by mf here that (m + t)f ∈ M (X); whence M (X) is a positive cone. In case each pn is symmetric about the origin, so that pn (−x) = pn (x), the above analysis shows {V (−t), t ≥ 0} is also a semi-group (and thus {V (t), t ∈ R} is a group). Thus with each f ∈ M (X), af ∈ M (X), for all a ∈ R, and M (X) becomes a linear space. Remark. This result may be contrasted with Proposition 2 and Theorem 4, where the process was Gaussian. Most of the special properties of the latter are not available in the general case, and the methods are necessarily different, albeit they are sufficient conditions. Another type of result, based on the ideas of Skorokhod’s [3] will be discussed in the complements. Incidentally, in the latter paper it was stated that the process described by Example 5 above constitutes a counter example to the conditions of the preceding theorem. However, one may verify that it does not satisfy the hypothesis, and in fact it can not contradict the result which is true. We present some additional material related to this circle of ideas in the complements section if Ω is a Hilbert space instead of the (canonically represented) function space RT . Since for inference theory, one needs to find expressions for likelihood ratios, we consider this problem for three broad classes of processes in the rest of this chapter, after a detailed analysis of Gaussian case. 5.2 General Gaussian processes Since exact forms of finite dimensional distributions of Gaussian processes are available, and they are uniquely determined by their mean
248
V. Likelihood Ratios for Processes
and covariance functions, it follows that an inference analysis can be restricted to these parameters. The functional form allows a detailed investigation, and sharp results are obtainable (cf., e.g., Theorem 1.4). Thus we present here some explicit forms of likelihood ratios of these processes. So motivated one can base an extended treatment of more general classes, and this will be done in the rest of the chapter. We start with a class of Gaussian processes having (possibly) different means and triangular covariance functions, extending Theorem 1.9, giving an explicit expression for the corresponding likelihood ratio which supplements Example IV.1.5 (see also the remarks following it). The next result is essentially due to Varberg [2]. 1. Theorem. Let {Xt , t ∈ T = [a, b]} be a canonically represented proP r1 r2 ) where P = Pm , Q = Pm are a pair of Gaussian probcess on (Ω, Σ, Q 1 2 ability measures with continuous admissible mean functions (relative to P0ri ) and each mi of bounded variation with ri as continuous triangular covariances (i.e., ri (s, t) = ui (s ∧ t)vi (s ∨ t), s, t ∈ T, i = 1, 2, ui , vi ≥ 0) r2 r1 having derivatives of bounded variation on T . Then Pm ∼ Pm iff 2 1 (1) u1 (t)v1 (t) − u1 (t)v1 (t) = u2 (t)v2 (t) − u2 (t)v2 (t), t ∈ T, (i.e., the Wronskians W (ui , vi ), i = 1, 2 are equal) and (2) {u1 (a), v1 (a)} ⊂ (0, ∞); or u1 (a) = 0 = v1 (a) and m1 (a) = m2 (a). When these conditions hold, the likelihood ratio is given by r1 r2 dPm dPm dP0r2 1 −m2 2 = (X − m ) · (Xt − m1 ), t 2 r1 dPm dP0r1 dP0r1 1
(1)
where the right side product stands for the following expressions: 1 dP0r2 [D2 X 2 (a)+ r1 (X) = D1 exp dP0 2 d u1 X 2 (t) ! dt ( v2 ) , d d u2 v1 (t)v2 (t) T dt ( v 2
with
D1 = D2 =
2 (b) 2 [ vv11 (a)v (b)v2 (a) ] ,
if u1 (a) = 0,
2 (b) 1 2 [ uu21 (a)v (a)v1 (b) ] ,
if u1 (a) = 0,
1
if u1 (a) = 0,
0, ( u21v2
−
1 u1 v1 )(a),
if u1 (a) = 0.
(2)
249
5.2 General Gaussian processes
The expression for the second factor is just (12”) of the last section, namely r1 dPm (X) = exp C1 + C2 X(a)+ dP0r1 d m 1 2X(t) − f (t) ! dt ( v ) , d d u 2 T dt v(t) (v)
(3)
with C1 , C2 defined there, and the last symbol in [ ] of (3) is a stochastic integral (cf., Remark 1.8). The proof depends on a smoothness property of the parameters, mean and covariance, of a Gaussian process, which was discovered by Baxter [1]. It admits refinements when the general case is specialized to Markovian or stationary families. Here is Baxter’s basic result: 2. Proposition. For a Gaussian process {X(t), t ∈ T = [a, b] ⊂ R} with mean m and covariance r, suppose that the derivative m exists is bounded, and r is continuous on the square T × T with uniformly bounded (mixed) second derivatives just off the diagonal of the square. Then with probability one fr (s) ds, (4) lim un (X(t)) = n→∞
T
exists, where 2 n
un (X(t)) =
[X(a +
k=1
k(b − a) (k − 1)(b − a) 2 ) − X(a + )] , n 2 2n
(5)
and fr (t) = D− (t) − D+ (t), the left (right) derivatives of r on the diagonal which are to exist and are given as: D± (t) =
lim
s→t+ (t− )
r(t, t) − r(s, t) . t−s
(6)
Proof. The idea of proof is simple. Namely, if An = {ω : |un (X(t))(ω)− −n 2 E(un (X(t)))| ≥ εn } where εn = n2 ∞ , then it suffices to verify (by the first Borel-Cantelli lemma) that n=1 P (An ) < ∞ so P [supn An ] = 0 and that the mean, E(un (X(T ))), tends to the right side integral of (4). This is verified by the following (not so simple) computation. For convenience m(t) = 0 may be assumed, since otherwise consider (b−a)(k−1) ˜ X(t) = X(t) − m(t) and if ΔXk (t) = X(a + b−a ) 2n ) − X(a + 2n with a similar expression for Δmk so that 2 n
k=1
2 n
2
(Δmk ) ≤ max |Δmk | k
k=1
|Δmk | → 0,
250
V. Likelihood Ratios for Processes
as n → ∞ since |Δmk | → 0 uniformly and m is of bounded variation (indeed m exists). Consequently by the CBS-inequality 2n
2 ˜k Δmk ΔX
2 n
≤
k=1
2 n
2
(Δmk )
k=1
˜ k )2 → 0, (ΔX
k=1
˜ with probability one if (4) holds for the X-process. But then 2 n
k=1
2 n
2
(ΔXk ) =
2 n
2
˜k ) + 2 (ΔX
k=1
2 n
˜k + Δmk · ΔX
k=1
(Δmk )2
k=1
and the last two terms tend to zero with probability one so that X(t) ˜ and X(t)-processes satisfy (4), the limit being independent of (the bounded differentiable) m. Thus let m = 0 and for simplicity take also a = 0, b = 1. 2n Consider un (X(t)) = k=1 (ΔXk )2 . Let ajk = E(ΔXj ΔXk ) and 2n note that E(un (X(t))) = k=1 akk . Also recall that for a Gaussian random variable Y with mean zero, one has E(Y 4 ) = 3V ar2 Y . Hence 2 n
E(u2n (X(t))) =
3a2kk + 2
(ajj akk + 2a2jk ).
(7)
1≤j
k=1
Now for the event An defined above the following simple upper bound ˘ for its probability is obtained by the Ceby˘ sev inequality: 2n V ar(un (X(t))) n2 2n = 2 [E(u2n (X(t))) − (E(un (X(t))))2 ] n 2n 2n 2 akk + 4 a2jk ], by (7), = 2 [2 n n
P (An ) ≤
1≤j
k=1
=
2n [2( n2
2n
a2kk + 2
=
2n 2 n2
j,k=1
a2jk )
j =k
k=1
2n
a2j,k =
2n bn (say). n2
(8)
It suffices to show that 2n bn is bounded in n using the remaining hy∞ pothesis. It then follows that n=1 P (An ) < ∞ and completes the argument.
251
5.2 General Gaussian processes
Given the (mixed) second partial derivatives of r are uniformly bounded by M (say) in the set T × T − {diagonal}, we use the Taylor series expansion to get for 1 ≤ j = k ≤ 2n : k−1 j−1 k j , ) + r( n , n )− 2n 2 n 2 2 k−1 j k j−1 r( n , n ) − r( n , n )| 2 2 2 2 1 n ≤ 3M ( ) . 2
|ajk | = |r(
Similarly for 1 ≤ j = k ≤ 2n − 1, one has 2
E(un (X(t)) ) =
n −1 2
k=1 1
fr (
k 1 ) + O(1) 2n 2n
→
fr (t) dt,
0
as n → ∞,
since the right side is a Riemann sum which converges to the integral. In view of the initial simplification, this establishes (4). Remarks. 1. If r is differentiable on all of T ×T , so that f = 0, then the result is not of much interest, since then it even excludes the BM whose covariance is given by r(s, t) = min(s, t), 0 ≤ s, t ≤ 1. In the latter case one finds that f (t) ≡ 1, and (4) reduces to the classical result due to P. L`evy, namely n lim (ΔXk )2 = 1, a.e. (9) n→∞
k=1
The same result, (9), also holds if r(s, t) = min(s(1 − t), t(1 − s)), 0 ≤ s, t ≤ 1. For the (general continuous) triangular covariances (4) holds with (possibly) f ≡ 1. 2. If P, Q are a pair of probability measures on (Ω, Σ) and {Xt , t ∈ T = [a, b]} is a process governed by both, suppose P, Q have zero means and continuous covariances r and ρ so that EP (Xt ) = 0 = EQ (Xt ) and r(s, t) = EP (Xs Xt ), ρ(s, t) = EQ (Xs Xt ). Let fr , fρ be the corresponding functions given by (5) for the process with covariances r and ρ. Then by the above proposition for any a < c ≤ b one gets
c
lim un (Xt ) =
n→∞
fr (s) ds,
a.e.[P ]
fρ (s) ds,
a.e.[Q].
a
= a
c
252
V. Likelihood Ratios for Processes
If Q P Q so that they are equivalent measures (hence have the same class of null sets), then both limits must be the same. Thus c c fr (s) ds = fρ (s) ds, a < c ≤ b. a
a
But then by the Lebesgue version of the fundamental theorem of calculus fr = fρ a.e. (and everywhere if they are continuous). This consequence (noted by Varberg) is of interest in the proof of the above theorem and elsewhere. 3. Suppose only that B1 = T fr1 (s) ds = T fr2 (s) ds = B2 . Then under the hypothesis of the proposition, un (Xt ) → Bi a.e. [Pri ]. Since B1 = B2 the classes of null sets of P ri , i = 1, 2, must be different so that the measures are not equivalent. But then by the Gaussian dichotomy (Proposition 1.1) the measures must be singular. This situation will be considered in more detail in the following work. Proof of Theorem 1. Given the equivalence of Pfr2 and P0r1 , one has by Remark 2 of Baxter’s proposition that fr1 = fr2 a.e. This means Condition 1 holds. If P r2 (X(a) = 0) = 0(or1) accordingly as u2 (a) = 0(= 0); and X(a) = 0(= 1) a.e. [P r1 ] accordingly as u1 (a) = 0(or = 0). Since the measures are equivalent, condition (2) also holds. Thus the proof of necessity is a consequence of Proposition 2. We have to make detailed computations for the sufficiency to which we now turn. Since the desired result follows if it is shown, under the given conditions, that each of the factors on the right side of (1) exists and is positive so (by the chain rule for the RN-derivatives) we can consider these factors individually. But the second factor on the right was already studied (even more generally) in Theorem 1.4. Thus it is only necessary to establish the first factor, namely, the equivalence of P0r1 and P0r2 when the means are equal (taken as zero for convenience). With this reduction, we can follow the method (of the converse part) of proof of Theorem 1.1 when the covariances satisfy the conditions given here. ri So we study the finite dimensional distributions P0α , i = 1, 2 and since they are equivalent, it will be first shown that the moment generating ri functions of both P0α are connected in a convenient way and show that the corresponding likelihood ratios converge to a positive limit under r dP 2 both measures P0ri , i = 1, 2, which will then give dP0r1 , to get (1). The 0 explicit detailed computations are somewhat tedious and we outline the essential steps leaving out some (algebraic) simplifications. 1. By the current hypothesis, the associated Baxter functions satisfy fr1 = fr2 (= w (say)). Since for the given ri , i = 1, 2, the transformed process defined by Xt − mγi (t) Zt = vi (γi (t))
253
5.2 General Gaussian processes
is a BM (see Lemma 1.7) where γi (t) = uvii (t) ↑, and a BM has continuous sample paths (m being continuous), the Xt -process also has a.a. continuous sample paths. So Ω ⊂ RT can (and will) be restricted to C(T ) = C([a, b]), the continuous function space, for the following analysis. Let p(·) be a simple function on T , and consider the moment generating function (m.g.f.) of T Xt dp(t) relative to P0r2 . Thus for Rα suppose the points of α are given by the (dyadic) partition of T , namely, α : a = t1n < · · · < tnn = b where tkn = a+ k(b−a) 2n , Xkn = X 2kn , and the corresponding covariance matrices of the process: Rin = (ri (tj , tk ), 1 ≤ j, k ≤ 2n ), i = 1, 2. Denote by Δpk = p(tkn )−p(tk−1n ), and the vectors Δpk = (Δp1 , · · · , Δpn ), Xn = (Xt1n , · · · , Xtnn ), and consider the finite ri dimensional m.g.f’s for P0α , i = 1, 2. Using the well-known likelihood dP
r2
ratio of dP0α r1 in the following, one has on T = [a, b] [since C(T ) is a 0α separable space and t → Xt is continuous with probability one, there are no measurability problems in the ensuing work]: ∩t∈T [|Xt |≤M ]
exp[ Xt dp(t)] dP0r2 T
=
n
lim exp[
∩t∈α [|Xt |≤M ] n→∞
= lim
n→∞
2
∩t∈α [|Xt |≤M ]
r2 Xtkn Δpk ] dP0α
k=1 r2 exp(Xn Δpn ) dP0α ,
by the dominated convergence theorem, and prime denotes transpose, dP r2 r1 = lim exp(Xn Δpn ) 0α r1 · dP0α , n→∞ ∩ dP [|X |≤M ] 0α t∈α t |R1n | exp(xn Δpn + = lim n→∞ |R2n | ∩t∈α [|xt |≤M ] 1 −1 r1 −1 x (R − R2n )xn ) dP0α . 2 n 1n
(10)
In the last line of (10), the finite dimensional likelihood ratio along with the fundamental law of probability are used to move the integration to r2 r1 the range (or state) space of the process to go from P0α to P0α . Next one calculates the limiting values in (10) using all the hypothesis on the functions ui and vi . First one can simplify the determinantal ratio (with Taylor’s expansions) which is relatively easy; and it is found that |R1n | D12 = limn→∞ |R where D1 is as in the statement. The difficult part 2n |
254
V. Likelihood Ratios for Processes
−1 −1 is to find the limit of xn (R1n − R2n )xn = Hn (say). This is written as
Hn =
x21n x21n − + (u1 v1 )(t1n ) (u2 v2 )(t1n )
2 (v1 (tk−1n )xkn − v1 (tkn )xk−1n )2 v1 (tk−1n )v1 (tkn )w ˆ1 (tkn ) n
k=1
(v2 (tk−1n )xkn − v2 (tkn )xk−1n )2 ! v2 (tk−1n )v2 (tkn )w ˆ2 (tkn ) = Jn + Kn + Ln , −
(11)
where Jn is the first term on the right of (11) before the summation, and the sum is split into the other two terms as follows, where we set w ˆi (tkn ) = vi (tkn )ui (tk−1n ), i = 1, 2. Thus 2 [w ˆ2 (tkn ) − w ˆ1 (tkn )][v2 (tk−1n )xkn − v2 (tkn )xk−1n ]2 n
Kn =
k=1
v2 (tk−1n )v2 (tkn )w ˆ1 (tkn )w ˆ2 (tkn )
,
and 2 n
Ln =
w ˆ2 (tkn ) v2 (tk−1n )v2 (tkn )[v1 (tk−1n )xkn − v2 (tkn )xk−1n ]2
k=1
− v1 (tk−1n )v1 (tkn )[v2 (tk−1n )xkn − v2 (tkn )xk−1n ]2 }× ! ˆ1 (tkn )w ˆ2 (tkn )}−1 . {v1 (tk−1n )v1 (tkn )v2 (tk−1n )v2 (tkn )w This formidable expression can be simplified, as in Baxter’s result, after a substantial computation, as n → ∞ to obtain the following stochastic integral (for details see Varberg [1], pp. 757-8): 2
LHS(11) = D2 X (a) +
f (t) d T
X 2 (t) v1 (t)v2 (t)
,
(12)
where f (t) is the integrand of the (stochastic) integral in (2). Substituting this in (10) gives
exp( Xt (ω) dp(t)) dP0r2 (ω) = ∩t∈T [|Xt |≤M ] T 1 exp[ Xt dp(t) + D2 X 2 (a)+ D1 2 ∩t∈T [|Xt |≤M ] T 2 Xt ]dP0r1 . f (t) d (13) v1 (t)v2 (t) T
255
5.2 General Gaussian processes
Since the integrand in (13) is positive, one can let M → ∞ to get (2) by the monotone convergence theorem. But by the uniqueness theorem for m.g.f.’s one gets dP0r2 (X)
1 = D1 exp[ D2 X 2 (a) + 2
f (t) d T
X 2 (t) ]. v1 (t)v2 (t)
(14)
which is (2) in full generality. With (1) and (2) on substituting the dP
r1
m1 −m2 expression for dP (X − m1 ) from Theorem 1.9, one finally obtains r1 0 the complete expression (3).
An immediate application of this theorem to the O.U. processes gives an interesting supplement to the analysis of Example IV.1.6, and it can be presented as follows. 3. Example. (a) Recall that an O.U. process is a stationary Gaussian process with mean zero and covariance given by r(s, t) = σ 2 exp[−β|s− t|] for 0 ≤ s, t ≤ a where β > 0, σ 2 > 0 are constants. If βi , σi2 , i = 1, 2 relate to a pair of O.U. processes, so that ri = ui vi with ui (t) = σi2 exp(βi t), vi (t) = exp(−βi t) then the conditions given in the theorem show that the measures P0ri , i = 1, 2 are equivalent iff σ12 β1 = σ22 β2 = K (say). The likelihood ratio is then obtained from (2) as follows where D1 , D2 and the integrand, denoted f , are found to be: D12 = and f =−
β2 , β1
D2 =
2 (β1 − β2 ), K
(β2 − β1 ) exp [−(β1 + β2 )t]. K
Substituting these in (2) and using integration by parts one gets dP0r2 (X) = dP0r1
β2 1 exp{− [(β2 − β1 )(X 2 (0) + X 2 (a)) β1 2K a 2 2 X 2 (t) dt]}. + (β2 − β1 ) 0
This example is also discussed by Striebel [1]. [There is a small numerical discrepancy here.] See also Duttweiler and Kailath [1],[2], Kailath and Weinert [1], and Kailath, Geesy and Weinert [1], for related work. (b) An even simpler example is noted for the equivalence of the BM and the Brownian bridge both have means zero and covariances r1 (s, t) = min(s, t) and r2 (s, t) = u(s ∧ t)v(s ∨ t) for 0 ≤ s, t ≤ a < 1 1 where u(s) = s, v(s) = 1−s. In this case D12 = 1−a , D2 = 0, f (t) = −1
256
V. Likelihood Ratios for Processes
and since ri (0, 0) = 0, i = 1, 2, so that X(0) = 0, a.e. under both measures, the likelihood ratio is given by: dP0r2 x2 (a) 1 exp[− ]. r1 = √ dP0 1−a 1−a The fact that the covariance function ri of the O.U. process is stationary, so that it has the spectral (or Fourier) representation: ri (s, t) =
σi2 e−βi |s−t|
σ 2 βi = i π
R
e−ix|s−t|
dx , βi2 + x2
(if σi2 = 1, then it is the characteristic function of the Cauchy density, which thus appears here as the spectral density), motivates a study of obtaining conditions on the spectral functions for the equivalence and singularity of the measures (instead of the covariances). We discuss this aspect of the problem briefly. The additional knowledge of stationarity calls for a sharper result on the dichotomy problem for the resulting likelihood ratios. The following is representative of the situation and it is due to Gladyshev [1] which we include for a comparison. [Recall that if {Xt , t ∈ R} is a (weakly) stationary process with a continuous covariance function r, then it admits the representation (by Bochner’s theorem on positive definite functions) r(s, t) = r(s − t) = R eiu(s−t) dF (u), where F is the spectral distribution (a bounded Borel measure) and f = dF du is the spectral density whenever it exists.] P ) be Gaussian measures with zero means and 4. Theorem. Let (Ω, Σ, Q continuous stationary covariances rk having spectral densities fk , k = 1, 2 for a process {Xt , t ∈ R}. If the fk satisfy the growth condition
fk (u) = ck |u|αk + O(|u|αk −2 ), k = 1, 2, c1 · c2 = 0, the αk being real and lim|u|→∞
f1 f2 (u)
= 1, then P ⊥ Q.
It should be observed that the orthogonality of measures P, Q may obtain even when lim|u|→∞ ff12 (u) = 1 and thus for a presence of the dichotomy, further conditions are necessary. We omit this specialization here and also the proof of the theorem referring the reader to Gladyshev’s paper where the result is established by first extending Baxter’s proposition and applying it to the present situation. A detailed study of the dichotomy problem for stationary Gaussian processes is given in considerable detail in Rozanov’s [1] memoir. It is clear from these computations, when the covariances of Gaussian processes are different, the work cannot be materially simplified. In
5.2 General Gaussian processes
257
fact the approximations here lead to a stochastic integral. If covariance functions are more general than the triangular ones, then it is necessary to use more sophisticated results from abstract analysis, as already seen in Theorem 1.12. To understand this structure better, we now include some additional results and the corresponding likelihood ratios. There is also an integral form of triangular covariances covering a large class of Gaussian processes, as will appear from the work below. It signifies an aspect analogous to that of Section IV.4. ri Observe that if Pm , i = 1, 2 are equivalent measures, then by the i chain rule for the RN-derivatives, one has for a.a. (ω): r2 r2 r2 r2 dPm dPm dPm dPm dP0r2 2 2 1 2 −m1 (ω) = (ω) (ω) = (ω) (ω). (15) r1 r2 r1 dPm dPm dPm dP0r2 dP0r1 1 1 1 r2 r1 r2 r2 is equivalent to Pm iff both Pm is equivalent to Pm and Thus Pm 2 1 2 1 r2 r1 Pm1 is equivalent to Pm1 . By Proposition 1.2 we deduce that m1 , m2 are admissible means of P0r2 so that δ = m2 −m1 is also one, by linearity of that set. Consequently Pδr2 is equivalent to P0r1 . This fact may be stated for a convenient reference as follows: ri 5. Proposition. If Pm , i = 1, 2, are Gaussian measures with means i r2 r1 mi and covariances ri , then Pm ≡ Pm iff Pδr2 ≡ P0r1 where δ = 2 1 m2 − m1 .
Since by Proposition 1.2 the conditions on the equivalence (or existence) of the first factor on the right side of (15) are known, it is now necessary to find similar conditions for P0r2 ≡ P0r1 . This is a far deeper problem and considerable insight is obtained by use of abstract analysis via Aronszajn space technology. [Mathematically this is on the level of the analysis of the Behrens-Fisher problem that we discussed in Chapter II based on Linnik’s penetrating study.] Its use and effectiveness in the present work was brought out by Parzen [1] and refined by Neveu [2], using a different notation. We follow this technique to elucidate its role in our theory and eventually obtain the likelihood ratio, in Theorem 12 below. In order to present an equivalent version of the above proposition, with hypotheses on the covariance functions, it will be useful to introduce an “entropy function” for measuring the distinctness of probabilities, in place of the Hellinger “distance” used in the proof of Theorem 1.1. It is borrowed from information theory and highlights the covariance functions more directly than the earlier one. Thus if P, Q are probability measures on (Ω, Σ) and μ is a dominating (σ-finite) meadQ sure on Σ (e.g., (P +Q)/2), so f = dP dμ , g = dμ , let I be the information functional defined by: g I = I(P, Q) = (g − f ) log dμ f Ω
258
V. Likelihood Ratios for Processes
dQ = log dP + dP Ω
log Ω
dP dQ = I(Q, P ), dQ
(16)
which thus does not depend on μ, and similarly if Fα ⊂ Σ is a σalgebra, and Pα , Qα , μα are the restrictions of P, Q, μ on Fα , let fα , gα be the corresponding densities and Iα the resulting number given by (16). If {Fα , α ∈ J} is a directed or filtering set of σ-algebras from Σ (J being a directed set and Fα ⊂ Fβ for α < β in J) then it is seen by the conditional Jensen’s inequality applied to the convex functions ϕ1 (x) = x log x and ϕ2 (x) = log x1 , x > 0, that Iα ≤ Iβ ,if α < β; whence it is a monotone nondecreasing functional and one can verify that limα Iα = I ≤ ∞, where Σ is replaced by σ(∪α∈J Fα ). [The necessary computation uses a martingale convergence theorem, α since {uα = dQ dPα , Fα , α ∈ J}, forms a martingale on (Ω, Σ, P ), the details being standard and are found, e.g., in Rao [12],p.213.] Also it follows from definition that I(P, Q) < ∞ implies Q ≡ P just as H(P, Q) = 1. But by Theorem 1.1, H(P, Q) > 0 ⇒ H(P, Q) = 1 for Gaussian measures, and similarly it is shown that, in this case, I(P, Q) < ∞ iff Q ≡ P , and Q ⊥ P iff Iα (P, Q) = I(Pα , Qα ) = ∞ for some α ∈ J (or I(P, Q) = ∞). This is also detailed in the above reference, and will be used without reproducing the algebra. Thus Theorem 1.1 can also be proved using this information functional I, as was originally done by H´ ajek [1]. If now α = (t1 , . . . , tn ) is a finite set of points of the index set T for a segment of the observed process {Xt , t ∈ T }, let X = (Xt1 , · · · , Xtn ), and rjn = [EP rj (Xti Xtk ), 1 ≤ i, k ≤ n], j = 1, 2, be the n×n covariance matrices. Then the (elementary) finite dimensional likelihood ratio is given for x = X(ω) by dQα (x) = pα (x) = dPα
|r1n | 1 −1 −1 − r1n )x ]}, exp{− [x(r2n |r2n | 2
(17)
where x is the transpose of the row vector x and |rjn | is the determinant of the corresponding matrix. In the proof of Theorem 1.1 we simplified this expression by simultaneously diagonalizing the matrices rjn , and here we proceed differently by using the Iα -functional with (17) to keep track with the covariances. Thus (16) with (17) becomes: Iα = Iα (P, Q) = EP0r2 (log pα (X)) − EP0r1 (log pα (X)) 1 −1 −1 (E r2 − EP0r1 )(X(r1n − r2n )X ) 2 P0 1 −1 −1 = (EP0r2 − EP0r1 )(tr[X(r1n − r2n )X ]) 2
=
259
5.2 General Gaussian processes
1 −1 −1 tr[(EP0r2 − EP0r2 n )(r1n − r2n )XX ] 2 1 −1 −1 = tr(r2n r1n + r1n r2n − 2id], 2 =
(18)
where id is the identity matrix, tr is trace, EP0ri denotes the expectation relative to the measure P ri , and where we used the standard computation for the expectations of covariances in Gaussian integrals. The crucial discovery here, due to Parzen [1], is that the expression in the last line of (18) can be identified as an element of the tensor product of the (finite dimensional) RKHSs Hr1n and Hr2n . We now consider the desired product space. If Hri is the RKHS for the covariance kernel ri , then their tensor product is defined as (r1 ⊗ r2 )(s1 , s2 , t1 , t2 ) = r1 (s1 , t1 )r2 (s2 , t2 ), si , ti ∈ T, i = 1, 2 so that r1 ⊗r2 is again a covariance kernel (by Schur’s lemma). Let gi ∈ Hi , i = 1, 2 and g = g1 ⊗ g2 defined on T ⊗ T as g(t1 , t2 ) = g1 (t1 )g2 (t2 ), and gH1 ⊗H2 = g1 H1 g2 H2 derived from the inner product (g, r1 ⊗ r2 (·, ·, t1 , t2 )) = (g1 , r1 (·, t1 ))(g2 , r2 (·, t2 )) = g1 (t1 )g2 (t2 ) = g(t1 , t2 ).
(19)
The space Hr1 ⊗ Hr2 is the closure under the norm · r1 ⊗r2 of finite linear combinations of elements of the form g, and is the tensor product of the spaces Hri , i = 1, 2 which is again a Hilbert space. It is verified easily that Hr1 ⊗r2 = Hr1 ⊗ Hr2 . As an example, for the triangular kernel r(s, t) = σ 2 min(s, t), 0 ≤ s, t ≤ 1 the space Hr ⊂ C0 ([0, 1]) of absolutely continuous functions vanishing at the origin, with square integrable derivatives, i.e., f ∈ Hr df ) iff (with f = du f (t) =
t
f (u) du = 0
0
1
f (u)χ[0,t] (u) du
and the inner product (f, g) =
1 σ2
1
f (u)g (u) du
0
2 is the RKHS for this kernel r. Since ∂r ∂s (s, t) = σ χ[0,t] (s) one gets r(·, t) ∈ Hr for each t and (f, r(·, t)) = f (t). This is the classical
260
V. Likelihood Ratios for Processes
Wiener space. Here, replacing σ by σi > 0, i = 1, 2 and calling the resulting kernels ri one gets after a simple computation for (18): Iα (P, Q) =
n σ1 σ2 ( − )2 , 2 σ2 σ1
(20)
which is unbounded as α varies on all finite subsets of [0, 1], if σ1 = σ2 . Resuming the general discussion, it follows from the above construction that for α < α (i.e., α ⊂ α ) are finite sets, then Hriα ⊂ Hriα and {Hr1α ⊗r2α }α∈J forms an increasingly nested set of subspaces of Hr1 ⊗r2 . Moreover, {r1 − r2 Hr1α ⊗r2α } forms a monotone increasing net and hence has a limit (finite or not) by the general RKHS theory (cf., Aronszajn [1], Theorem I on p. 362). Consequently, I(P, Q) = limα Iα (P, Q) = limα r1α − r2α Hr1α ⊗r2α exists, finite or not. It is finite iff P and Q are equivalent, and singular if I(P, Q) = ∞. Thus for equivalence, one must have r1 − r2 ∈ Hr1 ⊗r2 . The converse implication is established by showing that the non finiteness of the preceding limit implies that there is a set A ∈ ∩α Fα on which P has arbitrarily small value and Q has a value close to unity so that P ⊥ Q on Σ. In particular we have by (20) the well-known result that the scale different BM processes (i.e., σ1 = σ2 ) always determine singular measures. This discussion allows us to state Proposition 5 in the following more convenient but equivalent form involving conditions only on the means and covariances: ri 6. Theorem. The Gaussian probability measures Pm , i = 1, 2, on i (Ω, Σ) with means mi and covariances ri , i = 1, 2 are equivalent iff (a) δ = m1 − m2 ∈ Hr1 , (b) Hr1 = Hr2 , and (c) r1 − r2 ∈ Hr1 ⊗r2 (= Hr1 ⊗ Hr2 ), the equalities between the spaces denoting isometric isomorphisms.
The spaces Hr can be given a more interesting represention if T ⊂ R is a compact interval and r is (left or right) continuous on T × T so that Hr is separable, since then r may be shown to admit a (generalized) triangular form. This is utilized to obtain a sharper form of the preceding theorem which automatically includes the triangular covariances treated earlier. The result has methodological interest and reveals the structure of the problem vividly, in addition to unifying many of the previous formulations. We present the work for any r for which Hr is separable. ri 7. Theorem. Let Pm , i = 1, 2, be a pair of Gaussian measures on i T (R , BT ) with means mi and covariances ri . Then they are equivalent iff there exists a (σ-finite) measure space (Ω, Σ, ν) and an R ∈ L2 (Ω × Ω, Σ ⊗ Σ, ν ⊗ ν) satisfying the conditions:
261
5.2 General Gaussian processes
¯ , ω) for a.a. (ω, ω ) and for A : L2 (ν) → L2 (ν) (i) R(ω, ω ) =R(ω defined as Af = Ω R(·, ω )f (ω ) dν(ω ), (A will be Hilbert-Schmidt) -1 does not belong to the spectrum σ(A) of A; ¯ ω )R(ω, ω ) dν dν, (ii) (r1 − r2 )(u, v) = Ω Ω Ψ(u, ω)Ψ(v, ¯ ω) dν(ω) with r2 having a representation as r2 (u, v) = Ω Ψ(u, ω)Ψ(v, 2 relative to the family {Ψ(u, ·), u ∈ T } ⊂ L (ν); and (iii) there is a g ∈ L2 (ν) such that Ψ(u, ω)¯ g (ω) dν(ω). (m1 − m2 )(u) = Ω
[The family {Ψ(u, ·), u ∈ T } need not be unique, but each such collection determining r2 in (ii) has the same cardinality and satisfies (iii).] The proof, to be given after Proposition 8, is helped by the following auxiliary decomposition of r2 which is more general than the Mercer series representation (the latter demands continuity of the kernel everywhere), and explains the structure of Hr using only the separability hypothesis. We present the result here since it has independent interest. The procedure is motivated by the work of Cram`er [5], with references to his earlier contributions, and of Hida [1] both of which use the classical Hellinger-Hahn expansion instead. [But a reader may skip the following discussion, and proceed to the statement of Proposition 8 for the necessary facts that are used in the proof of the theorem. However, the result will also be found useful for linear prediction considered in Chapter VIII.] Thus let T ⊂ R be as given, K : T × T → C be a positive definite kernel, and HK be its RKHS with the inner product defined in (4) of Section 1. Then {K(·, t), t ∈ T } is dense in that space. Let Ht = sp{K(·, ¯ s), s ≤ t} ⊂ Ht for t ≤ t . It may and will be supposed that Ht = ∪s
262
V. Likelihood Ratios for Processes
function μn defined by μn ([t1 , t2 ])(= μn (t2 ) − μn (t1 )) = fn (t2 ) − fn (t1 )2 , is uniquely extendible to a Borel measure on R, and one has for each x ∈ HK : d!x, fn (·)" ψnx (t) = (t), a.e.[μn ], dμn (21) 2 x 2 |ψn (t)| dμn (t). x = n
T
Taking x = K(·, t) ∈ HK , for any fixed t in (21) let the corresponding 1 ψnx be denoted by ψn (t, ·), t ∈ T , and define μ = n 2|n| μn . Then μ is 2 a Borel measure on T and μn μ, n ≥ 1. If L (T, μ) is the resulting 1 2 n 2 Lebesgue space on T , then ψ˜n (t, ·) = ψn (t, ·)( dμ dt ) (·) ∈ L (T, μ) and the second equation of (21) gives: n
|ψn (s, t)|2 dμn (t) =
T
n
=
T
T
|ψ˜n (s, t)|2 dμ(t) !ψ˜n (s, x), ψ˜n (s, x)" dμ(x)
n
= !K(·, s), K(·, s)" = K(s, s).
(22)
This representation is of interest here. With this (using polarization) one can express the kernel as: !ψ˜n (s, x), ψ˜n (t, x)" dμ(x) = K(s, t). (23) T
n
For simplicity let Ψ(s, ·) = (ψ˜1 (s, ·), ψ˜2 (s, ·), · · · ) be an infinite vector so that (23) can be simply written, using the inner product notation of the sequence space (2 , (·, ·)), to get (Ψ(s, x), Ψ(t, x))2 dμ(x) = K(s, t).
(23’)
T
It is known (and easy to verify) that such a function space can be 2 (T, μ; 2 ) of vector valued (here 2 -valued) functions on identified as L T with norm K(s, s). Now let F = sp{Ψ(t, ¯ ·), t ∈ T } ⊂ L2 (T, μ; 2 ), and T ¯ HK = {g ∈ C : |g(t)| = | (Ψ(t, x), u(x))2 dμ(x)| < ∞, u ∈ F}, T
263
5.2 General Gaussian processes
with g2 = T (u(x), u(x))2 dμ(x), then it follows that K(·, t) ∈ HK for t ∈ T by (23’). Also !g, K(·, t)" = T (Ψ(t, x), u(x))2 dμ(x) = g(t) ¯ K is an RKHS for the kernel K. But the general theory of so that H Aronszajn’s [1] implies that K defines uniquely such a space so that ¯ K , and the mapping τ : HK → F is an isometric isomorphism. HK = H This is the desired representation for the proof of the theorem. It may be noted that if K is continuous on a compact interval T , then by Mercer’s theorem it can be expanded as: K(s, t) =
∞ ψn (s)ψ¯n (t) , λ n n=1
(24)
the series converging uniformly and absolutely. If we consider μ as a measure concentrated on N with μ({n}) = λ1n so that (24) becomes ¯ n))2 dμ(n), then it denotes a special case of K(s, t) = N (ψ(s, n), ψ(t, (23’). We present the general statement established above for reference as follows. 8. Proposition. Let K : T × T → C be a covariance function such that the associated RKHS HK is separable where T ⊂ R. Then there exists a family of vector functions Ψ(t, ·) = {ψn (t, ·), n ≥ 1}, t ∈ T and a Borel measure μ on T such that Ψ(t, ·) ∈ L2 (T, μ; 2 ) in terms of which K is representable as: K(s, t) = (Ψ(s, x), Ψ(t, x))2 dμ(x). (25) T
The vector functions Ψ(s, ·), s ∈ T and the measure μ may not be unique, but all such (Ψ(t, ·), μ) determine K and HK uniquely and the cardinality of the components determining Ψ remains the same. Important remarks. 1. It may be observed that if Ψ(t, ·) is a scalar, ¯ x) dμ(x), which includes the trathen we have K(s, t) = R Ψ(s, x)Ψ(t, ditional triangular covariance with μ absolutely continuous relative to the Lebesgue measure. 2. The following notational simplification of (25) can be made. Let Ω = R × Z, Σ = B ⊗ P where P is the power set of the integers Z, ˜ λ) = and let ν = μ ⊗ α where α is the counting measure. Then Ψ(t, {ψn (t, x), n ∈ Z} = Ψ(t, λ, n), (λ, n) ∈ Ω. Hence 2 ˜ ω)Ψ(t, ˜ ω) dν(ω), Ψ(t, ·)2,ν = Ψ(t, Ω
and
K(s, t) = Ω
˜ ω)Ψ(t, ˜ ω) dν(ω), s, t ∈ T, Ψ(s,
(25’)
264
V. Likelihood Ratios for Processes
HK = {g : g(t) =
Ω
˜ ω)¯ Ψ(t, u(ω) dν(ω), u ∈ F ⊂ L2 (Ω, ν)}, (26)
˜ ·), t ∈ T } and g2 = where F = sp{ ¯ Ψ(t, |u(ω)|2 dν(ω). Thus Ω τ : g → u is an isometry between HK and F. This form of (25’) is convenient for the proof of the theorem below. 3. An interesting application of the form (26) is the following on a general characterization of admissible means, complementing the work of Section 1. Thus let a function f : T → C be termed an (generally) admissible mean relative to a positive definite kernel K : T × T → C if it is the mean of a second order process {Xt , t ∈ T } on some probability space (Ω, Σ, P ) with covariance K. Since one can always take the process to be Gaussian with a given positive definite kernel as its covariance and zero mean by Kolmogorov’s existence theorem (cf. Theorem I.1.1), this is the same as saying that if Pf is the measure on Σ of the Xt -process and if P is the measure of the Yt = Xt − f -process, then Pf P . Thus it is another way of stating the concept introduced before. Also K1 (s, t) = f (s)f¯(t) evidently defines a (degenerate) positive definite function, and we have E(Ys Y¯t ) = K(s, t) − K1 (s, t). Then f is an admissible mean of P , i.e., iff f ∈ MP = HK by Proposition 1.2. But by (26) above, HK can be realized as: f ∈ HK iff, with K represented by (25’), f can be represented as the integral f (t) = Ω
˜ ω)u(ω) dν(ω), u ∈ F, Ψ(t,
with f K = u2 . But u ∈ HK1 always, and from (25)-(26) where now = 1 concentrating at one point and u = 1 so that f K1 = μ({λ}) 1 2 [ R 1 dμ] = 1. Since K − K1 is positive definite iff H1 ⊂ H by Aronszajn’s theory [1], pp.354-5, Theorems I-II), f K ≤ f K1 = 1. Thus we have shown that f is admissible relative to a covariance kernel K iff f K ≤ 1. In this form Ylvisaker [1] proved this by a somewhat different argument. Note that if K is also a continuous stationary covariance so that K(s, t) = R ei(s−t)u dG(u) for a unique bounded isλ nondecreasing G, then f is admissible iff with Ψ(s, λ) = e in (25) so that (26) gives f ∈ HK iff f (t) = R Ψ(s, λ)u(λ) dG(λ). This means f is the Fourier transform of dF (λ) = u(λ) dG(λ), F being a function of bounded variation. This sharp form of the result was obtained directly by Balakrishnan [1]. Note that (25) uses a series representation of r and hence is not good enough for the Fourier representation in the RKHS setup. In particular if r(s, t) = R R eisx−ity dF (x, y) for harmonizable processes for admissibility of f only that f R ≤ 1 is concluded but its corresponding representation in the stationary case is not given, and
5.2 General Gaussian processes
265
a different argument for such specializations is needed. [As Ylvisaker notes, the RKHS argument uses only the Aronszajn theorem which does not depend on any property of T and it can be any point set. This is the generality, but then the structure of f could not be made more precise since the additional information on r is not utilized. However, the corresponding result can be obtained, using different techniques (cf. Rao [23] and Exercise 6.9 for more detail.)] 4. The measure μ and the functions ψn (t, ·) are obtained in the general case by the Hellinger-Hahn theory when HK is separable. If the last condition is dropped, one has to consider a more advanced analysis based on the general spectral theory of a normal operator in a Hilbert space due to Plessner and Rokhlin [1]. The details of this in the context of processes are not yet available. So it will not be discussed further, but it points out the essential need to invoke deep mathematical tools even in such “naturally simple” problems. Proof of Theorem 7. Using the notation introduced for the above result, we first observe that each element V ∈ F ⊗ F corresponds uniquely to a Hilbert-Schmidt (HS)-operator U on F (cf., Schatten [1], pp.35-36), and moreover each such U is representable by a kernel K0 ∈ L2 (T × T, μ ⊗ μ; 2 ) ∩(F ⊗ F) where F = τ (Hk ). In fact if F ∈ Hk⊗k , then there is a (not necessarily positive definite) unique K0 such that ˜ ˜ y))2 dμ(x) dμ(y), F (u, v) = (Ψ(u, x), K0 (x, y)Ψ(v, (27) T
T
and K0 is hermitian if F is. All of this is an easy consequence of the theorem in Schatten referred to above. Putting Ω = T × Z, ν = μ ⊗ α, as in the above remark (α= counting measure), (27) is expressed as: ˜ ω )G(ω, ω ) dν(ω) dν(ω ), ˜ F (u, v) = (27’) Ψ(u, ω)Ψ(v, Ω
Ω
for a G ∈ L2 (ν ⊗ ν) in the new notation (G = K0 ). With these identifications, we show that the conditions of Theorem 6 are equivalent to those of Theorem 7, which will establish the result. Let F = r2 − r1 and A be the (integral) operator corresponding to F acting on L2 (ν). Then by the above recalled theorem, A associates with it a kernel G(= R of the theorem). To verify the equivalence of the present conditions with those of Theorem 6, consider the isometric isomorphism τ : Hr1 → F defined after (25’) in which K is taken as r1 . Then τ (δ) = g ∈ F and this is equivalent to (iii). Also r2 − r1 ∈ Hr1 ⊗r2 is equivalent to showing that (τ ⊗ τ )(r2 − r1 ) ∈ F ⊗ F, since r2 − r1 is hermitian. Then by (27’), there is an R ∈ L2 (ν ⊗ ν) in fact is in the subset (τ ⊗ τ )(Hr1 ⊗r2 ). This is (c) which is thus equivalent to
266
V. Likelihood Ratios for Processes
(ii). Finally to establish the equivalence of (i) and (b), recall that Hr1 = Hr2 iff (cf., Aronszajn [1], p.354) there are constants ai > 0 such that a1 r1 r2 a2 r1 where between a pair of positive definite kernels k1 , k2 , k1 k2 means k2 − k1 is positive definite. Now since R ∈ L2 (ν ⊗ ν) implies that the corresponding operator A determined by it is HS, let {λn , fn ; n ≥ 1} be the eigenvalues and the corresponding (normalized) eigenfunctions of A. If Fn (t) = ˜ Ψ (t, ω)fn (ω) dν(ω), then (cf., Remark 2 above) Fn ∈ Hr1 and Ω n τ (F )n = fn , n ≥ 1. Further !Fn , Fn "Hr1 = (fn , fn )L2 (ν) = 1. Since r1 (·, t) ∈ Hr1 , we get !r1 (·, t), Fn " = Fn (t) and by (ii): ˜ ω )R(ω, ω ) dν(ω) dν(ω ) ˜ r2 (u, v) = r1 (u, v) + Ψ(u, ω)Ψ(v, Ω Ω ˜ ˜ ω )+ = Ψ(u, ω)[Ψ(v, Ω ˜ ω )R(ω, ω ) dν(ω)] dν(ω) Ψ(v, Ω ˜ = Ψ(u, ω)¯ g (v, ω) dν(ω), (say), (28) Ω
where g ∈ F ⊂ L2 (ν). Hence r2 ∈ Hr1 so that Hr2 ⊂ Hr1 . By a similar argument we get Hr1 ⊂ Hr2 so that there is equality. Moreover r2 and g correspond to each other uniquely (cf. (26)). To see that −1 ∈ / σ(A), consider !r2 (·, t), Fn "Hr1 = g(t, ω)f¯n (ω) dν(ω) Ω ˜ = Fn (t) + (AΨ)(t, ω)f¯n (ω) dν(ω), by (28), Ω ˜ ω)f¯n (ω) dν(ω) = Fn (t) + λn Ψ(t, Ω
= Fn (t)[1 + λn ].
(29)
On the other hand, condition (b) of Theorem 6 is equivalent to a1 r1 r2 a2 r1 for some constants ai > 0 and hence (29) implies a1 !(r1 (·, )Fn ), Fn " ≤ !(r2 (·, ), Fn ), Fn " ≤ a2 !(r1 (·, ), Fn ), Fn ", which reduces to 0 < a1 ≤ (1 + λn ) ≤ a2 < ∞,
n ≥ 1.
(30)
Hence λn = −1 for any n and since A is necessarily compact (and can have only 0 as a possible limit point), it follows that −1 ∈ / σ(A). Since
5.2 General Gaussian processes
267
the argument is reversible, condition (b) of Theorem 6 and (i) here are equivalent, so that all the conditions of the present result are equivalent to those of Theorem 6, as desired. Note. If the covariance is already given to be in a generalized triangular form, i.e., of the type (25’), the result holds and no multiplicity or Hellinger-Hahn theory is needed. Indeed such a direct application was made by Park [1] if r1 is a triangular covariance on (T, μ) where T ⊂ Rn and μ is the Lebesgue measure. In our case r1 is also a general covariance but T ⊂ R so that Hellinger-Hahn representation has to be (and was) invoked and μ is a σ-finite Borel measure. We present a specialization for BM as an example. If T ⊂ Rn , a result corresponding to Proposition 8 is not immediately available, and an assumption of triangular (or “factorizable”) covariance, i.e., the representation (25’), seems to be desirable so that Theorem 6 can still be employed. Taking r1 as the covariance of the BM (so it is triangular r1 (u, v) = 1 χ (t)χ[0,v] (t) dt), we can present conditions for equivalence of an 0 [o,u] arbitrary Gaussian measure P0r2 with P0r1 of the BM, first obtained by Shepp [1] using a different method, as follows. [Here Ω = [0, 1] and dν = dx, the Lebesgue measure with K = r1 in (25’).] s be an arbitrary 9. Corollary. Let P0r correspond to the BM and Pm [0,1] Gaussian measure, both on (Ω, Σ) Ω = R . Then they are mutually equivalent iff there is a hermitian R ∈ L2 ([0, 1]2 , dx dy) such that for 0 ≤ u, v ≤ 1 (r(u, v) = min(u, u vv)): (i) s(u, v) = r(u, v) + 0 0 R(x, y) dx dy, / σ(A), and (ii) if A is determined by r on L2 ([0, 1], dx) then −1 ∈ t 2 (iii) there is a g ∈ L ([0, 1], dx) such that m(t) = 0 g(u) du.
Note that from (i) and (iii) it follows that s, m are differentiable ∂2s and in fact R(u, v) = ∂u∂v (u, v) and g(u) = dm du (u), a.e. Thus in the case of BM, the equivalence condition on mean and covariances can be given explicitly. The above conditions are a specialization of those in Theorem 7, but were discovered by Shepp [1] by a different procedure without using the RKHS techniques. A direct RKHS proof of this case of Shepp’s was immediately followed by Kailath [1]. Since Theorem 7 shows that the kernels r1 , r2 determine HS operators and when their difference belongs to Hr1 ⊗r2 so that r1 − r2 defines a similar operator, it is of interest to find conditions for the equivalence in terms of the latter transformations. The following is such a result, and it slightly extends a theorem due to Pitcher [7] (cf., also Root [1], p.302). Our demonstration is again based on the RKHS technique and follows from the preceding work. 10. Theorem. Let P0ri , i = 1, 2, be a pair of Gaussian measures on
268
V. Likelihood Ratios for Processes
(Ω, Σ) with means 0 and covariances ri where r1 is strictly positive 1 definite and ri ∈ L2 ([0, 1], dx dy). If Ri → Ri f = 0 ri (·, t)f dt, f ∈ L2 ([0, 1], dt), are the corresponding (necessarily HS) operators, then −1
− 21
P0r1 ∼ P0r2 iff I − R1 2 R2 R1
is an HS, or equivalently there is an 1
1
HS operator J such that R1 − R2 = R12 JR12 . Proof. By the result recalled from Schatten’s book above, Ri , i = 1, 2 are HS operators on L2 ([0, 1], dx) and both of their kernels can be expressed in series forms. For the strictly positive definite r1 one has: r1 (u, v) =
∞
αi gi (u)gi (v),
gi ∈ L2 ([0, 1], dx),
(31)
i=1
where {αi , gi , i ≥ 1} are theeigenvalues and the corresponding (normal∞ ized) eigenfunctions with i=1 αi2 < ∞, the series in (31) converging in mean of L2 ([0, 1]2 , dx dy). If ν({n}) = αn , then ν is a measure on N and if g(u, n) = gn (u), (31) becomes: r1 (u, v) =
g(u, n)¯ g (v, n) dν(n), N
u, v ∈ [0, 1],
(32)
which is a “triangular” covariance. If Hr1 is the corresponding RKHS as in (26), then h ∈ Hr1 iff g(t, n)¯ a(n) dν(n),
h(t) =
(33)
N
for some a ∈ sp{g(t, ¯ ·), t ∈ [0, 1]} ⊂ L2 (N, ν) with norm given by 2 2 h = N |a(n)| dν(n). In the present case the norm can be calculated explicitly as follows. Let f ∈ L2 ([0, 1], dx), g ∈ L2 (N, dν) so a : n → a(n) =
1
g(t, n)f (t) dt 0
is well-defined by the CBS-inequality and a ∈ L2 (N, ν). Let f˜ = 0
1
r1 (·, t)f (t) dt =
N
Hence f˜2 =
N
|a(n)|2 dν(n)
g(·, n)[ 0
1
g¯(t, n)f (t) dt]dν(n).
269
5.2 General Gaussian processes
=
∞
αn
0
n=1 ∞ 1 1
[
0
0
1
g(u, n)f¯(u) du
g¯(t, n)f (t) dt
=
1
0
αn g(u, n)¯ g (t, n)]f (t)f¯(u) dt du,
n=1
by the mean convergence, 1 1 = r1 (t, u)f (t)f¯(u) dt du, by (31), 0
0
1
= 0
(R1 f )(t)f¯(t) dt 1
= (R1 f, f ) = R12 f 22 .
(34)
1
Thus f˜ = R12 f 2 < ∞ and f˜ ∈ Hr1 . By hypothesis r1 is strictly positive definite so that R1−1 exists (as an unbounded operator). The −1
same is true of R1 2 . Since the process is Gaussian, by Proposition 1.2 Hr1 is precisely the set of admissible means of P0r1 . By Proposition 1
1.11, R12 (L2 ([0, 1], dt)) is exactly the set of admissible means in the 1 Gaussian case. Moreover, τ : f → fˆ = R12 f ∈ Hr1 , ∀f ∈ L2 ([0, 1], dt) satisfying 1 f˜ = τ (f ) ≤ R12 f 2 < ∞, and τ is one-to-one, continuous and onto. Thus τ −1 is also continuous 1 by the closed graph theorem. [Note that R12 (L2 ([0, 1], dt)) is a Hilbert space in · norm but incomplete in · 2 .] We now show that this result follows from Theorem 6 (or 7). If Hr1 is the RKHS relative to the kernel r1 , and K : [0, 1]2 → C is another positive definite kernel, and if K(·, t) ∈ Hr1 , t ∈ [0, 1], associate an operator A by the equation (Af )(t) = (f, K(·, t)), f ∈ Hr1 . Then A is well defined and has a bounded extension to L2 ([0, 1], dt). It is self-adjoint iff K is hermitian. Also A = I (identity) if K = r1 (cf., Aronszajn [1], p.372). By Theorem 6, P0r1 ∼ P0r2 iff r1 − r2 ∈ Hr1 ⊗r2 and Hr1 = Hr2 . Let K = r2 from now on so that A corresponds to r2 . Then V = I − A : Hr1 → Hr1 is determined by r1 − r2 ∈ Hr1 ⊗r2 so that it is HS (by the result from Schatten [1] recalled above). If Ri : L2 ([0, 1], dt) → L2 ([0, 1], dt) defined by the covariance kernels r1 , r2 , consider the mappings: R1 ,R2
I,A
τ −1
L2 ([0, 1], dt) −→ L2 ([0, 1], dt)→Hr1 −→Hr1 −→L2 ([0, 1], dt), τ
(35)
where τ is the isomorphism obtained above. Combining the various mappings on the corresponding spaces, one gets R1 = τ −1 τ and τ τ −1 =
270
V. Likelihood Ratios for Processes
I on Hr1 corresponding to A = I in this case. Similarly R2 = τ −1 Aτ . Consequently R1 − R2 = τ −1 (I − A)τ = τ −1 V τ (say) 1
−1
= R1 τ −1 V τ R1 = R12 JR1 2 , 1
(36)
1
where J = R12 τ −1 V τ R12 is a bounded operator and is HS iff R1 − R2 is, i.e., r1 − r2 ∈ Hr1 ⊗r2 or P0r1 ∼ P0r2 . Since R1−1 exists one −1
−1
has if B = R1 2 R2 R1 2 , then I − J = B is densely defined, bounded on L2 ([0, 1], dt), and it is HS iff the the measures P0ri , i = 1, 2, are equivalent. Remark. For the above result, we assumed that the means m1 = 0 = m2 for both measures. In the contrary case, as seen in Theorem 6, 1 one needs the additional condition that m1 − m2 ∈ R12 (L2 (T, μ)) for equivalence. Since R1−1 is generally unbounded, it will be necessary to work on Hr1 where the operators determined by r1 , r2 are HS and are connected to Ri by equation (36). There is some further information on the structure of operators Ri on L2 (T, μ) of (36), namely they may be simultaneously diagonalyzed, somewhat analogous to the classical case of a pair of positive definite matrices in linear algebra especially used in multivariate normal theory (cf., e.g., Roy [1], p.146, or Anderson [1], p.341). The corresponding result for Gaussian processes is also useful and we present it here and ri employ it in finding an explicit form of the likelihood ratios of Pm ,i = i 1, 2 and more generally in Section VII.3 later. This is a main reason in our study. If the ri are triangular, we already have such a result, due to Varberg [1]. Here is an auxiliary diagonalization result, a slightly generalized version from the one given by Kadota [1]. 11. Proposition. Let (T, T , μ) be a σ-finite measure space and Ki : T ×T → C be covariance kernels such that Ki ∈ L2 (T ×T, μ⊗μ), i = 1, 2 and K1 strictly positive definite. If Ri is the associated HS operator of Ki so that (Ri f )(t) =
Ki (s, t)f (s) dμ(s), f ∈ L2 (T, μ),
(37)
T 1
1
− − ˜ to L2 (T, μ) suppose that B = R1 2 R2 R1 2 has a bounded extension B having a discrete spectrum. If {αn , fn , n ≥ 1} are the eigenvalues and ˜ then the following the corresponding normalized eigenfunctions of B,
271
5.2 General Gaussian processes
simultaneous diagonalization of the kernels Ki , i = 1, 2 holds: K1 (s, t) = K2 (s, t) =
∞ n=1 ∞
1
1
(R12 fn )(s)(R12 fn )(t) (38) 1 2
1 2
αn (R1 fn )(s)(R1 fn )(t),
n=1
where both the series converge in the norm of L2 (T × T, μ ⊗ μ). 1
˜ n = αn fn , ((fn , fn ) = 1) and gn = R 2 fn (∈ Proof. By hypothesis Bf 1 −1 2 L (T, μ)), then R2 R1 gn = αn gn , n ≥ 1, and moreover the gn are linearly independent (since R1−1 exists). Using tensor notation, define the simple operators πn = (fn ⊗ f¯n ) on L2 (T, μ) of rank 1 as πn h = (h, fn )fn by definition. Then πn2 = πn and since {fn , n ≥ 1} is a complete orthonormal set, it follows that πn s form a decomposition of the identity in L2 (T, μ): ∞ I= πn , (39) n=1
n in that f = n=1 πn f, f ∈ L (T, μ). Similarly let Bn = k=1 gk ⊗ g¯k and B = limn→∞ Bn strongly is a bounded linear operator. Indeed, Bn being obviously bounded and linear, it also converges strongly to B since for m < n and h ∈ L2 (T, μ): ∞
2
n
(Bn − Bm )h22 = = =
(gk ⊗ g¯k )h22
k=m+1 n k=m+1 n
(h, gk )¯ gk 22 1
1
1
(R12 h, fk )R12 fk 22 , since R12 is self adjoint,
k=m+1 1
≤ R12 2
n
1
(R12 h, fk )f¯k 22
k=m+1
→ 0, by Parseval’s equation. Hence Bn → B strongly and B is bounded and linear, given by: Bh = =
∞ k=1 ∞ k=1
(gk ⊗ g¯k )h 1
1
R12 (fk ⊗ f¯k )R12 h
272
V. Likelihood Ratios for Processes 1 2
= R1 (
∞
1
fk ⊗ f¯k )R12 h
k=1 1 2
1
= R1 IR12 h = R1 h, ∀h ∈ L2 (T, μ),
(40)
on using (39). This implies B = R1 , and (40) is the same as the first part of (38). In fact, for any f, h ∈ L2 (T, μ) we have (R1 h, f ) =
∞
(h, gn )(gn , f ).
n=1
Using (37), this may be expressed as the absolutely converging series: K1 (s, t)h(s)f¯(t) dμ(s) dμ(t) = T
T
∞ T
n=1
1
h(s)(R12 fn )(s) dμ(s)×
1 2
(R1 fn )(t)f (t) dμ(t) T
(
= T
∞
1
1
(R12 fn )(s)(R12 fn )(t))h(s)f¯(t) dμ(s) dμ(t),
T n=1
where the interchange of the infinite sum and integral is easily justified as each of the series on the right converge in mean. Since f, h ∈ L2 (T, μ) are arbitrary, the integrands can be identified a.e. giving the first series of (38). The second series is established as follows. By hypothesis if β = ˜ < ∞, and B ˜ being self-adjoint, the classical spectral theorem B implies: β ∞ ˜ α dEα = αn (fn ⊗ f¯n ), (41) B= −β
n=1
˜ is used, the sum on the right converging where the HS hypothesis of B strongly. Recalling the definition of B, (41) is equivalent to −1
−1
R1 2 R 2 R 1 2 h =
∞
h ∈ L2 (T, μ).
αn (fn ⊗ f¯n )h,
(42)
n=1 1
1
Replacing h by R12 f in (42) and noting that R12 (L2 (T, μ)) is dense in 1
(real) L2 (T, μ) one has, on pre-multiplying both sides by R12 , that R2 f =
∞ n=1
1
1
αn (R12 fn ⊗ R12 fn )f.
(43)
273
5.2 General Gaussian processes
Since f is arbitrary in L2 (T, μ), (43) implies as in the preceding case the second part of (38), as desired. Note. This result is established by Kadota [1] if T = [a, b], a compact interval, and μ is the Lebesgue measure. Also (40) and (43) imply that 1 (with gn = R 2 fn in the above notation) (R1 − R2 )h =
∞
(1 − αn )(gn ⊗ g¯n )h.
(44)
n=1
However, in the RKHS norm one has (cf., (34)) 1
1
(gn , gm )HK1 = (R12 gn , R12 gm )2 = (fn , fm )2 = δmn . functions of Since R1 − R2 is HS on HK1 where K1 , K2 are covariance ∞ Gaussian processes, then by the general HS theory n=1 (1−αn )2 < ∞, (see Schatten [1], p.32). [On the other hand if B above is of trace class, ∞ then n=1 |1 − αn | < ∞ holds, cf. p.41 of the same book.] These facts will be useful for us in deriving the likelihood ratio of these measures to which we now turn. We first recall some notation from the preceding work to obtain a series representation of a second order process and use it in the following. Thus replace the given process {Xt , t ∈ T ⊂ R} by a series with a countable set of observable coordinates (cf., Section IV.1) and apply the methods of Chapter IV. Since P0ri , i = 1, 2 are now equivalent Gaussian measures, and the ri are left (or right) continuous, by the preceding proposition, we can use the simultaneous diagonalization and obtain a fixed set of (normalized) eigenfunctions {fn , n ≥ 1} corresponding to −1
−1
the eigenvalues {αn , n ≥ 1} of the HS operator B = R1 2 R2 R1 2 determined by R1 − R2 ∈ L2 (T × T, μ ⊗ μ). It is precisely for this application Proposition 11 was established. Here the Ri are the integral operators determined by ri ∈ L2 (T, μ), r1 being strictly positive definite. We take T ⊂ R and μ as Lebesgue measure to fully utilize the work of Section 1. [The case that the means mi = 0 will be considered shortly.] −1
Let gn = R1 2 fn , and then by (38) r1 (u, v) =
∞
gn (u)¯ gn (v);
n=1
r2 (u, v) =
∞
αn gn (u)¯ gn (v),
(45)
n=1
∞ where r1 − r2 ∈ Hr1 ⊗r2 , and n=1 (1 − αn )2 < ∞. Now define the observable coordinates as ∞ Xt fn (t) dt, n ≥ 1; Xt = ξn gn (t), (46) ξn = T
n=1
274
V. Likelihood Ratios for Processes
which are well-defined (since Xt -process is right continuous, it can be taken “measurable and separable,” so the integral exists), E1 (ξn ) = 0 = E2 (ξn ), E1 (|ξn |2 ) = 1, and E2 (|ξn |2 ) = αn , the series converging in L2 (T, dt)-mean. Here Ei denotes the expectation symbol relative to P0r1 , i = 1, 2. If the means mi = 0, then we may consider the case of mean 0 for the first and δ = m1 − m2 for the second measure as in equation (1). Then it will be necessary and sufficient that δ be an admissible mean for P0r2 so that δ ∈ Hr2 . By Theorem 1.4, if r2 vanishes at infinity for the locally compact T , then there is a unique regular bounded Borel measure β on T such that δ(t) =
r2 (s, t) dβ(s).
(47)
T
With this review of the earlier work, the likelihood ratio unifying several results of various authors can be given as follows (it also subsumes Theorem 1, although it cannot be as sharp as in that special case): 12. Theorem. Suppose that {Xt , t ∈ T ⊂ R} is a process on (Ω, Σ) ri and is Gaussian relative to the measures Pm , i = 1, 2 where r1 is i strictly positive definite, both ri are right continuous, r2 vanishing at infinity, and that the measures are equivalent, so that the representations (45)-(47) hold. Then the likelihood ratio is given by r2 dPm − 12 2 exp Xt dβ(t) r1 (X) = (D1 D2 ) dPm 1 T ∞ ! 1 2 1 − αn ξn − (1 − αn ) , − 2 n=1 αn
(48)
where the positive constants D1 , D2 are defined by (1−αi ) D1 = Π∞ α e ; D = exp[ r2 (u, v) dβ(u) dβ(v)]. 2 i=1 i T
(49)
T
The integral in (48) is a simple standard vector (or Bochner) integral, and the series converges a.e. relative to either (hence both) probability ri measures Pm . i Proof. Let Fn = σ(ξk , k ≤ n) and F = σ(∪n Fn ) be the σ-algebras ri shown. Suppose they are completed for Pm and use the same symbols i for brevity. If T is the Borel σ-algebra of T , then {Xt (ω), t ∈ T, ω ∈ Ω} is jointly measurable for F ⊗ T . By hypothesis P0ri |Fn , i = 1, 2, r2
are equivalent for all n. Let p = dP0r1 and pn = E Fn (p). Then 0 {pn , Fn , n ≥ 1} is a uniformly integrable martingale and pn → p, a.e. dP
275
5.2 General Gaussian processes
and in L2 (P1r1 ). For finding p, we can calculate pn and then let n → ∞ in what follows. Let fni be the n-dimensional normal densities of (ξ1 , . . . , ξn ) under ri P0 . Then our hypothesis implies: ' ( n 1 2 fn2 1 − 12 n ξ (ω)( − 1) , n ≥ 1, pn (ω) = 1 (ω) = (Πi=1 αi ) exp − fn 2 i=1 i αi since the ξi are independent and normal with means zero and variances 1 and αi relative to P0r1 , P0r2 respectively. Although pn (ω) → p(ω) a.a. (ω), the right side may not converge individually. factors ∞[Recall ∞ that Π∞ α exists iff |1 − α | < ∞. But we only have i i=1 i i=1 n=1 (1 − αn )2 < ∞. If the stronger convergence takes place, then the infinite product is exactly the Fredholm determinant of B. When the stronger condition holds, the P0ri are termed strongly equivalent.] Under the present (weaker) hypothesis we can introduce a convergent factor, to ˘ [1], get a “regularized Fredholm determinant” (cf., Gohberg and Krein p.166-7). Thus, express pn (·) as: 5− 21 4 × pn (ω) = Πni=1 (1 − (1 − αi ))e(1−αi ) 1 − αi 1 2 [ξi (ω) − (1 − αi )]}. 2 i=1 αi n
exp{−
∞ 2 ∞ (1−αi ) > 0 which Since i=1 (1 − αi ) < ∞, we get D1 = Πi=1 αi e converges ( and = det(BR)). The second factor also converges for a.a.(ω) by Kolmogorov’s two series theorem. Hence p(ω) =
−1 D1 2
∞
1 2 ξ (ω) exp{− 2 i=1 i
1 − αi α
− (1 − αi )},
for a.a.(ω). It remains to remove the condition of zero means. For this we use the chain rule in the form (cf., (1) of Theorem 1): r2 r2 dPm dPm dP0r2 2 2 −m1 (ω) = (ω) (ω) r1 dPm dP0r2 dP0r1 1 r2 r1 r2 for a.a.(ω). Also Pm ∼ Pm iff Pm ∼ P0r2 , andP0r2 ∼ P0r1 . Now 2 1 2 −m1 by Corollary 1.3 r2 dPm 1 2 −m1 = exp[Y − E(Y 2 )], r2 dP0 2
276
V. Likelihood Ratios for Processes
r2 r1 2 2 for a unique Y ∈ sp{X ¯ t , t ∈ T } ⊂ L (P0 ) ∩ L (P0 ). Since δ = m2 − m1 ∈ Hr2 , T is a locally compact set, and r2 (hence r1 also) vanishes at “∞”, one has as in (47), the representation
r2 (s, t) dβ(s)
δ(t) = T
for a unique regular bounded Borel measure β on T , and moreover Y = Xt dβ(t) (see (14) in Theorem 1.4). It follows now by a simplification T of D2 = exp(E(Y 2 )) that: ¯ v dP r1 dβ(u)dβ(v)] Xu X D2 = exp[ 0 T T Ω = exp[ r1 (u, v) dβ(u) dβ(v)] > 0, T
T
so that 1
p(ω) = (D1 D2 )− 2 × ∞ 1 2 1 − αi exp[( Xt dβ(t) − ξ − (1 − αi ) (ω)], 2 i=1 i αi T which establishes (48). r1 With this result, it is possible to test the hypothesis that H0 : Pm 1 r2 vs H1 : Pm2 for a process {Xt , t ∈ T ⊂ R} based on a realization, by using Grenander’s theorem (Theorem IV.1.1), to obtain the critical region Ak as:
Ak =
∞ 1 2 1 − αi ω : ( Xt dβ(t))(ω) − [ξ (ω)( ) − (1 − αi )] ≥ k 2 i=1 i αi T
r1 where k is chosen to have the prescribed size ε, Pm (Ak ) = ε, 0 < ε < 1. 1 In theory therefore this solves the original problem for testing two Gaussian measures (simple hypothesis vs simple alternative) with a nontrivial test theory iff the measures are equivalent by the dichotomy theorem. To implement this in a given problem, however, one must get some algorithms to find the eigenvalues and eigenfunctions αn , fn (and 1
−1
hence ξn ) for the HS operator B = R− 2 R2 R1 2 . Only an indication of an iterative evaluation with RKHS methods is available in Parzen [2], and some examples were worked out in several presentations of Kailath and his associates. However, a really usable and computationally feasible procedure has yet to be found.
5.3 Independent increment and jump Markov processes
277
After this detailed analysis with general Gaussian processes, it is possible to go for related results with specialized cases, such as triangular covariances, stationary processes, and other types, an indication of which has just been included in the above. We shall discuss further the above classes in Chapter VII. But for the present general account, we proceed with other (not necessarily Gaussian) classes where the corresponding likelihood ratios can be calculated. 5.3 Independent increment and jump Markov processes A Gaussian process is uniquely determined by its mean and covariance functions, and hence the analysis can be based on these two parametric functions. For other general processes such simple characteristics are not available. To utilize the hypothesis of independent increments, one looks at their Fourier transforms, or appropriate characteristic function(al)s on suitable function spaces, and search for a useful substitute in the famous L´evy-Khintchine formula (for such processes) which has two key parameters analogous to the above. Consequently we start with this result. Since every process with independent increments is a Markov process, we also include some general results on the latter for likelihood ratios, extending the current considerations. If {Xt , a ≤ t ≤ b} is a segment of the process {Xt , t ∈ T ⊂ R} with independent increments, then for each a ≤ t 0 < t 1 < · · · < tn ≤ b n, n being a partition, one has Xb − Xa = k=1 (Xtk − Xtk−1 ), a sum of independent random variables. If the process has no fixed discontinuities, then considering the characteristic functions of the process, with Φs,t (u) = E(eiu(Xt −Xs ) ),
a ≤ s < t ≤ b,
Πni=1 Φti−1 ,ti .
one has Φa,b = In case lims→t Φs,t (u) = 1 uniformly in t, u for s, t ∈ [a, b] ⊂ T , then Xt is infinitely divisible. Although in general this need not be true for all independent increment processes [one has to subtract certain suitable nonrandom functions, called the centering functions f so that Zt = Xt −f (t) has the desirable properties except at a countable set of fixed discontinuities], the situation becomes simpler if the processes is stochastically continuous, i.e., lim|s−t|→0 P [|Xs − Xt | > ε] = 0, s, t ∈ [a, b] for each ε > 0. We therefore assume this condition hereafter so that Xt − Xa is infinitely divisible for each a < t. Consequently, the classical L´evy-Khintchine formula is available for such processes and it is given as: iux 1 + x2 Φa,t (u) = exp{iuγat + (eiux − 1 − ) dGat (x)}, (1) 1 + x2 x2 R for a constant γat (∈ R) and a nondecreasing right continuous bounded function Gst (·) such that Gat (−∞) = 0 and the representation is
278
V. Likelihood Ratios for Processes
unique. Moreover the stochastic continuity of the process also implies that t → γat and t → Gat (x), x ∈ R, are continuous. Considering Φa,t (u), u ∈ R, we deduce that Φs,t is an infinitely divisiΦs,t (u) = Φa,s ble characteristic function as in (1) with the (γst , Gst ) as its (unique) pair of parameters. Hence a comparison of these two representations along with the uniqueness property gives that γst = γat − γas and Gst = Gat − Gas ≥ 0, so that both G· (x), Gas (·) are non-decreasing. But then Gt (0+) − Gt (0−) = σt2 ≥ 0, is the size of a (possible) jump of Gt at the origin u = 0, and the integrand in (1) at u = 0 has the value 2 ˜ t (x) = Gt (x)−σ 2 , one has M ˜ t (·) to be continuous at − u2 . So writing M t the origin and is nondecreasing, and hence (1) can be written in the following alternative form with an important probabilistic interpretation, to be discussed later: u2 Φa,t (u) = exp{iuγat − σt2 + 2 iux 1 + x2 ˜ t (x)}. (eiux − 1 − ) dx M 2 2 1 + x x R
(2)
Hereafter, for simplicity we take a = 0 [a fixed initial point] and X0 = 0 [or could be an infinitely divisible variable]. With these assumptions [and dropping the dependence on the fixed initial value 0] we get u2 u2 Φs,t (u) = exp{i(γt − γs )u − (σt2 − σs2 ) − (σt2 − σs2 )+ 2 2 2 iux 1 + x ˜ t (x) − M ˜ s (x))}. (eiux − 1 − ) dx (M 1 + x2 x2 R
(3)
If the increments are also stationary, so that the distribution of Xt+h − Xt depends only on h, then we get from the above relation, ˜ s+t (x) = M ˜ s (x) + M ˜ t (x). Because of the conγs+t = γs + γt and M tinuity properties, these Cauchy functional equations imply (this will be true even if these are only measurable, and then the conclusion is that they are necessarily continuous in t), one must have γt = tγ and ˜ (x) so that (1) becomes: ˜ t (x) = tM M tu2 σ 2 Φt (u) = exp{tγu − + 2 iux t (eiux − 1 − ) dM (x)}, 1 + x2 R
(4)
2 ˜ where dM (x) = 1+x x2 M (x). For a general theory of processes with independent increments, see Doob ([2], Sections VIII.6-7) and for processes
5.3 Independent increment and jump Markov processes
279
with stationary independent increments, the recent book by Bertoin [1] may be consulted. [Some properties of the parametric functions (γt , σt , Mt (·)) follow from the more general infinitely divisible processes established in Theorem 4.1 and Proposition 4.2 below.] In our stochastic inference study, it is often necessary to consider the finite dimensional distributions, i.e., if 0 ≤ t1 < · · · < tn < ∞ then the joint distribution or characteristic function of (Xt1 , . . . , Xt1 ) is desired. So the corresponding result for (2) is ΦXt1 ,... ,Xtn (u1 , · · · , un ) = E(ei(u1 Xt1 +···+un Xtn ) ) and the multivariate analog of (2) is (on letting (u, x) = u1 x1 + · · · + un xn for the vectors u = (u1 , · · · , un ), x = (x1 , · · · , xn ) so that (u, X) is infinitely divisible) the following: 1 Φt (u) = exp{i(γ, u) − (u, A(t)u)+ 2 i(u, x) i(u,x) ) M (t, du)} (e −1− 1 + (x, x) n R
(5)
where γ = (γt1 , · · · , γtn ) ∈ Rn , A(t) = (aij (t), 1 ≤ i, j ≤ n) is an n × n positive definite matrix and M (t, ·) is a finite non negative measure on Rn . The same holds if Xt takes values in Rn and t ∈ R+ , so that γt = (γt1 , · · · , γtn ) with At as a positive definite matrix each of whose elements depend on t continuously. In this case the analog of (3) becomes: 1 Φt (u) = exp i(u, γ(t)) − (u, A(t)u)+ 2 ! i(u, x) i(u,x) M (t, dx) , e −1− 1 + (x, x) Rn
(5’)
where γ : R+ → Rn is a continuous function, A : R+ → Rn×n is a positive definite n × n matrix function with (u, A(·)u) continuous, in (x,x) creasing, and M (t, ·) is a measure for which Rn 1+(x,x) M (t, dx) < ∞ as + well as continuous in t ∈ R . We thus have three parameters (γ, A, M ) that determine a stochastically continuous process of independent increments. For us, the representation (2) or (3) is of immediate interest since each such process is uniquely determined by the parameter functions (γt , Gt ) or (γt , At , Mt ) which is (superficially) similar to the Gaussian case in a different context. Now the characteristic function (ch.f.) Φt is a product which corresponds to the sums of independent random variables, whose ch.f.s are exp{(u, γt ) − 12 (u, A(t), u)}, the ch.f. of an n-dimensional Gaussian distribution, and the remaining one (after approximating the integral by suitable Riemann-Stieltjes sums) is the
280
V. Likelihood Ratios for Processes
ch.f. of a sum of (an infinite collection of) independent Poisson random measures. This may be summerized as follows: 1. Theorem. Let {Xt , t ≥ 0} be a stochastically continuous process (on (Ω, Σ, P )) with independent increments (hence has ch.f. (2)). Then it can be represented as: Xt = Yt + Zt ,
t ≥ 0,
(6)
on the same probability space (which may be assumed rich enough to support all these processes) such that {Yt , t ≥ 0} is a (stochastically continuous) Gaussian process with independent increments, and continuous mean function t → γt as well as variance function t → σt2 , and the Zt -process, which is independent of the Yt -process, also has independent increments having a finite number of jumps (of unit size) in any finite interval each of the jumps being Poisson distributed. Further, if the Xt -process has its increments stationary, then so are those of the Yt - and the Zt -processes. In fact if Y˜t = Yt − tγ, it becomes a martingale with continuous sample paths and independent (stationary) increments, and thus is a BM, as well as for t > s, P [Zt − Zs = k] = e−c(t−s) ck (t−s)k ,c k!
> 0, k = 0, 1, . . . obtains.
Since a product of ch.f.s corresponds to the sum of independent random variables, the above representation is a translation of this fact from (4), but the precise details involve some tedious computations which will be omitted as this is not essential for the following work. (The result that Y˜t -process becomes a BM under the condition of continuous sample paths is a classical characterization due to P.L´evy, and a martingale proof can be found in Doob [2],p.384.) However, it is a remarkable fact that processes, such as Cauchy, without moments are infinitely divisible and (6) is valid for them too, and in (6), the Yt -component will be missing. Then one notes the importance in this study of the Zt -process whose sample paths are always discrete. It is better appreciated when a further decomposition of the Zt -process is given. Thus for each Borel set A ⊂ R, let ν(t, A) =
tk ≤t
χ[Xtk +0 −Xtk −0 ∈A] ,
so that ν(t, A) is the number of jumps of the Xt -process on [0, t] with values in A. If A is away from zero, i.e., A ⊂ R − {|x| < ε} = Rε , for some ε > 0, then the integer valued random variable ν(t, A) < ∞ with probability one. (For simplicity we take Xt as real valued.) Moreover, if A1 , . . . , An are disjoint Borel sets of Rε , then ν(t, Ai ), i = 1, · · · , n are independent and ν(t, A) ≥ ν(s, A) for each s < t, and {ν(t, A), t ≥ 0}
5.3 Independent increment and jump Markov processes
281
has independent increments for each A. If π(t, A) = E(ν(t, A)), then π(t, ·) : Bε → R+ is a finite measure where Bε is the Borel σ-algebra of Rε . Also π(·, A) is increasing and hence defines a (Stieltjes) measure, denoted by the same symbol. This means, π(·, A) and π(B, ·) are Stieltjes type measures. Consequently π(·, ·) is a bimeasure. If it is allowed to take real or complex values, then it need not define a measure on the product σ-algebra B(R+ ) ⊗ B(R). Fortunately here it is non-negative valued, and in this case (since π(·, A) and π(B, ·) are “Radon measures”) it does have a unique extension to be a measure on the product σ-algebra. [For a proof of this nontrivial fact, see Berg, Christensen, and Ressel [1], p.24; but a direct argument in this special case is also possible.] Because of this property a stochastic (Stieltjes) integral with ν(t, ·) can be directly defined using the point-wise definition (and Fubini’s theorem) since E(ν(·, ·)) = π(·, ·) is a measure. [This is a special case of the stochastic integral discussed in Section IV.2 with Bochner’s boundedness principle, to be used in more general situations in later applications.] For a detailed analysis of such random measures ν ν, and applications, see Rao [18] . As noted above, a generalized Poisson process is a non-negative integer valued random set function ν on B(R+ ) ⊗ B(R) that may be realized as follows. Let C(= R+ × R) be a space, C(= B(R+ ) ⊗ B(R)), a σ-algebra containing all one point sets, and π : C → ¯(R)+ a measure. Then for any finite set of integers (r1 , . . . , rn ) and non-overlapping (= disjoint except perhaps for boundary points) sets C1 ∈ C, we have P [ν(Ci ) = ri , i = 1, . . . , n] =
n 3
p(π(Ci ), ri ),
i=1
where ⎧ −x xa ⎪ ⎨ e a! , 0 < x < ∞, a < ∞ integer, p(x, a) = 1, x = ∞, a = ∞, or x = 0, a = 0, ⎪ ⎩ 0, elsewhere.
(7)
That such a process exists can be deduced from Kolmogorov’s classical theorem by verifying the consistency conditions (cf., Theorem I.1.1), abstractly on (C, C, π). In such a case π is called a rate or intensity measure for the Poisson random set function. We now present a likelihood ratio for equivalent measures Pi determined by a pair of Poisson processes Xi with independent increments and rate measures πi , i = 1, 2. Let us start with a useful special case in which the Gaussian component Y = 0 in the representation (6). This is essentially due to Brown [1], (see also the related results given in
282
V. Likelihood Ratios for Processes
Gikhman-Skorokhod [1], Newman [2] and Brockett-Tucker [1]). The original argument is streamlined, utilizing the results of the preceding sections and others. 2. Proposition. Let Pi be the Poisson measures determined by the processes Xi of (6) having finite rate measures πi on (C, C), i=1,2. If dπ1 π1 π2 with f = dπ , then P1 P2 and one has the likelihood ratio 2 given by: X2 (C) 3 dP1 (X2 ) = exp[−{π1 (C) − π2 (C)}] f (ti ), dP2 i=1
(8)
where (t1 , t2 , . . . , ) are the countable set of values taken by X1 (C, ω) for a.a. (ω). Proof. Let h be the right side quantity of (8) which is a random variable, and it is to be shown that for any A ∈ B(R+ ) ⊗ B(R)
P1 (A) =
h dP2 = A
R+ ×Ω
χA h dP2 .
If EPi is the expectation operator under Pi and A is taken as a generator (A = [X2 (C) = k]), then the above statement becomes (k = 1, . . . , n) P1 ([X2 (C) = k]) = EP2 (χ[X2 (C)=k] h) = EP2 (EP2 [χ[X2 (C)=k] h|X2 (C)]),
(9)
where the operator EP2 [·|·] is the conditional expectation. We evaluate the latter for a fixed X2 (C) = n and then simplify (9). Since one has n P2 [X2 (C) = n] = e−π2 (C) π2 (C) , the necessary conditional probability n! measure is given by: (B ∈ C) P2 (X2 (B) = k, X2 (C − B) = n − k) P2 (C) P2 (X2 (B) = k)(P2 (X2 (B c ) = n − k) = , P2 (C) since X2 (B), X2 (B c ) are independent,
P2 (X2 (B) = k|X2 (C) = n) =
k c n−k n! )+π2 (C) π2 (B) π2 (B ) n k!(n − k)!π2 (C) k π2 (B) π2 (B c )n−k n! , = k!(n − k)! π2 (C)n since π2 (B) + π2 (B c ) = π2 (C),
= e−π2 (B)−π2 (B
c
5.3 Independent increment and jump Markov processes
k n−k π2 (B c ) n π2 (B) = . π2 (C) π2 (C) k
283
(10)
Let h1 denote the last product in (8), which is the only random quantity, 2X2 (B) 1 2X2 (C)−k 1 π1 (B c ) and h2 = ππ12 (B) · and so h1 = h2 a.e., since c (B) π2 (B ) each of the fractions correspond to the ordinary likelihood ratio for X2 (C) f (ti ) is exactly finite Poisson densities when X2 (C) = n and Πi=1 X2 (C) X2 (B) X2 (B c ) the above product. [In fact, Πi=1 f (ti ) = Πi=1 f (ti )Πi=1 f (ti ) = π1 (B) X2 (B) π1 (B c ) X2 (B c ) ( π2 (B) ) × ( π2 (B c ) ) since X2 (C) = X2 (B) + X2 (B c ) and on X2 (C) = n the product is just the ratio of the (finite product) Poisson densities with parameters f (ti ) and on disjoint sets they take independent values.] Consider therefore the right side of (9) with h2 for the factor h1 in h. Using the probability for X2 (C) = n, with n arbitrary, one finds: EP2 (EP2 [χ[X2 (B)=k] h|X2 (C)]) = EP2 [χ[X2 (B)=k] e−(π1 −π2 )(C) h2 |X2 (C)] π1 (B) k π1 (B c ) X2 (C)−k −(π1 −π2 )(C) =e EP2 × π2 (B) π2 (B c ) P (X2 (B) = k|X2 (C)) −(π1 −π2 )(C) k X2 (C)−k X2 (C) × EP2 ( ) ( ) =e k π2 (B)k π2 (B c )X2 (C)−k π2 (C)X2 (C) k n−k π1 (B) n π1 (B c ) −(π1 −π2 )(C) × =e π2 (B) π2 (B c ) k n≥k
k
c n−k
π2 (B) π2 (B ) π2 (C)n = e−π1 (C)+π1 (B = eπ1 (B)
c
e−π2 (C)
π2 (C)n n!
k ) π1 (B)
k!
k
π1 (B) = P1 [X2 (B) = k], k!
which establishes (9) and hence (8). Unlike the Gaussian measures, studied in the preceding sections, one can present simple examples showing that for general measures, no dichotomy result can hold. For Poisson processes, under an additional condition, such a dichotomy can still be obtained. The next assertion is useful in this direction, and also is of interest in itself.
284
V. Likelihood Ratios for Processes
3. Theorem. Let P1 , P2 be Poisson measures determined by the processes Z 1 , Z 2 of (6) with σ-finite rate measures π1 , π2 on (C, C). Then P1 P2 iff the following three conditions hold: 1 (i) π1 π2 with f = dπ dπ2 , (ii) πi ([x : |f (x) − 1| > a]) < ∞, i=1,2, for all a > 0, (i.e., |f − 1| dπ2 < ∞ for all a > 0), and [|f −1|>a] (iii) [x:|f (x)−1|≤a] (f (x) − 1)2 dπ2 (x) < ∞ for some a > 0. Proof. Suppose the above conditions (i)-(iii) hold. By (iii) one can find 1 > a > 0 such that the integral there is finite. If Cn = {x : |f (x)−1| > a a n }, and Bn = Cn+1 − Cn is the disjunctification so that Bn = {x : n > a n |f (x) − 1| ≥ n+1 } let Dn = ∪k=1 Bk . Then by (ii) πi (Dn ) < ∞, i = 1, 2 for all n, and Ba = ∪∞ k=1 Bk = [x : 0 < |f (x)−1| ≤ a]. Now consider the restriction σ-algebras Bn = C ∩ Dn ⊂ Bn+1 . If Pin = Pi |Bn , then the pair (Pi , πi ), i = 1, 2 being finite measures on Bn , satisfies the conditions (i) and (ii) of Proposition 2 and hence P1n P2n holds for each n, with density: n 3 dP1n gn = = Yi , n ≥ 1, (11) dP2n i=1 6 where Yi = exp[−(π1 − π2 )(Bi )] t∈Bi f (t), i = 1, 2, . . . , as in (8), on each Bi . But by our earlier work {gn , Bn , n ≥ 1} forms a positive martingale, and hence gn → g a.e. [P2 ]. We now assert, by (iii), that EP2 (g) = 1, which then implies the desired conclusion. In fact, if βn (t) = EP2 (eit log gn ) which is the ch.f. of the random variable gn , then by (iii) f (x) − 1 = o(x) and so the Taylor expansion for the first two terms gives 1 eit log(1+(f −1)) = 1 + it(f − 1) − (t2 + it)(f − 1)2 + o(f − 1)2 . 2 Hence substituting this for gn and letting n → ∞ so that f (x) → 1 (cf., the definition of Bn ), one finds βn (t) → β(t) for each t, and β is continuous at t = 0. Consequently, by the classical L´evy continuity theorem the limit β(·) is also a ch.f. In particular β(0) = EP2 (g) = 1. Since gn ≥ 0 and gn → g ≥ 0 a.e. with EP2 (gn ) = 1 = EP2 (g), it now follows from Scheff´e’s lemma (cf., e.g., Rao [15], p.25), that {gn , n ≥ 1} is uniformly integrable and hence converges in L1 (P2 )-mean. Thus one gets, as n → ∞, P1 (A) = gn dP2n → g dP2 , A ∈ Bm , m ≥ 1, A
A
and then for all A ∈ σ(∪∞ n=1 Bn ). This allows us to conclude that P1 P2 on B∞ = C ∩ Ba , which is equivalent to the desired conclusion.
5.3 Independent increment and jump Markov processes
285
Conversely, if any one of the three conditions is violated, then one can construct a (cylinder) set in the space D(R+ ) = Ω of right continuous functions with left limits (the canonical form or the path space of the Poisson processes with their measures P1 , P2 ) for which one of the measures vanishes but the other takes a positive value. Consequently P1 is not P2 -continuous. This construction is not difficult (although not trivial). It is not essential for the following work and will be omitted. (The necessary detail may be found in Gikhman-Skorokhod [1], Theorem 7.3, or Brockett-Tucker [1], p.25.) As a consequence of the above theorem, one has the following restricted dichotomy result for Poisson processes: 4. Corollary. Suppose that Pi is a Poisson measure (determined by a process) with a σ-finite rate measure πi on (C, C), i=1,2, and that π1 ∼ π2 (i.e., equivalence). Then either P1 ∼ P2 or P1 ⊥ P2 , and P1 ∼ P2 iff for some a > 0, [|f −1|>a] |f − 1| dπ2 + [|f −1|≤a] |f − 1|2 dπ2 < ∞; and P1 ⊥ P2 iff the sum of these integrals is divergent for all a > 0. The following simple illustration explains the situation. 5. Example. Let C = N, be the natural numbers, and C be its power set. Let P1 , P2 be a pair of Poisson processes with σ-finite rate measures πi defined as π1 ({n}) = n + 1, π2 ({n}) = n. Since the πi vanish only on empty sets, they are (trivially) equivalent. Moreover dπ1 1 dπ2 (n) = 1 + n . If Ba = [n : |f (n) − 1| > a] it is a finite set for any a so for any a > 0, Bac = C − Ba is infinite and π1 (Ba ) < ∞. However, 1 |f − 1| dπ2 = 1 ≤a n2 n = ∞, and hence P1 ⊥ P2 . Bc a
n
The likelihood ratio for these processes when P1 P2 holds, can be given. We shall present a result on this problem for a stochastically continuous process with independent increments. Let us first rewrite (5) or (5’) for n = 1, to reflect the Poisson and Gaussian components and their possible means. Thus the ch.f. u → Φt (u) = E(eiuXt ) may be written as: u2 2 Φt (u) = exp{iuγt − σt + iuμt + (eitu − 1) dMt (u)}, (12) 2 R−{0} x where μt = R−{0} 1+x 2 dMt (x) ∈ R and the Xt -process is uniquely determined by the (parameter) triple (γt , σt2 , Mt ) with 0 ≤ σt2 , and Mt is a measure. Exactly as in Theorem 1, the ch.f. given by (12) represents the sum of an independent Gaussian process with ch.f. exp{i(γt u − μt (u)) − 1 2 2 2 σt u } and a Poisson process having the (σ-finite) rate measure M (t, ·) both components with independent increments. Here γt −μt is the mean
286
V. Likelihood Ratios for Processes
and σt2 is the variance function of the Gaussian component (since the covariance is given as r(s, t) = σ 2 (s ∧ t)). The idea of obtaining the equivalence for the probability measures Pi corresponding to processes X i , i = 1, 2, is to find conditions for the corresponding Gaussian components Y i (using the work of Section 2) and the Poisson components Z i , i=1,2, both of which are mutually independent. But if Gi and ν i are measures induced by Y i and Z i , then the joint (image) measure (or distribution) will be G = G1 ⊗ G2 is the ordinary cartesian product (due to independence), and similarly ν = ν 1 ⊗ ν 2 . Then the classical measure theory shows (cf., e.g., Rao [17], Proposition 6.2.5 on p.333) that G ν iff Gi ν i , i = 1, 2 (although a similar result for G ⊥ ν need not hold). Thus our conditions will be to satisfy these, by combining Theorem 3 above and a specialization of Proposition 2.5 or Theorem 2.7. First a preliminary simplification is useful. 6. Lemma. Suppose for each T > 0, the σ-finite rate measures πi of Z i , i = 1, 2 satisfy (0 < t < T ): dπ1 (t, x), where for f , (i) π1 ∼ π2 with density f (t, x) = dπ2 (ii) Jt (π1 , π2 ) = (1 − f (t, x))2 dπ2 (t, x) < ∞. [0,T ]×R−{0}
(13)
Then one has the useful consequence:
|x| |π1 − π2 |(t, dx) < ∞, 1 + x2
R−{0}
0 ≤ t ≤ T < ∞,
(14)
so that μ1t −μ2t of (12), (the μit correspond to the Poisson components Z i i=1,2) exists and is finite for all t. Proof. Since πi (t, ·), i = 1, 2 are σ-finite by assumption, if π(t, ·) = i (π1 + π2 )(t, ·) is a similar measure and if fi (t, x) = dπ dπ (t, x), i = 1, 2 it follows that f = ff12 and (13) implies for each t the following: 0≤
R−{0}
( f 1 − f 2 )2 (t, x) π(t, dx)
=
R−{0}
[(f1 + f2 ) − 2
Next consider |x| [ |π − π2 |(t, dx)]2 2 1 1 + x R−{0}
f1 f2 ](t, x) π(t, dx) < ∞.
(15)
5.3 Independent increment and jump Markov processes
=[ R−{0}
287
|x| |f1 − f2 |(t, x) π(t, dx)]2 1 + x2
(2 π(t, dx) = |x|[( f 1 − f 2 )( f 1 + f 2 )](t, x) 1 + x2 R−{0} √ √ ( f 1 − f 2 )2 ≤ (t, x) π(t, dx)× 1 + x2 R−{0} √ √ x2 ( f 1 + f 2 )2 (t, x) π(t, dx), (CBS-inequality) 1 + x2 R−{0} 2x2 (f1 + f2 )(t, x) π(t, dx) < ∞, (16) ≤ Jt (π1 , π2 ) 1 + x2 R−{0} '
by hypothesis since Jt (π1 , π2 ) < ∞. Consequently for μit of (12), x (π1 (t, dx) − π2 (t, dx))| |μ1t − μ2t | = | 2 R−{0} 1 + x |x| |π1 − π2 |(t, dx) < ∞, ≤ 1 + x2 R−{0} by (16), as desired. If π1 ∼ π2 is assumed (with σ-finiteness), then by Theorem 3 (or Corollary 4) the measures corresponding to the Poisson components Z 1 and Z 2 are either equivalent or singular, and those relative to Y 1 and Y 2 have always the property of equivalence or singularity (the Gaussian dichotomy). Equivalence holds in the latter iff (since the σi2 (t), i = 1, 2 are increasing) σ12 (t) = σ22 (t)(= σ 2 (t) (say)) and since γ1t + μ1t − (γ2t − μ2t ) = δt must be an admissible mean of Pσ(t) by Theorem 1.4. Similarly when π1 ∼ π2 is assumed (and also σ-finiteness), then the corresponding Poisson measures satisfy P˜1 ∼ P˜2 iff (cf. Corollary 4) |f − 1| dπ2 + |f − 1|2 dπ2 < ∞, (17) [|f −1|>a]
[|f −1|≤a]
dπ1 = ff12 of (16). Conversely, the measures are orthogonal where f = dπ 2 if any one of the above conditions is violated, as already noted earlier. This involves some detailed computations, included in the above references. Thus the general result may be summarized as follows, the last part on the form of the likelihood ratio is due to Gikhman and Skorokhod ([1], Theorem 7.3):
7. Theorem. Let {Xti , t ∈ [0, T ], i = 1, 2} be stochastically continuous processes with independent increments so that by Theorem 1, Xti =
288
V. Likelihood Ratios for Processes
Yti + Zti , t ≥ 0, i = 1, 2 where the Y i and Z i are mutually independent Gaussian and Poisson independent increment processes, uniquely de2 2 termined by the three parameters (γit , σit , πi ) where γit ∈ R, 0 ≤ σit i i of Y , and the πi are (σ-finite) rate measures of Z . If Pi is the measure determined by the X i -process on the canonically represented space D([0, T ]) ⊂ R[0,T ] of right continuous functions with left limits and dπ1 if the measures π1 ∼ π2 so that f (t, x) = dπ (t, x) is defined, then 2 P1 ∼ P2 or P1 ⊥ P2 holds. Moreover, P1 ∼ P2 iff the following three conditions are true: (i) σ12 (t) = σ22 (t) (= σ2 (t) (say)),t ≥ 0, (ii) [0,T ]×R−{0} (1 − f (t, x))2 π(t, dx) < ∞, and (iii) γ1t − μ1t − (γ2t − μ2t ) = δt ((say)) is an admissible mean for the Gaussian independent increment component process with mean zero and variance function σ 2 (·); or equivalently (cf. Theorem 1.4) there is a unique function g ∈ L2 ([0, T ], σ(dt)) such that T (iii’) δt = 0 g(u) dσ 2 (u). [This is just the representation (13) of Theorem 1.4 after an integration by parts using the fact that the covariance function concentrates on [0, s ∧ t].] Further, when these conditions are satisfied, the likelihood ratio is given by T 1 T 2 dP1 2 (X (·)) = exp g(s) dYs2 − g (s) σ 2 (ds)+ dP2 2 0 0 T π2 (dt, dx) log f (t, x)[Z 2∗ (dt, dx) − ]+ 1 + log2 f (t, x) 0 R−{0} T ! log f (t, x) − f (t, x) + 1 π2 (dt, dx) 2 0 R−{0} 1 + log f (t, x) (18) for a.a.([P2 ]), sample paths of Xt2 , where Z 2∗ = Z 2 − π2 which has independent Poisson increments. Proof. In view of the preceding discussion, it is only necessary to obtain formula (18) when conditions (i)-(iii) are satisfied. Here we add a sketch of the argument following Gikhman and Skorokhod[1]. The idea of proof is similar to that of the special case treated in Proposition 2 above. So if h(·) denotes the right side of (18), then it should be verified T T that EP2 (φ( 0 u(t) dXt2 )h) = EP1 (φ( 0 u(t) dXt1 )), for a sufficiently large class of bounded Borel functions φ : R → R, determining the σalgebra generated by {Xt2 , t ≥ 0}, and continuous functions u(·). Here it is convenient to take φ to be a trigonometric polynomial, or, using RT T 2 linear approximations, simply let φ( 0 u(t) dXt2 ) = ei 0 u(t) dXt for the
289
5.3 Independent increment and jump Markov processes
same purpose. We then show that the Fourier transform of Xt1 equals the corresponding result in the above equation with this φ. Thus (with (5’)) and using the whole realization, together with the fact that πi will be σ-finite when the processes are stochastically continuous (proved in Theorem 4.1 later but we can use this here) consider: E(ei
RT
u(t) dXt2 0
h) = E
T
u(t) dYt2 +
exp[i 0
=E
exp[i(
E exp i
0
T
T
u(t) dYt2 ) +
T
0
0
u(t) dZt2 ]h , by (6), T
g(s) dYs2 ] ×
π2 (dt, dx) )+ 1 + x2 0 R T π2 (dt, dx) log f (t, x)[Z 2∗ (dt, dx) − ] × 1 + log2 f (t, x) 0 R T 1 T 2 log f (t, x) g (s) dσ(s) + [ exp − 2 2 0 0 R 1 + log f (t, x) (19) − f (t, x) + 1]π2 (dt, dx) , u(t)x(Z 2∗ (dt, dx) −
obtained by substituting the expression for h, and using the fact that Y 2 and (Z 2∗ − π2 ) are independent, as well as the formula: E[evZ
2∗
(dt,dx)
] = exp[(ev − 1)π2 (dt, dx)].
Also the values of the process Z 2∗ are independent random variables. This may be used in the simplification of (19). If one notes the hypothesis that dπ1 = exp[log f ] dπ2 , and simplies (19), after some routine T calculations one finds that the result reduces to E(exp[i 0 u(t) dXt1 ]) and since u(·) is an arbitrary continuous function, this implies the truth of (18) itself. The actual calculations are tedious but do not involve new ideas. In applications, where these results are of interest in testing simple hypotheses vs. simple alternatives, it is necessary to obtain the rate measures πi of the Poisson components as well as the other parameters γit , μit , σit to verify the conditions of the above theorem to ascertain the equivalence of the measures before finding the likelihood ratios. It is to be noted that if the X i -processes have moreover stationary increments, starting at the origin, then one uses (4) and the fact that γt = tγ, σt2 = tσ 2 , and μt = tμ. With these values one has the corresponding formula in (18). Leaving these restatements (cf., e.g., Newman [2] and Skorokhod [2]) we turn to a brief discussion of a similar
290
V. Likelihood Ratios for Processes
problem for Markov processes since they are a natural generalization of the additive class considered above. Recall that a process {Xt , t ∈ I ⊂ R} on (Ω, Σ, P ) with values in a set S, called the state space, is Markovian if for t1 < t2 < · · · < tn , ti ∈ I, the conditional probability satisfies the (system of) equations: P [Xtn ∈ A|Xt1 , . . . , Xtn−1 ] = P [Xtn ∈ A|Xtn−1 ]
(20)
with probability one (w.p.1); so it states (roughly) that the conditional distribution of the present given the complete past depends only on the immediate past. It is not too hard to verify that this is also equivalent to the (seemingly stronger) condition: P [Xt ∈ A|Xr , r ≤ s < t](x) = P [Xt ∈ A|Xs ](x), x ∈ S,
(20’)
w.p.1. Denote the right side by p(s, x; t, A) for r, s, t ∈ I and for all A ∈ S, a given σ-algebra of S. A simple example of a Markov process is the partial sum process of a sequence of any independent random variables on (Ω, Σ, P ). A generalization of this example to the continuous parameter case is a process with independent increments. The existence of a Markov process is established again by means of (the Kolmogorov) Theorem I.1.1. A convenient reference to conditioning as well as proofs of these statements is in (Rao [18], cf., particularly Sections 9.4-9.5). Thus the preceding results deal with a particular class of Markov processes. If the state space of the Xt -process is a fixed countable set, then the corresponding process is called a Markov chain, and (20) or (20’) becomes P [Xt = k|Xr , r ≤ s](j) = P [Xt = k|Xs ](j) (= ps,t (j, k) (say)).
(21)
Note that both ps,t , p(s, .; t, .) being conditional probabilities are random variables and equations (20), (21) are only supposed to hold w.p.1. The exceptional null sets depend on all the conditioned variables, and if this set is empty or, what is essentially the same, there is a fixed null set outside of which these relations hold for all s < t, x ∈ S, and A ∈ S then the mappings (s, x; t, A) → p(s, x; t, A and (s, j; t, k) → ps,t (j, k) = p(s, j; t, k) are called transition probability functions. If these functions depend only on t − s (so they are functions only of three variables) then the process (or chain) is termed Markovian with stationary transitions. There is also the Chapman-Kolmogorov equation satisfied by any Markov process. It is fundamental to much of the analysis on this subject and is given here for transition probability functions: for any x, y ∈ S, A ∈ S, one has p(s, x; t, A) = p(u, y; t, A)p(s, x; u, dy), s < u < t. (22) S
5.3 Independent increment and jump Markov processes
291
Intuitively, this means that the probability of a (Markovian) particle starting from state x at time s going to a state in A at a later time t is the probability of starting from x at time s, visiting some intermediate state y at a later time u before landing in the set A at time t. This equation connects the probabilistic and analytic aspects of Markov processes, as a result of which the subject has grown enormously and occupies a large part of the stochastic theory. We shall consider only a small portion here. [An extensive treatment of inference problems –both estimation and testing – are given by Billingsley [1], for Markov processes with a finite state space and stationary transition probabilities.] We first consider the important special case that the Markov process has stationary transitions, so that p(s, x; t, A) is denoted p(t − s; x, A), and moreover the state space S is assumed finite, i.e., the process is a (finite) chain. Then (22) becomes on setting u = t − s > 0, (and being a conditional probability j p(u; i, j) = 1) 0 ≤ p(u + v; i, k) =
p(u; i, j)p(v; j, k), u, v > 0,
(23)
j
and writing P (u) = [p(u; i, j), i, j ∈ S] as a (square) matrix, having non negative entries with row sums equal to one, (23) may be stated compactly using matrix multiplication as: P (u + v) = P (u)P (v),
u, v > 0
(24)
where P (0) = I, the identity by definition. A number of important properties result from this “semi-group” property. It can be shown that limu→∞ p(u; i, j) exists for all i, j, and if P (·) is assumed (component wise) continuous at t = 0 so that lim p(u; i, i) = 1,
u→0
(25)
then by (23) limt→0 p(u; i, j) = 0 for i = j. Also (23) implies that p(·; i, j) is continuous everywhere and under (25) p(u; i, i) > 0 for all u > 0 unless it is zero for all u. These and several consequences of the Markov property lead to many useful analytical properties of the transition functions and they in turn reflect in the corresponding versions of the sample paths of the process {Xt , t ≥ 0}. For instance, if p(·; i, j) is continuous at zero, then it is actually differentiable on R+ . In fact, letting (i = j) 1 − p(u; i, i) = qi ≤ ∞, u→0 u
0 ≤ −p (0; i, ) = lim
292
V. Likelihood Ratios for Processes
p(u; i, j) = qij = p (0; i, j), u→0 u lim
(26)
then letting Q = [qij ] where qii = −qi , one has the following system of differential equations (in matrix form with qi < ∞ for all i but qij is always finite) (27) P (t) = QP (t) P (0) = I, [It should also be noted that qi < ∞ holds for all finite chains we are discussing, and when S is (denumerable) infinite then supi qi < ∞ iff the limit in (25) holds uniformly in i; but we shall not go into details since these are not essential here.] The unique (standard) solution of this system is given by (noting −tQ that this is equivalent to d(e dtP (t)) = 0) P (t) = etQ ,
t ≥ 0,
(28)
∞ n Qn where etQ = n=1 t n! by definition, with the matrix sum, and the right side converges if, for instance, Q ≤ C0 < ∞ where Q = sup x =1 |(Qx, x)|. [For proofs of these statements, cf., e.g., Doob [2], Chapter VI.] This Q matrix (or operator) plays the same kind of parametric role in the Markov chain analysis that the triple (γt , σt2 , π) did in the independent increment case. Moreover it also has a probabilistic interpretation that will now be given since it is useful in finding the corresponding likelihood ratios. Let {Xt , t ≥ 0} be a Markov chain with a finite state space, denoted {1, 2, . . . , N }, and, as in the above, the processes will be assumed right continuous with stationary transitions that are continuous at t = 0 (i.e., (25) holds). It then follows from the general theory that a.a. sample paths are step functions. Moreover, for each α > 0 one has P [Xs = i, t0 ≤ s ≤ t0 + α|xt0 = i] = e−qi α .
(29)
This is obtained by noting that, if the left side probability is ϕ(α), then the Markov property implies ϕ(α + β) = ϕ(α)ϕ(β), α, β > 0. The solution of this functional equation is ϕ(α) = e−qi α where the qi > 0 is the same as that in (26) [qi = 0 corresponds to ϕ(α) = 1]. Using this, whose proof may be found in Doob ([2], p.244), we show that a chain is representable as a pair of sequences of random variables which also give a probabilistic meaning of the qi . [Picturesque names are also given to these states to remember easily. If qi = 0, the state i is termed absorbing, qi = ∞, it is called instantaneous and 0 < qi < ∞ it is referred to as a stable state.] Let Z1 be the constant value of the process Xt until the first jump at time T1 , and by induction let Zn be the value of the process after time
5.3 Independent increment and jump Markov processes
293
Tn−1 until the next jump at time Tn , n ≥ 1, (Tn−1 ≤ Tn ). Thus if ν(t) = max{n : Tn < t}, then one has Xt = Z◦ν(t). The general theory of such discontinuous Markov processes shows that the conditional measures are given by: P [Tn+1 − Tn > a|T1 = t1 , · · · , Tn = tn ,Z1 = z1 , · · · , Zn+1 = zn+1 ] = e−qzn+1 a
(30)
and P [Zn+1 = i|T1 = t1 , · · · , Tn = tn , Z1 = z1 , · · · , Zn = zn ] qz i = n+1 . qzn
(31)
If Ft = σ(Xs , s ≤ t), then (Zn , Tn ) : Ω → S × (0, ∞), and Ft = σ(Zn , Tn : 1 ≤ n ≤ ν(t)) also. Thus Xs , 0 ≤ s ≤ t and (Zn , Tn , n = 1, . . . , ν(t)) represent the same process, the latter are the ‘observable coordinates’, and allow a calculation of likelihood ratios in a similar manner as in Theorem 3 (or 7). At this point it is desirable to recall the canonical space representing the process, (i.e., (Ω, Σ, P )). The measures are defined on the (infinite dimensional) space of the + process (S R with its usual cylindrical σ-algebra but more specifically for the above given representation) as follows. Let S˜n = (S × R+ )n × S 6 6 and take Ω = n≥1 S˜n , with for the cartesian product. One takes for B(S˜n ) = (P(S) ⊗ B(R+ ))n ⊗ P(S) the first factor is the nth product σ-algebra of the product σ-algebra shown in parenthesis and P(S) is the σ-algebra of S which is taken as the power set here, and B(R+ ) is the Borel σ-algebra of R+ . From this one defines Σ as the cylinder σ-algebra as in the standard Kolmogorov setup. It may be noted that this way of defining Ω and hence its Σ is needed since the Sn s are nested but are not subsets of a space and so Ω = ∪∞ n=1 Sn is not well-defined. [However each Sn is the range of a (finite dimensional) projection n of Ω onto Sn , and Σ = σ(∪n≥1 n−1 (B(S˜n )).] With this setting, we now define Pi , i = 1, 2, and then obtain the likelihood ratio. Let Pi , i = 1, 2 be the measures induced by the given process {Xt , t ≥ 0} on (Ω, Σ) under the hypothesis and its alternative. Thus if Pi [X0 = i0 ] = πi (i0 ), i0 ∈ S, are the initial distributions, then the Markovian character with stationary transitions is (for any 0 = t0 < t1 < · · · < tn ) given by Pi [Xt0 = i0 ,Xt1 = i1 , · · · , Xtn = in ] = πi (i0 )pii0 i1 (t1 ) · · · piin−1 in (tn − tn−1 ),
(32)
294
V. Likelihood Ratios for Processes
for i = 1, 2. These are extended to Σ from cylinders to the general case. [Here one uses an extension of Kolmogorov’s theorem due to Ionescu Tulcea, as found in, e.g., Doob [2], p.613, or Neveu [1], p.168, i or Rao [21], p.226.] In our case pijk (t) = e−qjk t , i = 1, 2. The following result, due essentially to Albert [1], can now be presented and it will be generalized later. 8. Theorem. Let {Xt , t ≥ 0} be a right continuous finite state Markov chain with stationary transitions under the hypothesis P1 , alternative P2 and initial distributions π1 , π2 respectively. Suppose that the corresponding q-functions q1 , q2 are defined by (26). If π1 π2 with g = dπ1 2 1 dπ2 , and qjk > 0 implies qjk > 0, then P1 P2 and the likelihood ratio is given, when Xt or equivalently (Z1 = i1 , · · · , Zn = in , t1 , · · · , tn ) for ν(t) = n, on Ft are available from [0, ν(t)), as: dP1 (Xt ) = g(i) exp[(qi1ν(t) − qi2ν(t) )t]× dP2 3 qi1j ij+1
ν(t)−1
j=0
qi2j ij+1
exp{(tj+1 − tj )[(qi1ν(t) − qi2ν(t) ) − (qi1j − qi2j )]}, (33)
for a.a. [P2 ] and for 0 = t0 < t1 < . . . , tn < ν(t) < ∞. Proof. One may verify that (33) is the desired ratio by the procedure adapted for the proof of Proposition 2. An alternative and a somewhat simpler argument, borrowed from Grenander ([2], p.308), will be given here. Since for each t, ν(t) is an integer (finite), and Xt can be represented as a finite collection (Z1 , T1 , · · · , Zν(t) , Tν(t) ), also called an embedded process, one may use (32) in obtaining (33) by evaluating the conditional probabilities of the chain. We use the embedded process q for this purpose. Writing π(i, j) = qiji , the transition probability of the embedded process is given by P [Zn+1 = j, Tn+1 − Tn = αj |Zn = i, Tn − Tn−1 = αk ] = π(i, j)qi e−qi αj = qij e−qi αj .
(34)
Now let ν(t) = n, 0 = t0 < t1 < · · · < tn < t < ∞, and the chain is in state i at time Ti . Then the probability measure is obtained as P [Z0 = i0 , . . . , Zn−1 = in−1 , Tn−1 ≤ αn−1 , Zn = in ] [= P [Z0 = i0 ]e−tqi0 , if there is no jump in [0, t)] = P [Z0 = i0 ]e−tqn qi0 i1 e−qi0 (t1 −t0 ) · · · qin−1 in ×
5.3 Independent increment and jump Markov processes
295
e−qin−1 (tn −tn−1 ) e−(t−tn )qin , this being the probability of no jump in (0, t1 ), none in(t1 , t2 ) and upto (tn , t) where we used formula (34), = P [Z0 = i0 ]e−tqin
n−1 3
qik ik+1 · e−qik −qik−1 (tik − tk−1 ),
k=0
after cancellation and regrouping the terms.
(35)
We use this expression for P1 and P2 under the given hypothesis. Now suppose that the q-functions and the initial measures are dek noted by πk (i0 ) = P [X0 = i0 ] and qik , qij , k = 1, 2. Then by hypothesis dπ1 2 1 π1 π2 with g = dπ2 , as density, and qij > 0 ⇒ qij > 0. Hence the k corresponding measures Pn on Ft (with ν(t) = n) satisfy, by (35), Pn1 Pn2 . The case considered for ν(t) = n, consists of only a finite dP 1 number of states and hence dPn2 (x) = fn (x) is simply the standard n likelihood ratio obtained from (35) (as in the finite dimensional case) with 0 = t0 < t1 < · · · , < tn < t < ∞, n > 0: fn (x) =g(x) exp[(qi1n − qi2n )]× n−1 3
qi1j ij+1
j=0
qi2j ij+1
exp[(tj+1 − tj )(qi1n − qi1j − (qi1j − qi2n ))].
(36)
But {fn , Fn , n ≥ 1} is a simple nonnegative martingale. Consequently, it converges a.e. as n → ∞, by Grenander’s Theorem IV.1.1. [Here we are using, as we may, the observable coordinates {Zn , Tn , n ≥ 1} instead of the original process consisting of an uncountable set of random variables Xt .] If n = 0, the result is true by (35) (the first line). Replacing n by ν(t) in (36), the corresponding result denoted by ft is the expression given in (33) which is Ft -adapted. Even here, the gen1c eral limit denotes dP dP 2 , the density of the absolutely continuous part of P 1 relative to P 2 . Further conditions are needed, if S is not finite, to conclude that P 1 P 2 , such as uniform integrability of ft relative to P 2 . It may be useful to observe that, if S is not finite, although qij < ∞ for i = j, one has to assume that qi < ∞ for all i additionally. A generalization of the above is to consider Markov processes of purely discontinuous or jump type having only jumps (of random sizes). The process {Xt , Ft , t ≥ 0} is again assumed to have stationary transition probabilities satisfying the following conditions: (i)P [Xt+s ∈ A|Xs ](ξ) = p(t, ξ; A); limt→0 p(t, ξ; {ξ}) = 1, ∀ξ ∈ S,
296
V. Likelihood Ratios for Processes
(ii) limt→0+ 1−p(t,ξ;{ξ}) = q(ξ) ≥ c > 0, ∀ξ ∈ S, t p(t,ξ;A) (iii) limt→0+ = q(ξ, A), ∀A ∈ B(S − {ξ}), t where S denotes the state space (⊂ R) and B(S) is its Borel σ-algebra. The Xt -process, taken to be right continuous, can be represented by a countable set of observable coordinates (or again termed an embedded process) as follows. Let T1 , T2 , . . . be the set of (random) instances at which the process takes jumps of sizes Z1 , Z2 , . . . respectively so that Xt = Z1 ,
0 ≤ t < T1 ,
Xt = Z2 , T1 ≤ t < T2 , . . . , Xt = Zn , Tn−1 ≤ t < Tn , n ≥ 1. The jump times are finite when q(ξ) ≥ c > 0. If ν(t) = max{k : Tk < t}, the number of jumps up to time t (ν(t) = 0 if T1 ≥ t), then the above description implies that Xt = Zν(t) . The general theory of such jump Markov processes shows (cf., Doob [2], Section VI.2) that the sequence {Zn , n ≥ 1} is a Markov process with transition probabilities given by P [Tn+1 − Tn > r|T1 , . . . , Tn , Z1 , . . . , Zn+1 ] = e−q(Zn+1 )r
(37)
q(Zn , A) q(Zn ) = π(Zn , A) (say).
(38)
and P [Zn+1 ∈ A|T1 , . . . , Tn , Z1 , . . . , Zn ] =
Moreover {Zn , Tn+1 − Tn , n ≥ 1} is a Markov process with transitions given by P [Zn+1 ∈ A, Tn+1 − Tn ∈ B|Zn , Tn − Tn−1 ](x, a) π(x, dy) q(y)e−q(y)r dr, A ∈ B(S), B ∈ B(R). = A
(39)
B
These q-functions are also termed transition intensities. With this general setup, if we have a pair of processes (or equivalently a process governed by two measures) then it is desired to find the likelihood ratio of these measures. This question was primarily treated by Billingsley [1]. We can obtain it from a finite set of observations on (Zn , Tn ) as a consequence of our general Theorem IV.1.1 in lieu of an independent argument given by him for this particular case. Suppose that P1 , P2 are the probability measures on (Ω, Σ) governing the given jump Markov process under the hypothesis H1 and the alternative H2 . If πi are the initial distributions and qi (·), qi (ξ, ·), are
297
5.3 Independent increment and jump Markov processes
the corresponding q-functions satisfying (37)-(39), under the hypotheses Hi , i = 1, 2, suppose that π1 π2 and q1 (ξ, ·) q2 (ξ, ·) on the σdπ1 algebra S of the state space S with the (RN)-derivatives δ(ξ) = dπ (ξ) 2
1 (ξ,·) and δ(ξ, η) = dq dq2 (ξ,·) (η). Then we can present the following theorem from Billingsley [1].
9. Theorem. Suppose we have a right continuous jump Markov pro1 cess {Xt , Ft , t ≥ 0} on (Ω, Σ, P P2 ) with stationary transitions πi (·, ·), and initial distributions πi ,i=1,2, all of which are equivalent on Ft , t ≥ 0. Then Pi |Ft ,i=1,2 are equivalent and the likelihood ratio is given by, ν(t) dP1 (Z1 , · · · , Zν(t) ) = δ(Z1 ) exp{ (Tk+1 − Tk ) (q2 (Zk ) − q1 (Zk )) × dP2 i=1
3 (t − Tν(t) ) q2 (Zν(t)+1 ) − q1 (Zν(t)+1 ) } δ(Zk , Zk+1 ),
ν(t)
(40)
k=1
where Tν(t) = 0 if T1 ≥ 1. The same expression (40) holds if all the Radon-Nikod´ym derivatives are understood as the continuous parts dP c dπ c of their Lebesgue decompositions, i.e., dP12 , δ(·) = dπ12 and δ(ξ, ·) = dq1c (ξ,·) dq2 (ξ,·) .
Sketch of Proof. The argument is very similar to that of the preceding result. Since the Zn are not necessarily discrete, the σ-algebras are more general. As before, {Z1 , Z2 , . . . , T1 , T2 , . . . } form observable coordinates of the process {Xt , Ft , t ≥ 0} and the Ft s are replaced by the Fn of the observable coordinates in the same manner as in that result. If π1 π2 and q1 (ξ, ·) q2 (ξ, ·), one can replace the ratios
qz1
j ,zj+1
qz2
in
j ,zj+1
(33) by the appropriate RN-derivatives δ(zj , zj+1 ) and similarly with dπ1 δ(·) for dπ since the Zn s are not necessarily discrete. With this change 2 the proof is nearly identical with the preceding one. In fact the RNdensities given by (33) and (40) are quite similar. Without the absolute continuity assumption, the general case follows by taking the δ(ξ, ·)s as the densities of the absolutely continuous parts of the measures with the Lebesgue decomposition, the result follows, and the argument can now be left to the reader. As noted above, an independent proof of the result is also found in Billingsley [1]. Several types of Markov processes can be treated. [An example of the so-called birth-and-death process, having a countable state space, is sketched in the complements and exercises section.] In case these processes have continuous state spaces, sufficient conditions can be given, to obtain similar results. A class of Markov processes for which such
298
V. Likelihood Ratios for Processes
conditions are natural are the diffusion processes. These will be discussed later. Next we turn to the infinitely divisible class which already came up at the beginning of this section and which generalizes those of independent increments. These need not be Markovian and one has to use some new ideas. So this topic is discussed in the following section. 5.4 Infinitely divisible processes It was noted at the beginning of the last section that decomposable (or additive, or independent increment) processes without fixed discontinuities are infinitely divisible. Here we consider processes for which the latter conclusion holds (without necessarily being decomposable). Thus {Xt , t ∈ T } is infinitely divisible (i.d.) if every finite dimensional distribution of the given (real) process has the same property. Equivalently, using the L´evy-Khintchine representation for the characteristic functions (ch.f.s) of its finite dimensional distributions [wherein one separates the Gaussian component, if present, for convenience of analysis], the following holds: if λ = (t1 , . . . , tn ), ti ∈ T , then the ch.f. of Xλ = (Xt1 , . . . , Xtn ) is given, for any u = (u1 , . . . , un ) ∈ Rn , by Φ(u1 , · · · , un ) = Φλ (u) = E(ei(u,X ) ) 1 = exp{i(aλ , u) − (Rλ u, u)+ 2 λ
Rn −{0}
[ei(u,x) − 1 − i(u, b(x))] Qλ (dx),
(1)
where aλ = (a(t1 ), . . . , a(tn )) ∈ Rn , (u, x) is the inner product in Rn , and b(x) = (b1 (x), . . . , bn (x)) with bi (x) = xi if |xi | ≤ 1, = 1 if xi > 1, and = −1 for xi < −1. Here Rλ = (R(ti , tj ), 1 ≤ i, j ≤ n) denotes a positive definite matrix which is the covariance of the Gaussian component and aλ its mean. The function Qλ : B(Rλ0 ) → R¯+ is the L´evy measure of the system which is localizable (and not necessarily finite), |x|2 λ λ λ but satisfies Rλ 1+|x| − {0}, the Eu2 dQ (x) < ∞ where R0 = R 0 clidean space with the origin deleted. Here we need to present a certain nontrivial preparation before the likelihood ratios can be intelligently discussed. As already noted in the preceding section, an i.d. process is uniquely determined by the system of triples {aλ , Rλ , Qλ , λ ∈ Λ} where Λ is the directed set of all finite collections of elements of T , and is directed by the inclusion ordering, i.e., λ < λ iff λ ⊂ λ . Now to formulate our problem precisely, it is desirable to state that the measure P , canonically represented by the process (by Kolmogorov’s basic theorem I.1.1), is uniquely determined by a triple (a, R, Q) derived from the above system of triples. This is not obvious but true. To see that it can be done,
299
5.4 Infinitely divisible processes
first note that in (1) the parameters {aλ , Rλ , λ ∈ Λ} defining a consistent family of distributions (or ch.f.s) determine a Gaussian process with mean function a and the covariance function R that is independent of the second family which is given by the factor with the integrals in (1). It is also an i.d. ch.f. for each λ. The i.d. property follows from the fact that on taking aλ = 0, and Rλ = 0 in (1) the Φ is still i.d. [In general, a product of ch.f.s can be i.d without some or all the factors being i.d. and it is the particular form of the ch.f. given by (1) that allows this useful conclusion.] Moreover, as in Theorem 3.1, one has the representation Xt = Yt + Zt , t ∈ T where {Yt , t ∈ T } is a Gaussian process (hence i.d.) and {Zt , t ∈ T } is an independent (of the Yt ) process which is i.d. The basic Theorem I.1.1 implies that there is such a process on the same probability space (may be taken rich enough to support both process, as otherwise it can be enlarged by adjunction in a standard way, cf., e.g., Rao [21], pp.98-99). Thus E(Yt ) = at and Cov(Ys , Yt ) = R(s, t) holds so that aλ , Rλ are obtained from the mean and covariance functions a(·), R(·, ·). The second factor is likewise obtained from a measure Q determined by the family {Qλ , λ ∈ Λ}, but needing a substantially more involved argument, and it will now be supplied. Consider the family of L´evy measures {Qλ , B(Rλ0 ), λ ∈ Λ}, where for λ < λ , let pλ,λ : Rλ → Rλ denote the coordinate projections. Since the vector Xλ contains Xλ as a sub collection, it follows that Φλ (u) = Φλ (u, 0) by the uniqueness of the representation (1). We λ assert that this implies Qλ (A) = Qλ (p−1 λ,λ (A)), ∀A ∈ B(R ) since λ p−1 λ,λ (A) ∈ B(R ). For, the right side integral (= I, (say)) of (1) with λ = (t1 , · · · , tn ), λ = (λ, tn+1 ) and un+1 = 0 becomes (ei(u,x) − 1 − (u, b(x))) dQλ (x) I=
Rλ 0 ∩({∃xi =0,1≤i≤n} ∪{xi =0,∀i≤n})
⎛
=
(Rλ p−1 0) λ,λ
=
⎝ei(u,pλ,λ (x)) − 1 − i
⎞
bj (x)uj ⎠ dQλ (x)
j=1
(
) Qλ (p−1 λ,λ )(dx)
(
) Qλ (dx).
Rλ 0
n
=
Rλ 0
This implies by the uniqueness of the representation that Qλ = Qλ ◦ λ λ (p−1 λ,λ ), as asserted, so that {Q , B(R0 ), pλ,λ , (λ, λ ) ∈ Λ × Λ} is a compatible family (or a projective system) of σ-finite (and not necessarily finite) measures. In standard treatments, these are compatible systems
300
V. Likelihood Ratios for Processes
of probability measures and Qλ would then be the marginal probabil ity of Qλ , but now a more careful argument using the definition of the system used here, is needed. With this we can present the following important and beautiful result, due to Maruyama [1]. It is required in the ensuing analysis. We include a somewhat different and a detailed proof which explains the key issues involved. For simplicity let Bλ = B(Rλ0 ) and BT be the cylinder σ-algebra, i.e., the smallest σ-algebra relative to which all the coordinate projections pλ , λ ∈ Λ are measurable, pλ : RT0 → Rλ and Bλ is the Borel σ-algebra of the Euclidean space Rλ0 and since R is σ-compact, each measure Qλ is σ-finite. 1. Theorem. Let {Xt , t ∈ T ⊂ R} be an infinitely divisible process with its L´evy system of (compatible) measures {Qλ , Bλ , pλ,λ , λ < λ }. λ Then there exists a measure Q : BT → R¯+ such that Q ◦ p−1 λ = Q ,λ ∈ Λ. Moreover Q will be σ-finite if the Xt -process is stochastically continuous.
Proof. It was just seen that Qλ = Qλ ◦ p−1 λ,λ , so that the family is consistent as in Kolmogorov’s theorem. Let Bλ∗ = p−1 λ (Bλ ), a σ-algebra of RT0 and set B0∗ = ∪λ∈Λ Bλ∗ . Then B0∗ is an algebra, and the function Q0 defined by the equation Q0 (A∗ ) = Qλ (Aλ ) where A∗ ∈ B0∗ implies A∗ = p−1 λ (Aλ ) for some Aλ ∈ Bλ . This fixes Q0 unambiguously and it is additive, since Λ is directed. The proof is the same as that found in Bochner ([1], p. 119), and a more detailed version in Rao ([21], p. 18, Proposition 1). Now it suffices to show that Q0 is σ-additive on B0∗ so that Q0 has a measure extension to BT = σ(B0∗ ) by the classical Carath´eodory theorem (cf., e.g., Rao [17], p. 41, Theorem 10 (iii)). Now we establish the σ-additivity of Q0 on B0∗ . Note that each Qλ on Bλ is (inner) regular in the sense that for each A ∈ Bλ , Qλ (A) < ∞ and ε > 0, there is a compact set C ⊂ A such that Qλ (A − C) < ε which is a consequence of the definition of Qλ in (1). Hence {Rλ0 , Bλ , Qλ } is a regular σ-finite measure space and {Qλ , pλ,λ , λ < λ ∈ Λ} is a regular projective system. Also the mappings pλ,λ : Rλ → Rλ are continuous and onto. This system satisfies the hypothesis of Choksi’s [1] extension of Bochner’s theorem to σ-finite measures (cf., also M´etivier [1],p. 252, Th´eor`eme 5.1) so that the system admits a regular (projective) limit (R0T , BT , Q), BT = σ(B0∗ ) whence Q|Bλ∗ = Qλ ◦ p−1 λ , λ ∈ Λ. This is the desired result. Unlike the case that the system consists of probability measures, the thus obtained (RT0 , BT , Q) need not be σ-finite (shown by an example at the end of proof), although Q is again regular. [It is what is called a localizable space, i.e., each sub collection of BT has a supremum relative to Q, cf., e.g., Rao [17], p.70. The difficulty comes in because RT0 is not necessarily locally compact when T is uncountable. The problem does
5.4 Infinitely divisible processes
301
not disappear even if each Qλ were finite but not uniformly bounded.] It thus remains to show that Q is σ-finite if the Xt -process is stochastically continuous. It is natural that one should translate the stochastic continuity of the Xt -process to the ch.f. formula (1) to discuss the properties of the P P measure Q where it is located. Now Xs →Xt so that Xs −Xt →0 as s → D t, and the latter is equivalent to Xs −Xt →0, which in turn is equivalent to the ch.f. ϕs,t (u) = E(eiu(Xs −Xt ) ) satisfying lims→t ϕs,t (u) = 1, uniformly in u belonging to compact sets (i.e., locally uniformly). Using (1), the ch.f. of Xs − Xt is calculated with λ = (s, t) there, as follows: ϕs1 ,s2 (u1 , u2 ) = E(ei(u1 Xs1 +u2 Xs2 ) ) 2 1 uj uk R(sj , sk )+ = exp{i(u1 a1 (s1 ) + u2 a1 (s2 )) − 2 j,k=1 (ei(u1 x+u2 y) − 1 − i(b1 (x, y)u1 + b2 (x, y)u2 )) dQλ (x, y)}. λ R (2)
Taking u2 = −u1 , s1 = s, s2 = t one gets if ψs,t denotes the distinguished (unique) logarithm of ϕ so that ψs,t (0) = 0 locally uniformly: 1 ψs,t (u) = iu(a1 (s) − a2 (t)) − u2 (R(s, s) + R(t, t) − 2R(s, t))+ 2 [eiu(x−y) − 1 − iu(b1 (x, y) − b2 (x, y))] dQλ (x, y). (3) Rλ
To get rid of the linear terms, consider u sin2 (x − y) dQλ (x, y)− ψs,t (u) + ψs,t (−u) = −4 2 Rλ 2 u (R(s, s) + R(t, t) − 2R(s, t)) 2 → 0 as s → t,
(4)
by (2) and (3). But the positive definiteness of R and the positivity of Qλ imply that each of the terms on the right tend to zero so that (i) R is continuous on the diagonal, whence everywhere, and (ii) u2 (x − y) dQλ (x, y) = 0. lim sin2 (5) s→t Rλ 2 This will be used in deducing the desired property for Q. Consider the auxiliary functions 1 ε 2 v2 sin ux dx; Ψ(v) = , ε > 0. (6) Φε (u) = ε 0 1 + v2
302
V. Likelihood Ratios for Processes
ε (x) It is immediate that ΦΨ(x) is bounded above and below by constants c1 , c2 > 0 for all x. Hence Φε (x) ≥ c1 Ψ(x) and so by integrating (5) relative to u on [0, ε] and averaging, it still tends to zero as s → t, and then 1 ε 2 u λ sin Ψ(x − y) dQλ (x, y) (x − y) du dQ (x, y) ≥ c1 2 Rλ ε 0 Rλ = Is,t (say), → 0,
as s → t. But then for any δ > 0 since Ψ is increasing and symmetric, one has δ2 Ψ(x − y) dQλ (x, y) ≥ Qλ (|x − y| ≥ δ), (7) Is,t ≥ 1 + δ2 |x−y|≥δ and moreover, since
x2 1+x2
≥
x2 2
Is,t
for |x| ≤ 1, one has
1 ≥ Ψ(x − y) dQ (x, y) ≥ 2 |x−y|≤1
λ
|x−y|≤1
(x − y)2 dQλ (x, y). (8)
Since Is,t → 0 as s → t, both (7) and (8) imply that Qλ concentrates continuously along the diagonal of Rλ0 locally uniformly. This fact allows us to conclude that Q is σ-finite as follows. λ Let Dλ = p−1 λ (R0 ), λ ∈ Λ, and X0 = ∪r∈S ∩ Λ Dr where S is the set of all rationals of R. Since each Dr is σ-finite for Q, and this is a countable union, it follows that X0 ∈ BT and is σ-finite. Thus it suffices to show, under the current hypothesis, that Q concentrates on X0 . For this let A ∈ BT be any element satisfying Q(A) < ∞. Since BT is a cylinder σ-algebra, for this proof one may assume that A is also a cylinder set so that it is of the form At = {ω ∈ RT0 : ω(t) ∈ [a, b]}. If t is rational, then At ∈ BT (X0 ) = {X0 ∩ B : B ∈ BT }, and is σ-finite. Suppose then t is irrational. Let c, d ∈ R be such that [c, d] ⊃ [a, b], and consider Br = {ω ∈ RT0 : ω(r) ∈ [c, d]}, where r ∈ S. Then B = Br ∈ BT (X0 ) and At ΔBr ∈ BT so that if δ = |c − d| ∧ |a − b| > 0 one has Q(AΔB) = Qt,r (|x − y| ≥ δ). (9) Now using (7) and (8), choose r here such that |t−r| is arbitrarily small so that Qt,r ((x, y) : |x − y| ≥ δ) < ε. By the preceding paragraph and the density of rationals this is clearly possible. Hence A differs from B by a set of Q-measure zero. Since B ∈ BT (X0 ) and B is σ-finite, it follows that Q is carried by X0 . Hence Q is σ-finite as asserted. We now note by the following simple example that some such condition as stochastic continuity is needed, in the last part of this theorem.
303
5.4 Infinitely divisible processes
Indeed let {Xt , t ∈ [a, b], a < b} be a process of mutually independent i.d. random variables without the Gaussian component. Then At = {ω : Xt (ω) = 0, Xs (ω) = 0, s = t} ∈ BT , a cylinder, and Qt (At ) = Q(At ) > 0, being the support of the L´evy measure for Xt . Then the support Supp(Q) = ∪t∈[a,b] At is not σ-finite, since it is a disjoint union of an uncountable collection of the At of positive measure. It is of interest to remark that the projective limit measure Q of an infinitely divisible process is localizable, although not σ-finite, an instance where the distinction is clearly exhibited by a natural application. The preceding proof has further information on the continuity of the other parameters a, R. In fact (4) already implies that R is continuous, and the result on a is given by 2. Proposition. If {Xt , t ∈ T ⊂ R} is a stochastically continuous i.d. process, then the parametric functions a, R of (1) are continuous. Proof. Only the continuity of a need be verified. This can be done using the continuity property of the measure Q on the diagonal of each Rs,t 0 , established in (7) and (8). So consider ψs,t given by (3) which tends to 0 as s → t locally uniformly in u. In (4), it was shown that R(s, t) → R(t, t) as s → t and then the Qs,t tends to a measure concentrating on x = y. Using this fact we assert that the integral in (3) tends to zero as s → t which implies that the remaining term must go to zero, i.e., as → at . Now the integral in (3) can be split into two parts for each 0 < δ ≤ 21 , as: ku (x, y) dQs,t (x, y), + (10) I1 + I2 = |x−y|≤δ
|x−y|>δ
where the kernel ku (x, y) = eiu(x−y) − 1 − iu(b1 (x, y) − b2 (x, y)). Then I2 can be seen to be (since |bi (x, y)| ≤ 1): |I2 | ≤ 2(1 + |u|)Qs,t (|x − y| > δ) → 0, by (7) as s → t. Next consider I1 of (10). Define the sets A1 = {|x|, |y| ≤ 1, |x − y| ≤ δ}, A2 = {|x|, |y| > 1, |x − y| ≤ δ} and similarly A3 , A4 when |y| > 1, |x| ≤ 1 and x, y interchanged. Let Jj = Ak kuj (x, y) dQs,t (x, y), j = 1, . . . 4. Then since Qs,t (|x − y| ≤ δ, x = 0 = y) < ∞ one has |u|2 |J1 | ≤ (x − y)2 dQs,t (x, y) 2 |x−y|≤δ ≤
u2 δ 2 s,t Q (|x − y| ≤ δ) 2
304
V. Likelihood Ratios for Processes
and similarly |J2 | + |J3 | + |J4 | ≤ 3δ|u|[Qs (|x| > 1) + Qt (|y| > 1)] = O(δ). Hence |I1 | ≤ O(δ) and since δ > 0 is arbitrary, I1 → 0, as s → t, locally uniformly in u. This implies that as → at as s → t which is the desired conclusion. Remarks. 1. The result has been detailed here to show that the Poisson process plays a fundamental role in the theory of infinitely divisible process. Indeed, it is known in the classical treatments of the i.d. distributions that the ch.f. of every such function is a pointwise limit of a sequence of products of Poisson ch.f.s. This indicates its place in obtaining all the i.d. distributions as appropriate convolutions of just the Poisson class which includes the Cauchy family! 2. The Poisson measures have the property that they are non negative and take mutually independent values on disjoint sets. They are called “completely random” by Kingman [1] who gave a detailed and readable analysis of their structure with indications to some real life applications. Such measures will be considered in the following work. Consider a mapping π : S → L0 (P ), where (S, S) is a measurable space with S having also all the singletons, and L0 (P ) is the set of all real random variables on (Ω, Σ, P ). Then π is called a completely random measure, if (i) for each A ∈ S, π(A) ≥ 0 a.e., (ii) A1 , . . . , An ∈ S, disjoint, implies {π(Ai ), i = 1, . . . , n} are mutually independent, ∞ and ∞ (iii) for disjoint An ∈ S with A = ∪n=1 An ∈ S, one has π(A) = n=1 π(An ) in probability (or equivalently, in view of independence, with probability one). If also there is a measurable decomposition {Sn , n ≥ 1} of S (i.e., S = ∪∞ n=1 Sn , Sn ∩ Sm = ∅, m = n, Sn ∈ Σ), such that P (π(Sn )) > 0 for each n, then π(·) will be called a completely random σ-finite measure. This implies, in particular, that the function ¯ + defined, for each t > 0, by μt : S → R e−μt (A) = E(e−tπ(A) ),
(11)
is a measure on S and is σ-finite since μt (A) = 0(∞) iff π(A) = 0(∞) a.e. Let us illustrate the concept before applying it to our problems. Example. The process called ‘white noise’ satisfies the above hypothesis, and it is as follows. Let S ⊂ Rd (1 ≤ d < ∞) be a non empty set with S0 as the δ-ring of bounded Borel sets of S0 (so σ(S0 ) = B(S), the Borel σ-algebra of S, using the Euclidean metric). Consider a collection {π(A) : A ∈ S0 } of Gaussian random variables with means zero and covariance μ(A ∩ B) for A, B ∈ S0 where μ is the Lebesgue measure in Rd . Such a process exists by Theorem I.1.1 if Ω = RS0 , π(A, ω) =
305
5.4 Infinitely divisible processes
ω(A) ∈ R, ω ∈ Ω, and for any A1 , . . . , An ∈ S0 , {π(An ), i = 1, . . . , n} is assigned an n-dimensional Gaussian distribution with mean zero and the positive definite symmetric matrix (μ(Ai ∩ Aj ), 1 ≤ i, j ≤ n) as its covariance, which determines the probability measure P on the cylinder σ-algebra Σ of Ω, so that (Ω, Σ, P ) is the desired space. Now for ∞ A ∩ B = ∅ one has π(A)+π(B) i ∈ ∞= π(A ∪ B) and if A = ∪i=1 Ai , A, A S0 , Ai disjoint, then π(A) = i=1 π(Ai ), the series converging in L2 (P ) and hence in probability. Similarly taking S to be countable, d = 1, one can define the Poisson k where μ is a counting measure (or process if P [π(A) = k] = e−μ(A) μ(A) k! a more general σ-finite measure on (S, S) with S0 as the δ-ring of sets of finite μ-measure, and μ is again termed the intensity of the Poisson random measure π so that E(π(A)) = μ(A), A ∈ S0 ). More generally, the structure of a class of completely random σ-finite measures π(·), which include the above examples, has been analyzed by Kingman [1] for the L0 (P )-valued (and by Feldman [2] for L0 (P ; H)-valued where H is a separable Hilbert space) case. In the former it was shown that such a π admits a Lebesgue type decomposition: π = πa + πd ,
(12)
where πa and πd are completely random and ∞ mutually independent such that πd is purely atomic, i.e., πd = n=1 Xn δn with Xn ≥ 0 mutually independent among themselves, δn being the Dirac measure, and πa is nonatomic, i.e., P [πa ({s}) = 0] = 1, s ∈ S. Moreover πa is infinitely divisible, which is seen as follows. Since πa is nonatomic, for any A ∈ S0 and n ≥ 1 there exist Anj ∈ S0 such that A = ∪nj=1 Anj , disjoint, and E(πa (Anj )) =
μ(A) n .
Hence for any c > 0 one has
P [πa (Anj ) ≥ c] = P [1 − e−πa (Anj ) ≥ 1 − e−c ] ≤
E(1 − e−πa (Anj ) ) , by Markov’s inequality, 1 − e−c μ(A)
1 − e− n = , 1 − e−c
(13)
and hence limn→∞ max1≤j≤nP [πa (Anj ) ≥ c] = 0 so that πa (Anj ) are n infinitesimal. Since πa (A) = j=1 πa (Anj ) for every n ≥ 1, this implies the infinite divisibility (cf., Doob [2], p.132, Theorem 4.1; and also Gnedenko and Kolmogorov [1], p.96). Now the simpler countably valued component πd can be analyzed using the method of Theorem 3.8. We therefore need to concentrate on the relatively deeper analysis required to study πa , i.e., the general Poisson i.d processes which have no Gaussian component and whose
306
V. Likelihood Ratios for Processes
L´evy measure is σ-finite. Thus πa can be represented (via the L´evyKhintchine formula) as: Φ(t) = E(e−tπ(A) ) k(t, x) dQA (x)}, = exp{−
(14)
R+
where k(t, x) is the L´evy integral kernel simplified by setting u = it in −tx (10) and after a computation it may be shown to be k(t, x) = 1−e 1−e−t , obtained differently and directly in Kingman [1], and QA (·) is a regular Borel measure for each A ∈ S0 and Q(·) (B) is a regular Borel measure on S0 (defined before (12)) for each B ∈ B(R+ ). These statements may be combined by saying that A, B → QA (B) = Q(A, B) is a regular bimeasure. Since Q ≥ 0, this implies by results in Real Analysis (cf., also Berg, Christensen, and Ressel [1], p.24) that Q admits an extension ˜ to be a measure on the product σ-algebra S ⊗B(R+ ) → R¯+ such that Q ˜ × B) = Q(A, B) for all A ∈ S0 and B ∈ B(R¯+ ), where (S, S) and Q(A (R, B(R+ ) can be more general “standard Borel spaces”, i.e., they are such that their σ-algebras are countably generated and σ-isomorphic to the σ-algebras of complete separable metric spaces, also known as Polish spaces. In other words, except for the stated isomorphisms, the concrete spaces considered above are essentially the best possible objects. [This was already remarked in the last section with the work on independent increment processes, and the detail is also spelled out in Feldman [2].] With this information, a certain stochastic integral relative to π can be defined. Since the study is on i.d. processes X, it is desirable to have the latter property reflected in a stochastic integral IX (f ) for each real Qmeasurable f , where Q is the L´evy measure of X. However, this and the linearity of the integral are found to be competing claims, especially if one demands the natural condition that f → IX (f ) be continuous in the sense of topology in these spaces. So it becomes necessary to forego linearity in favor of the i.d. property. [Thus our boundedness principle which must retain linearity is not directly applicable.] Now let X be a Poisson random measure with intensity μ, and consider L0 (μ), the set of real μ-measurable functions on T ⊂ R, topologyzied with convergence in measure, in terms of which it becomes a complete linear metric space if (as usual) the μ-equivalent functions are identified. Next consider the vector subspace Sϕ = {f ∈ L0 (μ) : ρϕ (f ) = T ϕ(f ) dμ < ∞}, where ϕ : T → R+ is a bounded increasing continuous concave function with |x| x2 ϕ(x) = 0 iff x = 0. For instance ϕ(x) = 1+x 2 , and 1+|x| are often used with T = R. It may be verified that (Sϕ , ρϕ ) is a complete vector space (cf., e.g., Rao [17], Exercise 4.1.6 and Theorem 3.2.6). Finally define
307
5.4 Infinitely divisible processes
the integral functional IX for simple f = IX (f ) =
n
n i=1
ai χAi , as:
ai (XAi − bi (a)μ(Ai )),
(15)
i=1 2
|xi | where bi (x) = 1+|x 2 , or = |xi |χ[|Xi |≤1] + χ[xi >1] − χ[xi <−1] so that i| |bi (x)| ≤ 1 in both cases. These bi (·) will be used in our analysis. This is useful here since IX (f ) is clearly i.d. and if {fn , n ≥ 1} ⊂ Sϕ , and fn → f in μ-measure, then one verifies that IX (fn ) → IX (f ) in probability, that a real analysis argument (going to a subsequence for a.e. convergence etc.) shows the function IX (f ) is uniquely defined, and each IX (f ), f ∈ Sϕ , is i.d. Also IX (a1 f + a2 g) = a1 IX (f ) + a2 IX (g) + α where α is a constant (depending on f, g). This is again shown by considering step functions and then taking limits. However, the unicity and all the limit operations are based on the i.d. theory and the uniqueness of the L´evy-Khintchine representation. All the details even when X takes values in a separable Hilbert space have been discussed by Feldman [2], (see also Maruyama [1], Sec. 2 on this point). We merely state the result for reference in the following:
3. Theorem. Let X be a Poisson random measure with intensity μ. Then the mapping IX : Sϕ → L0 (P ) defined by (15) is unambiguous, and it is weakly additive, i.e., IX (a1 f + a2 g) = a1 IX (f ) + a2 IX (g) + c where c ∈ R (determined by f, g)). Moreover, IX (f ) is a generalized Poisson random variable with its L´evy-Khintchine measure μ ˜ given by μ ˜ = μ ◦ f −1 : B(R0 ) → R¯+ . [Thus one can uniquely define f (t) dXt − b(f )(t) dμ(t), f ∈ Sϕ . (16) IX (f ) = T
T
and we call it the i.d. integral.] 4. Remarks. 1. It is of interest to make a comment on the distinctions between the i.d. integral above and the stochastic integrals we discussed earlier. Thus consider a Poisson measure π with intensity μ so that E(π(A)) = μ(A) = V ar(π(A)) = E(π(A)−μ(A))2 and that π(A), π(B) are independent for A, B ∈ B(R0 ) of finite μ-measure with A ∩ B = ∅. Consequently one has for A, B ∈ S0 : E(π(A)π(B)) = E([π(A − B) + π(A ∩ B)][π(B − A) + π(B ∩ A)]) = μ(A ∩ B) + μ(A)μ(B), after a simplification. Hence writing π ∗ (A) = π(A) − μ(A), A ∈ S0 , one gets immediately E(π ∗ (A)π ∗ (B)) = μ(A ∩ B), so that π ∗ (A), π ∗ (B) are independent for
308
V. Likelihood Ratios for Processes
disjoint A, B. This property is analogous to the white noise case. Using these results it is possible to define the stochastic integral A f dπ ∗ , A ∈ S0 , with the boundedness principle since for simple S-measurable f one has E( f dπ ∗ )2 ≤ |f |2 dμ, (16’) A
A
(and actually there is equality here). However, this integral is not defined for all f ∈ Sϕ which is essential in the L´evy representation. To extend it to include such f which is desirable and to have also the continuity property the integral as a functional of f , one is brought of |f | back to (15). If T 1+f 2 dμ < ∞ then it is expressible as: f (t)[dXt −
I(f )A = A
f dμ]. 1 + f2
2. If f ∈ Sϕ , then evidently μ{x : |f (x)| > ε} < ∞ for each ε > 0, and also f2 2 f dμ ≤ 2 dμ < ∞, 2 [|f |≤1] [|f |≤1] 1 + f and these two conditions imply (hence are equivalent to) the above one for any measurable f ∈ Sϕ . Thus I(f ) is an i.d. random variable. A result corresponding to Theorem 3.1, in the i.d. context, called the L´evy-Itˆo theorem, can be given as follows. 6. Theorem. Let X = {Xt , t ∈ T ⊂ R} be an i.d. process with the L´evy-Khintchine parametric functions (γ, R, Q). Then X admits an a.e. unique decomposition X = Y + Z where Y, Z are mutually independent, with Y as the Gaussian component having mean and covariance functions (γ, R) so that 1 E(ei(u,Yλ ) ) = exp{i(u, γλ ) − (Rλ u, u)}, u ∈ Rλ , 2 and Z is a Poisson component such that there is a random measure ν (called the L´ evy-Itˆ o measure) with intensity Q, satisfying 1 ZA = x[ν(A, dx) − Q(A, dx)], (17) 1 + |x|2 RT 0 |x|2 for all A ∈ B(T ) for which A×RT 1+|x| 2 dQ(dt, dx) < ∞, i.e., ZA is 0 given by the i.d. integral defined in (16) with f (x) = x there. An independent proof of this result, when the process is (separable) Hilbert space valued, is given in Feldman [2]. We use only a special case of it, and refer to the above paper for the abstract version.
309
5.4 Infinitely divisible processes
It should be noted that Q, as seen from Theorem 1 above, is defined on the cylinder σ-algebra of RT0 . The representation of Xt in this case may be restated as follows. For every x ∈ RT0 , pλ (x) ∈ Rλ and taking λ = {t}, this simply means x(t) ∈ R. On the other hand, each point a of R is a value of some x ∈ RT0 , i.e., a = x(t) for some t ∈ T . Hence, the L´evy representation of Xt can be restated, exhibiting all the parameters, as: Zt = x(t) dν(t, dx) − b(x(t)) Q(t, dx), (18) [x:|x(t)|>0]
[x:|x(t)|>0]
in terms of (16) where |b(x)| ≤ 1 defined earlier, and Q(t, dx) is written for Q(A, dx) when A = (0, t]. This representation clearly exhibits the fact that Q is the L´evy measure on (RT0 , B(T )) and Q(t, ·) = Q ◦ p−1 {t} , in the previous abstract notation. Also (18) will be used here as it shows how all the features of the problem are exactly incorporated. Thus, if T = [0, t] one has the full formula including the Gaussian Yt as: Xt = Yt +
t
[
0
RT 0
x(s) ν(ds, dx) −
b(x(s)) Q(ds, dx)].
(19)
RT 0
An integral representation of a general continuous Poisson measure, similar to (18), will be stated for comparison: 6’. Theorem. Let X = {Xt , t ∈ T } be a stochastically continuous Poisson i.d. process, with ν(·, ·) as its L´evy-Itˆ o measure. Then for any A ∈ B(T ), it can be expressed as: 1 f (t, x)[ν(dt, dx) − Q(dt, dx)], (20) XA = β(A) + T 1 + |f (t, x)|2 A×R0 where f (t, x) = x(t) (i.e., f : T × RT0 → R), satisfies A×RT 0
|f (t, x)|2 Q(dt, dx) < ∞, 1 + |f (t, x)|2
for all compact sets A ⊂ T , and β(·) is a deterministic measure on B(T ). This result, which is just (18) in a slightly different form, was established in the scalar case by Kingman ([1], p.70) and by Feldman ([2], p.29) if X takes values in a separable Hilbert space. It will be useful in computing the likelihood ratio of the process relative to a given measure P1 and an alternative P2 on (Ω, Σ). For simplicity, these will be denoted X i , and its probability measures Pi on Σ, i = 1, 2. The expression for
310
V. Likelihood Ratios for Processes
dP2 dP1
based on a realization of X i is desired. As in the earlier section, it is first calculated if in (20) A × RT0 is replaced by a set of the form (0, t] × ∩s∈T {x : |x(s)| ≥ ε}, so that it has finite Q-measure, and then letting ε ↓ 0 by a suitable limiting process. Since it was already noted that the integrals in (20) can be considered as a difference between the (linear) stochastic integral and a Stieltjes integral, and since the dominated convergence criterion holds for both, one can expect that the general case is obtained by letting ε ↓ 0. However, a dominating function for the stochastic part is not available, and one needs to get a substitute for the final argument which is somewhat involved. Also note that since the L´evy-Itˆo measure can be split into nonatomic and discrete parts (cf. (11)) and the latter may be handled separately with relative ease, we consider the more difficult continuous case. Consequently the intensity measure μt given by E(ν((0, t], B)) = μt (B) will be nonatomic. Thus assuming this simplification is already made, we again use the same symbols νi (t, ·), Qi (t, ·) for the nonatomic case and present a solution. This result is due to Briggs [1]: 7. Theorem. Let X i = {Xti , t ∈ T }, i = 1, 2 be generalized stochastically continuous Poisson processes (thus completely random and i.d.) on (Ω, Σ, Pi ) and intensity measures Qi on (RT0 , B(T )) satisfying (18). dQ2 If Q2 Q1 and so ρ(x) = dQ (x), assume that 1 |ρ(x) − 1|i Qi (dx) < ∞, i = 1, 2, (21) Ei
where E1 = {x : |ρ(x) − 1| > 12 }, E2 = E1c . Then P2 P1 and moreover writing νi∗ = ν − Qi , dP2 1 (X ) = exp log ρ(x) ν1∗ (dx)+ dP1 E2 log ρ(x) ν1 (dx)+ E1 (1 − ρ(x) + log ρ(x)) Q1 (dx)+ E2 ! (1 − ρ(x)) Q1 (dx) . (22) E1
Conversely, if P2 P1 , then (21) holds, and hence the density (22) obtains. In case Q2 ∼ Q1 and (21) is true, then P2 ∼ P1 and (22) holds. Moreover, interchanging the measures suffixed by 1 and 2, one obtains 2 1 the ratio dP dP2 (X ) by the same formula by interchanging Q1 and Q2 1 and replacing ρ by ρ˜ = dQ dQ2 .
311
5.4 Infinitely divisible processes
Remark. The nonatomicity of the measures X i is assumed to get the explicit expression (22). If merely the equivalence conclusion (Q1 ∼ Q2 ⇔ P1 ∼ P2 ), is desired, then the βi components will also be present to take into account of the atoms, but they can be absorbed into the Qi . This was established by Feldman ([2], Theorem 6.6). Also if the (stochastically continuous) processes are with independent increments, then even the nonatomicity separation is not necessary as seen in Theorem 3.7. For the general i.d. processes, one can decompose the Xi into the continuous (or atom free), discrete and deterministic parts, all of which are mutually independent, and compute the likelihood ratios and then take a convolution to obtain the general case which clearly involves some additional work; but this however is standard. Consequently it is enough to consider the crucial part which is the nonatomic case and we sketch the argument due to Briggs [1]. Outline of proof. First consider the case that the Qi are finite, namely for the processes X i,ε , defined with arbitrarily fixed ε > 0, as: Xti,ε = x νi (dx) − b(x) Qi (dx), i = 1, 2, [x:|X(t)|>ε]
[x:|x(t)|>ε]
which are i.d., and let Piε be the corresponding measures on (Ω, Σ), determined by these processes. Let Rε = {x ∈ RT : |x| ≥ ε}, and Bε,T be the resulting trace σ-algebra of B(T ), i.e., Bε,T = {B ∩ RT0 : B ∈ B(T )}, on which the Qi are finite. First, (22) is established for the X i,ε , i = 1, 2, by the following steps: (i) For any A ∈ Bε,T and integer k ≥ 1, we obtain Q1 (A)k . E χ[ν1 (A)=k] exp( log ρ(x) ν1 (dx)) = e−Q1 (A) k! A (ii) Let {Aj , 1 ≤ j ≤ n} ∈ Bε,T be a measurable partition of Rε and kj ≥ 0 be integers. Then one establishes (as in the proof of Proposition 3.2) χ∩nj=1 [ν1 (Aj )=kj ] exp log ρ(x) ν1 (dx)+ Ω Rε ! χ∩nj=1 [ν2 (Aj )=kj ] dP2 . Q1 (Rε ) − Q2 (Rε ) dP1 = Ω
(iii) The preceding is extended to all sets of the form [Xt1,ε ∈ A] which are cylinders in Σ, (A ∈ Bε,T ) and then for all sets in Σ. (iv) This is shown to be equivalent to the result that ε hε dP1ε , A ∈ Bε,T P2 (A) = A
312
V. Likelihood Ratios for Processes
where hε is the integrand on the left side of the integral in (ii). Then (22) will follow for the processes X i,ε , for each ε > 0. (v) Since Qi (RT0 ) = ∞ is the new case, one has to use the hypothesis (21) and show, by a detailed analysis using various estimates, that P hε →h. The result is similar to that of Theorem IV.1.1. However, since {hε , Bε,T , ε > 0} is not generally a martingale, that result does not apply. The details need much care and may be found in Briggs ([1], pp. 184-198). The converse parts are established by an indirect argument. We again have to refer to her original work. In the literature, most of the effort is directed to finding sufficient conditions for the equivalence/singularity (and occasionally the mixed cases, as was done by Newman [2] for the independent increment processes) but rarely obtaining the likelihood ratios. See also Veeh [1] for a sufficient condition, related to Newman’s work, on the equivalence of i.d. processes. It is for this reason, the general i.d. case is discussed here in some detail, as this clarification may help the reader to do further work on these problems. Before Briggs’ work, the i.d. likelihood ratio question is completely open, and in fact in their long survey article, Gikhman and Skorokhod ([1], p. 121) state about the i.d. case: “only [sufficient] conditions for absolute continuity of [probability] measures are studied, but formulae for the densities are not given, because none have been found”. Thus the above result should indicate the necessary effort expected, with a solution to an important part, and help for any future research on this problem. Another related question of interest here is to find the distribution 2 of the likelihood ratio dP dP1 for the given hypothesis, i.e., relative to P1 , which is needed to apply the test procedure with the NeymanPearson-Grenander theorem ( cf. Theorem II.1.1). This is still an area to be explored. To show what may be accomplished, we go back to the independent increment case. For Gaussian processes, this can be studied with great success, as already seen in Section 2, and for the additive case this was considered by Brockett, Hudson and Tucker [1], and their result will be described which serves as an interesting step for the type of problems raised. Let X i be an i.d. process with parametric functions {γ i , Ri , Qi }. 2 Since in the calculations one often uses dQ dQ1 , one has to assume the existence of this derivative, i.e., Q2 Q1 . The measures Qi need not be σ-finite (although localizable and hence the expression can still be shown to be meaningful), for simplicity we assume that the processes are stochastically continuous so that by Theorem 1, Qi will be σ-finite. Then combing the results of Theorem 2.7 and Theorem 7 above, we find for the Pi that P1 ∼ P2 iff (i) Q1 ∼ Q2 , (ii)
313
5.4 Infinitely divisible processes
R1 = R2 = σ 2 (t), t ≥ 0 (say), (iii) X0 (1 − ρ(x))2 dQ2 (x) < ∞, and t x 1 2 2 (iv) γt1 − γt2 = X0 1+x 2 (Qt − Qt )(dx) + 0 p(v) dσ (t). [This form of equivalence for additive processes was given by Brockett and Tucker 1 [0,T ] [1].] Here p ∈ L2 ([0, T ], dσ 2 ), R0 = R0 , and ρ(x) = dQ dQ2 (x). With these results, the following is the desired assertion. 8. Theorem. Let X i = {Xti , t ∈ [0, T ]} be a pair of additive (or independent increment) stochastically continuous processes on (Ω, Σ) with (induced) equivalent measures Pi , i=1,2, so that P1 ∼ P2 . If (γi , σ 2 , Qi ) 1 are the parametric pairs for their ch.f.s, f = dP dP2 is the likelihood ratio, [0,T ]
1 ), then the distribution of f is given and ρ(x) = dQ dQ2 (x), x ∈ R0 (= R0 in terms of its ch.f. on (Ω, Σ, P1 ) and the (parametric) space R0 as:
Φf (u) = E(eiuf ) = exp iu[
log ρ(x) ) dQ2 (x)− 1 + (log ρ(x))2 R0 1 T 2 1 2 T 2 2 p (v) dσ (v)] − u p (v) σ 2 (v)+ 2 0 2 0 ! iuv (eiuv − 1 − ) Q ◦ g −1 (dv) , 2 1+v R0 (1 − ρ(x) +
(23)
where p ∈ L2 ([0, T ], dσ 2 ), g = log ρ, and γ1 (t) − γ2 (t) = R0
x (Q1 − Q2 )(dx) + 1 + x2
t
p(v) dσ 2 (v).
0
Consequently (by the L´evy-Khintchine representation) f has an i.d. distribution. In all the above work we assumed the stochastic continuity of the process to insure the σ-finiteness of the rate measure Q, by Theorem 1. If all the Qλ s are not σ-finite then Q need not even be localizable, and there exist completely random measures with this pathological behavior as the following example, due to Kingman [1], shows. 9. Example. There exists a probability space (Ω, Σ, P ) and a com¯ + with the following properties: pletely random measure ν : B × Ω → R (i) ν(·, ω) is σ-finite for each ω ∈ Ω, (ii) P [ν(A, ·) = ∞] = 1 if A ∈ B is uncountable, and (iii) P [ν(A, ·) = 0] = 1 if A is countable. Indeed, let (Ω, Σ, P ) be the Lebesgue unit interval, and (R0 , B) be ¯ + be defined as: the Borelian unit interval. Let ν : B × R0 → R χA (ω + r), (24) ν(A, ω) = r∈S
314
V. Likelihood Ratios for Processes
where S is the set of rationals of R. Then ν is a completely random measure that satisfies (i) and (ii) with its intensity measure Q given by:
Q(A) = E(ν(A)) = =
r∈S
E(χA (· + r))
r∈S
Ω
χA (ω + r) P (dω) =
P (A − r)
r∈S
= 0(∞) if A is countable (uncountable).
(25)
This shows that ν(A) is not integrable for any uncountable set A. Actually the truth of the stronger property (ii) can be seen as follows. In fact, (24) implies for any integers k, n ≥ 1, ν(A, ω) ≥
n
χA (ω + r +
i=1
i ), k
(26)
so that on letting k → ∞ the right side of (26) tends to nχA (ω + r) in L1 (P ). Hence (26) gives P [ν(A) ≥ n] ≥ P [ω :
n
χA (ω + r +
i=1
i ) ≥ n] k
→ P [ω : nχA (ω + r) ≥ n],
as k → ∞,
= P [ω ∈ A − r] ≥ P [A ∩[r, r + 1]), for some r ∈ S, = α > 0 (say). Thus P [ν(A) = ∞] ≥ α > 0. Let B = [ω : ν(A, ω) = ∞] so that B is uncountable and P [B] > 0. But by (24), ω ∈ B, r ∈ S such that ω +r ∈ (0, 1) ⇒ ω + r ∈ B. Thus F (x) = P (B ∩(0, x]) gives F (x + r) − F (r) = F (x), 0 < x < x + r < 1, r ∈ S. Consequently, F (x + y) = F (x) + F (y) for all x, y ∈ (0, 1), x+y < 1. This functional equation has the solution F (x) = xF (1) = xP (B) so that χB = F (x) = P (B) = α > 0 for almost all x implying α = 1 and (ii) is valid. The above results are indicative of the ideas that have to be developed in this subject. We now turn to the final part of this chapter on analogous problems for diffusion type processes. 5.5 Diffusion type processes In the preceding sections we have obtained likelihood ratios for processes that are Gaussian, additive, Markovian, or infinitely divisible.
315
5.5 Diffusion type processes
In some cases, especially Markovian, they can take discrete values as Section 3 shows. These may be replaced by processes with continuous paths, retaining the Markovian character, and it is also desired to calculate appropriate likelihood ratios. Such classes are typically solutions of stochastic (differential) equations. The mathematical tools are somewhat different, but these models are important in such applications as Physics, Biology, and lately also Economics (especially in finance studies). All these come under the rubric of diffusion types, individually having large literatures. Some of these will be discussed here. In Section IV.2 we introduced stochastic integration relative to an L2,2 -bounded process (cf. after Theorem IV.2.6), and evaluated the quadratic variation of the BM. The latter will be recast here for an L2,2 -bounded class. This is used in applications below. Since a process X = {Xt , Ft , t ∈ [a, b] = T } is L2,2 -bounded (in the generalized sense) n−1 if for each simple function f = i=0 gi χ(ti ,ti+1 ] , gi is Fti -measurable a = t0 < t1 < · · · < tn = b, there is a fixed constant C > 0, and a σ-finite measure α on B(T ) ⊗ Σ such that n−1 2 E( f (t) dXt ) = E( gi (Xti+1 − Xti ))2 T
i=0
≤C
T ×Ω
|f (t, ω)|2 dα(t, ω),
(1)
holds. Then τ f = T f dXt is the stochastic integral, defined for all f ∈ L2 (α). The L2,2 -bonded integrator includes the BM (with equality in (1) and C = 1), the orthogonal increment processes, as well as the square integrable (semi)martingales; in fact all the currently known integration processes. The quadratic variation of a process X, denoted [X], is defined as follows where the partition π : a = t0 < t1 < · · · < tn = b is used: (plim = limit in probability) [X]t = p lim
|π|→0
n−1
(Xti+1 − Xti )2 ,
i=0
if this limit exists, where |π| = max(ti+1 − ti ) is the norm of the partition. The following simple computation shows that the limit exists so that π is a for all L2,2 -bounded processes. Indeed let ti = a + (b−a)i 2n dyadic partition, and consider Xb2
−
Xa2
=
n 2 −1
i=0
(X (b−a)(i+1) − X (b−a)i )2 + n n 2
2
316
V. Likelihood Ratios for Processes
2
n 2 −1
X (b−a)i (X (b−a)(i+1) − X (b−a)i ) n n n
i=0
=
n −1 2
2
2
2
2
(ΔIni X) + 2
i=0
fn (t) dXt ,
(2)
T
and f = where Ini = (ti , ti+1 ], ΔIni X = Xti+1 − Xti , ti = a + (b−a)i 2n 2n −1 2,2 i Xt , a simple function. By the L -boundedness this intei i=0 χIn gral exists as n → ∞, and so the middle term of (2) has a limit in probability, and this implies [X]t exists for any locally bounded (i.e., on compacts of T ) L2,2 -bounded process, and is increasing in t. If there are two such processes X and Y on the same filtration then evidently their linear combination has a similar property, and one can define their quadratic covariation, denoted [X, Y ], using the polarization identity of the classical Hilbert space theory, for any closed interval [a, b] ⊂ T : [X, Y ]t =
1 ([X + Y ]t − [X]t − [Y ]t ), 2
t ∈ [a, b].
(3)
It follows that [·, ·] exists, is of (locally) bounded variation, and is bi(or sesqui)linear. Moreover if X or Y is itself of locally bounded variation then [X]t (whence [X, Y ]t ) vanishes. This follows from definition of [X]t = [X, X]t , in the case of the first expression and from the CBSinequality for [X, Y ]t since for t > 0 |[X, Y ]t | ≤ [X]t · [Y ]t , a.e., (4) in essentially the same way as in the classical Hilbert space theory (as was noted by Kunita and Watanabe [1]; see also e.g., Rao [21],p.383). We need the very useful stochastic integration by parts formula, due to Itˆ o [1], which plays a fundamental role in the continuous parameter stochastic analysis. 1. Theorem. Let X = {Xt , Ft , t ∈ [0, T ]} be a right continuous with left limits process which is L2,2 -bounded. If f : T → R is a twice continuously differentiable function, then {f (Xt ), Ft , t ≥ 0} is locally an L2,2 -bounded process and one has the formula for t > 0: t 1 t f (Xs− ) d[X]s + f (Xs− ) dXs + f (Xt ) − f (X0 ) = 2 0+ 0+ [f (Xs ) − f (Xs− ) − f (Xs− )ΔXs ]− 0<s
0<s≤t
f (Xs )(ΔXs )2 ,
(5)
317
5.5 Diffusion type processes 2
df where ΔXs = Xs − Xs− is the jump, f = dx , f = ddxf2 , and the sums convergence a.e. In particular, if the process has continuous sample paths then the sums of (4) vanish, and in that case one has
f (Xt ) − f (X0 ) =
t
0+
f (Xs ) dXs +
1 2
t
0+
f (Xs ) d[X]s .
(6)
A proof of this basic result (even in multidimensions) is found in virtually every book on stochastic calculus, and so we omit it referring to e.g., the companion volume (Rao [21], p.473 and 395, Protter [1], pp.71-74, or Revuz and Yor [1],p.139). We can now present another technical result, of constant use in problems involving change of measures, due essentially to Girsanov [1]. It plays a key role in both the testing and estimation aspects of inference, and therefore the details are included. P 2. Theorem. Let {(Ω, Σ, Ft , Q ), t ∈ [a, b]} be a pair of probability spaces where the filtration is complete and right continuous, in the usual sense, i.e., Ft ⊂ Ft for t < t , Ft = ∩s>t Fs , and each Ft is complete for P . Suppose the restrictions of P and Q to Ft , t ∈ [a, b] are equivalent, i.e. PFt ∼ QFt . If X = {Xt , Ft , t ≥ a} is a continuous local martingale on (Ω, Σ, P ), then
Yt = Xt −
t
Zs−1 d[X, Z]s ,
(7)
a
is a continuous local martingale on (Ω, Σ, Q). Here one takes a continuous version for Z (which exists) of the martingale Z = {Zt = dQFt dPFt , Ft , t ≥ a}. In case X is a BM on (Ω, Σ, P ) then Y is a BM on (Ω, Σ, Q). Remark. This result will be useful in obtaining the likelihood ratios of measures νi , i = 1, 2 if each is equivalent to the BM measure μ, since dν2 dμ 2 then one can apply the chain rule and compute dν dν1 = dμ dν1 . In a number of important applications νi μ can be verified and then this result applies, as the ensuing work shows. Proof. First note that the hypothesis is weaker than that of Q ∼ P or even Q P or P Q; e.g., consider P, Q as Gaussian measures with zero means and covariances r, σ 2 r, σ 2 = 1, so that Pt ∼ Qt , ∀t but P ⊥ Q, as seen in an example following Prop. 2.5 where for convenience we denote by Pt , Qt the restrictions of P, Q to Ft and t a = 0. Let Zt = dQ dPt , and observe that {Zt , Ft , t ≥ 0} is a strictly
318
V. Likelihood Ratios for Processes
positive martingale on (Ω, Σ, P ) where we may assume for simplicity that Σ = F∞ . In fact for s < t and A ∈ Fs ⊂ Ft , one has Fs E (Zt ) dPs = Zt dPt = Qt (A) = Qs (A) = Zs dPs A
A
A
so that E Fs (Zt ) = Zs a.e.[P ] and Pt ∼ Qt . Thus Zt > 0 a.e. [P ]. Also the martingale convergence theorem implies that Zt → Z∞ a.e. and E(Z∞ ) ≤ lim inf t→∞ E(Zt ), with equality iff the martingale is uniformly integrable, which in turn is equivalent to Q P by Vitali’s convergence theorem. Now observe that for any (local) martingale U = {Ut , Ft , t ≥ 0} on (Ω, Σ, Q), one has {Zt Ut , Ft , t ≥ 0} to be a (local) martingale on (Ω, Σ, P ) and conversely. Indeed, if A ∈ Fs , s < t, then by change of variables, since dQt = Zt dPt , one finds Zt Ut dPt = Ut dQt = Us dQs = Us Zs dPt , A
A
A
A
by the martingale property of the U -process for Q, so that the same holds for {Zt Ut , Ft , t ≥ 0} on (Ω, Σ, P ). In the local case, by definition of this concept, there is a sequence Tn ↑ ∞ of stopping times of the filtration {Ft , t ≥ 0} (cf., Section IV.2) such that {Ut∧Tn , Ft∧Tn , t ≥ 0} is a martingale on (Ω, Σ, Q) for each n ≥ 1. Then by the above case {(ZU )t∧Tn , Ft∧Tn , t ≥ 0} is a martingale on (Ω, Σ, P ) for each n. It now follows from the optional stopping theorem (cf. Theorem IV.2.4) by letting n → ∞, that the assertion holds for the local case. dPt The converse is seen by writing Zt = dQ (= Z1t ) and the fact that t Pt ∼ Qt , since the argument implies that {Zt Vt , Ft , t ≥ 0} is a (local)martingale on (Ω, Σ, Q) if (Vt , Ft , t ≥ 0} is such on (Ω, Σ, P ). Hence taking Vt = Ut Zt here we get {Ut , Ft , Σ, t ≥ 0} to be a Q-martingale, as asserted. t To establish (7), let At = 0 Zs−1 d[X, Z]s . Since [X, Z] is of locally bounded variation, as already noted, the pointwise Stieltjes integral exists and defines At as a process of locally bounded variation. In fact, let Tn = inf{t > 0 : Zt ≤ n1 }, which is a stopping time of {Ft , t ≥ 0}. Then Tn ↑ T ≤ ∞, and to see that T = ∞ and conclude that At < ∞ t a.e., note that for Tn ≥ t one has | 0 Zs−1 d[[X, Z]s | ≤ n|[X, Z]| < ∞ by (4). Also {Zt∧Tn , Ft , t ≥ 0} is again a P -martingale by Theorem IV.2.4. Now Zt dPt ≤ Zt dPt [T ≤t] [Tn ≤t] = Zt∧Tn dPt , by the martingale property, [Tn ≤t]
319
5.5 Diffusion type processes
= [Tn ≤t]
ZTn dPt ≤
1 → 0. n
Since Zt ≥ 0, it follows that Pt [T ≤ t] = 0 and we can take Zt = 0 on [T ≤ t] so that Qt [T ≤ t] = 0; and t > 0 being arbitrary one gets P [T < ∞] = 0 whence At is locally of bounded variation. We now apply Itˆ o’s formula (a two dimensional analog of (5)) with f (x, y) = xy and conclude that:
t
Zt (X − A)t − Z0 (X − A)0 = (X − A)s dZs + 0 t Zs d(X − A)s + [Z, X − A]t 0 t t = (X − A)s dZs + Zs dXs − 0 0 t Zs dAs + [Z, X]t − 0 0 t t = (X − A)s dZs + Zs dXs , 0
0
since dAt = Zt−1 d[Z, X]t .
(8)
But {Zt , Ft , t ≥ 0} and {Xt , Ft , t ≥ 0} are local martingales, so that their sum on the right is a local martingale. Consequently the left side of (8) is also. But by the preceding paragraph, since {Zt , Ft , t ≥ 0} is a (local) martingale on (Ω, Σ, P ), the process {Xt − At , Ft , t ≥ 0} must be a (local) martingale which is (7). Finally, if the Xt -process is a BM, then [X]t = t and has a.a. continuous sample functions as shown in Section IV.2, and hence the process Xt − At has a.a. continuous paths. Moreover its quadratic variation is given by [X − A]t = [X − A, X − A]t = [X, A]t + [A, A]t − 2[X, A]t = [X]t = t, by the bilinearity of [·, ·] and the fact that A has locally of finite variation, whence [X, A]t = 0 a.e. This implies that Yt = Xt − At is a continuous local martingale with variation t. By a classical characterization of such processes, due to P.L´evy (cf.,e.g., Rao [21],V.3.19, or Revuz and Yor [1],IV.3.7), {Yt , Ft , t ≥ 0} is a BM on (Ω, Σ, Q) as asserted. It should be mentioned that this result in the case that X is a BM and the martingale integral is replaced by the classical Wiener integral was established by Cameron and Martin, before the general stochastic calculus was well established. It was first extended by Girsanov [1] obtaining the last part of the theorem in the modern spirit, and the (local) martingale extensions emerged later as the stochastic analysis
320
V. Likelihood Ratios for Processes
is firmly founded and extended in the 1970s. The theorem generalizes without much difficulty if X is a semimartingale, and perhaps even to Lp,q -bounded processes. The above result is sufficient for the present applications. We thus turn to diffusion processes. After introducing the concept, its relation to Girsanov’s theorem will be explained. 3. Definition. A process Y = {Yt , Ft , t ≥ 0} is said to be of diffusion type relative to the BM X = {Xt , Ft , t ≥ 0}, if Y is a solution of the stochastic equation dYt = b(t) dt + a(t) dXt
(9)
or equivalently of the integral equation t t b(s) ds + a(s) dXs , Yt − Y0 = 0
(9’)
0
where the a(t), b(t) are both adapted to Ft for all t ≥ 0, and then the b(t) is called a drift and the a(t) a diffusion coefficient. [A diffusion process is a “strong” Markov process with continuous sample paths, where “strong” denotes that the Markov property holds if the constant times are replaced by stopping times in its definition.] This notion assumes the existence of a solution of (9) for which some Lipschitz type growth and integrability conditions have to be satisfied. Let Z = {Zt , Ft , t ≥ 0} be a continuous local martingale such that P [Zt > 0] = 1 for all t. Then the Zt -process can be expressed as an t exponential. Indeed, let Lt = L0 + 0 Zs−1 dZs (‘L’ for the logarithm of Zt ). Then {Lt , Ft , t ≥ 0} is a continuous local semimartingale (since the integral is defined, Zt being a locally L2,2 -bounded process). In terms of differentials, this becomes dLt = Zt−1 dZt and consequently one gets t
Zt = Z0 +
0
Zs dLs .
(10)
This is well defined since the Lt -process is locally L2,2 -bounded and hence qualifies to be an integrator. The solution of this is Zt = EZ0 (L)t where the exponential EZ0 (L)t = Z0 exp{Lt − L0 − 12 [L]t }. [If [L]t = 0 for all t, then this is the usual exponential solution = Z0 eLt in the ODE theory.] That this is the solution of (10) was first obtained by Dol´eansDade [1] (even for general semimartingales), and a slightly specialized version by Isaacson [1]. In our case this is a direct consequence of Itˆ o’s formula (5) by taking f (x) = log x and the process as Zt . Thus f (Zt ) − f (Z0 ) =
0
t
1 1 dZs − Zs 2
0
t
1 d[Z]s . Zs2
321
5.5 Diffusion type processes
Hence taking exponentials and setting Lt = log Z0 +
t
1 0 Zs
dZs , one gets
1 Zt = exp[Lt − [L]t ], 2
(11)
where the definition of quadratic variation is used to deduce that [L]t = t 1 d[Z]s . The representation is also unique, since if there is another 0 Zs2 such, then their difference is a constant martingale starting at 0, so that it is identically zero. Thus dQt = EZ0 (L)t dPt holds. Using the definition of covariation (see (5)), one has [X, Y ]t = p lim
|π|→0
n−1
(Xti+1 − Xti )(Yti+1 − Yti ),
i=0
and this implies (after approximating the integrals using (1)) that t [X, L]t = 0 Z1s d[X, Z]s . Hence the representation (7) in Girsanov’s theorem can be expressed as: Yt = Xt −
0
t
1 d[X, Z]s = Xt − [X, L]t , Zs
(12)
where Lt is given above as the “logarithm of Zt ”. Since the exponential EZ0 (L)t = 0, one can express dPt = (EZ0 (L))−1 t dQt , and the new , F , t ≥ 0} can be simplified using (12) as local martingale {(EZ0 (L)−1 t t follows. Now (12) is true for any given (local) martingale X where Zt is a fixed strictly positive (local) martingale which thus defines Lt abstractly. So we may let X = L in (12) and denote the new process ˜ t = Lt − [L, L]t , and by definition of [·, ·] one finds [L, ˜ L] ˜ = [L, L]. by L Hence from (11) it follows that ˜ L] ˜t ˜ = exp −L ˜ t − 1 [L, EZ0 (−L) 2 1 = exp{−Lt + [L, L]t − [L, L]t } 2 1 = exp{−Lt + [L, L]t } = EZ0 (L)−1 t . 2 We state this result for reference, see, e.g., Revuz and Yor ([1], VIII.1.7) as: 4. Proposition. Let P, Q be a pair of probability measures on (Ω, Σ) and {Ft , t ≥ 0} be a standard filtration with Pt , Qt as restrictions to Ft t and Pt ∼ Qt so that dQ dPt = Zt > 0 a.e., (whence Z = {Zt , Ft , t ≥ 0} is a P -martingale). Suppose that each Ft is completed for Pt (hence also
322
V. Likelihood Ratios for Processes
t for Qt ). Let Lt = log Z0 + 0 Z1s dZs be the local martingale obtained from the Zt -process. Then one has Zt = EZ0 (L)t given by (11) (so that dQt = EZ0 (L)t dPt ). Moreover, for any given continuous local martingale X = {Xt , Ft , t ≥ 0} one has ˜ t = Xt − X
t
0
1 d[X, Z]s = Xt − [X, Z]t , Zs
(13)
˜ t dQt holds. to be a continuous local Q-martingale and dPt = EZ0 (−L) ˜ t = 1 is actually a special case of a The relation EZ0 (L)t EZ0 (−L) more general identity which we state for a comparison. Since Z0 > 0 by replacing Zt with Zt − Z0 as in (11), we can replace Z0 by 1 in the exponential expression. Thus, if X i = {Xti , Ft , t ≥ 0}, i = 1, 2 are a pair of semimartingales with [X 1 , X 2 ]t as their covariation process, then one has Yor’s formula as: E(X 1 )t E(X 2 )t = E(X 1 + X 2 + [X 1 , X 2 ])t , t ≥ 0.
(14)
We shall sketch a proof in the exercises (cf. Exercise 6.6). An important feature in all these results is that they have a common filtration. Our first application of Theorem 2 is calculating the density (or likelihood ratio) of a measure governing a diffusion process that is absolutely continuous (or equivalent) relative to the BM measure. Thus consider the process X = {Xt , Ft , t ≥ 0} given as a solution of the SDE dXt = at dt + dBt ,
0 ≤ t ≤ T,
(15)
where B = {Bt , Ft , t ≥ 0} is the BM on (Ω, Σ, P ). It will follow from later analysis that the solution of (15) exists under conditions such as T |at | dt < ∞, a.e. Let Q be the measure governing X. Following 0 Liptser and Shiryayev [1], but slightly simplifying it, we establish the following: [(Ω, Σ) will be taken canonical hereafter.] 5. Theorem. Let X = {Xt , Ft , 0 ≤ t ≤ T } be a solution of (15) so that it is a diffusion process on (Ω, Σ, P ) and B = {Bt , Ft , 0 ≤ t ≤ T } is the BM. If PT , QT are the measures induced by the B and X processes, then PT ∼ QT and dQT = exp{ dPT
0
T
1 as dBs + 2
0
T
a2s ds}, a.e.,
(16)
provided {at , Ft , t ≥ 0} is a locally bounded measurable process, so that 2 t a ds, 0 a2s ds exist with probability one. Moreover, if the B is a BM 0 s on (Ω, Σ, PT ) then X is a BM on (Ω, Σ, QT ).
323
5.5 Diffusion type processes
Proof. The main task is to construct a strictly positive (local) martingale Z = {Zt , Ft , t ≥ 0} on (Ω, Σ, P ) such that EP (Zt ) = 1, whence a new probability Q on (Ω, Σ) can be defined. It should have the property Qt ∼ Pt , or more particularly dQt = Zt dPt , and govern the solution process X. For this crucial requirement, we consider (without motivat t tion !) the process Zt = exp{− 0 as dBs − 12 0 a2s ds} and verify the various conditions. It is evidently Ft -adapted and P [Zt > 0] = 1.Recall that the moment generating function of the normal random variable 1 2 V ∼ N (0, 1) is given by E(euV ) = e 2 u . This fact implies for simple n−1 functions of the form a = i=0 ati χ(Ti ,ti+1 ] where 0 = t0 < · · · < tn ≤ t ≤ T < ∞ and the BM process B, (at is Ft -adapted) n−1
1 ati (Bti+1 − Bti ) − (a2ti (ti+1 − Ti ))})) = 1. 2 i=0 (17) In fact, since B has independent Gaussian increments, Bs − Bt ∼ N (0, |s − t|), one has t n−1 3 Ft Ft as dBs )) = E [ E Fti (eati (Bti+1 −Bti ) )] E (exp(− E(Zt ) = E(E Ft (exp{−
0
i=0
= E Ft 1
= e2
'n−1 3
Rt 0
( e
1 2 2 ati (ti+1 −ti )
i=0 a2s ds
, since as is Fs -adapted.
This implies (17) for simple functions. Then the general case follows by a familiar approximation of the locally integrable function a. Moreover, Z = {Zt , Ft , t ≥ 0} is a (local) martingale, since for 0 < s < t < T any bounded Borel a one has E Fs (Zt ) = E Fs [Zs e− = Zs E Fs (
Rt s
au dBu − 12
Rt s
a2u du
]
) = Zs , a.e.,
the last factor being unity by the same (moment generating) trick as before. With this process in hand, define Qt by dQt = Zt dPt . Then Qt ∼ Pt , ∀t ≥ 0, and by Theorem 2, {Xt , Ft , t ≥ 0} is a BM on (Ω, Σ, QT ) where QT is determined by the consistent family {Qt , t ≥ 0}. To ensure the σ-additivity of QT we took the measure space to be canonical Ω = R[0,T ] , and Σ as the cylinder σ-algebra. Thus ω = x ∈ R[0,T ] and one finds that QT to be the measure governing the X-process since for any cylinder set A QT [XT ∈ A] = ZT dPT [XT ∈A]
324
V. Likelihood Ratios for Processes
implying E Ft (ZT ) = Zt a.e., and T 1 T 2 dQT (x) = ZT−1 = exp{ a2s dBs + a ds, } dPT 2 0 s 0 so that (16) follows. Finally, since B is a BM on (Ω, Σ, PT ), we express Z, satisfying an integral (or a differential) equation, using Itˆ o’s formula (5) to verify t t u (7). Thus let f (u) = e and Ut = − 0 as dBs − 12 0 a2s ds so that t [U ]t = 0 a2s ds (because [B]t = t and Zt = f (Ut )): Zt − 1 = f (Ut ) − f (U0 ) t 1 t = f (Us ) dUs − f (Us ) d[U ]s 2 0 0 t 1 t = Zs dUs − Zs a2s ds 2 0 0 t 1 t 1 t 2 =− Zs as dBs + Zs as ds − Zs a2s ds 2 0 2 0 0 t =− Zs as dBs .
(18)
0
Consequently, [Z, B]t = [1 −
0
(·)
Zs as dBs , B]t = 0 −
0
t
Zs as d[B]s .
It follows that t t −1 Bt − Zs d[Z, B]s = Bt + as ds = Xt , by (15), 0
0
which, in view of (7), shows that X is a BM on (Ω, Σ, QT ). It should be noted that, in general, the consistent family {Qt , Ft , t ≥ 0} on (Ω, Σ) determines only an additive set function. It is for this reason one replaces the abstract system by the canonical system in which the Qt , Pt are regular on Ft allowing QT to be a probability measure (cf., e.g., Rao [21], Theorem I.3.4 for detail). Thus the canonical representation is usually adopted in applications to avoid this (possible) difficulty. We next consider a general diffusion type process of the form dXt = at dBt + bt dt
(19)
325
5.5 Diffusion type processes
where {at , bt , Ft , t ≥ 0} are adapted processes (also termed nonanticipative) to the same filtration. Thus at ≥ 0 is diffusion and bt ∈ R is the drift coefficient processes in this case also. The main idea here is to reduce (19) to (15) relative to a new BM on (Ω, Σ, QT ) where Qt ∼ PT (or at least QT PT ) for a suitable diffusion and drift coefficient processes which are functions of the given at , bt . Consider a new measure QT defined by dQT = Z˜T dPT where Z˜T = exp{−
T
0
1 b ( )s dBs − a 2
and let
˜t = Bt + B
0
t
T
0
b ( )2s ds}, a
b ( )s ds. a
(20)
(21)
˜ = {B ˜t , Ft , t ≥ 0} is a BM on (Ω, Σ, QT ), and By the above result B moreover ˜t = at dBt + bt dt = dXt , by (15). at dB Hence one has ' ( T T ˜T dQ b 1 b |F = E Ft exp{ ( )s dBs + ( )2 ds} , dPT t a 2 0 a s 0
(22)
giving the desired solution. This will be recorded as follows (and in dPT dQT one uses the fact that Xt is BM relative to the QT -measure): 6. Proposition. Let the diffusion type process (19) be given where the functions a, b are adapted to the BM filtration. Suppose also that b, ab are locally integrable a.e. and a > 0. Then the Xt -process is a BM ˜ T = Z˜T dPT , the Z˜T being given by (20), the ˜ T ) where dQ on (Ω, Σ, Q likelihood ratio by (22), and moreover one has: dPT = exp{− dQT
0
T
b ( )(Xs ) dXs + a
0
T
b ( )2 (Xs ) ds}. a
The same argument extends if (19) is replaced by a pair of equations with different drifts but the same diffusion as: dX i = at dBt + bit dt, i = 1, 2.
(23)
Since the X i is a BM on (Ω, Σ, QiT ), one can obtain the likelihood ratio of the measures of the X i -processes. Let at = a(t, Xti ), bit = bi (t, Xti )
326
V. Likelihood Ratios for Processes
and since Q1T ∼ PT ∼ Q2T so that with (20) and (22), one gets (as is easily verified from the preceding): T 1 b − b2 dQ2T )(s, Xs1 ) dXs1 = exp − ( 1 dQT a 0 T 2 2 ! 1 (b ) − (b2 )2 + ( )(s, X ) ds . (24) s 2 0 a2 This formula is of interest to test the hypothesis H0 : b1 = b2 for the alternative H1 : b1 = b2 using the Neyman-Pearson-Grenander theorem. All the other likelihood ratios are employed for similar test procedures. Both these results have multidimensional extensions. We present a form of this as a final item of this type of analysis. First it is convenient to state the definition of the BM useful in such a case. Let B = {Bt , Ft , t ≥ 0} be a vector valued process on (Ω, Σ, P ), i.e., B : R+ × Ω → H where (H, (·, ·)) is a separable Hilbert space. It is called a (vector) Brownian Motion if (i) P [B0 = 0] = 1, (ii) t → Bt (ω) is continuous for a.a. ω ∈ Ω, (iii) has independent increments, and (iv) for 0 ≤ s < t, Bt has mean zero and covariance E[(x, Bt − Bs )(y, Bt − Bs )] = (t − s)(Lx, y), x, y ∈ H and L is a selfadjoint positive definite (linear) operator or matrix on H with finite ∞ trace. The last condition means n=0 (Len , en ) < ∞ for some (and then any) orthonormal basis {en , n ≥ 1}. [If H = Rn , one can take L = id, the identity matrix here, and these conditions characterize that it is a BM. In the infinite dimensional case the condition on L is essential.] With this concept at hand it can be shown that B satisfies an L2,2 -boundedness condition so that the stochastic integrals can be defined extending the finite dimensional case, (cf. Rao [21], Sec. 6.3). Thus (19) becomes dXti = a(t, Xti ) dBt + bi (t, Xti ) dt, i = 1, 2,
(25)
where a(·, ·) is the diffusion and bi (·, ·) are the drift operators on the (state) space H, each being Ft -adapted. Although the proofs depend on results from some (standard) Hilbert space theory, the general statements are essentially the same as in the scalar case. To have a feeling for the multidimensional version, we present a proposition corresponding to Theorem 5, which in case H = Rn , depending on a nontrivial extension of Girsanov’s theorem, from Stroock and Varadhan (cf. [1], Theorem 6.4.2), and its Hilbert space generalization due to Krinik (cf. [1], Theorem 6.2). 7. Theorem. Suppose that two diffusion processes satisfying equations (25) with the same diffusion operator a but different drift vectors bi , i =
327
5.6 Complements and exercises
1, 2 are given in a separable Hilbert space H, where B = {Bi , Ft , 0 ≤ t ≤ T } is an H-valued BM, as defined above relative to a trace class operator L. To assure the existence of X i , one assumes that (i) bi (·, ω) is (locally) bounded and (strongly) measurable on R+ , for a.a. ω ∈ Ω, and (ii) a(t, ω) is bounded and strongly elliptic, i.e., there exist constants Ci > 0, such that for all x ∈ H, [(x, x) = x2 in H] C1 (x, x) ≤ (x, a(t, ω)a∗ (t, ω)x) ≤ C2 (x, x), t ≥ 0, ω ∈ Ω,
(26)
holds, where a∗ is the adjoint of a. [These conditions not only assure the existence of X i satisfying (25), but (26) implies that a−1 is also well defined.] Suppose further that the drift coefficients bi are representable, for some uniformly bounded Ft -adapted g i (t, ·), as bi = aLg i , i = 1, 2. Then the measures QiT induced by X i satisfy Q1T ∼ Q2T , (i.e., are equivalent), and their likelihood ratio is given by t dQ2T | = exp ((a∗ −1 (u)(g 2 − g 1 )(u), dXu1 )− F dQ1T t 0 ! 1 t 2 [(g (u), Lg 2 (u)) − (g 1 (u), Lg 1 (u))] du . 2 0
(27)
This is an exact generalization of Theorem 5, and the many details needed for a complete proof can be found in Krinik’s work referred to above. The preceding account also shows the intimate relationship between the likelihood ratio theory of diffusions and the SDE solutions leading to different and important aspects of stochastic analysis. 5.6 Complements and exercises P 1. Let (Ω, Σ, Fn , n ≥ 1, Q ) be a filtered probability space such that
Pn = P |Fn ∼ Qn = Q|Fn , n ≥ 1. If fn = P
Q
dQn dPn
and gn =
dPn dQn ,
suppose that fn →f and gn →g (convergence in probability) where f, g are random variables. If F∞ = σ(∪n Fn ), show that P∞ ∼ Q∞ . [This result is useful in applications of Theorem IV.1.1 as well as the theory in Section 5 of the present chapter.] 2. This and the next two problems are on admissible translates. Let Ω be a separable Hilbert space with Σ as its Borel σ-algebra and P a probability measure on it. An element f ∈ Ω is an admissible mean for P if for each A ∈ Σ one has Pf (A) = P (A − f ) defines a P -continuous measure with A − f denoting the vector difference set which is in Σ. If MP is the set of admissible means of P , related to Proposition 1.1, show that there exists a positive definite Hilbert-Schmidt operator S : Ω → Ω
328
V. Likelihood Ratios for Processes
such that MP ⊂ SΩ in the sense that for each f ∈ MP there is an h ∈ Ω satisfying f = Sh (and the inclusion may be strict, however). [Hints: Consider the Fourier transform C : → Ω ei(,f ) dP (f ), ∈ Ω , the adjoint of Ω, and (·, ·) is the duality pairing notation. This exists and is continuous in the “topology of Sazonov”, i.e., there is a positive definite finite trace operator B : Ω → Ω such that C() → C(0 ) as (B( − 0 ), B( − 0 )) → 0, (cf., Bourbaki [1], Chapitre IX, p.92). If dP now ρf = dPf , f ∈ MP , then as (B, ) → 0 and n → ∞ one has 2 |Cf () − 1| ≤ 2nRe(1 − C()) + 4 (ρf − ρnf ) dP → 0,
Ω
i(,ω)
ρnf
ρf (ω) dP (ω) and = ρf χ[|f |≤n] . Verify that where Cf () = Ω e ∗ − 12 xf : → (, B f ), ∈ Ω , is a continuous linear functional on Ω, so 1 1 that there is a unique g ∈ Ω with g = B − 2 f . Taking B = S 2 which is Hilbert-Schmidt, one finds f ∈ SΩ. The semi-group property under dP 1 +f2 dP dP (ω) = dPf1 (ω) dPf2 (ω − f ), a.a. (ω). See addition follows from fdP also Skorokhod [3] in connection with this result.] 3. Let (Ω, Σ, P ) and MP be as in the above problem. We outline conditions for the vector space structure of MP and present the likelihood ratios. It simplifies some work of Skorokhod [3] and also generalizes at the same time using a slightly different method. (Cf., Rao [4(f)] where more details can be found.) Let I be the set of all finite dimensional subspaces of Ω, and directed by the inclusion ordering, i.e., α, β ∈ I, α < β iff Ωα ⊂ Ωβ (thus α ↔ Ωα ). Let Παβ : Ωβ → Ωα for α < β and Πα : Ω → Ωα be the coordinate (or orthogonal) projections. Then {Ωα , Παβ , α < β ∈ I} is a compatible family or a projective system whose limit exists, and is homeomorphic when Ω is endowed with its weak topology. Since Ω is separable, one may identify it isomorphically with RN , and Ωα with Rn if card(α) = n, (cf. Schwartz [1], p.180). α Let P ◦ Π−1 be the image of P on Rn (≡ Ωα ), which thus is a α = P regular probability on Bα , the Borel σ-algebra of Rn . If Pα = P |Fα , Fα being the trace algebra of Σ on Ωα , then one can identify P as the projective limit of P α ◦ Π−1 α , by the general theory. For each t ≥ 0 and a ∈ MP , let Qt = Pta , Qtα = Qt |Fα , denoted Qtn if card(α) = n. Then t Qt is the corresponding projective limit of (Qta α , Πα ). Now P ⊥ Q if t t Pα ⊥ Qα for some α, but Pα ∼ Qα for each α ∈ I need not imply P ∼ Qt . Assume that (*) Qtα and P α on Bα are both continuous relative to the Lebesgue measure μn so that Qtα , Pα are dominated by dQtα dP α the σ-finite measure μn ◦ Π−1 n = μn (say). Let fn = dμn , gn = dμn , and hn = gfnn , n ≥ 1. (a) Show that {hn , Fn , n ≥ 1} is a martingale hn → h, a.e.,and moret over h = dQ dP iff the martingale is uniformly P -integrable. [Compare
329
5.6 Complements and exercises
with Exercise 1 above.] tn g ˜n dP n ˜ ˜ ˜ (b) Let g˜n = dQ dμn , fn = dμn , hn = f˜n . Then hn = hn ◦ Πn , but ˜ n , Bn , n ≥ 1} is not a martingale, as the spaces are not nested. Verify {h that Qtn ◦ Π−1 (A) = f˜n (x − Πn (ta)) dμn (x), n A f˜n (Πn ω−tΠn a) , a.e.[μn ], f˜n (Πn ω)
so that fn = f˜n ◦ Πn , and the and hn (ω, a) = domain of f˜n has the usual differential structure. (c) Assume further that (+) f˜n is continuously differentiable in ˜ ∂ f˜n ) Rn and f˜n > 0 a.e. [μn ]. Let the gradient ∇f˜n = ( ∂∂xfn1 , · · · , ∂x n ˜ which exists and is given by the directional derivative ((∇fn )(x), y) = df˜n (x+sy) |s=0 where the scalar product notation in Rn is used. Define ds ∗ ˜∗ hn , hn for a ∈ MP , by ˜ ˜ ∗ (x, Πn a) = h∗ (ω, a) = (∇fn (Πn ω), Πn a) , x = Πn ω, ω ∈ Ω. h n n f˜n (Πn ω) Show that {h∗n (·, a), Fn , n ≥ 1} is a martingale on (Ω, Σ, P ) and
t
0
h∗n (ω − sa, a) ds = − log hn (ω, a).
(**)
(d) We now present a general condition for the above martingale to converge to h∗ as follows: Let Φ be a twice continuously differentiable nice Young function [also called an N-function], so that Φ(t) = 0 iff t = 0 and it is a non negative symmetric convex function on R, satisfying (i) |tΦ(t)| ≤ C < ∞, ∀t ≥ 0, and (ii) for its complementary N-function Ψ defined by Ψ(x) = sup{|x|y − Φ(y) : y ≥ 0} one has for some β > 0 sup n
Ω
Ψ(βh∗n (ω, a)) dP (ω) ≤ C1 < ∞.
Under these conditions, it is asserted that ta ∈ MP for all t ∈ R and t also h(·, ta) = dQ dP which is given by: t ∗ h(ω, ta) = exp − h (ω − sa, a) ds , a.e.[P ] 0
Moreover, Pta ∼ P . We remark that if Φ(t) = |t| log+ |t|, so Ψ(t) = e|t| − |t| − 1, then this reduces to the case treated in Skorokhod [3]. 2 But Ψ(t) = et − 1 also satisfies (i), (ii), although its complementary
330
V. Likelihood Ratios for Processes
function Φ cannot be expressed by an explicit formula. (Regarding the analysis of Young functions and their properties, may refer to, e.g., one n (ω) ) dPn (ω). Rao and Ren [1].) [Hints: Consider In (t) = Ω Φ ( fnf(ω+ta) Then the hypothesis implies In is differentiable and satisfies In (t) ≤ CC1 C β In (t) + β . By integrating and using Gronwall’s inequality, it will be found that In (t) ≤ max(
C CC1 , ) exp[(Φ (1) + 1)t]. β β
Using the growth condition (i) on Φ, and the definition of In (t) it is seen that fn (ω − ta) dPn ≤ sup In (t) < ∞, sup Φ fn (ω) n n Ω on compact t-intervals. Hence by the de la Valle´e Poussin criterion (cf., e.g., Rao and Ren [1],p.3), {hn (·, a), n ≥ 1} is uniformly integrable, and ta then hn (ω, ta) → h(ω, ta) = dP dP (ω), a.e.[P ], for any t ∈ R, so ta ∈ MP and MP is linear. (e) Next with a further justification using (**), show that the likelihood ratio to be (taking Σ = σ(∪α Fα ) for convenience): t dPta (ω) = h(ω, ta) = exp − h∗ (ω − sa, a) ds , a.e.[P ] dP 0 It may be noted that the smoothness hypothesis of the densities played a key role not only in showing Pta ∼ P , but also in the derivation of the likelihood ratio and the conclusion that it belongs to the exponential family. It will be interesting to find the distribution of the ratio and to determine its (possible) membership in an i.d. family, analogous to that found in Theorem 5.8.] 4. The preceding result gives sufficient conditions for the set MP of admissible means of P to be linear. But it is not a simple problem to decide whether a given element a ∈ Ω is in MP , even when Ω is a separable Hilbert space. The functional hn in Problem 3 can be defined for many a ∈ Ω − MP . The following example, due to Skorokhod [3], illustrates this point for a non Gaussian measure Q. First let P be a Gaussian measure on (Ω, Σ) with mean zero and covariance operator B which is necessarily Hilbert-Schmidt; so it has a discrete set of eigenvalues (with possibly a limit point zero), and a countable set of eigenfunctions denoted x1 , x2 , . . . , (xn ∈ Ω). Let Ωn be the linear span of x1 , . . . , xn and Ωn = Ω⊥ n , and suppose that dQ = f dP is defined with the continuous f given by f (x) = C exp{−[|x|2 − 1]−1 }χ[|x|>1]
331
5.6 Complements and exercises
where Ω f dP = C −1 . If pn , qn are the densities of the marginals of P, Q on Ωn , and rn denotes their likelihood ratio, it can be expressed as: qn (x) = rn (|x|2 ) = f (x + y) dQn (y). pn (x) n Ω t More explicitly, if gn is the density so −∞ gn (v) dv = Pn [x : |x|2 < t], verify the following: 1 rn (t) = C e− u+v−1 gn (v) dv, [u+v>1]
and using the definition of hn of Exercise 3(c), show that for a ∈ B(Ω) the following holds: hn (x, a) = 2(Πn x, a)
rn ((Πn x, x)) − (B −1 Πn x, a). rn ((Πn x, x))
Next claim that (with the CBS-inequality), there is a constant 0 < D < ∞ such that 1 [rn (t)]2 1 2 ≤C e− (u+v−1) gn (v) dv 4 rn (t) [u+v>1] (u + v − 1) ≤ C 2 sup t4 e−t = D < ∞. [t>0]
Conclude that h2n (x, a) dQ(x) ≤ 8D(Ba, a) + 2C(B −1 a, a) < ∞, ∀a ∈ B(Ω). Ω
So the martingale {hn (·, a), Fn , n ≥ 1} is uniformly integrable and hence hn → h, a.e.[Q] for all such a. However, if Qa (A) = Q(A − a) and if Qa Q, then one must have A fa (x) dP (x) = A f (x − a) dP (x), A ∈ Σ. Since P is Gaussian, for the f given above, this last equation is impossible; so Qa ⊥ Q for all 0 = a ∈ B(Ω), and the latter set can be even dense in Ω. 5. A simple countable state space Markov chain problem, extending Theorem 3.9, is the birth and death process which is an analog of the random walk. It is described as a process moving from state n after a random time to either state n − 1, or n, and is characterized by the stationary transitions Pij (h), h ≥ 0, (the left side below) as follows. ⎧ if j = i + 1 λi h + o(h), ⎪ ⎪ ⎪ ⎨ 1 − (λi + μi )h + o(h), j = i P [X(s + h) = j|X(s) = i] = ⎪ if j = i − 1 μi h + o(h), ⎪ ⎪ ⎩ if h = 0, δij ,
332
V. Likelihood Ratios for Processes
and where μ0 = 0, λ0 > 0, μi , λi > 0, i = 1, 2, . . . , with initial distribution P [X(0) = k] = qk so that pk (t) = P [X(t) = k] =
∞
qi Pik (t).
i=0
The Pij (t) satisfy the Chapman-Kolmogorov equation. One finds if P (t) denotes the infinite matrix (Pij (t)) and p(t) = (p1 (t), p2 (t), · · · , ) the infinite vector, then dp(t) dt = A p(t) denotes a system of the associated differential equations with A as the infinitesimal generator of P (t) satisfying the Chapman-Kolmogorov (or semi-group) equation where ⎛ −λ
⎜ ⎜ ⎜ A=⎜ ⎜ ⎜ ⎝
0
μ1 0 0 .. . ...
λ0 0 0 ...⎞ −(λ1 + μ1 ) λ1 0 ... ⎟ ⎟ μ2 −(λ2 + μ2 ) λ2 ... ⎟ ⎟ 0 μ3 −(λ3 + μ3 ) . . . ⎟ .. ⎟ .. ⎠ . . ·
which is a generalization of (27) of Section 3. This generator acts on the summable sequence space 1 and its domain consists of those vectors x ∈ 1 for which Ax ∈ 1 . The process moves by a jump to left or right related to birth or death so that Xt+0 − Xt−0 = ±1 respectively. If at time t the births and deaths are denoted by B(t) and D(t) then Xt = 1 + B(t) − D(t) and the total population at that time is thus St = ti ≤t Xti where the jump occurs at ti . Specializing Theorem 3.9 and assuming μi = μ; λi = λ for all i, and Hj : (λ, μ) = (λj , μj ), j = 1, 2 are the simple hypotheses with Pj as the corresponding probability measures, verify that the likelihood ratio may be obtained as: dP2 λ2 μ2 (St ) = ( )B(t) ( )D(t) exp{−(λ2 + μ2 − λ1 − μ1 )St }. dP1 λ1 μ1 What is the corresponding form in the general case? Other cases when μi = iμ; λi = iλ are also of interest in population growth models. [See e.g., Feller [1], pp. 454 ff, and Grenander [2], p.319.] 6. Complete the details of Yor’s formula for the general case with the sketch below. Let {Xti , Ft , t ≥ 0}, i = 1, 2 be semimartingales with [X 1 , X 2 ]t , denoting (locally) their quadratic covariation, and E(X i ) as the exponential martingale of X i , E(0)t = 1. Then it is asserted that E(X 1 )t E(X 2 )t = E(X 1 + X 2 + [X 1 , X 2 ])t , t ≥ 0.
(*)
333
5.6 Complements and exercises
First observe that Yti = E(X i )t is the unique solution of Yti
t
=1+ 0
i Ys− dXsi ,
i = 1, 2.
(+)
Consider (by Itˆ o’s formula) 1 2 dYt2 + Yt− dY11 + d[Y 1 , Y 2 ]t , d(Y 1 Y 2 )t = Yt−
(**)
and then verify the following by using (+): [Y 1 , Y 2 ]t = [
0
(·)
1 Ys− dXs1 ,
(·)
0
2 Ys− dXs2 ]t ,
1 2 Yt− d[X 1 , X 2 ]t . d[Y 1 , Y 2 ]t = Yt−
Hence from (**) deduce that 1 2 Yt− d(X 1 + X 2 + [X 1 , X 2 ]t d(Y 1 Y 2 )t = Yt−
holds, and so (*) is true. 7. The following is a multidimensional extension of Girsanov’s formula. Let a = (aij , 1 ≤ i, j ≤ d, ) b = (bi , 1 ≤ i ≤ d), c = (ci , 1 ≤ i ≤ d) be defined on R+ × Ω with real valued components, such that b, c d and ( i,j=1 aij bi cj ) are bounded and progressively measurable, i.e., measurable for B(0, t) ⊗ Σ all t ≥ 0 on a standard filtered space (Ω, Σ, Ft , t ≥ 0). Let X i = {Xti , Ft , t ≥ 0} be a diffusion process with diffusion coefficient a being positive definite, and drift b. Verify that for each t ≥ 0, d 1 t i ˜ ci (u)aij (u)cj (u) du} ci (u) dXu − 2 i,j=1 0 i=1
t d
Zt = exp{
0
˜ i = X i − t bi (u) du (note is a P -martingale, i.e., on (Ω, Σ, P ), where X t t 0 that P [Zt > 0] = 1, t ≥ 0). Let dQt = Zt dPt on Ft . Verify that {Qt , Ft , t ≥ 0} is a compatible (or projective) family of regular probability measures (on the canonical system) that has a limit Q on (Ω, Σ) where we can and do take Σ = F∞ = σ(∪t>0 Ft ). Finally show that X is again a diffusion on (Ω, Σ, Ft , Q) with the same diffusion coefficient a but the new drift ˜b = b + ac. A special form of a, b, c above taken as a(t, Xt ), b(t, Xt ), c(t, Xt ) where X = (Xt1 , · · · , X d )t , is often called the Cameron-Martin-Girsanov formula. Originally Cameron and Martin ([1] and later in several papers) established it if a(t) = a ˜(t)Xt where a ˜, b, c are nonstochastic, and showed in that case that
334
V. Likelihood Ratios for Processes
T T E{− 0 (Xt , a ˜(t) dXt )} = exp[ 12 0 tr(G(t)) dt where G is a unique solution of the Ricatti differential equation dG a(t) − g 2 (t), G(0) = 0. dt = 2˜ [The details and numerous applications of this result may be found in Liptser-Shiryayev [1],Ch.7, and Stroock-Varadhan [1], Ch.6.] 8. We present an analog of Theorem III.3.3, obtaining lower bounds for a sequential estimator of a diffusion process under regularity conditions similar to those given there. Thus let {Bt , Ft , t ≥ 0} be Brownian motion and consider dXt = a(t, θ, Xt ) dBt + b(t, θ, Xt ) dt where a(·, θ, ·), b(·, θ, ·) are nonanticipative and log( ab )(·, θ, ·) locally integrable a.e., on (Ω, Σ, Ft , P ) for each θ ∈ I, an open interval. Suppose that the coefficient processes are differentiable relative to θ, and the so derived processes are also locally integrable. Let τ be an Ft - stopping time, so that according to a sampling plan, the process is observed on the stochastic interval [[ 0, τ ). Let the likelihood ratio of Xt relative to the Bt -process be given by ϕ(t, X) = exp{ 0
t
b 1 ( )(s, θ, Xs ) dBs + a 2
0
t
b ( )2 (s, θ, Xs ) ds}. a
If an estimator δ(τ (X)) of θ is given and Eθ [δ(τ (X))ϕ(θ, X)] = f (θ) exists and is differentiable relative to θ, let W (δ −θ) be the loss suffered in estimating θ by δ, relative to a nonnegative convex function W , 1 symmetric and vanishing at the origin such that W k is also a convex function, k ≥ 1, with the same properties (e.g., W (x) = |x|k , k ≥ 1). Show that the best lower bound for the risk R(θ, δ) = Eθ (W (δ − θ)) is given by R(θ, δ) ≥ W
τ (X ) Eθ ( 0 t
f (θ) d b 2 dθ ( a ) (θ, Xs )ds)
×
θ(X ) d b 1 Eθ (| 0 t dθ ( a )(θ, Xs )ds|) k , τ (X ) d b Eθ ( 0 t dθ ( a )2 (θ, Xs )k ds
k . [The method of proof is a continuous parameter where k ≥ 1, k = k−1 extension of that given in III.5.3, and may be generalized to eliminate the regularity conditions with the ideas of Theorem III.5.3. Cf., also Liptser and Shiryayev [1], p.284.] 9. We include here a characterization of a function m : R → C to be the mean of a harmonizable process sharpening the contents of the “Important remarks” following Proposition 2.8 on admissible means.
335
Bibliographical notes
Thus {Yt , t ∈ R} be a process on (Ω, Σ, P ) with r(s, t) = E(Ys Y¯t ) = let isu−itv e dF (u, v), so that it is a strongly harmonizable covariance. R R Let m : R → C be a function. Verify that there is a harmonizable process {X t = mt + Yt , t ∈ R} having mt as the mean of the Xt -process iff mt = R eitu dμ(u) where μ is the (signed) measure on R defined by
μ(A) =
g¯(u) F (A, du); R
R
R
g(u)¯ g (v) dF (u, v) ≤ 1,
all the integrals being in the Lebesgue-Stieltjes sense. [Hints: Since u → dμ(u) = Xt = R eitu dZ(u) ⇒ mt = E(Xt ) = R eitu dμ(u), where E(dZ(u)), is a signed measure, : f → (f ) = R f (u) dμ(u) is a bounded linear functional on L2 (F ), the space of F -integrable (in MTsense) functions f : R → C with (f, f )F = R R f (u)f¯(v) dF (u, v) < ∞. Then by the Riesz representation theorem, (f ) = (f, g)F for a unique g ∈ L2 (F ). Verify that μ(A) = (χA ) = (χA , g)F =
g¯(u) F (A, du). R
Conversely, if Yt = Xt − mt satisfies E(Yt ) = 0, E(Ys Y¯t ) = r(s, t), harmonizable, then r is positive definite if m is given by the above representation. The result also holds if R is replaced by a locally compact abelian group, and for the weakly harmonizable case if the double integral is in the strict MT-sense. The details in the general case are in Rao [23]. The concepts are further discussed at the start of Section 6.1 below.]
Bibliographical notes The material in this chapter plays a key role in all of stochastic inference, and for this reason we have included a detailed treatment of obtaining likelihood ratios for several classes of processes. It is generally a nontrivial step to find these ratios in many problems, and the necessary new methods in the non Gaussian case involve bringing in different types of mathematical tools. For the most part, the Gaussian problems indicate what techniques are likely to be fruitful and also the desirable types of expected end results. The fecund notion of admissible translates was formulated by Pitcher who has also obtained the deepest results on this topic. They will be important in composite hypothesis testing as well (to be studied in Chapter VII), and the work in Section 1 is inspired by, and including some crucial results from, his work. The studies both in the Gaussian case and on general admissible
336
V. Likelihood Ratios for Processes
means are based on the work of Pitcher ([1],[3]) which are found to be fundamental. Additional results when Ω is a Hilbert space are due to Skorokhod [3], especially see Exercises 6.2 and 6.3 which are extensions of this paper. Section 2 is exclusively devoted to Gaussian process from a general point of view. There are many refinements in the stationary case, a systematic treatment of which is in Rozanov [1], but we did not specialize except for some illustrations since the aim here is for the general picture. Our analysis is inspired by the original and general methods from Grenander [1] and his recent influential monograph [2]. We also used the RKHS methods in this work, the effectiveness of which was first discovered and exposed well by Parzen in [1] and later). Its use was also found in Kailath [3] and in several other publications (cf., references) including his coworkers. The Gaussian processes with triangular covariances which generalize the Brownian Motion in several ways have been analyzed by Varberg ([1],[2]) and we have included some of his results in this section. In actual presentation of the material, which is unified in the author’s papers, and the treatment for the most part, is adapted from these articles (cf., Rao ([4(e),4(f)],[11] in particular). Proposition 2.11 plays a key role in Section VII.3 later. Kadota’s [1] theorem finds an abstract extension in the work by Baker [1], obtaining a complete result in that setting. As a first general case of not necessarily Gaussian process is the infinitely divisible class which may be regarded as a continuous random walk, but the processes may or may not have moments. The characterizing parameters of this class come from their L´evy-Khintchine representations. This leads to a new method of attack, since such processes can be decomposed into mutually independent components consisting of a Gaussian and a generalized Poisson process. The work on Poisson processes presents new challenges, and the key initial results given as Proposition 2 and Theorem 3 of Section 3 are due to Brown [1]. Moreover, the work of Gikhman and Skorokhod [1] has also been of special interest here. This analysis leads to Markov Processes which has a refined theory. We presented a brief account related to the likelihood ratio calculations of these. It is important to find a representation of a continuous parameter process with a suitable countable set of observable coordinates which plays a key role even in the Gaussian case. This idea was originally found and emphasized by Grenander [1], which is more natural as well as preferable than considering countable dense separable sets, when available. It is used here. Theorem 3.8 is due to Albert [1] and Theorem 3.9 is from Billingsley [1]. Section 4 examines the processes that admit the L´evy-Khintchine representation, namely the infinitely divisible class. Here the situation is more complicated and explains the reasons why stochastic continuity plays an
Bibliographical notes
337
essential role. The basic result is Theorem 4.1 and is due to Maruyama [1]. Its significance has not been adequately noted in the past. The demonstration here relies on some results from projective limit theory of nonfinite (σ-finite) measure spaces, primarily discussed by Choksi [1] and extended by M´etivier [1]. This is emphasized in the text. Also the analysis due to Kingman [1], and a more abstract version by Feldman [2] are of interest here. The main likelihood ratio is obtained for the generalized Poisson case. It was not available before (as noted by Gikhman and Skorokhod), and the key result here is Theorem 4.7 due to Briggs [1]. The distribution of the likelihood ratios is the next step in this process, and the only available work seems to be that due to Brockett, Hudson and Tucker [1] for independent increment processes. Their result is presented in Theorem 4.8. This and other properties in the section show that there are several challenging problems awaiting solutions. If one considers specialized processes, naturally sharper results can be obtained, even for the absolute continuity of translates refining the sufficient conditions to have necessary ones as well. We note the following as an example, due to Mizumachi and Sato [1] which generalizes several earlier works. Suppose the processes with independent increments are sequences such that X = (X1 , X2 , · · · , ) and Y = (a1 Y1 , a2 Y2 , · · · , ) where the X, Y are independent and the Xn are i.i.d. with a twice differentiable density f (for the Lebesgue measure), an ∈ R, and the Yn are also i.i.d. Let PX+Y , PX be the corresponding (product) measures. If we know the equivalence then the likelihood ratios are easily calculated. The following result is obtained by these authors on equivalence: Theorem. (a) Suppose E(D log f (X))2 < ∞, E(Yn2 ) < ∞, E(Yn ) = 0. Then PX+Y ∼ PX iff k≥1 a2k < ∞. 4 4 (b) If now E(Y n ) = 40, E(Yn ) < ∞, E(Dd log f (X)) < ∞, then PX+Y ∼ PX iff k≥1 ak < ∞. Here D = dx is the differential and D2 = D(D). The hypothesis of (b) is stronger than that of (a), and is satisfied if both the X, Y sequences are Gaussian, and this was first established by Kakutani [1], with some later generalizations by others. The proof uses all the special properties of the sequences, and it shows what refinements are possible. See also Golosov [1] for another such study. The continuous Markov processes are treated as solutions of diffusion equations driven by independent increment martingales, especially the Brownian Movement. This leads to the study of solutions of stochastic differential equations. An important tool in obtaining the likelihood ratios here is the Girsanov (-Cameron-Martin) formula. It is presented as Theorems 5.2 and 5.5. Here we have also taken advantage of the ideas and results of the Liptser-Shiryayev [1] monograph. Further we
338
V. Likelihood Ratios for Processes
indicated how a (not necessarily finite) vector valued diffusion can be treated in this analysis, as in Theorem 5.7, which is due to Krinik [1]. This shows what type of new research is needed in extending the present study. It is thus clear that in each class Gaussian ideas still play a part, and so they are treated in some detail and depth. In the last section we included some useful results as complements. The assertions of Exercises 6.2 and 6.3 as well as 6.9 with some detailed sketches have special significance on admissible translates and Exercise 6.4 shows the difficulty of deciding whether a given vector can be a translate. Exercise 6.8 is a form of the Cram´er-Rao-Wolfowitz inequality for the continuous parameter processes. Further work is evidently needed here. These cases also show how important and detailed analysis will be necessary as in other branches of mathematics such as the PDE, and moreover it is advantageous to have a broader view of the subject. Additional problems and results will be noted when we consider inference of parameters of processes for composite hypotheses in Chapter VII. One hopes that these results stimulate serious research on the problems raised. An important next step, for applications, is to find finite dimensional approximations to likelihood ratios obtained here. Although some are immediately found from this work, no algorithms have been available, and these should be considered as a next item of research. We also note that a number of problems of statistical interest have been considered in Basawa and Rao [1]. So far only a simple hypothesis versus a simple alternative has been treated. The composite case presents interesting challenges, demanding extensions of the above work. We will discuss it in Chapter VII, after considering some sampling theory of processes which is already of immediate interest for applications of the preceding analysis.
Chapter VI Sampling and Regression for Processes
Instead of observing a complete segment of a process on an interval, it is evidently desirable to consider suitable subsets, preferably countable ones, if they present the essential characteristics of the process on the bigger segment. A basic result in this direction for second order processes is the one due independently to Kotel’nikov and Shannon, and we present some results of this type for the stationary as well as some general processes, in Section 1. The work will be specialized to its band-limited and analytical aspects in the following two sections. Further a detailed analysis on periodic sampling, which is often used in engineering applications, is discussed in Section 4 along with an extension to random fields on Rn as well as a brief account to indexes of LCA groups. Next we consider the regression problems in some detail for both random processes and the corresponding measures. Some remarks on optional sampling of processes are included in the next section. As in the preceding chapters, the complements section is devoted to further important results with detailed sketches. Most discussion on sampling is conducted on second order continuous parameter processes, while the corresponding treatment on problems of regression is somewhat more general.
6.1 Kotel’nikov-Shannon methodology From a practical (or applicational) point of view, it is desirable (and cost effective) if a finite or a countable set of points can be selected for observing a continuous parameter process instead of a full segment containing an uncountable set of observations, if the probabilistic behavior of the given process can be determined. In fact, a model building for applications, as well as their fundamental analysis, should start with
© Springer International Publishing Switzerland 2014 M.M. Rao, Stochastic Processes – Inference Theory, Springer Monographs in Mathematics, DOI 10.1007/978-3-319-12172-7_6
339
340
VI. Sampling and Regression for Processes
continuous indexing parameters for a complete understanding. The usual discretization methods to approximate the continuous versions cannot be based on ad hoc arguments, but have to be rigorously justified. Conditions under which this may be accomplished for broad classes will be the main concern here. The following work relates to certain types of processes, a considerable part of those with two moments for which a deep analysis is already available. Thus let {Xt , t ∈ R} be a second order process on (Ω, Σ, P ), and suppose one observes it at a countable set of points {ti , i ≥ 1}, referred to as sampling times. Typically they will not form a dense set. These points are usually of two types: (i) periodic sampling, i.e., observations are taken at equally spaced points such as tn = nh for some fixed h > 0, and n = 0, ±1, ±2, . . . , or (ii) not necessarily periodic but all the (distinct) ti belonging to bounded subsets of R. If the covariance structure is assumed known (from prior knowledge of the experiment), conditions should be found on these parameters so that suitable linear combinations of {Xti , i ≥ 1} will determine the process as far as the two moments are concerned. It was noted in Proposition V.2.8, that any right (or left) continuous covariance function r : T × T → C (so that the RKHS Hr is separable) can be represented in a generalized “triangular form” relative to a σfinite measure μ (and a T ⊂ R) as: (Ψ(s, x), Ψ(t, x))2 dμ(x),
r(s, t) =
(1)
T
where (·, ·) is the inner product of an 2 -sequence space. This representation (derived independently by Cram´er and Hida) is obtained from the classical Hellinger-Hahn theory of (separable) Hilbert spaces. The special case that 2 is one dimensional is of immediate interest, and it was directly established and studied by Karhunen, which may be stated in a generalized form (see also Section IV.3) as: g(s, x)g(t, y) dρ(x, y),
r(s, t) = T
(2)
T
where ρ : T × T → C is a (necessarily) positive definite function of (the standard or Vitali) bounded variation. Then there is a second order (even Gaussian) process X = {Xt , t ∈ T } on (Ω, Σ, P ) with mean zero and covariance r of (2), by the basic (Kolmogorov) existence Theorem I.1.1. Moreover, the process itself can be represented as: Xt =
g(t, x) dZ(x), T
(3)
341
6.1 Kotel’nikov-Shannon methodology
relative to a stochastic Z : B(T ) → L20 (P ), with E(Z(A)) = measure ¯ 0, E(Z(A)Z(B)) = A B g(s, x)g(t, y) dρ(x, y). This is termed the Cram´er representation, and if Z(·) has orthogonal values, then it is termed the Karhunen representation, the one corresponding to (1). We can consider the more general form (2) which needs no extra effort. It should be observed that if g(t, x) = eitx , T = R, then (2) gives what was termed a (strongly) harmonizable covariance with (3) as the corresponding stochastic integral representation, and moreover if ρ concentrates on the diagonal x = y, so that it defines a positive bounded measure, then it gives the (weakly) stationary covariance: r(s, t) = R
ei(s−t)x d˜ ρ(x) = r˜(s − t),
(4)
¯ where in the corresponding integral (3), Z(·) satisfies E(Z(A)Z(B)) = d˜ ρ , i.e., Z(·) is orthogonally valued. There are generalizations of A∩B (2) if ρ is only of the weaker Fr´echet variation finite, instead of Vitali’s, but for now we concentrate on the above classes for the sampling problem already stated. [The general theory is detailed in different aspects by Cram´er [5], Rao [13], and Chang and Rao [1].] Here the above stated forms are taken as the basic building blocks and the subject is developed from them. The measure functions ρ, ρ˜ of (2) and (4) are termed the spectral bimeasures and spectral measures of r and r˜ respectively. [See Section 2 below, characterizing (2)–(4) for harmonizable classes.] The first result on sampling the processes may be presented as follows (and its extensions as well as specializations will then be studied later in the chapter): 1. Theorem. Let X = {Xt , t ∈ R} be a second order process, Xt ∈ ¯ t ) admits the representation L20 (P ) whose covariance r(s, t) = E(Xs X (2) with ρ as its spectral (Lebesgue-Stieltjes) bimeasure. If L(X) = sp{Xt , t ∈ R} ⊂ L20 (P ) and similarly M = sp{g(t, ·) : t ∈ R} ⊂ L2 (ρ) where 2 ¯ h(x)h(y) dρ(x, y) < ∞} (5) L (ρ) = {h : R → C, (h, h) = R
R
is the (semi-)inner product space defined for (·, ·) and ρ, which is a Hilbert space if the double integrals in (5) are in the weak, or MorseTransue, sense as defined in Section IV.3,[otherwise L2 (ρ) is taken as the completion as in Cram´er [6], p.336, for the above (semi-)inner prodn g(t,·) uct], suppose g(·, x) is infinitely differentiable and ∂ ∂t ∈ L2 (ρ), n = n 0, 1, 2, . . . , t ∈ R (so that the Xt of (3) is given for the g(t, ·) family). Then for any sequence {ti , i ≥ 1} of times in a bounded part of R with
342
VI. Sampling and Regression for Processes
infinitely many distinct ti , the sampled collection {Xti , i ≥ 1} determines L(X), in the sense that its closed linear span M = sp{Xti } ⊂ L20 (P ) satisfies M = L(X), i.e., each Xt ∈ L(X) is a.e. equal to an element of M for each t ∈ R. Proof. Given the process X with covariance r, and ρ as its spectral bimeasure satisfying (2), one has the representation (3) as well as its converse. This is the key result to be used and it is due essentially to Cram´er [6]. A somewhat quicker argument using the RKHS techniques is given in Chang and Rao ([1], Theorem 7.1 on p. 53). Since we have already reviewed the RKHS construction in Section V.1, employed it in the analysis of Section V.2, and since the result is needed for this proof, let us briefly recall ∗ the basic ideas of how (3) is obtained. Using ρ, let β(A, B) = A B dρ(x, y),be the induced bimeasure on B(R) × B(R). Since β is positive definite, by Kolmogorov’s existence theorem, one can construct a vector measure Z : B(R) → L20 (P ) such ¯ that E(Z(A)Z(B)) = β(A, B) and the correspondence j : β(A, ·) ↔ Z(A) is one-to-one. If Hβ is the RKHS of countably additive scalar set functions on B(R), then one finds that (Z(A), Z(B))L20 (P ) = (jβ(A, ·), jβ(B, ·)L0 )2 (P ) = (β(A, ·), β(B, ·))Hβ = β(A, B), so that the correspondence is an isometry. This extends through simple functions with the dominated convergence, to all β-(strict MT) integrable functions giving ∗ ¯ (f1 , f2 ) = f1 (x)f2 (y) β(x, y) = ( f1 (x) dZ(x), f2 (y) dZ(y)) A
B
A
B
(5’) where the integrals on the left are in the strict MT-sense and those on the right are in the (by now classical) Dunford-Schwartz sense. Replacing fi by g(t, ·) here, the isometry still survives and gives (3). The converse that (3) implies (2) is easier. We use this representation in executing the proof. [If β has finite Vitali variation, ‘∗ can be dropped.] Thus the given covariance function r is expressible as (6) g(s, x)g(t, y) dρ(x, y), r(s, t) = R
R
where ρ, the spectral bimeasure, admits an extension to a complex measure on the Borel σ-algebra of R × R(= R2 ), and further by hypothesis g (n) (t, ·) exists and g (n) (t, ·) ∈ L2 (ρ), t ∈ R, so that the following integral is well-defined: g (n) (s, x)g (m) (t, y) dρ(x, y), m, n = 0, 1, . . . . (7) R
R
6.1 Kotel’nikov-Shannon methodology
343
m+n
∂ r This implies that ∂s m ∂tn (s, t) exists, is continuous (in t), and the derivative commutes with the integral. Hence r is an analytic function. n (n) t But then from (3), one can immediately verify that Xt (= ∂∂tX n ) is 2 defined in L (P )-mean for all n. Thus it is an analytic random process. ¯ t ) = r(s, t), (2) and (6) imply that Xt 2,P = g(t, ·) 2,ρ Since E(Xs X or that τ : Xt → g(t, ·) is an isometry between L20 (P ) and L2 (ρ). It is easy to deduce from definitions that {g(t, ·), t ∈ R} generates L2 (ρ). Now let {ti , i ≥ 1} be a bounded countable set of (infinite number of) distinct points from R, as in the theorem, so it has a convergent subsequence tj → t0 ∈ R. Consider L(ρ) = sp{g(tj , ·), j ≥ 1} ⊂ L2 (ρ). Also note that g (n) (t0 , ·) ∈ L2 (ρ), n ≥ 0. It suffices to show that L2 (ρ) = L(ρ) to complete the proof, since by the isometry then one obtains that M = L(X). So let t ∈ R and consider g(t, ·) ∈ L2 (ρ). We assert that g(t, ·) ∈ L(ρ) which implies that L2 (ρ) ⊂ L(ρ)(⊂ L2 (ρ)), giving the desired conclusion. From (6) and the hypothesis on g (n) (t, ·) one has ∂nr (s, t)| = g (n) (t0 , x)g(t, y) dρ(x, y) s=t0 ∂sn R R ∂ (n−1) g (s, x)g(t, y) dρ(x, y) |s=t0 , = ∂s R R because the derivative at t0 may be calculated by taking a sequence (which can be tj itself) tending to t0 . Since g(tj , ·) ∈ L(ρ) it follows that g (t0 , ·) ∈ L(ρ), and by induction g (n) (t0 , ·) ∈ L(ρ), n ≥ 1. But by the Taylor series expansion ∞ (t − t0 )n (n) g (t0 , x) ∈ L(ρ). g(t, x) = n! n=0
Hence g(t, ·) ∈ L(ρ) as claimed. By the isometry then one has ∞ (t − t0 )n n) Xt0 ∈ M. Xt = n! n=0 as desired. The conditions on the g-functions are satisfied for a class of strongly harmonizable processes, namely those with spectral bimeasures having moment generating functions. More precisely, the following statement holds: 2. Corollary. Let {Xt , t ∈ R} be a strongly harmonizable process with its spectral bimeasure ρ (which thus determines a complex LebesgueStieltjes measure on R2 ), such that sx+ty < ∞, s, t ∈ R. e dρ(x, y) (8) R
R
344
VI. Sampling and Regression for Processes
Then for any sample {Xti , i ≥ 1} observed at points {ti , i ≥ 1} ⊂ R, bounded and (infinitely many) distinct ti , one has sp{Xti , i ≥ 1} = L(X) = sp{Xt , t ∈ R} ⊂ L20 (P ). Since g(t, x) = eitx is infinitely differentiable and all of its derivatives are integrable relative to ρ because of the moment condition (8), the conclusion follows from the theorem at once. The hypothesis (8) is essentially the best for the analyticity of a strongly harmonizable process in the sense that it is nearly necessary. This will be elaborated in Section 3 below after discussing the band-limitedness in the next section. The above result raises the question of finding other (simpler) conditions on the bimeasure ρ that allow us to derive useful conclusions related to the Kotel’nikov-Shannon expansion. This will be considered in the next sections. We first establish: 3. Theorem. Let {Xt , t ∈ R} be a weakly harmonizable process with zero mean and the spectral bimeasure ρ. Then for each ε > 0, there exists a bounded Borel set A(= Aε ) ⊂ R with diameter α0 (= diam(Aε ) > 0) and an integer n0 such that ρ (Ac , Ac ) = supB∈B(Ac ) ρ(B, B) < 4ε (Ac = R − A) and for α > α0 : Xt − Xtn 2,P ≤
C(t) +ε n(α − α0 )
(9)
where Xtn is a Kotel’nikov-Shannon type partial sum, namely: Xtn =
n
ak (t; α)X(
k=−n
kπ ), t ∈ R, α
(9’)
and 0 < C(t) < ∞, bounded on compact t-sets, the coefficients ak are explicitly obtained as ak (t, α) =
sin (tα − kπ) . (tα − kπ)
(10)
If ρ itself has bounded support in R2 , then one can take ε = 0 in (9), and in this case Xtn → Xt not only in mean, but also a.e., uniformly for t in compact sets. Remark. In the strongly harmonizable case ρ has finite Vitali variation which is supB1 ,B2 ∈B(R) |ρ(B1 , B2 )|, and in either case the ‘sup’ should not be omitted, because ρ need not be positive. Proof. Since there is no monotone convergence theorem for bimeasure integrals, we convert this problem into a stochastic integral representation and use certain properties of vector measures. Thus for An ∈ B(R),
345
6.1 Kotel’nikov-Shannon methodology
from the fact that ρ is induced by a stochastic measure (cf., (3)), so ¯ that ρ(A, B) = E(Z(A)Z(B)) we have 0 ≤ ρ(An , An ) = E(|Z(An )|2 ) = E(| χAn dZ|2 ) R ≤ lim sup E(| χAn dZ|2 ) R
n 2
≤ Z (R) < ∞.
(12)
n [ Z (A) = sup{ i=1 ai Z(Ai ) 2 : |ai | ≤ 1, Ai ⊂ A, disjoint} is the semi-variation of Z.] Then for any ε > 0, ∃A0 ∈ B(R) such that ρ(Ac0 , Ac0 ) = Z(Ac0 ) 2 ≤ Z 2 (Ac0 ) ≤ ( Next observe that Xt =
ε 2 ) . 16
eitu dZ(u) R=A0 ∪ Ac0
eitu Z(A0 ∩ du) + eitu Z(Ac0 ∩ du) R R itu ˜ itu ˜ = e Z1 (du) + e Z2 (du) =
R
= Xt1 + Xt2 , (say)
R
(13)
where both {Xtj , t ∈ R}, j = 1, 2, are weakly harmonizable. Moreover, Xt − Xt1 2 = eitu Z˜2 (du) 2 R
≤ Z (Ac0 ) ≤ 4
sup
B∈B(Ac0 )
Z(B) 2 ,
cf., Dunford-Schwartz ([1], IV.10.4), ε ε = . <4 16 4
(14)
It will now be shown that Xt1 can be replaced by Xn of the desired type for (9). Indeed, the function u → ft (u) = eitu can be approximated by classical methods as follows. Let α > α0 (= diam(A0 )). Then Fn (z) = |eizu −
n k=−n
<
L0 (z)α , (α − α0 )n
eiu
kπ α
sin α(z − α(z −
kπ α ) | kπ α )
(15)
346
VI. Sampling and Regression for Processes
where L0 (z) = L0 (u + iv) is bounded for z in bounded domains in the complex plane (cf., Timon [1],Sec. 4.3; see also its current use in Piranashvilli [1]). Now define n0
Xtn0 =
X kπ α
k=n0
sin α(t − α(t −
kπ α ) . kπ α )
(16)
To see that this gives the desired approximant, let ak (t; α) be the coefficient of X kπ in (16), and consider α
Xt −
Xtn 2
≤
−
itu
(e
n
eiu
kπ α
ak (t; α))Z˜1 (du) 2
A0
k=−n 2 n + − (Xt ) 2 , from (16) with obvious ≤ Fn (t) Z (A0 ) + Xt2 2 + (Xt2 )n 2
Xt2
≤
notation,
ε L0 (t) Z (R) + + (Xt2 )n 2 , (α − α0 )n 4
(17)
by (14) and (15). Next for the upper estimation of the last term, one has, (Xt2 )n 2 =
n
ak (t; α)X 2kπ 2 α
k=−n
=
[
n
R k=−n n
≤ sup | u∈R
≤ [1 +
eiu
kπ α
˜ ak (t; α)]Z(du) 2
eiu
kπ α
ak (t; α)| Z˜2 (R)
k=−n
L0 (t)α ε ] , by (15) and (14). (α − α0 )n 4
(18)
0 (t)α Take n = n0 ≥ [ 2L α−α0 ] and put it in (17) and (18) to get
1 ε ε 5ε Xt − Xtn 2 ≤ (1 + ) + = < ε. 2 4 4 8 If C(t) = L0 (t) Z (R), this gives (9). Finally, if ρ has bounded support, then it can be enclosed in a big enough relatively compact rectangle A0 × A0 and hence Xt2 = 0 in the above decomposition, so that one can put ε = 0 there. In this ˘ case, using Ceby˘ sev’s inequality and the first Borel-Cantelli lemma,
347
6.1 Kotel’nikov-Shannon methodology
(since n≥1 n12 < ∞) the a.e. convergence for t in compact sets is an immediate consequence. Remark. For strongly harmonizable processes an analogous result has been established by Piranashvilli [1], in which case the above computation may be slightly simplified, and the use of vector integration may be avoided. [See Exercise 7 for the classical Kotel’nikov-Shannon formula and compare it with (9’) above.] The methodology of the preceding proof can be employed to get another sampling theorem for a class of Cram´er-type processes if the g-function in its representation is analytic and satisfies some reasonable growth conditions. The desired result can be presented as follows. 4. Theorem. Let X = {Xt , t ∈ R} be a (weak) Cram´er class process relative to a g-family and a spectral bimeasure ρ as in (2) and (3). Suppose that (i) g(·, u), u ∈ R, and admits an extension to be an entire function, n 1 (ii)cn (u) = ∂∂tng (t, u)|t=0 ⇒ c∗ (u) = lim supn→∞ |cn (u)| n ≤ α0 , finite, and there is an integer m ≥ 0 such that ∗
|g(z, u)| ≤ L(u)(1 + |z|m )ec Then for α > α0 , q ≥ m and β < Xtn
=
n
X kπ
(u)y
, z = x + iy, L(·) ∈ L2 (ρ).
α−α0 q ,
if Xtn is defined by
sin (αt − kπ) sinq β(t − (αt − kπ)β q (t −
α
k=−n
kπ α )
kπ α )
,
(19)
one has Xt − Xtn 2 ≤
c0 (t, α, β) , n
(20)
for a positive constant (bounded on compact t-sets) c0 (t, α, β) < ∞. Moreover, Xtn → Xt also in the a.e. sense uniformly for t in compact sets. Proof. This is an abstraction of the argument of the preceding result and the detail will be outlined. The key idea is to find an approximation of g(t, ·), under the given condition, as in the case of the exponential in the last proof. The desired approximation is now based on a classical theorem due to M. L. Cartwright, as modified by Piranashvilli [1], that is extended to the case at hand. Thus under conditions (i) and (ii), one has |vn (z, u)| = |g(z, u) −
n k=−n
g(
sin (αz − kπ) sinq β(z − kπ kπ α ) , u) | kπ q α (αz − kπ)(β (z − α
348
VI. Sampling and Regression for Processes
<
˜ q (z) α α q L(u)L α [( ) + ( )q−m ], q β (α − α0 − βq) n n n
(21)
˜ q (z) is a positive number for z in bounded where L(·) is as in (ii) and L sets of the complex plane. Thus letting ζ(t) = Xt − Xtn, and hn (α, z, q) as the coefficient of L(u) in (21), one has ζn (t) = R vn (t, u) dZ(u) where |vn (t, u)| ≤ L(u)hn (α, t, q) and then L(·) to be integrable for Z(·). Hence by the vector integral calculus it follows that ζn (t) 2 ≤ 4 sup{ L(u)hn (α, t, q) dZ(u) 2 : A ∈ B(R)} A
˜ q (t) α α q 4L α [( ) + ( )q−m ]× ≤ q β (α − α0 − βq) n n n sup L(u) dZ(u) 2 B∈B(R)
= M0
β q (α
A
˜ q (t) α L [ − α0 − βq) n
], (say),
(22)
where M0 is an absolute constant. The right side tends to zero as n → ∞ for t in bounded sets. Let c0 (t, α, q) be the coefficient of n1 in (22). Then it gives (20) as desired. The last statement follows as in Theorem 3 for the same reasons. The above two results suggest that a number of classical (deterministic) approximation results may be extended to the stochastic context, and possibly to other processes at equally spaced intervals of suitable width h. This and the usual periodic sampling methodology will also be analyzed in detail in later sections. A special difficulty to watch here is the so-called “aliasing effect”, in that two different processes may agree at the fixed periodically sampled points. Hence conditions must be found to avoid this problem. We first discuss some results related to the ‘band-limited’ case and then proceed to the analytical properties of processes implied by such a condition. 6.2 Band limited sampling If {Xt , t ∈ R} is a mean zero stationary process with a continuous covariance r(s, t) = r˜(s − t), then by the classical Bochner theorem ei(s−t)u dF (u), s, t ∈ R, (1) r˜(s − t) = R
where F is a bounded positive nondecreasing function, determining a (bounded) Borel measure. Such an F is termed the spectral (measure)
349
6.2 Band limited sampling
function of the process, and if moreover the support of F is contained in a bounded interval (−a, a), the process is usually called band-limited. More generally, suppose r is the covariance function of a second order process admitting a (two-dimensional) Fourier transform: ∗ eisx−ity dF (x, y), s, t ∈ R, (2) r(s, t) = R
R
relative to a (necessarily positive definite) bimeasure F which however may only have a finite Fr´echet (not Vitali) variation, so that the symbol is a (strict) MT-integral as noted in the preceding section, and the process is weakly harmonizable. Again F is termed a spectral bimeasure, and if its support is contained in a bounded rectangle (−a, a) × (−b, b), then the process (by analogy) can and will be termed band-limited. Thus the concept extends to harmonizable processes. The following result characterizes continuous covariances r that are Fourier transforms of such spectral bimeasures, and motivates how other extensions (e.g., the Cram´er type processes) can be similarly analyzed. 1. Theorem. A continuous covariance function r : R × R → C is the Fourier transform of a bimeasure F : B(R) × B(R) → C (so that 1 ˆ it isitxweakly harmonizable) iff r < ∞ where (f ∈ L (R) ⇒ f (t) = e f (x) dx) R r(s, t)f (s)¯ g (t) ds dt| : fˆ ∞ ≤ 1, ˆ g ∞ ≤ 1, r = sup{| R
R
f, g ∈ L1 (R)}.
(3)
Here and below Lp (R) is the Lebesgue space on the Lebesgue line R. Remark. The condition that r < ∞ of (3) is termed weakly V bounded and is an extended version of a one-dimensional concept originally formulated by Bochner [2]. The point of this result is that the (two-dimensional) Fourier transform is characterized by a simple analytical condition, namely (3), without reference to the “abstract” definition of harmonizability. Proof. The “if” part is immediate. Indeed, let r admit the representation (2). Then for any f, g ∈ L1 (R) one has | r(s, t)f (s)¯ g (t) ds dt| = | f (s)¯ g (t)× R R R R ∗ eisx−ity dF (x, y)]ds dt| [ R R =| fˆ(x)ˆ g (y) dF (x, y)|, by a form R
R
350
VI. Sampling and Regression for Processes
of Fubini’s theorem for the MT-integrals, g ∞ F (R, R), by a property ≤ fˆ ∞ ˆ of the MT-integrals.
(4)
Here F (R, R) is the Fr´echet variation of the bimeasure (which is always finite) and is given by: F (R, R) = sup{
n
dF (x, y)ai a ¯j : Ai , Bj ∈ B(R)
Ai
i,j=1
Bj
disjoint, |ai | ≤ 1, ai ∈ C}. Thus (4) implies (3), so that r is (weakly) V -bounded. For the converse suppose r < ∞. If H : f → fˆ is the Fourier transform, consider the functionals T : C0 (R) × C0 (R) → C defined by T (fˆ, gˆ) = (H −1 (fˆ), H −1 (ˆ g )), where for f, g ∈ L1 (R), C0 (R) being the space of continuous functions vanishing off compact sets, and (·, ·) given by: (f, g) =
r(s, t)f (s)¯ g (t) ds dt. R
(5)
R
1 (R) × L 1 (R) since by (3) This T is a bounded bilinear functional on L
g ∞ ≤ 1} = r < ∞. sup{|T (fˆ, gˆ)| : fˆ ∞ ≤ 1, ˆ Now T admits a bound preserving extension to all of C0 (R) × C0 (R), endowed with the uniform norm, by the standard (Hahn-Banach type) extension results. Then by another representation theorem due to F. Riesz for multilinear functionals (cf., Dunford-Schwartz [1], Chapter VI, and Dobrakov [1] for extensions), there is a unique bounded bimeasure F , necessarily positive definite (cf., Chang and Rao [1], p.21), on B(R)× B(R) such that the following holds: T (fˆ, gˆ) = R
∗
fˆ(x)g¯ˆ(y) dF (x, y).
(6)
R
Then (5) and (6) imply R
R
r(s, t)f (s)¯ g (t) ds dt = (f, g) = T (fˆ, gˆ) ∗ [ eisx−ity f (s)¯ g (t) ds dt]dF (x, y). = R
R
R
R
351
6.2 Band limited sampling
Subtracting and using again a form of Fubini’s theorem, one gets ∗ [r(s, t) − eisx−ity dF (x, y)]f (s)¯ g (t) ds dt = 0. (7) R
R
R
R
Since f, g ∈ L1 (R) are arbitrary, this implies that [ ] = 0 a.e., and because of the continuity of r it vanishes identically. Thus r admits the representation (2). The preceding analysis also applies to the case that F is of bounded (Vitali) variation, with the following modifications. Consider r as a function on R2 and (3) replaced by |r|(R2 ) < ∞, where |r|(R2 ) = sup{| r(x)f (x) dμ(x)| : f ∈ L1 (R2 ), fˆ ∞ ≤ 1}, (8) R2
with μ as the planar Lebesgue measure, and f : R2 → C (not necessarily of the product form f (x1 , x2 ) = f1 (x1 )f2 (x2 ) for x = (x1 , x2 ) ∈ R2 ). Then by another (classical) Bochner theorem with x as above, there is an F : R2 → C of bounded (Vitali) variation such that (τ · x = (sx1 − tx2 ) for τ = (s, −t)) r(x) = eiτ.x dF (s, t), (9) R2
and since r is positive definite, F is also. Then the preceding work gives the following: 2. Corollary. A continuous covariance function r : R2 → C is strongly harmonizable (or the Fourier transform of a positive definite function F : R2 → C of bounded (Vitali) variation satisfying (9)) iff |r|(R2 ) < ∞, or more explicitly, r(s, t)f (s, t) dμ(s, t)| ≤ K fˆ ∞ , f ∈ L1 (R2 ), (10) | R2
for some constant 0 < K < ∞. Thus the class of processes for which a general band-limited concept can be defined is precisely the (weakly or strongly) harmonizable family. The importance of condition (3) or (10) is enhanced by obtaining the corresponding stochastic Fourier representation of the Xt -process itself. This is seen as follows. When r is V -bounded, one has for any bounded Borel f, g with compact supports, ¯ t f (s)¯ r(s, t)f (s)¯ g (t) ds dt = E[ Xs X g (t) ds dt] R
R
R
R
352
VI. Sampling and Regression for Processes
=E R
Xs f (s) ds
R
Xt g(t) dt ,
so that on taking f = g, this becomes E(| Xs f (s) ds|2 ) = r(s, t)f (s)f¯(t) ds dt R
R
R
≤ K fˆ 2∞ , by (3) [or (10)], for some 0 < K < ∞. Hence √ Xs f (s) ds 2 ≤ K fˆ ∞ .
(11)
R
This implies that the set { R Xs f (s) ds, f ∈ L1 (R), fˆ ≤ 1} ⊂ L20 (P ) is bounded, or equivalently it is relatively weakly compact. Consequently by another representation due to F. Riesz on C0 (R2 ) → L20 (P ), (in both cases of (3) and (10)) there is a unique vector measure Z : B(R) → L20 (P ) such that ˆ Xs f (s) ds = T (f ) = fˆ(t) dZ(t) R R = [ eitu f (u) du]dZ(t), R
R
where the left side is the standard vector Lebesgue [or the Bochner] integral and the right side is the Dunford-Schwartz’s. Hence as before one has [Xt − eitu dZ(u)]f (t) dt = 0, f ∈ L1 (R). (12) R
R
It follows that Xt =
eitu dZ(u), R
t ∈ R,
(13)
which is the integral representation of a weakly (or strongly) harmo∗ ¯ nizable process, and E(Z(A)Z(B)) = A B dF (s, t) with F defining a bimeasure (a signed measure) on B(R2 ). The importance of this result is that one can approximate eitu by a (finite or infinite) series as in the proof of Theorem 1.3 (cf., eq.(15)), leading to several such representations using different classical approximations and/or “sampling theorems”. They give rise to stochastic sampling versions of various kinds. For instance, the covariance function r can be replaced by r(s, t) = r˜(s, t)f (s)f (t) for a measurable f ≥ 0 such that R r˜(t, t)f 2 (t) dt < ∞. Then under similar conditions (of V -boundedness) r is a Fourier transform of a bimeasure, and
353
6.3 Analyticity of second order processes
more particularly, if r˜ and f are such transforms themselves, then r is the Fourier transform of a convolution. So the band-limitedness of a process can be defined, and it leads to some new developments in (stochastic) sampling theory. Such a generalization of band-limited results was proposed by Zakai [1] and the idea is further explored by several others (see e.g., Lee [1]). We shall omit a discussion of extending the classical sampling theory found in e.g., Higgins [1] and Zayed [1], to the stochastic context, since the novelty there is mostly in the deterministic case, and much less in its probabilistic counter part. An informative exposition is given by Pog´ any [1]. However, some results related to the above works will be included in later sections of this chapter. 6.3 Analyticity of second order processes For second order processes, especially the harmonizable classes, we shall consider the band-limitedness property and its relation to the analyticity of sample paths, which explains our preoccupation with the former in the preceding section. This point was already noted immediately after Corollary 1.2. The following result for stationary processes was obtained by Belyaev [1], and for the (strongly) harmonizable case it was established by Swift [1]. 1. Theorem. A strongly harmonizable process {Xt , t ∈ R} with spectral (signed) measure F is analytic in an open neighborhood of the origin iff F has a moment generating function in such a neighborhood of the complex plane. In particular, if the strongly harmonizable process is band-limited, then it is analytic in the complex plane. Proof. The sufficiency is a consequence of Corollary 1.2. In fact, if F has a moment generating function, then |
R
R
es1 x+s2 y dF (x, y)| < ∞,
|sj | < aj , j = 1, 2,
(1)
and hence F has all moments finite. Thus r is infinitely differentiable (by the dominated convergence theorem) so that it has an infinite Taylor series expansion converging (uniformly and absolutely) in the rectangle (−a1 , a1 ) × (−a2 , a2 ). So r is analytic, and then Xt has mean square derivatives of all orders. It is analytic as in Theorem 1.1. For the converse, we assert that the analyticity of r in a rectangular region, as above, implies that its spectral measure function F satisfies (1). Since F determines a signed (or complex) measure on R2 , by hypothesis, it is bounded and (by the Jordan decomposition) can be expressed as a linear combination of nonnegative measure functions of bounded (Vitali) variation, F1 , . . . , F4 . Hence, it suffices to establish
354
VI. Sampling and Regression for Processes
(1) for one of them, say F1 . So let eisx−ity dF1 (x, y), r1 (s, t) = R
(2)
R
and by hypothesis r1 is analytic. Thus it admits a (uniformly and absolutely) convergent power series expansion r1 (s1 , s2 ) =
∞ ∂ j+k r1
(0, 0) k
j,k=0
∂sj1 ∂s2
sj1 sk2 , j!k!
(3)
for |sm | < pm , m = 1, 2, a rectangular neighborhood of (0, 0) ∈ R2 , following some standard results on characteristic functions of probability theory. Now r1 of (2) is infinitely differentiable, and it is the Fourier transform of a bounded (positive) measure. This implies that F1 has all (absolute) moments finite since the integral is in Lebesgue’s sense. ∂ j+k r1 j+k If αj,k = R R xj y k dF1 (x, y), then ∂s αj,k . Because j k (0, 0) = i 1 ∂s2 α2j,2k ≥ 0, the absolute moments βj,k of F1 are dominated by the even moments as follows. Using the elementary inequality |ab| ≤ (a2 + b2 )/2 one has |x|k |x|k−1 |y|j |y|j−1 ≤
1 2k (x + x2k−2 )(y 2j + y 2j−2 ) 4
and hence 1 α2k,2j α2k−2,2j−2 β2k−1,2j−1 ≤ [ (2k)(2j) + + (2k − 1)!(2j − 1)! 4 (2k)!(2j)! (2k − 2)!(2j − 2)! α2k,2j−2 2k α2k−2,2j 2j + ]. (2k)!(2j − 2)! 2j − 1 (2k − 2)!(2j)! 2k − 1 Substituting this estimate and using the even coefficient terms of the j ∞ sk 1 s2 absolutely convergent series (3), one finds that j,k=0 βk,j k!j! converges absolutely in the same open neighborhood of (0, 0) ∈ R2 . This implies immediately that e|xs1 |+|ys2 | dF1 (x, y) < ∞. R
R
It follows from this that (1) holds for F1 and by the initial reduction, it holds for F itself. Thus the condition is also necessary. Finally, in the band-limited case, F has all moments finite (i.e., its moment generating function exists), and the sufficiency of the result implies that the process must be analytic.
355
6.3 Analyticity of second order processes
In the stationary case, F is positive and bounded on R, and the result follows from the analytic theory of characteristic functions. Perhaps for this reason, Belyaev states that the “proof is obvious”, and the above is an extension of that remark. In fact, the argument extends to analytic covariance functions of processes that are not necessarily harmonizable. The following result illustrates this point, and it is due to Belyaev himself. 2. Theorem. Let r : R × R → C be a covariance function of a process {Xt , t ∈ R} which is also analytic in a neighborhood of a point (t0 , t0 ) ∈ R × R. Then the process itself is analytic in a neighborhood of t0 ∈ R. If moreover, the process is Gaussian, then the condition is necessary as well. Proof. From the second order properties of a random process, we deduce that the infinite differentiability of the covariance r in an open rectangle implies that the process is also infinitely differentiable in the n (n) t mean in the same rectangle. Hence Xt = ddtX exists in mean for all n n and the analyticity of r around (t0 , t0 ) implies that it has a power series expansion as (3) around this point. If now we set n
Xtn =
(k) (t
Xt0
k=0
− t0 )k , k!
then the above statements are equivalent to: E(|Xt − Xtn |2 ) =
∞ j,k=n+1
∂ j+k r (t − t0 )j+k (t , t ) → 0, 0 0 ∂sj ∂tk j!k!
(4)
as n → ∞, which is true, so that Xtn → Xt in mean. We assert that limn→∞ Xtn = Xt exists with probability one, by means of the (first) Borel-Cantelli lemma. Indeed, consider ∞
ξtn =
(k) (t
Xt0
k=n
− t0 )k . k!
∞
(5)
It suffices to verify n=0 E(|ξtn |2 ) < ∞ for the desired conclusion be˘ cause of the Ceby˘ sev inequality. With (4) and a rearrangement of the absolutely convergent power series, one finds ∞ n=0
E(|ξtn |2 ) =
∞
[min(j, k) + 1]
j,k=0
≤C+
∞ j,k=1
|
∂ j+k r (t − t0 )j+k (t , t ) 0 0 ∂sj ∂tk j!k!
|t − t0 |j+k ∂ j+k r < ∞, |(t0 , t0 ) j k ∂s ∂t (j − 1)!(k − 1)!
356
VI. Sampling and Regression for Processes
where 0 < C < ∞ is a constant. Hence the desired pointwise a.e. convergence follows in the same open neighborhood. Suppose now that the process is Gaussian as well. We need to show that the analyticity of r leads to the same property of the Xt in a neighborhood of t0 . This is simple since by the analyticity of the latter at t0 (using (5)) Xt =
∞
(k) (t
Xt0
k=0
− t0 )k = Xtn + ξtn , k!
cf., (5),
(6)
and since the series converges in mean, one has |
∞
(j)
(k)
(t − t0 )j+k | < ∞. j!k!
(j)
(k)
(s − t0 )j (t − t0 )k j!k!
E(Xt0 Xt0 )
j,k=0
Hence
∞
E(Xt0 Xt0 )
j,k=0
also converges absolutely, since E(|ξtn |2 ) → 0 as n → ∞. Thus we obtain r(s, t) = E(Xs Xt ) =
∞
(j)
(k)
E(Xt0 Xt0 )
j,k=0
(s − t0 )j (t − t0 )k j!k!
converges uniformly and absolutely. Consequently r is analytic. Remark. If all moment functions are analytic, then the second order process itself is analytic, as the above result shows. However, as already noted by Belyaev ([1], p.405), a random process may be analytic without the analyticity of its (less than 2) moment functions. As an immediate consequence of Theorem 2, one can obtain the Kotel’nikov-Shannon expansion for band-limited strongly harmonizable processes. Recall that, for now, band-limitedness means that the spectral bimeasure vanishes outside of a bounded rectangle in the complex plane. It then implies that such a process is analytic (cf. Corollary 1.2 or Theorem 1 above). For such a process we have in fact the following: 3. Proposition. If {Xt , t ∈ R} is a strongly harmonizable process with a band-limited bispectral function, then Xt =
∞
X kπ α
k=−∞
sin (αt − kπ) , αt − kπ
(7)
357
6.3 Analyticity of second order processes
in mean and a.e., for α > β where β is the diameter of the spectral domain of the process. Proof. If r is the covariance function of the process, then by hypothesis
b
d
eisx−ity dF (x, y),
r(s, t) = a
c
where β ≥ max(b − a, d − c). Here (a, b) × (c, d) ⊃ support (F ) = SF , and we let β = diam(SF ). In particular, one may replace a, c by −α and b, d by α, so that using (13) of the preceding section one has α eitx dZ(x), (8) Xt = −α
¯ where E(Z(A)Z(B)) = A×B dF (x, y), and Z is the vector (or also termed “stochastic spectral”) measure of the process. But now from classical approximation theory results, it is known that eitx =
∞ k=−∞
eik
πx α
sin (tα − kπ) , αt − kπ
for each x > α, as already seen in the proof of Theorem 1.3 (cf. (15) there). The series converges pointwise and in L2 (F ). Substituting this in (8) and using the dominated convergence theorem for vector measures (cf., Dunford-Schwartz [1], Theorem IV.10.10), one gets (7) to converge in mean. But by considering the partial sums, and using ˘ the Ceby˘ sev inequality and the Borel-Cantelli lemma (as in Theorem 1.3 since the variance of the tail part is O( n12 )), the pointwise a.e. convergence follows. This result was given by Belyaev [1] by a long computation, just for stationary processes, from first principles. It should be noted that the series expansion (7) can be obtained by the same argument for weakly harmonizable processes also if the support of the spectral bimeasure has finite diameter, as seen in the last part of Theorem 1.3. However, the analyticity of the Xt -process then cannot be concluded, although several types of series representations are possible, since the (infinite) differentiability of the covariance function r (now represented by a strict MT-integral) is not yet available. In the above discussion, the support of F has been of interest and to understand it better, we present their structures for a class of harmonizable processes. More precisely, note that the support SF ⊂ R×R of a bimeasure F is the set of points (x, y) ∈ SF such that for each neighborhood U1 ×U2 of (x, y), the variation |F | is positive, i.e., |F |(U1 , U2 ) > 0.
358
VI. Sampling and Regression for Processes
Thus SF is the smallest closed set of R×R outside of which F vanishes. We then have the following characterization of a support for a class of processes. This will also be of interest in understanding relations between processes with periodic harmonizable covariance functions and certain stationary families. 4. Proposition. Let {Xt , t ∈ R} be a weakly harmonizable process with means zero and covariance r satisfying r(s, t) = r(s + α, t + α) for some fixed α > 0 (i.e., it is periodic with period α) and with spectral function F . Then SF = {(x, y) ∈ R × R : x − y =
2πk , k ∈ Z}. α
(9)
Conversely, if the support of the spectral bimeasure F of a harmonizable process is SF , given by (9), then the covariance function r is periodic with period α > 0. A proof will be outlined in the complements (cf., Exercise 4), where some related discussion on properties of this class will also be included. Here we just concentrate on certain other aspects of sampling. 6.4 Periodic sampling of processes and fields As noted earlier, to observe a process X = {Xt , t ∈ R} at intervals of fixed length, h > 0 so that {Xnh , n = 0, ±1, ±2, . . . } only is recorded, it is termed periodic sampling of a process with period length h > 0. This h should be properly chosen so that for each t ∈ R, Xt is (linearly) determined by the Xnh ’s, and no “aliasing” should occur. To implement this idea, we translate the problem from the “time domain” to the “spectral or frequency domain” for certain second order processes, starting with a stationary class and then extending the work. This allows us to bring in some powerful tools from Fourier analysis, as will now be demonstrated. We consider both the processes and fields, i.e., the index set is Rn , n > 1, and also more generally but briefly if the index is an LCA group G in the latter part of this section. Thus let {Xt , t ∈ R} be a second order mean continuous process ¯ t ) = r˜(s, t) = whose second moment is stationary, meaning E(Xs X r(s − t), and hence is uniquely representable (by Bochner’s theorem) as r(s − t) = eix(s−t) dF (x), s, t ∈ R, (1) R
for a bounded non-decreasing (spectral measure) function F . Then Xt = eitx dZ(x), t ∈ R, (2) R
6.4 Periodic sampling of processes and fields
359
¯ where Z has orthogonal increments with E(Z(A)Z(B)) = A ∩ B dF (x), as shown in Section 1. Let L(X) = sp{Xt , t ∈ R} ⊂ L2 (P ) and L2 (F ) = {f : R → C, R |f |2 (x) dF (x) = f 22,F < ∞}. It was already noted that L(X) and L2 (F ) are isometrically isomorphic, i.e., for each Y ∈ L(X) there exists a unique v ∈ L2 (F ) such that Y ↔ v and Y 2,P = v 2,F . Now if M = sp{Xnh , n = 0, ±1, ±2, . . . }, it is desirable to find conditions on F (hence on r) such that for a given h > 0, M = L(X). Since M ⊂ L(X) is always true, and equality obtains only under certain conditions on the spectral support (using the stationarity hypothesis), we consider SF = {x ∈ R : F (x) > 0}, the closed set, as the support of F . The following key result, due to Lloyd [1], gives a characterization of the desired equality. Its significance and a connection with approximation problems will be discussed later. 1. Theorem. Let {Xt , t ∈ R} ⊂ L2 (P ) be a mean continuous weakly stationary process and h > 0 be given. Then for each t ∈ R Xt =
∞
atn Xnh
(3)
k=−∞
for some constants atn (i.e., Xt is linearly determined by the sample) iff the translates of the support by h units form a wandering collection, in the sense that {SF − nh−1 , n ∈ Z} is a disjoint family where the vector difference SF − nh−1 = {x − nh−1 : x ∈ SF } is the translated set. Since by the isomorphism, Xnh ↔ einh(·) correspond to each other, and {x → eitx , t ∈ R} generates L2 (F ), it is enough to show that each eitx is determined linearly by {einhx , n ∈ Z}. Hence if N = sp{einh(·) , n ∈ Z}, a closed subspace of L2 (F ), it is to be shown that N = L2 (F ) under the given condition (of wandering of translates of SF ). Now if Π : L2 (F ) → N is the orthogonal projection, we find an explicit form of the operator Π, and then show, under the hypothesis, that it is the identity. Consequently we first present a general form of the projection operator in the following technical statement: 2. Proposition. The orthogonal projection Π : L2 (F ) → N is given by v(x + nh−1 ) dF (x + nh−1 ) (Πv)(x) = n∈Z , a.e.[F ], v ∈ L2 (F ). (4) −1 ) n∈Z dF (x + nh Proof. Observe that the elements of N , the set containing x → eitx , t ∈ R, n ∈ Z are periodic and of period h1 , and hence so are their linear combinations. By approximation all functions from L2 (F ) which are
360
VI. Sampling and Regression for Processes
periodic with period h1 are in N . Clearly f ∈ N ⇒ f¯ ∈ N and real f, g ∈ N ⇒ f ∨ g ∈ N , 1 ∈ N . Hence the real elements of N form a lattice and N is a complete normed linear space. Then by a known result (cf., e.g., the companion volume Rao [21], Theorems II.2.5 and II.2.1), it follows that the orthogonal (=contractive here) projection onto N is just a conditional expectation relative to a unique σ-algebra Bs ⊂ B(R) (and N = L2 (Bs , F )) so that Π = E Bs . Hence
v dF =
A
E Bs (v) dF (= Hv (A), (say)), A ∈ Bs ,
(5)
A
and of course E Bs is always a positive contractive projection. It remains to find an explicit form of E Bs (v) and show that it is given by (4). v However by (5) this is simply dH dF on Bs . To evaluate this, let G0 , G1 on Bs be defined (without motivation unfortunately!) by G0 (A) = G1 (A) =
n
A
n
A
dF (x + nh−1 ),
A ∈ Bs ,
v(x + nh−1 dF (x + nh−1 ), v ∈ L2 (F ), A ∈ Bs .
It is clear that G0 , G1 are measures on Bs and are absolutely continuous relative to F . Hence if Fn (dx) = F (dx + nh−1 ), vn (x) = v(x + nh−1 ), then dFn dG0 (x) = (x), dF dF n
dG1 dFn (x) = (x), vn (x) dF dF n
(6)
Using the chain rule for Radon-Nikod´ ym derivatives, since G1 G0 , it follows that dG1 dG1 dG0 (x). (x) = dG0 dF dF Substituting (6) in this equation, one gets (4) immediately. Remark. In this particular case one can also establish (4) directly, as was done by Lloyd [1] instead of invoking the result about conditional expectations. However, the identification with the general result shows the structure more clearly. Also it is common to take the “characters” as e2πix instead of eix as we did here. This change of scale has no bearing on the following computations. It will yield symmetrical formulas if Plancherel’s identity is used in the work. But this is not needed, and we retain the simpler form of the characters until such calculations are performed.
361
6.4 Periodic sampling of processes and fields
Proof of Theorem 1. Suppose that L(X) = M so that each Xt is determined linearly by Xnh , n ∈ Z, for any t ∈ R. In particular let t = ξh where ξ is an irrational number. Thus Xξh is a linear combination of {Xnh , n ∈ Z} and by isomorphism v(x) = eixξh corresponds to Xξh for v ∈ N . By Proposition 2, and the hypothesis that L(X) = M, one has Πv = v and (4) gives v(x) = where fn =
dFn dF
eixξh +
n∈Z−{0}
1+
fn (x)eiξh(x+nh
n∈Z−{0}
−1
)
, a.e.[F ],
fn (x)
≥ 0. Cross multiplying and simplifying, this gives
(1 − einξ )fn (x) = 0, a.e. [F ].
(7)
n∈Z−{0}
Since ξ is irrational, (1 − einξ ) has nonvanishing (positive) real part for each n, (7) implies fn (x) = 0 a.e. [F ], for n = 0. Thus Fn ⊥ F, n = 0 and there is a singular set Nn , Fn (Nn ) = 0 and F is supported in Nn . If N = ∩n =0 Nn , then F is supported by N and Fn (N ) = 0, n ∈ Z − {0}. But by definition, the support of Fn is Sn which is a translate of SF , the support of F , so that Nn = N − nh−1 , and F (Nn ) = 0, n = 0, with Fm (Nn ) = 0, n = m. However, SF = N ∩(∩n =0 Nnc ) is the support of F and hence is disjoint from Sn = SF − nh−1 . This shows that the support is a wandering set, as desired. For the converse, let Sn = supp(Fn ) be wandering. Then Sm ∩ Sn = ∅, m = n and Sn ∩ SF = ∅, n = 0. Thus Fn ⊥ F, n = 0. But fn = dFn v+0 2 dF = 0, a.e., n = 0. By (4) then v ∈ L (F ) ⇒ Πv = 1+0 = v ∈ N . Thus L2 (F ) = N which means that L(X) = M, as asserted. Remark. In the above computations ξ was irrational, but we could have allowed it to be rational, such as ξ = pq , p, q being relatively prime. / {0, ±q, ±2q, . . . }. Then SF − nh−1 will be disjoint for all n ∈ The preceding result can be translated into a series form, using ideas of Fourier series, of Kotel’nikov-Shannon type as follows. We again follow Lloyd [1]. Thus with the above notation, set SF = S, Sn its translate, and kt (x) =
χS (x + nh−1 )eit(x+nh
−1
)
, x ∈ R, t ∈ R,
n
=
χSn (x)eit(x+nh
n itx
=e
χS +
n∈Z−{0}
−1
)
χSn (x)eit(x+nh
−1
)
.
(8)
362
VI. Sampling and Regression for Processes
Since the Sn are disjoint Borel sets, |kt (x)| ≤ 1 and kt (·) is a periodic function of period h1 . Also kt (x) = eitx a.a. x ∈ S and kt ∈ N [F (S c ) = 0]. Hence eitx dZ(x) =
Xt =
R
R
kt (x) dZ(x).
(9)
Expanding kt formally in Fourier series, one finds kt (x) ∼ K(t − nh)ei(t+nh) n
where K(t) = h S eitx dx and the Fourier coefficients of kt (·) are found to be: h−1 h−1 −1 ixnh h e kt (x) dx = h eixnh χS (x + rh−1 )eit(x+rh ) dx 0
0
r
=h
Note that |K(t)| ≤ h express Xt as a series
R
dx ≤ 1 since diam(S) ≤
S
Xt =
χS (x)eix(t−nh) dx = K(t − nh).
K(t − nh)Xnh ,
1 h.
Then one can
t ∈ R,
n
provided the series converges in mean (for which additional restrictions are needed). For instance, if SF is a disjoint union of intervals (xα , xα ), α = 1, 2, . . . , then one finds K(t) = h
∞ eixα t − eixα t , t ∈ R − {0}, it α=1
= hμ(S),
t = 0.
1 1 Here μ(·) is the Lebesgue measure. In case the support SF = (− 2h , 2h ), t πt πt t then one gets K(t) = h sin h , (or = h sin h if the scale is changed to t → πt) so that an analog of the Kotel’nikov-Shannon series holds. General sufficient conditions, using classical summability results of Fourier series, one can obtain the following:
3. Proposition. Suppose that the support SF of the spectral distribution F of a stationary process {Xt , t ∈ R} is open and is a wandering set of diameter h1 (h > 0). Then the sampling series is (C, 1)-summable to Xt , i.e., Xt = l.i.mn→∞
n k=−n
(1 −
|k| )Xnh K(t − kh), t ∈ R. n
(10)
6.4 Periodic sampling of processes and fields
363
In fact, if supt |tK(t)| < ∞, then the sampling series actually converges in mean, so that in (10) the factor (1 − |k| n ) can be dropped. Proof. Representation (10) is a consequence of some classical results in trigonometric series, and the isomorphism of the time domain and the frequency domain elements. In fact, since kt (x) = eitx , x ∈ SF , is continuous and bounded in SF , the (C, 1)-partial sums are bounded for kt , so that by the dominated convergence, the series converges (C, 1) in N (cf., e.g., Zygmund [1],p.41) and hence (10) holds by the isomorphism given in the discussion following (2). In case supt |tK(t)| < ∞, the ordinary partial sums of the formal Fourier series for kt converges to kt on SF (cf. Zygmund [1], p.43) and then the difference between the (C, 1) and the ordinary partial sum series being bounded and converging pointwise a.e. to zero, the series converges in N . By the isomorphism, the result again follows as before. The preceding considerations for the stationary class can be extended to some harmonizable families of processes. We indicate a few of them in the complements, and an abstraction at the end of this section. Here an extension to random fields is discussed, i.e., the time axis R is replaced by Rn , n > 1. This presents certain new lines of study in the investigations. Thus consider X : Rn → L2 (P ) as a mapping (or a ‘curve’), termed a random field. Suppose E(Xt ) = 0 ¯ t ) = r˜(s, t), the covariance, so that X is stationary if and E(Xs X r˜(s, t) = r(s − t). Then again one has the representation eix·t dZ(x), t ∈ Rn , (11) Xt = Rn
n [x · t = i=1 xi ti is the dot product], Z having orthogonal values. A new property of interest here is the isotropy of the field which means that r˜(·, ·) is invariant under rotations (in addition to the property of translation due to stationarity) so r˜(gs, gt) = r(g(s − t)) = r(s − t) for all orthogonal matrices g : Rn → Rn . This gives r a specialized integral representation, which is also due to Bochner. It is given by Jν (λ|s − t|) n ν r˜(s, t) = r(s − t) = 2 Γ( ) dG(λ), (12) 2 R+ (λ|s − t|)ν with ν = n−2 , G being a unique Borel measure on R+ = [0, ∞), and
2 |s − t| = (s − t) · (s − t), n ≥ 2, is the Euclidean length. Here Jν is the Bessel function of the first kind of order ν. The problem is again to sample the process at a discrete set of points which respects the isotropy property of the random field but avoiding the“aliasing” effect, or equivalently obtaining an approximation of the field, by this set of
364
VI. Sampling and Regression for Processes
points with arbitrarily prescribed error bounds. As may be expected, the solution involves certain properties of Bessel functions, but a complete solution is possible. In fact we can present the result even for harmonizable isotropic case which has considerable interest since it enlarges the applicational prospects further than the stationary fields. For both cases an integral representation, analogous to (11), for isotropic fields is desired. To clearly reflect the new property of the random function, or of its covariance, it is useful to obtain an equivalent series expression for (12), using the addition formula for Bessel functions. Anticipating the future relations, let us introduce (at first without motivation) the corresponding harmonizable concept to continue with the new study. 4. Definition. A covariance function r : Rn × Rn → C (n > 1) is weakly (strongly) harmonizable isotropic if there is a positive definite bimeasure β : B(R+ ) × B(R+ ) → C, such that
r(s, t) =
αn2
∞ h(m,n) m=0
R+
Sm (u)Sm (v)×
=1 ∗
R+
Jm+ν (xs)Jm+ν (yt) dβ(x, y), (xs)ν (yt)ν
(13)
where the various symbols have the following meanings: (i) s = (s, u), t = (t, v), the spherical polar coordinates of s, t ∈ Rn , (·), 1 ≤ ≤ h(m, n) = (2m+2ν)(m+2ν−1)! , m ≥ 1, (ii) Sm (2ν)!m! n
(iii) αn > 0, αn2 = 22ν Γ( n2 )π 2 ,
and the integral in (13) is in the strict MT (ordinary Lebesgue) sense for the weakly (strongly) harmonizable case, the series converging absolutely. As usual, a random field X : Rn → L20 (P ) is weakly (strongly) harmonizable isotropic if its covariance function has the corresponding property. In the event that the field is stationary and isotropic, then the β of (13) concentrates on the diagonal of R+ × R+ , and the series simplifies to (12), after employing standard identities of Bessel func tions and of the (ultraspherical) polynomials related to Sm (·). Since this is not entirely obvious, the detailed computation will be included ˜ ∩ B) = β(A, B) and ωn for convenience. Thus (13) becomes, with β(A denoting the surface area of the unit sphere in Rn ,
r(s, t) =
αn2
∞ h(m,n) m=0
=1
Sm (u)Sm (v)
R+
Jm+ν (xs)Jm+ν (xt) ˜ dβ(x) x2ν (st)ν
6.4 Periodic sampling of processes and fields
365
∞ ν αn2 h(m, n) Cm (cos u, v)Jm+ν (xs)Jm+ν (xt) ˜ = dβ(x), ν ωn m=0 R+ Cm (1) x2ν (st)ν ν by the addition formula for spherical harmonics, and Cm (·)
the ultraspherical polynomial of order ν ≥ 0, u, v denoting the angle between the vectors u and v, ∞ (m + ν)Jm+ν (xs)Jm+ν (xt) ν αn2 ˜ Cm (cosu, v) dβ(x) = ωn R+ m=0 x2ν (st)ν Jm+ν (xρ) ˜ αn2 dβ(x), (14) = ωn R+ (2xρ)ν Γ(ν) by again using the addition formula for Bessel functions, where ρ2 = s2 + t2 − 2st cos u, v, (cf., e.g., Lebedev ([1], p.124). This is equivalent to (12). Note that r(s, v) = r(s − v) depends only on |s|, |t| and cos u, v so that it represents the isotropy. The relation between (13) and (12) can be further seen by another equivalent form, due to Swift ([1], p.586). It is obtained using certain other Bessel function identities. The result is stated for comparison, although (13) will be used for most of the computations below. The alternative form is thus: ∗ Jν (|xs − yt|) n r(s, t) = 2ν Γ( ) dF (x, y), (13’) 2 R+ R+ |xs − yt|ν where | · | denotes Euclidean length of a vector, and F is the spectral bimeasure as before. The interest in (13) is that it can be rewritten compactly as a covariance in Cram´er’s form and the integral representation, incorporating the isotropy property, can also be obtained immediately. It is then used in the sampling problem for fields that we are interested in. This is just another form of expressing (13) in such a way that one can invoke the following well-known result. 5. Theorem. Let (S, S) be a measurable space and F : S × S → C be a positive definite bimeasure. Suppose that r : T × T → C is a (generalized triangular or Cram´er type) covariance function relative to a given family of Borel functions gt : S → C and F so that ∗ r(s, t) = gs (x)¯ gt (y) dF (x, y), s, t ∈ T. (15) S
S
Then a random field X : T → L20 (P ) exists with covariance r, given by (15), admitting a unique representation as: Xt = gt (x) dZ(x), t ∈ T, (16) S
366
VI. Sampling and Regression for Processes
relative to a vector measure Z : S → L20 (P ) satisfying ¯ E(Z(A)Z(B)) = F (A, B) = A
∗
dF (x, y), A, B ∈ S,
(17)
B
where the integral in (16) is in the Dunford-Schwartz sense. On the other hand, every field X : T → L20 (P ) given by (16) relative to a vector measure Z satisfying (17) has its covariance function r that is of the generalized triangular form (15). The existence of a process (or a field) with properties demanded in the above result can be obtained by use of the basic Kolmogorov existence theorem (Theorem I.1.1). The representation from (15) to (16) (and conversely) may be expeditiously obtained by means of the RKHS methods, as detailed in, e.g., Chang and Rao ([1], Section 7). It should be noted that both the sets S, T in the above theorem are completely general. In our applications they are chosen suitably to deduce the desired representations. In the stationary case S = T = R and F is a bounded Borel measure on the diagonal of S × S(= R × R) (or T = Z, S = [0, 2π)). By a redefinition, it will be shown that (13) is expressible as (15) and the desired representation then follows from this result immediately. Here are the details. ˜ = {(n, ) ∈ N × N : 1 ≤ ≤ h(m, n), m ≥ 0} with h(0, n) = 1. Let N If ξ denotes the counting measure on the natural numbers N, let ζ be ˜ ) of N ˜ by ζ(A, B) = ξ(A ∩ B). Then defined on the power set P(N it is immediate that ζ is a positive bimeasure which therefore extends to a (σ-finite) measure as noted already (cf., Berg, Christensen and Ressel [1], p.24). Now define F˜ : P(N ) × P(N ) × B(R+ ) × B(R+ ) → ˜ ) ⊂ P(N ) × P(N )), by setting F˜ (A1 , A2 ; B1 , B2 ) = ζ(A1 , A2 ) · C, (P(N F (B1 , B2 ). It is a bimeasure in the pairs (A1 , A2 ) and (B1 , B2 ) [or a multimeasure of order four in all the components], which is not a ˜ × Rn × R+ → C be measure since F need not be one. Next let g˜ : N given by Jm+ν (xs) (u) , (18) g˜(m, ; s, x) = Sm (xs)ν where s = (s, u), Sm is the th spherical harmonic of order m ≥ 0 and n−2 ν = 2 ≥ 0. With these identifications, it is immediately seen that (13) takes the form:
∗
r(s, t) = S
g˜(m, ; s, x)g¯˜(m , ; t, y) dF˜ (x, y),
(19)
S
˜ × Rn × R+ , s = (s, u), t = (t, v). But then the reprewhere S = N sentation (17) holds and it is given as follows. There is a stochastic
367
6.4 Periodic sampling of processes and fields
measure Z : S → L20 (P ), such that Xt = αn g˜(m, ; t, x) dZ(m, ; x) S g˜(m, ; t, x) dZ, = αn ˜ N
(20)
R+
and Z satisfies ¯ 2 , B2 )) = F˜ (A1 , A2 ; B1 , B2 ). E(Z(A1 , B1 )Z(A
(21)
˜ ) is a measure obtained through the counting measure, Since ζ on P(N one has on writing Z(m, ; B) = Zm (B), the representation (21) as: (B1 )Z¯m E(Zm (B2 )) = δmm δ F (B1 , B2 ),
(22)
where δmn is the Kronecker delta. Hence (20) becomes with t = (t, v) Xt = αn
∞ h(m,n) m=0
=1
Sm (v)
R+
Jm+ν (xt) Zm (dx), (xt)ν
(23)
on B(R+ ) satisfies (22). Thus where the family of vector measures Zm we have the desired integral representation of the weakly (or strongly) harmonizable isotropic random field X on Rn → L20 (P ), which is recorded for reference as:
6. Theorem. Let X : Rn → L20 (P ) be a weakly (strongly) harmonizable isotropic random field. It admits a spectral representation (23) with F as its spectral bimeasure (signed measure), the series converging in L20 (P )-mean. On the other hand, a random field X given by (23) with its random spectral (or representing) measure Z satisfying (22) is harmonizable isotropic whose covariance function is in a generalized triangular form given by (13) (or (13’)), or equivalently by (19). The field is stationary iff F of (22) concentrates on the diagonal of R+ ×R+ . The representation of the harmonizable isotropic field given by (23) is used in developing the sampling results corresponding to the known Kotel’nikov-Shannon representation for processes presented in Section 1. This is obtained as follows. Just as in the case of one-dimensional time, it is sufficient to consider the band-limited problem for harmonizable fields also, for the purpose of sampling. This is because Theorem 1.3 has a simple extension that enables us to concentrate on the fields whose spectral bimeasure is supported by a bounded rectangle. The desired approximation may be given as:
368
VI. Sampling and Regression for Processes
7. Proposition. Let X : Rn → L20 (P ) be a weakly harmonizable random field with spectral bimeasure β(·, ·). Then for any ε > 0, there is a bounded Borel set Aε ⊂ Rn and a weakly harmonizable field Xε : Rn → L20 (P ) such that β (Rn × Rn − Aε × Aε ) < ε and X(t) − Xε (t) 2 < ε,
t ∈ Rn ,
(24)
where the spectrum of Xε lies in Aε × Aε . The proof is very similar to that of Theorem 1.3 (until (14) there) and will be left to the reader as an exercise. In view of this result, we now present a sampling theorem for a weakly harmonizable isotropic random field whose spectral bimeasure is confined to a bounded rectangle, obtained in (Rao [11], p.207), and it extends the corresponding stationary isotropic case in Yadrenko [1]. 8. Theorem. Let {Xt , t ∈ Rn } be a weakly harmonizable random field whose spectrum is supported in [0, a]×[0, a] ⊂ R+ ×R+ for some a > 0. Then for each t ∈ Rn , t = (t, u), the spherical polar coordinates,Xt is determined by a countable set of observations at |t| = tk , as: ∞ λk Jν (t) T λk Xt,u , = 2αn Jν+1 (λk ) a
X(t,u)
(25)
k=1
where Tr X(t,u) =
Sn
[s2
X(s,v) n dμn (v), − 2ts cos u, v + t2 ] 2
is a standard simple stochastic integral with integrand containing a n Poisson kernel, αn2 = 22ν Γ( n2 )π 2 , Jν is the Bessel function of order ν(= n−2 2 ≥ 0), μn is the surface (Lebesgue) measure of the unit sphere Sn ⊂ Rn , and the λk > 0 are the simple roots of Jν (λ) = 0, arranged in increasing order, the series converging in L2 (P )-mean. Proof. To begin with, the integral representation (23) of the weakly harmonizable isotropic random field, whose spectral bimeasure is contained in [0, a] × [0, a], will be simplified using some properties of Bessel functions, namely Jν . If λn > 0 are the (simple) roots of Jν (λ) = 0, arranged as λk < λk+1 , then by the classical theory the following orthogonal relations hold: a x x a2 2 Jν (λm )Jν (λm ) x dx = δmm Jν+1 (λm ), (26) a a 2 0 (cf., e.g., Lebedev [1], or Watson [1] for this and other formulas used below). Also if {ϕm , m ≥ 1} is the normalized orthogonal Jν then √ 2 xJν ( λam x) , 0 ≤ x ≤ a, ϕm (x) = aJν+1 (λm )
369
6.4 Periodic sampling of processes and fields
and any piece-wise continuous f can be expanded in a Bessel-Fourier series relative to the ϕm and in particular if f (x) = Jν (γx), one gets the series converging in mean as: Jν (γx) =
∞
ck ϕk (x),
(27)
k=1
where ck =
2 2 (λ ) a2 Jν+1 k
a
xJν (γx)Jν (
0
λk x) dx. a
(28)
Using these relations in (23) we are able to produce eventually the desired expansion (25). For simplicity, let Yk (x) = αxνk Jν (x). Since (23) contains Jm+ν , we need to connect the zeros of Jν (·) with the higher order Bessel functions using a classical formula due to Sonine. This becomes (cf., Watson [1], p.373), on rewriting in our format Jm+ν (rx) =
∞
(λ2k k=1
= 2Jν (r)
r 2mYn (r) ( )m+ν Jm+ν (λk x) 2 − r )Yn+2 (λk ) λk
∞ k=1
r λk ( )m Jm+2 (λk x). 2 2 Jν+1 (λk )(λk − r ) λk
(29)
Using (27)–(29) in (23) with t = (t, u), |t| = t, one gets X(t,u) = αn
∞ h(m,n) m=0
= αn
=1
∞ h(m,n) ∞ m=0 k=1
(
Sm (u)
=1 1
t m ) Sm (u) λk
0
a
0
Jm+ν (tx) dZm (x) (tx)ν
2λk Jν (t) × (λ2k − t2 )Jν+1 (λk )
Jm+ν (λk ax) dZm (ax). (aλk t)ν
(30)
But the Sm are orthogonal on the sphere Sn ⊂ Rn relative to the n surface measure μn , and hence writing γn2 = 2n−1 Γ( n2 )π 2 , one finds from the first line of (30) 1 Jm+ν (λk x) dZm (ax). X( at ,v) Sm (v) dμn (v) = γn ν (λ x) k 0 Sn
With this the second line of (30) simplifies to: X(t,u) = αn
∞
(λ2k k=0
2λk Jν (t) × − t2 )Jν+1 (λk )
370
VI. Sampling and Regression for Processes
[
∞
(
Sn m=0
h(m,n) t m ) Sm (u)Sm (v)X( λk ,v) ] dμn (v). a λk =1
(31)
One may reduce it further by using the ultra spherical polynomials ν Cm (·). In fact [ ] of the integrand of (31) becomes [
]=
∞ m=0
(
t m C ν (cos u, v) X( λk ,v) . ) h(m, n) m ν a λk Cm (1)
(32)
But for k large enough, λtk < 1 (in fact λk ∼ kπ and λk+1 − λk → π as k → ∞) so that (32) is well-defined, and it can be simplified with the ν as: generating function formula for Cm [
]=
1 − t2 . n X λk (1 − 2t cos u, v + t2 ) 2 ( a ,v)
(33)
Substituting this in (31), it becomes (25) as asserted. Remarks. 1. If Xt is moreover stationary, this reduces to the corresponding result due to Yadrenko ([1], p.196). However, the series representations are not unique since one may use other orthonormal sequences and produce the corresponding expansions with the above procedure. (For several such expansions, even for processes, see e.g., Cambanis and Liu [1].) In a sense, the procedure used for (25) may be regarded as ‘natural’, as it is analogous to the Kotel’nikov-Shannon expansion in the one-dimensional time. 2. It is also possible to generalize the Lloyd method in Theorem 1 above, with wandering supports, at least for the sufficiency part. We shall indicate a possibility of it in the complements. As a final item here, let us sketch an abstract result from the deterministic approximation (or sampling theory), due to Kluv´ anek [1], to show how a number of such results may be extended to the stochastic case with simple modifications using standard facts from vector measures and the isomorphism as in Theorem 1. It illustrates the underlying group structure of many of these representations. Thus let G be an LCA (=locally compact abelian) group and H ⊂ G be a discrete subgroup with (discrete) annihilator Λ = H ⊥ = {γ : ˆ the dual group of G. Then the quotient y, γ = 1, y ∈ H} ⊂ Γ(= G), ˆ Γ/Λ = H (as the dual of H) is compact. Let μG , μΓ , μΛ and μHˆ be the ˆ which may respective Haar measures on (the LCA groups) G, Γ, Λ, H (and will) be normalized to satisfy the following equation for integrable ˆ functions F : Γ → C and cosets γ˜ (= γ + Λ ∈ H): F (γ) dμΓ (γ) = [ F (γ + λ) dμΛ (λ)] dμHˆ (˜ γ ). (34) Γ
ˆ H
Λ
371
6.4 Periodic sampling of processes and fields
This is a specialization of a fundamental result, known as the WeilMackey-Bruhat(or WMB)-formula in Harmonic Analysis, valid actually for all locally compact groups G and closed subgroups Λ with quotient spaces G/Λ on which a “quasi-invariant” measure μHˆ satisfying (34) exists. [For details and applications to Probability Theory, the reader ˆ is a may consult, e.g., Rao [12], Sec. V.5, especially p.268.] Now H compact abelian group as the dual of a discrete subgroup H, so that ˆ = 1. This abstract formulation appears unrelated initially, but μHˆ (H) will be seen to be very useful and interesting in stochastic analysis. For our purpose, let Ω be a Borel subset of Γ containing exactly ˆ i.e., Ω ∩ γ˜ is a sinone element from each coset γ˜ (= γ + Λ ∈ H), gleton for each γ ∈ Γ. In the above formulation, Ω = Γ/Λ, but other choices for which (34) is valid are possible, and they actually ˆ = Γ), one takes H = appear in applications. When G = R(= R {· · · , −2h, −h, 0, h, 2h, · · · }, 0 < h ∈ G, and Ω = (−α, α] (for some α = αh > 0) identified as a compact subgroup of R, with addition (mod 2α) as the group operation. Define a function ϕ : G → C by χΩ (γ) t, γ dμΓ (γ), t ∈ G, (35) ϕ(t) = Γ
which is the inverse Fourier transform of χΩ , t, · being the group character (=eit(·) for G = R). With these notations one can state the following useful result of Kluv´ anek’s [1]: 9. Theorem. Let f ∈ L2 (G, μG ) be such that its Fourier transform fˆ vanishes off Ω (i.e., fˆ|Ωc = 0). Then f is a.e. equivalent to a continuous function on G, and if f itself is continuous, with ϕ of (35): f (t) =
f (y)ϕ(t − y),
t ∈ G,
(36)
y∈H
the series converging uniformly on G and also in L2 (G, μG )-norm. Moreover, f 22 = |f (y)|2 . (37) y∈H
The proof depends on properties of the continuous ϕ which include ϕ 2 = 1, ϕ(0) = 1, ϕ(H − {0}) = 0 and G ϕ(t)ϕ(t ¯ − y) dμG (t) = 0. The norm equation (37) is a consequence of Plancherel’s theorem. The details will be omitted. [However, a long sketch is included in the complements.] Let us present its stochastic analog and applications. We consider a random field {Xt , t ∈ G} of Cram´er-type relative to a family {ft , t ∈ G} ⊂ L2 (Γ, μΓ ) where G is an LCA group with dual
372
VI. Sampling and Regression for Processes
ˆ = Γ. Suppose that Ω and H are chosen as above and that fˆt |Ωc = G 0, t ∈ G, f : G × Γ → C is jointly (Borel) measurable. Consequently ˆ Xt = ft (γ) dZ(γ), t ∈ G, (Γ = G) (38) Γ
for a unique stochastic measure Z : B(Γ) → L20 (P ), as seen in Section 1 above, where E(Xt ) = 0, and ∗ ¯t) = fs (γ)f¯t (γ ) dβ(γ, γ ), r(s, t) = E(Xs X Γ
Γ
relative to a bimeasure β, and Xt ↔ ft being in one-one correspondence with Xt 2,P = ft 2,β . Since fs : Γ → C, s ∈ G, satisfies the hypothesis of Theorem 9, one gets by the isomorphism just noted, the following representation: Xt = Xy ϕ(t − y), (39) y∈H
where ϕ is the nonstochastic function defined by (35), and the series converges in L20 (P )-norm. It should be noted that no isotropy condition on the random field is assumed here. The result may be stated for reference as: 10. Theorem. Let X = {Xt , t ∈ G} be a Cram´er-type centered second order random field relative to a family of integrable functions ft : Γ → C, t ∈ G such that fˆt |Ωc = 0 where for a discrete subgroup H, Ω is defined above (Ω = Γ/H ⊥ is compact). Then for each t, Xt can be represented by the L20 (P )-convergent series (39). In particular, if X is a (weakly or strongly) harmonizable band-limited field and H ⊂ G is a discrete subgroup, with H ⊥ discrete, then (39) always holds. Some useful specializations of this sampling theorem for concrete groups will now be presented to explain its usefulness. ˆ = Γ), H = 1 Z, (α > 0 and Λ = H ⊥ so Example A. Let G = R, (= R α Γ/Λ = {˜ γ = γ + Λ : γ ∈ Γ}). Set Ω = (−α(s + 1), −αs] ∪(αs, α(s + 1)] so that Ω ∩ γ˜ is a singleton. Then 1 sin α(s + 1)t − sin αst ϕ(t) = eitu du = 2α Ω αt and ft (x) = eitx to have fˆt |Ωc = 0. One finds for a stationary process {Xt , t ∈ R} with spectrum in this Ω and mean zero, by (39) the series Xt =
n∈Z
X nπ α
sin ((s + 1)(αt − πn)) − sin s(αt − πn) αt − πn
(40)
373
6.4 Periodic sampling of processes and fields
the series converging in L2 (P )-mean and also a.e. Taking s = 0, this reduces to the original Kotel’nikov-Shannon sampling theorem and our extension is based on a computation from Higgins [1]. It may be observed that if the process is harmonizable and band-limited to the set Ω × Ω then the isomorphism result is valid in this case and hence the same conclusion can be drawn. Example B. Let G = Rn , H = α1 Zn , a scaled integer lattice (α > 0), and Ω = G/H ⊥ , identified as {x = (x1 , . . . , xn ) : |xi | ≤ α}. Then simplifying the integral below using spherical polar coordinates, one gets πk 1 eiy·(x− α ) dy ϕ(x) = (2α)n Ω π n J n (|αx − πk|) = ( )2 2 n 2 |αx − πx| 2 and for the stationary field {Xt , t ∈ G} whose spectrum is confined to Ω, (39) becomes:
Xt =
k=(k1 ,... ,kn )
X( πk1 ,··· , πkn ) α
α
J n2 (|αt − πk|) n
|αx − πk| 2
,
(41)
the series converging in L2 (P )-mean. This example is differently obtained in Yadrenko ([1], p.201). The next one is another specialization of (39) for discrete parameter processes. Example C. Let G = Z, H = kZ, k > 1, an integer, Λ = {e2πijn k, j = 0, 1, . . . , k − 1, n ∈ Z} and Γ = {einx , 0 ≤ x < 2π, n ∈ Z}, the circle group. Let Ω = Γ/Λ = [0, 2π k ) so that k ϕ(n) = 2π
2π k
0
einy dy =
2πin k (e k − 1). 2πin
Since L2 (G, μG ) = 2 now, let f ∈ 2 be such that fˆ|Ωc = 0, i.e., −inx fˆ(x) = f (n) = 0, x ∈ Ωc . Then by (39), for a stationary ne sequence {Xn , n ∈ Z} with spectral measure supported in Ω, one has Xm =
2πi e k (m−nk) − 1 k , m ∈ Z. Xkn 2π i(m − kn)
(42)
n∈Z
This representation is also based on a computation from Higgins [1]. We shall omit further examples, and conclude the sampling of processes by
374
VI. Sampling and Regression for Processes
periodically selected observations. Some additional results are sketched in the complements. So far we have been considering real or complex valued processes only. But there are substantial applications of processes with values in certain groups such as SO(n), of orthogonal, or GL(n), of n × n non singular matrices, as exposed with great detail in Grenander and Miller [2]. These should be the next key item of research in this subject, and we have not discussed them here for lack of known detailed sampling theorems. 6.5 Remarks on optional sampling Let X = {Xt , t ∈ I ⊂ R} be a process. In contrast to the work of the preceding sections, of sampling the process at certain periodic or other suitable fixed times to know the structure, one may consider sampling at random times, and then study properties of the new process, and try to establish its relation with the original one. It is clear that the previous methods will have to be replaced by a different set since, for instance, stationarity or harmonizability notions may not remain the same under optional sampling. In the new approach, the group structure of the index set may be relaxed, and only certain order properties are needed. We now make the problem concrete and briefly explain the difference with the preceding ideas. In a realization of the process, suppose one merely wants to observe it at random times {Tj , j ∈ J ⊂ I} where for j < j , Tj ≤ Tj and then the observed set becomes {Yj = X ◦Tj , j ∈ J}. From this we would like to obtain properties of the X-process itself. It is natural and meaningful to demand that the random times should be determined only by the prior observations. This implies that the Tj should be stopping (also sometimes termed Markov) times determined by the original process X. Thus let Ft = σ(Xs , s ≤ t) and denote the given process as {Xt , Ft , t ∈ I} and Tj : Ω → I be stopping times of the increasing (or filtering to the right) family {Ft , t ∈ I} of the underlying probability space (Ω, Σ, P ) with I as (for simplicity) a subset of R. However, even if the X-process is Brownian Motion, the optionally sampled Yj -process need not be a BM (J ⊂ R). Thus only for a certain class of processes such a sampling can be made so as to reflect properties of the original one. This will be possible for a class of (sub) martingale families. Thus the problem is to find conditions such that a submartingale {Xt , Ft , t ∈ I} under an increasing set of stopping times (also termed optionals) of {Ft , t ∈ I} goes into {Yj = X ◦ Tj , F(Tj ), j ∈ J}, to remain a submartingale, and to have some properties reflected from the X-process. Here F(Tj ) = σ{A ∩[Tj ≤ i] : A ∈ Ft , t ∈ I} is called the σ-algebra of events prior to
6.6 Regression for random processes and its basis
375
the stopping time Tj . It is seen that this reduces to Ftj for constant times Tj = tj . This indicates that the problem becomes obviously technical, and such processes have key applications in gambling and related questions whereas stationary and harmonizable classes are of interest in signal extraction and associated studies. Recall that {Xt , Ft , t ∈ I ⊂ R} is a (sub)martingale if E(|Xt |) < ∞ and for any s, t ∈ I, s < t ⇒ E Fs (Xt )(≥) = Xs a.e., or in words the conditional expectation of Xt given the past up to s is (at least) equal to Xs . The composition Yj = X ◦Tj is generally nonlinear, and so there are some technical problems (mainly of a measure theoretical nature) for analysis of the continuous parameter case. If I is a discrete set or t → Xt is continuous a.e., then these are slightly simpler. We present a result for these latter types to illustrate the kind of difficulties that arise. [This is just a different form of Theorem IV.2.4.] 1. Proposition. Let X = {Xt , Ft , t ∈ I ⊂ R} be a uniformly integrable (sub)martingale where either I is countable or t → Xt is a.e. continuous and ∩t>s Ft = Fs , s, t ∈ I. If {Tj , j ∈ J} is a stopping time process of the family {Ft , t ∈ I}, J ⊂ R, then the optionally sampled process {Yj = X ◦ Tj , F(Tj ), j ∈ J} is also a uniformly integrable (sub)martingale. This says that under the above integrability condition, the observed process retains the (sub)martingale property. The proof is available in the companion volume (cf., Rao [21], Prop. 4.2.3) for the martingale case, (and the submartingale case is obtained by the same methods). This was originally established by Doob (cf. [2], Sec. VII.11). We shall not reproduce the proof here, as it depends on several nontrivial properties of these processes. The detailed analysis becomes more technical and branches out into a different direction (namely stochastic calculus). In fact, the reader can easily get it from Theorem IV.2.4. Thus far we have not discussed another important aspect of this analysis, namely the regression problem for random processes as well as the corresponding (random) measures. This will now be taken up in the next sections because of its potential interest in this subject. 6.6 Regression for random processes and its basis If X and Y are a pair of random variables (or [finite] vectors) on a probability space (Ω, Σ, P ) whose joint distribution is or can be specified, suppose that some probability distributional structure of X is assumed known (or can be specified) as well as the existence of the first moment of Y may be taken to exist, then a key problem is to ‘predict’ the values of Y given X with the properties specified above.
376
VI. Sampling and Regression for Processes
It implies that one should find the coditional expectation, E(Y |X) of Y given X. For this no moments of X need be demanded and just the existence of one moment of Y suffices. After studying some special problems, this conditional expectation as a function of X is defined as the Regression of Y over X, by Cram´er (1946) and Wilks (1962) as well as most people working, e.g., in Statistical Analysis and Econometrics, as in Chipman (2011), and others. In all of these sources a functional type of gY (X) = E(Y |X) is assumed and the resulting subjects developped. The actual form of this gY (X), which always exists, is not really simple and was cosidered by J. L. Doob (1953,p.603) in a case which is generalized by Dynkin (1960,p.6), and will be given here (called a Doob-Dynkin lemma in my book, Rao (1981,p.4), with a more elaborate explanation and details in my later book, Rao (2005, p.31)), in outline for an appreciation of the theoretical problem involved. 1. Lemma Let (Ωi , Σi ), i=1,2, be measure spaces and f : Ω1 → Ω2 be a function measurable for (Σ1 , Σ2 ). Further let A = f −1 (Σ2 )(⊂ Σ1 ), and g : Ω1 → R be any function. Then g is A-measurable (relative to the natural Borel algebra R and Σ1 ) if and only if there is a measurable function h : Ω2 → R, for (Σ2 , B) such that g = h(f ). Proof. Let (Ωi , Σi ), i=1,2 be measurable spaces and f : Ω1 → Ω2 be a function measurable for (Σ1 , Σ2 ). If A = f −1 (Σ2 ) ⊂ Σ1 , so that f always satisfies the asserted relation, then the converse implication needs to be established. n Suppose that g = i=1 ai χAi , Ai ∈ A, disjoint. From the measurability of f , there exist Si ∈ Σ2 satisfying Ai = f −1 (Si ) which may not be disjoint. Let Ti be a disjuntification of the Si sequence, then it is seen that f −1 (Tj ) = Aj , since the Ai are disjoint. If hi = Σni=1 ai χTi , so that h is measurable, it is seen that h ◦ f = g, and the result holds in this cse. The general case follows from this if S0 is the set of points where the hn sequence converges, on using the fact that gn = hn ◦ f , and that gn → g, one sees that g = h ◦ f , from which the general result is derived with an easy approximation as used in the standard measure theoretical conclusions, also detailed in my book noted above. The point of this result is to consider various forms of the function g, and to show how different classes of regression problems are motivated in the work that follows, and discuss the linear case, or other forms that appear in applications. The linear and vector types of it are particularly important in applications. In fact, the latter type of relation was noted by the well-known Econometrician, Ragnar Frisch who then has raised the problem of characterizing the linearity of regression of Y on X when X = aξ + α, Y = bξ + β where a, b ∈ R and ξ, α, β are independent
377
6.6 Regression for random processes and its basis
random variables. After some particular results by H. V. Allen and E. Fix, the following more general characterization was obtained by M. Kanter. This will describe the central aspect of the problem. [For nontriviality, X, Y are taken to be linearly independent. The solutions obtained by the earlier authors is described a little later.] The general form of the function g given in the above lemma is not simple. For instance g(x) = ax for some constant a obviously implies a linear regression. But in the simple case, with a = 1, the characterization problem (as asked for, by Girshick and Savage in 1952), has a direct proof which is based on a (relatively deep) characterization of uniform integrability of a set of integrable random variables, by de la Vallee Poussin obtained in 1915, (detailed, e.g., in (Rao(1984) as Theorem 1.4.5) asserting that a collection A = {Xt , t ∈ I} ⊂ L1 (P ) is uniformly integrable if and only if there exists a positive symmetric convex function φ on R such that φ(x)/x → ∞ as x → ∞ and φ(2x) ≤ Kφ(x) for some constant K > 0, so that sup{E(φ(Xt )) : t ∈ I} < ∞.
(1)
In our case I has just two points and so it is trivially uniformly integrable. So g(x) = x, implies by the conditional Jensen’s inequality E(φ(X1 )|X2 ) > φ(E(X)) = E(X),
(2)
strict inequality with probability 1, unless X1 = X2 , a.e. The point of this is to emphasize the nontriviality of the linearity of the regression problem and its structure. This implies that E(V |X) = gV (X) = X is too stringent a demand. We thus have to relax the condition on gV (·) and discuss the resulting regression problems. Here are some related possibilities. The vector (Y, X) may be replaced by an L1 (P ) sequence {Xk , k ≥ 1} and ask for its behavior studied under the name weak martingale, if: E(Xn |Xm ) = Xm , a.e., n > m ≥ 1
(3,)
and one can study the properties of the Xn -sequence. It has an interesting consequence which can be considerd separately, (see, e.g., Rao (2007) for more detailed analysis). Such a sequence has also sevaral applications. Returning to the regression problem, it is an elementary result that the vector (X, Y ) distributed jointly as a bivariate normal 2 with means mX and mY , correlation ρ(X, Y ), and varianves σX , σY2 then E(Y |X = x) = gY (x) = mY + ρ(X, Y )
mY (x − mX ), mX
(4)
378
VI. Sampling and Regression for Processes
and it shows that more involved expressions for gV (X) can lead to a significant elaboration of the problem which will amplify Frisch’s question stated above. The structure of the problem is illustrated by Fix’s solution described as follows. If X, Y are as in Frisch’s formuation, and I ⊂ R+ (or its reflection in the origin), then the regression function is linear if and only if either ξ = a (a constant) and α = 0 or the Fourier transforms (= ch.f.’s) φξ , φα are given as: φX (t) = exp{−t2 u|t|ν }, 1 < ν ≤ 2, u > 0,
(5)
and that φα (t) = exp{−u| t|ν } with the same ranges as above ( both {α, xi } having ν moments). The significance of this result is that the random variables {α, ξ} may be of a more general distribution class. Indeed recalling the structural properties based on the classical L´evy-Khintchine representation of an infinitely divisible distribution, the α, ξ both belong to a ‘stable’ class (a subclass of the infinitely divisible random variables that includes the Gaussians). The point of this result is that the linearity of regression problem is far more general than the (commonly assumed) Gaussian class. The first comprehensive result on the linearity problem was obtained by M. Kanter (1972) which will be described for a motivation as follows. Recall that an integrable random variable ξ is symmetric stable if its characteristic function is of the form φξ (t) = exp{−c|t|p } for some c > 0, 1 < p ≤ 2, t ∈ R so that p = 2 reduces to the Gaussian case. Then Kanter’s result in our context may be described as follows: Let {ξi , 1 ≤ i ≤ n} be stable independent symmetric random variables of n n index p ∈ (1, 2], and X = i=1 ai ξi , Y = j=1 bj ξ, then the regression of Y on X is linear in that E(Y |X) = λX for some constant λ which depends on the parameters ai , bi , i = 1, · · · , n. A related result on the linearity of regression for such variables will be included in the complements section. An interesting extension of the above noted regression results, due to C.D. Hardin (1982), on the problem will be detailed here, explaining the structure in more detail starting with a key property of the Gaussian process. It is said that a random vector X = (X1 , · · · , Xn ) is spherically distributed if X and AX have the same distribution for all orthogonal n×n matrices A. Now a basic study on spherical distributions is given by Kelker (1970) which includes naturally the Gaussians, but is not limited to them. A process X = {Xt , t ∈ T } is spherically generated if each of its finite set of vectors {Xt1 , · · · , Xtn } is spherically distributed. Thus each centered Gaussian process is spherically distributed. A process
6.6 Regression for random processes and its basis
379
X = {Xt , t ∈ T } is termed spherically generated if each finite dimensional vector {Xt1 , · · · , Xtn } is equivalent to a spherical vector in the sense that it has the same distribution (or characteristic function = the Fourier Stieltzes transform) as a spherical vector. Thus each centered Gaussian process is spherically distributed but there also exist other processes with this property. The following characterization of spherical vectors is of interest in this study, and it is from Hardin (1982). 2. Lemma. Let X, Y be a pair of random variables in Lp (P ), p ≥ 1, of unit norms and E(X|Y ) = 0. Then {X, Y } is a spherical vector. Proof. Taking 0/0 = 0 here the result is to be established if both X, Y are nonzero elements. Also it is well-known that conditional expectation is a contractive projection on the Lp (P ) spaces (cf. e.g. Rao (1975)). Under these conditions a classical result due to Kakutani (1939) can be utilized in a representation of this operator as follows. The desired result is that in a normed linear space of at least three dimensions can also be given an inner-product if and only if every two dimensional subspace is the range of a contractive projection. Since conditional expectation is a contractive projection, as already noted, and the mapping X → (X, Y )(Y, Y )−1 Y is seen to be such a projection, then E(X|Y ) = CY is the desired form, where the quantity C = (X, Y )/(Y, Y ) is a desired constant. It will now be shown (the key conclusion) that (X, Y ) is a spherical vector, to proceed for the linear regression property. Assume that X, Y are both of unit norm, for simplicity, and let E(X|Y ) = 0 so that X and Y are orthogonal, i.e., X ⊥ Y by the above reduction. It should be shown that the vector (X, Y ) is spherical, utilizing the Kakutani (linear) representation of the contractive projection (here the conditional expectation), which now becomes for any non-zero vectors {X, Y } the representation E(X|Y ) =
(X, Y ) Y, (Y, Y )
(6)
so that for non-vanishing {X, Y } the projection E(X|Y ) = 0. By choosing suitable (non-zero) vectors for which E(X|Y ) = 0 should imply their spherical charactor. This needs a detailed computation as follows. Consider for an arbitrary θ the computation: (X, X cos θ + Y sin θ) (X cos θ + Y sin θ) X cos θ + Y sin θ 2 = cos θ(X cos θ + Y sin θ), (7)
E(X|X cos θ + Y sin θ) =
380
VI. Sampling and Regression for Processes
To obtain finer properties of the random vector (X, Y ) it is desirable to use their chaqracteristic functions and use their differentiability properties. Let φ, ψθ be characteristic functions of (X, Y ); < X, X cos θ+Y sin θ > for an arbitrarily fixed θ. and note that ψθ (t, s) = φ(t + s cos θ, s sin θ). It is now a useful property of characteristic functions, detailed in Lukacs and Laha (1964, Thm. 6.1.1;p. 103) implying that a pair of random vectors (Z1 , Z2 ) ∈ L1 (P ) has the linear regression property if and only if its joint characteristic function φ satisfies the differential equation: ∂f (0, λ2 ) ∂f (0, λ2 ) =c , ∂λ1 ∂λ2
(8)
holding for all real λ2 and a constant c. Using the relation between ψ, φ above, the equation (8) can be reduced (on recalling φ(s, t) = E(eisX+itY )) to: ∂ φ(r cos θ, r sin θ) ∂s ∂ ∂ = cos θ[cos θ φ(r cos θ, r sin θ) + sin θ φ(r cos θ, r sin θ)]. ∂s ∂t
(9)
This and the earlier result on characteristic functions related to linear regression from the Lukacs-Laha account, implies that the quantity in the square braces in (9) must vanish for cos θ = 0, and this is true for almost all θ by the linear independence of X, Y as assumed in the begining. With this a chain rule of the quantity in the square brackets, shows that it is a.a. never zero. This implies that (X, Y ) is a spherical vector as asserted. The following result, due to Hardin (1982), shows the usefulness of spherically distributed random processes in studies of the linearity of regression questions. 3. Theorem. Let {Xt , t ∈ T } be a stochastic process in Lp (P ), p ≥ 1, and have at least 3 linearly independent elements if p = 1, or 2 elements if p = 2. Then the process has the linear regression property if and only if it is spherically distributed. Proof. First it is asserted that each spherically generated process in L1 (P ) has the linear regression property. Thus if (X0 , X1 , · · · , Xn ) is linearly independent, then there is a spherical vector (Y0 , Y1 , · · · , Yn ), necessarily linearly independent, such that they both belong to the same subspace if X0 , Y0 are replaced by zero elements. But the sphericity of the Y -process also implies that the order preserving property of conditioning shows, due to sphericity, that E(±Y0 |Y1 , · · · , Yn ) are equal
381
6.6 Regression for random processes and its basis
and hence equal to zero a.e. Since for the problem one can assume both the Xi and hence the Yi are linearly independent, each subset being (linearly) generated by the corresponding X and Y subsets, it follows that the following holds: n n n E(X0 |X1 , · · · , Xn ) = E( ai Yi |Y1 , · · · , Yn ) = aj Yj = bj X j , i=0
j=1
j=1
with probability one. Hence the sufficiency assertion holds. For the converse let at least two (or three) Xi ’s be linearly independent, and not necessarily Gaussian, but for which the linearity of regression is true. It is to be shown that these exist a spherical process generating the given one for which the linearity of regression is valid, using Lemma 2 above. It is assumed that there is a contractive projection onto the range of X1 , · · · , Xn which in this case is the conditional expectation in the space of at least three dimensions, and the Kakutani result on the existence of (a conntractive or equivalently) an orthogonal projection onto the (finite dimensional Hilbert space) range genarated by the finite number of random variables under consideration. These vectors may be taken to be orthonormal by the usual Gram-Schmidt process on this finite dimensional space, the procedure gauranteed by the Kakutani theorem. Suppose then the sequence (X0 , · · · , Xn ) belongs to the space determined by the (orthonormalized) Yk , 1 ≤ k ≤ n. It is to be shown that the (Y0 , · · · , Yn ) so obtained is a spherical vector to establish the theorem. k−1 Thus let Y0 = X0 ( X0 )−1 for k > 1, set Yk = X0 − j=0 (Xk , Yj )Yj using the inner product notation, and then normalize it, so that the Yj are orthonormal. Then the Xj are in the linear manifold determned by the Yi , 0 ≤ i ≤ j, and it is asserted that {Yj , 0 ≤ i ≤ n} is a spherical vector. n If Yn = i=1 ui Yi for a unit coordinate vector u = (u1 , · · · , un ), then since Y0 ⊥ Yk , k ≥ 1 it follows by Lemma 2 above that the conditional expectation E(Y0 |Yu ) = 0. Hence it can be concluded that (Y0 , Yu ) is a spherical vector. Since this argument is valid for any finite subset of the process, it follows that the given process has the linear regression property (if and) only when it is spherically generated, as asserted. Remark. In spite of the above result, it should be noted that there exist particular processes that have linear regression property without being spherically generated, so that the regression property is more general than sphericity. The following example will exemplify this point. 4. Example. Let {X, Y } be a ‘process’ of independent symmetic p−stable (in the sense of L´evy-Khintchine) random variables where 1 ≤ q < p < 2, so that the two element ‘process’ generates a two
382
VI. Sampling and Regression for Processes
dimensional subspace of Lq (P ). This process has the linear regression property but it is asserted not to be spherically distributed. To see this note that the class of symmetric random variables form a linear space, and by Kanter’s result recalled above that for a pair of jointly p-stable random variables the linear regession property holds. Hence it is to be verified that the (X, Y ) is not spherically distributed. This may be seen by contradiction as follows. Since (X, Y ) is an independent spherical vector of order p, then so will be (aX + bY, cX + dY ), and its characteristic function φ must be radial and hence φ(cos θ, sin θ) will be a constant in θ. But a simple computation with these substitutions shows that φ(cos θ, sin θ) = exp[−k(|a cos θ + c sin θ|p + |b cos θ + d sin θ|p )] for some k > 0. Some not entirely simple argument shows that this will be impossible unless p = 2, the Gaussian case which is excluded. (The omitted details of computation may be found in Hardin(1982).) Thus the regression problem is closely related to the ‘stable class processes’ that include the Gaussians but generally encompassing the latter, which is often considered to be central. This implies just φ(cos θ, sin θ) = exp[−k(|a cos θ + c sin θ|p + |b cos θ + d sin θ|p ), (10) for some k > 0. But this holds only if the exponent of the exponential is a constant. A not entirely simple argument (detailed in Hardin) shows that it is true only if | sin θ|p + | cos θ|p = 1, and that too (if and) only if p = 2. Thus we conclude that linear regression property obtains here without the pair (X, Y ) being spherically generated. Recall that a random variable X has a symmetric stable distribution of index 0 < q ≤ 2 if φX (t) = exp{−c|t|q } for some 0 < c < ∞, so that (centered) Gaussians are symmetric stable of index q = 2. An interesting extension of this class, called subGaussian, is a process {Xt , t ∈ R} whose finite dimensional characteristic functions are given by the equatons: p − ln E(exp{itX}) = [− ln E{exp{itY })}] 2 , t ∈ R, (11) and the Y -process is termed the governing family. This is a generalization of the Gaussian class and is somewhat more inclusive. It forms a subclass of the symmetric stable processes family. An interesting account with historical detail is found in Feller’s Volume II (1966). The sub-Gaussian class was later introduced as it is found useful in relation to the linearity of Regression problems. In fact the following assertion largely ties up the preceding concepts with the linearity of regression in a comprehensive way:
6.6 Regression for random processes and its basis
383
4. Generalized Theorem. Let {Xt , t ∈ T ⊂ R} be a symmetric stable process of at least three linearly independent random variables. Then the following statements are equivalent: 1. The process has the linear regression property; 2. The process is spherically generated; 3. The process has the subGaussian property. The point of this result is that the (linear) regression property is inherently based on the conditioning concept and it inherits the generality and the computational problems (and ambiguities and traps) imbeded in the subject. To avoid the problems, at the outset, many people assume typically the relation E(Y |X) = gY (X) to be a linear function of X and term the problem (linear) regression by assumption (although it cannot be a property of classes of processes, as seen from the above work. Even such an important and standard book as Wilks (1962),p.83), takes the linearity of the gX (·) as definition, whereas Cram´er ((1946), 270–276, and later) details how conditioning came to be the natural concept for regression problems and shows how lineariy of regressions are considered as useful procedures in applications. But the problem is deeper and the difficulties of computing conditionals appear in different forms and details of studies. To emphasize this question let us restate the general form of symmetric α-stable distributions in a slightly more general form and a few properties ending up with a chaqracterization of the linearity of regression, essentially due to Miller (1978), using the criterion given in Lukacs and Laha (1964) already referred to, and employed. Let us restate the symmetric α-stable (or SαS, for short) process in n-dimensions, or in Rn , to explain the structures more vividly. It is convenient and easier to state the symmetric stable distribution in Rn in terms of its characteristic function (ch.f.). Thus X : Ω → Rn is symmetric α-stable (0 < α ≤ 2) if its ch.f. φ : Rn → R is representable as: Z φ(t) = exp{− | < x, t > | α dF (x)} S
where F is called the spectral measure of the process on the sphere S = {x : x = 1}. Extending the covariance in the centered Gausissian case with unit variance, define using the Fourier integral representation of its covariance, now termed covariation, denoted C(i,j) , is given by xi (xj )α−1 dF (x), (12) C(i,j) = S
where 0 < α ≤ 2, and F is the spectral distribution on the Borel sets of the unit sphere S of Rn . With this preamble the linearity of regression for the SαS processes can be presented as follows.
384
VI. Sampling and Regression for Processes
5. Proposition. Let {X0 , X1 , · · · , n} be jointly SαS stably distributed random variables with at least three linearly independent elements and the spectral measure F on the Borel-algebra of the unit sphere S of its n+1-dimensional range space. Then the linearity of regression holds in the sense that E(X0 |X1 , · · · , n) =
n
Ai Xi , a.e.,
(13)
i=1
if and only if for any real constants c1 , · · · , cn one has (x0 − S
n
n n Ai xi )( Bi xi )α−1 dF (x1 , · · · , xn ) = 0,
i=1
(14)
i=1 i=1
where F is the spectral distribution of the SαS process under consideration, and for all constants (Bi , i = 1, · · · , n). The constants Ai , i = 1, · · · , n are moreover uniquely determined if the Xi , i = 1, · · · , n are linearly independent. The coefficients Ai , i = 1, · · · , n are determined using least squares or other methods, and it constitutes a separate problem by itself. In case n = 2 the following explicit recipe is available. Thus if E(X0 |X1 , X2 ) = a1 X1 + a2 X2 then (a1 , a2 ) are solutions of the pair of equations: a1 Ci1 + a2 Ci2 = C0i , i = 1, 2, where the Cij = S xi xj dF (xi , xj ) are also the ’covariations’ between Xi , Xj , i, j = 0, 1, 2 already defined above. The details are standard (but not entirely simple), and may be found in Miller (1978). It was already noted in Section 2.4 that there are some problems in evaluating conditional expectations, especially in evaluating them to be used in practical applications. Since regression is inherently a conditional expectation based concept one should expect that the same anamaly appears in some form here too. This will now be exemplified with a simple (known) standard example and its impact on applicatins. Recall that if X; Ω → R is any integrable random variable on a probability space (Ω, Σ, P ) and B ⊂ Σ is a sub σ-algebra, the the conditional expectation E B (X) of X is given by using the Radon-Nikod´ ym theorem as: ˜ dPB , A ∈ B, X dP = νX (A) = X A
A
385
6.6 Regression for random processes and its basis
˜ = dνX , the stated derivative, and X ˜ = E B (X) is the condiwhere X dPB tional expectation of X relative to (or given) B. Taking A = Ω in the ˜ = E(X). If B = σ(Y ), above one has the fundamental relation E(X) the σ-algebra generated by Y and FX|Y (·) denotes the conditional distribution of X given Y , with density fX|Y (·|y) then if X has one moment, we can express the conditional expectation as: E(X|Y = y) = R
xdFX|Y (x|y) =
R
xfX|Y (x|y)dx.
(15)
This may be stated in an extended form, using the standard results in the theory of Measure and Integration, as E(X n |Y = y) =
xn dFX|Y (x|y) =
R
xn fX|Y (x|y)dx,
(16)
R
But E(X n ) = E(E(X n |Y )) for all n ≥ 1 for which the inner quantity in braces (the conditional expectation of X n ) exists by (15) and (16). But the following example shows that there can be a difficulty in this application, analogous to that noticed earlier in the Ka¸c-Slepian problems. This example is given here to show that certain structural difficulties exist in the Regression as in Conditioning itself, which is perhaps not unexpected as it is basically a specialization of the former. For an example, let 2
fX|Y (x|y) = (π/y)−1/2 e−x
/2
, y > 0, x ∈ R
so that E(X n |Y = y) < ∞, n ≥ 1. But if fY (y) = (πy)−1/2 e−y , y > 0, and E(X n ) does not exist and (16) fails for all n ≥ 1 since the marginal X is Cauchy distributed. Note that the regression of E(X|Y ) is not defined, but the regression of Y on X can be defined. This implies some of the inherent difficulties that exist and may not be overlooked in applications of the subject. Remark. Since the regression function E(X|Y ) = gX (Y ) is quite general in form, being measurable relative to σ(Y )-algebra and integrable relative to the probability PX (·), but the functional form of gX (·) is quite general. It is often taken as part of the prescription of the functional form of the problem, and thus is a ‘polynomial’, a trigonometric or a linear form based on the (assumed) physical background. We now consider another important aspect of the regression problem that again has both the theoretical and practical sides.
386
VI. Sampling and Regression for Processes
6.7 Regression for random measures and integrals To formulate the problem, it is useful to recall the concept of random measures on the δ-ring of bounded Borel sets of Rn , denoted B0 (Rn ), and let Z : B0 (Rn ) → Lp (P ), p ≥ 1, which is independently valued on disjoint sets and countably additive in the topology of convergence in probability on disjoint elements of B0 (Rn ), so that for disjoint An ∈ B0 (Rn ) one has Z(∪n≥1 An ) = Z(An ), (1) n≥1
the series converging in probability, and when Z(·) is independently valued, then this convergence is with probability one. Moreover, since p ≥ 1, this convergence is also in norm of Lp (P ) by a classical theorem due to P. L´evy. Further the random measure Z(·) is of symmetric pstable class and there exists a σ-additive measure c : B0 (Rn ) → R+ such that the characteristic function φZ (·) has the following representation: φZ(A) (t) = exp{−c(A)|t|p }, c(A) ≥ 0,
(2)
by the same L´evy theorem. The fact that finite c(·) is actually σadditive needs an additional non-trivial argument as follows. Let Ai ∈ B0 (Rn ), i = 1, 2 be disjoint. Substituting them in (2) and recalling that the φ here never vanishes being an infinitely divisible characteristic function, one can take (natural) logarithms of φ to get −c(A1 ∪ A2 )|t|p = −(c(A1 ) + c(A2 ))|t|p , t ∈ R, c(Ai ) ≥ 0.
(3)
From this, since t ∈ R is arbitrary, it is deduced that c(·) is additive. But if An ↓ ∅, An ∈ B0 (Rn ) so that Z(An ) → Z(∅) = 0 in probability and hence φZ(An ) → φ0 (t) = 1, t ∈ R it follows that c(An ) → 0 implying that c(·) is σ-additive. It is generally σ-finite, also called the L´evy measure governing Z(·). With this preamble we can now present a solution of the regression problem for (real) rendom measures. There is a symmetrization method in probability which allows extension of the real case to complex problems as well without much difficulty, and consider the real case so that we present here the symmetric stable class as in the following result: 1. Theorem Let Z : B0 (Rn ) → Lp (P ), p ∈ (1, 2] be a symmetric random measure. Then it has the linear regression property in the sense that E(Z(A)|Z(B)) = aZ(B) where the real constant a = a(A, B) depends only on the Borel sets A, B. Proof. From the integrability and symmetry conditions of the hypothesis, E(Z(A)) = 0, and its characteristic function is differentiable.
387
6.7 Regression for random measures and integrals
First let p > 1. If X = Z(A), Y = Z(B), suppose P (Y = 0) > 0, since otherwise the result is true as Y will be independent of all the X defined, and the desired assertion follows. Consider first a simplification: E(Z(A)|Z(B)) = E[Z(A − A ∩ B) + Z(A ∩ B)|Z(B)] = E[Z(A) − Z(A ∩ B)] + E(Z(A ∩ B)|Z(B)) since Z(·) has independent increments, = [E(Z(A)) − E(Z(A ∩ B))] + E(Z(A ∩ B)|Z(B)), = h(A, B), (say). It is to be shown that the h(A, B) is a constant multiple of Z(B), the constant depending only on A, B, which establishes the result. This will be verified on using Lemma 6.1 above and the fact that Z(·) is symmetric p-stable. Let D = A∩B ∈ B (R), and i = (−1). Consider with D = A ∩ B ∈ B (R): E(Z(D)eitZ(B) ) = E(Z(D)eit[Z(D)+Z(B−D)] ) = E(Z(D)eitZ(D) )E(eitZ(B−D) ), by independent increments of Z(·), 1 d = E( (eitZ(D) ))E(eitZ(B−D) ) i dt d = −i (E(eitZ(D) ))E(eitZ(B−D) ), dt d since now and E commute, dt p p d = −i (e−c(D)|t| )e−c(B−D)|t| , dt Z(·) being symmetric p-stable, p
= i[c(D)e−c(D)|t| p|t|p−1 ]e−c(B−D)|t|
p
p
c(D) d(e−c(D)|t| [ ], = ic(B) dt since D ⊂ B, and the additivity of c(·), =
c(D) E(Z(B)eitZ(B) , as before. c(B)
(4)
Now let us simplify the left side using (2) with Y = Z(B) and X = eitZ(B) Z(D), as follows. E(Z(D)eitZ(B) ) = E(E(Z(D)eitZ(B) |Z(B))) = E(eitZ(B) E(Z(D)|Z(B))) = E(E itZ(B) gD (B)), (say),
(5)
388
VI. Sampling and Regression for Processes
where gD (B) is the random variable coming from Lemma 6.1 above and is a function of Z(B) alone. Thus (4) and (5) imply the following: E(eitZ(B) [gD (B) −
c(D) )Z(B)]) = 0, t ∈ R, c(B)
(6)
so that the expression in [ ] is just a function of Z(B). Hence if we set V = Z(B) since gD (B) and k(V ) = gD (B) − c(D) c(B) Z(B) are both 1 elements of L (P ), (6) implies: 0 =E(eitV k(V )) eitV dQV , = Ω
t ∈ R,
where QV is a measure on the image space of random element kV given by Lemma 6.1 has its Fourier transform vanishing, so that the measure must vanish identically. Now letting a = c(D) c(B) ∈ R, the regression of Z(A) on Z(B) is linear for any A, B ∈ B0 (Rn ). The case p = 1 follows in the same way on using the fact that one uses d −|t| (e ) to be taken as e−|t| sgn (t), as is done in analysis while differendt tiating an absolute value which effectively rules out Cauchy measures.. Thus the theorem follows. Remarks. 1. Specializing the above result by replacing Z(A) with X and Y as any random variables where E(X) exists, and writing E(X|Y ) = gX (Y ) so that E(gX (Y )) = E(X) holds, it is clear that gX (Y ) cannot be a polynomial of order k > 1 and such powers of any regression such as k bj Y j , (7) gX (Y ) = j=1
will only be meaningful if E(Y ) < ∞ is hypothesized as well. In the above work it was assumed that E(Z(A)) exists which thus rules out Cauchy measures as observed earlier. 2. The above theorem includes some earlier work of Lukacs and Laha (1964) and was motivated by it. In this connection the studies of Rosinski (1984) on random measures will be of interest in extending these ideas. The preceding analysis of random measures and regression, is a general begining, and naturally there are other methods of such measures arising in applications. An important motivation is to generate new measures from symmetric stable measures in the form of indefinite integrals for classes of functions, and consider the analogous regression questions for them. This introduces some new and interesting ideas on j
389
6.7 Regression for random measures and integrals
these problems which will be discussed briefly to show the growth of the subject. First we need to recall the corresponding integral of scalar functions for some stable measures. n If f : Rn → R is a simple function representable as f = j=1 aj χAj , with disjoint Borel Aj , 1 ≤ j ≤ n, let as usual n f dZ = Z(A ∩ Aj ), (8) A
j=1
which is seen to be well-defined and gives a random set function on the Borel sets. This is standard but not trivial, and is discussed in books such as Dunford and Schwartz (1958), and the integral can be extended to more general functions with the usual procedures for some inclussive classes such as Lp (P ), to be specified below as needed. The general method is to consider simple functions fn : Rn → R which converge pointwise to f , such that { A fn dZ, m ≥ 1} is Cauchy in Lp (P ), or other (metric) function space considered, then the (unique) limit, denoted Y f (A) is for Borel A f Y (A) = f dZ = lim fn dZ, A ⊂ R1 . (9) n→∞
A
A
f
That Y (·) is well defined and does not depend on the sequence {fn , n ≥ 1} is not trivial, but is established in such books as DunfordSchwartz noted above. The interesting behavior here is that if Z(·) is a symmetric stable measure then the Y f (·) inherits the seme property, also of great interest. It is established as follows. Starting with the simple function fn in the Y f (·) above consider its characteristic function with a computation as: φY fm (A) (t) = E(eit
P km
j=1
= exp{−
km
ajm Z(A∩Ajm )
),
|ajm t|p (c(A ∩ Ajm )},
j=1
by (2) above, km = exp{−|t|p | ajm |p dc)}, A j=1
|f |p (v) dc(v)},
→ exp{−|t|p A
by the L´evy continuity theorem, Hence it can be concluded that itY f (A)
φY f (A) (t) = E(e
|f |p (v) dc(v),
P
) = exp{−|t|
A
(10)
390
VI. Sampling and Regression for Processes
which implies that Y f (·) is also a p-stable random measure. Now using an important result due to Schilder (1970) it can be inferred that 1 f Y (A) p = [ |f |p (v) dc(v)] p . A
It follows that both Y f (·), Z(·) are absolutely continuous relative to the measure c(·) that acts as a controlling one for both random set functions for each f ∈ Lp (c). Thus the following result is established: 2. Theorem Let Z : B0 (Rn ) → Lp (P ), p ∈ (1, 2] be a symmetric p-stable random measure with L´evy controlling measure c : B0 (Rn ) → + f ¯ R . Then the stochastic integrals {Y (A) = A f dZ, A ∈ B0 (Rn )} are well defined and each Y f (·) is a p-stable random measure having a control c˜ : A → A |f |p dc, A ∈ B0 (Rn ). Moreover these integrals generate a subspace which is isometrically isomorphic to Lp (c). Further the random measures Y f , f ∈ Lp (c), have the linear regression property, so that E(Y f (A)|Y f (B)) = αY f (B), A, B ∈ B0 (Rn ), (11) where α depends on (A, B, f ) only. It is possible to consider an extension of this result to pairs of measures Y f , Y g and ask for their linearity of regression properties. The following result presents a solution to this problem with a slight modification of the above argument. It will be stated here, leaving the proof for interested readers, to give a flavor of such problems that are of interest in this study. 3. Proposition Let Z : B0 (Rn ) → Lp (P ), 1 < p ≤ 2 be a symmetric p-stable measure with c : B0 (Rn ) → R+ as a L´evy measure controlling it. Then for any pair of elements f, g ∈ Lp (c) the corresponding random measures {Y f , Y g } which are symmetric p-stable, have the linear regression property: E(Y f (A)|Y g (B)) = βY g (B), a.e., A, B ∈ B0 (Rn ),
(12)
where the constant β is determined by A, B, f, g only. Such a result is useful in treating jointly symmetric p-stable random measures in multiple linear regression problems. The proof is not entirely elementary, but is of the same kind as discussed above, and will be left to the readers as a good exercise to try and prove. Multiple regression problems extending the above work are also of interest, and they will be briefly considered in the complements section below. A particular class of processes admitting lenear regression will be of interest for applications and further analysis. It will be considered in the final section now to round out the present treatment of the problem.
6.8 Processes admitting linear regression
391
6.8 Processes admiting linear regression It is usually assumed that problems with linearity of regression functions govern the processes of key applications, and are taken to be Gaussian. This property will now be established with details to illuminate the structure as well as specality of this process here. It is useful to begin with a simple case as follows. 1. Proposition Let X = (X1 , · · · , Xn ) and Y = (Y1 , · · · , Ym ) be a pair of vectors which are jointly Gaussian distributed with mean vectors μx , μy and covariance matrices Rxy , Ryy , the nth and mth order matrices. Then the regression function E(X|Y ) is linear and is given by E(X|Y ) = μx + A(Y − μy ), (1) −1 ∗ and the conditional covariance of X − E(X|Y ) is Rxy − Rxy Rxy Rxy −1 which is nonstochastic. Here the matrix Ryy is interpreted as a MoorePenrose inverse (or in the generalized) sense in case it is singular.
Proof. For simplicity assume that X, Y are centered Gaussian vectors, and let Rxy , Ryy be the covariance and variance matrices as in the statement. Suppose also that Ryy is nonsingular for simplicity. Also using the standard (or Gram-Schmidt) orthogonalization procedure one can find a matrix A such that Y and Z = X − AY are orthogonal, so Z ⊥ Y , or E(ZY ∗ ) = 0. Written in full, this gives: E(XY ∗ ) = AE(Y Y ∗ ).
(2)
This implies Rxy = ARY Y , defining the matrix A in terms of the variance-covariance matrices. It then follows that for this A, the vectors Y and Z are orthogonal centered vectors, and being Gaussian, they are independent. This crucial property implies that E(Z|Y ) = E(Z) = 0 so that E(X|Y ) = μx + A(Y − μy ), (3) implying the linearity of regression. The conditional covariance of X given Y is immediately obtained: −1 ∗ Rxy . E(X − E(X|Y ))(X − E(X|Y )∗ ) = E(ZZ ∗ |Y ) = Rxx − Rxy Ryy (4) This implies the result. It will now be shown that this result extends to general continuous Gaussian martingale processes that include the Brownian motion as a subclass.
2. Theorem Let Y = {Yt , Ft , t ≥ 0} be a continuous real Gaussian process which is also a martingale, and X is a functional of the Y process in that for any finite collection of t-points the {X, Yt1 , · · · , Ytn )
392
VI. Sampling and Regression for Processes
is an (n + 1) element Gaussian vector. Thus the regression functional of X given the process is linear and has the representation: E(X|Ys , 0 ≤ s ≤ t} = E(X|Y0 ) +
t
0+
K(t, s) dYs ,
(5)
where the kernel K(s, t) stands for a nonstochastic jointly measurable locally integrable function, and the stochastic integral in (5) is well-defined so that it satisfies the Bochner L2,2 -boundedness principle. [This principle was discussed in detail on pp.170ff.] Proof. The argument is an extension of the special case of the above proposition, but that depends on a martingale measure due to Dol´eansDade and related construction so that the details follow. Thus for a t > 0 consider the dyadic partition of the interval [0, t] as 0 = tn0 < tn1 < · · · tn2n = t and letting tnk = 2ktn , k = 0, 1, · · · , 2n . Consider the σ-algebra Ftn , the smallest σ-algebra containing Ytnk , k = 0, 1, · · · , 2n , so that Ftn ↑ Ft , the σ-algebra determined by the Ytnk , k = 0, · · · , 2n . The hypothesis on the path continuity of Yt , implies Ftn ↑ Ft generated by Yt , 0 ≤ s ≤ t. If Xtn = E(X|Ftn ) , then {Xtn , Ftn , n ≥ 1} is clearly a martingale, bounded in L2 (P ) since E(Xtn )2 ) = E([E(X|Ftn )]2 ) ≤ E(X 2 ) < ∞, n ≥ 1.,
(6)
Thus the process {Xtn , n ≥ 1} is a uniformly integrable martingale. Hence by the L2 (P )- martingale convergence theorem E(X|Ftn ) = Xtn → E(X|F), a.e. and in L2 (P )-mean. Now Proposition 1 above can be applied to {X, (Y0 , Ytn1 − Ytnk−1 , · · · , Ytn2n − Ytn2n −1 )} so that we get the representation as: E(X|Ftn )
= E(X|Y0 ) +
n −1 2
Ln (t, tnj )(Ytnj+1 − Yjjn ),
(7)
j=1
where Ln (·, ·) is a product moment and since the Y -process is continuous Ln (·, s) has the same property and Ln (t, ·) is a Borel function. By the martingale convergence theorem, the left side converges both pointwise and in L2 (P )-mean to E(X|Ft ) by the martingale convergence, and it should be shown that the right side tends to the desired limit, which needs much analysis. Letting Kn (·, ·) be the simple function defined as: Kn (t, s) =
n −1 2
j=1
Ln (t, tnj )χ[tnj ,tnj+1 (s).
(8)
393
6.8 Processes admitting linear regression
Clearly Kn is jointly measurable and integrable on compact rectangles of Rn . But now (2) can be given an integral form as: t n E(X|Ft ) = E(X|Y0 ) + Kn (t, s) dYs , a.e., (9) 0+
with prob. one. It is to be shown that the kernel sequence {Kn , n ≥ 1} converges to a fixed K(s, t) and this is dependent on properties of the square integrable martingale using the key Dol´eans-Dade measure associated with a submartingale. n The details are sketched as follows. For the simple fn (s, ω) = j=1 aj χ[tj ,tj+1 ]×Aj (s, ω), Aj ∈ Ftj , 0 < t0 < · · · < tn+1 < t, one has: E(| 0
t
fn (s) dYs |2 ) =
n
a2j E(χAj E Fj (Ytj+1 − Ytj )2 )+
j=0
2
E
0≤j<j ≤n Ft
j
=
aj aj E[χAj ∩Aj (Ytj+1 − Ytj )×
(Ytj +1 − Yj )]
n
a2j E[χAj (Ytj+1 − Ytj )2 ] + 0
j=0
since martingale differences are centered, =
n
a2j μ((tj , tj+1 ] × Aj )
j=0
=
(0,t)×Ω
|f (s, ω)|2 dμ(s, ω),
(10)
where the Dol´eans-Dade measure determined by the Y - submartingale, namely μ, is used (cf. Rao (1995, p.363). [For the Brownian motion this would be just a product measure, and the work simplifies.] The isometry obtained in (10) can be extended to all m, n ≥ 0 as: n m 2 E(Xt − Xt ) = |Kn − Km |2 (t, s) dμ(s, ω), (11) (0,t)×Ω
the measure μ being the Doleans-Dade’s as before. If μ is the marginal measure of the μ then (11) becomes t E[(Xtn − Xtm )2 ] = |Kn (t, s) − Km (t, s)|2 dμ → 0, 0+
as m, n → ∞, since {Xtn , n ≥ 1|} is Cauchy. This implies that Kn (t, ·) → K(t, ·) in L2 (μ ). To conclude that K(·, ·) is the desired kernel, it is to
394
VI. Sampling and Regression for Processes
be observed that it is jointly measurable. For this the following idea will be helpful. Let un (t) = (1+[2n t])/2n where [x] denotes the integral part of x > 0. Then un (t) ↓ t as n → ∞ so that K(un (t), s) → K(t, s) as n → ∞. But also note that Km (un (t), s) = χ[un (t)=r] Km (r, s), (12) r
where the sum extends over all rationals r. It is concluded tht the left side is jointly measurable in (t, s) and hence so is its limit K(t, s). This implies the main representation and the Gaussian property of the preceding result gives the desired conclusion as follows. By the result E(X|Y ) = αY + β for some constants α, β, using the Gaussian property so that it is a linear function of Y . Consequently X − E(X|Y0 ) = X − (αY0 + β) =
t
0+
K(t, s) dYs
has the covariance which just involves the variances and covariances of X, Ys , s ≥ 0, and is deterministic in the finite dimensional case. A useful consequence of the abobe result is that in the particular case that (X, Y ) is a jointly normal (or a vector Gaussian) process, then the regression of X on Y is always linear so that E(X|Y ) = gX (Y ) is a linear function of Y which may be generally written as X = αY + where the last one is a process that is independent of both X, Y with sufficient integrability and is centered. Specializing this concept, by taking Y = g(X) for a linear g(·) so that the linearity of regression is valid, great many applications are found, for instance in statistical and/or econometric and other problems. Note that here one has the connecting equation when X is integrable and Y is any random variable: E(X) = E(E(X|Y )) for all Y for which their joint distribution is only needed to be specified, so difficulties may arise as the following simple example makes it clear, and it is included for some introspection of the resulting work. Rcecall the relation between the conditional expectation operator and the absolute expectation, if (Ω, Σ, P ) is a probability space, B ⊂ Σ is a (sub) σ-algebra and X : Ω → R is a P -integrable random variable, then E(X) = X dP = E B (X)dPB = E(E B (X)), (13) Ω
Ω
6.9 Linear regression with constraints
395
and if FX (·|y) is the conditional distribution of X given Y , with density fX|Y (·|y) then the above eqquation (13) implies in familiar terms, on assuming densities relative to the Lebesgue measure: E(X) = xfX (x)dx = xfX|Y dxfY (y)dy = E(E(X|Y )). (14) R
R
R
Here one requires that the conditional measures are ‘regular’. However the equation that fX|Y (x|y)fY (y) = fX,Y (x, y), cannot serve as an identity, since easy examples show (one is in the Complements Section below) that there are cases of fX|Y , fY which do not satisfy the above equation for E(X). Thus there are problems on evaluations of conditioning which have not been satisfactorily resolved. Taking some evaluation is desirable for practical problems the regression analysis is persued, and some developments will be indicated as another aspect of this work in the concluding section of this chapter. 6.9 Linear regression with constraints The preceding three sections detail general conditions for the regression function to be linear, which show that the problem is closely related to the general symmetric stable class of processes. A very popular special family is the Gaussian class which is the most used process in many real-life applications. Restricting to this class, we present here some key results with references for further analysis and detail. Since most of what is considered in real applications is linear regression, and by the work of the preceding sections jointly Gaussian processes always have the linear regression property, it is clear that much of the work in applied problems is based on these classes and thus the main applicational model can be set forth as follows. If (X, Y ) is a (finite) Gaussian vector, then it has the linear regression property which is expressible as: E(Y |X) = g(X) = AX,
(1)
for some suitable matrix A to represent the general linear form. This may be restated in a more (familiar) general way as: Y = AX + ,
(2)
where the arbitrarily added random vector satisfies E(|X) = 0 which can be more general, usually termed a random error vector in the general model, assumed to have enough moments, at least 2. This model is of interest in studies in a general area called ‘Econometrics’, and here we consider briefly from the recent major analysis in Chipman (2011) to
396
VI. Sampling and Regression for Processes
point out a great applicational potential of this subject (of linear regression) under Gaussian assumption, the desired linearity being naturally present. It is now possible to reinterpret the model (2) above as a linear equation in X and Y with A as a matrix to be estimated from observing the variables (X, Y ) and assuming some usable (and natural) conditions on s which have known distributional structures. Now one can also impose some restraints on the coefficient matrix A and can impose other constraints appropriate in a least squares or other methods. Here one such procedure will be discussed as illustration. It is the least-squares technique which for the Gaussian distribution is essentially the same for the present problems. Thus in this case the linear regression equation (2) with a suitable matrix A is to be estimated by minimizing the ‘error’ vector introduced above to have the least variance (hence the least squares name) which becomes in the Gaussian case essentially the one called ‘maximum likelihood’ estimation method proposed by R.A. Fisher, briefly discussed in Chapter III. Thus in (2) above suppose the vector (Y, X) is observed and one wants to estimate the matrix A so as to minimize the covariance of ‘error’ vector. An example explains this point clearly, showing also a way the corresponding applications lead to separate studies demanding even some advanced analyses. The problem in (2) is to find A when the vector (Y, X) is ‘observed’ or known, and the error is assumed to have a known structure, usually taken to be a centered random vector with a dispersion matrix which is positive definite, known except for an unknown positive factor, and this assumption may be generalized. Here the presentation follows Chipman (2011, Chapter 4) and gives the flaver of the problem referring the details to his book. [The notation is slightly altered to conform with our discussion.) 1. Theorem Let the model (2) be restated as Y = AX + ; E() = 0; E( ) = a2 Ω,
(3)
where Ω is nonsingular and a constraint ΨA = α where Ψ is a matrix of full rank defining the constraints on A. Then an estimator of A of the model (3), unbiased and of minimum dispersion can be explicitly calculated. The details use properties of (not only the Gaussian distribution but) some simple aspects of matrices including those of idempotent class. The details will be omitted referring them and several important extensions as well as applications of this and related results to Chipman’s book. This is given here to inform the reader of the potential of regression analysis as a key part of Inference Theory of processes.
397
6.9 Linear regression with constraints
The book by Chipman is a rigorously executed volume to be studied by readers of this subject, and for possible extensions of Linear Regression Analysis. There are further refinements of this problem with constraints, where the latter are also random, going under the theme ‘ridge estimations’. This is again detailed in Chipman’s volume (Chapter 4) on constrained problems where the constraints are also assumed to be random as in the so-called ‘Bayesian inference procedures’ in which one allows the constraints also to be random whose distributions are available to the experimentor. This is again detailed in the above author’s volume, and we shall be content in pointing out the analysis available to interested readers. Here we conclude the discussion with an important item on linear and polynomial regrssion that may not come from the Gaussian model and may be some other, but necessarily belonginging to the stable class, including foe instance the Poission and Gamma families, as a final item of this section (and chapter) before turning to related complements of the chapter. Recall that if (X, Y ) is a pair of (real) random variables with one moment, then E(Y |X) = gY (X) is the regression of Y on X even if X has no moments. If (X, Y ) is jointly Gaussian, then gY (X) = βX + where is independent of (X, Y ), still has a linear regression formulation. In case n gY (X) = aj X j , (3) j=1
then by analogy it is termed an nth -degree polynomial regression when the corresponding expectations exist. Multiplying (3) by eitX and taking expectations, it is seen that the polynomial regression is well-defined if the pair (Y, X n ) is integrable by the uniqueness properties of Fourier transforms. This leads us to formulate some useful regression results for classes of processes, which need not be Gaussian but for which the desired Fourier results hold, leading to several types of problems called polynomial, trigonometric and other (not necessarily linear) regression problems with some applicational potential. Let (Y, X k ), k = 1, · · · , n be integrable randomvariables. Suppose that the regression function n E(Y |X) = gY (X) = k=0 ak X k be given, termed a polynomial regression of order n so that (Y, X k ) is integrable for 1 ≤ k ≤ n, and termed simply a constant regression if n = 0. Using the standard Fourier analysis results the following characterization of a polynomial regression can be obtained. 2. Theorem Let (Y, X, X 2 , · · · , X n ) be an integrable random vector. Then Y has a polynomial regression of order ‘n’ in the sense defined
398
VI. Sampling and Regression for Processes
above if and only if the following relation between the Fourier coefficients given by n itX E(Y e ) = ak E(X k eitX ), (4) k=1
holds for all t ∈ R and real constants ak , k ≥ 1. The proof is a simple and standard application of the uniqueness of Fourier transforms and can be left to the reader. The point of this result, originally due to Lukacs and Laha (1964), is a nice application of the Fourier analysis results to this problem. If k = 0 here, then one has a constant (or trivial) regression. The following final result of this section indicates another type of linear regression problem of interest in some structural analyses of symmetric stable processes of index β > 1, including the Gaussians. 3. Theorem Let X, Y be a pair of independent symmetric stable random variables of order 1 < p ≤ 2. Let U = aX + bY and V = cX + dY where the coefficients {a, b, c, d} are all non zero, and ad = bc. Then the linear regression E(V |U ) = α U for some constant α holds. We omit a proof of this result also and observe that it is discussed in Lukacs and Laha (1964), and conclude the discussion on these problems at this point, adding some other results in the Complements Section below. Some related problems supplementing the above work will also be included there. 6.10 Complements and exercises 1. Let F be the spectral bimeasure of a strongly harmonizable process {Xt , t ∈ R} with support S, i.e., the variation measure |F | determined by F is supported by S. Let Th A = {(x1 + h, x2 + h) : (x1 , x2 ) ∈ A} denote the translate of A by h units in R2 . Using the same procedure as in the proof of Theorem 4.1, show that the sampling theorem holds for such processes (i.e., L2 (X) = sp{Xt , t ∈ R} = sp{Xnh , n ∈ Z}) if the sets {Thn−1 , n ∈ Z} form a wandering sequence (i.e., Thm−1 ∩ Thn−1 = ∅, if m = n). 2. Here we give a multidimensional extension of Lloyd’s theorem. Suppose {Xt , t ∈ R} is a centered weakly stationary process with values in Cn , n > 1, i.e., α·Xt , ∀α ∈ Cn is just a stationary process, or what is E(Xt ) = 0 and the covariance matrix r(s, t) = E(Xs Xt∗ ) = the same, i(s−t)·λ e dF (λ) for an n × n positive definite matrix valued measure Rn F . [Here ‘∗ denotes a conjugate transpose.] Assume further that F (λ) = dF dτ (λ) exists a.e. where τ = tr(F ) is a measure. Note that the range of F may not be all of Cn if it is not of full rank; and suppose it is periodic of period h−1 , i.e., for all λ, Thm−1 λ ∈ supp(τ ) the ranges of F (λ) and F (Thm−1 λ) are equal a.e.[τ ], for all m ∈ Z. Under
6.10 Complements and exercises
399
these conditions show that L2 (X) = SP {Xt , t ∈ R}, span with matrix coefficients, (it is the same as the similar set SP {Xmh , m ∈ Z}) iff the translates {Thk−1 S, k ∈ Z} form a wandering sequence where S is the support of τ . [The method of proof is analogous to that of Theorem 4.1, but because of the possibility of F not being of full rank, some additional care and argument is necessary. For further details, cf., Pourahmadi [1].] 3. This and the next exercise deal with the structure of the support sets of spectral (bi)measures for a class of harmonizable processes whose covariances are periodic. A second order process X = {Xt , t ∈ I ⊂ ¯ t ), R}, with mean function m(·) and second moment r(s, t) = E(Xs X is periodically correlated if for some T0 > 0, m(t + T0 ) = m(t) and r(s + T0 , t + T0 ) = r(s, t), ∀s, t ∈ I for which s + T0 , t + T0 ∈ I, so that given a k ∈ Z one has m(t + kT0 ) = m(t) and r(s + kT0 , t + kT0 ) = r(s, t) if all these quantities are defined. Consequently the correlation function is also periodic, justifying the name. Clearly every centered weakly stationary process belongs to this class for every T0 ∈ R. Verify that if I = Z, then every periodically correlated X with period T0 is strongly harmonizable (in fact is T0 -dimensional stationary sequence), but that this statement is false if I = R. [Consider Xt = A(t)Yt , t ∈ R where the Yt -process is centered, stationary, and A(t) is periodic (and deterministic) but not the Fourier transform of an L1 (R)-function.] In the case of I = Z, show that the spectrum is supported by the lines y = x − kT0 , k = −T0 + 1, . . . , T0 − 1 and x, y ∈ [−π, π]. [See Yaglom [4] for a comprehensive treatment of the subject.] 4. Suppose X = {Xt , t ∈ R} is a periodically correlated and weakly harmonizable process with period T0 > 0. Then its covariance function r satisfies, r(s + T0 , t + T0 ) = r(s, t), ∀s, t ∈ R. If F is its spectral bimeasure with support SF , then verify that SF is contained in the set (∗) {(x, y) ∈ R2 : x − y = 2πk T0 , k ∈ Z}, and conversely if the support SF (of a bimeasure of a weakly harmonizable process X) is contained in the set (∗) of R2 , show that it is periodically correlated. [Hints. First verify that Xt is the L2 (P )-limit of a sequence of strongly harmonizable Xtn -processes, the limit being uniform on compact sets. This implies rn (s, t) → r(s, t) as n → ∞ uniformly in (s, t) in compact rectangles where rn is the covariance function of {Xtn , t ∈ R}. Then note that for large enough n, rn is also periodic with period T0 . If Fn is the spectral bimeasure of the Xtn -process, verify that with a careful argument (since the Helly-Bray type argument is not applicable) Fn (A, B) → F (A, B) for all bounded Borel sets A, B. This implies that SFn ⊂ SF for large n. If r has support in (∗), then verify that r is periodic of period T0 , on using the key fact that rn is the Fourier transform of Fn . Conversely, if r is periodically correlated, then so is rn for large n,
400
VI. Sampling and Regression for Processes
and observe that N 1 rn (s, t) = rn (s + kT0 , t + kT0 ) 2N + 1 k=−N sin [(N + 12 )T0 (x − y)] = dFn (x, y), T0 (x−y) 1 ) Rn Rn 2(N + 2 ) sin ( 2
where the Dirichlet kernel of the last integral is 1 for x, y in the set given by (∗). Conclude by a limit argument, as N → ∞, that rn has its support in (∗) and hence r also has the same property. In connection with this result, see Hurd [1] for the strongly harmonizable work, and for its extension to the weakly harmonizable case, cf., Chang and Rao [2].] 5. Complete the following sketch of Kluv´ anek’s proof of the result given as Theorem 4.9 in the text. If ϕ(x) = G χΩ (γ)x, γ dμΓ (γ), and ϕ(· − y) is the Fourier transform of −y, ·χΩ ∈ L2 (Γ, μΓ ), so that by Plancherel’s theorem and the WMB-formula, for each y ∈ H, with the notation of Theorem 4.9 one has: ϕ(x)ϕ(x ¯ − y) dμG (x) = χΩ (γ)−y, γχΩ (γ) dμΓ (γ) G Γ = y, γχΩ (γ) dμΓ (γ) Γ y, γ˜ dμΓ/Λ (˜ = γ) Γ/Λ
= 0,
y ∈ H − {0}.
(+)
Thus ϕ(H − {0}) = 0, and ϕ(0) = 1, as well as ϕ 2 = χΩ 2 = 1 (Plancherel’s formula). Next any f ∈ L2 (G, μG ) with fˆ|Ωc = 0, has the expansion (for suitable coefficients ay to be determined): fˆ(γ) = ay y, γχΩ (γ), (L2 (Γ, μΓ ) − convergence). (*) y∈H
By (+) and the next line, {ϕy = ϕ(· − y), y ∈ H} is an orthonormal set, ϕˆ = χΩ , and ϕˆy (γ) = y, γχΩ (γ). But by the unitary mapping (again due to the Plancherel transform), (*) gives f= ay ϕ−y , and f 22 = |ay |2 . y∈H
y∈H
To evaluate ay in (*), by inversion, f (x) = Γ fˆ(γ)x, γ dμΓ (γ), so f is continuous a.e. (the integral exists because fˆ ∈ L2 (Ω, μΩ ) ⊂
401
6.10 Complements and exercises
L1 (Ω, μΩ ), and the series in (*) also converges in L1 (Ω, μΩ )). Now if f is continuous, then for x ∈ G, one has: f (x) = fˆ(γ)x, γ dμΓ (γ) Γ ay y, γχΩ (γ)] dμΓ (γ) = x, γ[ Γ
=
y∈H
=
Γ
y∈H
x + y, γχΩ (γ) dμΩ (γ)
ay ϕ−y (γ).
(*+)
y∈H
The convergence also holds uniformly on G, since for arbitrary H1 ⊂ H, with the L1 (G)-convergence of the sum, |f − ay ϕ−y |(x) y∈H1
can be made small on taking H1 ⊂ H large enough. With x = y0 ∈ H, we get by (*) that f (y0 ) = ay0 , so that Theorem 4.9 follows. 6. Let G be a dyadic group of infinite sequences of {0, 1} with addition (mod 2) as group operation which with discrete topology becomes a compact abelian (metric) group with Haar measure as a product measure induced by μ({0}) = μ({1}) = 12 . The characters of this (compact) group are known as Walsh functions which are defined in different ways: if x = {xi , i ∈ Z}, y = {yi , i ∈ Z} ∈ G then a Walsh function is ψy (x) = ±1 according as y1−n xn is even or odd. [There is another slick way of defining these, due to R. Paley, using Rademacher functions but for now the above one suffices.] Let H = { 2sk , s = 0, 1, . . . } where 2sk is given a finite dyadic expansion and k > 0 is an integer. Let ˆ as usual and Ω = Γ/H ˆ which can be identified as Ω = [0, 2k ). Γ=G Then ϕ(x) = 2−k Ω ψy (x) dy = χ(0,2−k ) since the Haar measure here is the normalized Lebesgue measure on Ω. If {Xt , t ∈ R} is a stationary process with mean zero and covariance given by ψs (x)ψt (x) dμG (x), r(s, t) = G
then verify, as an application of Kluv´ anek’s result above and Theorem 4.10, that the sampling series is given by Xt =
∞ s=0
X
s 2k
χ(
s 2k
(t), , s+1 k ) 2
a.e.
402
VI. Sampling and Regression for Processes
Thus Xt in this case is an elementary function, i.e., it is constant on subintervals so that such processes are relatively simple. 7. Let G = R, H = { nπ α , n ∈ Z}, α > 0 so that Ω = G/H = (−α, α). Verify that, in the notation of the preceding exercise, ϕ(x) = sinαxαx , and for f ∈ L2 (R) such that fˆ(t) = 0 for |t| > α [such f is an exponential function of exponential type by a classical theorem due to Paley and Wiener], one has the original Cauchy formula f (x) =
n∈Z
f(
nπ sin (αx − nπ) ) . α αx − nπ
(*)
Now let X = {Xt , t ∈ R} be a (weakly) stationary process with its spectral measure ρ supported in Ω above, and show, using the isomor2 phism between L2 (ρ) and L(X) = sp{X ¯ t , t ∈ R} ⊂ L (P ), that the Kotel’nikov-Shannon formula is given by: Xt =
n∈Z
X nπ α
sin (αx − nπ) , αx − nπ
(+)
the series converging in mean. [Hint: Take f (x) = eitx in Cauchy’s formula (*).] 8. For a finite optional sampling of a martingale sequence, the following sharper statement holds. Let {Xn , Fn , 1 ≤ n ≤ m} be a martingale and {Tk , 1 ≤ k ≤ n0 } be a stopping time sequence of {Fn , 1 ≤ n ≤ m}. Verify that {Yj = X ◦ Tj , F(Tj ), 1 ≤ j ≤ n0 } is a martingale and we have E F1 (Xm ) = E F1 (Yj ) = X1 , a.e., 1 ≤ j ≤ n0 . [This result is actually true for a linearly ordered finite index set. The details of proof involve some properties of conditional expectations. However, it is a special case of a result given in the companion volume, cf., Rao [21], p.235.] 9. The Example 4 in Section 6 above involves some detailed (nontrivial) computations. Complete the omitted details to arrive at equation (10) there, consulting Hardin (1982) if necessary, since this is an example of linear regression that is not coming from a spherically distributed family and thus not from a Gaussian class. 10. The conditional expectation of X given Y satisfies the equation E(X) = E(E(X|Y )). This is a property of conditioning but does not characterize it, as the following example (detail to be supplied) shows. Suppose that the joint distribution of (X, Y ) has density and marginals which are well-defined, satisfying fX,Y (x, y) = fX|Y (x|y)fY (y), then the fX,Y (x, y) may have several conditionals as the following examples
6.10 Complements and exercises
403
show. This was already noted by P. Ennis (1973). Thus let (X, Y ) have a joint density given as: fX,Y (x, y) =
1 exp{−y(1 + x2 )}, x ∈ R, y > 0. π
Then the marginal fY and conditional fX|Y are easily calculated. Verify that the conditional expectation E(X n |Y = y) is defined for all n ≥ 1, but E(X n ) does not exist for any n ≥ 1. This shows that there are some hidden peculiarities with the conditioning analysis, and particular attension is needed in applications. 11. This problem deals with an important extension of linear regression analysis with constraints treated in Section 9 above, now called “ridge regression” with many applications, but also misunderstood on its implications and even the formulation. Recall that the regression model with linear restrictions may be given as Y = Xβ + where the X matrix denotes ‘controls’, β is an unknown paremeter matrix to be estimated, and vector standing for random errors, and the coefficient matrix β constrained by ψβ = α and is to be estimated using the observed random vector Y . In some problems the assumed restriction may be approximately known subject to some other error, so that α = ψβ + η where the pair , η are uncorrelated, since the errors arise from different sources. Putting these together the new (perhaps more ‘realistic’) model becomes: (Y α) = (Xψ) + (η) the coloumn vectors written as transposed vectors. This is now a new linear model with , η as orthogonal, but their covariance matrix is block diagonal with E( ) = σ 2 A and E(η η) = τ 2 B in which A, B are 2 assumed as known matrices, but the ratio ρ2 = στ 2 > 0 is unknown. The minimum variance unbiased estimator of β for each ρ2 from standard ˘ of the estimator β˘ is asserted to be: theory, denoted v(β) ˘ 2 ) = [X Ω−1 X + ρ2 ψθ−1 ψ]−1 [X Ω−1 Y + ρ2 ψ θ−1 α]. v(β)(ρ If ψ = I, θ = 1, Ω = I suitably and α = 0, this reduces to the result ˘ 2 ) = (X Y + ρ2 I)−1 X Y β(ρ given in Hourl and Kennard and also derived by Bellman, Kalaba and others. (The exact theory and analysis with references may be found in Chipman (2011),pp. 100 et seq.) 12. The regression can also be used in some characterization problems of distributions, as this problem shows. Recall that if E(Y |X) =
404
VI. Sampling and Regression for Processes
gY (X) = g(Y ) so that it does not depend on the values of X, it is termed a constant regression. If X, Y are independent then this happens trivially with only the condition of Y integrability. A simple manipulation of conditional expectations shows that the constant regression obtains if (and only if) E(Y eitX ) = E(Y )E(E itX ) for all t ∈ R. The latter property can be used in characterizing probability distribution families that admit linear regression for subclasses n n of infinitely divisible distributions. Let L = i=1 Xj and Q = i,j=1 ai,j Xi Xj . n n Suppose that the constants satisfy n j=1 ajj = i,j=1 aij . Then verify that the random sample {X1 , · · · , Xn } is from a Gamma distribution if and only if S = Q/(L2 ) has a constant regression. It thus follows that the regression concept rigorously defined as a conditioning relation plays several key roles in many applications. One observes that many uses of on ‘regression problems’ play several interesting roles in both the theory and applications. We thus conclude the discussion of these problems at this point.
Bibliographical notes Approximating functions by series using a countable set of linearly independent (often orthogonal in the Hilbert space context) elements with suitable coefficients is a well-known and useful procedure in (deterministic) functional analysis. Several applications to signal detection problems, using Kotel’nikov-Shannon type series within the context of stochastic processes, is especially popular in communication, electrical and radio engineering work. One uses the classical results from approximation theory in second order processes having band-limited spectral functions, as it is immediately adaptable. We have presented various aspects of this methodology in this chapter, where the actual sources are already given. The basic analysis of Piranashvilli [1] has been extended in Chang and Rao [3] for harmonizable processes, and the presentation in the text follows the latter work. The characterization of harmonizable covariances is taken from the recent paper (cf. Rao [23]). The survey on band-limited processes given by Pog´ any [1], (kindly sent to the author) containing much useful and unavailable (in English) Russian works should be noted here for the benifit of western readers. The condition of band-limitedness leads to analyticity properties of processes. We have included some of it in Section 2. Theorem 3.1 is due to Swift [1] and Theorem 3.2 to Belyaev [1]. The basic result in periodic sampling is Theorem 4.1. It and several related propositions are taken from Lloyd [1]. Our presentation is slightly different and is applicable to the strongly harmonizable case
Bibliographical notes
405
also. These ideas take different forms for (isotropic) random fields. An extended notion of harmonizable isotropy was introduced by the author (cf., Rao [27]). Theorem 4.8 is taken from that paper. In the stationary isotropic case much work was pioneered by Yadrenko [1] who pointed out that there are no nontrivial stationary isotropic fields satisfying the standard Laplacian (in the L2 (P )-sense). To remedy this deficiency, isotropic harmonizability was originally introduced by the author in [27]. Indeed there are plenty of nontrivial harmonizable isotropic fields, satisfying the Laplacian in the above sense, as examples in Swift [1] show. Thus the corresponding sampling theory also becomes important, and therefore we included an extended treatment. Even without isotropy, sampling for random fields has considerable interest. Some of this is given in the text as an extension of the deterministic analysis. Thus Kluv´ anek’s [1] sampling theorem on LCA groups has been included and illustrated in its general form for random fields having band-limited spectra. Next, we indicated a different notion of sampling at random times for processes having a dintict distributional structure, contrasting it with the second moment analysis. This type of sampling does not depend on Hilbertian geometry or its tools, leading to martingale processes. The optional sampling is natural in the context of gambling processes whixh correspond to the fortunes of a player in which one can participate at random times, and skipping games in between. The basic result here is Doob’s optional sampling theorem, and a special aspect of it is outlined here to point out the difference between this analysis and the preceding sampling theory. The conditioning concept is at the root of the regression problem which is so important in applications, but the foundations of the subject have not been analyzed in the literature, and hence the next three sections are devoted here to remedy this situation. As a consequence, Sections 6.6-6.9 are utilized to various distinct aspects of regression, and are employed to the foundations of the subject, especially dealing with characterizations of linearity for processes belonging to the infinitely divisible class, and the analysis uses some interesting and deep results of Hardin (1982), starting with Kanter’s (1972) original insight of the problem as being part of the Stable Class of the infinitely divisible family. A vast applicational potential restricting to the familiar Gaussian family is available in the literature. In fact in most treatments and applications, the Gaussian family (process) is taken as the basis, seen from such classic works as Wilks (1962), and leading all the way to the reference work by Chipman (2011). But using Fourier analysis, the problem was treated somewhat differently in Lukacs and Laha (1964) giving some generality and insight. Both these points of view are dis-
406
VI. Sampling and Regression for Processes
cussed and detailed in Sections 6.6–6.9 here and some moreover some constrained regressions are also considered. This includes ‘ridge regression’, and extensions to constrained regression are detailed with a rigorous treatment is in Chipman’s fundamental monograph, and it should be consulted on many types of related problems also based on constraints. Recently two mathematicians D.R.Jensen and D.E.Ramirez (2008) seem to have overlooked the constraints in the formulation of ‘ridge regression’ and published several counter-examples and related numerical computations to disprove the applied regression results, but the troubles could have been eliminated, if they have followed Chipman’s research and the earlier theoretical work in the Journal, ‘Linear Algebra and its Applications’ (1990),pp.55–74. We discussed the result in Problem 11, to clarify some of the difficulties. The complements section contains a few other results, including a detailed sketch of a proof of Kluv´ anek’s theorem which may not be readily available to many readers. We also indicated how Fourier analysis and constrained regression can lead to some characterization problems, particularly some stable classes, pioneered by Lukacs and Laha (1964), the only book that seems to have ventured into regression problems, using it as really a conditioning out-growth. As the impact of this work in stochastic analysis indicates, the statements dealing with representations such as the Kotel’nikov-Shannon series, have a mathematical content mainly in their classical versions, but the corresponding (second order) stochastic counterparts, although interesting for applications, contain only a minor amount of new probabilistic ideas, particularly related to the appropriate isomorphism mappings. We therefore conclude this chapter without proliferating the extensions of the deterministic works, detailed in, for instance, Higgins [1] or Zayed [1], and will proceed to other aspects of inference in the remaining chapters. The other real and notable impact is the regression analysis as well as the stable classes to be carefully analyzed in this context.
Chapter VII More on Stochastic Inference
Most of the work in Chapters IV and V on inference has focussed on a simple hypothesis versus a simple alternative. When either (or both) of the latter is composite, several new problems arise. In this chapter, we consider some of these questions in detail. The results again depend on likelihood ratios, and an extension of the Neyman-Pearson-Grenander theorem is once more of importance. Parameter estimation plays a key role, often involving uncountable sets of measures, and some of the work of Pitcher and his associates is presented. This utilizes certain abstract methods, and moreover includes some results appearing here for the first time. The Gaussian dichotomy in an alternative form (without separability restrictions) is considered. A general Girsanov theorem is also established because these assertions are needed in many important applications, including the Wiener-Itˆ o chaos, white noise and the Wick products as complements. With further detailed analysis on the more tractable Gaussian processes, the following work substantially advances that of the earlier chapters.
7.1 Absolute continuity of families of probability measures Although we discussed the general methodology of composite hypothesis testing in Chapter II in some detail, their systematic application to stochastic processes has not been made. The case of simple hypothesis versus simple alternative is treated in Chapter V with some specificity. Here we present many results on the general testing problem which may demand somewhat sharper tools than the former. As seen in Chapter II, the composite hypothesis testing case can be identified with that of simple versus simple for (possibly infinite) vector measures, and therefore the desired likelihood ratio, when it can be
© Springer International Publishing Switzerland 2014 M.M. Rao, Stochastic Processes – Inference Theory, Springer Monographs in Mathematics, DOI 10.1007/978-3-319-12172-7_7
407
408
VII. More on Stochastic Inference
computed, would furnish an appropriate critical region. However, in many cases these particular ratios (or the Radon-Nikod´ ym derivatives) either cannot be computed or do not even exist. So it becomes necessary to consider specialized problems to obtain concrete solutions. Consequently let us motivate with a particular extension of the classical Neyman-Pearson lemma, first to a class of simple hypotheses versus composite alternatives, to an abstract level applicable to certain processes, again following Grenander [1]. After giving the result, the basic conditions and the conclusions together with their relevance in the context of processes is discussed, to explain how this work must proceed from that point onwards. 1. Proposition. Let {Pα , α ∈ A ⊂ R} on (Ω, Σ, P ) be a family of probability measures governing a processes {Xt , t ∈ R} on the basic probability space. Suppose that Pα ∼ P for all α ∈ A and that the α densities f (·, α) = dP dP satisfy the conditions (in which P = Pα0 ): ∂f ∂f F dP < ∞, (i) exists a.e. and | | ≤ F, E(F ) = ∂α ∂α Ω ∂f (ii)Wk,k = {ω : f (ω, α) ≥ k + k |α=α0 (ω)}, ∂α
(1)
is independent of α, where the constants k, k are chosen to satisfy the constraints: for a given 0 < ε < 1 one has (a) P (Wk,k ) = ε, and (b) ∂Pα (Wk,k ) |α=α0 = 0. Then Wk,k is the best critical region to test the ∂α simple hypothesis H0 : Pα0 = P versus the composite alternative H1 : P ∈ {Pα , α ∈ A − {α0 }}, in the sense that if S ∈ Σ is another set with α (S) P (S) ≤ ε and ∂P∂α |α0 = 0, we have Pα (Wk,k ) ≥ Pα (S), ∀α ∈ A, (and thus the test is also unbiased). Proof. The argument is essentially the same as that given for Theorem II.1.1, because of the strong assumptions, and we present it for completeness. Indeed since P (S) = Pα0 (S) ≤ Pα0 (W ) = P (W ) where W = Wk,k , consider Pα (W −W ∩ S) =
f (ω, α) dP ≥ kP (W − W ∩ S) + k W −W ∩ S
W −W ∩ S
∂f (ω, α)|α=α0 dP ∂α
∂ (Pα (W − W ∩ S))|α=α0 , by (i), = kP (W − W ∩ S) + k ∂α ∂ (Pα (S − S ∩ W ))|α=α0 , ≥ kP (S − S ∩ W ) + k ∂α by (i) and the choice of S,
7.1 Absolute continuity of families of probability measures
= S−S ∩ W ≥
∂ k dP + ∂α
S−S ∩ W
409
k f (ω, α) dP |α=α0
f (ω, α) dP, by (ii) S−S ∩ W
= Pα (S − S ∩ W ). Adding Pα (S ∩ W ) to both sides, gives the asserted inequality. Several comments on this result are in order. In the above argukk ) |α=α0 = 0 are used ment, the constraints P (Wkk ) = ε and ∂Pα (W ∂α to determine the two constants k, k in the definition of the critical region Wkk , and the actual values 0 < ε < 1 and 0 are not essential. But the condition that Wkk be independent of α is a strong restriction. However, we give an example to show that there are processes for which it holds and which also explains the difficulties associated with it. As noted in Chapter II, one can put a weight function on the (composite) parameter space and reduce the result to the simple versus a simple (weighted) alternative, and then try to choose those weights to maximize the power (or minimize the possible loses incurred) of this test. That procedure leads to a different aspect of the test problem. (Wkk ) |α=α0 = a given constant, by If one replaces the constraint ∂Pα ∂α ∂ 2 Pα (Wk ) |α=α0 ∂α2
is a maximum where Wk is defined as a set exactly as in the simple versus the composite alternative case, then again the previous procedure applies. The two resulting tests are termed type A1 and type A by Neyman and Pearson in their classical studies in the 1930s. We now give an example to show that the above proposition applies to a class of Gaussian processes. It is also due to Grenander [1]. 2. Example. Let {Xt , t ∈ [a, b] = I} be a Gaussian process on (Ω, Σ, P ) with mean m(·) and a known continuous covariance function r. It is desired to test the hypothesis H0 : m(t) ≡ 0 versus the composite alternative H1 : m(t) = αa(t) where α = 0. Since r is given, we may employ the Karhunen-Lo`eve representation of the process as a series that was presented in Proposition IV.1.4: Xt = m(t) +
∞
ϕn (t) ξn √ , λn n=1
(2)
which converges in L2 (P ), uniformly in t ∈ I, where λn (> 0) and ϕn are the eigenvalues and the corresponding eigenfunctions (forming a complete set in L2 (I)) of the continuous kernel r. Then ξn ∈ L2 (P ) are orthonormal random variables. To apply Proposition 1, let Pαa be the probability measure corresponding to the mean function αa(·) and P for α = 0. We assume now that Pαa ∼ P . From these conditions
410
VII. More on Stochastic Inference
one obtains the following: If ak = I a(t)ϕk (t) dt, then as shown in ∞ Section IV.1, Pαa ∼ P necessarily implies i=1 λi a2i < ∞, and hence ∞ i=1 ξi ai λi converges in mean and a.e. Thus the likelihood ratio is given by f (ω, α) = exp{α
∞
ξi (ω)λi ai −
i=1
∞ α2 λi s2i }, 2 i=1
(3)
which is derived by the same procedure used in the examples there. To ∂f invoke Proposition 1, note that ∂α exists and is found from (3) to be ∞
∞
i=1
i=1
∂f ξi (ω)λi ai − α λi a2i ]. (ω, α) = f (ω, α)[ ∂α
(4)
∞ variable with Thus if Z= i=1 ξi ai λi , which is a centered ∞ Gaussian ∞ 2 2 2 2 a λ < ∞, since σ = λ a < ∞, and if α is variance i i i=1 i i i=1 restricted to a bounded interval, say |α| ≤ 1, then F defined as F = ∞ ∂f |. Thus to const.e|Z| [|Z| + i=1 a2i λi ] is integrable and dominates | ∂α apply Proposition 1, it remains to verify that the set 1
Wkk = [e− 2 α
2
σ 2 +αZ
≥ k + k Z]
can be chosen to be independent of α. Indeed, let β = kα and Y = k + k Z. Then k
α , k
γ = 12 α2 σ 2 +
Wkk = [eβY −γ ≥ Y ] =[
eβY eβY ≥ eγ , Y > 0] ∪[ < eγ , Y < 0]. Y Y
(5)
If γ0 = eγ depending only on k, k , it is possible to choose the latter to ∂P (W ) satisfy P (Wkk ) = ε(0 < ε < 1 given), and αa∂α kk |α=0 = 0. Since Wkk depends only on the known λi and the pair of exactly determined k, k satisfying the two constraints (with |α| = 1), they can be selected so that the set (5) gives a uniformly most powerful region. Thus there exist nontrivial classes of processes for which Proposition 1 applies. It is clear from this work that obtaining likelihood ratios for broad families of (not necessarily Gaussian) processes is crucial for the inference theory. Even for problems on finite sets of random variables, as discussed cogently and in detail by Birnbaum [1], the likelihood function (as well as the ‘principle’) will be essential, and this is true for processes as well. Their calculation is moreover a nontrivial task, particularly for the latter. Here we present some important results on this
7.1 Absolute continuity of families of probability measures
411
problem. In Section V.1, conditions were presented for the equivalence of Pf and P when f is a translate of the process X = {Xt , t ∈ I = [a, b]} taken in its canonical form, i.e., Ω = RI , Σ = its cylinder σ-algebra, and Pf is the probability measure of Y = X + f, f ∈ Ω with P = P0 so that Pf (A) = P [X + f ∈ A], A ∈ Σ. The geometry of the set MP of all such f for which Pf P , called the admissible translates of P , was analyzed in Theorem V.1.12. If X is Gaussian, then we found that MP is linear and has an inner product in terms of which it becomes dP a Hilbert space. We also then obtained the likelihood ratio dPf for this class in Theorem V.1.4. We now restate (and include a different proof of) a slightly specialized but sharp version, due also to Pitcher [1], leading to results for a large class of non Gaussian processes. 3. Theorem. Let {Xt , t ∈ [a, b] = I} be a Gaussian process on (Ω, Σ, P ) represented in the canonical form as above, [i.e., Ω = RI , Σ = the cylinder σ-algebra of Ω] with means zero, a continuous covariance function r and a (Σ-)measurable f ∈ Ω. Then f is an admissible mean of the process X iff af (∈ Ω) is also one for each a ∈ R. When this holds, the likelihood ratio is obtained as: a2 dPaf (x) = exp[aϕ(x) − C], C ≥ 0, x ∈ Ω, a.e., dP 2
(6)
where ϕ(·) is a linear functional on Ω. A sharper form of (6) can be given for admissible means of integral type (i.e., of the form (8) below) as: dPf 1 (x) = exp[ x(t) dF (t) − f (t) dF (t)], (7) dP 2 I I for a function F : I → R of bounded variation where f, F are related as f (s) = r(s, t) dF (t). (8) I
In the opposite direction, if there is an F : I → R of bounded variation such that 1 dPf (x) = exp[ x(t) dF (t) − C], (9) dP 2 I then necessarily C = I f (t) dF (t) where f is given by (8). Remark. Observe that not every admissible mean f of P admits the representation (8), shown by Example V.1.5, and thus the particular form given in (7) is for a subclass of such means. These will be characterized in Proposition 4 below. This is what makes the present case more concrete and adds interest to the former abstract version. [Here and
412
VII. More on Stochastic Inference
later, we use the coordinate representation Xt (x) = x(t), x ∈ Ω freely.] Proof. If Paf ∼ P for all a ∈ R, then Pf ∼ P trivially, and we only need to consider the converse which follows from Theorem IV.1.4, established using the RKHS techniques. Here we present a martingale proof, due to Pitcher, which is of methodological interest adding further insight into the structure of the problem. By hypothesis Σ = σ(∪α∈I πα−1 (Bα )) where I is the collection of finite subsets of I (directed by inclusion) and πα : Ω → Rα is the coordinate projection onto, with Bα as the Borel σ-algebra of Rα . Then Σα = πα−1 (Bα ) ⊂ Σβ for α ⊂ β in I, and {Σα , α ∈ I} is an increasing directed (or filtering) family of σ-subalgebras of Σ which they generate. If G : Ω → R is any (bounded) Σα -measurable function, then by a standard measure theory result (cf., e.g., the companion volume, Rao [21], Lemma II.2.10) there exists a (bounded) Borel function h : Rα → R such that G = h ◦ πα . Let Xt1 , . . . , XtN be a set of (can assume these are linearly independent) random variables with (t1 , . . . , tN ) ∼ = α, which has a (nonsingular) N -dimensional Gaussian distribution with means zero and covariance matrix SN = (sN ij , 1 ≤ i, j ≤ N ). Then by the image (or fundamental) law of probability one has:
G(x) dP (x) = Ω
1 (2π)N |SN |
1
−1
h(u)e− 2 u SN
u
du,
(10)
RN
with u = (u1 , · · · , uN ) ∈ RN (prime for the transpose and | · | for the determinant). For a given f ∈ Ω, let N
ϕN (x) =
aN ij x(ti )f (tj ),
N
CN =
i,j=1
aN i,j=1 f (ti )f (tj ) ≥ 0,
i,j=1
−1 = (aN where SN ij , 1 ≤ i, j ≤ N ). Hence for any λ ∈ R, one has by a change of variables in (10):
λϕN (x)− 12 λ2 CN
G(x)e Ω
dP (x) =
1 × (2π)N |SN | 1
RN
−1
h(u1 + λf (t1 ), · · · , uN + λf (tN )) e− 2 u SN G(x + λf ) dP (x), by the image law,
= Ω = Ω
G(x) dPλf , by change of
u
du
7.1 Absolute continuity of families of probability measures
variables (as in the begining of Sec. V.1), dPλf dP, since Pλf P . G(x) = dP Ω
413
(11)
Taking λ = 1 and noting that such G form a dense set in L1 (P ), 1 one gets from (11), and because eϕN (x)− 2 CN is Σα -measurable, that 1 1 dP E Σα ( dPf ) = eϕN (x)− 2 CN for a.a. (x) [P ], so {gα = eϕN − 2 CN , Σα , α ∈ I} is a positive uniformly integrable martingale. Hence it converges in L1 (P ), by a known convergence theorem (cf., e.g., Rao [12], Theorem 4.4.6, on p. 207). Since the Σα generate Σ it also follows from the dP same theorem that gα → dPf in L1 (P ) as α ∞. But under P , the mapping x → −x is measure preserving so that replacing x by −x and 1 noting ϕN (−x) = −ϕN (x), we deduce that {e−ϕN (x)− 2 CN , N ≥ 1} also has the same property and hence considering subsequences and multiplying both, we deduce that the product converges a.e. But this 1 1 means eϕN (x)− 2 CN · e−ϕN (x)− 2 CN = e−CN converges, or that CN → C ≥ 0 and C < ∞, by the uniform integrability of the gα set. This shows that eϕN (x) → eϕ(x) for a.a. (x) and in L1 (P ) as well. It follows dP (by the argument of the quoted mean convergence result) that dPλf = 2 exp[λϕ(x) − λ2 C], (a.a. x ∈ Ω, [P ]), for a uniquely defined functional ϕ : Ω → R and C ≥ 0. Since each ϕN (·) is linear, it is seen immediately that the limit ϕ(·) is also linear, establishing (6). [It may be noted that by using dyadic rationals in I, one can also use just a single sequence in the above discussion avoiding directed sets, as was done by Pitcher [1], but then we have to add an argument showing that the end result does not depend on such special sequences.] We now derive the sharper form (7) using the structure of Ω and the interval I as well as the crucial fact that, in the Gaussian case, the conditional expectation is a positive linear contractive projection acting on L2 (P ) is also an orthogonal projection. Thus let ϕN , SN , ΣN be as above (cf., (10) and (11)) for α taken as N -points. Then noting that for f given by (8) one has: f (t) dF (t) = r(s, t) dF (s) dF (t) ≥ 0, (12) I
I
I
by the positive definiteness of r and since F is bounded (r continuous), this integral is also finite. Hence by Fubini’s theorem: 2 ( x(t) dF (t)) dP (x) = [ x(s)x(t) dP ] dF (s) dF (t) Ω I I I Ω r(s, t) dF (s) dF (t) < ∞, by (12). = I
I
414
VII. More on Stochastic Inference
So Y = I x(t) dF (t) is well-defined and is in L2 (P ). We now show the key property that E ΣN (Y ) = ϕN a.e. using the fact that on L2 (P ) the last conditional expectation is simply an orthogonal projection onto the linear manifold determined by {x(t), t ∈ α}. For this one notes the special form of ϕN (·) with {x(t), t ∈ α} used to define it and gets, on the one hand
Ω
ϕN (x)x(tk ) dP (x) =
N Ω i,j=1
N
=
aN ij x(ti )f (tj )x(tk ) dP (x)
aN ij r(ti , tk )f (tj )
i,j=1
= f (tk ), −1 since (aN , ij ) = (r(ti , tj ))
and on the other hand: Y (x)x(tk ) dP (x) = x(t)x(tk ) dF (t) dP (x) Ω Ω I = r(t, tk ) dF (t) = f (tk ).
(13)
(14)
I
Since tk ∈ α is arbitrary (13) and (14) imply, by the linearity of the integral and subtraction, that 0= Ω
(Y − ϕN )(x)x(tN ) dP (x) = (Y − ϕN , Xtk ), tk ∈ α,
which implies that Y − ϕN ⊥ MN , the linear manifold spanned by {Xtk , tk ∈ α}. Now using the fact that our process is Gaussian we deduce finally the crucial conclusion that E ΣN (Y − ϕN ) = 0 and since ϕN ∈ MN one has E ΣN (Y ) = ϕN a.e. Thus {ϕN , ΣN , N ≥ 0} is a uniformly integrable martingale and hence as before ϕN → Y a.e. and in both L1 (P ) and L2 (P )-means. From this one gets immediately CN =
E(ϕ2N )
2
→ E(Y ) =
f (t) dF (t) ≥ 0. I
It remains to show that the likelihood ratio is that given by (7). By (11), taking G(x) = 1 and λ = 1 we have
ϕN (x)−
[e Ω
CN 2
2
CN
] dP (x) = e
Ω
e2ϕN (x)−2CN dP (x)
(15)
415
7.1 Absolute continuity of families of probability measures
CN
=e
Ω
dP2f = eCN → eC , by (15). (16)
However, in (11) with G(x) = χA (x), A ∈ ∪α∈F Σα so that A ∈ ΣN for some α = αN , one has C ϕN (x)− 2N e dP (x) = dPf (x), A
A
whence the martingale {ϕN − C2N , ΣN , N ≥ 1} converges a.e. as well as in L1 (P ). It was already established in (16) that CN → C so that ϕN → Y a.e. and in L1 (P ). This implies, as in the first part, that dPf Y −C 2 where Y is the simple stochastic (or Bochner) integral dP = e x(t) dF (t). So (7) is established. I C d dP For the opposite direction, let dPf = eϕ− 2 = eY − 2 also, where Y is as in (9). Consequently, ϕ − Y = 12 (C − d), a constant. Since E(ϕ) = 0 = E(Y ), the constant is zero so that C = d. However, using 1 the computation (13) and the fact that ϕN → ϕ in L (P ) we conclude that for any arbitrarily fixed t ∈ I, I ϕ(x)x(t) dP (x) = f (t). Since ϕ = Y , proved above, one has ϕ(x)x(s) dP (x) f (s) = Ω = [ x(t) dF (t)]x(s) dP (x) Ω I = r(s, t) dF (t). (17) I
So from the definition of Y [=
2
I
x(t) dF (t)], it follows that
ϕ (x) dP (x) =
C=d= Ω
2
Y dP = Ω
f (t) dF (t).
(18)
I
Consequently (17) and (18) establish the converse part and thus all the assertions hold. In the preceding work, the fact that I is a bounded interval is used in an essential way in showing that the set {CN , N ≥ 1} is a convergent sequence with limit 0 ≤ C < ∞. If I is not bounded, then C = ∞ is possible and then the likelihood ratio becomes zero so that Pf ⊥ P . A careful review of the proof reveals that this is in fact true as the reader can verify (cf., Exercise 1). The representing F giving the mean f (= r(·, t) dF (t)) can be constructed for many triangular covariances. A I large class for which this is explicitly obtainable is already implied
416
VII. More on Stochastic Inference
by our earlier results (cf., Lemma V.1.6 and Exercise V.6.3). It is desirable, however, to answer the general question: given r(·, ·) what f s are representable as (8) relative to an F ∈ (C(I))∗ ? Motivated by the method of proof of Theorem VI.2.1, we can present a solution of this problem as follows. 4. Proposition. Let r be a (strictly) nondegenerate continuous covariance function of a canonically represented Gaussian process X = {Xt , t ∈ I} on (Ω, Σ, P ) with means zero where I ⊂ R is a compact interval. Then a bounded Borel function f : I → R is an admissible mean of X iff ˜ r ∞ ≤ 1, h ∈ L1 (I) < ∞, (19) sup | f (t)h(t) dt| : h I
˜ r (t) = where h
I
r(s, t)h(t) dt.
Proof. If f is an admissible mean, then by the above theorem it is given by (8) for some function F of bounded variation on I. Consequently, | f (t)h(t) dt| = | h(t) dt r(s, t) dF (s)| I I I ˜ r (s)| = | dF (s)h I
˜ r ∞ , ≤ F 1 h
(20)
by H¨older’s inequality, where F 1 is the variation of F and ˜ |hr (s)| = |r(s, t)||h(t)| dt ≤ h 1 sup |r(s, t)| = h 1 |r|∞ (say), s,t∈I
I
˜ r ∞ ≤ |r|∞ h 1 < ∞. Hence setting K = F 1 < ∞ in so that h (20), the supremum in (19) is bounded by K. ˜ r (t) = 0 for a.a. t ∈ I Conversely, let (19) hold. First observe that h implies ˜ 0 = h(t)hr (t) dt = h(t) dt r(s, t)h(s) ds I I I r(s, t)h(s)h(t) ds dt, = I
I
and since r(·, ·) is strictly positive definite (used here), this implies that ˜ r is one-to-one and linear on h = 0 a.e. Thus the mapping H : h → h 1 L (I) → C(I), since r is continuous. Now consider Tf defined by ˜ r ) = (H −1 (h ˜ r )) (= (h)), Tf (h
(21)
7.1 Absolute continuity of families of probability measures
417
where (h) = I f (t)h(t) dt. Thus Tf is unambiguously defined and ˜ r : h ∈ L1 (I)} ⊂ by (19) this linear functional on the subspace {h C(I) is bounded. Hence it has a bound preserving (i.e., Hahn-Banach) extension to all of C(I). Then by a classical Riesz representation, for each element of the dual of C(I), there is an F (uniquely) of bounded variation on the compact I (F determines a signed Borel measure) such that ˜ r (t) dF (t). ˜ Tf (hr ) = h I
This and (21) together imply ˜ ˜ hr (t) dF (t) = Tf (hr ) = (h) = f (t)h(t) dt. I
Then
I
[f (t) − I
r(s, t) dF (s)]h(t) dt = 0, h ∈ L1 (I).
I
It follows that
r(s, t) dF (t),
f (t) =
a.a.(t).
I
Thus (8) holds for all t ∈ I, since r is continuous. It should be noted that an actual calculation of F for given f and r satisfying (19) is not trivial. From another point of view, this amounts to solving the integral equation(8) for F , and we illustrate this for a class of problems. Example A. Suppose r is a triangular covariance of a Gauss-Markov process, [such as the O.U. (Ornstein-Uhlenbeck) process] whose covariance is given as r(s, t) = u(s ∧ t)v(s ∨ t), where −∞ < a ≤ s, t ≤ b < ∞, uv ↑, and ( uv ) , (the derivative) exists (v > 0). Then (8) can be solved if f is an admissible mean, f exists and is of bounded variation with u(a) = 0 ⇒ f (a) = 0. In fact by Lemma V.1.6, under these conditions F of (8) is given by F (t) = − a
where λ(t) =
( fv ) (u v)
t
dλ(s) , v(s)
a < t < b,
(t) with λ(a)u(a) = f (a), λ(b) = 0 so that λ has
a discontinuity at both a and b. As seen before, integrating (by parts) one finds t b f (a) f + v(t) r(s, t) dF (s) = v(t) d( )(s) v(a) v a a
418
VII. More on Stochastic Inference
and for the O.U. process, r(s, t) = e−β|s−t| , β > 0. It is then observed that (19) holds for f = 1 so that 1 is an admissible mean for this O.U. process. Here u(t) = eβt , v(t) = e−βt and one finds λ(t) = − 12 e−βt . Thus
b
r(s, t) dF (s) = a
v(t) + v(t) v(a) −β(t−a)
=e
a
t
1 d( )(s) v
+ 1 − e−β(t−a) = 1 = f (t),
as was to be verified. We would like to proceed from translations of means in the admissible classes to some nonlinear functionals of the Gaussian X, and then to more general non Gaussian processes using certain geometric properties instead of the analytical forms. The first type will be discussed in the remainder of this section, and the second in the next section. Observe that, in the preceding work, we have studied the transformations (Tα f )(x) = x + αm, f (u) = u ∈ R of a Gaussian process X = {Xt , t ∈ I} in canonical form on (Ω, Σ, P ), E(Xt) = 0, E(Xs Xt ) = r(s, t). This can be generalized by (Tα f )(x) = (h ◦ πn )(x + αm) for f : Ω → R depending on a finite number of coordinates [also termed a cylindrical or “tame” function in the literature], as noted at the beginning of proof of Theorem 3, so that h : Rn → R is a Borel function and α ∈ R. In this form Tα is linear on this class of functions, but (Tα f )(x) is more complicated as a function of x. When f is bounded, we choose h to be also bounded. The class {Tα }α , defined on the set of all such cylindrical functions F constituting a dense collection in Lp (P ), 1 ≤ p < ∞, and depending on a parameter α ∈ R, will be of interest in the following work which extends the translation class studied earlier. To motivate the general definition of the Tα s desired above, consider a random variable X with distribution FX so that FX (u) = P [X < u] and for its translate P [X + a < u] = P [X < u − a] = FX (u − a). (u) is its density, then the likelihood ratio is given by If p(u) = FX p(u−a) La (u) = p(u) . In case p is differentiable with derivative p and p > 0, then letting ϕ(u) =
p (u) p(u)
one has
a p (u − b) log La (u) = − db p(u − b) a0 ϕ(u − b) db, =
(22)
0
and a likelihood ratio is obtained. Now to proceed further, (22) may be written more generally for any smooth function h (e.g., infinitely
7.1 Absolute continuity of families of probability measures
419
differentiable with compact supports) formally as: ϕ(u)h(u)p(u) du = − p (u)h(u) du R R ∂p(u − α) = h(u)|α=0 du ∂α R ∂ = p(u − α)h(u) du |α=0 ∂α R ∂ = p(u)h(u + α) du |α=0 ∂α R ∂ = (Tα h)(u)p(u) du |α=0 . ∂α R Hence we have a formula for ϕ as: ∂ ϕ(u)h(u) dP = (Tα h)(u) dP (u) |α=0 . ∂α R R
(23)
This formal computation can be justified in the translation problem that is considered above, but it can be taken as a starting point for the general case. Thus we want to find a ϕ satisfying (23) for a sufficiently “rich” class of functions h ∈ F , because for such a ϕ which is a solution of (23), the likelihood ratio will be obtained from (22) at once: α dPα (x) = exp[− ϕ(x − β) dβ]. (24) dP 0 Note that if α = 0 then P0 corresponds to P itself. Interest in this formula is enhanced by the manner in which α appears. This program is successfully implemented in a series of publications by Pitcher, and we shall include some salient features of this important work, for both the Gaussian and other classes. We now outline this program. We begin with a certain functional of the Brownian Motion (or BM) process in order to make detailed computations that will indicate what conditions may reasonably be imposed later in a general case. Moreover, the present analysis has independent interest. Thus let {Bt , t ∈ I} be a BM and for simplicity take I = [0, 1], the unit interval. Consider the process X = {Xt , t ∈ I} as a solution of the stochastic integral equation: t t a(s, Xs ) ds + b(s) dBs, t ∈ I; X0 = 0, (25) Xt = X0 + 0
0
where a : I ×R → R and b : I → R are continuous functions such that b is bounded away from 0 and ∞, (e.g., 0 < ε ≤ b(t) ≤ 1ε ) and ax (t, x) =
420
VII. More on Stochastic Inference
∂a ∂x (t, x)is
continuous. Moreover, a satisfies a uniform (on I) Lipschitz condition so that the integrals in (25) are well-defined and the solution X of this equation exists uniquely (cf., e.g., the companion volume, Rao [21], Corollary VI.4.8, on p.503, or Doob [2], Section VI.3 on p. 273). However, we assume somewhat more for the present illustrative purposes, namely, eixy A(y, t) dy, t ∈ I, (26) a(t, x) = R
exists (is real) and also ax (t, x) = with
iyeixy A(y, t) dy, R
R
(1 + |y|)|A(y, t)| dy ≤ K < ∞.
(27)
Further let m : I → R be an absolutely continuous function satisfying m (t)2 dt < ∞. (28) m(0) = 0, I
Let F be the set of cylindrical functions f : Ω → R such that f = h ◦ πn , n ≥ 1 where h is bounded with bounded derivatives. Here (26)(28) are introduced for technical reasons. Such a function class F is dense in Lp (P ), p ≥ 1. Define mappings Tα : F → F by the expression: (Tα f )(x) = h ◦ πn (x + αm),
x ∈ Ω.
Thus (Tα f ) ∈ Lp (P ) and is a positive linear operator. Also by our hypothesis α (f ) = (Tα f )(x) dP (x), α ∈ R, Ω
is well-defined, and Tα 1 = 1. So α (·) is a positive bounded linear functional on F . Hence it has a bound preserving extension to all of Lp (P ). Taking p = 1 and using the Riesz representation theorem, we get a unique probability measure Pα on Σ such that f (x) dPα (x), α ∈ R, f ∈ L1 (P ). (29) α (f ) = Ω
Our additional conditions (26)–(28) are imposed in order to ensure (as seen from the work below) that Pα ∼ P (= P0 ). Conditions for existence of solutions of (25) alone do not always imply that Pα ∼ P for all α;
7.1 Absolute continuity of families of probability measures
421
indeed Tα f may not even be measurable. Such pathological behavior of (α, x) → (Tα f )(x) has been noted by Cameron [1] long ago, and our additional restrictions avoid these problems. A method will now be presented which produces a functional ϕ(·) satisfying (24) for the above family of mappings Tα and that it leads to Pα ∼ P, ∀α, as well as enabling us to find the likelihood function. For this purpose we first establish a key technical result on a special ϕ (say ϕ0 ) determined by the diffusion process given by (25): 5. Proposition. With the above notation and assumptions, there is a ϕ0 satisfying: ∂ ϕ0 (x)(Tα f )(x) dP (x) = ( (Tα f )(x) dP (x)), ∀f ∈ F . (30) ∂α Ω Ω In fact, ϕ0 is given by ϕ0 (x) = I
m (t) − m(t)ax (t, x(t)) dBt , b(t)
(31)
which is a well-defined stochastic integral (cf. Sec. IV.2) since b(·) is bounded away from zero and infinity. Proof. We start with a particularly simple cylindrical function f , and Tα f , show that it satisfies (30), and then proceed to extend the result for an algebra of such elements in L1 (P ), thereby establishing that ϕ0 defined by (31) is the correct solution. Consider (without a motivation at this point) the cylinder function f ∈ Ω, given by f = h ◦ πV where the vector V = (t1 , . . . , tn , t) and n h(u1 , · · · , tn , t) = exp[i j=1 λj uj + λu], so that f (x, λ, t) = exp{i
n
λj x(tj ) + λx(t)}.
j=1
Next let ϕ0 (x)f (x, λ, t) dP (x)− g(λ, t) = Ω ⎤ ⎡ n λj m(tj ) + λm(t)⎦ f (x, λ, t) dP (x), i⎣ j=1
(32)
Ω
where for later use we take tj as ordered, i.e., 0 < t1 < · · · < tn < 1, tn ≤ t ≤ 1, and g(λ, 0) = 0 with ϕ0 as in (31). We then set λ = α and (Tα f )(x) = f (x + αm) so that the second integral in (32) becomes
422
VII. More on Stochastic Inference
n with f (x) = h(πn (x)) = exp[i j=1 λj x(tj )] and the first one corresponds to the left side of (30). The idea is to establish that g(λ, t) = 0, for tn ≤ t ≤ 1, by showing that supλ g(λ, t) = v(t) satisfies a homogeneous Gronwall inequality (so v ≡ 0). Next by induction it will be seen to be true for all n (for n = 0, we set t0 = 0 and by definition g(λ, 0) = 0). It will prove (30) for such simple cylindrical functions, and the result is extended there after. We now present the details involving some tedious computations. This is necessary for the variational calculus techniques used. To show that |g(λ, t)| = 0 at each point t ∈ [tn , 1], we estimate the right derivative of g(λ, t) at t and show that its maximum is arbitrarily small and hence it must be zero. This is the key part and here are the crucial details. Now g is jointly continuous in (λ, t). Since the diffusion process X has a.a. continuous sample paths, we can appeal to the bounded convergence theorem in simplifying and extending the integral in (32). Let us first estimate the right derivative of g(λ, ·) at each point, + denoted ∂∂tg , which we compute (with Δx(t) = x(t+δ)−x(t); Δm(t) = m(t + δ) − m(t), δ > 0) as follows: ∂ (Tλ f )(x)|λ=α dP (x) Ω ∂λ
∂+g g(λ, t + δ) − g(λ, t) (λ, t) = lim + ∂t δ δ→0 eiλΔx(t) − 1 ϕ0 (x)f (x, λ, t) = lim dP (x) δ δ→0+ Ω ⎞ ⎛ n eiλΔx(t) − 1 ⎠ ⎝ dP (x) λj m(tj ) + λm(t) f (x, λ, t) −i δ Ω j=1 Δm(t) ) − iλ( f (x, λ, t) dP (x) δ Ω eiλΔx(t)−1 dP (x) , f (x, λ, t) − iλΔm(t) δ Ω = I1 + I2 + I3 + I4 , (say), (33) where from (25) we get Δx(t) as: t+δ Δx(t) = Xt+δ − Xt = a(s, x(s)) ds + t
t+δ
b(s) dBs.
(34)
t
To simplify (33) we use (34) and the hypothesis on the (drift and diffusion) coefficient functions {a, b}. Let us consider the first term I1 on the right of (33), the work on the others being similar. Expanding ∂a (t, x)) one has: (using Fubini’s theorem and setting au (t, x) = ∂u t+δ 1 I1 = lim+ ϕ0 (x)f (x, λ, t) iλ [a(s, x(s)) ds + b(s) dBs ] δ→0 δ Ω t
423
7.1 Absolute continuity of families of probability measures
t+δ λ2 b(s) dBs )2 dP (x) ( 2 t ϕ0 (x)f (x, λ, t)a(t, x(t) dP (x)+ = iλ −
Ω
iλ lim δ→0+ δ
t+δ
ds t 2
Ω
f (x, λ, t)[m (s) − m(s)au (s, x(s))] dP (x)
λ b2 (s) ds f (x, λ, t)× − lim δ→0+ 2δ t Ω t m (s) − m(s)au (s, x(s)) dBs dP (x), b(s) 0 = J1 + J2 + J3 (say). t+δ
(35)
Here in the first term we used the dominated convergence, and in the other two an interchange of the order of integrals as well as simple properties of the classical Brownian integral. The middle two limits respectively are simplified as: iλΔm(t) J2 = lim + δ δ→0
f (x, λ, t) dP (x) Ω
− iλm(t) and
λ2 b2 (t) J3 = − 2
Ω
f (x, λ, t)au(t, x(t)) dP (x),
Ω
ϕ0 (x)f (x, λ, t) dP (x).
(36)
In an analogous manner one finds for I2 and I4 : lim
δ→0+
eiλΔx(t) − 1 dP (x) = f (x, λ, t) δ Ω
−
f (x, λ, t)[iλau(t, x(t))
Ω 2 2
λ b (t) ] dP (x). 2
(37)
Substituting (35)–(37) into (33), one gets: (λb(t))2 g(λ, t ∂+ g(λ, t) = − + iλ ϕ0 (x)f (x, λ, t)a(t, x(t)) dP (x) ∂t 2 Ω f (x, λ, t)au(t, x(t))[(1 + iλ)m(t) − iλ Ω
+i
n j=1
λj m(tj )] dP (x).
(38)
424
VII. More on Stochastic Inference
Using the fact that a is the Fourier transform of A we find, on interchanging the Lebesgue and the P -integrations: ∂+g (λ, t) = iλ ∂t
R
g(λ + y, t)A(y, t) dy −
(λb(t))2 g(λ, t). 2
(39)
But |g(λ, t)| ≤ B + C|λ| for some constants B, C ∈ R+ (cf. (32)) and hence ∂+ ∂+ (|g(λ, t)|2) = (g(λ, t)¯ g(λ, t)) ∂t ∂t g(λ + y, t)A(y, t) dy
= 2Re(iλ¯ g (λ, t)) −
R 2
(λb(t)|g(λ, t)|) 2
≤ (B1 |λ| + C1 |λ|2 )|g(λ, t)| −
λε(|g(λ, t)|)2 . 2
(40)
Here we used the fact that ε ≤ b(t) ≤ 1ε (b(t) > 0 is bounded above and below for some ε > 0). Since g(λ, 0) = 0 by definition, if |g(λ, t)| has a positive maximum, for a fixed λ, at tλ > 0 for the first time, then there is a subinterval J ⊂ (0, tλ ] over which |g(λ, ·)| is increasing and + so ∂∂t (|g(λ, τ )|2) ≥ 0 at each point τ ∈ J. Then this inequality must also hold for a minimum of |g(λ, ·)|, for this interval. If |g(λ, τ )| > 0, then (40) implies by dividing through by the positive quantity, that 2 |g(λ, τ )| ≤ B2 + C |λ| for some constants. Since max |g(λ, ·)| is approximated by such g(λ, τ ), the same inequality holds for all τ ∈ J, so that g(λ, t) is bounded. Since what we want to show about g(λ, ·) is a local property we can also assume that λ is confined to a compact interval and we do it from now on. Let v(t) = supλ |g(λ, t)|. Then by the preceding computations it follows that v is a bounded function. We shall show presently that it is actually continuous. First observe that (40) gives 0≤
(ελ|g(λ, t)|)2 ∂+ (|g(λ, t)|2) ≤ B3 |λg(λ, t)|v(t) − . ∂t 2
(41)
Indeed if C0 is the maximum of |λ| on its compact interval then we have the trivial upper estimate of (B1 |λ| + C1 |λ|2 )|g(λ, t)| for each t in the above interval as: (B1 |λ| + C1 |λ|2 )|g(λ, t)| ≤ (B1 C0 v(t) + C1 C0 v(t))|λg(λ, t)|
425
7.1 Absolute continuity of families of probability measures
Substituting this in (40) gives on setting B3 = (B1 + C1 )C0 the following: ∂+ (ε|λg(λ, t)|)2 0≤ (|g(λ, t)|2) ≤ B3 |λg(λ, t)|v(t) − . (41) ∂t 2 Consequently |λg(λ, t)| ≤ B4 v(t) ≤ B5 , and then B4 v(t)2 from (41). Further |v(t) − v(t0 )| ≤
∂+ (|g(λ, t)|2) ∂t
≤
2B5 + | sup |g(λ, t)| − sup |g(λ, t0 )|| |λ0 | |λ|≤|λ0 | |λ|≤|λ0 |
which can be made small by first choosing |λ0 | large and then taking t close to t0 , proving that v(·) is continuous. But then we also have on integration of the inequality in (41): 2
|g(λ, t)| ≤ B6
0
t
2
2
v (s) ds, ⇒ v (t) ≤ B6
t
v 2 (s) ds,
0
which is a homogeneous Gronwall inequality. Hence v(t) = 0 and so g(λ, t) = 0. The result may be restated for the original set of t points as g(λ, t) = 0 for n ≤ N implies that g(λ, tN ) = 0. So we can conclude by induction that g(λ, t) = 0 for all t ∈ [0, 1], and hence that g(λ, t) = 0 as desired. This shows that ϕ0 is a solution of (30) for f = h ◦ παN where N h(u1 , · · · , uN ) = exp{i j=1 λj uj } with αN identified as an N -point index, and then from (32) we have g = 0 so that:
b
dα a
Ω
ϕ0 (x)(Tα f )(x) dP (x) =
Ω
(Tb f − Ta f )(x) dP (x)
Ω
b
(Tα f )(x) dα dP (x),
= a
(42)
for the above type of f ∈ F . By a standard reasoning (with Fubini’s ˆ is bounded theorem) this holds for all h whose Fourier transform h and has bounded support. By linearity the result is also true for the algebra of functions generated by the above type as well as (obviously) constants. Since such functions separate points of RN the result is also true for all bounded continuous functions on RN , N ≥ 1, by the StoneWeierstrass theorem. Thus the relation (42) holds for all f ∈ F since each such function can be approximated by a sequence fn = hn ◦ παn with hn of the above type. Differentiating (42) relative to b, and noting that α → Tα f is continuous in L2 (P ) for each f ∈ F , we get finally the validity of equation (30).
426
VII. More on Stochastic Inference
It may be noted that ϕ0 depends only on Bt and Xt so that if C is the σ-algebra generated by the process {Xt , t ∈ [0, 1]}, then (Tα f )(Xt) is C-measurable for each f ∈ F although ϕ0 (Xt ) need not be. Since E(ϕ0 ) = E(E C (ϕ0 )), replacing ϕ0 by E C (ϕ0 ) = ϕ (say), we see that (30) holds if ϕ0 is replaced by ϕ which is C-measurable. The family {Tα , α ∈ R} is a group of transformations on F . To present conditions for mutual absolute continuity of {Pα , α ∈ R} and to obtain their likelihood ratios, it will be useful to associate, as in Theorem V.1.12, a semi-group of (linear) operators on F with the Tα family that has better integrability properties. Thus define Vf (α) : F → F , by the equation Vf (α)g = exp{
1 2
0
α
T−b f db}T−α g,
f, g ∈ F , α ∈ R+ .
(43)
Since the Tα form a group and they commute, one finds that for each f, Vf (0) = id., and moreover 1 α Vf (α + β)g = exp{ T−b f db}× 2 0 1 β T−(α+b) f d(α + b)}T−α Tβ g, exp{ 2 0 by the commutativity af all oprators in sight, = Vf (α)Tα [Vf (β)Tβ ]T−α (T−β g) = Vf (α)Vf (β)g. Thus the {Vf (α), α ∈ R+ } indeed forms a semi-group on F . Since 1 ∈ F and Tα 1 = 1, we have Vf (α)g = Vf (α)1 T−α g. Also let T f −T f D(Tα f )(x) = limh→0 α+h h α (x) which exists where the differential operator D is defined on a dense set of (bounded) functions of L1 (P ), a a and one observes that D( 0 T−b f db) = 0 D(T−b f db = f − T−b f . This is seen by integrals by suitable sums. Then the a approximating a n note that D( 0 T−b f db) = n( 0 T−b f db)n−1 (f − T−a f ). Hence we have the following simplification: D(Vf (α)g) = D(Vf (α)1 T−α g) = D(Vf (α)1)(T−α g) + (Vf (α)1)D(T−α g) 1 = (Vf (α)1) (f − T−α f )T−α g + (Vf (α)1)D(T−α g) 2 1 = (f Vf (α)g − Vf (α)gT−α f ) + (Vf (α)1)D(T−α g). 2
(44)
7.1 Absolute continuity of families of probability measures
427
Here the formal computation can be justified by the topology of L1 (P ) and the dominated convergence theorem for limits with f, g ∈ F . Thus one finds, since Vf (α)g ∈ F : f 1 ( − ϕ)Vf (α)g dP = f Vf (α)g dP − D(Vf (α)g) dP, 2 Ω Ω 2 Ω by Proposition 5, since E(ϕTα f ) = D(E(Tα f )) and Tα f may be replaced by any g ∈ F, f D(Vf (α)g) dP = ( )Vf (α)g dP − Ω 2 Ω ∂ = Vf (α)g dP, (45) ∂α Ω where we have used (44) in the last line.
[A similar computation will be seen, in the next section, to hold even when Tα is replaced by a two parameter operator family {Tαβ , α < β} satisfying an evolution equation identity analogous to the ChapmanKolmogorov equation.] In what follows here and later the semi-group {Vf (α), α ≥ 0} becomes an important technical tool in the analysis. The essential part played by the operator Vf (α) is seen from the following facts. First one establishes, after considering ϕN = ϕ∧N and sequences fn ∈ F , |fn | ≤ N which converge to ϕN satisfying Vfn (α)g → VN (α)g as n → ∞, defining {VN (α), α ≥ 0} as a strongly continuous contractive semi-group in L2 (P ) with generator AN given by AN f = ( 12 ϕN − D)f, f ∈ L2 (P ). Thus one obtains the following result: 6. Proposition. The semi-group {VN (α), α ≥ 0} converges in L2 (P ) as N → ∞ to a strongly continuous semi-group {V (α), α ≥ 0} with generator A which is the strong limit of AN introduced above (Af = ( 12 f − D)f, f ∈ F ), and the generator of V (α) extends A. The technical details of the proposition are tedious, and need a careful application of the standard semi-group theory with estimates utilizing the fact that we have a diffusion process. These are given in Pitcher [2] and we include an outline in the Complements section. With this, the general statement that we are after may be presented as the following comprehensive theorem whose significance for inference theory will soon become clear. 7. Theorem. Consider the diffusion process given by (25) with the conditions (26)–(28). Then the corresponding probability measures {Pα , α ∈ R} of (29), on the canonically represented (Ω, Σ, P ), (P0 = P ), governing the process have the properties: (i) the Pα are mutually absolutely continuous,
428
VII. More on Stochastic Inference
(ii) the related group of measurable linear transformations {Tα , α ∈ R} defined on the algebra of bounded functions F ⊂ Lp (P ), p ≥ 1, have bound-preserving extensions to all of L1 (P ) satisfying Tα (f g) = (Tα f )(Tα g) if at least one of f, g is bounded, and α α (iii) the likelihood functions are dP dP (x) = exp{ 0 T−b ϕ(x) db}, for any measurable version of (a, x) → T−a ϕ(x) where ϕ = E C (ϕ0 ) is given by (30). Moreover, the semi-group {V (α), α ≥ 0} of Proposition 6 determines a strongly continuous unitary group on L2 (P ), given by V (α)f = dPα dP T−α f
with generator A = ( 21 ϕ − D).
Sketch of proof. By definition, the generator A of our {Tα , α ≥ 0} is densely defined on L2 (P ), and hence the range of iA−iI is all of L2 (P ). Similarly, if T˜α = T−α and ϕ˜ = −ϕ, then again A is the generator of the semi-group {T˜α , α ≥ 0} and the range of iA + iI is L2 (P ). But iA is a symmetric operator so that, by what precedes, (iA)∗ = iA, or it is self-adjoint. Hence V (α), α ∈ R given in the last part is a unitary group with generator A. With this fact, we assert that Pα ∼ P for all α. In fact let fn ∈ F , fn ↓ 0 a.e. [Pα ]. Then by (29), Tα fn ↓ 0 a.e. [P ]. But since Tα is an isometry on L2 (P ) we have: 0←
2
Ω
|fn | dP =
Ω
=
Ω
= Ω
|Tβ fn |2 dP |Vfn (β − α)Tβ fn |2 dP, by definition of Vfn (α), D(β − α)2 |Tα fn |2 dP, by Proposition 5
and the dominated convergence theorem. Thus fn ↓ 0 relative to Pβ as well, for any β(= α). Since F ⊂ L2 (P ) is dense, one can extend the result to all of L2 (P ) by means of a classical theorem due to Banach (cf., Dunford and Schwartz [1], p.332). This gives (i) and shows the crucial role played by the V (α)-family of operators. Also (ii) follows from definition of Tα on F and then its extension to all of L2 (P ) is obtained as above, with Banach’s theorem. The next step is to derive the likelihood ratio. Let α > 0, N > M > 0, and consider the truncation of ϕ: ϕN,M (x) = ϕ(x)χ[−M,N] (ϕ(x)) + N χ[ϕ(x)>N] (x) − M χ[ϕ(x)<−M ] (x). choose fn,M ∈ F such that limn→∞ fn,M = ϕN,M a.e. But Now α α T fn,M db → 0 T−b ϕN,M db in L1 (P ) and hence for a subsequence −b 0 it converges a.e. For each M choose nM such that fnM ,M → ϕN in
7.2 Likelihood ratios for families of non Gaussian measures
429
L2 (P ) as M → ∞, and VfnM ,N (α)1 − VϕN,M (α)1 1 → 0 as M → ∞, where Vf (α) is defined in (43). Let DN (α) be the limit of this Cauchy sequence. Then DN (α) = lim VfnM ,M (α)1 = VϕN (α)1 M →∞
→ Vϕ (α)1, as N → ∞, dPα . = dP Squaring the last two expressions and recalling the definition of Vϕ (α) (cf. (43)) one gets the desired likelihood ratio. It should be emphasized that ϕ plays a crucial role in all the above analysis, and Proposition 5 is a basic tool in this work. If the Xt process was Gaussian with mean 0 and a continuous 1covariance r, then m is an admissible mean for this process iff m(t) = 0 r(s, t) dF (s) for a function F of bounded variation by Theorem 3 above (cf., also Theorem 1 V.1.4), and by (7) ϕ(x) = 0 x(t) dF (t) which is a special form of (31). The infinitesimal generator then is given by Af = ( 12 ϕ − D)f . This formula has been a motivation for the above analysis and for more general diffusion processes. However, the method itself is capable of application for processes that are not necessarily related to the BM or Gaussian processes. This is the content of the next section. 7.2 Likelihood ratios for families of non Gaussian measures We now abstract the essential features of the above analysis that did not depend specifically on the properties of Gaussian distributions. The whole treatment depended basically on three objects, namely a group {Tα , α ∈ R} of linear operators acting on an algebra F of measurable functions on (Ω, Σ, P ), its generator (or differential operator) D and a function ϕ : Ω → R related to the family of probability measures {Pα , α ∈ R}, with P0 = P , under consideration. We take these as building blocks of the composite hypothesis testing (as well as the estimation) problem, and hence for the general analysis assume that they are subject to the following conditions, in order to analyze the structure while obtaining their likelihood ratios: C1 . There exists a group of linear operators {Tα , α ∈ R} on F → F such that for all f ∈ F the following L1 (P )-limit Th f − f = Df, h→0 h lim
430
VII. More on Stochastic Inference
exists, α → (DTα )f is continuous, and DTα f 1 = O(eK|α| ) for an absolute constant K > 0. C2 . There is a ϕ ∈ L1 (P ) such that Ω ϕf dP = Ω Df dP, ∀f ∈ F . The work in the preceding section shows that when the {Pα , α ∈ R} is a family derived from a diffusion process, both (C1 ), and (C2 ) hold. We give some other examples below. It should also be observed that the algebra F of functions containing constants is important in studying properties of operators on Lp (P )-spaces, since they represent “measurable subspaces” of immense use, and their closures in any of these topologies are Banach lattices. (See, e.g., Rao [12], Theorem 2.2.5 on their structure.) In fact the following analysis could be modified and carried out with F , as such a lattice (i.e., the linear set is closed under ‘max’ and ‘min’ containing constants), allowing us to get another set of conditions. But this involves some new computations, and so will not be treated here. Using certain standard arguments, one shows that the (differential) operator D as well as the group {Tα } have extensions to a larger domain Δ containing bounded functions, and that D is a closed operator in L1 (P ) for which (C2 ) again holds on this larger set Δ (the extended operators will be denoted by the same symbol). There is an analog of Proposition 6 of the preceding section in this generality. We present this and also a result corresponding to Theorem 1.7 on likelihood ratios. Analogously we define Vf (α) : F → F by the equation: Vf (α)g = exp{
0
α
T−b f db}T−α g,
f, g ∈ F ,
(1)
and obtain the following: 1. Proposition. With the above notation and conditions (C1 ) and (C2 ), one has α α α T−b f db ∈ Δ, D T−b f db = DT−b f db = f − T−α f, 0
0
0
and Vf (α)g ∈ Δ as well as D(Vf (α)g) = (f − T−α f )Vf (α)g + (Vf (α)1)D(T−α g). Moreover, Vf (α)g has the L1 (P )-derivative and satisfies ∂ Vf (α)g dP = (f − ϕ)Vf (α)g dP, ∂α Ω Ω where ϕ is as in (C2 ).
(2)
7.2 Likelihood ratios for families of non Gaussian measures
431
Proof. This is an abstract version of Proposition 1.5, and with the present assumptions it is simpler than the former, and we can sketch the proof as follows. Regarding the first statement, use suitable approximating sums of the integrals, and note that the (differential) operator D is closed. So one can interchange the limit and D to obtain the result as: N 1 α n ( T−b f db) T−α g → (f − T−α f )Vf (α)g D n! 0 n=0 + (Vf (α)1)D(T−α g), in L1 (P ). Next consider (2)in which α, h ∈ R: α exp( αα+h T−b f db) − 1 [Vf (α + h) − Vf (α)]g = exp( × T−b f db) h h 0 (T−(α+h) − T−α )g+ α+h exp( α T−b f db) − 1 T−α g+ h [T−(α+h) − T−α ]g . h Now with condition (C1 ), one shows, employing the dominated convergence criterion and suitable upper estimates, that the first two terms converge in L1 (P ) to zero, and the last term then tends to −D(T−α g). Finally, using this and the first part, one obtains (f − ϕ)Vf (α)g dP = [f Vf (α)g − D(Vf (α)g)] dP Ω Ω = [(T−α f Vf (α)g − (Vf (α)1)D(T−α g)] dP Ω ∂ = Vf (α)g dP, ∂α Ω exactly as in the previous proposition. With the key relation (2) thus established, we can proceed by truncating ϕ and choosing sequences fn ∈ F such that Vfn (α)g → V (α)g in L1 (P ), defining a semi-group {V (α), α ≥ 0} with generator A whose domain contains Δ. We state the desired result and use it for some likelihood ratio calculations. The thus obtained family {V (α), α ≥ 0} is a strongly continuous positive contractive semi-group on L1 (P ) such that V (α)(f g) = V (α)f T−α g, ∀f, g ∈ F and has a generator A with domain at least as
432
VII. More on Stochastic Inference
large as Δ, and Af = (ϕ − D)f . This may be extended to the negative line by considering −T−α , −D and −ϕ to obtain {V (−α), α ≥ 0} having the same properties so that it is a semi-group with generator −A satisfying −Af = (−ϕ + D)f . However, V (−α) need not be (V (α))−1 . Thus {V (α), α ∈ R} will not generally form a group. This is made clear by the following: 2. Example. Let Ω = (−π, π], Σ = Borel σ-algebra of Ω and dP dx (x) = p(x) = C exp{ π 2−1 }, x ∈ Ω, where C > 0 is a suitable constant. Let 2 −x (Tα f )(x) = f (x − α)χ[x≥π+α] (x) + f (2π + x − α)χ[x<π+α] (x), and F is df the set of continuously differentiable functions on Ω. Then Df = − dx and ϕ(x) = ( pp )(x). In this case one finds (V (α)f )(x) = and
(V (−α)f )(x) =
p(x − α) p(x)
p(x + α) p(x)
12 χ[x≥π−α] (x),
12 f (x + α)χ[x≤π−α] (x).
But clearly V (α)V (−α)1 = 1 and in fact they do not have inverses. Thus it is necessary to find additional conditions for {V (α), α ∈ R} to form a group. The latter property is important, in view of the following assertion. 3.Theorem. If {V (α), α ∈ R} of Proposition 1 forms a group, then all measures {Pα , α ∈ R} the V (α) are isometries on L2 (P ), the probability are mutually absolutely continuous, and Ω f dPα = Ω Tα f dP, f ∈ F . α Moreover the likelihood ratio is given by V (α)1 = dP dP , (as Tα 1 = 1). Proof. Since the V (α) form a group V (−α) = V (α)−1 , and being contractions, this implies that each V (α) is an isometry. Let fn ↓ by the preceding property T fn dP = 0 a.e., for fn ∈ F . Then α Ω V (α)(T f ) dP = V (α)1f dP , where the latter follows from a α n n Ω Ω property of the V (α). Since fn ↓ 0, by the dominated convergence theorem these integrals decrease to zero. But α (f ) = Ω Tα f dP, f ∈ ∞ F , defines a bounded linear functional on L (P ) and since α (fn ) → 0, it follows that α (f ) = Ω f dPα where Pα is a probability measure, by standard results in Real Analysis, and then this extends to all of L1 (P ). Consequently one has on using the isometry property of V (α), f dPα = Tα f dP = V (α)(Tα f ) dP Ω Ω Ω V (α)1f dP, f ∈ L1 (P ) by a property of V (α). = Ω
433
7.2 Likelihood ratios for families of non Gaussian measures
It follows from this that Pα ∼ P, α ∈ R and moreover a.e.
dPα dP
= V (α)1,
This result demands finding conditions under which the V (α) form a group. The following provides a pair of useful sufficient conditions for this. 4. Proposition. Suppose that conditions (C1 ) and (C2 ) hold for the model (Ω, Σ, Pα , α ∈ R) and let {V (α), α ∈ R} be the corresponding family of operators on F ⊂ L1 (P ). If for some numbers K > 0, ε > 0, and N0 > 0, one of the two following inequalities holds
[x:ϕ(x)≥N]
ϕ(x) dP (x); or −
ϕ(x) dP (x) [x:ϕ(x)≤−N]
≤ Ke−εN , N ≥ N0 , then {V (α), α ∈ R} is a group of isometries on L1 (P ). Sketch of Proof. Let eα (x) = (V (−α)V (α)1, α ∈ R. Since the V (α)s are positive contractions, we have 0 ≤ eα ≤ e0 = 1. To see that eα = 1 must also hold, suppose the first condition of the hypothesis is true, and let ϕN = ϕ ∧ N . Then for 0 < α < ε α V (−b)(ϕN − ϕ)VN (b)1 db 1 eα − 1 1 = lim N→∞ 0 α ≤ lim sup (ϕN − ϕ)VN (b)1 1 db, N→∞
0
since V (·) is a contraction, α eNb e−Nε db = 0, ≤ lim sup K N→∞
0
using a growth property of VN (α)1, for all α ∈ R. The second condition of the hypothesis gives, after a similar computation, the same conclusion. In a number of problems, we also need to assume additionally a restriction, suggested by (29) of Section 1 for the special family {Tα , α ∈ R}, in the general case: (C3 ) There exist probability measures Pα on Σ, α ∈ R, such that Tα f dP = f dPα , ∀f ∈ F ⊂ L∞ (P ). Ω
Ω
Note that this is the same (by the Riesz representation applied in (29)) as demanding fn ↓ 0 ⇒ Tα fn ↓ 0 a.e. so that the Pα s are σ-additive.
434
VII. More on Stochastic Inference
An interesting consequence of conditions (C1 ) − (C3 ) above is a version of the Cram´er-Rao inequality for processes and it can be presented as follows: 5. Theorem. Let the conditions (C1 ) − (C3 ) hold on the family {(Ω, Σ, Pα ), Tα , F , ϕ}. Suppose that ϕ ∈ Lq (P ), 1 < q < ∞, and also that Pα ∼ P, α ∈ R. If X : Ω → R is an estimator (i.e., a random variable) of α and if J Ω |X|p dPα dα < ∞ for some J, an interval q of R containing ‘0 , p = q−1 , then one has, with g(α) = Eα (X) the expectation of X relative to Pα whose derivative g exists: |g (α)|p Eα (|X − g(α)|p ) ≥ p . [ Ω |ϕ|q dP ] q
(3)
In particular, if p = 2 and X is an unbiased estimator so that g(α) = α, one has the usual form of the lower bound, namely: Eα (X − α)2 ≥ [
ϕ2 dP ]−1 .
(4)
Ω
Proof. Suppose first that X is bounded (so X ∈ F ), and consider X dPα = XV (α)1 dP Ω Ω α
XV (b)ϕ dP db, by Theorem 3. =
g(α) =
0
(5)
Ω
On the other hand, by the discussion following the proof of Theorem 1, one has ∂ V (α)ϕ dP = V (α)1 dP = 0. (6) ∂α Ω Ω Also (5) implies that the derivative g (α) exists for a.a. (α), and hence
g (α) =
XV (α)ϕ dP
Ω
Ω
= = Ω = Ω
(X − g(α))V (α)ϕ dP, by (6), (X − g(α))V (α)1(T−α ϕ) dP (X − g(α))(T−α ϕ) dPα .
435
7.2 Likelihood ratios for families of non Gaussian measures
But now by H¨ older’s inequality we get
|g (α)| ≤
|X − g(α)| dPα p
Ω
p1
|T−α ϕ| dPα q
Ω
1q .
(7)
By (C3 ), since 1 < q < ∞, starting with simple functions and then using dominated convergence, one finds T−α to be an isometry on Lq (P ) so that |T−α ϕ|q dPα = |ϕ|q dP. Ω
Ω
Substituting this in (7) and dividing appropriately one gets (3) in this case. Suppose next that X is not necessarily bounded, and consider its truncation XN = Xχ[|X|≤N] . Then XN dPα = lim XN V (b)ϕ dP db N 0 Ω α Tb XN ϕ dP db = lim N 0 Ω α = Tb Xϕ dP db.
g(α) = lim N
Ωα
0
Ω
(α) = g (α) a.e. on J, so that But limN gN |g (α)|p |X − g(α)|p dPα ≥ lim N p N [ |ϕ|q dP ] q Ω Ω
|g (α)|p = p , [ Ω |ϕ|q dP ] q
establishing (3) in general. If p = 2 and g(α) = α, then (3) clearly reduces to (4). The result admits an extension to convex loss functions as in Section III.3, but it will be omitted. Here we consider some applications of the above work to show how the general theory may be implemented which usually need further nontrivial computations. Since ϕ plays an essential part in all the above work, we start with its construction for a class of processes. The method used has some novelties. Thus, let {Xt , t ∈ I} be a stochastic process which is canonically represented on (Ω, Σ, P ), i.e., Ω = RI etc. Let J = {tn ∈ I} be a sequence of points such that {Xtn , tn ∈ J} is dense in Lp (P ), 1 < p < ∞,
436
VII. More on Stochastic Inference
and assume that the finite dimensional distributions of Xt1 , . . . , Xtn are absolutely continuous with densities pn (·, . . . , ·). Consider the algebra of cylindrical functions F of the form f = h ◦ πn where h is a bounded Borel function, and πn : Ω → Rn is the coordinate projection. If (τα x)(t) = x(t) + αm(t) where x, m ∈ Ω, define the family of operators Tα : F → Lp (P ) by the equation (Tα f )(x) = h ◦ πn (x + αm)(= h(x(t1 ) + αm(t1 ), · · · , x(tn ) + αm(tn ))) so that (Tα f )(x) = (f ◦ τα )x. We now define ϕn and show that its limit gives ϕ. For this one has to assume that (B1 ) qnj (u1 , · · · , un ) = (
1 ∂pn )(u1 , · · · , un ), pn ∂uj
1 ≤ j ≤ n, n ≥ 1,
exists and qnj ∈ Lp (pn du) [du = du1 · · · dun ]. n Define ϕn = − j=1 m(tj )qnj (x(t1 ), · · · , x(tn )). Then one verifies easily that Ω ϕn f dP = Ω Df dP, ∀f ∈ F . Also, ϕn is measurable relative to Bn = σ(Xt1 , · · · , Xtn ) and E Bn (ϕn+1 ) = ϕn a.e. [P ]. Thus {ϕn , Bn , n ≥ 1} is a martingale in Lp (P ). Suppose moreover that (B2 ) sup |ϕn |p dP ≤ K < ∞; n
Ω
theso {ϕn , n ≥ 1} is uniformly integrable. Then, the martingale p ory implies that ϕn → ϕ a.e. and in L (P ), so that Ω ϕf dP = Df dP, f ∈ F . These conditions are satisfied if the X-process is Ω Gaussian, but can hold for many others. With this ϕ at hand, one can apply the preceding theory, construct likelihood ratios, or obtain a Cram´er-Rao inequality to estimators X = α ˆ and the like. Again consider {Tα , α ∈ R}, a group of automorphisms of F , the latter being an algebra of bounded measurable functions in L1 (P ) on (Ω, Σ, P ). Suppose that {Pα , α ∈ R}(P0 = P ) is a family of probability measures for which conditions (C1 ) − (C3 ) above hold, and that 1 α Pα ∼ P, α ∈ R. If V (α)f = dP dP T−α f , it can be extended to L (P ) preα = dP (x), serving all the properties. Since Tα 1 = 1, we get (V (α)1)(x) dP 1 and it is L (dα)-continuous for a.e. (α). If q(x) = R (V (α)1)(x) dα, then the Pα -family is termed dissipative whenever q(x) < ∞ a.a.(x), conservative when q(x) = ∞ a.a.(x), and mixed otherwise. Simple examples of these concepts are as follows. If X is a trivial one element process with density p(t)(> 0), and Xa = X + a is the translate, (i.e., Ta X = X + a) then its density pa (·) is given by pa (x) = p(x − a), and dP p(t−a) 1 a a (t) = and q(t) = (t) da = p(t) , so that {Xa , a ∈ R} so dP dP p(a) R dP is dissipative. On the other hand, if {Xt , t ∈ R} is stationary, and a (Ta X)(t) = X(t + a), then dP (t) = 1, and it is conservative. A mixdP ture system is similarly obtained by taking a convex combination of the above two. The first example above admits the following generalization:
7.3 Extension to two parameter families of measures
437
6. Proposition. Let X = {Xt , t ∈ I} be a process on the canonical space (Ω, Σ, P ), and (Tα x)(t) = x(t) + αm(t) for some measurable m ∈ Ω = RI . Suppose conditions (C1 ) − (C3 ) hold for {Tα , α ∈ R} and Pα ∼ P . Then, X is dissipative. Proof. Let t0 be a point such that m(t0 ) = 0. Since Pα ∼ P , we get P [Xt0 = 0] = 1. Then with q(·) defined as above, we have for any −∞ < a < b < ∞: χ[x:a≤x(t0 ) 0 a.a.(x), and q(x)g(x(t0 )) dP (x) = dP (x) g(x(t0 + α)) dα Ω Ω R G(x) dP (x) ∈ R. = Ω
Hence, |q(x)| < ∞ a.e. Many other examples may be constructed, but the flavor of the subject is clear from this work. In the next section we show how these ideas can be extended to multiparameter (or more general composite hypotheses) cases. 7.3 Extension to two parameter families of measures The work in the above two sections is strongly influenced by various properties of semi-groups of operators. This naturally suggests that these ideas may be generalized to n-parameter semi-groups as found, for instance, in Hille and Phillips ([1], p.534 and Chapter 25). Here,
438
VII. More on Stochastic Inference
we only consider the two parameter case. But we can study a ‘nonlinear type semi-group’ of linear operators, now called evolution operator families. This indicates different directions along which the theory can progress using some extensions of the previous technical tools. Our treatment is based on a thesis due to Velman [1] which generalizes the basic theme of Pitcher’s presented in the preceding two sections. We also indicate a crucial application of this extension. The starting point is the problem of testing hypotheses, for a family {Pα , α ∈ R} on (Ω, Σ, P ), that H0 : α ∈ I1 , vs H1 : α ∈ I2 where I1 ∩ I2 = ∅, Ij ⊂ R are intervals. For this we first have to find conditions such that Pα , α ∈ I1 and Pβ , β ∈ I2 are mutually absodP lutely continuous and then find their likelihood ratio dPαβ to solve the composite (distinguishable) test problem. Typically Pα = P |Σα with Σα ⊂ Σ determined by the process which depends on the hypothesis of α ∈ Ij . Even though P0 = P may be taken, Pα ∼ P, α ∈ I1 , and Pβ ∼ P, β ∈ Iβ , do not imply Pβ ∼ Pα because there need be no inclusion relationship between Σα and Σβ , and Pα ∼ Pβ on Σαβ = σ(Σα ∪ Σβ ) is not necessarily true. So a serious difficulty arises, and an extended study is needed for a satisfactory solution of the probdP c dP c lem since dPαβ and dPαβ will not provide a complete answer, unlike in the case where H0 or H1 is simple. As already indicated, we proceed to solve the above problem by extending the ideas of the preceding section as follows. Let (Ω, Σ, P ) be a canonically represented probability space and F0 be the algebra of bounded cylindrical functions from L1 (P ) and for each α ∈ R, let Fα ⊂ F0 be a subalgebra containing constants. We already observed (cf. just prior to Proposition 2.1 above), the closures (under any of the Lp -norms) of Fα and F are Banach lattices, and in fact are measurable subspaces so that, if F¯αp , F¯0p are Lp -closures, 1 ≤ p ≤ ∞, then these are precisely Lp (Σα , Pα ) and Lp (Σ0 , P ) where Σα is a σsubalgebra of Σ and Pα = P |Σα . We now associate a family of operators T (α, β) : Fβ → Fα and, what is decisive, these are assumed to be positivity preserving isometric isomorphisms that satisfy the evolution (or Chapman-Kolmogorov type) identity, i.e., for any α ≤ β ≤ γ in R, T (α, β)T (β, γ)f = T (α, γ)f,
f ∈ Fγ .
(1)
Thus, T (α, α) = id. From this point on, we assume various smoothness conditions, analogous to those of Section 2, so that the analysis can be extended. These are the following four and no further motivation will be added: (A1 )
T (α + h, α)f − f = Dα f, h→0 h lim
f ∈ Fα , (2)
7.3 Extension to two parameter families of measures
439
exists in L1 (P ) and α → Dα T (α, β)f is continuous in L1 (P ) for each f ∈ Fβ . (A2 ) For each α, the adjoint Dα∗ exists with domain F0 , a subalgebra of bounded functions in L1 (P ), and range L1 (P ), so that f (Dα g) dP = (Dα∗ f )g dP, f ∈ F0 , g ∈ Fα . Ω
Ω
We set ψα = E Σα (Dα∗ 1), so that ψα f dP = (Dα∗ 1)f dP = Dα f dP, f ∈ Fα , Ω
Ω
(3)
Ω
and ψα is Σα -measurable. (A3 ) There is a nondegenerate compact interval I ⊂ R such that α → ψα is L1 (P )-continuous for α ∈ I. Finally, we need to impose some growth conditions on the ψα family: (A4 ) There exist compact subintervals I+ , I− ⊂ I, positive constants C, ε, N0 such that for N ≥ N0 and α ∈ I− or β ∈ I+ : (4-) { χ[x:ψα (x)≤−N] (x)ψα (x) dP (x) ≤ Ce−εN , Ω
or
Ω
χ[x:ψβ (x)≥N] (x)ψβ (x) dP (x)} ≤ Ce−εN .
(4+)
These conditions are the analogs of (C1 ) − (C3 ) of Section 2, and are sufficient to carry forward the corresponding work with a careful analysis of the related estimates. We indicate the main results and then present an example where these assumptions are naturally satisfied. The Fα and their completions in Lp (Σα ) have the following (nontrivial) approximation property often employed in this analysis: 1. Lemma. Let {fα , L1 (Σα ), α ∈ J} be an adapted family, of elements with the index set J as a compact interval, in the sense that fα ∈ L1 (Σα ) and α → fα is L1 (P )-continuous. Then for each ε > 0, there exists a uniformly bounded adapted family {gα , Fα , α ∈ J} such that (i) (α, β) → T (α, β)gβ is L1 (P )-continuous on R × J, (ii) β → Dα T (α, β)gβ is L1 (P )-continuous on J, and (iii) fα − gα 1 < ε, α ∈ J. The approximation contained in this assertion is stronger than the usual results given in Real Analysis for integrable functions. It is established with the following procedure. First, each fα ∈ L1 (Σα ) is approximated by a gα ∈ Fα within ε-distance (by the standard analysis). Next, it is shown that each gα of the type desired in the lemma can
440
VII. More on Stochastic Inference
be approximated by a function of the form T (α, β)f in a neighborhood of β ∈ J. Then with the compactness of J, this is extended to J by finding a suitable finite covering of J. The final result is obtained by a convex combination of these approximants. We omit the actual detail which is tedious, but essentially standard. Based on this lemma, it is possible to replace any such adapted set {fα , L1 (Σα ), α ∈ J} by another collection {gα , Fα , α ∈ J} with better boundedness properties. We call such a family a standard modification indexed by the (same) compact interval J. One can then obtain systematically (but nontrivially) the desired conditions on the T (α, β)family to find a new collection, denoted by W (α, β) and termed the W (α, β)-family, corresponding to the V (α)-collection of Section 2. Thus, for each α, Dα admits an extension with domain Δα ⊃ Fα , such that, using the duality pairing notation (·, ·), one finds (Dα f, g) = (f, Dα∗ g), f ∈ Δα , g ∈ F , and f, g ∈ Δα ⇒ f g ∈ Δα , Dα (f g) = (Dα f )g + f (Dα g); as a differential operator, Dα is closed in the sense that fn ∈ Δα , fn → f , boundedly a.e. and Dα fn → g in L1 (P ) ⇒ f ∈ Δα and Dα f = g, as well as Dα f n = nf n−1 Dα f . [Dα∗ is the dual of Dα .] With these properties, we can now introduce 2. Definition. Let {fα , Fα , α ∈ J}, J compact, be a family of functions in standard modification. Then for α ≥ β, α, β ∈ J, we set for each f ∈ Fα : α T (α, b)fb db}T (α, β)g, f ∈ Fβ ∩ Fα , (5) Wf (α, β)g = exp{− β
where the T (α, β) are the f is a generic notation for Some properties of the (5) can now be described.
operators given by (1). [Here the subscript the standard modification used.] operators Wf (α, β) : Fβ → Fα defined by Its domain contains Δβ (⊃ Fβ ), and
Wf (α, β)g = Wf (α, β)1 T (α, β)g, g ∈ Fβ . Using computations similar to those of the last section, one can verify: (i)
Dα Wf (α, β)g = Wf (α, β)1(Dα T (α, β)g) α T (α, b)fb db Wf (α, β)g. − Dα β
(ii) For each g ∈ Fβ ∂ Wf (α, β)g = (Dα − fα )Wf (α, β)g, ∂α
7.3 Extension to two parameter families of measures
441
∂ Wf (α, β)1 = Wf (α, β)fβ , ∂β and the above together with ψα of (3) imply: ∂ Wf (α, β)g dP = (ψα − fα )Wf (α, β)g dP. ∂α Ω Ω
(6)
With these properties, one extends the Wf (α, β)-operators to those forming a class of positive contractive evolution family that does not depend on the (standard) modifications, denoted by the ‘f ’s. As in the case of V (α)-operators, we proceed to employ a similar method (although needing more involved computations) to obtain a W (α, β)-family. Thus let fn = {fnα , α ∈ J}, n = 1, 2, . . . , be a sequence of standard modifications on the same index J, all fnα ≥ M > −∞ such that fnα − (ψα ∨ (−N )) 1 → 0 uniformly in α as n → ∞. Then {Wfn (α, β)g, n ≥ 1} converges in L1 (P ) to a function WN (α, β)g uniformly in α, β ∈ J, for each g ∈ Fβ , the limit being independent of the sequence used. Here the set {ψα , α ∈ R} is the same as that given in (A2 ), i.e., ψα = E Σα (Dα∗ 1) which depends only on the Fα -family. The thus obtained WN (α, β) have the following desirable properties, namely: WN (α, α) = id., on L1 (Σα ), WN (α, β) is a positive contraction and satisfies WN (α, β)f g = WN (α, β)f T (α, β)g,
f ∈ L1 (Σβ ), g ∈ L∞ (Σα ).
Moreover, for α ≥ δ ≥ β, they satisfy the ‘anti-evolution’ identity, i.e., WN (α, β)f = WN (α, δ)WN (δ, β)f, and γ → WN (α, γ)hγ is continuous in L1 (Σγ )-norm for any family of strongly continuous standard modifications {hr , r ∈ J} as well as WN (·, β)f, f ∈ L1 (Σβ ), being continuous in mean. These properties are used to show that the WN (α, β)-family converges strongly to W (α, β) as N → ∞, and this gives the desired collection of operators for which the following relations hold. First, W (α, β) : L1 (Σβ ) → L1 (Σα ), and the set {W (α, β), α ≥ β ∈ J} is a positive contractive antievolution family of linear operators such that for all f ∈ L1 (Σβ ), g ∈ L∞ (Σβ ) we have: W (α, β)f g = W (α, β)f T (α, β)g.
(7)
Establishing these relations is not trivial, although it is an extension of the work of Section 2. In fact Velman [1] spends almost half of his paper to verifying these properties. It should be noted, however, that,
442
VII. More on Stochastic Inference
in spite of the lengthy computations needed, the work does not depend on other mathematical techniques or tools. In the above discussion, we approximated ψα ∨(−N ) by certain families of standard modifications, and by taking limits got {WN (α, β), α ≥ β ∈ J}. In exactly the same way, one can approximate ψα ∧ N , and ˜ N (α, β), α ≤ β ∈ J}, and this obtain the corresponding operators {W is also a positive contractive set satisfying again the evolution identity. Then letting N → ∞, one gets analogously a family of opera˜ (α, β), α ≤ β ∈ J}. In case W (α, β)W ˜ (α, β) = id., then the tors {W whole family is one of isometries (since each is a contraction), and one can drop the ‘tilde’ from the latter. Moreover it is true that W (α, β) = (W (β, α)−1 . This is similar to {V (α), α ∈ R} being a group, in the last section. As in that problem, the isometry property does not always hold, and one has to restrict the family for this to be true. Note that so far we only needed (A1 ) − (A3 ) in the above work. However, using condition (A4 ) we can also conclude this last property as in the earlier case. We state this result precisely for reference as: 3. Theorem. If conditions (A1 )−(A4 ) hold, then the family of positive operators {W (α, β), α, β ∈ J} constructed above forms an isometric isomorphism on L1 (Σβ ) onto L1 (Σα ) for each α, β ∈ J, and in fact for α ≤ β, W (α, β)W (β, α) = id., α, β ∈ J each having a unit norm. It is still necessary to construct measures (depending both on α and β) that represent T (α, β). This implies constructing a set isomorphism Q(α, β) from Σβ onto Σα involving a new step. The idea here is quite similar to the representation of (strictly) stationary processes (cf., e.g., Doob [2], Chapter XI, Section 1). Once this is obtained, then the image measure of P under Q(α, β) will produce the desired probability family, namely, Pαβ (·) = P ◦ Q(α, β)−1 (·). It then follows that Pαβ ∼ Pαα , leading to the likelihood ratios, fulfilling our ultimate target. Thus we now present this construction, explaining briefly how this is rigorously established. Later, an elaboration of how several mean continuous second order processes admit such representations will be explained to complete this material. Here are the details. Recall, from Real Analysis, that a mapping T : Ω → Ω is a measurable (point) transformation (relative to a σ-algebra Σ of Ω) if T −1 (Σ) ⊂ Σ, and we define (T f )(ω) = f (T −1 {ω}), ω ∈ Ω, T −1 being a (set) transformation on the power set P of Ω into itself. If (Ω, Σ, P ) is a measure space, and (more generally) if τ : Σ → Σ is a transformation, it is termed measurable and measure preserving when (i) τ (Σ) ⊂ Σ, (ii) P (τ (A)) = P (A), and τ (Σ) is also a σ-algebra. Then τ is sometimes termed a σ-homomorphism. [If τ = T −1 , it preserves arbitrary set operations, and in the general case one assumes this for just countable
443
7.3 Extension to two parameter families of measures
operations.] A set A ∈ Σ is τ -invariant if τ (A) and A differ by a P -null set and so has the same measure. E.g., ∅ and Ω are always τ -invariant as well as every set that differs from these by P -null sets. If the only τ -invariant sets are just these two trivial sets [so τ really ‘mixes up’ all points of Ω], then τ is called metrically-transitive or ergodic. With every set transformation τ one can associate a (point) transformation T on functions on the space Ω, but not necessarily conversely. [See, e.g., Rao [17], Prop. 10.1.2 on p. 503.] This T may be explicitly constructed simply as follows. Let r be a rational number, and consider Ar = τ {ω : f (ω) ≤ r} so that Ar ∈ Σ for measurable f : Ω → R. Clearly Ar and limr→∞ Ar = Ω, ∩r Ar = ∅. Define T by setting (T f )(ω) = inf{r : ω ∈ Ar } where, as usual, inf{∅} = ∞. Then T f is well defined, and is measurable since [(T f )(ω) < a] = ∪ {ω : f (ω) ≤ r} = ∪ Ar , a ∈ R. r
r
(8)
The right side is at most a countable union and so is in Σ. Observe that if τ is an isomorphism of Ω into itself, then the above reduces to the familiar (T f )(ω) = f (τ −1 {ω}). In case τ takes equivalence classes of measurable sets into equivalence classes, then T takes equivalence of measurable functions into such an equivalence class under a given probability measure P on Σ. Actually for the set mapping τ , it is always possible to choose a member of the equivalence class such that all set operations are preserved. This is done by the well-known lifting operation. (The existence [and properties] of such an operation is a key theorem of Real Analysis, and we need not dwell on it here. See, e.g., Rao [17], Chapter 8 for an elementary treatment, and Ionescu Tulcea and Ionescu Tulcea [1] for a general theory of the subject.) As an example, if τ is a metrically-transitive (or ergodic) transformation on Σ, and T is the corresponding mapping defined above, then Xn = T n X0 , n ∈ Z defines a strictly stationary sequence. In the continuous parameter case, we consider a translation group of measure preserving isomorphisms {τt , t ∈ R} and let {Tt , t ∈ R} be the associated family operating on measurable functions constructed above. Then the latter is also a group and {Xt = Tt X0 , t ∈ R} becomes a strictly stationary process where (Tt )(ω) = Xt (τt {ω}), so that in both cases the finite dimensional distributions of the process (or sequence) under time shift remain the same. On the other hand, if we start with a group of isometric translations {Ut , t ∈ R} and an X0 ∈ L2 (P ), then {Xt = Ut X0 , t ∈ R} is a weakly (or Khintchine) stationary process, and the Ut s are unitary in L2 (P ). Several properties of such families are of great interest in probability theory, and, for instance, Doob ([2], Chapters X–XI) details many of these classes. If the Ut -family is a contractive weakly continuous positive definite set, then Xt is weakly
444
VII. More on Stochastic Inference
harmonizable (cf. Rao [13], p.330). Here, taking these as motivation and background facts, we consider evolution families determined by a class of measurable set isomorphisms Q(α, β) that are of primary interest in our study. In view of the existence of a lifting operation, it is enough if all the following statements on Q(α, β) hold a.e. 4. Definition. Let {Σα , α ∈ R} be a family of σ-subalgebras of a complete probability space (Ω, Σ, P ) and for α ≤ β suppose Q(β, α) : Σα → Σβ is a set isomorphism satisfying the countable operations (or is a σ-isomorphism). [Then Q(β, α) is also uniformly continuous on the complete (or Fr´echet) metric spaces (Σα , ρ) → (Σβ , ρ) where ρ(A, B) = P (AΔB), the measure of the symmetric difference.] Let Q (β, α) be the corresponding mappings on the measurable function spaces L0 (Σα ) → L0 (Σβ ) constructed as in (8), so that the Q -family is uniquely defined by the Q-family. The following properties of Q from Q are easily inferred. (i) If fn ∈ L0 (Σα ) and fn → f a.e., then Q (β, α)fn ∈ L0 (Σα ) and Q (β, α)fn → Q (β, α)f a.e., Q is linear, and Q (β, α)χA = χ(Q(β,α)A) . (ii) Q (β, α)f ∞ = f ∞ , Q (β, α)(f g) = Q (β, α)f Q (β, α)g and the Q s are also positive operators. We constructed operators W (α, β) which correspond to Q (α, β), and need to find the σ-isomorphic mappings Q(α, β), a result in the converse direction. Their existence is obtained from the following result. 5. Theorem. Let W (α, β) : L1 (Σβ ) → L1 (Σα ) be a positive isometric transformation so that W (α, β)f 1 = f 1 , f ∈ L1 (Σβ ). Then there exists a set isomorphism Q(α, β) : Σβ → Σα , and a measurable function h(α, β) : Ω → R+ such that, with Q as the induced operators by Q, we have (9) (W (α, β)f )(x) = h(α, β)(x)(Q (α, β)f )(x). Moreover, if Pαα = P |Σα , and Pαβ = Pαα ◦ Q−1 (α, β), α ≤ β, then h(α, β)(x) =
dPαβ (x) = (W (α, β)1)(x), a.a. (x ∈ Ω) dPαα
(10.)
Proof. Let A, B ∈ Σβ be disjoint and define Q(α, β)(A) = {x : W (α, β)χA (x) = 0}. Now writing μ = Pαα , we get χA + χB 1 + χA − χB 1 = μ(A) + μ(B) + μ(AΔB) = 2μ(A) + 2μ(B), A ∩ B = ∅, = 2( χA 1 + χB 1 ).
7.3 Extension to two parameter families of measures
445
By the L1 (P )-isometry of W (α, β), this implies: W (α, β)(χA + χB ) 1 + W (α, β)(χA − χB ) 1 = 2[ W (α, β)χA 1 + W (α, β)χB 1 ].
(11)
Since W (α, β) is positive, f = W (α, β)χA ≥ 0, g = W (α, β)χB ≥ 0, a.e. Hence if C = [f ≥ g ≥ 0], D = [f < g], then C ∪ D = Ω and (11) implies on writing it out and cancelling terms that g dμ + f dμ = 0, (12) C
D
so gχC = 0 = f χD a.e. Whence f g = 0 a.e. So W (α, β)χA , W (α, β)χB have a.e. disjoint supports. Therefore Q(α, β)(A ∪ B) = Q(α, β)(A) + Q(α, β)(B) is a measurable a.e. disjoint union. By the L1 -continuity of ∞ W , this extends showing that Q(α, β)(∪∞ n=1 An ) = ∪n=1 Q(α, β)(An ) a.e. disjoint union, and moreover, Q(α, β)(Ω) = Q(α, β)[(Ω − A) ∪ A] = Q(α, β)(Ω − A) ∪ Q(α, β)(A),
(13)
so that we get Q(α, β)(Ω − A) = Q(α, β)(Ω) − Q(α, β)(A), from definition. It follows that μ(Q(α, β)(A)) = 0 ⇒ W (α, β)χA = 0 a.e., and since W is positive μ(A) = 0. Thus Q(α, β) : Σβ → Σα is a set isomorphism of the type asserted. If we let h(α, β) = W (α, β)1(≥ 0), a measurable function, then from the above properties h(α, β) = W (α, β)χA + W (α, β)χAc ⇒ h(α, β)(x) = W (α, β)χA (x), for a.a. x ∈ supp(W (α, β)χA ) since the right side terms have a.e. disjoint supports. It follows that (W (α, β)χA )(x) = h(α, β)χ(Q(α,β)A) (x), a.e.,
(14)
and considering linear combinations and recalling that Q (α, β)χA = χ(Q(α,β)A) , we conclude that W (α, β)f = h(α, β)f a.e., f ∈ L1 (Σβ ). Also μ(A) = 0 ⇒ μ(Q(α, β)−1 (A)) = 0 so that Pαβ Pαα . Hence for A ∈ Q(α, β)(Σβ ) we have μ(A) = Pαα (A) = Pαβ (Q(α, β)(A)) dPαβ dPαα . = Q(α,β)(A) dPαα
(15)
446
VII. More on Stochastic Inference
But then μ(A) = χA 1 = W (α, β)χA 1 = h(α, β)χ(Q(α,β)A) dPαα .
(16)
Ω
dP
αβ are both Σα -measurable, and agree by (15) and Since h(α, β) and dPαα (16) on the σ-algebra Q(α, β)(Σβ ) ⊂ Σα , it follows that they agree a.e. Hence both (9) and (10) hold.
Remark. Equation (11) for Lp (μ), 0 < p = 2 < ∞, gives the same conclusion more inclusively (cf., Royden [1],p.168). Its use for all isometries for Lp -spaces was first studied by Lamperti [1], by Goldstein [2] for Orlicz spaces, and more generally by Fleming, Goldstein, and Jamison [1] where other references can be found. The L1 (P )-case is simple, as seen in the above argument. The following useful consequence is recorded for a ready reference, as it deals with connections between W (α, β), T (α, β) and Q(α, β). 6. Proposition. (a) Using the preceding notation and assumptions we have: T (β, α)f dμ = f dPαβ = E Σα (T (β, α)f ) dPαα, (17) Ω
Ω
Ω
αβ = W (α, β)1. The dual operator W (α, β)∗ of W (α, β) is an and dPαα extension of T (β, α) so that W (α, β)∗ f = T (β, α)f, f ∈ Fα . (b) If e(α, β) = W (α, β)W (β, α)1, and setting e(β, α) = 1, then for all Σβ -measurable f ,
dP
e(α, β)T (α, β)f = Q (α, β)f,
(18)
and for all f ∈ L∞ (Σβ ), one has W (β, α)∗ f = Q (α, β)f.
(19)
Proof. Since both T (β, α) and W (α, β) are positive and the latter is an isometry, so W (α, β)h 1 = h 1 , take h = T (β, α)f (f = f + − f − ∈ Fα by considering f ± and subtracting) to get T (β, α)f dμ = W (α, β)(T (β, α)f ) dμ Ω Ω W (α, β)1 f dμ, see below, (20) = Ω
7.3 Extension to two parameter families of measures
dPαβ f dPαα dPαα
= Ω = Ω
447
f dPαβ , by Theorem 5, with Pαα = μ.
In the equation (20) we used W (α, β)(f1 f2 ) = (W (α, β)f1 )T (α, β)f2 , by (7), and then let f1 = 1 and f2 = T (β, α)f so that W (α, β)1(T (β, α)f ) = W (α, β)1T (α, β)T (β, α)f = W (α, β)1T (α, α)f, and T (α, α) = id. This proves (17), and the next statement is then clear since W (α, β)1 is Σα -measurable. The assertion about W (α, β)∗ f follows from the isometry of W together with (20) as (for g ∈ L1 (Σβ ), f ∈ Fα ):
gT (β, α)f dμ =
Ω
W (α, β) (gT (β, α)f ) dμ Ω (W (α, β)g)f dμ
=
Ω
g (W (α, β)∗ f ) dμ,
= Ω
and g being arbitrary in L1 (Σβ ), the integrands can be identified a.e. Finally, if e(β, α) = 1, then for each f in the range of W (α, β), i.e., f = W (α, β)g, g ∈ L1 (Σα ), we have
e(α, β)f dμ =
Ω
e(α, β)W (α, β)g dμ Ω W (α, β)e(β, α)g dμ
= Ω =
W (α, β)1g dμ, by hypothesis,
Ω
g dμ, since W is an L1 (μ)-isometry.
= Ω
This shows that on the range of W (α, β), e(α, β) = 1 as well. From this (18) is deduced thus. With the relations (9) and (10): W (α, β)1Q(α, β)f = W (α, β)(1 · f ) = (W (α, β)1)T (α, β)f = (W (α, β)1)e(α, β)T (α, β)f,
448
VII. More on Stochastic Inference
since e(α, β) = 1 on supp(W (α, β)1). (21) Since both factors vanish off support of W (α, β)1, they must agree a.e., because of (21), giving (18). Finally for (19), let f ∈ L∞ (Σβ ), g ∈ L1 (Σα ). Then
∗
(W (α, β) f )g dμ = Ω
(W (β, α)g)f dμ Ω W (α, β)(W (β, α)gf ) dμ, by isometry,
= Ω
=
W (α, β)W (β, α)gQ (α, β)f dμ,
Ω
by the product relation for W , e(α, β)gQ (α, β)f dμ = Ω = gQ (α, β)f dμ, by (18). Ω
Since g is arbitrary, this establishes the result (19). With these structural results, it is possible to obtain a general statement about the desired likelihood ratios in the following: 7. Theorem. Suppose conditions (A1 )−(A4 ) hold so that W (α, β) and W (β, α) are both isometries on J, a compact nondegenerate interval. Then for each pair α, β ∈ J, the measures Pαβ ∼ Pα,α , and dPαβ = W (α, β)1; dPαα
dPαα = Q (α, β)W (β, α)1, dPαβ
(22)
(Q, Q as in Def. 4). Moreover, T (α, β)f = Q (α, β)f for f ∈ Fβ , and ∂ W (α, β)1 = W (α, β)1Q (α, β)ψβ , ∂β
α, β ∈ J,
(23)
where ψβ is as in Condition (A2 ). Proof. Because of the strengthened hypothesis, both W (α, β) and W (β, α) are isometries, and by Proposition 6 Pαβ Pαα which implies the first half of (22). For the second half, let Pαβ (A) = 0, A ∈ Σα . Then, writing μ for Pαα , χA W (α, β) dμ 0 = μ(A) = Ω
7.3 Extension to two parameter families of measures
= Ω
=
Ω
449
W (α, β)∗ χA dμ Q (β, α)χA dμ, by (19).
(24)
Hence Q (β, α)χA = 0 a.e.[μ]. Also for the same A, W (β, α)χA dμ, by the isometry, μ(A) = Ω = (W (β, α)1)Q (β, α)χA dμ = 0, Ω
as a consequence of (24). Hence Pαα Pαβ so that Pαβ ∼ Pαα . On the other hand, e(α, β) = W (α, β)W (β, α)1 = 1 and (W (α, β)1)Q (α, β)(W (β, α)1) = [(W (α, β)1)× e(α, β)T (α, β)](W (β, α)1), by (18), = W (α, β)(W (β, α)1), as in the first and last lines of (21), = e(α, β)1 = 1. αα , and since Pαα ∼ Pαβ , Hence Q (α, β)W (β, α)1 = (W (α, β))−1 = dP dPαβ the first part of (22) is also true. Finally, (23) may be established using Theorem 3, and is left to the reader.
The above operator analysis of processes, or their underlying probability measures on function spaces, appears also for a large class of mean continuous second order families even on their sample paths. This was already indicated prior to Definition 4 for stationary processes. For instance, if Xt = Ut X0 where X0 ∈ L2 (P ) and {Ut , t ∈ R} is a strongly(=weakly here) continuous unitary group of operators, then employing its spectral decomposition (Stone’s theorem), one obtains an integral representation as: itλ e dEλ X0 = eitλ dZ(λ), (25) Xt = R
R
where {Eλ , λ ∈ R} is the classical resolution of the identity, and the above symbol denotes a spectral integral which may be identified as a stochastic integral, since {Z(λ) = Eλ X0 , λ ∈ R} is easily seen to be a process with orthogonal increments and hence satisfies Bochner’s boundedness principle. It follows that ¯ t dP Xs X r(s, t) = Ω
450
VII. More on Stochastic Inference
isλ−itλ
=
e R
Z(λ)Z(λ ) dP ]
d[
R
Ω
ei(s−t)λ dF (λ),
= R
where F (λ) = R |Z(λ)|2 dP ↑ and is bounded. The converse can also be proved using the isometry of L2 (F ) and of sp{Xt , t ∈ R} ⊂ L2 (P ). We now show that this argument extends to a class of nonstationary processes as a final item of this section and then, in the next section, an important application will be given to communication theory to illustrate the earlier work by obtaining likelihood ratios explicitly. 8. Theorem. Let {Xt , t ∈ R} ⊂ L2 (P ) be a mean continuous process on (Ω, Σ, P ), so that for each t ∈ R, E(|Xs − Xt |2 ) → 0 as s → t. Let H(X) = sp{Xt , t ∈ R} be the closed linear span in L2 (P ) determined by the process. Then there exist a Y0 ∈ H(X), a densely defined closed operator At on H(X), t ∈ R, and a strongly continuous unitary group {Us , s ∈ R} of operators on H(X) commuting with At for each t ∈ R, in terms of which one has the representation: X t = A t Ut Y 0 ,
t ∈ R.
(26)
Moreover, every such process has a triangular covariance (as defined in Chapter V) and hence is of Karhunen class. Remark. If the Xt -process is weakly stationary, then At = id. for all t, and it has other properties when additional conditions are satisfied by the family. For instance, if it is harmonizable, then {At , t ∈ R} has a certain positive definiteness property. A more detailed analysis of the latter specialization is found in Chang and Rao [2]. Proof. The mean continuity hypothesis implies that the space H(X) is separable. Then the classical Hilbert space theory assures that H(X) is isometrically isomorphic to a separable L2 (μ) on some measure space (S, S, μ). Now let {ϕn , n ≥ 1} and {fn , n ≥ 1} be any fixed complete orthonormal bases of H(X) and L2 (μ) respectively, both of the same cardinality. If τ : ϕn → fn , then τ can be extended linearly from H(X) onto L2 (μ) which, denoted by the same symbol, is the isometric onto mapping. By polarization, it preserves inner products in both spaces. Now each Xt ∈ H(X) can be expanded in a (Fourier) series as: Xt = =
∞ n=1 ∞ n=1
an (t)ϕn , an (t) = (Xt , ϕn ) = an (t)τ (fn ) = τ (
∞ n=1
an (t)fn ),
Ω
Xt ϕ¯n dP, (27)
7.3 Extension to two parameter families of measures
451
the last by the ∞linearity and continuity of τ since by Parseval’s relation Xt 22,P = n=1 |an (t)|2 < ∞. If A ∈ S, define Z(A) = τ (χA ). It is easily seen that Z(·) is a σ-additive H(X)-valued measure, and for simple functions f ∈ L2 (μ), f (u) dZ(u), (28) τ (f ) = S
holds. But then it can be extended to all f ∈ L2 (μ) with the properties ∞ of the Dunford-Schwartz integration. Let g(t, u) = n=1 an (t)fn (u) so 2 that g(t, ·) ∈ L (μ), t ∈ R, and then (27) and (28) imply: g(t, u) dZ(u) = eitu g˜(t, u) dZ(u), (29) Xt = S
S
where g˜(t, u) = e−itu g(t, u) and g˜(t, ·) ∈ L2 (μ) again. Moreover for A, B ∈ S, one has (Z(A), Z(B)) = (τ (χA ), τ (χB ))H(X) = (χA , χB )μ = μ(A ∩ B). (30) So Z has orthogonal increments. Such a process, given by (29), is of Karhunen class. Since any two separable Hilbert spaces are isomorphic, and only the isomorphism properties are used here, we can replace (S, S, μ) by (R, B, ν) the Lebesgue line and hence L2 (μ) by L2 (ν). From now on this will be assumed to have been done. If {Yt , t ∈ R} is defined by eitλ dZ(λ), t ∈ R, Yt = R
then it is weakly stationary as already noted (cf., (25)). Also it can be expressed as Yt = Ut Y0 , for a weakly continuous group {U t , t ∈ R} of unitary operators on H(X), and by Stone’s theorem Ut = R eitλ E(dλ) where E(·) denotes the resolution of the identity. If we define At = g˜(t, λ) E(dλ), (31) R
with g˜(t, ·) ∈ L2 (ν), given in (29), (with changed notation) then Z(A) may also be identified as E(A)Y0 in the above representation. [Since only the isomorphism is essential, all these formulations are possible.] It follows that At is a closed densely defined linear operator on H(X) commuting with E(·) and hence with Us (cf. Riesz-Sz.Nagy [1], p.351 on this classical result). Hence using (31), At Ut Y0 = At ( eitv E(dv)Y0 ) R
452
VII. More on Stochastic Inference
=
g˜(t, λ) E(dλ)( R
R
=
R
eitv E(dv)Y0 )
eitλ g˜(t, λ) E(dλ)Y0,
by a property of the spectral integral, eitλ g˜(t, λ) Z(dλ) = Xt , by (29). = R
This shows that the representation (26) holds. Finally, to see that a process Xt in the separable H(X), given by (26), necessarily has a triangular covariance, recall that At and Us commute and hence At and the spectral family E(·) of Us also commute. It then follows from an important theorem of von Neumann and Riesz (cf. Riesz-Sz.Nagy [1], footnote on p.351) that At is a Borel function ϕt of Ut and then (again by the spectral theorem), one has: ϕt (λ) E(dλ). (32) At = ϕt (Ut ) = R
Hence we have X t = A t Ut Y 0 = ϕt (λ) E(dλ) eitv E(dv)Y0 R R = eitλ ϕt (λ) dZ(λ).
(33)
R
Letting g(t, λ) = eitλ ϕt (λ), and μ(A) = (Z(A), Z(A))H(X) , it is seen that g(t, ·) ∈ L2 (μ) and r(s, t) = (Xs , Xt ) = g(s, λ)¯ g(t, λ) μ(dλ), R
so that r is a triangular covariance function. 9. Discussion. If the covariance function r has the property that r(s, ·) and r(·, t) have finite variations, then one can associate a bimeasure β : A × B → A B r(ds, dt) and construct an RKHS with it. One then shows that the Xt -process must be of Karhunen class. [This is Theorem 5(i) in Chang and Rao [1].] However, in Theorem 8 above, this finite variation condition on r is not imposed. But the mean continuity (hence separability of H(X)) was assumed instead. Processes of the form (33) [or equivalently (26)] are also called oscillatory. The class of oscillatory processes coincides with the Karhunen class, as is easily
7.4 Likelihood ratios in statistical communication theory
453
seen. They were treated under the name “deformed stationary process” by Mandrekar [1], and the above analysis is somewhat different from (but is motivated by) his work. In Chang and Rao [2], this is further analyzed if the Xt -process is weakly harmonizable so that {At , t ∈ R} has other properties including its positive definiteness. Some applications of oscillatory processes were treated in Priestley [1], who assumed at the outset the representation (33), ϕ0 (λ) = 1 as a practical normalization, and ϕt (·) slowly varying locally at each t. Thus the structure of processes represented by Theorem 8 has a potential for various concrete applications. We now turn to constructing processes satisfying the conditions of Theorem 7 which is one of the main results of this section. 7.4 Likelihood ratios in statistical communication theory Let the process {Xt , t ∈ T } denote a (stochastic) signal plus noise model containing an unknown parameter: Xt = Yt + αZt ,
t ∈ T ⊂ R, α ∈ R,
(1)
where Yt is the signal and Zt the noise, assumed to be independent Gaussian processes with a known covariance structure. Here the parameter α is to be tested or estimated, based on a realization of the Xt -process. For simplicity, let these processes be centered. We now convert this information into a form that represents a measure on the function space in which the processes live. More precisely, let K1 and K2 be the covariance functions on T × T → C of the processes Y and Z, and let E(Yt ) = 0 = E(Zt ), t ∈ T and K1 be strictly positive definite (or nondegenerate). We recall a simultaneous diagonalization of the covariance kernels from the earlier established Proposition V.2.11, a generalized version of Kadota’s [1] result. Thus let Ri be the integral operators defined by the Ki , so that (Ri f )(t) =
Ki (s, t)f (s) dμ(s), f ∈ L2 (T, μ),
(2)
T
where (T, μ) is a σ-finite measure space. The additional condition we −1
−1
impose is that B = R1 2 R2 R1 2 has a bounded extension, denoted by the same symbol, to all of L2 (T, μ), as required in the above quoted proposition. The importance of this assumption is that B is a HilbertSchmidt operator. So if (τn , ϕn ), n ≥ 1, are the eigenvalues and the corresponding normalized eigenfunctions of B (so τn ≥ 0), then we
454
VII. More on Stochastic Inference
have the simultaneous diagonalization of K1 , K2 as: K1 (s, t) = K2 (s, t) =
∞ n=1 ∞
1 2
1 2
(R1 ϕn )(s)(R1 ϕ¯n )(t) =
∞
gn (s)¯ gn (t),
n=1 1 2
1 2
τn (R1 ϕn )(s)(R1 ϕ¯n )(t) =
n=1
∞
(3) τn gn (s)¯ gn (t),
n=1
1
where gn = R12 ϕn , both series converging in the norm of L2 (T × T, μ ⊗ μ). This is just the content of the above noted proposition, proved in ∞ 2 Chapter V. Also n=1 τn < ∞. If B is of trace class, we actually get ∞ n=1 τn < ∞. As noted there the latter condition holds if T = [a, b], a compact interval with μ as Lebesgue measure. This is the original result of Kadota’s. For simplicity of this application, we assume (T, μ) as above so that B is of trace class as well as both K1 , K2 strictly positive definite (hence τn > 0 for all n). With the ϕn at our disposal, we may replace the Y, Z (and hence the X) processes by the corresponding (countable set of) their observable coordinates, namely Yn = (Y, ϕn ) = T Yt ϕn (t) dμ(t), and similarly Zn = (Z, ϕn ), n ≥ 1, all these being on the canonically represented probability space (Ω, Σ, P ) where Ω = RT . Then we have E(Yn ) = 0 = E(Zn ), E(Yn2 ) = 1, E(Zn2 ) = τn . It is convenient to introduce the following: ˜ = Ω × Ω, Σ ˜ = Σ ⊗ Σ, and P˜ = P ⊗ P . Let F be 1. Definition. Let Ω ˜ of the form: the set of cylindrical functions f on Ω f (y, z) = h2n ◦ π2n (y, z) = h2n (y1 , · · · , yn ; z1 , · · · , zn ) ˜ → Rm is the coordinate projection and hm : Rm → R is where πm : Ω a bounded function with bounded continuous first (partial) derivatives. Let Fα be the subspace of F such that f ∈ Fα iff f is of the form: ˜ ◦ πn (y + αz) f (y, z) = h ˜ n (y1 + αz1 , · · · , yn + αzn ) =h = fα (y, z), (say),
(4)
so that {Fα , α ∈ R} is a one parameter family of algebras of functions. To proceed with the analysis, we define T (α, β) : Fβ → Fα by the equation T (α, β)fβ = fα , ∀α, β ∈ R where fα is given by (4). Then T (α, β) is an isometric isomorphism of Fβ onto Fα under the uniform norm. Also it is a positive operator and verifies the evolution equation: T (α, β)T (β, γ)f = T (α, γ)f, α ≤ β ≤ γ, f ∈ Fγ .
(5)
7.4 Likelihood ratios in statistical communication theory
455
We then can introduce the measures Pαβ , the σ-algebras Σα determined by Fα , and spaces Lp (Σα ) exactly as in the preceding section. Let th n partial derivative of hn , in the representation of f hni = ∂h ∂xi , the i of (4). Then n (Dα fα )(y, z) = hni (y + αz), (6) i=1
where fα (x) = f (y + αz) = hn (y1 + αz1 , · · · , yn + αzn ) in the definition of (4), since hn is continuously differentiable. Also Dα T (α, β)fβ is L1 continuous in α and bounded so that condition (A1 ) holds. We need to verify that the T (α, β), Dα also satisfy conditions (A2 ) − (A4 ) of the last section. To make the necessary computations for this purpose, we now use the Gaussian hypothesis. Note that the orthogonal random variables Yn , Zn then become mutually independent, Yn is N (0, 1) and Zn is N (0, τn ). We indicate the steps necessary for these calculations, leaving some routine details as exercises. ˜ → R as ϕn (x) = n (Yi Zi )(x), x ∈ If we define a functional ϕn : Ω i=1 ˜ then (by independence) it follows that ϕn → ϕ a.e. and in L2 (P ). Ω, ∞ ˜ → R, f ∈ F , so Also ϕ 2 = n=1 τn . On the other hand, if f : Ω that f = h ◦ πn , let ∂hn ˜ (yi + αzi )|α=0 + ϕ(x)f (x), x = (y, z) ∈ Ω, ∂α n = zi (x)hni (x) + ϕ(x)f (x). (7)
(Dα∗ f )(x) = −
i=1
Next we verify, using the distributions of Yi and Zi , that
Ω
f Dα g dP =
Ω
(Dα∗ f )g dP,
(8)
and that ψα = E Σα (ϕ) holds, so condition (A2 ) will be verified. Indeed (8) is established with a straight forward computation by simplifying the right side, using (7) and the image law of probability, with f = hn ◦ πn , g = kn ◦ πn as follows: RHS(8) =
˜ Ω
(Dα∗ f )g dP˜
=−
[
n
R2n i=1 1
zi h2ni (y1 , · · · , yn ; z1 , · · · , zn )]kn (z1 , · · · , zn )×
(τ1 · · · τn )− 2 − 1 Pni=1 (y12 + zi2 ) τ e 2 dydz (2π)n
456
VII. More on Stochastic Inference
n
1
(τ1 · · · τn )− 2 =− h2n (y, z) zi kni (z)] × (2π)n R2n i=1 1
Pn
2
z2 i
e− 2 i=1 (yi + τ ) dy dz, after integrating by parts and simplying, f Dα g dP˜ . = ˜ Ω
In a similar manner, one can verify condition (A3 ), by the following calculation. Define ψαn = E Σα (ϕn ), in the above notation. Then one has: ψαn (y1 + αz, · · · , yn + αzn ) = α
n i=1
τi (yi + αzi )2 − 1], [ 1 + α2 τi (1 + α2 τi )
(9)
and ψαn → ψα in L2 (P˜ ) as n → ∞. Further α → ψα is L1 (P˜ )continuous. This is proved by observing that ϕn → ϕ in L2 (P˜ ) so that ψαn → E Σα (ϕ) also in mean as n → ∞. So we establish (9) by using f (x) = hn ◦ πn (y + αz), and ∂ ˜ f ϕn dP = fβ dP˜ |β=α . ∂β Ω˜ ˜ Ω To simplify the right side, consider fβ dP = hn (y + αz)× ˜ Ω
R2n
1
(τ1 · · · τn )− 2 − 12 Pni=1 (yi2 + zτi ) i dy dz e (2π)n n z2 i 1 − hn (z) [2(1 + β 2 τi )]− 2 e 2(1+β2 τi ) dzi . = Rn
2
i=1
Differentiating this relative to β, evaluating it at β = α, and collecting terms, one gets (9). The formal differentiation and its interchange with the integral is easily justified. Finally the L2 (P˜ )-continuity of ψα is shown by the fact that E Σα is a bounded (in fact contractive) operator, and so ψα − ψβ 2 ≤ 2 ϕ − ϕn 2 + ψα − ψβ 2 → 0 as n → ∞. The verification of condition (A4 ) needs a more sustained computation. We assert that for α ∈ J, a compact interval, there exist numbers ε > 0, C > 0 and N0 > 1 such that |ψα | dP˜ ≤ Ce−εN , N ≥ N0 , (10) [|ψα |≥N]
457
7.4 Likelihood ratios in statistical communication theory
which implies (A4 ). Since ex ↑ for x ↑> 1δ for a δ > 0, (10) will follow from the relation − δN ˜ 2 |ψα | dP ≤ e eδ|ψα | dP˜ , (11) δx
˜ Ω
[|ψα |>N]
if we show that the moment generating function t → E(etψα ), α ∈ J, a compact interval, is bounded for t in a nondegenerate neighborhood of the origin since eδ|ψα | ≤ eδψα + e−δψα . For this, express (9) as: ψαn =
n
σi (α)(Ui2 − 1)
(12)
i=1
where σi (α) =
τi 1+α2 τi
and Ui are independent N (0, 1) representing the 2
τn , (the right side > 0) stochastic term there. If |a0 | < inf n inf α∈J 1+α 2|α|τn so that 1 − 2a0 ασn > 0, for all α ∈ J and the above a, then one has ˜ Σ, ˜ P˜ ) given (for δ = a0 ) by: the expected value on (Ω,
E(ea0 ψα ) =
n
1
[e−a0 ασi (α) (1 − 2a0 ασi (α))− 2 ].
(13)
i=1
It may be shown that the right side of (13) is bounded in α ∈ J, if |a| is chosen as above, (further detail is outlined in Exercise 4) and this implies the condition (A4 ). With these preliminary considerations we can present the desired result also due essentially to Velman [1]: 2. Theorem. Consider the process {Xt = Yt + αZt , t ∈ [a, b]} where the Y, Z are centered independent Gaussian signal and noise processes with known strictly positive definite covariances K1 , K2 respectively. If α ∈ J, a compact interval, and R1 , R2 are the associated integral −1 −1 (hence HS)-operators such that B = R1 2 R2 R1 2 is of trace class with eigenvalues τn and normalized eigenfunctions ϕn , then for any α, β ∈ J, the induced probability measures Pαβ of the Xt -process are such that (i) Pαβ ∼ Pαα , and (ii) a version of their likelihood ratio is given as: 1 n 1 + α2 τ i 2 dPαβ (x) = lim × n→∞ dPαα 1 + β 2 τi i=1 τi β 2 − α2 (yi + αzi ) , exp 2 (1 + α2 τi )(1 + β 2 τi )
(14)
˜ where xi (= (yi , zi )) are the observable coordinates of for a.a. x ∈ Ω, the process Xt determined by {ϕn , n ≥ 1} on [a, b].
458
VII. More on Stochastic Inference
Proof. We sketch the details to show how the abstract versions of the earlier conditions are verified in this key application. Our hypothesis implies that conditions (A1 ) − (A4 ) hold, as already seen in the preceding analysis. Hence, using the notation of the last section, we deduce that both W (α, β) and W (β, α) are isometries on L1 (Σβ ) onto L1 (Σα ) in both directions. Consequently, Pαβ ∼ Pαα on Σα , and, by Theorem 2.7, we also have: dPαβ = W (α, β)1. dPαα It suffices therefore to show that W (α, β)1 is given by the right side limit of (14). This is verified by first showing that there is a uniformly integrable martingale converging a.e. to a limit which then is equivalent to W (α, β)1. Thus let Σαn = σ(Y1 + αZ1 , · · · , Yn + αZn )(⊂ Σα ), and Σα∞ = σ(∪n Σαn ). Suppose that f is any bounded and Σαn -measurable function. Then (writing μ for Pαα ) we have:
˜ Ω
T (β, α)f dμ = =
˜ Ω ˜ Ω
f W (α, β) dμ, by definition of T (β, α), (cf., (5)), f E Σαn (W (α, β)1) dμ,
(15)
which is the key connecting link between the T and W operators here. The left side integral is simplified by using the Gaussian hypothesis together with a calculation as in (8), to obtain (going to the image measure space and returning back to the original canonical space):
˜ Ω
n 1 + α2 τ i 2 1
× 1 + β 2 τi β 2 − α2 τi (yi + αzi ) dμ. exp 2 (1 + α2 τi )(1 + β 2 τi ) (16)
T (β, α)f dμ =
˜ Ω
f
i=1
Since the f above is arbitrary, the integrands of (15) and (16) can be identified a.e. This shows that {E Σαn (W (α, β)1), n ≥ 1} is a positive martingale closed on the right by W (α, β)1 which is in L1 (Σα ). Hence it converges a.e. and in L1 (Σα ), by the standard theory (cf., e.g., Doob [2], p. 319, or for this form Rao [12], p. 184). Moreover, limn E Σαn (W (α, β)1) = E Σα∞ (W (α, β)1) a.e. It remains to show that W (α, β)1 is Σα∞ -measurable, which will then imply that W (α, β)1 = limn E Σαn (W (α, β)1) a.e., and proves (14). To see that this holds, note that the observable coordinates yn + αzn , n ≥ 1} generate Σα by our construction, and hence ∪n L1 (Σαn ) is
7.5 The general Gaussian dichotomy and Girsanov’s theorem
459
dense in L1 (Σα ). Since W (α, β)1 ∈ L1 (Σα ), there exist gn ∈ L1 (Σαn ) such that gn → W (α, β)1 in L1 (Σα ). Consequently, one has: E Σα∞ (W (α, β)1) − W (α, β)1 1 ≤ E Σα∞ (W (α, β)1) − E Σαn (W (α, β)1) 1 + E Σαn (W (α, β)1) − gn 1 + gn − W (α, β)1 1 ≤ E Σα∞ (W (α, β)1) − E Σαn (W (α, β)1) 1 + 2 gn − W (α, β)1 1 → 0, as n → ∞, by the martingale convergence theorem recalled above, and the way the gn were chosen. Thus W (α, β)1 is equivalent to a Σα∞ -measurable function as desired. It is possible to consider other types of hypotheses for this model and obtain their associated likelihood ratios. However, we omit further discussion on these examples. Since the properties of Gaussian processes have been crucial here, we now include another proof of their general dichotomy that has been of constant use in this analysis. 7.5 The general Gaussian dichotomy and Girsanov’s theorem We have already seen in Theorem V.1.1 a proof of the Gaussian dichotomy, and here we include an alternative method, employed for a classical theorem due to Kakutani [1] on product measures, that extends (nontrivially) to our case. The treatment essentially follows the recent work due to Vakhania and Tarieladze [1]. After presenting this general dichotomy result, we consider, among equivalent Gaussian measures (or processes), conditions for a Gaussian process to be equivalent to Brownian Motion (or BM), since the latter is widely studied and well-understood. The main result here is to represent such a process as a functional of the BM, and this is essentially Girsanov’s theorem or some extension of it. We wish to discuss these ideas in some detail in this section. The fundamental Kakutani theorem, originally proved for countable products, has been extended in different directions by various authors. For instance, Zaanen ([1], p.170) has given a version for the systems of product probability measures that are not necessarily absolutely continuous, and Brody [1] considered singularity for not necessarily product but directed systems of probability measures (these are projective or consistent systems). [A simplified account of the latter is in Rao [12],
460
VII. More on Stochastic Inference
p.206.] Another version of the theorem will be given here. This as well as the preceding form are designed for the Gaussian dichotomy. The present result is again a slight variation which is for directed products, derived from a not necessarily countable collection of probability measures. Let us motivate this with a classical elementary result. Recall that a distribution F on Rn is Gaussian for a random vector X with mean m and nonsingular covariance C if it is given by: 1 1 dF (x) = [(2π)n |C|]− 2 exp{− (C−1 (x − m, x − m)} dx, 2
(1)
where (·, ·) is the inner product in Rn . This is equivalent to stating that for each y ∈ Rn , the scalar random variable (X, y) is distributed as N ((b, y), (Cy, y)), and it can be expressed as: (x, y) dF (x) = (m, y) = m(y) (say), (2) Rn
and
(x, y)(x, z) dF (x) = (Cy, z) + m(y)m(z),
(3)
Rn
where r : (y, z) → r(y, z) = (Cy, z) is the covariance bilinear functional on Rn × Rn , and m(·) is the (linear) mean functional on Rn . It is also seen that the Fourier transform of F is given by: 1 Fˆ (u) = exp{im(u) − r(u, u)}, u ∈ Rn . 2
(4)
These formulas make sense even if X : Ω → H, a Hilbert space, and we restate them precisely in the context of an infinite dimensional vector space with X as a stochastic process to be considered in the following work. The L´evy uniqueness theorem holds in this context as well. Thus if S is a set and G ⊂ RS is a vector subspace of the real functions on S, and ΣG = σ(∪f ∈G f −1 (B)), the σ-algebra generated by G, where B is the Borel σ-algebra of R, then a probability measure P on ΣG is termed Gaussian with mean functional m : G → R and covariance bilinear functional r : G × G → R when P ◦ f −1 : B → R+ , has a Gaussian distribution with mean m(f ) and variance r(f, f ), ∀f ∈ G. The relations (2)-(4) become: f dP ; rP (f, g) = f g dP − m(f )m(g), (5) mP (f ) = S
S
and Pˆ (f ) =
1 eif dP = exp{imP (f ) − rP (f, f )}. 2 S
(6)
7.5 The general Gaussian dichotomy and Girsanov’s theorem
461
n n It is seen that i,j=1 ai aj rP (fi fj ) = S ( i=1 ai fi )2 dP ≥ 0, for all fi ∈ G and ai ∈ R, n ≥ 1, so that r is positive definite defining an inner product on G, and P, Pˆ determine each other uniquely. Let Hr be the completion of the inner product space (G, r), analogous to the RKHS construction already employed before in Section V.2. If Hr can be continuously embedded into L2 (S, ΣG , P ) = L2 (P ) or equivalently r(·, ·) is a continuous bilinear form in L2 (P ), then by a classical theorem due to Riesz, there exists a positive definite symmetric operator A on Hr → L2 (P ), such that rP (f, g) = (Af, g)P , so that A : G → L2 (P ) where Hr = sp(A(G)) ⊂ L2 (P ). When the domain of A contains {f − mP (f ), f ∈ G}, the Hr consists of Gaussian random variables with means zero. Thus every element of L2 (P ), hence of A(G), can and will be regarded as a random variable on (S, ΣG , P ) in the following analysis. Before stating the dichotomy result, we observe that if {gi , i ∈ I} is (by a set of pth power integrable random variables and I is the directed inclusion) set of all non void finite subsets of I, and if gJ = i∈J gi ∈ Lp (P ) with limit Lp (P ), then suppose the net {gJ , J ∈ I} is Cauchy in p g (necessarily g ∈ L (P )). In this case we put g = J∈I gJ and call it the Lp (P )-product limit of the given gi -set. Note that if the gi are mutually independent and gi ∈ Lp (P ), then clearly gJ ∈ Lp (P ), J ∈ I, and thus the term Lp (P )-product limit is meaningful. We use it only in this case below. The desired general result can now be given as follows. P 1. Theorem. Let (S, ΣG , Q ) be a pair of Gaussian probability measure spaces on (S, ΣG ) where ΣG is the σ-algebra generated by the vector space G ⊂ RS of functions on a set S, with means mP , mQ and covariances rP , rQ on G and G × G respectively. Then P and Q are either mutually equivalent or singular. They are equivalent iff the following three conditions hold: (i) (mQ − mP )(f ) = (Af, h0 )P , f ∈ G, for some fixed h0 ∈ HP (= HrP ), (ii) there exists a bounded linear injective operator B : HP → HP such that rQ (f, g) = (BAf, Ag)P , f, g ∈ G,
(iii) D = B − I : HP → HP is Hilbert-Schmidt. In case P ∼ Q, i.e., conditions (i)–(iii) hold, let {1 + λn , n ≥ 1} be the nonzero eigenvalues (with multiplicities) of D, and {fn , n ≥ 1} be the corresponding (complete) set of orthonormal eigenvectors in HP . Then λn > 0, and the sequence (writing “ ln for natural log): 1 ξn = exp{− [(λ−1 − 1)fn2 + ln λn ]}, 2 n
n ≥ 1,
462
VII. More on Stochastic Inference
has the L1 (P )-product limit ξ, and in fact ∞ dQ ξn × ξ= = dP n=1 1 exp{A−1 h0 − (A−1 h0 , A−1 h0 )P }. 2 Further, we have λ = limn = A , and ξ p dP < ∞, 1 < p < bλ
(7)
(8)
S
where bλ =
λ λ−1
for λ > 1 and = ∞, if 0 < λ ≤ 1.
Remark. The first of the assertions is the same as Theorem V.1.1, the second is similar to that of Theorem V.2.6, and the last one is analogous to Theorem V.2.12. All of them used the RKHS technique, and the spaces Hr there were assumed separable. The proof here is different and depends on no separability restrictions. It also does not explicitly use any properties of the conditioning concept. In general, both these techniques complement each other and reveal interesting insights into this, by now classical, result. The proof is based on the following auxiliary facts, devised for Kakutani’s theorem and used by him for the same purpose, which we now present in a form needed in the proof. The next lemma is essentially from Kakutani’s original paper. 2. Lemma. Let {0 < ξi , i ∈ I} be a set of mutually independent random variables on (S, S, P ) such that E(ξi) = 1, i ∈ I, and 1 i∈I E(ξi ) > 0. Then the ξi -set has the L (P )-product limit ξ such that P [ξ > 0] > 0, and if moreover ξi > 0, a.e., ∀i, then so is ξ. √ Proof. √ Let I1 = {i ∈ I : 0 ≤ E( ξi ) < 1}, and I2 = I − I1 . Since E( ξi ) ≤ 1 and by hypothesis 0< E( ξi ) = E( ξi ) · E( ξi ) = E( ξi ) ≤ 1, i∈I
i∈I1
i∈I2
i∈I1
one has I1 to be at most countable, since otherwise the product must vanish which is impossible. So let I1 = N for simplicity, and gn = n E( ξj ) → g > 0, by hypothesis. Now for n > m, gn − gm = j=1 n positive j=m+1 E( ξj ) → 1 as m, n → ∞ the infinite product of √ numbers converges to a positive number which implies neven√E( ξn ) → = ξi then by 1 (cf., e.g., Stromberg [1], p.413). Hence, if h n i=1 √ n independence E(hn ) = i=1 E( ξi ), and one has E(hn − hm )2 = 2 − 2
n j=m+1
E( ξj ) → 0, m, n → ∞.
7.5 The general Gaussian dichotomy and Girsanov’s theorem
463
∞ √ So hn → h in L2 (P ), and h ∈ L2 (P ). Moreover, h = k=1 ξk > 0 ∞ in measure, so that h = exp[ 21 k=1 ln ξk ] > 0 in measure, and h > 0 a.e., if each ξk > 0 a.e. here. It follows that in any case the ξi -set has an L1 (P )-product limit. A result on HS operators is also needed for the following work: 3. Lemma. If a continuous linear operator D on a Hilbert space H satisfies the inequality πF DπF 2 ≤ c for a fixed constant c > 0 where πF : H → F is a contractive (=orthogonal) projection with range F ∈ FH , the directed (by inclusion) set of all finite dimensional subspaces of H, and · 2 is the HS norm, then D itself is an HS operator. Proof. Recall that the set of all HS operators on H is itself a Hilbert space H2 under the norm derived from the inner product (C1 , C2 ) = ∗ ∗ tr(CC ∗ ). (Elementary tr(C1 C2 ) (=trace of C1 C2 ) and C 2 = proofs of some properties of H2 used here may be found in Schatten [1].) By the fact that FH is directed, and {πF , F ∈ FH } is thus an ordered set of finite rank orthogonal projections, it follows from the so called metric approximation property of H (i.e., the identity operator can be strongly approximated on compact subsets of H by finite rank operators) we get πF h − h → 0 as F in FH , h ∈ H. Replacing h by DπF h this implies, after a triangular inequality split, that πF DπF h − Dh → 0 for each h ∈ ∪F ∈FH F ⊂ H, a dense subset, and then the same holds by continuity for all h ∈ H, F ∈ FH . But a bounded set in a Hilbert (in fact in any reflexive) space is relatively weakly compact, and so the closed ball Kc ⊂ H2 of radius c > 0 (and center 0) is weakly compact. But the weak topology of H2 is weaker than the weak operator topology in a Hilbert space so that Kc is also a closed subset in B(H) in its weak operator topology. [B(H) denotes the Banach space of all bounded linear operators on H under the uniform operator norm.] But πF DπF ∈ Kc , ∀F ∈ FH , and we have πF DπF h → Dh in H; whence πF DπF → D strongly. So D ∈ Kc . Thus D is HS and moreover D 2 ≤ c, as asserted. In order to compare the above theorem with Kakutani’s product measure dichotomy, we establish the latter in a form convenient for our purpose. 4. Theorem. Let (Ω, Σ, μν ) be probability spaces and {(Ωi , Σi ), i ∈ I, } be a system of measurable spaces, together with ξi : Ω → Ωi , i ∈ I as measurable mappings such that Σ = σ(∪i∈I ξi−1 (Σi )). Suppose that {ξi , i ∈ I} is a set of independent random variables on the spaces (Ω, Σ, μν ) and let νi = ν ◦ ξi−1 and μi = μ ◦ ξi−1 be the image probabilities on Σi so that ν = ⊗i νi , and similarly for μ. Then either μ ⊥ ν or ν μ; i.e., the dichotomy holds. [In case νi ∼ μi , i ∈ I, one has
464
VII. More on Stochastic Inference
μ ⊥ ν or ν ∼ μ again.] In either event it is true that ρ(μi , νi ) = 0 (> 0), μ ⊥ ν(ν μ or ν ∼ μ) ⇔
(9)
i∈I
[ρ(·, ·) here denotes the Hellinger distance already used before, cf. Thedνi ◦ ξi , it follows orem IV.1.7), and for the second alternative, if gi = dμ i 1 that {gi , i ∈ I} has an L (μ)-product limit, and moreover dνi dν = gi = ( ◦ ξi ). dμ dμi i∈I
(10)
i∈I
Proof. Note that we essentially expressed (Ω, Σ) = ⊗i∈I (Ωi , Σi ). Also ξi : Ω → Ωi , are the coordinate projections of the original result, and this form is useful for a convenient comparison that will be made below with Theorem 1, after its demonstration. It was seen in Theorem IV.1.7 that with μJ = ⊗i∈J μi and similarly νJ , J ∈ I, the collection of all (nonvoid) finite subsets of I directed by inclusion, we have ρ(μJ , νJ ) ≤ ρ(μJ , νJ ) for J ⊂ J and that μ ⊥ ν iff ρ(μ, ν) = 0. Since, by independence of ξi , we get ρ(μi , νi ) = ρ(μi , νi ) = ρ(μ, ν), lim ρ(μJ , νJ ) = lim J
J
i∈J
i∈I
it follows that ν ⊥ μ iff i∈I ρ(μi , νi ) = 0. If μ ⊥ ν, then by the above ρ(μ, ν) = i∈I ρ(μi , νi ) > 0. On the other hand, note that dνi √ ◦ ξi dμ E( gi ) = dμ i Ω dνi dμ ◦ ξi−1 = −1 dμi ξi (Ωi ) dνi = dμi = ρ(μi , νi ), dμi Ωi √ and hence the hypothesis implies i∈I E( gi ) > 0. So by Lemma 2, the net {gi , i ∈ I} has an L1 (μ)-product limit g, and g = i∈I gi > 0 on a set of positive μ-measure. But for any A ∈ ∪i∈I ξi−1 (Σi ) ⊂ Σ, one has A ∈ ξi−1 (Σi ) for some i ∈ I, and then g dμ = gi dμi = ν(A), [νi = ν|ξi−1 (Σi )]. (11) A
A
7.5 The general Gaussian dichotomy and Girsanov’s theorem
465
Since the above class of sets generates Σ and ν is σ-additive, this implies dν ν μ and dμ = g a.e. In case νi ∼ μi , i ∈ I, so that gi > 0 a.e., the above work and Lemma 2 imply that g > 0 a.e. Hence it follows that ν ∼ μ. Remark. Although we have not mentioned the conditional expectation ym derivative in this proof, the fact that gi in (11) is the Radon-Nikod´ −1 of νi relative to μi is equivalent to the statement that gi = E ξi (Σi ) (g) a.e. However, no other properties of conditional expectations have been used or will be needed. We now proceed to establish Theorem 1, with Lemmas 2 and 3. First it is convenient to record the following pair of elementary facts about Gaussian integrals. If ϕa,t (·) denotes the normal density and Φa,t its distribution function with mean a ∈ R and variance t > 0, using 1
ϕa,t (x) = (2πt)− 2 exp[−
1 (x − a)2 ], 2t
we find on using the known integral for its moment generating function, that p s s ϕb,s 1 (x)ϕa,t (x) dx = [p( )p−1 + (1 − p)( )p ]− 2 × ϕ t t a,t R p(1 − p) (a − b)2 exp{− }, 2 pt + (1 − p)s (12) and this is finite iff pt + (1 − p)s > 0. Note that Φa,t , Φb,s define equivalent measures iff st > 0 or degenerate at the same point (i.e., s = 0 or t = 0 and a = b). In particular, if p = 12 , from (12) we get the Hellinger integral as: ρ(Φa,t , Φb,s ) =
√
s 1 s 1 1 (a − b)2 }. 2[( )− 2 + ( ) 2 ]− 2 exp{− t t 4(t + s)
(13)
With this preparation, we proceed to establish Theorem 1 and then compare it with Theorem 4. Proof of Theorem 1. Now by Theorem IV.1.7, for any probability measures μ, ν, the Hellinger integral ρ(μ, ν) vanishes iff μ ⊥ ν and hence the same must be true in our case too. Therefore, it suffices to show that ρ(μ, ν) > 0(i.e., μ ⊥ ν) implies ν ∼ μ for the Gaussian measures iff conditions (i)–(iii) hold, and then (7)-(8) will be derived as a byproduct of our computations. Thus let f ∈ G be arbitrary and consider νf = ν ◦ f −1 , μf = μ ◦ f −1 , the image measures on R which by hypothesis are Gaussian with means
466
VII. More on Stochastic Inference
mν (f ), mμ (f ) and variances rν (f, f ), rμ(f, f ) respectively. Since by Theorem IV.1.7 and (13) above, excluding the true and trivial case that rμ (f, f ) = 0 which implies (mμ (f ))2 ≤ S f 2 dμ = rμ (f, f ) = 0 or mμ (f ) = 0 so that mν (f ) = 0 = rν (f, f ), for ρ(μ, ν) > 0 we can conveniently take rμ (f, f ) = 1, and set m(f ) = mν (f ) − mμ (f ). Then one has: √ 1 1 1 0 < ρ(μ, ν) ≤ ρ(μf , νf ) = 2[rν (f, f )− 2 + rν (f, f ) 2 ]− 2 × exp[− Setting a = rν (f, f ), b =
m(f )2 1+rν (f,f ) ,
m(f )2 ]. 4(1 + rν (f, f ))
the above inequality implies
ρ4 (μ, ν) ≤ 4(a + 2 + a−1 )−1 e−b = so that b=
4a e−b ≤ e−b , (a + 1)2
m(f )2 ≤ ln ρ−4 (μ, ν), 1+a
and also from the fact ρ4 (μ, ν) ≤
(14)
4a (a+1)2 ,
(15)
(16)
that
1 4 ρ (μ, ν) ≤ a ≤ 4ρ−4 (μ, ν). 4
(17)
Thus (17) and (16) (the former with polarization identity in Hilbert space) imply that rν (·, ·) and m(f ) are bounded bilinear and linear functionals respectively. Hence by the (unique) Riesz representations, invoked already, conditions (i) and (ii) hold. [Recall that by (17) the operator B for rν (f, g) = (BAf, Ag) satisfies inf h =1 |rν (f, g)| ≥ 1 4 ρ (μ, ν) > 0 so that B is invertible.] Thus it only remains to show 4 that (iii) also holds, and this involves some work especially a use of Lemma 3. Indeed consider F ∈ FH , the directed set of finite dimensional subspaces of H, as before, and suppose that as a function space F ⊂ G also. Then by the uniqueness part of the Riesz representations of rν and mν one has BF = πF BπF verifying rν (f, g) = (BF Af, Ag)μ,
f, g ∈ F,
(18)
and the functions in F can be regarded as random variables on S. Since BF is a finite rank operator, let f1 , . . . , fn be its eigenvectors and s1 , . . . , sn , the corresponding eigenvalues, relative to the equivalent Gaussian measures μF , νF on ΣF ⊂ ΣG . This implies that the fi are
467
7.5 The general Gaussian dichotomy and Girsanov’s theorem
N (m(fi ), si ), i = 1, . . . , n independent random variables. Hence from the decreasing property of the Hellinger integrals and the distributional properties of the fi , one has n √ −1 1 m(fk )2 1 [ 2(sk 2 + sk2 )− 2 exp[− 0 < ρ(μ, ν) ≤ ρ(μF , νF ) = ]. 4(1 + sk ) k=1
Squaring and taking ‘logs’, we get on transposition and dropping the term with m(fk )2 , the key inequality: n
1 1 −1 [ln( (sk 2 + sk2 ))] ≤ ln ρ−2 (μ, ν). 2
(19)
k=1
But 0 < sk ≤ BF ≤ B , and using the following elementary numerical inequality (exactly as in the proof of Theorem V.1.1 where a similar result was employed): √ 1 1 1 (x − 1)2 ≤ ( x + 1)2 (x + 1) ln[ (x− 2 + x 2 )], 2
x > 0,
one gets the crucial relation for DF = BF − IF : DF 2H2 =
n
(sk − 1)2 ≤ ( B + 1)2 ( B + 1) ln ρ−2 (μ, ν) < ∞.
k=1
The right side is independent of F and letting F , we conclude with Lemma 3 that D is HS, proving (iii). Thus ρ(μ, ν) > 0 implies conditions (i)–(iii). It remains to show that these in turn imply that ν ∼ μ. We now associate mutually independent positive random variables ξi that have an L1 (μ)-product limit ξ, so that by Theorem 4, they determine a measure ν1 equivalent to μ with density ξ; and then it will be shown that ν1 = ν to complete the argument. Since D = B − I is HS by (iii), there exist eigenvalues si > 0 and the corresponding orthogonal eigenvectors {fi , i ∈ I} of B forming a base for Hμ (relative to μ), such that by (ii) and (iii) (a) 0 < si ≤ B ;
(b)
(si − 1)2 < ∞.
(20)
i∈I
So there exist at most a countable set of indices from I for which si − 1 = 0, and then I1 = {i ∈ I : |si − 1| + |(h0 , fi )μ | > 0}
468
VII. More on Stochastic Inference
is also countable where h0 represents the mean m in the statement. Thus if we let ξi = exp[−
1 (fi − (fi , h0 )2μ ) − fi2 + ln si ], 2si
i ∈ I,
then the ξi > 0 a.e., and are integrable as well as mutually √ independent are independent Gaussian. Thus ξi ∈ L2 (μ) relative to μ, since the f i √ and i∈I E( ξi ) > 0. Consequently ξ = i∈I ξi = i∈I1 ξi > 0 a.e. 1 and is in L1 (μ) by Lemma 2. If we define ν1 : Σg → R+ by dν dμ = ξ, then it is σ-additive and, by Theorem 4, ν1 ∼ μ. It thus remains to show that ν1 = ν, and then establish (7) and (8). For the equality of measures, it suffices to verify that their Fourier transforms are equal, and by the uniqueness of the latter, the result follows. Now by (i) mν (f ) − mμ (f ) = (Af, h0 )μ = (Af, B −1 g0 )μ , g0 = Bh0 , (f, fi )μ (B −1 g0 , fi )μ , Parsevaal’s identity, = i∈I (21) where we used (·, ·)μ to indicate which of the measures μ, ν is used here. For the same reason one also gets rν (f, f ) = (BAf, Af )μ =
(si (Af, fi)2μ ,
(22)
i∈I
and expanding f in a Fourier series relative to the same complete orthogonal set {fi , i ∈ I} yields, f = mμ (f ) +
(Af, fi )μ fi .
i∈I
Finally consider the Fourier transform with (21)-(23): νˆ1 (f ) =
S
=
eif dν1 P
ei(mμ (f )+
j∈I (Af,fj )μ fj )
S
=
j∈I
ξi dμ, using (23) for f ,
j∈I 1
ei[mμ (f )− 2 (Af,fj )μ fj ] dν1j
S
1 = exp[imν (f ) − rν (f, f )] = νˆ(f ). 2
(23)
7.5 The general Gaussian dichotomy and Girsanov’s theorem
469
Hence by the previous reduction, ν = ν1 ∼ μ. The same argument also establishes (7) except that we wrote P, Q for μ, ν. So it only remains to show (8) is true. exWe have already seen in (12) that E(ξip ) < ∞, and an explicit pression was obtained for it. Since 0 < ξ = i∈I ξi = i∈I1 ξi = n limn→∞ j=1 ξij , in L1 (μ) where I1 = {ij : j ≥ 1} is countable so that the sequence of the partial products is uniformly integrable, we get by the Vitali convergence theorem that p
ξ dμ = lim S
n→∞
= lim
n→∞
⎛ =⎝
∞
n S j=1 n j=1
S
ξij dμ,
1 < p < bλ ,
ξij dμ, by independence, ⎞− 12
(psp−1 − (p − 1)spj )⎠ j
×
j=1 ∞ p(p − 1) (g0 , fj )2μ ]. exp[ 2 p − (p − 1)sj j=1
(24)
Since (sn −1) → 0 as n → ∞, the last term of (24) converges. Regarding the first factor, on using the following elementary inequality 1 − (ps(p−1) − (p − 1)sp ) ≤ c(s − 1)2 , for some constant 0 < c < ∞ for sin a neighborhood of 1. It also converges because of the fact that i∈I1 (si − 1)2 < ∞. Hence ξ ∈ L1 (μ), 1 < p < bλ . This is just (8) in a different notation. Regarding this result, we include some observations as follows: 5. Discussion. Theorem 1 above can be thought of as one on equivalence of a pair of Gaussian measures μ, ν when conditions (i)-(iii) hold. This is the hard part of the demonstration. On the other hand, if μ, ν are Gaussian measures on some measure space and conditions (i)(iii) hold, then Kakutani’s theorem implies that the measures must be equivalent. Suppose now one has a pair of arbitrary Gaussian measures μ, ν on some measure space which is generated by a fixed countable set of independent (relative to both μ, ν) Gaussian random variables, will the measures be equivalent? That such a sequence of random variables exists, if the covariance kernels of μ, ν are simultaneously diagonalizable (by Proposition V.2.11 and the work in Chapter V) indicates that without some condition such as simultaneous diagonalizability, the common
470
VII. More on Stochastic Inference
system of countable collection desired need not exist. But if this is assumed to exist, does it imply the diagonalization as noted above? A (possibly negative) solution is not available at this time. On the other hand a positive answer will allow a direct application of Kakutani’s theorem, implying that the Gaussian dichotomy is directly obtainable from it. In the Gaussian case, since two nonsingular such measures μ, ν are always equivalent, if one of them (say μ) is specialized to BM (or the Wiener measure) what additional conditions are needed to conclude that ν is also a measure of the same type? This problem has interesting applications (e.g., in finance mathematics and elsewhere), since the general form of the density given by (7) or by Theorem V.2.12 is not simple enough for an immediate use. This has been solved by Girsanov [1] restricting the general theory to diffusion processes, and extensions have been made by Shepp [1], Hitsuda [1] and others. Here we include a brief account of this work together with an application. A generalized form of Girsanov’s result has already been established in Theorems V.5.2 and V.5.5. Now we present an alternative and a shorter proof of the original theorem following Hida and Hitsuda [1], illuminating the previous considerations. The last part of the following theorem is due to Navikov [1]. 6. Theorem. Let {Xt , Ft , t ≥ 0} be the BM, {f (s, ·), Fs, s ≥ 0} be a path-wise (Lebesgue) integrable process (adapted to the same filtration), both on a canonically represented probability space (Ω, Σ, P ), with Ω = R[0,T ] as usual. For any 0 < s < t < T consider t 1 t 2 f = exp{ f (u) dXu − f (u) du}, (25) Zs,t 2 s s f which is well-defined. Suppose that E(Z0,T ) = 1 for any fixed T . If f dQT = Z0,T dP so that QT is also a probability measure on Σ, let t Yt = X t − f (u) du, (26) 0
then {Yt , Ft , 0 ≤ t ≤ T } is a BM on (Ω, Σ, QT ). Moreover, the condif ) = 1 holds whenever tion E(Z0,T E(exp[
1 2
T 0
f 2 (u) du]) < ∞.
(27)
f is Ft -adapted and nonnegative, and we obProof. By definition Z0,t serve that it is a uniformly integrable martingale. This is a consequence
471
7.5 The general Gaussian dichotomy and Girsanov’s theorem
of the basic Itˆ o formula (cf., Theorem V.5.1 and we need its two dimensional version; for a detailed discussion one may see the companion 1 2 volume, Rao [21],p.401). It can be given with g(x, y) = eax− 2 a y as ∂g etc. denote partial derivatives): follows (gx = ∂x
˜ t, A˜t ) − g(B ˜ 0 , A˜0 ) = g(B
t ˜ s , A˜s ) dB ˜s + ˜s , A˜s ) dA˜s g x (B g y (B 0 0 1 t ˜s , A˜s ) dB ˜ s. + gxx (B (28) 2 0 t
˜ t = t f (u) dXu, A˜t = t f 2 (u) du = B ˜ s . Writing this out, where B 0 0 ˜ ˜ one has the following (since Bt = 0 = At a.e.): 1 t 2 ˜ ˜ ˜ ˜ ˜ ag(Bs, As ) dBs − a g(Bs , As ) dA˜s 2 0 0 1 t 2 ˜ ˜ + a g(Bs , As ) dA˜s , 2 0
˜ t, A˜t ) − g(0, 0) = g(B
t
˜ s , A˜s ) dB ˜ s . But the right side is a ˜ t, A˜t ) − 1 = a t g(B so that g(B 0 ˜t -process is such. In particular, martingale for each a ∈ R since the B f ˜ ˜ , 0 ≤ t ≤ T , which is Ft taking a = 1 so that g(Bt , At )|a=1 = Z0,t f f Ft adapted, is a martingale and moreover E (Z0,T ) = Z0.t a.e. To see that the Yt -process is BM, it suffices to show that the increments Yt − Ys for 0 < s < t are independent and normally distributed as N (0, |t − s|). This may be established by finding the conditional characteristic function of the increment extending an idea of P. L´evy’s in deducing the BM property for certain martingales (cf., e.g., Rao [21], p.402). Thus consider ϕs,t (u) = EQT [exp(iu(Yt − Ys ))|Fs ].
(29)
To evaluate this, we note that for any random variable V ≥ 0 (or V is QT -integrable), and A ∈ Fs , so that E Fs (V ) is defined, one has: EQT (χA E Fs (V )) = EQT (χA V ) f ), by definition of QT , = E(χA V Z0,T
= E[χA
f E Fs (Z0,T ) f Z0,s
f E Fs (V Z0,T )],
f since {Z0,t , Ft , t ≥ 0} is a martingale, f f )E Fs (V Zs,T )], = E[χA E Fs (Z0,T
472
VII. More on Stochastic Inference f since Z0,s is Fs -adapted, f f E Fs (V ZsT )], = E[χA Z0,T
by the averaging property of conditioning, f = EQT [χA E Fs (V Zs,T )].
(30)
Since A ∈ Fs is arbitrary, we can identify the Fs -measurable integrands f Fs in (30), and conclude EQ (V ) = E Fs (V Zs,T ) a.e. Using this represenT tation, we can evaluate (29) with Itˆo’s formula used before, expanding it in terms of the P -measure by replacing V with the increment Yt − Ys for 0 < s < t < T as follows. f |Fs ] ϕs,t (u) = E[exp(iu(Yt − Ys ))Zs,t t t = E[exp( (iu + f (v)) dXv − (iuf (v)+ s
s
f2 (v)) dv|Fs], a.e.[P ], [QT ]. 2
(31)
Now apply Itˆ o’s formula to an exponential function defined as g1 (x, y) = (ia+b)x−(ia+ 2b )y suitably to get: e Fs
ϕs,t (h) = E u
1+
s
t
exp
u
(ih + f (v)) dXv − s
f2 (v)) dv (ih + f (u)) dXu− (ihf (v) + 2 s u u t
exp (ih + f (v)) dXv − (ihf (v)+ s 2
s
s
h2 f (v)) dv (− du 2 2 2 t h ϕs,u (u) du, a.e., =1− 2 s
(32)
since the middle term vanishes as Xt is a BM, and the integrand is Fu -adapted so an application of E Fs reduces it to zero. Now ϕs,s (h) = 1, 0 ≤ s ≤ T, h ∈ R, and hence (32) with this boundary condition has the following unique solution: ϕs,t (h) = exp{−
h2 (t − s)2 }, a.e. [P ] or QT . 2
This is independent of Fs so that it is the characteristic function of Yt − Ys which is of N (0, |t − s|) and is independent of Yu , u ≤ s. Taking
7.5 The general Gaussian dichotomy and Girsanov’s theorem
473
s = 0 we find that each Yt is normal, and for 0 < t1 < t2 < t3 < T , it follows that (Yt3 − Yt2 ) to be independent of Ft2 , as noted above, and hence of (Yt2 − Yt1 ). Thus the Yt -process is a continuous Gaussian process with independent and stationary increments, so that it is a BM. We now have to verify (27) which was needed to the crucial conclusion that QT is a probability measure. This is related to the Wald identity established in Theorem IV.2.7. By that result, we can get E(exp[Xτb − 12 τb ]) = 1 if τb is the first exit time of the BM at level b. There we used the two sided exit, but the method works if τb = inf{t ≥ 0 : Xb = t + b} also, from which one concludes that E(eτb /2 ) = e−b . This method has been detailed in Liptser and Shiryayev ([1], p.217). However, we use a shorter argument, also employed by Revuz and Yor ([1], p.308). First t t note that in (25), Mt = 0 f (s) dXs − 12 0 f 2 (s) ds, is a local martint gale, and since the Xt -process is BM, one gets M t = 12 0 f 2 (s) ds. Also since M T is well-defined and by (27) it has all moments finite (since the moment generating function exists), we can invoke an inequality for martingale differences (the continuous parameter version of Burkholder-Davis-Gundy result, cf., e.g., the companion volume, Rao [21], p.542) to conclude that MT∗ p ≤ const. M T p < ∞ for p ≥ 1 where Mt∗ = supt≤T |Xt |. This implies that the Mt -process is uniformly integrable and hence is a martingale. Now 1 1 1 1 exp( MT ) = E(M )T2 exp( M T ) 2 , 2 2
where E(M )t = exp{Mt − 12 M t } is the exponential martingale. By the CBS-inequality, the above equation becomes on taking expectations: 2 1 1 E(exp( MT )) ≤ E[E(M )T ]E[exp( M T ] 2 2 1 ≤ E(exp( M T )) < ∞, 2
(33)
using the hypothesis and the fact that {E(M )t, Ft , t ≥ 0} is a supermartingale. Thus 1 = E(E(M )0 ) ≥ E(E(M )t) for all t ≥ 0. Since (−Xt )-process is also BM, this shows that, replacing MT by (−Mt ), the same result holds for the (−Xt )-process. It follows that the Mt f collection is uniformly integrable and {Z0,t , Ft , t ≥ 0} is a uniformly inf f ) = E(Z0,0 ) = 1, tegrable martingale. Thus (27) is valid, giving E(Z0,T as desired. Remark. It may be shown that (27) is a good sufficient condition for (25) to be valid in that the former can be seen to be invalid if 12 is
474
VII. More on Stochastic Inference
replaced by ( 12 − δ) for some δ > 0. On the other hand there are t f )= examples showing that, even if E(exp( 12 0 f 2 (s) ds)) = ∞, E(Z0,T 1 can hold. A detailed analysis of these matters is given in, e.g., Revuz and Yor ([1], Chapter VIII). An interesting application of Girsanov’s theorem is to represent the likelihood ratio of a Gaussian process (or its measure on a canonically represented measure space) that is absolutely continuous (hence equivalent) relative to a BM. We present a result, due to Hitsuda [1], as it fits in our discussion of the subject of this chapter, and also will be useful for a comparison with the preceding work. 7. Theorem. Let X = {Xt , Ft , t ∈ [0, T ]} be a BM process on a canonical space (Ω, Σ, P ), Ω = R[0,T ] , etc., and Y = {Yt , Ft , t ∈ [0, T ]} be a Gaussian process on (Ω, Σ, Q) such that Q is dominated by (hence equivalent to) P with the likelihood ratio dQ dP = ϕ. Then Y and ϕ can be represented as: t s t (s, u) dXu ds − a(s) ds, t ∈ [0, T ], (34) Yt = X t − 0
0
0
where (·, ·) is a triangular and square integrable kernel, i.e., (s, u) = 0 for s < u and ∈ L2 ([0, T ]2 , ds dt), called the Volterra kernel, a ∈ L2 ([0, T ), dt); and
T
s
[
ϕ = exp{ 0
1 2
0 T
(s, u) dXu + a(s)] dXs− (
0
0
s
(s, u) dXu + a(s))2 ds}.
(35)
The representation (34) of Y is essentially unique in that the kernel , the measurable function a, and the BM process X are unique except for sets of measure zero (or indistinguishable in the case of processes). Conversely, if a process Y is represented by (34) relative to a BM process X whose measure is P , then its induced measure Q on the canonical space (Ω, Σ) satisfies Q ∼ P and then the likelihood ratio dQ dP = ϕ, is given by (35). If the process Y is centered, then a = 0 is taken in (34) and (35). We now comment on the significance of this result, and its relation with other studies on the subject. First observe that for any Gaussian measures μ, ν on the canonical space (Ω, Σ) satisfying the equivalences ν ∼ μ ∼ P , where the BM measure, one can obtain the likeli1 P is 2 dν dν dμ hood ratio dμ = dP / dP explicitly with (35) and the chain rule for the RN-derivatives. This has been calculated separately with suitable
7.5 The general Gaussian dichotomy and Girsanov’s theorem
475
conditions on means and covariances of μ, ν in a number of cases by Shepp [1]. Since no restrictions on means and covariances are imposed here, the resulting formulas are different in both these studies. We give here a mere outline of proof of the above theorem in which Girsanov’s formula plays a key role. (> 0, a.e.[P ]), and Since the hypothesis implies Q ∼ P , let ϕ = dQ dP Ft 1 Mt = EQ ( ϕ ) where Ft = σ(Xs , s ≤ t) ⊂ Σ. Then it is verified that {Mt , Ft , t ≥ 0} is a (local) martingale, and an important result to be utilized now is that, since Ft is the σ-algebra of the BM and Mt is Ft adapted, such a martingale admits a representation as an exponential: t 1 t 2 f (s) dXs − f (s) ds}, t ∈ [0, T ], (36) Mt = exp{ 2 0 0 with an Itˆ o-integral. This important assertion depends on the fact that each Ft -measurable function is representable as an infinite series of multiple Wiener integrals on Ω. When this is established, then Girsanov’s theorem produces the result (34) for a suitable Volterra kernel as desired. With this (36) gives, via Itˆ o’s formula, the representation (35) after a detailed analysis. The asserted uniqueness is then a consequence of the properties of the (Volterra) kernel and the integral appearing in the statement. Since the complete details are long and several preliminaries are needed, we shall not include them here, especially because they are available in a recent volume by Hida and Hitsuda ([1], Section 6.4). [However, a longer outline is included as Exercise 6 below. See also Exercises 7 and 8 on related important BM representations.] The theory of Volterra integral equations and the resulting operators is also available with a nice presentation in a volume by Gohberg and Kre˘in [1], to which we refer the reader for further information on these matters. For positive (semi)martingales, such (stochastic) integral representations are not surprising as the following result implies in a general context. [We observe that an adapted process X = {Xt , Ft , t ≥ 0} relative to a standard filtration Ft , t ≥ 0, is a (local) semimartingale if it can be expressed as Xt = Yt + Zt where {Yt , Ft , t ≥ 0} is a (local) martingale and Zt , Ft , t ≥ 0} is a process of bounded variation on each compact interval. The desired semimartingale decomposition and its analysis can be found in the companion volume (cf., Rao [21], Section V.2, p.364ff.] 8. Proposition. Let {Xt , Ft , t ≥ 0} be a positive right continuous semimartingale on (Ω, Σ, P ) satisfying P [inf t Xt > 0] = 1. Then it admits an integral representation as: t Xs− dNs , (37) Xt = X0 + 0
476
VII. More on Stochastic Inference
relative to a semimartingale {Nt , Ft , t ≥ 0} whose right continuous version is used. In fact the representing Nt -process can be taken as: Nt =
0
t
(Xs− )−1 dXs ,
(38)
or equivalently, the Xt -process may be considered as an exponential martingale given by Xt = X0 E(N )t , t ≥ 0, with E(N )t defined just preceding (33) above. A more general version of this result is found, for instance, in Rao ([24], Proposition 7.3) with applications to financial market models. Details of the latter need considerable background concepts and terminology, and we shall not include them here. It may be noted, however, that the result of Theorem 7 is more refined in the case it treats, and so sharper techniques mentioned in the outline are needed in comparison to the general cases. Thus except for the complements to follow, which however also contain important discussions on Wiener chaos and Wick products, we end the present general account of the subject, and turn to some interesting aspects of filtering and asymptotic properties of (certain nonparametric) estimators in the remaining two chapters of this volume. 7.6 Complements and exercises 1. In Theorem 1.3, the Gaussian process {Xt , t ∈ [a, b]} was given if the observational interval is finite. By setting an = a ∨ (−n), bn = b ∧ n, n > 0, the corresponding general statement may be obtained in the following form for non finite a, b: Consider the restriction of the process to the finite interval [an , bn ] and suppose that the thus restricted measures Pnf and Pn generated by the process on [an , bn ] satisfy Pnf ∼ Pn for each n. Let the corresponding likelihood ratio be given as in Theorem 1.3, by: dPnf = exp[ϕn (x) − Cn ], Cn ≥ 0. dPn
(*)
Then show that Cn C as n → ∞ and that Pf ⊥ P if C = ∞. In case C < ∞, verify that ϕn → ϕ and Pf ∼ P whose likelihood ratio is given by (∗) with this ϕ and C. [This is again a careful extension of the proof given in that theorem.] 2. Establish Proposition 1.6 with the following sketch. (a) Using the fact that for the strongly continuous contractive semi-group of operators {VN (α), α ≥ 0} with generator extending (or containing) AN , one
477
7.6 Complements and exercises
has for each f ∈ F : N→∞
α
V (α)f = f + lim
0
VN (b)AN f db = f +
α
V (b)Af db, 0
whenever VN (α)f → V (α)f in L2 (P ) and that {V (α), α ≥ 0} is strongly continuous with generator extending A. (b) For any kN ↑ ∞, note that
kN
0
e−ax VN (a)f da →
∞
e−ax V (a)f da,
0
as N → ∞, and it suffices to verify that with (x − A)
kN
−ax
e 0
1 VN (a)f da = (ϕN − ϕ) 2
0
kN
e−ax VN (a)f da
− e−kN x VN (kN )f + f,
the left side goes to f . The second term on the right goes to zero since VN (·) is bounded. Also ϕ − ϕN 2 ≤ ϕ0 − (ϕ0 )N 2 where ϕ = 1 (s) 1 x (s,x(s)) dXs − 0 m(s)ab(s) dXs E F (ϕ0 ), F = σ(Xt , t ≥ 0), ϕ0 = 0 mb(s) (cf., Prop. 1.5) = ψ + η (say). Since 0 ≤ ϕ0 − (ϕ0 )N ≤ (ψ − ψN ) + (η − ηN ), verify that ψ −ψN 2 ≤ e−CN for some C > 0, that η(t)−ηN (t) = t F (s)ξN (s) dXs where ξN = χ[R t F (s) dXs ≥N] with F (t) denoting the 0 0 integrand of η (the second term of ϕ0 ), and |F (t)| ≤ C1 < ∞. Next show: t P [η(t) ≥ N + 1] ≤ C12 P [η(s) > N ] ds, 0
implying that P [η(t) ≥ N ] ≤ iteration). Finally deduce that η − ηN 22 ≤
(C2 t) N!
N
for some 0 < C2 < ∞ (upon
Ck C N eC2 2 (k + 1 − N )2 ≤ 2 → 0, k! (N − 2)!
k≥N
by String’s formula for N !. Thus ϕ0 − (ϕ0 )N 2 → 0 as desired. 3. This problem gives an application of the major results of Section 1 (or 2) and indicates the need for the generality of the hypotheses considered there. Let {Xti , Ft , t ∈ [a, b] = I}, i = 1, 2, be independent Gaussian processes on (Ω, Σ, P ) with mean and covariance functions mi , r i , i = 1, 2, such that mi ∈ L2 (I, dt), r i ∈ L2 (I × i operator determined by r i , i.e., (Ri f )(s) = I, dsi dt). If R is the integral 2 r (s, t)f (t) dt, f ∈ L (I, dt), so that it is compact, let {λin , ϕin , n ≥ I
478
VII. More on Stochastic Inference
1}, i = 1, 2, be the eigenvalues and the corresponding eigenfunctions of Ri and suppose λin > 0, i = 1, 2, n ≥ 1. Consider the vector processes Zt = (Xt1 , Xt2 ) and Zαt = (Xt1 cos α + Xt2 sin α, −Xt1 sin α + Xt2 cos α) with α ∈ (−π, π). It is desired to test the hypothesis H0 : α = 0,against the composite alternative H1 : α = 0, (i.e., testing independence) for α which one needs the likelihood ratio dP where Pα , P (= P0 ) are the dP corresponding probability measures on (Ω, Σ) induced by the vector processes Zαt and Zt . Verify that the conditions of Theorem 1.7 are satisfied by the following procedure where Xti : Ω → R, Zt : Ω → R2 , the real and vector (planar valued) processes: (a) Let L = sp{1, Xfi = I Xti f (t) dt, f ∈ L2 (I, dt), i = 1, 2} ⊂ L2 (P ). Define the operators Tα : L → L, by the equations Tα Xf1 = Xf1 cos α + Xf2 sin α; Tα Xf2 = −Xf1 sin α + Xf2 cos α. Note that D(Tα Xf1 ) = −Xf1 sin α + Xf2 cos α; (DTα Xf2 ) = −Xf1 cos α − Xf2 sin α, so that α → (DTα Xfi is L2 (P )-continuous. The observable coordinates Xni = √1 i I (X i − min )(t)ϕin (t) dt, n ≥ 1, i = 1, 2, form a complete λn
orthonormal set of L, the closure of L ⊂ L2 (P ). (b) Suppose that the following sequences converge: 2 ∞ 1 k i m (t)ϕj (t) dt , 1 ≤ i = k ≤ 2 (i) λi I j=1 j 2 2 ∞ λ1j λ2j 1 2 (ii) − (ϕj ϕj )(t) dt . λ1j λ2j I j,j =1
Then the induced measures Pα , P are mutually absolutely continuous, dPα 2 and Theorem 1.7 gives dP . [Hints: Verify that there is a Φ ∈ L (P ) such that Ω ΦW1 W2 dP = Ω (W1 DW2 + W2 DW1 ) dP, ∀Wi ∈ L and then show that the hypothesis of Theorem 1.7 is satisfied after a careful computation. It may be seen that the convergence of each of the series in (i) is equivalent to the mutual absolute continuity of the measures associated with Xti − mi (t) + αmk (t), 1 ≤ i = k ≤ 2. For details of these assertions and other information, the reader may refer to Pitcher [5] where some other interesting examples are also found.] 4. Here we sketch details for the existence of the m.g.f. E(eδψα ), for small δ ∈ R, required for Theorem 4.2. Using the notation of the text, let ψαn be the sum defined by equation (12) so that ψαn → ψα
479
7.6 Complements and exercises
in L2 (P )-mean. Verify the result with the following outline. Choose δ 2 τn such that |δ| < inf n inf α∈J 1+α 2|α|τn and consider
δψαn
e
eδα
dμ =
Ω
Pn
j=1
σj (α)(u2j −1)
dμ(u)
Rn 1
= Πnj=1 [e−δασj (α) (1 − 2δασj (α))− 2 = Πnj=1 bj (α), (say),
But |1 − bj (α)| = O δασj (α)2 and
(δασj (α))2 = δ 2 α2
j≥1
j≥1
(*)
τj2 < ∞, 1 + α2 τ j )2
uniformly in α ∈ J since the trace class condition implies j≥1 τj < ∞. So the product (*) converges as n → ∞, and we can also take δ such that 1 − 2δασj (α) > ε > 0 for all α ∈ J and j ≥ 1. For instance 0 < |δ| < 14 and ε = 18 will satisfy the requirements. So bj (α) < 1 e−δασj (α) ε− 2 , and hence there is an n0 such that Πj≥n0 bj (α) < 1, ∀α ∈ J. The finite product of n0 −1 terms, each continuous in α, is uniformly bounded for α in the compact J, so there is a constant C > 0 such that Πj≥1 bj (α) < C, α ∈ J. Then by Fatou’s lemma, since ψαn → ψα , we get eδψα dμ ≤ lim Πnj=1 bj (α) < C Ω
n→∞
for all α ∈ J, as desired. 5. Let T be a Hausdorff space and G ⊂ RT be a vector subspace of functions separating points of T . If μ, ν are probability measures on BT , the Borel σ-algebra of T , that are inner regular (hence Radon) on BT , then they are termed Gaussian if for each f ∈ G, the image measures μ ◦ f −1 and ν ◦ f −1 are Gaussian on R. Verify that μ, ν have the dichotomy property, i.e., either ν ⊥ μ or ν ∼ μ, and in any c case dν dμ , the RN-derivative of the continuous part of ν, is measurable relative to ΣG , the σ-algebra generated by G, ΣG = σ(∪f ∈G f −1 (BR )). [Hints: If ν ⊥ μ, then show (by inner regularity) that for each compact K ⊂ T, μ(K) = 0 ⇒ ν(K) = 0. Now since G separates points of G, each compact K = ∩K⊂B∈ΣG B where B is closed, and by regularity μ(K) = μ(BK ) for some such B. If μG = μ|ΣG , the restriction, and similarly νG then by Theorem 5.1, μG , νG have the Gaussian dichotomy, and thus νG μG ⇒ 0 = μ(K) = μ(BK ) ⇒ ν(BK ) = ν(K) = 0. dνG dν Observe that dμ = dμ which implies the last part. This result is also G due to Vakhania and Tarieladze [1], where other extensions to general topological vector spaces are formulated.]
480
VII. More on Stochastic Inference
6. Here we present a series representation of a square integrable random variable, adapted to the σ-algebra generated by BM which was needed in Theorem 5.7, taking Y as centered for simplicity. Thus if {Xt , t ≥ 0} is a BM, and Ft = σ(Xs , s ≤ t), let Y be an Ft -adapted random variable, satisfying E(Y 2 ) < ∞. Then Y can be expresses as an a.e. convergent series: ∞
Y =
In (k);
n=1
∞
In (k) 2 < ∞,
(*)
n=1
where In (k) is an n-dimensional Wiener integral defined as follows.
···
In (k) =
k(t1 , · · · , tn ) dXtn · · · dXt1 ,
(+)
Rn +
with k(t1 , · · · , tn ) = 0 if ti < tj for any 1 ≤ i < j ≤ n, an n-dimensional Volterra kernel (see (b) below), and the In (k) are orthogonal in L2 (P ). Establish (*) with the following outline. (a) Let k : [0, T )n → R be defined for 0 = τ1 < · · · < τr and for in < r as k(t1 , · · · , tn ) = ai1 ,··· ,in if τi1 ≤ t1 < τi1 +1 < · · · < τin ≤ tn < τin +1 ; and = 0 otherwise, and set
···
In (k) =
k(t1 , · · · , tn ) dXt1 · · · dXtn , (T ≤ ∞).
[0,T )n
Note that E(In (k)) = 0, n ≥ 1 and E(In (k)Im ()) = δmn (k, )n where (k, )n =
0
T
0
u1
···
0
un
k(t1 , u1 , · · · , un )(u1 , · · · , un ) du1 · · · dun .
Verify that In (k) ∈ L2 (P ) for such k. (b) If k : [0, T )n → R is a general Volterra kernel, i.e., the mapping that vanishes “above the diagonal”, meaning k(t1 , · · · , tn ) = 0 if ti < tj for any i < j and k is Borel measurable, then it can be approximated in L2 ([0, T )n , dt1 · · · dtn ) by simple kernels of the above type, and define In (k) as the limit of simple In (ki ) as given above where E(In (ki ) − In (kj ))2 ) = ki − kj 2n → 0, as i, j → ∞,the ki being Cauchy in L2 ([0, T )n , dt1 · · · dtn ). The thus defined In (k) is the n-ple Wiener (or BM) integral. Verify that it is linear and the In (f ) form an orthogonal set in L2 (P ) starting with simple k. (c) Since the BM has a.a. continuous sample paths, it is useful to replace Ω with C([0, T ]), the space of continuous functions, as a
481
7.6 Complements and exercises
subset of R[0,T ] and again denote this contracted space by Ω with its cylinder σ-algebra and its induced probability measure P concentrating on this (new) Ω. With this setup, everything works and we can assert the completeness of the functions {In , n ≥ 1} in L2 (P ), but this fact uses a corresponding completeness property of Hermite polynomials in L2 (R, μt ) where μt is N (0, t), t ≥ 0. The latter is as follows. Recalling that a Hermite polynomial, of degree n in x, is given by: Hn (t, x) =
x2 d n (−t)n x2 exp( ) n (exp(− )), n = 0, 1, 2, . . . , n! 2t dx 2t
verify that for 0 ≤ s < t, the key relation: t
t1
Hn (t − s, Xt − Xs ) = s
tn−1
···
s
k(t1 , · · · , tn ) dXtn · · · dXt1 ,
s
where k(t1 , · · · , tn ) = 1 for t ≥ t1 ≥ · · · ≥ tn ≥ s; and = 0 otherwise. The evaluation of this integral uses some combinatorial ideas with induction on n. Now verify that {Hn (t, ·), n ≥ 0} is a complete orthonormal set in L2 (R, μt ). Next deduce, for T ≤ ∞, that Hn (
T
2
T
f (s) ds, 0
0
f (s) dXs)
T
t1
= 0
0
···
tn−1 0
k(t1 , · · · , tn ) dXt1 · · · dXtn ,
where k(t1 , · · · , tn ) = f (t1 )f (t2 ) · · · f (tn ) for t1 ≥ t2 ≥ · · · ≥ tn ; and = 0 otherwise, is a Volterra kernel. By (a), Hn (T, ·) ⊥ Hm (T, ·) if m = n. (d) Let Hn (T ) = {In (k) : k ∈ L2 ([0, T ]n , dt1 · · · dtn )}, where k is a Volterra kernel, and k = 0 for ti < t, and for convenience set H0 = R. Observe that E Fti (In (k)) = In (k ) where k = k for ti < t; and = 0 otherwise. With this description show that for f ∈ L2 (Ω, Σ, P ), on the space (Ω = C([0, T )), Σ, P ) of the BM, the Wiener expansion holds: f=
∞ i=1
Ii (ki );
∞
ki 22 < ∞,
(**)
i=1
where ki (t1 , · · · , ti ) is a Volterra kernel on [0, ti ]i , 0 < t1 < · · · < ti and T = {t1 , t2 , . . . } is any countable dense set of [0, T ) such that Σ = σ(Xti : ti ∈ T ), and the ki are uniquely determined by f . Moreover, L2 (P ) = ⊕∞ n=0 Hn (T ) is a direct sum decomposition which is often termed the Wiener-Itˆ o chaos form.
482
VII. More on Stochastic Inference
(e) If {Yt , Ft , t ≥ 0} is a martingale on (Ω, Σ, P ), the Wiener measure space considered above induced by {Xt , Ft , t ≥ 0}, then there exists a process {ft , Ft , t ≥ 0} such that Yt =
0
t
fs dXs ,
t ≥ 0,
(+*)
where the integral exists as a standard Itˆ o integral, and in fact with the representation of (**) one has (compare it with Prop. 5.8) fs = k1 (s) +
∞ n=2
s
0
t2
0
···
tn−1 0
kn (s, t2 , · · · , tn ) dXtn · · · dXt1 ,
for fs ∈ L2 (P ), k1 (s) a constant, and then the result (+*) is extended to all f for which this (Itˆ o-)integral exists. It is the representation asserted in (*) and it was needed to employ Girsanov’s formula to establish Theorem 5.7. [For a further detailed discussion, see Hitsuda [1] or Hida and Hitsuda [1].] 7. A problem of immediate interest, related to the integrals of the preceding exercise, is to find conditions under which the multiple Wiener integral may be expressed as a repeated integral for possible evaluations, to have a kind of stochastic Fubini theorem. As in other contexts, there are several stochastic extensions of such a result, and we include one here for applications, and comment on some other possibilities later in the Bibliographical notes. Thus let I(f ) be the multiple Wiener integral so that I(f ) = · · · f (t1 , · · · , tn ) dXt1 · · · dXtn , [0,T ]n
where f ∈ L2 ([0, T ]n , dt1 · · · dtn ). Consider the symmetrized function f˜ of f defined as: 1 f (ti1 , · · · , tin ), f˜(t1 , · · · , tn ) = n! (n)
where (n) = (i1 , · · · , in ) is a permutation of indexes on (1, · · · , n). Then one has T tn t3 t2 f˜(t1 , · · · , tn ) dXt1 dXt2 · · · dXtn−1 dXtn . I(f ) = n! ··· 0
0
0
0
The result may be obtained first for simple functions and then extending to the general case of f , (for which the Itˆ o integral exists) by
483
7.6 Complements and exercises
approximation. [This is also given by Itˆ o [2], along with the development of the subject, after analyzing the original work of Wiener’s [1] on polynomial chaos.] 8. Using Exercise 6(d) (cf. also (7)), we define now a stochastic integral relative to a BM for integrands ft ∈ L2 (P ), t ≤ T , not necessarily adapted to the Brownian filtration, using the so-called Wick products, obtaining a more complete generalization of the Wiener integral than the main Itˆ o extension used so far. By 6(d), every Yt ∈ L2 (P ) has a Wiener expansion (**), and if it is Ft -adapted, where {Bt , Ft , t ≥ 0} is the BM, then as in 6(e) we have tn T ∞ T t1 Yt dBt = ··· kn (t, t1 , · · · , tn ) dBtn · · · , dBt1 dBt . 0
n=1
0
0
0
If, however, Yt is not Ft -adapted, then the right side is still well-defined, but the left side symbol is undefined. It will be given the value of the T right, and is usually called the Skorokhod integral, denoted as 0 Yt δBt which thus, when Yt -is adapted to Ft , becomes equal to the standard Itˆo integral. [This was defined by Skorokhod [4], and also independently by Hitsuda [2]. It was shown in the companion volume (Rao [21], p.532) that it obeys the generalized Bochner boundedness principle, after some nontrivial computations; thus it is a new stochastic integral having the dominated convergence and other usual properties.] Here we give an t ) alternative form of this integral through the white noise Wt (= ‘ dB dt calculus, considered as a generalized random function. The following outline describes this procedure which leads to a new growth of the subject, and may be used to extend the work of Section V.5. (a) Recall that, if S is the Schwartz space of rapidly decreasing C ∞ functions on R, then f ∈ S iff |xk f (n) (x)| → 0 as |x| → ∞ for each of n the integers k, n ≥ 0 where f (n) = ddxf . With a topology defined by 1 the sequence of norms · n,k where f n,k = [ R |xk f (n) (x)|2 dx] 2 (so fm → f in this topology iff fm −f n,k → 0 as m → ∞ for each n, k), S becomes a countably normed complete vector space, and S ⊂ L2 (R, dx) is dense. Let S ∗ be the adjoint of S so that S ⊂ L2 (R, dx) ⊂ S ∗ with continuous embeddings. The space S is an example of a ‘nuclear space’. We take Ω = S ∗ . More precisely, in the Kolmogorov canonical representation, S ∗ = Ω ⊂ RS , and the coordinate function X : Ω → R is defined for each f ∈ S, as an index set, by Xf (ω) = ω(f ) = (ω, f ), in the duality pairing, so that Xf = (·, f ) is a random variable for each f ∈ S. Since S is dense in L2 (R, dx), for each f t = χ(0,t) ∈ L2 (R, dx), t > 0, there exist fm ∈ S such that fm → f t in L2 (R, dx) as m → ∞, and Xfm → Xf t pointwise. Now we introduce a Gaussian probability measure on (Ω, B), where B = σ(S ∗ ) is the cylinder σ2 algebra of Ω. For each f ∈ S, let C(f ) = e− f 0 , · 0 being the usual
484
VII. More on Stochastic Inference
norm of L2 (R, dx). Then C(·) is a positive definite continuous function on S with C(0) = 1, and hence by the Bochner-Minlos theorem (cf., e.g., Gel’fand and Vilenkin [1], Thm. 2 on p.350), there is a unique probability measure μ on (Ω, B) such that 2 1 eiXf dμ = e− 2 f 0 , f ∈ S. (*) C(f ) = Ω
Thus our basic triple is (Ω = S ∗ , B, μ) and let L2 (μ) denote the Hilbert space on it. On this space Xf t is identified as follows. By the bounded convergence theorem Xfm → Xf t as m → ∞, a.e., and in L2 (μ). From this one has with (*) 2
1
1
E(eiXfm ) = e− 2 fm 0 → e− 2 f
t 2
0
= e− 2 , t
the last being the characteristic function of Bt where {Bt , t ≥ 0} is the 2 BM. Thus Bt = (·, f t) is defined for functions in L (R, dx), having the following property there. For each f ∈ S, R f (t) dBt is the standard Wiener integral, for which, with integration by parts (and remembering that f (n) (±∞) = 0) one gets f (t) dBt = − Bt f (t) dt W (f ) = R
R
= (−f , B) = (f, B ),
f ∈ S,
in the weak (or Schwartz distributional) sense. This implies W = B t on S. Extending W to L2 (R, dx) we get Wt = W (f t ) = dB dt and since 2 Bt ∈ L (μ) we need to describe the space where B = W lives. It is termed the white noise process and Wt : S → L1 (μ), i.e., Wt (f ) is an integrable random variable. For each f ∈ S, W (f ) is normally distributed with mean zero and variance f 20 and W lives in a space containing L2 (μ) denoted (S)∗ , adjoint space to (S), to be defined,and is dense in L2 (μ). We now describe the space (S) and hence its adjoint (S)∗ . Note that the Schwartz space of test functions S, above, can be alternatively described. Consider the densely defined differential operd2 2 2 ator A = − dx 2 + x + 1 on L (R, dx), for which a Hermite function is an eigenfunction, i.e., one has Aen = (n + 2)en , n = 0, 1, 2, . . . , 1
x2
n −x2
e is a Hermite function, A−1 where en (x) = (−1)n (π)− 4 2− 2 e 2 d dx n 2 is a bounded operator on L (R, dx) with A−1 = 12 and for any p > 12 , A−p is HS. If f p = Ap f 2 , f ∈ L2 (R, dx) which is seen 1 to be ( n≥0 (2n + 2)2p (f, en )2 ) 2 where (·, ·) is the inner product of L2 (R, dx), then Sp = {f ∈ L2 (R, dx) : f p < ∞} is a Hilbert space n
485
7.6 Complements and exercises
and one finds S = ∩p≥0 Sp with S0 = L2 (R, dx) and the p are integers (S is the ‘projective limit’ of Sp s, in the terminology used in such cases, cf., e.g., Chapter I, Rao [21] on these limits). In the stochastic case of L2 (μ), with the above differential operator A, we associate the “second quantization operator” A˜ defined on L2 (μ) as: for each f which thus admits the Wiener expansion as in Exercise 6(d) above, i.e., f = n≥0 In (gn ), define ˜ = Af In (A⊗n gn ) n≥0
⊗n for all f satisfying gn , A⊗n gn )0 < ∞, where (·, ·)0 is n≥0 n!(A the inner product of L2 (R, dx). Then A˜−1 is a bounded linear op˜ −p , p > 1, is HS. Let ξ p = erator on L2 (μ) with unit norm, (A) (A˜p ξ, A˜p ξ)L0 (μ) , and (Sp ) = {ξ ∈ L2 (μ) : ξ p < ∞} with · 1 =
· L0 (μ) . Then (Sp ) is a Hilbert space and let (S) = ∩p≥1 (Sp ) is a countably Hilbert space (a nuclear space). Moreover, one has (S) ⊂ L2 (μ) ⊂ (S)∗ , the last being the adjoint space of the first, L2 (μ) being identified with its adjoint and the embedding is continuous. [For details, see Kuo [1], p.20.] Here (S) has exactly the same properties as ∗ t S above and Wt = W (f t ) = Bt = ( dB dt )t takes values (or lives)in (S) . ∗ (b) We now introduce a new product in (S) to define the desired generalization of the Itˆ o integral for anticipative functions. Using the symmetrizations of Exercise 7, for f˜, g˜ ∈ L2 (μ) we have an I˜n (fn ) = fn (t) dBt⊗n , f˜ = n≥1
g˜ =
bn I˜n (gn ) =
n≥1
n≥0
Rn
n≥0
Rn
gn (t) dBt⊗n ,
where Bt⊗n is the tensor product of Bt with itself n-times, i.e., a shorthand for the notation used before. These representations are unique. We now define the Wick product of f˜, g˜ denoted f˜ % g˜, as: am bn I˜m+n f˜ % g˜ = m,n≥0
=
n≥0
Rn
(f˜i g˜j )(t) dBt⊗n,
i+j=n
provided f˜ % g˜ L2 (μ) < ∞. If F, G ∈ (S)∗ so that F = a α Hα , G = bβ Hβ , α
β
486
VII. More on Stochastic Inference
where each α, (β) is a finite set n1 , . . . , nk corresponding to such finite products of Hermitian functions eni (ω) with |α| = n1 + · · · + nk (all these functions form a complete orthonormal system in L2 (μ)), then F %G= aα bβ Hα+β . α,β
This product always exists in (S)∗ . Moreover the Wick product on (S)∗ is commutative, associative, and distributive; and f˜ % g˜ ∈ (S)∗ . ˜ : R × Ω → R is jointly (c) Let Zt ∈ (S)∗ , f˜ ∈ (S). Then (Z, f) 1 ˜ measurable and if (Z(·) , f )(ω) ∈ L (R, dx) for a.a (ω), for each such f˜, then Z(·) is weak* integrable and R Zt dt ∈ (S)∗∗∗ . However in our case since (S) is nuclear, the integral belongs to (S)∗ itself, an analog of “Pettis’s integrability”. We then have the following result. If Yt is Skorokhod integrable, and W is the white noise, then Y % W is weakly (or Pettis or (S)∗ )-integrable and on any interval [a, b] one has
b
b
(Y % W )t dt,
Yt δBt = a
a
and in case Yt is adapted to the Bt -filtration, then (Y % W )t becomes the ordinary product and writing Wt dt as dBt the right side (hence the left side) integral reduces to the Itˆ o integral. For the general case we need to use the whole machinery. The proof thus involves several details. (See Kuo [1], p.246, or Holden, Øksendal, Ubøe, and Zhang [1], p.52.] We included all of this to showing the ideas in extending the nonanticipative integrands in Itˆ o’s definition to anticipative (i.e., the general) case. For a different approach to anticipating integrands and the corresponding calculus, see Nualart ([1], Chapter I). The use of Wick products with Itˆ o’s multiple Wiener integration, as well as the Itˆo-Wick calculus were detailed in Dobrushin and Minlos [1], and independently in a different context in Hida and Ikeda [1].
Bibliographical notes As in the earlier work on simple hypotheses, the Neyman-Pearson fundamental lemma for composite hypotheses applicable to processes, given in Proposition 1.1, is also due to Grenander [1]. The calculations of likelihood ratios in the latter case are more involved even when only one of the two hypotheses is composite. We have treated the results in a graded fashion in Sections 1 and 2, and most of them are taken from Pitcher’s fundamental work, as noted already in the text proper. The
Bibliographical notes
487
key point here is the realization that suitable (semi-)group operations should be introduced and the composite hypothesis testing problem be translated into this framework and solve, i.e., to be able to apply the general Grenander formulation of the Neyman-Pearson lemma. We have presented some salient features of Pitcher’s work explaining the crucial points involved in the solution. A precursor of involving advanced tools is the classical Behrens-Fisher composite hypothesis problem in the finite dimensional case, discussed in some detail in Section II.5, exemplifying the early ideas for problems awaiting solutions in the case of processes. These considerations have been extended when both the hypothesis and the alternative are composite for processes that need not be Gaussian. It is detailed in Section 3, and a key application to statistical communication theory is presented in Section 4. The material here, in a somewhat streamlined form with slightly enlarged explanations, is mainly from Velman [1]. It was a thesis written under Pitcher’s direction, which extends some of his work. However this was not widely known until now. We have therefore included it here for the benefit of other researchers. The analysis leads to some evolution type operators to be used crucially, highlighting some new techniques and details necessary for the subject. The inclusion was made possible due to the kind communication of the material to the author by Dr. Velman himself, at the suggestion of Prof. Pitcher, some years ago, and the author is glad to be able to present it for publication in this monograph. It should be noted that in Sections 1–3, topological algebra methods are utilized more than ever before, since this illuminates completely the structure of the problem. Perhaps in future such techniques will find better use in the subject. Now in the above mentioned application to statistical communication, we considered both the signal and noise to be independent Gaussian, involving a parameter in their distributions to be tested. The fact that for any pair of Gaussian measures the dichotomy holds plays a crucial role in these calculations. We have established this dichotomy in Chapter V, and several other proofs of this classical H´ ajek-Feldman theorem exist. For another method, see C. R. Rao and Varadarajan [1] and yet another in Alekseev [1]. Here the relation between Kakutani’s dichotomy for independent (not necessarily Gaussian) collection and the Gaussian dichotomy has been explored in more detail. In another proof of that theorem, given in Guichardet [1], he remarks, on p.105, that under a specialization Kakutani’s [1] theorem can be applied to Gaussian dichotomy, but did not present details of the idea. In the recent demonstration, due to Vakhania and Tarieladze [1], this result is proved basing it precisely on Kakutani’s theorem, which we followed.
488
VII. More on Stochastic Inference
Similarly, we included an alternative form (in addition to the one in Section V.5) of Girsanov’s theorem, along the lines given in Hitsuda [1], especially the version in Hida and Hitsuda [1]. Moreover some new applications of Girsanov’s theorem are to be found in finance mathematics. Although we could not include the latter here for space reasons, they may be seen, for instance, in an article of the author’s (cf., Rao [24], where other references on the subject can be found). Also the integral representation of continuous (local) martingales adapted to a Brownian filtration of σ-algebras is of interest in establishing a large class of likelihood ratios of Gaussian processes as evidenced in Theorem 5.7. Still much remains to be done in this area as no algorithm for actual calculations exist. Multiple stochastic integrals appear in many applications. For example, they were already encountered in Exercise IV.6.9 as well as in the work at the end of Section IV.5. For this reason we have treated the material related to such integrals at some length in Exercise 6.6 above. All of this is closely related to Itˆ o’s extension of Wiener’s [1] polynomial chaos. A detailed treatment of the latter is also found in Neveu [2]. In complements we included additional applications (cf., Exercises 6.2 and 6.3) from Pitcher’s work to elaborate the treatment of Section 2. The stochastic Fubini type result has many forms. A very general result on this topic appears in Green [1] where he treats a two parameter version for quasimartingales using the general Bochner bounding principle, (cf., Rao [21], Sec. VI.2). On the other hand, allowing the integral in In (f ) of Exercise 6.7 to be anticipative, as in Exercise 6.8, an extension of Itˆ o’s integral is defined, and a Fubini theorem is given in Berger and Mizel [1]. By this new method they gave a different Fubini theorem getting a correction term so that the theorem takes a new form. It will be interesting to see how the results of Exercise 8 and of Berger and Mizel appear under Green’s general hypothesis. These extensions enrich the subject. After having noted these possibilities, we conclude this general study and proceed to certain other specialized aspects of the subject in the remainder of this work.
Chapter VIII Prediction and Filtering of Processes
This chapter is devoted to a different class of applications complementing the preceding work. The first section contains a comparative analysis of general prediction operations relative to a convex loss function, and its relation to projection operators. This is refined in the next section, for least squares prediction with the Cram´er-Hida method. Then Section 3 treats linear filters as formulated by Bochner [2]. The results are specialized and sharpened in Section 4 for linear KalmanBucy filters of interest in many applications. Then in Section 5, we consider nonlinear filtering, which is a counter part of the preceding showing that there are many new possibilities, as well as illustrating the essential use of the general theory of SDEs in this subject. Thus Sections 3-5 contain mathematical glimpses of some of the vast filter technology. Finally some related complements are included as exercises, often with sketches of proof.
8.1 Predictors and projections Let {Xt , t ∈ I} be a process on (Ω, Σ, P ) such that Xt ∈ Lp (P ), p ≥ ˆt 1, t ∈ I. The general problem here is to predict the closest value X 0 of Xt0 , t0 ∈ / I, based on the observed process Xt , t ∈ I, relative to some ˆ t |. The basic idea here is similar to measure of closeness of |Xt0 − X 0 that employed in Section III.3 on estimation of an unknown parameter, with the difference that the “parameter” Xt0 is now a random variable. But this is analogous to the familiar Bayes estimation already studied (cf., Section III.2). We restate the earlier considerations as employed in the present context and then analyze the problem in detail, since the observed collection {Xt , t ∈ I} now typically consists of an infinite set of elements.
© Springer International Publishing Switzerland 2014 M.M. Rao, Stochastic Processes – Inference Theory, Springer Monographs in Mathematics, DOI 10.1007/978-3-319-12172-7_8
489
490
VIII. Prediction and Filtering of Process
Let W : R → R+ be a symmetric convex function vanishing at the origin. If Xt0 is as above, E(W (Xt0 )) < ∞, then a real (Borel) function X of {Xt , t ∈ I} such that E(W (X)) < ∞ is termed the best predictor of Xt0 if the following relation holds: E(W (Xt0 − X)) = inf{E(W (Xt0 − Y )) : Y = g(Xt , t ∈ I)}.
(1)
The existence and uniqueness of such X as well as approximating it by certain simpler functionals (e.g., those that depend on finite sets of the Xt instead of the g above) is of importance in the subject. Now W (x) = |x|p , p ≥ 1, is an important example which is often employed in applications, and this will always be illustrative of the following work. Indeed, if W (x) = x2 , (1) is simply the traditional least squares prediction problem, but the general case reveals the structure of the subject better, as we already explained and utilized in Sections III.3III.4. Also note that the best predictor X in (1) need not be a linear function of the observed Xt ’s, and therefore it is termed a nonlinear predictor to emphasize this fact. The linear case will be discussed later. To motivate the subject, first we discuss the classical least squares prediction. Thus W (x) = x2 , and let {Xt , t ∈ T = [a, b] ⊂ R} be an observed process. If BT = σ(Xs , s ∈ T ), then by (1), it is desired to find Y0 = f (Xs , s ∈ T ), a Borel function of the observations (hence BT measurable) and Y0 ∈ L2 (BT ) to be the best predictor of Xt0 , t0 > b. That such a Y0 exists uniquely and is given by Y0 = Q(Xt0 ), where Q is the orthogonal projection on L2 (Σ) onto L2 (BT ), is a classical result of the least squares approximation theory. Moreover Q = E BT , the conditional expectation relative to BT . These statements are in fact special cases of Theorem III.2.6 which was established in the analysis of Bayes estimation in Chapter III. Note that Y0 will not be a linear function of the Xt s unless the process is Gaussian. However, replacing L2 (BT ) by the (generally smaller set of) closed linear span of Xt s, one gets Y0 as a linear function and then the resulting problem is termed a linear prediction which we discuss later, since one has to use different (usually Fourier analysis type) techniques. In the present case, taking a countable set from T and assuming the processes to be continuous in L2 (P )-mean, we can approximate Y0 . In fact the following assertion holds. 1. Proposition. Let Bn = σ(Xt1 , · · · , Xtn ), the σ-algebra determined by the observations Xt1 , . . . , Xtn , and adding new observations, one has Bn ⊂ Bn+1 ⊂ Σ, where the process {Xt , t ∈ T, Xt0 } ⊂ L2 (Ω, Σ, P ) is mean continuous. Then with Yn = E Bn (Xt0 ), n ≥ 1, one has Yn → Y0 = E BT (Xt0 ) in mean as well as point-wise a.e. if we take a sequence denoted by t1 < t2 < · · · .
491
8.1 Predictors and projections
Proof. In fact since Bn ⊂ Bn+1 implies E Bn = E Bn E Bn+1 , we have Yn = E Bn (Xt0 ) = E Bn (E Bn+1 (Xt0 )) = E Bn (Yn+1 ), so that {Yn , Bn , n ≥ 1} is a uniformly integrable martingale. Hence the classical martingale convergence theorem implies both statements of the proposition. It is natural to consider the problem for a convex loss function W (·), since there is no intrinsic reason to consider the quadratic case, other than the simplicity in computations, as observed already by Gauss (cf., Section III.1). Thus as in our earlier work on estimation, we consider the prediction problem again for more general convex loss functions W , and in later sections return to second order processes for an in depth analysis (especially with filtering theory) of standard applications. The present general account will also serve as an incentive for future investigations, besides being useful for further applications itself. Different methods, essential in these extensions, illuminate. To make the problem somewhat familiar and simpler to solve, we restate (1) as follows. If W (x) = |x|p , p ≥ 1, then the set of functions f : Ω → R for which E(W (f )) < ∞ is the standard Lp (Ω, Σ, P ) or 1 Lp (Σ) with norm f W = ( Ω W (f ) dP ) p . The corresponding collection with a general convex W (·) is the space LW (Ω, Σ, P ) or LW (Σ) of all f : Ω → R such that E(W (kf )) < ∞ for some k(= kf ) > 0. This is the precise generalization of the Lp (Σ)-space with a (gauge) norm · W , given by f W = inf{k > 0 :
f W ( ) dP ≤ 1}. k Ω
The change in definitions of norms transforms (1) into a somewhat different type but gives an equivalent solution. Thus (LW (Σ), · ), the Orlicz space, becomes the familiar Lebesgue space when W (x) = |x|p , p ≥ 1, and, utilizing the norm symbol, (1) reduces to finding X such that Xt0 − XW = inf{Xt0 − Y W : Y ∈ LW (Σ), Y = g(Xt , t ∈ I)}. Here Y , a function of the (observable) process {Xt , t ∈ I} in LW is understood as one measurable relative to B = σ(Xt , t ∈ I)(⊂ Σ), the σ-algebra generated by the Xt s, and thus LW (B) ⊂ LW (Σ) is a closed subspace. Thus (1) is reformulated as: find an X ∈ LW (B) such that Xt0 − XW = inf{Xt0 − Y W : Y ∈ LW (B)}.
(2)
492
VIII. Prediction and Filtering of Process
It may be noted that (2) and (1) are identical if W (x) = |x|p , but take on different forms when W (·) is a more general (symmetric) convex function. This is not surprising since solutions of these problems depend on (the type of) loss functions. [In the general case of W , they are the norm and ‘modular’ functions which coincide when W (x) = |x|p , p > 1.] However one can use the same format as in the power case, but one needs some detailed aspects of Convex Analysis, especially “subdifferentials” and their properties in that context. This additional effort, needed to obtain the stated form of our application, does not seem worthwhile. [See, e.g., Kozek [1] who has to spend additional space and explanation to present a solution of the pth power type for just a lower bound of an estimator.] Moreover, an advantage of (2) is that one can immediately use the familiar and simple normed space techniques. This is why we stay with this easy (and still relatively general) form and present a solution of the problem which coincides with the classical case when W (x) = |x|p , which is our principal concern. See Corollary 5 below where the modular view is also discussed. If W is strictly convex and W (2x) ≤ KW (x), x ≥ x0 ≥ 0, called a Δ2 -condition (e.g., W (x) = |x|p , 1 < p < ∞), then LW (B) is a closed strictly convex set and the classical Banach space results imply that there exists a unique minimal element X ∈ LW (B) for any given Xt0 ∈ LW (Σ). Consequently the mapping πB : Xt0 → X is welldefined on LW (Σ) → LW (B), and is called the prediction operator. [In the general case, for each Xt0 such a unique X may not exist and hence the operator πB need not be well-defined, as one may verify by considering W (x) = |x|. Also generally πB depends on Xt0 .] The operator πB above, when defined, still need not be linear, but may have the following properties somewhat analogous to the conditional expectation operator. Thus consider the statements: (i) πB (aX) = aπB (X), a.e., for a ∈ R, (ii) πB = I, the identity on LW (B), (iii) πB (X + Y ) = πB (X) + Y , a.e., for Y ∈ LW (B), (iv) πB (XY ) = Y πB (X), a.e., if Y ∈ LW (B) and XY ∈ LW (Σ), (v) if B1 ⊂ B2 ⊂ Σ are σ-algebras, then πB1 (πB2 (X)) = πB1 (X), a.e., for X ≥ 0 a.e., (vi) if X ≥ 0, a.e., then πB (X) ≥ 0, a.e., (vii) if X is bounded, then πB (X) is also bounded and supp(πB (X)) is a.e. contained in supp(X). [supp(X) is support set of X.] Let P be the set of functions W for which πB (= πBx : LW (Σ) → W L (B)) exists and has properties (i)-(vii) above. Then P is not empty and in fact contains the functions W : x → W (x) = |x|p , 1 < p < ∞. It is actually possible to define πBx in the general case as a closed (not necessarily bounded) linear projection on LW (Σ) with range in LW (B) for
493
8.1 Predictors and projections
which all the above statements are true. This is based on the properties of “quasi-complements”. These are typically non unique, but can be employed for a finite (or at most a countable) collection of Xt0 s. This is discussed in the author’s paper (cf., Rao [3]) where πBx was termed a “closed conditional expectation” (cf., Definition 4 on p.107 there). It should also be remarked here that a conditional expectation operator itself on a Banach function space based on a measure space need not be bounded, much less a contraction, unless the norm is restricted to having a so-called Jensen property. This is however satisfied for the usual norm functionals in the Lebesgue and Orlicz spaces. [See, for an example illustrating this pathology and the significance of Jensen’s property, Rao [10], p.342, and p.348.] The requirement that πBx be independent of x (i.e., is the same operator defined on all of LW (Σ)) in exchange for dropping the linearity, restricts the class of W s but it is of interest in applications as well. For the most part below we drop the ‘x’ from πBx and the reader can understand the work for the narrower class of W s from P. The distinction is important to recognize, and substituting one for the other can lead to incorrect analysis, and misinterpretations. The verification of the above properties is easy for the most part. For instance, if B1 ⊂ B2 so that LW (B1 ) ⊂ LW (B2 ) and since the infimum on a larger set is smaller than that on a smaller one, (v) follows. The others may be verified in the general case with quasicomplements or also directly. Moreover, (πB )2 = πB , although it need not be linear. However there is an interesting (nontrivial) relation between this operator and the conditional expectation E B , the latter being a linear contractive projection on LW (P ). It will be recorded in the following proposition. This and several other properties below are due to Andˆ o and Amemiya [1] when W (x) = |x|p , 1 < p < ∞. 2. Proposition. Let W be the right derivative of the convex function W on R+ of class P, and for convenience set W (−x) = −W (x) for x ≥ 0. If the space LW (Σ) is reflexive and W is continuous, then for each X ∈ LW (Σ), bounded, and B ⊂ Σ, a σ-algebra, one has E B (W (X − πB (X))) = 0,
a.e..
(3)
In particular, if W (x) = |x|p , 1 < p < ∞, this is automatic. Remark. If W (x) = x2 , then W is linear and (3) implies that E B (X) = πB (X) since πB (X) is B-measurable, so that E B = πB , and the prediction operator is linear. Conditions under which πB is linear when W (x) = ax2 (a = 1) will be discussed later since it helps to understand the (nonlinear) prediction problem better. Also it is known that LW (Σ) is reflexive if there are constants x0 ≥ 0, K > 0, δ > 0
494
VIII. Prediction and Filtering of Process
such that W (2x) ≤ KW (x) [called the Δ2 -condition], and W (2x) ≥ (2 + δ)W (x), x ≥ x0 [called the anti Δ2 or ∇2 -condition]. [See, e.g., Rao and Ren [1], p.23 and p.112.] Proof of Proposition. For X ∈ LW (Σ), and B ⊂ Σ, let Y = πB (X). Then it is a basic fact of the Orlicz space theory that W (X − Y ) and ZW (X) are integrable for all Z ∈ LW (Σ). Moreover, the function ϕ : t → X0 + tZW where X0 = X − Y = 0 is differentiable at t = 0 (and if X0 = 0 then (3) is true and trivial). The (strong or Fr´echet) derivative ϕ (0) can be calculated and it is known to be (cf., the preceding reference, p.280): X0 1 ϕ (0) = ZW ( ) dP, Z ∈ LW (B), (4) k0 Ω X0 W where k0 = Ω XX0 0 W W ( XX0 0 W ) dP > 0. This is not trivial, and a detailed computation is given in that reference. But by definition of Y, ϕ(·) has a unique minimum at t = 0, and hence ϕ(0) = 0. In particular, taking Z = χA , A ∈ B in (4), one obtains since 0 < k0 < ∞ and χA ∈ LW (B), X0 ) dP = 0, ∀A ∈ B. (5) W ( X 0 W A But LW (Σ) ⊂ L1 (Σ) so that (5) implies (3), by the very definition of conditional expectation E B . Another useful property of πB is that the prediction πBn becomes better as Bn ⊂ Bn+1 is considered, which is intuitively clear, but a precise (non trivial) statement is provided by the following: 3. Proposition. (a) Let B1 ⊂ B2 ⊂ · · · ⊂ Bn ⊂ Σ be σ-subalgebras, and X ∈ LW (Σ), where the convex function W satisfies both the Δ2 as well as ∇2 -conditions (so LW (P ) is reflexive). If Xk = πBk (X) and Ak ∈ Bk are disjoint, such that Ω = ∪nk=1 Ak (i.e., a refined partition of Ω), then: X − X1 W ≥ X −
n
Xk χAk W = d0 ≥ X − Xn W .
(6)
k=1
(b) If W satisfies both Δ2 and ∇2 as in (a), then for any given ε > 0, there is an integer nε > 1 such that for n ≥ nε and a corresponding partion {A1 , · · · , Ak }, as in (6), with Ai ∈ Bn+i , Ω = ∪ki=1 Ai , for which one has: k Xn − Xn+i χAi W < ε. (7) i=1
8.1 Predictors and projections
495
In particular, (7) holds if LW (Σ) is a uniformly convex space so that W (x) = |x|p , 1 < p < ∞ is included. n Proof. We first establish (6). Thus let Yn = k=1 Xk χAk . The last inequality of (6) is immediate from definition since both Xn , Yn ∈ LW (Bn ), and Xn is the unique closest element by (2). Regarding the other inequality, since W satisfies the Δ2 -condition, the definition of the gauge norm implies that n (X − Xi )χAi X − Yn ) dP = W( ) dP W( 1= d0 d0 Ω i=1 Ai n [X − πBi (X)]χAi = W( ) dP, d0 i=1 Ω
by property (iv) of πBi , since W ∈ P, n Xdi − πBi (Xdi ) ) dP, W( = di d0 i=1 Ai by property (i) of πBi where di = XχAi − πBi (XχAi )W , n [Xdi − πB1 (X)]χAi di ≤ W( ) dP, di i=1 Ai since LW (B1 ) ⊂ LW (Bi ), X − πB1 (X) W( ) dP. = d0 Ω Hence d0 ≤ X − πB1 (X)W , establishing (6). A Digression. We now establish (7) after a discussion of some abstract analysis to be used here. Recall that a Banach space (X , · ) is termed uniformly convex if for each 0 < ε ≤ 2, and any pair of elements xi ∈ X , xi = 1, i = 1, 2, with x1 − x2 > ε, there is a δε > 0 (depending just on ε) such that x1 + x2 < 2(1 − δε ). It is well-known that X = Lp (μ), 1 < p < ∞ is uniformly convex for any measure μ on (Ω, Σ) and its norm is (uniformly) Fr´echet differentiable. In any uniformly convex space (X , · ), if a sequence {xn , n ≥ 1} ⊂ X , xn → x and x∗ (xn ) → x∗ (x) for each x∗ ∈ X ∗ , then xn − x → 0. This is stated as: (k): weak convergence plus convergence of norms of a sequence in X implies the strong (or norm) convergence of the sequence [‘k’ for Kadec who introduced it]. The last property also holds with a somewhat weaker hypothesis than uniform convexity, of interest in applications, as follows. Let the norm of the adjoint space X ∗ of X be Fr´echet (or F)-differentiable (which is automatic, indeed uniformly, if X is uniformly convex), then again the strong convergence holds
496
VIII. Prediction and Filtering of Process
under the above conditions. The hypothesis (k) implies reflexivity of X (but not uniform convexity). These statements are well-known in abstract analysis (cf., e.g., Day [1], pp. 112-113) and the particular implication with F -differentiability is also proved separately by the author (cf., Rao [8], Theorem 2.1). We observe that if C = LW (Σ) with W satisfying Δ2 and ∇2 simultaneously, then LW (Σ) is reflexive and (LW (Σ))∗ has an F -differentiable norm (cf., Rao and Ren [1], Sections 7.2 and 7.3 where it is also noted that such a space is not necessarily uniformly convex but is isomorphic to a uniformly convex Orlicz space). Consequently, if fn ∈ LW (Σ), fn → f weakly, and fn W → f W , then fn − f W → 0. With this auxiliary information at hand, let us now complete: Proof of (b). By (a) we have dn = X − Xn W ≥ X − Yn W ≥ X − Y W = d0 ,
(8)
where Xn = πB (X), and we assert that Xn → Y strongly. In fact, this is deduced as follows. Let Zn = X − Xn so that dn = Zn d0 ≥ d0 . To see there is equality here, note that for any ε > 0, we can find Yε ∈ ∪n≥1 LW (Bn ), a dense subspace in LW (B∞ ), such that Y − Yε < ε. So d0 ≤ X − Yε W ≤ X − Y W + Y − Yε W < d0 + ε. Since Yε ∈ LW (Bn ) for some n, this implies d0 ≤ d ≤ dn ≤ d0 + ε and hence d = d0 , from the arbitrariness of ε > 0. Now {Zn , n ≥ 1} is in a ball of radius d1 and thus is bounded in LW (Σ). The latter is a reflexive space and it is classical that closed balls (or all closed bounded sets) are weakly sequentially compact in reflexive spaces. Consequently there is a weakly convergent subsequence {Znk , k ≥ 1} with limit Z ∈ LW (Σ). But dnk = Znk → d0 and X = Xnk + Znk → Y0 + Z weakly for some Y0 ∈ LW (B∞ ), and ZW = X − Y0 W = d0 . By the uniqueness of the minimal element in LW (B∞ ), Y0 = Y a.e. This implies that each subsequence of {Zn , n ≥ 1} has a further convergent subsequence with the same limit Z so that the whole sequence converges to the same limit Z, and then Xn → Y weakly. Also Zn W → ZW , so that (by the preceding abstract analysis) Zn → Z strongly, and hence Xn → Y in norm. This argument applied to (8) shows that X − Yn → X − Y strongly, and consequently Yn → Y ← Xn also strongly as n → ∞. Hence (7) follows. The discussion prior to this proof shows that the last statement is true as well. We first discuss the probabilistic meaning of the function Yn in (6) and (7), and then restate the proposition for a better perspective. Here one uses the concept of a (simple) stopping time of the increasing sequence of σ-algebras, or filtration, {Bn , n ≥ 1} of Σ. Thus a mapping
497
8.1 Predictors and projections
τ : Ω → N ∪{∞} is termed a stopping time (or an optional), of the filtration {Bn , n ∈ N} if [τ = n] ∈ Bn , n ≥ 1, which is equivalent to stating [τ ≤ n] ∈ Bn , n ≥ 1. Also τ is simple if it only takes finitely many finite values. The class B(τ ) = {A ∈ Σ : A ∩[τ = n] ∈ Bn , ∀n ∈ N}, which is seen to be a σ-algebra, is termed the set of events prior to τ . The following statements are special cases of those found in many standard works on the calculus of stopping times (cf., e.g., the companion volume Rao [21], p. 242, and Sec. V.4.2). A simple stopping time τ of the filtration is thus of the form: τ=
n
n+1
kχAk + nχAn+1 , Ak ∈ Bk , ∪ Ak = Ω, k=1
k=1
where the Ak are disjoint so that Ak = [τ = k]. One denotes Xτ for (Xτ )(ω) = Xτ (ω) (ω), ω ∈ Ω. It is then seen that for an adapted process {Xn , Bn , n ≥ 1}, the composition Xτ is B(τ )-measurable (hence a random variable), and the following maximal inequality is true for any 0 < a < ∞: P [max |Xn | > a] ≤ n∈N
1 sup{E(|Xτ |) : τ simple}. a
(9)
In fact, for each n ∈ N, the set An = {maxk≤n |Xk | > a} can be n expressed as a disjoint union An = ∪ k=1 {|Xj | ≤ a, j ≤ (k − 1), |Xk | > n n a} = ∪k=1 Bk (say), and let τ1 = k=1 kχBk . Then τ1 is a simple stopping time and using all such times of the filtration one has: sup E(|Xτ |) ≥ E(|Xτ1 |) ≥ E(|Xτ1 |χAn ) ≥ aP (An ). τ
Dividing by a and letting n → ∞, (9) follows. With this concept of stopping times, we can restate (6) and (7) as: 3’. Proposition. Let B1 ⊂ B2 ⊂ · · · ⊂ Bn ⊂ Σ be σ-algebras and X ∈ LW (Σ), W ∈ Δ2 ∩ ∇2 . If Xk = πBk (X), then there exists a simple stopping time τ of {Bk , 1 ≤ k ≤ n} such that X − X1 W ≥ X − Xτ W ≥ X − Xn W .
(8’)
Moreover, for any ε > 0, there exist an nε and a simple stopping time τ = (τε ) ≥ nε of the finite filtration such that X − Xτ W ≤ X − Xnε W ,
and Xτ − Xnε W < ε.
(7’)
498
VIII. Prediction and Filtering of Process
Using these two propositions, it is possible to establish the point-wise convergence also of the prediction sequence {Xn = πBn (X), n ≥ 1}. Actually we can abstract the ideas here and prove a result that includes both the Lp and LW spaces as well as some others. This is a class of Banach function spaces on a probability space (σ-finite measures can be admitted, but we shall restrict to the finite case) that are endowed with a partial ordering, i.e., they are Banach lattices. These include the familiar Lp (Σ) and LW (Σ) spaces. Here the norm on measurable real functions axiomatically prescribed by abstracting the properties of the concrete Lebesgue and Orlicz spaces and hence must satisfy the monotonicity and certain limit relations on increasing sequences. We describe the spaces in a convenient form for Theorem 4 below. Let M be the set of all real measurable functions on (Ω, Σ, P ). It is classical that M becomes a complete metric space under the metric |f −g| derived from (f, g) → E( 1+|f −g| ), f, g ∈ M, which is equivalent to convergence in probability. Let ρ : M → R+ be a positive homogeneous subadditive functional (i.e., a norm) which additionally has the Fatou property. Thus, (i)ρ(f + bg) ≤ ρ(f ) + |b|ρ(g), f, g ∈ M, b ∈ R, (ii)ρ(f ) = ρ(|f |), (iii)ρ(f ) = 0 iff f = 0 a.e., and (iv)0 ≤ fn ∈ M, fn ↑ f ⇒ ρ(fn ) ↑ ρ(f ). (The last is the so-called Fatou property.) Then the set Lρ (Σ) = {f ∈ M : ρ(f ) < ∞} becomes a Banach lattice, so that ρ(fn − fm ) → 0 for fm , fn ∈ Lρ (Σ) ⇒ there is an f ∈ Lρ (Σ) satisfying ρ(fn − f ) → 0, and 0 ≤ f ≤ g a.e. ⇒ ρ(f ) ≤ ρ(g). For instance, if ρ(f ) = f W [or ρ(f ) = f p , 1 ≤ p < ∞], then the above definition becomes Lρ = LW [or = Lp ]. Additionally, we assume from now on that Lρ (Σ) contains all P -essentially bounded functions from M and that Lρ (Σ) ⊂ L1 (Σ), both of which are automatic for the spaces LW (Σ) [or Lp (Σ)] on the probability triple (Ω, Σ, P ). The space Lρ (Σ) is called the Riesz space. If moreover the inclusions are continuous, it is sometimes termed a normed K¨ othe space. With such a ρ there is an associate norm ρ given by ρ (f ) = sup{| Ω f g dP | : ρ(g) ≤ 1}, which also satisfies conditions (i)-(iv) and Lρ (Σ) is a subspace (under a natural identification) of the adjoint space (Lρ (Σ))∗ . These spaces, with various applications, have been extensively studied by Zaanen and his students. We only need a tiny part of that theory here, and refer to a more detailed treatment in the text book by Zaanen ([1], Chapter 15). For Banach lattices, a weaker condition than uniform convexity is uniform monotonicity, i.e., Lρ is uniformly monotone if for each ε > 0, ∃ δ(= δε > 0) such that for 0 ≤ fi ∈ Lρ , ρ(f1 ) = 1, ρ(f1 + f2 ) < 1 + δ ⇒ ρ(f2 ) < ε, or equivalently, if 0 ≤ f2 ≤ f1 and ρ(f1 ) = 1, ρ(f1 − f2 ) > ε ⇒ ρ(f2 ) ≤ 1 − δ. Such spaces are weakly sequentially complete, and, as an example, W ∈ Δ2 ⇒ LW is uniformly monotone (but not necessarily uniformly convex). A few other properties that
8.1 Predictors and projections
499
are needed will be listed here. The norm ρ(·) is said to be absolutely continuous if for each fn ↓ 0 ⇒ ρ(fn ) ↓ 0, and in this case not only simple functions are dense in Lρ (Σ), but (Lρ (Σ))∗ = Lρ (Σ) under the standard isometric identification and moreover Lρ is reflexive iff both norms ρ, ρ are absolutely continuous (and then both have the Fatou properties). It is strictly convex if ρ(αf + βg) < αρ(f ) + βρ(g) for 0 < α = 1 − β < 1 and f = g on a set of positive P -measure. These are detailed in Zaanen [1], and a finer analysis of the adjoint spaces may be found in Gretsky [1]. Thus if ρ(·) = ·W , it is an absolutely continuous norm iff W ∈ Δ2 and the space LW is reflexive iff W ∈ Δ2 ∩ ∇2 so that (Lρ (Σ))∗∗ = (Lρ (Σ))∗ = Lρ (Σ), in this specialization. Let U : Lρ (Σ) → R+ be a mapping, abstractly given, and satisfying the conditions: (i)U (0) = 0, (ii)U (αf + βg) ≤ αU (f ) + βU (g) for any f, g ∈ Lρ (Σ) with 0 ≤ α = 1 − β ≤ 1, and (iii)U (f + g) ≥ U (f ) + U (g), (iv)(strongly)continuous, i.e., U (fn ) → 0 if (iff) ρ(fn ) → 0 as n → ∞. Hereafter such U will be termed a (strong) modular in Lρ (Σ). Also if U (|fn − f |) → 0 as n → ∞, then we say fn → f in U mean. It is immediate that U (f ) = Ω W (f ) dP, f W ≤ 1, W ∈ Δ2 , satisfies (i)-(iv), and defines a strong modular. In particular W (x) = |x|p , p < ∞ is covered. Moreover, if ρ is absolutely continuous with the Fatou property, then (Lρ (Σ))∗ = Lρ (Σ), and one can also verify that it is weakly sequentially complete, i.e., every weak Cauchy sequence converges to an element in Lρ (Σ) (via the classical Vitali-Hahn-Saks theorem). With this background, we can now present the desired result on the convergence of the prediction sequences. It essentially follows the format of Bru and Heinich [1]: 4. Theorem. Let Lρ (Σ) be a uniformly monotone normed K¨ othe space introduced above, U : Lρ (Σ) → R+ be the strong modular functional just defined, and ρ an absolutely continuous (function) norm with property (k) which thus is strictly convex. Suppose moreover, U is additive on pairs of functions of disjoint supports and that the U -convergence is equivalent to the ρ-convergence. Then for each X ∈ Lρ (Σ), there is a unique element Yn ∈ Lρ (Bn ) closest to X for the modular, i.e., U (|X − Yn |) = inf{U |(X − Y |) : Y ∈ Lρ (Bn )} where Bn = σ(X1 , · · · , Xn ), n ≥ 1, the Xk , 1 ≤ k ≤ ∞, being the observed sequence (so that Y∞ ∈ Lρ (B∞ ), B∞ = σ(Xn , n ≥ 1)). Further, Yn → Y∞ in U -mean as well as point-wise a.e., as n → ∞. Proof. The argument uses the fact that a bounded sequence in the Lρ under the present hypothesis is relatively weakly sequentially compact and (using its weak sequential completeness) has a convergent subsequence determining the desired Yn (with the strict convexity condition). This implies that Yn is closest to X in the modular sense. The other
500
VIII. Prediction and Filtering of Process
assumptions imply that Yn → Y∞ weakly and that their modulars also converge, leading to the U -mean convergence. The point-wise convergence is then deduced with the help of Proposition 3’. We now fill in the details. Let a = inf{U (X − Y ) : Y ∈ Lρ (B)}, for a given X ∈ Lρ (Σ). Then 0 ≤ a < ∞. Choose Xn ∈ Lρ (B), such that (by definition of infimum) a = limn→∞ U (X − Xn ). So {Xn , n ≥ 1} ⊂ Lρ (B) ⊂ L1 (B) is a bounded set, and we assert that it is relatively weakly compact in L1 (B). Since the measure space is finite, the latter is well-known to be equivalent to uniform integrability (cf., e.g., Dunford and Schwartz [1], IV.8.11), and hence we verify the latter condition. If Bn ∈ B is any sequence satisfying P (Bn ) → 0, let Xn = Xn χBnc . Then it is evident that for all ω ∈ Ω = Bn ∪ Bnc |X − Xn |χBn (ω) + |X − Xn |(ω) = |X|χBn (ω) + |X − Xn |(ω). (10) Since P (Bnc ) → 1 and Xn ∈ Lρ (B), we see that (by the Fatou property of the norm) Xn is also a best approximant, i.e., a = limn→∞ E(U (|X − Xn |)) = limn→∞ E(U (|X − Xn |)). On the other hand, since ρ is an absolutely continuous norm, E(U (XχBn )) → 0. Consequently, one has a = lim E(U (|X − Xn |)) ≤ lim inf E(U (|X − Xn | + X|χBn )) n→∞
n→∞
= lim E(U (|X − Xn |)) = a, n→∞
(11)
since ρ(XχBn ) → 0. Thus there is equality through out. On the other hand, from the left side of (10) and (11), together with the uniform monotonicity of ρ as well as the equivalence of the U -mean and ρconvergences, we conclude that ρ((X − Xn )χBn ) → 0 as n → ∞. But from the inequality |Xn χBn | ≤ |X −Xn |χBn +|X|χBn , and the fact that Lρ is continuously embedded in L1 , one deduces that since the sequence {Bn , n ≥ 1} ⊂ B is arbitrary with just P (Bn ) → 0, {Xn χBn , n ≥ 1} is uniformly integrable as a set in L1 (B). So it is relatively weakly sequentially compact and hence has a convergent subsequence Xnk → X0 , X0 ∈ L1 (B) since the latter is weakly sequentially complete. Thus for each simple function f ∈ L∞ (B), we get (by the weak convergence) E(Xnk f ) → E(X0 f ).
(12)
Our assumptions imply (Lρ (Σ))∗ = Lρ (Σ) and since the embeddings L∞ ⊂ Lρ ⊂ L1 are continuous, the simple functions of Lρ (Σ) are norm determining. So (12) implies that Xnk converges weakly to X0 in Lρ (B). Moreover, by the Fatou property (of ρ and hence) of U , one gets a ≤ E(U (|X − X0 |)) ≤ lim inf E(U (|Xnk − X|)) = a, k→∞
501
8.1 Predictors and projections
so that X0 is a best predictor of X. Every infinite subsequence of Xn s determines such an X0 and by the strict convexity of ρ and of U ; we deduce that X0 = X0 a.e., whence each convergent subsequence has the same limit implying that the whole sequence Xn → X0 weakly in Lρ (B) itself. But then X −Xn → X −X0 weakly and ρ(X −Xn ) → ρ(X −X0 ), we deduce that X0 is the unique element of Lρ (B) which is a best predictor of X with E(U (|X −X0 |)) = a. This establishes the existence of a unique minimal element for each given X. Next suppose that Bn ⊂ Bn+1 ⊂ Σ is a sequence of σ-subalgebras, and consider the corresponding best predictors Yn of X ∈ Lρ (Bn ) whose existence has just been established. Then {Yn , n ≥ 1} ⊂ Lρ (B∞ ) = ∪n Lρ (Bn ) is a sequence with the properties (since the union is dense in the former) that E(U (|X − Y∞ |)) ≤ limn E(U (|X − Yn |)), and that [Lρ (Σ) being reflexive] X − Yn → X − Y0 weakly as well as in norm. Consequently, with the assumed condition (k) of the Lρ -space, used for the first time, this sequence is Cauchy with limit X − Y0 . By the strict convexity of (ρ and of) U , these limits are unique and so we get the result that Yn → Y∞ (= Y0 a.e.) in U -mean. The pointwise convergence is obtained as in Proposition 3’, using simple stopping times. This may be sketched as follows. Let Y∗ = lim inf n Yn and Y ∗ = lim supn Yn for the best predictors Yn . Then there exist simple optionals τn ↑ ∞, and τ n ↑ ∞ of the filtration {Bn , n ≥ 1} such that Yτn → Y∗ and Yτ n → Y ∗ , a.e. Indeed, by definition, for given ε > 0, there exists a subsequence nk , k ≥ 1, and an n0 such that P [|Y ∗ − Ynk | < ε] > 1 − ε if nk ≥ n0 . Then let τ k = inf{k > n0 : |Y ∗ − Ynk | < ε}. This is well-defined since P [|Y ∗ − Ynk | < ε] > 1 − ε for a suitable n0 . This is exactly as in (9). Then these τ k satisfy the desired conditions. Similarly we find τk with a subsequence nk (possibly different) such that the second limit holds. Hence P [Y ∗ − Y∗ > ε] ≤ P [lim |Yτ k − Yτk | ≥ k
ε ] 2
2 ≤ E(|Yτ k − Yτk |), as in (9), ε 2 ≤ lim ρ(Yτ k − Yτk )ρ (1) = 0, ε k since the Yn s form a Cauchy sequence, and a H¨older type inequality is valid for the Lρ -spaces. Finally letting ε ↓ 0 through a sequence, we get Y ∗ = Y∗ = Y∞ a.e. This establishes all statements. As noted already, the conditions of this theorem hold for Young functions W ∈ Δ2 ∩ ∇2 (in particular for W (x) = |x|p , 1 < p < ∞), and hence the following result is a consequence of the above, which is
502
VIII. Prediction and Filtering of Process
simpler to state. As discussed in the Digression, we can assume that the derivative W is continuous (and W (x) > 0 for x > 0) in view of the known structure theory of Orlicz spaces (cf., e.g., Rao and Ren [1], p.297) and that the adjoint space has an F-differentiable norm. This gives the following consequence: 5. Corollary. Let W ∈ Δ2 ∩ ∇2 be a (Young) strictly convex function and LW (Σ) be the corresponding Orlicz space on (Ω, Σ, P ). Then for each X ∈ LW (Σ), and Bn ⊂ Bn+1 ⊂ Σ, there is a best predictor Yn ∈ LW (Bn ) relative to the modular U defined as U (f ) = Ω W (f ) dP , and moreover, Yn → Y∞ (∈ LW (B∞ ), B∞ = σ(∪n Bn )) in U -mean as well as point-wise a.e., as n → ∞. In particular, taking W (x) = |x|p , 1 < p < ∞, the statement holds for Lp (Σ) with U (f ) = Ω |f |p dP . If only W ∈ Δ2 and W (x) > 0, x > 0 and continuous, there is a similar result on the existence (and then the convergence of a certain sequence) of the best predictors which need not be unique. It is then possible to develop some analogs of the above results, as in Shintani and Andˆ o [1], and the reader is referred to this work for further details in case W (x) = |x|. The preceding analysis shows that πBx , the prediction operator, on ρ L (Σ) with range Lρ (B) is either (i) a closed (not necessarily bounded) linear operator depending on X, or (ii) πB , independent of X but not necessarily linear. The latter is well-defined if Lρ (Σ) is also a strictly convex space. The question of linearity of πB in the second case when Lρ (Σ) is not (isomorphic to) a Hilbert space is of intrinsic interest, since linear analysis is simpler and well-understood. This question was primarily investigated by Andˆ o [1] for the Lp (Σ) spaces, and extended W to the L (Σ) by the author (cf., Rao [4(d)]). Here we outline a solution of this problem for comparison. The result will not be needed in the applications considered in this volume. It reveals an intrinsically nontrivial structure of these operators which should be recognized. It may be noted that for the existence of a closest element of X, we just used the fact that Lρ (B) is closed and convex. The problem is meaningful for any closed subspace S of LW (Σ), where W ∈ Δ2 ∩ ∇2 and the prediction mapping πS on LW (Σ) with range S is idempotent but generally nonlinear. Thus X − πS (X)W ≤ X − Y W ≤ XW , ∀Y ∈ S. If πS is linear, then this implies (I −πS ) is a contractive (linear) projection. Consequently, the adjoint operator Q = (I − πS )∗ acting on (LW (Σ))∗ with range S ⊥ , the annihilator of S, is a contractive projection (cf., e.g., Dunford and Schwartz [1], p.72). But a classical discussion of these matters shows that S ⊥ is isometrically equivalent to the adjoint of the quotient space (LW (Σ)/S)∗ . This leads to characterizing the quotient spaces LW (Σ)/S, or equivalently the structure
503
8.1 Predictors and projections
of the space S ⊥ which may be the range of a contractive projection. A solution to the problem can be presented as follows, which is due to Andˆ o [1] when W (x) = |x|p , 0 < p = 2 < ∞ (p = 2 being classical), to Douglas [1] when W (x) = |x|, and to the author [4(d)] when W ∈ Δ2 . For a clear understanding, we include a general proposition which is valid for many Banach spaces but which is stated for function spaces that are of immediate interest in this discussion. 6. Proposition. Let M ⊂ LW (Σ) be a closed linear manifold admitting a prediction operator πM (i.e., for each X ∈ LW (Σ), there exists a unique Y ∈ M such that X − Y W = inf{X − ZW : Z ∈ M}, ˘ sev subspace). Then πM is linear such an M is usually termed a Ceby˘ iff M has a complementary subspace N (i.e., a closed linear manifold with direct sum M ⊕ N = LW (Σ) and M ∩ N = {0}), such that N is the range of a contractive projection on LW (Σ). Proof. If πM is linear, then as already noted above, Q = I − πS is a contractive projection and clearly N = Q(LW (Σ)) is a complementary manifold. Conversely, if N is a complementary manifold that is the range of a contractive projection Q, then the operator πM = I − Q is a (linear) projection with range M (=null space of Q) and N as its null space. So X − πM (X)W = Q(X)W = Q(X − Z)W , ∀Z ∈ M, ≤ X − ZW ,
∀Z ∈ M.
˘ sev subThus πM (X) is a best (=closest) predictor of X for the Ceby˘ space M, and πM : X → πM (X) is a linear prediction operator. The question of characterizing the subspaces M admitting linear prediction operators onto them (or admitting contractive projections onto M⊥ ), now takes the center stage. This becomes essentially useless if we additionally demand that M = LW (B) for some σ-algebra B ⊂ Σ, since then, for instance if P is a diffuse measure and LW is not (isomorphic to) a Hilbert space, M must be trivial, as can be verified by examples. It is the isomorphism that allows generality. [Recall that a reflexive Orlicz space LW (Σ) is isomorphic to a uniformly convex ˜ ˜ where the former need not LW (Σ) for an equivalent Young function W ˜ W even be strictly convex, but L has all the pleasant properties!] In this sense the following result has sufficient interest and content. Its proof is long and will not be included here for the reasons already noted. ˘ sev subspace with πM 7. Theorem. Let M ⊂ LW (Σ) be a Ceby˘ as a prediction operator on it. Then πM is linear iff the quotient ˜ on some measpace LW /M is isometrically isomorphic to an LW (B)
504
VIII. Prediction and Filtering of Process
˜ B, ˜ P˜ ) the mapping preserves maximal support sets (in sure space (Ω, the usual sense) of the concerned spaces. This result as well as the earlier comments show that, if one wants to attach a linear operator to πBx at X (and extend it for at most a countable collection of Xs), one should settle for something less: for a closed (not necessarily bounded) operator, and then the theory of “quasi-complements” enters. These problems (carrying a certain non uniqueness with them) lead to involved constructions even in abstract classical analysis, and therefore have less applicational potential. On the other hand, if the classes of processes are restricted or specialized (e.g., Gaussian), then πB = E B can hold, as noted in Theorem III.2.7. Thus we omit further treatment of the general theory, and turn to methods associated with linear problems of special interest, applicable to numerous second order processes, in the following sections. 8.2 Least squares prediction: the Cram´ er-Hida approach As a continuation of the preceding work, we consider second order (not necessarily Gaussian) processes for linear prediction using interesting new ideas and methods developed independently by Cram´er and Hida in the early 1960s. The necessary mathematical tools and formulas have already been discussed for the proof of Theorem V.2.7 and we shall use them as needed. Hilbert space techniques will be exploited. First we restate the Hellinger-Hahn representation in a sharper form. For this the following reduction is of special interest. Thus let {Xt , t ∈ G} ⊂ L2 (Σ) be a second order process on (Ω, Σ, P ), where the index set is G ⊂ R or Z, so that both the continuous and discrete parameter cases are considered at the same time. Let Ht = sp{X ¯ s , s ≤ t, s, t ∈ G}, t ∈ G, be the closed linear span in L2 (Σ), H−∞ = ∩t∈G Ht , and H∞ = sp{X ¯ ¯ t Ht )). s , s ∈ G}(= sp(∪ The space H−∞ is termed the remote past, Ht the past and present, and H∞ the total space of the process, all these being Hilbert subspaces of L2 (Σ). If H−∞ = {0} the process is termed purely nondeterministic, H−∞ = H∞ , it is deterministic, and the general case that {0} =
H−∞ H∞ is simply nondeterministic. It is fortunate that the general case can be divided into the preceding two parts. The first such reduction was found by H. Wold in 1938 for G ⊂ Z, thereafter called the Wold decomposition and then the general formulation for both G ⊂ R or Z as well as the multidimensional case was provided by H. Cram´er. The case G ⊂ R is usually called the Cram´er decomposition since this is not an obvious extension of the discrete case. In both indexes, it was found, however, that a second order process decomposes into an (orthogonal) sum of the same types.
8.2 Least squares prediction: the Cram´ er-Hida approach
505
More precisely, we have the general result, due to Cram´er, as follows: 1. Proposition. Let {Xt , t ∈ G} ⊂ L2 (Σ) be a process with Ht , t ∈ G, as the Hilbert space representing both the past and present. Then it can be uniquely decomposed as Xt = Yt + Zt ,
t ∈ G,
(1)
where {Yt , t ∈ G} is deterministic, and {Zt , t ∈ G} is purely nondeterministic, such that Yt , Zt ∈ Ht and Ys ⊥ Zt , ∀s, t ∈ G. Proof. Let Qt : H∞ → Ht , t ≤ ∞, be orthogonal projections and since Xt ∈ Ht ⊂ H∞ , let Yt = Q−∞ Xt , t ∈ G, and Zt = Xt − Yt . ⊥ Then {Yt , t ∈ G} ⊂ H−∞ and Zt = (I − Q−∞ )Xt ∈ H−∞ so that Ys ⊥ Zt , ∀s, t ∈ G since H∞ = Q−∞ (H∞ ) ⊕ (I − Q−∞ )(H∞ ). We claim that Xt = Yt + Zt , t ∈ G is the desired decomposition. ⊥ Indeed, let Kt = sp{Z ¯ s , s ≤ t, s ∈ G} ⊂ Ht . Since Zt ∈ H−∞ , ∀t ∈ ⊥ ⊥ G, we get Kt ⊂ H−∞ , and so K∞ = ∩t∈G Kt ⊂ H−∞ . But Kt ⊂ ⊥ ∩ H−∞ = {0}, whence {Zt , t ∈ G} is purely Ht ⇒ K−∞ ⊂ H−∞ nondeterministic. As for the Yt -process, let Kt = sp{Y ¯ s , s ≤ t, s ∈ G} ⊂ H−∞ . But ⊥ Kt ⊂ Ht ⊂ H−∞ , so that Kt ⊥ Kt . By construction, Xt = Yt + Zt so Ht ⊂ Kt ⊕ Kt , and Kt ⊂ Ht , Kt ⊂ H−∞ ⊂ Ht ⇒ Kt ⊕ Kt ⊂ Ht . Hence Ht = Kt ⊕ Kt . Since H−∞ ⊂ Ht = Kt ⊕ Kt , t ∈ G, and ∩t Kt = {0}, Kt ⊂ H−∞ , taking intersections, we conclude that Kt ⊂ ⊂ H−∞ = Kt . Thus {Yt , t ∈ G} is deterministic. H−∞ ⊂ K−∞ Finally, if Yt + Zt = Yt + Zt , t ∈ G, are two such decompositions, ⊥ . then Yt − Yt = Zt − Zt and Yt − Yt ∈ K−∞ = H∞ and Zt − Zt ∈ H−∞ So since their intersection is {0}, we conclude that Yt = Yt and Zt = Zt and the unicity of decomposition follows. A consequence of this separation is that the deterministic component does not contribute anything new to prediction analysis, since a knowledge of the remote past gives everything of the future, and so it suffices to concentrate on the purely nondeterministic part. Therefore, for simplicity, in the following we assume that the process {Xt , t ∈ G} is itself purely nondeterministic, although there is yet no simple recipe that is applicable to separate this in a practical application. Moreover, we assume that the total space generated by the process is separable. A sufficient condition for the latter is that the covariance function of a centered process is continuous, or our Xt -process is left continuous with right limits which we assume in the applications below. Thus our standing assumptions from now on are: E(Xt ) = 0 and if 2 Ht = sp{X ¯ s , s ≤ t, s ∈ G} ⊂ L (Σ) then (i) H−∞ = {0}, (ii) H∞ is separable, and (iii) (s, t) → r(s, t) = E(Xs Xt ) is Borel measurable which is automatic when the process is mean continuous. We take for
506
VIII. Prediction and Filtering of Process
convenience that the family {Ht , t ∈ G} is right continuous, in that 1 , or replacing it with the latter if necessary. Ht = Ht+0 = ∩∞ n=1 Ht+ n These restrict our study slightly only when G ⊂ R. Now with this setup one can associate a family πt : H∞ → Ht of orthogonal projections with ranges Ht . The preceding conditions translate on {πt , t ∈ G} to the following. (i’) π∞ = id., (ii’) πs ≤ πt for s ≤ t (i.e., (πt − πs )2 = πt − πs ), (iii) πt+0 = πt (i.e., the mapping t → πt (∈ B(H∞ ) is right continuous), and (iv’) limt→−∞ πt x2 = 0 for x ∈ H∞ . Such a family is called a resolution of the identity operator, and we can invoke the classical Hellinger-Hahn theorem (cf., e.g., the book by Stone [1], Section 7.2). It is now necessary to translate this result from the πt -family to the Xt -process to get a useful integral representation, eventually by Karhunen’s theorem as already seen in the preceding chapters. This critical application of the Hilbert space theory to stochastic analysis has been made independently by Hida [1] in the context of Gaussian processes, and by Cram´er ([2],[3]) for general second order processes. The following account is adapted from their works. In the past, special methods (depending on [vector] Fourier analysis) have been employed for weakly stationary processes by several authors. But they do not extend to the non stationary case, and the latter needs this powerful new tool. We include this method and at once get a solution of the linear least squares prediction problem from the representation. The Ht spaces can also be constructed for the covariance kernel, using the RKHS methods in the “frequency domain” and then translate the results to the “time domain” via an isomorphism, and this interplay has significance in the context of Gaussian processes. This is the approach of Hida’s. We are considering the problem directly in the “time domain” for general second order processes. This is Cram´er’s approach. The former was used in Section V.2, and both methods are useful and essentially equivalent. For simplicity we proceed with the latter. The Hellinger-Hahn theorem implies the following important assertions. In the discrete case (i.e., G ⊂ Z) the analysis becomes considerably easy, and so we concentrate exclusively on the continuous parameter case (i.e., G ⊂ R) with an occasional remark on the discrete version. The resolution family {πt , t ∈ G} introduced above is used. Let A = G λπ(dλ) , be a spectral integral, or Ax = G λ(π(dλ) x), x ∈ H∞ be a Dunford-Schwartz integral. Then A is a bounded self-adjoint operator on H∞ , and there are sequences of orthogonal vectors {ξn , n ≥ 1} and {ζjk , j, k ≥ 1} of H∞ with the following properties: (a) ξn , ζjk are all mutually orthogonal for all n, j, k; b (b) if Zn ((a, b]) = a π(dλ) ξn , then Zn (·) extends to a vector measure on the Borel σ-algebra B of G with orthogonal values;
8.2 Least squares prediction: the Cram´ er-Hida approach
507
(c) if Fn (t) = πt ξn 22 , n ≥ 1, then Fn is a bounded nondecreasing left continuous function determining a finite Borel measure νn : B → dFn (t), B ∈ B, such that νn+1 νn , n ≥ 1; B (d) if λj are the eigenvalues of the operator A defined above, then i.e., Aζjk = λj ζjk ; ζjk , k ≥ 1, are its corresponding eigenfunctions, (e) if Mn = {h ∈ H∞ : h = G f (λ)Zn (dλ), f ∈ L2 (νn )} where the Zn and νn are as in (b) and (c), and Njk = sp{ζjk }, the onedimensional subspace spanned by ζjk , then πt (Mn ) ⊂ Mn , t ∈ G and H∞ = M ⊕ N , where M = ⊕n≥1 Mn , N = ⊕j,k≥1 Njk . All of this is a consequence of the Hellinger-Hahn theorem applied to the bounded self adjoint operator A determined by the increasing family {Ht , t ∈ G} of the Hilbert subspaces of L2 (Σ) obtained from the given process {Xt , t ∈ G}. With this we can produce an integral representation of the Xt -process. In fact, using (e) above, since for each t, Xt ∈ Ht , we have hn (t) + gjk (t), hn (t) ∈ Mn , gjk ∈ Njk , Xt = n≥1
=
n≥1
j,k≥1
fn (t, λ)Zn (dλ) +
G
ajk (t)ζjk ,
(2)
j,k≥1
where fn (t, ·) ∈ L2 (νn ) and j,k≥1 |ajk (t)|2 < ∞. But πt Xt = Xt ∈ Ht , and by definition of Zn (·) in (b), πt Zn (B) = 0 for B ⊂ G ∩[t, ∞). Consequently (2) becomes Xt = πt Xt = fn (t, λ)Zn (dλ) + ajk (t)ζjk , (3) n≥1
[λ≤t]
λj ≤t,j,k≥1
as the representation. This may be stated in a better form with the following notations. Let Nj be the finite or countably infinite set of the linearly independent ζjk , k ≥ 1, and N = supj Nj . If N is the number of nonzero ξn of (a) above, let N = max(N , N ), called the multiplicity (of A in the classical theory of operators and also in our case) of the Xt -process. In the discrete case we always have N = 1 as can be verified. [If the process is weakly stationary, then also one can show that N = 1.] It may be noted, from the general theory, that N is uniquely determined even though ξn , ζjk are not. In (3), the integral part comes from the so-called continuous spectrum of the operator A defined by the πt -family, and the λj belong to its discrete spectrum. If ¯ ⊂ G. Now C, D denote these two sets, then C ∩ D = ∅ and C ∪ D = G to simplify (3), define ζjk , B ∈ B, (4) Z˜n (B) = Zn (B ∩ C) + λj ∈B ∩ D j,k≥1
508
VIII. Prediction and Filtering of Process
and gn (t, λ) = fn (t, λ)χC (λ) +
ajn (t)χ[λ=λj ∈D] (λ),
(5)
j≥1
˜ for n = 1, 2, . . . , N . Note that Z(·) still has orthogonal values. Set 2 ˜ ¯ ρ˜n ). ρ˜n (B) = Zn (B)2 , defining a Borel measure and gn (t, ·) ∈ L2 (G, With this discussion, we summarize the above work. In this form, the result is given independently by Cram´er and Hida. 2. Theorem. Let {Xt , t ∈ G} ⊂ L2 (Σ) be a left continuous process with right limits which is purely nondeterministic (so H∞ in the preceding notation is separable). Then there is a unique number N ∈ {1, 2, . . . , ∞} determined by the process (which equals 1 if G ⊂ Z), such that N Xt = gn (t, λ)Z˜n (dλ), t ∈ G, (6) n=1
[λ≤t]
where, with δmn as Kronecker delta, (a) E(Z˜n (B)) = 0, E(Z˜m (B1 )Z˜n (B2 )) = δmn ρ˜n (B1 ∩ B2 ), N (b) E(Xt2 ) = n=1 [λ≤t] gn2 (t, λ)˜ ρn (dλ) < ∞, (c) ρ˜n+1 ρ˜n · · · ρ˜1 , (d) the Hilbert space Ht = Ht (X) is an orthogonal direct sum given by ˜ Ht = ⊕N n=1 Ht (Zn ), ¯ The representation (6) is where Ht (Z˜n ) = sp{ ¯ Z˜n (s), s ≤ t, s, t ∈ G}. canonical in the sense that the Xt cannot have a similar representation with properties (a)-(d) for a smaller value of N . Further, the best linear least squares predictor of Xt0 having observed {Xs , s ≤ t < t0 , s, t, t0 ∈ G}, is given by ˆ t0 ) = X(t,
N n=1
[λ≤t]
gn (t0 , λ)Z˜n (dλ),
(7)
with a mean square error of prediction as: 2
ˆ t0 ))2 = σ (t, t0 ) = E(Xt0 − X(t,
N n=1
t0
gn (t0 , λ)2 ρ˜n (dλ).
(8)
t
Since in (6), each of the terms on the right is orthogonal to the ˜ others, it follows that Ht ⊂ ⊕N n=1 Ht (Zn ). The opposite inclusion is ˜ simple since each Ht (Zn ) ⊂ Ht ,[called a “cyclic subspace”] and these are orthogonal. Thus (d) follows. For the least squares prediction, a
8.2 Least squares prediction: the Cram´ er-Hida approach
509
classical Riesz theorem says that the closest element of Xt0 for any closed convex set in a Hilbert space is its image by the (orthogonal) projection onto it. Thus πt (Xt0 )(∈ Ht ) is the best predictor of Xt0 . Hence (7) follows from (6) on applying πt and using the fact that πt ⊥ (πt0 − πt ). The same comment establishes (8) also since the Z˜n are mutually orthogonal. The kernels gn (t, λ) are some times termed response functions corresponding to the orthogonal valued Z˜n , also termed ‘innovation’ elements. In the weakly stationary case, N = 1, and taking G = R, one can show that t g(t − λ) Z(dλ), (9) Xt = −∞
with Ht (X) = Ht (Z); hence it is seen that the process has a canonical representation. Although the ρ˜n in (8) need not be unique, [indeed, one can multiply and divide by a positive function suitably] the equivalent class to which each ρ˜n belongs will be unique. [Hence, if Rn is a set of measures equivalent to ρ˜n , then Rn Rm , m < n, meaning for every μn ∈ Rn , μm ∈ Rm one has μn μm . In this sense the Rn classes are unique.] 3. Remark. Following the proof of Proposition V.2.8, we noted that a covariance function K on a set T , having the property that the associated RKHS HK is separable, is (representable as) a Karhunen covariance relative to a family of vector functions. An analogous statement for second order processes holds, and may be deduced from the above theorem. In fact, let G(t, ·) = (g1 (t, ·), · · · , gN (t, ·)) and Z¯ = ˆ and ρ(B) = (Z˜1 , · · · , Z˜N )∗ (the row and column) N -vectors, t ∈ G, ∗¯ ¯ ¯ E(Z(B) Z(B)), Z, having orthogonal values. Then (6) implies ¯ Xt = G(t, λ)Z(dλ), (10) [λ≤t]
and
K(s, t) = E(Xs Xt ) =
s∧t
G(s, λ)G∗ (t, λ)ρ(dλ).
(11)
Thus if H∞ is separable, then the Xt -process is a Karhunen process relative to the (vector) response function G(·, ·) and the spectral mea¯ sure Z(·). This also shows the essential equivalence of the frequency and time domain analysis for a large class of second order processes. They include the weakly stationary (as well as harmonizable, after some work) families forming a Karhunen process. In this sense the latter is a very wide class! We now discuss a concrete application, with suitable specialization, of Theorem 2, especially of multiplicity one. There are several applications of the multiplicity theory in the works of Hida [1] (cf., also
510
VIII. Prediction and Filtering of Process
Hida and Hitsuda [1]) and of Cram´er’s [2] (cf., also his lucid survey [5]). The following account follows the latter treatment. It indicates the general and unifying nature of the above representation of second order processes. 4. Example. When N = 1, the relation (6) with G = [a, ∞), can be expressed simply as t Xt = g(t, λ) Z(dλ), a
where E(Xt ) = 0, and
s∧t
r(s, t) = E(Xs Xt ) =
g(s, λ)g(t, λ) ρ(dλ).
(12)
a
Suppose now that ρ(·) is absolutely continuous (Lebesgue) with density ρ (u) = f (u) ≥ 0, which is positive on a set of positive measure, and f is piece-wise continuous in compact intervals. Also suppose that the kernel g(t, u) and (its partial) ∂g ∂t (t, u) are bounded and continuous for t ≥ u ≥ a, as well as (in lieu of g(t, u) > 0 for convenience) g(t, t) = 1, t ∈ G. It is then clear that the covariance function r given by (12) is continuous and its partial derivatives exist, but are continuous off the diagonal s = t of (a, ∞) × (a, ∞). In fact, we easily find r(s, t) − r(t, t) lim = s↑t s−t and lim s↓t
t
g(t, u) a
r(s, t) − r(t, t) = s−t
∂g (t, u)f (u) du + f (t), ∂t
t
g(t, u) a
∂g (t, u)f (u) du, ∂t
so that on the diagonal the partial derivatives have a jump of size f (t) > 0. In particular, if g(t, u) = p(t)q(u), with g(t, t) = 1 we get p(t) , and for s < t < u, g(t, u) = p(u) r(s, t) = p(s)p(t) a
s∧t
f (v) dv. p2 (v)
This implies r(s, u)r(t, t) = r(s, t)r(t, u), and hence the correlation R(s, t) = √ r(s,t) satisfies: r(s,s)r(t,t)
R(s, u) = R(s, t)R(t, u),
(13)
a well-known functional relation, showing that Xt is a wide sense Markov process which in the Gaussian case will be (strictly) Markovian.
8.2 Least squares prediction: the Cram´ er-Hida approach
511
If a = −∞, f (u) = c > 0, p(u) > 0, then the solution of (13) in this c event is R(s, t) = e− 2 |s−t| , i.e., the process is of Ornstein-Uhlenbeck type. If moreover, g(t, u) ≡ 1, Z(−∞, a) = 0, then Xt = Z((a, t]) and f (u) = 1, so that the process is Brownian Motion. In case g(t, u) = g(t − u) and f (u) = 1, a = −∞, then the Xt -process is weakly stationary, as already noted. Thus (6) unifies many of these types. We shall indicate, in Exercise 1(i), a sufficient condition on gi s in order that (6) represents a process of multiplicity one. Linear least squares prediction for random fields, e.g., {Xt , t ∈ R2 }, is now a natural question, and an aspect of it for weakly stationary fields has been considered by Chiang [1]. Indeed it will be of interest to study analogs of second order processes of Karhunen type, if the representation (6) can be generalized in the following manner. Let {Xt1 ,t2 , (t1 , t2 ) ∈ G1 × G2 } be a random field with Gi ⊂ R, i = 1, 2, as intervals. Set Ht1 (X) = sp{X ¯ s1 ,s2 : s1 ≤ t1 , s2 ∈ G2 } (similarly Ht2 (X)), and let Ht1 ,t2 (X) = sp{X ¯ s1 ,s2 : si ≤ ti , i = 1, 2} with H∞ = sp{X ¯ : t ∈ G , i = 1, 2}, all in L2 (P ). Note that t1 ,t2 i i Ht1 ,t2 (X) = Ht1 (X) ∩ Ht2 (X). Suppose that H∞ (X) is separable. If πtii : H∞ (X) → Hti (X), i = 1, 2 are orthogonal projections, then {πtii , ti ∈ Gi , i = 1, 2} are resolutions of identities. Suppose now that the two resolutions are mutually commuting uniformly bounded (by 1 2 one), and let πt1 t2 = πt11 ⊗ πt22 , in the sense that π(A×B) = πA πB for all Borel sets A ⊂ G1 , B ⊂ G2 . Then πt1 t2 defines a resolution of the iden tity in the Hilbert space H∞ and if Ai = G1 ×G2 λi π(dλ1 ,dλ2 ) , i = 1, 2, so that for any bounded Borel function f : G1 × G2 → R one can define an operator f (A1 , A2 ) = G1 ×G2 f (λ1 , λ2 )π(dλ1 ,dλ2 ) uniquely by the operational calculus. (For a general version of this type of spectral analysis, see e.g., Kluv´ anek and Kov´ a˘r´ikov´ a [1].) Thus we have an operator f (A1 , A2 ) for such f of two variables with its resolution of identity {πC : C ∈ B2 } where B2 is the Borel σ-algebra of G1 × G2 (in the sense of Dunford’s well-known spectral calculus) and a related spectral calculus is also available (cf. McGhee and Picard ([1], Sec. 7)). If the Hellinger-Hahn theorem is extended to such a family and an appropriate multiplicity theory is established, then Theorem 2 may be extended to this class and the prediction problem relative to some (e.g., lexicographic) ordering can be solved. At present, this has not been systematically explored, and seems to present an interesting possibility of extending Chiang’s work noted above. Indeed multiparameter linear prediction theory seems to be in its infancy, and should be the next topic for study. Equally (or even more) important is to consider the corresponding problems for (isotropic harmonizable) random fields to obtain the best predictor, extending the method of Sec. VI.4. Here the regions to be treated will not be quadrants, but quite different sets,
512
VIII. Prediction and Filtering of Process
and these would be of primary interest in applications. So we leave the subject here, and turn to linear filtering problems of considerable generality and interest in applications. 8.3 Linear filtering: Bochner’s formulation The problem of filtering can be stated in general terms as follows. If {Xt , t ∈ T } and {Yt , t ∈ T } are a pair of processes in L2 (Ω, Σ, P ) indexed by a group T , suppose that Λ is a linear operator satisfying ΛXt = Yt ,
t ∈ T,
(1)
and Λ commutes with translations (when defined) on T so that if Xt is weakly stationary, then so is Yt (although not necessarily conversely). Then (the deterministic) Λ is called a (linear) filter and the problem of interest is this. If Yt is the observed process, and Xt is an “input” giving the “output” Yt by the operator Λ as in (1), under what general assumptions can we find the Xt -process from the Yt , i.e., find conditions to “invert Λ” to get the input Xt . Except in simple cases, the standard operator inversion Xt = Λ−1 Yt is not possible, and we present general solutions for large sets of problems. The idea once again is to translate the problem into a frequency domain of the process if possible and solve, as is done in the weakly stationary case. But we have seen in the last section that such a possibility exists for Karhunen processes in general. Consequently, we first present a large class of processes that can be recognized as of Karhunen class, and then present a solution of the linear filtering problem for harmonizable classes which belong to this general family. The next result on the structure of second order processes, has independent interest. 1. Theorem. Let {Xt , t ∈ T } ⊂ L2 (Ω, Σ, P ) and H(X) = sp{X ¯ t, t ∈ T }. If H(X) is separable, then the Xt -family is of Karhunen class relative to a set {g(t, ·), t ∈ T } of square integrable functions on a measure space (S, S, μ), and an orthogonal L2 (P )-valued measure Z, such that Xt = g(t, u)Z(du), t ∈ T, Z(B)22 = μ(B), B ∈ S. (2) S
In particular, the conclusion holds in each of the following cases: (i) the underlying probability space (Ω, Σ, P ) is separable; (ii) T ⊂ Rk , 1 ≤ k < ∞, and the moment function r : (s, t) → ¯ t ) is continuous on the diagonal of T × T ; E(Xs X (iii) T is a countable set.
8.3 Linear filtering: Bochner’s formulation
513
Proof. Let α = dim(H(X)) which by hypothesis satisfies α ≤ ℵ0 . The argument below shows that if α < ∞, the result is true and simple, and so we only consider the case α = ℵ0 . Now if (S, S, μ) is any non-finite separable probability space, (e.g., the Lebesgue unit interval) then L2 (μ) is a separable Hilbert space of dimension ℵ0 , and since it is classical that any two Hilbert spaces of the same dimension are isomorphic (cf., e.g., Na˘imark ([1],p.95), H(X) ∼ = L2 (μ), i.e., are isomorphic. Indeed, let {ϕn , n ≥ 1} and {fn , n ≥ 1} be arbitrarily fixed complete orthonormal sets (or bases) of these two spaces, and τ : fn → ϕn be the isomorphic correspondence which is extended linearly onto these spaces and which by polarization can be taken fact, if g ∈ L2 (μ) so that ∞ to preserve inner products. In ∞ 2 2 g = i=1 a∞n (g)fn , [an (g) = (g, fn )], g2 = n=1 |an (g)| < ∞, then let g = n=1 an (g)ϕn ∈ H(X) and τ (g) = g is the desired isometric isomorphism onto the spaces, preserving inner products. In particular, if g = Xt , then this correspondence becomes: ∞ (3) an (Xt )fn . Xt = τ n=1
If A ∈ S, set τ (χA ) = Z(A). Then Z : S → L2 (P ) is σ-additive and for simple functions f ∈ L2 (μ): f (u)Z(du). (4) τ (f ) = S
By standard results on vector measure extensions (cf., e.g., DunfordSchwartz [1], f ∈ L2 (μ). Setting ∞Section IV.10), the result holds for all 2 g(t, u) = n=1 an (Xt )fn (u), so that g(t, ·) ∈ L (μ), t ∈ T , one gets from (3) and (4): g(t, u)Z(du), t ∈ T. (5) Xt = S
Moreover for A, B ∈ S, we have using inner products: ¯ E(Z(A)Z(B)) = (τ (χA ), τ (χB ))H(X) = (χA , χB )L2 (μ) = μ(A ∩ B). Hence Z(·) takes orthogonal values on disjoint sets, so that (5) implies that the Xt -process, indexed by T , is of Karhunen class relative to {g(t, ·), t ∈ T } ⊂ L2 (μ). [Generally, not much can be said about g(·, ·) and additional information in specializations have to be utilized in deriving various properties of g, as the following work demonstrates.]
514
VIII. Prediction and Filtering of Process
If (Ω, Σ, P ) is separable, then so is L2 (P ) and H(X), as a subspace of a separable metric space, has the same property and the result follows. If T ⊂ Rk and the product moment is continuous on the diagonal of T × T , then one notes that it is continuous on the square and so r(·, ·) is continuous, implying that the Xt -process is mean continuous. Then, as is well-known, H(X) is separable (as already observed by Cram´er [6], p.330). If T is countable, so that H(X) is countably generated, we conclude that H(X) is separable. Remark. For second order processes that are not of Karhunen class, one has thus to consider those processes X for which the space H(X) spanned by them in L2 (P ) is not separable. The problem here is that any regularity properties of the family {g(t, ·), t ∈ T } are difficult to establish, and they (as well as the non uniqueness of Z) depend on the (possible) special features of the Xt -process, as we now see. Recall that a second order process {Xt , t ∈ G}, (G = R or = Z) is harmonizable if it is the Fourier transform of a vector measure Z on ˆ = (0, 2π]). More precisely, let ˆ the Borel σ-algebra B of G(= R or Z 2 Z : B → L (P ) be a σ-additive function (i.e., a vector measure). Then Xt is harmonizable if Xt =
ˆ R
eitλ Z(dλ),
t ∈ G,
where the integral is in the sense of Dunford and Schwartz which we have used several times before. If Z(·) has orthogonal values on disjoint sets, then the Xt -process is weakly stationary. Hence one can find an ex¯ t ), by approximation pression for the product moment r(s, t) = E(Xs X ¯ of the integral. Thus if β(A, B) = E(Z(A)Z(B)), then β : B × B → C is a bimeasure, in the sense that β(·, B) and β(A, ·) are σ-additive, but β(·, ·) can fail to be a signed measure. In case β(·, ·) determines a σ-additive (complex) measure, then the corresponding Xt -process is termed strongly harmonizable, and if β(·, ·) is merely a bimeasure, then it is called weakly harmonizable. The distinction is that, in the former case, β has finite Vitali (or the usual two dimensional) variation given by ˆ G) ˆ = sup{Σn |β(Ai , Aj )| : Ai ∈ B, disjoint, n ≥ 1} < ∞, (6) |β|(G, i,j=1 and in the general case the positive definite β has only finite Fr´echet variation: ˆ G) ˆ = sup{ β(G,
n i,j=1
ai a ¯j β(Ai , Aj ) : Ai ∈ B, |ai | ≤ 1
515
8.3 Linear filtering: Bochner’s formulation
disjoint, n ≥ 1} < ∞.
(7)
Clearly ˆ G) ˆ ≤ |β|(G, ˆ G) ˆ ≤ ∞, β(G, ˆ G) ˆ < ∞ always, while |β|(G, ˆ G) ˆ =∞ but it may be shown that β(G, is possible. A detailed discussion along with (counter) examples verifying all these statements is found in Chang and Rao [1]. The strongly harmonizable case was introduced by Lo`eve in the mid 1940s and later in more detail in his book [1], without the qualification “strongly”, and the more general weakly harmonizable case was introduced by Bochner [2], calling it a V-bounded process (‘V’ for general or Fr´echet variation). A classification was needed for a finer analysis of the subject, and it was introduced by the author in [13]. In fact this distinction becomes crucial when one studies the subject in the frequency domain, and hence for much of the analysis that follows. Note that in the weakly stationary case β(A, B) = β(A ∩ B) ≥ 0, and β gives a bounded Borel measure on ˆ whereas for the strongly harmonizable case β is a complex measure, G, and for the weakly harmonizable processes, it is not even σ-additive (but only a [much weaker] bimeasure), although in all cases it is always positive definite. This raises questions of how one can consider the frequency domain analysis in the weak and strong harmonizable cases. In fact if the process is strongly harmonizable, then it is seen that the product moment function r is given by r(s, t) =
ˆ G
ˆ G
eisu−itu β(du, du ),
(8)
and can be taken as a Lebesgue integral whereas for the weakly harmonizable case one only has r(s, t) =
ˆ G
∗
ˆ G
eisu−itu β(du, du ),
(9)
not as the Lebesgue integral, but as a weaker Morse-Transue (or MT) integral. Here the distinction with ‘∗ on the integral signals that an automatic application of Lebesgue’s theory is not possible. The necessary detail (called strict MT-integral for (9)) has been given in Chang and Rao [1], and we shall use some results from it in the following treatment. In both these cases β is termed a spectral bimeasure and its counterpart Z is called the stochastic spectral measure of the processes. Thus if Xt is harmonizable (in either sense) then so is Yt = ΛXt (cf., Proposition 2 below).
516
VIII. Prediction and Filtering of Process
It is important to note that in (8) and (9) β is positive definite (but not positive), a property inherited from that of r. If moreover β ≥ 0, then it automatically determines a (positive) measure, and all the integrals become Lebesgue integrals. Further, if it is inner regular for compact sets, then one can show that there is a positive inner regular ˆ × G, ˆ B ⊗ B, μ) measure μ such that β(A, B) = μ(A × B), whence L2 (G will be a Hilbert space. This statement need not be valid if β determines just a signed (or complex) measure, where one uses the Lebesgue theory. [Then one replaces it by its variation, and can embed L2 (β) as a dense subspace.] However, if the weaker (strict) MT-integration is used, and L2 (β) is defined as: L2 (β) = {f : (f, f ) =
ˆ G
∗
ˆ G
f (u)f¯(v)β(du, dv) < ∞},
(10)
where (·, ·) is a semi-inner product, then L2 (β) will be a Hilbert space if equivalence classes are identified. In case β determines a scalar (or complex) measure (i.e., strong harmonizability), then L2 (β) will be a pre-Hilbert space, and can be completed to be a Hilbert space (as was done in this case by Cram´er [6]) and the analysis proceeds in the resulting frequency domain spaces. Without using the MT or the Lebesgue integration, and defining the spectral domain directly as the set of f such that Gˆ f (u)Z(du) exists as a vector integral (so the space is somewhat larger than L2 (β) in the strongly harmonizable case) Mehlman [1] showed differently that this spectral domain is complete and hence a Hilbert space even for strongly harmonizable class which corresponds to the completed version in Cram´er’s work. Here Mehlman uses the Dunford-Schwartz integration theory and a “dilation theorem” of harmonizable processes to a stationary process on a larger Hilbert space and invoking the known completeness property of the latter space, showing his spectral domain is the range of an orthogonal projection operator of the Hilbert space for the stationary process, which is ultimately similar to the earlier works. We shall not pause to establish these statements. A comprehensive account is available in Chang and Rao [1]; some additional details as well as extensions are in the works by Kakihara [1] and Swift [1]. In these sources the vector valued processes are also treated. We recall some of these facts as needed in the filtering work and other applications later. A simple example of the operator Λ, commuting with translations, introduced in (1) above, is the classical difference filter, given by:
ΛXt =
N j=1
aj Xt−τj , t ∈ Z, aj ∈ C,
(11)
517
8.3 Linear filtering: Bochner’s formulation
and another is the integral filter: a(τ )Xt−τ dτ, t ∈ R. ΛXt =
(12)
G
In the same way one can consider the differential filter, in which the t Xt s are taken to have mean-square derivatives: Recall that dX dt = Ut 2 (say) is defined as an element (if it exists) of L (P ) satisfying: lim
h→0
Xt+h − Xt − Ut 2 = 0, t ∈ G = R, h ∈ R − {0}. h
In the harmonizable case such a derivative exists if we have: 1 lim (Xt+h − Xt ) − eitv ivZ(dv) h→0 h ˆ G eith − 1 − iv)Z(dv) eitv ( = lim h→0 h ˆ G = 0. (13) This holds whenever Gˆ vZ(dv) exists as a Dunford-Schwartz integral so that the integrand is dominated by 2|v|, and the vector integration dXt theory implies that (13) is legitimate, giving dt = i Gˆ veitv Z(dv). Iterating the procedure, higher order mean-square derivatives can be defined, and one has dk Xt = eitv (iv)k Z(dv), (14) dtk ˆ G whenever the vector integral exists as an element of L2 (P ). If one sets ˆ ˜ Z(A) = A (iv)k Z(dv), which is again a vector measure in A(∈ B(G)), one finds from (14) that the k th derivative of a harmonizable process is again a (new) harmonizable process, provided v k is integrable relative to the original stochastic spectral measure Z. This is an extension of the well-known stationary case. An equivalent condition for this existence is that the covariance function r has the partial derivative of order 2k at each point on the diagonal, so that ∗ ∂ 2k r (k) ¯ (k) (iv)k (−iv )k F (dv, dv ), (15) E(Xs Xt ) = k k (s, t) = ∂s ∂t ˆ ˆ G G as may be verified. With this concept, the difference-differential filter Λ, of order (mN ) can be given as: ΛXt =
m N k=0 j=1
(k)
aj (τj )Xt−τj
518
VIII. Prediction and Filtering of Process
=
m N k=0 j=1
=
ˆ G
ˆ G
aj (τj )(iv)k eitv Z(dv)
F (v)eitv Z(dv),
(say),
(16)
where F (v) defined in the integrand here, is called the spectral characteristic of the filter. Actually everything stated above can equally be formulated if the process Xt is k-vector valued, i.e., Xt (ω) ∈ Ck . The only difference is that the linear spans are now with k × k ma¯ t ) is replaced by trix coefficients (i.e., elements of B(Ck )), and E(Xs X j ¯ t ), 1 ≤ j, j ≤ k). When no additional arguments E(Xs Xt∗ ) = (E(Xsj X are needed, we include this vector case below because of its applicational potential. An explicit form of the filter operator Λ to be analyzed is the combination of the above three examples. It is termed the integro-differencedifferential operator, and is given by (set Xt as X(t) for convenience): ΛX(t) =
m1
···
p1 =0
mn pn =0
Ap1 ,... ,pn (τ )×
G
∂ p1 +···+pn X(t − τ )Γp1 ,... ,pn (dτ ). ∂tp11 · · · ∂tpnn
(17)
Here t = (t1 , . . . , tn ) ∈ G, τ = (τ1 , . . . , τn ) ∈ G and the partial derivatives are as defined in (13) [in mean-square] with Ap1 ,... ,pn (t) ∈ B(Ck ), a matrix of complex bounded Borel functions, and Γp1 ,... ,pn is a complex Borel measure on B(G), the integrals in (17) being in (vector or) Bochner’s sense. Specializing Γ to be discrete, continuous or mixed types, one gets the previous examples as particular cases. Mainly, Λ is a closed linear operator, and later we can drop the restriction that it commutes with the translation operator (i.e., that Λ and Vt , a unitary operator, commute). We include the following simple result for a convenient reference. 2. Proposition. Let {Xt , t ∈ G = Rn , orZn } be weakly harmonizable with Zx and βx as its stochastic spectral measure and spectral bimeasure respectively, such that Gˆ |v|p1 · · · |vn |pn Zx (dv) exists. If ΛXt = Yt is the filter equation given by (17), then {Yt , t ∈ G} is also weakly harmonizable with Zy and βy as its respective stochastic spectral measure and spectral bimeasure that satisfy Zy (A) =
F (v)Zx (dv), A
ˆ A ∈ B(G),
(18)
519
8.3 Linear filtering: Bochner’s formulation
and
∗
βy (A, B) =
ˆ F (v)βx (dv, dv )F (v )∗ , A, B ∈ B(G),
(19)
B
A
where F is the (k × k-matrix) spectral characteristic of the filter Λ, given by F (v) =
m1 p1 =0
ˆ G
mn
···
(iv1 )p1 · · · (ivn )pn ×
pn =0
Ap1 ,... ,pn (τ )e−iτ,v Γp1 ,... ,pn (dτ ),
(20)
ˆ = Rn , or(0, 2π]n appropriately, !τ, v" being the inner product in and G the Euclidean n-space. [Here F ∗ is the conjugate transpose of F .] Proof. Both (18) and (19) follow if Yt = ΛXt is shown to be harmonizable under the given conditions. The following sketch is sufficient to establish them. Thus substituting (17) in the filter equation, we get on using the fact that the mean-square derivatives exist: Yt = ΛXt =
m1 p1 =0
···
mn pn =0
ˆ G
(iv1 )p1 · · · (ivn )pn ×
eit−u,v Ap1 ,... ,pn (u)Γp1 ,... ,pn (du)Zx (dv),
(21)
G
where we employed Fubini’s theorem after operating with a continuous linear functional x∗1 (∈ (C k )∗ ) on both sides and observing that x∗1 ◦Zx is a signed measure for which the classical (scalar) Fubini theorem applies. Since x∗1 is arbitrary, the above computation is valid (cf., DunfordSchwartz [1], III.11.13). Then (20) and (21) immediately yield the relation: it,v Yt = e F (v)Zx (dv) = eit,v Zy (dv). ˆ G
ˆ G
This verifies (18), and then (19) is a consequence of the above on using a change of variables technique. Note. If Xt is weakly stationary, then βx (A, B) = βx (A ∩ B) so that (19) reduces to a single integral, and then βy (A, B) = βy (A ∩ B) as well, so Zy has orthogonal values. Thus the Yt -process (or field) is of the same type. Similarly if the Xt is strongly harmonizable, then βx
520
VIII. Prediction and Filtering of Process
has finite Vitali variation and this implies that the MT-integral in (19) reduces to the Lebesgue concept so that βy then has finite Vitali variation, whence Yt is also strongly harmonizable. Note that in (17) we have the key property that Λ commutes with translation, and this is used in (18) and (19) to conclude the above representation of the Yt . In general if J is a linear operator, even bounded, and Xt is stationary, then Yt = JXt is only weakly harmonizable (neither stationary nor strongly harmonizable). This is seen from the integral representation of it,v Xt so that JXt = Gˆ e JZx (dv) which follows from a known property of the vector integral, and (JZ)(·) need neither have orthogonal ˜ values nor (A, B) → (JZ(A), JZ(B)) = β(A, B) finite Vitali variation, as easy examples show. Thus (17) appears to be essentially the most general form of a linear filter that commutes with translations, as observed already by Bochner ([2], p.8). The precursor of this filtering problem, is a pioneering study by Nagabhushanam [1] where both the output and the input processes are weakly stationary and Λ is a difference or an integral filter. The Hilbert space techniques which are at the center of this analysis have been well positioned in Karhunen’s [1] fundamental work the significance of which for our study has already been evidenced in Theorem 1 above. This has motivated Bochner to introduce the key concept of V -boundedness, not available in the previous studies, and to seek conditions on the general filter in order to find a stationary solution if the output is stationary. The fundamental idea of Bochner’s is to treat a second order process as a function in a Hilbert space, such as L2 (P ), and find conditions in order that it is a Fourier transform of a vector measure. The study was generalized to the harmonizable families only after a long period when the structure of these processes has been better understood. An intermediate step in this procedure for strongly harmonizable processes was taken by Kelsh [1], who extended Nagabhushanam’s work systematically to this case including some k-vector process results. We now characterize the solutions of the general filter equation (17). To include the k-vector case, one uses an analog of the fact that for a measurable function g : G → C, g(x) = 0, g(x)−1 is well-defined and measurable. In the case of a vector or a matrix valued measurable function g, one can define g −1 as a “generalized inverse” with an analogous property. This is stated precisely as follows since it will help present the desired solution. One can assume that the matrix g(x) is a square, because a rectangular one can be filled with zeros to make it a square. With this simplification, a square matrix valued function g is said to have a generalized (or Moore-Penrose) inverse g −1 at x if the following four conditions hold: (i) (gg −1 g)(x) = g(x), (ii) (g −1 gg −1 )(x) = g −1 (x), (iii) (gg −1 )∗ (x) = (gg −1 )(x), and (iv)
521
8.3 Linear filtering: Bochner’s formulation
(g −1 g)∗ (x) = (g −1 g)(x). It is clear that these are automatic if g(x) is nonsingular. In the general (including singular) case, it is known that, for each g, there exists a unique g −1 satisfying the conditions (i)-(iv). [For details and applications of this concept in Statistics and elsewhere, there are many books available; see e.g., C.R.Rao and Mitra [1], and for a simple constructive existence proof and immediate properties, see Chipman and Rao [1].] Also note that the elements of g −1 are solutions of a finite number of polynomial equations of Borel functions, obtained from the above conditions. Hence it follows that the elements of the generalized inverse are Borel functions. Moreover, gg −1 is a hermitian idempotent matrix (hence positive definite) and so it represents an orthogonal projection which is dominated by the identity matrix. Consequently, (gg −1 )2 = (gg −1 ) ≤ I implies that its diagonal elements satisfy k k aij a ¯ji = |aij |2 ≤ 1, 0≤ j=1
j=1
so that |aij | ≤ 1, 1 ≤ i, j ≤ k. In particular Gˆ (gg −1 )(v)Z(dv) exists. These simple properties suffice to establishing a general result. The desired solution of obtaining the input, represented by the equation (17), is characterized by the following: 3. Theorem. Let {Yt = ΛXt , t ∈ G} be a filter equation given by (17) where {Yt , t ∈ G} is weakly harmonizable with the spectral characteristic matrix function F (·) given by (20), and the spectral bimeasure βy by (19). Then there exists a weakly harmonizable input {Xt , t ∈ G} iff the following conditions hold: three ∗ ˆ (i) A A (I − F F −1 )(v)βy (dv, dv )(I − F F −1 )∗ (v ) = 0, A ∈ B(G), ∗ −1 −1 ∗ (ii) Gˆ Gˆ F (v)βy (dv, dv )F (v ) , exists, ∗ (iii) Gˆ Gˆ F˜p1 ,... ,pn (v)βy (dv, dv )F˜p∗1 ,... ,pn (v ) exists for 0 ≤ pi ≤ mj , 1 ≤ j ≤ n, where F˜p1 ,... ,pn (v) = |v1 |p1 · · · |vn |pn F −1 (v). When these conditions obtain, the input Xt is given by: eit,v F −1 (v)Zy (dv), t ∈ G, Xt = ˆ G
(22)
where Zy is the stochastic spectral measure of the harmonizable output ˆ Yt . The solution is unique iff F (v) is nonsingular for each v ∈ G. However, there is only one weakly harmonizable solution satisfying ΛXt = Yt , t ∈ G, for which ∗ −1 −1 (1 − F F )v × I − F F 2,βx = tr ˆ G
ˆ G
522
VIII. Prediction and Filtering of Process
βx (dv, dv )(I − F −1 F )∗ (v ) = 0,
(23)
where βx is the spectral bimeasure of the weakly harmonizable solution {Xt , t ∈ G}. Proof. For the sufficiency part, we note by the previous analysis that Zx defined by the equation ˆ Zx (A) = F −1 (v)Zy (dv), A ∈ B(G), (24) A
is a stochastic measure. A change of variables formula for vector measures is valid (which is first established for simple functions and extended for the general case by using the dominated convergence theorem for the vector or the Dunford-Schwartz integrals), and this will be used below often without comment. Now (18) and condition (iii) of the hypothesis imply that Gˆ |v1 |p1 · · · |vn |pn Zx (dv) exists. Hence Xt given by (20) is meaningful and simplifies to eit,v Zx (dv), t ∈ G. Xt = ˆ G
Consequently {Xt , t ∈ G} is weakly harmonizable. Moreover, the prep1 +···+pn ceding integrability condition with Zx implies that ∂ ∂tp1 ···tpnXt exists n 1 in the mean-square sense, and one has (using change of variables!) eit,v F (v)Zx (dv) ΛXt = ˆ G = eit,v F (v) F −1 (v)Zy (dv). ˆ G
ˆ Hence the matricial inner product of Yt −ΛXt is given for each A ∈ B(G) by
−1 (I−F F )(v)Zy (dv), (I − F F −1 )(v)Zy (dv) A A ∗ (I − F F −1 )(v)βy (dv, dv )(I − F F −1 )∗ (v ) = 0. = A
A
Taking traces of this matrix, one gets the norm which is thus zero. For the converse, suppose that Xt has the necessary partial derivatives and Yt = ΛXt . We show that conditions (i)-(ii) must hold. Consider it,v e F (v)Zx (dv) = eit,v Zy (dv), t ∈ G. Yt = ˆ G
ˆ G
523
8.3 Linear filtering: Bochner’s formulation
It follows that the relation between the integrals here is also valid for all trigonometric polynomials, and this yields (24). Since F F −1 is bounded, we deduce that |v1 |p1 · · · |vn |pn (F −1 F )(v)Zx (dv) exists, and ∈ L2 (P, Ck ). ˆ G
From this we get that Gˆ |v1 |p1 · · · |vn |pn F −1 (v)Zy (dv) ∈ L2 (P, Ck ). This gives (iii), and −1 (F F )(v)Zx (dv) = F −1 (v)Zy (dv), ˆ G
ˆ G
whence (ii). Since (23) is valid, we have with a similar argument that ˆ (I − F F −1 )(v)Zy (dv) = 0, A ∈ B(G). A
Taking (matricial) inner product again, this gives (i). Thus (i)-(iii) are necessary and sufficient for the existence of a (mean-square) solution of ΛXt = Yt , t ∈ G. It remains to consider the uniqueness part. If F (v) is nonsingular, then the result is immediate, since for any other weakly harmonizable ˜ t with Zx˜ as its stochastic spectral measure, one has: solution X ˆ Zy (A) = F (v)Zx (dv) = F (v)Zx˜ (dv), A ∈ B(G), (25) A
A
˜t, t ∈ so that F (v)Zx (dv) = F (v)Zx˜ (dv), whence Zx = Zx˜ or Xt = X ˆ then there exists a 0 = W ∈ Ck G. If F is singular at v = v0 ∈ G, annihilated by F (v0 ), and clearly we can find a unit vector V0 ∈ L2 (P ) orthogonal to Xtj , j = 1, . . . , k. Set ξt = eit,v0 V0 W, t ∈ G. Then this is a weakly stationary family whose stochastic measure concentrates at the point v0 , and conditions (i)-(iii) hold for ξt and Λξt = 0, t ∈ G. If ˜ t is a weakly harmonizable collection and ΛX ˜ t = Yt ˜ t = Xt +ξt , then X X ˜ holds, so that Xt = Xt are two solutions. In fact there are as many solutions as the points v0 at which F is singular. However, we now show that among all these, there is just one element that satisfies (23). ˜ t be two harmonizable solutions of (17) with Thus let Xt and X spectral bimeasures βx and βx˜ , both satisfying (23), so that I − F −1 F 2,βx = 0 = I − F −1 F 2,βx˜ .
(26)
Also the equations corresponding to (25) give a pair of expressions for Zy , namely F (v)Zx (dv) = Zy (A) = F (v)Zx˜ (v). (27) A
A
524
VIII. Prediction and Filtering of Process
ˆ one has On the other hand for any A ∈ B(G), χA (I − F −1 F )(v)Zx˜ (dv)22 = ˆ G ∗ −1 −1 ∗ (I − F F )(v)βx˜ (dv, dv )(I − F F ) (v ) = tr A
= (I − F
A −1
F )χA 22,βx˜
≤ (I − F −1 F )22,βx˜ = 0
(28)
since the trace of a positive definite matrix, as above, increases with A. Using this (and the change of variables) we get: ˜t = eit,v Zx˜ (dv) X ˆ G eit,v (F −1 F )(v)Zx˜ , by(28), = ˆ G = eit,v Zy (dv) ˆ G
= Xt , t ∈ G, by (22). Thus the solutions subject to (23) agree in mean-square norm, and hence under this restriction there is only one that satisfies (17), up to equivalence. The above condition becomes somewhat simpler and more intuitive if n = 1, k = 1, and m1 = m. In this case the spectral characteristic F is scalar, and let Q = {v : F (v) = 0} be the zero set of F , and ˆ − Q. By definition of the generalized inverse, F −1 = 0 on Qc = G Q. With these simplifications, the preceding theorem reduces to the following: 4. Corollary. Let Yt = ΛXt , t ∈ G(= R, or Z) be weakly harmonizable and βy be its spectral bimeasure. Then the spectral characteristic is given explicitly by: F (v) =
m p=0
p
(iv)
Ap (u)e−iuv Γp (du).
(29)
G
Let Q be the zero set of F . Then a weakly harmonizable input {Xt , t ∈ G} of the filter equation ΛXt = Yt exists iff (i’) β y (Q, ∗ Q) = 0,−1 −1 (ii’) Qc Qc |F (v)F (v )| βy (dv, dv ) < ∞, ∗ (iii’) Qc Qc |vv |p |F (v)F¯ −1 (v )|−1 βy (dv, dv ) < ∞, 0 ≤ p < ∞.
8.3 Linear filtering: Bochner’s formulation
525
When these conditions hold, we have a solution given by Xt = eitv G−1 (v)Zy (dv), t ∈ G, Qc
and there is only one solution (except for mean-square equivalence) of the filter equation satisfying βx (Q, Q) = 0. [If Q = ∅, then clearly there is a unique solution.] We now present a detailed application for Λ as a difference operator with G = Z but k ≥ 1 and m1 = . . . = mn = 0 with Γ placing unit masses at (m + 1) points, so that ΛXn =
m
aj Xn−j , n ∈ Z, a0 = id.,
(30)
j=0
and each aj is a k × k constant matrix. Then the spectral m characteristic is a (trigonometric) polynomial given by F (v) = j=0 aj eijv . In this particular case, the input of the filter equation can be given a sharper form as follows: 5. Theorem. Let ΛXn = Yn be defined by (30) where Yn belongs to a weakly harmonizable Ck -valued output sequence.Then the input m {Xn , n ∈ G} has the following description. If F (v) = j=0 aj v j is the spectral characteristic of Λ, and Q = {v : det(F (v)) = 0}, let D be the smallest open set containing the roots of the polynomial det(F ) on the unit circle in C, namely {vj = eivj ∈ Q}. If (i) |vj | > 1, ∀j, or those with |vj | = 1 satisfy βy (D, D) = 0, then Xn = ≥0 b Yn− for some constant matrix coefficients b , so that the input is obtained from the past and present values of the output, whence the filter is termed
1 and does not satisfy physically realizable, or causal; (ii) |vj | = (i) but βy (D, D) = 0 still, then the solution Xn depends on the Yn values corresponding to the past, present, and some future, so that the filter is not physically realizable (or is not causal). Proof. We consider the cases on the (finite number of) roots vj of the polynomial detF (v) separately, namely, (i) all the roots are outside the unit circle, i.e., |vj | > 1, (ii) all are outside or on the circle, but those on the circle satisfy the condition βy (D, D) = 0, and (iii) the condition that |vj | ≤ 1 and still βy (D, D) = 0. ˆ = (0, 2π] and F (v) is a nonsingular (i) |vj | > 1, ∀j. Since G = Z, G −1 −1 ˆ Now by Theorem 3, matrix. So (F F )(v) = (F F )(v) = id., v ∈ G. ΛXn = Yn has a unique solution given by Xn = einv (F (e−inv ))−1 Zy (dv), n ∈ G. (31) ˆ G
526
VIII. Prediction and Filtering of Process
Let F −1 (v) = (qj ) and Δj (v) be the (j)th cofactor of the determinant Δj (v) detF (v) so that qj (v) = detF (v) , and the roots satisfy |vj | ≥ α > 1, ∀j. So both the numerator and denominator, and hence qj are analytic in the open disc {v : |v| < α}. By the Taylor expansion (detF (v))−1 =
bp v p ,
(32)
p≥0
where bp are suitable coefficients, and the series converges uniformly and absolutely in compact subsets of the above disc. In particular this holds on the unit circle. Hence (31) becomes on using the standard properties of the integral (cf., Dunford-Schwartz [1], Sec. IV.10): Xnj
=
k =1
=
ˆ G
einv qj (e−iv )Zy (dv)
∞ k ˆ G
=1 p=0
=
bp Δj (e−iv )ei(n−p)v Zy (dv), by (32),
mj ∞ k
cjs
=1 p=0 s=0
since Δj (v) =
ˆ G
mj
ei(n−p−s)v Zy (dv), cjs v s , is a polynomial,
s=0
=
mj k ∞
bp cjs Yn−p−s .
=1 p=0 s=0
This implies that Xn is determined linearly by {Yr , r ≤ n}, and the filter is causal for the case that all roots are outside the unit circle. (ii) Under the hypothesis that βy (D, D) = 0 for the roots on the unit circle, we observe that the same situation as in (i) occurs again. In fact, by Theorem 3, the solution exists and is given by (31). Then by hypothesis we have Xnj
=
k =1
ˆ ∩ Dc G
einv qj (e−iv )Zy (dv), n ∈ Z.
(33)
ˆ ∩ Dc , we again have the analyticity as in (i), the same Since on G conclusion holds. (iii) In this case the bimeasure βy charges inside or on the unit circle. Then qj (v) is meromorphic and using partial fractions, we get
8.3 Linear filtering: Bochner’s formulation
527
a Laurent series type expansion with both positive and negative powers of the series. This will be elaborated with our hypothesis as follows. We divide the roots into groups: (a) those that are outside the unit circle, (b) those that are at v = 0, and (c) those that are in {v : 0 < |v| < 1}, since βy does not charge the unit roots by hypothesis. [See the remark below, about the difficulties arising in the case of unit roots with possible multiplicities.] Observe that group (a) is covered by part (i) above. So only (b) and (c) need be considered. In group (b), since Δj (v) d qj (v) = det(F (v)) let d(≥ 0) be the largest integer such that v divides (the polynomial) Δj (v), 1 ≤ j. ≤ k and similarly let d1 be the largest integer for which v d1 divides det(F (v)). Then d1 ≥ d + 1. Indeed in the contrary case, one must have d1 ≤ d. If we now let v → 0, lim (qj (v)) = lim F −1 (v) = B (say),
v→0
v→0
exists. But one also has limv→0 F (v) = F (0) to exist and hence both the following limits exist: lim F (v)F −1 (v) = F (0)B,
v→0
lim F −1 (v)F (v) = BF (0).
v→0
Consequently, F (v)F −1 (v) = I must be true for all v close to 0. This is contrary to the fact that in (b), det(F (0)) = 0 and F −1 (v) cannot Δj (v) 1 exist near 0. Thus d1 ≥ d + 1 holds. So det(F (v)) = v d1 −d h(v) with d1 − d ≥ 1 and some h analytic on |v| = 1. Then the series has at least one negative power of v, so that (33) implies Xnj contains at least one future value of Ym , m ≥ n + 1; whence Λ is not causal. In group (c), factors of the form (vj − v)p , p < 0 are present, and so the expression (33) will have not only the past, and present, but also some future terms of the output. Thus Λ is not causal, and only (i) and (ii) will have the physically realizable filter. Remark. In case there are some roots on the unit circle that are charged by βy (so βy (D, D) > 0), the above method fails, but one can use an Euler summability (cf., e.g., Hardy [1], p.236) procedure which is easily extendible to the Hilbert space case needed here. With this, as Nagabhushanam ([1], p.449) observed, extending an earlier idea due to H. Wold, already for the stationary case, that there are cases when the filter can be casual and cases when it is not. Indeed, if the roots have higher multiplicity, then there are real difficulties. Even in simple multidimensional cases, there are complications. The resulting problem needs a further investigation and it has not yet been settled. The continuous parameter case presents a new set of questions as well as new insights, and they will be examined briefly.
528
VIII. Prediction and Filtering of Process
6. Proposition. Let {ΛXt = Yt , t ∈ G = Rn } be the filter equation given by (17) for which conditions (i)-(iii) of Theorem 3 hold, so that a harmonizable input family Xt , t ∈ G, exists. Suppose moreover that the (generalized) inverse of the spectral characteristic F of Λ is the Fourier transform of a function g ∈ L1 (G, Ck ), in the sense that ˆ g− (F −1 )∗ 2,βy = 0. Then the input process is representable as: g(t − u)Yu du,
Xt =
(34)
G
and in particular, when G = R, the filter is physically realizable iff g(u) = 0 for all u < 0. Proof. By Theorem 3, the solution input process exists, and with the additional assumption here, one has (F −1 − gˆ∗ )(v)Zy (dv)2 = F −1 − gˆ∗ 2,βy = ˆ g − (F −1 )∗ 2,βy = 0. ˆ G
∗
Consider gˆ in place of F
(35)
−1
in (22) to obtain:
g(u)Yt−u du =
g(u)
ˆ G
G
G
=
eit−u,v Zy (dv)du
it,v
ˆ G
e
e−iu,v g(u)du)Zy (dv),
G
by an argument for Fubini’s theorem used before, eit,v gˆ(v)∗ Zy (dv) = ˆ G = eit,v F −1 (v)Zy (dv), by hypothesis, ˆ G
= Xt , by (35) and (22). This gives (34). Now taking G = R and rewriting (34) we get: Xt =
R
g(t − u)Yu du =
t
−∞
g(t − u)Yu du +
∞
g(t − u)Yu du,
t
and the solution is causal iff the second term vanishes. This can happen for an arbitrarily given output process satisfying the filter equation iff g(u) = 0 for u < 0, as asserted. To find conditions on F −1 such that its Fourier transform must vanish on the left half axis, one has to consider an appropriate Hardy
529
8.3 Linear filtering: Bochner’s formulation
class, and a suitable Paley-Wiener type result on (generalized inverses of) matrix valued functions. This is an important problem in function theory, and it leads to several nontrivial and interesting results for investigation. We thus point out that this new set of questions await solutions, and have to leave the discussion at this point. There is also a related problem on signal extraction, of great interest in applications, that can be put in the form of a linear filter. We again illustrate it here to emphasize how additional information demands refinements of, and then new, methods of the general model. This is already witnessed by the preceding two results. Here we consider the additive, or the signal plus noise, model with G = R or Z: Yt = St + Nt ,
t ∈ G,
(36)
where {St , t ∈ G} is the signal and {Nt , t ∈ G} the noise processes, both of which are assumed to be (weakly) harmonizable and, for simplicity, are orthogonal to each other, so that Yt is likewise a harmonizable family. This may be transfered into a simple filter model treated above, as follows, on taking G = Z for illustration. Let 1 1 St Yt ˜ ¯ Λ= Xt = Xt = 0 0 Nt 0 ¯ t = ΛX ˜ t . The problem is this. After observing so that (34) becomes X the Yt -process and knowing the covariance structures of Yt and Nt , one should find an “optimal” (least squares) estimator of the (unknown) ˜ t may be obtained. signal at any instant t ∈ G. By Corollary 4 above, X ˜ t is given by ¯ t = ΛX Indeed, the spectral characteristic of the model X F (v) = Λ which is a constant. Then (31) becomes 1 1 0 S inv −1 inv 2 ˜n = Zx¯ (dv) = 12 n , (37) e Λ Zx¯ (dv) = e X 1 0 ˆ ˆ G G 2 2 Sn where the generalized inverse of Λ is substituted,
and the spectral sto¯ n -process Zy is used here. Thus chastic measure Zx¯ of the output X 0 the general theory tells us to take half the observed value for the signal, but did not utilize the characteristics of the noise process. To remedy this omission, and to utilize the full information, we consider the problem from the point of view of the least squares estimation and obtain an optimal value of the signal as follows. This was already done by Grenander [1] for the stationary case, and by Parzen [2] when the signal and noise are both Gaussian (but not necessarily stationary) where RKHS methods produced sharp results. The result obtained by him is not an estimator of the signal Sa at a(∈ G), but is the likelihood
530
VIII. Prediction and Filtering of Process
ratio of the signal plus noise Gaussian versus the pure noise Gaussian process and test for the (non)existence of the signal from the output (see also Chapter VII on similar analysis). Here we proceed to obtain the best estimator of the signal at any given point a(∈ G). The following result contains a solution of this problem. 7. Theorem. Let {Xt = St + Nt , t ∈ G} be a weakly harmonizable signal plus noise process (or field) where St is the signal and Nt the noise, which are harmonizably correlated, with values in Ck . Suppose their spectral bimeasures βx , βs , βn and βsn are known. Then for any a ∈ G, the linear least-squares estimator Sˆa of the signal at ‘a is given by the equation: ˆ Ha (v)Zx (dv), (38) Sa = ˆ G
2
where Ha (·)(∈ L (βx )) is a solution of the (Wiener-Hopf type) matricial integral equation Ha (v)μ(dv) = eia,v ν(dv). (39) ˆ G
ˆ G
ˆ → Here μ and ν are the k×k-matricial (σ-additive) measures (μ : B(G) k ∗ B(C ), and similarly ν) determined by the bimeasures as (βsn (A×B) = β¯ns (B × A), the conjugate transpose): ∗ ˆ )(·, G), μ(·) = (βs + βn + βsn + βsn
ˆ ν(·) = (βn + βsn )(·, G).
The optimal solution, under these conditions is also unique. Moreover, the expected error covariance matrix σa2 = E[(Sa − Sˆa )(Sa − Sˆa )∗ ] is given by: ∗ [eia,v−v βs (dv, dv ) − Ha (v)βx (dv, dv )Ha∗ (v )]. (40) σa2 = ˆ G
ˆ G
Proof. The result is established by using the Hilbert space geometry in a relatively simple manner, and we sketch the details. Let 2 k HX = sp{X ¯ t , t ∈ G} ⊂ L (P, C ) be the closed linear span (with matrix coefficients) representing the observed family Xt . For any a ∈ G, the unknown signal Sa (∈ L2 (P, Ck )) at ‘a is estimated by the closest element Sˆa ∈ HX , according to the least-squares method, i.e., Sa − Sˆa 2 = min{Sa − Y 2 : Y ∈ HX }, and the classical Hilbert space results also imply that the minimum ⊥ exists uniquely. In fact, Sˆa is the element of HX such that Sa −Sˆa ∈ HX and is given by ((Sa − Sˆa )) = 0, or ((Sa , Xt )) = ((Sˆa , Xt )), ∀t ∈ G,
(41)
531
8.3 Linear filtering: Bochner’s formulation
so that the unique Sˆa should be determined by solving this system of equations, using the rest of the hypothesis. [Here ((·, ·)) denotes the inner product of L2 (P, Ck ), i.e., the trace of the (gramian) matrix E(Xs Xt∗ ).] Since the Xt , St , Nt families are now harmonizable, let Zx , Zs , Zn be their stochastic spectral measures so that eit,v Zx (dv) = eit,v Zs (dv) + eit,v Zn (dv), t ∈ G. (42) ˆ G
ˆ G
ˆ G
Denoting the corresponding spectral bimeasures by βx , βs , βn , and βsn (the last one for the cross product of St , Ns which is also given weakly harmonizable), one gets: ∗ ∗ ei(r,v−t,v ) βsn (dv, dv ), (r, t ∈ G). E(Sr St ) = ˆ G
ˆ G
But every element of HX has an integral representation relative to Zx (first for finite sums and then the general case by approximation with the Dunford-Schwartz integration theory), so that the minimal element Sˆa ∈ HX of (41) is obtainable as ˆ Ha (v)Zx (dv), (43) Sa = ˆ G
ˆ for a unique B(G)-measurable and Zx -integrable (in D-S sense) matrix function Ha . Substituting this in (41) and using the relation βsx (A, B) = E(Zs (A)Zx (B)∗ ) = E(Zs (A)Zs (B)∗ ) + E(Zs (A)Zn (B)∗ ) = βs (A, B) + βsn (A, B), one obtains the integral equation for Ha as: ∗ ei[a,v−t,v ] (βs + βsn )(dv, dv ) = ˆ ˆ G G ∗ ∗ Ha (v)e−it,v (βs + βn + βsn + βsn )(dv, dv ), t ∈ G. ˆ G
ˆ G
(44)
Substituting the measures μ and ν of the statement and using the σadditivity of the bimeasures on its components, (44) reduces to (39). Finally the expected error covariance is given by σa2 = E[(Sa − Sˆa )Sa∗ ], (since Sa − Sˆa ⊥ HX , and Sˆa ∈ HX ,)
532
VIII. Prediction and Filtering of Process
= E(Sa Sa∗ ) − E(Sˆa Sˆa∗ ), (since Sˆa ⊥ (Sˆa∗ − Sa∗ )), ∗ = {eia,v−v βs (dv, dv ) − Ha βx (dv, dv )Ha∗ (v)}. ˆ G
ˆ G
This is precisely (40). In case μ, ν are absolutely continuous relative to the Lebesgue meaˆ with (matrix) densities μ , ν , then (39) becomes: sure on G Ha (v)μ (v) dv = eia,v ν (v) dv. (45) ˆ G
ˆ G
If moreover the signal and noise are uncorrelated, then both μ , ν are positive (semi-)definite matrices, and using the generalized inverse notation, we can consider: Ha (v) = eia,v ν (v)μ
−1
(v),
(46)
and observe that this Ha satisfies (45) provided that ν (v) is in the −1 range of the orthogonal projection (μ μ )(v) for a.a. (v). However, −1 by definition ν = βx and μ = βs + βn in this case so that (μ μ )(v) is a projection containing the range of μ (v) which by the positive definiteness of βs and βx , contains the range of ν (v). Thus (46) satisfies (45), and by the uniqueness, it is the only solution. Thus we have the following result, due to Kelsh [1]: 8. Corollary. If the signal and noise are orthogonal, and the rest of the hypothesis of the theorem holds, then one has ˆ Ha (v)Zx (dv), (47) Sa = ˆ G
where Ha is explicitly given by (46). If moreover the process is scalar (weakly) stationary, then (47) becomes Sˆa =
ˆ G
f (v) eiav Zx (dv), (f + g)(v)
(48)
where f, g are respectively the spectral densities of the signal and noise processes. The result (48) was originally established by Grenander [1], and it was taken as a basis for various extensions. Now the preceding pair of results show that there is much interest in specializations and obtaining concrete formulas in filtering theory. Indeed there are many possibilities here. We have already considered such a result with a certain type
8.4 Kalman-Bucy filters: the linear case
533
of weighted signal as prediction, in Section IV.3. Another interesting specialization arises when the signal (or input) and observation (or output) are given by a system of (difference or differential) equations with constant coefficients. Pioneering work for certain classes of processes (especially Gaussian) was done on these problems by Kalman, and later by Kalman and Bucy, followed by other workers with (particularly) engineering interests. We consider, for a better understanding of the method, first the discrete case, and then its continuous parameter analog together with some nonlinear extensions of this new model. 8.4 Kalman-Bucy filters: the linear case We begin by specializing the filter equation Yt = ΛXt , t ∈ G, as a guide for the present considerations. Recall that by a particular choice of the filter Λ, we have obtained from the original equation, Yt = St + Nt , the signal plus noise model. Suppose now that the signal is also subject to a disturbance by another noise Mt , usually assumed independent of the Nt -process, and St may depend on some past values. This makes the model, in the discrete case for instance, to be: St = At St−1 + gt + Mt , Yt = Bt St + ht + Nt , t = 1, . . . , n,
(1)
where Mt , Nt are (at first) orthogonal noise processes, S0 = ξ, a random variable with mean a0 , the second moment (matrix) V0 , and gt , ht as deterministic vector sequences, whereas At , Bt are constant (given) nonrandom matrices of appropriate orders. Thus (1) becomes on reducing it into a single equation: Yt = Bt At St−1 + Bt gt + ht + (Bt Mt + Nt ) ¯t , t = 1, . . . n, (say). = Ct St−1 + kt + N
(2)
˜t , as before. But the fact that the So it is again of the form Yt = S˜t + N ˜t is ˜ signal St is given as a (linear) difference equation and the noise N also of a similar nature with the auxiliary coefficients At , Bt as known together with a knowledge of the covariance structure of various processes, allows the corresponding mathematics to be developed further after using the resulting structure of the general filter. A fortunate coincidence of this separate representations is that it helps to develop an interesting nontrivial structure theory involving a Riccati (difference and differential) equation of classical analysis which was an isolated result, but which now has taken on a special significance. Later, it was also found that the system of equations given
534
VIII. Prediction and Filtering of Process
by (1) has an interpretation as representing the dynamics as the state of the system (for signal) and the movement of the system being observed (the output). Moreover, in the continuous parameter version, this gives a purposeful application for systems of linear (and then nonlinear) stochastic differential equations. We therefore consider a key aspect of this technology and note that it has an interesting analogy with (sequential) estimation and testing problems. This work is based on a set of recursive relations solvable at each time using the preceding observations. This useful property was first established by Kalman [1] and extended by Kalman and Bucy [1]. This specialization of the filter (of Bochner’s general notion) from ΛXt = Yt to (1) is now called the Kalman filter. It enables us to obtain a large set of specialized and new results connecting the stochastic (control) theory and its deterministic well-known counterpart. For this reason, the American Mathematical Society awarded its Steele prize to Kalman (cf., AMS Notices 34 (1986), p.228) for such a useful and interesting advance. We thus include an aspect of these ideas, following in outline, motivated by the recent rigorous but compactly presented work by Bensoussan [1]. A more elementary version with further motivation and applications is given in Davis and Vinter [1], and for a particular Kalman result (Theorem 3 below), we also utilize the latter authors’ treatment. At the outset it should be emphasized that the optimality, construction, and essentially everything else to follow here, is based on the quadratic optimality (or the least-squares) criterion and hence the Hilbert space geometry. This is employed without further comment or explanation both because it is simpler, and moreover explicit formulas are obtainable for implementation in practice. The following classical result due to F. Riesz, is given for an immediate use. Its proof may be found in standard books (e.g., DunfordSchwartz [1], p.249), but is outlined here because this is essential for most of our work. [We have already used the result before.] 1. Theorem. If {Xt , t ∈ I} ⊂ X , a Hilbert space, and S = sp{X ¯ t, t ∈ I}, the generated (by Xt s) closed subspace, then for each X0 ∈ X , there is a unique X ∗ ∈ S such that ( · denoting the norm of X ) X0 − X ∗ = inf{X0 − Z : Z ∈ S} and (X0 − X ∗ , Xt ) = 0, t ∈ I. Moreover, the mapping πS : X0 → X ∗ is linear, idempotent, and a contraction. The null space N = {X ∈ X : πS (X) = 0} of πS is a closed linear manifold, and X = S ⊕ N is a direct sum, i.e., for each X ∈ X , one has X = Y + Z uniquely with Y ∈ S and Z ∈ N , Y ⊥ Z, meaning (Y, Z) = 0. Proof. [As the following argument shows, the result holds for any closed linear manifold S ⊂ X .] Let 0 ≤ δ = inf{X0 − Y : Y ∈ S} so
8.4 Kalman-Bucy filters: the linear case
535
that there exist Yn ∈ S satisfying X0 − Yn → δ. It is noted that {Yn , n ≥ 1} is a Cauchy sequence. Indeed by the parallelogram identity of Hilbert space, if Um = X0 − Ym , then Ym − Yn 2 = Um − Un 2 = 2(Um 2 + Un 2 ) − Um + Un 2 Ym + Yn 2 = 2(Um 2 + Un 2 ) − 4X0 − 2 ≤ 2(Um 2 + Un 2 ) − 4δ 2 → 2(δ 2 + δ 2 ) − 4δ 2 = 0, as n → ∞. Let X ∗ be the limit in S of this Cauchy sequence, so that ˜ ∈S X0 − Yn → X0 − X ∗ = δ. This X ∗ ∈ S is unique, since if X ∗ ˜ ˜ = δ, then X +X ∈ S and is another, with X0 − X 2 δ ≤ X0 −
˜ 1 X∗ + X ˜ ≤ δ. ≤ (X0 − X ∗ + X0 − X) 2 2
˜ there will be strict inequality above since X is strictly If X ∗ = X, ˜ must be true. convex. Since this is impossible, X ∗ = X ∗ Next observe that X0 − X = Z ⊥ S. Indeed, for any Y ∈ S and a, in the scalar field of X , aY ∈ S and Ya = X ∗ + aY ∈ S. So by definition X0 − Ya ≥ δ. Hence, setting Y = X0 − X ∗ , 0 ≤ X0 − Ya 2 − X0 − X ∗ 2 = Y − aY 2 − Y 2 = aY 2 − 2(aY , Y ). Taking a = λI = 0 here and dividing by |λ|2 , it follows that this can hold iff (Y , Y ) = 0 which verifies our assertion by taking Y = Xt , t ∈ I. Clearly X0 = X ∗ + (X0 − X ∗ ) ∈ S ⊕ N is an orthogonal decomposition for all X0 ∈ X so that X = S ⊕ N , is true. Let πS : X0 → X ∗ . Then this mapping satisfies πS2 (X0 ) = πS (X ∗ ) = ∗ X = πS (X0 ), so it is idempotent, and clearly linear. Moreover, with X0 = X ∗ + Z ∈ S ⊕ N , one has πS (X0 )2 = X ∗ 2 ≤ X ∗ 2 + Z2 = X ∗ + Z2 = X0 2 , and hence it is a contractive (same as an orthogonal) projection. Remark. The result also holds if instead the scalar field of X is the set of m × m-matrices B(Rm ). E.g., if Y = (Y1 , . . . , Ym )∗ , Yi ∈ L2 (P ) m and L2Rm (P ) = {f : Ω → Rm , f 2 = i=1 fi 2 < ∞} so ((f, g)) = m 2 m i=1 (fi , gi ), then (LRm (P ), · ) is a Hilbert space with B(R ) as
536
VIII. Prediction and Filtering of Process
the coefficient field, and here R can be replaced by C. Then a closed subspace S generated by the Xt ∈ L2Rm (P ) will be the closed finite linear span of the Xt s with coefficients as matrices from B(Rm ). The above result holds in this case as well. In fact it is true if m = ∞ also with a proper interpretation of various operations, and the relevant space will then be a “normal Hilbert module”, for which one may refer to Kakihara [1]. Here we only consider the case that m < ∞. A simple illustration of finding X ∗ will be included to make the discussion concrete. 2. Example. Let I = {1, . . . , n}, Sn = sp{X1 , . . . , Xn } be the generated finite dimensional subspace of L2 (P ). For X0 ∈ L2 (P ), ∗ let X ∗ = π Sn (X0 ) ∈ Sn . Then (X0 − X , Xi ) = 0, i = 1, . . . , n. n ∗ So X = i=1 ai Xi , and the ai should be determined. Let a = (a1 , . . . , an ), ξ = (X1 , . . . , Xn ) be vectors and r = (E(X0 Xj ), 1 ≤ j ≤ n), S = (E(Xi Xj ), 1 ≤ i, j ≤ n) be the matrices of expectations. From the (often termed normal) equations (X0 , Xi ) = (X ∗ , Xi ), i = 1, . . . , n we find rt = Sat , or at = S−1 rt assuming that the Xi are linearly independent, where at , rt are transposes of the row vectors a, r. Then the desired X ∗ is obtained as: X ∗ = ξat = ξS−1 rt .
(3)
Moreover, if Z = X0 − X ∗ is the error of estimation, then its second moment is given by: E(Z 2 ) = (Z, Z) = (X0 , X0 − X ∗ ), since X ∗ ⊥ (X0 − X ∗ ), = E(X02 ) − (X0 , ξS−1 rt ) = E(X02 ) − rS−1 rt .
(4)
The problem now is to find a recursive relation to calculate X ∗ and the second moments for the errors of estimation when the output is given by the filter relation (1) for the inputs Xi s. A solution of that particular filter equation will suggest how several other generalizations can be obtained. Thus Kalman’s basic approach to solve the problem may be presented with a detailed description which involves some tedious computations as: 3. Theorem. Let the filter equation be given recursively by a pair of vector relations of finite dimension represented as: Xk+1 = Ak Xk + Bk uk + Ck εk Yk = Hk Xk + Gk εk , k = 0, 1, . . . , n − 1.
(5)
8.4 Kalman-Bucy filters: the linear case
537
Here Ak , Bk , Ck , Gk , Hk are given deterministic matrices and Yk is the output, with Xk as the unknown input to be estimated optimally relative to the quadratic error criterion, from the observed output as k varies, i.e., sequentially, where the uk are the “exogenous” (or known at time k) vectors, and εk are the (random) noise vectors whose covariance structure only is assumed given, which for simplicity is taken to be E(εk ) = 0, E(εk εtk ) = id., and are mutually uncorrelated, E(X0 εtk ) = 0, (Gk Gtk ) being nonsingular. Then the optimal estimators given as ˆ k,S(k) = πS(k) (Xk ) ∈ S(k) = sp{Y0 , Y1 , . . . , Yk−1 }, can be obtained X recursively from the system of equations: ˆ 0 = m0 = E(X0 ), and for k ≥ 1, X ˆ k+1,S(k) = Ak X ˆ k,S(k−1) + Bk uk + K(k)[Yk − Hk X ˆ k,S(k−1) ], X (6) where K(k), termed the “Kalman gain” matrix (usually non square), given by a recursion scheme for k = 0, 1, . . . , as: (H t is the transpose of H) (7) K(k) = [Ak Qk Hkt + Ck Gtk ][Hk Qk Hkt + Gk Gtk ]−1 , ˆ k,S(k−1) )(Xk − and where the covariance matrices Qk = E(Xk − X t ˆ X k,S(k−1) ) satisfy a Riccati difference equation system with Q0 = E(X0 X0t ) (given) and for k ≥ 0, Qk+1 = Ak Qk Atk + Ck Ckt − [Ak Qk Hkt + Ck Gtk ][Hk Qk Hkt + Gk Gtk ]−1 [Ak Qk Hkt + Ck Gtk ]. (8) ˆ k,S(k−1) The error [also termed “innovation”] sequence Vk = Yk − Hk X satisfies Vj ⊥ Vk , j = k and E(Vk Vkt ) = (Hk Qk Hkt + Gk Gtk ).
(9)
Proof. Case 1. Let E(X0 ) = m0 = 0 and uk = 0, k ≥ 0. Observe that S(k) is spanned (with matrix coefficients) by Yj , 1 ≤ j ≤ k, which is also identifiable as the space spanned by all the (scalar) components {Yji , 1 ≤ j ≤ k, 1 ≤ i ≤ r (say). Under this identification, writing S(k) as this space, we use πS(k) as the orthogonal projection of L2 (P ) with range S(k). Thus (5) becomes Yki = (Hk Xk )i + (Gk εk )i , i = 1, . . . , r,
(10)
538
VIII. Prediction and Filtering of Process
and εk ⊥ Xk , (Hk Xk )i ∈ S(k) so that εik ⊥ Xkj , 1 ≤ i, j ≤ r. Here the component-wise operations are being used to get facility for the vector i i ˆi quantities to be used soon. Thus X k,S(k) = πS(k) Xk and πS(k) εk = 0, so that (10) implies Yki
=
hkij Xkj
+
j=1
gkij εjk , i = 1, . . . , r,
j=1
where Hk = (hkij ), Gk = (gkij ), 1 ≤ i ≤ r, 1 ≤ j ≤ , and hence i = πS(k) Yki = Yˆk,S(k−1)
ˆj hkij X k,S(k−1) + 0, i = 1, . . . , r.
j=1
Thus in vector notation this becomes ˆ k,S(k−1) , Yˆk,S(k−1) = Hk X
(11)
which implies that Hk and πS(k) “commute”. ˆ k,S(k−1) using ˆ k and X We now obtain a recursion relation between X (3) and the above simplification. In fact, employing the orthogonal decomposition of S(k), a very special case of Theorem 1, we have ˆ i + (X ˆi − X ˆ i ) ∈ S(k − 1) ⊕ (S(k) $ S(k − 1)) ˆ ki = X X k−1 k k−1 ˆi ˆi ˆi and X k−1 ∈ S(k − 1), Xk − Xk−1 ∈ S(k) $ S(k − 1) ⊥ S(k − 1). i Since S(k) is spanned by Yj , 1 ≤ j ≤ r, 1 ≤ i ≤ k and contains the ˆi − X ˆ i , the latter is a linear combination of the Y j differences X i k k−1 that span S(k) $ S(k − 1). Let πk,k−1 : S(k) → S(k − 1) denote the orthogonal projection shown, so Yki = πk,k−1 Yki + (Yki − πk,k−1 Yki ) + (Yki − Yˆ i ) = Yˆ i k,S(k−1)
k,S(k−1)
∈ S(k − 1) ⊕ (S(k) $ S(k − 1)), i since the Yki − Yˆk,S(k−1) span S(k) $ S(k − 1). We now use (3) for ∗ ˆ ˆ Xk − Xk−1 in place of X and Yk − Yˆk,S(k−1) for ξ there, by using (11) to get
ˆ k−1 ) = E[Xk (Yk − Yˆk,S(k−1) )t ]× ˆ k −X (X t ˆ k,S(k−1) ]. ]−1 [Yk − Hk X [E(Yk − Yˆk,S(k−1) )(Yk − Yˆk,S(k−1) (12)
539
8.4 Kalman-Bucy filters: the linear case
We now simplify (12) and show that it reduces to (6) when uk = 0 and m0 = 0, from which (7) will be shown with an easy modification. ˜ k,S(k−1) = To simplify writing, let Y˜k,k−1 = Yk − Yˆk,S(k−1) and X ˆ k,k−1 . Then Xk − X ˆ k , by (11), Y˜k,k−1 = (Hk Xk + Gk εk ) − Hk X ˜ k,k−1 ) + Gk εk = Vk , = Hk (X
(13)
with the definition of Vk in the statement. Hence t t ˜ k,k−1 E(Xk Y˜k,k−1 ) = E(Xk X )Hkt + E(Xk εtk )Gtk ˜ k,S(k−1) ⊥ X ˜ k,k−1 . = Qk Hkt + 0, since Xk ⊥ εk , X
Similarly E(Y˜k,k−1 Y˜k,k−1 ) = Hk Qk Hkt + Gk Gtk .
(14)
Since uk = 0, applying πS(k) and noting that the latter “commutes” with Ak , Ck , and πS(k) εk+1 = 0, we get ˆ k,S(k) + Ck εˆk,S(k) . ˆ k+1,S(k) = Ak X X Using the decomposition of S(k) into S(k − 1) and its complement in S(k), we get εˆk = πS(k) εk ∈ S(k), which can be simplified as (cf., Example 2 above): t t )[E(Y˜k,k−1 Y˜k,k−1 )]−1 Y˜k,k−1 . εˆk = εˆk,S(k) = E(εk Y˜k,k−1
(15)
But since the εk are uncorrelated and have unit covariance matrix, ˜ k,k−1 )Ak E(εk Y˜k,k−1 ) = E(εk εtk )Gtk + E(εk X = Gtk + 0. Hence (12) becomes ˆ k,S(k−1) + Qk H t (Hk Qk H t + Gk Gt )−1 Y˜k,k−1 ] ˆ k+1,S(k) = Ak [X X k k k t t t −1 ˜ (16) + Ck Gk (Hk Qk Hk + Gk Gk ) Yk,k−1 , which is (7) when uk = 0. We next derive (8), by computing the ˆ k+1,S(k) . With X ˜ k+1,k as defined above, one has: covariance of X ˜ k+1,k = Ak X ˜ k,k−1 + Ck εk − K(k)(Yk − Hk X ˆ k,S(k−1) ) X ˜ k,k−1 − K(k)[Hk X ˜ k,k−1 + Gk εk ] + Ck εk , by (5), = Ak X
540
VIII. Prediction and Filtering of Process
˜ k,k−1 + (Ck − K(k)Gk )εk . = (Ak − K(k))X
(17)
˜ k,k−1 , this gives Since Qk is the covariance of X Qk+1 = (Ak − K(k))Qk (Ak − K(k)Hk )t + (Ck − K(k)Gk )(Ck − K(k)Gk )t ,
(18)
˜ k,k−1 . where the cross product term vanishes due to the fact that εk ⊥ X This is (8) when (7) is substituted for K(k). Since now (9) is obvious, the result holds in this case. Case 2. Let either m0 = 0 or uk = 0, k ≥ 1. Then mk = E(Xk ) = mk and E(Yk ) = Ak mk . Let Yk = Yk − Hk mk ,
and Xk = Xk − mk ,
so that Yk , Xk and εk all have means zero. Then Xk+1 = Ak Xk + Ck εk ,
Yk = Hk (Xk − mk ) + Gk εk = Hk Xk + Gk εk . ˆ 0,S(−1) = m0 , by assumption, the result of Case 1 applies and Since X gives ˆ ˆ ˆ X k+1,S(k) = Ak Xk,S(k−1) + K(k)(Yk − Hk Xk,S(k−1) ).
ˆ k,S(k−1) = X ˆ But X k,S(k−1) + mk , as is easily seen from definition of ˆ k,S(k−1) )(Xk − X ˆ k,S(k−1) )t = Qk the minimal element, and E(Xk − X ˆ k,S(k−1) . whereas K(k) is the same. Also Y − Hk X = Yk − Hk X k
k,S(k−1)
Hence ˆ k+1,S(k) = Ak X ˆ k,S(k−1) + K(k)(Yk − Hk X ˆ k,S(k−1) ) + Bk uk , X ˆ 0,S(−1) = m0 , u0 = 0. This is (6) in the general case, and then with X the same Riccati equation for Qk holds. Thus all the assertions of the theorem are established. Remarks. 1. It is to be noted that, if the {Yk , k ≥ 0} is a Gaussian process, then every finite subset is jointly normal, and, with elementary computations, it is seen (and well-known) that the conditional k−1 expectation E(Yk |Yj , j ≤ k − 1) = j=0 aj Yj , is linear and this implies that k aj Yi = E(Yk+1 |Yj , j ≤ k). πS(k) (Yk+1 ) = j=0
8.4 Kalman-Bucy filters: the linear case
541
Thus the preceding formulas, and the analysis hold automatically in the Gaussian case with πS(k) being E(·|Yj , j ≤ k), the conditional expectation. However, if the Yk -process is not Gaussian, then, as seen in Proposition 1.1, the conditional expectation is an optimal nonlinear functional, and the results of Theorem 3 are no longer valid. We shall discuss some aspects of the nonlinear problem in the next section. 2. An interesting part of formulas (6) and (7) is that the known matrices Ak , Bk , Ck , Gk determine K(k) and Qk independently of the ˆ k,S(k−1) may be observations. Hence the knowledge of the estimator X used along with the observation Yk to obtain the next estimator by a simple substitution. It is this property (of using the previous estimator for computing the next one) that makes Kalman’s procedure valuable in applications. [See Exercise 6.4 for a sketch of a related result to (6)-(8).] 3. It may also be noted that computing Qk through the Riccati equation is not simple, since it involves inversions of matrices. Also this is a quadratic equation, and when it is solved, one of the solutions may not be admissible (e.g., if it is not positive semi-definite). Further K(k) is given by (7), but Qk should be obtained not from (8), but from (18), as the latter is a sum of positive definite terms which are more convenient in computations. Also by iteration the formulas of the ˆ k+,S(k) for an integer (≥ 1) units lead. model can be obtained for X This is discussed from an applicational viewpoint, for instance, in the book by Kwakernaak and Sivan ([1], p.530). Next we turn to the continuous parameter version of model (5) which involves (stochastic) linear differential equations, with extensions to some nonlinear aspects in the following section. Thus a continuous analog of (5) is given by: dXt = [F1 (t)Xt + F2 (t)Yt + F3 (t)] dt + G1 (t) dεt , dYt = [H1 (t)Yt + H2 (t)Xt + H3 (t)] dt + G2 (t) dηt , X0 = ξ,
(19)
Y0 = 0,
where the input Xt takes values in Rn and the output Yt in Rm , εt , ηt are noise processes which are vector valued of appropriate dimension, and Fi , Gi , Hi are given deterministic Borel measurable matrix functions of suitable orders, bounded on compact sets. Here we assume that the εt , ηt to be an L2,2 -bounded vector process so that (5) makes sense in integrated form (see the definition and immediate properties of the concept just prior to Theorem IV.2.7 where it is seen that BM and many other processes [e.g., square integrable martingales as well as the processes with orthogonal increments] are included in this concept which is an extension of that due to Bochner). Our problem then is to
542
VIII. Prediction and Filtering of Process
show that the system (19) has a unique solution under mild conditions on the given matrices, and to obtain (with the Kalman-Bucy method) optimal estimators of the (unknown) input and the covariances of the corresponding error process. We thus consider the twin problems: (i) the existence and uniqueness of the system with the given initial data, and (ii) the desired recursive formulas or SDEs for the estimators and the error covariances when the εt , ηt -processes are BMs. First to show that (19) has a unique solution, let us express it as a single first order vector (linear) differential equation. Put Zt = (1, Xt , Yt )∗ (a column vector with ‘∗ ’ for transpose here), and similarly ut = (1, εt , ηt )∗ . Also consider the block square matrices: ⎛
0 A˜ = ⎝ F3 H3
0 F1 H2
⎞ 0 F2 ⎠ , H1
⎛
0 ˜ = ⎝0 B 0
0 G1 0
⎞ 0 0 ⎠ G2
Then (19) may be expressed as ˜ ˜ dZt = A(t)Z t dt + B(t) dut .
(20)
This can be solved as an easy consequence of the classical Banach contraction mapping principle as follows. The ut -process satisfies the ˜ is bounded on bounded sets. So L2,2 -boundedness condition, and B(t) t ˜ one notes that Vt = 0 B(s) dut , t ≥ 0, is a well-defined L2,2 -bounded ˜ dut holds, and does not depend on Zt or A˜t . process and dVt = B(t) Let ξt = Zt − Vt , t ≥ 0. Now (20) shows that ξt is differentiable and one has: dξt ˜ ˜ = A(t)Z t = A(t)ξt + g(t) = K(t, ξt ), (say), dt
(21)
t ˜ ˜ ˜ ˜ where g(t) = A(t)V t = A(t) 0 B(s) dus is continuous if A(t) is and is (Borel) measurable in any case (but does not depend on the Zt -process). Using an argument of Dalecki˘i and Kre˘in ([1],Chapter III, Sec.1), we now establish the existence and uniqueness of the solution of (20), or ˜ being continuous. The desired result will equivalently (21), with A(·) be stated as: 4. Proposition. Suppose the (inhomogeneous) linear SDE (21) is ˜ is a locally (i.e., on compact sets) bounded Borel measursuch that A(·) able matrix function and Vt - (or the ut -)process is L2,2 -bounded relative to a Radon measure μ on R+ × Ω (hence σ-finite, and this is satisfied if ut is a BM in which case dμ = dt ⊗ dP ). Then the SDE (21) has a unique solution with the given initial value ξ0 = Z(0). Moreover, the
543
8.4 Kalman-Bucy filters: the linear case
solution in the case of (19) with F2 = F3 = 0 (of the first equation) is explicitly given by: t t R s F1 (s) ds][X0 + e− 0 F1 (u) du G1 (s) dεs ]. (22) ξt = exp[ 0
0
Proof. Consider (21) in the integrated form: t t ˜ g(s) ds, A(s)ξs ds + ξt − ξ0 = 0
0
or equivalently,
ξt = h(t) +
0
t
˜ A(s)ξ s ds,
(23)
t t ˜ where h(t) = ξ0 + 0 g(s) ds = ξ0 + 0 A(s)V s ds, the last quantity being understood as a vector (or Bochner) integral. Denote by (T ξ)t , the right side of (23). Now K(t, ξt ) of (21) is L2 (P )-bounded on compact t-intervals, K(t, ·), 0 < t < a < ∞ satisfies a Lipschitz condition, and the ξt -process is locally bounded, being differentiable. To see the former property, we make the following computation, in which some standard properties of vector integration are used: t 1 2 1 2 ˜ K(t, ξt ) − K(t, ξt )2 ) = A(s)(ξ s − ξs ) ds2 0 t 1 2 ˜ ≤ A(s)(ξ s − ξs )2 ds, 0 t 1 2 i i ˜ = A(s)(Z s − Zs )2 ds, since ξt = Zt − Vt , 0 t ≤ Ma sup Zs1 − Zs2 2 ds, 0 < t < a < ∞, 0 0≤s≤a 1 2
≤ Ma tZ − Z 2,a = Ma tξ 1 − ξ 2 2,a < ∞.
(24)
˜ Here Z 1 − Z 2 2,a = sup0
(Ma t)2 1 ξ − ξ 2 2,a , 2
(25)
544
VIII. Prediction and Filtering of Process
and by iterating the procedure, one has (Ma t)n 1 ξ − ξ 2 2,a . n!
(T n ξ 1 )t − (T n ξ 2 )t 2 ≤
(26)
This means for some n0 , T n0 is a contraction on the closed ball {ξ : ξ− ξ 0 2,a ≤ a} and by the Banach contraction mapping (cf., e.g., Dalecki˘i and Kre˘in [1], p.53), T n0 and hence T itself has a unique fixed point ξt which is the (unique) solution of (21). Moreover, ξt = limn→∞ (T n ξ 0 )t (the value ξ 0 being a point of the ball). In fact one obtains by repeated substitution t ˜ ξt = g(t) + A(s)g(s) ds + · · · + 0 t1 t tn−1 ˜ 1 ) dt1 ˜ 2 ) dt2 · · · ˜ n )ξt dtn + · · · A(t A(t A(t n 0
= g(t) +
0
∞
0
gn (t),
n=2
where gn is the nth displayed integral, which thus is not simple. In the case of (19) with F2 = F3 = 0, we can get a simple explicit solution, treating the equation in “Stratonovich’s sense” (i.e., the symmetrized Itˆo equation given as Yt ◦ dXt = Yt dXt + 21 d!X, Y "t where the left side is, by definition, a Stratonovich differential, the right side Itˆ o’s, and the last term !X, Y "t is the quadratic covariation; next section [cf., Prop. 5.2 there] has more details) for which the rules of the classical ODE apply (cf., e.g., the companion volume, Rao [21],pp 418-422). Thus the ‘integrating factor’ becomes:
t
M (t) = exp[− 0
F1 (s) ds].
Multiplying the preceding equation by M (t) to get an exact differential d(Xt M (t)), one has with the initial value X0 : Xt M (t) = X0 +
0
t
M (s)G1 (s) dεs ,
which after dividing by M (t) gives (22). The method can be generalized even for matrix equations such as (20) or (21) which are linear stochastic differential equations. In fact, it is a special case of a result given in (Rao [28], p.237). However, for the general matrix case, non commutativity and other problems
8.4 Kalman-Bucy filters: the linear case
545
arise. Also the difficulty with the above solution is that an estimate at time t > t cannot be obtained from that available at time t. For this purpose a specialized and detailed procedure (the Kalman-Bucy methodology) is needed, and it employs the orthogonality and the particular form of the equations (19) as well as various properties of the projection operators crucially. Since in the continuous parameter case, the sums (of Theorem 3) become (stochastic) integrals, that procedure also leads to other difficulties. However, if the process is Gaussian, then orthogonality becomes independence, and the projections coincide with conditional expectations. Under this hypothesis, Theorem 3 admits an extension (non trivially), and we present a solution for a wide class of Gaussian processes obtaining different integral representations of conditional means, using both functional analytic and probabilistic techniques which explain the nontrivial nature of conditioning. Before proceeding with the analysis, we should record a key difficulty related to computations based on conditional expectations with the ever present Kac-Slepian paradox arising when one deals with continuous random variables. This was already noted in Section II.4 and it exists even if one is dealing with Gaussian families. One of the frequently used methods in the literature in computing conditional expectations such as E(X|Y = y) where Y has a continuous distribution, is to use the “horizontal window” definition, i.e., to settle for limδi ↓0 E(X|yi ≤ Yi < yi + δi , i = 1, · · · k). The paradox is that the final result depends on the “window” and is not necessarily the desired E(X|σ(Y1 , . . . , Yk ))(y1 , . . . , yk ), as originally defined by the RNtheorem. [It is similar to the directional derivative of the classical differential calculus when the direction is prescribed.] Employing this window definition, one finds from the well-known and easy computation that, when (X, Y1 , . . . , Yk ) are jointly normally disˆ = E(X|Y1 = y1 , . . . , Yk = yk ) is the regression of X tributed, then X ˆ 2 on the vector Y = y which is linear with the error moment E(X − X) to be a constant (cf., e.g., Cram´er [1], p.314, and Wilks [1], p.169). This property plays a crucial role in the Kalman-Bucy approach, always using the “horizontal window” definition without an explicit mention, and also in all the books (known to the author) on filter analysis. Thus with the possibility that such a definition of conditioning does not necessarily conform with the original Kolmogorov’s concept used in the presently accepted Probability Theory, we include the following account on what may be reasoned as computational feasibility. [This is slightly different from the Markov process analysis where the existence of a regular conditional or transitional probabilities is established and thereafter a version of them is assumed as given, since there are as yet no constructive methods available for the computational calcu-
546
VIII. Prediction and Filtering of Process
lus of conditional expectations.] We show in some filtering problems, conditional expectations may be explicitly obtained under certain conditions. The next lemma, due to Lo`eve ([1], p.344), shows that for a class (termed by him ‘relative conditional expectations,’ a rather non probabilistic name) the result can be given as a ratio of (often) simpler conditional expectations and this will be used in a specialized form in representing an optimal filter. He originally introduced the concept for generalizing ‘sufficient statistics’ to ‘sufficient σ-algebras’, relative to a family of probability measures on the base space. 5. Lemma (Lo` eve) Let Z ≥ 0 be an integrable random variable on (Ω, Σ, P ) and μ(A) = A Z dP, A ∈ Σ. If X is μ-integrable, so that νX (A) = A X dμ, A ∈ B, is a (signed) measure on B ⊂ Σ, a σ-algebra, X then the mapping EμB : X → dν dμB , μB = μ|B, is well-defined and is positive linear (a “relative conditional expectation” since taking Z = 1 a.e. [P ], this becomes EμB = EPB , the usual conditional expectation for the basic probability P ), one has EμB (X) =
EPB (ZX) , a.e. [μ]. EPB (Z)
(27)
Proof. Using change of variables in the Radon-Nikod´ ym calculus, one obtains (27), as follows. For A ∈ B, μB (A) =
Z dP =
A
EPB (Z) dPB ,
(28)
A
or equivalently dμB = EPB (Z) dPB , and since EμB (X) is well-defined being the Radon-Nikod´ ym derivative of νX with respect to the finite measure μB , we have
EμB (X) dμB
=
X dμ A
A
=
XZ dP A
=
EPB (XZ) dPB .
A
Substituting for dμB in terms of dPB noted above, (29) becomes A
EμB (X)EPB (Z) dPB
= A
EPB (XZ) dPB , A ∈ B.
(29)
547
8.4 Kalman-Bucy filters: the linear case
Since the integrands of the same measure dPB are B-measurable and A ∈ B is arbitrary, one may identify them a.e. [P ]. Hence EμB (X)EPB (Z) = EPB (XZ), a.e. [P ]. But μ P , so that each P -null set is μ-null and hence we can divide through by EPB (Z) which gives (27), since μ(A) = 0 ⇒ ZχA = 0 a.e.[P ] and that EPB (ZχA ) = 0, a.e. [P ]. So this holds a.e. [μ]. It will be of interest to express (27) in terms of integrals to be used in applications below. For this we observe that EPB (χA ) = P B (A), is the conditional probability measure. Similarly EμB (χA ) = μB (A), A ∈ Σ, is also a conditional measure on (Ω, Σ, μ). If f is a simple function (for Σ), then (a)
EμB (f )
=
B
f (ω) μ (dω),
(b)
Ω
EPB (f )
=
f (ω) P B (dω), (30)
Ω
These formulas can be extended to all bounded (Σ)-measurable functions with the Dunford-Schwartz definition of a vector integral. When this is done, (29) holds for all such f , and substitution in (27) yields: EμB (f )
f Z dP B = f (ω) μ (dω) = (f Z1 )(ω) P (dω) = Ω , Z dP B Ω Ω Ω
B
B
where Z1 = E BZ(Z) is measurable for (Σ). This implies in particular, P taking f = χA , that:
B
μ (A) =
Z1 dP B ,
A ∈ Σ.
(31)
A
Now if μB and P B are considered as (positive) vector measures on Σ into L1 (Ω, B, PB ), then they have the (total) variation measures given as: μB (A) dμB , by definition, |μB |(A) = Ω EμB (χA )EPB (Z) dPB , by (28), = Ω = EPB (ZχA ) dPB , by (27), Ω Z dP = μ(A), by (28). = A
548
VIII. Prediction and Filtering of Process
A similar computation (or by taking Z = 1) shows that |P B |(A) = P (A). Thus the vector measures μB , P B into L1 (P ) have finite variation measures μ, P which moreover have the property that μ P . Although P B (·) (and also μB if μ is finite) is a vector measure (i.e., σ-additive) in Lp (P ) for 1 ≤ p < ∞, it need not have finite variation for 1 < p < ∞ in contrast to the case p = 1 considered above. Now to the P B : Σ → L1 (B, P ) which is P -continuous and has finite variation, a Radon-Nikod´ ym theorem due to Phillips ([1], Theorem 5.1) can be B (A) : A ∈ Σ0 } where Σ0 = {A ∈ applied, provided the set S = {fA = PP (A) Σ : P (A) > 0}, is shown to be relatively weakly sequentially compact. This is the best known condition on the subject. If this is satisfied, by the just cited theorem, there is a P -unique, strongly measurable, q(·, ·) such that B P (A)(·) = q(ω, ·) P (dω), A ∈ Σ, (32) A
holds (as a Bochner integral). However, the desired compactness condition is equivalent to uniform integrability of S ⊂ L1 (P ) (cf., DunfordSchwartz [1], IV.8.11), since P (as well as μ) is a finite measure. We now verify the latter form of the condition for P B , i.e., by definition we need to show that (with PB = P |B): lim fA dPB = 0, uniformly in A ∈ Σ0 . α→∞
[fA >α]
So consider for any A ∈ Σ0 : 1 fA dPB = E B (χA ) dPB , by definition, P (A) [fA >α] [fA >α] 1 χA dP, since [fA > α] ∈ B, = P (A) [P B (A)>αP (A)] =
P (A ∩[P B (A) > αP (A)] P (A)
= P ([P B (A) > αP (A)]|A), by definition of P (·|A), 1 P B (A)(ω ) P (dω |A), ≤ αP (A) Ω by the conditional Markov inequality, 1 ≤ P (Ω|A), since 0 ≤ P B (A) ≤ 1, a.e., αP (A) 1 = , A ∈ Σ0 , since P (Ω|A) = 1, αP (A) which goes to zero uniformly in A ∈ Σ0 as α → ∞. Hence we have (by B Phillips’ Radon-Nikod´ ym theorem) that dP dP (ω) = q(ω, ·) exists a.e.,
8.4 Kalman-Bucy filters: the linear case
549
and q is strongly measurable. Using again the change of variables, (31) then implies for each bounded (Σ)-measurable f since dμ = Z dP , on letting q˜ = Zq which has the same characteristics as q, that f (ω)q(ω, ·) dμ(ω) f (ω)˜ q (ω, ·) dP (ω) B Ω ( f μ )(·) = = Ω . (33) q(ω, ·) dμ(ω) q˜(ω, ·) dP (ω) Ω Ω Ω We record this result for reference as well as later use as: 6. Proposition. Let B ⊂ Σ be a σ-algebra and μ : Σ → R+ be a P -continuous measure in (Ω, Σ, P ). Then there exists a strongly measurable (relative to Σ ⊗ B) function q˜(·, ·) ≥ 0 such that (33) holds a.e.[P ]. Remark. If P B is assumed to be a regular conditional measure so that it may be written as P (·, ω), an ordinary probability and P (A, ·) a Bmeasurable function, then Meyer [1] has obtained a result related to the above proposition under additional restrictions such as B is essentially countably generated. (See also Kallianpur and Striebel [1] for a similar result under related restrictions.) Both of them make no reference to Lo`eve’s lemma. However, Zakai [2] refers to Lo`eve, and derives formula (27) independently. As seen above the result is true without these restrictions. However the additional assumptions allow a direct construction of the density q(·, ·) and the vector measure theory is not needed. The identification of the problem with abstract analysis gives a better insight into the structure (and a general form) of formula (33) and similar ones. In what follows we specialize this result for specific Kalman filters where the separability and regularity conditions are available. The representation (31) and the relevance of Phillips’s form of the Radon-Nikod´ ym theorem in this problem were discussed in Rao [9], although the sharper form (33) was not noted there. We now consider the linear filter problem as an extension of the linear regression obtained in the classical studies in multivariate normal distribution analysis using the (horizontal window) conditioning method without ever suspecting the availability of other ‘window’ procedures that were exhibited by Kac and Slepian [1]. We omit further discussion of the ‘windows’ below. Recall that if (Y0 , Y1 , . . . , Yn ) is a random vector, distributed jointly normally with mean m = (m0 , m1 , . . . , mn ) and (nonsingular) covariance matrix B = (bij , 0 ≤ i, j ≤ n), then the conditional density of Y0 given Yi = yi , i = 1, . . . , n exists, is again n det(B) normal with mean m0 + i=1 βi (yi − mi ), and variance α2 = det(B 1) where B1 = (bij , 1 ≤ i, j ≤ n) so that det(B1 ) is the cofactor of b00 in n det(B), and βi = j=1 cij bji , with B1−1 = (cij , 1 ≤ i, j ≤ n). Here Y0 can also be a sub-vector itself, and a similar result holds. An interesting point here is that this conditional normal density has its mean vector
550
VIII. Prediction and Filtering of Process
depending linearly on the yi but its variance is a constant, independent of the yi (cf., e.g., Anderson [1], p.29). Thus the standard least squares estimation theory implies that the best n linear element g(y1 , .2. . , yn ) 2 minimizing E|Y0 − g(y1 , . . . , yn )| is i=1 βi (yi − mi ) with α as its minimum (dispersion or) mean squared error. This interesting fact is extended in Kalman-Bucy filtering model with some generalizations as follows. We consider a slight specialization of (19), restate it, and present its solution in the following fundamental: 7. Theorem. Consider the Kalman-Bucy model for linear filtering: dXt = [F1 (t)Xt + f2 (t)] dt + dεt dYt = [H1 (t)Xt + h2 (t)] dt + dηt ,
t ≥ 0,
(34)
with initial conditions X0 = ξ, Y0 = 0, Xt : Ω → Rn , Yt : Ω → Rm , and noise (vector) processes εt , ηt which are mutually independent BMs having the same stochastic base, centered, and covariances Q(t), R(t), (R(t) ≥ cI, c > 0 so that R−1 (t) exists for all t) but f2 , h2 (as well as F1 , H1 ) are nonstochastic Borel measurable vectors (matrices) of appropriate sizes. Here ξ is a Gaussian vector with mean m0 and covariance matrix P˜0 , independent of both the noise processes, and Xt , Yt are jointly normal. Then a unique optimal (in the least ˆ t exists. Moreover, if the error matrix is P (t) = square sense) filter X ˆ t ][Xt − X ˆ t ]∗ ), the quantities X ˆ t = E(Xt |Ys , s ≤ t), and E([Xt − X P (t) are recursively obtained as (unique) solutions respectively of the stochastic differential equation: ˆ t + f2 (t)] dt ˆ t = [F1 (t)X dX ˆ t + h2 (t))dt], + P (t)H1∗ (t)R−1 (t)[dYt − (H1 (t)X
(35)
ˆ 0 = ξ, X and the (deterministic) Riccati differential equation: dP (t) = F1 (t)P (t) + P (t)F1∗ (t) dt − P (t)H1∗ (t)R−1 (t)H1 (t)P (t) + Q(t), P (0) = P˜0 .
(36)
Remark. This result can be established using the ideas and methods of deterministic control theory, as well as with probabilistic procedures. Since our point of view is from probability theory, we present the latter argument which is also found useful in extending the model. The
551
8.4 Kalman-Bucy filters: the linear case
control method is in Bensoussan [1]. Here we essentially follow the treatment of Liptser and Shiryayev [1]. [Caution: P (t) is a matrix, and not a probability.] Our strategy of proof now is first to establish the result for the scalar case, i.e., m = n = 1, and then simply extend it to the (finite) vector processes. A key step in this procedure is to obtain an integral representation of the conditional expectation E(Xt |Ys , s ≤ t) in terms of a simple Wiener type stochastic integral relative to the Yt -process with t a deterministic kernel K(t, s) so that it is of the form 0 K(t, s) dYs . This plays the same role as the linear regression function, recalled from classical (finite dimensional) case just prior to the statement of the theorem. Then the general vector form of the result will be obtained with an extension of the scalar case, where most of the detail can be better appreciated. We shall see that the general form can be given for other processes, such as diffusions, with special “regression” representations, giving rise instead to Itˆ o integrals with stochastic integrands. All of this will become clear in the following proof. First a representation of the Gaussian conditional mean is obtained as follows. [It should be noted that this is special and is in a refined form.] 8. Proposition. Let {Xt , t ≥ 0} be a continuous Gaussian martingale and Y be a random variable such that (Y, Xt1 , · · · , Xtn ) is a jointly distributed Gaussian vector for any ti ∈ R+ , 1 ≤ i ≤ n, n ≥ 1. Then the (regression function or) conditional expectation E(Y |Xs , s ≤ t) is representable as E(Y |Xs , s ≤ t) = E(Y |X0 ) +
t
0+
G(t, s) dXs ,
(37)
relative to a measurable deterministic kernel G : R+ × R+ → R which is locally bounded. Proof. Let 0 = tn0 < tn1 < · · · < tn2n = t be a dyadic partition of [0, t], so that tnk = 2ktn , k ≥ 0. If Ft,n = σ(Xtnk , 0 ≤ tnk ≤ 2n ), and let Ft = σ(Xs , s ≤ t), then by the continuity of the Xt -process, one has Ft,n ↑ Ft as n → ∞. Set Ynt = E(Y |Ft,n ). Then {Ynt , Ft,n , n ≥ 1} is an L2 (P )-bounded martingale, since t |Ft,n ) = E[E(Y |Ft,n+1 )|Ft,n ] E(Yn+1
= E(Y |Ft,n ) = Ynt , a.e., and E(Ynt )2 ≤ E(Y 2 ) < ∞, by the conditional Jessen inequality. So Ynt → E(Y |Ft ) a.e. and in L2 (P ), by the classical martingale convergence theory (cf., e.g., Section IV.2).
552
VIII. Prediction and Filtering of Process
Next using the fact that {Y, Xt , t ≥ 0} is (collectively) Gaussian, so that (Y, Xt1 , . . . , Xtn ) is a Gaussian vector, we have the (horizontal window definition of) conditional expectation of Y on Xt1 , . . . , Xtn to be a linear function, represented as: E(Y |Ft,n ) = E(Y |X0 ) +
n−1 2
Gn (t, tnj )(Xtnj+1 − Xtnj ),
(38)
j=0
where the Gn (t, tnj ) are some suitable constants, as seen prior to Theorem 7. [This representation will not generally be valid for non Gaussian families, and the proof (as well as the result) fails at this point.] Let us extend the definition of Gn as follows. Set Gn (t, s) = Gn (t, tnj ),
tnj ≤ s < tnj+1 ,
and hence (38) becomes E(Y |Ft,n ) = E(Y |X0 ) +
t
0+
Gn (t, s) dXs .
(39)
Then we claim that the measurable simple functions converge to a locally integrable function G giving the desired result. In fact, since {Xt , Ft , t ≥ 0} is a square integrable martingale, it is L2,2 -bounded relative to a σ-finite measure μ on R+ × Ω, (cf. the companion volume Rao [21],p.466; or Chapter IV here). Now consider E(Ynt
−
Ymt )2
t = E[ (Gn (t, s) − Gm (t, s)) dXs ]2 0 t = [Gn (t, s) − Gm (t, s)]2 dμ(s), since the martingale 0
Xt has orthogonal increments, → 0, as n → ∞. This is because the left side was already found to be Cauchy. Hence Gn (t, ·) → G(t, ·) in L2 ([0, t], μ), and (37) follows from (39). . Let us specialize this result for the (observation) Yt -process of (34), in the scalar case. [Thus {Yt , t ≥ 0} plays the role of the Xt in the above proposition, and the unobserved Xt -process is independent of the noises εt , ηt , which are orthogonal BMs.] Taking h2 = 0 for convenience, (34) is expressed slightly more generally for later use as: dYt = H1 (t)Xt dt + H2 (t)dηt ,
(40)
553
8.4 Kalman-Bucy filters: the linear case
where H1 , H2 are deterministic locally bounded measurable functions. ˆ 0 = 0 where Ft = σ(Ys , s ≤ t). This implies ˆ t = E(Xt |Ft ) and X Let X that ˆ 0 |F0 ] = 0, E(Xt |F0 ) = E[E(Xt |Ft )|F0 ] = E[X by (22) because F0 and εt are independent in that equation. Next we claim that the transformed process {Bt , Ft , t ≥ 0} given by Bt =
t
0
dY )(s) − ( H2
t
( 0
H1 ˆ s ds, )(s)X H2
(41)
(again termed an “innovation process”) is a BM. This is established, using L´evy’s characterization of a BM among square integrable continuous martingales, by calculating the conditional ch.f. of Bt − Bs for 0 < s < t with the classical Itˆ o differentiation formula as follows. Since B0 = 0 and t → Bt is continuous and locally square integrable, we first observe that it is a martingale. In fact, using (40) in (41) along with the change of variables for stochastic integrals, we have
t H1 H1 ˆ s ds )(s)Xs ds + dηs ] − ( )(s)X H2 0 0 H2 t H1 ˆ s ) ds + ηt . = ( )(s)(Xs − X 0 H2
Bt =
t
[(
(42)
Hence Bt is Ft -adapted and for 0 < s < t,
s H1 ˆ u ) du )(u)(Xu − X ( E(Bt |Fs ) = ηs + 0 H2 t H1 ˆ u |Fs ) du + ( )(u)E(Xu − X H 2 s = Bs ,
(43)
ˆ u |Fs ) = E(X ˆu − X ˆ u |Fs ) = 0, a.e. Also from since for u > s, E(Xu − X (42) we see that the quadratic variation of Bt and ηt are the same, i.e., [B]t = [η]t = t, since the integral in (42) has finite variation (being integrable on [0, t]), its quadratic variation vanishes. Hence by L´evy’s theorem noted above, {Bt , Ft , t ≥ 0} is a BM. For completeness we include the detail here. Let fy : R → C be given by fy (x) = eixy so that by the Itˆ o formula one has for s < t fy (Bt ) − fy (Bs ) = s
t
dfy 1 (Bu ) dBu + dx 2
s
t
d2 fy (Bu ) d[B]u , dx2
554
VIII. Prediction and Filtering of Process
which becomes iyBt
e
−e
iyBs
t
1 dBu − 2
iyBu
=
iye s
t
y 2 eiyBu du.
s
Integrating over A ∈ Fs and dividing by the Fs -adapted eiyBs , one gets 1 t py (A; s, t)(= eiy(Bt −Bs ) dP ) = P (A) − py (A; s, u) du. (44) 2 s A With the boundary condition py (A; s, s) = 0, this integral equation has the unique solution 1
py (A; s, t) = P (A)e− 2 y
2
|t−s|
.
(45)
Taking A = Ω, we see that the conditional characteristic function is just 1 2 e− 2 y |t−s| which is independent of Fs so that Bt − Bs is distributed as N (0, |t − s|) and the increments are independent. Thus the Bt process is a BM as asserted. Since Xt and the Yt -process, hence Xt with the linear functions {Bt , Ft , t ≥ 0} of the Yt -family, form a jointly normal system, the preceding proposition immediately implies the following: 9. Corollary. For the observable process {Yt , Ft , t > 0} given by (40), which is Gaussian, the process defined by (41) is BM, and for the ˆ t = E(Xt |Ft ), with m0 = 0 (Xt being jointly conditional mean mt = X normally distributed with the Yt -process), there exists a deterministic measurable kernel G such that t mt = G(t, s) dBs , (45) 0
with probability one. Now we turn to a proof of the main theorem, using the above corollary. Proof of Theorem 7. Step 1. We follow the same format as in the proof ˆ 0 = 0. Then using the represen(discrete case) of Theorem 3. Let X tation (45), we find for any bounded (deterministic) Borel function f and H2 ≥ c > 0 (not necessarily H2 = 1): t t ˆ ˆ E[(Xt − Xt ) f (s) dBs ] = E E[(Xt − Xt ) f (s) dBs |Ft ] = 0, 0
0
so that E Xt
0
t
f (s) dBs
ˆt = E(X
0
t
f (s) dBs )
555
8.4 Kalman-Bucy filters: the linear case
t t = E( G(t, s) dBs f (u) dBu ) 0 0 t = G(t, s)f (s) ds, since d[B]s = ds. 0
(46)
On the other hand the left side of (46) with (42) may also be simplified as: t t E(Xt f (s) dBs ) = E[Xt f (s) dηs 0 0 t H1 ˆ s ) ds] )(s)(Xs − X + Xt f (s)( H 2 0 t H1 ˆ s )) ds, =0+ f (s)( )(s)E(Xt (Xs − X H 2 0 since ηs , Xt are independent, andE(dηs ) = 0, t H1 ˆ s )E(Xt |Fs )] ds = f (s)( )(s)E[(Xs − X H2 0 t R H1 ˆ s )e st F1 (u) du Xs ] ds, = f (s)( )(s)E[(Xs − X H2 0 since dηu , Fs , u > s are independent, use (22) with G1 = 1, E Fs (dεs ) = 0 and initial value Xs , t Rt H1 F (u) du 1 ˆ s )2 ds =e s f (s)( )(s)E(Xs − X H2 0 ˆ s E(Xs − X ˆ s |Fs )) = 0, since E(X t R t H1 = e s F1 (u) du γs ( )(s)f (s) ds, (47) H 2 0 ˆ s )2 . From (46)-(47) and the arbitrariness of f , where γs = E(Xs − X the integrands can be identified. Thus the kernel G is given by Rt
G(t, s) = e
s
F1 (u) du
γs (
H1 )(s). H2
(48)
Consequently, the optimal filter in this case is obtained from (45) and (48) as: ˆt = X
t
0 Rt
=e
0
G(t, s) dBs t
F1 (u) du
0
e−
Rs 0
F1 (v) dv
(
H1 )(s)γs [dYs − H1 (s) ds]. H2
556
VIII. Prediction and Filtering of Process
Hence, putting it in differential form, we get ˆt = dX
γt H1 (t) γt H1 (s) ˆ ]Xt dt. dYs + [F1 (t) − H22 (t) H2 (s)2
ˆ 0 = 0 and H2 = 1. This is equivalent to (35) in one-dimension when X ˆ 0 = Step 2. For the general case (still in one-dimension), let P [X 0] > 0. We reduce this to that of Step 1 by the following change of variables. Define Rt
(a) (b)
Xt = Xt − X0 e 0 F1 (s) ds , t Rs ˆ0 Yt = Yt − X F2 (s)e 0 F1 (u) du ds.
(49)
0
These processes satisfy the stochastic equations: (a) dXt = F1 (t)Xt dt + F2 (t) dεt , (b) dYt = H1 (t)Xt dt + H2 (t) dηt ,
ˆ0, X0 = X0 − X Y0 = Y0 ,
(50)
which are of the type (40) or (34) if F2 = H2 = 1 [and f2 = 0 = h2 as ˆ = E(X |F ), and γ = E(X − X ˆ )2 . always]. Let Ft = σ(Ys , s ≤ t), X t t t t t t Since Y0 = Y0 , (49) implies that Ft = Ft , t ≥ 0 and hence ˆ t = E(Xt |Ft ) X
t ˆ = E(Xt |Ft ) − X0 exp[ F1 (s) ds] 0 t ˆ ˆ = Xt − X0 exp[ F1 (s) ds],
(51)
0
and differentiating this we get (after using (50)(a)): ˆ = [F1 (t) − dX t
γt H12 (t) ˆ γ H1 (t) ]Xt dt + t 2 dYt . H2 (t) H2 (t)
(52)
ˆ )2 = γt . From these However a computation shows that γt = E(Xt − X t equations one finds after a straight forward but careful simplification of (51) with (52) eliminating the primed quantities that (35) holds as stated if we set H2 = 1 = F2 and f2 = 0 = h1 . But the case that f2 = 0 = h2 is a simple extension which we leave it as an exercise. Step 3. Next we obtain the Riccati equation for γt from (35) and ˆ t )2 , we consider (34) (still in one-dimension). Since γt = E(Xt − X ˆ the difference dXt − dXt from (34) and (35), and take expectations
557
8.4 Kalman-Bucy filters: the linear case
ˆ t )2 . To simplify the latter, however, we need to use (a of (Xt − X generalized form of ) the Itˆ o formula for stochastic differentials. This is done as follows. First note that (34) and (35) may be written as: dXt = F1 (t)Xt dt + F2 (t)dεt ; dYt = H1 (t)Xt dt + H2 (t) dηt γt H12 (t) ˆ t dt), (dYt − H1 (t)X H22 (t)
ˆ t = F1 (t)X ˆ t dt + dX
(35’)
ˆ t from the above as: so that we get the differential of Zt = Xt − X dZt = [F1 (t) −
γt H12 (t) γt H1 (t) ]Zt dt + F2 (t)dεt − dηt . 2 H2 (t) H2 (t)
(52)
In this equation, the first term on the right is locally of finite variation, and the other two are (locally) of L2,2 -bounded processes. [These will be BMs only if the coefficient functions F2 , H2 of εt , ηt -processes are independent of t, and in the present case they are thus not necessarily BMs, but are simply L2,2 -bounded.] For these L2,2 -bounded processes also we have a generalized Itˆo formula (cf., the companion volume, Rao [21], Theorem VI.2.13) which for continuous processes such as the Zt in (52), may be expressed as follows. For any twice continuously differentiable f : R → C, one has f (Zt ) − f (Z0 ) =
t
0
df 1 (Zs ) dZs + dx 2
0
t
d2 f (Zs ) d[Z]s , dx2
(53)
where [Z]s is the quadratic variation of the Zt -process. [There is also a multivariate analog to be used in the vector case.] Taking f (x) = x2 here, (52) and (53) imply: Zt2
−
Z02
1 t 2 d[Z]s =2 Zs dZs + 2 0 0 t
γs H12 (s) ]Zs ds + F2 (s) dεs [F1 (s) − =2 H22 (s) 0 t γs H1 (s) dηs Zs + d[Z]s . − H2 (s) 0
t
γ 2 H 2 (t)
(54)
But d[Z]t = [F22 (t) + tH 2 1(t) ] dt, since the quadratic variation of a 2 process of (locally) finite variation vanishes. Substituting this in (54) and taking expectations, one finds γt − γ0 = 2
0
t
[F1 (s) −
γs H12 (s) ]γs ds + 0 − 0+ H22 (s)
558
VIII. Prediction and Filtering of Process
0
t
[F22 (s) +
γs2 H12 (s) ] ds, H22 (s)
since Zs and dεs as well as dηs (i.e., their increments) are independent by hypothesis. This shows that γt is differentiable and gives (36) for dγt dt , and the theorem is established in the one-dimensional case, if we ˆ 0 = E(X0 |F0 ) recall that the linear SDE with the given initial value X 2 ˆ 0 − X0 ) has a unique solution (cf., Proposition 4), and and γ0 = E(X the uniqueness of the solution of the Riccati equation under the initial condition is a classical result in the ODE theory. Step 4. Except for notational complexity, the extension for Xt , mdimensional, and Yt , n-dimensional vector (with εt , ηt also as vector BMs and H1 , H2 , F1 as suitable matrices satisfying the given conditions) is straight forward, and we merely indicate the salient points that should be remembered. The representation (45) holds exactly as before, but now t
ˆt = X
0
G(t, s) dBs ,
relative to the vector BM (Bs , s ≥ 0) with an m × n matrix of locally bounded deterministic measurable functions, and one can obtain the ˆ t | > 0]. Again let ˆ 0 = 0, and then P [|X result first for the case that X ˆ Zt = Xt − Xt , and derive an equation analogous to (52) which takes the form dZt = (F1 (t) + γt H1∗ (t)[H2 (t)H2∗ (t)]−1 H1 (t)Zt ) dt + F2 (t) dεt − γt H1∗ (t)H2−1 (t) dηt , where the ∗ denotes transposition of a vector or a matrix. Since the corresponding vector Itˆ o formula for L2,2 -bounded (vector) processes is available [in the same reference above], one can obtain the analog of (54). Computing the matrices E(Zt Zt ) and simplifying, (36) is obtained in complete generality. We leave these straight forward calculations as an exercise. It is useful to reiterate that although the independent processes εt , ηt are BMs, the processes {F2 (t)εt , t ≥ 0} and {H2 (t)ηt , t ≥ 0} are not (vector) BMs when F2 , H2 are functions of t. But they are L2,2 bounded in all cases, and the latter concept (introduced by Bochner in [1],[2]), used from Chapter IV onwards, is very useful in these problems. It also indicates a possible generalization of the theory. Moreover (36) has stimulated a further study of Riccati ODE in analysis, motivated by this work (cf., e.g., Curtain and Pritchard [1]). It is possible to weaken the strongly nonsingular condition on matrices F2 , H2 and use
8.5 Kalman-Bucy filters: the nonlinear case
559
their generalized inverses as in Section 3 above. Several applications of the linear theory are detailed in Kwakernaak and Sivan [1]. Another related question is to consider higher order SDEs in this analysis, analogous to what was done in the general case in a previous section. The nonlinear study in this work is a natural step for investigation. The higher order (nonlinear) problem has not been investigated thoroughly, but the linear Kalman-Bucy theory when the vectors take values in infinite dimensional Hilbert spaces as well as in normal Hilbert modules has been considered by several authors, and we refer to Kakihara [1] where some of these results, and his own work, were discussed in detail with references. We turn to the nonlinear case in the next section. 8.5 Kalman-Bucy filters: the nonlinear case The corresponding nonlinear (continuous parameter) problems can utilize the modern developments of the general Itˆo stochastic calculus, thereby enriching the potential of the subject. Here we include a small portion of this work to show how the filter applications motivate both the specialization as well as some generalizations of the existing mathematical analysis. It was already noted in Theorem 4.1 that for any X ∈ H, a Hilbert space, and a closed subspace S ⊂ H, there exists a unique closest element X ∗ ∈ S, so that X − X ∗ = min{X − Y : Y ∈ S}. In fact, X ∗ = πS (X) where πS is the orthogonal projection on H onto S. Various methods of calculating X ∗ if S is determined linearly by the set of observations (the linear filtering) have been developed in the preceding section, the prominent being the Kalman-Bucy filtering. If L2 (Ω, Σ, P ) = L2 (P ) and if the subspace S is determined by the set of functions (observables) that are measurable relative to BI = σ(Ys , s ∈ I), then the projection operator πS can be identified as the conditional expectation E BI . We state this for reference, and sketch a quick proof. 1. Proposition. Let {Yt , t ∈ I} ⊂ L2 (P ) and X ∈ L2 (P ). Then the closest element Y ∗ ∈ L2 (Ω, BI , P ) = L2 (PBI ) of X is given as Y ∗ = E BI (X) where BI = σ(Yt , t ∈ I), and thus Y ∗ is a (Borel) function g of the conditioning variables {Yt , t ∈ I}, written symbolically as: ˆ = E(X|Yt , t ∈ I) = g(Yt , t ∈ I). Y∗ =X (1) [Here g : RI → R, and RI is given the product topology.] Proof. By the uniform convexity of H, the uniqueness of Y ∗ is immediate, and its existence follows from Theorem 4.1. Thus we only need to determine the form of Y ∗ . This may be verified immediately from 2
X − Y ∗ 22 = E X − E BI (X) + E BI (X) − Y ∗
560
VIII. Prediction and Filtering of Process
= E(X − E BI (X))2 + E(E BI (X) − Y ∗ )2 + 2E E BI ((X − E BI (X))(E BI (X) − Y ∗ )) , since E(Z) = E(E BI (Z)), for any Z ∈ L1 (P ), = (X − E BI (X))2 + E(E BI (X) − Y ∗ )2 + 0. This is a minimum iff Y ∗ = E BI (X). It is a Borel function g of the conditioning variables {Yt , t ∈ I} by the well-known Doob-Dynkin lemma (cf., e.g., Rao [21], p.75). This is the desired result. Remark. It should also be noted at this point that the conditional mean is the best or optimal (estimator or) filter relative to a wide class of loss functions that are merely increasing and symmetric, if the processes are Gaussian and more generally their conditional distributions (based on finite sets of random variables) are unimodal and symmetric about their means which are assumed to exist. This is shown in Theorem IV.2.7, and is of considerable interest in the present context. The result provides additional justification for using conditional means as optimal filters in some cases of non least squares criteria, indicated long ago based on practical experience in applications (cf., e.g., Laning and Battin [1]). The proposition implies that for an explicit solution of the leastˆ of X, we need to evaluate the conditional expecsquares estimator X tation, g(Yt , t ∈ I) which in general is very complicated. We have shown if X, Yt , t ∈ I are Gaussian, then g can be determined with explicit forms when X(= Xt ) and Yt are given by certain difference or differential (stochastic) equations using special techniques. In the more general case of nonlinear g, we present an analysis here. To follow the Kalman-Bucy type method, one can also allow the coefficients Fi , Hi in the model equations (34) to depend on the unobservable Xt as well as the observed process Yt on the past and present values. Then the equations are no longer linear, in fact are nonlinear. The new model may thus be represented as: dXt = F1 (t, Xs , s ≤ t) dt + F2 (t, Xs , s ≤ t) dεt dYt = H1 (t, Ys , s ≤ t) dt + H2 (t, Ys , s ≤ t) dηt ,
(2)
where F2 > 0, H2 > 0 and all the Fi , Hi satisfy suitable conditions. They can be combined into a vector form as: dZt = F (t, Zt ) dt + H(t, Zt ) dWt , in which we denoted Xt Zt = ; Yt
F (t, Zt ) =
F1 (t, Zt ) H1 (t, Zt )
(3) ;
561
8.5 Kalman-Bucy filters: the nonlinear case
H(t, Zt ) =
0 F2 (t, Zt ) 0 H2 (t, Zt )
Wt =
;
εt ηt
.
In this form it is possible to let Fi , Hi depend on both Xt , Yt . Under certain Lipschitz conditions on F and H, the standard SDE theory implies the existence of a unique solution of (3) (hence (2)) when εt , ηt (or Wt ) is an L2,2 -bounded (vector) process, F, H are bounded measurable vector or matrix functions with H uniformly positive definite, and under suitable initial conditions. We established the result as Proposition 4.4 for the linear case, but the corresponding result in the general case also uses a similar argument. We give the precise statement below for comparison and then restate the filtering problem. To motivate the general case, we present an extension of Proposition 4.4, giving an explicit solution (still for the one-dimensional problem): 2. Proposition. Consider the Itˆ o type SDE given by dZt = (a(t)Zt + a0 (t)) dt +
k
(bi (t)Zt + ci (t)) dεit ,
(4)
i=1
where the noise processes {εit , t ≥ 0} are L2,2 -bounded (possibly correlated) and a0 , a, bi , ci are deterministic and continuous. Then (4) has a unique solution Zt , for a given initial value Z0 , representable as: Zt = M (t)−1 Z0 + 0
k 1 × 2 i,j=1
t
M (s)a0 (s) ds −
0
t
M (s)bi (s)ci (s) d[εi , εj ]s +
k i=1
t
0
! M (s)ci (s) dεis ,
(5)
where the factor M (·) is defined by M (t) = exp − k
1 2 i,j=1
0
0
t
a(s) ds −
k i=1
t
0
t
bi (s)bj (s) d[εi , εj ]s ,
bi (s) dεis !
(6)
with t → [εi , εj ]t being the quadratic covariation of the processes {εit , t ≥ 0; i = 1, . . . , k}. Proof. We convert (4) into the Stratonovich form for which the classical ODE rules apply, solve it, and then revert back to the Itˆ o form to contain (5). [It is also possible to substitute (5) directly, use Itˆ o’s
562
VIII. Prediction and Filtering of Process
differentiation formula, and verify that the relation holds so that from the uniqueness assertion the result follows.] Here we derive (5) from (4). The essential step is to use the connecting relation between the Itˆ o and the Stratonovich integrals, the latter is denoted with a circle t notation as a U1 ◦ dU2 for any pair of L2,2 -bounded processes Ui (t). This is given as (cf., e.g., Rao [21], Theorem VI.2.14, p.475):
t
U1 (s) ◦ dU2 (s) = a
a
t
1 U1 (s) dU2 (s) + 2
t
d[U1 , U2 ](s),
(7)
a
the right side one being in Itˆ o’s sense, and thus in differential form: 1 U1 (s) ◦ dU2 (s) = U1 (s) dU2 (s) + d[U1 , U2 ](s). 2
(8)
Note that if either U1 or U2 has finite variation [in particular if U1 is deterministic] then both integrals coincide and reduce to the Wiener [or martingale] type integrals, but to Lebesgue-Bochner type only when U2 is of (locally) finite variation a.e. It may also be observed that in finite dimensional vector processes, in particular for this theorem, the L2,2 -boundedness condition for a process becomes equivalent to the semimartingale property, as noted in the above reference (cf., also M´etivier and Pellaumail [1], p.129). The following properties of the stochastic symbolic rules devised by Itˆ o will be useful for this proof and elsewhere. Thus if dS and dA denote the sets of differentials of (local) L2,2 -bounded processes and that of (local) finite variation processes, both adapted to the same filtration {Ft , t ≥ 0}, then one has (as sets): dS · dS ⊂ dA;
dS · dA = {0},
(9)
and hence U1 ◦ dU2 = U1 dU2 (if U1 or U2 ∈ A), (U1 · dU2 ) · dU3 = U1 · (dU2 · dU3 ).
(10)
(See, e.g., Rao [21], p.421.) With these relations we simplify (4) wherein the only Itˆ o differential term is the second on the right and it can be converted into the other with (9) and (10) as follows: 1 (bi (t)Zt ) ◦ dεit = (bi (t)Zt ) dεit + d(bi (t)Zt ) dεit 2 1 i = (bi (t)Zt ) dεt + [Zt d[bi , εi ]t + bi (t) dZt dεit ] 2
563
8.5 Kalman-Bucy filters: the nonlinear case
k 1 = bi (t)Zt dεit + [Zt · 0 + bi (t) (bj (t)Zt + cj (t)) d[εi , εj ]t ], 2 j=1
using the Itˆ o differential formula and (8), k 1 = bi (t)Zt dεit + bi (t) (bj (t)Zt + cj (t)) d[εi , εj ]t . 2 j=1
(11)
Substituting (11) into (4) one finds
dZt = [a(t)Zt + a0 (t)] dt +
k
bi (t)Zt ◦ dεit −
i=1 k 1 bi (t)(bj (t)Zt + cj (t)) d[εi , εj ]t + ci (t) dεit 2 i,j=1 i=1 ⎛ ⎞ k k 1 = Zt ⎝a(t) dt + bj (t) ◦ dεit − bi (t)bj (t) d[εi , εj ]t ⎠ 2 i=1 i,j=1 ⎛ ⎞ k k 1 + ⎝a0 (t) dt − bi (t)cj (t) d[εi , εj ]t + ci (t) dεit ⎠ . 2 i,j=1 i=1 (12) k
The equation obtained by setting ci = 0, a0 = 0 is homogeneous, and is given as: ⎞ k 1 bi (t)bj (t) d[εi , εj ]t ⎠ . bi (t) ◦ dεit − dZt = Zt ⎝a(t) dt + 2 i,j=1 i=1 ⎛
k
(13) The classical ODE rules give the ‘integrating factor’ M (t) for this as:
M (t) = exp − k
1 2 i,j=1
0
0 t
t
a(s) ds −
k i=1
0
t
bi (s) dεis +
bi (s)bj (s) d[εi , εj ]s .
(14)
Hence multiplying (12) by M (t) we get from d(Zt M (t)) after integration:
Zt M (t) = Z0 +
0
t
1 M (s)a0 (s) ds − × 2
564
VIII. Prediction and Filtering of Process k
t i
0
i,j=1
j
M (s)bi (s)cj (s) d[ε , ε ]s +
k
t
0
i=1
M (s)ci (s) dεis . (15)
Here again the Itˆ o formula for semimartingales with two variables (cf., e.g., Rao [21], p.401) is employed. Since M (t) never vanishes, this is equivalent to (6). If {εit , t ≥ 0}, i = 1, . . . , k, are independent BMs then (6) or (15) reduces to the following: Zt = M (t)−1 [Z0 +
+
k
= M (t)−1 [Z0 + k i=1
M (s)a0 (s) ds −
0
1 2 i=1 k
t
0
t
M (s)bi (s)ci (s) ds
m(s)ci (s) dεis ]
i=1
+
0
0
t
M (s)(a0 (s) +
k
bi (s)ci (s)) ds
i=1
t
M (s)cis dεis ],
(16)
where M (s) = exp[− 0
t
k k t 1 2 (a(s) − b (s)) ds − bi (s) dεis ]. 2 i=a i 0 i=1
(17)
As the proof shows, the result depends on the Itˆ o differentiation formula. A direct verification of (16) with (17) was given by Wu [1]. The above method, valid for L2,2 -bounded processes, is useful in applications, and is taken from (Rao [28], p.237, correcting some obvious typographical slips). [Note that the result admits an immediate vector extension.] Such explicit expressions can not be expected for more general coefficient equations, even those of the form (3). We present the existence and uniqueness of solutions of equations (3) as follows. 3. Proposition. Consider the vector SDE (3) where the coefficients F and H are (locally) bounded vector and matrix functions on R+ ×Rm+n and R+ × Rm+n × Rm+n respectively, and {Wt , t ≥ 0} is an Rm+n valued (locally) L2,2 -bounded noise process. Suppose that F, H are left continuous in t and satisfy a Lipschitz condition expressed as (using vector and matrix norm symbols): F (s, x) − F (s, y) + H(s, x) − H(s, y) ≤ Kt x − y,
565
8.5 Kalman-Bucy filters: the nonlinear case
for all x, y in a ball of Rm+n for 0 ≤ s ≤ t and a constant Kt depending only on t. Then there exists a unique process {Zt , Ft , t ≥ 0} satisfying (3) with the given initial condition Z0 = A (a constant or with Z0 ∈ L2 (P ) and Z0 is independent of Ft , t > 0). Moreover, the solution is also locally L2,2 -bounded. A proof of this result is obtained with a modification of the existence argument of Prop. 4.4, but the details are also available in (Rao [28], p.248), and will not be reproduced here. In particular, we note that both equations of (2), under a given initial value and the coefficients satisfying the corresponding Lipschitz conditions, have unique L2,2 bounded solutions. Also, as already observed in the proof of Proposition 2 above, in the present context, L2,2 -boundedness is equivalent to the semimartingale property assumed of the processes. As seen from Proposition 1, the optimal least squares estimator of a function of the signal (or the state of the unobservable system) at time t, namely ϕ(Xt ), based on the observed process Ys , s ≤ t, is given by the conditional expectation E(ϕ(Xt )|Ys , 0 ≤ s ≤ t) = πt (ϕ), (say), generally a non linear function of Ys , 0 ≤ s ≤ t. This is usually difficult to calculate, and (the next best thing is) therefore to derive a stochastic differential equation of the process {πt (ϕ), Ft , t ≥ 0} where Ft = σ(Ys , s ≤ t). Then using the theory of SDEs it is often possible to solve for πt (ϕ) or at least to determine many properties of an optimal (non linear) estimator of ϕ(Xt ), hereafter termed an optimal (non linear) filter. This can be done if the (Xt , Yt )-process satisfies certain (first order possibly non linear) SDEs under conditions of Proposition 3. In this effort, Lemma 4.5 and Proposition 4.6, especially the representation (33), will be useful. We therefore restate it for immediate use. Let (Ω, Σ, Pi ), i = 1, 2 be given equivalent measure spaces and B ⊂ Σ be a σ-algebra, such that the restrictions PiB = Pi |B, i = 1, 2 are σfinite, the latter being automatic if the measures Pi are probabilities. 2 However, the following is true in this generality. Let Z = dP dP1 , and B if EPi denotes the (generalized) conditional expectation for the Pi measures which is well-defined (cf., e.g, Rao [17], Theorem 5.5.15, on p.296), so that for any f ≥ 0, measurable for Σ, and for each A ∈ B0 = {B ∈ B : Pi (B) < ∞, i = 1, 2} one has
f dPi =
A
A
EPBi (f ) dPiB ,
i = 1, 2,
and EPBi : f → EPBi (f ) has all the standard properties (positive, linear, contractive, averaging) including EPBi (1) = 1 a.e. Then Lemma 4.5
566
VIII. Prediction and Filtering of Process
holds in this case and one has EPB2 (f ) =
EPB1 (Zf ) = EPB1 (Z˜ · f ), EPB1 (Z)
where Z˜ =
Z B (Z) and Z is the RN-derivative EP 1 dP2 dP1 −1 dP1 < ∞ a.e. and then dP2 = Z
(18)
given above. [Thus
0 < Z = a.e. holds.] Now Z and hence 0 < Z˜ < ∞ are measurable for Σ (but need not be for B). Taking f = 1 and using a property of the (generalized) condi˜ = 1 a.e. although Z˜ = 1 tional expectation, we get EPB2 (1) = EPB1 (Z) a.e. The idea is, for each such P1 , to choose the equivalent measure P2 in such a way that with f = ϕ(Xt ), B = Ft , one gets the process {πt (ϕ(Xt )), Ft , t ≥ 0} as a solution of a (hopefully) a relatively simple SDE on (Ω, Ft , P2 ). The change of measures here is motivated by the Girsanov theorem (cf., Chapter V, Thm. 5.6). A problem in this form was first treated by Zakai [2] and we present a solution of it, with simplifications resulting from later developments. Also it may be noted that ϕ can be chosen so that ϕ(Xt ) is bounded, and the existence questions are streamlined, although one has to settle for generalized (hence weak) solutions. But now the work can be applied to a wider area, thereby enriching the subject considerably (just as in the motivating deterministic PDE analysis). We concentrate on the diffusion type processes, and the material of Sections V.5, VII.1-2, will be of interest here again. It is useful to record another representation of a conditional expectation, with certain ideas and results from martingale theory. Let {Ut , Ut , t ≥ 0} and {Vt , Ut , t ≥ 0} be square integrable right continuous martingales based on the same standard filtration, {Ut , t ≥ 0}. Then by the Doob-Meyer decomposition (U V )t = Mt + [U, V ]t ,
t ≥ 0,
where {Mt , Ut , t ≥ 0} is a martingale and the family {[U, V ]t , Ut , t ≥ 0} is a ‘natural’ locally bounded covariation process. [Here ‘natural’ + means E( R+ Sr− d[U, V ]+ r ) = E(S∞ [U, V ]∞ ) for every bounded right continuous martingale {Sr , Ur , r ≥ 0}. This can be shown to be the same as saying that t → [U, V ]t is measurable relative to the filtration generated by all a.e. left continuous bounded martingales for the above filtration.] But by the CBS inequality for square integrable martingales (established by Kunita and Watanabe in the mid 1960s (cf., e.g., Rao [21], p.383)) we have |[U, V ](0,t) |2 ≤ ([U, U ](0,t) )([V, V ](0,t) ),
t ≥ 0, a.e.
(19)
567
8.5 Kalman-Bucy filters: the nonlinear case
This implies that [U, V ]· is absolutely continuous relative to both Borel measures generated by t → [U, U ]t and t → [V, V ]t . Hence by the RN-theorem for t → [V, V ]t , there is an adapted integrable process {gs , Us , s ≥ 0} such that [U, V ]t =
0
t
gs d[V, V ]s ,
(20)
exactly as in the proof of (Rao [21], Prop. V.3.28, on p.413). Define t Vt = 0 gs dVs and Vt = Ut −Vt . Then from the definition of quadratic covariation between U, V , we get
[U, V ]t =
0
t
gs d[U, V ]s =
t
0
gs2 d[V, V ]s , by (20).
(21)
Hence [V , V ]t = [U − V , V ]t = [U, V ]t − [V , V ]t t t 2 = gs d[V, V ]s − gs2 d[V, V ]s = 0, by (21). 0
0
Thus the processes V , V are orthogonal. We state result this with integral representations of Ut in terms of the Vt -process later use. 4. Proposition. Let F = {Ft , t ≥ 0} be a standard filtration from (Ω, Σ, P ), so that it is right continuous, and each Ft is P -complete. Let M2 (F) be the set of right continuous square integrable martingales. If U, V ∈ M2 (F) (so Ut , Vt ∈ L2 (P )), then there is an integrable adapted process {gt , Ft , t ≥ 0} and a square integrable martingale W = {Wt , Ft , t ≥ 0} such that Ut =
0
t
gs dVs + Wt ,
t ≥ 0,
(22)
where U − W and W are orthogonal. This decomposition is unique. Moreover if Ft is generated by Vs , s ≤ t, itself, then W = 0 a.e. In particular, with V = {Vt , Vt = σ(Vs , s ≤ t), t ≥ 0}, any right continuous square integrable martingale U (∈ M2 (V)) admits a representation: Ut = U0 +
0
t
gs dVs ,
for a process {gs , Vs , s ≥ 0} which also satisfies | ∞, t > 0, with probability one.
(23) t 0
gs d[V, V ]s | <
568
VIII. Prediction and Filtering of Process
Proof. Because of the analysis preceding the statement, only the last t part remains to be established. So let L(U ) = {V : Vt = 0 gs dUs , t ≥ 0}. Then L(U ) ⊂ M2 (F), and the latter is a complete locally convex (or Fr´echet) space for the family of norms U t = E(|Ut |2 ). (This is shown in Rao [21], p.412.) If N is the smallest closed subspace of M2 (F) containing U , and with the property that V ∈ N , g ∈ L2 ([U ]t ), so the sample paths are (locally) square integrable relative to [U ], then t { 0 gs dVs , t ≥ 0} ∈ N . Also L(U ) ⊂ N , by definition since (setting V = U ) the process U is in L(U ).We observe that there is equality here. For, if there is 0 = V ∈ N − L(U ), then L(V ) ⊂ N and then, by linearity L(U ) ∪ L(V ) ⊂ N so that N is strictly larger than L(U ). Since U ∈ L(U ) and each of its elements is represented as above, it will be a proper subspace of N contradicting the minimal character of the latter. So there is no such V and N = L(U ). This implies, since W is t orthogonal to the martingale Vt = 0 gs ds, t ≥ 0, and is adapted to V, it is also in L(U ) so that it must be trivial, i.e., W = 0. Now the last statement is immediate. There is an alternative way to show (23) when Vt is a BM and this is by starting with eixVt and verifying that it is orthogonal to the Wt process, with Itˆ o’s formula, and then considering inductively with finite products of the exponential martingales of Brownian functions. Such functionals are seen to be dense in L(U ) and since W is also Vt adapted, the result follows. For details, see Liptser and Shiryayev ([1], p.162). ˜t = E Ft (ϕ(Xt )), then we get (on If the martingale is given by U 0 setting Ft = Vt ): E
Ft
˜0 + (ϕ(Xt0 )) = U
0
t
g(t0 , s) dVs .
Hence using equivalent measures P1 , P2 , with Z as density given by (18), this becomes for the optimal (estimator or) filter: EP2 (ϕ(Xt )|Ft ) = EP1 (ϕ(Xt )Z˜t |Ft ) = Ut (say) t = U0 + g (t, s) dVs .
(24)
0
If moreover ϕ(Xt ) and Vs , s ≥ 0, are jointly Gaussian, then the kernel g in (24) becomes deterministic, and the formula reduces to that of Corollary 4.9. The main non linear (continuous parameter) filtering problem, as formulated by Zakai [2], can be restated in the present terminology in the following way. Consider the (non linear first
8.5 Kalman-Bucy filters: the nonlinear case
569
order) SDE: dXt = F1 (t, Xt , Yt ) dt + F2 (t, Xt , Yt ) dεt + F3 (t, Xt , Yt ) dηt , X0 = ξ,
(25)
dYt = H(t, Xt , Yt ) ds + dηt , Y0 = 0,
where the Xt - is an unobserved m-vector state (or system) process and the Yt - is an observable n-vector process both satisfying the given SDEs. Here the coefficients F1 , F2 , F3 and H are supposed to be uniformly bounded Borel measurable (vector or matrix) functions depending on the past state as well as the observations, and εt , ηt are the mutually independent m and n vector BMs which are also independent of the initial state m-vector ξ. [Recall that an n-dimensional BM Ut is a process with continuous paths, independent increments, and each increment Ut −Us is an n-dimensional Gaussian distributed vector having means zero and covariance |t − s|A where A is a positive definite matrix. The existence of such a process is also a consequence of the basic Kolmogorov theorem.] We take G˜t = σ(Xs , Ys , 0 ≤ s ≤ t) ⊂ Σ, F˜t = σ(Ys , 0 ≤ s ≤ t) and Gt , Ft as P -completions of G˜T , F˜t . Thus Ft ⊂ Gt ⊂ Σ, and Fi (t, ·, ·) and H(t, ·, ·) are adapted to Gt . Let P1 = P and P2 be defined by dP2t = Zt−1 dPt where 0 < Zt < ∞ a.e. [Pt ], with Pit = Pi |Ft , i = 1, 2. Here P2 and Zt are obtained by (following Girsanov’s theorem, cf., (41) of the preceding section) the expression: t Zt = exp H(s, Xs , Ys )∗ dYs − 0 1 t H(s, Xs , Ys )H ∗ (s, Xs , Ys ) ds , (26) 2 0 which makes {Yt , Ft , t ≥ 0} a BM on (Ω, Σ, P2 ). This is the key property of the model and the fact that εt , ηt are BMs is an essential part of this strategy. Here (26) is a multidimensional analog of (41) of the last section. The reason for considering the multidimensional case is that we get an important (and essential) motive for analyzing stochastic PDEs. This aspect is vividly seen in higher dimensional problems and it is always present in the non linear filter theory. With model (25), and equivalent measures P1 , P2 defied with Zt−1 in place of Zt , (24) becomes, by interchanging the suffixes and Z˜t = Zt EP2 (Zt |Ft ) : EP (U |Ft ) = EP (U Z˜t |Ft ), t ≥ 0. (27) 1
2
It should now be clear why all the σ-algebras are taken complete. Note that, with U = 1 in (27), we get EP2 (Z˜t |Ft ) = 1, a.e.
570
VIII. Prediction and Filtering of Process
Following essentially Pardoux [1], we present a solution of the Zakai equation (25), by deriving an SDE satisfied by the optimal filter ˆ t at time t > 0 when the coefficients are bounded and adapted to X the filtration. More generally consider the mapping Z ϕ : t → Ztϕ = EPF2t (ϕ(Xt )), t ∈ R+ , so that Ztϕ is the desired (Zakai) filter of ϕ(Xt ) for the least squares criterion. The SDE depends on the (multidimensional) Itˆ o formula with its partial derivatives. To simplify the (unavoidably complicated) notation, we introduce some differential symbolism. Let 2 ∂ Di = ∂x , Dij = ∂x∂i ∂xj and define the operators: i Ltxy = Ajtxy = k Btxy =
m m 1 aij (t, x, y)Dij + F1i (t, x, y)Di , 2 i,j=1 i=1 m i=1 m
F2ij (t, x, y)Di , j = 1, · · · , m,
(28)
F3ik (t, x, y)Di , k = 1, · · · , n,
i=1 k Lktxy = Hk (t, x, y) + Btxy ,
k = 1, · · · , n.
These operators act on the space of twice continuously differentiable real functions with bounded derivatives, Cb2 (Rm ), into itself. Here a = (aij , 1 ≤ i, j ≤ m) is the matrix defined by a = F2 F2∗ + F3 F3∗ (again ‘*’ denotes transpose). It will be assumed that a is uniformly nonsingular (needed for invoking Girsanov’s theorem below). This is the same as saying that the eigenvalues of a are uniformly (in t) bounded away from zero. Also let πt (ϕ) = EPF2t (ϕ(Xt )Z˜t ), the desired predictor of the system or state functional ϕ(Xt ) for ϕ ∈ Cb2 (Rm ). So πt (ϕ) depends on Xt . With this, we now present the (generalized) solution of (25), the main result of non linear filtering considered, as follows. 5 Theorem. Consider the model (25) with bounded Borel coefficients. Then for each ϕ ∈ Cb2 (Rm ), the filter πt (ϕ) satisfies the stochastic PDE: dπt (ϕ) = πt (Lt(·)Yt ϕ) dt +
n
πt (Lkt(·)Yt ϕ) dYtk , t ≥ 0.
(29)
k=1
Proof. We include the essential argument, omitting some algebraic simplifications with Itˆ o’s formula. First from (25) we eliminate dηt to get an m-vector equation as: dXt = (F1 − F2 H)(t, Xt , Yt ) dt + F2 (t, Xt , Yt ) dεt + F3 (t, Xt , Yt ) dYt
571
8.5 Kalman-Bucy filters: the nonlinear case
= b(t) dt + c(t) d˜ εt ,
(30)
where b = F1 − F2 H = (bi ) is an m-vector, c = (F2 F3 ) = (cij ) is an m × (m + n)-matrix, and ε˜t = (εt , Yt )∗ is an (m + n)-vector BM ( the εt and Yt being independent processes). From this we can obtain an SDE for ϕ(Xt ) employing the m-dimensional Itˆ o formula (cf., e.g., Rao [21], t p.401) which can be stated as follows. The process { 0 b(s)ds, Ft , t ≥ 0} t εs , Ft , t ≥ 0} is a (locally) is of (locally) bounded variation and { 0 c(s) d˜ square integrable martingale, both with continuous paths, so that Xt becomes an L2,2 -bounded (or what is equivalent here a semimartingale) process. Hence the Itˆo formula gives for any ϕ ∈ Cb2 (Rm ), with (30): m
m 1 (Dij ϕ)(Xt ) d[(c˜ ε)i , (c˜ ε)j ]t (Di ϕ)(Xt )bi (t) dt + dϕ(Xt ) = 2 i,j=1 i=1
+
m
(Di ϕ)(Xt )(c(t)˜ εt )i ,
(31)
i=1
where for an m-vector ( )i denotes its ith component and c˜ ε as well as t b are such vectors. Here we utilize the fact that { 0 c(s) d˜ εs , Ft , t ≥ 0} is a continuous locally square integrable martingale, since for t1 < t2 we have t2 t1 t2 c(s) d˜ εs ) = c(s) d˜ εs + E Ft1 ( c(u) d˜ εu ) E Ft1 ( 0
0
t1
=
c(s) d˜ εs +
0
t1 t2
E Ft1 (c(u)E Fu (d˜ εu ))
t1 t
= 0
c(u) d˜ εs + 0, a.e. [P2 ].
(32)
Also the quadratic covariation above can be calculated using its bilinear property as: [(c˜ ε)i , (c˜ ε)j ]t =
m+n
cik (t)cjk (t)[˜ εk , ε˜k ]t
k,k =1
=
m+n
cik (t)cjk (t)δkk t,
(33)
k,k =1
since ε˜t is a BM (and the εt , ηt are independent, δkk being the Kronecker symbol). Hence (31) becomes with (33) m m+n m 1 (Dij ϕ)(Xt ) cik (t)cjk (t)] dt dϕ(Xt ) = [ (Dϕ)(Xt )bi (t) + 2 i,j=1 i=1 i=1
572
VIII. Prediction and Filtering of Process
+
m
(Di ϕ)(Xt )(c(t) d˜ εt )i .
(34)
i=1
It remains to substitute the values of bi (t) and (c(t)˜ εt )i from (25) and (30) as well as the symbols introduced in (28). Thus one finds after a straightforward (but tedious) algebra, the following: dϕ(Xt ) = [(LtXY ϕ)(Xt ) −
m
i Hi (t, Xt , Yt )(BsXY ϕ)(Xt )] dt
i=1
+
m
(AitXY ϕ)(Xt ) dεit +
i=1
m
i (BtXY ϕ)(Xt ) dYti .
(35)
i=1
A similar but simpler procedure, with the exponential function (26), gives the companion result (Z0 = 1) as: dZt = Zt
m
Hi (t, Xt , Yt ) dYti .
(36)
i=1
Next consider the Itˆo formula with f : R2 → R and f (u, v) = uv where u = ϕ(Xt ), v = Zt of (35) and (36). Analogous algebraic computation yields: d(Zt ϕ(Xt )) = Zt (LtXY ϕ)(Xt ) dt +
m
Zt (AiaXY ϕ)(Xt ) dεt
i=1
+
m
Zt (LitXY ϕ)(Xt ) dYti .
(37)
i=1
Integrating and applying EPF2t (a bounded operator which commutes with vector integrals), on remembering that under P2 the process Yt is BM and for any 0 < s < t, Ft = σ(Fs , Yu − Ys , s ≤ u ≤ t) and that the Yu − Ys , 0 < s ≤ t are independent of Fs , one gets EPF2t (Zt ϕ(Xt ))
− + +
EPF20 (ϕ(X0 )) m
t
i=1 0 m t i=1
0
= 0
t
EPF2t (Zs (LsXY ϕ)(Xs ) ds
EPF2t (Zs (AisXY ϕ)(Xs ) dεis ) EPF2t (Zs (LisXY ϕ)(Xs )) dYsi ,
cf., Dunford and Schwartz ([1],IV.10.8(f)),
573
8.6 Complements and exercises
t
= 0
+
EPF2s (
m t i=1
0
) ds +
m i=1
EPF2s (
0
t
EPF2s (
)EPF2s (dεis )
) dYsi .
The middle term on the right vanishes since εs and Ys are independent, and EPF2s (dεs ) = 0. This reduces to (29) as asserted. This result raises a host of interesting questions for theory as well as applications. An immediate one is that, although πt (ϕ) satisfies (29), there may be other solutions of that equation, and they can have little, if any, relevance for filtering. So the uniqueness of solutions is a concern. Several conditions from the SDE and PDE theories may be utilized. We have assumed that in (25) the coefficient functions are uniformly bounded with twice continuous bounded derivatives, i.e., in Cb2 (Rm ). If these are moreover supposed infinitely differentiable (i.e., in Cb∞ (Rm )), then it can be shown that uniqueness obtains, and the solution takes its values in a certain Sobolov space. We omit a description of these spaces (cf., e.g., Pardoux [1], Fleming and Pardoux [1], and Bensoussan [1] for details). Even the boundedness condition on the coefficients can be replaced by certain Lipschitz and linear growth restrictions (as in Proposition 3). Thus far no distributional assumptions are imposed on the state process Xt . If for instance in (25), F3 = 0, εt being a BM, under the usual Lipschitz conditions on F1 , F2 which are also independent of the y-component, then the SDE theory implies that the solution process Xt is Markovian. Specializing the coefficients Fi further more refined results can be described. Numerous other possibilities exist, and the connection of filtering solutions with SPDE shown by the above theorem, implies that one can take this technology in different directions, and such studies exist. But we have to conclude the work here due to space and energy constraints. Some related results on the topics of this chapter are discussed in the complements. 8.6 Complements and exercises 1.(i) This problem contains a sufficient condition for a second order purely nondeterministic process {Xt , t ∈ R} to have multiplicity one so t t that Xt = a g(t, u) dZ(u) holds where a g 2 (t, u) dF (u) < ∞, dF (u) = E(|dZ(u)|2 ). Verify that this will be true if in the representation of Xt in Theorem 2.2, each kernel gn satisfies the following three restrictions: (a) g and ∂g ∂t are bounded, continuous, and integrable on (a, t), a < t < ∞, (b) g(t, t) = 1, and (c) F is absolutely continuous with a piece-wise continuous density f which is non zero on a compact interval. [Hints: Proceed as in Example 2.4. If N > 1, then Ht = sp{X ¯ s , s ≤ t} ⊂
574
VIII. Prediction and Filtering of Process
s L2 (P ) contains the space sp{ ¯ a g(s, u) dF (u), s ≤ t} properly. So there exists for each s ≤ t some 0 = h ∈ L2 (f (u)du) such that g(s, ·) ⊥ h. s Differentiating a h(u)g(s, u)f (u) du = 0 gives a Volterra equation h(s)f (s) = −
s
h(u)f (u) a
∂g du, ∂s
whose solution from the classical theory of integral equations is h = 0 a.e. giving the desired contradiction.] (ii) Without some additional restrictions such as those in (i) above, the multiplicity N > 1 can obtain. Verify this statement for the following example. Let {Bi (t), t ∈ [0, 1]}, i = 1, 2 be a pair of mutually independent BMs on the Lebesgue unit interval as a probability space, and let Xt = B1 (t)χA (t) + B2 (t)χAc (t) where A is the set of rationals of the interval. Then Xt has multiplicity N = 2 since one also has Xt =
0
t
g1 (t, u) dB1 (u) +
0
t
g2 (t, u) dB2 (u),
where g1 (t, u) = 1 = 1 − g2 (t, u) if t = u ∈ A, and = 0 otherwise. 2. We use the concepts and notation of a function norm ρ introduced ¯ + be such a norm with just before Theorem 1.4. Thus let ρ : M → R ρ the Fatou property and for f, f1 , f2 ∈ L (P ) ⊂ M such that f ∧ fi = 0, i = 1, 2, ρ(f1 ) ≥ ρ(f2 ) ⇒ ρ(f + f1 ) ≥ ρ(f + f2 ). This property holds if ρ(·) = · W for a Young function W ∈ Δ2 , i.e., a modular derived norm. The space is also called a Riesz space,and let ρ be the associate norm of ρ which is given by ρ : f → sup{| Ω f g dP | : ρ(g) ≤ 1}. Let f, fn ∈ Lρ (Σ, P ) = Lρ (Σ), n ≥ 1 and consider Lρ (Bn ) = Mn where Bn = σ(fk , 1 ≤ k ≤ n). Suppose that Lρ (Σ) is strictly convex and has property (k) (of Kadec introduced in the Digression of Sec. 1), as well as ρ is absolutely continuous. [These conditions are satisfied for ρ(·) = · W with W ∈ Δ ∩ ∇2 , and then LW (Σ) is reflexive.] Now show that for each n ≥ 1, there is a unique hn ∈ Mn such that ρ(f − hn ) = inf{ρ(f − g) : h ∈ Mn }, and ρ(hn − h∞ ) → 0 as n → ∞ where h∞ ∈ M∞ = Lρ (B∞ ), B∞ = σ(∪n≥1 Bn ) and h∞ is the closest element of f . Show moreover that hn → h∞ , pointwise a.e. [This is similar to Theorem 1.4 but cannot be deduced from it. The result is a somewhat simplified version of that in (Rao [8],Thm. 3.3), and the reader may specialize the latter for this case since there are several points to be established.] 3. Let Yt = ΛXt for an integro-difference-differential filter Λ, where {Yt , t ∈ G}, G = Rk or Zk is weakly harmonizable, and suppose that the hypothesis of Theorem 3.3 holds. If βy (·, ·) is the matrix bimeasure of
575
8.6 Complements and exercises
the Yt -process (or field), and the covariance function of the Xt -process is given by r(s, t) = E(Xs Xt∗ ), show that r is representable as: ∗ ˜ (Λr)(s, t) = ei(s,λ)−i(t,λ ) βy (dλ, dλ )(F (λ ))−1 , ˆ G
ˆ G
˜ is the same where F (λ) is the k ×k-spectral characteristic matrix and Λ −1 being the generalized inverse of F . as Λ, but it acts on r(s, ·), F [Hints: Compute E((ΛX)s Xt∗ ) = E(Ys Xt∗ ) in two different ways, and ∗ ˜ observe that we can express the left side as ΛE(X s Xt ), by justifying the commutation of E and Λ. The result corresponds to the traditional “Yule-Walker relations” in the classical time series analysis.] 4. This problem presents an adjoint to Theorem 4.3 on the linear Kalman filter with a slightly different argument. Thus consider the (vector) linear filter model: Xn+1 = Fn Xn + fn + εn , Yn = Hn Xn + hn + ηn ,
n = 0, 1, . . . , N − 1,
(*)
X0 = ξ where εn , ηn are k and m-vector Gaussian sequences with means zero and covariance matrices Qn , Rn which are invertible and ξ is also N (m0 , A0 ) but mutually independent of both εn and ηn , n ≥ 0. Here fn , hn , Fn , Hn are suitable deterministic Borel measurable vectors or matrices. If Fn = σ(Y0 , Y1 , . . . , Yn−1 ), n = 1, . . . , N then the optimal ˆ N = E(XN |FN ). Verify that estimator or filter of XN is, as before, X ˆ 0 = m0 , and AN = the following recursion relations hold if we set X ˆ N )(XN − X ˆ N )∗ , the covariance of the error (of estimation) E(XN − X ˆ of the process {Xn − Xn , n ≥ 1}. Also let (‘*’ denotes transpose again) ˆ = X ˆ N + AN H ∗ (RN + HN A∗ )−1 (YN − HN X ˆ N − hN ), X N N N where AN +1 = QN + FN AN FN∗ , N ≥ 0, A0 = cov(ξ), and ∗ ∗ AN = AN − An HN (HN AN HN + RN )−1 HN AN .
(+*)
ˆ N +1 = FN X ˆ + fN , N ≥ 0. [Hints: First consider the ‘innoThen X N ˆ N + hN ) and observe that ZN is an vation’ process ZN = YN − (HN X uncorrelated mean zero Gaussian (hence independent) sequence with ∗ covariance matrix (RN + HN AN HN ), N ≥ 1, and moreover ZN is indeˆn, pendent of the σ-algebra FN . This is a key property. Let en = Xn −X ˆ and en = Xn − Xn (= en −Kn Zn ), where Kn is to be determined. Since E(en ) = E(en ) = 0, we want to find KN that minimizes LN = cov(eN ). Note that the matrix LN is explicitly computed to be ∗ ∗ LN = AN + KN (HN AN HN + RN )KN
576
VIII. Prediction and Filtering of Process ∗ − KN E(ZN e∗N ) − E(eN ZN )KN .
Since E(ZN e∗N ) = E(YN e∗N ) = HN AN , the above expression can be written (by adding and subtracting suitably to complete the square) ∗ ∗ LN = AN + [Kn − AN HN (HN AN HN + RN )−1 ]× ∗ + RN )[ (HN AN HN
∗ ∗ ]∗ − AN HN (HN AN HN + RN )−1 HN AN .
This is a minimum iff the middle term on the right vanishes, i.e., KN = AN − AN , in the above notation, as desired. The fact that the random vectors Xn , Yn are Gaussian is crucial in this computation. This argument essentially follows Bensoussan [1], p.6.] 5. Here is an extension of Theorem 5.5 and is in two parts. Let ϕ ∈ Cb1,2 (R+ × Rm ), the space of real functions with bounded continuous ∂ϕ ∂2ϕ derivatives, ∂ϕ ∂t , Di ϕ = ∂xi , Dij ϕ = ∂xi ∂xj . (a) Again consider a Zakai model equation in the form: dXt = F1 (t, Xt ) dt + F2 (t, Xt ) dεt , X0 = ξ, dYt = H(t, Xt ) dt + dηt ,
Y0 = 0,
(+)
where the vector and matrix coefficients satisfy a Lipschitz condition, |F1 (t, x1 ) − F1 (t, x2 )| + F2 (t, x1 ) − F2 (t, x2 ) ≤ K1 |x1 − x2 |, and |H(t, x)| ≤ K2 (1 + |x|) for some absolute constants K1 , K2 > 0. Suppose that εt , ηt are mutually independent BMs which are also independent of the initial state vectorξ and that E(|ξ|3 ) < ∞. Dem m fine the differential operator At = i=1 bi Di + i,j=1 aij Dij , with 1 ∗ a = (aij ) = 2 F2 QF2 , and F1 = (bi ), Q(t), R(t) being the covariance matrices of εt , ηt . As before, let πt (ϕ) = EP2 (ϕ(t, Xt )Z˜t |Ft ), the desired least squares optimal estimator (or filter) of the functional ϕ(t, Xt ) where ϕ ∈ Cb1,2 (R+ × Rm ), dP2t = Z˜t dP1t , Pit = Pi |Ft and Zt in Lo`eve’s lemma. In our case Zt is given as an expoZ˜t = EP (Z t |Ft ) 2 nential (cf., (26) of Sec. 5). With this formulation verify that the filter πt (ϕ) satisfies the SPDE: ∂ϕ + At ϕ)(t, Xt )+ ∂t πt (H ∗ (t, ϕ))Rt−1 dYt , a.e.
dπt (ϕ) = πt (
(*)
[Remarks: The proof of the above SPDE is obtainable by the same procedure as in Theorem 5.5. A somewhat different argument is given by Bensoussan ([1], p.83). His method is more in line with the classical
577
Bibliographical notes
PDE theory, by first obtaining a number of a priori estimates on the sizes of the integrals and then taking limits at the end. The probabilistic method in the text is somewhat simpler. The reader is referred to the above work for the alternative procedure.] (b) If we assume that F1 (t, x) = F0 (t)x+f (t), F2 = I, and H(t, x) = H0 (t)x + h(t) where F0 , H0 , f, h are bounded Borel (matrix or vector) functions, then verify that we have the following unique ‘explicit’ solution of the above equation (*): πt (ϕ) =
Rm
ˆt + ϕ(t, X
exp(− 21 |x|2 ) P (t)x) dμ(x) St m (2π) 2
where μ is the Lebesgue measure in Rm , P (t) is a (positive definitive matrix) solution of the Riccati equation: dP (t) + [P H0∗ R−1 H0 P ](t) = (Q + F0 P + P F0∗ )(t), dt ˆ t is the (Kalman linear filter) P (0) being the covariance matrix of X0 , X solution of the SDE ˆ t + f (t)) dt + (P H ∗ R−1 )(t)[dYt − (H0 X ˆ t = (F (t)X ˆ + h)(t)dt], dX ˆ X(0) = X0 , and the St -process is given by t ∗ ∗ ˆ H + h∗ )(s)R−1 (s) dYs St = exp (X 0 0 t 1 ˆ ∗ H0∗ + h∗ )R−1 (H0 X ˆ + h)(s) ds . − (X 2 0 [This result combines the work of Theorems 4.7 and 5.5. A detailed discussion of these cases may be found, in a somewhat condensed form, in Bensoussan’s book noted above, and can be consulted for this and related analysis.]
Bibliographical notes There exist linear and non linear parts of prediction and filtering problems. The well-developed least squares linear prediction, for stationary processes, has been the subject of extensive treatments by several authors, especially by Rozanov [3] and Yaglom [4]. In a not necessarily stationary process analysis, such as that considered here, we are able to present a format for prediction with optimality criteria based
578
VIII. Prediction and Filtering of Process
on convex functions, or metrics derived from such, analogous to the estimation problems treated in Chapter III. In this case the linearity of a prediction operation is no longer retained. One can obtain sequences of best predictors based on expanding sets of observations. But finding their limit behavior will be of interest. We presented a general format in Section 1 in which some standard results of abstract analysis play an interesting rˆ ole. We have considered various aspects of them culminating in Theorem 1.4 which is essentially taken from Bru and Heinich [1]. Here a discussion is included to clarify the fact that there are different kinds of prediction operators (some tailored to a given element to be predicted, and some not depending on such an element) and the present result is one based on a reasonably general version of the second type. We have included both the strong and pointwise convergence conditions for these sequences. Thus the problem is laid out and work on explicit calculations for classes of processes and specializations can clearly be undertaken. A different type of interesting prediction problem was discussed by Urbanik [1]. In the rest of the chapter, attention is restricted to the least squares criterion, and Hilbert space methods take the center stage. However, as a rule, processes are not restricted to be stationary in the treatment of problems since we want primarily to illuminate the general structure of the subject. Any second order process can be decomposed into a deterministic and a purely nondeterministic part. For prediction purposes only the purely nondeterministic part needs an in depth analysis, since the deterministic part is completely known from the remote past. The work here is based on the fundamental representation theory of Cram´er and Hida. We have followed for the most part Cram´er’s [5] lectures in the presentation. There is still much remains to be investigated, especially if the time parameter is multidimensional, i.e., for the random fields. Some possible approaches are indicated here. In Section 3 we considered a general linear filtering problem, as formulated by Bochner [2]. Here the filter can be an integro-differencedifferential linear operator. The associated equation is ΛXt = Yt where Yt is the output, and Xt input, and where the latter is to be obtained from the output for the filter Λ. If the Yt -process is weakly harmonizable, then a solution of the problem is presented. It depends technically on an integral weaker than Lebesgue’s, developed by Morse and Transue [1]. The necessary detail for the problem, based on the work of Chang and Rao [1], together with specializations if Λ is a finite difference operator, is included to illustrate how particular cases admit sharper results (although they cannot give a complete picture). An abstract concept of filter was formulated by Hannan [1] that may be used for (stationary) random fields. So far, not much use of it seems
Bibliographical notes
579
to have been made. In the general studies, the primary questions concern with existence, uniqueness, and convergence of sequences. In this work, so far no algorithm is found to calculate the solution or updating the results with new observations. Specializing the model further, however,such recursive relations can be obtained, and this is the thrust of the last two sections. We thus express the model as: Yt = ΛXt = F (t, Xt ) + G(t, Nt ), i.e., the observation as the sum of a “signal” and a “noise”. This special representation leads to somewhat sharper formulas. Here one can study both the linear and nonlinear cases which are basically due to Kalman [1] (also Kalman and Bucy [1]), and Zakai [2] respectively. Now the signal may be a stochastic difference or a differential equation with noise as a Brownian Motion. Such a representation enjoys the full impact of the stochastic calculus in the linear (or Stratonovich) and non linear (or Itˆ o) types, and the diffusion processes have much to contribute. The least squares criterion of optimality gives the conditional expectation as the best predictor or filter, and so the problem in its essentials becomes finding methods of evaluation of the just noted expectation operator. For the most part, the analysis in the subject uses the ‘horizontal window’ procedure without comment. If Xt is the (unobserved) signal, εt is the BM noise, and Yt is the observation, then πt = E(Xt |Ys , s ≤ t), is the best predictor, and its calculation, when Xt is assumed to be a diffusion (driven by εt ), becomes a key ingredient of the problem. In the discrete case, extending the classical (finite set) Gaussian multivariate regression analysis, Kalman obtained a representation for the ˆ t ), both with ˆ t as well as the covariance of the error (Xt − X estimator X recursion formulas. The work in the continuous parameter case uses equivalence of probability measures, and the calculation of πt is not simple. A classical representation of conditional expectations for two equivalent finite measures, observed by Lo`eve, serves as a useful tool in this analysis, with Girsanov’s transformation playing an important part. The work leads to finding certain SDEs satisfied by the πt -process (in lieu of a recursion). These results constitute the content of Sections 4 and 5, as it leads to establishing a new area, called ‘stochastic control theory’. Separate (and voluminous) treatments are available on these subjects. The detailed analysis containing various generalizations appears in Liptser and Shiryayev [1], with further developments in Pardoux [1]. We utilized both these works in our presentation. Also useful is Bensoussan’s [1] memoir which brings in more tools, especially from PDE. A pair of important results from this point of view are indicated as Exercises 4 and 5 to supplement the textual analysis. Recently Shald [1] has shown how the continuous parameter Kalman filter version may be obtained from the discrete version (cf., Theorem 4.3) by a suitable
580
VIII. Prediction and Filtering of Process
limiting procedure. Another related point is approximating the solutions, in the mean or pointwise sense, of the filter PDEs which we have not discussed. Some interesting developments in this direction may be found in Elliot and Glowinski [1] and further results in Lototsky and Rozovskii [1]. They show, with references to related articles, how other approaches and methods are available in such approximations. The connection between these problems and martingale theory is intimate, and a recent extensive review and analysis of these is provided by Mikulevicius and Rozovskii [1] and it will be of interest here. Numerous other applications of continuous parameter processes (mostly solutions of SDEs), including some finance models, are surveyed recently in Mel’nikov [1], with details. Several other references are given at various places in the text where particular presentations have influenced our view. However, we could not include more concrete applications, since one needs considerably more background material of the subject area for this. We have indicated some books on the subject. In all the above work, the distributions (hence their parameters such as moments) are assumed completely known. Since this is seldom the case, they have to be estimated. We indicate how this may be done in the next and final chapter.
Chapter IX Nonparametric Estimation for Processes
This final chapter is concerned with questions of (asymptotic) unbiasedness, and consistent nonparametric estimation of some functions, such as bispectral densities of a class of second order processes. After some necessary preliminaries, we discuss spectral properties of separable, especially harmonizable, processes of second order in Section 1. Then we consider an asymptotically unbiased estimator, and a related function, of the spectral distribution of the process in the next section. These are usually not consistent estimators when the process is not stationary. So we need to use another procedure, the so-called resampling method, wherein the covariance between samples falls off at a reasonable rate (made precise later). This is described in Section 3 and using such a procedure it is possible to obtain a consistent estimator of the bispectral density of strongly harmonizable processes, and this is given for such a class. Then Section 4 contains a slightly more general second order family for which some related results are discussed. In Section 5 a limit distribution of the (nonparametric) estimator defined above for a strongly harmonizable class is presented. Thus the conditions imposed are progressively more stringent, but then we get more refined results. Several new avenues and possible improvements of the results are pointed out, with related exercises in the last section, usually with sketches of proofs, as complements to the preceding work.
9.1 Spectra for classes of second order processes Let X = {Xt , t ∈ T ⊂ R} ⊂ L2 (P ) be a process with covariance function r : (s, t) → E(Xs − ms )(Xt − mt )∗ where (‘*’ for complex conjugate) mt = E(Xt ). We have seen in Section VIII.2 that if the process X is separable, or particularly left continuous with right limits
© Springer International Publishing Switzerland 2014 M.M. Rao, Stochastic Processes – Inference Theory, Springer Monographs in Mathematics, DOI 10.1007/978-3-319-12172-7_9
581
582
IX. Nonparametric estimation for processes
(all in mean), then it has a generalized Karhunen representation:
t
Xt =
g(t, λ) dZ(λ),
t ∈ T,
(1)
relative to a (perhaps vector) kernel g : T ×R → C where g(t, λ) = 0 for λ > t, and an orthogonal valued (perhaps vector) stochastic measure Z(·) on the Borel σ-algebra B of R. In the separable case, there is an analogous representation: ˜ Xt = g(t, λ) dZ(λ), t ∈ T; (1’) S 2 ˜ (S, S, μ) being a measure space and E(|Z(A)| ) = μ(A), (cf., Theorem VII.3.1). From (1) it follows that, taking m(t) = 0 for simplicity so that E(Z(A)) = 0, and Xt purely nondeterministic, one obtains
r(s, t) =
N
s∧t
gi (s, λ)¯ g (t, λ) dFi (λ),
(2)
i=1
where g = (gi , 1 ≤ i ≤ N ) in (1) and Fi (A) = E(|Zi (A)|2 ). Here N is the multiplicity of the process X, and in the separable case r(s, t) = g(s, λ)¯ g (t, λ) dμ(λ). (2’) S
Although the representation (1) (or (1’)) is quite general and interesting as a theoretical structure, one has little specific information on the gi to employ it in applications. This is true even if N = 1. Consequently, one would specialize further. If g(t, λ) = eitλ , i.e., an exponential, and N = 1, then (1) reduces to the stationary case. To have g as an exponential but X not necessarily restricted to stationarity, one has to abandon the orthogonality of the spectral measure Z(·). This then leads to the harmonizable class, and we treat this family here. Thus take T = R or Z, and hence have (1’) and (2’) as: Xt = eitλ dZ(λ), t ∈ T, (3) Tˆ
and then r(s, t) =
Tˆ
∗
Tˆ
eisλ−itλ dF (λ, λ ), s, t ∈ T,
(4)
where Tˆ(= R or (−π, π]) is the dual of T and F is the spectral bimeasure “distribution” of the stochastic spectral measure Z, i.e., F (A, B) =
583
9.1 Spectra for classes of second order processes
¯ E(Z(A)Z(B)). Since Xt given by (3) is continuous in mean, this forms a subclass of the Karhunen family. In what follows we consider only (3) (and (4)) which includes the stationary processes properly, since this happens when F concentrates on the diagonal of Tˆ × Tˆ, i.e., when Z has (additionally) orthogonal values. The problem here is that the spectral distribution (i.e., F˜ (λ, λ ) = F (Aλ Aλ ) where Aλ = (−∞, λ), and we use the same F for the distribution and the (bi)measure it generates, for simplicity) is typically unknown, and should be estimated from an observed segment of the Xprocess. We recall that the positive definite F (·, ·) is possibly complex valued, and it is always of Fr´echet but not necessarily of the standard (Vitali) variation, (cf., Sec. VIII.3). Thus the first one is: F (R, R) = sup{
n
ai a ¯j F (λi , λj ) : |ai | ≤ 1, λi ∈ R} < ∞,
(5)
i,j=1
but the standard (Vitali) variation is: |F |(R, R) = sup{
n
|F (λi , λj )| : λi ∈ R} ≤ ∞.
(6)
i,j=1
Since F (R, R) ≤ |F |(R, R) always, we assume |F |(R, R) < ∞ in what follows and briefly comment on the Fr´echet case later. Also if |F |(R, R) < ∞, then in (4) the ‘*’ in the integral can be removed and the integrals are taken in the Lebesgue-Stieltjes sense (but for the Fr´echet case one has to use the strict Morse-Transue definition which is weaker, as already discussed in Section VIII.3). In any event, F (·, ·) is a function of two variables, and the methods of stationary processes are not sufficient. To distinguish these, we term them weakly (Fr´echet case) and strongly (Vitali case) harmonizable. Let us now state the estimation problems in the context of a spectral function F , i.e., a nonparametric estimation. Suppose a strongly harmonizable process X = {Xt , a ≤ t ≤ b} is observed on the segment [a, b]. Let GN,x,y (X) = GN (Xs , a ≤ s ≤ b) be a (known) Borel function of the observed process, depending on the given “frequencies” x, y ∈ Tˆ with N (= Na,b ) denoting the segment length of observations. It is thus an estimator. If E(GN,x,y (X)) exists and lima→−∞,b→∞ E(GN,x,y (X)) = F (x, y) for all x, y ∈ Tˆ for which x, y are continuity points (i.e., F (x, y) = F (x ± 0, y ± 0)), so N → ∞, then GN,x,y is termed an asymptotically unbiased estimator of F (x, y). This is the minimal requirement for the estimation methodology. Proceding further, one may require that E(|GN,x,y (X) − F (x, y)|) → 0,
x, y ∈ Tˆ,
(7)
584
IX. Nonparametric estimation for processes
as a → −∞, b → ∞ which we write hereafter as N → ∞. Then for large enough observed segment, one desires (not only the asymptotic unbiasedness but) that GN,x,y (X) is a consistent estimator of F (as defined in Chapter III) which is a more useful property. A further important feature of estimation (inspired also by our work in Chapter III) is that of finding the limiting distribution of h(N, X)(GN,x.y − F (x, y)) where h(N, X) is a suitable normalizing factor. Typically one hopes to find conditions on the estimator GN,x,y (X), and the process itself, so that the limit distribution is Gaussian. As the work on non stable stochastic difference equations of Section III.4 shows that if h(N, X) is a pure function of N alone, then this may be impossible, i.e., the limit distribution may exist but not Gaussian (it will always be infinitely divisible under the usual conditions). In case we accomplish the above tasks, the next important problem is the error of approximation as a function of N , i.e., the speed of convergence in the limit processes, so as to ascertain the desirable size of the observational segment, and to control the error. Here we consider, for the strongly harmonizable case, some results which are progressively specialized and sharpened to find solutions of the first three questions. In what follows we also describe some leads from the stationary case that one may consider in refining this work in future investigations. The first (simple) property will be discussed in the next section. 9.2 Asymptotically unbiased estimation of bispectra Let X = {Xt , t ∈ T } be a strongly harmonizable process with T = R or Z. Thus (Tˆ being the dual group of T ): π itλ Xt = e Z(dλ), t ∈ R, [or, Xn = einλ Z(dλ), n ∈ Z], (1) −π
R
and the bimeasure F is given by F (A, B) = E(Z(A)Z(B)∗ ) for Borel sets A, B of Tˆ. We now want to find an asymptotically unbiased estimator of F . Let A = (−α, α), B = (−β, β), α > 0, β > 0 be intervals of Tˆ and let F also be considered as a ‘distribution’, i.e., a point function determined by the bispectral measure F as noted before. We then have the following: 1. Proposition. Let {Xt , t ∈ R} be a (strongly) harmonizable process, with spectral bimeasure F , observed on the interval (−α, α), α > 0 and let A = (λ1 , λ2 ), B = (λ1 , λ2 ) be continuity intervals of the ‘distribution’ F . Then the estimator ˆ Yα (u)Yα (v)∗ du dv, (2) Fα (A, B) = A
B
9.2 Asymptotically unbiased estimation of bispectra
585
of F , is asymptotically unbiased, i.e., E(Fˆα (A, B)) → F (A, B) as α → ∞, where α 1 Yα (u) = e−isu Xs ds, (3) 2π −α defined as a stochastic (or a Bochner) integral. Proof. Since from (1) we have, for the covariance r (taking E(Xt ) = 0 for simplicity, whence E(Z(A)) = 0), eisλ−itλ dF (λ, λ ), s, t ∈ R, (4) r(s, t) = R2
by the standard inversion formula for r (letting F = F1 − F2 + i(F3 − F4 ) where Fj ≥ 0, the so-called Riesz components, are of bounded variation), the following brief argument verifies the statement. Let A, B be continuity intervals of F in the statement: α1 α2 −iuλ2 − e−iuλ1 1 e ) ( F (A, B) = lim 0≤α1 ,α2 →∞ (2π)2 −α −iu −α2 1
e−ivλ2 − e−ivλ1 ∗ ×( ) r(u, v) dudv −iv α1 α2 1 = lim E[( e−ius Xu ds) 0≤α1 ,α2 →∞ (2π)2 −α −α2 A 1 × ( e−ivt Xv dt)∗ ] dudv B E[ Yα1 (u)Yα2 (v)∗ dudv], = lim 0≤α1 ,α2 →∞
A
B
by Fubini’s theorem and (3), = lim E[Fˆα (A, B)], 0≤α→∞
which shows that Fˆα (A, B) is asymptotically unbiased for F (A, B), and this implies the general assertion. The discrete version of the above result is entirely similar and in (2) and (3) as well as in the ensuing computations one uses sums in lieu of integrals. Thus the estimator becomes FˆN (A, B) = YN (u)YN (v)∗ dudv, (5) A
where YN (u) =
1 2π
|k|≤N
B
e−iku Xk .
The same argument extends to stochastic flows (already introduced in Sec. IV.3), and we state the corresponding result as an illustrative
586
IX. Nonparametric estimation for processes
application. Let {Xt , t ∈ T = [a, b]} be a strongly harmonizable process with a derived (in L2 (P )) process Xt , t ∈ T . If {Yt , t ∈ T } is a flow driven by the derived process, i.e., (Lt Y )t = Xt where for any given real continuous ai (·) and k-times differentiable g: (Lt g)t =
k
ai (t)
j=0
dk−j g , dtk−j
a0 (t) = 0, t ∈ T,
(6)
so that for a compactly supported continuously differentiable (or a test) function ϕ, one has
ϕ(t)(Lt Y )t dt =
T
ϕ(t) dXt .
(7)
T
Then it that (cf., Proposition VIII.3.2) Xt is also harmonizable follows when Tˆ Tˆ λ1 λ2 dF (λ1 , λ2 ) exists. Its spectral bimeasure G will be given as dG(λ1 , λ2 ) = λ1 λ2 dF (λ1 , λ2 ). Consequently, the preceding proposition implies the following statement for the flow {Yt , t ∈ T } defined by the operator Lt of (6), driven by the Xt -process: 2. Corollary. Let {Yt , t ∈ T } be a stochastic flow driven by a mean square differentiable strongly harmonizable {Xt , t ∈ T } such that (Lt Y )t = Xt where Lt is a differential operator given by (6). If F is its spectral bimeasure, then the estimator F˜α (A, B) defined by F˜α (A, B) = A
Y˜α (u)Y˜α (v)∗ dudv
is asymptotically unbiased for G(A, B) = Y˜α (u) =
α
−α
(8)
B
A B
e−isu (Ls Y )s ds,
λ1 λ2 dF (λ1 , λ2 ), where α > 0,
(9)
the sets A, B ⊂ Tˆ being continuity intervals of F . There is naturally a discrete analog also. This will be left to the reader. In general, however, the asymptotically unbiased estimators given above are not consistent for F (hence G), (counter-examples with even Gaussian processes can be constructed). Further restrictions are essential. Either the class of harmonizable processes is severely restricted, or look for other techniques at the cost of more observations. Here we present a resampling procedure needing a sequence of sequences of observations superficially as in the study of infinitely divisible families. This is the main topic of the next section.
587
9.3 Resampling procedure and consistent estimation
9.3 Resampling procedure and consistent estimation It will be convenient to consider the discrete parameter problem here, since the continuous parameter case can be obtained using a wellknown device which we indicate later. Now suppose that the bispectral function F of the harmonizable sequence {Xn , n ∈ Z = T } is not only of bounded (Vitali) variation, but is absolutely continuous (relative to the planar Lebesgue measure) with density f , so that the covariance function r, which tends to zero as |s| + |t| → ∞ (by the RiemannLebesgue lemma), is given by: eisλ−itλ f (λ, λ ) dλ dλ . (1) r(s, t) = Tˆ
Tˆ
The problem is to find a consistent estimator fˆ of f which is asymptotically unbiased as well. In the stationary case, such an f is a function of just one variable and is also nonnegative. Both of these properties are crucial, and one can find the desired estimator with one realization on a large enough segment of the process. (See, for instance, Grenander and Rosenblatt [1].) In the present case, both properties are absent, and so the procedure is replicated, consisting of repeated sampling, allowing (perhaps) some dependence between samples. This decreases to zero as replication at increasingly distant times continues, termed a resampling procedure. More precisely, it can be described as follows: Consider observing n vectors, each of size (2m + 1), so that X1m .. . Xnm
1 = (X−m , .. . n = (X−m ,
1 X−m+1 ,
n X−m+1 ,
... , ... ,
1 Xm ) .. . n Xm ),
(2)
where m(= m(n)) → ∞ as n → ∞. Each of these vectors is assumed to be from the same harmonizable process with a bispectral density f , j j+k and the dependence between Xm and Xm decreases to zero. This means the following conditions hold: j j∗ eisu−itv f (u, v) du dv, (a1 ) E(Xs Xt ) = Tˆ
Tˆ
for all j = 1, . . . , n, and (a2 ) {Xjm , j = 1, . . . , n} is α-mixing with its coefficient α(k) → 0 as k → ∞, i.e., if Fm;n = σ(Xjm , j ≤ n) and Fm;n+k = σ(Xjm , j ≥ n + k) then α(k) = sup{|P (A ∩ B) − P (A)P (B)| : A ∈ Fm;n , B ∈ Fm;n+k , n ≥ 1}
588
IX. Nonparametric estimation for processes
satisfies α(k) → 0 as k → ∞. It may be verified that the classical “mdependence” between random variables (vectors) is simply α(k) = 0 for k > m, and if the Xjm are independent realizations (of the same process) then α(k) = 0, k ≥ 1. Thus our double sequence (2) of vectors under conditions (a1 ), (a2 ) includes both these classical cases. A detailed illustrative analysis of mixing concepts may be found in the survey of Roussas and Ioannides [1]. The α-mixing is also called strong mixing, and was originally introduced by Rosenblatt [1], who later applied it for work on stationary sequences. It plays a key role in obtaining limit distributions of certain estimators of functions, such as a spectral density. Here we intend to use it in the context of (strongly) harmonizable processes. In fact, the point is to show that it may be applied in the current asymptotic analysis. The intutive idea of α-mixing is that Fm;n represents the knowledge of the random vectors Xjm , j ≤ n, i.e., of the past through present, and Fm;n+k , representing the future of Xjm , j ≥ n + k, separated by k units of time, where α(k) denotes a measure of dependence between the past and future. Consequently, if X is Fm;n -adapted (hence a function of j , j ≤ n), and Y is Fm;n+k -adapted, then their dependence is meaXm sured by a constant multiple of α(k). If, for instance, X, Y here are either bounded or more generally have two moments finite, then their covariance should be dominated by a suitable function of α(k). The following inequality of this type, originally due to Ibragimov and Linnik [1], will be of interest here. The present formulation is from Roussas and Ioannides ([1], p. 106 and p. 112), where its proof and related results may be found. 1. Proposition. Let X be Fm;n -adapted and Y be Fm;n+k -adapted where the σ-algebras are determined by the sequence {Xjm , j, m} as in (2). (i) If |X| ≤ M1 , |Y | ≤ M2 a.e., then |Cov(X, Y )| ≤ 16α(k)M1 M2 .
(3)
−1 (ii) More generally, if for each pi > 1 such that p−1 > 1, and 1 + p2 p1 p2 X ∈ L (P ), Y ∈ L (P ), then
|Cov(X, Y )| ≤ 40[α(k)]1−
p1 +p2 p1 p2
Xp1 Y p2 .
(4)
In particular, taking p1 = p2 = 2 + δ for a δ > 0, one has δ
|Cov(X, Y )| ≤ 40[α(k)] 2+δ X2+δ Y |2+δ .
(5)
In the case of real valued components of Xjm , the numbers 16 and 40 of (3) and (4) may be replaced by 4 and 10.
9.3 Resampling procedure and consistent estimation
589
In our application, we take the process to be centered, for simplicity, and observe from the representation (1). One can invert the transform provided r(·, ·) is summable. Under this assumption, it follows that, for T = Z, the density f is then continuous and given by: f (x, y) =
1 e−(ijx−ij y) r(j, j ), x, y ∈ Tˆ. 2 (2π)
(6)
j,j ∈Z
Motivated by the analysis of the preceding section, and the fact that (by n 1 (1)) the function rˆn (j, j ) = n s=1 Xjs Xj∗s , is an unbiased estimator of r (i.e., E(ˆ rn (j, j )) = r(j, j ) because of condition (a1 )), one is led to the estimator fˆm,n (x, y), for any given x, y ∈ Tˆ, and s, t ∈ Z as: fˆm,n (x, y) =
1 (2π)2
m
e−i(sx−ty) rˆn (s, t).
(7)
s,t=−m
We now present conditions on the process under which fˆm,n (x, y) is a consistent (hence also asymptotically unbiased) estimator of f (x, y) in L2 (P ), as n (and so m) tends to infinity. This result is due to Soedjak [1], (see also his paper [2]). 2. Theorem. Let {Xtj , t ∈ Z, j ≥ 1} be a strongly harmonizable double sequence subject to the resampling procedure of (2) where for different j the samples are α-mixing and have the same spectral densities, i.e., conditions (a1 ) and (a2 ) hold. Suppose moreover the moments, the sizes, and the sampled vectors satisfy the following growth conditions: 4 (b1 ) For some 0 < ε < 1, m4 = o(n1−ε ), i.e., nm 1−ε → 0, as n → ∞, (b2 ) For some δ > 0, if Msj (= Msj (δ)) = E(|Xsj |2(2+δ) ), then (i)
(ii)
1 n1+ε
n−k
1
[Msj Mtj Msj+k Mtj+k ] 2(2+δ) = O(1); (n → ∞, s, s , t, t ∈ Z)
j=1
∞
δ
[α(k)] 2+δ < ∞.
k=1
Under these conditions, the estimator fˆm,n (x, y) given by (7) is consistent in L2 (P )-mean, i.e., E(|fˆm,n (x, y) − f (x, y)|2 ) → 0, as n → ∞, for any fixed x, y ∈ (−π, π).
590
IX. Nonparametric estimation for processes
Proof. Consider the left side of (8), namely E[(|fˆm,n (x, y) − E(fˆm,n (x, y)) + E(fˆm,n (x, y) − f (x, y))|2 ] = V ar(fˆm,n (x, y)) + |E(fˆm,n (x, y) − f (x, y)|2 . (9) We show that each of the terms on the right of (9) tends to zero as n → ∞. Since E(ˆ rn (s, t)) = r(s, t), we have from (7): m 1 e−i(sx−ty) r(s, t) (2π)2 s,t=−m m 1 −i(x−α) ( e )× = Tˆ ×Tˆ 2π s=−m
E(fˆm,n (x, y)) =
m 1 −it(y−α ) e )f (α, α ) dα dα , by (1), ( 2π t=−m = Dm (x − α)Dm (y − α )f (α, α ) dα dα , Tˆ ×Tˆ
(10)
sin(2m+1) x
where Dm (x) = 2π sin x 2 ≥ 0, is the Dirichlet kernel which is the 2 partial sum of the corresponding trigonometric series. It acts as an π approximate identity in the sense that −π Dm (x) dx = 1 and for ε > ε 0, limm→∞ −ε Dm (x) dx = 1. Hence letting n (and thus m)tend to infinity in (10) and using the dominated convergence theorem, one gets E(fˆm,n (x, y)) → f (x, y), so that fˆm,n (x, y) is asymptotically unbiased, and thus the second term of (9) tends to zero. It remains to consider the variance term in (9). Substituting the expression for fˆm,n from (7), we find V arfˆm,n (x, y) =
1 (2π)4
m
e−ix(s−s )+iy(t−t ) ×
s,s ,t,t =−m
Cov(ˆ rn (s, t), rˆn (s , t )).
(11)
To use the conditions of the hypothesis on the Xt s, we simplify the covariance by expanding rˆn as follows: n 1 Cov(ˆ rn (s, t), rˆn (s , t ) = 2 Cov(Xsj Xt∗j , Xsj Xt∗j ) n j,j =1 n 1 Cov(Xsj Xt∗j , Xsj Xt∗j + = 2 ) n
j=j =1
1≤|j−j |=k≤n−1
591
9.3 Resampling procedure and consistent estimation
= I1n + I2n ,
(say).
(12)
To obtain upper bounds on I1n , I2n , consider: n 1 |Cov(Xsj Xt∗j , Xsj Xt∗j )| n2 j=1 n 1 j j E|Xsj (Xtj Xtj )∗ Xsj | + E|Xsj Xt∗j |E|X∗s ≤ 2 X | t n j=1
|I1n | ≤
41 n 2 j 4 j 4 j 4 j 4 E|Xs | E|Xt | E|Xs | E|Xt | ≤ 2 n j=1
(using the CBS-inequality), 1 2(2+δ) n 2 j 2(2+δ) j 2(2+δ) j 2(2+δ) j 2(2+δ) E|Xs | ≤ 2 E|Xt | E|Xs | E|Xt | n j=1 (using the Liapounov inequality, cf., (19), Sec. II.1), 1 2(2+δ) n 2 j j j j Ms Mt Ms Mt ≤ 2(1+ε) , by (b2 ). n j=1
(13)
Next consider I2n and note that Xsj Xt∗j is Fm;j -adapted, and Xsj Xt∗j is j 1 j ∗j 2+δ j 2 ≤ (Ms Mt ) (CBS inequality Fm;j+k -adapted, implying E|Xs Xt | and (b2 )); and similarly with s , t ; j, j . Thus using bounds (3)–(5), we get
n−1 1 δ C j j j 2(2+δ) j α(k) 2+δ |I2n | ≤ 2 (Ms Mt Ms Mt ) n k=1
|j−j |=k 1≤j,j ≤n
n−1 2C ≤ 1−ε n k=1
n−k j=1
1
(Msj Mtj Msj Mtj ) 2(2+δ) n1−ε
δ
α(k) 2+δ . (14)
Putting (12)–(14) in (11) we then get an upper bound for the variance of fˆm,n as: 1 n j j j 2(2+δ) j 2(2m + 1)4 j=1 (Ms Mt Ms Mt ) ˆ V arfm,n (x, y) ≤ + (2π)4 n1−ε n1+ε n−k 1 j j+k j+k 2(2+δ) n−1 j (M M M M ) 1 t s s t j=1 2(2+δ) α(k) . C n1+ε
k=1
592
IX. Nonparametric estimation for processes
From this we deduce that 2(2 + n1 )4 V arfˆm,n (x, y) ≤ m4 /n1−ε (2π)4 Since
' n
j=1 (
1
) 2(2+δ)
n1+ε
+ C
n−1
( (
)α(k)
δ 2+δ
.
k=1
(15)
k≥1
α(k)
n−k
δ 2+δ
< ∞, and 1
(Msj Mtj Msj+k Mtj+k ) 2(2+δ) = Cn1+ε (1 + o(1)),
j=1
as n → ∞, by hypothesis, the right side of (15) is bounded by a constant. Hence m4 (16) V ar(fˆm,n (x, y)) = O( 1+ε ), n which tends to zero by (b1 ). Thus the right side of (9)→ 0 as n → ∞, which implies the L2 (P )-consistency of the density estimator fˆm,n (x, y) for each x, y ∈ R, as asserted. The preceding result is also true for real valued processes, for which we need to make the following changes which are slightly more elaborate than the corresponding stationary case (compare, e.g., with Yaglom [4], p.100). Thus in the stochastic spectral representation, set Z = Z1 +iZ2 where Z1 , Z2 are real σ-additive in L20 (P )-mean (but not necessarily orthogonally valued) set functions, so that (with T = Z): itx Xt = e Z1 (dx) + i eitx Z2 (dx), (17) Tˆ
Tˆ
and Xt =
Xt∗
=
Tˆ
−itx
e
Z1 (dx) − i
Tˆ
e−itx Z2 (dx)
=
Tˆ
eitx Z1 (−dx) + i
Tˆ
eitx (−Z2 (−dx)).
(18)
Comparing (17) and (18), and observing the uniqueness of the Fourier representation, one concludes that the measures Z1 (·) and Z2 (·) are symmetric and skew symmetric respectively. Thus Z1 (dx) = Z1 (−dx) and Z2 (−dx) = −Z2 (dx) where for A ⊂ Tˆ, −A = {−x : x ∈ A}, the reflection of A in the origin. Now from (17), since Xt is real, squaring and expanding it, one has: 2 it(x+y) e Z1 (dx)Z1 (dy) − eit(x+y) Z2 (dx)Z2 (dy)] E(Xt ) =[ Tˆ ×Tˆ
Tˆ ×Tˆ
593
9.3 Resampling procedure and consistent estimation
+ 2iE[
Tˆ ×Tˆ
eit(x+y) Z1 (dx)Z2 (dy)].
(19)
With the already noted (symmetric) properties of Z1 , Z2 one finds that the first bracket on the right is real and the left side being real, the second term must vanish. Since eit(x+y) is the value of a character of Tˆ × Tˆ, it follows that the measure dx dy → E(Z1 (dx)Z2 (dy)) must vanish, or E(Z1 (A)Z2 (B)) = 0 for all Borel sets A, B ⊂ Tˆ. Using all these properties, it may be further verified that the spectral bimeasure induced by the real Z1 , Z2 are the same, i.e., E(Z1 (A)2 ) = E(Z2 (A)2 ) for all Borel sets A. With these relations, one finds that (17) becomes Xt = cos tx Z1 (dx) − sin tx Z2 (dx), (20) Tˆ
Tˆ
and its covariance is symmetric. In fact r(s, t) = E(Xs Xt ) = cos (sx − ty) F1 (dx, dy), Tˆ ×Tˆ
(21)
where F1 (A, B) = E(Z1 (A)Z1 (B))(= E(Z2 (A)Z2 (B))). The spectral density then takes the form F1 (dx, dy) = f (x, y) dx dy where f (x, y) =
1 −isx+ity e r(s, t) 4π 2 s,t∈Z
1 = cos (sx − ty) r(s, t), 4π 2
(22)
s,t∈Z
since r(−s, −t) = r(s, t), and similarly cos (sx − ty) f (x, y) dx dy. r(s, t) = Tˆ ×Tˆ
We then have from (7), with rˆn (s, t) = (the real) f as: 1 fˆm,n (x, y) = 4π 2
m
1 n
n j=1
(23)
Xsj Xtj , the estimator of
cos (sx − ty)ˆ rn (s, t).
(24)
s,t=−m
The result of Theorem 2 takes the following (slightly simpler) form, which was also obtained by Soedjak [1]: 3. Proposition. Let X = {Xnj , n ∈ Z, j ≥ 1} ⊂ L20 (P ) be a strongly harmonizable real-valued process having a bispectral density which is
594
IX. Nonparametric estimation for processes
given by (22) and is symmetric. Suppose that X is also α-mixing and satisfies the growth conditions (a), (b) of Theorem 2. Then the real valued estimator fˆm,n (x, y) for any x, y ∈ (−π, π], given by (24) of the bispectral density f is again consistent in L20 (P )-mean. Proof. It suffices to indicate the changes (simplifications) resulting in the work of Theorem 2. Since rˆn is an unbiased estimator of r, we get as before E(|fˆm,n − f |2 (x, y)) = V arfˆm,n (x, y) + |E(fˆm,n (x, y) − f (x, y)|2 , (25) and it is to be shown that each of the terms on the right tends to zero. Consider the second term: 1 E(fˆm,n (x, y)) = 4π 2 1 = 4π 2
m
cos (sx − ty) r(s, t)
s,t=−m
m
Tˆ ×Tˆ s,t=−m
cos (sx − ty) cos (su − tv)×
f (u, v) dudv, using (23), m 1 = [cos sx cos ty cos su cos tv+ 4π 2 Tˆ×Tˆ s,t=−m sin sx sin ty sin su sin tv]f (u, v) dudv+ 1 [cos sx sin su cos ty sin tv+ 4π 2 Tˆ×Tˆ s,t=−m sin sx cos su sin ty cos tv]f (u, v) dudv.
(26)
The second term has sine factors which are odd and the summation is symmetric about zero. So it vanishes. The first term is a product of sums of cosine terms giving the Dirichlet kernels (cf., the computation in (10)). Thus (26) simplifies, after an elementary but tedious algebra, to the following: 1 x−u y−v )Dm ( )+ Dm ( E(fˆm,n (x, y)) = 2 Tˆ×Tˆ 2 2 x+u y+v Dm ( )Dm ( ) f (u, v) dudv, as in (10), 2 2 1 → [f (x, y) + f (−x, −y)] = f (x, y), 2 as n → ∞, since f is symmetric. Thus fˆm,n (x, y) is again an asymptotically unbiased estimator of f (x, y), x, y ∈ R.
9.3 Resampling procedure and consistent estimation
595
It remains to consider V arfˆm,n (x, y). But in this case, we did not use any special properties of complex valued Xt in simplifying (11) except that the (exponential) coefficients of Cov(ˆ rn (s, t), rˆn (s , t )) are uniformly bounded (by one), and this is true of the cosine functions here. So the same computation applies verbatim (since the rest of the hypothesis is again the same), and this shows that V arfˆm,n (x, y) → 0 as n → ∞. Consequently (25) tends to zero as desired. In the above work rˆn is an unbiased estimator of the covariance function r, which plays a key role in analyzing harmonizable processes, perhaps only next to the bispectral function. So it is also of interest to find conditions for rˆn itself to be a consistent estimator of r. The following result addresses this problem (cf., Soedjak [1]). 4. Proposition. Let X = {Xnj , n ∈ Z, j ≥ 1} ⊂ L20 (P ) be a harmonizable sequence satisfying the same conditions as in Theorem 2, dropping (b1 ) (i.e., no other restriction on m except that m → ∞ as n → ∞). Then rˆn (s, t) is a consistent estimator of r(s, t) in L2 (P )-mean. Proof. Since E(ˆ rn (s, t)) = r(s, t), if we show that V ar (ˆ rn (s, t)) → 0 as n → ∞, then the result will follow. Consider as in (12) n 1 Cov(Xsj Xt∗j , Xsj Xt∗j ) 2 n j,j =1 ⎛ ⎞ 1 ⎠ Cov(∗, ∗) = 2⎝ + n
V ar rˆn (s, t) =
1≤j=j ≤n
= I1n + I2n
1≤|j−j |≤n−1
(say).
(27)
We simplify the two terms separately. Thus using the CBS and Liapounov inequalities, we have: n 1 [E(|Xsj Xt∗j |2 ) + |E(Xsj Xt∗j )|2 ] |I1n | ≤ 2 n j=1 n 2 ≤ 2 E(|Xsj |4 )E(|Xt∗j |4 ) n j=1
≤ ≤
n 1 1 2 j 2(2+δ) (2+δ (E|X | ) (E|Xt∗j |2(2+δ) ) 2+δ s n2 j=1
2 n1−ε
1 n (K j K j ) 2+δ
s
j=1
t
n1+ε
,
0 < ε < 1.
1 By hypothesis this implies that |I1n | = O( n1−ε ) as n → ∞. Using analogous computation and the hypothesis, as in Theorem 2, leaving
596
IX. Nonparametric estimation for processes
1 the algebra to the reader, one finds that |I2n | = O( n1−ε ) as n → ∞. 1 Thus V ar (ˆ rn (s, t)) = O( n1−ε ) as n → ∞, which implies the desired result.
An extension to continuous parameter strongly harmonizable processes {Xtj , t ∈ R, j ≥ 1} is obtained with the following simple modifications, as already noted by Soedjak (op. cit). In (2) replace Xjm by the vector XjTn where XjTn = (Xtj , −Tn ≤ t ≤ Tn ), j = 1, 2, . . . and assume that j j r(s, t) = E(Xs Xt ) = eisx−ity f (x, y) dx dy, s, t ∈ R, (28) R2
and in the real case
r(s, t) = R2
cos (sx − ty) f (x, y) dx dy.
Then let the estimator be defined as n 1 Tn Tn −isx+ity j ∗j e Xs Xt ds dt, fˆn (x, y) = 4nπ 2 j=1 −Tn −Tn n 1 Tn Tn ˆ [fn (x, y) = cos (sx − ty) Xsj Xtj ds dt.] 4nπ 2 j=1 −Tn −Tn
(29)
(30)
Following Ibragimov and Linnik [1] one may consider continuous resampling with the vectors XTT21 = (Xtτ , −T1 ≤ t ≤ T1 , 0 ≤ τ ≤ T2 ) and then let T1 , T2 → ∞ suitably in the above considerations. The details in the stationary case may be found in the last reference. In the next section, we show that an analogous result on consistent estimation holds for an ‘associated spectrum’ of a set of processes more general than the strongly harmonizable class, and it is also of interest in applications. Moreover, it will illuminate the procedure used in the above results. 9.4 Associated spectral estimation for a class of processes We consider a set of nonstationary processes which contains the strongly (but not weakly) harmonizable class, introduced and studied with applications by Kamp´e de F´erier and Frenkiel [1] (cf., the comprehensive paper [2] and references to their earlier work), to be called class (KF), and independently by Parzen [4] under the name ‘asymptotically stationary time series’, (cf. also Rozanov [2]) for which a spectral measure can be associated. This parameter is of interest in applications of such processes, and may be defined as follows.
597
9.4 Associated spectral estimation for a class of processes
1. Definition. Let X = {Xt , t ∈ R} ⊂ L20 (P ) be a process whose covariance function K(s, t) = Cov(Xs , Xt ) satisfies the condition (KF): 1 T →∞ T
T −|h|
r(h) = lim
0
K(s, s + |h|) ds,
h ∈ R.
(1)
Then the process X is said to belong to class (KF). Note that (1) is equivalent to limT →∞ rT (h), where 1 rT (h) = T
T−
|h| 2
|h| 2
K(s −
h h , s + ) ds, 2 2
(2)
by change of variables, and that rT (·) is positive definite and continuous, whence r(·) is positive definite and measurable. That the class (KF) is large enough is attested by the following: 2. Proposition. If X = {Xt , t ∈ R} ⊂ L20 (P ) is strongly harmonizable, then X ∈ class(KF ), and so all (weakly) stationary processes are already in it. Proof. Let K(s, t) = E(Xs Xt∗ ). Then by (1) we have: rT (h) = =
1 T 1 T
T −|h|
0
0
T −|h|
K(s, s + |h|) ds [
eisx−t(s+|h|)y β(dx, dy)] ds
R2
1 T −|h| is(x−y)−i|h|y e ds β(dx, dy), R2 T 0 by Fubini’s theorem, e−i|h|y a(T − |h|, x, y) β(dx, dy), (say) = 2 R → e−i|h|y χ[x=y] β(dx, dy), as T → ∞, R2 e−i|h|y dG(y) = r(h), (say), =
=
(3)
R
where G is β concentrating on the diagonal. Thus X ∈ class(KF ) and r(·) is a symmetric continuous covariance function. If X ∈ class(KF ), so that by (1) the averaged covariance rT has a limit r, which is a measurable positive definite function, it is equivalent to a continuous positive definite function by the classical Riesz extension of the Bochner theorem (i.e., they are equal outside of a set
598
IX. Nonparametric estimation for processes
of Lebesgue measure zero). In fact Crum [1] has shown, by a further detailed analysis, that each such function admits a representation as r = r1 +r2 where r1 is a continuous positive definite function on Rn and r2 is positive definite with support in a Lebesgue null set (its positive definiteness is the nontrivial new fact). Thus we have r(h) = ei(h,x) dH(x), a.a. (h), (4) Rn
for a unique bounded positive non-decreasing function H, which is termed the associated spectral function of the process X. For harmonizable processes, this is given by (3), derived on the diagonal of R × R from the bimeasure β. [It is of interest to note that for h ∈ Rn , n > 1, since r(h) = r(|h|), i.e., by definition r is isotropic and positive definite, this r is necessarily continuous everywhere except possibly at the origin in that r2 (h) = 0 for all h = 0, as noted by Crum.] With this provision, the discussion here holds for Rn or Zn or a product of these, where the averaging is done on cubes or balls. Also Crum observes that (in different terms) if the process (or field) X is measurable and of second order whose covariance r(·) is invariant, then it is automatically continuous (even without being isotropic in higher dimensional time), hence it is weakly stationary. The above proposition shows that the same is true of harmonizable processes (or fields) which by definition have continuous covariances. Note that E(Xt ) = 0 is taken only for convenience, since otherwise Cov(Xs , Xt∗ ) can be replaced by E(Xs Xt∗ ) = K(s, t) in (1) and (2), and similarly consider the harmonizable product moment (in lieu of covariance). We use this fact without further comment in what follows and Proposition 2 holds with this understanding. (But then there is also a question of finding a suitable function m(·) that qualifies to be a mean (cf., Rao [23].) If the mean function m(·) of a harmonizable process is not zero, and T unknown, then, is the natural estimator m ˆ T = T1 0 Xt dt consistent to estimate a constant parameter in the above L2 -mean sense? It is somewhat surprising that the answer depends on the continuity properties of the associated spectral measure function. More precisely, we have the following statement which is an extension of a known stationary result in which case mt = a0 , a constant, t ∈ R. 3. Proposition. Let X = {Xt , t ∈ R} ⊂ L2 (P ) be a process such that r˜ : (s, t) → E(Xs Xt∗ ) is strongly harmonizable. If mt = E(Xt ), T T ˆ T = T1 0 Xt dt, then then limT →∞ T1 0 mt dt = a0 exists, and if m m ˆ T is an asymptotically unbiased estimator of a0 , and it is consistent in L2 (P )-mean (i.e., E(|m ˆ T − a0 |2 ) → 0 as T → ∞) iff the associated spectral function H of X does not charge 0 (or that 0 is a continuity point of H).
9.4 Associated spectral estimation for a class of processes
599
Proof. By hypothesis, and Proposition 2, X ∈ class(KF ), so that T −|h| limT →∞ 0 K(s, s + |h|) ds = (limT →∞ RT (h) (say))=R(h) exists for each h ∈ R. Moreover, by the representation of the harmonizable process eitx dZ(x), t ∈ R, (5) Xt = R
for a stochastic measure Z, and one has eitx dμ(x), mt = E(Xt ) =
(6)
R
where μ : A → E(Z(A)) defines a signed measure and t → mt is continuous. [In fact, μ and the spectral bimeasure F of Z are related by μ(A) = R g¯(v) F (A, dv) for a unique g such that 0< R2
r(u)¯ g (v)f (du, dv) ≤ 1
cf., Rao [23], but this detail is not needed here.] 1 T
0
T
mt dt = →
( R R
1 T
T
eitx dt)dμ(x) 0
χ[x=0] dμ(x) = a0 , as T → ∞,
(7)
which exists since the dominated convergence theorem applies to interchange the limit and integral here. T T Now set m ˆ T = T1 0 Xt dt, so that E(m ˆ T ) = T1 0 mt dt = aT → a0 as T → ∞. Moreover E(|m ˆ T − a0 |2 ) = E[|(m ˆ T − aT ) + (aT − a0 )|2 ] = [E(|m ˆ T − aT |2 ) + (|aT − a0 |2 )] 1 T T K(s, t) ds dt + (|aT − a0 |2 ) = T 0 0 1 T = RT (h)dh + (|aT − a0 |2 ). T −T
(8)
Since the last term tends to zero, it suffices to simplify the first term on the right of (8). Now X ∈ class(KF ) ⇒ RT (h) → R(h) as T → ∞, uniformly on compact sets of R and both RT and R are Fourier transforms of bounded non decreasing non negative functions HT and H with HT (x) → H(x) at each continuity point of H (by a classical
600
IX. Nonparametric estimation for processes
theorem due to Bochner, cf., e.g., Cram´er [4], Theorem 11). Thus for each fixed ε > 0 one has 1 2T
T 1 RT (h) dh = eith dh dHT (t) R 2T −T sin T x dHt (x) = Tx R −ε ∞ sin T x HT (x) = + Tx −∞ ε ε sin T x dHT (x). + Tx −ε
T
−T
(9)
The first two terms on the right of (9) tend to zero as T → ∞ since 1 and HT is bounded. Choose ε > 0 such that ±ε is a | sinT xT x | ≤ εT continuity point of H and H(ε) − H(0+) + H(0−) − H(−ε) < ε which is possible due to the density of the continuity set of H in R. Since | sinT xT x | ≤ 1, it follows from the arbitrariness of ε > 0 that 1 lim T →∞ 2T
T
−T
RT (h) dh = H(0+) − H(0−).
Consequently the right side vanishes, iff 0 is a continuity point of H as asserted. In view of this result it is natural to ask for an estimator of the unknown associated spectral function H, or the covariance function R determined by the process X. The latter is answered by the following assertion which is a special case of a result due originally to Parzen [4], (for the classical stationary case, see Doob [2], Chapter X, Sec.7). 4. Proposition. Let X = {Xt , t ≥ 0} ⊂ L20 (P ) be a measurable process such that X ∈ class(KF ). Let R(·) be the asymptotic covariance of X given by (1). Suppose also that ρ(t1 , t2 , t3 , t4 ) = E(|Xt1 Xt2 Xt3 Xt4 |) satisfies the condition ρ(t, t, t, t) is locally integrable. Then the following random variable ˆ T (h) = 1 R T
0
T −|h|
Xt Xt+h dt,
h ∈ R,
(10)
has finite variance, and is a consistent estimator of R(h), if for each h ∈ R we have 0
t
cov (Xs Xs+h , Xt Xt+h ) ds = o(t),
(11)
9.4 Associated spectral estimation for a class of processes
601
as t → ∞. ˆ T (h) is an asymptotically Proof. Since X ∈ class(KF ), we note that R unbiased estimator of R(h), because ˆ T (h)) = 1 E(R T
T −|h|
0
K(s, s + |h|) ds → R(h)
as T → ∞. On the other hand ˆ T (h) − R(h)|2 ) = V ar (R ˆ T (h)) + |E(R ˆ T (h)) − R(h)|2 , E(|R
(12)
and the last term tends to zero, so consider the first term on the right of (12). Now for 0 < T < T and |h| < T − T we have ˆ T (h)) = V ar (R
2 T2
2 = 2 T 2 ≤ 2 T
T −|h|
0
0
0
t [ cov (Xs Xs+h , Xt Xt+h ) ds] dt 0
T −|h|
tCt (h) dt,
(say),
T −|h|
t|Ct (h)| dt +
o(T − |h|)2 , T2
using (11). First let T → ∞ and then T → ∞, so that the right side tends to zero, and this establishes the desired consistency. The analysis of the last two propositions shows that the discontinuity set of the associated spectral function H of X and the consistent ˆ T (h) of R(h) lead to the following closely related problem. estimator R Since H is a bounded nondecreasing function, we know that it can be decomposed uniquely (relative to the Lebesgue measure) into a continuous and a discrete part, say H1 and H2 . Also H2 has at most a countable support set in R or (−π, π]. Note that the latter played a key role in Proposition 3. It is thus of interest to find this set, or to estimate H2 consistently. The Fourier transform of H2 [i.e., the corresponding decomposition of R(·)] consists of a (uniformly) convergent trigonometric series, and is thus an almost periodic function. Using the properties of the latter, and Bochner’s theorem in the form used in Proposition 3, one can find conditions for the desired estimator of H2 , similar to those of Proposition 4. In fact, such a result was found by Hanin and Schwarz [1], using a particular case of the resampling procedure outlined in the last section, when X is a sequence, and the argument extends exactly for the locally compact σ-compact abelian group (index) set. This has also been given recently by Hanin and Schreiber [1] for the process with values in a Hilbert space. We now
602
IX. Nonparametric estimation for processes
outline from the latter, the desired result when the indexing group is G = R or Z for scalar processes, which contains the essential ideas. 5. Proposition. Let {X j = {Xtj , t ∈ G}, j ≥ 1} ⊂ L2 (P ) be a sequence of sequences of identically distributed measurable processes X j ∈ class(KF ), the X j being also independent for j = 1, 2, . . . , so that the associated spectral function H is the same for all j. Suppose that E(|Xtj |4 ) ≤ K0 < ∞ for all j ≥ 1, t ∈ G. Let the resampling take place at Xtj , j = 1, . . . , m; −Tm ≤ t ≤ Tm where Tm → ∞ as m → ∞. If H2 is the discrete component of H, then the estimator 1 ˆ 2,T (x) = 1 H m m j=1 (2Tm )2 m
Tm
−Tm
Tm
−Tm
Xsj Xt∗j e−ix(t−s) dsdt,
(13)
ˆ 2,T (x) − is consistent for ΔH2 (x) = H2 (x+) − H2 (x−), i.e., E(|H m 2 ΔH2 (x)| ) → 0 as m → ∞. Proof. A quick outline of the argument can be provided as follows. First ˆ 2,T (x)) = αm (x), then by the identical distributions note that if E(H m j of Xt for j = 1, 2, . . . , we have: 1 αm (x) = (2Tm )2
r(s, t)e−ix(t−s) ds dt.
Am ×Am
Note that by the membership of X j in class(KF ), αm (x) → ΔH2 (x) as m → ∞ (cf., Proposition 3 above). Here and below Am = (−Tm , Tm ). Also it is clear that ˆ 2,T (x) − ΔH2 (x)|2 ] = V ar (H ˆ 2,T (x)) + [|αm (x) − ΔH2 (x)|2 ]. E[|H m m (14) ˆ With the additional conditions, it will be shown that V ar (H2,Tm (x)) → 0 as m → ∞. In (13), let Fˆm (x) denote the expression inside the sum. Then using the Jensen and CBS inequalities we get: V ar (Fˆm (x)) ≤ E(|Fˆm (x)|2 ) 2 1 j ∗j 2 21 (E(|Xs Xt | )) dsdt ≤ (2Tm )2 Am ×Am 2 1 j 4 41 j 4 ≤ [E(|Xs | )E(|Xt | )] (2Tm )2 Am ×Am 2 1 K0 2 · 2Tm = K0 . ≤ 2Tm
9.4 Associated spectral estimation for a class of processes
603
ˆ 2,T (x)) ≤ 12 m V ar (Fˆm (x)) ≤ K0 → 0 as m → ∞. Hence V ar (H m j=1 m m This and the earlier assertion about αm (x) together show that the right side of (14) tends to zero, which implies the desired result. Remark. If the sampled X j processes are not independent, but have the same distributions, then one can proceed with the condition that they are α-mixing and the procedures of Proposition 3.4 (or Theorem 3.2) may be applied. In particular, if the X j are strongly harmonizable and satisfy the hypothesis of Theorem 3.2, then a similar result seems possible, but we omit these extensions here. The proposition in the LCA group context is detailed in Hanin and Schreiber [1], as noted above. We have seen that the strongly harmonizable processes belong to class(KF ). However, the following known example shows that the weakly harmonizable processes need not be in this class. 6. Example. Let L20 (P ) be separable and {en , n ∈ Z} be a complete orthonormal set, which is thus trivially a weakly stationary sequence. If V is any continuous linear mapping on L20 (P ), then Xn = V en , n ∈ Z, is weakly (but generally not strongly) harmonizable. This is a consequence of the so-called dilation theorem of harmonizable processes (cf., e.g., Rao [13], p.326). We now observe that for a proper choice of V , the process {Xn , n ∈ Z} ∈ / class(KF ). Let V = (δmn an , m, n ∈ Z) be an infinite (diagonal) matrix where 1 ≤ an ≤ 2 are chosen as follows: a0 = 1, an = a−n and for k > 0 ak = (χCn + 2χDn )(k). (15) n≥0
Here Cn = [22n , 22n+1 ) and Dn = [22n+1 , 22n+2 ) are disjoint intervals. Thus for any k, there is only a single nonvanishing term in the sum, and V is continuous on L20 (P ). Also for each n ∈ Z, V en = an en . We claim that this X = {Xn , n ∈ Z} which is weakly harmonizable, is not in class(KF ). Indeed, the covariance function r of X satisfies r(j, k) = 0 if j = k and for h ∈ Z, n−h−1 1 r(j, j + h) Rn (h) = n j=0 0, if h = 0 n−1 2 = 1 j=0 aj , if h = 0. n
Hence Rn (h) = 0 for all n ≥ 1 if h = 0, and 5 1 − 2m−1 , if n = 22m − 1 lim Rn (0) = 34 3·21 n→∞ if n = 22m+1 − 1. 3 − 3·22m ,
(16)
604
IX. Nonparametric estimation for processes
This shows that limm→∞ R22m −1 (0) = 53 , and limm→∞ R22m+1 −1 (0) = 4 / class(KF ). 3 . Hence limn→∞ Rn (0) does not exist, and a priori X ∈ The example suggests the following enlargement of the above class retaining many of its original properties. Since class(KF ) is based on an averaging (or smoothing) process based on their covariances, similar to the classical Ces`aro (or Abel) summability method, one can consider higher order classes c(KF, p), p ≥ 1, with c(KF, 1) = class(KF ). Thus N (p) (p−1) (0) for p ≥ 1, let RN (h) = N1 n=1 Rn (h) with RN (h) = RN (h). (p) Then X ∈ c(KF, p) provided limN →∞ RN (h) = R(p) (h) exists for each (p) h ∈ R. Since each RN (·) is positive definite, the same must be true of R(p) (·) and we get a pth -order associated spectral function H (p) (·) (and the term ‘asymptotically stationary’ will not be descriptive enough in this context). It is clear that c(KF, p) ⊂ c(KF, p + 1) with strict inclusion and one can consider the largest set (KF) = ∪p≥1 c(KF, p). It appears that, for each p ≥ 1, if the ak of (15) are chosen to have p-terms of disjoint supports, then the resulting process is weakly harmonizable but not in c(KF, p). Some discussion of the enlarged class appears in Swift [2]. However, a detailed analysis of (KF) itself is not yet available. It may be of interest to consider its structure, decomposing it as in the Wiener chaos. But this has not been done and we have to leave the problem at this stage. 7. Example. Let us now discus another important question on template estimation, due to Grenander [4]. Recall that (by the standard college dictionary definition) a template is a pattern (or gauge), as of wood or metal, used as a guide in shaping something accurately. The problem thus is to estimate the template from data in global shape modeling. In mathematical terms, a simpler form starts with a compact figure S ⊂ Rd , usually a cube, and a diffeomorphic group G1 contained in the similarity group S of mappings g : S → S having special properties such as leaving the boundary ∂S invariant. If Jt (t for template) denotes a template, such as a brain a real valued nice function on S, then Ig (x) = Jt (gx), x ∈ S, g ∈ G1 , the deformed image, is observable. It is desired to estimate Jt (x) for each x ∈ S, from the sample Igi (x), i = 1, . . . , n, as g varies over G1 . Now Igi (x)s are random since generally each observation is corrupted by some error and Igi (x) = Jt (x) + εi (gx), x ∈ S0 ⊂ S, g ∈ G1 ⊂ S. However, G1 is too large, and an invariant integral that we need here does not seem to be available (cf., Palais [1], p. 136). So we take a G ⊂ G1 , a locally compact subgroup, adding further suitable conditions on G hereafter to formulate the problem for illustration. [If one asks for a “quasi-invariant integral”, there is some better prospect (cf., Hirai and Shimomura [1]), but this leads to a different view of the problem than
9.4 Associated spectral estimation for a class of processes
605
we set out to study. So we stipulate some other conditions for the present exposition.] Suppose that {Igi (x), g ∈ G} is a second order mean-continuous (locally uniformly in g) process, with K(g, g ) = E(Igi (x)Igi (x)), independent of i ≥ 1. Thus K(·, ·) is a continuous hermitian positive definite kernel and E(Igi (x)) = Jt (x), g ∈ G. The problem is to estimate Jt (x). We use a procedure motivated by (and abstracted from) Proposition 3. However, for employing a similar averaging process, we need to have a family of subsets in G, chosen as follows: (A) Suppose there is a sequence {Hn , n ≥ 1} of Borel subsets of G with a Haar measure μ such that 0 < μ(Hn ) < ∞, and for each g ∈ G, (Δ denoting symmetric difference): lim
n→∞
μ((gHn )ΔHn ) = 0. μ(Hn )
(17)
This is an abstraction of the familiar fact that (−Tn , Tn ) ↑ R, Tn → ∞ as n → ∞. In fact it is known that (cf., Hewitt and Ross [1], p.255) if G were a locally compact, σ-compact, abelian group, then such a sequence always exists satisfying (17) and with the (additional) property that the Hn form an increasing open relatively compact collection. For general locally compact groups this may not obtain. Now an important consequence of condition (A) is that for any continuous almost periodic function f 1 f (x) dμ(x) = M (f ), (18) lim n→∞ μ(Hn ) H n exists, where M (·) is an invariant mean M (fa ) = M (f ) = M (f a ). Here fa (x) = f (ax) and f a (x) = f (xa). We recall that a bounded continuous f is almost periodic means the set {fa , a ∈ G} is totally bounded (=relatively compact) in the uniform norm. Typically the limit in (18) can depend on the Hn -sequence, but for our present purposes, this is not crucial. We observe that for any locally integrable f : G → C, if 1 |f (x)| dμ(x) ≤ K0 < ∞, and μ(Hn ) Hn 1 |f (x)| dμ(x) = 0, lim n→∞ μ(Hn ) gH ΔH n n then the same holds for the translated function fa , a ∈ G as well. This is true for the first relation by the translation invariance of μ, and the second holds from the fact that gHn Δg Hn is a subset of (gHn ΔHn ) ∪(g Hn ΔHn ), and the result holds on each of the terms on
606
IX. Nonparametric estimation for processes
the right by the subadditivity of the integral. Consequently this holds without the absolute value signs in the integrands. Taking f = K(·, g ) for fixed g where K(·, ·) is the product moment function defined above, we find from (18) that, for any h ∈ G, 1 R(g ) = lim K(gh, g ) dμ(g) n→∞ μ(Hn ) H n 1 = lim K(g, g ) dμ(g), (19) n→∞ μ(Hn ) H n if the limit exists. We now formulate the next condition: (B) For the sequence {Hn , n ≥ 1} of (A), the limit of the averaged product moment function in (19) exists. This is the same as saying that the {Ig (x), g ∈ G} is of class(KF ) relative to the Hn -sequence. Note that (B) is satisfied if g → Ig (x) is a stationary field, or if its product moment function is almost periodic (i.e., Ig is almost periodically ‘correlated’). Observe that when (B) holds, the limit function R(·) of (19) is also positive definite and is continuous if the process Ig (x) is locally uniformly mean continuous which will be assumed hereafter. Thus taking f = K(·, g), we have under (A) and (B) R(g) = M (K(·, g)) and then: n n −1 aj a ¯j R(gj gj ) = aj a ¯j M (K(· gj , gj gj −1 )) j,j =1
j,j =1
= M (E[|
n
aj Igj (x)|2 ] ≥ 0,
j=1
where aj ∈ C. Since R(·) is thus positive definite on G and continuous, it can be represented by a theorem due to Gel’fand and Na˘imark (cf., Na˘imark [1], pp.392-3) as: (20) R(g) = (Ug ξ, ξ), where g → Ug is a continuous unitary representation of G on a Hilbert space H which may be taken as the RKHS determined by the kernel R(· − ·), and a cyclic vector ξ ∈ H, i.e., a non-zero vector ξ such that {Ug ξ, g ∈ G} is dense in H. However, for our work we need an integral representation of R in (20), using the fact that g → Ug is a continuous unitary representation. This is obtained essentially by decomposing Ug ∼ Ug (y) into a direct sum, and correspondingly ξ ∼ ξ(y). It is this form we find in a result due to Mautner ([1], p.536) stating that, if ϕy (g) = (Ug (y)ξ(y), ξ(y)) then R(g) = ϕy (g) dν(y), (21) R
9.5 Limit distributions of (bi)spectral function estimators
607
for a suitable Borel measure ν(·) on R and a continuous (in g) positive jointly measurable (in (y, g)) elementary function (y, g) → ϕy (g). Here ‘elementary’ means that the Ug (y) are irreducible unitary operators for each y ∈ R. [One says that ϕy (g) is elementary if it is equivalent to a continuous positive definite function outside of a negligible subset of G relative to the Haar measure. The stated representation (20) is a strengthened version of Mautner’s, given slightly later by himself.] In case G = R we have ϕy (x) = eixy and thus (21) is a generalized Bochner representation of a continuous positive definite function, and so ν(·) is the associated spectral measure of the Ig (x)-process of class(KF ). [For a related representation of (21), see Yaglom ([1], pp. 598-600), and also Kunita [1], Sec. 4.7.] Thus if ν does not charge the origin of R, then the random variable Iˆn (x), based on n independent observations of the template under G, given by n 1 1 I j (x) dμ(g), (22) Iˆn (x) = n j=1 μ(Hn ) Hn g will be an unbiased estimator of the template Jt (x) and is consistent in L2 (P )-mean, if also E[|Igi (x)|4 ] ≤ K0 (x) < ∞ for g ∈ G, i ≥ 1. It will be desirable to find a simple verifiable condition on the template process in order that ν is continuous at the origin. For this we used an inversion formula in Propositions 4 and 5. It will also be useful to verify these conditions for some familiar groups G, which involves some non trivial analysis. These are some of the new problems suggested by the example to investigate. It shows what new developments are needed, from non abelian harmonic analysis, for proceeding with some of our applications. These are not yet available. We next include a brief discussion on the limit distributions of the estimators that were found to be consistent. 9.5 Limit distributions of (bi)spectral function estimators Let X j = {Xtj , t ∈ Z}, j ≥ 1, be a sequence of sequences of centered strongly harmonizable processes with the same covariance function r. For simplicity, we assume that each process is real valued and the (bi)spectral distribution has density f . It is then given by the Lebesgue integral r(s, t) = cos (sx − ty) f (x, y) dx dy, s, t ∈ Z. (1) Tˆ
Tˆ
Let us consider the estimator fˆm,n (x, y) of f (x, y), given in (24) of Section 3 above, based on m-rows of vector observations of length (2n+
608
IX. Nonparametric estimation for processes
1), (m = mn ) so that fˆm,n (x, y) =
1 (2π)2
where
m
cos (sx − ty)ˆ rn (s, t),
(2)
s,t=−m
1 j j X X , n j=1 s t n
rˆn (s, t) =
s, t ∈ Z.
(3)
In the case of stationary processes, where just one realization of the process suffices for such estimation since r : (s, t) → r(s, t) = r(s − t) and f (·) ≥ 0 are functions of one variable each, Ibragimov and Linnik [1] have presented a method to obtain the limiting distribution of the corresponding estimator fˆn . Their method can be extended to the strongly harmonizable case where the resampling procedure involves α-mixing and other conditions. The actual details are much more involved, but, under suitable conditions, the limit distribution of fˆm,n (x, y) of (2) can be obtained. The desired result was recently established by Soedjak [1]. He is in the process of publishing it, and so we outline the main points, since it is the first time that such a limit distribution in the harmonizable case is established, and this gives a feeling for the subject (after Theorem 3.2 and Proposition 3.4) for the ensuing discussion. The conditions to be presented are a (nontrivial) generalization of the stationary case involving not only the growth of moments, but the cumulants (or semi-invariants) of the process also come into play. These techniques have an essential role in the case of dependent observations, and were used very effectively (cf., Brillinger [1] for a lucid account, where much of his own earlier work was summarized, along with the then available contributions of many others). In writing a “Forward” ˘ to Zurbenko’s [1] book on statistical treatment of stationary processes, Kolmogorov clearly approves the type of conditions which may appear complicated but which also give error estimates for the speed of convergence. Now, as a first step, we present conditions for the existence of limit distributions which by a similar extended analysis can be used for the error estimation as well. But this is not discussed since it will be the next item of investigation. Here then is the procedure for a limit normal distribution of the bispectral density estimator (2) of f (x, y) which is not necessarily positive, under an appropriate normalizing factor and a standard correction term, used for controlling the drift of its probability mass to infinity. The conditions desired are motivated by the work on stationary processes, and no further discussion is included. (See also Grenander and Rosenblatt [1] as well as [2] on this point.) Let us rewrite (2) and (3) in a more convenient form: Sm,n (x, y) = nfˆm,n (x, y)
9.5 Limit distributions of (bi)spectral function estimators
=
n j=1
=
n
1 (2π)2
m
609
cos (sx − ty) Xsj Xtj
s,t=−m
j Zm (x, y), (say),
j=1
=
k−1
U,m,n (x, y) +
=0
k
V,m,n (x, y),
(4)
=1
where we split the Zm -terms into disjoint blocks defining U,m,n (x, y) and V,m,n (x, y), showing eventually that the U -set is useful, but the V -set vanishes in the limit. Here we define these terms, omitting the dependence on the arbitrarily fixed x, y ∈ R from now on:
(p+q)+p
U,m,n =
j Zm , 0 ≤ ≤ k − 1,
j=(p+q)+1
and the next set as (+1)(p+q) V,m,n =
j=(p+q)+p+1
n
j Zm ,
j j=k(p+q)+1 Zm ,
0 ≤ ≤ k − 1, = k,
n ], the where p, q and k (all depending on n) are chosen as k = [ p+q p integer part, p(= pn ) → ∞ as n → ∞ such that n = o(1), q = [ np ] satisfying q = o(p). Thus the partial sum Sm,n is partitioned into alternating blocks of lengths p and q, with the last block in V,m,n having n − (p + q)k terms. This split is motivated by the classical procedure given in Ibragimov and Linnik [1]. If we express (4) as + Sm,n , Sm,n = Sm,n
(5)
has the U -terms and Sm,n has the V -terms, then we conwhere Sm,n sider Sm,n − E(Sm,n ) Sm,n − E(Sm,n ) Sm,n − E(Sm,n ) = + . σ(Sm,n ) σ(Sm,n ) σ(Sm,n )
(6)
Here σ(X) denotes the standard deviation of X. Under suitable conditions, we wish to show that the second term on the right above tends to zero in probability and the first one gives the desired limit distribution so that the classical Slutsky theorem (cf., Exercise I.6.4(c) and (d)) will imply the same limit law for the left side, and this is seen to give the
610
IX. Nonparametric estimation for processes
limit distribution of fˆm,n (x, y). We now present details to implement this program. First let us recall the less frequently used concept of cumulants (cf., Cram´er [1], p.186). If ϕ(·) is the characteristic function of a random variable X = (X1 , . . . , Xk ), i.e., ϕ(t) = E(ei(t,X) ) and ψ(t) = log ϕ(t) is defined as the principal branch of the logarithmic function of ϕ, suppose that both are (formally) expanded in Taylor’s series in Rk k around t = 0. Then we get with |t| = j=1 |tj |,
ϕ(t) =
E(Πnj=1 Xjνi )Πkj=1 (itj )νj + o(|t|n ),
ν1 +···+νk ≤n
and ‘cum’ denoting the ‘cumulant’ (or semi-invariant), one has ψ(t) =
cum(X1ν1 , · · · , Xkνk )Πkj=1
ν1 +···+νk ≤n
(itj )νj + o(|t|n ). νj !
(7)
If the Xj are bounded, then these are valid operations, and in the general case suitable moment conditions can be given for a justification. One can thereafter obtain the following relations between moments and ˘ cumulants (cf., e.g., Rosenblatt [2], p.33, and Zurbenko [1], p.3) as: cum(X1ν1 , · · · , Xkνk ) =
1 +···+k =|ν|
(−1)k−1 νj ! ν E[Πkj=1 Xj j ]Πkj=1 , (8) k j !
where |ν| = ν1 + · · · + νk and the sum is over all k-partitions of the integer |ν|. For instance, if E(Xt ) = 0, t ∈ Z, then the above sum reduces to: cum(X1 , X2 , X3 , X4 ) = Cov (X1 X2 , X3 X4 ) − E(X1 X2 )E(X3 X4 ) − E(X1 X4 )E(X2 X3 ).
(9)
If the Xj are moreover normal so that the cumulants of order higher than 2 vanish, then (9) reduces to an ancient (1918) result, known as L. Isserlis’s formula, given by Cov (X1 X2 , X3 X4 ) = Cov (X1 , X2 ) Cov (X3 , X4 ) + Cov (X1 , X4 ) Cov (X2 , X3 ), which now-a-days is obtained immediately from the moment generating function of the multivariate normal distribution. Several properties of cumulants are discussed in Leonov and Shiryayev [1], and in Brillinger
611
9.5 Limit distributions of (bi)spectral function estimators
˘ [1]. They are also used extensively in Zurbenko [1] as well as Rosenblatt [2]. With this background we can set down our assumptions as follows. Assumptions. (I) The X j are real centered strongly harmonizable sequence of sequences such that r(s, t) = E(Xsj Xtj ) = cos (sx − ty) f (x, y) dx dy, j ≥ 1, Tˆ ×Tˆ
and for j = j , s = s , Xsj ⊥ Xsj , but
E(Xsj Xsj ) = ρ(|j − j |),
ρ2 () < ∞.
≥1
(II) The numbers p, q, k, m as functions of n in (4) satisfy for 0 < μ, ν < 1, p1+ν ∼ n, and kq = o(n), m4 ∼ nμ as n → ∞. [For example, 1 1 p = n 2 +ε and q = [n 2 −ε ], with 0 < ε < 12 will satisfy the above growth relations if ν is chosen as ν = 1−2ε 1+2ε .] j (III) The Xm are α(k)-mixing in the j th -variable and α(k) ≤
c k 1+β
, c > 0, β ≥
μ (1 + ν), k ≥ 1, 2
with 0 < μ, ν < 1 chosen in (II). (IV) The cumulants of the process of order 4, satisfy n
m
|cum(Xsj Xtj , Xsj Xtj )| = o(m2 n),
j,j =1 s,t,s ,t =−m
as m, n → ∞. The above conditions are stronger than those used in Theorem 3.2 (and Proposition 3.4) especially on mixing. In fact (III) above implies, for δ > 2+β β ,
δ
α(k) 2+δ ≤ c
k≥1
k−
(1+β)δ 2+δ
k−
(1+β)(2+β) 2+3β
k≥1
, since
k≥1
=c
u ↑ 1, 2+u
β2
k −(1+ 2+3β ) < ∞.
k≥1
With an additional uniform bound on the fourth moments of the processes, we are ready to present the desired result. The proof involving
612
IX. Nonparametric estimation for processes
numerous details is long and will only be outlined. Thus the limit distribution of the density estimator fˆm,n (x, y), due to Soedjak [1], can be given as follows: 1. Theorem. Let X j = {Xtj , t ∈ Z}, j ≥ 1 be a sequence of sequences of real centered strongly harmonizable processes satisfying Assumptions (I)–(IV) above. Suppose also for a δ > 2+β β , (β as in (III)) we have K = sup E(|Xsj |4(1+δ) ) < ∞.
(10)
s,j
Then for the Sm,n of (4), Sm,n − E(Sm,n ) D → N (0, 1), σ(Sm,n )
(11)
and consequently for fˆm,n of (2) fˆm,n (x, y) − E(fˆm,n (x, y)) D → N (0, 1), σ(fˆm,n (x, y))
(12)
as n → ∞, where σ(·) is the standard deviation of the random variable in question, and x, y ∈ R are arbitrarily fixed. The idea of proof of (11) or (12) is first to reduce the work for uniformly bounded processes using (10), with truncation, and then complete the demonstration step by step. We present here a sketch to give a feeling for the necessary detail. Sketch of Proof. The basic ideas are as follows. Step I. Let N > 0 be arbitrary, and consider a truncation of Xsj at N: Xsj,N = Xsj χ[|Xsj |≤N ] ; Xsj,N = Xsj χ[|Xsj |>N ] ,
j so that Xsj = Xsj,N + Xsj,N , and then simplify Zm as:
j = Zm
1 (2π)2
m
s,t=−m
j,N N j,N N j,N = Zm + Zm + Zm
N
N + Zm
N
(say).
Thus we have for instance, NN = Zm
cos (sx − ty) (Xsj,N + Xsj,N )(Xtj,N + Xtj,N )
1 (2π)2
m s,t=−m
cos (sx − ty) Xsj,N Xtj,N ,
613
9.5 Limit distributions of (bi)spectral function estimators
and similarly other terms with at least one unbounded term Xsj,N or Xtj,N . The corresponding Sm,n values are: N,N Sm,n
=
n
j,N N Zm ;
N,N Sm,n
=
j=1
n
j,N N Zm ,
(13)
j=1
N ,N N ,N and similarly Sm,n , Sm,n are obtained. We then assert that for large N, n the normalized primed quantities, (there are three of them) under condition (10), have negligible contributions to the limit distribution in the sense that N,N N,N Sm,n − E(Sm,n ) P → 0. (14) σ(Sm,n ) S N,N
For this it suffices to verify that V ar ( σ(Sm,n ) can be made arbitrarily m,n ) small for suitably large n and N as follows: n N,N Sm,n 1 j,N N )= 2 [ V ar ( V ar Zm σ(Sm,n ) σ (Sm,n ) j=1 j1 ,N N j2 ,N N +2 Cov (Zm , Zm )].
(15)
1≤j1 <j2 ≤m
Consider the first term in [ n
j,N N V ar Zm ≤
j=1
n
] of (15):
j,N N 2 E(|Zm | )
j=1 n 8 2 N (2m + 1)4 sup E[|Xsj,N Xtj,N |] 4 (2π) j=1 s,t,j n 8 2 4 ≤ N (2m + 1) sup |Xsj |2 dP, (2π)4 j=1 s,j [|Xsj |>N ]
≤
by the CBS inequality, 8 4 ≤ n(2m + 1) sup |Xsj |4 dP. j (2π)4 s,j [|Xs |>N ]
(16)
However, using H¨ older, Markov, and Liapounov inequalities (cf. Sec. II.1), as well as (10), we get [|Xsj |>N ]
1
|Xsj |4 dP ≤ K 1+δ
1 N δ/(1+δ)
δ
E[|Xsj |4(1+δ) ] 4(1+δ)2
614
IX. Nonparametric estimation for processes 1
≤ K 1+δ
δ + 4(1+δ) 2
1 N δ/(1+δ)
.
(17)
So (16) and (17) along with Assumption (I), imply: LHS(15) ≤
4+5δ 8 (2m + 1)4 n 4(1+δ)2 K . (2π)4 N δ/(1+δ)
(18)
Now by using the initial relation between m and n (Assumption (II)), (2m+1)4 is bounded (tends to 16), and moreover, by analogous detailed μ n 2 m2 computation, one shows that σ 2 (Sm,n ) = C m2 n ρ2 (j) (1 + o(1)), (19) j≥1
where C is an absolute constant (C ≤ π −4 ). Consequently, (18) and μ
n2 (19), imply that the first term of (15) ≤ O( N δ/(1+δ) ). A similar computation verifies that the second term of (15) is also of μ
n2 ). Using this with (19) and (15), one concludes that order O( N δ/2(1+δ)
μ
μ
N,N Sm,n n2 n2 ) ≤ K1 δ/(1+δ) + K2 δ/2(1+δ) . V ar ( σ(Sm,n ) N N
(20) δ
Thus for given ε > 0, one can choose n0 and N0 such that if N02(1+δ) > μ
(K1+K2)n02 ε3
, then the LHS(20) < ε3 . In the same way, (long computations), one shows that all the vari-
ances of the primed terms involving
N,N Sm,n
N,N σ(Sm,n )
and others in (13) can be
NN which means, made arbitrarily small. So we need only consider Sm,n for the proof of the theorem under condition (10), we have reduced the problem to uniformly bounded processes Xsj , s ∈ Z. With this reduction, one proceeds from now on without reference to (10). Step II. We next calculate the variances (and covariances) of the j random variables Zm , introduced in (4), and then of Sm,n as well as U,m,n , V,m,n . After a lengthy algebraic computation and simplification, involving some combinatorial and moment (cumulant) analysis, one finds that V ar (U,m,n ) = Cpm2 ρ2 (j)(1 + o(1)), 0 ≤ ≤ k − 1, j≥1
and
V ar (V,m,n ) =
Cqm2 C(p +
2 j≥1 ρ (j)(1 + o(1)) 0 ≤ ≤ q)m2 j≥1 ρ2 (j)(1 + o(1)),
k − 1, = k,
9.5 Limit distributions of (bi)spectral function estimators
615
which are also used in finding V ar (Sm,n ), given in (19) and C is an absolute constant (same as in (19)). These bounds are employed when Xsj s are α(k)-mixing in the j-rows, to obtain for 0 ≤ = ≤ k − 1, |Cov (V,m,n , V ,m,n )| = O(q 2 m4 α(p| − |)), |Cov (V,m,n , Vk,m,n )| = O(q(p + q)m4 α(p(k − ))). It should be noted here that, since the U s and V s are products of Xs s and Xt s, the variances and covariances involve the cumulants of Xs , Xs , Xt , Xt and thus Assumption IV is used in the calculations. These bounds are then used to deduce the key relation that k V,m,n ) → 0, (21) V ar ( =0 σ(Sm,n ) ˘ as n → ∞, so that by Ceby˘ sev’s inequality we get Sm,n − E(Sm,n ) P → 0. σ(Sm.n )
But the actual detail takes a great deal of careful computation (somewhat analogous to that given in Section IV.5). This allows us to con centrate on Sm,n in (5). Thus we can turn to its limit distribution. Step III. Consider the characteristic function (ch.f.) ϕj,m,n of the U −E(U ) random variable j,m,nσ(Sm,n )j,m,n We now study the limit distribution of the “accompanying sequence” of (independent in each row) random variables whose ch.f.s are ϕj,m,n . Then it is shown that it
I1n (t) = |E(e
Sm,n −E(Sm,n ) σ(Sm,n )
− Πk−1 j=0 ϕj,m,n (t)| → 0,
(22)
as n → ∞. Here one uses the bounds on moments of mixing random variables given in Proposition 3.1. Then one verifies the fact that Uj,m,n −E(Uj,mn ) are uniformly asymptotically negligible for each j as σ(Sm,n ) n → ∞, and uses an extension due to Gnedenko of Bawly’s theorem on infinitely divisible (limit) distributions (cf., e.g., Gnedenko and Kolmogorov [1],Theorem 3 on P.101, or Rao [15],Theorem 5.3.5), by which the product converges to that of the standard normal law, i.e., t2
− 2 I2n (t) = |Πk−1 | → 0, j=0 ϕj,m,n (t) − e
(23)
uniformly in t belonging to compact sets. Verification of the conditions of this generalized limit theorem is quite delicate and uses all the assumptions in our hypothesis. Thus from it
|E(e
) Sm,n −E(Sm,n σ(Sm,n
t2
)) − e− 2 | ≤ I1n (t) + I2n (t),
616
IX. Nonparametric estimation for processes
which tends to zero by (22) and (23), one concludes that − E(Sm,n ) D Sm,n → N (0, 1), σ(Sm,n )
(24)
as n → ∞. Step IV. Finally, using a form of Slutsky’s theorem (cf., Cram´er [1], p.254, and I.6.4 (c)), by (6), we get that Sm,n − E(Sm,n ) D → N (0, 1), σ(Sm,n ) as n → ∞. Since fˆm,n = nSm,n , it follows from the above that (13) holds, i.e., the normalized fˆm,n (x, y) has asymptotically the standard normal distribution for each x, y ∈ R. The omitted details really take several pages of analysis, as seen from Soedjak’s [1] work. The point of this theorem is that it is the first of its kind for the problem. Any future simplifications and extension could be aided by it. The same method and conditions may be refined to obtain the rate of convergence of the limit. Also note that the estimator for the bispectral distribution needs a separate treatment and cannot be obtained from the density result given above. For a comparison and illustration, we present the following theorem on the asymptotic distribution of an estimator of the spectral distribution of a stationary Gaussian process, due to Ibragimov [1], which also indicates what other types of results can be expected in the harmonizable case. 2. Theorem. Let {Xn , n ∈ Z} be a real centered Gaussian process with spectral distribution F and density f (= dF dx ≥ 0). Consider ˆ the “periodogram” estimator Fn (x) of F (x), x ∈ R, given by (since F (−x) = F (π) − F (x), x ≥ 0): Fˆn (x) =
1 2nπ
0
x
|
n
Xk e−ikt |2 dt.
ˆ i.e., Suppose F is strictly increasing, and f ∈ Lp (Z), ∞, for some 2 < p < ∞. Then √ where σ 2 (x) = 2π
n(Fˆn (x) − F (x)) → N (0, σ 2 (x)),
x 0
D
f (v)2 dv.
(25)
k=1
π −π
f (x)p dx <
9.6 Complements and exercises
617
√ More generally, if Yn (x) = n(Fˆn (x) − F (x)), then the sequence of processes {Yn (x), x ∈ [0, π]} converges weakly to a centered independent increment Gaussian process {Y (x), x ∈ [0, π]} (meaning the finite dimensional distributions of the former converge to those of the latter) as n → ∞, where Y (0) = 0, E(Y (x)) = 0, E(Y (x)Y (x )) = x∧x 2π 0 f (v)2 dv. This is the main result of his paper, and the details, depending on the Gaussian nature of the stationary process Xn , are established after a delicate analysis, to which the reader is referred. There it is also indicated how the continuous parameter extension is obtainable, without much difficulty, from the discrete version. We conclude this account with some complements of the work of the preceding sections. 9.6 Complements and exercises 1. Let {Xt , t ∈ T = R, orZ} be a stationary or harmonizable [weakly or strongly] process. If V is a bounded linear mapping on L2 (P ), let Yt = V Xt , t ∈ T be the transformed process. Verify that {Yt , t ∈ T } is always a weaklyharmonizable process. [Hints. Use the integral representation Xt = Tˆ eitx dZ(x), t ∈ T , and justify that the integral and V commute. Note that the statement cannot be strengthened to say that for a stationary Xt -process, Yt must be stationary, as Example 4.6 shows. This statement is not trivial.] However, find an example of the operator V such that both the Xt and Yt -processes remain stationary with positive variances. 2(a). Let Xt = Yt + Zt , t ∈ R, where Yt is weakly stationary with covariance r1 and Zt is of second order with a periodic covariance r2 with period d > 0 so that r2 (s + d, t + d) = r2 (s, t) for all s, t ∈ R and some fixed d. Also suppose that the Y, Z-processes are uncorrelated. Show that the Xt -process is in class(KF ). (b) We now seek a kind of converse to (a), namely investigate the set of processes in class (KF) that admit such a decomposition with r1 (0) > Bt 0. If Xt = √ , t > 0 and symmetrically extended for t < 0, X0 = 0, t where the Bt is Brownian motion, then verify that Xt ∈ class(KF ), it is not stationary and does not admit a decomposition as in (a), so that the problem is non trivial and only a subclass can possibly have such a decomposition, as seen in the following part. (c) We now present a positive solution here. Consider X = {Xt , t ∈ R} ∈ class(KF ) with covariance r which is completely monotonic on each line Δh of R+ × R+ parallel to the diagonal. This means that r(s − h2 , s + h2 ) is as a function of s, completely monotonic for each
618
IX. Nonparametric estimation for processes
h ∈ R+ . [Recall that a function f on [a, b] ⊂ R is completely monotonic if it is infinitely differentiable and (−1)n f (n) (x) ≥ 0, n ≥ 1, x ∈ [a, b], where f (n) is the nth derivative of f . An example is the Laplace transform of a distribution function on R+ , if it exists. Interestingly, a classical Bernstein theorem says that this characterizes the concept if f is conveniently normalized to be f (0) = 1. A proof of this theorem may be found, e.g., in Rao [17], p.244.] Then show that X can be decomposed as X = Y + Z where Y is stationary with a positive variance, and Z ∈ class(KF ) satisfying the condition that Z → 0 in L2 (P ). (The existence of such an X is a trivial consequence of Kolmogorov’s existence theorem.) Thus the problem leads to some unanswered questions. [Hints: First observe that a covariance function ρ(·, ·) is completely monotone on Δh , h ∈ R+ iff it can be expressed as: h h ρ(s − , s + ) = 2 2
e−xs G(dx, h),
(*)
R+
where G(·, h) is a bounded nondecreasing function whose increment ˜ xy (·) = G(y, ·) − G(x, ·), x ≤ y, is symmetric and positive function G definite. ˜ xy with the stated properties, then ρ of (*) is Indeed, if G defines G a covariance which is completely monotone on Δh , by definition, since the integral exists, and ρ being real, G(x, ·) is symmetric. Conversely, if ρ has the stated monotonicity property on each Δh , then by Bernstein’s theorem there is a G(·, h) satisfying (*). Next observe that s → ρ(s − h h + + 2 , s + 2 ) is an analytic function on R × R (cf., Widder [1], p.146) and it can be inverted to get (cf., again Widder [1], Thm, 7.6a, p.69): 1 T →∞ 2iπ
c+iT
ρ(s −
G(x, h) = lim
c−iT
h h e−xs ,s + ) ds, 2 2 s
(+)
for x > 0 and c > σ0 > 0 where σ0 is the abscissa of convergence of the integral. Since ρ is a covariance (as in the harmonizable case), it ˜ xy (·) is positive definite, and representation (*) holds. follows that G Now note that by (*) lims→∞ ρ(s − h2 , s + h2 ) = G(∞, h) exists, is finite, and limh→0 [G(x, h) − G(0, h)] = G(0+, h) exists. The function ˜ x (·) = G(x, ·) − G(0+, ·) is positive, bounded, and has symmetric G positive definite increments. Hence, by the preceding paragraph ρ˜ given by h h ˜ ρ˜(s − , s + ) = e−xs G(dx, h) 2 2 R+ is a covariance function satisfying lims→∞ ρ˜(s − h2 , s + h2 ) = 0. If we set r1 (h) = G(0+, h) and r2 (s, t) = ρ˜(s, t) then r(s, t) = r1 (s − t) + r2 (s, t)
619
9.6 Complements and exercises
and the corresponding second order orthogonal processes Y, Z with covariances r1 , r2 give the desired representation Xt = Yt + Zt , t > 0 and extended to the whole line. Regarding this problem, see Kamp´e de F´eriet [1] where an interesting discussion on class (KF) can be found.] 3. A fundamental property of stationary processes is that they have translation (or shift) invariant covariances. In fact, if {Xt , t ∈ R} ⊂ L20 (P ) is stationary and τs Xt = Xs+t then the τs Xt , t ∈ R, has the same covariance as the original process and extend τs linearly onto 2 sp{X ¯ t , t ∈ R}, whence onto L0 (P ) by defining it to be identity on the complement of this subspace. The extended operator, denoted by Us , is unitary and plays an important role in the analysis. The same is not always possible for other second order processes, since, for instance, this is false for the harmonizable class. Here we give a certain family of processes forming a subset of the Karhunen class for which it is possible to define a shift operator. In order that such a shift operator should exist, we must n have for each finite set of complex numbers a1 , . . . , an if Yn = j=1 ai Xtj , then Yn 2 = 0, or in terms of covariances, n Yn 2 = 0 should imply τs n a a ¯ r(t , t ) = 0 implies ¯j r(ti + s, tj + s) = 0, and i j i,j=1 i j i,j=1 ai a τs will be bounded if τs Yn 2 ≤ cYn 2 for some c > 0. Under these conditions for such τs we also have τs τs = τs+s , so that {τs , s ≥ 0} will form a semi-group of bounded operators on L20 (P ). [In the stationary case, this becomes a strongly continuous group of unitary operators, so that Ut Ut∗ = Ut∗ Ut = id.] We now demand that τs and its adjoint τs∗ commute (but not necessarily equal to the identity). Thus {τs , s ≥ 0} forms a semi-group of bounded normal operators on L20 (P ). Following the Ut -case, we also assume that this is a strongly continuous semigroup so that τs X − X2 → 0 as s → 0, which is still general, allowing us some technical analysis tools here. Then the standard theory implies that the semi-group has a generator A as a strong limit of τhh−I which is a densely defined closed linear (usually unbounded) operator such that Xs = τs X0 = esA X0 . Hence using the spectral theorem (cf., Riesz and Sz.-Nagy [1], p.288) for normal operators, one gets X s = τs X 0 = esλ F (dλ)X0 = esλ Z(dλ), C
C
¯ where Z(·) has orthogonal values so that E(Z(A)Z(B)) = G(A ∩ B). ¯ sλ+tλ ¯ G(dλ); consequently the Verify that r(s, t) = E(Xs Xt ) = C e non stationary Xt -process is of Karhunen class. Note also, since a harmonizable process does not admit a shift but is of Karhunen class, that only a proper subset of the latter admits a shift operation. For related results and more details, see Getoor [1] and Rao [16]. 4. Here we present a Cram´er-Rao type lower bound for estimators of a template problem. The objects (inputs) are templates taking their
620
IX. Nonparametric estimation for processes
values in the rotation group SO(n), the set of all n-dimensional orthogonal matrices in Rn . The observations (outputs) are denoted by a set I, and the problem is to estimate an element of SO(n), given the ˆ : I → SO(n) of Θ such that Θ ˆ −Θ observation I ∈ I, i.e., estimator Θ n2 as element of R is minimized in some sense. [This problem, with all the physical implications, is given in Grenander, Miller, and Srivastava [1]. Here we are presenting a particular case of it.] To state it briefly, since the observations I are subject to error, they are random variables taking values in I with density (or likelihood) f (·|Θ) assumed given relative to a σ-finite measure. Embedding SO(n) in the linear space of n × n-matrices M (n), one ncan put the Hilbert-Schmidt norm · given by tr(AA∗ ) = A2 = i,j=1 |aij |2 , A ∈ M (n). Thus if A ∈ SO(n), then A2 = n, and A − B2 = 2(n − tr(AB ∗ )). Let π(·) be the (normalized) Haar measure on the compact group SO(n) ⊂ M (n), which is 2 given the topology of Rn , taken as the prior probability on this set of parameters. Thus the joint density of (I, Θ) obtained on I × SO(n) is f (dI|Θ)π(dΘ) with the marginal p(·) as the density of I. The posterior density of Θ given the observation I is denoted G(·|I)π(dΘ). A Bayes estimator Θ∗ of Θ, after observing I, is given by (cf., Definition III.2.3) the equation:
W (Θ − Θ∗ )G(Θ|I)π(dΘ) SO(n) = inf ˆ Θ∈SO(n)
ˆ W (Θ − Θ)G(Θ|I)π(dΘ),
SO(n)
where W : R+ → R+ is a convex (loss) function. In general it is not easy to obtain a Bayes estimator Θ∗ , but a lower bound for the risk ˆ = E[W (Θ − Θ)], ˆ ˆ is any estimator of Θ, can be R(Θ, Θ) where Θ given. Show that the following (CR-type) inequality holds: ˆ ≥ E[W (Θ∗ − Θ)], Θ ∈ SO(n), R(Θ, Θ)
(+)
ˆ is any other estimator of Θ based where Θ∗ is a Bayes estimator, and Θ on the observation I (compare it with Thm. III.2.4). If W (x) = x2 , a Bayes estimator (may not be in SO(n) but) is given by (prime for transpose): Θ∗ = max{tr(ΘA ) : Θ ∈ SO(n)}, where A = SO(n) Θ G(Θ|I)π(dΘ). [Hints: For the inequality (+), use the identity E(X) = E(E(X|I)) for any random variable X ≥ 0, followed by the conditional Jensen inequality. The last part is obtained from the fact that Θ1 − Θ2 = 2(n − tr(Θ1 Θ2 )), the minimization of
Bibliographical notes
621
the left side is the same as maximization of the trace term on the right. Since SO(n) is not a convex set, the existence part of Theorem III.2.4 is not available here. Many aspects of the problem are presented in the above reference, and in fact new investigations are required when the parameter space is a curved manifold.]
Bibliographical notes Spectral functions play a vital role in the analysis of second order processes, when they exist. However, for a given process under observation, these are generally unknown and should be estimated from the sample. Thus one wants to have at least an (asymptotically) unbiased estimator, but for applications its consistency should really be established. If the process is stationary, then the spectral function is nonnegative and of one variable. These important properties are lost for harmonizable and other nonstationary processes. Here typically the spectral functions are complex valued and involve two variables. It is thus necessary that one should have more observations and, in order to account for the presence of both variables, one has to go in for a more elaborate procedure such as resampling. In the literature there is no result on (bi)spectral estimation of processes such as the harmonizable family or class (KF). The first satisfactory investigation on problems of estimation for strongly harmonizable processes is due to Soedjak [1]. He established both the consistency and the subsequent limit distribution of a bispectral density estimator, based on the resampling method, for strongly harmonizable processes. The work in Sections 3 and 5 is essentially from his papers [1], [2] and details in [3]. The conditions imposed can be refined to get the speed of convergence also. The method is largely motivated by the treatment due to Ibragimov and Linnik [1], for stationary processes. See also Grenander and Rosenblatt [2]. A basic result on the structure of harmonizable process is that they have a stationary dilation. This means, each harmonizable process is a projection of a stationary one from a super Hilbert space containing the observation space L20 (P ). However, this representation does not help in getting the consistency or asymptotic distributions of the density estimators from the known stationary theory, for at least two reasons. First, the super Hilbert space and hence the dilated stationary process are unobservable so that the estimation procedure cannot be based on this information. Second, the problems we are concerned with are nonlinear, and the projection procedure, being linear, is not applicable even if the dilated process is somehow known. Thus the presented analysis, with all its long computations, cannot be avoided.
622
IX. Nonparametric estimation for processes
The other class of nonstationary processes of immediate interest is what we called class (KF), or c(KF, 1), also termed asymptotically stationary. These include stationary as well as strongly (but not weakly) harmonizable processes. They have been introduced by Kamp´e de F´eriet and Frenkiel [1], independently by Parzen [4], and Rozanov [2]. The former authors in a series of papers have analyzed and illustrated this class for a number of applications. We have included a general treatment in Section 4. Proposition 4.4 is essentially from Parzen [4], and Proposition 4.5 is from Hanin and Schreiber [1] where the result is formulated for certain LCA groups. This class of processes can be generalized with pth -order summability methods, called c(KF, p), p ≥ 1. Also c(KF, p) are strictly increasing. They and their union are classes of potential interest for further study. A brief analysis of these appears in Swift [2]. In studying the linear filtering problem, ΛXt = Yt in Section VIII.3, we observed that Nagabhushanam [1] has considered conditions on the filter in order that the inversion Xt = Λ−1 Yt is physically realizable (i.e., Xt should depend on the past and present values of Ys only), when Λ is a difference or integral operator and the processes are stationary. In the discrete case, the characteristic equation of of the filter is a polynomial p(·), and the problem becomes more delicate if p(·) has roots on the unit circle. To overcome this obstruction, he used a summability method (called “(E, q)-summability) and obtains a physically realizable solution. He then remarks (on p.449): “ more comprehensive forms of summability methods can be used in the same way”. This is a precursor of the idea of class (KF), although neither the authors KF, nor Parzen or Rozanov appeared to be aware of the above work. Indeed, (KF) = ∪p≥1 c(KF, p) is similar, and the summability methods are of definite utility in the present context. Another illustration is as follows. A modification of template estimation, undertaken in a number of studies by Grenander and his colleagues, seems to have a connection with class (KF), opening up new avenues, as seen in Example 4.7. Perhaps the most useful part for applications here is to consider the full diffeomorphic group. Unfortunately this is too big a group to allow an invariant useful integration on it. (See, Palais [1], p. 136, where it is noted that further work is needed.) A quasi-invariant or other Radon measure probably could be considered, using the recent analysis due to Hirai and Shimomura [1]. This presents many new problems to solve. The recent extensive analysis of Grenander and Miller ([1], [2]) explains the numerous other related questions for detailed study in inference theory. Problem 6.4 is just an illustration of this phenomenon where curved manifold estimation appears naturally. We thus see many deep and interesting questions of stochastic in-
Bibliographical notes
623
ference arising from several different areas of real life applications. In most of the work in this book, processes of various types, without restricting to stationarity, have been considered and each is shown to admit detailed analysis employing different tools. It is to be hoped that the treatment here forms a basis for several promising and useful problems, still to be investigated. We end this work at this point with a high expectation that inference theory of processes and random fields can now proceed into new directions, utilizing new tools from different parts of mathematics to the benefit of both areas.
Bibliography
Abraham, R., Marsden, J. E., and Ratiu, T. [1] Manifolds, Tensor Analysis and Applications, (2nd ed.), Springer, New York, 1988. Albert, A. [1] “Estimating the infinitesimal generator of a continuous time, finite state Markov process,” Ann. Math. Statist., 33, (1962), 727–753. Alekseev, V. G. [1] “On conditions for the perpendicularity of Gaussian measures corresponding to two stochastic processes,” Theor. Prob. Appl., 8 (1963), 286–290. Andersen, E. S., and Jessen, B. [1] “On the introduction of measures in infinite product sets,” Danske Vid. Solek. Mat.-Fys. Medd., 25 no. 4 (1948) 8pp. Anderson, T. W. [1] An Introduction to Multivariate Statistical Analysis, Wiley, New York, 1958. [2] “On asymptotic distributions of estimates of parameters of stochastic difference equations,” Ann. Math. Statist., 30 (1959), 676– 687. [3] The Statistical Analysis of Time Series, Wiley, New York, 1971. Anderson, T. W., and Taylor, J. B. [1] “Strong consistency of least squares estimates in dynamic models,” Ann. Statist., 7 (1979), 484–489. Andˆ o, T. [1] “Contractive projections in Lp spaces,” Pacific J. Math., 17 (1966), 391–405. Andˆ o, T., and Amemiya, I. [1] “Almost everywhere convergence of prediction sequence in Lp (1 < p < ∞),” Z. Wahrs., 4 (1965), 113–120. Aronszajn, N. [1] “Theory of reproducing kernels,” Trans. Am. Math. Soc., 68 (1950), 337–404. Bahadur, R. R. [1] “On unbiased estimates of uniformly minimum variance,” Sankhy¯ a, 18 (1957), 211–224. Baker, C. R. [1] “Complete simultaneous reduction of covariance operators,” SIAM J. Appl. Math., 17 (1969), 972–983. Balakrishnan, A. V. © Springer International Publishing Switzerland 2014 M.M. Rao, Stochastic Processes – Inference Theory, Springer Monographs in Mathematics, DOI 10.1007/978-3-319-12172-7
625
626
Bibliography
[1] “On a characterization of covariances,” Ann. Math. Statist., 30 (1959), 650–675. Barankin, E. W. [1] “Locally best unbiased estimates,” Ann. Math. Statist., 20 (1949), 477–501. Bartle, R. G. [1] “A general bilinear vector integral,” Studia Math., 15 (1956), 337– 352. Basawa, I. V., and Prakasa Rao, B. L. S. [1] Statistical Inference on Stochastic Processes, Academic Press, New York, 1980. Baxter, G. [1] “A strong limit theorem for Gaussian processes,” Proc. Am. Math. Soc., 7 (1956), 522–527. Belyaev, Yu. K. [1] “Analytic random processes” Theor. Prob. Appl., 4 (1959), 402– 409. Bensoussan, A. [1] Stochastic Control of Partially Observable Systems, Camb. Univ. Press, Cambridge, UK, 1992. Berg, C., Christensen, J. P. R., and Ressel, P. [1] Harmonic Analysis on Semigroups, Springer, New York, 1984. Berger, A. [1] “On disjoint sets of distribution functions,” Proc. Am. Math. Soc. 1 (1950), 25–31. Berger, A., and Wald, A. [1] “On distinct hypotheses,” Ann. Math. Statist. 20 (1949), 104–109. Berger, J. O. [1] Statistical Decision Theory and Bayesian Analyses, Springer, New York, 1985. Berger, M. A., and Mizel, V. J. [1] “An extension of the stochastic integral,” Ann. Prob., 10 (1982), 435–450. Bertoin, J. [1] Levy Processes, Cambridge Univ. Press, London, UK, 1996. Bhattacharyya, A. [1] “On the analogs of the amount of information and their use in statistical estimation,” Sankhy¯ a 8 (1946/47), 1–14, 201–218. Billingsley, P. [1] Statistical Inference for Markov Processes, Univ. of Chicago Press, Chicago, IL, 1961. Birnbaum, A.
Bibliography
627
[1] “On the foundations of statistical inference,” J. Amer. Statist. Assoc., 57 (1962), 269–306. Blackwell, D. [1] “On an equation of Wald,” Ann. Math. Statist., 17 (1946), 84–87. Blackwell, D., and Girshick, M. A. [1] Theory of Games and Statistical Decisions, Wiley, New York, 1954. Bochner, S. [1] Harmonic Analysis and the Theory of Probability, Univ. of Calif. Press, Berkeley and Los Angeles, CA, 1955. [2] “Stationarity, boundedness, almost periodicity of random valued functions,”Proc. 3rd Berkeley Symp. Math. Statist. and Prob., 2 (1956), 7–27. Bourbaki, N. ´ [1] Elements de Math´ematique: Chapitre IX, Integration, Hermann, Paris, 1969. Briggs, V. D. [1] “Densities for infinitely divisible processes,” J. Multivar. Anal., 5 (1975), 178–205. Brillinger, D. R. [1] Time Series: Data Analysis and Theory, Holt, Rinehart and Winston, New York, 1975. Brockett, P. L., and Tucker, H. G. [1] “A conditional dichotomy theorem for stochastic processes with independent increments,” J. Multivar. Anal., 7 (1977), 13–27. Brockett, P. L., Hudson, W. N., Tucker, H. G. [1] “The distribution of the likelihood ratio for additive processes,” J. Multivar. Anal., 8 (1978), 233–243. Brody, E. J. [1] “An elementary proof of the Gaussian dichotomy theorem,” Z. Wahrs., 20 (1971), 217–226. Brown, M. [1] “Discrimination of Poisson processes,” Ann. Math. Statist., 42 (1971), 773–776. Bru, B., and Heinich, H. [1] “Meilleurs approximations et m´edeines conditionnelles,” Annales Inst. Henri Poincar´e, 21 (1985), 197–224. Cairoli, R., and Walsh, J. B. [1] “Stochastic integrals in the plane,” Acta Math., 134 (1975), 111– 183. Cambanis, S., and Liu, B. [1] “On harmonizable stochastic processes,” Infor. Control 17 (1970), 183–202.
628
Bibliography
Cameron, R. H. [1] “The translation pathology of Wiener space,” Duke Math. J., 21 (1954), 623–627. Cameron, R. H., and Martin, W. T. [1] “Transformation of Wiener integrals under translation,” Ann. of Math., 45 (1944), 386–396. [2] “The behavior of measures and measurability under a change of scale in Wiener space,” Bull. Am. Math. Soc., 10 (1947), Chan, N. H., and Wei, C.-Z. [1] “Limiting distributions of least squares estimates of unstable autoregressive processes,” Ann. Statist., 16 (1988), 367–401. Chang, D. K., and Rao, M. M. [1] “Bimeasures and nonstationary processes,” In Real and Stochastic Analysis, Wiley, New York, (1986), 7–118. [2] “Special representations of weakly harmonizable processes,” Stoch. Anal. Appl., 6 (1988), 169–189. [3] “Bimeasures and sampling theorems for weakly harmonizable processes,” Stoch. Anal. Appl., 1 (1983), 21–55. Chernoff, H. [1] “Large sample theory: parametric case,” Ann. Math. Statist., 27 (1956), 1–22. Chernoff, H., and Scheff´e, H. [1] “A generalization of the Neyman-Pearson fundamental lemma,” Ann. Math. Statist., 23 (1952), 213–225. Chiang, T.-P. [1] “On the linear extrapolation of a continuous homogeneous random field,” Theor. Prob. Appl., 2 (1957), 58–89. Chipman, J. S. [1] “Linear regression, risk reduction, and biased estimation in linear regression,” Linear Algebra and its Applications, 289 (1999), 55– 74. [2] Advanced Ecoonometric Theory, Routledge, London and New York, 2011. Chipman, J. S., and Rao, M. M. [1] “Projections, generalized inverses, and quadratic forms,” J. Math. Anal. Appl., 9 (1964), 1–11. Choksi, J. R. [1] “Inverse limits of measure spaces,” Proc. Lond. Math. Soc.,(3) 8 (1958), 321–342. Chow, Y. S., and Teicher, H. [1] Probability Theory: Independence, Interchangeability, Martingales, Springer, New York, 1978. Coddington, E. A., and Levinson, N.
Bibliography
629
[1] Theory of Ordinary Differential Equations, McGraw-Hill, New York, 1955. Conkwright, N. B. [1] Introduction to the Theory of Equations, Ginn and Co., Boston, 1957. Cram´er, H. [1] Mathematical Methods of Statistics, Princeton Univ. Press, Princeton, NJ, 1946. [2] “On some classes of nonstationary stochastic processes,” Proc. 4th Berkeley Symp. Math. Statist. and Prob., 1 (1961), 57–77. [3] “On the structure of purely nondeterministic stochastic processes,” Ark. Mat., 4 (1961), 249–266. [4] Random Variables and Probability Distributions, Camb. University Press, Cambridge, UK, 1970 (3rd ed). [5] Structural and Statistical Problems for a Class of Stochastic Processes, Princeton Univ. Press, Princeton, NJ, 1971. [6] “A contribution to the theory of stochastic processes,” Proc. 2nd Berkeley Symp. Math. Statist. and Prob., (1951), 329-339. Cram´er, H., and Leadbetter, M. R. [1] Stationary and Related Stochastic Processes, Wiley, New York, 1967. Crum, M. M. [1] “On positive definite functions,” Proc. Lond. Math. Soc.,(3) 6 (1956), 548–560. Curtain, R. F., and Pritchard, A. J. [1] “The infinite dimensional Riccati equation for systems defined by evolution operators,” SIAM J. Control and Optim., 14 (1976), 951–983. Dalecki˘i, Ju. L., and Kre˘in, M. G. [1] Stability of Solutions of Differential Equations in Banach Spaces, Amer. Math Soc., Providence, RI, 1974. Dantzig, G. B., and Wald, A. [1] “On the fundamental lemma of Neyman and Pearson,” Ann. Math. Statist., 22 (1951), 88–93. Davis, M. H. A., and Vinter, R. B. [1] Stochastic Modelling and Control, Chapman and Hall, London, UK, 1985. Day, M. M. [1] Normed Linear Spaces, Springer, New York, 1962. DeGroot, M. H., and Rao, M. M. [1] “Bayes estimation with convex loss,” Ann. Math. Statist., 34 (1963), 839–846.
630
Bibliography
[2] “Multidimensional information inequalities and prediction,” Multivariate Analysis, Academic Press, New York, (1966), 287–313. Diestel, J., and Uhl, Jr. J. J. [1] Vector Measures, Amer. Math. Soc. Surveys, Providence, RI, 1977. Dinculeanu, N. [1] Vector Measures, Pergamon Press, London, UK, 1967. [2] Vector Integration and Stochastic Integration in Banach Spaces, Wiley-Interscience, New York, 2000. Dobrakov, I. [1] “On integration in Banach spaces, VIII (Polymeasures),” Chech. Math. J., 37 (1987), 487–506. Dobrushin, R. L., and Minlos, R. A. [1] “Polynomials in linear random functions,” Russian Math. Surveys, 32(2) (1971), 71–127. Dol´eans-Dade, C. [1] “Quelques applications de la formule de changement de variables for les semimartingales,” Z. Wahrs., 16 (1970), 181–194. Dolph, C. L., and Woodbury, M. A. [1] “On the relation between Green’s function and covariances of certain stochastic processes and its application to unbiased linear prediction,” Trans. Am. Math. Soc., 72 (1952), 519–550. Doob, J. L. [1] “The Brownian moment and stochastic equations,” Ann. Math., 43 (1942), 351–369. [2] Stochastic Processes, Wiley, New York, 1953. Douglas, R. G. [1] “Contractive projections in an L1 -space,” Pacific J. Math., 15 (1965), 443–462. Dubins, L. E., and Freedman, D. A. [1] “Random distribution functions,” Bull. Am. Math. Soc., 69 (1963), 548–551. Dunford, N., and Schwartz, J. T. [1] Linear Operators, Part I: General Theory, Interscience, New York, 1958. Duttweiler, D. L., and Kailath, T. [1] “RKHS approach to detection and estimation problems, Part IV: Non-Gaussian detection,” IEEE Trans. Inf. Th., IT- 19 (1973), 19–28. [2] “RKHS approach to detection and estimation problems, Part V: Parameter estimation,” IEEE Trans. Inf. Th., IT- 19 (1973), 29– 37. Dvoretzky, A., Kiefer, J., and Wolfowitz, J.
Bibliography
631
[1] “Sequential decision problems for processes with continuous time parameter: problems of estimation,” Ann. Math. Statist., 24 (1953), 403–415. Dynkin, E. B. [1] Theory of Markov Processes, Prentice-Hall, Oxford, 1960. [2] Markov Processes, Vols.I, II, Academic Press, New York, 1965. Elliot, R. J., and Glowinski, R. [1] “Approximations to solutions of the Zakai filtering equation,” Stoch. Anal. Appl., 7 (1989), 145–168. Ennis, P. [1] “On the equation E(E(X|Y )) = E(X),” Biometrika, 60 1973), 432–433. Erd¨ os, P., and Kac, M. [1] “On certain limit theorems in the theory of probability,” Bull. Am. Math. Soc., 52 (1946), 292–302. Feldman, J. [1] “Equivalence and perpendicularity of Gaussian processes,” Pacific J. Math., 8 (1958), 699–708, correction ibid, 9 1295–1296. [2] “ Decomposable processes and continuous products of random processes,” J. Funtional Anal., 8 (1971), 1–51. Feller, W. [1] An Introduction to Probability Theory and its Applications,Vols. I,II Wiley, New York, 1957; 1966. Fend, A. V. [1] “On the attainment of Cram´er-Rao and Bhattacharyya bounds for the variance of an estimate,” Ann. Math. Statist., 30 (1959), 381–388. Fisher, R. A. [1] “On the mathematical foundations of theoretical statistics,” Phil. Trans. Roy. Soc. (London, Ser. A), 222 (1921), 309–368. Fleming, R. J., Goldstein, J. A., and Jamison, J. E. [1] “One parameter groups of isometries on certain Banach spaces,” Pacific J. Math., 64 (1976), 145–151. Fleming, W. H., and Pardoux, E. [1] “Optimal control for partially observed diffusions,” SIAM J. Control and Optim., 20 (1982), 261–285. Fraser, D. A. S. [1] Nonparametric Methods in Statistics, Wiley, New York, 1957. Fuller, W. A. [1] “Nonstationary autoregressive time series,” Handbook of Statistics Vol. 5, North-Holland, Amsterdam, The Netherlands, (1985), 1– 23.
632
Bibliography
[2] Introduction to Statistical Time Series, Wiley, New York, 1996 (2nd ed.). Gel’fand, I. M., and Vilenkin, N. Ya. [1] Generalized Functions, 4: Applications of Harmonic Analysis, Academic Press, New York, 1964. Getoor, R. K. [1] “The shift operator for nonstationary stochastic processes,” Duke Math. J, 23 (1956), 175–187. Gikhman, I. I., and Skorokhod, A. V. [1] “On the densities of probability measures in function spaces,” Russian Math Surveys, 21(6) (1966), 83–156. Girsanov, I. V. [1] “On transformations of a certain class of stochastic processes with the help of absolutely continuous substitution of the measures,” Theor. Prob. Appl., 3 (1960), 285–301. Gladyshev, E. G. [1] “A new limit theorem for stochastic processes with Gaussian increments,” Theor. Prob. Appl., 6 (1961), 52–61. Gnedenko, B. V., and Kolmogorov, A. N. [1] Limit Distributions for Sums of Independent Random Variables, Addison-Wesley, Redding, MA., 1954. Gohberg, I. C., and Kre˘in, M. G. [1] Theory and Applications of Volterra Operators in Hilbert Space, Amer. Math. Soc., Providence, RI, 1970. Goldstein, J. A. [1] “An existence theorem for linear stochastic differential equations,” J. Diff. Eq., 3 (1967), 78–87. [2] “Groups of isometries on Orlicz spaces,” Pacific J. Math., 48 (1973), 387–393. [3] Semi-Groups of Linear Operators and Applications, Oxford University Press, New York, 1985. Golosov, Ju. L. [1] “Gaussian measures equivalent to Gaussian Markov measures,” Soviet Math. (Doklady) 7 (1966), 48–52. Green, M. L. [1] “Planar stochastic integration relative to quasi-martingales,” In Real and Stochastic Analysis: Recent Advances, CRC Press, Boca Raton, FL, (1997), 65–157. Grenander, U. [1] “Stochastic processes and statistical inference,” Arkiv fur Mat., 1 (1950), 195–277. [2] Abstract Inference, Wiley, New York, 1981.
Bibliography
633
[3] General Pattern Theory, Oxford University Press, London, UK, 1993. [4] “Template estimation” (Private Communication), (1998). Grenander, U., and Miller, M. I. [1] “Representations of knowledge in complex systems,” J. R. Statist. Soc. Ser B, 56 (1994), 549–603. [2] “Computational anatomy: an emerging discipline”, Quarterly Appl. Math., 56 (1998), 617–694. Grenander, U., Miller, M. I., and Srivastava, A. [1] “Hilbert-Schmidt lower bounds for estimators on matrix Lie groups for ATR,” IEEE Trans. Pattern Anal. Mech. Intel., 20 (1998), 790–802. Grenander, U., and Rosenblatt, M. [1] “Statistical spectral analysis of time series arising from stochastic processes,” Ann. Math. Statist., 24 (1953), 537–558. [2] Statistical Analysis of Stationary Time Series, Wiley, New York, 1957. Gretsky, N. E. [1] “Representation theorems for Banach function spaces,” Mem. Am. Math. Soc., 84 (1968), 1–56. Guichardet, A. [1] Symmetric Hilbert Spaces and Related Topics, Lect. Notes in Math., 261, 1972, Springer, New York. H´ ajek, J. [1] “On a property of normal distribution of any stochastic processes,” Chech. Math. J., 8 (1958), 610–618. Hanin, L. G., and Schreiber, B. M. [1] “Discrete spectrum of nonstationary stochastic processes on LCA groups,” J. Theor. Prob., 11 (1998), 1111–1133. Hanin, L. G., and Schwarz, M. A. [1] “Consistent statistical estimate of spectral measure discrete component for a class of random functions”, Nonparametric Statist., 2 (1992), 81–87. Hannan, E. J. [1] “The concept of a filter,” Proc. Camb. Phil. Soc., 63 (1967), 221– 227. [2] Multiple Time Series, Wiley, New York, 1970. Hardin, C. D. [1] “On the linearity of regression,” Z. Wahrs. 61 (1982), 291–302. Hardy, G. H. [1] Divergent Series, Oxford Univ. Press, London, UK, 1949. Hays, C. A., and Pauc, C. Y. [1] Derivation and Martingales, Springer, New York, 1970.
634
Bibliography
Hewitt, E., and Ross, K. A. [1] Abstract Harmonic Analysis I, Springer, New York, 1963. Hida, T. [1] “Canonical representation of Gaussian processes and their applications,” Mem. Coll. Sci. Univ. Kyoto, Ser. A, 38 (1960), 109– 155. Hida, T., and Hitsuda, M. [1] Gaussian Processes, Providence, RI, 1993. Hida, T., and Ikeda, N. [1] “Analysis on Hilbert space with reproducing kernel arising from multiple Wiener integral,” Proc. 5th Berkeley Symp. Math. Stat. and Prob., 2, part I, (1967), 117–143. Higgins, J. R. [1] Sampling Theory in Fourier and Signal Analysis: Foundations, Oxford Science Publication, Oxford, UK, 1996. Hille, E., and Phillips, R. S. [1] Functional Analysis and Semi-Groups, (2nd ed.) Amer. Math. Soc., Providence, RI, 1957. Hirai, T., and Shimomura, H. [1] “Relations between unitary representations of Diffeomorphism groups and those of the infinite symmetric group or of related permutation groups,” J. Math. Kyoyo Univ., 37 (1997), 261–316. Hitsuda, M. [1] “Representation of Gaussian processes equivalent to Wiener processes,” Osaka J. Math. 5 (1968), 299–312. [2] “Formula for Brownian partial derivatives,” Second Japan USSR Symp. Prob. Th., 2 (1972), 111–114. Hoeffding, W., and Wolfowitz, J. [1] “Distinguishability of sets of distributions,” Ann. Math. Statist., 29 (1958), 700–718. Holden, H., Øksendal, B., Ubøe, J., and Zhang, T. [1] Stochastic Partial Differential Equations, Birk¨ auser, Boston, MA, 1996. Hurd, H. L. [1] “Representation of strongly harmonizable periodically correlated processes and their covariances,” J. Multivar. Anal., 29 (1989), 53–67. Hurewicz, W. [1] Lectures on Ordinary Differential Equations, The MIT Press, Cambridge, MA, 1958. Hwang, C.- R. [1] “Conditioning by EQU AL, LIN EAR,” Trans. Am. Math. Soc., 274 (1983), 69–83.
Bibliography
635
Ibragimov, I. A. [1] “On estimation of the spectral function of a stationary Gaussian process,” Theor. Prob. Appl., 8 (1963), 366–401. Ibragimov, I. A., and Linnik, Ju. V. [1] Independent and Stationary Sequences of Random Variables, Noordhoff Publishers, The Netherlands, 1971. Ionescu Tulcea, A., and C. [1] Topics in the Theory of Lifting, Springer, New York, 1969. Ionescu Tulcea, C. [1] “Mesures dans les espaces produits,” Atti Acad. Nat. Lincei Rend., 7 (1949), 208–211. Ince, E. L. [1] Ordinary Differential Equations, Longmans, Green, and Co., London, 1927. Issacson, D. [1] “Stochastic integrals and derivatives,” Ann. Math. Statist., 40 (1969), 1610–1616. Isii, K. [1] “Inequalities of the types of Chebyshev and Cram´er-Rao and mathematical programming,” Ann. Inst. Statist. Math., 16 (1964), 277–293. Itˆo, K. [1] “On a formula concerning stochastic differentials,” Nagoya Math. J., 3 (1951), 55–65. [2] “Multiple Wiener integral,” J. Math. Soc. Japan, 3 (1951), 157– 169. Jensen, D. R., and Ramirez, D. E. [1] “Anomalies in the foundations of ridge regression,” International Statist. Review, 76 (2008), 89–105. Johansen, S., and Karush, J. [1] “On the semi-martingale convergence theorem,” Ann. Math. Stat., 37 (1966), 690–694. Jordan, K. [1] Calculus of Finite Differences, R¨ottig and Romwalter, Budapest, 1939. Joshi, V. M. [1] “On the attainment of the Cram´er-Rao lower bound,” Ann. Statist., 4 (1976), 998–1002. Kac, M., and Slepian, D. [1] “Large excursions of Gaussian processes,” Ann. Math. Statist., 30 (1959), 1215–1228. Kadota, T. T.
636
Bibliography
[1] “Simultaneous diagonalization of two covariance kernels and application to second-order stochastic processes,” SIAM J. Appl. Math., 15 (1967), 1470–1480. Kailath, T. [1] “On measures equivalent to Wiener measure,” Ann. Math. Statist., 38 (1967), 261–263. [2] “A general likelihood ratio formula for random signals in Gaussian noise,” IEEE Trans. Inf. Th., IT- 15 (1969), 350–361. [3] “The structure of Radon-Nikod´ ym derivatives with respect to the Wiener and related measures,” Ann. Math. Statist., 42 (1971), 1054–1067. [4] “RKHS approach to detection and estimation problems, Part I: Deterministic signals in Gaussian noise,” IEEE Trans. Inf. Th., IT- 17 (1971), 530–549. Kailath, T., and Duttweiler, D. [1] “An RKHS approach to detection and estimation problems, Part III: Generalized innovations representation and a likelihood-ratio formula,” IEEE Trans. Inf. Th., IT- 18 (1972), 718–745. Kailath, T., Geesey, R. T., and Weinert, H. L. [1] “Some relations among RKHS norms, Fredholm equations, and innovation representations,” IEEE Trans. Inf. Th., IT- 18 (1972), 341–348. Kailath, T., and Weinert, H. L. [1] “An RKHS approach to detection and estimation problems, Part II: Gaussian signal detection,” IEEE Trans. Inf. Th., IT- 21 (1975), 15–23. Kakihara, Y. [1] Multidimensional Second Order Stochastic Processes, World Scientific, Singapore, 1997. Kakutani, S. [1] “Some characterizations of Euclidean space” Japan J. Math., 16 (1939), 93–97. [2] “On the equivalence of infinite product measures,” Ann. Math., 49 (1948), 214–224. Kallianpur, G., and Streibel, C. T. [1] “Estimation of stochastic systems: arbitrary system process with additive white noise observation error,” Ann. Math. Statist., 39 (1968), 785–801. Kalman, R. E. [1] “A new approach to linear filtering and prediction problems,” J. Basic Eng., 82 (1960), 35–45. Kalman, R. E., and Bucy, R. S.
Bibliography
637
[1] “New results in linear filtering and prediction theory,” J. Basic Eng., 83 (1961), 95–108. Kamp´e de F´eriet, J. [1] “Correlation and spectrum of asymptotically stationary random functions”, Math. Student, 30 (1962), 55–67. Kamp´e de F´eriet, J., and Frenkiel, F. N. [1] “Estimation de la corr´elation d’une fonction al´eatoire non stationnaire,” C. R. Acad. Sci., Paris bf 249 (1959), 348–351. [2] “Correlations and spectra for non-stationary random functions,” Math. Comput., 16 (1962), 1–21. [1] Kantor, M. [1] “Linear sample space and stable processes,” J. Functional Anal., 9 (1972), 441–459. Karhunen, K. [1] “Uber lineare Methoden in der Wahrscheinlichkeitsrechnung,” Ann. Acad. Sci. Finn. A1., 37 (1947), 1–79. Kato, T. [1] Perturbation Theory for Linear Operators, Springer, New York, 1966. Kelsh, J. P. [1] Linear Analysis of Harmonizable Time Series, Ph.D. Thesis, UCR Library, Riverside, CA, 1978. Kiefer, J. [1] “On minimum variance estimators,” Ann. Math. Statist., 23 (1952), 627–628. Kingman, J. F. C. [1] “Completely random measures,” Pacific. J. Math., 21 (1967), 59– 78. Kluv´ anek, I. [1] “Sampling theorem in abstract harmonic analysis,” Matem.- Fy˘ zok. Casop. Sav., 15 (1965), 43–47. Kluv´ anek, I., and Knowles, G. [1] Vector Measures and Controlled Systems, North-Holland Math. Studies, Amsterdam, The Netherlands, 1975. Kluv´ anek, I., and Kov´ ar´ikova, M. [1] “Product of spectral measures,” Chech. Math. J., 17 (1973), 248– 256. Kolmogorov, A. N. [1] Foundations of the Theory of Probability, Chelsea, New York, 1933. (Translation, 1956.) Kozek, A. [1] “On the theory of estimation with convex loss functions,” Proc. Symp. in honor of J. Neyman, PWN publishers, Warszawa, (1977),
638
Bibliography
177–202. Kraft, C. [1] “Some conditions for consistency and uniform consistency of statistical procedures,” Univ. of Calif. Publ. Statist., 2 (1955), 125– 142. Krasnoselskii, M. A., and Rutickii, Ya. B. [1] Convex Functions and Orlicz Spaces, P. Noordhoff, Groningen, Netherlands, 1961. Krinik, A. [1] “Diffusion processes in Hilbert space and likelihood ratios,” In Real and Stochastic Analysis, Wiley, New York (1986), 168–210. K¨ uhn, T., and Liese, F. [1] “ A short proof of the H´ ajek-Feldman theorem,” Theor. Prob. Appl., 23 (1978), 448–450. Kulldorff, G. [1] “On the conditions for consistency and asymptotic efficiency of maximum likelihood estimates,” Skand. Aktuar., 40 (1957), 129– 144. Kunita, H. [1] Stochastic Flows and Stochastic Differential Equations, Cambridge Univ. Press, Cambridge, UK, 1990. Kunita, H., and Watanabe, S. [1] “On square integrable martingales,” Nagoya Math. J., 30 (1967), 209–245. Kuo, H.- H. [1] White Noise Distribution Theory, CRC Press, Boca Raton, FL, 1996. Kwakernaak, H., and Sivan, R. [1] Linear Optimal Control Systems, Wiley-Interscience, New York, 1972. Lai, T. L., and Wei, C. Z. [1] “Asymptotic properties of general autoregressive models and the strong consistency of least squares estimates of their parameters,” J. Multivar. Anal., 13 (1983), 1–23. Lamperti, J. [1] “On the isometries of certain function spaces,” Pacific J. Math., 8 (1958), 459–466. Laning, J. H., and Battin, R. H. [1] Random Processes in Automatic Control, McGraw-Hill, New York, 1956. Lebedev, N. N. [1] Special Functions and Their Applications, Dover Publications, New York, 1972.
Bibliography
639
Lee, A. J. [1] “On band limited stochastic processes,” SIAM J. Appl. Math., 30 (1976), 269–277. Lehmann, E. L. [1] Testing Statistical Hypotheses, Wiley, New York, 1958. Leonov, V. P., and Shiryayev, A. N. [1] “On the technique of computing semi-invariants,” Theor. Prob. Appl., 4 (1959), 319–329. Liapounov, A. [1] “Sur les functions-vecteurs compl`etement additives,” Izv. Akad. Nauk. SSSR. Ser. Mat., 4 (1940), 465–478. Linnik, Ju. V. [1] Statistical Problems with Nuisance Parameters, Amer. Math. Soc., Providence, RI, 1968. Linnik, Ju. V., and Rukhin, A. L. [1] “Convex loss functions in the theory of unbiased estimation,” Soviet Math. Dokl., 12 (1971), 839–842. Liptser, R. S., and Shiryayev, A. N. [1] Statistics of Random Processes,I, II,, Springer, New York, 1977. Lloyd, S. P. [1] “A sampling theorem for stationary (wide sense) stochastic processes,” Trans. Am. Math. Soc., 92 (1050), 1–12. Lo`eve, M. [1] Probability Theory, D. Van Nostrand Co., Princeton, NJ, 1955. Lototsky, S., and Rozovskii, B. L. [1] “Recursive multiple Wiener integral expansion for nonlinear filtering of diffusion processes,” In Stochastic Processes and Functional Analysis, Lect. Notes in Pure and Appl. Math., 186, Marcel Dekker, New York, (1997), 199–208. Lukacs, E., and Laha, R. G. [1] Applications of Characteristic Functions, Hafner, New York, 1964. Mandl, P. [1] Analytical Treatment of One-Dimensional Markov Processes, SpringerVerlag, New York, 1968. Mandrekar, V. [1] “A characterization of oscillatory processes and their prediction,” Proc. Am. Math. Soc., 32 (1972), 280–284. Mann, H. B. [1] “An inequality suggested by the theory of statistical inference,” Illinois J. Math., 6 (1962), 131–136. Mann, H. B., and Wald, A. [1] “On the statistical treatment of linear stochastic difference equations,” Econometrica, 11 (1943), 173–220.
640
Bibliography
[2] “On stochastic limit and order relationships,” Ann. Math. Statist., 14 (1943), 217–226. Maruyama, G. [1] “Infinitely divisible processes” Theor. Prob. Appl., 15 (1970), 1– 22. Masani, P., and Rosenberg, M. [1] “When is an operator the integral of a given spectral measure?,” J. Functional Anal., 21 (1976), 88–121. Mautner, F. I. [1] “Unitary representations of locally compact groups, I,II,” Ann. Math., 51 (1950), 1–25; 52 (1950), 528–556. Maynard, H. B. [1] “A general Radon-Nikod´ ym theorem,” Proc. Conf. Vector and Operator valued Measures and Applications, Acad. Press, (1973), 233–246. McGhee, D. F., and Picard, R. H. [1] Cordes’ Two-Parameter Spectral Representation Theory, Pitman Research Notes, Wiley, New York, 1988. Mehlman, M. H. [1] “Structure and moving average representation for multidimenstional strongly harmonizable processes,” Stoch. Anal. Appl., 9 (1991), 323–361. Mel’nikov, A. V. [1] “Stochastic differential equations, singularity of coefficients, regression models, and stochastic approximation,” Russian Math. Surveys, 51(5) (1996), 819–909. M´etivier, M. [1] “Limits projectives de measures. Martingales, Applications,” Ann. Math. Pur. Appl. 63 (1963), 225–352. M´etivier, M., and Pellemail, J. [1] Stochastic Integration, Academic Press, New York, 1980. Meyer, P. A. [1] “Sur une probleme de filration,” Sem. d. Prob. VII Lecture Notes in Math., 321 (1973), 223–247. Mikulevicius, R., and Rozovskii, B. L. [1] “Martingale problems for stochastic SPDEs,” In Stochastic Partial Differential Equations: Six Perspectives, Amer. Math. Soc. Surveys, Providence, RI (1999), 243–325. Miller, G. [1] “Properties of certain symmetric stable distributions,” J. Multivar. Anal., 8 (1978), 344–360. Mizel, V. J., and Rao, M. M.
Bibliography
641
[1] “Nonsymmetric projections in Hilbert space,” Pacific J. Math., 12 (1962), 343–357. Mizumachi, H., and Sato, H. [1] “Absolute continuity of similar translates,” J. Math. Kyoto Univ., 37 (1997), 317–326. Moedomo, S., and Uhl, Jr, J. J. [1] “Radon-Nikod´ ym theorems for the Bochner and Pettis integrals,” Pacific J. Math., 38 (1971), 531–536. Morse, M., and Transue, W. [1] “C-bimeasures and their integral extensions,” Ann. Math., 64 (1956), 480–504. Nagabhushanam, K. [1] “The primary process of a smoothing relation,” Ark. Mat., 1 (1951), 421–488. Na˘imark, M. A. [1] Normed Rings, Nordhoff, Groningen, The Netherlands, 1964. Neveu, J. [1] Mathematical Foundations of the Calculus of Probability, HoldenDay, San Francisco, CA., 1965. [2] Processus Al´eatoires Gaussiens, U. of Montr´eal, Montr´eal, Canada, 1968. Newman, C. M. [1] “The inner product of path space measures corresponding to random processes with independent increments,” Bull. Am. Math. Soc., 78 (1972), 268–271. [2] “On the orthogonality of independent increment processes,” In Topics in Probability, NYU Courant Inst. (1973), 93–111. Neyman, J., and Pearson, E. S. [1] “On the problem of the most efficient tests of statistical hypotheses,” Phil. Trans. Roy. Soc., 231 (1933), 289–337. [2] “On the testing of statistical hypotheses in relation to probability a priori,” Proc. Camb. Phil. Soc. 29 (1933), 492–510. [3] Joint Statistical Papers, Univ. of Calif. Press, Berkeley and Los Angeles, CA 1966. Novikov, A. A. [1] “On an identity for stochastic integrals,” Theor. Prob. Appl., 17 (1973), 717–720. Nualart, D. [1] The Mulliavin Calculus and Related Topics, Springer, New York, 1995. Palais, R. S. [1] “Natural operations on differential forms,” Trans. Am. Math. Soc., 92 (1959), 125–141.
642
Bibliography
Paley, R. E. A. C., Wiener, N., and Zygmund, A. [1] “Notes on random functions,” Math. Zeit., 37 (1933), 647–668. Pardoux, E. [1] “Filtrage non lineaire et equation dux derivees partielles stochastiques associees,” Lect. Notes Math., 1464 (1989), 69–163. Park, W. J. [1] “On the equivalence of Gaussian processes with factorable covariance functions,” Proc. Am. Math. Soc., 32 (1972), 275–279. Parzen, E. [1] “An approach to time series analysis,” Ann. Math. Statist., 32 (1961), 951–989. [2] “Extraction and detection problems and reproducing kernel Hilbert spaces,” SIAM J. Control, 1 (1962), 35–62. [3] “Probability density functionals and reproducing kernel Hilbert spaces,” In Proc. Symp. Time Series Analysis, Wiley, New York, (1963), 155–169. [4] “Spectral analysis of asymptotically stationary time series,” Bull. Inst. Internat. Statist., 29(2) (1962), 87–103. [5] Time Series Analysis Papers, Holden-Day, San Francisco, CA, 1967. Phillips, R. S. [1] “On weakly compact subsets of a Banach space,” Amer. J. Math., 65 (1943), 108–136. Piranashvilli, Z. A. [1] “On the problem of interpolation of random processes,” Theor. Prob. Appl., 7 (1967), 647–657. Pitcher, T. S. [1] “Likelihood ratios of Gaussian processes,” Ark. Mat. 4 (1959), 35–44. [2] “Likelihood ratios for diffusion processes with shifted mean values,” Trans. Am. Math. Soc., 101 (1961), 168–176. [3] “The admissable mean values of a stochastic process,” Trans. Am. Math. Soc., 108 (1963), 538–546. [4] “Likelihood ratios for stochastic processes related by groups of transformations, I and II,” Illinois J. Math., 7 (1963), 396–414; 8 (1964), 271–279. [5] “Parameter estimation for stochastic processes,” Acta Math., 112 (1964), 1–40. [6] “The behavior of likelihood ratios of stochastic processes related by groups of transformations,” Ann. Math. Statist., 38 (1965), 529–534. [7] “A more general property than domination for sets of probability measures,” Pacific J. Math., 15 (1965), 597–611.
Bibliography
643
[8] “An integral expression for the log likelihood ratio for two Gaussian processes,” SIAM J. Appl. Math., 14 (1966), 228–233. Plessner, A. I., and Rohlin, V. A. [1] “Spectral theory of linear operators-II,” Usphekhi Mathem. Nauk. (N. S.), 1 (1946), 71–191. Pog´ any, T. [1] “Almost sure sampling restoration of band limited stochastic signals,” (Preprint (1995), 29pp). Pourahmadi, M. [1] “A sampling theorem for multivariate stationary processes,” J. Multivar. Anal., 13 (1983), 177–186. Priestley, M. B. [1] “Evolutionary spectra and nonstationary processes,” J. R. Statist. Soc. Ser. B., 27 (1965), 204–237. Protter, P. [1] Stochastic Integration and Differential Equations, A New Approach, Springer, New York, 1990. Rao, C. R. [1] “Information and accuracy attainable in the estimation of statistical parameters,” Bull. Calcutta Math. Soc., 37 (1945), 81–91. Rao, C. R., and Mitra, S. K. [1] Generalized Inverse of Matrices and Its Applications, Wiley, New York, 1971. Rao, C. R., and Varadarajan, V. S. [1] “Discrimination of Gaussian processes,” Sankhy¯ a, Ser. A, 25 (1963), 303–330. Rao, M. M. [1] “Theory of lower bounds for risk functions in estimation,” Math. Ann., 143 (1961), 379–398. [2] “Consistency and limit distributions of estimators of parameters in explosive stochastic difference equations,” Ann. Math. Statist., 32 (1961), 195–218. [3] “Conditional expectations and closed projections,” Indag. Math., 27 (1965), 100–112. [4] “Inference in stochastic processes,I–VI” (a) Theor. Prob. Appl., 9 (1963), 217–233; (b) Z. Wahrs., 5 (1966), 317–335; (c) ibid., 11 (1967), 49–72; (d) Sankhy¯ a, 33 (1974), 63–120; (e) ibid., 37 (1975), 538–549; (f) Multivariate Annalysis-IV, (North-Holland) (1977), 311–324. [5] “Existence and determination of optimal estimators relative to convex loss,” Ann. Inst. Statist. Math, 17 (1965), 133–147. [6] “Notes on pointwise convergence of closed martingales,” Indag. Math., 29 (1967), 170–176.
644
Bibliography
[7] “Abstract Lebesgue-Radon-Nikod´ ym theorems,” Ann. Mat. Pur. Appl., 76 (1967), 107–132. [8] “Prediction sequences in smooth Banach spaces,” Annales. Inst. H. Poincar´e, 8 (1972), 319–332. [9] “Remarks on a Radon-Nikod´ ym theorem for vector measures,” Proc. Conf. Vector and Operator valued Measures and Applications, Academic Press, (1973), 303–317. [10] “Conditional measures and operators,” J. Multivar. Anal., 5 (1975), 330–413. [11] “Covariance analysis of nonstationary time series,”In Developments in Statistics I, Academic Press, New York, (1978), 171– 225. [12] Foundations of Stochastic Analysis, Academic Press, New York, 1981. (Dover Edition, 2011.) [13] “Harmonizable processes: structure theory,” L’Enseign. Math., 28 (1982), 295–351. [14] “Application and extension of Cram´er’s theorem on distributions of ratios,” In Statistics and Probability, North-Holland, Amsterdam, (1982), 617–633. [15] Probability Theory with Applications, Academic Press, New York, 1984 (Second Edition, Springer, Berlin, (with R.J.Swift), 2006). [16] “Harmonizable, Cram´er, and Karhunen classes of processes,” Handbook of Statistics, Vol 5. Time Series in the Time Domain, NorthHolland, Amsterdam, (1985), 279–310. [17] Measure Theory and Integration, Wiley-Interscience, New York, 1987. (Second enlarged edition, Marcel Dekker-CRC Press, 2004). [18] Conditional Measures and Applications, Marcel Dekker, New York, 1993, Second enlarged edition, Chapman-Hall/CRC press, 2005). [19] “Exact evaluation of conditional expectations in Kolmogorov’s model,” Indian J. Math., 35 (1993), 57–70. [20] “Harmonizable processes and inference: unbiased prediction for stochastic flows,” J. Statist. Plan. Inf., 39 (1994), 187–209. [21] Stochastic Processes: General Theory, Kluwer Academic, Dordrect, The Netherlands, 1995.(Now Springer, Berlin). [22] “Nonlinear prediction with increasing loss,” J. Comb. Infor. and System Sci., 23(1998), 181–186. [23] “Characterizing covariances and means of harmonizable processes,” in Trends in Contemporary Infinite Dimensional Analysis and Quantum Probability (Eds. L. Accardi etal), Kyoto (2000), 363– 381. [24] “Martingales and some applications,” Handbook of Statistics, Vol. 19, Stochastic Processes, North-Holland, Amsterdam, (1999), 765– 816.
Bibliography
645
[25] Stochastic Processes and Integration, Sijthoff and Noordhoff, Alphen aan den Rijn, The Netherlands, 1979. [26] “Asymptotic distribution of an estimator of the boundary parameter of an unstable process,” Ann. Statist., 6 (1978), 185–190, Addenda, ibid 7 (1979). [27] “Sampling and prediction for harmonizable isotropic random fields,” J. Comb. Infor. and System Sci., 16 (1991), 207–220. [28] “Higher order stochastic differential equations,” In Real and Stochastic Analysis: Recent Advances, CRC Press, Boca Raton, FL (1997), 225–302. [29] “Representations of conditional means,” Georgia Math. J. 8 (2001), 363–376. [30] “Linear regression for random measures,” Advances in Multivariate Statistical Methods, World Scientific, Singapore (2009), 131– 144. [31] Random and Vector Measures, World Scientific, Singapore, 2012. Rao, M. M., and Ren, Z. D. [1] Theory of Orlicz Spaces, Marcel Dekker, New York, 1991. [2] Applications of Orlicz Spaces, Marcel Dekker, New York, 2002. Rao, M. M., and Sazonov, V. V. [1] “A projective limit theorem for probability spaces and applications,” Theor. Prob. Appl., 38 (1993), 307–315. R´enyi, A. [1] “On a new axiomatic theory of probability,” Acta Math. Hung., 6 (1955), 285–333. Revuz, D., and Yor, M. [1] Continuous Martingales and Brownian Motion, Springer, New York, (1961; 2nd ed., 1994; 3rd ed., 1999). Riesz, F., and Sz.-Nagy, B. [1] Functional Analysis, F. Unger Publishing Co., New York, 1955. Root, W. L. [1] “Singular Gaussian measures in detection theory,” Proc. Symp. Time Series Analysis, Wiley, New York, (1963), 292–315. Rosenberg, M. [1] “The square integrability of matrix functions with respect to a non-negative Hermitian measure,” Duke Math. J., 31 (1964), 291– 298. Rosenberg, R. L. [1] “Orlicz spaces based on families of measures,” Studia Math., 35 (1970), 15–49. Rosenblatt, M. [1] “A central limit theorem and a strong mixing condition,” Proc. Nat. Acad. Sci., U.S.A, 42 (1956), 43–47.
646
Bibliography
[2] Stationary Sequences and Random Fields, Birkh¨ auser, Boston, MA, 1985. Rosi´ nski, J. [1] “Stochastic integral representations of stable processes with sample paths in Banach spaces,” J. Multivar. Anal., 26 (1983), 277– 302. Rosi´ nski, J., and Szulga, J. [1] “Product random measures and double stochastic integrals,” In Martingale Theory in Harmonic Analysis and Banach Spaces, Lect. Notes in Math, 850 (1982), 181–199. Roussas, G. G., and Ioannides, D. [1] “Moment inequalities for mixing sequences of random variables,” Stoch. Anal. Appl., 5 (1987), 61–120. Roy, S. N. [1] Some Aspects of Multivariate Analysis, Wiley, New York, 1957. Royden, H. L. [1] Real Analysis, Macmillan Co., (2nd ed.), New York, 1968. Rozanov, Yu. A. [1] Infinite-dimensional Gaussian Distributions, Amer. Math. Soc., Providence, RI, 1978. [2] “Spectral analysis of abstract functions,” Theor. Prob., Appl. 4 (1959), 271–287. [3] Stationary Random Processes, Holden-Day, San Francisco, CA, 1967. Ryll-Nardzewski, C. [1] “Remarks on processes of cells,” Proc. 4th Berkeley Symp. Math. Statist. and Prob., 2 (1961), 455–465. Schatten, R. [1] Norm Ideals of Completely Continuous Operators, Springer, New York, 1960. Schilder, M. [1] “Some structure theorems for the symmetric stable laws,” Ann. Math. Statist., 41 (1970), 412–421. Schwartz, L. [1] Radon Measures on Arbitrary Topological Spaces and Cylindrical Measures, Oxford Univ. Press, London, UK, 1973. Segal, I. E. [1] “Fiducial distributions of several parameters with applications to a normal system,” Proc. Camb. Phil. Soc., 34 (1938), 41–47. Seth, G. R. [1] “On the variance of estimates,” Ann. Math. Statist., 20 (1949), 1–27. Shald, S.
Bibliography
647
[1] “The continuous Kalman filter as the limit of the discrete Kalman filter,” Stoch. Anal. Appl., 17 (1999), 841–856. Shepp, L. A. [1] “Radon-Nikod´ ym derivatives of Gaussian measures,” Ann. Math. Statist., 37 (1966), 321–354. Shintani, T., and Andˆ o, T. [1] “Best approximants in L1 -spaces,” Z. Wahrs., 33 (1975), 33–39. Shiryayev, A. N. [1] Statistical Sequential Analysis, Amer. Math. Soc., Providence, RI, 1973. Skorokhod, A. V. [1] Studies in the Theory of Random Processes, Addison-Wesley Publishing Co., Redding, MA, 1965. [2] “On the densities of probability measures in functional spaces,” Proc. 5th Bekeley Symp. Math. Statist. and Prob.. 2 (1967), 163– 182. [3] “On admissible translations of measures in Hilbert space,” Theor. Prob. Appl., 15 (1970), 557–580. [4] “On a generalization of a stochastic integral,” Theor. Prob. Appl., 20 (1975), 219–233. Soedjak, H. [1] Asymptotic Properties of Bispectral Density Estimators of Harmonizable Processes, Ph. D. thesis, UCR Library, Riverside, CA, 1996. [2] “Consistent estimation of the bispectral density of a harmonizable process,”J. Statist. Plan. Inf., (to appear) (1999/00). [3] “Bispectral density estimation in harmonizable processes,” In Real and Stochastic Analysis, Vol. 4, World Scientific, Singapore (2014), 503–560. Stein, C. [1] “A note on cumulative sums,” Ann. Math. Statist., 17 (1946), 489–499. Stigum, B. P. [1] “Asymptotic properties of dynamic stochastic parameter estimates, III,” J. Multivar. Anal., 4 (1974), 351–381. Stone, M. H. [1] Linear Transformations in Hilbert Space and Their Applications to Analysis, Providence, RI, 1932. Striebel, C. T. [1] “Densities for stochastic processes,” Ann. Math. Statist., 30 (1959), 559–567. Stromberg, K. R.
648
Bibliography
[1] An Introduction to Classical Real Analysis, Wordsworth, Belmont, CA, 1981. Stroock, D. W., and Varadhan, S. R. S. [1] Multidimensional Diffusion Processes, Springer, New York, 1979. Swift, R. J. [1] “The structure of harmonizable isotropic random fields,” Stoch. Anal. Appl., 12 (1994), 583–616. [2] “Some aspects of harmonizable processes and fields,” In Real and Stochastic Analysis: Recent Advances, CRC Press, Boca Raton, FL, (1997), 303–365. Timon, A. F. [1] Theory of Approximation of Functions of a Real Variable, Macmillan Co., New York, 1963. Tukey, J. W. [1] “Some examples of fiducial relevance,” Ann. Math. Statist., 28 (1957), 687–695. Tweddle, I. [1] “The exposed points of the range of a vector-valued measure,” Glasgow Math. J., 13 (1972), 61–68. Uhl, Jr, J. J. [1] “The range of a vector-valued measure,” Proc. Am. Math. Soc., 23 (1969), 158–163. Urbanik, K. [1] “Some prediction problems for strictly stationary processes,” Proc. 5th Berkeley Symp. Math. Statist. and Prob., 2-I (1966), 235–258. Vakhania, N. N., and Tarieladze, V. I. [1] “On singularity and equivalence of Gaussian measures,” In Real and Stochastic Analysis: Recent Advances, CRC Press, Boca Raton, FL, (1997), Varberg, D. E. [1] “On equivalence of Gaussian measures,” Pacific J. Math., 11 (1961), 751–762. [2] “Gaussian measures and a theorem of T. S. Pitcher,” Proc. Am. Math. Soc., 63 (1962), 799–807. Veeh, J. A. [1] “Equivalence of measures induced by infinitely divisible processes,” J. Multivar. Anal., 13 (1983), 138–147. Velman, J. R. [1] Likelihood ratios determined by differentiable families of isometries, Hughes Aircraft Co., Research Report No. 35, (1970), (USC Ph.D. thesis, 1969, AMS Notices, 17, p.899). von Neumann, J., and Morgenstern, O.
Bibliography
649
[1] Theory of Games and Economic Behavior, Princeton Univ. Press, Princeton, NJ, 1944. Wald, A. [1] “Tests of statistical hypotheses concerning several parameters when the number of observations is large,” Trans. Am. Math. Soc., 54 (1943), 426–482. [2] “Asymptotic properties of the maximum likelihood estimate of an unknown parameter of a discrete stochastic process,” Ann. Math. Statist., 19 (1948), 40–46. [3] “Basic ideas of a general theory of statistical decision rules,” Proc. Int. Cong. Math., 1 (1950), 231–243. [4] Statistical Decision Functions, Wiley, New York, 1950. [5] Sequential Analysis, Wiley, New York, 1947. Watson, G. N. [1] A Treatise on the Theory of Bessel Functions, Cambridge Univ. Press, (2nd ed.) London, UK, 1958. White, J. S. [1] “The limiting distribution of the serial correlation coefficient in the explosive case,” Ann. Math. Statist., 29 (1958), 1187–1197. Widder, D. V. [1] The Laplace Transform, Princeton Univ. Press, Princeton, NJ., 1941. Wiener, N. [1] “The homongeneous chaos,” Amer. J. Math., 60 (1930), 897–936. Wijman, R. A. [1] “On the attainment of the Cram´er-Rao lower bound,” Ann. Statist., 1 (1973), 538–542. Wilks, S. S. [1] Mathematical Statistics, Wiley, New York, 1962. Wolfowitz, J. [1] “The efficiency of sequential estimates and Wald’s equation for sequential processes,” Ann. Math. Statist., 18 (1947) 215–230. Wright, J. D. M. [1] “A Radon-Nikod´ ym theorem for Stone algebra valued measures,” Trans. Am. Math. Soc., 139 (1969), 75–94. Wu, R. [1] Stochastic Differential Equations, Research Note Math., 140 Pitman, Boston, MA, 1985. Yadrenko, M. I. [1] Spectral Theory of Random Fields, Optimization Software, New York, 1983. Yaglom, A. M.
650
Bibliography
[1] “ Second order homogeneous random fields,” Proc. 4th Berkeley Symp. Math. Statist. and Prob., 2 (1960), 593–622. [2] “On the equivalence or perpendicularity of two Gaussian probability measures in function space,” Proc. Symp. Time Series Analysis, Wiley, New York, (1962), 327–346. [3] “Strong limit theorems for stochastic processes and orthogonality conditions for probability measures,” Bernoulli, Bayes, LaPlace Anniversary Volume, Springer-Verlac, New York, (1965), 253– 262. [4] Correlation Theory of Stationary and Related Random Functions Vols. I, II, Springer, New York, 1987/8. Ylvisaker, N. D. [1] “A generalization of a theorem of Balakrishnan,” Ann. Math. Statist., 32 (1961) 1337–1339. Zaanen, A. C. [1] Integration, (2nd ed.) North-Holland, Amsterdam, 1967. Zakai, M. [1] “Band-limited functions and the sampling theorem,” Infor. Control, 8 (1965), 143–158. [2] “On the optimal filtering of diffusion processes” Z. Wahrs. 11 (1969), 230–243. Zayed, A. I. [1] Advances in Shannon’s Sampling Theory, CRC Press, Boca Raton, FL, 1993. ˘ Zurbenko, I. G. [1] The Spectral Analysis of Time Series, North-Holland, New York, 1986. Zygmund, A. [1] Trigonometric Series, Vols. 1,2 Cambridge Univ. Press, London, UK, 1968.
Notation Index Note: To minimize proliferation of symbols, the same letters are used for different objects in different chapters. Chapter I (Ω, Σ, P )- Probability space, 2 Rn - real Euclidean n-space, 2 ΔF - increment of a distribution F , 2 Ftθ1 ,... ,tn - n-dimensional distribution with parameter θ, 3 H0 (H1 )- null (alternative) hypothesis, 5 Π- cartesian product symbol, 6 ¯ + )- positive (extended) reals, 7 R+ (R |W |- Lebesgue measure of a set in Rn , 8 B n (or B)- Borel σ-algebra of Rn , 10 R(·, ·)- risk function, 11 W (·, ·)- loss function, 11 δ(·)- decision function, 12 M1 (AI ), N1 (AII )- sets of probability measures, 14 P D Xn →X, (Xn →X)convergence in probability (distribution), 16 d(F, G)- L´evy metric, 17 P P Xn =Yn if Xn − Yn →0, 17 Chapter II P c - continuous part of P (relative to μ), 20 dP c ym derivative (=likelihood ratio), 20 dμ - Radon-Nikod´ LRN= Lebesgue-Radon-Nikod´ ym, 20 C- set of critical or test functions, 22 Σ(A) = {A ∩ C : C ∈ Σ}, 23 S- σ-algebra of S, 26 |μ|, |ν|- variation measures of μ, ν, 30 (0, c) = (0, c1 , · · · , cn ), an n + 1 vector, 32 μ(Σ), ν(Σ)- ranges of measures, 34 AE (μ)- averaged range of μ, 35 X , X ∗ - Banach space and its dual, 36 ≺- partial order, 36 P (A|B)- conditional probability of A given B, 39 fX|Y (·|·)- conditional density of X given Y , 39 Pμτ (A × B)- weighted product measure, 43 © Springer International Publishing Switzerland 2014 M.M. Rao, Stochastic Processes – Inference Theory, Springer Monographs in Mathematics, DOI 10.1007/978-3-319-12172-7
651
652
Notation Index
π(·|x, θ)- posterior density, 44 Σ ⊗ I- product σ-algebra, 46 Σ × I- product algebra, 46 {(Ω, Σ, Pθ ), θ ∈ I}- family of probabilities Pθ on Σ, 50 PDE- partial differential equation, 51 UMP- uniformly most powerful, 55 ∂H- boundary of H, 55 EθB - conditional expectation relative to B and Pθ , 56 Rm - Borel σ-algebra of Rm , 59 Rang(T )- range of T , 60 Sx2 , Sy2 - sample variances, 62 T (X)- a statistic from vector X, 68 a.a. (a.e.)- almost all (almost everywhere), 69 Chapter III w+ (w− )- the right (left) derivative of W , 76 RW (·, ·)- Bayes risk function for loss W (·, ·), 79 P F - conditional probability relative to F and P , 84 U- unbiased estimators of zero, 89 sgn(x)- signum function of x, 90 (W, V )- complementary Young functions, 91 MW - the set of unbiased estimators of h(θ) in LW , 94 Δ2 , ∇2 - growth conditions of Young functions, 96 NW (·)- gauge norm, 115 LW (P )- Orlicz space, 115 Nk - integer valued stopping time, 117 SN - sum of random number N of variables, 117 Vθ - covariance matrix, 126 Chapter IV Ak0 -critical region, 134 Qc , Qs - continuous and singular parts of Q relative to P , 134 F∞ = σ(∪n≥1 Fn )- σ-algebra generated by Fn , 136 λn , ψn - eigenvalues and eigenfunctions of a kernel K, 140 Pn = P |Fn - restriction of P to Fn , 141 O.U.(= Ornstein-Uhelnbeck) process, 144 H(P, Q)- Hellinger distance of P and Q, 148 AΔB- symmetric difference of sets A and B, 151 {F, ⊂}- partially ordered set by inclusion, 152
Notation Index
T - a general stopping time of the net {Ft , t ≥ 0}, 156 F(T )- σ-algebra of events ‘prior to T ’, 164 X(T )- process stopped at T , 164 Hn (ψn )- the Haar (Schauder) function, 169 λ ⊗ P - product measure, 170 SPRT- sequential probability ratio test, 174 T - set of all stopping times of a filtration, 176 th Lt -n∗ -order differential operator, 180 f (x)g(y) dμ(dx, dy)- strict Morse-Transue integral, 181 A B ∂ n n D1 = ( ∂s ) - 184 ∨, (∧)- max(min) operators, 185 [i, j] = max(i, j), 197 ρ- maximal root of a characteristic equation, 199 S(n)- norming (random) factor, 207 τ (X)- double stochastic integral of dX(s) dX(t), 217 V (H)- Vitali variation of a function H, 218 Chapter V Pf = P ◦ Tf−1 , 223 RKHS- reproducing kernel Hilbert space, 225 (HK , · )- RKHS determined by K, 226 P )- Probability space for P, Q on Σ, 226 (Ω, Σ, Q Q P - Q is P -continuous, 226 MP - set of admissible means of a Gaussian P , 228 ∼ =- isometrically isomorphic, 231 BM- Brownian Motion, 236 MP (= M (X))- for nonGaussian P , 242 R(λ)-resolvent operator, 245 P r1 ∼ P r2 - equivalence of measures with covariances ri , 248 I(P, Q)- entropy function, 257 ΔHK = Hb Ha , for Δ = [a, b], 261 F = sp{Ψ(t, ¯ ·), t ∈ T } ⊂ L2 (T, μ, 2 ), 262 HS- Hilbert-Schmidt, 265 f ⊗ g- tensor product of f, g, 271 ch.f.- characteristic function, 279 ν(t, ·)- random measure, 281 CBS- Cauchy-Buniakovsky-Schwarz, 287
- projection operator, 293 supp(Q)- support of measure Q, 303 [X]t ([X, Y ]t )- quadratic (co-) variation, 316 EZ0 (L)- exponential martingale, 320 ∇fn - gradient of fn , 329
653
654
Notation Index
[[0, t)- stochastic interval, 334 Chapter VI Xt = T g(t, x) dZ(x)- Cram´er representation, 340 L2 (ρ)-bispectral domain, 341 1 (R)- Fourier transform of L1 (R), 350 L Π(v)- projection of v, 359 Jν (·)- Bessel function of order ν, 363 Sm (·)- ultra spherical polynomial, 364 LCA -locally compact abelian, 370 H ⊥ - annihilator of H, 372 F(Tj )-sigma-algebra of events prior to time T , 374 gY (X) = E(Y |X)-regression of Y on X, 376 X, a spherical vector, 378 Spherical vector characterization, 379 Linear regression for spherical processes, 380 Generalised Theorem for Linear Regressions, 383 Linear regression for random measures, 386 Linear regression for random integrals, 390 Regression characterization for Gaussian martingales, 392 Linear regression with constraints, 305 Regression characterization of the gamma distribution, 404 ‘ridge regrssion’, 406 Chapter VII Pf ⊥ P - mutually singular measures, 415 |r|∞ - uniform norm of r(·, ·), 416 F- cylindrical functions, 420 Δx(t) = x(t + δ) − x(t), 422 ∂+ ∂t - right differential, 422 Δ- domain of a group of operators, 430 D- infinitesimal generator of a group, 430 T (α, β)- evolution family of operators, 438 Fα ⊂ F0 - algebras of bounded cylindrical functions, 438 W (α, β)- (anti evolution) family of isometries, 441 ρ(A, B)- Fr´echet metric on σ, 444 Dα (Dα∗ )- closed (adjoint) operator, 455 HP = Hrp - the RKHS relative to the kernel rP of the Gaussian measure P , 461 H2 - Hilbert-Schmidt space, 463 FH - directed family of finite subsets of H, 463
Notation Index
ρ(Φa,t , Φb,s )- Hellinger integral, 465 (·, ·)- Volterra kernel, 474 [0, T )n - a cube in Rn , 480 In (f )- n-ple Wiener integral, 480 Hn (t, x)- Hermite polynomial of nth degree, 481 W (t)- white noise process,484 F G- Wick product, 485 Chapter VIII LW (Σ)- Orlicz space on (Ω, Σ, P ), 491 · W - gauge norm, 491 πBx (πB )- prediction operator, 492 τ - stopping time, 497 M(= L0 (P ))- set of real measurable functions, 498 ρ(·)- function norm, 498 U (·)- modular functional, 499 M ⊕ N - direct sum, 503 Ht - the space of past and present, 504 H−∞ (H∞ )- the remote past (total space), 504 {πt , t ∈ G}- resolution of the identity, 508 ρ˜(·)- associated spectral measure, 508 Ht1 ,t2 - space of random fields, 511 ΛXt = Yt - linear filter equation, 512 H(X)- total space of the process X, 512 τ (f )- stochastic integral, 513 C- complex number field, 514 β(·, ·)- a bimeasure, 514 F (F ∗ )- a matrix (conjugate transpose) filter characteristic, 519 F −1 - (generalized) inverse of F , 521 Δij (·), the (ij)th cofactor of det(F ), 526 HX = H(X)- total space of X, 530 πS - linear prediction operator in L2 (P ), 535 K(·)- Kalman gain matrix, 537 ODE- ordinary differential equation, 544 EμB - conditional expectation relative to B and μ, 546 F = σ(Ys , s ≤ t), 553 tt U (s) ◦ dU2 (s)- Stratonovich integral, 562 a 1 SDE- stochastic differential equation, 564 M2 (F)- space of square integrable martingales, 564 ∂ Di = ∂x , Dij = Di Dj , 570 i πt (ϕ)- (non linear) filter, 570
655
656
Notation Index
SPDE- stochastic PDE, 573 Chapter IX FˆN (A, B)- estimator of FN (A, B), 585 j Fm;n = σ(Xm , j ≤ n), 587 Dm (·)- Dirichlet kernel, 590 Class(KF ), 596 H2 (·)- discrete component of the associateed spectral measure, 602 Jt (·)- the template function, 604 cum(X)- cumulant of X, 610 Δh - line parallel to the diagonal at h units, 617 {τs , s ≥ 0}- a shift semi-group of normal operators, 619 SO(n)- special orthogonal group of n × n matrices, 620
Author Index Abraham, R., 60 Albert, A., 294 Alekseev, V. G., 487 Amemiya, I., 494 Andˆ o, T., 494,495,504 Andersen, E. S., 220 Anderson, T. W., 128,208,209, 210,211,221,270,551 Aronszajn, N., 155,225,260,264, 266 Bahadur, R. R., 56,94 Baker, C. R., 336 Balakrishnan, A. V., 264 Barankin, E. W., 95,96,97,130 Bartle, R. G., 35,36 Bartlett, M. S., 66,69 Basawa, I. V., 338 Battin, R. H., 560 Baxter, G., 249 Bayes, T., 40,70 Belyaev, Yu. K., 353,355,356,404 Bensoussan, A., 534,551,573,576, 577,579 Berg, C., 281,306,366 Berger, A., 6,7,18 Berger, J. O., 43 Berger, M. A., 488 Bertoin, J., 279 Bhattacharyya, A., 97,131 Billingsley, P., 291,296,297,336 Birnbaum, A.,410 Blackwell, D., 17,68,119,129,130, 131 Bochner, S., 170,182,300,349, 489,515,520,558,578 Bourbaki, N., 328 Briggs, V. D., 310,311,312,337 Brillinger, D. R., 608,610 Brockett, P. L., 282,285,312,313,
337 Brody, E. J., 149,220,459 Brown, M., 281,336 Bru, B., 499,579 Bucy, R. S., 534,579 Cairoli, R., 222 Cambanis, S., 394 Cameron, R. H., 333,421 Cartwright, M. L., 347 Chan, N. H., 210,211,213,221 Chang, D. K., 181,219,341,342, 350,366,404,450,452,453, 515,516,578 Chernoff, H., 18,32,33,34,36,70 Chiang,T.-P., 511 Chipman, J. S., 376,393,396,397, 403 Choksi, J. R., 300,337 Chow, Y. S., 47 Christensen, J. P. R., 281,306,366 Coddington, E. R., 183 Conkwright, N. B., 212 Cram´er, H., 28,51,111,130,131,154, 210,215,221,222,261,341,342, 503,506,508,510,514,516,578, 600 Crum, M. M., 598 Curtain, R., 558 Dalecki˘i, Ju. L., 542,544 Dantzig, G. B., 32,33,34 Davis, M. H. A., 534 Day, M. M., 496 DeGroot, M. H., 94,104,127,128,130, 131 Diestel, J., 34 Dinculeanu, N., 23 Dobrakov, I., 219,350 Dobrushin, R. L., 476
© Springer International Publishing Switzerland 2014 M.M. Rao, Stochastic Processes – Inference Theory, Springer Monographs in Mathematics, DOI 10.1007/978-3-319-12172-7
657
658
Author Index
Dol´eans-Dade, C., 320 Dolph, C. L., 180,188,193,196, 220 Doob, J. L., 72,125,147,158, 160,164,166,214,278,280, 292,294,296,305,399,420, 458,600 Douglas, R. G., 503 Dubins, L. E., 47,71 Dunford, N., 34,35,85,98,116, 158,244,345,350,357,428, 500,502,513,519,526,534, 548,572 Duttweiler, D. L., 255 Dvoretsky, A., 168,220 Dynkin, E. B., 176,376
Girsanov, I. V., 319,470 Girshick, M. A., 17 Gladyshev, E. G., 256 Glowinski, R., 580 Gnedenko, B. V., 85,305,615 Gohberg, I. C., 275,475 Goldstein, J. A., 245,446 Golosov, Ju. L., 337 Green, M. L., 222,488 Grenander, U., 19,22,36,38,51,72, 134,138,143,144,164,179,180, 186,220,276,294,332,336,374, 408,409,486,529,532,587,604, 608,620,621,622 Gretsky, N. E., 499 Guichardet, A., 427
Elliot, R. J., 580 Ennis, P., 403 Erd˝ os, P., 112
H´ajek, J., 226,258 Halmos, P. R., 56 Hanin, L. G., 601,603,622 Hannan, E. J., 578 Hardin, C. D., 378,379,380,382,402 Hardy, G. H., 527 Hayes, C. A., 39 Heinich, H., 499,578 Hewitt, E., 605 Hida, T., 261,470,475,482,483,486, 488,504,506,508,509,510,578 Higgins, J. R., 353,373,406 Hille, E., 97,244,437 Hirai, T., 604,626 Hitsuda, M., 470,474,475,482,483, 488,510 Hoeffding, W., 18 Holden, H., 486 Hotelling, H., 61 Hudson, W. N., 312,337 Hurewicz, W., 183 Hurd, H. L., 400 Hwang, C.-R., 72
Feldman, J., 226,305,306,307, 308,309,311 Feller, W., 332,382 Fend, A. V., 127,128 Feynman, R. P., 69 Fisher, R. A., 17,19,56,61,67, 107,130,396 Fix, E., 377 Fleming, R. J., 446 Fleming, W. H., 573 Fraser, D. A. S., 89 Freedman, D. A., 47 Frenkiel, F. N., 596,622 Frisch, R., 378 Fuller, W. A., 213,221 Geesey, R. T., 255 Gel’fand, I. M., 484,606 Getoor, R. K., 619 Gikhman, I. I., 28,285,287,288, 312,336
Ibragimov, I. A., 588,596,609,616,
Author Index
621 Ikeda, N., 491 Ince, E. L., 93 Ioannides, D., 588 Ionescu Tulcea, A., 60,443 Ionescu Tulcea, C., 44,60,71, 237,443 Isaacson, D., 320 Isii, K., 93 Isserlis, L., 610 Itˆo, K., 316 Jamison, J. E., 446 Jensen, D. E., 406 Jessen, B., 220 Johansen, S., 162 Jordan, K., 199 Joshi, V. M., 131 K¨ uhn, T., 226 Kac, M., 51,53,112,385,549 Kadec, M. I., 495,574 Kadota, T. T., 270,273,336,453 Kailath, T., 255,267,336 Kakihara, Y., 516,536,559 Kakutani, S., 220,337,381,479 Kallianpur, G., 549 Kalman, R. E., 534,579 Kamp´e de F´eriet, J., 596,619, 622 Karhunen, K., 524 Karush, J., 162 Kato, T., 245 Kelsh, J. P., 520,532 Kiefer, J., 81,82,83,168,220 Kingman, J. F. C., 304,305, 309,313,337 Kluv´ anek, I., 34,395,404,511 Knowles, G., 34 Kolmogorov, A. N., 3,39,40,52, 54,72,85,138,305,605,615 Kov´ ar´ikova, M., 501
659
Kozek, A., 104,130,492 Kraft, C., 155 Krasnosel’skii, M. A., 90 Kre˘in, M. G., 275,475,542,544 Krinik, A., 326,338 Kuldorff, G., 111 Kunita, H., 316,566,607 Kuo, H.-H., 485,486 Kwakaernaak, H., 531,558 Laha, R. G., 380,388,398 Lai, T. L., 208,209,221 Lamperti, J., 422 Laning, J. H., 446 Leadbetter, M. R., 51 Lebedev, N. N., 365,368 Lee, A. J., 353 Lehmann, E. L., 17,105 Leonov, V. P., 610 Levinson, N., 183 Liapounov, A., 28,32,60 Linnik, Ju. V., 18,57,58,60,61, 63,66,69,70,72,90,92,130, 257,579,588,596,621 Liptser, R. S., 322,334,337,473, 531,568,582 Liese, P., 226 Liu, B., 370 Lloyd, S. P., 359,360,361,404 Lo`eve, M., 182,515,546 Lototsky, S., 580 L´evy, P., 386 Lukacs, E., 380,388,398 M´etivier, M., 300,337,612 Mandl, P., 176 Mandrekar, V., 453 Mann, H. B., 18,70,128,199,210, 211,220 Marsden, J. E., 60 Martin, W. T., 333 Maruyama, G., 300,307,337
660
Masani, P. R., 194 Mautner, F. I., 606,607 Maynard, H. B., 35 McGhee, D. F., 511 Mehlman, M. H., 516 Mel’nikov, A. V., 580 Meyer, P. A., 549 Mikulevicius, R., 580 Miller, G., 383,384 Miller, M. I., 374,620,622 Minlos, R. A., 486 Mitra, S. K., 521 Mizel, V. J., 488 Mizumachi, H., 337 Moedomo, S., 35 Morgenstern, O., 17 Morse, M., 181,578 Na˘imark, M. A., 513,606 Nagabhushanam, K., 520,527, 622 Navikov, A. A., 470 Neveu, J., 43,44,117,214,229, 257,294,488 Newman, C. M., 282,289,312 Neyman, J., 17,19,22,31,36,38, 70,134 Nualart, D., 486 Øksendal, B., 486 Palais, R. S., 604,622 Paley, R. E. A. C., 238,402 Pardoux, E., 570,573,579 Park, W. J., 267 Parzen, E.,155,225,257,259,336, 529,536,622 Pauc, C. Y., 39 Pearson, E. S., 17,19,22,31,36, 38,70,134 Pellaumail, J., 562 Phillips, R. S., 97,244,436,548
Author Index
Picard, R. H., 511 Piransvilli, Z. A., 346,347,404 Pitcher, T. S., 225,231,235,239, 241,267,336,411,413,419, 427,478,487 Plessner, A. I., 265 Pog´ any, T., 353,413 Pourahmadi, M., 400 Priestly, M. B., 453 Pritchard, A. J., 558 Protter, P., 317 R´enyi, A., 72 Rao, B. L. S., 338 Rao, C. R., 68,487,521 Ratiu, T., 60 Ren, Z. D., 90,93,94,96,97,99, 115,116,330,494,502 Ressel, P., 281,306,366,474 Revuz, D., 317,319,321,473 Riesz, F., 140,144,451,452,534, 619 Rohlin, V. A., 265 Root, W. L., 267 Rosenberg, M., 194 Rosenberg, R. L., 644 Rosenblatt, M., 587,588,608,611, 621 Rosi´ nskii, J., 219,222 Ross, K. A., 605 Roussas, G. G., 588 Roy, S. N., 270 Royden, H. L., 446 Rozanov, Yu. A., 256,336,577,622 Rozovskii, B. L., 580 Rukhin, A. L., 90,92,130 Rutickii, Ya. B., 90 Ryll-Nardzewski, C., 54 Sato, H., 337 Savage, L. J., 56,71 Sazonov, V. V., 44
Author Index
Schatten, R., 265,268,269,463 Scheff´e, H., 32,33, 34,36,66,69,70 Schreiber, B. M., 601,603,622 Schwartz, J. T., 34,35,85,98,116, 158,244,345,350,357,428,500, 502,513,519,526,534,548,574 Schwartz, L., 51,328 Schwarz, M. A., 601 Segal, I. E., 67 Seth, G. R., 131 Shald, S., 579 Shepp, L. A., 214,219,222,267, 470,475 Sherman, S., 85,130 Shimomura, H., 604,622 Shintani, T.,497 Shiryayev, A. N., 72,168,179, 220,322,334,337,473,551, 568,579,610 Sivan, R., 541,559 Skorokhod, A. V., 239,247,282, 285,287,288,289,312,328, 329,330,336,483 Slepian, D., 51,53,549 Slutsky, E., 16 Soedjak, H., 589,593,595,596, 608,612,616,621 Srivastava, A., 620 Stein, C., 129 Stigum, B. P., 208,209,221 Stone, M. H., 606 Streibel, C. T., 255,549 Stromberg, K. R., 462 Stroock, D. W., 326,334 Swift, R. J., 353,365,404,540, 604,622 Sz.-Nagy, B., 140,144,244,431, 432,619 Szulga, J., 219,222 Tarieladze, V. I., 459,479,487 Taylor, J. B., 209 Teicher, H., 47
661
Timon, A. F., 346 Transue, W., 181,578 Trotter, H. F., 245 Tucker, H. G., 282,285,312,313, 337 Tukey, J. W., 209 Tweddle, I., 34 Ubøe, J., 486 Uhl, J. J., 34,35 Urbanik, K., 578 Vakhania, N. N., 459,479,487 Varadarajan, V. S., 487 Varadhan, S. R. S., 326,334 Varberg, D. E., 236,248,252,254, 270,336 Veeh, J. A., 312 Velman, J. R., 438,441,457,487 Vilenkin, N. Ya., 484 Vinter, R. B., 534 von Neumann, J., 13,14,15,17 Wald, A., 7,12,13,15,17,18,32,33, 40,42,66,108,111,117,120, 122,123,128,130,131,167,199, 210,211,220,221 Walsh, J. B., 222 Watanabe, S., 316,566 Watson, G. N., 368,369 Wei, C.- Z., 208,209,210,211,213, 221 Weinert, H. L., 255 Welch, B. L., 66 White, J. S., 208,210,217,221 Widder, D. V., 69,618 Wiener, N., 147,238,483,488 Wijman, R. A., 131 Wilks, S. S., 376,383,405,545 Wold, H., 504,527 Wolfowitz, J., 18,117,125,129,131, 168,220
662
Woodbury, M. A., 180,188,193, 196,220 Wright, J. D. M., 37 Wu, R., 573 Yadrenko, M. I., 219,370,373,380 Yaglom, A. M., 399,577,592,607 Ylvisaker, N. D., 264 Yor, M., 317,319,321,473,474 Zaanen, A. C., 60,459,498,499 Zakai, M., 353,549,566,568,579 Zayed, A. I., 377,405 Zhang, T., 486 Zygmund, A., 238,363 ˘ Zurbenko, I. G., 608,611
Author Index
Subject Index A α-mixing, 587 absolutely continuous, of measures, 20 for non-Gaussian measures, 337 of norm, 499 absorbing state, 292 admissible mean, or translate, 224 relative to a covariance, 264 in a Hilbert space, 330 aliasing effect, 348 analiticity of harmonizable process, 353 Aronszajn space, 225 associated spectral function, 598 asymptotically unbiased, 88 asymptotic distribution of maximal root estimator, 200 averaged range of a vector measure, 35 B Banach’s contraction mapping, 568 band-limited process, 349 Baxter’s theorem, 249 Bayes, formula, 39 theorem, 39 estimator, 78 solution, 41 general characterization, 78 Behrens-Fisher problem, 62 a general solution of, 63 equal sample size solution, 69 best estimator, 75 predictor for general loss, 80
Bhattacharyya bounds, 127 bias, 82 birth-and-death process, 297 BM, 168 construction, 168 pinned, 214 Borel paradox, 40 bounded in probability, 16 boundedly complete family, 69 Brownian bridge, 214 motion, 168 C Cameron-Martin-Girsanov formula, 333 canonical representation of a process, 508 Cauchy’s formula for nonrandom sampling, 370 causal filter, 525 Chapman-Kolmogorov equation, 290 characterization, of strongly harmonizable covariance, 351 of weakly harmonizable covariance, 349 circular window (c.w.) approximation, 53 class (KF), 597 completly random measure, 304 σ-finite measure, 304 composite hypothesis, 5 concave function, characterization of, 29 conditional density, a version of, 47 expectation, relative, 523 integral representation,
© Springer International Publishing Switzerland 2014 M.M. Rao, Stochastic Processes – Inference Theory, Springer Monographs in Mathematics, DOI 10.1007/978-3-319-12172-7
663
664
Subject Index
549 confidence region, 105 conservative family of measures, 436 consistency conditions, 2 of estimator, 88 consistent estimator, 88 consistent estimator, of covariance function, 595 of bispectral density, 589 convex function, characterization of, 29 convex loss function, 78 covariation, 383 Cram´er class, 186 decomposition, 504 representation, 340 Cram´er’s theorem on ratios of r.v.s, 215 extension of, 215 Cram´er-Rao lower bound, 83 inequality for processes, 434 critical region, 4 function, 22 cumulant (=semi-invariant), 610 cyclic subspace, 508 cylindrical function, 418 D Darboux property, 23 decision function, 12 minimax, 15 deformed stationary process, 476 derived process, 52 diagonal method (for cond. probabilities), 51 diagonaliation of two kernels, 270 dichotomy, far Gaussians, 226
for Poisson measures, 282 diffusion type processes, 320 directional (norm) derivative, 102 Dirichlet kernel, 614 distinguishability of measures, 7 distribution function (d.f.), 2 Doob decomposition, 160 E embedded process, 294 equivalence through means and covariances of Gaussian measures, 260 ergodic transformation, 443 estimate, 11 estimation, of a parameter, 10 interval,10 estimator, 10 locally efficient, 120 of maximal root, 199 evolution identsity, 438 explosive process, 111 exponential martingale, 321 extended Neyman-Pearson lemma, 36 F factorization criterion, 56 failure of invariance principle, 209 filter operator, linear, 513 nonlinear, 565 equation, 570 filtration, (or net) of σ-algebras, 164 predictable, 171 G
Subject Index
665
game, two person zero sum, 13 gauge norm, 91 Gaussian distribution on vector spaces, 460 generalized (=Moore-Penrose) inverse, 520 Girsanov’s theorem, 317 multidimensional case, 333 glueing for processes (or method), 60 good estimator, 75 Green function, (extended or generalized), 183 Gronwal’s inequality, 425 H harmonizable covariance, 341 isotropic, 388 process, 181 Hellinger distance, 148 Hellinger-Hahn theorem, 261 hierarchical priors, 43 horizontal window (h.w.), 51 approximation, 54 hypothesis, distinguishable, 7
invariance property of MLE, 106 invariant power function, 62 isometries of Lp (and Orlicz) spaces, 446 Isserlis formula, 600 iterated BM integrals, 219 Itˆo’s integral, 170 J Jensen’s property of a function norm, 473 Jensen’s inequality,conditional, 547 K
Kac-Slepian paradox, 545 Kakutani’s product measure dichotomy, 463 (product) theorem, 462 Kalman gain matrix, 537 Kalman-Bucy filter, 533 Karhunen class, 186 Karhunen representation, 341 Karhunen-Lo`eve (series) represenI tation, 141 Kolmogorov’s existence i.d. integral, 307 theorem, 3 ill-posed method (for finding cond. Kotel’nikov-Shannon type densities), 54 sampling theorem, 344 indicator function. 22 formula, 394 infinitely divisible (i.d.) process para-K¨ othe space, normed, 498 meters, 280 information functional, 257 instantaneous state, 292 intensity measure, 281 interval estimation, 10 invariance principle, 209
L Lagrange identsity, 183 Langevin type equation, 192 least powerful test, 23
666
Subject Index
Liapounov’s inequality, 28 (67) likelihood ratio distribution, of equivalent i.d. processes, 313 of a diffusion relative to BM, 322 of a Hilbert space diffusion for BM, 327 for birth-and-death processes, 332 likelihood ratio function, of two Gaussian measures, 300 of two Poisson meaures, 310 likelihood, 39 principle, 410 limiting distribution of bispectral density estimator, 612 linearity of admissible set of translates, 242 linearity of regression, 380 locally best estimator, 75 maximum risk estimator, 95 loss function, 11 lower bound, for Bayes risk, 79 sequential estimation, 124 for a template estimator, 620 L´evy metric, 17 L´evy representation of i.d. processes, 309 L´evy system of measures, 300 L´evy-Itˆo measure, 308 L´evy-Khintchine formula, 277 Lp -product limit, 463 Lp.q , L2,2 -boundedness, 170 M Markov property, strict sense, 510 wide sense, 510 martingale, 157 sub (super), 157
Maruyama’s theorem, 300 maximum likelihood (MLE) estimator, 105 mean of harmonizable process, characterization, 335 metrically transitive transformation, 443 minimal sufficient statistic, 57 minimax decision, 15 minimax solution, 41 estimator, 75 mixed family of measures, 436 modular form of risk, 94 mapping, strong, 499 Morse-Transue (or MT)integral, 181 strict, 181 most powerful test, 23 MT-integrable, 181 strict,181 multiplicity of a process, 507 multistage priors, 43 N necessary statistic, 57 Neyman’s rule (or Neyman ‘structure’), 57 Neyman-Pearson lemma, 22 vector form, 36 Neyman-Pearson-Grenander (or NPG) version, 20 non linear prediction, 83 nondeterministic, 504 n-ple Wiener integral, 480 nuisance parameters, 61 null-regular test, 66 O observable coordinates, 142
Subject Index
667
optional (=stopping time), 497 optional sampling process, 399 Ornstein-Uhlenbeck (or O.U.) process, 144 characterization of, 214 relations with BM, 147 oscillatory process, 452 P payoff function, 13 periodic sampling, 358 periodically corelated, 400 periodogram estimator, 616 physically realizable filter, 525 Polish space, 59 polynomial chaos, 483 polynomial regression, 398 posterior probability, 39 power of a test, 4 prediction operator, 492 linearity, 503 predictor, linear, 490 non linear, 490 prior probability, 39 property k (or Kadec’s), 495 pth power summability method, 622 Q quadratic (co)variation, 316 quadratic loss function, 74 quasi-complements, 469 R random measures, 281 randomized test, 22 Rao-Blackwell theorem, 89
rate (= intensity) measure, 281 regression of Y over X, 376 regression with Gaussian martingale processes, 382 regular bimeasure, 306 remote past, 504 representation of harmonizable isotropic field, 368 resampling procedure, 587 resolution of the identity, 261 Riccati equation, difference, 527 differential, 550 ridge regression, 406 Riemann function, 183 Riesz space, 498 risk function, 11 RKHS, 225 S sampling theorem, band limited, 356 general case, 347 for stationary processes,359 on LCA groups, 371 Schauder functions, 168 semi-variation, 345 shift oprator for nonstationary processes, 619 signal (and noise), 179 signum function, 90 similar test, 55 simple hypothesis, 5 size (of a test), 4 Skorokhod integral, 483 spectral bimeasure, 341 spherically distributed, 378 spherically generated, 378 SPRT, 174 optimality for continuous parameter processes, 177 ‘stable’ sequence (state), 111
668
Subject Index
(292) standard Borel space, 308 standard modification of a family, 440 statistic(s), 55 (133) stochastic flow, 182 stochastic integration by parts, 316 stochastically continuous, 277 Stone space, 158 stopping time (discrete case), 120 (continuous case), 155 strategy, 14 pure, 14 randomized, 14 Stratonovich integral calculus, 562 strict MT-integral, 181 strictly convex space, 499 strong (or α-) mixing, 588 strong (or Fr´echet) differentiability, 495 strong Markov process, 320 strong measurability, 104 ˘ Smulian’s theorem, 98 strongly equivalent Gaussian measures, 275 strongly integrable density, 58 sub-Gaussian class, 382 sufficient statistic, 56 σ-subalgebra, 56 Sylvester determinant, 211 symmetric p-stable random measure, 386
test, least powerful, 23 most powerful, 23 transition intensities, 296 transition probability function, 290 triangular type covariance, 184 Trotter’s theorem, 245 types A1 and A tests, 409 U UMP test, 55 unbiased estimator of zero, set of, 91 unbiased test, 13 decision function, 55 estimator, 88 uniformly best average power, 40 most powerful test, 55 uniformly convex, 495 uniformly monotone space, 498 unimodal distribution, 85 unstable process, 111 upcrossings lemma, 160 V value of game, 14 variance function, 185 variation, Fr´echet, 514 Vitali, 514 vertical window (v.w.), approximation, 53 V -bounded, weakly, 349
T W template estimation, 607 tensor product of RKHS spaces, 259 test function, 22
Wald’s identsity (in sequential estimation), 117 (generalization), 129
Subject Index
669
continuous parameter, 172 weak martingals, 377 weakly (or Gˆ ateau) differentiable, 93 weakly asymptotically efficient, 106 weakly harmonizable covariance, characterization of, 349 white noise process, 484 Wick product, 485 Wiener’s integral, 170 Wiener-Itˆo chaos form, 481 Wold’s decomposition, 504 X-Y-Z Yor’s formula, 332 Yosida-Hewitt decomposition, 215 Yule-Walker relation, 575 Zakai’s filter equation, 568