VARIATIONAL METHODS IN STATISTICS
This is Volume I2 1 in MATHEMATICS IN SCIENCE AND ENGINEERING A Series of Monographs and Textbooks Edited by RICHARD BELLMAN, Unii.crsitj of Soirtlwr/r Culiforriirr The complete listing of books in this series is available from the Publisher upon request.
VA R IAT I0 NA L METHODS IN STATISTICS Jagdish S. Rustagi Department of Statistics The Ohio State University Columbus. Ohio
ACADEMIC PRESS
New York
San Francisco
A Subsidiary of Harcourt Brace Jovanovich, Publishers
London
1976
COPYRIGHT 0 1976, BY ACADEMIC PRESS,INC.
ALL RIGHTS RESERVED. NO PART OF THIS PUBLICATION MAY B E REPRODUCED OR TRANSMITTED IN ANY F OR M OR BY ANY MEANS. ELECTRONIC OR MECHANICAL, INCLUDING PHOTOCOPY, RECORDING, OR ANY INFORMATION STORAGE AND RETRIEVAL SYSTEM, W I T HO U T PERMISSION IN WRITING FROM T HE PUBLISHER.
ACADEMIC PRESS, INC.
111 Fifth Avenue, New York, New York 10003
United Kingdom Edition published by ACADEMIC PRESS, INC. (LONDON) LTD. 24/28 Oval R o a d , Lo n d o n N W I
Library of Congress Cataloging in Publication Data Rustagi, Jagdish S Variational methods in statistics. (Mathematics in science and engineering series ; ) Includes bibliographies and index. 1. Mathematical statistics. 2. Calculus of variations. I. Title. 11. Series. QA276.R88 519.5’35 75-1 3092 ISBN 0-12-604560-7 PRINTED IN THE UNITED STATES OF AMERICA
TO
Kamla Pradip, Pramod, and Madhu
This page intentionally left blank
Contents
Preface Acknowledgements Chapter Z
Synopsis
1.1 General Introduction 1.2 Classical Variational Methods 1.3 Modern Variational Methods 1.4 Linear Moment Problems 1.5 Nonlinear Moment Problems 1.6 Optimal Designs for Regression Experiments 1.7 Theory of Optimal Control 1.8 Miscellaneous Applications of Variational Methods in Statistics References
Chapter ZZ 2.1 2.2 2.3
xi xiii
1 2 4 6 7 9 11 13 14
Classical Variational Methods
Introduction Variational Problem Illustrations in Statistics
16 17 18 vii
viii
CONTENTS
2.4 2.5 2.6 2.7 2.8 2.9
Euler-Lagrange Equations Statistical Application Extremals with Variable End Points Extremals with Constraints Inequality Derived from Variational Methods Sufficiency Conditions for an Extremum References
Chapter I l l 3.1 3.2 3.3 3.4 3.5 3.6
4.1 4.2 4.3 4.4 4.5 4.6
46 47 51 56 57 60 62
Linear Moment Problems
Introduction Examples Convexity and Function Spaces Geometry of Moment Spaces Minimizing and Maximizing an Expectation Application of the Hahn-Banach Theorem t o Maximizing an Expectation Subject t o Constraints References
Chapter V 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8
Modem Variational Methods
Introduction Examples Functional Equations of Dynamic Programming Backward Induction Maximum Principle Dynamic Programming and Maximum Principle References
Chapter I V
22 29 31 36 39 42 45
64 65 71 76 78 85 90
Nonlinear Moment Problems
Introduction Tests of Hypotheses and Neyman-Pearson Lemma A Nonlinear Minimization Problem Statistical Applications Maximum in the Nonlinear Case Efficiency of Tests Type A and Type D Regions Miscellaneous Applications of the Neyman-Pearson Technique References
92 94 99 105 109 110 115 123 132
CONTENTS
Chapter VI 6.1 6.2 6.3 6.4 6.5 6.6 6.7
7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8
Optimal Designs for Regression Experiments
Introduction Regression Analysis Optimality Criteria Continuous Normalized Designs Locally Optimal Designs Spline Functions Optimal Designs Using Splines Appendix to Chapter VI References
Chapter VZI
ix
133 134 137 144 150 161 165 168 169
Theory of Optimal Control
Introduction Deterministic Control Process Controlled Markov Chains Statistical Decision Theory Sequential Decision Theory Wiener Process Stopping Problems Stochastic Control Problems References
172 173 177 181 187 191 192 198 200
Chpater VIII Miscellaneous Applications of Variational Methods in Statistics 8.1 8.2 8.3 8.4 8.5 8.6 8.7
Index
Introduction Applications in Reliability Bioassay Application Approximations via Dynamic Programming Connections between Mathematical Programming and Statistics Stochastic Programming Problems Dynamic Programming Model of Patient Care References
202 203 209 213 216 225 227 229 233
This page intentionally left blank
Preface
Calculus of variations is an important technique of optimization. Our attempt in this book is to develop an exposition of calculus of variations and its modern generalizations in order to apply them to statistical problems. We have included an elementary introduction to Pontryagin’s maximum principle as well as Bellman’s dynamic programming. Other variational kchniques are also discussed. The reader is assumed t o be familiar with elementary notions of probability and statistics. The mathematical prerequisites are advanced calculus and linear algebra. To make the book self-contained, statistical notions are introduced briefly so that the reader unfamiliar with statistics can appreciate the applications of variational methods to statistics. Advanced mathematical concepts are also introduced wherever needed. However, well-known results are sometimes stated without proof to keep the discussion within reasonable limits. The first two chapters of the book provide an elementary introduction to the classical theory of calculus of variations, maximum principle, and dynamic programming. The linear and nonlinear moment problems are discussed next, and a variety of variational techniques t o solve them are given. One of the first nonclassical variational results is that given by the Neyman-Pearson lemma, and it is utilized in solving certain moment problems. A few problems of testing statistical hypotheses are also given. The techniques utilized in finding optimal designs of regression experiments are generally variational, and a brief discussion of optimization problems under various criteria of optimality is provided. Variational methods have special significance in stochastic control theory, xi
xii
PREFACE
optimal stopping problems, and sequential sampling. Certain aspects of these problems are discussed. In addition, applications of variational methods in statistical reliability theory, mathematical programming, controlled Markov chains, and patient monitoring systems are given. The main concern of the author is to provide those statistical applications in which variational arguments are central to the solution of the problem. The reader may discover many more applications of these methods in his own work. The references provided are not exhaustive and are given only as sources of supplementary study, since many of the areas in statistics discussed here are still under vigorous development. Applications of variational methods in other engineering sciences, economics, or business are not emphasized, although some illustrations in these areas are discussed. Exhaustive expositions of such applications are available elsewhere. The chapters are so arranged that most of them can be studied independently of each other. The material collected here has not appeared previously in book form, and some of it is being published for the first time. Part of the material has been used by the author over the years in a course in optimizing methods in statistics at The Ohio State University. The book can be used as a text for such a course in statistics, mathematics, operations research, or engineering science departments in which courses on optimization are taught. It may also be used as a reference. Readers are invited to send their comments to the author.
Acknowledgements
I am greatly indebted to Professor Richard Bellman, who invited me to write this book. Professor Herman Chernoff has been a source of great inspiration and advice throughout the planning and development of the material, and 1 am highly obliged to him. I am grateful to Professor Stefan Drobot for introducing me to variational methods. I would also like to acknowledge the help of Professor D. Ransom Whitney for continuous advice and encouragement. Dr. Vinod Goyal helped me in the preparation of Chapter 11; Professors Bernard Harris and Paul Feder provided comments on earlier drafts of some chapters, and I am very grateful to them. I am obliged to Mr. Jerry Keiper, who suggested many corrections and improvements in the manuscript. I would also like to thank the many typists, among them Betty Miller, Diane Marting, and especially Denise Balduff, who have worked on the manuscript. The work on the book began while I had a research grant from the Office of Scientific Research, United States Air Force, at The Ohio State University, and I am very much obliged to both for their support. I am also grateful to the editorial staff of Academic Press for various suggestions regarding the production of the book.
xiii
This page intentionally left blank
VARIATIONAL METHODS IN STATISTICS
This page intentionally left blank
CHAPTER I
Synopsis
1.1
General Introduction
Variational methods refer to the technique of optimization in which the object is to find the maximum or minimum of an integral involving unknown functions. The technique is central to the study of functional analysis in the same way that the theory of maxima and minima are central to the study of calculus. During the last two centuries variational methods have played an important role in the solution of many physical and engineering problems. In the past few decades variational techniques have been developed further and have been applied successfully t o many areas of knowledge, including economics, statistics, control theory, and operations research. Calculus of variations had its beginnings in the seventeenth century when Newton used it for choosing the shape of a ship’s hull to assure minimum drag of water. Several great mathematicians, including Jean Bernoulli, Leibnitz, and Euler, contributed to its development. The concept of variation was introduced by Lagrange. In this book we first introduce the basic ideas of the classical theory of calculus of variations and obtain the necessary conditions for an optimum. Such conditions are known as Euler or Euler-Lagrange equations. In recent years Bellman’s introduction of the technique of dynamic programming has resulted in the solution of many variational problems and has provided practical answers to a large class of optimization problems. In addition to Bellman’s technique we also discuss the maximum principle of Pontryagin. which has been regarded as a culmination of the efforts of the mathematicians in 1
2
1.
SYNOPSIS
the last century to rectify the rule of Lagrange multipliers. It gives a rigorous development of a class of variational problems with special applications in control theory. We include a brief introduction to both of the above variational techniques. Many problems in the study of moments of probability distributions are variational. The Tchebycheff-type inequalities can be seen to result from optimization problems that are variational. Many other problems in this category, in which one wants bounds on the expected value of the largest order statistic or the expected value of the range in a random sample from an unknown population, have had important applications in statistics. In addition to the classical theory of calculus of variations, the methods of the geometry of moment spaces have proved fruitful in characterizing their solutions. A brief introduction to these topics is also provided. One of the first nonclassical variational results is stated in the form of the Neyman-Pearson lemma, which arises in statistical tests of hypotheses. While introducing the fundamental concepts of most powerful tests, Neyman and Pearson provided an extension of the classical calculus of variations result, and this technique has been applied to a large variety of problems in optimization, especially in economics, operations research, and mathematical programming. We include a brief discussion of the Neyman-Pearson technique, which has also become a very useful tool in solving nonlinear programming problems. Many problems arising in the study of optimal designs of regression experiments can be solved by variational methods. The criteria of optimality generally result in functionals that must be minimized or maximized. We describe variational techniques used in stochastic control problems, controlled Markov chains, and stopping problems, as well as the application of variational methods to many other statistical problems, such as in obtaining efficiencies of nonparametric tests. In the next seven sections, we provide a brief introduction to the topics discussed in the book. 1.2
Classical Variational Methods
Calculus of variations, developed over the past two hundred years, has been applied in many disciplines. The basic problem is that of finding an extremum of an integral involving an unknown function and its derivative. The methods using variations are similar to those using differentials and make problems in optimization of integrals easy to solve. In Chapter 11, we discuss the classical approach to variational problems. We obtain the necessary conditions for an extremum in the form of the Euler differential equation. The sufficient conditions are too involved in general, and we consider them only in the case in which the functions involved are convex or concave. Such assumptions guarantee
1.2.
CLASSICAL VARIATIONAL METHODS
3
the existence and uniqueness of the optimum in various cases. We give a brief introduction to the statistical problems that have variational character. Many modern variational problems are discussed later in the book. Wherever possible, applications from statistical contexts illustrate the theory. The classical theory of calculus of variations is extensively discussed, and there are excellent textbooks available. In Section 2.2, we state the variational problem of optimizing the functional
J
b
Wbl
=
L [ x ,Y(X),Y’(X)l dx
a
over the class of continuous and differentiable functions y ( x ) . Various other restrictions on the functions y ( x ) may be imposed. This class is generally called the admissible class. In the optimization process, distinction must be made between global optimum and local optimum. For example, W b ] has a global minimum for y = y o ( x ) if W b ]2 W b o ( x ) ] for. all y in the admissible class. However, the local minimum may satisfy such a condition only in a neighborhood of the function y o ( x ) . Such concepts for a strong and weak local minimum are also defined in this section. The impetus for the development of calculus of variations came from problems in applied mechanics. However, the techniques have been utilized with increasing frequency in other disciplines such as economics, statistics, and control theory. In Section 2.3, we give a few illustrations of variational problems arising from statistical applications. Statistical notions are briefly introduced in this section; further discussion of these can be found in introductory statistics texts. The necessary conditions for a weak local extremum of the integral of the Lagrangian L(x, y , y‘) is obtained in terms of a partial differential equation, called the Euler equation or Euler-Lagrange equation,
There is an integral analog of this differential equation X
The Euler equation is derived in Section 2.4 through the considerations of variations. The variation of the functional is defined, and the fundamental lemmas of calculus of variations are stated and proved. One of the earliest problems in calculus of variations is the brachistochrone problem, in which one considers the path of a particle moving under gravity along a wire from a point A
4
1.
SYNOPSIS
to a point B so as to make the travel time from A to B a minimum. The solution of the brachistochrone problem is given. It is well known that the form of the wire is a cycloid. A statistical illustration of the Euler equation is given in Section 2.5. We consider a problem of time series describing the input and output of a system with an impulse response. The estimation of the impulse response so as to minimize the mean-square error in the sense of Wiener leads to a variational problem. Euler equations result in Wiener-Hopf integral equations that can be solved in many cases. In Section 2.6, we discuss the optimization problem with variable endpoints. It is assumed that a and b, the limits in the integral to be optimized, are no longer fixed and move along certain specified curves. In many applications, such situations occur quite frequently. The necessary conditions for a weak local extremum are obtained. They involve not only the Euler differential equation but also additional relations satisfied by the curves, generally known as the transversality conditions. Constrained optimization problems require additional considerations, and they are discussed in Section 2.7. The general theory of Lagrange multipliers as introduced in differential calculus also applies t o the problems of calculus of variations. An illustration is provided in which bounds of the mean of the largest order statistic are obtained under the restriction that every distribution has mean zero and variance one. The solution is also given for the maximum of Shannon information in which the probability density involved has a given variance. The solution in this case turns out to be normal distribution. In Section 2.8, the Hamiltonian function is introduced. It provides a simpler form for the Euler equation. The basic reduction by this device is to reduce the second order Euler equation to that of the first order involving the Hamiltonian functions. This reduction simplifies the solution in many cases. An application of the Hamiltonian function is given to obtain Young's famous inequality. An elementary introduction to the sufficiency theory for the variational problem is given in Section 2.9. The general treatment in calculus of variations for finding sufficient conditions of optimality is quite involved, and we do not give a general discussion of this topic. In case the Lagrangian is convex or concave in ('y, y'), the sufficiency conditions for a global extremum can be easily derived. We provide such a discussion. Detailed exposition of sufficiency conditions under the classical case are given by Hadley and Kemp (1971), for example. 1.3
Modern Variational Methods
We discuss Bellman's technique of dynamic programming and the maximum principle of Pontryagin in Chapter 111. Modern control theory in engineering and
1.3.
MODERN VARIATIONAL METHODS
5
applications in modern economics result in variational problems that can be solved by the above techniques. Although the maximum principle gives a rigorous mathematical development for the existence and for the necessary conditions of the solution of a general control problem, it is the technique of dynamic programming that provides the answer in practice. Before we consider the techniques, we give a few examples from control theory for illustrative purposes in Section 3.2. These examples introduce the functional equation of dynamic programming. An example is also given in which the functional equation arises from other considerations. Many examples of this nature are found in the literature. The functional equation of dynamic programming and Bellman’s optimality principle are given in Section 3.3. For a process with discrete states, the multistage decision process reduces to the consideration of optimal policy at any given stage through the application of the optimality principle. This principle states that an optimal policy has the property that whatever the initial state and initial decisions are, the remaining decisions also constitute the optimal policy with respect to the state resulting from the first decision. The application of this principle results in reducing the dimension of the optimization problem, and the solution of the problem becomes computationally feasible. When the process is stochastic and the criterion of control is in terms of expectations, a functional equation approach can be similarly used. Bellman and Dreyfus (1962) provide methods of numerical solutions for many dynamic programming problems. The backward induction procedure utilized in the functional equation approach of dynamic programming is discussed in Section 3.4. Using sequential sampling in the statistical tests of hypotheses, the optimal decision making requires a similar process of backward induction. In statistical contexts backward induction was introduced by Arrow et al. (1949). It has found various uses in other contexts, such as in stopping problems studied by Chernoff (1972) and Chow ef al. (1970). Section 3.5 discusses the maximum principle of Pontryagin. The principle has been used in providing the rigorous theory of optimal processes. It provides the necessary conditions for optimality. The problem considered is that of optimizing an integral of a known function of a vector x, called the state vector, and the vector u, the control vector, when the vectors x and u satisfy constraints in terms of first order differential equations satisfying certain boundary conditions. The theorem stated gives the necessity of the condition that corresponds to the Euler equation. We give an example arising in the consideration of the time optimal case. This problem is concerned with the minimization of total time taken by the control when the system state vector x and the control vector u satisfy a given differential equation. The general theory of optimal processes and other ramifications of the maximum principle are given in a book by Pontryagin ef al. (1962).
6
I.
SYNOPSIS
The relationship between the dynamic programming and the maximum principle is discussed in Section 3.6. In the case of the time optimal problem, the functional equation of dynamic programming is the same as that obtained by the maximum principle if certain extra differentiability conditions are assumed. Such conditions may not always be satisfied. The relationship is discussed by Pontryagin er al. (1962). 1.4
Linear Moment Problems
Moment problems have been of interest t o mathematicians for a considerable period of time. In its earliest form the moment problem was concerned with finding a distribution function having a prescribed set of moments. A recent survey is given by. Shohat and Tamarkin (1943). Various optimizing problems involving distribution functions with prescribed moments arise in statistical contexts, especially in nonparametric theory. In Chapter IV, we consider the variational problem in which the Lagrangian is a linear function of the distribution function. We utilize the geometry of moment spaces, as well as the Hahn-Banach theorem, in providing solutions t o the linear moment problem. In Section 4.2, a few examples arising from the applications in statistical bioassay and cataloging problems are given. Both of these examples lead to a general problem of finding the bounds of the integral Jg(x) dF(x), where g(x) is a given function with certain properties and F(x) is a distribution function having prescribed moments. The bioassay problem is a special case of the above general problem withg(x) = 1 - e-@, and the cataloging problem also reduces t o a similar g(x). First we consider the case in which the random variable having the distribution function F(x) is finite valued. The results are then extended t o the case in which the random variable takes values on the positive real line. In Section 4.3, we give an introduction t o the concepts of convexity and concavity. The celebrated theorem of Hahn-Banach is stated and proved. The notion of a convex set is very important in the discussion of the optimization problems considered here. Many results are given in such cases. For example, the sufficiency conditions in the classical theory of calculus of variations are given for the convex and concave functions. Many problems give unique solutions in such cases. We can consider the abstract notion of an integral as a linear functional defined over a function space. An elementary introduction t o linear spaces is given. A few pertinent notions are provided to prove the Hahn-Banach theorem for extension of linear functionals. This is an important theorem in function spaces and plays essentially the same role as the separating hyperplane theorem in the theory of convex sets. The Hahn-Banach theorem is then applied t o finding solutions of the linear moment problem. We consider the geometry of moment spaces in Section 4.4. An exhaustive
1.5.
NONLINEAR MOMENT PROBLEMS
7
account of the moment spaces is given by Karlin and Studden (1966). In the linear moment problem, the solution is characterized in terms of extreme points of a convex set. These extreme points in the convex set, generated by the class of all distribution functions, correspond to one-point distributions. Therefore, in many cases the optimizing solutions are obtained in terms of these distributions. In problems of regression designs also, the technique of geometry of moments is quite useful, and we use it to obtain certain admissible designs. Section 4.5 discusses the main optimization problem of Chapter IV. We consider the expectation of the function g(x), given by E[g(X)], where the random variable X has a set of prescribed moments. It is shown that the set in ( k + 1)-dimensional space with coordinates [ E M X ) ] , E(X), . . . , E(XK)] is convex, closed, and bounded. The existence of a minimizing and maximizing distribution is easily obtained from the above fact. The solution is then characterized in terms of discrete distributions. The results are applied t o the examples of Section 4.2, and complete solutions in some simple cases are provided. In Section 4.6, we consider the same problem as in Section 4.5, but with the application of the Hahn-Banach theorem. In moment problems the HahnBanach theorem has been applied by many authors. The results of Isaacson and Rubin (1954) are generalized. The conditions on g(x) are more general than assumed in Section 4.5, and the solutions are available in some cases for which the geometry of moments does not provide the answer. 1.5
Nonlinear Moment Problems
Many statistical applications require optimization of integrals of the following nature. Minimize
over a class of distribution functions F(x) with given moments when cp(x,y ) is a known function with certain desirable properties. Nonlinear problems of this nature can be reduced to the study of linear moment problems involving the optimizing function when q( x, y ) is convex or concave in y. First we reduce the nonlinear problem to a linear one and then apply the Neyman-Pearson technique for the final solution. This ingeneous approach works in many cases and is given in Chapter V. Applications of the nonlinear moment problem are made to obtain bounds for the mean range of a sample from an arbitrary population having a given set of moments. In Section 5.2, the fundamental problem of testing statistical hypotheses is discussed. Relevant notions are introduced, and an elementary introductidn to
8
I.
SYNOPSIS
the solution of the problem of testing a simple hypothesis versus a simple alternative is given. The classical theory of Neyman and Pearson is given, and the Neyman-Pearson lemma is proved. This lemma is one of the first nonclassical variational results, and it has important applications in statistics as well as in other fields. The Euler equation in calculus of variations gives necessary conditions for an extremum of an integral in which the limits of integration are given constants or variables. In the optimization problem of Neyman-Pearson, the integrals of known functions are defined over unknown sets, and the optimization is t o be done over these sets. In this section various generalizations of the Neyman-Pearson lemma are also given. The application of the Neyman-Pearson technique to the duality theory of linear and nonlinear programming is receiving serious attention by many researchers. We consider some of these problems in Chapter VIII. Other applications of the lemma are given in Section 5.8. We discuss the nonlinear problem introduced at the beginning of this section in Section 5.3. Assuming that q ( x , y ) is strictly convex in y and twice differentiable in y, we can prove the existence of the minimizing cumulative distribution function Fo(x). We then reduce the problem t o that of minimizing the integral
over a wider class of distribution functions. This process linearizes the problem and is much easier to deal with by the Neyman-Pearson technique. Similar approaches are commonly made in solving certain nonlinear programming problems. The solution can be obtained by a judicious choice of Fo(x). This approach avoids the technical details of satisfying the necessary and sufficient conditions needed for obtaining the solution through the application of classical calculus of variations. Also the present approach takes care of inequality constraints without much difficulty. This approach seems similar in spirit to the Pontryagin maximum principle discussed in Chapter 111. In Section 5.4, we consider some statistical applications. The functions of the form d X 9 Y ) = 0,- kx)2 occur in a problem of obtaining bounds of the Wilcoxon-Mann-Whitney statistic. The complete solution is given, including the case in which the constraints on the distribution function are inequality constraints: F ( x ) 2 x.
Another example, in which q(x, y ) = y " , is also given. Many other examples for the nonlinear case are given by Karlin and Studden (1966).
1.6.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
9
The problem of maximizing the integral
jq(x9~ ( ~dx1 )
over the same class of distribution functions considered before is discussed in Section 5.5. Using the same condition that q(x, y ) is strictly convex in y, we find by simple arguments that the maximizing distribution must be a discrete distribution, as in the case of the linear moment problem. Results of Chapter N are used to simplify the problem further, and a few examples are provided to illustrate the results. One of the most common uses of the variational method is in finding the efficiency of tests in statistics. We give an introduction t o such optimization problems in Section 5.6. Variational arguments are fruitfully applied by Chernoff and Savage (1958) in their famous paper in which they introduce a statistic for testing a wide variety of nonparametric statistical hypotheses. We give an expression for the large sample efficiency of the Fisher-Yates-TerryHoeffding statistic and the t-test for testing the hypothesis of equality of the location parameters of two populations. The variational technique provides the lower bound as well as the distribution for which the bound is attained. Another example of a similar kind is given in which the asymptotic efficiency of the Wilcoxon test with respect to the t-test is obtained. In Section 5.7, we introduce the problem of determining regions of type A and type D arising in testing statistical hypotheses. Consideration of unbiased tests-that is, tests for which the power function has a minimum at the null hypothesis-requires that the curvature of the power function be studied in the neighborhood of the null hypothesis. Such considerations for obtaining best unbiased tests lead t o type A regions, and the problems are interesting variational problems studied by Neyman and Pearson (1936). In the case of several parameters, consideration of best unbiased tests results in the study of Gaussian curvature of the power function in the neighborhood of the null hypothesis. The variational problems so introduced are studied. We give an example in which the parameters of the bivariate normal distribution are tested. Miscellaneous applications of the Neyman-Pearson technique are provided in Section 5.8. Problems considered are from stockpiling, mathematical economics, dynamic programming, combination of weapons, and discrete search. 1.6
Optimal Designs for Regression Experiments
Design of experiments is an important branch of statisttcs. Its development, in the hands of Sir Ronald A. Fisher, not only improved the method of efficient experimentation in many applied sciences but also provided a large number of mathematical problems in combinatorial theory and optimization. In
10
1.
SYNOPSIS
Chapter VI, we consider the problems arising from performing regression experiments. In such experiments the main objective is to study a response y as a function of an independent variable x. The choice of levels of x when the total number of observations are given becomes an optimization problem. The problem considered is the allocation of observations to various levels of x in order to optimize a certain criterion. In Section 6.2, the problem of linear regression and least-square estimation is discussed. We give the covariance matrix of the estimated parameters, and this covariance matrix is utilized in developing the optimality criteria. Suppose the linear model is given by
E ( Y ) = e,j-,(x) t e2f2(X)t . . . t ekfk(x) = e'qx). A design of experiment will be represented by x , , xz, . . . , xk with associated integers n,, n 2 , . . . , nk such that Cik,lni= n. That is, the design specifies performing ni experiments at level xi. Or we can allocate proportion pi = ni/n observations at xi, i = 1 , 2, . . . ,k. A typical optimization problem is t o find a design that minimizes some characteristic of the estimate of 8.We give estimates of the parameter 0 in the case of linear models and also discuss an approximate theory of nonlinear models. Section 6.3 discusses a few optimality criteria in regression designs. Suppose the covariance matrix of the estimator 8 is given by D(8). Then, one of the most commonly used criteria is the D-optimality criterion, which requires choosing a design that minimizes the determinant of D(8). The minimax criterion requires that the maximum of the quadratic form f'(x)D(e)f(x) be minimized. Since the diagonal elements of the matrix D(6) represent variances of the components of 8 , there is another criterion that minimizes the sum of variances of these components or, essentially, the trace of the matrix D(8). This is known as the A-optimality criterion. There are many more criteria, but we do not discuss them in this section. Various connections among these criteria are available in the literature, and a few are discussed in Section 6.4. We also give an elegant geometrical method, due to Elfving (1951), for obtaining D-optimal designs in a simple case. In Section 6.4, continuous normalized designs are introduced. As seen above, a design is represented by a probability distribution with probabilities P I , . . . , P k at X I , . . . ,xk. In many problems the solution becomes easier if a continuous version of the design is used. The design corresponds to a continuous density function in this case. The criterion for D-optimality reduces to the minimization of an expectation of a known function. Similar problems are discussed in Chapter IV. We also discuss some important connections between D-optimality and minimaxity. Another criterion of linear optimality is also introduced, and some of the other criteria are obtained as special cases. In nonlinear models, obtaining optimal designs becomes difficult; therefore, asymptotic theory is considered for locally optimal designs, that is, designs in
1.7.
THEORY OF OPTIMAL CONTROL
11
the neighborhood of the known parameters. Locally optimal designs are discussed in Section 6.5. Direct application of the calculus of variations in some examples has been recently made by Feder and Mezaki (1971). A few examples, including those discussed by Chernoff (1 962) for accelerated life testing, are also discussed in this section. In Section 6.6, we give an introduction to spline functions. The recent applications of splines to optimal designs is due to Studden (1971). A spline function is given by polynomials over subintervals and is continuous at the end points and satisfies smoothness conditions such as the existence of higher order derivatives. Splines have been used extensively in approximation theory and have also been applied to problems of control theory, demography, and regression analysis. A brief discussion of the application of splines t o optimal designs is given in Section 6.7. Kiefer (1959) introduced the concept of admissible designs in the same spirit as that of an admissible rule in statistical decision theory. We give a brief introduction to the theory of admissible designs using splines. 1.7
Theory of Optimal Control
Many stochastic control problems are variational problems, and dynamic programming methods are commonly used in solving them. In Chapter VII, we discuss the stochastic control problems, giving various forms in which they arise. The techniques of backward induction, statistical decision theory, and controlled Markov chains are also discussed. The deterministic control process is introduced in Section 7.2. The discrete control problem is given to introduce the multistage decision process. In the discussion of the dynamic programming technique and the maximum principle, we have already seen the basic structure of a control process. An’kxample of feedback control with time lag is provided to illustrate the basic elements of the control process so introduced. The direct variational argument is used to obtain the differential equations that provide the solution to the problem. In Section 7.3, the controlled Markov chain theory is given. Markov chains naturally occur in the consideration of the stochastic analog of difference equations such as Yn+l = v n
+ un,
where {un} is a sequence of independent and identically distributed random variables. Then the sequence of {yn} is a Markov chain. The study of controlled Markov chains depends heavily on the technique of dynamic programming. The functional equations are derived in this section for the examples given. The study of controlled Markov chains is also known as discrete dynamic programming or .Markovian decision processes.
12
I.
SYNOPSIS
The concepts of statistical decision theory are needed in studying the stopping problems connected with the Wiener process. Elements of this theory are discussed in Section 7.4. The statistical problems are formulated in terms of a game between nature and statistician, and various criteria of finding optimal decisions, such as those of minimax and Bayes, are defined. Examples are given to illustrate the Bayes strategy in the case of normal distribution and its continuous version-the Wiener process. In Section 7.5, we further develop the statistical decision theory for the sequential case. The Bayes rule in the case of sequential sampling consists of a stopping rule as well as a terminal decision rule. The stopping rule is obtained with the help of backward induction if it is assumed that sampling must terminate after a finite number of observations. The sophisticated use of backward induction for this purpose is due to Arrow e l al. (1949) and was formalized into dynamic programming by Bellman while the latter was studying multistage processes. An example for testing the hypothesis about the normal mean is given, and the problem is reduced to that of a stopping problem for a Wiener process. Chernoff (1972) has made an extensive study of such problems. The Wiener process is introduced in Section 7.6. In many discrete problems concerning the normal distribution, the continuous versions lead to the Wiener processes. The properties of the Wiener process are given as is the simple result corresponding to the standardizing of the normal distribution. That is, a Wiener process with drift p(t) and variance 0 2 ( t )can be transformed into a process with drift zero and variance one. Many problems of sequential analysis and control theory reduce to stopping problems. The stopping problems are also of independent interest in other applications. Such problems are discussed in Section 7.7. Examples of many interesting stopping problems are given by Chow e l al. (1970). Stopping problems for the Wiener process have been discussed by Chernoff (1972). Let a system be described by a process Y(s). Let the stopping cost be d b , s) when Y(s) =y. The problem of optimal stopping is to find a procedure S such that E[d(Y(S),S)] is minimized. For the Wiener process, the technique reduces to that of finding the solution of the heat equation with given boundary values. Such boundary value problems also arise in other contexts. The characterization of continuation sets and stopping sets is made in terms of the solutions of the heat equation. We derive the equation and describe the free boundary problem of the heat equation. The necessary condition for the optimization problem leads to the free boundary solution of the heat equation, and a theorem is stated to provide the sufficient condition for the optimization problem. A simple example is given to illustrate the theory developed by Chernoff. Continuous versions of controlled Markov chains lead to the study of the stopping problems in Wiener processes. An example of rocket control is given in Section 7.8. The solution of the problem is reduced t o the study of the stopping problem of the Wiener process in its continuous version.
I .8.
1.8
MISCELLANEOUS APPLICATIONS
13
Miscellaneous Applications of Variational Methods in Statistics
A few applications of variational techniques not covered in earlier chapters are discussed in Chapter VIII. It is not possible to include a large number of possible applications available in the literature. The topics chosen are based on their current interest in statistics and their potential application to future direction in research. We have included applications in bioassay, reliability theory, mathematical programming, and approximations through splines. In Section 8.2, we discuss some of the important inequalities in the theory of reliability. The case in which the failure distributions have increasing or decreasing failure rates is specially treated. Roughly speaking, by failure rate or hazard rate we mean the conditional probability of failure of an item at time f given that it has survived until time t Increasing failure rate distributions provide a realistic model in reliability in many instances. If F(x) denotes the distribution function of time to failure, then F(x) = 1 - F(x) is the probability of survival until time x and is a measure of the reliability of an item. The bounds on this probability can be obtained by variational methods: It is not difficult to see that the class of distributions with increasing failure rate is not convex and hence the methods of geometry of moment spaces used to solve such problems in Chapter IV cannot be used directly. Modifications of these methods are required, and the results can be extended to more general cases. We also give bounds for E[cp(x, F(x))], where cp(x, y ) is a known function convex iny. Such functions are also considered in the general moment problem in Chapter V. In their monograph on mathematical models of reliability, Barlow and Proschan (1967) give a detailed account of the increasing and decreasing failure rate distributions. Many variational results are also given in their discussion. The efficiency of the Spearman estimator is discussed in Section 8.3. In statistical bioassay, nonparametric techniques are increasingly being used. A simple estimator for the location parameter of a tolerance dkribution in bioassay is the Spearman estimator. It is shown that the asymptotic efficiency of the Spearman estimator, when compared with the asymptotic maximum likelihood estimator in terms of Fisher information, is less than or equal to one. The bounds are attained for the logistic distribution by using straightforward variational methods. Spline functions are introduced in Chapter VI and are applied to a problem of optimal experimental design. In Section 8.4, the splines are applied to problems of approximation, using the technique of dynamic programming. The splines are also used in developing models of regression, especially in data analysis, according to a recent study made by Wold (1974). An example is given in which the exact solution of an approximate problem is much easier to obtain than an approximate solution of an exact problem. The dynamic programming procedure is used to solve the optimization problem. This problem arises in the consideration of the best spline approximation s(x) of a function u ( x ) such that
14
I.
SYNOPSIS
J [s(x) - u ( x ) ] * d x is minimized. Since splines are defined over subintervals, the problem reduces to the study of the optimization of a finite sum. The dynamic programming technique becomes highly appropriate in such a case. In Section 8.5, we consider a few connections between mathematical programming methods and statistics. The scope of these applications is very large, and only a few cases are discussed. The application of the NeymanPearson lemma in developing the duality of nonlinear programming problems has been investigated by Francis and Wright (1969) and many others. We give an introduction to this theory. There are many applications of mathematical programming methods t o statistics. Some of the moment problems can also be reduced to those of programming problems, and the duality theory for these problems leads to interesting results. An interesting example of minimizing an expectation with countable moment constraints on the distribution function is given. This leads to an infinite linear programming problem. The problem of finding the minimum variance, unbiased estimate of a binomial parameter is also solved through a constrained programming problem. An important class of optimization problems arises in stochastic programming. We give a brief account in Section 8.6. There is extensive literature on this topic, so we consider only a few examples. The first example concerns a stochastic linear program in which the stochastic elements enter into the objective function. These elements may enter in the constraints, and the problem then becomes that of chance constrained programming. An example of such a problem is given. An illustration of the application of the dynamic programming technique to the solution of an important decision-making problem in patient care is provided in Section 8.7. The process of patient care in the operating room, recovery room, or an out-patient clinic exhibits the elements of a control process and is amenable t o treatment by dynamic programming. The basic objective in providing such care by the physician or the nurse is to restore homeostasis. Therefore, an objective function is formulated in terms of the physiological variables of the patient at any given time and the variables desired to restore homeostasis. The discussion follows along the lines of Rustagi (1968).
References Arrow, K . J . , Blackwell, D., and Girshick, M. A. (1949). Bayes and minimax solutions of sequential decision problems, Econometrica 17, 21 3-244. Barlow, R., and Proschan, F. (1967). Mathematical Theory of Reliability. Wiley, New York. Bellman, R., and Dreyfus, S. (1962). Applied Dynamic Programmirzg. Princeton Univ. Press, Princeton, New Jersey. Chernoff, H . (1962). Optimal accelerated life designs for estimation, Technomefrics 4, 381-408.
REFERENCES
15
Chernoff, H. (1972). Sequential Analysis and Optimal Design. SOC. lnd. Appl. Math., Philadelphia, Pennsylvania. Chernoff, H., and Savage, I. R. (1958). Asymptotic normality and efficiency of certain nonparametric test statistics, Ann. Math. Statist. 29, 972-994. Chow, Y. S., Robbins, H., and Siegmund, D. (1970). Optimal Stopping. Houghton-Mifflin, New York. Elfving, G . (1951). Optimal allocation in linear regression theory, Ann. Math. Statist. 23, 255-262. Feder, P. I . , and Mezaki, R. (1971). An application of variational methods to experimental design, Technometrics 13, 771-793. Francis, R., and Wright, G. (1969). Some duality relationships for the generalized Neyman-Pearson problem, J. Optimization Theory Appl. 4, 394412. Hadley, G., and Kemp, C. M. (1971). Variational Methods in Economics. American Elsevier, New York. Isaacson, S., and Rubin, H. (1954). On minimizing an expectation subject t o certain side conditions, Tech. Rep. No. 25, Appl. Math. Statist. Lab., Stanford Univ., Stanford, California. Karlin, S., and Studden, W. J . (1966). Tchebycheff Systems: With Applications in Analysis and Sratistics. Wiley (lnterscience), New York. Kiefer, J . (1959). Optimal experimental designs, J. Roy. Statist. SOC.Ser. B 21, 273-319. Neyman, J., and Pearson, E. S. (1936, 1938). Contributions to the theory of testing statistical hypotheses. I. Unbiased critical regions of type A and type A 11. Certain theorems on unbiased critical regions of type A. 111. Unbiased tests of simple statistical hypotheses specifying the value of more than one unknown parameter. Statist. Res. Mem. 1, 1-37; 2,25-57. Pontryagin, L. S., Boltyanskii, V. G., Gamkrelidze, R. V., and Mischenko, E. F. (1962). The Mathematical Theory of Optimal Processes. Wiley (Interscience), New York. Rustagi, J . S. (1968). Dynamic programming model of patient care, Math. Biosci. 8, 14 1-1 49. Shohat, J. A., and Tamarkin, J . D. (1943). The Problem o f Moments. Amer. Math. SOC., Providence, Rhode Island. Studden, W. J. (1971). Optimal designs and spline regression. In Optimizing Methods it7 Statistics (J. S. Rustagi, ed.). Academic Press, New York. Wold, S . (1974). Spline functions in data analysis, Technometrics 16, 1-11.
CHAPTER I I
Classical Variational Methods
2.1
Introduction
During the development of the calculus of variations in the last two centuries, the primary impetus has come from problems in applied mechanics. The subject soon became an area of study in mathematics, and a large number of mathematicians contributed to its development. The earliest problems were concerned with finding the maxima and minima of integrals of functions with or without constraints. Since an integral is just a simple functional defined on the space of functions under study, the variational techniques play the same part in functional analysis as the theory of maxima and minima do in the differential calculus. In view of new technological development in the past few decades, problems of optimization have been encountered in many diverse fields. There arose a series of problems in economics, business, space technology, and control theory that required solutions through variational techniques. Recent advances were needed in order to solve some of the important problems in the above areas, and we shall discuss some of these topics in the next chapter. In this chapter, we discuss first the Euler-Lagrange equation and the concept of variations. Many aspects of the Euler-Lagrange equations are discussed. We offer a few examples in statistics in which variational techniques are needed. A few results in variational calculus with constraints and with variable boundary points are also given. The sufficient conditions are obtained for an extremum in which the Lagrangian is convex or concave. The Hamiltonian functions are 16
2.2.
VARIATIONAL PROBLEM
17
introduced, and Young’s inequality is derived. The Euler equation is intimately connected with partial differential equations, but we do not pursue the subject in this book.
2.2 Variational hoblem Let y(x) be a real-valued function defined for x 1 < x < x 2 having a continuous derivative. Let a given function L(x, y, z ) (called Lagrangian) be continuous and twice differentiable in its arguments. We denote by y’(x) the derivative of y with respect t o x. Suppose now a functional Wb(x)] is defined as x2 W[Y(X)l = / L [x, Y(X), Y’(X)l dx.
(2.2.1)
XI
The earliest problem of calculus of variations is to optimize the functional (2.2.1) over the class of all continuous and differentiable functions y(x), x1 < x < x2 under possibly some more restrictions. The class over which the optimization of the functional W b ( x ) ] is made is generally known as the admissible class of functions. The class of admissible functions d is sometimes restricted to piecewise continuous functions in general variational problems.
Definition A function f defined on a closed interval [a, b ] is called piecewise continuous on [a, b ] if the following conditions hold: (i) f(x) is bounded on [a, b ] , (ii) lim ,;f(x) existstlxoE [a,b ) and lim x+xo-f(x)existsVxoE (a, b ] ,and (iii) f(x) is continuous on (a, b).
Definition f is said to be piecewise continuous on an arbitrary subset S of the reals if it is piecewise continuous on [a. b ] for all [a, b ] C S. Most often in variational problems, one is concerned with obtaining the global maximum or minimum of the functional W b ] . For a global minimum, for example, the problem is to find a y , E d s u c h that Wbol WYl (2.2.2) for all functions y Esu’. Not only does one need to show that a y , with property (2.2.2) exists, but also that it is unique. In many problems, a characterization of the optimizingy,(x) is necessary. In contrast to the problem of finding the global maximum or minimumcalled, for simplicity, global extremum-one may be satisfied in many situations with the local (relative) extremum.
18
11.
CLASSICAL VARIATIONAL METHODS
Let the distance between two functions u(r), u(r) be given by do(u, u) =
sup
tElX,. x2 1
Itc(r) - u(r)l.
(2.2.3)
The srrong local minimum of a functional W [ y ]is given by yo if
WYOl
(2.2.4)
WYl
for ally E d such that there exists a 6 with
u)
dobo9
(2.2.5)
The weak local minimum of a functional is given in terms of dl(u, U) =
sup
tElx,.x,
I
lu(t) - u(r)l +
We say that y o gives a weak local minimum of that W[Yl
WYOl for ally E d such that
v)< 6.
dl (YO?
sup
lu'(r)- u'(t)l.
(2.2.6)
tElx,,x21
W b ] if there is a y ,
and a 6 such
(2.2.7) (2.2.8)
In many problems of interest in variational theory, only one of the above kinds of extrema may exist. The distinction will be necessary in the statement of conditions for local or global extrema.
2.3 Illustrations in Statistics Variational problems arose historically in engineering applications. The first use of calculus of variations was made by Newton in the seventeenth century while he was choosing the shape of a ship's hull t o assure minimum drag of water. Later, many applications in other fields were made. During the past few decades, variational methods have been utilized in many other areas of applied sciences such as economics, statistics, business, and control theory. In this section we consider a few illustrations of variational methods in statistics and probability. An extensive discussion of variational methods in economics is given by Hadley and Kemp (1971). Many authors, some of whom are cited at the end of this chapter, have given comprehensive accounts of variational theory; see e.g. Young (1969). Many examples in statistics will be discussed in later chapters, when corresponding variational techniques for their solutions are provided.
ILLUSTRATIONS IN STATISTICS
2.3.
19
We assume that the reader is familiar with elementary notions in probability and statistics, such as those given in well-known books by Feller (1957), Mood and Graybill (1963), and Hogg and Craig (1959). By a cumulative distribution function (cdf), we mean a function F(x) such that (i)F(x) is monotonic increasing, (ii) continuous on the right, and (iii) F(-m) = 0, F(*) = 1 . Such a function expresses the probability of a random variable X < x. If F(x) is absolutely continuous, then fix) =F'(x) is called the probability density function of X. The moments of the random variable X with cdf F(x) are defined by m
r
(2.3.1) In many problems concerning optimization, both the cdf and pdf are involved. By a random sample of size n , we mean independently and identically distributed random variables X I , X2,. . . , X , . If the random variables X I , X 2 , . . . , X , are arranged in an increasing order of magnitude and written as
X(1) < X(2) G . . . < X(,), we call X,) the ith order statistic of the sample. The largest, X(,), and the smallest, X(l), are the two important order statistics that also give us the range R = X ( n ) - X(1). The cdf of X(,) can be obtained by the following, in case F(x) is continuous. Let G b ) denote the cdf of X(,). Then GCV) = Pr(X(,) G y ) = Pr(X(1) G y p. . . , X(,) G y ] = Pr{X1 G y , X2< y ,
... ,X , G y }
= [Pr(Xi G y ) ] = [ F b ) ].,
Similarly the cdf of the smallest order statistic is given by 1 - [l - F(x)] ,. For the ith order statistic Xq), the cdf is obtained as
In applications, we need the cdf G(w) of the range R , which is G(w) = n
i
--m
[F(x + w) - F(x)]
,-' dF(x).
There is an extensive literature on order statistics; a recent survey has been given by David (1 970).
20
11.
CLASSICAL VARIATIONAL METHODS
Example 2.3.1 Let X I , X 2 , . . . , X , be a random sample from continuous cdf F(x). Let p l ' and p 2 ' be its first two moments such that p 2 ' - p I f 2 > 0. Notice that under this condition, the class of distributions is not empty. For the general case in which k arbitrary moments are given, the conditions of existence of a distribution function is a classical problem and has been studied in the literature, e.g., Shohat and Tamarkin (1 943). In many nonparametric statistical problems, one is concerned with the minimization of the expectation of the range or the expectation of the extreme order statistics over the class of admissible distribution functions with the first two given moments. That is, we want to minimize, say, the expectation of the largest order statistic. Minimize
1 m
xd[F(x)l".
(2.3.2)
-m
Integrating by parts the integral in (2.3.2), we have the above problem reduced to maximizing J [F(x)]"dx. Here we have
W[yl = and the Lagrangian is given by
s
y"dx,
L(x, y, y ' ) = y n .
(2.3.3)
(2.3.4)
Similarly the minimization of the expectation of the smallest order statistic requires the maximization of the integral (2.3.5) and the Lagrangian in this case is
L(x, y, $1
= (1 -y)".
(2.3.6)
For the minimum or maximum of the expectation of the range of the sample, we need to consider the integral W[F(x)l =
1 - F"(x) - (1 - F(x))" ] d x .
The Lagrangian in this case is
L(x9 Y y ' ) = y"
+ (1 -y)".
(2.3.7)
Example 2.3.2 Let a random sample X I , X 2 , . . . , X,, be given from a population with cdf F(x). An independent sample Y1, Y2, . . . , Y,is also given from another population with cdf Gb). Suppose we want to test a hypothesis F = G against the alternative F # G.
2.3.
21
I L L U S T R A T I O N S IN STATISTICS
One of the tests used for this hypothesis is given by the Wilcoxon-MannWhitney statistic; for reference see Fraser (1957). The test statistic is given by U where i = 1 , 2 , . . . ,m, such that Yi<Xi; U = number of pairs (Xi, I;.)
j = 1 , 2 , . . . , n. Let p = 1 - JG[F-' (x)] dx and by assuming that L(t) = G(F-'(t)), we have
s
L(t) dt = 1 - p .
(2.3.8)
It is well known that E(U) = m n p , and the variance of U , given by V(U),is I
-(m-
1)(1-p)2+(n-1)(1-p)2+p(l-p)}.
There is considerable interest in finding the lower and upper bounds of V(U) over the class of distribution functions L(t), 0 < t < 1 . That is, the problem of interest is to find the extrema of the integral
/
(L ( t )- kt)2 d t
(2.3.9)
0
over the class of functions L(t), which is also a cdf, with side restriction (2.3.8). Notice that the Lagrangian in this case is given by
L (x, y, y ') = 0,- kx)2 *
(2.3.1 0 )
Example 2.3.3 In information theory, one of the first problems is to maximize Shannon's information given by - [ f ( x ) 1%
f(x>dx
for all p d f s fix), which are assumed to satisfy certain moment constraints. Assuming that y = F(x) and that F(x) is absolutely continuous, the above problem is a variational problem with a Lagrangian given by
L(x, y, y ' ) = -y' logy'. Sometimes one is interested in finding the lower bound of the Fisher information given bv
22
11.
CLASSICAL VARIATIONAL METHODS
where the pdf f ( x ) has given mean and variance. In this case, assume that
y = f(x);then the Lagrangian is given by
L (x, y, y ' ) = Y' Y Y . 2.4
Euler-Lagrange Equations
In this section, we derive the necessary conditions for an extremum of the integral
W[Y(X)l =
J L [x,Y(X),Y'(X)l dx
(2.4.1)
and introduce the concept of variations. In order to study the extremal function, we introduce a parameter a in the definition of the function y ( x ) as follows: Let
y ( x ) 3 Y(x, a),
x,
<x <x 2
and
-m<
a<
00,
so that y o ( x ) = Y(x, 0) is considered to be a solution of the extremum problem if it exists. With the above notation, we have from (2.4.1)
W[y(x)]- W[yo(x)]=
i' i'
L [ x , Y ( x , a), Y'(x, a)] dx
XI
-
L [x, Y(x, O), Y'(x, O)] dx
(2.4.2)
XI
+%,I
a y or=o (Y'(x, a)- Y'(x, O))] d x + o ( a ) . (2.4.2a)
By Taylor's expansion, we have (2.4.3) Similarly,
Y'(x, a) = Y'(x, 0 ) -1
01 + o(O1).
(2.4.4)
2.4.
EULER-LAGRANGE EQUATIONS
23
Denoting aY(x, a)/aala=oby 77(x) and aY'(x, a)/aal,=, by g'(x), we can write (2.4.3) and (2.4.4) in the form (2.4.5) Y=yO t a g t o(a) Y' = y' O t 017' + o(a).
(2.4.6)
Here we are suppressing the arguments of the functions Y, Y ', y o ,yo', q ,and (2.4.2a) the values of Y and Y'from (2.4.5) and (2.4.6), we have
g' for the sake of notational convenience. Substituting in
-
az aL
1
[w' + o(a)]
(2.4.7)
For simplicity of notation, we denote
So that we have (2.4.7)simplified to Xl
~ [ Y l - W [ Y 0 1= a s
[ % + %, aY -
aY
g
g.] dx
+ o(a).
(2.4.8)
XI
Notice that differentiating with respect to x the product gaL/ayo',we get
(2.4.9) Equation (2.4.8) then reduces to
24
11.
Definition where
CLASSICAL VARIATIONAL METHODS
The variational derivative of the Lagrangian is defined by 6L/6yo (2.4.1 1)
Definition
The variation of y o and yo' are defined as
Syo = q ( x )
and
6yo' = aq'(x).
(2.4.12)
Since (2.4.5) and (2.4.6) give =~ ( x )
W [ y ]- W [ y o ]=
and
aa
1 (3) XI
aL x2 6y o(x)dx + 0 '6yo'l + o(a) (2.4.13) aY XI
= 6 W[yO,6 y ] + o(a).
Definition
=~'(x),
The variation of functional
Wb] is
(2.4.14) defined by 6 Who, 6 y ] at
YO.
Definition If the variation of the functional vanishes at y o , then the functional is said to have an extremum at y o , and y o will be called an extremal.
In order to obtain the necessary conditions for an extremum, we need the following two lemmas.
25
EULER-LAGRANGE EQUATIONS
2.4.
The Fundamental Lemmas of Calculus of Variations Lemma 2.4.1 If for all continuously differentiable functions v(x), x 1 < x < x 2 ,q ( x l ) = v ( x 2 )= 0 , then for q ( x ) continuous in x 1 < x < x2,
j.
q ( x h ( x )dx = 0 implies that
XI
q(x) = 0,
V x E [ x l ,x z ] . (2.4.15)
Proof Assume 3 c E [ x l ,x 2 ]such that q ( c ) # 0. Suppose for simplicity > 0. Then since q ( x )is continuous, there is an interval [a, b ] around c such that q ( x )> 0. That is, q(x) > 0 for a < x < b. Let
q(c)
<x G ~ elsewhere.
(x - a)'(x - b)', Then
1
1
X I
2
,
b
X,
q(x)q(x)dx =
(x - a)'(x --b)'q(x) dx > 0.
a
XI
This is a contradiction, proving the lemma.
q(x) be continuous in [ x l ,x 2 ] . If Lemma 2.4.2 Let J;; q(x)p'(x)dx= O for every differentiable function ~ ( x )such that v ( x I )= v ( x 2 )= 0 , then there exists c, such that q(x) = c, for all x E [ x l ,x ' ] .
Proof Define
1
1 X
X .1 .-
c = (x2 - X I ) - '
and
q(x) dx
~ ( x=)
XI
(q(t)-c) dt.
XI
Notice that q(x) so defined is differentiable and satisfies the hypotheses of the lemma. The integral x2
r
XZ
r
Also
=
7
(q(x)- c)' dx.
XI
26
11.
CLASSICAL VARIATIONAL METHODS
Hence p(x) - c = 0 or p(x) = c for all x E [x, , x2] . We now prove the following theorem for the necessary conditions for a weak local extremum. Theorem 2.4.1 A necessary condition that W b ] has a weak local extremum at y o is that 6 Who, 6y] = 0 for all variations 6y.
Suppose first that y o is the weak local minimum, then for all admissible functions y in a d l -neighborhood of y o . Now from (2.4.14), we see that 6 W b o ] +o(a)2 0 for small a. For all ~ ( x ) , continuous and differentiable W b ] - W b O ] - o(a) is linear with respect t o a . Therefore, 6 [ y o ,6 y ] is linear with respect to a. Let &A, = & [ y o 6, y l . Then aA,, io(a)2 0; hence A, = -o(l). But A is constant. Therefore, A, = 0 and 6 W [ y O ,6y] = 0.
Proof
W b ] - W b O ]2 0
Theorem 2.4.2 If W b ] has a weak local extremum at y o ,then 6L/6yo = 0 and
(2.4.16)
Proof
From Theorem 2.4.1 x2
using (2.4.10) and (2.4.14), for all q(x) that are continuous and differentiable. Take ~ ( x , = ) v(x2) = 0 so that, using Lemma 2.4.1, we have
Definition
The differential equation
(2.4.17) is called the Euler-Lagrange equation or Euler equation. The Euler-Lagrange equation is really a second order differential equation, as can be seen by simplifying the differential operator. We will see later that the Hamiltonian transformations reduce the equation into two first order differential equations.
Special Cases of Euler-Lagrange Equation The following three cases are of interest in applications.
Case ( i )
aLlay = 0.
(2.4.18)
2.4
EULER-LAGRANGE EQUATIONS
27
Then Eq. (2.4.17) reduces to d
aL
so that
- (-)=O,
dr ay‘
aL
Q
=c.
(2.4.19)
Case (ii) aL/ay’ = 0.
(2.4.20)
aLlay = 0.
(2.4.2 I )
Then we have
Case (iii)
If L does not explicitly depend on x,
x i a x = 0.
(2.4.22)
Then the Euler-Lagrange equation reduces t o
L - y’ aLlay’ = C .
(2.4.23)
To show (2.4.23), we proceed as follows.
Hence L
-
y’ aL/ay‘ = c.
Example 2.4.1 (Brachistochrone problem) One of the earliest problems that led to the development of variational methods is the following. The path of a particle moving under the force of gravity from a given point A to a given point B along a wire, say, is to be obtained so that the time it takes to travel from A to B is a minimum, described in Fig. 2.1. Let m be the mass and u the velocity of the particle. Let g be the gravitational constant. The law of conservation of energy gives
i m v 2 - m g y = 0, giving u = (2gy)’”
or
ds/dt = (2gy)’I2,
28
11.
CLASSICAL VARIATIONAL METHODS
Figure 2.1
where s is the distance traveled in time 1. The time taken is obtained from d t = ds/u, where ds is the element of the arc traveled in time dt. Let A represent x = 0 and B be given by x = xl. Then the total time taken, which is a function of the path y(x), is given by X,
X,
using the well-known formula (ds/dx)2 = 1 -I- (dy/dx)2.
The problem then is to find min T b ] = Tho] over the class of functionsy on 0 G x Gx1. Here
Since aL/ax = 0, using (2.4.23) we have
[y j
Yf2
'I2- ( 1 ) L I Z
or
ly(1
Y -112 = const.
+p)] -112 = C -1/2
where c is some constant. Hence y' = [(c - y ) / y ]'1'. Solving the above differential equation in the parametric form, we obtain the solution as x = (c/2)(q -sin q),
(2.4.24a)
y = (c/2)(1 - cos q).
(2.4.24b)
The curve so defined is a cycloid and gives the form of the wire.
2.5.
STATISTICAL APPLICATION
29
Remark The generalization of the above problem can be made in several directions. One generalization common in applications is that the Lagrangian is a function of several variables and their derivatives. Another generalization is concerned with studying the Lagrangian involving derivatives of higher order. We d o not discuss any of the above generalizations here. 2.5
Statistical Application
We consider an important application of the Euler equation in statistical time series (Jenkins and Watts, 1968) in this section.' The problem is concerned with the estimation of the impulse response in a linear system. Let the system be described at time f by the time series X ( t ) = input of the system,
Y ( f )= output of the system.
Assume that X ( f ) has mean pX and Y(t) has mean py. The linear system is described by
s
m
Y ( t )- PJJ=
h (u) [ X ( f- u ) - p x ] du + Z ( f ) ,
(2.5.1)
0
where h(u) is the system impulse and Z ( t ) is the error term. Figure 2.2 gives a schematic representation of the above system. One of the many criteria for optimization is the Wiener minimum mean-square criterion. This criterion requires choosing h(u) such that
w[h(u)l = E[Z(t)12 is minimized.
Figure 2.2
' 1 am indebted to Edward Dudewicz for pointing out this application.
(2.5.2)
30
11.
CLASSICAL VARIATIONAL METHODS
Let the time series X ( t ) and Y ( t ) be stationary, and let the covariance between x(t)and Y ( t ) be denoted by Yxy(4 =E
“0
-P x l
(2.5.3)
[ Y O + u ) -Pyl
and between X ( t ) , X ( t + u ) by 7xx(u) = E [ X W - Pxl
[ X ( t + u ) -P x l *
Similarly, y y y ( 0 ) denotes the variance of Y(t),and y(O ,) denotes the variance of X(t). Then the criterion (2.5.2) reduces to the minimization of
s
m
W [ h(u)] = E [ Y ( t )- /Ay - 0 h ( u ) [ X ( t - u ) - px] du] 2
m
r
+ 0
J
(2.5.4)
h(u)lz(u)yxx(u - u) du du.
0
We use the variational approach t o find h(u) in order to minimize (2.5.4). The result is stated in the following theorem. Theorem 2.5.1 The function h Wiener-Hopf integral equation
that
minimizes
1
W [ h ] satisfies the
m
yxx(u) =
h(u)yxx(u - u) du,
u 2 0.
0
Proof
Let ho be the minimizing function and let
Iz(u) = h,(u) + €g(u).
(2.5.5)
EXTREMALS WITH VARIABLE END POINTS
2.6.
31
We have m
r
m
0
m
0
We obtain a necessary condition for a minimum if
holds for all g. From (2.5.6), we obtain m
m
m
Since yxx is an even function, we have
1
= 0 = -2
1 m
m
g(u)[y,,,(u)-
ho(u)yXx(u- u) du] du. (2.5.7)
0
0
Since (2.5.7) must be satisfied. for every g, ho satisfies
1 OD
Yxy(4=
ho(u)yxx(u - 4 du,
(2.5.8)
0
which is in the form of a well-known Wiener-Hopf integral equation and can be solved in many cases of interest. 2.6
Extremals with Variable End Points
In the previous section, we considered the case of fixed end points x 1 and x 2 . In many applications, especially in mechanics, the points (xl, y l ) and (x2, y 2 )
32
11.
CLASSICAL VARIATIONAL METHODS
may be known to lie on certain given curves C1 and C,, respectively. Assume that the equations of the curves are given by C1(x, y ) = 0
and
C2(x. y ) = 0 .
(2.6.1)
We assume further that the extremal can be written explicitly as (x, y ( x ) ) and y(x1) =y1 and AXz) 'YZ. (2.6.2) We consider the problem of finding extremalsy' for optimizing (2.2.1) where x 1 and x 2 satisfy (2.6.2). Notice that when the curves C1 and C2 shrink to points, the problem reduces to the case discussed earlier. Our first aim will be to obtain the total variation of the functional W , as defined earlier in (2.4.14), in an alternative form and then give conditions for an extremum. The concept of variation is not only applicable to finding extrema of functionals but has many other applications. We assume here that both x , and y as a function of x , vary. We obtain the total variation of the functional W in terms of the total variation of y , which in turn is given in terms of the variation (differential) of x. Let X(x, a) be a function with parameter Q such that X(x, 0) = x', the value for which the extremum occurs. Similarly X ( x l , 0) = x l 0 and X ( x 2 ,0 ) = x:. Now
(aa= )
(2.6.3)
+ A x Y'(x') + o(a).
(2.6.4)
X(X, a) = X ( X , 0 ) + Q
-
0
+ o((Y)= X O + hw' + o((Y),
where A x o = Sx' is the (total) variation of X at x o . Similarly, Y(X(x, a)) = Y(xo + A x + o(a))
= Y(x')
Also, we find that
Y(X(x, a),0) = Y ( x o ,0) + A x Y r ( x O0,)+ o(a) = y o(x") + Py0'(xo) + o@)+ A X [Y'(x') + 0(0)] + o((Y) = yo(xo) + Sy ( x o ) + A x y'(xo) + o(a) + o@).
(2.6.5)
The quantity
A y = Y - y o = Sy t Axy'
+ o(a) t o(0)
is the total variation of y.
Definition
The total variation of W is defined by
AW= W [ q - W [ y ] +o((Y)+o@).
(2.6.6)
2.6.
33
EXTREMALS WITH VARIABLE END POINTS
To obtain the total variation A W , we proceed as follows.
/
X?
AW=
1
x2
d X L(X, Y ( X ) , Y ' ( x ) ) -
XI
dx L(x, y , y ' ) + o(a, p), (2.6.7)
XI
where X I = X(x,, a ) and X z = X(x2, a). Or x,+Ax,
AW = x,+ A x l
dX dx -L(x + Ax, y + Ay, y' + A b ' ) ) dx
(2.6.8) Since X = x + Ax t o(a), aX/ax = 1 t d/dx A x t o(a). Expanding (2.6.8) and retaining terms containing A x , we have X,
+ - A x + aL -Ay+--.A~') AW= / ' d x { [ l + $ ( A x l [ L ( x , y , y ' )aL aL aY aY ax XI
Since Ay = 6y + y' A x t o(a,0) and A O ' ) = SO')+ y"Ax X,
Let
and since
we have X,
XI
+ o(a,p), we have
1
34
CLASSICAL VARIATIONAL METHODS
11.
Since 6y = Ay
1
-
y‘ A x ,
X,
AW=
dx@)6yt
+o(a,P),
(2.6.9)
XI
where
(2.6.10) We can now give the necessary conditions for a weak local extremum of W with variable end points.
Necessary Conditions for a Weak Local Extremum (1) SL/Sy = 0,
x , <x <xz,
(2.6.1 1) (2.6.12)
where A x , = Axl,, and A y , = A y l x l
(2.6.1 3)
Proof (1)
From Eq. (2.6.9), A W = 0 implies that X I
Since Sy and A x are linearly independent, we have, using 6y = /3yo’(x), XI
2.6
EXTREMALS WITH VARIABLE END POINTS
35
This is true for all 0,y o ’ such that y0’(x1)= yO’(x2) = 0, so
Hence SL/Sy = 0 for all x 1 < x < x z . (2) and ( 3 ) both follow from the assumption that ( A x l , A y l ) and ( A x z , A y 2 ) are independent. Suppose now that the end points are given implicitly to move along the curves given by cl(xl,Yl)=o, C2(Xz,Yz)=O. (2.6.1 4) We have, by introducing variations, C I ( X+~A X I , Y I+ A Y I ) = O or
(2.6.15) Similarly, expanding C z ( x z t A x z , yz t A y z ) , we have (2.6.16) From (2.6.12) and (2.6.1 5), the determinant
L - y ‘ ( x l )aLlay’ix, aLlay’r,,
ac,/a~~
acl/ a x 1
(2.6.17)
vanishes. Similarly, (2.6.13) and (2.6.16) give (2.6.18) We also have 6y
5
(XI, Y l )
=O
y*) = O.
(2.6.1 9) (2.6.20)
Equations (2.6.1 7)-(2.6.20) determine the weak local extremals in case there are variable end points. The conditions (2.6.17) and (2.6.1 8) are generally known as the transversality conditions.
36
11.
CLASSICAL VARIATIONAL METHODS
The transversality conditions are sometimes written as
(2.6.21) (2.6.22) where cpl ( x ) describes the curve C1 : C1(x, cpl ( x l )) = 0 and cp2 ( x ) describes the curve Cz : C 2 ( x z ,cpz(xz))= 0. Consider an example in which the extremal is to be obtained when the Lagrangian is given by L = (1 + y f z ) 1 / 2 / yThe . end points move along the curves x I 2+ y l z= 1 andy, = 5 + x 2 .
2.7
Extremals with Constraints
Suppose y = y z , . . . ,y “ ) represents a set of n functions, each y’ being piecewise continuous on [ x l ,x z ] . The object here is to find the extremum of the integral x2
J
WY(X)l =
(2.7.1)
L [ x , Y , Y‘l d x
Xl
with the following m constraints satisfied by the extremal yo
Fa(x,yo,yo’)=O,
=
a = 1 , 2, . . . , rn, m < n .
bol, . . . ,y”), (2.7.2)
The constraints may be in the form
F , ’ [ x , y o , y o ’ ] = 0.
(2.7.3)
Alternatively, the constraints may also be given so as to involve the integrals of known functions,
j.
Fa(x, y o , y o ’ )dx = 1,.
(2.7.4)
Xl
The method given here is that of Lagrange multipliers as commonly used in elementary analysis. Given constants h = (A’ ,A’, . . . , A m ) , consider the integral
I x2
Xl
(L +
XI
m a=l
AaF,)dx =
A ( x , y , v ’ ,h ) d x .
(2.7.5)
Xl
Then the local extremum is given by the equations
SA/S y = 0
(2.7.6)
2.7.
EXTREMALS WITH CONSTRAINTS
37
From (2.7.2), we obtain the m equations
(2.7.7) Assuming that the Jacobian
(2.7.8)
so that y"", y m + ? .. . ,y" can be obtained uniquely in terms of y ' , y z , . . . , y m , extremals are then given by Eqs. (2.7.6) and (2.7.7). Example 2.7.1 Consider the problem of obtaining bounds of the mean of the largest order statistic as considered in (2.3.2) in the form
1 1
n
x ( F ) F " - ' dF.
(2.7.9)
0
It is assumed that x is a function of F . We have the constraints
/
x ( F ) d F =0
(2.7.10)
x z ( F ) d F = 1.
(2.7.1 1 )
0
1 1
0
We consider the Lagrangian
A(X,y , y', A) = nxy"-' t h l x t AzxZ.
(2.7.1 2 )
The Euler equation gives
ny"-'
+ Al
t 2 A z x = 0.
(2.7.13)
A form of admissible F is then obtained from (2.7.1 3 ) as
The constants hl and hz can then be obtained from the constraints (2.7.10) and (2.7.11). In this form the Euler equation provides in some sense a sort of sufficiency condition, since the function (2.7.14) does maximize (2.7.9). In such a derivation, the existence of the constants X I and h2 is not guaranteed. In Chapter V we shall consider the problem from another point of view and will also demonstrate the existence of the solution. We also consider many problems such as the above without using the Euler equation.
38
11.
CLASSICAL VARIATIONAL METHODS
Example 2.7.2 Consider the problem of maximizing Shannon information
I
m
f(x) logf(x) dx,
-
(2.7.1 5)
-OD
wheref(x) is the probability density function of a random variable X . We assume thatf(x) belongs to a class of piecewise continuous functions such that
f (XI
(2.7.16)
0,
m
"
(2.7.1 7) -m m
r
J x'f(x)
dx = a ' .
(2.7.1 8)
-m
The constraints (2.7.16) and (2.7.17) are always satisfied by a probability density function, and (2.7.1 8) prescribes the second moment of the distribution. Consider the Lagrangian in this case as
NX.Y , Y', A)
= -Y logy + (A, + h 2 x ' ) ~ .
(2.7.1 9)
The Euler equation gives -logy,
-
1 + A,
+ A2X'
= 0,
(2.7.20)
that is,
fo(x) = A exp(h2x').
(2.7.2 1)
Using constraints (2.7.17) and (2.7.18), we have
1 m
s
m
A exp(A2x2) dx = 1,
-m
AX' exp (A2xz)dx = u'.
-m
We obtain A = 1/(2n)'/*u
and
h2 = -1/2uz.
The solution is given by
fo(x) = (1/(2n)'/Zu) exp(-x2/202j.
(2.7.22)
The probability density function given by (2.7.22) is the normal density with variance u2. That is, Shannon information is maximized for the normal distribution.
2.8.
INEQUALITY DERIVED FROM VARIATIONAL METHODS
39
2.8 Inequality Derived from Variational Methods
Many inequalities involving integrals can be obtained by variational methods. The idea is to obtain the extremizing functions for a specific integral and then obtain the bounds. In this section we shall include only one inequality that is directly concerned with the technique of calculus of variations. In probability and statistics an important role is played by Tchebycheff-type inequalities, and there is an extensive literative on these inequalities and their various generalizations (for reference, see Savage, 1961). We shall study some of these inequalities later. In this section we obtain Young's inequality, which also allows us to introduce the concept of a Hamiltonian. The Euler-Lagrange equations are normally given in terms of partial derivatives of first order and their derivatives. They are differential equations of the second order. However, by introducing a Hamiltonian function and canonical variables, these equations can be reduced to first order differential equations. Equation (2.4.17),
will be given in an alternative form. Let (2.8.1)
so that (2.8.2) Introducing a function H(x, y , p ) defined by
-H(x, Y , P I = L (x, Y , Y') - Y' w Y 1 , where y' is regarded as a function of x, y . and p and we can write for convenience y' = u , we have ~ ( xy ,, p ) = -L (x, y , u) + u aLiau. (2.8.3) The function H so defined is called the Hamiltonian. Now -dH = dL -3Llau dv - u d(aL/au).
(2.8.4)
But (2.8.5) Substituting (2.8.5) in (2.8.4), we have on simplification (2.8.6)
40
11.
CLASSICAL VARIATIONAL METHODS
Also since (2.8.7) comparing (2.8.6) and (2.8.7), we have aL/ax = -aH/ax
(2.8.8)
aqay =- a ~ / a y
(2.8.9)
and (2.8.1 0)
v = -aH/ap.
Hence the Euler-Lagrange equation is now transformed in and
aH/dy - dp/dx = 0
(2.8.1 1)
aH/ap - dy/dX = 0.
(2.8.12)
Equations (2.8.1 1) and (2.8.12) are called Hamilton's canonical equations.
Consider
Example 2.8.1
I x2
W [ y ]=
(1 +y'z)'12d x ,
L ( x , y , y ' ) = (1 t y r Z ) ' / *
XI
so that aL Y' p=--,= aY (1 +Y'z)"
or
p z =- Y r 2 1 +y12
which gives y" = p2/(1 - p'). The Hamiltonian is given by
The canonical equations are then given by dpldx = aH/ay = 0 , giving p = c, a constant, and
implies that y = A t B x , where A and B are constants.
2.8.
INEQUALITY DERIVED FROM VARIATIONAL METHODS
41
Young's Inequality Let L be strictly convex in u , that is, a 2 L / a u 2> 0. Considering L as a function of u alone, we have from (2.8.3),
aL H = -L(u) t u - = -L(u) + IJq, au where q = L'(u). Now dH = -L'(u) du t udq
(2.8.13)
+ qdu so that
Now for q, u independent,
For all values of q there exists a u ( q ) such that
q But
= L'(4q)).
H"(q) = I/L"(u(q))> 0 ,
hence a2
- [-H(q) +
aq2
wl < 0
for all q and for all u . max [ - H ( q ) + uql = -H(L'(u)) -+ UL'(U) = L(u) 4
by definition of H . Hence
L (u) > -H(q) + uq (2.8.1 5)
for all u and q (independent). Inequality (2.8.15) is called Young's inequality. An extension to several variables u , , . . . , u, can be easily made. Assume that L(u . . . ,u,) is such that we have the matrix A , given by
,,
a positive definite matrix. Let qi = aL/aui, i = 1, 2,
H(ql9 . . . *4 n ) = - ~ ( u l , . . *
. . . , n , and define n
3
un)
+
C qiui i=l
*
42
11.
CLASSICAL VARIATIONAL METHODS
(2.8.16) 2.9 Sufficiency Conditions for an Extremum
We have seen in earlier sections that the Euler equation provides only the necessary conditions for a minimum or a maximum in a variational problem. If the solution of the Euler differential equation is unique, and somehow it is also known that the optimal solution exists, then the optimal solution is given by the Euler equation. In many situations, however, if the Euler equation has several solutions, it becomes necessary to determine if the solution so obtained gives the maximum or the minimum. The classical theory for obtaining such sufficient conditions for an optimum is fairly complicated. The detailed discussion of these conditions are available in many books, e.g., Ewing (1969), Hadley and Kemp (1971), and Young (1969). In many cases there are simpler ways of showing the sufficiency of the solution of the variational problem, and we discuss a few cases in later chapters. In this section we consider a special case of the problem in which the Lagrangian satisfies additional assumptions, such as those of convexity or concavity, and provides sufficient conditions for global extrema. In applications in economics, statistics, and mathematical programming, assumptions of convexity and concavity are satisfied quite frequently. For notions of convexity, see the discussion in Chapter IV. Assume that the function f, defined over the convex set A in Euclidean space of k dimensions, is concave and that it is differentiable on the open convex set A . Then for any u, u € A , u = ( u l , u 2 , . . . , uk), u = ( u 1 , u 2 , . . . , u k )
f (4G f (4+ (u - u)’ aflau,
(2.9.1)
where a‘ denotes the transpose of the column vector a and af/au denotes the column vector of partial derivatives o f f with respect t o the components of the vector u. Assume further that the second order parital derivatives off exist. Let the Hessian be given by the matrix of the second order partial derivatives. That is,
H=
(2.9.2)
2.9.
43
SUFFICIENCY CONDITIONS FOR A N EXTREMUM
Then by Taylor’s expansion, we have
where 0 < 0
< 1. From (2.9.1) and (2.9.3) we see that iff
is concave, then
(~-~)‘H[u+~(u-u)](u-u)<~. That is, H is a negative semidefinite matrix for all 0 . Similarly if f i s convex, H is a positive semidefinite matrix. The strict concavity o f f would imply negative definiteness of the Hessian, and strict convexity would imply positive definiteness of the Hessian. We assume for the variational problem that the Lagrangian L ( x , y , y ‘ )is a concave function in @, y ’ ) over an open convex set that includes every point @, y ’ ) for which (x, y , y ‘ ) belong to a given set R in three dimensions. We have the following theorem.
Theorem 2.9.1 Suppose L(x, y , y‘) is concave with respect to (y, y’) ER, open and convex, for each x in a finite interval [Q, b ] . Let y o be an admissible function satisfying the Euler equation and further let y o also satisfy the corner conditions. Then Who] is the global minimum of
(2.9.4)
Proof
L e t y , be any other admissible function. We show that
W[YOl G W Y l l . Let the derivatives yo’, y l ’ be defined arbitrarily at the corners, if y o and y1 have any corners, subject to the conditions that (yo,y o ‘ )and (jl, y l ’ )are in the open convex set R. Let P=yl(x)-yo(x) and P’ =Yl’(x>-.Yo’(x>. Since L(x, y , y ’ ) has been assumed concave, we have
(2.9.5)
44
11.
CLASSICAL VARIATIONAL METHODS
Integrating both sides of the inequality (2.9.5), we have b
(2.9.6) Assume that the corners of y o are at xl, . . . , xmPlwith xo = a and x , integral in (2.9.6) can be integrated by parts; we have
= b.
The
b
(2.9.7) Since yo satisfies the Euler equation and the corner conditions and P(6) = P(a) = 0, the right hand side of (2.9.7) is zero and hence W[YI
1GW o l .
(2.9.8)
That is,yo gives the global maximum.
Remark The case of convex Lagrangian and global minimum is exactly the same as above. When the interval of integration is infinite, the proof requires a few more technicalities, which can be overcome with more assumptions. Corollary Under the conditions of Theorem 2.9.1, if the set R is convex and L(x, y, y ’ ) is strictly concave in @, y ’ ) over R , then the admissible function yo(x) obtained from the Euler equation is unique.
Proof Suppose y o ( x ) is not unique and y , ( x ) is another function that maximizes Wly] . Now y(x) = Ayo(x)t (1 - h ) y l ( x )
for
0 G X G 1,
satisfies the Euler equation and since L is strictly concave, we have
~(~,Y,Y’)>~L(X,YO,YO +(1 ’ ) -h)L(x,YI,Yl’).
(2.9.9)
Hence integrating, we have
W Y l >AW[YOl + ( 1 - h ) W y , l ,
or
W[Yl >W[YOl,
contradicting the fact that W [ y o ]was maximum. This proves the corollary.
REFERENCES
45
References Becker, M. (1964). The Principle and Applications of Variational Methods. MIT Press, Cambridge, Massachusetts. Caianiello, E. R. (1966). Functional Analysis and Optimization. Academic Press, New York. Chernoff, H. (1970). A bound o n the classification error for discriminating between populations with specified means and variances, Tech. Rep. No. 16, 1-13, Stanford Univ., Stanford, California. David, H. A. (1970). Order Statistics. Wiley, New York. Denn, M. M. (1969). Optimization by Variational Methods. McGraw-Hill, New York. Dreyfus, S. (1962). Variational problems with constraints, J. Math. Anal. Appl. 4, 297-308. Ewing, G. M. (1969). Calculus of Variations with Applications. Norton, New York. Feller, W. (1957). An introduction to Probability Theory and Its Applications, Vol. I . Wiley, New York. Fraser, D. A. S. (1957). Nonparametric Methods in Statistics. Wiley, New York. Gelfand, I. M., and Fomin, S. V. (1963). Calculus of Variations. McGraw-Hill, New York. Hadley, G., and Kemp, C. M. (1971). Variational Methods in Economics. American Elsevier, New York. Hardy, G. H., Littlewood, J. E., and Polya, G. (1952). Inequalities. Cambridge Univ. Press, London and New York. Hogg, R. V., and Craig, A. T. (1959). Introduction to Mathematical Statistics. Macmillan, New York. Jenkins, G. M., and Watts, D. C;. (1968). Spectral Analysis and Its Applications. HoldenDay, San Francisco. Mood, A.M., and Graybill, F. A. (1963). Introduction to the Theory of Statistics. McCrawRill, New York. Morse, M. (1973). Variational Analysis. Wiley, New York. Rustagi, J. S. (1957). On minimizing and maximizing a certain integral with applications, Ann. Math. Statist. 28, 309-328. Sagan, H. (1969). Introduction to the Calculus of Variations. McCraw-Hill, New York. Savage, I. R. (1961). Probability inequalities of the Tchebycheff type, J. Res. Nut. Bur. Stand. 65B, 211-222. Shohat, J . A., and Tamarkin, J . D. (1943). The Problem of Moments. Amer. Math. SOC., Providence, Rhode Island. Young, L. C. (1969). Lectures on the Calculus of Variations and Optimal Control Theory. Saunders, Philadelphia, Pennsylvania.
CHAPTER I l l
Modern Variational Methods
3.1
Introduction
In modern engineering applications a prominent part is played by the theory of optimal processes. The variational theory is central to the understanding of many control problems, not only in engineering but also in business, industry, and medicine. In the engineering sciences Bellman’s dynamic programming has been extensively applied, not only for solving problems of control but also for solving many other problems of sequential decision making. The development of the Pontryagin maximum principle was motivated by the desire to solve some outstanding problems in control theory. It has also found applications in many other areas. The technique of dynamic programming uses the principle of optimality, and there is an extensive literature on its applications. Although both the maximum principle and the principle of optimality tend to solve similar problems, there is a basic difference in their approaches. While the Pontryagin principle provides necessary conditions for the existence of an optimum, dynamic programming gives an algorithm t o arrive at the optimum. Less restrictive conditions are used in the techniques of dynamic programming. In this chapter we describe these two principles and give a comparison. It will be seen that starting with one principle, one can arrive at the other. There are many other nonclassical methods in variational theory, such as those arising in the development of the theory of testing statistical hypotheses by Neyman and Pearson. We discuss the theory of testing statistical hypotheses in Chapter V. A large number of variational problems have also been solved by the use of the 46
EXAMPLES
3.2.
47
Hahn-Banach theorem, fixed-point theorem, and minimax theorem. We shall have occasion to refer to some of them later. In this chapter we discuss the maximum principle and dynamic programming. Problems in variational theory involving inequality constraints require that the solution provided by the Euler-Lagrange equation be used to arrive at the extremals while not using the inequality constraints and trying to fit the solution to the constraints. The maximum principle is essentially a formalization of this procedure. Such an approach will also be seen fruitful in dealing with moment problems, and we shall discuss them later. For examples, see Rustagi (1957) and Karlin and Studden (1966).
3.2 Examples We consider here a few examples specially chosen from control theory. As we shall see the classical variational techniques can be applied to many situations. In computing numerical answers, however, the technique of dynamic programming will be introduced. Bellman’s functional equation (also known as HamiltonJacobi-Bellman) will be obtained later. Example 3.2.1 (Bellman) Let the state of a system be described by x ( t ) at time t , 0 < 1 < T. Let the system be controlled by u(t). Suppose the system equation is given by x‘(t) = ax(t) + u(t),
(3.2.1)
where x ’ ( t ) = dx/dt and Gzse (i)
(3.2.2)
x(0) = c. Chse (ii)
and
x(0)=cl
x(T)=ccz.
(3.2.3)
In (i) we have constraints only on the initial state of the process; however, in (ii) the terminal state is also constrained by Eqs. (3.2.3). The system is described in Fig. 3.1. Let the performance of the system be measured by a function J given by
J(X, U) =
i
0
[u2(t)+ x 2 ( t ) ] dt.
(3.2.4)
48
111.
input
MODERN VARlATIONAL METHODS
ayrtem
A
7
4
r
x(t)
output
Our object here is to find u o ( t ) so as to minimize J(x, u). Assume that the class of admissible functions is continuous and differentiable almost everywhere in (0, 7'). The necessary condition of a minimum can be obtained from EulerLagrange equations. Now if the equations have a unique solution and the solution of the differential equation yields the minimum, then we have completely solved the problem. This approach sometimes works, but not always. In the above case we have the criterion J(x, u ) on using (3.2.1),
J(x, u ) =
i
{X2(t)
t [ x ' ( t ) -ax(t)]
)dt
0
Using (3.2.3) for case (ii) Euler equations now gives x"(t)
-
(1 t aZ)x(t) = 0.
(3.2.5)
The general solution of (3.2.5) is given by x0(t) = klebt t
k2e-bt,
where b = (1 + a 2 ) ' / * and k l , k2 are determined in case (ii) by c1
= kl
+kz,
c2 = kl ebT t k2ePbT
(3.2.6)
3.2.
EXAMPLES
49
so that
giving
and the solution (3.2.6) is unique. Consider a solution x, ( t )= xo(t) + w(t) so that w(0) = 0 = w ( 0 making X I ( t ) satisfy constraints (3.2.3); that is, x 1(0) = c 1 and x1(7') = c 2 .
[
t
[b2w2(t) + wf2(t)] dt
0
Notice that
i
0
[b2w(t)xo(t) +xo'(t)w'(t)] dt = [xo'(t)w(t)]oT
-i
[x,,"(t)
0
= 0.
-
b2x0(t)] w ( t ) dt
50
111.
MODERN VARIATIONAL METHODS
Hence J [ X l ( t ) ,uo(t>I < J [ X O ( t ) , uo(t)l
and is attained when w(t) = 0 for all 1. Indirectly, therefore, we have shown the existence and uniqueness of the solution. Case (i) and other special situations can be taken care of similarly. In the next example we consider a discrete case and introduce the idea of a functional equation.
Example 3.2.2 (Bellman) Let the system be described by a discrete set of variables x o , x l , . . . , X N denoted by vector x and be governed by the difference equations n = 0 , 1 , 2, . . . ,N-1,
X,+~=UX,+U,,
(3.2.7)
with x o = c, where u, denotes the control variable with u = (uo, u l , . . . , U N ) . Let the criterion function be given by N
(3.2.8) Suppose we are interested in finding the values of ui such that (3.2.8) is minimized. One could consider the above problem as a discrete analog of the continuous problem discussed in Example 3.2.1. We introduce the function
(3.2.9)
fN(c) = min J N ( x , u). U
Notice that if ui = 0, i = 0, 1 , 2 , . . . ,N , the x['s can be easily found since from (3.2.7) we have N
JN(X,
0) =
1 amcZ = c 2
fl=O
c am, N
fl=O
since x , =ax,-] = u ' x , - ~ = a n x O=a"c. In general we follow recursively from (3.2.9). After uo is chosen, we have x 1 = ac + u o . Therefore, JN(X,
u) = uo'
+ c' +
From (3.2.9), notice that N
N n=l
(X,'
t U,').
FUNCTIONAL EQUATIONS OF DYNAMIC PROGRAMMING
3.3.
51
so that we have
fN(c)= min {a’ + uo’ + fN-1 (ac + uo)}. uo
(3.2.10)
That is, the minimization of J over the N dimensional vector has been reduced to only one dimensional u. Equation (3.2.10) is the well-known functional equation of Bellman and introduces the idea of dynamic programming for solving optimization problems of the type described here.
Example 3.2.3 (Kushner) Suppose two coins (cl, c 2 ) are tossed independently of each other, c 1 having probability of head equal to p1 and c2 equal to p z . Let X , be the number of heads till the nth toss. The decision as to which coin to choose at toss n depends on X,, and not on X 1 , .. . , XnPl.The decision rule u,(x), which specifies the coin t o be.chosen at (n + 1)st toss when X , = x, can be designated as control. The sequence of random variables {X,,}, n = 0,1, . . . forms a Markov chain. Notice the probabilities
P{X,+I = X n } =I-p, - 1-pz
if
un(Xn)=Cl,
if
u,(Xn)=c2,
and
P{X,+, = X , + l } = p l
if
u,(Xn)=cl,
=p2
if
un(Xn)=c2.
The above equations give the transition probabilities of the Markov chain {X,,}. The problem of interest is to find the optimal control so as to minimize E(X,) where M is already specified. Again the procedure outlined in the previous example can be used to obtain the optimal control. We describe the technique of dynamic programming in the next section.
3.3
Functional Equations of Dynamic Programming
In the last sections we have introduced a few examples from control theory and have also indicated the technique of solving them. The technique will be developed in this section for a general class of problems and will be further applied to problems in statistics. The theory of dynamic programming has proved to be of significance in solving problems in control theory as well as in
52
111.
MODERN VARIATIONAL METHODS
many other areas. Its successful use has been made in problems in economics, business, and medicine, and it is constantly being applied to new areas of optimization. We follow Bellman’s treatment in this section. The basic contribution of this approach is that a problem of optimization in n dimensions can be reduced to that of one dimension by using the functional equation. Suppose a physical or mechanical system is being studied at several stages. Let the system be characterized at any time t by state variables. At each stage one may choose a number of decisions that transform the state variables. In studying such a process by dynamic programming we shall suppose, in addition, that the past history of the system is not needed to determine future actions. The process is studied so as to optimize a certain given function of state variables. The problem is to find optimal decisions at any given stage. Such an optimal sequence of decisions is referred to as optimal policy. In solving the problem, Bellman utilized the principle of optimality and obtained the basic functional equation of dynamic programming, called the Bellman equation. The principle of optimality is stated below.
Principle of Oprimaliry An optimal policy has the property that, whatever the initial state and initial decisions are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. First consider the following process, in which the outcome of a decision is uniquely determined by the decision. Such a process is called a deterministic process. Suppose that the process has a finite or constantly infinite number of stages. Let the vector
be the state of the system at time t and suppose x(t) belongs t o a set D.A policy is defined with the help of a transformation T such that T transforms a point of the set D into itself. Consider the policy of the following type ( T 2 , .. . , T N ) assuming finite N(N may, in general, be countably infinite). We assume that
Let the criterion function be a function defined by the final stage of the process, say
(3.3.1)
3.3.
FUNCTIONAL EQUATIONS O F DYNAMIC PROGRAMMING
53
When D is finite, the maximum of R over D exists. An optimal policy can be obtained when Tq(x(t))is jointly continuous in x(t) and q for all x(t) in D and q in a finite region S . If S is finite at each stage, the maximum always exists. Suppose the maximum of R [xN(t)] over the class of all XN (t)’s is denoted by fN [(x(t)] . A recurrence relation is derived involving members of the sequence {f;.[x(t)] } using the principle of optimality. That is, if we choose some transformation Tq as a part of our decision of the first stage, it gives a state and the maximum of the criterion function from the following vector Tq(x(t)), (N - 1) stages i s f ~ - ] [Tq(x(t))] . Therefore, fN [x(t)] = max fN9 ES
for N 2 2
[T,(x(t))]
(3.3.2)
and
Notice that there may be many optimal policies that maximize the criterion function. The functional equations (3.3.2) and (3.3.3) are called Bellman equations (sometimes Hamilton-Jacobi-Bellman equations) and form the basis of the method of dynamic programming. Consider a class of continuous-time decision processes corresponding to the discrete-time process just discussed. Assume that the state of a system is defined by the vector
x(t) = [xl( t ) , . . . ,x,(t)]
0 < t G T.
for
The components of x(t) may describe the motion of a particle in space at time t giving its position, velocity, and acceleration. Suppose that x(t) belongs to a set X . Suppose the behavior of this system is controlled by a set of controls. u(t) =
b 1 (t),u2W,
...
9
urn(t)I
7
where u(t) belongs to a set (1. Assume that CJ is a bounded and closed region. Let the motion of the particle or the behavior of the system be governed by n differential equations dXi(t)/dt =fi[xl ( t ) , . . . ,x,(t); UI ( t ) , . . . , urn(t);t ]
(3.3.4)
i = 1 , 2, . . . , n. In vector notation we can denote the equations in (3.3.4) by
dx(t)ldt
= f[x(t),
u0)l
9
(3.3.5)
enlarging the vector u(t) to include t also. Let fo(x, u) be a given function, and let x(0) = xo and x(T) = x 1 give the state of the system at time t = 0 and t = T.
54
111.
MODERN VARlATIONAL METHODS
The problem of finding an optimal control is to optimize a functional J(x, u) such that
(3.3.6) over the class of permissible controls in U . By permissible controls we mean the controls that satisfy the differential equation (3.3.5). It is obviously a problem of variational type and, in many cases, can be solved with the help of Euler's equation. We shall in a later section give the maximum principle to solve this problem under a more general set-up. If fo(x, u) = 1 so that J(x, u) = T , the problem of finding an optimal control is called time optimal case. Consider the problem in general as described above. Equation (3.3.5) implies that u(t) is a function of x(t) and x'(t) so that the criterion J(x, u) can be rewritten in terms of some other functional g when the constraints (3.3.5) are used. Let fo be replaced by g in (3.3.6) so that
1 T
J=
g(x(t), x'(t)) dt.
(3.3.7)
0
Let x(0) = xo be the initial condition used. We introduce now a function I/
V(xo, T ) = min
i
g(x(t), x'(t)) dt.
x o
The choice of x(t) over [0, TI consists of a choice over [0, S] and a choice over [S, TI. Hence, the initial value xo of x(t) as a result of the choice of x(t) over [0, S] becomes S
x2 = xo +
x'(t)dt.
(3.3.8)
0
Using the principle of optimality and the additive character of the integral, we find the following functional equation, n
g(x(t), x ' ( t ) ) dt t V[x2, T - S]
3.3.
FUNCTIONAL EQUATIONS O F DYNAMIC PROGRAMMING
55
The above informal discussion can be made rigorous. The functional equation (3.3.9) provides an alternative form of the Euler equations. When the process is stochastic, that is, when the state vector x(t) is not completely determined but is a random vector, then the problem of dynamic programming may be formulated in the following manner. Let G(x(t), z) be the cumulative distribution function of a random vector Z with an initial value of Z equal to x ( t ) and Z E D.Let Tq transform the vector x ( t ) into Z, and we exhibit the distribution function of Z by Cq(x(t), z). In the stochastic case we first take the expected return after N stages and then take the maximum. That is, we want to maximize E[R(xN(t)]. Let the maximum be again denoted by fiv [x(t)]. Then we have
(3.3.10) The functional equation in this case is now given by (3.3.10). When the number of stages N + 0 0 , we can approximate the above discrete functional equation by
where F [ x ( t ) ] represents the final maximum. Computational techniques and other applied problems are given by Bellman and Dreyfus (1962). Examples in statistics using the techniques of dynamic programming will be discussed later. Here, we give an example from Markov chains.
Example 3.3.1 Markov decision chain The principle of dynamic programming can be easily expressed for finite Markov decision chains with a discrete time parameter. Suppose a system having states 1 , 2 , . . . , S is observed at times r = 1, 2, 3, . . . . When the system is in state s, an action a is chosen from a given set of possible actions A and a reward r(s, a) is received. Let p(tIs, a ) be the conditional probability that given the system is in state N at time x, it is in state N t 1 at time f and that action a is taken at time t . r(s, a) is really the expected reward given by
r(s, a> =
Ct r(s, a, t)p(tls, a),
where r(s, a, t ) is the reward at N t 1 when the system is in state 1.
(3.3.12)
56
MODERN VARIATIONAL METHODS
111.
Let VsNbe the maximum expected reward in N periods starting from x. The principle of optimality gives the functional equation S
VsN=max a
{
T(S,(I)+
t= I
@-'P(rls,a)j,
1
1 <s<S,
(3.3.13)
which can be solved by dynamic programming.
3.4
Backward Induction
The functional equations in dynamic programming use backward induction. This procedure is also basic to obtaining optimal decision procedures in sequential sampling. Suppose a statistician has to make a decision at each stage of experimentation whether to continue experimentation or to stop and make a decision. Let the number of observations be bounded by n. The process requires that the statistician knows how to use his first observation X1 if he is to decide whether to make a decision 6 without any more observations or to take X 2 . This will amount to answering this question: Should he take observation X 2 or stop sampling after taking X1? This process continues until he reaches X,, , at which time he must make a decision, since the sampling is terminated at that point. The answer at stage n is easy to give, since there is no more sampling after observing X,,.The decision at the n - 1 stage is given with the help of the observed sequence X 1 , .. . , XflPl , and hence the statistician can work backwards. We shall give a formal derivation of the backward induction procedure for sequential decision making later. The process of backward induction used in the functional equation of dynamic programming can be summarized as follows. For a stochastic or deterministic dynamic process, let {,, describe the history up to stage n. {,, may consist of observations or may consist of the posterior distribution. Let p({,,) be the expected cost of an optimal procedure. will describe the history of the process up to stage n + 1, and it will depend on {,,, the history up to stage n, and a,, the action taken at stage n. We show this dependence by (a,,, {,,). Then the backward induction is given by the equation PC,) =infP[S'n+l(an, {,,)I a,
(3.4.1)
Equation (3.4.1) is like the functional equation of the various dynamic programming procedures mentioned in the last section. Although the first use of backward induction may be hidden in antiquity, its first sophisticated use was made by Arrow ef af. (1949) for providing the Bayes
3.5.
M A X I M U M PRINCIPLE
57
and minimax solution of truncated sequential decision problems. The dynamic program of Bellman can be thought of as a generalization of the concepts of sequential decision theory. For further discussion see Chernoff (1 968).
3.5 Maximum Principle In a well-known series of papers culminating in a book by Pontryagin et al. (1962), a general class of problems in the optimization of continuous processes is solved. In this section a statement of the problem and its solution are given. The language used in discussion of the control problem is generally in terms of mechanical systems; however, there are many situations in economics, business, and industry where the study of optimal processes is useful. The importance of the maximum principle can be understood in the words of L. C. Young (1969): “The proof of the necessity of the Maximum Principle . . . represents the culmination of the efforts of the mathematicians for considerably more than a century, t o rectify the Lagrange multiplier rule.” Of course, the necessity of the condition without existence leaves us with only a recipe to be tried. The philosophy of this approach has also been used in solving moment problems. We consider the problem of finding an optimal control in the set U for optimizing the functional J ( x , u) in (3.3.6) with constraints given by the differential equations in (3.3.5) and initial conditions. The statement of the maximum principle is in terms of functions having general properties. Here, for the sake of simplicity, we assume that the class U consists of piecewise continuous controls. The functions fi are assumed to be continuously differentiable with respect t o x , , . . . , x,. An interesting case occurs when the functions fi do not depend on f explicitly. This situation has been designated as an autonomous system. We also assume that the function f o ( x , u) is continuously differentiable. Consider a set of auxiliary functions $ o ( t ) ,$ (r), . . . , $,(t) that satisfy the system of linear differential equations
,
For a given control u ( t ) the system of differential equations (3.3.5) provides a solution x ( t ) with the initial condition x(0) = x o ; therefore, for 0 < f < T , the system of equations (3.5.1) becomes (3.5.2)
58
111.
MODERN VARIATIONAL METHODS
The system (3.5.1) is a linear homogenous system of differential equations and has a unique solution $ ( t ) = [rl/o(t),rC/, ( t ) , . . . , G n@>I
.
(3.5.3)
Notice that $ ( t ) would be an ordered set of continuous functions having continuous derivatives except at a finite number of points. Define the Hamiltonian function H by
H ( $ , x , u) = $ o f o ( x , u) i- $ I f 1 (x, u) i- . . . i- $ n f n ( X ,
4.
(3-5.4)
Then Eqs. (3.3.5) and (3.5.1) can be written in terms of the Hamiltonian H as
aH/aGi = dxi/dt
(3.5.5)
aHlaxi = -d$Jdt
(3.5.6)
for i = 0, 1 , . . . ,n . Regarding the Hamiltonian as a function of u, we can obtain the upper bound of H ( $ , x , u) and denote it by
(3.5.7) If the supremum is attained for some value uo, we have the maximum equal to
M($,x ) . For this reason, the necessary condition for optimality has been
designated as the maximum principle. It is stated in the following theorem.
Theorem 3.5.1 (Maximum Principle) Suppose u(t), 0 < t < T is a permissible control such that a corresponding trajectory x(t) satisfies the equations i = 0, 1,2, . . . ,n .
dxi/dt = f i ( x , u, t ) ,
(3.5 3)
Also, the trajectory starts at point xo at time 0 and passes through the straight line parallel to the xo axis, passing through the point x 1 at time T. The necessary condition for the control u(t) and the trajectory x(t) is that there exist a nonzero vector function $ ( t ) = ( $ o ( t ) , $ ( t ) , . . . , $,&)) corresponding to u(t) and x(t) such that (i)
for all t , 0 < - r at u = u(t):
< T , the function H($(t),
x(t), r, u) attains a maximum
H [ $ ( t ) , x ( 0 , t , u(t)l = M [ W , x(t>,tl (ii)
7
Go(t)= const. < 0,
1
(3.5.9) (3.5.10)
t
(iii)
M ( $ ( t ) , x(t), t ) =
ah(x(t)’u(t)yt , $ i ( t ) dr. i=O
at
(3.5.1 1)
3.5.
MAXIMUM PRINCIPLE
59
Further, it turns out that if Jl(t), x(t), u(t) satisfy (3.5.8), (3.5.2), and (3.5.9), the function $ o ( t ) is constant while M [ J l ( t ) ,x(t), f ] can differ only by constant from the integral (3.5.11). Thus, it is sufficient to verify (3.5.10) and (3.5.1 1) at any single instant t , 0 < t < T , for example, for f = T. The above theorem is Theorem 4 of Pontryagin et al. (1962) and only one of the many cases of the maximum principle. For example, if the boundary X I moves along a given curve, the above statement can be slightly modified. We shall not give a detailed proof of the maximum principle; the interested reader is referred to the book by Pontryagin et al. The theorem will be applied to the time optimal case for illustrative purposes. Example 3.5.1 Consider the special case of the time optimal problem in which the acceleration d 2 x / d t is the control u with -1 < u < 1 and we are to bring the particle to the origin in the least time. Define x 1 = x and x z = d x / d t . Then we have d x l / d t = x 2 and dxzldt = u . Suppose the initial state is xo at t = 0 and x 1 = (0,O)’at T.
From the auxdiary equation, we have d $ ,/dt = 0, d $ J d t = -$ or il, 1 = C I , $ 2 = c2 - c1t , or u( t ) = sign(c2 - c 1t ) . That is, every control is a piecewise continuous function (see Fig. 3.2). When u = 1, we have x 2 = f t s 2 ,
Figure 3.2
where sl, s2 are constants of integration. Hence, x 1 = $ x i 2 is, where s is some other constant. Similarly, when u = -1, we have x 1 = - f x ; i- s‘. These two possible solutions are given in Figs 3.3 and 3.4.
60
111.
MODERN VARIATIONAL METHODS
kL
x2
x2
u = -1
Figure 3.3
3.6
-
1
X
Figure 3.4
Dynamic Programming and Maximum Principle
There is an intimate relationship between dynamic programming and the maximum principle. In the case of the time-optimal control problem, this connection can be seen as follows. Assume t o < t < t , . Let T(X0)
= c, -to
(3.6.1)
and let i-2 be the set of points of the space X from which an optimal transition t o x1 is possible. We assume that the set i-2 is open and the function T(x) has continuous partial derivatives with respect to x. Consider the following functional W[x(t)] for the optimal trajectory x(f) as given by W[x(t)] = -T(xO) t I
-
to
so that W[x(t,)] = O . Differentiating (3.6.2) with respect to x, and multiplying by f,(x(t), we have, summing over a,
(3.6.2)
u(t)),
since dx,/dr =fO(x(f), u(t)), OL = 1, 2, . . . are the given constraints. Suppose now that u is any control and at the time t t dt the trajectory x ( t ) is changed to x(t) t d x . Given that X I is a fixed point, the time to reach x1 optimally is T[x(t)] and should be less than T[x(t) t dx] t dt. That is, T[x(t) t dx] t dt
> T[x(t)]
3.6.
1)Y N A M I C PROGRAMMING A N D MAXIMUM P R I N C I P L E
W[x(t) + d x ] - W[x(t)] 2 d t
or
61
dW/dt < 1
Since W is assumed continuously differentiable, we have
Hence (3.6.5) in view of Eq. (3.6.3). Equation (3.6.5) is essentially Bellman’s equation in the dynamic programming formulation of the time optimal problem. Now assuming twice differentiability of W [ x ( t ) , u ] , we can derive the equation of the maximum principle from the above. Differentiating (3.6.3) with respect to xi again, we have
(3.6.6)
a =i-( =
axi
aw[x(t)3u(r)l)h(x(r), u(t)). (3.6.7)
ax,
The first term of Eq. (3.6.6) can then be replaced by (3.6.7), and substituting Jl,(f) for aW[x(t), ~ ( r ) lax,, ] we have (3.6.6) reduced to (3.6.8) Equation (3.6.5) also becomes now (3.6.9)
(3.6.8) and (3.6.9) provide the equations of the maximum principle. We have shown that under the assumption that W[x(t), u(t)] is twice continuously differentiable, the dynamic programming equations are the same as those obtained under the maximum principle. Note that such assumptions are not always satisfied, even in the simple case of the time-optimal control problem.
111.
MODERN VARIATIONAL METHODS
References Arrow, K. J., Blackwell, D., and Girshick, M. A. (1949). Bayes and minimax solutions of sequential decision problems, Econornetrica 17, 21 3-244. Bellman, R. (1966). A problem in the sequential design of experiments, Sankhya 16, 221-229. Bellman, R. (1957). Dynamic Programming. Princeton Univ. Press, Princeton, New Jersey. Bellman, R. (1967). Introduction to Mathematical Theory of Control Processes, Vol. I . Academic Press, New York. Bellman, R., and Dreyfus, S. (1962). Applied Dynamic Programming. Princeton Univ. Press, Princeton, New Jersey. Berkovitz, L. (1961). Variational methods in problems of control and programming, J. Math. Anal. Appl. 3, 145-169. Chernoff, H. (1968). Optical stochastic control, Sankhya, Series A 30, 221-251. Chow, Y. S., Robbins, H., and Siegmund, D. (1970). Optimal Stopping. Houghton-Mifflin, New York. Desoer, C. (1 961). Pontryagin’s maximum principle and the principle of optimality, J. Franklin Inst. 271, 361-367. Dreyfus, A. (1970). Dynamic programming and the calculus of variations, J. Math. Anal. Appl. 1, 228-239. Fan, Liang-Tseng (1966). The Continuous Maximum Principle. Wiley, New York. Feldbaum, A. A. (1965). Optimal Control Systems. Academic Press, New York. Fletcher, R. (ed.) (1969). Oprimization. Academic Press, New York. Gluss, B. (1972). An Elementary Introduction to Dynamic Programming: A State Equation Approach. Allyn and Bacon, Boston. Jacobs, 0. L. R. (1967). An hirroduction to Dynamic Programming. Chapman and Hall, London. Karlin, S., and Studden, W. J . (1 966). Tchebycheff Systems: With Applications to Analysis and Statistics. Wiley, New York. Kaufman, A., and Cruon, R. (1967). Dynamic Programming. Academic Press, New York. Kipiniak, W. (1961). Dynamic Optimization and Control-A Variational Approach. Wiley, New York. Lavi, A., and Vogel, T. (eds.) ( 1 966). Recent Advances in Optimization Techniques. Wiley, New York. Leitman, G. (ed.) (1967). Topics in Optimization. Academic Press, New York. Luenberger, D. G. (1969). Optimization by Vector Space Methods. Wiley, New York. Namhauser, G. L. (1966). Introduction to Dynamic Programming. Wiley, New York. Petrov, Iu. P. (1968). VariationalMethods in Optimal Control Theory. Academic Press, New York. Pierre, D. A. (1969). Optimization Theory and Applications. Wiley, New York. Pontryagin, L. S., Boltyanskii, V. G., Gamkrelidze, R. V., and Mishchenko, E. F. (1962). The Mathematical Theory of Optimal Processes. Wiley (Interscience), New York. Rustagi, J. S. (1957). On minimizing and maximizing a certain integral with statistical applications, Ann. Math. Statist. 13, 119-1 26. Rustagi, J. S. (1968). Dynamic programming model of patient care, Math. Biosci. 4, 141-149. Sakaguchi, M. (1961). Dynamic programming of some sequential sampling designs, J. Math. Anal. Appl. 2, 446-466. Siegmund, D. (1967). Some problems in the theory of optimal stopping rules, Ann. Math. Statist. 38, 1627-1640.
REFERENCES
63
Wilde, D. J . , and Beighter, C. S. (1967). Foundations of Optimization. Prentice-Hall, Englewood Cliffs, New Jersey. Wonham, W. M . (1963). Stochastic problems in optimal control, IEEE hit. Conv. Rec. 11, 114-124. Young, L. C . (1969). Lectures o n the Calculus of Variations and Optimal Control Tlieory. Saunders, Philadelphia, Pennsylvania.
CHAPTER IV
Linear Moment Problems
4.1
Introduction
A study of the so-called classical “problem of moments” has been made by many mathematicians, beginning with Tchebycheff, Markov, and Stieltjes in the nineteenth century, and is still continuing today. Many new notions, such as those of the Stieltjes integral, have been introduced as a result of solving the moment problem. Briefly, the problem of moments is concerned with finding a distribution function with a prescribed set of moments. Many necessary and sufficient conditions for the existence of such a function have been provided in due course, and there is an extensive literature on the problem. For a recent survey the reader may consult Shohat and Tamarkin (1943) and Kemperman (1 97 1). Several optimization problems involving moments of a distribution function have also been studied. Finding bounds on the probability that a random variable exceeds a given constant and deriving Tchebycheff-type inequalities are problems of this type. Several authors have studied bounds for the expectation of a given function of a random variable with certain constraints on its other moments. Variational techniques have been universally utilized in solving many such optimization problems. In this chapter, we discuss linear moment problems such as those of minimizing and maximizing an expectation with statistical applications. Many such problems arise, especially in nonparametric theory, and require that a functional involving the distribution function be minimized or maximized when 64
4.2.
EXAMPLES
65
some of the moments of the distribution function are given, such as the mean and variance. An important problem is to find the distribution function having prescribed mean and variance for which the mean range of a random sample is minimized. Similar problems of optimization involving the extrema of the expected value of the largest or smallest order statistic occur in statistics. In the study of some nonparametric statistical problems, finding bounds on the efficiency of test statistics involves variational techniques which we discuss in the next chapter. Two basic approaches are used in the optimization of moment integrals. One is the classical variational approach and the other is through the geometry of moment spaces. In our discussion in this chapter, the approach through the geometry of moments is discussed and several statistical applications are studied. An important role in variational techniques is played by the Hahn-Banach theorem, and, for the sake of illustration, we will utilize it to solve a moment problem. Recently, the moment problems have been studied in much more abstract settings in the literature. For a brief account, see Kemperman (1968, 1971). The nonlinear moment problems will be discussed in Chapter V. Many related moment problems have been studied by Skibinsky (1968), Weiss (1956), Jacobson (1969), and Whittle (1960). Bounds in more general settings, such as in the martingale setting, have been studied by Dharmadhikari and Jogdeo (1969), Dharmadhikari er al. (1968), and Rosen (1970). Many restricted classes of distributions, such as distribution with increasing (decreasing) failure rate, have important applications in reliability, and bounds for them have been considered by Barlow and Proschan (1965). Some of the applications will be studied in a later chapter.
4.2
Examples
In this section we give two examples reflecting the various areas of application in which moment problems arise. We restrict our attention only to linear moment problems.
Example 4.2.1 (Bioassay problem) (Isaacson and Rubin) Suppose a material is associated with points u I , u 2 , . . . in the plane R and these points are uniformly and independently distributed over a two-dimensional region of infinite area. Hence the number of points in a finite subregion Ri of finite area Aj are Poisson with parameter, say, M i . A certain material emanates from each of the points ui and spreads over the region R . We are interested in the concentration of this material at a given point x in R .
66
IV.
L I N E A R MOMENT PROBLEMS
Assume that the concentration at x is some unknown function f of the distance between ui and x denoted by f(ui - x ) , where f is a fixed nonnegative function. The total concentration at x is a random variable Y such that m
Y(x)=
1 f(ui-x).
i=1
An important aspect of this bioassay problem is the choice of functionf'in such a way as to minimize E [1 - e-py(x)] subject to the conditions that the mean and variance of Y(x) are fixed constants. The above problem can be stated in terms of an optimization problem for f a s follows. Let the arbitrary point x be taken as 0 and let Y(0)= Y=Z&j@i). The characteristic function of Y is given by
where UjRi = R withRi n R , =
8,i#j, (4.2.1)
kt gk(n) be the probability that n of the points uj are in R k ; this is Poisson with parameter hAk. Then
or
(4.2.2) where
Hence we have from (4.2.2),
EXAMPLES
4.2.
67
(4.2.3) Therefore, to find the bounds of E [ 1 - e-OY
or = 1 -exp
1 , we need
(1 -A
[I
du)
.
(4.2.4)
Taking the logarithm of cp(t)and obtaining the first two derivatives at t = 0, we obtain the mean and variance of Y; therefore, the side conditions are
a and A
s s
H
R
f(~)du=p
(4.2.5)
f y u ) du = uz.
(4.2.6)
The bioassay problem is to choose f so as to minimize (4.2.4) subject to side conditions (4.2.5) and (4.2.6). It will be seen that A need not be known for the solution of the problem. Also note that (4.2.4) is minimum when
I(
1 -,-Of(’))
du
(4.2.7)
R
is minimum. We consider a more general problem later in this chapter. Example 4.2.2 (Goodman) A population with a finite number of elements is partitioned into an unknown number of disjoint classes. The classes are assumed to possess no natural ordering. A random sample is drawn without replacement, and the problem is to estimate the number of classes in the population. Several practical cases in which the above situation arises are the following.
(i) A company has received a large number of requests for a free sample of its product. It is known that the same people often send more than one request.
68
IV.
LINEAR MOMENT PROBLEMS
From a sample of the requests, we wish to estimate the number of different requests. (ii) In receiving unemployment benefits from a government agency, a family may have many of its members receiving benefits. From the sample of beneficiaries, it is desired to estimate the number of families receiving benefits. (iii) The total number of words in a book may be estimated and a sample can be taken. One may like to estimate the number of distinct words in the book. We formulate the above problems more generally as follows. Here we permit an infinite number of classes. The general problem has been discussed by Harris (1 959). Let Xi be the jth observation and Mibe the ith class. Then
P [ 4 € 4 ] =pi, i = 1, 2, . . . for all j and pi > 0,, ;Z p i = 1. If the number of classes is finite, say S, then we assume p i = 0 for i > S. Let n, be the number of classes occurring exactly r times in the sample. Then
2 rn, =N , r=O 03
(4.2.8)
and the number of distinct classes observed in the sample is
c n,. N
d=
r=l
(4.2.9)
Let the coverage of the sample be denoted by
where the set A consists of all classes from which at least one representative has been observed. Suppose further that the number of distinct classes that will be observed in a second sample of size OLN, a 2 1, is to be predicted. Then denoting by d(a) and ~ ( athe ) number of classes and coverage obtained from ah' observations, the problem is to find upper and lower predictors of E(d(a)), that is, to find the minimum and maximum of the expected value of d(a) when some moments of the distribution with respect to which the expected value is taken are given. Define the random variables
YI .
jth class occurs in the sample, ={ 1 ifotherwise. 0
4.2.
EXAMPLES
69
Then
so that 00
E(d(a)) u
1 [l-(1
j=1
For large N ,
so that
Let (4.2.13) Then (4.2.13) defines a cumulative distribution and is unknown to the experimenter, since p I , p z , . . . are unknown. Now let the random variable Zf) be defined as
zi"'
=
1
if the classj occurs r times in the sample,
0
otherwise.
Then
c zi"' 03
E(nr) =
j=1
(4.2.1 4)
or
or (4.2.1 5)
70
IV.
LINEAR MOMENT PROBLEMS
Similarly,
m
c pj(l-PjY, m
= 1-
(4.2.16)
j=1
so that E(c) u 1 - C pi exp (-Npj).
(4.2.17)
Using (4.2.12) and (4.2.15) and the definition of F(c) in (4.2.13 ) , the expression for E(d(a)) can be written as m
Similarly, m
The problem here then is to find bounds for the integrals
(4.2.19) m
r
(4.2.20) with restrictions on the cumulative distribution function F(x). We consider a general problem of obtaining bounds for the integral
dF(x) with constraints on F(x).
(4.2.21)
4.3.
4.3
CONVEXITY AND FUNCTION SPACES
71
Convexity and Function Spaces
In problems of finding extrema, the notions of convexity and concavity are extensively used. This section provides some relevant definitions and a few pertinent theorems. In later sections, use will be made of the terminology of linear function spaces, linear functionals, and some extensions. We also give the Hahn-Banach theorem, which is commonly utilized in extrema problems. A set S of the Euclidean space R" is called convex if for all x, y E S. the line joining them is also in S. That is, )cx + (1 - X)y E S for 0 < X < 1. A real-valued function f defined on a convex set is called a convex function if f(hx1
-t
( 1 - h )
< V(x1) -t ( 1 - M x z )
(4.3.1)
for O < X < 1. If the function f' is twice differentiable, then the criterion (4.3.1) can be replaced by the condition that
azflax2 2 0.
(4.3.1a)
It can be verified that a convex or concave function defined in the n-dimensional Euclidean space R" is continuous. In the space R", for any point x not equal t o zero and any constant c, the set of points b:Zy=lxfli = c ) is called a hyperplane. A sphere in R" is the set of points (x: Ix - xoI < r ) , where xo is the center and r is the radius of the sphere. A point x is said to be a boundary point of a convex set S if every sphere with center at x contains points of S as well as points outside S. A point of S that is not a boundary point is called an interior point. An important concept in convex sets is that of extreme point. This is a point x of a convex set S that is not interior to any line segment of S. In other words, x is an extreme point of S if there do not exist two points xl, x2 ES with xI# x z andx=Xxl +(1 - X ) x z , O < X < 1. If x is a boundary of a convex set S, then the hyperplane g = l u f l i = c containing x such that 2&ufli < c for ally ES is called a supportinghyperplane of S at x, and Zy==lufli< c is the halfspace determined by the supporting hyperplane. A point x of R" is said to be spanned by a set of points x l , x z , . . . , x p in R" if there are nonnegative quantities a l , . . . , a p with Zf=lai = 1 such that x = Zf=laixi. For later reference, we state a few results in convex sets without proof. For a detailed study of this subject, see Blackwell and Girshick (1954). Theorem 4.3.1 A closed and bounded convex set in R" is spanned by its extreme points, and every spanning set contains the extreme points. Theorem 4.3.2 (i) A closed, convex set is the intersection of all the halfspaces determined by its supporting hyperplanes.
72
IV.
LINEAR MOMENT PROBLEMS
(ii) Every boundary point of a convex set lies in some supporting hyperplane of the set. (iii) Two nonintersecting closed, bounded convex sets can be separated by a hyperplane.
Theorem 4.3.3 For a point x in a closed and bounded convex set in R ” , b(x), the least number of extreme points of a convex set to span x is not greater than n + 1. That is, b(x)< n + 1. Function Spaces Spaces whose elements are functions defined over some fixed set are known
as function spaces. Their study is involved in the development of the variational
methods. We have already seen that an extremum problem is concerned with the choice of a function from a given class of functions such that an integral over the given class is minimized or maximized. An integral is a special functional, the real-valued function defined on the class of functions, and finding an extremum of a functional is similar to finding the extremum of a function of a real variable. That is, variational methods are central to functional analysis in the same way that the theory of maxima and minima is central to the calculus of functions of a real variable. We define a few of the central notions of function spaces and state some pertinent theorems in the sequel. There are excellent books available in functional analysis, and the interested reader is referred to them for details. A real linear space is a set R of elements x, y, z, . . . for which operations of addition (+) and multiplication by the numbers a: 0, . . . are defined obeying the following system of axioms: ( 0 )
x + y = y + x isin R , (x + y )+ z = x + ( y + z ) , 3 0 E R 3 x + O = x forany x E R , Vx E R, 3-x E R 3 x + (-x) = 0, 31 E R 3 V x E R, 1 ‘ x = x , @(OX) = ( ~ P > X , (a + p)x = a x + p x , a(x + y )= a x + ay.
A linear space is called a normed linear space if a nonnegative number llxll is assigned to every x E R such that (i)
(4
(iii)
llxll = 0 if and only if Ilaxll = IaI Ilxll, IIX +yll < llxll + Ilrll.
x = 0,
Afinctional is a real-valued function defined on a function space.
4.3.
E
CONVEXITY A N D FUNCTION SPACES
A functional W b ] is said > 0, there is a 6 such that
73
to be continuous at the point y o E R if for any
IW[yl
-
W b 0 I I < -E
(4.3.2)
if Ily -yoII < 6. The inequality (4.3.2) is equivalent to the two inequalities
W b 0 I - W b l < -E
(4.3.3)
W b 0 I - W b l > --E
(4.3.4)
and The functional is called lower semicontinuous at y o if (4.3.2) is replaced by (4.3.3) only and upper semicontinuous at y o if (4.3.2) is replaced by (4.3.4) in the definition of the continuity of the functional W b ] . An important well-known result for lower semicontinuous functions is stated in the following lemma.
Lemma 4.3.1 If f(x) is a lower semicontinuous function defined on a compact set, then f(x) achieves its infimum. (In particular,fis bounded below.) A functional L is called a linear functional if for any xl,xz € R L((YlX1
+(Yzxz)=~lL(x1)+(YZL(XZ)
and a 2 . for all real numbers Let X* be the space of continuous linear functionals f defined on the function space X such that,
llfll
= sup If(x)l, Ilxll
(4.3.5)
then X* is called the dual space of X . The concept of dual spaces is further utilized in defining convergence in function spaces, which we discuss below. A sequence of linear functionals { f, defined on a normed space X is said to converge weakly to a functional f € X* if f,(x) + f(x) for every x E X . The sequence { x,> is said to converge to x E X if for every f E X * ,
>
f(Xd
+
f(x>.
A functional L on a linear space M is called additive if
LCf+g)=L(f)+L(g)
for all f , g E M .
L is called subadditive if
LCf+g)
for all f, g E M ,
and L is called homogenous if
L(af) = a L ( f )
for all f E M
and all real numbers a.
74
IV.
LINEAR MOMENT PROBLEMS
It can be easily seen that a linear functional is an additive, homogeneous functional. LO is an extension of a linear functional L from a linear space N to linear space M if (i) Lo is defined o n M ; (ii) Lo@ = Lcf) for all f E N . A set S is said to be partiallv ordered by the binary relation p if
(a) (b) (c)
xpy andypz for all x, y, z E S (transitivity). xpy and y p x implies x = y for all x, y E S (antisymmetry). xpx for all x E S (reflexivity).
A partially ordered set S is said to be totally ordered if transitivity holds and one of the following holds: xpy, x = y , y p x for all x, y ES. A maximal totally ordered set is a totally ordered set that is not a proper subset of any other totally ordered set.
Hausdorff Maximality Principle Every nonempty partially ordered set contains a maximal totally ordered subset. Theorem4.3.4 (Hahn-Banach Theorem) If N is a linear subspace of the linear space M, if p is a continuous subadditive functional on M such that p(af) =apcf> for all a 2 0, and if L is an additive, homogeneous functional on N such that L ( f ) < p ( f ) for every f E N, then there is an additive, homogeneous extension Lo of L to the whole of M such that L o ( f )< p ( f ) for everyfEM.
Proof (a) LetfoEM\Nandletf;gEN.Then
Lk)-L(f) =Lk-f)G p k - f ) =P[k+fo)+(-f-fo)l so -P(-f-fo)
-
L(f) G p ( g + fo) -
m.
Thus for each g E N , f
sup [-p(-f-fo)-L(f)l
G p(g+fo)-L(g).
EN
so sup [ - p ( - f - f o )
-L(f)l G inf g EN
fEN
b(g+fo) -ml.
Therefore, there is a number t independent o f f such that -p(-f-fo)
- L(f) G
r G p ( f + fo) - L(f)
for every f~ N .
4.3.
CONVEXITY AND FUNCTION SPACES
75
Now if h = f + a f o where f E N , we define L , ( h ) = Ldf) t a t . If + a l f O = f 2+ a 2 f o , then (al - a 2 ) f o = f 2 - f l E N , and since f o @ N , it follows that a l = a 2 , hence fl = f 2 . Thus the representation of h is unique, so L is well-defined. Now define p such that
fl
(01, /I)
P
(02912)
if O1C O2 and 1, = 12 on 01,where (O,,1 1 ) is an ordered pair with O1 as a subspace of M, N C 0 , C M and 1 is an extension of L to 0 , . The collection S of all ordered pairs (0, 1 ) is partially ordered by p ; and by the Hausdorff maximality principle, S contains a maximal totally ordered subset C. Let R be the union of those subsets C such that (0, I ) E C. Then R is a linear subspace of M . Suppose that f E R . Then f € 0 for some 0 such that (0, 1 ) E C. Now define Lo on R by
Lo(f)=l(f)
for f E R .
LO is well-defined, since if f € O l and f€O2 where (01,1 , ) E C and (02,12)E C, then since C is totally ordered O1 C O2 (or 0 2 C 01) and 1, (f)= l 2 df)by definition. Also it is obvious from the definition of 1 that Lo is an additive, homogeneous extension of L satisfyingL,(f)< p ( f ) forfE R . Now, if we can show that R = M , the theorem is proved. (b) Assume there exists go E q . Let N 1 denote the subspace spanned by R and go. Then, by part (a), there exists an extension L I of LO from R to N 1 satisfying Lly.) < p ( f ) for all f € N 1 and L , ( f ) =Lo(g) + at for f = g + ago E N , . Since N 1 is not a subspace of R , ( N , , L l ) $ C. Since for any (0, I ) € C, f € 0 implies f = f + OgoE N , , we have 0 g N 1 and /(f)=Lo(f)=Ll(f). This implies (0, 1 ) p (Nl, L I ) for all (0, I ) € C. However, since (N1,L 1 ) $ C, this implies that C i s not maximal. This contradiction implies R = M . Corollary Let fo EM\N and L be an extension of L1 satisfying L d f ) < p ( f ) for every f € M . Then
L o ( f o )< L d f )
for every f € M ,
where
L o d f o ) = sup [ - p ( - f - f o )
-
L 1 (fll'
fEN
Proof Let L ( f l be an extension o f L l satisfyingL(f)< p(f) for everyfEM. For f € N, L(-f- fo) < p(-f- fo) implies -L(-f-fo)
-L,(f)> -p(-f-fo)-LI(f)*
Therefore,
Ldfo) 2 - P ( - f - f o )
- L(f),
76
4.4
IV.
LINEAR MOMENT PROBLEMS
Geometry of Moment Spaces
The problems of optimization involving moments can be easily understood if the geometric formulation is kept in mind. In what follows, we describe some salient features of the geometry of moments used in obtaining solutions to the optimization problem of moments. An excellent development of the geometry of moment spaces is given by Karlin and Shapley (1953) and Karlin and Studden (1966). For some of the recent applications of the geometry of moments techniques in solutions of the optimization problems in probability and statistics, see Chernoff and Reiter (1954), Rustagi (1957, 1961), Harris (1959, 1962), and Skibinsky (1 968). For the sake of simplicity, in the discussion of the moment problem in this chapter, only a finite valued random variable is assumed. Without loss of generality, we define the distribution function F(x) of X on a finite interval [0, 11. Assume further that the rth moment of X is given by
E ( X r ) = pr' =
i
0
x r dF(x).
(4.4.1)
Let p r ' , r = 1, 2 , . . . , k be k given moments of the random variable X . Considerable study of the distribution function F(x) having a preassigned set of moments pl', . . . , pk' has been made by many mathematicians. For a review, see Shohat and Tamarkin (1943). A well-known condition such that the c l a s s 9 of distribution functions F(x) in [0, 11 exists with pl', pz', . . . ,p k ' is that the determinants 1
D. = I
PI'
...
(4.4.2)
for j = 1, 2, . . . ,k are all nonnegative. Conditions for the general case are also available in Shohat and Tamarkin (1 943). It is easy to verify that the class 9is convex, since for any F1, FZ€ AF, + (1 - h)F2 for 0 < X < 1 also belongs to the class 9. The extreme points
4.4.
GEOMETRY OF MOMENT SPACES
77
of the set 9 are the degenerate distribution functions. That is, the extreme points of the convex set 9are given by F, ( x ) where (4.4.3) The degenerate distribution F,(x) cannot be written as a nontrivial convex combination of other distributions. Every other distribution can be so written. Let 9k+l be the subclass of 9 s u c h that the class 9 k + l has a given set of first k moments. A discrete cumulative distribution function can be written as a convex linear combination of the one-point distributions. That is, W) FA(x)=
E
i=l
EiFuJx)
(4.4.4)
where Z:ie)4i = 1, ti > 0, i = 1,2, . . . , b(F). b(F) denotes the number of jumps in FA(x). The set of points X = (XI, . . . , X,,) whose coordinates are the first n moments of at least one cumulative distribution function is called the moment space, and we shall denote it by D". Theorem 4.4.1 D" is a closed, bounded convex set in n-dimensions. For proof, see Karlin and Shapley (1953). The extreme points of the moment space D" are generated by the degenerate distribu.tions F, (x). Then the point of the moment space D" corresponding to Fu ( x ) is given by
X ( a ) = (a, a', a3, . . . , a").
(4.4.5)
The following theorem is from Karlin and Shapley. Theorem 4.4.2 The set of extreme points of D" is the set of points X(a) as a runs between 0 and 1 .
Proof Let C" denote the set of points X ( a , ) as a runs through [0, 11. We first show that C" spans D". Let DA" be the subset of D" generated by the discrete distributions FA. Since C" is closed and bounded, DAn is also closed and bounded. Now for any F E 8, there is a sequence of step functions FA" such that 1 1 lim
u-00
1
f(t)dFA'(f)
0
f(f) d F ( t )
= 0
for every continuous functionf. Takingf(t) = f, t', . . . , t", we see that D" is the closure of the set DAn. But DA" is closed and hence D n = D A n .Therefore, C" spans D".
78
IV.
LINEAR MOMENT PROBLEMS
Further, we show that no points of C" are spanned by other points of C". Consider a fixed X(a) E C". Let H be the hyperplane defined by the equation a2-
2X,a t x2= 0.
(4.4.6)
general point x ( t ) E C", we have the hyperplane Thus X(a) E H , while the rest of C1 lies in the positive half space determined by H. Therefore X(a) is not spanned by other points of C". Hence, using Theorem 4.3.1, we complete the proof. But
for
a
a2 - 2ra + f 2 = (a - f)'.
In the next section we consider an optimization problem in which the expectation of a function (not linearly dependent on the x , x 2 , . . .) is to be minimized or maximized over a given moment space.
4.5
Minimizing and Maximizing an Expectation
In the previous section, the geometry of moment spaces was introduced; we now apply it in order to characterize the solution of some linear moment problems. We consider the problem of minimizing and maximizing the expectation of a function g of a random variable X having a cumulative distribution function F ( x ) ; that is, 1
E(g(X)) =
g(x) N X ) .
(4.5.1)
0
We again assume that the random variable X is finite valued and is between 0 and 1. We shall see that in case g(x) is convex on (0, l), the minimizing cumulative distribution function is a discrete distribution with the number of jumps specified by the number of moment conditions satisfied by the optimizing cumulative distribution function. Another approach using the Hahn-Banach theorem will be used later in this chapter for finding the optimizing distribution function for (4.5.1). In the subsequent development of this section, we follow Harris (1959). Let g ( x ) be continuous on [o, 1 1 and consider the set A k + l in ( k t 1)dimensional Euclidean space of points whose coordinates are
[ E ( g ( X ) ) ,E ( W , . . . E(Xk)l 9
for all cumulative distribution functions F ( x ) on [0, 11 in 9. Then we have the following theorem. Theorem 4.5.1 The set A k + l is closed, bounded, and convex in ( k + 1)dimensional Euclidean space.
4.5.
MINIMIZING AND M A X I M I Z I N G AN EXPECTATION
79
Proof Since the random variable X is bounded by definition, the set Ak+1 is bounded. To show convexity of the set A k + ] , consider the transformation
T: 9 + A k + l with
1
1
T is a linear transformation and hence maps a convex set into convex set. Since S i s convex,Ak+] is also convex. To show that Ak+l is closed, note first that the set 9 is compact. The compactness of 9is a restatement of the Helley-Bray lemma. Again since T is continuous, it maps a compact set into a closed and bounded set and hence Ak+I is closed and bounded. In order to find
1 1
min
(4.5.3)
g(x)dF(x)
FESck 0 01
1
c
(4.5.4)
where the class %k is determined by the given k moments PI , PZ , . . . ,Pk'r we consider the following point OfAk+l. I
or
!
F
b l and bz can be easily seen to be boundary points of the set Ak+l ,since they are the points whose second, third, . . . ,( k + 1)st coordinates are fixed and whose first coordinate is minimized or maximized. So long as the set Ak+l is nonempty and is closed, b , and bz exist. Therefore, the minimizing and maximizing distribution functions corresponding to these boundary points also exist.
To characterize further the extremizing distribution, we have the following theorem.
80
LINEAR MOMENT PROBLEMS
IV.
Theorem 4.5.2 If Ax) is strictly convex (or concave), the set of extreme points of A k + l are exactly those points that correspond to moment sequences of degenerate cumulative distribution functions.
Proof The proof follows from Theorem4.4.2, since the point of the set Ak+l
corresponding to degenerate distribution functions is
x = ( g ( t ) , t , t z ,. . . , t k ) where t E [0, 11. The hyperplane (4.5.5)
t 1 ~ - 2 p l f ttl p2f= 0
for a fixed rl is a supporting hyperplane
OfAk+l
and hence
t12-2flt t t Z =(t-t1)2>0
for t # t l and 0 for t = t , assuming k = 2. Thus (4.5.5) is a supporting hyperplane at and it is not attainable as a nontrivial convex linear combination of points of corresponding to degenerate distributions. Hence points in A k + 1 corresponding to degenerate distributions are extreme points of A k + l . This completes the proof of the theorem.
Ak+l
Let ii be a boundary point of the point of the set A k + l . Then 2 can be represented as a convex linear combination of at most k t 1 extreme points of A k + l . That is,
c hi(g(tj),
j =O
c k
k
a=
ti, t i * , . . . , t i k ) ,
x i 2 0,
j =O
xj
= 1.
Hence the minimizing distribution is discrete with positive probability concentrated on at most ( k t 1) points in (0, 1). If g(x) is concave, the same statement can be made about the maximizing distribution. In this case the number of jumps for the minimizing cumulative distribution can be further reduced. This we see as follows. Let the supporting hyperplane at u be
The roots of equation (4.5.6) correspond t o the relevant extreme points. Since (4.5.6) is a supporting hyperplane at a, for t E [0, 1 1 ,
4.5.
MINIMIZING A N D MAXIMIZING AN EXPECTATION
81
Let r be the number of distinct roots of P(t) in [0, 11. Also let (4.5.7)
r ‘ = r - l2s.
s = 0, if 0, 1 are not roots ofP(t); s = 1 , if one of 0, 1 is a root; and s = 2, if both are the roots ofP(t). Then by induction we obtain the following theorem. Theorem 4.5.3 If g(r) is a continuous, bounded, and monotone function on [0, I ] such that the first k derivatives exist and are monotone on (0, l), then P(t) has at most k t 1 roots in [0, 11. Theorem 4.5.4 If P ( t ) > 0 for all t E [0, 11 and At) satisfies the hypothesis of Theorem 4.5.3, then r‘ < i ( k t 1).
Proof
Let s be the number of distinct roots at 0 and 1. Then
r ’ = r - l2s,
s = 0, 1 , 2 .
Since P ( t ) > 0, P ’ ( t )has at most k roots in (0, 1). Notice that all interior roots of P(t) are multiple roots, since P’(t)> 0 for all r E [0, 11. Hence whenever P ( t ) = 0, f E (0, l), P’(t) = 0, t E ( 0 , 1). Also if t o , t l with lo < f l are distinct roots of P(t), then there exists a t* such that P’(t*) = 0, to < t* < t l . Hence
(r-l)t(r-s)
or
r’tis-1 +r‘-js
r’ < j(k t 1).
(4.5.8)
It will now be shown that the extremizing distributions have exactly r’ jumps. The above theorem provides a generalization of certain results of Chernoff and Reiter (1954) and Rustagi (1957). It is shown in what follows that there are exactly two cumulative distribution functions, one minimizing the expectation and the other maximizing the expectation. We also show that the minimizing cumulative distribution function has a jump at 1 whereas the maximizing distribution function is continuous at 1. Wald (1939) defined the degree of a discrete cumulative distribution function by rtls where r is the number of jumps in the interval (0, 1) and s is the number of jumpsatOandl,s=0,1,2. Theorem 4.5.4 determines the degree of the extremizing distribution to be at most r’. We use the following propositions due to Wald (1939). Proposition 4.5.1 If the distribution function F(x) has degree 4 and G(x) is the distribution function of any other random variable, then D(x) = F(x) - C(x) has at most 24 - 1 changes of signs.
82
IV.
LINEAR MOMENT PROBLEMS
Proof Assume first that q is an integer. Let P(X = ai)> 0, i = 1,2, . . . , 4 and consider intervals Ii = (ai,oli+l ), i = 1 , 2 , . . . ,q - 1. Then q + 2 points in the spectrum are possible, 0, 1, and 4 internal jumps. D(x) has at most one change of sign in the interior of the interval I i , since F(oli+, ) > F(ai) and G(x), being the cumulative distribution function, is monotonically increasing. Also, a change of sign may occur at q.Hence the maximum number of changes of signs is (q - 1) + q = 2q - 1. Again if q = q' + f ,where 4' is an integer, we can assume that there is a jump at 0; that is, P(X = 0) > 0 and there are q' further jumps at a1,. . . , a k . Arguing as above, the number of changes of signs is q' + q' = 24' = 2q - 1. Other cases are similarly obtained. We can extend the above proposition for discrete distributions or distributions having the same jumps. The results are stated in the following propositi ons. Proposition 4.5.2 If both distributions F and G are discrete, each having degree q , then the changes of signs of F(x) - C ( x ) are at most 2q - 2 . Proposition 4.5.3 If both cumulative distributions F and G have degree q but have one common jump, then the number of changes of signs of F(x) - G(x) is at most 2q - 3. Proposition4.5.4 If F(x) and C ( x ) have the same first k moments, then F(x) - G(x) changes sign at least k times unless F(x) = C(x). Theorem 4.5.5 There are exactly two distribution functions F1( x ) and F2( x ) with degrees less than 4 ( k + 1) each having the same first k moments such that F1(x) is continuous at 1 and F2(x)has a jump at 1.
Proof The existence of at least two distribution functions having the same first k moments is guaranteed by Theorem 4.5.4, and they are the minimizing and maximizing distribution functions. Suppose there exist two distributions of degree < f ( k t 1) having the same first k moments and continuous at 1. Let the degrees of F I and F2 be q . Then, from Proposition 4.5.2, we have that F1 ( x ) - F ~ ( x changes ) sign at most 24 - 2 times. Also, from Proposition 4.5.4, F l ( x ) - FZ(x) changes sign at least k times. However, by hypothesis, we have 2q - 2 < k - 1. Hence, FI ( x ) = F2(x). Again suppose that there exist two extremizing distributions with the same first k moments, both of which have a jump at 1 and both of w h c h have degree <(k + 1)/2.Then, by Proposition4.5.3, we see that F1( x ) - F 2 ( x ) changes sign at most 24 - 3 times, and by Proposition 4.5.4, F l ( x ) - F 2 ( x ) has at least k changes of sign. But 2q - 3 < k - 1. Hence F1( x ) = F2( x ) . This establishes the existence of two cumulative distribution functions with degree < ( k + 1)/2 having the same first k moments, where one has a jump at 1 and the other is continuous at 1. These are the minimizing and maximizing distributions.
4.5.
MINIMIZING AND MAXIMIZING AN EXPECTATION
83
The degree of minimizing (maximizing) distribution will now be shown to be exactly (k t 1)/2 except under the condition that the given constant &' is such that it minimizes (maximizes) the kth moment of the distribution function when its first k - 1 moments are specified. We have the following theorem.
Theorem 4.5.6 The degree of the minimizing (maximizing) distribution function is (k t 1)/2 unless 1 p k r = max (min)
1
xk d F ( x )
0
over the class of distribution functions having the first k - 1 moments equal to I f 7I-12 * * . pk--1'.
PI
2
9
Proof From Theorem 4.5.5, we note that the degree of the two distribution functions having first k - 1 moments p l r , . . . ,I-1k-f which minimize (maximize) 1 x k dF(x)
(4.5.9)
0
is < k/2. That is, the degree of the extremizing distribution is <(k t 1)/2, so that the degree of the extremizing distribution is (k t 1)/2 unless pk is the extremum value of (4.5.9). Using Theorems 4.5.5 and 4.5.6, the minimizing and maximizing distributions can now be characterized as follows.
Theorem 4.5.7 If p l r , p z r ,. . . ,&' is a moment sequence and (4.5.9) is not satisfied, then the extremizing distributions are given by F 1( x ) and FZ(x) where (i) When k is odd, the distribution F 1( x )has jumps at (k t 1)/2 points in (0,l). (ii) When k is even, the distribution F l ( x ) has jumps at 0 and k / 2 points in (0, 1). The distribution F z ( x ) is characterized as follows: (i) When k is odd, the distribution F z ( x ) has jumps at 0, 1 and at (k - 1)/2 points in (0, 1). (ii) When k is even, the distribution F 2 ( x ) has jumps at 1 and at k / 2 points in (0, 1). In a given case, the complete characterization is obtained by this theorem for distribution functions on [0, I ] . The case of distribution functions on [0, -1 can be considered by studying the expectation on [0, b] and then taking the limit as b +. -. Many of the above results also hold in this case, and the applications have been discussed by Harris (1 959). We study this case by another approach using the Hahn-Banach theorem and discuss it in the next section.
84
1V.
LINEAR MOMENT PROBLEMS
Example 4.5.1 Consider Example 4.2.2. Let k = 2. The condition of moments is specified for the cumulative distribution function, and we want an extremum of
i
l-e-(a-lb
x
d F (x).
(4.5.1 0)
0
We consider two-point distributions with jumps at x l and x 2 and probabilities XI and X2, so that hi 2 0, hl + h2 = 1,
hlXl + A2x2 =
(4.5.1 1)
X 2 X l 2 + h2xz2 = p2’
(4.5.12)
correspond to the first two given moments p l ’ and p 2 ’ , respectively. F 1 is given by solving (4.5.1 1) and (4.5.12). We find A1
r T:
=(PZ’--PI’~)/PZ’,
XZ = ~ 1 ’ ~ / p 2 ’ , and
x z =P~’/PI’,
since Xzxz = p I ‘ and hzx? = p2’ a n d x l = 0. Thus
FI (XI =
(P21-Pl’2)/P21,
x < 0, 0 d x
(4.5.1 3)
d x.
For finding FZ(x), we consider the distribution on (0, b ) with one jump at b and the other between 0 and b . We have
hlxl + h 2 b = p 1 ’ ,
(4.5.14)
h2xlZ+ h Z b ’ = ~ ( 2 ~ .
(4.5.1 5 )
Solving the system (4.5.14) and (4.5.15), we find
This gives
i
O,
4.6.
M A X I M I Z I N G AN EXPECTATION SUBJECT TO CONSTRAINTS
Taking limits as b -+
00,
85
we find (4.5.1 7)
The minimizing and maximizing distributions can be determined by comparing the values of (4.5.10) when (4.5.13) and (4.5.17) are used. Computations for k = 3 can be easily carried out, but for higher values of k , such computations become involved. 4.6
Application of the Hahn-Banach Theorem to Maximizing an Expectation Subject to Constraints
When the cumulative distribution function is defined on [0, -1, the problem of maximizing an expectation subject to moment constraints can also be solved with the application of the Hahn-Banach theorem. This theorem was discussed in Section 4.3, and we consider the case of k = 2 given moments for the sake of simplicity. The Hahn-Banach theorem has been utilized in optimization by several authors, for example, Kemperman (1 968), Isaacson and Rubin (1954), Sullivan (1 970), and Sullivan and Rustagi (1970). We consider here the problem of finding the maximum of
1 m
subject to the conditions
g(x)dF(x)
(4.6.1)
x dF(x) = 1-11'
(4.6.2)
x2 dF(x) = p 2 ' .
(4.6.3)
0
I
m
0
and
I m
0
It is assumed as before that the class of admissible cumulative distribution functions, that is, distributions satisfying (4.6.2) and (4.6.3), is nonempty. Essentially we are assuming that p2' - pi2 > 0 as is evident from condition (4.4.2). We assume further that g(x) E 99 where the class Q has the following properties:
Properties A A(i) g(x) is continuous on [0, m) and the first three derivativesg'(x), g"(x), g"'(x) exist on (0, m).
86
IV.
LINEAR MOMENT PROBLEMS
A(ii) g ( x ) is 'sounded above and below by certain real linear functions of the type a l x + u 2 x z for every x E [0, -). A(iii) g(x)/x is strictly convex on (0, -). The class Y is a little different from the class of functions g considered in Section 4.5. For example, the functions in class Y of the type g ( x ) = xe-Bx,
0>o
(4.6.4)
are not monotone, and hence the technique of geometry of moments is not applicable to them. However, the function g ( x ) = e-x
(4.6.5)
can be treated with the technique of geometry of moments; although it does not belong to 9,since it violates A(iii). We consider first a more general problem of minimizing
(4.6.6) where g E 9,over a class of finite measures H ( x ) defined on [0, -) such that 0 < H ( x ) < 1,
(4.6.7)
m
xdH(x)=pI',
(4.6.8)
x 2 d H ( x ) = p2'.
(4.6.9)
0
I m
0
Using the Hahn-Banach theorem, we first solve the problem of minimizing I&) with constraints (4.6.7)-(4.6.9) by utilizing the correspondence between linear functionals and I[&). There are several equivalent statements of the HahnBanach theorem. We use here the one given in Theorem 4.3.4 and its corollary. To apply the theorem and its corollary, we define the quantities N, M,L1, and p for our problem as follows. Definitions
N=9,
(4.6.10)
M = set of all continuous functionsf on [0, -) such that there exist f 1 , f 2 E N s u c h thatfl(x) < f ( x )
(4.6.1 1)
4.6.
MAXIMIZING A N EXPECTATION SUBJECT TO CONSTRAINTS
~ , ( f=)a , p I ’ + a2pz’for f ( x ) = a l x t a 2 x 2 , pdf) =
inf
gEN.g>f
L l ( g ) forfEM.
87
(4.6.12) (4.6.13)
With the above definitions, we note that N, M, L , , and p satisfy the conditions of the Theorem 4.3.4. To solve the problem, we obtain Lo&) by maximizingLl(f) subject t o f E N and f < g . For f < g is equivalent to a l x +a2x2 < g ( x ) .
(4.6.14)
Since H(x) is being considered for x > 0, (4.6.14) is equivalent to a]
+ u2x < g ( x ) / x = h(x).
(4.6.1 5)
By assumption A (iii), h(x) is strictly convex. Therefore, for maximizing L I , we consider those functions f € N for which f ( x ) / x is tangent to h(x) at some x > 0. This we see as follows. If f ( x ) / x is tangent to h(x), .f(x)/x < h(x)
(4.6.1 6)
by the convexity of h. Also, for any line a, + a z x not tangent to h at some > 0, we can find a tangent line at a3 + a4x which is greater than al + u z x for all x > 0. Since L 1 is positive, we have
x
Ll(UlX
t a 2 x Z )
Therefore, in maximizing L I ,only u1 + a 2 x need be considered. Given c > 0, if a, (c) i- az(c)x is tangent to h at c, we must have (i) a l ( c ) +az(c)c = h(c) (ii) a2(c) = h’(c). From (i) and (ii) together we have al(c) = [-cg’(c)
i-
2g(c)l / c ,
a2(c) = [cg’(c)-g(c)l IC’.
(4.6.17) (4.6.18)
Then we have
L1 [a1(c)x +az(c)x2] =PI ’a,@) +pz’az(c).
(4.6.19)
Now to obtain L o k ) , we maximize (4.6.19) with respect to c. Using (4.6.17) and (4.6.1 8), (4.6.19) becomes L1
= c-’ [-C2g’(C)p, i- 2 ~ g ( ~ ) p+l~’ g ’ ( ~ ) p Z ’ - g ( ~ ) p 2 ’ ] (4.6.20) .
88
IV.
LINEAR MOMENT PROBLEMS
Differentiating L1 with respect to c twice and simplifying, we have
L ~=’ - c - ~ ( ~ ~- ’p2’)[c2g”(c) c - 2 ~ g ‘ (t~Mc)] ) L’,’ = - ~ ~ { c ~ ~ ’ [ c ~ g ’ ’ ( c ) - 2+c2g(c)] g’(c) t
011‘~- ~Z’)[C~~’’’(C)-
3c2g”(c) + 6 ~ g ’ ( ~ ) - 6 g ().~ ) ]
We obtain c = p 2 ’ / p l f equatingLl’ = 0, and L l ” is negative for this value. Hence we have the following theorem.
Theorem 4.6.1 L l has a unique maximum at c = p 2 ’ / p 1 ‘ for g E 9. Corollary Lo(g) = b;’/pz’)g(c), where c = ~ ~ ’ 1 1 - 1 1 ’ . Theorem 4.6.2 For H(x) satisfying (4.6.7)-(4.6.9), IH(g)2 (p;2/p2’)g(c). is linear. For f € M , there exist f 1 ,
Proof By linearity of the integral, I&) fz E N with and
fl(x)=a1xtazx2
f 2 ( x ) = b l x+ b 2 x 2
such that UiX
Thus
1
+azXz Q f ( x ) Q b l x t b2x2.
m
m
(UlX t azx2) dH(x) Q
0
1 m
j
f ( x ) dH(x) Q
0
( b l x t b2XZ) dH(x),
0
or UlPl’ +a2P*’ Q I d f )
blP1‘ + b2112’
and hence IHcf)
w*
Again let f € N, then f ( x ) = a l x + a2x2 for some a l and a2.
,. m
For g E N such that g 2 f, we have
4.6.
MAXIMIZING AN EXPECTATION SUBJECT TO CONSTKAINTS
89
and hence
Therefore, using the corollary to Theorem 4.6.1,we have the result. Define now the distribution function F o ( x ) as follows:
Fo(x) =
i::
x
1-Pi2/P2',
0Gx
< P2'/P,',
(4.6.21)
x 2 P2'IPI'.
Since 0 < ~ , ' ~ / < p ~1, ' Fa is a cumulative distribution function and satisfies (4.6.2)and (4.6.3).The following theorem provides. the lower bound to IF@). Theorem 4.6.3 For a given g E 9, Fo as given by (4.6.21)minimizes IF(g) over the class of distribution functions defined on [0, a)and satisfying (4.6.2) and (4.6.3).Further min f,(g) = ( / ~ ; ~ / p ~ ' ) g ( p ~ ' / p , ' ) .
Proof
Now I-1
i2
g ( c ) = 7g ( c ) , P2
112'
c = I.
PI
(4.6.22)
Let H(x) = F ( x ) on (0, m) for any distribution function F(x) defined on [0, w) satisfying (4.6.2) and (4.6.3). Since g ( 0 ) = 0, IF(g) =IH(g). Also, since H(x) satisfies (4.6.7)-(4.6.9), Theorem 4.6.2implies
GI*//J2)g(c)
I&)
=I&).
Thus ( ~ . i 1 ~ / ~ 2g(c) ') is the lower bound for IF(g), and from (4.6.22)we notice that F o ( x ) defined above achieves it. Hence Fo(x) is the solution of the problem. Example 4.6.1 Consider Example 4.2.1 (bioassay problem). The function g ( x ) in this example is given by
g ( x ) = 1-e-Px.
(4.6.23)
The function (4.6.23)satisfies the properties A(i), A(ii), and A(iii), as can easily be verified. The lower bound for
90
IV.
LINEAR MOMENT PROBLEMS
is given by the distribution function Fo in (4.6.21); hence
W )2-W
O )
= (rU?/PZ’)
1 - exp (-flPz’/rU1
‘)I.
(4.6.25)
References Barlow, R. E., and Marshall, A. W. (1964). Bounds for distributions with monotone hazard rate, Ann. Math. Statist. 34, 375-389. Barlow, R. E., and Proschan, F. (1965). Mathematical Theory o f Reliability. Wiley, New York. Blackwell, D., and Girshick, M. A. (1954). Theory o f Games and Statistical Decisions. Wiley, New York. Chernoff, H., and Reiter, S. (1954). Selection of a distribution function to minimize an expectation subject to side conditions, Tech. Rep. No. 23, Appl. Math. Statist. Lab., Stanford Univ., Stanford, California. Dharmadhikari, S., and Jogdeo, K. (1969). Bounds on moments of certain random variables. A n n Math. Statist. 40, 1506-1508. Dharmadhikari, S., Fabian, V., and Jogdeo, K. (1968). Bounds on the moments of martingales, Aizn. Math. Statist. 39, 17 19-1 723. Goodman, L. A. (1949). On the estimation of the number of classes in a population,Ann. Matlr. Statist. 20,572-579. Harris, B. (1959). Determining bounds on integrals with application to cataloging problems, A n n Math. Statist. 30, 521-548. Harris, B. (1962). Determining bounds on expected values of certain functions, Ann. Math. Statist. 33, 1454-1457. Isaacson, S., and Rubin, H. (1954). On minimizing an expectation subject to certain side conditions. Tech. Rep. No. 25, Appl. Math. Statist. Lab. Stanford Univ., Stanford, California. Jacobson, H. I. (1969). The maximum variance of restricted unimodal distributions, Ann. Math. Statist. 40, 1746-1 752. Karlin, S., and Shapley, L. S. (1953). Geometry o f Moment Spaces. Amer. Math. SOC., Providence, Rhode Island. Karlin, S., and Studden, W. J. (1966). Tchebycheff Systems: With Applications in Analysis and Statistics. Wiley (Interscience), New York. Kemperman, J. H. B. (1968). The general moment problem, a geometric approach, Ann. Math. Statist. 39, 93-1 22. Kemperman, J. H. B. (1971). Moment problems with convexity conditions, 1. In Optimizing Methods in Statistics (J. S . Rustagi, ed.). Academic Press, New York. Munroe, M. E. ( 1 95 3). Introduction t o Measure and Integration, Addison-Wesley, Reading, Massachusetts. Proschan, F. (1 965). Peakedness of distributions of convex combinations, Ann. Math. Statist. 36, 1703-1706. Rostn, B. (1970). On bounds on the central moments of even order of a sum of independent random variables, Ann. Math. Statist. 41, 1074-1077. Rustagi, J. S. (1957). On minimizing and maximizing a certain integral with statistical applications, A n n Math. Statist. 28, 309-328. Rustagi, J . S. (1961). Bounds for the variance of Mann-Whitney statistic, Ann. Inst. Statist. Math. 13, 119-126.
REFERENCES
91
Shohat, J., and Tamarkin, J. (1943). The Problem of Moments. Amer. Math. SOC., Providence, Rhode Island. Skibinsky, M. (1968). Extreme nth moment for distributions on [0, 11 and the inverse of a moment space map,J. Appl. Probl. 5,693-701. Sullivan, J. (1970). Minimizing an expectation subject to moment conditions, Doctoral dissertation, The Ohio State Univ., Columbus, Ohio. Sullivan, J. A., and Rustagi, J. S. (1970). The application of Hahn-Banach theorem to a moment problem (abstract), Ann. Math. Statist., 41, 1809. Wald, A. (1939). Limits of a distribution function determined by absolute moments and inequalities satisfied by absolute moments, Trans. Amer. Math. SOC.46,280-306. Weiss, L. (1956). A certain class of solutions to a moment problem,Ann. Math. Statist. 27, 85 1-85 3. Whittle, P. (1960). Bounds for the moments of linear and quadratic forms of independent random variables, Theor. B o b . Appl. 5, 302-305.
CHAPTER V
Nonlinear Moment Problems
5.1
Introduction
One of the earliest nonclassical variational results is the basis of the theory of tests of hypotheses in statistical inference developed extensively by Neyman and Pearson. They introduced the concept of the power of a test and used the criterion of power to obtain optimal tests. Techniques for maximization of power led to variational problems in which it is desired to find a set over which the integral of a given function is maximized or minimized. The NeymanPearson technique has also been applied recently to a very general class of optimization problems. In this chapter a brief review of the Neyman-Pearson theory of hypothesis testing is given; it is then utilized in solving certain nonlinear moment problems, some of which occur in the theory of order statistics. We discuss the problem of finding a distribution function with given mean and variance such that the expectation of the largest observation in a random sample of given size is minimized. Similar problems are considered for the mean of the sample range subject to restrictions on the class of distribution functions. In certain problems of nonparametric inference, questions involving relative asymptotic efficiencies lead to similar nonlinear problems, and we discuss a few of these problems in this chapter. The nonlinear problem is first reduced to a linear one before applying the Neyman-Pearson technique. The problem of minimizing an expectation of a known functiong(x) over a 92
5.1.
93
INTRODUCTION
class of distribution functions having a given set of prescribed moments was considered in Chapter IV. The integral considered is (51.1) Integrating by parts, we have
0
That is, the problem of minimizing (maximizing) (5.1 . l ) is the same as that of maximizing (minimizing) the integral
F(x)i(x) dx.
(5.1.3)
0
1vQ.
The integral (5.1.3) is a special case of the integral of a functionq(x, F(x)),
F(x)) dx9
(5.1.4)
where q(x, F(x)) = F(x)g'(x) for the above problem. In the following, we consider cases in which q(x, F(x)) is nonlinear. In many cases of applications, q(x, F(x)) has other desirable properties; for example, q(x, F(x)) is a convex function of F(x). We consider below a few examples in which integrals of the type (5.1 A) occur in statistical contexts.
Example 5.1.1 Let a random sample of size n be given from a population having the cumulative distribution function F(x). Suppose the observations are ordered as XI < X 2 < . . . < X , . We have seen in Chapter 11 that the expectations of the largest X , , the smallest X I , and the range W, = X , - XI are given bY m
HX,) =
1 1 1
x d[F"(x)l,
(5.1.5)
-m m
E(XI)=
x d [ 1 -F(x)] ,,
(5.1.6)
x d [ 1 - F" (x) - ( 1 - F(x))"] .
(5.1.7)
-m
m
E( W, ) =
94
V.
NONLINEAR MOMENT PROBLEMS
Integrating by parts, we find that (5.1 S)-(5.1.7) give, respectively, m
E(X,) = c, -
J F"(x)
dx,
(5.1.8)
-m
1 m
E(X1) = c*-
(1 -F(x))" d x ,
(5.1.9)
-m
m
where Ci'sare constants. The problem of minimizing the expectations becomes that of maximizing the integrals in the above expressions. When the distribution functions belong to a class of distribution functions specified by moments such as those in (4.3.1), the problems can be similarly reformulated. We discuss first the general problem of minimizing an integral of a given functionq o f x and F(x) subject to moment constraints. The given function q, as has been mentioned earlier, will be assumed to have certain properties such as those of differentiability and convexity so that we may obtain existence and uniqueness of the solutions. The technique used here is very similar in spirit to that of the Pontryagin principle discussed in Chapter 111. Before we discuss the moment problems, we develop the elements of the Neyman-Pearson theory of testing hypotheses. This technique will be utilized in solving the nonlinear moment problem described above. The theory of tests of hypotheses is in a mature state of development, and the reader is referred to Lehmann (1959) for a detailed discussion.
5.2
Tests of Hypotheses and Neyman-Pearson Lemma
One of the central problems of statistical inference is concerned with the testing of hypotheses. The structure of the hypotheses testing problem is as follows. Let f o ( x ) and f l ( x ) be the two probability density functions of X representing the populations from which samples are observed. Based on an observed value of X , it is desired t o test the hypothesis that X has probability density f o ( x ) versus the alternative hypothesis that X has density f l (x). That is, we have
Ho : X has density f o ( x ) ,
H I : X has density f,(x).
5.2.
TESTS OF HYPOTHESES AND NEYMAN-PEARSON LEMMA
95
By a test of the hypothesis H o , it is generally meant a statisticq(x) that provides the probability of rejecting Ho when X = x is observed. Thus 0 Q q ( x ) G 1 . The probability of rejecting the null hypothesis Ho when it is true is known as the significance level or the size of the test. Notice that the significance level a is given by the conditional expectation of q(X) under the hypothesis H o . That is,
(5.2.1)
a = E H , ( d X ) ) = Eo(P(X)).
The power finction of the test is given by the probability of rejecting the hypothesis Ho when H I is true. That is, the power function 0 is given by
P ( d = EH,(lP(XN = E I ( d X ) ) . In general, the test cp(x) is a randomized test. q ( x ) = 0 means that Ho is accepted, and q ( x ) = 1 implies that H I is accepted. When q ( x ) = 0 or 1, the test is nonrandomized. q ( x ) is sometimes known as the critical finction of the test. The nonrandomized test reduces to the finding of a set S in the sample space over which the hypothesis Ho is rejected. In the classical theory of testing hypotheses developed by Neyman and Pearson, an. optimum test is the test of a given significance level having the maximum power. The fundamental NeymanPearson lemma that we discuss next provides the critical region S in terms of the likelihood ratio, f o ( x ) / f l(x), and shares the most powerful character of the test S. In what follows we show the existence, necessity, and sufficiency of the critical function. We state the lemma as given in Lehmann (1959). Theorem 5.2.1 (Neyman-Pearson fundamental There exist a test cp and a constant k such that
Eo(cp(X)) = a
lemma)
(Existence) (5.2.2)
and
(5.2.3) (Necessity) If q ( x ) is the most powerful test for testingHo against H1,then for some k , (5.2.3) holds. Also (5.2.2) is satisfied unless there exists a test of size smaller than a and power equal to 1. (Sufficiency) If a test statistic satisfies (5.2.2) for q ( x ) in (5.2.3) for some k, then it is most powerful for testing Ho against HI at significance level a. Proof We assume that 0 < a < 1 since, when a = 0 or 1, the lemma is seen to be trivially true by taking k = m and assuming (O)(m) = 0. (Existence) Define the probability PGfi(X)> CfO(X)IHO)=
(5.2.4)
96
NONLINEAR MOMENT PROBLEMS
V.
Let C(x) be the cumulative distribution function of the ratio f l ( X ) / f o ( x ) . Then (5.2.4) can be written as .(C)
= 1 - G(c).
(5.2.5)
Now a(c) is monotonic decreasing and is continuous to the right with a (-m) = 1 and a(m) = 0. Let co be the value of c such that a(c0) < a < ~ ( c o - ) and consider the following test cp. when f~(x) > C O ~ (XI, O
C'
(5.2.6)
If co is a point of continuity of a(c), the middle expression in (5.2.6) is not meaningful. The size of cp(x) is now given by
Hence co can be chosen equal to k , showing its existence. (Necessity) We prove necessity by contradiction. Let cp*(x) be the most powerful test satisfying (5.2.2) and (5.2.3). Let the sets S ' and S-be defined as
> O}, s- = {x : cp(x) - cp*(x) < 0). s+= (x
: cp(x) - cp*(x)
(5.2.7) (5.2.8)
Also let
T = (x : fI(X> Now since [cp(x) - cp*(x)]
J
s+vs-
f
kfo(x)).
(5.2.9)
r,(x) - kfo(x)] is positive on S = S+U S-n T,
[ d x ) - cp*(x)l [fib)- kfo(x)l dx =
I
[cp(x)-cp*(x)]
[fl(X)
-kfo(x)l dx > 0,
(5.2.10)
S
which shows clearly that cp(x) is more powerful than q*(x), leading to a contradiction. (Sufficiency) Assume that a test cp(x) satisfies (5.2.2) and (5.2.3). Let cp* be any other test such that Eo(cp*(X)) < a. For x E S+,since cp(x) > 0, f l (x)
2 kfo ( X I
5.2.
TESTS OF HYPOTHESES AND NEYMAN-PEARSON LEMMA
97
or
showing that q ( x ) is the most powerful test. Dantzig and Wald (1 95 1) stated the above lemma in a more general form and proved the existence and necessity of the conditions. In the original formulation of the lemma, Neyman and Pearson had shown only the sufficiency of the condition. The following form of the lemma seems more useful for our purposes, and we state it below. Theorem 5.2.2 Neyman-Pearson lemma (Dantzig and Wald) Let fi( x ) , f 2 ( x ) ,. . . ,f,,,(x) andg(x) be ( m t 1) integrable functions in a finite dimensional Euclidean space d.Let c l , c 2 , . . . , c , be given constants. LetYbe the class of all measurable subsets S of W such that
s
S
f i ( x ) dx = ci,
i = 1 , 2 , . . . , m.
(5.2.1 1)
Then the class of setsgo containing set So such that g ( x ) dx = max
(5.2.1 2)
g ( x ) dx S
is characterized by the following. There exist constants k l , k 2 , . . . , k,,, such that g ( x ) 2 k , f l ( x )t k 2 f 2 ( x )+ . . . + k,f,,,(x)
when x E SO
g ( x ) < k l f l ( x )+ k 2 f i ( x ) + . . .+ k m f m ( x )
whenx4So.
A further generalization of the Neyman-Pearson lemma was given by Chernoff and Scheff6 (1952) so as to maximize a known function of several integrals of the type (5.2.12) over the subsets S. The generalization is applied to finding type A and type D regions in testing hypotheses, which will be discussed in Section 5.8. Let m + n integrable functions f,( x ) , . . . ,f,,,(x), g , ( x ) , . . . ,g,(x) be given for a point x in the Euclidean space Y'.Suppose a function h ( z , , z2, . . . , z,,) of
98
V.
NONLINEAR MOMENT PROBLEMS
n real variables is defined on the n-dimensional Euclidean space %. Let constants
CI,CZ, . . . , c,
be given such that
d
fi(x) dx = c i ,
i = 1 , 2 , . . . , nz
(5.2.13)
and let a closed subset A of 2be such that the point with the coordinates
lies in the subset A . The problem is to prove the existence and provide the characterization of a set S o , which maximizes h (z (S)>
(5.2.1 5)
subject to conditions (5.2.13) and z(S) € A . Let the set of m t n points given by
be denoted by A when the set S varies over all possible Bore1 subsets of .X.The setd< can then be seen to be closed, bounded, and convex. It also follows from the Lyapounov theorem, as shown by Halmos (1948) and Karlin and Studden (1966). Let Ac be the subset ofdilsuch that (5.2.13) holds. We state without proof the generalized Neyman-Pearson lemma. Theorem 5.2.3 Generalized Neyman-Pearson lemma (Existence) If there exists a set S satisfying (5.2.13) and cp(z) is continuous o n A n A c ,and if A is closed, then there exists a set $' maximizingcp(z(S)) subject to the condition that y(S) = c and z(S) E A . (Necessity) If S o is a set for which z(So) is an interior point of A , ifcp(z) is defined inAlc n A , and cp'(z) exists at z = z(So), then a necessary condition that S o maximizes h(z(S)) subject to conditions y(S) = c and z(S) E A is given by the following. There exists constants k l , k Z ,. . . , k , such that n
m
n
m
(5.2.16)
and
where ai = 6h/6zi , i = 1 , 2 , . . . , IZ
5.3.
A NONLINEAR MINIMIZATION PROBLEM
99
(Sufficiency) If the set So satisfies the side conditions y(s) = c , z(s) E A , and if h(z) is defined, quasiconcave in a dC n A and is differentiable at z=z(So), then a sufficient condition that So maximizes h(z(S)) subject to z(S) E A is that So satisfies (5.2.16).
5.3 A Nonlinear Minimization Problem Let cp be a continuous and bounded function defined on a unit square {(x. y): 0 4 x < 1, 0 < y < l}. Assume further that cp is strictly convex and twice differentiable in y . We consider the problem of minimizing and maximizing
1 1
I(F)=
cp(& F ( x ) ) d x
(5.3.1)
0
over the class of cumulative distribution functions with specified moments, such as in
I 1
s
1
F(x) dx = c1
and
0
xF(x) dx = c2.
(5.3.2)
0
These restrictions on the cumulative distribution function F(x) are similar to the ones in (4.3.1), and we assume that the constants c I ,c2 are such that the class of distribution functions satisfying (5.3.2) is not empty. For simplicity we have assumed in (5.3.2) that only two moments are given. The results obtained are valid for the case in which k moments of the cumulative distribution function are given. Let d denote this class, and we call the cumulative distribution function in d admissible. The existence of an admissible minimizing or maximizing cumulative function Fo(x) can be seen in the same way as discussed in Section 4.4 using Theorem 4.4.1 ,since cp is assumed continuous and bounded. The minimizing cumulative distribution function is also unique, as we shall see below. It is a result of the strict convexity of the function cp i n y and is shown by contradiction. Suppose F,(x) is not unique and that there is another admissible minimizing cumulative distribution function F , (x), with FI ( x ) # Fo(x). Let
M = min
i
F E d O
p(x, F ( x ) ) d x .
100
V.
NONLINEAR MOMENT PROBLEMS
0
= AM
0
+ (1 - X)M = M ,
and hence we have a contradiction. The above results are expressed in the following theorem.
Theorem 5.3.1 If cp(x,y) is strictly convex in y and is continuous and bounded on a unit square, then there exists a unique F o ( x ) € d , which minimizesI(F) given in (5.3.1).
Reduction of the Nonlinear Problem to a Linear One We use the notation
in the following lemma.
Lemma 5.3.1 Fo(x) minimizes (5.3.1) if and only if
i
cpy(x9
Fo)F(x) dx 2
0
i
cpy(Y, Fo)Fo(x)dx.
(5.3.3)
0
Proof Let F(x) be any other admissible cumulative distribution function. Define [(A) for 0 < h Q 1 as follows.
1 1
I(h) =
~ ( xhF0(x) , + (1 - h)F(x))dx.
(5.3.4)
0
By the assumption of twice differentiability of cp, cpy exists and is continuous in y and hence I(X) is differentiable, giving
1 1
I’(h) =
0
cpY(x,hFo(x) + (1 - h)F(x))(Fo(x)-F(x))dx.
(5.3.5)
A NONLINEAR MINIMIZATION PROBLEM
5.3.
1
-
101
A
1
'0
Figure 5.1
Since cp is strictly convex in y , f ( h ) is a strictly convex function of A. If Fo(x) minimizes (5.3.1), f ( X ) achieves its minimum at h = 1 . This is possible if and only = ~0, as shown in Figure 5.1. That is, if I ) ( h ) l ~ <
j
0.
cpy(x, Fo)(Fo(x)-F(x)) dx
0
Conversely, suppose (5.3.3) holds. Then, retracing the steps backward, we can see that Fo(x) minimizes (5.3.1). This proves the lemma. Let
I
(5.3.6) 0
From Lemma 5.3.1 we note that minimizingf(F) is related to minimizing f o ( f l . In fact, we want an Fo such that Fo minimizes (5.3.6). Although the problem looks complicated, it is now linear in F. Consider the set r o f points with coordinates (u, u, w ) equal to 1
1
I
Again, by the same arguments as those in Theorem 4.4.1, the set r is a closed, bounded, and convex set in three dimensions and we state it as a lemma.
Lemma 5.3.2
r i s a closed, bounded, and convex set in three dimensions.
The minimum of (5.3.6) will be a boundary point, say (uo, u o , w o ) , of the set correspond to this boundary point, and there is a supporting hyperplane of r a t (uo, u o , w o ) . That is, there exist constants v0, q l , v2,and 773
r, FO will
102
V.
NONLINEAR MOMENT PROBLEMS
such that vou0 + v1u o + q2w o + q3 = 0 and qou + v1u + 772 w (u, u, w) E r. Hence
+ q3 2 0 for all
770(u-u0)+7)1(u-u0)+7)*~~-~0~~~.
(5.3.8)
vo in (5.3.8) is positive.
Lemma 5.3.3
h o f Suppose r ’ = { ( u ’u, , w) : u’ 2 u , (u, u, w ) E r ) . T h e n r ’ i s convex and contains the set r. Now t i o is the minimum of u subject to the condition that u = u o and w = wo.This implies that (uo, uo, wo)is also the minimum point of the set r’ and hence its boundary point. Hence there is a set (vo, vl, q 2 )f (0, 0,O) such that, for (u‘, u, w)E I?’, 77&’
- uo)
+ % ( u - uo) + 772(w-
WO) 2
0.
(5.3.9)
If qo = 0, then 7710-
or
UO)
+ 772(w-
WO) 2
0
1
1
that is, Fo(x)minimizes
(5.3.1 0) over all F E g . Now q1 + v 2 x is monotonic, and we see, from Theorem 4.5.4, that Fo(x)is a two-point distribution with its total mass concentrated at 0 and 1. But such a cumulative distribution is not admissible, and hence there is a contradiction. Further suppose qo is negative. Consider a point (uo + h, uo, w o ) E I” for some positive h. We obtain from (5.3.9) that qOh2 0, which again is a contradiction. Hence v0 is positive. Since go > 0, we can normalize it so as to make it one. Therefore, (5.3.8) becomes
j
( P y k
0
Fo) + 7)l + 7 7 2 X ) F ( X ) d X
5.3.
A NONLINEAR MINIMIZATION PROBLEM
103
or Fo minimizes 1
Il(F) =
j
( v y k
Fo) + 771 + 772x)F(x)dx
(5.3.1 1)
0
among the class of all distribution functions s o n [0, 11. Retracing our steps, we can see that if an admissible Fo(x) minimizes (5.3.1 l), then F o ( x ) minimizes lo(F) over the class of all admissible cumulative distribution functions. The above results are summarized in the following lemma.
Lemma 5.3.4 Fo(x) minimizes lo(F) if Fo(x) is an admissible cumulative distribution function minimizing Il (F),and any Fo(x) that minimizes Io(F) minimizes Il (F)for some v1 and v 2 . Omracterization of the Solution The solution of the above problem will now be characterized in terms of the solution FqIq2(x) of the equation for y given by Let and let
s = { x :A ( x ) f 0,o <x < 1 }:
(5.3.14)
We show below that the Fo-measure of the set S is zero, and the integral of A ( x ) over intervals where Fo(x) is constant is zero. The solution of the minimizing cumulative distribution function can be obtained in terms of the solution Fqlq2(x) modified so as to make a cumulative distribution function. .Essentially the solution coincides with Fqlq2(x), except on intervals in which it is constant. We give below a series of lemmas that lead to the characterization of the solution.
Lemma 5.3.5 If Fo minimizes 1, (F),then the set S has Fo-measure zero. a
Lemma 5.3.6 If for 0 < c < 1, Fo(x)< < x < b, and Fo(x) > c for x > b, then
I
c
for x < a, F o ( x ) = c for
b
A ( x ) dx = 0.
a
(5.3.15)
104
V.
NONLINEAR MOMENT PROBLEMS
Lemma 5.3.7 If Fo minimizes I I ( F ) , then Fo(x) has n o jumps in the open interval (0, 1) and hence A(x) is continuous on (0, 1). For proof of the lemmas, the interested reader is referred to Rustagi (1957). Let Fqlq2(x) be defined for 0 < Fqlq2(x) < 1 such that y = Fqlq,(x) satisfies Eq. (5.3.12). Since cp,(x, y ) is continuous and strictly increasing in y , Fqlq2 (x) is also continuous whenever Fqlq2(x) is defined. We define a function Gqlq2(x) on [0, 11 that is continuous on [0, 1 1 such that
and Gqlq2(1) = 1 . Theorem 5.3.2 If Fo minimizes I , (F),then for 0 < x < 1 , Fo(x)coincides with Gqlq2(x) except on intervals on which Fo(x) is constant.
Proof From Lemmas 5.3.6 and 5.3.7 we know that Fo(x) has no jumps in
(0, l), and Fo(x) is increasing whenever A(x) # 0. Hence Fo(x) remains constant
until it intersects with FqIq, (x). Notice that if GqIq2(x) is a cumulative distribution function, then Fo(x) = GqIq2(x). The solution in general may not be completely specified. However, there are many special cases of interest in which G,,,,,, (x) is the solution of the problem. A possible solution is expressed in Fig. 5.2.
t
F,(x) =
-,Fqlq,(x) = - - - - - -,G q l q , ( X ) = ___. Figure 5.2
'
. -.
105
STATISTICAL APPLlCATlONS
5.4.
Theorem 5.3.3 If cp,,(x,y ) is a nonincreasing function in x and Fo(x) = Gtllr12( x ) , 0 < x < 1.
772
< 0,
Proof Since cpy(x,y ) is nonincreasing in x and q2 is negative, it is clear from Eq. (5.3.13) that A ( x ) is a nonincreasing function of x . Therefore, Ftlls2 ( x ) will be an increasing function of x . Thus Gtllt12( x ) is a cumulative distribution function and hence is the solution of the minimizing problem. Theorem 5.3.4 If cp(x,y) is a function of y alone, say $b),then, corresponding to the solution Fo(x) of minimizing I ( F ) , q2 is negative. Further, the minimizing cumulative distribution function is Gtllt12( x ) for some 171 and 772.
Proof Suppose q2 2 0. Then v I + v 2 x is nondecreasing. Also as $'@) is nondecreasing in y , $ ' ( F o ( x ) )+ v1 + v 2 x is also a nondecreasing function ofx. Hence the solution of (5.3.12) with the conditions of the theorem is also nondecreasing. Therefore, from Theorem 5.3.2, F o ( x ) is constant on [0, 1). But such Fo(x) is not admissible, and there is a contradiction. From Theorem 5.3.3 the solution is given by G q I t l 2 ( x ) .
5.4
Statistical Applications
Example 5.4.1 In the problem of obtaining bounds of the variance of the Wilcoxon-Mann-Whitney statistic, we need the extrema of the integral
1 1
I(F)=
s
with side conditions
0
( F ( x )- kx)2 dx
(5.4.1 )
0
F ( x ) dx = 1 - p ,
(5.4.2)
where F(x) is a cumulative distribution function on (0, l), as discussed in Chapter 11. Now cp(x, F(x)) = (F(x) - kx)2 ; thus cp satisfies the conditions of Theorem 5.3 .l. Therefore, the admissible cumulative distribution function Fo(x) exists, is unique, and leads to the minimization of the integral
j
(Fo(x)- kx
+ X)F(x) d x ,
(5.4.3)
0
as seen from (5.3.1 l), over the class of all distribution functions on (0, 1).
106
V.
NONLINEAR MOMENT PROBLEMS
Let gA(t) be the value of y for which y - k t + A = 0. Then gh(t) is a straight line with a positive slope and its points of intersection with the straight lines y = 0 and y = 1, which are given by XI
Let Fo(x) =
= A/k,
x2 = (A
i::
(5.4.4)
+ l)/k.
(5.4.5)
if x < max (0, x
kx - A,
if
max (xl, 0) Q x < min (1, x ~ ) ,
if
min (x2, 1) < x.
(5.4.6)
We give the various possible cases of (5.4.6), obtaining the value of A so as to satisfy the constraint (5.4.2).
Case (i) x I
< 0, x2 2 1, so that
[$
Fo(x) =
giving A = k/2
-
-
A,
Case (ii) ”
if
O Q x < 1,
if
1 Qx,
(5.4.7)
1 + p , and the minimum value of (5.4.1) is obtained as Z(F0)=
Notice that k
if x < 0,
j(k - 2 + 2~)’.
(5.4.8)
< 2p and k Q 1. 0 < xI Q 1 , O Q x2 Q 1.
Is
6.
Fo(x) = \
{!/
if -
A,
x < xl,
if x1 Q x < x 2 ,
(5.4.9)
if x2 Q x.
W e f i n d A = p k - i and 1 (1 - k)3 I ( F o ) = k ( p k - - $ ) 2-~ 3k ‘
(5.4.1 0)
H e r e k 2 l , a n d 2 p > Ilk.
Case (iii) x 1 Q 0,O < x2 Q 1.
ioy
F~(x)= kx-A,
1,
if
x
if
0 Q x Q x2,
if
x2 Q x.
(5.4.1 1)
5.4.
+ (2kp)l/* and
Here h = -1
I(F0) = i p (2kp)'/* - 2p + 1 - k
Herek< 1,2p
Case(iv) O < X ,
{r:
-
+ f k'.
(5.4.12)
< l,xz 2 1. if x
<x l ,
if x 1 < x < 1 ,
F ~ ( x ) = kx-A,
giving h = k
107
STATISTICAL APPLICATIONS
(5.4.13)
if x 2 1 ,
[2k(l - p ) ] 'I2 and
I(F0) = ik' - 2kq
+ 4q(2kq)'I2,
where q = 1 - p .
(5.4.14)
Here k < 1 and 2(1 - p) < l/k. There are no admissible cumulative distribution functions for other possible values of x 1 and x 2 .
Example 5.4.2
Suppose an additional constraint on F ( x ) such that F(x) 2x
(5.4.15)
for 0 < x < 1 in the previous example. This requires that p < f from (5.4.2). The inequality (5.4.15) obtains when we consider alternatives for the tests of hypothesis using the Wilcoxon-Mann-Whitney statistic. The solution of the problem, F o ( x ) , is now obtained in terms of y = max (kx - A,x), as seen from (5.4.6) and (5.4.15). Let the point of intersection of y = kx - A and y = x be denoted by x3 = A / @ - 1). We consider now the following cases.
Case(i) k < l , k - l < h < O . T h e n 0,
Fo(x) =
if x < 0, if
O<x<x3,
(5.4.16)
if x 3 < x < 1 , if x > 1.
We have, in this case, A = -[(l - k)(1 - 2p)l 'I2, and I ( F o ) = f { ( k - 1)'+2[(l-k)(l-2p) Here 2p 2 k.
3 ] 112
>.
(5.4.17)
108
V.
NONLINEAR MOMENT PROBLEMS
Case(ii) k G l , h < k - 1 o r k > 1 , - 1 G h < O .
t
if x
(0,
F,(x)=
I
kx-A,
if
1,
if x 2 G x .
O<x<x2,
(5.4.1 8)
This is the same as (5.4.1 1) and hence leads to the same bound. Casefiii) k > 1 , O G X G k - 1 .
if x < O , (5.4.19) if x2 G x. giving h = k - 1 -[k(k - 1)(1 - 2p)] 'I2. Here the lower bound is (5.4.20) Example 5.4.3 Consider the problem of maximizing the expectation of the largest order statistic from a sample of size n from a population having a cumulative distribution function F(x) defined on (0, 1) with given first two moments. That is, from (5.1.8), we note that the problem is to minimize
1 1
F"(x)dx
(5.4.21)
0
with constraints 1
I
As seen before, the solution is given in terms of the function grllrll (x), which satisfies the equation
ny-1 + q1 t q2x = 0.
(5.4.23)
5.5.
MAXIMUM IN THE NONLINEAR CASE
109
77], q2 can be determined from the constraints, and the details are left for the reader. 5.5
Maximum in the Nonlinear Case
In the previous section we have considered the problem of minimizing an integral of a strictly convex function of a cumulative distribution function. The existence of the minimizing and maximizing cumulative distribution is given by the results of Section 4.4. However, the maximizing solution is not unique. Further, as we shall show next, the maximizing cumulative distribution function is discrete. We denote the admissible class of the cumulative distribution function satisfying (5.3.2) by &as described earlier. Lemma 5.5.1 The maximizing cumulative distribution function for I(F) over the admissible class d is a discrete distribution..
Proof Let F o ( x ) and F , ( x ) be two admissible cumulative distribution functions giving the maximum of I(F), and denote this by M. Then, for 0 < h < 1, we have, from the strict convexity of cp,
That is, a convex combination of Fo and F , gives a value smaller than the maximum. Hence the maximizing cumulative distribution function corresponds to the extreme points of the set d l e a d i n g to the discrete distributions. The characterization of such a distribution can now be easily obtained from results proved earlier. Notice that the maximizing cumulative distribution function depends solely on the number of moments given and is the same whether the problem is linear or nonlinear. However, the actual maximum or minimum will result from the given integral. Example 5.5.1 Consider the same example as Example 5.4.1. We would like to find the upper bound of I(F) over the class of distribution functions subject to restriction in (5.4.2). Again, since (F(x) - kx)’ is strictly convex in f i x ) , the maximizing distribution function is discrete and its degree is one from results of Chapter IV. Hence the admissible functions satisfying (5.4.2) are F I (x) and (5.5.1)
110
V.
NONLINEAR MOMENT PROBLEMS
and
F2(x)= 1 - p = q
if
O < x < 1.
(5 S . 2 )
Then IIF1( x ) ] = 4 - k(pq + q ) + 5 k 2 , and I [ F 2 ( x ) ]= q2 - kq + ik' . Since I(F1)- I(F2) = pq(1 - k ) , the maximum for I(F) is given by I ( F I )if k < 1 and by I(F2) if k 2 1, These results can then be used to obtain the upper bounds of the variance of the Wilcoxon-Mann-Whitney statistic. Example 5.5.2 Consider the same problem as in Example 5.4.2. Here there is an additional constraint that F(x) 2 x . The maximizing cumulative distributions F3(x)and F4(x) are of the form if
O<x
if
a<x
b,
if
O<x< b,
x,
if
b < x < 1.
(5.5.3)
and
(5.5.4)
The distribution F 3 ( x ) has a jump in the open interval (0, l), whereas F ~ ( xhas ) a jump at 0. It can be easily verified that a = 1 - (1 - 2p)'/' and b = (1 - 2 ~ ) ' ~ ' . We obtain
I(F,) = 3. k [ka3( 1 - k)' - (1- k)3 + (1 - ka)3] and
I(F4)= f [b3( 2 - k ) + (1 - k)' ] . Since I(F3) - I(F4)= a(1 - k ) (a - 1 ) 2 , the maximum of I(F) is given by I(F3) when k < 1 and I(F4) when k > 1. The upper bound of the variance of Wilcoxon-Mann-Whitney statistic can now be easily obtained. 5.6
Efficiency of Tests
Let X I , X 2 , . . . , X, be a random sample from a population having the cumulative distribution function F(x). Let T n ( x l ,x 2 , . . . , x , ) be a statistic to test the hypothesis about the translation parameter 6 of F(x) such that \k(x - e ) = F ( x ) , where J/ is a distribution function with density $ ( x ) . For testing hypotheses about 8, one may like to compare the statistic Tn with another one T,' having the same power. One measure of efficiency is the ratio of the variance of T , to the variance of T,,'. Sometimes the efficiency is measured by the ratio of the observations needed, using the statistics having the same power. We use the following definition for relative asymptotic efficiency.
111
EFFICIENCY OF TESTS
5.6.
Definition The relative asymptotic efficiency of a statistic TN relative to TN* is given as E T ~T,N : = lim (N *IN)
(5.6.1)
when the power of the tests based on the statistics TNand TN: is the same. Suppose Y1,Y,, . . . , Y, is another sample from a population having cumulative distribution function Gb). Let
< x, i = 1 , 2 , . . . ,m ) (5.6.2) the sampling distribution function o f X1, X z , . . . , X,,, . Similarly, let G,(x) F,(x) = ( l / m ) {number of Xi
be be the sampling distribution function of Y 1 , Y z , . . . , Y,. Then we can define the sampling distribution function of the combined sample of Xs and Y's by
HN(X)=ANFm(X) + ( l - h N ) G n ( x )
(5.6.3)
hN = m/(m + n) = m/N.
(5.6.4)
where We define below a statistic that includes many well-known statistics as special cases. Let if the ith observation in the combined 1, sample is an X observation, ZNi = (5.6.5) 0,
otherwise.
Suppose EN^ are certain given numbers. Then the Chernoff-Savage statistic is defined by (5.6.6) The statistic has been used for testing hypothesis in many nonparametric cases. For example, the Fisher-Yates-Terry-Hoeffding statistic c1 is of the type TN. In many applications, integral representation of the statistic in (5:6.6) is more useful, as given by
1 m
TN =
JN (HN ( X I )
dFtn (x)
(5.6.7)
-m
where EN^ = JN(i/N). In an important study, Chernoff and Savage (1958) showed that the statistic TN, when suitably normalized, has an asymptotic normal distribution. The proof is involved, and the interested reader is referred to the original paper. Let F(x)=\Ir(x - 0,) and C(x)=\k(x - el), where \JI is a cumulative distribution function and B o and B are the location parameters. Let JI (x) be the
112
V.
NONLINEAR MOMENT PROBLEMS
probability density function for \k(x). An important test of the hypothesis is to test H ~ e o: = e l or A = e o - e , =o, H] : A#O. (5.6.8) There are many nonparametric tests for testing the equality of the distribution functions F and G and, for the above case, we consider the Fisher-Yates-TerryHoeffding (normal scores) test statistic c I in which m
~1
=
C aN(Ri),
i=l
(5.6.9)
where R I ,Rz,. . . , R m are the ranks of Xi's in the combined sample and (IN', i = 1 , 2 , . . . N is a random sample of size m + n from uniform distribution on [0, 1 J , and Q-' is the inverse function of the standard normal cumulative distribution function
aN(i) =E[@-l((INi)].
s X
Q(x) = (2n)-'12
--m
exp (-tz/2) d t .
We also denote the standard normal density function by q ( x ) = (2n)-'12 exp (-xz /2). For a detailed discussion of the Fisher-Yates-Terry-Hoeffding test statistic and other nonparametric tests, the reader is referred to Hijek and Sid6k (1 967). A parametric competitor of the c1-test is the t-test, which turns out to be the likelihood ratio test when F ( x ) = @(x), the normal cumulative distribution function. Chernoff and Savage showed that the asymptotic relative efficiency of the c I-test as compared to the t-test for the above hypothesis is given by
GI,? (W= 1/oz
SJ..
[ W X ) l $'(XI dx,
(5.6.10)
where uZ is the variance of \k(x) and (5.6.1 1)
Jo = @ - I (x).
The efficiency Ec,,r is not affected by normalizing the cumulative distribution function Jl(x); that is, we impose the restrictions I.$(.)
dw = 0
and
s
x 2 $ ( x ) dx = 1 .
(5.6.12)
In what follows, we find the lower bound of the integral in (5.6.10) with restrictions (5.6.12). It is shown that Ec,,?(+)a1 and E C , , ? ( q )= 1 only when
*=a.
EFFICIENCY OF TESTS
5.6.
113
First the problem is simplified as follows. From (5.6.1 l), we have
1
J&)
x = @(Jo(x))=
(2n)-’/* exp ( - t 2 / 2 ) d t
(5.6.13)
--m
Differentiating (5.6.13) with respect to x , we have 1 = d J o ( X ) ) Jo’(x) giving JO’G-1
Also since \k(x) =mo!J
= l/cp(Jo(X))-
q ( x )dx, (5.6.1 4)
dx = d J o ) dJo-
We suppress 0 in J O in subsequent discussion for simplicity of notation. The integral in (5.6.10) is reduced to (5.6.1 5) and (5.6.12) is reduced to
Iz = I3 =
s s
xcp(J) dJ = 0,
(5.6.1 6)
xZcp(J)dJ= 1.
(5.6.17)
The problem of finding the lower bound of the relative asymptotic efficiency is then reduced to minimizing the integral I I with constraints that I 2 = 0 and I 3 = 1. When x = J,II = 1. The existence and uniqueness of the solution is evident from Theorem 5.3.1, since the integral I I is continuous and strictly convex in x . We can also show this directly as follows. Replace x by cx = x * , where c is some constant. Then 1 1 , Iz, and I 3 are replaced by I I * = c-’I I , IZ * = CIZ , and I 3 * = c Z I 3 ,respectively. Now if I z = 0 and 1 3 < 1, we can choose c such that 13* = 1 and I I * < I I . For a constant h with 0 B h S 1, let x = (1 - h)xl t Axz.Since I I is strictly convex in x , we have
1 cp
( J ) dJ
-k
h
s
x2 q ( J ) dJ
(5.6.1 9)
114
V.
NONLINEAR MOMENT‘PROBLEMS
Hence x I and x 2 cannot both be the solution of the minimization problem, since, if they were, ( x l +x2)/2 would give a smaller value of Il still satisfying the constraints. Hence the solution is unique, if it exists. To show that x = J is the solution, we suppose that x 1 and x 2 are monotone functions satisfying constraints (5.6.16) and (5.6.17) and that x 2 gives a smaller value of Il than x I. Since I1(h)is convex and monotonic decreasing in h, we have
Il’(0) = I,‘(A)lh=()
< 0.
From (5.6.1 8)-(5.6.20), we have (5.6.21) Z2’(0) = / ( x 2
Hence there exists a constant
-
X I ) d J dl ) =0
(5.6.22)
t such that (5.6.23a)
I’(0) t tz3‘(0)2 0.
The condition (5.6.23a) is essentially the Euler equation. However, here it plays the role of a sufficient condition in place of a necessary condition, since we have convexity and monotonicity at our disposal in this problem. Integrating by parts (5.6.2 1) we have
(5.6.24) When x l (4 = J , we have I’(0) t [I3’(())=
s
(x2
- X I ) [cp’(J)
+ 2EJdJ)I dJ.
(5.6.25)
Thus, when E = 4, (5.6.25) leads to 11’(0)t fZ,’(O) = 0. That is, for x l ( 4 = J , the bound is attained. We have proved the following theorem. Theorem 5.6.1 Suppose the cumulative distribution function * ( x ) has a density with finite second moments, then &J*) 2 1
and equality is attained when \k = a.
5.1.
TYPE A AND TYPE D REGIONS
115
Efficiency of the Wilcoxon Test with Respect to the t-Test The test of the following hypothesis is considered. : c ( x ) = F(X - el, H~ : F ( X ) = c ( x ) , where 6 is the location parameter. Again, the relative asymptotic efficiency of the Wilcoxon test with respect to the t-test is obtained in terms of the integral,
IfZ(x)dx,
(5.6.26)
where f(x) is the probability density function for F(x). Again we consider the lower bound of the integral in (5.6.26) with the restrictions that
l x f ( x ) dx = 0 ,
sf
( x ) dx = 1,
I x z f ( x ) dx = 1.
(5.6.27)
Using Lagrange’s multipliers, we reduce the above problem to that of minimizing J[fz(x)
+ b(xz - a ~ ) f ( ~dx. j]
Using the Euler-Lagrange equation, the extremal solution is given by setting
b(az - x z ) ,
if x z < a’,
0,
otherwise.
a, b can now be determined from (5.6.27) giving a = (5)’12, b = 3/20(5)’12. The
lower bound of the efficiency is
12aZ ( l f z ( x ) d x ) Z ,
which is 12(9/125) = 1081125. For a detailed discussion, see Hodges and Lehmann (1956).
5.7 Type A and Type D Regions In comparing various unbiased tests, that is, tests such that the power function has a minimum at the null hypothesis, it is the shape of the power function at the null hypothesis that is studied. Neyman considered regions in which the curvature of the power function at the null hypothesis has a maximum. The variational problems arising from these considerations are quite interesting and lead t o type A regions. Consider that X is observed from a population having a probability density function f (x, 6). Let : e > eo H~ : e G eo,
116
V.
NONLINEAR MOMENT PROBLEMS
be the hypothesis of interest. Suppose the power function of the test q ( x ) is given by a@,) = Po,
(rejecting Ho)
or
(5.7.1)
Also (5.7.2)
For type A regions, we consider the region such that we maximize (5.7.3) subject to constraints
a(eo)= a.
and
~ ’ ( 6 , ) = 0.
(5.7.4)
The Neyman-Pearson lemma can now be used to solve the above problem. Let Ro be the type A region; then we characterize R o as (5.7.5) when x E R o , and (5.7.6) when x $ R o . The necessary and sufficient conditions for the solution of the problem are given by (5.7.6), assuming the existence of the constants k l and k 2 . The constraints (5.7.4) and (5.7.5) are used to determine these unknown constants. The existence of the region Ro can be seen in the same way as in Section 4.4, since the set S of points ( z , , z 2 , z 3 ) defined below is closed and convex. (5.7.7) (5.7.8) (5.7.9)
5.7.
TYPE A A N D TYPE D REGIONS
117
The subset of S is simply a section with z 2 = 0 and z 3 = 010 and remains closed and convex. The set Ro is then obtained from the hyperplane at the maximum point. Type C regions are introduced to study hypotheses in which two parameters are involved. Assume that the population has a density depending on parameters O 1 and 0 2 . The hypothesis tested is Ho : O 1 =Ole and O2 =020. Let d o = (0 OzO). Optimal regions of fixed size for such an hypothesis are assumed to be unbiased and maximize the power along ellipses of constant power in the neighborhood of the null hypothesis. Such regions are known as type C regions and can be found if and only if one knows the relative importance of the power in the infinitesimal neighborhood of the point Ole, 02,,. It is impossible to find them otherwise. Isaacson (1951) introduced type D regions to rectify this situation. See also Lehmann (1959), p. 342. Type D regions are obtained by maximizing the Gaussian curvature of the power function at O OzO. Therefore, finding type D regions reduces t o maximizing
(5.7.1 0) We give below the general case in which the parametric space is k-dimensional
0 = ( e l , 0 2 , . . . , O k ) . Let Ho : 0 = do. Assume that f(x, 0 ) can be partially differentiated under the integral twice with respect to Oi, O j , i = 1 , 2 , . . . , k,
j = 1 , 2 , . . . ,k.Let
(5.7.1 1) and (5.7.1 2)
i, j = 1, 2, . . . , k. Let C(q) be the matrix whose (i, j)th element is gij(p). The
determinant of C(q) is denoted by IC(q)l. Utilizing in the following the definition of Gaussian curvature or total curvatdre (Eisenhart, 1909) for a surface in k-dimensional Euclidean space, we have, in our case (5.7.1 3)
118
NONLINEAK MOMENT PROBLEMS
V.
Under the restrictions imposed by the unbiased character of the regions under consideration above, the Gaussian curvature is given by IC(q)l, since h j ( c p ) = 0 for all j . The maximization of a determinant arises in many other contexts. In optimal designs of regression experiments, the criterion of D-optimality requires essentially the maximization of determinants; hence the term "D-optimality" is used to describe the criterion. We consider such problems in optimal designs in the next chapter. In what follows, we shall suppress 0 in B0 for notational convenience. In obtaining type D regions, we first characterize the test function q o ( x ) so as to find k
(5.7.1 4)
and will then extend it to the maximization of IG(q)l. Let cpo(x) be a critical function with
Lemma 5.7.1
such that there exist constants 1 1 1 ,
cpo(x)=
j (0
then II;=,
gjj(q)
if
. . . ,a k
k
k
i=l
j = l ,j+i
Z( II
k
$)&i(%)<
(5.7.16)
C
i = l Qihi(%),
is maximized for cp(x) = cpo(x).
Proof Consider the function
Applying the Neyman-Pearson lemma to fo , we have
(5.7.1 7)
5.7.
TYPE A A N D TYPE D REGIONS
119
or
(5.7.1 8)
(5.7.19)
i=I
That is, the arithmetic mean is that
xi
< k.
(5.7.20)
< 1 . Hence the geometric mean is also < 1 , so
( fi
<1
i=l
and therefore k
k
This gives the result of the lemma. Lemma 5.7.2 and if
If qo(x) is given by (5.7.16) and satisfies conditions (5.7.15), gij(qo)=O,
i f j = 1 , 2, . . . , k,
(5.7.21)
then iCl is maximized for q ( x ) = q o ( x ) whenever C is positive definite.
Proof Under conditions (5.7.21), the determinant of the matrix G(q) is obviously given by IG(~)I=
k
IIgii(q).
(5.7.22)
i=l
From Lemma 5.7.1 and the condition that C(q) is positive definite, the result follows. Theorem 5.7.1
Let C(q) be a positive definite symmetric matrix with its
(i, j)th element gij(q) given by(5.7.11). If q o ( x ) is such that hj (PO)= 0
and is characterized by (5.7.16), then qo(x) maximizes IC(q)l.
(5.7.23)
120
V.
NONLINEAR MOMENT PROBLEMS
Proof There exists an orthogonal matrix 0 of constants that diagonalize C(cpo). That is, O'G(cp0)O is diagonal such that 0'0 = I. Let G*(cpo) = O'G(cp0)O. Let the (i, j)th element g$(cpo)of G*(cpo)be given by (5.7.24) Since 0 is orthogonal,
Also where G*(cp) has its (i, j)th element g$(cp) given by
(5.7.25) Since G(cp0) and C(9) are positive definite, C*(cpo)and G*(cp) are also positive definite and hence their diagonal elements are positive. Applying Lemma 5.7.1 and Lemma 5.7.2, we see that IC*((Po*)I
IG*(cp)l
(5.7.26)
for cpo satisfying (5.7.23) and given by
10,
otherwise.
(5.7.27) We show below that cpo*(x) = cpo(x) given by (5.7.16). Let G be a matrix with (i, j)th element gij
..., k and
G* = O'GO = (g$, where
Now G*(qo)G* = O'G(cpo)OO'GO= O'G(cpo)GO,so that tr c*(cp0,) G* = tr O'G(cpo)GO = tr G(cpo)G,
5.7
TYPE A A N D TYPE D REGIONS
121
since trace is not affected by orthogonal transformation. Hence the inequality in (5.7.27) is satisfied whenever the inequality in (5.7.16) is satisfied. Example 5.7.1 Consider two independent random samples of sizes n l and n2 given by x l , . . . , x n l , x , ~ + ~. ,. . , x,,+,~* from normal populations with means p 1 and p2 and known variances u12and u2'. We wish to obtain type D regions for testing the hypothesis HO:~c=(p1,~2)=(0,0)=0, HI :pfO.
Without loss of generality we assume u12= u2' = 1 . Also, since XI and X 2 , the two sample means, are sufficient statistics for pl and p2 and have independent normal distributions, we assume n l = n 2 = 1 and let X I = u and X2 = u . The probability density function is given by
f ( u , u) =(2n)-' exp [ - $ ( u - P ~ )- ~1 ( ~ - ~ 2 ! ' 1 .
(5.7.28)
Now we have = (27r)-lu exp [-$(u' t u')] = uh(u, u),
where (2n)-' exp [-f(u' t u')]
= h(u, u ) , = uh(u, u)
(5.7 2 9 ) (5.7.30)
I dTf2 I
aPl
"'f
M"0
aP2
/J=o
= (u' - l)h(u, u)
(5.7.3 1)
=(u' - l)h(u, u).
(5.7.32)
For an unbiased critical region of size a given by po(u, u ) such that cp0(u, u ) is the probability of rejecting the null hypothesis when (u, u ) is observed, we have the constraints (5.7.33) (5.7.34) ,."
(5.7.35)
122
V.
NONLlNEAR MOMENT PROBLEMS
and the matrix
ss
u2h(u, u)qo(u, u) du du - a
uuh(u, u)qo(u, u) du du (5.7.36)
u2h(u, u)qo(u, u) du du - a
u)qo(u, u ) du du
is positive definite. From the criterion given by (5.7.27) and using (5.7.33), we have the following characterization of cpo(u, u).
4 = 1,
%(U,
when
-2
( l/uuh(u,
2
(5.7.37)
u)qo(u, u ) du du) 2 k l t k2u t k3u
and \po(u, u ) = 0, otherwise. The liklihood ratio test indicates that cpo(u, u ) = 1 when, for a constant k, h(u, u ) < k or (5.7.38)
u2 t u 2 >a2
and q0(u, u ) = 0, otherwise. Here a is some constant such that (5.7.33) is satisfied. Since u, u are independently normally distributed with mean zero, u2 t u2 has x2 distribution with two degrees of freedom and hence a can be determined. It is now verified that qo(u, u ) given by (5.7.38) satisfies the condition (5.7.37) and that k l , k 2 , k 3 exist. This can be seen as follows.
jj
U2h(U, u) du
du
u 2 + u 2>a‘
=
JJ
u2h(u, u)dudu
u 2 + u 2> a 2
= a(1 t a2/2).
(5.7.39)
Also, since h(u, u ) is an even function,
JJuuh(u, u)qo(u, u) du du = 0.
(5.7.40)
5.8.
THE NEYMAN-PEARSON TECHNIQUE
123
In view of (5.7.39) and (5.7.40), the matrix (5.7.36) now becomes
which is positive definite. The inequality (5.7.37) now reduces to (5.7.41) Choosing k2 = k3 = 0 and k l = ara4/2, we see (5.7.41) is satisfied. The above results can be used to find an unbiased critical region of type D for testing a simple hypothesis about the means of a multivariate normal population with a known covariance matrix, since it is possible, by an orthogonal transformation of variables, to transform this problem into the one we have just solved. 5.8
Miscellaneous Applications of the Neyman-Pearson Technique
Problems of optimization with applications in stockpiling (Danskin, 1 9 5 9 , in dynamic programming (Bellman et QL, 1954), and in choosing combinations of weapons (Karlin et al., 1963), among many others, have utilized the NeymanPearson technique for solving them. Some of these problems will be discussed here.
Stockpiling Problem (Danskin) Suppose that certain goods are desired to be stockpiled for a certain contingency that occurs only once. Let the probability density function of the time t that a contingency occurs at time t be given by u(t). Suppose that the production of goods depends on the expenditure per unit of time z and is given by H(z), where H is a concave function of z. That is, marginal production decreases with increasing expenditures. Let y ( t ) be the rate of spending at time t and let c be the desired expenditure; then, m
P
J
y ( t ) d t =c.
(5.8.1)
0
We also assume that the utility of a stockpile X is given by q x ) , which is also concave, since marginal utility decreases with increasing stockpile. The expected utility, as a function o f y , is given by m
r
(5.8.2) n
124
V.
NONLINEAR MOMENT PROBLEMS
Suppose further that the money accumulates at an interest rate a! compounded continuously. Hence if we want to spend at time f at a rate y ( f ) ,the actual expenditure at time r is at the rate y(t)e*' and, therefore, actual production is given by H(y(t)e"'). Suppose the initial stockpile is X;then the stockpile at time f is given by r
X ( f )=
x+
(5.8.3)
H(y(7)ea7) d7. 0
The problem here is t o findyo(r), which maximizes the expected utility given by (5.8.2) with constraints (5.8.1) and (5.8.3). Consider the case
> p > ff,
u(f)
(5.8.4)
1- V ( f )
where V(t)is the cumulative distribution function for u ( f ) . The existence of the solution follows from results in Section 5.3. To obtain the admissible y o , which maximizes (5.8.2), we apply Lemma 5.7.1 as follows. Let y*(t)=(l-A)yo(t)+hy(t),
o<
A < 1.
Then, from (5.8.3), we have t
X A ( f )=
x+
H(yh(T)e"') d7, 0
and (5.8.2) becomes
s
m
Y?v(h)=
0
Since yo yields a maximum of +(A),
On simplification (5.8.5) gives
F(XA(t))u(f) df-
(5.8.5)
5.8.
THE NEYMAN-PEARSON TECHNIQUE
125
or f
m
Interchanging the order of integration, we have m
Go
or (5.8.6)
where
J F'(X(r)u(t))di m
Qo(r)= eat
(5.8.7)
t
Using the Neyman-Pearson lemma, we have the following: There exists a constant k such that if y o ( r )> 0
then
H'(yo(t)e"')Qo(t)= k
(ii) y o ( t ) = 0
then
H'(yO(t)e"')Qo(r)G k .
(i)
The constant k can be determined from the constraint (5.8.1).
Combination of Weapons (Karlin er al.) Determining a weapons combination that optimizes a certain objective function depending on the individual merits of the weapons leads to problems solved by the Neyman-Pearson method. For simplicity we consider here only a single weapon being used against an advancing enemy at distance s. Let the accuracy of the weapon be given in terms of the probability of destroying the enemy when the enemy is at the distance s. We use the notation a(s) p(s) g(s)
F(s)
accuracy, firing policy giving the rate of fire, gain if the enemy is destroyed at distance s, the probability that the enemy survives to a distance s.
Assume that O
(5.8.8)
126
NONLINEAR MOMENT PROBLEMS
V.
s
and there is a constant 6 such that
p ( s ) ds
< 6.
(5.8.9)
0
Also we assume that g(s) is a bounded, increasing, absolutely continuous function having bounded derivative with g(0) = 0. The expected gain is then given by
PO) =
i’
ds) W S ) .
(5.8.1 0)
0
Integrating by parts, we have
1 m
d P >=
g(s)(l -F(s))ds.
0
For a given policy p(s), the survival probability F(s) is given by m
(5.8.1 1) The problem is to find an optimal policy po(s) that maximizes &I). maximize
That is,
m
m
(5.8.1 2) with constraints (5.8.8) and (5.8.9) The integrand can be easily seen to be a concave function of p and hence, by the same arguments discussed in Section 5.3, we find that the optimal policy po(s) exists and is unique. Using the results of Lemma 5.7.1, we have the reduced problem of finding a maximizing policy so as to minimize m
m
m
with constraints (5.8.8) and (5.8.9). Let S
m
5.8.
127
THE NEYMAN-PEARSON TECHNIQUE
and, since H(s) 2 0, (5.8.13) is minimized if m
(5.8.14)
H(s)a(u)p(u)du 0
is minimized. Note that H(s) involves po(s). The Neyman-Pearson lemma then gives the existence of a constant c such that
:I
if H(s)a (s) > c , if H(s)u (s) < C,
Po@) =
arbitrary,
if
(5.8.15)
H(s)u(s) = C.
In special cases (5.8.15) gives a complete characterization of po(s). Example 5.8.1
Let 6 = 1 = M . ks,
s
kd,
d,
g(s)=
a(s) =
s> { { yuo-s),
O<s
s 2ao.
Assume further that a.
(aobo+ 4)2 G ho’bo.
and
It can be verified that the following po(s) gives the optimal policy.
co,
s <so, so < s
P o @ ) ={ 2/bo(ao - s)2,
to,
(5.8.16)
s > To,
(0, where so = 4
To = a0 - 2ao/(aobo + 4).
2 ,
The probability of survival can then be calculated from (5.8.1 1) as follows: Fo(s) =
i
(a0
- to)/(ao - s),
(ao- to)2/(ao- s)2,
Consider a(s) = 1 /s and
As)
so G
s G To,
s >so.
(1,
Exercise
s < so,
=
ks,
1 kd,
sd.
128
NONLINEAR MOMENT PROBLEMS
V.
Show that po(s) defined below is the optimal policy. (0' p o ( s ) = { 1, I
(0,
s<so, so<sto,
where t o = ( h 2 t d2)'I2,
so = -6 t (ij2 t d2)'I2
and the probability of survival is, s < so,
( so/to,
F(s) = {
so
s/to,
(1,
< s < to,
s >to.
Mathematical Economics (Bellman et al.) Let xi(t), i = 1,2, . . . ,N be outputs of a system and let each xi be divided into two parts yi and zi such that y i is reinvested to increase future output and zj is the profit. Assume that the change in the output is determined by ,N
(5.8. 7 )
and
Xj(0) = c j .
(5.8. 8)
The total profit is given by T
(5.8.1 9 )
The problem is to find an optimal policy so as to maximize (5.8.19) and satisfy constraints (5.8.17) and (5.8.18). For simplicity, assume N = 1 and a l l > 0. We then have (5.8.20) (5.8.2 1)
The total profit is T
r
THE NEYMAN-PEARSON TECHNIQUE
5.8.
129
since, from (5.8.20), we have 1
0
We have the additional constraints OGy,
Iyl t
< Cl +a11
dt.
0
(5.8.22) can be rewritten as follows T
J’
J ( Y I ) = ~ I T + b11(T-t)-11~1(t)dt,
(5.8.2 3)
0
since, integrating by parts,
1 T
0
t
T
Syds)dsdt=(T-t) S.Yl(t)df. 0
0
Let TI be the value o f t for which a l l ( T- t ) - 1 = 0; that is, TI =(allT- 1)lQll Using the Neyman-Pearson technique, the optimizing policy is then given by t
when a l l T - 1 > 0. When a l l T - 1 < 0, y I ( I ) = O everywhere in (0, r). The generalization to the case N > 2 is similarly obtained. The interested reader may consult the details in Bellman er al. (1954). We consider below the case in which the objective function is a linear functional, and linear constraints of physical origin are introduced. The complexity is greatly increased in this case. Now let the differential equation for x(t), absolutely continuous, be given by dx/df = - x ( t ) t y(t),
with
0
~ ( 0=) 1 ,
(5 3.25) (5.8.26) (5.8.27)
130
V.
N O N L I N E A R MOMENT PROBLEMS
The problem is to find an optimal y o ( t )such that yo(t) minimizes T
J(y) =
(5.8.28)
(1 - x ( t ) ) 2 d t . 0
s
The differential equation (5.8.25) gives the following solution x ( t ) = e-' t e-'
Then J ( y ) in (5.8.28) becomes
0
(5.8.29)
e"y(u) du.
T
t
(5.8.30) 0
0
It can be easily seen that J ( y ) is the convex function of y . Let y o ( t ) be the optimal solution. Using Lemma 5.7.1, we reduce the problem to that of minimizing T
where
(5.8.32) 0
Changing the order of integration in (5.8.31), we have
s
T
J&)
where
=
0
1 T
y ( u ) [e"
T
y ( u ) K ( u )du,
e-'(l - x o ( t ) ) dt
(5.8.33)
0
U
s
T
K(u) = eu
e-'(l - x o ( t ) ) d t .
(5.8.34)
U
The optimal y o ( t ) can then be obtained in terms of the function K, as in (5.8.24).
Discrete Search (Kadane) Suppose an object is hidden in one of n boxes. The search consists of selecting a box and finding an object in that box. It is assumed that the search
5.8.
THE NEYMAN-PEARSON TECHNIQUE
131
may not be successful, even if the object is in that box. That is, the box may be searched several times. A search strategy may be a sequence of integers, such as S = ( 4 , 2 , 3 , 4 , 5 , 4 , . . .).
The strategy S means that the box number 4 may be searched several times. There is a cost associated with every search, and there is a budget constraint imposed. Let pik = probability that jth search of box k conducted is successful if the strategy includes the jth search of box k , = 0, otherwise. c i k = cost of jth search of box k . The probability of finding an object is then given by (5 3.35)
and the total cost has the constraint (5.8.36)
with a preassigned c. Let 9 be the class of all strategies S. The problem of optimal search strategy is to find a strategy So EYsuch that p is maximized with constraints (5.8.36). The form of the functions in (5.8.35) and (5.8.36) will be the same if we assume that there is one box for each pair (j, k), and Pjk is the probability that the box (j, k ) contains the object, j 2 1 and 1 < k < n. Similarly, cik is the cost of a search of box(j, k). We can then replace (5.8.35) and (5.8.36) by (5.8.37) and (5.8.38) &; hci (5.8.39) xio = 0, if pi < hc; for some A, 0 < X < m, and Z cixi = c.
{
132
V.
NONLINEAR MOMENT PROBLEMS
The set of constants A satisfying (5.8.39) is the same for each optimal xo and is a single point or an interval. Many special cases of the search problem have been discussed in the literature; for a bibliography, see Kadane (1968). References Bellman, R. E. (1957). Dynamic Programming. Princeton Univ. Press, Princeton, New Jersey. Bellman, R., Clicksberg, I., and Gross, 0. (1954). On some variational problems occurring in the theory of dynamic programming, Rend. Circ. Mate. Palermo 3, 363-397. Chernoff, H., and Savage, I. R. (1958). Asymptotic normality and efficiency of certain nonparametric test statistics, Ann. Math. Statist. 29, 972-994. Chernoff, H., and Scheff6, H. (1952). A generalization of Neyman-Pearson fundamental lemma, Ann. Math. Statist. 23, 213-255. Danskin, J. (1955). Mathematical treatment of a stockpiling problem, Naval Res. Log. Quart. 2,99-109. Dantzig, G . B., and Wald, A. (1951). On the fundamental lemma of Neyman-Pearson, Ann. Math. Statist. 22, 87-93. Eisenhart, L. (1909). Differential Geometry. Ginn, Boston, Massachusetts. Halmos, P. R. (1948). The range of a vector measure, Bull. Amer. Math. SOC.5 4 , 4 1 6 4 2 1 . &jek, J., and Sidik, Z. (1967). Theory of Rank Tests. Academic Press, New York. Hodges, J. L., and Lehmann, E. L. (1956). The efficiency of some nonparametric competitors of the t-test, Ann. Math. Statist. 21, 324-355. Isaacson, S. L. (1951). On the theory of unbiased tests of simple statistical hypotheses specifying the values of two or more parameters, Ann. Math. Statist. 22,217-234. Kadane, J. B. (1968). Discrete search and the Neyman-Pearson lemma,J. Math. AnaL Appl. 22, 156-171. Karlin, S . , and Studden, W. J. (1966). Tchebycheff Systems: With Applications in Analysis and Statistics. Wiley (Interscience), New York. Karlin, S., Pruitt, W. E., and Madow, W. G. (1963). On choosing combination of weapons, Naval Res. Log. Quart. 10, 95-1 19. Lehmann, E. L. (1959). Testing Statistical Hypotheses. Wiley. New York. Meeks, D. H., and Francis, R. L. (1973). Duality relationships for a nonlinear version of the generalized Neyman-Pearson problem, J. Optimization Theory Appl. 11, 360-378. Neyman, J., and Pearson, E. S. (1936). Contributions to the theory of testing statistical hypotheses, Statist. Res. Memoirs, Vol. I, 1-37. Rustagi, J. S. (1957). On minimizing and maximizing an expectation with statistical applications, Ann. Math. Statist. 28, 309-328. Terry, M. E. (1952). Some rank order tests which are most powerful against specific parametric alternatives, Ann. Math. Statist. 23, 346-366. Wagner, D. H. (1969). Nonlinear functional versions of the Neyman-Pearson lemma, SIAM Rev. 11,52-65.
CHAPTER VI
Optimal Designs for Regression Experiments
6.1
Introduction
The theory of design of experiments as developed by statisticians was primarily aimed to make efficient use of experimental resources. Although experimentation is as old as human knowledge, the development of the modern theory of experimental design was greatly enhanced by the contribution of Sir Ronald Fisher, while he was associated with Rothamstead Experiment Station in England. The construction of many proposed designs led to the formulation of many new problems in combinatorial theory, which has become a very active branch of mathematics. Fisher also popularized the statistical technique of analysis of variance for properly analyzing experiments. The use of randomization in experiments was initiated by Fisher, and the designs formulated by him and his colleagues have become important aspects of present practice in experimental sciences and data analysis. The book, The Design of Experiments by Fisher (1947), is still regarded as a classic and has stimulated research in many areas. We are concerned in this chapter with designs for regression experiments. Suppose an experimenter is interested in obtaining information on a response variable y that depends on a variable x, the levels of which are under his control. With a given set of resources such as money, time, and the number of observations he can take, he would like t o know where and how many 133
134
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
observations to take. The problem of optimal regression designs becomes the problem of choosing levels of x and allocating observations at x so as to optimize certain criteria. Depending on the object of experimentation, these criteria will differ. There have developed large numbers of criteria of optimality of regression experiments as a result of the varied nature of experimentation. We consider a few of these criteria in this chapter. The theory has progressed to an advanced stage and is attracting large numbers of researchers. A recent survey of optimal regression experiments has been given by Federov (1 972). The art and science of statistics is concerned with optimization problems in many facets. However, in the study of regression designs, optimization appears to be the main theme. In many of these investigations, the variational techniques are universally used. The purpose of this chapter is to emphasize the application of variational methods to the problems of regression experiments. Definitions of new concepts utilized are provided, and several results have been proved to make the treatment self-contained. Some of the most important contributions in optimal regression designs have been made by Elfving (1952), Chernoff (1953), Kiefer and Wolfowitz (1959), Kiefer (1 959), Karlin and Studden (1 966), and Federov (1972), among many others. 6.2
Regression Analysis
Suppose the experimenter observes yi, Xli,. . . , X k j for a given subject i, i = 1 , 2 , . . . ,n, where yi denotes some kind of response and &,. . . , Xki denote k levels of independent variable xi. For the above set of observations, we
may assume the linear model
yi = P o
+'
+P1xli
'
'
t Pkxkj +Ejr
(6.2.1)
where P o , P I , . . . , P k are unknown parameters and ~i are random errors such that they are uncorrelated and have the same variance u2. Let
y = (r** Y 2 , . . . ,Yn)'
(6.2.2)
REGRESSION ANALYSIS
6.2.
135
The linear model (6.2.1) can be written in the matrix notation (6.2.5)
Y=XptE.
The least-square criterion for estimation of parameters /3 requires that we minimize E'E
xp).
(6.2.6)
p'x'xp.
(6.2.7)
= (Y - XS)' (Y-
Equation (6.2.6) becomes on simplification, E'E
= Y'Y - 2y'xp +
Differentiating (6.2.7) with respect to /3 and equating to zero, we find the minimizing fl to be given by
6 = s-' X'Y,
(6.2.8)
where X'X = S, and S is assumed nonsingular. In general, using the generalized inverses, (6.2.8) could be modified when S does not have an inverse. The estimate of u2 is taken generally by the expression. 1
62 = -(Y- X&Y
n-k
-
(6.2.9)
There are several situations in which the linear model assumed in (6.2.1) does not provide the appropriate setup for an experiment. We consider a nonlinear model in Section 6.5; we also introduce asymptotic theory for it. We discuss below a more general form of the linear model considered in (6.2.1). Let ~ ( x6) , in (6.2.1) be of the form
54x9 6 ) = W x ) ,
(6.2.1 0)
where f'(x) = (fl(x), f2(x), . . . ,fk(x)) is a vector of k known functions of x. When these functions are polynomial functions of a given degree, the regression model is that of a polynomial regression. The least-squares estimate of 6 is obtained in the same manner as in (6.2.8):
8 = M-'Y, where
c f'(xi)f(xi), n
M= and
i=1
(6.2.1 1)
(6.2.
136
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
The well-known Gauss-Markoff theorem states that the estimate of 6 given by (6.2.11) is the best (in the sense of minimum variance) linear unbiased estimate of e. The variance-covariance matrix of the estimate 6 is given by D(d) = M-’ u2.
(6.2.14)
The determinant of D(d) denoted by ID(d)l is commonly known as the generalized variance of the estimate 6. Generally, this determinant is utilized in studying optimal designs for linear regression problems. Notice that the matrix M defined by Eq. (6.2.12) depends completely on the levels of the variable x and hence is called the design matrix. In subsequent discussion the matrix M will be called the information matrix of the experiment E and will be denoted by M(E). In many classical experimental setups such as randomized blocks, Latin squares, incomplete block designs, and factorial designs, the model
Y=Xpte is used, where X is given to be a matrix of 0’s and 1’s. The treatment in this chapter does not deal with such experimental situations. The study of the classical theory of the design of experiments, as developed by Fisher and other statisticians, provides various methods of design and analysis of a large number of experimental situations. There is a considerable amount of interest in the basic theory of construction of such designs, and this interest has stimulated substantial research in combinatorial theory. This is beyond the scope of the present exposition. We consider first a simple example.
Example 6.2.1
Consider the simple linear regression model y i = B o t &xi t ei,
i = 1,2, . . . ,n.
Let
y = 0 1 1 9 Y2,
f
f
*
,Yn)’,
8 = (60,e l ) ,
and we assume that the ei are uncorrelated with variance u 2 . So that
( iXi ) .
s=x’x=
=xi
ZXi2
6.3.
OPTIMALITY CRITERIA
Let
X = Zxi/n,
and
137
= nZ(xi -q2
n B i 2-
(summations are from 1 to n).
Using the estimates given by (6.2.8), we have
The covariance matrix of b is given by S-’u2. cov(eo,
el) =
-
u2 c x i n z ( x i - q2’
In order to obtain optimal x i so as to minimize the variance of 8 , , V(bl ) say, we maximize C ( x i - F)’. Similarly, to minimize the variance of bo, we minimize Zx: / C ( x i - X)2. Since Exi? 2 C(xi - X)’ , we can minimize V ( c 0 ) when X = 0. This condition also makes cov (bo,b l ) = 0. There are many other criteria of optimality in designing regression experiments, some of which we discuss in the next section.
6.3
Optimality Criteria
In regression experiments, the main problem is to find the levels of the independent variables so as to obtain estimates of the parameters of the model or estimates of some functions of the parameters optimally. By an experimental design, we mean the choice of these levels. Assuming that there are n observations available, we are interested in knowing the allocation of these observations to various levels of the variable x . That is, we want to choose levels X I , x 2 , . . . , xk to be repeated n , , n 2 , . . . , n k times such that n l + n 2 + . . * + nk = n , given in advance. The set X 1 , X z , . . . ,X k
with
ni,n2
, . . . , nk
(6.3.1)
138
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
is known as the design of an experiment E . The points x I, x 2 , . . . , x k are called spectrum of the design. In place of integers n , , n2, . . . , n k , we use a normalized design
x I , x Z $ .* .
,xk
.., P k ,
with
PI>P2,.
where k
p i = ni/n,
C pi = 1 i=l
(6.3.1 a)
The object of the theory of optimal regression designs is to determine optimal values of x I , . . . , xk and p l , . . . , P k such that certain criteria are satisfied. For example, the generalized variance of the estimate of the parameters in the model may be minimized. We shall consider various cases in this section. A large number of optimality criteria have been considered in the literature. We shall consider only the most commonly used ones. One of the earliest criteria is that of D-optimality, which is defined next. Definition
The design that minimizes ID(b)l is called a D-optimal design.
The intuitive reason for considering D-optimal designs is to choose the design such that the estimate of the unknown parameter vector is as close to the true value as possible. The matrix M is regarded as the information matrix of the design, and therefore we are maximizing the information while minimizing the generalized variance of the estimate. This is evident from (6.2.14). Most of the other criteria also involve the information matrix and indirectly involve maximizing this matrix. We consider them later in this section.
Example 6.3.1
Consider an experiment with n = 3 with the model y = 8,
+ e2x t e,
(6.3.2)
where e is assumed to have mean zero and variance 1. Suppose for simplicity that, -1 < x < 1. Note that there are at most three design points x l , x2, and x 3 . The information matrix M is obtained from (6.2.12):
(6.3.3)
139
OPTIMALITY CRITERIA
6.3.
The D-optimal design is then easily obtained to be the values of xl, x2, x3 maximizing IMI.That is, x 1 = x 2 = -1
and
or
x3 = 1
x 1 = x 2= 1
and x 3 = -1.
(6.3.4)
That is, either we take two observations at x = -1 and one observation at x = 1 or two observations at x = + I and one observation at x = -1. Both of these designs are D-optimal. We consider the above example for the case in which a linear function of the parameters and d 2 is to be estimated. Elfving (1952) made one of the earliest studies in optimal regression designs, and we discuss below a geometric approach used by him.
Elfvings Method Consider the model y i = elxli
t
i = 1 , 2 , . . . ,k .
O2xZi t ei,
Assume, as before, ei are uncorrelated with variance u2. Let
e = (el, e2)’.
xi = (xIi, xli)’,
Suppose we are interested in estimating the value of a given linear function
ale1 +a2e2 =a’8 at 0 where a=
(:)
is given. An unbiased estimate of a’0 is of the form k
C CiYi, i=l
(6.3.5)
where yi is the average of ni values o f y obtained at each xi, i = 1,2, . . . ,k and ci’s are constants to be determined. The expected value of (6.3.5) is given by
140
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
Or
leading to the equations
C cixli=al, k
i=l
C cixzi = a 2 . k
(6.3.6)
i=l
Let
p i = n i / n , i = 1 , 2 , . . . ,k ,
k
so that
pi>O,
2 p i = 1. i=l
(6.3.7)
The optimum values of ci can be obtained by using the weighted least-square estimates of e l , O2 with weights p i , where we minimize the expression k
C pi(yi - elxli - ezx2i>2. i=l
(6.3.8)
Suppose we use the optimizing criterion where the variance of Z ~ = I cis~ i minimized. Then
Regarding (6.3.9) as a function of p = k l , ... ,pk)' and c = (Cl, c2,. . ck)', we first find its minimum with respect to p , then with respect to c. Now V is minimized for pio =Alcil,
i = 1,2,... ,k,
(6.3.1 0)
where A is a constant that equals l/&,lcil utilizing the constraints (6.3.7). Hence p:
=lcilm, i = 1 , 2, . . . , k .
This problem is solved if we find an optimal co with
(6.3.1 1)
OPTIMALITY CRITERIA
6.3.
141
with constraints (6.3.6). In view of (6.3.1 1 ) the constraints reduce to k
ul =
C i=l
(sign ci)xli = A U ~ , ,
(6.3.13)
k
u2 =
2 ~ p ; ’(sign c;)x2i = A U ~ , .
i=l
Geometrically, the constraints (6.3.1 3) represent vectors in two-dimensional spaces. We donate the vector with components ul, and a2, in Eq. (6.3.13) by a,. The vector a, is in a convex set generated by the vectors fxl, f x 2 , . . . , +xk, as shown in Fig. 6.1. The minimum of V(po, c) is obtained when a, coincides with a. Since A is the ratio of the lengths of the vectors a and a,, the weights pio
Figure 6.1
should be in the same ratio as RQ : PR for i = 1 , 2 , and 0 for other i = 3, . . . , k . The minimum is given by (OTIOR)’ . The above geometrical argument shows that only two sources of experiment are relevant, and they have to be used in the proportion shown above. Suppose now in place of estimating a linear function of O I and 8 2 , we are interested in estimating both O 1 and 0 2 . The estimates of O 1 and O 2 are obtained
142
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
with the help of Eq. (6.2.8). The covariance of ( e l , 8,) is given by the inverse of the matrix M where
(6.3.14) A possible criterion for finding an optimal design may be such that the design minimizes the sum of the variances of 8 and 8,. That is, minimize
V(&)+
(6.3.15)
V(82).
The criterion (6.3.15) is the same as that of minimizing the trace of the covariance matrix D(b) or the inverse of the information matrix M, as seen from (6.2.21). This criterion is generally known as the A-optimality criterion and is formally defined below. llefinifion The design that minimizes the sum of the variances of the components of the estimates 6,that is, minimizes the trace of the matrix D(6), is called the A-optimal design. We shall show later that, in many cases, D-optimality and A-optimality criteria lead to the same optimal design. Elfi ing 's Method (continu ed) Let q = tr(M-').
(6.3.16)
Suppose there is a matrix A such that MA= I. From (6.3.14),
Also, differentiating MA= I, we have aAlapi = - A X ~ X ; ' A ' -(AX;)(AX;)'. ,= From (6.3.16) we have
aq/api =
(tr A) = -tr(Ax;)(AxJ'
= - I A X ; I=~ - k Z .
(6.3.17)
Equation (6.3.17) can be solved so as to obtain the relevant points of the design. Since a quadratic form can be determined by at most three points, there are at most three relevant sources for an A-optimal design. In general, for a model involving s parameters, A-optimal designs require at most s(s + 1)/2 relevant sources.
OPTIMALITY CRITERIA
6.3
143
Special case If in the model (6.3.4), x l i = 1 for all i, we have the simple linear regression model
y i = el + &xi
where, for simplicity, we assume that x 1 < x 2 < * that the convex set generated by the k points +
(’ ) ,
(6.3.18)
+ ei,
...,%(’ *k
XI
. . < x k . Now it is easily seen
)
is a parallelogram. This set is given in Fig. 6.2. If the interest of the experimenter is in estimating 8 2 , that is a = (0, l)‘, it is obvious that 8 2 uses only x1 and x k
-a,
’rk Figure 6.2
and in equal proportions. Again if O 1 is the only one to be estimated, and all xi have the same sign, the extreme ones are again used but in the proportion of x 1 : x k . If xi’s include both positive and negative numbers, then the values of the pi’s are arbitrary, except that k
1 p i x i = 0. i=l A general discussion of the geometrical allocation theory in regression designs has been given by Elfving (1955). Consider again the general linear regression model with V(X,
0) =
e’w,
144
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
as assumed in (6.2.10). The covariance matrix of the estimate of 0 is denoted by D(8). Then the quadratic form (6.3.1 9)
f '(x) D (6) f (x)
gives a measure of the covariance matrix D(8). Let X be the region of x. The optimal design can be obtained by first maximizing (6.3.19) over X and then minimizing the maximum so obtained over the unknown pi's. Such a design is called minimax.
Definition
A design that minimizes the max f'(x) xex
~ ( 8f(x) )
is called a minimax design. Sometimes minimax design is also called a Tchebycheff design. We consider the equivalence of minimax and D-optimal designs later.
6.4
Continuous Normalized Designs
It is not always possible to find designs satisfying the criteria of optimality mentioned earlier. Comparison of experiments, therefore, may not be possible if based on these criteria. It is, however, possible to show relationships among some of these criteria. By consideration of continuous analogs of discrete designs, we show the equivalence of D-optimal and minimax designs. Although continuous designs may not be directly applicable to experimentation, they reflect the approximation of discrete designs with large numbers of points in their spectrum. We assume now that in place of the probabilities p I ,p z , . . . , p n , there is continuous probability distribution g(x) over a given closed region X in which the spectrum of the design varies. That is, we have
/ d E ( x ) = 1,
0 < [(x)
< 1.
(6.4.1)
X
The information for the experiment E will be taken as
M(E) = S h ( x ) f(x) f'(x) dg(x). X
(6.4.2)
A(x) expresses the continuous weight functions and corresponds to weights
Wi
used in weighted least-squares estimation. When g(x) is an absolutely continuous
6.4.
145
CONTINUOUS NORMALIZED DESIGNS
distribution, it has a probability density function given by p(x). Equations (6.4.1) and (6.4.2) reduce to
s
X
and
p(x)dx= 1 ,
M(E) =
s
X
p(x)>O
h(x) f(x) f’(x)p(x) dx.
(6.4.3)
(6.4.4)
The problem of finding an optimal design is reduced to finding p(x) such that some function of M(E) is optimized. Certain properties of the information M(E) are studied first. The optimality questions will then be taken up. First we prove the following theorem. We assume h(x) = 1 in what follows, although the results hold for all h(x).
Theorem 6.4.1 (i) (ii) (iii) (iv)
(v)
For any experiment E ,
M(E) defined by (6.4.4) is a positive symmetric semidefinite matrix. IM(E)I = 0 if the spectrum of the design contains points less than the number of parameters in the model. The set of all information matrices of all possible designs is convex. log /M(E)) is strictly concave. For any design E , the matrix M(E) has a representation
Proof (i) The symmetry follows from definition (6.4.4) of M(E). TO show positive semidefiniteness, consider the quadratic form z‘M(E)z for any vector z given by z’M(E)z = which is equal to
/
s
z’f(x) f‘(x) zp(x) dx,
[Z‘f(X)]* p(x) dx 2 0.
(6.4.5)
(6.4.6)
(6.4.6) shows that M(E) is positive semidefinite. (ii) In the case of a design with finite spectrum having n points, we have n
M(E) =
1 pif(xi)f’(xi). i=l
146
V1.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
If the number of parameters to be estimated is m > n, the matrix M(E) is then m x m and its rank is equal to or less than m. Hence the determinant of M(E) is
zero. The argument carries over to the continuous case as well. (iii) Consider the design E such that €=(YE1
and
+(1-(Y)E2,
O
and e 2 are designs with respect to probability distributions t,(x) and
[ z (x), respectively. The design E has probability distributions
(Ytl(X)+(l - a ) t z ( x ) . Therefore,
M(E) = M((YE~+ (1 - ( Y ) E ~ ) =
/fW
f’(x)dbt,(x)+ (1 - 4 t z ( x > l ,
X
=(Yjf(x)f ’ ( X ) d t l ( X ) t ( l - 4 /f(x)f’(x)dt2(x), or
X
X
M(E) = (YM(E~)+ (1 - (Y) M ( E ~ ) ,
(6.4.7)
and hence the set of information matrices is convex. (iv) To show concavity of log IM(E)I, we proceed as follows. It is well known that for any positive definite matrices A and B of order n and 0 < (Y < 1. ~ ( Y tA( l - a ) B I >IAlalBI1-OL.
(6.4.8)
Consider the case when the information matrix M(E) is positive definite. Assume that
M(E)=(YMI(E)+(~-(Y)M~(E),
Mi Z M 2
Then
IM(E)I
> IMi IaIM2l‘-OLY,
so that log IM(E)I > a log IMI I + (1 -a) log IMzI, showing concavity of log IM(E)I. (v) Since M(E) is symmetric of order m , it can be represented by a vector in m(m + 1)/2 dimensions. Also from (iii), the set of vectors defining the information matrix M(E) is the closed and convex set consisting of vectors corresponding to information matrix M(E(x)), where the spectrum of design E ( X ) contains a single point x. Now a point of a convex set in (m t l)m/2 dimensions
147
CONTINUOUS NORMALIZED DESIGNS
6.4.
can be written as a convex linear combination of at most m(m extreme points. This proves the property (v).
+ 1)/2 + 1
Now we consider some relations among the optimality criteria defined earlier. For example, we show below that D-optimal and minimax designs are equivalent when we restrict our attention to continuous designs. In many cases, an equivalence of this type allows us to find optimal designs by simple and elementary methods. Some of the earliest results in this area were obtained by Kiefer (1961). Theorem 6.4.2 E , gives a D-optimal design for the linear model (6.2.1 1) if and only if E , also minimizes niax X
J' f ( x ) f ' ( x ) p( x ) dx.
Proof Suppose first that E , is D-optimal. That is, it maximizes IM(e)I, where M(E) is the information matrix of the design E . Suppose €1
=(l-a)E,
+a€,
o < a < 1.
Let log IM(~1)I=logIM[(l-a)Eo+aE)]l=log laM(~)+(l-a)M(~O)l. Since d da
- log IM(EI)I = tr
we have d - log IM(el)I = tr M-'(E) [M(E) - M(EO)]] do
For a = 0, we have d -
da
1
log M ( E ~ )
LY=O
= tr [.-I
(E,,) M(E)] - rn.
Since E , is D-optimal,
Assuming that there is at least one point in the spectrum of the design, we find that tr [M-'(E,) M(E)] - m < 0,
(6.4.9)
148
V1.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
or tr[M-'(Eo) f(x) f'(x)] -rn GO,
(6.4.10)
or f'(x)M-'(Eo) f(x)-rn GO. On the other hand
s
f'(x)M-'(E)f(x)p(x)dx=a* p(x)dX=a*=rn, X
where a* is such that min f'(x)M-'(E)f(x) X
< a* < max f'(x)M-'(E)f(x), X
so that
max f'(x)M-'(E)f(x) X
2 rn.
(6.4.11)
Therefore, (6.4.10) and (6.4.1 1) show that e0 is a minimax design. To show that the converse is true, suppose that € 0 minimizes max f'(x)M(E)f(x), X
and e0 is not D-optimal. Then there is a design E such that tr [M-l(e0) M(E)] - rn > 0
(6.4.12)
from (6.4.9). By Theorem 6.4.1 (v), we notice that any design E can be represented by a set of [m(m t 1)/2 + 11 designs M(e(xi)). We may consider that E consists of a finite number n of points. Hence tr[M-'(eO)M(~)] -rn =
n
1 p i f'(xi)M(EO)f(xi)-rn. i=l
But e 0 is minimax and hence f'(x)M-'(EO)f(x)
< rn.
(6.4.13)
The inequality (6.4.13) shows c p i f'(xl)M-'(Eo)f(xi)GrnC
pi-rn = O
and hence a contradiction. Therefore, e0 is D-optimal.
6.4.
Corollary
If
€0
CONTINUOUS NORMALIZED DESIGNS
149
is a D-optimal or a minimax design, max f’(x) M-l(e0) f(x) = m . X
Many other illustrations have been developed by Kiefer and Karlin and Studden based on the equivalence of the D-optimality and minimax optimality criteria. The minimax criterion is a reasonable one when the range of the level of x is given and the experimenter wishes to minimize the worst that can happen in his ability to predict over this range. This criterion does not provide a good procedure for extrapolation. The D-optimality criterion in which the generalized variance of the estimate is being minimized may prove to be a poor criterion, according to Chernoff (1972, p. 37). There seems to be no meaningful justification for D-optimality, since it abandons to the vagaries of the mathematics of the problem the scientist’s function of specifying the loss associated with guessing wrong. The invariance property of D-optimality under nonsingular transformations of the parameters disguises its shortcomings. Since a D-optimum design minimizes the generalized variance, under assumptions of normality it minimizes the volume of the smallest invariant confidence region of 0 = (el, . . . ,OS)’ for a given confidence coefficient. It follows from the result on type D regions discussed in Section 5.7 that for given variance, a D-optimum design achieves a test whose power function has maximum Gaussian curvature at the null hypothesis among all locally unbiased tests of a given size. The reader may further refer to Kiefer (1959) for details. It should also be remarked here that the equivalence of D-optimal and minimax designs does not hold when the consideration is restricted to discrete designs. We will have occasion to refer to this problem in a later section.
Cn‘terion of Linear Optimality
A general criterion of optimality of designs has been developed by Federov (1972). Let L be a linear functional defined on the set of positive semidefinite matrices such that L ( A ) 2-0 whenever A is positive semidefinite. Definition A design E is called linear optimal, or L-optimal for short, if it minimizes L [ D ( 6 ) ] ,where D(6) is the covariance matrix of the parameter vector of the model.
The generalization of Theorem 6.4.2 to linear optimality has been made by Federov. The proof follows the same variational technique used in the proof of the theorem. We state the results formally in the following theorem.
150
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
Theorem 6.4.3 The following conditions are equivalent: (i) (ii)
E' minimizes L [M-'(E)]. eo minimizes max L[M-'(e)f(x)f'(x)M-'(e)].
(iii)
L[M-'(e0)] = max [M-'(E)f(x)f'(x)M-'(e)].
X EX
x EX
Further, the set of designs satisfying (i), (ii), and (iii) are convex. Since the trace of a matrix satisfies the conditions of the linear functional L, A-optimal designs obtained by minimizing the trace of the dispersion matrix are special cases of L-optimal designs. It is easy to see that tr (A
+ 9) = tr A + tr 9,
tr ( k A ) = k tr A, and tr (A) 2 0 if A is a positive semidefinite matrix. Therefore trace is a linear functional. Linear optimality also extends to A,-optimality, where only 1 parameters out of a given m(1 < m ) are of interest to the experimenter and the optimal criterion requires the minimization of the sum of the variances of these 1 estimates. In the next section we discuss criteria of local optimality. When the model considered is nonlinear, the criteria of optimality discussed so far involve the parameters to be estimated. Hence we need other criteria so as to remove the dependence on the parameters. One general approach is to consider asymptotic theory and the Fisher information matrix in the neighborhood of the known value of the parameters. Use of variational techniques in such cases will be discussed in the next section. 6.5
Locally Optimal Designs
So far we have considered optimal designs for the assumed linear model. In the case of nonlinear models, it is not possible to arrive at the common criteria of optimality discussed so far. Since simple estimates for the parameters cannot be obtained, the study of the covariance matrix of the estimates is out of question. However, designs that are optimal for a given value of the parameter or in the small neighborhood of the parameter are possible. Such designs are generally known as locally optimal. In this section we discuss a few simple problems and obtain locally optimal designs for them. Variational methods can be usefully employed in studying the asymptotic theory of such designs. Elfving's geometrical technique discussed earlier is also variational and can also be used in obtaining locally optimal designs. Feder and Mezaki (1971) have given direct variational approaches to obtain locally optimal designs for studying a
6.5.
151
LOCALLY OPTIMAL DESIGNS
variety of problems. We restrict our approach to D-optimal designs only; however, the approach can be extended to other optimality criteria. Consider again a nonlinear regression model (6.5.1) Yi= q(Xi, 8 ) t ei. We assume as before that the errors ei’s are uncorrelated and have the same variance 0’. 8 has k components e l , . . . ,ek and let q(x, 6 ) be nonlinear in general. The least-squares estimates for 8 can be obtained by minimizing n
(6.5.2)
In the Appendix to this chapter we show that the normal equations obtained by equating to zero the partial derivatives of (6.5.2) with respect to e l , . . . , Ok do not lead to explicit solutions, and therefore the estimates cannot be given explicitly. See, for example, Eq. (6.A.3). However, it is possible to obtain the asymptotic variance and covariance matrix of the estimates. We denote the derivatives of q(Xi, 6 ) with respect to 8 at B = Bo by. (6.5.3) i = 1 , 2, . . . , n, j = 1, 2, . . . ,k. Let the matrix of the partial derivatives gV(B0)
be denoted by Xo. Under fairly general conditions, it is well known that the distribution of the least-squares estimates of B is k-dimensional multivariate normal with mean d o and covariance matrix a2(XOXi)-I, where u’ is the common variance of ei. When the matrix XoXd is singular, generalized inverse or other perturbation techniques could be considered. The designs that minimize the determinant of the matrix X o X i are called locally D-optimal. In what follows we consider locally optimal designs for both linear and nonlinear models. The classical variational techniques are used in obtaining the locally optimal designs. The discrete designs are first reduced to continuous cases, and the variational method is then utilized. Example 6.5.1
Consider the model such that q(x, e)=el + e , ( X - g ,
OGXG
1.
Here k = 2. Then the determinant of the matrix X o X i in this case is given by
(6.5.4) Locally optimal designs are obtained by maximizing (6.5.4) with constraints OGXl G . * . < x , < 1.
152
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
Consider a continuous analog of the discrete design as follows. Let x(t) be a right continuous and nondecreasing function with 0 < x(1) < 1, and let xi =x(i/n). The function x(t) here plays the same role as a distribution function on (0, 1). Now (6.5.4) can be written as
Utilizing the continuous transformation, we have the approximate value of (6.5.5) given by I
1
(6.5.6) The optimization problem reduces to finding a right continuous function x(t) on (0, 1) such that it maximizes (6.5.6). The existence of a maximizing function xo(t) is guaranteed by the continuity of the transformation from x(t) t o J(x), where
J(x)=
j
x’(t)dt -
0
(/
x(f)dt)2.
(6.5.7)
0
It can be verified that the set of points J(x) as x varies over the class of right continuous nondecreasing functions on (0, 1) is closed and bounded by results shown in Chapter V. We now utilize variational methods directly to characterize the maximizing function xo(t). The proportion of observations such that x(t) < y for 0 < y < 1 can be regarded asymptotically equivalent to sup(t : x(t) < y). Let
S6 = { t : 6 < xo(t) < 1- 6 )
for
6
>o
and let f ( t ) , be a bounded function on (0, 1) such that
l(t)=O
ift@S6.
(6.5.8)
Consider a variation of the functionxo(t) for sufficiently small E in the form xo(t) + E r ( t ) . The details of such an approach are discussed in Chapter 11, where we derived the Euler-Lagrange equation using variations. Let
The derivative of
is given by
@(E)
I 1
a'(€)= 2 or
1
=2
I[
J
t ( t )d t ,
0
0
1
I 1
[ x o ( ~+) f t ( t ) l d t
[xo(t)+ ~ E ( t )[l( t ) d t - 2
0
@I(€)
153
LOCALLY OPTIMAL DESIGNS
6.5.
1
x,(t) t E
m )-
0
xo(s) t E [ ( S ) ds]C(t) dt.
0
Since x o ( t ) is the solution of the problem, the maximum of @ ( E ) is attained at = 0. Therefore,
E
@TO) = 2
j[ j
x0(s)ds] [ ( t )d t = 0.
x o ( t )-
(6.5.9)
0
0
Now (6.5.9) is satisfied for all [ ( t )and, since [ ( t ) = 0 when t does not belong to the set Ss as assumed in (6.5.Q we have
or
xo(t)<6
or
xo(t)2 1-6,
I 1
xo(s)ds = 0.
x o ( t )-
0
Since 6 is arbitrary, we have
x o ( t )= 0, 1 , or k.
(6.5.10)
The above argument shows that a necessary condition that xo(t) gives a maximum of (6.5.7) is that, for some constants hl and hz, xo(t) has the form
xo(t)=
i
0,
O
k,
hi
1,
hl t A , < t < l .
< t < hi +hz,
(6.5.1 1)
The constants A , , h2 can now be determined by maximizing J[xo(t)] when expressed as a function of A t and h 2 . Notice that (6.5.1 1) gives
J[xo(t)] = k 2 h z+ ( l - h I
-h2)-
[khz + ( I - h l - h 2 ) ] '
W/ak = 2khz - 2hz [khz + 1 - hi - A2 ] = 2 x 2 [k - khz - 1 t hi t hz] = 2hz [A 1 - (1 - k ) (1
-
A,)].
154
V1.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
Since 6J/6k < 0 when k = 0 and 6J/6k > 0 when k = 1 , the maximum of J [ x o ( t ) ] occurs when k = 0 or k = 1. When k = 0, J [ x o ( t ) ] = (XI t X2)(1 - X I and it is maximized when A, t X2 = f . When k = 1, J[xo(t)I = h i ( l - h i ) and it is maximized when A, = 4. Hence h2 = 0 and, from (6.5.1 l), we have the following form of x o ( t ) . (6.5.12) The maximizing function x o ( t ) shows that for large n, the optimal design consists of assigning half of the observations at x = 0 and the other half at x = 1. Elfving’s method discussed earlier also gave the same result. Since x o ( t ) acts as a cumulative distribution function, the results of geometry of moments discussed in Chapter IV may also be applicable to such problems. The next problem concerns a nonlinear model.
Example 6.5.2 Let
The locally optimal designs are obtained by maximizing the information matrix obtained from the partial derivatives of 77 with respect to 6 and 0 2 . (6.5.13) (6.5.1 4) Expressions (6.5.13) and (6.5.14) are evaluated at 8 = do and they are then utilized in obtaining the information matrix X d X o . To simplify notation, we do not exhibit the subscript 0 further.
6.5.
155
LOCALLY OPTIMAL DESIGNS
Again we consider the continuous analog of (6.5.15), as in Example 6.5.1, by defining a right continuous nondecreasing function x(t) so that (6.5.16)
The determinant IX'XI reduces to
%{ (32
xx4(;)xx2(;)-[L-x3(;)]2}.
The continuous approximation makes the determinant IX'XI = ((3 12/(3?) J[x(t)] , where
4x0)) =
/
j
x2(t) dt
X4(t)
dt
0
0
-
[
s
&I2.
X3(t)
0
(6.5.17)
Let the maximizing function be xo(t) and, following along the lines of Example 6.5.1, we see that xo(t) is given by equating to zero the derivative of J[xo(t) + E t Wl,
where t ( t ) and E are defined exactly as in Example 6.5.1. We then have 1
1
0
1
1
0
0
0
(6.5.18)
Therefore, xo(t) d 6 or xo(t) 2 1 - 6 or
1 1
4x03(t)
xo(s) ds + 2 x o ( t )
0
1 1
1
xO4(s)ds - 6 x 0 2 ( t )
0
That is, there exist constants Ao,
A1
xo3(s) ds = 0. (6.5.19) 0
,A2 and a l ,a2 such that O
where y l , y2 are solutions of (6.5.19), that is, solutions of (6.5.21)
156
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
and
1 1
ai =
x’o+Z(s)ds,
i = o , 1,2.
0
The solution (6.5.20) is exhibited in Fig. 6.3. For x o ( t ) defined in (6.5.20),we have J[xO(t)]=y1z7:(’YZ-71)z
hlhZ +712 (1-r1)2hlh3+Y: ( l - 7 2 ) ’
Xo+X1+X2
hZh3. (6.5.22)
1
Figure 6.3
Maximizing J [ x o ( t ) ] with respect to y l , yz for fixed ho = 0, h l , and h z , we have additional constraints provided by Eq. (6.5.21) as 2y,2a0 - 3Y1a1+ a2 = 0 ,
2y2Zao - 3yzal t a2 = 0 ,
giving (TI -yz)[2ao(y1
+72)-3a11
=o.
If 71 = yz = y, x o ( t ) has the form (6.5.23 )
andJ[xo(t)] = yz(l - y)’h(l - A), so that y = h = f and we have J [ x o ( t ) ] = 1/64.
(6.5.24)
6.5.
157
LOCALLY OPTIMAL DESIGNS
If y I # 7 2 , a numerical search can be made. The solution still turns out to be of the same form as in (6.5.23) and y = h = 4. The interested reader is referred to Feder and Mezaki (1 97 1) for further details. The direct optimizing technique using the notions of the classical theory of calculus variations can be applied routinely to other nonlinear models. Problems in which functions of more than one variable are involved can also be treated in the same way. The following model occurs in the study of the kinetics of reactions catalyzed by solids and provides an example for the case of several variables. The model for this problem gives the function ~ ( x 8, ) as V(X,
e ) = cele2u~/(it e,u + e24'
where e l , O 2 > 0, x = (u, u ) ' , 8 = (el, 02)'. To obtain locally optimal designs for this nonlinear function, we introduce two functions u o ( t ) and u o ( t ) so as to maximize the corresponding measure of information. The complete solution in this case requires numerical methods and is not pursued here further. Similar nonlinear models can also be treated alternatively by the geometrical method of Elfving. Chernoff (1953) generalized the results of Elfving in a paper on locally optimal designs for estimating parameters. The theory has recently been applied to some practical problems of accelerated life testing by him (see Chernoff, 1962). An example illustrating the theory is discussed here. A random variable T has an exponential probability density function f ( t ) when
The mean of the random variable T is p and variance p 2 . The exponential distribution has been extensively used as a model of the lifetimes of many mechanical or electronic devices. For studying accelerated life tests, an assumption can be made that a device may have a lifetime with mean p(x)= i/(e,x t e 2 x 2 ) ,
o<x<x*,
the quantity x to be selected by the experimenter. It is also assumed that the cost of experimentation at level x is C(X) = c/(e,x
t
e2xZ)
for some given constant c. With the above assumptions, the lifetime T has the probability density function f ( t : 8, X) =
(0,
(Blx
t
e2x2)exp [-(e,x
t
e2x2)t],
o
00,
elsewhere.
(6.5.25)
158
V1.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
The Fisher information matrix for the above problem reflects the asymptotic “goodness” of a design where the information matrix is given by
I(x) =
+(a2 +(a2
+-(a2 iogfiae, ae,)
iOgf/ael2)
log f/ae,
- ~ ( a , log fiao,’)
ae2)
1.
(6.5.26)
Chernoff (1953) also showed that if it is desired to estimate a function . . . , 0,) when the distribution of the data involves r parameters 8 1 , 0 2 , . . . ,O r , there is an asymptotically optimal design that involves repeating at most r of the available experiments. The optimality here is considered in the sense that for a large amount to be spent on experimentation, the asymptotic variance of the maximum likelihood estimate 8 of 8 = (el, . . . ,0, )‘ is the least that can be obtained by any design combined with any estimation technique. Since there are two parameters in our problem, we require observations at most at two stress levels of x. Further, if the experiments are repeated n , , n 2 , . . . , n, times with information 1 1 , . . . , 1, , then the total information is given by
g(€J,,0 2 ,
A = n l I + n212 + .
. . + nsIs.
Let g’(e) = (ag/ae,,
. . . , aglae,.)‘.
Then, it is well known that the asymptotic variance of the maximum likelihood estimate 8 is given by g‘(8)
‘4-l
(6.5.27)
g@).
In our case, we are interested in estimating
xoel +~,33,. Hence
g’w = ( ~ 0 7xgz)’ = (al, a,)’. A simple calculation leads to the information matrix I(x) =
1 (ex
+ e,X2)2
x2
x3
x3
x4)
(
.
(6.5.28)
The information “per unit cost” can be taken as
(6.5.29)
J(x) = I ( x ) / C ( x ) .
Hence J(x) =
1
x2
x3
159
LOCALLY OPTIMAL DESIGNS
6.5
Let (6.5.30)
y 1 =x/[c(e,x t e z ~ Z ) ] 1 / 2 r
and (6.5.31)
Yz =xy1.
Thus with the new scales, we have
Jb)=
("
Yzz
YlYZ
We now apply Elfving's geometrical method in solving the above problem. This requires drawing the graph of points bl,y z ) and then obtaining the resulting convex set. The solution is obtained by drawing a vector from the origin through the point ( a l , az)' = (XO, XO')'. A numerical example of Chernoff is given below for illustration. Let 0 = 0.8, O2 = 0.32, xo = 1.2, c = 2, and x * = 25. We then have X
I(
= [2(0.8x t 0 . 3 2 ~ ~ ) ]=' /0.8(u ~ t u2)'/' XZ
y2 =
[2(0.8x
+ 0.32x2)]'/2
--
'
UZ
0.32(u + u z ) ' p '
using a slightly modified scale to represent (yl, yz) and (-yl, -yz) as functions of u = (0, /e ) x = 0.4x, so that 0 < u < 10. That is, we use the coordinates U
(u
u2)1fl
=
y, = 0.8y1,
and
where u varies from 0 to 10. The points ( 0 . 8 ~, 0~. 3 2 ~ are ~ ) plotted in Fig. 6.4. The values of u can be transformed back to the values of x. The convex set is obtained by drawing a tangent from u = 10 in the third quadrant to the curve in the first quadrant and from u = 10 in the first quadrant to the curve in the third quadrant. Now, to obtain the optimal design, we take the line from the origin through the point (xo, xoz). This line has slope xo given by uo = 0.48 in the u-scale. The details are shown in the inset in Fig. 6.4. Clearly the solution of the problem shows that only two stress levels are used. These correspond to the values u* = 10 and u** x 1.2 (the point of contact of the tangent with the curve). These points correspond to levels of x given by x * = 25 and x** = 3. Also, a very small part of the allocated expense (about 4%) should go to x*.
Figure 6.4 [Reproduced from H. Chernoff (1962), “Optimal Accelerated Life Designs for Estimation,” Technometrics 4, No. 3, p. 387, with permission of the publisher.1 160
6.6.
161
SPLINE FUNCTIONS
Spline Functions
6.6
In many applications, functions are commonly approximated with the help of step functions or piecewise linear functions. Similar approximations made in terms of higher degree polynomials lead to the theory of spline functions. A spline is a mechanical device used by draftsmen for drawing a smooth curve. The device consists of a flexible rod with attached weights so that it can be constrained to pass through a given set of points. Spline functions have found useful application in the theory of optimal regression experiments (Studden, 197 1). They have also been successfully employed in the theory of optimal control and other areas of mathematics. A brief account of spline functions is provided in this section. For more exhaustive treatments, see Greville (1 969), Schoenberg (1 969), and Rivlin (1 969). Definition if for
A function s(x) defined on the real line is called a spline function --oo
= C;"
< C;, <. . . < C;, < -oo = [ , + I ,
it satisfies the following properties: (i) (ii)
On every interval (ti, i = 0, 1, 2, . . . ,n, s(x) is a polynomial of degree < m ,a given integer. The jth derivative of s(x) is continuous for j = 0, 1, 2 , . . . ,m - 1 , except when m = 0.
Dej-inirion
The points [ I ,
C;2,
. . . , C;, are called knots of the spline s(x).
It is easy to see that the spline function of degree 0 is a step function. The spline function of degree 1 is a polygonal curve. In curve fitting by least squares, a spline function may be preferable to a polynomial involving the same number of parameters under many circumstances. A simple representation of a spline function with knots t I , t 2 ,. . . ,C;, can be made in terms of a polynomial p ( x ) of degree less than or equal to m and truncated functions f + ( x ) . The truncated function is defined as follows:
f+W=
otherwise.
A spline s(x) can be represented formally in terms of a polynomial p ( x ) of degree m and knots E l , t 2 ,. . . , [, as follows: (6.6.1)
162
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
where ci are given constants. That is, m
n
j=O
j=O
Notice that s(x) is determined by at most m
i- n
+1
constants uj and ci.
Definition A spline function is called natural if it is of odd degree n = 2k - 1 and is given in terms of a polynomial of degree k - 1 on the intervals (Goo, I' I 1 and ( E n , -1. Natural splines are needed in many applications. It can be seen from the definition of a spline that a natural spline o f degree 2k - 1 has the representation n
1 Ci(X - I i=l
s ( x ) = p ( x ) i-
where p ( x ) is a polynomial of degree equations
c cixi=O, n
i=l
-
' i y ,
1 and the coefficients ci satisfy the
j = O , 1 , 2, . . . , k - 1 .
It can also be proved that there is a unique natural spline that interpolates the data (XI, Yl),
. . . > (XflY,)
having knots at X I , x2, . . . , x , . In applications in which smooth functions are needed for interpolation, one considers the criterion o f minimizing u [ g ( x ) ] where b
(6.6.2) In the Theorem 6.6.1, we prove that a spline of degree at most k is needed to minimize (6.6.2). Let the class of functions on (u, b ) whose kth derivatives are step functions be denoted by Ck-'(u,b). Lemma 6.6.1 Let s(x) be a spline of degree k and f ( x ) be a function with the following properties: (i)
f ( x ) E Ck-'(u, b ) and f k = [ d k f ( x ) / d x ] is continuous in each open interval (ti,t i + i = 1 , 2 , . . . ,n, with t o= a and En+ = b . (ii) f k - r - l ( ~ ) ~ ( ~ ) k0+, rr==0, 1 , 2 , . . . , k - 2, f o r x = u a n d x = 6. (iii) f ( u ) sZk-'(u - 0) = f ( b )sZk-l(b t 0) = 0.
Then
J'
163
SPLINE FUNCTIONS
6.6.
f k ( x )sk(x)dx = (-l)k(2k
a
Proof Integrating by parts the integral b
r
(6.6.3) we have,
s
c (-1y [fk-'-'(b)
k-2
fk(X) Sk(X)
a
dx =
r=O
1
sk+r(b)
b
(a) #+'(a)] t (-1)k-1
-fk-r-l
(6.6.4)
f ' ( x ) s Z k - l ( x ) dx.
a
The summation in (6.6.4) vanishes due to hypothesis (ii) of the lemma. Since knots of s(x) and sZk-'(x) are the same, we have
(6.6.5)
i =O
a
where qi is the constant value of sZk-'(x) on (ti,ti+]), i = 0, 1 , . hypothesis (iii) of the lemma, the right hand of (6.6.5) reduces to
. . ,n. Using
n
(6.6.6)
Differentiating Eq. (6.6.1) with respect tox, (2k - 1 ) times, we have S ~ ~ - ' ( C ; ~ + O ) - S ~ ~ - ~ ( C ; ~ - - O ) = ( ~ ~ - ~ ) ! Ci ~= , 1 , 2 , . . . , n .
(6.6.7)
Using (6.6.6), (6.6.7), and (6.6.4), the lemma is proved.
Lemma 6.6.2 If s(x) is a natural spline with k 2 1 and if
f(x) E Ck-'(a, b ) , i = 1 , 2 , . . . , n, then
such that f k ( x ) is continuous in each interval (ti,
J'
a
f k ( x ) s k ( x )dx = (-l)k(2k
-
l)!
?
i=l
cif(&),
(6.6.8)
164
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
and, if f ( x ) = 0 at every knot of s(x),
(6.6.9)
fk(X) sk(x) dx = 0. a
Proof For natural splines, the conditions (ii) and (iii) of Lemma 6.6.1 are satisfied and hence (6.6.8) follows from Lemma 6.6.1. Equation (6.6.9) follows directly from (6.6.7) since f(ti)= 0 for i = 1 , 2 , . . . , n. Theorem 6.6.1 For a < x1 < x 2 < . . . < x, be a natural spline interpolating data points (xlul),
< b,
and 1
< k < n , let s(x)
. . . (Xn, Y n ) . 9
(6.6.10)
Let f ( x ) be any other function interpolating the data points (6.6.10) such that f ( x ) E Ck-'(a, b ) and f k ( x ) is piecewise continuous; then u[s(x)l
df(x)l,
with equality only for s(x) = f ( x ) , where u is defined in (6.6.2).
Proof Using f ( x ) - s(x) in place of f ( x ) in Eq. (6.6.9), we have, from Lemma 6.6.2, s k ( x ) [ f k ( x ) - s k ( x ) ] dx = 0. a
Now u [ ~ ( x )=] u [ ~ ( x ) - s ( x ) + s ( x ) =] u [ ~ ( x ) - s ( x ) ]+ u [ s ( x ) ] (6.6.11)
by definition. Since the first term in (6.6.1 1) is nonnegative,
u[f(x)l 2 o [ s ( x ) l . The equality also follows from (6.6.1 1) when s(x) = f ( x ) .
Remarks (1) Splines are the least oscillatory functions for interpolating data. In particular the above theorem shows that a differentiable function f ( x ) such that f ( x i ) =yi,i = 1 , 2 , . . . , n , and which minimizes b
a
is a cubic spline (rn = 3) of the form (6.6.1) with knots at xl, . . . , x, and is linear below x and above x, .
OPTIMAL DESIGNS USING SPLINES
6.7.
(2) 1,
165
In using the functions X,
x',
. . . ,xm,
( ~ - [ l ) + ~ ( ~,- [ 2 ) + " ' ,
. . . , ( x - [ ~ ) + " ' (6.6.12)
we may be interested in knowing for which set of values of XI,XZ,.
. . ,Xn+m+1
one can interpolate an arbitrary set of values
Y I ? .. . rYn+m+l using a unique linear combination of functions in (6.6.12). This is the case if and only if X~<[~<X,,+~+~,
i = 1 , 2 , . . . ,m.
(6.6.1 3)
The inequalities (6.6.13) say that we cannot have too many xi's in any given interval. It means that when n = 2, m = 1 and we use 1, x,
(x-t1)+
(x-E2)+,
the inequalities (6.6.13) tell us that xI, x2, x3, x4 must be such that <[1<x3,
6.7
X2<[2<x4.
Optimal Designs Using Splines
The most commonly used model in regression experiments is linear in the parameters, and the functions involved are generally polynomials. Models involving splines may be more appropriate in many applications. In practice, some situations may make it necessary to use splines. Application of the theory of spline functions in optimal design of experiments has recently been made by Studden and Van Arman (1 969) and Studden (1 97 la). Consider first the case of polynomial regression. We assume that the regression function has the form i=l
(6.7.1)
The model (6.7.1) is a special case of the general model discussed earlier in (6.2.10). Consider a continuous experiment described by a probability distribution function [(x) defined on the range of experimentation, which for simplicity we choose again to be [-1, 11. The information matrix of the experiment E for the model (6.7.1) is obtained as
166
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
where Mii ( E ) =
i?
-1
x i+i-2 d g (x).
(6.7.3)
That is, the (i, j)th element of the information matrix is the (i + j moment of the distribution g(x). We denote the rth moment by pr ( E ) .
+ 2)nd
Definition An experiment with information matrix M ( e 1 ) is said to be better than c 2 with information matrix M(E) if M ( E ~ ) M(e2) is nonnegative definite . Recall that a matrix A being nonnegative definite implies that for every vector t , t'At > 0. The admissibility of a design will be defined now.
Definition An experiment E is admissible if there is no other experiment which is better. Otherwise E is called inadmissible. The term experiment and the distribution g(x) are used synonymously in our discussion. We give, in the following, a characterization of optimal admissible designs for polynomial regression, due to Kiefer (1959). The results can be extended to spline regression, such as studied by Studden and Van Arman (1969). The problem of finding admissible designs is reduced in this case to optimization in moment spaces, and therefore we use results discussed in Chapter 1V. Consider a vector t' = ( t l ,t z ,. . . , t k + l )with k + 1 components. Let rl = 1, rfll = u , and all other components be zero. Since, from (6.7.2), the matrix WE)
= (I-1i+j-2(4),
we have for the above vector t , (6.7.4)
t'M(E)t = ~ u / J ~ ( E+ )u ' 1 - 1 2 r ( ~ ) ,
since ~
O ( E)
= p O ( e 2 )= 1. Therefore,
t' [M(E I )
- M ( ~ 2 1 I f=
2~ ( P r ( E 1)
- Ur(EZ
)) + u2(1-1 2 r ( ~ 1 -I-1 )
2r (€2)).
Assuming that M(eI) - M ( E ~ is ) nonnegative definite, we have
>
2 u [ 1 - 1 r ( ~ l ) - ~ r ( ~ 2+)u2 1 [ 1 1 2 r ( ~ 1 ) - ~ 2 r ( ~ 2 ) 10
(6.7.5)
for all u . The inequality (6.7.5) then gives I*r(EI) = 1 - 1 r ( ~ 2 )
(6.7.6)
for r = 0, 1, 2 , . . . , k . Similarly, taking the vector t such that t4+1 = 1 and rs+l = u , with all other components equal to zero and q < s, we have 1-1r ( 6 I ) = 1-1r ( € 2 )
(6.7.7)
6.7.
for r = 0, 1 , 2 , . . . , 2k
-
167
OPTIMAL DESIGNS USING SPLINES
1. Therefore, t' [M(E~) - M ( E ~ )t ]> 0
if and only if That is, the condition of admissibility of an experiment E can be obtained by finding an experiment E such that p Z k ( e )is maximum while its first (2k - 1) moments are specified. Using results of Theorem 4.5.6, we find that for such an experiment the distribution function [(x) has k + 1 jumps, including jumps at the points -1 and 1. We state the result formally in the following theorem. Theorem 6.7.1 The design with distribution function [(x) for the polynomial regression model in (6.7.1) defined on [-1, 11 is admissible if and only if in the interior of the interval [-1, 11, E(x) has at most k - 1 jumps. Consider the case of the regression model in which splines are used. That is, the regression functions are, for example, of the forms 1,
X,
x',
. . . ,x",
(X
- [i)+",
(X
- [i)+",
1
..
9
(X
- ti)+",
i = 1,2,. . ., h , with h knots [ 1 , [',
. . . , Eh
so that
-1 = Eo < .$I
< t z , . . . < Eh < Eh+l = 1. 7
The regression functions with the above splines give a polynomial of degree n on each of the intervals (ti, i = 0, 1, . . . ,h . From Theorem 6.7.l,,we have that the admissible spline design has at most n - 1 jumps in the interval (ti,Ei+i), i = 0, 1 , 2 , . . ,h . When n = 1, the regression function is linear on (ti,ti+,)and continuous at ti.That is, we have the set of regression functions 1, x,
(x - Ed+, (x - E d + ,
..
* >
(x
-
[h)+.
In this case the admissible design does not have a single jump in the interior of the interval (ti, except possibly at the end points .$i and ti+,. Therefore, the possible jumps are at the points -1,
1'1,
b , ..
*
,Eh,
1.
The reader interested in the general discussion of spline regression would find the paper by Studden and Van Arman (1969) very illuminating. The formulas for allocation of observations at given points can be obtained in terms of Lagrange interpolating polynomials. An exhaustive treatment of the problem is given in the books by Federov (1972) and Karlin and Studden (1966).
168
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
Appendix to Chapter VI Assume that
yi = q(xi,e) + Ei,
(6.A.1)
where q(Xi, 0 ) is a known function of Xi and 8 such that
Xi = (Xli, Xzi, . . . ,Xki)’,
e =(el, ez, . . . ,ek)’,
i = 1 , 2 , . . . ,n.
We assume again that the errors E ~ ’ Sare uncorrelated with the same variance u2. It is generally assumed for many purposes of statistical inference that the errors E = (E . . . ,en)’ have normal distribution with means 0 and covariance u2I, where I is the identity matrix. The least-squares estimates are obtained by minimizing the sum of squares (6.A.2) The k normal equations are obtained by differentiating (6.A.2) and equating them to zero (6.A.3) j = 1,2, . . . ,k. Obviously these equations cannot always be solved explicitly.
There are many approximate methods of solving the system (6.A.3) in general. We describe an iterative procedure below that essentially requires the linearization of the problem using the Taylor’s expansion of q(X, 0 ) at an initial value of 0 = do = (Ole, . . . ,OkO)’ and then using the results of linear least squares successively. The iterative methods depend heavily on the initial value d o and hence care must be taken to arrive at d o . We have, by Taylor’s expansion,
(6 .A .4) where
and
169
REFERENCES
R is the remainder of the expansion. We shall be using the approximation with R = 0 in further analysis. Let 8 - B0 = So = (Plo, . . . ,l j k o ) ' and
Then the model (6.A. I), using the approximation (6.A.4), becomes
Yi = qio +
c k
j=1
fljozji+ E j
or
(6.A.5)
Y-qo=ZoPo+E, where z o = (Z?',
zy, . . . , z:').
(6.A.6)
Therefore the estimate of Po is given by
b0 = (zo'zo)-'Z0'(Y - q o ) .
(6.A.7)
Po
The next trial value of 8 will now be obtained as 8, + and so on. In general, at any stage i, 8i+l = Bi + and = (Z;Zj)-' Z i ( Y - qj).The process continues until convergence; that is, the ratio of the (j+ 1)st estimate to the jth estimate is within 1 - 6 and 1 + 6 with a preassigned 6. There are many other computational procedures available in the literature, but we do not pursue them here.
pi
pi
References Atwood, C. L. (1969). Optimal and efficient design of experiments, Ann. Math. Statist. 40, 1570-1602. Box, G . E. P. (1954). The exploration and exploitation of response surfaces, Biometrics 10, 16-60. Box, G . E. P., and Draper, N. R. (1968). Evolutionary Operations, A Statistical Method f o r Process Improvement. Wiley, New York. Box, G . E. P., and Wilson, K. B. (1951). On the experimental attainment of optimum conditions, J. R o y . Statist. SOC. 13, 1-45. Box, M. J. (1971). Simplified experimental designs, Technometrics 13, 19-31. Chernoff, H. (1 95 3). Locally optimal designs for estimating parameters, Ann. Math. Statist. 24,586-602. Chernoff, H. (1959). Sequential design of experiments, Ann. Math. Statist. 30, No. 3, 755-770. Chernoff, H. (1 962). Optimal accelerated life designs for estimation, Technometrics 4, NO. 3.381-408.
170
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
Chernoff, H. (1963). Optimal Design of Experiments, Proc. 8th Con& Design Experiments Army Res. Develop. Testing, pp. 303-315. Chernoff, H. (1969). Sequential designs. In The Design of Computer Simulation Experiments (T.H. Naylor, ed.), pp. 99-120. Duke Univ. Press, Durham, North Carolina. Chernoff, H. (1972). Sequential Analysis and Optimal Design. SOC. Ind. Appl. Math., Philadelphia. Chernoff, H., Bessler, S., and Marshall, A. W. (1962). Accelerated life test, Technometrics 4, NO. 3, 367-380. Elfving, G. (1952). Optimum allocation in linear regression theory, Ann. Math. Statist. 23, 255-262. Elfving, G. (1955). Geometric allocation theory, Skand. Actuarietidskr. 38, 170-190. Elfving, G. (1 956). Selection of nonrepeatable observations for estimation, Proc. 3rd Berkeley Symp. Math. Statist. Probl. 1, 69-75. Elfving, G. (1959). Design of linear experiments. Probability and Statistics ( C m d r Volume) (U.Grenandar, ed.) pp. 58-74. Wiley, New York. Feder, P. I., and Mezaki, R. (1971). An application of variational methods to experimental design, Technometrics, 13, 77 1-793. Federov, V. V. (1972). Theory of Optimal Experiments. Academic Press, New York. Fisher, R. A. (1947). The Design ofExperiments (4th edition). Oliver and Boyd, Edinburgh. Creville, T. N. E. (ed.) (1 969). Theory and Applications of Spline Functions. Academic Press, New York. Hoel, P. G., and Levine, A. (1964). Optimal spacing and weighing in polynomial prediction, A n n Math. Statist. 35, 1553-1560. Hotelling, H. (1941). Experimental determination of the maximum of a function, Ann. Math. Statist. 12, 20-45. Hotelling, H. (1944). Some improvements in weighing and other experimental techniques, Ann. Math. Statist. 15, 297-306. Karlin, S., and Studden, W. J. (1966). Optimal experimental designs, Ann. Math. Statist. 37, 783-8 15. Kiefer, J. (1953). Sequential minimax search for a maximum, Proc. Amer. Math. Soc. 4, 502-506. Kiefer, J . (1959). Optimal experimental designs,J. R o y . Statist. SOC.Series B 21, 273-319. Kiefer, J. (1961). Optimum designs in regression problems: 11, Ann. Math. Statist. 32, 298-325. Kiefer, J., and Wolfowitz, J. (1959). Optimal designs in regression problems, Ann. Math. Statist. 30, 271-294. Kiefer, J., and Wolfowitz, J. (1960). The equivalence of two extremum problems, Can. J. Math. 12, 363-366. Kiefer, J., and Wolfowitz, J. (1965). On the theorem of Hoe1 and Levine on extrapolation, Ann. Math. Statist. 36, 1627-1 655. Kiefer, J., Farrel, R., and Walran, A. (1965). Optimum multivariate designs, Proc. 5th Berkeley Symp. Mcth. Stat. Probl. 1, 1 13-1 38. Murty, V. N., and Studden, W. J. (1972). Optimal designs for estimating the slope of a polynomial regression, J. Amer. Statist. Ass. 67, 869-873. Nalimov, V. V., Colikova, T. I., and Mikeshina, N. G. (1970). On practical use of the concept of D-optimality, Technometrics, 12, 799-812. Rivlin, T. J. (1969). A n Introduction to the Approximation of Functions. Blaisdell, Waltham, Massachusetts. Schoenberg, I. J. (1969). Approximations with Special Emphasis on Spline Functions. Academic Press, New York.
REFERENCES
171
Spendley, N., Hext, G. R., and Hinnisworth, F. R. (1962).Sequential application of simplex designs in optimization and evolutionary operations, Technometrics 4,441-459. Studden, W.J. (1968). Optimal designs on Tchebycheff points, Ann. Math. Statist. 39,
1435-1447.
Studden, W . J. (1971a).Optimal designs and spline regression. In Optimizing Methods in Statistics (J. S. Rustagi, ed.). Academic Press, New York. Studden, W . J . (1971b).Elfving’s Theorem and optimal designs for quadratic loss, Ann. Math. Statist. 42, 1621-1631. Studden, W.J., and Van Arman, D. J. (1969). Admissible designs for polynomial spline regression, Ann. Math. Statist. 40, 1557-1569.
CHAPTER V I I
Theory of Optimal Control
7.1
Introduction
While introducing the optimizing techniques of dynamic programming and maximum principle in Chapter 11, we discussed many problems of control theory. In this chapter many other problems of deterministic and stochastic control theory are discussed. In the development of the mathematical theory of control processes, the methods of dynamic programming are extensively used. The problem of control is basic in many areas of human endeavor, including industrial processes, mechanical equipment, and even the national economy. In many simple situations, the models of control theory will be developed in this chapter. The most common technique in solving these problems is that of dynamic programming, and it has been applied to a large variety of problems. In statistical applications, a large class of problems related to sequential sampling can be solved by backward induction and therefore have direct bearing on dynamic programming. There are many questions related to stopping strategies in dynamical systems, which can be treated by dynamic programming. The continuous versions of many of the above discrete problems can be transformed in terms of the Wiener process, and the related questions are solved with the help of the heat equation. In this chapter we treat the simple aspects of statistical decision theory as they arise in the study of Bayes solutions for the Wiener process. An exhaustive study has recently been made of controlled Markov chains. Markov chains form the basic model of multistage decision processes and we 172
1.2.
DETERMINISTIC CONTROL PROCESS
173
give a brief introduction to the study of control processes that can be thought of as Markov chains. For exhaustive treatment of the above control problems the reader should consult Bellman (1967, 1971), Chernoff (1972), and Kushner (1971). among many others. We follow Chernoff (1972) in the subsequent development. Many other references related to problems discussed here are given at the end of the chapter.
7.2
Deterministic Control Process
In developing the procedures of dynamic programming and maximum principle discussed earlier in this book the motivation came from many control problems. The language of control theory was utilized in obtaining solutions to problems that arise in many areas other than control theory. In this section we discuss the main problems of deterministic contr.01 theory. The solutions of these problems involve some of the variational techniques discussed so far in the book. A deterministic control process is concerned with the mechanical. electrical economic, or biological behavior of a process in which the outcome is completely determined when a given control is applied. There is no random element in the process, and we do not assume that the input or outcome depends on a chance mechanism. Control processes with stochastic elements will be discussed in the next section. The general structure of a deterministic control process is described by several quantities. The srare variable of the process describe the state of the process at any given instant of time, space, etc. The state variables may be a scalar or vector or a continuous function. The control variables reflect the decisions taken at any stage and again may be scalar, vectors, or continuous functions. An essential part of the process is the system of dynamic equations, which give the behavior of the process when the process is in a given state and a given control is applied. In most practical situations such equations are difference or differential equations. The physical and mechanical systems most often are governed by known relations. However, there are many important processes, such as those of economic and biological significance, in which such relations are not completely known. Statistical problems enter into the estimation of these relationships. For our present purposes we assume that the system is governed by a known set of difference or differential equations. For any effective control, it is necessary to have in mind some kind of objective or criterion function. The performance of a process is measured by this criterion function. Depending on the objective of the control, the optimizing problem appears. In a rocket control problem, it may be minimizing the distance
174
VII.
THEORY OF OPTIMAL CONTROL
by which the rocket misses the target; but in the case of an economic control problem, it may be the maximization of profit. Many of the processes have an element of feedback in them. Here, the performance of the process is used to guide the system back to its proper functioning. Some processes have time lag built into them, since a control applied at a given time r will take effect after several units of time have elapsed, rather than immediately. We also consider an example of this type in our discussion. Complete solutions of the deterministic control problems are available when the process is governed by linear difference or differential equations and the objective function is quadratic. The situation is quite similar to that in statistics, in which the theory of linear models is fairly complete when the criterion of minimization is quadratic. Many of these aspects in statistics are studied under the least-squares theory of inference. In connection with optimal design of regression experiments, we have discussed some elements of least-squares theory in Chapter VI. Consider a multistage decision process with Nstages. Let Xo, . . . X ~ b the e sequence of vectors describing the process at stages 0, 1 . 2, . . . , N respectively, with Xo as the initial state of the process. Let the controls at stages 1 . 2, . . . . N be given by U1, U2,. . . , UN. A schematic representation of this process is given in Fig. 7.1.
.
t
%
t
t U
%?
rJ- 1
t&
Figure 7.1
Suppose the state vector X,, at stage n depends only on the state vector at stage n - 1 and the control at stage n ; then, the equation governing the system is of the type X, =fn(XnPl, U,), Let the criterion function be g(X,)
H =
=
1 , 2 , . . . ,N.
IXNI*.
The object is to find the sequence of optimal controls U1 minimizesg(XN). Such a sequence is called an optimal policy.
(7.2.1) (7.2.2)
. . . , UN that
1.2.
DETERMINISTIC CONTROL PROCESS
175
The optimal policy will be obtained by the procedure of backward induction of dynamic programming as discussed earlier. The basic steps in this procedure are the following. Suppose XN-] is somehow determined, we find U, for each possible X N - ~ so as to minimize IX,?. Since X,=f,(XN-l, U,)from(7.2.1), we can find such . , U Tabulate the values of UNp1 and the minimum value of g(X,). Supposing now that XN-2 is available, we can find the optimizing values of UN-l and UN. Since determines X N - ~ and since U, has been tabulated for every XN-] with a minimum of g(X,), a search is simply made over the values of UN-]to find that XN...l for which minimum is attained. The above procedure reduces the dimensionality of the problem, since minimization is reduced over UN-] in place of UN-] and U., This process is repeated backwards until one reaches Xo. But X o is known, and an optimal UI can be chosen by means of tables of X I . In this way, the above numerical procedure determines the optimal policy u , , . . . , u,. We remark here that the numerical solution of the optimization problems requires the storage of functions. This may become a serious restriction on the computation of optimal policy in higher dimensions through the technique of dynamic programming. In Examples 3.2.2 and 3.2.3, we considered the basic elements of the control problem and showed the existence of the solution. The characterization of the solution was also given. We consider an example with lag in the following. Example 7.2.1 (Process with time lag) In some practical applications, the effect of the control may take some time to affect the behavior of the process. For example, when a certain drug is given to a patient and the physiological variables are being monitored, say, every minute, it will take several minutes before the drug’s effect is seen. We consider a simple example in which time lag matters. Suppose the differential equation governing the system for 0 < r < T is given by x ‘ ( t ) = a l x ( t ) + Q z x ( t - 1) + u ( t ) , 1 < t < T, with x ( t ) = g(t),
The criterion function is taken as
J(x, u ) =
i
1
0 < c < 1.
(7.2.3)
[x’ ( t ) + u’ ( t ) ] dt.
(7.2.4)
176
VII.
THEORY OF OPTIMAL CONTROL
Assume that xo(t), u o ( t ) are the minimizing functions. Consider X(f)
u ( t ) = uo(t) t € W ( t ) .
= xo(t) t E U ( t ) ,
(7.2.5)
Then (7.2.4) becomes, on simplification, T
J(x, u ) =J(xo, u o ) t 2~
( x o u t u o w )dt t E*J(v,w).
(7.2.6)
1
Also, from constraints (7.2.3), we have
1
u'(t) = a l u ( t ) + a2 ~ (-t1) t w ( t ) ,
s
< T.
(7.2.7)
The variational condition from (7.2.6) is [ x o ( t ) u ( t ) + u,(t)w(t)] dt = 0.
(7.2.8)
1
Eliminating w(t) from (7.2.8) using (7.2.7), we have
S 1
{ x o ( t ) u ( t ) t u o ( t ) [ v ' ( t ) - a a l u ( t ) - a 2 u (t l ) ] ) dt = 0.
Since
1 T
uo(t>u'(t)dt = uo(r)u(r)l
1
and
T
T
i i-' i
I
J
-
-
uo'(t)u(t)dt
1
uo(t)u(t - 1) dt =
1
0
substituting in (7.2.9), we find u ( t ) [ x o ( t ) -u'(t) - a l u o ( t ) - a 2 u o ( t
+ l)] dt
1
u ( t ) [ ~ ~ ( t ) - u g ) ( t ) - a ~ u ~dt( t+) ]uo(T)u(T)= 0.
t
T- 1
(7.2.9)
7.3.
CONTROLLED MARKOV CHAINS
177
Hence we have t 1 ) = 0,
xo(t)-ur(t)-u,uo(t)-u2uo(t
xo(t)-uo’(t)-Ulu,(t)=
0,
1
T-
< T - 1, 1 < t < T,
(7.2.10)
u ( T ) = 0. Laplace transformation methods can then be utilized to solve Eq. (7.2.10) further (Widder, 1946). The discrete analog of the problem, however, can be solved with the help of dynamic programming techniques. Before we consider the stochastic control problem in general, we discuss the controlled Markov chain problem. Many of the stochastic control problems occur in the form of Markov chains. We study the controlled Markov chain in the next section. 7.3
Controlled Markov Chains
In control theory many processes with random elements are Markovian in nature. The basic property of a Markov process is that the present state of the process depends upon the immediate past only. In many applications the process is finite, and we can describe the phenomenon by a Markov chain. In this section we give a brief discussion of Markov chains and the problem of control when the state of the process can be described by Markov chains. There is extensive literature on controlled Markov chains. For an exhaustive account, the reader should consult books by Derman (1970), Howard (1960), Kushner (1971), and Bellman (1958) among many others. The subject area has been called discrete dynamic programming by Blackwell (1 962) and Markovian decision processes by Bellman (1 958). A sequence of random variables X , , n = 0, 1 , 2 , . . . ,is called a Markov chain if, for every finite collection of integers
n, < n 2 < ~ ~ ~ < n r < n , we have
pix, IX,,,
Xn,, *
. . Xn,)= P(X, K r ) . 7
(7.3.1)
If the possible values of X , are denumerably infinite, the chain is called an infinite Murkov chain. Let X , take values . . . , -3, -2, -1 , 0, 1, 2 , . . . . Then P{Xm+I =jI&
=i)
i, j = . . . ,-3,-2,-1,0,
=Pij(m),
1,2,. ...
(7.3.2)
178
VI1.
THEORY OF OPTIMAL CONTROL
If for all i and j the probability pi,(m) does not depend on m ,we call the chain homogenous. The chain is completely determined if we also know the initial probabilities of the chain. When the values of X , are only finite, we have a finite Murkov chain. The Markov chain is a natural extension of the dynamics of a process. The process can be considered as a stochastic analog of the difference equation (7.3.3)
Yn+l = f O f n > ,
where y n + ] can be calculated without knowingy,,. For example, consider
. . ,yn-] ify,
is known. (7.3.4)
~ n I += ayrz + t n ,
where is a sequence of independent, identically distributed random variables. Then the sequence { y , } becomes a Markov chain. When a decision is made regarding a process and the decision, or control, affects the probabilities of the process, the process is called a controlled process. The case of a controlled Murkov chain is clearly evident from Example 3.2.3, considered earlier, and we now discuss it further.
Example 7.3.1 Let there be two coins 1 and 2 with probabilities p 1 and p z respectively of falling heads. At each time n = 1 , 2 , . . . , one of the coins is selected and tossed. Let X , be the number of heads accumulated till time n. Assume the trials are independent, so that X , depends on X n p 1 and not on X n - 2 , . . . , X I . Therefore, X , is a Markov chain. Let u,(x) denote the decision (control) for choice of a coin a t time n + 1 , when X , = x. Therefore, P{X,+] =X,}=l-pi
if
u,(X,)=i,
i=1,2.
P{X,+I = X , + I ) = p ; ,
if
u,(X,)=i,
i = 1,2.
Similarly These transition probabilities can be written explicitly as functions of u,(X) by introducing the Kronecker delta 6 i j , where 6ii = 1 if i = j and 0 otherwise.
Pi,;[U,(X)l = (1 -PI> pi,i+l[un(X)I = P I
&4,(X),l + (1 -Pz> 6 U , ( X ) , 2 ,
& ~ , ( x )+,PIZ 6 u n ( x ) , 2 .
(7.3.5)
Then X , is a controlled Murkov chain with transition probabilities given by (7.3.5). In general, the problem for finite time interval [O,n] with discrete times 0, 1,2, . . . ,n has the following framework. The state of the system is a Markov
7.3.
CONTROLLED MARKOV CHAINS
179
chain with N states. Let u'(i) = control when the state is i and there are r units of time left to control, also denoted by un-,.(i). n, = control policy consisting of the matrix of controls {ur(i)}; r = 1, 2 , . . . , n a n d i = 1 , 2 , . . . , N. pii [u'(i)] = transition probability that X I = j given that Xlp1 = i and control u r ( i ) i s u s e d a t t i m e I - l ; i , j = 1 , 2 , . . . ,N , I = 0 , 1 , 2 , . . . , n . A general controlled Markov chain will be given by the difference equation
m = 1 , 2 , . . . ,n , (7.3.6) =f(xm u m t m ) =f(xm ,t m ) , where t m are independent and identically distributed random variables. Let the xm+l
9
9
1
criterion function be assumed to be of the form
c
n-I
~ n , ,X o ) = E {
m =O
k [ X m , u n - m ( X m ) l + ko(xn)},
(7.3.7)
where k and ko are some given functions. The general form of the criterion is given by Kushner (1971). The basic problem of optimal control is to find n, such that V(n,, X o ) is minimized when constraints (7.3.6) are satisfied. From (7.3.7), we have
If the minimum of V(nn,X o ) over all n, is denoted by Vn0, the principle of optimality of dynamic programming leads to the functional equation
Vno = min E(k(Xo, Un(Xo)+ Vn-, o } . In the finite case, with restrictions of continuity on the functions k and pij and other reasonable conditions, the existence of an optimal policy can be proved by backward induction. An important example is the case of a system that is described by a linear system of difference equations and has an objective function given by a quadratic form. We discuss it in Example 7.3.2.
180
VII.
THEORY OF OPTIMAL CONTROL
Example 7.3.2 Let the system be described by Xm+] = A X , t BU, t g m ,
m = 0, 1 , . . . ,n and
Xo = c,
(7.3.8)
where A and B are matrices and X, and U, denote the state and control vectors. We also assume that [, are independently and identically distributed vectors with mean 0 and covariance matrix G . Let G be nonsingular. Let and
k(x, v) = x’Rx t V’QV
ko(x) = x’ROx,
so that [X,’RX,
m=O
+ U,’QU]
t X,’RoX,
where Q is positive definite and R, Ro are nonnegative definite matrices. Let V,(x) = min V(n, , x),
= tI
so that Vn+l(x) = min [ V,(X,) t x‘R,x t u’Qu], with Vo(x) = x’ROx. The dynamic programming solution of the problem is obtained in terms of (7.3.10)
V,(x) = x’Pnx t d,,
(7.3.1 1) (7.3.1 2) Here P, satisfies the functional equation P,+] = A‘P,A t R-A’P,B(B’P,B
t Q)-’B‘P,A.
(7.3.13)
The optimal control is given by
u,(x) = -(B’P,B
t Q)-,B’P,AX.
(7.3.1 4)
The proof is by induction and results from the following arguments. If n = 0, the system stops at 0 since no control is used. Thus (7.3.10) and (7.3.1 1) hold for n = 0. Assuming that (7.3.10) holds for n , we have Vn+](x) = min E[(Ax t Bv t &,)’P,(Ax t Bv t 5,) t x’Rx t v’Qv] , V
7.4.
STATISTICAL DECISION THEORY
181
= min [v'(B'P,B t Q)v t v'B'P,Ax t x'A'P,Bv] V
t x'(A'P,A t R ) x t tr (P,G).
Now the minimizing value of v is given by
U,(x) = -(B'P,B
+ Q)-'B'P,Ax.
Using (7.3.13), we can verify that Vn+ I (x) =
x'Pn+ 1 x t dn+1
9
completing the proof.
7.4 Statistical Decision Theory
In statistical theory, the major problems of inference are concerned with the estimation and testing of hypotheses about the parameters involved in the probability model. The general problem of inference is viewed as a process of decision making under uncertainty, and statistics is described as a science of making decisions under uncertainty, with or without observations. The scope of statistics so described includes a wide variety of problems in the real world. The object of our discussion in this section is to introduce briefly the structure of statistical problems in a more general framework and give a few elementary notions needed to discuss some of the control problems introduced earlier. For a comprehensive discussion of statistical theory, the reader should consult books by Blackwell and Girshick (1954), Ferguson (1967), and DeGroot (1970), among many others. According to decision theory formulation, a statistical problem is a game between nature and the statistician. The unknown state of nature represents the parameter involved in the probability law of the random variable that the statistician can observe. By spying on nature. the statistician can collect data and construct a function to make a decision about 0 . He is supposed to incur a loss (risk) due t o a wrong decision. The assumption of this loss function allows him to choose an optimal way somehow to deal with the loss. There are several ways to select an optimal strategy, and we shall discuss some of them in this section. Consider the following quantities involved in the statistical problem: H set of unknown states of nature 0 . Y" set of random observable X . the probability density function of the random variable X when the f(xl0) nature is in state 0. A set of actions a taken by the statistician.
182
VII.
THEORY OF OPTIMAL CONTROL
L(0, a ) loss to the statistician when he takes action a while nature is in state 8 . D class of decision functions d such that d : .+"'-+A.That is, d ( x ) = a , when statistician observesX = x , he takes action based on the decision rule d.
For a given decision rule d, the loss L(O,d(x)) becomes random, and we generally use its expectation, called the risk to the statistician. That is,
R (e, d ) = EeL(e, d ( ~ ) ) .
The statistical problem is then described by the triple (0,D, R ) , and the basic problem of statistical inference becomes the optimal choice of d in the class D so as to optimize R(O, d ) in some suitable way. Many criteria have been developed to answer this question. The minimax criterion stipulates that the statistician take a decision d o that minimizes, over the class D, the maximum of R(0, d) over the class 0. The minimax rule is the rule of the pessimist and is not always favored since it takes care of the worst possible situation. A rule that has many other desirable properties is the Bayes rule. We say that do is Bayes when do minimizes the expected risk, the expectation taken with respect to a prior distribution of 0 . When the loss is squared,
L(0, a) = (0
- a)',
(7.4.1)
assuming that 0 and A are the sets of real numbers, it can be readily seen that the Bayes rule is given by the mean of the posterior dism'bution. If l ( 0 ) is the prior density of 0 , the posterior distribution of 0 , given X = x , is obtained by the Bayes rule,
(7.4.2) We assume for simplicity that the random variable X and the parameter 0 have continuous probability distributions having a probability density. For the case in general, the expressions can be suitably modified. Now n
The Bayes risk is given by
r(d) =
I
R(e, d)5'(0)do.
(7.4.3)
7.4.
183
STATISTICAL DECISION THEORY
The Bayes rule do is obtained by minimizingr(d); that is, we minimize
Jrs.(O,
(7.4.4)
d(x)t(e)f(xle) dx do.
Expression (7.4.4) can be minimized when
/ u e , d(x))t(Olx)
(7.4.5)
is minimized. When the loss is. given by (7.4.1), the optimal rule is given by do(x) =
fe.geix) dx.
(7.4.6)
The mean of the posterior distribution, therefore, plays an important role in obtaining Bayes rules.
Example 7.4.1 variance d. If
Let X have a normal distribution with mean
I.(
and
q(x) = ( 2 7 ~ - ' /exp ~ (-x2/2),
we assume that (7.4.7) Suppose that interest centers around the estimation of Assume that
with one observation x. (7.4.8)
Now, to find [(FIX), we note that the joint distribution ofx and p is
The marginal density of X i s given by
184
VII.
THEORY OF OPTIMAL CONTROL
Since
-2p(xu-2
+ pOu;2) + x2u-2 + p;
we have
Therefore, Ebb) = If(x, p)/f(x)] is obtained on simplification as follows:
The posterior distribution of 1-1 is normal with mean (7.4.1 1) and V(j.lX)
Example 7.4.2
(7.4.12)
= (u-2 t uo2)-’
Suppose a sequence of independent random variables X1,
X2, . . . , X , have the normal distributions (I/ui) p((x - p)/ui), i = 1,2,
. . . ,n ,
respectively. Let the prior distribution of 1.1 be the same as (7.4.8). Then it follows that the posterior distribution of 1-1 given XI = x1, . . . X , = x, is normal with mean Y , where
.
(7.4.13) and variance S , where
s;1
= (762 t u2 ;
+ . . . t a2;.
(7.4.1 4)
If X I , . . . , X , is a random sample from a normal distribution (I/u)
p((x - p)/u), expressions (7.4.13) and (7.4.14) have the simple forms
(7.4.15)
S;’ = u’;
t nu- 2 .
(7.4.1 6 )
If the sequence of random variables Y , as defined in terms of the Xi’s from Eq. (7.4.13) is considered, we find that, for rn < n, Y , - Y,,, is normal with
7.4.
STATISTICAL DECISION THEORY
185
or (7.4.1 7)
Yml = E [ Y , I x , , . . . ~ x m l That is, Y , is the regression of Y , on x , , . . . , x,, . Thus
Y , = Y,, t u , where u has mean 0 and is uncorrelated with Y,. Therefore, V( Y,) = V( Y,) t V(u).
Also p = Y, +u,
where u has mean 0 and is uncorrelated with Y , ,
V h ) = V(Y,) t V(u). That is, 0:
= V ( Y,)
+ s, .
Thus
v(u)= v(Y,)-v ( Y , , ) = u ~ ~ - s ~ - ( u O ~ - S S ~ ) = S ~ - S ~ . Normality follows from the linearity of the function involved in Y , . The above result is stated in the following lemma.
Lemma 7.4.1 For rn < n, Y , - Y , as defined in (7.4.13) is normally distributed with mean 0 and variance S, - S,, and Y , - Y , is independent of Ym. Lemma 7.4.1 shows that Y(s) behaves like a normal process with independent increments starting from Y o = p o , and the conditional probability distribution of Y , - Y , given Y , is normal with mean 0 and variance S, - S,.
Continuous Version Consider the continuous version of the discrete process C.?=,xi as a process X ( t ) such that the mean and variance are given in terms of dX(t): E [ d X ( t ) ]= p d t , and V [ d X ( t ) ]= u2 d t . (7.4.1 8)
186
VII.
THEORY OF OPTIMAL CONTROL
Given that I-( has prior distribution that is normal with mean p0 and variance 02, the posterior distribution of p is obtained in terms of Y , and S,, which have a continuous version given in terms of a stochastic process Y(s) with parameters. Lemma 7.4.1 shows that Y(s) is a normal process with independent increments having
E [ d Y ( s ) ]= 0,
and
V [ d Y ( s ) ]= -ds,
(7.4.19)
with Y(s0) = P o ,
so = 0 0 ,
and
s-' = ui2 +
Essentially, Y(s) can be regarded as a Gaussian process in -s scale. The above process will be utilized in reducing the sequential analysis problem to that of a stopping problem when a continuous version is used. In a more general setting, we use the following result (Chernoff, 1972). Lemma 7.4.2 with mean
Let X(c) be a Gaussian process of independent increments E[dX(t)l = I.1 d H ( t ) ,
and variance
V [ d X ( t ) ]= dV(c), where V(c) is a nondecreasing function of c. Let I./ have a normal prior distribution with mean p0 and variance u2. Then the posterior distribution of p with given X ( t ) is also normal with mean Y(s) and variance s given by t
s-1
= 002
+ 0
Y(s) is a Wiener process with 0 drift in the -s ( Y o , so) = ( P o , ao2).
scale originating at
In the next section, we discuss elements of sequential decision theory. The stopping rule is obtained in terms of backward induction, which has also been utilized in dynamic programming.
7.5.
SEQUENTIAL DECISION THEORY
187
7.5 Sequential Decision Theory When the possibility of sequential sampling is introduced, the statistical decision problem assumes a more complicated structure. At each stage of sampling, the statistician has to decide whether to stop sampling or continue taking observations and, if he stops sampling, what terminal decision to take. The cost of taking an observation also enters the picture. It is intuitively clear that the more expensive the sampling, the less the number of observations should be taken. For simplicity of discussion we assume that the cost of observation is a constant c. Let the stopping rule be given by c p ~ ~ ~ = ~ c p o , c p l ~ ~ l ~ , c p.z.), ~~,,~z~,.
where tpi(x,, . . . , x i ) is the conditional probability that the statistician will stop sampling given that he has observed X I = x l , . . . , Xi = x i . cpo represents the probability of taking no observations at all. The terminal decision rule is given by where d i ( x l , . . . , x i ) gives a rule for 0 once a sample X I = x l , . . . , Xi = x i is observed and a stopping rule, cpi(xl, . . . , x i ) , is used. Therefore, a decision rule 6(x) has two components given by
6(x) =
[dx), Wl.
(7.5 . l )
The risk in a sequential decision problem is a function of 6(x) and the cost of taking j observations. The Bayes rule is obtained in two steps. First, a terminal decision is obtained for a given prior distribution and a given stopping rule. so as to minimize the risk. Then, the stopping rule is obtained so as to minimize the risk obtained after using the optimal terminal decision rule. Although the argument is circular. it is possible to show that the rule so obtained is the Bayes rule. We show below the procedure of obtaining the Bayes stopping rule given in terms of backward induction first proposed in this connection by Arrow et al. (1949). The generalization of the backward induction to that of the dynamic programming procedure by Bellman has already been discussed in several contexts. For more detailed discussion, see Blackwell and Girshick (1954) and Ferguson (1967), among many others. The basic process is the following. Consider the case in which the sampling must stop after N observations. The case of unlimited sampling can then be treated by taking the limit as N -+ 0 0 . We assume that tpN(X,,
* . . , x I v ) = 1,
188
THEORY OF OPTIMAL CONTROL
VII.
since sampling must stop when N observations are taken. At the ( N - 1)st observation, we stop if the conditional expected loss of stopping immediately, given XI = x l , . . . , X N - I = x ~ - l plus , the cost c of an additional observation, is less than the conditional expected loss given X1 = x l ,. . . , X N - ~= x ~ - land taking one more observation and then stopping. This gives P N - ~ .Knowing (PN and V N - ] , we can proceed inductively to obtain P N - 2 and so on. That is, P N , q N P l.,. . , cpo are determined. This procedure is formally described below. Assume that g(B) is the prior distribution of B and
PN-I =
if
> ....
if
(0,
(7.5.3)
Suppose, f o r j = 1 , 2 , . . . ,N , we define recursively
So that the optimal stopping rule is given by
yp(x I ,
. . . ,xi-])
=
any, if
LO,
if (7.5.5)
f o r j = 1 , 2 , . . . ,N . The stopping rule so obtained is Bayes rule. The terminal decision is chosen so that after stopping at j = J , it is regarded as a fixed sample rule for J
7.5.
SEQUENTIAL DECISION THEORY
189
observations. The procedure jointly giving the stopping and terminal rule provides the optimal decision rule for a sequential analysis problem. Our object in the following will be to introduce the sequential analysis problem so that it can be reduced to that of a stopping problem. The discrete problem can be transformed into a continuous problem to take care of more general situations.
Example 7.5.1 The test of the hypothesis of the mean of a normal distribution p will be reduced to a stopping problem. Let Ho : p > O ,
HI : p < O ,
(7.5.6)
and let
< 0,
L@, ai)= - k p
when
p
kp
when
p>O,
=
where ai = accept Hi, i = 0 , 1. Let the cost of each observation be c. Hence the cost is cn if we stop after 11 observations and the decision is correct, and cn + klpl if the decision is incorrect. In this case the optimal stopping rule can be obtained by the backward induction procedure as described above.
Example 7.5.2 Consider a continuous analog of Example 7.5.1. Let p have a prior density ( l / u o ) q ( p - p o ) / u o . Then, from (7.4.13) and (7.4.14), we have the posterior distribution of p given XI = x , , . . . , X , = x,, s y * q [ @ -y,)
s,q.
The risk using the posterior density, or the posterior risk, of the decision to accept H I and stop at n is db,, s,) where
1 0
d(y, s) =
k l p l ~ - ” ~ q [ s - ’ / ~ ( p - yd)p] ,
(7.5.7)
-m
since the loss is klp1 when p < 0 and 0 elsewhere. Let u = ( p - y)s-’/*,
then d b , S) = -
i’
k ( y + s ’ / ~ uq) ( ~d) ~ ,
--oo
(7.5.8)
190
VII.
THEORY O F OPTIMAL CONTROL
where u = s-'l2y. Simplifying, we have
i"
d ( y , S) = -ksLlz
(U
+ u ) ~ ( udu. )
(7.5.9)
-m
Integrating by parts, we have where --v
J'
@(u) =
q(t)dt.
_m
Or
9
Thus d b , s) = k S l / ' $ + ( U ) ,
(7.5 .I 0)
$+@I = d u ) - u [ I - N u ) ] ,
(7.5.1 1)
where where u Z 0. Similarly, the posterior risk of rejecting the hypothesis H 1 and stopping at ti is ks'l2$-(u), (7.5.12) where
$ - ( u ) = q(u) + u@(u),
where
u
< 0.
(7.5.13)
The cost of sampling is cn or C(S,'
- 002)
(7.5.1 4)
2,
from (7.4.1 6). Hence the risk associated with stopping at the nth observation d O n , s,,), where d(y, s) is given by using (7.5.10), (7.5.12), and (7.5.14), d ( y , S) = ks'/2$(y/s1/2)+ CU'(S-'
-C
J ~ ~ ) ,
(7.5.15)
with
Nu)=
$+(u),
when u > 0,
$-(u),
when u
< 0.
The sequential analysis problem for testing a hypothesis about the mean p has thus been reduced to that of a stopping problem with a given stopping risk. The study of the stopping problems will be made for the continuous versions
1.6.
WIENER PROCESS
191
of the above discrete problem. For that purpose, we introduce the notion of Wiener process in the next section. 7.6 Wiener Process The Wiener process (Brownian motion process) holds an important place in the study of stochastic processes. However, the usefulness of the Wiener process in control theory and statistics arises from the fact that it provides a method of generating continuous time analogs to discrete processes, especially of the type x n + l =g(xn, tn),
(7.6.1)
where t , is a sequence of independent random variables. We use the Wiener process in reducing the sequential analysis problem to that of a continuous case and give the characterization of stopping problems. A stochastic process, that is, a family of random variables X(t) for t in the interval (0, 0,is called stationary if for (7.6.2) O < f l < f 2 < . . . < f k < T, the probability distribution of
x ( f ,t r), x ( f 2 + r), . . . , X ( f k + 7 ) is the same as that of x ( t l )X, ( t 2 ) , . . . , X ( t k ) for all r such that O
...< f k t T < T .
Definition (Wiener process) A continuous stochastic process X(t), 0 < t < T is called a Wiener process if X ( t ) has stationary independent increments with X(t) - X(s) being normally distributed with mean 0 and variance u2 It - sI. The covariance of X(s) and X ( t ) in the above case is given by
r(s, t ) = u2 min(s, t ) .
(7.6.3)
It can be easily seen that if Y1,Y 2 , . . . , Y , are independent identically normal distributed random variables with
Xn=Y1+Y2+.*.+Y,, then for 0 < m < n, X , - X,,, is independent of X , and X , - X , is normally distributed with mean (n - m)p and variance (n - m)u’. For large n , the distribution of X , resembles that of a continuous process X(f) with the properties of a Wiener process. We generally write that
E ( d X ( t ) ) = p dt,
(7.6.4)
192
VII.
THEORY OF OPTIMAL CONTROL
and V [ d X ( t ) ]= u2 dt. p is called the drift of the Wiener process. A transformation of the Wiener process with drift p can be reduced to one without drift by taking s=
1
d ( t )d t ,
and
X ( S )= X ( t ) Then
E [dX(s)]= 0,
J’
p ( t ) dt.
(7.6.5)
V ( d X ( s ) )= ds.
The Wiener process also gives an approximation in the case of successive sums of independent random variables in which the central limit theorem applies.
7.7 Stopping Problems In the study of sequential sampling stopping problems occur naturally, since one part of the decision is the stopping rule. There are many other areas in which stopping problems arise. We have already seen that the sequential sampling procedure for testing a hypothesis about the mean of a normal distribution can be transformed into a stopping problem. Consider the following example, which arises in a different context. Example 7.7.1 (Secreraly problem) Suppose a secretary is to be hired by an executive out of N applicants. The executive interviews girls one at a time, rejecting or accepting a girl. At any stage of this process, the girl he is interviewing is ranked among those who have already been interviewed. The girls are interviewed a t random. How should he choose a girl such that he maximizes the probability of hiring the best girl among N . Suppose rank 1 is for the best and higher rankings are given for the worst in order. We may consider a criterion according to which the stopping procedure is obtained so as to minimize the expected value of the absolute rank of the girl chosen. Example 7.7.2 Another class of stopping problems may be stated in the form of a stochastic process Y,,, n = -m, . . . , -1,0 such that Y - , =yo is given. Let y,+1
= Y,, t u n ,
(7.7 .l)
7.7.
193
STOPPING PROBLEMS
where u,'s are independently and identically normally distributed random variables with means 0 and variance 1 . Suppose an observer can stop the process at n < 0 and collect 0 or wait till n = 0 at which point he collects 0 if Yo 2 0, otherwise Y: if Yo < 0. The cost of each observation is 1 unit. What is an optimal stopping procedure? Figure 7.2 describes the situation.
Y
11
0 0
+
1
8
4
3
2
'
I
1
0
-1
Figure 7.2
There is an extensive literature on stopping problems. In this section we consider only certain elementary aspects of stopping problems. There are sequential analysis problems, which have already been seen to reduce to stopping problems. Other control problems will also be seen to reduce to stopping problems. Many interesting mathematical results have recently been obtained by Chow e f al. (1971) and are discussed in their monograph. Statistical applications and applications in control theory leading to stopping problems as well as the general solution of their continuous analogous processes are discussed by Chernoff (1968). We discuss elementary aspects of the stopping problems related to the Wiener process. It is well known that the Wiener process is intimately connected with the heat equation. We give a brief introduction to the connections of the Wiener process with the heat equation. For a detailed discussion of stopping problems for Wiener processes, the reader is referred to Chernoff (1 968).
General Stopping Problem The stopping problem we consider here concerns a continuous stochastic process in continuous time. Let Y(s) be a Wiener process in -s scale originating from ( y o ,so) with E [ d Y ( s ) ]= 0
and
V [ d Y ( s ) ]= -ds.
(7.7.2)
194
VII.
THEORY OF OPTIMAL CONTROL
Let the cost of stopping be d(y, s). It is required to obtain a stopping procedure S so as to minimize the stopping risk
a).
b b o , so) = E ( d ( Y ( S ) >
(7.7.3)
Figure 7.3 represents a typical situation.
s
Figure 7.3
Example 7.7.3 The continuous analog of the sequential analysis problem discussed in Section 7.5 as Example 7.5.2 is the stopping problem with the stopping risk d(y, s) = ks1/2$(y/s’/2) t ca’(s-’ - a;2),
as given by (7.5.1 5). Example 7.7.4
The continuous time version of Example 7.7.2 can be seen = y o , we have (n an integer)
as follows. For the discrete case, withy-, d(y, n) =
{ mm - nn- ,y 2 ,
for
n=O,y
otherwise.
-
Thus for the continuous time, we have d(y,s)=m-s-yZ, = m - s,
s=O,
y
otherwise.
The general attack to solve continuous stopping problems is through the heat equation. The basic philosophy used is that of backward induction; however, it does not appear explicitly in the solution. Let the minimum of the stopping risk over all procedures S be given by p ( y o , so). That is, (7.7.4) p o l o , so) = inf bbo,SO), S
7.7.
STOPPING PROBLEMS
195
where b(yo, so) defined in (7.7.3) is the risk associated with any specific stopping procedure. Now p b o , s o ) < d ( y o , so). Since Y(s) is a process with independent increments, p b , s) represents the best that can be expected once we reach Y(s) = y , irrespective of how it was reached, although p b , s) is defined for all b, s). Therefore, we may characterize the stopping procedure S by the following rule.
Rule
Stop as soon as
That is, there are two sets for the optimal procedure So known as stopping set Yoand continuation set go,given by
The optimal procedure considered will be described only in terms of Vo and
Yo.It will be shown that the solution satisfies a partial differential equation, known as the heat equation, and in many cases it provides the necessity condition of the solution.
Necessary Condition Let 01,s) be in the continuation set. Then the probability of stopping between s + 6 and s is o(6) and the process changes from Y(s t 6) to Y(s). Therefore,
b(y, s t 6)=E{b(Y(s),s)lY(st 6 ) = y } to(&), = E{b(y t
W6 ’12, s))
t 0(6),
where W is distributed normally with mean 0 and variance 1. Using Taylor’s expansion,
where
196
VII.
THEORY OF OPTIMAL CONTROL
We have therefore,
so that, taking limits, we have and
b b , s) = d b , s)
on %.
(7.7.8)
Equations (7.7.7) and (7.7.8) form the heat equation, and their solution is the solution of the Dirichlet problem of the heat equation. To obtain the optimal sets V o and Yo,we need an additional condition (7.7.9) satisfied by them on the boundary, P y ( Y 9 s) = d,(X s).
That is, the optimal sets Vo and equations PdY, s) = fP,b?
.4”0
and the optimal risk p satisfy the following
s)
P(Y9 s) = d b , s) P,(X
(7.7.9)
s) = dybt s)
ongo,
(7.7.1 0)
on .Tb,
(7.7.1 1)
on the boundary of Vo.
(7.7.12)
The solution of Eqs. (7.7.1 0)-(7.7.12) is the Stefan problem or free boundary problem of the heat equation. Suppose that (yo,so) is a point on the portion of the boundary of Vo above which are stopping points and below which are continuation points. Suppose further that d Y b , s) exists a t (yo,so). Now, since p b , so) = d b , so) fory 2 yo, the right-hand derivative of p with respect t o y , p ; b , s), is such that P,+(Y? s) = d y b o , so).
(7.7.1 3)
Similarly, f o r y < y o , p b , so) G d b , so), so that Py-(Yo, so) G d y b o , so).
(7.7.14)
The risk of not stopping between so + 6 and so and then proceeding optimally is given by E(Pb0 +
W6
so)},
so that P b o , so -I 6) GE(Pb0 -I W6’/2,so)}.
(7.7.15)
7.7.
197
STOPPING PROBLEMS
Using Taylor's expansions, we have, for W > 0 , p(y0 i- W6
so) = p(y0,
so) i- W6 1~2py+(yo, so)
i- 4
6 1/2),
and for W < 0, p ( y 0 -I W6'l2, so) = p(y0, so) i- W6 1/2py-(Yo. so) -I o(61/2).
Thus
= P b O , so) -I (6/27r)1'2 (PY+- p y - ) i- 0(6'/2),
or, using (7.7.1 3) and (7.7.1 5), p(y0, so i- 6) - p(y0, so) G ( 6 / 2 7 v 2 [dY(YO,so)
-Py-(yO*
so)]
o(6 !I2).
Assuming that [ p ( y o , so t 6 ) - p b 0 , so)] /6 is bounded below, it follows that (7.7.1 6)
d y b o , so) 2 Py-CYo, so).
Then (7.7.1 3) and (7.7.1 6) give the condition (7.7.1 2 ) .
Sufficient Conditions Given the solution of the free boundary problem, can one say that it solves the original optimization problem? The answer is in the affirmative if certain additional conditions are satisfied for the heat equation. The results are given in the following theorem, which is stated here without proof.
Theorem 7.7.1 (Chernoff) If u(y, s) is the solution of the free boundary problem and go is a continuation set with u(y, s) and d(y, s) having bounded derivatives up t o third order, and if
u(y, s) G d ( y , s)
and
j d y y ( y ,s)
d,D, s)
on the continuation set V 0 , then the solution of the optimization problem is given by the function u(y, s) together with Eo under the condition that the optimal risk can be approximated by the risk of a procedure in which stopping is restricted to a finite number of discrete times.
In many cases stopping problems are not easily solved as simple solutions of the heat equations. Sometimes bounds on the optimal boundary of the
198
VII.
THEORY OF OPTIMAL CONTROL
stopping region and continuation region as well as tlie optimal risk are helpful. The following general procedure may be applied. First let ucv,s) be an arbitrary solution of the heat equation. Let the set on which My, s) = d b ,s)
be denoted b y & Now i f g i s the boundary of a continuation set go,the risk for tlie procedure defined by the continuation set KOis b b , s) = u(y, s)
oneo
b b , s) = d ( y , s)
and
on
%.
But then
b b , s).
P b , s)
Therefore, if ('yo,so) is a point of V0 where u(y, s)< d b , s), then < d(yo, so) and ('yo,so) is a continuation point of the optimal procedure.
p b o , SO)
Example 7.7.5 Consider Example 7.7.4, with its continuous analog giving tlie stopping risk by d ( y , s) = nz
s
~
-
y2,
s = 0, y
otherwise.
=m-s
Since the problem is invariant under the transformation
Y*=uY and the solution is simple. Also note that if %o = { ( y ,s) : y
s*=u2s,
< 0,
s>0)
and p(y, s) = -s, =
-y
y 2 0, s 2 0 , - s,
y
SGO,
then ( p , '60)is a solution of the free boundary problem, since
PCV, s)
= d(y, s),
and
pY(y, s) = d,(y, s),
for y = 0.
7.8 Stochastic Control Problems
A few examples of control problems involving a stochastic process with continuous time parameter are described here. We consider the problems
7.8.
STOCHASTIC CONTROL PROBLEMS
199
involving the Wiener process only. The problems of controlled Markov chains are discrete cases of these stochastic control problems. The problems involving stochastic processes with continuous time parameters in their dynamic equations with criterion functions involving their moments are quite complicated. Many advanced notions, such as those of stochastic integrals and martingales, are needed. These are beyond the scope of the present exposition. Exhaustive treatments of stochastic control are available in books by Kushner (1971), Bellman (1967), Chernoff (1972), Aoki (1972), and many others. Many of the control problems, as well as bandit problems, can be reduced to stopping problems, as discussed in the previous section. We introduce a control problem in this section. The solution involves technicalities and is given completely by Chernoff (1968). Example 7.8.1 Suppose a rocket is directed toward Mars and its miss distance can be measured at times t , , t 2 , .. . and can be adjusted by instantaneous use of fuel. Assume that the cost of missing Mars by amount y is k y 2 . The problem of control is the allocation of fuel so as to minimize total cost. In its continuous version, the problem can be transformed in terms of a Wiener process Y(s) in -s scale. Let Y(s) originate a t (yo, so) with
E [ d Y ( s ) ]= 0
and
V [ d Y ( s ) ]= 4 s .
Let the miss distance have a standard deviation proportional to the distance to the target. We assume that the mean has a prior normal distribution with mean po and variance a: as before. Applying Lemma 7.4.2 with
dH(t) = dt
and
d V ( t ) = a2(to- t)' d t ,
we have the posterior distribution of the miss distance as normal with mean Y(s) and variance s, where Y(s)is a Wiener process in -s scale and s-' = 00'
s
+0
( t o- t)-' dt = 00' - o - ~ti' t o-2(to- I ) - ' .
Assume that a: and 02t;', where to is total time of flight, are small and cancel each other, so that (7.8.1) s = fJ2( t0 - t). Assume further that the amount of fuel to change the miss distance by an amount A is inversely proportional t o the distance t o the target. Then, the cost of fuel required per unit change o f y is d(s) = ( t o - t ) - l
= s-l.
(7 A.2)
200
VII.
THEORY O F OPTIMAL CONTROL
In addition to the cost IAls-' to adjust by an amount s, there is an additional cost kY'(0). The problem in the continuous version is to find a rule so as to minimize the expected cost.
References Alvo, M. (1972). Bayesian sequential estimation, Stanford Univ. Tech. Rep., Department of Statistics, Stanford, California. Aoki, M. (1967). Optimization of Stochastic Systems. Academic Press, New York. Arrow, K. J., Blackwell, D., and Girshick, M. A. (1949). Bayes and minimax solutions of sequential decision problems, Econometrica 17,213-244. Astrom, K. J. (1970). Introduction to Stochastic Control Theory. Academic Press, New York. Bather, J. A. (1966). A continuous time inventory mode1,J. Appl. Probl. 3.538-539. Bather, J. A., and Chernoff, J . (1967). Sequential decisions in the control of a space-ship, Proc. 5th Berkeley Symp. 3, 181-207. Univ. of California Press, Berkeley. Bellman, R. (1958). Dynamic programming and stochastic control processes, Information and Control 11,228-239. Bellman, R. (1967). Introduction to the Mathematical Theory of Control Processes, Vol I . Academic Press, New York. Bellman, R. (1971). Introduction to the Mathematical Theory of Control Processes. Vol. 11. Academic Press, New York. Blackwell, D. (1962). Discrete dynamic programming, Ann. Math. Statist. 33. 7 19-726. Blackwell, D., and Girshick, M. A. (1954). Theory of Games and Statistical Decisions. Wiley, New York. Boudarel, R., Delmas, J., and Guichet, P. (1971). Dynamic Programmingand Its Application to Optimal Control Theory. Academic Press, New York. Canon, M. D., Cullum, J., Clifton, D., and Polak, E. (1970). Theory of Optimal Controland Mathematical Programming. McGraw-Hill, New York. Chernoff, H. (1968). Optimal stochastic control, SankhyZ 30,221-252. Chernoff, H. (1972). Sequential Analysis and Optimal Designs. SOC. Ind. Appl. Math., Philadelphia. Chow, Y. S., Robbins, H., and Siegmund, D. (1971). Great Expectations: The Theory o f Optimal Stopping. Houghton Mifflin, Boston. DeGroot, M. H. (1970). Optimal Statistical Decisions. McGraw-Hill, New York. Denn, M. M. (1969). Optimization by Variational Methods. McGraw-Hill, New York. Derman, C. (1964). On sequential control processes, Ann. Math. Statist. 35. 341-349. Derman, C. (1970). Finite State Markovian Decision Processes. Academic Press, New York. Dreyfus, S. (1965). Dynamic Programming and the Calculus of Variations. Academic Press, New York. Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. Academic Press, New York. Hermes, H., and LaSalle, J. P. (1969). Functional Analysis and Time Optimal Control. Academic Press, New York. Hestenes, M. R. (1966). Calculus of Variations and Optimal Control Theory. Wiley, New York. Howard, R. (1960). Dynamic Programming and Markov Processes. Wiley, New York.
REFERENCES
20 1
Jazwinski, A. H. (1970). Stochastic Processes and Filtering Theory. Academic Press, New York. Kushner, H. (1971). Introduction t o Stochastic Control. Holt, New York. Mangasarian, 0. L., and Schumaker, L. L. (1969). Splines via optimal control. In Approximations with Special Emphasis on Spline Functions (I. J. Schoenberg, ed.), Academic Press, Ncw York. Petrov, lu. P. (1 968). Variational Methods in Optimum Control Theory. Academic Press. New York. Tracz, G. S. (1968). A selected bibliography o n the application of optimal control theory to economic and business systems, management science and operations research, J. Oper. Res. 16, 174-186. Tung, F., and Striebel, C.T. (19.56). A stochastic optimal control problem and its applications, J. Math. Anal. A p p f . 8 , 350-359. W a r g , J . (1972). Optimal Control of Differential arid Functional Equations. Academic Press, New York. Widder, D. (1946). Lapluce Transform. Princeton Univ. Press, Princeton, New Jersey. Wiener, N. (1958). Nonlinear Problems in Random Theory. Wiley, New York.
CHAPTER V l l l
Miscellaneous Applications of Variational Methods in Statistics
8.1
Introduction
In earlier chapters a wide variety of problem areas in statistics has been shown amenable to solution through variational methods. These applications serve only as examples and do not exhaust all the possible applications of variational techniques in statistics. Some more interesting applications of variational methods and connections with other optimizing techniques are discussed in this chapter. The applications chosen are in reliability theory, statistical bioassay, least-squares approximations via splines, and connections between mathematical programming and statistics. The illustrations serve only as examples from a large class of applications in engineering, biology, medicine, and economics. The reader will find further applications in the literature, a partial list of which is supplied at the end of this chapter. In the statistical theory of reliability, distributions having increasing failure rate or decreasing failure rate arise quite frequently. We give a few important inequalities that arise in this study. The procedures developed in earlier chapters have to be modified to be applied to such a class of distributions. There is extensive literature in mathematical models of reliability, and the reader may consult the book by Barlow and Proschan (1965). References to some recent papers are also given at the end of this chapter. We have already seen the application of variational methods to a problem of 202
8.2.
APPLICATIONS IN RELIABILITY
203
bioassay in Chapter IV. In that application the bioassay problem reduced to a linear moment problem of quite a wide interest. Here we discuss the efficiency of a special estimator-Spearman estimates-in bioassay. We have also discussed a problem of this nature earlier in connection with the study of the ChernoffSavage statistic. However, the application here is of special interest in bioassay. The reader may find detailed procedures in statistical bioassay in a book by Finney (1964). We discuss the problem of Brown (1959). Recent interest in splines has led to their applications in many areas of statistics. We pointed out some directions of these applications in Chapter VI. In approximations, splines have found considerable applications; for reference the reader may consult Schoenberg (1 969). However, the recent applications of dynamic programming to approximation have been stressed by Bellman et al. (1974), whose results we discuss here briefly. There are several problem areas in regression analysis, estimation, and testing hypotheses 'in which modern mathematical programming methods have been utilized. The discussion here concerns only those mathematical programming methods in which certain elements of variational 'arguments appear. A recent comprehensive survey of programming in statistics and probability has been given by Krafft (1970). There is also a considerable amount of literature on the duality theory of the Neyman-Pearson lemma and we attempt to give a few elementary insights into these connections. In this regard a recent paper of Francis and Meeks (1972) may be helpful to the interested reader. With the development of linear programming techniques, there has appeared an interest in the consideration of problems with random elements. Random variables may appear in mathematical programming problems in various ways. We demonstrate, by examples, the variational character of some of the stochastic programming problems. The book on probabilistic programming by Vajda (1972) may be quite illuminating for the interested reader. The richness of these applications can be caught by only a few examples that we discuss involving chance constraints. For further details, see Charnes et al. (1971). 8.2 Applications in Reliability In the performance of complex engineering systems, reliability concepts naturally arise. The notion of failure rate or hazard rate is quite commonly used to measure the ability of an item to perform efficiently or to perform at all. Especially in problems in which the length of life of an item is important, such as in the life of a light bulb or electron tube, the failure rate plays an important role in decision making. Let F(x) be the cumulative probability distribution function of the time to failure of an item. Let the conditional probability of failure time (x, x t d x ) ,
204
VIII.
MISCELLANEOUS APPLICATIONS
given that the item has survived till time x , be denoted by r(x)dx. Then r ( x ) is called the failure rate or hazard rate of the item. If f ( x ) is the probability density function corresponding to F(x), it is easy to see that
r(x) =
f(x)
~
1- F ( x ) '
(8.2.1)
when F(x) < 1 . For example, when the failure probability density is exponentially given by
f(x)=
{y
x > 0, elsewhere,
F ( x ) = 1- e P w , and (8.2.2) Conversely, one can show that, if the failure rate is constant, the failure probability distribution is exponential. The exponential distribution arises in reliability problems quite naturally, and we shall discover it again when we consider bounds for the probability of survival data. In some applications, the failure rate goes on increasing or decreasing. Such distributions have many interesting properties. We define formally the increasing failure rate distributions that have found extensive applications. Definition A continuous distribution function F(x) is said to have an increasing failure rate, or simply IFR, and if and only if
F(t + X ) - F ( x ) 1- F ( t ) is a monotone increasing function o f t for x
(8.2.3)
> 0 and r 2 0 such that F(t) < 1.
For the discrete case, let the time to failure X have the distribution
P(X=k)=pk,
k = 0 , 1,2,....
Then the distribution function is IFR if and only if m
1%-
(8.2.4)
is monotone increasing in k for k = 0, 1,2, . . . . Analogously, the notion of distributions with decreasing failure rate can be defined. Decreasing failure distributions may be denoted by DFR. Although
8.2.
APPLICATIONS IN RELIABILITY
205
there is a large class of either IFR or DFR distributions, there are many distributions that are neither. Also, a distribution may be IFR for some parameter values, while for others it may be DFR. One of the central problems of reliability is the probability of survival until time x , and this probability may sometimes be taken as the reliability of an item. It is of interest, therefore, t o find upper and lower bounds on 1 - F(x). Bounds on the distribution function F(x) have been studied extensively under moment constraints by several researchers, and the whole group of Tchebycheff-type inequalities belong to this problem area. A recent survey is given by Karlin and Studden (1966). In this section, inequalities are discussed for 1 - F ( x ) , when the distributions are IFR. In the classical study of Tchebycheff-type inequalities or in their generalizations such as discussed in Chapter IV, the geometry of moment spaces is utilized. The arguments depend heavily on the property that the class of distribution functions is convex. The class of distribution functions having a prescribed set of moments is also convex. However, in the study of IFR and DFR distribution, it is not true. Therefore, classical methods of geometry of moment spaces cannot be directly applied to the above problem. We first state an important inequality, called Jensen 's inequality.
Theorem 8.2.1 If cp(x) is a convex (concave) function of x , where x is a point in n-dimensional Euclidean space, then
HcpQI > cp bw31.
(8.2.5)
(4
Inequality (8.2.5) is essentially a variational result. We consider finding the lower bound of S[cp(X)]. That is, find the minimum, over the class of distribution functions F ( x ) , of
such that
J k x ) dF(x)
(8.2.6)
I
(8.2.7)
x dF(x) = p.
The Jensen's inequality, restated, says that the lower bound of the integral (8.2.6) is q ( p ) . Such problems have been discussed in Chapter IV. Utilizing these results, since tp is convex, in the case in which X is a real-valued random variable, the. one-point distribution Fo(x) with
(8.2.8) provides the lower bound.
206
V111.
MISCELLANEOUS APPLICATIONS
The upper bound, in casecp(x) is concave, is similarly obtained. In case the distribution function F(x) is absolutely continuous, we note that
T(X)= -d/dX {log [ 1 - F(x)]) .
(8.2.9)
Therefore, the property that F(x) is IFR is equivalent in this case to log( 1 - F(x)) being concave. In our subsequent discussion, logconcavity and logconvexity of 1 -F(x) may replace the property of being IFR and DFR, respectively. A large number of results are given by Barlow and Proschan (1967) for bounds on 1 - F(x), but we discuss only two for illustrative purposes._The interested reader should consult their monograph. We denote 1 - F(x) by F(x),
F(x) = 1 -F(x).
(8.2.10)
Theorem 8.2.2 If F(x) is IFR and
s
x dF(x) = p ,
then
,
if
x
otherwise.
Proof For the following proof, we assume that F(x) is continuous. For the discrete case, the proof follows along the same lines. Since log[] - F(X)] is concave, we have, from Jensen's inequality in (8.2.5), E{log [ I - F ( X ) ] ) < log [ l -F(p)].
(8.2.1 1)
Now, it is well known that if X h a s distribution F(x),the distribution of F(x)is uniform on (0,1). Consequently, the distribution of 1 - F(X) is uniform on (0,1). We then have
1 I
E[logF(X)] =
logxdx=-1.
0
From (8.2.1 1) we have
log F ( p ) 2 - 1 ,
or
F ( p ) 2 e-'.
We also have, from (8.2.3),
log F(x) - log F(0) X - 0
- log [F(x)]
as a decreasing function x , since F(x) is IFR. That is, [F(x)]'Ix is a decreasing function of x. Therefore,
207
APPLICATIONS IN RELIABILITY
8.2.
and hence F(x) 2
[ F @ ) ] + > e-X/’’
for
x
< p.
For x 2 p , the bound is attained by a one-point distribution with total probability at x = p . The theorem then follows. For finding the upper bounds of the probability F ( x ) , we have a similar result, which is stated in the following theorem. Theorem 8.2.3 Let F(x) have mean p and be IFR; then,
where w ( x ) is given by the equation 1- w ( x ) p = e
Proof
(8.2.12)
We consider the function C(y) such that
Then F(y) and C(y) cross each other at most once for each y crosses C ( y ) from above if it does at all. I f x > p, we can determine w in such a way that
< x , and F(y)
X
r
J
e-WY d y = p ,
or
1- e p W x = PW.
0
Assume that G(y) is not identically equal to F(y) for this choice of F(x)
For, if not p=
i
0
F(x)dx>
w ;then,
for y a x .
s
G(x)dx=p
0
leads to a contradiction. Since C ( y ) is IFR, the bound is sharp. For x G 1.1, the upper bound is given by one-point distribution with mass concentrated at x = p. This completes the proof of the theorem. Bounds for integrals for a general class of functions involving F(x)can be obtained in some cases. We consider nonlinear functions cp(x,y) with the following properties. A(i) A(ii)
cp(x, y ) is convex in y and twice differentiable in y . [&p(x, y)/dy] is increasing in x.
208
VIII.
MISCELLANEOUS APPLICATIONS
Such functions were used in the study of nonlinear moment problems in Chapter V and are of considerable interest in statistics. In many applications bounds of the expectation of the range of the sample from IFR distributions or of the expected value of the largest or smallest order statistics from the sample are needed. As has been seen in Chapter V, these problems are special cases of the general problem in which E [ q ( X , F(X))] is minimized or maximized. The class of distribution functions considered there was restricted only in terms of a given set of moments. Here we consider the case in which the class of distribution functions is IFR. The methods used are similar to the ones used by Rustagi (1957) and Karlin (1959). Consider IFR distributions F(x) with
F(O)=O
and
p=
J’
xdF(x).
(8.2.13)
0
We have the following theorem. Theorem 8.2.4 Let q(x,y ) be a given function with properties A(i) and A(ii) and F(x) be IFR with constraints (8.2.13). Then the lower and upper bounds of the integral 1 0=
i
q(x, F(x)) dx
0
are given respectively by the distribution functions C(X) =
and
{
1 - e-XIP ,
x>o,
0,
x
< 0,
(8.2.14)
(8.2.1 5)
Proof Consider the integral
0
for o < A < 1. I(A) is convex in A, since q(x, y ) is convex in y . If C(x) minimizes I @ ) , then I ( A ) achieves its minimum at A = 1. This is possible if and only if I’(h)IA=1
< 0,
209
BIOASSAY APPLICATION
8.3.
where
,.
m
=
I’(A)lA=,
,
0
Since is nonincreasing in x by A(ii) and q ( x , y ) is convex in y , we have that *(x) = -
avJ aY
I
is nondecreasing in y since
y=G(x)
is nonincreasing in x . If G is the exponential distribution and F ( x ) is IFR,F(x) crosses G ( x ) only once from above, say at x = t o ,so that F(to) = (?(to);then,
[
[*(x) - * ( t o ) ] [ c ( x ) - F ( x ) ] dx
< 0.
0
Therefore, I y h ) < 0. Hence the minimum is obtained for
G(x)=
,
x>o, x
< 0.
The upperbound is obtained in the same manner. Corollary If F is IFR with mean p, F(0) = 0, and p ( x ) is an increasing function of x , then the upper bound of JFq(x)F(x)dx is given for the distribution D ( x ) defined in (8.2.1 5).
Remark
The corresponding inequalities are reversed if q ( x , y ) is concave in
y and aq(x, y)/ay is a nondecreasing function of y. Similarly in the corollary,
the lower bound is given if the function q ( x ) is decreasing in x instead of increasing in x . The corollary is the special case of Theorem 8.2.4 when cp(x9 u) = d X X 1 - Y ) . 8.3
Bioassay Application
A variational problem resulting from a bioassay application was discussed in Chapter IV. The problem discussed resulted in finding bounds of an expectation. In this section we consider another aspect of the bioassay problem. In quanta1 assay, the basic problem of bioassay is to estimate the response distribution F(x) with the help of observations F(xi) at (2k + 1) levels xi for i = 0, +1, . . . , +k. Doses of a drug or a chemical with concentrations xi are given to n subjects at each dose. Let the number of responses at dose xi be ri, i = 0, + l , . . . , +k.
210
VIII.
MISCELLANEOUS APPLICATIONS
Suppose the tolerance distribution has the location parameter p , so that its general form is given by F(x - p). Let the dose levels xi be chosen in the following way. A number x o is chosen at random in an interval (0, d ) for a given d. Then
i = O , ? l , . . . ,?k.
xi=xo+id,
(8.3.1)
An important estimator, called the Spearman estimator, for p is @en by k
s = +i =C- k
(pi+]-pi)(xi+l + x i ) .
(8.3.2)
The efficiency of the estimator S will be obtained in this section. The likelihood of the sample is given by
n k
h=
j=-k
(;)[.(xi-
p)Iri[1 - F ( x i -p)]"-'i.
(8.3.3)
The Fisher information for this experiment is given by
(8.3.4) for a given x o . We find, from (8.3.3),
where F(xi - p ) = Fi, for convenience of notation. Using the fact that E(ri) = nFi and Var(ri) = nFi( 1 - Fi),we have
(8.3.6) For an infinite experiment, that is, when k -+ m, the information can be defined to be I = lim Ex" [ f k ( x o ) ] , k+-
(8.3.7)
the expectation taken with respect to xo, which is assumed to be uniformly distributed on (0,d). (8.3.7) can be calculated using (8.3.6) as follows:
21 1
BIOASSAY APPLICATION
8.3.
by letting xo t id = x,
(8.3.8) Also, the variance of the Spearman estimator (8.3.2)for the case k
V(S)=
n
F ( t ) [1 -F(t)] dt.
-+
CQ,
(8.3.9)
-m
For details, see Brown (1959). The efficiency of the estimator, then, can be defined for the infinite experiment by the ratio of the inverse of the information (8.3.8) and the variance V(S),given by
E(F) = P / V ( S ) , or m
m
The object of the remaining discussion is to find the upper bound of this efficiency and show that the efficiency attains the value 1 for the logistic distribution having the cumulative distribution function
F(x) = (1 t e-@+flx))-l.
(8.3.11)
We assume that the distribution function F(x) considered here is absolutely continuous, having probability density function f(x), so that
(8.3.12) The efficiency E(F) in (8.3.10)then becomes m
m
We show that the efficiency is 1 for the logistic distribution in which the only unknown parameter is the translation parameter and only symmetric distributions are considered. That is, we show that E(F) < 1, and it is equal to 1 for the logistic distribution. The result is stated in the following theorem.
212
VIII.
MISCELLANEOUS APPLICATIONS
Theorem 8.3.1 The logistic distribution is the only symmetric distribution, with the translation parameter as the single unknown parameter, for which the Spearman estimate has asymptotic efficiency equal to 1 .
h o o f Let C ( x ) be the symmetric distribution function that minimizes E(F) or maximizes [E(F)] = Y(F).Consider V ( x )to be another symmetric function such that for small E , G(x) + E V ( x ) is a distribution function. Let g ( x ) = G'(x), u(x)= V'(x), and y ( ~=)Y(G(x)+ E V(x)), so that y'(0) = 0. From (8.3.1 3), we have
-'
y'(O) =
[ 1 C(x)(1
g 2 ( x )dx
-m
] [ 1 V ( x ) [1 m
m
- C(X))
-
2G(x)] dx]
-m
m
(8.3.1 4) -m
By symmetry, we find m
(8.3.15) Denote by m
(8.3.16) and m
B=
C(x)[1-G(x)] d x . -m
Using (8.3.1 5)-(8.3.17), Eq. (8.3.14) becomes m
(8.3.17)
8.4.
APPROXIMATIONS VIA DYNAMIC P R O G R A M M I N G
21 3
Since V(x) is any function satisfying the condition of symmetry and differentiability and G ( x ) cannot be such that 1 - 2G(x) vanishes in the whole real line, the necessary condition that y’(0) = 0 in (8.3.1 8) is
A=
Bg2 (XI C’(x) [ 1 - G(x)]2
(8.3.19)
Or (8.3.20) Solving (8.3.20), we find that
where
(Y
is the constant of integration. Thus, G(x) = (1 + , - @ + P - ~ ) ) - l .
(8.3.2 1)
It can be easily verified that E(C) = 1, proving the theorem. The above problem was considered by Brown (1959). For other applications of Spearman estimates in bioassay, as well as for the general theory of statistical bioassay, see Finney (1964). 8.4
Approximations via Dynamic Programming
In model building or in obtaining solutions to complicated problems approximations reduce the complexity of problems and sometimes even make the solutions possible. It is quite common in practice to use a discrete analog of a continuous problem for computational purpose. By necessity the arithmetic of computers requires approximations t o many problems. Criteria used in approximating a function or a solution to a problem are generally such that the results obtained are still meaningful. The theory of approximation is a well-developed and rich branch of mathematics and has seen advanced development in recent years. The variety and interest in research in this area is extensive; for a recent account, the reader may consult Schoenberg (1 969). We discuss in this section a recent example of approximation via spline functions through the method of dynamic programming. It can be demonstrated that it is sometimes easier and more meaningful to solve a problem by finding an exact solution t o an approximate problem rather than t o find an approximate solution t o an exact problem. We give a n example of this nature, taken from
21 4
VIII.
MISCELLANEOUS APPLICATIONS
The elementary introduction t o the theory of spline functions was made in Chapter VI in connection with their application t o a design experiment problem. In this section, we give an example in which spline approximation is accomplished through dynamic programming. For a more rigorous development, the reader may refer t o a recent paper by Bellman er af. (1974). The use of splines has been explored in data analysis by Wold (1974).
of
Example 8.4.1
(Belfnzan) Consider the problem of finding the minimum
j
(duldt)’ d t
(8.4.1)
0
with following side conditions on the function u(t). q ( t ) u 2 ( t ) d t= 1,
(8.4.2)
0
u ( 0 ) = u ( l ) = 0.
(8.4.3)
We assume that u(t) is well-behaved, and q ( t ) is known. Using the classical variational methods, we obtain the Euler equation in which the Lagrangian is given by (duldt)’
+ hq(t)u2(r).
(8.4.4)
The Euler equation is given by -2(d2u/dt2) t 2 h q ( t ) u ( t )= 0.
(8.4.5)
The solution of the differential equation (8.4.5) with the boundary conditions given by (8.4.3) is not available in general and hence necessitates the consideration of approximate solutions. Consider now the alternative approach of optimizing an approximation to (8.4.1). We assume that u , ( I ) , . . . , u n ( t ) is a sequence of linearly independent functions defined on (0, 1) such that
ui(0)= u i ( l ) = 0,
i = 1, 2 , . . . ,n.
(8.4.6)
Let (8.4.7)
APPROXIMATIONS V I A DYNAMIC P R O G R A M M I N G
8.4.
21 5
Then we consider the problem of minimizing
(8.4.8) where ui'(t) denotes the derivatives of ui(t). In addition to (8.4.6), the following constraint is to be satisfied: I
(8.4.9) Other simplifying assumptions can also be made, such as that the system of functions u i f ( t ) ,i = 1 , 2, . . . ,n is orthogonal; that is,
s
0
where
ui'(t)ui'(t)d t = 6ii,
i, j = 1,2, . . . ,n ,
(8.4.10)
(8.4.1 1) Orthogonal functions such as defined by (8.4.1 0) have many different representations; simple examples are given by trigonometric functions, Legendre polynomials, etc. The problem of minimizing (8.4.8) with constraints (8.4.9) and (8.4.6) is exactly solvable. However, the question of whether the solution of the latter converges to the solution of the former is still unanswered. We d o not address ourselves to this question here. Many criteria for the goodness of an approximation are available. The more frequently used one is that of least squares. In statistics the criterion of least squares is used extensively in regression analysis, curve fitting, time series problems, and many others. It also forms an important principle for estimating parameters in other models. Splines are being used in regression models and other applications in which approximations to functions are needed. Bellman and Roth (1969) describe the method of approximation using straight line segments when the least square criterion is used. In obtaining an approximation to a function by cubic polynomial in a piecewise manner, Bellman e f al. (1974) use the principle of dynamic programming. We outline this procedure in the next example. Example 8.4.2 (Bellman et al.) Consider the approximation of a function u(t) over (a. b). Let us partition the interval (u,b ) into n subintervals given by
21 6
MISCELLANEOUS APPLICATIONS
VIII.
< rl < . . * < t,, = b. It is required to find a function s ( t ) that is of third degree on the subintervals such that
a = to
J(a, b ) =
i"
[u (t) - s(t)] dt
(8.4.1 2)
is minimized. Let s ( t ) be si(t) on the interval (ti, t i + l ) ,i = 0, 1,2, . . . , n such that s i ( t ) = y i t z i ( t - t i ) t u i ( t - t i ) 2t w i ( t - r t i )3 . (8.4.13)
Let ti+] - ti = hi+l. The conditions of continuity require that (8.4.14)
ui = hL?l [3(ji+1 - V i ) - hi+l(zi+l t 2zi)] w . = -h'3r+l [2(ji+1 - yi) - hi+](zi+l + zi)l. 9
(8.4.1 5)
The functional to be minimized over the interval (ti, t i t l ) is
F;(y;, zi) =
min
'2
Yi+k. zi+k i = 1
k = 1 , 2, . . . , n - I ,
1''
(8.4.16)
[ u ( t ) - s,(t>l2 dt,
'i
i = 1 , 2, . . . , n - I .
Here yi is the value of si(t) at ti, and zi is the value of the derivative of si(t) at ti. The functional equation of dynamic programming can be obtained from (8.4.1 6) as FiCyi, Z i ) =
min
Yi+l. zi+l
{ f" ti
[ ~ ( t ) - ~ i ( t ) ] dt '
1
+Fi+l(jj+l, zi+j)
. (8.4.17)
The functional equation (8.4.1 7) can be further simplified, leading to explicit solutions of the problem.
For further details, the reader may consult the original paper of Bellman e f al. (1974).
8.5 Connections between Mathematical Programming and Statistics The central problem of statistical inference, that of testing statistical hypotheses and the development of the Neyman-Pearson theory, were briefly
8.5.
CONNECTIONS BETWEEN PROGRAMMING AND STATISTICS
21 7
introduced in Chapter IV. The application of the Neyman-Pearson technique has already been made in solving some nonlinear moment problems as discussed in Chapter V. The Neyman-Pearson problem has recently been applied to many mathematical programming problems, especially in duality theory. Recent references in this connection are Francis and Wright (1969), Vyrsan (1967), Francis and Meeks (1972), and Francis (1971). Mathematical programming methods have been widely used in statistics in many contexts. For a recent survey, see Wagner (1959, 1962), Karlin (1959), Krafft (1970), and Vajda (1972). In using the language of mathematical programming, we have two problemsprimal and dual-and the solution of one provides the solution of the other. For many complicated optimization problems this duality theory simplifies their solution. Interesting new optimization problems arise as a result of the duality property. We discuss below an example in which the Neyman-Pearson problem is the dual to a problem of sufficient interest, called the primal problem. We consider first the Neyman-Pearson problem and describe a duality theory for it. It will be seen that the development resembles the one in Chapter V in which the nonlinear moment problem is solved by first linearizing it and then applying the Neyman-Pearson technique, as considered by Rustagi (1 957) and Karlin (1 959).
Neyman-Pearson Problem Consider a given function ~ p ( x , y ) ,which is strictly concave in y and differentiable in y . Let 11/ (x, y ) , . . .’, G m ( x , y ) be a given set of m functions such that Jli(x, y ) is convex and differentiable i n y for i = 1 , 2 , . . . ,m . Let hl , . . . ,Am be m nonnegative real numbers and let m
V(X,
Y , A) = CP(X~ -
C1 hiGi(x, Y ) ,
(8.5.1)
I=
where A = (A,, . . . , Am)’. By the assumptions above, the function ~ ( xy, , A) is concave and differentiable in y . Let Z(x) and u ( x ) be given functions and denote by W the class of functionsfsuch that Z(X)
< f ( x ) < u(x).
Let the subsets S1, S2 , and S 3 be defined as
s1 = {x
S3
:
= {x :
2 (x, y, A) < 0, aY
Z(X)
2
I
(8.5.2)
,
(8.5.3) (8.5.4)
(x, y, A) = 0, for exactly one y, Z(x)
I
(8.5.5)
21 8
VIII.
MISCELLANEOUS APPLICATIONS
Since 4x, y , A) is strictly concave in y and differentiable in y , we can see that the sets S , , Sz, and S 3 are disjoint and the union is the whole real line. Let fo(x) be such that (8.5.6) and let ( b , , . . . , b,) = b be a given vector. Then we consider the following “primal” and “dual” problems.
Primal Problem
Minimize 1, (A), where
(8.5.7) over the set of all h such that h > 0.
Dual Problem
Maximize f2(f), where
rz
s
d x , f(x)) dx,
(8.5.8)
over the class of functionsf(x) such that (8.5.2) is satisfied and
s
Gi(x, f(x)) dx < bi,
i = 1 , 2 , . . . ,m.
(8.5.9)
Let the class of functions f(x) satisfying constraints (8.5.2) and (8.5.9) be denoted by d.A solution to the primal problem is given by finding a global minimum of II(A) over the nonnegative orthant in the rn-dimensional Euclidean space. A solution to the dual problem is any admissible function (feasible solution) belonging to the class d.The dual problem is a version of the Neyman-Pearson problem involving inequality constraints. We give below a few relationships between the primal and dual problems.
Lemma 8.5.1
For any admissible solutionf(x) of the dual problem and any
A, 11
Proof
(A) 2 1 2 (f).
From (8.5.9) we have bi-
s
$i(x,f(x))dx>O,
i = 1 , 2, . . . ,m.
8.5.
CONNECTIONS BETWEEN P R O G R A M M I N G A N D STATISTICS
219
Thus
(8.5.1 1)
From (8.5.1), it is clear that ~ ( xy,, A) is strictly decreasing in y on S1,strictly increasing on S 2 , and 77 [x, f o ( x ) , A1 2 77 [ X ? Y9
on S 3 . Therefore
17
[ x ,f ( x ) , A1 d x G
s,
17
A1
[ x , [ ( X I , A1 d x ,
(8.5.1 2 )
s,
and
The integral on the right of the inequality (8.5.1 1) can be written as a sum of integrals on S1, Sz, and S3 and, using inequalities (8.5.12)-(8.5.14), the lemma follows from (8.5.1 1).
Lemma 8.5.2 If f is an admissible solution to the dual problem and for A> 0 (8.5.1 5 )
with
(8.5.16) (8.5.1 7)
and
220
V111.
MISCELLANEOUS APPLICATIONS
'
then A is a solution of the primal problem and Il(Ao)= I z ( f 0 ) .
hoof
Using (8.5.1 5), we see, from (8.5.8),
Using (8.5.1 6)-(8.5.1 8), we can write
r
m
or I,(f) = I l @ ) . Clearly A is the solution of the primal problem and f(x) the solution of the dual problem, proving the lemma.
Necessary and sufficient conditions for the solution of the primal and dual problems can be derived similarly. The techniques bear a resemblance to the moment problem discussed in Chapter IV. Note that there the part of the function f(x) is played by the cumulative distribution function F(x), so that 0 < F(x) Q 1, and the function Z(x) = 0 and u(x) = 1 in this case. The reader should consult the paper of Francis and Meek (1972) and other related papers for further discussion on this problem. Many other connections between problems of statistical optimization and programming have been discussed by Krafft (1970). Let X be a random variable with mean p and variance u 2 , then (8.5.1 9) or
P { X - 1.1 Q - k } Q v u2 + k 2
(8.5.20)
8.5.
CONNECTIONS BETWEEN PKOGKAMMING A N D STATISTICS
221
The above inequalities can be formulated as variational problems. hequality (8.5.20) is obtained by finding the upper bound on the integral
dF(x)
(8.5.2 1 )
x dF(x) = p ,
(8.5.22)
/x,4 (XI
with side conditions
and
I
k2
dF(x) = p 2 t u2,
(8.5.23)
where A = (-m, p - k ] , F(x) is the distribution function of the random variable X , and xA(x)is the characteristic function of the set A . That is, 1,
if
X@A,
if
xEA.
There is extensive literature on Tchebycheff inequalities; for example, see Karlin (1959). Finding bounds of expectations can be related to the problems of mathematical programming, and we consider an example below. First we consider the upper bounds of an expectation and call it the primal problem.
Primal Problem Consider the problem of finding the maximum of an expectation of g(X) (8.5.24) over a class of probability density functions f(x) such that (8.5.25)
i=O,1,2, . . . , m , withg,(x) = 1, f(x) 2 0, and ci are known constants with co = I
Dual Problem
Find the minimum value of (8.5.26)
222
VIII.
MISCELLANEOUS APPLICATIONS
subject to constraints
(8.5.27) The duality theory simplifies the solution of the primal problem to that of the dual problem. We consider the case of a countable infinity of constraints like (8.5.25). The general problem has been discussed by Krafft (1970). Example 8.5.1
Consider a countable set of constraints on moments with
(8.5.28)
g;(x>=XI,
and let
1'
x 2 i f ( ~d )x
=
s
( 2 i - l)! 2'(i - l)!
2'i! ~ ~ ~ + ' df x( =x 7 ) , (2n)
(8.5.29) (8.5.3 0 )
for i = 1, 2 , 3 , . . . . It is well known that there is exactly one probability distribution satisfying constraints (8.5.29) and (8.5.30), namely the standard normal cp(x). Let
{ exp (h2/2), 0,
x
> h > 0,
otherwise.
If xA(x)is the characteristic function of the set A , we have g ( x ) = exp ( x 2 / 2 )x [ A ,m 1
(8.5.31)
Therefore, the maximum of E [ g ( x ) ] is given by exp ( x 2 / 2 )[ 1 - W h ) ] , where @ ( x )is the cumulative distribution function of standard normal. The dual problem is the following. Find the minimum of
(8.5.32) where ui, u i , and w satisfy the constraints
(8.5.33)
8.5.
CONNECTIONS BETWEEN PROGRAMMING AND STATISTICS
223
since
m
- exp(-AZ)/ (2n)1l2
eCxa exp
( :)dx. --
(8.5.34)
0
Since 2k+l
1
i=O
and
(-l)’((xxy<e-xa, I .
c ( - 1 (AX)’ ) ’ ~
2k
i =O
>e-xa,
k = 0 , 1 , 2 . .. ,
(8.5.35)
k = 0 , 1 , 2 ...,
(8.5.36)
we have the following feasible values of w,ui, and ui:
w =1, i = 1 , 2 , . .. (0,
(8.5.37)
i>k, i=O, 1 , .
(8.5.38)
The objective function (8.5.32) then becomes
c2 k
$,(A)=
i=o
~ 2 i
T I - -
.
2ix2i+1i! c(2n)1/2 ( 2 i + I)! 1
k-1
i=o
Therefore, from duality, we have 1
-
@(A)
< exp (-h2/2) $
(A).
(8.5.39)
GZ(h),
(8.5.40)
Similarly, considering the maximizing problem, 1 -@(A)
> exp (-A%)
224
VIII.
MISCELLANEOUS APPLICATIONS
which shows the well-known equivalence. Many other problems of statistical interest in which programming methods can be applied are available in the literature. For example, the CramCr-Rao inequality can be obtained through programming methods; for reference, see Isii (1964). We consider here an example of unbiased estimation. An estimate T,(X,, . . . , X , ) based on a random sample X I , X 2 , . . . , X , from a population having probability density f ( x , 0 ) is called unbiased if E I T , ( X I , . . . , X , ) ] = 8. An important principle of estimation is to obtain unbiased estimates with minimum variance. That is, it is desired to find T n ( x l , .. . ; X , ) such that
E[T,(XI,. . . , X , ) - W
(8.5.41 )
is minimized subject to constraints
E [ T , ( x ~ ., . . ,x,)]= e.
(8.5.42)
It is easy to see that (8.5.41) is minimized if we minimize
E[Tn2(X1,. . .,Xn)I.
(8.5.43)
We consider an example involving the Bernoulli trials.
Example 8.5.2 Let X I , .. . , X , be independent and identically distributed random variables with P(Xi = 1) = 0 and P(Xi = 0) = 1 - 8 , i = 1 , 2 , . . . ,n. A statistic T , ( X l , . . . , X , ) is a function defined on 2" n-tuples of 1's and 0's. There are (7) such n-tuples with i ones and n - i zeros. The probability of any such n-tuple is ei(i - e
Let T,(XI, Let
. .. ,X,)
yi. = Yii, where i = 0, 1 , 2 , . . . , w and j = 1,2, . . . , (7)
= 1.
I
(8.5.44)
8.6.
STOCHASTIC PROGRAMMING PROBLEMS
225
and (8.5.45) Then the problem of finding the minimum variance unbiased estimate of 0 becomes that of minimizing (8.5.46) with constraints n
C ei(i - e y i y i = 0.
i=O
(8.5.47)
The constraints are satisfied if
y .=
'
(n - l)! (i-l)!(n-i)!
i =-I. n
(8.5.48)
Since
I Y p 2 yi2,
(8.5.49)
using the Cauchy-Schwarz inequality we have, from (8.5.49), (8.5.50)
with equality if Yii = i/n. Therefore T n ( X , ,. . . , X , ) = i/n is the minimum variance unbiased estimate for 8.
8.6 Stochastic Programming Problems The simplest mathematical programming problem is the linear programming problem, in which we optimize a linear function of a finite number of variables that satisfy a system of linear inequality constraints. The theory of linear programming has found application in diverse fields of engineering, economics, and agriculture, and this theory has been developed extensively. In statistics, especially in regression problems, the linear programming problem enters quite naturally. These programming problems have been extended in several directions. There are complete solutions to problems in which a quadratic functional or a convex functional of the variables is to be optimized with linear inequality constraints. There is extensive literature on these mathematical programming problems.
226
V111.
MISCELLANEOUS APPLICATIONS
Most of the earlier work in programming considered mainly deterministic models. However, there are situations in applications in which stochastic elements enter the models. These stochastic elements enter the model in several ways. The functionals may involve random variables, or the constants involved in the functionals may have a general stochastic behavior. The inequality constraints may also hold with a preassigned probability. This area of mathematical programming is called probabilistic or stochastic programming. In this section a few examples are given to illustrate the problems of stochastic programming. We consider especially their connection with problems discussed so far in this book. For a comprehensive account of probabilistic programming, the reader should consult the book by Vajda (1972). Many references to recent stochastic problems are also given therein. The linear programming problem is concerned with finding the minimum of c’x,
(8.6.1)
Ax = b,
(8.6.2)
x > 0,
(8.6.3)
with constraints and where x is an m-dimensional vector, A is an n x m matrix, and b and c are given vectors. The solution to the problem is obtained by considering the extreme points of the convex set given by (8.6.2) and (8.6.3). Many times the constraint (8.6.2) is in terms of inequalities. The minimum, then, can be obtained by the consideration of the value of the functional (8.6.1) over these extreme points. Well-known procedures, such as those of the Simplex method developed by Dantzig, are available to obtain the numerical solutions. Suppose now that c is random. One approach to the solution of the problem will be by finding the minimum of E(c)‘x over the set defined by the same constraints. Recently, Rozanov (1 974) has suggested the following criterion for this stochastic programming problem. Let the value of the objective function at an extreme point of the convex set generated by the restriction be denoted by the random variable Zi. It is assumed that the distribution of Ziis somehow known. Then the Rozanov criterion is to find the minimum so as to minimize the probabilities of extreme points; that is minimize P(Zi) over i. (8.6.4) Numerical procedures resembling the simplex method of deterministic linear programming have been developed recently by Dantzig (1974).
8.7.
DYNAMIC PROGRAMMING MODEL O F PATIENT CARE
227
We consider below the problem of chance constrained programming. Here the constraints are satisfied with specified probabilities. Many procedures of chance constrained programming have been developed by Charnes el al. (1971). Example 8.6.1 The problem is to minimize E(X)
(8.6.5)
P { X < Y}
(8.6.6)
x> 0.
(8.6.7)
subject to constraints and For simplicity assume that Y is a continuous random variable having the probability
{ 0,
fCv) = 2 y ,
O
(8.6.8)
elsewhere.
Assume that f ( x ) is the probability density of X on (0, 1). Then (8.6.5) implies that we minimize
J
1
(8.6.9)
x f ( x ) dx
0
subject to constraints (8.6.6) and (8.6.7). With the assumption (8.6.8) we have 1
P(X
< Y )= 0
1
2yf(x) dx d y =
0
i
s I
x
x’f(x) dx.
0
Hence the constraint (8.6.6) becomes
x’f(x)dx
(8.6.1 0 )
0
The optimizing problem discussed above is a linear moment problem and can be solved easily by the methods of Chapter IV, wherein solution of a general class of linear moment problems was developed.
8.7 Dynamic Programming Model of Patient Care The process of patient care-whether in the operating room, recovery room, or an outpatient clinic-exhibits elements of a control process. The physician and
228
VIII.
MISCELLANEOUS APPLICATIONS
the nurse provide controls t o guide the patient to homeostasis with the help of drugs, instruments, and equipment. To understand the basic structure of the process, we consider only one aspect of patient care-the operating room. The example considered is that of the administration of anesthesia to the patient, and the control problem utilizes their terminology. For practical application, the discrete analog of the control process is considered. For details, see Rustagi (1968). Consider the status of the patient in terms of monitored physiological and other variables, given by the vector X(t) at time c. Let r = 0, 1 , 2 , . . . , T b e the times at which the patient is observed. X ( t ) = ( X , ( t ) , . . . ,X m ( f ) ) ‘ where , X , (t) is the systolic blood pressure, X z ( t ) the diastolic blood pressure, X 3 ( t ) the pulse rate, X 4 ( t ) the body temperature, and so on. Let the actions taken by the physician and nurse be represented by vector Y ( t )at time t. Again assume that
-
Y ( t )= ( Y , ( t ) , . . , Yn(t))’. We may have Y l ( t ) as the amount of medication (e.g., neosnephrine atropine sulphate, damrol), Y z ( t )the amount of anesthetic, Y 3 ( t )the amount of blood given, and so on. Supposing that the rate of change of patient vector depends on the current patient status and the action taken, we have
X ( t + 1 ) - X(t) = f [X(t), Y ( t ) ,t ] ,
t = 0, 1 , 2 , . . . , T
(8.7.1)
with
(8.7.2)
x(0) = c.
Taking the vector Y to be n + 1 dimensioned by including t , we can write Eq. (8.7.1) as follows:
X(t + 1) = g(X(t), Y(t)). Suppose the actions are taken such that we minimize
(8.7.3)
(8.7.4) where [la - bll denotes the “distance” between two vectors a and b, and Z(t) is the desired value of the patient vector. The actions are to be taken such that the patient is brought back to homeostasis. That is, the deviation of the patient vector X ( t ) at any time t from the desired value Z(t) at t is used to take the action Y(t).Hence Y(t) = h [X(t)
-
Z(t)].
(8.7.5)
REFERENCES
229
That is Z(t) is a function of X ( t ) and Y(t), so that the objective criterion (8.7.4) is reduced t o minimizing T
E Pf [X(th Y(t)l> f=l
(8.7.6)
where p is such that tpf [ X ( t ) , h ( X ( t ) , Z(t))] = IlX(t) - Z(t)ll’. Suppose now that (8.7.6) is minimized with constraints (8.7.2) and (8.7.3). Let the minimum be denoted by aT(c). It can then be seen, by using the optimality principle of dynamic programming from Chapter 111, that
a d c ) = min (PI [X(I), Y(1)1 +&T-,(P~[X(1),Y(1)1)}. Y(l)
(8.7.7)
Equation (8.7.7) can then be solved by numerical techniques of dynamic programming. The usual reduction of optimization to that over only the set of vector Y(1), instead of Y(1). . . Y ( 0 , allows the computational simplicity available through the dynamic programming approach. The model proposed above is fairly general and can be applied to various other situations in patient care, such as in providing care in public health programs. In the control problems discussed so far, we assumed the constraints are completely known. However, here the constraints may have to be estimated from the data available, thus introducing problems of statistical nature in the study of dynamic programming solutions.
References Barlow, R. E. (1965). Bounds o n integrals with applications to reliability problems, Ann. Math. Statist. 36, 565-574. Barlow, R. E., and Prosclian, F. (1967). Matlienlatical Theory of Re1iabilit.v. Wiley, New York. Baum, L. E., Petrie, T.. Soulcs, G . , and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains Aun. Math. Statist. 41, 164-171. Beckenbach, E. F., and Bellman, R. (1961). Itiequalities. Springer-Verlag. Berlin. Bellman, R. (1967). Introduction to the Mathematical Theory of Control Processes. Vol. 1. Academic Press, New York. Bellman, R., and Roth, R. (1969). Curve fitting by segmented straight lines J . Amer. Statist. Ass. 64, 1079-1084. Bellman, R., Kashef, B., and Vasudevan, R. (1974). Mean square spline appro\iniation. J. Math. Anal. Appl. 45, 47-53. Brown, B. W., J r . (1959). Some problems of the Spearman estimator in bioassay, Tech. Rep. No. 6, Department of Statistics, Univ. of Minnesota, Minneapolis, Minnesota. Brown, L. D. (1971). Admissible estimations, recurrent diffusions and insoluble boundary value problems, Ann. Math. Statist. 3 , 855-903.
230
VIII.
MISCELLANEOUS APPLICATIONS
Charnes, A., Cooper, W. W., and Kirby, M. J . L. (1971). Chance-constrained programming: an extension of a statistical method. In Optimizing Methods in Statistics (J. S. Rustagi ed.), pp. 3 9 1 4 0 2 . Academic Press, New York. Dantzig, G. B. (1974). On a convex programming problem of Rozanov, Appl Mallr. Optimization 1, 189-192. Fan, K., and Lorenz, G. G. (1954). An integral inequality, Amer. Math. Morith1.v 61, 626-631. Finney, D. J . (1964). Statistical Method in Biological Assay. Charles Griffin and Company Limited, London. Francis. R. L. (1971). On relationships between the Neyman-Pearson problem and linear programming. In Optimiziiig Methods in Statistics (J. S. Rustagi ed.), pp. 259-280. Academic Press, New York. Francis, R.. and Meeks, H. D. (1972). On saddle point conditions and the generalized Neyman-Pearson problem, Ausr. J. Statist. 14, 73-78. Francis, R.. and Wright, G. ( 1 969). Some duality relationships for the generalized Neyman-Pearson problem, J. Optimization Theory Appl. 4, 394-41 2. Godwin, H . J. ( I 964). Inequalities o n Distribution Functions (Griffin's Statistical Monograph and Courses). Hafner, New York. Isii, K. (1964). Inequalities of the types of Chebychev and Cram&-Rao and mathematical programming, A i m Irist. Statist. Math. 16, 277-293. Johnson, E. A., and Brown, J . B. W. (1961). The Spearman estimator for serial dilution assays, Biometrics 17, 79-88. Karlin, S. (1959). Mathematical Methods arid T/ieorjJ iii Games. Programmiiig am1 Economics, Vol. 2, pp. 210-214. Addison-Wesley, Reading. Massachusetts. Karlin, S., and Novikoff, A. (1963). Generalized convex inequalities. Pac. J . Math. 13. 1251-1279. Karlin, S., and Studden, W. J. (1966). Optimal experimental designs, ANH.Math. Statist 37, 783-815. Karlin, S.. Proschan, F., and Barlow, R. E. (1961). Moment inequalities of Polyn frequency functions, Pac. J:Marh. 1 I , 1023-1033. Krafft. 0. (1970). Programming methods in statistics and probability theory. In Noillillear Pro,yrai?imiiig ( J . Rosen, 0. Mangasarian, and K. Ritter, eds.), p p 426-446. Academic Press, New York. Marshall, A. W., and Proschan. F. (1965). An inequality for convex functions involving majorization, J. Math. Anal. Appl. 12, 87-90. Olltin, I . , and Pratt, J . (1958). A multivariate Tchebycheff inequality A m . Malk. Statist. 29,226-236. Rozanov, Yu. A. (1974). Stochastic linear programming, Colloquium talk at The Ohio State Univ., Columbus, Ohio. Rustagi, J . S. (1957). On minimizing and niasimizing a certain integral with statistical applications, Aiiii. Math. Statist. 28, 309-328. Rustagi, J . S. (1968). Dynamic programming model of patient care, Matlr. Biosci. 3. 141-1 49. Savage, I . R. (1961). Probability inequalities of the Tchebycheff Type, J . Res.. Nat. Bur. Staiid. 65-B. 21 1-222. Schoenbcrg, I . J . (ed.) ( I 969). Approximatioris with Special Empliasis o i i Splirie Fuiictions. Academic Press, New York. Vajda, S. (1972). Prohai~ilisticProgrammiris~.Academic Press, New York. Vyrsan, K. ( I 967). The Neynian-Pearson lemma and linear programming (Russian). Rev. Roiimaiiie Matlr. Pures Appl. 12, 279-293.
REFERENCES
23 1
Wagner, D. H. (1969). Nonlinear functional versions o f the Neyman-Pearson lemma, SIAM Review 11,52-65. Wagner, H. M . (1959). Linear programming and regression analysis, J. Amer. Statist. Ass. 54, 206-21 2. Wagner, H. M. (1962). Nonlinear regression with minimal assumptions, J . Amer. Statist. Ass. 57,512-518. Wold, S. (1914). Spline functions in data analysis, Technometrics 16, 1-1 1.
This page intentionally left blank
Index
Page numbers given for individual authors refer t o citations within the text. Complete references may be found at the end of each chapter.
A
C
Admissible experiment, 166-1 67 Aoki, M., 199 Approximations via dynamic programming, 21 3-216 Arrow, K. J., 187
Calculus of variations, 16, 17 fundamental lemmas of, 25 Charnes, A., 203,227 Chernoff, H., 76, 97, 111, 134, 149, 157, 158, 160,173, 193, 197, 199 Chernoff-Savage statistic, 111 Chow, Y. S., 1 9 3 Controlled Markov chains, 177-181 Convexity, 71-76 Cumulative distribution function, 19 degree, 8 1
B Backward induction, 56, 186 Barlow, R. E., 2 0 2 , 2 0 6 Bayes rule, 182 Bellman, R. E., 123, 128, 129, 173, 177, 187,199,203, 213,214,215 Bellman’s functional equation, 47 Bioassay problem, 6 5 Blackwell, D., 177, 181, 187 Bounds of mean of the largest order statistic, 37 Brachistrochrone problem, 27 Brown, B. W.Jr., 203, 21 3
D Danskin, J., 1 2 3 Dantzig, G. B., 97, 226 David, H. A., 19 DeCroot, M. H., 181 Derman, C., 177
233
234
INDEX
Designs normalized, 138 spectrum, 138 Deterministic control process, 173-177 Dirichlet problem, 196 D-optimality, 148, 149 local, 15 1 Dynamic equations, 173 Dynamic programming, 175, 179, 186, 229 functional equations, 5 1 and patient care, 227 E Elfving, G., 134, 139, 143, 157, 159 Euler-Lagrange equation (also Euler equation), 26, 114, 152 necessary conditions for an extremum, 22 special cases, 26 Extremals sufficiency conditions, 42 with variable endpoints, 31
F Failure rate (also hazard rate), 203 decreasing, 204 increasing, 204 Feder, P. I., 150, 157 Federov, V. V., 134, 149, 167 Ferguson,T. S . , 181, 187 Finney, D. J., 203, 213 Fisher, R. A., 133 Fisher information matrix, 158 Fisher-Yates-Terry-Hoeffding test statistic, 112 Francis, R., 203,217, 220 Functional homogeneous, 73 subadditive, 73 Function spaces, 72 C Gaussian curvature, 118 Gaussian process, 186 Gauss-Markoff theorem, 136 Generalized Neyman-Pearson lemma, 98 General stopping problem, 193
Girshick, M. A., 181, 187 Goodman, L. A., 67 Greville, T. N. E., 161 H Hadley, G., 18 Hahn-Banach theorem, 74 Hijek, J., 112 Hamiltonian function, 39 Hamilton-Jacobi-Bellnian equation, 47 Ilarris, B., 76 Hessian, 42 Hodges, J. L., 115 Howard, R., 177 I Inequalities Jensen’s, 205 Young’s, 17 Information Fisher, 2 1 matrix, 136 Shannon, 21 Isaacson, S . L., 65, 117 Isii, K., 224 K Kadane, 1. B., 130, 132 Karlin, S . , 76, 98, 123, 125, 134, 149, 167, 205,208,217,221 Kemp, C. M., 18 Kiefer, J . , 134, 147, 166 Krafft, O., 203, 217, 220, 222 Kushner, H., 51, 173, 177, 199 L Lagrange multipliers, 36 Lagrangian, 16, 20 Laplace transformation methods, 177 Least-squares criterion, 135, 168 Lehmann, E. L., 115, 117 Likelihood ratio, 95 Linear moment problem, 227 Linear optimality criterion, 149
235
INDEX
M Markov chains decision chain, 55 finite, 178 homogeneous, 178 infinite, 177 Mathematical economics application, 128 Mathematical programming and statistics, 216-225 Maximization of a determinant, 118 of an expectation, 78 global, 17, 44 of information matrix, 154 nonlinear problem, 109 of Shannon information, 38 strong local, 18 weak local, 18 Maximum principle, 57-59 and dynamic programming, 6 0 , 6 1 Meeks, H. D., 203, 217, 220 Mezaki, R., 150, 157 Minimax criterion, 182 Minimization of an expectation, 78 global, 17, 4 3 nonlinear problem, 99 strong local, 18 weak local, 18
Optimal policy, 174 Order statistic, 19
P Piecewise continuous function, 17 Posterior distribution, 182 Prior density, 182 Probability density function, 19 Proschan, F., 202, 206 R Range, 19 Regression analysis, 134-1 37 Regression experiments, 137 Reiter, S., 76 Reliability, 205 Rivlin, T. J., 161 Roth, R., 215 Rozanov, Yu.A., 226 Rubin, H., 65 Rustagi, J. S., 76, 208, 217, 228 S
Sampling distribution function, 111 Savage, I. R., 111 Scheffb, H., 97 Schoenberg, 1. J., 161, 203, 213 Secretary problem, 192 N Semicontinuous function lower, 73 Neyman, J., 115 upper, 73 Neyman-Pearson lemma, 95-97, 116, 118, Sequential decision theory, 187-192 125,127,203,216,217 stopping rule, 187 discrete form, 131 terminal decision, 187 miscellaneous applications, 123-1 32 Sequential test of hypotheses, mean of a Neyman-Pearson problem, normal, 189 dual, 218 Shapley, L. S., 76 primal, 218 Phohat, J., 76 Normed linear space, 72 gidik, Z., 112 Skibinsky, M., 76 0 Spearman estimator, 210 efficiency, 21 1 Optimal designs using splines, 165-169 variance, 21 1 Optimality criterion Spline functions, 161-165 A-optimality, 143 definition, 161 D-optimality, 138 knots, 161 minimax, 144 natural, 162 Tchebycheff, 144 Statistical decision theory, 181-1 86
236
INDEX
Stefen problem (also free boundary problem), 196 Stochastic control problem, 198-200 Stochastic programming, 225-227 Studden, W. J., 76, 98, 134, 149, 161, 165, 166,167,205
Variation definition, 24 of a functional, 24 Variational derivative, 24 Vyrsan, K., 217
W
T Tamarkin, J., 76 Tests of hypotheses critical function, 95 efficiency, 110 nonrandomized, 95 randomized, 95 relative asymptotic efficiency, 116 type A regions, 115 type C repions, 117 typeDregions, 115, 117, 121 Total variation, definition, 32
U Unbiasedness of critical regions, 121 of estimates, 224
Wagner, H. M., 217 Wald, A., 97 Weak local extremum necessary conditions, 26, 34 transversality conditions, 35 Widder, D., 177 Wiener-Hopf integral equation, 30 Wiener process, 186, 191-192, 193 definition, 191 drift, 192 Wilcoxon test, efficiency, 115 Wilcoxon-Man-Whitney statistic, 107 definition, 21 lower and upper bounds of variance of, 21 variance of, 21, 105 Wolfowitz, J., 134 Wright, G., 217
V Vajda, S., 203, 226 Van Arman, D. J., 165, 166, 167
Y Young, L. C., 18
A 8 C 0 E
6 7 8 9 O
6 H 1 J
2 3 4 5
F 1