This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
on NND(s) is called upper semicontinuous when the upper level sets {(f> > a} — {C > 0 : $(C) > a} are closed, for all a G R. This definition conforms with the one for the matrix-valued information matrix mapping CK, in part (a) of Theorem 3.13. There we met an alternative sequential criterion (b) which here takes the form liiriw--^
5.8. INFORMATION FUNCTIONS
119
5.7. SEMICONTINUITY AND REGULARIZATION Lemma. For every isotonic function $ : NND(s) —> R, the following three statements are equivalent: a. (Upper semicontinuity) The level sets
are closed, for all a e R. b. (Sequential semicontinuity criterion) For all sequences (Cm}m>\ in NND(^) that converge to a limit C we have
c. (Regularization) For all C,D e NND(s), we have
Proof. This is a special case of the proof of Theorem 3.13, with s = 1 and CK =
120
CHAPTERS: REAL OPTIMALITY CRITERIA
set, thus visualizing it geometrically (Section 5.10). We establish that the set of all information functions is closed under appropriate functional operations, providing some kind of reassurance that we have picked a reasonable class of criteria (Section 5.11). We introduce polar information functions (Section 5.12) which provide the basis for the duality discussion of the optimal design problem. We study the composition of an information function with the information matrix mapping (Section 5.14), serving as the objective function for the optimal design problem (Section 5.15). 5.9. UNIT LEVEL SETS There exists a bewildering multitude of information functions. We depict this multitude more visibly by associating with each information function
The following Theorem 5.10 singles out the characteristic properties of such sets. In general, we say that a closed convex subset C C NND(s) is bounded away from the origin when it does not contain the null matrix, and that it recedes in all directions of NND(s) when
The latter property means that C + NND(s) C C for all C e C, that is, if the cone NND(^) is translated so that its tip comes to lie in C € C then all of the translate is included in the set C. Exhibit 5.1 shows some unit level sets C = {®p > 1} in Rj. Given a unit level set C C NND(^), we reconstruct the corresponding information function as follows. The reconstruction formula for positive definite matrices is
Thus
For S > 0, nothing has changed as compared to (1) since then (SC) +
5.9. UNIT LEVEL SETS
121
EXHIBIT 5.1 Unit level sets. Unit contour lines of the vector means 3>p of Section 6.6 in IR+, for p = -co, —1,0,1/2,1. The corresponding unit level sets are receding in all directions of R+, as indicated by the dashed lines.
NND(s) = S(C + (l/S)NND(s)) - SC. But for 8 = 0, condition (2) turns into C 6 (OC) + NND(s) = NND(s), and holds for every matrix C > 0. Hence the supremum in (2) is always nonnegative. The proof of the correspondence between information functions and unit level sets is tedious though not difficult. We ran into similar considerations while discussing the Elfving norm p(c) in Section 2.12, and the scale factor p2(c) in Section 4.10. The passage from functions to sets and back to functions must appear at some point in the development, either implicitly or explicitly. We prefer to make it explicit, now.
122
CHAPTERS: REAL OPTIMALITY CRITERIA
5.10. FUNCTION-SET CORRESPONDENCE Theorem. The relations
define a one-to-one correspondence between the information functions
A unit level set C e F is
o finite i positively homogeneous ii superadditive iii nonnegative
0 I II III
iv v
IV V
nonconstant upper semicontinuous
a subset of NND(s) bounded away from the origin convex receding in all directions of NND(s) nonempty closed
In the first part of the proof, we assume that an information function
5.10. FUNCTION-SET CORRESPONDENCE
123
In order to verify the second equality in the theorem we fix C € NND(s) and set a = sup{8 > 0 : C € (8C) + NND(^)}. It is not hard to show that (C) = 0 if and only if a = 0. Otherwise
we learn that
iii. Since 8 = 0 is in the set over which the defining supremum is formed, we get 4>(C) >0. ii. In order to establish superadditivity of $, we distinguish three cases. In case
and
Thus we have
and therefore
124
CHAPTERS: REAL OPTIMALITY CRITERIA
iv. Since C is nonempty there exists a matrix C € C. For any such matrix we have
To prove the direct inclusion, we take any matrix C e {<£ > a} and choose numbers Sm > 0 converging to
whence entails <;
Converselv. if C G aC then the definition of <6 immediatelv for all This proves
In particular, a = 1 yields the first formula in the theorem, Altogether the mapping <j> H-» {<£ > 1} from <£ onto F is bijective, with inverse mapping given by the second formula in the theorem. The cone NND(^) includes plenty of nonempty closed convex subsets that are bounded away from the origin and recede in all directions of NND(s), and so there are plenty of information functions. The correspondence between information functions and their unit level sets matches the well-known correspondence between norms and their unit balls. However, the geometric orientation is inverted. Unit balls of norms include the origin and exclude infinity, while unit level sets of information functions exclude the origin and include infinity, in a somewhat loose terminology. The difference in orientation also becomes manifest when we introduce polar information functions in Section 5.12. First we overview some functional operations for constructing new information functions from old ones.
5.11. FUNCTIONAL OPERATIONS New information functions can be constructed from given ones by elementary operations. We show that the class of all information functions is closed under formation of nonnegative combinations and least upper bounds of fi-
5.12. POLAR INFORMATION FUNCTIONS AND POLAR NORMS
125
nite families of information functions, and under pointwise infima of arbitrary families. Given a finite family of information functions, fa,...,
if at least one of the coefficients 5, > 0 is positive. In particular, sums and averages of finitely many information functions are information functions. The pointwise minimum is also an information function,
Moreover, the pointwise infimum of a family >/ with / ranging over an arbitrary index set J is an information function,
unless it degenerates to the constant zero. Upper semicontinuity follows since the level sets (inf/ e j
where denotes the class of all information functions. The set over which the infimum is sought is nonempty, containing for instance the sum ^,-<»i fall is also bounded from below, for instance by fa. Therefore the infimum cannot degenerate, and lub,<m
126
CHAPTERS: REAL OPTIMALITY CRITERIA
best thought of as the largest function satisfying the (concave version of the generalized) Holder inequality,
For the definition, it suffices that <£ is defined and positive on the open cone PD(s). DEFINITION. For a function
That the function >°° so defined satisfies the Holder inequality is evident from the definition for C > 0. For C > 0, it follows through regularization provided
for all for all (
Sym(s) Sym(s), and
A norm <£ has polar function <£° defined by
This leads to the (convex version of the generalized) Holder inequality,
Alternatively the polar function <j>° is the smallest function to satisfy this inequality. The principal distinction between a norm and an information function is, of course, that the first is convex and the second is concave. Another difference, more subtle but of no less importance, is that norms are defined everywhere in the underlying space, while information functions have a proper subset for their domain of definition. One consequence concerns continuity. A norm is always continuous, being a convex function on a linear space which is everywhere finite. An information function is required, by definition, to be upper semicontinuous.
5.13. POLARITY THEOREM
127
Semicontinuity is not an automatic consequence of the other four properties that constitute an information function. On the other hand, an information function is necessarily isotonic, by Lemma 5.4, while a norm is not. Another feature emerges from the function-set correspondence of Section 5.10. An information function > is characterized by its unit level set {<£ > 1}. A norm <£ corresponds to its unit ball {<£ < 1}. The distinct orientation towards infinity or towards the origin determines which version of the Holder inequality is appropriate, as well as the choice of an infimum or a supremum in the definitions of the polars. In order that our notation indicates this orientation, we have chosen to denote polars of concave and convex functions by a superscript oo or 0, respectively. Some matrix norms <£ have the property that
In this case, the similarity between the polars of information functions and the polars of norms becomes even more apparent. The next theorem transfers the well-known polarity relationship from norms to information functions. Polars of information functions are themselves information functions. And polarity (nomen est omen} is an idempotent operation. The second polar recovers the original information function. 5.13. POLARITY THEOREM Theorem. For every function <£ : NND(s) —» R that is positively homogeneous, positive on PD(s), and upper semicontinuous, the polar function
Proof. The definition of the polar function is <£°° — mfoo'/'c* where the functions ^c(I>) - (C,D)/
where the set T> = {C > 0 : trace C — 1} is compact. Owing to upper semicontinuity, the supremum of > over T> is attained and finite. Therefore the value
128
CHAPTERS: REAL OPTIMALITY CRITERIA
In the second part of the proof, we take
It suffices to show that hen the function-set correspondence of Theorem 5.10 tells us that <£ and ( coincide. We claim that the two sets C°° and C satisfy the relation
For the direct inclusion, take a matrix D € C°°, that is, D > 0 and 4>°°(D) > 1. For C <E C, the Holder inequality then yields (C, D) >
Formulae (1) and (2) entail the direct inclusion The converse inclusion is proved by showing that the complement of C is included in the complement of C0000, based on a separating hyperplane argument. We choose a matrix E e NND(s) \ C. Then there exists a matrix D e Sym(s) such that the linear form {-,£>) strongly separates the matrix E and the set C. That is, for some y e R, we have
Since the set C recedes in all directions of NND(.s), the inequality shows in particular that for all matrices C e C and A > 0 we have {£, D) < (C+8A, D) for all 8 > 0. This forces (A,D) > 0 for all A > 0, whence Lemma 1.8 necessitates D > 0. By the same token, 0 < (E,D) < y. Upon setting D = D/v the. sfrnna sp.naration of F. and C takes the form
Now (1) gives D € C°°, whence (2) yields and the prooi is complete.
Therefore
5.14. COMPOSITIONS WITH THE INFORMATION MATRIX MAPPING
129
Thus a function $ on NND(s) that is positively homogeneous, positive on PD(s), and upper semicontinous is an information function if and only if it coincides with its second polar. This suggests a method of checking whether a given candidate function
This looks like being a rather roundabout way to identify an information function. However, for our purposes we need to find the polar function anyway. Hence, the quasi-linearization method of computing polar functions and to obtain, as a side product, the information function properties is as efficient as can be. Another instance of quasi-linearization is the definition of information matrices in Section 3.2, CK(A) = minL€Kix*. LK=is LAL1. The functional properties in Theorem 3.13 follow immediately from the definition, underlining the power of the quasi-linearization method. The theory centers around moment matrices A e NND(/c). Let K'6 be a parameter system of interest with a coefficient matrix K that has full column rank s. We wish to study A H-> <£ o CK(A), the composition of the information matrix mapping CK : NND(/c) —> NND(s) of Section 3.13 with an information function <£ on NND(s) of Section 5.8. This composition turns out to be an information function on the cone NND(fc) of k x k matrices. 5.14. COMPOSITIONS WITH THE INFORMATION MATRIX MAPPING Theorem. Let the k x s coefficient matrix K have full column rank s. For every information function $ on NND(s), the composition with the information matrix mapping, <j> o CK, is an information function on NND(fc). Its polar is given by
Proof. In the first part of the proof, we verify the properties of an information function as enumerated in the proof of Theorem 5.10. 0. Clearly
130
CHAPTERS: REAL OPTIMALITY CRITERIA
iii. Nonnegativity of CK and of
for all From Theorem 3.13. we know that the matrices satisfy and converge Monotonicity of
To see this, we write the definition i Making use of regularization, we obtain the inequality chain
Hence equality holds throughout this chain, thus proving (1). The unit level set of the composition <£ o CK is
Indeed, for such that For all
with Conversely, if for. then monotomcitv and
we get there exists a matrix imolies we get
with
5.15. THE GENERAL DESIGN PROBLEM
131
To see this, we estimate KCK' (see Section 3.21). Monotonicitv leads to the lower bound (A.B) Now (KCK'B}. The lower bound is attained at KCK' proves (3). Finally we apply (1) to
A first use of the polar function of the composition
This calls for maximizing information as measured by the information function $, in the set M of competing moment matrices. The optimal value of this problem is, by definition,
A moment matrix M e M is said to be formally <j>-optimal for K'B in M when > (CK(M)) attains the optimal value v (<£). If, in addition, the matrix M lies in the feasibility cone A(K), then M is called
132
CHAPTERS: REAL OPTIMALITY CRITERIA
However, an optimal design is not an end in itself, but an aid to identifying efficient practical designs. The appropriate notion of efficiency is the following. DEFINITION. The $-efficiency
of a design £ e H is defined by
It is a number between 0 and 1, and gives the extent (often quoted in percent) to which the design £ exhausts the maximum information v(
Lemma. If the set M of competing moment matrices is compact, then there exists a moment matrix M G M that is formally <£-optimal for K' 0 in M, and the optimal value v(
5.17. SCALAR OPTIMALITY, REVISITED
133
Proof. By Theorem 5.14, the composition
as follows from the Holder inequality, the property z'CK(M)z = 0, the polarity equation, and strict monotonicity of
134
CHAPTERS: REAL OPTIMALITY CRITERIA
The concept of information functions becomes trivial, for scalar optimality. They are functions <£ on NND(l) = [0;oo) satisfying 0(y) =
Show that a linear function
for 5.5 Is an information function?
for
5.6 Show that neither of the matrix norms are isotonic on NND(s) [Marshall and olkin (1969), p. 170]. 5.7 Which properties must a function <£ on NND(s) have so that >(|C|) is a matrix norm on Sym(.s), where |C| = C+ + C_ is the matrix modulus of Section 6.7?
CHAPTER 6
Matrix Means
The classical criteria are introduced: the determinant criterion, the averagevariance criterion, the smallest-eigenvalue criterion, and the trace criterion. They are just four particular cases of the matrix means
positive
and
Each of these criteria reflects particular statistical aspects, to be discussed in Section 6.2 to Section 6.5. Furthermore, they form but four particular members of the one-dimensional family of matrix means
135
136
CHAPTER 6: MATRIX MEANS
6.2. D-CRITERION The determinant criterion
Indeed, in Section 3.5 the inverse C l of an information matrix was identified to be the standardized dispersion matrix of the optimal estimator for the parameter system of interest. Its determinant is called the generalized variance, and is a familiar way in multivariate analysis to measure the size of a dispersion matrix. This is the origin of the great popularity that the determinant criterion enjoys in applications. In a linear model with normality assumption, the optimal estimator K'0 for an estimable parameter system K'6 has distribution N^/ fl .( ff 2/ n ) C -i, with C = (K'M'KY1 and M = (\/ri)X'X. It turns out that the confidence ellipsoid for K'0 has volume inversely proportional to (det C)1/2. Hence a large value of det C secures a small volume of the confidence ellipsoid. This is also true for the ellipsoid of concentration which, by definition, is such that on it the uniform distribution has the same mean vector K'6 and dispersion matrix (ar2/ri)C~l as has K'0. For testing the linear hypothesis K'0 — 0, a uniform comparison of power leads to the Loewner ordering, as expounded in Section 3.7. Instead we may evaluate the Gaussian curvature of the power function, to find a design so that the F-test has good power uniformly over all local alternatives close to the hypothesis. Maximization of the Gaussian curvature again amounts to maximizing det C. Another pleasing property is based on the formula det(H'CH) = (det //2)(det C), with a nonsingular 5 x 5 matrix H. Suppose the parameter system K'0 is reparametrized according to H'K'6. This is a special case of iterated information matrices, for which Theorem 3.19 provides the identities
Thus the determinant assigns proportional values to Cx(A) and CKu(A), and
6.4. E-CRITERION
137
the two function induced orderings of information matrices are identical. In other words, the determinant induced ordering is invariant under reparametrization. It can be shown that the determinant is the only criterion for which the function induced ordering has this invariance property. Yet another invariance property pertains to the determinant function itself, rather than to its induced ordering. The criterion is invariant under reparametrizations with matrices H that fulfill det H = ±1, since then we have det CKH(^) = det CK(A), We verify in Section 13.8 that this invariance property is characteristic for the determinant criterion. 6.3. A-CRITERION Invariance under reparametrization loses its appeal if the parameters of interest have a definite physical meaning. Then the average-variance criterion provides a reasonable alternative. If the coefficient matrix is partitioned into its columns, K = (c\,... ,cs), then the inverse l/<£_i can be represented as
This is the average of the standardized variances of the optimal estimators for the scalar parameter systems c[6,...,c'sd formed from the columns of K. From the point of view of computational complexity, the criterion >_! is particularly simple to evaluate since it only requires the computation of the s diagonal entries of the dispersion matrix K'A~K. Again we can pass back and forth between the information point of view and the dispersion point of view. Maximizing the average-variance criterion among information matrices is the same as minimizing the average of the variances given above. 6.4. E-CRITERION The criterion $-00, evaluation of the smallest eigenvalue, also gains in understanding by a passage to variances. It is the same as minimizing the largest eigenvalue of the dispersion matrix,
Minimizing this expression guards against the worst possible variance among all one-dimensional subsystems z'K'O, with a vector z of norm 1. In terms of variance, it is a minimax approach, in terms of information a maximin approach. This criterion plays a special role in the admissibility investigations of Section 10.9.
138
CHAPTER 6: MATRIX MEANS
The eigenvalue criterion <£_oo is one extreme member of the matrix mean family <£p, corresponding to the parameter 6.5. T-CRITERION The other extreme member of the <(>p family is the trace criterion <j>\. By itself the trace criterion is rather meaningless. We have made a point in Section 5.1 that a criterion ought to be concave so that information cannot be increased by interpolation. The trace criterion is linear, and this is so weak that interpolation becomes legitimate. Yet trace optimality has its place in the theory, mostly accompanied by further conditions that prevent it from going astray. An example is Kiefer optimality of balanced incomplete block designs (see Section 14.9). The trace criterion is useless if the regression vectors x e X have a constant squared length c, say. Then the moment matrix Af (£) of any design £ € H satisfies
whence the criterion
Similarly
In these situations the criterion
The moments /u; = $<\.\\ t> dr for j = 2,4 attain the maximum value 1 if and only if the design is concentrated on the points ±1. Thus every
6.6. VECTOR MEANS
139
design T has at most two support points, ±1, whence a formally optimal moment matrix has rank at most equal to 2. No such moment matrix can be feasible for a three-dimensional parameter system. The weaknesses of the trace criterion are an exception in the matrix mean family <£p, with p e [-00; 1]. Theorem 6.13 shows that the other matrix means are concave without being linear. Furthermore they fulfill at least one of the conditions (b) or (c) of Lemma 5.16 for every formally optimal moment matrix to be feasible (see Theorem 7.13). 6.6. VECTOR MEANS Before turning to matrix means, we review the vector means 4>p on the space Rs. The nonnegative orthant W+ = [0;oo)* is a closed convex cone in R5. Its interior is formed by those vectors A e Rs that are positive, A > 0. It is convenient to (1) define the means <J>P for positive vectors, and (2) extend the definition to the closed cone Us+ by continuity, and (3) cover all of the space Us by a modulus reduction. For positive vectors, A = ( A i , . . . , A s )' > 0, the vector mean 4>p is defined by
For vectors A e Us+ with at least one component 0, continuous extension yields
For arbitrary vectors, A e Rs, we finally define
The definition of <J>P extends from positive vectors A > 0 to all vectors A £ RJ in just the same way for both p > 1 and for p < 1. It is not the definition, but the functional properties that lead to a striking distinction. For p > 1, the vector mean 4>p is convex on all of the space R*; for p < 1, it is concave but only if restricted to the cone IR+. We find it instructive to follow up this distinction, for the vector means as well as for the matrix means, even
140
CHAPTER 6: MATRIX MEANS
though for the purposes of optimal designs it suffices to investigate the means of order p e [-00; 1], only. If A = als is positively proportional to the unity vector ls then the means <&p do not depend on p, that is, 4>p(a75) = a for all a > 0. They are standardized in such way that
Since the vector means 3>p are invariant under permutations of their argument vector A, the order in which A(C) assembles the eigenvalues of C does not matter. Hence
6.7. MATRIX MEANS
141
For integer values p, the meaning of Cp is the usual one. In general, we obtain trace Cp trace For positive definite matrices, C 6 PD(s), the matrix mean is represented by
For singular nonnegative definite matrices, C e NND(s) with rank C < s, we have
For arbitrary symmetric matrices, C e Sym(s), we finally get
where 1C I, the modulus of C, is denned as follows. With eigenvalue decomthe positive part C+ and the negative part C_ are position given by
They are nonnegative definite matrices, and fulfill C = C+ - C_. The modulus of C is defined bv |C| = C+ + C_. It is nonneeative definite and thus substansatisfies tiating (3) It is an immediate consequence of the definition that the matrix means q>p on the space Sym(.s) are absolutely homogeneous, nonnegative (even positive if p 6 (0;ooj), standardized, and continuous. This provides all the properties that constitute a norm on the space Sym(^), or an information function on the cone NND(s), except for subadditivity or superadditivity. This is where the domain of definition of (f>p must be restricted for sub- and superadditivity to hold true, from Sym(s) for p > 1, to NND(s) for p < 1. We base our derivation on the well-known polarity relation for the vector means <J>P and the associated Holder inequality in Rs, using the quasilinearization technique of Section 5.13. The key difficulty is the transition from the Euclidean vector scalar product (A,/x) = X'fi on Rs, to the Euclidean matrix scalar product ( C , D ) = trace CD on Sym(s). Our approach
142
CHAPTER 6: MATRIX MEANS
uses a few properties of vector majorization that are rigorously derived in Section 6.9. We begin with a lemma that provides a tool to recognize diagonal matrices. 6.8. DIAGONALITY OF SYMMETRIC MATRICES Lemma. Let C be a symmetric s x s matrix with vector 8(C) = (en,..., css)' of diagonal elements and vector A(C) = ( A i , . . . , A s ) ' of eigenvalues. Then the matrix C is diagonal if and only if the vector S(C) is a permutation of the vector A(C). Proof. We write c = 8(C) and A = A(C), for short. If C is diagonal, C = Ac, then the components c;; are the eigenvalues of C. Hence the vector c is a permutation of the eigenvalue vector A. This proves the direct part. For the converse, we first show that Ac may be obtained from C through an averaging process. We denote by Sign(s) the subset of all diagonal s x s matrices Q with entries qy} € {±1} for / < s, and call it the sign-change group. This is a group of order 2s. We claim that the average over the matrices QCQ with Q e Sign(s) is the diagonal matrix Ac,
To this end let e, be the ; th Euclidean unit vector in IR5. We have e-QCQej = quqjjCij. The diagonal elements of the matrix average in (1) are then
The off-diagonal elements are, with i ^ /,
Hence (1) is established. Secondly, for the squared matrix norm, we verify the invariance property
6.8. DIAGONALITY OF SYMMETRIC MATRICES
143
On the one hand, we have ||(>C()||2 = trace QCQQCQ = trace C2 = £) y<5 A 2 . On the other hand, we have ||AC||2 = trace ACAC = ]Cy<5c/; = £] 7 < S A 2 , where the last equality follows from the assumption that c is a permutation of A. Finally we compute the squared norm of the convex combination (1) and utilize the invariance property (2) to obtain, because of convexity of the squared norm,
Hence (3) holds with equality. Since the convexity of the squared norm is strict and every weight 1/2* is positive, equality in (3) forces all matrices in the sum to be the same, QCQ = C for all Q € Sign(s). Going back to (1) we find that Ac = 1/25 £G€Sign(*) C = C, that is, C is diagonal. D The proof relies on equation (1), to transform C into QCQ, then take the uniform average as Q varies over the sign-change group Sign(s), and thereby reproduce Ac. The vectors A(C) and 5(C) are related to each other in a similar fashion; to transform A(C) into Q\(C), then take some average as Q varies over a finite group Q, and finally reproduce 8(C). The relation is called vector majorization. It is entirely a property of column vectors, and a change of notation may underline this. Instead of S(C) and A(C), we choose two arbitrary vectors x and y in IR*. The group Q in question is the permutation group Perm(/c), that is, the subset of all permutation matrices in the space Rkxk. A permutation IT of the numbers ! , . . . , & induces the permutation matrix
where €j is the yth Euclidean unit vector of Uk. It is readily verified that the mapping IT i-» Q^ is a group isomorphism between permutations, and permutation matrices. Hence Perm(/c) is a group of order k\. A permutation matrix Q^ acts on a vector y e Uk by permuting its entries according to TT~I,
144
CHAPTER 6. MATRIX MEANS
We are interested in averages of the type
The with weights aQ that satisfy mmQePeTm(k) aQ>Q vector x has its components less spread out than those of y in the sense that x results from averaging over all possible permutations Qy of y. This averaging property is also reflected by the fact that the matrix 5 is doubly stochastic. A matrix S £ R /cx * is called doubly stochastic when all elements are nonnegative, and all row sums as well as all column sums are 1,
The set of all doubly stochastic k x k matrices is a compact and convex subset of the matrix space Rkxk, closed under transposition and matrix multiplication. Every permutation matrix is doubly stochastic, as is every average Y^QePerm(k) aQQ- Conversely, the Birkhoff theorem states that every doubly stochastic matrix is an average of permutation matrices. We circumvent the Birkhoff theorem in our exposition of vector majorization, by basing its definition on the property of 5 being doubly stochastic. 6.9. VECTOR MAJORIZATION The majorization ordering compares vectors of the same dimension, expressing that one vector has its entries more balanced, or less spread out, than another vector. For two vectors x,y e Rk, the relation x -< y holds when x can be obtained from y by a doubly stochastic transformation, for some doubly stochastic k x k matrix 5. In this case, the vector x is said to be majorized by the vector y. Not all pairs of vectors are comparable under majorization. A necessary reauirement is that the comnonent sums of the two vectors are the same: if then The relation is reflexive and transitive,
Hence the majorization ordering constitutes a preordering, For this preordering to be a partial ordering, antisymmetry is missing. We claim that vector majorization satisfies a weaker version which we call
6.9. VECTOR MAJORIZATION
145
antisymmetry modulo Perm(/c),
To prove the direct part in (1), we consider decreasing rearrangements and partial sum sequences. Given a vector x its decreasing rearrangement, jq, is defined to be the vector that has the same components as x, but in decreasing order. For x to be a permutation of another vector y, it is evidently necessary and sufficient that the two vectors share the same decreasing rearrangements, JC| = yi. This equality holds if and only if the partial sums over the consecutive initial sections of jq = (x^,..., x^)' and y^ — (y^,..., y i/t )' coincide,
If jc -x y, then the two sums are the same for h — k, since jc,y,jtj,y| share one and the same component sum. We show that majorization implies an ordering of the partial sums,
To this end, let x be majorized by y, that is, x = Sy for some doubly stochastic matrix S. We can choose two permutation matrices Q and R that map x and y into their decreasing rearrangements, jt| = Qx and yj = Ry. Since R' is the inverse of /?, we obtain y = R % It follows that x± = Qx = QSy = QSR 'yj = /*yi, where the matrix P = QSR' is doubly stochastic. For h < k, we get
with coefficients finally yields
that satisfy
and
This
We have verified property (3). If jc and y majorize each other, then the two inequalities from (3) yield the equality in (2). Hence the direct part in (1) is established. For the converse, we only need to apply the majorization definition to x = Qy and y = Q'x. The proof of our claim (1) is complete.
146
CHAPTER 6: MATRIX MEANS
We are now in a position to resume the discussion of Section 6.8, and derive the majorization relation between the diagonal vector 8(C) and the eigenvalue vector A(C) of a positive definite matrix C. 6.10. INEQUALITIES FOR VECTOR MAJORIZATION Lemma. Let C be a positive definite 5 x 5 matrix with vector 8(C) = (en,...,c s s )' of diagonal elements and vector A(C) = ( A i , . . . , A,)' of eigenvalues. Then we have: a. (Schur inequality) 8(C) is majorized by A(C). b. (Monotonicity of concave or convex functions) For every strictly concave function g : (0; oo) —> R, we have the inequality
while for strictly convex functions g the inequality is reversed. In either case equality holds if and only if the matrix C is diagonal. c. (Monotonicity of vector means) For parameter p € (-00; 1) the vector means <&p obey the inequality
while for parameter p e (l;oo), the inequality is reversed. In either case equality holds if and only if the matrix C is diagonal. Proof. We write and for short. For oart fa), we choose an eigenvalue decomposition and define the s x s matrix S with entries Since Z' = (zi,...,zs) is an orthogonal sxs matrix, the rows and columns of S sum to 1. Thus 5 is doubly stochastic, and we have, for all i < s.
This yields c = S\, that is, c is majorized by A. To prove part (b), let g be a strictly concave function. For a single subscript / < s, equalitv (1) vields
6.11.
THE HOLDER INEQUALITY
147
Summation over / gives
as claimed in part (b). Equality holds in (3) if and only if, for all / < s, equality holds in (2). Then strict concavity necessitates that in (1) positive coefficients s/y come with an identical eigenvalue Ay = c,,, that is, stj > 0 implies c(/ = A y for all i,j < s. Therefore A; = Y^i-s >os'i^i — lL,i<ssijc"> ^or all 7 < 5, that is, A = S'c. Thus c and A are majorized by each other. From the antisymmetry property (1) in Section 6.9, the vector c is a permutation of the vector A. Hence C is diagonal by Lemma 6.8. The case of a strictly convex function g is proved similarly. The proof of part (c) reduces to an application of part (b). First consider the case p e (—oo;0). Since the function g(x) = xp is strictly convex, part (b) yields Ej<s8(cjj) < £;•<,?(*/), that is, g(*p(c)) < g(*p(X)). But g is also strictly antitonic, whence we obtain $>p(c} > $ P (A). For p — 0, we use the strictly concave and strictly isotonic function g(x) — log x, for p e (0; 1) the strictly concave and strictly isotonic function g(x) = xp, and for p € (l;oo) the strictly convex and strictly isotonic function g(x) — X?. LJ We next approach the Holder inequality for the matrix means <j>p. Two numbers p,q e (-00; oo) are called conjugate when p + q — pq. Except for p — q = 0, the defining relation permits the familiar form l/p + l/q = 1. For finite numbers p and q, conjugacy implies that both lie either in the interval (—oo;l), or in (l;oo). The limiting case, as p tends to ±00, has q = 1. Therefore we extend the notion of conjugacy to the closed real line [-00; oo] by saying that the two numbers -oo and 1 are conjugate in the interval [—00; 1], while the two numbers 1 and oo are conjugate in [1; oo]. The only self-conjugate numbers are p = q = 0, and p = q = 2 (see Exhibit 6.1). 6.11. THE HOLDER INEQUALITY Theorem. Let p and q be conjugate numbers in [—oo; 1], and let p and q be conjugate numbers in [l;oo]. Then any two nonnegative definite 5 x 5 matrices C and D satisfy
Assume C to be positive definite, C > 0. In the case p,q € (—oo;l), equality holds in the left inequality if and only if D is positively proportional to Cp~l, or equivalently, C is positively proportional to Dq~^. In the case
148
CHAPTER 6: MATRIX MEANS
EXHIBIT 6.1 Conjugate numbers, p + q = pq. In the interval [-00; 1] one has p < 0 if and only if q > 0; in the interval [l;oo] one has p > 2 if and only if < 2. The only self-conjugate numbers are p = q = 0 and p — q = 2.
p,q € (l;oo), the equality condition for the right inequality is the same with p in place of p, and q in place of q. Proof. First we investigate the cases p,q € (—oo; 1), for positive definite matrices C and D. We choose an eigenvalue decomposition Z 'A A Z. With the orthogonal 5 x 5 matrix Z' = (z\,..., zs) that comes with D, we define C = ZCZ'. Then the scalar product (C, D) of the matrices C and /) is the same as the scalar product (S(C),A(Z))) of the diagonal vector 5(C) of C and the eigenvalue vector \(D} of D.
The Holder inequality for vector means provides the lower bound 0>p (8(C))
6.12. POLAR MATRIX MEANS
149
we obtain the inequality chain
We denote the components of A(D) by A y , while 8(C) has components Equality holds in the first inequality of (1) if and only if, for some a > 0 and for all j < s, we have A in case p,q ^ 0, and in case p = q — 0. In the second inequality of (1), equality holds if and only if the matrix C is diagonal. Hence c;/ are the eigenvalues of C as well as of C, and we have Altogether equality in (1) entails
That this condition is also sufficient for equality is seen bv straightforward verification. Evidently is equivalent with The case p,q e (l;oo) uses similar arguments, with reversed inequalities in (1). The extension from positive definite matrices to nonnegative definite matrices C and D follows by continuity. The extension to parameter values P->4 £ {-0°,!} and p,q € {l,oo} follows by a continuous passage to the limit. If p, q / 0 and C, D > 0, then D is proportional to Cp~l if and only if Cp and Dq are proportional. The latter condition looks more symmetric in C and D. The polar function of a matrix mean
In either case the denominators are positive and so the definition makes sense. The formulae are the same as those in Section 5.12. It turns out that the polar functions again are matrix means, up to a scale factor. 6.12. POLAR MATRIX MEANS Theorem. Let p and q be conjugate numbers in [-co; 1], and let p and q be conjugate numbers in [1; oo]. Then the polar functions of the matrix means
150
CHAPTER 6: MATRIX MEANS
are
Proof. We again write (C, D) = trace CD. In the first part, we concentrate on the case p,q < 1. Fix a positive definite matrix D. The Holder inequality proves one half of the polarity formula,
The other half follows from inserting the particular choice Dq~l for C, provided p,q € (-00; 1). This choice fulfills the equality condition of the Holder inequality. Hence we obtain
For/? = -oo, we insert for C the choice /5. Forp = 1, we choose the sequence Cm = zz' + (l/m)/s > 0 and let m tend to infinity, where z is any eigenvector of D corresponding to the smallest eigenvalue Amin(D). This shows that the functions
This bound and representation (3) from Section 6.7 yield
(This estimate has no counterpart for polars of information functions.) We restrict ourselves to the case p, q e (1; oo) and a nonsingular matrix D. The other cases then follow by continuity. Nonsingularity of D forces |D| to
6.13.
MATRIX MEANS AS INFORMATION FUNCTIONS AND NORMS
151
be positive definite. Hence the Holder inequality applies:
Together with (2), this establishes the first half of the polarity formula, The other half follows if we insert the choice Positive and negative parts fulfill the orthogonality relation entailing
From i
and
) for
we obtain
Hence equality holds, and the proof is complete. The functional properties of the matrix means may be summarized as follows. 6.13. MATRIX MEANS AS INFORMATION FUNCTIONS AND NORMS Theorem. Let p and q be conjugate numbers in [-00; 1]. Then the matrix mean (f>p is an information function on NND(s), with s<j>q as its polar function; $p is strictly concave on PD(s) if p e (-oo;l), (j>p is strictly isotonic on NND(s) if p e (0; 1], and
152
CHAPTER 6: MATRIX MEANS
Strict concavity on PD(s) follows from Corollary 5.3, provided tj>p is strictly superadditive on PD(s). To this end we show that, for C > 0, D ^ 0 and p e (-00; 1), the equality (f>p(C+D) =
Now we assume that equality holds. Then E minimizes {C, F)/
For C > 0 fixed,
6.15. ORTHOGONALITY OF TWO NONNEGATIVE DEFINITE MATRICES
153
The matrix means of greatest importance are singled out in Section 6.1 to Section 6.5, the D-, A-, E-, and T-criteria
In Section 6.5, we illustrated by example that there need not exist a
6.15. ORTHOGONALITY OF TWO NONNEGATIVE DEFINITE MATRICES Let A and B be two n x k matrices. We call A orthogonal to B when the matrix scalar product (A,B) = trace A'B vanishes,
This notion of orthogonality refers to the scalar product in the space Rnxk. It has nothing to do with an individual k x k matrix A being orthogonal. Of course, the latter means A 'A — Ik and n — k. If A and B are square and symmetric, then for the scalar product to vanish, it is clearly sufficient that the matrix product AB is null. For nonnegative definite matrices A and B the converse also holds true:
In order to establish the direct part, we choose two square root decompositions, A = UU' and B = VV (see Section 1.14). We get (A,B) = trace UU'VV = \.race(U'V)'U'V = \\U'V\\2. Hence if A is orthogonal
154
CHAPTER 6: MATRIX MEANS
to B, then U'V is null. Premultiplication by U and postmultiplication by V' leads to Q=UU'VV = AB. Moreover the equation AB = 0 permits a geometrical interpretation in terms of the ranges of A and B. The equation means that the range of B is included in the nullspace of A. By Lemma 1.13, the latter is the orthogonal complement of the range of A. Hence AB is the null matrix if and only if the ranges of A and B are orthogonal subspaces. For nonnegative definite matrices A and R we thus have
Notice the distinct meanings of orthogonality. The left hand side refers to orthogonality of the points A and B in the space Sym(fc), while the right hand side refers to orthogonality of the two subspaces range A and range B in the space HI*. Lemma 2.3 is a similar juxtaposition for sums of matrices and sums of subspaces. The equivalences (1), (2), and (3) are used repeatedly and with other nonnegative definite matrices in place of A and B.
6.16. POLARITY EQUATION Lemma. Let C be a positive definite 5 x 5 matrix, and let p be a number in [—00;!]. Then D e NND(s) solves the polarity equation
if and only if
where the set 5 consists of all rank 1 matrices zz' such that z is a norm 1 eigenvector of C corresponding to its smallest eigenvalue A m j n (C). Proof. For the direct part, let D > 0 be a solution of the polarity equation. In the case p e (-00; 1), the polarity equation holds true if and only if D = aCP~l for some a > 0, by Theorem 6.11. From trace C(aCp~l) a trace Cp = 1, we find a = I/trace Cp. In the case p = 1, trace monotonicity of Section 1.11 and the polarity
6.17.
MAXIMIZATION OF INFORMATION VERSUS MINIMIZATION OF VARIANCE
155
equation yield
where the penultimate line employs the definitions of fa and >-oo, and the last line uses the polarity formula s^-oo = 4>\°' Thus we have traceC(D hmm(D)Is) = 0, with C positive definite. This entails D — A m i n (D)/ 5 , that is, D is positively proportional to Is — C1"1. From trace C(als) = 1, we get a = I/trace C1. The last case, p = -oo, is more delicate. With C > A m i n (C)/ s , we infer as before
This entails trace (C - A min (C)/ s )£> = 0, but here D may be singular. However, C - A m i n (C)/ 5 is orthogonal to D in the sense of Section 6.15, that is, CD = A m i n (C)D. With eigenvalue decomposition D = ^2j<s hjZjZJ, postmultiplication by Zj yields \jCzj = A m j n (C)A,z ; , for all ; < s. If Ay ^ 0, then Zj is a norm 1 eigenvector of C corresponding to A m i n (C), and ZjZJ G S. Furthermore, the polarity equation implies that the nonnegative numbers A min (C)A, sum to 1, £;<,A m in(C)A ; - - trace A min (C)D - trace CD = 1. Hence A m i n (C)D = Y^j:\>o(^mm(C)^j}zjZ- is a convex combination of rank 1 matrices from S. The proof of the direct part is complete. The converse follows by straightforward verification. 6.17. MAXIMIZATION OF INFORMATION VERSUS MINIMIZATION OF VARIANCE Every information function > is log concave, that is, log $ is concave. To see this, we need only apply the strictly isotonic and strictly concave logarithm function,
156
CHAPTER 6: MATRIX MEANS
In view of log > = — log(l/0), log concavity of > is the same as log convexity of 1/0. Log convexity always implies convexity, as is immediate from appealing to the strictly isotonic and strictly convex exponential function. Hence the design problem, a maximization problem with concave objective function 0, is paired with a minimization problem in which the objective function N *-* 1/0 °° (#'#.£) is actually log convex, not just convex. The relation between 0 and 1/0 has the effect that some criteria are covered by the information function concept which at first glance look different. For instance, suppose the objective is to minimize a convex matrix mean 0p of the dispersion matrix C"1, in order to make the dispersion matrix as small as possible. A moment's reflection yields, with p = -p,
Hence the task of minimizing the norm 0p of dispersion matrices C l is actually included in our approach, in the form of maximizing the information function 0P of information matrices C, with p = —p. By the same token, minimization of linear functions of the dispersion matrix is equivalent to maximization of a suitable information function (compare the notion of linear optimality in Section 9.8). Maximization of information matrices appears to be the appropriate optimality concept for experimental designs, much more so than minimization of dispersion matrices. The necessary and sufficient conditions for a design to have maximum information come under the heading "General Equivalence Theorem". EXERCISES 6.1
Show that the nroiection of ( positive part C+, that is,
onto the cone NND(s) is the
6.2 Disprove that if C = A - B and A,B > 0 then A > C+ and B > C_, for C € Sym(.y). 6.3 Show that the matrix modulus satisfies
, for all C e Sym(s) [Mathias
(1990), p. 129]. 6.4 For a given matrix C € NND(s), how many solutions has the equation in SymCO? 6.5 Are |(0,1,1)' and |(3,1,1)' comparable by vector majorization?
EXERCISES
157
6.6 Verify that, for p < 1, the matrix mean $p fails to be concave on Svmf.vV and 6.7 Prove the oolaritv equation for vector means, for i if A vector solves and and only if in case conv S in case where the set 5 consists of all Euclidean unit vectors ef such that y, is the smallest component of y.
6.8 Let w-i,..., ws be nonnegative numbers. For p ^ 0, ±00 and (0; oo)s, the weighted vector mean is defined by
What is the aoorooriate definition for of definition of < to the space Us.
Extend the domain
6.9 (continued) Show that if wi,...,ws are positive and sum to 1, then the polar of the weighted vector mean is w^Vs) on NND(s), where p and q are conjugate over [-00; 1] [Gutmair (1990), p. 29]. 6.10
Let W > 0 be a nonnegative definite matrix. For p ^ 0, ±00 and C > 0, the weighted matrix mean is defined by What is the appropriate definition for p — 0, ±00? Extend the domain of definition of <^ to the space Sym(,s).
6.11 Show that the polar of the weighted matrix mean $\ '(C) = trace WC equals l/A m a x(WD~) or 0 according as D e A(W) or not [Pukelsheim (1980), p. 359].
CHAPTER 7
The General Equivalence Theorem
The General Equivalence Theorem provides necessary and sufficient conditions for a moment matrix to be
158
7.2. NORMAL VECTORS TO A CONVEX SET
159
EXHIBIT 7.1 Subgradients. Each subgradient Bi,B2,B3,... to a concave function g at a point M is such that it determines an affine function A »-» g(M) + (A — M, Bj) that bounds g from above and coincides with g at M.
The set of all subgradients of g at M is called the subdifferential of g at M, and is denoted by dg(M). When the set dg(M) is nonempty the function g is said to be subdifferentiable at M. In the design problem g is a composition 4> ° C/c-
The subgradient inequality has a pleasing geometrical interpretation. Let us consider the linear function T(A) = (A,B). Then the right hand side of the subgradient inequality is the affine function A i-» g(M) + T(A - M), with value g(M) at A = M. Hence any subgradient of g at M gives rise to an affine function that globally bounds g from above and, locally at M, coincides with g (see Exhibit 7.1). We rigorously prove whatever properties of subgradients we need. However, we briefly digress and mention some more general aspects. The subdifferential dg(M) is a closed and convex set. If M is singular then the subdifferential dg(M) may be empty or not. If M is positive definite then dg(M) is nonempty. The function g is differentiable at M if and only if the subdifferential dg(M) is a one-point set. In this case the unique subgradient is the gradient, dg(M) = {Vg(M)}. A similar relation holds for generalized matrix inversion (compare Section 1.16). The matrix A is invertible if and only if the set A~ of generalized inverses is a one-point set. In this case the unique generalized inverse is the inverse, A' = {A'1}.
7.2. NORMAL VECTORS TO A CONVEX SET The other central notion, normal vectors to a convex set, generalizes the concept of orthogonal vectors to a linear subspace. Let M be a convex set of symmetric k x k matrices, and let M be a member of M. A matrix B e Sym(A:)
160
CHAPTER?: THE GENERAL EQUIVALENCE THEOREM
EXHIBIT 12 Normal vectors to a convex set. Any normal matrix B to a set M at a point M is such that the angle with the directions A - M is greater than or equal to a right angle, for all A 6 M.
is said to be normal to M at M when it satisfies the normality inequality
Geometrically, this means that the angle between B and all the directions A — M from M to A e M is greater than or equal to a right angle, as shown in Exhibit 7.2. If M. is a subspace, then every matrix B normal to M at M is actually orthogonal to M. If M is a subspace containing the matrix A, then it also contains SA for 8 > 0. The inequality (8A, B) < (M, B), for all 6 > 0, forces {>!,#} to be nonpositive, (A,B) < 0. Since the same reasoning applies to —A e M, we obtain (-4,5) = 0, for all A £ M. This shows that the matrix B is orthogonal to the subspace M. The following lemma states that, under our grand assumption that the set of competing moment matrices M. intersects the feasibility cone A(K), we can formulate an "equivalent problem" in which the set M intersects the open cone of positive definite matrices. The "equivalent problem" is obtained by reparametrizing the original quantities. This is the only place in our development, and a technical one, where we make use of a reparametrization technique. A similar argument was used in the second part of the proof of Lemma 4.12.
7.3. FULL RANK REDUCTION Lemma. Let the set M. C NND(/:) be convex. Then there ev'ct a mimh*»r r < k and a k x r matrix U with U'U = Ir such that the set of reduced r x r moment matrices satisfie and the reduced
7.3. FULL RANK REDUCTION
161
information matrix mapping Cu >K fulfills
Proof. By Lemma 4.2, the set M contains a moment matrix M with maximum rank r, say. Let be an eigenvalue decomposition of M. Then the k x r matrix U = (u\,...,ur) satisfies U'U = Ir, and UU' projects onto the range of M. The set M. — U 'M.U contains the r x r matrix U'MU = U'U&\ U 'U = AA which is positive definite. Hence the intersection of M and PD(r) is nonempty. Since the range of M is maximal it includes the range of M, for all M e M. By our grand assumption of Section 4.1, there is a matrix in M. such that its range includes the range of K, hence so does M. Recalling that UU' projects onto the range of M, we get
Now we verify the equality string, upon setting >
for
The first and the last equation hold by definition. The two minima are equal because the two sets of matrices over which the minima are formed are the same. Namely, given GU'MUG' with GU'K = Is, we have that L = GU' is a left inverse of K with LML' = GU'MUG'. Conversely, given LML' with LK = /,, we see that G = LU satisfies GU'K = LUU'K = LK = Is and GU'MUG' = LUU'MUU'L' = LML'. This establishes the first equation of the lemma. For the second equation, we use Cv>K(M) = CK(M), and conclude with CK(M) = CK(U(U'MU)U ') = CK(UMU'). The lemma says something about the reparametrized system K'6 = (UU'K)'O = (U'K)'U'O = K'6, where K = U'K and 0 = U'O. The set of information matrices C#(M) for K'6, as M varies over the set M of competing moment matrices, coincides with the set of reduced information matrices C~(M) for K'6, as M varies over the set M of reduced moment /C
162
CHAPTER?: THE GENERAL EQUIVALENCE THEOREM
matrices. Therefore, the moment matrix M is optimal for K'6 in M if and only HU'MU is optimal for K'8 in M. We employ the lemma in the Duality Theorem 7.12. Otherwise it allows us to concentrate on the assumption that the set M contains a positive definite matrix, that is, M intersects the interior of the cone NND(fc). The following theorem provides the essential tool for our optimality investigations. 7.4. SUBGRADIENT THEOREM Theorem. Let g : NND(fc) —» R be a concave function, and let the set M C NND(fc) be convex and intersect the cone PD(fc). Then a moment matrix M e M maximizes g over M if and only if there exists a subgradient of g at M that is normal to M at M, that is, there exists a matrix B € dg(M) such that
Proof. The converse part of the proof is a plain consequence of the notions of subdifferentiability and normality, g(A) < g(M) + (A - M,B) < g(M) for all A e M. The direct part is more challenging in that we need to exhibit the existence of a subgradient B that is normal to M. at M. We invoke the separating hyperplane theorem in the space Sym(fc) x IR with scalar product ((A, a), (B, ft)) = (A,B) + aft = (trace AB) + aft. In this space, we introduce the two sets
It is easily verified that both sets are convex. Optimality of M forces any point (A, a) in the intersection /C n C to fulfill a < g(A) < g(M) < a. This is impossible. Hence the sets 1C and £ are disjoint. Therefore there exists a hyperplane separating /C and £, that is, there exist a pair (0,0) ^ (B,ft) £ Sym(k) x R and a real number y such that
In other words, the hyperplane H = {(A, a) e Sym(fc) x IR : (A, B)+aft — y} is such that the set /C is included in one closed half space associated with H, while £ is included in the opposite closed half space. In (1), we now insert (A/,g(M)) e K. for (A,a) and for (A, a):
7.5. SUBGRAD1ENTS OF ISOTONIC FUNCTIONS
163
This yields e$ > 0 for all e > 0, whence /3 is nonnegative. We exclude the case j8 = 0. Otherwise (1) turns into sup~^ ft (^4, B} < iniAeM(A,B}, with B ^ s\^\) 0. That is, in Sym(fc) there exists a hyperplane separating the sets NND(fc) and M. But by assumption, the set M contains a matrix that is positive definite. Therefore NND(fc) and M cannot be separated and /3 is positive. Knowing j8 > 0, we define B — -B//3. Replacement of y by g(M)/3 + (M,B) from (2) turns the first inequality in (1) into (M,B). With a = g(A), we get
Therefore B is a subgradient of g at M. The second inequality in (1) becomes
Lettine a tend to e(M} we see that the suberadient B is normal to M at M. It remains to compute subdifferentials, and to this end we exploit the properties that are enjoyed by an information function. First we make use of concavity and monotonicity. Again we have in mind compositions g = <£ o CK on NND(/c), given by information functions <£ on NND(^). 7.5. SUBGRADIENTS OF ISOTONIC FUNCTIONS Lemma. Let the function g : NND(fc) —> U be isotonic and concave. Then for all M e NND(fc), every subgradient of g at M is nonnegative definite. Let the function
Hence the matrix B is nonnegative definite. Let £ be a subgradient of $ at C > 0. Then strict monotonicity on PD(s) entails
Therefore the matrix E is positive definite.
164
CHAPTER?: THE GENERAL EQUIVALENCE THEOREM
We now study the subdifferential of compositions of <£ with the information matrix mapping CK, where the k x 5 coefficient matrix K has full column rank s. 7.6. A CHAIN RULE MOTIVATION The tool for computing derivatives of compositions is the chain rule. The definition of subgradients in Section 7.1 applies to functions with values in the real line R, with its usual (total) ordering. We extend the idea to functions with values in the linear space Sym(.s), with its Loewner (partial) ordering. A subgradient mapping T of the information matrix mapping CK at M is defined to be a linear mapping T from the space Sym(A:) into the space Sym(s) that satisfies the subgradient inequality
where the inequality sign refers to the Loewner ordering in the space Sym(.s). Can we find such subgradient mappings Tl The answer is in the affirmative. The information matrix CK(M) is a minimum relative to the Loewner ordering,
where L is a left inverse of K that is minimizing for A/. We define a linear mapping T from Sym(A:) to Sym(s) by T(A) = LAL', and claim that T is a subgradient mapping of CK at M. Indeed, for all A > 0 we have
It remains open whether all subgradient mappings of CK at M arise in this way. Now let
The subgradient D is nonnegative definite, by the first part of Lemma 7.5. As seen in Section 1.11, the linear form C i-» (C,D) is then isotonic relative
7.7. DECOMPOSITION OF SUBGRADIENTS
165
to the Loewner ordering. Now we apply the first inequality to E = CK(A). Making CK(A) larger according to the second inequality, we get, for all A
Therefore the matrix T'(D) — L'DL is a subgradient of the composition 0 o CK at M. However, this argument leaves open the question whether all subgradients of
is also a subgradient of > o CK at M. It emerges as Corollary 7.8 to the (somewhat technical) next theorem that all subgradients are of this form if M is feasible for K'6.
7.7. DECOMPOSITION OF SUBGRADIENTS Theorem. Let the function
166
CHAPTER 7: THE GENERAL EQUIVALENCE THEOREM
I. First we show that (a) implies (b). Let B be a subgradient of
With A = MK, the subgradient inequality gives 0 < (MK — M,B) = -(M MK,B). But M — MK is nonnegative definite, by Lemma 3.14 and so Lemma 1.8 provides the converse inequality (M — MK, B) > 0. This proves B to be orthogonal to M — MK. Setting C = CK(M), we get (M,B) = (MK,B) = (KCK',B). For every nonnegative definite s x s matrix E, we have CK(KEK') — minL€Ksxk. LK=Is LKEK'L1 = E, giving
Thus K 'BK is a subgradient of <£ at C, and (b) is established. II. Secondly we assume (b) and construct matrices G and F that fulfill (c). With R being a projector onto the nullspace of M, we introduce the k x k matrix
This definition is legitimate: F is the generalized information matrix of B for R'0, in the terminology of Section 3.21. By Theorem 3.24, the matrix F enjoys the three properties
By (1), F is nonnegative definite. From (2), we have MF = 0, that is, F is orthogonal to M. By assumption, the matrix B is orthogonal to M - MK, whence we have MB = MKB and BM = BMK. This yields
A passage to the complement turns (3) into (nullspace(fi-F))+(range M) = Uk. As in the proof of Theorem 2.16, this puts us in a position where we can find a nonnegative definite k x k matrix H with a range that is included in
7.8. DECOMPOSITION OF SUBDIFFERENTIALS
167
the nullspace of B - F and that is complementary to the range of M,
The first inclusion means (B - F)H = 0 whence (4) gives (M + H)(B F)(M + H) — MKBMK. Choosing for M the generalized inverse G = (M + HY1 from Lemma 2.15, we obtain B - F = GMKBMKG. This provides the representation required in part (c). III. Thirdly we verify that (c) implies (a). For A > 0, we argue that
The first inequality holds because K'BK is a subgradient of $ at CK(M). The equality uses the representation AK — KCK(A)K'. For the second inequality we observe that B inherits nonnegative definiteness via MKBMK from K'BK, by Lemma 7.5. Thus monotonicity yields (AK,B} < (A,B). Since G'MG is a generalized inverse of M it is also a generalized inverse of MK, by Lemma 3.22. Inserting the assumed form of B we obtain (M, B} = (MKG'MGMK,B) + (M,F} = (MK,B). Except for the full column rank assumption on the coefficient matrix K, the theorem is as general as can be. The matrix M need neither be positive definite nor lie in the feasibility cone A(K), and the optimality criterion $ is required to be isotonic and concave only. Part (c) tells us how to decompose a subgradient B of > o CK at M into various bits and pieces that are simple to handle individually. For the purpose of construction, we reverse the issue. Given a subgradient D of $ at C, a generalized inverse G of M, and a nonnegative definite matrix F orthogonal to M, is GKCDCK'G' + F a subgradient of 4> o CK at M? This is indeed true provided M is feasible.
7.8. DECOMPOSITION OF SUBDIFFERENTIALS Corollary. For every nonnegative definite k x k matrix M, the subdifferentials of 0 o CK at M and of 0 at C = CK(M) fulfill the inclusion
168
CHAPTER?: THE GENERAL EQUIVALENCE THEOREM
relations
Moreover, if M lies in the feasibility cone A(K], then the three sets are equal. Proof. The first inclusion is verified in Section 7.6. The second follows from part (c) in Theorem 7.7, upon replacing MK by KCK' and K'BK by D. Moreover, let M be feasible. For any member B = GKCDCK'G1 + F of the last set, we define Z = CK'G1. Because of feasibility, Theorem 3.15 yields LK = CK'G'K = (K'M-K)-1K'M~K = Is and LML' = CK'G'MGKC = C. Thus B = L'DL + F is a member of the first set. D Lemma 7.5 provides some partial insight into the subgradients D of (f> at C = CK(M), for isotonic and concave criteria <£. The following theorem says much more for information functions
which we have already encountered earlier, in part (c) of Lemma 5.16, and which for the matrix means <J>P is solved explicitly in Lemma 6.16. 7.9. SUBGRADIENTS OF INFORMATION FUNCTIONS Theorem. Let » be an information function on NND($). Then for every pair C and D of nonnegative definite s x s matrices, the following three statements are equivalent: a. (Subdifferential of < b. (Polarity equation) c. (Subdifferential of (
and and
In particular, the Subdifferential of
provided >(C) is positive. Moreover, if C is positive definite, then <£ is subdifferentiable at C, d<£(C) ^ 0-
7.9. SUBGRADIENTS OF INFORMATION FUNCTIONS
169
Proof. It suffices to establish the equivalence of (a) and (b). The equivalence of (b) and (c) then follows from the polarity correspondence <£ = (f)0000 of Theorem 5.13. First we derive (b) from (a). We insert SC with 8 > 0 into the subgradient inequality, dtf>(C) < >(C) + (8C - C,<j>(C)D}. Cancelling 0(C) > 0, we get 8 - 1 < (8 - 1){C,D). The values 8 = 2 and 8 = 1/2 yield {C,D) = 1. With this, the subgradient inequality simplifies (j>(E) <
Hence D solves the polarity equation, >(C)<£°°(.D) = 1 = (C,D). A similar argument was employed in the proof of Theorem 5.13. For a composition of the form > o CK, we now have a couple of ways of representing the subdifferential. If M € A(K] then C = CK(M) is positive definite, d(
The first two equalities follow from Corollary 7.8. The last equality applies the present theorem to the information function
170
CHAPTER 7: THE GENERAL EQUIVALENCE THEOREM
If a scalar parameter system c'B is of interest, then <£ is the identity on [0;oo), with derivative one on (0;oo), as remarked in Section 5.17. Hence for M € A(c}, we get
These expressions are familiar to us from the proof of the Equivalence Theorem 2.16 for scalar optimality, except for the matrix F. The General Equivalence Theorem 7.14 follows just the same lines. If subdifferentiability of 4> o CK at M fails to hold, that is, d($ oCK)(M) = 0, then M is neither feasible for K' B nor formally $-optimal for K'6 in M. In this case M is of no interest for the general design problem of Section 5.15.
7.10. REVIEW OF THE GENERAL DESIGN PROBLEM With these results from convex analysis, we now attack the design problem in its full generality. The objective is to maximize the information as measured by some information function $ on NND(s):
The set M C NND(fc) of competing moment matrices is assumed to be compact and convex. The k x 5 coefficient matrix K is assumed to be of full column rank s, with information matrix mapping CR. We avoid trivialities by the grand assumption of Section 4.1 that there exists at least one competing moment matrix that is feasible, M n A(K) ^ 0. Although our prime interest is in the design problem, we attack it indirectly by first discussing its dual problem, for the following reason. Theorem 7.4 tells us that a moment matrix MI is optimal if and only if there exists a subgradient BI with certain properties. Similarly another moment matrix M2 will be optimal if and only if there exists a subgradient B2 with similar properties. In order to discuss multiplicity of optimal moment matrices, we need to know how MI relates to B2. The answer is given by the dual problem which represents B (more precisely, its scaled version N = B/
7.11. MUTUAL BOUNDEDNESS THEOREM FOR INFORMATION FUNCTIONS
171
7.11. MUTUAL BOUNDEDNESS THEOREM FOR INFORMATION FUNCTIONS Theorem. Let M 6 M. be a competing moment matrix and let N be a matrix in the set
Then we have 4>(CK(M)) < l/(f>°°(K'NK), with equality if and only if M and N fulfill conditions (1), (2), and (3) given below. More precisely, upon setting C •= CK(M} we have
with respective equality if and only if
Proof. Inequality (i) and equality condition (1) are an immediate consequence of how the set M is defined. Inequality (ii) uses M > MK = KCK' and monotonicity of the linear form A >-> trace AN, from Theorem 3.24 and Section 1.11. Equality in (ii) means that N is orthogonal to M - MK > 0. By Section 6.15, this is the same as condition (2). Inequality (iii) is the Holder inequality from Section 5.12, leading to condition (3). The theorem generalizes the results for scalar optimaliy of Theorem 2.11, suggesting that the general design problem is accompanied by the dual problem:
The design problem and the dual problem bound each other in the sense that every value for one problem provides a bound to the other,
172
CHAPTER 7: THE GENERAL EQUIVALENCE THEOREM
For a moment matrix M to attain the supremum and therefore to be formally optimal for K'6, that is, disregarding the identifiability condition M € -A(K), it is sufficient that there exists a matrix TV e A/" such that <}>(CK(M)} = l/
7.12. DUALITY THEOREM Theorem.
We have
In particular, a moment matrix M e M is formally optimal for K'O in M if and only if there exists a matrix TV € A/" such that
Moreover, any two matrices M e M and N e M satisfy
Thus the matrix N is a member of the set A/", and it satisfies >(C) = 1/4>°°(K'NK). Hence M is an optimal solution of the design problem, N is an optimal solution of the dual problem, and the two problems share a common optimal value. In the second part, we make do with the assumption that the set M meets the feasibility cone A(K). From Lemma 7.3, we first carry through a full rank reduction based on the k x r matrix U and the set M = U'MU that appear there. Because of M n PD(r) ^ 0, the first part of the present proof
7.12. DUALITY THEOREM
173
is applicable and yields
Here we have set ft = {N e NND(r) : trace AN < 1 for all A € M}. We claim that this set is related to the set M C NND(fc) of Theorem 7.11 through
For the direct inclusion, we take a matrix N e A/" and define N = UNV. Then A4 = V'MU implies N £ M since trace MN = trace MUNU' = trace MN < 1 for all M € X. Furthermore U'U = Ir entails U'NU = N, whence N is a member of U 'AfU. For the converse inclusion, we are given a matrix N eM. From the proof of Lemma 7.3, we borrow
Thus N = U'NU lie-in Af since trace MN - traceU'MUU'NU = trace MN < 1 for all M e A-l. This proves (2). Now we may continue in (1),
and observing UU'K — K concludes the proof. We illustrate the present theorem with the pathological example of the parabola fit model of Section 6.5. We consider the design which places equal mass 1/2 on the two points ±1. Its information matrix for the full parameter vector 9 is singular,
Under the trace criterion, the information for 6 is <£i(M) = (1/3) trace M = 1. Now we define the matrix N — 73/3 > 0. For the moment matrix M2(r) of an arbitrary design r on T = [—1; 1], we compute
174
CHAPTER 7: THE GENERAL EQUIVALENCE THEOREM
Thus the matrix N lies in the set .A/", providing for the design problem the upper bound
Since
Theorem. There exists a moment matrix M e M that is formally
7.14. THE GENERAL EQUIVALENCE THEOREM
175
7.14. THE GENERAL EQUIVALENCE THEOREM Theorem. Let M e M be a competing moment matrix that is feasible for K'd, with information matrix C = CK(M). Then M is $-optimal for K' 6 in M if and only if there exists a nonnegative definite 5 x 5 matrix D that solves the polarity equation
and there exists a generalized inverse G of M such that the matrix N = GKCDCK'G' satisfies the normality inequality
In case of optimality, equality obtains in the normality inequality if for A we insert M, or any other matrix M e M that is >-optimal for K'B in M. Proof. For the direct part, we do not need the feasibility assumption. Let M be a formally >-optimal moment matrix for K'B in M. We cannot appeal directly to the Subgradient Theorem 7.4 since there we require M n PD(fc) ^ 0 while here we only assume Mr\A(K) ^ 0. Instead we piece things together as follows. The Duality Theorem 7.12 provides a matrix N G Af with 4>(C) - l/
Theorem 7.9 states that <j>(C)N is a subgradient of
with D e d
176
CHAPTER 7: THE GENERAL EQUIVALENCE THEOREM
We record the variants that emerge if the full parameter vector 6 is of interest, and if maximization takes place over the full set M(E) of all moment matrices.
7.15. GENERAL EQUIVALENCE THEOREM FOR THE FULL PARAMETER VECTOR Theorem. Let M e M be a competing moment matrix that is positive definite. Then M is $-optimal for 0 in M if and only if there exists a nonnegative definite k x k matrix N that solves the polarity equation
and that satisfies the normality inequality
In case of optimality, equality obtains in the normality inequality if for A we insert M, or any other matrix M e M that is <£-optimal for 6 in M. Proof. For the full parameter vector 0, any feasible moment matrix M e M is positive definite by Theorem 3.15. In the General Equivalence Theorem 7.14, we then have G — M~l, K = Ik, and Cjk(M) — M. The polarity equation and the normality inequality simplify accordingly. 7.16. EQUIVALENCE THEOREM Theorem. Let M e M(S) be a moment matrix that is feasible for K'0, with information matrix C = CK(M). Then M is ^-optimal for K'8 in M(H) if and only if there exists a nonnegative definite s x s matrix D that solves the polarity equation
and there exists a generalized inverse G of M such that the matrix N = GKCDCK'G' satisfies the normality inequality
In case of optimality, equality obtains in the normality inequality if for x we insert any support point *,- of any design £ e H that is <£-optimal for K'B in H.
7.18. MERITS AND DEMERITS OF EQUIVALENCE THEOREMS
177
Proof. Since M (E) contains the rank 1 moment matrices A = xx', the normality inequality of Theorem 7.14 implies the present normality inequality. Conversely, the present inequality implies that of Theorem 7.14, for if the moment matrix A e M(H) belongs to the design 17 e E, then we get trace AN = E^suPP *, *l(x)x'Nx < I . In the case of optimality of A/, let £ e E be a design that is <£-optimal for K'S in H, with moment matrix M = A/(£) and with support points *!,...,*£ e X. Then Jt/Afjt, < 1 for some / entails the contradiction trace MN = £,.<< £(*,-) jc/ATjc,- < 1 - trace MN. 7.17. EQUIVALENCE THEOREM FOR THE FULL PARAMETER VECTOR Theorem. Let M € M(S) be a moment matrix that is positive definite. Then M is >-optimal for 0 in M(E) if and only if there exists a nonnegative definite k x k matrix N that solves the polarity equation
and that satisfies the normality inequality
In case of optimality, equality obtains in the normality inequality if for x we insert any support point jc, of any design £ € H that is <£-optimal for 9 in H. Proof.
The proof parallels that of the previous two theorems.
7.18. MERITS AND DEMERITS OF EQUIVALENCE THEOREMS Equivalence theorems entail an amazing range of consequences. It is worthwhile to pause and reflect on what we have achieved so far. Theorem 7.14 allows for a fairly general set M of competing moment matrices, requiring only that M is a compact and convex subset of the cone NND(&), and that it contains at least one feasible matrix for K'B. It is for this generality of M that we term it a General Equivalence Theorem. Most practical applications restrict attention to the set M(H) of all moment matrices, and in such cases we simply speak of an Equivalence Theorem. Exhibit 7.3 provides an overview. In any case the optimality characterization comes in two parts, the polarity equation and the normality inequality. The polarity equation refers to 5 x 5 matrices, as does the information function <£. The normality inequality applies to k x k matrices, as does the information matrix mapping CK. Thus
178
CHAPTER 7: THE GENERAL EQUIVALENCE THEOREM
EXHIBIT 7.3 A hierarchy of equivalence theorems. A General Equivalence Theorem (GET) allows for a compact and convex subset M of the set M(H) of all moment matrices. Over the full set M(a) we speak of an Equivalence Theorem (ET). Either theorem simplifies if interest is in the full parameter vector 6.
the two parts parallel the two components > and CK of the composition > ° CK. In order to check the optimality of a candidate matrix M, we must in some way or other compare M with the competing matrices A. Equivalence theorems contribute a way of comparison that is linear in A, and hence simple to evaluate. More involved computations do appear but need only be done once, for the optimality candidate A/, such as determining the information matrix C = C#(A/), a solution D of the polarity equation, and an appropriate generalized inverse G of M. The latter poses no problem if M is positive definite; then G is the regular inverse, G = M~l. In general, however, the choice of the generalized inverse G does matter. Some version of G satisfy the normality inequality, others do not. Yet more important, for the matrix means >p, reference to the polarity equation disappears. Lemma 6.16 exhibits all of its solutions. If p is finite then the solution is unique.
7.19. GENERAL EQUIVALENCE THEOREM FOR MATRIX MEANS Theorem. Consider a matrix mean
7.19. GENERAL EQUIVALENCE THEOREM FOR MATRIX MEANS
179
In case of optimality, equality obtains in the normality inequality if for A we insert M or any other matrix M € M that is $p-optimal for K'6 in M. Specifically, let M e M be a competing moment matrix that is positive definite. Then M is $p-optimal for 0 in M if and only if
Proof. The polarity equation has solution D = Cp l/ trace Cp, by Lemma 6.16. Hence the General Equivalence Theorem 7.14 specializes to the present one. ALTERNATE PROOF. If the optimality candidate M is positive definite then the theorem permits an alternate proof based on differential calculus, provided we are ready to use the facts that the functions fp(X) = trace Xp, for p / 0, and fo(X) — det X are differentiable at a positive definite matrix C, with gradients
The chain rule then yields the gradient of the matrix mean >p at C,
Because of positive definiteness of M, the information matrix mapping C% becomes differentiable at M. Namely, utilizing twice that matrix inversion X-1 has at M the differential -M~1XM-1, the differential of CK(X) = (K'X~lKYl at M is found to be
Composition with the gradient of
That is, the gradient is proportional to M~lKCp+lK'M~l. Hence it is normal to M at M if and only if the normality inequality of Theorem 7.19 holds true.
180
CHAPTER 7: THE GENERAL EQUIVALENCE THEOREM
7.20. EQUIVALENCE THEOREM FOR MATRIX MEANS Theorem. Consider a matrix mean
In case of optimality, equality obtains in the normality inequality if for x we insert any support point */ of any design £ € H that is <£p-optimal for K'O in H. Specifically, let M £ A/(E) be a moment matrix that is positive definite. Then M is ^-optimal for 6 in M(H) if and only if
Proof.
The results are special cases of Theorem 7.19.
For the smallest-eigenvalue criterion <£-oo, Lemma 6.16 provides many solutions of the polarity equation, unless the smallest eigenvalue of C has multiplicity 1. Some care is needed to appropriately accommodate this situation. 7.21. GENERAL EQUIVALENCE THEOREM FOR E-OPTIMALITY Theorem. Let M e M be a competing moment matrix that is feasible for K'O, with information matrix C — CK(M). Then M is <£_oo-optimal for K'O in M if and only if there exist a nonnegative definite s x s matrix £ with trace equal to 1 and a generalized inverse G of M that satisfy the normality inequality
In case of optimality, equality obtains in the normality inequality if for A we insert M, or any other matrix M e M that is 0_oo-optimal for K'O in M\ and any matrix E appearing in the normality inequality and any moment matrix M 6 M that is $-00-optimal for K'O in M, with information matrix C = CK(M\ fulfill
where the set S consists of all rank 1 matrices zz' such that z 6 Rs is a norm 1 eigenvector of C corresponding to its smallest eigenvalue.
7.22. EQUIVALENCE THEOREM FOR E-OPTIMALITY
181
Specifically, let M e M be a competing moment matrix that is positive definite. Then M is ^-oo-optimal for 0 in M if and only if there exists a nonnegative definite k x k matrix E with trace equal to 1 such that
Proof. We write A = A m j n (C), for short. For the direct part, we solve the polarity equation in the General Equivalence Theorem 7.14. Lemma 6.16 states that the solutions are D = E/\ with E e conv 5, where S comprises the rank 1 matrices zz' formed from the norm 1 eigenvectors z of C that correspond to A. Hence if M is ^-oo-optimal, then the normality inequality of Theorem 7.14 implies the present one. For the converse, the fact that trace E — 1 entails that every spectral decomposition E = Y^j
The first inequality exploits the monotonicity behavior discussed in Section 1.11. The middle equality expands C into CK'G'MGKC. The last inequality uses the special case trace MGKCECK'G' < A of the normality inequality. Thus we have trace FE = 0. Section 6.15 now yields FE = 0, that is, CE = A£. Postmultiplication by Zj gives Czj = Az;, for all ; = 1,... ,r, giving E e conv S. It follows that the matrix D = E/\ solves the polarity equation and satisfies the normality inequality of the General Equivalence Theorem 7.14, thus establishing (/^oo-optimality of M. Now let M e M be another moment matrix that is >_oo -optimal for K'd in M, with information matrix C. Then M and M share the same optimal value, A m i n (C) = A. Since the dual problem has optimal solution N = GKCECK'G'/\, condition (1) of Theorem 7.11jields trace MN = 1. Hence the normality inequality is an equality for A — M. Moreover, with condition (2) of Theorem 7.11, we may continue 1 = trace MN — trace CE/\. Therefore the nonnegative definite sxs matrix F — C - \IS is orthogonal to E. Again we get FE — 0, thus establishing property (1). Postmultiplication of (1) by Zj shows that z, is a eigenvector of C, whence follows (2) 7.22. EQUIVALENCE THEOREM FOR E-OPTIMALITY Theorem. Let M be a moment matrix that is feasible for K'6, with information matrix C = CK(M). Then M is 4>_oo-optimal for K'6 in M(H) if
182
CHAPTER 7: THE GENERAL EQUIVALENCE THEOREM
and only if there exist a nonnegative definite s xs matrix E with trace equal to 1 and a generalized inverse G of M that satisfy the normality inequality
In case of optimality, equality obtains in the normality inequality if for x we insert any support point *,• of any design £ € H that is <£_oo-optimal for K'6 in H; and any matrix E appearing in the normality inequality and any moment matrix M G A/(H) that is ^-oo-optimal for K'B in A/(E), with information matrix C = CK(M), fulfill conditions (1) and (2) of Theorem 7.21. Specifically, let M e M(H) be a moment matrix that is positive definite. Then M is <£_oo-optimal for 0 in A/(H) if and only if there exists a nonnegative definite k x k matrix E with trace equal to 1 such that
Proof.
The theorem is an immediate consequence of Theorem 7.21.
If M is ^-oo-optimal for K '6 in M and the smallest eigenvalue A of CK(M) has multiplicity 1, then the matrix E in the two preceding theorems is uniquely given by zz', where z e R* is a norm 1 eigenvector of CK(M) corresponding to A. If the smallest eigenvalue has multiplicity greater than 1 then little or nothing can be said of which matrices E > 0 with trace E = 1 satisfy the normality inequality. It may happen that all rank 1 matrices E = zz' that originate with eigenvectors z fail, and that any such matrix E must be positive definite. As an illustration, we elaborate on the example of Section 2.18, with regression range X = {x e R2 : \\x\\ < 1}. The design £(J) = 1/2 = £(J) with moment matrix M = I2/2 is <£_oo-optimal for 6 in H. Indeed, the positive definite matrix E = I2/2 satisfies x Ex = ||*||2/2 < A m i n (Af), for all x e X. On the other hand, the norm 1 eigenvectors of M corresponding to 1/2 are the vectors z e R2 with ||z|| = 1. For x — z, we get x'zz'x = 1 > A m j n (M). Hence no rank 1 matrix E — zz' fulfills the normality inequality. The same example illustrates that 0_oo-optimality may obtain without any scalar optimality property. For every coefficient vector c ^ 0, the Elfving Theorem 2.14 shows that the unique optimal design for c'B is the one-point design in c/\\c\\ e X, with optimal variance (p(c))2 = ||c||2, where p is the Elfving norm of Section 2.12. Indeed, the variance incurred by the $_oooptimal moment matrix I2/2 is twice as large, c'(I2/2)~lc = 2||c||2. Nevertheless, there are some intriguing interrelations with scalar optimality which may help to isolate a ^-oo-optimal design. Scalar optimality always comes into play if the smallest eigenvalue of the >_oo-optimal information matrix has multiplicity 1.
7.24. E-OPTIMALITY, SCALAR OPTIMALITY, AND ELFVING NORM
183
7.23. E-OPTIMALITY, SCALAR OPTIMALITY, AND EIGENVALUE SIMPLICITY Theorem. Let M € M. be a competing moment matrix that is feasible for K'O, and let ±z e Rs be an eigenvector corresponding to the smallest eigenvalue of the information matrix Cx(M). Then M is <^_oo-optimal for K'B in M and the matrix E = zz'/\\z\\2 satisfies the normality inequality of Theorem 7.21 if and only if M is optimal for z'K'O in M. If the smallest eigenvalue of C#(M) has multiplicity 1, then M is >-oooptimal for K'B in M. if and only if M is optimal for z'K'O in M.. Proof. We show that the normality inequality of Theorem 7.21 for <£_oooptimality coincides with that of Theorem 4.13 for scalar optimality. With E = zz'/\\z\\2, the normality inequality in Theorem 7.21 reads
The normality inequality of Theorem 4.13 is
With c = Kz the two left hand sides are the same. So are the right hand sides, because of c'M~c = z'K'M~Kz = z'C~lz = \\z\\2/\mia(CK(M)). If the smallest eigenvalue of CK(M} has multiplicity 1 then the only choice for the matrix E in Theorem 7.21 is E = zz'/\\z\\2. Under the assumption of the theorem, the right hand sides of the two normality inequalities in the proof are the same, ||z|| 2 /A m i n (C/f(M)) = z'K'M~Kz. The following theorem treats this equality in greater detail. It becomes particularly pleasing when the optimal solution is sought in the set of all moment matrices, Af (E). Then the ElfVing Theorem 2.14 represents the optimal variance for z'K'O as (p(Kz))2, where p is the Elfving norm of Section 2.12. 7.24. E-OPTIMALITY, SCALAR OPTIMALITY, AND ELFVING NORM Theorem. fulfill
Every moment matrix M e M(H) and every vector 0 ^ z € R5
184
CHAPTER 7: THE GENERAL EQUIVALENCE THEOREM
If equality holds, then M is >_oo-optimal for K'O in M(E) and any moment matrix M e M(E) that is >_oo-optimal for K'O in M(E) is also optimal for z'K'e inM(H). Proof. Let M 6 M(H) be a moment matrix that is feasible for K'6. Then we get 0 7^ ATz € range /C C range M C £(#), where C(X} is the regression range introduced in Section 2.12. It follows that Kz has a positive ElfVing norm, p(Kz) > 0. If M is not feasible for K'O, then the information matrix C = CK(M) is singular, by Theorem 3.15, and \min(CK(M)) = 0 < (\\z\\/p(Kz))2. Otherwise maximization of the smallest eigenvalue of C > 0 is the same as minimization of the largest eigenvalue of C'1 = K'M'K. Using AmaxCC" 1 ) = maxjiju^i z'C~lz, we have for all 0 / z € IR5, the trivial inequalities
If equality holds for M and z, then attaining the upper bound (\\z\\/ p(Kz))2, the moment matrix M is 4>-oo-optimal for K'O in A/(S). For every (^-oo-optimal moment matrix M for K'O in A/(E), we get
Now (p(tfz)) 2 = z 'K'M'Kz shows that M is optimal for z 'K'O in M(E). For ||z || = 1, the inverse optimal variance l/(p(Kz))2 is the maximum information for z'K'O in M(S). Therefore the theorem establishes the mutual boundedness of the problems of maximizing the smallest eigenvalue A m i n (M) as M varies over M(S), and of minimizing the largest information for z'K'O as z varies over the unit sphere in Rs. The two problems are dual to each other, but other than in Theorem 7.12 here duality gaps do occur, as is evidenced by the example at the end of Section 7.22. Duality does hold in the parabola fit model of Section 2.21. Indeed, the vector c = (—1,0,2)' fulfills ||c||/p(c) = 1/%/S. The optimal moment matrix M for c'B has smallest eigenvalue 1/5. The present theorem proves that M is $-00-optimal for 0 in A/(S). This will guide us to determine the >_oo-optimal design for polynomial fit models of arbitrary degree d > 1, in Section 9.13.
EXERCISES
185
In the present chapter, our concern has been to establish a set of necessary and sufficient conditions for a moment matrix to be optimal. We have concentrated on moment matrices M rather than on designs £. In the next chapter we explore what consequences these results imply for the optimality of a design £, in terms of the support points and weights of £.
EXERCISES 7.1 Let T be a linear mapping from a scalar product space C into a scalar product space /C. The transposed operator T' is defined to be the unique linear mapping from /C into C that satisfies (Tx,y) = (x,T'y) for all jt e £, y e /C. Find the transposed operator of (i) T(x) = Ax : Uk -> R", where A e R"x*, (ii) T(A) = LAL' : Sym(k) -» Sym(s), where L e Rs*k. 7.2 For p G (-00; 1], prove that
7.3 Prove that >_oo is differentiable in C > 0 if and only if the smallest eigenvalue of C has multiplicity 1, by discussing whether d
7.5 True or false: if <£(C) > 0, then 6
186
CHAPTER 7: THE GENERAL EQUIVALENCE THEOREM
7.9 Demonstrate by example that in the Equivalence Theorem 7.16, the generalized inverse G cannot generally be taken to be the MoorePenrose inverse M+ [Pukelsheim (1981), p. 17]. 7.10 Demonstrate by example that in the Equivalence Theorem 7.16, the domain of the normality inequality cannot generally be reduced from x € X to x € X n range M [Pukelsheim (1981), p. 17]. 7.11 Use the General Equivalence Theorem to deduce Theorem 4.13. 7.12 Show that any matrix E e NND(s) with trace E = 1 satisfies (i) E < Is, and (ii) c'Ec = c'c if and only if E - cc'/\\c\\2, where 0 ^ c e R5. 7.13 For the two-point regression range X = < (]), (Jj i, show that 72 is the unique 4>-<x-optimal matrix for 0 in Af(E). Which matrices E of
O. O- !(»?)>!(!!). i( -r.1) satisfyTheorem 7-22?
7.14 Show that ^(^oo) < r2, where f(^_oo) is the <£_oo-optimal value for 6 in E and r is the Euclidean in-ball radius of the Elfving set Tl =
conv(;ru (-#)).
7.15
(continued) Demonstrate by example that f(>_oo) < r2 is possible.
CHAPTER 8
Optimal Moment Matrices and Optimal Designs
The interrelation between optimal moment matrices and optimal designs is studied. Necessary conditions on optimal support points are derived in terms of their number, their location, and their weights. The optimal weights on linearly independent regression vectors are given as a fixed point of a nonlinear equation. Multiplicity of optimal moment matrices is characterized by a linear relationship provided the criterion is strictly concave.
8.1. FROM MOMENT MATRICES TO DESIGNS The General Equivalence Theorem 7.14 concentrates on moment matrices. This has the advantage of placing the optimization problem in a linear space of finite dimensions. However, the statistical interest is in the designs themselves. Occasionally we experience a seamless transition between optimal moment matrices and optimal designs, such as in the Elfving Theorem 2.14 on scalar optimality. However, the general rule is that the passage from moment matrices to designs is difficult, and limited in its extent. All we can aim at are necessary conditions that aid in identifying the support points and the weights of optimal designs. This is the major theme of the present chapter. First we establish an upper bound on the number of support points. The bound applies to all designs, not just to optimal ones. The theorem states that every k x k moment matrix A e M (H) is achieved by a design 17 which has at most \k(fc + 1) + 1 support points, and that for a feasible moment matrix A, the 5 x 5 information matrix CK(A) (possibly scaled by some 8 > 1) is realized by a design £ with a support size that obeys the tighter bound ^s(s + 1) + s(rank A - s).
187
188
CHAPTER 8: OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS
8.2. BOUND FOR THE SUPPORT SIZE OF FEASIBLE DESIGNS Theorem. Assume that the k x s coefficient matrix K of the parameter system K'O is of full column rank s. Then for every moment matrix A e M(H), there exists a design 17 e H such that
For every design 17 € H that is feasible for K'O, there exists a desig such that
Proof. From Lemma 1.26, the set of all moment matrices M(H) admits a representation as a convex hull,
The set M(H) is a convex and compact subset of the linear space Sym(&) of symmetric matrices. The latter has dimension \k(k + 1), whence (1) and (2) are immediate consequences of the Caratheodory theorem. For the tighter bound (5), we replace the linear space Sym(A:) by a hyperplane in the range of an appropriate linear transformation T on Sym(A;). The construction of T is reminiscent of the full rank reduction in Lemma 7.3. Let the moment matrix A/ of 17 have rank r. Feasibility entails s < r < k. We choose some k x r matrix U with U'U = lr such that UU' projects onto the range of M. Then the matrix U'M~K is well defined, as follows from the preamble in the proof of Theorem 4.6. We claim that U'M'K has rank s. To this end we write UU' = MG where G is a generalized inverse of M (see Section 1.16). From MG — (UU1)1 = G'M and M2G2M2 = MG'MGM2 = M2, we infer G2 e (M2)~. Now we get K'G'UU'GK = K'G'MG2K = K'(M2)~K. By Lemma 1.17, the last matrix has rank s, and so has U'GK = U'M~K. With a left inverse L of U'M~K, we define the projector R — lr — U'M'KL. Now we introduce the linear transformation T through
In view of T(A) = T(UU'AUU'}, the range of T is spanned by T(UBU') = B - R 'BR with B e Sym(r). For counting dimensions, we may choose R to
8.2. BOUND FOR THE SUPPORT SIZE OF FEASIBLE DESIGNS
189
be in its simplest form
This induces the block partitioning
The symmetric sxs matrix B\\ contributes ^s(s+\} dimensions, while another s(r - s) dimensions are added by the rectangular s x (r - s) matrix 512 = B2'r Therefore the dimension of the range of T is equal to |s(.s + l) + ,s(r-s). In the range of T, we consider the convex set M generated by the finitely many matrices T(xx') which arise from the support points x of 17,
Since M. contains T(M}, the scale factor e = inf{5 > 0 : T(M) € 8M} satisfies e < 1. If e = 0 then T(M) = 0, and RU'M'K = 0 yields the contradiction
Hence e > 0, and we may define 8 = l/e > 1. Our construction ensures that 5T(M) lies in the boundary of M.. Let T~t be the hyperplane in the range of T that supports M in 8T(M). Thus we obtain 8T(M) € H n M = conv(H n 5(17)). By the Caratheodory theorem, 8T(M) is a convex combination of at most
members of H n 5(rj). Hence there exists a design £ such that T(M(£)) = dT(M), the support points of £ are support points of 17, and the support size of £ is bounded by (5). It remains to establish formula (3). We have rangeM(£) C range M. This entails UU'M(£)UU' = M(£). Because of RU'M'K = 0, we obtain
Evidently the range of K is included in the range of M(£) whence M(£) is feasible for K'B. Premultiplication by K'M(g)~ and inversion finally yield In the case of optimality, we obtain the following corollary.
190
CHAPTER 8: OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS
8.3. BOUND FOR THE SUPPORT SIZE OF OPTIMAL DESIGNS Corollary- Let $ be an information function. If there exists a > -optimal moment matrix M for K'O in M(H), then there exists a >-optimal design £ for K'B in H such that its support size is bounded according to
Proof. Every design £ with a feasible moment matrix for K'O needs at least s support points. This is so because the summation in range M (£) — ZLcesupp f (range **') must extend over at least s terms in order to include the range of K, by Lemma 2.3. For a design 17 that has M for its moment matrix, we choose an improved design £ as in Theorem 8.2, with support size bounded from above by (5) of Theorem 8.2. Positive homogeneity and nonnegativity of
Optimality of M forces the design £ to be optimal as well. If there are many optimal designs then all obey the lower bound, but some may violate the upper bound. The corollary only ascertains that at least one optimal design respects both bounds simultaneously. For scalar optimality, we have 5 = 1. Thus if the moment matrix M is optimal for c'd in A/(H), then there exists an optimal design £ for c'O in H such that
The upper bound k also emerges when the Caratheodory theorem is applied in the context of the Elfving Theorem 2.14. For the full parameter vector 6, we have s = k, and the bounds become
In Section 8.6, we provide examples of optimal designs that require \k(k + \} support points. On the other hand, polynomial fit models illustrate attainment of the lower bound k in the very strict sense that otherwise a design becomes inadmissible (see Theorem 10.7). Before studying the location of the support points we single out a matrix lemma. 8.4. MATRIX CONVEXITY OF OUTER PRODUCTS Lemma. Let Y and Z be two k x s matrices. Then we have, relative to the Loewner ordering,
8.5. LOCATION OF THE SUPPORT POINTS OF ARBITRARY DESIGNS
191
for all a € (0; 1), with equality if and only if Y — Z. Proof.
With X — (1 - a)Y + «Z, the assertion follows from
The following theorem states that in order to find optimal support points, we need to search the "extreme points" X of the regression range X only. Taken literally, this does not make sense. The notion of an extreme point applies to convex sets and X generally fails to be convex. Therefore we first pass to the Elfving set 72. = conv(#u (-<¥)) of Section 2.9. Since 72. is convex, it has points which are extreme, that is, which do not lie on a straight line connecting any other two distinct points of 72.. Convex analysis tells us that every extreme point of 72, is a member of the generating set X U (-X). We define those extreme points of 72. which lie in X to form the subset X.
8.5. LOCATION OF THE SUPPORT POINTS OF ARBITRARY DESIGNS Theorem. Let X be the set of those regression vectors in X that are extreme points of the Elfving set 72. = conv(^ U ( — X ) ] . Then for every design rj 6 H with support not included in X, there exists a design £ e H with support included in X such that
Proof. Let x\,...,xt e X be the support points of 17. Being closed and convex, the set 72. is the convex hull of its extreme points. Hence for every i = 1,...,^, there exist extreme points V n , . . . , V j r t . of 72. such that Jt, = £]/<M( otijyij, with min;<M( «,; > 0 and £)y-
192
CHAPTER 8: OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS
where the design £ is denned to allocate weight 17 (*/)«,; to the point y<;, for / = 1,... ,^ and ;' = 1,... ,n,. The points yij are extreme in 72, and hence contained in X or in —X. If y/; e —X is extreme in 72. then, because of point symmetry of 71, -y/y- e ^ is also extreme in 71. From the fact that (—yi/)(—y/;)' — -ft;)'/;' we may replace Vij £ —X \yy —yij € X. This proves the existence of a design £ with supp £ C #, and with moment matrix Af (£) improving upon M(TJ). The preceding result relates to the interior representation of convex sets by sweeping mass allocated in the interior of the ElfVing set into the extreme points out on the boundary. This improvement applies to arbitrary designs 77 e 5. For designs £ that are optimal in H the support points */ must attain equality in the normality inequality of the Equivalence Theorem 7.16, x-Nxi = 1. This property relates to the exterior representation of convex sets, in that the matrix N induces a cylinder which includes the Elfving set 71. Altogether the selection of support points of designs optimal in H is narrowed down in two different ways, to search the extreme points x of 72. that lie in X, or to solve the quadratic equation x'Nx = 1 that arises with an optimal solution N of the dual problem. Polynomial fit models profit from the second tool, as detailed in Section 9.5. The following example has a regression range X so simple that plain geometry suggests concentrating on the extreme points X.
8.6. OPTIMAL DESIGNS FOR A LINEAR FIT OVER THE UNIT SQUARE Over the unit square as the regression range, X = [0;1]2, we consider a two-way first-degree model without constant term,
also called a multiple linear regression model. We claim the following. Claim. For p € [-00; 1), the unique design £p € H that is ^-optimal for 6 in H, assigns weights w(p) to (J) and \(l - w(p)) to (J) and (J), where
Proof. The proof of our claim is arranged in three steps, for p > -oo. The case p = — oo is treated separately, in a fourth step.
8.6. OPTIMAL DESIGNS FOR A LINEAR FIT OVER THE UNIT SQUARE
193
I. The regression vectors x € X that are extreme points of 7£ visibly are
For p € (—oo; 1), the matrix mean (f>p is strictly isotonic on PD(2), by Theorem 6.13. Therefore Theorem 8.5 forces the ^-optimal design £p to have its support included in X, whence its moment matrix takes the form
II. By interchanging the weights W2 and H>3 we create the design £p, with moment matrix
If £p and £p are distinct, w>2 / n>3, then their mixture r\ = \(£p + £p) improves upon £p. Namely, the matrix mean
contradicting the optimality of £p. Thus gp and £p are equal, and we have W2 = H>3 = i(l — Wi). Upon setting w(p) = w\, the moment matrix of gp becomes
III. It now suffices to maximize cf>p over the one-parameter family
This maximization is carried out by straightforward calculus. The matrix Mw has eigenvalues \(\ + 3w) and |(1 - H>). Hence the objective function fp =
194
CHAPTER 8: OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS
EXHIBIT 8.1 Support points for a linear fit over the unit square. For p e (-00; 1), the support of the (ftp-optimal design £p for d comprises the three extreme points (Q)>(?))(}) in 72. n X. The 0-oo-optimal design £-00 is supported by the two points (Q), ("), and has optimal information one for c'6 with c = \(\}-
The unique critical point of fp is w(p) as given above. It lies in the interior of the interval [0; 1], and hence cannot but maximize the concave function f IV. For ^-oo-optimality, the limiting value lim/,_>_00 w(p) = 0 suggests the candidate matrix M0 = \h- Indeed, the smallest eigenvalue ^(1 - w) of Mw is clearly maximized at w = 0. Thus our claim is proved. See also Exhibit 8.1. The argument carries over to the linear fit model over the A>dimensional cube [0; \}k (see Section 14.10). The example is instructive from various perspectives. The smallest-eigenvalue optimal design £_oo makes do with a minimum number of support points (two). Moreover, there is an intriguing relationship with scalar optimality. The eigenvector corresponding to the smallest eigenvalue of the matrices Mw is c = (_?j). It is not hard to see that £-<» is uniquely optimal for c'O in H. This is akin to Theorem 7.23. For p > — oo, the designs gp require all three points in the set X for their support, and hence attain the upper bound |/c(A:+l) = 3 of Theorem 8.3. The
8.7. OPTIMAL WEIGHTS ON LINEARLY INDEPENDENT REGRESSION VECTORS
195
average-variance optimal design £_j has weight vv(-l) = (\/3-l)/(A/3+3) = 0.1547 which, very roughly, picks up half the variation between w(-oo) = 0 and H-(O) = 1/3. The determinant optimal design & distributes the weight 1/3 uniformly over the three points in X, even though the support size exceeds the minimum two. (In Corollary 8.12 we see that a determinant optimal design always has uniform weights l//c if the support size is a minimum, k.) The design with u>(l) = 1 is formally (fo-optimal for 6 in H. This is a onepoint design in (|), and fails to be feasible for 0. Theorem 9.15 proves this to be true in greater generality. This concludes our discussion of the number and the location of the support points of optimal designs. Next we turn to the computation of the optimal weights. Theorem 8.7 and its corollaries apply to the situation where there are not very many support points: x\,..., xt are assumed to be linearly independent. 8.7. OPTIMAL WEIGHTS ON LINEARLY INDEPENDENT REGRESSION VECTORS Theorem. Assume that the i regression vectors Xi,...,xi € X C Rk are linearly independent, and form the rows of the matrix X £ Uexk, that is, X' = (*i,.. .,*£). Let H be the set of designs for which the support is included in {*],... ,xe}. Let £ 6 H be a design that is feasible for K'O, with information matrix C = CK(M(g)}. Then the design £ is $-optimal for K'6 in H if and only if there exists a nonnegative definite s x s matrix D solving the polarity equation ^(C)^°°(D) = trace CD = 1 such that the weights w,- = £(*,-) satisfy
where a\\,... ^a^ are the diagonal elements of the nonnegative definite i x t matrix A = UCDCU' with U = (XX')'1XK. Proof. If Xi is not a support point of £, then we quite generally get a,-,- = 0. To see this, we may assume that the initial r weights are positive while the others vanish, w i , . . . , wr > 0 = wr+i = ••• — wf. Then x\,...,xr span the range of Af, by Lemma 2.3. Because of feasibility, there exists some r x s matrix H such that
By assumption, X
has full row rank £, whence For i>r, we get
(H',G)ei = 0 and an = 0, by the definitions of A and U.
196
CHAPTER 8: OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS
Let Aw be the diagonal matrix with weight vector w = ( w j , . . . , we)' e Ue on the diagonal. The moment matrix M of £ may then be represented as M =X'bwX. Hence
is a specific generalized inverse of M, where A+ is the diagonal matrix with diagonal entries 1/vv, or 0 according as tv, > 0 or w, = 0. With the Euclidean unit vector et of Uk we have x, — X'ei. Thus we obtain
which equals a/y/w? or 0 according as w, > 0 or H>, = 0. For the direct part of the proof, we assume £ to be
If xi is a support point of £, then (2) holds with equality, while the left hand side is au/wj. This yields w, = ^/a^. If *, is not a support point of £, then Wii = 0 = an, as shown in the beginning of the proof. For the converse, (2) is satisfied with G — GQ since the left hand side only takes the values 1 or 0. By Theorem 7.16, the design £ is optimal For the matrix means >p with parameter p 6 (-00; 1], the polarity equation has solution D = Cp~1/^ trace Cp, by Lemma 6.16. Hence the matrix A = UCDCU' is proportional to B = UCf>+lU'. With diagonal elements b\\,...,bw of B, the optimal weights satisfy
From the optimal value becomes
we obtain
Hence
Application to average-variance optimality is most satisfactory. Namely, if p = -1, then B = UU' only involves the given regression vectors, and no weights.
8.9. C-OPTIMAL WEIGHTS ON LINEARLY INDEPENDENT REGRESSION VECTORS
197
8.8. A-OPTEVfAL WEIGHTS ON LINEARLY INDEPENDENT REGRESSION VECTORS Corollary. A design £ is <£_i -optimal for K' 9 in H if and only if the weights w, = £(*,) satisfy
where 6 n , . . . , f t ^ are the diagonal elements of the matrix B = UU' with U = (XX'Y^XK. The optimal value is Proof. cussion.
The corollary is the special case p = —I from the preceding dis-
If X is square and nonsingular, t = k, then we obtain U = X' lK.li the full parameter B is of interest, then K = Ik and B = (XX1)'1. The averagevariance optimal weights for the arcsin support designs in Section 9.10 are computed in this way. If a scalar parameter system c'O is of interest, s = I, then (XX1 )~lXc is a vector, and matters simplify further. We can even add a statement guaranteeing the existence of linearly independent regression vectors that support an optimal design.
8.9. C-OPTIMAL WEIGHTS ON LINEARLY INDEPENDENT REGRESSION VECTORS Corollary. There exist linearly independent regression vectors x\,..., Xi in X that support an optimal design £ for c'O in H. The weights w, = £(*,) satisfy
where u\,...,ut are the components of the vector u — (XX')~lXc. optimal variance is
The
Proof. An optimal design 17 for c'd in H exists, according to the Elfving Theorem 2.14, or the Existence Theorem 7.13. From Section 8.3, we know that k support points suffice, but there nothing is said about linear independence.
198
CHAPTER 8: OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS
If the optimal design TJ has linearly dependent support points x\,... , x f , then the support size can be reduced while still supporting an optimal design £, as follows. The ElfVing Theorem 2.14 provides the representation
where yt = £(*/)*/. Because of linear dependence there are scalars / A I , . . . , /A£, not all zerOj such that 0 = ^2i<e /A,V,. We set /AO = T^,i
say. Because of
we may rewrite (1) as
with Wf = rj(xi) — a/A/. The numbers w, are nonnegative, for if /A, < 0, then this is obvious, and if tt, > 0, then ^(jc/)//^, > a by definition of a. We apply the ElfVing norm p to (2) and use the triangular inequality to obtain By the ElfVing Theorem 2.14, the design £ with weights £(*,) = w/ is optimal for c'Q. It has a smaller support size than 77, because o We may continue this reduction process until it leaves us with support points that are linearly independent. Finally Corollary 8.8 applies, with As an example we consider the line fit model, with experimental domain T = [-1; 1). We claim the following. Claim. The design £ that assigns to the points jci = (\), x2 = (j), the weights
is optimal for c'6 in H, with optimal variance Proof. The set of regression vectors that are extreme points of the ElfVing set is X = {*i,Jt2}. By Theorem 8.5, there exists an optimal design £ in H
8.11. OPTIMAL WEIGHTS ON GIVEN SUPPORT POINTS
199
with support included in X. The present corollary then proves our claim, by way of
Without the assumption of linear independence, we are left with a more abstract result. First we single out a lemma on the Hadamard product, that is, entrywise multiplication, of vectors and matrices. 8.10. NONNEGATIVE DEFINITENESS OF HADAMARD PRODUCTS Lemma. Let A and B be nonnegative definite i x i matrices. Then their Hadamard product A * B = ((«,,£>,,)) is nonnegative definite. Proof. Being nonnegative definite, B admits a square root representation B = UU' with U = (MI, ... ,ue) € R / x / . For jc e R £ , we then get
8.11. OPTIMAL WEIGHTS ON GIVEN SUPPORT POINTS Theorem. Assume that the design £ € E is >-optimal for K'O in H. Let E e Rsxs be a square root of the information matrix, CK(M(g)) — C — EE'. Let D € NND(s') and G € M(£)~ satisfy the polarity equation and the normality inequality of the Equivalence Theorem 7.16. Let the support points *!,...,*£ of £ form the rows of the matrix X e IR* x *,thatis,^' = (jci,... ,jCf), and let the corresponding weights H>, = £(*,) form the vector w e IR£. Then, with A = XGKE(E'DE}1I2E'K'G'X' e NND(^), the weight vector w solves
and the individual weights w, satisfy H>, < I/a? < A ma x(CD), for all / = 1,..., i.
200
CHAPTER 8: OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS
Proof. The proof builds on expanding the quadratic form x/GKCDCK'G'xf from Theorem 7.16 until the matrix M(£) appears in the middle. We set N = GKCDCK'G', and v/ = E'K'G'x; for i < L Theorem 7.16 states that
We expand E'DE into (E'DE)ll2ls(E'DE)1/2. Since C is nonsingular, so is E. From E'-lE~l = C~l = K'M(^)~K = K'G'M(£)GK, we get 7S = E'K'G'M(C)GKE. Inserting M(£) = £•<< *>;•*;*/> we continue in (1),
This proves (A*A)w = l f . From the lower bound a?M>, for (2), we obtair wi < 1/fl2/. On the other hand, we may bound (1) from above by
Since E'DE and EE'D = CD share the same eigenvalues, this yields By Lemma 8.10, the coefficient matrix A *A of the equation (A*A)w = le is nonnegative definite. The equation is nonlinear in w since A depends on w, through G, E, and D. Generally, the fixed points w of the equation (A * A)w = lt appear hard to come by. For the matrix means
8.13. MULTIPLICITY OF OPTIMAL MOMENT MATRICES
201
the optimal information matrix C, an exception emerges for determinant optimality.
8.12. BOUND FOR DETERMINANT OPTIMAL WEIGHTS Corollary. Every $o-optimal design £ for K'd in 5 has all its weights bounded from above by 1/5. Moreover, every (^-optimal design £ for 6 in H that is supported by k regression vectors has uniform weights l/k. Proof. With D = C~l/s, we get wt < A max (C£>) = A max (/,)/5 = 1/5. Moreover, if the full parameter vector 8 is of interest, then w, < l/k. If there are no more than k support points, then each weight must attain the upper bound l/k in order to sum to 1 Next we turn to multiplicity of optimal designs. We assume that the information function 4> is strictly concave, forcing uniqueness of the optimal information matrix C. Optimal moment matrices M need not be unique, but are characterized as solutions of a linear matrix equation.
8.13. MULTIPLICITY OF OPTIMAL MOMENT MATRICES Theorem. Assume that the information function > is strictly concave on PD(5). Let the moment matrix M € At be >-optimal for K'd in M, with generalized inverse G of M that satisfies the normality inequality of the General Equivalence Theorem 7.14. Then any other moment matrix M e M is also >-optimal for K'0 in M if and only if
Proof. For the direct part, we relate the given optimal solution M of the primal problem to the optimal solution N = GKCK(M}DCK(M}K'G' of the dual problem of Section 7.11. We obtain NK = GKCK(M)D and K'NK = D. Any other optimal moment matrix M, and the dual optimal solution N that comes with M, jointly fulfill equation (2) of Theorem 7.11. Postmultiplication of this equation by K gives
At this point, we make use of the strict concavity of
202
CHAPTER 8: OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS
to be positive definite, for otherwise z 'Dz = 0 and z ^ 0 lead to the contradiction
much as in the proof of Theorem 5.16. Therefore we may cancel D in (1). Secondly, strict concavity entails uniqueness of the optimal information matrix. Cancelling CK(M) — CK(M) in (1), we now obtain the equation MGK = K. For the converse part we note that M is feasible for K'9. Premultiplying MGK = K^by K'M~, we obtain K'M'K = K'M~MGK = K'GK = K'M~K. Thus MJias the same information matrix for K'O as has M. Since M is optimal, so is M. The theorem may be reformulated in terms of designs. Given a $-optimal moment matrix M for K'6 in M and a generalized inverse G of M from the General Equivalence Theorem 7.14, a design £ is 0-optimal for K'9 in H if and only if M(£)GK = K.
For general reference, we single out what happens with the matrix means
8.14. MULTIPLICITY OF OPTIMAL MOMENT MATRICES UNDER MATRIX MEANS
Corollary. If p e (-00; 1), then given a moment matrix M e M that is <£p-optimal for K'O in M, any other moment matrix M e M is also
8.16. MATRIX MEAN OPTIMALITY FOR COMPONENT SUBSETS
203
In the remainder of this chapter we treat more specialized topics: simultaneous optimality relative to all matrix means, and matrix mean optimality if the parameter system of interest K'd comprises the first s out of all k components of 0, or if it is rank deficient. If a moment matrix remains ^-optimal while the parameter p of the matrix means runs through an interval (p\,p2), then optimality extends to the end points p\ and p2, by continuity. The extreme cases p\ ~ -oo and p2 ~ 0 are of particular interest since <£_oo and
8.15. SIMULTANEOUS OPTIMALITY UNDER MATRIX MEANS Lemma. If M is
with s x s block A/n, (k - s) x (k - s) block A/22, and s x k block M\i = A/2'r The vectors x = (xi,...,xk)' e Uk such that jcs+1 = • • • = Jt* = 0 form a subspace which we denote by IRS x {0}. Theorem 7.20 and Theorem 7.22 specialize as follows.
8.16. MATRIX MEAN OPTIMALITY FOR COMPONENT SUBSETS Corollary. Let M e M (E) be a moment matrix such that its range includes the leading s coordinate subspace, Rs x {0}, with information matrix
c = MH — M^M^MII.
If p € (—oo; 1], then M is ^-optimal for (6\,..., 8s)f in M (S) if and only if there exists some s x (k - s) matrix B such that
In case of optimality, B satisfies BA/22 = —^tu-
204
CHAPTER 8: OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS
The matrix M is <£_oo-optimal for (#1,..., 6S)' in M(H) if and only if there exist a nonnegative definite s x s matrix E with trace equal to 1 and some s x (k — s) matrix B such that
In case of optimality, B satisfies EBM-& = -EM\2Proof. We set K = (/s,0)' and choose a generalized inverse G of M according to Theorem 7.20. Partitioning the 5 x k matrix CK'G' = (A,B) into a left s xs block A and a right s x(k-s) block B, we postmultiply by K to get /I - CK'G'K = 7S. This yields GKCDCK'G' = (IS,B)'D(IS,B). The normality inequality now follows with Z) = Cp~1/ trace C^ and Z) = £/Amin(C') from Lemma 6.16. In the case of optimality, we have, upon setting N = (1S,B)'D(IS,B),
Because of condition (2) of Theorem 7.11, we may equate the bottom blocks to obtain DBM22 — -DM\2. If p > -oo, then D cancels; if p = -oo, then AminCC") cancels. In order to discuss parameter systems K'0 with a k x s coefficient matrix of less than full column rank, we briefly digress to discuss the basic properties of Moore-Penrose matrix inverses. 8.17. MOORE-PENROSE MATRIX INVERSION
Given a rectangular matrix A e Ukxs, its Moore-Penrose inverse A+ is defined to be the unique s x k matrix that solves the four Moore-Penrose equations
The Moore-Penrose inverse of A obeys the formulae
Therefore, in order to compute A+, it suffices to find the Moore-Penrose inverse of the nonnegative definite matrix AA', or of A 'A. If a nonnegative definite matrix C has r positive eigenvalues A l 5 . . . , A r , counted with
205
8.18. MATRIX MEAN OPTIMALITY FOR RANK DEFICIENT SUBSYSTEMS
their respective multiplicities, then any eigenvalue decomposition permits a straightforward transition back and forth between C and C + ,
It is easily verified that the matrix A+ obtained by (1) and (2) solves the Moore-Penrose equations. One can also show that this solution is the only one. 8.18. MATRIX MEAN OPTIMALITY FOR RANK DEFICIENT SUBSYSTEMS If K is rank deficient and a moment matrix M is feasible for K'O, M e A(K), then the dispersion matrix K'M~K is well defined, but has at least one vanishing eigenvalue, by Lemma 1.17. Hence K'M~K is singular and regular inversion to obtain the information matrix C fails. Instead one may take recourse to generalized information matrices, to a reparametrization argument, or to Moore-Penrose inverses. Indeed, the positive eigenvalues of K'M~K and (K'M~K}+ are inverses of each other, as are all eigenvalues of K'M~K and (K'M~K)~l if K has full column rank s. On the other hand, the matrix mean <j>p(C) depends on C only through its eigenvalues. Therefore we introduce a rank deficient matrix mean <£p', by requiring it to be the vector mean of the positive eigenvalues \l,...,\r of C = (K'M~K)+:
The definition of the rank deficient matrix mean <£p is not consistent with how the matrix mean >p is defined in Section 6.7. If p e [-00; 0], then the ordinary matrix mean 4>p vanishes identically for singular matrices C, whereas >p' does not. Nevertheless, with the rank deficient matrix means >p', the equivalence theorems remain valid as stated, except that in Theorem 7.19, negative powers of C = (K'M'KY must be interpreted as positive powers of K'M'K. This leads to the normality inequality trace AGKC(K'M~K)l-pCK'G'
< trace C(K'M-K)l-p
for all A e M.
In Theorem 7.21, A m i n (C) becomes the smallest among the positive eigenvalues of C, entailing l/A min (C) - \max(K'M~K). An alternate approach leading to the same answers is based on reparametrization, K'6 — UH'd, with some k x r matrix H which has full column
206
CHAPTER 8: OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS
rank r and with some s x r matrix U which satisfies U'U = lr. Then we get
Respecting multiplicities, the positive eigenvalues of (K 'M K)+ are the same as all eigenvalues of (H'M~H)~l. Therefore, by adopting (H'M~H)~l as the information matrix for K'O, the resulting optimality criterion coincides with the one that is built on Moore-Penrose inversion,
8.19. MATRIX MEAN OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS For the centered contrasts of the factor A in a two-way classification model, Section 4.8 establishes Loewner optimality of the product designs rs' in the set T(r) of designs with given treatment marginals r. Relative to the full set T of all a x b block designs we now claim the following. Claim. The equireplicated product designs las' with arbitrary column sum vector s are the unique ^'-optimal designs for the centered contrasts of factor A in the set T of all block designs, for every p e [-00; 1]; the optimal contrast information matrix is Ka/a, and the optimal value is I/a. Proof. The centered contrasts of factor A have a rank deficient coefficient matrix
An equireplicated product design 1as' has row sum vector 1a = la/a, that is, the levels i = 1,..., a of factor A are replicated an equal number of times. Its
8.19. MATRIX MEAN OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS
207
moment matrix Af, a generalized inverse G, and the product GK are given by
(see Section 4.8). Hence the standardized dispersion matrix is K'GK — aKa, and contrast information matrix is C = Ka/a. The powers of C are Cp+l = Ka/ap+l. Let A e M(T) be a competing moment matrix,
If the row sum vector of W coincides with one given before, r = r, then we get K'GAGK = a2Ka&rKa. The left hand side of the normality inequality of Section 8.18 becomes
The right hand side takes on the same values, trace Cp = (a - l)/ap. This proves ^'-optimality of 1as' for Kaa in T for every p € [-oo;l], by Theorem 7.19 and Lemma 8.15. Uniqueness follows from Corollary 8.14, since by equating the bottom blocks in
we obtain aW 'Ka = 0 and W — 1as'. The optimal value is found to be
In other words, the optimal information for the contrasts is inversely proportional to the number of levels. This completes the proof of our claim. A maximal parameter system may also be of interest. The expected response decomposes into a mean effect, the centered contrasts of factor A, and the centered contrasts of factor B,
208 tem
CHAPTER 8: OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS
This suggests an investigation of the parameter sys-
Here we claim that simultaneous <£p'-optimality pertains to the uniform design 1al b. This is the unique product design that is equireplicated and equiblocksized. It assigns uniform weight \/(ab) to each combination (/,;') of factor A andB. Claim. The uniform design 1al b is the unique ^'-optimal design for the maximal parameter system (1) in the set T of all block designs, for every p € [-00; 1]; the dispersion matrix D = K'(M(1 alb)]~K and the >p'-optimal value are
Proof, With the generalized inverse G of M = M(l al b ) as in Section 4.8, we get MGK — K for the coefficient matrix K from (1). Hence M is feasible. Furthermore C = D+ is readily calculated from D, and
say. With regression vectors
from Section 1.5, the normality inequality of Theorem 7.19 becomes (e/,d/)Af(e/,<*/)' = 1 + (a - l)/aP + (b - l)/bP = trace CP. Therefore the uniform design is (^'-optimal, for every p e [-00; 1]. Uniqueness follows from Corollary 8.14 since
forces r = s = and W = This proves the claim.
EXERCISES
209
In the following chapter we continue our discussion of designs that are optimal under the matrix means
(continued) Show that if K' = K+, then <}>p(U'AKU) = <}>p((K'A-K)+) for all A € A(K).
CHAPTER 9
D-, A-, E-, T-Optimality
Optimality of moment matrices and designs for the full parameter vector in the set of all designs is discussed, with respect to the determinant criterion, the average-variance criterion, the smallest-eigenvalue criterion, and the trace criterion. Optimal designs are computed for polynomial fit models, where designs with arcsin support are seen to provide an efficient alternative. In trigonometric fit models, optimal designs are obtained not only for infinite sample size, but even for every finite sample size. Finally we illustrate by example that the same design may be optimal in models with different regression functions.
9.1. D-, A-, E-, T-OPTIMALITY The most popular optimality criteria in the design of experiments are the determinant criterion fa, the average-variance criterion >_i, the smallesteigenvalue criterion <£_oo, and the trace criterion
9.2. G-CRITERION A special case of scalar optimality arises if the experimenter wishes to investigate x '0, the mean value for the response that comes with a particular regression vector x — /(/). The performance of a design £ with a feasible moment matrix M is then measured by the information value CX(M) = (x'M~x)~l 210
9.3. BOUND FOR GLOBAL OPTIMALITY
211
for x'6, or equivalently, by the standardized variance x'M x of the optimal estimator x'O. However, if the experimenter is interested, not just in a single point x'6, but in the regression surface x H+ x'O as x varies over the regression range X, then a global performance measure is called for. The following criterion g concentrates on the smallest possible information and provides a natural choice for a global criterion. It is defined through
A design £ E H is called globally optimal in M(H) when its moment matrix M satisfies g(M) = sapA£M^g(A). Thus we guard ourselves against the worst case, by maximizing the smallest information over the entire regression range X. Traditionally one prefers to think in terms of variance rather than information, as pointed out in Section 6.17. The largest variance over X is
The global criterion thus calls for maximization of g(A), the smallest possible information over X, or minimization of d(A\ the largest possible variance over X, as A varies over the set M(H) of all moment matrices. A bound on the optimal value of the global criterion is easily obtained as follows. 9.3. BOUND FOR GLOBAL OPTIMALITY Lemma. Assume that the regression range X C IR* contains k linearly independent vectors. Then every moment matrix M e M (H) satisfies
Proof. If M is singular, then there exists a regression vector x e X which is not a member of the range of M. Hence we obtain d(M) = oo and g(M) = 0, and the bounds are correct. If M is nonsingular and belongs to the design £ e H, then the bounds follow from
Indeed, the upper bound I/k for the minimum information g, and the lower bound k for the maximum variance d are the optimal values. This
212
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
comes out of the famous Kiefer-Wolfowitz Theorem. Furthermore, this theorem establishes the equivalence of determinant optimality and global optimality if the set of competing moment matrices is as large as can be, M = M (E). The moment matrices and designs that are determinant optimal are the same as those that are globally optimal, and valuable knowledge about the location and weights of optimal support points is implied. 9.4. THE KIEFER-WOLFOWITZ THEOREM Theorem. Assume that the regression range X C Rk contains k linearly independent vectors. Then for every moment matrix M e M(E) that is positive definite, the following four statements are equivalent: a. b. c. d.
(Determinant optimality) M is
In the case of optimality, any support point *,- of any design £ e E that is
Proof. Theorem 7.20, with p = 0, covers the equivalence of parts (a) and (b), as well as property (1). From (b), we obtain d(M) < k. By Lemma 9.3, we then have d(M) = k, that is, (c). Conversely, (c) plainly entails (b). Condition (c) says that M attains the global optimality bound of Lemma 9.3, whence (c) implies part (d). It remains to show that part (d) implies (c). By the Existence Theorem 7.13, there is a moment matrix M e M(S) which is
9.5. D-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS
213
Property (2) entails that if a
9.5. D-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS The Kiefer-Wolfowitz Theorem 9.4 is now used to determine the
with tf e [—!;!]• The regression function / maps t e [—1;1] into the power vector x = f ( t ) = (1, r , . . . , td}' e M.d+l. The full parameter vector 6 comprises the k — d + 1 components BQ, &\,..., 6d. Rather than working with designs £ e a on the regression range X C Ud+l, we concentrate on the set T of all designs T on the experimental domain T — [-!;!]. In the dth-degree model, the moment matrix of a design T on T is given in Section 1.28:
A design r is feasible for the full parameter vector 0 if and only if its moment matrix Md(r) is positive definite. The minimum support size of r then is d+l, because r must have at least d + 1 support points r0, f i , . . . , td e [-1; 1] such that the corresponding regression vectors jc/ = /(f,-) are linearly independent. If the design T is
214
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
Degree 2 3 4 5 6 7 8 9 10
Legendre Polynomial Pd on [— 1, 1] 2
(-l + 3r )/2 (-3r + 5r3)/2 (3-30?2 + 35*4)/8 (15/-70f 3 +63f s )/8 (-5 + 105f2 - 315f4 + 231f6)/16 (-35/ + 315f3 - 693/5 + 429f7)/16 (35 - 1260/2 + 6930/4 - 12012/6 + 6435f8)/128 (315* - 4620f3 + 18018*5 - 25740?7 + 12155f9)/128 (-63 + 3465r2 - 30030/4 + 90090/6 - 109395/8 + 46189/10)/256
EXHIBIT 9.1 The Legendre polynomials up to degree 10. We have PQ(t) = 1 and /^(r) = t, the next nine polynomials are shown in the exhibit.
This yields k = d +1 support points of the form
for any design T e T that in the dth-degree model is <£-optimal for 0 in T. We have constructed the matrix N as the subgradient of
where Pd is the derivative of the d th-degree Legendre polynomial Pd. Proof.
Uniformity of the weights follows from the Kiefer-Wolfowitz The-
9.5. D-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS
215
orem 9.4. The location of the support is found by studying the normality inequality f(t}'(Md(T*)}-lf(t] < d+l. Introducing the (d+l) x (d+l) matrix X through
we get Md(r*) = X'X/(d + 1). Since Md(r*} is nonsingular so is X. This reduces the normality inequality to IJA""1/!')!!2 < 1 f°r au" f £ [-!;!]• The inverse V of X' admits an explicit representation once we change from the power basis l,t,...,td to the basis provided by the dth-degree Lagrange polynomials L/ with nodes fo, h > • • • > t
where in the products, k ranges from 0 to d. Evidently L/(f ; ) equals 1 or 0 according as / = j or / ^ j. The same is true of the dth-degree polynomial P(t) = e!Vf(t) since we have P(tj) = e!Vf(tj) - efVX'ej = etc,. Thus the two polynomials are identical,
In other words, the matrix V, which has the power basis coefficient vectors of the Lagrange polynomials as rows, is the inverse of the matrix X', which has the power vectors jc, = /(/,) that come with the nodes ?, as columns. This yields \\X'~lf(t)\\2 = ||V/(r)||2 = £?=oL?(0- In order to compare this sum of squares with the constant 1, we use the identity
Indeed, on either side the polynomials are at most of degree 2d+l. At tj, they share the same value 1 and the same derivative 2L;-(fy), for j = 0,1,...,d. Hence the two sides are identical. The polynomial Q(t) = n*=o(' ~'*) nas degree d+\ and satisfies Q(r,-) = 0. Around f,-, it has Taylor expansion
216
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
With this, the Lagrange polynomials and their derivatives at r, become
We let c e R be the coefficient of the highest term td+l in (1 -t2)Pd(t). By assumption this polynomial has zeros tk, giving cQ(t) = (1 — t2)Pd(t). Only now do we make use of the particulars of the Legendre polynomial Pd that it is characterized through the differential equation
In terms of Q this means cQ(t) = -d(d+l)Pd(t), and cQ(t) = -d(d+l)Pd(t). Insertion into (2) yields
For i = 1,... ,d — 1, the points /, are the zeros of Pd, and so the derivatives L,-(fj) vanish. This aids in evaluating (1), since it means that the only nonvanishing terms occur for / = 0 and / = d. The left hand side in (3) is (l-t2)Pd(t)-2tPd(t). Thus t = ±1 in (3) implies L 0 (-l) = -d(d+l)/4 and L rf (l) = d(d+l)/4 in (4). With Pd(-l) = (-\)d and Pd(l) = 1, we insert L0 and Ld from (4) into (1) to obtain
This verifies the normality inequality. The proof of our claim is complete. For d > 1, the optimal value i^(
with starting value vi(
9.6. ARCSIN SUPPORT DESIGNS
217
form weight 1/2, and optimal value ui(
for / = 0,1,... ,d. Symmetry of the arcsin distribution entails symmetry of the quantiles, s^ = — sd_i for all / = 0 , 1 , . . . , d , with s0 = —1 and sd = 1. Exhibit 9.5 illustrates the construction. Designs with arcsin support are very efficient in many polynomial fit problems. They deserve a definition of their own. DEFINITION. In a polynomial fit model of degree d over the experimental domain [—1; 1], an arcsin support design ad is defined by having for its support the d th-degree quantiles
of the arcsin distribution on [-!;!]. The set of all arcsin support designs for degree d is denoted by ^d. The member with uniform weights l/(d + 1) is designated by
218
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
EXHIBIT 9.2 Polynomial fits over [-1; 1): ^-optimal designs r* for 6 in T. Left: degree d of the fitted polynomial. Middle: support points and weights of rfi, and a histogram representation. Right: optimal value vd(
9.6. ARCSIN SUPPORT DESIGNS
219
EXHIBIT 93 Polynomial fits over [-1; 1]: ^-optimal designs a* for G in 2d. Left: degree d of the fitted polynomial. Middle: arcsin support points, weights of
220
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
EXHIBIT 9.4 Histogram representation of the design TJ}°. Superimposed is the arcsin density
EXHIBIT 9.5 Fifth-degree arcsin support. Top: as quantiles sf - A '(//5). Bottom: as projections s, = cos(l - i/5)tr of equispaced points on the half circle.
9.7. EQUIVALENCE THEOREM FOR A-OPTIMALITY
221
Exhibit 9.2 presents the overall optimal designs TQ. In contrast, the optimal arcsin support designs crfi and their (^-efficiencies are shown in Exhibit 9.3. For degrees d — 1,2, the arcsin support designs aft coincide with the optimal designs TQ and hence have efficiency 1. Thereafter the efficiency falls down to 97.902% for degree 9. Numerical evidence suggests that the efficiency increases from degree 10 on. The designs are rounded to three digits using the efficient design apportionment for sample size n = 1000 of Section 12.12. The determinant criterion fa is peculiar in that optimal designs with a minimum support size d + 1 must assign uniform weights to their support points. For other criteria, the optimal weights tend to vary. We next turn to the matrix mean
for positive definite k x k matrices A, and d_\(A) — oo for nonnegative definite k x k matrices A that are singular. Because of
the function d_\ is on A/(a) bounded from below by 1. 9.7. EQUIVALENCE THEOREM FOR A-OPTIMALITY Theorem. Assume that the regression range X C U.k contains k linearly independent vectors. Then for every moment matrix M € M (E) that is positive definite the following four statements are equivalent: a. b. c. d.
(Average-variance optimality) M is <£_!-optimal for 6 in M(E). (Normality inequality] x'M~2x < trace M"1 for all x € X. (Minimax property) d_i(M) = 1. (d ^-optimality) M minimizes d_\ in M (E).
In the case of optimality, any support point jc, of any design £ e S that is >_i-optimal for 6 in S satisfies
Proof.
The proof parallels that of the Kiefer-Wolfowitz Theorem 9.4.
222
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
The present theorem teaches us a major lesson about the frame that the Kiefer-Wolfowitz Theorem 9.4 provides for determinant optimality. It is false pretense to maintain the frame also for other criteria, in order to force a structural unity where there is none. No interpretation is available that would promise any statistical interest in the function d^\. Therefore, the frame of the Kiefer-Wolfowitz Theorem 9.4 is superseded by the one provided by the General Equivalence Theorem 7.14. 9.8. L-CRITERION An optimality concept not mentioned so far aims at minimizing linear criteria of the dispersion matrix. It arises when a design £ is evaluated with a view towards the average variance on the regression surface x »-> x'0 over X. Assuming M(£) to be positive definite, the evaluation is based on the criterion
where the distribution A on X reflects the experimenter's weighting of the regression vectors x e X. Upon setting W = A/(A), this approach calls for the minimization of trace WM (^)~1. The generalization to parameter subsystems K'0 is as follows. Let W be a fixed positive definite s x s matrix. Under a moment matrix M that is positive definite, the optimal estimator for K' 6 has a dispersion matrix proportional to K'M~1K (see Section 3.5). The notion of linear optimality for K'O calls for the minimization of
This is a linear function of the standardized dispersion matrix K'M 1K. Linear optimality poses no new challenge, being nothing but a particular case of average-variance optimality. Let H 6 R5X* be a square root of W, that is, W — HH'. Then the criterion takes the form
Hence minimization of
9.9. A-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS
223
9.9. A-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS Continuing the discussion of polynomial fit models, we now study designs that for degree d are $_i -optimal for the full parameter vector 6 in the set T of all designs on the experimental domain T = [—!;!]. Let Md be a moment matrix that is >_]-optimal for 6 in M(E). The criterion
In summary, there is a unique design T^ that is
Exhibit 9.7 shows that the >_!-efficiency of a^ is remarkably high. Next we turn to the left extreme matrix mean, the smallest-eigenvalue criterion <£_oo- In order to determine the 4>_oo-optimal designs for polynomial fit models we need to refer to Chebyshev polynomials. We review their pertinent properties first.
224
CHAPTER 9. D-, A-, E-, T-OPTIMALITY
EXHIBIT 9.6 Polynomial fits over [-!;!]: ^-optimal designs T^, for 9 in T. Left: degree d of the fitted polynomial. Middle: support points and weights of r^r and a histogram representation. Right: optimal value u,/(<^_i) of the average-variance criterion.
9.9. A-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS
225
EXHIBIT 9.7 Polynomial fits over [-!;!]:
226
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
Degree
2 3 4 5 6 7 8 9 10
Chebyshev Polynomial Td on [-1;i]
-l + 2f 2 -3r + 4t3 l - & 2 + 8r4 5f - 20f3 + 16f5 -! + 18f 2 -48r 4 + 32r6 -7r + 56r3-112r5 + 64f7 1 - 32f2 + 160r4 - 256f6 + 128r8 9t - 120r3 + 432r5 - 576r7 + 256r9 -1 + 50r2 - 400r4 + 1120f6 - 1280r8 + 512/10
EXHIBIT 9.8 The Chebyshev polynomials up to degree 10. We have T0(t) = 1 and 7\ (/) = t, the next nine polynomials are shown in the exhibit.
9.10. CHEBYSHEV POLYNOMIALS The Chebyshev polynomial Td of degree d is defined by
The Chebyshev polynomials for degree d < 10 are shown in Exhibit 9.8. It is immediate from the definition that the function Td is bounded by 1,
All extrema of Td have absolute value 1. They are attained at the arcsin support points
from Section 9.6:
In order to see that Td(t] is a polynomial in t, substitute cos (p for / and compare the real parts in the binomial expansion of the complex exponential function, cos(d
9.11. LAGRANGE POLYNOMIALS WITH ARCSIN SUPPORT NODES
227
coefficient and then every second are nonzero,
where [d/2\ is the integer part of d/2, We call c — (CQ,CI, ... ,Q)' e IRd+1 the Chebyshev coefficient vector. With power vector f ( t ) = (1, t,..., td)', we can thus write Td(r) - c'/(r). Whereas the vector c pertains to the power basis l , f , . . . , f d , we also need to refer to the basis provided by the Lagrange interpolating polynomials, with nodes given by the arcsin support points Sj.
9.11. LAGRANGE POLYNOMIALS WITH ARCSIN SUPPORT NODES The Lagrange polynomials with nodes SQ, s\,..., s^ are
where in the products, k ranges from 0 to d. We find it unambiguous to use the same symbol L, as in Section 9.5 even though the present node set is distinct from the one used there, and so are the Lagrange polynomials and all associated quantities. Again we introduce the (d + 1) x (d + 1) matrix V = (VQ,VI, ... ,vd) that comprises the coefficient vectors v, from the power basis representation Li(t) = i>//(0- F°r degrees d = 1,2,3,4, the coefficient matrices V and the sign pattern (-l)d~l+j of vijtj-2j are shown in Exhibit 9.9. More precisely, the entries of V satisfy the following. Claim. For all i = 0,1,...,d and ; = 0,1,..., |///2j, we have
Proof. The proof of (1) is elementary, resorting to the definition of L, and exploiting the arcsin support points -1 = SQ < Si < ••• < sd^ < sd = 1, solely their symmetry, sd_( = -5,-. The denominator satisfies
228
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
EXHIBIT 9.9 Lagrange polynomials up to degree 4. For degrees d = 1,2,3,4, the (d +1) x (d + 1) coefficient matrix V is shown that determines the Lagrange polynomial L,-(f) = y*._ n v;/f'', with nodes given by the arcsin support points 5,-. The bordering signs indicate the pattern
It remains to multiolv out the numerator oolvnomial and to find the coefficient belonging to td 2>. We distinguish three cases. I. Case d odd. Except for / - sd_t = t + s,-, the factors in P come in pairs, / - sk and / + sk. Hence we have P(t) = (t + Si)Q(t), where the polynomial
involves only even powers of t. In order to find the coefficient of the odd power td~2i in P, we must use the term t of the factor t + st and \(d - 1) - ; terms t2 in the factors of Q. This results in the coefficient
With \(d - 1) = \\(d - 1)J, we see that (3) and (2) establish (1). n. Case d even and i = \d. The definition of P misses out on the factor t — 0. Since the remaining factors come in pairs, we get
9.12. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, I
229
The even power td~2i uses \d — j terms t2, and thus has coefficient
With \d-l = \\(d - 1)J now (4) and (2) imply (1). III. Case d even and i / \d. Here P comprises the factors t - 0 = t and t - sd_i = t + St. We obtain P(t) = t(t + s,-)G(0» with
The even power td 2> is achieved in P only as the product of the leading factor r, the term t in the factor t + Si, and \d-l-j terms t2 in the factors of Q. The associated coefficient is
This and (2) prove (1). Moreover (5) is nonzero unless the summation is empty. This occurs only if ; = \d; then the set contains \d - 1 numbers and does not include a ^-element subset. As depicted in Exhibit 9.9, we refer to (1) through the sign pattern (-\Y~l+l• Furthermore both numerator and denominator in (1) are symmetrical in / and d — i. The discussion of (5), together with (3) and (4) show that the numerator vanishes if and only if d is even and / = ^d ^ i. All these properties are summarized in
This concludes our preparatory remarks on Chebyshev polynomials, and Lagrange polynomials with nodes s,. In order to apply Theorem 7.24, we need the Elfving norm of the vector Kz. For the Chebyshev coefficient vector c and z = K'c, we compute the norm p(KK'c) as the optimal value of the design problem for c'KK'0. 9.12. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, I In a d th-degree polynomial fit model, we consider parameter subsystems of the form QX = (Oi)iei, with an ^-element index set I = {ii,...,is}. Using
230
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
the Euclidean unit vectors e§,e\,...,ed of Rd+1, we introduce the (d + 1) x 5 matrix K = (e,,,... ,eis) and represent the parameter system of interest by K'O = ( f t , , . . . , ft,)'. The matrix K fulfills K'K = Is and KK1 the latter is a diagonal matrix D where da is 1 or 0 according as / belongs to the set X or not. Because of the zero pattern of the Chebyshev coefficient vector c, the vector KK 'c depends on X only through the set
More precisely, we have KK'c = Y^j&jcd-2jed-2j- We assume the index set X to be such that J is nonempty, so that KK 'c ^ 0. In this section, we consider scalar optimality for c'KK'd and find, as a side result, the Elfving norm p(KK'c) to be \\K 'elf. It turns out that the optimal design is an arcsin support design. Claim. Let c e F8rf+1 be the coefficient vector of the Chebyshev polynomial Td(t) = c'f(t) on [—!;!]. Then the unique design TJ that is optimal for in T has support points
and weights
and optimal variance (p(KK 'c)) = \\K 'c||4, where the coefficients w0, "i, • • • , ud are determined from
If d is odd or J ^ {\d} then all weights are positive. If d is even and J — {\d}, then we have c'KK'B = BQ and wd/2 = 1, that is, if d is even, then the one-point design in zero is uniquely optimal for BQ in T. Proof. The proof is an assembly of earlier results. Since the vector u = (HO, MI , • • • i ud)' solves X'u — KK 'c, the identity X' V = Id+i from Section 9.5 yields u = VKK'c, that is, «, = X)yej vi,d-2jCd-2j- An exceptional case occurs for even degree d and one-point set J = {\d}\ then we have wd/2 = 1 and all other weights are 0.
9.12.
SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, I
Otherwise we use the sign patterns to show that the weights are positive,
The normalization follows from
and
231
and
upon using
From (6) in Section 9.11, we see that the weights are symmetric, w, = Wd-iLet M be the moment matrix of the design TJ. The key relation is the following:
Premultiplication by c'KK'M~ gives \\K'c\\4 = c'KK'M'KK'c. That is, for the design TJ the optimality criterion for c'KK'O takes on the value ||AT'c||4. Now we switch to the dual problem. The matrix N = cc' satisfies f(t)'Nf(t) = (Td(t)}2 < 1, for all t e [-!;!]. The dual objective function has value c'KK'NKK'c = ||A"'c||4. Therefore the Mutual Boundedness Theorem 2.11 proves the matrices M and N to be optimal solutions of the design problem and its dual. The theorem also stipulates that any support point t of an optimal design satisfies the equation f(t)'Nf(t) = (Td(t)}2 = 1. By Section 9.10, this sin gles out the arcsin quantiles SQ, s\,..., sd as the only possible support points. The matrix identity X'V = Id+i shows that X is nonsingular, whence the regression vectors f ( s Q ) , f ( s i ) , . . . ,f(sd) are linearly independent. Hence Corollary 8.9 applies and provides the unique optimal weights, i These are the weights given above. Our claim is established. As a corollary we obtain a representation as in the Elfving Theorem 2.14,
Thus the coefficient vector KK 'c penetrates the ElfVing set 72. through the ddimensional face generated by (-I)d/(s0), (-l)*"1/^!)* • • • > -/(*d-i)i/(**)•
232
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
The full index set J = {0,1,..., d} gives the unique optimal design for c'B, denoted by TC, with optimal variance \\c\\4. The one-point set J = {d — 2j} yields the optimal design Td_2j for the scaled individual parameter cd_2j8d-2j-> with optimal variance c^_2.. The same design is optimal also for the unsealed component 0-2/» with optimal variance c^_2,. These designs satisfy the relation
Therefore rc is a mixture of the designs T d _ 2 y, with mixing weights c^_2 -/||c||2. We now leave scalar optimality and approach our initial quest, of discussing optimality with respect to the smallest-eigenvalue criterion
For instance, in a fourth-degree model, the sets {0,1,3,4}, {0,1,4}, {0,3,4}, {0,4} all share the same set J = {0,2}, the same vector KK'c = Coeo+c4e4 — (1,0,0,0,8)' e R5, and the same optimal design T{0,2}- The information matrices for K'8 are of size 4 x 4 , 3x3, 3x3, 2 x 2 , with respective eigenvectors (1,0,0,8)', (1,0,8)', (1,0,8)', (1,8)', and common eigenvalue l/(cJ + c3) = l/65. While it is thus fairly easy to see that \\K'c\\~2 is some eigenvalue, <£_oooptimality boils down to showing that it is the smallest eigenvalue of C. 9.13. E-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS The designs TJ of the previous section are symmetric whence the odd moments vanish. Therefore their moment matrices decompose into two interlacing blocks (see Exhibit 9.10). In investigating the subsystem K'B = 0j, we place a richness assumption
9.13. E-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS
233
EXHIBIT 9.10 E-optimal moment matrices. Because of symmetry of the design TC = T^, its moment matrix splits into two interlacing blocks, as shown for degrees d — 1,2,3,4. Dots indicate zeros.
on the index set J which ensures that the smallest eigenvalue of the information matrix comes from the block that is associated with the Chebyshev indices d - 2j. More precisely, we demand that every non-Chebyshev index d — 1 — 2j in J is accompanied by its immediate successor d — 2/,
for all j — 0,1,..., [\d\. Since the scalar case is taken care of by the previous section, we further assume that the index set J contains at least two indices. This prevents the set
from degenerating to the one-point set {\d}. We claim the following for a dth-degree polynomial fit model with experimental domain [-!;!]. Claim. Let c e Ud+l be the coefficient vector of the Chebyshev polynomial Td(t) — c'f(t) on [-1;1], and assume that the index set X satisfies assumption (1). Then the design TJ of Section 9.12 is the unique <£_oo-optimal design for K'6 — % in T, with optimal value ||A"'c||"2. If d > 1, then the smallest eigenvalue of the information matrix C = CK(Md(Tj)} has multiplicity 1. Proof. From Section 9.12 we have p(KK'c) = ||/C'c||2. The proof rests on Theorem 7.24, in verifying the equality
234
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
That is, we wish to show that
with equality if and only if z is proportional to K 'c. I. Let M be the moment matrix of TJ. With a left inverse L of K that is minimizing for M, we can write C = LML'. Hence the assertion is
Because of the interlacing block structure of A/, the quadratic form on the left hand side is
where P and Q are polynomials of degree d and d — 1 associated with the Chebyshev index set and its complement,
The contributions from P2 and Q2 are discussed separately. II. In (1) of Section 9.12, we had d th-degree polynomial satisfies
A comparison of coefficients with
yields
Observing sign patterns we apply this to the Chebyshev polynomial Td:
Any
9.13. E-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS
235
We also apply it to the dth-degree polynomial P, in that Now, for each / 6 J, the Cauchy inequality yields
If equality holds in all Cauchy inequalities, then we need exploit only any one index j € J with j ^ \d to obtain proportionality, for / = 0,1,... , d, of P(s,-)Kd-2;|1/2 and (-l)d~~i+i\vitd_2j\l/2- Because of i/M_2;- ^ 0, this entails P(SI) — a(-l)d~', for some a e IR. Hence equality holds if and only if P = aTd, that is, a d _ 2y = acd_2y for all j = 0,1,..., \\d\. III. The argument for Q2 is reduced to that in part II, by introducing the d th-degree polynomial
From sj < 1 and part II, we get the two estimates
If d is even and \d e J, then the last sum involves the coeffcient of t° in P which is taken to be a,\ = 0.
236
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
EXHIBIT 9.11 Polynomial fits over [-!;!]: ^-oo-optimal designs vd_^ for 0 in T. Left: degree d of the fitted polynomial. Middle: support points and weights of T^ and a histogram representation. Right: optimal value v(>-oo) of the smallest-eigenvalue criterion.
9.14. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, II
237
Equality holds in (3) only if, for some /3 e R, we have P = f$Td. Equality holds in (2) only if Q2(Si) = s2Q2(si); in case d > 1, any one index / = l , . . . , d - 1 has 5? < 1 and hence entails Q(s,-) = 0 = P(SI) = (-l)d-'p. Thus equality obtains in both (2) and (3) if and only if 0^-1-2; — 0 f°r a^ ; = 0,l,...,Lj(d-l)J. IV. Parts II and III, and the assumption that an index d - 1 - 2; occurs in X only in the presence of d - 2j yield
With a = L'z and LK = 7S, we get a'KK'a = ||z||2. Therefore ||tf'c||-2 is the smallest eigenvalue of C. If d > 1, then a'Ma = \\z\\2/\\K'c\\2 holds if and only if a — ac, that is, z = aK'c\ hence the eigenvalue ||/£'c||~2 has multiplicity 1. V. If T is another ^^-optimal design for K'O in T, then T is also optimal for c'KK'O, by Theorem 7.24. Hence the uniqueness statement of that theorem carries over to the present situation. This completes the proof of our claim. For instance, in a fourth-degree model only the last two of the four sets {0,1,3,4}, {0,1,4}, {0,3,4}, {0,4} meet our assumption (1). Hence T{0,2} is 4>_oo-optimal in T for (0o,03,04)', as we^ as f°r (0o>04)', with common optimal value 1/65. The present result states that many 0_oo-optimal designs are arcsin support designs. The case of greatest import is the full index set I — {0,1,..., d} for which the design TC of the previous section is now seen to be also $-00optimal for 0 in T. It is in line with our earlier conventions that we employ the alternate notation r^ for TC. Exhibit 9.11 lists the ^-oo-optimal designs r^ up to degree 10, with weights rounded using the efficient design apportionment of Section 12.5. The line fit model has <£_oo-optimal design supported by ±1 with uniform weight 0.5 and optimal value f^-oo) = 1; the (^oo-optimal moment matrix for 0 is /2, its smallest eigenvalue, 1, has multiplicity 2. The parabola fit model has >_oo-optimal design r 2 ^ supported by -1, 0, 1 with weights 0.2, 0.6, 0.2, and optimal value u2(<£-oo) = 0.2. From Theorem 7.24, we also know now that the Chebyshev coefficient vector c determines the in-ball radius of the polynomial Elfving set 72., r2 = l/||c||2. The in-ball touches the boundary of ft only at 9.14. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, II In Section 9.12, we derived the optimal designs rd_2j for those individual parameters 0^-2y that are an even number apart from the top coefficient Bd.
238
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
A similar argument leads to the optimal designs rd_i_2j for the coefficients Od-\-2jthat are an odd number away from the top. Their support falls down on the arcsin support of one degree lower, d — \. In order that the coefficient vector of the (d — 1) st-degree Chebyshev polynomial Td_i fits into the discussion of the dth-degree model, we use the representation Td_i(t) = c'f(t), with dth-degree power vector f(t) = ( l , r , . . . ,td)' as before. That is, while ordinarily Td_\ comes with d coefficients c 0 ,ci,... ,Q_I, it is convenient here to append the entry cd = 0. Let again Ox = (#i)<ez be the parameter system of interest. Because of the zero pattern of the Chebyshev coefficient vector c introduced just now, the vector KK 'c depends on J only through the set
Assuming the index set I to be such that J is nonempty, we claim the following. Claim. For degree d > 1, let c e Rd+1 be the coefficient vector of the Chebyshev polynomial Td_i(t) = c'f(i) on [-!;!]. Then the unique design TJ that is optimal for in T has support points
and weights
<2
and optimal variance (p(KK 'c)) = \\K'c ||4, where the coefficients w0, MI , . . . , ud_i are determined from
If d is even or J ^ {\(d — 1)}, then all weights are positive. If d is odd and j = {I(d - 1)}, then we have c'KK'O = QQ and w(d^)/2 = 1, that is, if d is odd then the one-point design in zero is uniquely optimal for OQ in T. Proof. The support of the design T~ gives rise to only d linearly independent regression vectors /(?o))/(^i)> • • • >/fo-i) in the d + 1 dimensional space Ud+l. Hence the moment matrix Md(r~) is singular, and feasibility of r~ for c'KK'O requires proof.
9.14. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, II
239
Using the (d - l)st-degree power vector f(t) = (l,t,...,td~1)', equaThis yields tion (1) without the last line reads
are the power basis coefficients of the Laeranee where polynomials with nodes The last line in (1) is and Using the symmetries we get
Thus the last line in (1) is fulfilled and T~ is feasible for c'KK'B. The rest \J of the proof duplicates the arguments from Section 9.12 and is therefore omitted. Again we deduce a representation as in the Elfving Theorem 2.14:
Here the coefficient vector ATAT'? nenetrates the Elfvine set 72. throueh the
(d — l)-dimensional face that is generated by The one-point set I — {d - 1 - 2;} yields the optimal design rd_i_2j for :he component 0
240
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
Degree 3 4 5 6 7 8 9 10
fib 0i
0.3600 1 0.2528 1 0.2062 1 0.1793 1
1 0.6141 1 0.4679 1 0.3893 1 0.3393
h
0.5625 1 0.4973 1 0.4312 1 0.3824 1
03
04
0s
06
0j
0s
09
#10
1 0.6863 I 1 0.5968 1 0.5821 1 0.6462 1 1 0.5404 1 0.6066 I 0.5085 1 0.5825 1 0.6331 1 1 0.4907 1 0.5609 1 0.6106 1 0.4549 1 0.5313 1 0.5860 1 0.6271 1
EXHIBIT 9.12 Arcsin support efficiencies for individual parameters 0y. For degree d < 10 and j — 0,1,..., d, the table lists the efficiencies of the optimal arcsin support design ay for 0, in ?.d, relative to the optimal design TJ for 0, in T.
by Corollary 8.9. This design has efficiency
The efficiency is 1 for degree 1 and 2. From degree 3 on, the discrepancy between the arcsin support sets of degree d and d - 1 entails a drastic jump in the efficiencies, as shown in Exhibit 9.12. Finally, we treat the right extreme matrix mean, the trace criterion fa. Although there is little practical interest in this criterion its discussion is instructive. Just as the Elfving Theorem 2.14 addresses design optimality directly without a detour via moment matrices, so does the Theorem 9.15. Furthermore, we are thrown back to the concept of formal optimality of Section 5.15, that a moment matrix can have maximum fa -information for the full parameter vector 6 without being feasible for 0. 9.15. EQUIVALENCE THEOREM FOR T-OPTIMALITY Theorem. Assume that the regression range X C Uk contains k linearly independent vectors. Let R be the maximum length of all regression vectors, R = max{ ||jr|| : x 6 X}. Then a design £ e H is formally (fo-optimal for 0 in H if and only if every support point of £ has maximum length R. Proof. It is easy to establish the theorem by a direct argument. Every moment matrix M(£) obeys the bound trace M(g) = £)*esupp f £(*) x'x < R2. The bound is attained if and only if all support points of £ have maximum length R.
9.16. OPTIMAL DESIGNS FOR TRIGONOMETRIC FIT MODELS
241
From Section 2.17, the maximum length R of the regression vectors is the radius of the Euclidean ball circumscribing the regression range X. Usually this makes the situation easy to analyse. For polynomial fit models over the experimental domain T — [—1;1] the squared length of a regression vector is \\x\\2 = ||/(OII2 — 1 + /2 + • • • + t2d. The maximum R2 = d + 1 is attained only at t = ±1. Hence any ^-optimal design rf on [—1;1] has a moment matrix of rank at most 2 and cannot be feasible for 0, except for the line fit model. The optimal value is Vd(4>i) — R2/(d + l) = l. In our discussions of the polynomial fit models we have managed to study the optimality properties of, not just moment matrices, but designs proper. The following example, trigonometric fit models, even leads to optimal designs for finite samples sizes.
9.16. OPTIMAL DESIGNS FOR TRIGONOMETRIC FIT MODELS
The trigonometric fit model of degree d > 1 has regression function
and a total of k — 2d + 1 components for the parameter vector 6 (see Section 2.22). The experimental domain is the "unit circle" T = [0;27r). We call a design rn an equispaced support design when r" assigns uniform weight \/n to each of n equispaced support points on the unit circle [0;27r). The support points of an equispaced support design T" are of the form
where a £ [0;27r) is a constant displacement of the unit roots 2Trj/n. For these designs we claim the following optimality properties. Claim. Every equispaced support design T" with n > 2d+l is $p-optimal, for all p £ [-00; 1], for the full parameter system 0 in the set T of all designs on the experimental domain T = [0;27r). The optimal value function strictly increases in p,
from
over
242
Proof. matrix
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
First we show that the designs r" all share the same moment
The diagonal entries and the off-diagonal entries of M are, with a,Z> = !,...,<*,
Evaluation of the five integrals (2), (3) and (4)-(6) is based on the two formulas
where m e {!,...,«-!} determines some multiple of ITT. In order to establish (7), we set /3 = 2irm/n e (0;27r). The complex exponential function provides the relation e" = cos t + i sin t, and we get
Thus both sums in (7) are linear combinations ot the nnite geometric series The two series sum to , with quotients q This proves (7). and hence vanish, because of We now return to the integrals (2)-(6). For cos2 (at) I in , and apply (7) to obtain in (2) we set i
9.17. OPTIMAL DESIGNS UNDER VARIATION OF THE MODEL
243
Then sin2 = 1 - cos2 gives fsm2(bt)drn = \ in (3). The integrals (4), (5) evidently are of the form (7) and vanish. In the integral (6), the sin-cos addition theorem transforms the integrand into cos(at) sm(bt) = \ (sin(at + bt)-sm(at-bt)). Again the integral vanishes because of (7). Therefore every equispaced support design T" has moment matrix M as given by (1). Optimality of the designs T" and the moment matrix M is approached through the normality inequalities. With parameter p € (-00; 1] we have, for all t e [0;27r),
Theorem 7.20 proves 0P-optimality of the designs T" if Lemma 8.15 extends optimality to the t^-oo-criterion. What do we learn from this example? Firstly, there are many designs that achieve the optimal moment matrix M. Hence uniqueness may fail for designs although uniqueness holds true for moment matrices. Secondly, the theory of designs for infinite sample size occasionally leads to a complete solution for the discrete problem of finding optimal designs for sample size n > k = 2d + l. Thirdly, every regression vector/(f) has squared length d + 1. Therefore every design is
244
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
I. The first model has regression function f(t) = (l,r,f 2 )' and hence fits a parabola. The design £ on the regression range X which is induced by T has support points, moment matrix, inverse moment matrix, and normality inequality given by
Hence the design r is ^-optimal for 6 in T. The cfo-optimal value is 41//3/3 = 0.52913 (see also Section 9.5). II. The next model has regression function /(?) = (l,sin(| / 7rr),cos(5irr))' and fits a trigonometric polynomial of first degree over the half circle. The induced design £ has support points, moment matrix, inverse moment matrix and normality inequality given by
Again r is (^-optimal for 0 in T, with value 41/3/3 = 0.52913. III. The third model has regression function f(t) = (l,e',e~')'. The design £ has support points
With a = 1 + e + e~l = 4.08616 and b = 1 + e2 + e~2 = 8.52439, the moment
EXERCISES
245
matrix of £ and its inverse are
where c - 6a2 + 3b2 - 2a2b - 27 = 6.51738 is the determinant of 3M. The normality inequality turns out to be for a;;l Hence T is
on R.
246
CHAPTER 9: D-, A-, E-, T-OPTIMALITY
iv. Represent O in the power basis. obtain and compare coefficients in
with
for all a\ = 6fls, and a0 = 2«2- In other words, the polynomial Q which solves the differential equation (*) is unique. v. Show that (1 - t2)Pd(t) solves (*), where Pd is the Legendre polynomial.
93
Verify that 2.1e °nd is a reasonable fit to the determinant optimal value VdM'
9.4 How does the linear dispersion criterion 4>w of Section 9.8 relate to the weighted matrix mean >_^? 9.5 Show that relative to the arcsin distribution A of Section 9.6, the Chebyshev polynomials satisfy fTdTmdA = 0, 1/2, 1 according as d=£m, d = m ^ 0, d = m = Q. 9.6 In Section 9.11, show that viid^_2j
= SiVitd_2j and |v/ f d_i_2/| =
\vd-i,d-l-2j\-
9.7 In the line fit model over the interval [-b;b] of radius b > 0, show that the uniform design on ±b is ^-oo-optimal for 6 in T, with information min{l,Z?2}. 9.8 In the quadratic fit model over T = [-V5;\/2], show that (i) the unique ^>_oo-optimal design for 0 in T assigns weights 1/8, 3/4, 1/8 to —\/2, 0, v/2, (ii) its moment matrix has smallest eigenvalue 1/2 with multiplicity 2, and eigenvectors z = (-1/V5,0,1/V5)' and z = (0,1,0)', (iii) the only nonnegative definite trace 1 matrix E satisfying Theorem 7.22 is zz' [Heiligers (1992), p. 33]. 9.9 In the dth-degree polynomial fit model over [-1;1], show that the arcsin support design with weights l/(2d),l/d,... ,l/d,l/(2d) is the unique design that is optimal for the highest coefficient 6d [Kiefer and Wolfowitz (1959), p. 282]. 9.10 In the Jth-degree polynomial fit model over [—1;1], show that (i) Amax(M d (r)) < d + 1 for all T € T, (ii) supT€T fy (Md(r)) = d+ 1 for all p e [l;oo]. 9.11
In the trigonometic fit model of Section 9.16, are the equispaced support designs the only $p-optimal designs for 0 in T?
C H A P T E R 10
Admissibility of Moment and Information Matrices
Admissibility of a moment matrix is an intrinsic property of the support set of the associated design. Polynomial fit models serve as an example. Nevertheless there are various interrelations between admissibility and optimality, a property relating to support points and weights of the design. The notion of admissibility is then extended to information matrices. In this more general meaning, admissibility does involve the design weights, as is illustrated with special contrast information matrices in two-way classification models. 10.1. ADMISSIBLE MOMENT MATRICES A kind of weakest requirement for a moment matrix M to be worthy of consideration is that M is maximal in the Loewner ordering, that is, that M cannot be improved upon by another moment matrix A. In statistical terminology, M has to be admissible. Let M C NND(£) be a set of competing moment matrices, and H C H be a subset of designs on the regression range •y r~
DEFINITION. A moment matrix M e M is called admissible in M when every competing moment matrix A e M with A > M is actually equal to M. A design £ e H is called admissible in H when its moment matrix M(£) is admissible in M(H). In the sequel, we assume the set M to be compact and convex, as in the general design problem reviewed in Section 7.10. A first result on admissibility, in the set H of all designs, is Theorem 8.5. If a design rj e H has a support point that is not an extreme point of the Elfving set 72. = conv(X U (-.#)), then 17 is not admissible in H. Or equivalently, if £ e H is admissible in then its support points are extreme points of 72. Here is another result that emphasizes the role of the support when it comes to discussing admissibility.
247
248
CHAPTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES
10.2. SUPPORT BASED ADMISSIBILITY Theorem. Let 17 and £ be designs in H. If the support of £ is a subset of the support of 17 and 17 is admissible in H, then £ is also admissible in H. Proof. The proof is akin to that of Corollary 8.9, by introducing the minimum likelihood ratio a = min{ri(x)/^(x): x e supp£}. Because of the support inclusion, a is positive. It satisfies i7(jc) - ag(x) > 0 for all x e X. Let £ e H be any competing design satisfying M(£) > M(£). Then a£+ 17 - a£ is a design in H, with moment matrix
Here equality must hold because of admissibility of rj. Since a is positive, we get M(£) = M(£). This proves admissibility of £. In measure theoretic terms the inclusion of the supports, supp £ C supp 17, means that £ is absolutely continuous relative to 17. A matrix M e M. is said to be inadmissible in M. when it is not admissible in M, that is, if there exists another competing moment matrix B e M such that B ^ M. In this case, B performs at least as well as M, relative to every isotonic function
10.4. POSITIVE POLYNOMIALS AS QUADRATIC FORMS
249
be improved upon, A^M, where the moment matrix A e M is admissible. Theoretically, the design problem is simplified by investigating the "smaller" subset Hadm of admissible designs, rather than the "larger" set H of all designs. Practically, the design set Eadm and the moment matrix set M(H a dm) may well lack the desirable property of being convex, besides the fact that the meanings of "smaller" and "larger" very much depend on the model. Here are two examples in which every design is admissible, the trigonometric fit model and the two-way classification model. In both models the regression vectors have constant length,
as mentioned in Section 6.5. But if x'x is constant over X, then any two designs £ and 17 that are comparable, M(T]} > M(£), have identical moment matrices, M (17) = M(£). This follows from
and the strict monotonicity of the trace function. Thus every moment matrix M e M(H) is admissible in M(H), and admissibility leads to no simplification at all. In plain words, admissibility may hold for lack of comparable competitors. In this respect, polynomial fit models show a more sophisticated structure. We take the space to explicitly characterize all admissible designs in this model. The derivation very much relies on the peculiarities of the model. Only fragments of the arguments prevail on a more general level. We begin with an auxiliary result from calculus. 10.4. POSITIVE POLYNOMIALS AS QUADRATIC FORMS Lemma. Let P be a polynomial defined on R, of even degree 2d. Then P is positive, P ( t ) > 0 for all t 6 (R, if and only if there exists a positive definite (d + 1) x (d + 1) matrix A such that, with power vector f ( t ) = (1, t,..., td)',
Proof. For the direct part, we extract the coefficient c e IR of the highest power t2d in P. Because of P > 0, we have c > 0. We proceed by induction on d > 1. If d = 1, then P is a parabola,
250
CHAPTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES
with a, )8 e IR and y — a2 + )3. Because of P > 0, we have Thus we get the desired representation P(t) = ( l , t ) A ( l t ) , with
Assuming the representation holds for all positive polynomials R of degree 2(d — 1), we deduce the result for an arbitrary positive polynomial P of degree Id. Because of P > 0, the roots of P are complex and come in conjugate pairs, Zj = ay +i/3 ; and 7; = a; - i/3;. Each pair contributes a parabola to the factorization of P,
with a y , Pj € U. Because of P > 0 we have 0? > 0. Thus the factorized form of P is
say. The first polynomial in (1) is Q(t) bd_itd~l +td = (fr',l)/(0, with Z> = ( f t o , * i i - - - , ^ - i ) ' e ^d- Hence we can write
The second polynomial in (1) is In this sum, each term is nonnegative, the constant term is and 2 1 the highest power r ^- ) has coefficient £/
Altogether, (1), (2), and (3) provide the representation P(t) = with the positive definite matrix
This completes the direct part of the proof. The converse part holds because of f(t) ^ 0 for all t 6 R.
f(t}'Af(t),
10.5. LOEWNER COMPARISON IN POLYNOMIAL FIT MODELS
251
10.5. LOEWNER COMPARISON IN POLYNOMIAL FIT MODELS Again we denote by T the set of all designs on the experimental domain T = [—1; 1]. The model for a polynomial fit of degree d > 1 has regression function f(t) = (!,*,..., td}. A design T e T has moment matrix
Here the Loewner comparison of two moment matrices reduces to a comparison of moments. We claim that only the highest moments can differ, the others must coincide. To this end, we introduce a notation for the initial section of the first 2d-\ moments,
We can now give our claim a succinct form. Claim. Two designs cr, T e T satisfy moments of a and T fulfill Proof.
if and only if the and
For the proof of the direct part, we evaluate the quadratic form
With just two nonzero entries, Zi — 1 for vectors z = (z0> z\,. • • , zd)' and Zi = a, we obtain, tor i — 0,1,..., a — 1,
By induction on i, we show that (1) implies
that is, the consecutive initial sections of the moments of a and r coincide. For i = 0 we have HQ(O) = 1 = /AO(T). Hence (1) forces JJLJ(or) == /A/(T) for all
252
CHAPTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES
; = !,...,, that is, H(d)(<*} = AK<*)(T)- Assuming /i(rf+l--i)(o-) = M(d+i-i)(T), we deduce /i^+oC0") = /*(<*+/) ( T )- The restriction i < d-l leads to 2/ < d+i — 1. Hence jt2,-(o-) = M2i(T) and (1) yields /A,+/(O-) = Pi+j(r), for all / = /+!,..., d. Adjoining At/+d(cr) = /t l+ d(T) to the assumed identity of the initial sections of length d + i - I, we obtain i*.(d+i)((r} — P-(d+i)(T)- Thus (2) is established. The case / = d - 1 in (2) gives At(2d-i)(o") = ^(2d-\)(f)- Finally (1) yields lJL2d(°') > ^2d( T )» where equality is ruled out because of Md(a) ^ Md(r). The proof of the converse is evident since Md(a) - Md(r) has entries 0 except for the bottom right entry fadfa) - M2d( T ) which is positive. Thus the proof of our claim is complete. A first admissibility characterization is now avaifabte. fn a o"tfi-cfegree model, a design r 6 T is admissible in T if and only if it maximizes the moment of highest order, 2d, among those designs that have the same lower order moments as has T,
This characterization is unwieldy and needs to be worked on. The crucial question is how much the 2dth moment At2d(0-) varies subject to the restriction /t(2d-i)(cr) = / A (2d-i)( T )- If there is no variation, then /u,2d(T) is the unique moment that goes along with iA(2d-i)(T)-> and again admissibility holds for lack of comparable competitors. Otherwise /i2d, given the initial section M(2-i)( T )i is nonconstant, and admissibility is a true achievement.
10.6. GEOMETRY OF THE MOMENT SET The existence of distinct, but comparable moments depends on how the given initial section relates to the set of all possible initial sections,
We call ju,(2d-i)(T) the moment set up to order 2d - 1. Its members are integrals,
just as the members of the set A/(H) of moment matrices are the integrals fxxx' dg, with £ e H. For this reason the arguments of Lemma 1.26 carry over. The moment set /A(2d-i)(T) is a compact and convex subset of R2d~\
10.7.
ADMISSIBLE DESIGNS IN POLYNOMIAL FIT MODELS
253
being the convex hull of the power vectors g(t) — (t,...,t2d~1)' with t e
[-i;i].
The set M(2d-i)(T) includes all polytopes of the form conv{0,g(fi),..., g(hd-\}}- If tne points 0 ^ *i, • • • , *2<*-i £ [-1»1] are pairwise distinct, then the Vandermonde determinant proves the vectors g(t\),... ,g(?2d-i) to be linearly independent,
In this case, the above polytopes are of full dimension 2d-l. Therefore the moment set /A(2-i)(T) has a nonempty interior. We now claim the following, for a given design T € T. Claim. If /u,(2rf_i)(r) lies in the interior of the moment set /t(2d-i)(T), then there exists a design a e T satisfying /A(2d-i)(0") = M(2d-i)( r ) and ^2d(a'} / ^2d(f}-
Proof. The statement has nothing to do with moments, but follows entirely from convex analysis. A change in notation may underline this. Let C — /i(2d)(T) be the moment set up to order Id, a convex set with nonempty interior in Rw+1, with m = 2d - 1. Let D C Um be its image under the projection ( y , z ) i-> y. We assume that y — )H(2d-i)( T ) nes in tne interior of D. The assertion is that the cut Cy — [z € 1R : (y,z) € C} contains at least two points. Convex analysis provides the fact that the number z lies in the interior of Cy if and only if the vector (y,z) lies in the interior of C. The latter is nonempty, and hence so is Cy. Thus Cy is a nondegenerate interval and contains another point /u,2(cr), say, besides /^X1")- This proves the claim (see also Exhibit 10.1).
10.7. ADMISSIBLE DESIGNS IN POLYNOMIAL FIT MODELS We are now in a position to describe the admissible designs in polynomial fit models. Claim. For a design T G T in a d th-degree polynomial fit model on the experimental domain [-1;1], the following three statements are equivalent: a. (Admissibility) r is admissible in T, b. (Support condition) r has at most d - 1 support points in the open interval (-1;1).
254
CHAiTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES
EXHIBIT 10.1 Cuts of a convex set. If C C R m+1 is a convex set with nonempty interior and y e IRm is an interior point of the projection D of C on Rm, then the cut Cy = |z 6 IR : (y,z) € C\ is a nondegenerate interval.
c. (Normality condition) There exists a positive definite (d + l) x (d + l) matrix N that satisfies
with equality for the support points of T. Proof. First we prove that part (a) implies part (b). Let r be admissible. We distinguish two cases. In the first case, we assume that the vector M(2d-i)(T) lies °n the boundary of the moment set /i(2d-i)(T). Then there exists a supporting hyperplane in R2d-1, that is, there exists a vector 0 ^ h e U2d~l such that
with power vector g(t) = (t,..., t2d *)'. Therefore the polynomial
is nonpositive on [-1; 1], and satisfies //*(*) dr = 0. The support points of T are then necessarily zeros of P. Because P < 0, they actually determine local maxima of P in [—!;!]. The degree of P is at most 2d — 1. As h ^ 0, the
10.7. ADMISSIBLE DESIGNS IN POLYNOMIAL FIT MODELS
255
polynomial P is nonconstant. Hence P possesses at most d - 1 local maxima on R and r has at most d - 1 support points in (—1; 1). In the second case we assume that H(2d-i)(T) ues m the interior of the moment set M(2d-i)(T). Because of admissibility, the moment /-^(T) maximizes ^2d °ver the designs that have initial section At(2rf-i)(7"), by Section 10.5. Stepping up to order Id, this puts the vector n^d)^} °n the boundary of the moment set /t(2d)(T). Hence there exists a supporting hyperplane in U2d, that is, there exists a vector 0 ^ h e R2d such that
with enlarged power vector g(t) — (t,...,t2d I,t2d)'. Therefore the polynomial
is nonpositive on [—1; 1], and again the support points of r are local maxima of P in [-!;!]. In order to determine the degree of P, we choose a design cr with distinct but comparable moments. Section 10.6 secures the existence of such designs cr, for the present case. Moreover T is assumed to be admissible, whence //^(o") < M2d( T )> by Section 10.5. From
we obtain h2d > 0. If h2d = 0, then 0 ^ (hi,.. - ,/i 2 d-i)' e 032d~l defines a supporting hyperplane to /X(2d-i)(T) at n^d-i)^}, contradicting the present case that H(2d~i)(T) lies in the interior of the moment set M(2d-i)(T). This leaves us with h2d > 0. Any polynomial P of degree 2d with highest coefficient positive has at most d-1 local maxima on R. Thus r has at most d-1 support points in (—1;1). Next we show that part (b) implies part (c). Let t\,...,tt be the support points of T in (—1; 1). By assumption, we have i < d - 1. If i < d - 1, then we add further distinct points r ^ + i , . . . , td_\ in (-1; 1). Now the polynomial
is nonnegative inside [-1;1], nonpositive on the outside, and vanishes at the support points of r. Therefore the polynomial
256
CHAPTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES
is positive on R and of degree 2d. From Lemma 10.4, there exists a matrix AT € PD(d +1) such that P(t) = f(t)'Nf(t). In summary, we obtain
with equality for the support points of T. Finally, we establish part (a) from part (c). Let a e T be a competing design with Md(o~) > Md(r). Since the support points t of T satisfy f(t)'Nf(t) = 1, the normality condition (c) yields
Strict monotonicity of the linear form A H-+ trace AN forces Md(o~) = Md(r}. Hence T is admissible, and the proof of the claim is complete. This concludes our discussion of admissible designs in polynomial fit models. The result that carries over to greater generality is the interplay of parts (a) and (c), albeit in the weaker version of Corollary 10.10 only. We may view part (c) as a special instance of maximizing a strictly isotonic optimality criterion
10.9.
E-OPTIMALITY AND ADMISSIBILITY
257
Proof. The argument for part (a) is indirect. If there exists a competitor B e M with B ^ M, then strict monotonicity of
For the direct part, we assume admissibility. Let A e M be <£_oo-optimal for K'd in M', such matrices exist by the Existence Theorem 7.13. Optimality of A yields
This means Ir < CK(A). Pre- and postmultiplication by K and K' give
where AK is the generalized information matrix for K'B from Section 3.21. Admissibility forces A = M, showing that M is uniquely 4>_oo-optimal for K'B inM. The converse follows from part (c) of Lemma 10.8, with As an application, we consider one-point designs £(*) = 1. Their moment matrices are of the form M(£) = xx'. Such a design is admissible in H if and only if it is uniquely optimal for x '6 in H. The Elfving Theorem 2.14 permits us to check both optimality and uniqueness. Thus the one-point design £(*) = 1 is admissible in H if and only if x is an extreme point of the Elfving set K = con\(Xu(-X)).
258
CHAPTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES
Another application occurs in the proof of the following theorem, on optimality relative to the information functions
In particular, M is
Hence if M maximizes A H-» trace AN, then M has a longest projection onto £(N) among all competitors A e M. The necessary condition (a) and the sufficient condition (b) differ in whether the matrix N points in any direction of the cone NND(fc), N > 0, or whether it points into the interior, N > Q (see Exhibit 10.2). Except for the standardization trace M N = 1, the normality inequality of Section 7.2 offers yet another disguise of
In the terminology of that section, the matrix N is normal to M at M.
10.10. T-OPTIMALITY AND ADMISSIBILITY
259
EXHIBIT 10.2 Line projections and admissibility. The matrices M\ and MI have a longest projection in the direction N^ > 0, but only MI is admissible. The matrices A/2 and MI are admissible, but only M2 has a longest projection in a direction N > 0.
Relative to the full set M(H), we need to refer to the regression vectors jc e X only. The design £ is (j>N -optimal for 6 in H if and only if
with equality for the support points of £. This is of the same form as the normality condition (c) of Section 10.7. The geometry simplifies since we need not argue in the space Sym(fc) of symmetric matrices, but can make do with the space Rk of column vectors. Indeed, N induces a cylinder that includes the regression range X, or equivalently, the Elfving set Tl (see Section 2.10). Neither condition (a) nor (b) can generally be reversed. Here are two examples to this effect. Let HI be the set of all designs in the line fit model of Section 2.20, with regression range X\ — {1} x [-1; 1]. Choosing N = (QQ), we find x'Nx = 1 for all jc e X\. Hence the one-point design £(Q) =1 is
Therefore the converse of condition (a) fails to hold in this example. A variant of this model is appropriate for condition (b). We deform the regression range X\ by rounding off the second and fourth quadrant,
260
CHAPTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES
In the set Ha of all designs on X2, the design £(Q) = 1 is uniquely optimal for
in Ha and hence admissible, by the Elfving Theorem 2.14 and by part (c) of Theorem 10.7. But the only matrix
satisfying
is the singular matrix N = (J|j). Indeed, we have a = (1,0)N(J) = 1. Nonnegative definiteness yields y > /32. Inserting x = (j), we get 1 + 2j8 + y < 1, that is, y < -2/3. Together this means
From Now we exploit the curved boundary of and Siihriivisinn hv we obtain eives '. As Jt? tends 1 tends giving 2 From (1), we get Thus N is necessarily singular. The and converse of condition (b) is false, in this setting (see Exhibit 10.3). The results of the present corollary extend to all matrix means
10.11. MATRIX MEAN OPTIMALrTY AND ADMISSIBILITY Corollary. Let M e M be a competing moment matrix and let 0 ^ p €
a. (Necessity) If M is admissible in M and N > 0 satisfies trace AN < 1 = trace MN, for all A e M, then M is >p-optimal for HK'B in M where K is obtained from a full rank decomposition MNM = KK' and where H = (C*(Af ))<1+^/Grt. b. (Sufficiency) If M is uniquely <jH»p-optimal for K'O in M., for some full column rank coefficient matrix K, then M is admissible in M.
10.11. MATRIX MEAN OPTIMALITY AND ADMISSIBILITY
261
EXHIBIT 10J Cylinders and admissibility. Left: the one-point design £(J) = 1 is ^-optimal, but inadmissible over the line fit regression range X\. Right: the same design is admissible over the deformed regression range X^, but is ^-optimal for the singular matrix N = (^Q), only.
Proof. We recall that in part (a), admissibility of M entails the existence of a suitable matrix N', from part (a) of Corollary 10.10. The point is that N is instrumental in exhibiting the parameter system HK'6 for which the matrix M is <£p-optimal. Suppose, then, that the k x r matrix K satisfies MNM = KK', where r is the rank of MNM. From range AT = range MNM C range M, we see that M is feasible for K'6. Hence C = CK(M) is a positive definite r x r matrix, and we have
For the positive definite matrix H = C(1+p)/(2p), we obtain
with q conjugate to p. Therefore the primal and dual objective functions take the value
The Duality Theorem 7.12 shows that M is ^-optimal for HK'6 in M, and
262
CHAPTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES
that N is an optimal solution of the dual problem. Part (b) follows from part (c) of Lemma 10.8. If the hypothesis in part (a) is satisfied by a rank 1 matrix N = hh', then MNM = cc', with c = Mh, and M is optimal for c'6 in M. Closely related conditions appear in Section 2.19 when collecting, for a given moment matrix M, the coefficient vectors c so that M is optimal for c'6. 10.12. ADMISSIBLE INFORMATION MATRICES The general optimality theory allows for parameter subsystems K'6, beyond the full parameter vector 0. Similarly admissibility may concentrate on information matrices, rather than on moment matrices. The requirement is that the information matrix C^(Af) is maximal, in the Loewner ordering, among whichever competing information matrices are admitted. To this end, we consider a subset C C C/^(M(H)) of information matrices for K'6 where, as usual, the coefficient matrix K is taken to be of full column rank. An information matrix C#(M) e C is called admissible in C when every competing information matrix CK(A) € C with CK(A) > C#(A/) is actually equal to C/^(M). A design £ e H is said to be admissible for K'6 in H C H when its information matrix C#(M(£)) is admissible in C/c(A/(E)). Some of the admissibility results for moment matrices carry over to information matrices, such as Section 10.8 to Section 10.11, others do not (Theorem 10.2). We do not take the space to elaborate on these distinctions in greater detail. Instead we discuss the specific case of the two-way classification model where admissibility of contrast information matrices submits itself to a direct investigation. 10.13. LOEWNER COMPARISON OF SPECIAL C-MATRICES In the two-way classification model, let the centered contrasts of the first factor be the parameter system of interest. For the centered contrasts to be identifiable under an a x b block design W, the row sum vector r = Wlb must be positive. Then the special contrast information matrix Ar — rr' is Loewner optimal in the set T(r) of designs with row sum vector equal to r, by Section 4,8. The first step is to transcribe the Loewner comparison of two special contrast information matrices into a comparison of the generating row sum vectors. We claim the following. Claim. In dimension a > 2, two positive stochastic vectors f, r e Ra satisfy
10.13. LOEWNER COMPARISON OF SPECIAL C-MATRICES
263
if and only if the components fulfill, for some / < a,
Proof. The major portion of the proof is devoted to showing that, under the assumption of (1), conditions (2) and (3) are equivalent to
Indeed, with K' = (Ka,Q) and utilizing generalized information matrices as in (1) of Section 3.25, condition (4) means (M(ts'))K > (M(rs'))K. By Lemma 3.23, this is the same as K'M(ts')~K < K'M(rs')~K. Insertion of the generalized inverse G of Section 4.8 yields an equivalent form of (4),
The a x (a - 1) matrix //, with iih row -l^-i while the other rows form the identity matrix 7 fl _i, satisfies KaH = H and H(H'H)~1H' = Ka. This turns (5) into //'A'1// < H'k~lH. Upon removing the ith component from t to obtain the shorted vector 7= ( f i , . . . , r,-_i, tM,..., ta)', we compute //'A,"1// = A f ~ ] -i- (l/ti)la-.\lg__r With f defined similarly, we may rearrange terms to obtain
Because of (1) the factor 1/f, - 1/r, is positive. It is a lower bound to the elements l/r; — l/r; of the diagonal matrix A^1 — A f ~\ whence (6) entails Ar 1 > A^1. Moreover, in view of the Schur complement Lemma 3.12, we see that (6) is equivalent to
Another appeal to Lemma 3.12, with the roles of the blocks in (7) interchanged, yields
264
CHAPTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES
This is merely another way of expressing conditions (2) and (3). Hence (4) is equivalent to (2) and (3). Only a few more arguments are needed to establish the claim. For the direct part we conclude from that t ^ r, whence follows (1), and then (4) leads to (2) and (3). For the converse, (1), (2), and (3) imply (4). Equality in (4) is ruled out by (1). If we assume A,-f?' = A r —rr', then we trivially ge Since (1) secures / ^ r, there exists some k < a such that tk > rk and
We have more than two subscripts, a > 2, and so conditions (2) and (8) cannot hold simultaneously. Hence A, - tt' ^ Ar - rr', and the proof is complete. Thus we have A, — tt' ^ Ar - rr' if and only if, with a single exception (1), the components of t are larger than those of r (2), and the discrepancies jointly obey the quantitative restriction (3). This makes it easy to compare two special contrast information matrices through their generating row s,um vectors, and to attack the problem whether comparable row sum vectors exist. 10.14. ADMISSIBILITY OF SPECIAL C-MATRICES For the set of special contrast information matrices
the question is one of admissibility of a particular member Ar - rr' in the class C. Claim. For a positive stochastic vector r € Rfl in dimension a > 2, the matrix Ar - rr' is admissible in C if and only if r, < \ for all / = 1,..., a. 'Proof. We prove the negated statement, that ' is inadmissible in C if and only if r, > \ for some /. If Ar - rr' is inadmissible, then there exists a positive stochastic vector t e Ra such that . Consideration of the diagonal elements yields t} - tj > r; - r?, for all j. From condition (1) of Section 10.13, we get r, < r/, for some /. This is possible only if r, > \ and >/€[l-r,,r(-). For the converse, we choose /, e [1 - r,,r/) and define
Then t — (t\,...,ta)' is a positive stochastic vector satisfying (1) and (2) of
10.15. ADMISSIBILITY, MINIMAXITY, AND BAYES DESIGNS
265
Section 10.13. It also fulfills (3) since f,• > 1 — r, yields
By Section 10.13 then This completes the proof.h
whence Ar - rr' is inadmissible.
In view of the results on Loewner optimality of Section 4.8, we may summarize as follows. A block design W e T is admissible for the centered contrasts of factor A if and only if it is a product design, W = rs', and at most half of the observations are made on any given level / of factor A, r, < \ for all i < a. It is remarkable that the bound \ does not depend on a, the number of levels of factor A. The mere presence of the bound \ is even more surprising. Admissibility of a design for a parameter subsystem K '6 involves the design weights. This is in contrast to Theorem 10.2 on admissibility of a design for the full parameter vector 6 which exclusively concentrates on the design support. 10.15. ADMISSIBILITY, MINIMAXITY, AND BAYES DESIGNS The notion of admissibility has its origin in statistical decision theory. There, a parameter 6 e @ specifies the underlying model, and the performance of a statistical procedure T is evaluated through a real-valued function 6 H-» R(6,T], called risk function. The terminology suggests that the smaller the function R(-,T), the better. This idea is captured by the partial ordering of smaller risk which, for two procedures T\ and T2, is defined through pointwise comparison of the risk functions,
It is this partial ordering to which admissibility refers. A procedure T\ is called admissible when every competing procedure T2 with T2 < T\ has actually the same risk as T\. Otherwise, T\ is inadmissible. Decision theory then usually relates admissibility to minimax procedures, that is, procedures that minimize the maximum risk,
Alternatively, we study Bayes procedures, that is, procedures that minimize some average risk,
266
CHAPTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES
where TT is a probability measure on (an appropriate sigmafield of subsets of) 6. In experimental design theory, the approach is essentially the same except that the goal of minimizing risk is replaced by one of maximizing information. A design is evaluated through its moment matrix M. The larger the moment matrix in the Loewner ordering, the better. The Loewner ordering is a partial ordering that originates from a pointwise comparison of quadratic forms,
where Sk = {x € Rk : \\x\\ = 1} is the unit sphere in IR*. Therefore, the difference with decision theoretic admissibility is merely one of orientation, of whether an improvement corresponds to the ordering relation <, or the reverse ordering >. With this distinct orientation in mind, the usual decision theoretic approach calls for relating admissibility to maximin designs, that is, designs that maximize the minimum information,
This is achieved by Theorem 10.9. Alternatively, we may inquire into some average information.
where TT is a probability measure on 5^. This is dealt with in Corollary 10.10, with a particular scaling of N. From the analogy with decision theory, a design may hence be called a Bayes design when it maximizes a linear criterion (j>N(A) = trace AN. However, there is more to the Bayes approach than its use as a tool of decision theory. In the arguments above, TT conveys the experimenter's weighting of the various directions x e Sk. More generally, the essence is to bring to bear any prior information that may be available. Other ways of how prior information may permeate a design problem are conceivable, and each of them may legitimately be called Bayes, EXERCISES 10.1
Show that M is admissible in M if and only if HMH' is admissible in HMH1, for all H e GL(fc).
10.2
In the parabola fit model over [-1; 1], a design r is optimal for BQ+ fa in T if and only if ftdr = 0 and /1 2 dr = |. Which of the optimal
EXERCISES
designs 1652].
267
with such that are admissible in T? [Kiefer and Wolfowitz (1965), p.
10.3 Show that M is admissible in M if and only if M is uniquely optimal for (CK(M})l'2K'9 in M, where K provides a full rank decomposition MNM = KK' for some N > 0 such that trace AN < 1 = trace MN for all A e M. 10.4
Show that £ is admissible for K'9 in H if and onlv if £ is admissible for for all
10.5 Verify that violate 10.6
and
fulfill (;i)-(3) of Section 10.13, but
(continued) Show that H — (Ia-\,-la-i)' satisfies KaH = H and H(H'H)~1H' = Ka.
C H A P T E R 11
Bayes Designs and Discrimination Designs
The chapter deals with how to account for prior knowledge in a design problem. These situations still submit themselves to the General Equivalence Theorem which, in each particular case, takes on a specific form. In the Bayes setting, prior knowledge is available on the distribution of the parameters. This leads to a design problem with a shifted set of competing moment matrices. The same problem emerges for designs with bounded weights. In a second setting, the experimenter allows for a set of m different models to describe the data, and considers mixtures of the m information functions from each model. The mixing is carried out using the vector means 3>p on the nonnegative orthant R™. This embraces as a special case the mixing ofm parameter subsets or ofm optimality criteria, in a single model. A third approach is to maximize the information in one model, subject to guaranteeing a prescribed efficiency level in a second model. Examples illustrate the results, with special emphasis on designs to discriminate between a second-degree and a third-degree polynomial fit model.
11.1. BAYES LINEAR MODELS WITH MOMENT ASSUMPTIONS An instance of prior information arises in the case where the mean parameter vector 6 and the model variance a2 are not just unknown, but some values for them are deemed more likely than others. This is taken into account by assuming that 6 and o2 follow a distribution, termed the prior distribution, which must be specified by the experimenter at the very beginning of the modeling phase. Thus the parameters 6 and a2 become random variables, in addition to the response vector Y. The underlying distribution P now determines the joint distribution of Y, 6, and a2. Of this joint distribution, we utilize in the present section some conditional moments only, as a counterpart to the classical linear model with moment assumptions of Section 1.3.
268
11.1. BAYES LINEAR MODELS WITH MOMENT ASSUMPTIONS
269
The expectation vector and the dispersion matrix of the response vector Y, conditionally with 6 and or2 given, are as in Section 1.3:
The expectation vector and the dispersion matrix of the mean parameter vector 0, conditionally with a2 given, are determined by a prior mean vector do 6 Rk, a prior dispersion matrix RQ € NND(A:), and a prior sample size n0 > 1, through
Finally, a prior model variance a^ > 0 is the expected value of the model variance cr2,
Assumptions (1), (2), and (3) are called the Bayes linear model with moment assumptions. Assumption (2) is the critical one. It says that the prior estimate for 6, before any sampling evidence is available, is OQ and has uncertainty (o-2/n0)/?oOn the other hand, with the n x k model matrix X being of full column rank k, the Gauss-Markov Theorem 1.21 yields the sampling estimate 9 = (X'X)~1X'Y, with dispersion matrix (a2/n)M~l, where as usual M = X'X/n. Thus, uncertainty of the prior estimate and variability of the sampling estimate are made comparable in that the experimenter, through specifying nQ, assesses the weight of the prior information on the same per observation basis that applies to the sampling information. The scaling issue is of paramount importance because it touches on the essence of the Bayes goal, to combine prior information and sampling information in an optimal way. If the prior information and the sampling information are measured in comparable units, then the optimal estimator is easily understood and looks good. Otherwise, it looks bad. With a full column rank k x s coefficient matrix K, let us turn to a parameter system of interest K'6, to be estimated by T(Y) where T maps from R" into Rs. We choose a matrix-valued risk function, called mean squared-error matrix,
Two such matrices are compared in the Loewner ordering, if possible. It is possible, remarkably enough, to minimize the mean squared-error matrix among all affine estimators AY + b, where the matrix A e Rsxn and the shift
270
CHAPTER 11: BAYES DESIGNS AND DISCRIMINATION DESIGNS
b e Rs vary freely. The shift b is needed since, even prior to sampling, there is a bias
unless BQ = 0. Any affine estimator that achieves the minimum mean squarederror matrix is called a Bayes estimator for K'B. 11.2. BAYES ESTIMATORS Lemma. In the Bayes linear model with moment assumptions, let the prior dispersion matrix jR0 be positive definite. Then the unique Bayes estimator for K'6 is K'8, where
and the minimum mean squared-error matrix is
Proof. I. We begin by evaluating the mean squared-error matrix for an arbitrary affine estimator T(Y) = AY + b. We only need to apply twice the fact that the matrix of the uncentered second moments, provided it exists, decomposes into dispersion matrix plus squared-bias matrix. That is, for a general random vector Z we have E/>[ZZ'] = D/>[Z] + (E/>[Z])(E/>[Z])'. Firstly, with Z = AY + b — K'B, we determine the conditional expectation of ZZ' given B and a2,
where B — AX-K'. Secondly with Z = BB + b, the conditional expectation given a2 is
Integrating over a2, we obtain the mean squared-error matrix of AY + b,
11.2. BAYES ESTIMATORS
271
II. The third term, (BBo + b)(B8Q + b)', is minimized by b = -B00 =
K'OQ-AXBQ.
III. The key point is to minimize the sum of the first two terms which, except for oj, is S(A) = AA1 + (l/no)(AX -K')R0(AX -K')'. With A = K'(n0RQl +X'X)-1X', we expand S(A + (A-A)). Among the resulting eight terms, there are two pairs summing to 0. The other four terms are rearranged loyiGldS(A) = S(A) + (A-A)(In+nQlXR0X')(A-AY > 5(1), with equality only for A — A. IV. In summary, with & = K'^-AXOo = K'(nQR-1 + X'X^noR^Oo, the unique Bayes estimator for K'6 is AY + b = K'O, with 6 as in (1) and mean squared-error matrix as in (2). The Bayes estimator B does not yet look good because the terms which appear in (1) are too inhomogeneous. However, we know that the sampling estimate 6 satisfies the normal equations, X'Xd = X'Y, and so we replace X'Y by X'X9. This exhibits 9 as a combination of the prior estimate &Q and the sampling estimate 6. In this combination, the weight matrices can also be brought closer together in appearance. We define
to be the prior moment matrix which, because of (2) of Section 11.1, is properly scaled. The corresponding scaling for the sampling portion is X'X = nM. With all terms on an equal footing the Bayes estimator looks good:
It is an average of the prior estimate 6fo and the sampling estimate B, weighted by prior and sampling per observation moment matrices M0 and A/, and by prior and experimental samples sizes n0 and n. The weights sum to the identity, (/ioM) + nAf )~1n0A/o + («o^o + nM)~lnM = Ik. The mean squarederror matrix of the Bayes estimate K'd is
It depends on the Bayes moment matrix Ma which is defined to be a convex combination of prior and sampling moment matrices,
272
CHAPTER 11: BAYES DESIGNS AND DISCRIMINATION DESIGNS
with a specifying the sampling weight that the experimenter ascribes to the observational evidence that is to complement the prior information. The Bayes design problem calls for finding a moment matrix M such that the mean squared-error matrix (4) is further minimized. As in the general design problem of Section 5,15, we switch to maximizing the inverse mean squared-error matrix, (K'M~lK)~l = CK(Ma), and recover the familiar information matrix mapping CK. This mapping is defined on the closed cone NND(&), whence we can dispense with the full rank assumption on MO = RQ l . As a final step we let a vary continuously in the interval [0; 1]. 11.3. BAYES LINEAR MODELS WITH NORMAL-GAMMA PRIOR DISTRIBUTIONS In this section we specify the joint distribution of Y, 6, and o2 completely, as a counterpart to the classical linear model with normality assumption of Section 1.4. The joint distribution is built up from conditional distributions in three steps. The distribution of Y, conditionally with 6 and tr2 given, is normal as in Section 1.4:
The distribution of 0, conditionally with a2 given, is also normal:
where the prior parameters are identical to those in (2) of Section 11.1. The distribution of the inverse of a2 is a gamma distribution:
with prior form parameter «o > 0 and prior precision parameter /3o > 0. Assumptions (1), (2), and (3) are called a Bayes linear model with a normalgamma prior distribution. Assumption (2) respects the scaling issue, discussed at some length in Section 11.1. In (3), we parametrize the gamma distribution F^^, in order to have Lebesgue density
with expectation oo/A) and variance 2a0//3o- These two moments may guide the experimenter to select the prior distribution (3). For OQ > 2, the moment
11.4. NORMAL-GAMMA POSTERIOR DISTRIBUTIONS
273
of order -1 exists:
say. Solving for OQ we obtain «0 = 2 + /3/0-Q. The parametrization F2+ft)/(72.A) is more in line with assumption (3) of Section 11.1, in that the parameters are prior model variance ofi and prior precision j3o- Note that Fy;1 is the ^-distribution with / degrees of freedom. The statistician cannot but hope that the normal-gamma family given by (2) and (3) allows the experimenter to model the available prior information. If so, they profit from the good fortune that the posterior distribution, the conditional distribution of 6 and a2 given the observations Y, is from the same family.
11.4. NORMAL-GAMMA POSTERIOR DISTRIBUTIONS Lemma. In the Bayes linear model with a normal-gamma prior distribution, let the prior dispersion matrix RQ be positive definite. Then the posterior distribution is the normal-gamma distribution given by
where 6 is the Bayes estimator (1) of Section 11.2, and where the posterior precision increase is
with mean Proof. With z = l/o-2, the posterior distribution has a Lebesgue density proportional to the product of the densities from assumptions (1), (2), and (3) of Section 11.3,
274
CHAPTER 11: BAYES DESIGNS AND DISCRIMINATION DESIGNS
where the quadratic form Q(B) involves )8i from (3),
With gamma density 'Xa0+/j;#)+0i(z) as given in Section 11.3, (1) and (2) follow from
Lemma 3.12 entails nonnegativity of ft, as a Schur complement in the nonnegative definite matrix
The determination of the expected value of fii parallels the computation of the mean squared-error matrix in part I of the proof of Lemma 11.2,
The posterior distribution testifies to what extent the experimental data alter the prior opinion as laid down in the prior distribution. More evidence should lead to a firmer opinion. Indeed, for the mean parameter vector 0, the posterior distribution (1) is less dispersed than the prior distribution (2) from Section 11.3,
and for the inverse model variance l/
11.5. THE BAYES DESIGN PROBLEM
275
The prior and posterior conditional distributions of 6 are so easily compared because the common conditioning variable a2 appears in both dispersion matrices in the same fashion and cancels, and because (4) is free of the conditioning variable Y. The situation is somewhat more involved when it comes to comparing the prior and posterior distributions of 1 /a2. The posterior precision increase Pi depends on the observations y, whence no deterministic statement on the behavior of ft is possible other than that it is nonnegative. As a substitute, we consider its expected value, E/>[ft] = na^, called preposterior precision increase. This quantity is proportional to the sample size «, but is otherwise unaffected by the design and hence provides no guidance towards its choice. For the design of experiments, (1) suggests minimizing (no/?^1 + X'X)~l. With a transition to moment matrices, R^1 = MQ and X'X = nM, this calls for maximization of the Bayes moment matrices
In summary, the Bayes design problem remains the same whether we adopt the moment assumptions of Section 11.1, or the normal-gamma prior distributions of Section 11.3. 11.5. THE BAYES DESIGN PROBLEM Given a prior moment matrix MQ € NND(/c) and a sampling weight a G [0; 1], the matrix Ma(g) = (1 - a)MQ + aM(g) is called the Bayes moment matrix of the design £ e H. The Bayes design problem is the following:
A moment matrix M that attains the maximum is called Bayes <j> -optimal for K'O in M. If there is no sampling evidence, a = 0, then there is only one competitor, MQ, and the optimization problem is trivial. If all the emphasis is on the sampling portion, a = 1, then we recover the general design problem of Section 5.15. Hence we assume a e (0; 1). The criterion function M \->
The set Ma inherits compactness and convexity from M. If M intersects the feasibility cone A(K), then so does Ma, by Lemma 2.3. Now, in the
276
CHAPTER 11: BAYES DESIGNS AND DISCRIMINATION DESIGNS
formulation
the Bayes design problem is just one manifestation of the general design problem of Section 5.15. That is, the general design problem is general enough to comprise the Bayes design problem as a special case. As an illustration, we consider the question of whether optimal Bayes moment matrices for K'0 are necessarily feasible. From a Bayes point of view, the prior moment matrix MQ often is positive definite; then the set Ma is included in the open cone PD(/c). Hence by part (a) of the Existence Theorem 7.13, all formally ^-optimal moment matrices for K'B in Aia are automatically feasible for K'B. Thus, for the Bayes design problem, the existence issue is much less pronounced. As another example, we present the Bayes version of the General Equivalence Theorem 7.14. 11.6. GENERAL EQUIVALENCE THEOREM FOR BAYES DESIGNS Theorem. Assume that a prior moment matrix M0 € NND(fc) and a sampling weight a e (0; 1) are given. Let the matrix M e M be such that the Bayes moment matrix B = (1 — a)M0 + aM is feasible for K'0, with information matrix C - CK(B). Then M is Bayes ^-optimal for K'0 in M if and only if there exists a nonnegative definite s x s matrix D that solves the polarity equation
and there exists a generalized inverse G of B such that the matrix N = GKCDCK'G' satisfies the normality inequality
In the case of optimality, equality obtains in the normality inequality if for A we insert any matrix M e M that is Bayes $-optimal for K'0 in M. Proof. Evidently M is Bayes <£-optimal for K '0 in M if and only if B is >-optimal for K'0 in Ma = {(1 - a)MQ + aA: Ae M}. To the latter problem, we apply the General Equivalence Theorem 7.14. There, the normality inequality is
11.7. DESIGNS WITH PROTECTED RUNS
277
This is equivalent to a trace AN < 1 - (1 - a) trace MQN, for all A e M. Because of trace BN — 1, the right hand side is trace (B - (1 - a)Mo)N — a trace MN.lna trace AN < a trace MN we cancel a to complete the proof.
Three instances may illustrate the theorem. For a matrix mean <j>p with p e (-00; 1), the matrix M e M is Bayes
Indeed, in the theorem we have N = GKCp+lK'G'/irace Cp, and the common factor trace Cp cancels on both sides of the inequality. Second, for scalar optimality, M e M is Bayes optimal for c'B in M if and only if there exists a generalized inverse G ot B = (l-a)A/o + aM that satisfies
Third, we consider the situation that the experimenter wants to minimize a mixture of the standardized variances c'Ma(g)~c where c is averaged relative to some distribution /A. With W — JR* cc'rf/x e NND(fc), this calls for the minimization of a weighted trace, trace WM a (£)~. An optimal solution is called Bayes linearly optimal for 6 in M. This is the Bayes version of the criterion that we discussed in Section 9.8. Again it is easy to see that the problem is equivalent to finding the <£_i-optimal moment matrix for H'6 in Ma, where W = HH' is a full rank decomposition. Therefore M e M is Bayes linearly optimal for B in M if and only if some generalized inverse G of B = (1 - a)M0 + aM fulfills
The optimization problem for Bayes designs also occurs in a non-Bayes setting.
11.7. DESIGNS WITH PROTECTED RUNS Planned experimentation often proceeds in consecutive steps. It is then desirable to select the next, new stage by taking into account which design £0 was used in the previous, old stage. Let us presume that n0 observations were taken under the old design £0- If the new sample size is n and the new design is £, then joint evaluation of all n0 + n observations gives rise to the
278
CHAPTER 11: BAYES DESIGNS AND DISCRIMINATION DESIGNS
per observation moment matrix
where a = n/(n$ + n] designates the fraction of observations of the new experiment. The complementary portion 1 - a consists of the "protected" experimental runs of the old design &• Therefore an optimal augmentation of the old design £0 is achieved by maximizing a criterion of the form (f> o C/i:(Af a (£)). This optimization problem coincides with that motivated by the Bayes approach and Theorem 11.6 applies. Another way of viewing this problem is that the weights of the old design provide a lower bound a(x) = &(*) f°r the weights of the new design £. There may also be problems which, in addition, dictate an upper bound b(x). For instance, it may be costly to make too many observations under the regression vector x. For the designs with weights bounded by a and b, the General Equivalence Theorem 7.14 specializes as follows. 11.8. GENERAL EQUIVALENCE THEOREM FOR DESIGNS WITH BOUNDED WEIGHTS Theorem. Assume a[a\b] = {£ e H : a(x) < £(x) < b(x) for all x e X} is the set of designs with weights bounded by the functions a, b : X —> [0; 1]. Let £ e H[a;6] be a design that is feasible for K'B, with information matrix C = C/c(A/(£)). Then £ is ^-optimal for K'O in E[«;fc] if and only if there exists a nonnegative definite s x s matrix D that solves the polarity equation
and there exists a generalized inverse G of M(g) such that the matrix N = GKCDCK'G' satisfies the normality inequality
Proof. The set M(H [«;&]) of moment matrices is compact and convex, and intersects A(K) by the assumption on £. Hence the General Equivalence Theorem 7.14 applies and yields the present theorem but for the normality inequality
We prove that (1) and (2) are equivalent. First we show that (1) entails (2). For any two vectors y, z 6 X with g(y) and £(z) lying between both bounds, the inequalities in (1) go either way and
11.8. GENERAL EQUIVALENCE THEOREM FOR DESIGNS WITH BOUNDED WEIGHTS
279
become an equality. Hence there exists a number c > 0 such that we have, for all x € X,
For a competing design 17 e E[a;b], let x\,... ,xe be the combined support points of 17 and £. We have the two identities
the latter invoking 1 = trace M(£)N. Splitting the sum into three terms and observing a(x) < 7j(jt) < b(x), we obtain (2) from
Conversely, that (2) implies (1), is proved indirectly. Assuming (1) is false there are regression vectors y,z 6 X with £(v) < b(y) and £(z) > a(z) fulfilling e = y 'Ny -z'Nz > 0. We choose 8 > 0 and define 17 in such a way that
Therefore 17 is a design in H[«; b], with moment matrix M(17) = M(£)+d(yy 'zz'). But trace M(rj)N = l + 8e > 1 violates the normality inequality (2). D As an example we consider a constraint design for a parabola fit model, in part II of the following section. We place the example into the broader context of model discrimination, to be tackled by> different means in Section 11.18 and Section 11.22.
280
CHAPTER 11: BAYES DESIGNS AND DISCRIMINATION DESIGNS
11.9. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, I With experimental domain [-1;1], suppose the experimenter hopes that a second-degree polynomial fit model will adequately describe the data. Yet it deems desirable to guard against the occurrence of a third-degree term. This calls for a test, in a third-degree model, of 03 = 0 versus #3 ^ 0. If there is significant evidence for 03 not to vanish, then the third-degree model is adopted, with parameter vector 0(3) = (0o> #i ? #2> 03)'- Otherwise a seconddegree model will do, with parameter vector 0(2) = (0o, 0i> flz)'The experimenter proposes that the determinant criterion fe is appropriate for evaluating the designs, in the present situation. The statistician advises the experimenter that there are many designs that provide information simultaneously for 03, 0(3), and 0(2). The efficiencies of the following designs are tabulated in Exhibit 11.1 in Section 11.22, along with others to be discussed later. The efficiencies are computed relative to the optimal designs, so these are reviewed first. i. In a third-degree model, the optimal design for the individual component 03 = e3'0(3) minimizes T ^ e^M^(r)e^. The solution is the arcsin support design r from Section 9.12, with weights, third-degree moment matrix, and inverse given by
Dots indicate zeros. The optimal information for 03 is 1/16 = 0.0625. ii. In a third-degree model, the ^-optimal design for the full vector 0(3) maximizes T i-+ (det M3(T))1/4. The solution T is shown in Exhibit 9.4 in Section 9.6,
11.9. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, I
281
The
It has respective efficiencies of 72%, 94%, and 84% for 03, 0(3), and 0(2). Of course, there is no direct merit in the constant spacing. The high efficiencies of r are explained by the fact that the support set T = (±1, ±1/2,0} is the union of the second-degree and third-degree arcsin support, as introduced in Section 9.6. We call T the arcsin support set for the discrimination between a second-degree and a third-degree model. II. As an alternative, half of the observations are drawn according to the old design T from the previous paragraph while the other half r is adjoined in a
282
CHAPTER 11: BAYES DESIGNS AND DISCRIMINATION DESIGNS
This design has respective efficiencies of 42%, 89%, and 94% for #3, 0(3), and 0(2). To see that r is the
where d = w + 17/80 - (w + 1/4)2. It is a member of the class T[a; 1] of designs with bounded weights, with lower bound function a(t) = 1/10 for t = ±1,±1/2,0 and a(t) = 0 elsewhere. Theorem 11.8 states that z'M(w)~lz is constant for those z = f(t} for which the new weight r(t) is positive. Our conjecture that one of the designs rw solves the problem requires us to consider positivity of r(t) for the points t — ±1,0. That is, the sum of all entries of (M(vf))~ 1 must be equal to the top left component. This leads to the equation w 2 -H>/6 = 11/120. The relevant solution is w = (1+^/71/5)712 = 0.3974. For the design T = T#, the moment matrix M(w] and its inverse are
The lower bound a(t) is exceeded only at / = ±1,0 and, by construction of iv, we have f(t) 'M(w)~lf(t) — 3.2. All weights r(t) stay below the constant
11.10. MIXTURES OF MODELS
283
upper bound 1. Therefore the normality inequality of Theorem 11.8 becomes
The inequality holds true by the very contruction of w. Namely, the polynomial P ( t ) = f ( t ) ' M ( w ) - l f ( t ) = 3.2-5.3f2+5.3f4 satisfies P(0) == P(±l) = 3.2 and has a local maximum at 0. Thus P is on [-1; 1] bounded by 3.2, and optimality of the design r is established. Alternatively we could have utilized Theorem 11.6. There are other ways to find designs that perform equally well across various models. We explore two of them, to evaluate a mixture of the information obtained from each model and, from Section 11.19 on, to maximize the information in a first model, within a subclass of designs that achieve a prescribed efficiency in a second model. 11.10. MIXTURES OF MODELS The problem of maximizing mixed information from different models is relevant when the experimenter considers a number of potential models to describe the data, and wants to design the experiment to embrace each model in an efficient manner. Hence suppose that on a common experimental domain T, we are given m different regression functions,
Eventually the task is one of finding a design on T, but again we first concentrate on the optimization problem as it pertains to the moment matrices. With model dimensions equal to k\,... ,km, we introduce the cone
in the Euclidean space Sym(A:i x • • • x km} — Sym(A:i) x • • • x Sym(km). The scalar product of two members A = (Ai,...,Am) and B = (Bi,...,Bm) in Sym(A;1 x • • • x km} is the sum of the trace scalar products, (A, B) = trace A\B\ + • • • + trace AmBm. In the space Sym(A:i x • • • x km), the cone NND(£i x • • • x km) is closed and convex; for short we write A > 0 when A e NND(*i x • - . x km). In the i th model, the moment matrix M, is evaluated through an information function «//,. Compositions of the form
284
CHAPTER 11: BAYES DESIGNS AND DISCRIMINATION DESIGNS
Thus ^(M) = (tj/i(Mi),..., i]/m(Mm))' is the vector of information numbers from each of the m models, and takes its values in the nonnegative orthant ra? = [0;ooT. To compress these m numbers into a single one, we apply an information function <J> on R™, that is, a function
which is positively homogeneous, superadditive, nonnnegative, nonconstant, and upper semicontinuous. The prime choices are the vector means 4>p with p e [-00; 1], from Section 6.6. Thus the design problem is the following,
where the subset M^ C NND(fci x • • • x km) is taken to be compact and convex. Increasing complexity to formulate the design problem does not necessarily mean that the problem itself is any more difficult than before. Here is a striking example. Suppose the regression function / = (/i, ...,/*)' has k components, and the experimenter is uncertain how many of them ought to be incorporated into the model. Thus the rival models have moment matrices MI arising from the initial sections /(,) of /,
In the /th model, the coefficient vector of interest is e, = (0,... ,0,1)' G R', in order to find out whether this model is in fact of degree /. The criterion becomes a Schur complement in the matrix A — A/, by blocking off AH = Mi-i,
see Lemma 3.12 and its proof. If the information from the k models is averaged with the geometric mean 3>0 °n ^+»then tne criterion turns into
11.11. MIXTURES OF INFORMATION FUNCTIONS
285
Thus we simply recover determinant optimality in the model with regression function /, despite the challenging problem formulation. Generally, for maximizing 4> o tf/ over M^m\ little is changed compared to the general design problem of Section 5.15. The underlying space has become a Cartesian product and the optimality criterion is a composition of a slightly different fashion. Other than that, its properties are just the same as those that we encountered in the general design problem. The mapping ty is an Um-valued information function on NND(&i x • • • x km), in that it is positively homogeneous, superadditive, nonnegative, nonconstant, and upper semicontinuous, relative to the componentwise partial ordering A > 0 that the cone IR™ induces for vectors A in the space Rm. Therefore its properties are exactly like those that the mapping CK enjoys relative to the Loewner ordering C > 0 on the space Sym(s') (see Theorem 3.13). With the present, extended terminology, the information matrix mapping CK is an instance of a Sym(s)-valued information function on NND(/c). The following lemma achieves for the composition o if/ what Theorem 5.14 does for the composition > o CK, in showing that $ o ty enjoys all the properties that constitute an information function on NND(A;1 x • • • x km), and in computing its polar.
11.11. MIXTURES OF INFORMATION FUNCTIONS Lemma. Let <J> be an information function on IR™, and let if/i,...,if/m be information functions on NND^),... ,NND(/:m), respectively. Set ty — ( « A i , . . . , \ltm)' and tf/°° = (i/^ 0 0 ,..., ty™)'. Then 4> o if/ is an information function on NND(A:1 x • • • x km), with polar function (4> o if/)00 — 4>°° o i^00. Proof. The same steps o-v as in the proof of Theorem 5.14 show that <£ = <J> o i/r is an information function. Also the polarity relation is established in a quite similar fashion, based on the representation
With level sets {tf/ > A} - {A > 0 : ^r(A) > A}, for all A € R™, the unit level set of the composition <£ o tf/ is
For all B e KNO^ x • • - x km) and A e R?, we get
286
CHAPTER 11: BAYES DESIGNS AND DISCRIMINATION DESIGNS
The last line uses inf /1 . e ( l/ , i > A .}( J 4,,fi,) = A,-^/30 (/?/). This is so since for A, > 0 we have {^/ > A,} = A,{^ > 1}, while for A, = 0 both sides are 0. Finally we apply (1) to $ o tfr and then to 4> to obtain from (2) and (3), for all B > 0,
The grand assumption of Section 4.1, Mr\A(K) / 0, ensures that for some moment matrix M e A4, the information matrix mapping CK(M) falls into PD(s), the interior of the cone NND(s) where an information function
11.12. GENERAL EQUIVALENCE THEOREM FOR MIXTURES OF MODELS Theorem. For the problem of Section 11.10, let M = (M 1 ,...,M m ) e M^ be a tuple of moment matrices with «fc(M/) > 0 for all / < m. Then M maximizes o ^(A) over A e M^ if and only if for all i <m there exist a number a, > 0 and a matrix Af, € NND(&,) that solve the polarity equations
11.12. GENERAL EQUIVALENCE THEOREM FOR MIXTURES OF MODELS
287
and that jointly satisfy the normality inequality
Proof. Just as in the Subgradient Theorem 7.4, we find that M e M^ maximizes the function $ o ^r : NND(/C! x • • • x km) —* IR over .M(m) if and only if there is a subgradient B of 4> o t[/ at M that is normal to M^ at M. Theorem 7.9, adapted to the present setting, states that a subgradient B exists if and only if the tuple N = B/(3> o ^r)(M) satisfies the polarity equation for
where we have set ities for 4> and for «/*-,• yield
The polarity inequal-
Thus (4) splits into the conditions
The numbers a, = trace A/,Af, > 0 sum to 1, because of (5). If a, > 0, then we rescale (7) by I/a, to obtain (2); conversely, (2) implies (7) with Ni = a,-W/. If a, = 0, then ^(Af/) > 0 forces ^(ty) = 0 in (7), and AT/ = 0 is (another) particular solution to (5), (6), and (7). With AT,- = a,-Af,-, (5), (6), and (7) are now seen to be equivalent to (1) and (2). The normality of B to M^ at M then translates into condition (3). The result features all the characteristics of a General Equivalence Theorem. The polarity equation (1) corresponds to the component function 4>; it serves to compute the weighting a\,..., am. The set of polarity equations in (2) come with the individual criteria ^,, they determine the matrices Ni. The normality inequality (3) carries through a comparison that is linear in the competing tuples A e M^m\ It evaluates not the m normality inequalities
288
CHAPTER 11: BAYES DESIGNS AND DISCRIMINATION DESIGNS
trace AiNi < 1 individually, but their average with respect to the weighting a,. More steps are required than before, but each one alone is conceptually no more involved than those met earlier. The vector means <J>P and the compositions fa = 4>i ° CKi let the result appear yet closer to the General Equivalence Theorem 7.14.
11.13. MIXTURES OF MODELS BASED ON VECTOR MEANS
Theorem. Consider a vector mean 4>p of finite order, p e (-oo;l], and compositions of the form fa — <& o C#., with information functions
and there exists a generalized inverse G, of Af, such that
jointly satisfy the normality inequality
Proof. For a vector mean 4>p, the solution of the polarity equation is given by the obvious analogue of Lemma 6.16. With this, property (1) of Theorem 11.12 requires a/i/^Af,) to be positive and positively proportional to (fa(Mi)Y~l. This determines the weighting a,. The polarity equations (2) of Theorem 11.12 reduce to the present ones, just as in the proof of the General Equivalence Theorem 7.14. We demonstrate the theorem with a mixture of scalar criteria, across m different models. Let us assume that in the ith model, a scalar parameter system given by a coefficient vector c, e IR*( is of interest. The design T in T is taken to be such that the moment matrix A/, = JT fj(t)fj(t)' dr is feasible, MI € A(ci). For the application of the theorem, we get
11.14. MIXTURES OF CRITERIA
289
Therefore (Mlt...tMm) 6 X (w) makes ^((c^fci)- 1 ,...,^^-^)- 1 ) a maximum over M^ if and only if for all / < m there exists a generalized inverse G, of M,- such that, for all (>4i,... ,Am) <E .M(w),
The discussion of Bayes designs in Section 11.6 leads to the average of various criteria, not in a set of models, but in one and the same model. This is but a special instance of the present problem.
11.14. MIXTURES OF CRITERIA We now consider the situation that the experimenter models the experiment with a single regression function
but wishes to implement a design in order to take into account m parameter systems of interest AT/0, of dimension st, evaluating the information matrix CK.(M) by an information function fa on NND(s,). This comprises the case that a single parameter system K'B is judged on the grounds of m different criteria
Carrying out the joint evaluation with an information function 4> on R™, the problem now is one of studying a mixture of information functions, utilizing an information function on R™:
where the set M C NND(A;) of competing moment matrices is compact and convex. Upon defining the set M(m) C NND(A: x • • • x A;) to be M(m) = { ( M , . . . , M ) : M € M}, the problem submits itself to the General Equivalence Theorem 11.12.
290
CHAPTER 11: BAYES DESIGNS AND DISCRIMINATION DESIGNS
11.15. GENERAL EQUIVALENCE THEOREM FOR MIXTURES OF CRITERIA Theorem. For the problem of Section 11.14, let M G M be a moment matrix with ^(M) > 0 for all / < m. Then M maximizes <1> o i f / ( A , . . . ,A) over A e M if and only if for all / < m there exist a number a, > 0 and a matrix Af, e NND(£) that solve the polarity equations
such that the matrix N —
Proof.
satisfies the normality inequality
The result is a direct application of Theorem 11.12.
Again the result becomes more streamlined for the vector means <&p, and compositions ifa = <£, o CK,. In particular, it suffices to search for a single generalized inverse G of M to satisfy the normality inequality. 11.16. MIXTURES OF CRITERIA BASED ON VECTOR MEANS Theorem. Consider a vector mean 4>p of finite order, p € (-00; 1], and compositions of the form fa = fa o £#., with information functions fa on NND(s/), for all i < m. Let M e M be a moment matrix that is feasible for K/6, with information matrix C, = C/^.(M). Then M maximizes 4>p(