1519.5( M636s .
c.1
5/: ·-:_:;,. M G:3 0 '"' c.
0_.
H S Migon and
D Gamerman ""\ ::1'
Federal University of Rio de Janeiro, Brazil
�
� ..b
�
ARNOLD
A member of the Hodder Headline Group LONDON
o
NEW YORK
o
SYDNEY
o
AUCKLAND
.
First published in Great Britain in 1999 by Arnold, a member of the Hodder Headline Group, 338Euston Road, London N W l 3BH http://www.arnoldpublishers.com Co-published in the United States of America by Oxford University Press Inc., 198Madison Avenue, New York, NY 10016 © 1999 Dani Gamerman and Helio Migon All rights reserved. No part of this publication may be reproduced or transmined in any form or by any means, electronically or mechanically, including photocopying, recording or any information storage or retrieval system, without either prior permission in writing frorri the publisher or a liCenCe permitting restricted copying. In the United Kingdom such licences are issued by the Copyright Licensing Agency: 90 Tottenham Court Road, London W 1 P 9HE. Whilst the advice and information in this book are believed t.o be true and accurate at the date of going
to press, neither the authors nor the publisher can accept any legal responsibility or liability for any errors or omissions that may be made.
..
'
British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalogue record for this book is available from the
\j brary of Congress
ISBN 0 340 74059 0 123 4 567 89\0 Publisher: Nicki Dennis
Production Editor: Julie Delf
Production Controller: Sarah Kett Cover Design: Terry Griffiths
Typeset in 10/12 pt Times by Focal Image Ltd, London
What do you think about this book? Or any other Arnold title? Please send your comments to
[email protected]
•.
Para Mirna, Marcio, Marcelo e Renato (Helio) Para meus pais Bernardo e Violeta (Dani)
Contents
Preface I
2
3
4
vii
In.troduction
I
1.1
Information
2 3
1.2
The concept of probability
1.3
An example
1.4
Basic results in linear algebra and probability
10
1.5
Notation
14
1.6
Outline of the book
15
Exercises
16
Elements of inference
7
I9
2.1
Common statistical models
19
2.2
Likelihood-based functions
21
2.3
Bayes' theorem
26
2.4
Exchangeability
32
2.5
Sufficiency and the exponential family
35
2.6
Parameter elimination
41
Exercises
45
Prior distribution
53
3.1
Entirely subjective specification
53
3.2
Specification through functional fonns
55
3.3
Conjugacy with the exponential family
59
3.4
The main conjugate families
60
3.5
Non-informative priors
65
3.6
Hierarchical priors
71
Exercises
73
Estimation
79
4.1
Introduction to decision theory
4.2
Classical point estimation
86
4.3
Comparison of estimators
92
79
vi Contents 4.4
Interval estimation
4.5
Estimation in the normal model Exercises
5
6.
116
Approximate and computationally intensive methods
8
125
5.1
The general problem of inference
5.2
Optimization techniques
126
5.3
Analytical approximations Numerical integration methods
133
5.4 5.5
Simulation methods
147
Exercises
160
Hypothesis testing
125
144
167
6.1
Introduction
6.2
Classicai h
168
6.3
Bayesian hypothesis testing
181
6.4
Hypothesis testing and confidence intervals Asymptotic tests
186
6.5
yi)othesis testing
Exercises 7
99 105
Prediction
167
188 191 197
7.1
Bayesian prediction
197
7.2
Classical prediction
201
7.3
Prediction in the normal model
203
7.4
Linear prediction
206
Exercises
208
Introduction to linear models
211
8.1
The linear model
211
8.2
Classical linear models
213
8.3
Bayesian linear models
218
8.4
Hierarchical linear models
224
8.5
Dynamic linear models
227
Exercises
232
Sketched solutions to selected exercises
235
List of distributions
247
References
253
Author index
256
Subject index
258
Preface
This book originated from tQ'le lecture notes of a course in statistical inference taught on the MSc programs in Statistics at UFRJ and IMPA (once). These notes have been used since 1987. During this period, various modifications were introduced until we arrived at this version, judged as minimally presentable. The motivation to prepare this book came from two different sources. The first and more obvious one for us was the lack of texts in Portuguese, dealing with statistical inference to the desired depth. This motivation Jed us to prepare the first draft of this book in the Portuguese language in 1993. The second, and perhaps the most attractive as a personal challenge, was the perspective adopted in this text.
Although there are various good books in the literature dealing with this
subject, in none of them could we find an integrated presentation of the two main schools of statistical thought: the frequentist (or classical) and the Bayesian.. This second motivation led to the preparation of this English version. This version has substantial changes with respect to the Portuguese version of 1993. The mOst notable one was the inclusion of a whole new chapter dealing with approximation and computationally intensive methods. Generally, statistical books follow their author's point of view, presenting at most, and in separate sections, related results from the alternative approaches. In this book, our proposal was to show, wherever possible, the parallels existing be tween the results given by both methodologies. Comparative Statistical Inference by V. D. Barnett (1973) is the book that is closest to this proposal. It does not however present many of the basic inference results that should be included in a text proposing a wide study of the subject. Also we wanted to be as comprehensive as possible for our aim of writing a textbook in statistical inference. This book is organized as follows.
Chapter l is an introduction, describing
the way we find most appropriate to think statistics: discussing the concept of information.
Also, it briefly reviews basic results of probability and linear al
gebra. Chapter 2 presents some basic concepts of statistics such as sufficiency, exponential family, Fisher infonnation, exchangeability and likelihood functions. Another basic concept specific to Bayesian inference is prior distribution, which is separately dealt with in Chapter 3. Certain aspects of inference are individually presented in Chapters 4, 6, and 7. Chapter 4 deals with parameter estimation where, intentionally point and interval
viii Preface estimation are presented as responses to the summarization question, and not as
two unrelated procedures. The important results for the normal distribution are presented and also serve an illustrative purpose. Chapter 6 is about hypotheses testing problems under the frequentist approach and also under the various possible
forms of the Bayesian paradigm.
In between them lies Chapter 5 where all approximation and computationally based results are gathered. The reader will find there at least a short description of the main tools used to approximately solve the relevant statistical problem for situations where an explicit analytic solution is not available.
For this reason,
asymptotic theory is also included in this chapter. Chapter 7 cove(s prediction from both the frequentist and Bayesian points of view, and includes the linear Bayes method. Finally in Chapter 8, an introduction to normal linear models is made..initially the frequentist approach is presented, followed by the Bayesian one. Bas�;d upon the latt�r approach, generalisations are presented leading to the hierarchical and dynamic models. We tried to develop a critical analysis and to present the most important reSults
of both. approaches commenting on the positive and negative aspects of both. As
has already been said,_ the level of this book is adequate for an MSc course in
statistics, althoUgh· we do not rule out the possibility of its use in an advanced undergraduate course aiming to compare the two approaches.
This boo� can also be useful for the more mathematically trained profession als from related areas of science such as economics, mathematics, engineering,
operations research and epiderriiology. The basic requirements are knowledge of
calculus and probability, although basic notions of linear algebra are also used.
As this book is intended as .a basic text in statistical inference, various exercises are included at the end of each chapter. We have also included sketched solutions
to some of the exercises and a list of distributions at the end of the book, for easy
reference.
There are many possible uses of this book as a textbook. The first and most
obvious one is to present a11 the material in the order it appears in the book and
without skipping sections. This may be a heavy workload for a one semester
course. In this case we suggest postponing Chapter 8 to a later course. A second
option for exclusion in a first course is Chapter 5, although we strongly recommend it for anybody interested in the modem approach to statistics, geared towards applications. The book can also be used as a text for a course that is more strongly oriented towards one of the schools of thought. For a Bayesian route, follow
Chapters I, 2, 3, Sections 4.1, 4.4.1 and 4.5, Chapter 5, Sections 6.3, 6.4, 6.5,
7.1, 7.3.1, 7.4, 8.1, 8.3, 8.4 and 8.5. For a classical route, follow Chapter I, Sections 2.1, 2.2, 2.5, 2.6, 4.2, 4.3, 4.4.2 and 4.5, Chapter 5, Sections 6.1, 6.2, 6.4, 6.5, 7.2, 7.3.2, 7.4, 8.1 and 8.2.
This book would not have been possible without the cooperation of �various
people. An initial and very important impulse was the typing of the original lecture
notes in TE(( by Ricardo Sandes Ehlers. Further help was provided by Ana Beatriz
Soares Monteiro, Carolina Gomes, Eliane Amiune Camargo, Monica Magnanini
Preface ix and Ot
We also had the stimulus of several colleagues; in
particular, we would like to mention Basflio de B. Pereira. We would also like to thank Nicki Dennis for her support and encouragement throughout all the stages of preparation of this book and for making us feel at home with Arnold, and Marcia N. Migon for the many commments on earlier versions of the book. The Brazilian research supporting agencies CNPq and FAPERJ and the Ministry of Science and Technology have helped by the continuing support of our &search work. This support enabled the use in this project of the computational and bibliographic material they provided for our own research. Our families also played the important roles of support and understanding, especiaHy in the weekends and late nights spent trying to meet deadlines! To all of them, our gratitude. Finally, the subject of the book is not new and we are not claiming any originality here. We would like to think that we are presenting the subject in a way that is not favoured in many textbooks and that will help readers to have an integrated view of the subject. In our path to achieve this goal, we have been influenced by many researchers and books. We tried to acknowledge this influence by refening to these books whenever we felt it provided a description of a topic worth reading. Therefore, we tried to relate every major subject presented in our book to books that treated the subject in a more complete or more interesting way. In line with its textbook character, we opted to favour books rather than research papers as references. We would like to think of our book as a basis for discovery and will feel our task is accomplished whenever readers understand the subject through the book alone, its references or a combination of both. H.S.M. & D.G. Rio de Janeiro, December 1998
1 Introduction
'
Before beginning the study of statistics, it is relevant to characterize the scope of the area and the main issues involved in this study. subject directly, which is a hard and polemical task.
We avoid defining the
Some of the components
involved in this area of science wi11 be presented in the hope that in the end the reader will have a clear notion of the breadth of the subject under consideration. The fundamental problem towards which the study of statistics is addressed is that where randomness is present. The statistical methodology to deal with the resulting uncertainty is based on the elaboration of probabilistic models in order to summarize the relevant information available. There are many concepts used in the last sentence that deserve a clarified expla nation of their meaning in order to ensure a unified understanding. A model is a formal collection of coherent rules describing, in a simplified way, some real world problem. The language used to describe precisely a model in which uncertainty is present is probability. The meaning of summarization, in this context, refers to the ability to describe a problem in the most concise way (under the assumption that there are many possible ways to describe a problem). The art involved in model building is the desire to balance the need to include as many aspects of reality as possible while preventing it from being too complex. A related concept, which is useful to have in mind, is that of parsimony. This means that the model must have an optimal complexity. A very simple model can be misleading since it misses relevant aspects of reality. On the other hand, if it is highly complex it will be hard tO understand and extract meaningful information from. From the previous discussion it is not difficult to guess that infonnation is the main input for statistics. However. this is a hard concept to define. Among the objectives of our study. we can identify the two main ones as being: to understand reality (estimation and hypothesis testing) and to make decisions (predictions). A strong limitation of many presentations of statistical inference is to be mainly concentrated in estimation and testing. In this context, it only deals with quantities that can never be observed in the present or in the future. Nevertheless, our view of statistics is that it must be concerned with observable quantities. ln this way, one is able to verify the model adequacy in an irrefutable way.
For the sake of completeness, however, we will present the main results
available from all topics above.
2
Introduction
1.1
Information
As we have already said, the notion of information is present in alJ the studies developed in statistics. Since uncertainty is one of the main ingredients in our models, we need to gather as much information as possible in order to reduce our initial uncertainty. A fundamental question which we are concerned with is about the type of information that is relevant and must be retained in the analysis. A possible reply to this question is that all available information is useful and must be taken into consideration. Another answer is to avoid arbitrariness and take into consideration only objective observation coming from a sampling process. In this way all the subjective information must be discarded. These two points of view roughly fonn the bases for two different forms of statistical analysis: the Bayesian (or subjectivist) and the classica! {or frequentist) approach, respectively. As we will see in the next section the divefgence between these two approaches is much stronger, �eginning with the interPretation of the·· concept of probability. This is always the starting point of a statistical modeL An example to illustrate and clarify thes·e points is as follows.
Example. Consider the situation described by Berger ( 1985, p. 2) concerning the following experiments: 1. A fine musician, specializing in classical works, tell uS _ that he is able to distinguish if Hayden or Mozart composed some Clas-sical song.
Small
excerpts of the compositions of both authors are selected at random and the experiment consists of playing them for identification by the musician. The musician makes 10 correct guesses in exactly I 0 trials.
2. A drunken man says that he can correctly guess in a coin toss what face of the coin will fall down. Again, after I 0 trials the man correctly guesses the outcomes of the 10 throws. 3. An old English lady is well known for her ability to distinguish whether a cup of tea is prepared by first pouring the milk or the tea. Ten cups filled with tea and milk, well mixed and in a random order, are presented to her. She correctly identifies all ten cups. It is not difficult to see that the three experiments provide the same infonnation and therefore any test to verify the authenticity of the persons' statements would result positive for all of them, with the same confidence. This does not make any sense! We have more reasons to believe in the authen ticity of the statement of the musician than of the old lady and, certainly, much more than of the drunken man. There is no doubt that the experimental outcome increases the veracity of the statements made. But we cannot reasonably say that we have the same confidence in the three assertions. By common sense, there is a long way to go before one accepts this conclusion.
The concept of probability 3
1.2
The concept of probability
Although the definition of probability is well accepted by almost every statistician (ignoring some technical details), its interpretation or the sense attributed to it varies considerably. We mention here some of the more common interpretations: physical (or classical), frequentist and subjective.
(I) Physical or classical. The probability of any event is the ratio between the number of favourable outcomes and the total number of possible experi mental results. It is implicitly assumed that all the elementary events have the same chance. The concept of probability was first formulated based on these classical ideas, which are closely related to games of chance (cards, dice, coins, etc.), where the equal chance assumption is taken for granted. The probability associated with more elaborate events is obtained just as a conse
(2) Frequentist. The probability of an event A, denoted by Pr(A) or P(A), is given by
Pr(A)
.
=
m
hm -
n-Hlo n
where m is the number of times that A has occurred in n identical and independent experimental trials. This interpretation intends to be objective as far as it is based only on observable quantities. However, it is worth noting that: (i) The limit cannot be understood as a mathematical limit since given E > 0 and N > 0 there could well exist an No > N such that IPr(A)- (m/No)l > E. This is improbable but not impossible. (ii) The concepts of identical and independent trial are not easy to define objectively and are in essence subjective. (iii)
does not go to infinity and so there is no way to ensure the existence of such a limit.
n
The scope of the two interpretations is limited to observable events and does not correspond to the concept used by common people. The human being evaluates (explicitly or implicitly) probabilities of observable and unobservable events.
Ex
I. Consider the proposition
=
4
Introduction useful quantity to consider. If its numerical value is low then I will decide not to take an umbrella, I will prepare myself for a nice walk back home, etc. 2. Let A be 'John has disease X'. Although A can be an observable quantity af ter delicate and expensive surgery, John's doctor can take a number of actions (including the surgery itself) based on the value he ascertains for
Pr(A). 3. The proposition A is 'John will get married to Mary'. Once again it makes sense to think about Pr(A), especially if I have some sort of personal rela tionship with John and/or Mary. In all the cases presented above the classical and frequentist interpretations do not make sense.
A is always non-observable, unique and cannot be repeated under
similar conditions.
(3) Subjective. The probability of an event A is a measure of someone's degree of belief in the occurrence of A. To emphasize its subjective character, it is better to denote this probability by Pr(A I H) where H (for history) represents the available information set of the individual. For example, let
A be the event 'it is raining in Moscow'. (a) The easiest probability to associate with A for someone in Rio de Janeiro who does not know anything about the Moscow climate, is
Pr(A 1 H1) H1.
=
0.5, which is based on his or her body of knowledge
(b) On the other hand, someone in St. Petersburg could have stated if it is raining in St. Petersburg if it is not. Note that, in contrast to someone in Rio, this person will typically have more infonnation, contained in
H2.
(c) But for someone in Moscow if it is raining otherwise because in this case,
H3 contains A!
Pr (A J H) are not equal since they H, which is different for each case. This interpretation
It is worth pointing out that the values for depend on the information
of probability illustrated by the above example is called subjective and obeys the basic rules of probability. Note also that adopting the subjective interpretation we can associate probability for the cases unsolved by the other schools of thought. The remaining question is how to obtain its value for a given event or proposition based iOn a specified information set Probabilities can be evaluated directly or indirectly. One standard tool for direct probability evaluation can be an urn with 100 (or 1000, say) balls with two dif ferent colours: blue and red. For example, let us suppose that you want to assess
The concept of probability 5 the probability that the Canoas road (a very pleasant road along the mountains surrounding Rio) is closed because of a road accident. In the direct approach, you must compare this probability with the chance of drawing a red ball from the urn. If these probabilities were judged equal when the urn has 20 red balls, then the probability that the road is closed is 0.2.
For the case of indirect measurements, let us assume that we have two lottery systems: one involving the event you are interested in and the other is any direct evaluation instrument. lmagine the following lotteries: 1. Bet A: If the road is not blocked you win 5 monetary units, otherwise you do not win anything.
2. Bet B: If the ball drawn from the urn is red, you win 5 monetary units, otherwise nothing. Considering the two lotteries offered to you, on which one do you prefer to bet? If you prefer A, it might be because there is a bigger chance to win the premium betting on A than on B. Now, let us make a small modification in the urn
composition, to I 0 red balls and 90 blue ones, so that the probability of a winning
bet B is 0.1. If you still prefer A, redefine again the composition of the urn and
continue until you become indifferent between bets A and B.
There are also difficulties associated With these forms of probability evaluations.
In the first case, the difficulty is assodat.ed with the comparison of probabilities coming from different propositions. In the second case, the difficulty is caused by the introduction of monetary units and the evaluation of their respective utilities.
1.2.1
Assessing subjective probabilities
There are many alternative ways to determine subjective probabilities. De Finetti
(1974) provides a very useful scheme based on the notion of squared error loss function. Let A be a proposition or an event identified with the value 1 when it
is true and 0 otherwise. The probability p that we associate to A is obtained by minimizing the square error loss function. if A= I if A= 0. The basic properties of probability follow easily from this definition as will be shown below.
I. p E [0, I] If p > I then p2 > I and (p 1)2 > 0. Therefore the losses are always bigger than I and 0, the losses obtained by making p = I. A similar argument is used to show that if p < 0 the losses wi11 be bigger than those evaluated with p = 0. Then. minimization of the square error losses imposes p E [0, I]. Figure 1 .1 arrives at the same conclusion graphically. -
6
Introduction 8
D C
0
Fig. 1.1 The possible losses are given by BD2 and C D2 which are minimized if D is between B and C. 2.
Pr (A)= I- Pr(A) The possible losses associated with the specification of
Pr(A)
p and
Pr(A) = q are: A = I : (p- 1)2 + q2 A= 0 : p2 + (q- 1)2. As·we)mve already seen in
(I), the possible values of (p, q) are in the
unit square. In Figure 1.2 line segments are drawn to describe the possible losses. The squared distance between two consecutive vertices represents the losses. It is clear from the figure that the losses are reduced by making p + q=l.
q
p
Fig.1.2 Thepossible lossesaregiven by BD2 when A= I andCD2 when A= 0, which are minimized if D is projected on E over the line p + q = I. 3.
Pr(A n F)= Pr(A I F)Pr(F) Define Pr(A I F) as the probability of A ifF= I. Denoting this probability by p, Pr(F) by q and Pr(AnF) by r, the total loss is given by (p-A)2 F + (q- F)2 + (r- AF)2. Its; possible values are: A= F = I : (p- 1)2 + (q
- 1)2 + (r- 1)2
A = 0, F= I : p2 + (q- I)2 + r2
An example 7
Note that
(p, q, r) assume values in the unit cube. The same arguments
used in (2) can be developed in the cube. Minimization of the three losses is attained when
1.3
p
/q.
= r
An example
In this section a simple example will be presented with the main intention of anticipating many of the general questions to be discussed later on in the book. The problem to be described is extremely simple but is useful to illustrate some relevant
ideas involved in statl stical reasoning. Only very basic concepts on probability are enough for the reader to follow the classical and Bayesian inferences we will present. The interested reader is recorrunended to read the excellent paper by Lindley and Phillips
(1976) for further discussion.
On his way to university one morning, one of the authors of this book (say, Helio) was stopped by a lady living in the neighbouring Mare slum. She was pregnant and anxious to know the chance of her seventh baby being male. Initia1ly, his reaction was to answer that the chance is 1/2 and to continue his way to work. But the lady was so disappointed with the response that Helio decided to proceed as a professional statistician would. He asked her som� questions about her private life and she told him that her big family was composed offive boys (M) and one girl (F) and the sequence in which the babies were born was MMMMMF. The lady was also emphatic in saying that all her pregnancies were consequence of her long relationship with the same husband. In fact, her disappointment with the naive answer was now understandable. Our problem is to calculate the chance of the seventh baby being male, taking into consideration the specific experience of this young lady. How can we solve this problem? Assuming that the order of the Ms and Fs in the outcome is not relevant to our analysis, it is enough or sufficient to take note that she had exactly five baby boys
(5 Ms) and one baby girl (I F). The question about the order of the births is
brought up because people usually want to know if there is any sort of abnormality in the sequence. However, it seems reasonable to assume the births to be equally distributed in probabilistic terms. Before proceeding with an analysis it is useful to define some quantities. Let us denote by X; the indicator variable of a boy for the ith child, i
=
let B denote the common unknown probability of a boy, i.e., Pr(X; with 0 :::
B
I, ... , 7 and =
liB)
=
B
::: I. Note that B is a fixed but unknown quantity that does not exist in
reality but becomes a useful summary of the situation under study. A frequentist statistician would probably impose independence between the X;'s. In doing that the only existing link between the X;'s is provided by the value of B. It seems reasonable at this stage to provide to the lady the value that one considers the most reasonable representation of 8. Given that he is only allowed to
8 Introduction use the observed data in his analysis, this representative value of e must be derived
from the observations X;, i = 1, . .. , 6.
There are many possible ways to do that.
Let us start by the probabilistic
description of the data. Given the assumptions above, it is not difficult to obtain
Pr(Xt = l,X2= l , ... ,Xs = l,X6=0ie) =e5(1-e). One can proceed on this choice of value for e by finding the single value
{;
that
maximizes the above probability for the actual, observed data. It is a simple exercise to verify that, in the above case, this value is given by {J = 5/6 = 0.83. We would then say that she has 83% chance of giving birth to a boy. There are other w'Vs to proceed still based only on the observed data but more assumptions are now needed. One possible assumption is that the lady had pre viously decided that she only wanted to have six children, the last one being an unwanted pregnancy. In this case, the observed data can be summarized into the number Y of M's among the six births. It is clear that Y has a binomial distribu tion with size six and success probability e, denoted by Y bin(6, e), and that E(Y/6161) = e. Using frequentist arguments, one can reason that when we are �
able to observe many boys in a similar situation one would like to be correct on average. Therefore, one would estimate the value of(} as
Yj6, the relative fre
quency of boys in the data. Given that the observed value of Y is 5, a reasonable estimate for 61 is if= 5j6, coinciding with§. There is no guarantee that the two approaches coincide in general but is reassuring that they did in. thfs ·case.
When asking the lady if the assumption to stop at the sixth child was true, she
said that her decision had to do with having had her first baby girl. In th�s case, the observed data should have been summarized by the number
Z of M's she had Z and Y
until the first girl, and not by Y. (Even though the observed values of
are the same, their probability distributions are not.) It is not difficult to see that
Z has a negative binomial distribution with size 1 and success probability 1-e, N 8(1, e), and that E(Zie) e ((! -e). Proceeding on Z with the reasoning used for Y leads to the estimation of e by 5/6 as in the previous cases. denoted z
�
=
The main message from this part of the example for choosing a representative value for e is that there are many possible methods, two of which were applied above and while the first one did not depend on the way the data was observed, the second one did. These issues are readdressed at greater length in Chapters 2 and 4. Another route that can be taken is to decide whether it is reasonable or not to discard 1/2 as a possible value for 61. This.. can be done by evaluating how extreme
(in discordance of the assumption e = 1(2) the observed value is. To see that one can evaluate the probabilities that Y � 5 and
Z � 5, depending on which stopping
rule was used by tbe lady. The values of these probabilities are respectively given by 0.109 and 0.031. It is generally assumed that the cutoff point for measuring
extremeness in the data is to have probabilities sma11er than 0.05. It is interesting that in this case the stopping rule has a strong effect on the decision to discard the
equal probabilities assumption. This will be readdressed in a more general form
in Chapter 6.
An example 9 Intuition, however, leads to the belief that specification of the stopping rule is
not relevant to solving our problem. This point can be more formally expressed
in the following way: the unique relevant evidence is that in six births, the sex of the babies was honestly annotated as
1 F and 5 Ms. Furthermore, these outcomes
occurred in the order specified befote. This statement points to the conclusion that only the results that have effectively been observed are relevant for our analysis.
For a Bayesian statistician, the elements for the analysis are only the sequence of
the observed results and a probability distribution describing the initial information about the chance of a baby being male.- The experimental conditions were carefully
described before and they guarantee that a result observed in any given_birth is equivalent to that obtained in any other birth. The same is true for pairs, triplets, etc.
of birth. This idea is formalized by the concept of exchangeability. The sequence of births is exchangeable if the order of the sex QUtcomes in the births is irrelevant.
In the next chapter, we will define precisely the concept of exchange?biHty. For our present example thi� means that the probability of any seqUefl�e ·of
s
Fs (subject to r + s
r
Ms and
= n) is the same as that of any other sequence with the same
number of Ms and Fs.
Let us return to our origi�al problem, that is, to calculate the chance of the seventh
baby born being male based on the information gathered. namely that provided by
the previous births. This probability is denoted by pair (5,
Pr[X, = 11(5, I)] where the 1) denote the number ofbirths from each sex previously observed. Using
basic notions o.f probability calculus we can obtain
Pr[X7 =
11(5, I)]= f' o J =
=
fo'
fa'
.
P[X7 = I,el(5, !)]de P[X,
=
11e . (5, l)]p(el(5, !)) de
ep(el(5, I)) de
= E[e I (5, !)] where the expected value is with respect to the distribution of 8 given the past
results. As we will see in the next chapter, this is the unique possible representation
for our problem if the assumption of exchangeability of the sequences of births is
acceptable. One of the elements involved in the above calculation is
p(el(5, I))
which has not yet been defined. It has the interpretation, under the subjective
approach, of the probability distribution of the possible values fore after observing the data (5, 1).
Let us suppose that before observing the values of
(5, 1), the subjective proba
bility specification for() can be represented by the density
o s e s I, which is a beta distribution with parameters
at the end of the book).
(a,b>O)
a and b (see the list of distributions
10
Introduction
Note that
p(8 I
(5 I))= '
8)
p((5, 1),
I)) p((5, I) I 8)p(8) = p((5, I)) s e)8a-t (I ex 8 0 p((5,
_
_
8)b-t,
since p((5, I)) does not depend on ex
85+a-l (I
_
8
8) I +b-1
and the stopping rule or the sample space, in cltJ.ssical language, is irrelevant because it gets incorporated into the proportionality constant. Furthermore, for any experimental result the final distribution will be a beta. So we can complete the calculations to obtain
Pr[X7
=I I (5, I)]=
E[8](5, I)]=
a+5 a+b+6
.
We still have a problem to be solved. What are the values of a and b? Suppose, in the case of births of babies, that our initial opinion about the chances associated with M and F are symmetric and concentrated around 0.5. the distribution of
e
This means that
is symmetrically distributed with mean 0.5 and with high
probability around the mean. We can choose in the family of beta distributions that one with a = b = 2, for instance. ?(0.4
<
e
<
With this specification,
E(8) =
0.5,
0.6) = 0.3 and the probability of the seventh birth being a boy is
7/10 = 0.70. If we have been thinking of an experiment with honest coins instead of births of babies, we could have chosen a beta(50,50) which is symmetrically distributed but much more concentrated around 0.5 than the beta(2,2). This is a clear repre sentation of the fact that we know much more about coins than about the sex of new birth, even before observing the data. Under this specification of the beta, Pr(0.4
<
e
<
0.6) = 0.8 and the chance of the 7th outcome being a head would
be 55/106 = 0.52. Evidently this result is closer to 0.5, which seems reasonable since the sex of babies and coins are quite different things. In Chapter 3 we will come back to the discussion of how to specify and determine this prior distribution in any given statistical problem.
1.4
Basic results in linear algebra and probability
This is a book concerned with the study of statistics. In order to do that, a few simple results from linear algebra and probability theory will be extensively used. We thought it might be useful to have the main ones cited here to provide the reader with the basic results he will be using throughout the book. We will start with the basic
Basic results in linear algebra and probability 11 results concerning densities and probability functions of co11ections of random variables and will then define the multivariate normal distribution to motivate the importance of linear algebra results about matrices and their connection with distributions.
1.4.1
Probability theory
Let X= (X,, .. .,Xp). Y
=
(Z,, . . ,Z,) be three ran (Y,, ... ,Yq ) and Z .1', Y and Z, respectively (p, q, r 2:: 1). =
.
dom vectors defined over sample spaces
Assume for simplicity that they are all continuous with joint probability density function
p(x,y,z).
The marginal and conditional densities will be denoted by
their relevant arguments. So, for example, p(x) denotes the -marginal density of
X and p(zl x,y)
denotes the conditional density of
ZIX. Y.
Then, the following
equations hold:
p(x)= p(xly)
=
p(xlz)= p(xfy,z)
�
p(x,y) dy
p(x y) , p(y)
J
p(x,ylz) dy
p(x,y!z) =
p(ylz)
.
There are many possible combinations of these results but most can be derived easily from one of the above results. In fact, all results below are valid under more general settings with some components of the vectors being discrete and others continuous. The only change in the notation is the replacement of integrals by summations over the relevant parameter space. These relations define a structure of dependence between random variables. An
X andY are said to be condition p(xlz) p(ylz). Conditional dependence
interesting concept is conditional independenCe. ally independent given
Z
if
p(x,ylz)
=
structures can be graphically displayed with the use of influence diagrams. Fig ure 1.3 shows some possible representations involving three random variables. These diagrams can be very useful in establishing probability structures for real problems that tend to have many more than just three variables. Important elements to describe a distribution are its mean vector and variance covariance matrix. The mean vector JL of a collection of random variables X
=
E(X;), i l , ...,p. The variance covariance matrix :E of X has elements aij Cov(X;, Xj). It is clearly a symmetric matrix since Cov(X;,Xj) Cov(Xj. X;), for every possible pair (X,,... ,Xp) has components /1;
=
=
=
=
(i,
j).
These moments are well-defined for any well defined distribution although they may not exist for some distributions.
Therefore, one can evaluate the condi
tional mean of X3IX1. X2 and Xs by calculating the mean of X3 under the con-
12 Introduction
(a)
(b)
X
G>------0
y
Fig. 1.3 Possible influence diagrams for three random variables: (a) X andY are conditionally independent given Z; (b) (X, Y) is ifiiiependent of Z.
1,
ditional distribution of X3IXi, Xz and Xs. Likewise, one can evaluate the joint marginal variance--covariana;! matrix of X -¥.2.and Xs by evaluating the variance covariance matrix of the joint marginal distributibn of (XJ, Xz, Xs).
{XJ,: . . , Xp) and.Y
Another useful result concerns transformation of random vectors. Let X (Yt,·· .. , Yp) be p-dimensional random vectors defined =
over continuous spaces X and Y, respectively. Assume further that X and Y are =
g(X) with inverse function h(Y) and these functions are at least differentiable. As a consequence, Y = g(X) mid X= h(Y). Then the densities px(Y) and p y(y) are related via uniquely·related by the I-to-! transformation Y
X
=
=
py(y) = px (h (y))
where
h( ) y
I ay I , E B
y
Y.
la��l�
is the absolute value of the Jacobian of h, the detenninant of the matrix of deriva Hves ofh with respect toy, This matrix has element {i, j) given by a h,( )j yj.
V(i, j).
In the case of scalar
py(y)
y
X andY, the relation becomes
=
Px (h( y ))
I
B
h(y)
B
I
ay ,
An important distribution where some of these results can be used is the mul
(JL,
tivariate nonnal distribution with mean vector p, and variance-covariance matrix E, denoted by N E) with density (2Jl')-PI21EI-l/Zexp
�-�(x-JL)'I:-1(x-JL)}, x E
RP,
where IAI denotes the determinant of A. The scalar version (p = 1) of this density {I will simply be referred to as a normal distribution with density (2Jl'cr2)-l/Z exp
1- � -p,)2}, E 2 2
(x
x
R.
Basic results in linear algebra and probability 13 When I'= 0 and a2
=
1, the distribution is referred to as standard normal.
It can be shown that the quadratic forrn in the exponent of the density has a
x 2 distribution. Also, the normal distribution is preserved under linear transfor
mations. The multivariate nonnal distribution is completely characterized by its parameters tt and especially L. Ifthe components are uncorrelated with au = 0, Vi
,p j, then they are
also independent. In any case, the correct understanding of
the variance-covariance structure of a distribution is vital, for example, to obtain some properties such as marginal and conditional distributions (see exercises). In order to do that, a few basic results about matrices will be reviewed below. There are many other distributions that will be considered in some detail in the
book. They will be introduced in the text as they become needed. 0
1.4.2
Linear algebra
Let A be a real matrix of order r
x
p, p,
r
?: 1. Denote the matrix element in row
by aij, i = 1, .. . , r and j = 1, ... , p. If p = r, the matrix is said to be square of order p. Such a matrix is said to be symmetric if aij = aj; .
i and column
j
for every possible pair
A' = A.
Let Y
=
(Y1,
(i, j).
In this case, the transpose of A, denoted by A ', is
. .. , Yr )'be a random vector defined by the linear transformation
Y = c +CX of another random vector X = r-dimensional vector and an
r x
(X 1, .. . , X p)'
where
c
andCare an
p matrix of constants. Then the expectation and
variance-covariance matrix of Y are respectively given by E(Y) = c + CE(X) and
V(Y)
=
Cl:C'. As an example, let
Y = 1�X = L; X; Y will have mean 1 � E(X) Li.j Cov(X;, Xj).
p-dimensional vector of l's. Then, variance
1� V(X)1p =
A matrix A is said to be positive (non-negative) definite if for every non-null vector
b
=
where
(bJ, ... , bp)'.
1p is the
Li E(X;)
=
b'Ab
>
and
C::)
0,
Similarly, negative (non-positive)
definite matrices can be defined by replacement of the >
(2:)
sign by the
<
(::0)
sign. A trivial example of a positive definite matrix is the p-dimensional identity matrix, denoted by lp with diagonal elements equal to 1 and off-diagonal elements
equal to 0. It is very easy to check that larger than zero because b is non-null.
b'lpb =
bf +
· · ·
+
b�
which must be
Variance-covariance matrices are always non-negative definite and usually are
positive definite matrices. To see that, assume that 1: is the variance-covariance matrix of X =
(X 1,
• • •
,
Xp).
Then, a non-null random variable
be formed. This variable will have variance
Z
=
X can
b'
b'l:b
which is necessarily
Positive definiteness defines an ordering over matrices.
Using the notation
V(Z)
=
non-negative. Varying b over all possible values ofb in RP (excluding the origin) proves the result.
A
>
C::)
t::)
0 for a positive (non-negative) matrix A allows one to denote by A
B the fact that the matrix A - B >
(2:)
>
0. This ordering makes sense in
the context of probability. Let A and B be the variance-covariance matrices of
independent p-dimensional random vectors X and Y. Then A > B implies that
14
Introduction
V (b'Y) =b' Ab- b'B b=b' (A -B) b > 0, for every non-null vector V (b'X) b. So, there is a sense in which matrices can be compared in magnitude and one can say that A is larger than B. -
Symmetric positive-definite matrices are said to be non-singular, because they have a non-null determinant. This implies that they have full rank and all their rows are linearly independent. Likewise, singular matrices have null determinant, which means that they do not have fuH rank and some of their rows can be represented as linear combinations of the other rows. Non-singular matrices also have a well defined inverse matrix.
A square matrix A of order p with p-dimensional rowsai
i
=
I, . .. , p is said to be orthogonal ifa;aj
that if A is orthogonal then A' A=A
=
=
(ai I· .. , aip)', for i' j. Note .
I, if i =j and 0, ifi
A'= lp. Therefore, orthogonal matrices can
be shown to be full rank with inverse A -l and A' =A - l . There are many methods available in linear algebra for iteratively constructing an orthogonal matrix from
given starting rows. In most statistical applications, matrices are non-negative definite and symmet
ric. The square root matrix of A denoted by A112, can then be defined and satisfies
A1i2A1i2 =A. One of the most common methods of funding the square root matrix is called the Choleski decomposition.
1.5
Notation
Before effectively beginning the study of statistical inference it will be helpful to make some general comments. Firstly, it is worth noting that here we will deal with regular models only, that is, the random quantities involved in our models are of the discrete or continuous type. A unifying notation will be used and the distinction will be clear from the context. Then, if
X is a random vector with
distribution function denoted by F(x), its probability (or density) function will be denoted by p(x) if
X is discrete (or continuous) and we will assume that
j independently of
dF (x)
=
j
p(x)dx
X being continuous or discrete, with the integral symbol repre
senting a sum in the discrete case. In addition, as far as the probability (or density) function is defined from the distributio� of
X, we will use the notation
X
.......,
p
meaning that X has distribution p or, being more precise, a distribution whose probability (or density) function is p. Similarly,
X
.!!...
p will be used to denote
that X converges in distribution to a random variable with density p.
In general, the observables are denoted by the capital letters of the alphabet (X,
Y, ...), as usual, and their observed values by lowercase letters (x, y, . . .). Known
(A, B, ..), and the greek (0, A, . .) are used to describe unobservable quantities. Matrices, observed
quantities are denoted by the first letters of the alphabet
letters
.
.
or not, will be denoted by capitals. Additionally, vectors and matrices will be
Outline of the book 15 distinguished from scalars by denoting the first ones in bold face. Results will generally be presented for the vector case and whenever the specialization to the scalar case is not immediate, they will be presented again in a univariate version. The distribution of
X is denoted by P(X). The expected value ofXIY is denoted E(XIY), Ex1y(X) or even Ex1y(XIY) and the variance of XIY is denoted by V(XIY), Vx1y(X) or even Vx1y(XIY). The indicator function, denoted by Ix(A), assumes the values by
fx(A)
1.6
=
·j1•
0,
if X E
A
otherwise.
Outline of the book
The purpose of this book is . Y:> pr.esent an integrated approach to statistical inference
at an inteimediate level b)r disCussion and comparison of the most important results
of the two main schools of statistical thought: frequentist and Bayesian. With that
. in mind, the results are presented for rnultipararneter models. Whenever needed, the special case of a single parameter is presented and derivations are sometimes made at this level to help understanding. Also, most of the examples are presented
at this level. It is hoped that they will provide motivation for the usefulness of the
results in the more general setting. Presentation of results for the two main schools of thought are made in parallel as·mtich as possible. Estimation and prediction are introduced with the Bayesian approach followed by the classical approach.
The presentation of hypothesis
testing and linear models goes the other way round. All chapters., including this intfoduction, contain a set of exercises at the end.
These are included to help
the student practice his/her knowledge of the material covered in the book. The exercises are divided according to the section of the chapter they refer to, even
though many exercises contain a few items which cover a few different sections. At the end of the book, a selection of exercises from all chapters have their solution presented. We tried to spread these exercises evenly across the material contained in the chapter. The material of the book will be briefly presented below. The book is composed of eight chapters that can broadly be divided into three parts. The first part contains the first three chapters introducing basic concepts needed for statistics. The second
part is composed at Chapters 4, 5 and 6 which discuss in an integrated way the
standard topics of estimation and hypothesis testing. The final two chapters deal with other important topics of inference: prediction and linear models. Chapter 1 consisted of an introduction with the aim of pioviding the flavour of
and intuition for the task ahead. In doing so, it anticipated at an elementary level many of the points to be addressed later in the bqok.
Chapter 2 presents the main ingredients used in statistical inference.
The
key concepts of likelihood, sufficiency, posterior distribution, exponential fam ily, Fisher information and exchangeability are introduced here.
The issue of
16 Introduction parameter elimination leading to marginal, conditional and profile likelihoods is presented here too. A key element of statistical inference for the Bayesian approach is the use of prior distributions. These are separately presented and discussed in Chapter 3. Starting from an entirely subjective specification, we move on to functional form specifications where conjugate priors play a very important role.
Then, non
informative priors are presented and illustrated. Finally, the structuring of a prior distribution in stages with the so-called hierarchical form is presented. Chapter 4 deals with parameter estimation where, intentionally, point and in terval estimation are presented as different responses to the same summarization question, and not as two unrelated procedures. Different methods of estimation (maximum likelihood, method of moments, least squares) are presented. The clas sical results in estimation are shown to numerically coincide with Bayesian results obtained using vague prior distributions in many of the problems considered for the normal observational model. Chapter 5 deals with approximate and computationally intensive methods of inference.
These results are useful when an explicit analytic treatment is not
available. Maximization techniques including Newton-Raphson, Fisher scoring and EM algorithms are presented. Asymptotic theory is presented and includes the delta method and Laplace approximations. Quadrature integration rules are also presented here. Finally, methods based on simulation are presented. They include bootstrap and its Bayesian versions, Monte Carlo integration and MCMC methods. Chapter 6 is about hypothesis testing problems under the frequentist approach and also under the various forms of the Bayesian paradigm. Various test procedures are presented and illustrated for the models with normal observations. Tests based on the asymptotic results of the previous chapter are also presented. Chapter 7 deals with prediction of unknown quantities to be observed. The pre diction analysis is covered from the classical and Bayesian point of view. Linear models are briefly introduced here and provide an interesting example of predic tion. This chapter also includes linear Bayes methods by relating them to prediction in linear models. Chapter 8 deals with linear models. Initially, the frequentist inference for linear models is presented. followed by the Bayesian one.
Generalizations based on
the Bayesian approach are presented leading to hierarchical and dynamic models. Also, a brief introduction to generalized Fnear models is presented.
Exercises § 1,2 I. Consider the equation
P(A n F)
=
P(A I F)P(F) in the light of the
de Finetti loss function setup with the three losses associated with events A
=
and
0, F
r.
=
I,
A
=
I, F
=
I and
F
=
0 and respective probabilities p, q
Show that losses are aJI minimized when p
= r/q.
Exercises 17 §1.3 2. Consider the example of the pregnant lady. (a) Show that by proceeding on Z with the reasoning used for Y leads to the estimation of e by
5/6.
(b) Evaluate the probabilities that Y 2:
5
and Z 2:
5
and show that the
values of these probabilities are respectively given by
3.
0.109 and 0.031.
Consider again the example of the pregnant lady. Repeat the evaluation of the
P(Xr+s+l = l [ (r, s)) assuming now that
(r, s) are (15, 3) using beta priors with parame b) = (I, I) and (a, b) = (5, 5). Compare the results obtained.
(a) the observed values of ters (a,
(b) it is known that her 7th pregnancy will produce twins and the observed value of
(r, s)
is
(5, 1).
§1.4 4. Let
XIY� bin(Y, n:) and let Y � Pois(!.).
(a) Show that
E(X) = E[E(XjY)] = !.n: and that V(X)
(b) Show that 5. Let
=
E [ V(X j Y) ] + V[E(XIY)J = !.n:.
X� Pois(!.Jr) and that Y- XIX� Pois(!.(l - n:)].
XIY � N(O, v-1) andY� G(a/2, b/2). Obtain the marginal distri' Y[X.
bution of X and the conditional distribution of
6;
X "'"" Apply the result c + CtL, C C'). N( J: to obtain the marginal distribution of any subvector of X.
Show that nonnality is preserved- under linear transformations, i.e., if
N(tL, 1:) and Y= c + C X then Y
�
7. Show that if
N(ILXIY• 1:xiY) where ILXIY = /Lx + J:XY 1:y1 (Y- /Ly) and J:x- 1:xr1:y11:xr.
then XjY�
J:x1r
=
8. Show that if
X , ... , XP are independent standard normal variables then 1 p
L;x?� x� . i=l
9. Show that if X= (X 1o • • • , Xp)' � N(/L, 1:) and Y = (X -tL)'J:-1 (X-tL) then y � x2 p· Hint: Define Z= A(X -tL) where the matrix A satisfies A'A= 1:-1 and use the result from the previous exercise. I 0. Let A, and B be non-singular symmetric matrices of orders p and q and ap
x
q matrix. Show that
C
18
Introduction
(a) (A+CBC')-1 =A-1-A- 1C(C'A-1C+B-1)-1C'A-1 (b) -En-' o
1 1 1 where D = B- - C'A- C and E = A- c.
-'
)
2 Elements of inference
In this chapter, the basic concepts needed for the study of Bayesian apd classical statistics will be described. In the first section, the most commonly used statistical models are presented. They will provide the basis for the presentation of most of the material of this book. Section 2.2 introduces the fundamental concept of likelihood function. A theoretically sound and operationally useful definition of measures of information is also given in this Section. The Bayesian point of view is introduced in Section 2.3. The Bayes' theorem acts as the basic rule in this inferential procedure. The next section deals with the concept of exchangeability. This is a strong and useful concept as will be seen in the following chapter. Other basic concepts, like sufficiency and the exponential family, are presented in Sec tion 2.5. Fina1ly, in Section 2.6, the multiparametric case is presented and the main concepts are revised and extended from both the Bayesian and the classical points of view. Particular attention is given to the problem of parameter elimination in order to make inference with respect to the remaining parameters.
2.1
Common statistical models
Although the nature of statistical applications is only limited by our ability to fonnulate probabilistic models, there are a few models that are more frequently used in statistics. There is a number of reasons for it: first, because they are more commonly found in applications; second, because they are the simplest models that can be entertained in non-trivial applications; finally, because they provide a useful starting point in the process of building up models. The first class of models considered is a random sample from a given distribution. They are followed by the location model, the scale model and the location-scale model. Excellent complementary reading for this topic is the book of Bickel and Doksum ( 1977). The most basic situation of observations is the case of a homogeneous population from a distribution Fe, depending on the unknown quantity 0. Knowledge of the value of e is vital for the understanding and description of this population and we would need to extract infonnation from it to accomplish this task. Typically in this case, a random sample X 1,
•
•
•
,
Xn is drawn from this population and we hope to
20 Elements of inference build strategies to ascertain the value of e from the values of the sample. In this case, the observations X
1
, ... , Xn are independent and identically dis
tributed (iid, in short) with common distribution Fe. Assuming that Fe has density or probability function f, they are probabilistically described through n
p(X1,
. . •
, Xn j8) =
n /(Xi 18).
i=l
Example. Consider a series of measurements made about an unknown quantity(). Unfortunately, measurements are made with imprecise devices which means that there are errors that should be taken into account. These errors are a result_of many
(sma11) contributions and are _more effectively ·described in terms of a probability distribution. This leads to the construction of a model in the form X; I,
. .
.
, n.
=
f)+ ei, i
=
The e;'s represent the measurementeiTors involved. If the experiment is
perfonned with care with measurements being. collected indeP.\'Qd�ntly and using the same procedures, the ei's will form a random sample from the.· distribution of errors
Fe. For the sam� reason, the Xi's will also be iid with joint density n n f(x;l8) = p(X1,.:. ,.x,\8) = /,(Xi- 8) i=l i=l
n
where
n
fe is the density of the error distribution.
Definition (Location model). X has a location model if a function f and a quan
tity 9 exist such that the distribution of X given 9 satisfies p(x
In this case, 9 is called a location parameter.
I 9) = f(x- 9).
Examples. 1. Normal with known variance
In this case the density is p(xl8) = which is a .function of X - 8.
(2Jrcr2)-0·5 exp{-0.5cr-2(x- 8)2),
2. Cauchy with known scale parameter The Cauchy distribution is the Student t distribution with I degree of free
dom. In this case, the density is p(xl8) = {rrcr[l + (x- 8)cr2]) -1, which
is a function Of X
-
0, toO.
3. Multivariate normal with known variance-covariance matrix
In this case the density is
p(xl9) = (2Jr)-PI211: i-1i2 exp{-(x- 9)':!:-1 (x- 8)j2), which is a function of
x-
8. Note that an iid sample from the N(8, cr2)
distribution is a special case with 9 = 01 and 1: = u2I.
Definition (Scale model). X has a scale model if a function f and a quantity"
exist such that the distribution of X is given by
p(x I cr) =
�� (�)
Likelihood-based functions 21 In this case,
a is called a scale parameter.
Examples. 1. Exponential with pararneter-8
p(xl 8)= 8 exp( -ex) is in the fonn a-t f(xja) withe = cr-t f(u) = e-•.
The density and
2. Normal with known mean()
1 p(xla2) = (2rr)-1i2a- exp { -[(x a-1 f(xja).
The density
Definition (Location and scale model). are a function
-
8)/af/2 J is in the fonn
X has location and scale model if there
f and quantitiese and a such that the distribution of X given (8, a)
satisfies
p(x I e, a)=
� C� ) f
In this case,e is called the location-parameter and
8
a
.
the scale parameter.
Some examples in the location-scale family are the normal and the Cauchy distributions. Once again, the location part of the model can also be multivariate.
2.2
Likelihood-based functions
Most statistical work is based on functions that are constructed from the proba bilistic description of the observations. In this section, these functions are defined and their relevance to statistical inference is briefly introduced. These functions will be heavily used in later chapters, where their importance will be fully appre ciated. We start with the likelihood function and then present Fisher measures of information and the score function.
2.2.1
Likelihood function
0 is the function that associates the value p(x I 0) to 0. This function will be denoted by 1(0; x). Other common notations are lx (O), 1(0 I x) and 1(0). It is defined as follows The likelihood function of
each
I(·; X): 8-+ R+ 0-+ 1(0; x)= p(x I 0). The likelihood function associates to each value of(}, the probability of an observed value x for X. Then, the larger the value of l the greater are the chances associated to the event under consideration, using a particular value of 8. fixing the value of
Therefore, by
x and varying 8 we observe the plausibility (or likelihood)
of each value of 8. The concept of likelihood function was discussed by Fisher,
22
I
Elements of inference
It
� �
00
0
t f
[
"' 0
r ' f
\1
£
�
"
!
0
� 0
0
�
'
I
(.
' ' 0.4
0.2
0.0
-i 1 i
1.0
0.8 theta
Fig. 2.1 Likelihood function of the example for different values of x. Barnard and Kalbeifhesh among many others. The likelihood function is also of fundamental importance in many theories of statistical inference. Note that even though
JR p(x I 0) dx
Example.
X
�
=
l,
f81(0; x) dO= k
i' l, in generaL
bin(2, B)
p(x I B)= l(B; x)=
G)
G) f
ex ( l
Bx(l- B)2-X
2; e
E e= (0 , l)
X=
0, l,
B)2 x d8=
G)
B(x+l,3-x) =
= 21W
8).
but
fe
l(8;x)dB=
-
-
�""'
l.
Note that: l. If
x=
l then
l(B; x =
l)
-
The value of
8
with highest
8 is 112. 2 then 1(8; X= 2)= 82, the most likely value of e is l. x = 0 then 1(8; x = 0) = (l 8)2, the most likely value is again 0.
likelihood or, in other words, the most likely (or probable) value of
2.
If X =
3. If
-
These likelihood functions are plotted in Figure
2.1.
The notion of likelihood can also be introduced from a slightly more general perspective. This broader view will be useful in more general observation contexts
Likelihood-based functions 23 than those considered so far.
These include cases where observation are only
obtained in an incomplete way. Denoting by E an observed event and assuming the probabilistic description of E depends on an unknown quantity fJ, one can define the likelihood function of(} based on the observation E as 1(0; E) ex Pr(EIO) where the symbol ex is to be read as is prop ortional to. If E is of a discrete nature, there is no difference with respect to the previous definition.
Example. Let X 1, . . , X n be a collection of iid random variables with a common .
Bernoulli distribution with success probability (). Let E be any observation of the X 1,
•
•
•
, Xn consisting of x successes. Then, /(e; E)
ex
Pr(Eie) =ex(1-ey•-x.
For continuous observations, assume that any observed value x is an approxima tion of the real value due to rounding errors. Therefore, the observation x in fact corresponds to t�e observed event E = {x : a .:::: x .:::: a + D.} for given values of a and to.
>
0, which do not depend on
9. In this case, Pr(E19) = F(a +to.)- F(a)
where F is the distribution function of the observation. For typical applications, the value of to. is very small and one can approximate F(a +to.)- F(a) = p(xiO)!o.. Therefore, 1(0; E)
<X
p(xl9). This definition can be extended for a vector of
observations with minor, technical adaptations to the reasoning above to lead to
1(0; E)
ex
p(xl9).
The likelihood function leads to the likelihood principle, which states that all the infonnation provided by the experiment X is summarized by the likelihood function. This principle draws a clear line that separates inference ·schools. On the same side lie the Bayesian ·and likelihood approaches, that do not violate this principle, and on the other side the frequentist approach which is based on the probabilities implied by the sampling distribution of X. In this way, it takes into consideration all the possible values of X.
2.2.2
Fisher information
We have already mentioned that the understanding and measuring of information is one of the key aspects of the statistical activity. ln this section, the most com monly accepted measures of information are introduced. They have important connections with the notion of sufficiency, to be defined later in this chapter, and will prove to be very useful in the sequel. In fact, they consist of a series of related measures generally known as Fisher information measures.
Definition. Let X be a random vector with probability (density) function p(x I e). The expectedJ'isher information measure of e through X is defined by
I(e) - EXte _
[
2 a logp(X 1 e) ae2
]
·
24 If (J
Elements of inference =
(&t, ..., (}p) is a parametric vector then the Fisher expected information
matrix of 9 through X can be defined by
J2logp(X
1 9)
a9a9' with elements
lij(9) given by 821ogp(X
19)
ae;ae1
]
,
i, j
J =
1, .. . ' p.
The information measure defined this way is related to the mean value of the curvature of the likelihood. The larger this curvature is, the larger is the information content summarized in the likelihood function and so the larger will I (6) be. Since
the curvature is expected to be negative, the information value is taken as minus the curvature.. The expectation is taken with respect to the sample distribution.
The observed Fisher information corresponds to minus the second derivative of the log likelihood:
lx(9)
a2logp(x =
I 9)
a9a9'
and is interpreted as a local measure of the information content while its expected value, the expected Fisher information, is a global measure. Both
lx(9) and l (9)
will be used later on in connection with Bayesian and frequentist estimation. The motivation for these definitions will be clear from the following example.
Example. Let X� N(B,cr2),withcr2unknown. ltis easy to getl(B ) a -2,
=
lx(B) =
the normal precision. Then, the observed and expected Fisher information
measures with respect to e obtained from the observation
X coincide with the
precision, which we had previously tried to identify with information.
There are many properties that can be derived from the Fisher information. One of the most useful ones_ _concerns the additivity of the information with respect to independent observations, or more generally, sources of information.
Lemma. Let X= (X 1 , Xn) be a collection of independent random variables Pi(X I 6), i = 1, ... , n. Let lx and lx; be the observed informa • • • •
with distribution
tion measures obtained through X and X;, i = l , . .
.
, n,
respectively. Let I and I;
be the expected information measures obtained through X and respectively. Then, n
fx(9) =
.L>x; (9) i=l
and
l(9) =
X;, i
=
1, .. .
, n,
n
I:, I; (9).
i=l
The lemma states that the total information obtained from independent observa tions is the sum of the information of the individual observations. This provides
Likelihood-based functions 25 further intuition about the appropriateness of the Fisher measures as actual sum
maries of information.
Proof. p(X I 0) I 0). Then,
(X;
= n7=1p;(X;
a2Iogp(X
I 0) and therefore logp(X I 0)
1 0)
aoao'
=
_
t
a2logp;(X;
XI0
on both sides, gives
1(0)
=
-
-
E
[- t
i=l
�; I O) I o
aoao
azlogp;( X;
E
Taking expectation with
azlogp;(
i=l
t [
I 0)
aoao'
i=l
which proves the result about observed information.
respect to
= L:7=1logp;
I 0)
aoao'
I
o
]
]
n
L, 1;(9).
=
i=l
0
Another very important statistic involved in the study of the likelihood function
is the score function.
Definition.
The score function of X, denoted as U(X; U(X·'
8)
In the case of a parametric vector
vector U(X;
0)
(J
a logp(X
defined as
I 8)
ae
=
=
8), is
(611, . , ()p)T, the score function is also a 0) =a logp(X I O)jae1, i = I, ... , p. . .
with components U., (X;
The score function is very relevant for statistical inference as wiH be shown in
the next chapters. The following lemma shOws an alternative way to obtain the
Fisher information based on the score function.
Lemma.
Under certain regularity conditions, I (8)
In the case of a vector parameter
1(0)
=
Exw[U2(X; 8)].
0, the result becomes
= Ex18[U(X;
O)U' (X; 0)].
Although we shall not go into the technical details of the regularity conditions,
the main reasons for their presence is to ensure that differentiation of the likelihood
can be performed over the entire parameter space and integration and differentiation can be interchanged.
26 Elements of inference Proof.
Using the equality
J
p(X
I
6)dX
=1
and differentiating both sides with
respect to 8, it follows, after interchanging the integration and differentiation operators, that
o
=I
ap(XI 6) a6
dx
=I 1 =I p(X
1
OJ
ap(XI 6)
alog p(XI a9
ao
0)
P
(XI 6) d x
p(XIO) d X.
Therefore the score function has expected value equal to a zero vector. Differenti
ating again with respect to (J 31nd interchanging integration and differentiation we
have
[ =I J I . = l[ ��X ][ ��X r
·
a log p(X
0
ao
alog
1 9)
]O)
ap(X
a9
1
O) '
alog
dX
IO)
+
a2log p(XI 9) a9ao'
p(XIO)dX
P
(X
1
-/(9). 0
Bayes' theorem
We have seen that the statistical inference problem can be stated as having an
unknown, unobserved quantity of interest 8 assuming values in a set denoted by
e.
8 can be a scalar, a vector or a matrix. Until now, the only relevant source
of inference was provided by the probabilistic description of the observations. In
this section, we will formalize the use of other sources of information in statistical
inference. This will define the Bayesian approach to inference.
LetH (for history) denote the initial available information about some parameter
of interest. Assume further that this initial information is expressed in probabilistic
terms. It can then be summarized through p(OIH) and, if the information content of H is enough for our inferential purpose, this is all that is needed. In this case the description of our uncertainty about
9
is complete.
Depending upon the relevance of the question we are involved with, H may not
be sufficient and, in this case, it must b� augmented. The main tool used in this
case is experimentation. Assume a vector of random quantities X related to(} can be observed thus providing further information about(}. (If X is not random then
a functional relationship relating it to (} should exist. We can then evaluate the value of
0
and the problem is trivially solved.) Before observing X, we should
know the sampling distribution of X given )?y p(X
1 0,H), where the dependence
=
on fJ, central to our argument, is clearly stated. After observing the value of X, the
amountof information we have about8 has changed fromH toH*
In fact, H* is a subset of H (a refinement on H was performed).
Hn{X
=
If ;
9)dX
The result follows straightforwardly.
2.3
[
I I
x}.
Bayes' theorem 27 Now the information about O is summarized by p(O ing question left is how to pass from
1 x,H) and the only remain p(OI H) to p(OI x, H). From Section 1.4,
one can write
p(OI x ' H)=
p(B,xI H) p(x 18, H)p(OI H) = p(xI H) p(x I H)
where
p(x 1H)=
[ p(x,e Je
1H) de.
The result presented above is known as Bayes' theorem. This theorem was in troduced by the Rev. Thomas Bayes in two papers in after his death, as mentioned in
l'!amett ( 1973).
1763
and
1764,
published
As we can see the function in the
denominator does not depend upon 6 and so, as far as the quantity of interest fJ is concerned, it is just a constant. Therefore, Bayes' theorem can be rewritten in its more usual form
p(OI x)
The dependence on
cx
p(x I
0)p(O).
H is dropped, for simplicity of notation, since it is a common
factor to all the terms. Nevertheless, it should not be forgotten. The above formula is valid for discrete and continuous, scalar, vector and matrix quantities. theorem provides a rule for updating probabilities about leading to
p(8I x).
The
8, starting from p(IJ) and
This is the reason why the above distributions are called prior
and posterior distributions, respectively. To recover the removed constant in the former equation, it is enough to notice that densities must integrate to I and to rewrite it as
p(OI x)= kp(x I O)p(O) where I=
{ p(OI x) dO= k { p(xI O)p(O) dO le Je
and hence
k-1 = p(xiH)=
{ p(xI O)p(O)dO fe
= Eo[p(xI
0)].
This is the predictive (or marginal) distribution of X. As before, after removing
the dependence on H, it can be denoted by p(x). This is the expected distribution
of X (under the prior) and it behaves like a prediction, for a given H. So, before observing X it is useful to verify the prior adequacy through the predictions it provides for X. After observing X, it serves to test the model as a whole. An observed value of X with a low predictive probability is an indication that the stated model is not providing good forecasts. This is evidence that something unexpected happened. Either the model must be revised or an aberrant observation occurred.
28 Elements of inference 2.3.1
Prediction
Another relevant aspect that foHows from the ca1cu1ations presented above is that we obtain an automatic way to make predictions for future observations. If we want to predictY, whose probabilistic description is P(Y I
p(y I x) = = = '
9), we have
[ p(y,9 I x) d9 le
fe
p(y 19,x)p(91x)d9
[ p(y I 9)p(9 I x)d9, le
where the last equality follows from the independence between X andY, once
9
is given. This conditional independence assumption is typically present in many statistical problems. Also, it follows from the last equation that
p(ylx) = Eelx[p(y19)]. It is always useful to concentrate on prediction rather than on estimation because the former is verifiable. The reason for the difference is thatY is observable and B is not. This concept can be further explored by reading the books of Aitchison and Dunsmore (1975) and Geisser (1993).
Example. John goes to the doctor claiming some discomfort. The doctOr is led to believe that he may have the disease A. He then takes some standard procedures for this case: he examines John, carefully observes the symptoms and prescribes routine laboratory examinations. Let () be the unknown quantity of interest indicating whether John has disease A or not. The doctor assumes that P(8 = II H) = 0.7.
H in this case contains
the information John gave him and all other relevant knowledge he has obtained from former patients. To improve the evidence about the illness, the doctor asks John to undertake an examination. Examination X is related to () and provides an uncertain result, of the positive/negative type, with the following probability distribution
l
P(X=I I 8 = 0) = 0.40, P(X = I I 8 = I) = 0.95,
positive test without disease positive test with disease.
Suppose that John goes through the exrupination and that its result is X = I. So, for the doctor, the probability that John has the disease is
P(8 =I I X=I) o< 1(8 =I; X= l)P(8 =I) 0(
(0.95)(0.7) = 0.665
and the probability that he does not have the disease is
P(8 = 0 I X=I) o< 1(8 = 0; X= l)P(8 = 0) 0(
(0.40)(0.30) = 0.120.
Bayes' theorem 29 The normalizing constant, such that the total probability adds to so that k(0.665) +
1, is calculated
k(O.I20)= 1 and k = 1/0.785. Consequently P(& =I
I X= 1)= 0.665/0.785 = 0.847
and P(& = The information
0 I X= I)=0.120/0.785 = 0.153.
X = I increases, for the doctor, the probability that John has 0.7 to 0.847. This is not too much since the probability that
the disease A from
the test would give a positive result even if John w�re not ill was reasonably high. So the doctor decides to ask John to undertake another test Y. again of the positive/negative type, where
(
P(Y =I P(Y =
I e = 0) 0.04 I I e = I)= 0.99. =
.
Note that the probability Of this new test yielding a positive result given that John doesn't have the illness is very smal� .. Alth_ough this test might be more expeHsive,
its results are more efficient.
The posterior distribution of
e given X' P(B I X), will be the prior distribution
for theY test. Before observing the result of testY, it is useful to ask ourselves what will the predictive distribution be, that is, what are the values of P(Y = y
1 X=1),
for y =0, l. As we have alreadY seen, in the diScrete case,
= I>
p(Y = y
I X= x)
=
y
I e =j) p(B = j I X=x)
j=O
and so P(Y =
I I X=I)= (0.04)(0.153) + (0.99)(0.847)= 0.845
and P(Y =
0 I X=I)= I- P(Y =I I X= 1)= 0.155.
Let us suppose now that John undertook test Y and the observed result was Y =
This is a reasonably unexpected result as the doctor only gave it a
0. 15.5% chance.
He should rethink the model based on this result. In particular, he might want to ask himself I. Did
0.7 adequately reflect his P(O = II H)?
2. Is test X really so unreliable? Is the sample distribution of X correct? 3. Is the testY so powerful? 4. Have the tests been carried out properly?
30
Elements of inference
In any case, the doctor has to re-evaluate the chances of e =1 considering that as well as the knowledge that X = I he now also knows that Y =0. By Bayes' theorem P(B =II X= I, Y=O)exi(B =1 ;Y=O)P(B= 11 X= I) ex (0.01)(0.847) = 0.008 and P(B=0 I X= I, Y = 0) ex 1(8 =0; Y= O)P(B=0 I X= I) ex(0.96)(0.155)=0.147. Note that now P(B I X= I) is working as the prior distributio& for the experiment Y. The normalizing constant is 0.008 + 0.147=0.155 and P(B=1 I Y = 0, X= l)
=
0.008/0.155 =0.052
and P(B=0 I Y = 0, X=1) = 0.147/0.155 = 0.948.
{
Summarizing the doctor's findings chronologically,
P(B = I)=
0.7, 0.847, 0.052,
before the tests X andY after X and beforeY after X andY.
The doctor can then decide his course of action as these updating operations took place.
2.3.2
Sequential nature of Bayes' theorem
Bayes' theorem is nothing more than a rule for updating probabilities. From an experimental result Xt with probability distribution p1 (Xt I 8) (and consequently, likelihood 11 (8; Xt)), it follows that
t
I I If I
f
p(8 I Xt) ex It (8; xt)p(8). After observing another experimental result Xz with probability pz(xz 1 8) not depending on X 1 , we have p(8 I x2, Xt) ex l2(8; x2)p(8 I xt) exl2(8; x2Jit(8; xt)p(8). Repeating this procedure n times, after observing X3, ... , Xn related to() through the observational distributions p;{x; 18), fori= 3, . .., n, we get p(8 I x., .. .'Xt)ex 1n(8; Xn)p(8 I Xn-1' ... 'Xt)
Bayes' theorem 31 or alternatively
p(O I Xn, Xn-1·
.
.
•
, Xt)
()( [n x;)J p(O) l=l
1;(0;
and it is not difficult to see that the order in which the observations are processed
by the theorem is irrelevant. In fact, the observations can be processed in one batch, on an individual basis or through disjoint subgroups. The final result is
always the same as long as the observations are conditionally independent (on 0).
The sequential nature of Bayesian inference is deeply explored by Lindley (1965).
Another basic result corresponds to the case of nonnal observations with un
known mean, which is used in many practical situations. If the mean is described
by a normal prior distribution, the posterior distribution will also be a normal distribution. Theorem 2. 1 (Normal prior and observation). Let e
e
N(8, cr2), with cr2 known. (8 I X= x)- N(p.t, rfl where �
�
N(f.r, r2)
and
Therefore, the posterior distribution of
and
X I 8 is
r1-2 = r -2 +a-2 .
Defining t?e precision as the reciprocal of the variance, it follows from the
theorem that the posterior precision is the sum of the prior and likelihood precisions and does not depend on and defining
w =
x. Interpreting the precision as a measure of information r-2/(r-2+cr-2) E (0, 1), measures the relative information w
contained in the prior distribution with respect to the total information (prior plus
likelihood). Then one can write
which is the weighted mean of prior and likelihood means. Proof. From Bayes' theorem, it follows that
p(8 I x) cx 1(8; x)p(8) cx
exp
cx exp
{ 8)2- -(8- p.)zJ 2r2 2cr2 {- - 82 +-x8 +p.8 } 2a 2 2 { 2 (_1_+_1_)+8(�+.!::. .)}· a2 t"2 -
= exp -
1 -(x e2
-
I
-
T2
a2
T2
82
a2
r2
where in the first step all constants involved were included in the proportionaJity
factor. Now let
rf = (r-2+cr-2)-1
and ILl
= (cr-2x+ p.r-2)rf,
Substituting
32 Elements of inference into the above expression gives
{ {
p(e I x) ex exp -
ex
exp
2 e
2 +
-
2r1
1 ---(e 2 2rI
I
ex---
�2nrr
exp
{
ell1 2 r1
-
-
-
}
llt)2
}
I 2 2 (8 - ll t l 2r, -
}
. -
It is easy to recognize that the last term in the above expression corresponds to a normal density. Therefore, the last proportionality constant is equal to
(8 I x)- N(Jlt, rr).
1 and D
Example (Box and Tiao; 1992).
Two physicists, A and B, want to determine the
value of a physkal constant fJ. The physicist A with large experience in the area
e- N(900, ( 0)2) . On the other side, the physicist B, not 2 so experienced on the subject, stated a more uncertain prior e- N(800, (80) ). It is easy to obtain that for the physicist A, Pr(e E (860, 940)) ,;, 0.95 and for the physicist B, Pr(e E (640,960)) ,;, 0.95. Both physicists agree to make an evaluation of(}, denoted by X, using a calibrated device with sampling distribu tion (X 18) - N(e, (40)2). After observing the value X = 850, the posterior distributions of(} can be obtaitled using Theorem 2.1 and the values stated above
specifies his prior as
2
as
(8 I X= 850)- N(890, (17.9)2) 2. physicist B: (8 I X= 850)- N(840, (35.7)2). I. physicist A:
The inferential procedure o f the two physicists can b e summarized i n Figure
2.2.
It is worth noting that due to the different initial uncertainties, the same experiment provides very little additional information for A, although the uncertainty of B has
been substantially reduced. Identifying precision (the inverse of the variance) with information we have that the information about B for physicist A increases from
0.0025 to 0.00312 since the likelihood' precision was 0.000625 (an increase of 25%). For physicist B, it increases from 0.000156 to 0.000781 (an increase of 400%).
2.4
Exchangeability
Exchangeability is a very important concept introduced by de Finetti ( 1937). It is weaker than independence but it is just as useful, as wi11 be shown below.
Exchangeability 33 "
�
A
0 N 0 ci
I'
prior likelihood posterior
�
A
� 0 0 ci
8 ' ' ' ' '·
� 0 0 ci
'
r-
' '
''
'
" 0 700
600
800
900
1000
theta
Fig. 2.2 Prior and posterior densities and likelihood forB for the physicist's example.
Definition. Let /C tities X 1,
• • •
=
{kt, ... , kn) be a permutation of {I, . . . , n ). Random quan
, Xn are exchangeable if then! permutations (Xk1, ... , Xkn) have
the same n-dimensional probability distribution. An infinite sequence of random quantities is exchangeable if any finite subsequence is exchangeable. One immediate consequence of exchangeability is that all marginal distributions must be the same. To see this. consider any two distinct permutations JC and K.' of an exchangeable sequence of random variables, that therefore must have the same probability. Let k1 and k; be the first index of the two permutations, respectively. If both sides of this equality are integrated with respect to all the components but the first, one gets that the marginal distribution of Xk1 and Xk� must be the same. Since one is free to choose the V?lues of k1 and k!. this means that all marginal distributions are equal. A sequence (finite or not) of iid random quantities is trivially exchangeable. although the reciprocal is not true, in general.
Examples. 1. Consider an um with m balls, r with number l and m -
r
with number 0.
Balls are drawn from the urn, one at time, without replacement. Let Xk denote the number associated with the kth ball selected. Then, X 1, . . . , Xn, n _::: m is an exchangeable sequence, but the Xi's are not independent.
34 Elements of inference 2. Let X 1, X2, ... be a sequence ofBemoulli trials with unknown success prob
ability 8. The classical assumption is that the X 's are iid. For a Bayesian,
k
if() is unknown, the information content of the jth observation can modify
your belief about the distribution of X . The example of the previous chapter
k
illustrated this point. If the experiments are judged similar in some sense, the hypothesis of exchangeability is acceptable. while marginal independence is not. The relevance of the concept of exchangeability is due to the fact that, although it is based on weak assumptions, it allows one to state a very powerful result, known as the representation theorem of de Finetti. The theorem is stated here without proof, but it can be found in de Finetti (1937).
Theorem 2.2. To all infinite sequences of exchangeable random quantities {Xn. 1, 2, ...) assuming values {0, I) there corresponds a distribution Fin (0, I) such that for all n and k ::0 n,
n
=
P[(k, n- k)]
=
f
ek(l - e)"-k dF (e)
where (k, n- k) denotes the event that k of the X;'s are 1 and the other n- k of the X;'s are 0. Note that, due to the exchangeability assumption, any sequence with k I 's and n - k O's also admits a representation according to the above theorem. A very simple outline of the proof of this representation theorem, worth reading,
can be found in Heath and Sudderth ( 1976). The importance of the theorem is that it provides further backing for the Bayesian approach. If one is willing to consider
a collection of 0-l observations as exchange"able then one is prepared to rephrase
their model into a sampling Bernoulli model with success probability e that itself is random with probability distribution F. The theorem however does not tell us
anything about the distribution F. For example, we could have: I. a degenerate distribution:
P (e
=
Bo)
P[(k, n- k)] 2. a discrete distribution:
and
L: Pi
=
P(e
=
e,)
=
=
=
I, for some eo, implying that
e§ (I
Pi, if e
-
=
e0)n-k ei, i
=
I,
... , s
with Pi 2: 0
1, implying that '
P[(k, n - k)] 3. a continuous distribution: e
P[(k, n- k)]
=
=
=
1o
�
1
L p;e,'(l - e, )n-k ·, i=l
=
beta(a, b), implying that
ek(l - e)n-k
t Jo
ea-1(1 - e)o-1
B(a, b)
de
I ea+k-1(1- e)b+n-k-1 de B(a, b) B(a + k, b + n- k)
__ _
B(a, b)
tI I
I
' i
If
Sufficiency and the exponential family 35 The exchangeability concept has already been extended to many other distribu tions with the inclusion of some additional hypotheses; see Bernardo and Smith (1994) for a review of the main results. The definition is the same as presented before with removal of the constraint imposed on the sample space. In particular, if we introduce the hypothesis of ·symmetry of the distributions and invariance under linear transfonnations, it is not difficult to show that the joint density of any finite subsequence is given by
whereB is a quantity 'varying in R, a a quantity assuming values in R+, PN(·; a, is the density of a
N(a, b) distribution and
Y
F is a distribution function in R
b) x
R+. Exchangeabilit , along with invariance now, leads to a representation where
a normal sampling distribution is obtaiFJed with parameters having some prior distribution F. It is worth noting that F Once again is not specified. Another useful extension, well explored in recent years, is the concept of partial exchangeability. In this case the exchangeability holds only under certain condi tions. For example, we can define some groups of variables where exchangeability
iS ·vaJid · �nly within each group. This concept can be extended to more general
cases. than group classification and is formalized as follows.
Definition. Let {X;. i = I, 2, ... , n) be any sequence of random quantities and . K be any permutation of {1, 2, . , n). We say that X is partially exchangeable if there are quantities {Z;, i I, 2, . , n} such that the distribution of (X 1 Z) is the same as that of (X,o;: I Z,o;:), for any permutation K, where for any veotor C = (CJ, ... , Cn), c,o;: = (q1, .. .Ck,). . .
=
. .
.
The main idea behind this definition is that when the indexes of the X; 's are per muted, the resulting vector will have the same conditional distribution, as far as the same permutation is applied to the Z; 's indexes. This is clearly a weaker concept than exchangeability. The case
Zi
=
1, Vi, corresponds to the exchangeability
definition. Another interesting and less trivial case is when there are. .s groups and each
Z; takes on a value in {I, . , s} to identify the group the observation X; belongs to. In this case, one has exchangeability within each group but not . .
globally. The extension of this concept to countable sequences is immediate. The notion of partial exchangeability will be returned to in Chapter 3 when the related concept of hierarchical prior is introduced and in Chapter 6 when inference for hierarchical models is discussed.
2.5
Sufficiency and the exponential family
As we have said before, one of the main goals of statistics is to summarize infor mation. An important question is to know if, given a set of observations X sampled
36 Elements of inference to get information about a parameter of interest(}, there are statistics, i.e. functions of the observations X, that summarize all the information contained in
X.
Definition (classical). Let X be a random quantity with probability (density) I 0). Then, the statistic T T(X) is sufficient for the parameter 0 if
function p(x
=
p(x II, 0) The definition states that given
=
p(x I 1).
T, X does not bring any additional infonnation
about 0. From the Bayesian point of view this means that X and 0 are independent conditionally on T. The main point of the definition is that, after observing T, one can forget the rest of the data if one is only interested in gathering information about 8. The concept of sufficient statistics was introduced by Fisher and was studied by Lehmann, Scheffe and Bahadur in the 1950s as pointed out by DeGroot (1970). The strength of the definition lies in the possibility of finding sufficient statistics of a smaller dimension than data X thus implying savings in information storage. In some cases, it is possible to find sufficient statistics with fixed dimension inde pendent of the sample size. In these cases the dimension reduction and consequent storage saving can be huge if large sample sizes are considered.
Theorem 2.3. If T
=
T(X) is a sufficient statistic for
p(O Ix) Proof. p(x 1 0)
=
p(x I 0)
=
0, then
p(O It), for all priors p(O)
p(x;t I0), if t
=
T(x) and 0, if I cp T(x). So,
=
p(x It, O)p(t I 0)
=
p(x I t)p(t 1 0),
by the definition of sufficiency.
But, by Bayes' theorem,
p(O Ix) 6: p(x IO)p(O)
Then p(O
Ix)
=
=
p(x I t)p(t I O)p(O)
<X
p(t I O)p(O),
<X
p(O It).
k
since p(x
p(O I t), for some k
>
It) does not depend one
0.
Additionally, I and so
p(O I x)
=
=
{ p(O Ix)dO Je
=
k
{ p(O It) dO le
=
k
p(O It). 0
Sufficiency and the exponential family 37 This theorem leads to a possible Bayesian definition of sufficiency below.
Definition (Bayesian). The statisticT(X) is sufficient for 0 if there is a function fsuch that p(O Ix) <X f(O, t). A useful insight into sufficiency is gained through the notion of partitions of the sample space. LetT= T(X) be a p-dimensional statistic and At= {x: T(x) = !}. The collection of sets {At : t E RP) = {At) is a partition if At n At• = 0, VI, t' E RP and Ut At = S, the sample space. A partition induced by a sufficient
statistic is called a sufficient partition. The equivalence between the classical and Bayesian definitions follows easily from the theorem presented below for the classical definition of sufficiency.
Theorem 2.4 (Neyman's factorization criterion). The statistics Tis sufficient for 0 if and only if
p(x I 0)
f(t, O)g(x}
=
where f and g are non -negative functions.
Proof. <=J We have already seen that p(x I 0) = p(x I t)p(t I 0). Then it is enough to define g(x) p(x I t) ·= p(x I T(x)) and/(!, 0) = p(t I 0) completing this part of the proof. <=J Conversely, we have that p(X I 0) = f(t, 8)g(x). Defining At = {x: T(x) = t), the probability (density) function ofT I8 is =
p(t I8)
[ p(x I 8) dx } At = f(t, 0) r g(x) dx } At =
for some function G
= f(t, 8)G(x),
and so, f(t, 0) = p(t I 8)/G(x). On the other hand, from the hypothesis of the theorem, f(t, 0) = p(x I8)/g(x). Equating the two forms for f(t, 8) leads to p(x I0)
g(x)
p(t I 8)
G(x)
---
Since p(x It, 0)= p(x I0)/p(t I0}, then p(x It, 8) =
g(x) G(x)
--
=
p(x It)
since it does not depend on 8. Thus, Tis sufficient for fJ. 0
38
Elements of inference
Neyman's factorization criterion is the tool usually used for identifying sufficient statistics.
We can now show that the two concepts of sufficiency are equivalent,since I. (classical==> Bayesian) follows trivially from Theorem 2.4; 2. (Bayesian ==> classical)
p( O 1 x)=
p(x I O)p(O) p(x)
= f(t, O)p(O),
by hypothesis.
So,p(x I 0)= f(t, O)p(x) which, by the factorization criterion, is equiva
lent to saying that T is a sufficient statistic.
Definition. Suppose that X has density p(x I 0). Then T(X) is an ancillary
statistic for 0 if p(t I 0) = p(t). of
In this case, T does not provide any information for 0 althoug)l.
X, which is related to
8.
sufficiency.
it is.a function
Ancillarity can be understood as an dntonym for
Sufficiency is a baSic concept in classical statistic although it is not so relevant
for the Bayesian approach: On the other hand,from the applied point of view this
is also not a very useful concept SinCe even small perturbations in the model can
imply the loss of sufficiency.
Examples. I. Let X= (Xt .... , 1
1 e)=e.
Xn) be observations with values 0 or l, where P(X; =
n p(xle)=e'(l-e) -r,
witht=
From the factorization criterion it follows that
n
L x;. i=l
T(X)= I:7=t X; is a suffi
cient statistic. In this case, it is also possible to conclude straightforwardly
from the definition of sufficiency and using some combinatorial arguments, that
T(X) is a sufficient statistic since p(x I T(x) = t) = [(;)r1• which
does not depend on e.
2. Let X 1.
Then:
Xz, ..., Xn be iid conditional on 0 with common density p(x; I 0). p(XJ, . .. ,x;, I 0)=
n
n p(x; I 0).
i=l
The order statistics are defined as Yt = Xo) = ntin; X;, Y2 = X(2) = second smallest sample value , .. . , Yn
=
X(n) =max; X;. Since the order
of the terms does not alter the product and to each x; there corresponds a
unique y; (assuming continuity),
n
n
0 p(x; I 0) 0 p(y; I 0). i=l
ex
i=l
Sufficiency and the exponential family 39 Then, with g(x)
I, t
=
=
(y,,
. . . •
factorization criterion holds and T
Yn) and /(1, 8) n7�t p(y; I 8), the (Yt Yn) is a sufficient statistic =
=
• . . . ,
for 8. Note that the dimension ofT depends upon the sample size. In this case no dimensionality reduction was achieved and the definition becomes deprived of its strength. It is also clear that the sample X itself is trivially a sufficient statistic for 8. The application of the sufficiency concept developed above is not necessarily useful. It is only relevant when the dimension of the sufficient statistic is signif icantly smaller than the sample size. An interesting question is how to obtain a sufficient statistic with maximal reduction of the sample data. Such a statistic is known in the literature as a minimal sufficient statistic.
Definition. Let X
-
p(x I 8). The statisticT(X) is a minimal sufficient statistic
for 0 if it is a sufficient statistic for lJ and a function of every other sufficient statistic for 8. If S(X) is a statistic obtained as a bijective function of a sufficient statisticT(X)
then S is also a sufficient statistic. On the other hand, the minimal sufficient statistic is unique, apart from bijective transformation of itself.
Definition. Two elements x and y of the sample space are information equivalent if
and only if the ratio p(xj8)(p(yiO) or equivalently 1(8; x)(l(O; y) does not depend on 8.
Information equivalence defines an equivalence relation of the elements of the sample space. Therefore, it defines a partition of the sample space. This partition is called a minimal sufficient partition. It can be shown that a sufficient statistic is minimal if and only if the partition it defines on the sample space is minimal sufficient.
Example. Let Xt .... , Xn be iid Poisson variables with mean A and define T(X) 2::7�1 X;. Then, =
p(xiA)
=
).Xi
fl p(x;IA) fl e-'"x;! =
i=l
i=l
=
e-"'"
T(X)
;... n
-
ix;!
.
AT(X)-T(Y) 01(y; !)/Xi!), which does not depend on T(y). Hence, T(X) is a minimal sufficient statistic for A.
Therefore, p(XIA)/p(yiA)
A if and only if T(x)
n
n
=
=
Another interesting question is whether there are families of distributions ad mitting fixed dimension sufficient statistics for 8. Fortunately, for a large class of distributions, the dimension of the sufficient statisticT is equal to the number of parameters. Maximum summarization is obtained when we have one sufficient statistic for each parameter. Subject to some weak regularity conditions, all dis tributions with the dimension ofT equal to the number of the parameters belong to the exponential family.
40 Elements of inference Definition. The family of distributions with probability (density)function p(x I 8) belongs to the exponential family with r parameters if p(x 1 8) can be written as p(x 19) =a(x)exp
{t
}
Uj(X)>j(9)+b(9) ,
x EXC R
and X does not depend on 8. By the factorization criterion, Ut (X), ... , U,(X) are sufficient statistics for 8 (when a single X is observed). For a size n sample of X we have
p( x 19) =
[o J {t [� ] �(x;) exp
}
Uj(X;) >j(O) + nb(O)
which belongs to the exponentialfamily too, with a(x) = n7�1 a(x;) and Uj(X) = Uj(X;), j = 1, . ., r. So, T = (TJ, .. , T,) with Tj = Uj(X), j = 1, 2, , r is a sufficient statistic for 8. The exponential family is very rich and includes most of the distribUtions more commonly used in statistics. Among the most important distributions not included in this family we have the unifonn distribution (which has sufficient statistics with dimension not depending on the sample size) and the Student t distribution (which have none). Darmois, Koopman and Pitman have independently shown that among families satisfying some regularity conditions, a sufficient statiStic of fixed dimension will only exist for the exponential family.
I:7�1
.
.
.
.
.
Examples. 1. Bernoulli(@) p(X I@) =@x(l-@)1-xl ({O, 1}), x
\ C� )
=exp x log
e
}
+log( l-e) l ({O,I]). x
For a sample x we have, n
p(x I 8) =n ox; (I - 8)1�x; lx; ({0, I}) i=l
=exp
{tx;
log
C� ) g
+n log(!- 8)} /x({O, !)").
Then, a Bernoulli belongs to the one parameter exponential family with a(x) = /x({O, !)"), b(@) = n log( I -@), ¢(8) = log[@/(! - @)] and U(X) = L7=J xj. So, u is a sufficient statistic fore as we have seen before.
Parameter elimination 41
2.
Poisson().)
e-A;.x
P(x I).)= -1-Ix([O, X.
)
1, ... ) ,
which in the exponential form is 1
p(x I>.) = -! exp{->. + x log A]lx( {O, 1, . )). x . .
For a sample x we have,
p(x I).)=
n?:,
x ;!
exp
{�X;
log).- n).
}
lx({O,
1, . . . ]").
So, the Poisson belongs to the one parameter e-xponential family with a(x)=lx({O, 1, . . ]")/ n7�1 x;!, b(>.) -n>., if>_(>.) = log>. and_[.!()() : = L7=I X;. Then U is sufficient for A. 3. Normal(f', a 2 ) .
=
p(x If', a2) =.
1
r;c
.
-v2na
exp
1
= -- exp v'Ziia
=
1
·
( - ) ( J!:_x- x2- .!!!__) a (x .
1')2 2a2
1 -- _
1._ 2
�'
v'2Ji exp a2x-la2x
(
2 2
2a2
cr2
I''
-
1
2 l 2- 21oga a
)
.
For a sample x it follows,
p(x If', a')=
1 (2 rr)•i2 x·exp ·
I'
n
1
n
2- n
xi { 2L a i=I xi-2a2L i=l
(
2 I' z+Ioga2
2 a
-
)}
.
Then, the normal distribution is a member of the exponential family with a bidimensional parameter 8 = (f', a2), a(x) = (2rr)-•i2, b(O) = -(n/2 ) [(1'2/a2)+loga2],ifJI(8) =!'2/a2,
2.6
Parameter elimination
Sometimes models need to be developed with the inclusion of various parameters, many of which are included regardless of our particular interest in them. These parameters are often included to describe relevant aspects of the reality we are modelling, even though they are not related to our main concerns about the problem. To simplify the discussion, the parametric vector can be broadly split into two
42
Elements of inference
subvectors: fJ containing the parameters of interest and q,, the components that despite being present in the model are not of our immediate concern. The first subvector is called the parameter of interest and the second one is known as the nuisance parameter. Usually, it is our desire to eJiminate q, from the analysis as soon as possible, in order to concentrate efforts on (}. From the Bayesian point of view, this is done using probability rules. Once having observed X= x, we get p(O,> Ix), and we can easily calculate 1. Marginal posterior distributions: p(OIx)=
l
p(O,> Ix) difi
p(>I x) =
L
p(O,> j x) dO
and
Where e and
p(OI¢. x) = and p(rp I O,x) =
p(O,>I x) p(> Ix) p(O,rpIX) p(OIx)
ex
p(O, ¢Ix)
ex p(O,¢ Ix)
where these conditional distributions are well defined if the corresponding denominators are non-zero,that is, for values of 8 and q, with strictly positive marginal posterior density: The above calculations use the fact that the terms in the denominator do not depend on the quantity of interest and can be included in the proportionality constant. 3. Marginal likelihood functions:
The likelihood function is defined as 1(0, ¢;X)= p(x I 0, >). In a similar way, one can define the marginal likelihood functions as:
l := l
1(0; x) = p(xI 0) =
and l(rp; x) =
p(x, >I 0) drp p(x I rp,B)p(>I0) drp
[ p(x I 0, rf>)p(OI>) dO le
Conceptually, there is no difficulty in defining and obtaining any of the above ' quantities although sometimes there are difficulties in analytica1ly solving these integrals. The same is not true for classical or frequentist inference. Some special rules must be stated leading to ad hoc procedures to solve the stated problem.
/-=-
Parameter elimination 43 Many efforts are geared in classical inference towards a proller definition of marginal likelihood. This is a relevant problem to the frequentist statistician and has received much research attention, beginning with the work Of Bartlett in the 1930s and further developed by Kalbfleisch and Sprott in the 1970s as discussed in Cox and Hinkley (1974). Many proposals to express the mar�i�al likelihood !(8; x) are based in substituting 1> in the (total)likelihood by some Particular value. Oft�n, this is taken as the value that maximizes the likelihood. Denoting this value by >(8), because it can depend on 8, we get the relative or profile likelihood for 8 as 1(8, ¢<e); x). lp(8; x) Some authors suggest corrections in this expression taking into consideration meaq sures of information associated with t/J. There are other frequentist definitions for the marginal and conditional like lihood. Suppose that the vector X, or some one-to-one transfof111ation of it, is partitioned into (T, U), with joint density given by
p(t, u 18, >)
=
p(t 18, >)p(u It, 8, >).
The likelihood function of 8, 1> is given by the left-hand side of the equation above and the right-hand side terms can also be written in likelihood tenns as 1(8, <J>; t, u) = 1(8, <J>; t)/(8, >; u I t). The first term on the right-hand side is called the marginal likelihood and the second, the conditional like.lihood. These forms of likelihood are useful when some of the parametric components can be eliminated. For example, if T is such that 1(8, <J>; t) = 1(8; t) only this term is used to make inferences about 8. For conditional likelihood, this form is related to sufficient statistics because if T is sufficient for >, with 8 fixed, then/(8, >; u It)= 1(8; u It). Againonly this term is used to make inferences about 0. The question is how much information is lost when ignoring the other part of the likelihood.
.
Example. Let X= (X 1, .. , Xn) be a random sample from a N(8, a2) and define ¢ a-2. The parameter vector (8, >)is unknown and we assume that the main interest lies in the mean of the population. The precision ¢ is essentially a nuisance parameter that we would like to eliminate from the analysis. Suppose that the prior distribution is such that noaJ¢ x;0 or equiva1ently ¢ G(no/2, noaJ/2) and 1> is independent of 8 a priori. In these conditions we have =
"'
-
p (> 18) On the other hand,
=
p(¢) ex q,•o/Z-l exp
{
-
noa2 ----/¢
}
.
44 Elements of inference but
n
2)x1- 8)2
n
L [ (x1
=
-
i=l
i=l
x)+ (x- 8)]2
n
I:<x1 -xi+n(x-8)2
=
i=l
n[s2 + (x- 8)2]
=
where
s2
n
=
L(X;- x)2jn.
i=l Therefore, the marginal likelihood of 8 is 1(8; x)
L ex{" fo"'
p(x I¢, B)p(¢ I 8) d¢
=
=
{ � )
rp"Jl-I exp - noaJ ¢� exp
{ -�[ns2
+n(x- 8)2]} d¢
¢�-1exp{-�[ns2+n(x-B/+noaJJ)d¢
r[(n+ no)/2]
=
-
{[(ns' + n(x-8)2 +noaJJ/2)Cn+no)/2 r[(n+no)/2]
[(ns2+noaJ)f2]<•+no)/Z
[
n(X-8)2 --, I + ---,-ns2+noaJ
]
-�
Letting no -r 0, which corresponds to vague prior infonnation (as will be seen in Chapter 3), gives n (x- 8)' - /2 1(8; x) --+ k I + =
k
[ [
I+
52 ] ]
T'(x, 8) -
n-1
[(n-l)+IJ/2
,
where T(X B)
x-e =
a· ,
n-I Interpreting the likelihood as proportional to the probability density function of T, it follows that T ln-I (0, I) (see list of distributions). The sampling distribution of X, the minimal sufficient statistic for 8, is used in the frequentist inference. This distribution depends on a2• Substituting it by the estimator S2, the minimal sufficient statistic for ¢ leads to T(X, 8) with a ln-I (0, I) sampling distribution as will be seen in Chapter 4. Therefore, if the adopted prior distribution for (J is proportional to a constant then the Bayesian and classical results will agree numerically at least, although theoretical1y they were obtained using different arguments. �
,
Exercises 45 Exercises § 2.1 1. For each of the following distributions verify if the model is location, scale
or location-scale.
!
!t
(a)
ta(J", a2). a known;
(b) Pareto (xo, a), with a fixed, with density p(xlxo)
xo, (a, xo> 0);
(c) uniform distribution on (B -
=
ax0jxl+•, x>
I, B + I);
(d) _JJniform distribution on (-B, B).
§ 2.2 2. Let X have sampling density
p(xiB) with B E 8 c R. Prove or give a
counterexample to the assertion that if the sampling distribution p(xjB) has
·-:a ·unique maximum then the likelihood l (8; x) also has a unique maximum. Generalize the result to the case of a vector of observations X and also for a parameter vector 0. 3. A situation that usually occurs in lifetime or survival analysis is to have
observations that are censored because of time and/or cost restrictions on the experiment. One common situation occurs when the experiment is run
only until timeT>
0. If the observational unit is observed until timeT, it is
uncensored but if it is censored then all one can extract from the experiment is that the lifetime of this unit is larger than T. Assuming that a random sample x, . .
. .
, Xn from a density
f(xiB) is observed, show that
(a) the likelihood is
l(B)
n
=
n[f(x;IB)]1-x'[l- F(TjB)]Xi i=l
where F is the distribution function of the observations and Xi is the censoring indicator, i = 1,
. .
. , n, taking values
0, when failure is
observed, and 1, when censoring takes place; (b) in the case that f(xjB)
=Be-Ox, I(B) =Bme-BU where m S n is the = (n-m)T +I:, (I- x;)x;
number of uncensored observations and U is the total time on test. 4. Let XJ,
. .., Xn be iid random quantities from the Weibull distribution, de (a, fJ > 0), with
noted by Wei(a, fJ)
p(xja, fJ)
=
fJax•-! exp(-fJx"),
a> 0, fJ> 0
(a) Obtain the lik(;lihood function, the score function and the observed and expected Fisher information matrix for the pair of parameters (a,
{3).
(b) The Weibull distribution is sometimes parametrized in terms of a and
B
=
1/fJ". Repeat item (a) for the pair of parameters (a, B).
46
Elements of inference
§ 2.3 5. Return to the example about John's illness and consider the same distribu tions. (a) Which test result makes the doctor more certain about John's illness? Why? (b) The test X is applied and provides the result X
=
I. Suppose the
doctor is not satisfied with the available evidence and decides to ask for another replication of test X and again the result is X
=
1. What
is now the probability that John has disease A? (c) What is the minimum number of repetitions of the test X which allows the doctor to ensure that John has the disease with 99.9% PfObability. What are the results of these replications that guarantee this?
6. Suppose that X I e
-
N(B, l )(for example, X is a measurement of a physical
constant() made with an instrument with variance 1). The prior distribution forB elicited by the scientist A corresponds to a N(5, 1) distribution and the scientist B elicits a N(l5, I) distribution. The value X= 6 was observed. (a) What prior fits the data better? (b) What kind of comparison can be done between the two scientists? 7. Classify the following assertions as TRUE or FALSE, briefly justifying your
answer. (a) The posterior distribution is always more precise than the prior because it is based on more infonnation. (b) When X2 is observed after X1, the prior distribution before observing X2 has to be necessarily the posterior distribution after observing X I·
(c) The predictive density is the prior expected value of the sampling distribution. (d) The smaller the prior information, the bigger the influence of the sam ple in the posterior distribution. 8. A test to verify if a driver is driving in a drunken state has 0.8 chance of being correct, that is, to provide a positive result when in fact the driver has a high level of alcohol in his/her blood or negative result when it is below the acceptable limit. A second test is only applied to the suspected cases. This never fails if the driver is not drunk, but has only a 10% chance of error with drunk drivers. If25% of all the drivers stopped by the police are above the limit, calculate:
(a) the proportion of drivers stopped that have to be submitted to a second test; (b) the posterior probability that this driver really has the high level of alcohol in his blood informed by the two tests; (c) the proportion of drivers that will be submitted only to the first test.
Exercises
47
9. (DeGroot, 1970, p. 152) The random variables
X1, ... ,Xk are such that - 1 of them have probability function h and one has probability function g. Xj will have the probability function g with probability aj• j I, . . , k, where
0, Vj and I:J�1 Uj 1. What is the probability that X1 has
k
=
.
=
probability function g given that: (a) X
1
=
x was observed?
(b) X; =x,i f 1 was observed? 10. Let X I 8, J1. � N( 8, cr2), cr2 known and 8 I J1. � N(Ji., r2), r2 known and J1. � N(O, 1). Obtain the following distributions: (a) (8 I x. JL); (b) (Jl. I x); (c) (8 I x).
11. Let (X18) N(8, I) be observed. Suppose that your prior is such that 8 is N(Jl., 1) or N(-JL, 1) with equal probabilities. Write the prior distribution and find the posterior after observing X =x. Show that �
Jl.'
and draw a graph of
= E(8 I x) =
:: 2
+
1!:1-exp(-JJ.x) 21 +exp(-!lx)
tl' as a function of x.
12. The standard Cauchy density function is p(xl8) (1/rr){ 1/[1 + (x- 8)2]} and is similar to N(8, 1) and can be used in its place in many applications. Find the modal equation (the first order condition to the maximum) of the posterior supposing that the prior is p(8)= 1/rr ( l + 82). =
(a) Solve it for x =0 and x = 3.
(b) Compare with the results obtained assuming that (x18) � N(8,I)and 8 � N(O, 1). 13. Assume that. an observation vector X has multivariate normal.distribution, introduced in Chapter 1, with mean vector JL and variance-covariance matrix L. Assuming that L is known and the prior distribution is J.t,...., N(f.Lo, Bo) obtain the posterior distribution for fl-.
§ 2.4
14. Let X = (X1,. . , X,) be an exchangeable sample of 0-1 observations. Show that .
(a) E[X;] = E[Xj]. Vi,j =I,... , n; (b) V[X;]=V[Xj],Vi,j=l,. . ,n; (c) Cov(X;,Xj)= Cov(X, X,), Vi,j, k, I=I,... ,n. .
15. Let X = (X1, ,X,) be an exchangeable sample of 0-1 observations and T = 2::7�1 X;. Show that • • •
(a) P(T=t)=Jd(�)8'(1-8)"-1p(8)d8,
t=l,... ,n.
48 Elements of inference (b) E(T)= nE(B). Hint: in (b), use the definition of E(T) and exchange the summation and integration signs. 16. Let 01, . . . ,Bk be the probability that patients
I 1, ... , I, have the disease
B.
After summarizing all the available information the doctor concludes that (a) The patients can be divided in two groups, the first containing the patients ft, . . .
, Ij. j
<
k and the second with patients
Ij+l• . .. , h.
(b) The patients in the same group are similar, that is to say they are indistinguishable with respect to B. (c) There is no relevant informatiOn relating these two groups. Use the id_ea of partial exchangeability to construct a prior distribution for 0 = (BJ>... ,Bk). If instead of (c), there was information relating the two § 2.5
groups, What modifications. w�uld this imply in the prior for 0?
17. Let X= (X 1,
. . . •
Xn) be a random sample from U(B1. Bz). that is, I
p(xiBI,Bz)= n • 02- VJ
B1:sx:sBz.
Let T(X) = (X(I)• XcnJ), obtain its joint distribution and show that it is a sufficient statistic for 0
=
(B1,Bz).
18. Let X be a random sample from P(X I 0). Show that if T = T(X) is a sufficient statistic for 0 and S(X) is a 1-to-1 function ofT then S(X) is also a sufficient statistic for(}.
19. LetX1
• . . .
,Xn be a random sample fromP(X I
B1,Bz). Show that if T1
is sufficient for Bt when 82 is known and T2 is sufficient for 8z when Bt is
known, then T= (T1, Tz) is sufficient for B= (B1,Bz).
20. Verify whether the following distributions belong to the exponential family. If so, determine the functions
a,
b,
u
and¢.
(a) bin(n, B),n known; (b) exp(B); (c)
G(a, {J); (a, {J); (e) N(p,,1;), 1; known.
(d) beta
21. Which of the following distribution families are members of the exponential family? Obtain the minimal sufficient statistic for those belonging to the exponential family. (a) p(x
I B)= 1/9,X
E
{0.1 + e, . . . '0.9 + B}; '
(b) the family of N(B,B2) distributions;
(c) the family of N(B, B) distributions, with B
>
(d) p(x I B)= 2( x + B)/(1 + 2B),X E (0,1), e
0; >
0;
Exercises (e) the distribution family of X I X cp 0 where (f) f(x I 8) =8/(1 +x)1+0, x E R+;
49
X� bin(n,8);
(g) f(x 18) =8xlog8/(8-]),x E (0, I); (h) f(x 18)=(1/2) exp(-lx-81),x E R.
2
22. Let Xt, .. ., X, be a random sample from N(!L, a2),with a unknown. Show, using the classical definition, that the sample mean X is a sufficient statistic for J.L. Hint: It is enough to show that il· Why? 23. Let
E (X I X) and V (X I X) is not a function of
(Xt, X2, X3) be a vector with probability function n.I
3
n P;
3 ni=lx;li=l
X;
,
Xj 2: 0,
where p1 =82, pz= 28(1- 8), P3 =(I -8)2 and 0 :5 e :5 I. (a) Verify whether this distribution belongs to the exponential family with k parameters. If this is true,what is the value of k? (b) Obtain the minimal sufficient statistic for(}. 24. Using the same notation adopted for the one parameter exponential family, (a) show that
E[U(X )]=
b'(8) rf/(8)
-
and
V[U(X)] =
b'(8)¢/'(8) -<{>'(8)b"(8) [>'(8)]3
Hint: From the relationship J p(x I 8) dx = 1, differentiate both sides with respect to e. (b) Verify that the result in (a) is correct for the case where by direct evaluation of E[U(X)] and V[U(X)].
X�
exp(8)
25. Show that information equivalence defines an equivalence relation of the elements of the sample space. 26. Let Xt. X2, X3 be iid Bernoulli variables with success probability 8 and define T = T(X) = L;j= X;, Tt = Xt and T2 = (T, Tt)· Note that the � sample space isS= {0, I} . (a) Obtain the partitions induced by T, Tt andT2. (b) Show thatTz is a sufficient statistic. (c) Prove that T is a minimal sufficient statistic for 8 but T2 isn't by showing that T induces a minimal sufficient partition on S but T2 does not. 27. Consider a sample X= (X 1, ... , X,) from a common density p(x 18) and letT be the vector of order statistics from the sample.
50
Elements of inference (a) Prove that the sample X is always sufficient for 8 (b) Obtain the factor g(x) in the factorization criterion for the sufficient statistic T.
28. Consider a sample X= (XJ,...,Xn) from a uniform distribution on the interval [Bt. Bz], Bt < Bz, so that 8 = (Bt, Bz ). (a) Show that this distribution does not belong to the exponential family. (b) Obtain a sufficient statistics of fixed size. (c) Specialize the results above for the cases that B1 is known and 82 is known.
29. Consider a sample X= (Xt,...,Xn) from a t,(/L, cr2), and 8= (v, /L, cr2). (a) Show that this distribution does not belong to the exponential family. (b) Show that it is not possible to ob�ain a sufficient stati�tics for 0 of fixed size. (c) Show th:it the results above are retained even for the cases when some of the components of 8 are know�.
§ 2.6
30. Let
X mld Y be independent random variables Poisson distributed with parameters () ahd rp, respectively, and suppose that the prior distribution is p(B, ¢) = p(B)p(>) ex k. Let 1{f = B/(0 +>) and�= B +> be a parametric transformation of (B, >). (a) Obtain the prior for (1/f, $).
(b) Show thatl{f I x, y are independent.
�
beta(x+ I, y+I) and� I x, y
�
G(x+y+2, I)
(c) Show that the conditional distribution X given X+ Y depends only on l{f, that is p(x I x + y ,l{f,�) = p(x I x + y,l{f) and that the distribution of X+ Y depends only on�. (d) Show that X+ Y is a sufficient statistic for�, X is a sufficient statistic for 1/f, given X+ Y, and that (X,X+ Y) is a sufficient statistic for (1/f,�). (e) Obtain the marginal likelihoods of 1{f and�.
(f) To make an inference about � a statistician decides to use the fact presented in item (d). Show that the posterior .is identical to that obtained in (b). Does it mean that X+ Y does not contain information about l{f?
31. Suppose that X has density f(x I 8) where 8 = (BJ,/h,BJ) and the prior forB is built up as p(8)= g(BJ,82 I B3)h(B3) where g and hare densities. Obtain the marginal likelihood f(x I Bz,83) as a function of f,g and h. 32. Let X= (Xt,Xz) where Xt = (XtJ,... ,XJm) and Xz = (XzJ,... , Xzn) are samples from the exp(Bt) and exp(/h) distributions respectively. Suppose that independent G(a;,b;) priors are assumed, i= I, 2 and define 1{f= Bt /Bz.
Exercises 51 (a) Obtain the distribution of (BI> B2) given
X= x.
(b) Repeat item (a), assuming now that a,, b,
->
0, i = I,
2.
(c) Using the posterior obtained in item (b), show that XJ
:::1/J I
x
X=
X2 where
I:XIj - = - XJ
�
F(2m, 2n)
Xz=
and
m
Hint: complete the transformation with 33. Let
(Xt, X2, X3)
rameter 8
=
t/11
2· I: X
--1 11
= 82.
be a random vector with trinomial distribution with pa
(Bt, B2,
B3)
where
e, = I
- Bt - B2 and assume that the prior
for 8 is constant. (a) Define A = Bt I (Bt + 82) and
1/J =
(b) Obtain the marginal likelihood of (c) Show that
Xt + Xz is
Bt + B2 and obtain their priors.
1/J.
a sufficient statistic for
1/J.
34. A machine emits particles following a Poisson process with mean intensity of A particles per unit time. Each particle generates a N(B,
I)
signal. A signal
detector registers the particles. Unfortunately, the detector only registers the positive signals (making it impossible to observe the number
n
of emitted
particles). (a) Obtain the dis tribution of
kin
where
k
is the number of particles reg
istered. (b) Show that the likelihood l(B, A) based on the observation of just one registered signal
(k = I)
>
assuming the value XJ
0 during a unit
interval is given by
¢(xt- B)A(B)exp{-A(B)) where 4> and are the density and the cumulative distribution function of the standard normal. Hint: Obtain the joint distribution of
(x,, k, n)
and eliminate
11
by
integration. (c) Obtain the profile likelihood of e, that is, l(B, .i.(B)) where .i.(B) max imizes the likelihood of), supposing e known. (d) Supposing that the prior is p(B, of e.
!..) 0< k, obtain the marginal likelihood
3 Prior distribution
In this chapter, different specification forms of the prior distribution will be dis� cussed. Apart from the interpretation of probability, this is the only novelty intro duced by the Bayesian analysis, relative to the frequentist approach. It can be seen as an element implied from exchangeability by de Finetti's representation theorem. It is determined in a subjective way, although it is not forbidden to use past experi mental data to set it. The only requirement is that this distribution should r:epresent the knowledge about e before observing the results of the new experiment. In this chapter, alternative forms of assessing the prior distribution will be discussed. ·In.· Section 3.1 entirely subjective methods for direct assessment of the prior will be presented. An indirect approach, via functional forms, is discussed in Section 3.2. The parameters of those functional forms, known as hyperparameters, rilust be specified in correspondence with the subjective information available. The conj�
gate distribution will be. introduced in Section 3.3 and the most common fa·milies will be presented in Sectiori 3.4. The concept of reference prior and different forms of building up non-informative priors will be presented in Section 3.5. Finally,
hierarchical prior specification will be discussed in Section 3.6.
3.1
Entirely subjective specification
Let 8 be an unknown quantity and consider its possible values. If it is discrete, a prior probability for each possible value of e can be evaluated directly. Also one may use some auxiliary tools, like lotteries or roulettes, as described in Chapter l. De Finetti ( 1974) characterizes subjective probability through the consideration of betting and scoring rules. The continUous case is slightly more complicated. Some suggestions are: 1. The histogram approach: first, the range of values of 8 is divided into inter vals, and prior probabilities for e belonging to each interval are specified, as in the discrete case. Hence, a histogram for e is built up and a smooth curve can be fitted to obtain the prior density of(}. Note that the number of intervals involVed is arbitrarily chosen. Although the probability in the tails of the prior distribution are often very small, they can influence the subse quent inference. This is a relevant aspect in prior elicitation that deserves
54
Prior distribution
, ...
0
2
4
6
8
10
12
theta
Fig. 3.1 Histogram representing (subjective) probabilities of the intervals h, h, h /4, Is and lo with a fitted density. some caution. Figure 3.1 shows one such elicitation exercise for a positive quantity e.
2. The distribution function approach: First, let us define percentiles. Za is the IOOa% percentile(<> quantile) of X if P(X :<: Za) =a, a E [0, 1]. The median of X, denoted by m, is the 50% percentile, that is P(X :<: m) 0.5. The collection of all percentiles of X describes the distribution function of X. In this approach, some percentiles are subjectively assessed, as in the =
discrete case and later a smooth curve is fitted to the distribution function of e as in Figure 3.2:· This approach is less used since it is easier to identify a distribution through its density than through its distribution function.
3. Relative likelihood approach: The procedure is similar to the histogram approach but, instead of intervals, it evaluates the relative chances of isolated points. Even though Pr(e =eo)=0, VBo E 8, this can be done because
Pr(B =Bo I e
=
p(Bo) Bo or 8 =Bt) = ---;::;-7�--;;;-.,. p (Bo) + p(Bt)
where p(B) is the prior density of e. Then a set of values proportional to the priorGdensity of(} can be evaluated. For example, if e = 2 is three times more probable than e = I and e = 3 is twice more likely than e = I, then we have p(2) = 3p(l) = 1.5p(3).
Again a smooth curve can be
fitted to these points. One problem still outstanding is the evaluation of the
Specification through functional forms 55 "
"' 0
�
"' 0
�
if
" 0
N 0
0 0 Z[0.75)
Z[0.25) z[O.SJ
0
2
4
6
8
10
theta
Fig. 3.2 Distribution function fitted to
the quartiles Z0.25· zo.s and zo.75·
nonnaJization constant. Note that every densitY· must integrate to 1 and. by construction, this curve does not necessarily satisfy this requirement. These concepts are well decribed in the book by Berger (1985).
3.2
Specification through functional forms
The prior knowledge about e can be used to specify a prior density with a particular functional form. A parametric family of densities can, for instance, be defined. Although very often this family can make the analysis easier, one must be careful and make sure that the chosen density rea1ly represents the available infonnation. For example, we could make the following assumptions about e: o
e is symmetrically distributed with respect to the mode;
•
its density decays fast (say, exponentially) when far away from the mode;
•
intervals far from the mode have irrelevant probabilities.
These considerations can characterize, at least approximately, a nonnal distri bution with parameters, generically called hyperparameters, determined in cor respondence with the information expressed in H. These ideas may be put in a more general framework and have led to a systematic approach of determination of prior distributions. The most relevant case corresponds to a conjugate family of distributions.
56
Prior distribution
We have seen in Theorem 2.1 that if the observational distribution is
(X 1 8)
"'
r2), then the posterior distribution is also normal, with mean J.L 1 and variance rf. So, if we start with a normal prior we end
N(e, a2) and the prior is e
�
N(!L,
up with a normal posterior. The main advantage of this approach is the ease of the resulting analysis. Among other things, this allows for the possibility of exploring the sequential aspect of the Bayesian paradigm. Every new normal observation that is obtained only leads to changes in the parameters of the (new) posterior distribution. No new analytic calculations are required.
Definition. Let F
{p(xle), e E 6} be a family of a sampling or observational P of distributions is said to be a conjugate family with respect to F if for all p E F and p(B) E P then p(e I x) E P. distributions.
0
=
A class
Thus, we can say that the class of normal distributions is a conjugate family with respect to the class of normal (sampling) distributions. Some caution is necessary when using the notion of conjugacy: •
P can be very broad P { all distributions } and F to be any family of sampling distributions. It is easy to see that P is conjugate with respect to :F since any posterior will be a meJ1?.ber of P. In this context, the definition
The class
For example, take
•
=
of conjugacy does not have any pr��tical appeal and is useless. P could be very narrow
The class
Suppose, for example, that
P = {p : p(e for some non-null set
=
Bo)
1, e0 E
=
AJ
A. This means that P consists only of distributions
concentrated on a single point. Whatever the information provided by the sample the posterior distribution would be the same as the prior because if we know, a priori, that
e
=
eo with certainty, nothing will remove this
certainty. That is
p(elx)
ex
l(e) p(e) =
l
l(e) l(e)
X
I,
x
o,
ife =eo ife o1 eo.
Then it follows that
p(e I x) =
{k
0,
x
1(&),
ife =eo ife o1 eo.
I x) de = I, we must have that p(e I x) = I if and only if (iff, in e = eo. Hence, P is conjugate for any distribution family and again
As J p(B short)
the definitions are not helpful. The last consideration illustrates, in an extreme situation, another very important aspect of prior specification. When a null probability is given to a particular subset
Specification through functional forms 57 of the possible values of e, no observed information will change this specification, even if it is proved to be obviously inadequate. In order to avoid this incoherent statement, it is strongly recommended that the statistician always associates a non null prior probability to every possible value of e, even if some of them are judged very unlikely. Dennis Lindley refers to this recommendation as Cromwell's rule. Therefore the class
P must be broad enough to ensure elicitation of the conve
nient prior and, at the same time, restricted enough in order that the definition be useful. A general procedure for obtaining conjugate prior families is illustrated in the following example.
Example(Bernoulfi trials).
Let
(X;! e)
Ber(e), i
=
I, . .. , n.
The joint
sampling density is
p(xle)=e'(l-e)"�'
n
wheret=·
I>:i,
X;=O,J, i =l , ... ,n
i=l
defining a class of distribution parameterized by 8
I
t
I
WC
E
(0, l ). From Bayes' theorem,
knQ� that the pOSteriOr density of 8 given X is given by
p(e 1 x)
rx rx
p(x 1 e) p(e) e' (I -e)"-' p(e).
It is worth pointing out that p(e) and p(e
I x) are related through the likelihood
function. The conjugate prior can then be obtained by mimicking the kernel of the likelihood function. In this example the likelihood kernel is of the form ea (1- e)b. This is also the kernel of the beta family of distributions, introduced in Chapter I. Taking the prior distribution as a beta(a,
fJ) and combining with the likelihood,
the posterior distribution is
1 �-1 p(e 1 x) rx e' (I - e)n-r e •- (I- e) (X e•+t-1(1- ejP+n-t-1 Therefore ce
I x)
-
beta(a + t,
fJ + n
-
I) which belongs to the same family
of distributions used for the prior. So, the beta family is conjugate with respect to the Bernoulli sampling model.
It is not difficult to show that the same is
true for binomial, geometric and negative binomial sampling distributions. The proportionality constant forthe posterior density is given by
1/ B(a +t, fJ +n -t).
We can now discuss the setting of the conjugate prior family from a practical point
�(X 1, , Xn) p(x 1 8), 8 E e, and consider the density function of X, p(x 1 e). The family P is said to be closed under multiplication or sampling if for all Pi, P2 E P, there is a k such that kp, P2 E P. of view for the general case of any given random sample. Let X
be a sample from
• • •
58
Prior distribution
Example. Let P be the class of gamma distributions. If Pi, i I, 2, then gamma densities with parameters (ai, bi), i
=
I, 2, denote the
=
PI
x
P2
=
cx
f(az) a2-J -b2x f(aJ) a1-i e-b1x -----a:;-x e b2 bi -
x , -0--
xa1+a2-le-(b1+b2)x
which is proportional to another gamma density with parameters a1 +az and b1 +bz. It can be shown that the same result is true for the class of beta distributions. Closure under multiplication is very important in the search for conjugate fami lies. If the kernel of the likelihood can be identified with the kernel (of a member) of a given family of distributions and this family is closed under multiplication then prior and posterior will necessarily belong to the same family and conjugacy is obtained. The concept of conjugacy was formalized by Raiffa and Schlaifer (1961 ). They also studied many of the families that are presented in the next section. With the definition of closure under sampling in hand it becomes easy to specify a procedure to determine the conjugate class. It consists of
1. identifying the class P of distribution fore whose members are proportional to 1(8; x); 2. verifying if
P is closed under sampling.
If, in addition, for any given likelihood 1(8; x) obtained from a family F, there exists a constant k defining a density pas p(8) k 1(8; x), then the family P of all such densities pis said to be a natural conjugate family with respect to the sampling model with likelihood function I. =
Example (continued). Setting k-1 k
1(8; x)
=
B(t +I, n- t +I) implies that 1
=
B(t + I, n
-
t+
I)
81(1- 8)n-t
which has the form of a beta(t +I, n- t + I) density. Therefore, the class of beta distributions with integer parameters is a natural conjugate family to the Bernoulli sampling model. Nothing substantial is lost however if this class is enlarged to the class of all beta distributions, including all positive values for the parameters. This new class, strictly speaking, is no longer a natural conjugate family. Nevertheless it keeps the essence of the definition and is used in practice. Natural conjugate families are especially useful because an objective meaning can be attributed to the hyperparameters involved. Revisiting the above example, suppose that no hypothetical (or not) trials were previously made, with to successes. Then, the likelihood I' of this hypothetical experiment would be I'(8) ex 81•(1 e)no-to. If our (subjective) prior infonnation is equivalent to that provided by the experiment described above, the prior forB will be a beta with hyperparameters to + I and no - to +I.
Conjugacy with the exponential family 3.3
59
Conjugacy with the exponential family
The one-parameter exponential family includes many of the most common prob ability distributions. An essential characteristic of this family is that there exists a sufficient statistic with fixed dimension.
The conjugate class P to the one
parameter exponential family is easy to characterize. Following the reasoning behind natural conjugacy, it is not difficult to see that members of this class have density
p(8)
o:
exp{a¢(8)+ f3b(8)}
and so
p(8 I x)
o:
exp ([a+u(x)]¢(8)+ [{3+I]b(8)}.
p(li) by k(a, {3), the constant . p(8 I x) will be k(a + u(x), f3 + 1) Using k as defined above'f\· .. is easy to obtain p(x) without explicitly calculating J p(x 18)p(8)d8. From the equation p(x) p(8 I x) =p(x I 8)p(8), it follows that
Denoting the constant involved in the definition of associated to
p(x) =
;< p( I 8)p(8) p(8 I x)
Substituting the densities previously obtained we get a (x) exp{u(x )> (8) + b(8) j k(a, {3) exp {<X>(8)+ f3b(8)} p(x) = �.:...,--'-+-c-�+-'-.;-;:..�-.:,:.. ��-"-cC:;'--:'-;-;-7;;;-; -:-c k(a +u(x), f3+I) exp{[a+ u(x)]¢(8)+ [{3+I]b(8)}
and after some simplification we arrive at
p(x)=
:-:--a(-'- x.:_.). )--:c : k(' ct'-:,: f3,.: k( ct +u(x), f3+I)
A straightforward extension of the Bernoulli example is the binomial model. In that case, it follows that
p(x)= =
' (;)ex (I - e)n x a- (ct, {3)e•-1 (I- e)fi-1 B '(a+x,f3+n-x)8•+x '(l-8)fi+n-x (n) B(a +x,f3+n-x) X
B(ct, {3)
I
, forx =0, l, ... ,n,n :::_I.
This is the beta-binomial distribution. In general, from a sample of size n of the exponential family we obtain the joint density
p(x I 8) =
[o ] { [t ] a(x;) exp
}
u(x;) ¢(8)+nb(B) .
60 Prior distribution The use of a conjugate prior posterior density
p(O) = k(a, /3) exp {a¢(0) +f3b(O)} leads to the
p(O lx)=k(a+ �u(x;),f3+n) xexp
{[
a+
�u(x;)]¢(0)+[f3+n]b(O)}
and the marginal or predictive distribution is
p(x)= 3.4
[na(x;)]k(a, /3) . k(a+ L:u(x;),f3+n)
The main conjugate families
The main members of the exponential family will be presented in this section. The reSultS obtained previously will be applied to these particular cases and the resulting conjugate families obtained.
Some of these families were presented
before.
3.4.1
Binomial distribution
The family of beta distributions is conjugate to the binomial (or Bernoulli) model as we have shown above.
3.4.2
Normal distribution with known variance
Theorem
2.1
stated that the normal distribution family is conjugate to the normal
model, based on a single observation. For the case of a sample of size seen in Section
2.6 that
1(0; x)
ex
exp
n, we have
[- 2;2 (X- 0)2 j
a2 were incorporated into the proportionality constant. x, substituting x by x and a2 by a2jn. Another way to say this is to note that X is where the terms involving
So, the likelihood has the same form as that based on a single observation
a sufficient statistic fore and so the likelihood based on the observed value of which is distributed as N (0,
X,
a2 f n),is proportional to the likelihood obtained with
the individual observations X. Therefore, the result presented in Theorem
2.1
is
stiH true, with the substitutions mentioned above, i.e., the posterior distribution of
0 given xis N(f.'J, rf), with ncr-2X+ r-2tt f.'I= na 2+ 2 r
and
-2 =ncr-2 + r-2 .
r1
The main conjugate families 61 3.4.3
Poisson distribution
Suppose that X= (X 1, ... , X ) is a random sample from the Poisson distribution n with parameter e, denoted Pois(B). Its joint probability function is
and the likelihood function assumes the form
1(8 I x)
e-negEx;.
<X
Its kernel has the form eae-be characteriz�ng a gamma family of distributions. We have already seen in the previous section that the gamma family is closed under
sampling. Then the conjugate prior distribution of e will bee posterior density will be
p(B I x)
rx
-
G(a, fJ). The
ga+Ex;-1 exp{-(fJ + n)B)
corresponding to the G(a + :Exi, f3 +n) density. The calculation of the predictive distribution, using the method described before, is left as an exercise.
3.4.4
Exponential distribution
Suppose that X = (X J, . , X11) is a random sample from the exponential distri bution with parameter 8, denoted by Exp(B). Its joint density function is . .
{ tx;}.
p(xiB) = e" exp -e
1=1
The form of the likelihood allows recognition of the kernel of the gamma family as a conjugate distribution for e. Assuming a G(a, fJ) prior, the posterior will have the form
p(B I x)
<X
<X
{ �}
en exp -8
/
e•+n-1 exr -
which is the density of a G(a + n, fJ +
3.4.5
x; e•-1 exp{-fJB} +
L::X,wj
I;x;) distribution.
Multinomial distribution
Denote by X = (X,, . , Xp) and 9 (8,, . , Bp), respectively, the number of observed cases and the probabilities associated with each of p categories in a sample of size n. Assume that the following constraints are true: Lf= Xi = n .
.
=
. .
1
62
and and
Prior distribution
Lf= t 8; 1. X is said to have a multinomial distribution with parameters n (8,, ... , 8p)- The joint probability function of the p counts X is =
It is not difficult to show that this distribution also belongs to the exponential family. The likelihood function for
0 is 1(0)
ex
0 e('.
Its kernel is the same as
the kernel of the density of a Dirichlet distribution. The Dirichlet family with integer parameters
at, ... , ap is natural conjugate with respect to the multinomial
sampling distribution. Again, little is lost by extending natural conjugacy over all Dirichlet distributions. The posterior distribution will then be
p(O 1 x) ex
[fJ ] [fJ ] fJ t=I
e:'
et-'
=
r=l
!=l
e('+a;-l
and, as anticipated, this posterior is also a Dirichlet distribution with parameters
a, +xt, ... , ap+xp which is denoted by (0
I x)-
D(at +xr. ... , �p+xp). From
the above results about the Dirichlet, it is not difficult to obtain the proportionality constant as
f(a + n) nr�, f(a; + x;f
This conjugat� analysis generalizes the analysis for binomial samples with beta priors.
3.4.6
Normal distribution with known mean and unknown variance
Let X= and¢ =
(Xt, . . . , Xn) be a random sample of sizen from the N(O, a2), 8 known, a -2• In this case the joint density function will be: where
The conjugate prior may have the kernel of 1(¢;
s5 =
I
n
-
L(x;- 0)2•
x), which is in the gamma distri
bution fonn. As the gamma family is Closed under sampling, we can consider a
G(no/2, noaJ/2) prior distribution or, equivalently, that noaJ¢ has ax 2 distribu
tion with no degrees of freedom. The posterior distribution of¢ is obtained using
Bayes' theorem,
p(cj; I x) ex 1(¢; x)p(¢) n n ex cp /2 exp{ -nsJ¢/2}¢( o/2l-I exp{ -noaJcp /2} l no+n l/21-I exp{-(noaJ + nsJ)¢/2}. cp ( =
The main conjugate families
63
The above expression corresponds to the kernel of the gamma distribution as expected. Therefore:
"lx�G "' or equivalently
(
no+ n noaJ + nsJ
2
--
2
'
(noaJ + nsJ)
)
·
x2 family of distributions is conjugate with respect to the no�al sampling model with 8 known and a2 unknown. Then it follows that the gamma or the
3.4.7
Normal distribution with unknown mean and variance
The conjugate prior distribution for
(8, ¢)will
be presented in two stages. First,
the followiirg conditional distribution of e given
(B I
�
N[J-Lo, (co¢)-1]
that is,
(no, crJ) and (J-Lo, co) are obtained from the initial information H. This co, no, aJ) and joint density given by: where
distribution is usualJy caJled normal-gamma or nonnai-x2 with parameters (J-to,
p(B,
=
O<
=
p(BI
1- �
c
exp
- J-Ld
�-� [
l
exp
(- � )
nocrJ + co(B- 1-'0)2
no J¢
] l·
The marginal prior distribution of e can be obtained by integration, using the following result:
Application of the result gives
p(B)
O<
_
fo""
exp
{ -� [
r[(no + 1)/2]
noaJ + co(B - J-Lo)2
- {[noaJ + co(ll'- J-Lo)2]/2J(•o+l)/2 O< noaJ + co(B - J-Lo)2r(•o+IJ/ 2 [
]j
d¢
64
Prior distribution
since the r ( · ) term does not depend on e. Rearranging tenns gives
p(B)
O<
[
I+
(8- JLo)2
]
-(no+ l /2)
no(a�fco)
which is !he kernel of the Student t distribution with no degrees of freedom, location parameter
J.Lo and scale parameter CJJfco, denoted by
1110
(1-lo. aJ/co).
The conditional distribution of ifJ ! e can be obtained from the joint distribution
of (8, ¢)and is a G{(no+1)/2, [noa�+co(B-JLo)2]/2), or equivalently, [noa�+ co(B- JLol2l4> I 8- xi1no+ll" The joint distribution of a random sample X = (X 1, ... , X,) is
p(x 1 e, >)
=
0<
where
Q
�-�(x; �-� [
¢112 exp
q,"/2 exp
- 8)2
j
ns2 +n(x- 8)2
])
2 I'\' �(x; -x) s 2 =n
as we have seen in Section 2.4. The above expression has the same kernel as the nonnal-gamma density for
(B, ¢).
Next, it is necessary to check if the normal
gamma family is closed under sampling. It is not difficult to verify that it is. The posterior distribution will then be
p(e, 4> 1 x)
0(
0(
p(x 1
e,
>)p(B, ¢)
q,lln+no+ 1)/2]-1 x
exp
I
"'
-
l
)
- 8)2] . [noa02+ ns 2+co(B- JLo)2 + n(x-
It is not difficult to show that
c0(8- JLo)2 +n(B -X)2 =(co+n)(8- JL1l2 + � (JLo-x)2 co+n where
(e, 4>)
J LI = (coJL +nx)f(co+ n).
Thus it follows that the posterior density for
is proportional to
¢ x
[(n+no+l)/2]-1 exp
}
con 2 4> --[noa02 + ns + -- (JLo- x)2 +(co+ n)(B- JLJ)2 ] . 2 co+n
I
Therefore, the joint posterior for
(JLJ, CJ, n1, a f)
(B, ¢1 I x) is normal-gamma with parameters
given by
JLJ=
CQflO+nX co+n
Non-informative priors
65
ct =co+n n1 =no+n
nw12
=
con 2 (J-Lo-x) . nocr02 +ns2 + co +n --
Prior and posterior distributions are members of the same family. So, the normal gamma family is conjugate with respect to the nonnal sampling model when e are both unknown. Table 3.1 ·summarizes the distributions involved in the
and
a2
Bayesian analysis of the normal models with unknown mean and variance.
Table 3.1 Summary of the distributions Prior
N(J-Lo, (co¢)-1) noaJ¢ "'"' x;o tn0(J-Lo, crJ I co) [nocrJ + co(8 -�
I
! I
I
3.5
Posterior
N(/<1, (c1¢)-1) n1a[
Non-informative priors
Many statisticians show concern about the nature of the prior distributiOn. This is ma.inly due to an influence from the frequentist point of view. They typically maintain that the prior distribution is arbitrary and alters the conclusions about the statistical problem at hand. Therefore, they argue that prior information is not acceptable for use in a scientific context. In this section, the concept of non informative or reference prior will be presented in an effort to _reconcile these arguments with the Bayesian point of view. The idea behind these. priors comes · from the desire to make statistical inference based on a minimum of subjective prior information. This minimum is clearly a relative concept and should take into consideration, for example, the sample information content. Another context where the concept of a reference prior may be useful was outlined in the example of the two physicists in Chapter 2. Let us suppose that two scientists have strong and divergent prior opinions about an unknown quantity and that is not possible to reconcile these initial opinions. This is a situation where it is necessary to produce a 'neutral' analysis, introducing a referential. Another plausible justification to support a reference analysis is the usual expectation that the evidence frOm the experiment is stronger than the prior. Initially, uniform priors were proposed to represent situations where little or no
initial information is available or, even, if it is available and we do not wish to use it. So,
p(8)
ex
k for fJ varying in a given subset of the real line means that none
of the particular values of e is preferred (Bayes,
1763).
This choice brings some
difficulties with it. The first one is that p(8) is not a proper distribution if the range
66
Prior distribution
of values ofe is unbounded. This means that f p(8) d8-+
oo
which goes against
the basic rules of probability. Also, if rp=¢(8) is a one-to-one transformation of
8 and if() is uniformly distributed, then by the theorem of variable transfonnation,
the density of¢ is
¢) ¢)) p( = p(B(
1:: I I:: I <X
which is only constant if
4> is defined by a linear transformation. However, the ) same assumptions leading to the specification of p(8 <X k should also lead to ¢ p( ) <X k, which contradicts the above deduction. Ideally, we would like to state an invariant rule that would not violate results about variable transformation.
In practice, we. are concerned with the posterior distribution, which is often
proper, even when the prior distribution is not. In this case, one doesn't need to
give much relevance to the impropriety of the prior distribution. Careful exami
nation must be ca"nied out to make sure the posterior is actually proper to proceed ·
confidently with ihe analysis.
The class of non-informative prior proposed by Jeffreys (1961) is invariant but in
many cases leads to improper distributions. This class of priors is extensively used
by Box ·and Tiao
(1992).
Intuitively, it tries to provide as little prior information
..as possible, relative to the sample information. It is not surprising therefore that it should depend on Fisher information measures.
Definition. Consider an observation X with probability (density) function p(x 1
8). The Jeffreys non-informative prior has density given by
p(B)
a
(l(BJJ'I2,
e E e.
In the multivariate case, the density is given by
p(O
Lemma. The Jeffreys prior p(8)
)
<X
a
11(0)1112.
[/(8)]112is invariant under one-to-one trans
formations, that is, if¢ = ¢(8) is a one-to-one transformation of e, then the ¢ Jeffreys prior for rp is p( ) <X [I(¢ )]11 2
Proof. Let¢= ¢(8) be a one-to-one transformation of e. Taking the derivative
of log p(X I ¢) with respect to¢, it follows that &logp(X 1¢)
arp
=
&logp(X 1¢(8)) ae ae
arp
where 8 = 8(¢) is the inverse transformation of¢. To obtain the Fisher information
of a parameter, the log likelihood of this parameter needs to be differentiated twice.
This gives
&21og p(X 1 ¢)
&rp2
=
a log p(X 1¢(8)) &2e ae
arpz +
&2Iog p(X 1¢(8)) ("e
aez
arp
)
2
Non-informative priors 67 Multiplying both sides by (-I) and calculating the expected value with respect to
p(xI8), gives
!(¢)
= Ex1e =
l(B)
[
alogp(XI8) ae
]
a2 e 1(8) z a.p +
( ) ae
2
a.p
G:r
since Ex1e[J logp(X I 8)ja8] = 0, as seen in Chapter 2. Therefore, ! 112(¢) = [112(8) IJ8 ;a¢1· By the rules of probability, if 8 has density proportional to !112(8) then ¢ has density
and the specification is invariant to one-to-one transformation. 0
Corollary. The same result is true for the multiparameter case, that is, the Jeffreys prior is invariant under one-to-one transformations. There is only one transformation 1/J of e which satisfies the invariance rule and has constant density. This transfonnation is easily obtained by making
or IJ8/Jtfrlex r112(8) = IJ1f)J81
Example. Let X = (X 1, density of the observations is
. • .
,
ex
11/2(8) = tj;
ex
j"
f112(u) du.
Xn) be a sample of Pois(8) variables. The joint
Taking logarithm, it follows that n
logp(x I 8) = -ne +
n
l:x; log e -log TI x1!. i=l
i=l
The first and second order derivatives of the log likelihood are Jlogp(xl8)
ae
= n - +
L:7-1x; e
and
Then, the Fisher information will be: !(8)
=
Exw
[z:=x,J ----rf2
=
I" ne n z L... E(X;) = ez = (j e
68
Prior distribution
and the non-informative prior is p(())
p(B I x)
ex p(x
ex
e-1/2• So the posterior density will be
1 B)p(B) ex e -nBgE;x,g-l/2 = e-neeE;x;-1!2 ,
that is,
B I x
�
G('E,x, + 1/2, n), or alternatively, 2n B
I x
transformation leading to the uniform prior is
ex
fo"
u-l/2 du = 2utl21
�
ex
gl/2 ,
The non-informative prior is frequently obtained from the conjugate prior by let ting the scale parameters go to zero and keeping the other ones constant. In the above example, it can be noted that the reference prior is the limit of the gamma dis tribution (the natural conjugate prior for the Poisson model) withe� G( l/2, E),
E
-+
0.
The Jeffreys prior specification was alternatively obtained in the univariate case by Bernardo (1979).
He called them reference priors and they are defined as
the distributions that maximize the amount of unknown information about e in an infinite number of replications of the experiment. The amount of unknown information about() in
n
replications of the experiment is defined as
, Xfl). The amount of unknown information about() in an where Xn = (X1, infinite number of replications of the experiment is obtained as the limit of the •
•
•
information based on n replications when n
--+ oo.
In the multivariate case, Bernardo proposed a modification to Jeffreys rule. He suggested a two-stage procedure. The parametric vector is divided into two components:
8 denoting the parameters of interest and fb, the nuisance parameters. p(
First, a (conditional) reference prior distribution
is used to eliminate the parameters f from the likelihood and gives a marginal likelihood p(x
I 8) as we have seen in Section 2,4, Then, this likelihood is used to p(8), Finally, the complete reference prior is obtained by the multiplication rule p(>, 8) = p(> 1 8)p(8), This procedure obtain the (marginal) reference prior
seems to provide better results than that proposed by Jeffreys, although it depends on
an
arbitrary partition of the parametric vector. Therefore, reference priors are
not invariant to the choice of the parameter partition. Another similar procedure, based on information maximization, was proposed by Zellner (1971). There are other difficulties associated with the reference prior distribution be sides its specification not being unique and often leading to an improper density. It can lead to incoherent inferences in the sense that if analysis conditional on the
Non-informative priors 69 sample is replaced by analysis conditional on a sufficient statistic, which should not affect inferences, it could lead to different posterior distributions for some pa rameter transfonnations. This is obviously an unpleasant situation that fortunately does not occur very often. Another drawback of Jeffreys priors is that they do not satisfy the likelihood prin ciple. Since it is based on the experiment, a Jeffreys prior produces different results for equal likelihoods. A famous example illustrating that the non-informative prior depends on the sample models is provided by Bernoulli trials with success proba
bility e. If the sample design is such that n fixed Bernoulli trials are made and the number of success observed, then X ""'bin(n, 8) and
which implies that
(;)
logp(xI&)=log
+xloge + (n x)log(l-e). -
Ther· ef��e, the. first and second derivatives of the log likelihood are a log p(x I
&)
ae
�
= e
n _-_x
__
1- e
x n-x -e' - o - e)'.
and
Then the expected information measure is
IB(B)=
£(XI&)
+
e2
E(n-XI&) (I-e)2
and the non-informative prior is PB(&)
e<
n = eo-e)
e-112(1-e)-112 Note that this prior
is a beta( 1/2, 1/2) distribution and it is not improper.
Now, suppose that the sample scheme consists in observing the number of replications until s successes are obtained. The observation now is Y with negative binomial distribution denoted by Y
p(y I&)=
-
N B(s, &) and
-1 &'(1- &)Y-• s- 1
(n )
which implies that the first and second derivatives with respect toe are
a log p(y 1 &) ae
=
s
y
-
s
e-I-e
The expected information is
and
a' log
p(y 1 e)
ae2
s y- s =- e2 o -e)'·
70
Prior distribution
and the non-informative prior is PBN(B) ex B-1 (1-B)-I/2. This prior is the limit
of a beta
(E, 112) when E
l(B;x
3) ex83(1-B)7 andi(B;y
-+
0 and then it is an improper distribution.
Suppose now that in 10 experiments, three successes were observed. Therefore, =
= 10) exB3(1-B)7• The sample informa
tion, namely the likelihood, about B is the same for both models. Although the
infonnation is the same, the prior and consequently the posterior distributions are
different, with p(B
1 x = 3) ex B2·5(1-B)65
and
p(B 1 y = 10) ex B2(1-B)6·5
showing, then, that the non-i_!!formative prior violates the likelihood principle.
These criticisms only reinforce the point that non�informative priors must be
used with extreme care. Nevertheless, if proper care is taken, they provide a useful benChmark in a number of problems and may be a useful input to the analysis.
Applications of no�-!�.C?.r:mative priors in a few frequently used models described
in Chapter 2 are pfeserited below, starting with the location model. If X has the location model, then
a log f(x- B)
a logp(x I B)
aB
aB
f'(x- B) =
Then by the last lemma, I(B)= Exw
[(
f(x- B)
)]
/'(X-B) 2 f(X- B)
Making the transformation U = X-8, then /(B)= Eu
(
where!'
at =
ae .
.
)
/'(U) 2 f(U)
--
which does not depend on B. So, /(B) = k and then p(B) ex k. It is easy to see
that this result is also true for a parameter vector 0.
One way to justify this prior is through model invariance directly. Working with
an observation vector observation vector
Y
X and location parameter 0 is equivalent to working with X+c and location parameter q = 0+c for any given
=
constant c. We can then insist on the same non-informative prior specification
for 0 as for q. It is not difficult to show that the only distribution satisfying this requirement has density
p(O)
ex
If X has the scale model, then
a logp(x I u) au
k.
a Iog[u-1f<xfu)] au
=
= =
=
�[
(� ) ] /u) -..:._ ) (
-logu+logf
a
_.!:_ a
_.!:_ u
+
[
a iogf(<X Oa
.:.f'(xfu) 1+
uf��)
]
a2
where! '=
a t. �
Hierarchical priors 71 Therefore the information measure about a will be I
l(a)=2Ex1a a I
= 2E a
(
[(
)
X f'(Xfa) 2 1+Ia a f(Xfa)
I +U
·
]
)
f'(U) 2 f(U)
after the transformation U =
a then I (a)= k
x
Xfa. Since the distribution of U does not depend on a-2 and p(a) ex a-1 is the non-informative prior distribution.
Once again, model invariance can be invoked directly by assuming equivalence
X and scale parameter a and a model with = eX and scale parameter 1J = ca. Insisting on the same non informative prior for a and� leads to p(a) oc a-1. If X has a location-scale model, a reference prior (B, a) can be obtained fol between a model with observatiO@.
observation Y
lowing the procedure proposed by Bernardo withe being the parameter of interest and
a the nuisance parameter. This partition corresponds to the majority of cases
of interest. Now, if e is supposed known, we are restricted to a scale model and its prior is
p(a I&) oc a-1• The distribution of X I e is now obtained as p(x I&)=
J
p(x Ia,&)p(a I &)da
where we can observe that the dependence on X and (} will continue to be of the form
f(x- 8). Therefore, in a location-scale model, the reference prior is p(&, a)= p(&)p(a I&) oc
I -.
a
This prior is also recommended by Jeffreys (1961) although is not the one implied by direct application of his rule.
3.6
Hierarchical priors
A good strategy to specify the prior distribution or for a better description of an experimental situation, is often to divide it into stages or into a hierarchy. The idea of using a hierarchical structure with multistage was fonnalized by Lindley and Smith (1972). This way, the prior specification is made in two phases: I. structural, for the division into stages; 2. subjective, for quantitative specification at each stage.
Example.
Suppose that Y1,
• • •
, Yn are such that Y;
�
N(&;,a2),with a2 known.
Among many possibilities depending on the situation under study, many choices are available for specification of the prior for 0
options can be used:
= (&t, ... ,e.). The following
72 Prior distribution • •
8;'s are independent, that is, p(O)
TI; p(8; ).
=
8;'s are a sample from a population with p(8
parameters describing the population.
I 1) where 1 contains the
So, for the last option, "
p(O
11)
=
n p(B;
11).
i=l
This specification corresponds to the first stage . To complete the prior setting, it is necessary to specify the second stage: the distribution of 1, p(1). Note that p(AJ
corresponds to the second stage and does not depend on the first stage. One can then obtain the marginal Prior distribution of (J by p(O)
=
f
Note that the
p(O, 1) d1
=
f
p(O
I 1.) p(l.) dl.
.
=
J fi
p(B;
1=1
I l.)p(1) d. >-.
th 's are supposed exchangeable as their subscripts are irrelevant in
terms of this prior. Since the distribution of A is indePendent of the first stage, it can be stated as:
I. Concentrated: p(l. 2. Discrete: p(l.
=
>-o)
=
Aj)
=
=
I.
pj.j
=
l, . . . ,k, with 'EjPj
=
I . In this case
the distribution of e wW be a finite mixture .
.
1 J...i) with
. ,k.
3. Continuous: as before, the distribution of 0 will be a continuous mixture of p(O 11) with weights given by p(l.).
If the first stage prior assumes that 8;
(f-', r2). Assuming that p(r2
=
J
r )
=
......,
N(f.L, r-2), i
=
1, ... ,
n,
then)..
=
I and 1-' is normally distributed then 0 has a
multivariate nonnal distribution. On the other hand, assuming that p(t-<
=
t-
=
I
and r-2 has a gamma prior distribution implies that fJ has a multivariate Student t distribution.
This subdivision into stages is a probabilistic strategy that allows easy identi
fication and specification of coherent priors. Nothing prevents these ideas from
going further into the hierarchy. For example, the distribution of).. can depend on
q,. In this case,
p(O)
=
li
p(B
I l.)p(l. I >)p(>)d'!..d>.
The parameters A. and q, are called hyperparameters and are introduced to ease the prior specification. Theoretically one can state as many states as one thinks are
necessary to improve prior specification. In practice, it is very hard to interpret
the parameters of third or higher stages, so it is common practice to use a non
informative prior for these levels.
Exercises 73 The concept of hierarchical modelling will be returned to in Chapter for the normal case.
8, at least
That chapter provides an introduction to more elaborate
models where the full strength of prior specification wiiJ be better appreciated.
Exercises
§ 3.1
1. Lete repre�ent the maximum temperature at your house door in September. (a) Determine, subjectively, the 0.25 and 0.5 quantiles of your prior dis tribution for g. (b) Obtain the normal density that best fits these quantiles. (c) Find subjectively (without using the normal density obtained in (b)) the 0.1 quantile of your prior distribution for e. Is it consistent with the normal obtained in (b)? What can you conclude from this fact?
2. Let g be the probability that a football team from Rio de Janeiro will be the winner of the next Brazilian championship. Supposing that() does not vary in time, build a prior distribution fore based on past information.
§ 3.2 3. Show that the classes of beta and normal-gamma distributions are closed under sampling.
§§ 3.3/3.4
4. Show that the beta family is conjugate with respect to the binomial, geometric and neg ative binomial sampling distributions. . 5. For four pairs of a rare specimen of bird that nested last season, the number of eggs per nest n and the number of eggs hatched Y were observed, providing the data n
=
2, 3, 3 and 4 andy
season in similar conditions, n5
= =
I, 2, 3 and 3. For a fifth pair nesting this 3 eggs were observed and are about to
hatch. Let g be the probability that an egg is hatched. (a) Obtain the likelihood function fore, based on the observations y =
(y,,
·
· · ,
Y4).
(b) Assess a conjugate prior that in your opinion is adequate and calculate the posterior of g
1
y.
(c) State a probabilistic model for Ys, the number of eggs hatched in the fifth nest and obtain its predictive distribution. 6. Suppose that a random sample X
=
(Xt, ... , Xn) from the N(e,
cr2)
is
observed with e known and that a x2 prior for cr-2 is used. If the prior coefficient of variation (CV) of a-2 is equal to 0.5, what must the value of n be to ensure that the posterior CV reduces to 0.1? Note: The coefficient of variation of X (CV) is defined by
cr/IlL I,
where
IL = E(X) and cr2 = var(X) represent the mean and the standard deviation of X, respectively.
74
Prior distribution
1 N(8, ¢- ) distribution and con sider the conjugate prior distribution for 8 and ¢.
7. Let X1 ,
• . •
, Xn be a random sample of the
{fLo, co, no, a-o) of the prior distribution, £(8) = 0, Pr(l81 < 1.412) 0.5, E(<jJ) = 2 and
(a) Determine the parameters knowing that
E(¢2) = 5.
=
(b) In a sample of size
n
=
10,
X=
I and
2::7�1 (X; - X)2 =
8 were
observed. Determine the posterior distribution of e and sketch a graph of the prior, posterior and likelihood functions with
Pr(IYI
>
I 1
x), where Y
is a new observation taken from the
same population.
N(8, a2) distribution, a2 known. The prior distribution for 8 is a N(!l-o. a6). What must the
8. A random sample Xt, ..., Xn is �elected from the with
sample size be to reduce the variance (a) of the posterior distribution of
8 to a6I k (k
>
I)?
(b) of the predictive distribution of Y, a future observation drawn from the same population to
a6I k (k
>
I)?
N(O, a21p), where the p-dimensional a2 are known. Show that the distribution family given by U I a2- N(/L. a2Co) and noa61a2- x,�0 is conjugate to
9. Consider the sampling model X "' mean vector (} and the scalar
the model presented above, generalizing the results obtained in Section 3.3 for univariate normal distributions.
10.
Let X1, ... , Xn be a random sample from the Pois(O) distribution. (a) Determine the conjugate prior parameters for() assuming that 4 and CV(8) =
0.5 and determine n
such that
V(8 I x)
<
E(B) 0.01.
=
(b) Show that the posterior mean is of the form
where flO =
£(0) and that Yn
--+ I when
n --+
co.
(c) Repeat the previous item for a sample from a Bernoulli distribution with success probability 8 and prior e- beta(a,
b).
II. Let X = (X 1, ... , Xn) be a random sample of the U (0, 8) distribution. (a) Show that the Pareto family of distributions, with parameters a and b, and density p(e) = ab" 181+", 8 > b, (a, b > 0), is a conjugate family to the uniform. (b) Obtain the mode, mean and median of the posterior distribution of e.
12.
Consider the conjugate model Poisson-gamma with predictive distribution using (a) the usual integration procedures; (b) the approach described at the end of Section 3.3;
n = I. Obtain the
Exercises 75 (c) calculate also the mean and variance of this distribution. I 3. Show that
2 co(8 -l'o)2 + n(8- :X) =(co+ n)(8 -1'1)2 + _9lf'_ (l'o- :X)2 co +n where 1'1 =(col'+ nx)/(co +n),
§ 3.5
14. Consider the observation of a sample X = (X 1, ... , Xn) with probability (or density) function p(x I 8). (a) Show that Eo IX
!
I
[I
og
with equality obtained only when p(8 I
15. Let Xi "' p(xdOi) and .. . , p.
>
0
Vx
x) = p(8).
Pi(Bi) the �on-infonnative prior for' Bi, fori
Assuming that the Xi ·s· are independent,· show thaf the non
informative Jeffreys prior for 0
i
J
(b) Interpret the above result
I,
!
1 x ( 8---'-'-) ,_P_:_ -p(8)
=
(81,,,.; Op) is given by
ni�1 p; (8;).
16. Consider a random sample of size n from the P3.reto distribution with pa rameters e and b, respectively. (a) Show that there is a sufficient statistic of fixed dimension fore. (b) Obtain the non-informative prior for 8. Is it improper-?· 17. Suppose that X I 8
-
Exp(8) and that the prior for 8 is non-informative.
(a) Obtain the predictive distribution of X and show that p(x) and p(x I 8) are monotonically decreasing in x. (b) Calculate the mode and the median of the sampling distribution and of the predictive distribution, 18. Suppose that X =(X1, Xz, X3) has a trinomial distribution with parameters n
and
n:
= (1f1, 1f2, 1fJ), where lfJ + 1f2 + 1f3 = I and
n
is known, with
density given by
where Xi = 0, I, .. . , n, i = informative Jeffreys prior for
I,
2, 0 � XJ + x2 ::=:: n. Show that the non n: is p(n:) ex [1f1n:2(1 -1ft - 1f2)r1i2,
19. Suppose that the lifetimes of n bulbs are exponentiaBy distributed with mean e. (a) Obtain a non-informative prior fore and show that it is improper. (b) Suppose that the times are observed up until r failures have occurred. Obtain the likelihood expression,
76 Prior distribution (c) Show that, if no bulbs fail before a pre-specified time limit c > 0 when observation stops, then the posterior distribution is also improper.
B has non-informative prior p(B) <X k. Show that <jJ = aB +b, <X k. Suppose now that B has non-informative prior p(B) <X e-1, B > 0. Show that ¢ = ea, a f 0, also has prior p(¢) <X ¢-1 and that 1ft= Iog B has prior p(1/t) <X k.
20. Suppose that a
f
0, also has prior p(¢)
(X1, X2, X3) be a random vector with trinomial distribution with parameters n and (81, Bz, 83). The statistician decides to reparametrize the problem defining A = Bt/(Bt + Bz) and 1ft = Bt + /lz. (This is a valid procedure since the transformation from (Bt, 82) to (A, 1ft) is I -to- 1.)
2 I. Let X =
(a) Write the density of X as a function of A and 1ft. (b) Show that T = Xt + X2 is a sufficient statistic for 1ft. Interpret these results in terms of the inference about ljf. (c) Obtain the non-informative prior for 1/1 based on the result proved in (b). (d) Interpreting 1/f as the success probability in an experiment and suppos ing that in n repetitions of the experiment t successes were observed, what is the probability of a future experiment being a success?
I/ (8)1112 is invariant under one-to-one transformations, that is, if q, = t/1(8) is a one-to-one transformation of 8, then the Jeffreys prior for 4> is p(>) <X I det I(>) 11/2
22. Show that the Jeffreys prior p(B)
<X
23. Assuming that to work with an observation vector X and location parameter 0 is equivalent to working with an observation vector Y = X +c and locati6n parameter 1J = 0 + c for any given Constant c and insisting on the same non informative prior specification for 0 as for 1J, show that the only possible distribution for 0 has density p(O) ex k. 24. Repeat the above exercise under the conditions of the scale model to show
that the non-informative prior must have density in the form p(a) by
ex
a-1
(a) assuming equivalence between the model with observation X and scale parameter a and the model with observation Y = eX and scale pa rameter 11 = ca; (b) transforming the problem into a location model with observation Z = log X and location s = logo·. § 3.6 25. Assume that the first stage prior specifies that e,
and define l. = (Jl, r2).
�
N (Jl,
r2), i = I, . . . , n,
(a) Assuming tpat p(r2 = rJ) = 1 and JL is normaHy distributed, prove that 9 has a multivariate normal distribution. (b) Assuming that p(Jl= 110) = I and c2 has a gamma prior distribution, prove that(} has a multivariate Student t distribution.
Exercises 77 1970, p. 154) Suppose that the prior distribution p(e) fore is built up hierarchically as follows
26. (DeGroot,
(a) If� = i, then the prior density fore is p; (8), i = I, . . ., k. i) c;, i = I, ... , k. (b) The distribution of� is p(� =
=
Suppose also that X, with density p(x I 8), is observed and define ljJ; as the class, containing p;, of conjugate distributions to the sampling distribution of X, i 1, . . . , k, and \lJ as the class of distributions given by =
k
{p: p(8)
=
L, f3;p;(8) and p; E ljJ;). i=l
(a) What is the mathematical expression for the prior of p(B)? (b) Show that l.J! is conjugate to the distribution of X, that is, there exist constants b;, i 1, . . . , k, such that =
k
p(8 I x)
=
L_ b p (8 1 x). ,
,
i=l
(c) Obtain the relationship between h and the prior and posterior proba bilities of� = i, i 1, ... , k. =
Hint: Define h;(x) = f p(x I 8)p;(8) d8 and p;(8 I x) (8)I h; (x), i = I, . .. , k.
=
p(x I 8)p;
27. The IQ's of a sample of n senior undergraduate students of statistics at UFRJ are represented respectively by 8;, i 1, ... , n, and the common unknown =
mean for all the final year students at UFRJ is denoted by IL· Suppose that the 8;'s constitute a random sample of the population of IQ's, with unknown mean but with known variance b. A useful test to assess IQ's is applied, providing the independent observations Y1, . . . , Yn. where YilfJ; ,......, N(8;, a), with a known. (a) Build a hierarchical prior for the parameters 81, ..., 8n. (b) Calculate p(J1.IYI· . . . , y,) and obtain E(J1.IY!o . . ., y,). (c) Obtain p(8;{J1., Yt· ... , y,) and £(8;{J1., YI, .. ., y,), fori= I, . . . , n. (d) Obtain £(8dy1, ... , y,), fori = I, .. . , n.
28. Suppose that the prior distribution for (81, ..., 8,) is such that the 8;'s constitute a random sample from a N(tt. r2) distribution and 11- I r2 N(J1.0, r2 Ic). Obtain the prior distribution for (81, ..., 8,) and, in particular, calculate the covariance between 8; and Bj. 1 .:::: i, j .:::: n, with i :f= j ,..,_,
supposing that (a) r2 is known; (b) r2 is unknown with prior distribution ,-z
�
G(a, /3).
4 Estimation
One of the central problems of statistical inference is discussed in this chapter. The general problem of decision making is briefly described to motivate estimation as a special case. Estimation is then treated from both Bayesian and frequentist points ·of view. In Section 4.1, the decision problem is defined and the concepts of loss function and Bayes risk are discussed. Different loss functions are considered in the def inition of different decision problems. The Bayes estimators ensuing from these losses are presented and their advantages and disadvantages discussed. The more important classical methods of estimation, namely maximum likelihood, minimal least squares and moments, are presented in Section 4.2. Properties of these meth ods are extensively discussed. In Section 4.3, methods of comparison of these estimators are defined. The concepts of bias, (classical) risk and consistency of an estimator are introduced. Following a discussion on point estimation,jnterval estim·ation is presented in Section 4:4. Finally, the results are applied in Section 4.5 to the estimation of mean and variance in the normal model. Results concerning approximate methods of estimation, including asymptotics, are deferred to the next chapter.
4.1
Introduction to decision theory
Consider the posterior density exhibited in Figure 4.1. This density is not com pletely uncommon and illustrates the difficulties that may be associated in the learning process involved in a statistical procedure. Nevertheless, this density contains all that is available in terms of a probabilistic description of our infor mation about a quantity of interest. Any attempt to summarize the information contained in this density must be made with caution. The graph of the posterior density is the best description of the inferential process. Sometimes, however, it is useful to summarize further the information into a few numerical figures for communication g urposes. The simplest possible case is point estimation where one seeks to determine a single value of the unknown quantity of interest 8 that summarizes the entire information of the distribution. Denote this value by
8, the
point estimator of e. As we will see below. it is easier to understand the choice of
80 Estimation
: I Ji :l
- I
f\
" •
o§. �
I
'
0 " 0
I 4
2
0
6
B
I j 10
theta
Fig. 4.1 Posterior density ofB with ihree distinct regions: the first containing around 30% of the total probability, the second with 10% and the third with around 60%. The mode of this density is 3. 78, the mean is 4.54 and the median is 4.00.
the value of
e in the context of decision theory.
A decision problem is completely specified by the description of three spaces: 1. parameter (or states of the riature) space 8; 2. space of possible res�l,ts of an experiment Q; 3. space of possible actions
A.
A decision rule 0 is a function defined in Q with values in A, that is 0
:
Q
--+
A.
A Joss function may be associated to each decision O(x) and each possible value of 0 E 8. It can be interpreted as the punishment that one suffers for taking decision 0 when the value of the parameter is e. This function from 8
x
A with values in
R+ will be denoted by L(S, B). Definition. The risk of a decision rule, denoted by loss given by
R(S)
=
R(O), is the expected posterior
Eerx[L(S, &)].
The importance of the risk is the introduction of a measure that enables one to rank different decision rules.
O* is optimal if it has minimum risk, namely R(O*) < R(O), VO. This rule is called the Bayes rule and its risk is called the Bayes risk. Definition. A decision rule
Introduction to decision theory 81 Example. Suppose that a doctor must decide if a patient (for example, John from previous chapters) with a given disease must undergo surgery or not. The
= I) or not(@ = 0). Let us simplify the (8 = 1) if he thinks John is sick. This way, a decision rule 8 directly related to the value of f)
states of nature are: John is sick(@
problem by assuming that the doctor will only prescribe surgery
gets established. Unfortunately, the value of the parameter f3 is unknown for the decision maker, the doctor. Table 4.1 is a possible representation of the losses (measured in a hypothetical monetary unit) associated with all combination of values of the action and state of nature.
Table 4.1 Losses associated with the doctor problem no surgery healthy sick
8 0
surgery
0
500
1000
100
These losses represent the subjective evaluation of the decision maker with respect to the combination of actions and states of nature. The smallest loss is null which occurs when the patient is not sick and does not undergo surgery. The largest loss occurs when the patient is sick but is not prescribed surgery. This implies a loss in the doctor's reputation and may even lead to legal problems. He evaluates this loss at 1000 monetary units. Note that all losses are non�negative as defined and do not take into account the doctor's fee whiCh is constant or at least should be immaterial to the problem considered. As must be clear by now, the decision must be guided by taking into consideration the uncertainty about the unknowns involved in the problem. In this case, the unknown is distribution,
e and let us assume that its uncertainty is described by its updated Pr(e I) = rr and Pr(e = 0) I- rr, for 0 s rr S I. This can be =
=
a prior distribution or a posterior distribution, obtained after a few tests have been carried out on the patient. Evaluation of the risk of an action 0 is straightforward and
R(8 = 0)= Ee[L(8 = 0, 8)] = 0(1 - rr) + IOOOrr = IOOOrr R(o = I)= E8[L(8 = I, e)] = 500(1 - rr) + IOOrr = 500- 400rr. R(8 0) = IOOOrr = 500- 400rr, or rr = 5/14. For rr < 5/14, the risk associated with 8 = 0 is smaller than the risk associated with 8 = 1. In this case, 8 = 0 is the Bayes rule and the Bayes risk is I OOOrr. For rr > 5/14, the prqblem is reversed, the Bayes rule is 0 = 1 and the Bayes risk is 500 - 400JT. In summary, the doctor's strategy must be to prescribe John surgery iff rr > 5/14. As can be seen from Figure 4.2, the two actions have equal risk if
R(o
=
=
1), which happens iff
This example shows how sensitive the decision is to the choice of priors. It is
82
Estimation 0 0
:?
0 0 ro
0 0 w � • c 0 0 "
0 0 N
0
1
no
surgery
i�
surgery
�
0.0
0.2
0.4
0.6
O.B
.
1.0
Fig. 4.2 Risks associated with the two actions: surgery or not, as· afunctions of the probability ofdisease JT. impoftarit to study also the sensitivity of the decision to the choic:e of loss function. Estimation-is clearly dependent on the specified losses and variation to their values may lead to different decisions.
Definition. An estimator is an optimal decision rule with respect to a given loss function. Its observed value is caiied an estimate. This definition is broad enough to be useful in a classical perspective, where other optimality criteria will be introduced. In what fo11ows, most of the presentation will be cOncentrated on symmetric loss functions of the form L(8, 8)
=
h(8- 8), for some function h. These are the
most commonly used los�functions. In general, e c Rand the loss functions are continuous.
Lemma. Let L, (8, 8) = (8- 8)2 be the loss associated with the estimation of () by 8. (This loss is usually known as quadratic loss.) The estimator of () is 8, £(8), the mean of the updated distribution ore. =
Proof. We have to calculate the risk and show that 81 minimizes it. So, R(O) 2 £[(8- 8) ] = £([(8- 8,) + (81- 8)]2), where 81 E(8). Therefore, =
R(8)
=
=
2 2 E9[(8- 8,) ] + Ee[(8,- 8) ] + 2£9[(8- 81)(81 - 8)] 2 2 (8- 8J) + Ee[(8, - 8) ] + 2(8- 8J)E9[81- e]
=
Introduction to decision theory 83
2 = (8- 81) 2 + Ee[(o1- e) J, = (8- 8,)2+ V(e)
since
8, = E(e)
8 = 8,. In this case, the Bayes risk is R(81) = V (B) R(8J) :" R(o), '18, with equality iff 81 =8.
and the risk is minimized for and
0
The quadratic loss is sometimes criticized for introducing a penalty that increases strongly with the estimation error 0 �·e. In many cases, it is desirable to have a loss function that does not overly emphasize large estimation errors. The next lemma presents the estimator associated with the absolute loss function, which considers punishments increasing linearly with the estimation error.
Lz(O, 8) 18- !9\ be the loss associated with the estimation of B. e is 82 =med(fJ), the median of the updated distribution of f).
Lemma. Let
=
The estiffl�tor of
The proof of this lemma is more cumbersome ·and will be left as an exercise. Another form to reduce the effect of large estimation errors is to consider loss functions that remain constant whenever
10- Bl
>
k for some k arbitrary. There
is some freedom for options of suitable values of k. The most common choice is the limiting value as k
--+
0. This loss function associates a fixed loss when an
error is committed, irrespective of its magnitude. This loss is usually known as the 0-1 loss. Lemma. Let
L3(8, e) = lim,-o 118-BI([t:, oo)). The estimator of e is 83 = e.
mode( e), the mode of the updated distribution pf Proof (for thee continuous case).
E[L3(8, e)]= lim
I
<:----'>0
·
p(e) de+
[!,_, [1 lo-s ['+' s----'>0 1 -oo
-
= lim = But lim,�o
p(e) de
- lim P(o- e ,�o
Pr(o- t:
<
e
<
<
e
J,'H 0-e
0
·
p(B) de+
J,oo O+e
I·
]
p(e) de
]
<
8 +e).
8 + t:) de = p(8). Note that E[L3] is minimized 83 =mode (e).
when p(o) is maximized. Hence,
0 When the updated distribution is the posterior, the estimator associated with the
0-1
Joss is the posterior mode. This is also referred to as the generalized maximum
likelihood estimator (GMLE)
84
Estimation
0
delta ·theta Fig. 4.3 Loss functions: quadratic, - - --;absolute,
of the equation dp(9 ! x)/ dB
=
· · · · ·
·; 0-1. - - - -.
0. Figure 4.3 illustrates the variation of the loss
functions considered here as a function of the estimation error. Many of these results can be generalized to the multivariate case. Apart from the absolute value that has no clear extension to the multivariate case, the quadratic and 0-1 loss can be respectively extended by .
and
L,(&, 8)
=
lim
vol(A)-o
lto-9i(A)
where A is a region containing the origin and vol{A) is the volume of the region
A. It is not difficult to show that the Bayes estimators of 0 under loss functions
L 1 and L3 are respectively given by the joint mean and joint mode of the updated
distribution of 8. These concepts are treated in greater depth by Berger (1985), DeGroot (1970) and Ferguson (1967).
Example. Let X cp
=
=
(X 1
•
. . .
, Xn) be a sample from a N(e, cr2) distribution and
a-2. We have previously seen that in a conjugate analysis, the posterior
distribution is e I 4>
�
N (JLt' (Ct>)-1 ) and ntcrl>
�
xJI.
To ease the derivation_ of the mode of the joint distribution, it is usual to work with the logarithm of p(8, 4> I x) given by '
log p(e, 4> I X)= k
-
� [ct
ce
-
)Lt)2 + n
wl) + C'; I
-
I)
log 4>
Introduction to decision theory 85 and differentiate it with respect to 8 and¢ leading to
and
a log p(e, .p 1 x) a.p Making a log p(e, 1> I x)jae = 0 gives e = IL l as a critical point and a log p(JLI' ¢ I x)ja¢ = 0 gives ¢ = a1-2(nt- 1)/nJ. The second order conditions are satisfied as
Q
, 0 a2logp(8,¢ix)l = -CJ a2e e�"'·•�
Therefore, (J..L 1,
).
The above calculations do not guarantee that 1-L 1 is the maximum of the marginal
distribution of 8 and
¢ is the maximum of the marginal distribution of cf>.
In this
example, it is easy to see that Jll is also the marginal mode since· the marginal
8 is a Student-t centred at J.L}. This automatically implies that 111 However, the same is not true for ¢ . It was shown in Section 3.4.7 that¢ ! x......., G(nt/2, ntaf/2). This distribution has mode n1 -2
distribution of
is the mean and the median of the marginal posterior distribution of 8.
A
Note also that the posterior mean of rp is a}2 and the median cannot be explicitly evaluated. Another important consequence from probability theory is that the mode and
a2 cp-l and denote the 52. ;p-1 is not the joint nor the marginal mode of a2. To evaluate the mode of a2, the posterior distribution of a2 must be obtained. For the case of the mean are not invariant under transfonnations. Let mode of a2 by
the marginal distribution,
=
86 Estimation . . . and 1ts loganthm 1s logp( u2
I x) = k
Differentiating with respect to a2: d logp(u2
I x)
da 2
I
- +
)
I
(
= _ :'..'. +
2
f -2u 2 .
nta
log u2 -
1) _I_+ nwt G-2 254 &2 = :;:� f:. :::� = ;j,-1.
a2=G"2
The solution of the equation is
- (n'2
=
0 •
The second order condition guarantees the maximum as d2Jogp(a21 d(u'
4.2
x)
l'
=
+2)3 (:'..'.2 1) _I__ 2nwt =_I2 (nt (nwf)2 +
a4
<
2a6
. O
Classical point-estimation
In the B<:�yesian methodology, point estimation is always dealt with by minimiza tion of the expected loss. In the classical perspective, many methods have been propoSed in an effort to make them adequate to a variety of problems. In this section, the three most important methods wi11 be cited: the method of maximum likelihood, the method of minimum least squares and the method of moments. We will also briefly present non-parametric estimation. Classical point estimation is
covered in a cle.ar and concise way in the books by Cox and Hinkley (1974) and Silvey (1970). We recommend both books to the interested reader.
4.2.1
Maximum likelihood
This is currently the most used method of estimation in classical inference. Its use is justified in many instances and it is useful to note that it is entirely based on the likelihood function. Therefore it does not violate the likelihood principle. In addition, there is a good body of theory developed in a variety of situations. Its intuitive appeal can be grasped in the very simple example below.
Example.
Consider a situation where all that is known about an unknown quantity
of intereste is that its value is either variable when e Table
X
=
4.2.
114
is observed and the success
1/4
or
4/5,
when e
=
3/4.
or
(X
3/4. =
Assume now that a 0-1 random I) probability of
X
is either
1/6,
This probabilistic setup is summarized in
Observe that the sum of each column is l but the sum of each line is not. Given the
value of X. the likelihood function of e can be constructed as l(B; x) •
=
p(x 1
B).
1(1/4; x = I) 1/6 < 4/5 = /(3/4; x = 1). 3/4 attached �larger probability to 1/4 and therefore seems more the observed event than the model with B
If
X
=
I is observed,
This means that the model withe
=
=
=
plausible or likely. If we had to choose an estimate for e after observing
X=
I. we would probably opt for the value
3/4.
Classical point estimation 87 Table 4.2 Table ofprobabilities x
e 1/4
0 1
•
If
3/4
5/6
]/5
1/6
415
X = 0 is observed, the same reasoning would lead to the choice of the 0).
value 1/4 fore since I (1/4; x = 0) = 5/6 0> 1/5= 1(3/4; x=
So, in the above example, we have opted to estimate 8 by the value that maxi mizes the likelihood function for every value of x. This simple and powerful idea is the basis of the method.
Definition. Consider the observation of X with joint density p(xl8). The maxi mum likelihood estimator (MLE, in short) of 0 is the value of 0 E 8 that maximizes
1(0; X). The usual notation for the MLE of 0 is 0. Its observed value is called the maximum likelihood estimate. In most cases, fJ varies continuously over an interval or, more generally, over a region of RP, for some p. In these cases, irrespective of whether X contains discrete variables or not, the MLE can typically be found by solving the equation al(O; X)j30= 0, or equivalently a logl(O; X)/30= 0, where 0 is a vector ofO's. We are now in position to compare the GMLE, presented In the previous section, with the MLE. The GLME generalizes the MLE just as the posterior density generalizes the likelihood function. Recall that p(O I X) special case p(O)
ex
k it follows that p(O I X)
ex
ex 1(0; X)p(O). In the 1(0; X). Therefore, the value
of 0 that maximizes the posterior (the GMLE) also maximizes the likelihood. So. the MLE is the GMLE in the case p(O)
ex
k.
Despite their similarity, it is important to stress the distinction between classical and Bayesian estimators. The first ones are statistics and therefore have a sam pling distribution based on which their properties wilJ be established. Bayesian estimators are based on the posterior distribution which is always conditional on the value of the observed sample and therefore their properties are based on the posterior distribution, an entirely different object. Nevertheless, they can be seen as functions of the observed sample and in this way compared numerically with classical estimators.
Example. Let X = (Xt
•
.
.
.
,
Xn) be a sample from the N(O, o-2) distribution.
The 1ikelihood function is
2 1(0, o- ; X)=
( ;_,)n v 2na2
exp
{-� t 2a
i=l
2 (X;- 0)
}
88
Estimation
with logarithm Iog/(8, "2;X)=_':_ log(2Jw2)2
� 2a
[t i=l
(X; - x/ + n(X- 8)2
]
.
Differentiating with respect to B and a2 gives 1 a!og/(8,"2;X) '----'- = - -, 2 n(X- 8) -"-= -': 2" " a 8-
and
where
S2 =
z L.. (X; - X) . n- I i=l 1
--
�
-
Observe that the denominator used in the expression of 52,
(n
with respect to the value used in previous chapters (n).
Equating derivatives to 0 leads toe = X and &2 = (n
-
-
1), was modified
l)S2 jn as critical
points.
Differentiating again gives
and a'1ogl(8, "' ; a(D"2)2
x)
I
e�o.a'�ii'
n 1 = _ _- __[(n- l)S' + n(X2(a2)2 (a2)3 =
Therefore,
(0, a2) is the MLE
n
1
-:z ca'l'
G)']
< o.
of (8, D"2).
This result can also be established from' the similarities between GMLE and MLE
and calculations in the previous section. Note that by taking co -----? 0, a ---+ 0 and no= 2, the prior becomes constant and the above result becomes a special case
J
of the derivations of the previous section. There are many interesting properties of the MLE's that can be easily shown: I. The MLE is invariant to 1-to-1 transformations (unlike the GMLE). If the MLE of
0
is
0 and 4> = 1/>(0) is a 1-to-1 function of 0 then the MLE of lj> is
Classical point estimation 89 given by
� = >(0 ). To see that, remember that 0 maximizes 1(9) and that
the likelihood of q, is 1'(>) = 1*(>(9)) = 1(9). Therefore, if
0 maximizes
I, it will also maximize
1'(>(9)) by uniqueness of the transformation and f/»(9) will maximize l*. Consequently. � maximizes l* and therefore ¢ is the MLE of >. Example. ln the case of a N(8, a2) distribution, the MLE of a2 is &2 Therefore the MLE of the standard deviation a is & and the MLE of precision
2. As briefly mentioned before, the MLE does not depend on the sampling plan.
If different experiments £1 and £2 lead to respective likelihood fuflctions ki2 for some k > 0 that does not depend on Q, their
It (9) and lz(9) andlt =
MLE will be the same. Therefore, the MLE does not violate the likelihood principle. 3. The MLE may not exist. Assume that X 1; U (0, 8) distribution, 8
>
1(8; X)= p(X 18) where T
=
_ .
. , Xn is a sample from the
0. The likelihood function is 1 (JII
=-
ll
.
.
l
n Ie(X;. oo) = �Ie(T, oo) en i=l
max; Xi. As the likelihood function is a strictly decreasing
function of e, its maximum is attained at. the lower value Of .ifs domain, the interval (T, oo). As the interval is open, the function dOes· riot have a maximum and fJ does not have an MLE. This technical difficulty is easily remedied by considering without loss of generality closed int�rvals. In this case, the MLE of e is T = max; Xi. Note, however, that Pr(T
<
fJ)
=
I
and therefore the MLE will underestimate e with certainty. 4. The MLE may not be unique. Assume that X 1, . . . , Xn is a sample from
the U (8, 8 + I) distribution, 8 E R. The likelihood function is n
1(8; X)=
n Jo(X;- I, X,)= le(T2- I, Tt)
i=l
where T1 = mini X; and T2 "= maxi Xi. Therefore, the MLE of e will be any value in the interval (Tz- I, Tt), if it exists, because the likelihood function is constant over that region. 5. MLE and Bayes estimators depend on the sample only through minimal
sufficient statistics. By the factorization criterion, 1(9; X)= g(X)f(T, 9), where T is a sufficient statistic. Maximization of I(0; X) is therefore equiv alent to maximization of f(T, 9).
Since f only depends on the sample
throughT, the MLE will have to be a function ofT. The same reason applies for the Bayes estimators. As this is valid for every sufficient statistic, it must be valid for the minimal one.
90
Estimation
To obtain the
(G)MLE,
one must obtain the maximum of the likelihood func
tion (posterior density, respectively). This task can be seen as an optimization problem.. Assuming further that fJ varies continuously over a region, simpli
fies the task considerably. The problem can typically be solved by solving the
equation
8/(9; X)/89
=
0
(or
8p(91x)j89
=
0).
In many cases, it is easier
to work with the logarithm. Concentrating on the likelihood from now on, let
L(9;
X)=log I(U; X). The problem can then be rephrased in terms of finding the
solution of
U(X·' U)=
aL(U X) ; = 0'
an
where U is the score function introduced in Chapter 2. Note that aJogp(UIX)
au
_
-
aL(U;X) atogp(U) · -U X U au + au < ' )
atogp(U)
+
and can thus be referred to as the generalized score function.
au In some cases,
the above equation can be analytically solved and the roots of the (generalized)
score function found explicitly. In other applications, typically involving a highly dimensional 8, no analytical solution can be found. Algorithms for solving this
problem will be presented in the next chapter.
4.2.2
Method of least squares
Assume now that /;(9) andV(Y;
1
Y= (Y,, ... , Y,)
for a random sample such that E(Y;
U)=a2 One can rewrite each Y; as
Y, =f,(U)+e,
where£(e;)=OandV(e;)=a2,i= l, .
. .
I
U) =
,n.
One possible criterion for the estimation of() is to minimize the observation errors
e; 's incurred. There are many ways to account globally for the errors. Given the
assumption of homoscedasticity (equal error variances), it seems fair to account for all the errors in the same way and with the same weight. Also, it seems reasonable to
account for the errors symmetrically to avoid penalizing more positive or negative
f!.rrors. One possibility is to attempt minimization of the sum of the absolute errors.
In fact, this choice is as plausible as the choice of the absolute loss function in the
context of Bayesian estimation. However, for historical and mathematical reasons,
the criterion preferred in many cases is to account for the squared errors.
Therefore, the estimation criterion can be stated as the minimization of "
n
i=I
i=l
S(U)= L ef = L(Y;Note that by forming the vector f(U)
=
(/1 (U),
/;(U))2
..., f,(U)', the quadratic form can
be rewritten as S(U)= (Y- f(9))'(Y- f(9)). The value of U that minimizes S(U)
is called the least squares estimator (LSE, in short) of(). Once again, minimization
is achieved by solving the equation
aS(U)jaU=0.
Classical point estimation 91 Example. Simple linear reg re ssio n . Assume one knows that his/her variable of interest
Y
is affected by the values of another known quantity
X
and that this
dependence is linear on the mean. One model for this setup is to take E
eo+ e, X; where 0
=
(Yi
(eo, eJ). The quadratic form is given by n sceo, e,) L(Y; - e0-e1X;)2 i=l
1 0)
=
=
Differentiation gives
ascoJ aeo ascoJ aei
--
=
f-. L.)Y;-e0-e1X;) -n(Y- e0-e 1X) _
_
=
,�1
=
--
where
-2
g(X, Y)
-2
f-. X;(Y;-80-81X;) L.. i=I
generically denotes
_
_
=
_
-n(XY- BoX-e1XZ)
(1/n) 2::7�1, g(X;, Y;) (Bo, BJ) is
It is not difficult to
.
show that the least squares estimator of
, , (Y -e,1x-, XY- :XY) .
(Oo,Oil
-
=
X2-X2
One important support for the method is the fact that it coincides with the MLE if the error distribution -is. normal. This method can be extended to the case of error variances that are unequal due to constants, namely V(ei)
=
w;1a2•
It seems
reasonable in this case to take into account the different variabilities and weigh more heavily in the sum the more precise observations, those with larger values of
Wi.
This modification of the criterion leads to the weighted LSE obtained by
minimization of
"
5(0)
=
L W;(Y;-/;(9))2 i=l
The sum can again be written in matrix notation as (Y- f(O))'W(Y- f(O)) where
W is then
x
n
diagonal matrix with elements
Wt, ... , Wn.
Note that in its full
generality, the matrix of weights W does not even need to be diagonal by allowing correlated observation errors. The previous method (without weights) is also called ordinary least squares. It can be obtained as a special case where W
=
a2In, then
x n
identity matrix.
·
One useful application of this method is in the case of spurious observations. We would not want our analysis to be influenced by observations that are known to be discordant from the rest of the observations. Reduction of the effect of these variables is achieved by setting smaller values of
w;
for them. In the limit,
w;
--?
0
and the observation has no influence in the estimation. This line of reasoning fonns the basis of robustness studies. In all the cases above, 0'2 is estimated by
af5 given by
(Y- f(OLs))'W(Y- f(OLs))
.
92
Estimation
where (J LS is the weighted LSE of 8. Sometimes, n is replaced by n p is the dimension of 0.
4.2.3
-
p,
where
Method of moments
Assume again a random sample X 1, . , Xn from a distribution p(x I 8) with moments of order k given by !'k E(X' I 9). k = 1, 2, ... The method of moments recommends estimation of f.lk by . .
=
Mk •
=
.
k I� Lxi.
-
n
i=I
In words, the populational moments are estimated by the sample moments. Any other function of() is estimated in the same way by taking into account its relation with the population moments. For example, f.<J � E(X I 9) is estimated by flt = X, ttz E(X2 1 9) is estimated by X2 ahd the populati'Orialovariance V(X 1 fJ), once it is written
4.2.4
Empirical distribution function
This is a non-parametric method which involve·s rio knowledge of the distribution function to be estimated. This estimator is useful at least as an initial estimator, thus providing some insight into the form of the distribution function. Let X (X 1, ... , Xn) be a random sample from an unknown distribution function F. Recall from the definition that F(x) = P(X :S x). The empirical distribution function. iS denoted by F and is given by =
"
F(x) =
#X�s '
n
< x -
Observe that just like F, fr is non-decreasing and contained in the interval [0, I]. It is interesting to note that ft(x) can be written as Z wh�e Z; = Ix,.(-oo,x], i l, . , n. The populational quantity equivalent to Z is the proportion of population e1ements that are::; x. This is given by P(X S x) = F(x). So, the empirical distribution function is a form of method of moments estimator of the distribution function. Other properties of ft wil1 be described in the sequel. =
4.3
.
.
Comparison of estimators
Given that there is no unifying criterion for the choice of frequentist estimators in any given problem, it is important that a set of criteria is established to compare
Comparison of estimators
93
them. The main criteria for comparison are: bias, (frequentist) risk or mean squared error and consistency. Much effort was concentrated on this area during the 1950s and 1960s. These studies lead to the characterization of uniformly minimum variance unbiased (UMVU, for short) estimators.
4.3.1
Bias
Definition. Let X= (XJ, ... , X,) be a random sample from p(x I 0) and o = 8(X) an estimator of h(O), for any given function h. 0 is an unbiased estimator of h(O) ifE[o I 0] = h (O), VO. The estimator o is said to be biased otherwise. In this case, the bias is denoted by b(O) and defined as b(O)= E[o I 0]- h(O). The frequentist interpretation of the definition is that after repeating sampling of X from p(x l 9) many times, averaging the corresponding values of 0 will produce h(O) as a result. This is a desirable property because one formulates an estimator o in an effort to obtain the value of h(O). The difficulty is that in most cases only a single sample X is observed for time and/or financial restrictions. Note also that un biased estimation is always related to a given parametric function; an estimator can be biased with respect to a given function but unbiased with respect to another one.
Example. None of the three parametric methods of estimation proposed in the previous section can guarantee unbiased estimators. The empirical distribution function however is an unbiased estimator of the distribution function. 4.3.2
Risk
Definition. Let X= (X 1, . . , X,) be a random sample from p(x I 0) and denote now by o = o(X) an estimator of h(O). The frequentist risk of the estimator o is defined as R&(O)= Ex19[L(o(X), 8)]. In the case of a quadratic loss function L, the risk is given by R&(O)= Ex19[(o- h(O))'(o- h(O))] and is also called the mean squared error (MSE, in short). In the scalar case, the MSE reduces to Ex19[o- h(e)f. .
Comparing with the Bayes risk, once again we see the presence of an expected loss. The change with respect to the approach in the evaluation of the expectation must be stressed. The Bayesian risk considers expectation with respect to the posterior distribution of Dlx whereas here expectations are taken with respect to the sampling distribution of X[O. Although the dependence on x causes no harm to the Bayesian estimators, the dependence on (} will cause additional problems, to be described below. In terms of risk, the estimation task resumes in finding the estimator of smallest risk. Let OJ = OJ (X) and o2 = 02(X) be estimators of h(8). Their respective risks are R&1 (8) and R,1(8) and 01 is better than 02 if R,, (8) :<; R,,(O), for all 0 with strict inequality for at least one value of 0. An estimator is admissible if there is no better estimator than it. When an estimator is unbiased and its variance
94
Estimation
is uniformly smaller for ali possible values of (J over all possible estimators, it is referred to as a uniformly minimal unbiased estimator
(UMVU, in short).
If the estimator is unbiased, its quadratic risk is given by tr[V(dl8)], the trace of its sampling covariance matrix. If it is biased, the quadratic risk is given by tr[V(J)] + [b(O)]'b(O). In the case of a scalar e, the quadratic risk of an unbiased estimator is given by its sampling variance V(818) and if 0 is biased, its quadratic risk is given by V(JIB) + iJ.(B). Example. Let X= (X1,
• . • ,
X,) be a random sample from the N(B, a2) dis e. Taking OJ (X) = X and J,(X) = X t
tribution with a2 known and h(B) = gives
II
I ne 1 e]= E(X 1 Bl=E(X; 1 B)=- =e n n i=l E[o,(X) I 8]= E(Xt I 8) = B
L
E[ot(X)
and therefore 81 and 8 2 are unbiased estimators of 8. Therefore, their quadratic
risks will coincide with their sampling variances an� will be respectively given by
a' R, (B)= V(X I B)= , n
R0 (8) = V(XJ) = a2 2 Of course, R(81)
<
R(th), if n
>
l, for all values of()" and therefore lh is better
than 0,. It is not always possible to find an estimator that completely dominates the other ones in terms of risk. In the ideal situation one would eliminate 8 by suitably weighting the risks over their different values. This is performed naturally in the Bayesian context. Here however 8 is fixed and no such natural weighting scheme exists. An alternative is to consider the worst possible risk for each estimator and choose the estimator with smallest worst possible risk. This is the definition of the minimax estimator. Returning to the example, it is not entirely surprising that 81 is better than 82. After al1, 81 seems to be using the sample information better than 82. Once again, the key concept here is sufficiency and the result is formalized in the Rao--Biackwell theorem below. Theorem 4.1. Let X= (Xt, ... , X,) be a random sample from p(xiO), 8 = J(X) an unbiased estimator ofh(O), for some functionh and T statistic for 0. Then, J*
=
J'(X) = E(J
=
T(X) a sufficient
I T) is an unbiased estimator of h(O)
1 0) :0 V(8 I 8), VO. In the case of a scalar h(O), the result states that V(J* I 0) :0 V(J I 0).
with V(J' I
1 Recall from Chapter I that the matrix
A
-
that if A and B are squared matrices of the same dimension, A ::;: B means B is a non-positive definite matrix.
Comparison of estimators 95 Before proving the result there are a few important comments to make. The first one is that the theorem states that whenever one finds an unbiased estimator, it can always be improved in terms of risk by conditioning on a sufficient statistic. Also, the conditional expectation used in the definition of 8* does not introduce() because of the definition of a sufficient statistic. Proof. Initially note that &* is unbiased because
E[o'(X) IOJ = E{E[o(X) I T(X)JI OJ= E[o(X) I OJ= h(O). Finally note that
V(O' I 0)= V(o I 0)- E[V(o IT) I OJ. As V(O I T) 2:::. 0, its expectation is also non-negative positive and therefore
V(o I0) "'V(O' I0). 0
The important message Of the theorem is that estimators have their risks reduced if they are funC:tiq.qs C?f sufficient statistics. Risks are indeed reduced by properties of non-negative definite matrices (see Chapter
I). Yet again, maximal improvement
in terms of risk is achieved if minimal sufficient statistics are used.
A related
interesting question. is to know if the reduction in risk is the smallest possible. The search for maximal reduction. is helped in a sense by the concept of complete families of distributions.
Definition. Let X= (Xt, ... , Xn) be a random sample from p(x I 0), T =T(X) any statistic and g any function ofT. The family of distributions ofT is complete if for all
0, E(g(T) 10)= 0 =} g(T) = 0,
with probability I.
Verification of completeness of famiJies directly from the definition is cumber some. Fortunately, for exponential families with k parameters, it can be shown that the family of distributions of the k-dimensional statistic (Ut (X), ..., Uk(X))
(
is complete if the variation space of definitions of the Vi's and
requires elements that are beyond the scope of the study of this book and will therefore be omitted. The interested reader is referred to Lehmann
( 1986).
Example. Let X = (X 1, . . , Xn) be a random sample from the N (8, a2) dis tribution. Then, Ut = J:X,, U, = EXJ, 2) vary over a bidimensional space. Therefore, the family of distributions of (U1, U2) is complete. If one assumes that e = a2 , the space of variati@n of (¢1, ¢2) is reduced to a single dimension and the family of distributions of (Ut, U ) 2 .
-
is no longer complete.
96
Estimation
The concept of completeness is useful to ensure uniqueness of the UMVU estimator. It can be shown that the UMVU estimator is unique in the presence of complete families because if 01 and
due to Fisher although it was independently stated in its present form by Cramer and Rao-in the 1940s, as cited in Cox and Hinkley (1974).
Theorem 4.2. Let X = (X 1, .. , Xn) be a random sample from p(x I 8) and J an unbiased estimator of h(O), for some function h. Assume further that {x : p(x \ 0).> 0} does not depend on 0, the differentials Bp(x I 0)/BO and Bh(O)jBO exist, ··E($ ·f 0) is differentiable under the integral sign and that the Fisher information 1(0) is finite. Then .
V(JIO) >
-
Bh(8) ' [1(8Jr a8
(
)
Bh(8) '
a8
In the case of a scalar f), the inequality reduces to
V[8 I 8]2:
[dh(8)jd8]2
l(B)
Proof (of the scalar case). £[8 1 8] = J 8(x)p(x I 8)dx = unbiased. Differentiating both sides with· respect to f) gives dh(B) de
=!._ ae = =
J
h(B) because 8 is
8(x)p(xl8) dx ap
l B) dx, where interchange of signs is valid by hypothesis
j 8(x) � j 8(x) B) � [ (8(X) ��X ) ]
=E
e
l p(x l Blog
ap
8 l ) p(x I 8) dx e I 8) Ie
= £[8(X)U(X; 8) I 8].
As previously seen in Section 2.4,
d��)
E[U(X; 8)] = 0 and
= E [(J(X)- h(B))U(X; 8) I 8] = Cov [(J(X), U(X; 8)) I 8].
Since the absolute valu/of the correlation between two random variables is never larger than 1, the squared covariance will never be larger than the product of the
Comparison of estimators 97 two variances. Hence, [dh(8)/ d8f :'0 V[S(X) IB]V [U(X; B) But
[
2 V [U(X; 8) I 8] = E U (X; 8)
I 8].
]
I 8 = I (8).
completing the proof. 0 The proof of the theorem in the rnultiparameter case is left as an exercise. Observe that the unbiased estimator attains the lower bound when it has maximal
correlation wit!J the score function. In other words, when there are functions c and
d of 0 such that o(X)
=
c(O)U(X; 0) + d(O).
with probability 1. Taking expectation of both sides with respect to
that 0 (X) is an unbiased estimator of
c(O)
=
d(B) and therefore d
=
XIO gives h. In this case,
I-1 (0).
Also, when the MLE is unbiased, it attains the Cramer-Rao lower bound. This
can be seen by solving the above equation for 0. This leads to
U(X· 8
'
_
Blogp(X
)-
18)
a8
_
-
d(8) c(8) ·
o(X)-
Equating to 0 implies that S(X) is the MLE of d(O). But we have already seen that
d = h and, by hypothesis, 8 is unbiased forh(O). Hence, it attains-the Cramer-Rao lower bound.
Definition. The estimator 8 of h(O) is said to be efficient if it is unbiased and 0. The efficiency of an
its variance attains the Cramer-Rao lower bound, for all
unbiased scalar estimator is given by the ratio between the Cramer-Rao bound and its variance. Note that there is no guarantee that UMVU estimators will attain the Cramer-Rao
bound and they may have their efficiency smaller than 1. However, the converse is true with efficient estimators being necessarily UMVU.
Example. Let X = (X 1,
• • .
, Xn) be a random sample from the Pois(B) distribu
tion. Then
logp(X I 8)
=
n
n
i=l
i=l
-nB +LX' IogB-LlogX;!
and therefore U(X; 8) = -n + EX; /8. As estimators cannot possibly depend on
the parameter they are supposed to estimate, define c(8)
= 8/n
and d(B)
=
8.
This way, it is immediate that X is an efficient estimator of its mean e. In fact, any linear function of X is an efficient estimator of the respective linear function
of e. More than that, these are the unique efficient estimators that can be found in the presence of a random sample from the Pois(B) distribution.
-"-'
98
Estimation
4.3.3
Consistency
It is to be expected that the information contained in the sample increases with an increase in the sample size. This is certainly true at least for the Fisher measures of infonnation. One would then expect that reasonable estimators will tend to get closer and closer to their estimands.
This subsection discusses theoretical
properties of the estimators as the sample size gets larger and larger. The relevant question is how close are the estimator and its estimand. Related questions of interest are: Is the bias getting smaller when sample size increases? Is the variance getting smal1er as well? These questions will be deferred to the next chapter.
Definition. Let Xn =(X,, . . . , Xn) be a random sample of size n from p(xl9) S11(X) an estimator ofh(B) based on a sample of size n. As .the sample size n varies, a sequence of estimators for h(O) is obtained. This sequence is said to be (weakly) consistent for h(O) if 8, (X) --+ h(O), in probability, when n -> oo. and
In practice, the definition is shortened by saying that the estimator is or is not consistent instead of a sequence. The definition rnea�s that
h(O)I > <)--> 0, when n
h(8). As an example,
--+ oo.
'r/E > 0, P(l811 (X) 8n(X)
This result is usually denoted by plim
Fn (the empirical diStribution function) is consistent for
=
F.
The three important questions to ask about an estimator are: is it unbiased, how large is its risk and is it consistent? When Bayes estimators are considered as functions of the sample Xn instead of
its observed value x11, they can be studied for their sampling properties just like any other estimator. In particular, it can be shown that Bayes estimators are invariably biased. Also, it may be reasoned that as the sample size increases, the influence of any non-degenerate prior becomes smaller and Bayes ·estimators will become closer to the MLE. So, intuitively, one can expect them to inherit'all the properties of the MLE, irrespective of the loss function used.
Xn (X 1, . . . , Xn) be a random sample from the Ber(B) distri 0. We know !hat lhe MLE of e is iJn Xn and that, by the laws of large numbers, Xn --7 8, in probability and almost surely. Therefore, Xn is a
Example.
Let
bution, withe
=
>
=
consistent estimator of e.
In the case of a conjugate prior beta(a,
{J) and quadratic loss function, the Bayes (a+nX11)j(a+ ,B +n) which converges to Xn when n --7 oo. Therefore, j8:(Xn)- XnI --7 0 in probability and as Xn -? e, almost surely, o; is estimator is o:(x11)
=
·
also consistent.
It follows readily from Tchebychev's inequality that
Pr(ISn(X) but
_
h(O)i > <)
<
�
E{[8n(X)- h(O)J' n(X)- h(O)JI 0) <
Vt
>0
E{[8n(X)- h(0)]'[8n(X)- h(O)] I B) = RT,(X)(O). So, if a sequence of 0, the estimator is consistent.
estimators has quadratic risk tending to
Interval estimation 99 One can also define strong consistency of a sequence of estimators if the con vergence for the parameter is almost sure instead of in probability. The theory of probability assures us that almost sure convergence implies convergence in proba bility and therefore strong consistency implies weak consistency. It can be shown that the MLE is strongly consistent under the same regularity conditions of the Cramer-Rao inequality. Nevertheless, the concept of weak consistency retains the essence of what is needed and will be maintained in the sequel.
4.4
Interval estimation
4.4.1
Bayesian approach
Returning to the Bayesian point of view, the most adequate form to express avail able infonnation about unknown parameters is through the posterior distribution. Despite �ts coherent specification through expected loss functions, point estimation presents some inconvenient features. The main restriction is that it simplifies the multitude of information from a distribution into a single figure. It is important at least to have some information about how precise the specification of this figure is. One possibility is to associate point estimators with a measure of the uncer tainty about them. So, for the mean, one can use the variance or the coefficient of variation. For the mode, the observed information given by the curvature at the mode is usua1ly adequate. Finally, for the median, the interquartile distance could be used. In this section, another line of work is sought. The aim is-to provide a com promise between the complete posterior distribution and a single figure extracted from it. This compromise is-reached by providing a range of values extracted from the posterior distribution. Typically, one attaches a probability to this region and when the probability is large one gets a good idea of the probable or likely values of the unknown of interest. Ideally, one would like to report a region of values of 0 that is as small as possible but that contains as much probability as possible. The size of the interval informs us about the dispersion of the values of
0.
Definition. Let 0 be an unknown quantity defined in 0. A region C c 0 is a a ) % credibility or Bayesian confidence region for 9 if Pr(O E Clx) ?:. 100(1 I -a. In this case, 1 -a is called-the credibility or confidence level. In the scalar case, the region Cis usually given by an interval, [CJ, c2] say, hence the name. -
It should be clear from the above definition that the intervals are defined by simple probability evaluation over the posterior distribution of 6. Many Bayesian authors reject the use of the word confidence for Bayesian intervals. As wi11 be seen shortly, confidence has a very precise meaning in the definition of frequentist intervals that differs substantially from the meaning given here. These authors consiaer it important to dissociate the concepts. In the sequel, we will refer to confidence intervals whenever we refer to the method in general or in the c1assical context and wiJI use the word credibility in reference to Bayesian intervals.
1 00 Estimation Note that C is never an interval in the multidimensional case.
Even in the
uniparameter case, there is nothing in the definition enforcing the region C to be an interval. Therefore, there is a slightly misleading use of the word interval. The above probability is evaluated over the updated distribution of (J which will be taken from now on as the posterior. In general, one would want both a and C to be as small as possible. This in tum implies that the posterior distribution is as concentrated as possible. The requirement of a larger posterior probability than the confidence level is essentially technicaL It is mainly due to the use in discrete distributions where it is not always possible to find a region that exactly satisfies the probability required by a given level. In many cases, the inequality can be taken as an equality thus implying that the region C will be as small as possible. Note also that credibility intervals are Jnvariant under 1-to-1 transformations of the parameter. So, if C is a 100(1 -a)% credibility interval for() and tf> rp(O) is a 1-to-1 transfonnation of 0 then rp(C). the image of C under
=
,P. is a
I00(1 -a)% credibility interval for rp. This useful property is also shared by frequentist confidence intervals.
(X 1, . . , Xn) be a random sample from the N (f), a2) dis a2 known. The non-informative prior for f) is p(fJ) ex k and the
Example. Let X tribution with
.
=
likelihood is
/ 2:2 (e-x)2 ]
t(e; x)ex exp -
1(8; x) and therefore e I x N(O. I). From there, many 100(1 -.g)% confidence intervals may be constructed for 8 with the use of the standard normal distribution function
p(e 1 x) cx t(e; x)p(e) N(x. a2jn) or equivalently vn(e - x)ja I X
cx
�
�
=
"'
-
constructed from: I. 1-a
= Pr(vn(e-x)ja :Sza I
with posterior probability
x)which implies that e
I -a.
Hence, the interval C1
:Szaa!Fn+x (-oo, X+ =
z,afvnl is a 100(1-a)% Bayesian confidence interval fore. The length of C1 however is infinity which is not very useful for our summarization purposes. As previously mentioned, one would like to have C as smal1 as possible. The problem with this interval is that it includes (infinitely) many values that have very negligible probability around them. 2. Let
zp and Zy be numbers such that I-a= p
(-zp :Svn(e-X)ja :S Zy I
x).
Using the symmetry of the normal distribution,
=
P(X :S -zp)
=
P(X :0: zp)
=
I
-
and the probability of the above interval is given by
P(X
<
zp)
=
fJ
Interval estimation 1 01 1 - (y + {3)
and therefore y +
1- a= Pr = Pr The interval C = [q,
f3 =Ct.
With this assumption,
(zfi :" .jii(e � ( :;, -
x)
:" zy
Zfi +X:" e :" Zy
1
x) :;,
+X I
X
)
.
c,] where (J
-
Ci =x+ -zfi .jli
and
-
Cz =X+
(J
Zy .j/i
is a 100(1 -a)% Bayesian confidence interval fore. Note that it has length (Zy + Zfi)a/ .jli. There still remains the point about minimization of the length of the interval subject to y + f3 = a. Note that if¢ = ¢(8) is a monotonically increasing transformation of e, then a ) % Bayesian confidence interval for¢. If [¢(q), ¢(C2)] is also a 100(1 ¢ = ¢(8) is a monotonically decreasing tra,nsfonnation of e, then {¢(c2). ¢(ci)] is a 100(1 -a)% Bayesian confidence interval for¢. Consider without loss of generality that Zy :::=.: z�iz-."5:. ZfJ and define a = Zaf2Zy 2:. 0, b = Zf3 - Zaf2 2:. 0 and A and B as the areas between z_a/2 and Zy and between zp and Za/2 respectively. The length of the con_fidence interval becomes 2Zaf2 + b- a and A = B. It is clear from-Figure 4.4 that the density over the first interval is strictly larger than under-the second interval. Therdo . r�, b 2: a and -
B
A
lA
b
'
a
b
theta
Fig.
4.4 Density
of the standard normal distribution.
102
Estimation
b- a � 0. The shortest possible interval is then obtained by taking b = a = Zo:f2· Therefore, the symmetric interval is the shortest one and every value of e inside it has larger density than any point lying outside the interval. This simple example provides the key to finding intervals of shortest length. It indicates that the length of the interval is inversely proportional to the density height. Shortest intervals are then provided by inclusion of points of higherdensity. This idea is mathematically expressed in the definition below and represented graphically in Figure 4.5.
C2
C1
theta Fig. 4.5 The HPD interval for the density above is given by Ct U C2.
Definition. A 100(1 -a)% Bayesian interval of highest posterior density
(HPD, p(O I
for short)for8is thel00(1-a)%Bayesian interva!CgivenbyC = {8 E e:
x)::: k(a)) where k(a) is the largest constant such that P(O E C I x)::: I- a.
Interval estimation
Example (Berger, 1985). Cauchy(8,
I)
Let X=
(X1, ... , Xn)
103
be a random sample from the
distribution and let 8 have non-infonnative prior p(8)
ex
k. The
posterior density of e is n
p(8
I x) ex
]
0 I + (x; -
Assume now that the observed sample was
8 )2
x=
.
(4.0, 5.5, 7.5. 4.5, 3.0), with
sampling average x = 4.9. Then, the 95% HPD confidence interval for 8 can
6.06].
be numerically obtained as [3.10,
Had we assumed a N(8,
distribution, the 95% HPD interval would be [4.02, by the suspect value
7.5.
5.86]
I)
sampling
which is more affected
In both cases, the intervals are easily obtained with the
help of a computer. It will be seen in the sequel that the exercise is far from tri�ial for the Cauchy case in the frequentist approach. The main reasons are the absence of a univariate sufficient statistic for () and the absence of asymptotic results to allow for approximations. Note however that despite their appeal, HPD regions are not invariant under 1 �to-1 transformations. The main reason is the existence of the Jacobian required when obtaining the density of any parametric transformation.
Because of the
invariance, transformation of HPD regions remain valid confidence intervals with the same confidence level. AU they lose is the HPD property. Final1y, assume now that Pr(8i c
=
cl
X
...
X
Cr.
Pr(O E C)=
and Cis a 100(1
E
Cj}
:::.
1
- a;, i
=
1, ... , r, and let
If the 8; 's are independent a posteriori then
-a)%
'
'
i=l
i=l
n Pr(O; E C;) 2: n(l- ct;)
confidence region for 0 if
fl (I -a;) 2: I -a.
If the 8;'s
are not independent then ' Pr(9 E C)
2: I- L Pr(8; ¢ C;). i=l
This result is also known as the Bonferroni inequality. As Pr(8i one takes
L; a;
=
a,
taking for example
a;
=
ajr,
¢. C;) :sa;, if -a)%
then Cis a 100(1
confidence region for (}.
4.4.2
Classical approach
In the case of classical confidence intervals, only sampling distributions can be used since parameters are unknown but fixed. Therefore, they are not liable to the probabilistic description they get under the Bayesian treatment. That is why the concept of confidence instead of probability intervals becomes relevant. Before describing the general formulation, it is useful to see it applied in an example.
104
Estimation
Example. tribution,
Let
X= (XJ, ... , Xn)
2 with a known.
be a random sample from the
N(e, a
2)
dis
To draw a classical inference one should ideally base
calculations on a minimal sufficient statistic fore. In this case, we have seen that
X is such
a statistic and or
U
=
x-e -
N(O, 1).
r.; al..;n
.
Observe that U is a function of the sample and of the parameter 8, the parameter of interest and its distribution does not depend on
P (-z.;z
::0 Us
Za;z)
e.
=
lt can be said that
1-a
and isolatinge yields p
(x- Zaj2 :;,
se s
X+ Zaj2
:;,)
=I- a.
So, even though an interval for 8 was obtained, it cannot be understood as a probability interval for () as in Bayesian intervals.
It can only be interPreted
in the sampling framework by saying that if the same experiment were ·to be repeated many times, in approximately
X
- za;zaIV/i
and
X
+
za;zaI V/i
100(1 -a)%
of them, its random l�mits
would include the value of e.
Also, ·this
assertion is useless from a practical perspective since it is based on an unobserved sample. What can be done is to replace the observed value of X in the expression and state that one can have
100(1 -a)%
confidence, instead of probability, that
the so-fonned numerical interval contains 8. The general procedure to obtain confidence intervals in the frequentist frame work is based on a generalization of the steps of the above example to any statistical problem. These are: I. A statistic U=
G(X, 9)
E Uwith distribution thatdoes not dependon9 E El
must be found. Ideally, this statistic must depend on
X
thrOugh minimal
sufficient statistics and have a known distribution. Both require�ents were met in the example since U depended on the sample through
X and
had a
standard normal distribution.
U, find a region A c U such that -a. When e is scalar, then a scalar U can be found in [a!, a2]. many cases with Pr(a1 .:5 U :S a2)= 1 -a and, in this case, A
2. With knowledge of the distribution of
Pr(V
E
A) =
l
=
3. The confidence interval c c e is obtained by isolating () in the above expression and replacing sample values. A useful complementary text on the subject is Silvey (1970).
Definition. Let 9 be an unknown quantity defined in El, U be a function U = G(X, 9) with values in U and A be a region in U such that Pr(V E A) 2: l -a. A region C C El is a 100(1 -a)% frequentist confidence region for 9 if C
=
{9
:
G(x, 9)
E
A}.
Estimation in the normal model 105 In this case, 1 -
a
is called the confidence level. In the scalar case, the inversion
in tenns of e usually leads to an interval, C
[C!, c2] say, hence the name.
=
Once again, the use of the word interval for the general case is an abuse of lan guage. The inequality in the definition is taken as an equality, whenever possible. The quantity U is usua1ly called a pivot or a pivotal quantity and finding one such
quantity is fundamental. The choice of U is crucial to the success of the method.
It is not at all obvious that reasonable options are available in any given problem.
The effort towards the use of minimal sufficiency is in the direction of shortening intervals as much as possible. All randomness present _here is due to the sample X leading to a probability interval for U and not for 0. The procedure involves an elaboration that is absent from the Bayesian definition. It provides an interval to which is associated a nu
merical value, the confidence of the interval. For that reason, it is often interpreted misleadingly as a proba\Jility interval, as in the Bayesian framework. Care must be
e�ercised to ensUfe
� ·��rrect interpretation of the intervals.
The existent symme
try in many canonical situations leads to intervals that coincide numerically when obtained by a frequentist or a non-informative Bayesian approach. This happened
in the above example but should not be used to unduly equate the two approaches.
In the case of a parameter vector, the use of Cartesian products of confidence
intervals can also be applied to the construction of classical intervals. In particular,
approximations such as those from the Bonferroni inequality are used more often in classical inference where the search for the pivotal quantity U does not always lead to independent components. This problem typically does not occur in the Bayesian approach where most problems lie in the computations.
4.5
Estimation in the normal model
This section deals with applications of point and interval estimation of means and variances to problems of one and two normal populations.
The Bayesian per
spective with both non-informative and proper conjugate priors and frequentist perspective are presented and compared. We will particularly emphasize the sim ilarity between the results with the frequentist and the non-informative Bayesian points of view.
The similarity is only numerical since we have seen that the
derivations are completely different.
4.5.1
One sample case
Assume initially a single sample X= with 1> =
" -'.
(X1,
• • •
,
Xn) from theN (8, a2) distribution � N(Jl.o, rJ), we have
If 1> is known and the prior distribution is e
rf) where na-2X + rQ2JJ.-o and na-2 +to 2
already obtained that e ix Jl.J=
�
N (I' 1,
1 06
Estimation
So, posterior mean, median and mode coincide and the posterior precision and curvature of the log posterior are given by rJ2• Credibility intervals can be obtained by noticing that
8-11-1 r1
1 x- N(O, I)
and therefore
1- a= P(-Zaj2 < (8-JJ-iJ/<1 < Zaj2 I x)
P(JJ-1- Zaj2T1 < e
=
<
11-1 + Zaj2T1 I x)
and, due to the symmetry of the normal, (J1 J- - Zaf2T1, 11-1 + Zaf2T1) is the 100(1a)% HPD interval fore, A non-informative prior can be obtained by letting rJ ---?- oo. In this case,
.. .
2 2 . .. :rf --+ na- and f..LI
--+X. Posterior mean, median and mode coincide with the moments estimator and MLE, X. It is easy to check that X is also an unbiased, efficient and (strongly) consistent estimator for 8, Also, the 100(1 -a)% HPD confidence interval for() coincides with the classical confidence interval obtained in the previous section. Assuming now that B is known and the prior for a2 is noaJ
posterior for
(noa,S +ns6)¢ I X '"'-' x;+no where
I s02= - � L., (x; - 8)2. n i=I The following quantities can be obt�ined: E[¢1x]=
1
IE[,PI x n-
=
no +n noa02 +ns02 2 noa2 n no 2 o + ns o a + s2 = no +n ° no+ n ° no +n ---
---
which is a weighted average between the prior estimate aJ and the maximum likelihood estimate sJ with weights no/(no + n) and nj(no + n), respectively. Also, noa2 +ns2o 1 2 O E[ a 1 x]= E[.p- 1 x] =
n +no- 2
which coincides with the inverse of the posterior mode (but not of the mean) of¢. The main dispersion measures are V
(
2(n +no) (noaJ + ns5)2
and
J(mode) =
(noa2 o + ns'o)' 2(n +no- 2)
Note that once ag&in the expression of J(mode) is very similar to the posterior precision v-1 (
Estimation in the normal model 107 Confidence intervals can be obtained with the percentiles of the x2 that are available in many tables and statistical softwares. Definingx2 and X� v as the , :...,.:..a v
lOOcx% and 100(1 -a)% percentiles of the x2 distribution with freedom, respectively gives
(&;z,n1 < (noaJ + ns6 )¢< X�;z,n11x) X�f2,n, �f2,n1 "' I x =P < < noa02+ ns0z '+' noa0z + ns0z
1 -a= P
)
(
where
v
degrees of
n1 =no+ n
·
This gives rise to a 100(1-a)% credibility interval for cj>. Given the asymmetry of thex2 distribution, this is not an HPD interval. As a2 = lfcj>,
(
noa02 + ns02 noa02 + ns02 , 2 2 n1 Xa.j2, &12.n1
)
is a 100(1-a)% credibility interval fora2 The non-informative prior can be obtained by letting no -7 0. In this case, the posterior is ns� ¢ I x x; and {£[¢ Ix] ) - 1 7 s& . which coincides with the maximum likelihood estimate of cp-1 = a-2. The MLE is sJ with sampling distribution nS6/a2\a2,......, x;. Therefore, it is unbiased and since �
a log p(X I a2) n 2 = (S2 c- cr) 2a4 o aa2 Its variance is 2a4jn and tends to 0 as n �
it is also efficient. oo which means that the estimator is also consistent. The 100(1- a)% credibility interval for a2 becomes ·
(
ns20 nsz0 _2_'_2 Xo:;2,n "fa;z,n
)
__
.
The pivotal quantity used for the construction of the interval is nS'fi ja2 with ax; sampling distribution. The classical confidence interval can be easily obtained and shown to coincide with the above interval, obtained for the non-informative prior. If (} and a2 are both unknown quantities, using the conjugate prior 8 1 tjJ ,....., ( N vo, (co¢)-1) and nocrJ ¢ - x;;,, gives the marginal posterior distributions
8 Ix- t,, (Ill, a ffq) and nwf ¢ I x- x;;, where nwf = noaJ +(n- l)s2+ con(vo - x)2j(co+ n) and s2 = E(x; -X)2j(n- 1). Once again, the posterior mean, mode and median of(} coincide and are given by /J-I· Also,
V(8 I x) =
n1 a2 ...1_ nt- 2 Ci
__
and
1(1-'I) =
n1 +I ct n1 a 1
-- 2.
Denoting the 100(1-ex)% percentile of the t,(O, 1) distribution by 10,,, gives by symmetry of the Student-/ that e-v, I- a= P -laj2,n1 < .JCJ. < ta;2,n1 -a-1-
(
)
108 Estimation
=
P
(
111 - laj2,1t1
�
< (}
�)
and the above is an HPD interval. For
cr2, by analogy with the results for known e, E [4> I x] 2 E[cr 1 x]
=
E [r 1 I xJ
=
cr]2 and
2 n1cr1
=
n1- 2
which coincides with the inverse of the posterior mode (but not of the mean) of¢. The main dispersion measures are and
](mode)=
(n1a12)2 2 (ni - 2)
Once again, the expression of J (mode) is very similar to the posterior precision
.
v-1[4> I x].
Confidence intervals can again be obtained w'ith the x 2 percentiles leading to .J- a= p =P
This gives a interval). As
is a
(&;2.nt < nwf¢ <x;;2,1q l x)
(�!'·"' n1a1z
< "' 't' <
:X�;z.,, 2 nta 1
)
Ix
.
100(1 -:-a.)% confidence interval for ¢ (that is also not an HPD a2 = I (1>; nta 12 nta12
(
, -- , , --
Xa(2.nt &j2,n1
)
100(1- a)% confidence interval for a2.
The non-informative prior in this case is p(8, ¢)
oc
q,-1. This can be seen as a
limiting case of the conjugate prior above when co, aJ --+ 0 and no =-1. This gives marginal posterior distributions e I X- '"-!(X, s2/n) and (n- l )s2¢ I X
x;_1.
Again, the posterior mean, mode and median of e coincide with :X, which
is the maximum likelihood estimate. The dispersion measures fore are
V(B lx) and since ,jil(e -X)fs I
n -1 s2
=
n-3n
--
x-
and
t,-I (0, 1), the HPD 100(1 -a)% confidence interval
for (J is analogously obtained as
s _ s X- faj2,n-1 Jli' X+ ta;2,n-l -/ii
(
n'
1 (x)= :-,,.. - -.' (n _ l )s 2
). -
'
The classical (moments and maximum likelihood) estimator X fore is unbiased, efficient and consistent. Classical confidence intervals for e cannot be obtained
Estimation in the normal model 109 with the same pivotal quantity used before because it depends on the unknown a.
A new pivotal quantity depending only on X and8 and with a distribution that is known and does not depend on any of the unknown parameters must be found. Fortunately, this is possible in the normal case with the following results.
2
Theorem 4.3. Let X= (X,, ... , Xn) be a random sample from the N(8, a ) distribution and letX and S2 be the sample mean and variance respectively.Then, conditional on 8 and X and S2 are independent with respective sampling
a2,
distributions
ft Proof. DefineZ
X-8 --
�
a
N(O, I)
and 0
(Z,, ..., Zn) whereZ; ft(X;-8)/a, i = I, ... , n. Then, ft(X-8) fa and, in matrix notation,Z N(0, In), the Z; are iid N (0, I), where In is x n identity matrix. LetA be an orthogonal matrix with first row given by (lj..jil)l� where ln is an n-dimensional vector of 1 's. There are
's
=
then
=
Z
=
�
many methods in linear algebra available for completing orthogonally the other
n- 1 rows ofA. From the invariance of the multivariate normal distribution under linear transformations (Exercise 1.6), it follows that
Y= A Z Therefore, the
�
N (0, In) sinceAO
Yi's are iid N(O, 1)
=
0 andAA'
=
In.
variables,
n
I:r? ""'xJ. i=l Also, from the independence of the Yi 's, L;�=l Y? x;_1• So, Y'Y=
"'
s2
n
(X,- X)2
L i=l a2 n L(Z;- Z)2 i=l n Lzf-nZz i=l n = I;z�- rf i=l =Z'Z- Yf.
(n- 1) = 2 a
=
=
But,
n
n
I;r,Z = Y'Y i=l
=
(AZ)'AZ= Z'AA ' Z= Z'Z
=
L:zf. i=l
11 0 Estimation
L'/=2 Yl "' x;_1 and is independent of Therefore (n- I)S2ju2 consequently of Yz and X, completing the proof. =
Yf and, 0
Lemma. If T - N(O, 1) and W T(../WTV- t,(O, 1).
-
x;!
and T and W are independent then
The proof of the Lemma is an adaptation of results previously shown and is left as an exercise.
Corollary. Let X= (XJ, ... , X<) be a random sample from the N(B, a 2) dis tribution and let X and s2 be the sample mean and variance respectively. Then, conditional on e and o-2, X has sampling distribution
Jn
(X -B) S
-tn-J(O,l).
Proof. A straightforward application of the last lemma with T = Jn(X- B) (a, W (n -l)S2(a2 and v n -1. Then, T(../WTV Jn(X -B)(S, completing the proof. =
=
=
0
The above results indicate how to define pivotal quantities for construction of confidence intervals fore and a2. In the case ofe, a is replaced by its estimatorS leading to the new pivotal quantity .jii(X- B)(S, whose sampling distribution is tn-l (0, 1). Note the similarity with the Bayesian standardization over the marginal posterior of(). It is easy to obtain that the classical interval will coincide with the non-infonnative Bayesian one. Even if S could estimate a without error, this substitution implies an increase in the uncertainty bounds since lf:J,n > zp for small (J. For a 2 , we have that {E[.p 1 x]}-1 s2,the mode of¢ is (n-3)/[(n-l)S2] = (i, and the dispersion measures are =
V(.p I x)
2 =
(n-l)s4
and
(n- 1)2s4
•
J(¢)= 2(n-3)
·
The MLE of a2 is G-2 (n -l)sZ(n whicb is biased and is usually replaced by the unbiased estimator S2 with sampling distribution (n-l)S2(rr2 - x,7_ and � V(S2) 2a4((n - 1). Since S2 is unbiased and V(S2) - 0 as n - oo,S is a consistent estimator of a2• The difference between S2 and &2 becomes negligible as n increases which means that &2 is also consistent. Non-informative Bayesian and classical intervals for a2 once again coincide and are given by ((n-l)s2 (n- l )s 2 ) =
=
' 2 2 Xaj2,n-l !a;2,n-l
.
Estimation in the normal model 111 When making an inference about one of the parameters, the other one becomes a nuisance parameter. In the Bayesian approach it is eliminated by integration. In the frequentist approach it is eliminated by appropriate choices of pivotal quantities. In the normal case, we were able to find suitable quantities given by Jn(X- 8)/ S and
(n - l)S2 ja2.
These are based on minimal sufficient statistics and do not
depend on the respective nuisance parameters. This fortunate coincidence does not necessarily occur in all statistical problems. In those cases, alternative approaches based on some form of approximation must be used.
4.5.2
Two samples case
From now on, until the end of the section, we will concentrate on two normal samples where distribution and
X1 = (XJJ X,= (XzJ
• . . .
.
.
,
�
X1,1) is a random sample from Ihe N(&1,
. . •
distribution. In addition, the two samples Will be assumed to be independent. If
af and a:} are known, the
likelihood is
which factors out into separate likelihoods for 81 and Bz. So, if81 and 82 are prior
independent, they will remain posterior independent. One cia:ss.of conjugate prior is given by independent&;
'""'
N(J-.L;, r?) distributions fori
=
1, 2. Another class 'is
given by bivariate normal distributions. It includes the previous class by allowing also non-null prior correlation between for simplicity.
81
and8z. The first class will be used here
Combining the adopted prior with the likelihood leads to the independent pos teriors
8; I x; *
P..; =
"'
N(p.7, rt2) where
n,ai-2x; + r; 2 J.Li nicri 2 + r i 2
i = I 2.
and
.
The analysis is exactly like two separate conjugate analyses and all the results for one sample follow. The same comments are true for non-informative priors and for classical inference. The non-informative prior for and
e, I
X;�
N
(x;. ::)
The equivalent sampling result is
.
81
and
82
is
p(81, 82)
ex
k
i = I. 2 independent.
Xi J 8i......, N(8;, crlfni), i
=
I, 2 independent,
from where point and interval estimation can be processed in the same way. An interesting problem absent in the single sample case is comparison of means. This can be done by estimation of j3 =
81
-
fh. The posterior distribution of j3 is
obtained from the properties of the normal distribution as
N(,uj- 112· rj2 + r22).
112
Estimation
This posterior reduces in the non-informative case to the where
P = Xt - X2.
N
In the case of classical inference, estimation is based on
Xt- Xz with sampling distribution N({J, <Jf fnt
+ CJ} /n2). Observe that all the
above distributions are symmetric which eases the calculation of estimators and HPD intervals. Assume now that
al
and
a:}
are unknown but equal with¢
= aJ2 = a;2• 8; \ ¢ ""'
Then a conjugate prior can be constructed in the following way:
N(J1.;, (c;rp)-1), i = 1, 2,
are conditionally independent and
prior density of ( Bt, Bz, ¢)is
no
CJJ
x;,.
The
p(&1 , &2 ,
-
()( q,no/Z exp{-
% [no<JJ
2
+
c1 W1
-
2
!L !l + c2(82
2 - 1L2l ]).
rp is N(/L1- /L2.¢ -1(c\1 +
l n particular, the prior distribution of f3 I
c21)).
Therefore, using the results from Section 3.4, one can obtain its marginal prior
tn0(J1.1- /Lz,
f3-
CJJ(c\1 + c21)). The likelihood is
p(XJ, Xz 18t, 8z, ¢)
=
p(X t l&t, ¢)p(X2 18z, ¢)
()( n rp"il2 (-<E. [(n,exp
i=l
where
2
ni
l
.L)xij- X;)1, sf= � n1 j=l
l )s
[ + n,(&; -:- :X;)2]
i = 1, 2.
Combining the likelihood with the prior gives the posterior density of
¢ (no+n!+n2)/2 x
exp
where
[(n 1
¢ -no<Jo2 { [ 2
J..Li = (cif..Li
+
+
vs 2 +
2
cn
.--. (J1.i--x;)2
L -Ct +i i
i=l
niXi)/(Ci
+
nl
ni).
i
=
1, 2,
l
+
(Bt, &z, ¢):
* 2 (c; + n;)(&;- Ji;)
v = nt
+
nz -
s[
2 and
J}
52 =
- 1)Sf + (nz - 1 )S}J/v, that is a weighted average of the obtained within each sample with weights given by ni - 1, i = 1, 2. Note that the posterior and prior densities have the same kernel. established, by analogy:
e, I rp, X- N
(!L:, c74J ) 1 --
Hence, the following results can be
'
independent and
e, I X- tn•o
(Iii, "ocj ) ,
i
=
I, 2,
Estimation in the normal model and
nQaQ24> \ x"' x;0
where
* *2 n0a0
=
c7
= ci + ni, nQ
113
=no+ n1 + nz and
2 cini - 2 "\"""' noa02 + vs2 + L (J-Li- x1) . n Ci i + i=l ---
In terms of {3, one can obtain the conditional posterior distribution {3 \ tfJ, x ,....,
NC�-ti- J.L;,¢ 1 ( c f 1 tn*<J-Li- J.Li, aQ2Cci-J -
-
+ +
ci-1)) ci 1)) -
and marginal posterior distribution f3 I x
.
---
Once again, the posterior is symmetric and
p3sterior mean, mode and median of f3 coincide. HPD intervals for {3 can be obtained using percentiles of the Student t distribution.
For ¢, we have that
{£(¢ I x) }-1 = aQ2 and credibility intervals can be constructed using.percentiles Q of the x 2 distribution.
The non-informative prior distribution can be obtained by noting that this is a
location scale model with location parameters el and e2 and scale parameter¢. Therefore, the non-informative prior is given by p(BJ, fh,
·
.
�
=
mode and median of {3 are iven b f3 and the 100(1 -a)% HPD interval for f3 has limits jj ±
ta/2.vs2 nj1 + n2'·
A possible estimate for
The 100(1 -a)% credibility interval for
a2 is given by s2• a2 obtained as before is given by
(�·�)· Xaj2,v f:..a;2.v
In the case of classical inference, jj = X1
- x2 and &2
=
vS2 j(n 1
+
112) are
the MLE of f3 and a2, respectively. It is not difficult to show that _t1 is an unbiased, efficient and consistent estimator for f3 and & 2 is a consistent but biased estimator for a2. It is usually replaced by s2 which is an unbiased, efficient and consistent estimator for a 2. The relevant sampling distributions are
S-fJ -?=�== - t,(O, I) S YIn l 1 + n2 1
and
Note that these variables provide pivotal quantities for the construction of con fidence intervals for [3 and a2 respectively. The resulting confidence intervals
coincide numerically with those provided by the non-informative prior. If ar =¢11 and a{ =¢21 are unknown and unequal, the likelihood factors
according to
Adopting independent normal-gamma conjugate priors with parameters (Jl..i, Ci, vi, s5i), i = 1, 2, for each of the samples leads to the independent posterior
114
Estimation
distributions
ei I X"-' lvt
(
* /1-j'
s()i2
-;;r
)
i
'
=
1, 2
and
i
= 1, 2
where the relevant quantities are obtained by the usual one sample operations
associated with the conjugacy of the normal-gamma by the normal observational
modeL
If interest lies in the comparison of the means, the posterior distribution of fJ
must be obtained. First, let
r
and w be such that and
tanw =
sOJ 0 * I V"I s(J21�
It follows that sin w
=
50t1Jcf 2/c* * s01 I + s0* 2jc* 2 2
r�c'2��=:'=
and therefore
r=
Bz-J-t*2 cos w 81-JJ-*. 1smwsQ1!Jcf sQ2!)(1
where the fractions· in the right-hand side of the equation have independent standard
Student-! distributions with respective v j and vi degrees of freedom. A random
quantity under these conditions is said to have a Behrens-FiSher distribution with parameters v i , vz and w. This distribution is similar to the Student t distribution and has been tabulated, enabling easy construction of confidence intervals. In the case of a non-informative prior p(e,,
(}2, a�, af)
i
=
c;
.
C7i:---2u -2 . This can be 2 sJ; � 0 and v; = -1,
ex
seen as a llmiting case of the conjugate prior above when c;,
1, 2. Replacing these values in the expression of the conjugate posterior gives
= n;,
f.l: =Xi, v; = ni
estimate of
a?, i
=
-
I and
sQ;
=
s;,
where
sf
is the usual unbiased
1, 2. Estimators and confidence intervals can be obtained
accordingly. The problem is more difficult to treat under the classical perspective.
No easy pivotal quantity for fJ can be found with known distribution although approximations based on the Behrens-Fisher distribution were proposed.
Another situation of interest is the cbmparison of variances. Since variances
measure the scale of a distribution and are always positive, it makes more sense to
compare them through their ratio instead of their difference as we did for the means.
Therefore, we will focus on the posterior distribution of ljr = We have just obtained that
¢1
affaf
=
¢11(/12.
and ¢2 are posterior independent with joint density
p(¢I, ¢2 I x) ex
JJ2 ¢11v*/2-1
ex p
{
}
v!'s*-2 - -T ¢; .
r,
Estimation in the normal model 115
'
The easiest form to obtain the posterior distribution of
¢2
lfr is to complete the trans
formation to ensure a bijection and use results available for densities of 1-to-1
complete the transfonnation. 1/fz = Vr
transformations of random quantities. Let The inverse relation ¢1 = tion is J
=
1/12) la(8(1/1,¢t,>2)1
and the posterior density of
p(1/J,1/Iz I x)
rx
(1/1, 1/lz) is
1 lz > 1/ 1/1 1 0' 1 I= 1/
I x)
'f
'�'Z
exp
1/ 2 [v* • 2 + v•• \I �2 1s01 1 2s02
r(cv; +viJ/2) z )/2 [CvisOz + v7sQ/tfr)/2J(vj+v2 2 -t,i+"il/ *2 CX'"vi/2-1 v *+v*SO! ,f, '!' 2 I 'f' *2 so,
= 1/J ,.12_1 I
'·'·] J
[
r
•
1/1 is
�1 p(1/J, J/!z dJ/!z ,,,,j/2- 1 j .�.t"i+"il/2- 1 ex
o
lz l'i+"2Ji'-c exp \-2 I 2 * 1/1 ,.1,_, 1 1/12 1/ v2*s0*2 + v* 1s01
���
Finally, the marginal density of
p(1/J I xi
=
'·'·]) 'f
)
(
*2 5�12 tfr I s oz
It can then be shown that
x"'
F(vj, v;).
Even though the distribution function of the
F distribution
cannot be obtained
analytical1y, it is tabulated in many books and computer software. Its percentiles
F distribution
can be used in the construction
are given in the list of distributions. An interesting property
.-...
for probability evaluation with the F distribution can be derived from the fact
F(v2, VJ) then x-1 F(VJ, 1J2) by simple inversion in the ratio of x2 distributions. Therefore, denoting the a and 1 -a quantiles of the F(v1, v z) distribution respectively by .E,(v,, v ) and Fa(Vt, vz) gives that 2 - 1 _E,(vt. !J2) =Fa- (v2, vt). that if X.-.....
independent
2 v*
Point estimators are given by E [1/1
s*
-2I x] = _Qf__ * 2 v* 2 SOl
Both estimators converge to
2-
and mode(tfr)=
(sQ2/sQ1)2 when vi--+ '
s0*22 v*(v*- 2) . 2 ; � s01 v1 (v2 +2)
/ 's are given by sf, the unbiased oo.
estimates of the populational variances and the v;'s are given by ni - 1, i In the case of a non-informative prior the sQ
= 1, 2.
116
Estimation
Therefore, for large samples the estimators will be given by the ratio of the variance estimates. In classical inference, we have already obtained the independent sampling dis tributions of
(ni- l)S/¢i (Sf ISi)1fi
F distribution,
"'x;i_1, i = 1, 2. Therefore, from the properties of the -
F(nt
-
1, n2- 1). Once again a pivotal quantity was
found for inference about ljJ and the resulting confidence interval will be mimeri cally equivalent to the non-informative Bayesian one. The unbiased estimator of
1/1-l
=
f } is
a la
Confidence intervals are obtained in either case using quantiles of the F distri
bution. These are not HPD intervals due to the asymmetry of the F digtribution.
Using the results listed about the F, a obtained from
and, therefore,
100(1 -a)%
credibility interval for 1/f is
[(SOl' 21S02'2)_F_1 ' ' ( '21 '2)-F ( ' ')] ct/2 VI' V2 ct/2( V2, VI ) Sot Soz •
is a
100(1 - a)% credibility interval for ljr.
In the case of non�infonnative priors,
the modifications previously mentioned apply and the resulting interval is
Finally, the confidence interval derived from classical inference coincides with the above interval, obtained with a non-informative prior.
Exercises § 4.1
1. Prove that the estimator of 8 associated with the absolute loss is the posterior median of the updated distribution of e.
2. Show that if L1 and L2 are two proportional loss functions, that is,
kLz(8, 8),
L1 (8, ())=
then the Bayes estimators associated with these losses coincide.
N(f), a2) (a, b) (a
3. Suppose that X with distribution it is known that
8
E
with a2 known is observed and
(a) Obtain the non-informative prior for e.
Exercises 117 (b) Obtain the complete expression of the resulting posterior. (c) Obtain the posterior mean and mode. 4. Let X1, . . . , Xn be a random sample from the Bernoulli distribution with unknown parameter () having unit uniform prior distribution. One wishes to estimatee using the loss function
L(d, 8)
=
(8-
d)2 /[8(1 - 8)].
(a) Calculate the Bayes estimator and obtain its risk. (b) Determine the predictive distribution for the (n + I )th observation and detennine its mean and variance. (c) Generalize items (a) and (b) to spective probabilities Bj, of 8j,
j
=
j
=
I , . .. ,k.
k possible values for each X; with re
1, ... , k, and obtain the Bayes estimator
5. Suppose that the loss function used to estimate() through 8 (i) equals the distance between 8 ande if 8 is smaller thane, and
(ii) triples the distance between 8 ande if 8 is larger than e.
(a) Obtain the mathematical expression of the loss function. (b) Show that the estimator of e is the first quartile of the updated distri·
bution of f).
(c) Generajize the result in (b) for when the loss in (ii) is p times the distance between 8 and f).
6. Suppose that X� bin(n, 8) and the conjugate prior e
,...__,
beta(a,
b) is used:
. of X that minimizes the variance of the posterior (a) What is the value distribution of()? (b) What is the value of X that maximizes it? Interpret the results. (c) Repeat items (a) and (b) for the case of a negative binomial distribution for X. 7. Assume that X - U[8-1, 8+1] is observed and assume a prior p(8)cx: e-1, e
>
o.
(a) Prove that p(81x)
=
1)/(X- 1)], X > J.
ce-1' e
E (x -1, X+ 1), where c-1
=
log[(x +
(b) Calculate the mean, mode and median of the posterior distribution. 8. The income tax policy of a given country establishes that an individual pays tax q iff its income is larger than k. Assume that the income distnbution for these individuals follows a Pa(k, 8) distribution. (a) Show that log(X/ k) - Exp(8). (b) Assuming that little is known a priori aboute, a sample of these individ uals is observed and their incomes registered. Show that the posterior distribution of 8 is G(n, n log(G f k)) where n is the sample size and
G is the geometric average of the observations.
118
Estimation (c) Assume a change in policy is under study aiming at taxing the rich more. The threshold would be raised to l
>
k and tax raised tor
>
q.
Show that the expected revenue would only rise if
r
>
q
(
I +
Iog(ljk)
-:-"-':-0::-:-:-:n
Iog(G/ k)
)
"
Hint: first calculate the expected revenue given e.
§ 4.2 9. Consider a simple linear regression where 8
=
(8o. 8J), i
=
E(Y; 1 8)
I, . .. , n. Obtain the sum of squares
eo+ (}JX; with
=
S(8)
and show that
its minimization leads to the least squares estimator
g(X, Y)
1 =..., n
n
Lg(X;, Y;). i=l
.
I 0. Show ihat the MLE of a parameter is a minimal sufficient statistic for this .. parameter. 1 1 . Suppose that the waiting time in a bank queue has Exp(B) distribution with 8
>
0. A sample of n customers is observed over a period ofT minutes.
(a) Suppose.-the individual waiting times were discarded and only the
number X of clients was recorded. Determine the MLE of 8 based on
X.
(b) Determine the maximum likelihood and Bayes estimators for 8, as suming that in a sample of n
=
20 customers the average serving time
was 3.8 minutes and all 20 clients were served. (c) Suppose that, in addition to the observations reported in (b), an ad ditional observation was made but all that is known is that it lasted for more than 5 minutes. Obtain the maximum like1ihood and Bayes estimators of 8 in this case. 12. Suppose one wishes to test three types of bulbs: normal life, long life and extra long life. The lifetimes of the bulbs have exponential distribution with means 8, 28 and 38, respectively.
-{\ssurne the test consists of observing one
randomly selected bulb of each tyjJe. (a) Determine the MLE of 8.
(b) Determine the method of moments estimator of 8. (c) Let 1ft
=
1/8 and assume the prior 1ft
posterior distribution of e.
�
G(a, {3).
Determine the
<J
(d) Determine the Bayes estimators of e using 0-1 and quadratic loss functions.
r I I I
I
Exercises 119 13. Let (X 1,
Y 1 ),
... , (X11,
Y11) be bidimensional vectors forming a sample from
al,
i = 1, 2, and corre lation coefficient p. Detennine the MLE of all model parameters.
the bivariate normal with mean vector JL, variances
§ 4.3
14. Prove the Cramer-Rao inequality for the multiparameter case.
Hint: Consider the joint covariance matrix of 8 and the score function and explore the non-negative definiteness of this matrix.
15. Consider a random sample X function F and let
=.
(X i
,
.. . . X11)froman unknown distribution
f: denote the empirical distribution function.
Show that
f: is non-decreasing and contained in the interval [0, I]. i(x) can be written as Z where Z; Ix/-oo, x], i I, . (c) Show that f: is an unbiased estimator of F. (d) Show that f: is a consistent estimator ofF, (a)
(b)
=
Hint: obtain the sampling distribution of
=
..
, n.
f:(x), Vx,
16. Let X 1, X2, . , . , Xn be a random sample from the Pois(B) distribution and y =
L1/=l Xj,
(a) Determine c such that exp( -cf) is an unbiased estimator of exp( -8). (b) Obtain a bound for the variance of the estimator obtained in (a). (c) Discuss the efficiency of the estimator obtained in (a). 17. Let XJ, ... , Xn be a random sample from the uniform distribution over interval [0, 8]. (a) Obtain fj11, the MLE of8, and show it is a biased but consistent estimator of e. (b) Obtain from (a), an unbiased estimator of e. (c) Calculate the quadratic risks of the estimators in (a) and (b). (d) Find an estimator of e with smaller risk than those obtained in (a)and (b). 18. It is common in genetics to obtain samples from the binomial distribution with the impossibility of 0 observations. (a) Show that the sampling distribution is given by
I (x I 8)
=
() "
x
.
8" (!
-
I - (I
8)"-x
- 8)" ,
x
=I,
... ,n.
(b) Obtain the MLE of 8, assuming that n=2 . (c) Verify whether the above estimator is unbiased. 19. Let X,, ... , X11 be a random sample from the uniform distribution on the interval [a- b, a+ b], a E Rand b > 0. (a) Verify if a and bare location and/or scale parameters.
120 Estimation (b) Obtain the MLE of a and b.
(c) Assume now that b = I and define T1 = -X =
L7-IX ; n
and
(d) Show that Tt and T2 are consistent and unbiased estimators of
a.
(e) Compare T1 and T2 specifying a choice between them and justifying it.
I 8) = eo - ey.
20. Let X I, . . " Xn be iid with probability function f(x X= 0, I, 2, . . . , e E (0, I) and define u, Ix, ({0}). =
(a) Show that U is an unbiased estimator of e and calculate its variance. 2 (b) Show that the expected Fisher information fore is n/(8 (! - 8)) and find the efficiency of U.
21. Let X1 and Xz be iid with Exp(8) distribution. (a) Show that U
=
exp(-XJ) is an unbiased estimator oflfr = 8/(8+ 1).
(b) Show that (I - e-T)/T is a UMVU estimator of 1fr, forT = X 1 + Xz.
§4.4
22. Show that Bayesian confidence intervals are invariant under 1-to-1 transfor mations of the parameter. So, if Cis a 100(1-a)%confidence interval forO
and 1> = 1>(0) is a l -to-1 transformation ofO then >(C), the image of C un
der f/J, is a 100(1 -a)%confidence interval for ,P. Show thatHPD intervals
are not invariant under 1-to-1 transformations. Show also that frequentist confidence intervals are invariant under 1-to-1 transformations.
23. Let 0 = (81 , ... , e, ). Show that if the 8; 's are not independent then
Pr(O E C) 2: I -
'
L Pr(e; ct C; ).
i=:l
Show that if
Pr(B;
E
C;) 2: a;
= ajr,
then Cis a IOO(l-a)%confidence
region for 8. Obtain the equivalent result from a classical perspective.
Hint: Define pivotal quantities Ui
24. Let X
�
bin(n,
=
Gi(X, 8;), i
B) and assume the prior B
observed value was X= n.
�
=
1,
U[O,
. . . , r.
!]. Suppose
that the
(a) Show that the 100(!-a)%HPD interval forB has form [a, I], a
< I. B j(l-B). Show from (a) that Pr[a/(1-a) S lfrlx] = I -a and therefore [a/(!-a), oo) is a 100(!- ct)%credibility interval for lfr. (c) Obtain the posterior distribution of ljf and discuss the forrn of a I 00( I ")%HPD interval for 1fr.
(b) Let 1fr =
(d) In particular, is the interval obtained in (b) ofHPD?
Exercises 121 25. Let X 1, . .. , Xn be a random sample from the uniform distribution over the interval [0, 8].
lin, the MLE of B. 100(1 -a)% classical confidence interval for 8 based on the
(a) Obtain a pivot based on the sampling distribution of (b) Obtain a
pivot used in (a). (c) Which conditions must be satisfied for a minimum length interval? (d) Assuming the non-informative prior p(B) HPD interval for 26. (Berger,
g-l, find the 100(1-a)%
rx
8.
1985, p. 134) Let X be such that p(xI
and assume the prior p(B) rx
(I+ 82)-1, B 2:
B)
=
0.
exp[ -(x- B)], X 2: e
(a) Prove that the posterior distribution of e given
X
is monotonically
increasing and that the mode of 8 is x.
100(1- a)% HPD in_t,[Vi\1. for B must have the form [c(a), x], where c(a) is such that P[c(aj S B S xI x] =I-a.
(b) Show that the
(c) Obtain the posterior density of TJ = exp(8) and prove it is a monoton ically increasing function of ry.
100(1- a)% HPD interval for� must have the form [1, d(a)], where d(a) is such that P ( l S ry S d(a) I x) I -a. Show that the confidence interval in (d) implies a 100(1-a)% lowest
(d) Show that the
=
(e)
posterior density interval fore. 27. Let X
=
(X,, ... ,X,) be a random sample from a Pois(B) distribution.
The prior distribution fOre is judged to be based on information equivalent to that obtained after observing
a
observed mean lifetime b, where
lifetimes with Exp(8) distribution with
a,
b
>
0.
(a) Show that the prior distribution of B is (b) Obtain the distribution of B
G(a + I, ab).
I x.
(c) Suppose that instead of observing the sample, only T observed. Obtain the distribution of BI T
Hint: if X consist of iid Pois(B) variables then T (d) Compare
L7=1
x;
=
the
distributions
obtained
=
t.
=
in
�
(b)
Li=I X; was
Pois(nB). and
(c)
when
t ,justifying the result obtained.
(e) Obtain a 100(1 -a)%·confidence interval fore using the percentiles of the x2 distribution. 28. Let X;
=
8t; + t:;, i
=
1, ..., n, where E;
are iid
N(O, a2), a2 known and
the non-informative prior for 8 is assumed. (a) Obtain the posterior distribution of B. (b) Obtain the
100(1- a)% HPD region for B.
(c) Show that the MLE of B .is the UMVU estimator of
8.
,
(d) Based on the sampling distribution of the MLE of 8, construct a
100(1 -a)% confidence interval for e.
122 Estimation (e) lf 0 :::; li .::::; 1, Vi, what values of the tj's must be chosen to obtain the confidence intervals in (b) and (d) of shonest possible length?
§ 4.5 29. Show that if T �
N (0, I) and W
�
�
x� are independent then Tj ..fWTV
1,(0, 1).
30. Consider the situation with two nonnal samples where Xt = (Xtt
a})
• . . . ,
distribution and Xz = (X21, X1n1) is a random sample from the N(BJ, ... , X2n2) is a random sample from the N(fh, distribution. In addition,
a})
the two samples will be assumed to be independent.
af
a}
and are kno�n, show that the class of bivariate nonnal distri (a) If butions for e, and Bz is conjugate with the above observation model. Also, obtain the posterior correlation between Bt and Bz and compare it with the prior correlation. . (b) If is unknown, show that
af ai =
=
• '
.
q2
.
. . .
vS2
2
--""Xv· 2 a
a2
(c) Show that the resulting confidence intervals for fJ and obtained with the pivotal quantities of the previous item coincide numerical1y with the credibility intervals obtained with a non-infonnative prior. are unequal and unknown, show that (d) If and
al
a}
•
2
5�1 2 soz
tjr I
x ,..___
F(vj, vi).
a2) (8,
31. Let X= (XJ, ... , Xn) be a random sample from the N(O, distribution and assume that the normai-x2 conjugate prior is used for ¢)with no= What should be the sample size to guarantee that 5 where ¢ = 2 Pr((O- 111) :s 4 VJix) :0: 0. 95 where 111 = E(e I x) and v, = V (e I x)?
a-2•
a2)
distribution and 32. Let x, and Xz be two observations from the N(e , assume a non-informative prior for (8, If Y,.. -i.S the smallest of the observation and Y2 the largest one, show that a posteriori,
a2).
Pr(y, :S
8
:S Yz) =
Hint: write the posterior distribution of
8
0.5.
as a function of YI and Y2· distribution distribution,
a2) ka2)
33. Let X = (XJ, ..., Xn) be a random sample from the N(O,, andY= (Y,, . .., Yn) be a random sample from the N(Oz, k known. (a) Assuming a non-informative prior for distribution of e, and
- 82
a2.
(81, 82, a2), obtain the posterior
Exercises 123 (b) Construct a (c) Obtain a
100(1-a)% HPD interval for et- e2.
!00(1-a)% confidence interval fore, -e2 from a frequentist
perspective. 34. Let I,
.
XJ, ... , Xn .
.
, n,
be independe.nt observations from the
N(&,a2fki),
respectively, where the k; 's are known positive constants.
i
=
(a) Obtain a family of natural conjugate distributions to the observational model above. Hint: define X=
I:?= I k;X;/L_�1=1 ki.
!00(1 -a)% HPD interval for e. (c) Obtain a 100(1 -a)% confidence interval for
(b) Construct a
perspective.
8 from a frequentist
5 Approximate and computationally intensive methods
As we have seen in Chapter 4, the classical approach to statistics requires the sampling distribution of the estimators to be useful from both a practical and theoretical point of view. In the Bayesian approach, all the information needed is described by the posterior distribution. In both approaches, the evaluation of probabilities or expected values and optimization of some criterion function are often demanded.
When the exact solution fails, evaluation of these quantities
must involve analytical or computationally intensive approximation techniques. Typically, the accuracy of the analytic procedures depends critically on the sample size and the accuracy of simulation-based techniques depends on the number of simulations. In this chapter the central problem of inference is stated in Section 5.1 and effi cient numerical solutions are discussed. Many approximating techniques used in statistics are described in this chapter. Some optimization methods are presented in Sec'tion 5.2 and analytical techniques are discussed in Section 5.3. An intro duction to numerical integration is presented in Section 5.4, and methods based on simulation are described in Section 5.5, including Monte Carlo, Monte Carlo with importance sampling, classical and Bayesian bootstrap and some ideas on Markov chain Monte Carlo techniques. A good account of some of the techniques presented in this chapter can be found in the books by Davison and Hinkley ( 1997), Gamennan (1997), Tanner (1996) and T histed (1976).
5.1
The general problem of inference
In the development of statistical inference, as we have seen, we are often involved with the optimization of some criterion function or with the evaluation of integrals or expected values. In some cases these problems are not tractable analytically. The classical methods are often based on the maximization of an objective function, such as the likelihood function or the squared error loss. Based on decision theory, the minimization of the expected loss function provides Bayes estimators, where expectation is calculated with respect to the posterior distribution. {I
Generally speaking we are often faced with one of two basic problems: I. maximization of a function q(O) that can be either a posterior density or a
126
Approximate and computationally intensive methods likelihood function;
2. evaluation of an integral E[g(O) where
x
I x] =
J
g(O)p(O
I x) dO
or
E[g(X)
I 0]=
J
g(x)p(x
I 0) dx
represents the observed data, g( · ) is an integrable function and
0 is
a p-dirnensional vector. The first problem is related to the definition of the (generalized) MLE and the second includes many alternatives, which are exemplified below: 1. One of the basic problems in Bayesian inference is to find the value of the normalizing constant k of the posterior density. It is obta� ned as k -l
=
f 1(0; x)p(O) dO;
2. If one wishes to ascertain the bias of an estimator
iJ
= g(X) of(} then one
must evaluate the sampling expectation above. 3. In order to find the Bayes estimator with respect to the loss function
L(8, 0), I x) dO with g(-) = L(J, ) 4. Evaluation of the confidence of a region C involves calculation of P(G(X, 8) E C) = J g(x)p(x I 0) dx where g(x) lx (lx: G(x, 0) E C}). 5. The predictive density of a future sample Y with density p(y I 0), in dependent of X conditionally on 0 is given by J g(O)p(O I x) dO where p(y I 0). g(O) one must evaluate
J g(O)p(O
·
.
=
=
In the next section some useful-optimization methods will be presented. Their importance is to yield classical and Bayesian point estimators and also confidence and HPD intervals.
5.2
Optimization techniques
The optimization of some criterion function is present in many theoretical devel opments of statistical theory, as seen in Chapter 4, from both the classical and the Bayesian perspectives. For example, if we wish to obtain least square estimators, generalized maximum likelihood estimators or the minimum expected loss then some sort of optimization will be required. As will be seen in Section 5.3, even to ' calculate the approximate value of some integrals via Laplace methods, maximum values of some functions are needed. Since in many relevant problems the op timum cannot be obtained analytically, some numerical optimization techniques will be reviewed in this section. The main goal in this section is to present and illustrate the use of Newton�Raphson techniques and the Fisher scoring methods to locate the maximum of the likelihood function or of the posterior distribution. Many statistical books include chapters on statistical computing with discussion of optimization techniques, as for example Garthwaite, Jollife and Jones ( 1995).
Optimization techniques
127
An algorithm to find the zeros of a twice differentiable function g : RP --? R (p :=:: 1) is easily obtained from the Taylor expansion of g around an arbitrary point x<0> E RP:
Neglecting higher-order tenns in x- x<0> for suitably close values of x and x<0) gives
g(x) "' g(x(O)) + (x - xiO))'
ag(xl0l) ax
. If x* is a zero of g then solving the above equation for x* gives x* :::::: x< 1 >
xiOJ
_
[Bg(xiOI)/B xrlg(xiOI).
=
It follows that, starting with an initial value x<0) and using the relation stated
before, the algorithm provides us with a new value xCI) closer to the root of the above equation. This new point is the intersection of the tangent line, the linear
approximation of gat xo, with the x axis. The procedure is then repeated with xCI) replacing x<0). This will lead to an even better approximation for x* denoted by
x(2). Repeating the process successively gives the recursive relation x
=
x(J-1)-
[a u-nl]-1 g<x
ax
g(xlj-IJ).
The procedure is graphically illustrated in Figure 5.1. This is the wen�known Newton-Raphson algorithm and it must be repe"ated until some- convergence criterion is achieved. Typical criteria are and )g(x(j))J
< E
)xU) - XU -:-I)!
<
8
where 8 and E are preset precisions determined arbitrarily. Since
it is easier to evaluate proximity at the x level rather than at the g level, the former is sometimes preferred.
5.2.1 Let
Solution of the likelihood equation
U(X; 9)
=
a log p(X19)/B9 be the score function. The MLE is the solution
of the equation
U(X; 9) Remember that J(9)
=
=
0.
-BU(X; 9)jB9 is the observed information matrix.
Then,
replacing relevant terms in the expression of the Newton-Raphson iteration gives
There is a sense, to be made clearer in the next section, in which the observed information J approaches the expected information /. This idea can be introduced as a modification to the Newton-Raphson algorithm by replacement of the factor involving J by another one involving I. The jth step of the iteration becomes
128
Approximate and computationally intensive methods
I
// .·
I
I I
.
I I I
1..x[j]
x[j-1]
Fig. 5.1 Graphical representation of the iterative method for finding the roots of
an equation in the scalar case.
This revised algorithm is usually known as Fisher scoring and enjoys many of the nice properties of the Newton-Raphson algorithm.
Under mild regularity
conditions found in many applications, it converges to the maximum likelihood estimator. Its strength lies in the fact that considerable reduction in computation is sometimes achieved by replacement of J by I with consequent elimination of a few tenns.
Example. In this example an indirect use of the Fisher scoring algorithm is pre
sented. First the profile likeli,!Jood defined in Chapter 2 is obtained, reducing the dimension of the parameter space, and then the algorithm is implemented. Let
X1,
. . . , Xn
be iid Wei(a, /3) random variables (see list of distributions).
The
Optimization techniques
129
log-likelihood function is given by n
L(a,f3) =nlogf3+nloga+ (a-1) l:: J og X ; -f3 L Xf. i=l Substituting {3 by its MLE fi(a)
=
can be written as
L(a,fi(a))
=
nlogn- n log
Differentiating wit� respect to
n/ L:7�1 xr,the profile log�likelihood function
(� )
Xf +(a- I)
�
log X;+nloga-n.
a and making some algebraic simplifications gives
the score function and the observed information matrix as
Applying the Newton-Raphson with some initial value
a(O)
gives at the jth step
of the iteration the equation
As soon as convergence in a is reached, fJ can be estimated using the relationship between
a and its conditional MLE given by , f3(a)
n =
"" L.... i=J
. X� I
=25 observations artificially generated from a Weibull distribution with a = 1.5 and e 2.0, we obtained via Newton-Raphson the following MLE: & 1.59 and fi = 1.97. Confidence intervals can be obtained from the asymp Using n
=
=
totic distribution of the MLE. As will be seen in the next section, this requires the evaluation of the Fisher information. Fortunately, this can be obtained at no extra cost since the Fisher information must be evaluated in the algorithm. Approxi
95% confidence intervals for 8 and a are respectively given in this case by (1.80, 2.15) and (1.42, 1.76). Note that they include the true values used in the
mate
generation, as expected.
5.2.2
Determining confidence intervals
Recall from Chapter 4 that the HPD interval in the uniparameter case is the col lection of all
8
E 8 such that
p(8Jx)
>
fidence probability, through the equation
k, where k is related to y, the con
P[e!
<
e
<
euix]
=
y.
Note that
130
Approximate and computationally intensive methods
we are assuming here that the confidence region is in fact
an
intervaL
Defin
ing g(O)= log(p(Ojx)) -logk, applying the Newton-Raphson algorithm or the
Fisher scoring algorithm twice but with different starting values§ -E and 8 + E. where {J is the posterior mode and E
>
0, the convergence of the method is assured
as long as starting values are not far from the mode. Many difficulties can occur with numerical optimization methods. The solu tion may not be stable when numerical derivatives are used; the choice of a bad initial value can lead the solution to a local (instead of global) optimum and so forth. Nevertheless, many solutions to these problems have been proposed in the literature. There are methods that avoid the numerical evaluation of the second derivative, others that allow for different randomly selected initial values and so forth. The interested reader is referred to Thisted
5.2.3
(1976) for further discussion.
The EM algorithm
This is a general iterative method useful to obtain the (generalized) MLE when we are faced with an incomplete data set or when it is simpler to maximize the likelihood of a more general problem involving unobserved quantities. Dempster, Laird and Rubin
(1977) provided the ·first unified account of the algorithm. A
discussion in the context of data imputation is presented by Little and Rubin
( 1988). The algorithm is iterative and at each iteration it alternates the operations (E) and maximization (M). Let X E Rn be then-dimensional-vector of observed quantities and Z E Rm
of expectation
an m-dimensional vector of unobserved quantities. The complete data is denoted
by Y = (X, Z )' E Rn+m and its density function is p(yj9)
=
p(x, zj9), 9 E 8.
On the other hand, let the conditional density· of the unobserved data given the
observed one be p(zjx, 9), which also depends on 9. To obtain the MLE of 9 the
logarithm of the marginal likelihood, L(9;x)=log
(J
p(x,z;9)dz
)
is usually directly maximized. To avoid the highly dimensional integral involved in the marginalization of p(x, z!D), the following relationship can be used
L(9; x)=log
z
l ,
(:i:;x :;)
=log p(x, zj9) -log p(zjx, 9).
Since Z is unobserved, it is necessary to eliminate it before maximizing L(O;
x).
One way to do that is to take expected values with respect to the conditional density,
p(zjx, 9) in the above equation. Noting that Eztx.o[L(9; x)]=L(9; x), gives L(9; x) = Q(9; 9<0J)- H(9; 9(0))
where
Q(O; o
H(9; 9(0))
=
=
Eztx.o
Eztx.o[logp(ZjX, 9)]
Optimization techniques 131 and o
2. M (maximization): Q(e. 8u-11).
evaluation of o
The estimation procedure is iterative and alternates the E and M operations at each iteration. Dempster, Laird and Rubin (1977) showed that the sequence qUI, j ": I , generated by the EM algorithm satisfies L(8(jllx) :0 L(e(j+!)jx) and is monotonically increasing in the likeli. hood /(Oix). Therefore, the sequence o<j) converges to iJ, the' MLE, if the iik'e1lh00ct function has only a single maxi inurn. Convergence can be established by criteria such as IB(j) -·o(j�\)1 < 0 or 1 IQ(etn, eU- l)- QceU-tJ, eU-IJ)I < E. The convergence can be slow in some cases specialfy if the missing information of Z is substantial. The adaptation for
p(iste.rior mode evaluation involves replacement of the likelihood l(B; Y) by the posterior p(8jY) in theE step.
Example (Rao, 1973). A classical application in the statistical literature refers to a genetic study st3ting that the four-dimensional vector of animal counts X = (XJ, X2, X3, X4) his multinomial distribution with parameters nand Jr, where rr (1/2 + e;4, (I- 8)/4, (I- e)/4, ej4). So the probability function of X is =
(
(xl +x2 +x3+x4)! . XJ!x2!x3!x4!
p(xje)=
e
]
-+2 4
) ( x'
I -e --
)
xz+xJ
4
(- )
8 X4 4
Direct maximization of the above expression is awkward due to the presence of the term I /2+ /4. To avoid it, the EM method described above will be applied. To do that, let X 1 = Yo + Y1 and Y; = X;, i :=: 2, where the augmented vector Y = (Yo, Y,, Y2. Y3, Y4) has multinomial distribution with paramete _ rs n andrr' = ( l j2 , j4, (l-e)j4, (1-e)/4,ej4). To complete the notation,define Z = Yo, so that Y (X, Z). Therefore,
e
e
=
p(yje)_
n!
yo!y1 !y,!y,!y•!
(l-),.'(-e)Y'(J-e)Yz+y'(-e)r' 2
4
--
4
4
and log p(yje)= k, (y) +Yo log
(�) (�)
+(YI + Y4) log =
+ (Y2 + Y3) log
c � e) -e)
k,(y)+ (y1 + Y•) loge+ (Y2 + YJ) log(!
132 Approximate and computationally intensive methods where kt (y) and k2(y) are functions of y but not of e. Therefore
Q(8, e<j)) = E[k2(Y) + (Y1 + Y4) loge+ (Y2 + Y3) log(!
-
8)1X, e<j)]
= k(X, eUI) + E(Y1 + Y4jX, eUI) loge +E(Y2 + Y,jX, e<j)) log(!-8) = k(X, e<j)) + [E(Y,IX, e<j)l + x.J log e +(X2 + x,)log(l-8) sinceY; =Xi, i = 2, 3, 4. We only need to evaluate the expectation of ft since k(X, e<j)) does not depend on 8 and will therefore be irrelevant in theM step. From the construction ofY it follows that (ZjX,8)- (ZiXt, 8)- bin(XJ, p) where p = (1/2)/[(l/2) + (8/4)] = 2/(2 + 8). Therefore, E(YojX, 8) = X1p and
where p
TheM step involves finding the value of 8 that
maximizes Q(8, ()(j)). This is easily obtained by differentiation of Q and gives
eU+I) =
Cl +X4 X,pl
x1pUI + x2 + x3 + x. (X, + x.wu1 + 2x4
To illustrate the results, assume it was observed that the counts were X· = ( 125, 18,
20, 34)' and the EM algorithm. was started at e<0l ( +l) 8j
=
0.5. Then,
( 1598 1) + 68 '
-
197e<j) + 144
·
The first five iterations of the algorithm give the values 0.608, 0.624, 0.626, 0.627,
0.627.
Example. Randomized..Y.esponse. The proportion e of individuals belonging to certain stigmatized category must be estimated. To avoid the non-response (and its consequent loss of information), a new sampling scheme is proposed.
An
alternative question, not related to the main one, with known proportion of YES responses is introduced together with the guarantee that the selected question will be known exclusively by the respondent. The idea is to increase his/her confidence
in providing the correct response without revealing its true status. The probability of a YES response will be A(8) = 1r8 + (1
-
1r)8A, where 8 is the probability
of the original question of interest,8A is the known probability of a YES answer
to the altemav ve question and rr is the probability of selection of the question of
interest. In a sample of 150 individuals. 60 YES responses were obtained, based
on a procedure with
rr =
0.7 and BA = 0.6.
Analytical approximations 133 Using the EM algorithm we get that the observed data is X, the number of YES responses, and XIB "'bin(n, A). Also, the unobserved data is Z , the number of individuals that will select the question of interest. Then, ZjX, 8 where p
=
rr()
by
JA.
�
p(x, zl8)
=
=
p(zjx,8)p(xj8) rr e '
(:) ( ) ( rr:f' (:)Ax0_ w-x 1
A
_
n' (rr8)'[(!- rrJ8AJ'-'[l z!(x- z)!(n- x)1 k(x, z)+ z loge+ (n-x) log(! -A). ·
=
0
bin(X, p),
The joint density of the observed and unobserved data is given
log p(x, zl8)
=
-
A]"-x
Then, the jth iteration of the EM algorithm will have:
=
Q(e,eUl) E[logp(X, Zj8)jX,elil] Xplillog8 + (n X) log(! -A)+ k(X, elil). 2. M step: maximization of Q involves finding the solution of
1. Estep:
=
but
aQ(e, eUlJ
Xplil
=--
e
ae
n-X
-
1-
Solving for() gives
_
e (j+ll -
Xplil[I-(1
A
--rr.
-
rr)8A]
Xplil + (n-X)rr
.
In the condition of the example with a sample of 150 individuals, initializing the procedure with e (O)
=
0.4, provides the sequence of estimates 0.338, 0.322,
0.317, 0.315, 0.314, 0.314,....
5.3
Analytical approximations
Some analytical methods used in statistical inference will be presented in this section. Firstly, results based on asymptotic theory will be presented, followed by the Kul1back-Lieb1er approximation and a technique for numerical integration called the Laplace method.
5.3.1
Asymptotic theory
The behaviour of statistical inference will be studied in this section from the standpoint of an infinite sample size. Clearly, as in practice n is never infinite, the
l' 134
Approximate and computationally intensive methods
results presented here can be applied only when n can be thought of as sufficiently large to ensure that the results stated are approximately va1id. The results presented here provide a method to obtain approximate solutions to problems for which the exact solution is not feasible, which is their main utility. There are other ways to obtain approximate solutions but almost all of them are variations on the methods described in this section. The main exception is the class of methods based on the Kullback�Leibler divergence measure which will be developed in the fo11owing subsection. In general, when the sample size increases, its influence on the inference also increases, minimizing the importance of the chosen prior distribution. Howe_ver.
this will only be true if the prior distribution is non-degenerate, that is, p(8) > 0, VfJ E e, as we have seen in Sectiori 3.2. As n increases. the posterior distribution will be more and more concentrated around its mo�e. Then it follows that
p(O I x) ex 1(0; x)p(O) ex
but
where L(O)
=
log/(0;
x).
L?=I JOg l(O; x;), the number of tenhs in L(O) will increase with p(O) is fixed and, consequently, the influence of the prior p(O) becomes
Since L(()) n,
exp{L (O) +log p(O))
=
less and less relevant. Then,
p(O I x)
=
--
L(O) +(0-
9, the MLE of 0, it follows that
-- aL(iJ) 1 ., azL(iJ) .. .. (0- 0) + R(O, 0) 0)' ---aiJ + (0- 0)' 2! aoaO'
R(fJ, 8) contains terms of higher order, which can be eliminated when(} is 9. As aL(9)jaO 0 it follows that
supposed to be close enough to L(O) Therefore,
=
k-
p(O I x)
ex
(0- 9)' exp
=
J(9) 2
(0- 9)
=
=
-
a2L(O)
)
aoao' .
1-(0- 9)'J(9)(0- 9)!2 and so
Jti2(9)(o- 9) I Then, ifXn
where J(O)
X�
N(O, I p).
(Xt , ..., Xn) is a randoJti sample from p(x I 0) and 9 is the MLE
of 9, under certain mild regularity conditions, it fo11ows that
J1i2(9)(0- 0) I The result simplifies when
e -e
Xn
� N(O, I p)
e is a scalar.
--;-:::--,.�
J-1/2(0 )
I
v
Xn --->
when
n--+ oo.
In this case.
N(0, !)
when
n --+ oo
Analytical approximations 135 and
J(O) is given by-a2L(8)(a82
The regularity conditions required are basicaHy the same as those involved in
the statement of the Cramer-Rao inequality (see Section 4.3). The results can be extended further if the sample is not made up of independent observations. Observe that J(9) 2:7�1 a2log f(x; I 9)/a92 and so J(9) will typically increase with n. =
This result indicates that 0 I Xn has an approximately normal distribution, N(O, r1(iJ)), when n is large enough. This means that as long as the regu
larity conditions are satisfied, it is possible to draw approximate inferences about the parameters. In particular, it is possible to construct approximate credibility regions based on the above results.
Definition. Let 8 be an unknown quantity defined in 8. A region C c 8 is 100(1 -a)% credibility or Bayesian confidence region for 8 if limn---+oo Pr(O E C!xn) ::=: 1 - a. In this case, 1 - a is called the asymptotic an asymptotic
credibility or confidence level. In the scalar case, the region C is usually given by
an interval,
[CJ, czJ say.
More accurate approximations are obtained by retaining the third-order tenn in
the Taylor expansion. Then,
p(8fXn)
::e
,
p(8fXn) exp
where
[-]_(U1 , ,
,][
8)'J(8)(8- 8)
]
" a3 L(U) L-k ae,a8 a8k ( e;- 8;)(8j- 8j)(8k- 8k>· j 1,), •
=
t(8)
3!
,
'
t(U)
l +
,
These approximations were studied by Lindley (1980).
Example. The expressions stated above applied for the posterior mean of a scalar
8 simplify to
E(Oixn) where
:::e
8+ ·
f
'
41 a 3 L(8)
(8- 8) ·
(iai)3
1 , PN(8 ( 8, r (8))d8 •
PN(·fa, b) denotes the density of the N(a, b) distribution. Then £(81xn)
' 3 , I a L(8) • 4 EN(8- 8) "'8 + a (i i)3
,
=
8+
' 2 , )3r (8). a ]_ i
1 a 3 L(8)
Example. Let XJ, . .. , Xn be a random sample from Pois(8) and suppose that G(ao, bo) distribution. So the posterior of 8 will be G(a�o b1), bo + n, and t L?=l xi. with a1 ao + t and ht It is easy to verify that the posterior mode is 8 (a 1 - 1) Ib 1 ,
the prior for 8 is a
=
=
=
=
a21og p(8ixn)
'
J(8) =
t(8)
=
a82
=
'
a 3 log p 8fXn) ( 8 a8
;
Q] -
1
"T _ 8) 3
=
) 2(al,- I (8 83
_
8)3
.
o
136 Approximate and computationally intensive methods Then the approximations of second and third order will be
where the last equation means that the distribution is proportional to the product of a normal distribution and the correction term in brackets. Using the result from the previous example, it follows that
E ce IX,) "' e and E(e lx,)
"'
e+
I I bJ.
Note that for the validity of the results, p(fJ) must be strictly positive and con tinuous in e. If p(fJ) is proportional to a constant, it follows that the posterior coincides with the likelihood. Also, in order that Q
three times differentiable.
t(6) exist, the posterior must be
It has just been shown that the distribution of 8 converges to the nonnal dis tribution when
n
increases. What can be stated about reparametrizations, that is,
transformations of()? The answer is not easy because since 8 is normal we know that the only transformations that preserves normality are the linear transforma tions of(). Since all the above developments do not depend on any special property associated with 8, any transformation of() preserving the regularity conditionS will produce the same results. So, these results are valid for any transformation of(}. lbe only difference will be on the speed of the convergence to the normal distri bution. The Taylor series expansion applied directly to some transformation of the pa rameter is another approximation that leads to the same result.
E(X) =a, V(X)
=
defined in the point Y where
jo(u)\j\u\
A and
a.
=
-+
g
is a 1-to-l transformation of
Suppose that
X with derivatives well
Then
g(X)
=
0 when
g(a) u
-+
+
(X-
0. If
Jg(a) a)'�
+ o(X-
X is close to
a
a)
then the last term in the
right-hand side can be omitted andY will have an approximately linear relationship with
X where E(Y) =
g(a)
and
(
V Y) =
(Jg(a))' Jg(a). ax
A
ax
Also, if X is normally distributed, so �ill Y be. This result is commonly known as the delta method.
·
An interesting question concerns the choice of the optimal reparametrization in the sense of inducing fast convergence. This theme is still under investigation and some preliminary empirical results seem to indicate that the strongest candidates are the parameter transformations that appear in the definition of the exponential family and the transformations leading to constant non-informative priors. It is worth pointing out that the mean of normally distributed data, with known variance, which has a posterior normal distribution for any value of n, belongs to both groups
Analytical approximations 137 of transformations. Note also that the first class of transformations encompasses the class of parameters with constant Fisher infonnation and therefore is, in some sense, stable. Example. Let X
�
bin(n, 8) with unknown 8. It then follows that
L(8) =logp(x I 8) =log
(:)
+xlog 8 + (n-x) log(l- 8)
and its derivatives are given by
----
n- X , X aL(8) X =----=>8=e 1-8 n a8 n -X x ---<0 82 a82 o -e)2
a2L(8)
and
"
1(8)
n& n-ne + ' 8' ( 1 - 8)' ·n.. n .n
= �
=�+--c= ,
e
1-8
, .
e(I-e)
So, for large enough n, e has a posterior distribution approximately N[f9, 8(1 tl)/n]. The non-informative prior for this model is given by p(8) -� e-112(1-e)-112, as seen in Section 3.4, and the transformation producing a constant non-informative prior is ¢<X
fo"
u-1/2 ( 1 -
u)-112 du <X sen-1 (../iJ).
Defining¢= sen-1 (../i)) it follows that and
1 ' tl(l-ii) (¢) =
J-
n
,
48(1
-
1
, =4n 8)
So, the posterior distribution of¢ for large enough n is approximately N((i,, 1 j4n). The parameter obtained in the. definition of the exponential family is lfr log[8/0 - 8)]. Then, t=log
( �) 1-8
and
a 1 = ae 8(1-e) "'
l l
and so the posterior distribution of 1jJ is approximately N(,fr, 1/ntl(l - tl)). In classical inference, similar calculations can be perfonned with the asymptotic distribution of the maximum likelihood estimator. Let Xn = (X t , Xn) be a •
•
•
.
138
Approximate and computationally intensive methods
random sample from
p(x I 8) for a scalar 8 and en be the MLE of e obtained
Xn. Suppose that the that 182U(Xn; 8)/8821 < k.
Crarner-Rao regularity conditions are true and also
from
n
Under these conditions, it follows that
n
L U(X;; Bn) L U(X;; 8) =
i=l
i=l
,
+(On for; between() and
-
O)
f.., 8U(X;; 8) L.. 1=1
ae
+
f.., a2U(X;; �)
(tln- 8)2 2
L.. l=l
aez
8n. By the definition of 8n, the term in the left-hand side is (tln - 8) it follows that
null. Isolating the term
(ljn) [L.:;'�1 3U(X;; 8)/38 Since
E[U(X;; 0) I 0]
+
O.S(Bn- 8) L.:7�1 82U(X;; n;ae' l
0 and V[U(X;; 8) I 0]
=
=
l(O)
it follows from the
N(0, I (8)). E[a u (X;; 8) I ae !
central limit theorem that the numerator converges in distribution to a The first term of the denominator converges almost surely to
B]
=
-1(8). The second term converges almost surely to zero by the (strong)
consistency of the maximum likelihood estimator and by bounds imposed on the second derivatives of U(X;
6), by hypothesis.
So, the denominator converges to -I (B) with probability one and the fraction converges to the quotient of the limits since if
Zn converges in distribution to Z and
Wn converges almost surely to a constant w then Zn I Wn converges in distribution
to
Zjw.
So,
,/ii(Bn- 0)
� N(O, r1(8))
when n-->
oo.
The result is equivalent to the Bayesian asymptotic result. It is usually said that for Iargen the distribution of Bn is approximately N(O,
r1 (8)/ n). It is worth pointing
out that the information measure considered is based on a single observation,
p(x I 8). The above result is also true for multivariate parameters
0. In this case, the
asymptotic result is
,/ii(Un- 6)
� N(O, r1 (6))
when n-->
oo.
The result is used to say that when n is large the distribution of iJ n is approximately
N(O, 1-1(6)/n). Approximate classical inference can be made on the basis of those results. Once again, asymptotic confidence regions can be constructed. Definition. Let
'
6 be an unknown quantity defined in 0, U be a function U G(Xn, 0) with values in U and A be a region in U such that limn-oo Pr(U E
=
Analytical approximations 139 A)
0 is an asymptotic 100(1-a)% frequentist confidence
� 1- a. A region C C
region for(} if
C={0 : G(Xn, 0)
E
A}.
In this case, I -a is called the asymptotic confidence level. In the scalar case, the inversion in tenns of f) usually leads to an interval, C=[c1, cz] say. It is easy to obtain from the asymptotic results above that
E[DniO]
-+
0, or
simply that the MLE is asymptotically unbiased. This means that although the maximum likelihood estimator could be biased for a fixed n, its expected value
always tends to the parameter being estimated. Besides, it fo11ows that lim
n�oo
nV[On I 0]--. I-1(0).
So, the variance of the MLE asymptotically reaches the Cramer-Rao lower bound.
As On is asymptotically unbiased, the above limit could be thought of as a measure of the asymptotic efficiency of
0 n.
Estimators satisfying this property are said to
be asymptotically efficient.
Sometimes. 1(0) depends on 0 making the process of inference (point estimation
1(0) 1(811) or. in cases where the expected value is hard to obtain. by J(iJ11). In the
and confidence intervals) harder. It is common in these cases to substitute by
former case the asymptotic distributions coincide and certainly lead to the same numerical results. The consistency of the maximum lil<.elihood estimator and the
strong law of large numbers justify the substitutions made ·to obtain both of the above results.
The convergence in distribution of the score function can be used to make
approximate inference about 0. This is particularly useful when there is no explicit form for the maximum likelihood estimator. Then, for example, the asymptotic 100(1
- a)% confidence region for a scalar e can be constructed using the
percentile of the normal distribution and is given by
{ I� t e:
nl(e)
;=J
U(x;; e)
l
<
Zafl
}
Za.f2
.
This result can also be extended to the case of parametric vectors. The same considerations made in the Bayesian case with respect to the invariance over l -to-1 parametric transformations are again true here since the results are entirely based on the maximum likelihood estimator and on the Fisher information
measures, which satisfy the invariance conditions.
Example (continued).
If
X
-
bin(n,
e), then
X n -X X -ne U(X;e)=e= 1-o eo-e) It is also known that
and
, X Bn= . n
-
J(e) = I(e(l - e). Applying the results developed above e(l-e)), for a large n. Then an asymptotic
gives -.,fii(Bn -e) distributed as a N(O,
1 .
.
140
Approximate and computationally intensive methods
confidence region with coverage probability of
by
{ I ..;nee. I 8
:
-8) ,/8(1 B)
<
100(1 -a)% for 8 will be given Zaj2
}
·
A simple inequality must be solved to get an explicit solution for the confidence region. In this case it could be more convenient to proceed with the suggested modifications of e by its MLE. In this case, observed and expected information
coincide at the MLE and
,
J(8n)
=
,
I
,
,
8.(1-8.)
=
l(8n).
The above confidence region is now replaced by the interval
( - p.ci-e.), Je.o-e.)) 8)). { l..rnce.-811 } ,
Bn
Za.j2
, Bn + Zaj2
n
n
·
.
An alternative can be to use the asymptotic distribution of iJ (X; N (0,
1(8(1
-
It is not hard to see that the 100(1
region fore is again given b y
8:
,/
8(1
8)
<
&) / ..j1i given by a -a)% asymptotic confidence
Zn/2
.
In this case,_ the confidence region coincides with the region obtained from the asymptotic distribution of en.
5.3.2
Approximation by Kullback-Liebler divergence
A measure of the divergence between the density
p(8) ail� its approximation 1994)
is defined by the expected value (see Bernardo and Smith,
o[p(8); po(8)]
=
f
p(8) log
(
p(8) Po(8)
)
d &
=
E
[ ( log
p(8) Po(8)
)]
po(8)
·
In the discrete case, all one needs to do is to substitute the integration with a summation.
Smaller values for 8 ind�cate better approximations.
Two central
questions in the applications are conce·med with the determination of the better
normal approximation and the most convenient reparametrization to accommodate
such approximation.
These two questions will be answered only for particular cases in the exponential
family.
The first question posed has an easy and general response.
The best
normal approximation to the distribution p(B) is that with mean and variance 2 given respectively by J.l = £(8) and a = V(8), the mean and variance of the original distribution.
Analytical approximations 141 The second question involves finding the best transformation { � (8) to induce normality in the sense of minimizing the divergence measure between p(O and its normal approximation. This is equivalent to findingS that minimizes the expected value =
J
p(l;) log
(;o�/J
d\,
where Po(/;) is the density function of the N[£(1;), V(l;)] distribution and p(l;) = p(8)Jd\ /d8J. This is a difficult problem for which the solution will be presented only for two particular distributions: the beta and gamma. I.
If g beta(a, ,B) then 1;(8) = log[B/(1- 8)] is the best transformation to induce nonnality and the approximating normal will have mean and variance given by �
£(/;)-:=log
( �J 1
=log
(�)
wherep, =
a ��
a+,B
and v (\)
(a + {3)2
1
1
"'
--;-:---:- ----;:-:---, !'(I -I') a + ,B + I
a,B(a + ,B +I)·
2. If g G(a, ,B) then 1;(8) =loge is the best transfonnation and the mean and variance of the approximating nonnal distribution will be �
£(1;)-:=logl'
and
V(l;)-:=
I =1'.8 " l
�
where p, =
"
fj·
The delta method was used to get these results: It is worth pointing out that these transfonnations are exactly the same as those that appear in the exponential family for the Bernoulli and Poisson models, respectively. These are the observational models to which the beta and gamma are conjugate respectively. Example. Suppose someone wants to elicit the hyperparameters of the conjugate prior fore in a Poisson model. Assume that 8 ....... G(a, {3) and now suppose that its mode was assessed as 0.5 and also that P(8 < 0.25) = 0.05. Then (a- 1)/fJ =0.5 or ,B = 2(a- I) and P(8 < 0.25) = P(l; < log0.25) -:= [(log0.25- £(1;))/,/V(S)] where E(S)-:= log[a/(2a- 2)] and V(\)-:= 1/a. So fJ =2a- 2 and 1.96 =[log0.25 -log[a/(2a- 2)])/v"f7C'. Solving this implicit function for a gives a =3.35 and fJ =4. 7.
5.3.3
Laplace approximation
This class of approximation methods is very useful to evaluate integrals of the type I= J j(9)d9 by rewriting it as '
J
g(9) exp[-nh(9)]d9
Approximate and computationally intensive methods
142
where g
:
RP
-7
Rand h
times differentiable. Let
:
8
RP
--+
Rare smooth functions which are at least three
be the value of IJ which minimizes
method approximates I by
f
=
g(iJ)(2rr/n)P12;i: ;11 2 exp[-nh(iJ) ]
.,
where£..=
[
h.
The Laplace
a2h(O) -1 aoao
, ---
J
The Laplace approximation is based on the Taylor approximation for h and around
0.
g
Only the univariate case will be presented here for ease of exposition
i; will be denoted by &2• As in the last section, it will be supposed that() and e are close.
and
Using a Taylor expansion up to the third order it fo1lows that
nh(8)
=
nh(ih +
where
t(&)
nt(B)
�2 (8- ii)2 + 2o =
3!
+ o(n-1)
a3h(B) iJri3<e- B)3
Exponentiating the last expression and applying a linear expansion to exp( -nt(8)) , it follows that
exp[-nh(8)]= exp[-nh(B)] x
n
exp[ -�(82o
,
[
8) ] I-
2
The same expansion in 'faylor series around
g(8)
=
nt(8)
-- + 6
1
]
1
o(n- ) [I+ o(n- )].
8 can be applied to
, , a8(e) g(8) + ae(e- 8) + o(n- 1 ).
Recognizing that
B)2]d8 (2rr) lf2(&2/n)li2, 2. j(e- 0)2'+1exp[-(n/2&2)(8- G)2]d8= 0, Vk integer, and 3. J nt(8)(8- B)exp[-(n/2&2)(8- G)2]d8 o(n-1) 1.
Jexp[-(n/2&2)(8
_
=
=
and applying the expression for I leads to a scalar version of the following propo sition.
Proposition. When n->
oo,
i= I [ ! + o(n-1)].
In Bayesian applications, generally -nh(&)
=
L(8)
+log p(8) which is the
expression of the posterior density but for the proportionality constant. If non-negative, the integral can be redefined by " I=
J
exp[-nh*(B)] de wherenh*(B)
=
nh(8)-logg(8).
g(8) is
Analytical approximations Denoting by{)* the value that minimizes
h*(())
143
&*2 = a2h"'(iJ)jae2,
and writing
there follows an alternative approximation for I given by
i = (2rr)1i2
"*
where (J
is the value of (J that minimiZes
of second derivatives of
"'
h* evaluated at 8
" *
h*
is the inverse of the matrix
and L
.
Following the same steps as before it is easy to see that 0
(1986) proposed evaluating
Tierney and Kadane
E[g(B)]
=
[!
g(O)exp[-nh(O)]dO
J I[!
i = /[1
+
ex p [ - n h (O ) ] dO
o(
n
�1) ].
J
by approximating separately the numerator and the denominator. They have shown that by doing so the o(n-1) terms cancel out and an improved approximation of
order o(n-2) is obtained. The final expression for their appro ximati on can be obtained by combining the above results to give
" E[g(O)] Example. Let X 1,
..
a conjugate priorB,.....,.
=
ao
1 g(iJ')Ii:'l 12exp[-nh'(0')] "
/
·
,
I 1: 1 1 2 exp[-nh (0)]
. , X, be a random sample from a Pois(B) and suppose that
G(ao, bo)
E[Bix] where GJ
=
=
+ L�1=1.x;
':'. bt
and
is used. Taking g(e) =Bit follows that
(�"-1-) n,�I/Z a1- I
b
1
=
bo
�1 e ,
G]
>
J
+ n.
The exact posterior mean in this example is GJ jb, and so we can easily evaluate
the relative errors (£-
a1
=
6 the
E)/ E
involved in the approximation. For example, for
relative error is 0.0028 and for a1 =
Example (continued). Randomized response
10
it will be only
0.00097.
(Migon and Tachibana,
1997)
A
variation of the randomized response model consists in asking as the alternative question the negation of the original one.
In this case,
eA
= 1
-
e
and the
probability of a YES response is A=
rrB +(I-rr)(l- 0) = (2rr- 1)8 +(I
-
rr
)
.
The Laplace approximation for the evaluation of the posterior mean of
e
can be
applied. The derivatives and points of maxima can all be obtained analytically or numerically. With the same data of the example and with a unit unifor m prior for
144
Approximate and computationally intensive methods
(),the exact posterior mode is 0.25. The posterior distribution of() is a mixture of n + 1 beta distributions and the calculations become tedious as the sample size increases. The performance of the Laplace approximations can be assessed as a function of the sample size by keeping the YES proportion fixed at 60/150
=
0.40
and letting the sample size n vary. For this example,the exact and approximated posterior means were n
Exact
50
0.28
Laplace 0.27
150
0.251
0.255
450
0.251
0.251
It is worth noting that even though the exact posterior mean itself depends on the sample size,t�e Laplace approximation gets better as the value of n increases.
Example. Weibull
data Using· the: same data of the example presented in Sec
tion 5.2.1, we obtain the following estimates via Laplace methods with a non informative prior:
a
=
1 .59,
V(&\x)
=
0.064 and
{j
=
2.05, V(ii)
=
0.079,
which compare well with the maximum likelihood results, as expected.
5.4
Numerical integration methods
Numerical _i'ntegration, also called the quadrature technique, is a co11ection of methods· cOnvenient to solve some useful problems in inference,mainly when the dimension of the parameter space is moderate. They become useful as soon as the analytica_l solution fails. Firstly we will take care of the general problem of obtaining the value of the
J:
f
w:
integral I = j(x)dx, where : R -+ R is a smooth function. Let R --+ R + be a well-defined weight function. Quadrature methods are essentially approximations of I obtained by evaluating fat points Xi,
i
=
1, . . . ,n. The
simplest solution is given by the weighted sum "
f == Lwif(xi)
wherewi
=
w(x, ), i
=
1, . .. ,n.
I
The quadrature methods are fully ch<;tracterized by choosing the points of eval uation or nodes XI,
. • .
,Xn in the interval
(a, b)
and the corresponding weights
involved. An integration rule must have easily obtainable weights and nodes. The nodes should lie in the region of integration and the weights should all be positive.
5A.1
Newton-Cotes type methods
The interval of integration function
(a, b)
with finite
a, b
is divi �ed into
n
equal parts,the
f(x) is evaluated in the middle point of each interval and the weights are
Numerical integration methods
145
then applied. Then n
iNc= h L j(a + (2i-
l)h/2)
i:::l
approximates I, with h
(b - a)j n.
=
These methods are generically named
Newton-Cotes rules. This is an approximation by the area of the rectangles with equal base (bOften a good approximation is obtained with
n
of the order of
102,
a)jn.
which seems
reasonable for the unidimensional case. A slight variation is the trapezoidal rule involving unit weights except in the extremes of the interval, when they are set to l/2. The rule gives the approximation
ir
=
h
[
f(a) 2
+
t i=l
f(a + (2i
� l)h/2) + j(b ) 2
]
.
The Simpson rule is another variation described by weights alternating between
4/3
and
2/3,
except in the extreme where they assume the value l/3. In this case
the approximation is given by
is=
n/2
h
-[f(a) + 4 L f(a + (4i + 3 i=l
l)h/2) +
n(2 2 L f(a + (4i
+ 3)h/2)
+ j(b)].
i=1
The p�dimcnsional case is slightly more demanding. A general solution fo11ows from an iterative application of the cartesian product rule. Let
x
=
(XJ, x2)
be a
bidimensional vector. A quadrature in two dimensions is based on the cartesian product rule
j
f(x)dx
j [!
=
J
f(x,, x2ldx2 dx1
=
j
!1 (x, )cLq.
The last term in the right-hand side is obtained by integration with respect to
x2
using the unidimensional quadrature rule
j where
h(x2 )
=
h(x2ldx2 -::e
f(xt, xz).
t
j=l
wj/2(X2,j).
As the last integral is also unidimensional it can be
approximated again by quadrature
J fi(XJ)dxJ
-::e
2::7�1 w;fJ(XI,i).
Joining the
two weights it follows that
j
j(X) dX '::e
tt
This is a bidimensional rule based on n weights
(w1, Wj), i
=
WiWjj(XJ,i, X2,j).
i=l j=l
I,
.
. .
,n, j
=
xm
evaluation points
(XJ,i, xz.}) and with
1, ... , m.
In the sequel, a general integration method more adequate to statistical problems will be introduced. From now on we will be concerned with an integral over the whole real line.
146
Approximate and computationally intensive methods
5.4.2
Gauss-Hermite rules
The method introduced in this section is specifically useful for integration over the whole real line. If the domain of the integral is
r
>
0,
which sometimes
happens when we are dealing with a precision or a scale parameter, then the reparametrization log r is often convenient.
Assuming thatthe integrandf(x) can be expressed in thefonn of g(x) exp(-x2).
a general unidimensional rule is
/_: f(x)dx = i: g(x)e-x' dx � h;/(x;) :e
0
Hn (x), and the Hn-1 (x), evaluated at Xi, maximum degree 2n 1, the
where the Xi's are the zeros of a Hennite polynomial of degree n, weights hi depend on nand on the Hermite polynomial
i
=
I,
. .
.
, n.
As long as
g(x)
is a polynomial of
-
formula is exact There are other Gaussian integration rules that can be applied when the domain of integration is finite. The values of the zeroes and heights are tabulated in many mathematical tables and can be found in Abramowitz and
Stegun (1965). for example.
The accuracy of the approximation depends on
by a polynomial of degree
2n -
f(x)
being well approximated
I or less times a normal weight function. For
the multivariate case some sort of parameter orthogonality must be guaranteed for application of the cartesian rule. The integration rule introduced before can be extended to a more general ap proximating function to
f(x) = g(x)(2rra2)-112exp{-0.5[(x-
Making the variable transformation
tL)/rr]2) leading
z = (x- fJ..)f../2;;2, it follows that
f f(x)dx = f g(.,{i;2z � 2na2 =,.,.-l/2 j g(.,fi;2z + !L)
v
e-'
2
.,{i;2d z
2 + tL)e-' dz
and the Gauss-Hermite approximation will be
j f(x)dx
:e
,.,.-l/2
t hrhrra2e'fg(.,fi;2z, i=l
n = Lm;g(x;) i=l
wherem·I
= h;;iez1h·t
andx·t = V.C.a''2a2z·I +" r> ,·=I ,
. , . ,
n.
+ !L)
Simulation methods 147 In general, the mean JL and variance a2 of the approximating nonnal are un
known. The following strategy, proposed by Naylor and Smith
(1982), can be
used: I. Let Jl.O and a be initial values for J..L and a2•
J
2. Apply the above expression with g(8)
=
e and g(8)
=
82, respectively.
This will provide approximating values for the mean and variance.
3. Repeat the last step until the mean and variance values obtained are stable. In the multiparameter case, the cartesian product rule is applied after some-
orthogonalization is operated over the parameters. This can be done through a Cholesky decomposition of the (approximated) variance-covariance matrix I:.
Let I:
=
H DH', where Dis a diagonal matrix and HiS a lower triangular matrix, = n-I/ZH-1 (x- JL)/"-J2. Therefore
and define the transformation z
I
f(x) dx
=
rr-p(l
I
g(p. + v'lH D1i2z)exp( -z'z) dz
11 P" ""rr
I!!
�p(l L . '. L h,, ... h,, j(p. + v'lH D112 z) ip=l
ij=l
where ni is the number of nOdes involved in the approximation at the ith coordinate = (Zi , ... , Zi ), iJ = 1; . . _ . , n J• j = 1, . . . _p·. 1 p If the mean vector JL and variance-covariance m3trix L of the approximating
and z
.
normal density are unknown, an extension of the iterative method of Naylor and Smith presented before can be applied.
5.5
Simulation methods
In this section we wifl be discussing a collection of techniques useful to solve
many of the relevant statistical problems. From a classical point of view we are
interested in simulating a sampling distribution arising from a possibly complex model and in the Bayesian case we are interested in obtaining some characteristics
of the posterior distribution. All the techniques described in this section share a common characteristic: they invOlve the random generation of samples from a
distribution of interest. This is the empirical distribution in the classical approach and the posterior distribution in the Bayesian case.
complement their study by reading the books by Efron and Ripley ( 1987).
5.5.1
The interested reader can
( 1982), Gamennan ( 1997)
Monte Carlo method
The basic idea of the Monte Carlo method consists in writing the desired integral
as an expected value with respect to some probability distribution. To motivate
Approximate and computationally intensive methods
148
our discussion we will begin with a very simple problem. Assume we wish to calculate the integral of a smooth function in a known interval
I=
lb
(a, b),
that is
g(e)de.
The above integral can be rewritten as
I= f\(b-a)g(e)]-1-de. b -a fa
This problem can be thought of as the evaluation of the expectation of [ (b -a)g(e)] with respect to the uniform distribution over
where
U(a, b)
(a, b)
and
I= Eu(a. J[(b- a)g(e)] b the uniform distribution in (a, b).
represents
A method of moments estimator of this quantity is
i where
(a, b).
611, .
. .
n
L(b- a)g(e1) n i=l I
=-
, 811 is a random sample selected from the uniform distribution on
An algorithm can be described by the following steps
1. generate 81, (h, ... , 811 from a
U(a, b)
distribution;
g(eJ), g(e2), ... , g(e"); 3. obtain the sample mean: g = (1/n) I:7�1 g(e,);
2. calculate
.
4. finally, determine:
i = (b- a)g.
A generalization can be obtained straightforwardly. Let expected value of
g(e)
I = Ep[g(e)] be the p(e). The
with respect to a distribution with density
algorithm is similar to that described above with modification of the sampling in step I from the
U(a, b)
top(·).
The multivariate extension is based on evaluation of
I=
1
b' .. ·
0]
1
b' g(O) dO
Op
and the Monte Carlo estimator is
.
I
� g(O;) I=-L.. n i=l
where o,, ... , 011 is a random sample selected from the uniform distribution on
(a,, bt) x · · · x (ap. bp).
Some questions need to be answered to implement these techniques. How large
must
n
be? How do Monte Carlo methods compare with quadrature rules? Or, to
pose it in another way: when is Monte Carlo preferred to numerical integration? An elementary example will be useful to motivate further developments.
Simulation methods
149
Example. The evaluation of I = f01 exdx is desired. As we have just seen, the (1/n) :L7=t exp(Xj ), Xj - U (0, 1 ) simple Monte Carlo _"Stimator of I is 7 j := 1, ... , n. Since I is a sample mean, its precision to estimate I can be measured
,
=
by its variance. It is given by
V(/) = �V(ex) = n
5.5.2
(
� t e2'dx- I2 Jo n
)
=
0.242_ n
Monte Carlo with importance sampling
The Monte Carlo method with importance sampling is a technique developed to reduce the variance of the estimator. Consider now explicitly that the integral/ of interest is the expectation of a given function g with respect to a density p. lt can then be rewritten as
I=
J
g(x)p(x)dx
=
J
x g(x/( ) h(x)dx h(x)
where h(x) is a positive function for all x where p(x)
>
0 and J h(x )dx
=
1. It
is therefore a density. An alternative method of moments estimator for I can be:
obtained as
1
7=n
"
L g(X;)w(X;), I
where w(X-) ' and
p(X;) =
X; -h(x),i
h(X;)
=
l, ... ,n.
where his called the importance sampling density. The only difference with respect to simple Monte Carlo is the first step where sampling from the unifonn distribution is replaced by sampling from hand the third step where the values of g x _w instead of values of g are averaged. Therefore, V (7) = (ljn) J(g(x)w(x)- 1)2h(x)dx. Choosing g(x)w(x) approximately constant can make V(/) as small as we want. Therefore, whenever possible the importance sampling density should be roughly similar to g(x)p(x). The multivariate extension is again trivial. Assume we wish to obtain the value of the expectation of g with respect to p given by I=
j g(x)p(x)dx,
and use the multivariate density h as the importance density. Once again, h must be positive whenever p is. The Monte Carlo estimator is given by
i
1 =-
n
n
L g(X;)w(X;), i=l
Approximate and computationally intensive methods
150
where
p(X); w(X) = ' h(X;)
and
X; - h(x), i
Example (continued).
=
lf we take as importance sampling density h(x)
p( x)j h (x) x),x E (0, I) then w(x) (3/ 2)e-'j( l +x)O:k,for x E (0,1). Then =
1=
Therefore, 7
f ' e' Jo
[
I, . . . , n.
3 2(1 + x)
=
3/[2(1
+
x)J and
=
j (l +
g(x)w(x)
][�(l+x)]ctx= fo 21+x
{'��h(x)dx.
3
= (l /nJL:;'�13ex'/[2(1 +X); ] where X;- h(x),
Jd e2'/(l
2
i
=
l, . . . ,n,
+ x)2dx - / ) and V(/) = (1/n)[(3/2)2 0.027/n. The variance reduction is quite substantial. It was reduced to approxirriately one tenth of the value obtained with a simple Monte Carlo using the S<J,me number of replications. =
The implementation of the algorithm in this case depends on sampling from the density h. The importance sampling distribution function is ·given by H(x)
=
{ t(x+ e;,)
if X< Q ifO::; X< 1 if X 2: 1.
Using the probability integral transform, one simply has to generate a unit uniforin = (2j3)(X + X2 /2). The unique solution satisfying the equation is X= [(4 + 12U)1i2- 2]/2.
random variable U and solve for U
Example.
Let(}
=
P[X
>
2] where X has a standard Cauchy distribution with
density
p(x) =
I
X E R.
rr(l + x2)'
x
In this example, g(x) = /x[(2, oo)] and 8 = J'>000 g(x)p( ) dx. Let X1, . . . , X11 be a random sample from the Cauchy distribution. It is easy to obtain that
I
,
8= -
" #(X; 2) .:___e L Ix, [(2,oo)] _:_
n i=l
>
=
n
= {'.00 p (t)dt 0.5 + ,.-t arctan x, so that V(O) = 8(1 - 8)/n = 0.126jn. Let h be a density defined by h( x) = 2jx2Ix[(2, oo)]. It is easy to obtain that the distribution function is H(x) = I - 2/x, if x 2:: 2. For any U E U(O, I) it and nB- bin (n,8). Note that P(x)
8
=
I
=
- P(2) = 0.1476. Then,
follows that X= 2/(1- U) has density hand a sample X,, ... , X, from which
h can easily be obtained. The importance sampling estimator of e is given by iJ
I
"
n
i=l
=- Lw(X;),
I
x2
where w(x) = - --. 2rr 1 +x2
·
Simulation methods 151 It can be shown that the variance of this estimator is smaller than that of Exercise
B (see
5.16).
Therefore, the Monte Carlo algorithm can be used to solve any of the basic inference problems cited in the introduction of this chapter that can be written as an expectation. In the Bayesian case, when one wants to evaluate
E[g(O)Ix}, the
algorithm can be summarized as follows:
1. Genefate 01, . . . , (}n from the posterior density
p(O IX) (or the importance h(O)). g(O,) (or g, g(O,)p(O,/x)j h(O,)), i I, . . . . 2. Calculate g, }.·Obtain the estimator E[g(O)] (1/n) I:;'�1 g,. density
=
=
=
n.
=
5.5.3
Resampling methods
In this section we will be concerned with some sampling and resampling techniques from a classical and a Bayesian point of view. Firstly, the classical bootstrap, which essentially consists in resampling from the empirical distribution function, will be presented. A weighted version of the bootstrap will be useful to implement the Bayesian argument. Intuitively the argument follows as: a sample is generated from the prior distribution and a resample is taken using some well-defined weights. It is not difficult to show that the points in the resample constitute an approximate sample from the posterior distribution.
This approximation becomes better as
sample sizes increase. Another classical resample technique, named jackknife, will be presented and exemplified. Its Bayesian version will be developed for the exponential family. A general account of these resampling methods from a classical perspective is provided by Davison and Hinkley ( 1997) and Efron ( 1982). The main objective of jackknife and bootstrap is to obtain a measure of the accuracy of some complex statistics-. .Prom a classical point of view the question arises from the fact that the sampling distribution is hard to be determined in some cases. As some examples we can mention the robust statistics, like trimmed means, the correlation coefficient, concordance measures in probabilistic classification and so on. On the Bayesian side the interest in techniques like jackknife and bootstrap is slightly different. One important application of leave-one-out methods is to obtain infonnation about the influence of particular observations or, more generally, to define diagnostic measures. One may also use the bootstrap as a resarnple tech nique useful to implement the Bayesian paradigm, as wiii be shown later in this section.
<�
This class of procedures is characterized by its computational demand. al though they are often very easy to implement even in complex situations or high dimensional problems.
152 Approximate and computationally intensive methods Jackknife The jackknife is a useful technique to build up confidence intervals and it works as well as a bias reduction technique as will be shown in an example. This is a generic tool and so in specific problems it could provide less accurate results. The basic idea of splitting the sample was introduced by Quenouille (1949, 1956) to eliminate estimation bias. , X, is a random sample from p(x]e) and that iJ(X) is an Suppose that X J, estimator of e. Denote by {ji the estimator based on the original sample without the ith observation. LetBi n8 - (n I){]j be a sequence of pseudo-values and define the jackknife estimator of e as . . .
=
-
The name pseudo-value derives from the fact that for the special case where {J (X) X, the pseudo-value coincides with the ith observation, that is Ji = L.� X j L'Ji'i X j = xi. It is not difficult to show that {j J is unbiased if e and ii; are also unbiased. Besides that, the jackknife estimator has the property of eliminating terms of order ljn on the bias of the estimator. =
-
Example. Let X 1, .... Xn be iid observations from the uniform distribution on (0, 8). It is well known that T max; X; is a sufficient statistic for 8 with E(T) (1- 1j n)e , 'ie. A jackknife estimator is given by =
=
whereB; and B; max{Xt . . , Xi-!, Xi+ I• . (n- 1)[1- 1/(n - l )]e =e. =
.
.
. .
=
nT- (n -l)B;,
, X,). Then, E(iJJ)
n (l
-
l j n)e -
Let jji, i I, . , n, represent random variables approximately independent and identically distributed with mean 8. A jackknife estimate of the sample variance will be given by =
.
.
and therefore the statistic
eJ
-e
Simulation methods 153 Example. Correlation coefficient. A very popular data set in statistics was given by Fisher ( 1936) and contains measurements on three species of iris. We concen trate on the sepal length and sepal width of the species iris setosa. The correlation coefficient of length and width is calculated in a sample of 50 observations. The sample correlation coefficient is estimated as P 0.742. Using the jackknife methodology we obtain the point estimation 0.743 and the 95% confidence inter val based on the Student t distribution was (0.63, 0.84). =
From a Bayesian perspective the notion of jackknife corresponds to obtaining the posterior or predictive distribution leaving one of the observations out and is very useful for model checking. For example, the influence of an observation can be assessed by this procedure. Using a divergence measure like K.ullback Liebler, the posterior or predictive distributions based on the whole sample could be compared with that leaving one observation out to evaluate its influence in the analysis. For the exponential family with one parameter, defined in Chapter 2, the conju gate analysis leads to the posterior density p(&lx) ()( exp{aJ¢(8)
+/lib(&)}
where a1 I:7�1 u(x;) and p(y}x) /l + n and T(x) a+ T(x), /JJ a(y )k(a1, tJ1)/k(aJ + u(y), /l1 + n + 1). So, leaving one observation out would make t�e _posterior density p(8}x;) '
where a 1 =a + T(x;),
=
=
=
!l\
=
ex
exp{a\¢(8)
/l + n-
I
=
/l1
+ !l\b(&)} -
I,
T(x;)
=
LJ,Oi u(xj).
Bootstrap The concept of bootstrap was introduced by Efron ( 1979). The method of bootstrap consists in generating a large number of samples based ori the empirical distribution obtained from the original sampled data. Confidence intervals with some pre specified coverage probability can be built up easily under mild assumptions. Let X1, ... , Xn be the observed data from a random sample of a distribution p(x}8), where e E e is the unknown parameter. Let il(x) be an estimator of e. The empirical distribution function is defined by Fn(X) (1/n)#(X; :5 x), Vx E R, as seen in Chapter 4. A resample procedure consists of the selection of samples with replacement from a finite population using equal probability. This corresponds to selecting a sample from the empirical distribution Fn(X). These sampled values will be denoted by {Xj, . . . , x;} and the bootstrap estimator of 8 by 8*(x*). The inference will be based on B replications of the above procedure and in the evaluation of the statistic of interest, in this case the estimator 8* 8* (x*) for each of the B replications. Denote the resulting values by B{' . . , e�. The bootstrap distribution of§* is given =
=
.
154
Approximate and computationally intensive methods
by the empirical distribution formed by the resampled values. Summarizing, the bootstrap distribution of the fr is used in the place of the sampling distribution of {; in order to make the inferences about 8. A central assumption in the method is that Fn is a good approximation ofF, that is, the bootstrap distribution of&* is similar to that of{; or, that the distribution of B* - fJ is similar to that or 8 - e, where J is the value or the parameter for Fn. The mean and variance of these B replications will be denoted by 8* (1J n) I:f�t B;* and&2(B*) [lj(B - I )] L:f�t (B,* -11*)2 From the above suppositions it follows that V(B) =: V(B*) :e &2(0*) and E[B - e] e*- ii. A bias adjusted =
=
=
estimate of e will be fj e - [il* - er. Confidence intervals for 8 can be built from the percentiles of the bootstrap distribution.. Let 8 * (a) be the 1 OO(a)% percentile of the bootstrap distribution of {r\ that is, �[8* _::=: B*(a)J =a .. The interval (8;12, 8j_u12) obtained as described before is named the 100( I -·a)% bootstrap confidence interval. =
Example· (continued). Returning to the evaluation of the correlation between length and width of samples of iris setosa and applying the bootstrap with different values of B gives the following results B
100 400 1600
Ls%
0.64 0.71 0.71
Mean 0.736 0.741 0.742
0.81 0.82 0.82
It seems that n = 400 is a reasonable number of replications to accurately de scribe 'the bootstrap distribution. It is interesting to note that this is the number of replication usually recommended in the literature. The point estimates for the bootstrap andjackknife are almost the same although the confidence intervals are shorter for the bootstrap. Weighted bootstrap
Sometimes we are not able to sample directly from the distribution of interest, p(x). A useful strategy is to sample from an approximation of this distribution and use the accept-reject scheme: I. Generate x from an auxiliary density h(x). 2. Generate u independently from a uniform distribution on (0, I). 3. If u :<: p(x)jAh(x), where A max p(x)j h(x), acceptx, otherwise return to step l. =
The probability of accepting a value x generated ffom h(x) is ?(accept x)
=
j J lu(O, p(x)jAh(x)]h(x)dxdu �=
Simulation methods 155 The expected number of accepted values in n independent runs of the algorithm will ben/ A. So, the algorithm is improved by decreasing the value of A as much as possible. If the detennination of A is difficult, the following modification of the algorithm can be applied: 1. Take a sample fromx,, ... ,x" from h(x). 2. Evaluate the weights
w(x;)
=
p(x; )/ h(x; ) , i
. .
I,
=
3. Select anew samplexj, xi, ..., X� from the set {x,, probabilities given by Wi I L7=I
Note that
Wi, •i
n
P(x* Taking the limit as n
L
=
w·
I,
1
_
lx, ( -oo, a )
-->
:Sa)=
11
i=l Lj=l W;
. .
. , n.
... , Xn} with respective
. , n with replacement.
fx; ( -oo, a ) .
� oo,
n
w· n '
L
_
i=l Lj=l WJ
1"
-00
p(x) dx .
It is interesting to note that the algorithm allows approximate sampling from p(x) even when
p
is known up to an arbitrary constant.
This is particularly useful
for Bayesian inference where in many cases the proportionality constant of the posteri�r distribution is not known. The above algorithm is known in the literature as the weighted bootstrap. As before, many questions deserve consideration in the applications. For example, how big must n, the initial sample size, be?
And
m,
the resampling size? Is
this approximation efficient? Note that if values of x were not generated in some regions then these values will never be resampled even if the weights were large. We shall concentrate here on a modification of the algorithm to solve Bayes' theorem numerically. Remember that
p(B I x)
=
kp(B)l(B; x),
where p(B) is the prior distribution, l(B; the available data. Taking
h(x)
=
p(B)
x)
BE
8
is the likelihood function and x denotes
in the algorithm gives
w(x)
=
p(B I x) p(B)
=
kl(B; x).
Therefore, the algorithm simplifies to 1. Take a sample e,,
... , Bn
2. Evaluate the weights 3. Sample
{81,
...
B�, ez, ..., B�
, On}
I, . .. ,n.
w;
p(B). kl(B; x), i I, ... ,n.
from the prior distribution
=
p(B 1 x)/ p(B)
=
=
with replacement from the finite population
with respective weights
li f L7=1 h.
where
li
=
/(8;; x),
i
=
Approximate and computationally intensive methods
156
(a)
nil .
0
ci
·6
·4
-2
(b)
�l -
,
I
0 beta
beta
Fig. 5.2 Summary of inference for fJ in the example: (a) initial sample; (b) final resample. It is worth noting that often it is necessary to make some adjustments to take care of numerical difficulties. One mus_t make sure to sample over the relevant region of the parameter space. It may well be that the prior is concentrated over a region of low posterior probability. In this case, the prior is not a suitable candidate for initial sampling and other distributions must be used.
a2), where Jl.i f3xi and a2 I, i 1, . .. , 5. (-2. 0, 0, 0, 2) and x = (-2, -I, 0, I, 2). Using prior N(O, 4) for fJ and simulating n 1000 samples and
Example. Let Yi "' N(J.Li,
=
=
=
The observed data is y = a reasonably vague
m
=
=
500 resamples we can easily obtain numerical summaries. The estimated
mean is 0.80 1 and the estimated variance is 0.087. The exact MLE in this example is
{3
=
0.8, which is in complete agreement with the Bayesian bootstrap resample
depicted in Figure 5.2.
5.5.4
Markov chain Monte Carlo methods
The central idea behind the Markov chain.Monte Carlo(MCMC, in short) method is
to build up a Markov chain that is easy to� imulate and has equilibrium distribution given by the distribution of interest. These techniques are often more powerful than the quadrature rules and simple Monte Carlo because they can be successfu1ly
applied to highly dimensional problems. A general discussion about this topic can be found in Gamerman( 1997).
Let X 1, . . . , XP have the joint density p(x) = p(XJ, . • . , xp) defined in the space X c RP. In fact, the derivations below are also valid for the more general case where theX; 's are vector variables. Suppose that a homogeneous, irreducible
Simulation methods 157 and aperiodic Markov chain with state-space X and equilibrium disuibution p(x)
can be constructed. Denote by q(x, y) the transition kernel of the chain, which
means that q(x, -) defines a conditional distribution governing the transitions from state x.
In other words, it is possible to build a chain with transition probabilities invariant in time, where each state can be reached from any other state with a finite number of iterations and also without absorbing states.
Assume further that it is easy
to generate values from these transition probabilities.
This means that for any
given initial stage, a trajectory from the chain can be generated. For a sufficiently large number of iterations, this trajectory win eventually produce draws from the equilibrium distribution p(x). By constructinga suitable Markov chain, one is able to perform a Monte Carlo simulation of values from p, hence the name MCMC.
There are many possible ways to construct such a chain. One scheme is pro
(1984) (1990). Let
vided by the Gibbs sampler algorithm, proposed by Geman and Geman and popularized to. the statistical comn:mnity by Gelfand and Smith
Pi(Xi lx_1) denote the conditional density function of X1 given values of all the other Xj's (j ::j=. i) and assume that it i:;; possible to generate from these distribu tions for each i
=
.. , p. The algorithm starts by arbitrarily choosing initial , x�). If in the jth iteration we have the chain at state x<j),
I,
values x0 =(x?, . . .
.
then the position of the chain at iteration j + 1 will be denoted by xU+n and will be given after •
generating a random quantity xij+l) from
(j)))·, PI (XJ IX -1 = (x2(j) ,... ,xfl •
generating a random quantity xij+I) from
( 1(j+l),x3, (j) . . , x (j))) pz(xzIX -z=x p ,· .
•
successively repeating the procedure fori = 3, . .., p where at the last step . IJ+n lj+JJ +) from pp (x p lX-p =(x a random quanhtyxp , ... ,xU p-Jl )). 1
This way, a vector xU+l)
(xij+l), ... ,xV+l)) is fonned. Under suitable
regularity conditions the limiting <_listribution of x(j), as
j
-+ oo,
is just p(x).
Another scheme is provided by the Metropolis-Hastings algorithm initially pro posed by Metropolis et al.
(1953) and later extended by Hastings (1970). A clear
introductory explanation of the algorithm is presented by Chib and Greenberg
( 1995). It is based on the same idea of using an auxi1iary distribution, previously
used for importance sampling, accept-reject schemes and weighted bootstrap. Let
q*(x, ·)denote an arbitrary transition kernel and assume that at iteration j the chain is at state x
proposing a move to x* according to q*(x(Jl, ·);
158 •
Approximate and computationally intensive methods accepting the proposed move with probability lj)
a(x
*
,x )
=
.
mm
{
I,
p(x*)/q*(xiJl,x*)
.
.
p(xiJl )/q *(x* ,xiJl)
}
thus settingxU+ll = x* or rejecting the move with probability 1-a(xUl, x*) thus setting x U ll = x
+
It is not difficult to show that the Metropolis�Hastings chain has equilibrium distribution given by p(x). Note that the move is made in block for all model parameters. In practice, with highly dimensional models it is very difficult to find suitable kernels q* for such spaces that ensure large enough acceptance proba bilities. A commonly used variation of the algorithm inCOflX>rates t�e blocking strategy used in the Gibbs sampler and performs moves componentwise by defin ing transition kernels qt, fori
=
1, . . . , p. A transition is then completed after
cycling through all p components ofx and the generations of the components are made according to the Metropolis�Hastings scheme.
. Whatever the scheme used to generate the chain, a stream of values x<1l,x<2l, .. l is formed. Although consecutive values xU and xU+l) are correlated, a random sample of size
n
from p(x) can be formed by retaining
n
successive values after
convergence has been ascertained. If approximately independent observations are required one might hold only the
n
x(m),x(m+l),... ,x(m+(n- !J/J, where
observations lagged by l units, for example m
is large enough to ensure convergence has
been achieved and I is large enough to carry only residual correlation over the
chain.
This will be a random sample of
n
approximately iid elements of the
joint distribution p(x). This sample is valid for any positive value of particular,
l
=
m
and
l� in
1 is a common choice since independence is not really required.
After generating a large random sample the inference about each
Xi can be done
as in any Monte Carlo method. For example, the mean of theith component of X
is estimated by ( ljn)
:LZ:J x;m+kl).
This idea can be applied in the Bayesian
context to obtain a sample from the posterior distribution of a parametric vector 0 or in the frequentist context to obtain a sample from the sampling distribution of an estimator or of a test statistic
T(X).
Nevertheless, most of the work and
applications in the area are geared towards the Bayesian approach. Estimates using the known conditional distribution can also be obtained. In the case of the mean of
X;. the Rao�Black�ellized estimator of Ep(X;) is Ep(X;)
This estimator of generated values.
=
l -
n
n�I
L Ep(X;IX�7+kl)). k=O
Ep(Xi) is usually better than X;,
which is based only on the
The improvement is justified by a more efficient use of the
(probabilisitic) information available. A very similar idea was used in the Rao Biackwell theorem (see Section
4.3.2) to prove that conditioning of sufficient
Simulation methods 159 statistics improves the estimator, hence the name. It is worth pointing out that
inferences about any quantity related to the X' s are easily done. For example the
mean value of
g(X), given by Ep[g(X)], is estimated as (1/n) I:�;;:6 g(xCm+kl l), 1, are the values generated from the chain.
where the x
There are many practical problems that are easily handled by the combination
of Bayesian methods and Gibbs sampling or some other MCMC methods, but are difficult to handle by other means.
Example. Let X1, ... , Xn be a random sample from the N(B, a2) distribution and assumeindependentprior distributionse N(tLo, rJ) and> G(no/2, noaJ/2). �
�
Note that this distribution is different from the usual conjugate prior used so far in this book but may be a suitable representation of the prior knowledge in some situations. Then the joint posterior is
p(B, <J>[x)
ex
l(e, a2;
ex
¢"12 exp x
cx
·
exp
x)p(B)p(¢)
( -� [
{
-
I
2rJ
ns2 +n(x-B)2
(8- tLo)2
}
])
(
noaJ¢ ¢"0/2-1 exp -- 2
)
cp[(n+no)/2]-1 x
exp
( -�
[>(noaJ +ns2 +n(e-x)2) +r0-2(e - !LQ)2]
)
.
This distribution has no known form and it is not possible to perform the ana lytic integration to obtairi the proportionality constant. The kernel of the marginal distributions can be obtained but are of no known form which prevents the exact evaluation of their mean, variance and so forth. Nevertheless, the posterior con ditional distributions of
81¢ and ¢18 are easy to obtain. They are given by the
posterior density once the terms that do not depend on the quantity of interest are incorporated into the proportionality constant. So,
p(B[¢, x)
,
p(<J>[B x)
ex
exp
ex
exp
0<
(� ( -� [
- [n<J>(B
e2(r0-2
X)2) + r02(B
� �
�
-Mfl) J)
+n<J>)-2B(r(J2!Lo +n<J>x)
¢1(n+n ol/ 2J-I exp -
from which it is clear that (8[¢, x)
nwf(B)/2) where
-
[noaJ +ns2 +n(8-X)2 ]
N[tLJ(>), rf(¢)] and (>[B, x)
)
�
G(nJ/2,
160 Approximate and computationally intensive methods and n ,a (61) = noa +ns2 +n(() �X). Therefore, it is easy to generate from the
f
J
conditionals and the Gibbs sampler becomes easy to implement.
One difficult problem in the applications of MCMC schemes is to ensure the
convergence of the chain. Some theoretical results and many diagnostic statistics
are available. The practical recommendation is to monitor the trajectory of the
chain using output diagnostics. A common approach is to plot the averages of selected quantities such as the components of the vector X and assess by visual
inspection whether the convergence has occurred. More formal diagnostic tools
have already been derived and should also be used in addition to visual inspection.
Convergence of the chain can be slow for many reasons. components of
x
For example, if the
are highly correlated, or if the joint density has multiple modes
with regions of low probability between some of them, then the chain may take a
large number of iterations to converge. The interested reader is refered to the books
by Gamerman (1997) and Gilks et al. (1996)for some theoretical and practical
aspects of MCMC methodology.
Exercises § 5.2
1. Consider the genetic-application of Section
5.2 where a four-dimensional
vector of counts X= (X,, X2, Xj , X4) has multinomial distribution with · parametersnandrr, whererr (lf2+0f4,(1-8)j4.(1-8)j4,8j4). =
Assume that the observed data was (125, 18,20,34).
(a) Obtain the equations required for calculation of the MLE via the Newton-Raphson algorithm and apply it to obtain the maximum like
lihood estimate for the given data set.
(b) Obtain the equations required for calculation of the MLE via the Fisher
scoring algorithm and apply it to obtain the maximum likelihood esti
mate for the given data set.
(c) Use ·the expressions given for the successive iterates in the EM algo rithm to show that the likelihood is monotonically increasing through
the steps of the algorithm.
(d) Compare the three different algorithms for finding the MLE in terms of computational complexity and time.
(e) Assume now a prior e ""'beta(a, b). Repeat the exercise to obtain the (generalized)MLE. Specify numerical values for a and b and obtain
the corresponding posterior modes. 2. Consider a random sample X 1,
.
. , X11 from the G(a, fJ) distribution with .
both parameters unknown. Obtain the maximum likelihood equations and
describe the· use of an iterative scheme to obtain the MLE of a and fJ.
3. Consider the randomized response example. Show that the MLE of e can
be calculated directly using invariance properties of the MLE and evaluate
its value with the figures provided in the example.
Exercises 161 4. Consider the EM algorithm with a sequence of iterated values oU), j '2::: 1. Show that the sequence satisfies
L(9(j11x)
monotonically increasing in the likelihood
§ 5.3
5. Let X, = (X 1,
..
� L(9(j+l11x) 1(91x).
and is therefore
. , X,) be a random sample from theN (0, 82) distribution.
(a) Obtain the asymptotic posterior distribution of 8 as n
� oo.
(b) Obtain the asymptotic posterior mean and variance of 82. Hint: X� N(O,
I)=>
X2 �
xf.
(c) Obtain the asymptotic distribution ofe2 based on the delta method and compare it with the results obtained in (b). '
6. Let X � bin(20, e) and assume that X = 7 was observed. Obtain a 90% confidence interval for(} using a uniform prior and (a) the fact that if
z
�beta(a,
b)
then
b z - -- � F(2a, 2b); a 1- z 9/1- 8; 1 (c) an asymptotic approximation for¢= sin- (.Jil).
(b) an asymptotic approximation for 1/1
=
(d) Compare the results. 7. Let Xn = (X1,
•
•
•
,
X11) be a vector of independent random variables where
Xi "" Pois(eti), i = ·1, . .
. , n,
and !1,
. (a) Prove that the MLE of e is t=
:2:7=1 ti/n.
{j
..
. , In are known times .
= X/t where
X=
L:7�t X;/n
(b) Obtain the asymptotic posterior distribution of 8 I X n
l
is large. (c) Obtain the asymptotic posterior distribution of fJ 112
Xn
and
and construct
an asymptotic 100(1- a)% confidence interval fore assuming thatn
I
l
and, based
on it, construct an asymptotic 1 00(1 - a)% confidence interval fore
assuming that n is large.
(d) Compare the confidence intervals obtained in (b) and (c), considering especially their lengths. 8. Let X 1,
•
•
•
, Xn be a random sample from the distribution with density
9 1 j(x I e)= ex - lx([O, !]). (a) Verify which function(s) of e (up to linear transformations) can be es timated with highest efficiency and determine its (their) corresponding estimator(s).
(b) Obtain the asymptotic 100(1
-a)%
confidence interval for fJ based
on approximations for the posterior distribution of e.
162
Approximate and computationally intensive methods (c) Repeat item (b) basing calculations now on the asymptotic distribution of the score function U(X; e).
(d) Repeat item (b) basing calculations now on the central limit theorem applied to the sample 9. Let X1, ...
Xn =
(Xt,
... , Xn).
, Xn be a random sample from the Pois(B) distribution and define �0.
ft.=etfa,a
(a) Obtain the likelihood function l(fi.; X). (b) Obtain the Jeffreys non-informative prior for A. (c) Obtain the Taylor expansion of L(fi.)
=
logl(fi.) around the MLE of fi.
and detennine the value(s) of a for which the third-order term vanishes. (d) Discuss the importance of the result obtaine9 in the previous item in tenns of asymptotic theory. 10. Let X 1, . . .
, Xn be a random sample from the uniform distribution over the en be the MLE of e.
interval [0, e] and let
(a) Obtain a non-degenerate asymptotic distribution for
911,
or in other
words, find functionsh(n), a(e) and b(8) and a non-degenerate asymp totic distribution P such that
h(n)
en- a(8) b(8)
Hint: use the density of result (1 +
s/n)n-+
e5
�
p
when n-+
oo.
Bn to obtain the fonn of h, a and band use the when n-+ 00 for s E R.
(b) Comment on the convergence rate found. (c) Obtain the asymptotic 100(1 - a ) % confidence interval fore of small est length based on the results of item (a). (d) Show that the parameter
h(n)
en - a(8) b(8)
converges in distribution to P where h, a, band P are the same ones obtained in item (a). Therefore, the asymptotic result obtained with the Bayesian inference is similar to the result obtained with the classical inference. (e) Obtain the asymptotic 100(1- a)% HPD confidence interval fore. (f) Compare the intervals obtained in items (c) and (e) with the exact interval. II. Show that for any distribution p(e) in the exponential family, the best normal approximation in the Kullback-Leibler sense has mean and variance given respectively by /J- = E(8) and a2
original distribution.
=
V(8), the mean and variance of the
Exercises
163
12. Consider again the -variation of the randomized response model which con sists in asking as the alternative question the negation of the original one. (a) Show that the posterior distribution of(} is a mixture of distributions.
n
+
1 beta
(b) Obtain the relevant derivatives and points of maxima required for the evaluation of the posterior mean of() analytically or numerically.
(c) Apply the results of the previous items to reproduce the table of exact and approximated posterior means fore given in the text.
§5.4 13.
Apply the Gauss-Hermite integration rules to obtain approximations for the
posterior expectation of a ande given the observed values already provided in the Weibull example of Section
5.3, and compare the results with the
approximations from the Laplace method.
§5.5
14. Use the simple Monte Carlo method to evaluate f� it with the known answer
-/2Ji.
oo e -xil2dx and compare
Also, evaluate the variance of the estimator.
Hint: make a transformation to take the line into the interval [0,
proceed as before.
·
I] and then
15. Show that if an integral I = J g(x)p(x)dx is estimated by importance sampling then its estimator
I " I=- _Lg(x,)w(x,), n I where
(x,) w(x,) = p h(x,)
and
x; "'h(x),i
=
is unbiased and has variance given by
h(x)dx.
16. Let 8 = P (X
>
I, . .. , n ,
V(/)
=
(If n ) j(g(x)w(x)- 1)2
2) where X has a standard Cauchy distribution with density p(x) =
1
Jr(l + x2)'
X E R.
Let h be an importance sampling density defined by
h(x) = 2Ix[(2,oo)]/x2. Show that use of this sampling density reduces the variance of the estimator of() over the simple Monte Carlo estimator.
164 Approximate and computationally intensive methods 17. Show that the Ra(}-Blackwellized estimator of Ep(Xi) given by
Ep(X;)
n-l
L Ep(X;ixl_";+"Jl
l =
-
n
k=O
provides an unbiased and consistent estimator of Ep(Xi). Generalize the result to obtain the Rao-Biackwellized estimator of the marginal density of Xi and show that it is also an unbiased and consistent estimator. 18. Let X 1,
. , Xn be a random sample from a Poisson distribution with mean that is either 8 or . . .
(a) Obtain the likelihood of the unknown parameterse,¢andm. (b) Suggest a reasonable family of conjugate prior distributions fore,¢ andm. Hint: to simplify matters, assume independent priors fore,¢andm. (c) Obtain the full conditional distributions required for implementation of the Gibbs sampler. (d) Generate data (XJ , ... , Xn) for given values of8,¢andm and apply the Gibbs sampler to draw inference about them.
19. (Casella and George, 1992) Let rr denote the following discrete distribution over S (0, I }2 =
0 rroo
0
Xz rrot
rrw
IrJJ
where rroo + rrot + rr1o + IrJI 1 and Irij > 0, fori, j I, 2. Assume that instead of drawing samples directly from rr, one decides to draw values from rr through the Gibbs sampler. =
=
(a) Show that the transition probabilities for X 1 are given by the condi tional distribution lrJ of XJ[Xz j, =
and where rr+j rroj + ntj• j 0, 1. (b) Show that the transition probabilities for X2 are given by the condi tional distribution rr2 of X2IX 1 i, =
=
=
and where n;+
=
n;o + JC;J, i
=
0, 1.
Exercises 165 (c) Show that the 4
x
4 transition matrix
P of the chain formed by the
Gibbs sampler has elements
P((i,j), (k, l))
=
== =
Pr((X1, X2)tn)
1 (k,I)I(Xt,X,)tn- 1
=
(i,j))
Irk/ lrkj 1Tk+lf+j
for
(i, j), (k, l) E 5.
(d) Show that n is the only stationary distribution of this chain. (e) Extend the results for cases when
X 1 can taken1 values and X2 can
takenz values. � 20. Show that the Metropolis-Hastings chain has equilibrium distribution given by
p(x).
6 Hypothesis testing
6.1
Introduction
In this chapter we still consider statistical problems involving an unknown quantity 0 belonging to a parametric space
0.
In many instances, the inferential process
may be summarized in the verification of some assertions or conjectures about For example� one may be interested in verifying whether a coin is fair, a
(J.
collection of quantities is independent or if distinct populations are probabilistically equal.
Each one of the assertions above constitutes an hypothesis and can be
associated ·with a model. This means here that it can be parametrized in some
form.
Considering the simple case of two alternative hypotheses, two disjoint
subsets
So
an-d 8, belonging to 8 are formed.
Denote by e,.
Ho
the hypothesis that 9 E
Elo
and by
Ht the hypothesis that 9 E Ho or Ht is accepted, or
A new statistical problem is to decide whether
in other words, whether(} is in
flo
or
Bt.
If the subset of the parameter space
defining an hypothesis contains a single element, the hypothesis is said to be simple. Otherwise, it is said to be composite. Under a simple hypothesis, the observational distribution is completely specified whereas under a composite hypothesis it is only specified that the observational distribution belongs to a family. From now on, the hypotheses
Ho
and
Ht
will be uniquely associated with disjoint subsets
Elo
and
E>1 of the parameter space. Whenever they are simple, the notation E>o = {Oo} and E>1 = {OJ} will be used. Note that in some cases, only one of the hypotheses is simple. Typica1ly, a test of hypotheses is a decision problem with a number of possible actions. If the researcher makes the wrong decision he incurs a penalty or suffers a loss. Once again, his/her objective is to minimize his/her loss in some fonn. For example, under the Bayesian approach he/she would try to minimize the expected loss. A rule to decide the hypothesis to be accepted is called a test procedure or simply a test and will be denoted by if the hypothesis accepted is
if.
One may define for example that
if
=
i
H;.
From the Bayesian perspective, one may have manx alternative hypotheses
H1,
• • •
,
H,
that can be compared through
P(H1 1 x),
i
=
I, ... , k. Under
the classical perspective, it is important to have only two hypotheses
Ho
and
HJ.
168 Hypothesis testing The theory of this chapter is developed for this case in order to ease comparisons between the two approaches. Usually there is an hypothesis that is more important. This will be denoted by Ho and called the null hypothesis. The other hypothesis is denoted by H1 and called the alternative hypothesis. In general, Ho and H1 are mutually exclusive which means that at most one of the hypotheses is true. If, in addition, Ho and Ht exhaust all possibilities then necessarily one of them must be true. In this case, rejecting one of them necessarily implies accepting the other one. In this chapter, we start the presentation of the classical procedures and later present the Bayesian procedure. We then move on to establish a connection be tween hypothesis tests and confidence intervals. Finally, tests based on asymptotic theory results are described from bolJl classical and Bayesian perspectives. A sys tematic introductory account of this topic is presented in Bickel and Doksum
(1977) and DeGroot (1970). The classical theory is presented at a more formal level in Lehmann (1986). Assume that before deciding which hypothesis to accept, the statistician is of fered the choice of observing a sample X 1, ... , Xn from a distribution that depends on the unknown parameter 8. In a problem of this kind, the statistician can specify a test procedure by splitting the sample space into two subsets. Under the frequen tist viewpoint, this is the only available option as the only source of information comes from the data. One subset of the sample space will contain the values of
X that will lead to acceptance of Ho and the other one will lead to rejection of Ho. This latter set is called the critical region and a test procedure gets completely specified by the critical region. Of course, the complement of the critical region contains sample results that lead to the acceptance of Ho.
6.2
Classical hypothesis testing
The general theory of classical hypothesis testing comes from the pioneering work of Neyman and Pearson (1928). The probabilistic characteristics of a classical test can be described by specification of rr(8), the probability that the test leads to the rejection of H0, for each value of 9 E 0. The function rr is called the power of the test. If C is the critical region then rr is defined by rr(8)
=
P(X E C \8),
V8 E 0.
Some textbooks define the power function only for 8 ¢ E>o. The size or signifi cance level
a
of a test procedure is defined as a
2:. sup rr(8). Oe0o
Just as in the case of confidence levels seen in Section 4.4, the inequality above is a technical requirement. It is more useful in discrete sample spaces where not all
Classical hypothesis testing 169 values in
[0, 11 are possible probabilities. As will shortly be seen, one wishes to a as possible. In practice, this means that an equality is
use as small a value for used.
1/1, two types of error can be committed. A type I Ho when it is true. Note that the largest possible value for the probability of this error is a. Similarly,the type II error is committed when the test indicates acceptance of Ho when it is false. The probability of a type II error is usually denoted by fl. Note that /3(9) = I- rr(O), for 9 E Elt. In the case of simple hypotheses,the probability of a type I error is " = rr (9o) and the probability of a type II error is fJ = 1 - rr (9t ) . For any given test procedure
error is committed when the test indicates rejection of
Simple hypotheses
6.2.1
It is useful to start the study of the theory with the case of two simple hypotheses
Ho: 8
=
Oo and H t
:
8
=
81. ldeally,onewishes tofind a testprotedure for which ..
the two error probabilities are as small as possible. In practice, it is impossible to find a test for which these probabilities. are simultaneously minimized. As an altemative,one may seek to construct a test that minimizes linear combinations of aandf:J.
Theorem (optimal test). Assume that X (X 1, , X,) is a random sample p(x\9), Ho: 9 = Oo and Ht : 9 = 9 t. Let Y,' be a test of Ho versus H, such that Ho is accepted if Po/PI > k arid Ho is rejected if PolPt < k, where k, nothing can be decided.) p; p(x I 9,), i = 0, 1 and k > 0. (lf Pol Pl =
. • .
from
=
�
Then, any other test 'ljl will be such that
aa(Y,') +bf:J(o/') "5 aa(Y,) +bf:J(o/), where a(Y,) and
/3(Y,) respectively denote the probabilities of errors of type I and a, b E R+
II of test >fr, for any
Proof. i
=
Let Cbethe critical region ofany arbitrary test t and define Pi = p(x b E R +,
0, I. Then, for a,
I 9, ) ,
aa(>fr)+bfJ(>/J)=a /cp(xt9o)dx+b /cp(xt91)dx =a /
c
p(x\Oo)dx+b[l- fcp(xi91)dx]
= b + fc(apo- bp,) dx. So, minimization of
aa(l/1)+ bf3(t/I) is equivalent to choosing the critical region
C in such a way that the value of the integral be minimal. This will occur if the integration is performed over a set that indudes every pointx such that apo - bPl
0 and does not include points x such that apo- bp1
>
0.
<
·
170
Hypothesis testing
Therefore, minimization of aa('!ft) + bf3(1fr) is achieved by having the critical region C including only points x such that apo - bp1 < 0. (If the sampling distribution is continuous and apo - bp1 0, this point has 0 contribution and is irrelevant.) This completes the demonstration because apo - bp1 < 0 iff po/ Pi < k bja, which corresponds to the description of the test lfr*. =
=
0
The ratio Po/PI is called the likelihood ratio (LR). The theorem establishes that a test that minimizes aa(!fr) + b,B(!fr), rejects Ho when the LR is small and accepts Ho when the LR is large. Usually, the null hypothesis Ho and error of type I are privileged. Therefore, one Considers only tests t.fr such that a(ljl) cannot be larger than a pre-specified level ao and, among thein search for the one that minimizes f3Ct/t). This is a variation of the problem solved . with the theor�J;TI.�nd the solution is provided by the foJiowing lemma. ·.
·
Lemma (Neyman-Pearson). Assume that X (X 1, , X,) is a random sam ple from p(xiO), Ho : 0 Oo and H1 : il o,. Let lfr' be a lest of Ho versus H1 such that Ho is acCepted if PolPI > k and Ho is rejected if po/PI < k, where k, nothing can be decided.) Then, for p; p(x I 0;), i 0, I. (If po/Pl any other test 1fr such that a(!fr) :<: a(!fr'), f3(1fr) ::: .IJ(lfr'). Also, a(!fr) < a(!fr') implies .IJ(lfr) > ,B(!fr'). =
=
=
=
. . •
=
=
Proof. Foiiowing the definition of the -of;.timal test 1./t* in the theorem, it follows that for any other test 1fr a(!fr') +k,B(!fr* ) :S a(!fr) + k,B(!fr), fork > 0. If a(!fr) :<: a(!fr') then necessarily f3(1fr) ::: ,B(!fr'). Also, if a(!fr) a(!fr'), it follows that ,B(!fr) > ,B(!fr'), completing the demonstration.
<
0
Special attention must be given to the wording of the lemma. It only considers acceptance or rejection of Ho with no reference to Ht. This is consistent with the preferential status given to Ho and also to the fact that Ho and H1 do not exhaust the parameter space. This point will be readdressed below. In the lemma, ao plays the role of significance level. Recalling that rr(01) 1 fl(lf!), minimization of fJ implies maximization of :rr. Hence, the Neyman Pearson lemma shows that of all tests w'ith a given significance level, that based on the LR has largest power or is more powerful.
-
=
In theN (8, 0"2) with known 0"2, consider the test of Ho: 0 81 with 11o < o,. Then
Example. H,: e
=
Po
p(xiOo)
p; =
p(XiOJ)
�-� L7�1 (x; - Bo)2J (211'0'2)-n/2 exp �-� L-7�1 (x; - OJ)2 J
(211'0'2)-n /2 exp =
=
Bo versus
where the proportionality constant invo - lves the constants Bo. 81 and a2. The LR test accepts Ho when Po/PI > k. Then,
n(eo-eJ)x PO - > k {:::::::} exp a2 PI since Bo
I
<
)
> c3 {:::=}
n(eo-e,)x
> c2 <==? X < c1
a2
e,, for constants C], C2 and C}. As the best estimator of e is X. the
sample mean, when testing Ho against H1, one expects the test to accept Ho for small values of
X. That is exactly the result of the optimal,
LR test. The next
step is to determine the value of CJ. To do this, note that the test has level therefore
Cl =
p (rejection of
(
Xleo�N eo. Since a=
a and
HoI e = 8o) = p (x >CJ I e =eo). But
:) '
or
Z
x-eo = -----;;; a/vn
�
N(O, 1).
P (z > (CJ - 8o),fiija), (CJ -eo)
.:._:
a
_::.:_ ,jii = Za
_
The test with significance level
a
""} CJ
a = 8o + Za r.:: . vn
accepts Ho if X
<
Bo + a Za fJn.
Note that the test is completely specified and does not depend on the value of
e1 but for the fact that 611 > Bo. This means that the test is the same for any value of e, such that e, > eo. Therefore the LR test is also more powerful to test Ho versus H, : e, > Bo. The fact that the test does not depend on the value of 81 may also cause problems. Consider the case when Ho is rejected but X is much closer to Bo than to e1, that is, X- eo << e, -X. In this case, common sense suggests that given the choices of Ho and H1 one should choose Ho. This is another reason to avoid commitments towards acceptance or rejection of H1 since the test does not provide infonnation about it. Similar comments apply if the distances of X to Bo and e1 are much larger than the distance between eo and e,. The intuitive reasoning based on the frequentist argument is that if sampJing of X is repeated many times, in only
IOOa% of them, Ho will be erroneously rejected. The power of the test Jr(e) = P( rejection of Ho I e) is given by rr(e) =
[
P x >eo+
:;,z.
]
1 e >eo .
'II 172 Hypothesis testing But
X I e>eo- N(e, a2jn) rr(e) = P
=P
=1
[ [
Fn
Fn
_
and therefore
(X- e) a
-/ii(X- e)ja- N(O, !).
�
eo+ (a/ aj
>
n
z.- e i
]
Then,
e>eo
]
(X- e) (IJo -e) > Fn+z.le>Bo a a
(
z.
(e _
: eo) -Iii) ,
which is an increasing function of 8. So, the more distant is the parameter value in the alternative, the smaller are the chances of a type II error. G
This test is not only the most powerful test of Ho versus to test Ho : 8 ::::
Bo
versus
H1: () >81
hypothesis is maxrr(e) =max B::::Bo
As just seen, ((}: 8::::
JT
B;:::Bo
Ht:
8> 8o, but also
because the level of the test with the new
P
[x
J
>eo+ �z. . vn
is an increasing function of 8 and the maximum in the region
Bo} isgivenat the value8 oandthe value ofn
at thispointis
1-c:t>(za)
=a.
Also, in the example we could evaluate the size at which we would reject Ho after observing X
= x,
that is, we could evaluate y such that
Bo - CY Zy 1 Jn =X.
This gives a more precise account of the strength of the data evidence _in favour of
or against the hypotheses. Smaller values of y indicate a lower probabil_ity or type I error and therefore more evidence in favour of Ho. Likewise, larger values of y
indicate a higher probability or type I error and therefore more evidence against Ho. More generally, suppose we have a test where Ho is rejected when a test statistic
T belongs to a region of the form [T > c] and lett be the observed value ofT. Pr(T > tiHo) gives an idea of how extreme the observed
Then, evaluation of value is under Ho.
This probability is usually known as the p-value.
previous example, the p-value is given by I -
(-/ii(x- Bo)/a).
In the
The notion of a
p-value is useful for determining the size at which one would reject Ho based on the information actually obtained for
Ho is rejected where
a
{::::::} p-value
< a,
is a pre-specified level of the test. It should be stressed that under the
frequentist treatment, no probabilities can be associated to the hypothesis as e is
not random. Therefore, no association between the p-value and the probability of
Ho can be made because such a probability simply cannot be defined.
The notion
of p-value can be placed in a general setting whenever it makes sense to specify
the border of the critical region in terms of observed values of the sample.
Returning now to the case of discrete populations. it is not always possible to obtain tests of any pre-specified level exactly. By exact, we mean to have a = TJ
Classical hypothesis testing 173 SUPeeeo P( rejection of Ho18 E eo). The notion of p-value becomes where ry even more important here. There is an alternative approach that allows one to obtain tests of an exact level even for discrete distributions. This alternative is known as randomized tests where any pre-specified level is obtained after realization of an additional independent Bernoulli experiment with success probability conveniently chosen to complete the difference between a and ry. =
6.2.2
Composite hypotheses
Consider again the test Vr of Ho: 0 E eo versus H,: 0 E e,. Let ct be the-flxed significance level and rr �(0) the power function of Vr. Then, rr � (0) ::; a for every 0 E eo.
Definition. A test lfr* is uniformly more powerful (UMP, in short) for Ho: 8 E eo ...... versus H1: 0 E 81 at the s.ignificance level a if .
I. a(>j;') ::; a; 2. V>jr with a(>jr) ::0 a, rr�(O) ::0 rr�·(O), V0 E e1.
The test of the example above.is UMP to teste= 8o versus e > eo and also to teste ::::: 8o versus e >eo.
Theorem 6.1. Let X = (X1, X,) be a random sample from p(xl8) and p(x)8) belong to the one-parameter exponential family with density • • • ,
p(x I e)= a (x) exp { ¢(8)T(x) + b(8)1 and let ¢ be a strictly increasing function of 8. Then the UMP test of level a to test Ho: 8 ::: Bo versus H1: 8 > 8o is given by the critical region T(X) > cwhere cis such that a ?:. P(T(X) > c I Bo) (with equality in the continuous case). The power of this test is _an increasing function of 8. If the hypotheses are interchanged or¢ is a strictly decreasing function of 8, then the UMP test of level a rejects Ho if T (X) < cwhere cis such that a ?:. P(T(X) < c I Bo) and the power of this test is again an increasing function of 8. If the two conditions above are simultaneously true, the UMP test remains unaltered.
Proot Consider the standard case where 1> is strictly increasing and the hypotheses are Ho : f3 8o and H1 : f3 = tJ1 > 8o. In this case, the Neyman-Pearson lemma ensures that the most powerful test rejects Ho when =
p (xl8ol ---
p(xl8t)
=
exp{ ¢(8o)T(x ) + b(8o)}
exp{ 1> (Btl T(x) + b(8t)} = exp{¢ [ (8o) -¢(8t)JT(x)j < cz
= [¢(8o)-¢(8t)]T(x) < Ct = T(x)
> c
174
Hypothesis testing
... , CJ, care constants such that a= P(T(X) > ciBo). For the most Jr ( Ot ) (see the exercises). So, the power function is an increasing function of 8 and where c4,
powerful test, it must be true thatJr(Oo) ::=
Jr(Oo) = sup
{B:tkBo)
1r(O).
So, the test is UMP for Ho : fJ :::::
Bo. As in the calculation above only the condition Bo was used, so the results must be equal1y true for any such value of Bt. Therefore, the test is UMP for Ht : e > Oo. Bt
>
In the case of a strictly decreasing¢,
[.j>(Oo)- ,P(OJ )]T(x) < Ci <=> T(x)
{x: T(x)
In the case of interchanged hypotheses, all inequalities must be reversed because one must work with the ratio
p(x!Ot)/p(xiOo) instead of p(xiOo)/p(xiOt). This {x: T(x)
leads to the critical region in the form
Finally, in the case of a strictly decreasing¢ and interchanged hypotheses, double reversal of the inequalities preserves it as it was and the critical region remains in the form
{x: T(x)
Example. Let X 1, ... , Xn be a random sample from the Ber(B) distribution. From Section 2.5, we know that ¢(0) = log[0/(1- 0)] which is an increasing function of O and that T(X) =2::7�1 X;. Therefore, the UMP test for Ho versus H1 :
e
>
Bo has critical region of the form
: e S eo
L?=t Xi> c.
The property that guarantees the existence of UMP tests in the exponential family is in fact more general. It finds an appropriate setting under families with monotone LR.
Definition. The family of distributions {p(x I 0), e E 8) is said to have monotone 8 with Ot < Oz the
likelihood ratio if there is a statistic T(X) such that VOt, Oz E likelihood ratio
p(X I Oz) p(X I OJ)
is a monotone function of
T(X).
The uniform distribution over the interval
[0, 8] does not belong to the expo
nential family. Nevertheless, it has monotone LR because the ratio of sampling densities is a monotonically decreasing function of
T(X) =max; X;.
The results just proved for exponential families can be extended for families with monotone Hke1ihood ratio. So, if the LR is an increasing function the UMP test of level a for Ho:
e :S eo versus Ht: e
>
of T (X), then eo is given by the critical
Classical hypothesis testing 175 region of the form
T(X) < c where c is such that a
=
P (T(X) < c 1 e0) and the
power of this test is an increasing function of(}. Likewise, if the hypotheses are
T(X), the UMP test of level a T(X) > c where cis such that a ": P(T(X) > c IBo) and the power
interchanged or the LR is a decreasing function of rejects Hoif
of this test is again an increasing function of B. If the two conditions above are simultaneously true, the UMP test remains unaltered. These results make intuitive sense. The larger the LR, the more plausible is T(X), the same
the valueBorelative toe,. If the LR is an increasing function of
T(X). Therefore, a reasonable rejection region for Howould T(X).
reasoning is true for
be given by small values of
Example (continued). distri&ution. Then, LR is
Let
p(X I 8)
X =
r
82r(l-82 )"-
8i 0-81l" r
....
1,
X,
be a random sample from the Ber(8)
eT(l-8') '-T with
[
82 (1-8I
=
> l
-
) r (I=ii;' ) =� J - ·. ry= 1-82
"
81(1-8,)
with
Since 1-81
T =LX;. lf81 < 8,, the
1
and
T
n
ry
8,
--
1-81
ftz, � > 0, and the LR is increasing in T, confirming results
obtained earlier in the example. So far, only one-sided tests have been considere�. These are tests where the parametric regions defining the hypotheses are given by a single strict inequality. An example of interest of an hypothesis that is not one-sided is Ho: 8=8oversus
H1: 8 i=Bo. This test may be useful when comparing two competing treatments.
a2) distribution with a2, consider the three tests of Ho : 8 Bobelow based on X:
Assuming now the observation of a sample from the N(B, known
=
I. rejectHoifiX-8o I> 1.645a/Fn; 2. reject Hoif X-8o> 1.282a/07; 3. reject Hoif
I X-8o I< 0.126a ;Fn.
Calculation of the rejection probability of Ho-shows that the three tests have
0.1. The next step is to proceed with evaluation of the power of each of the 6.1 that none of the tests is UMP over the other two. Nevertheless, the first test is the only one with minc-l, n(8) > n(8 o). level
tests. It can be easily seen from Figure
This means that the rejection probability is larger under the alternative than under the null hypothesis. This way, one guarantees that the chances of rejecting Hoare larger when Ho is false . This seems like a reasonable property to require from tests.
Definition. A test lfJ for Ho: 0 E f>o against H1 : 0 E e, is said to be unbiased (0, 0') where 0 E E>oand 0' E e,, then ny,(O) S ny,(O'). The
if for every pair
power function is at least as large in 01 as it is in f>o. If the test does not satisfy
the condition above, it is said to be biased.
176
Hypothesis testing
/
·4
0
-2
Fig. 6.1 Powerfunctionsfor tests/, One can then try to construct
2
UMP
4
and 3 as functions of8' tests for Ho:
8
=
=
;,fii(B-
Bo versus H1:
8o)fo.
8 I-
Oo
within the class of unbiased tests. In the one-parameter exponential family,_ it can he shown that if 4J is a strictly increasing function of(), the level a for Ho: e with
P(ct
<
=
T(X)
80 <
versus Ht :
c2 I Bo)
=
8 i=
UMP unbiased ·test of ct < T(X) < c2
Bo accepts Ho when
1- a and
(ct. c2)
is an interval of highest
sampling density of T(X). The sampling distributions of T(X) are not necessarily sy mmetric (around 8o) as in the normal case above. For such tests, we are led to two distinct p�values. In general, the smallest one is used. It may not be possible in general to find unbiased tests. A general procedure for
Ht: (} E E>t is based on the maximum likelihood ratio (MLR, in short) statistic given by
testing H0: 8 E flo versus
A(X)
=
SUPeeeo p(X
I 0)
sup9ee, p(X
I 8)
.
The most common case for use of this procedure is when
E>o and E>t are exclusive
and exhaustive and 90 is of smaller dimension than 91.
In these cases, the
denominator is replaced by the supremum over the whole parametric space 9, which is easier to evaluate.
Formally, the statistic A(X) is being replaced by
max{A(X), 1}. In any case, A(X) is a random variable depending on the sample. The maximum (or generalized) LR test for Hoof level a accepts Hoif A(X) where
c satisfies a 2: sup P(A(X) 8e90
<
c I 8).
>
c
Classical hypothesis testing
177
Once again, a is taken as equal to the supremum of the above probabilities whenever possible. The power of the test is given by rr(9) = P(A(X) < c 1 9 ). This test rejects the null hypothesis Ho if the maximized value of the likelihood under Ho is distant from the global maximized value. This is an indication that there is great improvement in likelihood by consideration of points outside Ho and this hypothesis does not provide a good description of the data. In this case, it makes sense to reject Ho. It is important to distinguish between the above test and the test based on mono� tone likelihood ratios. Although the maximum likelihood ratio test enjoys good asymptotic properties, it is not always unbiased or UMP. The main difficulties associated with it are the calculation of the maximized likelihood in closed form and determination of its sampling distribution. The first point was dealt with in Chapter 5 and the second one will be addressed below when asymptotic tests are treated in Section 6.5. :· Other desirable properties in test procedures are similarity and invariance. Con sider the problem of testing H : 0 Oo in the presence of a nuisance parameter tf>. A test is said to be similar if the level of the test is the same whatever the value of the disturbance parameter. An example of a similar test is the t test, to be seen later in this section. A test is said to be invariant (under a specific family of transformations) if the distribution of any transfonnation of the observations inside the family remains in the same family. This will allow the hypotheses to remain unaltered with any of the transfonnations operated over the data. The tests presented below in this section are all invariant under linear transformations of the observations.
· ·
=
6.2.3
Hypothesis testing wfth the normal distribution
The most common tests for samples from a normal distribution are presented here. 2 Once again, let X 1 , Xn be iid with Xi ""' N (8, a ) and suppose one wishes 2 to test Ho: e =eo versus H1: e-:/= 8o. Assume initially that a is known. In this case, we have shown that the UMP unbiased test of level a is given by the critical region Jn(X- 8o)/cr > Zaf2· We will now obtain the MLR test. Since Elo = {8o} then •
• • •
sup p(x I 8) = p(x I 8o).
eeeo
For 8 E Ell = e - {8o}, the maximum of p(x I 8) is obtained at 8, the MLE of e. which in this case is the sample mean X. Note that for every value of e. P(B = eo) = 0 and therefore to consider maximization over e instead of over 81 in the expression of the MLR does not produce any change. Then the MLR statistic is given by '
p(X I 8�) A(X) = p(X I 8)
178
Hypothesis testing (2:n-a2)-nf2 exp {- [:L(X; -X')2 + n(X- Bo)2] j2a2 } (2:n-a2)-nf2 exp {- :L(X; -X)'j2a2)
=exp
/- 2:2 (X-Bol2 ) .
X- N(Bo, a2jn) and therefore Z .jii(X- Bo)fa N(O, 1). The MLR is given by !.(X) exp(-Z2/2). The MLR test rejects Ho if Observe that under Ho,
=
=
!.(X)< cz = Z2/2 >c, = IZI >c. a qP(]Z I >c I Ho) or c P(IZI > Zaf2 I H t).
Given a significance level a, test is
rr(B)
Under
=
-
=
n:(B)
The' power of the
2
Ht, X- N(B, a /n). So,
w
Then, W
=ZaJ2·
=
Z-
-
P
=I
-
p
=
-[X- Bo- (B-Bo)]- N (0, I). a
.jii(B -Bo)fa and therefore the power of the test is
I
=
-/ii-
=
I +
(-zu;z< Z<
( (-z.;z- Jn -
Zaj2
Za J21 B) w-�
- Jn
a
(8 -Bo) a
< w < Zaj 2
)-
(
Za/2
-
w-� IB a (B-Bo)
Jn
- Jn
a
)
)
> a.
The MLR test is unbiased as .n(B) depends on
> a, VB i= Bo. The rate of increase of the power a. The smaller is a (more precise distribution, more concentrated
population), the faster is the rate of growth of the power towards 1.
a2 is unknown, 0o {(61, a2) : e = e0, a2 > 0} and BE R,a2 > 0). Since the dimension ofElo is smaller than.the dimension of 01 and of 0, we will work with the latter. As before, this change In the case where
El
=
= [(B,a2):
can be proved to affect only a zero probability set. As seen in Section 4.5.1, sup (B,a2)E9o
p(X 1 e. a2)
=
p(X1 eo
. ii-JJ
where
and sup p(X (&,a2)e9
I B,a2)
= p(X I e, 6-2)
where
e =X and 6-2 = 2:(X1- X)2fn.
Classical hypothesis testing
179
Therefore, the MLR statistic becomes
One can also write
l::(X; - X)' +n(X- Bo)2 L(X;- X)l
iiJ ii2
n(X- Bo)2 - I + -:'---:-�,-(n-I)S2 =
I +
!_ _! n-l
where
T
=
(X vn n
c
Bo)
S
!.(X) (I + T2/n - 1)-"/2 Therefore, the Ho if T2 < c, or [T[ < c. As T[Ho t,_, (0, 1), the value of the level a test is ta;Z,n -1· This test is usually known as the t test, and .
and the MLR can be rewritten as
=
MLR test accepts of
�
�
is possibly the most used test-in statistics. It can be shown that the power of the test is a strictly increasing function of
18 - eo I (see Exercise 6.5). An immediate
consequence is the unbiasedness of the test since the smallest value of the power occurs when ()
=
Bo.
This test is similar because none of the properties of the test is affected by the
value of the nuisance parameter a2. This was achieved by replacement of a2 by its estimator
S2 and the existence of the pivotal quantity T. This test is also invariant
under linear transformations. Another common test is the test of equality of two means.
Xz
=
Consider two
(XI!, ... , X,n1) from the N(B,,a2) distribution and (X2J, . . . , Xz,2) from the N(Bz, a2) distribution. It was assumed also
random samples X,
=
for simplicity that the variances are equal. defined by the parameter space E>o
=
Then, the hypothesis of interest is
{(81, e,, cr2) : 81
=
Bz
=
8 E R, cr2 > 0}. n1 + nz
Observe that under Ho the problem contains only a single sample of size
N(e, u2) distribution. The parameter vector can be more usefully de scribed by 3 new parameters of interest &z-e,, any other transformation of Bt and e parameters not multiple of 8 -81 and u2. The last two are nuisance parameters. 2 2
from the
180
Hypothesis testing
Once again, the MLR is based on maximizations over 0 and So. These opera tions give sup
(01 ,B2.u2)EBo
p(X,, x, I e, a2)
where
and
&o2
I =
[I::C "'
--
n1 + nz. .. . .
i=I .
2 x"-e) + A
"'
z=cxb- e)2 A
i=l
]
and sup
(O!/h,o-2)ee
p(X,,X,Ie,,e,,a2)
where
ei
=
x,.
i
=
1,2
This gives A(X1,Xz) _
A
x1-e
=
=
&2
=
1
n I +nz
-
[I: i=l
cxli -B,)2+
(&2;& )(n1+n2)/2. Note that
J
n 2 -- cx,-x,J n1 +nz
which implies that
Therefore
and
and
fcx,, -ezJ'J . i=l .
Bayesian hypothesis testing 181 where
x,-x2 T= ---r�='=; -1 -1 n 1 +n 2
with
SJ
s2
=
So, the MLR test accepts Ho when A(XJ, Xz) in Section 4.5.2, T]Ho""' t111+nz-2(0,
n, +nz az 1 n +nz -2
> c1
or when IT]
1). For a level
a
test,
c
=
< c.
As seen
laf2,n1+n2-2.
Analogously, it can be shown that the power of this test is a strictly increasing function of ]81 - 82], Therefore, this test is unbiased for the same reasons as
the previous t test. It is also similar because none of its properties are affected
by the two nuisance parameters and invariant under linear transformations of the observations.
6.3
Bayesian hypothesis testing
In the Bayesian context, the problem of deciding about which hypothesis to accept is conceptually simpler. Typically, one would compare the hypotheses H1, ... , Hk through their respective posterior probabilities, obtained via Bayes' theorem as
p(H; I x)
ex
p(x I H;)p(H;).
Once again, this setup can be framed as a decision problem. In addition to the (posterior) probabilities attached to the hypotheses (or states of nature), a loss structure associated with the possible actions can be incorporated. Returning to the special cases of two hypotheses, suppose one wishes to test
Ho : fJ E"E>o versus H1: fJ E 81. It suffices to examine the posterior probabilities p(Ho I x) and p(H1 I x). If p(Ho I x) > p(H1 I x), then Ho should be accepted as the most plausible hypothesis for 8. In this case, it can be said that Ho is preferable to HJ. Otherwise, H1 is prefered to Ho. There is a clear-cut rule for the choice between the hypotheses, which is not always true under the frequentist framework. As
p(Ho I x)
ex
p(x I Ho)p(Ho)
p(H1 I x) ex p(x I Ht)p(HI) and recalling that the proportionality constant is the same in both expressions,
p(Ho I x) p(H1 I x)
=
p(Ho) p(x I Ho) p(Htl p(x I H1)
The ratio p(Ho)/ p(H1) is called the prior odds between Ho and H1 and the ratio
p(Ho I x)/p(H1 I x) is called the posterior odds between Ho and H,. The ratio p(x I Ho) p(x I Htl
182
Hypothesis testing
is called the Bayes factor and is denoted by BF(Ho; HJ).
This concept was
introduced by Jeffreys ( 1961 ). Note that it is in some way a ratio of likelihoods. So, the posterior odds is given by product of the prior odds and the Bayes factor.
Once again, the likelihood ratio introduces the influence of the observations in the setting of hypothesis testing. In general, the likelihoods here are marginal in the sense that they are obtained after integrating out some of the parameters not associated with the specification of the hypotheses. Despite their notational simplicity, it is not easy in many cases to specify p(Hj)
when Hj is a simple hypothesis, j = 0, l and 8 is continuous . If a prior density f
is specified for 0 E E>, one will have that p(Hj)= p(Hj
I x)= 0. In these cases,
one solution is to attribute a lump prior probability rr to the simple hypothesis, say Ho, for n E
(0, 1 ) So, if H1 is the complement of a simple hypothesis Ho, .
then p(H1)= I Jln and this probability is distributed over the different values of
8 under H 1, according to the prior distribution for 8 ! HJ. This distribution will have density f over 01. As Ho is a simple hypothesis, it follows that p(x
I Ho)= p(x I Oo), the marginal
density of X given Ho. This can also be referred to as the marginal likelihood of
Ho based on X. The marginal likelihood of H1 based on X is p(x
I H,)=
{ le-JOoJ
p(x,
0 I HI)dO
p(x I 0, HJ)p(O { Je-!Ool = { p(x I O)f(O) dO le =
I H1) dO
observing that the last integration is performed over the entire parameter space e because a single point does not alter its value. The Bayes factor is reduced to . BF(H0' H 1 )=
p(x
I Oo)
f p(x I O)f(O)dO·
Note that it provides the relative odds between
Ho and H1 without taking into
account the prior odds. It is a Bayesian measure of the goodness of fit of a given model to the data set. A Bayes factor larger than I indicates that
Ho fits the data 0 is
better than H1 and thus has larger likelihood. Also, the marginal prior for mixed and therefore the marginal distribution of X is p(x)=
J
p(xiO)dF(O)
J
=rrp(xiOo)+(l-rr) =
Let X = (X 1,
p(xiO)f(O)dO
rrp(xiOo)+(I- rr)p(xiHI)-
... , Xn) be a random sample from the N (8, a2) distribution and Ht: f) =f: Bo. Consider initially, the
assume one wishes to test Ho: e = Bo versus
Bayesian hypothesis testing 183
a2. n r2) I B) B) HJ) = f )" 2 -xi} = (-1 ?-2na2 1 1--2a21 f-.(x; =l J \--"-(e-:x 2a2 i)../2rr1-r l--2r21-(e -M)2} dB.
case of known variance
e
I H1
-
p(x I
N(Jl,
Then, with a prior probability
for Ho and assuming
gives
p(x
p(
dB
exp
t
exp
x
exp
After a few substitutions and algebraic transformations
p(x
I HI)=
(2 �2 r/2
2 l 2 exP.I ;;:} c2: ��/nf s2 L7=I X)2jn. ( ) 2 I -"''] I (X-Bo)2] BF(Ho; HI)= "''] ( ( )"/2 l )1/2 I I -2a2 r2 + azjn)l/2 1'-'- [ -M)2 - 8o)2] l· =( a2jn 2 a2 +nr2 a2 X. X. aiogBF '.'. [2(x-!l) 2(x-&o)J = ax = 2 a2 +nr2 a2 [ O 2 2 = n 2n 2 a2Iog BF = n [ 2 a2 + nr2 - a2 J 2 a2(a2 +nr2) J 2 ax " Xmox = 8o + n-r22 (&o- !l) )1/2 . -ll)2 } ( + nr2 (Bo -;y 1 I 2r2 rr
exp
X
where
=
(X;-
The Bayes factor is n
1 2Jra2
I 2Jra2
/
2a2
exp
exp
a'fn . r2+a2 jn.
.
exp
exp
2a 2 n
-
exp
(x
I (x-•)' -2 r2+a 2jn
(x-
As expected, the Bayes factor only depends on the sample through
To obtain the
sample value that maximizes the Bayes factor, it s'uffices to solve the maximization problem for
T�king the logarithm of the previous expression and differentiating
with respect to X gives
r
<
and solving the first equation provides the maximum. The solution is
and the maximized value of the Bayes factor is
exp
I
>
0
184
Hypothesis testing
The larger the sample value
n,
the larger the chances of
Ho
and the larger is
r2,
the
larger is the maximized value of the Bayes factor. The prior uncertainty plays a crucial role in the Bayesian comparison of sharp null hypotheses. In the limit, when
r2
--+ oo,
the Bayes factor
BF(Ho; HJ)
also increases indefinitely. This limit is
the result obtained when a non-informative prior for (}
Ht
\
is used. This was
first noted by Lindley (1957) and is known as Lindley's paradox. It has been the object of study of comparisons between the Bayesian and frequentist approaches to hypothesis testing. To ease comparisons between the approaches, take fJ-
= 80 r2 = a2. and
The
first assumption centres the alternative prior distribution over the single value of
Ho
and the second takes the prior variance in the alternative as equal to the
observational variance. Then,
)112 {" [(:X-8o)2 - (:X-8o)'JJ BF(H0; HJ) = (a2+a2jn 2jn a2 +n�J2 2 ]l = (n +I) 112 { n [n(:X-8o)2 a2(n + o. z= ./ii -e (n 1) 12 I n+n z'J a p(Ho)= p(Ho x) = -"-BF(Ho; H1) p(H, p(Ho x) = { I+ [I �:rcBF(Ho; H1)r1} -1 (p(Ho) = p(HJ) = 1 p(Holxl= t+[o+nl'l2exr{-":1�Jr } { Ho n z2• Ho Ho. Ho exp
IJ
=
If
n,
+
1
-2
exp
-2
exp
---1 2
IJ
l)
where
:x --
then
I I x)
1 -
:rc
<=*
I
Assuming prior indifference
1/2) gives
-1
The posterior probability of
can be calculated for different values of
since both classical and Bayesian tests are based on
and z
Working with the most
common values, associated with the p-v�lues 0.05 and 0.01 gives Table 6.1. The probabilities of
are smaiJer for the larger value of
lz I which is reasonable.
What is less reasonable is the values that are obtained for these probabilities but they reinforce the idea that p-values should not be taken as probabilities that can be associated with
Another interesting result is that for any given value of z,
posterior probabilities can vary substantially from a very low value that would lead to rejection of
to a very large value that would lead to acceptance of H0. One
possible way to reconcile these findings with the significance level is that levels should also be changed with sample size. A large sample size should ca11 for a
Bayesian hypothesis testing 185 Table 6.1 Values of
n
lzl
(p-va1ue)
1.96 (0.05) 0;35 0.37 0.60 0.80
1 10 100 1000
P(Holx)
2.576 (0.0 1 ) 0.21 0.14 0.27 0.53
reduced level and vice versa. A more detailed discussion of the subject can be found in Berger and Sellke (1987). Assume now the same situation but with an unknown observational variance a 2 must also be specified. Since the hypotheses and take¢ = a-2. A prior for do not involve it is reasonable to take the same marginal prior for¢ unde:r both hypotheses. Adopting conjugate priors under both hypotheses gives
¢, nooJ¢ x;;, �
¢
under
Ho H1 e I¢,H1 � and
and
N(/L,
(c¢)-1).
The relevant quantities for the calculation of the Bayes factor are
p(x I Ho) J p(x I I Ho)d¢ p(x 1H1l J J I e,¢)p(B 1¢, HI)P(¢ I Hl)dBd¢ eo, >)p(>
=
=
p(x
where all the above densities are known. Substituting their expressions gives
BF (Ho; H1) (c + n) { noo02 + (n- 1)s2 + [cn/(c n)](x } c noo� +(n- 1)s2 +n(x- 8o)2 s2 - x)2/(n no o.e + n)112! (n- 1)s2 + [c/(c +n)]n(X- 8o)2 l "12 BF(Ho; H1)-+ (c-(n- 1)s2 + n(X- 8o)2 c c t .jii :X- eo . 1 2! nn- ++t2 l c +n s 1/2
=
(no+n)(2
-!L)2
+
.
- 1). where now = L(X; It is interesting to study the behaviour of the Bayes factor in extreme situations --..---? 0 and assuming as before that such as with non-informative priors. Taking {t = gives
= k- I
-1
1
kt2
"12
where k =
--
and
=
--
The above expression is graphed in Figure 6.2. The Bayes factor is a symmetric function of the sample values through the statistic It varies from the highest value of support of when t = 0 to a minimum value when co, as expected.
Ho
t.
It!--+
Hypothesis testing
186
0
Fig. 6.2 Bayes factor for Ho: 8
=
Bo versus H,: E! =I= Bo as a fwzction oft.
In the general case, the hypotheses can be incorporated into-the prior distribution and the problem of hypothesis testing can be thoughtof as comparison of possible alternative distributions for (}. For example, in the above test, the priOts corre�
Ho and Ht would be P($ 8ol¢) I and 81¢ N(!', (c¢)-1). The prior corresponding to Ho is degenerate. This will be the case whenever one sponding to
=
=
�
of the hypotheses corresponds to a parameter space with smaller dimension than the complete parameter space e. It will be seen in Chapter 8 that it is common to test if a few of the components of 0 are null. Writing (J
=
(81, 82), where 81 is q-dimensional and Oz is p q 82 0. In this case, P(fJ I Ho) will
dimensional, one may wish to test whether
-
=
be concentrated on a q-dimensional subspace of RP passing through the point
Bz
=
0.
6.4
Hypothesis testing and confidence intervals
You may have noticed the strong connection between hypothesis testing and con fidence intervals. This connection is clearer with the frequentist framework but it is also relevant under the Bayesian approach as will shortly be seen. Starting with the classical approach, consider a level a test for the hypothesis that
8
=
fJo. Assume that, based on a given sample size, the test yields an acceptance
region
A
=
A(fJo) where it is important to explicitly denote the dependence of the
Hypothesis testing and confidence intervals 187 region on Oo. So,
P(X
E A(Oo)
I Oo) =
I
-a.
Varying the value of Oo in 0 leads to a family of regions, all depending on 8 and with probability I - a. This autorn atically implies that, after observing the value .
x for X, the region
{O:xEA(O)) will have confidence 1 - a.
, X,) be a random sample from the N(e, 0'2) distri The UMP unbiased test of level a to test Ho: () eo versus Ht:-e #eo has acceptance region {X: eo- Zaj20'(.jii 'S X 'S eo+ Za/2 0'/..fii). This interval can be rewritten as {X: X- zcx;zaJ-Jfi::: 8o ::; X+ zrx;zaf,.fii}. Replacing now eo by 8 gives the 1 -a confidence interval for 8:
Example. Let X = (X 1,
bution with known a2•
l·
l
• • •
e :X-
=
;, Zaj2
'S e 'S X+
!
;,Za/2 ·
Of course, this is exactly the same interval as that obtained in Section 4.4. This relation is more heavily explored in the next section when asymptotic tests are introduced.
In fact, this relation can also be used in the reverse direction
with hypothesis tests obtained from confidence regions. To see this, suppose that
{0 : G(x, 0) E C) is a 100(1-a)% confidence region for U. Then, for every 0, it is true that P(G(X, Oo) E C I Uo) = I -a. A level a test for the hypothesis U Oo can be defined by the critical region {X: G(X, Uo) 1 C). value Uo of
=
This brings us naturally to an alternative definition of Bayesian hypotheSis test ing. The method consists in ·constructing a credibility region forB with probability 1
-a
and accepting the hypothesis 8 = Oo if the above region contains the value
Bo. As before, one should aim to construct HPD regions (intervals, in the scalar case) or at least with smallest possible volume (length, in the scalar case). The value of a is typically low but there is no prescription about the value to be adopted in any given problem. This method was proposed and extensively used by Lindley ( 1965).
Example. Taking again a random sample X= (X1, . . . , X,) from the N(O, 0' 2) . 2 distribution with known a and prior (} N (!J.., !'2) gives the posterior (} J X- N(JiJ, rf). The 100(1-a)% confidence interval fore is given by [/iJ
""'
't't z11;z, 111 + TJ Za;z]. The hypothesise = Bo can be accepted if Bo belongs to the interval above. In the non-iilfonnative case, 111 --+ X and TJ --+ aI Jn and the
hypothesis is accepted if eo belongs to the interval [X- Zaj2(J I .Jli, X+ Zaj2 aI Jn],
coinciding with the classical test. In the previous section, Bayesian tests of a simple hypothesis in the form 0 =Do were defined by attributing a lump prior probability
lf
to this hypothesis and the
remaining I - rr was distributed according to some probability distribution of
188
Hypothesis testing
OlH1. This specification is criticized for giving a sometimes unjustified special, different status to the value 9o. This discussion is in fact wider and goes beyond the boundaries of Bayesian thinking. It has to do with the capacity of a single hypothesis to represent ad equately the situation under study. Some authors suggest that these hypotheses are always at most a useful approximation of more realistic hypotheses where (J actually belongs to a neighbourhood of
Oo. We will not pursue this discussion
here further than acknowledging that the spectrum of thoughts on this matter goes from total rejection to total acceptance of this formulation. Good references for this discussion are Berger and Delampady ( 1987) and Lindley ( 1993).
'6.5
Asymptotic tests
Asymptotic tests are those based on asymptotic approximation for the distribution of the test quantity, irrespective of whether it is under the Bayesian or frequentist paradigm. This is a very broad definition including results based on the central limit theorem. We shall concentrate here on tests based on the MLE, score function and MLR. In most cases we are typically led to a limiting x 2 distribution. The use of asymptotic theory to develop useful test statistics was carried out by Bartlett, Wald and Wilks among others in the 1940s. In many cases, it is not possible to analytically obtain the exact distribution of the MLR A(X) and asymptotic methods are frequently used. Alternative approximat ing methods were described in the previous chaP.ter. Suppose that(} E e c RP
Ho: 6 8o. In this case, a Taylor series L(Do; X) =log p(X I 8o) around iJ gives
and one wishes to test the hypothesis expansion of the function
=
'
'
=
1
'
L(8o; X)"' L(8; X)+ [U(X; 8)]'(80- 8)-
2
'
'
'
(Do- 8)'J(8)(80- 8)
' ' ' 1 L(8; X)- (0o- 0)'J(0)(00- 8) "2 '
since U(X; 0) 0. The higher-order terms are neglected since, under Ho, Do and 0 are close for n large. Therefore, =
-2logA(X) = -2log
=
(
p(X 18o) ,
p(X I 0)
-2[L(Do; X)-
)
L(O; X)]
"'(8o- ii)'J(0)(8o- 8). J(O)/n converges almost surely 1(6o)/n under Ho, the quadratic form on the right-hand side
Since the MLE is asymptotically normal and to its expectation
of the equation has a
x� asymptotic distribution. Therefore, the asymptotic dis x� and the test with asymptotic level accepts Ho x;,p. If the null hypothesis is of the form Ho: 9 E e0 with
tribution of -2log A(X) is if -2log A(X)
<
a
Asymptotic tests 189 dim Elo
p- q > 0, then the asymptotic distribution of -2log J.(X) is x
=
the test with asymptotic level a accepts Ho if -2logJ.(X)
<
xJ.q·
J and
These results are also useful in the construction of confidence intervals. The
idea is once again to explore the relation between confidence intervals and tests.
Suppose that Ho: 8
= (8t, ... , Bp-q• Bp-q+t.o, .. ., Bp. o) . In other words, the last q components of 8 are fixed. Let the MLR now be denoted by A(X[8p-q+J.o, . . . ,
Bp,o). A 100(1 -a)% confidence region for (Bp-q+l, ... , 8p) is given by
I (Bp-q+l•. ", 8p): -2logA(X I Bp-q+l·
"
., 8p)
<
x,;,q
I·
Similarly, from the Bayesian point of view, the asymptotic posteriOr distribution of -21og ):.(x\8p-q+I, ... , Bp) is xJ. Tests and confidence regions can be con
structed as described before. Note that we have chosen to test for known values of the last q components for simplicity. The same Bayesian and frequentist results hold for a test on any q components of(}.
.. . .· . . . The asymptotic distributions of the score funi::tion and of the MLE _lead to two
classes of classical tests. Assume that (J E <3 C RP and one wishes to test Ho:
8
=
Oo. Defining. then
WE (8o) and
Wu (8o)
=
=
(ii- Oo)'I(Oo)(ii- 8o)
[U (X; 8o)J'!-1 (Oo)U(X; 8o)
� asymptotic distribution under Ho. So, the a reject Ho if W£(9o) > x�.p and Wu(llo) > x�·P' respectively. Given the almost sure convergence of 1 (in and 1 (9o) to I (90), these replacements can be made in the definitions of WE and Wu and the same results we have that both statistics have a X
tests of asymptotic level
are obtained. Equivalent Bayesian results can be obtained that the asymptotic
posterior distributions of W£(6) and Wu(9) are X · These results were partially
�
presented in Section 5.3. Hypothesis testing then can be made as described in the previous section.
The tests based on WE and.Wu can also be applied to situations where dim Elo
=
p - q > 0. In this case, the value of fJo is replaced in their expressions by Oo, "the
estimator of 9 under Ho. Their asymptotic distribution becomes a x , just like the MLR test.
:
Note that three general tests have been defined and they all have asymptotic X::!
distribution under Ho where the number of degrees of freedom depends on the dif ference between the dimensions of e and Elo. The test based on the MLR depends on maximization on both hypotheses whereas the score test requires maximization
under the null hypothesis and the test based on the MLE requires maximization
only under the alternative hypothesis. In most but not all cases, the nun hypothesis
provides a simplification to the model. It is then simpler to estima!e parameters
under the nuB hypothesis which favours the score test. When estimation under the
null hypothesis is harder, it is simpler to perform the test based on the MLE. A
190
Hypothesis testing
good reference for comparison between and interpretation of these tests is Buse
(1982).
Example. Assume a specific model for observations Y1, ... , Y,, is of the form Y; N(J; (8), a2) where 8 (80, 81, 82) and /; (8) 80 + 81 exp(82Z; ) , i I, . . . , n. Let the null hypothesis be in the fonn Ho: 82 82.0· Under the null �
=
=
=
=
hypothesis the model becomes a simple linear regression with known regressor variable Xi
=
exp(82,0Zi ) , i
=
1, ... , n. It is then simpler to estimate the model
under the null hypothesis. If, however, the null hypothesis is of the form Ho:
8jifazi
1/3 then estimation L!_nder the null hypothesis becomes a non-linear
=
problem with restriction whereas estimation under the alternative is only a non
linear problem. In those cases, it is easier to use the test based on the MLE.
Another interesting question concerns the appropriateness of the asymptotic
approx.imation. It can be shq · wn that the approximation of A(X) to thex
� is of order
P(Z :0 x) + O(n-1), where Z �x . and obtained a corrected MLR statistic A'(X) lc(X)[1 + b(8o)/n]-t such that E[lc*(X)] p + O(n-2) for many multivariate problems. Lawley (1956) showed that all moments of A* agree with those of ax distribution to order n -2. Cordeiro (1987) proved that this approximation order is also valid for the distribution function of A*. This result was extended further to any test statistic wfth asymptoticx distribution by Cordeiro and Ferrari ( 1991). A_particular case of special interest is when n items are observed and classified
n-1• Bartlett (1947) showed that P(lc(X) :0 x)
�
=
=
=
�
�
independently into one of p possible groups. The parameter of interest is the vector with the group probabilities fJ =
(tiJ, . .. , Bp d where the probability for -
the pth group is obtained from the unit sum restriction. Suppose one wishes to test Ho: B
=
Bo
=
(8t .o .... , Bp-t,o). Then it can be shown (see Exercise 6.21)
that the test statistics WE and Wu are given by
WE(8o) =
P
L
(N- nB· o)2
i=l
Both have an asymptotic
'
Ni
'·
and
Wu (8o)
n8;,o)2 t (N; nB;,o -
=
i=l
X�-l distribution under Ho. The OrilY difference between
them is the replacement of Ni by ne;,o. But, under Ho. N;/n when
n -+
oo,
-+
e1,o almost surely
by the strong law of large numbers. This is an indication of an
asymptotic equivalence between the two statistics. The tests reject Ho when the 2 values of the statistics WE and Wu are larger than the 1 -a quantile of thex _1 P '· distribution.
These tests try to measure how wen a given hypothesis fits the data. For that
reason. they are known as goodness-of-fit tests and are heavHy used in statistics whenever a situation can be represented in· p mutually exclusive categories. Of
course the assessment of the fit of a model is llJUCh more general and leads to a
variety of other tests. Analogously. the Bayesian asymptotic result is that when
the above quantities are written as functions of 8 (instead of
;
asymptoticx _l posterior distribution.
Do) they wilJ have an
Exercises When Ho is no longer a simple hypothesis but has instead dimension
191
p-q -1
>
0, the goodness-of-fit statistics are given by wE(iio)
p
=
'
L
(N; - ne;,o) 2
i=l
Ni
where §i,O is the MLE of Bi, i
Wu(Oo)= •
and
P
L
(N· - ne· o)2 '
i=l
=
1, .. .
, p-
1 under
Ho.
•
ne;,o
'·
Both test statistics have
xi distribution under Ho and reject Ho if and only if the value of the statistic is larger than the 1 -a quantile of the xi distribution. asymptotic
An important application of goodness-of-fit tests is the test of the fit of a given distribution to a data set. In this case, a random sample X =
(X 1, . . ., Xn) from an
unknown distribution is observed and 'bne wishes to test whether this distribution
is of a given known form. Partitioning the line into intervals /; , i = 1, .. .
, p, the
number of observations in each interval can be counted. These counts have jointly multinomial distribution with probabilities
ei,O = P(X E l;IHo) and the test of
the fit is given as above. If the null hypothesis specifies a class of distributions with
q
>
0 unknown parameters then one should first estimate the unknown parameters,
WE and Wu can be /iiHo)- The test statistics will now have an asymptotic X - -I distribution and the level a test rejects Ho if the value q by maximum likelihood say, under evaluated with the estimates
�
8;,0 =
Ho.
Then the statistics
fo(X
E
�
-a
quantile of the X -q-l distribution. _ Another important application of these tests to contingency tables is left as an of the test statistic is larger than the I
exercise. A!Lthese tests have the property that their power function converges to I when n -+ oo
fOr any parameter value in the alternative. This result can be obtained
from the Taylor series expansion of the log-likelihood around the point in the alternative. This follows essentially from the consistency of MLE. Tests with this property are said to be consistent.
Exercises § 6.2 I. Assume that x,, and
e
>
. . . , X.,
are iid with density
p(x/8) = ex9-1Ix([O, 1]), e ::: I
0 is unknown. Determine the UMP test of level 0.05 for Ho:
against H1: 8 > 1. 2. Let X - bin(n, e) and suppose one wishes to test Ho: e = 1/2 versus H1: e ,p 1/2. (a) Show that the MLR test statistic is
12X- nl.
(b) Find the critical region for a test of level 0.05 when
n
= 25 using the
normal approximation. 3. Suppose that k independent tests about the same hypothesis
H: 8 = eo
have been performed using different data sets and were based on inde pendent statistics T1, ... , Tk with continuous distributions a(TJ), . .. , a(Tk) be their respective p-values.
under
Ho. Let
_
192 Hypothesis testing (a) Show that a(Tt),
... , a(Tk) fonn a random sample from the uniform (0, 1). (b) DefineF = -2 L7� t log a(T;) as the test statistic for a combined test distribution on
ofH. Which values ofF should lead to the rejection of the hypothesis H? (c) Show that a
test.
F,...., xik underHand specify the critical region of a level •
4. Let X= (Xt, ... , Xn) andY=
(Yt, . . . , Yn) be independent samples from
the exponential distributions with means e, and� respectively and suppose we wish to test the hypothesisHo of equality of the distributions. (a) Show that the MLR test rejectsHo if
(b) Show that the value of c is given by
I
+ f.p(2n,2n)
2
for the level a test and fJ is such that PfE.p(2n,2n) f.p(2n,2n) + 2] = I- a. Hint: show that under
Ho, X/Y
�
<
F(2n,2n)
<
F(2n, 2n).
5. Show that the power of the t test .is a strictly increasing function of
18- 8ol.
What is the expression of the p-value for this test ? 6. Show that the unifonn distribution over the interval
[0, 8] has monotone LR
because the ratio of sampling densities is a monotonically non-increasing function of
T (X) = maxi Xi.
7. Consider a random sample X 1
•
•
•
•
, Xn from the N(B, a2) distribution with eo andHt :
both parameters unknown and define the hypotheses Ho : e =
e f. eo. Show that working with the full parameter space EJ instead of parameter space 81 underH1 amounts to a zero probability change in the MLR statistic.
, XP be independent random variables with respective Pois(,u;) distributions, i = 1, ... , p and we wish to test the hypothesis Ho of the
8. Let X 1,
• • •
equality of the distributions; (a) Obtain the MLR test forHo. (b) Show that (X t. ... , Xp)[Xt +. ..+XP = s has a multinomial distribution with parameterss and (eJ, i = 1, . . . , p.
... , e p) wheree; = JL;/(JLt +
·
·
+JLp),
(c) Justify the following test (commonly used in such situations): reject -2p -2 Ho IfL;�1(X;- X) JX ::; Xp-J.a· •
Exercises 193 (d) Are the test in (a) and the test in (c) similar? 9. Prove that the tests in Section 6.2.3 are unbiased, similar and invariant under linear transformations.
§ 6.3 10. Let X 8
=
,......
Exp(B) and assume one wishes to test Ho: 8
8,, where 81
=
Bo versus H1:
8o.
<
(a) Prove that the level a likelihood test accepts Ho if X
<
1 -BQ loga.
X = 3, when 8o = I. 1/2, cll!culate the BF for X= 3 and 8o
(b) Obtain the p-value associated with (c) Assuming that 81
=
(d) Supposing p(Ho)
=
=
I.
p(HJ), calculate the posterior probability of Ho.
(e) Compare the results of the classical test obtained in item (a) with those frOm the Bayesian test. II. SuppoSe that versus Ht: 8
X- Cauchr(8, I) and that one wishes to test Ho: 8 0 -:f. 0. To do that, set a mass probability TC > 0 to Ho and the =
remaining probability over H1 distributed according to a density p(B). (a) Prove that p(Ho
I x)-+
when
rr
lxl-+
co.
(b) What conclusions can be drawn from this result? 12. The data set in Table 6.2 represents the number of vehicles that travel through a section of a highway during a 15 minute interval in the afternoons of six Con.secutive days of two consecutive weeks.
Table 6.2 Number of vehicles Week
Day of the week Moo
Tue
Wed
Thu
Fri
Sat
!st.
50
65
52
63
84
102
2nd.
56
49
60
45
112
90
Assume that the number of vehicles follows a Poisson distribution. So, the average number of vehicles traveling from Monday to Thursday is A and JL for Friday and Saturday. (a) Obtain 95% credibility intervals for A and JL based on non-informative priors and comment on the results. (b) Test the hypothesis that 2!.
=
f.<.
13. Suppose that independent r>mdom variables X 1 and X2 where B; and P(X; 81
=
82
=
=
I)
=
1-Bi, i
B versus H1: 81
=
P(X; = 0)
=
1, 2, are observed and one wishes to test Ho:
t= fh. Assume al�p that all prior distributions
are uniform, that is, under Ho, B is uniform over the unit interval, and under H1, (81, Bz) is uniform over the unit square.
194
Hypothesis testing (a) Show that the MLE of e under Ho is (Xt + Xz)/2 and the MLE's of B; underH, are X;,i = 1,2. (b) Obtain the posterior distributions under the two hypotheses, that is, the distributions of e I
X,
Ho and (e,, e,) I
X,
H,.
(c) Obtain the GMLE's under Ho and Ht and compare them with the MLE's obtained in item (a). (d) Obtain the predictive distributions of (X,, Xz) under Ho and under
H,. (e) Show that BF(Ho, Ht) = 4/3, if XJ = x,, and 4/6 if Xt ,P x,. Interpret the result. 14. Suppose one wishes to verify if a quantity 8 is smaller than a prespecified value Bo. Therefore,assume 8 "'N(Jl.,
r2) and observe X rlL N((), a2), a2
known. (a) Obtain the prior probability of Ho: e
<
eo.
(b) Obtain the posterior probability of Ho. (c) Prove that the probability of Ho increases after observing X= x <
eo (I - 1/ ,/2) in the case 0"2=
x
iff
r2 and p. = 0.
§ 6.4 15. Construct a level a test for the hypothesis 8= Bo from confidence intervals assuming that independent X; "'Pois(Bt; ),i
=
I, .. . ,n,are observed with
t;, i = 1,... ,n, known. Can the hypothesis be accepted if
L:7�1 t; = 5,eo= I and a= 16. Let X1 N(et, I) and x, �
:L;1=1 Xi =
10,
0.05? And if a= 0.01? �
N(e,, I) be independent and define
p=
e1;e,. (a) Show that the test that rejects the hypothesis PoXzl
>
1 12za;2 has level a.
(I + p�)
p=
Po when IX1 -
(b) Obtain a 100(1 -a)% confidence region for p from the test described in (a). Draw a graph of the region and interpret the result. §6.5 17. Let X = (X1,
. • • ,
Xn) be a random sample from the Exp(e) distribution
and suppose one wishes to test H: e :S 1. (a) Show that the likelihood ratio� test rejects H when (b) What is the value of
c
L7=1 X;
< c.
for the test with level a?
(c) Show that the power of this test is a strictly monotonic function of B. (d) Draw a graph of the power fora= 0.05 andn= 15. (e) Construct a test based on asymptotic results and compare it with the test based on exact results. 18. Let X = (X1,
. • .
, Xw) be a random sample from an unknown distri
bution. After observing X = (I, 0.7,0.2, -1.3,-0.5, 1.52,-0.85, 0.25,
Exercises 195 0.47 , -0. 67) , test whether one can assume that the data comes from a stan dard nonnal distribution. Check also whether one can assume that the data comes from a normal distribution. Obtain the p-values of the two tests and compare them.
Ho: 8 =8o versus Ht: 8 ,P 8o based on a random N(8, a2), e and a2 unknown, is consistent. 20. Suppose that X 1, , X11 are iid with N((), a2) and one wishes to test Ho: a ao versus H1: a i= ao. 19. Show that the t test of
sample of size nfrom the • • .
=
'L'/�1
Ho if (X;- X)2 faJ E [CJ, c2] whereq and c are such that F(cz)- F(CJ) 2 I- a and F is the distribution function of the distribution. (b) What condition must be satisfied by CJ and c2 for an unbiased test?· (a) Show that the MLR test with asymptotic level a accepts
=
x;_1
CJ n - ffnzcr/2 and Za/ is the I- a/2 quantile of the N(O, I) 2
(c) Show that the nonnal approximation gives cz = n + v'ziiza;2 where distribution.
(d) Show thatthe equal tail test, where F(ez)
=
=
I-a/2 and F(q) =a/2
is asymptotically unbiased, i.e.that the test is unbiased when il
--+ oo.
21. Prove that in the case of multinomial observations, the statistics W£_ �nd Wu are respectively given by P
(N· - n8· o)2
i=l
Ni
L
W£(0o) = .
'
'·
P
and
L
Wu(Oo) = i=l
(N'
�
o)2 ,
n8,·
nfh.o
22. A contingency. table is a table of multiple classification of observational units into cells. In the simplest case of double entry, the cells are defined by Bt, . ..
1,
A with levels A ... , AP and B with levels , B,. Define the probabilities 8;j =P(A =A;, B = Bj), V(i, j),
the intersection of two factors:
and assume that n independent observations are made and_each one of them is classified into a single cell (i, j) and the counts
Nij associated with the
cells registered.
{Nu} has multinomial distribution with parameters nand {8ij) Define the parametric space EJ and obtain the MLE's of {ilij },
(a) Show that .
V(i, j),
A and B are said to be independent if P (A =A;, B = Bj) =P(A = A;)P(B = Bj)- Define theparametric spaceEJounder independence between A and B and calculate its dimension. Hint: define 8i+ =L�·=l 8ij and e+i =Lf=1 eij. (c) Calculate the MLE's of {8;j) under Elo. (d) Obtain the goodness-of-fit and MLR tests of level a, specifying the (b) .The factors
critical regions and the distributions involved.
7 Prediction
All that has been done up to now concerns estimation, that is, un
The big and fundamental difference here is
that all statements will be confronted with reality and are subject to approval or dismissal without dispute. Even though notions of decision theory were introduced when we dealt with Bayesian estimation, we think the appropriate moment for decision taking is when we make predictions. We have a clear view of the consequences and respective losses associated with our acts when we face our positions about quantities that will become known.
7.1
Bayesian prediction
The typical situation of the prediction problem is that in which a quantity X related to an unobserved quantity
8 through P, (x j 8) is observed and we are interested Y that is related to X and
in producing statements about another random quantity 0 through
P2(Y 1 8, X). So, after observing X = x we have updated information Y and this infonnation is contained in the distribution
to make an inference about
198 of Y
Prediction
I X.
Therefore, the distribution of Y
p(y I x) = = =
J J f
I X =x
is
p(, y e 1 x)dO p(y I e, x)p(O 1 x) d 8 p(y I
O)p(O I x)d8
with the last equality valid in the common case of conditional independence be tween X and Y given 0. That happens, for instance, when we sample future and past observations from the same population. In the trivial case in which it is pos sible to directly specify the distribution of Y
I x, the above calculations involving
..
the removal of unobserved quantities are not needed.
= (X 1, .
In the case of a random sample X
, Xn)
from
p(· I
)8 and
a single
future observation Y that also comes from the same population, that is, Y has density
p(· I
p(y I x) = and
p(y I
8), then
f
n
p(y I
)O p(O 1 x)d8
p(O I x) ex p(O)
where
n p(x;
I 8)
i=l
)8 and p(x; I
)8 have the same form.
One can then write
p(y I x) = Eo1 x [p(y I
)8 ].
In the case of prediction in samples from the one-parameter exponential family with conjugate prior, we have, using the notation of Section 3.2 that
p(x I )8
=
a(x) exp{u(x)¢(8)+b()8 )
p()8 = k(a,{J) exp{a¢(8)+{Jb()8 ) and so
p(8 1 x) =k
( t, a+
u(x;),{J +n
) { [ t, ] exp
+ a
u(x;) ¢()8 +[{3 +n]b()8
}
and
"
p( x) =
n a(x;)
1�1
k(a, {J) n
k ( a+Li�I u(x;), fJ + n)
.
The expression of p(y
I x) is obtained similarly to p( x), using the posterior density p(8Ix) instead of p(8). So k(a+l:f-1u(x;),{J+n) p (I y x) -a()y "" k ( a+L-i�I u(x;)+ u(y), fJ +n +I ) '
Bayesian prediction 199 Example. Let
X 1,
...
, Xn be a sample from the exponential distribution with
parameter e. The conjugate distribution is
G(y, 0). The densities can be written
in the fonn
p(x I 8) = exp{-x8 + IogB}, p(8) =
8Y
-r(yl
x >0
exp{-88 + (y - I) IogB},
e
>
0.
I dentifying with the above notation, we have
a= -8 fJ=y-1 k(a, jJ) = (-a)f+l I r(jJ + I) a(x) = I u(x)=-x .. from which we get
�a- z=:;'=1 u(xi) X
(.
1 +
u(y)
-u(y)
n
a+ Li�l u(x,.)
)-IP+n+l)
y > 0.
Rewriting as a function of y and 8 leads to
(y ) p I x = 8+
y +
n
I:;'�1 x,. + y
(
1
+
y 8+
I:;'�1 x,.
)
-(y+n)
y > 0.
Note that this density is strictly decreasing with y. With the predictive distribution, we can proceed to an inference as previously
seen in Chapter 4. In particular, we can make an inference by point prediction. To do this, we put the problem into the framework of decision theory whose elements are: I. States of nature- here, represented by the possible values of the quantity Y to be observed in the future and that we wish to predict.
2. Space of possible actions - containing possible actions to be taken. Here,
taking an action is to choose a value for Y. its point predictor�.
3. Loss function- to each possible value of Y and to each p6edictor� we have
a loss
L(Y, J).
200 Prediction Put in this form, the problem is mathematically identical to that from Section 4.1,
that is, we choose the predictor � so as to minimize the expected loss
j
L(y, li)p(y I x) dy
.
This decision theoretic approach to prediction is pursued by Aitchison and Dun smore
(1975) and Geisser (1993). The difference with respect to parameter es
timation is not in the equation but in the interpretations of its elements. We can
objectively quantify the loss we will incur by predicting Y by 0 because Y is observable. This quantification is less clear in the case of estimation and only
makes sense when related to observed quantities associated with the parameters.
The example of John and his doctor in Section 4.1 illustrates this point. It was only possible to construct the loss table after referring to observed quantities in
the future such as death and definition of the disease state of the patient.
That is due to the fact that we do not take decisions against values of theoretical
objects used to understand a phenomenon (parameters) but only after evaluating the consequences these values will have upon observables. This point is the basis of the predictivist approach to inference which itself is not free from arguments.
Its advantage however is that it allows judgments that are not ambiguous with a clear and unquestionable meaning. A prediction can always be confronted against
reality while estimation never is.
So, a point predictor can be chosen according to the loss function we incur.
Using results from Section
4.1, we have that:
I. The predictor associated with quadratic loss is the mean of the predictive distribution. 2. The predictor associated with absolute loss is the median of the predictive distribution. 3. The predictor associated with the 0-1 loss is the mode of the predictive distribution.
Example (continued).
(a)
The predictors described above are given by:
E(Y I x)
=
=
Eelx[E (Y I 0)] Eelx[I/B]
0 + L�i=IXi = -='-'-;-' y + n-1 (b) The solution med of the equation 1 ' given by mect =
2=
1med 0
p(y I x)dy
(8 + L:7�1 x;)(21h+n- 1).
(c) by 0, the mode of the predictive distribution of Y
! x.
Classical prediction 201 Observe that for large
n,
the predictive mean is approximately equal to X and
for the non�informative prior {y, 0
�
0) the predictive mean is
E(Y I x)
=
2::7-t x;. n
-
1
100(1 �a)% predictive regions may be C such that P(Y E Clx) 2: I -a. In the case of a scalar Y, the region C may be reduced to an interval [a1, az] satisfying P(a1 < Y < az ! x) :=:: 1 -a. The same comments are still valid here In analogy to the estimation problem,
obtained for Y. All that is needed is to find a region
that for a given value of
a
one wishes to find the interval with smallest possible
length, leading naturally to the choice of regions where their predictive density is higher. This is formalized by the concept of regions of highest predictive density
( HPRD) C given by C where
=
{y : p(y I x) 2: k(a)}
k(a) is the highest constant guaranteeing that P(Y E C I x)::: I-
Example (continued}.
As the predictive density is strictly decreasing, the HPRD
interval for Y must be in the form
fo"
Solving for
a
gives
(0: +
[0, a] where a is such that
p(y I x) dy
L7=I x; )((1
�
=
I
-a.
a)-lfy+n -:- 1).
In the case of a large sample, the posterior distribution of(} around
ii, the MLE of 0.
a.
I x gets concentrated
So,
p(y I x)
=
f
p(y I O)p(O I x)dO � p(y
Iii).
This approximation neglects the variability of the parameter and leaves only the sampling variability of the quantity to be predicted. In the general case, both forms of variability are important and should be taken into account.
7.2
Classical prediction
There are no clear rules as to how to proceed to make prediction of future ob servations in classic inference.
One of the most frequently used procedures is
to substitute the value of the parameter appearing in the sampling distribution of future observations by some estimate based on past data. Specifically, assuming that
Y with sampling distribution p(y I 0) must be pre
dicted based on observations X from a sample of p(x 1 fJ), one uses the distribution
p(y I
6) where 0 is an estimator of(}.
The most common choice is the MLE, which
takes us back to the discussion iri the final paragraph of the last section. The draw back of this procedure is not to take into account the variability associated with the estimation of (J.
202
Prediction
0) b(O) and that V(Y 1 0) B(O). It b(0) as a point predictor of Y and B(8) as a measure
As an example, assume that E(Y I
is common practice to take
=
=
of the variability associated with that prediction. This procedure underestimates
the variability of the prediction by not taking into account the variability of the estimation of
0.
Example (continued).
The maximum likelihood estimator of 8 is
1/X
and the
mean of Y is I ;e. After doing the substitutions we get the point predictor X for Y. Note that the estimator of the variance of Y, of
e with respect to e.
1 ; X2 , does not consider the variability
One approach avoiding this problem consists in obtaining a pivotal quantity
G(Y, X) whose distribution does not depend on
0.
This approach was used in
Section 4.4 in interval estimation. As before, the function G must depend on the
sample X in an optimal form (for instance, through minimal sufficient statistics
for
0).
Once the pivot is obtained, one can make probabilistic statements about it.
C)
::=:
1
In particular, for a given value of a (a E -a.
(0, I)), one can obtain that P (G(Y, X) E 100(1 -a)% predictive
·Then, it becomes possible to construct a
region for Y given by
{y: G(y, x)
E
C}.
The problem with this approach is that it is not always possible to find such a function G.
exceptions.
The example we are dealing with in this chapter is one of the
Example (continued).
We now want to find a function
G(Y, X) whose distribu
tion does not depend on e. Preferably, this function would depend on X through
X
which is a minimal suficient statistic for (). Fortunately, in this case, this is
possible since
Y- Exp(8)
=
G(I, 8)
� 2
x2 =
and
n
"X L '
-
G(n 8) '
=
1=1
2 X2" 28
and they are independent. So, Y/X -
Using the properties of the
F(2, 2n), which does not depend on e. F distribution, we have that E(Y/X) 2nj(2n 2) =
-
from where we can take as a point predictor for Y
X _3:!:_ 2n-2
=
1::7-t X; n-I
which differs from the predictor obtained above but coincides with the Bayesian predictor with non-informative prior. The
can be constn::,ted (2, 2n) < Y/X <
100(1 -a)% confidence interval
for Y
�th the percentiles Ea;2(2, 2n) and Fa;z(2, 2n) as P(Ea;z
Faf2(2, 2n)) (X£,12(2, 2n), XFa;2(2, 2n)).
=
I
-
a.
The prediction interval is given by
iI
I
Prediction in the normal model 203
7.3
Prediction in the normal model
7.3.1
Bayesian approach
The structure of normal models allows many easy calculations especially in the case of linearity. The application of these rules to prediction reduces the calculation considerably. Basically consider an observation Y I 8 known.
One can rewrite this model as Y
=
8 +
E
,......
N(B, a2) where a2 is E N(O, a2) can
where
"'""
be regarded as an observation error. Observe that the distribution of depend on 8 and therefore
E
does not
and 8 are independent. Assume now that the updated
E
distribution of e (possibly after the observation of a sample) is N(�J-. <2). therefore the sum of independent nOrmal quantities and has a N(p.., a2
Y is
+ r2)
distribution. This is the predictive distribution of Y. Point and HPRD interval prediction� ·can be made as described in Section
7 .I.
Example.".Assume the updat�d distribution of () is the posterior distribution relative to a sample X from
N(;.LJ, a2 +
a N(fJ, a2)
of Yin this case will be I" I and the form
distribution.
This leads to y
rf) where 11-1 and rf are given by Theorem 2.1.
J
I
(l- a)% HPRD interval for
J
(,u 1 - Z.o:J2 a2 + r f , ,u 1 + Z.a.J2 a-2 + 11-1
In the case of a non-informative prior,
I
x �
X and
N (x. cr2 (I +
rf
=
'""-.J
Y is of the
a2jn, leading to the
�)) .
The point predictor of Yin this case will be x and the I 00(1 for Y is of the form
X
rf).
=
predictive distribution
Y
I
The point predictor
(x- Za;zcr)I + n 1, x + Za;za)I
-a)% HPRD interval 1).
+n
The above example can be generalized in various fonns. If Y is a vector with
r),
we
have by the same reasoning that the predictive distribution of Y is N(JL, 1: +
T .
sampling distribution N(8,
:!::)
and the updated distribution of 8 is N(/L,
)
Assume now that the sampling mean ofY is given by the linear relation XO where X is a matrix of known constants and that 8 remains with the same distribution. If the matrix X is square, the dimension of 8 and the hyperparameters of its distribution remain unaltered. This restriction is unnecessary and we can consider any matrix
X with the correspondent change in the dimension and distribution of 0. We shall see in the next chapter that the cases of interest involve a lower dimension of (} than the dimension of Y implying a reduction in the dimensionality of the problem. Schematically, Y where 8 and
t:
=
xe
+<
are independent. So, the predictive distribution of Y is given by
the sum of two independent normal quantitieS, the first relative to 8 given by a
N(XJL, XrX') distribution and so Y I X- N(X!L, XrX'
+:E).
204
Prediction
An example is the case of prediction of a series of future observations from a population from which a sample had previously been observed. Up to now, the sampling variance was assumed known. Generally it is not and the usual approach in this case is to try to specify a conjugate distribution for the variance. Although a conjugate analysis when the sampling covariance matrix is totally unknown is possible, we will consider here only the case where the covariance matrix is totally known but for an unknown multiplicative scalar a2. Without Joss of generality we will assume that V(Y 1 0, a2) a2Ip where p is the dimension of Y. For a conjugate analysis, it is necessary that the updated covariance matrix of 6 be proportional to a2 as seen in Chapter 3. So, 6 ! a� N(JL, a2r) and Y 1 a2 � N(XJL, a2(Xrxr + lp)). Assuming as before that the updated distribution of a2 is 11,J)af; ct> x,70 with ¢= a�2, =
"'
,..__.
p(y)
=
ex
I
q,r/2 exp( -
=I ex
p(y I
q,<no+P/ZH exp{ -
[noa� + Q(y)]-(no+rl/2
where Q(y) (y- XJL)T (XrXT + lp)-1(y- XJL). It is easy to obtain then that =
Note that the only changes with respect to the known variance case are the substi tutions of the normal by the Student t distribution with no degrees of freedom and of a2 by its updated estimator af;. The main properties of the multivariate Student t distribution that are relevant here are, ifU � t, (m, C) then (i) the marginal distribution of any q-dimensional subvector from U(q < p), say U,, is also Student t with v degrees of freedom and parameters m, and Ct obtained from the components of the vector m and of the matrix C corresponding to the components of the vector Ut. In particular, the jth component of U, Uj, has(univari�te) fv(m i• Cjj) distribution. (ii) LU � t,(Lm, LCL'), for any r x p matrix L of full rank.
Example (continued). The same results are valid with the modifications cited above. As seen in Section 3.3, the posterior distribution of(&,¢) is given by the normal- X 2 with parameters (J.Lt, Ct, n1, at). A future observation Y will have predictive distribution
Prediction in the normal model
205
In the case of a non-informative prior, the predictive distribution is given by a
ln-tCX. s2(1
+ n-1)) distribution where s2 is the unbiased estimator of
a2. The a2 but for the substitutions tn-1 (0, I) distribution and of a
HPRD intervals coincide with those given for known of the percentiles of the by
N(O, I)
for those of the
s.
7.3.2
Classical approach
For the frequentist inference, one should only work with the sampling distribu tions of Y and X in search of a pivot whose distribution does not depend on the parameters.
Again, we write Y
suitable estimators for (J and the estimator Y-X--.....
f; =X.
a.
�
8
""'
N(O, a2)
and all that is left is to find
Assuming initially that
a2
is known we have
As Y and X are both independent TIOlfTlal with same mean,
N(O, a2(1 + n - 1 )) is the pivot with distribution identical to that obtained
with a non-informative prior. Confidence intervals for Y W.ill also coincide.. T!:te difference in the approaches is theeretical: the Bayesian result is conditional On X which in fact is the way it will be used in classical inference too.
a 2 is unknown, another pivot must be found because the Y -X depend? on a 2.)n the n�mnal case, this is easy because s2 independent from y -X and (n 1}S2 ja2 x;_\" So,
In the case where distribution of is
-
Y-X SJI+l/n
-:r,=;�c=
'"'-'
�I
n-
1
(0 I) '
is the required pivot. Again, we get the same results previously obtained with a non-infonnative prior. The results of the example above were obtained only for the simplest case of a single future observation folloWing the observation of a s·ample from the same population.
They can be extended to more general situations.
Some of these
extensions will be seen in the next chapter. In the purely predictive cas·e without unknown parameters, we have Y and X with joint distribution
(i)
N �
Ly
[ ( �; ) ( Lxy 7: )] •
and the prediction of Y given the observation of X is based on the distribution of Y
I
X
=
x given by N
(11-r
+
:Exr 1::;;1 (x -Px . ). :!:y- :Exr:E:;;':Erx .
If Y and X are scalar quantities
)
206 where
Prediction
cri
=
V(X),
<1f
V(Y) and crxy
=
=
Cov(X, Y). We can see from the
results above that the knowledge of the value of X reduces the variance of Y. The larger the correlation between
X and Y in absolute value the greater the reduction.
More important than that is the fact that the optimal predictor (under any reasonable criteria) is a linear function of
7.4
X.
This result is explored in the next section.
Linear prediction
The problem of linear prediction takes place in the absence of complete distribu� tional information. Assume the presence of two random vectors which it is only known that E v
X
and
Y from
( i) ( �;) :!:y l:rx . ( X ) ( :!:xr l:x ) =
y
=
The complete distribution of X and Y rriay not be known for a number of reasons. Our problem here-�� to establish a form of predicting
Y
based on
X.
A few
restrictions must be imposed to solve it. The most natural one is to restrict the class of predictors to linear functions of
X.
This restriction is not only justified
by rendering the problem tractable. Linear solutions are obtained as first-order approximations. The linear predictqr thus serves as a preliminary predictor even in the case where the joint distribUtiOn is completely specified. To choose the predictor among all predictors of the form
Y(X)
=
a
+ B
X
requires the definition of an optimality criterion. It is reasonable that this criterion be bas·ed on the quadratic risk (or expected loss) given by
R(Y, Y)
=
E[Y- Y(X)][Y-
Y(X)]'
and the optimality obtained by the minimization of the trace of the matrix R. The search of the optimal predictor reduces to the search of optimal constants
a
and
B. The predictor so obtained is called a linear predictor and the associated risk, the linear risk. Observe that the predictor that (globally) minimizes the risk is the conditional expectation E(Y I
X)
and the associated risk is the conditional
variance. For that reason, the linear predictor is also called the linear expectation and its risk, the linear variance. In the specific case of the normal distribution, the
conditional expectation is a linear function of X. So, the linear predictor minimizes the quadratic expected loss among all possible predictors (linear or not). In the special case where Y and R(Y,
h
In this case, trace (R)
=
=
E[Y -
X are scalar, the risk is given by Y(X)]'
=
E[Y- (a+ bX)]2.
R, and to find the values of a and b that minimize R w,e
calculate aR aa
-
=
E[2(Y-
a-
bX)(-1)]
=
2[a +
bJl.x- Jl.r]
Linear prediction 207 and
aR ab
= E[2(Y- a- bX)(-X)] = 2[attx +bE(X2)- E(XY)].
Equating the derivatives to zero gives the system of equations
a= ttY- httx and
, E(XY) - aM b= . E(X2) Replacing the first in the second equation gives
XY) l lt:_ ::.:c_X b = =£-'(:.:_:_ :__-_,:(,c:tl �Y""_:-=-f, lc:L�X:.: E(X2)
E(XY)- /1-XItY + htti E(X2) ' 2 axr +bJJ- x E(X2) which has solution
b = crxr fa�
predictor is
where
axr = Cov(X, Y). So the optimal linear
crxr Y=!l-r+- (X -ttx) 2 ax �
and its risk is given by
axr R(Y. Y)=E[Y -tty- - (X -ttx)] 2 2 ax 2 A
= E(Y-ttd +
(a:;)
E(X -ttx)2
axy -2- E[(Y- !1-r)(X -ttx)] ' aX 2 axr ·crxr 2 2 - 2y =ay+ -- a x 2 - ax 2 ax ax 2y ax 2 -, = a y-. ax
( )
In the multivariate case, it can be shown that the linear predictor is
ttr + I:xrl:x1 (x -ttxl and its risk is given by
208
Prediction
generalizing the result obtained when Y and X are scalar. As mentioned above, the linear predictor is given by the expression of the conditional expectation in the normal case. This implies that the linear predictor is as close to the global optimum (in terms of minimization of the quadratic risk) as the normal distribution is close to the joint distribution of X and
Y.
The method of linear prediction was developed in the 1970s by Hartigan and Goldstein among others. It does not depend on the point of view adopted for inference. However, it generally appears in the Bayesian literature where it receives the name of linear Bayes methodology.
Exercises § 7.1 1. Let X= (X 1
e
-
Pa(a,
•
•
•
•
, Xn) be a random sample from a
U[O, 8]
distribution and
{3).
(a) Obtain the predictive distribution for a new observation from the same population based on all the information. (b) Assuming that max xi> {3, what is the probability of.Y >max xi? (c) Repeat (a) and (b) for
a-+
0 and f3-+ 0.
2, Show that the problem of choice of a point predictor for Y may be reformu lated as that of finding the predictor 0 that minimizes
f V(e, o)p(e I x) de where
V(e, o) = Ene[L(Y, li)J. 3. Consider the problem of choosing the point predictor of
Y
E RP based on
the quadratic loss
L(Y, li)
=
(Y -li)'M(Y -li)
where M is a known positive definite matrix. Show that the point predictor is given by the matrix
li = E(Y I x) MV(Y I x).
and that the expected loss is given by the trace of
§ 7.2 4. Assume that one wishes to make a prediction for Y .-.... bin(m, observation of X '"'-' bin(n,
B).
B)
based on
How can that be done from a frequentist
point of view? 5. A geologist wishes to study the incidence of seismic movements in a given region. He then selects m independent but geologically similar observation points and counts the number of movements in a specific time interval. The observational model is Xi ,....., Pois(8), where Xi, i = 1,
.
. . , m, is the
Exercises 209
number of occurrences in the ith observation point and e is the average rate of seismic movements. From his previous experience, the researcher assumes that£[&] = 2 movements per time interval and that V[&] = 0.36 and uses these values to specify a conjugate prior.
(a) Assuming that x=(2, 3, 0, 0, 1, 0, 2, 0, 3, 0, I, 2) was observed, what is the posterior distribution?
(b) He wishes to predict the expected number of seismic movements and its precision in an (m + I )th site based on the observation he had made. Establish the necessary hypotheses and perform the calculations using the data above.
§ 7.3
6. Consider the observation of independent samples
·
Y = (Yr .... , Ym) and
, X11) from a N((), CJ"2) population. Indicate how to obtain P6f�t and confidence interval predictors for the components of Y using:
X
=
(X 1,
•
•
•
(a) a conjugate prior for(&, u2);
(b) a non-informative prior for (fJ, a2);
(c) frequentist inference.
§7.4
7. Prove that, in the multivariate case, the linear predictor is given by
' l'r + I:xri:x (x -l'x) and its risk is given by
8 Introduction to linear models
In this chapter one of the most important problems in statistics will be considered. iJ[he scope is introductory because it deserves a whole book to be considered in depth. A complete approach to the subject can be found in books such as Draper and Smith (1966) and Broemeling (1985). The class of normal linear models is characterized in the first section. In this same section the class of generalized linear models will also be introduced, as an extension of the normal linear models to the exponential family. In the following sections, classical and Bayesian inferences are developed for these models. Two other broad classes of models, natural extensions of the normal linear models, are described in Sections 8.4 (hierarchical linear models) and 8.5 (dynamic linear models).
8.1
The linear model
In this section, the problem of observing a random variable Y with values affected by other variables will be discussed. For example, the income of a firm is affected by its capital and by the number of persons it employs, the production of a machine is influenced by the maintenance scheme implemented and how well trained its operator is, the arterial blood pressure depends upon the age of the patient, and so on. In these cases, the variability of Y is explained by some other variables. As a first approximation only a linear relationship will be used to describe how these variables influence Y. Let X 1, . .. , XP denote the set of p explanatory variables.
It follows that
If a model with an intercept is required, all we need to do is to specify X 1
=
1. The
model introduced above is called a linear m_odel or linear regression model. The case with p = I, where only one explanatory variable is involved, is named simple linear regression. Note that the expectation of Y is calculated conditionally on the values of the variables X 1,
• • .
, Xp· Throughout this chapter it will be assumed
that the values of the explanatory variable are known.
212
Introduction to linear models
If, in addition, a common V (Yi)
observing the sample Y1,
e;
=
and the parameters
= a2 is assumed for all i, the errors, after ... , Y11 can be specified as
Y ;- (f!IX!i +
{3 = (/31,
. . .
· · ·
, fJp)1
+
f!pxp;),
i
1, . .
=
. ,n
can easily be estimated by least squares.
ef . Y or the errors was needed. If the hypotheses of normality and independence of the distribution of the Yi 's are
In this case, the estimator is given by the value of fJ that minimizes 2::�1= 1 Note that no hypothesis about the distribution of
assumed, then it is easy to obtain the likelihood function as
-n l({J,a2.,YJ, ... ,Yn)cxa exp Then, the value of
f3
{
1 � --2 L.. e12 2a i=l
}
•
that maximizes the likelihood is equivalent to the value
that minimizes L7=1 ef, that is, the maximum likelihood and the least squares estimators coincide.
(�) (
It is useful, for further development, to adopt a matrix notation. Let us define _
and
X-
x1 ..
_
-
X� from which it follows !hat Y
I {3. a2
�
.
XII .
.
Xin
N(X{J, a2Ip)where lp is !he p xp idenlity n
matrix. The likelihood equation can be rewritten as a� exp{ -S(/3) j2a2 }, where
n S(/3) = L(Y;- x�{J)2 i=l = (Y- X{J)' (Y- X{J). The model presented before can be rewritten through the following equations
Yi "'-'N(J1i,a2), i = l,
. . .
,n,
independent
J1i =Ai, i = l , ... ,n Ai =x�{J where the (apparently unnecessary) second equation states the relationship be tween the mean of the observations and the explanatory structure in the model.
Stating the main objective in this form it is clear that there is no reason to be restricted to the normal distribution and to the class of linear relationships. One
of the most relevant extensions in the model presented before constitutes the class
of generalized linear models, which aHows us to model observations described by
any member of the exponential family and to relate the mean of the distribution
Classical linear models
213
to the explanatory relationship through any differentiable function. Therefore the Yi's have density given by
p(y; I &) = a (y;) exp[u (y;)qi (&) + b(&)}, i g(J.l.;) =A; where g is differentiable and J.l.i =
1,
... , n,
=
E(Y;
independent
I&)
Ai = x;/3. This class is broad enough to include many of the most frequently used models in statistics. A complete description of these models, including many inferential aspects, can be found in McCullagh and Neider ( 1989).
8.2
l·
Classical linear models
Classical estimators for /3 and
CI2
can be obtained from the likelihood function
exhibited before. _Beginning with the estimation of
f3
we see that the least square
and maximum likelihood estimators do coincide and are given by the value of
f3
that minimizes the quadratic form S(/3). Differentiating this expression with
respect to the elements of the parameter vector
JS(fl) Jfl
=
f3, it follows
that
2(X'Xfl - X'Y)
(see Exercise 8.1 ). Since the matrix of second derivatives is positive definite, the solution of
aS(fl)/ofl
=
0 provides the point of minimum
satisfy
The above
X'X{J
=
iJ
of
S(fl) and must
X'Y.
If the matrix X'X is of X are linearly independent, then X'X has an inverse
p equations are known as the normal equations.
fu11 rank, or if the columns of
and the maximum likelihood estimator of
It will be assumed that the matrix
f3 is given
by
X'X has full rank, from here on. This restriction X not of full rank means that there are some
does not seem to be that serious since
redundancies in its specification or in the model specification. These redundancies can be useful for understanding the model but can be eliminated without any loss from the following inferences. The quadratic fonn can be expressed as
S(fl)
=
(fl - {J)'X'X(fl- Pl + S,
where
S,
=
=
Y'Y- i/X'Xfi (Y- X{J)'(Y - X{J).
214
Introduction to linear models
The maximum likelihood estimator of cr2 is obtained as the solution of the equation
and will be denoted by
&2.
s(/J) = Se. by the a2 is given by Se/n.
Since
maximum likelihood estimator of
above development, the
The sampling distribution of these estimators can easily be obtained. Since
P is
a linear function ofY, its sampling distribution is a multivariate normal with mean and variance given by
E(jJ I {J, a2) (X'X)-1X' E(Y I {J, a2) =
'= =
V(jJ I {J, a2)
=
=
(X'X)-1X'X{J fl
(X'X)-1X'V(Y I {J, a2)X(X'X)-1 a2(X'X)-l
and so it is an unbiased estimator. Since the score function is linear in has minimum variance.
The quadratic form given by
[5(,8) -
s2
jJ and so of S({J)- S,. =
Se is independent
xJ-P and, Iherefore, s,; (n- p) is an unbiased estimator of a2 So, it follows that (/3 -. {J) Is has
sampling distribution. On the other hand, it can be shown that of
/3. it also has x�
Se]Ja2
Since
S({J)/a2
�
x:f, S,ju2
�
- p degrees of freedom ·and (X'X)�1.
a multivariate Student t sampling distribution with n location parameter 0 and scale parameter matrix
The above statistic can be used as a pivot to obtain confidence intervals for {3 or its components. In particular, it is easy to obtain the sampling distribution of
(/Jj - {3 )/s which is t,-p(O, Cjj) where Cjj is the (j, j)th element of the matrix (X'X)- (, j I, . . . , p. The 100(1 -a) % confidence interval for flj is given by =
,..
l/2 "
1/2
[,Bj- laj2,n�pSCjj , /3j + laj2,n-pSCjj ]. Alternatively, from the independence between
[S({J)- S,]/ ps2
�
S({J)- S, and S" it follows that
F(p, n - p) and confidence regions for fl
can be obtained.
Simple linear regression. Consider the model Yi fJo + /31 (xi . , n with only one explanatory variable and where X is the mean of the observed values xi's. Then X= (xt, . .. ,Xn)', X= (ln,X -Xln), Example 1.
X) + ei, i =
=
I, . .
and with ii - Y , f3'tpO-
P
=
Li-I (Y;- Y)(x;- X) ._,n
L...i=l(Xi -x) 2
u�)
Classical linear models 215 and
s2
=
1
--
f-. ' L)Y, - fJo- fJ1 '
n- 2 i=l
ln is an n-dimensional vector of 1 's and
where
(x,-
2
X))
x' = (XJ, . .
.
, x11). Due to the
centring of the values of the xi's, the covariance matrix of the sampling distribution of
j3 is diagonal. As this is a multivariate normal, this is equivalent to saying that
So anct S1
are independent.
Each of these estimators will provide separately information for each param
eter.
Confidence intervals can be built up based on the independent sampling
distributions
�o- fJo - ln-
---
s
2
(0, 1/n)
���fJI _,"_,(o·[tcx,-x)'r}
and
Example 2. Analysis of variance with one factor of cfassification. Let -Y.); = fJJ + e1i , i I, . . . , nJ. j = I,.·. . , k, be the model, that is, the nf observatfons in group j have the same mean, j. = l, ... , k. This model is frequently referred to as One-way ANOVA. The ·t�tal number of observations i1 is given by L�=l n j· Other parametrizations are possible,_ with the. most usual given by {Jj = f1 +a j where f1 is the global mean and a j the ·deviation of the group mean with respect ·
=
to the global mean. This parametrization is not used here due to the redundancy
L�= l aj
=
0, which we are trying to avOid.
The model is completely characterized with parameter vector defined by P
=
(fJJ, . .· . , fh )' , the observation vector YII
0
0
Y1n1
0
0
Y=
and the matrix
X
=
YkJ
0
0
Yknk
0
0
(X'X)-1 has a diagonal form with (j, j) Yj where Yj is the observation mean in group j.
As in the previous example, the matrix element given by Besides that,
nj 1 and fij
=
S, = and s2
= Se 1 (n
1 1n-k(O,nj ),j
=
k
llj
L L(Yji- Yj)2 j=l i=l
k) is the unbiased estimator of
1,
.
.
.,k.
a
2. Therefore,
({Jj
-
fJj) Is "'
216 Introduction to linear models Hypothesis tests based on the likelihood ratio can be obtained. One of the most useful is the model validation test, that is, the test of the hypothesis Ho: fh= ... = f3p = Ointhe model Y; = f3r+f32x2;+· · ·+f3pxp;+e;, i 1, . , n. Under Ho, none of the explanatory variables have influence over the value of Y. As we have seen in Section 4.2, the maximum likelihood estimator of (,81, a2) under Ho is (Y, aJ) where aJ= :E(Y;- Y)2fn. The maximized likelihood under Ho is given by &Q11 e�n/2. So, the maximum likelihood ratio is given by =
( ) ( nao2 _ Se
nf 2
=
1+
nao2- s
'
Se
)
n/2
=
(
- 1
)
P 1+---F n- p
..
n/2
where F = (naJ- S,)/[(p- l)s2 ]. The likelihood ratio test of significance level a rejects Ho if F > c where c is iffiplicitly given by the equation a = Pr(F > c I Ho). Using results about quadratic forms of nonti';11 random variables it is possible to show that nBJ-Se is independent ofSe. Asn&Jja2"'"' x,;_i'then it follow that(n8J-Se)/a2 "'"'x;_,
.
F(p- 1, n- p), under Ho. Therefore, c = F,(p- I. n- p). and so F Confidence regions for (fh, . . , {Jp) follows from the above test statistic. �
Example 1 (continued). The model validation test in this case corresponds to the test of Ho: fh = 0. The expression n8J is given by 11
n
2 L(Y;- Y) = :Lt[Y;- V-�r(x; -x)J+�r(x;-x)}2 i=l II
II
'\' [Y; - Y- f3r(x, - x)]2 + '\''2 = L.., L.., {31 (X; - x)2 _
i=l
i=l
"
+ LlY;
i=l
= S,
,
- V- Sr lx; -:xnS, (x; - x) n
+fir :[CY; i=l
-
Y)(x;- X)
n '2 '\' 2 = S, + {31 L..,(x;-x) . i=l So, F = Sf/Cs21 '£7�1 (x;- x)2) F(i, n- 2) test rejects the null hypothesis Ho ifF> FaO. n �
Is,
I> 1af2.n-2S
/"
tL2 and the model validation 2) or equivalently if
-
�
t;cx;- x)2.
'
...
This result can be extended to multiple regression in the following sense. Consider the model Yi = /31 + f32(X2i-X2) + · · · + f3p(Xpi- Xp) + ei, i = l, , n where
Classical linear models 217 all the explanatory variables were conveniently centred. The likelihood ratio test for
Hr fJJ = 0, j
. . . , p with level a rejects HJ if
2,
=
�j I>
I
Example 2 (continued).
I
n
L)xji -Xj)2• j=l
Now the model validation test is concerned with the
Ho: f3t
following hypothesis, k
ta/2.n-pS
=
.
. . :::::: fh. The expression of k
n,-
n&J is
nj
I: .L
j=I i=l where
k
1
-
Y = n Lnj Y j and
=
-
k
j=l
I! j
k
n&J = L L) Yji - Yj)2 + Lnj(Yj- Y)2 j=l i=l k
+2
I: _L
=
j=l
II j
Se +
-
L nj (Yj - Y)2 because the third term vanishes.
j=l
We have already seen that squared deviations around the mean for a normally distributed variable are independent of the sample mean. Then, Se is independent of the means l
I
J
expression.
Y
1,
. . . , }\ and so, independent of the second term in the above k normal
This term is also in the fonn of squared deviations of
observations with respect to its mean and so have the sampling distribution given 2 by cr x _1. Therefore,
f
F= confirming the result described above for the general linear model. MLR test statistic for
Fa(k-l,n -k).
Ho and the test of significance level
a
F is the F >
rejects Ho if
Finally, we will address the relevant question of how to make predictions using the classical linear model. The problem here is how to make an inference about an m-dirnensional vector of future observations
Y*
with explanatory variables
gathered into am x p matrix x*. Once again, the trick is to find a convenient pivot.
218
Introduction to linear models
which in this case is given by (Y*- x*{J)Js. To verify this result, it suffices to observe that Y*, jJ and s2 are independent and so Y*- x*jJ is independent of s2. The numerator is distributed as normal with mean and variance given by
E[Y'- x*jJjp,
=
=
=
x*{J- x*{J 0
=
Since (n- p)s2 ja2 "' x;_P, the pivotal quantity is distributed as a tn-p(O, C*) independently of the parameters in the model. So, confidence intervals for Y* with confidence level l00(1- a)% can be built. These results are easily extended to the prediction of a vector of future observations following the same steps as before.
8.3
Bayesian linear models
In Bayesian inference, a prior distribution for the parameters must be sp�cified in addition to the likelihood function. Firstly, the results involving·proper priors (in fact natural conjugate, as will be shown)will be presented. The analysis with non-infonnative prior will be presented in the sequel and some comparisons with classical �nference will be made. The examples presented in the previous section · will be revisited. The prior Pistribution adopted for the parameters is a multivariate generalization of the normal-x2 presented in Section 3.3. Assume that the parametric vector {3 has a conditional prior distribution N(J.l0,<jJ-1CQ1) where
p({J, >)
=
(2n)-P121>Coi112exp x
0<
�-�({3-t
(no} r(no/2) > P 2
({3-t
·
Then, as in the univariate case, the conditional distribution of <jJ ! p can be obtained from the joint prior distribution of p and I {3 distribution of {3 can be obtained dividing p({J, >)by p(¢1{3) or, as done before, integrating the joint distribution with respect to
-
p({J)ex
�
[no
Bayesian linear models 219 which corresponds to the density of a normalizing constant is
1
tn0 (JLo, aJCij
), as seen in Chapter 4.
r[(no + p)/2) 2 no/2 2 ( noaa ) I Co r(no/2)ni{/ On the other hand, the likelihood of
q,n/2 exp
1'12
The
.
fJ and ¢ is given by
I -�[S, + ({J- '/l)'X'X({J- ,ii)} )
and has the same form as the prior density. Therefore, the posterior is given by
p({J, ¢ly) .x
e�j}
q,ttn+no+p)/2)-l
()(
¢
1-2[noa0 + 2
+ ({J- JLo)'Co(fJ- JLo) + ({3- {J) X X({J- {J)]
S,
,,,
It can be shown that
'}
_
({J- JLo)'Co(fJ- JLo) + ({J- ,ii)'X'X({J- ,ii)
with the terms s, +
JLoCoJLo + fJ X X{J + JL;c,JL,I I I I " I ' ' " I Y Y- {3 X X{J + JL0CoJLo + fJ X X{J + ILt (CoJLo + X y) (y- XJLt)'y + (JLo- ILt!'CoJLoI
�I
I
"
A
A
=
=
Then, the posterior density of p and
p({J, ¢ I y) oc ¢P12 exp where nt
=
¢ can be written as
1- �({J- ILt )'Ct ({J- t)} IL
n+no andnwf
=
q,tn'/2)-1 exp
I
-
�nw f
)
noaJ+
has the same form of the prior and, so, is natural conjugate to the normal linear
1
models. In particular, {3 I y t," (/Ltj. af
�
=
respectively, by and
n1af¢ I y ,...._. x,;, with mean aJ2. The point are given by tt1 and a1 -2, respectively. Confidence inter
The posterior distribution of
4> 4> are obtained from the percentiles of tn r and
estimators of {3 and vals for {31 and
4> is
x,;1
distributions,
220 Introduction to linear models respectively. It is also possible to make inference about the joint distribution of the P based on the fact that (fl- 1'1 )'C1(fl - 1'1)/at I Y - F(p, n1 - p). The non-informative prior can be used to represent, in some sense, the absence of initial information. Using the same approach as in Section 3.4, asp is essentially a (multivariate) location parameter and¢ is a scale parameter, it follows that the joint non-informative prior distribution is given by p(/J, 4>) ex:¢-I. This prior is a particular, degenerated case of the natural conjugate prior. Making the convenient substitutions, the posterior density is 1 p(fl, rp 1 y) <X q,tn/21- exp �-�[S,+(fl- P)'X'X(fl- P)Jl <X
rpP/2 exp
1- �(fl- p)'X'X(fl- p)l
q,ttn-p)/2)-1 exp
l JL�
2 (n - p)s
)
.
Therefore, the posterior will remain in the same class with only changes to the values of the hyperparameters of the relevant distributions. So, {3 1 y ,....., 2 2 and the quadratic form inp will s (X'X) - 1 ) and (n- p)s ¢ I y-
x;;_P
1n-p(j3,
be reduced to (fl- P)'X'X(fl- P)/s2 with posterior distribution F(p, n - p). These distributions provide the parallel Bayesian results to those obtained in the previous section using the classical approach.
Example 1 (continued). Suppose that in the prior distribution, f3o and {31 are 1 conditionally independent given
x;0•
,....__
j3)
S(flo)
"
=
n(f3o- J)2 and S(f31) = z)x; i=l
-
"02(fl1 - �1)2
and the likelihood function is given by
2 /(flo, fl1,¢; y) <X¢" 1 exp �-�[S,+S(f3o) + S(fl1)]l· Combining with the prior, the posterior distribution follows:
p(f3o, Pt.
exp �-�[noaJ+S,
2-1
+ co(f3o- J1.o)2 + S(flo) + CJ (,8, - J1.1 )2+S(fl1)ll
from where it is easy to see that, conditional on t/J, f3o and {31 stay independent with distributions (c ¢)-1), j = 0, 1, respectively where cQ = co+ n, 2 = J cr C + L7=1 (x;- X) •
N(fLj, j
*
J1. 0-
COMO+nJ co+n
and
Bayesian linear models The posterior distribution of
"'"
l
n 1 a�¢ I y
n
n
i=l
i=:=l
xJ-1
_......
with
n1
=
221
n +no and
f = I>l- "!La So -JL;s, L(Xi - :X)2 + co(/lo- !Ld + CJ(/lJ -JLJ}2
The marginal posterior distributions of freedom and parameters
fJo and fh
f.Lj and offcj, j
are Student t with n 1 degrees of
0, I, respectively. 1 In the case of a non-informative prior p(f3o, {3 , >)ex: ¢- , the posterior is =
1
p(/lo, !l1, ¢ I y)
cc
cc
¢-1¢"12 exp \ -�[(n - 2)s2 + S(/lo) + S(/lJ)]} ¢112 exp \ -�S(/lo)} ¢112 exp \-� S(/lJ)} x¢11n-2l/2)-l exp \ -�(n-
2)s'}.
Examining the above expression, it is easy w identify that_ fJo
! ¢, y ,......, N N(�J, (¢I:;'� I (xi -:X)2)-1) and (n -2)s2¢ I y � x,;_,. (y, (n¢)-1), /l1 I ¢, y Then, /lo I y t,_, (y, s2(n) and !l1 I y � ,,_, ( � ,. s2/ I::'�, (x,.- x)2) .and so ·
�
�
the inference coincides numerically with the one obtained following the classical
approach.
Example 2 (continued).
J
l
l I
Assume that
!l; 1 ¢
are conditionally independent and that n oaJ¢
�
N(JL;, (c1¢)-1), j
,..... x,;0•
=I,
... , k,
An alternative tO the condi
tional independence of the f3 j s will be considered in the next section. Then, as in '
Example I,
X'X is a diagonal matrix and the quadratic
reduces to
form
({j- Pl'X'X(/l- Pl
k
L n;(/lj - Jjl2 j=l The likelihood is given by
Combining with the prior distribution, the following posterior is obtained
·
222
Introduction to linear models
where cj = Cj + nj, J-tj = (cJ/Lj + niY1 )/cj, j = 1, . , k, n 1 .
.
=
n +no and
k
'"" nJCJ (Ji; -JL;) 2 . n,a12 = noa02 +(n- k)s2 + L.. J=l nj +c J ---
It is easy to obtain that f3; I ¢>, y � N[JLj, (cj¢)-1], thus retaining the prior
independence and n1al¢ I y"' x;1 and, therefore, the marginal posterior distri butions of f3 j are t11 1 (Jl.j, s2Jcj), j = 1, ... , k. The point estimators of fJJ and ¢ are given by Jlj and aj2, j =I, .., k, respectively. Confidence intervals for {31 .
and ljJ (or a2) can be obtained using the percentiles of the !11 1 and xJ1 distributions, respectively. In the case of a non-infonnative prior, we will have ·the posterior
p({31,
• • • ,
{3,, rp 1 y)
ex
ex
rpl"i'H e xp
{D
{ %[ -
[%
rp 1/2 exp -
l %
xexp -
t -yyJ}
en-k)s2 +
n 1({3;
(n-k)s2
)
_
n ;(f3; -y1)2
]}
rplln-k)/2H
.
Making the same identifications as Example 1, the marginal posterior distributions ., k, .and (n - k)s2¢> f3; I y � f,_,(y1, s2/n;), j x1- k are easily I, obtained. These distributions are similar to the ones obtained in classical inference. =
�.
. .
Predictive distributions are needed to perform hypothesis testing and prediction from a Bayesian point of view. As we have already seen, these distributions can be obtained by integrating the sample distribution with respect to the distribution of the parameters. Fortunately, in the n_ormal case, this job is greatly simplified by the linear structure of the distribution. Consider the model
Y=Xf3+e,
) e�N(0,¢- 1,)
where¢=1/a2
and suppose that the conditional prior distribution of P is
Combining these two results, we get
Y I¢>� XN(JL, rp-'c-1) + N(O, rp-1In) � N [XJL, rp-1(In + XC1X')]
since the above normal distributions are independent. Supposing now that va�
Y � fv[XJL, aJ(In + xc-1X')].
"""
Bayesian linear models 223
I
To compare two hypotheses
H1) , l
l
i
1 ...
-!-
=
. . .
0, 1 is required. The model validation test is based on the Bayes factor
=
BF(H,. o, HI )=
I
I
tv[Xi J.t, aJ(l + xic�J x;)J, i 1, , n. ( 1 Ho and H 1, evaluation of the distributions of Y
The marginal distribution of Yi is
p(y I Ho)
p(y I H,)
where the denominator is the density of the above distribution given by
�;] (noa,f)"/21
r[(no + n
f(no/2)n01
I,
+XC()1X' 1-1/2
x[noaJ + (y- XJioJ'(!, + XC()IX')-l(y- X�toll-(no+n)/2
If it is desirable to test the model validation hypothesis, we build up Ho .. {Jp = 0. Under Ho, the model simplifies toY fJ2 l,fJ, + e and N[l,J.Loo. cp-l (I, +c001l,1;,)]. {3, I ¢ N[J.Loo, (coo ¢)-1] and thereforeYI ¢ Then, the density p(y I Ho) is given by =
.
=
=
�
�
r[(no+n)/2] 1-1/2 -Il1l (noaa'l"/2lln + COO 1 n' 0 f(no/2)nn/2 x
[noaJ + (y
-
111 .uoo)' (111 + cO(} 111 t;) -l (y - 111 Jloo)]-(no+n)/2.
The Bayes factor is, then, given by
{
I I, +XC01X'I I 111 + c0011111� I
x
I I
}
l
'
.
noaJ + (y- X�tol'(l, +XC01X')-1 (y- X�tol I I noaJ + (y- l,J.too)'(l, + c00 1l,l�)-l (y- l,J.Loo)I
}
� ·
In the case of the prediction of an m�dimensional vector of future observations Y* with explanatory variable matrix
x*,the same result used above can be applied y with density
as far as the prediction is based on the predictive distribution ofY* I given by
p(y* I y)
= f f p(y* I fJ, ¢,y)p(fJ, ¢I y)d/3d¢ =
f {!
}
p(y* I f3,¢)p(f31 ¢ , y)df3 p(¢I y)d¢
and the calculus is similar to that involved in the evaluation of the Bayes factor. An important difference is that the marginalizations are with respect to the posterior distribution of the parameters while the Bayes factor is obtained using marginal izations with respect to the prior distribution. Then, using the adopted notation it follows that{)
Y* 1 Y
"'""1n1 (x* 1'- > 1
a[(lm + x*C!1x*'))
Introduction to linear models
224
and so point predictions and confidence intervals for The analysis using non-informative priors leads to JLt and
af
-+
s2 and the predictive distribution of Y*
Y* can be easily obtained. � {3, Ct-----+ X'X, n --+ n 1
reduces to
Y* I y- ln-p(x' p, s2Um + x*(X'X)-1x'')). This distribution coincides with the predictive distribution ofthe classical approach providing the same predictions.
8.4
Hierarchical linear models
In Section 3.5 we have seen how to combine structural information with strictly subjective information to build up the prior distribution in stages. This strategy is explored within the context of linear models in this section. The linear structure of the model combines very well with the hierarchical modelling in the context of linear normal models and this structure can be explored in more detaiL Specifi cally, the hierarchical structure described in Chapter 3 is used with the additional assumption of linearity and normality. This setup preserves the model linearity as a whole and the conjugacy at any stage of the model. This area was organized sys tematically and received great research development following a paper by Lindley and Smith (1972). In order to specify the model and its prior, we can rewrite it as
Y I {J,,¢- N(X,{J,,¢-1lnl {J, I {J,,
where the matrix with explanatory variables is renamed as Xt due to the presence of another matrix including the second stage explanatory variables,
X2. This matrix
includes the values which explain the variations in the p 1 and the coefficient of this explanation is given by
p2•
The model specified above is the simplest in the class
of hierarchical models containing only two stages. More stages can be included in the model specification depending on the structure of the problem. The form of the model is not modified. Only some extra equations are included, each in the fonn
fJi I fJHI,
Generally, for higher stages, it is more difficult to specify the equations as the level of elaboration involved gets deeper and deeper. The dependence of the variances on fjJ is not by chance. It allows the use of results about conjugacy developed in Sections 3.3 and 8.3. AH the derivations that follow are done only for the model with two stages to keep the notation simple although there is no technical problem to extend the results to models with K stages, K
>
2.
Hierarchical linear models 225
l l
l l
I
i
j
J
Firstly it is worth mentioning that the analysis conditional on p2 is completely similar to the developments made in the last section. If {32 is known, the prior does not depend on its probabilistic specification. So, all that is needed is to apply the
results of the last section substituting JLo by Xzf3z. In particular, it is obtained that 1 C'- (C,XzfJz + X',y), fJI I y ' ' (JLJ, arC'-1) and n I err¢ - x;;, where ILJ 2 c, + x;x,, n, n +no and n,a1 noaJ + (y- X,XzfJ2)'y + (X2fJ2C' . JL,)'CoXzfJz Considering again the case with /12 unknown, it is worth noting that the dis tributions of /31 ] 4> and f31 can be obtained via marginalization as was done at �
=
"
=
the end of the last section to obtain the predictive distribution of the observations. Therefore, combining the distributions of
I
/31 I
ft2.¢and
/32
I¢it follows that
.. . . and, by integration, we obtain that /31 Example 2
j.
.=
1, .._
. ,
"'t�0(X2J.t, aJCi-1).
(continued). In the last section, it was assumed that the means f3j,
k were itidependently distributed a priori. A reasonable alternative for
the case in which similarity among the k groups can be assumed is to suppose that the means are a- random sample from a population of means. This population is
fictitious and, to fix ideas, taken as homogeneous. To keep the structure presented above, it is assumed that this population is normal. So, it follows thatf31, . ..
sample from the· f./(J.-L,
j
=
=
, fh is a (CJ ¢)-1) where CJ measures the precision of the population
of means relatively to the precision of the likelihood. The model is completed with
the specification of the distributions of fJ.
completely specified by
I ¢ and ¢>. If follows that the prior is
!
I
and the distribution of the
fJ
I ¢ is
N[lkllO· ¢-1(c!1I, + cz11,1�)] and the
components of f3 are no longer independent. In particular, the prior correlation between any two distinct components is given by
(I+ c2fcl)-1.
helpful in the specification of the constants CJ and
constant
c2/Ct
c2.
This may be
A larger value for the
leads to a smal1er prior correlation and vice versa.
It is interesting to see that the distributions of {31 and
{32
conditional on¢ are
multivariate nonnal and therefore the theory developed in the last section is directly
applicable and the predictive and posterior distributions can easily be obtained. So, the predictive distribution for Y
! ¢ is given by
226
Introduction to linear models
and the marginal for
N
by 1n0 and
normal-X
2
.p-I
Y is obtained from the above expression substituting the
by
with parameters JL , C*, n and
err given by
1 1 llt=C'-1(CjXzJLo+X'tY) C* = Cj +X' X1 1 n1 =n +no nw�=noaJ + (y- XtJLt)'y +(X2JLo- llt)'Cj-1X21LO·
Example 2 (continued).
Consider again the analysis of variance model with one
classification factor with a hierarchical prior. Applying the above results, it follows that
C'-1X 2JLo+ X' Y= (cl-I + kc2-I) J.l oI k + (1lJYJ·····nkYk) ' 1 1 2 cl ' . C.= dtag(ct +nJ, ... ,CJ +nk)1,1,. kc1 + C2 ..
*
.
-
-
.
·
It can still be shown in the case of a non-informative prior in the second level
(c 2 -+ 0) and equal number of observations in each group that the posterior mean of j3J is in the fonn withO:::;wj::::; l,j= l, where
Y is
. .
.
,k,
the average of group averages. This type of estimator is known as a
shrinkage estimator since it brings all the group estimates closer to the global mean, shrinking the distance among the means. Shrinkage estimators and their properties were studied by many authors including Copas, James, Morris and Stein, as cited
in O'Hagan (1994).
To obtain a posterior distribution of /3 2 it is first necessary to write the likelihood
function of /32 and
Y I /32, .P- Xt N(X2f32, .p- 1C]1) + N(O, .p-I 1,)
- N(X X2f32, .p- 1C01) 1 where CQ 1 = In + X CJ1X'1. Therefore, likelihood and prior are normal as be 1
fore with
a
slight difference since the observational variance is not proportional
to an identity matrix. The same results are still valid with the respective substitu tions, that is, the posterior distribution of
parameters JL z, Ci, nz and
u} given by
/32
and <jJ is multivariate normal-x2 with
JL2= q-t(C2-1 JLo+x;coy) q c, +XtX2Cox;x; n =n+no {) 1 nt
Dynamic linear models 227
8.5
l
j l
1
I
I'
�
Dynamic linear models
This is a broad class of models with time-varying parameters, useful to model time series data and regression. It was introduced by Harrison and Stevens
( 1976)
and is very well documented in the book by West and Harrison
In this
(1997).
section some fundamental aspects of dynamic models will be introduced and some examples in time series as well as in regression will be addressed.
8.5.1
Definition and examples
Dynamic linear models are parametric models where the parameter variation with time and the available data information are described probabilistically. They are
characterized by a pair of equatiJns, named observational equation and parameters evolution or system equation. These are given by Y,
=
{31
=
x;fl, + Er. Er"" N(O, a,l) G1{31_1 + w1, w,- N(O, W,)
where Y, is a time sequence of observations, independent conditionally on the
x, is a vector of explanatory variables as described in G1 is a p x p matrix describing and, finally, a? and W1 are the variances of the errors
sequence of parameters f3t> Section
8.1,
Pr is a p
x
the parameter evolution
1
vector of parameters,
associated with the unidimensional observation and with the p-dimensional·vector of parameters, respectively. Summarizing, a dynamic linear model is completely specified by the quadruple
{x,, Gr. a?. \V1}. Two special cases are, respectively, time series models charac terized by x1 x and Gr G. Vt and dynamic regression models, described by G, lp. =
=
i
1 j
I
I
j I
=
Example 3. The simplest model in time series is the first-order polynomial model, which corresponds to a first-order Taylor series approximation of a smooth time function, named the time series trend. This model is completely defined by the quadruple
{I,
1, u?, W,}.
Y,
fJ, where
The above equations specialize to =
{31
=
f3t-l
+
<1, +
w1,
N(O, <J 2) 1 w1 - N(O, W1 )
fJr is unidimensional.
Although this model is very simple, it can be applied in many short-term fore casting systems involving a large number of time series such as in stock contt:ol or production planning. The observational and parameter variance can also evolve in time, offering a broad scope for modelling. A slightly more elaborated model, named the linear growth model
(LGM,
in
short), is derived after including an extra parameter fh.t to describe the underlying
228 Introduction to linear models growth of the process. Then, it follows that Y,
=f3u +
flu= fJu-1 +f32.t-t + "''·' fJ2.r =f32.r-l +w,, r. w, = (w�,,,w,,,)' -N(O,W,). The parameter to verify that
fJ1.1 is interpreted as the current level of the process and it is easy
x1 = (1, 0) and Gr =
model.
Example
1
(continued).
( � � ).
Vt characterizing a time series
Simplf! dynamic linear regression.
Suppose, in this
(x1, Yr) are observ�d through time and that it is wished to model the existing relationship between Xr and Y1. Assuming that the
example, that pairs of values
linear model is a good approximation for the relationship betwe{!n t�ese values,
a simple linear regression .model can be set. Its parameters can be estimated via classical methods or through the use of the Bayesian argument as described in
Section 8.3. Since the lin"ear relationship is only- a local approximation for the
true functional dependence involving x and
Y, a model with varying parameters is
appropriate. ln manY applicatiOns time-varying parameters are more adequate. For example, the omission of some variables caQ justify the parameter oscillation, the
non-linearity of the functional relationship connecting
x
and Y or some structural
changes occurring in the pr· ocess under investigation can also be responsible for
the parameter instability. Then, these situations can be modelled as
Y1 = x;f3t + Er /3, =f3,_, +w,
where
x,
=
(1, x1 )'and w1
,_...,
N(O,W1). Note that, in this case, G1 = lz.
As we can observe, the choice of x1 and Gr depends on the model and the nature of
the data that is being analysed. To complete the model specification the variances
cr? and
W1 must be set. The observational variance is usually supposed time
invariant, as in the previous sections and W1 describes the speed of the parameter
evolution. In applications
cr,2 is, often, larger than the elements of W1.
In what
fo1Iows the parameter estimation method, including the observational variance,
will be described. To make it easier for the conjugate analysis, Wr is scaled by a?
and the conditional variances of the
Wr
become
a,2W1• Therefore, the matrix W1
can be interpreted as a matrix of relative weights with respect to the observational variance. The parameter evolution variance matrix must be assessed subjectively
by the user of the method and, in order to do that, the notion of discount factor will be useful. Alternatively, it can be estimated by one of the approximating methods
described in Chapter 5.
Dynamic linear models 229 The equations presented before can be rewritten as
Y, I {J,- N(x;p,,a() {J, I {J,_,- N (G1{J1 _ 1 ,a,2W,). Let Dr
= {D,_ 1, Yr} with Do describing the initial available information, including
the values of x1 and Gr. Vt, which are supposed to be known.
It is worth noting that it is assumed that for any time t, the current observation
Yr is independent of the past observations given the knowledge of /31• This means that the terriporal dynamics is summarized in the state parameter evolution. This linear structure for modelling data observed through time combines very well with the principles of Bayesian inference by the possibility to describe subjectively the involved probabilities and by its sequential nature. Therefore, subjective in� formation is coherently combined with past information to produce convenient inferences.
8.5.2
Evolution and updating equations
The equations described before enable a joint description of (Y1, {31) given the past observed data D1-1 via
p(y,{J,IDt-1)
=
p(y,I{J,)p({J,IDt-J).
This leads to the predictive distribution after integrating out {31• One of the main characteristics of the dynamic linear model is that, at each instant of time, all the information available is used to d_�$cribe the posterior distribution of the state .vector. The theorem that follows shows how to evolve from the posterior distribution at time t
- 1 tO the posterior at t.
} = a2, Vt. De 2 2 note the posterior distribution at t- 1 by (,B,_i\Dr-1, a )"' N(mr-1. a Cr-d and the marginal posterior distribution of
=
t/>ID1-J- G(n1-J(2,n1-JSt-J/2). Then, 1. Conditionally on a2,it follows that
(a) Evolution- the prior distribution at t will be
2 {J , Ia , D,_, with 3r
= G,m1-! and Rr
=
- N(a1, a 2R1)
GtCr-l G; + Wr.
(b) The one-step-ahead predictive distribution will be
y,\cr2, D,_,- N(f, a2Q1)
230
Introduction to linear models
(c) Updating-the posterior distribution at t will be P,lu2 , D,- N(m,, u 2C,) with m1 =a 1 + A1e1 and C, =R1- A,A;Q,, where Ar =R, x;fQ1 and e1 =y1 - /1• 2. The precision ¢ is updated by the relation ¢[D,- G(n,j2, n,s,/2) with n1 =nr-l +I and n1s1 =nr-JSt-I +e?fQ, . 3. Unconditionally on a2, we will have {a) /3 1 1Dr-I ""'"'1111_1 (ar, s1_JRr); (b) Y,ID,_J '"'"'tn,_1 (Jr. Qn, with Q7 =sr-I Qr; (c) f3,1Dt "'ln1(mt>srC,). Proof. (I) Item (a) follows immediately using the parameter evolution equation and standard facts from the normal theory. With respect to (b), using the prior distribution in (a), it follows that
2 f, =£[£(Y,[jl,) [u , DH]
2 £(x;p,[u , D,_J] =x;a, 2 2 Q, =V[£(Y,!jl,)[cr , DH] + £[V(Y,[jl,)[a , D,_J] = V[x;f31la2, Dr-d +a2 = a2(x;R1X1 + 1) =
and the normality is a consequence of the fact that all the distributions involved are normal. To prove part (c), suppose that the posterior distribution at t 1 is as given in the theorem. We wish to show that (c) follows from the application of Bayes' theorem,that is,
-
2 p(jl,[cr ,D,)
ex
2 p(ji,[a ,DH)p(y,[ji,, cr2).
This can be shown using Theorem 2. I and the identity ct-I =Rt-I + x;xra-2. If a2 is unknown and defining ¢ =a- 2 it will follow that •
By hypothesis,¢[D,_t- G(n,-t/2,n,_JS,-t/2),andy,[¢,D,-t- N(j,, Q,j¢). Then, by Bayes' theorem, p(¢[D,)
•
cx
q,tn,_,+J)/2-l exp
�-� (
n<-JS<- J +
g,) J
and therefore, ¢[D,- G(n,j2, n,s,j2). Finally, for part 3 of the theorem, the proofs of items (a)-(c) follow from the results about conjugacy of the norma1-x2 to the normal model and from the marginal distributions obtained in Sections 3.3 and 8.3. 0
Dynamic linear models 231 8.5.3
I
l
J
l
11
Special aspects
Among the special aspects involved in the dynamic Bayesian modelling, the fact that it is possible to model the observational variance deserves special attention. For example, it can be modelied as a power law,
a/= a2�J,f, where J.lr = x;p,
is the process mean level. The constant b can be chosen in parallel to the weJI known Box-Cox family of transformations. The scale factora2 can be sequentiaJly estimated as stated in the theorem. The main advantage in this form of modelling is to avoid data transformation, thus leaving original data and interpretation of parameters unchanged. This can be useful for times where one wishes to perform subjective intervention in the series. Also, when analysing similar time series, it is possible to incorporate the hierarchical structure into the dynamic framework. This idea is formalized with dynamic hierarchial models (Gamerman and Migon.
1993). To avoid directly setting the state parameter evolution matrix. th� use of discount factors is proposed. These are fixed numbers between zero and one describing subjectively the loss of information through time. Remember-thafR1 = where
PI = GrCt-I c;.
Denoting the discount factor by
0,
Pr
+ �r
we car, rewrite Rr
=
Pr/0, showing clearly that there is a relationship between W1 and 0. This is given 1 by W1 = (0- - l)Pr-1· showing that the loss of information is proportional to 0.9, only about the poSterior variance of the state parameters. For example, if 0 90% ·of the information passes through time. =
Other relevant aspects of dynamic linear models are to easily take care of missing observations and to automatically implement subjective interventions. In the first
case, it suffices·not to use the updating equations at the time the observations are
missing. In this way, the uncertainties increase with the evaluation of the new prior distribution and the recurrence equation continues to be ·valid without any
J j
J'
l
'
additional problem. From the intervention point of view, the simplest proposal is
to use a small discount factor, close to zero, at the time of announced structural changes in the data generation process. In this way the more recent observations will be strongly considered in the updating of the prior distribution and the system can be more adaptive to possible changes. Finally, it is worth mentioning that parameter distribution at timet can be revised with the arrival of the new observations. We can generically state the parameter distributions p(f31ID1+k), Vk integer. distribution, if k
=
If k
>
0,
0, it is just the posterior and if k
this is named the smoothed <
0, it is the prior distribution.
In a dynamic model it is common to use the distributions p(fJ,IDn). Vt= 1, ..
. , n,
where n is the size of the series, to retrospectively analyse the parameter behaviour.
For example, one may want to quantify the effect of a behaviour change induced by some measure of economic policy. The future data would inform about the change occurring in any particular parameter of the model describing the behaviour of the economic agents involved.
j
l
'
l
232 Introduction to linear models Exercises § 8.2
I. Show that the derivative of 5(/3 ) with respect to /3 is given by 2(X'X/3- X'y),
Verify that j3 does in fact minimize S(/3) by calculating the second derivative
of 5(/3) with respect to {3.
2. Prove that 5({3)
=
({3- Pl'X'X(f35,
=
=
3. Obtain a
l
fJ) + 5, where
"'
"
y'y- f3 X Xf3 I
(y- XPJ'
100(1 -a)% confidence region for f3 based ori [5(/3)- 5, ]/ ps2
What is the forrn of the region?
fhxli + · · ·
l, . . , n + {JpXpi + e;, i with thee; 's iid N (0, a2). Construct a 100(1 -a)% confidence region for (p2, {Jp) from the MLR test of validity ofthe:model with level a. Consid . er the model Y; f!t + fJz(xz; -x2) +· · ·+ fJp(xp; -xp) +e;, i L ... ; n with thee;'s iid N(O, a 2) andxj (ljn) L:7�t Xj;. j 2, . . , p. Show that the MLR test of the hypothesis Hr {Jj 0, J 2, . . . , p, of level a fejects H; if
4. Consider the model Yi
=
fh ·+
=
.
. . . •
5.
=
=
=
=
=
.
=
"
I
sj
I> '•12,n-pS/
6. Obtain the expression of the
2)xji- Xj)2
J=l
100(1 -a)% confidence interval for the pre
diction of a future observation
Y*:
(a) in the simple 1inear regression with explanatory variable x*.
(b) in the analysis of variance model with a single classification factor, for an observation for group j, j 1, . . . , k. =
§ S.3
7. Show that
({3 - P.o)'Co(P- P.o) + ({3- Pl'X'X(f3- Pl �
!
C!1(CoP.o + X'y) and c, =Co+ X'X. Obtain the expressions of ILl> Ct. nt and af for the particular case of a where p.1
8.
�I
({3- P.t )C,({3 - P.1l + P.oCoP.o + f3 XI XP + p.1 C,p. 1 !
=
=
simple linear regression, showing that they coincide with the expressions
obtained in Example l.
f3o and fh in the simple linear regression model. Are these parameters independent a posteriori? \Note that they are conditionally independent given ¢.) Repeat the exercise for the one-way analysis of variance model.
9. Obtain the joint posterior distribution of
I
l
1 !
Exercises 233 10. Perfonn Bayesian inference for the simple linear regression model with the same prior as before but with prior correlation p (0 < jp\ < 1) conditional on cf>, instead of 0, as assumed before. 11. Prove that if
2
(a) Y=Xf3 + e, and e- N(O, q,-11,) where q,= 1/o , (b) f3 I q,-N(Jt, q,-1C) and (c) voo
then Y- t"(XJt, o5(1, + XCX')), using only the formula p(z) w)p(w)dw.
=
J p(z I
12. Obtain the expression of the Bayes factor to test the hypot�esis of validity
of the model for
(a) the simple linear regression model. (b) the analysis of variance model with a single classification factor with k levels.
§8.4 13. Consider the hierarchical model with K stages given by Y I /3 ,¢- N(X,p,,q,-11,) 1 p, 1 PHr·
i
l
l l
PK l-N(�t.
=I,
. .
. , K- 1
noCJJc/>"" x,;0.
(a) Obtain the prior distribution of p, I ¢, k
=
I, . . . , K - 1.
(b) Obtain the marginal prior distribution of f3 k, k= 1, ... , K
-
1.
(c) Obtain the marginal distribution of Y.
14. Show that in the hierarchical analysis of variance model with a single clas
sification factor with k levels, Cov({3;. f3r I¢)= Cov(f3;. f3r)= c21 and Cor(fJ;, f3r 1 ¢) Cor({3;. f3J') = (I + c2/ct)-1, V(j, j'), where Cor denotes the correlation. =
15. Consider the hierarchical model with two stages. Show that the posterior
2
distribution of fj2 and¢ is multivariate normal-x with parameters JLz, Ci, n2 and a:f given by
1'2=q-'(c2-1Jto+x;coy)
Ci= C2 + x,xzCoX',x;
n1=n+no 2 n a1 =noa5 + (y - x, X2JL2)'Coy + (JLo- JLz)' Ci-1 JLo. 1
234
16.
Introduction to linear models Show that in the hierarchical analysis of variance model with a single clas sification factor, m observations in each group and non-informative prior at the second level, E(f3J I>, with
0 ::;
wi
::; I, j
=
I,
Obtain the expressions for
§ 8.5 17.
.
.
y) .
wj,
=
+
(1
, k and y is the j 1, ... , k.
- wj)'y average of group averages.
=
a?
Consider the first-order polynomial dynamic model with
W1
=
W,
=
Vt. Derive the predictive distribution for a horizon k
show that
distribution of (Yr+I· that
a 2 and >
0 and
Y<+kiD,- N(m,, C, +kW +a2).
Obtain the distribution of (Yt+l + ·
18.
W/YJ
.
· ·
.. , Yt+k I D1).
+ YHk
I Dr)
from the joint predictive
For example, take k
=
2, and prove
Y<+l + Y<+2 ID,- N(2m,, 4C, + 2a2 + SW).
Consider again the first-order polynomial dynamic model. Show that
with E(/3,-JID,)
=
m,_, + Ct-J!R,[m,- m,_,]
and
I
V(f3,_JiD,)
=
c,_,- (C,_1jR,)2(R,- C,).
19. Suppose the series was not observed at timet so that Yr was missing and D,_,, Obtain the distributions of f3tiD, and y,+tiD, as suming knowledge of m r-l and Cr-I for a first-order polynomial dynamic therefore, D, model.
'i
=
1
.j
1
l
l
Sketched solutions to selected exercises
Chapter 2 2. The test X is such that P(X �·ili:J'� I)= 0.95, P(X= 118 = 0)= 0.40 and the disease prevalence is ?(8=I)= 0.70. (a) From the example on page 26,
P(8= I IX= I) = 0.847. By Bayes' P(8= IIX= 0) cx./(8= I; X= 0)?(8= I)= 0.035 and P(8 = OIX= 0) cx./(8= 0; X= 0)?(8 = 0) = 0.180. Therefore, ?(8 = IJX = 0) = 0.035/(0.035 + 0.180) = 0.163. The result X 1 makes the doctor more certain about John's illness because ?(8= IIX =I)= 0.847 > 0.!63 P(8=!IX= 0). (b) Uslng-.P(81X = 1) as the prior for the second experiment, it follows from Bayes' theorem that P(8 = l l X1 I, Xz = I) ex. 0.95 x 0.847 = 0.805 and P(8 = OIX1 = 1, Xz = I) ex. 0.40 x 0.153· 0.063. So, the probability that John is ill is P(8 = IIX 1 1 , X2 = I)= 0.805/(0.063 + 0.805)= 0.927. (c) The likelihood for n positive results is (0.95)". Therefore the solution of ?(8 = IIX 1 = I, ... , Xn = I) = (0.95)"0.70/((0.4)"0.3 + (0.95)"0.7) 2: 0.99 is obtained by trial and error as n =8. theorem,
=
=
=
=
=
5. Let e represent the event the driver is drunk with ?(8)= 0.75. The test X 1 will be positive (=I) if the level of alcohol in his/her blood is high and zero otherwise. It is known that P (X1 = 118 I) = 0.8 and for a second test P(Xz= 018= 0)=I and P(Xz 018=I)= 0.10. =
=
(a)
P(X1 = I)= (0.8)(0.25) + (0.2)(0.75)= 0.35 can be interpreted as the proportion of drivers stopped that have to be submitted to a second test.
?(8= l!Xt =I) ex. (0.8)(0.25)(0.35= 0.571. P(Xz = IIXt =I)= 0(1- 0.571) + (0.9)(0.571) = 0.514. Therefore, P(8= I!Xt = I, X2 =I)= (0.9)(0.571)(0.514= I. (c) Obviously, P(Xt = 0)= I - P(Xt = 1)= 0.6S.
(b) By Bayes' theorem, Also,
8.
(a)
p(81x, J.t) ex. p(8, J.t, x) = p(xl&, J.t)p(BIJ.t)p(J.t). But, p(&lx, J.t) ex. exp{- ( 0.5) [(x - 8)2 ja2 + (8 - J.t)2 jr2 + J.t'JJ. After some algebra
rI
236 Solutions to selected exercises (8fx, !-') - N(!-'1, Tf) with 1-'1 = Tf(xja2 + i-'/r2) andr!2 = (a-2 + r-2). This is in fact an application of Theorem 2.1. (b) In order to apply Bayes' theorem we need the likelihood l(tL; x) p(xftL) f p(xf8, tL)p(8ftL)d8 <X f exp {-( 0.5)[ (x -8)2ja2-(81-')2jr2)]}d8. After some calculations we get we get
=
=
p(xftL)
<X
exp [-( 0 .5 )( x
-tL)2/(r2 +a2)].
1-'fx- N(l-'1· rj') with 1-'1 xj (a2 + r2 + I) and r;' = (az + r2)/(a2 + r2 + 1). (c) p(8) <X J p(8ftL)P(tL)dl-'or8- N(O, l+r2). It follows immediately by Bayes' theorem that Blx "" N(8t, r-}) where 81 = r}xa2 and r] = az(r2 + l)j (a2 + r2 + !). Using Theorem 2.1 once again, it follows that =
In fact the above results can be easily obtained after rewriting the
= e + E, E - N(O, a2), e = 1-' + w, w - N(O, r2) N(0, I), where E, w and J.L are independent. Using results about linear combinations of nonnal variates we get X = M + w + E or XII-'- N(tL, r2 +a2). Analogously, 8 tL+w ore- N(O, r2+ 1). 15. From exchangeability of the Xi's and Theorem 2.2, any sequence of n Xi's having k values of 1 and n - k values 0 has probability given by 1 ' /0 e o - 8)"-'dF(8), Vk :s n, n > o. exercise as X and p.,
"'
·
=
'
(a) Since T
=LX;, P(T = t) = LXEA fo 8'(1- 8)"-'dF(8) where A {Xi I: X; t). The number of n-tuples in A is (';) and they all have the same probability. Then, P(T t) (;) J8' (I 8)''-'dF(8). (b) E[T] = 2::7�; E(X;) 2::7�1 E(X,) nE(X1) = nE[ E(Xif8)] = n£[8] since Xi18- Ber(8). =
=
=
=
=
21.
=
(a) The sample space depends on the unknown parameter, hence the dis tribution does not belong to the exponential family. (c)
p(xf8) = (2nT1i2 exp{-0.5[x2-28x + 82]/8] 1 (2rr)- i2 exp(x) exp{ -0.5[x2j8-8-log(8)]}. =
a(x) (2 rr )-1i2 exp (x), u(x) = x2j2 , ¢(8) = e-1 and b(8) = -(8 + log(8))/2, we conclude that it is a member of the Identifying
=
one-parameter exponential family and, by the Neyman factorization criterion, X2 is a sufficient statistic.
(e)
'
p(xfx -1 0, 8) p(xf8)/P[X -1 Of&] G)exo - e)n-x /[1 (l-8)"],x = l, ... ,n or p(xfx "I 0, 8) = G ) exp{x log [8/( 1 8)] +n log(! - 8) -log[ I -(I - 8)"]}. So, a(x) (;) , u(x) = x, ¢(8) = log[8 /0 - 8)] and b(8) n log(! - 8)- log[I -(I -8)"]. =
=
=
=
It is clear that X is a sufficient statistic for fJ.
,
I
l
I -�
l
1
'
I !
Solutions to selected exercises
237
(g) p(x[B) =ex log(&)/(&- I)= exp[x Iog B + log[log &j(B- 1)]. So, a(x) = !, u(x)= x, ¢(8)=log(&) and b(8)=log[log(B)/(8- 1)]. So it is a member of the exponential family and T(X) = X is a minimal sufficient statistic.
24.
Remember that p(x[B)=a(x)"exp[u(x)¢(8)+ b(8)] and J p(xl8)dx= I. (a) Differentiating both sides with respect to B it follows that
J
[u(x)<j;' (8) + b' (8)]p(xl8) dx=0.
It is then easy to get E[u(X)J = -b'(8)/
J
[u(x)<j;"(B)p(x[8) + u(x)<j;'(B)p'(x[B) + b"(B)p(x[B) +b'(B)p'(xl8)]dx
where p'(xl8) = [u(x)¢'(8) +b'(8)]p(x[8). After some calculation it follows that £[u2(X)]= [
I
l
l
32.
Pe.¢(>/f, �)Ill e< �.because the Jacobian (a) It is easy to get p(>/f, �) of the transformation is just �. (b) p(>/f, � l x , y) <X ex exp(-8)<j;Y exp(�¢) where 8 = >/f� and¢=�(1>/f). So, p(>/f, �lx, y) e< �x +y + t exp(-�)>f/"'(1- >jf)Y or proportional to the product of a G(x+ y+ 2, I) and a beta (x+I, y+ I) distribution. (c) First of all, it is easy to see that X + Y �Pais(�), where I; = 8 + ¢, since X and Y are independent Poisson distributed with parameters 8 and <jJ respectively. Since p(xlx + y, >jf, n = p(xl>/f, �)p(x + y[x, >jf, /;)/p(x + y[>/f, I;) it follows that, p(xlx + y, >/f, �)= (x:y)>/fx (I - >/f)Y which is only a function of o/. (d) Using the distributions obtained in (c) and the factorization criterion it is simple to obtain the results. (e) The marginal likelihoods of o/ and /f; x, y) = p(x, Ylo/) = j p(x, y[>/f, np(l;[o/)d l; and 1(1;; x, y)= p(x, y[l;)= J p(x,ylo/,l;)p(o/li;)d>/f. Since p(l;lo/) ocl; and p(o/11;) exk,simple integrations lead to 1(>/f; x, y) ex >/fx(l - >jf)Y and 1(1;; x, y) oc i;·'+Y exp(-I;).
35.
(a) The inverse transformations are Bt ="J..>/f and Bz= >f/(1- !..) and the Jacobian is I=>f. The distribution of (!.., o/) follows as p("J.., >f) ex >j/, since p(O) exk. (b) p(x["J.., >f)="J..x' (I - "J..)"'o/"''+xz(l ->f) l-x, -xz. Then the marginal likelihood for >/f is p(xl>f) = J p(x[ "J.. , >/f)p("J..I>/f)d"J... Since p(!..lo/) <X k, p(xlo/) ex t"+x'(l o/) t-x,+xz.
=
_
238 Solutions to selected exercises
(c) By the Neyman factorization criterion with f(xt + xz, 1/r) = 1/rx,+x,
+
l
(I -1/r) -x, x, and g(x) =kit follows that X,+ Xz is sufficient for
1/r.
Chapter 3 I.
(a) My experience of living in Rio gives me confidence to assess the first quartile as 20 °C and the median as 28 oc. (b) A good approximation for the standard deviation is
<1"' 1.2 51Qt Q2!. where Qi is the ith quartile. So my. assessment is that the temPer ature is normally distributed with mean 28 oc and standard deviation equal to 10 °C.
(c) The 0. I quantile determined from the normal distribution cqrresponds
to 2.7 oc conflicting with the subjective assessment of lf°C." The as sessment of nminality must be revised.
5.
(a) 1(8; y) =
ni�, G: JeYi (I e)n;-y; (l( e•o- e)3 -
(b) The class of the beta distributions can be used as prior, say with a and b = 1 represe · nting v ·ague initial information. 89(1- 8)3 or aBly- beta(IO, 4).
=
So, p((:ljy)
1 ex
(c) Since conditions are similar, it is natural to assume that ys\8 "' bin (ns, 8) withns = 3. The predictive distribution will be a beta-binomial distribution with parameter (3, 10, 4) providing the fo1lowing proba
bilities
Ys p(ysl8)
10.
0 0.036
2
0.178
0.393
3 0.393
(a) Assuming that e - G(a, b) then 11- = £[8] = ajb = 4 and the coefficient of variation CV(8) =
that a = 4 and b = I. The posterior variance will be less than or equal
toO.Ol ifandonly if ( t+a)/(n+b)2::: 0.01, wheret = I;x;. Solving the quadratic inequality in n it follows that n::: 10(4 + t)112- I. (b) Theposterior mean can be written as11-1 = (a+t)j(b+n) = YnXn+ (I - YnWo where 11-0 = ajb and Yn = nj(n +b). The limit of Yn• when n
--7
oo, will be 1.
·
(c) The posterior fore is beta(a +t, b + n- t) where t = I;x;. This
distribution has mean (a+ t)j(a +b +n) = YnXn +(1- Yn)!l-o where 11-0 = aj(a + b) and Yn = nj(a + b + n). The limit of Yn• when n
16.
-----7
oo,·wm be I.
(a) The Pareto distribution is a member of the one-parameter exponential family with a(x) = 1/ n; x; , ¢(0) = -8, U(x) = I;log(x;) and b(8) = n loge + ne log b. So the sufficient statistic fore is U(X).
Solutions to selected exercises 239 (b) The observed information is -d2log(p(xl8))/d82 =nj82 which co incides with the Fisher infonnation for a sample of size n. The Jeffreys
p(8) <X I(8)112 <X 8-1• It is obviously improper since f p(8)d8 =f k8-1d8 diverges. (c) The posterior distribution is p(8 tx) e< 8"-1b"0 /({n7� xi}11")"9. Let 1 1/ z = {D7=t Xi} n be the sample geometric mean. It is c1ear that prior will be
p(Oix)
e<
e"-1exp(-n0log(z/b)) which corresponds to a
Iog(z/b)).
G(n,n
20. The non-informative prior for e is
27.
p($) e< k. The density for¢=ae + b, a f" 0, is p(¢) = pe[ (¢- b)/a] ldO/d¢1 <X k d[(¢- b)ja]jd¢ oc k. If p(O) e< e-1, e > 0 then p(¢) = pe (¢11" ) I dO/d¢1 with¢ =0", a f" 0 or 0 p(>) ex q,-l If 1jr =log e then p(1jr) e< exp(-1/r) exp(1jr) =I . (a) LetO;II' N(!', b),b known and suppose that theO;'s are independent given I'· Assuming that p(!') ex k, p(8, !') ex n7�, p(O;I f.l, b), or 811' N(!'la,b!a). (b) Since p(f.liY) ex p(YI!t)P(!') where Y;l�t N[f.l, (a +b)], i = I, . . . , n, independent, it follows from Theorem 2.1 that I'IY ex N[y, (a + b)jn]. (c) Note that p(8;II', y) <X p(8;ll')p(yl8; ). Once again, it follows from Theorem 2.1 thatO;I�t, y N[�ti, b'], where l'i (a!'+by;)j(a+b) and b' a bj(a +b). Hence, E(O;I!', y) =�ti,, i =I, ... ,n. (d) E[O;Iy] =E[E[O;Iy,!']] =E(�ti} = by;j(a + b). �
�
�
�
=
=
Chapter 4 p(8 tx) ()( 8'(I - e)n-t lo (0,I) where I =.L7�I X;. that is, a I,n- I+ 1).
4. The posterior beta(!+
1 (a) E[L(e, d)lxJ ex /0 [(8 - d)2 /[8(1 - e)])e' (I- 8)"-' de = f01 (8t d)2er- (I- e)n-r-tde. This integral is proportional to the expected value of the square loss with respect to the beta (t,n - t) distribution. So, it is minimized when d = tjn = X. The risk of the Bayes estimator is R(x) = B(t + 1,n -t + 1)/d (e -X)28'(1-8)"-'d8. Multiplying and dividing by B(t, n-t), we get R(x) [B(t +I, n t + 1)/B(t,n)]V(Oix). After some simplifications, it follows that R(X) =Ijn2 =
. (b) The predictive distribution is obtained via
p(Xn+tlX) =
fo'
p(Xn+tl8)p(8ix) d8.
Xn+l Ber(O) independent of X, it follows that ex.,o+t( l - e)n-x.,-t+lde. p(Xn+tiX) = B(t + I,n- I + I ) Solving the integral gives p(xa+tlx) = B(t + I, n - t + 1)/B(t +
Assuming that
�
Jd
240
Solutions to selected exercises x,+l +I, n- x,+ l -r+2), Finally, p(Xn+JJX) =(r+ IY"+'(n-t+ t)l-x,+l j(n+2). The mean and variance of the predictive distribution are (t + l)l(n + 2) and (t+ l )(n t+ l)l(n + 2) 2, respectively. (c) Let t , . . , lk be the counts associated with each of the k possible 1 values of X. Then, straightforward generalizations give for each Bi the Bayes estimator ti In, i = 1, ... , k, with associated risks 1 f n 2. The predictive probability function generalizes to p(xn+I = ilx) = Ui+l)l(n +k),i=l, ...,k. -
.
7.
(a) p(eJx) ex 8-1 Io(O, 1). The proportionality constant is obtained mak
J:�i ke-1de
ing 1 = f p(ei x)de = log[(x+ l)l(x - 1)], x > L
= klog el��:.
(b) Posterior mean� mode and median are easily obtained.
so, k-' =
E(Bix) =
J:�i ke8-1de =k[(x +I)- (x -I)]=2k =21og[(x-l)l(x+ 1)].
To evaluate the mode, it is enough to observe that p(Bix) is a strictly decreasing function of B. So its maximum occurs a t x -l. The median, in turn, is the solution of
fxm-l ke-' de=I12 or k log[m I(x - I)] = 1/2. After some simplifications it follows thatm = [(x-I)(x+ 1)]112
12. The density functions for each type of bulb are p(xio/) = o/ exp(-,Yx),
,y-1 = 8, 2e and 38, respectively. (a) Observing one bulb of each type, it follows !hat l(e; x) ex exp[-(XJ + x212 + x,I3)18JI6e3 The MLE is the solution of the equation
. dlogi(e; x)ld8 = -318 +(X, + Xzl2 + X,l3)18 2 = 0: That is,§=(X1 + X212 + X313)13. (b) Assuming the prior o/ G(a, {3), it follows that the posterior distri" bution is p(o/ix) ex ,yJ+a-l exp[-o/(XJ + x212 + x,l3 + {3)]. (c) The Bayes estimator is the posterior mean, E['frix] =(a+ 3)1({3 + XJ + xz/2 + XJ3), if a square loss function is assumed. For the 0� 1 loss function the Bayes estimator is the posterior mode, (a+2)/(fJ + Xi + X212 + X313). �
16. It is worth remembering that the moment generating function for the Pois(tl) is
E[exp(-cY)] =
=
exp[-ne(l - exp(-c))].
The estimator exp(-cY) wil! be unbiased for exp(-e) iff ne(J exp(-c)) e or c = log[nl(n - 1 )], Therefore, (1- lln)Y is an unbiased estimator of e-8. =
(b) By the Cramer-Rao inequality, the variance lower bound of an unbi ased estimator of h(e) is given by h'(8)2 1 I (e). In the present case h(B) = exp(-8), so h'(O) = -exp(-8) and the expected informa tion is obtained as E[-d21ogi(&;x)ld&2] = E[,Lx,;e2] = nle. Therefore V[exp(-cY)]?:: e exp(-28)1n, fore= log[nl(n - 1)].
Solutions to selected exercises 241 (c) The variance of exp(-cY) can be calculated as V[exp(-cY) ] = exp[-n8( 1-exp(-2c)]-exp[ -2n8(l-exp( -c)], q>y(2c)-q>�(c) for c = log[n/(n - !)]. After some algebra, it follows that V[exp (-cY)] = exp[-8(2-n)]- e xp(-28). Tbe ratiobetween theCramer Rao lower bound and the variance of the estimator is less than 1, so the estimator is not efficient for exp(-B). =
24.
(a) p(81x) c< p(xl8) c< 8x ( l - &)"-'· Assuming that X= n it follows that p(8]x) ex en, which is a monotonic increasing function of e. So the maximum posterior derisity interval will be in the fonn [a, I], for some a < 1 such that P[8 :0: a lx] I - a. (b) Let 1/f 8/(l -8). Then, P[af(l- a) 'S 1/flx] = P[a/(1- a) -s 8/(l - 8)1x] = P(a "08 Sllx)= 1-a. (c) ltis easyto eva!uatep(>/flx) Pe(1/f/(!+1/f)ix) ld8/d1/flor p(>/flx) ex 1/f" j(l + o/)"+2, which is in the form of an F.[2(n + !), 2] distribution. (d) The interval obtained in (b) is not an HDP because the F Oe'risltY is not monotonically decreasing in 8. =
=
=
(a) a exponentially distributed lifetime_s with o.bse_rv e d mean lifetime equal to b lead to a likelihood /(8) ex 8" exp(-8ab). The prior distribution for 8 can then be obtained.as p(B) ex ea exp(-8ab). This distribution is in the form of a G (a + 1, ab) distribution. (b) Using res u lts of the conjugate analysis, it follows that p(Bix) is a G(a + t + !, n +a b) where t = _Lx;.
27.
(c) Since T Pois(n8), p(81t) 811 G(a +I+ I, n + ab). �
c<
[(n8)' exp(-n8)][8" exp(-ab8)] or
�
(d) The posterior distributions obtained in (b) and (c) are the same, which is not surprising since T is a sufficient statistic for 8.
?((8- 1"1)2 'S 5VJ1x) = ?(18- I"Ii/.JVI s .,/Six)= P(tn1(0, !) 'S 2)). By trial and error, n 1 = 11 is the largest integer ensuring .,/Sn1 j(n t that the above prob�bility is larger than 0.95. Therefore, n = n 1 -no= 6.
31.
Chapter 5 2. The log-likelihood function is given by L(a, fJ; x) n[a log fJ- log f(a)] +(a-l)T!-fJTz, where Tt = Li log X; and Tz= _L1 X;. The maximum likelihood equations will be 8L(a, (J; X)/Ba= n[log fJ- r' (a)/ f(a)]+T1 and aL(a, (J; X)ja(J = naf(J - h The MLE of f3 as a function of a will be �(a)= a(X, where X= Tzfn. The profile log-likelihood of a is L(a, �(a)) = n[a log(a/X)-log(f(a))]+(a -l)TJ -net. Differentiating the profile log-likelihood with respect to a we get aL(a, �(a); X)j8a = n[log a- r'(a)/ f(a)] + T1 - n log X. A numerical optimization method, such as Newton-Raphson, can then be used after a numerical approximation for the digamma function, to obtain the MLE fora. The MLE estimator for f3 follows from the equation fi =&/X and the invariance of MLE's. =
·
242 Solutions to selected exercises
9. We have e =!."and p(Xj8) ex e-nOeT where T=
Li X,.
(a) The likelihood function is I(!.; X)= k exp( -n!.")!.aT
(b) The Jeffreys prior is defined as p(!.) ex I(!.)112 where I(!.)= E[-L" (!.;X)], L is the log-likelihood and derivatives are taken with re spect to !.. Then, L(!.; X)= -n!." + aT log!.. Its first and second derivative are respectively given by L'(!.; X)= -na!.a-l +aTI!. and L"(!.; X)= -na(a- 1)!.a-2- aTj!.2 and the MLE is 5.= (Tjn)11a. Also, I(!.)= na2A."-2 and the Jeffreys prior is p(A.) ex).a-l. (c) The third derivative is L"'(A.; X)= -na(a -l)(a- 2)A.a-3 +2aTjA.3 Evaluating it at 5.= x11" gives L"'(5.; X)= [-a-(;,- !)(a - 2)T + 2aT]j(Tjn)31a. It is easy to verify that L"'(5.; X)= 0 iff a= 3 or
a= 0 (degenerate solu�ion).
(d) The transformation that makes the third de.rivative null improves the ?symptotic approximatiOn. So, with re·spe6t .to skewness, the para metrization A e113 should be used to improve approximations. =
16. We have already obtained thatV(B) = 0.126/n. Now, V(li) V[w(X)]/n wherew(X) = (Zrr)-1 X2/(!+X2)2and V[w(X)]= E[w2(X)]-82,since the importarice sampling estimator is unbiased. £[w2(X)] = Jt)(2rr)-2 =
[x2j(l + x2)]2(2/x2)dx = 0.021875. Therefore, V(e) = [0.021875 (0.!476)2]/n= 9 x w-5 jn, which is substantially smaller than V (B). 20. First assume that y -1=. x� The transition kernel of the chain is given by q'(x, y)= q(x, y)a(x, y). Then, suppose thata(x, y) < l sothata(x, y)= [p(y)q(y, x)]/[p(x)q(x, y)]. Then, a(y, x) I and p(x)q'(x, y) = p(x)q (x, y)a(x, y) = p(y)q(y, x)a(y, x) = p(y)q'(y, x). This equality is also trivially satisfied when y x. Integrating both sides with respect to x gives p(y) = j p(x)q'(x, y)dy, which means that p(y) is the stationary =
=
distribution of the chain.
Chapter 6 3.
(a) The p-value is the statistic a(T) = Fr(T) where Fr is the distri bution function of the test statistic T. By the integral probability transform (see, for example, DeGroot (1986, p. 154), it follows that a(T)= Fr(T) U(O, 1). Since a(T;), i= I, ... , k, are function of independent statistics and all with the same distribution they constitute, a random sample of the U (0, I) distribution. -
(b) The hypothesis is rejected if a(T)
>
a, the significance level. So,
small values of a(T) should lead to rejection and hence large values of -2loga(T) should lead to rejection. Therefore, large values ofF " must lead to rejection of the hypothesis. (c) It is well known that if X- U(O, I) then Y= -log X- Exp( l )= G(l, I) or 2Y G(l, 1/2). Since F is the sum of k iid gamma -
Solutions to selected exercises distributed random quantities then F-
10.
G(k, 1/2) or
243
x�.
(a) The maximum likelihood ratio is A(X) = Boexp[-X(Bo- Bt)]/81
< (Bo e,)-1log[Bo/(c8t)] = t, where t is such that P(X > tiOo) = a. Then, exp(-Oot)=a or t= 001 1og a (b) The p-value is P[X > 318o]= exp(-30o)=0.0498.
and the null hypothesis is accepted iff A(X) > c or X
.
-
(c) The Bayes factor is BF(Ho,Ht)=2e-3i2= 0.45.
(d) The odds ratio is p(Holx)jp(HJix)= p(Ho)/p(H,)BF(Ho,H,) = BF(Ho,H,),since p(Ho)= p(HJ). So,
p(Holx) = {[BF (Ho, Htlr1 + 1 ) -
1 =
[(0.45)- 1 +o q-1 = 0.31.
(e) Using the result in (b), the null hypothesis is rejected at the a = 5%
level because the p-value is less than a,which is somewhat conflicting
with the posterior probability of 0.31 for the null hypothesis, because
the latter is smaller than 0.5 but is far from recommending rejection. 13.
(a) The likelihood function under Ho is p(x,, xziO) = 8'(1- 8)2-',
T =X,+ Xz and e =tj2 is the MLE. Under H,, p(x,,x,j8) = 8) J-x, exz (1 ex, (I 8) l x and MLE of (Bt, Bz) is (X 1, Xz).
-
-z
-
(b) Using a uniform prior, the posterior distribution follows, from Bayes' theorem, as p(8jx, Ho)
e)l-x]gx2(1- 8)1-x2.
ex
8' (1- 8)2-' and p(Bt, Bzlx. Ht)
ex
ex, (l
(c) The GMLE is the mode of the posterior distribution. In this exercise, the GMLE's coincide with the MLE's obtained in (b). (d) The predictive distribution is given by p{x) = Ee[p(xje)]. So p(XJ,
Jd
xziHo) = 8'(1- 8)2-'de = r{t+ l)r(3- t)jr(4) and p(x1, x21 HJ) r(x,+I )r(2-xt)r(l+x2)r(2-x )/ r(3)2 Numerically, 1 =
we get the table of probabilities below
(0,0) 1/3 1/4
(0,1) 1/6 1/4
(1,0) 1 /6 1/4
(1 I) '
1/3 1/4
(e) The result follows immediately from the table above. The result states
15.
that the data favours Ho (equality of distributions) twice more when XJ = xz than when Xi f. xz.
The MLE of e is
'f:_ t;/8.
8 = L; X ; j :[, t;
and the Fisher information is I (8) =
From Section 5.3.1, the asymptotic distribution of the ML� is e - N[O, r' (8)]. A 1 00(1 -a)% asymptotic confidence interval is e
Zaj j-Jf2(8) :58 :5{j + Zaj2j-lf2(8) where f is an estimator of I given by 2 I: t;Jii. If I:, x; = 10 and I:, t; = 5 then e= 2 and ice)= 5/2. A 95% asymptotic confidence interval for 8 is (0.76,3.24). Since Bo 1 belongs =
to the confidence interval described above we accept the nu11 hypothesis at
244 Solutions to selected exercises
the 5% level. With a I% level, the corresponding interval is (0.37, 3.63) and eo 1 will again be accepted. =
21. Let the cell frequencies be denoted by Nt, ... , Np, with N, > 0 and n and parameter 8t . . Lf=l Ni log-likelihood function is L(O; N) =
.. =
1. The ' ep. f)i > 0 with Lf=l Bi k + I:f'�t NJ log OJ. the MLE of 0 is =
N/n and the score function is U (N; 0) lNp/Bp. The Fisher information is /(0) nll'jfJp and its inverse is
i!
=
=
=
(Nt/Ot, ... , Np-t /Bp-t)' ndiag(l/B t, . .. , 1/Bp-t)
n t[diag(Ot, ... , Op-t) - (Bt, ... , Bp-t)(Bt, ... , Op-t) ']. Then, it follows that WE nO;o)2 /nB;.o.
=
2
I:f'=t(N;- n8;,o) IN; and Wu
=
lihood is given by 1(8; X) O]x Pa(fJJ, D!J), where D!I �
=
=
I:f'�t (N;
D!fJ" ;e•+1Je(fJ,oo) and the like le(T, oo)/0", forT max; X;. Then, ct + n and flt max\t, fJ]. =
=
=
(a) From Bayes' theorem it follows that p(x) p(x]O)p(O)/p(B]x) af3a falfJf1. Assuming that Y and X are conditionally independent given e, it follows that p(y]x) ctt {J�1 /ct2 fJ 2 , with"' ctt + I and max\y, fJtlfJ 2 =
=
�
=
=
=
(b) IfT >fJthen fJt =t,{J,=max{t,y} and
p(y]x)
=
ct1t"1 / [a ,(max{ t ,y})"2].
So, P(Y >t ] x ) atf[a,(a,- 1)]. (ctt/ct2)t"1 !,"' y-"2dy +t (c) If ct-+ 0 then e t]x) -+ n/[(n + l)n] 1/(n + 1). If ct, fJ -+ 0 then p(y ] x) -+ nt" /[(n + l)(max{t,y))"+t] or p(y]x)-+ n/[(n + l)t], if y < t and nt"/[(n + l)y"+t], if y > T. Also, P(Y > t]x) -+ 1/(n + 1). =
=
=
4. The easiest way to predicte Y in a classical way is to replace fJ by its
:
stimator
fJ.
In this example, p(y ] fJ)
=
(; )(x/n)Y ( l
- x/n)m-y, where
f) xjn. This distribution has mean mxjn and variance mx(n - x)jn2. (This can be Contrasted with the Bayesian prediction under the improper prior p(B) oc e-1(1- 0)-1 that leads to E(Y]x) mx/n and V(Y]x) = mx(n- x)(l +c)jn2 where c (m- 1)/(n +I) 2: 0. This again is similar but overdispersed with respect to the classical result.) =
=
=
I
l
1 J
Chapter 7 I. The prior has density given by p(B)
I
jj
I
Solutions to selected exercises
245
Chapter 8 2. Define F(fl) [S(/l) - S,]/ps2 F(p, n- p). Therefore, P[F(fl) :0 Fa(p,n-p)] 1-a. So!ving forfl,it means that{fl: (fl-P)'X'X(fl jJ) :0 ps2F.(p,n- p)) defines a !00(1 -a) % confidence region for fl. This region has the form of an p-dimensional ellipsoid centred around '/3. =
�
=
7. Developing the product and conveniently collecting the tenns,
Q
=
({3 -
Jlo)'Co(fl- Jlo) + (fl- Pl'X'X(fl- Pl fl'[Co + X'X]fl- 2/l'[CoJLo + X'XPJ+JL�CoJLo+il'X'Xtl. Using/it andCt as defined and completingthe 1 squares, it follows that Q Q 1 + Qz where Q 1 fl' Cj fl-2/l'C j1 Jlt + �/ I -) ! I --]I JL1C1 111 and Qz JL0CoJLo + fl X Xfl- 111C1 JL1. Now, Q, (/l' Ilt) Ct (fl - Ill), which proves the result. =
=
=
�
=
=
1
� N(/3j,(m¢) - ) are sufficient and independent statistics for /31, ... , !3k· Therefore, E (/3j 11>,y) E(/3j 11>,y1 , . . . , h)
16. The group means Yj
=
=
£(E(/3jliL,¢.)ij)l)i1, ... ,y,]. From Theorem 2.1, E(/3jliL,¢,yj) [my j + Ct E(IL 1¢,}'t, .. . ,)i,)]/ (myj + cw)/(m + ct). So, E(/3j 1¢, Y) (m + ct). FromTheorem2.1 again, E(ILJ¢,y1, .. . ,y,) =)i. Therefore, E({JjJ
=
=
=
By Bayes' theorem and using the fact that y1 and Br-1 are condition ally independent given Bt. it follows that the first term in the integrand is p(81-tl81, D,_,) ex p(81_tiD,_,)p(8,j8,_,, D,_,). Ap p(81-tJ8,, D,) plication of Bayes' theorem for the normal observation 81 and parameter e,_, gives (8,_, ;e,, D,) N[m,_, + c,_, (e,- a,)/R,, c,_,- c�-1 I R,]. Therefore,(Br-JI D1) has mean =
�
E[Br-tiD,]
=
E[£(8,_tl8,)jD1]
=
m,_, + Cr-1 (m, - a,)jR,
and V[81-tiD,]
=
=
E[V(8t-�J81)ID,] + V[£(81-tl81)jD,] c,_,- c�_1(R,- c,_,)/R �.
The normality follows from the linear relation of 81 on the conditional mean of Br�1 and the marginal normality of Br.
l
I
1
1 1 I l
j
l j
�
l
List of distributions
This list includes the distributions that are most often used in statistics and fre quently appeared in this book. They are listed in alphabetical order and are not divided into groups such as continuous
x
discrete, univariate
x
multivariate. This
information wi11 be clear from the context. I. Bernoulli X is said to have a Bernoulli distribution with success probability 8, denoted by X
Ber(8), if its probability function is given by
...._,
1 p(x f 8) = 8x(J- 8) -x, for 8 E
X= 0, ],
[0, 1]. This distribution has mean 8 and variance 8(1- 8).
2. Beta
�
is said to have a beta distribution with parameters
X � beta(a, fJ), if its density function is given by
for
a,
k-1
fJ
>
a
and fJ, denoted by
0. The constant k is given by
( ) (b) = B( a, b)= T' a T' T'(a +b)
The functions B( ·,
and
f'(c)= {000 xc-le-xdx,
Jo
c >
0.
·) and f' ( ) are respectively known as the beta and gamma aj(a + {3) and variance 01{3j[(01 + ·
functions. This distribution has mean
{3)2(01 + f3 + 1)].
3. Beta-binomial
X is said to have a beta-binomial distribution with parameters denoted by X � p
BB(n, a, {3), if
( (n B(01+x,f3+n-x) xI n, 01, {3) - ) x B(a, {3) _
n, a
and
{J,
its probability function is given by
x
=
0, I, .. .
, n,
n 2: I, a,f3 > 0. This distribution has mean n01j(01 +{3) and variance n01{3[l +(n- 1)/(a +f3- 1)]/(01 + {3)2 The expression for the variance
for
248
Ust of distributions
-
can be rewritten in the form ne(l B)(l +E), where e af(a + {J) and E = (n- l)f(a + {J- 1). Since E > 0, Vn > 1, the variance of the beta binomial distribution is bigger than the variance of the binomial, when one compares 8 withe, for Ci + {J > 1. 4. Binomial X is said to have a binomial distribution with parameter n and success probability 8, denoted by X - bin(n, 8), if its probability function is given by p(xJn,8)
5.
=
(:) 8'( 1
=
8)"-x ,
x
=
0, 1, . . . , n,
for n ?.: 1 ,. 8 E [0, 1]. This distributiOn has mean n8 and variance n8(1- 8). This family of distributions includes the Bernoulli as the special case n = 1 .
Dirichlet · X= (X 1 _, , X p)' is said to have a Dirichlet distribution with parameter 9 (8r, . . . , 8p)', denoted b'y·:D(O), if its joint density function is given by • • •
=
. p(x)
·
r(8+)
=
n, 1(8;)
np
i�l
0-1
x,.'
,
Xi E
i=
[0, 1],
l , . . . ,p,
for fh E [0, 1 ], i = 1, . . . ' p and e+ Li ei. This distribution has mean 2 (I 18+ and variance-covariance matrix given by e_:; (8+ + 1)-1 [8+diag(9)90'], where diag(c) denotes a diagonal matrix having the elements of the vector. C in the main diagonal. This family of distributions includes the beta as the· s· pecial case p = 2. =
6.
Exponential X is said to have an exponential distribution with parameter e, denoted by X - Exp(8), if its density function is given by p(xJ8)
fore 7.
>
=
8 e -ex ,
X >
0,
0. This distribution has mean 1 ;e and varian!==e 1je 2.
Gamma (or x2) X is said to have a gamma distribution with parameters a and {3, denoted by X - G(a, {J), if its density function is given by p(xJa, {J)
=
kx"-le-fix,
x
>
0,
for a, {J > 0. The constant k is given by k = {3" 1 f(a). This distribution has mean a/ fJ and variance aj {J2. This family of distributions includes the exponential as the special case a l. distribution is equivalent to the G(p /2, 1 /2) distribution. There The fore, the G (a, {J) distribution corresponds to a 2{3 distribution. 8. Multinomial ' X = (X 1, , XP)' is said to have multinomial distribution with parame' ter n and probabilities.O (8" . . . , ()P) , denoted by M(n, 9), if its joint =
x;
xfa
. . •
=
List of distributions 249 probability function is given by
nI p(X\9) = np .
1=1
p
p
..
n 8;"', Xi = 0, ], . x;! i=I
.
. , n,
i=l, . ,p,Lxi=n, i=l for
ei
E. [0,1], i = 1' .
..
, p,
LF=I ei = 1.
This distribution has mean nO
and variance-eovariance matrix given by n(diag(9)- 00']. This family of distributions includes the binomial as the special case p = 2.
9. Negative binomial
fJ
-
X is said to have a negative binomial distribution with parameters rand denoted by X N B(r, if its probability function is given by
8),
p(xjr, 8) =
(
r+ x- 1 X
)
8' (l - 8)x,
X
=0, ],. ..
8,
,
for r ::::_ 1, 8 E [0, 1). This distribution has mean r(l- fJ)/8 and variance r(l- 8)/82
10. Normal X is said to have a normal distribution with mean p, and variance a 2, denoted by X......., N(J.1.,a2), if its density function is given by
for J1. E R anda2> 0. When J.L =O and a2 =1, the distribution is referred
to as standard normal.
X=(X1,
•
.
.
, Xp)' is said to have a multivariate normal distribution with
mean vector JL and variance-covariance matrix I:, denoted by N(JL, I:), if its density function is given by
for p. E RP and "E > 0, where \AI denotes the detenninant of A.
11. Pareto X is said to have a Pareto distribution with parameters a and (), denoted by Pa (a, (), ) if its density function is given by
p(x\8,a) =a8"jx1+a,
x > 8,
for a, e > 0. This distribution has mean afJj(a- 1), when a > 1, and variance a82j[(a -1)2(a- 2)], when a> 2.
I
l
250 List of distributions
12. Poisson X is said to have a Poisson distribution with parameter e, denoted by X
"'
Pois(B), if its probability function is given by
p(xjB) = e-
for 8
>
0 ex
x
-,
x!
=
0,], 2, ... ,
0. This distribution has mean and variance given by A.
13. Snedecor F X is said to have Snedecor F (or simply F) distribution with vr and vz degrees of freedom, denoted F(VJ, vz), if its density function is given by p (X I VJ, Vz) =
f[(�1 + Vz)/2]
f(v,j2)f(v2/2)
(v,/2)-1 ( Vz + V]X )-(v,+v2)/2, V1,,f2 Vv2!2 z X x>O
for v,, Vz
>
0. This distribution has mean vzf(vz - 2), when vz
variance [2v2(v! + vz- 2)]/[v!(vz- 4)(vz- 2)2] , when Vz> 4.
>
2 and
and X, xJ, are independent then (X,jv!)j(Xzfvz) F(vr, vz). This family of distributions includes the square of the Student t
If X1
�
x!,
�
as the special case v, = I. If vz--+
oo,
�
then v, F(v1, vz)
x�1•
14. Studentt
X is said to have a Student t (or simply t) distribution with mean f.J-, scale parameter a2 and v degrees of freedom, denoted by X lv(J.l.., o-2), if its .-....-
density function is given by
p(xjv, J1.,a')=
[
]
' -('+l)/2 2 (x- J1.) f[(v + 1)/2] v'i v+ f(v/2)v'JT a a2
X E R,
> 0, J.l.. E Rand a2> 0. This distribution has mean Jl., when v> I, and variance vj(v- 2), when v > 2. This family includes the Cauchy distribution as the special case v = 1, denoted by Cauchy (!J,, a2). X = (X 1, , Xp)' is said to have a multivariate Studentt distribution with mean vector JL and scale matrix :!; and v degrees of freedom, denoted b y for
v
. . .
lv(JL, 1:), i f its density function i s given by f(xlv 'fl., :!:) =
for
v >
0,
JL
f[( + )jl] '/2 -l Z p v v 1:!:1 / f(vj2):rr: Pf2 x[v + (x- p.)':!:-1(x- p.)r
x E RP,
E RP and :E > 0. This distribution has mean JL, when v > 1,
and variance v:!;f(v- 2), when v > 2.
15. Uniform
X is said to have a uniform distribution with parameters e1 and ffl, denoted U[81, ez], if its density function is given by
by
p(xJe,,e,)
I
=
8 - e, 1 2
-
x E re1.B2J,
J
j
i
'
l
List of distributions 251 for Bt < 82. When 81 0 and 82 I, the distribution is referred to as unit 2 uniform. This distribution has mean (81 +/h) /2 and variance (82- 81 ) j I 2. =
=
16. Weibull X is said to have a Weibull distribution with parameters a and {J, denoted by Wei(ct, {3), if its density function is given by
p(xja, /3)
=
{3ax "-1 exp( -f3x"),
x
>
0,
for a, f3 > 0. This distribution is sometimes parametrized in terms of a and 8 1 j{P- and includes the exponential as the special case a l. Thi-s 1 1 distribution has mean {3- /"r(l +a-1) and variance {3-2/"[r(l + 2a- ) =
rzo + a-1)J.
=
References
Abramowitz., M. and Stegun,l.AG, editors (1965). Handbook of Mathematical Functions, National Bureau of Standards Applied Mathematics Series, Number 55. Washington: U.S. Government Printing Office. Aitchison, J. and Dunsmore, l.R. ( 1975). Statistical PredictionAnalysis . Cambridge: Cam bridge University Press. Barnett, V. (1973). Comparative Sta;istical Inference. New York: Wiley. Bartlett, M.S. (1947). Multivariate analysis. Journal of the Royal Statistical Society, 9, B (supplement). 76--97.
Bayes, T. (1763). An essay towards solving in the doctrine of chances. Philosophical Transactions of the Royal Society London, 53, 370--418.
Berger, J. 0. (1985). Statistical Decision Theory and Ba:resian Analysis. New York: Springer-Verlag. Berger, J. and Delampady, M. (1987). Testing precise hypothesis. Statistical Science, 2;
317-352. Berger, J. and Sellke, T. (1987). Testing a point nul hypothesis: the irreconcilability of significancy levels and evidence. Journal of the American Statistical Association, 82,
112-122. Bernardo, J. M. (1979). Reference posterior distributions for Bayesian inference (with discussion). Journal of the Royal Statistical Society, Series B, 41, 113-147.
Bernardo, J. M. and Smith, A.M. F. (1994). Bayesian Theory. Chichester: Wiley. Bickel, P. J. and Doksum, K A. ( 1977). Mathematical Statistics. Englewood Cliffs: Pren tice Hall. Box, G. E. P. and Tiao, G. C. (1992). Ba.vesianlnference in Statistical Analysis. Reading: Addison-Wesley. Broemeling, L. D.(1985). Bayesian Analysis of Linear Models. New York: Marcel Dekker. Buse, A. (1982). The likelihood ratio, Wald and Lagrange multiplier tests: an expository note. The American Statistician, 36, 3, 153-157.
Casella, G. and George, E. I. (1992) Explaining the Gibbs sampler. The American Statis tician, 46, 167-174.
Chib, S. and Greenberg, E. (1995) Understanding the Metropolis-Hastings algorithm. The American Statistician, 49, 327-335.
Cordeiro, G. M ( 1987). On the correction to the likelihood ratio statistics. Biometrika, 74,
265-274. Cordeiro, G. M. and Ferrari, S. L. P. ( 1991). A modified score test having chi-squared distribution to order 11-l. Biometrika, 78, 573-582.
Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics. London: Chapman & Hall.
254 References Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and their Application. Cambridge: Cambridge University Press. de Finetti, B. (1937). La prevision: ses lois logiques, ses sources subjectives. Annals of the Institute Poincare, 1, 1, 1-68. de Finetti, B. (1974). Theory of Probability (vols. 1 and 2). New York: J. Wiley. DeGroot,M. H. (1970). Optimal Statistical Decisions. New York: J. Wiley. DeGroot, M. H. (1986). Probability and Statistics (2nd edition). Reading:
Addison
Wesley. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likehood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B,
39, l -38 (with discussion). Draper, N. R. and Smith, H. (1966). Applied Regression Analysis. New York: Wiley. Efron, B. (1979). Bootstrap methods: another look at the jackknife. Annals of Statistics, 7. 1-26. Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans, Monograph 38. Philadelphia: SIAM. Ferguson, T. S.
(I 967).
Mathematical Statistics: A Decision· The�retic Approach. New
York: Academic Press. Fisher, R. A. (1936). The use of multiple measurements in taxonomi.c problems. Annals of Eugenics, 7,179-188. Gamerman, D. (1997). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian inference. London: Chapman & Hall. Gamerman, D. and Migon,H. (1993). Dynamic hierarchical mOdels. Journal of the Royal StO.tistical Society, B, 55, 629-643 GarthWaite,P. H., Jollife,l. T. and Jones, B. ( 1995). Statistical Inference. London: PrentiCe Hall. Geisser, S. (1993) Predictive inference: An Introduction. London: Chapman & Hall. Gelfand, A. E. and Smith, A. M. F. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85,398-409. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Transactions oh Pattern Analysis and Machine intelligence, 6, 721-741. Gilks, W. R., Richardson, S. and Spiegelhalter, D. J. {editors). (1996). Markov Chain Monte Carlo in Practice. London: Chapman & Hall. Hanison, P. J. and Stevens, C. F. (1976). Bayesian forecasting. Journal of the Royal Statistical Society, B, 38,205-247 (with discussion). Hastings, W. K. ( 1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57,97-109. Heath,D. L. and Sudderth, W. D. (1976). De Finetti's theorem for exchangeable random variables. American Statistician, 30, 333-345. Jeffreys,H. (1961). Theory of Probability. Oxford: Clarendon Press. Kalbfleisch, J. G. (1975). Probability and Statistical Inference, volume 2 (2nd edition). New York: Springer-Verlag. Lawley,D. N. (1956). A general method for approximating to the distribution of likelihood ratio criteria. Biometrika, 43, 295-303. Lee,P. M. (1997). Bayesian Statistics: An introduction (2nd edition). Oxford: Oxford University Press. Lehmann,E. (1986). Testing Statistical Hypothesis (2nd edition). New York: Wiley.
i
,, !l
\j
References 255 Lindley, D. V. (1957). A statistical paradox. Biometrika 44, 187-192.
Lindley, D. V. (1965) Introduction to P robability andStatistics (Parts I and 2). Cambridge: Cambridge University Press.
Lindley, D. V. (1980). Approximate Bayesian methods. Bayesian Statistics (Eds. J. M. Bernardo et al.), 223-245 (with discussion).
Lindley, D. V. (1993). The analysis of experimental data: the appreciation of tea and wine. Teaching Statistics, 15, 1. 22-259.
Lindley, D. V. and Phillips, L. D. (1976). Inference for a Bernoulli Process (a Bayesian view). The American Statistician, 30, I 12-119.
Lindley, D. V. and Smith, A. F. M. (1972) Bayes estimates for the linear model (with discuss{On). Journal of the RoyalStatistical Societ}� Series B, 34, 1-41.
Little, R. J. A. and Rubin, D. B. (1988). Statistical Analysis with Missing Data. New York:
J. Wiley. McCullagh, P. and Neider, J. A. (1989). Generalized Linear Models (2nd edition). London:
. . Chapman & Hall. ·
Metfopolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953).
Equation of state calculations by fast computing machine. Journal of Chemical Ph.vsics,
21, 1087-91.
Migon, H. S. and Tachibana, V. M. (1997). Bayesian approximations in randomized re sponse model. Computational Statistics and Data Analysis, 24, 401-409.
Naylor, J. C. and Smith, A.M. F. (1982). Applications of a method for efficient computation of posterior distributions. AppliedStatistics, 31, 214-235.
Neyman, J. and Pearson, E. S. ( 1928). On the use and interpretation of certain test criterion for purposes of statistical inference. Biometrika A, 20, 175-240 and 263-294.
O'Hagan, A. (1994). Bayesian Inference. Volume 28 of Kendall's Advanced Theory of Statistics. London: Edward Arnold.
Quenouille, M. H. (1949). Approximate test of correlation in time series. Journal of the Royal Statistical Society, Series B , 34, 1-41.
Quenouille, M. H. (1956). Notes on the bias estimation. Biometrika, 43, 353-360. Raiffa, H. and Schlaifer, R. (1961 ). Applied Statistical Decision Theory. Cambridge, MA: Harvard University Press.
Rao, C. R. (1973). Linear Statistical Inference (2nd. edition). New York: Wiley. Ripley, B. D. (1987).Stochastic Simulation. New York: Wiley. Silvey, S.D. (1970). Statistical Inference. London: Chapman & Hall. Tanner, M.A. (1996). Tools forStatistical!nference-Methodsjor Exploration of Posterior Distributions and Likelihood Functions (3rd edition). New York: Springer VerJag.
T histed, R. A. (1976). Elements ofStatistical Computing. New York: Chapman & Hall.
Tierney, Land Kadane, J. B. (1986). Accurate approximations for posterior moments and marginal densities. Journal of the AmericanStatistical Association, 81, 82-86. West, M. and Harrison, P. J. (1997). Bayesian Forecasting and Dynamic Models (2nd edition). New York: Springer Verlag.
Zellner, A. (1971)- An Introduction to Bayesian Inference in Econometrics. New York: Wiley.
Author index
Abramovitz, M.,146
Fisher, R. A., 36, 96, !53
Aitchison,J., 28, 200
0
Gamerman, D., 125, 147, !56, 160, Bahadur, R. R., 36
231
Barnett, V. D., xi, 27
Garthwaite, P. H., 126
Bartlett, M.S., 43,188, 190
Geisser,S., 28,200
Bayes, Rev. Thomas, 27
Gelfand, A. E., 125, 157
Berger, J., 2, 55, 84,102, 121, 185,
Geman, D., !57
188
Bernardo, J. M.,35,67,68,71,140
Geman,S., !57 George, E. 1., 164
Bickei,P.J., 19,168
Gilks, W. R., 160
Box, G. E. P., 32,66
Goldstein, M., 208
Broemeling, L. D., 211
Greenberg, E., !57
Buse, A., 190
Harrison, P. J., 227
Casella, G., 164
Hartigan,J .,208
Chib,S., !57
Hastings, W. K., !57
Copas,J. B., 226
Heath, D. L., 34
Cordeiro, G. M.,190
Hinkley, D. V., 43,86,96,125, !51
Cox, D. R., 43,86,96 Darmois, G., 40
James, W., 226 Jeffreys, H., 66,68,71,182
Davison, A. C., 125,151
Jollife, I. T., 126
De Finetti, B., 5, 32
Jones, B., 126
DeGroot, M. H., 36,84, !68 Delampady, M.,188 Dempster, A. P., 130, 131 Doksum, K . A., 19,168 Draper, N. R., 211
Kadane,J. B., 143 Kalbfleisch, J. G., 43 Koopman, B. 0., 40
Dunsmore, I. R., 28, 200
Laird, N. M.,130,131
Efron, B., 147, 151, 153
Lehmann, E. L., 36,95, 168
Lawley, D. N., 190 Lindley, D. V., 7, 31, 57, 71, 135,
Ferguson, T.S.,84 Ferrari,S. L. P., 190
184,187,188,224
Little, R. J. A., 130
Author index 257 McCullagh, P., 213
Sch1aifer, R., 58
Metropolis, N , !57
Sellke, T., 185
Migon, H. S., 143,231
Silvey, S . D , 86,105
Morris, C . N., 226
Smith, A.M. F., 35, 71, 125, 140,
.
.
147,157,224 Naylor, J. C., 147
Smith, H., 211
Neider, J. A., 213
Sprott, 0. A., 43
Neyman, J., 168,170,173
Stegun, 1. A., !46
O'Hagan, A., 226 Pearson, E. S., 168, 170, 173 Phillips, L. D., 7 Pitman, E. J. G., 40 Quenouille, M . H., !52 Raiffa, H., 58 Rao, C. R., 96, 131 Ripley, B. D., 147 Rubin, D. B., 130, 131 Scheffe, H., 36
Stevens, C. F., 227 Sudderth, W. D., 34 Tachibana, V. M., 143 Tanner,M. A., 125 T histed, R. A., 125,130 Tiao, G. C., 32,66 Tierney, L., 143 Wald, A., 188 West, M., 227 Wilks, S. S., 188 Zellner, A., 68
Subject index
decision theory, 80, 81, 83
Alternative hypotheses, 168 Analytical approximations, 133 asymptotic theory, 133-40
Bayes rule, 80, 81 Bayes' theorem, 26-7 prediction, 28-30
by Kullback-Liebler
sequential nature, 30-32
div'ergence. 140-41 Laplace approximation, 141-4 Ancillarity, 38 Asymptotic efficiency, 139 tests, 188-91 theory, 133-40
Bias classical hypothesis testing, 176 comparison of estimators, 93 jackknife, 152 Bidimensional rule, 145 Bonferroni inequality, 103, 105
Bayes factor, 182, 183-4, 185-6, 223
Bootstrap, 151, 153-4 weighted, 151, 154-6
Bayesian approach (subjective approach), 15 estimation, 87, 89 hypothesis testing, 167-8, 181-6 confidence intervals, 186, 187-8 inference, general problem of, 125, 126
Cartesian product rule, 145, 147 Central limit theorem, 138 Cholesky decomposition, 14, 147 Classical approach (frequentist approach), 7-8, 15 estimation, 82, 86, 92-3 confidence interval, 100, 104--5 hypothesis testing, 167-81
Laplace approximation, 142
parameter elimination, 42--4
linear models, 218-24
prediction, 201-2, 205-6
Monte Carlo with importance sampling, 151 parameter estimation, 42, 44 prediction, 197-201, 203-5, 208 resampling methods, 151, I53, ' 155, 156, 158-9 sufficiency, 35-8 Bayes risk, 93
probability, 3, 4 resampling methods, 151 sufficiency, 37-8 Complete families of distributions, 95--0 Conditional density, II independence, 11, 12 likelihood, 43
Subject index 259 posterior distributions, 42 Confidence interval asymptotic, 135, 138-9,140
beta-binomial, 59,247-8 beta, 57,247 binomial, 248, 249
bootstrap, !53, !54
bivariate normal, 111-12
Bayesian approach, 100-101,
Cauchy, 250
!02-3
classical approach, I 00, I 04, 105
highest posterior density (HPD), 101-2 highest predictiv� density (HPD), 20!,203, 205 jackknife, !52, !53 Bayesian linear models, 219, 222
classical linear models, 214, 215, 2!8
normal model: one sample case, !06, 107, !08, !09, !10
normal model: two samples case, !14,115,!!6, 117 optimization techniques, 129-30
Conjugate family of distributions, 55,56-8
natural, 58 Consistency of estimators, 98-9 Correlation coefficient, 153
chi-squared distribution, I 07, 113,188, 189
Dirichlet, 62,248 exponential, 248 Gamma, 248 multinomial, 248-9 multivariate nonnal, 12, 13, 249
multivariate Student t, 250 negative binomial, 8, 249 2 normal-x , 63 normal, 12-13, 249 normal-gamma, 63,64 Pareto, 249 Poisson, 41,61,250 Snedecor F di. stribution, 116, 250
Student t distribution; 250 uniform, 250-51 Dynamic linear models, 227 definition and examples, 227-9 evolution and updating equations, 229-30 special aspects, 231
Cramer-Rao inequality, 96 Credibility interval, 100 Critical region, !68,174-5
Efficiency of estimatOrs, 97-8 asymptotic, 139
Cromwell's rule, 57
Elimination of parameters, 41--4
Decision rules, 80, 81
Empirical distribution function, 92
Decision theory, 79-86
Estimation, 1,79
EM algorithm, 130-3
Bayesian prediction, 199-200 De Finetti 's representation theorem, 34,53 Delta method, 136,14! Density function, 14 Distribution, Behrens-Fisher, 114, 115 Bernoulli, 247,248
classical point, 86--92 comparison of estimators, 92-9 decision theory, 79-86 interval, 99-105 Exchangeability, 32-5 partial 35 prior distribution, 53 Exponential family, 39--40
262 Subject index Rao--Blackwell theorem, 94-5, Reference priors,
Taylor expansion dynamic linear models, 227
158-9
hypothesis testing, 188, 191
see
Non-informative priors
Laplace approximation, 142 optimization techniques, 127
Risk Bayes, 80, 81, 83, 93
Tchebychev's inequality, 99
comparison of estimators, 93-8
Test procedures, 167
decision theory, 80, 81-3
consistent, 191
linear, 206-7
goodness-of-fit, 190-91
quadratic, 94, 99
invariant, 177 optimal, 169-70
Scale
similar, 177
model, 20-21,71
t test, 177, 179
parameter, 21
Type I errors of, 169, 170, 172
Score function, 25-6
Type II errors of, 169, 172
generalized, 90
unbiased, 175-6, 179
Shrinkage estimators, 226 Simulation methods, 147
uniformly more powerful (UMP), 173--{i
Monte Carlo methods, 147-9 Monte Carlo with importance sampling, 149-51 Markov chain Monte Carlo,
Stein, 226 see
Bayesian
approach Sufficiency, 35-41 Fisher information, 23 partitions, 37, 39 risk, 94
Uncertainty, I, 2 Uniformly minimum variance unbiased (UMVU)
156-60 resampling methods, 151-6 Subjective approach,
Unbiased estimators, 93
estimators, 93 Variance-covariance matrices, 11-12, 13 Weibull distribution, 251
. .
.
.
260 Subject index completeness, 95
Laplace approximation, 141-4
conjugacy, 59-65
Least squares, method of, 90-92
jackknife, 153
estimator (LSE), 90--91
sufficiency, 39--41 Factorization criterion, Neyman's, 37,38,39,40
ordinary, 91 Likelihood
Fisher information, 23-5 asymptotic theory, 137, 139 consistency, 98
function, 21-3 maximum, 87, 89,90 principle, 23
expected, 24, 25
Likelihood ratio (LR)
observed, 24, 25
classical hypothesis testing,
Fisher scoring, 128,129 Frequentist approach,
weighted estimator, 91, 92
see
Classical
approach Gamma function, 24 7 Gauss-Hermite rules, 146-7 Generalized linear models, 212-13 Generalized maximum likelihood estimator (GMLE), 87-90 decision theory, 83-4 EM algorithm, 130 Generalized score function, 90 Gibbs sampler algorithm, 157-160 Hierarchical
170, 171,174-5, 177 monotone, 174-5, 177 Lindley's paradox, 184 Linear algebra, basic results in, 1OIl, 13-14 Linear Bayes methodology, 208 Linear growth model (LGM), 227-8 Linear models, 211-13 Bayesian, 218-24 classical, 213-18 dynamic, 227-31 hierarchical, 224-6 Linear prediction, 206--8 Location
linear models, 224-6
model, 20,70
priors, 71-3
parameter, 20,21
Homoscedasticity, 90 Hyperparameters, 55, 58, 72 Hypothesis testing, I, 167-8
-scale model, 21 Loss functions, 80-84 0--1, 83--4
asymptotic tests, 188-91
absolute, 84
Bayesian, 181-6
quadratic, 84,93
classical, 168-81 composite, 167, 173-7 Importance sampling, 149-51 Influence diagrams, II, 12 Information, 2 additivity of, 24 Jackknife, 151,152-3 Kullback-Liebler divergence, 140--41
Marginal density, II likelihood, 182 posterior distributions, 42 Markov chain Monte Carlo (MCMC),156--60 Matrices linear algebra, 13 orthogonal, 14 positive definite, 13---4
;l
Subject index 261 singular, 14
Optimization techniques, 126--7
square, 13
Parameter elimination, 41-4
symmetric, 13
Posterior distribution, 27, 31-3
Maximum likelihood estimator
asymptotic theory, 134, 135-6,
(MLE), 86--90
137
asymptotic tests, 188,189, 190,
decision theory, 79
191
Laplace approximation, 143-4
asymptotic theory, 137-8, 139,
Markov chain Monte Carlo
140
methods, 159
consistency, 98
parameter elimination, 42
EM algorithm, 130, 131 generalized,
see
Posterior odds, 182
Generalized
Prediction, 1, 197
maximum likelihood
Bayesian, 197-201
estimator optimization t"echniques, 127---.9· Mean squared error (MSE), 93 Metropolis-Hastings algorithm, 157-8; 165 Mini�ax estimatOr, 94 Moments, method of, 92 Monte CarJo·rnethods, 147-9 with importance sampling, 149-51 Markov chain, 156-60 Newton-Raphson algorithm, 127 confidence intervals, determin ing, 129 likelihood equation, solution of, 127,128, 129 Neyman's factorization criterion, 37, 38,39,40 Non-informative priors, 65-71 Nuisance parameters, 42, 43 non-informative priors, 68, 71 Null hypothesis, 168 Numerical integration methods, 144 Gauss-Hermite rules, 146-.7 Newton-Cotes type methods, 144--5
·
classical, 201-2 linear, 206-8 linear model, 217-18 Prior distribution, 27, 30-33, 53 Conjugacy, 59-65 distribution function approach to specification, 54 hierarchical, 71�3 histogram approach, specifica tion of, 53-4 Jeffreys, 66-9, 76 linear models, 218-19,220, 221-2 non-informative,
see
Non-informative priors subjective specification, 53-5 specification through functional forms, 55-8 Prior odds, 182 Probability, I basic results in, 10-13 concept of, 3-5 function, 14 p-value, 172 Quadrature technique,
see
Numer
ical integration methods
One-way ANOVA, 215
Randomized !fsts, 173
Optimality criterion, linear
Rao-Blackwellized estimator, 158,
prediction, 206
164