llII(IIl1!j!ItIIIj!!
Contents
Foreword
xi
by David Hendry
XV
Preface Acknowledgements List of symbols and abbreviations
lntroduction
Part I
Econometric modelling, a preliminary view 1 1.1 Econometries - a brief historical overview 1.2 Econometric modelling - a sketch of a methodology Looking ahead
2
2.1 2.2 2.3
study of data Histograms and their numerical Frequency curves Looking ahead
DescriptiYe
Probability
Part 11 3 3.1 3.2 3.3
characteristics
15 22 23 23 27 29
theory
Probability The notion of probability The axiomatic approach Conditional probability
33 34 37 43
Contents
and probability distributions Random Yariables The concept of a random variable The distribution and density functions The notion of a probability model distributions Some univariate Numerical characteristics of random variables
47 48 55 60 62 68
5 5. 1 5.2 5.3 5.4
and their distributions Random vtors Joint distribution and density functions Some bivariate distributions Marginal distributions Conditional distributions
78 78 83 85 89
6 6. 1 6.2* 6.3
Functions of random Yariables Functions of one random variable Functions of several random variables Functions of normally distributed random
,' 4 4.1 4.2 4.3 4.4 4.5
a
96 96
99 variables,
108 109
summary
Looking ahead Appendix 6.1 The normal
distributions
-
and
related 110
7 7. 1 7.2 7.3
The general notion of expectation Expectation of a function of random variables Conditional expectation Looking ahead
116 116 121 127
8* 8. 1 8.2
Stochastic processes The concept of a stochastic process of a stochastic Restricting the time-heterogeneity
130 131
process
Restricting the memory of a stochastic process Some special stochastic processes
8.3 8.4 8.5 9 9. 1 i?2 .
J
Summary
-
Limit theorems The early limit theorems The 1aw of la rge n umbers The central limit theorem l.: nit ! heorems for stochastic processes
137 140 144 162 165 165 168 173 178 180
Contents
10* 10.1 10.2 10.3 10.4 10.5 10.6
lntroduction to asymptotic theory lntroduction Modes of convergence Convergence of moments The 0' and little o' notation Extending the limit theorems expansions Error bounds and asymptotic tbig
183 183 185 192 194 198 202
Statistical inference
Part III
The nature of statistical inference lntroduction The sampling model The frequency approach An overview of statistical inference Statistics and their distributions Appendix 11.1 -. The empirical distribution
213 213 215 219 221 223
function
228
12 12.1 12.2 12.3
Estimation I - properties of estimators Finite sample properties Asymptotic properties Predictors and their properties
231 232 244 247
13 13.1 13.2 13.3
Estimation 11 The method of The method of The maximum
14 14.1 14.2 14.3 14.4 14.5 14.6
Hypothesis testing and confidence regions Testing, definitions and concepts Optimal tests Constructing optimal tests The likelihood ratio test procedure Confidence estimation Prediction
285 285 290 296 299 303 306
15* 15.1 15.2 15.3
normal distribution The multivariate Multivariate distributions The multivariate normal distribution Quadraticforms related to the normal distribution
3 12 3 12 3 15
11 11.1 11.2 11.3 11.4 11.5
methods
least-squares moments likelihood method
252 253 256 257
319
Contents
15.4 15.5
Estimation Hypothesis
16* 16.1 16.2
Asymptotic test procedures Asymptotic properties The likelihood ratio and related
Part IV
320 323
testing and confidence regions
test procedures
The Iinear regression and related
models
326 326 328
statistical
17 17.1 17.2 17.3 17.4 17.5
Statistical models in onometrics Simple statistical models Economic data and the sampling model Economic data and the probability model The statistical generating mechanism Looking ahead Appendix 17.1 Data -
339 339 342 346 349 352 355
18 18.1 18.2 18.3 18.4 18.5
The Gauss Iinear model
357 357 359 363 366 367
l9. 1 19.2 19.3 19.4 19.5 19.6 19.7 19.8
20
Specification Estimation Hypothesis testing and confidence intervals Experimental design Looking ahead The Iinear regression model I spification, estimation and testing lntroduction
369
369 370
Specification Discussion of the assumptions Estimation Specification testing Prediction The residuals Summary and conclusion Appendix 19. 1 A note on measurement -
375
378 392 402 405
408 systems
The Iinear regression model 11 departures from the assumptions underlying the statistical GM The stochastic linear regression model The statistical parameters of interest
409
412
413 418
Contents Weak exogeneity Restrictions on the statistical parameters
of
interest
422 432 434
20.5 20.6
Collinearity S'Near' collinearity
2 1.1 2 1.2 2 1.3 2 1.4 2 1.5 2 1.6
The Iinear regression model III departures from the assumptions underlying the probability model Misspecification testing and auxiliary regressions Normality Linearity Homoskedasticity Parameter time invariance Parameter structural change Appendix 21.1 - Variance stabilising -
443 443 447 457 463 472 48 1
transformations
22 22.1 22.2 22.3 22.4
The linear regression model IV departures from the sampling model assumption Implications of a non-random sample Tackling temporal dependence Testing the independent sample assumption Looking back Appendix 22. 1 Deriving the conditional -
expectation
23 23. 1 23.2 23.3 23.4 23.5 23.6
The dynamic Iinear regression
24 24.1 24.2 24.3 24.4 24.5 24.6
The multiYariate lntroduction
model
Specification Estimation Misspecification
testing Specification testing Prediction Looking back
linear regression model
Specification and estimation A priori information The Zellner and Malinvaud formulations Specification testing Misspecification testing
493 494 503
511 521 523
526 527 533 539 548 562 567 571 571 574 579 585 589 596
Contents
Prediction The multivariate dynamic linear regression (MDLR) model Appendix 24. 1 The Wishart distribution .. Appendix 24.2 Kronecker products and matrix -
25 25. 1 25.2
25.4 25.5 25.6 25.7 25.8 25.9 25.10 26
599 6O2
differentiation
603
The simultaneous equations model Introduction The multivariate linear regression and simultaneous equations models ldentification using linear homogeneous
608 608
restrictions Specification Maximum likelihood estimation
610 614 619 621 626 637 644 649 654
Least-squares estimation lnstrumental variables Misspecification testing Specification testing Prediction Epilogue: towards a methodology of econometric
modelling 26. 1 26.2
A methodologist's critical eye Econometric modelling, formalising
26.3
methodology Condusion
659 659 a
References
673
lndex
689
* Starred chapters and or sections are typically more difficult and might be avoided at first reading.
PA R T I
Introduction
Econometric
1.1
modelling, a preliminary view
- a brief historical overview
Econometrics
lt is customary to begin a textbook by defining its subject matter. ln this case this brings us immediately up against the problem of defining 'econometrics'. Such a definition, however, raises some very difficult methodological issues which could not be discussed at this stage. The epilogue might be a better place to give a proper definition. For the purposes of the discussion which follows it suffices to use a working definition which provides only broad guide-posts of its intended scope: Econometrics
phenomena
is concerned willl observed data.
the slaremarfc
study
t#'
economic
lsfrlg
This definition is much broader than certain textbook definitions of narrowing the subject matter of econometrics to the theoretical relationships as suggested by economic theory. lt is argued in the epilogue that the latter definition of econometrics constitutes a relic of an outdated methodology, that of the logical positivism (see Caldwell ( 1982:. The methodological position underlying the denition given above The term systematic is used is largely hidden behind the word where economic theory observed of data in framework describe the a use to well role, play statistical inference important as yet undetined. The an as as observed data is what distinguishes econometrics from other forms of of use studying economic phenomena. Econometrics, detined as the study of the economy using observed data, can be traced as far back as 1676, predating economics as a separate discipline by a century. Sir William Petty could be credited with the first kmeasurement'
tsystematic'.
A preliminary
view
'systematic' attempt to study economic phenomena using data in his Political Arithmetik. Systematic in this case is used relative to the state of the art in statistics and economics of the time. Petty ( 1676) used the pioneering results in descriptive statistics developed forms of economic by his friend John Graunt and certain rudimentary in studying economic attempt theorising to produce the first phenomena using data. Petty might also be credited as the first to submit to modelling. According to Hull, a most serious temptation in econometric main biographers and collector of his works: one of his 'systematic'
Petty sometimes appears to be seeking figures that will support a conclusion he has already reached: Graunt uses his numerical data as a basis for conclusions. declining to go beyond them. (See Hull ( 1899), p. xxv.)
Econometrics, since Petty's time, has developed alongside statistics and economic theory borrowing and lending to both subjects. In order to understand the development of econometrics we need to relate it to developments in these subjects. Graunt
(i) (ii) (iii)
and Petty initiated three important
developments
in statistics:
the systematic collection of (numerical)data; related to life-tables; and theory (#' probabilitl' the mathematical
the development
of what we nowadays call descriptive statistics a coherent set of techniques for analysing
(see Chapter 2) into numerical data.
It was rather unfortunate that the last two lines of thought developed largely independent of each other for the next two centuries. Their slow convergence during the second half of the nineteenth and early twentieth centuries in the hands of Galton, Edgeworth, Pearson and Yule, inter alia, culminated with the Fisher paradigm which was to dominate statistical theory to this day. The development of the calculus of probability emanating from Graunt's work began with Halley (1656-1742jand continued with De Moivre (1667-1754q, Daniel Bernoulli (1700-824,Bayes 41702-61q, Lagrange (1736- 1813j, Laplace g1749-1827j, Legendre g1752-1833q, Gauss (17891857) inter alia. ln the hands of De Moivre the main line of the calculus of probability emanating from Jacob Bernoulli (1654-17054 was joined up with Halley's life tables to begin a remarkable development of probability theory (see Hacking ( 1975), Maistrov ( 1974)). The most important of these developments can be summarised under the following headings: manipulation of probabilities (addition, multiplicationl; (i)
1.1
A brief historical oveniew
families of distribution exponentiall;
functions
(normal, binomial, Poisson,
1aw of error, least-squares. least-absolute errors', limit theorems (law of large numbers, central limit theoreml; life-table probabilities and annuities; higher order approximations; probability generating functions. Some of these topics will be considered in some detail in Part 11 because they form the foundation of statistical inference. The tradition in Political Arithmetik originated by Petty was continued by Gregory King (1656-17141. Davenant might be credited with demand schedule (see Stigler ( 1954:, drawing publishing the rst freely from King's unpublished work. For this reason his empirical demand for wheat schedule is credited to King and it has become known as tKing's law'. Using King's data on the change in the price (pf) associated with a given change in quantity (fyt)Yule (1915) derived the empirical equation
(iii) (iv) (v) (vi) (vii)
tempirical'
explicitly as p,
=
-
2.33:, +0.05:,2
-
0.0017gJ.
Apart from this demand schedule, King and Davenant extended the line of thought related to the population and death rates in various directions thus establishing a tradition in Political art of reasoning by relating figures upon things, to government'. Political Arithmetik was to stagnate for almost a century without any major developments in the descriptive study of data apart from grouping and calculation of tendencies. From the economic theory viewpoint Political Arithmetil played an where numerical role in classical economics data on money important and prices, public imports were extensively stock, finance, exports wages, used as important tools in their various controversies. The best example of the tradition established by Graunt and Petty is provided by Malthus' 'Essay on the Principles of Population'. ln the bullionist and currencybanking schools controversies numerical data played an important role (see Schumpeter ( 1954)). During the same period the calculation of index numbers made its lirst appearance. With the establishment of the statistical society in 1834 began a more coordinated activity in most European countries for more reliable and complete data. During the period 1850-90 a sequence of statistical congresses established a common tradition for collecting and publishing data on many economic and social variables making very rapid progress on this front. ln relation to statistical techniques, however, the progress was much slower. Measures of central tendency (arithmetic mean, median, u4/'lnltall'/v',
'the
A preliminary
view
mode, geometric mean) and graphical techniques were developed. of dispersion (standard deviation, interquartile Measures range), correlation and relative frequencies made their appearance towards the end of this period. A leading figure of this period who can be considered as one of the few people to straddle all three lines of development emanating from Graunt and Petty was the Belgian statistician Quetelet(1796-1874j. From the econometric viewpoint the most notable example of empirical modelling was Engel's study of family budgets. The most important of his conclusions was: The poorer a family, the greater the proportion of its total expenditure that must be devoted to the provision of food.
(Quotedby Stigler (1954).) This has become known as Engel's law and the relationship between consumption and income as Engel's curve (see Deaton and Muellbauer
(1980)).
The most important contribution during this period (early nineteenth century) from the modelling viewpoint was what we nowadays call the Gauss linear model. In modern notation this model takes the following general form:
y =X#
+u,
where y is a F x 1 vector of observations linearly related to the unknown k x 1 vector p via a known fixed F x k matrix X (ranktx) k, F> k) but subject to (observation)error u. This formulation was used to model a situation such that: =
it is suspected that for settings linear relation: k )., jj;1 pix; =
xl
,
xc,
.
.
xk there is a value y related
,
.
by a
,
i
=
where p is unknown. A number T of observations on y can be made, corresponding to F different sets of (x: xkl,i.e. we obtain a data set (yf, xfk), r 1s2, xf : xta, 'Cbut the readings on y, are subject to error. (See Heyde and Seneta (1977).) uzzpi)
,
,
.
.
.
,
.
.
.
,
.p
=
.
.
.
,
The problem as seen at the time was one of interpolation (approximation), that is, to the value of #. The solution proposed came in the form of the least-squares approximation of p based on the minimisation of Gapproximate'
u'u=ty
-x#)'(y
-X#),
whichlead to /= (X'X)--1X'y
( 1.4)
A brief historical oveniew
(see Seal (1967:. The problem, as well as the solution, had nothing to do with probability theory as such. The probabilistic arguments entered the problem as an afterthought in the attempts of Gauss and Laplace to justify T are assumed the method of least-squares. lf the error terms lk1, t 1,2, to be independent and identically distributed (IlD) according to the normal distribution, i.e. =
c2),
.::(.0,
ld, x.
t
1, 2,
=
.
.
.
,
.
.
.
,
(1.5)
T)
optimal solution' from a probabilistic then j- in (4) can be justified as viewpoint (see Heyde and Seneta (1977),Seal (1967), Maistrov (1974)). The Gauss linear model was later given a very different interpretation in the context of probability theory by Galton, Pearson, and Yule, which gave rise to what is nowadays called the linear regression model. The model given in (2)is now interpreted as based wholly on probabilistic arguments, y, and Xf are assumed to bejointly normally distributed random variables and X/ is viewed as the conditional expectations of pf given that X! xf (Xr takes the Tt.i.e. value xp) for r 1, 2, kthe
=
=
.
.E()rry/Xt xf ) =
=
.
.
,
#'xt,
t
1, 2,
=
.
.
(1.6)
'C
,
.
with the error term u, defined by lz, yf - F(y/ Xr further details). The linear regression model =
.pf .Et-p,Xf xl) + =
=
u,,
t
=
1, 2,
.
.
.
,
=
xf)
(see Chapter 19 for
T
can be written in matrix form as in (2) and the two models become indistinguishable in terms of notation. From the modelling viewpoint, however, the two models are very different. The Gauss linear model relationship where the xffs are known constants. On describes a the other hand, the linear regression model refers to a relationship where yf is related to the observed values of the random vector Xl (forfurther discussion see Chapter 19). This important difference went largely unnoticed by Galton, Pearson and the early twentieth-century applied econometricians. Galton in particular used the linear regression causal relationships in support of his theories model to establish of heredity in the then newly established discipline of eugenics. The Gauss linear model was initially developed by astronomers in their relationships for planetary orbits, using a attempt to determine large number of observations with less than totally accurate instruments. The nature of their problem was such as to enable them to assume that their theories could account for all the information in the data apart from a vhfrc-ntpfsp(see Chapter 8) error term uf. The situation being modelled design' situation because of the relative resembles an of constancy the phenomena in question with nature playing the role of the 'law-like'
'predictive-like'
'law-like'
'law-like'
'experimental
8
A preliminary
view
experimenter. Later, Fisher extended the applicability of the Gauss linear phenomena using the idea of randomisation model to (see Fisher ( 1958:. Similarly, the linear regression model. firmly based on expectation, was later extended by Pearson to the the idea of conditional stochastic of regressors (see Seal ( 1967)). case ln the context of the Gauss linear and linear regression models the convergence of descriptive statistics and the calculus of probability became a reality, with Galton (1822-1911.q, Edgeworth g1815- 1926q, Pearson g1857-19361 and Yule g1871-195 1q being the muin protagonists. ln the hands of Fisher (1890-1962jthe convergence was completed and a new One of the most important modelling paradigm was proposed. contributing factors to these developments in the early twentieth century was the availability of more complete and reliable data towards the end of the nineteenth century. Another important development contributing to the convergence of the descriptive study of data and the calculus of probability came in the form of Pearson's family of frequency curves which provided the basis for the transition from histograms to probability density functions (seeChapter 2). Moreover, the various concepts and techniques and provide the developed in descriptive statistics were to be reinterpreted basis for the probability theory framework. The frequency curves as used in descriptive statistics provide convenient for the observed data at hand. On the other hand, probability density functions were postulated as imodels' of the population giving rise to the data with the latter viewed as a representative sample from the former. The change from the descripti,e statistics to the probability theor approach in statistical modelling went almost unnoticed until the mid-1930s when the latter approach formalised by Fisher dominated the scene. During the period 1890-1920 the distinction between the population from where the observed data constitute a sample and the sample itself was blurred by the applied statisticians. This was mainly because the paradigm tacitly used, as formulated by Pearson, was firmly rooted in the descriptive statistics tradition where the modelling proceeds from the observed data in hand to the frequency (probablity)model and no distinction between the population and the sample is needed. In a sense the population consists of the data in hand. In the context of the Fisher paradigm, however, the modelling of a probability model is postulated as a generalised description of the actual data generation prpcess (DGP), or the population and the observed data are viewed as a realisation of a sample from the process. The transition from the Pearson to the Fisher paradigm was rather slow and went largely unnoticed even by the protagonists. ln the exchanges between Fisher and Pearson about the superiority of the maximum likelihood estimation over the method of moments on efficiency grounds, Pearson 'experimental-like'
'models'
1.1
A brief historical oveniew
9
never pointed out that his method of moments was developed for a different statistical paradigm where the probability model is not postulated a priori (see Chapter 13). The distinction between the population and the sample was initially raised during the last decade of the nineteenth century and of the early twentieth century in relation to biqher t?l'J:'r approximations limit theorem (CLT) results emanating from Bernouli, Moivre De central and Laplace. These limit theorems were sharpened considerably by the Russian school (Chebyshev (1821-941, Liapounov (1857-1922q,
Markov g1856-1922q, Kolmogorov g19031 (seeMaistov ( 1974)) and this period. Edgeworth and Charlier, among extensively used during expansions which could be used to improve the others, proposed asymptotic offered by the CLT given sample size T (see Cramer approximation for a The of development ( 1972)). a formal distribution theory based on a fixed sample size 'T)however, began with Gosset's (Studenfs) t and Fisher's F. distributions (see Kendal and Stuart ( 1969)). These results provided the basis of modern statistical theory based on the Fisher paradigm. The transition from the Pearson to the Fisher paradigm became apparent in the 1930s when the theory of estimation and testing as we know it today was formulated. lt was also the time when probability theory itself was given its axiomatic foundations by Kolmogorov ( 1933) and tirmly established as part of mathematics proper. By the late 1930s probability theory as well as statistical inference as we know them today were firmly established. The Gauss linear and linear regression models were appropriate for Yule ( 1926) discussed the modelling essentially static phenomena. difficulties raised when time series data are used in the context of the linear regression model and gave an insightful discussion of regressions' (seeHendry and Morgan ( 1986)). ln an attempt to circumvent these problems, Yule (1927) proposed the linear autorenressive model (AR(m)) where the xffs are replaced by the lagged ).fS, i.e. 'non-sense
An alternative model for time-series data was suggested by Slutsky ( 1927) in such data using weighted his discussion of the dangers in showed that weighted He averaging of a white-noise by averaging. process with produce periodicities. Hence, somebody looking for a data series ut can cyclic behaviour can be easily fooled when the data series have been smoothed. His discussion gave rise to the other important family of timeseries models, subsequently called the moinq tkptalv/gcp m()(leI(MA(p)): ksmoothing'
A preliminary
view
Wold ( 1938) provided the foundations for time series modelling by relatinp the above models to the mathematical theory of probability establshed bj Kolmogorov ( 1933). These developments in time series modelling were to have only a marginal effect on mainstream econometric modelling until the mid-7os when a slow but sure convergence of the two methodologies began. One of the main aims of the present boc)k is to complete this convergence in methodology, the context of a reformulated With the above dcvelopments in probability theory and statistical inference in mind, let us consider the history of econometric modelling in the early twentieth eentury. The marginalist revolution of the 187s, with Walras and Jevons the protagonists, began to take root and with it a change of attitude towards mathematical and statistical techniques and their role in studying the economy. ln classical ekronomics ebserved data were used mainly to tendencies in support of theoretical arguments or as kfacts' to be explained. The mathematisation of economic theory brought about by the marginalist revolution contributed towards a purposeful attempt to quantify theoretical relationships using observed data. The theoretical relationships formulated in terms of equations such as demand and supply funetions seemed to offer themselves for quantification using the newly established techniques of correlation and regression. The early literature in econometric modelling concentrated mostly on two general areas, business cycles and demand curves (see Stigler (1954:. This can be explained by the availability of data and th influence of the marginalist revolution. The statistical analysis of business cycles took the form of applying correlation as a tool to separate long-term secular movements, periodic movements and short-run oscillations (seeHooker (1905), Moore (1914)inter alia). The empirical studies in demand theory concentratd mostly on estimating demand curves using the Gauss linear model disguised as regression analysis. The estimation of such curves was fitting' with any probabilistic treated as being arguments coincidental. Numerous studies of empirical demand schedules, mostly of agricultural products, were published during the period 1910-30 (see Stigler ( 1954), Morgan ( 1982), Hendry and Morgan (1986:, seeking to of demand'. These studies establish an empirical foundation for the purported to estimate demand schedules of the simple form 'establish'
tcurve
'law
qD= l
(1.10)
f/() +J1n,
OD
refers to quantities demanded at time l (intentions on behalf of where economic agents to buy a certain quantity of a commodity) corresponding to a range of hypothetical prices (q. By adopting the Gauss linear model line through the these studies tried to approximate' (10) by fitting the of where usually referred to 1, T fq- t ) t 2, t scatter diagram 'best'
/7
'
=
.
.
.
,
1.1
A brief historical overview
11
quantities transacted (or produced) and the corresponding 1. That is, they would estimate
L
=
b1
-1-
b1
,
l
=
1 2s ,
.
.
.
r
prices
f
at time
(1 11)
X
.
using least-squares or some other interpolation method and interpret the estimated coefficients h and /)1as estimates of the theoretical parameters?o and f1j respectively, if the signs and vaiues were consistent wth the law of demand'. This simplistic modelling approach, however, ran into difficulties immediately. Moore (19l4) estimated ( l 1) using data on pig-iron (raw steel) production and price and (or so he thought) a positively sloping demand schedule (Jj > 0). This result attracted considerable criticism from the applied econometricians of the time such as Lehfeldt and issue in Wright (see Stigler ( 1962)) and raised the most important econometrie modelling; te connccrft?n bpfwtzpn be estimated equations pt/ullf/tvlt?l %'cctpntpl?c' rs/ng observed data tznJ the theovetical l-elationships argued that 1915), commenting Moore*s Lehfeldt theory. ( on equation estimated supply demand but the a curve. Several was not a applied econometricians argued that Moore's estimated equation was a mixture of demand and supply. Others, taking a more extreme view, raised the issue of whether estimated equations represent statistical artifacts or genuine empirical demand or supply curves. lt might surprise the reader to learn that the same issue remains largely unresolved to this day. Several 'solutions' have been suggested since then but no satisfactory answer has emerged. During the next two decades (1910-30) the applied econometricians struggled with the problem and proposed several ingenious ways to 'resolve' some of the problems raised by the estimated versus theoretical relationships issue. Their attempts were mainly directed towards specifying theoreteal models and attempting to rid the observed data more of information'. For example, the scenario that the demand and supply curves simultaneously shifting allowing us to observe only their intersection points received considerable attention (see Working (1927) incr ('/fll. The time dimension of time-series data proved particularly given that the theoretical model was commonly static. difticult to Hence the observed data the data was a popular way to In order to bring them closer to the theoretical concepts purporting to estimated-theoretical As argued Morgan the below (1982/. measure (see tssue raises numerous problems which, given the state of the art as far as statistical inference is concerned, could not have been resolved in any Attisfactory way. ln modern terminology these problems can be xummarised under the following headings: theoretical variables versus observed data; : Sdiscovered'
'discovery',
4realistic'
kirrelevant
'solve'
Sdetrending'
'
kpurify'
A preliminary
(ii) (iii) (iv) (v)
statistical
N'iew
model specificalion'. misspecification testing;
statistical
specification
testing, reparametrisation, identification; theoretical models. By the late 1920: there was a deeply felt need for a more organised effort to face the problems raised by the early applied econometricians such as Moore. Mitchell, Schultz. Ctark- Working, Wallace, Wright, fnlcr (Ilftl. This led to the creation of the Econometric Society in 1930. Frisch, Tinbergen of international society' and Fisher (Irving) initiated the establishment //- r/lt, adt,ancement (?/' economic l/:t?t??'). in its rt?lation to statistics and Anfl/lpnlrkrs. The decade immediately after the creation of the Econometric Society can be characterised as the period during which the foundations of modern econometrics were laid mainly by posing some important and insightful questions. An important attempt to resolve some of the problems raised by the estimated theoretical distinction was mae by Frisch ( 1928) (1934). Arguing from the Gauss linear model viewpoint Frisch suggested the so-called errors-in-variables formulation where the theoretical relationships defined Jtkf) in terms of theoretical variables pt EEEfy 1 ,, defined by the system of k linear equations: empirical
versus
'an
bare
.
A'p f and
,
5
,
=
.
=0
the observed data yt BE (.v11 yg
.
pt + ct
.
.
.
,.J.!k2 )'
are related
to pt
via
,
where cr are errors of measurement. This formulation emphasises the distinction between theoretical variables and observed data with the measurement equations ( 13) relating the two. The problem as seen by Frisch was one of approximation (interpolation) in the context of linear algebra in the same way as the Gauss linear model was viewed. Frisch, however, with his coqjluence analq,sis offered no. proper solution to the problem. A complete solution to the simplest case was only recently provided, 50 years later, by Kalman ( 1982). lt is fair to say that although Frisch understood the problems raised by the empirical theoretical the quotation distinction as below testifies, his formulation of the problem turned out to be rather unsuccessful in this respect. Commenting on Tinbergen's 'A statistical test of business cycle theories'. Frisch argued that: The qucstion of what connection there is between relations we work with in theory and those we get by fitting ctlrves to actual statistical data is a very delicatc onc. Tinbergen in his work hardly mentions it. He more or Iess takes it for granted that the relatons he has found are in their nature
1.1
A brief historical overview
the same as of this sort,
the theory This is, in my opinion, unsatisfactory. In a work and bctween slatisifL'fll the connection l,b.ltll'et'il'al l-ta/f-/ri?ns must be thoroughly understood and the nature of the intbrmation which the statistical relations furnish - although they are not identical with the theoretical rclations - should be clearly brought out. (Sce Frisch ( 1938), pp. 2 - 3.) .
.
.
As mentioned above, by the late 1930s the Fisher paradigm of statistical inference was formtllated into a coherent body of knowledge with a firm foundatin in probability theory. The first important attempt to introduce this paradigm into econometrics was made by Koopmans ( 1937). He formulation in the proposed a resetting of Frisch's errors-in-variables context of the Fisher paradigm and related the least-squares method to that of maximum likelihood, arguing that the latter paradigm provides us with additional insight as to the nature of the problem posed and its (estimation). Seven years later Haavelmo (a student of Frsch) published his celebrated monograph on b'T'heprobability approach in econometrics' (see Haavelmo ( 1944)) where he argued that the probability approach (the bsolution'
Fisher
paradigm)
was
the most promising
approach
to econometric
modelling (see Morgan ( 19841). His argument in a nutshell was that if statistical inference (estimation, testing and prediction) are to be used systematically we need to accept the framework in the context of which these results become available. This entails formulating theoretical propositions in the context of a well-defined statistical model. ln the same monograph Haavelmo exemplified a methodological awareness far ahead of his time. ln relation to the above discussion of the appropriateness of the Gauss linear model in modelling economic phenomena he distinguished between observed data resulting from: ( 1) experiments that we should like to make to see if ccrtain real economic intlucnces' would phenomena - when artificially isolatcd from verify certain hypotheses, and 'other
(2) the stream own enormous observers
of experiments
that Nature is steadily turning out from her wc mcrely watch as passive
laboratory, and which
He went on to argue:
ln the first case we can make the agreement or disagreement between theory and facts depend upon two things: the facts we choose to consider, ln the second casc we can only try to as well as our theory about them adjust our theories to reality as it appears before us. And what is the meaningof a design of experiments in this case'? lt is this: Wc try to choose a theory and a design of experiments to go with it. in such a way that thc .
.
.
.
A preliminary
view
resulting data would be those which we get by passive observation reality.And to the extent that we succeed in doing so, we become master reality - by passive agreement.
t)! 0:
Now if we examine current economic theories, we see that a great many of them, in particular the more profound ones, require experiments of the first type mentioned above. On the other hand, the kind of economic data that we actually have belong mostly to the second type. (See Haavelmo (1944).)
Unfortunately for econometrics, Haavelmo's views on the methodology of econometric modelling had much lesser influence than his formulation of a statistical model thought to be tailor made for econometrics; the so-called simultaneous equations model. ln an attempt to capture the interdependence of economic relationships Haavelmo ( 1943) proposed an alternative to Frisch's errors-in-variables and formulation where no distinction between theoretical variables observed data is made. The simultaneous equation formulation was specified by the system F'y f + A'xf + tt
=
0,
where yr refers to the variables whose behaviour this system purports to explain (endogenous)and x/ to the explanatory (extraneous) variables whose behaviour lies outside the intended scope of the theory underlying ( 14) and zt is the error term (seeChapter 25). The statistical analysis of (14) provided the agenda for a group of distinguished statisticians and econometricians assembled in Chicago in 1945. This group known as the Cowles Foundation Group, introduced the newly developed techniques of estimation (maximum likelihood) and testing into econometrics via the simultaneous equation model. Their results, published in two monographs (see Koopmans (1950)and Hood and Koopmans (1953)) were to provide the main research agenda in econometrics for the next 25 years. lt is important to note that despite Haavelmo's stated intentions in his discussion of the methodology of econometric modelling (seeHaavelmo (1944)), the simultaneous equation model was later viewed in the Gauss linear model tradition where the theory is assumed to account for a11 the information in the data apart from some non-systematic (white-noise) errors. lndeed the research in econometric theory for the next 25-30 years was dominated by the Gauss linear model and its misspecification analysis equations model and its identification and and the simultaneous estimation. The initial optimism about the potential of the simultaneous equations model and its appropriateness for econometric modelling was not fulfilled. The problems related to the isstle of estimated versus
A sketcb of a metbodology
mentioned above were largely ignored because of theoretical relationships this initial optimism. By the late 1970s the experience with large equations macroeconometric models based on the simultaneous formulation called into queson the whole approach to econometric modelling (seeSims (1980),Malinvaud (1982),fnl!r t:!ia). The inability of large macroeconometric models to compete with BoxJenkins ARIMA models, which have no economic theory content, on prediction grounds (see Cooper (.1972)) renewed the interest of econometricians to the issue of sttzlic tllptlry tlcrsus Jynclmfcrfmpseries Jcltl raised in the 1920s and 30s. Granger and Newbold ( 1974) questioned the conventional econometric approach of paying little attention to the time series features of economic data', the result of specifying statistical models using only the information provided by economic theory. By the late 1970s it was clear that the simultaneous equations model, although very useful, was not a panacea for al1 econometric modelling problems. The whole in view of the econometric methodology needed a reconsideration experience of the three decades since the Cowles Foundation. The purpose of the next section is to consider an outline of a particular approact to econometric modezing wtch takes account of some of the problems raised above. lt is only an outline because in order to formulate the methodology in any detail we need to use concepts and resultswhich are developed in the rest of the book. In particular an important feature of the proposed methodology is the recasting of statistical models of interest in econometrics in the Fisherian mould where the probabilistic assumptions are made directly in terms of the observable random variables giving rise to the observed data and not some unobservable error term. The concepts and ideas involved in this recasting are developed in Parts 1l-IV. Hence, a more detailed discussion of the proposed methodology is given in the epilogue.
1.2
Econometric modelling - a sketch of a methodology
In order to motivate the methodology
of econometric
modelling adopted
below let us consider a simplistic view of a commonly propounded methodology as given in Fig. 1.1 (for similar diagrams see lntriligator ( 1978), Koutsoyiannis (1977),inter 4?/fJ), ln order to explain the procedure represented by the diagram let us consider an extensively researched theoretical relationship, the transactions demand for money. There is a proliferation of theories related to the demand for money which are beyond the scope of the present discussion (fora survey see Fisher (1978:. For our purposes it suffices to consider a simple theory where the transactions demand for money depends on income. the price level and
A preliminary
view
Data Prediction Econometric model
Theoretical
Theory
model
(f Orecasting)
Estimation testing
policy evaluation Statistical inference
Fig. 1.1. The
interest
'textbook'
approach
to econometric
modelling.
rate, i.e. MB
=./'(Ft #, 1).
(1.15)
in some Most theories of the demand for money can be accommodated variation of (15) by attributing different interpretations to F. The theoretical model is a mathematical formulation of a theory. ln the present case we expressed the theory directly in the functional form (15) in an attempt to keep the discussion to a minimum. Let the theoretical model be an explicit functional form for ( 15), say MB
or
ln
=
zlyvt'ac.fza
MB
=
atj + rzj ln F + xz ln # + aa ln 1
in log-linear form
with tzll ln being a constant. The next step in the methodological scheme represented ,4
=
by Fig. 1.1 is to transform the theoretical model ( 17) into an econometric model. This is commonly achieved in an interrelated sequence of steps which is rarely explicitly stated. Firstly, certain data series, assumed to represent measurements of the theoretical variables involved, are chosen. Secondly, the theoretical variables are assumed to coincide with the variables giving rise to the observed data chosen. This enables us to respecify (17) in terms of these observable variables, say, V,, f,, Ft and I-t ln Vt
=
txtl +
a1 ln 1-,)+ a2 ln #f + aafr,
(1.18)
The last step is to turn (18) into an econometric (statistical) model by attaching an error term t1t which is commonly assumed to be a normally distributed random variable of the form
A sketch of a methodology
l.2
of the exeluded variables.
the effects n
t
=
x () + x 1 .f?+ a at + u
3l
Adding this error term onto + at
,
t
6
( 18) yields (1
.20)
T,
u here small letters represent the logarithm of the corresponding capital lctters. Equation (20) is now viewed as a Gauss linear model with the cstimation, testing and prediction techniques related to this at our disposal transactions demand for money'. :o analyse next qate ''l'e t%to esttmate 2% ustttg tte statistkal results retated to the Gauss linear model and test the postulated assumptions for the error term. lf any of the assumptions are invalid we correct by respecifying the error term and then we proceed to test the a priori restrictions suggested by the theory such as, tz1 :>: 1, aa cls 1, 1 < (s <0, using the statistical techniques related to the linear model. When we satisfy ourselves that the theory is 'correct' we can proceed to use the estimated equation for prediction or and policy evaluation. ln practice the above methodological procedure turns out to be much more difficult to apply and applied econometricians find themselves having procedures in order to get something worth publishing. to use include estimating dozens of equations like (20)with Such procedures relevant variables as well as including a of combinations possibly various variables. variables the This is most explanatory few lagged among graphically described in the introduction of Leamer ( 19781: 'the
-
'illegitimate'
1 began thinking about these problems when l was a graduate student in economics at the University of Michigan, 1966-1970. At that timc there was a very active group building an econometric model of the United modelling was done in the States. As it happens, the econometric basement of the building and the econometric theory courses were taught on the top tloor (the third). l was perplexed by the fact that the same language was used in both places. Even more amazing was the transmogrification of particular individuals who wantonly sinned in the basement and metamorphosed into the highest of high priests as they ascended to the third floor.
ln the same book Leamer went on to attempt a systematisation of these 'illegitimate' procedures using Bayesian techniques. The approach procedures are proposed below will show that some of these certain appropriate problems indeed in the context of an ways to methodology Chapter 22). alternative (see ln an attempt to motivate the underlying logic of the methodology to be sketched below 1et us return to Fig. 1.1 in order to consider some of the possible links' in the textbook methodology. The first possible weakness of the textbook methodology is that the 'illegitimate'
'tackle'
'weak
A preliminary
view
starting point of econometric modelling is some theory. This arises because the intended scope of econometrics is narrowly defined as the Such a definition was rejected at the outset of of theoretical relationships'. the present book as narrow and misleading. Theories are developed not for the sake of theorising but in order to understand some observable phenomenon of interest. Hence, defining the intended scope of econometrics as providing numbers for our own constructions and ignoring the original aim of explaining phenomena of interest, restricts its scope considerably by attaching to the modeller. In a nutshell, it only that the information' contained in the presupposes observed data chosen is what the theory allows. This presents the modeller with insurmountable difficulties at the statistical model specification stage chosen for them without their when the data do not fit the nature being taken into consideration. The problem becomes more apparent when the theoretical model is turned into a statistical (econometric) model by attaching a white-noise error term to a reinterpreted equation in terms of observable variables. lt is naive to suggest that the statistical model should be the same whatever the observed data chosen. ln order to see this 1etus consider the demand schedule at time t referred to in Section 1.1: 'measurement
Sblinkers'
ilegitimate
tstraightjacket'
qtB
(1.2 1)
atl + a Lpt.
=
If the data refer to intentions q?tt q) pitj,i 1, 2, n wlich correspond to the hypothetical range of prices pit,i 1, 2, n then the most appropriate statistical model in the context of which (21) can be analysed is indeed the Gauss linear model. This is because the way the observed data were generated was under conditions which resemble an experimental situation', the hypothetical prices pff, i 1, 2, n were called out and the economic agents considered their intentions to buy at time !. This suggests that uo and 1 can be estimated using =
=
=
=
y-P,t =
+ a j pit+ uft,
'xo
.
i
=
.
.
.
.
1, 2,
.
.
.
,
.
,
,
.
.
.
,
(1.22)
n.
ln Haavelmo's categorisation, ty-pjf, n constitute observed pit),i 1, 2, data of type one; experimental-like situations, see Section 1.1. On the other hand, if the observed data come in the form of time series li, f), t 1,2, F where tj'r refers to quantities transacted and the corresponding prices at time t then the data are of type two; generated by nature. In this case the Gauss linear model seems wholly inappropriate unless there exists additional information ensuring that =
.
.
.
,
=
.
.
.
,
!
t?2D(f)
=
Such a condition
(
for all
1.
(1.23)
is highly unlikely to hold given that in practice other
of a methodology
A sketch
factors such as supply-side, historical and institutional will influence the It is highly likely that the data (!, t), t 1, 2, determination of j', and T when used in the context of the Gauss linear model, .
=
)'()+ / l
r
+
=
.
.
.
,
lt
will give rise to verq' misleading estimates for the theoretical parameters of interest tztl and tzj (see Chapter 19 for the demand for money). This is because the GM represented by the Gauss linear model bears little, if any, resemblanee to the acttlal DGP which gave rise to the observed data (f, pt), TJln order to account for this some alternative statistical model t 1.2, should be specified in this case (see Part IV for several such models). Moreover, in this case the theoretical model (21) might not be estimable. A moment's retlection suggests that without any additional information the estimable form of the model is likely to be an kp-juslrrlcn! process (price or/and quantity). lf the observed data have a distinct time dimension this should be taken into consideration in deciding the estimable form of the model as well as in specifying the statistical model in the context of which the latter will be analysed. The estimable form of the model is directly related to the observable phenomenon of interest which gave rise to the data (the actual DGP). More often than not the intended seope of the theory in question is not the demand schedule itself but the explanation of changes in prices and quantities of interest. ln such a case a demand or and a supply schedule are used as a means to explain price and quantity changes not as the intended scope of the theory. =
.
.
.
,
ln the context of the textbook methodology distinguishing between the theoretical and estimable models in view of the observed data seems totally unnecessarj' for three interrelated reasons: the observed data are treated as an afterthought', (i) the actual DGP has no role to play', and (ii) theoretical variables are assumed to coincide (one-to-one) with the (iii) observed data chosen. As in the case of (2 1) above the theoretical variables do not correspond directly to a particular observed data series unless we generate the data isolating the economic phenomenon of interest ourselves by from other influences- (see the Haavelmo ( 1944) quotation in Section 1.1). We only have to think of theoretical variables such as aggregate demand for money, income, price level and interest rates and dozens of available data these variables. for measuring series become possible candidates what the theoretical Commonly, none of these data series measures variables refer to. Proceeding to assume that what is estimable coincides with the theoretical model and the statistical model differs from these by a -artificially
.4 preliminary'
view
w'hite-noise error terln re.ltlrtlll?vs to misleading conclusions.
q/' lhe
(?).st?,-l't.?t/
(l(l(l
t'/?t?st>??. can only
lead
The question which naturally arises at this stage is whether we can tackle some of the problems raised above in the context of an alternative methodological framework. ln view of the apparent limitations of the framework should be flexible textbook methodology any alternative enough so as to allow the modeller to ask some of the questions raised above even though readily available answers might not always be forthcoming. With this in mind such a methodological framework should attribute an important role to the actual DGP in order to widen the modelling. lndeed. the estimable model intended scope of econometric should be interpreted as an approximation to the aclual DGP. This brings the nature of the observed data at the centre of the scene with the statistical model being defined directly in terms of the random variables giving rise to the data and not the error term. The statistical model should be specified as a generalised description of the mechanism giving rise to thedata. in view of the estimable model- because the latter is going to be analysed in its context. A sketch of such a methodological framework is given in Fig. 1 An important feature of this framework is that it can include the textbook methodology as a speeial case under certain conditions. When the actual DGP is to resemble the conditions assumed by the theory in question (Haavelmo type one observed data) then the theoretical and estilnable models could coincide and the statistical model could differ from these by a white-noise error. In general, however. we need to distinguish between them even though the estimable model might not be readily available in some cases such as the case of the transactions demand for money (see Chapter 23). In order to be able to turn the above skeleton of a methodology into a ftllly fleshed framework we need to formulate some of the concepts involved in more detail and discuss its implementation at length. Hence, a more detailed discussion of this methodology is considered in the epilogue where the various components shown in Fig. 1 are properly defined and their role explained. In the meantime the following w'orking definitions will stlffice for the discussion which follows: .2.
-designed-
,2
Tlle()l-
'.'
a conceptual
constrtlct
provid ing an ideal ised description of the us to seek
phenomena within its intended scope which will enable cxplanations and predictions related to the actual DGP.
A sketch of a methodology
l.2
Actual data generating
Theory
Theoretical
Observed data
model
-
-
r'-
l l
l Statistical model
Estimable model
I
l
I
1 1
l
l
Estimation Misspecification Reparametrisation Model selection
l l I
I I l l 1 l
I I 1 I
l l
I I
l
Empirical econometric prediction,
1
model
poI icy evaluation
I l
1 l
l
potentitlly chosen.
l
1 I I
l l l
Estilnable
3
zyt?t/cp/..a particular form of the theo retical model which is estimable in view of the actual DG P and the observed data
a probabilistic formulation purporting to provide a generalised description of the actual DGP with a view of analysing thtt estimable model in its context. Sttltivstil'tll
Enlpirical
??kt)t/(>/.'
trvtlrltpnt-zr-.
motlel:
a
reformulation
(reparametrisation/'
restriction) of a well-defined estimated statistical model in view of the estimable model which can be used for description. explanation or/and prediction.
A preliminary
view
Looking ahead its main aim is the statistical As the title of the book exemplifies. of modelling. econometric In relation to Fig. 1.2 the book foundations mainly The concentrates on the part u ithin the dotted rectangle. statistical model in lerms of the variables giving rise to the of specification a observed data as well as the related statistictl inference results will be the subject matter of Parts 11and 111.ln Part 15' N'arious statistical models of interest in econometric modelling and the related statistical inference results will be considered in some detail. Special attention will be given to the procedure from the specification of the stttistical model to 1he dcmand for money of the empirical econometric model. The transactions example considered above will be tlsed th roughout Part 1V in an attempt to awaiting the tlnaware il1 the context of the textbook illustrate the well with the alternative methodology methodology as as compare this formalised in the present book. modelling Parts 11 and lll form an integral part of econometric and viewed of and the concepts should not be as providing a summary definitions to be used in Part lV, A sound background in probability theory and statistical inference is crtlcial for the implementation of the approach adopted in the present book. This is mainly becatlse the modeller is required statistical model taking into consideration the to specify the nature of the data in hand as well as the estimable model. This entails making decisions about characteristics of the random variables which gave independence, rise to the observed data chosen such as normalitys stationarity- mixing, before any estimation is even attempted. This is one of the most crucial decisions in the context of econometric modelling because model renders choice of the statistical the related an inappropriate reader advised statistical inference concltlsions invalid. Hence. the is to view of Parts 11 and 11I as an integral part econometric modelling and not as reference appendices. ln Part IV the reader is encouraged to view econometric modelling as a thinking person's activity and not as a sequence of technique recipes. Chapter 2 provides a very brief introduction to the Pearson paradigm in an attempt to motivate 1he Fisher paradigm which is the subject matter of Parts 11 and 111. tdesign'
'dangers'
'appropriate'
Additional references
C H AP T E R 2
Descriptive study of data
2.1
Histograms
and their numerical characteristics
By descriptive study of data we refer to the summarisation and exposition of observed data as well as (tabulation, grouping, graphical representation) such as measures of location, the derivation of numerical characteristics dispersion and shape. Although the descriptive study of data is an important facet of modelling with real data in itself, in the present study it is mainly used to motivate the need for probability theory and statistical inference proper.
In order to make the discussion more specific let us consider the after-tax personal income data of 23 000 households for 1979-80 in the UK. These data in raw form constitute 23 000 numbers between f 1000 and f 50000. This presents us with a formidable task in attempting to understand how income is distributed among the 23 000 households represented in the data. The purpose of descriptive statistics is to help us make some sense of such data. A natural way to proceed is to summarise the data by allocating the The number of intervals is chosen a priori numbers into classes (intervals). and it depends on the degree of summarisation needed. In the present case the income data are allocated into 15 intervals, as shown in Table 2. 1 below (see National lncome (CH(/ Expenditul'e ( 1983)). The first column of the table shows the income intervals, the second column shows the number of incomes falling into each interval and the third column the relative frequency for each interval. The relative frequency is calculated by dividing the number of observations in each interval by the total number of observations. Summarising the data in Table 2.1 enables us to get some idea of how income is distributed among the various classes. lf we plot the relative frequencies in a bar graph we get what is known as the histogram,
23
Descriptive study of data
0. 16 0. 15 0. 14 0.13 U'0. 12 11 . E 0. # 0.10
,/ 0.09 1 0.08 g o o7
a
0.06
c.c5
0.04 0.03 0.02 0,01
0
1
2
3
4
5
6
7
8
9
10
11
12 13 14
15
l ncome Fig. 2.1. The histogram and frequency polygon of the personal
income
data.
shown in Fig. 2. 1. The pictorial representation of the relative frequencies gives us a more vivid impression of the distribution of income. Looking at the histogram we can see that most households earn less than E4500 and in some sense we can separate them into two larger groups: those earning between f 1000 and f 4500 and those above f4500. The first impression is
and their numerical characteristies
Histograms
that the distribution rather smilar.
of income inside these two larger groups appears to be
For further information on the distribution of income we could caiculate location, charaeteristcs describing the hstogram's varous numerical dispersion and shape. Such measures can be caleulated directly in terms of the raw data. Howeqrer, in the present case it is more convenient for expositional purposes to use the grouped data. The main reason for this is to introduce variotls concepts which will be reinterpreted in the context of probability theory in Part ll. The n'lpt/l' as a measure of location takes the form 15
()j
.- =
..u
=1
-. '...
L
=
?
...
'y
.
,1
where $: and zi refer to the relative frequency and the midpoint of interval I'. ' Tbe rnott? as a measure of location refers to the value of income that occurs most frequentl) in the data set. ln the present case the mode belongs to the first interval f 1.0- 1.5. Another measure of location is the mtalan referring to when incomes al'e arranged in an the value of ineome in thc luiddle ascenling (01' descending) order according to the size of income. The best cIftnulatit'v qvaph which way to calculate the median is to plot the such consrenient answering questions -Ho5v is more for as many observations fall below a particular value of income?' (see Fig. 2.2). From the cumulative frequency graph we can see that the median belongs to the interval f 3.0-3.5. Comparing the three measures of location we can see that .//gt/l/.eznc)'
$'
1.0 0.9 $
0.8 ' o7 g .
# o.6
+-
y
0.5
';
'
'B 0.4 E
c=
0,3 1
0.2 0. 1
l 1
2
3
4
5
6
Fig. 2.2. The cumulative data.
7
8 9 lncome
10
1 1 12
13
14
15
histogram and ogive of the personal income
Descriptive study of data
mode < median histogram.
<
confirming
mean,
the
obvious
asymmetry
ol' the
Another important feature of the histogram is the dispersion of the relative frequencies around a measure of central tendency. The most defined by frequently used measure of dispersion is the l'al-iance 15
'2
(zf pli
=
t
=
4.85,
=
-
i
which is a measure of dispersion around the mean; standard deriation. We can extend the concept of the variance to
15
mk =
f
) =
(zf -
1
z'lksi
k
,
3 4
=
,
,
.
,
.
p
is known as the
,
defining what are known as hiqber central rrlf?rntanrs. These higher moments can be used to get a better idea of the shape of the histogram. For example, the standardised form of the third and fourth moments defined by SK
?n a X. an d -
=
K
n --7.4 u?
=
,
(2
.4)
,
L
known as the skewness and kurtosis tro//'icfcaurk. measure the asymmetry and peakedness of the histogram, respectively. In the case of a symmetric and the less peaked the histogram thegreater value of histogram, SK K. For the income data =0
SK
=
1.43
and
K
=
7.33,
which confirms the asymmetry of the histogram (skewed to the right). The above numerical characteristics referring to the location, dispersion and shape were calculated for the data set as a whole. lt was argued above, however, that it may be preferable to separate the data into two larger groups and study those separately. Let us consider the groups f 1.0-4.5 and f4.0-20.0 separately. The numerical characteristics for the two groups are and
h
=
2.5,
(721=
0.996,
SKL
=
0,252,
h
=
6.18,
:22
3.8 14,
SKz
=
2.55,
=
Kz
=
11.93,
respectively.
Looking at these measures we can see that although the two subsets of the income data seemed qualitatively rather similar they actually differ substantially. The second group has much bigger dispersion, skewness and kurtosis coefficients. Returning to the numerical characteristics of the data set as a whole we
2.2
Frequency curves
can see that these seem to represent an uneasy compromise between the above two subsets. This confirms our first intuitive reaction based on the histogram that it might be more appropriate to study the two larger groups separately. Another form of graphical representation for time-series data is the time 'C The temporal pattern of an economic time series grkpll (zf l). l 1 2e is important not only in the context of descriptive statistics but also plays an important role in econometric modelling in the context of statistical inference proper', see Part lV. =
s
2.2
.
.
.
.
.
Frequency curves
Although the histogram can be a very useful way to summarise and study observed data it is not a very convenient descriptor of data. This is because ()mof intervals) are j (m being the number m 1 parameters /1, /a, describe it. analytically histogram is a Moreover, needed to the of form the cumbersome step function -
.
( ) ,.c
m
)''')(Sf i -.-
=
--
1
( f '
+
1-
.
,
J (g,:.r i i ,:.r
.
L'i
where gf z'i 1 ) represents indicator function .
.
.
.
)
,
.
j.
)) ,
the Eth half-closed for for
interval and 1(
'
)
is the
Ilc,zi 1 ) k!llz'f,zi 1 ).
zg
+.
z
.
Hence, the histogram is not an ideal descriptor especially in relation to the modelling facet of observed data. The first step towards a more convenient descriptor of observed data is Jmtvfr/tpnwhich is a modified histogram. This is the so-called p-equency midpoints of the step function, as shown in Fig. the obtained by joining up continuous function. 2. 1. to get a An analogous graph for the cumulative frequency graph is known as the ogive (seeFig. 2.2). These two graphs can be interpreted as the histograms obtained by increasing the number of intervals. In summarising the data in the form of a histogram some information is lost. The greater the number of intervals the smaller the information lost. This suggests that increasing the number of intervals we might get more realistic descriptors for our data. lntuition suggests that if we keep on increasing the number of intervals to infinity we sllould get a much smoother frequency curve. Moreover, with a smooth frequency curve we should be able to describe it in some functional form with fewer than m - l parameters. For example, if we were to describe
Descriptive study of data
be able to the two subsets of the data separately we cotlld conceivably version smoothed of the frequency in polynomial polygons form a express a reasoning This line of 1ed statisticians the in with one or two parameters. second part of the nineteenth century to suggest various such families of frequency curves with various shapes for describing observed data, The Pearson
familyof frequencytwrptas'
ln his attempt to derive a general family of frequency curves to describe observed data, Karl Pearson in the late 189()s suggested a family based on the differential equation
dtlo /(z))
z+
(1
bo + b 1 :: + b 2::U,
d ):
which satisfies the condition that the curve touches the z-axis at T)(.c) 0 and has an optimum at z= -a, that is, the curve has one mode. Clearly, the solution of the above equation depends on the roots of the denominator. By imposing different conditions on these roots and choosing different values for a, bv, ?1 and bz we can generate numerous frequency curves such as =
(2.8)
(iii)
4(J)
.,4
=
aty,':--lt'*
11
-
J-shaped.
(2. 10)
In the case of the income data above we can see that the J-shaped (iii) frequency curve seems to be our best choice. As can be seen it has only one parameter (1 and it is clearly a much more convenient descriptor ('if equal to the appropriate) of the income data than the histogram. For lowest income value this is known as the Pareto frequency curve. Looking at Fig. 2. 1 we can see that for incomes greater than f 4.5 the Pareto frequency curve seems a very reasonable descriptor. An important property of the Pearson family of frequency cursres is that the parameters a, bv, l?I and bs are completely determined from knowledge of the first four moments. This implies that any frequency curq'e can be fitted to the data using these moments (see Kendall and Sttlart ( 1969)). At this point, instead of considering how such frequency curves can be fitted to observed data we are going to leave the story unfinished to be taken up in Parts ll1 and IV in order to look ahead to probability theory and statistical inference proper. ,4a
2.3 2.3
luooking ahead
Looking ahead
The most important drawback of descriptive statistics is that the study of the observed data enables us to draw certain conclusions which relate tpnlr to the data in hand. The temptation in analysing the above income data is to attempt to make generalisations beyond the data in hand, in particular about the distribution of ineome in the UK. This- however, is not possible in the descriptive statistics framework. ln order to be able to generalise model' the distribution of income in the beyond the data in hand weneed UK and not just the observed data in hand. Such a general 'model- is provided by probability theory to be considered in Part Il. lt turns out that the model provided by probability theory owes a 1ot to the earlier developed descriptive statistics. ln partieular, most of the concepts which form the basis of the probability model were motivated by the descriptive statistics concepts eonsidered above. The concepts of measures of location, dispersion and shape, as well as the frequency curve, were transplanted into probability theory with renewed interpretations. The frequencl curve when reinterpreted becomes a density function purporting real world phenomena. ln particular the Pearson to model observable family of frequency curves can be reinterpreted as a family of density in functions. As for the various measures- they will now be reinterpreted terms of the density function. Equipped with the probability model to be developed in Part 11 we can go on to analyse observed data (now interpreted as generated by some assumed probability model) in the context of statistical inference proper', the subject matter of Part 111.ln such a context we can generalise beyond the observed data in hand. Probability theory and statistical inference will enable us to construct and analyse statistical models of particular interest in econometrics', the subject matter of Part lV. ln Chapter 2 we consider the axiomatic approach to probability which forms the foundation for the discussion in Part ll. Chapter 3 introduces the concept of a random variable and related notions; arguably the most widely used concept in the present book. ln Chapters 4--10 we develop the mathematical framework in the context of which the probability model could be analysed as a prelude to Part 111. :to
tdescribe'
Additional references
PART
11
Probability theory
Probability
'Why do we need probability theory in analysing observed dataf?' ln the in the previous chapter it was descriptive study of data considered emphasised that the results cannot be generalised outside the observed data under consideration. Any question relating to the population from which the observed data were drawn cannot be answered within the descriptive statistics framework. ln order to be able to do that we need the theoretical framework offered by probability theory. ln effect probability theory molt?! which provides the logical foundation of develops a matbematical statistical inference procedures for analysing observed data. ln developing a mathematical model we must first identify the important features, relations and entities in the real world phenomena and then devise the concepts and choose the assumptions with which to project a generalised description of these phenomena', an idealised picture of these of its phenomena. The model as a consistent mathematical system has a own' and can be analysed and studied without direct reference to real world phenomena. Moreover, by definition a model should not bejudged as because we have no means of making suchjudgments (seeChapter or approximation model only A bejudged to the 26). as a or can with grips explain if enables it the 'reality' it purports to us to come to phenomena in question. That is, whether in studying the model's behaviour the patterns and results revealed can help us identify and understand the real phenomena within the theory's intended scope. The main aim of the present chapter is to construct a theoretical model for probability theory. ln Section 3. 1 we consider the notion of probability itself as a prelude to the axiomatisation of the concept in Section 3.2. The probability model developed comes in the form of a probability space (S, P( )).ln Section 3.3 this is extended to a conditional probability space. 'life
'true'
'false',
'good'
'better'
,%
'
Probability
3.1
The notion of probability
The theory of probability had its origins in gambling and games of chance in the mid-seventeenth eentury and its early history is associated with the names of Huygens, Pascal, Fermat and Bernoulli. This early development of probability was rather sporadic and without any rigorous mathematical foundations. The first attempts at some mathematical rigour and a more sophisticated analytical apparatus than just combinatorial reasoning, are credited to Laplace, De Moivre, Gauss and Poisson (see Maistrov (1974)). Laplace proposed what is known today as the classical definition of
probability: Dehnition 1 If a random experiment can rtrsu/r in N mutuall-v exclusive and equally likely outcomes and if NA oj' rtasp outcomes result in lr then te probability of A is desned !?.p occurrence oj' te event a4,
N J'(.4) c=..-..J-. N To illustrate the definition let us consider the random experiment of tossing The set of al1 a fair coin twice and observing the face which shows up. equally likely outcomes is S
)(SF), (FS), (SS), (TF)l,
=
Let the event
-4
'observing
be
With
N
=
4.
at least one head (S)', then
.4 )(.J.fF). TH), (J.ff.f)). =
Applying the classical definition in the above Since Nz 3, P(A) straightforward but in general it can be a tedious exercise example is rather Moreover, there are a number of 1968/. in combinatorics (see Feller ( this definition of probability, which render it serious shortcomings to foundation for probability theory. The obvious totally inadequate as a limitations of the classical approach are: it is applicable to situations where there is only a jlnite number of (i) possible outcomes; and likely' condition renders the definition circular. the (ii) Some important random experiments, even in gambling games (inresponse to which the classical approach was developed) give rise to a set of infinite until it turns up outcomes. For example, the game played by tossing a coin possible outcomes S (4S), (TS), heads gives rise to the infinite set of could flip a coin somebody (FTf1), (TFTS), it is conceivable that likely' is indefinitely without ever turning up heads! The idea of =t.
=
Sequally
=
.);
.
.
iequally
3.1
The notion
of probability
synonymous with equally probable', thus probability is defined using the idea of probability! Moreover, the definition is applicable to situations symmetry exists, which raises not only the where an apparent question of circularity but also how this definition can be applied to the case of a biased coin or to consider the probability that next year's rate of likely' outcomes inflation in the UK will be 10oz,z7Where are the and which ones result in the occurrence of the event? These objections were well known even by the founders of this approach and since the 1850s several attempts have been made to resolve the problems related to the 'equally likely' presupposition and extend the area of applicability of probability theory. The most intluential of the approaches suggested in an attempt to tackle the problems posed by the classical approach are the so-called frequency approach had its and subjective approaches to probability. Therwltpncy origins in the writings of Poisson but it was not until the late 1920s that Von Mises put forward a systematic account of the approach. The basic argument of the frequency approach is that probability does not have to be restricted to situations of apparent symmetry (equallylikely) since the notion of probability should be interpreted as stemming from the observable stability of empirical frequencies. For example, in the case of a (S) is not because there are fair coin we say that the probability of two equally likely outcomes but because repeated series of large numbers of trials demonstrate that the empirical frequency of occurrence of 'converges' to the limit as the number of trials goes to infinity. lf we denote by n.4 the number of occurrences of an event zt in n trials, then if 'objective'
iequally
-4
=
,
..4
1im na ex
11
=
PA ,
n
t1:fl
PA. Fig. 3. 1 illustrates this notion for the case of #= we say that #(z4)= in a typical example of 100 trials. As can be seen, although there are some twild fluctuations' of the relative frequency for a small number of trials, as these increase the relative frequency tends to (convergearound ). Despite the fact that the frequency approach seems to be an improvement over the classical approach, giving objective status to the notion of probability by rendering it a property of real world phenomena, there are as n goes to some obvious objections to it. tWhat is meant by infinity'''?' l-low can we generate infinite sequences of trials'?' 'What happens to phenomena where repeated trials are not possible'?' The subjecttve approach to probability renders the notion of probability a subjective status by regarding it as degrees of belief' on behalf of individuals assessing the uncertainty of a particular situation. The tsettle'
ilimit
Probability
36 1
.0
0.9 0.8 0.7
..c. (ru)
0.0
0. 5 0.4 0.3 0.2 0.1 10
20
Fig. 3. 1. Observed tossings.
30 relative
l 50
l
40
I 60
I
70
l 80
frequency of an experiment
1 90
I 100
with 100 coin
protagonists of this approach are interalia Ramsey ( 1926), de Finetti ( 1937), Savage ( 1954), Keynes ( 192 1) and Jeffreys ( 196 1),.see Barnett ( 1973) and Leamer (1978)on the differences between the frequency and subjective approaches as well as the differences among the subjectivists. Recent statistical controversies are mainly due to the attitudes adopted towards the frequency and subjective definitions of probability. Although these controversies are well beyond the material covere in this book, it is advisable to remember that the two approaches lead to alternative methods of statistical inference. The frequentists will conduct the discussion around what happens average', and attempt to develop the long-run' or objective', procedures which perform well according to these criteria. On the other hand, a subjectivist will be concerned with the question of revising prior beliefs in the light of the available information in the form of the observed data, and thus devise methods and techniques to answer such questions (see Barnett ( 1973)). Although the question of the meaning of probability was high on the agenda of probabilists from the mid-nineteenth century, this did not get in the way of impressive developments in the subject. ln particular the systematic development of mathematical techniques related to what we nowadays call limit theorems (see Chapter 9). These developments the work of the Russian School were mainly tin
-on
(Chebyshev, Markov. Liapounov and Bernstein). By the 1920s there was a wealth of such results and probabilith' began to grow knto a systematic body of knowledge. Although various people attempted of a systematisation probability it was the work of the Rtlssian mathematician Kolmogorov which proved to be the cornerstone for a systematic approach to
The axiomatic
approach
managed to relate the concept of probability theory. Kolmogorov that of integration probability to theory and exploited to the a measure in of functions on the one theory and the analogies between theory the set full and variable other. the of random ln a monumental concept hand on the a monograph in 1933 he proposed an axiomatisation of probability theory establishing it once and for a1l as part of mathematics proper. There is no doubt that this monograph proved to be the watershed for the later development of probability theory growing enormously in importance and applicability. Probability theory today plays a very important role in many disciplines including physics, chemistry, biology, sociology and economics.
3.2
The axiomatic
approach
The axiomatic approach to probability proceeds from a set of axioms (accepted without questioning as obvious), which are based on many centuries of human experience, and the subsequent development is built deductively using formal logical arguments, like any other part of mathematics such as geometry or linear algebra. ln mathematics an axiomatic system is required to be complete, non-redundant and consistent. By complete we mean that the set of axioms postulated should enable us to prove every other theorem in the theory in question using the axioms and refers to the mathematical logic. The notion of non-redundancy impossibility of deriving any axiom of the system from the other axioms. Consistency refers to the non-contradictory nature of the axioms. A probability model is by construction intended to be a description of a chance mechanism giving rise to observed data. The starting point of such a model is provided by the concept of a vandom t?xpt?rrntrnr describing a simplistic and idealised process giving rise to observed data. Djinition
2
wllfcll satishes ,4 random experiment, denoted /?y 4 is an f?xpt?rrrlt?;r conditions: the Ji?lltpwng alI possible distinct f.?lkrct?nlty.s are knfpwn a priori; (f-l) il1 /ny particular trial rt? ourtrol'tlt/ is not known a priori; and (/?) it trfkn be repeated unde. Eftpnlftrtnl conditions. (c) Although at first sight this might seem as very unrealistic, even as a model of a chance mechanism, it will be shown in the following chapters that it can be extended to provide the basis for much more realistic probability and statistical models. The axiomatic approach to probability theory can be viewed as a In an attempt to formalisation of the concept of a random yxpcrzlcnr.
Probability
formalise condition (a) all possible distinct outcomes are known a priori, possible distinct Kolmogorov devised the set S which includes outcomes' and has to be postulated before the experiment is performed. tall
Dejlnition J The samplespace,
denoted by S, is dejlned to be the set
outcomes (?J te elementary events.
The elements
'.
experiment
t#'
t#*a//
S
possible
are
called
Example Consider the random experiment Ji' of tossing a fair coin twice and observing the faces turning up. The sample space of & is
S
l(f1T), (TS), (HH4' (TT)l,
=
with (ST), (TS), (SS), (TT) being the elementary events belonging to S. The second ingredient of ($* to be formulated relates to (b)and in particular to the various forms events can take. A moment's reflection suggests that there is no particular reason why we should be interested in elementary outcomes only. For example, in the coin experiment we might be interested least one S', at most one H' and these are not in such events as z4l particular in elementary events; .,42
'at
-
,41 t(z;T),
TH4, HH)
-4c l(SF),
CTH), (FT)l
=
and
-
are combinations of elementary events. A1lsuch outcomes are called e,ents associated with the sample space S and they are defined by combining' elementary events. Understanding the concept of an event is crucial for the discussion which follows. lntuitively an event is any proposition associated with which may occur or not at each trial. We say that event occurs when any one of the elementary events it comprises occurs. Thus, when a trial is made only one elementary event is observed but a large number of events may have occurred. For example, if the elementary event (ST) and have occurred as well, occurs in a particular trial, Given that S is a set with members the elementary events this takes us immediately into the realm of set theory and events can be formally defined ('t..J5 - union, to be subsets of S formed by set theoretic operations complementation) on the elementary events (seeBinmore intersection, z41
'
.,4l
.,12
'c7'
-
$-'
-
3.2
The axiomatic
approacb
( 1980:. For example,
Two special events are S itself, called the sul-e plllrll and the impossible event Z defined to contain no elements of S, i.e. .Z yf j; the latter is defined for =
completeness. A third ingredient of &' associated with (b) which Kolmogorov had to formalise was the idea of uncertainty related to the outcome of any particular trial of Ji. This he formalised in the notion of probabilities attributed to the various events associated with $ such as #(,4j), #(.,4c), expressing the likelihood' of occurrence of these events. Although attributing probabilities to the elementary events presents no particular mathematical problems, doing the same for events in general is not as and straightforward, The difficulty arises because if are events z1c, etc., k.p S z41, zlc S ch are also events because of and implies the occurrence or the occurrence or non-occurrcnce not of these events. This implies that for the attribution of probabilities to make sense we have to impose some mathematical structure on the set of all which reflects the fact that whichever way we combine these events, say events, the end result is always an event. The temptation at this stage is to define .F to be the set of a1l subsets of S, called the pwt!r ser; surely this covers all possibilities! ln the above example the power set of S takes the form .g-f:
-4:
.,4a,
-
z4l
=
.4a,
z4:
x4c,
zzlc
=
-41
-
-42
.,4l
r%
,.F
lS, t(.HT)), )(Tf1)), )(1S)), )(TT)), )(F11), (ST)), (4'.Ff1),(1/f1)), (TT')), .t(1-1'F),(HHII', )(f1T'), (TT)), (TT)), .t(fT), (TH), (ff'f1)l,.t(f.fT'), (Tf.f), (TT)), ((ff'.f), tlfff/'l,(TT), (Tff)l, (CHH), (FT'), (HT)l). .?,
=
t('f'f:l),
lt can be easily checked that whichever way end up with events in .LEFor example,
tlflffl,('FT))
ch
(fT')l (('f'f1),
=
(T1-1)) k.p t(T1-1),(1-7T)l(CHHS,
we combine any events
in ,F we
(3 6,:/; tl'fffl.tTfll,
(ffT)l e z:/k etc.
lt turns out that in most cases where the power set does not lead to any inconsistencies in attributing probabilities we dene the set of events .F to be the power set of S. But when S is intinite or uncountable (it has as many
40
Probability
elements as there are real numbers) or we are interested in some but not a11 ft possible events, inconsistencies can arise. For example, if S ) zlf S and #(z4) a > 0, #f, such that 1 ch Aj ,Z (#.j), isjzzz1, 2, Then #4.) where .P(.,4)refers to the probability assigned to the event z' 1 P(.,4) )J,z 1 a > 1 (seebelow), which is an absurd probability- being p') greater than one; similar inconsistencies arise when S is uncountable. Apart from these inconsistencies sometimes we are not interested in a1Ithe subsets of S. Hence, we need to define ,LF independently of the power set by structure which ensures that no endowing it with a mathematical inconsistencies arise. This is achieved by requiring that .LF has a special mathematical structure, it is a c-field related to S. .,41
z42,
=
U,i)-
zztf
=
.
.
.
,
,
.
.
.
=
=
-4.
=
=
4
Dnron
lf : is called a J-field Let ..F be a set q subsets t#' S. lnt/l/complementation: (f) e: r/ltpnWG .F - closure zzlfl zzlf 1* 1 2, clllsure then ( I-% 6: ,F g () 1 ,.t+-
.4
.:)
U
.kj
=
.
,
.
,
.
utlioll. ctprfflrlh/t.?
Note that
-
t.'/??J ullder
(i) and (ii)taken together imply the following: S e .J; because
(iii) (iv) (v )
.z'1-=
.,4
k.p
S;
,F (from(iii) V= .(J (E
and ,5.) .p1. 1 2 t h en ((-) I i ) G .:F These suggest that a c-field is a set of subsets of S which is closed under complementation- and countable unions and intersections. That is, any of these operations on the elements of will give rise to an element of lt can be checked that the power set of S is indeed a c-field, and so is the set .f3'
.#-),.
(EE .k9j
-4
j
G
i
=
,
,
.
.
,
.
..9
-#'
=
but the set C What we can
)HH), (T/f), (FT)),
(.t(ST)),
.?A)
=
Z,
.),
t(fT),( TH)( is not because ZIC, S'#C, t(ST), (FS) )#C. t do, however, in the latter case is to start from C and construct j
the minimal cn#e'/J generated by its elements. This can be achieved by extending C to include all the events generated by set theoretic operations (unions, intersections, complementations) on the elements of C. Tlaen the minimal c-field generated by C is Z, ((J-1F), (FS)), t(SS), (FT))) c(C). and we denote it by This way of constructing a c-field can be very useful in cases where the events of interest are fewer than the onesgiven by the power set in the case of each H or F a finite S. For example. if we are interested in events w'ith one of c-field and to be the power set, can do as there is no point in defining the ...kL. well with fewer events to attribute probabilites to. The usefulness of this method of constructing c-fields is much greater in the case where S is either in such cases this method is indispensable. Let us infinite or uncountable; .%
=
.6
=
ts',
3.2
The axiomatic
eonsider an example such a c-field.
approach
where S is uncountable
and discuss the construction
of
Example Let S be the real line R
=
be J
t6BxL
=
x c 2)
tx:
-
:c)
x<
<
where Bx
=
f
'.r
the set of events of interest
) and
c: z .%x
lj
(
=
x)
'.'.f -
,
.
This is an educated ehoice, whieh will prove to be very useful in the sequel. How can we construct a c-field on E2?The definition of a c-field suggests
that if we start from the events Bx, x 6: R then extend this set to include Xx andtake countable unions of Bx and X'xwe should be able to define a c-field on (R, c(J) -- the mfnmll c-,/it!?J lenerated b t'Ile t?rt?nrs Bx, x iE Q. By definition Bx G c(.f). lf we take complements of Sx: X'x z.. e R, z > x (x, :7- ) e c( J4 Taking countable unions of Bx : UJ- 1 ( :f- x (1/))j ( :y- x) s c(./). These imply that c(.f) is indeed a c-field. ln order to see how large a collection c(J') is we can show that events of the form (x, ), gx, also belong to c(J), using set theoretie operations as (x, z) for x < c, and follows'. '
=
-
.
t
.7
=
=
-
.
,
-
,
.:ys),
.cc
tx)
(x, (y.)
=
(
gx, :y:. ) ( =
(x, z) fx )
=
=
(
'ctp -
,
-
'L<) ,
xj c c(./). .Y) g
tr(J),
gc'.1x. ) e:c(J),
.x(l
-
>;.
(')
11 =
l
ct., ,
1.
x, x
1 -
-
/1
e:c(J).
This shows not only that t#J) is a c-field but it includes almost every conceivable subset (event) of R, that is, it coincides with the c-field The generated by :7r?.1. set of subsets of R, which wedenote by i.e. tr(J) c-field will play a very important role in the sequel; we call it the Borel #c/J on R. .?4
.?d,
=
..'#
solved the teehnical problem of possible inconsistencies in attributing probabilities to events by postulating the existence of a c-field '..F associated with the sample space S, Kolmogorov went on to formalise the concept of probability itself. Having
Probability
42
Dqflnition
.5
Probability is dhned as a set function on ,W'satisfying thefollowing
axioms:
6 +j. Axiom 1: #(.g1)): 0 for 1,.and Axiom 2: PS) ZI5-) ,5; ft Axiom 3: IL-1 1 #(,4f) (' ylf ) : is that sequence of muttally exclusive events in called countable additivity). zlf ra Aj= for i #.j) -4
'l7crl'
=
#IU
,4fl
=
,),/-
.g
-
In other words, probability is defined to be a set function with 'F as its domain and the closed real line interval g0,1J as its range, so that
#(
'
):,/-
I0, 1q.
-+
The first two axioms seem rather self-evident and are satisfied by both the classical as well as frequency definitions of probability. Hence, in some sense,the axiomatic definition of probability the deficiencies of of probability the other definitions by making the interpretation dispensable for the mathematical model to be built. The third axiom is less obvious, stating that the probability of the union of unrelated events must be equal to the addition of their separate probabilities. For example, since ((SF)l rn Z, tovercomes'
(IHHII =
#(.r(JfF))
kl
(f'f)))
=
#(l(ST'))) +#(((ffS)))
+ - .t .) =
.
interpretation' result. To Again this coincides with the summarise the argument so far, Kolmogorov formalised the conditions (a) and (b)of the random experiment ($ in the form of the trinity (k%, P )) comprising the set of a1l outcomes S v-the sample space,a c-field c'Fof events related to S and a probability function #( assigning probabilities to events For the coin example, if we choose .F t)(SF)), ((TH), (HH), (F7)), in Z, 5') to be the c-field of interest, P( ) is dened by dfrequency
..%
'
.)
si'
=
'
Psl-
1,
.13(.43):=:0,
#(t(z.fT)))=.t,
Because of its importance the trinity (S,
,%
P(((TH), P
'
)) is given
HH), (TT')))=.1. a name.
Dejlnition 6
S endowed witb -4 sample satisfying axioms 1-3 is function .sp/cre
.F and a probability a c-jeld called a probability space.
As far as condition (c)of & is concerned, yet to be fonnalised, it will prove of paramount importance in the context of the limit theorems in Chapter 9, as well as in Part 111.
3.3 Conditional
probability
43
Having defined the basic axioms of the theory we can now proceed to derive more properties for the probability set function using these axioms and mathematical logic. Although such properties will not be used directly what we called a probability model, they will be used in constructing indirectly. For this reason some of these properties will be listed here for references without any proofs: P1) (#J) P34 (P4)
(#.5)
#(W) 1 J'(..4), =
-
.y1
E:
q.t'
f'4Z) 0. f.J' ! c X2, P(X1) G P(X2), 1, X2 6 .X' P(.,1l) + Ptz4al Pt,4j ro 8.,1.1%.) t#* 1 is a monotone sequence 1.Jjz4,,)J,.: =
z4
./l
v4c)
#(lim,,....
..42).
=
-
events in
.#-
then
a4,,)
limpi...wP(.4,,). A monotone sequence of events in ,F can be either increasing (expanding) z4p, z4l z1l c yla c c c A, uo or or decreasing (contracting), i.e. - l z4,,, z4,, z4p, :u) respectively. For an increasing sequence lim,,-. .4,, j 1 z4,,. P5 is known as the and for a decreasing sequence lim,,-, 1 continuity propert of the set function #( ) and plays an important role in probability theory. In particular it ensures that the distribution function (see Section 4.2) satisfies certain required conditions see also Section 8.4 on martingales. .
=
.
'
'
.
.
.
'
'
=
.
0,*,
.,1,,
=
.
'
U,*,.
-
'
Conditional probability One important extension of the above formalisation of the random experiment t.$' in the form of the probability space (.$, #( )) is in the probabilities. direction of conditional So far we have considered probabilities of events on the assumption that no information is available relating to the outcome of a particular trial. Sometimes, however, additional information is available in the form of the known occurrence of z4. For example, in the case of tossing a fair coin twice we might some event know that in the first trial it was heads. What difference does this information make to the original triple (S, P ))? Firstly, knowing that the first trial was a head, the set of all possible outcomes now becomes ,t)6
'
,%
'
SA
)(.f/T'), (SS)),
=
sincetTW), (TT) are no longer possible. Secondly, the c-field taken to be the power set now becomes .F,
=
(S.a
((f.fT)), ((f.fS))).
,g,
,
Thirdly the probability
#.:(,,4)=1,
set function becomes
.P,4(,3)=0,
#x(l(ST)))=,
#x(t(SS)))=-i'.
Probability
Thus, knowing that the event least one H' has occurred (in the first trial) transformed the original probability space (,, P )) to the conditional probability space (5'u,.FA #4( )).The question that naturally arises is to what extent we can derive the above conditional probabilities without having to transform the original probability space. The following formula provides us with a way to calculate the conditional probability. *at
.4
-
.%
'
'
,
,4)
/',14z11)
>
I
z4)
#(z4I
-
z'4,4,ra #4z1)
=
In order to illustrate this formula let then since #(-4j) #(.,4) J, .P(z41fo =t,
.
/ll
=
zz
#a(X1)
=
P(XI
j,l 4
I -4)
=
t(ST)) and #( t(1-JT')))
.,4
=
t(/fT), (HH4)
-
=.t,
.4)
=
1
=j=-,
2
as above. Note that .P4-4)>0 for the conditional Using the above rule of conditional
PlAl
l
.?121'--P(,41
f--
,4cl
I
'
probabilities to be delined. probability we can deduce that
Pfz4cl
z11) #4d1), = .P(.,42 .
(3.8) for
.,41,
zzlc
e'
(3.9)
,?>
This is called the mullp/fccllfon rule. Moreover, when knowing that occurred does not change the original probability of z4c, i.e.
-42
has
t
uIX.,4l l -42)-
we say that
ztj
Independence
and
#tz'1ll. zlc
are independent.
r
r
is very different from mutual t?xc/l/sllt?ne'ss in the sense that but Ptz4j ..41rn # .!X.41)and vice versa can both arise. Z lndependence is a probabilistic statement which ensures that the occurrence of one event does not influence the occurrence (or nonoccurrence) of the other event. On the other hand, mutual exclusiveness is a statement which refers to the events (sets) themselves not the associated probabilities. Two events are said to be mutually exclusive when they cannot occur together (see exercise 4). The careful reader would have noticed that the axiomatic approach to probability does not provide us with ways to calculate probabilities for individual events unlike the dassical or frequency approaches. What it provides us with are relationships between the probabilities of certain events when the events themselves are related in some way. This is a feature of the axiomatic approach which allows us to construct a probability modcl without knowing the numerical values of the probabilities but still lets us deduce them from empirical evidence. .42
I
-42)
=
Conditional probability
45
Impovtant concepts
random experiment; classical, frequency and subjective definitions of probability; sample space, eiementary events's c-field. minimal c-field generated by eventss Borel field; probability set function, probability space (S, P( )); conditional probability, independent events, mutually exclusive events. e?6
'
Questions Why do we need probability theory in analysing observed data? What is the role of a mathematical model in attempting to explain real P henomena'? Compare and contrast the classieal and frequency definitions of probability. How do they differ from the axiomatic definition'? Explain how the axiomatic approach formalises the concept of a random experiment 4 to that of a probability space (S, )). Why do we need the coneept of a c-field in the axiomatisation of probability? Explain the concept intuitively. Explain the concept of the minimal c-field generated by some events using the half-closed intervals ( uo, xj, x (E R on the real line as an t%p
.
-
example.
Explain intuitively the continuity property of the probability set function #( ). Discuss the concept of conditional probability and show that #( 1 for some e: is a proper probability set funetion. '
'
.,1
z4)
.t#'
Exerdses Consider the random experiment of throwing a dice and you stand to lose money if the number of dots is odd. Derive a c-field which will enable you to consider your interests probabilistically. Explain your choice. Consider the random experiment of tossing two indistinguishable fair coins and observing the faces turning up. Derive the sample space S, the c-field of the power set L.F'and (i) define the probability set function P( ). Derive the c-field generated by the events )SS) and (T'l). lf you stand to lose a pound every time a coin turns up what is the c-field of interest'? '
'heads'
Probability
46
Consider the effect on S, P ))when knowing that event $at least one F' has occurred and dene the new conditional #a( )).Confirm that for the event probability space (,$x,,.#r4, tails, - two .6
.4
'
-41
'
P,4(z41) =
8.4
c5
z11) .
17(X)
Consider the events IHH). and (FT) and show whether they are mutually exclusive or and independent. Consider the random experiment of tossing a coin until it turns up theads'. Define the sample space and discuss the question of detining a c-field associated with it. Consider the random experiment of selecting a card at random from an ordinary deck of 52 cards. Find the probability of (i) .41 - the card is an ace;
and
.,42- the card is a diamond. Knowing that the card is a diamond show how the original (S, #( )) changes and calculate the probability of ,d
'
yla - the card is the ace of diamonds. Find 174..11 ro
z4cl
derived in (ii).
and compare
Define two events which are: mutually exclusive and (a) mutually exclusive but (b) mutually exclusive not (c) (d) not mutually exclusive
it with the probability
of
..43
independent; not independent;
but independent; and and not independent.
Additional references Barnett (1976).
Giri ( 1974); Mood, Graybill (1973);
and Boes ( 1974); Pfeiffer ( 1978/ Rohatgi
CHAPTER
4
Random variables and probability
distributions
In the previous chapter the axiomatic approach provided us with a mathematical model based on the triplet (S, P( )) which we called a probability space, comprising a sample space S, an event space + (c-field) and a probability set function P( ). The mathematical model was not developed much further than stating eertain properties of P ) and introducing the idea of conditional probability. This is because the model based on (S, Pq ))does not provide us with a flexible enough framework. ,#t
.
.
'
-f6
'
The main purpose of this section
is to change this probability
space
by
mapping it into a much more flexible one using the concept of a random varable. The basic idea underlying the construction of S, #( ))was to set up a framework for studying probabilities of events as a prelude to analysing problems involving uncertainty. The probability space was proposed as a formalisation of the concept of a random experiment & One facet of tf' which can help us suggest a more flexible probability space is the fact that when the experiment is performed the outcome is often considered in relation to somc quantisable attribute; i.e. an attribute which can be represented by numbers. Real world outcomes are more often than not expressed in numbers. lt turns out that assigning numbers to qualitative outcomes makes possible a much more flexible formulation of probability theory. This suggests that if we could find a consistent way to assign numbers to outcomes we might be able to change (,S, #( ))to something more easily handled. The concept of a random variable is designed to do just that without changing the underlying probabilistic structure of (S, % P( )). ,#7k
'
,%
'
'
Random variables
and probability
distributions
4 LHHi
j(J?s)) 1(Fr)l
(/?F)
1(8r) (88) (r/?)l 14F8) (/?r), (rrlt (rr)# 148/-/), jjsr), (rsjj ,
(r8) (rr)
Fig. 4. 1. The relationship probability set function.
4.1
,
!
,
l
I
I 1 1 1 1 I
l
I
0 0.2 0.4 0.6 0.8 1.O
between
sample
space,
c-field
and
The concept of a random uriable
Fig. 4. 1 illustrates the mathematical model (,$', #( )) for the coin-tossing example discussed in Chapter 3 with the c-tield of interest being .F= (S, Z, )(TT)), l(HH,(TT))l(TH),(11T')), 't(1.T),(TH),(H.H)l, ((1.f1.f)), )(SF),(T'S),(FF))). The probability set function #(') is defined on .F and 1j, i.e. #(.) assigns probabilities to the events takes values in the interval in ,F. As can be seen. various combinations of the elementary events in S define the c-field .F (ensure that it is a c-fieldl) and the probability set function #(.) assigns probabilities to the elements of .F. The main problem with the mathematical model (S, #( )) is that the general nature of S and .F being defined as arbitrary sets makes the of #( ) N'ery difficult; its domain being a c-field mathematical manipulation of arbitrary sets. For example, in order to define #( ) we will often have to derive all the elements of .F and tabulate it (a daunting task for large or infinite to say nothing about the differentiation or integration of such a r9;
'
r0,
,@
'
'
.
./-s),
set function.
Let us consider the possibility of defining a function Ar( ) which maps S directly into the real Iine R, that is, '
A'(
.
): S
-+
Rx,
assigning a real number xl to eaeh sl in S by xl xl g R. sl e S. For example, in the coin-tossing experiment we could define the function A' the number of heads'. This maps all the elements of S onto the set Rx .t0, 1, 2) see Fig. 4.2. -tsll,
=
-
=
,
The question arises as to whether
every function from S to R will provide
4.1
The concept of a random
Fig. 4.2. The random example.
variable
variable
Ar-number of
49
Sheads'
in the coin-tossing
us with a consistent way of attaching numbers to elementary events', consistent in the sense of preserving the event structure of the probability space (S, t'f #( )). The answer, unsurprisingly, is certainly not. This is because, although is a function defined on S, probabilities are and in qF the issue we have to face is how to dene the assigned to events values taken by X for the different elements of S in a way which preserves the ln order to illustrate this let us return to the earlier event structure of To value of X, equal to 0, 1 and 2 there correspond some each example. of i.e. S, subset '
x
.?/f
0
'r))
-...
t(T' (z?'z')), tt-rffl, ,
1 2 .:(1-1.r.f)), -+
-.+
and we denote it by -Y-
i(0)
=
t(TT)),
A--
14
1)= t(TS), (f1T)),
X-
1(2)::=,
)(Sf1)),
'(
used by abuse of mathematical using the inverse mapping ) (sinverse language). What we require from A' - 1( ) (or .Y) is to provide us with a correspondence between Rx and S which reflects the event structure of that is, it preserves unions, intersections and complements. ln other words, x'
-
.#1
Random variables
and probability distributions
i for each subset N of Rx the inverse image X - (N) must be an event ' in ,F. Looking at X as defined above we can see that X - (0)G,?A, k..p 1 X - 1 (2)c X - 1( X - 1 (/t.0 J1 t..p f( 2 ))g X - 1( 1) G ))(E 1(( 1) t.p g A' that is, -Y( ) does indeed preserve the event structure of Ry defined by F(tff1F) )= ,X On the other hand, the function Y'( ): S YI..fLHH)) 1, y(.t TH j ) F( TT)) 0 does nOt preserve the event structure 140) 1( of .Lt- since F # F - 1) ( ,i/J This prompts us to define a random variable A' to be any such function satisfying this event prpst?rnfng condition in relation to some c-field defined on Rx; for generality we always take the .%
to)
cz
.t2))
.t#',
'
.ft
.%
-+
'
t
=
=
.#t
Borel field
,?d
.%
=
on R.
Dhnition
l
A random variable X is a p-,(?l valued function S to R wllfc B G 4 on E, tbe set satjles the c'(?nlll't?n that jr ptktr/? Borel X - 1(/) in s.' .Y(.$)e:B, s g s') is an .#(?rl1
.$t?r
=
t
.?8
'gt?rlr
Three important features of this definition are worth emphasising. A random variable is always defined relative to some specific c-
(i)
field R is a ln deciding whether some function F( ) : S (ii) variable we proceed from the elements of the Borel field of the c-field rF and not the other way around. variable'. (iii) A random variable is neither nor Let us consider these important features in some more detail in of the concept of a random enhance our understanding undoubtedly the most important concept in the present book. ,%'
-+
'
..,d
trandom'
random
to those
'a
order to variable)
The question is X( ) : S -+ R a random variable?' does not make any sense unless some c-field ..F is also specified. ln the case of the function Xnumber of heads, in the coin-tossing example we see that it is a random variable relative to the c-field as defined in Fig. 4. 1. On the other hand, F, variable relative to ..R This, however, does random above, is nt?r defined a as preclude from variable with respect to some other crandom F being not a 'y (S,Z,)(S1'1),(f1F)), ((Ff.f),(TT)) ) lntuition field ,.Fy; for instance valued real function .Y( S R we should be able to that for suggests any ): variable. that random ln the previous such Ar define a c-field S is on a section we considered the c-field generated by some set of events C. Similarly, we can generate tr-fields by functions A-( ): S -+ R which turn Indeed above is the nlfnfrnf?/ o'zfleld Ar( ) into a random variable. generated ?v F, denoted by c(F). The way to generate such a minimal c-field 14 is to start from the set of events of the inverse mapping F - ), i.e. 1(0) 1 J-field .t(.fF), (HH); by and generate a FJ' - ( 1) and t(TS), (FT)) taking unions, intersections and complements. ln the same way we can see '
,%
=
.
.
-+
rh
'
.%
'
'
=
=
The concept of a random variable
4.1
that the minimal c-field generated by .Y - the number of heads, c(Ar) coincides with the c-field .LFof Fig. 4.2,. verify this assertion. ln general, however, the c-field .F associated with S on which a random variable X is defined does not necessarily coincide with c(Ar). Consider the function X :(
): S
'
(R
-+
Xl( klSfflll
=
A' !('t(Tf1)))= X1('t(1T)l)
Xl('t(TT)))
1,
=
140)
'(
0
=
'
(4.2) )(TF)) iE
1)= t(ffJ1), (Tf1), (ffT)) iE ,,F (see Fig. 4.2), X( tj ) s (E x, is a random variable on with respect to the c1 s x(ST), (TS))) # indeed field ,.'F c(i-). But c(X:) (S, Z, (41-1f1),
since i(
z%-1
.%
=
,#,
tot, .:
-
.%
=
=
c(A-1) cu ..'F
=
c(aY).
The above example is a special case of an important general result where A',, are random variables on the same probability space -Y:, Xz, (S, P( )) and we define the new random variables .
.
,
.
.%
'
Fz
=
X : + A'a + X s
,
(4.5)
c4Y;,)form an increasing sequence of c-fields in ln the above i.e. c(Yj), example we can see that if we define a new random variable ,Yc( ): S R by ,.'k
.
.
.
,
'
-Yztttf1'flll=
1,
A-2()(f1T)))= Xa()(Tf1)))
=
-+
A-2('t(TT)))=0,
+ X1 (seeTable 4. 1) is also a random variable relative to c(X); then X X is defined as the number of J'ls (see Table 4.1). =
x'j
Note that
z1
is defined as
:at
least one H' and Xz as
Etwo
Jls'.
generated by random variables will prove very useful in the discussion of conditional expectation and martingales (see Chapters 7 and 8). The concept of a c-field generated by a random variable enables us to concentrate on particular aspects of an experiment without having to consider everything associated with the experiment at the same time. Hence, when we choose to define a r.v. and the associated c-field we make an implicit choice about the features of the random experiment we are The above concept of c-fields
interested in. il-low do we decide that some function .X( ): S R is a random variable relative to a given c-field ,i.F?9From the above discussion of the concept of a .
-+
and probability distributions
Random uriables
random variable it seems that if we want to decide whether a function X is a random variable with respect to .F we have to consider the Borel field on R or at least the Borel field on Jx'. a daunting task, lt turns out, however. that this is not necessary. From the discussion of the c-field c(J) generated by the set J x e: J@ ) where Bx ( 'Lt, x(l we know that .t4 c(J) and if .Y( ) is such that .4
./dx
.t#.:
=
=
=
-
,
'
1
X '' (( -
v.
xj
,
)
=
ft
.Y( -)
.'
:
G
(-
xj
py.- s
,!;
,
e:S
j!
g
.kT'
for a11 ( -
then
1
A-- (B4
=
f
(s
:
-Y(s)e: B, s (F S 'j c .F
v-
.
x)Js
.??,
for a1l B g .A
ln other words, when we want to establish that A- is a random variable or define #xt ) we have to look no further than the half-closed intervals (x(Iand the c-field c(..f)ft they generate, whatever the range Rx. Let us g( yt xj, s 6E to use the shorthand notation .YtylGxlj. instead of f(s: number the above in the of of Hs. with argument A' - the consider case in Fig. 4.2. respect to '
':y-
,
'(x)
-
,
,
.t7-
1 x< 2 :;
,(3 ,
=
f#
( TH (T T) ) -
(H
t
1
'r)-
0 .A.J' < 1
,
(HH)
)
(4.9)
1 % )',
,
and thus F -- (( - v- ).q) # .F for ). 0, y 1, i.e. F ( ) is not a random however, f(s : F(s) ,6Ll.j1 G .k) fo r variable with respect to With respect to all )' e: R and thus it is a random variable. The tenn random variable is rather unfortunate because as can be seen h?t?/- (1 vtll-l'tlble'', it i s a real from the above definition A- is tleib t?lvalued function and the notion of probability does not enter its definition. =
=
,
'
.../'),,
..
.
wrandotnn
Probability an attempt
-
enters the picture after the random variable has been defined in model induced by X. to complete the malhematical
Tbe concept of a random variable
Table 4. 1
A' relative to .- maps S into a subset of the real line, on 2 plays now the role of .k ln order to complete the Common assign probabilities to the elements B of the assignment of probabilities to the events B (E @ must the probabilities assigned to the corresponding events need to define a set function #xl ): E0,11 such that
variable
A random
and the Borel field model we need to sense suggests that be consistent with in Formally, we .ut
.?4
.?4
.?#
-->
'
for all B G in the case illustrated
For example, p
i
).
() x ( J
(
)
Px l (h0 l)
=
4.
=
1.4
4
*
,
p x ( tf j J) j
-1-
=
Px , ( 111)
2' =
3-4 .
(4.10)
,.#.
in Table 4. 1 p x (ft 2,))
=
-l4
Px ( )0 l k.p j
5
p x ( j()'j
h11)
=
k.p ( t
1 Px ,
j J).j
=
J- ? 4
etc.,
( ft0 ) rn t 1)) 0. =
j
The question which arises is whether, in order to define the set function #x( ), we need to consider al1 the elements of the Borel field 4. The answer is that we do not need to do that because, as argued above, any such element of can be expressed in terms of the semi-closed intervals ( :s, .xq. This we can implies that by choosing such semi-closed intervals define #xt ) with the minimum of effort. For example, Px( ) fOr x, as defined in Table 4. 1, can be defined as follows: '
..#
-
Sintelligently',
'
'
As we can see, the semi-closed intervals were chosen to divide the real line at the points corresponding to the values taken by X. This way of defining the semi-closed intervals is clearly non-unique but it wll prove very convenient in the next section. The discerning reader will have noted that since we introduced the concept of a random variable A'( ) on (S, .k P( )) we have in effect '
'
Random variables
54
and probability
distributions
developed an alternative but equivalent probability space (R, Px )) induced by X. The event and probability structure of (S, #4 )) is #xt ))and the latter has a preserved in the induced probability space (R, much to handle' mathematical structure; we traded S, a set of arbitrary elements, for R, the real line, ,F' a c-field of subsets of S with 2..d, the Borel field on the real line- and #( ) a set function defined on arbitrary sets with #x( ),a set function on semi-closed intervals of the real line. ln order to P( )) to illustrate the transition from the probability space (S, (Rx, Pxt )) let us return to Fig. 4. 1 and consider the probability space of heads, defined above. As can induced by the random variable z-number variable Fig. 4.3, the random A'( ) maps S into k0,1, 2). be seen from 1q, ( 'Js, 21 we can intervals semi-closed the Choosing ( vs, %, ( which of Borel #xl R field on forms the domain generate a ).The concept of a random variable enables us to assign numbers to arbitrary elements of a as set (S) and we choose to assign semi-closed intervals to events in induced by X. By defining #xt ) over these semi-closed intervals we complete the procedure of assigning probabilities which is consistent with the one used in Fig. 4. 1. The important advantage of the latter procedure is Px )) is a that the mathematical structure of the probability space (R. lot more flexible as a framework for developing a probability model. The purpose of what follows in this part of the book is to develop such a tlexible mathematical framework. lt must be stressed, however, that the original probability space (S, #( )) has a role to play in the new mathematical framework both as a reference point and as the basis of the probability model we propose to build. Any new concept to be introduced has to be related to (S, P( )) to ensure that it makes sense in its context. ..%
'
.%
'
.?d,
'
Ceasier
'
'
.%
'
..%
'
'
':yo,
-
'
.t7-
'
.@,
'
-i/')
'
-%
'
/
s
1(8H), ( rrll
1(/./s)) j(rrlk
CHHL
,
(TH3 t8f) Fr)
s 1(rJ?)
(/./r)) (/.fr) ( 1 r/.?),(r7')t 1(8r) (THb (/-/8)1 ,
-
(
,
,
!
0 !
.-
(
I
!
0
1
2
s
O
1
I
k
1
!
i
i
.
1
o o.2 o.4 0.6.-0.8 I
L
I
1
l
I
!
1
2
.))
to (Rx,?#,#x(
'
1
1
.0
0.2 O.4 0.6 0,8 1.O
Rx Fig. 4.3. The change from (.S,,#-,17(
L
)) induced
by X.
I
I
The distribution and density functions
4.2
The distribution and density functions
4.2
ln the previous section the introduction of the concept of a random variable (r.v.), X, enabled us to trade the probability space (,$', #( )) for .gi
'
(R, #xt )) which has a much more convenient mathematical structure. The latter probability space, however, is not as yet simple enough because Px( ) is still a set function albeit on real line intervals. ln order to simplify it we need to transform it into a point function (a function from a point to a point) with which we are so familiar. The first step in transforming #xl ) into a point function comes in the form of the result discussed in the previous section, that #xt ) need only be defined on semi-closed intervals ( - cc, x g R, because the Borel field 4 viewed c-field the minimal generated by such intervals. With this as can be view of the fact that a1l such proceed in mind in to argue that we can starting intervals have a common ( ) we could conceivably define point function a .%
'
'
'
'
.x(1,
'point-
E:c,
-
F( ): ER--+ .
g0,11,
which is, seemingly, only a function of x. In effect, however, this function will do exactly the same job as Px ). Heuristically, this is achieved by defining F( ) as a point function by '
.
for all x c R, and assigning the value zero to F( - :y.). Moreover, given that as increases the interval it implicitly represents becomes bigger we need to ensure that F(x) is a non-decreasing function with one being its maximum value (i.e. .x
F(x1) % F(x2) if xl :t xc and limx- . F(.x) 1). For mathematical also require F( ) to be continuous from the right.
reasons we
=
'
Dlflnition 2 Let ZYbe a ,-.!?. tljlned I0. 11 dehned tv F(x) #x(( =
(.,
(pll
x1)
'wy-
,
#(
.k',i
=
.
)).The ptpnr
Pr(X G x),
is cf;l//t?J tbe distribution function (1)F) ///t?.'fng pl-operties: (f)
F(x) is rltpn-f/fs?c-tv/.sfng''
( )
F( -
i1*
:y.-
) li rn.x =
-+
-
.,
F( x ) 0 =
,
.jr
t#'
./ntrlforl
all x
A- and
F( 6
R ustkrs/'s
.
)..R
-+
(4.14) the
56
Random variables
and proability
distributions
It can be shown (seeChung ( 1974)) that this defines a unique point function for every set function #x( ). The great advantage of F( ) over #( ) and #xt ) is that the former is a point function and can be represented in the form of an algebraic formula) the kind of functions we are so familiar with in elementary mathematics. '
'
'
'
This will provide us with a very convenient way of attributing probabilities to events. Fig. 4.4 represents the graph of the DF of the r.&'.X in the coin-tossing example discussed in the previous section, illustrating its properties in the of Hs. case of a discrete 1-.1,. --number
Drlrft'?n
3
z4random of l/?t? set
variable
is called discrete if its rc/nfyer Izx is stplot? subset 0 k 1 + 2, Z integers (?J ). -
=
,
.
.
.
ln this book we shall restrict ourselves to only two types of variables, namely, discrete and (absolutely)continuous.
random
Dehnition 4 X is called (absolutelyj continuous i.f its F(x) is continuous .Jtpr alI x iE R and there t.?--?'xr. real tbe that Iine .J( ) on
-4 random lwr/?/g (Iistribution (1non-neqative ,lrlcrft'?n
-s?.,/c/?
.//ntrrft?l?
'
X
F(x)
=
(l) d r/,
4.2
The distribution and density functions
It must be stressed that for A- to be continuous is not enough for the distribution function F(x) to be continuous. The above definition postulates that F(x) must also be derivable by integrating some non-negative function /'(x). So far the examples used to illustrate the various concepts referred to discrete random variables. From now on, however, emphasis will be placed almost exclusively on continuous random variables. The reason for this is that continuous random variables (r.v,-s)are stlsceptible to a more flexible and this helps in the mathematical treatment than discrete r.N construction of probability models and facilitates the mathematical and statistical analysis. r.N we introdtlced the function In defining the concept of a contnuous /'(x) which is directly related to F(x). .'s
.
Dehni rtpn 5
Y e: r'.'i 't'
-
F(x)
=
()
ld A
i.
t't'?? t 1*?.7 ut?l?.s
(4.20)
vx G Er,!l - (Iis'l-ee
./'(u),
.v
(probability)density function ( pt/./') q/' X. A,y, .1-4. example, 1) $-sand (see Fig. 4. 5). I n
saitl t() /?t?tll(?
ln the coin-tossing for a discrete with those of a continuous r.v. order to compare F(x) and where consder the Alet us takes values n the interval lk, ?q and a1l case attributed of the values z are same probability', we express this by saying unljrnll is Adistributed in the interval (k, l and we write Athat t7tl, !?).The DF of Ar takes the form ,/-(0)
./'(
=
./'(2)
=
=
,/'(x)
.
'v
X < (1
(see Fig. 4.6). The corresponding
pdf of X is given b)'
elsewhere.
Random uriables
and probability
distributions
Comparing Figs. 4.4 and 4.5 with 4.6 and 4.7 we can see that in the case of random variable the DF is a step function and the density discrete a function attributes probabilities at discrete points. On the other hand, for a continuous r.v. the density function cannot be interpreted as attributing probabilities because, by definition, if X is a continuous r.v. P(X .x) 0 for a1l x 6 R. This can be seen from the detinition of /'(x) at every continuity =
=
4.2
59
The distribution and density functions
Fig. 4.7. Thedensity
function of a uniformly distributed
random
variable.
(4.23) i.e. .J( ): R .
-+
r0,
v-,l.
(4.24)
Although we can use the distribution function F(x) as the fundamental concept of our probability model we prefer to sacrifice some generality and adopt the density function ,J(x)instead, because what we lose in generality we gain in simplicity and added intuition. lt enhances intuition to view density functions as distributing probability mass over the range of .Y. The density function satisfies the following properties:
(4.25) (4.26) (4.27) (iv)
d
.J(x)=.uF(x),
at every point where the DF is continuous.
(4.28)
Random variables
60
and probability
distributions
Properties (ii)and (iii)can be translated for discrete r.v.'s by substituting t) for j- dx' It must be noted that a continuous r.v. is not one with a continuous DF F( ). Conlfnufrv refers to the condition that also requires the existence of a non-negative function /'( ) such that '
'
*
x
'
.
(4.29) ln cases where the distribution function F(x) is continuous but no integrating function .J(x)existss i.e. (d/dx)F(x) 0 for some x e: J2,then F(x) is sqid to be a sinqular f/f.$r?'l'?l./lf(?n. Singular distributions are beyond the scope of this book (see Chung ( 1974)). =
4.3
The notion of a probability model
Let us summarise the discussion so far in order to put it in perspective. The axiomatic approach to probability formalising the concept of a random #( )), where S experiment J' proposed the probability space (., of all possible of ,F the is the set events and #( ) set outcomes, rcpresents assigns probabilities to events in The uncertainty relating to the outcome is formalised in P( ). The concept of a of a particular performance of random variable A-enabled us to map S into the real line Ii and construct an equivalent probability space induced by X. (R, Px( )),which has a much teasier to handle' mathematical structure, being defined on the real line. Although #xt ) is simpler than #( ) it is still a set funiion albeit on the Borel field Using the idea of c-fields generated by particular sets of #xt ) on semi-closed intervals of the fonn ( vs, xl and defined we events managed to define the point function F( ), the three being related by .:/t
'
'
.k
'
'
.@,
'
'
'
,.?d.
'
'
Pfs: -Y(y)6
(-
:y.,
x(1,
us
6
S)
=
#x( -
.,.'y,
xl
=
(4.30)
F(.x).
The distribution function F(x) was simplified even further by introducing jXvia F(x) du. This introduced further the density function w is definable in closed flexibility into the probability model because algebraic form. This enables us to transform the original uncertainty related to J to uncertainty related to unknown parameters 0 of /'('); in order to emphasise this we write the pdf as $. We are now in a position to define ./?'ld'/J' ()f t/pnsl'r model probabilit parametric in form the of J.' jilnctions a our which we denote by ./'4lk)
,/'(x)
=
./'(x)
./'(x;
'
(l)
=
).J(.x',04, 0 G O ) .
* represents a set of density functions indexed by the unknown parameter (.) (usuallya multiple of 0 which are assumed to belong to a parameterspace the real line). In order to illustrate these concepts let us consider an example
The notion of a probability
model
Fig. 4.8. The density function of a Parcto distributed random of the parameter.
variable
for
differentvalues
family of density functions, the Pareto distribution:
of a parametric (Lp=
ftx.,(?)
=
.
-
p
x
x -9. x
t?+ l ,
x > 0, ()(E (.)
,
xll - a known number O r1+- the positive real line. For each value in 0. /'(.x;p)represents a different density (hencethe term parametric family) as can be seen from Fig. 4.8. When such a probability model is postulated it is intended as a description of the chance mechanism generating the observed data. For example, the model in Fig. 4.8 is commonly postulated in modelling personal incomes exceeding a certain level x(). lf we compare the above graph with the histogram of personal income data in Chapter 2 for incomes over E4500 we can see that postulating a Pareto probability density seems to be a reasonable model. In practice there are numerous such parametric families of densities we can choose from, some of which will be considered in the next section, The choice of one such family, when modelling a particular real phenomenon, is usually determined by previous experience in modelling similar phenomena or by a preliminary study of the data. When a particular parametric family of densities * is chosen, as the appropriate probability model for modelling a real phenomenon, we are in effect assuming that the observed data available were generated by the 'chance mechanism' described by one of those densities in *. The original uncertainty relating to the outcome of a particular trial of the experiment =
,
62
Random variables
and probability
distributions
has now been transformed into the uncertainty relating to the choice of one 0 in 6), say 0*, which determines uniquely the one density, that is, tx,'p*), which gave rise to the observed data. The task of determining 0* or testing some hypothesis about 0* using the observed data lies with statistical inference in Part 111. ln the meantime, however, we need to formulate a mathematical framework in the context of which the probability model (l) can be analysed and extended. This involves not only considering a number of different parametric families of densities, appropriate for modelling different real phenomena but also developing a mathematical apparatus which enables us to describe, compare, analyse and extend such models. The reader should keep this in mind when reading the following chapters to enable him her not to lose sight of the woods for the trees. The woods comprise the above formulation of the probability model and its various generalisations and extensions, the trees are the various concepts and techniques which enable us to describe and analyse the probability model in its various formulations. 4.4
uniYariate
Some
distributionst
ln the previous section we discussed how the concept of a random variable (r.v.) X defined on the probability space (S, P( )) enabled us to construct a general probability model in the form of a parametric family of densities (31).This is intended to be an appropriate mathematical model purporting of real phenomena in a stochastic to provide a good approximation (probabilistic) environment. ln practice we need a menu of densities to describe different real phenomena and the purpose of this section is to consider a sample of such densities and briefly consider their applicability to such phenomena. For a complete menu and a thorough discussion see Johnson and Kotz ( 1969), (1970),(1972). .%
'
(1)
Discrete distributions
(i) Bernoulli distribution experiment J's where there are only two possible and for convenience, that is, S tfailure'). vafiable X by lf we define on S the random 1) p 1, A-tfailure) 0 and postulate the probabilities Przr 0) 1 p we can deduce that the density function of X takes the
Consider a random outcomes, we call (Esuccess',
Asuccess) and Pr(X
=
=
'success'
kfailure'
=
=
=
=
=
-
'!- The term probability distribution is used to denote a set of probabilities complete system (a c-field) of events.
on a
4.4 Some
.
/'(x', p)
=
pXl
distributions
univariate
1 p) -
1 -
fo r x
'N '
=
0, 1
otherwise. and the probability
ln practice p is unknown
takes the form
model
(4.34) Such a probability model might be appropriate in modelling the sex of a newborn baby, boy or girl, or whether the next president of the USA will be a Democrat or a Republican.
(ii) Tlle binomial distributiov The binomial distribution is unquestionably the most important discrete distribution. lt represents a direct extension of the Bernoulli distribution in the sense that the random experiment Js is repeated n times and we define in rl trials. If we the random variable F to be the number of al1 21 trials the density of probability is in that the the same assume of Y takes the form tsuccesses-
isuccess'
otherwise. and
denote this by writing J'
we
'v
#(n, p):
&
' 'w
Sdistributed
reads
as'.
Note that
n .'
n! -(n y) ! .p!
/(!
,
k
'
(/
1)
'
(/( 2)
.
.
.
2 1. .
The relationship between the Bernoulli and binomial distributions is of considerable interest. lf the Bernoulli r.v. at the th trial is denoted by Xi, i + X,,; + ,Ya + 1, 2, n, then Y is the summation of the Arfs,i.e. L This is used emphasise subscript because Y' dependence in is its the to on n. which implies that the Arfstake the value 1for a success' and 0 for a ))- j Xi represents the number of successes' in n trials. The interest in this relationship arises because the 'is generate a sequence of increasing c-fields cu c(L); c(Ff) represents the c-field generated of the fonn c(Fj) cz c('L)c by the r.f. Y).This is the property, known as martinqale condition (seeSection 8.4),that underlies a remarkable theorem known as the De Moivre-luaplace =
.
.
.
=
,
tfailure',
'
'
'
'
-1
'
'
64
Random variables and probability distributions
cvntral Iimit rtrtpl-?rrl.De Moivre and Laplace, back in the eighteenth n) for a century, realised that in order to calculate the probabilities large n the formula given above was rather impractical. ln their attempt to find an easier way to calculate such probabilities they derived a very important approximation to the formula by showing that, for large n, ./'()';
where j. ,g
-u-
=
g
.jy
&
x'7.t,7/.,( -
-- . - .
,
1 J?)q
5
rx
reads approximately
equal.
on the RHS of the equality was much easier. This the density function of the most celebrated of a1l has a bell-shaped symmetric curve, the nortnal. Fig. 4.9 of a binomial density for a variety of values for 11 and p. As we can see, as n increases the density function becomes more and more bell-shape like- especially when the value of # is around 1y. This result gave rise to one of the most important and elegant chapters in probability theory, the so-called limit theorems to be considered in Chapter 9.
Using the formula formula represents distributions which represents the graph
(2)
Continuous distributions (i)
-'b
e ntp/vzlt//
tlist'lnibut
k't?/'?
The normal distribtltion is by far the most important distribution in both probability theory and statistical inference. As seen above, De Moivre and Laplace regarded the distribution only as a convenient approximation to the binomial distribution. By the beginning of the nineteenth century, however, the work of Laplace, Legendre and Gauss on the theory of placed the normal distribution at the centre of probability theory. lt was found to be the most appropriate distribution for modelling a large number situations in astronomy, physics and eugkmics. Moreoverof experimental Markov, Lyapounov and the work of the Russian School (ChebysheN, 'errors'
limit theorems, relating to the behaviour of certain standardised sums of random variables, ensured a central role for the Kolmogorov)
on
normal distribution.
A random variable
zYis normally
distributed if its probability
function is given by j'q
x y .
''
.
5
!
(y
2
j
=
1 ex j) cx (.27r) -
-s .-. ...
1 -
.
2c
j
2 (x - g j .
y
density
4.4
Some univariate distributions
0.8
0.8
0.7
0. 7
65
0,6
0.6 0.6
n R
< 04
5 0'0S
=
=
0.5
n V
< 04
0.3
0.3
0.2
0,2
0.1
0.1 0 12 3 4 5 6
0.5
0 1 2 34 56
X
X
0.6
0.6
0.5
0.5 0.4
n #'
< 03
S
s
=
=
1()
=
0.05
=
0.4 < 03
0.2
0.2
0. 1
0. 1
lo 0. 5
0 1 2 3 4 5 6 7 8 9 10
X
X
0.6
0.5
0. 5
0.4
=
=
0 12 3 4 56 7 0.6
n /7
< 03
n P
ac
=
0.05
=
0.4 < 03
0.2
0.2
0. 1
0. 1 0 12 34 5 6 7 X
=
z
a
P
=
.
.
pc 0'5
.
.
l
I
j
jI
I
-
-
.
0 1 2 3 4 5 6 7 8 9 10 12 14 1 1 13 15
Fig. 4.9. The density function of a binomially distributed variable for different values of the parameters n and p.
random
this by Ar N(p, c2). The parameters p and c2 will be studied in more detail when we consider mathematical expectation. At this stage we will treat them as the parameters determining the location and flatness of the density. For a fixed c2 the normal density for three different values of p is given in Fig. 4. 10. 'v
and probability distributions
Random variables
0
-4.0
Jz
g
=
=
Jl
4.0
=
0.40 0.30
:i o,2o <
0.10 0.00
l
-8
-6
-7
-4
-5
-3
-2
0
-1
1
2
3
4
6
5
7
8
X
Fig. 4.10. The density function of a normally distributed random variable with c2 1 and different values for the mean p. =
1.00 0.90 0.80 0.70 0,60
Sil(Lso <
0.40 0.30 0.20 0, 10 0.00
J=
2.5
5
6
1
I
-8
-7
-6
-4
-5
-3
-2
-1
0
1
2
3
4
7
8
X
Fig. 4.1 1. The density function of a normally distributed random variable and different values for the variance. with mean p =0
Fig. 4.1 1 represents the graph of the normal density for p 0 and three alternative values of c; as can be seen, the greater the value of c the flatter the graph of the density. As far as the shape of the normal distribution and density functions are concerned we note the following characteristics: =
The normal density is symmetric about p, i.e.
fp =
+
k)
Pry
=
1 expt /(2, 2c cx/(2a) x x/t+ k) Pry -k
,v
.ytjj
(4.39)
.k),
-
=
(4.40)
G..
and for the DF, F(
-
x)
=
1 - F(x + 2p).
(4.4 1)
4.4 Some
univariate distributions
67
The density function attains its maximum
d./-txl ftx) dx =
M
'
2(x
-
=0
x
=
-
2cz
'
at x
=p,
-p)
= '
u
and
-
ftuj -'
1
=
.
'
c
(2zr) (4.42)
(iii)
The density function has two points of inflection at x
=p
+ c,
Thus, c not only measures the flatness of the graph of the pdf but it determines the distance of the points of inflection from p. Fig. 4.12 represents the graphs of the normal DF and pdf in an attempt to give the reader some idea about the concentration of these functions in terms of the standard deviation parameter c around the mean p.
10 .
W)
F
1
l
I
I
0.84 r----0.50
I
I
f
I
I I
4
#
---u 1 c.:6
-c
y
f (x) Shaded area
=
I I I
p
f .-
+ (z
p
+
2c
x
1 (J,s)
0.9545
o 65 tV(27r1
0.1353 (V(2F1
-2g
p Fig. 4. 12. lmportant
functions.
y-o
p
Jz + c
g + 2c
features of the normal
distribution and density
Random variables and probability distributions
68
The density function of the random variable Z
1 /.(y -t;jaalexp g . x.'
..--j.:r2
=
-
=
c is
(-Y-p)
)
---
,
which does not depend on the unknown parameters p, c. This is called the standard normal distribution, which we write in the form Z N(0, 1). This shows that any normal nv. can be transformed to a standard normal nv. when p and c are known. The implication of this is that using the tables for and F(x) using the transformation Z J(z)and F(z) we can deduce (.Y -p) c. For example, if we want to calculate P6X G 1.5) for A' N( 1, 4) /:-4z) F(0.25) 0.5987, that is, F(x) we form J (x 1) 2 0.25 x
./'(x)
=
x
=
F(1.5)
=
uc>
=
-
=
=
=
0.5987.
(ii) Expon
(.w tial
.?'l'll'/-J'
t?/' (listributions
Some of the most important distributions in probability theory, including the Bernoulli, binomial and normal distributions, belong to the exponential family of distributions. The exponential family is of considerable interest in statistical inference because several results in estimation and testing (see Chapters 11- l4) depend crucially on the assumption that the underlying probability model is defined in terms of density functions belonging to this family; scc Barndorff-Nielscn ( 1978). 4.5
characteristics
Numerical
of random
Yariables
ln modelling real phenomena using probability models of the form *= p), 0 g (.)). we need to be able to postulate such models having only a f general quantitative description of the random variable in question at our Such information comes in the form of certain numerical disposal a characteristics of random variables such as the mean, the variance. the skewness and kurtosis coefficients and higher moments. lndeed, sometimes such numerical characteristics actually determine the type of probability density in *. Moreover, the analysis of density functions is usually undertaken in terms of these numerical characteristics. ./'(x;
'priori.
(1)
Mathematial
expectation
(r.v.) on (S, P )) with F(x) and .J(x)its distribution function (DF) and (probability) density function (pdf) vafiable
Let Ar be a random
respectively. The F(.Y)
mean .ylxl
=
n%
'
of A- denoted by f)A-) is defined by dx - for a continuous
r.v.
Characteristics
4.5
69
of random vayiables
and F(A-)
=
j
xf./'t-Yf
i
)
-
fOr
a discrete r.v.,
when the integral and sum exist. is over all possible values of X. The integral in the definition of mathematical expectation for a continuous random variable can be interpreted as an improper Riemann integral. lf a unifying approach to both discrete and continuous r.v.'s is required the integral (see Clarke ( 1975)) concept of an improper Riemann-stieltjes
Note that the summation
(4.47) used. We sacrifice a certain generality by not going directly to the Lebesque integral which is tailor-made for probability theory. This is done, however, to moderate the mathematical difficulty of the book. The mean can be interpreted as the centre q/' gravitv of the unit mass as distributed by the density function. lf we denote the mass located at a from the origin by m(xf) then the centre of gravity is distance x, i 1, 2,
can be
=
located
.
.
.
at -vfnyt-vj)
j
Zf-
.-
(4.48)
.
?Fl(.Y)
1. If we identify nt-vsl with ptxf) then f)A-)= jf xptxf), given that f Jx'f)= provides a measure of location (orcentral In this sense the mean of the r.v. tendency) for the density function of X. '
If A- is a Bernoulli distributed
r.v. (X
'v
b 1, p)) then
0
A'
/'(x) (1 - p) lf X
'v
distributed
/?), i.e. Ar is a uniformly
U((I,
'(A-)
=
,x?
h
.Vtxldx
=
j x
-
b-
(1
dx
1 =
-.
then
r.v.,
b
j
x2
---
2 b
-
a
=
a
a +. /)
2
.
70
and probability distributions
Random uriables
lf X
Np, c2), i.e. X is a normally distributed r.v., then
'v
F(Ar)
t
=
-
(2,c) +. p ) e .-ya (ja (2zr)
.x;
p
'
--
-
1
cc
2
1 x u 2 c -
exo
c
(Jz
x
=
1
x
e
for
e-izz
(jz
J
X
-
=
(r
:c
a
.
dx,
(2z:)
z
-
+
.
-.jc2 (j Z
(27:)
= 0 + p 1 p, since the first term is an odd function, i.e. h - x)= - /1(x).Thus, the P arameter p for X Np, c2) represents its mean. .
=
'w
ln the above examples the mean of the nv. A- existed. The condition which guarantees the existence of '(Ar) is that X
dx < Ixl.(x)
cc.
-
< w. )(gIxfl/txf)
or
vo
(4.49)
i
One example where the mean does not exist is the case of a Cauchy distributed r.v. with a pdf given by flxs
=
1 zr(1 + x 2 )
,
R.
.X 6
In order to show this let us consider the above condition:
x, -X
dx Ixl/txl
1
1
r
lxl1 + x
=-
z: - x
a
=- olimzc 2 zr -+
Ctl
1 --
zr
x,
x
2
j
o
oxc
dx
by synlnxetry
1
=
dx c
o
1 x dx=- lim logell : a . 1+ x a
+J2)
-+
.
That is, '(Ar) does not exist for the Cauchy distribution.
Some properties of the expectation (E1) (.E2)
c, (' c is a constant. ftzArj + bxz) tkEtxjl + lxEl-fzl for Jn-p '(c)
=
=
lwt? r.,.'s
ArI and
p
of random
4.5 Cllaracteristics
variables
Xz whose means exist and a, b are real constants. For example, (J Xi b 1, p), i 1, 2, n, i.e. Xi represents the Bernoulli r.r. of the ith trial, then for F;, S(Fl, P) f (Y;,) 1 E (Xf) np. 1 Xfl 1 p That is, the mean of a binomiallv distributed r.r. equals the number of trials multiplied b)' the probability' oj' and Properties E1 E2 desne F(.) as a linear transformation. Prx > ;.E(X)) % 1/2 jr a positive r.t,. X and ;.> 0,' this is ('3) the so-called Markov inmuality. Although the mean of a r.v. X, .E(Ar) is by far the most widely used measure of location two other measures are sometimes useful. The first is themode defined to be the value of X for which the density function achieves its maximum. The second is the median, xm of X defined to be the value of X =
'w
.
.
.
,
=
(Z)1-
::::>'
'v'
=
Z)'=
=
Z7=
=
bsuccess'.
such that
It is obvious that if the density function of A- is sq?mmetric then '(.Y) x..
(4.51)
=
If it is both symmetric and unimodal
mean
=
median
(i.e.it has
only one mode) then
=mode,
assuming that the mean exists. On the other hand, if the pdf is not unimodal this result is not necessarily valid, as Fig. 4. 13 exemplifies. ln contrast to the mean the median always exists, in particular for the Cauchy distribution x,n =0.
I I
l l
I
I
I
1
I
I I I
I I I 1
1
I j 1 1 I I I
'G
mode
Fig. 4.13. A symmetric
mode.
mean
median
mode
r
x
density whose mean and median differ from the
(2)
and probability distributions
variables
Random
The varlance
When a measure of location for a nv. is available, it is often required to get widely the values of .,Y are spread around the location an idea as to how is, measure, that a measure of dispersion tor spread). Related to the mean as variance and a measure of location is the dispersion measure called the defined by '
Vart#l
SIIA- f'tA-ljz
=
-
(x - f(-))2/(x)
=
F(.Y))2/'(xf)
= il (xf-
(4.52)
dx - continuous
(4.53)
- discrete.
t#' inertia of the mass can be interpreted as the moment through the mean. axis distribution with respect to the perpendicular
The variance
is referred to as
Note: the square root of the variance
deviation.
standard
Exkrnp/t?s (i)
Let A' ??(1,p); it x,
was shown above that F(Xl -p2)(
VarlA-l
=
(0
Sf
Var(A-)
X-
=
p, thus
=
-p)2p=p(
1
a+
2
1
+ (1
-p)
b
-p).
-f.'/)2
2
1
(? dx =
b
-a
-
-
12
(verify).
An equality which turns out to be convenient for deriving Var(-Y) is given by E'(A-2) (E'(.Y)12, Vart-l =
-
where v2./'(x)dx.
for z
=
x
p
-
-..-
c
Characteristics of random
4.5
Yariables
.fr
( PrI)
fkny c'tpnsltknr ('. Vartt?l 0 2 (1 constant. Varll X) a VartXl, #-((xY E(xY)l> k) :$ gVar(X)q,/k2 Chebyshev's inmuality the fr /( > 0. This ntx?-fllflr lives (1 relation and JC#nt?l probabilitv q.f by dispersion as variance te such tA- E(A-)1p: k, pl-onpng in //t'?cr bound an upper for =
lS2)
.jr
( ( Iz'3)
=
-
-lt?lwtlt?n
probabilities. (3)
Higher moments
Continuing
the analogy with the various concepts from mechanics we define the moments of inertia from x 0 to be the so-called rlw' moments: =
the l-th raw moment, if it exists, with p'a 1 and yj usually denoted by Jt. Similarly, the rth moment around central nlt?rnt?nl, is defined (ifit exists) by pvEEE/I..Y -
plr
(-x-- p)r./-(x)dx,
=
r
=
2, 3,
F(A-); the mean is x p, called the rth
H
=
=
.
.
.
.
c2. These higher ;tz a E'(- - p)2 is the variance, usually denoted by moments are sometimes useful in providing us with further information relating to the distribution and density functions of r.v.'s. In particular, the in the form: 3rd and 4th central moments, when standardised X3
=
14
=
:3
--
G
3
and lt *
-:
(T
are referred to as measures of skpwncss and kljrtosis and provide us with measures of asymmetry and flatness of peak, respectively. Deriving the raw moments first is usually easier and then the central moments can be derived via (see Kendall and Stuart ( 1969:: r
pr- j=1 (
l .
-
1/
i
p'ipr-j. ,
(4.58)
An important tool in the derivation of the raw moments is the characteristic
74
and probability distributions
Random variables
function defined by J
Eteilx)
kx-
eirxdz-txl,
=
Using the power series form of
/x(l)
=
v''- 1.
(4.59)
eA
we can express it in the form
., (jrlr
p'r. F. --r!
1+
=
i
X
-
r
(4.60)
1
=
This implies that we can derive y'r via t
Fr
=
dr/
x (jjr
(r) =
A function related to Chapter 10) is
loge/xtr)
()
/xtrl of
1+
=
(4.6j)
1
t
interest in asymptotic theory
particular
x) (ir)r&r r 1 r!
)2
(see
(4.62)
=
where
sr, r= 1, 2,
.
.
.
are called the cumulants.
Example Let A- Np, c2), the characteristic
function takes the form
'w
-r2c2),
/
1d/xlll 'i- dl
(M=expirp
X
1
= o 1-
,-
1 d24 (r) 7 dr ,
c
-.ir
exptirp
X
()
-
a = p +
Gz =
pa
=
c
J,
)(ip
(r a
-
rc c)
p,
=
()
,-
.
Similarly we can show that ps pzb 3/4, ps /z6 15c6, etc. c2, x,=0, Kendall and Stuart rb 3, aa 0, a,yc::z3', see ( 1969). &72 =0,
=0,
=
=
l1
=p,
=
=
the various numerical characteristics of r.v.'s as related under certain to their distribution, it is natural to ask whether 1, 2, Jz'r, knowing the moments we can determine the circumstances r DF F(x). The answer is that F(x) is uniquely determined by its moments /t;, if and only if r= 1, 2, Having considered
=
.
.
.
.
.
.
X
l (/t'c,)-r -
r
=
(4.63)
vs.
1
This is known as Carleman's
condition.
4.5
Important Random
of random variables
Characteristics concepts
the probability
variable,
by a r.v., a c-field
space induced
generated by a r.v., an increasing sequence of c-fields, the minimal Borel field generated by half-closed intervals ( r xl x e R, distribution function,density function,discrete and continuous r.v.'s, probability model, parametlic family of densities, unknown parameters, normal distribution, expectation and variance of a r.v., skewness and kurtosis, higher raw and central moments, characteristic function, cumulants. -
,
Questions Since we can build the whole of probability theory on (S, % P ))why do we need to introduce the concept of a random variable? Define the concept of a r.v. and explain the role of the Borel field generated by the half-closed intervals ( vz, x(l in deciding whether a function .Xt ): S R is a r.v. 'Although any function Art ); S Ii can be defined to be a nv. relative c-field valuable information if we do not stand lose to we to some define the nv. with care'. Discuss. Explain the relationship between #( ), #xt ) and Fx( ). Discuss the relationship between Fx( ) and .J( ) for both discrete and continllous r.v.'s. What properties do density functions satisfy? Explain the idea of a probability model * ).J(.x; 0j, 0 (E 6)) and discuss its relationship with the idea of a random experiment $ as well as the real phenomenon to be modelled. Give an example of a real phenomenon for which each of the following distributions might be appropriate: Bernoulli; (i) binomial', (ii) normal. (iii) '
-
-+
.
'
--+
,.'h
'
'
'
'
'
=
Explain your choice. Explain why we need the concepts of mean, variance and higher moments in the context of modelling real phenomena using the probability model * 0), 0 6 O). B'hat features of the density function do the following numerical characteristics purport to measure? =
ttx;
mean, median, mode, variance, skewness, kurtosis. ixplain the difference between these and the concepts with the same names in the context of the descriptive study of data. iixplain Markov's and Chebyshev's inequalities.
76
Random variables and probability
distributions
12. Compare the properties of the mean with those of the variance. do the moments characterise the 13. Under what circumstances distribution function of a r.v.? Exercises b, c, J) and 84/) #(h) Consider a random experiment with S (tp, #(C) P(Is l'. Derive the c-eld of the power set (i) l/J, ?) say ..W). Derive the minimal c-field generated by (ii) S: Consider the following function defined as
=t,
=
=
=-),
=
,?/.'
,
-tfzl
-Y(c) -Y(J)
0,
.(?)
=
=
J'(/))
=
=
F((.')
=
1,
=
7 405 926,t
J'(J)
2.
=
Show that .Y and F are both nv.'s relative to but F is not. is a r.v. relative to Show that Find the minimal c-field generated by F, .?>:
(iii) (iv) (v) (vi) (vii)
,t7j
-
,#(.
1) k.p g1, 2:. Find #y(.t0) ), #y((0. 1q), #y,((- cfs 11), Py(I)0, Derive the distribution and density functions F(y), ,
.(y)
and
plot them. Calculate E(F), Var(1'), az(F) and a4(J'l. (viii) The distlibution function o the exponential distl-ibution is F(x)
=
1
-
exp
-
x
'
,
and plot its graph. Derive the density function Derive the characteristic function /(1) 'Derive /J(Ar), VartxYl, aa(,Y) and a4(Ar). .(x)
(i) (ii) (iii)
'(ei'x).
=
Note:
'# If the reader is wondering about the significance of thfs number it fs the number of demons inhabiting the earth as calculated by German physician Weirus in the sixteenth centtlry (see Jaslrow (1962:.
Characteristics
4.5
of random
variables
lndicate which of the following functions functions and explain your answer: (i) /'(x) kx2, 0 < x < 2,'
represent
proper
density
=
2(1 -x)2, x > 1,' 341 x y 1,e .J(x) < x < 2,' 1), + 0 .(x3 (iv) /'(x) 3, J(.Y) iE R. x (v) -lx Prove that Vart-Yl E(X2) - gE(xY)q2. lf for the nv. X, E(-Y) 2, F(xY2) 4, find F(3X + 4), Var(4X). Let Ar N(0. 1). Use the tables to calculate the probabilities F(2.5); (i) F(0. 15),(ii) 1 - F(2.0). (iii) and compare them with the boundsfrom Chebyshev's inequality. What is the percentage of error in the three cases? ,f(x)
(ii)
=
(iii)
=
''E).
-
=
=
=
=
=
x
Additional references Bickel and Doksum ( 1977)., Chung ( 1974)., Cramer ( 1946)., Dudewicz ( 1976),. Giri t 1974)) Mood, Graybill and Boes (1974); Pfeiffer ( 1978); Rohatgi ( 1976),
C H AP T E R 5
Random vectors and their distributions
The probability model formulated in the previous chapter was in the form of a parametric family of densities associated with a random variable (r.v.) 0), 0 s O). ln practice, however, there are many observable X: * phenomena where the outcome comes in the form of several quantitative attributes. For example, data on personal income might be related to number of children, social class, type of occupation, age class, etc. ln order to be able to model such real phenomena we need to extend the above framework for a single r.v. to one for multidimensional r.v.'s or random vectors, that is, =
t/tx',
X
=
(A'1 Xz,
.
X',)'.
,
.
.
where each Xf, i 1, 2, n measures a particular quantifiable attribute of experiment's random (J) outcomes. the For expositional purposes we shall restrict attention to the twodimensional (bivariate)case, which is quite adequate for a proper understanding of the concepts involved, giving only scanty references to the n-dimensional random vector case (just for notational purposes). ln the next section we consider the concept of a random vector and its joint distribution and density functions in direct analogy to the random variable case. ln Sections 5.2 and 5.3 we consider two very important forms of the joint density function, the marginal and conditional densities respectively. These forms of the joint density function will play a very important role in Part 1V. =
.
.
.
,
Joint distribution and density functions
Consider the random experiment 78
of tossing a fair coin twice. The sample
Joint distribution and density functions
5.1
space takes the form S t(ST), CTH), CHH), (TT)). Define the function Both and -Y2( ) to be the number of .lt ) to be the number of of these functions map S into the real line (!4in the form =
Stails'.
'heads'
'
'
(.X'1( ),X2( ))1)(FS)l '
'
(-''i( ),.X'2( ))1)(.fT)l '
'
=
(1,1), i.e. (.X'1(Tf1), X2(TS))= (1,1), (1,1),
),.Yc( ))((f.f.f)l (2,0),
(A-1(
=
'
'
(.Y1(
=
'
),
-'2( -
))l(TT)l
=
(.0,2).
R2 is a twoThis is shown in Fig. 5. 1. The function (xY:( ), A-2(. )): S dimensional vector function which assigns to each element s of S, the pair of ordered numbers (x1,x2) where xj xYj(s), xc Xc(s). As in the onedimensional case, for the vector function to define a random vector it has to satisfy certain conditions which ensure that the probabilistic and event structure of (S, P( )) is preserved. ln direct analogy with the single s ariable case we say that the mapping -+
'
=
=
..%
.
X(
'
)H (X1( ), X2( )): S '
-+
222
vector if for each event in the Borel #a), the event defined by
definesa random .22.say B H (:j, X- 1(B)
belongsto
'
=
(s: xYI(s)
e #1, -Y2(.s)e #c, s e
tield product
.@
x
,)
.@
M
(5.2)
..%
S Xa
(88) (HP
a
( F8) (rr)
1
@
0
1
Fig. 5.1. A bivariate random vector.
2
xl
80
Random vectors and their distributions
Extending
can be profitably seen as being the c-field generated by half-closed intervals of the fonn ( :yz to the case of the direct product x we can show that the random vector X( ) satisfying the result that
,#
,xj
-
.@
,?d
.
1
X-
((
w xq)
-
,
g
.F for a1l x
g
/2
implies X - 1(B) (E .F for all B G 41. This allows us to define a random vector Dhnition
1
A random
Vtor
Ar(
'
)..is
a vector
as follows:
.jnction
-
cc <
X a(.s) G x 2 s s ,
s) e
..@
Note. ( :r xj (( x. xlqs ( LJ -xc(I) represents an infinite rectangle (see Fig. 5.2). The random vector (as in the case of a single random variable) induces a probability space (R2, J7xl where are Borel subsets on the plane and #x( ) a probability set function defined over events in in a way which preserves the probability structure of the original probability ,
=
-
,
,
,82,
')),
r@1
.#2,
'
Joint distribution an4 density functions
5.1
This is achieved
by attributing
i@l
to each B i!
the
(5.6) This enables us to reduce
joint
#x(
tlistribution
tcunultlrfrt?l Djlnition
'
./ncrftpn.
a point function F(x1, xa), we call the
) to
2
fat?lX (Ar1 Ar2) be /rlcrt?n dehned py EEE
,
F(
.
.
,
):
/2
--.
a random
ptvltpr
dehned on (S, .@I P
'
)). The
g0,1q,
stch that EE
#?-(X .Gx)
is said to be the joint distribution function of X. ln the coin-tossing example above, the random vector
X(
) takes the alue (1, 1), (2,0), (0,2) with probabilities .l, ln order to derive thejoint distribution function (DF) we have to define al1 the events of the form )s: Xltsl Gxl, -Yc$) Gxc, s c 5') for all (xj, x2) (E (22 .
1.4and .1.4. respectively.
N
x:
<0,
0 Gx l
xc <0 <
2, 0 :; x2
.xl 1 2, Ntlrt, a
The
degree of arbitrariness
in choosing the intinite
joint DF of Xj and X1 is given by 0, xl 12 ,
F(x1 xa) ,
<
0,
xa <
0
0t x1 < 2 0 Gx2 < 2 ,
=
-1.4.x 1 > 2 0 ,
1,
t;
,
xl
y 2,
xcy
xz< 2
2.
<
2, x 2 y 2
x2 >
rectangles
2
<2
:1;h 2, oGxc 0 Gx j
<
2.
(
':t;, -
xq.
and their distributions
Random vectors
82
of (X1, X2)
jnction
Table 5. 1. Joint Jcnsly
From the definition of the joint DF we can deduce that F(x1 x2) is a monotone non-decreasing function in each variable separately, and s
lim F(x1 X
-/
-
-#
1 X 2 -#
x2)
1im F(x1, x2)
=
-*
W
lim F(x1 X
,
,
x2)
=
-
(5.8)
=0.,
X
1.
m. 'X
As in the case of one r.v., we concentrate exclusively on discretc and continuous joint DF only; singular distributions are not considered. Dehnition 3 and Xz is called a discrete distribution DF 4?./--Y'1 function.J(.) such that
Tejoint
4' rlrtr
exists a density
(5.10)
J(%1,.Y2) > 0, (x1 x2) (E (22 ,
and it takes tbe value zero injlnite ptpjfTl in lc plane
.Jtxl.x2)
=
Pt-lxk
everywhere
except at a
jinite t)r
countablv
wjr =
x1, Xz
(5.11)
x2).
=
In the coin-tossing example the density function in array form is represented in Table 5. 1. Fig. 5.3 represents the graph of the joint xz) via density function of X > (.Y1-A'2). Thejoint DF is obtained from the relation a rectangular ,/'(xj
,
.''
F(x1, x2)
./'txlf,
-
.'. j i < .' :
A i< ,z
xci).
2
Dehnition 4 Thejoint DF t?/' A-1 and A-cis called absolutely') continuous if r/-lthl't? exists a non-negative function .J(x1x2) such that ,
5.2
Some
bivariate distributions f (x xc) z
A'
O
A
A
-
-ej
2
A
Xc
1 2 Xl
function of Table 5. 1. Fig. 5.3. The bivariate density
-va).
if j'
.
at
) is continuous
(xj,
Some bi>'ariate distributions (1)
Bivariate normal distvibution --i.
2) (.--.&-1
f'lx:xa; 0) aagjo.a-
,
0- (pj pc, ,
cf, czc,p) c R2 x pi2+ x
(p,1q.
Random vectors and their distributions
f (x
xa)
,-
l l l l l
x
Xz xe
t
Nw
-
l
Nx N
.-..
N
-
1
'-
w.
1
M
..-
>
w.
(0 0)
A
'--
x
z
w
R
N >.
-
X1
normal density.
Fig. 5.4. The density function of a standard
lt is interesting to note that the expression inside the square brackets expressed in the form of X1
#1
-
c
2
--
1
X1
- 2p
#1 c1 -
p2
X2 -
+
(r 2
X2
- p2 o'z
2 m
ca
defines a sequence of ellipses of points with equal probability which can be x2) represented in Fig. 5.4. viewed as map-like contours of the graph of .f(xl,
0 (k =
(3)
,
a l a a) ,
.
Bikariate binomial dltrl'bution
/'(x 1 x c) . ,
!
/1 ... - ! J?1'P)2, a -.j 1 .v 2
=
-Y1
+ xc
=
1
,
p 1 + I.'z
=
1 .
(5.20)
The extension of the concept of a random variable A- to that of a random (.YI .Y,,) enables us to generalise the probability model X1, vector X =
,
.
.
.
,
85
5.3 Marginal distributions *
.f
=
/ (x ; 0) 0 6 O ) '
t.
,
family of
to that of a parametric *
,/'(
l
=
x
!
xa
.
,
.
.
.
,
joint density functions
x,,; 0) 0 6 (9 ,
(5.22)
)
.
generalisation since in most applied disciplines, the real phenomena to be modelled are usually including econometrics, multidimensional in the sense that there is more than one quantifiable feature to be considered.
This is a very important
5.3
Marginal distributions
Let X EB (xY1 Xal be a bivariate random vector defined on (,. ...Fi#( ))with a joint distribution function F(x1, xa). The question which naturally arises is whether we could separate A-l and Xz and consider them as individual random variables. The answer to this question leads us to the concept of a marginal J.srrf/pfklfon. The marginal distribution funetions of A-1and Xz are defined by '
,
F1 (x1)
=
and F c(x 2 )
lim F(x1 xc) ,
X
=
z
-+
J-
l im F(x : x 2) ,
. 1
''*
.
ik
Having separated .,Y1and Xz we need to see whether they can be considered defining a random as single r.v.'s defined on the same probability space. In that condition the vector we imposed
$ts : A-1 (s) < x l
-
,
2(#
:%
x2
)G
(5.25)
.:''
The definition of the marginal distribution function we used the event s : A-1 (s)
:$
2.( x 1 A- s) < vs ,
lj
(5 6)
.2
,
t which we know belongs to .K This event, however, can be written as the intersection of two sets of the form
u hich implies that .ts:
.,1
(-$)% xl X 2(s) ,
<
'zt-'
)
=
ts: A- (s) I
:;
xl
)
,
(5.28)
and it is the condition needed for A'1 to be a r.v. u hich indeed belongs to probability function Fjlxjl,' the same is true for X1. In order to see with a ..k-
WK
86
and their distributions
Random vectors
this, consider the joint distribution function
F1(x1)
lim Ftxl
=
X2
since 1im,,-+
.(e-'')
-+
,
x2)
1
=
e-ttVl, -
A
0. Similarly,
=
F 2 (x2) 1 - e-tz =
?
x2
6F R +
.
Note that F1(xj) and F2(x2) are proper distribution functions. Given that the probability model has been defined in terms of the joint density functions, it is important to consider the above operation of marginalisation in terms of these density functions. The marginal densitq' functions of -Y1 and Xz are defined by
(5.30)
and
that is, the marginal density of Xii= 1, 2) is derived by integrating out Xji #.j) from thejoint density. ln the discrete case this amounts to summing out with respect to the other variable: X
h (x1)
-
/t-vl x2). -
i
=
1
Example Consider the working
population
of the UK classified by income and age as
follows: Income:
2000-4000, f 4000-8000, f 8000- 12 000, f 12 000-20 000, f 20 00050 000, over f 50 000.
Age: young, middle-aged. senior. Define the random variables A'l-income class, taking values 1-6, and Xzage class, taking values 1-3. Let the joint density be (Table 5.2):
5.3
distributions
Marginal
87
Table 5.2. Joint densi )., q/' (.zYj Xz) ,
-
.?
-
-..;
.
-U-
...y.
.
Xz
---..
.--.
.
:LL
S
-l-
7
.
.y.
I
2
t
l-
1
1 2 3 4 5 6
.
Jc(xa)
l
a
0.250 0.075
0.020 0.250
0.040
0.075
0.020 0.010 0.005
0.030 0.015 0.010
0.400
0.400
m
.y,
ay
-7
.3
.....t.
/
0.5
0.020 0. 1(y) 0.035
1
1 I
(
/.1(.xj )
.
0.275 0.345 0.215
0.020
0.085 0.045
0.020
0.035
0.200
1.000
The marginal density function of -Y: is shown in the column representing r()w totals and it refers to the probabilities that a randomly selected person will belong to the various income classes. The marginal density of Xz is the column totals and it refers to the probabilities that a row representing randomly selected person will belong to the various age classes. That is, the marginal distribution of A'I(Ara) incorporates no information relating to Xat-Yk ). Moreover, it is quite obvious that knowing the joint density function of Arl and Xz we can derive their marginal density functions; the is reverse, however, is not true in general. Knowledge of txj)and .Jc(x2) xc) only when enough to derive ./'(x1,
../'txlx2)
'-h
(-x,)
-
(5.32)
(x2),
'jz'
Independence in that ,Yl and Xz are independent terms of the distribution functions takes the same form
in which case we say F(x1,
.X2)
=
l-.t.'.s.
F1(x1) F2(x2). '
ln the case of the income-age
example it is clear that
'..f2(x2),
/'(x1,x2) #.Jllxll 0.250 #
(0.275)(0.4),
and hence, Arl and Xz are not independent r.v.'s, i.e. income and age are related in some probabilistic sense; it is more probable to be middle-aged and rich than young and rich! In the continuous r.v.'s example we can easily verify that Fl(xl) Ltnd
'
F2(x2)
=
(1
-
e-0xt)( 1 -
e-0X2) =
thus .Y1 and X2 are indeed independent.
F(x
1,
x2),
and their distributions
Random vectors
and in the context of the probability Note that two events, said P( )) are (S, to be independent (seeSection 3.3) if .42,
,41
space
,.1
'
#(v4: ro
.4cl
=
#(-4j) #(,4a). .
It must be stressed that marginal density functions are proper density functions satisfying all the properties of such functions. ln the income-age 7: 0 and 0, .Jc(x2) j .J1(x1f) 1 and example it can be seen that .txlly
=
Zi
.J2(X2)
=
1.
Because of its importance in what follows let us consider the marginal density functions in the case of the bivariate normal density:
(5.36)
(5.37)
being a proper conditional
since the integral equals one, the integrand density function (see Section 5.4 below). Simlarly, we can show that
1
../-2 ( ) '-y.,-y. N ( a )o.a .X2
=
eX p
1 -
j
//x
2 -
u :1
l
(5.39 )
'-
.
.
(72
Hence, the marginal density functions ofjointly normal r.v.'s are unvarate
normal. provides us with ways to ln eonclusion we observe that marqinalisation when model is model such defined in terms of joint simplify a probability variables. random In unwanted out' by density functions any be of Xk interest X2, the density of nv.'s marginal the can general, 'taking
-'.'l
,
.
.
.
,
distributions
Conditional
5.4
our investigation
ln the ineome-age example if age is not relevant in simplify the probability by marginalising out
we can
-a.
Conditional distributions
5.4
ln the previous section we considered the question of simplifying probability models of the form (22)by marginalising out some subset of the away' the information X,,. This amounts to r.v.'s A' 1 -Ya, r.v.'s; irrelevant. integrated out as being In this section we related to the ('ollditiollis) with simplifying respect to some of * by question consider the of r.v.'s. the subset ln the context of the probability space (.i. ./7 #( )) the conditional is defined by (see Section 3.3): probability of event z'll given event bthrowing
,
.
,
s
.
'
zzla
-4....1.
#( ''jz?d :k' !'kF'
1 -4
..1
1
2
)
P(
=
-
..4
z)
ro
-
#(d2)
,.4
-.4
1
,
.
a e:./'
(1:* ''''
By choosing X j fts: X 1 (s) < .'7l ) we could use the above formula to derive an analogous definition in terms of distribution functions. that is =
F
where
f .Gxl ( .''A(::!1. ) P$X . 't'
..
' ..
..'::1.
,,.'.,4c),
=
there are two related forms we are As far as event zulc is concerned zlc it zk-z jl where is a specific value taken interested in, particularly c-field generated and by X 2. ln the case where a), the ct i.e. by A'a. 1 arising in ctz'a particular problems the definition of the .4a ), there are no .f-a
=
.-a
=
.
,4
=
-
=
conditional distribution function
P ( t s : A 1 (s ) .Gx : j...'ro c ( a )) 7 -
-
.
F .!k
j .' rr1
.
lk
2)
f'(c('c))
*
since ctA'al e although it is not particularly clear what form Fv, ct,ya) will however, it is immcdiately X2(s) take. In the case where a) a xa) when is Xz since #(s: obvious that a continuous 1'.v., there will .t
./1
=
z4a(s)
'(s:
.f
=
=0
=
,
90
and their distributions
Random vectors
be problems. When Xz is a discrete nv. there are no problems arising and we can define the conditional density jnction in direct analogy to the conditional probability formula'. /'(x1
c)
.f
=
Pr(X
1
Pr(X
1
=
=
x j 'Xz
=
x 1 X: ,
Prlx,
.f
=
=
2)
.f2) =
j'lx 1 ./2(.z)
.2)
.f-
,
2) .
.
=
The upper tilda is used to emphasise the fact that it refers to just one value
taken by Xz.
Example Let us consider the income-age example of the previous section in order to derive the conditional density for the discrete r.v. case. Assume that we want 6 (incomeclass of over to derive the conditional density of Xz given f 50 000). This conditional density takes the form: -',t-1
=
flxz
.f1)
=
0.005 0.035
=0.143
=
for Xz
=
2 (middle aged)
1 for Xz
=
3 (senior).
0.0 10 = 0.035 0.286 =
0.020 = 0.035
=0.57
1 (young)
for Xz
This example shows that conditioning is very different from marginalising because in the latter case all the information related to the r.v.'s integrated out is lost but in the former case some of that information in the form of the value taken by the conditioning variable is included in the conditional
density. variables
In the case of continuous random above procedure because Prukrj
=
xj Xz
Prx,
,
-
=
.kL)
7cl
it does not make sense to use the
0 (j
apparatus needed to bypass this problem is rather formidable but we can get the gist of defining the conditional distribution in the continuous case by using the following heuristic argument. Let us define the two events to be
The mathematical
-41
.t
=
s: X
j
(.s),I x 1 ) i! .F
(5.46)
5.4
Conditional distributions
distribution of .Y1 given Xz
This enables us to define the conditional 0. That is, taking the limit as
kz by
=
-+
lim = 0
tg, ..fa)
xl
=
X
.
(5.49)
dK.
J2(.f2)
we can define the conditional density of Arl
Using this heuristic argument given Xz xc to be =
xl
g
Rx:.
Examples Consider the bivariate logistic distribution F(x1 x2) ,
(X
FX l
1
=
+e-x2)-
(1
+e-At
l (1 + C - )Xl
=
1
1
1+ e - x e 1+ j
=
2( 1 +
e-X2)
-
1
e-X2
A:2
-
3
e-x'
.
92
and their distributions
Random vectors
The distribution function of the bivariate exponential takes the form
Xx c (X 2 ) -
Hence
=
1-
= Let j'lx 1 x a) . ,
xl
>fC1 >0,
=
C-
X2 '
f?xzl - *1 exp
1 + $x 1 )(1 + ()4
2)t2 ;.lk + 1)(r.,1 1 (1
xc >az
distribution
,
>0.
-1-
l $4/./
2x 1
+ a 1x a -
k- x ( 1 + 0x a) l 21
(1 1 (1
1
c )-
.
(2
+
'
density functions There are two things to note in relation to conditional above examples'. brought out by the the conditional density is a proper density function- i.e. (a)
both properties can be verified in the above examples. lf we vary Xz. that is, allow Xz to take all its values in Rxz, we get different conditional density for each Xz .x2 of the form
a
=
/x' vct-xl /x2),
xlgR.y,
,
(5.54)
xzeRxa
reflection suggests that knowledge of al1 these xa), a densities is equivalent to knowledge of conditional equality brought general relationship out by the A moment's
./'(xj
,
(x1 x2) -
e' R2.
5.4 Conditional distributions model tp .t.J'(x x,:; 0), 0 e (.)) because it offers us a general way to I xc, decompose the joint density function. It can be seen as a generalisation of holding when A-I and Xz are xa) the equality (x1) independent, considered in the previous section; (55)-(56)being valid for any joint density function. lndeed, we can use the condition which makes the two equalities coincide as an alternative definition of independence, i.e. .;t-: and Xz are independent if =
,
.
.
,
.
'./'2(xa),
=./)
./'(xj
,
/)
x24-Y1
,'''xz)
(.Y1
=l
(5.57)
),
This definition of independence can be viewed as saying that the information relating to X2 is irrelevant in attributing probabilities to XI. Looking back at the way (x1) was derived from the bivariate normal density we can see that the expression inside the integrl in (37)was the conditional density of X; given Arj xl. It can be verified directly that in this case =
/'(xl . .
.x2)
-
=
-), ( x
=
1
exp
j azrr.):
-j
x,(.x'2, ( /'1(x1))(-/x' c
The marginal and conditional
1 vi -
'
.:1)).
distributions
2
t'/2 cj
2) -
( 1 - /? cax,
.
(2z:) --
(5.59) in this case are denoted by
(5.60) (5.6 1) (5.62)
C a Se :
(5.64)
Random vectors
and their distributions
A sequence of r.v.'s -Yl, Xc, any x G R, Fj(x)
=
F2(x)
=
'
'
'
.
=
.
.
,
Ar,,is said to be identically distributed if,for .65)
(f
F,,(x).
The concept of conditioninq enables us to simplify the probability model in the form of a parametric family of multivariate density function (1) x,,; p), 0 i5 e) in two related ways: t/'txl xa, (i) to decompose the joint density function into a product of conditional densities which can make the manipulation much easier; and information in some in the case where the stochastic (probabilistic) of the r.v.'s is not of interest we can use the conditional density with respect to some observed values for these r.v.'s. For instance in the case of the income-age example if we were to consider the question of poverty in relation to age we would concentrate on a:(X2/'X1 exclusively. 1) 2 . =
,
.
.
.
,
=
.
lmportant concepts Px( )); random vector, the induced probability space (Rn, the joint distribution and density functions', marginal distribution and density functions, marginal normal density; independent r.v.'s; identically distributed r.v.'s; conditional distribution and density functions, conditional normal density. .@'',
'
Questions Why do we need to extend the concept of a random variable to that of a random vector defined on (S, P( ))? Explain how we can extend the definition of a r.v. to that of a random vector stressing the role of half-closed intervals ( vz xq. Explain the relationships between #( ), #xl ) and Fx x2( ). ! Compare the concepts of marginalisation and conditloning. Define the concepts of marginal and conditional density functions and discuss their properties. Define the concept of independence among r.v.'s via both marginal as well as conditional density functions. Define the concept of identically distributed r.v.'s. .%
'
-
,
'
'
.
'
,
6.
Exercises
Consider the random experiment of tossing a fair coin three times.
5.4 Conditional distributions Derive the sample space S. Define the random variables A-1- the number of Xz lnumber of number of Derive the c-fields generated by c(-Y1), Xz, c(A-a) and cl-Yl, A-2). Define the joint distribution and density functions F(x1, xc), x2). /'(xj x2) and plot Derive the marginal density function /'j(x1)and ,J2(x2). Derive the conditional densities theads',
=
'tails'l.
eheads'
-
(iii)
-4-1
,
,/'(x1
,
,
/'l(x1.7l),
3) .J'1(x1
2),
./(x2
./'clxc
0),
,
For the income-age example discussed above l), Plot the graph of .J(x: xc), .J'1(.x (i) ://2), c,/'3) (ii) Derive /'ltx/ 1), .J:(x fzqx and compare their graphs with that of Jttx ). be bivariate normal 3. Let the joint distribution of Arj and ,/'24x2).
,
.:-2
X1 Xz
p1 (i) (ii) (iii)
=
t7'il
p1
'w N
PG 1 (r 2 -
,
cl
pclcc
/:2
1
,
Derive Derive
(x1) and L (x2). jx' xa(.Y1/'.X2), fOr x2 j
=
0, 1, 2.
Under what circumstances are X3 and X1 independent'? Let the joint density function of A'j and X1 be /'(x 1 x2) ,
=
Derive Derive f. 1, 2, 10.
x l expt - xj( 1 + xzlj'
and
./-(x1)
v-tx,
'x
,)
fctxz). for x
,
=
,
x 1 > 0,
1, 2, 10, and
x2
>
0.
for (xc,/'x1) /'xz.a.j
x:
=
XG4itional derences Bickel and Doksum (1977); Chung (1974)',Clarke (1975);Cramer (.1946/ Dudewicz 1974); Pfeiffer ( 19781.,Rohatgi (1976), (1976); Giri ( 1974); Mood, Graybill and Boes (
6.1
of one random
Functions
variable
CHAPT ER 6
Functions
of random
variables Fig. 6. 1. A Borel function of a random
One of the most important problems in probability theory and statistical A',,) when inference is to derive the distribution of a function I1lZYl A-a, This known. randoln X,, ) is vecto r X (A-l the distribtltion of the problem is important for at least two reasons'. it is often the case that in modelling observable phenomena we are (i) primarily interested in functions of random variables; and inference the quantities ot- primary interest are in statistical colnmonly functions of random variables. It is no exaggeration to say that the whole of statistical inference is based on of various l'unctions of r.v.'s. l n the fi rst ou r abil ity to deri ve the distribution consider the distribution of functions of a single subsection we art! going to and then consider the case of functions of random vecto rs. r.v. .
.
=
6.1
Functions of one random
.
.
.
.
.
.
,
,
considered a function from S to R and the above ensures that the composite function (X): S R is indeed a random variable, i.e. the set /l(Ar)(s)e:.B,)s,.i/-for any Bkt?d (seeFig. 6. 1). Let us denote the r.v. hX) by Ft then F induces a probability set function #yt ) such that PyBh)= #x(#x) JN.,4),in order to preserve the probability structure of the original (S, P )).Note that the reason we need ( ) to be a Borel function is to preserve the event structure of Having ensured that the function ( ) of the r.v. A' is itself a r.v. F hzrj we want to derive the distribution of F when the distribution of X is known. Let us consider the discrete case first. When A' is a discrete r.v. the F hzr) is again a discrete r.v. and a1l we need to do is to give the set of values of F and the corresponding probabilities. Consider the coin-tossing example where X is the r.v. defined by A' (number of H - number of F), then since .4
-.+
=
'
ts: =
'%
.
'
,..d.
'
=
=
=
variable
S
P Let A' be a r.v. on the probability space (,, Suppose -.+ function S. real valued R, i.e. -Yis a on S function with at most a 11 is a continuous (liscontinuities. More formally we need /1( ) to be ..#7t
'
)). By definition, A-( ): '
'
that hl
.
countable
): R
=
)HT TH, HH, FF), XIHT)
XITH)
=0,
X(SS)
2, A'IFFI -2 and =
=
the probability
function is A- x - 2 0 2 .,t J
Rs where number of --+
a B()reI function.
=
=
t
'
22 4 with the same &'&'' I.et F Xl, then F takes values ( 2)2 4, (0)2 probabilities as X but since 4 occurs twice we add the probabilities, i.e. =
Dpnit 1011l
variable.
=0,
=
F
y
=
#r(F
=
y)
0
4
1.
-t2
2
'
ll1 gencral. the distribution 1-3( )')
!1t- I't. t I1t- i1)
=
'$
'
P(
,s
t. 1' st.
:
l
1'-(s) lI
.1
t'
<
=
function of J' is defined as ).)
=
t ik) I l /?
#( s : 1
(.)
1
-
- (( (s) es1-1 l 1 t-t--t
I
11#.
'.'
t 17e
cs. )j ,
)),
tl 11it.l t.l e
.
I1tlt')ll v'' '$''
I.'l I,t- t il/llh 4) f ra
y ba
r ia Illi.s
I 11 t llc ct str w' I'le1't,., is lt c() 11t i11tlt) tl s I'. tle l'i i I1g t 11etl ist l'l 1)11t I i 1 11 ( ) 1'1' /7( ) is not as si l'npltt as thtl discrcte eastl because. rstly. 1'' is ntlt al ways a continutus r.v. as well and, secondly, the solution to the problem depends crucially on the nature of h ). A sufflcient condition for F to be a continuous nv. as well is given by the following lemma. .'
.
-
'
.
where hX) is dlrentiable Let X be a continuous r.r. and F= h < 0 forall x. Then > 0 or (dll(x)q/(dx) for alI x (5 Rx and gd/ltxlq/tdx) the densit), functionof F is given by =.fxl
(.p)
>
1(.p))
-
d
h - '(-p)
for a <
dy
.p
<
(6.2)
b,
I
where I stands for the absolute value and a and b smallest and biggest value y can take, respectively. Example 1 -p)/c,
=
(aY
=
1/(2Uy)
(y)-./)(x71')(j-try1
=
-j i.e. F
ln cases where the conditions of Lemma 1 are not satisfied we need to derive the distribution from the relationship Fy(y)
=
P#(x)
> y)
=
Pr(X
E; h -
1((
-
(6.3)
xt x1)). ,
Example 2 Np, c2) and F= Xl (seeFig. 6.2). Since Edtxll/tdx) 2.x we can see increasing for x > 0 and monotonically that /l(x) is monotonically <0 and Lemma 2 is not satisfied. However, for y>0 decreasing for x Let X
(
1
1
expt
(aa)
-1 (y- 1-l (.y)
normal distribution.
N(0, 1) the standard
'v
1+.t( )
Ed-
i(y))/
-/.p)4yt7y1
-
=
distributed.
(see Fig. 6.2). In this form we can apply the above lemma with for x > 0 and x < 0 separately to get (d.') =
which implies that Edtxlq/tdx) 1/c> and F Let X N(p, 1(y)q/(dy) 1(y) > o.y +y and Ed 0 for a1l x c R since tr 0 by definition; /lc. Thus since tr2)
'w
Fig. 6.2. Thc function F= ..Y2where X is normally
to the
refer
l
-jy)
expt
+
for y>o
((a) expt .-jyjjy-.i
-y)
That is, fytl,'lis the so-called gamma density, where F( is the gamma fnction (F(n) jJ' t??1e-r d$. A gamma r.v.1 denoted by F.v G(r, p) has a density of the form .J(y) g.p/F(r)q(p#)'- expt p)1, > 0. The above 1.2) and is known distribution; an distribution is G(1y, as the chi-square important distribution in statistical inference; see Appendix 6.1. .)
=
,y
=
-
=
'v
Fy(y)
=
#r@(x)
= #r(
-
:%
y)
=
#r(x
6F
h
-
1(
-
),
xj) -
Functions of several random variables
As in the case of a single r.v. for a Borel function h ): R'1 R and a random xY,,), (X) is a random variable. Let us consider vector X (.11,Xz, used functions of random variables concentrating on the certain commonly variables case for convenience of exposition. two '
=
Fx( Vy) Uy% A' %Qyl Fx(Uy) =
6.2*
-
.
.
.
,
-
Fig. 6.3. The function
(l )
F
=
.Y1 + Xz.
The distribution of X 1 + Ar2
By definition
the distribution
function of F
=
..Yj +
.:-.,.,
(sey Fig. 6.3) is Example 4 U( - 1, 1), 1 1, 2 (uniformlydistributed). t 7sing Fig. 6.3 we can show that I
.et
'j
'v
=
and define F
=
A-j + X1.
=
ln particular,
if .,Y1and .Y2 are independent,
Iy'(..v) =
by X1
symmetry. -
A' 2
./'1
(.p -
x.z)./2'(x2)dx2
Using an analogous
=
argument
then ,1
(xj )./a* (-'
-vl -
) dx 1
,
we can show that for F
=
(2hl -#/1
:6:
zc
1L .
This density function is shown below (seeFig. 6.5) and ascan
be seen it is not
1():!
6.2*
of random variables
Functions
Functions
of several random variables
Fig. 6.6. The function F= Xb Xz for F Fig. 6.4. The density function of F uniformly distributed.
X3 + Xz where
=
and
A-I
l03
.::0
and F
>0.
A'c are
1$k t-s t he lbrl'rl ./'txl
F y (.p)
fv (y)
=
,
0
x2) dx1 dxc
- ccl
0.5 .;''r'
) :
i -1
-p
1
o
Fig. 6.5. The density function of F= uniformly distributed.
.11
+ Xz + X5 where
2
. ).
B
Xi, i
=
1, 2, 3, X
are
.
only continuous but also differentiable everywhere. The shape of the curve is very much like the normal density. This is a general result which states uniformly distributed independent r.v.'s, that for Xi U( 1, 1), i 1, 2, which is closer to a normal distribution the L Z)'-l Xi has a distribution particular of value case of the central limit theorem (see greater the n; a Chapter 9). 'v
=
-
.
.
.
=
II1 tilt. t'tstt
:.
The distrihution
(.
A' 1 and Xz are independent
.
'
this becomes
.
I., tl'l;I' t j
(.1,.v2) '.fL(x2)dxc, I.vzI./'l
#
.
.
'q
of A-1 A'2
where
:
.E.
(2)
xa) dxc, jxajytyxa,
y,j ,y,.,(
1 1- t .1! t
.$ (t l)e
(6.8)
htllnatical manipulations
lllat
are not ilnportantl)
2
.N(t ). l ) :1 11 tl A c y 4tl ) chi-sq u a re wi t h 11 deg rees of freedorn, .' 1 t. 1l t j? l I ) k lt- I I 1t It- l l I 'lt-t'il t- 1 lt ) lt n d let u s d e ri ve its j . x ( j t l I 1 l l 1 1 l t I 1 I l l t ( Ik l 1 s l ! l't I I ) t' l i( ) I ( .3#' l I1t- (.It. ! 1( I 11il : t t (. 1' l ) is i t't r'l 4 N ( .
,
x.
-
I
i
,
l
'
.1
'')t-
''
.'
.'
'1
..!
.'
(
'(
', 1'1sitl
t- 1'
t u-(
-s ',
l'.
'$ .
.'
I
:1 1,1(.1
.'
?
(.1
Ct1'1
It'!t
'i
'
'
'
.
'$
.1
'k
'1
'k
.....'''1
.j?.
--
..:
'$'
Functions of random
l() l
Ix. (Z) . Since
.J'(x1,::)
=
n
variables
n/ 2 Zn
ag(u,a)jjj-jus)
,
=/'1(x:)
,
-
*
nz
eXP -
2
a
,
it takes values only for z > 0, which implies that
'.gz),
1
' .x
=
.-
--
Jr G j G 2
=
This is the density of Sfut/pnr's
g
j)
()
u-) z V
S( G 2
'
Y
2
'y.
:-
- 2-
2
y +.
()' 1
j-
V1
)
($ .- u
d ly
z2(nj)and
'v
()
l-t/fsll-fhuffon.
F
'
/y( . Aj
=
xz2(na)
Xz
be two independent
=
.-.
.
xz/nz)
'c
.
,?,
n -!.
=
The distribution
v2
.
n1
/ 1 -na ,
-
n2
exp F
/'y (
.r.)
n 1 + nz 2
--
j-
n j
1
2
n2
..2
.
y,
y))
nl
,
dxa nj /2
-n
),'c(,,j
-
ajyaj
2 j + 1.4,1
m,
-
j'n
h (x g ) dx .c .
.v
tj1( la 1+
xa
:=
and define
(X1 /n1) nz Xl
. --
0
.
r.v.'S
1+
u 1 '
112
y?
)/21
,, ...?
,
d xa
y
W Uere bl =
,
Example 6 Let Arl
1
.
G2
x
V 1 CT2 .--
xp
'
of F
=
mint A' 1 A-c) ,
'
v %
--
2 2
2
!
.
2
j
u! ..y -V ---j G1
G2
106
Functions
of random
6.2*
uriables
of several random variables
Functions
'rhese assumptions
enable
to deduce that
us
(6. 12) Example 9
l
.ct
Xi
N(0, 1), i
'v
1'1 =
1, 2 be two independent
=
ll1(.. 1 X2) ,
X1 + X2,
=
nv.'s and
1'2 /12(X1. X2)
A' 1
=
=
-
.
Ar2
sillce X
l'' 1 0 1( 1 -
.172)
-
,
'1.F2
j+
'
.y,
2
.'j!
Fig. 6.7. The function F
=
min tXj
,
i
X2).
J= det
consider them together. Let (#1, X2, joint probability density function transformation'. .(x1,
.
Ar,,) be a random vector with a x,,) and define the one-to-one x2, .
.
.
.
1 1+ )'c
t lllskilnplies that
,
.
+ ),c
I.'''l
,
/'(y1 .
1
-p2)
,
=
2,:
exp
1
-j
jy 3....1 (yy-).4)
(y 1 .J.'2 ) + y f (1 + yc) c 2
...
,
1 ( 1 -I 1 J.c)2 .J,f -j exp + ( j .y y,a) ( 1 27: .rc)
(6. 10) l l1t. Illltin
drawback of this approach
.
is well demonstrated by the above
'N.$l1:I'le. The method provides us with a way to derive the joint density 1(,11 of the Ffs and not the marginal density functions. These can be t 1$l,1k 1It.I 15 t'tl by integrating out the other variables. For instance,
j .. E
.I1l
F
111
t 11cltbove
example
these take the form
Assume:
(i) (ii) (iii)
hi( ) and gi ) are continuous; the partial derivatives Jxf/pyf, i,j 1, 2, continuous', and the Jacobian of the inverse transformation '
'
=
.
.
.
,
bn exist and
are
(. atlchy density. 'f! .
,:@
108
Functions
6.3
Functions of normally distributed random
of random variables
Looking ahead
6.4 variables,
109
a summary
The above examples on functions of random variables show clearly that A',,) when deriving the distribution of hz-, x,,) is known is not an easy exercise. Indeed this is one of the most difficult problems in probability theory as argued below. Some of the above results, although involved (as far as mathematical manipulations are concerned), have been included because they play a very important role in statist'al /ld/-pnct?. Because of their importance generalisations of these results will be summarised below for reference purposes. ,(x1,
,
.
.
,
.
.
.
.
,
Lgnlrntt 6 l .
1/' Xi
(Yt'1
I
Nqyi c/).
x.
i
,
z'b'ij
Letnlna 6
1 2,
=
,
.
N(Yt,'-pi, Y'f'-
-
1
1
.
.
11 tgr'(?
,
c/)
-
normal.
independen r
?-.t?.*.!f
then
.2
//' X,. .N(0, 1),i 1 2, z2(rl) chi-square with =
x.
indepvndent n Jtx' rees q/' jeedom. -.l?.5y
,
.
.
.
n
,
g/-g
tbe
()-)'.1 .Y,2.)
x
6.2* f--pnklrlt?
1/' Xi
()-'1 i 1
N(/tf c/), i x g2(j7.
'v
z
=
1 2,
=
,
2/.c2) i i
,
1'-'l'i/i fl
=
j
..
.
.
,
j
'G 2 i
Fig. 6.8. The normal
independent r'.!7.'-$ then chi-square wr non-centralit p
n are
) ..-non-central
'
,
p(.l /-(1111(.4l Q9l* ,
.
I-cmma 6.5
N(0, 1), ..Y1 x. 2(n) arv A' : .,Yz j?7(/t?/?(.?nJ?n t ?-.t).'s then z with ?? f Student's 'A j l)x,'' X Jt.yf://-t.afz?-s )1 t 2, n ( (rl).q.Jjeedoln. -
'
distributions.
,
Ll,nltntl 6.-3
1.1%X 1
and related
x.
,
../
.
.
x,
Lt.zrrlnkt/6
.-3
N(p, c2), Xz 1/' X 1(?; J) .,:-1,,/'(x'/(X2,/:)1 t.??J p/'c. ptlratnet j
among the distributions referred to in these lemmas are in Fig. 6.8. For a summary of these distributions see Appendix 6. 1 1,4.11 lt'r tt more extensive discussion see the excellent book by Johnson II,lt $ K t ' t z ( I970). I llt.
'v
'v
'v
I'cl:ttionships
lt'ki
L'1'It
*
,y$
c2g2(n),
.
.Yj Xz indepgndgnt ,-.j?.'y thgn non-central t with ntr?n-trt.?nl?-./// )' ,
=
Lemma 6.4 ?-.!,:.*s A' 1 ztln1 )s X1 z2(nc), A-j Xz independent thetl (xYl n j )/(A-2/n2) F(n l na) - Fisher's F with n j and na degrees t?/'
lf
'v
'v
,
'v
,
jeedom.
Lenma 6.4*
.k. .t *4 '
I
II,
lllx
.*, II1
? q 1) .(.)
I :
,
,1
,1
'
'lk's
y (.t
ahead
.tltpking
w'e considered
of functions of random AIt Iltltlgh thtl are in general rather l 1t is kt ve 1'),' iInp() rtant facet of probability theo ry for two reasons: I ( t 'l't t.I1 k'cctl I's i1) prltct icl?t hat the probability model is not defined I 11 t t. l I 11 s $ ) l. l llt! ( ) l'i g i11:1l I'. N' s btl t i 11 sorne fu nctions of these. %l ; t 1I s l I k': 1 I I I 1 t't-I't- I 1k't.- isk c I'tlt. i t lIy tlttpentlell t o 11 t he d ist ribution of l 1 1 l 1 t I l , s . l l ; 1! 1k 1( ', I 1) y : t !' 1kl l7 It- s I s ! iI1 1 t t ( I's t 11(.l t t.ls t s t l.t t is t ic s .trtt .
k'llitlltel'
the distribution
manipulations
l'nathematical
.
.
1dtI
k
( '
(
.,
Functions
of random
Appendix 6.1
variables f(B)
A-,,)and the distribution functions of r.v.'s of the form h(X3, Xz, of such functions is the basis of any inference related to the unknown parameters 0. From the above discussion it is obvious that determining the distribution Ar,,)is by no means a trivial exercise. lt turns out that more of /1(.-1, X1, often than not we cannot determine the distribution exactly. Because of the importance of the problem, however, we are forced to develop approximations', the subject matter of Chapter 10. lt is no exaggeration to say that most of the results derived in the context of the various statistical models in econometrics, discussed in Part IV, in Section 6.3 above. depend crucially on the results summarised Estimation, testing and prediction in the context of these models is based on the results related to functions of normally distributed random variables and the normal, Student's t, Fisher's F and chi-square distributions are used extensively in Part IV. .
.
.
.
.
.
.
f (/; 1)
,
Appendix 6.1 The normal Univariate
normal
.- Ar
#lxl
1
=
.
.. .
Gx/
l
;.- .....-
/-(x) --,V-,2 exp c ( z) ' x F(.Y)
=
.
=
y,
exp
(27:)
-
Var(.Y)
=
1 u- p - -2 (5'
1 xy - 2 .--..c skvwnvss
s)
p'
Chi-square distribution - F
'v
z2(n)
=
=
2
d r/,'
Reproductit'e
prtpptrrly
2 ,
=
as
=
0,
kurtosis
=
a4
=
distribution
chi-square
3
()
7. ,
,
n)
=
1
---.;.,-
Vn
Higher moments:
x
r t?7 el
2-
cc
Zch
k
-
tN''2)
exp
z2(n,.(i)
- 15 'v
y!fH.''
.j 2 ), + J)1. t - -.-41
f
.
2) - 1
)J)(,?'/'
is
Some prtppt?rres
F( F)
=
n+
(),)kuk +. j.)
(2k)! 1- k +
,
)
.
n
,
>0
Cumulants
chi-square.
.E'(.p) n (thedegrees q/' freedomb,Vart)l) 2/. I ilt! density function is illustrated for several values of n in Fig. 6.9.
Non-central c2,
,
Fig. 6.9. The density functions of a central and non-central
distributions
and related
/. (z ; 6
0
N(p, c2)
'v
.Y
(y; 6 )
Var(F)
=
>
n
=
12 ,
,
.
.
.
2(n + 2).
E:
;
'E
'
.) ''m. :
is that the rtant difference with the centra j c jaj-square ftl llct itln is shifted to the right and the variance increases.
Ilctlt.t-. l Ile iIllpo
deltsl l
11t
-;'''(
?t/
.
.
11(
+1
l 1'(:
/)?'f
)'
Functions
of random
variables f tx)
z (w; 4 ) Nxw
z
-..V'''''- f (x)
z-
N
tzvj
exp
x2 -
nz nz -2
.E'(LJ)
=
nz > 2,
,
r
=
I l1ccentral and non-central
>
2
1
3
x
t's /-J-ifr?-//7l.flk'(?l71.1z' r(N)
F-distribution
/'( &; n 1
'v
-
F-distribution
l ig. 6. 11 for purposes of comparison. Non-central
e-
,
n 2,'
y .. 1+
k
=
)
0
.. U
2k) - 1
N1
nz
j-'
f-distribution
F(n1 rl2,' J), ,
>
0
u
+2:) n 1 -i''t -..n.2
n 1 + n z + 2/ -.... 2
yn, +nz
n 1 4. These moments show that for large n the
'v
density functions are shown in
=
ku J(n,+
Gt'
+2k)
F
nz
n2
2
F
...... Fl 1 +
gk
... . ,
2
>2s
is very close to the
normal (see Fig. 6. 10).
t (t.?)
Non-central
t-distrib. u/ ion - Hz' r(n; J), J
'
w; n
>
x
N .,
'''
ir
/'(
(jj ,
=
:;2!
.--z
(.ti
C .....(
iEl
-.-::?2
.-
0
)1
- .-y wzliV' Y /n r(n/2)(n + --
j
f (tz; n , n z )
) , a-j
,
'
)'.
);'.
tit
x
y'gs(.l-+k -1--ljtjt k=0
2
2w2
b-y .
jk/l
n + u(r,
,
ws
f (fz; n j n a ; ) ,
u
.
'. )il)1/., '''' ::.
1201-ft,lp-f/n g
() :'
'
kr( 1.12)
---
'i;
,
-
N
0
St l-ff/cri
+rl2 2#lJ(n1 Vart (p) n1(n2 2) c (n2 -4) -2)
2
N
A
A
1
=
&
u s ()
,
Functions
of random variables
Important
Appendix 6.1
tl'tlaccr/.:
Additional
Borel functions, distribution of a Borel function of a r.v., normal and related distributions, Student's t, chi-square, Fisher's F and Cauchy distributions.
'larke
( ( l t?78);
Questions Why should
be interested
we
in Borel functions
of r.v.'s and their
distributions? :A Borel function is nothing more than a r.v. relative to the Borel field on the real line.' Discuss. Explain intuitively why a Borel function of a r.v. is a r.v. itself. Explain the relationships between the normal, chi-square, Student's !, Fisher's F and Cauchy distributions. What is the difference between central and non-central chi-square and F-distributions'? .td
Exercises Let zYkbe a nv. with density function
f
A-1
0
(.X
*2
1
)
Derive the density functions (i)
X X A'
(ii) (iii)
of
2*
A' 17
=
e'Yi
=
;
l0 + 2%-21 Let the density function of the r.v. X be /'(x) e distribution of F logc X. Let the joint density function of X3 and Xz be =
.
=
-X,
x > 0. Find the
=
),t
Derive the distribution of F X21 + X2'25 (i) F m in (A-j .Ya ) (ii) ( .'(() 1) kl t.-I-i t..'yt I1t-tl is t 1'i13t I t i( 1. t
j *)
=
=
-
.
.'
.e
.x.
'$.'
.
.
',
1) (
.'
1'1'
references
(1975)., Cramer ( 1946),, Giri (1974),. Mood, Graybill Rao ( 1973),' Rohatgi
( 1976).
and Boes
( 1974)*,Pfeiffer
The general notion
of expectation
expectation
In Chapter 4 we considered the notion of mathematical context of the simple probability model
m
)
(7.1)
0j, 0 e:O
./'(x;
=
single random as a useful characteristic of density functions of a model probability generalised the to Since then we
*
f
=
(
/ (x 1
,
x 2s
variable.
J'
'
..
in the
.
.
,
.
x , , ; 0 ) 0 iE (). ,
and put forward a framework in the context of whichjoint density functions luarginalisation, conditioning and functions can be analysed. This included section is variables of this The to consider the of random (r.v.'s). purpose general framework. of this more notion of expectation in the context Expectation
7.1
of a function of random
A'ariables
ln the one-dimensional case of a single r.v. we considered many numerical characteristics of density functions, namely. E'(-Y), S(,Yr), .E'I-Y A(A-))2, which contain summary information concerning '(.Y - JJ(x))r, r= 2, 3. the nature of the distribution of zY.lt is important to note that each of these characteristics is the expectation of some function of X, that is, F(/;(xY)); /)4zY) X, /?(') Xr, etc. This provides us with the natural way to proceed in the present section, having diseussed the idea of a Borel functioll ( ): For (r which preserves the event structure of the Borel field R'' where n 2. simplicity of exposition we consider the case their joint Let (.Y1 z'c) be a bivariate random vector with density function and let h ): 22 -+ R be a Borel function. Define F (A'1 Xcl and consider its expectation. This can be dened in two -
.
.
.
,
=
=
'
;..d.
--+
=
.vc)
.(.x1
,
,
'
,
=
7.1
of a function of random
Expectation
variables
equivalent ways:
:X.
J
(ii) E(ll('
:,
A-z))
/1(.x1 x2)./'(x1
=
,
,
dxj dxa.
xz)
and it is usually The choice between (i) and (iij is a matter of convenience of of Y. diffkulty deriving by the degree in the distribution determined
Let Xi Using
'v
N(0, 1), i
,
X( + X(.
=
(ii), E(X 21 + X 2) 2
1 /',(J.') 2,)-.( . =
j
----
-
(x( +
=(v2j
the other hand, Y freedom, that is,
x
2aj
-
2/
exp
)
-
jx ( + x 2aj) dx ( dx c
z2(2.j- chi-square
+ -Y2a) '.w
exp
t
cc
r
=
on
and
r.v.'s and /1(Xl X a)
1. 2 be two independent
=
=
l
.
with two degrees of
t 1-v) -
we know that
E(Y)
)') -.'Jy',(
=
-v
2
=
equals the number of degrees of freedom (see Appendix 6.1). Before we consider particular forms of pltAr(. .Ya) let us consider some properties of E(ll(A-: .Xa)). ,
Properties
of expectation fkE(lll(A-l Xa)) + ?E(/12(X1, A'2)); avd :( ), b a( ) are Borel are 122 rtp R. .1n particular
EIJ/k1(X1. X2) + bhzx
Iinearity, u/ntl'l ons
u'lltpl-t? a .//4)m
1,
,
'
'
11
aiz-i 1* .=
=
t'onsrlnr.s
/1
E x
Ar2)1
/nt
1
-
i
Z aikjz-i). =
(7.5)
1
I/' X 1 and X a are independent
-.p.'s,
.ll-
tvyt?ry
Bovel jnction h
(')
The general notion of expectation
and Jl2( ),'R
R,
-.+
'
f(/'l1(A-1)2(A'c))
Eht
=
(.,1))
/'l2(..Y2)),
'
(7.6)
given that the above expectations exist. This is botb a nessary as well as suflkient condition for independence. particular of interest is when /lj(xY1) XL and /lc(aY2) Xz, One case =
=
.E'(A-1X2) F(.X1) &.zY2). =
This is in some sense Iinear independence independence. Moreover, given that CovtA'l
,
(7.7)
'
which is much weaker than
.Y2) F(Ar:A-c) - f)-Y1) .E't-tcl
(7.8)
'
=
(see below), linear independence is equivalent to uncorrelatedness since it implies that Covt-rj A'a) 0. A special case of linear independence of particular interest in what follows is when =
,
=0,
S(.Y1A'2)
and we say that A-l and Xz are orthogonal, writing X3 -1-Xz. A case between F(.Y1), that independence and uncorrelatedness is that in which E(k3/Xz) is, the conditional and unconditional expectations of .11 coincide. The analogous case for orthogonality is =
E(XL,/X1)
when .E'(A-:) 0.
=0,
(7.10)
=
This will prove to be a useful property in the context of limit theorems in Chapter 9, where the condition EzY',/xn
.',,
.Y'l) 0 - c, plays an important role. For reasons that will become apparent sequel we call this property martingale orthogonality'. 1,
-
Forms of hX
j
.
,
.
.
=
,
A-cl ofparticular
in the
interest
lllA'l -Y2) A-tA-22, 1,k > 0, ,
=
then
!
p;k =
'(.YtA%)
=
X)
xtxt-fxlxz)
dxl dx2
are calledytpfnl mw moments of order I + k; this is a direct generalisation the concept for one random variable. For (-Y1,.-2)=(*-1
-&A-1)/(-Y2 -'(A-2))k,
of
of a function of random variables
Expectation
7.1
E'(-Y1)
EEEE
/t1
are called joint central moments of order I + k. Two especially interesting joint central moments are the variance and covariance; () Covariance.. l k =
Covtxl,
1
=
(7.14)
-/-t2)).
-Y2)- .E'((A'1-p1)(A-c
provides a measure of the linear lrlwt?t?nfwt? random variables. With a direct multiplication above formula becomes :
relationship
The covariance
Covt-l
Ar2)
E(A-1) F(A-a). '(Ar1.Y2) independent then using E2 wtl deduce tat =
,
I/' zYl and Xz are
'
=0.
to note that the
tlt?rv important
(7.15) (7.16)
Covlx'j, Ar2) 1t is
the
is not true.
conYerse
(iij Variance l 2, k 0 =
=
-p1)2,
'Varl-Yl) For a linear
E(A'1
EEE
functionjf
var jl aiuri
aixi the tlariance
=)
i
is of the
form
a,? Var(A-i). i
Using the variance wc could dejlne the standardised formof a covariance kntpwnas rt? correlation coefficient and dehned by
Corrt-'l, X2)=
Covtx'l A'c) ,
gvartA'jl
'
VartAralq
.
(7.20)
The general notion of exptation
Properties (C/) (CJ)
(4)7.)
of CorrtA'l
,
-Y2)
Corr(A'1, aY2) 0 for A-j and .Y2 independent r.p.'.s. 1 % Corrt.r 1 A'al % 1. Corrtfj A'2) 1 for a + bX1, a, b being real constants. =
-
,
-:-1
=
=
,
Example 2 Let pcgljczj), (*'x1a) xllyLljafc, -
Covtfj, X1) X
*
(x1
-/.tc)T(x1,xc) dx1 dx2
(-Y1
1
-/z1)(xa
'X
=
cjcc
=,
x
.
-/t1) cl
xz (2r)
(2a)
exp
1
exp
-j
-p2)-.
(1
.%1
1
1
(1
z
2
dx1
xa -/.ta
-
-p
-Jt1 cl
)
o.,
.-ptxlts-/jtjztj (txa
-0.,a,v.
1
Hence, Corrtztj A-a) It should be noted that p in the case of normality does imply independence, as can be easily verified directly that p 0 o.J'(x1 x2) 'X(xz). =p.
=0
,
=/(x1)
=
,
Nlta that correlation measures the linear dependence between r.v.'s but if we consider only the first two moments that is the only dependence we can analyse.
7.2 Conditional
expectation
Conditional expectation
7.2
The concept of conditional expectation plays a very important role both in x,,; 0), 0 (E 0) to time extending the probability model *= ).(x1, x2, and in the specification of variables process) random dependent (stochastic models Part in lV. Iinear of Arj for Xz x2 was ln Section 5.4 the conditional distribution function exists) limit the defined to be (if .
.
.
,
=
Fx: Aatxl
xc)
).
1im #r(.Y1 Gxj/xa - h %Xz %xa +
=
0
(7.21)
(7.22) Note that
(7.23) 'fx,
X. '
xzlx
1
x
2)
-
.f
qc -
h (x1) h (x1)
.
) (x2?7x1 x, (xz/ x 1 ) d x I
x. ./-,-2
,
,
-.
Bayes'
formula,
the denominator being equal to .Ja(x2). propertiesL The conditional distribution function satisfies the following in .Yl, i.e. (CDI) For a jlxed x2, Fxj x2(x1/x2) s a distribution function 1j. Fxj ); IR (0, ,t.a(
'
-,
For a jlxed xl Fxj,.xatxl/'xzl is a talues of x1, x2 For 'ny
(CDJ) (CD3)
,
./'x,,xa(x1/x2)
xgf/ xa, /xc(.X1 '.Y2) > 0, xl
For a fxL
.fnction
of x2.
is a proper density . fxj xatxl, xc) dx1 1. .rfrlcrft?rl
EER
and
J'''-
wl
=
The conditional expectation of A-j given that Xz takes a particular value .vc (A'a xc) is defined by =
The general notion of expectation
ln general for any Borel function hl ) whose expectation '
Properties
of the conditional
exists
expectation
Let X, Arl and Xz be random variables ()n (s%, P )), tben: x) '(tz1:(A-1)+ a1h(Xz)jX x) + azkhzrzlj J1 Eh(X3),''X X x), tJ1 az constants. (CE2) lf Arj > X1, '(.Y1/-Y x) )v:ELXI/X x). (CE3) E(l1(A-j -Ya) Xz x2) E(l1(X1 xa),,''X2 xa). ICfVI f((A'1) X1 x2) f(ll(-Yt)) if A-1 al)(I are independent. outside tbe square (CF5) J)/l(-Y1)) 'gS((-1) Xz xaq, tbe pxptvltzrr?rl brackets being wfr respect t() X1. The conditional expectation EIXL/XZ x2) is a non-stochastic function of -+ f(.Y1 Rx, The graph i.e. @. ): (x2, EIXL x2)) is called the regression x2, .%
'
=
=
=
=
,
=
,
=
=
=
,
=
-:-2
=
=
=
=
=
'
(7&rPf?.
Example 3
ln the case of the bivariate normal distribution,
The graph of this linear
reqression
fncrftan
is shown in Fig. 7.1.
Conditional expectation
f (xlexa)
M
e'
-
.-
l I
.-
I I
X1 1 1 l
Xz
I
+
>
7z .z.
.f
''-o
Fig. 7. 1. The regression
.
.
;v
2w
7e
eJ
curve of a bivariate normal density.
As in the case of ordinary expectation, moments:
we can define higher conditional
Rflw conditional moments:
(7.28) Central conditional moments: A'EIA-I
Of particular skedasticity'
-EX,/Xz
=x2))'/-Y2
interest is the conditional
Vart-Yl/-'rc
=
x2)
=
'E(Arj Exljjxz
-
-
(7.29)
x21,
variance,
EIXL/XZ
=
sometimes
x2))2/Ara
=
called
x 2.
xc) g.E'(Arj Xz xalq = In the above example of the bivariate normal distribution we can show that VartArl/Arc x2) c21(1 In the case where the conditional variance is free of the conditioning variables it is said to be homoskedastic; otherwise it is called heteroskedastic. -p2).
=
=
=
-
=
The general notion of exptation
Example 4 Let
?6 This
is the distribution distribution
function of Gumbel's
bivariate
E0,1(l .
exponential
since
'(X:/X2
.X2)
=
r! ( 1 + pxa + r+ (1 + pxc)
=
Var(.Y1,.'A-a xc) =
=
l
and -2p2
+pxa)2
(l +p
--
r0)
,f.--.--
( 1 + pxc)
For a bivariate exponential distribution the regression and the conditional variance is heteroskedastic.
curve is non-linear
Example 5 The
joint density of a bivariate Pareto distribution takes the form
'./'(x1xc). ,
-
ltkcx,
ptp-h 1)(4?1t7c)+
+
',x.a
-:71Ja)-''+''
xl The marginal
> fzj >0,
x2 >az
>0.
density function of X1 is 2
(.X
2
)
=
PJt? '-tt'-b 11 , 2x 2
2
Var(.;Y'1/,Y2 xa) =
=
t?l (p+ 1) xj c az 0 0 + 1)
-
ln the case of the Pareto distribution the conditionalvariance is heteroskedastic.
regression curve is linear but the
7.2
Conditional
expectation
Example 6
joint density of a bivariate Iogistic distribution is
The
.Jtxj xa) ,
E(X3,,'Xz
=
x2)
'Xz
=
xa)
Vart-
1
2E1 +
zll -
3
expt
xc)),
- (x1+ 1 - logt 1 + e-'2) - non-linear in x2
=
=
=.1yzrz -
e-rt + e-
1
- homoskedastic
xa) is a non-stochastic function of x2 and for a particular .Q)is interpreted as the average value of .Y1 given Xz k1. The temptation at this stage is to extend this concept by considering the conditional expectation '(-Y1 Xz
As argued above value
ECXL,/XI
.fa,
=
=
=
(7.31) The problem, however, is that we will be very hard pressed to explain its meaning. What does it mean to say average value of A-j given the --+ variable E'? The only meaning we can attach to such a -Y2( ): S random conditional expectation is 'the
.
StA- 1 c(-Ya)),
where c(A'c) represents the c-field generated by Xz, since what we condition on must be an event. Now, however, E'(-Yj c('c)) is no longer a nonstochastic function, being evaluated at the r.v. Xz. lndeed, .E'tArl/ct.Ycll is a random tlariable with respect to the c-field c(A-a) c: .F which satisfies certain properties. The discerning reader would have noticed that we have already used this generalised concept of conditional expectation in stating CE5, '(/l(A'1)/ where we took the expected value of the conditional expectation 'c xc). What we had in mind there was '(llj(.Y1) c(-2)) since otherwise the conditional expectation is a constant. The way we defined '(A-j/c(-Y2)) as a direct extension of E(X !yA-a x2), one would hope that the similarity between the two concepts wlll not end there. lt turns out, not surprisingly, that f'taj /c(A'2)) satisfies certain properties analogous to CEI-CES: =
=
LSCE/) '(J1(A-1) + t72g(A-2)/c(A-)) fk11)ll(-Y!) t7'(xY)) + t72S(#(-Y2) t4aY)). $SCE2) 1./*-Y1y Xz. J).Yl c(-)) > E(A-a c4x)). (SCS-?) SEll(A'1) #(zY2)1 FEg(-Y2).E'((A-l) c(X2))1. (SCE4) &(zY1),/c(xYa)) JJ(&(-Y1))#' zk-k and A-a are independent. '((aYI) g(xYc)f)(zYI), (SC'5) g(-Y2) c(-2)) c(A-c)). This implies that the two conditional expectation concepts are directly related but one is non-stochastic and the other is a random variable. =
'
=
=
.
=
The general notion
of expectation
Example 7
ln the case of the bivariate normal distribution considered above, '(-Yl/c(A'a))
-i-p --1 (A'a
=pl
-p2),
G2
and it is a Xz
random
variable. Note that Vartx''l 'c(A'2))
(homoskedastic).
ln relation to the conditional variance the variance exists, Vart.Yll
=
we can show
c2j( 1
=
-p2)
is free of
that if '(A-f) <
(x),
i.e.
Var(F(.Y1/c(A'2))) + '(Var(A-1,,'c(.Yc))),
that is, the varianee of X : can be decomposed into the variance of the conditional expectation of ..Y1 plus the expectation of the conditional variance. This implies that Var(-Y1) > Vart'tA-l
c(Ara)).
(7.34)
Note that some books write E(X3/Xz) when they EI-Yl X 2 x2) or F(-Y1,/c(X2)), thus causing confusion.
mean
either
=
The concept of conditional expectation can be extended further to an arbitrary c-field .f.2 where .1 is some sub-c-field of not necessarily generated by a random variable. This is possible because all elements of fzl to which the conditional expectation of a r.v. X are events with relative to ...F can be defined. If we were to interpret the conditional expectation EtXl Xz x2) as -Y1 to a constant we can think of X to a random variable. EX/9) as Because of the generality of EX,'''L we are going to summarise its properties (whichindude CE1-5 and SCE1-5 as special cases) for reference purposes: Let A' and i' be two r.v.'s defined on (.$, P( )) and 91.g (c-C.E1) 1/' c- is a ctlnslclnl, then E(c V') c. wCE2) 1.f X % F, then E(X f/) % JJIF V). ouCE3) 1j' a, b Jp'c constanss, E(aX + :F V) aEIX,/f?-) + bE(Yj,@ ). .%
'respect
'smoothing'
=
tsmoothing'
.%
,%
'
=
=
(c-cE4)
lE(x
(c-Cf.5) (c-()7f3)
If F(A'). tZ, then EX, E(X/,F) X. flgftA- f/)) F(A-). i 1, 2), FIJJIA#' V'1 9 2 (h Ex/g' 1 ).
(c-CE7) Lo'-CE8)
.f.0)1
z:
E(lA'l,,'f). ,$)
o'.s
.#t))
=
=
=
=
.%
'.f/)/.f/.
=
lj
=
2)
[email protected]
=
7.3
Looking ahead
Let be a r.p. relative to 94 and '(J-Yj)< tx;,, '()ArFj) < SI-YF r2l A-JJIF f/'). If A- is independent of f/ ten fxY f?) S(Ar). .;t'
(c-C'E9)
:t;
then
=
(wCE10)
=
These properties hold when the various expectations used are defined and are always relative to some equivalence class. For this reason all the above be qualified statements related to conditional expectations must (formally) surely' ('a.s.)(see Chapter 10). For example, with the statement C-CEI formally should read: c' is a constant, i.e. X c a.s. then flxY V') Ealmost
'lf
a S .
=c
=
.
expectation will prove invaluable in the study The concept of conditional considered stochastic in Chapter 8 because it provides us with a of processes natural way to formulate dependence.
7.3
Looking ahead
The concept of conditional expectation provides us with a very useful way to exploit auxiliary information or manipulate information (stochasticor otherwise) related to r.v.'s in the context of the probability model
*
=
tf
j'l.zj x 2 ,
,
.
.
.
,
x,:,' p),
0 g (.))
(7.36)
.
Moreover,
as argued in the next section, the conuept of conditional expectation provides us with the most natural and direct link between sequences of independent r.v.'s discussed above and sequences of dependent r.v.'s of stochastic processes which enable us to extend (l) to include more realistic Jvnlrnfc phenomena. Stochastic processes are the subject of Chapter 8.
Important
concepts
expectation of a Borel function of a r.v.-' linearity of the expectation operator; independence in terms of expectation, linear independence (uncorrelatednessl; orthogonality, martingale orthogonality', joint raw and central moments, covariance, correlation; conditional distribution and density functions- Bayes' formula; F4.Y1 f.?)., conditional expectations E(X3,/'Xz x2). '(X1 regression curve, raw and central conditional moments, skedasticity, homoskedasticity, heteroskedasticity. ,/c(-c)),
=
The general notion of exptation
Questions
3. 4. 5. 6. 7.
Explain the concept of expectation for a Borel function of a nv. Explain the concepts of independence and uncorrelatedness in the context of expectation of Borel functions of r.v.'s. Compare the concepts of independence and martingale orthogonality. What does the regression curve represent'? What does the correlation coefficient measure? What does skedasticity measure? Compare the concepts of Etxj X1 x2), f2(A'l c(-Ya)) and fl.Yl 9. ), where c + =
.f/
Explain intuitively the equality Vart-Yl)
VartE't.Yl -Y2))+ E'(Var(.Y 1 /Xz)).
=
Exercise. For exercise 3 of Chapter 6 show that f'(A-1.Yc) F(A-1).E'(-Y2)but Ar1 and Xz are not independent. For exercise 3 of Chapter 6, derive J)-Y: /'Xz xa). Vartx't-l Xz x2), for x2 1, 2, (i) find Covt'j -Y2)-CorrtxYl -Ya). (ii) Let Xj and Xz be distributed as bivariate normal r.v.'s such that =
=
,
,
A' 1
cf
Jzl
N X2 ,v
=
=
pc1 c2 G1a
PtF1tr2
/22
,
pl
-
6'2a
=4,
(lalculate E(X
X2
l
E(Ar2
=
-:-1
=
VartAfj/.Yz
4. Determine the flxb
,
X2),
x1),
=
.x1
=
1, 2, 6,
=0,
1, 2,
Vart-rc/fj
x2),
=
=
x1).
of c for which
value .X2)
X2
.X1(.X1
+
A72),
is a proper joint density function and derive: EX 1 X 2 .X2); (i) Vart-'j Xz xc),' (ii) .E'IA'/-YZ Xz x2); (iii) Przk-k + Xz % 0.5),. (iv) Corrt-l, .:-2). (v) Show tat f'(.Y1) f'tf'tatj/A-cll. =
=
=
=
2,
pz P
=
=0.2.
4,
7.3 Looking
ahead
Additional references Bickel and Doksum Whittle ( 1970).
( 1977); Chung (1974)) Clarke ( 1975/ Giri (1974); Rao (1973);
CHAPTER
8*
Stochastic processes
ln Chapter 3 it was argued that the main aim of probability theory is to provide us with mathematical models, appropriately called probability models, which can be used as idealised descriptions of observable phenomena. The axiomatic approach to probability was viewed as based on a particular mathematical formulation of the idea of a random expeliment in the form of the probability space (.$,i@1 P )).The concept of a random variable introduced in Chapter 4 enabled us to introduce an isomorphic probability space (Rx,2,#x(')) which has a much richer (and easier) mathematical structure to help us build and analyse probability models. From the modelling viewpoint the concept of a random variable is particularly useful because most observable phenomena come in the form of quantifiable features amenable to numerical representation. A particularly important aspect of real observable phenomena,which the random variable concept cannot accommodate, is their time dimension;the concept is essentially static. A number of the economic phenomena for which we need to formulate probability models come in the form of dynamic processes for which we have discrete sequence of observations in time. Observed data referring to economic variables such as inflation, national income, money stock, represent examples where the time might be very important as argued in Chapters 17 dependency (dimension) and 23 of Part IV. The problem we have to face is to extend the simple probability model, .
* (.f(x;04, 0 e' t-)), -
to one which enables us to model dynamic pbenomena. We have already moved in this direction by proposing the random vector probability model
8.1 (I)
=
The concept of a stochastic ).f
txl xa,
.
,
.
.
,
13 1
process
(8.2)
x,,; 0), 0 6 (.)).
The way we viewed this model so far has been as representing different characteristics of the phenomenon in question in the form of the jointly X,,. lf we reinterpret this model as representing the distributed r.v.'s Arl, same characteristic but at successive points in time then this can be viewed as a dynamic probability model. With this as a starting point let us consider the dynamic probability model in the context of (S, % P )). .
.
.
,
'
The concept of a stochastic process
8.1
The natural way to make the concept of a random variable dynamic is to extend its domain by attaching a date to the elements of the sample
space S. Dejnition
1
probability space and -F an index set of real R. The A-( ) by .X'( ): S x T numbers and dejlne the function ordered sequence of random variables )A-( r), t e: T) is called a
Let
(.,
,./
#4
'
)) be a
'
'
'
,
-+
'
,
.
,
stochastic (random)process. This definition suggests that for a stochastic process (xY( r), l c T), for each r c T, r) represents a random variable on S. On the other hand,for each s G S, A'(s, ) represents a function of t which we call a realisation of the process. -Y(s, r) for given s and t is just a real number. '
,
x(
'
,
.
Example 1
Consider the stochastic process (-Y( r), r G T) defined by .
,
Xs, r)
=
F(# costztslr + u(s)), -z:,zr),
U( where y(.) and z(.) are two jointly distributed r.v.'s and tI(') independent of F( ) and Z( ). For a fixed r, say t 1, -Y(# F(# cos(Z(# + u(#), being a function of r.v.'s, it is itself a r.v. For a fixed y, F(# y, Z(# z, n(,s) u are just three numbers and there is nothing stochastic about the function z(r) y costzr + ?# being a simple cosine function of l (see Fig. 8.1(fz)). This example shows that for each t G T we have a different nv. and for each s 6 S we have a different realisation of the process. ln practice we observe one realisation of the process and we need to postulate a dynamic probability model for which the observed realisation isconsidered to beone of a family of possible realisations. The original uncertainty of the outcome of an experiment is reduced to the uncertainty of the choice of one of these ,v
.
'
=
=
=
=
=
=
Stochastic X
processes
(t) 3 2 1 0
10 2030
40
50 60
t
0. 10
0.05
-0.05
-0, 10
1964
1967
1970
1973
1976
1979
1982
T ime
(b)
possible realisations. Thus, there is nothing about a realisation of a process which can be smooth and regular as in the above example (seeFig. 8. 1(:1))or have wild fluctuations like most econometric data series (seeFig. 8. 1(4)). The main elements of a stochastic process :A'( r), r e: T) are: (i) its range space (sometimes called the state space), usually R,' (ii) the index set -1J)usually one of R. ii 1, 0, 1, 2, zc ). Z t Z + k.f0 1 2 d ; a n ) the dependence structure of the r.v.'s t g -1J'. 'random'
'
,
.
.
.
.
zj ,
=
,
,
,
.
.
.
=
r0, ..(1)-
=
t
.
.
-
,
,
The concept of a stochastic
8.l
process
ln what follows a stochastic process will be denoted by t-Y(r), l (E Tj. (s is dropped) and its various interpretations as a random varia t7le, a rea lisation orjust a number should be obvious from the context used. The index set T used will always be either T + 1, EfE2, ) or T )0, 1, 2, ), thus concentrating exclusively on discree slotl/lfgsrtr processes (for contlnuous stochastic processes see Priestley ( 198 1)). The dependence structure of kfA-(1),t (E T) in direct analogy with the case of a random vector, should be determined by the joint distribution of the T is commonly an infinite set, process. The question arises, however, do we need an infinite dimensional distribution to define the structure of the process?' This question was tackled by Kolmogorov ( 1933) who showed that when the stochastic process satisfies certain regularity conditions the ln particular, if we define the joint answer is definitely < r,,) of T by distribution of the process for the subset (r1< tz < r3, A'(r,,)%x,,) then, if the stochastic P-l.-trll F(-Y(11), x1, satisfies conditions: A'(r), the -t'-) t e' process: Ar(ly,)) A'(r,,)) Ftxtryl ), (i) symmetrq F(xY(f1 ), .X(12), is any permutation of the indices 1, 2, where.jl..jz, n (i.e. reshuffling the ordering of the index does not change the distributionl; .X(l,, 1)) A-(r,,)) F(Ar(lj), compatibilitq lirflxa.-.. F(.Y(rj )s (i.e. the dimensionality of the joint distribution can be reduced by marginalisationl', there exists a probability space (S, P )) and a stochastic process tA'(l), t e T) defined on it whose finite dimensional distribution is the distribution F(.(rj), A-(l,,)).as defined above. That is, the probabilistic structure of the stochastic process t-Y(!), l e:T) is completely specified by the joint -Y(r,,))for all values of n (a positive integer) and any distribution F(A-(21), This is a remarkable result because it enables us of T. r,,) subset (r1 r:!, the stochastic to process without having to detine an infinite dimensional distribution. ln particular we can concentrate on the joint distribution of a finite collection of elements and thus extend the mathematical apparatus built for random vectors to analyse stochastic processes. Given that, for a specific 1, .Y(r) is a random variable, we can denote its distribution and density functions by F(A-(l)) and /'(A'(l)) respectively. Moreover, the mean, variance and higher moments of -Y(r)(as a r.v.) can be defined as in Section 4.6 by:
t0,
=
.
.
=
.
.
.
.
,
tsince
itentative'
Kno'.
.
-(f,,))
.
.
.
.
,
'tti
=
,
.
.
.
.
,
'tl.jal,
.
.
.
=
,
.
.
,
.
.jn
.
.
.
.
..
.
.
.
,
=
.
,
.
.
.
,
,
.%
.
.
.
.
,
.
,
,
.
.
.
.
,
.
Adescribe'
(8.3)
'(-Y(r))
=p(l),
'E(.X(r)-/.t(r))21
=
tl(r),
'(-Y(r)r) /t,,(r), =
(8.4) (8.5)
Stochastic processes As we can see, these numerical characteristics of X(1) are in general 1, given that at each t 6 T, Ar( 1) has a different distribution F(Ar(r)). The compatibility condition (ii)enablcs us to extend the distribution function to any number of elements in T, say r1, lc, 1,,. That is, F(.Y(rj), A'(rz), X(r,,))denotes thejoint distribution of the same random variables A'(r) at different points in T. The question which naturally arises at this stage is how is this joint distribution different from the joint distribution of the random vector X !E (#1, Xz, Xnj' where X3 Xz, Xt, are different random variables'?' The answer is not very different. The only real difference stems from the fact that the index set T is now a cardinal set, the difference between ti and tj is now crucial, and it is not simply a labelling device as in the case of Fz'j Xn). This suggests that the mathematical Xz, developed in Chapters 5-7 for random vectors can be easily apparatus extended to the case of a stochastic process. Forexpositional purposes let us consider the joint distribution of the stochastic process )A'(l), t c T) for t t 1 , t :, The joint distribution is defined by
functions of
'
,
.
.
.
.
.
,
=z
.
,
.
,
.
.
.
,
.
,
.
.
.
,
,
.
.
F(Ar(ll), aY(r2)) #r(A'(r1) G x1, -Y(r2)G x2),
(8.6)
=
The marginal and conditional distributions for xY(r1)and are defined in exactly the same way as in the case of a two-dimensional random vector (see Chapter 5). The various moments related to this joint distribution, however, take on a different meaning due to the importance of the cardinality of the index set T. ln particular the linear dependence measure .:-(12)
t11,l2) is now called the
=
'rl,
p(l t
(lf
)
-p(l2))q,
r1,r2GT.
(8.7)
function. ln standardised form
autocovariance
-------2-..5--.,
r2)
FE(.X'(f1)-p(f1))(A'(ra)
rl tz G T ,
1)t'(12))'
is called the autocorrelation function. Similarly, the autoproduct moment is defined by m(r1, l2) .E'(-Y(l1)A-(r2)).These numerical characteristics of the stochastic process )A7(r),t s T) play an important role in the analysis of the process and its application to the modelling of real observable phenomena. We say that (.:71), t is T) is an uncorrelated process if r(lj lc) 0 for any r1, 12 6 T, 11 # l2. When rn(11 l2) 0 for any l1, t2 c T, rl # ra the process is said to be orllltlgontll. =
,
,
=
=
Example 2
One of the most important examples of a stochastic process is the normal
(or
8.1
The concept of a stochastic
process
Gaussian) process. The stochastic process tAr(l), t (E T) is said to be normal if -Y(1,:))EEEEX,,(t)' for any finite subset of T, say lj r2., r,,, (-Y(r1),.:-(12)), has a multivariate normal distribution, i.e. .
,
f (.2(11), .
.
.
.
.
.
,
.
.
,
A''(l,,))
,
(det V )-1 expl
-ifX,,(t) #(t)) v,, 1 (x -
,,
- (2zr)
,
-
,,/2
-
,,
(t)-
,(t))),
1, 2, n is an n x n autocovariance matrix and g(12), /t(?,,))' is a n x 1 vector of means. ln view of the #t) (/t(rl), deduce condition the marginal distribution of each we can compatibility .Y(rf), which is also normal,
where V,, EEEEt?trf, ry)1,i,j =
.
.
.
.
,
,
.
.
=
(8. 10) As in the case of a normal random variable, the distribution of a normal stochastic process is characterised by the first two moments /t(r) and n(l) but now they are both functions of t. The concepts introduced above for the stochastic process .Y(1),t e:T) can be extended directly to a k x 1 vector stochastic process )X(l), t e:-F) where X(l) = (.Y1(l), .Yc(r), .Xk(l))'. Each component of X(l) defines a stochastic tArftrls (E T), i 1, 2, k. This introduces a new dimension to the process r concept of a random vector because at each 1, say r1, X(l1) is a k x 1 random X(l,,)')' defines a random n x k vector and for tzuz11 r2, r,,,?' EEE(X(r1)?, matrix. The joint distribution of .'.Tis defined by .
.
.
,
=
,
.
.
.
.
.
.
,
,
.
.
.
,
with the marginal distributions F(X(lg)) #r(X(rf) xf) being k-dimensional distribution functions. Most of the numerical characteristics introduced above can be extended to the vector stochastic process )X(l), r s T) by a simple change in notation, say J)X(l)) p(r), 'g(X(l) - p(l))(X(l) p(l))?1 V(l), t c T, but we also need to introduce new concepts to describe the relationship between .Yj(l) and Xjz) where i #.j and 1, z c T. Hence, we detine the cross-covariance and cross-correlation functions by =
=
cf./ll,
'r)
=
:$
-
=
.E'E(-Y/?)-/.ti(l))(.'.j('r) -Jt/z))(1
(8.13)
Stochastic processes These concepts for i ;e tween the stochastl'c processes Arf(r), linear dependence the measure ) moment r g Tl and .Y/(z), z e: Tl Similarly, we define the Note that 1.41, z) function by l?1fyl/, z) )-j(l), .Yj(z)) ntr, z) when i z)Et.?(r)l.T)() r(r, z) p(l)p(z) ' Using the notation introduced in Chapter 1.n(1, 6 (seealso Chapter 15) we can denote the distribution of a normal random by matrix N()te that
cijt, z)
t
r7(r,z) and
=
1')
rijt,
r41, r)
=
=j.
'ross-product
.
=
=j.
=
=
-
=
.
nk'
X(l1) X(r2) X(r,,)
XN
.
Jl(r1)
V(r1 )C(f1 rc),
p(r2)
C(r2-rl )V(r2)
'
1,,) (r(?.1 ,
,
1
llltnl
.
t7(r,, t I )
V(r,,)
.
(8. 14)
where V(rj) and C(lj, ly) are l x k matrices of autocorrelations and crosscorrelations, respectively. The formula of the distribution of needs special notation which is rather complicated to introduce at this stage. ,'
(EX
In defining the above concepts we (implicitly)assumed that the various moments used are well defined (bounded)for all t G -F, which is not generally true. When the moments of LzY(r),t e: T) are bounded for al1 l G T up to order
/, i.e.
for
all '
7J,
(8.15)
we say that the process is q/' order 1. In defining the above concepts we assumed implicitly that the stochastic processes involved are at least of order 2. The definition of a stochastic process given above is much too general to enable us to obtain a manageable (operational) probability model for modelling dynamic phenomena. ln order to see this let us consider the question of constructing a probability model using the normal process. The natural way to proceed is to define the parametric family of densities /'(.Y(f) p,) which is now indexed not by 0 alone but r as well- i.e.
*
=
t./'(X(1); pr), %e' (% r .
EET),
lf ./'(Ar(r);p,) is the normal density %EB (/t41),lz'(r,r)) and (% EE E2x R+ The fact that the unknown parameters of the stochastic process tA'(r), l (E T) change with r (such parameters are sometimes called incidentalj presents us with a difficult problem. The problem is that in the case where we only have a single realisation of the process (the usual case in econometrics) we will
.
8.2
Restricting
time-heterogeneity
have to deduce the values of p(1) and P'(r, r) with the help of a single observation! This arises because, as argued above for each r, -Y(s. 1) is a random variable with its own distribution. The main purpose of the next three sections is to consider various special forms of stochastic processes where we can construct probability models which are manageable in the context of statistical inference. Such manageability is achieved by imposing certain restrictions which enable us to reduce the number of unknown parameters involved in order to be able to deduce their values from a single realisation. These restrictions come in two forms: of the process', and restrictions on the time-heteroleneitq' restrictions on the memory of the process.
ln Section 8.2 the concept of stationarity inducing considerable timehomogeneity to a stochastic process is considered. Section 8.3 considers various concepts which restrict the memory of a stochastic process in different ways. These restrictions will play an important role in Chapters 22 and 23. The purpose of Section 8.4 is to consider briefly a number of important stochastic processes which are used extensively in Part lV. These include martingales, martingale differences, innovation processes. Markov processes, Brownian motion process, white-noise, autoregressive (AR) and moving average (MA) processes as well as ARMA and ARIMA processes. 8.2
Restricting the time-heterogeneity of a stochastic process
process :A'(1), t c T) the distribution function the parameters % characterising it being F(X(tj; 0t) depends well. stochastic That l is, a functions of as process is time-heterogeneous in raises issues in modelling real difficult general. This, however, very only usually have phenomena because one observation for each 1. we will pf On the basis Of a single practice Hence, in have to we observation, which is impossible. For this reason we are going to consider an important class of stationary processes which exhibit considerable timehomogeneity and can be used to model phenomena approaching their undergoing but continuously equilibrium steady-state, stochastic stationary class of processes. fluctuations. This is the
For an
arbitrary
stochastic
on t with
'estimate'
'random'
Djinition
2
,4 stochastic process .Y(r), t EET) is said lo be (strictly)stationary 1,,) C!JT and some for Jn-' subset (rj !2, ft
(J
'r,
,
.
.
.
,
(8.17)
Stochastic processes That is, the distribution function of the process remains unchanged when shifted in time by an arbitrary value z. ln terms of the marginal distributions t c T stationarity implies that F(A'(M), F(-Y(r))
FX(t +z)),
-
(8.18)
F(-Y(f,,)). That is, stationarity implies and hence F(A'(!1)) F(zY(t2)) -Y(r,,)are (individually) that A'(!j), identically distributed (ID); a perfect time-homogeneity. As far as thejoint distribution is concerned, stationarity implies that it does not depend on the date of the first time index l1. This concept of stationarity, although very useful in the context of probability theory, is very difficult to verify in practice because it is defined in terms of the distlibution function. For this reason the concept of Ithorder stationarity, defined in terms of the first l moments, is commonly preferred. =
.
.
.
'
=
'
'
=
,
Dehnition 3
,4 stochastic process 1tAr(r),t (E T) is said to be lth-order stationary 4J' 1,,) of T and Jny z, F(aY(l1, for any subset (r1,tz, is of equal corresponding the itsjoint moments are order land to moments .:71,,))
.
t?.JF(Ar(l1 +z),
.
.
.:-(11 .
.
t.Y(12)ldz, E'EtArtrjl )d1 .
= ln order to understand (1)
First-order
.
.
,
,
-
-
-
+
(2)
=
.
.
.
,
/,, l,.see Priestley ::t
(8.19)
,
(1981).
this definition let us take 1= 1 and I 2. =
stationarity' < for al1 t c T and '(I.Xr)I)
=p,
=
Second-order
(x;
stationarity'
(.-(8, l c T) is said to be second order stationary if '(I-Y(l)12)< and (/j
=
1, Iz
=
,
A-(!',,))l,,(I
tA-(r), t c T) is said to be first order stationary if 1, f'(AXr))E(X(t +z)) constant free of r.
for /1
.
k.f
FgeftArtrl + z) )..'1 wtf A'(r,:+ z))Z(1
where/1 + lz +
.
+ z)), i.e.
,
.
.
,
0)
'ExY(r)j
=
.E'EXII
+ z)j
:x;
forall t e T
=pl,
(/1 2, lz 0) =
=
constant free of
l
#.2
Restricting
time-heterogeneity
(/j
1,lz
(iii) '(.tA-(r1)).tA'rtral)j
and
Taking z
=
=
-
1)
=
z)) .t.-Y(/'a+
= figta'&-trj+ rl we can deduce that
.EE.Lz''(0)).:.'(r2 - r1))I -
lllfa
z))(I .
Irz rl 1.
a function of
...-.11),
-
These suggest that second-order stationarity for -Y(!) implies that its mean and variance (Var(Ar(l)) p pI) are constant and free of r and its l2) f;g)A-(0)))A'(l2 autocovariance (r(11 , r1))(J pl) depends on the stationarity, which is also and Second-order not interval jl2 rj j; r2. rj or called wuk or wide-sense stationarit-v, is by far the most important form of stationarity in modelling real phenomena. This is partly due to the fact that in the case of a normal stationary process second-order stationarity is given that the first two moments equivalent to strict stationarity normal distribution. the characterise ln order to see how stationarity can help us define operational probability models for modelling dynamic phenomena let us consider the implications of assuming stationarity for the normal stochastic process ',A-(r),t c T) and its parameters 0t. Given that '(-Y(r)) p and Var(-Y(l)) c2 for all l G T and tyrj, f2) s(lr1 rcl)for any rl rc c T we can deduce that for 1,,) of T the joint distribution of the process is the subset (tI l2, =
-
=
-
-
-
=
=
=
,
.
.
.
-
,
,
characterised by the parameters
a
(n+
1) x 1 vector.
(8.20)
This is to be contrasted with the non-stationary case where the parameter N n), a (n+ n2) x 1 vector. A sizeable ector is 0= (p(rf),ntrf, ly), i,jzuz1, 2, reduction in the number of the unknown parameters. lt is important, however, to note that even in the case of stationarity the number of 1,,) although the parameters increases with the size of the subset (lj l2, This does (E time-homogeneity T. is because depend t on parameters do not Ar(rj) and The of between dependence restrict the process. the not r2l but the 4rc) is restricted only to be a function of the distance . r'unction itself is not restricted in any way. For example h ) can take forms .
.
.
,
,
.
.
.
,
'memory'
'
Il1-
xueh as:
-?a)2.
(a)ll(lr1 -rc1)-(r1 (b)/,(lr1 /.c1) exp.t - Ir1 1c1). -
-
-
(8.21) (8.22)
Ic case (a) the dependence between .Y(rj) and xY(lc) increases as the gap een rl and tz increases and in case (b) the dependence decreases as -hi?ru
140
Stochastic processes
Il: !cIincreases. ln terms of the indeed from
'memory'
of the process these two cases are stationarity viewpoint they are identical different but the very (second-order stationary process autocovariance functions). ln the next section we are going to consider restrictions in an obvious the problem of the parameters increasing with the size of attempt to the subset (r: 12, r,,)of T. Before we consider memory restrictions, however, it is important to stochastic process as the comment on the notion of a non-stationary absence of time-homogeneity. Stationarity, in time-series analysis, plays a similar role to linearity in mathematics', every function which is not linear is said to be non-linear. A non-stationary stochastic process in the present context is said to be a process which exhibits time-heterogeneity. ln terms of actual observed realisations, the assumption of stationarity is considered appropriate for the underlying stochastic process, when a z-period (z> 1) wfnltpw,wide enough to include the width of the realisation, placed directly over the time graph of the realisation and sliced over it along the time axis, shows same picture' in its frame; no systematic variation in the picture (see Fig. 8. 144). Non-stationarity will be an appropriate assumption for the underlying stochastic process when the picture shown by the window as sliced along the time axis changes such as the presence of a variance. change monotonic An trend or a important form of nonin the is so-called non-stationarity stationarity which is the homogeneous described as local time dependence of the mean of th,process only (see ARIMAI/?,q) formulation below). -
kmemory'
'solve'
,
.
.
.
,
'the
'systematically',
8.3
Restricting the memory of a stochastic process
In the case of a typical economic time series, viewed as a particular realisation of a stochastic process .tA-(1),r 6 T) one would expect that the dependence between Ar(r1)and Ar(r2)would tend to weaken as the distance refers to the GNP in the UK at time l (r2 - l1) increases. For example, if would between .Y(r1) and .:X12) dependence that expect one to be much greater when lj 1984 and tz 1985 than when 11 1952 and tz 1985. Formally, this dependence can be described in terms of the joint distribution F(aY(t1), .X(la), X(!,,)) as follows: -lrl
=
=
.
.
.
=
=
,
Dehnition 4 A slochflslfc process .fA-(1),t e T) dejlned on the probability space #( ))is said to be asymptotically independent lf forany subset
(.,
,%
'
14 1
8.3
gOeS
!t? zero
J.$ T
izfw).
-+
Let us consider the intuition underlying this definition of asymptotic and B in independence. Any two events are said to be independent 0. Using when #4z4 ra S) #(yl) PBj or equivalently P(.4 ch Bj - #(z4)#(:) this notion of independence in terms of the distribution function of two random variables (r.v.'s)Arl and Xz we can view tF(-Yj .Y2) - F(x1)F(Ara)l as a measure of dependence between the two r.v.'s. In the above definition of asymptotic independence j(z) provides an upper bound for such a measure ) the of dependence in the case of a stochastic process. lf j(z) 0 as z +z)) +z), subsets Ar(l,,)) and (A'(j), A-(r,, become two (-:-411 independent. A particular case of asymptotic independence is that of m-dependence and -Y(l2) are which restricts j(T) to be zero for all T > n;. That is, would for practice > ln independent expect to be able to find a we rrl. )lj enough' able approximate 'large m so as to be to any asymptotically m-dependent This is equivalent to independent process by an process. small able > assuming that j(z) for z m are so to equate them to zero. as to be An alternative way to express the weakening of the dependence between .Y(l1) and A'(l2) as !lj ra( increases is in tenms of the autocorrelation function which is a measure of linear dependence (see Chapter 7). ,4
,t7-
.
=
=
,
-+
.
.
.
,
.
.
.
-+
,
-(r1)
lcJ
-
-
Dejlnition J
.4 stochastic process (A'(l), r (E T) is said to be asymptotically uncorrelated #' tere exists t; sequence of constants kfp(z), z y 1) dehned by d1, t +z)
(t'(l)r(f+z))''
G P(T),
./r
aIl r 6 V,
sucb tat
0 %p(z) <
l
As wecan see, the sequence of constants z 7. 1) defines an upper bound r(!, r +z). Moreover, given autocorrelation coefficients the of sequence or -f 1 +' -+ and < ptz) p(z) for > 0, a sufficient 0 as z w, is a necessary '.hat z v'-, underlying the White 1984)), intuition < /?(z) condition for the I x (see ( obvious. above definition is
tplz),
-.+
Stochastic processes
ln the case of a normal stochastic process the notions of asymptotic coincide because the dependence independence and uncorrelatedness l1, and -Y(lj) A'(rc) for between tz e:T is completely determined by the any autocorrelation function rtlj r2). This will play a very important role in Part IV (see Chapters 22 and 23) where the notion of a stationary, asymptotically independent normal process is used extensively. At this stage it is important to note that the above concepts of asymptotic which restrict the memory of a independence and uncorrelatedness stochastic process are not defined in terms of a stationary stochastic process but a general time-heterogeneous process. This is the reason why #(z)and ptz) for z > 1 define only upper bounds for the two measures of dependence given that when equality is used in their definition they will depend on (11, Svell 1,,) as as z. r2, general formulation of asymptotic independence can be achieved A more using the concept of a c-field generated by a random vector (seeChapters 4 where (.Y(r), and 7). Let denote the c-field generated by ..Y(1), T) is a stochastic process. A measure of the dependence among the tc elements of the stochastic process can be defined in terms of the events (E and B c Mt by ,
.
.
.
,
,@j
.:-(1)
.
.,4
.
.
,
.@y
,
a(z) suplf'l-gt =
J
c B4 - #(.4)/'4:)1,
(8.25)
Dejlnition 6
,-1stocbastic process (z) 0 as z --.
kf
( c T) .Y(M,
is said to be
(strongly)mixing
1)/-
bz'.
-+
of the asymptotic As we can see, this is a direct generalisation independence concept which is defined in terms of particular events and B related to the definition of thejoint distribution function. In the case where )A'(r), l G T) is an independent process a(z) for z > 1. Another interesting special case defined above of a mixing process is the m-dependent process where a(z) 0 for z > m. In this sense an independent process is a zerodependent process. The usefulness of the concept of an m-dependent process stems from the fact that commonly in practice any asymptotically independent (ormixing) process can be approximated by such a process for ilarge enough' ?'n. A stronger form of mixing, sometimes called unlfrm mixing, can be defined in terms of the following measure of dependence: ,4
=0
=
tp(T)
=
sup
(8,4/,)-#(-4)t,
PB)
>0.
(8.26)
Restricting
#.3
143
memory
Dehnition
,4 stochastic process (Ar(r), t c T) is said to be uniformly mixing :f 0 as
(P(T)
'r
'--.
(f
-+
.
Looking at the two delinitions of mixing we can see that a(z) and @(z)define absolute and relative measures of temporal dependence, respectively. The formeris based on the definition of dependence between two events and B separated by z periods using the absolute measure ,4
E#(z4ro #) 8-4) -
'
#(#)) 10
and the latter the relative measure EPIA/.B)
-
#(X)) > 0.
stochastic stationary processes ln the context of second-order asymptotic uncorrelatedness can be defined more intuitively in terms of the temporal covariance as follows: Xt CovtAXll,
+z))
=
t?(z)
0
-+
A weaker form of such memory restriction is the so-called ergodicity property. Ergodicity can be viewed as a condition which ensures that the p(z) by averaging over memory of the process as measured by time' kweakens
Dehnition 8
-4second-order
stationary
1 r lim -T Zl p(z) zv-vx
process (-Y(l), r e T) is said to be ergodic
=0.
(8.28)
If wecompare (28)with (25)wecan deduce that in the case of a second-order stationary process strong mixing implies ergodicity. The weaker form of temporal dependence in (28),however, is achieved at the expense of a very In modelling we need restrictive form of time-homogeneity (stationarity). and off between them (see there is often a trade both type of restrictions Domowitz and White (1982:. Memory restrictions enable us to model the temporal dependence of a stochastic process using a finite set of parameters in the form of temporal moments or some parametric process (see Section 4). This is necessary in order to enable us to construct operational probability models for modelling dynamic phenomena. The same time-heterogeneity and memory restrictions enable us to derive asymptotic results which are crucial for
Stochastic processes
statistical inference purposes. For example one of the most attractive features of mixing processes is that any Borel function of them is also mixing. This implies that the limit theorems for mixing processes (see Section 9.4) can be used to derive asymptotic results for estimators and test statistics which are functions of the process. The intuition underlying these results is that because of stationarity the restriction on the memory enables us to argue that the observed realisation of the process is typical (in a certain sense) of the underlying stochastic process and thus the time averages of the corresponding probability constitutee reliable estimates expectations.
8.4
Some spial
stochastic processes
The purpose of this section is to consider briefly several special stochastic processes which play an important role in econometric modelling (seePart IV). Thcse stochastic processes will be divided into parametric and nonparametric processes. The non-parametric processes are defined in terms of their joint distribution functions or the first few joint moments. On the other hand, parametric processes are defined in terms of a generating mechanism which is commonly a functional form based on a nonparametric process.
(1)
Non-parametric
processes
The concept of conditional expectation discussed in Chapter 7 provides us with an ideal link between the theory of random variables discussed in Chapters 4-7 and that of stochastic processes, the subject matter of the present chapter. This is because the notion of conditional expectation enables us to formalise the temporal dependence in a stochastic process (.:-(1),t e:T) in terms of the conditional expectation of the process at time t, .:-(1)('the present') given (-(l 1), A'(l -2, (ithe past). One important application of conditional expectation in such a context is in connection with a stochastic process which forms a martingale. .)
-
.
(1)
.
Martingales Dehnition 9 (lfned #( )) and Let )A'(r), t G T) be a stocastic on (S, process Jt oj' oujlelds G T increasing 9$. l e: T) sequence r (f ,, ) an tc the conditions: satisfyinq followinq ,.%
'
.@
,#7t
8.4 Some
special stochastic
Ar(l) is a random
() (ff) (fi)
processes lmriable
zR1x(r)l) (.c. its
(r.t?.)relative rtl
mean is bounde X(l 1), for aII t 6 T. '(-Y(l) 9. t - j) Then (A'(l), t 6 T) is said rtp be a martingale ).f4, t 6 T) and wtl wrflp tAr(l), 4, t e T). < c/s
=
.f/?.
t
for aII r e: T.
for all t c T,. and
-
wirll
respect
to
Several aspects of this definition need commenting on. Firstly, a martingale is a relative concept; a stochastic process relative to an increasing sequence of c-elds. That is, c-fields such that 91 c cLhcu L'it, and each -Y(l) is a r.v. relative to %, t G T. A natural choice for c Vt cu c-fields X(1)), t 6 T. Secondly, the 1), will be gtt c(Ar(l), Xt such value bounded all of xY(l) expected for must be t c T. This, however, implies because Z'(X(l))= stochastic has constant that the mean process jq all c-CE7 of conditional E(.Y(l 1)) 'gF(xY(t))/.@ T for t c by property 1.expectations (see Section 7.2). Thirdly, (iii)implies that .
.
.
.
.
,
.
=
=
-
.
.
.
,
-
E(xY(!+ z) i? t-
1)
=
Xt
-
1) for all l (E T and z y 0.
(8.29)
That is, the best predictor of xY(r+ z), given the information 9't-. 1, is A'(l 1) for any z >0. Intuitively, a martingale can be viewed as a fair game'. Defining .Y(1)to be the money held by a gambler after the rth trial in a casino game (say, black jack) and Vk to be the history' of the game up to time 1, then the because the gambler condition (iii)above suggests that the game is before trial t expects to have the same amount of money at the end of the trial as the amount held before the bet was placed. It will take a very foolish gambler to play a game for which -
%fair'
(iiil'
Eqxltj/.@t
1) .GX(l 1). defines what is called a supermartingale
(8.30)
-
This last condition (tsuper'for the casino?). The importance of martingales stems from the fact that they are general enough to include most forms of stochastic processes of interest in econometric modelling as special cases, and restrictive enough so as to theorems' (seeChapter 9) needed for their statistical allow the various analysis to go through, thus making probability models based on operational. In order to appreciate their generality let martingales consider examples of martingales. extreme two us tlimit
ilargely'
Example 3 Let tZ(r), l c T) be a sequence of independent r.v.'s such that F(Z(r))
=
0 for
146
Stochastic processes
al1 t e: T. If we define .:-(1) by f
z(k),
-Y41) =
k=1
then )X(r),
(8.3 1)
h, r e T) is a
martingale, with 9.t c(Z(M, Z(l - 1), c(A'(r), X(t 1), xY(1:. This is because conditions (i) and automatically satisfied and we can verify that =
.
-
.
.
.
,
.
.
,
z( 1))
=
(ii) are (8.32)
using the properties c-CE9 and 10 in Section 7.2.
Example 4
t 6 T) be an arbitrary' stocbastic process whose tz(!), F(lz(r)j) < for all t e:T. If we define ..Y(1)by that
only restriction is
Let
gz
l
A'(r)
=
E
EZ(k)
-
1)1,
Elzkl/.i
(8.33)
-
k=1
.Y(1:, then c(Z(k), Zk 1), Z( 1)) c(A-(k), .Y(k 1), tA'(r), ?,, t G T) is a martingale. Note that condition (iii)can be verified using the property c-CE8 (seeSection 7.2).
where 9k
=
=
-
,
.
.
,
-
.
.
.
,
The above two extreme examples illustrate the flexibility of martingales very well. As we can see, the main difference between them is that in example 3, aY(r)was defined as a linear function of independent r.v.'s and in example 4 as a linear function of dependent nv.'s centred at their conditional means,
i.e.
F(r)
.E'(Z(l)/Mt
.:-(/)
==
.-.
-
(8.34)
1),
lt can be easily verified that )F(l), r e: T defines martingale dfference process relative to tt because '(F(l),
j
In the case where F(F(l)F(k))
:)
-
=0,
for all
'(.E'(F(r)F(/())
= E gF(k)'(F(f)/f/, That is, (F(l), t e: T) is an orthogonal
rG T
q -
is known as a
(8.35)
f G T.
< x for all .E'(IZ(f)I2) =
what
we can
deduce that for r > k
yj
:)j
=0
sequence as well
(8.36) (seeChapter 7).
special stochastic
8.4 Some
147
processes
Dehnition 10
.4 stochastic process tF(l), t (E 7-) is said to be a martingale diffrence c c process relative to the increasing sequence of c-.l/l.s .f4
j.j Q.y (uu F41) is a nn, relative to (f) '(lF(r)!) < Jz'; and (if) f)F(r) . , - 1) 0, t c T. (fff) .
(uu
.
.
.
.
.
%;
=
Dehnition 11
-4 stochastic process )F41), t (EFT' ) is said to be an innovation process c(-(l), Xt 1), #- it is a martinqale dierence wfr/l respect to .f4
=
.
.
. (i) (f)
,
xY(0),where '(1
)
-:-(1)
=
1'(r)I<
2)
'(y(r)F(z))
Js; =
.,
0, t >
-
() F(.j), and
z,
r, z c T.
These special processes related to martingales will play a very important role in the statistical models of interest in econometrics in Part 1V. Returning to the main difference between the two examples above we can elargely' equivalent to that see the independence assumption in a sequence is of orthogonality in the context of martingales. It will be shown in Chapter 9 that as far as the various limit theorem results are concerned this Trude tlaw of large numbers' equivalence' carries over in that context as well. The limit theorem' results for sequences of independent nv.'s and the can be extended directly to orthogonal martingale difference processes. Martingales are particularly important in specifying statistical models of interest in econometrics (seePart 1V) because they enable us to decompose any stochastic process (Ar(1),t G T) whose mean is bounded for all t G T into two orthogonal components, p(r) and u(r), called the systematic and nonsystematic components, respectively, such that icentral
.X'(1)
=/-t(r)
+
141),
(8.37)
and u41) xY(1) '(A'(r) a't Some c-field 9t where /t41) E(X(t)/.% defining our sample information set. The non-systematic component u(l) defines a martingale difference and thus all the limit theorem results needed for the statistical analysis of such a specification are readily available. ln view of the discussion in the last two sections on time-homogeneity and memory restrictions, the question which arises naturally is to what extent martingales assume any of these restrictions. As shown above, martingales are first-order stationary bcause 1)
=
=
F(aY(l)) .E'4-Y(1 1)) p, =
-
=
-
for a11 t 6 T.
1), fOr
(8.38)
Stochastic
processes
Moreover, their conditional memory is restricted enough to allow us to define a martingale difference sequence with any martingale. That is, if tA-(r). t e: T) is a martingale, then we can define the process F(r)
=
xY(r)- X(t - 1),
F(0)
=
(8.39)
A-(0),
which is a martingale difference and A'(!) j () FU).ln the case where the martingale is also a second-order process then F(f), r e: T) is also an innovation process. In Chapter 9 it will be shown that an innovation process behaves asymptotically like an independent sequence of random variables; the most extreme form of memory restriction. =
-
t
(2)
Markov processes
class of stochastic processes is that of Markov These processes are based on the so-called Markov property that processes. is independent of the tthe future' of the process, given the Another
important
'past'.
'present',
Djlnition
12
-4 stocbastic pl-otr'ss t-Y(r), r e: T) is said rtpbe a s'Iarko: process /??-ever #, Borel yncrtpn (.Y(r))e: (tbe future')such that ,.;d;'
#'
< 'j/2(A''(r))I :yz
A((A'(l)),
.@%
c(A-(r): a w/-ltlrp,Mb J =
ln particulara in the case suggest that f(-Y(r + z)/,.df-
.,)
=
,
.)
F44-(1)) /'c(A'(r
=
< t<
where
f-'tr
/?).(Note: ,#L
,g:
-
1)), is past' p/lf.s
(8.40) ipresent'.)
/l(A-(r)) X(t + z), the Markov property =
+ z)/tr(A'(!)),
(8.41)
An alternative but equivalent way to express the Markov property is to 1 define the events B c e: t 1 and state that .4'-
.4
.4cE
+
LxL:,
#(X
r-h B Jdff)=
#(X
Jdf)
'
#(S .Xj).
lt is important to note that the Markov property is not a direct restriction on the memory of the process. lt is a restriction on the conditional melnory of the future of the process the present provides the process. For all the relevant information. A natural extension of the Markov process is to allow the relevant information to be m periods into the past. tpredicting'
8.4 Some
spial
149
stochastic processes
Dejlnition 13 <
Marko:
prtptrtlsy.taY(l),t ( Tl is said to be vth-order
.,4stocbastic
f(IX(r)l)
('
and
:y-
.2?d--.1,)
.E4..Y41)
.E'(.X'(r)c(.Y(r
=
1),
-
,
.
.
,
(8.43)
Xt - n'l))).
ln terms of this definition a Markov process is a first-order Markov. The rnth-order Markov property suggests that for predicting X(l) only the irecent past' of the process is relevant. ln practice a Markov process is commonly supplemented with direct restrictions on its memory and time-heterogeneity. In particular, if an rnthorder Markov process is also assumed to be normal, stationary and asymptotically independent we can deduce that '(A'(r)
.?f--,1,)
=
- 2) +
a1-Y(l - 1) + xzzrt
'
'
'
(8.44)
+ amhjt - rn),
'
a.) 0 lie inside the and the roots of the polynomial km unit circle (seeAR(m) below). This special stochastic process will play a very important role in Part lV. xk;.m -
-
(3)
'
-
'
'
=
-
Brownian motion process
A particular form of a Markov process with a long history in physics is the so-called Brownian motion (or Wiener) process. Dqjlnition 14
.4 stochastic process .X/1), r iE T) is called a Brownian motion #( )) (J process, dfned on (, (J) 0, .Y41) for r 0 the process at 0,' a con,entionj; independent A'(r) stationar-b' increments, i-e. is t' wf (la) process for 0 < 11 G tz % G r,,, tbe increments (-Y(lj) - Xti j)), f
,%
.
-srlrf.s
=
=
'
i
=
1, 2,
.
.
.
,
.
'
n, are independent
.E'(.Y(?,.) - Arlrf-
)) 0, =
1
ti 1 ) 1zz41,, -
czlrj
=
- ti
1),.
the increments Xtrfl Artrf 1 ), i 1, 2, distributed. This implies rt7l tbe densitv =
-
/'(.v 1
t1
-v,,,'
,
.
.
.
,
,
-
r'.r.'ss such tbat
.
.
.
,
.
.
.
,
n, are normally
Jnction is
tn) -.i
=
1
---=-
c
N
.
CXn '-
/(27:)lj
x2 1
2c221
(;
,,
f
jw
c
-.
c
) (2zr)
ti -
1
Stochastic processes (.Xf
X
exp
-
1)2
xi
2c2(/.f /.j-
-
'
-
.
j)
ln the case where (7.2 1 the process is called standard Brownian motion. lt is not very difficult to see that the standard Brownian motion process is both a martingale as well as a Markov process. That is, (.Y'(1),l G T) is a .Y(z), z %r. Moreover, martingale with respect to 4t- since '(xY(l)/,?Fsince E(X(t)/@% f'(A'(r)/c(A)z)))it is also a Markov process. Note also that Eg(-Y(l) Xzjllj@z. (1 z), t % z. =
.)
(x)
.)
=
=
.1
-
=
-
Dejlnition 15
,4 stochastic
(f) ()
process '(l/(r))
-
JtM4,r g T ) is said
0,'
J;(l/(r)lg(z)) -
z
t
if t (f t+ =
to be a white-noise process
#'
z z
r.v.'s). (uncorrelated
Hence, a white-noise process is both time-homogeneous, in view of the fact that it is a second-order stationary process, and has no memory. In the case where (?.(r), t e: T) is also assumed to be normal the process is also strictly
stationary.
Despite its simplicity (or because of it) the concept of a white-noise process plays a very important role in the context of parametric time-series models to be considered next, as a basic building block.
(11)
Parametric
stochastic processes
The main difference between the type of stochastic processes considered so far and the ones to be considered in this sub-section is that the latter are defined in terms of a generating mechanism', they are in some sense stochastic processes. 'derived'
(4)
Autoregressivenjirst
order
(z4#(1))
The AR( 1) process is by far the most widely used stochastic process in econometric modelling. An adequate understanding of this process will provide the necessary groundwork for more general parametric models such as AR(m), MA(m) or ARMA(p,t?) considered next. Dejlnition 16
.4 stochastic
process
(x(r),!GT)
s said to be autoregressive
of
8.4
Some spial
stochastic
processes
order one (,4R( 1)) lf it satisjles the stochastic difference equation, .-41)=xX(t
1) + u(1),
-
wllcrp u is a con,stfznf flnt
l1(f)
(8.45) is a white-noise
process.
The main
difference between this definition and the non-parametric definitons given above is that the processes )-Y(r), r c T) are now defined indirectly via the generating mechanism GM (45).This suggests that the properties of this process depend crucially on the properties of u(l) and the structure of the difference equation. The properties of /(r) have already been discussed and thus we need to consider how the structure of the GM as a non-homogeneous difference equation (seeMiller (1968)) determines the properties of .t-Y(l), t s T), T )0,1, 2, Viewing (45) as a stochastic non-homogeneous first-order difference equation we can express it (by repeated substitution) in the form .).
=
.
.
/-1
.:-41) a'Ar(0) + =
i
)( xiult =
f).
-
(8.46)
0
This expresses X(t) as a linear function of the white-noise process )u(l), Using this form we can deduce certain properties of the r s T) and AR( 1) such as time-homogeneity or and memory. ln particular, from (46) we can deduce that ..(0).
F(xY(f))
=
fE'(-Y(0))
(8.47)
and +z)) '(A-(t)aY(l f- 1
a'?r(0) A-
l
== .
i
) =
liut
)
--
0
a'+1?f(0) ?-
j i 0
xiut A-z
--
i4
=
f-1
=
r+ z - l
')a2+'1xY(0)2)+ E i
) =
l+r-1
autr
-
0
j) i
j0
aijtr +z
-
jj
=
(8.48) This shows clearly that if no restrictions are placed on .Y(0) or/and a the AR(1) process represented by (45)is neither stationary nor asymptotically independent. Let us consider restricting and a. .1(0) in (46) represents the initial condition for the solution of the stochastic difference equation. This seems to provide a solution for the difference equation (seeLuenberger (1969)) and plays an important role in the determination of the dependence structure of stochastic difference .40)
bunique'
Stochastic processes
equations. If form
that .Y(0)
we assume
for simplicity,
=0
(47)and (48)take the
(8.50)
As we can see, assuming .140) does not make the stochastic process T) generated (..'.t-41), by (45),stationary or asymptotically independent. tG as The latter is achieved by supplementing the above initial condition by the < 1 we can coefficient restriction lal< 1. Assuming that .Y(0) 0 and Jal deduce that
=0
=
J)-Y(r).Y(r + z))
1
clul
=
(:t
-
2f
0
-+
1
-
a
2
(8.j1)
and thus )A-(l), l e: T) is indeed asymptotically independent (but not stationary). For (a1 > 1 the process is neither asymptotically independent stationary. nor For the stochastic process (xY(!), t e: T), as generated by (45),to be stationary we need to change the index set T to a double infinite set T* (0, That is, assume that the process X(t) stretches back to the u!r1, + 2, infinite remote past (a convenient fictionl). The stochastic process (.,:71), t 6 T*) with GM (45)can be expressed in the form: =
.).
.
.
f(AX0)) 0, =
X
'(A-(l).Y(r+ z))
=
)
E i
=
X
liult
-
i)
0
j Vult + z j 0 =
X
= Hence, for
c2
< 1 the stochastic 1(z1 and asymptotically
stationary
i
j0
-jj
C&
aiaf +1
=
a2
c2a1 i
=
=
0
5
(8.54)
process )A'(r), r c T*) is both second-order independent with autocovariance function
2
?-?(z) =
-
(1
-
z
-
)
(8.55)
a',
az, and the process is not even On the other hand, for k: 1, (:/7.0 a2) f(tAr(r)t2) order since second is not bounded.
)(
-+
stocllastic processes
Some spial
8.4
The main purpose of the above discussion has been to bring out the importance of seemingly innocuous assumptions such as the initial conditions (Ar(0) 0, Ar(- F)-+O as T-+ :yz ), the choice of the index set (T or T*) and the parameter restrictions ()al 1), as well as the role of these assumptions in determining the properties of the stochastic process. As seen above, the choice of the index set and the initial conditon play a very important role in determining the stationarity of the process. In particular, for the stochastic process as defined by the GM (45)to be stationary it is necessary to use T* as the index set. This, however, although theoretically convenient, is an unrealistic presupposition which we will prefer to avoid if possible. Moreover, the initial condition .Y(0) 0 or .Y( T) 0 as T :yz is not as innocuous as it seems at first sight. -Y(0) 0 determines the mean of the process in no uncertain terms by attributing to the origin a very special status. The condition A'( T) 0 as T :x- ensures that A'Ir)can be < 1. expressed in the form (52)even without the condition modelling IV) it is interesting For the purposes of econometric (seePart 1, 2, in relation to to consider the case where the index set is T )0, above, in this case kfxYtfl, stationarity and asymptotic independence. As seen restriction #40) 0, under second-order T) stationary even the is not l iE asymptotically < 1. Under the same conditions, however, the process is independent. The non-stationarity stems from the fact the autocorrelation function t?(f, t +z) depends on t because for z 0, =
-..
--+
=
-
=
-+
-+
-
Ial .)
=
.
.
=
Ial
/1
+ ,t
1')
1
J2aT =
.
- tz
j
lt
c
y
-
(8.56)
,
This dependence on 1, however, decreases as t increases. This led Priestley (198 1) to introduce the concept of asymptotic stationarity which enables us to approximate the autocorrelation function by
t,tr,t +z)
::x:
c2(g
-#
1
(8.57)
,
1
At this stage, it is interesting to consider the question of whether, instead it by of postulating (45)as a generating mechanism, we can actually imposing certain restrictions on the structure of an arbitrary stochastic stochastic' process )A'fr), t c T) is: (i)normal, process. lf we assume that the (ii) Markov and (iii)stationary then lderive'
f)Ar(l)/f/
t- j
)
=
fJ(A'(r)/c(X(l
= aA'tr
-
1/)
-
(8.58) (8.59)
1),
.-40)). The first equality stems from c(A'(r - 1), Arlf - 2), where 1 the Markov property, the linearity of the conditional mean from the .%
=
-
.
.
.
,
Stochastic processes
normality and the time invariance of a from stationality. In order to ensure < 1 we need to assume that the process is also (iv)asymptoticall), that IaI independent. Defining the process u(r) by u(r) A-(r) Elxtlj.,Lit't =
-
1
),
N(0)
-Y40),
=
(8.60)
we can construct the GM A'(r)
yX(t - 1)+ u(r),
=
(8.61)
where lk(1) is now a martingale difference orthogonal process relative to This is because by construction Elutl/t 0 and for t > k, j - ) 1q) 0 c-CE8 of 7.2. by Moreover, Section the process .E'(FElg(r)lt(/()/f4 t(u(l), ! G 5-) ean be viewed as a wllilp-noisc process or an innosation process relative to the information set ? because c,#t.
'(u(r)l(k))
=
=
=
,
E'(u(r))
'(.E'(lf(l) f. t
=
'(l,/(f)2)FtF(I#l)2/.(#()
- 1
f(-Y/)
=
0-,
(8.62)
j))
=
=
)
-
-a/.E'(-YJ- trx2(1 -af) 1)
=
c2,
=
(8.63)
say. This way of defining a white-noise process is very illuminating because it brings out the role of the information set relative to which the white-noise process is not predictable. ln this case the process (l(1),t t5 T). as defincd in That is, ?.(r) contains (60) is white-noise (or non-predictable) relative to no systematic' information relative to tt. This, however, does not preclude the possibility of being able to predict u(r) using some other information set .@)= Lh with respect to which l1(f) is a random variable. To summarise the above argument, if :A'(r), t e:T) is assumed to be a normal, Markov, stationary and asymptotically independent process, then the GM (61) follows by The question which naturally arises at this point is that (61)seems identical to (45) what happens to the difficulties related to the time dependence of the autocorrelation function as shown in (50)75 The answer is arises stationarity of given that the difficulty never because the we can derive its autocorrelation function by ...'
,.
tdesign'.
kgiven
'(r)
tl(T)
=
EX(t)Xt
t
= E @.Y(r
-
z))
1) + !(r.))aY(r z) ) -
=
az
-
1),
since '(l/(r).Y(r z)) 0. This implies that -
t(z)
=
(f/0).
=
(8.65)
8.4 Some
spial
stochastic
processes
Moreover,
d0)
f'(A-(r)Ar(l)) Exxt
- 1).:-(8) + /)tI(1)-Y(1)) =ar(1) + c2 a2tj0) + c2.
=
=
=
Hence, G
d0)
2
=
(F
and
-
(1 a)
tz
2
=
'
'
-
(1
-
-aZ)
(8.67)
af.
tdesigned' This shows clearly that in the case where the GM is no need to change to asymptotic stationarity arises as in the case where (45) is postulated as a GM. What is more, the role of the various assumptions becomes much easier to understand when the probabilistic assumptions are made directly in tenns of the stochastic process tAr(r), t c: T) and not the
white-noise process. (5)
mth-order
Autoregressive,
(z1#(-))
The above discussion of the AR(1) process generalises directly to the AR(/# process where mb 1. Dejlnition 1 of order -4stochastic process t-Y(l), t G T) is said to be autoregressiYe lf satisjles the stochastic (v4R(m)) it equation dterence m Ar(r) alArtl =
1) + xzxt
-
- 2) +
'
'
.
+ xmzrt
1.n)
-
+ u(l),
(8.68)
where aj, a2, xmare constant and u(r) is a white-noise process. For the discussion of (68)viewed as a generating mechanism (GM) it is convenient to express it in the 1ag operator notation .
a(f)x(l)
=
.
,
.
(8.69)
/.k48,
-xzLl -am1r) with 1tX(t) Xt k),k> 1. The whereatLl 1 T* of equation 1, for 2, the difference + :I: can be solution (24) (0, writtenas =(
.-ajf-
.
.
.
=
-
.)
=
.71)
=
g(r) + x -
.
.
'(f-)l/(r).
(8.70)
g(1) c12l +c22@ + + c,,,2tmis the so-called general solution which is 2m) of the expressed as a linear combination of the roots (21,22, polynomial '
=
.
.
.
txm
-
ajx'l -
1
-
.
.
. -
aml
=
.
.
,
(8.71)
0
(assumed to be distinct for convenience) with the constants cj, c2,
.
.
.
,
cm
Stocbastic processes
being functions of the initial conditions particular solution is the second component
.:-40),
-Y(1), xtn1 1). The of (70)and takes the form: -
.
.
.
,
(8.72) where 7 t -F- 1 7 () (7
0
=
:
7,,,+
a j ),,,,
+
7,. vz
7,,,+z
l
+
j
-
+
1
-
.
'
+
'
'
'
myfl
+
'
(8 3) .7
.
0
=
0
m7,=
z
0, 1, 2,
=
.
.
.
ln the simple case m 1 considered above a(fa) (1 - xL), qt) The restriction (21 a), j= a/,.j 0, 1, 2, IaI< 1 in the AR( 1) case is now extended to al1 the roots of (71), i.e. =
=
=
=
I2f I<
1,
.
i
=
1, 2,
.
.
.
.
A-(0)aJ
=
.
.
,
(8.74)
n1.
That is, the roots of the polynomial (71) are said to lie within the unit circle. Under the restrictions (74) the general solution goes to zero, i.e.
(/(1) 0 ->
and the solution of the difference equation (68) can be Fritten as X
.Y4/.) =
-.j). )- o?.#l j 0 =
This form can be used to detennine the first two moments stochastic proccss )A'(r), l (E T*). In particular &.Y(r))
=
0,
'(Ar(r)A'(l+ z))
=
E j
=
l ),yl(l
-jj
0
pfutl + i
=
of the
)
': -
0
X'
= Z(, j
'jb-vz
'2.
-
tUz'
This is bounded when )?y?) < ts a condition which holds when the () of polynomial 1) within the the unit circle (seeMiller (1968/.ln a roots (7 lie conditions the sense (74)ensure that the stochastic process tA-(r), r (E T*) as generated by (68) is both second-order stationary and asymptotically independent because the condition (.l- (, -yJ?) < implies that t?('r) 0 as ,
'w.
T
-->
%;
-+
.
As in the case of an AR( 1) process when the index set T
=
)0,1, 2,
.)
.
.
is
Some spial
8.4
stochastic processes
used instead of T* )0,EE 1, +2, we run into problems with the secondT), EE of which stationarity order t we can partly overcome using the (.:-41), will not be pursued any further asymptotic stationarity. This of concept AR( argued in 1) when the the parametric model is not because, as case, making the necessary assumptions GM but by postulated as a T) tA-(l), of such problems arise. ln particular, if we in directly terms t e: no mth-order normal, Markov, (iii)stationary the process is (i) (ii) assume that independent asymptotically then and (iv) .)
=
.
.
idesigned'
E'(.Y(l) c(.X(l - 1),
.
.
.
,
.Y(0))) F(A-(r) c(Ar(l - 1), =
.
.
.
,
.Xtr
-
n'l)))
= i j 1 afzYtr-).
(8.78)
=
The first equality stems from the mth-order Markov property. The linearity of the conditional mean is due to the normality, the time invariance of the afs is due to the stationarity and asymptotic independence implies that the roots of the polynomial (71) 1ie inside the unit circle. lf we define the X(0)), t e T, and the increasing sequence of c-fields 9L tr(.Y(r), A-t 1), u(r), T) by t c process =
-
.
.
.
,
t
utrl-xtr)- Ex/'u
k-
(8.79)
1)-
fk, l 6 T) is a martingale difference, orthogonal (an we can deduce that )?,f(r), the AR(m) GM as in (68) innovation) process. This enables us to from rst principles with u(1) being a white-noise process relative to the information set %. The autocovariance function can be defined directly, as in the AR(1) case by multiplying (68)with X(t -z) and taking expectations to yield Sdesign'
tdesigned'
F(-Y(f)A'(r -z)) >
=
aj '(Ar(r- 1)Ar(l-z)) +
'
'
'
+ E(l(l)A-(t -z)),
(8.80)
(8.81) satisfy the same difference Hence, we can see that the autocovariances equation as the process itself. Similarly, the autocorrelation function takes the form
(8.82) The system of equations for z 1, 2, m are known as Flg/t?-lzl//k!r which play important role in the estimation of the coefticients equations an The relationship 198 1)). between the a. (see Priestley ( tz1 aa, =
,
.
.
.
,
.
.
.
,
Stochastic processes
autocorrelations and the asymptotic independence of )A'(r), t c T) is shown most clearly by the relationship z
=0,
1, 2,
.
.
.
(8.83)
,
viewed as a general solution of the difference equation (82).Under the 1, 2, < 1, i restrictions m (impliedby asymptotic independence) we can deduce that
I2fI
r(z) (6)
-+
=
.
.
.
,
0
(8.84) (4f-4)processes
Moving average Dejlnition 18
The stochastic process taY(r), t (5 T) is said to be a moving average process of order k (M,4(k)) 4- it can be expressed in the form
(8.85) where :1, bz, bk are constants and t e:T) is a white-noise the white-noise used to build the process process is process. Ttzr is, (aY(?), t c T), beinq a linear combination (#' the Iast k ?,k(r ils. Given that f .Y(f), t c T) is a linear combination of uncorrelated random variables we can deduce that .
.
tlt(r),
,
.
-
(8.86) 0 tKz %;k z> k
(8.87)
o/zGk (8.88) (!?() 1). These results show that, firstly, a MA(k) process is second-order stationary irrespective of the values taken by /?1 bz, bk, and, secondly, after k its autocovariance and autocorrelation functions have a periods. That is, a MA(k) process is both second-order stationary and kcorrelated (r(z) 0, z > k). ln the simple case of a MA(1), (85)takes the form =
,
.
.
.
,
tcut-off'
=
..(1)
=
Ik(1)
+
/71u(t
1),
-
(8.89)
with t7(0) =
r( 1)
=
(1+ hf)c2, 1)1 +hf)
(1
,
p(1)
r(z)
=
=
0,
#1c2,
r(z)
=
0,
(8.90) (8.9 1)
8.4 Some
special stochastic
processes
As we can see, a MA(k) process is severely restrictive in relation to timeheterogeneity and memory. It turns out, however, that any second-order independent stationary, asymptotically process )-Y(r), t e: T) can be as a MA(:ys ), i.e.
Kexpressed'
x(r)
)( bjut
=
-j),
j=
bjb)< (yo and t (E T). is an innovation process. This result where (N.)'o constitutes a form of the celebrated l4b/J decomposition theorem (see Priestley ( 198 1)) which provided the theoretical foundation for MA(k) and ARMNP, q) processes to be considered next. The MA(aa) in (92) can be constructed from tirst principles by restricting the time-heterogeneity and the memory of the process. If we assume that ).Y(r), l G T) is (i)second-order stationary, and (ii)asymptotically uncorrelated, then we can define the innovation process )u(r), r e:T) by .tu(!),
l(r)
A'(r)
=
EX(t)(%-
-
where fh c(A-(r), A'(1 1), c(u(l), u(l to deduce that 7.2) =
1),
-
.
.62,
=
.
.
(8.93)
.:70)). Asymptotic independence enables us u(0)) and thus by c-CE6 (seeStion - 1), ,
.
.
.
,
A-(r) S(aY(r)/f4).
(8.94)
=
tl4l),
Given that t G T) is an innovation process (martingaledifference, orthogonal process), it can be viewed as an orthogonal basis for %. This '(-Y(r) enables us to deduce that can be expressed as a linear combination of the l41 i.e. -4)
-.j)s,
.E'(-Y(1).@f )
=
j
bjut -j4,
(8.95)
j=O
from which (92)follows directly. ln a sense the process :u(1), t G T). provides blocks' for any second-order stationary process. This can be the seen as a direct extension of the result that any element of a linear space can be expressed in terms of an orthogonal basis uniquely, to the case of an infinite dimensional linear space, a Hilbert space (see Kreyszig ( 1978/. The MA(k) process can be viewed as a special case of (92) where the uncorrelatedness assumption of asymptotic is restricted to kcorrelatedness. ln such a case )Ar(r), l G -F) can be expressed as a linear u(? -k). function of the last k orthogonal elements l(l), l41 - 1), ibuilding
,
(7)
Autoregressive
.
.
,
moving avevage processes
As shown above, any second-order
stationary, asymptotically
uncorrelated
Stochastic processes
process can be expressed in a MA(az) form %)
.Y(r)
)
=
bjut
.-jj,
j=
where ()5t () %)< c/s and lg(r), r G T) is an innovation process. Such a representation, however, is of very little value in practice in view of its nonoperational nature. The ARMAIT,q) formulation provides a parsimonious, operational fonn for (96).
t
Dejlnition 19
,4 stochastic process .tvY(l), t e T) is said to be an autoregressive moving average process of order p, q (,4RMz4(p, q)) 4- it can be expressed frl the tlrm' .Y(f) + aI-Y(l
1) +
-
'xvxl.t '
'
+
.
-
p)
=
l/(r)
+
htutr
1) +
-
+ bqult q) -
1p, /71 bz, where a1, ac, white-noise process. .
.
.
,
,
.
.
.
,
bq,are
constants
and
.ju(l),
'
'
'
(8.97)
t g T) is a
In order to motivate the ARMAIP, qj formulation as an operational form of the MA(cc)) representation (96)let us express the latter in terms of the 1 + bj L + bzl? + infinite polynomial )*(f-) =
.
.
.
.:-(1) 47*(fa)u,.
(8.98)
=
Under certain mild regularity conditions b*L) can be approximated ratio of two finite polynomials (seeDhrymes ( 197 1:, b*L)
bq(L) =
=
xp(L)
(1+ lljfz + bzl? + (1 + a jfa + ac1a2+
.
.
.
-
+
bq) ,
.
.
.
+ uplup)
For large enough p and q, b*(L) can be approximated accuracy. Substituting (99)back into (98)we get
xpl-z-
=
k(1a)l/(r),
by the
(8.99) to any degree of
(8.100)
which is an ARMAIP, q) model. This is an operational form which is widely used in time-series modelling to provide a parsimonious approximation to second-order stationary processes. Time-series modelling based on ARMAIP, q) formulation was popularised by Box and Jenkins (1976). The ARMAIP, q) formulation (97)can be viewed as an extension of the part of the ARIrn)representation in so far as the non-homogeneous difference equation includes additional terms. This, however, makes no difference to the mathematical properties of (97)as a stochastic difference
8.4 Some
special stochastic processes
equation. ln particular, the asymptotic independence of the process depends only on the restriction that the roots of the polynomial x (2) (2#+ =
#
1
a12#-
+
'
'
'
+ aP) .
=
0
(8.101)
lie inside the unit circle. No restrictions are needed on the coefficients or the roots of bql. Such restrictions are needed in the case where an AR((x)) lie inside formulation of (97)is required. Assuming that the roots of bpli the unit circle enables us to express the ARMAIP, q) in the form =0
(8.102) This form, however, can be operational only when it can be approximated enough' m. The conditions on a/2) by an AR(m) representation for stability commonly known and those on k(2) as conditions as are invertibility conditions (seeBox and Jenkins ( 1976)). The popularity of ARMAIP, q) formulations in time-series modelling stems partly from the fact that the formulation can be extended to a stochastic processes; the so-called particular type of non-stationary homogeneous non-stationarity This is the case where only the mean is time dependent (the variance and covariance are time invariant) and the time change is local. ln such a case the stochastic process tZ(l), t G T) exhibiting such behaviour can be transformed into a stationary process by differencing, i.e. define 'large
-Y(f)
=
( 1 - f-)dz(l),
(8.103)
where J is some integer. For J=0, A'(l)=z(l); for J first difference and, for J 2,
=
ls Ar(r) Z(r) -Z(r =
-
1);
=
-:-(1) Z(l) - 2Z(l =
-
1)+Zt
-2).
(8.104)
Once the process is transformed into a stationary one the ARMAIP, q) fonnulation is used to model *-(8. ln terms of the original model, however, the formulation is ap(f,)(1 - L)dZ(t)
=
$(f)l(l),
(8.105)
which is called an ARIMA (p, J, /)-, autoregressive. integrated moving average, of order p, J, q (see Box and Jenkins ( 1976)). ln the context of econometric modelling the ARIMA formulation is of limited value because it is commonly preferable to model non-stationarity as part of the statistical model specification rather than transform the data at the outset.
Stochastic processes
8.5
Summarr
The purpose of this chapter has been to extend the concept of a random variable (r.v.) in order to enable us to model dynamic processes. The extension came in the form of a stochastic process (Ar(r), t 6 T) where .:71) is dened on S x T notjust on S as in the case of a r.v.; the index set T provides the time dimension needed. The concept of a stochastic process enables us x,,; 0), 0 6 0) to extend the notion of the probability model *= discussed so far to one with a distinct time dimension
t/xl,
%4,#! 6 Or' t 6 T1. * l'./'(x(l);
.
.
.
,
(8.106)
=
This, however, presents us with an obvious problem. The fact that the unknown parameter vector 0t, indexing the parametric family of densities, their values from depends on t will make our task of (commonly) a single sample realisation impossible. In order to make the theory build upon the concept of a stochastic process manageable we need to impose certain restrictions on the process itself. The notions of asymptotic independence and stationarity are employed with this purpose in mind. Asymptotic independence, by restricting the memory of the stochastic process, enables us to approximate such processes with parametric ones which reduces the number of unknown parameters to a finite set 0. Similarly, stationarity by imposing time-homogeneity on the stochastic process enables us to use timeindependent parameters to model a dynamic process in a is to reduce the equilibrium'. The effect of both sets of restrictions model 106) probability to ( testimating'
tstatistical
*=
(.J(x(l);0), 0 G (1),t e:T).
(8.107)
This form of a probability model is extensively used in Part IV as an important building block of statistical models of particular interest in econometrics. Impovtant concepts
Stochastic process, realisation of a process discrete stochastic processes, distribution of a stochastic process, symmetry and compatibility autoproduct, autocorrelation, restrictions, autocovariance, crossfunctions, normal stochastic process, covariance and cross-correlation vector stochastic process, lth-order process, time-homogeneity and memory of a process, stlict stationarity, second-order stationarity, nonasymptotically independent stationarity, homogeneous non-stationarity, m-dependent and uncorrelated process, strong mixing, processes,
8.5 Summary
163
Markov ergodicity, asymptotic independence, nth-order process, martingale, martingale difference, innovation process, Markov property, Brownian
motion process, white-noise, parametric and non-parametric processes, AR( 1), AR(m), initial conditions, stability and invertibility conditions, MA(m), ARMAIP, (3), ARIMAIP, J, q).
Questions What is the reason for extending the concept of a random valiable to that of a stochastic process? Define the concept of a stochastic process and explain its main
4. 5. 6.
components. 'Ar(.s,r) can be interpreted as a random valiable, a non-stochastic function (realisation) as well as a single numben' Discuss. Wild fluctuations of a realisation of a process have nothing to do with its randomness.' Discuss. How do we specify the structure of a stochastic process? Compare the joint distribution of a set of n normally distributed independent r.v.'s with that of a stochastic process (.:-(0, t E: T) for (r1, ;,,) in terms of the unknown parameters involved. l2, Let (Ar(r), l c T) be a stationary normal process. Define its joint < l,, and explain the effect on the unknown distribution for r < t, parameters involved by assuming (i)m-dependence or (ii)mth-order .
.
.
,
.
8.
9. 10.
14.
.
.
,
Markovness. (xY(r), l e: T) is a normal stationary process then: asymptotic independence and uncorrelatedness', as well as (i) strict and second-order stationarity, coincide.' (ii) Explain. Discuss and compare the notions of an m-dependent and an mthorder Markov process. Explain how restrictions on the time-heterogeneity and memory of a stochastic process can help us construct operational probability models for dynamic phenomena. restriction notions of asymptotic Compare the memory Fnth-order independence, asymptotic uncorrelatedness, Markovness, mixing and ergodicity. Explain the notion of homogeneous non-stationarity and its relation to .4RlMA(p, J, q) formulations. Explain the difference between a parametric AR(1) stochastic process tdesigned' non-parametric AR(1) model. and a Define the notion of a martingale and explain its attractiveness for tlf
modelling dynamic phenomena.
Stochastic
processes
15. Compare and contrast the concepts of a white-noise and an innovation process. The AR( 1) process is a Markov process but not a martingale unless we sacrifice asymptotic independence.' Discuss. 'T'he AR( 1) process defined over T tfo, 1, 2, is not a second-order stationary process even if < l.' Discuss. .)
=
tAny
19.
second-order
.
.
lal
stationary and asymptotically uncorrelated stochastic process can be expressed in MA(cfs) form.' Explain. Explain the role of the initial conditions in the context of an AR(1)
PI-OCCSS.
20.
Explain the role of the stability conditions
in the context of an AR(m)
PI-OCCSS.
22.
b'T'heARMAIP, q) formulation provides a parsimonious representation for second-order stationary stochastic processes.' Explain. Discuss the likely usefulness of ARIMAIP, q) formulations in econometric modelling. Additional references
Anderson (197 1); Chung ( 1974); Doob ( 1953.4;Feller ( 1970),' Fuller ( 1976); Gnedenko ( 1969); Granger and Newbold (1977); Granger and Watson ( 1984),' Hannan ( 1970); Lamperti ( 1977); Nerlove et (11. ( 1979); Rosenblatt ( 1974); Whittle ( 1970); Yaglom ( 1962).
C H A PT E R 9
Limit theorems
9.1
The early Iimit theorems
The term
tlimit
theorems' refers to several theorems in probability theory under the generic names, of large numbers' (LLN) and limit theorem' (CLT). These limit theorems constitute one of the most important and elegant chapters of probability theory and play a crucial role in statistical inference. The origins of these theorems go back to the seventeenth-century result proved by James Bernoulli. tlaw
Scentral
Bernoulli's theorem Let S', be the number q occurrences t#' an cpt?nr in n independent trials t? (1 random experiment rsi and p #4z1) is the probabilitq' 0)' iv each of rf? trials. F!n jr t/ny s > 0 occurrence .,4
=
.f
.4
lim Pr ->
,1
(c
S',
-p
--Fl
< z
=
1,
i.e. tbe Iimit t#' the probabilitq' q/' the event $((S,,,/n) J?)l
--+
.:x;.,
(9.2)
166
Limit theorems
These two results gave rise to a voluminous literature related to the various ramifications and extensions of the Bernoulli and De MoivreLaplace theorems known today as LLN and CLT respectively. The purpose of this chapter is to consider some of the extensions of the Bernoulli and De Moivre-Laplace results. In the discussion which follows emphasis is placed on the intuitive understanding of the conclusions as well as the crucial assumptions underlying the various limit theorems. The discussion is semi-historical in a conscious attempt to motivate the various extensions and the weakening of the underlying assumptions giving rise to the results. The main conditions underlying the Bernoulli and De Moivre-l-aplace results are the following: (LTI) variables Sn Z)'= 1 Xi, that is, & dejlned as the sum of n random (r.n.'#. LT2) Xi 1, (' occurs, and Xi 0, otherwise, i 1, 2, n, i.e. the Xis is a binomially distributed r.t?. are Bernoulli r.v.'s and hence LT3) A-j Xz, -Y,, are independent r.17.'y. /'(.X1) =(.X2) =(.X,,), (faF#) i.e. A'l, X2, xY,, (lre identically wl/l Prlxi 1)= p, Prlzri distributed 1 -p for i 1,2, n. Est,/nl p, i.e. wc consider the event of the dterence between a r.r. value. and its expted The main difference between the Bernoulli and De Moivre-l-aplace theorems lies in their notion of convergence, the former referring to the convergence of the probability associated with the sequence of events IE(&/n) < s and the latter to the convergence of the probability associated with a very specific sequence of events, that is, events of the form (z %z) which define the distribution function F(z). ln order to discriminate between them we call the fonner convergence in probability' and the latter convergence in distribution'. ithe'
'the'
=
-4
=
=
=
.
.
.
,
.,,
.
,
.
,
.
=
'
'
'
.
.
.
,
=0)
=
=
=
.
.
.
,
=
-pq1
Dehnition 1
t
z'tsequence of nv.'s F;,, n > 1) is said to converge in probability to r.p. (t?rconstant) F (/' for every ; > 0 lim #r(1F;, 11->
cf)
-
FI<
I'Iz denote this wflll Y;,
-+
P
:)
=
1.
a
(9.3)
K
Desnition 2 tF,,(y), ,4 sequence of r.v.'s (Y;,, n > 1) wr distribution functions n > 1) is said to conYerge in distribution to a r.p. F with distribution
9.1 The
early limit theorems
function F(y)
4J'
1im F,,(y) 11
--#
F(y)
=
C:l
D
at aII points oj' continuity, of F(y); denoted by F;,
--+
FL
lt should be emphasised that neither of the above types of convergence tells us anything about any convergence of the sequence ('F;,) to F in the sense used in mathematical analysis, such as for each s > 0 and s sS, there exists an N N(c, # such that =
IL(s) -
< F(s)I
for n > N.
;
(9.5)
Both convergence types refer only to convergence of probabilities or functions associated with probabilities. On the other hand, the definition of a r.v. has nothing to do with probabilities and the above convergence of F;, to F on S is convergence of real valued functions defined on S. The type of stochastic convergence which comes closer to the above mathematical sure' convergence. convergence is known as 'almost
Dehnition 3 .,4sequence of r.v.'s (Y;,,n > 1) con,erges to F almost surely or with probability one) ,)' lim Y;, F
Pr
=
lt
--#
or, equivalently',
lim Pr
u -/
(x)
=
1,' denoted b)' F;,
243
(f for any
U IYk
-
(f?r.1-,.or -+
F,
a constantj
(9.6)
) 2>0
Fl
<
m 7) n
s
=
1.
This is a much stronger mode of convergence than either convergence in probability or convergence in distribution. For a more extensive discussion of these modes of convergence and their interrelationships see Chapter 10. The limit theorems associated with convergence almost surely are appropriately called law of large numbers' (SLLN). The term is used 1awof large numbers' (WLLN) to emphasise the distinction with the associated with convergence in probability. In the next section the 1aw of large numbers is used as an example of the developments the various limit theorems have undergone since Bernoulli. For this reason the discussion is intentionally rather long in an attempt to motivate a deeper understanding of the crucial assumptions giving rise to all the limit theorems considered in the sequel. Sstrong
'weak
lwimittheorems The law of large numbers
(1)
The weak
of lavge numbers
/Jw.,
( WL f.,'vl
Early in the nineteenth century Poisson asserting identical distributions for A'j result to go through. ,
that the condition LT4 Xn was not necessary for the
realised .
.
.
,
Poisson's theorem Let .tfxY,,,n y 1) be a sequence 1) pi and Pr(Xi 0) Prlzri > 0, ; =
=
=
I-S-
-+
,1
1
-.11. - ... N Fl
ljm pr x
(?Jindependent 1 - pi i
=
,
=
1, 2,
.
St?rnt?l/// r.!r.'y witb n, then, for t7/nl' .
.
,
''
jj i =
1
pi
< ;
(9.s)
1
=
.
The important breakthrough in relation to the WLLN was made by Chebyshev who realised that not only LT4 but LT2 was unnecessary for the Ar,,were Bernoulli r.v.'s was result to follow. That is, the fact that Xj, not contributing to the result in any essential way. What was crucially important was the fact that we considered the summation of n r.v.'s to form Sn 3Xi and comparing it with its mean. .
=
.
.
,
7-
(9.9)
Pl't?q/:. Since the A'ls are independent
since lim u
--+
.x
(, ll8
a
=
0, lim Pr ,,
--+
w.
j
. i 11
11 .=
Z =
j
1 Xi - 11
!1
i
Z Jtf > =
1
E
=
0,
9.2
The Iaw of Iarge numbers
1
1imPr
-/1
169
1
j( Xi -- 1 )2pi j
<)
=
1,
i
Markov, a student of Chebyshev's, noticed in the proof of Chebyshev's A-,, are independent played only a theorem the fact that the .YI Xz. minor role in enabling us to deduce that Var(&) ( 1 ,72) :)1-j c/. The above proof goes through provided that ( 1,.'r?2)Var(&) 0 as n ry-.. Since .
,
.
.
,
=
-+
-+
11
Var(5',,)
Vart.Yfl +
=
f
we
need
Chapter
=
1
j #j)
Covtxl..Yjli
(9.10)
i
ol-der f#' magnitude (see to assume that Vartzf Xf) is Of Smaller 10) than nl for the result to follow. Hence LT3 is not a crucial
condition.
Markov's theorem
j
1im Pr ,,
-+
az.
-. '
j
11
i
) =
1
Xi - 11
11
E(Xij i
=
< ;
=
1.
1
Khinchin. a student of Markov's, realised that, in the case of independent and identically distlibuted (11D) r.v.'s, Markov's condition was not a nessary condition. In fact in the llD case no restriction on the nature of the valiances is needed.
& % N --/
'Jw
Limit theorems
(2)
Iaw of large numbers
Te strong
(SL Nm
The first result relating to the almost sure convergence of Bernoulli distributed nv.'s case was proved by Borel in 1909.
& for the
Borel's theorem Let fA-,,) be a sequence 0). 11D Bernoulli r.v.'.%wjr Prlxi Prlxi 0) 1 p for aIl i, ten =
=
=
1) p and =
-
S ''
lim
Pr
p
=
N
-+x
13
=
1 .
ln other words, the event defined by s e:S), has n (k,,(y)q probability one; S being the semple space. An equivalent way to express this is .ty:1im,,-,
=p,
.
Sm
lim Pr max ?1 -+
t
-p
I'?I
mp n
x:.
=0.
This brings out the relationship between the SLLN and the WLLN since simultaneous refers realisation of the inequalities and former the to the
S -.$1
Sm
%max
-p
&
(9.17)
-p .
??1
mn
P
as
t-0*=
%-.+'
This implies that Kolmogorov, by replacing the Markov condition 11
j
lim
l1 --/ ::Ll
'Y
,
) -
VartA4l
=0
(9.18)
1
for the WLLN in the case of independent r.v.'s, with the stronger condition Cf
) =1
j
s
Varta&k) < cs,
(9.19)
proved the first SLLN for a general sequence of independent Kolmogoro''s
r.v.'s.
2
teorem
Let (a,,,n y 1) be a sequence of independent r.v.'s such that E(Xi) then #- they satisfy lt? and Var(.Y) exist for aIl i 1, 2, condition (19) wc can deduce that =
)
Pr
lim n n rl -+
.
.
.
,
?1
i
) g#f-f(A-j)() =
1
=0 =
1.
9.2
The Iaw of large numbers
This SLLN is analogous
to Chebyshev's WLLN and in the same way we inequality. The inequality used in this context is an Kolmoqorov's inequalit-v If Arj, xYa, .Y,,are independent r.v.'s, such that c/ < :Ef,, i 1, 2, Vartlf) rl, then for any ) >0
can prove it
using
.
=
=
Pr Kolmogorov
,u
.
.
'
k < ,,
.
.
,
,
jsk-'(sk)Iy
max
l
.
11
1g
c <
2
:
i
)) c/. 1
=
went on to prove that in the case where 1t-Y,,,n 7: 1) is a r.v.'s such that F(A'f) < then
sequence of llD
,
:* Varl'x k ) )
:2
k=1
k
1
%' =
j
Y
k=1
-k
dx x2.J(x)
< ct;. ,
which implies that for such a sequence the existence of expectation is a necessary as well as sufficient condition for the SLLN. Having argued that some of the conditions of the Bernoulli theorem did not contribute (in any essential way) to the result, the question that arises of naturally is, are the important elements giving rise to the large numbers'' (SLLN, WLLNI'?'The Markov condition ( 18) for the WLLN, and Kolmogorov's condition ( 19) for the SLLN, hold the key to the answer of this question. It is clear from these two conditions that the most important ingredient is the restriction on the variance of the partial sums S,,, that is, we need the Var(&) to increase at niost as quickly as ;. More formally we need Var(&) to be at most of order n and we write Var(S,,) O(n). ln order to see this let us consider some of the cases discussed above. ln the llD case if VartxYf) c2 for all f, then Var(5',,) nc2 O(n). ln the case of independent r.v.'s with 'what
tlaw
=
=
=
Varl Xi ) cf2 < =
':s
,
1
=
1, 2,
.
.
.
=
,
(9.22) Moreover,
0(.;2)
where the Markov condition can be written as Var(S,,) smaller order than' achieves the same effect since Var(5',,) O(n) Var(S,,) o(n2) (see Chapter 10). The Kolmogorov 1. conditlon is a more restrictive form of the Markov condition, requiring the variance of the partial sums to be uniformly of at most of order n. This being the case, it becomes obvious that the conditions LT3 and LT4, assuming independence and identically distributed r.v.'s, are not fundamental ingredients. lndeed, if we drop the identically distributed condition altogether and weaken independence to martingale orthogonality the above limit theorems go through with minor modifications. We say that a sequence of r.v-'s Xn, n y 1) is martingale orthogonal if f'(A',,/c(A-,,- I
small
:o'
=
reads
=
'of
'=.
=
t
,
.
.
.
,
Limit theorems
n > 1. It should come as no surprise to learn that both important proving the WLLN and SLLN, the Chebyshev and Kolmogorov tools in inequalities hold true for orthogonal r.v.'s. This enables us to prove the WLLN and SLLN under much weaker conditions than the ones discussed above. The most useful of these results are the ones related to martingales because they can be seen as direct extensions of the case and the results are general enough to cover most types of dependencies we are .Y1))
=0,
'independent'
interested in.
The law of large numbers
(3)
for martingales
such that E%) 0, for all n, and define f.;,, n (E N) be a martingale 1 S,, discussed 0). As in Section 8.4, if Sn defines a L S,,- - ! n ): (.$0 then by construction F;,defines an orthogonal martingale wlth respect to process and thus, assuming a bounded variance for F;,, the above limit theorems can go through with minor modifications.
Let
.t5',,,
=
=
=
,
.';,
WLLN
for martingales
f
.Y,,, n k: 1) be a sequence t? r.v.'.%with respect to the increasing < ct;, and sequence of c-//z-us ( n > 1) suc that z;(1Az'1) #(1A-,,1> x) and x) ,? > 1, c-constant (i.e.all > f0r x > 0 xis A-). Tben are bounded /?y some r.p. Let
%c#(lA-I
z,-
y;.K Xi - Elxijfh j ), An equivalent way to state the WLLN is 1g
-N i
13
11
LXi =
(9.24)
1
-
E(Xi/%.
1)j
-
-->
0.
This result shows clearly how the assumption
of stationarity
of (A-,,, n y 1)
9.3
The central limit theorem
1), and );(-Y,, n > 1) (seeChapter 8) can strengthen the WLLN result to that of the SLLN. The above discussion suggests that the most important ingredients of the Bernoulli theorem are that: (i) we consider the probabilistic behaviour of centred nv.'s of the form ,f.6-
Zn S', - n;) j)- 1 l-Yj- F(Ar2)); Var(.,,) 0(,1)., and (f F,,. n y 1) is a martingale X,, Y;, for - f).Y,,), the sequence F1)) 0, n > 1. i.e. '(Y;, difference, c(L- 1, This suggests that martingales provide a very convenient framework for these limit theorems because by definition they are r.v.'s with respect to an increasing sequence of c-fields and under some general conditions they converge to some nv. as n tys. The latter being of great importance when r.v. is needed. Moreover, for any convergence to a non-degenerate 1) tXu, 9. u, n y: the martingale differences sequence martingale sequence orthogonal martingale sequence of r.v.'s which can help tL, n y 1) defines a us ensure (ii) above. =
=
=
=
.
.
.
=
,
-+
The SLLN is sometimes credited as providing a mathematical foundation for the frequency approach to probability. This is, however, erroneous because the definition is rendered circular given that we need a notion of probability to define the SLLN in the first place. Remarl:
9.3
The central limit theorem
As with the WLLN and SLLN, it was realised that LT2 was not contributing in any essential way to the De Moivre-Laplace theorem and the literature considered sequences of r.v.'s with restrictions on the first few jt j Xi, the CLT moments. Let tA-,,, 1177 1 be a sequence of r.v.'s and S,, of considers the limiting behaviour =
F,:
=
S,, - .E(S,,)
-7tvar(s,,); N' .
which is a normalised and SLLN.
,
version
Lindeberl-luevy,
of
F(%,), the subject matter of the WLLN
.%, -
theorem
j
#
lim F.,( y) ''
?,-.
cc.
)f'-
-
'
=
1im P(K. G r '
?,-.
w.
''
-
=
--,
'
-
.
x''(27:)
expt
-
1yu2)
d)?.
(9.28)
Limit theorems
Liapunov's theorem Let
kfX,,,
n > 1)
be a
VartA-zl
f
c2 i
.
< J >0. .E'(lAz'fI2+)
c/ < az,
=
r.v.'s wr/l 'aa,
1 2
11
)-
C= 1,
of independent
sequence
>
1
=
ten if
(9.30) theorem is rather restrictive because it requires the existence of higher than the second. A more satisfactory result providing both moments and sufficient conditions is the next theorem', Lindeberg in 1923 necessary established the if' part. part and Feller in 1935 the Liapunov's
'if'
'only
Lindeberq-Feller
theorem
Let (A',,, n 7: 1) be a sequence ofindependent )F,,(x),n > 1) such tat
r.p.'.s wf/'ll
distribution
jnctions
() Exi) ii) Vartxf)
=pi
Ten
=
(9.3 1)
c/ < vs,
the relations
(7) lim max l -+x un ?1
t2
11
Gi =
t??,
0,
where
tr,, = i
l =
c/
,'
1
(9.34) 11
jg (x i 1 l-pilxsc '
'
-pf)2
dFj(x)
-
=
0(c',2,) jr
alI
; >0.
(9.35)
The necessary and sufficient condition is known as the Lindeberg condition really gives rise to the result'. and provides an intuitive insight into twhat
The central Iimit theorem
9.3 Given that 1g
11
2 Cn i
F, =
y
1
:2
-/z)2
Ix
.-pikxircf
dFftx)
(x
max prtjAry - pfl > c(rf),
1% i ts n
this shows that the heart of the CLT is the condition that no one r.a dominates the sequence of sums, that is, each (Arf cf is small relative to the sum gS,,- f(S,,)q c,, as n increases. The Liapunov condition can be deduced from the Lindeberg condition and thus it achieves the same effect. Hence the CLT ref ers to ihe istributional bealliour of the summation of an increasing number of r.v.'s whfcll individuall), do not exert any signlkfcant c/-kcr on the behaviour of the sum. An analogy can be drawn from economic theory where under the assumptions of perfect competition (no individual agent dominates the aggregate) we can prove the existence of a general equilibrium. A more pertinent analogy can be drawn between the CLT and the theory of gas in physics. A particular viewpoint in physics considers a gas as consisting of an enormous number of individual particles in continuous but chaotic motion. One can say nothing about the behaviour of individual particles in relation to their position or velocity but we can determine (at least probabilistically) the behaviour of a large group of them. Fig. 9. 1 illustrates the CLT in the case where Ar,, are 1lD r.v.'s, 1, U( 1, 2, Xi l), uniformly distributed i.e. i n, and + Xz + + Xn. represents the density function of Y;, -/tf)
-:-1
,
,v
.
.
.
.
,
.
,(#)
=
-
.
,
.:-1
'
=
.
.
fl (p')
f (y) --#'
+.
Z#
y'
.* Z zz
N
-
--
-
a
(y1
*.
x
*
N
N
*
'
N
x.
0
y
Fig. 9. 1. lllustrating the CLT using the density function of are uniformly distributed for n 1, 2, 3.
where the
xfl.s
(y)
=
r
=
)'= Xi 1
Limit theorems
Returning to the Bernoulli and De Moivre- Laplace theorems, we can see that the important ingredients were the same in both cases and both families of limit theorems refer to the behaviour of the sum of a sequence of P
r.v.'s in a different probabilistic
sense. The WLLN referred n)Qp(land the CLT to
the SLLN to
to
r(,%,
' j--!S pj
jr 1.-,y.-?)
j-
r
--+
pj,
z xtt),1).
-..,
-
!),L/n)
,.
&
Let us consider the relationship between these limit theorems in the particular case of the binomial distribution. From the CLT we know that pv
...-
:u
. .--
,,'1),,p(
N/ '
b
S - np 1-p); ''
g <
i)
1 -x,7(z:l exp
rx a
2
u -
du
-
(9.38)
equal'). ln order to see how good the (t c>: reads 10, ;) < & 8), n approximation is 1et us take From the binomial 0 -k tables we get j7.($(1k0)(0.j)k(0.5)1 0.3662. 'J'hu normal approximation to this probability takes the form 'approximately
.13?-46
:
=
=..
=
8 + .$ np L-
6
-
#r(6 < S, % 8) :x *
-
y l N' (DJ?)(
-
-
pq
*
$ nn
-
-
=
'
(142.2
=
tnpii-i p)
1)
-
(140.3
-
x
16),
where *4 ) refers to the normal cumulative distribution function. It must be noted that #r(S,, G h) is ap,proximated by F + -p)q ) V'gnptl rather than Ft(? - np) x,'' Inpl1 ) in order to improve the approximation by bridging the discontinuity between integers. From the normal tables we get (142.2 1) *(0.3 16) 0.9866 -0.6239 0.3627 which is a very good approximation of the exact binomial probability for /7 as small as 10. .
t(l?
-p)q
=
-
Using the above results we can deduce Pr
c-
s,, -p
< c
'
(-
=
1 -,z (2J:)
-e x/'
h, 100
-
p wa
-
=
lhat
uz eXp
K--!.p.p
Pr
.-np)
e.qat),#?' Pr
--
2
S2(?q
p 20 -
S5 oo
500 -p
d
.,
;
'
,# =
1. /:/( 1 - p) .
.
.
n
<
2
=
0.944,
<
2
=
0.965.
The central Iimit theorem
From the above example of the normal distribution providing an approximation to the binomial distribution we saw that for n as small as 10 This is, however, by no means the rule. ln it was a very good approximation. this case it arose because p =c1.Z. For values of p near zero or one the approximaton is much worse for the same n. ln general the accuracy of the approximation depends on n as well as the unknown parameter 0. This presents us with the problem of assessing how good the approximation is for a particular value of n and a range for 0. This problem will be considered further in Chapter 10 where the additional question of improving the approximation will also be considered. Although the CLT refers to the convergence in distribution of the it standardised sum (.,, - m,,) c,,, where 1,n,, /;4S,,) and c',, x/'EVar(,%,)(I is common place in practice to refer to Sj being asymptotically normally distributed with mean mn and variance c,,z an d to denote this by .,*,
=
=
=
c,2, s x(m,,, ). pl
-
Strictly speaking, following sense:
such a statement
is incorrect, but it can be justified in the
of the form Pr$S', s: (k) can be approximated rl probabilities J) since approximation *uf.? -nu) the - m,,),/'c,J error Pljst, :K -+ uniformly R to vt as n on goes zero
for large (I)r(fJ
-
by c
.
(See Ash ( 1972).) The CLT can be extended to a sequence of random vectors where X 11 is a k x 1 vector. Lintleberg-yeller
)X,,, n 7: 1)
CLT
X,,, n y 1) be a sequence 0j. k x 1 independent rtlndom vectors and distribution w'ff F(Xf) pi and COv(Xf) Zf, i 1. 2, .//nclfons 'f(F,, n 7: 1) such that:
Let
ft
=
=
=
.
.
.
,
,
lim (?,)
() I
--+
=
j: + 0,
(9.40)
z
each n ln practice this
:
>
0,'
11
-1 i
j
L-.Z
1
result
(Xf - pi)
N(0, Z).
'v
(9.42)
t:t
is demonstrated by showing that for any fixed
Limit theorems
(9.43) Since
c'Z,,c c'Zc + 0, c' --+
lt
n-'l
j
i
(Xj - pi)
1
=
N(0, c'Ec)
,v
x
for all c + 0.
(9.44)
Then, using the Cramer-Wold theorem (seeChapter 10) we can deduce the CLT result stated above. As in the case of the other limit theorems (WLLN, SLLN) the CLT can be easily extended to the martingale case. CLT
for martinqales
.()?. n p: 1) be t,%,, dljlrences S,, Let
,,,
',,
,$',,
=
wl
-
.E'(.X'n/.f/.',,I )
=
-
and dejlne the nmrlfntytk/tr that JtA',,, is, n > 1) is a sequence oj' r.n.'s 1 0, n 1 sucb that: =
a martingale ,
,
.
.
.
,
(f)
(9.45) (9.46) 11
ctl, j
c/,
=
1
ci
=
EX,?),
j
then -.. f-n
?1
i
j
Xi 1
'w
N(0, 1). (9.47)
x = This theorem is a direct extension of the Lindeberg-Feller theorem. lt is important to note that (i)and (ii)ensure that the summations involved are of smaller order of magnitude than c,2, ....-
.
9.4*
Limit theorems for stochastic processes
The purpose of this section is to cosider briefly various extensions of the limit theorems discussed above to some interesting cases where )A',,, n > 1) is a stochastic process satisfying certain restrictions (seeChapter 8). The first step towards generalising the limit theorems to dependent r.v.'s has already been considered above for the case where kfX., n > 1) is a martingale relative to the increasing sequence of c-fields Jtf2pj, n 7: 1). Another interesting form of a stochastic process is the m-dependent process (seeChapter 8). For an m-dependent zero mean stochastic process (A',,, n > 1) with finite third moments, '(1A',,13) < K for all n > 1 and some
Limit theorems for stochastic processes
9.4*
constant K, it can be shown that 1:1
11
jg
V Xi --sn V
i
if c2
tT2),
ZxN(,
-+
--
1
)
1im 11 co
=
n
-+
11
oj-vi k
=
< cx',,
1
(9.48)
m-1
c/
=
j
2
Covtx''
+ Vart-f
Xi +.)
+j,
+.).
j=0
The importance of martingales and m-dependent processes lies with the and mixing stochastic processes behave fact that stationary-ergodic asymptotically like martingale differences and m-dependent processes, respectively. Hence, the limit theorems for martingales and m-dependent when certain processes can be extended to stationary and mixing processes restrictions related to their homogeneity and memory are imposed.
(t7) /(:n) O(r?1-'), =
(?) a(rn)
04?,n-1),
=
z>
(r (?--
(9.51)
1:,
1 for ry 1 and such tbat 11,,1 > 0 (f.t?.Xn is dominated !?yz,,, n y 1) then .j
11
a.s.
E-Yi-.E.;t-)(I - ) N i
=
--+
Jrl),'
(9.52)
0
1
(5t?pWhite and Domowitz 1984)). The value of r is the above SLLN determines the highest moment assumed kf to exist and the same value restricts the memory of Ar,,,n y 1). For 1) ftAr,,,n 2: an independent process r= 1 and $(m)dies out exponentially. CLT
./r
mixinq
processes
Let n y 1) be a mixing process satisfyinq the restrictions the SLLN to hold. ln addition let ly assume that.. imposed .J't??kfX,,,
(i)
'(Xn)
=
(9.53)
0,
(ii)
.(lA',,l2r)
(fff)
for 5',,(z)
'K
(9.54)
< crs,
--i
=
n
j Xi, 7--7+
tere
exists I/-hnite and non-zero (x), then n
such r/-lfl (fJ(S,,(z))2 - Jzz-.j 0 jr alI z as --+
(np'l-'is(0) x(O, 1) ,,
-
(see White and Domowitz (198444.
--+
Limit tlleorems
The importance of the above limit theorems for mixing processes becomes more apparent in view of the following result'. lf kf-Y,,, n > 1) is a mixing p?vpcas-s(44n1)or .:t4?A2)) then /./ny Borel lf zY,,is Xn k ) is also mixing. A-/rt?rct??.lpr, jllnction F;, f./,,(zY,,, O(n then F; is aIs() O(m z > 0. This result is of considerable interest in statistical inference where asymptotic results for functions of stochastic processes are at a premium. For stationary stoehastic processes which are also ergodic several limit theorems can be proved. =
.
.
.
,
-
-')
-')
1
SLLN
5
/br stationarq',
t.?rg(4Jfc'processes
1) be a stationary n > 1, then
.f..,:4f Xn,n > J;1.1,,(<
:x,
,
1 w-m Xi -N i L 1
and
ergodic
process such
that
a s .
K
-+#
(9.56)
E (X,,),
=
tsce' Stout (1974)).
(9.58) (9.59)
m
=
)
E('n)1't'+J''t'+''1
<
'v
1
then l
S
-1 .-.r! G
(5't?t!Hall
,.
srjtl j) ,
x
and S?)'Je'
(198044.
Note that mixing implies ergodicity (see Chapter r'
)'
9.5
8).
Summary
The limit theorems discussed above provide us with useful information relating to the probabilistic behaviour of a particular aggregate function, the sum, of an increasing sequence of r.v.'s, as the number of nv.'s goes to
18 l
9.5 Summary
infinity, when certain conditions are imposed on the individual r.v.'s to one dominates the sum'. The WLLN refers to the ensure that of Sn r?, i.e. probability in convergence tno
P
S11
. -#
ES
11
)
1
11
The SLLN strengthens 11
-+
the convergence
to
surely', i.e.
balmost
11
.
.
11
h1
The CLT, on the other hand, provides us with information relating to the rate of convergence. This information comes in the form of the factor by which to premultiply % E(%) so that it converges to a nondegenerate r.v. (the convergence in WLLN and SLLN is to a degenerate r.v.; standard deviation of S,,,i.e. a constant). This factor comes in the form of the 'appropriate'
-
D
1 N
,-k ,,, g ar( s )j
(S,,
-
.E'(S,,))-/ Z
'v
(9.62)
N(0- 1).
,,
lmportant
concepts
convergence almost surely, convergence in numbers, strong law of large numbers, large law of distribution, central limit theorem, Lindeberg condition. Convergence
in probability.
weak
Questions 1 and contrast it k) P?
-
=
=
conclusions. 7*. ln the CLT why is the limit distribution a normal and not some other distribution'? 'A1l limit theorems impose conditions on the individual r.v.'s of a sequence in order to ensure that no one dominates the behaviour of the aggregate and this leads to their conclusions.' Discuss.
Limit theorems
Compare the Lindeberg-Feller CLT with the one for martingales. Compare Markov's condition for the WLLN with Kolmogorov's condition for the SLLN. Compare the conclusions of the WLLN, SLLN and CLT. Exercises
(i)
#/.Xp,
-1.
+ 2/)
(ii) (iii)
u (jyy. (jj
11
Note: i
Determine
) =
i
=
2
1
.
the value of a for which
#?-(.Y,, + n =
)
=
the sequence of nv.'s )Ar,,)
ly
satisfies the SLLN. 3*. Show that for the sequence of independent r.v.'s (A-,,, n > 1) with I)j Pr(X,, the Lindeberg conditions holds iff a <-12. + h7) 1/gan2t?=
=
5
Additional references
Chung (1974); Cramer (1946); Feller (1968); Giri (1974); Gnedenko (1969); (1963); Pfeiffer ( 1978); Rao (1973); Renyi ( 1970); Rohatgi (1976).
Locve
10*
CHAPTER
lntroduction to asymptotic theory
10.1
Introduction
At the heart of statistical inference lies the problem of deriving the distribution of some Borel function h ) of the random vector X (A'j, Ar )', i.e. determine '
=
.
.
.
,
?1
F,,(y)
Prlhqz'b-j
=
,
.
.
.
,
.Y,,)G )?),
from the distribution of X. ln Chapter 6 we saw that this is by no means a trivial problem even for the simplest functions hl ).Indeed the results in this area are almost exclusively related to simple functions of normally distributed r.v.'s, most of which have been derived in Chapter 6. For more complicated functions even in the case of normality very few results are available. Given, however, that statistical inference depends crucially on being able to determine the distribution of such functions (X) we need to tackle the problem somehow. lntuition suggests that the limit theorems discussed in Chapter 9, when extended, might enable us to derive approximate solutions to the distribution problem. The limit theorems considered in Chapter 9 tell us that under certain conditions, which ensure that no one r.v. in a sequence tAr,,, n > 1) dominates the behaviour of the sum (Z)'I A-), we can deduce that: .
jj
IA
1;
-n )'-=
i
-->
''
j
--+
11
i
1
a.s. z-i
=
J -
n
j
1
-rl i
Xi
-
11 i
j -
F(xYf) 1
(WLLNI;
11
) =
F(vYf) (sLLN); 1
Introduction
to asymptotic
theory
(CLT).
ln order to be able to extend these results to arbitrary Borel functions /7(X), notjust j)'- 1 Xi, we firstly need to extend the various modes of convergence (convergence in probability, almost sure convergence, convergence in distribution) to apply to any sequence of r.v.'s ).Y,,, ?1 y 1). The various modes of convergence related to the above limit theorems are considered in Section 10.2. The main purpose of this section is to relate notions of convergence to the probabilistic the various mathematical needed modes in asymptotic theory. One important mode of convergence encountered in the context of the limit theorems not convergence s 'convergence in the rth mean', which refers to convergence of moments. Section 10.3 discusses various concepts related to the convergence of moments such as asymptotic moments, limits of moments and probability limits in an attempt to distinguish between these concepts often confused in asymptotic theory. ln Chapter 9 it was stressed that an important ingredient which underlies the conditions giving rise to the various limit of magnitude'. For example, the theorems is the notion of the Markov condition needed for the WLLN, 'order
)
lim
c ?,-+x n
/1
k
j =
Vart-tk)
0,
=
1
is a restriction on the order of magnitude Var(&)
o(rl2).
=
In Section 10.4 the
tbig
of Var(&) of the form
( 10.6)
o' notation is considered in some detail as a O', prelude to Sections 10.5 and 10.6. The purpose of Section 10.5 is to consider the question of extending the limit theorems of Chapter 9 from (j)'- I Arf)to A',,) such as more general functions of (A-j Xz, ))'-1 XL,/-> 1. This is indeed the main aim of asymptotic theoo'. Asymptotic theory results are resorted to by necessity, when finite sample results are not available in a usable form. This is because asymptotic results provide only approximations. 'How good the approximations are' is commonly unknown because we could answer such a question only when the finite result is available. But in such a case the asymptotic result is not needed! There are, however, various error bounds which ean shed some light on the magnitude of the approximation error. Moreover, it is often possible to upon the asymptotic results using what we call Klittle
,
.
.
.
,
krough'
kimprove'
Modes of convergence
10.2
asymptotic expansions such as the Edgeworth expansion. The purpose of Section 10.6 is to introduce the reader to this important literature on error bounds and asymptotic expansions. The discussion is only introductory and much more intuitive than formal in an attempt to demystify this For a more literature whieh plays an important role in econometrics. complete and formal discussion see Phillips ( 1980), Rothenberg ( 1984), intelalia.
Modes of convergence
10.2
and play a very important role in of of only because the limit theorems discussed in probability theory, not they underlie Chapter 9 but also because some of the most fundamental concepts such as probability and distribution functionss density functions, mean, variance- as well as higher moments. This was not made explicit in Chapters 3-7 because of the mathematical subtleties involved. ln order to understand the various modes of convergence in probability theory 1et us begin by reminding ourselves of the notion of convergence in ftf.?,,, G mathematical analysis. n t is defined to be a function sequence from the natural numbers . # )1, 2, 3, ) to the real line R. 'convergence'
'limit'
The notions
'
,4
.
'
=
.
.
.
Dqpnition 1 f A smuence ( a n g # is said to converge to a Iimit a f/' jr t:'t,ta?-v Llrbitrar)' (1 l?un'l/?t?/- A'(;) such small nl/yl/afskls > 0 tbere ctp?-?-fpsptpnps ')
?;
,.
-
.
< t 1t.,, w't? denote tbis /1
l/-lar/'/?ta inequtllitq' n > N(;);
wlll
all rt?r,e'n-s ?,, t?/' /-':'IJs /?p lim,,-.. fk,, a. ./?-
.
-
the s:?t/?..',nce-
=
E'xtl/nkp/a1 ' .. ..
lim,,-, n (1,/'n)''-Ie .,x')'
=
1
j-
:.'
N
nj
=
=
.(log, 0, b > 0, b + 1,'1im,,-..I( 1 + nln) 1, for any n > 0'. 1im,,-+ + 2.7 1828-,1im,,- .,. g(n2 + n + 6)/(3n2 2n + 2)j -j;1im,,-+ . gx'''(n () fo r n g t ,
=
=
=
-
'.
5
,
whose This notion of convergence can be extended directly to kny --+ The necessarily but any subset of R, i.e. /?(.v):(J R. way domain is not this is done is to allow the variable x to define a sequence of numbel's 1x,,. h? G . t l converging to some limit x() and consider the sequence ) (x?,), --+ ?2i: . f as x,, .v() and denote it by limx .x(, /7(x) /. .lnction
.
'
','
=
.-.
Depn i rt?n 2 A function
said r(?converge to a Iimit I fxs (x) 1.%
x
--+
xo.
(J
'for
crt?rl?
Introduction
theory
to asymptotic
)> 0, bowever small, there exists
a number
J(#
0 such that
>
Example 2 For
(x)
=
ex, limx.-
/1(x) =
./1(x)
0 and for the polynomial
=
-
aox'' + t/lx'' -
1
+
.
.
+ a 11 j x + -
.
function
1imJl(x) a,,.
(I,t,
=
x-0
Ix
-x()l < J(c) excludes the point x xtl in the Note that the condition 0 < above definition and thus for /1(x) (x2 9),(x 3), limx-ps (x) 6, even though /7(x) is not defined at x 3. For =
=
=
-
=
(x)
=
t7X,
0,
a>
/l(x) (1 + x)1?-v, =
7
(x) /l(x)
=
=
lim /1(x) e, =
x-0
X
1+ X
=
lim /1(.x) 1, x-0
,
1im J?(x)
=
eJ,
x-0
gloget1 +
x) xj,
lim /?(x)
=
1,
x-0
Using the notion of convergence of a function to a limit we can detine the continuity of a function at a point. Djlnition
3
-4 function (x),dehned over some interval D) tf to be continuous at the point jr eacb (:) > 0 such tat .xo
l(x)
:
R, xa G D() is said 0, tere exists a
>
-(xo)I
./??-trllcr.y: x satiqjkving r/-l?rgstriction l-x x()l < J(c). W' denote this b)., /l(-Y) limx /-1(x) /,1(.X()). is-said to be continuous if it is q/' point at its domain, D. continuous every .,4
-xo
Example 3 The functions
(verifyl).
=
./ntrrtpl7
Modes of convergence
10.2
187
t,,(x),
n e: f ln the case of a general function we can define the sequence is a subset of Dh). and consider its behaviour for each x e: where .4
.4
')
,
Dqhnitioll 4 -)
-4 sequence of jnctions (,,(x),n c f wl common domain D is D f/' jr each c > 0, there said to converge pointwise to Jl(x) on x) ,7 tben if x), > sucb that N(c, N(c, exists a .
-4
1,,(x)-
ll(x)1
holds
for all
x s
-4.
Example 4 1,
,,(x) k
X
/(
-
=
()
k!
lim
,
,:
-.
for all x (EFR.
,,(x)=ex
x.j
ln the case where N(s, x) does not depend on x (onlyon c) then (ll,,(x), n 67 #'') is said to converqe uniformly on X. The importance of uniform convergence stems from the fact that if each h,,(x)in the sequence is continuous and converges uniformly to (x)on D then the limit /l(x) is also continuous. That iss if ,
,,(x)
lim Jl,,(x) =
X
for xo D
,:(x)
Xo
-*
for al1 x G D,
lim /1?,(.x) /l(x) =
!1
untformly,
( 10.8)
CX)
-#
then lim /?(x) ll(xo). =
X
X0
''*
With the above notions of continuity and limit in mathematical analysis in mind let us consider the question of convergence in the context of the #( )) and (R, @,Px )). Given that a random probability spaces (,, xY( function from S to R we can define pointwise and uniform variable ) is a the S sequence )A-,,(y), n (E f by conyergence on for '
'
,
.
'')
.
t.Y,,(,s) -
and
< Ar(s)I
I.Y,,(,s) A'(-s)1< -
:
for
')
for n > N(c),
(10.10)
n > .N(o, s),
s s S,
(10. 11)
respectively. These notions of convergence are of little interest because the probabilistic structure of )A-,,(#, n e is ignored. Although the probability set functions #4 and #xt ) do not come into the definition of a '-)
.f
.
.)
'
188
Introduction
to asymptotic
theory
random variable they play a crucial role in its behaviour. lf we take its probabilistic structure into consideration both of the above forms of convergence are much too strong because they imply that for n > N $z,,(s) Xlsll
.s-set
.gl
=
Dqhnition 5 fX
r.l.'s
.4 sctyttlnctl f#'
t.
(s),n (E
l1
''.
#
.
J
s said a.S.
(tk.s.)t() a
>
xY(s),denoted b J' A-,,-->
-.t!.
Pr s: 1im ..Y,,(-s) A-(s) =
rtlconverge almost surely ij'
z,
1.
=
-#
11
z4rlequit/alent 1
wly of Jt-/nfl';g almost
1im Pljs --#
ll
:
tyL
k.Y,,,(-$) .(.)I< -
(stit?Cnnj.y (1974)). The convergence associa ted
tllazlt.l.sl
x'ith
F;-
all m > n)
f/?t? stronq
/kJ1,
(SLLN).
Another mode
of convergence
not considered
theorems (see Chapter 9) is that of
=
converlencv
l
(10. 14) ls
conterqence
'sn-t?
is /?),
sl/r'f? convergence
t?
lllt, 'node
Iarge
in relation in rth mean.
q/
nl/rrlltu?l-.s
to the limit
/3nlt?rl6
D
Let
ft,',,(y).
')
I 1r)
t'ha l E ( A-,, < n e: # be a std/l/t?llct? q/' r.l?.'s -sl,lt)'Jl all n e' f and .E'(1A-1r) ,. > 0, thvn the sequence < w converges to x r .
-
./y,.
c/?-
.::yz:
.
in
>
rth mean,
denoted /?r X,,
lim f(IA-,, ''''*
11 11
-
-+
X,
#'
A-lr) 0. =
'Ed(';'
0/' particular in rcrtszslin w/'?tkr Illos (/- 1) and mean ygulrt:? (?- 2). =
is the
c(?rlLrt/rlyt?nct?
in
lnean
=
A weaker mode of convergence related to the (WLLN) is that of convergence in probability.
weak law of large numbers
189
Modes of convergence
10.2
Dejlnition
to
Ar,,(s),n
-)
(E
,.
jy said to converge in probability
#
r
X, denoted b Ar,,--> X, '
?-.'.
a
f
of r.v.'s
.4 sequence
#'
lim #/-(-s: IX,,(s) - A''(-s)i<
1
1)
-*
:f
between
relationship
Tbe
probability
l)
=
l
( 10. 16)
.
'
t't?l
ctlnt'tlrfy?nt't?
be deduced b).'
tyornpt/rfnry
surel p and almost wr/, ( 16) (14).
in
The mode of convergence related to the central limit theorem (CLT) is that of convergence in distribution.
8
Dhnition
.4 sequence F,:(x), n (E .ft A' A' /' ,
t?./ l-.v.'s ')
')
t
,1
is said to converge
f
distribution jnctions denoted b.v in distribution to
x (s), n e:
.
f
wlll
.
-(-s),
-+
A
);
.
1im F,,(x)
1
11
-#
=
F(x)
(Et)
continuity point x of F(x). This is nothing more than the pointwise convergence of a sequence of functions considered above. In the case where the convergence is also unijrm then F(x) is continuous and vice versa. lt is important, however, to note that F(x) in (17)might not be a proper distribution function (see Chapter 4). Without any further restrictions on the sequence of r.v.'s nc the above four modes of convergence are related as shown in Fig. 10.1. As we can see, convergence in distribution is the weakest mode of convergence being implied by al1 three other modes. Moreover, almost sure and rth mean convergence are not directly related but they both imply convergence in probability. ln order to be able to relate almost sure and rth mean convergence we need to impose some more restrictions on the sequence CX11(s),n e: f such as the existence of moments up to order r. t at
El/lfsll-r
t
-u(#,
-,#-)
')
,
,
.
The implication
Fig. 10.1
P
a.s. --+
'=>
-+
stems from the fact that
( 14) is a stronger form of
Introduction
to asymptotic
theory
convergence than ( 16), which holds for a1l n 7:/.n. The implication based on the inequality
P
r =
-+
-+
is
(10. 18) The = is rather obvious in the case where F(x) is a proper implication distribution function because for every .)>0, J >0, there exists N so that for all 17> N, #?'($A',, ..Y1> c) < J, and thus -+
-+
-
F,,(x -c)
implying that as
J < F(x) < F,,lx +
-
:,
l)
J,
+
0,
--+
D
( 10.19)
X 11 --+ X. P
D
The reverse implication A-,, is a consttlnt.
A' = .Y,,
-.+
X holds only in the case where X
-+
P
In order to be able to establish the result the convergence in probability is
--+
-sufficiently
co
)(
a.s.
we need to ensure that fast'. lf
=
-->
!>
#?-()A',,
'j
-
>
11z= 1
4
<
for every s > 0.
.cc
Moreover, a similar condition on the convergence convergence almost surely. ln particular, if
a.s. =
--+
(10.20)
-+.
in rth mean implies
a..S.
then f,,
-->
(10.2 1)
X.
In order to go from convergence in probability or almost sure convergence to rth mean convergence we need to ensure that the sequence of r.v.'s .ftA-,,,,1 e: f is bounded and the moments up to order 1- exist. In particular if ')
,
P
(i)-(iii) above hold, then Xn
.Y,, --+ X and conditions ( 1980)).
r .-.+
X
(see Serfling r
An important property l
of
p'th
mean convergence
is that if A-,, X for -+
some rb 1then Xn X foro < l < ?'. For example, mean square convergence implies mean convergence. This is related to the result that if -+
A((zY,,(r)<
'w
then F(1-Y,,(d)<
:t.
for 0 < I <
/-.
(10.22)
That is, if the rth moment exists (is bounded) then a11the moments of order less than r also exist. This is the reason why when we assume that
slodesof
10.2
Var(.Y,,) <
191
convergence
we do not need to add that .E'(A-,:)< cfs, given that it is always
ct..
implied. asymptotic theory we often need to extend the above results to transformed sequences of random vectors .t(y(X,,), convergence iE f above results are said to hold for a random vector The convergence n .fX f k if they hold for each component Xin, i 1, 2, ,1 c n sequence ( of X ?1 In applying ').
.
'')
=
.
,
.
.
.
,
.
Lemma 10.1 ')
fatrr t X n e: f be a random lltrcrtppcontinuous function at X, then: ,
11,
a.S.
(f)
a.9.
X
=
#(X,,)
X,, --> X
=
g(X,,) --+ g(X),'
=
g(X,,)
X,,
-+
-+
and g(
'
): Rk
-+
Ra
(10.23)
g(X),'
P
P
( 10.24)
D
D
(iii) X,,
sequence
X
-.+
-+
( 10.25)
g(X).
These results also hold in the case where g ) is a Borel jnction (see Chapter 6) when certain conditions are placed on the set of discontinuities of g ): see Mann and Wald (1943).Borel functions have a distinct advantage over continuous functions in the present context because the limit of such functions are commonly Borel functions themselves without requiring uniform convergence. Continuous functions are Borel functions but not vice versa. ln order to get some idea about the generality of Borel functions note that if h and g are Borel functions then the following are also Borel functions: (i)a + bg, a, b G R, (ii) (iii) maxt/?, g), (iv)mintn, (F), (v) '
'
'
11,
l.
Of particular
interest in asymptotic
theory are the following results.
( 10.26) Lemma 10.3 ,f
Let t X Y n c f l1 5
?1 '
,
')
be a sequence
(?Jpair
(?/'
randotn
k x 1 vectors.
lntroduction
theory
to asymptotic
P
(1)
/./' (X,, - 5'-!? )
0
-+
D
P
(x?)If X?,
X
-+
D
(-) /./ X,,
-+
P
Y 11 --+ 0
= X ??Y 11
0,'
-,+
P
X
and
Y,, --. C
D (('o??s/Jnl)
=
(X,,+ Y,,)
= Y, X 1
10.3
11
--.
--.
X+ C
CX;
Convergence of moments
Consider
the sequence
of r.v.'s
6
( .Y11,
n > 1J? such
that
D
A-,,
-+
A-
(i,e. lim F,,(x) F(x)), =
!1
--*
LX?'
where F,,(x) and F(x) refer to the cumulative distribution functions of .Y,, respectively. We define the momens q/' A',, (when they exist) by and -
/J(.Y;l )
xr
=
d F,,(x),
(10 8) .2
This definition of raw moments involves the Riemann-stieltjer integral which is a direct generalisation of the ordinary Riemann integral. ln the case where F,,tx) is a monotonically non-decreasing function of x and it has a continuous derivative (as in the case of a continuous r.v.) then dF,,(x) is equivalent to the differential dxs' ()dF,,(x)1/''dx being the corresponding density function. Thv /fnlr t?/' the rth moment (f;( is defined by ./,l(x)
,/,',(x)
=
',r,))
lim f).Y,r,),
( 10.29)
and it refers to the ordinary mathematical limit of the sequence it F(A',r,), lj ,77: 1 This limit is by no means equivalent to the asllnptotic nytp?'nt.?nlyof -Y,, defined by .
%
F( Xr11) EEEEq.) M
r)
=
xr d yqx ) ,
-%
As w'e can see from (30) the asymptotic
( 10.30) moments of .Y,,are defined in terms
10.3 Convergence of
moments
distribution F(x) and not its finite sample distribution F,,(x). In view of the fact that F,,(x) might have moments up to order m and F(x) might not (or vice versa), there is no reason why '(A'r,,),lim,,-. .A'(A-r,,) and Fz(xYr,,)will be equal for all r :Kq and all n. lndeed, we can show that the for some l.> 1 provide upper bounds for the limit inferior of corresponding asymptotic moments. ln particular:
of its asymptotic
.E'(I-Y,,Ir)
Lemma 10.4
o
1/' A'
A- then;
--+
11
.
lim inf
11
--#
E'41-Y,,1) > F(1-1)..
C42:
(fi) lirrl illf 5; t rt ?,
-*
;y SJa rt
ah,,)
,l)
cxl
tsct? Chunq 1974)). however, these concepts are equal.
Under certain conditions, Lemma 10.5 If Xn
o
-+
X and the sequence
i.e. lim sup c -+
(x)
n
A7,, n p: 1) is uniformly integrable
d.p 0 IA',r,I =
(jxir>t.)
f
l-tlrl< uo
,
and
Iim,,- w F(-Y,r,) Exr). -
Lemma 10.6
Lemma 10.7 P
If .Y,:
limn
-+
< n y 1) is uniformly '(!aYIr) txprj,
X and
-+
:x;
,
integrable, then
F(.Y,',) f(.Yr). =
.,
Lemma 10.8
l.f
a S. .
x,,
-+
A' and lim,,-
limu..+.'txYrnl
=
xinf
F(I-,,Ir)A'(lxlr) ,v
then
'txYrl.
(For these lemmas see Sertling (1980).)Looking at these results we can see that the important condition for the equality of the limit of the rth moment and the Kth asymptotic moment is the uniform integrability of tA',r,,n y 1) which allows us to interchange limits with expectations.
Introduction
to asymptotic
theory
Beyond the distinction between moments, limits of moments and asymptotic moments we sometimes encounter th concept of approximate moments. Consider the Taylor series expansion of glmr), mv (1/n) Z)'= 1 Ary =
+ /t''(>-rlln',r
-#(/z'r)
glmr)
-i#t2'(/t',.)(,nr+ -/z'r)2
-p'r)
+
-
'
'.
(10.31)
This expansion is often used to derive approximate moments for gmvj. Under certain regularity conditions (see Sargan (1974/, Elglm
>#(/z'r)
Var(g(?'nr)) as -'(:(,n2))3
Eglm
-/t'r)2,
+#t2'(/'r)f(,rlr
(10.32)
Vartmr), (kt1'(/z-r)(I'2 (#'11(p'r))3'(g(,nr) gt''(/t;))2#t2'(/z')'E#(,'nr)
(10.33)
-:/'r)1*,
(10.34)
-g(/z'r))3
+c
r
reads approximately where equal. These moments are viewed as moments of a statistic purporting to approximate glm and under certain conditions can be treated as approximations to the moments of glm (seeSargan ( 1974)). Such approximations must be distinguished from f(A7,) as well as 1)A-r). The approximate moments derived above can be very useful in choosing the functions g( ) so as to make the asymptotic results more accurate in the context of variance stabilising transformations and asymptotic expansions (seeRothenberg (1984:.In deriving the asymptotic distributions of g(m,) only the l'irst two moments are utilised and one can improve upon the normal approximation by utilising the above approximate higher moments in the context of asymptotic expansions. A brief introduction to asymptotic expansions is given in Section 10.6. :
:xt
'
'
10.4
The
ibig
0' and little o' notation
As argued above, the essence of asymptotic theory is approximation; approximation of Borel functions, random variables, distribution functions, mean, variances and higher moments (see Section 10.5). A particularly useful notion in the context of any approximation theory is ln that of the accuracy or order of magnitude of the approximations. quantities of magnitude of various analysis order the mathematical the track of' by the use of the big 0, little involved in an approximation is notation notation. lt o' turns out that this can be extended to probabilistic modilications. with minor The purpose of this section is to approximations consider extension notation and its review the 0, o to asymptotic theory. tkept
10.4 Let tfat,,
),,,
The
0' and
tbig
195
o' notation
ilittle
n e:-.'U-)be a double sequence of real numbers.
Dejlnition 9 is said to ttku,nc..,.1-)
The sequence
denotedby
be at most of order bn' and 1J,,
>'
J,,
=
O(hn)
tf lim 11
I
bn
-.
.
<
K,
for some constant K
>
0.
(10.35)
Dejlnition 10 The sequence
denoted by' 1
a',
=
is said to be tln,ng.zfz-l lf
/(/7,,)
smaller order than bn' and
iof
a lim --'! ll
--#
(10.36)
=0.
bn
'2Kl
Example 5
1 2n2-
1
= O ip. ; (n+
3
ln + n2 O(n5n2+ n3 =
1),.
1)
log. n
O(n) o(n2);
=
=
o(n?),
=
a > o;
expt
-
n)
=
(6n2+ 3n)
=
o(n-), o(n3)
=
O(n2).
A very important implication stemming from these examples is that if
a
pl
=
O(n*) then
The 0, o notation If
an
=
o(n'+)
a, J > 0.
'
satisties the following properties'. t7,, =
O(q,)
and
',, J,,',,
=
O(c,,), then
=
O(c,,c,,);
IJ,,IrOert -
an+
),,
=
,1
c,,)). otrrlintcu,
The same results hold for small o' in place of
1j' r./,, O(!,,) and =
/?,, =
t?(c,,),
Vbig
0%above.
then
t7l,+ b,, O(p,,); anbn o(c,,c,,). =
=
The 0, o notation can be extended to general real valued functions hl ) and
Introduction
to asymptotic
p ) with common domain D # constant K >0,
theory
Z.
We say
(x) O(g(x)) as x =
xo
--+
if for a
/?(x) K. x (E CD- xo). lim g(x) x-xo Moreover, we say that /1tx) o(g(x)) if :;
=
/14x)
lim
x-xv
0,
=
g(x)
x G CD- xa).
Example 6 Jl(x)= ex
I(x)IGclxl for x e E- 1, 1), /?(x)= O(x),
1,
-
and /l(x) cos =
1 + otx)
(x)
x,
=
In the case where
lx) - g(x)
O(l(x)) we write
=
yx) 47(x)+ O(/(x)) =
and for '
(x) g(.x) -
0(/(x))
=
we write /k(x) gtx) + o(/(x)1. =
is particularly useful in the case of the Taylor expansion, where we can show that if /1(x)is differentiable of order n (i.e.the derivatives (dJ) (7x./)v hlb, j 1, 2, n, exist for some positive integer n) at x x(), This notation
=
.
.
.
=
,
then
/y(?,)()( ()) +--- - J + o(j ,2
n!
kt
) as
--+
0.
The 0, o notation considered above can be extended to the case of stochastic convergence, convergence almost surely and in probability. Dehnition 11 Let kfX ?1 n is ,
'') '#
-
be a sequence
(4j%r.t-.'s
and
R,,,n G
'(1
.
i.' a sequence
positive real numbers. W sa r tbat X,, is at most of order c,, in probability if there (j)
of
exists ntln-
10.4
The
()9 and
kbig
stochastic
ilittle
o' notation )J,,,
sequence
(i)
that
suc
P
A'p'
(ln J
and denoted /?y Xn
')
..i
6
n
a ?,
-
C'',
-..+
()
t?p((r,,).
=
X,, is of order smaller than
(.',,
in probability i.f
P
X
't tl,,
0,' denoted by Xn
-+
In the same way we can define Xn
=
=
o
p(c,:).
Ou.,.(c,,) and .Y,,
=
t?g.s.(c,,).
lt turns out that the properties P 1 and P2 for 0, o can be extended directly to Op, ov and Os.s.,ou.s. with minor modilications (seeWhite (1984:.
Moreover, order of magnitude results related to non-stochastic sequences can be transformed into stochastic order of magnitude using the following theorem due to Mann and Wald (1943). Theorem tMtknn-tl/#l Let
JtX,,,
'')
n
where X,, N X
(E
,
xjn :
=
jn
J,,(a,,)
jr
some
Ja
(E
n
be
f .j
=
(1
of k-dimensional random
sequence
1, 2,
Op (cj,,), j
.
=
.
.
,
k)
1, 2,
t'ectors
such tbat .
.
.
,
n.l,
0(,,),
=
non-stochastic
#'' sucb
sequence
of
rllr?l
k-dimensional vectors
g,,(X,,) 0,(7,,) =
tscfrFuller (./9 76)). This theorem can be used to translate non-stochastic Taylor expansion to stochastic ones.
results related to the
198
Introduction
to asymptotic
Useful results (OJ)
If Var(Ar,,)
relating
theory tlrdel.
to
c,2,< rx) then Xt,=
=
deviation. (02) (OJ)
Os(c,,) - as big as the standard
If A'u Op(1/v' n) = A% op( 1). =
=
D
If X,,
D
0/1).
X
=
Xn
X
:cc>
Xn +
--+
=
tp/1)
D
X.
(04)
lf -Y,,
10.5
Extending the Iimit theorems
-+
-.+
As mentioned above, in statistical inference most of the Borel functions /4.Y1, Xz, .Y,,)whose asymptotic behaviour is of interest are functions of The limit theorems discussed in Chapter 9 refer exclusively xYr, 1. rb :J'- 1 asymptotic behaviour of the case r= 1. What we need to do now is to the to extend this to any r > 1 as a prelude to a discussion of the asymptotic behaviour of a function of such quantities and then extend results to any Borel function /74.:-1,Xz, .Y,,), when possible. For expositional purposes let us consider the case where t-Y,,, n > 1) is a The quantities rb 1, are j .YC, sequence of IID r.v.'s on S, #( directly related to the raw moments p'vn '(Arr),rb 1,in the sense that if we were to select an estimator' of p'v we would select .
.
.
,
.
.
.
,
,.d
1g
mr
=
-
n
.)).
:'f'-
11
Fl -Y7 i 1
(10.38)
,
-=
as a natural choice. Given that the l'aw moments tf/ti, rb 1) play an important role in the manipulation and description of the distribution function F(x) their estimators sequence (rn,,rb 1) is expected to play an important role in statistical inference. For reasons which will become apparent in the next chapter the mrs are called sample r/w moments. Let us X,, consider mr=(1/n) j),!-1 xYras a Borel function of the lID r.v.'s -Y1, and thus a nv. itself. Taking its expectation, .
1g
.E'4,?1,)
.
,
?1
=-
n
.
i
) =
1
'(.Yr) - because of the linearity of .E'( ) .
1
1)
=-rl jg p'r i 1 =
- because of the IlD assumption,
(10.39)
10.5
the Iimit theorems
Extending
199
Let
i
1. 2,
=
.
.
.
then
,
s
nzly =
that is, S,, is the sum of IID r.v.'s F1, Fa, we can deduce that
.
.
,' ''
where S,,
n .
,
=
i
Y Ff, =
1
and SLLN
L. From the WLLN
P f
( 10.40)
and
(1
( 10.4 1)
r,
v
Using the Lindeberg-l-evy
CLT we can deduce that if Var(F;)
then (
wn lf Vart#fl
z
-+
Gy
c2
=
assuming that p'zr <
<
..'x
:;c
( 10.42)
1), pj. -p')
-/t;)2
=
.E'(Arr
=
=
c)l,
,
1. Moreover, given that
ly
jj
11
frnr,nkl-iyj.1 j i
-x(0,
VartFk.l
=
ry.), rb jj
cy2<
D
'r)
mr - #
=
=
&A-r-Y))
1
n-1
/ = - Fr +k + n
Z
rl-'
l
-
n
?
j
!1
--s
F(-Yr+k) +
1
-
nc
!
Z 2 ;(A'rA-)) i y,./
( 10.43)
#r#k,
we can deduce that Covtrrlrlh)
1
=
-
n
(/t;+k -/.t;/zk).
( 10.44)
Since Covtnlrrnk) fitnkrrnkl
-/z;/t'k,
We
=
(iv)v'km, l'nr
=
can deduce that
D
z -x(0,E),
-p'r)--+
(n'?11 ma,
.
.
.
,
m,.), p',.
=
(/t/1p'2 ,
,
.
.
.
,
/t',.)
(10.45)
The results (i)-(iv)can be p'ipj' i,j 1,2, with E gcfjqo-ij p' Xi to seen as extensions of the limit theorems in Chapter 9 for the sum the general sum )'-: .Yr, rb 1. Having derived the asymptotic results (i)-(iv)for the sample raw of moments we can now extend these results to any continuous function them using Lemma 10.1 above. =
,
=
..j
-
,
=
,r.
.
.
.
Zl'j
to asymptotic
Introduction
theory
Example 7
lf
4?( '
) loget =
) then for rnr >
'
0, p'v> 0, Z
>
0:
P
logetrnr) -->
logel/t'rl,'
a.S.
(ii) logetn-lr)
logetp/,.l;
-'.
/t;)j ,.' c & (/-tzr / -pr )
I)ltn.?r .52-,--, -
(iii) log e
,
,
D
log (? z
-.+
The last result, however, is not particularly useful because we usually need distribution and of derive of that lmrl the not to ,.' (;1! -p21 asymptotic of Let distribution derive the us #?lx''Entrrlr x' pzv -pr )). g(mr), taking the opportunity to use some of the concepts and results propounded above. and hence from the From 01 above we know that nly=p'r + 0/ 1/V'n) Mann-Wald theorem we can deduce that if q( ) has continuous derivatives of order k then ,.F'''
/.
,,
,,
'
( 10.46) Assuming that k
qm) r
and
=
2 we can deduce that
-f?(/.t',.) -g'''(/z'r)(rnr -/t',.)
andby o2, ) -f?(p'r)) x/''N (# (r?'? r
Op(n- t)
-
v'-nln'lr-/,t'r)gt'1(p')r
-
+ o p (1).
Let
1z;,x' =
then
n(g(rrlr)
P;, &U,,+ =
-(/(/.tr)),
L ,,
=
x.'
( 10.47)
-Jt:)
lit?'Flr
t.
=J
( 10.48)
1).
0p(
(/zr)#
D
D
From Lemma 3 we may conclude that if b'n -+ U then 1,$-+ cb', and thus j'
(v) w/ ntf/trrlrl For example, if glnr)
-g(/.tr))
=
y
t
'-a
A'(0- (/tzr-pr
y
g
.a). )Eg )(/-tr)1
jj
f
(10.49)
loge n, then ,
/2
/t 2r -- 1if 7 n'lr- loge pv) N 0, . wj .. N ,''ntloge Jtr a ,
'v
.
( 10.50)
to the
This result can be generalised G(In,.)
20 1
Extending the Iimit theorems
10.5 H(g1(l'nr),
#./2(I'nr),
.
.
.
,
case (iv) directly. lf we dene
vector
f?k(ln,.))
(10.51)
v'n(G(mr) -G(gr))'v N(0, DED'), where (I'l1r)
I
=
.-
'li ln'l./
i
,
mr
=
1, 2,
.
.
I, j
,
.
1 2,
=
,
.
.
,
,
r.
-/z;
The above procedure is known as the (see Rothenberg ( 1984/. generalised results directly also above to an I1D sequence of be The can crucial changes. In order 1), 1 without X,,:/g x any random vectors kfX,,,n y results that easily show useful these one of the most are we can to see how estimator lV, of Borel the functions interest in Part important -metbod
( 10.52)
1
z1
.i ?'1
J4
11 i
11
1
=
j
xf
,
?1
j'r 2 ! f
.
1
=
then ,rs)
-
qz,
c'2,
,
.7:,-
.-,.,
=
-Sz - zl zz 2 zs z 1
( 10.53)
.
-
Let us suinmarise the argument so far. ln our attempt to derive asymptotic A-,,)we extended the limit results for arbitrary Borel functions /;t'l, Xz, general Xi to more theorems for j)'- 1 sums ))- 1 -Yr, r7: 1, and then these continuous extended functions of these general sums. results were to following The results so far suggest the way to extend them even further. of k-dimensional random vectors kX,,,n y 1) Consider the 1ID sequence ?- (analogous to and define the Borel functions Ii ): Rk R, i 1, 2, EE. /,.(X,,)) (the z'fs above) with )'- 1 .-;, r > 1) and Z,, (!1(X,,), /c(X,,), iRbe a Borel function (f?( ) above), and define /1(Z,:) Let ( ): Rr '(Z,,) 2 ff. where Z,, ( 1 .n) f I Zi. The above result ln terms of /?(Z,,) being continuously differentiable at Z,, # takes the form .
'
-.+
.
=
.
,
=
.
=
.
.
.
.
.
.
,
-
.
-+
,1
w
=
/1(/z)) N, Un((2,,) 'w
1
DVD'),
(10.54)
202
Introduction
to asymptotic theory
where 5,5 Cov(Zj),
D
=
(see Bhattacharya
P(Z) t'lzf
=
i
=
.
.
.
,
r
( 1977) and references therein).
Error bounds and asymptotic
10.6
1, 2,
=
zmJ,
expansions
The CLT tells us that under certain conditions the distribution of the gVar(,%,)q), tends to a standard standardised sum F;, (gS,, normal r.v., i.e. '(5',,)j/v
=
lim F,,()?) -.+r
=
-
y
(14.J7)=
11
-
1 x
l.o 2
exo 1-
x (21)
-
i'
k
u
) dl.
( 10.55)
It provides us with no information as to how large the approximation error l.&(y) is for a particular value n. The first formal bound of the approximation error was provided by Liapunov in 190 1 proving that: For X, X1, Vartxtf) c2 and an I1D sequence of nv.'s with E(Xi) 3) B pl E v3, all finite for i 1, 2, -
,
*(y)I .
.
=p,
.
=
'
(I.'f
=
-
.
-*4A,)1 'K C v sup 1J--,,(.:) .-G R
>
rs
.
.
,
log n
(10.56)
----,
v
N
w'here C is a positive constant. This result was later sharpened by Cramer but the best result was provided by Berry (1941)and Esseen (1945). Berry-Esseen
theorem
Under the same conditions as tbe Liapunov result we can deduce that
c uu supl#,,ly) -*(y)1 G--.r,i ca x'
33 C--j-.
,
>
,
( 10.57)
y
That is, the factor 1og n was unnecessary. As we can see, this is a sharpening of the Lindeberg-l-evy CLT (seeChapter 9) in so far as the latter states that *(y)1 0 as n :x. without specifying the rate of convergence. The Berry-Esseen theorem using higher moments provides us with the a dditional information that the rate of convergence is O(n-). Various authors improved the upper bound of the error by reducing the constant C. The best bound for C so far was provided by Beeck ( 1972), 0.4097 G C <
supyI#,,(y) -
-+
-+
0.7975.
For an independent but not identically distributed sequence of nv.'s Xn with '(-Yf) paf < :c i Vartatf) c/ and F(1Arf 1,2,
XL,
-/tfl3)
=pi,
.
.
.
,
=
=
,
=
.
.
.
,
10.6 Error bounds and the Berry-Esseen
asymptotic
203
expansions
result takes the form
( 10.58)
Although these absolute magnitude (i.e. uninformative) :yz as n y order to reflect the
results provide us with a simple and usually accurate of the approximation error, they become trivially true when y is allowed to vary with n. For example if --+ --> then 6,(.$and *(.y) tend to zero (seePhillips (1980/. ln dependence of the upper bound on y as well, the BerryEsseen result can be extended to ':c
-
v:3
*(.$1 G c' 1f-*,,4.r)
>
-
:$
c
1 1+ yc
for all y e' EEi
( 10.59)
(see Ibragimov and Linnik (1971/. An important disadvantage of these results is that they provide us with no information as to how we can improve the asymptotic approximations provided by the CLT or how we can choose between asymptotic approximations provided by the CLT or how we can choose between asymptotically equivalent results. This line of reasoning leads us naturally to the idea of asymptotic expansions related to the approximation error. The idea underlying some asymptotic expansions of the approximation error is that we can think of the asymptotic distribution as the first term in an expansion of the form r (y) ,4
Fk(.p) *(.r)
>
-
+
Z (,,2)+ Rr,,(.v),
( 10.60)
-----.-,
f=1
here the terms of the expansion are polynomials in y in powers of (n-'i)and -r/2). This is the so-called Edgeworth expansion the remainder Rr,, O(n U idely used in econometrics. Let us consider how such asymptotic expansions can be derived in the present context. Let be the density function of an appropriately normalised Borel X,,. function of the llD sequence of r.v.'s -Y1, X1, is assumed to belong to a particular family of functions defined over the interval ( :f;, x) nd satisfy the condition
%&
=
,(x)
,(x)
.
.
.
,
-
X
-X
dx< I/'(x)I2
.....c ,
xnown as square integrability. The space of al1 such functions is denoted by J-2l - w :c) and constitutes a Hilbert space (seeKreyszig (1978/.One of ,
204
Introduction
to asymptotic theory
the most important properties of 1-zt is that any element in this space can be represented or sufficiently accurately approximated by the use of the orthogonal polynomials: ,yzl
,yz
-
,
Hklx)
(
=
..
1) dk4(x)
-
--
/(-Y)
'
dx k
,
k 0, 1 2, =
,
.
.
.
( 10.62)
,
known as H erm ite polynomials of degree k; /(x) (J N /(Jzr)j exp'f( - 1.x2 2 y, the density function of a standard normal r.v. This is a natural extension of the concept of an orthogonal basis spanning a linear space. The of polynomials orthogonality these comes in the form of =
WI
Hks,xjHmlxjst.
dx
=/f!
=0 The first five of these polynomials H0
1 H1
=
,
are
=x2
Hz
=.x,
otherwise.
-
1, H5
=
x3 -3x,
-6.%2 + 3. ( 10.64)
=x*
H.
The density function /,,(x) is assumed to be an element of f.2( it represented thus can be in the form
and
'.yl
-
'wx ,
J.
bkllkx).
./;,(x)k=0 -
( 10.65)
sufficiently accurately by Although in principle we can approximate choosing a finite value for the summation. in practice we prefer to approximate the ratio L)n' (x)/4(x)qin view of the fact that it is usually and thus easier to approximate sufficiently accurately smoother than by a low degTee polynomial. This ratio can be approximated by ,(x)
,(x)
,&(.Y)
r
.4(x) :x: k ) bkuklx), o
( 10.66)
-
where
( 10.67) These coefficients are chosen so as to minimise the error ./;
. -
/(x) .
'
(x)
t)(x)
2
r
-
k
Hence, the density function
-
(o bklikxs ./,',(x)
(I.Y.
can be approximated
by
./')(x)
where
r
=4(A:) /*1(x)
Z bklikx).
k=0
(10.69)
10.6
Error bounds and asymptotic
205
expansions
As we can see, the coefficients bk, /( 0, 1, 2, as defined above are directly of For instance the moments related to /,',(x). =
b3
=
3/t'1 6 hp - ),
.
.
.
etc.
This implies that if we know the moments of up to some order r we can it. approximate to l use .J,*(x) ln view of the relationship between the derivatives of /(.x)and Hermite polynomials /',*, (x)can be expressed in the form r
.
/*(x) ). .
-
l1
k=0
./?',(x)
P
-' 4tk) (x), j yjy
(10.70)
where
4('(x)
=
dk/tx) k dvk ,
1, 2,
=
.
.
.
.
of 4(x) and its is a linear combination is, the approximation derivatives up to order r,' see Cramer (1946). This form of the approximation approximation. In terms of the series is known as the Grtkr?1-Cllf.'l?'/ft?ljXd?zlthe approximation is cumulative distribution function F,,(x) .
That
-4
./;,(u)
=
r
1
F,*(x) '
=
k=0
r -1 (I)tk) (x). k!
( 10.72)
The question which is of primary interest about these approximations is ,(x) as r not so much whether the above series converges to F,,(x) or but whether for a small 1. < vl the above series will provide a good approximation to these functions or not. A particular way to proceed which enables us to choose the order of the approximation error is to break down them according to lhe terms bkblkqxjinto components and then -'. That of is, express .J,*, (x)in the n :heir order of magnitude. say in powers -+
,.',f
reassemble
f ;1'(x ) 4(x ) -
1+
r
)
.,1.
(x)
-.-3.6.,....-, k
rl) k (x?' -
l
,
( 10.73)
lntroduction
206
to asymptotic
theory
where is a polynomial in x. ln order to be able to choose the order of Rr,,(x) defined by Ro,(x) the approximation error (remainder) (x) we need to ensure that the above series constitutes a proper asymptotic expansion. An asymptotic expansion is defined to be a series which has the property that when truncated at some finite number r the remainder has the same order of magnitude as the first neglected term. ln the present case the remainder must be of order ,4ktxl
=l(x)
+ 1
(?1-)r
j (;
!
.
x (x) ojn- ((r
.
=
l.tr
-e
-f1,
1 ). 2 1)
(1:
.
.
g4)
This is ensured in the case of an expansion r
./'(x) k=0
akkzktx) + Rr(x),
-
(/k(x),k > 0)
when the sequence
/k+
1(.X)
=
( 10.75) is an asymptotic
sequence
as x
-+
O(/k)
xo
(10.76)
because then Rr(x) O(/k). Under certain restrictions =
>
r
k ( l
--
expansion
of the form
.4ktxl
Z
/(.'x) 1 +
.,(x)=
(seeFeller ( 1970))the ?') k
(10.77)
+ Rm
1'/21) asymptotic expansion and R (x) otn-nr'h This expansion widely applied Edgewortb and has is known as the been in the econometric literature (see Sargan ( 1976), Phillips (1977),Rothenberg ( 1984), inter alia). In order to illustrate some of the concepts introduced above let us consider the standardised Borel function
a proper constitutes
'',:
z,, x(7' =
m
1
*
''
-
l f
-
=
l Xi
( 10.78)
-p
1
c2, Exi -y)k Xn with E(Xi) Vartx'f) of the 11D sequence -Y1, 3, The problem 4, is first have face derive the necessary i to to we able of in order evaluate coefficients Z,, be the moments ck, k 0, 1, 2, to to E(Z,l,) EZn) Apart from the first two moments 0, 1, the higher r. l.. . moments involve some messy manipulations (see Cramer (1946/.Using these moments given by Cramer we can evaluate the first few coefficients'. .
=
.
.
.
.
.
=y,
,
zzzpi,
=
.
=
.r
=
,
c()= 1, cl
=0,
C5
=
1 -
&5,
=
10.6
Error bounds and asymptotic
expansions
207
3, 4, 5, 6 refer to the cumulants of Xi c (seeChapter 4); c*) 1946:. The Gram-charlier 3 Cramer ( (see y5 (/t4 approximation of for r 6 takes the form where Ks
=
i
Ki,
-p)
=
c3,
&4.
=
-
,(x)
=
,:3
lc l & 1 (3J(x) +. 1 .... 45(x) y*txl.f:(x) 4(x) --3!k-.4 -j -4 ua yf u 1 .s, + l0,l 4(,j (x). + -f n n =
-
(j(j..yq)
Collecting the terms with the same order of magnitude we can construct the Edgeworth expansion
1x
1
--l
*
1
4(x) 3! nz $ aj(x)+ V V &44
./;,(x)=
-
(4j
10
(x)+ 6! x14 (6j (xj
.j.
g c,,,
(10.80) here Rc,, O(n-f). ln terms of the Hermite polynomials moments this takes the form
W
w
=
./;,(x)-
4(x)
1
+)
and the central
(q) (?-:-
3) 1',2:x'
''33tx'+-' !
o'
+
n
c
10
p5 a c
2
S6(x)
6!
+ Ra,,,.
(10.8 1)
lt is important to note that the presence of cumulants in (79)and (80)is no coincidence. lt is due to the fact that for the sum of standardised IID r.v.'s &: =
O(n1-4r/2))
,
From these expansions we can see that in the case where the distribution of the CLT approximation is of order 1 n the Arfs is symmetric g(/t3c3) and when the kurtosis a4= (p4c4) 3 the approximation is even better, of order 1 nl and not of order 1 x/'n as the CLT suggests. ln a sense we can interpret the CLT as providing us with the first term in the above Edgeworth expansion and if we want to improve it we should include higher-order terms. and Edgeworth expansions for arbitrary Borel The Gram-charlier Ar,,can be derived similarly when the moments needed functions of aY1 to determine the required coefficients are available (see Bhattacharya (1977)). When these moments are not easily available the approximate moments considered in Section 10.5 can be used instead. lt must be emphasised that Edgeworth expansions are not restricted to Borel functions (X) with asymptotic normal distributions. Certain Borel =0)
=
,
.
.
.
,
208
Introduction
to asymptotic
theory
functions of considerable interest in econometrics (see Chapter 16) have asymptotic chi-square distributions for which Edgeworth approximations can be developed (see Rothenberg ( 1984)). The only difference from the derivation of the normal Edgeworth expansion is that the appropriate space is Lz(0, ) and the corresponding orthogonal polynomials are the socalled Laguerre polynomials: 'yt
x dk .L,,(x) e i. (.x e -=
'7),
k
t.ydx
1 2,
=
,
.
.
.
( 10.82)
.
Error bounds and asymptotic expansions of the type discussed above are of considerable interest in econometrics where we often have to resort to asymptotic theory (see Part 1V). Important Convergence
concepts
almost
surely, convergence
in rth mean, convergence
in
probability, convergence in distribution. convergence on S, convergence uniformly on S, big Op, small op, order of magnitude, Mann-Wald theorem, Taylor series expansions, sample raw moments, J-method, limits approximate moments, of moments, asymptotic moments, uniform theorem, integrability, Berry-Esseen integrable functions, square polynomial approximation, Hermite polyorthogonal polynomials, series A, asymptotic nomials. Gram-charlier sequence, asymptotic expansion, Edgeworth expansion.
Questions Why do we need to bother with asymptotic theory and run the risk to use inaccurate approximations? Compare and contrast the two modes of convergence, convergence in probability and almost sure convergence. Explain intuitively why P
.s.
a -+
=
-+ ,
Compare the concepts of convergence in probability and convergence D p = in distribution and give an intuitive explanation to the result -.+
-..
Explain the Cramer-Wald device of proving convergence in distribution for a vector sequence tX,,, ?? y 1) Explain the Op and op notation and compare it with the 0, o notation of mathematical analysis. Discuss the Mann-Wald theorem on how to derive stochastic order of magnitude results from non-stochastic ones. Explain the Taylor series expansion for a Borel function /l(X) of a j. sequence of r.v.'s X H (' j .
'2,
,
-,,
.
.
,
,
10.6 8.
Error bounds and asymptotic
expansions
209
Explain how the results of the limit theorems for )'.j Xi can be extended to the sample raw moments )'-k A'r, rb 1. How can the latter be extended to arbitrary continuous functions of them'? Discuss the for deriving the asymptotic distribution of f7(/?) ((/( ) continuous) from that of rnr. Compare and contrast the following concepts: ?'th-order (i) moments', (ii) limits of rth-order moments; asymptotic rth-order moments; and (iii) approximate moments. (iv) -method
'
Explain intuitively the concept of uniform integrability. Discuss the Berry-Esseen theorem and its role in deriving upper bounds for the approximation error in the CLT. Explain the role of Edgeworth expansions in asymptotic theory. Discuss the derivation of an orthogonal polynomial approximation to an appropriately standardised Borel function of a sequence of r.v.'s t-Y,,sn y 11) in the context of the space 1w24 %, ). Explain the difference between Gram-charlier and Edgeworth -yt
-
,
expansions. What order of approximation does the CLT provide in the case where the skewness and kurtosis coefficients of )'=l Xi are zero and three respectively'? Explain. Discuss the question of how Edgeworth approximations can help us discriminate between asymptotically equivalent Borel functions of a sequence of r.v.'s X I1 n > l 1 .(f
*
5
Exercises Determine the order of magnitude
sequences: n3 + 6n2 + 2 -
6n3+
n rr::rc .1 2
1 .-
,
,
1.,1 1 2 =:::
Determine
,
,
.
.
.
(bigO and
small
0)
of the following
,. ?
.
.
.
.
the order of magnitude
of the following series:
to asymptotic
Introduction
thtxory
For the 11D sequence ).Y,,, n k: 1) where A-,, N(0, 1) we know p,, ()'- 1 Xl) z2(n).From the CLT we know that h,,(X) (p,, 'v
=
-n)/
=
'w
Ev/'(2n)) N(0, '.w
(Z
order 1/n. Hint: Additional
1). Derive the Edgeworth a,,
=
(2x/2)/x/n,a4,,
=
approximation
of
(X) of
12/n.)
references
Bhattacharya and Rao (1976); Bishop Wallace (1958).
et al.
(1975);
Billingsley
(1968);Rao (1984);
PART
lII
Statistical inference
C H A P T E R 11
The nature of statistical
11.1
inference
Introduction
ln the discussion of descriptive statistics in Part l it was argued that in order to be able to go beyond the mere summarisation and description of the it was important observed data under consideration to develop a mathematical model purporting to provide a generalised description of the data generating process (DGP). Motivated by the various results on frequency curves, a probability model in the form of the parametric family of density functions *= 0), 0 ( (4) and its various ramifications was formulated in Part ll.providing such a mathematical model. Along with the formulation of the probability model * various concepts and results were discussed in order to enable us to extend and analyse the model, preparing the way for statistical inference to be considered in the sequel. Before we go on to consider that, however, it is important to understand the difference between the descriptive study of data and statistical inference. As suggested above, the concept of a density function in terms of which the probability model is defined was motivated by the concept of a frequency curve. lt is obvious that any density function 0) can be used as a frequency curve by reinterpreting it as a non-stoehastic function of the observed data. This precludes any suggestions that the main difference between the descriptive study of data and statistical inference proper lies with the use of density functions in describing the observed data. 'What is the main difference thent?' In descriptive statistics the aim is to summarise and describe the data under consideration and frequency curves provide us with a convenient way to do that. The choice of a l-requency curve is entirely based on the data in hand. On the other hand, in statistical inference a probability model (l) is
rtx,'
./'(x;
The nature
of statistical
inference
postulated a priori as a generalised description of the underlying DGP giving rise to the observed data (not the observed data themselves). Indeed, there is nothing stochastic about a set of members making up the data. The stochastic element is introduced into the framework in the form of uncertainty relating to the underlying DGP and the observed data are viewed as one of the many possible realisations. ln descriptive statistics we start with the observed data and seek a frequency curve which describes these data as closely as possible. In statistical inference we postulate a probability model * a priori, which purports to describe either the DGP giving rise to the data or the population which the observed data came from. These constitute fundamental departures from descriptive statistics allowing us to make generalisations beyond the numbers in hand. This being the case the analysis of observed data in statistical inference proper will take a very different form as compared with descriptive statistics briefly considered in Part 1. In order to see this 1et us return to the income data and discussed in Chapter 2. There we considered the summarisation description of personal income data on 23 000 households using descriptors like the mean, median, mode, variance, the histogram and the frequency curve. These enabled us to get some idea about the distribution of incomes among these households. The discussion ended with us speculating about the possibility of finding an appropriate frequency curve which depends on few parameters enabling us to describe the data and analyse them in a much more convenient way. ln Section 4.3 we suggested that the parametric family of density functions of the Pareto distribution
could provide a reasonable probability model for incomes over f4500. As can be seen, there is only one unknown parameter 0 which once specified flx,' 0) is completely determined. In the context of statistical inference we postulate * a priori as a stochastic model not of the data in hand but of the distribution of income of the population from which the observed data constitute one realisation, i.e. the UK households. Clearly, there is nothing cr/rrc in the context of descriptive wrong with using .J(x;0) as a frequency statistics by returning to the histogram of these data and after plotting .J(x;p) for various values of 0, say 0= 1, 1.5, 2, choose the one which comes closer to the frequency polygon. For the sake of the argument let us assume that the curve chosen is 0= 1.5, i.e.
11.2
The sampling model
This provides us with a very convenient descriptor of these data as can be easily seen when compared with the cumbersome histogram function
(see Chapter 2). But it is no more than a convenient descriptor of the data in hand. For example, we cannot make any statements about the distribution of personal income in the UK on the basis of the frequency curve .J*(x).ln order to do that we need to consider the problem in the context of statistical inference proper. By postulating * above as a probability model for the distribution of income in the UK and interpreting the observed data as a sample from the population under study we could go on to consider questions about the unknown parameter 0 as well as further observations from the probability model, see Section 11.4 below. ln Section 11.2 the important concept of a sampling model is introduced (p= ).(x; @,0 e:O), as a way to link the probability model postulated. say model E!E sampling available. The x,,)' obstrved data the x (x1, to statistical needed ingredient define second important to a provides the statistical inference. model; the starting point of any ln Section 11.3, armed with the concept of a statistical model, we go on to discuss a particular approach to statistical inference, known as the frequency approach. The frequency approach is briefly contrasted with another important approach to statistical inference, the Bayesian. A brief overview of statistical inference is considered in Section 11.4 as a prelude to the discussion of the next three chapters. The most important concept in statistical inference is the concept of a statistic which is discussed in Section 11.5. This concept and its distribution provide the cornerstone for estimation, testing and prediction. .
.
.
,
iparametric'
11.2
The sampling model
As argued above, the probability model * ).J(x;0), 0 G 0) constitutes a very important component of statistical inference. Another important element in the same context is what we call a samplinq model, which provides the link between the probability model and the observed data. lt is designed to model the relationship between them and refers to the way the observed data can be viewed in relation to *. In order to be able to formulate sampling models we need to detine formally the concept of a =
sample in statistical inference.
nature of statistical
Te
inference
D ehni tion 1
-4 sample is J#n?J .
.
A',,)
. jnction ,
to
??
(?./a st.ar random
variables
(-Y1 -Ya (?-.p.-s) (lensit ,
densit (nctions tr'o!'nc!'Jpwith the 0o4 t?y postulated b)' the probabilitl' model.
'krl/tpst;r
./'(x,'
.
*rp-rft.a'
'
J'
Note that the term sample has a very precise meaning in this context and it is not the meaning attributed in everyday language. In particular the term does not refer to any observed data as the everyday use of the term might suggest. The significance
of the concept becomes apparent
when we learn that the
observed data in this context are considered to be one of the many possible realisations of the sample. ln this interpretation lies the inductive argument of statistical inference which enables us to extend the results based on the observed data in hand to the underlying mechanism giving rise to them. Hence the observed data in this context are no longerjust a set of numbers we want to make some sense of, they represent a particular outcome of an experiment; the experiment as defined by the sampling model postulated to complement the probability model * f(./'(x; p), 0 g (4) =
.
Given that a sample
distribution
which
Djlnition
is a set of r.v.'s related to * it must call the distribution of the sample. we
have a
2
The distribution of the sample X iE!E(jtp/'!rdistl-ibution t?/' tbe ?-.r.'.$ .zY, ,
/-(x
-v,,
,
,
.
.
.
,
; 04 N
.
.
A',,)' is /t?.#??t?J to lat?the zY,,delloted b
1,
.
.
-
.
.'
,
.
/'(x ) p).
The distribution of the sample incorporates both forms of relevant information. the probability as well as sample information. It must come as p) plays a very important role in statistical no surprise to learn that 04 depends crucially on the nature of the inference. The form of sampling model as well as *. The simplest but most widely used form of a sampling model is the one based on the idea of a random experiment (f (see Chapter 3) and is called a random sample. ./'(x;
./'(x;
De-hn rt'pn 3 zY,,)is called a random .,4 set t?. random variables (.Yj A'a. 0) (/ the A' j zY2, X,, are independent (7/7J sample jom identicall), distributed (11D4. //7 this ctk-s'r tbe distribution of te -
.f(x;
,
.
-
.
-.t'.'s
,
.
.
.
,
11.2
The sampling model
the sample rt//t-e'-s
/rn'l
vqualit). Jlf? t() independence and f/? (.7s()('()n(l
tbe //?f?l-.v.'s
.#?-sr
identically distributed.
are
rf?th (.2
l'ha t
.?c/
One of the important ingredients of a random experiment t.') is that the experiment can be repeated under identical conditions. This enables us to construct a random sample by repeating the experiment n times. Such a procedure of constructing a random sample might suggest that this is feasible only when experimentation is possible. Although there is some truth in this presupposition, the concept of a random sample is also used in cases where the experiment can be repetted tlnder identical conditions, if ln order to see this let tIs consider the personal income only conceptually. example where (I) represents a Pareto family of density functions, KWhat is a random sample in this case?' lf we can ensure that every household in the UK has the same chance of being selected in one performance of a conceptual experiment then we can interpret the ?? households selected as X,,) and their incomes (the representing a random sample (. 1 X1, sample. being realisation of the ln general we denote the data) observed as a x,,l', where x Xt,)' and its realisation by x EEE(x1 salnple by X EH ( l k./',' usually values obsert'ation yptkt't? assumed take i.e. in the is to xG R'' , (?' A less restrictive form of a sampling model is what we call an independent sample- where the identically distributed condition in the random sample is ,
,
.
.
.
'
.
.
.
.
,
.
.
.
.
,
.4,'
=
.
relaxed.
#ni l???
D
4
..4set' q/' r.t'.-s (X 1 /'( i ; p) 1 1 2 . 1, lt1 l'nr/t?pta/?tt-i'n f(l ?'l
,
.
.
.
-
-:'
.
,
,
.
.
.
,
2/,11..$
A',,) is said n
,
l-espet-
Jo
l i ?.rt?/ .
'k'.
bv (7?7 independent sarnple ?'/' 11 t t? l.. ?...*s 1 .'.'.t''
case r/lt?distribution
(?j'
l'tl'n
.7
,
.
.
.
.
.,
lt
(1p-t?
sanlple rtk/t?.'.k the r/'7t.?
( 11.4) belong to the same the density functions I-l.i ; %). i 1 family but their numerical characteristics (moments, etc.) may differ. If we relax the independence assumption as well we have what we can call non-random sample. a U sually
,2
=
,??
-
.
.
.
The nature
of statistical
.1) tlili J'1,-.l
-''
inference
said tp be a non-random sample from s .Y,,)is ( I r.1?.'s ('/' t'be .-Yt,are non-llD. ln this case the X 1 x,,,' 0) J(x1 t?/' the distribution t#' the sample possible is onlv lcl'/rn/?t/./lf?r/ tt/' ?'.?..-y
4 se:
,
.
.
,
z
,
.
.
.
,
.
.
.
.
,
( 1 1.5) given xo wllprtl .J'(xj conditional distribution of Xi
'x:
,x
,
,
.
.
.
,2,
i 1 f/illt??'l Xl, X2, -
j
; 0i4'
,n,
=
.
.
.
.
.
.
,
represent
zTi -
the
1.
A non-random sample is clearly the most general of the sampling models considered above and includes the independent and random samples as special cases given that
Xn are independent r.v.'s. Its generality, however, renders the whcn A-l, unless certain restrictions non-operational are imposed on the concept restrictions have been heterogeneity and dependence among the A-js. Such often restrictions extensively discussed in Sections 8.2-3. ln Part IV the used are stationarity and asymptotic independence. ln the context of statistical inference we need to postulate both a probability as well as a sampling model and thus we define a statistical model as comprising both. .
.
.
,
Dhnition
6
,4 statistical model is dejlned as comprising (p (i) a probability model )./'(x;p), 0 e: (6)),.and > model Ar,,)'. samplinq X (-1 Xz, (f) a The concept of a statistical model provides the starting point of all forms of statistical inference to be considered n the sequel. To be more precise, the concept of a statistical model forms the basis of what is known as parametric inference. There is also a branch of statistical inference known where no * is assumed a priori (see Gibbons as non-parametric inference ( 1971)). Non-parametric statistical inference is beyond the scope of this book. lt must be emphasised at the outset that the two important components of a statistical model, the probability and sampling models, are clearly interrelated. For example, we cannot postulate the probability model *= =
,
.
.
.
,
11.3
The frequency approach
This is because if the r.v.'s the probability model must be defined in Jt-ftxl, x,,; p), 0e O). Moreover, terms of theirjoint distribution, i.e. (I) sample we need independent distributed the of identically not but in case an sample, i.e.*= functions each individual specify in the for density the nv. to n). of this The important implication 1, 0 6 2, .fttxk; 0), most 0, k when postulated sampling model is found the relationship is that to be respecified model probability be it has the inappropriate as to means that well. Several examples of this are encountered in Chapters 2 1 to 23.
f J(x; 0), 0 G (4) if the sample X is non-random.
A'I,
.
.
.
,
-Y,,are not independent
=
=
.
.
.
.
.
.
,
,
The frmuency approach
ln developing the concept of a probability model in Part 11it was argued that no interpretation of probability was needed. The whole structure was built upon the axiomatic approach which defined probability as a set function #( ): .F E0,1q satisfying various axioms and devoid of any interpretations (see Section 3.2). In statistical inference, however, the interpretation of the notion of probability is indispensable. The discerning reader would have noted that in the above introductory discussion we have already adopted a particular attitude towards the meaning of probability. ln interpreting the observed data as one of many possible realisations of the DGP as represented by the probability model we have committed ourselves towards the frequency interpretation of probability. This is because we implicitly assumed that if we were to repeat the experiment under identical conditions indefinitely (i.e. with the number of observations going to infinity) we would be able to reconstruct the probability model *. In the case of the income example discussed above, this amounts to assuming that if we were to observe everybody's income and plot the relative frequency curve for incomes over f4500 we would get a Pareto density function. This suggests that the frequency approach to statistical inference can be viewed as a natural extension of the descriptive study of data with the introduction of the concept of a probability model. ln practice we never have an infinity of observations in order to recover the probability model completely and hence caution should be exercised in interpreting the results of the frequency-approach-based statistical methods which we consider in the sequel. These results depend crucially on the probability model which we interpret as referring to a situation where we keep on repeating the experiment to infinity. This suggests that the results should be interpreted the long run' or as holding under the same circumstances, i.e. average'. Adopting such an interpretation implies that we should propose results' according to statistical procedures which give rise to '
-+
'in
koptimum
ton
The nature
of statistical
Probability
4)
inference
model G (i)j
1 (x ; $ ) 0
=
,
Distribution
/(x
X
1,
xz,
.
of the sample xn ; 0 ) .
.
,
Sampling model (.X1,Xa Xn )
EE
.
,
.
.
,
Observed data
x
EE
(x,, xz,
.
.
,
,
xp)
The frequentist approach
Fig.
to statistical
inference.
criteria related to this intemretation. Hence, it is important to keep this in mind when reading the following chapters on criteria for optimal estimators, tests and perdictors. The various approaches to statistical inference based on alternative intemretations of the notion of probability differ mainly in relation to what constittites rele,ant information for statistical inference anj lltpw,it should be tp///prtptzc/k processed. ln the case of the h-equency (sometimes called the classical approach) the relevant information comes in the form of a probability model (1) fytx,' @,0e 0. ) and a sampling model X N (-Y1 X1, Xtt)', providing the link between * and the observed data x E/ (x1,x2, . x,,)'. The observed data are in effect interpreted as a realisation of the . sampling model, i.e. X x. This relevant information is then processed via the distribution of the sample txj, xa, x,,; 0j (see Fig. 11.1). The subjective' interpretation of probability, on the other hand, leads to a different approach to statistical inference. This is commonly known as the Bayesian Jpproflc because the discussion is based on revising prior beliefs about the unknown parameters 0 in the light of the observed data using Bayes' formula. The prior information about 0 comes in the form of a probability distribution .J(p);that is, 0 is assumed to be a random variable. The revision to the prior .J(#)comes in the form of the posterior distribution Slong-run'
=
,
.
.
,
.
.
,
=
.
.J'(px)
,
.
,
via Bayes' formula: .,flx
./(p/x) . =
0f (t?)
':x
,
/(x)
..f(x/p)./
,
.
(p),
/'(x 0) being the distribution of the sample and
./'tx)
being constant
for
11.4
An overview of statistical
inference
X x. For more details and an excellent discussion of the frequency and Bayesian approaches to statistical inference see Barnett ( 1973). ln what follows we concentrate exclusively on the frequency approach. =
11.4
An o&'erview of statistical
inference
As defined above the simplest form of a statistical model comprises: (i) a probability model *= tf/'(x;04, 0 (E 0).., and Ar,,)' - a random sample. (ii) a sampling model X (A-1 Using this simple statistical model, let us attempt a brief overview of statistical inference before we consider the various topics individually in order to keep the discussion which follows in perspective. The statistical model in conjunction with the observed data enable us to consider the following questions: Are the observed data consistent with the postulatcd statistical ( 1) =
model?
,
-2,
.
,
.
,
misspecljlcation)
Assuming that the statistical model postulated is consistent with the observed data, what can we infer about the unknown parameters 0 e:0. ? Can we decrease the uncertainty about 0 by reducing the (a) parameter space from (.) to O() where (6):is a subset of (9:/ (conjldence estimation) about 0 by choosing a Can we decrease the uncertainty (b) particular value in (6), say 1, as providing the most representative value of 0 point estimationj Can we consider the question that 0 belongs to some subset (i)() of 6)? qhypotbesistestin Assuming that a particular representative value l of 0 has been chosen what can we infer about further observations from the DGP as described by the postulated statistical model? prediction) The above questions describe the main areas of statistical inference. Comparing these questions with the ones we could ask in the context of descriptive statistics we can easily appreciate the role of probability theory in statistical inference. The second question posed above (thelirst question is considered in the appendix below) assumes that the statistical model postulated is and considers various forms of inference relating to the unknown parameters 0. ivalid'
Point psrfmt/rt)n (t?- just estimatio: refers to our attempt to give a numelical value to 0. This entails constlucting a mapping h(') : 1.-.+* (see Fig. 11.2). We call function h(X) an estimator of 0 and its value h(x) an
The nature
of statistical
inference
Fig. 11.2. Point estimation.
I
O
g( # -
x
Oo
Fig. 11.3. lnterval
estimation.
estimate of 0. Chapters 12 and 13 on point estimation deal with the issues of estimators, respectively. defining and constructing 'optimal'
Consdence estimation: refers to the construction of a numerical region for 0, in the form of a subset (1)0of 0. (s Fig. 11.3). Again, contidence estimation gt ): 0. comes in the form of a multivalued function (one-to-many) .%.
'
-.+
Hypothesis testinq, on the other hand, relates to some a priori statement about 0 of the form Ho : o). e, (.la cu (6), against some opposite statement Hko@4v or, equivalently, #GO1 0. 1 r-ht.l4l Z and Oj kp(.)() 0. In a situation like this we need to devise a rule which tells us when to accept Ho tinvalid' HL in view of the observed data. Using the or reject as as postulated partition of (.) into (6)0 and 01 we need, in some sense, to construct a mapping t?(') :.4---+(1) whose inverse image induces the partition =
,
'valid'
q - 14O0) Cll - acceptance 140. 1 ) C1 q- rejection
region,
=
=
where Ca t..p Cj
-??''
=
region,
(seeFig. 11.4).
=
11.5 Statistics
and their distributions
223
'
e
Co
(%
f'l
e
Fig. 11.4. Hypothesis
1
testing.
The decision to accept Stj as a valid hypothesis about 0 or reject Ho as an invalid hypothesis about #, in view of the observed data, will be based on whether the observed data X belongs to the acceptance or rejection regions respectively, i.e. X G Crll or X (E C71(seeChapter 14). Hypothesis testing can also be used to consider the question of the appropriateness of the probability model postulated. Apart from the direct test based on the empirical cumulative distribution function (seeAppendix theorems. For 11.1) we can use indirect tests based on characterisation example, if a particular parametric family is characterised by the form of its first three moments, then we can construct a test based on these. For several characterisation results related to the normal distribution see Mathai and Pederzoli (1975. Similarly, hypothesis testing can be used to assess the appropriateness of the sampling model as well (see Chapter 22). As far as question 3 is concerned we need to construct a mapping 1( ): which will provide us with further values of X not belonging to the O sample X, for a given value of 0. '
...,t'
-+
11.5
Statistics and their distributions
As can be seen from the bird's-eye view of statistical inference considered in the previous section. the problem is essentially one of constructing some mappinq of the form:
q( ): '
.?' -+
O
or its inverse, which satislies certain criteria (restrictions)depending on the nature of the problem. Because of their importance in what follows such mappings will be given a very special name, we call them (sample) statistics'.
The nature
of statistical
Dellni
7
T't??
A statistic is
to /?t?any #0/-t?/ jnction (.s?t?Chapter 6)
y('?jt/
q( ) : 4
--+
.
Estimators,
R
'
'
Note that t/(
inference
,
) does not depend on any unknown parameters.
'
confidence intervals, rejection
regions and predictors
are al1
statistics which are directly related to the distribution of the sample. -statistics' are themselves random variables (r,v.'s)being Borel functions of r.v.'s or random vectors and they have their own distributions. The discussion of criteria for optimum is largely in terms of their distribtltions. bstatistcs'
Two important examples of statistics what occasions in follows are: numerous
Y,,
j
called the samplv
-j,'
.
i
sz
s.
. ..
ueun,
j
j =
on
11
)12
=
which we will encounter
11
-.-
n- 1i
y (x
-.
vkj,
i -
j
2 ,.
)
called the
sflnlplf;z
ptlrftlnc't?.
(1l
.
l0)
On the other hand, the functions / 1(
-7!k'
)
1
pi
&
.)k'
...--..j:f
cccn .... !.
(!17-) ( .
j
and 1
pl
y n .- 1 i 1
/a(.,y)
=
--
(xj
-/t)2
-,.
unless c2 and ;i are known. respectively. The concept of a statistic can be generalised to a vector-valued of the form
are not statistics
q(') : .i'
--+
(.)
=
R'l,
function
( 11.13)
As with any random variable, any discussion relating to the nature of (?(X)must be in terms of its distribution. Hence- it must come as no surprise to Iearn that statistical inference to a considerable extent depends critically on our ability to determine the distribution of a statistic t?(X) from that of X T.E.Ee (-Yl A-,,)',and determining such distributions is one of the most difficult problems in probability theory as Chapter 6 clearly exemplified. ln that chapter we discussed various ways to derive the distribution function of F ?(X) -2,
.
,
.
.
-
=
F( )')
=
.P#t?(X) : )')
,
( 11.14)
and their distributions
11.5 Statistics
when the distribution of X is known and several results have been derived. The reader can now appreciate the reason he she had to put up with some rather involved examples. All the results derived in that chapter will form the backbone of the discussion that follows. The discerning reader must have noted that most of these results are related to simple functions t?(X) Ar,,. lt turns out that most of of normally distributed r.v.'s, A'l, Xz, the results in this area are related to this simple case. Because of the importance of the normal distribution, however, these results can take us a inference avenue'. Let us restate some of these long way down results in terms of the statistics Y,, and s'l, for reference purposes. .
.
.
,
-statistical
Example 1
Consider the following statistical model:
.Jtx;0)
(I)
=
1
=
X > (A' j Xz, ,
.
.
.
'
(2zr)
c ,
2
1 x- u exp 2
,
c
0 EEE(p, c2)
A-,,)' is a random sample from
,(x;
G
R x R+ ;
#).
For the statistics
x=
1 N
11
y xj
11
=1
and
/
(.; J.t) .x'J.-..'7.-, x (() jj. ?
.
,
( 11.16)
,
G
( 11.18)
(iv) Note that n jl i
=
Arl .
1
.
''$ -
g
%''
u r'z
2
)
=
n
X -y l'
g
2
.$2
j
+ (n - j ) Jj
x
;
2
(n )
( 11.19)
The nature of statistical
226
inference
and
Covf x.
y,2)
.;f'
?1
.
0.
=
1
n( Xpj p ) -
.j
.r(u
(s2/c2) ,' yn ('rpl/z2) x
(11.20)
)
5
Sn
1, m
-
-
( 11.2 1)
1),
wherez,l is the corresponding
sample variance of a random sample (Z1, Zc, z2) s,l, zm2are independent. and from Npz, . z.) results follow from Lemmas 6.1-6.4 of Section 6.3 where the Al1these chi-square, normal, Student's t and F distributions are related. Using the distribution of g(X), when known, as in the above cases, we can consider questions relating to the nature of this statistic such as whether it tbad' estimator or test provides a (to be defined in the sequel) or a statistic. Once this is decided we can go on to make probabilistic statements about 0, the parameter of *, which is what statistical inference is largely about. The question which naturally arises at this point is: SWhat happens if we cannot determine the distribution of the statistic t?(X)?' Obviously, without a distribution for q(X) no statistical inference is possible and thus it is imperative to solve' the problem of the distribution somehow. In such cases asymptotic theory developed in Chapters 9 and 10 best' solutions in the form of comes to our rescue by offering us of t?(X). In Chapter 6 we discussed distribution the approximations to various results related to the asymptotic behaviour of the statistic T,, such .
.
,
,
Ggood'
Strue'
isecond
as: a S. .
--
X 11
T,
(iii)
.
p,
->
-+
p
and
p;
1
(Y,,-p)
w'
D
z x(0,1);
-,
( 11.22)
,x.
irrespective of the original distribution of the Xis. Given only that c2 < Vartfj) ln Chapter 10 these results were x; note that '(Y,,) general extended to more functions (X). ln particular to continuous functions of the sample rJw moments '(-f)
=p,
=p.
=
1
mr
=
-
ni
19
Z aYr ,
=
1
(11.23)
11.5 Statistics
and their distributions
227
ln relation to mr it was shown that in the case of a random sample
:
a.$.
lllv
-+
r?lr
--.
/.
pv, P
('ii)
;i';r
ftrl (1
( 11.24)
(iii)
with
'tr?1rl
=
p;,
avl y. =
-
(p;)2 ,
(11.25) assuming that pr < :x, It turns out that in practice the statistics :(X) of interest are often functions of these sample moments. Examples of such continuous functions of the sample raw moments are the sample central moments being dened by .
(11.26)
r > 1.
These provide us with a direct extension of the sample variance and they represent the sample equivalents to the central moments X
pr
(.X
=
(11.27)
d.X. -FI'.'JIXI
the help of asymptotic theory we could generalise the above asymptotic results related to n1,, rb 1, to those of F;, <(X) where q ) is a Borel function. For example, we could show that under the same conditions With
=
(i)
'
a S. .
br --. Jtr; P
#r
#r
-#
V'nt,
,'
-pr)
Gr+
where
e'*l r
=
n
z
/42,+ 2 '-ph
assuming that p2r <
'L
.x((),
-2(r
1
(11.28)
1),.
+ 1)/trpr+ a +
see exercise 1.
(r+
Lllyvlyz,
( 1 1.29)
228
inference
The nature of statistical
Asymptotic results related to L q(X) can be used when the distribution of Y;,is not available (or very difticult to use). Although there are many ways to obtain asymptotic results in particularcases it is often natural to proceed by following the pattern suggested by the limit theorems in Chapter 9: =
Step 1 Under certain conditions F;, :(X) can be shown to converge in probability to some function of (p) of 0, i.e. =
P
(f?), or
Y;,
-+
Construct two sequences )z; j (p)
a
Let F.(y*)
11
=
.
,,
?1
,. /1
(?,
h(04.
Y;:
( l 1.30)
c,,(@, n k: 1) such (11,,4p), o
-
y- +
.s.
-+
...+
z
x
denote the asymptotic
that ( 11 3 1)
N(0 1). ,
.
distribution of F,*, then ,
for
rfyc n
F,,(.F) :> Fw,(.1'*),
and F.(y*) can be used as the basis of any inference relatipg to F,, t?(X). A question which naturally comes to mind is how large n should be to justify the use of these results. Commonly no answer is available because the answer would involve the derivation of F,,(y) whose unavailability was the very reason we had to resort to asymptotic theory. ln certain cases higherorder apprbximations based on asymptotic expansions can throw some light on this question (see Chapter 10). In general, caution should be exercised when asymptotic results are used for relatively small values of n, say n < 100? =
Appendix 11.1 - The empirical distribution function The first question posed in Section 11.4 relates to the validity of the probability and sampling models postulated. One way to consider the validity of the probability model postulated is via the empirical distribution /nclt?n F)(x) defined by F,*,(x)
=
1 -
t1
(number of
xjs G x),
Alternatively, if we define the random variable (r.v.) Zi to be
Appendix 11.1
229
1 if xi (E ( vt0 otherwise, -
Zj
=
,
xq
,
)'-
distribution postulated in * is thell F,*,(x) ( 1/n) 1 Zi. If the original F(x), a reasonable thing to do is to compare it with F,*,(x). For example, =
consider the distance
o,,
=max
F(x)l, l#,*,(x) -
x
G
R
is the D,, as defined is a mapping of the form D,,( ): -+ g0,1(1where observation space. Given that Zi has a Bernoulli distribution F,*,(x) being Z,, is binomially distributed, i.e. the sum of Zj Zz, .9-
,?'
'
,
.
Pr F,t(x)
.
,
.
k
=
n
=
-
gF(x) k(1 E1
k
n
-
F(x)
,,
-
k
)I, k
0, 1,
=
.
.
.
,
n,
F(x) and Var(F,t(x)) where f'(Fl(x)) ( 1/n)F(x)(1 F(x)). Using the central limit theorem (see Section 9.3) we can show that =
=
-F(x)).. x,'',?(/7(x) tFtxlg 1
n
z x((),1). ,w.
F(x)q )
-
-
D
Using this asymptotic
result
it can be shown that
v'n
.D,,--+
.p
where
J?
F(y)
=
1- 2
j
k-1
1 ( 1/ -- expt - 22 -
p2)
,
This asymptotic distribution of x//nl)n can be used to test the validity section21.2.
of *;
see
Important
concepts
Sample, the distribution of the sample, sampling model, random sample, independent sample, non-random sample, observation space, statistical model, empirical distribution function, point estimation, confidence estimation, hypothesis testing, a statistic, sample mean, sample variance, sample raw moments, sample central moments, the distribution of a statistic, the asymptotic distribution of a statistic.
Questions Discuss the difference between descriptive statistics and statistical
inferenee.
Te nature of statistical inference
230
4.
5. 6.
Contrast .Jtx;0) as a descriptor of observed data with ftx; p) as a member of a parametric family of density functions. Explain the concept of a sampling model and discuss its relationship to the probability model and the observed data. Compare the sampling models: random sample; (i) (ii) independent sample; non-random sample; (iii) and explain the form of the distribution of the sample in each case. Explain the concept of the empirical distribution function. tEstimation and hypothesis testing is largely a matter of constructing mappings of the form g ): I (.).9Discuss. Explain why a statistic is a random variable. 1) (seeAppendix Ensure that you understand the results (11.15V(11.2 6.1). Being able to derive the distribution of statistics of interest is largely what statistical inference is alI about.' Discuss. Discuss the concept of a statistical model. '
9. 10.
--+
Exercises
1.* Using the results (22)-(29)show that for a random sample X from a distribution whose first four moments exist, nxn
-/z)
ntsaz -
c
2
)
XN
0 0
c2
ps
y5
y4
-cr*
Additional references Barnett
(1973);
Bickel and Doksum
(1977);Cramer ( 1946);
Dudewicz
(1976).
CHAPTER
12
Estimation
l - properties of estimators
Estimation in what follows refers to point estimation unless indicated otherwise. Let (S, P( ))be the probability space of reference with X a r.v. defined on this space. The following statistical model is postulated: '%
.
(.J(x;P), 0 6 O), O L
(i)
*
(ii)
X H(Ar1, Xz,
=
.
.
.
R;
X,,)' is a random
,
sample from
.J(x;0).
in the context of this statistical model takes the fol'm of constructing a mapping 4.)..I 0- where 3- is the observation space and Borel composite h ) is a function. The function (a statistic) $- h(X): S O called and value is its an estimator (x),x c I an estimate. It is important to distinguish between the two because the former is a random variable (r.v.) and the latter is a real number. Estimation
-+
,
--+
'
Example J
p)2), ps n, and x be a random sample Let ylx;p)- El/x/tznnexpt R'' and the following functions define estimators from .J(x;p). Then of p: .--ifx
-
.133-,,z:
1(
(i) (ii)
-
/1
Ni
p-2 P-3
-Yf,'
-
--
1
=
k
1
=
ki
)''1 X i'
..)'
''''''-
231
k
1 2,
=
9
.
-
fr
i
-
12 ,
'
*
w
.
,
n>
.
.
>
n
-
1;
Properties
p-z
=
p-5
=
(X 1 + X ); 11
N
1g
?1
-
N i
jj
2 i ',
zr
.
1
=
1
,1
-
p-6 $
1 -
of estimators
K f
iA-i =
1
1g
=
'
,
'
) n+ 1i j
Xf
.
-
lt is obvious that we can construct intinitely many such estimators. estimators is not so obvious. From the above However, constructing examples it is clear that we need some criteria to choose between these estimators. ln other words, we need to fonnalise what we mean by a estimator. Moreover, it will be of considerable help if we could devise such good estimators; a question general methods of constructing considered in the next chapter. kgood'
igood'
Finite sample properties
12.1
ln order to be able to set up criteria for choosing between estimators we need to understand the role of an estimator first. An estimator is representative constructed with the sole aim of providing us with the value of 0 in the parameter space 0- based on the available information in the form of the statistical model. Given that the estimator V= (X) is a r.v. (being a Borel function of a random vector X) any formalisation of what we representative value' must be in terms of the distribution of mean by a (? say ftb. This is because any statement about near f/is to the true p' probabilistic only be one. a can of 0 to satisfy is that estimator The obvious property to require a Jt#l is centred around p. tmost
>
,
.
'most
'how
'good'
Dejlnition 1
# 0-f 0 is said
.,1,7estimator t.'.;jtyyd
&p be an tmbiased
estimator
of 0
#-
X
E'lJ)
() dp.
=
........
. .
:
That is, the distribution parameter to be estimated.
t#'
( has mean equal to the unknown
Finite sample properties
12.1
but equivalent, way to define fit( is
Note that an alternative,
( 12.2) (iistribution 0) =.J(x1 xa, where x,,; ?) is the (?Jtbe sample, X. without having to derive either of the Sometimes we can derive abovedistributions by just using the properties of E ) (see Chapter 4). For example, in the case of the estimators suggested in Section 12.1, using and the properties of the normal distribution we can deduce independence N(p, l (1/n), this is because ('Ikis a linear function of normally that distributedr.v.'s (see Chapter 6.3), and .(x;
.
,
,
,
.
'(#)
'
'v
j
J;(;1)
-
E N
j
11
i
)( xi =
=
/
1
j
11
-/
i
)(
1;(A',.)
N i
l
=
N (?
11
p=--
= ''-=
p
=
D
1
(see Fig. 12. 1). The second equality due to independence and the property .E'(c) c if c is a constant and the third equality because of the identically distributed assumption. Similarly for the variance of #1 =
jj
flt)1 P)2 -
=
E
j
11
-j. K
f
=
1
(A-g- ?)2
=
K
j
11
.-j f
=
1
Eqxt. - 0jl
=
Ka i
j
11
1 =
1
=
.
K
(12.4)
k
::=:
1 2,
05 N(p, 1), 04. N 'v
'v
,
.
.
.
,
1-1.-
27 2 ,
n
nz
,
1
,
-? nz 2 (n 5x
-
,
np
2
) ,
Properties
of estimators
h
n + 1 0, (rl+ 1)(2n+ 1) 6n 2
-v.
N
',y
N
'v
,
n 0,
n+ 1
n
(n+ 1)c
.
Hence, the estimators #1,#2,#aare indeed unbiased but #4, 4) and #, are h, biased. We define bias to be B(0) f(#) p and thus ,46) E(2 n)/n?0, Rt/5 ) n2( 1 + p2)- p, #())) g(n 1j/2qp, #($j= 0(qn+ 1). As can be seen from the above discussion, it is often possible to derive the mean of an estimator # without having to derive its distribution. It must be remembered, however, that unbiasedness is a property based on the distribution of #.This distribution is often called sampling distribution of 1 in order to distinguish it from any other distribution of functions of r.v,'s. Although unbiasedness seems at first sight to be a highly desirable property it turns out to be a rather severe restriction in some cases and in most situations there are too many unbiascd estimators for this property to be used as the sole criterion for judging estimators. The question which naturally arises is, can we choose among unbiased estimators?'. Returning to the above example, we can see that the unbiased estimators #2,05 have the same mean but they do not have the same variances. j Given that the variance is a measure of dispersion, intuition suggests that the estimator with the smallest variance is in a sense better because its around p. This argument leads to the distribution is more second property, that of relative efficiency. =
=
=
=
-
-
-
-
ihow
,
iconcentrated'
Dehnition 2
zl.rlunbiased estimator #1of 0 is said to be relatiYely than some other unbiased estimator 'a
more efficient
t
vartj) < vartpc) or
)
=
and
1 -
n 1
vart'/al,z=<
is relatively more efficient than either
1
< -
=
k 1
I/) vartpa)< 1.
'1
ln the above definition since Vart'j
vartl )
=-----%-
efftlj
=
Var(2),
k
=
1, 2,
.
.
.
,
n
-
h or h
1,
varta),
i.e. h is relatively more efficient than t')a(see Fig. 12.2). In the case of biased estimators relative efficiency can be defined in terms of the mean square error (MSE) which takes the form E1-
0)1
=var()+
E#()q2,
(12.6)
12.1
Finite sample properties
/'(j)
f (z) ) / (J:., '!
r
0
that is, an estimator Or
'(* -
$2 <
MSE(*)
<
#* ij E(
-
'i
relatively more efficient than
if
p)2
MSE(t.
As can be seen, this definition includes the definition in the case of unbiased estimators as a special case. Moreover, the definition in terms of the MSE enables us to compare an unbiased with a biased estimator in terms of efficiency. For example, MSE($)
<
MSE(l),
estimator than (35despite the fact and intuition suggests that (Lis a Fig. 12.3 illustrates the case. ln unbiased and 0y is not; that #a is circumstanceslike this it seems a bit unreasonable to insist on unbiasedness.Caution should be exercised, however, when different distributionsare involved as in the case of ). 'better'
Let us consider the concept of MSE in some detail. As defined above, the MSE of depends not only on but the value of ()in O chosen as well. That is, for some pa e:O
MSEII,t?o)=E(4- %)2 =
figt-Ft')l
= Vart)
+
+ E)
-p())j2
g#(, p())12,
(12.7) pe) zittil 0v is the bias of #
thecross-product term being zero. B$, to the value po.Using the MSE as a criterion foroptimal estimators relative would like to have an estimator $ with the smallest MSE for a1l ?in O we =
Properties
of estimators
f'(J,)
(Ja) 'C
r'
0 Fig. 12.3. Comparing
the sampling distribution of
h and h.
(uniformly in O). That is, (?) MSEI#,p) MSEIV. :
for all p e:(6),
where Fdenotesany other estimator of p. For any two estimators #and o 0 if MSEI, ?) MSEIC p), () (E 0- with strict inequality lplding for some 0 G 0, is said to be inadmissible. For example, l above is inadmissible because :$
MSEIVI ,
04<
MSEIVa,?), for
a1l 0 e O if n > 1.
ln view of this we can see that 4 and s are inadmissible because in MSE terms are dominated by ,. The question which naturally arises is: can we find an estimator which dominates every other in MSE terms'?' A moment's reflection suggests that this is impossible because the MSE criterion depends on the value of ? chosen. ln order to see this 1et us choose a particular value of p in 0, say 0v, ani define the estimator 0* 00 =
for al1 x e:
,?,'
Then MSE(?*, 0v) and any uniformly best estimator would have to MSE(t?*, ?) 0 for a1l 0 g (9, since 0o was arbitrarily chosen. That is, satisfy whatever value! Who needs criteria in such a perfectly its estimate 0 case? Hence, the fact that there are no uniformly best estimators in MSE terms is due to the nature of the problem itself. Using the concept of relative efficiency we can compare different =0
=
'true'
12.1
Finite sample properties
estimators we happen to consider. This, however, is not very satisfactory since there might be much better estimators in terms of MSE for which we know nothing about. ln order to be able to avoid choosing the better of two Such a inefficient estimators we need some absolute lntzlsu/'t! t?/' tllciency. which provided takes form Cramer-Rao the is by !(?u':7b ound the measure
(7R(?)
=
---
2
d#(p)
1+
d
-
.
c
Iog ytx; ?) t?p
p
E
.
.J(x;p) is the distribution
of
,
sample
the
and #(t?) the bias. lt can be shown
0* of p
that for any estimator
MSE(p*, 04y CR(@
under the following regularity conditions on m: The ?)> 0 ) does ntpr depend t?n 0. )x : For each 0 g (i) the derivatives I)t?log (x, ?)(I ), i fo r a 11x g (CR3) 0 < f'Rglt)0)lof /'(x; ?)()2 < r.y. jr aII p (Eq(.),
(CR1) (CRJ)
./'(x;
.,4
-t?r
=
,'q(()
./
=
1. 2, 3, exist
.k.'
ln the case of unbiased
the inequality takes the form
estimators
2
lO g (x ; P) Pp ,/'
(
Var(p*) )v E
- 1
the inverse of this lower bound is called Fisher's by 1,,(P). Dejlnition
l/rrrlfkrfon and is denoted
.3
zln unbiased
t?/'
estimator
Vart#l
E
=
0
1Og '
0 is said to 2
J'(X,' P)
-
<
0
(fully)efficient #'
?t7
1
1,,(p)-
wpn /--/itrfpnr
That is, an unbiased estimator is Cramer-Rao drpwprbound.
its
1 .
tariance
equals tbe
ln the example considered above the distribution of the sample is ./'(xi;
-
=
1
p1
11
.f(x;p) il-1
p)
-
1
-,'/.2
= (2a)
exp t 17 -.--),.'-ja) (
-
exp
--
1
1k
--l-(x
-
v 11
a fj =
1
(xj- ?)2
'
p)2)
of estimators
Properties
and
2
d loy. .J(x;0) d0
E
An alternative equality
ojtd
2
E
=
F (x2
=n
way to derive the Cramer-Rao log
..sjd...2
.J(x; )42(j dp
by independence.
-p)
'--'j
lower bound is to use the
log tx;0) dp2
(j ,
which holds true under CR 1-CR3 and density function. p) is the and hence the equality ln the above example Ed21og .Jtx;(?)j/(dp=) holds. So, for this example, CR(@ 1/n and, as can bc seen from above, the only estimator which achieves the bound is #! that is, Var(#1) CR(p); hence #1 is a fully efficient estimator. The properties of unbiasedness, relative efficiency and full efficiency enabled us to reduce the number of the originally suggested estimators considerably. Moreover, by narrowing the class of estimators considered, to the class of unbiased estimators, we uniformly best estimator', succeeded in the problem of discussed above in relation to the MSE criterion, This, however, is not very sumrising given that by assuming unbiasedness we exclude the bias term which is largely responsible for the problem. Sometimes the class of estimators considered is narrowed even furtherby requiring unbiased as well as linear estimators. That is, estimators which are linear functions of the r.&'.'s of the sample. For example, in the case of Within the example 1 above, h, l, 6, h and ;c are linear estimators. 411 of linear unbiased estimators and show that class has minimum we can variance. ln order to show that let us take a general linear estimator :true'
./'(.xr'
-n
=
=
=
,
'solving'
'no
':
,
11
p-=c + jl i
=
aixi,
( 12.10)
1
which includes the above linear estimators as special cases, and determine 7, which ensure that the values for c, ai, i 1, 2, is best Iinear unbiased (BLUE) of Firstly, for be unbiased J'to estimator y. we must have 1. Secondly, since Vartp) which implies that c 0 and j ai )'c2 a? we must choose the ais so as to minimise j)'- j J/c2 a,? as j satisfy unbiasedness). 1 (for well as Setting up the Lagrangian for 1 ai =
.
.
.
,
'(p)=0
=
=
N)'. :)'-
=
=
=
Z)'=
12.1
Finite sample properties
239
this problem we have ?1
min : ltaa2)
=
i
tzi
11
Y
ail
2
-
i
1
=
F, ai =
1
-
,
1
Summing over f, l1
. i
=
'1
-
i
1=
=
2
,
=
jyj
,;g
?1
lf =
1
-
-
n
1e
,
.
jj ai
.
=
-
i
,
n
12
=
,
.,
.
.
.
,
n
,
for c 0 and ai 1/n, i 1, 2, n, (1 n) j)'- 1 Xi, which is identical to #1. Hence 4'1is BLUE (minimumvariance among the class of linear and unbiased estimators of p). This result will be of considerable interest in theorem. Chapter 2 1, in relation to the so-called Gauss-Markov The properties of unbiasedness and efficiency can be generalised directly where 0 H(p1, 0m). is said to be an unbiased to the multiparametercase of estimator 0 if =
=
=
.
.
.
=
,
.
.E'(#) 0, =
i.e. E(#f)
0i
=
i
,
1, 2,
=
.
,
.
.
.
.
,
?'l.
ln the case of full efficiency wecan show that the Cramer-Rao of 0 takes the form an unbiased estimator -j1(0
Yj(0
10g XX;
covt)
lOg
pp
Or
> Inoli-i varlpf)
',
i
t?0 1, 2,
-
.f(x;
.
.
.
,
1,,(:=E
Yjt
./x;
I0B /tX;-..?6
0 -
pp
l log ftx; 0) (0 Jp
,
,
(12.12)
m
(rrl being the number of parameters), where diagonal element of the inverse of the matrix
z;j(0 I0B
j'jy:
04
inequality for
1 I,,(p)J represents the th
)'j ( 12.13)
called the sample information matrix; the second equality holding under the restrictions CRI-CR3. ln order to illustrate these, consider the following example:
Properties
In example
1 discussed above we deduced that j
/ = 'good'
is a sample
of estimators
11
'l j
y xi
(12.14)
j
=
estimator of y, and intuition suggests that since i is in effect the corresponding to /t the sample variance
moment
1
(f2
11
y (xy-p2
.
tl
( 12.15)
1
=
estimator of c2. In order to check our intuition let us dl satisfies any of the above discussed properties.
should be a examine whether
'good'
( 12.16) Since
2
E(Xi -p)2
Ep -/t)2
c2,
=
=
and
-
n
from independence, we can deduce that ,1
(-Yj i
=
!
1, -/@)2
E
=
1
)
1*=
(7.2 1
+--
S
1)
-2
=(n
-K
-
1)c2.
( 12.17)
This, however, implies that E(dl) g(n- 1)/'??jcr2+ c2, that is, dl is a biased c2. Moreover, it is clear that the estimator estimator of =
$.2 ' n
1
=
11
t.yj 1 f )-q 1
j2
-
-
=
( 12.18)
is unbiased. From Chapter 6.3 we also know that (rl 1)
S2 'v
-
2
2
z (rl -
1)
( 12. 19)
12.1
Finite sample properties
and thus Var(s2)
c*
=
1)a
(n -
2(n
1)
-
2c* =
n
1
-
( 12.20)
,
since the variance of a chi-square r.v. equals twice its degrees of freedom. Let Y,, and are efficient estimators: us consider the question whether .s2
=
( 12.2 1) o
1og ftx; 0)
t?log .J(x;0) Jp
27:
-jlog
=
-j
(72
log
-#..j i ) =
-p)2,
( 12.22)
(-Yf 1
p log tx;0) t?/z
=
=
f?
log
tx;04
pc2
:2 1og tx,'p) (0
t'?p
' J2 1og f (x;0)
l
t2
=
2
log
Jc2
tx,'0)
(?y
log
tx;0)
i)y (?(r2 2
log
tx;p)
pc4.
( 12.24)
2
0 n 2c 4
and
gI,,(p)j -
1
N =
0
Properties
of estimators
This clearly shows that although T,, achieves the Cramer-Rao lower bound sl does not. lt turns out, however, that no other unbiased estimator exists which is relatively more efficient than s2; although there are more efficient biased cstimators such as
d21
1r
=
n+ 1 i
?1
) -
(Xi
-
Y,,)2.
( 12.26)
j
Efficiency can be seen as a property indicating that the estimator all the information contained in the statistical model. An important concept related to the information of a statistical model is the concept of a sujhcient statistic. This concept was introduced by Fisher ( 1922) as a way to reduce the sampling infonnation by discarding only the information of no relevance to any inference about 0. In other words, a statistic T(X)is said to be sufficient for 0 if it makes no difference whether we use X or T(X) in inference concerning 0. Obviously in such a case we would prefer to work with r(X) instead of X, the former being of lower dimensionality. 'utilises'
Dejlnition 4 R'n, n > rn, is called sufficient for 0 ,4 statistic T( ): t/' the conditional distribution z(x) z) is independent 0j% 0, i.e. 0 does not appear in /(x,,''r(x) T) and the domain of f ) does not involve 0. In example 1 above intuition suggests that z(X) j)1-1 Xi must be a sufficient statistic for 0 since in constructing a good' estimator of p, 1, we only needed to know the sum of the sample and not the sample itself. That is, as far as inference about 0 is concerned knowing all the numbers (A'j Xz, A-,,)orjust 1 Xi makes no difference. Verifying this directly by deriving f(x T(x) T) and showing that it is independent of 0 can be a very difficult exercise. One indirect way of verifying sufficiency is provided by the following lemma. .4-
-+
'
.f(x
=
=
'
=
tvery
,
.
.
.
Z)'-
,
=
Fisher-Neyman The statistic jctorisation
letnma factorisation
T(X) is of the
sujlcient abrp?
/'(x; 0) =.J(T(x);
p)
.
for 0 #' and (x),
only
#' there
exists
a
(12.27)
wllcrc .J(T(x); p) fs (he Jtrnsfty ynctfono/' z(X) anl depends t)n 0 and Jl(X),some function(?JX independent (?J0. Even this result, however, is of no great help because we have to have the statistic T(X) as well as its distribution to begin with. The following method suggested by Lehmannand Scheffe( 1950) provides us with a very convenient way to derive minimal sufhcient statistics. A sufficient statistic T(X)is said to
12.1
243
Finite sample properties
be minimal if the sample X cannot be reduced beyond T(X) without losing i?sufficiency. They suggested choosing an arbitrary value xa in and form the ratio
J(x; P) '/'(x();0$
=
x G ..J, 0 (E 0.
,
,
( 12.28)
.
g (x x () ; p)
and the values of xtl which make gtx, minimal sufficient statistics. ln example 2 above
xtl;
,
0)independent of 0 are the required
( 12.29) Xl?) is a minimai sufficient This clearly shows that T(X) 1 1 A'j, ZJ'= statistic since for these values of xa g(x, x(); 04 1. Hence, we can conclude y2) being simple functions of z(X) are sufficient statistics. lt is that (X,,, Xt? separately as 1 important to note that we cannot take l Xi or minimal sufficient statistics; they are jointly sufficient for 0 EB (/t,c2). In contrast to unbiasedness and efficiency, sufficiency is a property of statistics in general, notjust estimators, and it is inextricably bound up with the nature of *. For some parametric family of density functions such as the exponential family of distributions sufficient statistics exist, for other families they might not. lntuition suggests that, since efficiency is related to full utilisation of the information in the statistical model, and sufficiencycan be seen as a maximal reduction of such information without losing any relevant information as far as inference about 0 is concerned, there must be relationship along the a direct relationship between the two properties. A should needed look no further efficient estimator when is we lines that an following provided lemma. statistics, by sufficient the is than the =
(Z)'-
=
Z)'-
)-
Rao and Blackwell Iemma Let T(X) be (1 sufjlcient
statistic
)r
0 and t(X) be an estimator of 0,
then &h(X)
-
04l %F(t(X) - p)2, 0 g (1),
=
=
=
T),
conditional
of t(X)
expectation i.e. the wt'?rc h(X) F(t(X),/T(X) given z(X) z. From the above discussion of the properties of unbiasedness, relative and full efficiency and sufficiency we can see that these properties are directly
Properties
of estimators
related to the distribution of the estimator (1 of t?.As argued repeatedly. deriving the distribution of Borel functions of r.v.'s such as )'=/?(X)is a very difficult exercise and very few results are available in the literature. These results are mainly related to simple functions of normally distributed r.v.'s (see Section 6.3). For the cases where no such results are available (which is the rule rather than the exception) we have to resort to asymptotic results. This implies that we need to extend the above list of criteria for estimators to include asymptotic properties of estimators. These asymptotic properties will refer to the behaviour of # as n r ln order to emphasise the distinction between these asymptotic properties and the properties sample (or small sample) properties. considered so far we call the latter The finite sample properties are related directly to the distribution of #,,,say /'((,). On the other hand, the asymptotic properties are related to the asymptotic distribution of t)-,. tgood'
-+
.
.jlnite
l
12.2
Asymptotic properties
A natural property to require estimators to have is that as n --+ (i.e.as the sample size increases) the probability of being close to the true value 0 should increase as well. We formalise this idea using the concept of convergence in probability associated with the weak law of large numbers (WLLN) (see Section 9.2). 'z.
'
Dehnition 5
4n
0- /?(X) is said to be consistent jr 0
estimaor
2
lt
lim
11-*
LX
=
.
p!< c) P r( p-. 11
I
-
-
l
.
/' ( 12.3 1)
,
P
w't? u,,?-/lt?t', -+ p. t./,7(./
This is in effect an extension of the WLLN for the sample mean T,, to some Borel function (X). lt is important to note that consistency does not refer to t approaching The 0 in the sense of mathematical converence. with pl refers probability associated < the event to the c as convergence derived from the distribution of as n --+ x.. Moreover, consistency is a very minimal property (although a very important one) since if In is a 7 405 926, n if #r(t,, - ?t> consistent estimator of () then so is n + which implies that for a small n the difference t?l 7405926/:) 1,n, n > 1, might be enormous, but the probability of this occurring decreasing to zero
Ip,, -
t
.
%* =
Iu
=
aS
n
-+
7f
-
.
Fig. 12.4 illustrates the concept in the case where
,,
has a well-behaved
Asymptotic
12.2
properties
symmetric distribution for n! < na < ns < rl4.< ks. This diagram seems to suggest that if the sampling distribution .J(),,)becomes less and less dispersed as n %) and eventually collapses at the point () (i.e. becomes degenerate), then is a consistent estimator of (). The following lemma formalises this argument. --+
,,
P
t hen
(i)
11
--+
p
.
lt is important, however, to note that these are only stfjl-lcient conditions for consistency (nor necessaryl; that is, consistency is not equivalent to the above conditions, since for consistency Varttdi,ylneed not even exist. The above lemma, however, enables us to prove consistency in many cases of interest in practice. lf we return to example 1 above we can see that P
t
-+
p
246
Properties
of estimators
by Chebyshev's inequality and lim,,-, .E1 -( 1/n;2)j 1. Alternatively, using Lemma 12.1 we can see that both conditions are satisfied. Similarly, we can =
P
P
showthat P
P
#c p, ;a +.. p ('+.+'reads
'does
not converge
-+
P
t) +.+ 0, s ++ p 4 ; showthat dl --+
P
:
in probability
P
0,
+.+
,
p. Moreover, for 42 and sl of example 2 we can
+.+ p
c2 and sl
to'),
(7.2
--+
A stronger form of consistency associated with almost sure convergence very desirable asymptotic property for estimators.
is a
Dejlnition 6
t is said
zzlnestimator
to be a strongly consistent estimator
1im #,, 0
Pr
=
?1 --#
of 0 (J'
1
=
2:)
.%.
and is
tlenoted
.
i
by
-+
0.
The strong consistency of ),,in example 1 is verified directly by the SLLN and that of sl from the fact that it is a continuous function of the sample X? (see Chapter 10). Consistency and moments Y,, and r/?c ( 1 n) 1 consistency extensions of the weak 1aw and strong law can be seen as strong statistic respectively. of large numbers for 1 Xi to the general Extending the central limit theorem to n leads to the property of asymptotic normality.
)'-
=
(,,
Z)'-
Dehnition 7 'J,is ,s(:fJto be asymptotically normal zln estimator 'tj exist sttch that 1. tf U;,(p) n y 1 )p,, > 1) ,
,
(1-lw()
sequences
,
-p
()p))-( ()1;-7,
,,
,,
)
-+
z
-
( 12.32)
.N'(0-1),
normality presents several logical This way of defining asymptotic problems deriving from the non-uniqueness of the sequences P;,(p)) and )p,,), n y 1. A more useful definition can be used in the case where the order of magnitude (seeSection 10.4) of P;,(p)is known. ln most cases of interest in practice such as the case of a random sample, P;,($ is of order 1/n, denoted by P;,(@ 04 1/rl). In such a case asymptotic normality can be written in the following form:
t
=
',-1 j ( - p ) x(()prjpjj -$,?' ,,
where
'
i 'v
,,
x
-
J
reads asylnptotically
,
(12.33) distributed as' and U(p) > 0 represents the
12.3
and their properties
Predictors
247
asymptotic variance. ln relation to this form of asymptotic consider two further asymptotic properties. D#nron
normality
we
8
zln estimator (, wflll Vart'(l unbiased ('/' x/ n(P,, P) 0
0( 1 rl) is .$JJ to be asymptotically
=
(12.34)
-->
-
This is automatically satisfied in the case of an asymptotically normal estimator for Vartt?,yl J,)(p) and f)#,,) p,,. Thus, asymptotic normality can be written in the form =
,,
v'''rl($
-
,,
p)
-
=
N
p-(p)).
,
(12.35)
lt must be emphasised that asymptotic unbiasedness is a stronger condition than 1im,,-, F((,) p; the former specifying the rate of convergence. ln relation to the variance of the asymptotic normal distribution we can define thc concept of asymptotic efficiency. =
.
Dlinition
9
tt is said rtpbe asymptotically
An asynptoticall y' normal lsrrrlfkrt?r I,''(p) 1,..(?) - 1 w/-llrt? efficient ,/'
=
,
1 .l.,(p) lim -1,11,,(p) cx7. l =
;
(12.36)
,
--.
i.e. the asymptotic pr.lrff.lncl achieves the lrrlr ltpwgr-bound lyct?Rothenberq (1973:.
0j%
the Cramer-Rao
At this stage it is important to distinguish between three different forms of the information matrix. The sample information matrix 1,,(p) (see(13)),the single observation one 1(p) with txf;P) in (13), i.e. E
J 1og f (xf,'p) d0
and the asymptotic 12.3
2
information matrix 1.(?) in (36).
Predictors and their properties
Consider the simple statistical model: (i) Probability model: * p) i-e. .N((),1). Samplinq model: X N (xYj A''a, =
r
'v
ttx; ,
.
.
1 x/(27:) expt
=
.
,
-.(x -
@2,0 c R)
.Y,,)' is a randorn salnple.
Properties
248
of estimators
Hence the distribution of the sample is !1
flx,
.
.
1-l ..ftx-'
p)
,x,,; .
,
-
i
=
(12.37)
p).
1
Prediction of the value of X beyond the sample observations, say A',,+ 1, refers to the construction of a Borel function 1( ) from the parameter space O to the observation space i?t.
/(
): 0- --+
'
.kL
(12 38) .
If 0 is known we can use the assumption that Xn + 1 x, N(0, 1) to make probabilistic statements about Xn + j Otherwise we need to estimate 0 first and then use it to construct 1( ln the present example we know from Sections 12.1 and 12.2 above that .
').
p11
1
l
-
=-
&f
=
l
( 12.39)
=kr
i
1
estimator of 0. lntuition suggests that a is a might be to use /(#,,) #,,,that is,
tgood'
igood'
predictor of X,, +. 1
=
k
?1 +
j
=
(,.
(12.40)
The random
variable .1 /(,,) is called the predictor of X,, + 1 and its prediction value. Note that the main difference between value the testimating' and prediction estimation is that in the latter case what we are (A-,,+j) is a random variable itself not a constant parameter p. /(t) we In order to consider the optimal properties of a predictor + j define the prediction error to be .-,,
=
=
,,
e
1, +
1
-
.Y,,+
-',,
1
-
+
(12.41)
1.
Given that both -Y,,+ 1 and X,, 1 are random variables en + I is also a random variable and has its own distribution. Using the expectation operator with respect to the distribution of c,,+ j we can define the following properties'. Unbiasedness. The predictor .)Q,,1 of X,, + l is said to be unbiased if (1) +
+
f (G,+
1)
=
0.
(12.42)
Minimum MSE. The predictor
mean square error if E(el
kn
+ 1
-k
)
H zl + 1
E(X
?1
+
1
n
+
1
of X,+
)2KEIX
1
11 +
is said to be minimum 1
-X
n +
1
)2
for any other predictor Y,:+1 of X,, 1. Another property of predictors commonly used in practice is linearity. Linear. The predictor kn + 1 of X', + 1 is said to be linear if 1( ) is a (3) linear function of the sample. +
Predictors
12.3
and their properties
ln the case of the example considered above e
11 .1
'v
N 0, 1 +-
249 we can
1
deduce that (12.44)
,
N
function of normally distributed r.v.'s, :,,+ 1 + 1 is a linear 1 Xi. Hence, n) X,,+ 1 is both linear and unbiased. Moreover, ( j)?-1 using the same procedure as in Section 13. 1 for linear least-squares estimators, we can show that X,,+1 is also minimum MSE among the class of Iinear unbiased predictors. The above properties of predictors are directly related to the same properties for estimators discussed in Section 12.1. This is not surprising, however, given that a predictor can be viewed as an estimator' of a random variable which does not belong to the sample. given that c,, Xt, .1
=
-
Important
concepts
estimate, unbiased estimator, bias, relative efficiency, mean full efficiency, Cramer-Rao lower bound, information matrix, error, square properties, sufficient statistic, finite sample properties, asymptotic asymptotic normality, asymptotic consistency, strong consistency, unbiasedness, asymptotic efticiency, BLUE. Estimator,
Questions Define the concept of an estimator as a mapping and contrast it with the concept of an estimate. Define the finite sample properties of unbiasedness, relative and full efficiency, sufliciency and explain their meaning. kunderlying every expectation operator J) ) there is an implicit distribution.' Explain. Explain the Cramer-Rao lower bound and the concept of the information matrix. method of constructing minimal Explain the Lehmann-scheffe sufficient statistics. Contrast unbiasedness and efliciency with sufficiency. Explain the difference between small sample and asymptotic '
4.
properties. 8. 9.
Define and compare consistency and strong consistency. Discuss the concept of asymptotic normality and its relationship the order of magnitude of Var((,).
to
of estimators
Properties
10.
Explain
the
of asymptotic
concept
efficiency
in
relation
to
asymptotically normal estimators. %What happens when the asymptotic distribution is not normal?' Explain intuitively why x' n(t?,, 0j 0 as n %. is a stronger condition than lim,,-. p. 12. Explain the relationships between 1,,(p), l(0j and 1w(f?), --+
-
.'(p,,)
--.
=
Exercises Let X EEE(A'1 A-c, X,,)' be a random consider the following estimators of (?: .
,
;
.'
=
;a
1
.
-s
-
.
l
j
j =
jXj
12
1
-
,
,
,
1
2,aA-,, #.k
J-a.- + 1
=
i
1
1,
,
sample from N(p, 1) and
,
.
---j-
1
-
,
.
.
,
n- 1
,
'
jl ix i h ;, + l-a i I ,
rl
.
-
.
=
Derive the distribution of these estimators. Using these distributions consider the question whether these estimators satisfy the properties of unbiasedness, full efficiency and consistency. (iii) Choose the relatively most efficient estimator. Consider the following estimator defined by
h
A
1
=
1 jrx1- t?--,,
) nl
cuc -.
...
n
(1'
/
and
n
=
Prt,,
=
n)
=
??
cuc -
j
n+
,
l 11-f- J
,
and show that: (i) (, as defined above has a proper sampling distribution; is a biased estimator of zero; (L?, (ii) lim,,-, Vart#,,l does not exist; and (iii) (iv) #,,is a consistent estimator of zero. Let X H (.Y1 X.z, zY,,)' be a random sample from N(0, c2) and consider ..
,
022
.
J
.
,
?1
Xll
.
tl i
.
=
j
as an estimator of c2. Derive the sampling distribution of *2 and show that it is an (i) unbiased, consistent and fully efficient estimator of c2.
12.3
Predictors
and their properties
Compare it with l of example 2 above and explain intuitively why the differences occur. Derive the asymptotic distribution of l. (iii) .Y,,)' be a random sample from the exponential 4. Let X EEE(Ar1, distribution with density (ii)
.
.
.
,
Construct a minimal sufficient statistic for 0 using the LehmannScheffe method.
Additional references Bickel and Doksum ( 1977)) Cox and Hinkley ( 1974); Kendall and Stuart Lloyd (1984)) Rao ( 1973)., Rohatgi (1976); Silvey (1975); Zacks (197 1).
(1973)',
Estimation 11 - methods
The purpose of this chapter is to consider various methods for constructing igood' estimators for the unknown parameters 0. The methods to be discussed are the least-squares method, the method # moments and the maximum likelibood method. These three methods played an important role in the development of statistical inference from the early nineteenth century to the present day. The historical background is central to the discussion of these methods because they were developed in response to the particular demands of the day and in the context of different statistical frameworks. lf we consider these methods in the context of the present-day framework of a statistical model as developed above we lose most of the early pioneers' insight and the resulting anachronism can lead to misunderstanding. The statistical model method developed in relation to the contemporary framework i'sthe maximum likelihood method attributed to Fisher (1922). The other two methods will be considered briefly in relation to their historical context in an attempt to delineate their role in contemporary statistical inference and in particular their relation to the method of maximum likelihood. The method of maximum likelihood will play a very important role in the discussion and analysis of the statistical models considered in Part IV; a of this method will be of paramount importance. sound understanding After the discussion of the concepts of the likelihood function, maximum likelihood estimator (MLE) and score function we go on to discuss the properties of MLE's. The properties of MLE'S are divided into finitesample and asymptotic properties and discussed in the case of a random as well as a non-random sample. The latter case will be used extensively in Part IV. The actual derivation of MLE'S and their asymptotic distributions
The method of least-squares
13.1
throughout as a prelude to the discussion of estimation
is emphasised Part lV.
in
The method of Ieast-squares
13.1
The method of least-squares was tirst introduced by Legendre in 1805 and Gauss in 1809 in the context of astronomical measurements. The problem as posed at the time was one of approximating a set of noisy observations vf, n, n. with some known functions (pft?l 0z, i 1, 2, k), i 1, which depended on the unknown parameters p= ()j tk)', m < r7. h?, minimising ?j i 1, 2, Legendre argued that in the case of qiol =
.
,
.
.
.
,
.
=
,
.
,
=
=
,
.
.
.
.
.
.
.
,
,
.
,
.
1)
) ( J,j - )1)2, with respect to 71 i
1
=
( 1,/n)jgyf #,,, the sample mean, which was generally givesrise to .J',,). On the consideredto be the most representative value of (y1 ya, squared result he went on to suggest minimising the errors basisof this =
=
.
,
.
.
,
11
)
l04
=
1*crr
1
(-pf- (/f(p))2,' the least-squares,
in the general case. Assuming differentiability of #j.(#), i 1, 2, gt?!(p)(l(0 0 gives rise to the so-called normal equations of the form =
.
.
.
,
n,
=
(
?1
2)
-
i
l
())
g),-
1
=
gf(#)1 '
'.
liob
=
ppk
0,
k
=
1 2, ,
.
.
.
,
n'l.
ln this form the least-squares method has nothing to do with the statistical model framework developed above, it is merely an interpolation method in approximation theory. Gauss, on the other hand, proposed a probabilistic set-up by reversing the Legendre argument about the mean. Crudely, his argument was that if X EE (A-l Xz, X,,)' is a random sample from some density function .J(x) value for all such A-s, then the is the most representative and the mean normal, density function must be i.e. ,
.
.
.
.
.k,,
f'tvxl '' =
'
----
'
1
-
exo '
'*(2z:)
cx
Using this argument he
'$.'ggio) :=
-h cf
.
cf lbrl, c2), 'v
1 x2 - 2c c
to pose the problem in the form
went on
i zzu l 2, ,
i
=
1, 2,
,
.
.
.
,
n.
.
.
.
,
u,
Methods
normal independent, justifying the normality grounds of being made up of a large number of the assumption on cancelling otherout.) In this form the problem can each independent factors viewed of estimation the in context of the statistical model: be as one Not(e: Nl(
*
'
)
'reads'
.f
=
()'f;#)
1
=
cx
y,(aal'exp
(-vf qil04jl -
assumption
the probabilistic
by transferring r.v., and
1 ,o.2
-
from
,
0GO
to yi, the observable
;j
consider
3' ( ==
.r
.3..'..2
1,
,
.
.
,
.
.J.?,,)/
rl. Gauss went on to to be an independent sample from tm;$, i 1, 2, derive what we have called the distribution of the sample =
lg
.
.
.
,
,1
2 /'(y,. 0) (27:c2)- n.'' exp - --j ' 2c =
)( (yf -
i
gi(0))l
,
1
=
interpreting it as a function of 0, and suggested that maximisation of fty,' 0) with respect to 0 gives rise to the same estimator of 0 as minimising the squared errors 11
f
Yl1 El'-f?f(p)12. =
As we will see below, the above maximisation can be seen as a forerunner of the maximum likelihood method. Since these early contributions the method of least-squares has been extended in various directions, both as an interpolation method as well as a statistical model with the probabilistic structure attached to the error term. The model mostly discussed in this literature is the linear model where gjoj is linear, i.e. m
yi
=
j
(lkxki+ t:i
and the normality Etc
i
,
1, 2,
=
k=1
=
=
c
.
.
,
n
,
replaced by the assumptions
assumption
) 0, Vartl)
.
2 ,
i,j
=
1, 2,
.
.
.
,
7.
( 13. 10) ln some of the present-day literature this model is considered as an extension of the Gauss formulation by weakening the normality assumption. For further discussion of the Gauss linear model see Chapter 18.
of least-squares
The method
13.1
For simplicity of exposition let us consider the case where m model becomes
,f 0 1 x 1 i .1.t?i ::=
,
i
12
:=
,
,
,
.
,
.
1 and the
=
1.1 .
method suggests minimising
The least-squares
(13.12)
( 13. 14)
Note that in the case where xl
sample
t71.
of
estimator
is the least-squares
i
1, i
=
=
1, 2,
.
.
.
,
n,
0-1
=
(1/''n)j)'=1
i.e. the
.pf,
mean,
Given that the xjfs are in genera l assume d to be known constants 0-1 is a .J',, of the form linear function of the r.v.'s )'I ,
.
.
.
,
19
'-'
01
,
t. l ). f .
.
..
i
=
1
where
Hence, 11
??
F(;1 )
)'- c'fuE'tyfl -
-
i
=
1
i
)'-1 cfxl
i03
-
( 13.16)
t?l ,
=
i.e. t?: - is an unbiase d es timator of p1 Moreover, since (p-1 - ()1 ) .
lt can be shown that, under the above (j)'- j x2jy)+ 0, the least-squares estimator
'l
=
Y)' I x.c f, r
-
I
assumptions relating to ;f if of ?l has the smallest variance
Methods
within the class of linear and unbiased estimators (Gauss-Markov see Section 2 1.2).
theorem,
The method of moments
13.2
From the discussion in the previous section it is clear that the least-squares method is not a general method of estimation because it presupposes the existence of approximating functionsg/p), 1 1,2, n, which play the role of the mean in the context of a probability model. In the context of a probability model *, however,unknown parameters of interest are not only associated with the mean but also with the higher moments. This prompted Pearson in 1894 to suggest the method t?/' moments as a general estimation method. The idea underlying the method can be summarised as follows: A-,,)'is a random sample from .J(x;0), Let us assume that X (11,X1, Rk. The of moments 0c r > 1, are by definition mw .J(x,'0), p'v> of unknown since functions the parameters, =
=
.
.
.
.
.
.
,
,
'(.xr),
:'
.X'.'/-(.X;
/t'r(#) =
P)dx,
(13.18)
ln order to apply the method we need to express the unknown parameters 0 in the form
0i
gipk p-,,
=-
,
.
1. 2,
Jtk'). 1
''''''-
,
.
.
.
.
.
,
(13.19)
k,
where the gis are continuous functions. The method of moments based on the substitution idea, proposes estimating oiusing li
gjtn'l 1 n'l2,
=
.
,
.
n'lkl,
,
.
i
1. 2,
=
.
.
.
,
(13.20)
k,
represent ;)'-1, 2,.Yr, r > 1,The
where r#l,= ( 1 n) estimators of Vi,i the fact that if pj Jt h;. .
m
r
-+
the sample rlw' moments, as the justification of the method is based on p'k are one-to-one functions of 0 then since
1
=
,
.
.
.
.
,
.
.
k.
,
(13
p?r ,
.2
1)
it follows that a S. .
3332'-'i -+ t, i 9
(see Chapter
'v
'
5
...
'i..
',
( 13.22)
10).
Exanvle Let Xi
(I.:2
2,'
1
Np, c2), i
=
1, 2,
.
.
.
,
?,
then pj
=
p, y
=
c2 +/t2
and
l';j
=
d2
=
Iikelihood method
The maximum
13.3
1 12
(D1
ma -
=
1 -
tz
i
)X ;
-
1
(Y/j
=
1
11 j
?1
)-')(A' --
,
i -
.
Xtlt).
Formally, the above result can be stated as follows: Let the functions k, have continuous partial derivatives up to order I on 0/z;(p),i 1, 2, Jacobian of the transformation the and =
.
.
,
.
J( --.-P
dett?(p If the equations
.j
,
1
,
,
.
.
.
,
.
.
s
p;(@
/t q)
pk)
=
( l 3.23)
+ () fo r 0 e,o.
i
p1g,
=
1, 2,
.
.
.
/, have a unique solution (, EEE(t p xL. then J 0 (i.e.i is a one, as n ,
,
h)' with probability approaching consistent estimator of 0). Although the method of moments usually yields (strongly) consistent estimators they are in general inefficient. This was taken up by Fisher in several papers in the 1920: and 30s arguing in favour of the maximum efficient estimators likelihood method for producing (at least and Fisher The about asymptotically). the controversy between Pearson relative merits of their respective methods of estimation ended in the mid1930s with Fisher the winner and the absolute dominance since then of the maximum likelihood method. The basic reason for the inefficiency of the estimators based on the method of moments is not hard to find. lt is due to the fact that the method does not use any information relating to the probability model * apart from the assumption that raw moments of order I exist. lt is important, however, to remember that this method was proposed by Pearson in the late nineteenth century when no such probability model was postulated a priori. The problem of statistical inference at the time was seen as one A-,,)' and estimating .J(x; 04 without starting from a sample X (-Yj assuming a priori some form for /'( ).This point is commonly missed when comparisons between the various methods are made; it was unfortunately missed even by Pearson himself in his exchanges with Fisher. lt is no surprise then to discover that a method developed in the context of an alternative framework when applied to present-day set-up is found wanting. -+
,
=
,
.
.
.
--+
,
'
13.3
The maximum
likelihood method
The maximum likelihood method of estimation
was formulated by Fisher
Methods
in a series of papers in the 1920s and 30s and extended by various authors such as Cramer, Rao and Wald. In the current statistical literature the method of maximum likelihood is by far the most widely used method of estimation and plays a very important role in hypothesis testing. The likelihoodfunction
Consider the statistical model: (i)
*
=
(ii)
X
=
(fx;
#), 0 6 O );
(xY'1X1, ,
.
.
,
.
Xtt)' a sample from
tx,'04,
where X takes values in .I= R'', the observation space. The distribution of the sample D(x1, xc, x,,; 04 describes how the density changes as X takes different values in I for a given 0 6 0. In deriving the likelihood function we reason as follows: .
.
.
,
since D(x; 0) incorporates al1 the information in the statistical model it makes a 1ot of intuitive sense to reverse the argument in deriving D(x; 0) and consider the question which value of 0 e:(4 is mostly supported by a given sample realisation X x? =
Example 1
Consider the case where X is a random sample from a Bernoulli distlibution with density function
.Jx; 0)
pX( =
1
-
?)1-'Y,
x
=
and for simplicity assume that the 0.8. Suppose the sample realisation
(13.24)
0, 1,
0 can only be either 0= 0.2 or 0=
'true'
was
what can we say about the two possible values of 03 Intuition suggests that since the average of the observed realisation is 0.7 it is more reasonable to assume that x is a sample realisation from flx; 0.8) rather than tx,'0.2). That is, the value of 0 under which x would have had the highest ilikelihood' of arising must be intuitively our best choice of 0. Using this intuitive argument the likelihood function is defined by L0; x)
=
k(x)D(x; p),
0 (E (1),
where k(x) > 0 is a function of x only (not p). In particular 1a( ; x): O '
-+
g0, ). Evz
(13.26)
Iikelihood method
The maximum
13.3
259
Even though the probability remains attached to X and not 0 in detining fp; x) it is interpreted as if it is reflected inferentially on 0; reflecting the of a given X x arising for different values of 0 in 0. In order to consider this, the following example. see
Slikelihood'
=
Example 2 Let X EEE(aYj
,
.
.
,
.
the randomness Dlxl
,
.
.
.
Xn)' be a random sample from N(0, @;0= c2. Because of of the sample its distribution takes the form ,
x,,;
p)
1
,,
1-l
=
exp
x?' -
(2ap)
f=1
11
-''l
= (2ap)
2p XI
1-I
exp
i
=
1
J!
- 2p
,
and the likelihood function is ZIP' X) ,
=
S/
V(X)(27:P) -
x/
p1
2 j
J'JCXP .j
-
j-j
.
If we were to reduce the dimensionality of x to one (to enable us to draw pictures) the derivation of fz4:; x) is shown in Fig. 13.1 for two sample realisations X xl and X x2. Fig. 13.14/) shows a family of D(x; 0)for 0= 0.5, 1,2, 3,4 and for two given values of X,xj and x2 the likelihood functions 1atp;x1) and Ltp; x2) are reflected in Fig. 13.1()). As can be seen from these diagrams, different sample realisations provides different likelihoods' for 0 c 0. In the derivation of these likelihoods the constant k(x) was chosen arbitrarily to be equal to one. The presence of k(x)in the definition of .L(p',x) =
=
D
0
0.5 0 1 () 2
=
(x; 0 )
l (0 ; x)
n
=
L (0 ; x : )
=
0
=
0 4
=
L (p; xz)
3
o (1 X1
(2 3
4
5 6 7 x
0
1
2
3
4
0
Xz
(b) Fig. 13.1. Deriving the likelihood function from the distribution of the
sample.
Methods
implies that the likelihood function is non-unique; any monotonic of particular: ln transformation it represents the same information. and the
sl'()re
( 13.27)
/ifl'lt'rff.?n,
incorporate the same information as 1-(p,'x) itself; if we have any one of the functions L(0; x), 1og 1-tp; x), s(p; x) we can derive the others. In example 2 above
and
Hence.
( Iog fatp,' x) -- .-..(0
=
-
1 n + j 20 2 p
d log L0'-'- x) (.
(1l
?'
-.
dp
--
=
-
y x,? .
i
-
:
1 '' (). log - x-;j j uz: 0) i n
=
X) +
(y*
j
and
As will be seen in the sequel, although the proportionality factor is function, it plays no indispensable in the actual detinition of the likelihood what with 1og L(0; x) follows working role in the derivation of estimators. ln and g Iog L(0; x)(I(?0 will prove more convenient than using 1-(p; x) itself; see Figs. 13.2-13.4.
(2)
The maximum
Iikelihood estimatov
ML f)
Given that the likelihood function represents the support given to the various 0 (E (.) given X x, it is natural to define the maximum likelihood O such that estimator of 0 to be a Borel function : =
.%
--+
L4#; x)
=
max 1-tp; x), 0 e (.)
and there may be one, none or many such MLE's.
(13.29)
13.3
The maximum
Iikelihood method
L (9 ; x)
iI (
Fig.
, 13.2. A likelihood
; function.
Fig. 13.3. The log-likelihood function.
Methods
262
Note that
log L,
for a1l 0* e:(9.
x) > log fz(p*; x),
(13.30)
In the case where fz(p; x) is dlfferentiable the MLE can be derived as a solution of the equations
P log f-(p; x) EEEs(#; x) 00
=0,
(13.3 1)
referred to as the likelihood equations. ln example 2, 0 log .L(p; x) = f?0
n
''
n
Ar/ 0, -j+ 20 a i )g 1 =
-
lg
p= -n
11
i
Zx
2 f
1
=
(for a maximum). Example 3 Let X N
(11,
.
.
.(x; 0) 0x - t?=
,
.
l
Ar,,)'be a random sample from a Pareto distribution with 1 %x % r.'.c p 6 IJ?+ ,
,
11
L(0; x)
=
/(x)
7) k(x)/'(x1
./xf;
=
i
=
1
and =
d log L0; dp
x)
''
n =.-p
f
=>
F. log xf
r1
?t
= n N log xf i
=
=0
1
--. 1
,
.
.
.
,
x,,) - P-
1
13.3 The
Iikelihood method
maximum
Before the readerjumps to the erroneous conclusion that deriving the MLE is a matter of a simple differentiation let us consider some examples where the derivation is not as straightforward.
Example 4 Let Z
=
(Z1, Za, N
i-e.
Zf
Z,,) where
,
.
.
.
sample from
(A-d,F;.) be a random
=
1 p
0 0
,
.
-
l
p
exp 1-pc)-2, -k ( l'(x,,;p)-f
log 1-(p; x, y) c =
n
1 j
n log 27: -jlog (1 1
z
-p
,
)
?1
-
- 2(1 -
-cpxy+l,zl
(x2
-pc)
p2)
i
-$'/),
)- (x,?-
2/?xfyi +
.1
11
d log L dp '
,?( -
+
-2)
+ )'/) )(()txk? (1+p c) i .y
2(1 - p 2 )
/' - /'
2) (1 - p c
(1 p c)z -
/1
=
n/(1
-/2)
+
/2)
(1+
,,
I
-
i
=
Z.
xfyf
1
J
11
x/ + i
=
()
=
j
11 -
i -Pi
.''
i
i
1
)) =
.vl
=
0.
1
This is a cubic equation in p and hence there are three possible values for the MLE and additional search is needed to locate the maximum value using MLE'S was a numerical methods. The use of numerical methods in deriving major drawback for the method when it was first suggested by Fisher. Nowadays, however, this presents no difficulties. For a discussion of several numerical methods as used in econometrics see Harvey (1981). Example
.5
Xn)' be a random sample from Let X H(Ar1, Xz, The function is likelihood 0. x% .
fatp,'x)
=
.
.
0-
,
''
if 0
:$
xf
% 0, i
=
1 2, ,
.
.
.
,
1 0 where 0 %
tx',t?) =
n.
x)1 dp= 0 to derive the MLE is out of the question since Using Edfwtp,' Ltp; x) is not continuous at the maximum (see Fig. 13.5). A moment's xY,,). maxtArl, .Ya, reflection suggests that the MLE of ? is '=
.
.
.
,
Methods 1 (0 ; x)
l
--
O
-.
p
Fig. 13.5. The likelihood
function of example 5.
Example 6 Let X (-Y1 Xa, A',,)' be a random The (). likelihood 7: function is x =
,
,
.
,
.
sample from
/'(x;0)
=
e-tx-t') where
Again gdLtt?,' x)q d? 0 is inappropriate for dcriving the MLE. Common sense suggests 1w4?;x) is maximised by choosing 0 as large as possible such that 1-lp; x) > 0. Since 0 is bounded below by the xfs, V=mint-j Xz, ',,) represents the MLE of 0. Looking at examples and 6 we can see that the problem of the derivation of the MLE arose because tbe range of the Xis depended on the unknown parameter 0. lt turns out that in such cases there are not only problems with deriving the MLE but also the estimators derived do not in general satisfy al1 the properties MLE'S enjoy (seebelow). For example, V= maxlzYj is not asymptotically normal. Such cases are excluded by the assumption CR1 of Chapter 12. So far the examples considered refer to the case where 0 is a scalar. In econometrics, however, p is commonly a k x 1 vector, a case which presents certain additional difficulties. For differentiable likelihood functions the MLE of 0- ((?j,pa, pkl' is derived by solving the system of equations =
,
.5
',,)
,
.
.
.
,
.
Iog L
=0
.
.
,
.
.
.
,
likelihood method
The maximum
13.3
265
Or
?1og L . lp,
(), i
a,zz
c=,
12 ,
.
,
.
k
,
.
,
ensuring that the Hessian matrix
is negative
?loe L t?pt?#'
? log L
-
H(
=
.
=
,
lpf t?oi.-j
o-t.
i,j
;tt
1, 2,
.
.
.
,
k
i,j
definite, i.e. for any zeRk, z'H()z
z: 0.
<0,
Example Let X EEE(Ar1 X2, ,
f*; p, '
.
.
.
J2)
1
=
1 x - -2 .
exp
(2zr)
tr
tr2) G
0-(p,
x c R,
sample from N(y, c2), i.e.
Ar,,)' be a random
,
p
-
2 ,
tr
J! x (!Rs .
The likelihood function takes the form
and log L P log L t?p -
=
c
n log 27: n 1og c c 2 2
-
1 =
y (xj-p)
/1
2t7.a i-l
1
0
=
--
j
1
j
wthe Mt,E is y-
?'
--
plogL t?c2 = -
j
-
'1
a
=
o'
--
-
n
+ 2c a
1
2c
,y
n
11i
) xi =
-/:)2=0
Z (xf
f-1
1
-
T,,,
(xj-p)
2 ,
( 13.33)
266
Methods
Since
and
N
-/
H(#)=
0
z'H(l)z<0
and
N
for any zg R2.
- 24
This example will be of considerable interest in Part IV where most estimation problems will be a direct extension of this case. Despite the intuitive appeal of the method of maximum likelihood its ultimatejustification as a general estimation method must be based on the optimum properties of the resulting MLE's.
(3)
Finite sample properties
Let us discuss the finite sample properties of MLE'S in the context of the statistical model: simple probability model, *= ).(x; @,0 G (9),. (i) sampling model, X N (.Y1 A',,)', is a random sample from (ii) .
.
.
,
,
./'(x;03. One of the most attractive properties of
MLE'S
is invariance.
Invariance
#be a MLE
of p. lf g ): O --+ R is a Borel function of 0 then a MLE of exists and is given by g(#). This means that if the MLE of 0 is available go) thenforfunctionsk, spch as 0k,et?5 log p, its MLE is derived by substituting in log are the MLE'S of these functions. In the case of P laceof p, i.e. example 2 above the MLE of 0 was
Let
'
,
?1
p-= n i
--.
)( =
1
log Xi
11
13.3
The maximum
of
The invariance property
4=j
1
1
.
is
/
likelihood method
=-
N
enables us to deduce that the MLE of
MLE'S
n
Z 1ogXi. =
1
ln relation to invariance it is important to note that in general
(13.34)
'(:())##()).
p2 it is well known that For example, if qoj # (E(t)2 in general. This MLE'S contributes to the fact that the are not in general unbiased estimators. For instance, in example 7 above the MLE of c2, d2 (1 n) =1 (Ari T)2 is a biased estimator since (nc-2)c2 g2(u 1) (see Section 11.5) and hence f(d'2) g(n 1) njc2 # c2. Thus, in general, unbiased and MLE'S do not coincide. ln one particular case, when unbiasedness is accompanied by full efficiency, however, the two coincide. '(2)
=
=
))'
'w
-
=
-
-
full-eflciency
Unbiasedness,
ln the case where satisfies the regularity conditions CRI-CR3 and is an unbiased estimator of 0 whose variance achieves the Cramer-Rao lower bound, then the likelihood equation has a unique solution equal to 1.This suggests that any unbiased fully efficient estimator 1 can be derived as a solution of the likelihood equation (a comforting thoughtl). ln example 7 Nt, o'ljn) since above the MLE of p was /,, Y,,which implies that is an is a linear function of independent r.v.'s. Hence, S(,,) p and unbiased estimator. Moreover, given that (1)
=
'w
,,
,,
=
N
I,,(p)H E
-p2 1ogz-(p;x) 0 ?g
'JY
,,
0
=
0
n
2
and gI,,(p)(1-
1
N =
0 we can see that Var(,,) achieves the Cramer-Rao lower bound. On the other hand, the MLE of c2, :,2, (1 n) )'=: (A'f X,,)2as discussed above, is not an unbiased estimator. The property mostly emphasised by Fisher in support of the method of maximum likelihood was the property of sufficiency. =
-
268
Methods Sujhciencj'
lf z(X) is a sufficient statistic for p and a unique MLE 0 exists then #is a function of z(X). ln the case of a non-unique MLE, a MLE be found which is a function of T(X).lt is important to note that this does not say that MLE'S any MLE is a function of z(X); in the case of non-uniqueness some 1 Xi, are not functions of z(X). lt was shown in Chapter 12 that T(X) j)'- j Xi?) are jointly minimal sufficient statistics for 0-(p, c2) in the case Xn)' is a random sample from Np, c2). In example 7 where X HE (11, MLE'S of p and c2 were above the 'of
'can
=()-'
.
lj
y-
11
=
.
,
.
,1
-
F
X'i
:
n
i
=
d,2, =
,
1
jj
11
..
n
)'')(A'
y-
i
=
1
Xt,)2
which are clearly functions of z(X). An important implication for ML estimation when a sufficient statistic exists is that the asymptotic covariance of ,, (seebelow) can be consistently estimated by the Hessian evaluated at 0= #,,.That is, (?1
1 -n .
)
log L (0
t?p
!!,
P -.+
I .,
(p)
( 13.35)
(see Dhrymes ( 1970)). (4) Although
Asymptotic properties MLE'S
(IID casej
enjoy several optimum
finite sample properties,
as seen
above, their asymptotic properties provide the main justification for the almost universal appeal of the method of maximum likelihood. As argued below, under certain regularity conditions, MLE'S can be shown to be consistent, asymptotically normal and asymptotically efficient. Let us begin the discussion of asymptotic properties enjoyed by MLE'S by considering the simplest possible case where the statistical model is as follows: probability model, * (i) 1 p), p (E O), 0 being a scalar; EEE sampling model, X tA-j Xn)' is a random sample from flx; 0). (ii) of Although this case is little interest in Part 1V, a brief discussion of it will help us understand the non-random sample case considered in the sequel. conditions The regularity needed to prove the above-mentioned MLE'S asymptotic properties for can take various forms (see Cramer ( 1946), Wald ( 1949), Norden ( 1972-73), Weiss and Wolfowitz (1974), Serfling ( 1980), inter alia). For our purposes it suffices to supplement the regularity conditions of Chaptcr 12, CR 1-CR3, with the following condition: ./'(x.'
=
,
.
.
.
,
13.3
The maximum
likelibood method
(CR4)
wllcre the
/1a(x)are inteqrable
/11(x)an functions
12
i
,
tptpt?r (
-
vt zt ,
),i.e.
,
tlntl X
P)dx < lls(x).f(.'.t7;
K
,
w/lprr K does ntpr depend on p. The conditions CRI-CR4 are only sufficient conditions forconsistency and asymptotic normality. In order to get some idea about how restrictive these conditions are it is instructive to consider various examples which do not
satisfy them.
The examples 4 and 5 considered above are excluded by the condition CR1 because the range of the random variables X,, depends on the Moreover, of example 6, if p in the case then parameter 0. -l,
.
.
.
,
=0
J3 1og f Jc 6
3x2
1
=
+ --jc c
-
--.
-
w
as c2
--+
0
'
i.e. the third derivative is not bounded in the open interval 0..:: (72 < :;c. ; condition CR4 is not satised (seeNorden ( 1973/. Under the regularity conditions CR 1-CR4 we can prove (seeSerfling x)1 (0= 0 admits a sequence 1980)) ( that the likelihood equation y log 1a4@, such solutions 1) that: n> of
t,,,
Consistency: a S. .
0tt 0o (strong consistencyl -+
which implies that
tt
P -+
pll
(weakconsistency)
(see Chapter 10). Asymptotic normality:
p() N(0, x,/ntqtl'-, -
-
a
.!'(p)-
').
270
Methods
enlcienc
-,45b'mptobic
p.'
1 1(p) lil'n - t,(p) Fl co n =
,
-.
i.e. the asymptotic Rao lower bound.
variance
of
@,equals
the limit of the Cramer-
Although aformal proof of these results isbeyond the scopeof thisbook it is important to consider an informal heuristic argument on how such results come about. If we were to take a Taylor expansion of gt?log ft; x)?/(0 at #= 0v we would get
p 1ogL(h) p log L(0o) + :2 log L(%) (% P0)+ 0p(#, pp2 pp pp = -
(13.36)
where Op(n) refers to a11 tenns of order n (see Chapter 10). The above expansion is based on CR2 and CR4. ln view of the fact the log L0) j)'- 1 log 0) and the .Jtxi;0), i 1, 2, n, can be interpreted as functions of 1ID (independentand identically distributed r.v.'s) we can express the above Taylor expansion in the form =
.(xi;
=
.
.
.
,
Using the strong law of large numbers (SLLN) for llD ,4,,
=
1 -
ni
''
Y
t? log
ftxf,'p())a.s. -+
P0
=-
j
0
=
E
p Iog
.(x;
r.v.'s
p())
t?0
,
(seeSection 9.2) (13.38)
These in turn imply that
(J .
11
a. S
a. S. -
0o) -+ 0
To show asymptotic expansion by ljw/'n
or
-+
,,
.
0v.
(13.40)
normality we divide all the terms of the Taylor (not 1/n) to get
plog ytxf;p,) + Op(vi). %j 1n ) #,,Un(#,, i 1 V oo n
-
=--,u
=
(13.41)
13.3 The
maximum
likelihood method
Using the central limit theorem (CLT) for llD nv.'s (seeSection 9.3) wc can show that 1 n
i
txf;)())
p log
n
Z
Given that Bn
a S. . -.+
X(0, T(P)).
'--
PP
1
=
(13.42)
a
1(p) we can deduce that
x/nlh 0v)
').
Nl, 10)-
w,
-
(13.43)
As far as asymptotic efficiency is concerned we can see that the asymptotic variance of a consistent and asymptotically normal (CAN) estimator, is 1 equal to 1(@ where ,,
-
1604 E =
lx,'otj
(?log
2
(?2
E
=
(?0
1og
tx,'0vj
-
t?pc
,
(13.44)
the information for a single observation. We know, however, that in the lID (3462 equals n case the sample information matrix 1?,(p) 1;g0 log L(0vj times 1(04,given that each observation contributes equally to the sample,i.e. nl(0) l',f0j. This implies that the limit of the Cramer-Rao lower bound is 1(p) bccause =
=
1 lirn - 1,,(p)
1 =-
1,:(p) l(0j. =
It must be stressed that this is only true for the llD case. ln more general cases care should be exercised in assessing the order of magnitude of 1,,4?)
(see Chapter 19). Returning to the example 2 above we can see that n
1,:(J1)=
and
agj
1(c)
1 =
2c
which implies that /
,2
x/ ntd,, I1-1example
-
c()2
4 X(0, 2c().
'.w
1
3, 1
?,
n po2and
(po) =
v/'nq
,,
-
0())
x.
N(0, p()2).In example
4
,
2
although /,rcannot be derived explicitly, this does not stop us from deriving its asymptotic distribution. This takes the form -p2)2
/
&nb', -pq
.t%
-
a
x
(1 0. 1+ # a.
.
Methods The above asymptotic
where 0 EEEE(()1 t?c ,
J (the
a .s.
.
,
.
properties can be extended directly to the case k 1: 1 to read :
?k)',
,
P
and
0
-,
1
.
-
0
--+
,,
dropped for notational
zero subscript ''nk
(ii)
p)
-
-.
',
N(0, I(p) -
'v
conveniencel;
1) ,
(13.46)
J
where
(
( log
l(po) E -
polh
.(x;
.(x;
)
oo
:2 1og
'j'j'
.?.lo..y
0v)
))
pp
y
-
).
0o)
.(x;
-
vo pg
( 13.47)
That is,
xz'noin
0i)
'
aN,
-
El(p)!-f),
2
1, 2,
i
.
1 I(p)J being the
.
.
,
k,
( 13.48)
fth diagonal element of I(p) - ' Asymptotic efficiency in the multiparameter case is in terms of the I(p)- 1. For any asymptotic covariance of which takes the form Covti) .
i
other CAN estimator
cov(#,,) -
7
#,,,the
=
matrix difference
I(p) - 1 > 0
(13.49)
is ntpn-ntx/rpt? dflnite. In the case of example 7 above the asymptotic takes the form distribution of l >(,,, ',2)/
lJ
#l(J,,
1
-/t)
Y,
/n(2
'dddj,...
1.lr
-c2)
x
a
x
0
0
.s.
c2
0
0
2c4
'
This shows that asymptotically *,2, ( 1,,7n) )'=j Xi Y,,)achieves the lower bound even though for a fixed ?? it does not (see Chapter 12). =
(5)*
Asymptotic properties
non-llD
twNe)
The cases of particular interest in Part IV are the ones of an independent Xtf)' is a set of and a non-random sample, that is, (i) X,,H(zY: independent (ID) (but not identically distributed) r.v.'s, and (li)X,, is a set of non-llD r.v.'s. The asymptotic properties of MLE'S for the 11D case considered above can be extended to the cases of interest without much difficulty. The way this is achieved is to reduce the more complicated cases to something for which the limit theorems considered in Chapter 9 can be ,
.
.
.
,
applied.
The extension of the above asymptotic
results to the independent (ID)
The maximum
13.3
likelihood method
case presents no particular difficulties apart from the fact that the result 11(#) 1,:(P) no longer holds. This in turn implies that the asymptotic variance of can no longer be 14$. This, however, is not such a difficult problem because it can always be replaced by the assumption that 1,,(?) is of order rl. That is, =
,,
1
lim -N 1,,(p) ,1 --.
<
(x)
( 13.50)
,
.:x)
denoted by I,,q0) O(n) (seeChapter 10). Given that =
1',04 =
E
z log
?'
Z
-
j'txf; p)
'
Ppc
= 1
,
wecan see that the assumption t,(@=O@)ensures that as n (yz, 1,,(?) at the same speed, i.e. information continues to accrue as n increases. The non-random sample case is more difficult to tackle but again all we need to do is to reduce it to a case where the various limit theorems can be applied. ln our discussion of these limit theorems it was argued that independence is not needed for the various results and that martingale orthogonality suffices for most purposes. This suggests that if we were to reduce ztp, and S,, above to martingale orthogonal processes then the results will follow. The natural way to orthogonalise the non-random sample X,, is to start from ArI, define Xj Ar1 '(Ar1),and then construct -+
=
kz
x-?
1
=
X2
=
A-1
-+
-
F(Xa/c(xY1)),
-
Ex
-
l
c(A' 1
l1
'
.
.
.
,
A-?, --' 1.)).
c(aY1, Xz, Xi- 1), i 2, 3, n, denote the c-field generated 1 z%i r.v.'s A-1, Xi By construction includes only the new the by the 1. and satisfies properties; Xi the following infonuation in Let
.%
=
.
-
.
(i) (ii)
.
.
.E'(Xi f.?. iEzk
f
Rj
.
.
=
,
.
.
.
,
,
1
)
f- j
)
.@,
=
0
,
=
0
,
i
j
=
2 3
<
i
,
,
,
.
.
ij ,
.
=
,
(13 52)
n;
2 3 ,
.
,
.
.
.
,
n
.
(13 53) .
That is, the Xis define a martingale difference (seeSection 8.4). Hence by conditioning on past infonuation at each i we can reduce the non-random
Methods ,Xn)' to a martingale orthogonal sample X,,EB (X1, asymptotically can be treated in a similar way as a random sample. For this we need to impose certain time-homogeneity and memor) restrictions on t-Yu, n > 1) so as to ensure for example that
sample Xu EB (A-1 ,
.
.
.
.
.
.
.
T,,) which
n - ESn)
(7.2
-+
as n
(13.54)
:c
-+
where Sn n r 1) behaves asymptotically as the 7=1 Xi.ln such a case of martingale with each variance c2. This in turn enables differences sum n various the limit theorems (see Chapter 9) to derive a number of us to use L asymptotic results needed. Heuristlcally, this enables us to treat the parameters oiin the decomposition
t5',,,
=
as being equal, i.e. deduce that %
-/
0
( 13.56)
This in turn allows us to define the likelihood function to be glven
.X().
( 13.57)
For simplicity 1et us consider the case where 0 is scalar.
d loc L..(04 d0
=
-
' '
-
'
d
''
Y - log fxif/xk i 1 d0 ,
=-
.
.
.
,
xf
-
1,-
0)
( 13.58)
is the random quantity we are particularly interested in. Observe that the terms in the summation can be written as d dp
log wJtxf/.Yl,
.
.
.
A'f
,
-
1;
P)
=
d
j-j Elog
Li(0)
Li 1(P)(1HE zf(@, -
'-log
( 13.59) which implies that d d Lf(?) 1ogLi log fo,(p) F d0 d 0 glog 1 ir and assuming the expected values exist, d E --- 1og flxi x 1 xi 1 04/V' j dp ''
=
''
1(p)j
EEE
i
j 1 zftp) -
,'
,
.
.
.
,
z-.tply,z-lj-st(jdpIog Iog
-,'(.yjd
.r-,--,(?)yz-:)-e.
(13.61)
13.3
Iikelihood method
The maximum
1) and hence kf(d dp) 1og Lnoj, These imply that Eziolf/gi 0, i 1, 2, cftpls martingale and 1 is the are martinqale dlfferences ,A,,n > ) a zero mean variable Defining the random 8.4). (see Section (r.v.) =
=
.
.
.
,
( 13.62) we observe that it corresponds to the Fisher information matrix defined above. Moreover, under conditions similar to CR 1-CR3 above, ( 13.63) and 1,,(?) can be defined alternatively 1
,1
P)
,1(
i
dgf ( pj Vf o '
'
E
=
-
1
=
as
,,.'
,,
'
( 13.64)
.'''
j.j
u
1
.
,''
Under certain regularity conditions relating the first three derivatives of it can be shown that 1og1.,,(p), n 1, 2, =
.
.
.
11
I7,,(p)(1 - F,1 '
i
,
I
0,
.:'f(p)
--+
(13.65)
=
P
provided 1,,(p) as property for MLE's. ':x)
-+
n
This enables us to deduce the following
,LJo.
-->
consistency The likelihood equation
0, i.e.
lim #rtl,,
!1 --
Moreover,
)'-j
- pp< c)
'zl
=
zf()=0
has a root
,,
which is consistent for
( 13.66)
0.
under slightly stronger
restrictions
( 13.67) a.S.
provided /,,(p) i.e. consistent,
'vz
-+
a. S. . ,,
--.
as n
-+
:t.
and hence a MLE
,1
is also strongly
p. The most interesting of the sufficient conditions
for
Methods
the consistency of
is the condition
MLE'S
that ( 13.68 )
(or a.s.). This can be interpreted intuitively as saying that fn/brmafon 0 increases with the sample size.
tilhlu!
As argued above consistency is only a minimal property. A more useful
property is that of asymptotic normality. ln order to get asymptotic normality we need to find a normalisation sequence n y 1) such that (7)X0 in be probability transformed into can convergence % convergence in distribution of the form
tc',,,
-
t!,,(tt p)
D
z
-+
-
N(0, 1).
-
( 13.69)
The natural choice for c,, in this case is c',,
E.l(p)jl( 11
In
l/?tz
p)
-
11
=
which E/,,(p)(I'l
ensures
x(0, 1).
X
11D case ,1
Fl i=1
(jj
11
ctp)
Y
-
J=1
log
dp
Jjxi; 0)
and
,, f,,'(p)i Fl
,, '(z,?(p)
?i
-
1
=
lf we denote the fq/rmllforl d log
1(p) E =
fxi
d().
1
-
) )i 1 -
(j
E
-'.u
jog .--,-7.--yfx
dp
.,
o)
; ?)
2
.-
.
random
sample case
11
n1$04-
=
and hence the asymptotic
normality
p) N(0, (,?.f(p)/(t', -
-
.
mtt rl-fx ./i?l-ontt ollstrrtwrforl by 1(P),i.e.
and assume CR l-CR3 then, in the
1',(0)i )-1 .f(p)
,
1)
takes the form
that
13.3
The maximum
likelihood method
( 13.76) of n. lt is obvious that in this
since 0 < l0) < :y- from CR3 and independent case the condition 1,,(?) n1(@ =
uo
-+
s automatically satisfed. ln the general non-random asymptotic normality result
like an analogous
N(0, Iz'(p)),
c,,(t - t?) -
W
sample case we would
(13.78)
here c refers to the order of magnitude of ,,
For g1,,(?)1l.
example, in the llD
Case,
l (0)p =
11
n1(@
=
0(n)
and
nll < k < vz since i.e.limn- E1,,(p) N example 7 above. 2 in (;u, 4,,) .
'
c',,
=
/''n -bh--
,
0 < 1(p) < .z. To illustrate .
this consider
!1
P)
1?11
lim -. z
G
0
n
11
l
=
ln cases where the sampling model is either an independent or a nonran dom sample the order of magnitude of 1,,(?) can be any power of U.For most cases of interest, however, it suftices to concentrate on cases where
1,,(:) O P (n) =
information matrix I.(p) to be
and define the asymptotic l
''t8) -
P
-+
11
I .(#),
( 13.80)
1.
I.(p) - The notation above is used to emphasise the fact that I,,(p) might be stochastic given that the conditional expectation is relative to a c-field (see Section 7.2).
with V(p)
=
X. b'tlptotic Under
certain
llt?rFrl/lf
!).'
regularity conditions any consistent ),, is asymptotically normal, i.e.
solution
of
likelihood equation
(.f((?)/(t- l1 .
1
p)
-
x(0,1).
(13.81)
Methods
The case of particular interest in econometrics is when 0 is a k x 1 vector of unknown parameters. ln such a case we need to normalise each individual parameter separately in general. With this in mind let us define the following normalising matrix: D,,(p)
=
logL
diag
. '
?01
.
.
p log L (''? pk
.
,
.
-
,
P log L where . (oi Under certain regularity conditions D ? (p)( 11
where D 11- 1(p)A11(1)D 11-
( 13.83)
P
1(p)
c(p)
-+
p2 jog L 'oi t'?pj
-
=
we can show that
x(0,c(p)-
'v
Cr
A,,(p)
( 13.82)
cxs
l),
-p)
and
--+
i,)'
,
Asymptobic efhcienc
1,
=
.
.
.
k.
,
'
As shown above, the asymptotic distribution of the CAN estimator is F(p)) where N(0 P'(pl achieves asymptotic n( p ' the Cramer-Rao ) x,'lower bound fw.(p)- 1, that is, ,,
--
,,
'v
,
Q
P-(p) 1.0) =
1
(13.84)
.
This last equality defines the asymptotic
(6)
efficiency of the MLE
,,.
Summary of asymptotic properties
For reference purposes let us summarise the asymptotic properties of MLE of 0o, in the case where I,,(p) Op(n): =
Consistency'. For some root
i
a .S -+
.
0o
lim
Pr
?1 --*
P
0-?1 0() -+
2t2.
J,, of the likelihood
i
=
0v
1im #r(1 0-tt 0() -
!1
--*
::J;
=
t
<
1 ;
4 1 =
.
equation
I
l
,
a
279
Iikelihood method
The maximum
Asvmptotic normality. For a consistent
(2)
V
n(
-
0v) x N(0,
Asymptotic c//ccncy. V(p) Note that asymptotic
=
1.(p) -
of 0o
v(p)). l,, of
For a CAN estimator
0v
1
normality
P
x''nt
i
MLE
.
unbiasedness,
also implies asymptotic
i.e.
.
,,
Example 8 Let X,, EEE(.Y1, X1,
.
.
.
,
sample generated by the AR(1)
A',,)' be a non-random
time-series model: Xi
=
tzztf
+ - 1
Uf,
where b'i V(0, c2); a normal white-noise process (see Chapter 8). The distribution of the sample D(X,:; p) is multivariate normal of the form: 'v
X,,
'v
N(0, c2V,,(a)),
where
This is because
k 0, 1 2 =
,
,
.
.
.
Ial
< 1 which ensures stationarity (seeChapter 8). in view of the restriction As argued above the only decomposition possible in the case of a nonrandomsample is 11
D(x,:; 0j
=/(x();
0$ i
1-lTtxi/xl, .
=
1
.
.
,
xi -
-,
1
04
where .J(xo;0) refers to the marginal distribution of the initial conditions. ln practice there are several ways the initial conditions can be treated'. (i) assume Xo is a known constant, say Xo 0 (i.e.a degenerate r.v.); =
280
Metllods
(ii)
assume
J
(
(iii)
x
?,
this ensures stationarity
;
y 1)
of
,.
n
'
-a)2)
that Xv ,vX(0,c2/'(1
.t#,,,
A-,, which defines as a circular Anderson 1971/. stochastic process (see ( For expositional purposes let us adopt (i) to define the likelihood function assume
that Xv
=
as
t?log L pc2 = Hence, the
n
-
1 2c4
+
2c2
?'
i
jq1
of a and c2 are
MLE'S
y,2 =
1
t72log L 2
t?a
1
.
c
lo L 472 g (.7a 7c2 =
:2 log L . c4
=
i-1 i
1
n
2c4
E
-
:2 log L
oyc p2loe <'
-
L
t'?a:t7.2
11
)'(1 fzkri-
)2,
Xi .j
=
-
'
1 ''
n
-
c
1
.1 n i
n
.ya.a
2
Taking expectations E
Xi -aA'f-1)2=0.
=
4
i
jj xixi =
c6
) xi
2
-a i
j
1 -
j
-1
,
j
...,..
''
i
)g(xi - xxi =
j)2.
..
1
with respect to D(X,,; 04 it follows that -
1 c,
''
yyj
Exi-
1 (n 1J c2 (1 a2)
1
-
'
=
'
-
note the role of the initial conditions.
)
1 -
(n
a, -
j
1)a '
'
-
(7.2
''
jya
-a2)
(1
=
-
0;
ac
-
n
-
1
1 aa -
A
1,
likelihood method
The maximum
13.3
02loq L
E-
n o ncl + 2c* c
<'
=
-
=
.-
Jc*
In order to establish consistency
of
281 0
n .-u; 2c
N
'IG
note that the score function:
ti,, we
martingale defines a zero-mean (,%,,n > 1). Using martingales and Markov processes it follows that 42* 11,
-J--1
n
1
n
i
)(
zr
,?-
(r;.
--+
1
-
1 xc
,
-
1
-
for
;)
l2'
li
--..
and
0
-+
the WLLN
respectively. Hence
p &,, x. -+
P
Using a similar argument we can show that :,2,--+ c2. For asymptotic normality we need to express &,,in the form &
/
jj
'
n(
.-
,,
deducing that
a)
=;:
x.''rl(4u
.
-
l
/1
i
--
y
1
=
a)
'w
a
y
2 f- l
1
j N
y
,,
.
X(0, (1 a2)) and x'''n(62 -
-
c2)
'.w
N(0, 2c4)
1
> 1 (see Anderson ( 197 1)). It is interesting to note that in the case where la! the order of magnitude of 1u(a) is no longer n, in which case we need a different normalising sequence for asymptotic 1)lnormality. In particular for the sequence it follows that n y 1) where c,, ()'- 1 Xh
tc,,,
(.,,(.4,,a) -
-
=
xN(0,c2)
(see Anderson (1959)).It is also important to stress that the asymptotic distribution of 4,, depends crucially on the distribution of the white-noise > 1 if t-p's is not normally error process )U,,, n y 1). ln the case where asymptotic normality above result does not follow. distributed the
1a1
Methods Important
concepts
normal equations, method of moments, the likelihood Least-squares, function, the log-likelihood function, the score function, maximum likelihood estimator, likelihood equations, invariance, sample information matrix, single observation information matrix, asymptotic information matrix.
Questions Explain the least-squares estimation method. Explain the logic underlying the method of moments. Why is it that the method of moments usually leads to inefficient
4. 5.
7. 8.
estimators? Define the likelihood function and explain its relationship with the distribution of the sample. Discuss the relationship between the likelihood, log likelihood and score functions. Define the concept of a MLE and explain the common-sense logic underlying the definition. State and explain the small sample properties of MLE's. sample X can be transformed into a Explain how a non-random martingale orthogonal sample X. Explain why log Ln(()), 9,,, n > l ) defines a zero mean #l, define a martingale difference. martingale and the zf(p), i 1, 2, State and explain the asymptotic properties of MLE's. Discuss the relationship and between the order of magnitude asymptotic normality of MLE's. If we were to interpret the likelihood function as a density function what does the MLE correspond to?
ttd/dp) =
10. 11.
.
.
.
,
Exerdses
Consider the linear model y'j pj x 1 j + =
;i ,
li
'v
N-1(, c2)
,
i
=
1,
.
.
.
,
rl.
0= (p1,c2). Define the MLE-S of 0 and compare its properties those of the least-squares estimators. Show that
with
13.3
likelihood method
The maximum
283
where #1is the least-squares estimator of exercise 1, and compare it with the Cramer-Rao lower bound. X,,)' be a random sample from the Poisson Let X EEE(Arl Xz, distribution with density function ,
.
.
.
PX
JlX
P)
;
'
=
,
-
:h %-'
P
-
.
x!
Derive the likelihood, the log-likelihood and the score functions. Derive the MLE of 0 and its asymptotic distribution. (ii) Let X H(xY'1, sample from the Bernoulli X,,)' be a random distribution with a density function
(i)
4.
.
.
,
.
.Jlx;p) pxt1 - p)1- x
x
=
=
0, 1.
Derive the MLE and state its properties. Let X EEE(-Y1, Xn4' be a random sample from N(p, 1). Show that .
.
,
.
d log
E . Let X
E
(A-1 ,
.
.
d2 1og J-,t./:;x)
E
-
(jjta
tr2)
1
=
xc
exo -
(2z:)
j.
sample from the log normal 1
-
x lo2 2cu: m -
--'
2 ,
Derive the MLE'S of m and c2. (Hint: use invariance.) Derive the method of moments estimators for m and c2. (A-1 Ar,,)' be a random sample from the exponential ,
distribution
(X',
8.
=
A-,,)' be a random density function
,
.
with
tx,' ?'n,
Let X =
j
cjy
distribution
(i) (ii)
2
f-tJt; x)
.
.
.
,
with
density function
P) P e - 0x =
Derive the MLE of 0 and show that it is both consistent and asymptotically normal. What is the MLE of 1/?? Let X EEE(A'1, Xz, X,,)' be an independent sample from Nlpi, c/). SILE'S c2, cj? of (c2,gj, i 1, 2, For n, derive the (i) /t,,) and their asymptotic distribution. (Hint: check 1.(.$.) MLE'S of @,c2j, For pi i 1, 2, n, derive the cJ) and their asymptotic distribution. .Y,,)' be a random sample from the Weibull Let X -(.:-1, distribution with a density function .
.
.
,
=
=
=p,
.
.
=
.
.
.
.
.
.
,
.
1
expt
-
.
,
,
.Jtx;p) ocf =
.
pxc), x > 0,
.
.
,
.
.
,
284
Methods
where c fs a known constant. Derve the MLE of 0 and fts asymptotfc distributions. 10*. Let X EEE(X1, X2, X,,)' where .
Xf
.
.
,
H(X') X,i
be a random sample from the bivariate normal distribution
# 1i
A-2i
c2
/?c l c 2
pc 1 c2
c22
0
..v../V' 0
1
,
i zzz 1, 2,
.
.
.
,
n.
Derive the MLE of 0m (cf, cl, p) and its asymptotic distribution. Of Assuming that cf c2a 1 derive the MLE p and its asymptotic distribution. Compare the asymptotic variance of /$ and / and explain intuitively why they differ. For p and p 1 derive the MLE'S of trf and cq and variances. their asymptotic compare =
=
=0
=
Additional references
Bickel and Doksum (1977/ Cox and Hinkley ( 1974); Kendall and Stuart ( 1973); (1984); Rao (1973);Rohatgi ( 197$,. Silvey ( 197,5); Zacks (197 1).
Lloyd
CHAPTER
14
Hypothesis testing and confidence regions
The current framework of hypothesis testing is largely due to the work of Neyman and Pearson in the late 1920s, early 30s, complementing Fisher's work on estimation. As in estimation, we begin by postulating a statistical model but instead of seeking an estimatorof pin Owe consider the question whether 0 65 (.)0c O or 0 (E (41 (.) - Oo is mostly supported by the observed data. The discussion which follows will proceed in a similar way, though less systematically and formally, to the discussion of estimation. This is due to the complexity of the topic which arises mainly because one is asked to assimilate too many concepts too quickly just to be able to deline the problem properly. This difficulty, however, is inherent in testing, if any of the topic ls to be attempted, and thus unavoidable. proper understanding effort made is Every to ensure that the formal definitions are supplemented explanations and examples. ln Sections 14.1 and 14.2 the with intuitive needed tests are concepts to define a test and some criteria for discussed using a simple example. ln Section 14.3 the question of constructing tests is considered. Section 14.4 relates hypothesis testing to confidence estimation, bringing out the duality between the two areas. ln Section 14.5 the related topic of prediction is considered. =
'good'
'good'
14.1
Testing, definitions and concepts
Let A- be a random variable (r.v.) defined on the probability (S, P( )) and consider the statistical model associated with Xi
space
.%
'
tl' (i) ),J(.X', ?). ()6 0) X (Ar1 zY2, -Y,,)/is a random sample, from .J(x',@. (ii) The problem of hypothesis testing is one of deciding whether or not some =
=
,
.
.
.
,
286
Hypotllesis testing and confidence regions
conjecture about 0 of the form p belongs to some subset (i)o of (.) is x,,)'. We call such a conjecture the null supported by the data x (.xI x2, hypothesis and denote it by Ho 0 c Olljwhere ifthe sample realisation x e: l-a we accept Ho, if x (E C1 wc reject it. The mapping which enables us to define IJI(see Fig. 11.4). Ca and C1 we call a test statistic z(X): ?' ln order to illustrate the concepts introduced so far let us consider the following example. Let X be the random variable representing the marks achieved by students in an econometric theory paper and let the statistical model be: =
,
.
.
.
,
-.+
(ii)
*
=
X
=
n 40 is a =
.(Ar; 0)
(.11,Xz,
.
1
=
8 .
.
,
exp
(2zr)
Ho1 0= 60
(i.e.A'
f/1 : p# 60
(i.e.A'
against
1 A- - p j ,
2
peg
,
>
E0,100j ;
.Y,,)',
sample from
random
-
'v
x
.(x;
$.
N(60, 64:,
The hypothesis to be tested is
(,.)0 =
Nlp, 64/, y # 60),
t60) (91
=
g0,100)
-
(60) .
estimator of 0, say T,, Common sense suggests that if some sample realisation Xi, for the (1/n) j7- 1 60 then we x takes a value will be inclined to accept Hv. Let us formalise this argument: 'good'
=
earound'
The acceptance Co
=
C'1
=
and
reqion
takes the form 60 -c GY,,< 60 + c, s >0, or
1X,, 601 G
l)
tx:IX, 60I>
l)
ft
x:
-
-
,
is tbe
rejection
region.
The next question is,showdo we choose ;?' If ; is too small we run the risk of rejecting Hv wpn l is true; we call this tbpe I error. On the other hand, if c is too large we run the risk of acceping Ho wJlt?nl is false;we call this type 11 error. Formally, if x 6 C1 (rejectHo) and p e:(6)0(Ho is true) - type I error; if x 6: Ctl (acceptHv) and 0 6 (91 Ho is false) - type 11errortsee Table 14.1). The
Table 14.1
Ho true Hj false
Ho accepted
Hv rejected
correct
type I error correct
type 11 error
14.1
Testing, delinitions and concepts
hypothesis to be tested is formally stated as follows: Sa: 0 e:0c),
Oa
0.
Against the null hypothesis Ho the form:
(14.1) we postulate
Sj which takes
the alternati,e
H j : 0#0..o
or, equivalently, H j : 0 e:0- 1
BE
0- - (-)()
(14 3)
.
.
lt is important to note at the outset that Ho and H3 are in effect hypotheses about the distribution of the sample .Jtx; ?), i.e. H
:
(X;
P),
H 1:
/(X;
P), P E:0- 1
.
A hypothesis Ho or Sj is called simple if knowing 0 e:(.)0 or 0 6 01 specifies /(x; ?) completely, otherwise it is called a composite hypothesis. That is, if J(x; ?), 0 (E (.): or /'(x;p), 0 (E(.): contain only one density function we say that Ho or Sl are simple hypotheses, respectively; otherwise they are said to be composite. ln testing a null hypothesis Hv against an alternative Hj the issue is to decide whether the sample realisation x tsupports' Ho or Sl. In the former case we say that Hv is accepted, in the latter Ho is rejected. ln order to be able to make such a decision we need to formulate a mapping which relates (')0 to some subset of the observation space say C(), we call an acceptance ('o rn Cc region, and its complement C71 (Co t..p C1 we call the rejection reqion (see Fig. 11.4). Obviously, in any particular situation we cannot say for certain in which of the four boxes in Table 14.1 we are in; at best we can only make a probabilistic statement relating to this. Moreover, small' we run a higher risk of committing a type l if we were to choose s of committing than a type 11 error and vice versa. That is, there is a error trade ojf' between the probability of type l error, i.e. .../,'
.%
=
.g)
=
btoo
#K(x
e:C1; p e O0)
and the probability
=
a,
j of type 11 error, i.e.
Prtx c Ca; 0 (E O 1)
=
j.
( 14.6)
Ideally we would like x j= 0 for al1 0 G O which is not possible for a fixed n. Moreover, we cannot control both simultaneously because of the trade-off between them. tl-low do we proceed, then'?' ln order to help us decide let us consider the close analogy between this problem and the dilemma facing the jury in a trial of a criminal offence. =
288 The
Hypothesis
testing and confidence regions
jury in a criminal offence trial are instructed to choose between: Hv the accused is not guilty; and the accused is guilty;
HL :
with their decision based on the evidence presented in the court. This evidence in hypothesis testing comes in the form of * and X. The jury are instructed to accept Ho unless they have been convinced beyond any reasonable doubt otherwise. This requirement is designed to protect an innocent person from being convicted and it corresponds to choosing a small value for a, the probability of convicting the accused when innocent. By adopting such a strategy, however, they are running the risk of letting a off the hook'. This corresponds to being prepared to number of relatively high value of j, the probability of not convicting the accept a accused when guilty, in order to protect an innocent person from conviction. This is based on the moral argument that it is preferable to let off a number of guilty people rather than to sentence an innocent person. However, we can never be sure that an innocent person has not been sent to prison and the strategy is designed to keep the probability of this happening very low. A similar strategy is also adopted in hypothesis testing where a small value of a is chosen and for a given a. p is minimised. Formally, this amounts to choosing a* such that 'crooks
and
Prtx e:C1; 0 G Oe) Prlx
6
=
a(@ %a*
C().,0 c O1) jt)l
for 0 e:O()
is minimised for pcl')j
=
by choosing C1 or Ctl appropriately. In the case of the above example if we were to choose
#r(IT,,
-601
> ;,'
(14.7)
p= 60)
=
0.05.
a, say a*
(14.8) =0.05,
then
(14.9)
This represents a probabilistic statement with c being the only unknown. l-low do we determine c, then?' Being a probabilistic statement it must be based on some distribution. The only random variable involved in the statement is X,, and hence it has to be its sampling distribution. For the above probabilistic statement to have any operational meaning to enable us to determine :, the distribution of Y,,must be known. In the present case
we know that
T,
'v
,
c2
(9.2
N p, -u
where
which implies that for 0 60 =
ztx)-
.S,,- 60 jocyj
) -
-
n
64 =
=
4:
(i.e.when
.x(() ,
!),
1.6,
( 14.10)
Hv is true)
( 14.11)
14.1
Testing, definitions and concepts
and thus the distribution of z( ) is known completely (no unknown parameters). When this is the case this distribution can be used in conjunction with the above probabilistic statement to determine E:. In order to do this we need to relate jXp,-601 to z(X) (a statistic) for which the distribution is known. The obvious way to do this is to standardise the which is equal to former, i.e. consider 1Y,, This suggests pz(X)I. changing the above probabilistic statement to the equivalent statement .
-601/1.265
IY,, -
Pr
60I> c,;
1.265
p 60 =
=
0.05 where
cx
=
c .
1.265
Given that the distribution of the statistic z(X) is symmetric and we want to determine cx such that ##1z(X)1 > cz) 0.05 we should choose the value of c'a from the tables of N(0, 1) which leaves a*/2 0.025 probability on either side of the distribution as shown in Fig. 14.1. The value of ca given from the N(0, 1) tables is cx= 1.96. This in turn implies that the rejection region for the test is =
=
( 14.13) C1
=
tx:1X,,- 60l > 2.48 ).
(14.14)
That is, for sample realisations x which give rise to T,, falling outside the interval (57.52,62.48) we reject Hv. Let us summarise the argument so far in order to keep the discussion in perspective. We set out to construct a test for Hv : p= 60 against S1 : 0+ 60 and intuition suggested the rejection region (1X,, k' 4. In order to determine : we had to -
60I
(i) (ii)
choose an a; and then define the rejection region in terms of some statistic z(X). The Iatter is neeessary to cnable us to determine t; via some known distribution. This is the distribution of the test statistic z(X) under Ho (i.e. when Hv is true). (z)
::
-ca
r'
0
Fig. 14.1. The rejection region
ca
(14.13).
z
Hypothesis testing and confidence regions
Iz(X)I
Given that C1 (x: the question > 1.96) defines a test with a What probbility of naturally arises need is: do we which the type 11error j for'?'The answeris that we need # to decide whether the test defined in terms of f71 is a test. As we mentioned at the outset, the way we or a problem of the trade-off between a and j was to decided to the small and value for a define C1 so as to minimise j. At this stage we choose a whether do not know the test defined above is a good' test or not. Let us setting consider up the apparatus to enable us to consider the question of optimality. =
=0.05,
tbad'
'good'
tsolve'
14.2
Optimal tests
Since the acceptance and rejection regions constitute a partition of the observation space i.e. Co k.p C1 I and Co ra C1 Z, it implies that Prx E:Co) 1 c C1) for all 0 e O1. Hence, minimisation of Prtx 6 Co) for all 0 E:01 is equivalent to maximising Prtx e:C1) for a1l 0 c O1. .%
=
=
-##x
=
Dejlnition 1 Theprobability of rejecting So w/lcn falseat some point t?1601, Prx s C1; 0 pl) is called the power of the test at 0 0j =
=
i.e.
.
Note that G C1;
Prx
0 p1) 1 - #r(x =
C0; 0=
6
=
t?1) =
1
-
j(p1).
In the case of the example above we can define the power of the test at some pj eO1, say )l 54, to be PrE(lT)-601) 1.265 > 1.96; p= 54q. 'How do we calculate this probability'?' The temptation is to suggest using the same distribution as above, i.e. z(X) N (X,, 60)/ 1.265 N(0, 1).This is, however, wrong because 0 is no longer equal to 60,.we assumed that p= 54 and thus (-Y,, 54)/1.265 N(0, 1). This implies that =
'v
-
'v
-
z(X)
N( ,v
(54-60) jj j.a65
for
,
o
.54.
Using this we can define the power of the test at 0= 54 to be Ar,, 60 -
Pr
j.a6j
y 1.96,. 0= + Pr
541=
Pr
(T,,- 54) j.r6j
%
-
(X,, 54) (54-60) > 1.96 -
1.265
1.265
1.96 =
-
(54-60) j.z6j
0.9973.
Hence, the power of the test defined by Cj above is indeed very high for p= 54. ln order to be able to decide on how good such a test is, however, we
14.2 Optimal tests need to calculate the power for a1l 0 c (9: Following the same procedure the power of the test defined by C1 for 0= 56, 58, 60, 62, 64, 66 is as follows: #r(Iz(X)1> 1.96; 0= 56) .
=0.8849,
Pr(lz(X)1> 1.96; 0= 58)
=0.3520,
Pr(Iz(X)1> 1.96,. p= 60)
=0.05,
> 1.96; p= 62) #r(1z(X)I
=0.3520,
> 1.96; p= 64) #r(Iz(X)I > 1.96; p= 66) #r(Iz(X)I
=0.8849,
=0.9973.
As we can see, the power of the test increases as we go further away from p= 60 Hjl and the power at 0 60 equals the probability of type l error. This prompts us to define the power function as follows: =
Dejlnition 2
,#(p) #r(x G C1), 0 E: 0- is called the power function of the test dejlned by the rejection region C1. =
Dehnition 3 is dejlned to be tbe size
.#(@
a the test. =
maxpoo,
(t)rte
signcance
leve of
ln the case where Hv is simple, say 0= 0o,then a #ool. These definitions enable us to define a criterion for best' test of a given size x to be the one (if it exists) whose power function c?(p), 0 6 Oj is maximum at every 0. =
ta
Dejlnition 4
..4lcsl of So : 0 6 0-o against S1 : 0 e:01 as desned by some rejection reqion C1 is said to be uniformly most powedul (UM#) test of size a .:#(@
(f)
max
(ff)
@0) y
wcrp
=
0 (2 (90
kt.#*(?)
a,.
J?*(p)
is the
for all 0 6 01,. of any power function
other test of size a.
As we saw above, in order to be able to determine the power functlon we need to know the distribution of the test statistic z(X) (interms of which C1 is defined) under Sl (.c. wen Ho is false). The concept of a UMP test provides us with the criterion needed to choose between tests for the same Ho. Let us consider the question of optimality for the size 0.05 test derived
Hypothesis testing and confidence regions f (z) C(
;) .
.,
0
(x:z(X) > 1
=
(14.16)
.645)
.
1.645
z
f (z) C(
4-
=(x:z(X) G 1
'Z
-1
(14.17)
r
O
.645
.645)
z
/ (z) C(
-0.03
0 0.03
Fig. 14.2. The rejection
regions
above with rejection (71
=
tx
:
*
+
=(x:1
z(X)I< 0.038) (14.1 8)
z
(14. 16), ( 14. 17) and ( 14. 18).
region
Iz(X)1> 1.96).
( 14.19)
To that end we shall compare the power of this test with the power of the size 0.05 tests (Fig. 14.2), defined by the rejection regions. All the rejection regions define size 0.05 tests for HvL t? 60 against H3 : p# 60. ln order to tgood' and discriminate between tests we have to calculate their power functions and compare them. The diagram of the power * + functions ((?),.# (p) is illustrated in Fig. 14.3. Looking at the diagram we can see that only one thing is clear + + C) defines a very bad test, its power function being Jomnlrp by the other tests. Comparing the other three tests we can see that C( is more + < a for 0.< 60. Cib is more powerful than the other two for 0>. 60 but + > powerful than the other two for 0 < 60 but t (p)< a for p > 60, but none of the tests is more powerful over the whole range. That there is no UMP test of size 0.05 for HvL 0= 60 against Hj : ()+ 60. As will be seen in the sequel, no UMP tests exist in most situations of interest in practice. The procedure adopted in such cases is to reduce the class of all tests to some subclass by imposing some more criteria and consider the question of UMP tests within =
kbad',
,#(p),
.#-'((?),
.#
'better'
-
-
+
'cut';
.#*(@
293
Optimal tests @+
'++ (p)
1.00
>.
x.
N
N
N
p'
N N
N
/ h
.t.#++'' 0.05
ttpy
X ?
(#)
l
l
/
e
z
/
/
,(# (0 )
0
60
0 Fig. 14..3. The pow'er functions
..#(f?),
.#*(0j,
.#-t
''-(?),
.#-F
> ''(#).
the subclass. One of the most important restrictions used in this context is the criterion of unbiasedness. Dejlnition J
-4 test of Ho1 0 e:(')() alainst ..@
max t?e O f.,
(0)t
0 g 0- 1 is
said t() be unbiased
#)J
(14.20)
,(#(p).
max
() G (''1j
ln other words, a test is unbiased if it rejects Hv more often when it is false than when it is true; a minimal but sensible requirement. Another fonn these added restrictions can take which reduces the problem to one where UMP do exist is related to the probability model *. These include exponential jmilv. restrictions such as that (l) belongs to the one-parameter + ln the case of the above example we can see that the test defined by C) is biased and C1 is now UMP within the class of unbiased tests. This is because C1 and C3* are biased for 0 < 60 and 0>. 60 respectively. lt is obvious, however, that for *
*
() 60
HvL
=
against H3 : 0 > 60 H
:
()< 60,
+ the tests defined by C1 and C* are UMP, respectively. That is, for the one+. lt is sided alternatives there exist UMP tests given by C1 and C) of H3 above H3 and in the parameter that the space important to note case implicitly assumed is different. ln the case of Hk the parameter space implicitly assumed is O E60,10% and in the case of H3, O g0,6%. This is needed in order to ensurc that (6): and 0- I constitute a partition of 0=
=
.
Hypothesis testing and confidence regions
Collecting al1 the above concepts together we say that a test has been dejlned when the following components have been specified: (FJ) a test statistic z(X). (T2) the sjcc of the test a. (FJ) the distribution of z(X) under Hv and Sj (F4) the rejection region C1 (t?r,equivalently, C0). Let us illustrate this using the marks example above. The test statistic is .
n(.Y,, p) (.Y,, 60) ; -
r(X)
=
( 14.2 1)
=
1.27
c
we call it a statistic because c is known and 0 is known under Ho and S1. If wechoose the size a 0.05 the fact that z(X) N(0, 1)underse enables us to define the rejection region C1 (x:Iz(X)I > ca) where ca is determined from Pr(Iz(X)l > cz; 0= 60) to be 1.96,from the standard normal tables, i.e.if 4tz) denotes the density function of N(0, 1) then =
'v
=
=0.05
Ca
/tz) dz= 1 - C
(14.22)
-a.
a
In order to derive the power function we need the distribution of z(X) under Sl. Under Hb we know that
w/nuk'np1) ,vN(0, z*(x)= -
1),
(j.4 2,3) .
for any pl G (91 and hence we can relate z(X) with z*(X) by
(p1 -po)
z(x)=z*(x) +
v
( 14.24)
to deduce that
xtx/'n (p1 -
z(x)
-
jj
p())
( 14.25)
,
c
under H3. This enables us to define the power function as yca) p(pj)=#r(x: Iz(x)I -p())
=Pr
+Pr
z*(x) %
-c,
z*(x) > c,
-
. (p1 V'n
-vG
(pj
-
0o) ,
p: s e1.
( 14.26)
Using the power function this test was shown to be UMP unbiased. The most important component in detining a test is the test statistic for
14.2 Optimal tests which we need to know its distribution under both Hv and HL. Hence, constructing an optimal test is largely a matter of being able to tind a statistic z(X) which should have the following properties: z(X) depends on X via a estimator of p; and (i) (ii) the distribution of z(X) under both Ho and Sl does not depend on any unknown parameters. We call such a statistic a pivot. lt is no exaggeration to say that hypothesis testing is based on our ability to construct such pivots. When X is a random sample from N(.p, c2) pivots are readily avail/ble in the form of Sgood'
.&4#,,-/tj.x((),
j),
1)
(n -
S2 'x,z
Gc
2
(n-
1),
(14.27) but in general these pivots are very hard to come by. The tirst pivot was used above to construct tests for p when c2 is known (both one-sided and two-sided tests). The second pivot can be used to set up similar tests for y when c2 is unknown. For example, testing Ho2 p against H3 : p #/t() the rejection region can be defined by =pv
C1
)x:lzj(X)1 y cz)
=
where z1(X)
fxcy
=
x/n
(Y,,-Jt) ,
s
(14.28)
and cacan be determined by: flt) d! 1 tz,' .J(r)being the density of the with n 1 degrees of freedom. For S0: p po Student's l-distribution against Sl : p < po the rejection region takes the form =
-
=
-
W:
C1
zj(X) y c.)
tx:
=
with
(x =
dr, .J'(1)
detennining ca. The pivot z2(X)
=
(n- 1)s2 z2(n Gz 'v
-
1)
c2. For example, in the case of a can be used to test hypotheses about c()2against Sj : c2 < cj the c2) c2 sample N(g, Hv from testing y random rejectionfor an optimal test takes the form
C7:
=
wherec.
)x:z2(X) %c.),
is detenuined via Cz
dzztn 1) -
0
=
a.
(14.31)
296
Hypothesis testing and confidence
regions
Constructing optimal tests ln constructing the tests considered so far we used ad hoc intuitive arguments which led us to a pivot. As with estimation, it would be helpful if there were general methods for constructing optimal tests. It turns out that the availability of a method forconstructing optimal tests depends crucially on the nature of the hypotheses (Hv and H3) or, and the probability model postulated. As far as the nature of Ho and HL is concerned existence and optimality depend crucially on whether these hypotheses are simple or composite. As mentioned in Section 14.2 a hypothesis Hv or lk is called simple if 0-() or (6)1contain just one point respectively. In the case of the 'marks' example above, (6)0 ft60) and 0- I (E0,60) k..p (60, 10% i.e. Hv is ) simple and H, is composite since it contains more than one point. Care should be exercised when 0 is a vector of unknown parameters because in such a case 0-() or 0- 1 must contain single vectors as well in order to be simple. For example, in the case of sampling from N(p, tr2) and c2 js not c2j, c2 g R krlfpwn,Hv3 p pv is nOt a simple h b'potbesis Since Oa =
=
,
'f(tpa,
=
=
.)
.
(1)
Slmple
null and slmple alternative
The theory concerning two simple hypotheses 1920: by Neyman and Pearson. Let (lJ
t
$, t?e O )
./'(x',
=
was fully developed in the
be the probability model and X (.Xj, Xz, Xn)' be the sampling model and consider the simple null and simple alternative Hv: 0= ?ll and HL : f?(), 71) i.e. there are only two possible distributions for *, that ?= f?j O iss ?1). Given the available data x we want to choose 0o) and between the two distributions. The following theorem provides us with sufficient conditions for the existence of a UMP test for this, the simplest of the cases in testing. =
.
.
.
,
.f
,
.(x;
=
,
./'(x'.
( 14.32)
( 14.33)
optimal tests
14.3 Constructing
297
ln this simple case (x 1
/#(t?) =
for 0 % for 0 ()1 =
1)
-
=
(14.34)
.
The Neyman-pearson theorem suggests that it is intuitively sensible to base the acceptance or rejection of Hv on the relative values of the distributions of the sample evaluated at 0= 0o and 0= ?l i.e. reject Hv if the ratio .Jlx; t?o)/'(x; f?I) is relatively small. This amounts to rejecting Hv when it a higher lt is very the evidence in the form of x favour H3 rgiving that solve Neyman-pearson theorem not note does the the important to relating problem the ratio completely because of the p()) problem tx,' J(x; ?1) to a pivotal quantity (test statistic) remains. Consider the case where X N(t?, c2), c2 known, and we want to test HvL 0= 0v against f11: theorem we know that the p= ?j (p()< ?:). From the Neyman-pearson ratio region of in delined the rejection terms -
tsupport'.
'x-
()) a exists for some a. can provide us with a UMP test if #rtx c Cj; 0= The ratio as it stands is not a proper pivot as we defined it. We know, however, that any monotonic transformation of the ratio generates the same family of rejection regions. Thus we can define =
z(X)
=
w''n (X,1- 04
-
c
-
r;-
gn(o1 .p())g
log /(x, p(),p1) t:'
+.N-
a in terms of Cl
which =
we can define the rejection region
)x: T(X) y 6: C1 ;
0
=
2
'
?o) a =
( 14.37) x if
( 14.38)
exists.
Remark: in the case of a discrete random variable
#rtx
6: C 1 ;
0
=
(14.36)
as
c)).
C1 defines a UMP test of size P#x
0 + 0o n --!- p
(74)) a, =
x might not exist since the random variable takes discrete
values.
298
Hypothesis testing and confidence regions
For example if a
,t?qpl) =
zl
=0.05,
cl
-N
(x)> c)
1.645 and the power of the test is
=
/ n(p1 p2 ) -
=
-
1
-
(14.39)
p,
where under Hj
(14.40)
.
ln this case we can control j if we can increase the
1 j= -
#r(z1(x) ,Ac)*)
=
c'l + c)*
=
sample
size since
Vn(() - o ). G
1
(14.41)
()
For the hypothesis So: p= 0o against Hf : 0= ?: when 0 statistic takes the form
<
0v the test
yz'
(# s p) z(X) v----s-..n =
which gives C1
(2)
rise =
kf : x
to the
rejection
region
z(X) % ca*)
Composite
( 14.42)
.
null and composite alternative
one parameter
cas
For the hypothesis Ho : 0 > 0o
against H j : 0 < 0v
being the other extreme of two simple hypotheses, no such results as the Neyman-pearson theorem exist and it comes as no surprise that no UMP tests exist in general. The only result of some interest in this case is that if we restrict the probability model to require the density functions to have monotone Iikelihood ratio in the test statistic z(X) then UMP tests do exist. This result is of limited value, however, since it does not provide us with a method to derive z(X).
14.4
The Iikelihood ratio test procedure
299
Simple Hv against composite H 1 ln the case where we want to test HoL 0= ?a against Sl : 0>. ?o (or 0< ?()) uniformly most powedul (UMP) tests do not exist in general. ln some particular cases, however, such UMP tests do exist and the NeymanPearson theorem can help us derive them. lf the UMP test for the simple Ho 0= 0v against the simple Hj : 0= ?1does n()r depend on pj then the same ln the example > 0v (or 0 < t?()). test is UMP for the one-sided alternative discussed above the tests defined by t71
)x:z(X) > c))
(14.43)
C1
)x:z(X) %c))
(14.44)
=
=
HoL 0= 0v against .S1 : 0 > 0v and Ho1 are also UMP for the hypotheses 0= 0v against Sj : 0 < po, respectively. This is indeed confirmed by the example above. diagram of the power function derived for the Another result in the simple class of hypotheses is available in the case exponential where sampling is from a one-parameter famil),of densities (normal, binomial, Poisson, etc.). ln such cases UMP tests do exist for onesided alternatives. 'marks'
Two-sided alternatives For testing S(): 0= 0o against H3 : 0+ ?tlno UMP tests exist in general. This is rather unfortunate since most tests in practice are of this type. One interesting result in this case is that if wc restrict the probability model to the one-parameter exponential familvand narrow down the class of tests by imposing unbiasedness, then we know that UMP tests do exist. The test defined by the rejection region
C'1
=
)x:1z(X)!p: ca)
(14.45)
example) is indeed UMP unbiased; the one-sided alternative (see biased tests being over the whole of 0. 'marks'
14.4
The Iikelihood ratio test procedure
The discussion so far suggests that no UMP tests exist for a wide variety of cases which are important in practice. However, the likelihood ratio test procedure yields very satisfactory tests for a great number of cases where none of the above methods is applicable. lt is particularly valuable in the case where both hypotheses are composite and 0 is a vector of parameters. This procedure not only has a lot of intuitive appeal but also frequently leadg to UMP tests or UMP unbiased tests (whensuch exist).
300
Hypothesis
testing and confidence regions
Consider Ho 0 c(.)()
against H 1 : 0 e:O j
.
Let the likelihood function be Ltp; x), then the likelihood ratio is defined by
max Atp', x) 2(x) dceo
-
L0; x)
=
=
x)
.1.z(#;
max ,s e)
-
( 14.46)
.
L0; x)
measures the highest x renders to 0 G (6)0and the value maximum of denominator measures the the Iikelihood function (see Fig. 14.4). By definition 2(x) can never exceed unity and the smaller it is the less Hv is by the data. This suggests that the rejection region based on 2(x) must be of the form
'support'
The numerator
'supported'
Cj
)x:2(x) G k)
=
( 14.47)
,
and the size being defined by max @0)
=
oBo
a.
(x and k as well as the power function can only be dened L (0 ; x )
L ()
1(#a)
--
---
-
-
-
-
-
I 1 I I
I l
I I I
1 I I
-
I I I 1 l
I I I
I
1
I l I
l
J Fig. 14.4. The likelihood ratio test.
r
0
when the
14.4
The likelihood ratio test procedure
301
distribution of 2(x) under both Ho and Hj is known. This is usually the exception rather than the rule. The exceptions arise when * is a normal family of densities and X is a random sample in which case 2(x) is often a monotone function of some of the pivots we encountered above. Let us illustrate the procedure and the difficulties arising by considering several examples. Example 1
Xn)' be a random sample be the probability model and X tA-j, Xz, from /'(x;p, c2), Ho : p yv against f11 : p # pv. =
.
.
,
.
=
1k
(2zrc2)
-,:./2
L(p; x)
=
11
s-c j
exp
=
?1
.... l , t
)( (xf--pn)2 i 1
(xj-p)2
I=1
pr
=
2(x) =
,,
j (xi- Y,,)2 f=1
At first sight it might seem an impossible task to detennine the distribution of 2(x). Note, however, that
which implies that
n(-'.t'
2(x)
,,
1+
,1
)7 (l -i =
W
-
1,'
14J1
)
=
-
2
sq
,v,
(
1+
j;j;i (! .jj
--
l1/
;!
u
r(n 1) under Ho,
) under H 1
Since 2(x) is a monotone the form
C 1 jfx :
-'p,,)
1
-/t()) here1&r=x7nI((#,, I'I'',v l(n
-/1...2
2
-/-to)
=
-
V:1(Jt1 ,
=
-/.10)
,
Jzl G 01
decreasing function of I''Uthe rejection
> cz )
,
.
region
takes
Hypothesis testing and confidence regions
and a, ca and @ol can be derived from the distribution of W( Example 2 In the context of the statistical H
against and
:
o
(72
=
of example 1 consider
model
c2 o
H 1 : c 2 + c 0,2
0- K J@x R
.
under Hv
2
z (n -
17 'v
1; J ) un d e r H 1
=
,
no'j2 n tr
,
tr
2:
(E;())
,
with kj and Lz defined by z
dzztn
1)
-
kl
=
1 -a,
18.5, kz 29.3. n - 1 30, /j e.g. if a Hence, the rejection region is (71 n kj or p y kc). Using the analogy between this and the various tests of y we encountered so far we can postulate that in the case of the one-sided hypotheses:
=0.1,
=
=
=
.ftx:
:
=
H
:
o
c2 > cj,
H 0 : c2
t;
Hj
:
c2 H : 0, 1
c2 < c(j (7.2
>
(7.2
0,
the rejection region is C1 i>1
=
fx:
py
=
jf
x : 17%kj );
/(a).
The question arising at this stage is: tWhat use is the likelihood ratio test procedure if the distribution of 2(X)is only known when a well-known pivot exists already'?' The answer is that it is reassuring to know that the procedure in these cases leads to certain well-known pivots because the likelihood ratio test procedure is of considerable importance when no such pivots exist. Under certain conditions we can derive the asymptotic distribution of 2(X). We can show that under certain conditions Ho
-
2 Iog
(x) z2(,.) -
J
(14.48)
14.5 Confidence estimation H ()
('
'w
'
distributed under Hv'),
iasymptotically
reads
r
being the number of
7
parameters tested. This will be pursued further in Section 16.2. Confidence estimation In point estimation whcn an estimator of p is constructed we usually think ofit notjust as a point but as a point surrounded by some region of possible standard error of This can be error,i.e. 1+ e, where e is related to the condence of interval for 0 viewedas a crude form a .
(
-
;<
0 :A1+ n);
(14.49)
crude because there is no guarantee that such an interval will include 0. Indeed, we can show that the probability the 0 does not belong to this interval is actually non-negative. In order to formalise this argument we need to attach probabilities to such intervals. ln general, interval estimation refers to constructing random intervals of the form
(14.50)
(z(X)G 0 %f(X)),
together with an associated probability for such a statement being valid. r(X) and f(X) are two statistics referred to as the lower and upper respectively; they are in effect stochastic bounds on 0. The associated probability will take the form tbound'
#r(#X) G p %f(X))
=
1
(14.51)
-a,
where the probabilistic statement is based on the distribution of J.(X) and f(X). The main problem is to construct such statistics for which the distribution does not depend on the unknown parameterts) 0. This, however, is the same problem as in hypothesis testing. ln that context we 'solved' the problem by seeking what we called pivots and intuition suggests that the same quantities might be of use in the present context. lt turns out that not only this is indeed the case but the similarity between interval estimation and hypothesis testing does not end here. Any size a test about 0 with 1 can be transformed directly to an interval estimator of 0 consdence level. -a
Dehnition 6 The interlml (#X), f(X)) is called a (1 jr alI 0 e:O #r(z(X) G 0 %f(X))> 1
-.
-
$ confidence interval for 0 #(14.52)
( 1 -$ is called the probability of coverage of the interval and the statement
Hypothesis
testing and confidence regions
suggests that in the long-run (in repeated experiments) the random interval (z(X), f(X)) will include the ttrue' but unknown 0. For any particular sure' whether (z(X), f(X)) realisation x, however, we do not know includes or not the true' 0,' we are only ( 1 - a) confident that it oes. The duality between hypothesis testing and confidence intervals can be seen in example discussed above. For the null hypothesis the ifor
Kmarks'
H () : 7 0() =
0() 6: 0-
,
against H 1 : p # 0v, a size x test based on the acceptance
we constructed C()(?o)
x : ()v- cz
=
c G Xn G 0v +
c
--c
(,'a
.
v'n
region
--rx n
,
( 14.53)
with cx delined by fr:;4
$(z) dz= 1 -
T'
Z
-a,
x
( 14.54)
N(0, 1).
a
C(,, 0= p(,) 1 x and hence by a simple manipulation of Co we can define the (1-a) confidence interval
This
implies
that Prx
C(X)
=
is
=
()o: X,, c!a c % 0v %
--s ,n
-
N
-
x-, x<
,,
+ cx
tr
-sx,. /
,
Proo e:C7) 1 - a.
( 14.55) ( 14.56)
=
ln general, any acceptance region for a size x test can be transformed into a ( 1-a) confidence interval for 0 by changing ()-a, a function of x 6,7/,. to C, a function of 0o s0. One-sided tests correspond to one-sided confidence intervals of the form #r(z(X) G 0j > 1 -a
(14.57)
Pro Gf(A-)) y 1
(14.58)
-a,
In general when O R''', m 1. 1, the family of subsets C(X) of O where C(X) For example, depends on X but not 0 is called a random lzgt??. =
C(X)
=
)#:z(X)
:$
0 %f(X))
()-(X)
=
)#:z(X) % p).
The problem of confidence estimation is one of constructing region C(X) such that, for a given a e:(0, 1),
#rtx : 0 G C(X)/p)> 1 a, -
for a1l 0 G (4.
( 14.59) a random
(14.60)
14.5 Confidence estimation
305
It is interesting to note that C(X) could be interpreted as
C)X)
tp: r(X)
=
0i fk(.X), i
t:i
:$
1, 2,
=
in which case if (r(X), i(X)) represent intervals
.
.
.
,
??l
)
( 14.6 1)
,
independent
(1
(zf)
-
confidence
m
(1
-a)=
J-l(1
-j).
i
=
1
The duality between hypothesis testing and confidence estimation does not end at the construction stage. The various properties of tests have corresponding counterparts in eonfidence estimation. Djlnition
7
.4 hlmil)' ( 1 a) Ie,el conjdence rtxitpn.s C(X) is said to be uniformly most accurate (UNIA) among ( 1 a) Ievel regions C*(X) /' 0j'
-
x'tpny'zf/f?n'ta
-
#r(x: 0 6: C(X) #) G#r(x: 0 E: C*(X)
'#)
for al1 0 6 0. ( 14.63)
This clearly shows that when power is reinterpreted as accuracy it provides us with the basic optimality criterion in confidence estimation. lt turns out (not surprisingly) that bkbp tests lead to UMA confidence regions. This is
because 0 t5 C7(X) (9 if and only if x 6
rt)tpl
( 14.64)
,?'',
where C()(p) represents the acceptance region of Ho 0 confidence region C(X) can be formed by
=
0o. In effect the
f'tx) )p():x e t)-o(p()))s
( 14.65)
=
and the acceptance region C()($ by t)-(#()) =
hence
tx:0v e:C(x)),
#?'(x: x c 7()(p())0= p()) Prtx: 0v c f7(x) 0= 0vj > l =
-
a.
( 14.67)
This duality between C04*0) and C(X) is illustrated below for the above example assuming that n l to enable us to draw the graph given in Fig. 14.5. Continuing with this duality it comes as no surprise to learn that unbiased tests give rise to unbiased confidence regions and vice versa. =
306
Fig. 14.5. The duality between hypothesis testing and interval estimation.
e:C(x)/#J 1 - a for #1 0z i 0. (14.68) ln general, a test will give rise to a good conlidence region and vice Lehmann (1959)). versa (see
#r(x:
0L
:$
,
igood'
14.6
Prediction
In the context of a statistical model as defined by the probability and sampling model components, prediction refers to the construction of an toptimal' Borel function 1( ) (seeChapter 6) of the form: '
l
'
): 0-
-+
k??l
(14
.69)
guess' for the value of a random variable which purports to provide a zY,, 1 which does not belong to the postulated sample. If we denote the A-,,)' and its distribution with D(X,,; 04then we sample with X,, E (#1, need to construct l0) which for a good estimator l,, of 0, X,, 1 /(i) is a Sgood
+
.
.
.
,
+
=
Prediction
14.6
307
'good' predictor of .Y,,+ 1. Given that 1, ll(X,,) we can define /( ) as a 1 lnction of the sample directly, i.e. /(,,) !(X,,). Properties of optimal predictors were discussed in Section 12.3, but no methods of constructing such predictors were mentioned. The purpose of this section is to consider this problem briefly. The problem of constructing optimal predictors refers to the question of -how do we choose the function 1( ) so as the resulting predictor to satisfy certain desirable properties?' To be able to answer this question we need to specify what the desirable properties are. The single most widely used criterion for a good predictor is minimum mean square error (MSE). This criterion suggests choosing /( ) in such a way so as to minimise '
=
=
'
'
X,, +
1
-
l(X,,))2,
(14.70)
is defined in tenus of the joint distribution of Xn j and Xn, say, X,,; 1, #). lt turns out that the solution of this minimisation problem is theoretically extremely simple. This is because (70)can be expressed in the where DX,,
'(')
+
form
'(X,: +
-
l(X,,))2
'().Y,,+
-
1
Exn
+
7(-Y,, + 1 c(X,,)) + .tF(-Y,,+1/c(X,,))
-
1
1-
+ 2'('t.Y,,
c(X,,)))2 + ECECXn+
flX?,
+ 1.
1 -
Ezrn
+
+
1
/(X,,)))2
-
c(X,,))
-
l(X,,))2
l/c(X'')) .t'(A-,, + 1/c(X,,) l(X,,))). -
(14.7 1)
Using the properties CE5 and SCE5 of Section 7.2 we can show that the last term is equal to zero. Hence, (71) is minimised when l(X,,)
f;(..,,
-
( 14.72)
1/c(X,,)).
+
When this is the case the second term is also equal to zero. That is, the form 1(.Y,,)which minimises (70) is of the predictor j .-,,.
=
.i'
11 +
Exn
=
1
+
1/t4X,,)),
(14.73)
where c(X,,) is the c-field generated by X,, (see Chapter 4). As argued in Sections 5.4 and 7.2, the functional form of EXt, .I/c(X,,)) depends entirely on the form of the joint distribution Dx,, + 1, X,,; #).For example, in the case where D4Ar,,+ 1, X,,; #) is multisariate normal then the conditional expectation is linear, i.e. E (X,,
.
1
WlX,,)) #'X,,
(14.74)
=
(see Chapter 15). ln practice, when D(',, + 1 X,,; #) is not known linear predictors are commonly used as approximations to the particular ,
308
Hypotllesis testing and confidence regions
functional form of '(A-,,+ 1/c(X,,)) =g(X,,).
ln such cases the joint distribution is implicitly assumed to be closel) approximated by a normal distribution. The prediction value of Xn + j will take the form
.4 l
+
1
=
E(Xn + 1/X,,
=
x,,),
where x,, refers to the observed realisation of the sample X,,. The intuition for the value xY,, 1 must be the underlying (76)is that the best values, in view of the past realisations of a11 possible of its average iguess'
+
#(XC=
It
then
Is
xn)
(seeFig.
14.6).
important to note that in the case where .Y,,
.
'(A-,,+ 1/W(X,,)) E(Xn + =
and X,, are independen:
1
1).
(14.77)
expectation with the marginal That is, the conditional coincides expectation of aY,, : (seeChapters 6 and 7). This is the reason why in the case of the random sample X,, where Xi N((). 1), i 1, 2, n, if A-,, 1 is also assumed to have the same distribution, its best predictor (in MSE sense) is its mean, i.e. kn 1 (1//rl) )'-1 Xi (seeSection 12.3). lt goes without saying that for prediction purposes we prefer to have non-random sampling models because the past history of the stochastic proess t.Y,,, n )y:1) will be of considerable value in such a case. Prediction regions for A-,, take the same form as confidence regions for ..
'v
+
=
.
=
.1
1 I I I
l
I // I /
51
.''
ly
A
Il sk-x x ll NX ll N x
'x
Il
ll I
1 I I I
0
n n
+
1
.
.
,
+
14.6
Prediction
309
0 and the same analysis as in Section 14.5 goes through with minor interpretation changes. Important
concepts
Null and alternative hypotheses, acceptance region, rejection region, test statistic, type l error, type 11 error, power of a test, the power function, size of a test, uniformly most powerful test, unbiased test, simple hypothesis, lemma, likelihood ratio test, composite hypothesis, Neyman-pearson level, uniformly most accurate pivots, confidence region, condence confidence regions, unbiased confidence region, optimal predictor,
minimum mean square error.
Questions Explain the relationship between Hv and H3 and the distribution of the sample. and rejection Describe the relationship between the acceptance regions and (-)0 and O1. Define the concepts of a test statistic, type l and type 11 errors and probabilities of type l and 11 errors. Explain intuitively why we cannot control both probabilities of type l and type 11 errors. How do we this problem in hypothesis testing'? Define and explain the concepts of the power of a test and the power function of a test. Explain the concept of the size of a test. Define and explain the concept of a UMP test. State the components needed to define a test. Explain why we need to know the distribution of the test statistic under both the null and the alternative hypotheses. Define the concept of a pivot and explain its role in hypothesis testing. Explain the concepts of one-sided and two-sided tests. Explain the circumstances under which UMP tests exist. Explain the Neyman-pearson theorem and the likelihood ratio test procedure as ways of constructing optimal tests. Explain intuitively the meaning of the statement isolve'
10. 11. 12. 13. 14.
#r(z(X) G 0 Gf(X))
=
1 - a.
Define the concept of a ( 1 -a) confidence region for 0. Explain the relationship between C()(?()),the acceptance region for HoL 0= 0o against Sj : o 0v and the confidence interval C(X) for 0o.
Hypothesis
310
testing and confidence regions
Define and explain the concept of a (1
confidence
uniformly
-a)
most accurate
region.
Exevcises For the example of Section 14.2 construct a size 0.05 test for Hv HL : against 60 0 < 60. Is it unbiased? Using this, construct a 0.95 0 confidence interval for 0. level significance c2) the following hypotheses: and consider Let X N(p,
'marks'
=
'v
(i)
ff0: /t G/zo,
(ii)
S(): c2 >
(iij)
H () : p
(iv)
H,.. /.t
(7.24),
y ()
=
c2 >
f11 : p Apo,
0, po - known;
c2 < c2(),
HL :
c2()
H 1 : p # p ()
,
c2
-pv-
> czt), H,
:
c2
,
- known;
:#:
c
2:,.
c2 < czo.
y +pv,
State whether the above null and alternative hypotheses are simple or composite and explain your answer. X,,)' be a random sample from N(0, 1) where 0 e:OLet X H(-Y1, .tft?1,p2). Construct a size tz test for .
.
=
,
.
against S1 : 0 0z. =
Using this, construct for p.
(1
a
-
tz)
significance
level conlidence
interval
Xn)' be a random sample from a Bernoulli distribution with a density function Let X EH (71,
.
.
,
.
fx; 0)
Construct a size
1
p'I(
=
-
041 -
'Y
x
0, 1.
=
test for HvL 0 %0v against S1 : 0>. 0o. (Snl.' is binomially distributed.) Let X EEE(A-j, A-,,)' be a random sample from N(p, c2) (i) Show that the test defined by the rejection region .
cl
.
(x
.
,
x:
=
(j)'=1 Xi)
(.S,,-yJ
Un
>k
s
sl
,
j
11
=
n
-
1j
y xi
-.:,,)2
.j
defines a UMP unbiased test for Ho'. p G/t() against Sl : p wpo. Derive a UMP unbiased test for Hv.. p y/to against Sl : p <po. (ii) BE F.)? be two random samples Let X tA-j Xn)' and Y (F1 from Ni, c2j) and Np, c2a) respectively. Show that for the hypotheses'. ,
.
.
.
=
,
,
(i)
H o c 12 c c2 H j : c lj > c
(ii)
S
.:2
:$
,.
,
-.
()
(7
2)
y: c
22 ,
S
j
:c
2j.
<
2,
c ;
.
.
.
,
Additional references
(iii)
H () : c 21 c 2, =
H j : c 2j # c 2,
,
,
the rejection regions are: Cl
c1
x y : z (x y) y; k 1 )
.f(
=
=
,
,
xf
tx
,
,
y : z(x y) %k
.z
,
cl (x,y : ks %ztx, where respectively, =
y)
)
,
,::
/,j.)
,
11
Z (A-i
-f,,)2/n
ztx,y)
=
f
-
1
1
=
,
,,,
i
Z (1'i=
1V?nl2
m
-
1
1
define UMP unbiased tests (seeLehmann ( 1959), pp. 169-70). What is the distribution of ztx, y)? Construct a size a test for HvL cf cq the likelihood ratio test procedure. =
=
c2, H3 : cf + cl using
Additional references Bickel and Doksum ( 1977); Cox and Hinkley ( 1974); Kendall Lehmann (1959); Rao ( 1973); Rohatgi ( 1976)) Silvey ( 1975).
and Stuart
(1973);
CHAPTER
15*
The multivariate
15.1
normal distribution
distributions
Multivariate
The multivariate normal distribution is by far the most important distribution in statistical inference for a variety of reasons including the fact that some of the statistics based on sampling from such a distribution have tractable distributions themselves. lt forms the backbone of Part IV on and thus a closer study of this statistical models in econometrics distribution will greatly simplify the discussion that follows. Before we consider the multivariate normal distribution, however, let us introduce some notation and various simple results related to random vectors and their distributions in gcneral. Let X EB (A-1 Xz, Xn)' be an n x 1 random vector defined on the probability space (S, P )).The mean vector '(X) is defined by ,
,
.%'
'
F(A-1) f(X)=
F(A'2)
j
r 1
Kp,
an
7
x 1 vector
s(-v,,) and the covariance
matrix Cov(X) by
covt.vlx,,l I ,
covtxcAVar(A',,)
/1
) 1l j MY, J
.j
15.1
distribution
Multivariate
definite matrix, i.e. E' where E is an n x n symmetric non-negative Iljth element of E is x'Ea> 0 for any x G R''. The ojj. f;(-tf =
jlxj.
/t
-
pj),
-
i,j
=
1, 2,
.
.
.
,
=
E and
( 15.3)
u.
In relation to a'Ya y 0 we can show that if there exists an x e: R'', a#0 such that Var(a'X)= a'Ea 0 then Pr@'X c) 1 where c' s a constant (only holding constants have zero variance), i.e. there is a linear relationship Xn with probability one. among the nv.'s .Y1 =
,
=
.
.
.
=
,
Lemma 15.1
Ltzhnrtltk15.2
AS(X) + b
(f)
F(Z)
(/)
Cov(Z)
=
.F)rZ
and ct/r/rfkrlc'tz E
//' X bas mean y
=
AX + b
=
Ap + b,'
SEIAX + b -(Ap + b))(AX + b (Ap + b))1 AFIX p)(X p)'A' AYA'. = -
=
=
-
Let X and Y be n x 1 and m x 1 random '(Y) py, rllt?rl
wr
pt?cltpry
F(X)
=
/lx,
=
Cov(X, Y)
I)tcovtA-fFy.lfyj FEIX - px)(Y
=
=
-
/ly)'(l.
Correlation
So far correlation has been defined for random variables (r.v.'s)only and the question arises whether it can be generalised to random vectors. Let F(X) 0 X into =
(without any loss of generality), Cov(X)
X EEE Define Z
=
A' 1 X2
X2:
,
(n- 1) x 1 Y=
c
=
E. X n x 1, and partition
11
5'1 ,
J2 1
Y2 2
.
x'X2 and 1et us consider the correlation
Corrtzl A-j )
=
between A-: and Z
5'12
---t-1
.---u-.
c21(x E22*2 ,
This is maximised for the value of x which minimises S(xY1
-
x'Xc)2
(see Chapter 7), which is
=
12-21Jz1 and
we
define the multiple correlation
The multivariate
normal distribution
coefhcient to be 1 2N22 - J21
(J1
R
=
)
.
c'l 1
c o rrtx
y
j
z/xa),
,
ln the case where X
X1
=
X1: k x 1, X: (n k) x 1, E -
X.a
,
we could define the r.v.'s coefficient is Corrtzl
,
ZL
'1Xj and Z:
=
.
E 1c
E21
E2 2
,
x'1 E 1 2 x 2
za)
=
(1El 1l)'2X222) ---u. ,
(y 1 x 12 x 22 -
=
E 11
'aXa whose correlation
=
,
From the above inequality it follows that for x2
C1 c
=
1x
21 y 1
)j
(1, x 1 1 a 1 ).#
>
=
Ec1Y211
corrtzj za).
(15.10)
,
E1-/El ztz-zs Ecj has at most k non-zero eigenvalues which measure association between XI and Xa and are called canonical correlations. Let X1 X
where Xa : (n 2) x 1
X1
=
the
-
X3
c11 E
Another
between form the
=
J13
ca1
c22
Jc3
J31
J32
Y33
.
form of correlation of interest in this context is the correlation and X1 given that the effect of Xa is taken away. For this we
-:-1
r.v.'s
F1
=
X3
-
b'lxa
and
L
=
X2
-
b'2Xa
Corrt'l
,
F2)
is maximised by bl Ea-alcaj and b2 Ya-alo'a:as seen above. Hence We define the partial correlation coejhcient between X3 and Xz given Xa to be =
=
pl 2.3
Ecl1
-
cl
1
- a5z 3 E 33 c12 1 l --y-3E33 /31J - /32)1 1c22 (r23E3:5 -tr1
=
-
-
,
b Corrth
,
Fc).
15.2
normal distribution
The multivariate
15.2
normal distribution
The multiuriate
normal density function discussed above was of the form
The univariate
( 15.12) X,,)' when the xY,.sare l1D The density function of X EEE(-Y1, X2, normally distributed r.v.'s was shown to be of the form .
- n/ 2jg2)
gaj
(
=
11/ 2
-
.
,
.
1k
exp
11
p) l
(xy -
-
20-a f
1
=
.
Similarly, the density function of X when the .Ys are only independent, Xi Nqpi, cJ),izuu1, 2, n, takes the form
i.e.
'v
.
.
,
.
2
,,/2
= (27:)
2
(c:, c2,
.
.
.
,
2 c,,)
j. -
exp
-
11
-
2i
)-( 1
.Xf
1!
pi
.....
(1j
,
o'i
-
.
j4)
Comparing the above three density functions we can discern a pattern developing which is very suggestive for the density function of an arbitrary normal vector X with f(X) p and Cov(X) E, which takes the form =
=
E)-l expt
-?'/2(det
tx; p, E) (2z:) =
-.(x
-
1(x
/z)'E -
Jl)),
-
(15.15)
and we write X N(p, E). lf the Arfsare 1lD r.v.'s E c2I,, and (detX) (c2)';. On the other hand, if the A'zsare independent but not identically distributed =
=
'.w
11
E
=
diagtcz1.
,
ln the case of n p= o
p1 pz
(det E)
=
=
.
.
.
c2) l1
,
(det N)
and
=
(cJ. ) (c =
i
1
=
2: ,
.
.
.
,
ct;
) .
2 E
t7'l2
21 2a(
c c
c 12
=
2
tF12
1 p 2) -
c2
G2
>
0
1
=
Pc 1c 2 2 G2
p/1tF2
for
-
1< p
<
1
,
where p
=
c
12 ,
(71J2
3 16
Te multivariate
normal distribution
and
(see Chapter 6). The standard bivariate density function can be obtained by defining the new r.v.'s
(1)
Properties
N(p, E) then Y (AX + b) NlAp +b, AEA') for A: m x n and b: m x 1 constant nktkrl'l't't/s. p.(7. i)' Y C'X,c # 0, Y x Nlcp, c2E). This property shows that if X is normally distributed then any linear jllnction of X is also normally distributed. (N2) Let Xtx Ntpf, Xr), r 1, 2, F be independently' distributed ?n)' arbitrarq' constant matrices Aj, t 1, 2, random rpcltary, tben
(N1)
Let X
'v
'v
=
=
=
.
.
.
,
./r
=
F
F
j A,X, l
x.
N
1
=
f
1
jj =
1
A,s, f
jjl (A,tA,')
The converse also holds. If the Xrs are llD then pt
and
(-'' $x,) '
,
1
-
x(,z, y' x)
.
.
=
=
p, Yf
=
E, t
=
1, 2,
.
.
.
,
F
The multivariate
15.2
normal distribution
Let X N(p, E) then the -Yfs are independent if and only f.Jaij 0, c,,,,). ln general, zero n, i.e. E diagtc 1 1 i #.js i,j 1, 2, implq' independence but in tbe case ofnormality' covariance does not =
'v
=
.
.
=
,
,
.
.
.
,
are equivalent.
r//? lwo Ij' X
.
Np, Y) then the marginal distribution
'w
0j'
an), k x 1 subset Xj
whvre X EEE
X1 ,
X2
p
p,
=
E
,
pz
=
E11
E 1z
.-
.a
=2
,
L 22
1
q/r
N(p1 E1 1). This jllows jom property' N1 Similarly, Xa N(p1, E2a). k x n), b These can be verified directly using Xj
A
'v
,
=
(Ik: 0),
=0.
'v
J(X
1;
(X;
P1)
=
and
P)dx 2
/'(x2;p2)
/'(x;0
-
dx1
,
although the manipulations involved are rather too cumbersome. k 1, this property implies that each component of X Np,t) normally distributed; the converse, however, is not true. 'v
=
(N5)
F()r
1t?
partition of X considered Xl (?J given X2 takes the
same
distribution
This follows from property Ik
AX
0 =
in N4 the conditional
(X2 - p2), + E1 2E2-21
NpL
E1 1
-
E1 2E2-21 E21).
N 1 for
- E 1 2E zh I,, -k
=
is also
./?nl
(Xl/X2),
A
Taking
and
X 1 - E 1 aE c-21X z
,
Xa
(15.19)
b 0 =
Cov(AX)
=
Ej 1
E 1 21
-
2-21
:22 1
0 .
0
Ec c
(15.20) From this we can deduce that if E1 2 0 then X1 and Xa are independent since(X1 Xc) x. N(/z1 E1 1). Moreover, for any E1 c, (Xj -E1cE2-21Xc) and X2 given that their covariance is zero. Similarly, (X2,/Xj) are independent X(#2 + E21E1-/ (X1 /11), E22 E21E1-11E1 2). ln the case n 2 =
,
'w
-
X3./Xz) ,
=
-
(7'
'v
N p.tl+ p - !. (-2 -p2), cfl 1 - p 2 ) Gz
.
( 15.2 1)
These results can be verified using the formula
/'(x1 x2; /)
'.
=
Jtx; p)
.I'(xc-,0z)
.
(15.22)
The multivariate
normal distribution
function
N5 suggests that the regression 'IX 1 Xc
x2)
=
,
+
pL
=
E1 cE2-altxa pz) -
is linear in
xc
(15.23)
and the skedasticity jnction Cov(X1/Xa
x2)
=
E1 yz'tft l cE2-21E2j
=
is
freeof
These are very important properties of the multivariate and will play a crucial role in Part lV.
(2) Without
partition
Multle
X1
E
X2
let X
c11
=
,.w
N(0, E), X: n x 1 and define the
c12 ,
..
=22
J21
where X2: (n 1) x 1 and .,Y1: 1 x 1. The squared coefficient takes the form R2
(3)
j
=
normal distribution
correlation
any loss of generality
Xv
x2.
vartx
1
-
,'x
Vart.-
1
2
)
)
-
multiple
a 1 cta-clcc,
=
( 15.25)
,
c:
correlation
1
Partial corvelation
further into
Let Xa be partitioned Xz
X z EB
Xa
X s : (n - 2 ) x 1
A' 2 : 1 x 1
,
,
with J22
Ec2
J23
=
E33
G5z
The partial correlation
,
E
=
tz1 1
J1 3
c21
az
J31
E33
.
between Xj and X1 given X3 takes the form
covtxl xcy/'xal ,
/?1 2
.
3
=
'
--.y
EVar(A-1/Xa)(1IEVar(.Yc,Xalj =
Ec11 -
c1 (!Ea-31a aa o'13E:'-3'/..51 lEcc2 - c'cata-a'
c 12 -
,a2(1-:
'
(15.26)
319
15.3 Quadraticforms with
*'
X ,''X3 XN 1 A' ,'X 3
3E 3-31X a
J1
3-31
X :9
J 2 3E
2
c
1
j
c 21
3E a-alTr :$1
J1
-
o.z 3E
-
1
J 1 3 E 3- 3 J 3 2 - 1
(T 1 2 -
-I J 3 2 o'z,
/23E33 c32
-
( 15.27)
15.3
Quadraticforms
(Q1)
Let X
to the normal distribution
related
Np, E), w/llrl E> p, X : n x 1, then
,v
p)'t
(j)
(X
(f)
XE - IX
-
1
-
(X
x
g2(n;
'v
z2(n)-
-p)
)-
chi-square;
chi-square;
non-central
These rtlsl/r,s depend crucially on E beinq a p'Ewt?rt? E > 0 tbere exists a ntpn-sfngur ptlsflt'l dejlnite matrix because matrix H, E HH' Z H- '(X pj N(0, 1,,), i.e. the Zis are independent and (X p)'E - 1(X pj j)'- 1 Z/. Similarly (). For r/yl MLE q y, 1p.
=
./lr
z=.
=
=
'v
-
./),-
/
T'
1
=
T
r
=>
Let X
(Q2)
-
1 N p, - E F
'w
1 '(/-p)
N(p, 1,,), then (X
(t)
-
X,
-p)'E--
T z..-g
)
p)'A(X
=
-
-
z2(n).
-
for A a p)
-
h'om N2
yyn-lrnt?r/-ft.'
(A'
A) matrix
=
zzttrA),'
'.w
'wzzttr
A; ), j= p'Ap. X/AX (ff) A). Note tr A re-krs to the if and only if A is idempotent (.c. 7jj). of A A (tr j)'=1 rmc.l matrix, l/l?n Let X Np, E), Y > 0 and A is a -*.2
=
=
(Q3)
.s.rnlm'r?-c'
x,
(X - #)'A(X
(f)
X'AX
if and
(Q#) X
()
zzttrA);
-p)
'v
zzttrA', J), J
x.
only if AE is ftcrnptprenr
Le
'v
Nlp, E),
E > 0,
and
p'Ap,.
=
(f.t?.AYA
(xz X1
X EEE
,
#
N
=
A). p1 ,
yg
x
E 1l
=
xaj xaa
for X1 : k x 1 the dlerence 1(x E(x- p)'E -
-
p)
-
(X1
-
p1)'E1-11(Xl -
JIjlll
x
X1 2
z2(n k). -
The multivariate
X
normal distribution
N(p, E), then
'v
matrices. AEB 0.
(./1
=
X'AX
A
pr
and
(./2
=
and B sjnlmetric and jleplpt?rtv7r X'BX ar? independent if and onlq' 4'/
=
(Q6)
Let X Np, 1,,), then A a syrn#npr/-t.' and idempotent matrix antl B a k x n l'nr/rrjx, then X'AX and BX are independent #' BA 0. Let X N(p, 1,,) and Z xN(0,1.) tben A and B symmetr. ./ip,-
x
=
(Q7)
./,-
x
'v
matrices flt?rn/ptpr???l
X'AX -
tr A
x
.
Z BZ g
F( t !- A t r B
.j
(j)
,
y' A p
=
,
.
tr 11 15.4
Estimation
Let X H (Xj X2, Xw)'be a random sample from N(p, E), i.e. Xf N(p, Y), 1, 'lL 2, X being l a T x n matrix. The likelihood function takes the ,
=
.
.
.
.
.
.
'w
,
,
form
= k(X)(27:)
ztdet
'-nT
E)-T'
2
1'
j
exp
-
i
) =
(Xt X'E- 1(.Y - p) t -
1
,
( 15.28) 1ogL(0; X)
=
c-
r n T lo2 27: T logtdet E) 1 (Xf-p)'EF --s 2 2 2t j --,-
-
=-'-
1(X,
-p).
( 15.29)
Since (Xf -/z)'E- 1(Xr -p)-
r
/'
)(Xr -&r)E-
1(X,
-&w)
+
T(X, g)'X - 1(Xw p) tr E - IA + T'(X.wp)/E - 1(X.w# =
-
Xw=
log L(0; X)=
c*
T -
--
1 T
X,,
A
=
r
r
logtdet E) -
j( (X, -&w)(x,
1
-&w)',
( j 5.3())
tr E- IA ( 15.3 1)
15.4 Estimation j log1-(p,'X) wx- (: v . JI ) . (p t?1og f.te, X) F E 1 A 0 = i =Y 2 t?E- 1 .()
=
=
--
l 1og fatp,' X) Lp t?/l'
'rx
=
-
-
.;.:y
1
j
=
F
(Xl -Xw)(Xl
,
-Xw), (15.33)
,
f?log L4:; X) . PE
j
(jj.ya)
,
..
F =
-
l
2
Y(I - : )E. ( 15.34)
Hence,
Xwand i are the
of p and E respectively.
MLE'S
Properties Looking at Xwand i we can see that they correspond directly to the MLE'S in the univariate case. It turns out that the analogy between the univariate and multivariate cases extends to the properties of Xwand i. ln order to discuss the small sample properties of Xwand i we need their distributions. Since Xwis a linear function of normally distributed random vectors it is itself normally distributed
1
Xw'vN /z, E T
.
The distribution of i is a direct generalisation of the chi-square distribution, the so-called Wishart distribution with F- 1 degrees of freedom (seeAppendix 24.1), i.e. Tk
'v
I#,(E, F-
1)
( 15.36)
From these we can deduce that .E'(Xw)p unbiased estimator of p and '(f) g(T'- 1)/FjE - biased estimator of E. S (1/(T- 1)j jt (Xt Xw) (Xf-Xw)' is an unbiased estimator of E. X / and i are independent and jointly sufjlcient for (!z,E). =
=
(2)
=
-
Useful distributions
Using the distribution (T- 1)S I4'(E, T- 1) the following results relating to the sample correlations can be derived (see Muirhead ( 1982)): 'v
Simple correlation
ri.i =
sij S Lsij?i-j,i,) siisjj ,
=
.
=
1, 2,
.
.
.,
n.
The multivariate
normal distribution
If
For M
Lrij?ijwhen
=
E
diagtcl
=
1,
.
.
.
,
c.),
-
2 logtdettsl/
,.v.
.;
zztntn-
1:.
(15.38)
Multiple correlation Rmw
=
ls sl zS 2-z a j 'i
( 15.39)
.
S11
Under R
k;
w-nj
=0,
h;-
.stu
.j,
w.u).
1-.!'i/
1
( 15.40)
In particular, 1
n'(.l'il.)= r
1
-
2(T-n)(n
Vart y #)=
,
(F c
-
1)
(15.41)
.
-
1)4F- 1)
The distribution of 11.2when R # 0 is rather complicated commonly use its asymptotic distribution:
v'..'-
1)(w -.R2)
x(0,4R2(1 -R2)2)
,v,
a
5
0..::Rl
and instead we
<
1.
(15.42)
On the other hand, under R
=0,
1).)
('r-
'w
z2(n -
1).
to R2 is the quantity
A closely related sample equivalent
#2F
=
'
X 1 X 2 (XIX 2
)--1X'2 x 1 l
(x1x1) t
(15.43)
,
(15.44)
x 1 X c: T x k The sampling distribution of 112was derived by Fisher (1928) but it is far variance complicated of and 1ts be direct interest. too to mean, however, are of some interest
x1 : T
,
.
testing and consdence
Hypothesis
15.5
4R2(1
Var(#2)
R2)2
-
=
T
1
-
+ O(T-
323
regions
2)
(15.46)
(see Muirhead (1982)).On the 0( ) notation (seeChapter 10). Hence, the mean of 112 increases as k increases and for R.2 '
=0
'(#2)
k
=
T- 1
+ O(T
-
2
).
(15.47)
Partial correlation 1
s 1 2 s 1 3 S 3- 3 s 3 2 ! s31)-(s22 :'S33 (s1l -s: -
P12
.
3
=
Under
pt 2.3
15.5
.-scass:s
-
-
1
( 15.48)
'
s32)-j
0.
=
(15.49)
Hypothesis testing and confidence regions
Hypothesis testing in the context of the multivariate nonnal distribution will form the backbone of testing in Part IV where the normal distribution plays a very important role. For expositional purposes let us consider an example of testing and confidence estimation in the context of the statistical model of Section 15.4. Consider the null hypothesis HoL p 0 against 1-11: p# 0 when E is unknown. Using the likelihood ratio test procedure with =
L0. x)
maX
,
Ti)-'lT'-n-
:)-'F'2(det
=
c*tdet
=
c*tdettt +
t) expt
-1.Fn),
f-(p; x) max 0 e (6,0
Fi/F-n-
zldet Xw&'w)-r
(15.50)
a
p e (.)
1)
expt -JTn),
( 15.5 1) we get
2(x)
det(TE)
=
dettt
+&w&r,)
vfz =
1
1+&J
1
.r.,z
--
=
j:
w
j ..).ggzjg-
-
j
,
(15.52) where
z12=
T
'r&ys-l&
F
s= T-1 t
(15.53)
324
The multiuriate
normal distribution
is the so-called Hotelling's statistic which can form the basis of the test, being a monotone function of 2(X). Indeed, T- n H1 n(F- 1)
(see Muirhead based on the t-j
(1982:.Using
rejection
T
x:
=
1)
n(T-
the Hotelling
H 2 > cu
rl). For
Q
Tp E
j
-
( 15.54)
p
statistic we can define a test
(15.55)
,
p= po against H3 : ## pv the
HvL
test statistic
T'(Xw JIp'S - 1(Xz -
(15.56)
-,:).
=
region for the latter
Using the acceptance Ca
n
-
fcze
H2
,
=
region
dF(n, Twhere a takesthe form =
F(n, T - n; J),
'v
F
x:
=
=
Important
n
o
-.
rl(T- 1)
we can define a ( 1 C(X)
-
H= %;ca
( 15.57)
,
conjldence region for p of the form
-a)
Jl: W(Xw p)'S - 1(Xw p) -
-
:$
(T-
1)n
(F-n)
ca
.
concepts
Mean and covariance of a random vector, multiple correlation, canonical normal density function, correlation', partial correlation, the multivariate marginal and conditional normal density functions, Wishart distribution.
Questions Explain the various correlation measures in the context of random vectors and compare the general formulae with the ones associated normal distribution. Comment with the multivariate on the
similarities. Discuss the relationship
between normality and linearity. Discussthe marginal and conditional distributions of a subvectorxj of normally random vector X distributed (X'1 X?2)'. a chiUnder what circumstances is the quadratic form (X #'A(X =
,
-p)
-
square distributed?
15.5
Hypothesis
testing and confidence regions
325
State the conditions under which the quadratic forms X'AX and X'BX will be independent. 6. State the conditions under which the ratio of the quadratic forms X'AX and Z'BZ will be F-distributed. Under which circumstances will the quadratic form X'AX and BX be independent? 8. Discuss the properties of the MLE'S X,,and i of p and E respectively in the context of the statistical model defined by the random sample X ESE X,,)' from Np, E). (X1 X2, ,
.
.
.
,
Additional references
Anderson
Mardia (1984),.
et al. (1979); Morrison
(1976),. Seber (1984).
Asymptotic test procedures
As discussed in Chapter 14, the main problem in hypothesis testing is to construct a test statistic z(X) whose distribution we know under both the null hypothesis Hv and the alternative HL and it does not depend on the unknown parameterts) 0. The first part of the problem, that of constructing z(X), can be handled relatively easily using the various methods discussed above (Neyman-pearson lemma, likelihood ratio) when certain conditions are satisfied. The second part of the problem, that of determining the distribution of z(X) under both Ho and H3 is much more difficult to solve' and often we have to resort to asymptotic theory. This amounts to deriving the asymptotic distribution of z(X) and using that to determine the rejection region C1 (or C() and the associated probabilities. For agiven sample size n, however, these will be as accurate as the asymptotic distlibution of z,,(X)is an accurate approximation of its finite sample distribution. Moreover,since the distribution of zp(X) fora given n is not known (otherwise we use that) we do not know how good' the approximation is. This suggests that when asymptotic results are used we should be aware of their limitations and the inaccuracies they can lead to (see Chapter 10). ,
16.1
Asymptotic properties
Consider the test defined by the rejection region c:
> c,, ) tx lz,,(X)I :
=
,
and whose power function is ,?4,(p) Prx s =
G),
0 iE (9.
Since the distribution of Tn(X)is not known we cannot determine cn or 326
16.1 Asymptotic
properties
J#(p). lf the asymptotic distribution of z,,(X)is available,however, we can use that instead to define cu from some fixed a and the asymptotic power jnction
zr(p) #r(x
G C)''),
=
0 6 0.
(16.3)
ln this sense we can think of (z,,(X), n > 1) as a sequence of test statistics defining a sequence of rejection regions (C,1,n > 1) with power functions ).#(p), n y: 1, 0 G(6)) and we can choose c,, accordingly to ensure that the sequence of tests have the same size if '1
max
,#,(p)
a
=
0 c 6o
for all n > 1.
( 16.4)
Note that limn-.. X(#)= In this context the various criteria for tests discussed above must be reformulated in terms of the asymptotic power function z:(:); see Bickel and Doksum (1977). .;:(:).
Dehnition 1 The sequence of tests for Ho.' 0 6 (i)o against 1'f1: 0 01 dejlned by c (C'1 n > 1) is said to be consistent of size f/ ,
c(#) max e e
and
=
(z
etl
7:(p) 1, 0 6 01 =
.
(16.6)
As in the case of estimation, consistency is a reasonable property but only a minimal property. In order to be able to make comparisons between various tests we need better approximations to the power than 1. With this in mind we define asymptotic unbiasedness. Dehnition 2
.4 sequence of tests as dehned above is said to be asymptotically unbiased of size a #' max
z:4:)
=
e (.:0 'l0)
tx<
<
1,
0 6 O1.
( 16.8)
Desnition 3
.,4sequence of tests as desned above is said to be uniformly most
Asymptotic
test procedures
power (UM#) of
size a
7(0)
max ('a
(/ ( 16.9)
=
ee
and
z0) >
for any
zr*(p),
0 (E (.l,
( 16.10)
,
size tz test with asymptotic
ptpwt!r jnction zr*(p). often interested in Iocal alternatives of the form tests we are
ln asymptotic
b#0
( 16.11)
in order to assess the power of the test around the null. When I
(p) opn) then -
11
v'',,(4-po)
x(b,I.(p)-
-
1)
X
for l the MLE of 0. ln this case we consider only local power and a test with greatest local power is called locally' uniformly most powerful. The Iikelihood ratio and related test procedures
16.2
ln this section three general test procedures which give Iise to asymptotically optimal tests will be considered', the Iikelihood ratio, Wrtzlf! and Lagrange multiplier test procedures. All three test procedures can be interpreted as utilising the information incorporated in the 1oglikelihood function in different but asymptotically equivalent ways. For expositional purposes the test procedures will be considered in the context of the simplest statistical model where *= (.(x; p), 0 (E 0) is the probability model; and (i) X EB (11, Xz, Ar,,)' is a random sample. (ii) The results can be easily generalised to the non-random sample case where I,,(p) O/n) as explained in Chapter 13 above in the context of maximum likelihood estimation. For most results the generalisation amounts to substituting I(p) for I.(p) and reinterpreting the results. .
.
.
,
=
(1)
Simple
null
hypothesis
Let the null hypothesis
be Ho: 0= 0o, 0 c (.) EEER''' against Sl
:
0+ 0v.
The likelihood ratio test The likelihood ratio test statistic discussed in Section 14.4 takes the form
2(x) =
x) 1-(p(); L(0., x) max e e til
=
f/pe', x) L(1,. x)
.
( 16.13)
16.2
Likelihood
ratio
329
and related test procedures
where I is the MLE of 0. ln cases where 2(x) or some monotonic function of it have a tractable distribution there is no need for asymptotic theory. Commonly, however, this is not the case and asymptotic theory is called for. The IOIJ test Under certain regularity conditions which include CRI-CR3 13) log L(0; x) can be expanded in a Taylor series at p= l Lo; 1og
p Iog z-(p;x)
l-t; x) +(4-p)
xl-log
+.:1- 0)' whereJp* p) < J- p! and - 10). Since (seeChapter p f-(p; x) 'ik log #--J
00
p2log 1-(p*; x) pp 0 ,
p-j
(
(seeChapter
:-j
p) +
-
0,(1),
( 16.14)
og 1) refers to asymptotically
negligible terms
( 16.15)
=0,
conditions
beingthe tirst order 1 p2 n 0 Pp
log.L(;
for the MLE, and
x) XI(p),
the above expansion can be simplified log L0,. x) 1og .L(1;x) +.l.n(; =
(16.16)
-
(seeSerfling (1980))to: p)'l(p)(I p) + o/ 1). -
(16.17)
This implies that, since 2glog L(1; x) - log l..t?(); x)(l, (16.18) - 2 log 2(x) -2 1og2(x) n(4 - pf))'I(#)(l - 0o) +o1). (16.19) it is known that under certain For the asymptotic properties of MLE'S regularity conditions =
=
x/nt
0o)
-
-
tz
x(0,I(p)-
Using this we can deduce Z-R
-2
-
log
1).
(16.20)
(seeproperty Q1, Chapter
(x):>,n(I-pp'I(p)(4
being a quadratic form in asymptotically
Ho -p(,)
-
15) that
z2(,,l),
normal random variables
(16.21) (r.v.'s).
330
test procedures
Asymptotic
Wald (1943),using the above approximation of log 2(x), proposed an alternative test statistic by replacing I(p) with 141): -2
Ho
Wz'=n(l-po)I()(J-po) given that
(iii)
I(J)
xz2(m),
(16.22)
P
14$. This is the so-called W/-J/Jstatistic.
-+
The Lagrange multiplier test
Rao (1947)using the asymptotic distribution of the score function of that ot l), i.e.
1
?
log f7tr V n -u
q(#)
=---s
L(0; x)
proposed the emcient score LM
1
=-
n
x
N(0, I(p))
1q(po)
q(#()'l(#() -
Hv
( 16.23)
multiplier)
(or Lagrange
(instead
test statistic
zztrj,
x
x
which is again a quadratic
form in asymptotically
normally
distributed
I-.V. S.
For all three test statistics (LR, l4zi.LA.flthe rejection
)x : !(x)> cz)
(71
=
region
takes the form
(16.25)
,
where /(x) stands forall three test statistics and the critical value ca is defined by dz2(r# a, x being the size of the test. Under local alternatives with a Pittman type drift of the folnn:
Jc
=
(t
/.fl : 0,, 0o + =
b
(16.26)
,
M
all three test statistics are asymptotically H3
l(x4
z2(m;J), J
'v
=
distributed as:
b'Itp()b,
( 16.27)
since
x'',;(J -
0v)
-
vt
-
#,,) + b
-
X
xtb,1(:0) - 1)
(16.28)
and qtpo) v7n
-x/ntl-ppltp()l
+0p(1)
-
xtbI(po), I(po)).
(16.29)
16.2 Likelihood
ratio and related test procedures
Fig. 16.1. The LR, W' and LM tests compared.
Hence, the power function for all three test statistics takes the form X
itol
dzztm.. J),
=
(16.30)
C?
and thus, LR, W' and LM are asymptotically equivalent in the sense that they have the same asymptotic properties. Fig. 16.1, due to Pagan (1982),shows the relationship between LR, ##rand LM in the case of a scalar 0. z,M=2
area
Lo
q(po),
p (- p())2
B
g,,=2
p
=q(po)2
E
A
area Atl
=
C
q),
D
LR= 2 area
L
A C
qo) dp.
=2
( 16.3 1)
(16,32)
( 16.33)
po
Note that al1 three test statistics can be interpreted as functions of the score function.
332
Asymptotic
test procedures
(2)
Composite
null
hypothesis
Consider the case where the Ho is composite, i.e. aga in st
H o : 0 i! 0. o
(p:R(#)
=
,
0. ()
Rr
0. i Rm
,
as well as practical to parametrise
It is both convenient
(1)0
H 1 : 0 (F0. j
=
.
(.)0 in the form
0, 0 c 0)
(16.34)
represents r non-linear equations, i.e. R(#) (R1(#), Rz0), Rr(#))'. In most situations in practice the parametrised form arises . naturally in the fonn of restrictions such as R1(p) 0305+ 0t, Rz(0) log ?l p2, Rz(0) p2j+ 0z 1, Rzj0) 0L etc. lf we define 0 to be the maximum likelihood estimator (MLE) of 0, i.e. J is the solution of (t?log L0; xlj/Jp= 0, then from =0
where R(#) .
.
=
,
=
=
-2pc,
=
-
x'V(l
0)
-
=
-
.N(0, 14:) -
'w
1),
(16.35)
and 1
t?
log L(0; x)
n'&
N(0, I(p)),
'v
z
we can deduce that Ho
xV(R(
R),I(p) - 1R:),
-R(p)) xN(,
since R(#) can be approximated R(p)
R(4) +Rp(p
=
where
by
) +0/1),
R(p)
Rp=
(i)
-
at 0=
( 16.37)
.
(0
The W'/Z/Jtest procedure
lf the null hypothesis Ho is true we expect the MLE 1,without imposing the restrictions, to be close to satisfying the restrictions, i.e.if Hv is true, R(l) 0. This implies that a natural measure for any departure from Hv should be p1 R(
-011
(16.39)
.
If this is different from zero it will be a good indication that Hv different from is false. The problem is to formalise the concept zero'. The obvious way to proceed is to construct a pivot based on 1jR(l)1jin order to enable us to turn this statement into a precise probabilistic statement. Ssignificantly'
tsignificantly
16.2
Likelihood ratio and related test procedures
333
ln constructing such a vot there are two basic problems to overdepends on the units of measurement and come. The first is that jjR(#)Jj the second is that absolute values are not easy to manipulate. A quantity both problems is the quadratic form which 'solves'
- 'R(;),
R('Ev(R()1
(16.40)
where V(R()) represents the covariance of R(l). Determining V(R()) can be a very difficult task since we often know very little about the distribution of Asymptotically, however, we know the distribution of R() and . v(R(l))
hencewe
'
(t6.41)
R;I(p) - R:,
=
can deduce that
IRPI iR()
nR('gRI(#)-
Ho
-
Wald's suggestion estimator, i.e.
z2(r).
(16.42)
'X
to replacing
amounts
1'V'=nR()'gRI(l)-
'w
Ho
'RI;II-1R()
'v
X
V(R()
with
a consistent
z2(r).
Note that the Wald procedure can be used in conjunction asymptotically normal estimator 0* (not just MLE's) since if
w''n0
*
-
#)
'.-
N (0 E : ) ,
W'= nR(#*)'gRipRj1
with any
(16 44)
,
.
Ho
- 'R(#*)
'w
:'
(ii)
( 16.43)
z2(r).
(16.45)
Laqranqe multiplier test procedure
In contrast to the Wald test procedure the Lagrange multiplier procedure is based solely on the restricted MLE of 0, say #. Although the Lagrange multiplier test statistic can take various equivalent forms we consider only two such forms in what follows. Estimation of 0 subject to the restrictions R(p)= 0 is based on the optimisation of the Lagrangian function
,.l? 1og L(#., x) =
-
R(p)p,
(16.46)
where p: r x 1 vector of multipliers. The restricted MLE of 0 is defined to be the solution of the system of equations:
log1z(:
x)- Rp=0,
(16.47)
Asymptotic
test procedures
R(#) 0.
(16.48 )
=
In the case of the Wald procedure we began our search for an asymptotic pivot using R(l) which should be close to zero when Hv is true. ln the bydelinition and thus it cannot be used. But. present case, however, R(#) =0
the score function evaluated at 0=
although in the Wald procedure zero, i.e. 91og L(f. x)
is
(16.49)
= 0,
this is not the case for (t7log L(; xlj/'Pp and we can use it to construct an asymptotic pivot. Equivalently, the Lagrange multipliers pfe can be used instead. The intuition underlying the use of g(#) is that these multipliers can be interpreted as shadow prices for the constraints and should register a11 d epar tures from Ho; if #is closed to'd #'#)is small and vice versa. Hence, a reasonable thing to do is to consider the quantity Using the same argument as in the Wald procedure for jR(J) we set up the quadratic form
1-)
-0(
-01.
1#(#).
/z(l1'EV(/#l))I -
(16.50)
Using the fact that
1t? log L(,. t?p -n
x)
N(0,
'v
l(p)),
(16.5 1)
a
we can deduce that 1
(p()
p(04)
N0, gR;l(p) - j R?q
'v
''''''''''''''tiklih',------,---' -
-)--
-
j
( 16.52)
).
Hence, 1
g(#)'(R;l(p)- 'RJIZI-I
-11
Ho -
u
z2(r).
( 16.53)
The Lagrange multiplier test statistic takes the form Z-M
1 =
-
Hv
#(#)'gR;-l(#)- 1RJg(#) -
X
z2(r),
(16.54)
or, equivalently, LM
=
-
1
n
'
log L(. x)
which is the efjlcient score form.
lo -
1
p '0
log L(. x)
,
(16.55)
16.2 Likelihood
ratio test statistic takes the form
The likelihood
Ho
LR
2(1ogLk,' x)
=
1og Ll; x))
-
IVO:
LM
:>:
n(l
'v
(:
#-)'1(#)(1 -
-
z2(r).
( 16.56)
we can show that
Using the Taylor series expansions
LR :>:
335
ratio and related test procedures
#).
Thus, although a1l three test statistics are based on three different asymptotic pivots, as n a:i the test statistics become equivalent. Al1three tests share the same asymptotic properties; they are a11consistent as well as asymptotically locally' UMP against local alternatives of the form considered above. In the absence of any information relating to higherorder approximations of the distribution of these test statistics under both Ho and SI the choice between them is based on computational convenience. The Wald test statistic is constructed in terms of J the unrestricted MLE of 0, the Lagrange multiplier in tcrms of # the restricted MLE of 0 and the likelihood ratio in terms of both. ln order to be able to discriminate between the above three tests we need to derive higher-order approximations such as Edgeworth approximations (see Chapter 10). Rothenberg (1984)gives an excellent discussion of various ways to derive such higher-order approximations. Of particular interest in practice is the case where 0- (p1 0z) and f'fo: 0L 0 against ffj : 03# o, pl : r x 1 with pc: (rn- r) x 1left unrestricted. In this case the three test statistics take the form -+
,
=
.f.-(4,-
LR
(16.58)
x)),
x) -log
L,
-2(log
-
W'= rl(11 -001)/2-11 -I1a(l)I.,2%1(I2)(;)(()(lj ...-.#0:), 1(1)
LM
1g(#) EI11(-) ,
--
-I1c(#)I22
-
n
1
ljyj-),
(yj21 tj.)g -
where
-(11, 2), #= o, #2), This is because for R(p)
=
0L
0
-
1(*)
I(p)
=
11c(*) 11 121(#) 1224#)
R;
,
=
pb
=
(16.59) (16.60)
x) t7log z-(p,' (o, .-4.
.
,
(1,:0)
and hence R;I(#)(#) - 1R:
=
1(:) E11
-
I 1 a(p)I2-21(p)Ia1(#)q-
For further discussion of the above asymptotic survey by Engle (1984).
t
.
test procedures
(16.6 1) see the
Asymptotic
test procedures
Important
concepts
power function, consistent test, asymptotically unbiased test. asymptotically uniformly most powerful test, locally uniformly most powerful test, Wald test statistic, Lagrange multiplier (efficientscore) test Asymptotic
statistic.
Questions Why do we need asymptotic theory in hypothesis testing'? Explain the concept of an asymptotic power function and use it to define consistency, asymptotic unbiasedness and UMP in testing. What do we mean by a size x test in this context? Compare the LR, 1#' and LM tests in the case of a simple null hypothesis (draw diagrams if it helps). Explain the common-sense logic underlying the LR, 14' and LM test procedures in the case of a composite null hypothesis. 6. Discuss the similarities and differences between the LR, Hzrand LM test procedures. Explain the circumstances under which you would use these asymptotic test procedures in preference to the test procedures discussed in Chapter 14. 8. Explain the derivation of the 145and LM test statistics in the case of HvL p1 03 against HL : 0L 0. p! being a subset of parameter vector 0= (p1,0z), considered above. 9. Verify the form of the Wald and Lagrange multiplier test statistics for 0z) using the partitioned matrix Ho #1 0 against Sl : pl # 0, pHtpj inversion rule :#
=
=
,
- 1
A 11
M12
A2l
A2c
A
(A1 1 A1 CA2-3IAC 1) Az-l A21(Al l - A1 cA2-2'A21) -
=
-
1
l
A1-/Aj 24A22 A2 j A1-/ A1 c) 1 (Ac2 A21 A1-/ A1c) -
1 .
-
Additional references Aitchison
and Silvey ( 1958); Buse
Silvey (1959).
( 1982),. Engle (1984);
Moran
(1970);Rao (1973);
PART
IV
The Iinear regression and related statistical
models
17
CHAPTER
Statistical models in econometrics
17.1
Simple statistical models
The main purpose of Parts 11and IIl has been to formulate and discuss the concept of a statistical model which will form the backbone of the discussion in Part 1V. A statistical model has been defined as made up of two related components: (i) a probability model, *= .tD(y; 04, 0 6: (.))t specifying a parametric family of densities indexed by 0', and )'w)' defining a sample from a sampling model, y (y':, D()?; p()),for some true' 0v in (4. The probability model provides the framework in the context of which the stochastic environment of the real phenomenon being studied can be defined and the sampling model describes the relationship between the probability model and the observable data. By postulating a statistical model we transform the uncertainty relating to the mechanism giving rise to the observed data to uncertainty relating to some unknown parameterts) 0 whose estimation determines the stochastic mechanism D(y', 0). An example of such a statistical model in econometrics is provided by the modelling of the distribution of personal income. ln studying the distribution of personal income higher than a lower limit y() the following statistical model is often postulated: .y,2,
=
*
=
D()/ yo ; 0)
y2, y EEF(-9?1,
.
.
.
=
,
P '* 1
P
--
Vo ,
.
J.7
yw)?is a random
sample from Dly/yo; 0).
!' The notation in Part IV will be somewhat different from the one used in Parts 11and 111.This change in notation has been made to conform with the established econometric notation. 339
Statistical models in econometrics
Note:
if 0 >2. For y a random
sample the likelihood
v
L0; y)
p
=
t
>'0
1
=
-
function is
p+ l
.
V0
wp 0w),() (y 1
=
).'f
,
.J,,,2 ,
.
.
.
,
)?w)-(p
+ 1) ,
r
log L4t?;y)
T log 0 + T? log yo -((? + 1)
=
d log L T dp =-0
=>
iis the maximum (d2 log L) dp2 Chapter 13): =
V
-
+ T
log )'()-
log
T t
log yr
=
)
l=1
log yf
,
0,
, -
L
1
.J.'()
likelihood estimator (MLE) of the parameterp. Since F/p2, the asymptotic distribution Of P takes the form (see
./T((4- 0)
x(0-p2).
Q
Although in general the finite sample distribution is not frequently available, in this particular case we can derive D() analytically. It takes the form D)
pw-lww-l =
.j.
F( F
1).J,,
-
exp
y.p -
--
,
y
i.e.
awp --)--
'v
z2(2F)
(see Appendix 6.1). This distribution of can be used to consider the finite sample properties of as well as test hypotheses or set up condence intervals for the unknown parameter 0. For instance, in view of the fact that 'l
E)
T =
T
-
2
0
we can deduce that is a biased estimator of p. It is of interest in this particular case to assess the accuracy' of the asymptotic distribution of for a small 'T:(W= 8), by noting that
vart
=
w2p2 (T- 2)a (F- 3)
17.1 Simple statistical
34 1
models
(see Johnson and Kotz ( 1970:. Using the data on income distribution 5000 (reproducedbelow) to estimate 0, Chapter 2), for
(see
.vy
6000
5000
we get
=
#1
log
T' l
..
8000
7000
-
12000
15 000
20 ()(X)
1
P()
as the ML estimate. Using the invariance property of MLE's (seeSecton 13.3) we can deduce that 'tarl)
'()
2. 13,
=
0.9 1.
=
As we can see, for a small sample (T= 8) the estimate of the mean and the variance are considerably larger than the ones given by the asymptotic
distribution:
'll= a
1.6,
42
vartl=F
=0.32.
a
On the other hand, for a much larger sample, say T= 100,
'()
=
1.63,
Varl) 0.028, =
with
as compared
4 -
1.6,
kart
=0.026.
These results exemplify the danger of using asymptotic results for small samples and should be viewed as a warning against uncritical use of asymptotie theory. For a more general discussion of asymptotic theory and how to improve upon the asymptotic results see Chapter 10. The statistical inference results derived above in relation to the income of the distribution example depend crucially on the appropriateness should represent a statistical model postulated. That is, the statistical model good approximation of the real phenomenon to be explained in a way which takes account the nature of the available data. For example, if the data were collected using stratified sampling then the random sample
assumption is inappropriate
(seeSection 17.2 belowj. When any of the
Statistical models in
onometrics
assumptions underlying the statistical model are invalid the above estimation results are unwarranted. In the next three sections it is argued that for the purposes of econometric modelling we need to extend the simple statistical model based on a random sample, illustrated above, in certain specific directions as required by the particular features of econometric modelling. In Section 17.2 we consider the nature of economic data commonly available and discuss its implications for the form of the sampling model. lt is argued that for most forms of economic data the random sample assumption is inappropriate, Section 17.3 considers the question of constructing probability models if the identically distributed assumption does not hold. The concept of a statistical generating mechanism (GM) is introduced in Section 17.4 in order to supplement the probability and sampling models. This additional certain specific features of component enables us to accommodate econometric modelling. ln Section 17.5 the main statistical models of interest in econometrics are summarised as a prelude to the discussion which follows. 17.2
Fxonomic data and the sampling model
Economic data are usually non-experimental in nature and come in one of three forms: (i) time series, measuring a particular variable at successive points in time (annual, quarterly, monthly or weekly); cross-section, measuring a particular variable at a given point in time over different units (persons, households, firms, industries, countries, etc.); panel data, which refer to cross-section data over time. (iii) Economic data such as M1 money stock (M), real consumers' expenditure (F) and its implicit deflator (P), interest rate on 7 days' deposit account (f), over time, are examples of time-series data (seeAppendix, Table 17.2). The income data used in Chapter 2 are cross-section data on 23 000 households in the UK for 1979-80. Using the same 23 000 households of the crosssection observed over time we could generate panel data on income. In practice, panel data are rather rare in econometrics because of the difficulties involved in gathering such data. For a thorough discussion of econometric modelling using panel data see Chamberlain ( 1984). The econometric modeller is rarely involved directly witll the data collection and refinement and often has to use published data knowing very little about their origins. This lack of knowledge can have serious repercussions on the modelling process and lead to misleading conclusions. Ignorance related to how the data were collected can lead to an erroneous
17.2 Economic data
and the sampling model
343
choice of an appropriate sampling model. Moreover, if the choice of the data is based only on the name they carry and not on intimate knowledge about what exactly they are measuring, it can lead to an inappropriate choice of the statistical GM (see Section 17.4, below) and some misleading conclusions about the relationship between the estimated econometric model and thc theoretical model as suggested by economic theory (see Chapter 1). Let us consider the relationship between the nature of the data and the sampling model in some more detail. In Chapter 11 we discussed three basic forms of a sampling model: random sample (i) - a set of independent and identically distributed (ilD) random variables (r.v.'s),' independent sample a set of independent but not identically and r.v.'s; distributed non-random sample - a set of non-llD r.v.'s. (iii) For cross-section data selected by the simple random sampling method (where every unit in the target population has the same probability of being selected), the sampling model of a random sample seems the most appropriate choice. On the other hand, for cross-section data selected by the stratified sampling method (the target population divided into a number of groups (strada)with every unit in each group having the same probability of being selected), the identically distributed assumption seems rather inappropriate. The fact that the groups are chosen a priori in some systematic way renders the identically distributed assumption inappropriate. For such cross-section data the sampling model of an independent sample seems more appropriate. The independence assumption can be justified if sampling within and between groups is
(ii)
-
random. For time-series data the sampling models of a random or an independent sample seem rather unrealistic on a priori grounds, leaving the non-random sample as the most likely sampling model to postulate at the outset. For the time-series data plotted against time in Fig. 17. 1@)-(J)the assumption that they represent realisations of stochastic processes (seeChapter 8) seems more realistic than their being realisations of llD r.v.'s. The plotted series exhibit considerable time dependence. This is confirmed in Chapter 23 where these series are used to estimate a money adjustment equation. ln Chapters 19-22 the sampling model of an independent sample is intentionally maintained for the example which involves these data series and several misleading conclusions are noted throughout. In order to be able to take cxplicitly into consideration the nature of the observed data chosen in the context of onometric modelling, the statistical models of particular interest in econometrics will be specified in terms of the observable r.v.'s giving rise to the data rather than the error term, the usual
Statistical models in econometrics
35000
;
25000
f X 15000
5000
1963
1966
1969
1972 Time
1975
1978
1982
1966
1969
1972 Time
1975
1978
1982
consumerf
expenditure.
(a)
18o '!' 160 J. 14000
12000
1963
Fig. 17,1(5). Money stock ftmillion).
(/?)Real
approach in econometrics textbooks (see Theil ( 1971), Maddala ( 1977), Judge et aI. ( 1982) inter alia). The approach adopted in the present book is to extend the statistical models considered so far in Part 1II in order to modelling. In accommodate certain specific features of econometric particular a third component, called a statistical generating mechanism (GM) will be added to the probability and sampling models in order to enable us to summarise the information involved in a way which provides
17.2
Economic data and the sampling model
240
200
160 @ Q.
120
80
40 1963
1966
1969
1972 Time
1975
1966
1969
1972
1975
1978
1982
1978
1982
15
12
9 w*
6
3
() 1963
'
T i me
Fig. 17. 1(c). lmplicit
price detlator. (J) lnterest rate on 7 days' deposit
account.
'an adequate' approximation to the actual DGP giving rise to the observed will be considered data (see Chapter 1). This additional eomponent section 17.4 ln next the nature of the the below. extensively in Section modelling will econometric required models be discussed in in probability model. sampling of the view of the above discussion
346
Statistical models in econometrics Economic data and the probability model
In Chapter 1 it was argued that the specification of statistical models should take account not only of the theoretical a priori information available but the nature of the observed data chosen as well. This is because the specification of statistical models proposed in the present book is based on the observable random variable giving rise to the observed data and not by attaching a white-noise error term to the theoretical model. This strategy such as implies that the modeller should consider assumptions independence, stationarity, mixing (see Chapter 8) in relation to the observed data at the outset. As argued in Section 17.2, the sampling model of a random sample seems rather unrealistic for most situations in econometric modelling in view of the economic data usually available. Because of the interrelationship between the sampling and the probability model we need to extend the simple probability model (I) tD( 1,,. 0), 0 c (4) associated with a random sample to ones related to independent and non-random samples. An independent (but non-idcntically distributed) sample y /5 (y'1, ywl' raises questions of time-heterogeneity in the context of the corresponding probability model. This is because in general every element y, of y has its own distribution with different parameters Dt),f; pf). The parameters ot which depend on t are called incidental parameters. A probability model related to y takes the general form =
.
*= (D(y'2; 0t),
ot6 0,
t CETl.,
.
.
,
(17.1)
.)
where T ) 1, 2, is an index set. A non-random sample y raises questions not only of time-heterogeneity but of time-dependenceas well. In this case we need thejoint distribution of y in order to define an appropriate probability model of the general form =
*
.
=
.
/t.D().'1
.p2,
,
.
.
.
,
A'w;#r),
ove' 0,
T1
=
(1, 2,
.
.
.
,
F)
T)
.
ln both of the above cases the observed data can be viewed as realisations of the stochastic process (yf,t e: T) and for modelling purposes we need to restrict its generality using assumptions such as normality, stationarity and asymptotic independence or/and supplement the sample and theoretical information available. In order to illustrate these let us consider the simplest case of an independent sample and one incidental parameter: (i)
*
=
D(y2,. 0 t )
1
=
c
exp
(2zr)
1 -- 2
l
,
-pt
2 ,
c
)'vl' is an independent sample from D(yI; 'T;respectively.
y EEE()71,ya, .
.
.
,
347
model
Data and the probability
17.3
.
.
0t), t
,
.
1, 2,
=
The probability model postulates a nonnal density with mean pt (an incidental paramete and variance c2. The sampling model allows each y; to have a different mean but the same variance and to be independent of the The distribution of the sample for the above statistical model other ns. and 0 n 1, yz, D(y: p) where y H (.v1,yz, pv, c2) is .ywl'
.
.
.
.
,
.
.
,
T
Dlyt', g,, c2) oty,04 f17 1 -
=
T
(c2)-F/2(2a)-F/2
=
exp
1 2c a f j
-
-jtf)2
(.yf
j
-
.
gw), As we can see, there are T+ 1 unknown parameters, #= (c2,jtj, yz, sufficient provide with which observations only estimated and T us to be warning that there will be problems. This is indeed confirmed by the maximum likelihood (ML) method. The log likelihood is .
.
log L(p', y)
=
log L t'q/z, =
,log L f,c2 =
const
T
--j-
1 ( 2c a
-
T -
2c z
log c2 2)(.J4
-
1
+
2c
2c a
pt)
-
T
1 -
=
.
-/tl)2,
0,
(17.4)
(y:
I
,-
r
1, 2,
=
(17.5)
-6 .
.
,
.
-/zf)2=0.
Z(.pt
.
,
( 17.6)
,
These first-order conditions imply that lt yt, Before we rush into pronouncing these as MLE'S the second-order conditions for a maximum. =
:2 1og L aA
1= 1, 2,
.
.
,
'C and 82
=0.
it is important to look at 1
T ;
.
= 2c 4.
#
-
c
6
z
Z(A'-/4)
'
,
t
-;
1
c=
which are unbounded and hence lt and :2 are not MLE's; see Section 13.3. This suggests that there is not enough information in the statistical model (i)-(ii) above to estimate the statistical parameters 0= (Jz1,pz, pw', c2). An obvious way to supplement this information is in the form of panel T: In the case where N N, t= 1, 2, data for yt, say y, i 1, 2, realisations of y, are available at each r, 0 could be estimated by .
=
1.
pt N =
--
i
.
.
.
.
,
xv
Z A', -
1
r= 1,2,
.
.
.
,
T
.
.
,
.
.
,
Statistical models in econometrics
and
:2
=
w
1 T
x
j jy (yy ptll. -
t
=
1 i
1
=
of 0. lt can be verified that these are indeed the MLE'S An alternative way to supplement the information of the statistical model (ijvii) is to reduce the dimensionality of the statistical parameter space 0. This can be achieved by imposing restrictions on 0 or modelling p by relating it to other observable random variables (r.v.'s)via conditioning (see Chapter 7). Note that non-stochastic variables are viewed as degenerate r.v.'s. The latter procedure enables us to accommodate theoretical information within a probability model by relating such information to the statistical parameters 0. ln particular, such information is related to the meap (marginalor conditional) of r.v.'s involved and sometimes to the variance. Theoretical information is rarely related to higher-order moments (seeChapter 4). The modelling of statistical parameters via conditioning leads naturally to an additional component to supplement the probability and sampling models. This additional component we call a statistical generating mechanism (GM) for reasons which will become apparent in the discussion which follows. At this stage it suffices to say that the statistical GM is postulated as a crude approximation to the actual DGP which gave rise to the observed data in question, taking account of the nature 4 such data as well as theoretical a priori information. In the case of the statistical model (i)-(ii) above we could solve' the inadequate information problem by relating pt to a vector of observable 1, 2, variables xlj, xcf, 'C say, linearly, to postulate xkf l .
.
,
.
=
,
.
,
.
,
pt b'xf,
( 17.9)
=
bkl', k < Ti is a vector of unknown parameters. By where b > (4j, b2, postulating this relationship we reduce the parameter space from (.) > RFX R + and increasing with T to (i)cp> R x (Ji and independent of The statistical GM in this case takes the general form .
.
.
,
'.r
+
yt
=b'x
t
+ ut,
where pt b'xf and ut are called systematic and non-systematic of respectively. construction By components ),'t, -b'x,
=
=
Eytut)
=
.y,
0 and
Eutj
=
0,
Elutl)
=
c2,
Eutus)
=
0, f #s.
where E( ) is defined relative to Dtyf,' 0), the marginal distribution of rf Equation ( 10) represents a situation where the choice of the values xlf, xal, '
.
17.4
The statistical
generating
mechanism
349
xkt determines the systematic part of yr with the unmodelled part ut . being a white-noise process (seeChapter 8). This is the statistical GM of the .
.
,
Gauss linear model (see Chapter 18). The above statistical GM will be extended in the next section in order to deline some of the most widely used statistical models in econometrics.
17.4
The statistical
generating mhanism
The concept of a statistical GM is postulated to supplement the probability sampling models and represents a crude approximation to the actual DGP which gave rise to the available data. It represents a summarisation of the sample information in a way which enables us to accommodate any a priori information related to the actual DGP as suggested by economic theory (see Chapter 1). Let (y, l (E T) be a stochastic process defined on (., i P( (seeChapter 8). The statistical GM is defined by
and
.))
3'f pt + =
where
ut
(17. 11)
,
#f f(#t/V),
(17. 12)
=
L being some c-field. This defines the statistical process generating y, with pf being the postulated systematic mechanism giving rise to the observed and ut the non-systematic data on part of yr defined by ut y'f Defining ut this way ensures that it is orthogonal to the systematic component /t,; denoted by Jtf-l-l/? (see Chapter 7). The orthogonality condition is needed for the logical consistency of the statistical GM in view of the fact that ut represents the part of yf left unexplained by the choice of pt. The terms systematic, non-systematic and orthogonality are formalised in terms of the underlying probability and sampling models defining the statistical model. lt must be emphasised at the outset that the terms systematic and nonsystematic are relative to the information set as defined by the underlying probability and sampling models as well as to any a priori information related to the statistical parameters of interest 0. This information is incorporated in the definition of the systematic component and the remaining part of yt we call non-systematic or error. Hence, the nature of ut depends crucially on how pf is defined and incorporates the unmodelled part of #,1.This definition of the error term differs significantly from the usual use of the term in econometrics as either errors-in-equation or errors of The of the in the book concept present measurement. use comes much tnoise' used in engineering and control literatures (see closer to the term .vf
=
-pt.
350
Statistical models in econometrics
Klman
(1982:.Our
aim in postulating a statistical GM is to minimise thc non-systematic component ut by making the most of the systematic information in defining the systematic component pt. For more discussion on the error term see Hendry (1983). Let t c T) be a k x 1 vector stochastic process defined on (S, .@I P )) which represents the observable random variables involved. Let y, be the random variable whose behaviour is of interest, where
tzt, Zt
'
.l Xt
H
For a conditioning
.
information set
%
the systematic component of yr can be defined by pt
=
E (#l/V't )
(17 13)
,
.
where % is some sub-c-tield of The non-systematic unmodelled of given the part represents y, pt, i.e. .@
l,
=
.p, -
component
.E(.h/f4),
These two components
ut
(17.14)
give rise to the general statistical GM
(17.16) ( 17. 17) using the properties of conditional expectation (see Chapter 7). It is important to note at this stage that the expectation operator E( ) in (16)and (17) is defined relative to the probability distribution of the underlying probability model. By changing Lh (andthe related probability model) we can define some of the most important statistical models of interest in econometrics. Let us consider some of these special cases. Assuming that .tZf, l q T) is a normal 11D stochastic process and (a) choosing % (Xt xf ),a degenerate c-field, (15)takes the special form '
=
yf
=
p'xt+ ut,
=
t 6 T,
(17.18)
where the underlying probability model is based on D(yf,/X,; 04. This dcfines the linear regression model (seeChapter 19). l (E T) is a normal IlD stochastic process and Assuming that .tzt,
choosing
.t
.f-
t =
=
c(Xf), (15)becomes
#'X +
l&,
(17.19)
generating mechanism
The statistical
17.4
with D(Z,; #) being the distribution dening the probability model. ( 19) represents the statistical GM of the stocbastic Iinear reqression model (seeChapter 20). Assuming that (Zt, t g T) is a normal stationary lth-order Markov c-field to be % c(yt0- j, process and choosing the appropriate XO, Xy (X, f, i 0, 1,2, x,0), y)'- : (y, i 1,2, (15) takes the form =
.),
.),
-
=
=
=
-
-f,
.
.
I
+ A', #k)x, =
(d)
Z (.af1', + /ixf -i
i
=
-f)
.
.
(17.20)
+ uf,
1
where the underlying probability model is based on Dytjyh 1 ,Xy; 004.This defines the statistical GM of the JpntzrnfcIinear regression model (seeChapter 23). Assuming that (Zr, t G T) is a normal llD stochastic process and yt is an m x 1 subvector of Z, the c-field t c(Xj xt) reduces (15)to .9).
=
=
(17.21)
yl B'xf + u2, =
with Dtyf X,; p*) the distribution defining the underlying probability model. This is the statistical GM of the rrltllritwritzlt? linear regression model (seeChapter 24). An important feature of any statistical GM is the set of parameters defining it. These parameters are called the statistical parameters of interest. For instance, in the case of (18)and (19)the statistical parameters of interest cl c11 cj aEallaj These are functions of the are 0- (#,c2), #= Xz-zbaz, assumed to be D(Zf;#) of parameters =
z,
(xX,)
.
-
N
-
.
) (t'..; ((0:)
s&'a2a))
( 17.22)
(see Chapter 15). In practice the statistical parameters of interest 0 might not coincide with the theoretical parameters ofinterest, say (. ln such a case we need to relate the two sets of parameters in such a way that the latter are uniquely determined by the former. That is, there exists a mapping (
=
(17.23)
H($,
which detine ( uniquely This situation for example arises in the case of the yrrlulrtlnptlusequations model where the statistical parameters of interest are the parameters defining (21) but the theoretical parameters are different (see Chapter 25). In such a case the statistical GM is reparametrised/restricted in an attempt to define it in tenns of the theoretical parameters of interest. restricted statistical GM is said to be an econometric The reparametrised model. .
Statistical models in econometrics It must be stressed that the statistical GM postulated depends crucially on the information set chosen at the outset and it is well defined within such a context. When the information set is changed the statistical GM should be respecified to take account of the change. This implies that in econometric modelling we have to decide on the information set within which the specification of the statistical model will take place. This is one of the reasons why the statistical model is defined directly in terms of the random variables giving rise to the available observed data chosen and not in terms of the error term. The relevant information underlying the specification of the statistical GM comes in three forms: (i) theoretical information; sample information; and (ii) (iii) measurement information. ln terms of Fig. 1.2 the theoretical information relates to the choice of the observed data series (andhence of Zf) and the form of the estimable model. The sample information relates to the probabilistic structure of l 6: T) of and and the measurement information to the measurement system Zt any exact relationships among the observed data chosen (see Chapter 26 for further discussion). Any theoretical information which can be tested as restrictions on p is not imposed a priori in order to be able to test it. An important implication of this is that the statistical GM is not restricted a priori to coincide with either the theoretical or estimable model apart from a white-noise term at the outset. Moreover, before any theoretical meaning is attached to the statistical GM we need to ensure that the latter is first assumptions defining the the underlying well-dqjlned statistically; statistical model are indeed valid for the data chosen. Testing the underlying assumptions is the task of misspectjlcation testing (seeChapters 20-22). When these assumptions are tested and their validity established we in order to derive a can proceed with the reparametrisationjrestl-iction theoretically meaningful GM, the empirical econometric model (seeFig. .ftzf,
1.2).
17.5
Looking ahead
As a prelude to the extensive discussion of the linear regression model and related statistical models of interest in econometrics let us summarise these in Table 17.1. ln the chapters which follow the statistical analysis (specification, misspecification, estimation and testing) of the above statistical models will be considered in some detail. ln Chapter 18 the linear model is briefly considered in its simplest form k 2) in an attempt to motivate the linear regression model considered extensively in Chapters 18-22. The main =
(:)x
*
o
(:)x
+
rt
O
1I! Y
Q
'
w-
a
*
*=
II
-
''7
=
'
cp>>-
.-.
Q
-M
'
X
.-.
.
o
*'
J
=
Q
Z'-u A
Il Q
x:
.
41
ga
ea
o
=6
-
>'
...
d
w.
d :
;=
*
-* G o ew .=. (7 ' + : m
=
.0
=
(:)x
o
*
+ u. .-k o + <
8
:
V A
o
=
=
AN
Q
a-
+
+
#=
>/
I
lD
tp
t
-
-
w
r ..
hw
t '-d
Ux1!1
.
.
7:-
1* + rm..-x 8.. cw W Il Il
+
-.
+
,.>
rq
k)
w
X
n
M
W
<
o ,. R
C1
G
<
<
<
w
11 I1
11 >'
;wx
.
:>.
-
X
o
%
>h=
. .
'-
1I1 =
Q
'U o
-6
r o .a m CJ V-
x d = = :: o 1) .a = -
=
*' tv 1) =
o
Q
. .
.q;!
ce
.=
2 l =
x o =c:
-
o
w.
:
:
g
x 1)
Q =
Q
ct
x
r w. >.
Q
lz
o
o
g
O
m
D C)
;;
* .
D =
=
.-.
$7
.
*v:
.
G
'i;
=
Y O
=
; o .
'-
=u '
r
y
o
a m
v.
xo
w.
iz
'z
=
.z
o
z : O
-
354
Statistical models in econometrics
reason for the extensive discussion of the linear regression model is that this statistical model forms the backbone of Part lV. ln Chapter 19 the estimation, specication testing and prediction in the context of the linear regression model are discussed. Departures from thc assumptions (misspecification) underlying the linear regression model are discussed in Chapters 20-22. Chapter 23 considers the dynamic linear regression model which is by far the most widely used statistical model in econometric modelling. This statistical model is viewed as a natural extension of the linear regression model in the case where the non-random sample is the appropriate sampling model. In Chapter 24 the multivariate linear regression model is discussed as a direct cxtension of the linear regression model. The simultaneous equation model viewed as a reparametrisation of the multivariate linear regression model is discussed in Chapter 25. In Chapter 26 the methodological discussion sketched in Chapter 1 is considered more extensively. Important concepts Time-selies,
cross-section and panel data, simple random sampling, sampling, stratied incidental parameters, statistical generating mechanism, systematic and non-systematic statistical components, of of interest, theoretical parameters interest, reparametrisaparameters tion/restriction.
Questions Explain why for most forms of economic data the notion of a random
4.
sample is inappropriate. Explain the concept of a statistical GM and its role in the statistical model specication. Explain the concepts the systematic and non-systematic components. Discuss the type of information relevant for the specification of a statistical GM.
Appendix 17.1
Appendix 17.1 adjusted data on rnt?rll),' stock M1 (M), real consumers' expenditure (F), its implicit price dejlator P) and interest rate on 7 days' deposit account (1)for the period 19631-19821v. source: Economic Trends, ptrlnurl/ Supplemenb, J9#J, CSO)
Table 17.2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
Quarterlyseasonally
M
F
#
1
6740.0 6870.0 6990.0 7210.3
12 086.0 12 446.0 12 575.0 12 618.0 12 691.0
0.402 53 0.403 26 0.405 8 1 0.408 15 0.412 58 0.416 36 0.42 1 27 0.426 98 0.432 98 0.437 8 1 0.442 38 0.446 37 0,449 94 0.454 97 0.459 72 0.465 36 0.465 33 0.467 36 0.470 42 0.474 35 0.477 82 0.489 52 0.497 14 0.50 1 10 0.51 103 0.516 94 0.520 83 0.527 46 0.534 99 0.544 82 0.552 99 0.565 33 0.576 53 0.592 99 0.603 48 0.610 49 0.6 15 75 0,624 45 0.641 8 1 0.657 58
0.202 OOE-OI 0.200 E--0I 0.200 (XIE--OI 0.200 E-0I 0.237 OOE-OI 0.300 OOE-OI 0.300 E-0I 0,390 E-0I 0.500 E-0 1 0.470 E-0 1 0,400 OOE-OI 0.400 E-OI 0.400 E-0 1 0.400 E-0 1 0.486 E-0 1 0.500 E-0 l 0.45500E-01 0.368 E-0I 0.350 E-0I 0.489 OOE-OI 1 0.594 OOE-.O 0.550 E-.0 1 0.544 OOE-OI 0.500 E-.01 0.535 E--0I 0.600 E-0 1 0.600 E--0I 0.600 OOE-OI 0.585 E-0I 0.508 E-OI 0.500 YE--PI 1 0.500 OOE-.O 0.500 E-0 1 0.400 E-0I 0,367 XE--OI 0.325 E-0I 0.250 E-.01 0.250 E-0I 0.470 E-0 1 0.544 XE-OI 0.7 18 OOE-OI 0.703 OOE-O1 0.827 E-0I
7280.0 7330.0 7440.0 7450.0 7490.0
7570.0 7620.0 7610.0 7910.0 7830.0 7740.0
7600.0 7780.0
7880.0 8160.0
8250.0 82 10.0 8340.0 8530.0 8640.0 8490.0 83 10.0 8380.0 8660.0 8640.0 8920.0 9020.0
9420.0 9820.0 9900.0 10210.0 10 310.0 1 13.0 11 740.0 12 050.0 12 370.0 12 440.0 13 200.0 12 960.0
12787.0 12 847.0 12 949.0 12 959.0
12960,0 13 095.0 13 117.0 13 304.0 13 458.0 13 258.0 13 164.0 13 311.0 13 527.0 13 726.0 13 82 1.0 14 290.0 13 69 1.0 13 962.0 14 083.0 13 960.0
13988.0 14 089.0 14 276.0 14 2 17.0 14 359.0 14 597.0 14 64 1.0 14 603.0 14 867.0
15071.0 15 183.0
15 503,0 15 766.0 15 930.0 16 07 1.0 16 724.0 16 525.0 16 566.0
0.665 15
0.677 76 0.695 34
continued
356
Statistical models in econometrics
Table 17.2. continued
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
13 020.0 12 850.0 13 230.0 13 550.0 14 460.0 14 850.0 15 250.0 16 770.0 17 150.0 17 880.0 18 430.0 19 050.0 19 0.0 19 440.0 20 430.0 2 1 970.0 23 170.0 24 280.0 24 950.0 25 920.0 26 920.0
27 520.0 28 030.0
28 840.0 29 360.0
29 260.0 29 880.0 29 660.0 30 550.0 3 1 8 10.0 32 870.0
33 210.0 33 760.0 36 720.0 37 590.0 38 140.0 40 220.0
16 5 17.0 16 2 11.0 16 169.0 16 288.0 16 38 1.O 16 342.0 16 358.0 16 0 15.0 15 937.0 16 105.0 16 163.0 16 199.0 16 240.0 15 980.0 16 020.0 16 153.0 16 364.0 16 840.0 16 884.0 17 249.0 17 254.0 17 396.0 18 3 15.0 17 8 16.0 18 072.0 18 120.0 17 729.0 17 83 1.0 l 7 870.0 18 040.0 17 926.0 17 934.0 17 97 1.0 17 927.0 17 998.0 18 242.0 18 543.0 -
0.72 1 44 0.752 76 0.790 53
0.950
0.824 84 0.867 22 0.9 19 47
0.982 88 1.0297 1.0703 1 1027 1.1322 1.1679 1.2237 1.2770 1.32 16 1.3523 1.3766 1 1.4363 1.46 19 1.49 10 1.5357 1.5796 1 1.7419 l.8 109 1.8823 1.9246 1 17 2.0 154
0.642 OOE-OI E--0I E-.0 1 0.592 OOE-.O1 0.693 OOE-OI
0.700 0.588
.
0.107 30 0.565 E-0 1 0.428 OOE-.O1 0.377 E-.0 1 0.332 E-.O1 0.306 OOE-0 1 0.5 l 8 OOE-.O1 0.675 E-0I 0.857 OOE.-01 0. 103 70 0.993 E-0 1 0. 1 15 00 0.13 1 40 0. l50 0. 150 ()0 0. 140 50 0. 13 1 (X) 0.109 40 0.9E-0 1 0.943 O0E-O1 0.133 0.114 20 0. 100 10 0.829 E-0I 0.624 E-.0 1
,4059
.6808
.97
2.0867 2. 1343 2. 1838 2.2 177
2.2673 2.29 19
2.3076 -U
Additional references
Granger (1982);Griliches (1985); Richard (1980).
.
E-0 1
0.950 OOE--OI 0.950 E-0I 0.950 E--0I 0.950 E-0 1 0.846 E-.0 1 0.625 E-.0 1
'
.
-
CHAPTER
18
The Gauss linear model
18.1
Specifkation
In the context of the Gauss linear model the only random variable involved is the variable whose behaviour is of interest. Denoting this random variable by yt we assume that the stochastic process ftyt, t G T) is a normal, independent process with F(y,) pf and a time-homogeneous variance c2 for t (E T (-T being some index set, not necessarily time), =
37f Nlpt, 'v
tr2),
defined on the probability space (S, #( )). In terms of the general statistical GM (17.15) the relevant conditioning which implies that set is the trivial c-field o (., ,.1
'
.3)
.@
=
pt
=
Eyt/fqz'zu E(#,).
That is, the statistical GM is #t E(J',) +
(tB.2)
ut,
=
with gt assumed to be related to a set of k non-stochastic (or controlled) variables xlt, xcf, xkt, in the form of the linear function .
.
.
,
k
J'tt=
i
=
1
) ix i =
b'x t
,
is an obvious notation. Defining the non-systematic Llt =
357
J't
-
.E(#t).
component
by
(18.4)
358
The Gauss Iinear model
the statistical GM .J'! ,
=
b'x t +
(2)takes the
particular
form (18.5)
u t,
The underlying probability is naturally delined in terms of the marginal distribution of y,, say, D(y,; 0), where 0 > (b,c21) are the statistical parameters of interest, being the parameters in terms of which the statistical GM (5) is defined. The probability model is defined by D(yf; 0)
*=
1
=
j--r-jV( )
exp
1 -
(y? b'x,)2 -
IG
c
0 e' Ild x
,
R+ tGT ,
(18.6)
tyf,
In view of the assumption of independence of t s T) the sampling model, providing the link between the observed data and the statistical GM, is defined as follows:
(-p1, ymb .p2,
.
.
.
,
yvl'
is an independent sample from Dtyr; p), tzzu 1, 2, T; respectively. It could not be a random sample in view of the fact that each yt has a different mean. By construction the systematic and non-systematic components satisfy the following properties: .
(i) (ii)
.E'(l,) Eptutj Elutus)
=
.
.
,
uE'ty,- .E'(yf1) 0,' =
=/t,'(l,)
=0-,
G' 2, =
0,
l + s,
t, s c T.
Properties (i)and (iii)show that )l/t,t e: T) is a normal white-noise process and (ii)establishes the orthogonality of the two components. It is important to note that the distribution in terms of which the above expectation is defined is none other than D(yt; %), the distribution operator E underlying the probability model with 0o the true' value of 0. The Gauss linear model is specified by the statistical GM (5), the probability model (6)and the sampling model defined above. Looking at this statistical model we can see that it purports to model an experimentallike' situation where the xfts are either fixed or controlled by the experimenter and the chosen values determine the systematic component of y, via (3).This renders this statistical model of limited applicability in econometrics where controlled experimentation is rather rare. At the outset the modeller adopting the Gauss linear model discriminates between yf and .)
18.2 Estimation
359
the xfrs on probabilistic grounds by assuming yt is a random variable and the xfrs non-stochastic or controlled variables. ln econometric modelling, however, apart from a time trend variable. say x, t, t G T, and dummy variables taking the value zero or one by design, it is very difficult to think of non-stochastic or controlled variables. The Gauss linear model is of interest in econometrics mainly because it enhances our understanding of the linear regression model (seeChapter 19) when the two are compared. The two models seem to be almost identical notation-wise, thus causing some confusion; but a closer comparison reveals important differences rendering the two models applicable to very different situations. This will be pursued further in the next chapter. =
18.2
Estimation
For expositional purposes 1et us consider the simplest case where there are only two non-stochastic variables (k and the statistical GM of the Gauss linear model takes the simple form =2)
A't d?1+ =
d72.X1t
+
(18.7)
14,
The reason for choosing this simple case is to utilise the similarity of the mathematical manipulations between the Gauss linear and linear regression models in order to enhance the reader's understanding of the matrix notation used in the context of the latter (seeChapter 19). The first variable in (7)takes the value one for all t, commonly called the constant (or intercept). ln view of the probability model (6)and the sampling model assumption of independence we can deduce that the distribution of the sample (see Chapter 11) takes the form F
D(1'1, #2,
.
.
.
,
J'w; 04
=
17D(A',;04.
l
=
1
(18.8)
Hence, the likelihood function, ignoring the constant of proportionality, can be defined by
360
The Gauss Iinear model
(see Section 13.3). The log likelihood takes the form: T
log L0; y)=
log 2r--j-
--j-
The first-order conditions estimators (MLE's) are:
t?log L
T
1og c a
for the derivation of the maximum likelihood
1
( p/?l = 2c
Z (A',
-2)
-'1
-
2
(18.11)
72-Y1) =0,
-
t
(18.12)
Solving (1 1H13)simultaneously JI
/;2
MLE'S
J'c.f,
)- -
=
we get the
Z (.J.,,-
( 18.14) .9
.:(.x',
-
f
=
( 18. 15)
5
x32
Z(x1(
where 42
and Jj J2x, represents the where gfEE: of the error term uf. Taking testimator'
T
,,
-
( 18.16)
t
estimated residuals', second derivatives
.p,
-
1
=-
the natural
( 18 17) .
2
( log L p/?lpha = p21ogL
c
t?c2
c
J)2
=
-
1 1
p2log L jgx,, kb 1 pcz ,
1 #.
$ xtut. t
1 -
-
Z
''-'tri
&''
t
(18.18)
18.2 Estimation For p-(/?:, hc, c2), the sample information matrix Iw(p) and its inverse are
-
Zxt c2
xf
Z xl
T
lwtpl-
c2 )
l
x2t
''
Z(x t
r -' 1
T
0
0
Il w($1
(18.19)
0.2
2
-
-F
-t-Y3
- c
-y
f
2
-c2 jj-0 x T
j-2.x
)((x,-'.)
l
=
2
) (xf
---------L?iq---L--)3 --')i)7----t)
.92
---------
-
f
t
t
if , (x,-.f)2#0,i.e. theremust is positivedefinitetlwtpl>t) be at least two distinct values for xt. This condition also ensures the existence of J'cas defined by ( 15).
xorcthat/rtp)
of 0
Properties
=
(
1,
)
2,
Asymptotic properties
I
is a MLE enables us to conclude limw- a information (matrix) defined by 1.(:) : 12-13) then definite (seeChapters
The fact that
=
J t-0, i.e. Iis a consistent -+
V
/'T4
E,(4)
p)
-
=
-
that if the asymptotic ()(l F)lw(#)) is positive
estimator of 0 (if 1
N(0, gI.(p)1 -
'
0, i.e. asymptotically
i.e.
tr:
.xt
-.
I is asymptotically
unbiased
=
()I,x,(p)q-
1,
i.e.
as T-+ w);
normal;
(the asymptotic mean of ;
is 0);
varal
,z-
is asymptotically
efficient.
362
The Gauss linear model
1.(.:) is positive definite if det(1.(p)) > 0,' this is the case if
lim v- x
--
1 T
)((x/ -
x3 c
=
( 18.2 1)
qxx > 0.
,
Finite sample properties
(2)
being a MLE we can deduce that: is a function of the set of minimal sufficient statistics F
r
r
#y) Z yl, Z.)4, Z xtn l=1
f=l
t=1
(18.22)
;
=
is invariant with respect to Borel functions,i.e. ifh(p):O (6)then the MLE of h4#)is h(J);see Section 13.3. ln order to consider any other small (finite)sample properties of J we need to derive its distribution. Because the mathematical manipulations are rather involved in the present case no such manipulations are attempted. It turns out that these manipulations are much easier in matrix notation and will be done in the next chapter for the linear regression model which when reinterpreted applies to the present case unaltered.
(vi)
-->
Jj Jc
'v N
bj b,
var(J,) Covt/;l J2) covtrl Jc) vartrc) ,
( 18.23)
,
,
where
vartra)
2
-
.
Z(x:-
.92
t
This result follows from the fact that J'l )5where 42 jyt and linear functions of normally distributed ;.t= gtx, xx/gzf (xt random variables and thus themselves normally distributed (seeSection 6.3). The distribution of 42 takes the form -f,
=
-x')2j,
-
T.l 2
.vz2(F-2),
=
Zf
.f),
-
(18.24)
18.3
Hypothesis
363
testing and confidence intervals
where z2(T-2) stands for the chi-square distribution with T- 2 degrees of freedom. (2)follows from the fact that F2 c2 /7ol(f/c)2 involves F- 2 independent squared standard normally distributed random variables. (viii) From (vii)it follows that E(J1) /?1,f)/J2) /?c, i.e. j and z are unbiased estimators of ?1 and bz respectively. On the other hand, since the mean of a chi-square random variable equals its degrees of freedom (seeAppendix 6.1) =
=
=
E
Tl
F-2
=
/.2
(T-2) S442)=-T'
::::>
i.e. 82 is a biased estimator of (1 (T- 2)(1 f l is unbiased and (T- 2).:2 2
(; 1,
2
(x)
'
.$2
but
the estimator
,vz2(T'-2),
) are independent
This can be verified
c2.
(72c/:c2
of
by considering
=
(18.25) .s2
(or 42). the covariance
between them.
(20) we can see that (/1, ;a) achieve the Cramer-Rao lower bound and hence we can deduce that they are fully efficient. Given that :2 isbiased the Cramer-Rao given by (20) is not applicable, but for sl we know that
Comparing
Var o
(23) with
(T- 2).s2
=
az
o.
Var(y2)=
2c* T- 2
y
2(T-2) 2c*
the Cramer-Rao T -
bound.
Thus, although sl does not achieve the Cramer-llao lower bound, no other unbiased estimator of c2 achieves this bound.
18.3
Hypothesis testing and confidence intervals
In setting up tests and conlidence intervals the distribution of and any pivotal quantities thereof are of paramount importance. Consider the null hypothesis Hv1 Intuition
deviation
)1
=
l-fj against
H3 : hl #
-1,
h
being a constant.
suggests that the distance $J1 511, scaled by its standard (toavoid any units of measurement problems), might provide the
364
The Gauss Iinear model
basis for a
test statistic. Given that
igood'
( 18.26)
this is not a pivotal quantity unless c2 is known. Otherwise we must find an alternative pivotal quantity. Taking (25)and (26)together and using the independence between J'1and sl we can set up the pivotal quantity
( 18.27)
IJ 1 1(r-2). --,-$---1-.-N,'g'%ar(/71)(I -b-
( 18
.28)
The rejection region for a size x test is
cj
$J -l; j
(' (k'
.----t---s--.y =
y..
ca
Jgvart/?llj
where 1
,
dr(F-2).
-a
= -
(18.29)
L'a
Using the duality between hypothesis testing and confidence intervals (see level confidence interval for bk Section 14.5) we can construct an (1 based on the acceptance region, -a)
C()('1)
=
y:
1;1
-
'1
l
(?1 -
% cu
#r
=
Evartrll.'.l
-
cz %
'1)
v'g j- ar( j 1)j
%ca
1
=
-
( 18.30)
T'Zt.Y'-O2 l
)
.v/
l
p
jj (xj-p2 t
(18 1) .3
x,
18.3 Hypothesis Similarly, for H(): Iz Clts-cl
=
y:
-
y: a
1$-2 the rejection
z#
1r2- -l > cz ' x' gkartrclq
region of a size a test
:%
(T'-2)s2
b
:$
c
#r(Co)
,
is
( 18.32)
.
l
=
=
lh against
against Sj : c2 # l. The pivotal quantity up the acceptance region
Consider Hv c2 useddirectly to set
tr'tl
365
testing and confidence intervals
(18.34)
1 a,
=
(25)can be
-
such that ( 18.35) A ( 1 -J)
level confidence interval is
c(y)
=
c2:
(F-2)s2
ca
ts
b
:6:
(T-2)s2
( 18.36)
.
a
Remark: One-sided tests can be easily constructed by modifying the above two-sided results; see Chapter 14. Consider the question of constructing a (1 -a) level confidence interval for
pt= :1
+
bzxt.
A natural estimator Vart/)
=
of pt is
Vart;j +
t
=
41+ lzxt,
with Et)
=
pt, and
;2xt)
( 18.37)
366
The Gauss Iinear model
These results imply that the distribution of
(Jf -
(i)
pt)
gvartrn
'v
Jf is
normal and
x(() 1).,
(js.a8)
,
(18.39) construct
jsl
a
(x, -
.y
2
)
j
.
Z(x,-.k)2 l
( 18.40) This confidence interval can be extended to t> T in order to provide us with a prediction conlidence interval for yvyt, ly 1
( 18.4 1)
(see Chapters 12 and 14 on prediction). In concluding this section it is important to note that the hypothesis testing and confidence interval results derived above, as well as the estimation results of Section 18.2, are crucially dependent on the validity of the assumptions underlying the Gauss linear model. lf any of these assumptions are in fact invalid the above results are unwarranted to a greater or lesser degree (seeChapters 20-22 for misspecscation analysis in the context of the linear regression model).
18.4
Experimental
design
ln Section 18.2 above we have seen that the
MLE'S
J1and
h of
bL
and bz
Looking ahead
18.5
respectively are distributed as bivariate normal as shown in (23). The fact that the xfs are often controlled variables enables us to consider the question of the statistical GM (5) so as to ensure that it satisfies certain desirable properties such as robustness and parsimony. These can be achieved by choosing the xfs and their values appropriately. Lookin4at their variances and covariances we can see that we could make 41 and b2 more by choosing the values of xf in a certain way. Firstly, if 0 then idesigning'
'accurate'
.k
=
;2) 0
Covlrl,
(18.42)
=
and and ;2 are now independent. This implies that if we were to make a changeof origin in x, we could ensure that J1 and z are independent. Secondly, the variances of J1and /2 are minimised when f xtl (given isas large as possible. This can be easily achieved by choosing the value of xl and as large as possible. For to be on either side of zero (to achieve example,we could choose the xfs so that ';1
=0)
.'.1=0)
X
1
=
X2
X T+ l
=
*
=
*
*
*
*
=
*
=
X F
XF
=
=
-
K>
( 18.43)
11
(T even) and n is as large as possible; see Kendall and Stuart (1968). Another important feature of the Gauss linear model is that repeated observations on y can be generated for some specified values of the xts by repeating the experiment represented by the statistical GM (7).
18.5
Looking ahead
From the econometric viewpoint the linear control knob model can be seen to have two questionable features. Firstly, the fact that the xfls are assumed to be non-stochastic reduces the applicability of the model. Secondly, the independent sample assumption can be called into question for most economic data series. In other disciplines where experimentation is possible the Gauss linear model is a very important statistical model. The purpose of the next chapter is to develop a similar statistical model where th first questionable feature is substituted by a more realistic formulation of the systematic component. The variables involved are all assumed to be random variables at the outset. Important concepts Non-stochastic or controlled repeated observations.
variables, residuals, experimental
design,
368
The Gauss Iinear model
Questions Explain the statistical GM of the Gauss linear model. Derive the MLE'S of b and c2 in the case of the general Gauss linear model where b'xr + ut, t= 1, 2, 7; x, being a k x 1 vector of nonstochastic variables, and state their asymptotic properties. Explain under what circumstances the MLE'S f 1 and 42of bj and bz respectively are independent. Can we design the values of the non.p,
4.
=
.
.
.
,
stochastic variables so as to yet independence'? Explain why the statistic l/l,/Vartrll is distributed as r(T'-2) and use it to set up a test for Ho: bj
=
0
against
HL :
bf #U,
as well as a confidence interval for ht. Verify that and 42 are independent. Additional references Chow ( 1983); Dhrymes ( 1978); Johnston ( 1984); Judge e (11. (1982);Kmenta Koutsoyiannis (1975) Maddala ( 1977);Pindyck and Rubinfeld (1981).
(197 1);
CHAPTER
19
The linear regression model 1 - specification, estimation and testing
19.1
Introduction
The linear regression model forms the backbone of most other statistical of the models of particular interest in econometrics. A sound understanding regression prediction and estimation, the linear in testing specification, model holds the key to a better understanding of the other statistical models
discussed in the present book.
In relation to the Gauss linear model discussed in Chapter 18, apart from and the mathematical similarity in the notation some apparent statistical involved analysis, the linear regression in the manipulations situation model from the one envisaged model pumorts to a very different particular model could be considered to the Gauss linear by the former. In models of the estimable analysing statistical model for be the appropriate
form
k
Mt
Jli,
=aa+
i
=
(19.2)
1
where Mt refers to money and Qit,i 1, 2, 3 to quarterly dummy variables, in view of the non-stochastic nature of the xfts involved. On the other hand, estimable models such as =
M
=
Akrxbpxzlx
(19.3)
referring to a demand for money function (M - money, F- income, P -price level, 1 - interest rate), could not be analysed in the context of the Gauss 369
Specilkation, estimation and testing
linear model. This is because it is rather arbitrary to discriminate on probabilistic grounds between the variable giving rise to the observed data chosen for M and those for F, # and 1. For estimabl'e models such as (3)the linear regression model as sketched in Chapter 17 seems more appropriate, especially if the observed data chosen do not exhibit time dependence. This willbecome clearerin thepresent chapter after the specification of thelinear regression model in Section 19.2. The money demand function (3)is used to illustrate the various concepts and results introduced throughout this chapter.
Spaification
19.2
Let tZ?, t c T) be a vector stochastic process on the probability space (S, #4.) where Zt (yf,X;)' represents the vector of random valiables giving rise to the observed data chosen, with yr being the variable whose behaviour we are aiming to explain. The stochastic process (Zf,lGT) is assumed to be nornml, independent and identically' distributed (N11D) with '(Zt) m and Covtzf E. i.e. .28
=
=
=
j (( j( ) jj
(x-'',
.''' yx
x
-
t;,1
sl1,2,
,
in an obvious notation (seeChapter 15). lt is interesting to note at this stage that these assumptions seem rather restrictive for most economic data in general and time-series in particular. On the basis of the assumption that .f(Z?,r s T) is a Nl1D vector stochastic Zw; #)in process we can proceed to reduce thejoint distlibution D(Z1, order to define the statistical GM of the linear regression model using the general form .
yt
=
Eyt/xt
=
xf)
y, E(y,,/Xf 14,,.,u, =
-
is the x,)
systematic
the
pt p0
and
,
E (A',/'X,
=
=
=
my
-
/:
X,)=
component,
non-systematic
(seeChapter 17). In view of the normality where
.
(19.5)
where and
.
of (Zl, t G T) we deduce that
Jo + fxt (linearin
2E2-21m x ,
component
x,),
(19.6)
az 1 # Ec-21 =
Var(u,/X,=x,) =Var(y'/X,=x,)=c2
(homoskedastic), (19.7)
19.2 Specification where c2 cj 1 - o.j atc-al ,yj (seeChapter 15). The time invariance of the parameters jo, ( and tz = stems from the identically distributed (lD) assumption related to .tZr, t G T). It is important, however, to note that the ID assumption provides only a sufficient condition for the time invariance of the statistical parameters. ln order to simplify the notation let us assume the m 0 without any loss of generality given that we can easily transform the original variables in and (Xt This implies that ?v, the mean derivation form (y, coefficient of the constant, is zero and the systematic component becomes =
=
-n1y)
-mx).
#'Xl. (19.8) l-tt E (-Fr?''XtN) ln practice, however, unless the observed data are in mean deviation form the constant should never be dropped because the estimates derived otherwise are not estimates of the regression coeflicients #= Ec-21J2, but of '(XlX;)- ''(Xf'.Ff); SCe Appendix 19.1 on the role of the constant. #* The statistical GM of the linear regression model takes the particular form y', p'xt+ Ikt, (19.9) =
=
=
=
=
with 0- (#,c2) being the statistical parameters t#' interest; the parameters in terms of which the statistical GM is defined. By construction the systematic and non-systematic components of (9)satisfy the following properties: E(ut%t
=
Xf)
=
(ii)
f;tutls/x,
=
xf)
(iii)
Eptut/xt
=
Xt)
f
=
E(1'r -
'(-Pt,?/Xf Xf)),'''Xf Xt1 =
ptkutjxt
X?)
=
=
=0,
tuf,
The first two properties define t 6 T) to be a white-noise process and (iii) establishes the orthogonality of the two components. It is important to note ,,'XZ that the above expectation operator xr) is defined in terms of D()//Xf', 0), which is the distribution underlying the probability model for (9).However, the above properties hold for F( ) defined in terms of Dzt; #) as well, given that: '(
'
=
'
(i)'
'(u,)
=
'tutlks) =
F(f'(?.k,/X,= xl)) 'hftftagus/''xt
=
=0.,
xt))
G 2, =
0,
Specification, estimation and testing
and (iiil'
JJ(/t,uf)
=
Elkytut
'Xf
xj))
=
=0,
r, s iE T
(see Section 7.2 on conditional
expectation). The conditional distribution D()',/X,', 04is related to thejoint distribution Dn, X; #) via the decomposition D(#,, Xf',
) D(1',//X?; l ) D(X,;
( 19. 10)
a)
'
=
(see Chapter 5). Given that in defining the probability model of the linear regression model as based on 1)(),,,/'Xf;p) we choose to ignore D(X,; #2)for the estimation of the statistical parameters of interest p. For this to be possible we need to ensure that X, is wt?kk/), exogenous with respect to 0 for the sample period r= 1, 2, r (see Section 19.3. below). For the statistical parameters of interest 0 EEE(#, c2) to be well defined we need to ensure that Ec2 is non-singular, in view of the formulae #= E2-21o.c1, c2 cj j - o.j aE2-a1,a1 at least for the sample period r 1, 2, T This requires that the sample equivalent of 122,( 1/F)(X'X) where X EEE(xj i.e. xwl' is indeed non-singular, .
=
.
.
,
=
,
.
.
,
.
,x2,
.
rank(X'X)
=
ranktx)
=
.
,
,
k,
Xf being a k x 1 vector. As argued in Chapter 17, the statistical parameters of interest do not necessarily coincide with the theoretical parameters of interest t. We need, however, to ensure that ( is uniquely defined in terms of 0 for ( to be identnable. ln constructing empirical econometric models we proceed from a well-defined estimated statistical GM (seeChapter 22) to reparametrise it in terms of the theoretical parameters of interest. Any restrictions induced by the repa'rametrisation, however, should be tested for their validity. For this reason no a priori restrictions are imposed on p at the outset to make such restrictions testable at a later stage. As argued above, tbe probabilitv tnodel underlying (9)is defined in terms of D(y'r?'Xr;p) and takes the form
(1) =
D(yr/Xt; 0)
1
=
.
c
exp
(21)
1 -
2c
j.
(yf p'xtll -
,
0 s IJd x IJ@ + t e: T ,
.
(19.12)
in view of the independence of (Z,, r 6 T) the sampling model sequentially takes the form of an independent sample, y > (y1 drawn from D(y,,/X,', p), l 1, 2, T respectively. Having defined all three components of the linear regression model 1et us Moreover,
,y'w)',
,
=
.
.
.
,
.
.
.
19.2 Specification together and specify the statistical
collect al1 the assumptions properly.
Statistical GM, yf
specilication
model:
The linear regrenion =
#'xf+
model
u,, l G T
g3j g41 g5q
'(yr,/Xt=x,) S(y,,/'Xt xf) - the systematic component; ut yr non-systematic component. the - N(#, J2)e 0 p Na-clcaj c2 o.j j - cj ara-al ,aj are the statistical Cov(X,, y,), parameters of interest. Note: Eac Cov(X,), ,2I Var(yf).) cl j T Xf iS weakly exogenous with respect to 0. r 1, 2, No a priori information on 0. xwl/,' T x I data matrix, (T > /). Ranklx) k, X (x1, xa,
(11)
Probability
E1(1 17(1
p,
=
=
=
-
=
=
,
=
=
=
=
=
*
=
=
.
.
.
.
.
.
,
,
model
Dl.,'r,''Xf;0$
=
-
1
exp
'-QVx' ( ) .
l
(-J.'f-
2c 1
#'x,)2
,
D(#f Xl ; #) is normal; ti) Ft-r X, xf) p'xt linear in xr; (ii) - homoskedastic xr) (7.2 Vart))r/x, (free of xt),' (iii) p is time invariant. .
=
=
=
(111)
=
-
Sampling model
y' represents an independent sample sequentially 'J: drawn from 1)(y,,,/X,.,p), t 1, 2, speeification is that the above An important point to note about the making of D(y,g'Xf;0) no assumptions model is specified directly in terms model there is FOr regression the specification of the linear about D(Zf; #). T). The problem, no need to make any assumptions related to 1Zf, r 6 however, is that the additional generality gained by going directly to D( #'l/''Xf;0) is more apparent than real. Despite the fact that the assumption that tZr, t g -IJ-)is a NllD process is only sufficient (not necessary) for g6) to g81above, it considerably enhances our understanding of econometric modelling in the context of the linear regression model. This is, firstly, because it is commonly easier in practice to judge the appropriateness of
g8j
y EB (-:1 ,
.
.
.
,
=
.
.
.
,
Specification, estimation and testing
probabilistic assumptions related to Zf rather than (yr,/X, x,); and, secondly, in the context of misspecification analysis possible sources for the departures from the underlying assumptions are of paramount importance. Such sources can commonly be traced to departures from the assumptions postulated for (Zr, r (E T) (see Chapters 2 1-22). Before we discuss the above assumptions underlying the linear regression it is of some interest to compare the above specification with the standard textbook approach where the probabilistic assumptions are made in terms of the error term. =
Standard
' X# =
textbook
spcct/krprft?n of tbe linear regression
model
+u.
X(0, c21w); no a priori information on (#,c2); rank (X) k. Assumption (1) implies the orthogonality F(Xf'?-/f,,At x,) 0, t,z,z1, 2, 'T;and assumptions (6j to g8j the probability and the sampling models respectively. This is because (y,?'X) is a linear functionof u and thus normally distributed (see Chapter 15), i.e.
(1) (2) (3)
(u,/X)
'v
=
=
(y?'X) N(X#, c2Iw). 'v
=
.
.
.
,
(19.13)
As we can see, the sampling model assumption of independence is behind the form of the conditional covariance c2J. Because of this the independence assumption and its implications are not clearly recognised in certain cases when the linear regression model is used in econometric modelling. As argued in Chapter 17, the sampling model of an independent sample is usually inappropriate when the observed data come in the form of aggregate economic time series. Assumptions (2)and (3)are identical to (4j and g5) above. The assumptions related to the parameters of interest onp, c2) and the weak exogeneity of X with respect to 0 (g2qand g3(J above) are not made in the context of the standard textbook specification. These assumptions relpted to the parametrisation of the statistical GM play a very important role in the context of the methodology proposed in Chapter 1 (seealso Chapter 26). Several concepts such as weak exogeneity (see Section 19.3, below) and collinearity (seeSections 20.5-6) are only detinable with respect to a given parametrisation. Moreover, the statistical going GM is turned into an econometric model by reparametrisation, from the statistical to the theoretical parameters of interest. The most important difference between the specification g1()-g8q and ( 1)-(3), however, is the role attributed to the error term. In the context of the thidden'
19.3
Discussion of the assumptions
latter the probabilistic and sampling model assumptions are made in terms of the error term not in terms of the observable random variables involved as in g11-r81.This difference has important implications in the context of misspecification testing (testingthe underlying assumptions) and action thereof. The error term in the context of a statistical model as specified in the present book is by construction white-noise relative to a given information set % 7 +.
19.3
Discussion of the assumptions
r1q
The systematic
and non-systematic
components
in Chapter 17 (see also Chapter 26) the specification of a T; i.e. statistical model is based on the joint distribution of Zf, l 1, 2,
As argued
=
D(Z1 Z2, ,
.
.
.
,
.
.
.
,
(19. 14)
Zw; #)H D(Z; #)
which includes the relevant sample and measurement information. The specification of the linear regression model can be viewed as directly using the assumptions of related to ( 14) and derived by normality and llD. The independence assumption enables us to reduce 1)(Z; #) into the product of the marginal distributions D(Zf; kt),r= 1, 2, Ti i.e. treduction'
.
.
.
,
F
)-
D(z;
l
17D(Z,; #,) =
( 19. 15)
1
The identical distribution enables us to deduce that #f # fOr t 1.2, ,Z The next step in the reduction is the following decomposition of D(Z(; #): =
=
D(Zf;
#)
=
.
.
(19. 16)
#1) D(X,; /2).
D(#t,/X,;
.
'
The normality assumption with E> 0 and unrestlicted enable us to deduce the weak exogeneity of Xf relative to 0. tXf xt) depends The choice of the relevant information set crucially on the NIID assumptions', if these assumptions are invalid the will in general be inappropriate. Given this choice of , the choice of non-systematic components are defined by: and systematic .)t
=
=
.1-
.1,
pt
E ()4,/Xt
=
=
Xt),
lh
Under the NIID assumptions l.t? j'xg, u: =
=
y?
-
=
l?f-
EtAt
=
yt and ut take the particulgr
q'xt.
(19.17)
Xr).
forms:
(19.18)
Spification,
estimation
and testing
Again, if the NIID assumptions
pt*p1 (seeChapters
(21
and
are invalid then
EplulAt
=
( 19. 19)
xt) #z
2 1-22).
Te parameters
of
interest
As discussed in Chapter 17, the parameters in terms of which the statistical GM is defined constitute by defnition the statistical parameters of interest parametrisation of the unknown and they represent a particular of underlying probability model. ln the case of the linear the parameters model interest of the regression parameters come in the form of p>(j, c2) c2 Ec-21o'c1, -,1cE2-c1la1. above the As argued cj1 where pn parametrisation 0 depends not only on D(Z; #) but also on the assumptions of NIID. Any changes in Zf or and the NIID assumptions will in general change the parametrisation. =
(31
Exogeneity
ln the linear regression model we begin with Dtyf, Xf; where concentrate exclusively on Dtyt/xf ,./1)
D(.Fl, Xr/#)
=D(-pt,/X,;#1) D(Xr; /2), '
#) and then
we
(19.20)
which implies that we choose to ignore the marginal distribution D(Xf;/2). ln order to be able to do that, this distribution must contain no information relevant for the estimation of the parameters of interest, 0n (p,o.l),i.e. the stochastic structure of X/ must be irrelevant for any inference on 0. Formalising this intuitive idea we say that: Xf is weakl exogenous over the with # H(#1, sample period for 0 if there-exists a reparametrisation #2) such that: (i) p is a function of #j (0 h(/1)); (ii) /1 and 2 are variation free ((#: #a)e:T1 x Ta). Variation free means that for any specific value #a in T2, /1 call take any other value in Tj and vice versa. For more details on exogeneity see Engle, Hendry and Richard (1983).When the above conditions are not satisfied the marginal distribution of Xf cannot be ignored because it contains relevant information for any inference on 0. =
,
Discussion of the assumptions
19.3
g4(l
N0 a priori information
on 0 >
(#, c2)
is made at the outset in order to avoid imposing invalid testable on p. At this stage the only relevant interpretation of 0 statistical parameters, directly related to #j in D( I'f/''Xr;/1). As such no is as Such information is a priori information seems likely to be available for 0. of interest related theoretical parameters commonly (. Before 0 is to the statistical underlying need the that to ensure used to define (, however, we observed of in data misspecification) the well is terms defined (no model chosen.
This assumption
restrictions
The observed data matrix X is ofill
IS
For the data matrix X H!
ranktx)
(x1 ,
x2,
.
.
.
,
vank
xz)', T x /(, we need to assume that
k, k < 'J:
=
The need for this assumption is not at al1 obvious at this stage except perhaps as a sample equivalent to the assumption
rankttzz)
=
k,
needed to enable us to define the parameters of interest 0. This is because ranktx) rank(X'X), and =
1 k . jg xrxt'
1 =
y.
!
(X'X)
can be seen as the sample moment equivalent to Ecc.
L61
linearity
Normality
homoskedasticity
of normality of D( J',, X,; #) plays an important role in the specification as well as statistical analysis of the linear regression model. As far as specification is concerned, normality of D(),,, Xr; #) implies D(.r, X/; p) is normal (see Chapter 15),. (i) x,) #/x,,a linear function of the observed value xr of Xr', f)y,/'Xl (ii) is free of xt, i.e. Varly, Xj x,) c2, the conditional variance (iii) homoskedastic. Moreover, (i)-(iii)come very close to implying that Dt.rt, Xf; #) is normal as well (see Chapter 24.2). The assumption
=
=
=
=
Spification, Parameter
estimation
and testing
time-invariance
As far as the parameter invariance assumption is concerned we can see that it stems from the time invariance of the parameters of the distribution Dlyt, Xf; ); that is, from the identically distributed (ID) component of the normal 11D assumption related to Z,.
g8)
Independent sample
The assumption that y is an independent sample from D(y,/Xf; 0), t= 1, 2, T; is one of the most crucial assumptions underlying the linear . regression model. ln econometrics this assumption should be looked at very closely because most economic time series have a distinct time which cannot be modelled exclusively in terms of dimension (dependency) exogenous random variables X,. ln such cases the non-random sample assumption (seeChapter 23) might be more appropriate. .
.
,
19.4
Estimation
(1)
Maximum likelihood estimators
Let us consider the estimation of the linear regression model as specified by the assumptions I(1)-g81discussed above. Using the assumptions (61 to (8j we can deduce that the likelihood function for the model takes the form L(p, c2; y, A')
1'
=k(y)
t
lgJ cv -
1
1
exp
(2zr)
1
-
-#'x,)2
2c .z(yt
( 19.23)
(3)
( 19.25)
19.4 Estimation F
v
1
1 ucz- 'r t ) 1 (y,- j'x' )2ec2
T'i j(1
=
tl,
in an obvious notation,
=
=
( 19.26)
c2, respectively. lf are the maximum likelihood estimators (MLE's) of # and staiisticl 2, T; in the GM, p, #'x,+ u, r 1, we were to write the matrix notation form =
=
.
.
.
,
(19.27)
y= Xj+u, where y >
F x 1, the
(y1 ,
xsl', F x k, and u H T x 1, X EEE(x! suggestive form take the more
(uI
.vv4',
.
.
.
MLE'S
,
,
.
.
.
,
,
/ (x'x) - t x'y
.
,
.
.
14z.)',
=
1
and for
BE
The information
y -XJf,
42
=
( 19.28)
.
1
'
F
matrix Iw(#) is defined by
Iw(p)-s((-Pl0BL)(P1Og)')-s(-t?21OBL),
(-)0
Pp (0'
t?p
where the last equality holds under the assumption that D()'t X,; 0) probability model. ln the above case represents the 'true'
:2 log L . P/ 0/
1 =
-
c2
p2log L T t?c4 = 2c4
T
y xfx;
-
ca
,-1
JX
:2 log L
(x'x), t7'j
cl
-
1
= -
tr
4. f
) =
xf ut
,
1
F
1 -
1
2Es
ut1
( 19.29)
.
t
=
1
Hence X'X
c2(x'X)-
2
Ir($
=
and
gIw(p)q-
1
=
1
0 2c4
.
T
( 19.30) lt is very important to remember that the expectation operator above is defined relative to the probability model Dty'f,/Xr;0). ln order to get some idea as to what the above matrix notation formulae
380
Specilication, estimation and testing
look like let us consider these formulae for the simple model: ''f-/?cxf +',/t, l,f ib
t- 1, 2,
-
.
.
.
T
,
( 19.3 1)
X1 X2
X
X'X
l
=
l
x, Z
x, xl
l
,
v
s
), tpyyl 2
)
.vf
,
X'
'
t
EB
,
xf)', t
- p-2
.p
).;
Compare these formulae with those of Chapter 18. One very important feature of the MLE p-above is that it preserves the original orthogonality between the systematic and non-systematic components
(19.32)
y p + u,
p .u between the estimated systematic and non-systematic =
components
in the
form y=
/+,
/
-L
,
( 19.33)
19.4 Estimation b NXj-
-
5
-X/
y
respectively. This is because
!h
and
= Pxy
(1-
=
( 19.34)
Pxly,
where Px X(X'X) - 1X' is a symmetric (P' P ), idempotent matrix (i.e.it represents an orthogonal projection) and =
X
E(;')
=
X
(Px2 Px) =
'(Pxyy'(l - Px))
=
= f7tpxyu'll
Px)),
-
Px)c2, = Px(I -
since
(1 Pxly -
since S(yu')
=
(1 Pxlu
=
-
c2Iw
since Px(I - Px) 0. = 0, ln other words, the systematic and non-systematic components were estimated in such a way so as to preserve the original orthogonality. Geometrically Px and (1 Px) represent orthogonal projectors onto the subspace spanned by the columns of X, say .,//(X), and into its orthogonal complement //(X)-, respectively. The systematic component was estimated by projecting y onto ,.,//(X) and the non-systematic by component projecting y into //(X)i, i.e. =
.
.
y Pxy =
+
(1 Pxly.
(19.35)
-
Moreover, this orthogonality, which is equivalent to independence in this (92 since is independent of context, is p assed over to the MLE'S j and y'(I Pxly, the residual sums of squares, because Px(I - Px) (seeQ6, Chapter 15). Given that #= X/ and 42 ( 1/T)' we can deduce that / and 42 are independent', see (E2) of Section 7.1. Another feature of the MLE'S / and :2 worth noting is the suggestive similarity between these estimators and the parameters j, c2: '
=
=0
-
=
/ J
B
2>
N 2-ai /2 tr 11
-
'
1
a
l-
X'X =
-
1
X' y
.--.
T
( 19.36)
,
T
1
12
E 22 - a 21,
)-(y,.'X)(Xw'X)-'(X'y).
oz-l;'
(19.37)
T
Looking at these formulae we can see that the MLE'S of j and c2 can be derived by substituting the sample moment equivalents to the population
moments: Na a :
-
1
/
(X'X),
c'zl
Using the orthogonality
:
1
w
X'y,
cj
1:
1 -.
w
y'y.
of the estimated components
( 19.38)
/
and
we could
Specification,
estimation
and testing
decompose the variation in y as measured by y'y into '''h.'d
-?
? yr=##+=#X m
-?
?
/
-
X#+.
?
( 19.39)
Using this decomposition we could define the sample equivalent multiple correlation coefhcient (seeChapter 15) to be
to the
''-'
#2
'
M
=
f
?
X(X X) /
=
1
-
-
Xy ,
.'
1-
=
?
'
p;
.
(19.40)
y yy This represents the ratio of the variation by / over the total variation and can be used as a measure t#' goodness C!JJlt for the linear regression model. A similar measure of fit can be constructed using the decomposition of y around its mean jF, that is
'
'explained'
(y'y
T)*)
-
=
(/'/
-
Tgl) +
/,
(19.41)
denoted as TSS ltotal)
ESS
=
RSS
+
(19.42)
,
( residual)
(explained)
where SS stands for sums of squares. The multiple correlation coefficient in this case takes the form a
R- 2
=
pa,#
-
'
(y y
-
yy F
=-
'zk
=
)
1
Rss -
.
T SS
,
( 19.43)
Note that R2 was used in Chapter 15 to denote the population multiple correlation coefficient but in the econometrics literature R2 is also used to 2 and #2. denote of t', #2 and kl, have variously Both of the above measures of sample multiple correlation coefficient in the defined been to be the when reading different Iiterature. should exercised econometric Caution be and properties. textbooks because #2 For example, have different 0 < #2 < 1, but no such restriction exists for 2 unless one of the regressors in X, is the constant term. On the role of the constant term see Appendix 19.1. One serious objection to the use of *2 as a goodness-of-fit measure is the fact that as the number k of regressors increases, kl increases as well irrespective of whether the regressors are relevant or not. For this reason a goodness-of-fit measure is defined by bgoodness
.k2
Ecorrected'
( 19.44)
19.4 Estimation
383
is the division of the statistics involved corresponding degrees of freedom; see Theil ( 197 1).
The
(2)
correction
An empirical
by their
example
In order to illustrate some of the concepts and results introduced so far let us consider estimating a transactions demand for money. Using the simplest form of a demand function we can postulate the theoretical model: MD=
(19.45)
(X P, 1),
MD
is the transactions demand for money, F is income, P is the price where level and 1 is the short-run interest rate referring to the opportunity cost of holding transactions money. Assuming a multiplicative form for ( ) the demand function takes the form '
A,jnw Ayapxzix,
or
ln
MD
=
( 19.46) ( 19.47)
atj + aj ln F + aa ln P + as ln 1,
where ln stands for loge and
(xo
=
ln
.4.
For expositional purposes let us adopt the commonly accepted approach modelling (seeChapter 1) in an attempt to highlight some of econometric to the problems associated with it. If we were to ignore the discussion on econometric modelling in Chapter 1 and proceed by using the usual 'textbook' approach the next step is to transform the theoretical model to an econometric model by adding an error term, i.e. the econometric model
is
( 19.48) where mt ln Mt, yr In F), pf ln Pt, it ln It and t1( N/(0, c2). Choosing some observed data series corresponding to the theoretical variables, M, F, =
=
=
=
.v
P and 1, say:
Vf -
M 1 money stock; real consumers' expenditure; J Pt - implicit price deflator of #;,' f, interest rate on 7 days' deposit account (seeChapter 17 and its appendix for these data series), respectively, the above equation can be transformed into the linear regression statistical GM: D-t, =
/.0+ /1h + /2/, + Ist
Estimation of this equation
+
(19.49)
&f.
for the period
1963-1982f:
(T= 80) using
Spilication,
estimation
and testing
quarterly seasonally adjusted (for convenience) data yields 2.896 0.690
/-
0.865 -
.$2
=0.00
TSS
=
0.055 155
24.954,
That is, the estimated
kl
'
ESS
=
=
42
0,9953, 24.836,
equation
l?-i 2.896 +0.690.ff f
=
=0.995
RSS
1
5
=
0.1 18.
takes the form
+0.865/:$
-0.0554 +
t.
( 19.50)
The danger at this point is to get carried away and start discussing the plausibility of the sign and size of the estimated (?).For example, telasticities' have both a we might be tempted to argue that the estimated tcorrect' sign and the size assumed on a priori grounds. Moreover, the tgoodness of fit' measures show that we explain 99.5t% of the variation. Taken together these results that (50)is a good empirical model for the transactions demand for money. This, however. will be rather premature in view of the fact that before any discussion of a priori economic estimated statistical theory information we need to have a well-dhned model which at least summarises the sample information adequately. Well defined in the present context refers to ensuring that the assumptions underlying the statistical model adopted are valid. This is because any formal testing of a priori restrictions could only be based on the underlying assumptions which when invalid render the testing procedures incorrect. Looking at the above estimated equation in view of the discussion of econometric modelling in Chapter l several objections might be raised: The observed data chosen do not correspond one-to-one to the (i) theoretical variables and thus the estimable model might be different from the theoretical model (see Chapter 23). The sampling model of an independent sample seems questionable in view of the time paths of the observed data (see Fig. 17.1). The high /2 (and #2) is due to the fact that the data series for Mt and (iii) Pt have a very similar time trend (seeFig. 17.1(t# and (L')). If we Iook at the time path of the actual ( #'f) and fitted (.f,,)values we notice (explains) largely the trend and very little else (see that fr Fig. 19.1). An obvious way to get some idea of the trend's contribution in .*2 is to subtract pt from both sides of the money equation in an attempt to the dependent variable. 'elasticities'
kindicate'
ttracks'
bdetrend-
19.4
Estimation actual z fitted
10,6 '**
1
0.4
A
z
A -'
10.2
z
10.0
g
z'e
'sz 9.8
?
z
z.
gs'
z
'-
z
z
--
9. 4
,-
A
9.2
.e
>'
.A
9.0 8.8
1963
1972 Time
1969
1966
1982
1978
1975
f', from ( 19.50).
Fig. 19. 1. Actual .J4 ln Mt and fitted =
9.9
9. 8 /
/
;
/
k/ N N.
actual
.../
9.7
x N. xN / w
d '-
z
j
I
v
/
r
<.
z v z'h-
l
z.s h
sv'''.'X
/
/ !
.x
/ l / l l 'hu
9.6
1963
.h.
1966
1972 Time
1969
Fig. 19.2. Actual
.pz
=
1975
lj 1
N/ u /
-..v
/ .e-'
1978
/
/
f itted
/
1982
ln (M P)t and fitted ft from ( 19.5 1).
detrended dependent ln Fig. 19.2 the actual and fitted values of the the The new regression emphasise point. variable (r?1, are shown to equation yielded elargely'
-pt)
tmf-p1)
=
2.896 +0.690.:,
.k2 0.468 =
!
-0.135pt
42 0.447 =
5
sl
-0.055t
+ =0.00155.
Iif,
(19.51)
386
Spification,
estimation
and testing
Looking at this estimated equation we can see that the coefficients of the constant, yf and it, are identical in value to the previous estimated equation. The estimated coefficient of pt is, as expected, one' minus the original estimate and the is identical for both estimated equations. These suggest that the two estimated equations are identical as far as the estimated coefficients are concerned. This is a special case of a more general result related to arbitrary linear combinations of the xzs subtracted from both sides of the statistical GM. In order to see this let us subtract y'xf from both sides of the statistical GM: .s2
t
or
-T'Xf
(#'-T')Xf
=
),t
(19.52)
+ ut
j*zx + ut,
=
t
in an obvious notation. It is easy to see that the non-systematic component as well as c2 remain unchanged. Moreover, in view of the equality *
=
(19.53)
,
where *
* =y -X#
-+
'-'*
j =(X X) '
,
-
l
X y# '
=#-y,
-'
we can deduce that sl
1
*'*
=
1
F- k
T
On the other hand, *2
=
1
'
= -
kl is not
invariant
'
-
t' +,y+-
(19.54)
'
k
2.
#
,z'y-+2
to this transformation because
(19.55)
.*2
of the As we can see, the dependent variable equation is less than half of the original. This confirms the suggestion that the trend in pt contributes significantly to the high value of the original kl. It is important to note at this stage that trending data series can be a problem when the asymptotic properties of the MLE'S are used uncritically (seesub-section (4) below). 'detrended'
(/ J2j jinite sample In order to decide whether the MLE l is a good' estimator of 0 we need to consider its properties. The tinite sample properties (seeChapters 12 and 13) will be considered tirst and then the asymptotic properties. (3)
Pvoperties of the MLE
-
l being a MLE satisfies certain properties by definition: For a Borel function hf ) the MLE of h(0)is (4).For example, the (1) MLE of logtj'j) is loglj/j) '
19.4 Estimation (2)
387
If a minimal sufficient statistic
T(y)
of it. Using the Lehmann-scheffe theorem the values of yo for which the ratio
exists, then
(seeChapter
oty,'x; (2,,:,.c)-w,c
(
12) we can deduce that
(y-x#)'(y-x#))
expt-aota
p) Dyo X; 0' =
l must be a function
.y
.rg1
w,,a
taagal-
exp
.xj),(y()
g
ty()
-
(19.56)
xpjq
is independent of 0, are ygyll y'y and X'y: X'y. Hence, the minimal lz T2(y)) sufficient statistic is z(y) H (z1(y), (y'y,X'y) and J (X'X) - 2 (y), 1za(y)) T(y). 42 (1/F)(z1(y) -zk(y)(X'X)are indeed functions of ln order to discuss any other properties of the MLE of 0 we need to derivethe sampling distribution of 4.Given that / and :2 are independent we can consider them separately. =
=
=
=
=
The distribution of
/=(x'x)-
lx'y
=
/ (19.57)
Ly,
''*
!X) - 1
where L H(X X is a k x T matrix of known constants. That j s, linear functionof the nonnally distributed random vector y. Hence ,
p- NILX/, c2LL')
from N1, Chapter 15,
'v
or
p-'wNp c2(X'X)-
j is a
')
(19.58)
From the sampling distribution (58)wecan deduce the following properties for p (3(i)) / is an unbiased estimator of # since Elp-) #, i.e. the sampling distributionof /1has mean equal to p. - is a fully efflcient estimator of # since Cov(/) c2(X'X)- 1 i e Covtj) achieves the Cramer-Rao lower bound; see (30)above. =
=
The distribution of 42
42 where Mx
=
1
=-
F I
Tdl
c2
-
-
1
-
1
(y-Xj)'(y
-Xj)
Px. From
(Q2)of Chapter
.
'wzzttr Mx),
=-
.
'
=-
F
T
u'Mxu, .
(19.59)
15 we can deduce that
(19.60)
Spaitkation,
estimation
and testing
where tr Mx refers to the trace of Mx (tr A tr Mx
)J'- 1 aii, A: n x
=
tr I - tr X(X'X) - 1X' (since trtxi + B)
=
= F- tr(X'X) - 1(X'X)
(since tr(AB)
=
=
n),
tr A + tr B)
tr(BA))
'-k.
= Hence, we can deduce that T.l
g2(w- k).
x
2
( 19.6 1)
lntuitively we can explain this result as saying that (u'Mxu)/c2 represents the summation of the squares of T- L independent standard normal components. Using (61) we can deduce that Fd2
E
-
T- k
=
.
cz
and
Var
F#2 c
a
=
2(T- k)
(see Appendix 6.1). These results imply that T- k
f)ti2)=
-T
c2 # c2
'
That is:
82 is a biased estimator of c2; and 82 is not a jll), efhcient estimator of c2. However, 3(ii) implies that for
(3(ii)) (4(ii))
sl
(T- k)
1
'
( 19.62)
=
F
S2
-
k
2
(Ty z 'v
/()
(19.63)
and E(Sl) cl, Vartyzj (2c4)/(F- k) > (2c4)/F - Cramer-Rao bound. That is, is an unbiased estimator of c2, although it does not quite achieve the Cramer-Rao lower bound given by the information matrix (30)above. lt turns out, however, that no other unbiased estimator of c2 achieves that bound and among such estimators has minimum variance. ln statistical inference relating to the linear regression model is preferred to :2 as an estimator of c2. The sampling distributions of the estimators p- and involve the =
=
.s2
.:2
.$2
.s2
389
19.4 Estimation
c2. ln practice the covariance of is needed to / unknownparameters # and of the estimates. From the above analysis it is known assessthe that 1 'accuracy'
Cov(/)=c2(x'x)-
(19.64)
5
which involves the unknown parameter c2. The obvious such a case is to use the estimated eovariance ovt#-)=.$2(X'X)-
way
to proceed in
( 19.65)
1 .
The diagonal elements of ov()) refer to the estimated variances of the eoefficient estimates and they are usually reported in standard deviation form underneath the coefficient estimates. ln the case of the above example results are usually reported in the form mt 2.896 +0.690./,, +0.865p, ( 1.034) (0.105) (0.020)
(19.66)
-0.055,
+
=
5.2
72
=0.9953,
f,
(0.013) (0.039)
F= 80.
1og L= 147.412,
:=0.0393,
=0.9951,
Note that having made the distinction between theoretical variables and observed data the upper tildas denoting observed data have been dropped for notational convenience and R2 is used instead of 112in order to comply with the traditional econometric notation.
(4)
Propevties of the MLE
v
(/
EEE
'2)
asymptotic
-
,
is the fact that under certain regularity An obvious advantage of MLE'S conditions they satisfy a number of desirable asymptotic properties (see Chapter 13). P
0v
Consistency
-.+
0)
Looking at the infonuation matrix
(i)
(30)we can deduce that:
42 is a consistent estimator of
lim
#K(l'2
c2j <
-
:)
c2, i.e.
1,
=
T-+ x
since MSE(#2)
-+
0 as T-+ cti.' and -
lim(X'X) F-+
1
EEE
x
limw- #r(l/ y
lim
F-,
y
j
-j1
< )) =
1, i.e.
xfx;
1 =
0,
t
J is a consistent
estimator
of
b.
390
Specification, estimation and testing
Note that the above restriction is equivalent to assuming that c'(X'X')c ) for any non-zero vector c (seeAnderson and Taylor (1979/. The above condition is needed because it ensures that
Covt#)
(2)
ln order
0
-+
(x''z'(I,.
Asymptotic normality
I to be
of
(seeChapter
as T-+ ct.l
asymptotically
12). -1))
-x(O,I.(p)
-p)
-+
normal we need to ensure that
1im 1 Iw(p) 1.(p) F-+ -gx. =
Given that J.(c2)
exists and is non-singular.
(i)
x'/wtdz
-c2)
x(0,2c4).
-
Moreover, if limw- .(X'X/T)
.v''r(/
and non-singular
then
1).
-
,v
(19.68)
Qx is bounded
=
x(0,c2Q
-#)
1/2c4 we can deduce that
=
(19.69)
X
X
/
and 42 we can deduce F ro m the asymptotic normal distribution of asymptotic unbiasedness as well as asymptotic efficiency (seeChapter 13).
(3)*
(&
Strong consistency
a .S
.
0)
-,+
a S. .
-2 iS a Strongly
consistent
G
lim
Pr
T-+ a.
+2
c
=
estimator
2
c
(J2
of c2
-+
2)
t)-
,
ie .
.
( 19.70)
1.
=
Let us prove this result for sl and then extend it to d2, (y2
c2)
-
l
u'
=
P u F k
c2
X
-
-
where w H(w1, that
wc,
.
.
.
,
ww-k), w
T- k
1
-
F (w/
=
-
T
k tr 1
-
c2),
H'u, H being an orthogonal
=
(19.71) matrix, such
F-k
1-1'41 Px)l1 -
=
diagt 1, 1,
.
.
.
,
1, 0,
.
.
.
,
0).
Note that w .vN(0, c2Iw-k) because H is orthogonal. Since F(w/ 2c4< (x) we can apply Kolmogorov's SLLN and S(w/ -c2)
=0
-c2)2
=
,
(see
19.4 Estimation
391
Chapter 9) to deduce that j.
a.s.
-F E(w/ , Using the fact that -c2)
('2
-
c2)
-+
jg1 (w/
c2) +-
-
T
,
s2.
....+
w-k
j
=-
a,s.
.s2
0,
-
k
( 19.72)
(7.2
T a. S
.
and the last term goes to zero as T-+ cc, 42 c2. (ii) j is a strongly consistent estimator of j (#--+
X Y' '
m.
p) if
-+
< C, i Ixjgl x'x
T m
(#
-
f)
=
=
1, 2,
Z t
z'
,
.
k, t
,
.
(X'X)- 1X'u
jg
=
x f x/t T
f
J
1 2,
=
for all T
=0
- 1
.
is non-singular
and F(xftlr)2 Since f2txttltl SLLN to deduce that XIX/ r
.
-T' Z sut
=
xJc2<
.
C-constant; and
,
.
(19.73) 1
1
) xfjkf.
7
(19.74)
l
we can apply Kolmogorov's
w,
a.s. ->
0.
( 19.75)
t
Ixftxpl
Note that (1) implies that < C* for i C* being a constant. It is important to note that the assumption
=
1, 2,
.
.
k, r, s
,
.
=
1, 2,
.
.
.
,
1 xrx'r =Qx < :z) and non-singular, lim F t r-, x needed for the asymptotic normality of / is a rather restrictive assumption because it excludes regressors such as xf, t, t 1, 2, T) since -
)
=
=
x,?, =*6
.
.
.
,
T(T + 1)(2T + 1) O(F3)
(19.76)
=
t
Zl
(see Chapter 10)sand limw- x E(1/F) x/J (x). The problem arises because the order of magnitude of x,?jis higher than O(T) and hence it goes to infinity much quicker than T The obvious way out is to change the factor UT in x/'Tl1)so as to achieve the same rate of convergence. For example, in the above case of the regressor xft t UF3 in order to ensure that limw- .E(1 T3) xjq < ). The we need to use question which naturally arises is how can we generalise this particular result? One of the most important results in relation to orders of magnitude =
:1
-
=
Zf
Spification,
estimation
and testing
big as its is that every random variable with bounded variance is c/ < tz, then Zi Op(cf) standard deviation-, i.e. if Varlzf) (seeChapter 10).Using this result we can weaken the above asymptotic normality result (69) to the following: tas
=
=
Lemma Ft??-the linear regression Aw
model as specied
v
Z xjx;
=
(F)
t,li i
Qw
l=1
1
D ?- AwD /.
1
as y'..+ w
w
...+
=
in Section 19.2 above let
information increases 2 --+0, i (li(r)
Xfw..h 1
i
=
12 ,
,
.
wl
.
.
,
T);
k
tnt? individual observation dominates the summationl; lim
F-+ z
Q, Q < =
vz
and non-singular,
tben D T (/-#)
-
X
Anderson txstde
19.5
Spification
x(0,c2Q-
')
(1971:. testing
testinq refers to tests based on the assumption of correct That specification. is, tests within the framework specified by the statistical model in question. On the other hand, misspecscation testing refers to testing outside this specified framework (seeMizon (1977:. Logically, misspecification tests precede specification tests because, unless we ensure that the assumptions underlying the statistical model in question are valid (misspecificationtests), specification tests (basedon the validity of these assumptions) can be very misleading. For expositional purposes the estimated money equation of Section 19.4 will be used to illustrate some of the specification tests discussed below. It must be
Specihcation
19.5 Specification testing
393
emphasised, however, that the results of these tests should not be taken seriously in view of the fact that various misspecifications are suspected (indeed, confirmed in Chapters 20-.22). ln practice, misspecification tests are used first to ensure that the estimated equation represents a well-defined estimated statistical GM and then we go on to apply specification tests. This is because specification tests are based on the assumption of correct specification'. hypothesis-testing Within the Neyman-pearson framework a test is defined when the following components (see Chapter 14) are specified: (i) the test statistic z(y); (ii) the size (x of the test; (iii) the distribution of z(y) under Ho; the rejection (or acceptance) region; (iv) (v) the distribution of z(y) under HL. (1)
Tests velating to c2
As argued in Chapter 14,the problem of setting up tests for unknown largely exercise appropriate pivot related to finding in an an parameters is c2 unknown parameterts) question. of ln the in the case the likeliest quantity candidate must be the
'good'
(T-/)
S2
2
'v
(19.77)
/().
(Ty z
Let us consider the null hypothesis Hv: c2 /2() (cg-known) against the alternative hypothesis Sj : c2 > c()2. Common-sense suggests that if the estimated c2 is much bigger than c we will be inclined to reject Hv in favour of H3, i.e. for sl > c where c is some constant considered to be enough', we reject Hv. ln order to determine c we have to relate this to a probabilistic statement which involves the above pivot and decide on the size of the test a. That is, define the rejection region to be =
kbig
C1
2
=
s y: (T- k) z > cz
,
c
( 19.78)
where cx is determined by the distribution of 2
z(y)=(Fi.t).
k)
z
under Ho,
( 19.79)
394
Spification,
estimation
and testing X
dz2(F-k)
a.
=
C2
This implies that ln the case of the money example 1et us assume c/ 155, ca 85.94 for a 0.05 and the rejection region takes the since sl form (F- klsl > 85.94 tl'j y: =0.001.
=0.00
=
=
=
J
a
.
-:0
and hence Ho is rejected. In order to decide whether this is an test or not we need to consider its power function for which we need the distribution of z(y) under Sl. In this case we know that toptimum'
( 19.80) (
'v
reads
:distributed
zty) =
under S1'), and thus
c/
zoty)
'''z
H, 'w
o.
,
..a' z
(T-k),
(19.81)
because an affine function of a chi-square distributed random valiable is also chi-square distributed (seeAppendix 6.1). Hence, the power function takes the form
.#4c2) Jar
z(y) > cz
=
2 ti'o 2 ; c > coc Gz
r
dz 2 (F- k).
= (?a(o.j,/'o.2)
(19.82)
The above test can be shown to be uniformly most powerful (UMP); see Chapter 14. Using the same procedure we could construct tests for: Hv.. c2
=
c2()against =
or
Ho'. c2
=
cj against
c/* x 0
: c2 <
ty.'z(y) < ca*),
C*j (ii)
ffj
=
Sl
:
c2()(one-sided) with a
c2 # cj
(y:z(y) ga
cl
=
dz2(F-
>
)),
GO
dz2(F-k) b
( 19.83)
(two-sided)With
or z(y)
dz2(F-k)=
k)
-X
=
a
i
.
(19.84)
395
19.5 Specilication testing
The test defined by C.'tis also UMP but the two-sided test defined by C1* is UMP unbiased. Al1 these tests can be derived via the likelihood ratio test
procedure.
Let us consider another test related to c2 which will be used extensively in Section 21.6. The sample period is divided into two sub-periods, say,
and
t e: T 1
(1 2
=
,
.t
t s T2
=
,
.
.
T) + 1,
T1)
,
.
.
.
F),
,
.
whereT- Tk Tc, and the parameters of interest 0- p, c2) are be different. That is, the statistical GM is postulated to be =
yt pkxt+
1&,
=
and
#r2Xl
#t
=
+ ut,
Var(J',/X,
=
x,)
=
Vartyr/x,
=
x,)
=
allowed to
cf
(19.85)
c2z for t (E T2..
( 19.86)
An important hypothesis in this context is 21 H () t:r c '.
J2
against
=
2
H 1:
2 (7'1
a (7'2
>
co
,
where cll is a known constant (usuallyc 1). lntuition suggests that an obvious way to proceed in order to construct a test for this hypothesis is to estimate the statistical GM for the two subperiods separately and use =
Tk
1
:2 s2= 1 T1 k l : 1 f -
=
and
to define the statistic z(y)
(-6 c? -
l
st?
'vZ
c (T;
sllslz. Given that
=
-/f),
i
=
from (77),and sf is independent of assumption),we can deduce that ((Tc (Tk
-
-
sl/o-l/lj k)s#/(cl/(Tc
-/4)
( 19.87)
1, 2, yl
(due to the
sampling
model
k))1
-
(19.88)
396
Spification,
estimation
and testing
Hence, T4$
s 21 c c 0 sz
=
Ho
#(TI k' V /f1.
'v
'
-
-
This can be used to define a test based on the rejection region C1 ty: r(y) > cy ) where the critical value ca is determined via dF(A I, T2 a, a being the size of the test chosen a priori. It turns out that this defines a UMP unbiased test (seeLehmann ( 1959/. Of particular interest is thecase where ctl 1, i.e. HvL cf c2.z.Note that the alternative Hj : c2:/ca2 < 1 ca2/c2j > 1, i.e. have the can be easily accommodated by defining it as H3 : greater of the two variances on the numerator. =
Jt*',
-/)
=
=
(2)
-
=
Tests relating
to
p
The first question usually asked in relation to the coefficient parameters whether they are statistically significant'. Statistical significance formalised in the form of the null hypothesis: Hv :
pi
H
pi# 0 for
against j
:
=
# is is
0 some i
=
1, 2,
.
.
.
,
k.
Common sense suggests that a natural way to proceed in order to construct a test for these hypotheses is to consider how farfrom zero / is. The problem with this, however, is that the estimate of # depends crucially on the units of measurements used for yt and X,. The obvious way to avoid this problem is to divide the pisby their standard deviation. Since pxNlp, c2(X'X)- 1), the standard deviation of Jf is
)q x/'gcztx'xlJ x'tvartp-
i
1q,
1 where (X/XIJ refers to the fth diagonal element of (X'X) - 1 Hence we can deduce that a likely pivot for the above hypotheses might be .
?'-/'i
(c2(X'X)J1q
N
xo,
1).
(1q.89)
The problem with this suggestion, however, is that this is not a pivot given that c2 is unknown. The natural way to this problem is to substitute its estimator sl in such a way so as to end up with a quantity for which we know the distribution. This is achieved by dividing the above quantity with the square root of ksolve'
((w-,)s2 ,,''(wc-2/()),
' .
'
,
19.5
Specification
testing
( 19.90)
which is a
very convenient
T(y)
pivot. For Ho above
jf
=
H
e
(F- /f).
'v
1
s E()()()f; 1 ?
Using this we can define the rejection
('1
region
> ca ), (z(y)j
.ity:
=
where ca is determined from the dr(F- k)
1
=
l
tables for a given size
a. That
is,
-a.
The decision on optimal' the above test is can only be considered using its power function. For this we need the distribution of zly) under H3, say /f /$9,X # 0. Given that %how
=
- jy Ib l(T-k), 1 Esa (X X)J l I? + 1y) 'r(y)c i E,s(x,xl.y
':()(y) ='
( 19.9 1)
-
,
'ro(r)
a non-central
with non-centrality
l
?
j=
'tT'- k;
-
( 19.92)
),
parameter
1q
tz g(X'X)J
(see Appendix 6. 1). This test can be shown to be UMP unbiased. In the case of the money example above for J1t: jf 0, f 1, 2, 3, 4, =
/1
g(X'X)1
s
1
I :. s ()( X):'3 (1 =
--
?
-
z
= 2.8,
s
42 9 .s .
,
j
=
g(X'X)2-1
/'?.
=
6.5,
.-4.1.
E(X X).t1 ,
.-
0.05 two-sided test thecritical value is cx 1.993,given that we 76 degrees of freedom. Assuming that the underlying assumptions of the linear regression model are valid (asit happens they are notl) we can proceed to argue that al1 the above hypotheses of significance For a size have T- I
=
=
=
398
Specification, estimation and testing
(the coefficients are zero) are rejected. That is, the coefficients are indeed significantly different from zero. lt must be emphasised that the above ttests on each coefficient are separate tests and should not be confused with the joint test: HvL Fj /72 p5 p.k 0, which will be developed next. The null hypothesis considered above provides an example of linear hypotheses, i.e. hypotheses specified in the form of linear functions of #. lnstead of considering the various forms such linear hypotheses can take we will consider constructing a test for a general formulation. Restrictions among the parameters j, such as: =
=
=
=
(i) + /5 135
/72+
/73+ /74+ ks
can be accommodated R#= r,
=
1',
=
1,'
within the linear formulation
ranktR)
=
(19.93)
m
where R and r are m x k (k> m) and m x 1 known matrices. For example, in the case of (iii), R-(0
0 0 1 0
0
r -
() 'j
( 19.94)
.
This suggests that linear hypotheses related to special cases of the null hypothesis Ho1 Rp= r
against the alternative
Hj
# can be :
considered
as
R## r.
ln Chapterzo a test forthis hypothesis will bederived via thelikelihood ratio test procedure. In what follows, however, the same test will be derived using the common-sense approach which served us so well in deriving optimal tests for c2 and #f. The problem we face is to construct a test in order to decide whether # satisfies the restrictions R#= r or not. Since p is unknown, the next best thing to do is use p-(knowingthat it is a estimator of #) and check whether the discrepancies 'good'
Rb- r $$ -
$r
(19.95)
to zero' or not. Since when deriving p-no such restrictions were are taken into consideration, if Ho is true, p should come very close to satisfying these restrictions. The question now is close is close?'. The answer to this question can only be given in relation to the distribution of some test 'close
thow
19.5 Specification testing
399
statistic related to (95).We could not use this as a test statistic for two reasons: it depends crucially on the units of measurement used for yt and Xf; (i) and (ii) the absolute value feature of (95) makes it very awkward to manipulate. The units of measurement problem in such a context is usually solved by dividing the quantity by its standard deviation as we did in (89)above. The absolute value difficulty is commonly avoided by squaring the quantity in question. If we apply these in the case of (95)the end result will be the quadratic form
(RJ -
r)' EVar(R/
r)q - 'IR#-
-
r)
-
(19.96)
,
which is a direct matrix generalisation of (89).Now, the problem is to determine the form of VartR/ Since R/ - r is a linear function of a normally distributed random vector, j, -r).
(Rj-r)
,v
N(Rj
c R(X X)
-r,
(from N1 in Chapter 15). Hence, -
(R #
r)
-
'
R
(1
.
(96)becomes
IR'q - iIR#-
Ec2R(X'X)-
)
-
(19.98)
r).
This being a quadratic form in normally distributed random variables, it must be distributed as a chi-square. Using Q1of Chapter 15 we can deduce that
-.
(R/
'-'
-r)
,
2 1 Ec R(X ) lR, 1- (R # -r) ,x
-
,v
2
z (1,n,),
(19.99)
.
chi-square with m i.e. (16) is distrlbuted as a non-central (R(X'X') - 1R')) degrees of freedom and non-centrality parameter =
(Rj- r)'ER(X'X)--IR'J- t(Rj- r) (F a
(rank
( 19.100)
(see Appendix 6.1). Looking at (99)we can see that it is not a test statistic as yet because it involves the unknown parameter a1. Intuition suggests that if we were to substitute sl in the place at c2 we might get a test statistic. The problem with this is that we end up with (R/
-
r)'(R(X'X) S
2
r) IR/IIIRJ -
5
( 19. 101)
for which we do not know the distribution. An equivalent way to proceed which ensures thatwe end up with a pivot (atest statistic whose distribution
400
Specilication,
estimation
and testing
is known) is the following. Since, S2
(T- k) c
z
'w
2
(T'- k),
(19.102)
if we could show that this is independent of (99)we could take their ratio (divided by the respective degrees of freedom) to end up with an Fdistributed test statistic', see 95 and ()7 of Chapter 15. ln order to prove independence we need to express both quantities in quadratic forms which involve the same normally distributed random vector. From ( 102) we know that
sl
(F- k)
2
u'(I
P 'Y lu
-
=
2
After some manipulation
u'Q u +
P .Y
'
=
X(X'X) - 1X'.
we can express
( 19.103)
(99) in the form
r)'gR(X'X) - R'j - 1(R# r) 1
(Rj-
x
2
( 19. 104)
l
where
Q
X(X'X) - 1R/(R(X'X) - 1R/j - 1R(X'X)- 1X'
=
x
In view of Qx(I
-
Px)
=
*
0. (99)and (102)are independcnt. This implies that
u'Qu+ (R# r)'(R(X'X) - 1R'j -
-
1
(R#
-r)
2
T(y)
=
u (1-Px)u ,
'v
'---
F(m, T- k;
(T'-k)c2
A more convenient form for
1 (RJ r)'gR(X'X) -
=
-
-
.
(19.105)
is
z(y)
-
z(y)
).
a
IR'II- IIRJ r) '
,
( 19.106)
which apart from the factor ( l m) is identical to ( 101), the quantity derived by our intuitive argument. lt is important to note that under Hv, R#= r and J i.e. =0,
Ho
z(y)
x
F(m, T-/().
( 19.107)
Using this we can define the rejection
t)7j =
ty:z(y) > cz)
where a
region to be dF(m, F- k).
=
t' St
The power of this test depends crucially on the non-centrality
(19.108) parameter
J
19.5 Specilication
40 1
testing
E(T- k)(m+ ) as given in (100).ln view of the fact that n1(T- k 2)q (seeAppendix 6. 1) we can see that the larger is the greater the power of the test (ensure that you understand why). The non-centrality Rj (a very desirable parameter is larger the greater the distance 2/ i variance smaller conditional The power also and the feature) the c 1'n, p2 of order Fln k. depends on the degrees freedom vl to show this well-known relationship the between the F and beta explicitly 1et us use enables which distributions us to deduce that the statistic '(z(y))=
-
.-.r
.
=
z*(y) E''1'r(y)1
+
is distributed as non-central terms of z*(y) is
beta
E)J.'1'r(y)
=
k#(#) #r(z*(y) =
=
e
- (j
'2) I
(
,i
1.t'
7
''-'/
)-( /! =
J?al
(19.109)
(seeSection 21.5). The
power function in
cz*)
>
?
=
O
1
..2)
.v.
t
*
E(v1
c:
-
1+
U( 1*
-.
vb
.
lt-tyvl + I
*
/
l pz
2 )- 1
jjg +
,4.zva.l
( 19.110) (see Johnson and Kotz (1970/. From (l 10) we can see that the power of the test, ceteris paribus, increases as F-k increases and m decreases. lt can be shown that the F test of size a is UMP unbiased and invariant to transformations of the form: * (ii)
y*
=cy
=
where
y + pv,
pv e: (40.
(for further details see Seber (1980/. of the F-test is that it provides ajoint test One important for the m hypotheses Rj= r. This implies that when the F-test leads us to reject Hv any one or any combination of these m separate hypotheses might be responsible for the rejection and the F-test throws no light on the matter. In order to be able to decide on this matter we need to consider simultaneous hypothesis testing; see Savin (1984). As argued in Chapter 14, there exists a duality relationship between hypothesis testing and confidence regions which enables us to transform an optimal test to an optimal confidence region and vice versa. For example, the acceptance region of the F-test with R h; k x 1, r= h'#0, #0known, 'disadvantage'
=
1h'/ b',01 -
C0(/0)-
y:
<cu k---WAF7XYL-X/E ( ) 1 .
(19.112)
402
Specification, estimation and testing
can be transformed into a (1 h,/ -s.v''p'(x'x) c(y) -
=
) confidence interval
t#:
-
1h(1),
lhjca
(19.113) Note that the above result is based on the fact that if
then
V''v F( 1, T'- k)
Qvxr(F-
k).
(19.114)
A special case of the linear restrictions R#= r which is of interest in econometric modelling is when the null hypothesis is Ho3 #(1) 0 against Sj =
:
#(1) + 0,
where /91)represents a1l the coefficients apart from the coefficient of the constant. ln this case R 1 and r=0. Applying this test to the money equation estimated in Section 19.4 we get =1k-
z(y) =
tjjo.(ml 248.362
1
-
546 a
5353.904,
a
=
0.05.
This suggests that the null hypothesis is strongly rejected. Caution, however, should be exercised in interpreting this result in view of the discussion of possible misspecification in Section 19.4. Moreover, being a joint test its value can be easily inflated' by any one of the coefficients. In the present case the coefficient of pt is largely responsible for the high value of the test statistic. If real money stock is used, thus detrending Mt by dividing it with Pt which has a very similar trend (seeFig. 17. 1(a) and 17.1(c)), the test statistic for the signilicance of the coefficients takes the value 22.27% a great deal smaller than the one above. This is clearly exemplified in Fig. 19.2 where the goodness of fit looks rather poor.
19.6
Prediction
The objective so far has been to estimate or construct related to the parameters of the statistical GM
h
=
tests for hypotheses
p'xt+ I/f,
(19.115)
using the observed data for the observation period r= 1, 2, T The question which naturally arises is to what extent we can use (115) together with the estimated parameters of interest in order to predict values of y beyond the observation period, say .
l
.'.'.'.=
12 ,
,
.
.
.
.
.
.
,
Prediction
19.6
403
From Section 12.3 we know that the best predictor for yw+i,1= 1,2, is its conditional expectation given the relevant information set. ln the present case this infonnation set comes in the fonu of w./ )Xw+/= xw.;). This suggests that in order to be able to predict beyond the sample period we need to ensure that 9. w+/, I >0, is available. Assuming that Xws, xr..; is available for some lb 1 and knowing that Eluvytlxv-qt= xw-r/) 0 and .
.
.
.9.
=
=
=
H
El-vvst Xw-,
pv-vl a natural
=
x,.+,)
(19.116)
#'x,.+,,
=
predictor for pv +/ must be
lkl t #-xw.,.
(19. 117)
=
-
In order to assess how good this predictor is we need to compare it with the value of y, y,vsk. The prediction error is defined as
actual
e
-y,.-,-w-,-uw+?+(,-#-w)'x,
and
(19.118)
-.,
r +l
evs:
x
N(0, c241 +x'w+/(X'X)- 1xw+/),
(19.119)
since ev..l is a linear function of normally distributed random variables, and the two quantities involved are independent. The optimal properties of # predictor and evwk has the smallest variance among make pv..lan linear predictors (see Section 12.3 and Harvey (198 1:. ln order to construct a confidence interval for ),' a pivot is needed. The obvious quantity 'optimal'
+/
c gl +
ev-vl
x')...,(X'X) -
lxw-j-/ll
( 19. 120)
x x((),
1) tr2.
is not a pivot because it involves the unknown parameter reduce it to a pivot, however, by dividing it by x/ E(T-k).$ since sl is independent of ev-vi, to obtain
We could ((Fklc 1, 1
( 19. 121) see Section 6.3. (Using ....
Prt#xwvg-casv
(121)we can set E1+xw+/(X X) ,;'
,
,
....
up the prediction interval gj
xw.h/qGyw+/<
+x'w,.,(x'x)-1x,.../q)+c ,.u,''E1
,'h
# xw.., ,
1
-a,
(19.122)
404 where
Specification, cx
estimation
and testing
is determined from the r tables for a given
via
As in the case of specification testing, prediction is based on the assumption that the estimated equation represents a well-defined estimated statistical GM; the underlying assumptions are valid. lf this is not the case then using the estimated equation to predict $'vp: can be very misleading. Prediction, however, can be used for misspecification testing pumoses if additional observations beyond the sample period are available. It seems obvious that if it is assumed to represent a well-defined statistical GM and the sample observations t 1,2, Xare used for the estimation of p, then the predictions based on pv.l= #-'xwo/, 1= 1, 2, n1, when compared with should give idea about the validity of the 1, 2. I m, us some y'.:.pt, specification' assumption. correct Let us re-estimate the money equation estimated in Section 19.4 for the sub-period 1963-1980/ and use the rest of the observed data to get some idea about the predictive ability of the estimated equation. Estimation for the period 1963f-1980ff yielded =
.
.
,
.
.
=
.
.
.
mt 3.029 +0.678)?, ( 1.116) (0.113)
+0.863p,
=
R2 RSS
72
=0.993,
=
.
.
,
,
-0.049f,
(0.024)
=0.993,
s
0. 10667,
li,,
+
(0,014)
(19.123)
(0.040)
=0.0402,
1og L
=
127.703.
Using this estimated equation to predict for the period l980ffollowing prediction errors resulted: -0.0317,
t?j
-0.0279,
ez
=
14,
-0.03
t?s
=
e
=
0.0276,
t?a =
=
-0.0193,
e p1 ()
ey
=
-0.02 17,
=0.0457,
1982/1,,the
ezb
-0.0243,
f?g
0.0408,
=
=
=0.0497.
As can be seen, the estimated equation untlerpredicts for the first six periods and overpredicts for the rest. This clearly indicates that the estimated equation leaves a lot to be desired on prediction grounds and re-enforces the initial claim that some misspecification is indeed present. Several measures of predictive ability have been suggested in the econometric literature. The most commonly used statistics are: j.
MSE=-
m t
T' + m
.;
=
.
.
el l
(meansquare
errorl
(19. 124)
19.7
The residuals
MAE
r
1
=
+
m
-
N1
;
y. +. l
=
405
If4I(mean absolute 1 ;. -m t jj +.
1
j -D1
sm
jg r
=
+
,
z
2
)t
l
et1
w.
=
U=
,,2
( 19. 125)
errorl;
.)
1
1
--
-if
#
7.+..
V
&
2 t.*t
m f= w-j-I
(Theil's inequality coefficient)
(19.126)
(see Pindyck and Rubinfeld (198 1) for a more extensive discussion). For the above example MSE= 0.001 12, MAE 0.03201, U 0.835. The relatively high value of U indicates a weakness in the predictive ability of the estimated equation. The above form of prediction is sometimes called ex-post prediction because the actual observations for ),f and X are available for the prediction period. Ex-ante prediction, on the other hand, refers to prediction where this is not the case and the values of X, for the post sample period are kguessed' (in some way). ln ex-ante prediction ensuring that the underlying assumptions of the statistical GM in question are valid is of paramount importance. As with specification testing, ex-ante prediction should be preceded by misspecification testing which is discussed in Chapters 20-22. Having accepted the assumptions underlying the linear regression model as valid we can proceed. to xw t by iw-, and use =
=
kguessestimate'
#
JF
+
l
p-T k T + /
-
l
5
=
12 '
'
.
.
.
ln such a case the prediction error defined by as the predictor of decomposed be into three sources of errors: can )'v-l -J/. w-? + (x7.., #w+.,)',w, uv +(# #w)'xw+g ( 19. 128) ,y.+/.
=
,/
=
?.-s,
one additional to 19.7
-
.,
(?vyt
-
(see (118)).
The residuals
The residuals for the linear regression model are defined by
l,i, / y! EB
-
#'xt
y -X#
,
l
=
1, 2,
in matrix
.
.
.
,
T
notation.
(19 129) ( 19.130)
From the definition of the residuals we can deduce that they should play a role in the misspecification analysis (testing the very important assumptions of the linear regression model) because any test related to
406
Specification, estimation and testing
ty,,/Xr= xt,l6T) can only be tested via f. For example, if we were to test any one of the assumptions relating to the probability model (.VI,/X,
Xf)
=
X(#'xf,
'v
t7'2)
(19.131)
we cannot use y'j because ( 131) refers to its conditional distribution marginal) and we cannot use
#'x,)
(y, -
x.
N(0, c2)
(notthe (19. 132)
because # is unknown. The natural thing to do is to use (y?-/x,), i.e. the T: lt must be stressed, however, that in the same residuals Iif, t 1, 2, of its own' (it stands for y, #'xf), Stands for way as ut does not have a and random variable interpreted should not be as an 1,' - /'xf observable of form but as the ty,'Xf xr) in mean deviation form. The distribution of > y -X#= (1 Pxly (1 Pxlu takes the form =
.
.
,
.
tlife
-
!
bautonomous'
=
=
-
-
c2(1
(19. 133) - Px)), 'v .N(0, where Px is the idempotent matrix discussed in Section 19.4 above. Given, however, that rank (1 Px) trtl Px) T-k we can deduce that the distribution of is a singular multivariate nonnal. Hence, the distributions of and u can coincide only asymptotically if =
-
v(X(X'X) - 1X')
-
=
0
-+
(19.134)
where v(A) maxf,yst7jjt,A Laiji,j.This condition plays an important role in relation to the asymptotic results related to y2. Without the condition (134) the asymptotic distribution of s2 will depend on the fourth central moment of its finite sample distribution (seeSection 21.2). What is more, 1 the condition limw-+w(X'X) 0 or equivalently =
=
=
v(X'X) -
'
-+0
(19.135)
as
verified for xf, 2'. v Looking at ( 133) we can see that the finite sampling distribution of is inextricably bound up with the observed values of X, and thus any finite sample test based on will be bound up with the particular X matrix in hand. This, together with the singularity of (l33l,promptus to ask whether we could transform in such a way so as to sidestep both problems. One way we could do that is to find a Fx (T-/() matrix H such that
does not imply ( 134) as can be
H'(I
-
Px)H
=
A,
where A takes the form ('F-k
A
01,
=
0
0
=
(19.136)
The residuals
19.7
407
being the matrix of the eigenvalues of I Px. This enables us to detine the transformed residuals, known as BLUS residuals, to be -
' H' =
N(0, c2, I.r-k),
'v
(19.137)
i being a (T- k) x 1 vector (seeTheil (1971/. These residuals can be used in misspecification tests nstead of but their interpretation becomes rather difficult. This is because fl is a linear combination of all lils and cannot be related to the observation date r. Another form of transformed residuals which emphasises the time relationship with the observation date is the recursive residuals. The recursive residuals are defined be >'
9
l
0
=
y,l
-
..
bt- 1 recursive
=
l
0
t-
1
1, 2,
=
.
.
.
,
k
( 19. 138)
,.
where is the
for t
j,t -
r k+ 1
1x t ,
(X, lX, -
=
1)
X,
Ieast-squares E&
(x j
,
xz
,
.
.
,
.
x,
1 yt
-
-
-
1
)
,
y
.
.
'
T
!'
( 19.139)
1,
estimator '
.
j0-
of
# with
1 EB
(y j y z ,
,
.
.
.
,
yt
'
-.
j.
)
.
This estimator of j uses information up to t 1 only and it can be of considerable value in the context of theoretical models which involve expectations in various forms. -
#, N(0,c2( 1 + x;(X?0-'jXr0-j) 'v
-
1xj)),
and for P=
'
,
g1 + x;(X,0-
tl* ;* EEE(t?* k+ k+ 1 $
2,
'
#+ x((),c2I.,-k). x
'
1x,0-
*
5
1
tl*)/ F
)-
1
(19. 140)
'
x f (j
>
(19.141)
residuals have certain distinct advantages in over related of misspecification tests the yfs (seeSection to the time dependency 2 1.5 and Harvey (1981:. The residuals play a very important role in misspecitication testing, as shown in Chapters 20-22, because they provide us with a way to test the relating to the process t-vf,/Xr, t e: T). lt is underlying assumptions interesting at this stage to have a look at the time path of the residuals for the money equation estimated in Section 19.4, shown in Fig. 19.3. The residuals exemplify a certain tendency to increase once they start increasing and to decrease once they start decreasing. This indicates the presence of' strong positive correlation in successive residuls (orserial correlation) and The
recursive
Speciation,
408
estimation
and testing
0.10
0.06
-0.05
-0. 10
1964
1967
1970
1973 T ime
1976
1979
1982
Fig. 19.3. The residuals from ( 19.66).
therefore the sampling model assumption of independence seems rather suspect (invalid).The time path of the residuals indicates the presence of systematic temporal information which was by the postulated systematic component (seeChapter 22). tignored'
19.8
Summary and conclusion
The linear regression model is undoubtedly the most widely used statistical model in econometric modelling. Moreover, the model provides the foundation for various extensions which give rise to several statistical models of interest in econometric modelling. The main purpose of this chapter has been to discuss the underlying assumptions in order to enable the econometric modeller to decide upon its appropriateness for the particular case in question as well as derive statistical inference results related to the model. These included estimation, specification testing and prediction. The statistical inference results are derived based on the presupposition that the assumptions underlying the linear regression model are valid. When this is not the case these results are not only
inappropriate, they can be very
misleading.
The money equation estimated in Section 19.4 was chosen to highlight the importance of choosing the most appropriate statistical model by taking into consideration not just the theoretical infonuation but the information related to the observed data chosen. The latter form of information should be taken into consideration in postulating the
Appendix 19.1
409
probability and sampling models as well as the statistical GM. The estimation and testing results related to the money equation might of a well-specified transactions encourage premature pronouncements demand for money. Such conclusions, however- are not warranted in view discussed brefly above. Before any of several indications of misspecification estimation, specification testing or prediction results can be considered appropriate we need to test the underlying assumptions on the validity of which they are based. This is the task of misspecification testing eonsidered in the next three chapters. Once the underlying assumptions g1q-g8(1 are tested and their validity established with the data chosen, the estimated statstical GM is said to estimated statistical model. This. however, does not constitute (1 wt?//-#npf necessarily coincide with the empirical econometric model because the statistical and theoretical parameters of interest are commonly different. estimated statistical model s transformed The well-defined into an reparametrising econometric model in empirical by it terms of the theoretical parameters of interest.
Appendix 19.1 A note on measurement -
systems
A measurement system refers to the range of the variables involved and the associated mathematical structure. This system is normally selected so that the mathematical structure of the range retlects the structural properties of systems such as the phenomena being measured. Different measurement nominal, ordinal, interval and ratio are used in praetice. Let us consider them one by one very briefly. possible are Nominal: ln a nominal system the only relationships (i) whether the quantities involved belong to a group or not without any ordering among the groups. Onlinal: In this system we add to the nominal system the ordering of the various groups (e.g. social class, ordinal utility). lnerval: ln this system we add to the ordinal system the (iii) interpoint distances (e.g. measures of possibility of comparing of the values that Note temperature). any linear transformation Fahrenheit and scales). is legitimate (Celsius taken Ratio: ln the ratio system we add to the interval system a natural origin for the values. ln such a systen the ratio of two values is a
meaningful
value.
In the case of the linear regression model caution should be exercised when using different measurement systems for the variables involved. In general, most economic variables are of the ratio type and in such a case the
410
Specification, estimation and testing
constant term among the regressors is of paramount importance since it estimates a linear function of the means of the variables involved. For this reason tbe ctpnsrfznr term sbould alwaq's be included in a regression among ratio scaled variables because it represents the origin of the estimated empirical relationship. Important
concepts
White-noise error term, exogeneity, parametrisation, residuals, R2, specification tests, F-test, recursive residuals, BLUS residuals, least-squares, measurement system.
2,
.#2 #2
recursive
Questions Compare the linear Gauss linear and the linear regression models (statistical GM, probability and sampling models). Explain the concept of exogeneity in the context of the linear regression model. Discuss the differences and similarities between '(.?f
pt BE
X!
=
x?), p)
EEE
'(.rr,'''c(Xt)).
Explain the role of conditioning of the in the parametrisation statistical model. Explain the relationship between normality, linearity and homoskedasticity. 6. Discuss the orthogonality of the estimated systematic and nonsystematic components and their relationship to the goodness-of-fit measure #2. 2 and #2 7 Compare the goodness-of-fit measures /12 MLE'S sample properties of the l EEE(j 8 State the finite IH(/ MLE'S c2)'. asymptotic properties of 9. State the the and model consistency and + discuss the 10. Consider the ut y jr of (Note.' normality 1. 1, asymptotic 1 t a =J, a J for l.cT( F+ 1),limw- w (j7- 1 1/r2) 7:2/6.) Explain why we need to assume that limv- .(X'X) --1 0 for the consistency of / and no such assumption is needed for the consistency of 42
4.
.l)'.
'
,)
5
=
=
=
7-
-
=
=
Exercises Derive the
MLE'S
of 0 EE (j, c2)'.
=
Additional references
41 1
Derive the information matrix Iw(p) and discuss the question of full > efficiency of j and c Show that az
.
(i)
Z'(#)
(ii)
Px2 Px
(iii)
Px'
(iv)
(1 Px)X 0.
0,'
=
(Px
=
=
X(X'X) - 1X');
Px;
=
=
-
Derive the distribution of p-and c-2 Check whether the variables .X'lt 2', 0 < k < 1 and Xzt =
=
t satisfy the
conditions: lim v-.x
-.
1 T
(X'X)
lim(X'X)-
1
=
Qx<
LJo
non-singular;
=0.
F-y x
Compare these two conditions. HvL c2 cj against /11 : c2 # c2(). a test for Explain how the following linear restrictions can be accommodated within the general form Rj= r. (i) J$1 /72*,
Construct
=
=
(iii)
/71+
/72+ /33
jl
fz /3
=
=
=
1,'
=0.
Show that (R/
-
r)'(R(X'X)- 1R')-
IIRJ- r) (R#
=u'Qu +
-
r)'gR(X'X)- IR') - 1(R#
-r),
where Q X(X'X)- 1R'gR(X'X)- IR'j - 1R(X'X)- 1X' Show that () 2 Q Derive the distribution of pv+: fxv. 1 and use it to construct a prediction interval for yv.v,. Compare the prediction errors with the recursive residuals. Derive the distribution of and explein how can be transformed to N(0, c2Iw-k). define the BLUj residual vector =
=
.
=
'
'v
Additional references
Goldberger (1968)) Judge et al. (1982);Madansky (1976); Maddala Dhrymes (1978); 1977);Malinvaud ( 1970);Schmidt (197$; Theil (1983). (
C H AP T E R 20
The linear regression model 11 departures from the assumptions underlying the statistical GM
ln the previous chapter we discussed the specification of the linear regression model as well as its statistical analysis based on the underlying eight standard assumptions. In the next three chapters several departures and their implications will be discussed. The discussion differs from E1q-g8j somewhat from the usual textbook discussion (see Judge et aI. ( 1982)) because of the differences in emphasis in the specification of the model. $ In Section 20. 1 the implications of having '(yl,/c(Xr)) instead of '(rf, Xf xl) as the systematic component are discussed. Such a change gives rise to the stochastie regressors model which as a statistical model shares some features with the linear regression model, but the statistical inference related to the statistical parameters of interest p is somewhat different. The statistical parameters of interest and their role in the context of the statistical GM is the subject of Section 20.2. ln this section the socalled omitted variables bias problem is reinterpreted as a parameters of interest issue. In Section 20.3 the assumption of exogeneity is briefly considered. The cases where a priori exact linear and non-linear restrictions on 0 exist are discussed in Section 20.4. Estimation as well as testing when such information is available are considered. Section 20.5 considers the concept of the rank deficiency of X known as collinearity and its implications. The potentially more serious problem of collinearity' is the subject of Section 20.6. Both problems of collinearity and near collinearity are interpreted as insufficient data information for the analysis of the parameters of interest. It is crucially important to emphasise at the outset that the discussion of the various departures from the assumptions underlying the statistical GM which follows assumes that the probability =
Snear
4 12
The stochastic Iinear regression model
20.1
and sampling models remain valid and unchanged. This assumption is needed because when the probability and/or the sampling model change the whole statistical model requires respecifying. 20.1
The stochastic linear regression
model
the statistical GM is that the systematic
The first assumption underlying component is defined as
X). (20.1) pt G f(-Pr/X/ An alternative but related form of conditioning is the one with respect to the c-field generated by Xf, which defines the systematic component to be =
pl
'(A 6'(Xf)).
H
The similarities and differences between ( 1)and (2)were discussed in Section 7.2. ln this section we will consider the meaning and intuition underlying (2) variable in Xr, which is as compared with (1).Let xYjt be the first random assumed to be defined on the probability space (S, #( )). c(A-1t) represents the c-field generated by .Y1,, i.e. the minimal c-field with respect to which -Yjt is a random variable. By construction c(A-jr) z / The c-field -Yk,) is defined to be generated by Xt HE (zY1f,Xzt, .%
'
.
.
.
,
k
c(X,)
-
i
U c(A-.f) zzn =
1
Let y, be also defined on (S,,W,P4.)). The J)y,/c(X,)): (,, c(X,)) --+ (R, is defined via
conditional
expectation
,)#)
with respect to c(Xf). This shows that F(yf/c(Xf)) is a random t'ariable Intuitively, conditioning )?fOn c(Xt) amounts to considering the part of the random variable yf associated with all the events generated by Xf. Conditioning on X, x, can be seen as a special case of this where only the event Xl x! is considered. Because of this relationship it should come as no surprise to learn that in the case where D( Xr,' p) is jointly normal the conditional expectations take the form =
=
',
pt f(.r, 'X, x,) lz)EEEuE'l-r,'c(Xr)) EEEE
=
=
(seeChapter J', =
15). Using
,'X,
+ u,,
=
J1
212-21
(20.5)
xr,
cl 2E2-2'X?
(6) we can define the
(20.6) statistical
GM:
(20.7)
414
from assumptions
Departures
-
statistical
GM
where the parameters of interest are 0- p, c2), p- Yc-allaj (7.2 2E2-a1J21. The random vectors (X1, X2, Xw)' > i'k- are assumed cj 1 J1 rank satisfy condition, the rankt) k to for any observed value of 1. The defined by term error =
,
-
.
.
.
,
=
f()?,/oX,)), ut satisfiesthe following properties: M
(20.8)
.)4
-
=
'lF(&,/c(Xf)))
=
f'tF(/t?,/,/c(Xt)))
.E'(ul) Eutplq Elutus)
=0,
(20.9) =0,
.E')S(u,l
=
=
(20. 10) c2
t
,
0,
s,
=
(20. 11)
tzp s,
The statistical GM (7) represents a situation where the systematic component of y, is defined as the part of yf associated with the events c(X!) and the observed value of Xt does not contain a1l the relevant information. That is, the stochastic structure of Xt is of interest as far as it is related to yt. This should be contrasted with the statistical GM of the Gauss linear and linear regression models. Given that Xfin the statistical GM is a random vector, intuition suggests that the probability model underlying (7)should comc in the form of thejoint distribution 1)()'r, X,; #). We need, however. a form of this distribution which involves the parameters of interest directly. Such a form is readily available using the equality D()'f, Xf;
#)
Dq' 'Xt;
=
#:)
'
D(X,;
a),
(20.12)
with 0 (#,c2) being a parametrisation of #j. This suggests that the probability model underlying (7) should take the form EEE
(20.13)
where
(20.14) D(Xr;
(det E 2 2 )kyc (2zr)
#c)
=
(see Chapter 15) and The random
#c are
vectors
X1,
expt --l.x,Eca ,
-
l
xt )
(20. 15)
said to be nuisance parameters (notof interest). Xw become part of the sampling model .
.
.
,
which is defined as follows: (Zl Zc, ,
.
.
.
,
Zw) is a random sample
fromD(Zt,.#), t
=
1, 2,
.
.
.
,
T;
The stochastic linear regression model
20.1
respectiyely,
where as usual
(xA'r,,)
z,
EEE
.
lf we collect a1l the above components together stochastic linear regression model as follows: The statistical
GM: yf
=
j'Xr + ut,
l (E T
'c(X,)) and lk, )'f - Et))r 'c(Xf)). pt 0 HE (#, c2), # E z1 az j and (72 cj j - Jj ztz-t o.z j parameters of interest. X is assumed to be weakly exogenous wrt 0, for t 1, 2, No a priori information on 0. Xw)?, ranktt') For :t(' EEE(X1, Xa, k for al1 observable > T of k.
E1q r21
the
we can specify
'tl
=
=
=
g31 E4q E51
the
=
=
.
.
=
,
.
.
.
,
.
'TJ
values
.?)
The probability
model
1
1 2c
cv ( ) (detxccl-feXp --l(X;E2-2'Xr)) k/z t x (2zr) E6(1 (i)
,
0 c(Rk x R +)
.
D(y/X,; p) is normal;
#'x, - linear in x,', vartyf' x )) c2 - homoskedastic-, 0- (j, c ) are time-invariant.
(ii) (iii)
;t),,, c(x,)) c( -
(Z1, Zc,
,
=
j!
model
Ihe sampling
g8(I
=
.
.
.
,
respectively.
Zw) is a random sample from D(Zt;
k), t
=
1, 2,
.
.
.
,
T)
The assumption related to the weak exogeneity of xr with respect to 0 for T shows clearly that the concept is related only to inference on 0 t= 1,2, and not to the statistical inference based on the estimator of 0. As shown below, the distribution of the MLE of 0 depends crucially on the marginal distribution of X2. Hence, for prediction purposes this marginal distribution has a role to play although for efficient estimation and testing on 0 it is not needed. This shows clearly that the weak exogenty concept is about statistical parameters of interest and not so much about .
.
.,
distributions.
Departures
The probability
(
-.v4' 'k'
1 .J'a. ,
.
.
.
from assumptions
statistical
-
GM
and sampling models taken together imply that for y K Eiii (X and X w)' t he l ikelihood function is 1 X c, .?'
,
,
.
.
,
,
T
f-(0 ; y / )
D( #'j Xf; 0) tg-I
'
,
.
=
'
,
D (X f ;
1
:=
# a) .
The log likelihood takes the form
in ( 17) can be treated as a constant as far as the respect to the parameters of interest 0 is concerned. Because of the apparent similarity between ( 17) and the log likelihood function of the linear regression model (seeSection 19.4), it should come as with respect to 0 yields no surprise to learn that maximisation The last component
differentiation
with
1
-
-.
)(x'),
j x x't
=
l
f
!
l
E,
(.?'',,z.)-
1
.,t''y
''
!
(20.18)
and 1
(i*2
=
Ta
V(
-
1
p*xtj a EEI -T
.: -
(20.19)
'''b
.
in an obvious notation. The same results can be derived by defining the likelihood function in terms of D( T'f, X,,' #) and, after estimating cl j, c1c,
thes: estimators and the invariance property of MLE'S to c2 - Yc-21 EEE estimators and the corresponding for j cl ccj eonstruct p MLE'S of jand 0-2 tzl2E2-21caj. Looking at ( 18)and ( lglwecan see that these differ from the corresponding estimators for the linear regression model #= 1 in so far as the latter include the observed value xr (X'X) - X'y. 42 ( 1'Fl' of random the X,, as above. This difference, however. implies vector instead 4*2 and longer are no a linear and a quadratic function of y and thus that p-. results relation distributional in to and 42 cannot be extended to j* the * and J.*2 That is, are no longer normally and chi-square and respectively. ln distributed, fact, the distributions of these estimators are analytically tractable at present. As can be seen from ( 18) and (19) they not complicated functions of normally distributed random variables. are very which naturally arises is whether we can derive any The question without knowing their distributions. Using the properties of p-*and SCE3) properties SCE 1-SCE5 (especially on conditional expectations with c-field Section 7.2) we can deduce the following'. (see respect to some Ecc- using
-
=
/
/*
'*2
/*
=
1 1 ( 1.'.1''4- k''lb:t'p + uj j-h (,/''./') - #'''u =
(20.20)
The stochastic Iinear regression model
20.1
# + 'I).4'';'')- ',f.''E'(u = #,
'(#*) if
'g(;??''t?)
E(u/c(J'))
=
1,'71
-
< x?,
That is,
=0.
COvll
**
c(t?'))(l
'E.E'(#*/c(.(F))1
=
)O
Ep
(20.21) GM
since by the construction of the statistical #* is an unbiased estimator of (. R*
A*
A*
>*
-#)
-#)(#
ELEC'
/ =
(7),
-#)(/
/) /c(?)1 '
-
1,'?''A'(uu' c(,'.F))t'(;?''t?') -
= FI)-''.??'') o'lE.J-'.'jt -
=
1
11
(20.22)
if E-/''.I) - exists, since ftuu//ct.'F' )) czlw. Using the similarity between the log likelihood function (17)and that of the linear regression model we can deduce that the information matrix in the present case should be of the form 1
=
E(I'-% I1(p)
G
=
0
2
(20.23)
.
T
0
4
2c
estimator of (. j* is an Using the same properties for the conditional expectation of c2 can show that for the MLE
mcient
This shows that
operator
we
'*2
.E'(#*2) Sg.E'(#*2 c(.?'))j =
1 = T Fgfu/Mxu 1 = -,jr Sgftr 1 =
''
Mx) c2,
This implies that although
defined by
is unbiased.
,
1 =
'.
czsttr
*/c(kY-))(j
,
where Mx 1
,
o
F
c(.?)(1
Mxuu'/c(.?))1
cuftr
T-k
=
1 Fgf't* w
=-
'rtr
=
F
Iw - ''tl'.Il .
=
-
.-
j
.,
i
Ma.E'(uu'/c(.'?/))1 1
Iw tr(tf'.'') - ,')F/,jF)) -
for al1 observable
values
of ,?J
d'*2 is a biased estimator of
c2
(20.24)
the estimator
418
from assumptions - statistical GM
Departures
Using the Lehmann-scheffe theorem (seeChapter 12) we can show that > (y'y,1'il, I'y) is a minimal sujhcient statistic and, as can be seen from (18)and (19),both estimators are functions of this statistic. Although we were able to derive certain finite properties of the MLE'S J* without having their distribution, no testing or confidence regions and are possible without it. For this reason we usually resort to asymptotic theory. Under the assumption Tty,
,+)
'*2
lim F-+
E(.I'Ij
.
'
=
F
Qxx<
cfz
and non-singular,
(20.26)
we deduce that
x'''z'(/*
-
.N0,c2Qx4),
(20.27)
,v
N(0, 2c*).
(20.28)
-#)
UT(+2
c2)
-
These asymptotic distributions can be used to test hypotheses and set up condence regions when T is large. The above discussion of the stochastic linear regression model as a separate statistical model will be of considerable value in the discussion of the dynamic linear regression model in Chapter 23. In that chapter it is argued that the dynamic linear regression model can be profitably viewed as a hybrid of the linear and stochastic linear regression models.
20.2
The statistical
parameters
of interest
The statistical parameters which define the statistical GM are said to be the statistical parameters of interest. In the case of the linear regression model these are #= E2-)lc1 and c2 cl 1 cj aE2-zllaj. Estimation of these statistical parameters provides us with an estimated data generating mechanism assumed to have given rise to the observed data in question. The notion of the statistical parameters of interest is of paramount importance because the whole statistical analysis around these assumptions defining the linear parameters. cursory look at E1j-E8q regression model reveals that al1 the assumptions are directly or indirectly related to the statistical parameters of interest 0. Assumption E1jdefines the systematic and non-systematic component in terms of 0. The assumption of weak exogeneity (3q is dened relative to 0. Any a priori information is introduced into the statistical model via 0. Assumption g5qreferring to the rank of X is indirectly related to 0 because the condition =
-
trevolves'
rank(X'X)
=
k
(20.29)
20.2 The
statistical
parameters
of interest
419
is the sample equivalent to the condition rankttzz)
=
k,
(20.30)
required to ensure that E2c is invertible and thus the statistical parameters of interest 0 can be defined. Note that for T > k, ranklx) rank(X'X). Assumptions E6qto (8(Iare directly related io 0 in vi ofthe fact that they w defined in of a1l D(y,,/Xt; 0). terms are The statistical parameters of interest 0 do not necessarily coincide with the theoretical parameters of interest, say (. The two sets of parameters, however, should be related in such a way as to ensure that ( is uniquely defined in terms of 0. Only then the theoretical parameters of interest can be given statistical meaning. In such a case t is said to be identnable (see Chapter 25). Empirical econometlic models represent reparametrised statistical GM's in terms of (. Their statistical meaning is derived from 0 and their theoretical meaning through (. As it stands, the statistical GM, =
y'f =
#'xl+
lzf,
t e: T,
(20.31)
might or might not have any theoretical meaning mapping ' G(t 0)
depending on the
(20.32)
=0,
relating the two sets of parameters. lt does, however, have statistical meaning irrespective of the mapping (32). Moreover, the statistical parameters of interest 0 are not restricted unduly at the outset in order to enable the modeller to test any such testable restrictions. That is, the statistical GM is not restricted to coincide with any theoretical model at the outset. Before any such restrictions are imposed we need to ensure that the estimated statistical GM is well defined statistically; the underlying assumptions g1!-E8qare valid for the data in hand. The statistical parametrisation 0 depends crucially on the choice of Zt and its underlying probabilistic structure as summarised in D(Z; #). Any changes in Zt or/and D(Z; #) changes 0 as well as the statistical model in question. Hence, caution should be exercised in postulating arguments which depend on different parametrisations, especially when the parametrisations involved are not directly comparable. In order to illustrate this let us consider the so-called omitted variables bias problem. The textbook discussion of the omitted variables bias argument can be summarised as follows: The true specification is y X#+ W? + =
:,
E,V
N(0, c,Z1z),
'
(20.33)
420
from assumptions - statistical GM
Departures
but instead y X# =
+ u,
u
N(0, c2Jw)
'w
(20.34)
was estimated by ordinary least-squares (OLS) (seeChapter 2 1), the OLS estimators being j= (X X) 42
Xy
1
=
(20.35)
' '
'r - k
=
y -X#.
-
(20.36)
ln view of the fact that a comparison u
between
(33) and (34) reveals that
WT +E,
=
(20.37)
we can deduce that E(u)
=
Wy 10
(20.38)
and thus A'(/)
'x'wy#o;
-#=(x'x)-
(20.39)
and (20.40) 1
'-'
l .-XIX ?X) - X That is, # and ti 2 suffer from omitted variables bias and y= 0, respectively; see Maddala (1977), Johnston ( 1984), unless W'X Schmidt ( 1976), inter alia. Mx
=
,
.
=0
viewpoint, From the textbook specification approach where the statistical model is derived by attaching an error tenn to thc theoretical model, it is impossible to question the validity of the above argument. On the other hand, looking at it from the specification viewpoint proposed in
Chapter 19 we can see a number of serious weaknesses in the argument. The most obvious weakness of the argument is that it depends on two statistical models with different parametrisations. ln particular # in (33)and (34) is very different. lf we denote the coefficient of X in (33)by j= Ea-21J21, the same coefficient in X takes the form: lJ2 lo N2-=
1
-
Ec- lEc
3EL' ,3
1,
where Ec.a (Ycc - EaaYL1Es2), Eza Cov(Wr), Eaa Cov(Xf, Wf), JaI CovtWt, yf) (seeChapter 15). Moreover, the probability models underlying (33) and (34)are Dt.pf Xf; 0) and 1)(y,,,'Xr,Wr,' 0o) respectively. Once this is realised we can see that the omitted variables bias (39)should be written as =
=
=
=
20.3 Weak E )'/-1f
exogeneity
=(X X)
-bo
1J'
t#)
(20.41)
X Wy#0,
,
since E
(u) WT #0,
(20.42)
=
j',..''A'U/ .
where E)..
) refers to the expectation operator defined in terms of D( pt/Xr, Wf;0v). Looking at (41)we can see that the omitted variables bias arises when we try to estimate pv in f,.(
'
y =X#() + Wy+ c
(20.43)
by estimating # in (34)wheres by construction, the context of the same statistical model, E (/)
## po.On the other hand, in
0
-#-
(20.44)
since E
(u) 0
(20.45)
=
vt'.kr
and no omitted variables problem arises. A similar argument can be made for ti2. From this viewpoint the question of estimating the statistical parameters of interest 0v H (#a, )!, c,2) by estimating 0 H (j, c2) never arises since the two parameter sets 0o and 0 depend on different sample ct.yt, X,, Wf t 1, 2, T) and ..F c(y,, Xt, t 1, 2: information, F) respectively. This, however, does not imply that the omitted variables argument is useless, quite the opposite. ln cases where the sample the argument can be very useful in information is the same (.#$ deriving misspecification tests (see Chapters 2 1 and 22). For further dscussion of this issue see Spanos ( 1985b). The above argument illustrates the dangers of not specifying explicitly the underlying probability model and the statistical parameters of interest. By changing probability the underlying distribution and the parametrisation the results on bas disappear. The two parametrisations are only comparable when they are both derivable from the joint distribution, D(Zj ,Zw;#) using alternative arguments. .#-()
=
=
s
.
.
.
=
,
=
.
.
.
,
.t./-)
=
treduction'
,
20.3
.
.
.
Weak exogeneity
When the random vector Xr is assumed to be weakly exogenous in the context of the linear regression model it amounts to postulating that the stochastic structure of Xf, as specified by its marginal distribution D(Xf; #c), is not relevant as far as the statistical inference on the parameters of interest
422
Departures from assumptions
-
statistical
GM
o-p, c2) is concerned. That is, although at the outset we postulate Dt-pt, Xt; #) as far as the parameters of interest are concerned, D(yj/Xt; #1) suffices; note that D(#t, Xf; #) D(.F/Xl; =
#1) D(Xf; #2)
(20.46)
'
is true for any joint distribution (seeChapter 5). If we want to test the exogeneity assumption we need to specify 1)(X,; #2) and consider it in relation to D(yt/X2; #1) (see Wu (1973),Engle (1984),. inter alia). These exogeneity tests usually test certain implications of the exogeneity assumption and this can present various problems. The implications of exogeneity tested depend crucially on the other assumptions of the model as well as the appropriate specification of the statistical GM giving rise to xr, T; see Engle et al. (1983). t= 1, 2, Exogeneity in this context will be treated as a non-directly testable assumption and no exogeneity tests will be considered. It will be argued in Chapter 2 1 that exogeneity assumptions can be tested indirectly by testing the assumptions g6q-E8j.The argument in a nutshell is that when inappropriate marginalisation and conditioning are used in defining the parameters of interest the assumptions g6J-E8(I are unlikely to be valid (see Engle et al. (1983),Richard (1982:.For example the weak a way to exogeneity assumption indirectly is to test for departures from the nonuality of Dtyf, Xf; #) using the implied normality of D(yl/Xf; 0) and homoskedasticity of Vartyt/xr xr). For instance, in the case where D(y,, Xf; #) is multivariate Student's 1, the parameters /1 and #c above are no longer variation free (seeSection 21.4). Testing for departures from normality in the directions implied by D(y,, X,; #) being multivariate t can be viewed as an indirect test for the variation free assumption underlying weak exogeneity. .
.
.
,
ttest'
=
20.4
Restrictions
on the statistial
parameters
of interest 0
The statistical inference results on the linear regression model derived in Chapter 19 are based on the assumption that no a priori information on 0 BE (#, c2) is available. Such a priori infonuation, when available, can take various forms such as linear, non-linear, exact, inexact or stochastic. In this section only exact a priori infonuation on # and its implications will be considered; a priori information on (7.2 is rather scarce.
(1)
Lineav a 'l'ftll'f restrictions
on
p
Let us assume that a priori information in the form of m linear rcstrictions R/= r
(20.47)
20.4 Restrictions
of interest
on parameters
is also available at the outset, where R and r are m x k and m x 1 known Such restrictions imply that the parameter space matrices, ranktlt) values in no longer Rk but some subset of it as determined by where # takes relevant for the statisrestrictions represent information (47). These and regression model of the linear tical analysis can be taken into consideration. In the estimation of 0 these restrictions can be taken into consideration by extending the concept of the log likelihood function to include such restrictions. This is achieved by defining the Lagrangian function to be =?'n.
l0, p; y, X)
const
=
T --j.
1og c a
1
,
2c z ty -
-
.,.
ns'g
.A#l
v n. y - .,h.P? # (R/ ,
-
-
r),
(20.48) where p represents an m x 1 vector of Laqranqe multipliers. Optimisation of (48) with respect to #, c2 and p gives rise to the first-order conditions'. 01 1 (X'y -X?X#) - R'p p = cl
pl t?cz
T
=
-
2c a
(31 = -(R#
+
1 .(y 2c =
(49)with
(20.49)
-X#)'(y -X/)=0,
(20.50) (20.5 1)
0.
-r)
ep
Premultiplying
=0,
R(X'X)-
1
solving for p we get /
(tomake the second term invertible) and
gc2R(X'X)- 1R'j - 1(Rj- r),
=
(20.52)
where J= (X'X)- 1X'y is the unconstrained that the constrained MLE of # is ..-.
...- j-(X #=
and
From
R(
- r
(50)we
=
X)
-
1
R ERIX X)
t
y
-
j
R
,
q
-
1
1
1 =- F
#. This in turn implies
-
(Rj - r)
(20.54)
0.
can deduce that the constrained
:2 Y -
,
MLE of
-x#
-x#
(y '
MLE of c2 is
,
N
(20.55)
(y y -Xj.
-
424
Departures
from assumptions - statistical GM
Properties
of
1-0(J # 5
2)
5
Using these formulae we can derive the distributions of the constrained of # c2 and #. Jand# being linear functions of we can deduce that J
MLE'S
7
j
#
x
(X'X) - 1 R' gR(X'X) - 1 R/) - 1(RjEc2R(X'X)-1R!j - 1 (R#-r)
(#
N
-
r)) Cj 1
Cj z
C.z, Caa
'
(20.56) c 11
=
c2E(x'X) - 1 -(x'X)
1 1 - R'gR(X'X) - 1R'j - R(X'X) -
C12 (X'X)- 1 R'gc2R(X'X) - 1 R'() - 1 > Covt/, =
C22 gc2R(X'X)- IR'II- ' =
Using
(i)
lj
;
Covtj),
#) C'a1, =
Cov(#).
EEE
(56)we
can deduce that When R#= r, Elp-) p and 0, i.e. #- and # are unbiased estimators of # and 0, respectively. # are fully efficient estimators of p and p since their variances achieve the Cramer-ltao lower bounds as can be verified directly using the extended information matrix: '(#)
=
=
Jand
x'x Iw(#, p, c2)
R
-
0
R'
()
0
0
0
(20.57)
T
2c
;y
(see exercises 1 and 2). - Covt#-lj G0 i e. the covariance of the constrained EC0v(#) #- is always less than or equal to the covariance unconstrained MLE j-2 irrespective of whether R#= r holds - MSE(#)j 10 where MSE stands for mean but EMSEI#) 12). Chapter (see error The constrained MLE of c2 can be written in the form -
,
.
-
:2
=
42 +
1 F
-
Given that T'42
--s
x
-
(R# -r)?gR(X'X) -
li = -X(X'X)-
z2(w-/()
MLE of the or not; square
1
RgR(X'X) -
IR'II- 1(R# -r),
(20.58)
IR'II- 1(R/
(20.59)
-
r).
20.4 Restrictions
of interest
on parameters
425
and (Rj-
r)'
IR'I--1 (Rj-
ERIX'X) .7
r)
G
J), z2(m;
'w
(20.60)
where 1 1 (R# -..j'gR(X'X) - R') - (R#
= we ean deduce that 7-:2 - a c
m
k,.
-
(20.61)
,
z
z2(T +
'v
r)
-
(20.62)
),
using the reproductive property of the chi-square (seeAppendix 6.1) and the independence of the two components in (59).This implies F(d2) + c2.
But for
(J().63)
.92
gl (F+
=
f'(#2)
-/()j?g,
m
c2 wjl.w
=
R#= r, since J
0.
=
The F-test revisited
ln Section 19.5 above we derived the F-test based on the test statistic 'r(y)
=
(R/
i 1 (R/ - r)' (R(X'X) - R'j --
/l1 l
z--'
r)
-
(20.64)
--
for the null hypothesis Ho: R#= r
against
Hj
R## r,
:
must be close to using the intuitive argument that when Hv is valid jlR#zero. We can derive the same test using various other intuitive arguments similar to this in relation to quantities like j1J-/)tand being close to of the Fvalid when question A formal derivation is 5). Ho (see more zero test can be based on the likelihood ratio test procedure (see Chapter 14). The abovc null and alternative hypotheses in the language of Chapter 14 can be written as .-rij
tj/j!
Ho1 pgeo,
H,
where 0G
(j,
(F2),
O
(.)
pc(6)1
:
c2):
.t(#,
=
jG
-60,
(F/, c2
g
(i)tl (#, c2j: # G iRk Rj= =
.tf
7
.
R o ).s r,
(7.2
(j
R+
jl .
The likelihood ratio takes the form
#y)
=
max L(0; y) ps eo -----. max L(0, y) pct.k
L (#. ,
=
.-v--
w(,;.ya)- r?y,.-w.a # (a j --jwn--. sj.-- o J.
.
=
=
ttp,. y) (2a) .
(tz )
t?
(?c
-.
r,,a
ti a
(20.65)
from assumptions - statistical GM
Departures
The problem we have to face at this stage is to determine the distribution of 2(y) or some monotonic function of it. Using (58)we can write 2(y) in the form
2(y) /'
Looking
(R/
1 ?-
==
'R'II-1(R/-r)
-r)'gR(X'x)-
-r/2
(20.66)
(T- k)sa --
(66) we can see that it is directly related to (64) whose
at
distribution we know. Hence, 2(y) can be transformed into the F-test using the monotonic transformation z(y)
-27z (2(y) . -
=
T- k
1)
(20.67)
.
This transformation provides us with an alternative way to calculate the value of the test statistic zty) using the estimates of the restricted and unrestricted MLE'S of c2. An even simpler operational form of z(y) can be specified using the equality (58).From this equality we can deduce that '''h
(R#
r) r (RIX X)
-
-
1
R
?,
(1 -
1
''''
(R# - r)
??
=
/' -
(20.68)
(see exercise 4). This implies that z(y) can be written in the form
'r(y) z(y)
uu-uu
=
-
,
T-k
(20.69)
'
-
UU
m
RRSS -URSS URSS
=
T-k .
-
(20.70)
,
m
where RRSS and URSS stand for restricted and unrestricted residual sums of squares, respectively. This is a more convenient form because most computer packages report the RSS and instead of going through the calculation needed for (64)we estimate the regression equation with and without the restrictions and use the RSS in the two cases to calculate z(y) as in (70).
Example Let us return to the money equation
mt 2.896 + 0.690.:, ( 1.034) (0.105) =
R2
log L
=0.995,
=
42
estimated
0.865p,
+
-0.055,
+
in Chapter 19: l'ir,
(0.020) (0.013) (0.04) =0.995,
147.4, RSS
=0.117
.:=0.0393,
52, T= 80.
(20.7 1)
20.4 Restrictions
of interest
on parameters
427
that this is a well-defined estimated statistical GM (a very questionable assumption) we can proceed to consider specification tests related to a priori restrictions on the parameters of interest. One set of such theory a priori restrictions which is interesting from the economic viewpoint is Assuming
ja # 1.
against
and
lnterpreting j2 and ja as income and price elasticities, respectively, we can view Hv as a unit elasticlty hypothesis. ln order to use the form of the F-test (linear restrictions) as specified in (70) we need to re-estimate (71) imposing the restrictions. This estimation yielded lmt
-p1
R2
-
=
yt)
-0.529
=
0.629,
(20.72)
-0.2 19f1+ (0.055) (0.019) (0.087) ,
#2
=0.624,
s
=0.0866,
log 1-= 83.18, RSS 0.58552,
T= 80,
=
Given that RRSS 0.585 52 and URSS= 0. l 17 52 =
we can
deduce that (20.73)
(20.74) Hence we can conclude that Hv is strongly rejected. lt must be stressed, however, that this is a specification test and is based on the presupposition that all the assumptions underlying the linear regression model are valid. From the limited analysis of this estimated equation in Chapter 19 there are clear signs such as its predictive ability and the residual's time pattern that some of the underlying assumptions might be invalid. In such a case the above conclusion based on the F-test might be very misleading. The above form of the F-test will play a very important role in the context of misspecification testing to be considered in Chapters 2 1-23.
(2)
Exact non-linear restrictions
on
ji
Having considered the estimation and testing of the linear regression model when a priori information in the form of exact linear restrictions on j we turn to exact non-linear restrictions.
428
from assumptions - statistical GM
Departures
Consider the case where a priori information comes in the form of m nonlinear restrictions (e.g.pj #c/ja, jl -#c2): =
f(#) 0,
i
=
=
1, 2,
=
.
.
.
,
m,
or, in matrix form: H(#)
=0.
(20.75)
ln order to ensure independence between the m restrictions we assume that ?H(#)
rank
=
(p
(20.76)
rl.
As in the case of the linear restrictions, 1etus consider first the question of constructing a test for the null hypothesis Ho1 H(#)
0 against
=
Hj
:
H(#) + 0.
(20.77)
Using the same intuitive argument as the one which served us so well in constructing the F-test (seeSection 19.5) we expect that when Ho is valid, H(j) :%0 The problem then becomes one of constructing a test statistic
based on the distance f!H(#1
-
0 !) .
Following the same arguments in relation the absolute value we might transform H(#) ECov(H(#))j
to the units of measurement and (78) into
H(#).
(20.79)
Unlike the case of the linear restrictions, however, we do not know the distribution of (79)and thus it is of no use as a test statistic. This is because x'Np c2(X'X) - 1) the distribution of H(j) is no a lthough we know that longer normal, being a non-linear function of j. Although, for some nonlinear functions we might be able to derive their distribution, it is of little value because we need general results which can be used for any nonlinear restrictions. The construction of the F-test suggests that if we could linearise H(/) such a general test could be derived along similar lines as the F-test. Linearisation of H(/) can be achieved by taking its first-order Taylor's expansion at #, i.e.
/
,
f(J)
H(/)
=
H(#) +
pH(#)
p#
(/-#)
tasymptotically
+op(1),
(20.80)
negligible terms' (seeChapter 10)., where op(1) stands for (80) provides us with a linearised version of H(j) at the expense of the approximation. The fact that we chose to ignore a1l the higher terms as
Restrictions
20.4
of interest
on parameters
429
asymptotically negligible implies that any result based on (80)can only be justified asymptotically. Hence, any tests based on (80) can only be asymptotic. What we could not get in finite sample theory (linearity), we get by going asymptotic. Given that
v''r(/ - ,)
-
c2Qx-')
N(0,
(20.81)
,
we can deduce that x''T(H(
-H(#))
N 0, c
-
H(,)
z op -
Qx-
pH(#)
1
'
(20.82)
op
(see N 1 in Chapter 15). This result implies that if we substitute the asymptotic covariance (Cova(H(j))) in (79)we could get an asymptotic test because 'z'I4(/)'Ecov(H(#)()
- 1H(/)
where
j), z2(m,.
'v
(20.83)
x
(:
c2j(''JjC>)ox-
covt/lQ
1(t''Jj>)'j
- H(#)'Ecov(/)q
-1H(,).
(20.84)
Q
(83) cannot be used as a test statistic because # and P
As it stands
unknown and Qx is not available. Given, however, that s2 --+ limw-+.,(X'X) T= Qx we can deduce that t?H(/)
2
.'b'
where
P#
(?H(j)
t?j
X'X - 1 ?H(/) F Jj
'
P --+
covtj)
(20.85)
,,
.
.
t?j
#-j
the test statistic
H(/)'js2(0Hgj(j(x,x)-
u,(y)-
ltt?l-lgjjj-
'u(j) ,...,
zztm;), (20.86)
which can be used to define an a size asymptotic test based on the region =
are
c2 and
?H(#) =
This enables us to construct
C1
c2
1y:1'Pr) >
Gl,
rejection
(20.87)
430
Departures
from assumptions
statistical
-
GM
where
dzztn1l =
,x.
C
This is because H BStyl
z2(r#l).
-
(20.88)
This test is known as the WW/J test whose general form was discussed in Chapter 16. ln the same chapter two other asymptotic test procedures which give rise to asymptotically equivalent tests were also discussed. These are the Lagrange multiplier and the likelihood ratio asymptotic test procedures. The Lagrange multiplier test procedure is based on the constrained MLE of J # instead of the unconstrained MLE J. That is, #-is derived from the optimisation of the Lagrangian function
1(0,#;
=log
y) ?/
t? log
pt?/
t?H(#)
-
=
#-(#- d2) is the
--
/.t(#)
lp
lH(#)
L
P#
= -H4#)
(p =>
L0; y) -#H(#)
-
()
(20.90)
(#; y)
t7#
plp
0,
=
log L
p
(20.89)
H(#
-
0,
(20.91)
MLE of onp, c2). In order to understand what is involved in (89)-491),it is advisable to compare these with (48),(49) and (52)above for the linear restrictions case. ln direct analogy to the linear .
'
constrained
restrictions case the distance
r)/z(# -0)(
can be used as a intuitiveargument 1 T
(20.92) measure of whether Ho is true or not. Using the same as for !hH(/) j above we can construct the quantity -0
-
#(#)'ECOv(#(#))q
-
j
-
p(#)
(20.93)
x
as the basis of a possible test statistic. lt can be shown that
-v,'v pt)
-,,(#))
--
x((,, c2g(f'Hp,t#')Qz'(-'----/?.'4p#
)'j-')
(ae.94)
Restrictions
20.4
(see Chapter
used
C1
431
16). This suggests that the statistic
f-Mtyl
can be
of interest
on parameters
=
#/1/
pH(J)(X X)
(72
,
1
p#
)y : Lsiy)
> ca
p(
y
(j) c z (n1, .
'v
a
(20.95)
x size test with rejection
to construct an =
H(J) '
-.
region
)
(20.96)
and power function izthp) PrLM(y)
dzztnl,'
> ral =
=
Cz
).
(20.97)
ln Chapter 16 it was shown that the Lagrange multiplier test can take an alternative formulation based on the score jnction gr log L)?((0. ln view of the relationship between the score function and the Lagrange multipliers given in (91) we can deduce that we can construct a test for Hv against H3 based on the distance
logL)
.
0
- 0
.
Following the same argument as in the case of the construction LMyj we can suggest that the quantity
( log Lj' (0
Cov
1o2 L(#)
a
t? log
Ltpl
t?p
T
1
7loz Lj '-'
(20.99)
'
t?p
0
should form the basis of another 1
-
-
of lVtyl and
reasonable
test statistic. Given that
x N(0, Iw.(p))
(20.100)
,
(see Chapters 13 and 16) we can deduce the following test statistic: Esy)
--
1 'r
p log L)' t0
Iw)
-
,
log L(#)
Hu
2
-
(0
x
z (m).
(20. 10 1)
This test statistic can be simplified further by noting that Ho involves only a subset of 0 t'ustj) and 1.(p) is block diagonal. Using the results of Chapter
16 we can deduce that Esyl-ji?
log 1-(1, 0)j'
'y
lrjt
g:2(x,x)-
10g .L(1, 0)j.
y
(20. 102)
Departures
from assumptions - statistical GM
The test statistic Esy) constitutes what is sometimes called the ejjlcient score form of the Lagrange multiplier test. The Iikelihood ratio test is based on the test statistic (seeChapter 16): J-aR(y)
=
2(log L(, y) - log .I-(', y)) x. 1
-
z2(m; ),
(20.103)
with a rejection region C1
.fy
f-R(y)
:
=
>
ca ).
(20.104)
Using the first-order Taylor's expansion f-R(y)
T'(#-
cx
This approximation
4)'I,c0)(Jis very
we can approximate
LRy) as
(20.105)
.
suggestive
because
it shows
clearly
an
important feature shared by a1l four test statistics WJtyl, LMy), ESy) and .LR(y). A11four are based on the intuitive argument that some distance l H(/)1) g(/311 E7log L($-,0)l/?,)r and 14 #-- respectively, is to zero' when Ho is valid. Another important feature shared by a11 four test statistics is that their asymptotic distribution (z2(m)under Hv) depends crucially on the asymptotic nqrmality of a ertain quantity inyolved in defining these / dijtancesg x?' F(# - #), ( 1,/x.' Tqlp) - #(#)), ( 1 x7 F) EPlog .L(#, 0)j/J$ and v''F(#- 0) respectively. All thre tests, the Wald ( 14,'),Lagrange multiplier LLM) and likelihood ratio (faR) are asymptotically equivalent, in the sense that they have the same asymptotic power characteristics. On practical grounds the only difference between the three test statistics is purely computational, l4'tyl is based only on the unconstrained MLE p-LM(y) is based on the constrained MLE J and LR(y) on both. For a given example size T; however, the three test statistics can lead to difference decisions as far as rejection of Hv is concerned. For example, if we were to apply the above procedures to the case where H(#) R# r we could show that l4z'(y)
11
,
/)1
41
,
R
F'
Y
'
:
,
tclose
,
5
=
20.5
-
(see exercises 5 and 6).
f.R(y) y LMy)
Collinearity
As argued in Section 19.3 above, assumption ranktx) is directly
E2-21o'al c2 ,
k,
=
related =
cj
ranktx)
-
j =
g5qstating that
to the statistical parameters of interest 0 vja Ya:. This is because gj aEa-.z1gaj) rank(X'X)
(20.106) H(#,
c2)
(#
=
(20.107)
20.5 Collinearity and (106) represents the sample equivalent to ranktrzc)
=
k.
(20.108) (72
When (108)is invalid and Ea2 is singular # and cannot even be defined. verified directly and thus we need to Condition ( 108), however, cannot be relyon ( 106) which ensures that the estimators
/=(x'x)-
'x'y
42
and
1 y'(I -X(X'X)
=-
-
T
l
X
,
ly
(20.109)
of # and c2 can be defined. ln the case where X'X is singular j and 42 cannot be derived. The problem we face is that the singularity of (X'X) does not necessarily imply the singularity of Ec2. This is because the singularity of (X'X) might be a problem with the observed data in hand and not a population problem. For example, in the case where T< k rank(X?X) < k, irrespective of E22 because of the inadequacy of the observed data information. The only clear conclusion to be drawn from the failure of the condition ( 106) is that the of the statistical the estimation sample injrmation in X is inadequate t#' interest c2. The source of the problem is rather more j and parameters difficult to establish (sometimesimpossible). In econometric modelling the problem of collinearity is rather rare and when it occurs the reason is commonly because the modeller has ignored relevant measurement information (see Chapter 26) related to the data chosen. For example, in the case where an accounting identitq' holds among some of the xrs. It is important to note that the problem of collinearity is defined relative to a given parametrisation. The presence of collinearity among the columns of X, however, does not preclude the possibility of estimating another of the statistical GM. parametrisation restriction One such parametrisation is provided by a particular linear combination of the columns of X based on the eigenvalues and eigenvectors of (X'X). .jr
Let P(x?X)P'
=
A
=
diagti-l, 2c,
.
.
.
,
2,,,,0, 0,
.
.
.
,
(20.110)
0)
and P'P= PP' 1, where P represents a k x k orthogonal matrix whose columns are the eigenvectors of (X'X) and ;.1, ;.a, m its non-zero eigenvalues (seeHouseholder (1974/. lf we define the new observed data matrix to be X* XP and #* P'# the associated coefficient parameters we could reparametrise the statistical GM into =
.
=
.
.
,
=
(20.111) The new data matrix X* can be viewed as referring to the values of the
from assumptions - statistical GM
Departures
artificial random variable X1= X,'pf, i 1, 2, k. The columns of X* Xpf are known as principal components of X and in view of defined by X =
.
.
.
,
=
(X'X)pf
fOr
0
=
(see 110), where pj, i
=
1, 2,
X* HE (X*1:X1),
#*
Decomposing
(iii) for t
=
1, 2,
.
.
i
m + 1,
=
.
X1
.
.
,
.
.
.
,
T
0.
=
as y
(20.112)
,
k, are the columns of P, we can deduce that
P'# conformably
=
.
=
in the form
#*
=(a',
/)'
(20.114)
X*a + u, ,:2
being the new parameters where ranktxl) rl, with and data specsc. These parameters can be estimated via =
- =(X/'Xt)-1X1'y
Moreover, in
view of
#= P#* any linear
=
1-2
1
=-
T
which are now
y'(I -X1(Xt'X1)-1X1)y.
(20.115)
relationship
P1 + P2)'
combination
c'b=c'PIa
and
the
we can rewrite
c'# of
+c'Pcy
# is
estimable if c'Pa
=
0 since
=c'P1a,
(20.117)
Using the principal components as the new columns of X, however, does not constitute a solution to the original collinearity problem because the estimators ( 115) refer to a new parametrisation. This shows clearly how the collinearity problem is relative to a given parametrisation and not just a problem of data matrices. The same is also true for a potentially more serious problem, that of near collinearity', to be considered in the next
section.
20.6
Near' collinearity
lf we define collinearity as the situation where the rank of (X'X) is less than collinearity refers to the situation where (X'X) is k, singular or ill-conditioned as is known in numerical analysis. The effect of this near singularity is that the solution of the system dnear'
tnearly'
(X'X)j
=
X'y
(20.118)
to derive j is highly sensitive to small changes in (X'X) and X'y. That is, small changes in (X'X) or X y can lead to big changes in q. As with
20.6
tNear' collinearity
collinearity, this is a problem of insujhcient data information relative to a given parametrisation. ln the present context the information in X is not quite adequate for the estimation of the statistical parameters of interest / and c2. This might be due to insufficient sample information or to the choice of the variables involved (Ecc is nearly singular). For example, in cases where there is not enough variability in some of the observed data series the sample information is inadequate for the task of determining j collinearity is a problem to be tackled. On the and c2. In such cases other hand, when the problem is inherent in the choice of Xt (i.e. a population problem) then near collinearity is not really a problem to be tackled. In practice, however, there is no way to distinguish between these collinearity because we do not know Ec2, unless we are two sources of in a Monte Carlo experimental situation (seeHendry (1984:.This suggests collinearity relative to a given that any assessment of whether parametrisation is a problem to be tackled will depend on assumptions values of the statistical parameters of interest. about the collinearity Some of the commonly used criteria for detecting suggested by the textbook econometric literature are motivated solely by its of J as measured by effect on the :near'
'near'
knear'
ttrue'
'near'
taccuracy'
Cov(#)
c (X X)
=
(20.119)
.
Such criteria include: Simple correlations These refer to the transformation of the (X'X) matrix into a correlation matrix by standardising the regressors using
(xfrv (
kit
=
(-YI'!
f
=
.ff)
-ff
-
a
1)2
,
i
=
1, 2,
.
.
.
,
k.
(20.120)
1
The standardisation is used in order to eliminate the units of measurement problem. High simple correlations are sometimes interpreted as indicators of near collinearity.
(b)
yluxlrv
regressions
436
from assumptions - statistical GM
Departures
and a high value of the multiple correlation coefficient from this regression, Rl, is used as a criterion for tnear' collinearity. This is motivated by the following form of the covariance of h :
/
covt/ h )
c2(x'x)j-;,1
=
c2g( 1
=
Rjzlxkx
-
1
-
(20.122)
else assumed fixed) leads (see Theil (1971)). A high value for RL(everything to a high value for Cov (/ )9 viewed as the uncertainty related to j h. By the same token, however,a small value for xkxh, interpreted as the variability of the th regressor, will have the same effect. N()te that R; refers to
RJ
xkX -,(X'-p,X
-,)
-
1X'-y:x,
=
(c)
(20.123)
.
,
xhxh
Condition numbers
Using the spectral decomposition of (X'X) in ( 110) we can express ( 119) in the form Cov(#) -
2
1
c (X'X)-
=
c2(PA - 1P/)
=
k 0.2
=
i
;::1
.x..z
y 1
9-6j 2,.
,
(20.124)
and thus Var(#f) -
.g
c
=
n
2
pij
'E ;j f=1
,
f
1, 2,
=
.
.
.
k
,
(20. 125)
(see Silvey (1969:. This suggests that the variance of each estimated coefficient /f depends on a1l the eigenvalues of (X'X). The presence of a relatively small eigenvalue ki will these variances. For this reason we look at the condition numbers tdominate'
&j(X'X)
=
2max i ki ,
=
1, 2,
.
.
.
,
k,
(20. 126)
refers to the largest for large values indicating near' collinearity, where zmax Belsley number is large large condition eigenvalue (see et aI. (1980:.How a enough to indicate the presence of near collinearity is an open question in view of the fact that the eigenvalues 21 kk are not invariant to scale changes. For further discussion see Belsley (1984). Several other criteria for detecting collinearity have been suggested in the econometric literature (seeJudge et aI. ( 1985) for a survey) but all these criteria, together with (a)-(c) above, suffer from two major weaknesses'. (i) they do not seem to take account of the fact that near' collinearity should be assessed relative to a given parametrisation; and ,
%near'
.
.
.
,
20.6
tNear' collinearity
none of these criteria is invariant to linear transformations of the data (changesof origin and scale). Ideally, we would like the matrix X'X to be diagonal, reflecting the orthogonality of the regressors, because in such a case the statistical GM takes a form where the effect of each regressor can be assessed separately ckk) and given that E22 diagtccc, caa, =
.
.
.
,
(20.127) tJ'l i
=
k
/'11 l'ny - Fl t;cii
Covtx'flrf),
=
i
rrlx f (Xft), =
i
-
=
1/
plx,
a
2, 3,
.
.
.
,
(20.128)
k.
ln such a case 1x' - (x'x) i bi =
i
y
>
Vartj)
=
c
2
-1
(xJx)
,
(20.129)
with the estimator as well as its variance being effected only by the th regressor. This represents a very robust estimated statistical model because changes in the behaviour of one regressor over time will affect no other coefficient estimator but its own. Moreover, by increasing the number of regressors the model remains unchanged. This situation can be reached by design in the case of the Gauss linear model (seeChapter 18). Although such an option is not directly available in the context of the linear regression model because we do not control the values of Xl, we can achieve a similar effect via reparametrisation. Given that the statistical parameters of interest 0 rarely coincide with the theoretical parameters of interest ( we can tackle the problem at the stage of reparametrising the estimated statistical GM in order to derive an empirical econometric model. It is important to note that statistical GM is postulated only as a crude approximation of the actual DGP purporting to summarise the sample information in a particular way, as suggested by the estimable model. Issues of efficient estimation utilising a1l a priori information do not arise at this stage. Hence, the presence of tnear' collinearity in the context of the statistical model should be viewed as providing us with important information relating to the adequacy of the sample information for the estimation of 0. This information will help us to model by construct a much more robust empirical econometric estimated statistical GM which provide in reparametrising the us with ways without, which close orthogonal transformed regressors to being are however, sacrificing its theoretical meaning. This is possible because there is no unique way to reparametrise a statistical GM into an empirical econometric model and one of the objectives in constructing the latter is to
438
from assumptions - statistical GM
Departures
ensure that the sample information is sufficient for the accurate determination of its parameters. It is important to emphasise at this stage empirical econometric models are not, as some econometric and that time-series textbooks would have us believe, given to us from outside by 'sophisticated' theories or by observed data regularities, but constructed by econometric modellers using their ingenuity and craftsmanship. ln view of the above discussion it seems preferable to put more emphasis robust empirical econometric models with in constructing orthogonal regressors without, however, sacrificing either their statistical or economic meaning. Hence, instead of worrying how to detect collinearity (whichcannot be defined precisely anyway) by some battery of suspect criteria, it is preferable to turn the question on its head and consider the problem as one of constructing empirical econometric models with 'nearly' orthogonal regressors among its other desirable properties. To that end we need some criterion which assesses the contribution of each regressor separately and is invariant to linear transformations of the observed data. That is, a criterion whose value remains unchanged when igood'
'nearly'
tnear'
and
h is mapped into
.b't
=
X, is mapped into X)
=
tk1 ),t + c1
(20.130)
zlcxf
(20.131)
+ c2,
where f?j # 0 and is a (k 1) x k 1) non-singular We know that under the normality assumption ,42
-
Z,
'v
Zt
EB
matrix.
(20.132)
N(m, E),
where l'f Xt
m
,
ml
=
m2
E
,
=
(r1 1
J1 2
J1 1
X22
(20.133)
,
the statistic (2,S),
Z
r
1
=-
T
,
F
)( -
Zf
and
1
s
=
)( (z,
l=1
-2)(z,
-2)'
(20.134)
is a sufficient statistic (seeChapter 15). Under the data transformation and (131) the sufficient statistic is transformed as 2
-.+
A2
+c
(20.135)
ASA',
(20.136)
and S
-->
where A
-
(130)
j (/4)1 ) (cr x0 2
,
c
-
al
z
20.6 ENear' collinearity are k x k and k x 1 matrices. The corresponding parameters are: m E
transformations
Ammc,
-+
(20.137)
AEA'.
--+
on the
What we are seeking is a criterion which remains unchanged under this group of transformations. The obvious candidates as likely measures for assessing t h e separate contr ibutions of the regressors involved are the multiple and partial correlation comcients (see Chapter 15). The sample partial correlation coefcient of y, with one of the regressors, say Arlr, given This represents a measure of the the rest Xaf, is given in equation (15.48). when the effect of all the other variables have been correlation of y'fand tpartialled out'. On the other hand, if we want the incremental contribution of Xjt to the regression of y, on Xt we need to take the difference between the sample multiple correlation coefficient, between yyand Xf and that between yt and X 1r (X, with xYj, excluded), denoted by kl and #2-1, i.e. use uhrbt
-
(2
kl 1 )
-
(20.138)
-
2).
It is not very surprising
(seeequation (15.39)for measuresare directly related
(#2
2- j)
-
=
yz :$( 1
that both of these
via -
2-
,
j
(20.139)
)
p-12.a denotes the sample coefficient of y, and Xqt given the rest Xsf Let kl S) and /(c,a=g1(Z, S).
(seeTheil (1971)),where
partial
correlation
.
=g(2,
We can verify directly that
(/Z, S) g(A2 =
g1(2, S) =gj(A2
+c, ASA') +c, ASA').
(20.140) (20.141)
That is, the sample multiple and partial correlation coefficients are invariant to the group of linear transformations on the data. lndeed it can be shown that 112 is a maximal invariant to this group of transformations That is, any invariant statistic is a function of 2. (see Muirhead (1982)). The multiple and partial correlation coefficients together with the simple correlation coefficients can be used as guides' to the construction of empirical models with nearly orthogonal regressors. The multiple used overall picture of the relative particular correlation in to get an can be statistical of various GM and the contribution the regressors (in both the
440
Departures
from assumptions
empirical econometric
-
statistical
GM
model) using the incremental contributions
i
=
1, 2,
.
,
.
.
k - 1,
(20.142) (20. 143)
which Theil ( 1971) called the multicollinearity effect. ln the present context such an interpretation should be viewed as coincidental to the main aim of constructing robust empirical econometric models. It is important to remember that the computation of the multiple correlation coefficient differs from one computer package to another and it is rarely the one used above. To conclude, note that the various so-called to the problem of such adding collinearity dropping as or near regressors or supplementing the model with a priori information are simply ways to introduce alternative reparametrisations, not solutions to thc original problem. isolutions'
Important
concepts
Stochastic linear regression model, statistical versus theoretical parameters of interest, omitted variables bias, reparametrisation, constrained and MLE's, priori and non-linear unconstrained restrictions, restricted linear a and unrestricted residual sum of sqtlares, collinearity, near' collinearity, orthogonal regressors, incremental contributions, partial correlation coefficient, invariant to linear transformations.
Questions Compare and contrast
(i)
F().'f X,
(ii)
'''''
1 j= (X X) X y, p+
(iii)
42
xE7lyfc(X,));
xf),
=
''-'
,
=
1
-
',
t
t*2
1 =--
T
(.)??') .?
=
.1
'
-
1
.4
.t .
y,
*?*,
T
Compare and contrast the statistical GM's, the probability and sampling models for the' linear regression and stochastic linear regression statistical models. model be tl-et the ebtrue''
yt
=
j'x t + y,w ! + f;f
20.6
Near' collinearity
d the one used be v #'xt + u t lt can be shown that for j' (X X )- X y E(p) p, i.e. # suffers from omitted variables bias; and (i) 741/,) y'wt.' (ii) an
'
f
=
=
.
'
u#
=
Discuss. Explain informally how you would go about constructing an exogeneity test. Discuss the difficulties associated with such a test. MLE'S of p and c2: Compare the constrained and unconstrained
)= j-(x'X)- 1 R' j=(x'x)-1x'y,
42
1 (y
=
g.2
=
-x#
,
r -.
1 T
-
(y
-
(y-x#, -
x jj (y- X#) ,
,
Explain how you would Hv1 Rj= r
(R(X/X)- IR') - 1(Rj-r).
a test for
go about constructing
against
S1 : Rp+ r
based on the intuitive argument that when Hv is true is close to zero. kWhy don't we use the distance 11R#- r) Compare the resulting test with the F-test. Explain intuitively the derivation of the Wald test for '?'
.f
Hv H(j)
=
'Why don't we use
0 against
rH(/)
!
Hj : H(j) # 0.
1 instead of
!
.
IH(/) i as the basis of the
!
! I
.
argument for the derivation?' What do we mean by the statement that the Wald, Lagrange multiplier and likelihood ratio test procedures give rise to three asymptoticallv equitalent lasrs? When do we prefer an asymptotic to a finite sample test? Explain the role of the assumption that rank (X) k in the context of the linear regression model. and their and Discuss the concepts of implications as far as the MLE'S # and 42 are concerned. in the How do we interpret the fact that (X'X) is context of the linear regression model? How do we tackle the problem? =
tcollinearity-
Snear-collinearity'
'ill-conditioned-
442
Departures from assumptions - statistical GM Exercises Using the first-order conditions (49/-(51)of Section 20.4 derive the information matrix (57). Using the partitioned inverse formula, A B'
B
- 1
D
=
E=D
A-
1
A-FE- 1F'
1F' - EIB
-B'A-
EF
c2)j 1 and derive g1w(#, compare its p, and Cc2 in (56). Verify the distribution of -
1
--FE-
=
1
A- IB
various
elements with C1 j C1 2 ,
($-
in(56). vCr ify the equality
cuu' -' (R/ -r)'gR(X'X) - IR') - 1(Rj For the null hypothesis Hv: Rp= r against ff1 : R## r use the Wald, Lagrange multiplier and likelihood ratio test procedures to derive the following test statistics:
?U(y)
==
-r)
.
( 1 i-
y- k
FNT(Y)' L A4tY)
==
.-
LRy)
=
T log
(7-
--CIVKGCXI ) j. suzty) ,
mzty)
1 + T-i
,
respectively, where z(y) is the test statistic of the F-test. Using IzFtyl, LMy) and .LR(y) from exercise 5 show that PU(y);: 2L1(y) )h pA4ly). Note that logtl + z)> z/(1 + z), cylogtl Savin (1982).)
+ z), zb 0)
(see Evans
Additional references
Aitchison and Silvey (1958),. Judge et
al.
( 1982),. Leamer (1983).
and
The linear regression model IlI - departures from the assumptions underlying the probability model The purpose of this chapter is to consider various forms of departures from the assumptions of the probability model; D(J'f/Xt; 04 is normal, E61 (i) E(1', X, xt) j'xf: linear in xf, (ii) Vartyf/x, x,) c z homoskedastic, (iii) =
=
=
=
,
g7l 0- (j1c2)are time-invariant. ln each of the Sections 2-5 the above assumptions will be relaxed one at a time, retaining the others, and the following interrelated questions will be discussed : what are the implications of the departures considered? (a) (b) how do we detect such departures'?, and (c) how do we proceed if departures are detected? It is important to note at the outset that the following discussion which considers individual assumptions being relaxed separately limits the scope of misspecification analysis because it is rather rate to encounter such conditions in practice. More often than not various assumptions are invalid simultaneously. This is considered in more detail in Section 1. Section 6 discusses the problem of structural change which constitutes a particularly important form of departure from E7).
21.1
Misspification
testing and auxiliary regressions
Misspecication testing refers to the testing of the assumptions underlying statistical model. In its context the null hypothesis is uniquely defined as a the assumptionts) in question being valid. The alternative takes a particular fonn of departure from the null which is invariably non-unique. This is
443
444
Departures
from assumptions
probability
model
because departures from a given assumption can take numerous forms with the specified alternative being only one such form. Moreover, most misspecification tests are based on the questionable presupposition that the of the model are valid. This is because joint other assumptions misspecification testing is considerably more involved. For these reasons the choice in a misspecification test is between rejecting and not rejecting the null; accepting the alternative should be excluded at this stage. An important implication for the question on how to proceed if the null is rejected is that before any action is taken the restllts of the other misspecification tests should also be considered. lt is often the case that a particular form of departure from one assumption might also affect other assumptions. For example when the assumption of sample independence (8))is invalid the other misspecification tests are influenced (see Chapterzz). In general the way to proceed when any of the assumptions 1(6(1-1(8) are invalid is first to narrow down the source of the departures by relating them back to the NIID assumption of LZ,, f (E Ir and then respecify the model of the taking into account the departure from NllD. The respecification of the reduction from D(Z1 Za, Zw,' #) model involves a reconsideration to D(y,/Xf; p) so as to account for the departures from the assumptions involved. As argued in Chapters 19-20 this reduction coming in the form of: ,
D(Z j
,
.
.
.
,
Z 7. ; #)
.
.
.
,
v
=
t
Dt Zrs' #) J'''.j .....
j
I
=
tg'ID( yf 'X?
,
t
=
1
) D(Xr '
j
,'
#a)
involves the independence and the identically distributed assumptions in ( 1). The noimality assumption plays an important role in defining the parametrisation of interest 0- (#, c2) as well as the weak exogeneity condition. Once the source of the detected departure is related to one or takes the form of an more of the NIID assumptions the respecification form of reduction. This is alternative illustrated most vividly in Chapter 22 assumption discussed. lt is invalid not where turns out that when (8(1 (8(1is results 19 invalid but the other misspecification only the in Chapter tests are inappropriate as well. For this reason it is advisable in practice are first and then proceed with the other assumptions if to test assumption (8(1 (8(lis not rejected.-f'he sequence of misspecification tests considered in what follows is chosen only for expositional purposes. Slargely'
With the above discussion in mind 1et us consider the question of general procedures for the derivation of misspecification tests. In cases where the alternative in a misspecification test is given a specific parametric form the various procedures encountered in specification testing (F-type tests, Wald,
21.1
Misspecification
testing
445
and likelihood ratio) can be easily adapted to apply in
Lagrange mtlltiplier
the present context. ln addition to these procedures several specific misspecification test procedures have been proposed in the literature (see White ( 1982), Bierens ( 1982), intel. alia. Of particular interest in the present variables' argument which book are the procedures based on the lead to auxiliary regressions (see Ramsey ( 1969), ( 1974), Pagan and Hall ( 1983). Pagan ( 1984), inter /!i/). This particular procedure is given a prominent rolkl in what follows because it is easy to implement in practice of most other interpretation and it provides a common-sense misspecification tests. variables' argument was criticised in Section 20.2 because it The -omitted
-omitted
statistical GM's. was based on the comparison of two This was because the information sets underlying the latter were different. lt was argued, however, that the argument could be reformulated by postulating the same sample information sets. ln particular if both Zw; #) by using parametrisations can be derived from D(Z1 Zc, statistical redtlction GM's then the arguments alternative two can be made comparable. Sttchastic Let ftzf l G 1 ) bo a kector process defined on the probability which includes the stochastic variables of interest. In #( )) space (S, 'non-comparable'
,
.
.
.
.
,
..t
'
that for a given fz.r( .F
17 it was argued
Chapter
;
.y t
E(. .
=
'
,
,/
t
t/'.) + ,
uj
,
defines a general sta' tistical GM .'k)), #f = .E'( f /
I/t
'
-
).'!- E .J',
=
satisfying some desirable orthogonality condition: Espttlts
with
(2 1.4)
t)
properties
by construction
including the
=0,
lt is important to note, however, that (3)-(4)as defined above arejust empty boxes'. These are filled when .tZ,, l e: T) is given a specific probabilistic structure such as NllD. In the latter case (3)-44) take the specific forms: $,r
#'xr +
=
.
/tt* j'x =
(2 1.6)
tt r , LI* f
?
rk'r lX,
=
x,
=
-
.
l
#'x
f,
(2 1.7)
information set being
with the conditioning =
'
) .
When any of the assumptions in NllD are invalid, however, the various properties of pt and lIt no longer hold for pl and ul. ln particular the
446
Departures
from assumptions - probability model
orthogonality condition Eplutt)
(5)is invalid.
The non-orthogonality
#0,
(21.9)
can be used to derive various misspecification tests. If we specify the alternative in a parametric fonu which includes the null as a spial case (9) could be used to derive misspecification tests based on certain auxiliary regressions. ln order to illustrate this procedure let us consider two important parametric forms which can provide the basis of several misspecification tests: 'iplli
g*(x,)-f=1 (b)
(21.0) k
#(x,)=t1 + j'
'-V
=
k
k
J,bixit+ )( )cijxitxjt 1
i
k
k
=
1
jp 1
k
+ l )wil lkdi-lxirx./fxl
i-
'
'
'
'
1
/yj
The polynomial g*(xt) is related to RESET type tests (see Ramsey (1969) and g(x,) is known as the Kolmogorov-Gabor polynomial (scelvakhnenko (1984/. Both of these polynomials can be used to specify a general parametric form for the alternative systematie component:
ltf
=
#'0Xf
+
''o%*
(21.12)
where z) represents known functions of the variables gives rise to the alternative statistical GM rr
=
bLxt+ i''vzl + cf
Zf
-
1,
.
.
.
,
Z1, Xf. This
(21. 13)
,
which includes (6) as a special case under
ffe: yll 0, =
with
A direct comparison -
li ( 1
=
-
:
between
regression #)'X, + SZ) ut (#o whoseoperational form =
HL
),e # 0.
(13)
(21.14) and
(6) gives rise to the auxiliary
(21. 15)
+ cf,
/)?x+ ylzlp + c,
(21.16)
t
can be used to test ( 14) directly. The most obvious test is the F-type test discussed in Sections 19.5 and 20.3. The F-test will take the general form FF(y)
=
( 1
RRSS - URSS F-k*
vgss
sy
(21.17)
447
21.2 Normality
where RRSS and URSS refer to the residuals sum of squares from (6)and ( 16) (or ( 13/, respectively; k* being the number of parameters in (13) and m the number of restrictions. This procedure could be easily extended to the highercentral moments of ) ?t /'Xt s
Elut/xt
(21.18)
r y 2.
x,),
=
For further discussion see Spanos
21.2
(1985b).
Normality
As argued above, the assumptions underlying the probability model are all interrelated and they stem from the fact that Dt-pf,Xf; /) is assumed to be multivariate normal. When D(),,, Xf,' ) is assumed to be some Other multivariate distribution the regression function takes a more general fonn (not necessarily linear),
ElhAt
X!)
=
11(*,X!),
=
and the skedasticity function is not necessarily free of xf, Vartyf/'x,
x,)
=
=g(k,
x,).
(21.20)
Several examples of regression and skedasticity functions in the bivariate case were considered in Chapter 7. In this section, however, we are going to consider relaxing the assumption of normality only, keeping linearity and homoskedasticity. In particular we will consider the consequences of
assuming (J'f,/Xf
=
X)
'v
D(FX,,
(T2),
where D( ) is an unknown distribution, and discuss the problem of testing whether D ) is in fact normal or not. .
'
(1)
Consequences of
non-normality
Let us consider the effect of the non-normality assumption in (21)on the specification, estimation and testing in the context of the linear regression model discussed in Chapter 19. As far as specification (see Section 19.2) is concerned only marginal changes are needed. After removing assumption g6(l(i)the other assumptions can be reinterpreted in terms of Dtj/xf, c2). This suggests that relaxing normality but retaining linearity and homoskedasticity might not constitute a major break from the linear regression framework.
448
from assumptions
Departures
--
probability
model
The first casualty of (2 1) as far as estimation (see Section 19.4) is concerned is the method of maximum likelihood itself which cannot be used unless the form of D ) is known. We could, however, use the least-squares method of estimation brietly discussed in Section 13. 1, where the form of the not needed. underlying distribution is is alternative method of estimation which is historically Least-squares an older the maximum likelihood much than or the method of moments. The least-squares method estimates the unknown parameters 0 by minimising the squares of the distance between the observable random variables yf, r e: T, and ht(0) (a function of p purporting to approximate the mechanism giving rise to the observed values p,), weighted by a precision factor 1 rcf which is assumed known, i.e. '
'apparently'
hto) minV ),,,lk't -
2
(2 1.22)
-
.
psf.l r
It is interesting to note that this method was tirst suggested by Gauss in 1794 as an alternative to maximising what we, nowadays, call the log-likelihood function under the normality assumption (seeSection 13. 1for more details). In an attempt to motivate the Ieast-squares method he argued that:
the most probable value of the desired parameters will be that in which the sum of the squares of differences between the actually observed and computed values multiplied by numbers that measure the degree of precision.is a minimum .
.
.
.
This clearly shows a direct relationship between the normality assumption method estimation. the least-squares of lt can be argued, however, that and applied least-squares method the to estimation problems without can be normality. In relation to such an argument Pearson (1920) assuming
warned that:
methods are theoretically we can only assert that the lcast-squares obey the normal law, accurate on the assumption that our observations Hence in disregarding normal distributions and claiming great . the generality by merely using the principle of least-squares generalisation gained merely been the has at apparent expense of theoretical validity .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Despite this forceful argument let us consider the estimation of the linear regression model without assuming normality, but retaining linearity and homoskedasticity as in (21). The least-squares method suggests minimising
6$)
w (), #,x,)2 ---j. '
=
,
t
=1
G
(21.23)
21.2 Normality
V9
or, equivalently: F
f(#) =
el
0# = -
f
Z (A't=
#'x,)2 (y-X#)'(y
2X
,
(y-X#)
=
(2 1.24)
-X#),
=
1
(2 1.25)
0.
that rankl) Solving the system of normal equations (25)(assuming get the ordinary least-squares (OLS) estimator of # b (X'X) - 1X'y.
l
of c2 is
1 /(b)
1
=
=
k) we
(21.26)
=
The OLS estimator
=
T- k
T- k
(y-
Xb) , (y Xb).
(2 1.27)
-
Let us consider the properties of the OLS estimators b and fact that the form of DFxt,c2) is not known. Hnite sample properties
of b and
l
in view of the
.92
Although b is identical to p-(the MLE of #)the similarity does not extend to the properties unless Dtyf/xf; 0) is normal. Since b Ly, the OLS estimator is linear in y. (a) Using the properties of the expectation operator E ) we can deduce: .E'(b) .E'tb+ Lu) j+ L.E'(u) #, i.e. b is an unbiased estimator of #. (b) czLL' c2(X'X) - 1. .E'(b (c) - #)(b- j)' F(Luu'L') Given that we have the mean and variance of b but not its distribution,what other properties can we deduce? Clearly, we cannot say anything about sujh'ciency or full efflciency without knowing D(.p,,/Xf;0) but hopefully we could discuss relative efhciency within the class of estimators satisfying (a) and (b).The GaussMarkov theorem provides us with such a result. =
'
=
=
=
=
Gausr-Markov
=
=
theorem
Under the assumption (21), b, the OLS estimator of #, has minimum variance among the class of linear and unbiased estimators (fora proof see Judge et aI. ( 1982/. .1
is concerned, we can show that As far as .E'(.2) c2, i.e. l is an estimator Of c2, (d) using only the properties of the expectation operator relative to DFxt,c2). ln order to test any hypotheses or set up confidence intervals for 'unbiased
=
450
from assumptions - probability model
Departures
0=p, c2) we need the distribution of the OLS estimators b and J2. Thus, unless we specify the form of Dxt, c2), no test or/and confidence interval statistics can be derived. The question which naturally arises is to what theory' can at least provide us with large sample results. extent 'asymptotic
Asymptotic distribution of b and Lemma 21
.1
(21),
Unft?r assumption
x/zb p) x(0,czo.y-
-
lim
F
.2
X'X
u-
-+
1)
(2 1.28)
(2 1.29)
Qx
=
F
is flniteand non-singular. Lemma 21
.2
(21)we
Under
x''w(J2
ean deduce that -
c2)
x(0,
-.
tfl-y -
where puyrefers to the fourthcentral to be (seeSchmidt 1976)). ./znffr
1)c4),
lzlomcnr
of 1)(.J,/Xf;
(2 1.30) 04assumed
Note that in the case where Dlyf, Xj',0j is normal
P-43 4
=
N/
G'
7w(.2-c2)
-
x
Lemma 21 Under
x(0,2c4).
(2 1.3 1)
.3
(21) P
b p -+
and
(!/limw-.(X'X)=0)
(2 1.32) (2 1.33)
P
a2 S
-+
G
2 .
From the above lemmas we can see that although the asymptotic distribution of b coincides with the asymptotic distribution of the MLE this is not the case with ?l. The asymptotic distribution of b does not depend on
Normality
21.2
4f1 l
1)(yt,/X,; 04but that of does via p4. The question which naturally arises is to what extent the various results related to tests about 0- (#,c2) (see Section 19.5) are at least asymptotically justifiable.Let us consider the Ftest for Ho R#= r against HL : R## r. From lemma 2 1.1 we can deduce that 1R') - 1) which implies that under Hv : V'F(Rb - r) N(0,c2(RQx'v
(Rb
-
r)
ERIX/X)- 1 R'(l G
2
=
Ho
(Rb
r)
-
'w
a
Using this result in conjunction
zwty) (Rb
1
(21.35)
with lemma 21.3 we can deduce that
gR(X'X) - IR'II- 1 (Rb r)' . c
-
z2(n'l).
-
m.
r)
1 x.
-
x
m
z2(r#
under Hv, and thus the F-test is robust with respect to the non-normality assumption (21) above. Although the asymptotic distribution of zwty) is chisquare, in practice the F-distribution provides a better approximation for a small T (seeSection 19.5)).This is particularly true when D(#'xt, c2) has heavy tails. The significance r-test being a special case of the F-test, c=-,,--,.--2,
zty)
b
i; sv E(xk-')j ,
1
N(0, 1) under Ho:
-
x
/,.-0
(2 1.37)
is also asymptotically justifiable and robust relative to the non-normality assumption (21) above. Because of lemma 2 1.2 intuition suggests that the testing results in relation to c2 will not be robust relative to the non-normality assumption. Given that the asymptotic distribution of depends on pzbor a4= p4/c4 the kurtosis coefficient, any departures from normality (wherea4 3) will seriously affect the results based on the normality assumption. In particular the size a and power of these tests can be very different from the ones based on the postulated value of a. This can seriously affect al1 tests which depend l such as some heteroskedasticity and structural on the distribution of change tests (seeSections 2 1.4-2 1.6 below). ln order to get non-normality robust tests in such cases we need to modify them to take account of /.24. .92
=
(2)
Testing
fovdepartures from normality
Tests for normality
can be divided into parametric and non-parametric whether depending the alternative is given a parametric form or tests on not.
452
Departures
(a)
Non-parametric
from assumptions - probability model tests
The Kolmoqorotwsmirnov
test
tlf/xf,
Based on the assumption that r e: T) is an IID process we can use the results of Appendix 11.1 to construct test with rejection region
C1
(y:Q'T 1> ca)
=
(21.38)
where y refers to the Kolmogorov-smirnov residuals. Typical values of cx are: .01
.05
test statistic in terms of the
.01
a (21.39) ca 1.23 1.36 1.67 . For a most illuminating discussion of this and similar tests see Durbin (1973). The Shapiro-Wilk
test
This test is based on the ratio of two different estimators of the variance c2. 2
n
1V= r
where
li41)
n
Z =
tilrrtl'r-r.f.
-
li(t))
1 '
.
'
2
F
/Z l I =
(21.40)
1
are the ordered residuals,
G lo
if T is even
=-
/
!
% lcj G F
1)
n
=
T- 1
2
if T is odd,
and atv is a weight coefficient tabulated by Shapiro and Wilk samplg sizes 2 < T 50. The rejection region takes the form: C1 where (b)
ty:1z:'< c.)
=
cu are
(1965)for (21.41)
tabulated in the above paper.
Parametric
tests
The skewness-kurtosis
test
The most widely used parametric test for normality is the skewnesskurtosis. The parametric alternative in this test comes in the form of the Pearson family of densities. The Pearson family of distributions is based on the differential equation
d ln dz
-/)
.(z)
=
(z
c +cjz-hczzz
,
(21.42)
21.2 Normality
453
where solution for different values of @,c(), cj generates a large number of interesting distributions such as the gamma, beta and Student's t. lt can be shown that knowledge of c2, aa and a4 can be used to determine the distribution of Z within the Pearson family. In particular: ,
a
cl
=
(a4+
=
(..2)
3)(a3/c-
(21.43)
(414 33)c2/J, 12as ca (2a4 3a3 6) J, J= (1014. (see Kendall and Stuart (1969:.These parameters co
=
-
=
-
-
(21.44) 18)
(21.45)
can be easily estimated using (13 and :14 and then used to give us some idea about the nature of the departure from non-normality. Such information will be of considerable interest in tackling non-normality (see subsection (3/. ln the case of normality cj c2 0 aa 3. Departures from normality within the Pearson family of particular interest are the following cases: (a) cc cl # 0. This gives rise to qamma-type distributions with the chi-square an important member of this class of distributions. For ',
=
=0,a4
u't>
=
=
=0,
Z
(b)
'v
g2(r1),
aa
23
=
(21.46)
,
cl 0, trtl > 0, (r2 > 0. An important member of this class of (x4. 3+ distributions is the Student's !. For Z 1(rl), aa %m 4), m > 4). which are < (r2. This gives rise to beta-type distributions cl directly related to the chi-square and F-distributions. In particular i 1, 2, and Z, ,Zz are independent, then if Zi z2(n1f), =
=0,
'v
=
-
..::0
'.w
=
) ( ) (--11 jZ,)
z
-
z,
z
m--s
B
-
,
,
(2 1.47)
where Bmk/l, rnc/2) denotes the beta distribution with parameters /11 2 and mzjl. As argued above normality within the Pearson family is characterised by aa
(/t3/c3)
and
=0
=
=(p4/c*)
a4
=
3.
(21.48) '
lt is interesting to note that (48)also characterises normality short' (lirstfour moments) Gram-charlier expansion: -..aa(z3
gz) E1 =
-
3z) +
Wta,j.3)(z*-6z2 -
+ 3)q(I)(z)
within the
(21.49)
(see Section 10.6). Bera and Jarque (1982)using the Pearson family as the parametric alternative derived the following skewness-kurtosis test as a Lagrange
454
from assumptions
Departures
-
model
probability
multiplier test: T1(y)
F =
6 where
a
ti=3+
,- g() -
l
((i4 3)2 24
'w
-
s.--
--)((-i
tljl-i
z2(2)
(2 1.50)
a
,i/)'j
,z.--
,.-gt-,) ,z.-j
H'
F
,.'',,?)2j.
(2 1.5 1)
The rejection region is defined by X
C1
-
zlty) 1)y:
dz2(2) a.
>cl,
=
Ca
A less formal derivation of the test can be based on the asymptotic distributions of tia and 44:
v'''r 43
Ho
(21.54)
x(0,6)
-
a
0
x,7r(4.-3) -
With
N,
24).
(2 1.55)
(13
and 44.being asymptotically independent (seeKendall and Stuart (1969))we can add the squares of their standal-dised forms to derive (50)., see Section 6.3. Let us consider the skewness-kurtosis test for the money equation
mt 2.896 +0.690y, +0.865/5 -0.0552 + f, ( 1.034) (0.105) (0.020) (0.013) (0.039) =
R2
=0.995 ,
F= 80,
(ij=
42
=0.995,
s
((i4.- 3)2
0.005,
1og L
=0.0393,
=0.
=
(21.56)
147.4,
145.
Thus, zj(y) and since ca 5.99 for a we can deduce that under the assumption that the other assumptions underlying the linear regression and a4 3 is not rejected for model are valid the null hypothesis HvL aa =0.55
=0.5
=
=0
a
=
=
0.05.
There are several things to note about the above skewness-kurtosis test. Firstly, it is an asymptotic test and caution should be exercised when the sample size T is small.' For higher-order approximations of the finite sample distribution of &s and 4 see Pearson, D'Agostino and Bowman ( 1977), Bowman and Shenton (1975), inter alia. Secondly, the test is sensitive to outliers' (tunusually large' deviations). This can be both a blessing and a
21.2
Normality
455
hindrance. The first reaction of a practitioner whose residuals fail this normality test is to look for such outliers. When the apparent nonnormality can be explained by the presence of these outliers the problem can be solved when the presence of the outliers can itself be explained. Otherwise, alternative fonns of tackling non-normality need to be considered as discussed below. Thirdly, in the case where the standard error of the regression is relatively large (becausevery little of the variation in yt is actually explained), it can dominate the test statistic z1(y). lt will be suggested in Chapter 23 that the acceptance of normality in the case of the money equation above is largely due to this. Fourthly, rejection of normality using the skewness-kurtosis test gives us no information as to the nature of the departures from normality unless it is due to the presence of '
Outliers.
A natural way to extend the skewness-kurtosis test is to include cumulants of order higher than four which are zero under normality (see Appendix 6.1).
(3)
Tackling non-normality
When the normality assumption is invalid there are two possible ways to proceed. One is to postulate a more appropriate distribution for D(yl/Xf; 0) and respecify the linear regression model accordingly. This option is rarely considered, however, because most of the results in this context are developed under the normality assumption. For this reason the second way to proceed, based on nonnalising transformations, is by far the most commonly used way to tackle non-normality. This approach amounts to applying a transformation to y, or and Xl so as to induce normality. Because of the relationship between normality, linearity and homoskedasticity these transformations commonly induce linearity and homoskedasticity as well. One of the most interesting family of transformations in this context is the Box-cox (1964)transformation. For an arbitrary random variable Z
the Box-cox transformation takes the form 0 %% 1.
(2 1.57)
Of particular interest are the three cases: (i)
j=
(ii)
J
=0.5,
-
1, Z* Z*
=
=
Z(Z/
1
reciprocal',
(2 1.58)
- square root;
(21.59)
-
Departures
= 0,
from assumptions - probability model
Z*
logg Z - logarithmic
=
ote'. 1imZ*
log.
=
(2 1.60)
z).
J-0
The first two cases are not commonly used in econometric modelling because of the diliculties involved in interpreting Z* in the context of an model. empirical econometric Often, however. the square-root transformation might be convenient as a homoskedasticity inducing transformation. This is because certain economic time-series exhibit variances which change with its trending mean (%),i.e. Var(Z,) pl,c2, tczz1, T: In such cases the square-root transformation can be used as a 2, variance-stabilising one (seeAppendix 2 1.1) since Var(Z)) r!, c2. The logarithmic tranAformation is of considerable interest in econometric modelling for a variety of reasons. Firstly, for a random variable Zf Whose distribution is closer to the 1og normal, gamma or chisquare (i.e.positively skewed), the distribution of loge Zt is approximately nonual (seeJohnson and Kotz ( 1970:. The loge transformation induces 'near symmetry' to the original skewed distribution and allows Zt to take negative values even though Z could not. For economic data which take only positive values this can be a useful transformation to achieve near normality. Secondly, the loge transformation can be used as a variancestabilising transformation in the case where the heteroskedasticity takes the form =
.
.
.
,
Varty /X t t
=
xf)
=
c,2
=
(/t,/c2
!,
t
=
1, 2,
.
.
.
,
'fJ
(21.61)
xt) c2, t= 1, 2, T Thirdly, the log yt, Vartyl/xf transformation can be used to define useful economic concepts such as elasticities and growth rates. For example, in the case of the money equation considered above the variables are a11in logarithmic form and the estimated coefficients can be interpreted as lasticities (assumingthat the estimated equation constitutes a well-defined statistical model; a doubtful assumption). Moreover, the growth rate of Zt defined by Z* (zf -Zt- 1)/ Zt 1 can be approximated by A loge Zt H loge Zt Z,- j because Alogezt logt 1+Zt*) ::x Zt* In practice the Box-cox transformation can be used with unspecified and let the data determine its value (seeZarembka (1974/.For the money equation the original variables Mf, 'F;,Pt and It were used in the Box-cox transformed equation: For y)
xloge
=
=
.
.
.
,
=
-log
.
(Ms-
1)-/,1
+fzLY,-
1)+/,z(#2j-
1)+,.42-,
1)+u,
(21.6a)
457
21.3 tnearity and allowed the data to determine the chosen was J=0.530 and JI
=
J2
J3
(0.223)
(0.119)
of
.
The estimated j
J4=
-0.000
=0.005,
=0.865,
0.252,
value
(0.0001)
value
07. (0.00002)
transformation is this mean that the original logarithmic inappropriate'?' The answer is, not necessarily. This is because the estimated value of depends on the estimated equation being a well-delined statistical GM (nomisspecification). ln the money equation example there is enough evidence to suggest that various forms of misspecification are indeed present (seealso Sections 21.3-7 and Chapter 22). The alternative way to tackle non-linearity by postulating a more appropriate form for the distriution of Zf remains largely unexplored. Most of the results in this direction are limited to multivariate distributions closely related to the normal such as the elliptical family of distributions (see Section 21.3 below). On the question of robust estimation see Amemiya (1985).
tDoes
21.3
Linearity
As argued above, the assumption f(J't/Xf
=
xt)
=
(2 1.63)
#'Xt,
can be viewed as a consequence of the assumption that Z, N(0, E), l 6E T (Zf is a normal lID sequence of r.v.'s). The form of (63)is pr//x) x)) can be nonnot as restrictive as it seems at first sight because .E'( linear in x) but linear in xf llx1) wherg /(') is a well-behaved transformation such as xt log xl and xf (x))2.Moreover, tenns such as
where
#=Xa-21J2j
'v
=
=
=
=
2 ca + c 1 r + ca l ..j.
.
(,tl (a, .s.9:
+
costt')
.
.
.j. (ym (m
sintt')
)r+ )r), -,,.
(2 1.64)
pumorting to model a time trend and seasonal effects respectively, can be easily accommodated as part of the constant. This can be justified in the context of the above analysis by extending Lt N(0, E), r c T, to Zt Nlmf, X), t 6 T, being an independent sequence of random vectors where the 65 mean is a function of time and the covariance matrix is the same for all t T. The sequence of random vectors tZ,, t (E T) in this case constitutes a nonstationary sequence (seeSection 21.5 below). The non-linearities of interest in this section are the ones which cannot be accommodated into a linear conditional mean after transformation. 'w
,v
458
from assumptions - probability model
Departures
lt is important to note that postulating (63),without assuming normality of Dty'f, Xr; #), we limit the class of symmetric distributions in which Dyt, X:,' #) could belong to that of elliptical distributions, denoted by Eljp, E) (seeKelker ( 1970/. These distributions provide an extension of the multivariate normal distribution which preserve its bell-like shape and symmetry. Assuming that
s,4(0(,) s/lacal) (x-p.,) ) (gc)
(2 1.65)
-
implies that
.E'(y,,'X! xf) -
,'1
-
ztz-lxt
(2 1.66)
and Vart)'f/xf
X2)
=
(/(Xt.)(t7'1
=
5'!
1 -
2X2-21J21).
(21.67)
This shows that the assumption of linearity is not as sensitive to some departures from normality as the homoskedasticity assumption. Indeed, homoskedasticity of the conditional variance characterises the normal distribution within the class of elliptical distributions (seeChmielewski ( 198 1:.
(1)
Implications
of non-linearity
Let us consider the implications of non-linearity for the results of Chapter 19 related to the estimation, testing and prediction in the context of the linear regression model. ln particular, are the implications of assuming that D(A;#) is not normal and 'what
'()'f/'Xt
Xf)
=
=
/l(N),
(21.68)
where (xf) p'xt'?' ln Chapter 19 the statistical GM for the linear regression model was defined to be :#;
)'f
=
#'xf+
thinking that /t) Epituttf/Ak x,) however, is =
'tyr/xf
=x,)
=
=
With
-p1
=j'xf
and ul x,) c2. The =yt
Elutlt
0 and
yr /)(xt)+ =
(21.69)
ut,
=
ttrue'
=
E (!t,/Xt=xf) statistical GM, =0,
(21.70)
cf,
and ct where pt '(yl,/Xl= xf) f()4/X? xf). Comparing (69) and (70) we can see that the error term in the former is no longer /I(x2) white noise but ut +8, Hg(xf) + st. Moreover, =/?(xf)
=.pr
=
.--p'xt
=y',
-
.-p'xt
=
=
21.3 t-inearity Eutxt
=
x.,)
JJ(utlt/x t
ln
view of
459
# 0 and
Eptutq
=:(xf),
2 xf ) gtxf ) + c/. =
=
l
these properties of
deduce that for
ut we can
g(x,..))', e - (f?(x1),g(x2), #) #+ (X X) X e' ,, .
.
.
,
(2 1.72)
=
and
J)s2)
(7.2
=
+ e'
Mx
w- k
e+ c
2 ,
Mx
=
l .-XIX X) ,
-
1
X
(2 1.73)
, ,
because y Xpv + e + E not y Xj+ u. Moreover, / and are also inconsistent estimators of # and c2 unless the approximation error e satisfies (1/T)X'e 0 and (1/T)e'Mxe 0 as T-+ CX; respectively. That is 2 non-linear and the non-linearity decreases with 'C# unless /1(x,) is not and s2 are inconsistent estimators of j and c2. As we can see, the consequences of non-linearity are quite serious as far as t h e proper ties of j and sl are concerned being biased and inconsistent estimators of j and c2, in general. What is more, the testing and prediction results derived in Chapter 19 are generally invalid in the case of nonlinearity. In view of this the question arises as to what is it we are estimating by s and # in (70)7 Given that plt (Jl(xt)- p'xt)+ st we can think of p-as an estimator of #* where #* is the parameter which minimises the mean square error of ut, i.e. .s2
=
=
--+
-+
'too'
,
'h
2
=
#*
=
min c2(#)
where c2(j)
K
Elutlj.
(21.74)
Ecztjlj/)j= ( -2)A'g(/?(xf) - j'xylxg'q (assumingthat we operator). Hence, #* differentite inside the expectation can 1 This is because
::::c0
=
Y2-21/2,, say. Moreover, sl can be viewed as the xtx/t)natural estimator of c2(#*). That is j and are the natural estimators *'xt unknown function /l(xr) the of a least-squares approximation to respectively. What is more, and the least-squares approximation-error '(/1(xt)x;)
=
,$2
we can show
(2)
that
Testing
.
a.S.
#-+#* and
.s2
a,S. -+
c2(#*)
(see White ( 1980)).
for non-linearity
ln view of the serious implications of non-linearity for the results of Chapter 19 it is important to be able to test for departures from the linearity assumption. ln particular we need to construct tests for f10: '(yr/Xl x2) j'xf =
=
460
from assumptions - probability model
Departures
against
J.f1 : El)'tjxt
xf)
=
=/l(x2).
(21.76)
This, however, raises the question of postulating a particular functional form for txf)which is not available unless we are prepared to assume a Alternatively, particular form for D(Zf;#). use the we could and systematic parametrisation related to the Kolmogorov-Gabor component polynomials introduced in Section 21.2. Using, say, a third-order Kolmogorov-Gabor polynomial (.&'G(3)) we can postulate the alternative statistical GM:
)'f where
#Xt
=
+
Szf
kzt includes the
+ 713/32+
second-order
injzzz2, 3, and
.
.
(21.77)
:t
terms
k,
,
.
(21.78)
#at the third-order terms 1, inj, 1--2, 3,
xitx-itxtt,>./>
.
.
.
,
(2 1.79)
k.
Note that x1, is assumed to be the constant. Assuming that F is large enough to enable us to estimate linearity in the form of: HoL
y2
0 and
=
HL :
yz + 0
or
(77)we can test
ya # 0
using the usual F-type test (seeSection 2 1.1). An asymptotically test can be based on the R2 of the auxiliary regression: t
= (#0-
#)'X
+ T'2z, +
t
'kst
+
equivalent
(21.80)
Ef
using the Lagrange multiplier test statistic RRSS URSS RRSS
ffe
-
LM(y)
FR2
=
q being the number of Cj
=
ty
:
T
=
restrictions
LM(y)
>
ca )
;l(q)
x
a
(21.8 1)
(see Engle ( 1984)). Its rejection region is dz 24g)
,
=
a.
Cu
For small T the F-type test is preferable in practice because of the degrees of freedom adjustment; see Section 19.5. Using the polynomial in pt we can postulate the alternative GM of the form:
yt where pt
=
=
p'sxt+
czptl + cspt3 +
.
.
xt. A direct comparison
.
+ cmp? +
p?
between
(21.82)
(75)and (82)gives rise to
a
!
21.3
Linearity
461
RESET type test (seeRamsey (1974)) for linearity based on Ho : cw= c3 l'n. Again this can be tested using the F-type cm 0, Sl : ci #0, i 2, ' LM test or the test both based on th'e auxiliary regression : =
.
'
=
=
=
e'
,
=
b. #) xt +
.
.
.
,
Z cipt + zx
,
-
f
i
's
14,
pt
2
=
=
l
t
x,.
(2 1.83)
Let us apply these tests to the money equation estimated in Section 19.4. The F-test based on (77)with terms up to third order (but excluding ) yielded : because of collinearity with .)
477 0.1 17 520 ().:4j4,7-/ -0.045
FF(F)
=
(-j-167
11.72.
Given that ca 2.02 the null hypothesis of linearity is strongly rejected. Similarly, the RESET type test based on (82)with m=4 (excluding;J because of collinearity with Jf) yielded: =
FT(y)
(:-1-
0.1 17520 -0.060 28 74 ().:6: as
=
35. l3.
Again, with cz 3. 12 linearity is strongly rjected. lt is important to note that although the RESET type test is based on a more restrictive form of the alternative (compare(77)with (82))it might be the only test available in the case where the degrees of freedom are at a premium (see Chapter 23). =
(3)
Tackling non-linearity
As argued in Section 21.1 the results of the various misspecification tests should be considered simultaneously because the assumptions are closely interrelated. For example in the case of the estimated money equation it is highly likely that the linearity assumption was rejected because the independent sample assumption g8qis invalid. ln cases, however, where the source of the departure is indeed the normality assumption (leadingto nonlinearity) we need to consider the question of how to proceed by relaxing the normality of )Z! 1 l 6 T). One way to proceed from this is to postulate a general distlibution Dtyf, Xf; #) and derive the specific form of the conditional expectation ,
ClA,'''Xf
=
Xr)
=
(xf).
(21.84)
Choosing the form of Dtl?f, Xt: #) will determine both the form of the conditional expectation as well as the conditional variance (seeChapter 7). An alternative way to proceed is to use some normalising transformation
462
from assumptions - probability model
Departures
on the original variables y, and X? so as to ensure that the transformed variables y) and X) are indeed jointly normal and hence E)'1%1
=x))=#*'x
(21.85)
and Vart-vtl/rx)
x))
=
c2.
=
(21.86)
The transformations considered in Section 2 1.2 in relation to normality are also directly related to the problem of non-linearity. The Box-cox transformation can be used with different values of for each random variable involved to linearise highly non-linear functional forms. ln such a case the transformed nv.'s take the general form X tt>
=
zkri it
j
-
12
.
,
x
!
=
,
,
.
.
.
7
(2 1.87)
k
f (see Box and Tidwell
(1962:.
In practice non-linear regression models are used in conjunction with the inter alia). normality of the conditional distribution (seeJudge et aI. (1985), The question which naturally arises is, reconcile the noncan we linearity of the conditional expectation and the normality of D(y?/Xf; p)?' As x,) is a direct mentioned in Section 19.2, the linearity of pt Es Eytj'xt consequence of the normality of thejoint distribution Dyt, X,; #). One way the non-linearity of '()!f/Xr xf) and the normality of D(y,/Xt; 0j can be reconciled is to argue that the conditional distribution is normal in the transformed variables X) /)(X), i.e. Dlttf/Xl x)) linear in xtl but non'how
=
=
=
=
linear in xf, i.e.
E-vbt/xt
=
xr)
=
gls, T).
(2 1.88)
Moreover, the parameters of interest are not the linear regression H(y, c2+). It must be emphasised that parameters 0-(p, c2) but 4 nonlinearity in the present context refers to both non-linearity in parameters (y) and variables (X,). Non-linear regression models based on the statistical GM: .Vf
=
#(N, $ +
(21.89)
l/f
can be estimated by least-squares
based on the minimisation
of
(2 1.90)
I
463
21.4 Homoskedasticity
This minimisation will give rise to certain non-linear normal equations which can be solved numerically (seeHarvey ( 198 1), Judge et al. (1985), Malinvaud ( 1970), inter alia) to provide least-squares estimators for c2 ),: m x 1. * can then be estimated by
sl
1 Z (-p,-/x,, f))2. w- k
=
(21.91)
,
Statistical analysis of these parameters of interest is based on asymptotic theory (seeAmemiya ( 1983) for an excellent discussion of some of these results).
21.4
Homoskedasticity
that Vartyf,/Xl xt) c2 is free of x, is a consequence of the assumption that Dtyt, Xr; #) is multivariate normal. As argued above, the assumption of homoskedasticity is inextricably related to the assumption of normality and we cannot retain one and reject the other uncritically. x2) above, homoskedasticity of Vartyt/xf lndeed, as mentioned class. elliptical For within normal distribution the characterises the argument's sake, let us assume that the probability model is in fact based on D(#'xf, c,2) where D( ) is some unknown distribution and oj1 Jl(xt). The assumption
=
=
=
=
.
(1)
of keteroskedasicity
Implications
A far as the estimators s
/ and
.:2
are concerned
p i.e. p-is an unbiased Covt (x'x)- '(X'ox)(X'X)-
(i)
Ep
=
,
estimator
f)
=
diagtcf,
c2a,
.
.
.
,
of
(21.92)
c;/)
=
c2A,
lf limw-+.((1,'JIX'DX) is bounded and non-singular P
/-
-+
p i e. b-is a ,
.
estimator of
consistent
/
#.
l
=
where
we can show that
then
#.
(X'X)-- 1X'y retains some desirable results suggest that and unbiasedness consistency, although it might be properties such as with the so-called generalised Ieastinefficient. p- is usually compared of #, #,derived by minimising sylt?rt?s (GLS) estimator These
/(l)=(y-X#)'n-
'(y
-X,),
=
(21.93)
464
Departures from assumptiolks - probability model
P(#) P# =0
#=(x f) x) ,
=
1
-
-
lx/o
- ly
-((),)(--'-,)')-' )):(---'-,)('t).
Given that
Covt#l =(X'f1 - 1X) -
and
(2 1.94)
1
(21.95)
Cov(/) y Cov(#)
(21.96)
(see Dhrymes (1978:, p- is said to be relatively inefficient. It must be emphasised, however, that this efficiency comparison is based on the presupposition that A is known a priori and thus the above efficiency comparison is largely irrelevant. lt should surprise nobody to that supplementing the statistical model with additional information we can get a more efficient estimator. Moreover, when A is known there is no need for GLS because we can transform the original variables in order to return to a homoskedastic conditional variance of the form kdiscover'
Vartyle/''xle
=
x))
=
c2,
t= 1,
.
.
.
,
(21.97)
T
This can be achieved by transforming y and X into
y* Hy =
and
X*
HX
=
In tenns of the transformed
where
H'H =A -
1.
(21.98)
the statistical GM takes the form
variables
y* X*#+ u*
(21.99)
=
and the linear regression assumptions are valid for )'1 and X). Indeed, it can be verified that
#- (x+'x*)=
1x*'
Hence, the GLS estimator known a priori.
y
*
=
is
(x'A - 1X) - 1x,A - 1 y rather
unnecessary
=
#
(J 1.104))
in the case where A is
The question which naturally arises at this stage is, twhat happens when fl is unknown'?' The conventional wisdom has been that since fl involves T unknown incidental parameters and increases with the sample size it is clearly out of the question to estimate T+k parameters from T observations. Moreover, although p- (X'X)- 1X'y is both unbiased and of Covt#-) estimator consistent s2(X'X)- 1 is an inconsistent (X'X) - IX'DXIX'X) - 1 and the difference =
=
c2(X'X) - 1 -(X'X)- 1(X'f1X)(X'X)- 1
(21.101)
21.4 Homoskedasticity
465
can be positive or negative. Hence, no inference on #, based on p-is possible since for a consistent estimator of Cov(/) we need to know f (or estimate it the consistently). So, the only way to proceed is to model V so as to incidental parameters problem. Although there is an element of truth in the above viewpoint White (1980) ointed out that for consistent inference based on j we do not need to P estimate fl by itself but (X'fX), and the two problems are not equivalent. The natural estimator c/ is l (yt-J'xf)2, which is clearly unsatisfactory because it is based on only one observation and no further information accrues by increasing the sample size. On the other hand, there is a perfectly acceptable estimator for (X'f1X) coming in the form of 5
tsolve'
=
r
1
Wr
=
-
T
l
F,1
li/xrxt',
(21.102)
r
for which information accrues as T-+ x. White (1980) showed that under certain regularity restrictions
j and
a
.
;
.
--+
p
(21 103)
(X'f1X).
(21.104)
$
.
it .S.
Ww -+
The most important implication of this is that consistent inference, such as the F-test, is asymptotically justifiable, although the loss in efficiency should be kept in mind. In particular a test for heteroskedasticity could be based on the difference (X'X)- 1X'DX'(X'X) -
1
-
c2(X'X) - 1
(21.105)
Before we consider this test it is important to summarise the argument so far. Under the assumption that the probability model is based on the c(.) is distribution Dxt, c/), although no estimator of fl diagtczj, possible, #= (X X) X y is both unbiased and consistent (undercertain conditions) and a consistent estimator of Cov(/) is available in the form of WT. This enables us to use j for hypothesis testing related to #. The c/ will be taken argument of up after weconsider the question of from testing for departures homoskedasticity. =
.
.
.
,
'modelling'
(2)
Testing departures
from homoskedasticity
White ( 1980), after proposing a consistent estimator for X'DX, went on to
466
Departures
from assumptions - probability model
use the difference (equivalentto ( 105:: (X'f1X) -c2(X'X)
(2 1.106)
to construct a test for departures from homoskedasticity. expressed in the folnn
(106)can be
F
('(l?r2) czlxyxj',
(21.107)
-
;=1
and a test for heteroskedasticity could be based on the statistic
1 jr (2 t wt 1
dzjxtxt',
-
(2 1.108)
=
the natural estimator of (107).Given that (108) is symmetric we can express the jklk 1) different elements in the form -
F
1
-'.,
where
-
l
1
-45/
(21.109)
2,, /p,rl', klt xitxjt, , (/1,, k l 12 i /' 1 2 .
.
,
-
,
Note the similarity
,
,
.
.
.
between
,
,
,
,
.
.
,
.
m
gklk
n'l
,
1) .
#f above and the second-order term of the (11). Using (109),White (1980)went on to
Kolmogorov-Gabor polynomial suggest the test statistic
zy)
=
-
T
t
where F
jl1
tl
-
-
zl'f
bw-1
r
1 --
T
,
) -
tl
,llt
-
,
(2 1.110)
1
F
1
I,.--
r
1
ll l
-42)2(,
1
=
-#w) kt -#w)',
r
1 #v-y f Z1 #,.
(2 1. 111)
=
Under the assumptions of homoskedasticity can be based on the rejeciion region
cl
=
Jt
y: zy)
> ca),
zy)
,vz2(m)
?
and a size a test
(2 1.112)
Because of the difficulty in deriving the test statistic (109)White went on to suggest an asymptotically equivalent test based on the R2 of the auxiliary
Homoskedasticity
21.4
regression equation l
acj + a j
=
/ 1t +
Under the assumption TR2
.v
1
ac/
zt +
.
.
.
(2 1.113)
+ am/mt.
of homoskedasticity,
z2(n1),
(21.114)
and TR2 could replace zy) in (112) to define an asymptotically equivalent test. lt is important to note that the constant in the original regression should not be involved in defining the /ftS but the auxiliary regression should have a constant added. Example For the money equation estimated above the estimated auxiliary equation of the form
2
!
co + y'#! +
=
lll
yielded R2 0.190, FF(y) 2.8 and FR2 15.2. ln view of the fact that the null hypothesis of ./6.73) 2.73 and z2(6) 12.6 for a homoskedasticity is rejected by both tests. The most important feature of the above White heteroskedasticity test that is no particular form of heteroskedasticpy is postulated. subsection ln (3)below, however, it is demonstrated that the White test is an exact test in the case where D(Z,; #) is assumed to be multivariate t. ln this case the conditional mean is pt #'xt but the variance takes the form: =
=
=
=0.05
=
=
kapparently'
=
c/
=
c2
k
(21.115)
f +x'Qxf.
variables' xf) + lll We Can Using the argument for ul Eluijxt This suggests derive the above auxiliary regression (seeSpanos (1985b)). that although the test is likely to have positive power for various forms of heteroskedasticity it will have highest power for alternatives in the multivariate t direction. That is, multivariate distributions for D(Zr; #) which are symmetric but have heavier tails than the normal. ln practice it is advisable to use the White test in conjunction with other tests based on particular fonns of heteroskedasticity. ln particular, tests which allow first and higher-order terms to enter the auxiliary regression, such as the Breusch-pagan test (see(128)below). lmportant examples of heteroskedasticity considered in the econometric literature (seeJudge et al. (1985),Harvey (1981)) are: tomitted
Cr2
=
c2'x);
=
=
(21.116)
>'
468
Departures
lii)
Gl
(iii)
c2l
.
t
=
=
from assumptions
-
probability
model
c2('x*)2. t
exp('x*)! .
!'
x2) and x) is an m x 1 vector which incltldes known where V Vartyf/xf of and transformations its first element is the constant term. It must be xr econometric noted that in the literature these forms of heteroskedasticity are expressed in tenus of w, which might include observations from weakly exogenous variables not included in the statistical GM. This form of heteroskedasticity is excluded in the present context because, as argued in Chapter 17, the specification of a statistical model is based on all the observable random variables comprising the sample information. lt seems very arbitrary to exclude a subset of such variables from the definition of the Elytlxt xr) and include them only in the systematic component conditional variance. ln such a case it seems logical to respecify the systematic component as well in order to take this information into conditioning consideration. Inappropriate in delining the systematic component can lead to heteroskedastic errors if the ignored information affects the conditional variance. A very important example of this case is when the sampling model assumption of independence is inappropriate, a non-random sample is the appropriate assumption. ln this case the systematic component should be defined in such a way so as to take the variables temporal dependence among the random involved into consideration (seeChapter 22 for an extensive discussion). lf, however, the systematic component is defined as /tt NJJ(y!,7X,=xf) then this will lead to autocorrelated and heteroskedastic residuals because important temporal information was left out from yt. A similar problem arises in the case where stochastic processes (see Chapter 8) with y, and Xt are non-stationary distinct time trends. These problems raise the same issues as in the case of non-linearity being detected by heteroskedasticity misspecification tests discussed in the previous section. Let us consider constructing misspecification tests for the particular forms of heteroskedasticity (il---tiii). It can be easily verified that (i)-(iii)are special cases of the general form =
=
tother'
=
oj2
=
/;(a'x)),
(2 1.119)
for which we will consider a Lagrange multiplier misspecification test. Breusch and Pagan ( 1979) argued that the homoskedasticity assumption is equivalent to the hypothesis Hv1 a2
=
aa
=
.
.
'
=
ap,=0,
given that the first element of xl is the constant and
(aj)
=
c2.
The 1og
21.4 Homoskedasticity likelihood function
logL,
see discussion above) is
(retainingnormality, x)
,'
469
const
=
1.
1
1
jl log c,2 t 1
-j
-.j.
'-'
=
w
j c;t 1
2(.p
-
#'xf)2,(21.120)
=
where V hlx'xt). Under Ho, c/ c2 and the Lagrange multiplier statistic based on the score takes the general form =
=
uM-L.t-j
oeIogz-tyl,
log,,(),I(p)-1
test
(2 1.121)
where #refersto the constrained MLE of 0- (j, ). Given that only a subset of the parameters 0 is constrained the above form reduces to
p log .L(0, #) (122-lc1l1l21)/k
p
'
LM
=
1
log fa(0, &)
nx
(21.122)
(see Chapter 16). In the above case the score and the information matrix evaluated under Hv take the forms (2 1. 123) (2 1. 124)
andlc1
42 LM test statistic is where
=(),
j
LM
=(1
j7- 1
F)
tl
,
-
)(
)(
xlw,
=j
=
Hence, the
H (;
)(
xlw,
klm
'v
''.:t
t
42) 1j. Given that wherew, g(li,2 =
1
xlxl'
t
t
R1
is the MLE of c2 unders.
- 1),
(21.125)
R2 in the linear regression model is
y'X(X'X)- 1X'y - T#2
(21.126)
',y- wy.z
(see Chapter 19), the LM test statistic expressed in the form /
1--bv4*
-
'
=
)
1
xlxl'
xl'w,
T
'
Z
xlwf '
(21.127)
w'?2
t
is asymptotically
equivalent
to TR2 from the auxiliary regression (2 1.128)
Departures
from assumptions
-
probability
model
Ho
that is, T'R2
'v
z2(m -
Q
1) (see Breusch and Pagan
(1979),Harvey (1981)1.
lf we apply this test to the estimated money equation with x,l > (x,, c?. #af) (See (78) and (79)) xlr, x excluded because of collinearity) the auxiliary regression :2 --2 k'
yielded R2
=
/ 1X
l
+ 7'2 zt + T'3
3t
+
L't
FF(y) 1) 19.675 and Given that TR2 F(11, 68) 1.94, the null hypothesis of homoskedasticity is relected by both test statistics. =0.250,
=20,z2(1
=2.055.
=
=
(3)
Tackling heteroskedasticity
When the assumption of homoskedasticity is rejected using some misspecification test the question which arises is, do we proceed'?' The first thing we should do when residual heteroskedasticity is detected is to diagnose the likeliest source giving rise to it and respecify the statistical model in view of the diagnosis. In the case where heteroskedasticity is accompanied by non-normality or and non-linearity the obvious way to proceed is to seek an appropriate nonnalising, variance-stabilising transformation. The inverse and loge above transformations discussed can be used in such a case after the form of heteroscedasticity has been diagnosed. This is similarto the GLS procedure where A is known and the initial variables transformed to khow
y* Hy, =
X*
=
for
HX
H'H
=
A - 1.
ln the case of the estimated money equation considered in Section 2 1.3 above the normality assumption was not rejected but the linearity and homoskedasticity assumptions were both rejected. ln view of the time paths of the observed data involved (seeFig. 17.1) and the residuals (seeFig. 19.3) it seems that the Iikeliest source of non-linearity and heteroskedasticity which led to dynamic might be the inappropriate conditioning non-linearity, misspecification (see Chapter 22). This heteroskedasticity can be tackled by respecifying the statistical model. An alternative to the normalising, variance-stabilising transformation is postulate to a non-normal distribution forDtyr, Xt; 0j and proceed to derive f'tyt/'Xf x?) and Vart)'t/xf xf) which hopefully provide a more appropriate statistical for the actual DGP being modelled. The results in this direction, however, are very limitedjpossibly because of the complexity of the approach. ln order to illustrate these difficulties 1et us consider the kapparent'
=
=
21.4
Homoskedasticity
case where D(yf,Xf
471
t with n degrees of freedom, denoted by
is multivariate
*,0)
y,
Xf zw&
O
cj
j
o.1 c
0
az3
Y22
(21.13 1)
,
lt turns out that the conditional mean is identical to the case of normality (largelybecause of the similarity of the shape with the normal) but the conditionalvariance is heteroskedastic, i.e. ft.Ft/''Xf Xf) J1 2Y2-11X/ (21.132) and =
=
(see Zellner ( 197 1)). As we can see, the conditional mean is identical to the one under normality but the conditional variance is heteroskedastic. ln particular the conditional variance is a quadratic function of the observed values of Xl. ln cases where linearity is a valid assumption and some form of heteroskedasticity is present the multivariate l-assumption seems an obvious choice. Moreover, testing for heteroskedasticity based on H0:
0-2 f
=
(7.2
tzzzz1, 2,
,
.
.
F
,
.
T; Q being a k x k matrix, will lead against H3 : c,2 (x;Qxf)+ c2, t 1, 2, White identical the directly to a test test. to The main problem associated with a multivariate r-based linear regression model is that in view of (133)the weak exogeneity assumption of X, with respect to 0n (p,tr2) no longer holds. This is because the parameters /1 and /2 in the decomposition =
=
D(.)Xr;
#)
=
D(),/X,;
.
.
#1) D(Xt; '
.
,
(21.134)
a)
are no longer variation free (seeChapter 19 and Engle et al. ( 1983) for more details) because /1 = (Jj cE2-21 cl 1 J1 2Ec-21 o.a1, Ea-a1)and /2 EEE(Eac) and the constant in the conditional variance depends on the dimensionality of Xl. This shows that /1 and /2 are no longer variation free. The linear regression model based on a multivariate sdistribution but with homoskedastic conditional variance of the form -
,
,
Vart?&/xl
=
xt)
=
1.,()c (J,'0
-
2
2)
,
was discussed by Zellner (197$.He showed that in this case indeed the MLE'S of J and c2 as in the case of normality.
/ and
:2 are
from assumptions
Departures
- probability
21.5
Parameter
time invariance
(1)
Parameter
time dependence
An important
)'f
assumption
#'xt+
=
underlying
the linear
model
regression
statistical GM
(21.136)
ut,
is that the parameters of interest 0 HE (#,c2) are time nlllnknr, where #> 1 tj.latc-alcaj The time invariance of these parameters 1222 - J 21 and c2 cj j is a consequence of the identically distributed component of the assumption =
At
'v
-
.
N(0, Y),
i.e.
tzr,r e:T)
is Nl1D.
(21.137)
This assumption,
however, seems rather unrealistic for most economic time-series data. An obvious generalisation is to retain the independence assumption but relax the identicallq' distributed restriction. That is, assume that f(Zj,t (E T) is an independent stochastic process (seeChapter 8). This introduces some time-heterogeneity in the process by allowing its parameters to be different at each point in time, i.e. Z,
-
N(m(!), E(r)),
(21.138)
where IZr, t e: T) represents a vector stochastic process. A cursory
look at Fig. 17.1 representing
the time path
of several
confinus that the economic time series for the period 1963-1982n assumption (137) is rather unrealistic. The time paths exhibit very distinct time trends which could conceivably be modelled by linear or exponential type trends such as:
(i)
mt
=
mt eXP ) =
(iii)
n2t
(21.139)
ao + al 1,' ll
tl + a1
=
+ aj 11',.
(1 -
e-'
(21.140)
(21.14 1)
1),
The extension to a general stochastic process where time dependence is also allowed will be considered in Chapters 22 and 23. For the purposes of this chapter independence will be assumed throughout. ln the specification of the linear regression model we argued that (137)is equivalent to Zf N(m, E) because we could always define Zf in mean deviation or add a constant term to the statistical GM; the ccnstant is defined by jj n1l 0.: 2Yc1m2 (seeChapter 19). ln the case where (138)is valid, however, using mean deviation is not possible because the mean varies with 1. Assuming that 'w
=
(xA'f,) -
-
;)2at(r/))) (.)'j'(t,')t ((m''' 1at('/))
x
,
(21.142)
time inuriance
21.5 Parameter
& vt -
Xt
.'
Vart
where
T'l
x t)
=
,/x
-
!
=
=
#'x* t l
xt )
=
:
c,2
-
1', l-kt,,/(1.,,),/1, ',.zltJ'), #(l)f Ycct/)=
=
r&zcl =
1(r)
-
take the form
mean and variance
we can deduce that the conditional
,.1
-
/.',11(r)
x)
=
2(l)Y22(l) -
-
,1
2(r)Ec.a(r)-
',...,1t.?'4-
t1, x;)' 'cc1(r).
convenience the star Several comments are in order. Firstly, for notational written will conditional the and x) be dropped as #'fxf.Secondly, mean in stochastic non-stationary independent 142) under defines a the sequence Zf ( restrictions on the time process (see Chapter 8). Without further (E 'T-jt of interest p, (#;, cfzl cannot be the parameters heterogeneity of kZf, t sample size ILThis gives us a fair with the estimated because they increase warning that testing for departures from parameter time invariance will not be easy. Thirdly, ( 142) is only a sufficient eondition for ( 143) and ( 144), it is not necessary. We could conceive of parametrisations of ( 142) whieh could lead to time invariant j and c2, Fourthlys it is important to distinguish between time invariance and homoskedasticity of Vart r,,,'Xf xr), at least at the theoretical level. Homoskedasticity as a property of the conditional variance refers to the state where it is free of the conditioning variables (see Chapter 7). ln the context of the linear regression model homoskedasticity of Vart-l'r//'xf xf) Stems from the normality of Zr. On the other hand, time invariance refers to the time-homogeneity of Vaq '!,'X, xr) and follows from the assumption that f(zt, l 65 7-) is an identically distributed process. ln principle, heteroskedasticity and time dependence need to be distinguished because they arise by relaxing different assumptions relating to the stochastic process ftzt, t e: T). In practice, however, it will not be easy to test for discriminate between the two on the basis of a misspecification both and heteroskedasticity dependence be time Moreover, can either. Section multivariate r-distribution where is in (see (142) a present as the case 2 1.4 above). Finally, the form of Jlf above suggests that, in the case of economic time series exemplifying a very distinct trend, even if the variance is constant over time, the coefficient of the constant term will be time dependent, in general. ln cases where the non-stationarity is homogeneous by (restricted to a local trend, see Section 8.4) and can be time-invaliant. differeneing,its main effect will be on jj, leaving j(:), This might exptain why in regressions with time series data the coefficient of the constant seems highly volatile althougt the other coefficients appear to be relatively constant. =
=
=
=
Geliminated'
Klargely'
474
Departures
(2)
Testing
from assumptiolts - probability model
for parameter
time dependence
Assuming that (Zr, t e: Ty is a non-stationary, independent normal process, and detining the systematic and non-systematic components by pt
and
E()'t ,.'Xj x)
=
=
the implied
#;Xf
=
+
yt - S()'f/'Xf
=
(21.145)
XJ),
=
GM takes the form
statistical
yt
tIt
(21.146)
&t,
with pf V)being the statistical parameters of interest. lf we compare (146) with (136)we can see that the null hypothesis for parameter time invariance for a sample of size T is Klpt,
Hv :
#j #2 =
against Sl
:
=
'
.
'
#w #
=
ct2 # c2
pt# #
c2j c22
and
=
=
for any r
.
=
1, 2,
=
.
.
.
=
.
.
,
cw2 =
(7.2
T
Given that the number of parameters to be estimated is Tk + 1) + F and we only have T observations it is obvious that 03 pware not estimable. It is and ahead ignore instructive, however, to this to attempt estimation of go these parameters by maximum likelihood. Differentiation of the 1og likelihood function: ,
logfa=const
1
-
i
f
F
jg log c,2
1 -
-1
.
.
,
.
T
)
2(
cf-
'
-#r'xf)2
pr
(2 1.147)
,-,
yields the following first-order conditions: t?log L tn(bt =
c
(y, #,x,)xf ,
,
-
=
()7lo g L
0,
1
2 ..-Y (, G't
=
-
ac:r
+
1
2c4t
2
u?
=
()
(2 1.148)
ctl
These equations cannot be solved for pt and because ranktxg) ranktxr, xt') 1, which suggests that xtx; cannot be inverted; no MLE of 0t exists. Knowing the source' of the problem, however, might give us ideas on how we might solve' it. In this case intuition suggests that a possible (Xk0'xk0),where Xk0EEEE(x1,x2, x:). invertible form of xfx; is 1 x,x;) That is, use the observations r= 1,2, k, in order to get ranktxko) k and invert it to estimate pkvia =
=
(Z,k-
=
.
.
j
(x0'X0) k #
=
-
txk'yk
=
.
.
#,
=
lyk,
In turn the residuals
t
=
,
t
=
k + 1,
.
.
.
-
.
.
.
,
,
.
.
T; the corresponding
pts
(21.150)
.T:
(y't-#;xf), t k + 1, =
.
(2 1.149)
1, k + 2,
,
(X, X, ) Xt yt
,
(xk0) -
H yk0 (y1, yk). Moreover, for 1,:::ck + couldconceivably be estimated by .
.
.
.
=
.
,
'Ccan be derived which,
21.5 Parameter time inuriance however, cannot be used to estimate cf2 because the estimator implied by the above first-order conditions is (2 1.151) This is clearly unsatisfactory given that we only have one observation for each ct1 and the fs are not even independent. An alternative form of residuals which are at least independent are the recusri'e residuals (see Section 19.7). These constitute the one-step-ahead prediction errors
fh =
(A'f-
#; 1x,) -
+ x;(#,
l/,
=
#, 1), -
-
t
=
k + 1,
.
.
,
and they can be used to update the recursive estimators of observation becomes available using the relationship
/? k- 1 +(X,=
1X!-
1)
x! d, t
r= k + 1,
,
.
.
(21.152)
'f) .
,
.
each new
#fsas
(21.153)
7-:
(see exercises 5 and 6) where + x'(X0' (21154) t/ t -- 1 )! (1 t t -- 1 x0 As we can see from (153), the new information at time r comes in the form of f'f and #f is estimated by updating p-f-1 p-t - 1 in (152)yields
J
lx
==
'
'
.
substituting f'!
lg,
=
+ x'
pt
-
(X,0-'j x,0-j)
- 1
f- 1 i
jg1
xfx;
=
pi
x;(x,0-' 1 x,0-j)
-
-
1
t- 1
g
'
i
)
xylj 1
=
(2 1.155) c22, (see exercise 7). Hence, under So, Et) 0, El) This implies that the standardised recursive residuals =
wl
=
6 .-
Jf
,
l
=
k+ 1
.
,
.
.
,
tzzzk + 1,
=
.
.
.
,
T
(21 156)
T
.
suggest themselves as the natural candidates on which a test for Ho might wrl', be devised. lndeed, for w BE (wk 1, wk c, +
+
Hv
w
'v
.
.
.
,
N(0, c2Iw-k)
(21.157)
N(J, C),
(2 1.158)
Hl
where
w
,w.
J EB (k + 'j = t df
!, s c Ectsq, 1 x/f #t (Xt0-'I Xf0- ) ' F x ix'iqi i 1 1
,
k+
a,
.
.
.
,
w),
=
=
f-
-
j
=
,
k + 1,
.
.
.
,
T)
476
from assumptions .- probability model
Departures (7.2
--j t!?, -(It ...............=
0'
+ x'(X t t
t
0
Xt -
- j
)-'
i )-)x x'c2 i
1
1. .
i
j
-.-
i
0 X (X0' r- l t - 1j-
t 1' + =
-
f
X l'(X,0-'1 X,0- j.
.
t/ (/
)-
1
(
i
t s
,
.
.
.
,
xt
,
T2. (2 1 160 ) .
1
' -'
1
=
s
1
l
)-')x
'ic-,? ,.x
1
-:.
(2 1.161) 7- (see exercise 8). for t < y, l k + 1lf we separate Hv into =
.
Hlvq) '
,
.
for a11 t
pt #,
..
H$1t. 0
.
=
(7.2 = t
(7.2
for al1 r
q
=
12
=
1 2.
,
,
.
.
.
,
'J) '/-i
,
.
.
.
.
we can see that H)$
w but
N((), c),
x
(2 1.162)
HL2$
w
.N(J, c2Iy.-j),
'v
(2 1.163)
This shows that coefficient time dependcnce only affects the mean of w and variance time dependence affects ils coq'ariance. The implication from these results is that we could construct a test for //t)1) given that f:lttlz'holds against St/ '.' pt+ p for any t 1, 2, T based on the chi-square distribution. ln view of ( 163) we can deduce that =
F
k.l:)
(j
w,
,
.
)
x
.
,
I'
'
)
N
r=k
f=k+1
'
cl
(,
(21 164)
.
,
:1
This result implies that testing for Ho '. given Hlvlbis valid, is equivalent to testing for F(wr) 0 against F(wt) # 0. Before we can use ( 164) as the basis of a test statistic we need to estimate c2. A natural estimator for c2 is =
2 .sw
where
=
1 . -..----.. 1F k f
j''n (u, k
,.
1/(F- /$:))J E) jJ--'k
-#.
fi
(y) ('F- .) -.Sw, =
(2 1 165) .
,
1
j
This enables us to construct Tl
).gj2
.
=u
u.',. kNrt?t,that
the test statistic
? + 0 when
HL') is
not
valid.
Ho x
t (w -
. -
1) .
(2 1.166)
21
time invariance
Parameter
.5
#t'a
(71
7' t?a z 1 (y)1
.:y :
=
). -
1- a
d rl T
=
-
(.
(21 167)
&- 1)
-
y
.
and ( 167) is (see Harvey ( l98 1)). Under Hv the above test based on ( 166) HLI' does not UMP unbiased (see Lehmann ( 1959)). On the other hand, when (7.2 E'(sw2,) and this can reduce the power of the test significantly (see > 1101(1 Dufour ( 1982)). 1 conditional on HLlb being valid was suggested Another test related to HL2 by Brown, Durbin and Evans ( 1975). The CL.rsL'AzJ-test is based on the test statistic l
1$.f ;
J4(
=
-
i
=
k
1
v
..
t
-
S
k + 1,
=
.
.
,
.
(21.168)
71,
(1/(T-/)q j7- j ttl. They showed that under Hv the distribution by N(0, l - /) ( W'; being an approximate of W( can be approximated Brownian motion). This 1ed to the rejection region .s2
where
=
(I-T
.L, y : !Hz;(>
=
)
(.z
'F- k)'l+ 2t/(l
/(
()v =
.
I%' )(T' - /..)-' 5,
-
with a depending on the size x of the test. For tx 0.0 1, 0.05, 0. 10, a 1.143, 0.948, 0.850, respectively. The underlying intuition of this test is that if Hv is invalid there will be some systematic changes in the j,s which will give rise to a disproportionate number of wfs having the same sign. Hopefully, these T will be detected via their cumulative effects 1V;,t Ik'+ 1, Brown vt aI. ( 1975) suggested a second test based on the test statistic =
=
=
,
.
.
,
(2 1.170)
has a beta distribution Under Hv : l !'. k now'n as the Cl-zrit-zrA/sT-statistic, ly v-sT k i t i h t ) t r t e ( ( ) s w pa ra n-le e ,
,
/fo
lz,--
.
.
(21 17 1)
B6--#T' r), -ty(1- /)). -
.
between the beta and F-distributions ln view of the relationship Johnson and Kotz ( 1970)) we can deduce that
)
j,zr
=
1-5 f-- k ) . 1 '.J( -- r) -- 11 ; (t
-
-
.
'
.
H0 x
F ((F - ;) (t - li)) ,
.
(see
(2 1.172)
478
from assumptions - probability model
Departures
This enables us to use the F-test rejection region whose tabulated values are more commonly available. Looking at the three test statistics (166),(168)and (170)we can see that one way to construct a test for Hvlt is to compare the prediction error with the average over the previous periods, i.e. use the test squared (wr2)
statistics
wr2
z2(y0) t
=
t- 1
1
(1- k) i
t
,
)-
k + 1,
=
.
.
.
-k
(173)is that the denominator
The intuition underlying natural estimator of hl-
1
can be viewed as the with the new information at
which is compared
time t. Note that Wl2
Hv
G2
'v
2
z (1)
(2 1.173)
TJ
,
wf2
1
t- 1
Hv
w/ y ij 1
and
2
(21.174)
z (1-k),
x
=
and the two random variables are independent. These imply that under Hv, z2(y/) F(1,
zty)')
t(t - k), l k + 1, TJ (21.175) lt must be noted that z(y/) provides a test statistic for #t pt j assuming - For an that c,2 cJ.j; see Section 2 1.6 for some additional comments. overall test of Hv we should use a multiple comparison procedure based on the Bonferroni and related inequalities (see Savin ( 1984:. One important point related to a11 the above tests based on the standardised recursive residuals is that the implicit null hypothesis is not quite Hv but a closely related hypothesis. lf we return to equation (156)we can see that x
-k)
l
or
=
'v
.
.
.
,
=
=
'(u?r) 0 =
which is
if x'tlqt
not the same as
-
bt ) -
j
(2 1.176)
0,
=
pt.-pt- 1) 0. =
In practice, the above tests should be used in conjunction with the time of the recursive estimators / ft i 1 2 aths k and the standardised /7 TJ lf we ignore the first few values of recursive residuals wf, t k + 1, these series their time paths can give us a lot of information relating to the time invariance of the parameters of interest. ln the case of the estimated money equation discussed above, the time paths of p,t, #2r,jat, #4 are shown in Fig. 21.1(J)--(J)for the period t 20, 80. As we can see, the graphs suggest the presence of some time . dependence in the estimated coefficients re-enforcing the variance time dependence detected in Section 21.4 using the heteroskedasticity tests. ,
=
.
.
.
=
,
,
.
.
.
,
,
,
=
.
.
,
21.5
(3)
479
time inuriance
Parameter
Tackling parameter
time dependence
When parameter time invariance is rejected by some misspecification test do we proceed'?' The answer to the question which naturally arises is, likely the this question depends crucially on source of time dependence. If behaviour of the the agents behind the actual time dependence is due to model DGP we should try to this behaviour in such a way so as to take this additional information. Random coejjl'cient models (seePagan (1980/ or state space models (see Anderson and Moore ( 1979)) can be used in such cases. lf, however, time dependence is due to inappropriate conditioning or stochastic process then the way to proceed is to Z, is a non-stationary respecify the systematic component or transform the original time series in order to induce stationarity. ln the case of the estimated money equation considered above it is highly likely that the coefficient time dependency exemplified is due to the nonstationarity of the data involved as their time paths (see Fig. 17.1) indicate. One way to tackle the problem in this case is to transform the stochastic processes involved so as to induce some stationarity. For example, if Mf, l > 1) Shows an exponential-like time path, transforming it to .ftA1n(M #)f, r y 1) can reduce it to a near stationary process. The time path of A lntM #lf ln(M/P)f ln(M #)r- I as shown in Fig. 2 1.2, suggests that this transformation has induced near stationarity to the original time series. lt is interesting to note that if A ln(M,/#)r is stationary then ihow
t
=
-
a-,-(Ms),-,n(wM),-cIn('wM),.,
+In(,ws),.-
(2 1.177)
is also stationary (seeFig. 21.3),*any linear combination of a stationary process is stationary. Caution, however, should be exercised in using for stationarity inducing transformation, because ollerd4ferencing, example, can increase the variance of the process unnecessarily (seeTintner ( 1940)). ln the present example this is indeed the case given that the variance of A2 1n(A.1P4t is more than twice the variance of A ln(M/#)f, Var A ln
M
-P
=
t
0.000 574,
M Var A z ln P
=0.001
t
354.
(2 1.178)
ln econometric modelling differencing to achieve near stationarity should not be used at the expense of theoretical interpretation. lt is often using appropriate explanatory the non-stationarity possible to variables. Note that it is the stationarity of (yj/xf, t G T) which is important, not that of (Z,, l (5 T). 'model'
480
Departurcs
from assumptiow
- probability
model
(b) Fig. 2 1.1(fz). The time path of the recursive estimate of p3t- the coefcient time path of the recursive estimate of pzt the
of the constant. () The coefcient of ',,
-
In view of our initial discussion related to the possible inappropriateness of the sampling model assumption of independence (see Chapter 19) we can argue that the above detected parameter time dependence might also be due to invalid conditioning (see Chapters 22-23 for a detailed discussion). ln
21.6
Parameter
structural
481
change
0.100 0.075 0.050 0.025 it 0 -0.025 -0.050 -0.075 -0.1
1968
1970
1972
1974
1976
1978
1980
1982
1976
1978
1980
1982
Time
(c) 4 2 0 -2
Jl! -4 -6 -8 1 -. O1968
1970
1972
1974
Time
(d)
Fig. 21. l(c). The time path of the recursive estimate of pt - the coeftkient of the recursive estimate of #4, the coefficient of (.
of p,. (J) The time path
-
this case the way to tackle time dependence is to respecify pt in order to take account of the additional information present. 21.6
Parameter
structural
change
structural change is interpreted as a special case of time dependence, considered in Section 2 1.5, where some a priori information
Parameter
482
Departures
from assumptions - probability model
0.075 0.050 0.025
f
kc S <
-0.025 -0.050 -0.075
1964
1967
1970
1973
1976
1979
1982
Time
Fig. 2 1.2. The time path of
ln (M/#)l.
0.10
0.05
f * g
o
A
-0.05
-0.10
1964
1967
1970
1973
1976
1979
1982
T ime Fig. 2 1.3. The time path of A2 ln (M/P)f.
related to the point of change is available. For example, in the case of the money equation estimated in Chapter 19 and discussed in the previous sections we know that some change in monetary policy has occurred in 1971 which might have induced a structural change. The statistical GM for the case where only one structural change has occurred at t T'1(T1 > k), takes the general form =
21.6 y ..
=
?l
=
structural
Parameter
p'jxr +
uj t
pjxt+
lkct
change
483 (2 1.179)
,
(2 1.180)
,
12 Tk) T 2 T1 + 1 T) .)w it h p1 EEEE(#1 tr 2j) an d pcEEE(#2,c2a) being the underlying parameters of interest respectively. For T) ,Tk + 1, 'Jl the the case where the sample period is 1::::c1,2, of sample form the takes the distribution
whe re T 1
=
(
.ft
.
.
.
,
,
,
.
.
.
,
=
,
.
,
.
.
,
.
.
.
y1 X 1 N '2/X 2 x whereTz =
X 1 #1
21 c I wj
X 2 #2
0
.
.
,
.
,
.
0 a c-21
.
.
,
(21.181)
,
wa
T- T'l The hypothesis of interest in this case is .
S0:
and
/1 /2 =
which can be separated into .J'1C1'' 0
c2 Sf2)' 0 1
#1 pzt =
'
=
'
c2.2,
IS
0
S(1) 0
=
fa
.J1(21) 0 '
The alternative f-1l is specified as in (181).These hypotheses are very similar to the ones we considered in the previous section and it should come as no surprise to learn that constructing optimal tests for the present case raises similar problems. Moreover, the test statistics enjoy more than just a passing resemblance with some of the statistics derived in Section 2 1.5. The main advantage in the present case is that the point of structural change T1 is assumed known a priori. This, however, raises the issuetof whether F2 > k (p2is estimable) or Tz < k (p2is not estimable). Let us consider the two cases
separately.
Case 1 (L
>
k)
Chow ( 1960) proposed the F-type RSSV
T2(')
.-RSS'I
(seeChapter
,-RSSZ
F-2k
=
RSSL
+
19) test statistic
RSSZ
'
k
,
(21.182)
where RSSV, RSSL and RSSZ refer to the residual sums of squares for the Tk) and subperiod 2( 1:::=Fl + 1, whole sample, subperiod 1 (1 1, 2, F) respectively. The test statistic (182)can be used to construct a UMP That islxitcan invariant test (seeLehmann (1959)) for HL'S against .J/1 fo HLl$. used against for be Sj with ct c2a. to construct an $op timal' test pb j2 The test statistic is distributed as follows: =
.
.
.
,
.
=
r2(y) F(), F-2k)
under
'v
'rcty) F(/(, T-2/; 'v
)
under
HL3', Sl HLl$, f'''h
.
.
,
=
(2 1.183) (21.184)
484
Departures
from assumptions - probability model
where
E(X'kX1)-1 +(XrcXc)- 1(I-1
-(#1
-#c)'
.z
HL''can be used
The distribution of zaty) under rejection region is C1
ly:T2(y)
=
(#1-#2).
to define a size a test whose
Gl,
>
(21.185)
(21 186) .
This test raises the same issues as the time invariance tests considered in Section 21.6 where c21 o'2a had to be explicitly tor implicitly) assumed in constructing a coefficient time invariant test. There is no HLI' is valid when Hvl might not be. A test reason to believe, however, that for Hvlb against S1 can be based on the test statistic comparing the two estimated variances (seeChapter 19): =
.s1 1'3(b p' EEE 1
RSSZ
T,
-
-
=
T3(b
F(T2
'v
k
T2 k
RSSL
- k, T1 k)
under
-
zaty) F(Tc k, T1 k; ) 'v
-
(2 1.87)
,
-
-
HLlI,
under Hj
(2 1.188) (2 1.189)
,
(c2a/c2j) is the non-centrality parameter which ortunately' does where not depend on #1or #2.This turns out to be critical because the test for Hlvt) against Sl defined by the rejection region, =
Cl
ty:z:'(y)>
ca),
(2 1.190)
is independent of the test defined by zaty) (seePhillips and Mccabe (1983/. This implies that a test for Hv against Sj can be implemented by testing for HLltt'irstusing zz(y) and, if accepted, testing for HL''using zaly) in that Seqklence.
Let us apply the above tests to the money equation considered in the previous sections. As mentioned above, a structural change is suspected to have occurred in 1971 because of important changes in monetary policy. Estimation of the money equation for the period 1963f-1971f: yielded -0.793
mt
+ 1.050yf +0.305pt
=
(2.055) (0.208) R2
1og L
=0.968,
=
90.08,
RSSL
=
+
t,
(0.103) (0.014) (0.015)
42 0.964, =
-0.007t ,$1
=0.0155,
0.006 72,
(21.191)
21.6
structural
Parameter
of the same equation
Estimation
mt
=
6.397 + 1.641y, (1.875) (0.191)
-
R2
#2
=0.994,
log1.=95.37, Testing for
zaty) -
yielded
-0.0761
+0.784/7,
+
(21.192)
f,
(0.022) (0.014) (0.035) sz =0.052
=0.0346,
84,
F=48.
H3 the test statistic zaty) is
(u)
(0.05284) 28 ,7a,)
(:.(06
485
for the period 1972-1982/:
=0.993,
RSSZ
HLltagainst
change
-
(2 1.193)
5.004.
HLI' For a size x is rejected. At this stage there is not c,= 1.81 and given that HLI' against Sl r'h HL1t much point in proceeding with testing HL't has been rejected. For illustration purposes, however, 1et us consider the test regardless. The residual sum of squares for the whole period is RSS 0.1 17 52 (see against S1 f''h HLI/ the test statistic is Chapter 19). Hence, testing for HL3' =0.05,
=
'2tS-
52) 72)-(0.052 -m.(06 (0.117 72)+(0.052 84)
(0.006
:4)
72 1-Y117.516. (21.194) HLZ,
is also rejected. strongly It is important to note that in the case of the test for Hv the size is not a i.e. for the above example the overall size is 0.0975. This is but 1 becauseHo was tested as two independent hypotheses in a multiple hypothesistesting context (seeSavin (1984/. Given that ca=2.5 for a size
a=0.05
test we can deduce that
-a)2,
-(1
Case 2 (T2 < k) ln this case 0z is not estimable since ranktxqxa) % Ta < k. This is very similar to the time invariance test where F2 1. If we were to express the test statistic =
W22
z(y,0) -
1
t k -
t i
-
(2 1.195)
1
jg
w/
.k
(see (173/ in terms of the residual sum of squares with test statistic emerges:
l G T2
the following
486
Departures from assumptions - probability model
w/ This is based on the equalities RSSt RSSt- 1 + w/ and RSSt- 1 (see exercise 9). The test statistic CH, known as Cllt?w test (seeChow ( 1960:, as in the case of z(y,0) in Section 2 1.5, can be used to construct a UMP invariant test not of Ho against J.f1but of Ht against Hj fo HLI' where S): Xa(#1 #2)0 0. This is not surprising given that pz is not estimable and thus we need to rely on (y2-Xa/ 1 ) uc - X2(/ 1 - #2)(adirect analogue of Section 2 1.5) in order to construct a test for to the recursive residuals statistic T he Cs-test is distributed as follows: HL).
)r--2
=
=
=
-
=
j
I
czf-Ftw CH
2:
'F1
IILIb under 11% 0 ro
-k)
F( F2, T1 - k; J)
'w
where
under H 1
r''h
,
HLlt,
(21.198) (2 1.199)
Given that parameter time dependence can be viewed as an extension of structural change at every observation t ?L+ 1, 'Cit is not surprising that the Chow test statistic is directly related to the CUSUMSQstatistic discussed in Section 2 1.5 above (seeMcAleer and Fisher (1982:. This test can also be used for Hv against Hj ra HLII but caution should be exercised because although the size is correct the test can be inconsistent when S1 is valid but HL3' is not (seeRea (1978), Mizon and Richard (198$). Moreover, for S1 against H3 the above test does not have the correct size against alternatives when ct # cJ, given that the distribution under f.f1 when HLIS is invalid is not F(Ta, F1 - k). Similarly for testing Hv against Sj the Chow test has the correct size but 1ow power for alternatives when Ht and HLI, are valid but HL'is not, given that the implicit alternative is in fact H1 rn HLl3. These comments can also be made for the test based on z(yf0) in Section 2 1.5. As in case 1 (T2 < k), we need to test HLI3 before we can safely apply the Chow test. Given that yl cannot be estimated, intuition suggests using the prediction sum of squares =
F f
=
Z rj
+
-
1
'-'/
2
'-'
(.:2, l1x2!)
=(y2
-Xa,1)
.
.
.
,
''h
/
(y2-X2,1)
(21.200)
Appendix 21.1
487
instead. This gives rise to the test statistic T2b
(y2 - X 2 / 1 )'(y 2 - X 2 / 1 ) s'f
=
W heru
,
st
=
RSS 1 T1 - k
(a1.a()j)
.
lt is not very difficult to see, however, that the numerator and denominator of this statistic are not independent and hence the ratio is not F-distributed. Asymptotically, T4(y)
however, z2(T2)
'v
s(
P
c2 under
--+
HLI)
and thus
HL2t.
under
Using this we can construct an based on the rejection region
(21.202) size a test for
aspnptotic
HLlagainst
1-1l
'
ly:z.tyl
t7l =
> c'al,
a
dz2(T'2).
-
(21.203)
Ca
Let us apply the above tests to the money equation discussed above. Estimation of this equation for the period 1963-1980 yielded: mt 2.685 + 0.7 13yr + 0.852,/ 0.0521,+ f, (1.055) (0.107) (0.022) (0.014) (0.039) =
(21.204)
-
R2
Iog L
#2
=0.994,
138.52,
=
=0.994,
RSSL
.s1
=0.109
=
0.0392,
23,
Testing for HLlt against 6.484, ca= 11.07, a 0.05. Testing for Ht : z4(y) : CH 1.078, ca 2.34, a 0.05. These results imply that against HLOHLI' both hypotheses cannot be rejected. This is not very surprising given that the post-sample period used for the tests was rather small (Tc 5). In cases where Tc islarger we expect the two hypotheses to be rejected on the basis of the results using CH. In the present case this was not attempted because when Ta > k, CH is no longer the best' test to use; T/y) is the optimal test. SI
=
=
=
=
=
=
Appendix 21.1
uriance
-
stabilising transformations
Consider the case where Vartyf/xf xt) o'lgpt) and q ) is a known function. Our aim is to find a transformation h ) such that c2. Let us assume that we can take the first-order Taylor x,) Varty)/x, of expansion (y,) at p,: =
'
=
'
=
=
(.pf) hpt) ':>:
+ (A',
-/t,)/l'(/t,),
/?'(pr)being the first derivative of /l(yf)evaluated at yt
=pt.
Then, we can use
488
from assumptio
Departures
- probability
model
1972 Time
1975
15
12
9 @ 6
3
0
1963
1966
1969
1978
1982
-2.0
-2.5
-3.0
-3.5
-4.0
1963
1969
1966
1978
1975
1972
1982
Time
(b) Fig. 2 1,4(J).
Time graph of lt.
(#)Time
in order to approximate
this approximation Varltytl/xf
xf) cxvargll/lt/tfl
=
+
graph of ln 1t.
the
(yt-/tt)'(p,))/X,
= vargx/x, xJ('(/t,))2. =
This implies that when we choose hl
'(p,)
1
==
then Vartyl/x,
,
El(/t,)1 =
x,) :>: c2.
'
) such that
of
variance =
x;
(.pf)by
Appendix 21.1
489
The variance stabilising transformation can be used in a general case where the variance of a random variable depends on some unwanted parameters #r. ln the case where Vartyl Xt x,) yto.l the transformation (y,) loge yl is called for because =
=
=
1
'
11(/.tt )
pt.
.
ln Fig. 2 1.4(:/) the time path of 7 days' interest rate is shown which exhibits a variance increasing with the Ievel Jtt. Its log. transformation is shown in Fig. 2 1.4()). Important
ctlactzpt.k
Auxiliary regression misspecification test procedure, Kolmogorov-Gabor elliptical theorem, OLS estimators, P olynomial, Gauss-Markov distributions, non-linear conditional expectations, normalising transformations, GLS estimators, variance stabilising transformations, time dependent parameters, recursive estimator, structural change.
Questions Explain linearity between normality, the relationship homoskedasticity. variables' the Explain the intuition underlying misspecification test. State the finite sample properties of the OLS estimator Komitted
and
type
b (X'X) - 1X'y, =
based on the assumptions that (yf Xt xf) D(#'x,, c2) and y is an T;where the form of independent sample from D(y,/Xf,' #),l 1,2, 1)( ) is not known. Explain the statement: b'T'heOLS estimator b (X'X)- 'X'y of # has minimum variance among the class of linear and unbiased estimators.' theorem shows that b is a fully efficient t'l-he Gauss-Markov estimator.' Discuss. for the linear t'T'he normality assumption is largely unnecessary regression model because without it we can use the OLS estimators b and of # and c2, respectively. Moreover, all the hypothesis testing (72 derived using the MLE*S j and c2 are results about j and asymptotically vald- anyway.' Discuss. =
=
x
.
.
.
'
=
.1
,
490
10.
from assumptions - probability model
Departures
Explain the difference in the asymptotic distributions of sl and J2, the MLE and OLS estimators of c2. Explain the intuition underlying the skewness-kurtosis test for departures from normality. How do we proceed when the normality assumption is rejected in the linear regression model? Discuss the implications of non-linearity in the context of the linear
regression model. How can we test for non-linearity'?
'When linearity is rejected by some misspecification test we should adopt a non-linear specification h(0, xf) instead of p'xt and retaining the assumption of normality for D( pf/''X,;#)) proceed to derive MLE'S for p and c2.' Discuss. Discuss the implications of heteroskedasticity in the context of the linear regression model. A'T'hecomparison between F and / is largely irrelevant since the derivation of #-is based on the assumption that fl is known up to a scalar multiple.' Discuss. Explain the intuition underlying the derivation of the GLS estimator
p-of #.
Discuss the following matrix inequalities: (X'f1 - 1X)-- ' -(X'X) - 1X/f1X(X'X) - 1 Gll c2(X'x) - 1 -(X'x) #'
-
lx'flxtx/xj
- l
y().
'*h
(.Although
#= (X X)
l
X y is both unbiased and consistent estimator of #, no consistent inference about # is possible since no consistent estimator of fl is possible unless c/ is modelled.' Discuss. 18. Explain the intuition underlying the White heteroskedasticity test. 19. How is the White heteroskedasticity related to non-linearity as well? 20. t'T'he way to proceed in the case where homoskedasticity is rejected by some misspecification test is to model cf2 by relating it to xf or some other exogenous variable z, and retaining the linearity and normality (7.2 of #, and the unknown assumptions proceed to derive MLE'S c,2.' postulated model of in the Discuss. parameters Discuss the linear regression model based on the assumption that D(yf,Xr; ) is multivariate t. Explain why Xr cannot be weakly exogenous for 0 > (#,c2) in this model. Explain how we can reconcile the heteroskedasticity of Vartyt/x, =xJ with the normality of D k,f,/X?; $. What do we mean by time invariance of 0u p, c2)'? of Explain how non-stationarity ?.
-
Appendix 21.1 .J?r
Zf HE
Xv
can lead to time dependence parameters of interest. Explain the following form of the recursive estimator of .,.
26.
.w
#r bt 1 + -
(),
() .j
(X, X, )
xrt.'4
n.f
xt), t- 1
-
t
-
k + 1,
.
.
#,: .
,
IL
How can we test for parameter time-invariance in the context of the linear regression model? Explain the intuition underlying the CUSUM test. b'T'heimplicit null in testing for coefficient time invariance using the recursive residuals is not Hlb : pt jf 1 but S! : xtpt jf 1) 0.' Discuss. How do we tackle the problem of time dependence of p? How do we test for coefficient structural stability (constancy)in the case where T2 > k? How is this different from the case Tz < k'? :In the case where Fc < k the implicit null for coefficient constancy is not Hv': (#I - ja) 0 but Ht X2(#j ja) 0.' Discuss. =
=
-
=
=
-
Exercises
and when r, N(0, c2) and compare them with the same quantitites when nt D(0,c,2) and the form of D4.) is unknown. Show that b=(X'X)- 1X/y has minimum variance among the class of unbiased estimators of the form 'v
x
b* (L + Cly, =
L
=
(X'X) - 1X'
is a k x T arbitrary matrix. Show that under the assumption
c
Ep +
p
and
(y,/Xt
=
xl) 'vD(/7(x,),c2),
JJ(y2)# c2,
Show that under the assumption Cov(#) (X X) =
Derive the GLS estimator
(),'t'X, xr) D(#'xf, =
X DXIX X) of
cf2)
.
assumption of exercise 3. c2A is essentially the same as
p under the
Show that knowing f) or A where f) far as estimation of j is concerned.
x
=
492
Departures
from assumptions
probability
-
Using the formula B- 1 A 1 c 1/4.1 + #'A - #, show that =
'
model
C?A- 1a#'A -' 1 for B
--
=
A h-ap'
where
=
'x
0
0) -
(x t
?
1 :=
0
0
'
(x t -
1
x
?-
1
)-
1
(X0' X0
1
x'('X0' X0
1
--G-1-.C---.-t)- x..1.-.4-.L- 1 t - 1 j v, 1 (x, - 1) -k x,- -1-h x, ((X,
-
.
,
-
Show that -
#?
#, l -
=
0 (X0' j X ...q- j ) , -
+ -
1+
1
-;-
,
x/xf
'*1
#y.syx,) j' ' L -'' 1) 1Xr- x, x-q.(y..j,
.
1 8. Show that (X,0-') x,0.. j )jj ) xyx; 9. Verify the expressions for Vartw'r) and Covtwrws) of Section 2 1.5. 10. Show that wf2 RSS: RSSt 1 t 7: /f + 1 Where RSS: .-
=
=
Zl -
1
(.f
-
-
-
#-'Xf) 2 t
.
,
,
=
.
Additional references Bickel and Doksum ( 198 l ); Gourieroux Ramsey ( 1974),. White and MacDonald
t?r
aI. ( 1984),. Hinkley ( 1975),. Lau ( 1974).
( 1980),. Zarembka
( 1985),.
22
CHAPTER
The linear regression model IV - departures from the sampling model assumption
One of the most crucial asstlmptions underlying the linear regression model )'w)' constitutes an is the sampling model assumption that y EEE( )'a. seqtlentially T) X,; 1, 2, from draw'n salnple D( t t?), independent the assumption enables function This likelihood to respectively. us to define be 'l
.
.
.,
.
.
.
=
.
.
.
,
F
-IP; y)
=
c(y)
l-lD(),/X,;
p).
(22. 1)
f=1
lntuitively. this assumption amounts to postulating that the ordering of the observations in yl and Xt plays no role in the statistical analysis of the model. That is, in the case where the data on )'2 and Xt are punched observation by observation a reshuffling of the cards will change noneof the results in Chapter 19. This is a very restrictive assumption for most economic time-series data where some temporal dependence between successive values seems apparent. As argued in Chapter 17, for most model sampling economic time series the non-random seems more appropriate. ln Section 22.1 we consider the implications of a non-independent sample for the statistical results derived in Chapter 19. lt is argued that these implications depend crucially on how non-independence is modelled and two alternative modelling strategies are discussed. These strategies give and autocovrelation rise to two alternative approaches, the respecillcation testing and ways to tackle the tppprotzc/lt?sto the misspecification dependence in the sample. ln the context of the autocorrelation approach the dependence is interpreted as due to error temporal correlation. On the other hand, in the context of the misspecification approach the error term's 493
Departures
from the sampling model assumption
rolt?as the non-systematic component of the statistical GM is retained and samples dependence is modelled from first principles in terms of the in the random variables observable involved. Sections 22.2 and 22.3 consider the proceed with sample and the various ways to a non-random sample independent assumptions misspecification testing for the 22.4 of misspecification analysis in In respectively. Section the discussion perspective. Chapters 20-22 is put into
22.1
Implications
of a non-random
(1)
Defining the concept
t#*
sample
a non-random
sample
lt is no exaggeration to say that the sampling model assumption of independence is by far the most crucial assumption underlying the linear regression model. As shown below, when this assumption is invalid no estimation, testing or prediction result derived in Chapter 19 is valid in general. ln order to understand the circumstances under which the independence assumption might be inappropriate in practice, it is instructive to return to the linear regression model and consider the Zr,. #) to Dlt't 'X,,' 04, t 1, 2, reduction from D(Z1, Zc. F in order to understand the role of the assumption in the reduction process and its relation to the NIID assumption for Z?. r g Tj Zr N (yr,X;)'. As argued in Chapter 19, the linear regression model could be based directly on the conditional distribution D vry'X,,'#j) and no need to define D(Zf; #) arises. This was not the approach adopted for a very good reason. ln practice it is much easier to judge the appropriateness of assumptions related to Dzt., #), on the basis of the observed data, rather than assumptions related to D( pf/Xl; 1). What is more, the nature of the latter distribution is largely determined by that of the former. In Chapter 19 we assumed that (Z!, l 6 T is a normal, independent and identically distributed (NllD) stochastic process (seeChapter 8). On the basis of this assumption we were able to build the linear regression model ln particular the probability and sampling defined by assumptions E1j-I)8(I. viewed assumptions model can be as consequences of )Zt, t c T) being NIID. The normalitl' of D(.yr/Xf; /1), the Iinearit of f'ly'r/xr xr) and the homoskedasticity of Vartyr//''xf x,) stem directly from the normality of Zt. .
.
.
=
.
t
.
.
.
,
.
=
=
The time invariance of 0- (p,c2) and the independent sample assumptions stem from the identically distributed and independent component Of N11D. Note that in the present context we distinguish between homoskedasticity and time invariance of Vart)'r/xt x,) (seeSection 21.5). This suggests that =
22.1
Implications
of a non-random
sample
495
sample assumption the obvious way to make the independent t G -1i-)is a (dependent) stochastic inappropriate is to assume that process. ln particular (becausewe do not want to lose the convenience of normality) we assume that
tzt,
Zf
'v
8), Cov(Z?Z,)
N(m(r), E(l,
E(l,
=
t, s e: T.
#,
(22.2)
to the money equation estimated in Chapter 19, a cursory look the realisation of Z,, t 1, 2, F (see Fig. 17.1(4g)-(J)),would convince at above assumption that the seems much more appropriate than the 1ID us realisations of the process exhibit a very The for such data. assumption systematically trend changes time distinct (the mean over time) and for at well. variance the change of least two them seems to as The question which naturally arises at this stage is to what extent the assumptions underlying the linear regression model will be affected by relaxing the I1D assumption for Z,. r G7T1, One obvious change will come sample y H ( )'1 .'2. )'w)'.What is not so in the form of a non-independent replace which B i11 f)( )'; X:.. #:). Table 22.1 obvious is the distribution reduction of the linear regression summarises the ilnportant steps in the is a N1lD stochastic model from the initial assumption that :Z,, i! where :Z,, l e: 5) is a non-llD process and contrasts these with the case minor of the linear difference between the process. The only regression model as summarised in this table and that of Chapter 19 is that the mean of Zf is intentionally given a non-zero value because the mean plays an important role in the case of a non-llD stochastic process. As we can see from Table 22.1, the first important difference between the Zw is complicated 1lD and non-llD case is that distribution of Z1 considerably in the latter case with the presence of the temporal covariances and the fact that all the parameters are changing with r. The presence of the temporal covariance implies that the decomposition of the joint Zr in terms of the marginal distributions is no longer distribution of Z1, valid. The dependence among the Z,s implies that the only decomposition possible is the sequential conditioning decomposition (seeChapter 6) where the conditioning is relative to the past history of the process denoted by This in turn Z0l-1 Ntzj Zr-- 1)., with t representing the zc, implies that the role played by D(Zr; #) in the lID case will now be taken over by D(Zf Zr0- 1 ; #*(r)).In particular the corresponding decomposition into the conditional (D4y, 'Xf; /1)) and marginal (D(Xr; /2))distribution w111
lf we return
=
.
.
.
t
,
.
.
.
.
.
.
tconstruction'
,
.
.
,
,
.
.
,
,
tpresent'.
,
.
,
.
,
now be
where the past histor-b'of the process comes into both distributions because it contains relevant information for yf as well as X,. Although the algebra in
)) z, ss
Z
:
22
XN
:
Zw
mt l )
E( I l ), E( 1 2)
m(2)
E(2, l )-E42, 2)
:
:
,
'
z
x
,x.
t'
@21
)!'
q m =
.
tz!
c2
#'xr ) aE2-a1ca I mxl +
.
(
=
#s ez (('()p
rz2)
-
.
.
.
j
-
x,)
x
lp Tt F)
A.-(('(,(t) +. #'tltrjx, +
definition of the parameters 0
Xl t.r('--J' . 1 ),
'k,,,
.
c(Y,O
(iii). Vartl',.
time independcnt
(iv)e1
sample )?wl' is an independent y LE t),l sequentially drawn from .I)(.', X,,' kj ). t l 2, 'f; respectively ,
I.'r/.fz0,
(ii) E
=
model
'p
'
=
Sampling
E42,
E( F. l )
mt T)
'r)
-
22
cj j - .1 araa- trcj J.'t//X, E( (ii) xt ) - linear in xt Vartl', Xt x,) - homoskedastic (iii)
c
(iv)
1J(1
.
(x'''' ) 1(''1)) N((.'''''
-
(.1,, 'X / ) Nt'v
model
'
:
j
t ...
Probab ility
'
-
=
,
.
.
.
,
).
xl
XO t
=
-
xI)
(t'()(/),polt),#f(l)-
.
:
.
.
-
i
=
-
,
.
.
,
=
i
+
#'f(/)x,J .
see Appendix
,
r,rg4/)) (for the
1)
1
homoskedastic
,
,
.
involved
txJ(r),
i
.- 1
Iinear in Zlt'-
)'y.)/ is a non-random y Es (yj D( y'p,'/Zt0j Xl,. 9(l)), t 1. 2. .
,
- 1
0)
=
i
)-(Iaf(/).J.,
(free of
Zl' -
1
)
t - 1, c4(l)) - time dependent samf'le sequentially drawn from 'T)respcctively 1-2,
.
.
.
,
22.1
Implications
of a non-random
sample
497
the non-llD case is rather involved, the underlying argument is the same as in the llD case. The role of D()'f/''Xr; 1) is taken over by D()'f Zf0- j Xf ; k$(t)) and the probability and sampling models in the non-llD case need to be defined in terms of the latter conditional distribution. A closer look at this distribution, however, reveals the following: ,
f-1
't.pf,/'J(Y/- j ), Xr0
=
Vart
)f
X/ 't7'(Yr0-1) ,
x))l
c()(1)+
=
#(l)xI
+
F (af(l)yti + #(r)xt J; -
r=1
=
xt0)
=
-
(22.4)
c(jttl ;
where X f0 the conditional
Firstly, although
(x 1
EEE
mean
.
x 2,
.
.
.
x ). t
.
is linear in the conditioning
variables, the number of variables increases with r. Secondly, even though the conditional variance is homoskedastic (free of the conditioning variables) it is time dependent along with other parameters of the distribution. This renders the conditional distribution Dt-pt Zt0- 1, Xt; #y(l)), as defined above, non-operational as a basis of an alternative spification to be contrasted with the linear regression model. It is in a sense much too general to be operational in practice. Theoretically, however, we can define the new sampling model as: ),w)' represents a non-random y drawn, from D()'r Z,0- l X,', 41:(1)), t 1, 2, -(.v1,
.
.
.
,
.
=
.
,
sample sequentially 'Tkrespectively. .
,
ln view of this it is clear that what we need to do is to restrict the generality assumptions related to )Z,, l iE T) in order to render of the underlying ln particular, we need to impose certain D( 'Z,0- j Xr,. t(r))operational. the of time-heterogeneity and dependence of the form the restrictions on ftzf, T). (E This is the purpose of the next two substochastic process r sub-section the dependence of ftzf, t g Tjt is restricted to sections. In the next independence and the complete time-heterogeneity (nonasj'mptotic identically distributed) is restricted to stationarity (seeChapter 8) in order to get an operational model. ln sub-section (3) the time-dependence assumption is restricted even further in an attempt to retain the systematic component of the linear regression model and relegate the dependence to the error term. ,r
,
Departures from the sampling model assumption
(2)
The respec6cation
approach
As seen above, when l G T) is assumed to be normal, and non-llD, the conditional distribution we are interested in, D()'tf'Zh j, Xj,' #y(J)), has a mean whose number of terms increases with r, and its parameters #1'(r)are time dependent (see(4)-46/. The question which naturally arises at this stage is to what extent we need to restrict the non-llD assumption in order the incidental Jptzl'tkrrlplt?p-s problem. ln order to understand to the role of the dependence and non-identically distributed assumptions let us eonsider restricting the latter first. Let us restrict the non-identically distributed assumption to that of stationarity, i.e. assume that )Zr, l (E T) is a normal, stationary stochastic process (without any restrictions on the form of dependence). Stationarity (see Section 8.2) implies that
tzr,
tjolve'
f(Zr)
=
m,
Cov(ZtZ,)
Z1
m
Z2
m
=
E(Ir Yotl Y1Y:
xN
.$1).
(22.7)
-
Yz. 1 Er-?
.
,
5
.
Y1
Zw
Ew-l
m
tiuvttlt -s(),
Y1Eo
i
=
1 2, ,
.
.
.
,
T - 1.
(22.8)
That is, the Zfs have identical means and variances and their temporal covariances depend only on the absolute value of the distance between them. This reduces the above sample g'F x ( + 1)(1x E.'Fx (k+ 1)(1 covariance matrix to a block Toeplitz matrix (see Akaike ( 1974)). This restricts the original covariance matrix considerably by inducing symmetry and reducing the number of k + 1) x (k+ 1) matrices making up these covariances from 7-2 to F- 1. A closerlook at statienarity reveals that it is a direct extension of the identically distributed assumption to the case of a dependent sequence of random variables. ln terms of observed data the sample realisation of a stationary process cxemplifies no systematic changes either in mean or variance and any z-period section of the realisation should look like any other r-period section. That is, if we slide a z-period along the time axis the picture' over the realisation should not differ systematically. Examples of such realisations are given in Figs. 21.2 and 21.3. Assuming that ftzf, t (y T) is a normal stationary process implies that as far tdifferent'
-wnf/tpw'
22.1
Implications
of a non-random
(ii)
Var(),,''c(Y?0- j ),Xf0 x,0)
(iii)
0* EEEE(c(), v
=
pi a i
,
,
,
i
=
,
499
c()2;
=
12
sample
,
.
.
.
,
(22.10) t - 1 c()2).
(22.11)
,
As we can see, stationarity enables us to the parameter timeincidental problem but dependence the parameters problem remains unresolved the number of largely because terms (and unknown parameters) 1. Hence, the time-homogeneity conditional with in the mean (9)increases introduced by stationarity is not sufficient for an operational model. We also need to restrict the form of dependence of (Z,, r g T) ln particular we of the process by imposing restrictions such need to restrict the ergodicity, mixing or asymptotic independence. ln the present case the as independence. most convenient memory restriction is that of (ls This restricts the conditional memory of the normal process so as to enable us to approximate the conditional mean of the process using an mth-ovder Markov process (see Chapter 8). A stochastic vector process .CtZf,l 6 T). is said to be mth-order Markov if 'solve'
.
bmemory'
'lnprtalfc-
Ftzr Z,0- 1)
;(zr,,'c(zr-
=
l
,
z,
- a,
.
.
.
,
z,
Assuming that .f(Zf t e: T) is: normal; (i) stationary; and (ii) asymptotically independent (iii) enables us to deduce that for large enough conditional mean takes the form
(22.12)
-,,,)),
,
m
(hopefully m < F) the
This form provides us with an operational form for the systematic ?,n, c02)is both time component for t > ?1. Now 0* (c'0,je, jf, xi, i 1, 2, invariant and its dimensionality remains fixed as F increases. lndeed, Dty', Zf0- j Xr,. rrl, a t) yields a mean linear in xt, y, -i, xf -f, i 1, 2, homoskedastic variance and /1 is time invariant. Hence, defining the nonsystematic component by =
=
.
.
.
,
=
,
.
u;b
=
.pt -
E(yt/''c(Yy0-I )X0
=
x$0),
.
.
,
(22.14)
from the sampling model assumption
Departures
w'e can define a new statistical GM based on D()'t,
ZJ0-
j
,
Xf
t) to
be
in an obvious notation. Notv that c'a has been dropped for notational convenience (implicitlyincluded in pvq. of stationarity and lt must be noted at this stage that the assumption asymptotic independence for IZ, t e: T) are not the least l'estrictive assumptions for the results which follow. For example asymptotic independence can be weakened to ergodicity (see Section 8.3) without affecting any of the asymptotic results which follow. Moreover, by strengthening' the memory restriction to that of tp-mixing some timeheterogeneity might be allowed without affecting the asymptotic results (see
White (1984)). It is also important to note that the maximum lag m in ( 15) does not represent the maximum memory lag of the process tyf/'Z)'- 1 Xf, t e: T) a.S in the case of an m-dependence (see Section 8.3). Although there is a duality with an mth-order Markov process, in the result relating an rn-dependent of is considerably longer than m (seeChapter 8). latter the the memory case This is one of the reasons why the AR representation is preferred in the is determined The maximum lag by the solution of present context. memory polynomial: 1ag the ,
l
a(.f-) =
af.f-f
-
i
=
=
(22.17)
0.
1
ln view of the statistical GM ( l5) let us consider the implications of the non-random sample (i)-(iii)for the estimation, testing and prediction results in Chapter 19. Starting with estimation we can deduce that for #=(X X)
and
sl
(22.18)
Xy
1 T'-/l'
=(
(22.19)
the following results hold: i () (ii)
A'(/)
#() + (X'X) - 1X'.&Z* 7)# #' j is a biased # #; # is an inconsisten: estimator of #; AP ++
=
-
estimator
of
#' ,
-
''h
MSEI#)HE c 2 (X X) t
-
1
A(s2)=c2 +y';(z*'M
+
t
(X X)
-
1
z*ly#c2.
.X
IXIX X) 1 # c 2 (X X) 1 js a biased estimator Of c2'
....y
z
X '(Z SZ .:2
5
+?
,
-
/
-
?
.
,
Implications
22.1
(v) (vi)
s2
+%
c2;
-$2
of a non-random
sample
is an inconsistent estimator
j01 Of
c2;
p MSE(j)'
$2(X'X) -
1 +.+
where Mx Iw-X(X'X) - 1X'. These results show clearly that the sample assumption are very serious for the implications of the non-random y2) as estimators of 0 E (#, c2). Morcover, (i)(vi) appropr iateness of l N (/ taken together imply that none of the testing or prediction results derived in Chapter 19, under the assumption of an independent sample, are valid. In particular the l-statistics for the significance of the coefficients in the estimated money equation are invalid together with the tests for the coefficients of v, and pt being equal to one as well as the prediction intervals. At first sight the argument underlying the derivation of (i)-(vi)seems to be variables problem' criticised in Section 20.2 as identical to the being rather uninteresting in that context. A direct comparison, however, between ( 15) and =
,
bomitted
y, j'x, =
(22.20)
+ u,.
reveals that both statistical GM*s are special cases of the general statistical
GM
p,
1), Xf0 xr0 .E'()'j,/W(Yr0)+
=
=
under alternative constitute
on .tZt, t c T) ln this sense from the same joint distribution
assumptions
'reductions'
D(Zl
,
.
.
.
,
.
Zw; #)
them directly comparable; conditioning set f'j
(3)
.t
=
c(Y,0-j ). x,0 xy
The autocorrelation
(20) and ( 15) (22.22)
which makes
.
(22.21)
gt
=
' )
.
they are based on the same
(22.23)
approach
The main difference between the respecification approach considered approach above and the autocorrelation lies with the systematic the respecification In approach we need to respecify the component. systematic component in order to take account of (model) the temporal systematic information from the sample. ln the autocorrelation approach remains the same and hence the temporal the systematic component systematic information will be relegated to the error term which will no longer be non-systematic. The term autocorrelation stems from the fact that dependence in this context is intereted as due to the correlation among
from the sampling model assumption
Departures
the error terms. This is contrary to the logic of the approach of statistical model specification propounded in the present book (seeChapter 17). The approach, however, is important for various reasons. Firstly, the approach is very illuminating for both comparison with the respecification approach approaches. Secondly, the autocorrelation dominates the textbook econometric literature and as a consequence it provides the basis for most misspecification tests of the independent sample assumption. The systematic component for the autocorrelation approach is exactly the same as the one under the independent sample assumption. That is, assuming a certain fol'm of temporal dependence (see (41)). j) yt > .E'(y, V',
=
#'x,.
(22.24)
This implies that the temporal dependence in the sample will be left in the
error term:
(22.25) ;f yt - E (.)/.67)). .%.as defined in (23).In view of this the error term will satisfy the following properties: =
(i)
Elst/.@-,)
0
=
E (cJs/.f? r)
(22.26) c2(p, t,(r,
=
,s),
These assumptions in terms of the observable the rather misleading notation: (y,''X) N(Xp, c2Vw), 'v
r.v.'s are often expressed
Vw>0.
in
(22.28)
The question which naturally arises at this stage is. are the implications of this formulation for the results related to the linear 19 under the independence regression model derived in Chapter assumption'?' As far as the estimation results related to j (X'X) - 1X'y and .$2 g1/(F- klj/, y - Xj are concerned we can show that: bwhat
=
=
=
J is an
unbiased estimator
(i)'
Ep
(ii)'
j
(iiil'
Covt#-) c2(X'X) - IX'V F X(X'X) -' 1.
P -+
p
=
5
# if limr..o.(X'VwX'''F< =
)s2)
(iv)'
=
S
2X
,5,2
,
and non-singular;
tr( 1 Px)Vw + c2; -
is an inconsistent estimator
C+c2(X'X) - IX'V z'x(x'Xjsl and bare not independent.
S 2(X/X) -
#;
'
gc2/(T'-/()j
(7.2
y.
of
1
1
.
,
Of
c2. ,
22.2 Tackling temporal dependence
503
ln view of (iiil', (v)',(vi)' and (viil' we can conclude that the testing results derived in Chapter 19 are also invalid. The important difference between the results (i)-(vi) and (i)'-(vi)'is that / estimator in the latter case. This is not surprising, is not such a however- given that we retained the systematic component of the linear regression model. On the other hand, the results based on Covty/-'= X) property of / in the present G 21 are inappropriate. The only undesirable v the relative said its is inefliciency be proper MLE of j when Vw to to context is assumed kntpwn. That is, p-is said to be an inefficient estimator relative to the GLS estinlator kbad'
=
'''
j j j #= (X Vw X) . X Vw y y
.
g
.
(22.29)
(see Judge et aI. (1985:. A very similar situation was encountered in the case of heterosk-edasticity and the same comment applies here as well. This efficiency comparison is largely irrelevant. ln order to be able to make (2) with justifiable efficiency comparisons NN': shotlld be able to compare (/. well is information l known, however, the set. t based same estimators on no uonsistent estimator of the that in the case where 5'w i s tlnknown exi information and the of interest st mat rix could not be used to parameters efficiency bound. full lower define a 22.2
Tackling temporal dependence
sample question: 'How do we proceed when the independent of considered the this will testing invalid?' before is be assumption approach autocorrelation the the assumption because in two are inextricably bound up and the testing becomes easier to understand when the above question is considered first. This is because most misspecification approach consider tests of sample independence in the autocorrelation which we assumption independence particular forms of departure from the approach, This however, will be will discuss in the present section. above, mentioned respecification approach because, as considered after the respecification the the former is a special case of the latter. Moreover, approach provides a most illuminating framework in the context of which the autocorrelation approach ean be thoroughly discussed. The
(1)
The
rtzArptlclr/ztw/t?zl
approach
ln Section 22. l we considered the question of respecifying of the linear regression model in view of the dependence model. lt was argued that in the case where lZ:. (E 'T' is independent process stationary and asymptotically
the components in the sampling assumed to be a the systematic
Departures from the sampling model assumption
component takes the form
(22.30) The non-systcmatic uf
'(F(Y/- j ),Xf0 x/),
).,t - S(.J.'?
=
This suggests that to the sequence 9 t
f'lf Vf -
1)
is defined by
component
,f(
t > m.
=
u,, t > rn)
defines a martinqale derence process relative J(Z?() Xr+ I), l > m, Of c-fields (see Chapter 8). That is,
=
,
0,
=
(22.32)
Moreover, 2
Gf),
JJ(url,/x)
(22.33)
=
0,
These properties will play a very important i.e. it is an innovation p?-?c'(?.$.$. role in the statistical analysis of the implied statistical GM:
'f p'vxt+
afl'r
=
1*=
J
-
i
+ i
=
l
(22.34)
&t, - f+
ixt
1
If we compare (34)with the statistical GM of the linear regression model we can See that the error term, say f;r in .
)'f
=
Fxf+
(22.35)
lr,
noise relative to the information set (c(Yt0- Xf0 xf0) is given that t)t is largely predictable from this information set (see Granger (1980:. Moreover, in view of the behaviour of ;r the recursive estimator no longer white
#, #! 1 + =
j),
(Xf Xl )
xtll'f -
#, 1 xr) -
=
(22.36)
(seeChapter 2 1) might exemplify parameter time dependence given that (yj // 1x,), t > k, is no longer a mean innovation process but varies systematically with t. Hence, detecting parameter dependence using the recursive estimator j l > k, should be interpreted with caution when invalid conditioning might be the cause of the erratic behaviour of -
1
-
15
f(#f
-
#f 1), -
(22.37)
The probability distribution underlying (34) comes in the form of D()'/'Zy- :, X.,',#), which is related to the original sequential decomposition T
D(Z*; #)
=
L1D(Z,/Zf -
f=1
1,
.
.
.
,
Z1;
#),
(22.38)
22.2
Tackling temporal dependence
D(Zf /Zf0- j ;
#)
Dt.y'f(L,t-k
=
xf 41) otxf//'Zf0- ; #a). ,.
,
505
(22.39)
.
j
The parameters of interest 0 EEE(jf, xi + 1, i 0, ls n'l, c2) are functions of 41 and Xt remains weakly exogenous with respect to 0 because #: and /2 are variation free (seeChapter 19). Hence, the estimation of 0 can be based on D(yf/'Zf0-j, xf',#j) only. For prediction purposes. however, the fact that D(Xl Z,0- j ; #a) involves Yf0-j cannot be ignored because of the feedback between these and Xf. Hence, for prediction purposes in order to be able to ZtOconcentrate exclusively on D4y1 1, Xf,. #1) we need to assume that =
D(Xf/'Z!0- 1 ;
#c)
=
D(Xf Xf0.j
,'
.
.
.
,
(22.40)
#a),
i.e. Yt0- j does not Granqer cause Xf (See Engle et aI. ( 1983:. When weak exogeneity is supplemented with Granger non-causality we say that Xt is stronql), exogenous with respect to 0. The above changes to the linear regression model due to the non-random sample taken together amount to specifying a new statistical model which (1 ?-t?g?-e'ssf(??7 nl()(leI. Becatlse of its inportance in we call the J'nfilhnftr lineal. specification. estimation, testing and prediction econometric modelling the in the context of the dynamic linear regression model will not be considered here but in a separate chapter (see Chapter 23). (2)
The autocorrelation
approach
As argued above in the case of the autocorrelation approach the stochastic t 6 T) is restricted even further than just stationary and process asymptotic independent. ln order to ensure (24) we need to restrict the identical, temporal dependence among the components of Zr to be .tzf.
'largely'
i-e. rovtzffzsj)
.....=
c-/l't
j
--
yj
), i.j
=::
1 2, ,
.
.
,
,
(22.41)
1%' .+ 1,
(see Spanos ( 1985/7) for further details). The question of restricting the time-heterogeneity and memory of the process arises in the context of the autocorrelation approach as restrictions on c(l) and n(l, s) (see (27)).lndeed, looking at (28)we can see that c2(r) has already been restricted (implicitly),c2(t) c2, t (E T. Assuming that ty 1) is a stationary process Vw becomes a Toeplitz matrix fsee Durbin ( 1960)) of the form Ltts trjr-l, r, s 1, 2, T and although the number of unknown parameters is reduced the incidental parameters problem is not entirely solved unless some restrictions on the 'memory' of the process, such as ergodicity or mixing, are imposed. Hence, the same sort of restrictions as in the respecification approach are needed here as well. In practice, however, this problem is solved by postulating a =
tcf,
=
=
.
.
.
,
506
from the sampling model assumption
Departures
generating mechanism forcf which ensures both stationarity as well as some form of asymptotic independence (see Chapter 8). This mechanism (or model) is postulated to complement the statistical GM of the Iinear regression model under the independence assumption in order to take account of the non-independence. The most commonly used models for F;, are the AR(m), MA(m) and ARMAIJ?. qj models discussed in Chapter 8. By far the most widely used autocorrelated error mechanism is the AR( 1) model where st pBt - 1 + ut and ut is (1 w/lfftontpyt? process. Taking this as a typical example, the statistical GM under the non-random sample assumption in the context of the autocorrelation approaeh becomes =
p
=
c,
=
.1
#'x + l
pBt
- 1
(22.42)
;r,
+
N,,
0< p
<
1
(22.43)
,
The effect of postulating (43)in order to supplement (42)is to reduce the number of unknown parameters of Yv tojust one, /?, which is time invariant as well. The temporal covariance matrix :'7. takes the form /? r-2
1)
(22.44)
(see Chapter 8). On the assumption that (43) represents an appropriate model for the dependency in the sample we can proceed to estimate the 4,7.2 J(;/). parameters of interest 0 (#, p, c2) where The likelihood function based on thejoint distribtltion under (42)--(43) is ''
L(0,. ).) ,
=
2
(2z:c )
1%2
=
(t.je t y,rv j; ;))
T
1ogA(0,' y) con st - - Ioc 2 =
-
0-2
-.
tj
)
+ . 1og ( 1 - /? 2
2)
(22.46)
22.2
Tackling temporal dependence
1 t?log L X'Vw(p) (7p =-rj c
(' 1og L t?c2 -
T =
-
x+
2c=
1 2c
L p p lo-#-.-+ 1 p'/? ( - p2)
1
1
4.
c,v z.(/?) ..
pc1
-
(22.47)
0,
1;=
(7.2
(22.48)
(j
s
1,
1
+-.j c
:=
ya (;, - vst-
t
-
j
)f;, j -
-
0,
(22.49)
where c HE y -X#. Unfortunately, the first-order conditions ( log Lj, ?p= 0 cannot be solved explicitly for the MLE of 0 because they are not linear in p. To derive # we need to use some numerical optimisation procedure (see Beach and Melinnon ( 1978), Harvey (198 1), inter alia). For a more extensive discussion of the estimation and testing in the context of the autocorrelation approach, see Judge et aI. ( 1985).
(3)
compared
The two approaches
- the common
factor restrictions
As mentioned in Section 22.1, the autocorrelation approach is a special case of the respecification approach. ln order to show this 1et us consider the example of the statistical GM in the autocorrelation approach when the error mechanism is postulated to be an AR(rn) process
lpl
v, #'xf+ =
j
pftl't
f=1
-.
j'xr
i -
1
-
) +.
ld,
.
(22.51)
lf we compare this with the statistical GM (34) in the respecification approaeh we can see that the two are closely related. lndeedst34) is identical to (51) under Ho
ppi
=
-
pi,
?' =
l 2, ,
.
.
.
,
m.
(22.52)
These are called ctprtlmt?njctor restrictions (see Sargan (1964),Hendry and Mizon ( 1978). Sargan ( 1980)). This implies that in this case the model suggested by the autocorrelation approach is a special case of (34)with the which arises at this common factors imposed a priori. The natural question result or is only true for this particular stage is whether this is a general example. In Spanos (1985/)it was shown that in the case of a temporal covarianee matrix Vw based on a stationary ergodic process the hybrid statistical GM in the context of the autocorrelation approach takes the
508
from the sampling model assumption
Departures
general form (22 53) .
Hence, the autocorrelation approach can be viewed as a special case of the respecification approach with the common factors restrictions imposed a
priori.
Having established this relationship between the two approaches it is factors interesting to consider the question: tWhy do the common The answer restrictions arise naturally in the autocorrelation approach?' lies with the way the two approaches take account of the temporal dependence in the formulation of the non-random sample assumption. In the respecification approach based on the statistical model specification procedure proposed in Chapter 17 any systematic information in the statistical model should be incol-porated directly into the definition of the systematic component. The error term has to be non-systematic relative the to the information set of the statistical model and represents tunmodelled' part of y'r. Hence, the systematic component needs to be redefined when temporal systematic information is present in the sampling approach attributes the model. On the other hand, the autocorrelation dependence in the sample to the covariance of the error terms retaining the same systematic component. The common factors arise because by modelling the error term the implicit conditioning is of the form #n
z);r,,'c(E0 ( .
where Ef0-j
(:,
EB
'tcr
t
=
-
1
1,
)) ;f
f
-
r =
a,
.
(22.54)
ait:t -f-
.
1
.
c(Et0-j ))+
.
,
;()),
The implied statistical GYI is
(22.55)
ut
(22.56) which is identical to (53), Hence, the common factors are the result of tmodelling' the dependence in the sample in terms of the error term and not in terms of the observable random variables directly. The result of both approaches is a more general statistical GM which the dependence of the in the sample in two different but related ways. The restrictiveness autocorrelation approach can be seen by relating the common factor restrictions (52)to the parameters of the original AR(lz7)representation of Gmodels'
dependence
Tackling temporal
22.2
tZ,, r G T) based on the following sequential conditional
(Zr
At
-
1
m ) kN ))
f.'k1
1()
2(1')
a1 A 2 a(,)
'.w
f
a 2 1 (j)
I
=
-
I.? i -
xf -
.
distribution: -1
f7.)1 1 ,
i
2
(22.57)
.
'jcc
f-tu j
As shown in the appendix- the parameters of the statistical GM relatedto the above parameters via the following equations:
/30
Dz-zif5zl
ui
)(1 1 (i) + e?
=
and
=
ta1
2()
b'i =
.
)
a c 1 (ij ) fo r i
12
=
,
Hence the common factor restrictions hold al a()
a'21()
=
and
0
=
(22.58)
+ t?3l 2612-21 A22(f)
2-a1
.,2f1
:
1
(4) are
A2a(f)
,
.
.
,
ln.
when
J1 1 (f)I
=
.
for a11 i
=
1.
.
.
.
,
n1.
(22.59) hold when Granger non-causality That is. the common factor restrictions holds among all Zfts and an identical form of temporal self-dependence 1985tJ)). These are very rn. t > m (see Spanos ( exists for all Zits. f= 1, principle factor priori. ln the impose restrictions common unrealistic a to of the the by context in tested testing indirectly be (59) restrictions can representation. general AR(m) A direct test for these restrictions can be formulated as a specification test approach. In order to illustrate how the in the context of the respecification restrictions can be tested 1et us return to the money eommon factor estimated in Chapter 19 and consider the case where m 1. The equation statistical GM of the money equation for the respecification and autocorrelation approaches is .
.
,
.
=
illl <
Under S
X () :
1::
j
=
-
-.
a '
,::
.
Iz
?
,
-
=
Xg -
--
j
/.3
(:4, '
=
(Z4
-
..
.-
/.4
1
(22.61)
from the
Departures
model assumption
sampling
wtt can see that the two sides of (62)have the common jctor ( 1 al L) which can be eliminated by dividing both sides by the common factor. This will give rise to (61). The null hypothesis Hv is tested against -
H 1: a 1 +
,-3 a
-
tz....j
s/74
.
or
/72
# -
a:
#
.:1
/./3
-
.
Although the Wald test procedure (see Chapter 16) is theoretically much more attractive, given that estimation under 1j is considerably easier (see Mizon ( 1977), Sargan ( 1980) on the Wald tests), in our example the likelihood ratio test is more convenient because most computer packages provide the log likelihood. The rejection region based on the asymptotic likelihood ratio test statistic (seeChapter 16).
-2 logc-;-ty)
=
2(loge
H
-(#,
y)
# refer to the MLE'S
where and the form
y))
log. L,
-
z2(k
x.
-
(22.63)
1),
Sl and Ho, respectively,
of 0 under
takes
X. .f(
C1
y:
=
of
Estimation
mt
=
-
-
2 logg 2(y) y
0.766 + 0.793,n,
+ 0.160p! (0.220)
log L Estimation
of
) ,
a
dz
=
1
-
+ 0,038)!r + 0.240.:!
(0. 169)
(0.182)
0.04 1f, + 0.006ft l 12)
1 -
(0.0 /12
=0.999,
-
=0,999,
1).
(22.64)
s
=0.018
-
1
+ 0.023p/
(0.208) (22.65)
li,,
+
(0.013)
(0.018) 1,
=209.25,
(61) for the
same period yielded
mt 4. 196 + 0.56 1rf + 0.884/4 - 0.040, ( 1.53) (0.158) (0.037) (0.0l3) =
R2
24k
C2
(60) for the period 1963//.-1982f,, yielded (0,582) (0.060)
R2
(.',
42
=0.998,
=0.998,
+
,
t
=
0.8 19t
(0.064)
-
1
+
t,
(0.022)
log L= 187.73,
.$=0.0223,
(22.66) and three 43.04. Given that c, 7.8 15 for (z of is strongly rejected. Ho freedom, degrees As mentioned above, the validity of the result of a common factor test Hence,
-2
1og.2(y)
=0.05
=
=
Testing the independent sample assumption
22.3
511
of the statistical GM postulated for the depends on the appropriateness well-defined model of estimated statistical model. The general a as part of extensively discussed in the next chapter. this is ensuring question
22.3
Testing the independent sample assumption
(1)
The respecijication
approach
In Section 22. 1 it was argued that the statistical results related to the linear sample regression model (see Chapter 19) under the independent assumptions are invalidated by the non-independence of the sample. For importance to be able to test for this reason it is of paramount
independence.
As argued in Section 22.2, the statistical GM takes different forms when the sampling model is independent or asymptotically independent, that is,
'f
=
#'x?+
uf
t
,
=
1 2, ,
.
.
.
,
(22.67)
'F
(22.68) respectively. This implies that a test of independence based on the significance of the parameters x EEE(aj, #' )' i.e. .
.
.
,
p.j
.
.
can be constructed a.)', j* H(#'j, #2, .
,
,
and
pz 0 =
/l* 0. '
In view of the linearity of the restrictions an F-test type proeedure (see Chapters 19 and 20j should provide us with a reasonable test. The discussion of testing linear hypotheses in Chapter 20 suggests the statistic: r(y)
=
RRSS - URSS URSS --
--
-
T'- ktnl + 1) mk
---
--
.j-
(22.69)
nl/zt-rl has an asymptotic chi-square distribution (z2(mk))under Hv. ln small samples, however, it might be preferable to use the F distribution approximation (F(mk, F- Llm+ 1/) even though it is only asymptotically
Departures
from the sampling model assumption
justifiable. This is because the statistic z*( #') nykztl'l increases with the number of regressors; a feature which is particularly problematical in the present case because of the choice of n1. On the other hand, the test statistic (69) does not necessarily increase when m increases. ln practice we use the z(y) > t'alj where ca is defined by test based on the rejection regict) C1 =
.ry:
=
,
L'L?
a
(22.70)
dFtl?1k, T- /(v? + 1))
=
f'J
as if' it were a finite sample tcst. Let us consider this test for the money equation estimated in Chapter 19 with m 4 using the estimation period 1964- 1982/:,, =
mt 2.763 + 0.705.4 + 0.8621,, - 0.0533.:+ ,, ( 1.10) (0.112) (0.022) (0.014) (0.040) =
.R2
log L
2
=0.995,
=
138.425,
0.995,
=
RSS
mt 0.706 +0.589-, (0.815) (0.132) =
-
1
0.040 22.
=
0. 1165,
=
-0.018rn -0.046mf 3 +0.2 14-1 -z (0.152) (0.166) (0.129)
+ 0. 19lyt + 0.5 18 v, : (0.199) (0.261) -0.060pr (0.348)
s
+0.606p-
j
(0.670)
-0.047f, + 0.0 17ff (0.0 14) (0.020)
-
0.253.9,1 c -
(0.255)
-
0. 116.:,
(0.260)
-0.38 1pf 2 +0.558p,-
(0.642) -
(0.022)
2
+ 0.006/2
(0.021)
-
0.022.:/ 4. - a-
(0.223)
-0.479pf
(0.630)
-0.025?'r
1
3
a
-4
(0.348)
-0.0 18)
(0.014)
log L
#2
=0.999,
=
-
,j.
(22.72)
+ l, (0.0 18) R2
-4
2 10.033,
=0.999,
s
Rk5'=0.017
=0.0
17 78,
697.
19.25. Given that cx 1.8 12 for The test statistic (69)takes the of 56) l6, and 0.05 deduce that Hv is strongly freedom degrees ( can we a rejected. That is, the independence assumption is invalid. This confirms our initial reaction to the time path of the residuals , y - x t (see Fig. 19.3) that some time dependency was apparently present. An asymptotically equivalent test to the F-test considered above which corresponds to the Lagrange multiplier (seeChapters 16 and 20) test can be based on the statistic value z(y)
=
=
=
=
#
Hv
LMy)=
TRI
x
z2(mk),
(22.73)
22.3
Testing the independent sample assumption
where the R2 comes from the auxiliary
regression
t
=
m + 1,
.
.
.
,
'.E (22.74)
In the case of the money equation the R2 for this auxiliary regression is 64.45. Given that cx= 26.926 for x 0.848 which implies that LMy) and 16 degrees of freedom, hv is again strongly rejected. lt is interesting to note that faMtyl can be expressed in the form : =0.05
=
LMqy)
TR2
=
RRSS
T
=
URSS
-
(22.75)
RRSS
(see exercise 4). This form of the test statistic suggests that the test suffers from the same problem as the one based on the statistic z*( T'), given that 5.2 increases with the number of regressors (see Section 19.4). (2)
approach
The autocorrelation
of the As argued in Section 22.3 above, tackling the non-independence approach before testing the sample in the context of the autocorrelation appropriateness Of the implied common factors is not thecorrect strategy to adopt. In testing the common factors, implied by adopting an error autocorrelation formulation, however, we need to refer back to the useful is a test of respecification approach. Hence, the question arises, the independence assumption in the context of the autocorrelation approach given that the test is based on an assumption which is likely to be erroneous'?' In order to answer this question it is instructive to compare the statistical GM's of the two approaches: 'how
J.',-
#'oxf+ j i
=
afy, 1
-
,.
+ i
) =
'ixt
i
-
1
+ u,,
r>
(22.76)
n1,
and p/
=
p'xt+
t'tl-lrt
.;t,
=
?(f-)u,,
where J(L) and /74L)are pth- and /th-order polynomials in L. That is, the postulated model for the error term is an ARMAIP, q) time series formulation (see Section 8.4). The error term )f interpreted in the context of (76) takes the form G
:-(#0 -
bs'xt+
Z Eaiy'f-i+ #ixf-(l + u,-
(22.78)
=1
That is, cf is a linear function of the normal, stationary and asymptotically t > rn) as independent process Zf -.f, i 1, 2, m ).This suggests that .j
=
.
.
.
,
t;f ,
Departures
from the sampling model assumption
defined in (78)is itself a normal, stationary and asymptotically independent process. Such a proccss, however, can always be approximated to any degree of approximation by an ARMAIP, q) stationary process with enough' p and q (seeHannan ( 1970), Rosanov ( 1967), inter alia). ln view of this we can see that testing for departures from the independent sample assumption in the context of the autocorrelation approach is not unreasonable. What could be very misleading is to interpret the rejection of assumption the independence of the error as an autocorrelation model postulated by the misspecification test. Such an interpretation is totally unwarranted. An interesting feature of the comparison between (76)and (77)is that of the role of the coefficients of xf, pvand #, are equal only when the common factor restrictions are valid. In such a case estimation of # in yt #'xf+ should yield the same estimate as the estimate of # in klarge
'endorsement'
=
)', j'x, =
+
)r,
faly1-laf
=
h(1.)u,.
.5!
(22.79)
Hence, a crude
of the implicitly imposed test' of the appropriateness restrictions might the two estimates of j in factor be to compare common the context of the autocorrelation approach. Such a comparison might be quite useful in cases where one is reading somebody else's published work and there is no possibility of testing the common factor restrictions directly. In view of the above discussion we can conclude that tests of the sample independence assumption in the context of the autocorrelation approach do have a role to play in misspecification testing related to the linear regression model, in so far as they indicate that the residuals Jl, t 1, F do not constitute a realisation of a white-noise process. For this reason we are going to consider some of these tests in the light of the above discussion. ln particular the emphasis will be placed on the non-parametric aspects of these tests. That is, the particular form of the error autocorrelation (AR(1), MA( 1),ARMAIP, t?)) will be less crucial in the discussion. ln a certain sense based tests will be viewed in the context of the these autocorrelation auxiliary regression =
.
.
.
,
(22.80) which constitutes a special case of (74). The particular aspect of the process t G T). we are interested in is its temporal structure. The null hypothesis of interest in the present context is that t g T) is a white-noise process (uncorrelated over time) and the alternative is that it is a dependent stochastic process. The natural way to
tif,
.f(:,,
Testing the independent sample assumption
22.3
proceed in order to construct tests for such a hypothesis is to choose a measure' of temporal association among the cts and devise a procedure to detennine whether the estimated temporal associations are significant or not. The most obvious measure of temporal dependence is the correlation defined by between cf and Ct 1, What We Call lth-order autocorrelation +
jej czz -
Covtkk
+I)
; gvartk)vartq..1---6) j
In the case where (ct,t e: T) is also assumed to be stationary Vartrt +!)) this becomes e 1
=
Covtk;,
(Var(cf)
=
+/)
var(:,)
--
(22.82)
'
(22.83) Intuition suggests that a test fof thesapple independence assumption in the present context should eonsider whether the values of for some I 1, 2, rn, say, are significantly different from zero. ln the next subsection we . 1 and then generalise the results to != m > 1. consider tests based on 1:::z: ,
.
.
=
,
Tbe Durbin-ktson
tes
(/ 1) =
used (and misused) misspecification test for the approach is autocorrelation the of the assumption in independence context GM The postulated statistical so-called Durbin-Watson for the test. the purposes of this test is
The
most
.J.'r=
widely
#'x,+
'ir,
t:t =
pst - I +
(22.84)
l,,
The null hypothesis of interest is Ho1 p (i.e.t?t u!, white noise) against Durbin the alternative J.fl : p #0. Building on the work of Anderson (1948), and Watson (1950, 1951) proposed the test statistic, =0
=
(22.85)
from the sampling model assumption
Departures
where 0
-1
0
-1
2
0 2
-1
0
(22.86)
: 2
- 1 1
between A1 and %'v is given by
The relationship
where C
=diagl
1. 0, '
V,.(p) -
:>:
.
.
,
.
0, 1). Durbin and Watson
( 1 -p)2I
used
the approximation
(22,88)
+ /?A1
,
which enabled them to use Anderson's result based on a known temporal 1 Vp)Iw. The covariance matrix. As we can see from (88), when /? region for hypothesis the null rejection =0,
HoL
p
0,
=
against
=
H3 : p + 0
takes the form C1
.j
: z 1 ()') %c'a
.p
=
). ,
(22 9) .8
where L'z refers to the critical value for a size a test, determined by the distribution of the test statistic (85)under Hv. This distribution, however, is inextricably bound up with the observed data matrix X given that b= MxcMx= Iw X(X'X) - 1X' (see Chapter 19) and -
r 1 (y)
=
:'M .h(.A1Mx:
.
-
E'Mx:
.
.
(22.90)
This implies that the Durbin-Watson test is data spt?c'tc and the distribution of z1(y) needs to be evaluated for each X. ln order to make the evaluation of this distribution easier to handle Durbin and Watson, using the fact that Mx and MxA1Mx commute (i.e. My(MxA1 Mx) (MxA1 Mx)Mx) and Mx is an idempotent matrix, suggested using an orthogonal matrix H which diagonalises both quadratic forms in (90) simultaneously, i.e. =
F-k
''C1
(y)
=
i
Z
r-k i
.'ftl?
1
.=
.
) 1 t,? =
(22.91)
Testing the independent sample assumption
22.3
t: H'c =
N(0, c2Jw).
'v
(22.92)
Hence the value cx can be evaluated by statement based on (91) for ca:
##z 1 (y)f for a given size
c.)
=
Pr i
.=
1
(
',.
-
the following probabilistic
'solving'
t7'.)(/
G0
=
a
x.
( 1970), however, suggested that in practice there was no need to evaluate cx. Instead we could evaluate Hannan
p(y)-Pr
F .-
l
1*uzur
1(
(vj-r1(y))tk? /9
(22.94)
,
and if this probability exceeds x we accept Ho' otherwise we reject Hv in favour of HL* : p > 0. For H( : p as the alternative 4 -z1(y) should be used in place of r1(y). The value 4 stems from the fact that -::0
'r1 (y) 2( 1 'v
lI
(22.95)
)
is the first-order residual correlation
coefficient and takes values between
and l hence 0 Gz1(y) < 4. - 1 ln the case of the estimated money equation discussed Watson test statistic takes the value
above
the Durbin-
z 1 (y) 0.376.
(22.96)
=
For this value p(y ) ().()()()and hence H () : /) 0 for any size a test. The question which arises is, do we interpret the rejection of Ho in this case/' 'Can we use it as an indication that the appropriate statistical GM is not k., j'x, + ut but (84)?5The answer is, certainly not. ln no circumstances should we interpret the rejection of Hv against some specific form of departure from independence as a confirmation of the validity of the alternative in the context of the autocorrelation approach. This is because in a misspecification testing framework rejection of Ho should never be interpreted as equivalent to acceptance of H3', see Davidson and McKinnon ( 1985). ln the case where the evaluation of (93) is not possible, Durbin and Watson ( 1950. 1951) using the eigenvalues of A1 proposed a bounds test which does not depend on the particular observed data matrix X, That is, =
=
1*.$stl.onJly
thow
=
,
l'ttjt?rlp/
Departures
from the sampling model assumption
they derived Ju and Ju such that for any X, Ju %f 1(y) %Ju,
(22.97)
where dv and Ju are independent of X. This 1ed them to propose bounds test for against
the
(22.98) and
L-'o
=
ry:z1(y) > Ju ).
(22.99)
ln the case where dv Gz1(y) %dv the test is inconclusive (seeMaddala (1977) for a detailed discussion of the inconclusive region). For the case S(): p 0 should be used in (99). against Sj : p <0 the test statistic 4 view In of the discussion at the beginning of this section the DurbinWatson (DW) test as a test of the independent sample assumption is useful coefficient rl in so far as it is based on the first-order autocorrelation Because of the relationship (95)it is reasonable to assume that the test will have adequate power against other forms of first-order dependence such as MA(1) (seeKing (1983) for an excellent survey of the DW and related tests). Hence, in practice the test should be used not as a test related to an AR(1) only but as a general first-order dependence test. error autocorrelation Moreover, the DW-test is likely to have power against higher-order dependence in so far as the lirst order autocorrelation coefficient captures' part of this temporal dependence. =
-z1(y)
.
Higher-order
tests
of the linear regression model is easier to handle under the independent sample assumption rather than when supplemented with an autocorrelation error model, it should come as no surprise to discover that the Lagrange multiplier test procedure (see Chapter 16) asymptotic provides the most convenient method for constructing misspecification tests for sample independence. In order to illustrate the derivation of misspecification tests for higherorder dependencies among the y,s let us consider the LM test procedure for the simple AR(1) case which generalises directly to an AR(rn) as well as a moving-average (MA(rn)). Postulating an AR(1) form of dependency among the yfs is equivalent to changing the statistical GM for the linear regression model to Given that estimation
X
=
#'Xf
+ Pt-
1
-P#'Xf
- 1
+
&r,
(22.100)
Testing the independent sample assumption
22.3
against is Hv : p as can be easily verified from (84).The null hypothesis easier than its under much is of Hv estimation 100) : the Because ( 0. + l'fl p estimation under H3 the f-M-test procedure is eomputationally preferable to both the Wald and the likelihood ratio test procedures. The efficient score form of the faAf-test statistic is =0
( log
z-A,1(y) =
Lt,' y) -
pp
I y.( #)-
,
1
log
(
-.-.
--
z-(#., ))) -.--
-
.
(')0
ln the case where Hv involves only a subset of 0, say Hv t?l 03 (f?1,0zj then the statistic takes the form =
p log Lol #a) (l1l (7,p, '
tA.ftyl
,
=
lj nlc2 j j.z1) .. j -
-
,
log Lo
where 0 E:E
,
z)
oo1 (22.102)
(see Chapter 1$. ln the present case the 1og likelihood function was given in equation and for ?j p and 0z BE (#, c2) it can be shown that:
(46)
=
T
0
2)
( 1 - /? 0
lv(04
=
(X'Vw-1 X)
(22.103)
1
(22.104)
(22.105) and LMy)
=
(22.106)
TI z2(1) --
(see exercise 5). Hence, the Lagrange
multiplier
test is defined by the
from the sampling model assumptim
Departures
rejection
region
C'1
.ty:
LMy)
=
ca )
>
wherex
dz2( 1).
=
(22.107)
C'(y
The form of the test statistic in ( 106) makes a lot of intuitive sense given that the first-order residual correlation coefficient is the best measure of firstorder dependence among the residuals. Thus it should come as no surprise to learn that for zth-order temporal dependence, say ;f
/?
=:
-1-tl r
t - c
,
the f-M-test statistic takes the form LMy)
Tl
=
(22. 108)
C
72rlc'r
v'
=
t
r
=
va
a
lflf .+.
:jr''
''' ,..''
m
.y,
1
,.'.
,,.'
-
z .''
t
=
2 ;j
m
T
,
=
12 ,
.
,
.
,
.
(22. 109)
m< F
1
(see Godfrey ( 1978), Breusch and Pagan ( 1980)). This result generalises directly to higher-order autoregressive and moving-average error models. lndeed both forms of error models give rise to the same AM-test statistic. That is, for the statistical GM, ),'f p'xt+ lf, supplemented with either an AR(m) error process: t 1, 2. =
'f)
=
.
.
.
,
(22. 110) or a MA(-) process: (ii)
(22.111)
the J-M-test statistic for for all i
1, 2,
=
for any i
=
.
1. 2,
.
.
.
m
,
.
.
,
m
takes the same form LMy)
z' j
=
.
with
rejection
Cj
=
r
=
?J 1
z2(m),
-
1
(22. 112)
region
.ty
:
Lsly) y ca )
,
(22.113)
(seeGodfrey ( 1978), Godfrey and Wickens ( 1982/. The intuition underlying this result is that because of the approximations involved in the derivation
22.4
Looking back
of the faM-test statistic the two underlying distinguished asymptotically.
error processes
equivalent test, sometimes called the modified test, can be based on the .R2 of the auxiliary regression (see (80)): An asymptotically
be
cannot
LM-
(22. 114) IIv
'rR2
with the
z2(/n)
-
rejection
(22. 1lf )
region X
C
1
f t.5. '.
=
.
TR2 p: c
::t
' .f ,
a
dz2(m).
=
(22. 116)
C
This is a test for thejoint significance of 7: The auxiliary regression for the estimated
yielded'. FR2
=
similar to ( 112) above. money equation with ?n 6 -,.,u.
','a-
.
.
.
.
.
=
(22.117)
72(0.7 149)= 51.47.
Given that ('a 12.592 for a 0.05 the null hypothesis 11(): 7 0 is strongly rejected in favour of Sl : + 0. Hence, the estimated money equation has failed every single misspecification test for independence showing clearly that the independent sample assumption was grossly inappropriate. The above discussion also suggests that the departures from linearity, homoskedasticity and parameter time invariance detected in Chapter 2 1 might be related to the dependence in the sample as well. =
=
=
'y
22.4
Looking back
In Chapters 20-22 we disctlssed the implications of certain departures from the assumptions underlying the linear regression model, how we can test for these assumptions as w'ell as how we should proceed when these assumptions are inN alid. ln relation to the implications of these departures we have seen that the statistical results (estimation testing and prediction) related to the linear regression model (seeChapter 19) are seriously affected, with some departures, such as a non-random sample, invalidating these results altogether. This makes the testing for these departures particularly crucial for econometric modelling. This is because the first stage in the statistical analysis of an econometric model is the specification and estimation of a well-defined (no misspecifications) statistical model. Unless we start with a well-defined estimated statistical GM any deductions such as specification testing of a priori restrictions- economic interpretation of
Departures from the sampling model assumption
estimated parameters, prediction and policy analysis, will be misleading, if not outright unwarranted, given that these conclusions will be based on erroneous statistical foundations. When some misspecification is detected the general way to proceed is to respeclf'ythe statistical model in view of the departures from the underlying of al1 the assumptionts). This sometimes involves the reconsideration underlying statistical model such assumptions the as the case of the sample assumption. independent One important disadvantage of the misspecification testing discussed in Chapters 20-22 is the fact that most of the departures were considered in isolation. A more appropriate procedure will be to derive joint misspecitication tests. For such tests the auxiliary regressions test procedure discussed in Chapter 2 1 seems to provide the most practical way forward. In particular it is intuitively obvious that the best way to generate a misspecification test is to turn it into a specification test. That is, extend the linear regression model in the directions of possible departures from the underlying assumptions in a way which defines a new statistical model which contains the linear regression model as a special case; under the null that thc underlying assumptions of the latter are valid (seeSpanos (1985c)). Once we have ensured that th underlying assumptions g1j-g8)of the linear regression model are valid for the data series chosen we call the estimated statistical GM with the underlying assumptions a well-dejned estimated statistical model. This provides the starting point for reparametrisation restriction and model selection in an attempt to construct an empirical econometric model (seeFig. 1.2). Using statistical procedures based on a well-defined estimated statistical model we can proceed to reparametrise the statistical GM in terms of the theoretical parameters of interest t: which are directly related to the statistical parameters 0 via a system of equations of the form 'good'
G(p, () 0. =
(22.118)
The reparametrisation in terms of t: is possible only when the system of equations provides a unique solution for ( in terms of o
t
=
H(p)
(22.119)
(explicitly or implicitly). ln such a case ( is said to be identlsed. As we can see, the theoretical parameters of interest derive their statistical meaning from their relationship to 0. In the case where there are fewer tfsthan 0is the extra restrictions implied by ( 118) can (andshould) be tested before being imposed. lt is important to note that there is a multitude of possible reparametrisations in terms of theoretical parameters of interest giving rise
Apandix
22.1
to a number of possible empirical econometric models. This raises the question of model selection where various statistical as well as theoryoriented criteria can be used such as theorb' consistency, parsimony, encompassings robustness, and nearly orthoqonal explanatory variables (see Hendry and Richard (1983:, in order to select the best' empirical restriction of the estimated econometric model. This reparametrisation statistical GM, however, should not be achieved at the expense of its statistical properties. The empirical econometric model needs to be a welldefined estimated statistical model itself in order to be used for prediction and policy analysis. Appendix 22.1 Assuming that
deriYing
zl
mt 1)
Z2
1%
-
E( 1, 1).
E( 1, F)
m(2)
Y(2. 1).
E(2, F)
:
:
:
mt T)
E( T,' 1)
E(T) T)
'v
:
Zw
the conditional exptation
.
5
we can deduce that
For
z, Af(r)
=
1. (x'''' ) (m/vt'' , mtr)
T
-
x(r)
,
z
' a 1 : (i. r)-a j 2(f,
a2 ! (. )-A22(,
1)
1)
,
f#rl
o.l1 1 (l)
=
)21 l -
(#, Z,0-1
,
Xf)
-
5
cotr) +
jk,(llx,+
c)1
2(r)
(l) f12a(/)
,
1
i + ,;(r)xr )- gaftll),', i =
1
j1,
c()2(r)
,
Departures
from the sampling model assumption
c()(l) p1y,(!) =
#7(r)
fza21
(l) #;(f)
71 1
=
=
=
Important
al
-
f.51
2(r)f12-2'(r)mx(r) '
(r)f122(r) -
? 1 ct rlD
(i, r) -
2(f,
r)
-
,
2--21(
rlaz 1(, r),
f.,tyl z(r)f12-21(r)A22(,
r).
concepts
Error autocorrelation,
stationary
process, ergodicity,
mixing, an inno-
vation process, a martingale difference process, strong exogeneity, Durbin-Watson test, well-defined estimated common factor restrictions, statistical GM, reparametrisation? restriction, model selection.
Questions the interpretation autocorrelation approach
of non-independence in the context of the that of the respecification approach. Compare and contrast the implications of the respecification and autocorrelation approaches for the properties of the estimators HvL (X'X) - 1X'y, sl g1/(F-klj', y - Xj and the F-test for Rp r against Hj : R## r. Explain the role of stationarity, in the context of the respecification approach, for the form of the statistical GM (34). 'T'he concept of a stationary stochastic process constitutes a of of generalisation the concept identically distributed random variables to stochastic processes.' Discuss. of the lnappropriate conditioning due to the non-independence sample leads to time dependency of the parameters of interest #EEE (#, c2).' Explain. Compare the ways to tackle the problem of a non-random sample in and autocorrelation approaches. the context of the respecification Explain why we do not need strong exogeneity for the estimation of the parameters of interest in (34). Explain how the common factor restrictions arise in the context of the autocorrelation approach. Discuss the merits of testing the independent sample assumption in the context of the respecification approach as compared with that of the autocorrelation approach. 'Rejecting the null hypothesis in a misspecification test for error independence in the context of the autocorrelation approach should not be interpreted as acceptance of the alternative.' Discuss.
Compare
with
/
=
=
=
=
Appendix 22.1
Explain the exact Durbin-Watson test. i'T'he Durbin-Watson test as a test of the independence assumption in the eontext of the autocorrelation approach is useful in so far as it is directly related to ?-1 Discuss. error Discuss the Lagrange multiplier test for higher-order autocorrelation (AR(m) and MA(n1))and explain intuitively why the test is identical for both models. underlying the linear 14. State briefly the changes in the assumptions non-independence of the regression model brought about by the .-
sample.
Explain the role of some asymptotic independence restriction on the non-random memory of )Zr, t (E T) in defining the statistical GM for a
sample.
Exercises sample assumption for p-and Verify the implications of a non-random Of .$2as given by (i)-(vi)and (i)'--(vi)' Section 22.1 in the context of the respecification and autocorrelation approaches, respectively. Show that I) 1ogL(@,y)q, (p 0 where log L(0; y) given in equation (49) =
is
non-linear
in p.
Derive the Wald test for the common factor restriction in the case where I 2 and m 1. Verify the formula FR2 LIRRSS URSS) 'RRSS? T given in equation (75).,see Engle ( 1984). Derive the f-ktf-test statistic for Ho /? 0 against H 1 : /? # 0 in the case where t?t /?;, 1 + lI, as given in equation ( 106/. =
=
=
-
=
=
-
Additional references
The dynamic linear regression
model
of the linear regression ln Chapter 22 it was argued that the respecification model induced by the inappropriateness of the independent sample assumption led to' a new statistical model which we called the dynamic linear regression (DLR) model. The purpose of the present chapter is to consider the statistical analysis (specication, estimation, testing and prediction) of the DLR model. The dependence in the sample raises the issue of introducing the concept of dependent random variables or stocastl'c processes. For this reason the reader is advised to refer back to Chapter 8 where the idea of a stochastic process and related concepts are discussed in some detail before proceeding further with the discussion which follows. The linear regression model can be viewed as a statistical model derived by reduction from the joint distribution D(Z1 Zw; #), where Z, < (yt,Xf')', and )Zt,l6T) is assumed to be a normal, independent and identically distributed (N11D) vector stochastic process. For the purposes of the present chapter we need to extend this to a more general stochastic process in order to take the dependence, which constitutes additional systematic information, into consideration. ln Chapter 2 1 the identically distributed component was relaxed leading to time varying parameters. The main aim of the present chapter is to relax the independence component but retain the identically distributed assumption in the form of stationarity. In Section 23. 1 the DLR is specified assuming that tZ,, t c T) is a stationary, asymptotically independent normal process. Some of the approach arguments discussed in Chapter 22 in terms of the respecification will be considered more formally. Section 23.2 considers the estimation of the DLR using approximate MLE's. Sections 23.3 and 23.4 discuss .
526
.
.
,
23.1 Specification misspecification and specification testing in the context of the DLR model, respectively. Section 23.5 considers the problem of prediction. In Section 23.6 the empirical econometric model constructed in Section 23.4 is used to the misspecification results derived in Chapters 19-22. Sexplain'
23.1
Spification
In defining the linear regression model )Zt, t e: T) was assumed to be an independent and identically distributed (llD) multivariate normal process. ln dening the dynamic linear regression (DLR) model (Zf, t e T) is assumed to be a stationary, asymptotically independent normal process. That is, the assumption of identically distributed has been extended to that of stationarity and the independence assumption to that of asymptotic independence (seeChapter 8). ln particular, (Z/, l e T) is assumed to have the following autoregressive representation:
zf
=
i
=
l A()z,-f
(23.1)
+ El,
1
where
c(z1L1))
El,t
YlA(f)Zl
-
i Z/-
1
=
-
(Zf
-
1
1()
A(i)
=
Ef
=
and
(23.2)
-,
1
,
Zt
t7l a21() Z?
-
2,
.
.
.
2() al A22(f)
S(Z,,'c(Z,O-
Z1),
,
j
)),
(23.3)
with tEl,c(Zr0- j),r > rtll defining a vector martingale difference process, which is also an innovation process (see Chapter 8), such that (Er/''Z,0-1)
N(0,
x
n), n >
0,
It is important to note that stationarity is not a necessary condition for the existence of the autoregressive representation (1).The existence of (1) hn, and f being time invariant are the essential with A(f), i 1, 2, conditions for the argument which follows and both are possible without the stationarity of )Zf, l 6 T). For homogeneous non-stationary stochastic process (1)also exists but differencing is needed to achieve stationarity (see Box and Jenkins (1976/. Homogeneous non-stationary processes are =
.
.
.
,
The dynamic linear regression
model
characterised by realisations which, apart from the apparent existence of a local trend, the time path seems similar in the various parts of the realisation. ln such cases the differencing transformation Ai
-z-)i,
where Ljzt
1
=(
zrt-i,
=
i
1, 2,
=
.
.
.
can be used to transform the original series to a stationary one (see Chapter 8). The distribution underlying ( 1) is D(Z, Zr0-j ; ) arising from the sequential decomposition of the joint distribution of (Z1, Z2, Zw): .
.
.
,
Z.', #) refers to the joint distribution of the first m observation D(Z1 interpreted as the initial conditions. Asymptotic independence enables us to argue that asymptotically the effect of the initial conditions is negligible. One important implication of this is that we can ignore the first m observations and treat r m + 1, T as being the sample for statistical will be adopted for inference purposes. ln what follows this expositional purposes because proper' treatment of the initial conditions can complicate the argument without adding to our understanding, The statistical generating mechanism (GM) of the dynamic linear regression model is based on the conditional distribution D( pf//xl, Zt0- 1,. #j) related to D(Zl ?'Z,0-j ; ) via ,
.
.
,
.
=
,
.
,
.
-solution'
) D ).,,,'X,; z,0- (#')04x,
D(Z, ,''Z,0-, ;
=
,
,.
,/z,1.
,
,
4
,.
,j.
The systematic component is defined by
',''-
0 r- 1
( .k, .
f -
1
((I
#;
(a12()
.
11
t
( )+
t:
1
!,
..
a, 1
.
.
2f1
.
,
)-j. ), X,0
a-cl
:.,.)1 zfkz-l
!'))
a a 1(
((x, .'x,...j
,
,
.
.
pv (1.c-c1 a j
.
,
.'.x: ),
.o
,
Ac.c(i'))- i
l 2. ,
.
.
.
,
,
m
(see Spanos ( 1985z)). The non-systvmatic component is defined by 1 ), af )'f - E (.3./7-?6' c(Zp0-1 xf xt), and satisfies the following properties-. 1 =
where
-..:);)
=
-
,
=
(23.6)
23.1 Specication '(ut ) E $4E
ut
=
(F T24
t6
,
.- 1
)) 0 =
fturl.ksl E )f utus/qth'-
1.)
=
2
Go
c? 1 1 - o 12 fl 22
=
)
(23.9)
.
2 tl'o 0
=
1
21
t
s
=
(23. 10)
.
These two properties show that ut is a martingale difference process relative with bounded variance, i.e. an innovation process (see Section 8). to is also orthogonal to the Moreover, the non-systematic component systematic component, i.e. '%
'ft
ET34
Eptut
'tprlf)
-
1))
-t/6-
0.
=
The properties ETI-ET3 can be verified directly using the properties of the conditional expectation discussed in Section 7.2. ln view of the equality
o'lut ut ,
j
,
.
.
that deduce can we ET4)
.
u 1)
,
(23.12)
-@(',
=
(23.13)
E'(ur,.'c(Uf0-1 )) 0, =
/)- lJf0- j (lf - 1 t1t - j2 i.e. u, is not predictablefrom :=
,
,
.
.
.
,
N1
)
,
its
past.
tpwn
This property extends the notion of a white-noise process encountered so far (see Granger ( 1980:. As argued in Chapter 17, the parameters of interest are the parameters in terms of which the statistical GM is defined unless stated otherwise. These parameters should be defined more precisely as statistical parameters of interest with the theoretical parameters of interest being functions of the former. ln the present context the statistical parameters of interest are 0* H(#l) where 0* H(j(), #1 am, c()2). pm,al, D(Zf/'Z,0,. that The normality of #j and #a are variation free ) implies weakly exogenous with respect to 0. This suggests that 0* can and thus Xr is be estimated efficiently without any reference to the marginal distribution D(X,,,/Zf0j #z). The presence of Y,0-I in this marginal distribution, however, raises questions in the context of prediction because of the feedback from the lagged yrs. ln order to be able to treat the xfs as given when predicting .I.f we need to ensure that no such feedback exists. For this purpose we need to assume that =
,
.
.
.
.
,
.
.
,
',
D(Xf,,''Zr0-j ;
a)
=
'L
(23.14)
(seeEngle et aI. ( 1983) for a
more detailed
.l)(X,,?'Xr0j ;
i.e. J't does not Gl'fgl?fyt?' cause X
2),
t
=
m + 1,
.
,
.
,
The dynamic Iinear regression
model
discussion). Weak exogeneity of X, with respect to 0* when supplemented with Granger non-causality, as defined in (14), is called stronl exogeneitq'. Note that in the present context Granger non-causality is equivalent to aclt)
0, i
=
1, 2,
=
.
.
(23.15)
m
.
.
(1),which suggests that the assumption is testable. In the case of the linear regression model it was argued that, although the joint distribution D(Zr; #) was used to motivate the statistical model, its specification can be based exclusively on the conditional distribution D(yr/Xf; #j). The same applies to the specification of the dynamic linear regression model which can be based exclusively on Dyt/zt. I Xf; /1). ln need to be imposed on the such a case, however, certain restrictions parameters of the statistical generating mechanism (GM): in
,
m
IN
).,, #'oxf+ =
i
Z
1
=
a-vf - +
i
Z #;x!-
i
+
(23.l6)
l/f,
1
=
ln particular we need to assume that the parameters the restriction that a1l the roots of the polynomial m -
am
i
=
,
a2,
.
.
.
,
a.) satisfy
1
) afz m
z -
(aI
m -
i
=
0
1
1ieinside the unit circle, i.e. l-ft< 1, i l 2, m (seeDhrymes (1978)). This restriction is necessary to ensure that .fty;, t e: T) as generated by (16) is indeed an asymptotically independent stationary stochastic process. ln the case where m 1 the restriction is < 1 which ensures that =
1 1
a2f
-
Covtyfllf
+,)
=
0
-+
-
:z
.
.
.
,
ta4
=
(7'211
,
2
as z --+
Jz
(see Chapter 8). lt is important to note that in the case where )Zt, t G T) is assumed to be both stationary and asymptotically independent the above restriction on the roots of the polynomial in ( 17) is satisfied automatically. convenience let us rewrite the statistical GM ( 16) in the For notational concise form more ),f p*'X1 =
where
+ u,,
For the sample period matrix fonn:
y =X*#*
+ u,
l
r > m,
=
m + 1,
'fj .
.
.
,
( 18) can be written in the following
(23.19)
23.1 Specification where y: (T- rn) x 1, X* : (T- m) x rrltk + 1). Note that xr is k x 1 because it 1) 1 vectors', rrl, are (/ includes the constant as well but xf - f, i 1, 2, - x this convention is adopted to simplify the notation. Looking at ( 18) and (19) the discerning reader will have noticed a purposeful attempt to use notation which relates the dynamic linear regression model to the linear and stochastic linear regression models. lndeed, the statistical GM in (18) and (19)is a hybrid of the statistical GM's of these models. The part is directly related to the stochastic linear regression model in Z)''-1 fyf -f conditioning view of the on the c-field c(Y,0- ) ) and the rest of the systematic extension of that of the linear regression model. being direct component a =
.
.
.
,
will prove very important in the statistical analysis of the This relationship of dynamic linear regression model discussed in what the parameters
follows. ln direct analogy to the linear and stochastic linear regression models we need to assume that X* as defined above is of full rank, i.e. ranktx*) 1)'. ml + 1) for all the observable values of Y?- j H ()..,, + 1 )'vThe probability model underlying ( 16)comes in the form of the product of the sequentially conditional normal distribution Dtyf Zy- 1 Xf; 1), t > F/1. T the distributon of the sample is For the sample period t m + 1, =
.p.,
,
.
.
.
,
,
=
.
.
.
,
(23.20) sample from The sampling model is speeified to be a non-random )'vl' can be viewed as a nonD*(y; /1). Equivalently, y EEE(yu+ 1, )u + c, T) random sample sequentially drawn from D( bj/zt.- l X,; 41),r m + 1, respectively. As argued above, asymptotlcally. the effect of the initial conditions summarised by .
.
.
,
,
DIZ 1 ; #)
17D .'? Z)'-
1
.
Xr
1
.=
',
=
.
.
.
,
#)
can be ignored: a strategq adopted in Section 23.2 for expositional pumoses. The interested reader should consult Priestley (198 1) for a readable discussion of hovs the initial conditions can be treated. Further discussion of these conditions is given in Section 23.3 below.
The dynamic Iinear regression
0)
vt
'oxt =
+ i
F =
-
spec+cation
GM
The statistical
w
model
1
x i )'r
-
i
+ f
Y1 p'ixf - i + ut =
,
(23.22)
The dynamic Iinear regression model
(23.23)
Ll (1 The
(statistical)parameters of interest are 0* EEEE(a 1
(x.2 ,
,
.
.
.
,
a,,,,
#I
v,
,
.
.
.
,
#mc()2); ,
see Appendix 22. 1 for the form of the mapping 0* H(#1). Xp is strongly exogenous with respect to #*. >(a1, The parameters aml' satisfy the restriction that all polynomial of the roots the =
g3(I g4(I
.12,
.
-
i
E5q (11)
,
jj
m - i
ai
,
=
0
1
=
are less than one in absolute value. ranktx*)
m(k + 1)for a1l observable
=
The probability
*
E61
.
1
m-
)wm
.
T > m (k+ 1).
model
'z)L 1, X,; p*)
D()'t
=
values of Y7-.,
=
1 exp J'''''l''''''--o x ( )
1
-
1c / ( '
pr -
#*'X))2
D()',?'Z,0- j X,; p*) is normal; 7( ',/'c(Y,0- j), xr0 x,0) .'x;b .- Iinear in x),' (ii) Vartyf/ctYto- j), X,0 xfo) c()2 homoskedastic (iii) X) ); 0* is time invariant.
(i)
,
,
=
=
=
(111)
The sampling
E81
y EEE()',,,+
=
(free of
model
asymptotically is a stationary, )',n+ 2, independentsample sequentially drawn from D()4 Zr0- , 'F respectively. t m + 1, m + 2, -pw)'
I
,
.
.
.
,
,xl,.p*),
=
.
.
.
,
N()te that the above specification is based on D(y, Zl0- j Xr; p*j, t m + 1, T) directly and not on D(Z,,''Zf0- j ; #). This is the reason why we need . assumption (4(1in order to ensure the asymptotic independence of ,
.
.
J( t
.
,
J't /Z0t -
1 !,
Xl ) t > ',
n1l.f
.
=
23.2 23.2
Estimation
Estimation
ln view of the probability and sampling likelihood function is defined by
model
assumptions
(6qto (8()the (23.25)
and log 1-(p*,'y) const =
(T- rn) -
-
2
1og c/
1 -X*#*)'(y - 2/.0 (y
-X*#*),
(23.27)
:20
=
1 -
/ -
(F - m)
u
*
5
(23.28)
li y - X*J*The estimators /* and 42o are said to be appvoximate where estimators (MLE's) of #* and c()2, respectively, bause Iikeliood nmximum conditions initial have been ignored. The formulae for these estimators the similarity between the dynamic, linear and stochastic linear the bring out the similarity does not end with the formulae. models. Moreover, regression for the dynamic linear regression model can statistical GM the Given that viewed the other of two models we can deduce that in direct be as a hybrid linear regression model the finite sample the stochastic analogy to (vl of and distributions are likely to be largely intractable. One important J* dynamic and stochastic linear regression models, difference between the however, is that in the former case, although the orthogonality between pt and l1t (pf Ik?) holds in the decomposition (23.29) t', /t, + l?r. for each t > m, it does not extend to the sample period formulation =
.
=
(23.30)
y p+u =
for z < 0
(23.31)
The dynamic Iinear regression model
One important Ep-'b
implication
#*) Sg(X*/X*)- 1X*'uj
:#:
=
-
/* is a biased estimator
of this is that
0,
of
#*,i.e. (23.32)
This, however, is only a problem for a small T because asymptotically #*> ()2) enjoys certain attractive properties under fairly general conditions, including asymptotic unbiasedness.
(J*,
Asymptotic properties
of
#y
Using the analogy between the dynamic and stochastic linear regression models we can argue that the asymptotic properties of 11depend crucially on the order of magnitude (seeChapter 10) of the information matrix Iv0*) dcfined by '(X*'X*) Ir(#*)
cp
=
(23.33)
This can be verified directly from (27)and (28).lf E(X*'X*) is of order Op(T) then Iw(p*) Op(F) and thus the asymptotic information matrix I.,(p *) limw.-.((1/F)Iw(p*)j < Moreover, if GwEB S(X*?X*/T) is also nontsufciently large', then the asymptotic properties of singular for all T as of MLE be Let us consider the argument in more detail. 0* deduced. can a If we define the information set (c(Y)L1), X,0 x/)) we can deduce 1 that the non-systematic component as defined above takes the form =
=
,'.'.f,
.
s
,6'-
=
nr =
)'? Ef.t,ih-
1),
-
.f(/.r,
=
(23.34) hrj)
and the sequence represents a martingale difference, i.e. r> j) 0, t > m (seeSection 8.4). Using the limit theorems related to '(l, martingales (seeSection 9.3) we can then proceed to derive the asymptotic properties of #y. ,#(,
.%-
=
In view of the following relationship:
(, *
tf we
ian
-
j*)
(X*'X*
'j-
=
y
y
ttat
ensue
yj,x-* x '
*
hy
.0
a4a .-ss
T
then
we can use
( ), X*'tl
the strong
s
$7%A%
,
taw of targenumbes fo
mattqtates
to stow
Estimation
23.2
(-X*
'u
v
-j a'S. --,
(23.37)
().
Hence, a %. .
'*+
#
...#
p.
.
l
The convergence in (37) stems from the fact that .tfX)l?f, martingale difference given that ,h'
E(Xt3ut/.%
1)
Xtt.qElut/eh -
=
-
:)
0, i
=
=
1, 2,
.
.
.
,
m(k + 1).
defines a
(23.38)
This suggests that the main assumption underlying the strong consisteney or p-*is the order of magnitude of F(X*'X*) and the non-singularity of F(X*'X*/F) for T > m (k+ 1). Given that X) EEE(.J.7, 1 yf 2, .J7f-?n, N, xf 1 conditions F(X*'X*) satisfies if the crossthese we can ensure that . products involved satisfy the restrictions. As far as the cross-products which ipvolve the y,-fS are concerned the restrictions are ensured by assumption afz'n - i) 0. For the xts we need to assume (4J on the roots of (2n j7--11 ,
.
.
.
,
,
,xt-.)
.
.
=
-
that:
< c, i lxffl
1, 2, k, t (EFT, C being a constant; .E1,/(T-z)q limw7--Jx,x;-,= Qzexists for z > 1 and in particular non-singular. also Qo is
(i) (ii)
=
.
.
,
.
a S . .
These assumptions ensure that '(X*'X*) ()( F), i.e. Gw 0( 1)and Gw G where G s non-singular. These in turn imply that Ir(#*) 04 F) and 1.4#*) 0( 1) which ensures that not only (37) holds but --+
=
=
=
=
a S. .
ri'g -+
(23.39)
c.
by multiplying (#1 0w) with x? T (the order of its standard normality deviation) asymptotic can be deduced: Moreover,
/'' ' T( #+ #+) -
N
'
-
1 x(() j (p*)- )
x
,
.
.
(23.40)
That is, ...'
'r(:
o (
Y
%
' Tdo
2
-, v ) -
j!
c())
-
x J
x(0,cpG - ),
(23.41)
4. -N(0,2co).
(23.42)
j
The dynamic linear regression model
The
lwc(7/tlconsistency
of
#w* as
an estimator
of 0*, i.e.
P
#* F
--+
(23.43)
p*
follows from the strong consistency (seeChapter 10). Moreover, asymptotic d/ictnktry follows from (40). If we compare the above asymptotic properties of #) with those of .s (/ 42) in the linear regression model we can see that their asymptotic properties are almost identical. This suggests that the statistical testing and prediction results derived in the context of the linear regression model which are based on the asymptotic properties of kvare likely to apply to the present context of the DLR model. Moreover, any finite sample based result, which is also justifiable on asymptotic grounds, is likely to apply to the present case with minor modifications. The purpose of the next two sections is to consider this in more detail. =
Example The tests for independence applied to the money equation in Chapter 22 showed that the assumption is invalid. Moreover, the erratic behaviour of estimators and the rejection of the linearity and the recursive assumptions Chapter confirmed 1 homoskedasticity in 2 the invalidity of values conditioning observed of only. Xf ln such a case the on the current natural proceed appropriate statistical model the is to respecify the way to for the modelling of the money equation so as to take into consideration the time dependence in the sample. ln view of the discussion of the assumption of stationarity as well as economic theoretical reasons the dependent variable chosen for the postulated statistical GM is m;b ln (,hzJ,/'#,)(seeFig. 19.2). The value of the maximum lag postulated is m 4, mainly because previous studies restriction adequate to demonstrated the optimum 1ag for characterise similar economic time series (see Hendry (1980/. The postulated statistical GM is of the form =
=
'memory'
4
ln:
=
()v+
4
'ximJ-
(//1
i+
i
=
1
-)';
i
=
-
i
0
+ c'l Q1,+trzQ2,+ caoaf
+
pzipf - i + (lsiit -
f)
(23.44)
+ ut.
where Qit,i 1, 2, 3, are three dummy variables which refer to important changes: monetary policy changes which 1ed to short-run =
tunusual-
(Q1/-197?fr)The
introduction
of competition and crpf/l'r control.
23.2
Estimation
Table 23, 1. E.%tinltltetl srtkrf-srcltJl
(Qar- 19
',75
i t..)
GM
l-le sl./spt?n.tft??'lq.Jthe
th t? banks
to
('/'?(?nl'lt?/
t-tp/-stpr (1;1(1
r/?e
?7t,u'
r/7t.,Btlnk
llltliial.l
t?/'
Enlland t/s/t?t/ J/t?/-st-?k// vjol
t/)4lp)'
Ioans. I-he in rroc/nclfo'l f?/- A.11 as t/ tnonetar )' t(1?-gt.:1 The estimattd coeftcients for the period 1964f 1982f1, are shown in Table 23. 1. Estislatlon of (44)with htnl as the dependent variable changed only the goodltlss-of-fit measure as given by R.2 and #2 to R2 0,796 and /12 0.7 11.The change measures the loss of goodness of fit due to the presence of ()Lt a trend (compare Fig. 19.2 with 2 1.2). The parameters p EEEF(jt). jl (lzi c2()j in terms of whicl the statistical G M is ui ,. I i 0- 1, 2, 3. 4- c I t'a- t. a. defined are the statistical and not the (econonic ) theo retical pa rameters of interest. ln order to be able to determ intt the latter (using specification statistical G M is well testing), we need to ensure first that the klstilrlatel assumptiens tlnderlying is. 11 That that the defined. the statistical g) 1))8-J is the task of model are indeed : al id. Testing for these assun-iptions nisspecificttion tcsti ng in the context of the dynamic linear regression naodel consid t?red i1-1 t ht2 next sectio n Looking at the time graph of the actual (y,)and fitted ( f'r)(see Fig. 23. 1) pr quite closely values of the dependent variable we can see that t for the estimation period. This is partly confirmed by the value of the variance of the regression which is nearly one-tenth of that of the money equation estimated in Chapter 19,*see Fig. 23.2 showing the two sets of residuals. This reduction is even more impressive given that it has been achieved using the same sample information set. In terms of the 5.2 we can see that this also confirms the improvement in the goodness of fit being
((? t-l 9*J1.)
.
=
=
.
=
,
.
,
.
ttracks'
-
.
The dynamic Iinear regression model k' $' !'
0.075 0.050
actual' l
0.025
l l l l
l j
c
$ v
N
l
I t/
..1
l
p /l l l
t l
I
y
lw
q
f'-
$,
ZZ
0 025
f ktlsd
j
1/
/
t
t
l k
-0.050 -0.075
1964
1967
1970
1973 Ti m e
Fig. 23. 1. The time graph of actual estimated rcgression in Table 23.1
Fig. 23.2. Comparing the residuals and (23.44) (see Table 23.1).
1976
T.,
=
1979
1982
A ln (uYJ,,/#),and fitted fr from the
.
from the estimated
regressions
(19.66)
more than double the original one with ml as the dependent variable and y,, Only the regressors (see Chapter 19). pf and it Before we proceed with the misspecification testing of the above estimated statistical GM it is important to comment on the presence of the
23.3
Misspecification
testing
539
three dummy variables. Their presence brings out the importance of the background in econometric modelling. sample period economic Unless the modeller is aware of this background the modelling can very easily go astray. ln the above case, leaving the three dummies out, the normality assumption will certainly be affected. lt is also important to remember that such dummy variables are only employed when the modeller believes that certain events had only a temporary effect on the relationship without any lasting changes in the relationship. ln the case of longer-term effects we need to model them, not just pick them up using dummies. Moreover, the number of such dummies should be restricted to be relatively small. A liberal use of dummy variables can certainly achieve wonders in terms of goodness of fit but very little else. lndeed, a dummy variable for each observation which yields a perfect fit but no explanation' of any kind. In a certain sense the coefficients of dummy variables represent a measure of our ignorance'. thistory'
23.3
Mi%pecification
testing
As argued above, specification testing is based on the assumption of correct specncation, that is, the assumptions underlying the statistical model in question are valid. This is because departures from these assumptions can invalidate the testing procedures. For this reason it is important to test for the validity of these assumptions before we can proceed to determine an empirical econometric model on the sound basis of a well-defined estimated statistical GM.
(1)
Assumption underlying
(2lf
the statistical
Assumption I(1Jrefers to the definition of the systematic and non-systematic components of the statistical GM. The most important restriction in defining the systematic component as
pt =
f(.A'/ CIY/0-1), Xf0 xr0l =
=
p'vx+
) taf.y,,+ #;x,
-f)
-f
i
=
1
(23.45)
is the choice of the maximum /tktym. As argued in what follows an inappropriate choice of m can have some serious consequences for the estimation results derived in Section 23.2 as well as the specification testing results to be considered in Section 23.4 below. (i) large' m chosen lf m is chosen larger than its optimum but unknown 'too
value
m* then
Snear
The dynamic Iinear regression model
collinearity' (insufficientdata information) or even exact collinearity will 'creep' into the statistical GM (see Section 20.6). This is because as m is increased the same observed data are to provide further and further of about number unknown parameters. The information an increasing insufficient of data information implications discussed in the context of the regression problem be applied linear to the DLR model with some can reinterpretation due to the presence of lagged yts in X). 'asked'
m cbosen
small'
'too
small' then the omitted lagged Zrs will form part of the lf m is chosen unmodelled part of )'l and the error term 'too
(23.46) is
no
longer non-systematic .%
1
=
kfutvytnt
relative to the information set
(c(Yt0- j ) X;0 x;0) =
,
(23.47)
.
> rnj. will no longer be a martingale diflrence, having vel'y serious consequences for the properties of #* discussed in Section 23.2. In particular the consistency and asymptotic normality of #* are no longer valid, as can be verified using the results of Section 22.1,. see also Section 23.2 above. Because of these implications it is important to be able to test for statistical GM is given by m < m*. Given that the
That is.
'true'
*1 +
#'oxf+ j
)),
=
i
the error term
((zf-'J
i
-
j'fxr j) -
+
1
=
ul can
(23.48)
+ u,,
be written in the form (23.49)
This implies that m < m* can be tested using the null hypothesis H o : ap,* 0
a nd
=
where
x.*K
+. 1 (.1,,,
,
.
.
,
.
pm*0 =
a.)'
s
aga in s t
#.*EEEE(jm.#j
,
.
H 1: ,
,
,
<
#0
#.*+ 0
,
#,,,.) .
The obvious test for such a hypothesis will be analogous to the F-type test for independence suggested in Chapter 22. Alternatively, we could use an asymptotically equivalent test based on FR2 from the auxiliary regression
;b -(,()
-#'x + r
i
=
1
(af).'r-f+ #;x,-f)+cf,
(23.50)
Misspecification
23.3 where
)
testing
refers to the residuals from the estimation
of
f) + ut.. + v p'vxt+ Yltaiy,- i #'fxf-
.
-
l
i
1
=
The rejection region is defined by
(71
f
ty
=
FR2
:
>
c.)
a
,
dzzunl*
=
-
n1)/t-1
(23.52)
L'y
(see Chapter 22 for more details). The F-type test for the money equation n1* 6 yielded
estimated
in Section 23.3
with
=
-0.009
Fr(y)
0.010 49 0.009 65
=
43
65
V
=
0.467.
(23.53)
2. 14 for a 0.05. Ho is strongly aecepted and the value of seems appropriate. m As argued in Chapter 22. the various tests for residual autocorrelation for can be used in the context c)f the respecification approach as tests of independence assumption especially when the degrees freedom are at a premium. ln the present context the same misspecification tests can be used with certain modifications as indirect tests o m < m*. As far as the asymptotic Lagrange multiplier (.LM) tests for AR(p) or MA(p), where pb 1, based on the auxiliary regressions are concerned, can be applied in the present context without any modifications. The reason is that the presence of the lagged yts in the systematic component of the statistical GM (51) makes no difference asymptotically. On the other hand, the Durbin-
Given that
t'a
=
=
=4
W'r.st?n (DW) test is not applicable in the present context because the test depends crucially on the non-stochastic nature of the matrix X. Durbin of the DLR 1 + ur) in the context ( 1970) proposed a test fpr AR(1) (.u) statistic so-called by defined ll-test model based on the 1 F- m N(0, 1) (23.54) b ( 1 - JD I4z-(y)) 1-(T-rn) Vaitafl=pu1-
x.
=
<
under Ho: p
0. Its rejection region takes the form
=
'X
cl
.fty
=
:
1n1 >
czy'. ,
1a
4() d-
=
(23.55)
Ca
where 44 ) is the density function of a standard normal distribution. lt is interesting to note that Durbin's ll-test can be rationalised as a Lagrange multiplier test. The Lagrange multiplier test statistic for a first-order '
The dynamic linear regression model
dependency (AR(1) or MA( 1)) takes the form T(y)
=
1-*
(23.56)
l1I1 w*vartti'l) Ho
and z(y)
(198 1:,
(see Harvey
z2(1). Noting that
'v
X
(23.57)
-#Dw),
=k1
1%1
we can see that the Durbin ll-test can be interpreted as a Lagrange multiplier (faM) test based on the first-order temporal correlation coefficient and As in the case of the linear regression model the above Durbin's the f,M() (/1 1) test can be viewed as tests of sijnificance in the context of the auxiliary regression -test
l
?%=J'x)
l pflir-f + nr,
+ f
(23.58)
1
=
The obvious way to test Ho : pl
=
pc
'
=
'
'
=0,
=pI
for any i
S1 : pi #0,
=
1,2,
,1 .
.
.
(23.59) is to use the F-test approximation which includes the degrees of freedom chi-square form. The correction term instead of its asymptotic autocorrelation error tests will be particularly useful in cases where the Ftest based on (48) cannot be applied because the degrees of freedom are at a premium. ln the case of the estimated money equation in Table 23.1 the above test yielded: statistics for a =0.05
h
=
1.19,
LM
(2): FF(y)
LM
(3):
LM
(4): FT(y)
=
FF(y)=(
cx
=
1.96,
0.0 10487 -0.0 10 339 -
()s()j();gj)
a
0.010500 -0.010 291 ().()j() ajlj
=
1) 1(-j-1-0.318, ..:-1-0.23 1
0.010 466
-0.010
().()j() aj5
49
255
47
45
-0.350,
3.18,
(23.60)
c.=2.80,
(23.61)
cx
-
1, c.=2.5
-
/.
(23.62)
As we can see from the above results in all cases the null hypothesis is (53)above.
accepted confirming
23.3
Misspification
543
testing
The above tests can be viewed as indirect ways to test the assumption postulating the adequacy of the maximum lag rn. The question which naturally arises is whether m can be determined directly by the data. ln the statistical time-series literature this question has been considered extensively and various formal procedures have been suggested such as Akaike's AlC and BIC or Parzen's CAT criteria (see Priestley (198 1) for a readable summary of these procedures). In econometric practice, however, it might be preferable to postulate m on a priori grounds and then use the above indirect tests for its adequacy. specifies the statistical parameters of interest as being 0* HE Assumption g2(I (J1 xm, #(),#1 #,,,,tr()2). These parameters provide us with an opportunity to consider two issues we only discussed in passing. The first issue is related to the distinction made in Chapter 17 between the statistical and (economic)theoretical parameters of interest. ln the present context 0* as detined above has very little, if any, economic intemretation. Hence, 0* represents the statistical parameters of interest. These parameters enable us to specify a well-defined statistical model which can provide the basis of the tdesign' for an empirical econometric model. As argued in Chapter 17, the estimated statistical GM could be viewed as a sufficient statistic for the theoretieal parameters of interest. The statistical parameters of interest with the parametrisation provide only a statistically (sufficient) theoretieal parameters of interest being defined as functions of the former. This is because a theoretical parameter is well defined (statistically) only statistical parameter. The when it is directly related to a well-defined determination of the theoretical parameters of interest will be considered in Section 23.4 on specification testing. The second related issue is concerned collinearity. ln Section 20.6 it was argued that with the presence of and relatiy e to a giN en parametrisation defined collinearity is enear' collinearity likely it that is In the context (or present information set. insufficient data information) might be a problem relative to the parametrisation based on 0*. The problem, however, can be easily overcome a in determining the theoretical parameters of interest so as to Both issues theoretical parametrisation. parsimonious as well as will be considered in Section 23.6 below in relation to the statistical GM estimated in Section 23.2. postulates the strong exogeneity of Xt with respect to the Assumption (.3(J 7 As far as the weak exogeneity component parameters 0* for t m + 1 concerned it will be treated as a non-testable of this assumption is of the linear regression model (seeSection presupposition as in the context 20.3). The Granger non-causality component, however, is testable in the context of the general autoregressive representation ,
.
.
.
,
,
.
.
.
,
tadequate'
'near'
-near'
Sdesign'
'robust'
=
.
.
.
.
.
The dynamic linear regression
model
#l1
At
=
i
j
A(j)Z,
-f
+ E,
(23.63)
1
'l'.r
does not 'Granger cause' Xr if x21(f) 0 (see Appendix 22. 1).In particulars 2. 1, This i for suggests that a test of Granger non-causality can be m. based on Ho a21(f) 0 for all i 1,2, m against H3 : aa1(f) # 0 for any i l?y. where 2, For the 1, 1, Granger ( 1969) suggested the Wald test k case statistic .y'l
=
=
.
,
.
,
=
=
.
.
.
.
.
.
=
,
=
,
(23.64) where URSS and RRSS refer to the residuals sums of squares of the /'n, respectively. The regressions with and without the .-js, i 1. 2, rejection region takes the form .,
C1
=
f tX
'
'
14S
> (' z>
=
.
.
.
,
.
(23.65)
The Wald test statistic can be viewed as an F-type test statistic and thus a natural way to generalise it to the case where k > 1is to use an F-type test of significance in the context of multivariate linear regression (seeChapter 24). For a comprehensive survey of Granger non-causality tests see Geweke ( 1984). Assumption g4qrefers to the restrictions on x needed to ensure that l )'r, t s T ) as generated by the statistical GM:
(23.66) is an asymptotically form alfal.pf =
where
independent stochastic process. If we rewrite
w, + n1,
(66)in' the (23.67)
and treat it as a diflrence equation, then its solution (assumingthat (fz) has ??7 distinct roots) takes the general form
=0
(23.68)
Misspification
23.3
testing
g(r),called the complementary function, is the solution of the homogeneous cu)' are constants difference equation at1.alyr 0 and c EEE(t'1 cc, via determined by the initial conditions yj )u =
.
,
.
.
,
.v2,
-
171 -
-1-c 2 +
'
=
1
'
,
2
+
'
+ t?2/.
( 1 /.l .
m ''
2
+
.
.
+ c m?vm .
.
.
,
.
)
i>.
''
''
=
'
.
(23.69)
independence In order to ensure the asymptotic (stationarity) of T) g need this l decay : )'t, component to to zero as t we w in order for Priestley 198 conditions initial T) (E l the .ty,, ( 1/. For this to to (see polynomial the of which ;.z, the 2,,,), roots be the case (21 are -+
E
.
iforget
.,
.
,
.
(23.70) should satisfy the restrictions
1
12f < l
(23 1) .7
l 2,
1.=
,
.
,
.
.
rn.
.
,'';.,.
Note that the roots of a(1-) are ( 1 When the restrictions hold
and
), i
1, 2,
=
lim gt) l
-+
=
.
.
.
,
m.
.72)
(23
0.
%
As argued in Section 23. 1 in the case where f#Z,, r G Tlj, is assumed to be a stationary, asymptotically independent process at the outset, the time invariance of the statistical parameters of interest and the restrictions ?77. are automatically satisfied. 1, 2, < 1, i ln order to get some idea as to what happens in the case where the < 1 1 1 2. restrictions m- are not satisfied let us consider the where ll and 1 (,:z 1), i.e. l ;.k simplest case 1
Il
=
.
I2f
,
.
-
.
=
.
=
.
.
.
.
=
=
(23.73) Cov(y,),,
z) -
=
c2()r,
(23.74) .f
These suggest that both the mean and covariance of $.yf, l c -1J-)increase with . t and thus as l --+ C/S they become unbounded. Thus yf, r (E T), as generated by (73),is not only non-stationary (withits mean and covariance varying remains constant. with r) but also its
t
'memory'
Tlle dynamic linear regression model > 1 again Ia1I
In the case where
f
y,, t c T) is non-stationary
-at
Z'(.Ff)
1 1-
Wt
=
(: j
and
CovtA',yf +,)
1
co2
=
since
lt
x1
-
1
a'k
(; (.)yj)
,
.
a1
-
and thus .E'(yt) tx) and Covtyfyrsr) ct;. as t -+ ). Moreover, the tmemory' of the process increases as the gap z -.+ c/a ! < 1 is not In the simplest case where m 1 when the restriction satisfied we run into problems which invalidate some of the results of Section 23.2. In particular the asymptotic properties of the approximate MLE'S of 0* need to be modified (seeFuller (1976)for a more detailed discussion). For the general case where m > 1 the situation is even more complicated and most of the questions related to the asymptotic properties of #* are as yet unresolved (seeRao (1984:. x't, x; 1, Assumption ES, relating to the rank of X* E (yf 1, yf ln the case x;-m), has already been discussed in relation to assumption E1q. where for the observed sample y the rank condition falls, i.e. -+
-+
1211
=
-
.
.
.
,
-.,
-
.
.
.
,
ranktx*) n < mk + 1), (23.76) a unique #*does not exist. If this is due to the fact that the postulated m is =
much larger than the optimum maximum 1ag m* then the way to proceed is to reduce rrl. 1f, however, the problem is due to the rank of the submatrix xwl' then we need to reparametrise (seeChapter 20). In X Htxj, x2, either case the problem is relatively easy to detect. What is more difficult to collinearity which might be particularly relevant in the detect is As argued above, however, the problem is relative to a context. present and thus can be tackled alongside the given parametrisation reparametrisation of (48) in our attempt to design' an empirical econometric model based on the estimated form of (48)(seeSection 20.6). .
.
.
,
%near'
(2)
Assumptions underlying
the probability
model
The assumptions underlying the probability model constitute a hybrid of those of the linear and stochastic linear regression models considered in Chapters 19 and 20 respectively. The only new feature in the present context is the presence of the initial conditions coming in the form of the following
distribution:
D(Z1 ; /)
t
1-1Dyt//=
2
1,
X,;
4),
(23.77)
where 4 =H(#). For expositional purposes we chose to ignore these initial conditions in Sections 23.1 and 23.2. This enabled us to ignore the problem
23.3 Misspecification testing of having to estimate the statistical GM t-1
#f
=
#lxr+
i
l-1
Z =
1
ai-vr-f +
i
Z #;xt-f+ ut =
(23.78)
1
for the period l 1, 2, rn. The easiest way we can take the initial conditions into consideration is to assume that 4 coincides with 0*. If this is adopted the approximate MLE'S #*will be modied in so far as the various summations involved will no longer be of the form jr- P! + 1 uniformly but v +fxl-fyt, i 1, 2, ,- z m. That 1s, start summing L-r 1 +iXIX)-f and where observations become available for each individual from the point component. As far as the assumptions of normality, linearity and homoskedasticity are concerned the same comments made in Chapter 21 in the context of the linear regression model apply here with minor modifications. In particular the results based on asymptotic theory arguments carry over to the present case. The implications of non-normality, as defined in the context of the linear regression model (seeChapter 21), can be extended directly to the DLR model unchanged. The OLS estimators of j* and co2have more or less the same asymptotic properties and any testing based on their asymptotic distributions remains valid. In relation to misspecication testing for departures from nonuality the asymptotic test based on the skewness and kurtosis coefficients remains valid without any changes. Let us apply this test to the money equation estimated in Section 23.3. The skewnesskurtosis test statistic is z(y) 2.347 which implies that for a 0.05 the null hypothesis of normality is not rejected since ca 5.99. Testing linearity in the present context presents us with additional problems in so far as the test based on the Kolmogorov-Gabor polynomial (see (21.11) and (21.77))will not be operational. The RESET type test, however, based on (21.10) and (2 1.83) is likely to be operational in the tl t3 present context. Applying this test with j$l and were excluded because of collinearity) in the auxiliary regression =
.
.
.
,
,
=
.
.
.
,
=
=
=
pr J'x =
+ y.4 + t',
(23.79)
which does not reject Ho. yielded: FT(y) 2.867, ca 4.03 for a Testing homoskedasticity in the present context using the KolmogorovGabor parametrisation or the White test (seeNicholls and Pagan (1983)) presents us with the same lack of degrees of freedom' problem. On the other hand, it might be interesting to use only the cross-products of the xfts Only as in the linear regression case. Such a test can be used to refute the conjecture made in Chapter 21 about the likely source of the detected heteroskedasticity. It was conjectured that invalid conditioning (ignoring =
=
=0.05,
548
The dynamic linear regression model
the dependence in the sample) was the main reason for rejecting the homoskedasticity assumption based on the results of the Whitc test. An obvious way to refute such a conjecture is to use the same regressors /1,. /6f in an auxiliary regression where tIt refers to the residuals of the . dynamic money equation estimated in Section 23.3 above. This auxiliao regression yielded: .
.
,
TR2
5. 11,
=
FF(y)
=
0.830.
(23.80)
The values of both test statistics for the significance of the coefficients of the kits reject the alternative (heteroskedasticity)most strongly; their critical values being ca 12.6 and t'a 2.2 respectively for 0.05. The time path of the residuals shown in Fig. 23.345) exemplifies no obvious systematic ,x
=
=
=
variation. ln the present context heteroskedasticity takes the general form Vart r,/'c(Y0,,-j),
x,0 =
x,t') /7(v,0.) x,0). =
,
This suggests that an obvious way to the problem of applying the White test' is to use the lagged residuals as proxies for Z,0- j That is, use tl..v r2-1 tl. a, as proxy regressors in the auxiliary regression: 'solve
.
,
.
li2 t
=
.
.
t'tj
,
+
(.':
tl.
j
+ cztl.- a +
.
+ cgth
.
.
g
+
gt
(J,3.8.'2)
and test the significance of c'1, cv. This is the so-called ARCH (autoregressive conditional heteroskedasticity) test and was suggested by Engle (1982).ln the case of the above money equation the ARCH test statistic for p takes the value F'Ijyj 0.074, with critical value cz 2.5 1 for x 0.05. Hence, no heteroskedasticity is detected, confirming the White test given above. ln Chapter 22 it was argued that the main reason for the detected parameter time dependence related to the money equation was the invalid conditioning. That is, the fact that the temporal dependence in the sample Having conditioned was not taken into consideration. on a11 past information to definc the dynamie linear regression model the question arises as to what extent the newly defined parameters are time invariant. ln Fig. 23.3())-(J) we can see that the time paths of the estimated recursive coefficients of J'l, pr and it exemplify remarkable time invariance for the last 40 periods. lt might be instructive to compare them with Fig. 2 1.1(h)-(J) of the static money equation. .
.
.
,
=4
=
23.4
=
tshort'
=
Spification
testing
Specification testing refers to testing of hypotheses related to the statistical
23.4 Specification
testing
parameters of interest assuming that the assumptions underlying the statistieal model in question are valid. In the context of the dynamic linear testing is particularly important regression (DLR) model specification stands it statistical GM has very little, if any, economic because the as GM when tested for any estimated statistical The interpretation. assumptions is rejected can underlying of and the misspecifications none of the sample convenient providing summarisation only be interpreted as a 0. 10
0.05
b
0.00
-0.05
-0.10
1964
1967
1970
1973
1976
1979
1982
T i me
(b)
Time
from (23.44).(h)--(J) The time paths of the Fig. 23.34(/).Thc residuals of estimates recursive Jya,,/s, and jzkg,the coefficients of ln yr. ln J'f and ln lt respectively from (23.44).
The dynamic Iinear regression
model
1.25 ,00
1
0.75 lat 0.50 0.25
-0.25
1973
1976
1979
1982
1979
1982
Time
(c) 0.100 0.075 0.050 0.025
l4t
0 -0.025 -0.050 -0.075 -0.1
1973
1976 Time
(d)
Fig. 23.3 continued
information. The (economic) theoretical parameters of interest are assumed of simple functions the statistical parameters 0*. These theoretical to be will be determined using specification testing. parameters well-defined The estimated statistical GM provides the firm foundation on which the empirical econometric model will be constructed. There are two important considerations which must be taken into account in going from the estimated statistical GM to the empirical econometric model.
23.4 Specification testing
55l
Firstly, any theoretical restrictions needed to determine the theoretical from the statistical parameters of interest must be tested before being imposed. Secondly, when these restrictions are imposed we should ensure that none of the statistical properties defining the original statistical model has been invalidated. That is, we should ensure that the empirical econometric model constructed is as well specified (statistically) as the original statistical GM on which it was based. An important class of restrictions motivated by economic theory considerations are the exact linear restrictions related to j*: Ho: R#*
against
r
=
Sj
:
(23.83)
R#* # r,
and q x 1 known matrices with ranktRl=
where R and r are q x k* q, k* mtk + 1). Using the analogy with the linear regression model the F-type test statistic suggests itself: (R/% -r)'ER(X'X)-
FT*(y)=
1R/1 -
qdl RRSS - URSS URSS
=
=
1(RJ*r) -
T-k* q
(23.84)
(see Chapter 20). The problem, however, is that since the distribution of #-* is no longer normal we cannot deduce the distribution of FT*(y) as Fq, T- k*) under Ho. On the other hand, we could use the asymptotic distribution of #*, i.e. v'T(/*
N(0,c2G-1)
-,*)
-
(23.85)
in order to deduce that So
qyl-sy) z2(t?). -
7
Using this result we couldjustify the use of the F-type test statistic (84)as an approximation of the chi-square test based on (85).lndeed, Kiviet (1981) has shown that in practice the F-type approximate test might be preferable in the present context because of the presence of a large number of regressors. In order to illustrate the wide applicability of the above F-type specification test 1et us consider the simplest form of the DLR model's statistical GM where k 1 and m 1: =
A'f =
JOXI
+
Jlxf 1 +J1X-
=
1
+'
lfr.
Despite its simplicity (86) incorporates a large proportion
(23.86) of empirical
The dynamic Iinear regression model
econometric models in the applied econometric literature. In particular, following Hendry and Richard ( 1983), we can consider at least nine special cases of (86) where certain restrictions among the coefficients po,jl and aj are imposed: Case 1. Static 't
k,f
fvxt+
ttt.
u j yl
+ ut
=
=
.,
()1
regression
..j
.I1
=
=
0)..
(23.87)
(23.88)
.
(23.89) Case 4. (23.90) Case 5.
#oxf+ ()1 xr - 1 +
.r
=
Partial adjustment
(vx;+
#'f =
x 1 )'f
-
1
Case 8. Dead-start' 'r
).
Jl1 -Y
==
+ ut
f -
-i- t: j )'f
0).. (23.92)
+ x1 /.71
)(x, 1 - )'r - 1 ) + (/.() 0).'
al
model j.
=
.
/A7oJt.?/ (/.()+
poAx, + ( 1 -
=
.
n-ltpt/c/ (jj
Error--correction A)',
t1t
=
ur.
1)..
(23 9 3) .
=
-
1
-h 1/f
.
(23.94)
For the above eight cases the restrictions imposed are a1l linear restrictions which can be tested using the test statistic (84) in conjunction with the rejection region C1
.t
=
y : F r(y)>
(',
)
where ca is determined by x
dkq, F- k*).
= Q'y
family of restrictions considered extensively in Chapter 22 the factor restrictions. ln relation to (86)the common factor common are restriction is :1 jo + jl 0, which gives rise to the special case of an autoregressive error model. An important
=
23.4 Specilkation testing Case 9. Autoreqressive
yf jax, + =
t;,,
t/-rt?r
c,
=
553
model
al;,
-
l
+
(a1/0+
/JI
=
0)..
(23.95)
l,.
For further discussion on a1l nine cases see Hendry and Richard ( 1983), and Hendry, Pagan and Sargan (1984). ln practice, the construction ('design')of an empirical econometric model taxes the ingenuity and craftsmanship of the applied econometrician more than any other part of econometric modelling. There are no rules or reduce any well-defined established procedures which automatically empirical estimated statistical GM specified) to a ('correctly' economic mainly model. both theory This is because econometric as well as role choice sample data play in the properties of the the ('design')of the a statistical GM for a this, order ln let us return to the latter. to illustrate estimated 23.3 this equation estimated ln in Section 23.2. Section money and misspecifications possible equation was tested for any none of the natural question rejected. The to ask at underlying assumptions tested was constitutes this estimated statistical GM that this stage is a wellspecify statistical model, how do we proceed to (choose) an defined empirical econometric model?' As it stands, the money equation estimated in Section 23.2 does not have any direct economic interpretation. The estimated parameters can only be viewed as well-defined statistical parameters. ln order to be able to proceed of an empirical econometric model we need to consider the with the question of the estimable form of the theoretical model, in view of the observed data used to estimate the statistical GM (see Chapter 1). In the case of the money equation estimated in Section 23.2 we need to decide whether th theoretical model of a transactions demand for money considered in Chapter 19 could coincide with the estimable model. Demand in the context of a theoretical model is a theoretical concept which refers to to a range of hypothetical the intentions of economic agents corresponding values for the variables affecting their intentions. On the other hand, the observed data chosen refer to actual money stock M 1and there is no reason why the two should coincide for all time periods. Moreover, the other variables used in the context of the theoretical model are again theoretical constructs and should not be uncritically assumed to coincide with the observed data chosen. In view of these comments one should be careful in isearching' for a demand for money function. ln particular the assumption that the theory accotlnts for al1 the information in the data apart from a white-noise term is highly questionable in the present case. In the case of the estimated money equation the special case of a static demand for money equation can be easily tested by testing for the significance of the coefficients of all the lagged variables using the F-type 'proper'
'assuming
'design'
554
The dynamic Iinear regression model
0 and #f 0, test considered above. That is, test the null hypothesis S0: : #0 or i 1, 2, 3, 4, against #f# 0, i 1, 2, 3, 4. The test statistic for this hypothesis is =
=
.#.fj
=
=
F'Ijy)
=
(
0.11650
53 53 1(-fI1-
-0.010
(23.96)
28.072.
().()j() 53
c.= 1.76 for a 0.05 we can deduce that Hv is strongly rejected. A moment's reflection suggests that this is hardly surprising given that what
Given that
=
the observed data referto are not intentions orhypothetical range of values, but realisations. That is, what we observe is in effect the actual adjustment process for money stock M 1 and not the original intentions. Hence, without any further information the estimable form of the model could only be a money adjustment equation which can be dominated by the demand, supply or even institutional factors. The latter question can only be decided by the data in conjunction with further a priori information. Having decided that the estimable model is likely to be an adjustment process rather than a demand function we could proceed with the design' of thc empirical econometric model. Using previous studies rclated to adjustment equations, without actually calling them as such (seeDavidson Hendry and Ungern-sternberg (1981), Hendl'y et al. (1978),Hendry (1980), and Richard (1983), Hendry (1983/,the following empirical econometric chosen: model was
a
In(->-M),
y.''j
-0.134
In(->-M),.f)
(j1
-0.474
-
a
(0.130) (0.02) (Intyl
-e.196
l-1
(0.022)
4
-0.80 1 ln Pt -0.059 ln ff
j) (
-0.025
f=1
(0.145)
(0.007)
=0.758,
log 1=223.318,
42 0.725, =
s
1)f ln It
-
i
(0.008)
+ lif, -0.045Qjt+0.059Qct+0.053Q31 (0.014) (0.014) (0.015)
R2
-
(23.97)
=0.0137,
R&S=0.012 47,
F= 76.
A1l the restrictions imposed on the estimated statistical GM in order to reduce it to the above empirical econometric model are linear and the relevant F-type test statistic is
23.4 Specification testing FT(y)
53
0.0 12 476 -0.010 530 0.0 10 530
=
(23.98)
=0.753.
j
test we can deduce that these Given that ca 1.88 for a size a restrictions are not rejected. This test, however, does not suffice by itself to establish the validity of (97)as a well-defined statistical model as well. For this we need to ensure that the misspecification test results of the original estimated statistical GM are maintained by (97). against m 6. Using the F-type test Cboice of m - testing for m (i) 1 342 the value of the test 379 and URSS with RRSS statistic is =0.05
=
=4
=
=0.01
=0.012
1 342 56 0.012 379 : ().()1 1 y4a -0.01
FT(y)
=
with c.= 2.1 1, x
=
1-
0.640
0.05,
the null hypothesis m 4 is not rejected. The Lagrange multiplier error autocorrelation test for I 2 in its F-type form yielded: =
=
177 62
-0.012
FT(y)
=
0.012 379 0.012 177
Y
(23.99)
0.514
=
with ca 3. l5, q 0.05. Misspectjlcation test for normality - skewness-kurtosis (ii) the test statistic is 43 0.3433 and :4. 3 =
=
test. With
-0.2788
=
=
-
z(y)
F =-
6
.
4a +-
T
and with c.= 5.99 for rejected.
x
=
0.0f the assumption
(b)
=
0.685,
.t3,
tl
RESET test with FT(y)
=
rpsrs Jr Iinearity
Misspecscaion
=
=
1.18,
cx
=
2.76.
1.75.
ARCH test: FF(y)=
0.53 1, cz=2.51.
These tests indicate no misspecification
of normality
homoskedasticity
:
,
ca
and
.*
White test: FF(y)
(23.100)
(44. 3)2 1.801 -
24
at a = 0.05.
is not
The dynamic linear regression
model
Structural t.-/:fgnf?tr tests (seeSection 2 1.6). It might be interesting to test the hypothesis that the changes picked up' by the dummy variables are indeed temporary changes without any lasting effects as assumed and not important structural changes. The first dummy variable was defined by =0
D1
for t # 36, f
Using T1 37 the F-type test for =
F T) (y)
=
0.0 16 00 0.0 14 549
=
1l .
HLlt:clj
5, 6,
=
.
.
.
,
80.
c2a yielded
=
(23.103)
,
implies that HLl) is not rejected. Given which for x 0.05, ca 1 this result we can proceed to test /'t)': #j pz. The F-type test statistic is .85,
=
=
=
FF2(y)
=
(11-0.0517,
0.013 749 -0.0 -
()s
j?
13 678 3 60 ?- g'yjj
(23.104)
) is strongly accepted. Using given that c?, 2.254 for x 0.05, HL3 the same procedure for Tz 51. =
=
=
FFI
0.016 278
(y) 0.01 356 =
c'a= 1.83,
and
FF2(y)
=
x
1.22,
=0.05
(23.105)
0.0 13 749 -0.0 12 867 60 -j0.0 12 867 -.
.---
cu 2.254,
a
#1 #2 and
cf
=
Hence, Hv
=
=
=
0.685,
0.05.
=
=
o'q is accepted
(23.106) at
*
=
1-
-a)2
(1
=
0.0975. ln Chapter 2 1 we used the distinction between structural change and parameter invariance with the former referring to the case where the point of change is known a priori. Let us eonsider time invariance in relation to (97) as well. A very important property for empirical econometric models when needed for prediction or policy analysis is the time invariance of the estimated coefficients. For this reason it is necessary to consider the time invariance of the model in order to check whether the original time invariance exemplified by the estimated coefficients of the statistical GM the empirical econometric model we try has been lost or not. In to capture the invariant features of the observable phenomena in order for 'designing'
23.4 Specification
testing
the model to have a certain value for prediction and policy analysis purposes. Hence, if the model has been designed at the expense of the time invariance the estimated statistical GM will be of very little value.
The recursive estimates of Lj7- 1 Atn:f ..j -pt -j)1, (Fl, - 1 - pt 1 - yt 1), 1)J'jf -y), AR, it and in Fig. 23.4(4J)-(./') 1 ( ..j are shown () AJ'l respectively for the period 1969/-1982/:. Apart from some initial volatility due to insufficient sample information these estimates show remarkable time constancy. The estimated theoretical parameters of interest have indeed preserved the time invariance exemplified by statistical parameters of interest in Section 23.3. lt is important to note that the above misspecification tests are not iproper' tests in the same sense as in the context of the statistical GM. They checks' in order to ensure that the should be interpreted as determination of the empirical econometric model was not achieved at the expense of the correct specification assumption. This is because in the empirical econometric model from the estimated statistical GM we need to maintain the statistical properties which ensure that the end meaningful estimated equation but a product is not just an economically well-defined statistical model as well. Having satisfied ourselves that (97)is indeed well defined statistically we can proceed with its economic meaning as a money adjustment equation. The main terms of the estimated adjustment equation are: (i) (1y j7, j A ln(M P4t -f) - the average annual rate of growth of real money stock; (ln(M/#)t- 1 -ln F) 1) -- the error-correction (ii) term (see Hendry ( 1980))., j) Vz-ot) (1.3 the average annual rate of growth of real 1 A ln Fp consumers' expenditure'. (iv) A ln Pt - inflation rate: ln lt - interest rate (7 days- deposit accountl; (v) (vi) jt) j ( - 1)iln 1:- annual polynomial lag for interest rate. Interpreting (97)as a money adjustment equation we can see that both the rate of interest and consumers' expenditure play a important role in the determination of tht? changes in real money stock. As far as inflation is concerned we can see that the restriction for its coefficient to be equal to minus one against being less than minus one (one-sidedtest) is not rejected 1.67 and the test statistic is at a 0.05 given that (',
l7-
(ZJ?-
,
Sdiagnostic
Sdesigning'
-
-
-
!
=
=
z(y)
=
-.-
-
-
- 0.80 1 + 1.00
tj.j4j
-
-0.
=
137.
(23. 107)
This implies that in effect the inflation rate term cancels out from both sides
558
The dynamic linear
regression
model
a 1t
(a)
T ime
Fig. 23.4(44-(/1, The time paths of the recursive coelicients of (j7- j A ln (M #lf -j), (ln (M #F)f 1), and (7. j ( 1)J ln lt--j) respectively (from(97:. -
-
estimates of the 1 A ln 1-j), ln lt
(1F
with short-run behaviour being largely determined in nominal terms; not an unreasonable proposition. The long-run represented by the static solution (assumingthat yt y'f-f,N N-f,i 1,2, ) Of the adjustment is equation =
M PF
=
Aj -
0.301(
j .j. gj -
=
=
4..087 '
.
.
.
(23. 108)
23.4 Specification testing
559
0
Fig. 23.4. collrulktuat
This suggests that the long-run where is a constant (see Hendry (1980)). behaviour differs from the short-run in so far as the former is related to real money stock. The question which arises at this stage is, is the estimated adjustment equation related to the theoretical model of a transactions demand for money?' From (97) we can see that the adjustment equation is dominated by the demand side in view of the signs of I A ln Ft- f, ln It .,4
ihow
Z)-,
The dynamic Iinear regression model
560 0
ast
Time
Time
Fig. 23.4. continued
and jt) 1 ( 1)i ln It f. This suggests that if we were to assume that the supply side is perfectly elastic then the equilibrium state, where inherent tendency to change' exists, can be related directly to ( 108). Hence, in view of the perfect elasticity of supply ( 108) can be interpreted as a transactions demand for money equation. For a more extensive discussion of adjustment processes, equilibrium and demand, supply functions, see Hendry and Spanos (1980). -
'no
23.4 Specification Re-estimation
of
testing
561
(97) with Apt excluded from both sides yielded the
following more parsimonious
empirical model:
(23.109) .R2ccz(j.'y:9,
s 0.01384=
1og L= 222.25, The above estimated coefficients can be interpreted as estimates of the theoretical parameters of interest defining the money adjustment equation. These parameters are simple functions of the statistical parameters of interest defining the statistical GM. An interesting issue in the context of this distinction between theoretical and statistical parameters of interest is collinearity or/nd short data (collectively related to the problem of called insufficient data information) raised in Chapter 20. ln view of the large number of estimated parameters involved in the money statistical GM one might be forgiven for suspecting that insufficient problems might data information be affecting the statistical parametrisation estimated in Section 23.3, One of the aims of econometric modelling, however, is to design' l'tlbusl estimated coefficients which are directly related to the theoretical parameters of interest. For this reason it will be interesting to consider the correlation matrix of the estimated coefficients as a rough guide to such robustness, see Table 23.2. The correlations among the coefficients of the regressors are relatively small; none is greater than 0.8 1 with only one greater than 0.68. These correlations suggest that most of the esttmated (theoretical)parameters of design'. interest are nearly orthogonal; an important criterion of a The first column of Table 23.2 shows the partial correlation coefticients (see Section 20.6) between Am and the regressors with the numbers in parentheses underneath referring to the simple correlation coefficients. coefficients show that every The values of the partial correlation contributes substantially the explanation of Arnfwith the errorto regressor correction term and the interest rate playing a particularly important role. above (see Another important feature of the empirical model bnear'
tgood
tdesigned'
(??1t1 - Pt /'
1
YkxAnA7K
'he
562
1 -
.1L-
1
llel:
0.4 13 ( - 0.053) - 0.754 ( - 0.4 18)
1
3
-3 j=)
mWt
leresm
0.422 (0.056)
A y,j..j .
-0.70 l ( -0.080)
j'
-0.077
Table 23.3. Restricted coefhcient
p;)@-j pt..j pt-j it ..j
0.4 13
27
4.=2
0.354
0 O 0 -0.025
0.025
on
(97) .
--.
j=j 0. 196 0.80 1
-0.80 1 -0.059
based
estimates -
j=Q
0.093
j=? 0 13 - 0.4 0 0.025
j=4 0.158
0 0
-0.025
( 109/ is that the restricted coefficient estimates do not differ significantly from the unrestricted estimates as can be seen from Table 23.3. Further evidence of the constancy of the estimated coefficients is given in Fig. 23.5(fl)-4.) where the 4o-observation estimates are plotted. These estimates. based on a fixed sample size of 40 observations, run through the whole sample, that is, jl based on observations 1...-40,#2 on 2..41, ja on 3-42, etc. iwjnftpw'
23.5
Prediction
An important question to consider in the context of the dynamic linear regression (DLR) model is to predict the value of yw+1 given the sample T As argued in Chapter 13, the best information for the period t 1, 2, predictor (in mean square error (MSE) sense) is given by the conditional expectation of yw+1 given the relevant information set. For simplicity let =
.
.
.
,
23.5
Prediction
563
(a)
(b) Fig. 23.5(a)-(:'). The time paths of 40 observation the coefficients of ( 109) apart from the constant.
window estimates of
564
The dynamic linear regression model
Time
(c) 0
Fig. 23.5. coatinued
565
Prediction
23.5
ffgt/
Fig. 2.3.5. cottil;
us assume that 0* EEE(al aa, predictor of .
,
.
xowsI EB (x1 xa, .
,
',f..1
tv
i
+
.
,
HF
is given and the parameters ln such a case the best known. are
x:.,.I)'
pm, am. j(), j1, is given by its systematic component, .
.
,
.
,
!
.
-
) =
x
=
0y I
.
i.e.
j
B1
N1
i
xw, c2a)
,
.
0w 0w) E'(4,w+ l ,/c(Y X
=
Moreover,
.
'
Xi l y
I-
.
1
j
Va
V
- t)
the predietion
error
F;X
!
.
j
-
?
(23.110)
.
is
(23. 111) (23. 112) and /f( l
1
-.
1
c(Y01 ) X?. 1
+
.
l
x0T 4.
=
)
1
=
0.
Similarly the best predictor of ),wv! given a similar information set is m
?N
)(( pr-hz-alpwsl
afy,.+a-..+
+
i
=
2
i
z= =
0
#;xw+?-f.
In general we can construct predictors for I periods ahead by substituting
566
The dynamic Iinear regression
model
and p1
p1
pv-l- l i
cf/zwx./-i
1
=
i
#;xw-,-, j(2 0
/-rn+
1, ,n+2,
.
.
.
(23.115)
.
=
The above predictors were derived on the assumption that 0* was known a priori 't-a grossly unrealistic assumption. In the case where 0* is not known we use 0*, the approximate MLE and the predictor of )'v+l takes the form /-1
p- v + l
j i
=
=
m .:i'
i 1
pr
+
l- i
+ i
m
)
(:i.
l
l
=
-
p v+
j ..
j
l-
j
+
j p-'i x v
1*=
+
O
g..
j
l 2 3, =
,
,
.
.
m (23. 116) .
,
and m
F-
'r
+
l
=
i
m
jl =
1
(i'
i
yv+
t- i +
i
) =
p-k'J .
+
0
,
I
m+ 1
=
,
.
.
.
(23.117)
.
ln the case where xw.hp,/ 1,2, is not known it has to be and substituted in the above formulae. Returning to ( 1 10),we can see that the prediction error variance by (111), i.e.
tguessestimated'
=
.
.
Ellyw-h 1 - p..+ 1)2 c(Y?),
,
.
x7-
)
-
xow..ll
czt).
-
is given
(23.118)
Similarly, (23. l 19)
and
I 2 3, (23.120) =:
.
.
.
l
m
a,?c()2+ j = ij 1 i =
,
=
r?T +
cj 1
,
l m+ 1 =
,
.
.
.
.
(23 12 1) .
formulae for the case where #* is used instead of 0*, when the latter is unknown, are rather complicated. ln practice, however, since #*is a consistent estimatorof 0* the formulae ( 118)-(12 1) are also used (but only asymptotically valid) in the case where the estimated coefficients are used. For hypothesis testing and interval estimation related to prediction The corresponding
Looking back
567
is given by
distribution
errors the asymptotic
l
N 0, )( a/coz (.pw+,-w-,) -
i
('
=
,
1- 1, 2,
.
.
.
(23.122)
r,,,
,
1
where xl 1. Note that the predictors pv..lare correlated w+ 1, w+2, and anyjoint testing or confidence estimation needs to take the covariance into consideration as well. =
23.6
.
.
.
,
Looking back
In Section 23.4 we constructed an empirical econometric model of a money adjustment equation based on a well-defined statistical GM estimated in Section 23.2. lt will be useful at this stage to return to the money equation estimated in the context of the linear regression model (seeChapter 19) and derived in the various statistical inference attempt to 19-22. For convenience 1et us reproduce both equations Chapters estimated for the same sample period 1964/- 1982/:r: 'results'
'explain'
hmt
1
124 -0.485
-0.
=
j
(0.018) (0.130)
4
)
Aln1r-j
-
pt-j)
J=1
(0.022) (23. 123) (0.007)
(0.0 14)
kl
.R2
=0.709,
1ogL= 222.25,
=0.675,
s=0.013
RSS=0.012
832,
84, F= 76.
Note that small letters are used to denote the natural logs of the variables represented by the capital letters.
nl, 2.763 +0.705yf ( 1,100) (0.112) =
Rz
=
0.995,
log L= 138.425,
+0.862pf
-0.053t
+
,
(0.022) (0.014) (0.040)
Rl 0.995, =
RSS
=0.
s=0.040
(23.124)
22,
1165,
Looking at these estimated equations we can see how grossly misspecified (124)is as a linear regression statistical GM. The inappropriate
equation
568
The dynamic linear regression model
assumption of an independent sample invalidates a1l the statistical inference results based on ( 124) derived in Chapter 19. Moreover, interpretipg the estimated coefficients as elasticities and using these in any way is again unwarranted given that these are not well-defined statistical parameters and any arguments based on the estimated coefficients can be very misleading. ln terms of goodness of fit the estimated variance of ( 123) is almost a tenth of that of (124). From the statistical viewpoint the main difference between (123) and ( 124) comes in the form of the unmodelled part of yf. The residuals for the two estimated equations behave very differently. In Fig. 23.6, t and J, are plotted over time and as we can see they differ greatly. Firstly, the standard deviation of t is three times smaller than that of J,. Secondly. the whitenoise assumption seems much more appropriate for ut rather than :f. From Chapters 19-22 we know that f exhibit not only time dependency but heteroskedasticity as well. Hence, in no circumstances should ( 124) be interpreted as an empirical econometric model. The main problem associated with ( 124) is that of invalid conditioning. That is, in defining the systematic component we should have conditioned (X, xr) but the past as well. This invalid not only on the conditioning induced time dependency in Jt, V as well as j. Looking at ( 123) we can see how this problem can arise. Using the terminology bpresent'
=
Time
Fig. 23.6. Comparing the residuals from
(123)and (124).
569 23.6 Looking back model empirical econometric and Richard 1982) the by Hendry introduced ( l 124). Encompassing refers to the ability of a particular ( 123) estimated statistical model to explain the results of rival models (seeMizon ( 1984) for an extensive discussion). The comparison of ( 123) and (124) above ln this case, howevers a was in effect a simple exercise in encompassing. formal discussion of encompassing seems rather unnecessary given that ( 124) could not be seriously entertained as an empirical econometric model given that it has failed almost all the misspecifieation tests applied. Although the statistical fotlndations of ( 123) seem reasonably sound, its theoretical underpinnings are less obvious. This, however. it not surprising given that economic theory is rather embarrassingly slent on estimable adjustment equations. At this early stage it might be advisable to rely more heavily on data-based specifications which might provide the basis for more realistic theoretical formulations of estimable models. Muellbauer ( 1986) provides an excellent example of how' economic theoretical arguments can and interpret successful data-based empirical be used to econometric models. This seems a most promising way forward with each other; economic theory and data-based specifications complementing 1985). Nickell and Engle 1985). Pagan 1985). Granger ( ( ( see The above discussion exemplifies the dangers of using an inappropriate statistical model in economuttric modelling. At the outset it must have been obvious that the sampling model assumption of independence associated with the linear regression model was highly suspect for the kind of data chosen. Despite that, we adopted an inappropriate statistical model in order to illustrate the importance of the decision to adopt one in preference of other statistical models. Throughout the discussion in Chapters 17-23 every attempt has been made to persuade the reader that the nature and statistical properties of the observed data chosen have an important role to play in econometric modelling. These should be taken into consideration when the decision to adopt a particular statistical model in preference to the others is made. In a certain sense the choice of the statistical model to be used for the particular case of cconometric modelling under consideration is one of the most important dccisions to be taken by the modeller. Once an inappropriate choice is made quite a few misleading conclusions can be drawn unless the modellel' is knowledgeable enough to put the estimated statistical GM through a battery of misspecification tests before embarking on specification tcstilg and prediction. lf, however, the modeller follows the mtxim that theory accounts for all the naive methodological information in the data f irrespecti: e of the choice of the data) apart from a theoretical information is real information-, white-noise error term' or then the misspecifieation testing seems only of secondary importance and misleading conclusions arc more than likely. wencompasses-
trationalise'
'the
'only
The dynamic Iinear regression model Important
concepts
Autoregressive representation, homogeneous non-stationary processes, initial conditionsrasymptotic independence, Granger non-causality, strong exogeneity, unit circle restrictions, approximate MLE, Durbin's test, statistical versus theoretical parameters of interest, error correction term, long-run solution, partial correlation eoefficients, encompassing.
Questions Explain the role of the stationarity and asymptotic independence of the stochastic proccss .tZ!, l (E T) in the context of the specification of the DLR model. Is the stationarity of )Z,, t c T) necessary for the autoregressive representation of the process? Define the concept of strong exogeneity and explain its role in the context of the DLR model. statistical GM of the DLR model is a hybrid of those for the linear 4. and stochastic linear regression models.' Explain. Discuss the difference between the exact and approximate MLE'S of 0*. :Do they have the same asymptotic properties?' Explain your answer. State the asymptotic properties of the approximate MLE #* of p*. l-low do we test whether the maximum lag postulated in the specification of the statistical GM is too large'?' tl-low do we interpret residual autocorrelation in the context of an estimated statistical GM in the DLR model'?' iWhy is the Durbin-Watson test inappropriate in testing for an AR(1) error term in the context of the DLR model'?' -f'he
Additional references
Anderson (1959))Crowder (1980); Durbin ( 1960); Harvey (198 1); Granger and Weiss and Wald (1943:; Priestley ( 1981).
(1983);Mann
linear regression model
The multivariate
Introduction
24.1
The multivariate linear regression model is a direct extension of the linear regression model to the case where the dependent variable is an m x 1 random vector yr. That is, the statistical GM takes the form
y
t
=
B'x ! + u t
,
where yf : m x 1, B : k x rrl, xt: k x 1, uf : m x 1. The system system of m linear regression equations'. yff
=
fixt+
i
uit,
=
1, 2,
.
.
.
,
(1) is effectively
n1, l 6 T,
a
(24.2)
with B (#1,#c, pm). ln direct analogy with the m 1 case (seeChapter 19) the multivariate linear regression model will be derived from first principles based on the joint distribution of the observable random variables involved, D(Zf; #) + k) x 1. Assuming that Z, is an I1D normally where Zt BE (y;,X;)', (?'? distributed vector, i.e. =
.
.
.
,
=
(xY',) -
N
) ((0())()1a 2a)) Yxal
wecan proceed to define the by: pt E(y,/X, =
and
tl,
=
=
xl)
=
=
systematic and non-systematic
B'x,,
yt - C(y!/Xt x,),
for all t s --,
B
=
components
Ec-21E2:
(24.5)
'
The multivariate
(i)
'(u,)
=
Stulu'sl
Ftpruf'l
where D 19.2).
=
E1 j
=
=
-
model
regression
ul and pt satisfy the following properties'.
by construction,
Moreover,
(iii)
linear
'EZ'(u,/X! x,)(l =
ErEtuu's Xf
'ES(pu;,/Xf E21 El 2Ea-21
=
=
=
Xf)1
x,)(1
0,'
=
=
fl 0 E
'X, x,)() rprluf', =
(compare these
with
=
0,
the results in Section
The similarity between the m 1 case and the general case allows us to consider several loose ends left in Chapter 19. The first is the use of thejoint distribution D(Zr; #) in defining the model instead of concentrating exclusively on Dtyf X!; /1). The loss of generality in postulating the form of the joint distribution is more than compensated for by the additional insight provided. In practice it is often easier to judge' the plausibility of assumptions relating to the nature of D(Z,; #) rather than D(yf/Xr; j). analysis the relationship Moreover, in misspecification between the assumptions underlying the model and those underlying the random vector of the nature of the process (Z,, t e: T) enhances our understanding possible departures. An interesting example of this is the relationship of the assumption that pZ,, t e: T) is a normal (N); ( 1) independent (J); and (2) identically distributed (1D) process; and (3) D(y!,/X,; 1) is normal; !61 (i) Eyt/'xt x,) is linear in x,; (ii) (iii) Covtyf/'x, xr) is homoskedastic (free of xr),' p>(B, f1) are time-invariant; g7q g8(I (.J,'l,,'Xl,t (E T) is an independent process. =
=
=
The relationship
between these components
is shown diagrammatically
below:
(.f) I8I --+
The question which naturally arises is whether (i)-(iii)imply (N) or not. The following lemma shows that if (i)-(iii)are supplemented by the assumption
24.1 that X,
x
Introduction
implication
N(0.Eac), detltac) + 0, the reverse
holds.
Lemma 24.1 Zf
kV(0,E) jr
'v
only
(/'
N(0, Ya2), dettEaa) + 0,'
(,*)
X
(1'1')
E (yly'xf
'v
Xt
=
Covtyf/Xf
Y
4' and
t e: T
Xt; ) X 1 2Y2-J =
)
Xl
=
X1 1
=
-
X j 2X2--/ X 2 1
(24.6)
XB + U,
=
where Y: F x n?. X: T x k. B: k x n?. U : T x 131. The system in f 1) can be lth row of (6). The fth row taking the fo rm as the yi Xjf =
+ uf
i
,
1, 2,
=
.
.
.
,
viewed
tn
I'th regression in (2).ln order to define represents al1 T observations on the #I) the need special notation of conditional D(Y X; distribution the we Appendix 2). Using notation this the matrix Kronecker products (see written in form the distribution can be
(Y
where f
X)
.3'=
NIXB, f ()) Iw),
,v
the covariance of
(:& Ir represents y
vec (Y )
=
(24.8)
1
'2
:
'F?'nx 1 .
y,,l The vectoring operator vect ) transforms a matrix into a column vector by stacking the columns of the matrix one beneath the other. Using the vectoring operator we can express (6) in the form '
vectYj
=
(Im () X) vectll) + vec(U)
y* Xyjv =
in an obvious notation.
+ uv
(24.10)
The multivariate
linear regression model
The multivariate linear regression (MLR) model is of considerable interest in econometrics because of its direct relationship with the simultaneous equations formulation to be considered in Chapter 25. ln particular, the latter formulation can be viewed as a reparametrisation of the MLR model where the statisticalparameters of interest 0- (B, f) do not coincide with the theoretical parameters of interest (. lnstead, the two sets of parameters are related by some system of implicit equations of the form:
f 0,I) These
0,
=
equations
i
1, 2,
=
be
can
.
.
.
,
(24.11)
p.
interpreted
providing
as
an
alternative
parametrisation for the statistical GM in terms of the theoretical parameters of interest. ln view of this relationship between the two statistical models a sound understanding of the MLR model will pave the way for the simultaneous equations formulation in Chapter 25.
24.2
Spification
and estimation
ln direct analogy to the linear regression model (n linear regression model is specified as follows: (1)
Statistical GM: y, y, : m x 1
x, :
,
B'xr + ut,
=
B: k x m
kx 1
,
=
=
x,)
.
B'x,,
=
1) the multivariate
lsT
The systematic and non-systematic
;tt JJty, 'Xr
=
components u,
=
are:
y, - .E'(y,/Xt x,), =
and by construction ut) E EE(u,,''Xf xrll =
=
=
0,
'(pt'uf) E Ef(#Ju,/'X, x?)(1 0, =
(4J
(5q
=
The statistical parameters of interest are 0- (B, D) where B 1 1 E 22 E 11 E 12 E 22 - Y21, f - E 21' X, is assumed to be weakly exogenous with respect to 0. No a priori information on 0. Ranktx) xw)': T x k, for F > k. k, X E (x:, xz, =
g31
=
-
=
.
.
.
,
=
24.2 Specification (11)
Probability
alxl estimation
model --i
tll
D(yt/'Xl; #)=
=
f1)
(det
CXP)
mjz
(2zr)
'--llyf
B Xf) D .y ,
-
,
0
(yt- s,xt jj,
R'nkx frt
!
tGT
7
D(y,//Xt,' #) - normal; (i) JJty; X, x,) B'xr - linear in x,; (ii) (iii) Covlyf/xr= N) f - homoskedastic (free of xp); 0 is time invariant. =
=
=
(111)
Mmpling
model
ywl' is an independent sample sequentially drawn from D(y,,/Xr;0j, t= 1, 2, 'C and T> m + k. The above specilkation is almost identical with that of m 1 considered in Chapter 19. The discussion of the assumptions in the same chapter above with only minor modilications due to m> 1. The applies to E1(1-E81 only real change brought about by m > 1 is the increase in the number of statistical parameters of interest being mk +-)m(n1 + 1).It should come as no surprise to learn that the similarities between the two statistical models extend to estimation, testing and prediction. From assumptions (6) to E8qwe can deduce that the likelihood function takes the form
g81
Y=
(y1,y2,
.
.
,
.
.
.
.
,
=
r
v)=c(Y) z-(p;
l
1-1o(y,,/x,;04 =
1
and the log likelihood is T
1
logtdet f1) - s 2 2
F (y, B'x,)/f1- 1(y,
B'x,)
(24.12)
1(Y XB)'(Y XB)q = const -gT logtdet D) + tr fl (see exercise 1). The first-order conditions are
(24.13)
log L= const
-
-
-
t?log L PB
= (X Y -X XBID ,
p log L T fl l pfl- .= Y
,
1
-
=
0.
-J(Y -XB) (Y -XB) ,
-
'-'r
-
(24.14) =0.
(24.15)
't C?''denotes the space of all real positive definite symmetric matrices of rank m.
The multivariate
Iinear
regression
model
These first-order conditions lead to the following MLE's: = (X'X) IX'Y -
(24. 16)
!!
where U Y - Xb. For ( to be positive dcfinite we need to assume that T y n?+ /:. (see Dykstra (1970)). lt is interesting to note that ( 16) amounts to estimating each regression equation separately by =
p-i (x'x) =
-
lx'y
1, 2,
1
=
'
.
.
(24.18)
/3.7.
-
.
Moreover, the residuals from these separate regressions j yf X/ i can be used to derive f' via &ij ( 1y'Flf/iy, ivj 1,2, As in the case of j- in the linear regression modd, the MLE preserves the original orthogonality between the systematie and non-systematie and t y: components. That is, for t =
=
y,
=
,
+r,
t
=
1, 2.
.
z'x,
=
.
.
.
=
-
.
.
z'xf
T
.
-
,rn.
=
(24. 19)
and t ..1- r. This orthojonality can be used to define a goodness-of-fit ' 2 1 ( ) (y y) to measure by extending R ,
=
G
=
I -((;''l'))(Y'v)
.,
,
-
-
t
(Y'Y - l')l'))(Y'Y) -
=
'.
(24.20)
The matrix G varies between the identity matrix when l-J 0 and zero when Y U (no explanation). In order to reduce this matrix goodness-of-fit measure to a scalar we can usc the trace or the determinant =
=
(24.2 1) (see Hooper ( 1959)). ln terms of the eigenvalues (21 ;.1, goodness of fit take the form .
j
t/1
=
N7
)-q;.i l',1 i . --
.
.
.
,
2.) of G the above measures of
tn
and
(lz
- 1
=
i
i
=
(24.22)
.
.j.
The orthoyonality extends directly to 1 X: and U and can be used to show that B and fl are independen random matriees. ln the present context this amotlnts to =
covt () f1)=0,
(24.23)
where E( ) is relative to D(Y X; 0j. '
.
24.2 Specification
and estimation
Finite sample properties
t#'
2
fl
and
and f' are MLE'S we can deduce that they enjoy the of such estimators (see Chapter 13) and they are invarianct? stfjh' trtdnl if they exist. Using the Lehmanstabistics, militnal of the functions result 12) Chapter we can see that the ratio Scheffe (see Irrom the fact that
prtppgl-ll'
o(Y X; p) D(f l .X; p)
=exp) -
1 .--YTYII tr f1- EY'Y
-(Y -Yo)'XB -
(,
B'x'(Y -Y())(l) of 0 if Y'Y =Y'Y() and Y'X
is independent z(Y)
=
(24.24)
YLX. This implies that
=
(z1(Y)-Tc(Y)), where z1(Y) Y'Y, zc(Y) =
=
Y'X
defines the set of minimal sufficient statistics and :
=
(X'X) -
lz
2 (Y)',
(24.26) ln order to discuss the other properties distributions. Since 2 B + (X'X) - IX'U
and f let us derive their
of
=
L=(X'X)- 1X'
=B + LU,
5
we can deduce that
(24.27)
').
,..-N(B, f (x)(X'X) This is because is a linear function of 5'
where
(24.28)
(Y 'X) NIXB'D L I). x
Y(I Mx)Y', its distribution is the matrix equivalent to the Given that T chi-square, known as the Wishart distribtltion with F- k degrees of freedom and scale matrix fl and written as =
T
x,
-
(24.29)
Wz,,,(f1,r- /()
(see Appendix 1). ln the case where T'fz'v c2z2(T-k),
m
=
1 T
E(Tf1)= c2(F-
.
=
ti'li and
/&j.
properties of the The Wishart distribution enjoys most of the attractive normal distribution (seeAppendix 1#.In direct analogy to (30), multivariate =(T-k);n, A'I)('.r'f)
(24.31)
The multivariate
linear regression model
and thus ( gl (F-k)qU'Uis an unbiased estimator of f1. In view of (25/-(31) we can summarise the finite sample properties of the MLE'S (: and f of B and fl respectively: * and f are insariant (withrespect to Borel functions of the form (1) + 1:). y( ): O Rr (1%r Gtrnk + J?'ntpl B and fl are jnctions of the minimal sufficient statistics T1(Y) Y'Y and z2(Y) Y'X. 2 is an unbiased estimator of B (i.e..E'1) B) but f'i is a biased (1/(F- k)(lI'.)'U estimator of f; l being unbiased. : is a fullyeihcient estimator of B in view of the fact that Cov(2) fl ()!)(X'X)-- 1 and the information matrix of 0 EEE(B, f1) takes the form =
'
-+
=
=
=
=
=
lr(#)
(5)
f1-1 ()) X'X
=
T
(24.32) j
- j
- (Eldll ) (d1!
(see Rothenberg (1973:. and f are independent; in view of the orthogonality Asymptotic properties of
in
(19).
and f-1
Arguing again by analogy to the m= 1 case we can derive the asymptotic B and fl of B and f1, respectively. properties of the MLE'S
(i)
( Q B, f X f1)
Consistency:
B) N(0, !'1 (&(X'X)w-1) we can deduce that if In view of the result ( 1 limw- .(X'X)w0 then Covt) 0 and thus 2 is a consistent estimator of and B (see Chapters 12 and 19). Similarly, given that limw- x ew
-+
=
'4(1)
limw-w Covtfll
=
0, Lh % n.
Note that the following statements 1
lim (X'X)w-
2min(X'X)w-.+ 1
,.'.'c
2 (x'x)wmaX
tr(X?X)w-1
-+
are equivalent:
0,'
=
F-* x
(d)
=f1
-+
0
0',
!' Note that the expectation model Dlyt X,; 0j.
operator
'( .
) is relative to the
underlying
probability
579
information
24.3 A priori
1 where 2min(X'X)r and 2max(X'X)w-refer to the smallest and largest eigenvalue of (X'X)w and its inverse respectively', see Amemiya (1985).
Strong consistencyz
lim(XlX).w-x
1
(*
an d
=0
'
a..S. -.+
B)
2
maX
(X'X) F .
<
2min(XX)w ,
C
a.. S.
some arbitrary constant
for (iii)
c, then
-->
B; see Anderson and Taylor(1979).
Asymptotic normality
From the theory of maximum likelihood estimation we know that under V'F(;- 0) relatively mild conditions (seeChapter 13) the MLE ; of 0 '.w
Q
1). N(0,I.(p) For this result to apply, however, we need the boundedness of zxclimr... js(1/T)Ir(#) as well as its non-singularity. ln the present case I.(p) the asymptotlc information matrix is bounded and non-singular (fullrank) Under this condition we can if limw- w(X'X) F= Qx< %. and non-singular. -
deduce that
/
T't -B)
1)
-
(1
v7'r(fz-n) x(0,2(n -
(see Rothenberg
(24.33)
x(0,n () Qx-
(24.34)
() n))
(1973/.
Note that if t(X'X)w,T > k) is a sequence of k x k positive definite matrices r.'.'c as such that (X'X)r- ) - (X'X)r is positive semi-definite and c'(X'X)zc 1 then w(X'X)wlimw0. T-+ x for every c # 0 -+
=
ln view of (iii) we can deduce that (iv) unbiased and jf'ciell.
24.3
2 and (1 are both
asymptotically
A priori information
One particularly important departure from the assumptions underlying the model is the introduetion of a priori multivariate linear regression such additional information is available When 0. restrictions related to applies and the results on estimation derived in assumption g4)no longer modified. The importance of a priori information in Section 24.2 need to be
580
Iinear regression model
The multivariate
the present context arises partly because it allows us to derive tests which testing and partly because this can be usefully employed in misspecification will provide the link between the multivariate linear regression model and the simultaneous equations model to be considered in Chapter 25. (1)
Lineav restrictions
kelated' to X,
The first form of restrictions D1B + Cj
=
to be considered
is
(24.35)
0,
: p x m are matrices of known where D1 : p x k (p< /(), ranktll) p and special particularly important A case of (35) is when constants. =
D1
(0,Ik
M
), 2
B
CI
B1
=
B
(24.36)
,
2
That is, a subset of the coefficients in B is and (35)takes the form Ba restrictions The about thing is that they are not the same these note to zero. fonn the as =0.
(24.37)
R#= r,
discussed in the context of the m 1 case (seeChapter 20). This is because the D1 matrix affects a1l the columns of B and thus the same restrictions, apart from the constants in C1 are imposed on al1 ?z? linear regression equations. The form of restrictions comparable to (37) in the present context is =
,
Rj+
=
(24.38)
r,
: n?1, x 1, R : p x mk, r: p x 1. This form of where #,y vec(B) (#'j #'2, #'m)' linear restrictions is mor general than (35) as well as =
=
BF1 + A1
=
,
.
.
.
,
(24.39)
0.
All three forms, (35),(38)and (39),will be discussed in this section because they are interesting for different reasons. When the restrictions (35)are intel-preted in the context of the statistical GM
y, B'x, =
(24.40)
+ ur,
1, 2, k. we can see that they are directly related to the regressors Xit, The easiest way to take (35)into consideration in the estimation of pH(B, f) the system (35)for B and substitute the is to into (40).ln order to do that we define two arbitrary matrices D1' :4/f -p) x k, ranktDl) =
'solve'
.
.
.
,
'solution'
=
A priori information
24.3 I
-p,
and Cf
x rn, and reformulate
(k -p)
:
(35) into
DB+C=0
(24.41)
D=(D1,D1),
C=
The fact that ranktl) 1
B= - D -- C
where G yields
(GI EEE
Y*
,
G1')
=
GICI
=
D-
-
,
(41) for B to yield
(24.42)
GICI',
+ 1.
tc?
us to solve
k enables
=
/'C1
Substituting
this into
(40)for r 1, 2, =
X*C1'+ U,
=
where Y* EEEY .-XGICI underlying probability
(G/'X'XG
=
Hence, from (42) the constrained
GjCj),
-
1
G 1'X'(Y - XG1 C1)
(24.44)
MLE of B is 1G1''X'X(
-G1Cj),
+ L(
1')
- G1 Cj ).
+ G/(G/'X'XGt)-
P(
P
=
P= I
-
(24.46)
lu.
=
1
1
(24.45)
1. =G1'(G/'X'XG1')- IGI'X'X
where where
-GIC1)
Given that 1u2 L, P2 P and LP=0 (i.e.they are orthogonal we can deduce that P takes the form =
T
=
1G1'X'X'(
=
,
and X* XGt. The fact that the form of the is unchanged implies that the MLE of C1 is
= (GI'X'XGI) -
-
.
model
=
=GjC1
.
(24,43)
1 C1 (X*'X*) - X*'Y*'
2,,,zGICI
.
(X'X) - 1 D' 1 ED 1 (X'X) -- D' 1(1- D 1
projections)
(24.47)
.
(see exercise 2). This implies that
E= -(x'x)since D1G j
1
1l,
D' 1 LD 1 (x'X)-
1
I- 1(D 1
Ip. Moreo: er. the constrained
=
l
f1= f f 'r-.,
.
1
=(1
v.
r
(fj
-
+
c1),
(24.48)
MLE of fl is
)'(x'x)(:-).
(24.49)
of B and f we can see that they are Looking at the constrained MLE'S direct extensions of the results in the case of m 1 in Chapter 20. Another important speeial case of (35)is the case where al1 the coefficients apart from the constant terms, say j.1 are zero. This can be expressed in the =
,
582 form
Tbe multivariate
linear regression model
(35)with D1
=
(0,lk - 1),
B=
(j.1,B(1)),
and Hj takes the form B(:)
=
(2)
kelated' to yt
Linear restrictions
0.
The second form of restrictions
BFl
+ A1
to be considered is
=0,
(24.50)
where k x q are known matrices with ranktr'l) q. m x q (qGn1) and The restrictions in (50) represent linear between-equations restrictions F1 :
A1:
=
because the ith row of B represents the fth coelicient on a11 equations. lnterpreted in the context of (35)these restrictions are directly related to the yffs.This implies that if we follow the procedure used for the restrictions in (38) we have to be much more careful because the form of the underlying probability model might be affected. Richard (1979) shows how this procedurecan give rise to the restricted MLE'S of B and f1. For expositional purposes we will adopt the Lagrange multiplier procedure. The Lagrangian function is !(B,f), M)
T =
logtdet f1)
--j-
--)
tr f
.j
(Y
-tr(A'(BF1 +A1)q,
-
XB)
,
(v
.xs)
(24.51)
where A is a matrix of Lagrange multipliers. 01
(X Y -X XBID ,
PB =
,
el T f) -#Y 1 po - = Y Pl -(BFj PA =
-
B)
=
1
-
asj
.() ,
-XB)'(Y -xB)=0,
+ A1)=0
(see Appendix 2). From (X'X)(
-
(52)we
(24.53) (24.54)
can deduce that
AF'1f1.
(24.55)
Premultiplying by A1 and solving for A yields A
=
(X'X)(F1
which in view of A
=
-BF1)(F'jQF1)-
1,
(24.56)
(54)becomes
(X'X)(rj
+
A1)(F'1f1F1)- 1.
(24.57)
A priori information
24.3
583
This implies that the constrained
2
=
1 f=- T f; fT=f ,
+-
1 T
(:-
of B and fl are
1F'1f1,
(24.58)
)(x x)(:-)
(24.59)
+ A:)(l7fFj)-
(F1
-
MLE'S
,
,
(see Richard ( 1979/. lf we compare (58)with (48)we can see that the main difference is that ( enters the MLE estimator of B in view of the fact that the restrictions (50)affect the form of the probability model. It is interesting to note that if we premultiply (58)by Fj it yields (54).The above formulae,ts8), (59), will be of considerable value in Chapter 25. (3)
Linear
frela/ed'
restvictions
to both y, and X,
A natural way to proceed is to combine the linear restrictions in the form of DIBF
+
(38)and (50) (24.60)
C= 0,
where D1 : p x k, F1 : ?,n x q, C: p x q, are known ranktrl) q. Using the Lagrangian function
matrices with rank(D1)=
p,
=
I(B,f1,A)
=
T
--j-
logtdet f1)
-trgA'(D1BF1
we can show that the restricted
j tr f - (Y -XB) (Y -XB)
--i'
,
+
(24.61)
C)q,
MLE'S
are
'D' 1 ED1(x'X)- 1D'.1 - 1(D1r'1
-(x'x)-
=
+
c)(r'IfF1)-
'rlf,
(24.62)
1 f= (1+ F (B-E) (x x)(E- ). ,
An alternative
derive
w'ay to
(24.63)
,
.
(62)and (63)is to
consider
(24.64)
D1B* + C= 0
for the transformed specification (24.65)
Y* =XB* + E. B* BF1 and E= UFj. where Y* =YFI, The linear restrictions in (60) in vector form can be written as =
vec(D1BFj or
C) (F': () D1) vectB)
+
(F'1 @)D1)/+
=
=
r,
+ vectc)
=
0
(24.66) (24.67)
The multivariate
584
Iinear
regression
model
vec(C). This suggests that an obvious way to where j+ vec B and r (F'I this is substitute generalise L D1) with a p x km matrix R to to restrictions the in the form formulate =
=
R#v
=
-
(24.68)
r,
where ranktR) p p < /(rn). The restrictions in (68) represent the most general form of linear restrictions in view of the fact that #+ enables us to and reach' each coefficient of B directly and impose within-equation between-equations restrictions separately. In the case where only within-equation linear restrictions are available R is block-diagonal, i.e. =
R
=
R1
0
0
Rc
0
'
and
rc
=
r..
ranktRf)
restrictions
(24.69)
,
R.j
0 R : pi x k, Exclusion
r
r1
pi
=
?-f:
,
pf x 1.
are a special case of within-equation
restrictions
where Rf has a unit sub-matrix, of dimension equal to the number of excluded variables, and zeros everywhere else. Across-equations linear restrictions in the off can be accommodated ?,l, of R with Rfyreferring block-diagonal submatrices Rfy,i,j 1, 2, to the restrictions between equations i and j. of B and f) under Let us consider thederivation of the constrained MLE'S the linear restrictions (68).The most convenient form of the statistical GM F is not for the sample period t 1, 2, zpj
=
=
Y
.
.
.
.
XB + U,
=
5'+=X+#+
(24.70) formulation
vectorised
(24.71)
+u+,
X+
j + (#'1#'2 =
and
,
.
,
as in the case of (35) and (39),but its
where
.
,
,
.
.
.
,
pm' )': lnk
x 1
u+
,
=
=
(1p,L X) : Tm x mk, (u'I ,
u'2 ,
.
.
.
,
u;
)': bn
x 1
Iw): Tm x Fn. f1+ (f1 Lx?l =
The Lagrangian
function is defined using the
Ip., fl+, 2)
T =
--y
-
--jyv
logtdet f1,y)
2.'(Rj+ - r),
vector
2: p x 1 of multipliers'.
-X+#+)'f1+ -
1
(y,,-X+#+) (24.72)
24.4
The Zellner and Malinvaud
t?! -
=
b.
X'+f1+-1(y+
l
T
t?I - 2 (R#+ =
-
r)
X+j+) - R'2
-
j
7.0. = - 'i- o.
- 1
+o.
=
formulations =
585
0,
(24.73)
1 (y.- x . p. )jy. .xs#s),os- (24.,74)
0.
(24.75)
Looking at the above first-order conditions (73)-(75)we can see that they equations which cannot be solved constitute a system of non-linear explicitly unless fl is assumed to be known. In the latter case (73)and (75) imply that
J #.*
k and
F.
=
=
=
(X'+n+- IX * )- IR'ERIX' * ( *- IX * )- 1 R*l - 1(Rj
-
#
*,(24.76)
gR(X'+n+- X+) - R'1 - 1(R#+ - r).
(24.77)
(X'.f1+-lx
(24.78)
1
*
'
)- 1X'* j1-*
ly *
.
lf we compare these formulae with those in the m 1 case (seeChapter 20) we can sec that thc only diffcrencc (when fl is known) is the presence of f1.. This is because in the m > 1 case the restrictions Rj+ r affect the restricting probability model the econometric literature ln by underlying y,. the estimator (78) is known as the generalised Ieast-squares (GLS) estimator. In practice fl is unknown and thus in order to solve' the conditions (73)-475)we need to resort to iterative numerical optimisation (seeHarvey ( 198 1), Quandt ( 1983)- inter alia). The purpose of the next section is to consider two special cases of (68) where the restrictions can be substituted directly into a reformulated statistical GM. These are the cases of exclusion and across-equations linear homogeneous restrietions. ln these two cases the constrained MLE of j. takes a form similar to ( 78). =
=
24.4
The Zellner and
formulations slalinvaud
ln econometric modelling t&5o special cases of the general linear Rj+
=
r
restrictions
(24.79)
useful. These are the exclusion and across-equations linear restrictions. In order to illustrate these let us consider the homogeneous
are particularly
The multivariate
Iinear regression model
two-equation case
(.p1!j-(#11 )?cf #l2
.X1 f
j
#a1 pz5
(i)
Exclusion restrictions..
(ii)
Across-equation
x.:,
#11
(24.80)
c,
.X3t =0,
tl/jyl ,j,
+.
=0,.
pz5
#21 #j 2.
linear homogeneous restrictions:
=
lt turns out that in these two cases the restrictions can be accommodated directly into a reformulation of the statistical GM and no constrained optimisation is necessary. The purpose of this section is to discuss the estimation of #+ under these two forms of restrictions and derive explicit formulae which will prove useful in Chapter 25. Let us consider the exclusion restrictions first. The vectorised form of Y =XB + U,
(24.81)
as defined in the previous sections, takes the explicit form yl
X
0
0
y2
0
X
.' ,
=
#1 #2
0
y
.
0
y. =X+#+
0
u1 u2
+
.
X
(24.82)
.
#.
um
(24.83)
+ uv
in an obvious notation. Exclusion restrictions can be accommodated directly into (82) by allowing the regressor matrix X to be different for different regression equations yf X#f + u, i 1, 2, m, and redefining reformulate accordingly. That is, the #s (82)into =
=
X1
yl y2
0 =
0
m
)'+ X1#:
'.
X2
0
y
0 ,
0 + u1,
Xpl
#*1 2 #* #* m
.
.
.
,
u1 u2
+ '
(24.84)
um
(24.85) where Xf refers to the regressor data matrix for the h equation and p? the corresponding coefficients vector. ln the case of the example in (80)with the =
24.4 restrictions jl
The Zellner and Malinvaud 1
587
j23 0, (84)takes the form
0,
=
formulatios
=
xh)(',))+(--))'
Where
(ry))-()
(24.86)
xs), X2 EEE(x1,x2), jl (j21ja1)?and pz (j12jac)'. The formulation (84)is known as the seeminqly unrelated reqression equations (SURE), a term coined by Zellner (1962), because the m linear regression equations in (84)seem to be unrelated at first sight but this turns out to be false. When different restrictions are placed on different equations the original statistical GM is affected and the various equations become interrelated. In particular the covariance matrix fl enters the estimator of #l. As shown in the previous section, in the case where fl is known the MLE
of
Xj
H(x2,
=
=
#: takes the form
1
#:= (X:'(D-
()) Iw)X:) - 1X:'(f1-
1
(:&Iw)y+.
(24.87)
Otherwise, the MLE is derived using some iterative numerical procedure. For this case Zellner ( 1962) suggested the two-step least-squares estimator
/1=(X1(Z- 1 @Iw)X:)- 'X:'(Z (&Iw)y+, where f1=(1/F)fJ'I'.), f; =Y -X. lt is not very difficult to
(24.88)
see that this estimator can be viewed as an approximation to the MLE defined in the previous section by the first-order conditions (73/-(75) where only two iterations were performed. One to derive ( and then substitute into (87). Zellner went on to show that if
o *- 1
lim X!' w--w
X:
F
*
and non-singular, taking the form
Q+<
=
'.yt
distribution of
the asymptotic
V'y j#-+ jv ) -
*
*
x
'
Q*-
y (() '
1
,
j
'
(24.89) (87)and (88) coincide, (24.90)
lt is interesting to note that in the cases: (a)
X1
=
Xc
=
.
,
'
X,u X:
=
=
and f)= diagl?: &* jv =
=
1
.
ac.
(x+xv ?
.
.&+
.
.
,
yv
mv,)
(24.91)
(see Schmidt (1976)). Another important special case of the linear restrictions in (79)is the case
The multivariate
Iinear
model
regression
of across-equation linear homogeneous restrictions example (80).Such restrictions can be accommodated (82) directly by redefining the regressor matrix as
X)
x'1!
0
0
x' 2?
=
such as #2j #12in into the formulation =
0
.
(24.92)
.
0
0
x'mt
.
(where x, refers to the regressors included in the Rh equation) and the coefficient vector #+,so as to include only the independent coefficients. The form yr X#1
(24.93)
+ uf
=
is said to be the Malinvaud form (see Malinvaud (1970/.For the above example the restriction /y1 2 #21 can be accommodated into (80) by defining X) and #: as =
f
p .zj
Y 1t
0
X 1t
-Y2t
-
X=+
-
) x+?n-
1x*
1
r
t
+
-
(24.94)
.
&*in the case where f is known is
MLE of
F =
p5j
#2c
The constrained -+ #+
#* *
and
=
1
r
jg1 x*'nr
1
(24.95)
,
l'
-
Given that fl is usually unknown, the MLE of p; as defined in the previous section by (73)-(75)can be approximated by the GLS estimator based on the iterative formula F
-
>+
jj x t
*
j')'G
1
F
- 1
jl x t
() i- 1 y!
j g / (; 4 jj 6) r where l refers to the number of iterations which is either chosen a priori or determined by some convergence criterion such as
Ft
l
=:
-
lj) +1
!
-
i
l
x
*
4: /
?
t
-....-
==
,
,
.
.
.
,
.
,
-j))
< ;
for some
l >0.
In the case where /= 2 the estimator F
#-:=
,
.1
t
)()x)'f1=
'x)
- 1
1
wheref1=(1,/r)'l7,
(J =Y -X.
e.g,
;
=0.00
defined by
1.
(96)coincides
(24.97) with
1'
l
jl x)'f1=
1
ly
r.
(24.98)
24.5 Specification testing 24.5
589
testing
Spification
ln the context of the linear regression model the F-type test proved to be by far the most useful test in both specification as well as misspecification analysis', see Chapters 19-22. The question which naturally arises is whether the F-type test can be extended to the multivariate linear regression model. The main purpose of this section is to derive an extended F-type test which serves the same purpose as the F-test in Chapters 19-22. From Section 24.2 we know that for the MLE'S fkand D
b
(i)
1)*,
N(B, fl @ (X'X)-
'w
(24.99)
and T(1
(ii)
(24.100)
I'F;tD, T'- k).
'v
Using these results we can deduce that, in the case where we consider one regression from the system, say the fth,
y the
=
MLE'S
of
bi
=
(24. 10 1)
X/f V uf'
jf and
(X X)
are
ii
X yf
o'bii-,--
F
f
i-
-
y -X#i
(24.102)
Moreover, using the properties of the multivariate normal (seeChapter and Wishart (seeAppendix 1) distributions we can deduce that
-
Nbi, t.t)if(X'X)- : )
#iN
and
oh
T - . t?f 1
'v
a
z (y,.j).
15)
(a4.j()?)
These results ensure that, in the case of linear restrictions related to jf of the form Hv Rzjf rj against Hk : Rsjf + rf where Rf and rf are pi x k and pi x 1 known matrices and ranklR/) ;)i, the F-test based on the test statistic =
=
- ri ) R..j..(X X )- 1 R j.(1- 1 (R i p-i ....j.r ) T - k (Rd#.,.j ;... -.ltjf pi
r
'
'
'rtyy
=
-
-
--
'
is applicable without anj' changes. In particular, individual coefficients based on the test statistic
I
O
'rlyi) ---j,,-s -.-L-jx.. i (x x ) T --
'/-'-..
-
)
'
(a4.j()4)
tests of significance for
(24.105)
,) 1
(a special case of ( 104)) is applicable to the present context without any modifications', see Chapters 19 and 20.
590
linear regression model
The multivariate
Let us now consider the derivation of a test for the null hypothesis: So: DB
C
-
against
=0
Sl
:
DB
-
C# 0
(24.106)
where D and C are p x k and p x m known matrices, particularly important special case of (106)is when D
(0,Ikc): k2 X k, B
=
B1
=
and
B2
C
=
i.e. HvL B2 0 against S1 : B2 #0. The constrained Ho take the form
MLE'S
=
=
(x'X)- 1D'gD(x'x)
-
and f
=fl
+-
1 T
( -)'(X'X)(2
-
ID'II- 1(D
0: kz x
-
ranktD)
=p.
A
,n,
of B and fl under
(24.107)
c)
(24.108)
-),
=(X'X) - IX/Y, f'i =(1/F)U'(J, U =Y -X are the unconstrained of B and fl (see Section 24.3 above). Using the same intuitive argument as in the m 1 case (seeChapter 20) a test for Hv could be based
where
MLE'S
=
on the distance hSD:
-CS(
(24.109)
.
The closer this distance is to zero the more the support for Ho. lf we normalise this distance by defining the matrix quadratic form f1-1(D
-c)'gD(X'X)
-
lD'j - 1(D -C),
the similarity between (110)and the F-test statistic Moreover, in view of the equality (T'fl +(D
fT'f;=
stemming from ( 108),
=417'17 -
-c)'ED(x'x)-
(110) can be
(24.110) (104)is a1l too apparent.
1D'j - 1(D
-c)
(24.111)
written in the form
f.J'I'))(I')'fJ)-1,
(24.112)
where U =Y -X2. This form constitutes a direct extension of the F-test statistic to thegeneral m > 1 case. Continuing the analogy, we can show that
U'U
Gmtf1, r- k),
(24.113) where U'U U'MXU, Mx I -X(X'X) - 1X'. Moreover, in view of (112)the chidistribution of 0/0 0'U is a direct extension of the non-central 'v
=
=
-
square distribution, the non-central Wishart, denoted by (0'0 U'U) l4z;(f1, p; A), F y m + k, -
,v
(24.114)
SpecifKation
24.5 where
A= f1- IIDB
(0/17 where Mo
=
ID'I - IIDB -C)
C)'ED(X'X)-
-
is the non-centrality
testing
(24.115)
This is because
parameter.
0/U)= U'MOU
-
X(X'X)- 1D?(D(X'X)- ID'J- 1D(X'X)- 1X'
(24.116)
5
.
and Mo is a symmetric idempotent matrix (M/ Mo, MoMo Mp) with ranktMp) ranktD) p. Given that Mp and Mx are orthogonal, i.e. =
=
=
=
MoMx
(24.117)
0,
=
we can also deduce that U'MXU and U'MPU are independently distributed (see Chapter 15).The analogy between the F-test statistic, FF(y)
'll
S0
T-k P
'-/
= ,
,.w
F(p, T- k),
(24.118)
in the case k::zr1, and C as defined in (112), is established. The problem, however, is that G is a random m x m matrix, not a random variable as in the case of (118).The obvious way to reduce a matrix to a scalar is to use a matrix real-valued function such as the determinant or the trace of a matrix: zj(Y)=det((fJ'fJ zz(Y)=trE(I7'I7
17,1-.))(f.J,fJ)-1q,
-
-
I-J'I'.))(I')'(J)-
1q.
(24. 119) (24.120)
ln order to construct tests for Ho against H3 using the rejection regions C1
=
J'tY :
z i (Y)>
)
(.'x
,
12
1.=
,
,
(24. 12 1)
we need the distribution of the test statistics r1(Y) and za(Y). These distributions can be derived from thejoint distribution of the eigenvalues of 21, 2c, kt where l mintm. p). because , say, .
.
.
,
=
,
(24. 122) The distribution of 2 > (21 k. ;.!)' was derived by Constantine (1963) and James (1964)in terms of a zonal polynomial expansion. Constantine (1966) went on to derive the distribution of z2(y) in terms of generalised Laguerre polynomials which is rather complicated to be used directly. For this reason several approximations in terms of the non-central chi-square distribution have been suggested in the statistical literature; see Muirhead (1982) for an excellent discussion. Based on such approximations several .
.
.
,
,
592
The multivariate
linear regression model
tables relating to the upper percentage points of (T- kh2(')
z1(y) =
have been constructed asymptotic result,
(24.12.3
(seeDavis ( 1970)). Forlarge
T-k
we can also use tht'
Ho
':ltyl zzlmp), -
7
in order to derive c. in ( 121). The test statistic is known as the Lawlel', Hotelling statistic. Similar results can be derived for the determinental ratio test statistic
z?(y) -
(F- k)'r1(y).
(24.125)
statistics z1(y) and z2(y) can
The test be interpreted as arising from the Wald test procedure discussed in Chapter 16. Using the other two test procedures, the likelihood ratio and Lagrange multiplier procedures, we can construct alternative tests for Hv against Sj. The Iikelihood ratio test procedure gives rise to the test statistic L( Y) L(4., Y)
.I-.RIYI ln terms
dettf'i)
2'F
'
=
=
of the eigenvalues of
1,R(Y)=
l
f
C this test
-
1j.
(24.127)
,
region
is defined by
tY: f,R(Y) Gca).,
=
(24.126)
statistic takes the form
1
and thus its rejection C1
detg(')(U'fJ)
=
1
1-11+ ki =
dettf'i)
(24.128)
cx being determined by the distribution of .LS(Y) under Hv. This distribution can be expressed as the product of p independent beta distributed random variables (see Johnson and Kotz ( 1970), Anderson ( 1984), inter alia). For large T we might also use the asymptotic result Ho
- v. where F*
log LR(y)
-
z2(,np),
(24. 129)
gF-k + 1)j (;)y?.n)', see Davis (1979)for tables of upper percentage points ca. The Lagrange multiplier test statistic based on the function
/(p,2)
-.(?'n
F =
-p
=
-j-
dettfl)
-
j
trtfl - (Y XBIIY XB) ) trtA ,
-
-
-
,
(ss
.c))
(,4. jyt)l
24.5 Specification
testing
can be expressed in the form:
z-M(v)=tr(G),
(24.131)
where
(7 (f;'I7 - fJ'U)(fJ'U)- 1 (24.132) This test statistic is known in the statistical literature as Pillai's trace test statistic because it was suggested by Pillai (1955), but not as a Lagrange multiplier test statistic. In terms of the eigenvalues 21, 2a, 2,,, this statistic takes the form =
.
.
z,Af(v)
l
)
=
.
,
'i
=
i
.
-1V;.i
1
(24.133)
'
The distribution of f>M(Y) was obtained by Pillai and Jayachandran (1970) but this is also rather complicated to be used directly and several approximations have been suggested in the literature; see Pillai ( 1976), ( 1977), for references. For large F- k the critical value c-a for a rejection region identical to ( 12 1) can be based on the asymptotic result Hv
(T'- k)f-M(Y)
x
z2(mp).
test statistic known as I4'ro's ratio test statistic
A similar
other matrix scalar function of (7, the determinant z3(Y)
=
dettG).
(24.134) is defined as the
(24.135)
ln terms of the eigenvalues of (-1 this test statistic is ': 3(Y)
l
;
'wi
=
(24.136)
.
i
1
=
1+ ;.i
It is interesting to note that (7 as defined above is directly related to multivariate goodness of tit measure G as defined in Section 24.2 above. Note that G
-
fv'v
-
f-r'(;)('''Y)
-
1
In order to see the relationship HvL
B2
=
Y
=
0.
.
let us consider the special case where
H 1 : Bc # 0.
and XI
(24.137)
B1 + XcBa + U.
(24.138) (24.139) where :1
residuals the restricted by U =Y -Xj21 IXIY view (X'jX1)C as the multivariate multiple correlation we can Defining
=
The multivariate
594
linear regression model
matrix of the auxiliary multiple regression
U =XIBI
+ XcBc + V.
(24.140)
All the tests based on the test statistics mentioned so far are unbiased but no uniformly most powerful test exists for Hv against Sl; see Giri ( 1977), Hart and Money (1976)for power comparisons. A particularly important special case of Hv : DB C 0 against S1 : DB C # 0 is the case where the sample peliod is divided into two sub71) and T2 (74+ 1, F), where F- rl F2 samples, say, T1 2, and '.TkF2 > k. If we allow the conditional means for the two sub-periods to be different, i.e. for -
=
-
=41,
.
.
.
=
,
.
.
.
=
,
,
+ U1
,
(24.141)
XaBc + U2,
(24.142)
=XIBI r e: T1 : Y1
t G T2:
Y2
=
but the conditional
variances to be the same, i.e.
E (Yf/.?; Xf) f =
(24.143)
(x)1.6,
=
then the hypothesis:so: B1 Bc against Sl : B1 # B2 can be accommodated into the above formulation with =
D
and
=(l,
-Ik),
j, c
B1
B=
B2
('t))-()
x0-)(Bs))+(7)).
=
()
(24. 144)
This is a direct extension of the F-test for structural change considered in Chapter 21. ln the same chapter it was argued that we need to test the equality of the conditional variances before we apply the coefficient
constancy test. A natural way to proceed in order to test the hypothesis Ho f11 f12 against Sl : f11 is to generalise the F-test derived in Chapter 21. That scalar function of the is, use a =
c#f12
'ratio':
V
=f1
2
f- 1
1
(24.145)
>
where fl- i
=
1 Fl
-
-
U'U f, i
i
=
such as
d, (ii)
=
dettflafll-
1),-
dz trtflcf-il- 1). =
1 2. ,
596
The
24.6
Misspilication
multiYariate
linear regression model testing
Misspecification testing in the context of the multivariate linear regression model is of considerable interest in econometric modelling because of its relationship to the simultaneous equations model to be discussed in Chapter 25. As argued above, the latter model is a reparametrisation of the former and the reparametrisation can at best be as well defined (statistically) as the statistical parameters 0n (B, D). ln practice, before any questions related to the theoretical parameters of interest ( can be asked the misspecification testing for 0 must be successfully completed. As far as assumptions E1()-(8(l are concerned the discussion in Chapters 19 to 22 is directly applicable with minor modications. Let us consider the probability and sampling model assumptions in the present case where l'n> l .
Normalitv. The assumption
that D(yf//'X!;0) is multivariate normal can be : tested uslng a multivariate extension of the skewness-kurtosis test. The skewness and kurtosis coefficients for a random vector ur with mean zero and covariance matrix fl are defined by J3.,,
lu
E u'n t -
=
t
)3q
and
x 4,,,,
zJ(u'f1 t
=
respectively. In the case where ur N(0, f1), These coefficients can be estimated by 'v
2 3.m
T
where 0ts
F
) =
j
t
=
t
)2
(24157)
,
.
0 and a4,?,, mtn + 2). =
T
jj jg gt3s
(24. 158)
,
t
'n--
==
a,m
lu
j x
=
l s,
=,
j
t s '
12
=
,
.
,
.
.
,
IL
(24. 159)
Asymptotically', we can show that F
6
H
:23.,n
x
g2(/),
l
=
-t(;n1tpn + ljjj?y+ J)
z
(24. 160)
and F
(4.j.m -n1(?'n + 2) 8n1(,.:1
Hv
+2))2
'v
'
x
Note that in the case where m F
-6
H
(i2a
,x.
a
z2(lj
and
=
T
(24.161)
1:
(.i4 3)2 -
24
z2(1),
Hu 'v
z
g2j1).
(24.162)
Misspification
24.6
testing
Using ( 160) and ( 161) we can define separate tests for the hypothesis f1g1J:
and
az
m
x 4., n 1'1t2): 0
=
0 against
=
pntr?l
S$1 ):
tya
?yj
+0
J.ft/': azj m + mjm + 2j
+ 2)
.
respectively. When Dtyr//xf', 0) is normal then Hl ' ro Hvls is valid. As in the case where rrl 1, the above tests based on the residuals skewness and kurtosis coefficients are rather sensitive to outliers and should be used with caution. normality see Mardia For further discussion of tests for multivariate (1980), Small ( 1980) and Seber ( 1984). inter alia. =
Linearity. A test for the linearity of the conditional the auxiliary regression
mean can be based on
IL (24.163) = (Bo - )'x,+ F'kt + f)f r 1, 2. where 1/:1 EEE(/1/, /pr)' are the higher-order terms related to the Kolmogorov-Gabor or RESET type polynomials (see (2 1.10) and (2 1.11)). The hypothesis to be tested in the present case takes the form t
=
,
.
Sa: F
.
0
=
.
.
.
.
.
-
(24.164)
H3 : F# 0.
against
This hypothesis can be tested using any one of the tests zf(Y), i LR(Y) or 1wM(Y) discussed in Section 24.4.
=
1, 2, 3,
A direct extension of the White test for departures from homoskedasticity is based on the following multivariate linear auxiliary regression:
Homoskedasticitv.
4-r c'0
+ C' # + c,
=
where and
4, Eiis t(/?1,.() a:.
,.
(24.165)
.
a.
n
()-l
1.
=
li
i!
liyr
.
.
.
.
k: j
.
4:,)
=
.
Testing for homoskedasticity Hv C
=0
(
1 2, .
.
.
.
,
n1,
q
=
Aan-ltr?-l
+ 1).
(24.166)
can be based on HL :
(24.167)
C# 0.
which can be tested using A linear set of restrictions Section 24.4 above. The main difference with the m the cross-products of the residuals in addition to the regressors. =
the tests discussed in 1 case is that we have
cross-products
of the
598
The multiuriate
linear regression model
Time invariance and structural change. The discussion of departures from the time invariance of 0 > (#,c2) in the m 1 case was based on the behaviour of the recursive estimators of 0, say 0t, t k + 1, T This discussion can be generalised directly to the multivariate case where 0 EE (B, f1) without any great difficulty. The same applies to the discussion of structural change whose tests have been considered briefly in Section 24.4. =
=
.
.
.
,
Independence. Using the analogy with the m= 1 case we can argue that X;)' is assumed to be a normal, stationary and lthwhen )Z,, t e: -T), Zt > (y'r: order Markov process (seeChapter 8), the statistical GM takes the form l
=B7xr +
yt
i
) =
!
Aiyf -f + 1
i
=
zB;x,-
+
(24.168)
:,,
1
.tc,,
l > !) is a vector innovation process. lf where statistical GM under independence
yf Fxr
(24.169)
we can see that the independence auxiliary multivariate regression (Bo
=
)'x,+ j 1
-
i
ln particular,
(168)with the
+ u,,
=
t
we compare
can be tested using the
assumption
gA/fyf f + B'fxf fl + -
=
(24.170)
Et.
the hypothesis of interest takes the form
S0: Af
=
0
and
B
for all i
0
=
1, 2,
=
.
.
.
.
.
,
,
I
against S1
:
Af # 0
Bf # 0
for any i
=
1,
.
1.
This hypothesis can be tested using the multivariate
F-type tests discussed in Section 24.4. The independence test which corresponds to the autocorrelation approach (see Chapter 22) could be based on the auxiliary multivariate regression Dox, + Crtf 1 + + Clf -1 -1-V?. (24.171) t HvL That is, test C1 C2 C! 0 against H3 : Cf # 0 for any i 1, 2, !. This can also be tested using the tests developed in Section 24.4 for linear restrictions. Testing for departures from the independence assumption is particularly important in econometlic modelling with time-series data. When the assumption is inappropriate a respecification of the multivariate linear regression model gives rise to the multivariate dynamic linear regression model which is very briefly considered in Section 20.8. '
=
=
=
'
'
'
=
'
'
=
=
.
.
.
,
599
24.7 Prediction 24.7 ln
Prediction
view of
that
the assumption
yt B'xt =
f
+ ur,
(24.172)
c: T,
Twere the best predictor of yw-hg,given that the observations t 1, 2, used to estimate B and n, can only be its conditional expectation =
1= 1, 2,
'x,+,,
=
v-vl
.
.
.
.
.
.
,
(24.173)
,
where xw../ represents the observed value of the random vector Xt at T + /. The prediction error is ewmlH
yv-l
(2
w+l=
-
(24.174)
B)'xr+,+uw+I.
-
1=:,
Given that ev-vt is a linear function of normally distributed r.v.'s, 'xw-2)) (24.175) evstx N40, f1(1 + x'w+,(X'X)(see exercise 7), which is a direct generalisation of the prediction error distribution in the case of the linear regression model. Since f) is unknown its unbiased estimator
1
S=
'- k
fi/0
is used to construct the -v+tj'SFH (yr+l where
(24.176)
'(yw+I
=
Ss
=s(1
test statistic
prediction
+ xy..I(X'X)-
-
(24.177)
w+I),
(24.178)
lxw+/).
Hotelling (1931) showed that H*
(F-/(
-rn
+ 1)S 'v
=
(F-k)m
F(m, T-k
-n
+ 1),
(24.179)
and this can be used to test hypotheses about the predictions or construct prediction regions. 24.8
The multi&'ariate dynamic Iinear regression (MDLR) model
In direct analogy to the
0)
ln
=
1 case the MDLR model is specified as follows:
Statistical GM (24.180)
600
The multivariate
linear
model
regression
E1(1 pt Eyt c(Yr0-1), X,0 x,) and uf y, Elyt c(Yt0- j ), x,0 x,O). B/,f1())are the statistical parameters g2(l 0. v (A1,. At, B(),B1, interest. E3q Xf is strongly exogenous with respect to 0*. =
=
.
(4(1
.
=
,
.
.
.
=
-
,
of
The roots of the matrix polynomial /-1
211- ) AfJ -f f=1 I(5j
0
=
lie within the unit circle. Rank(X*) k* where X* H (Y j mk + Imlk 1) + /!,n2,
Y -.,, X, X
=
,
.
.
,
.
-
(11)
Probablity -
1
,
.
.
.
X
,
1),
-
k'b
=
model
otyr/zl'-1
(p
-
,.
p*)
(det n())-'ix 1 = (27$=7
expt
1
B +,x)),o
yy, -
-
-
()
ljyf
-
s+,x)))y,
p*G (i), r (E -I7
.
(24.181)
D(y,/'Z,0- l ; p*) normal; '(y,/c(Y,0- j), x,0 xfo) B*'X) - linear in X); Covty, c(Y,0- j ), x,0 x,0) ntl homoskedastic;
(i) (ii) (iii)
=
=
=
=
-
E'7(I
0* is time invariant.
(111)
Sampling model
g8j
Yr)/ is a non-random sample sequentially drawn from Y > (Y1, Dlyt JZt-. 1 ; p*), t 1, 2, .T;respectively. .
.
.
,
=
B*'
(A?1 A'2 s
EEE
X* EH (rt
5
yt
.
*
*
.
,
.
.
,
A'I B'0 B'1 ,5
!
''
.
.
.
'h
B')I
'h
-,)-
..2, xr tt -I, xf- xr - l The estimation, misspecification and specification testing in the context of this statistical model follows closely that of the multivariate linear regression model considered in Sections 24.2-24.5 above. The modifications to these results needed to apply to the MDLR model are analogous to the ones considered in the context of the m 1 case (see Chapter 23). In particular the approximate MLE'S of 0* -
.1.,
.
.
.
,
-
.
.
.
-
=
*
=
(x''x#)- 1x*'Y
(24.182)
24.8
The MDLR
model
and o- 0
=
I.T *
-
I
,
U*
+
t. '
x
%
&,-
=
x
(24. 183)
B
like 2 and f (see Anderson and Taylor ( 1979)). Moreover, the multivariate F-type tests considered in Section 24.4 are asymptotically justifiable in the context of the MDLR model. For policy analysis and prediction purposes it is helpful to reformulate GM ( 180) in order to express it in the first-order the statistical
behave
asymptotically
autoregression form y'h A1'y)=
(24. 184)
+ B1t'Z) + u),
1
where .'.$. 1
-
()
4a '
.
* 1
0
I.
1)
s
- I tt?
0
.*.2
(24 l 85)
(24. 18 6) This is k ne S.Nn i n t htl
51f) B'l =
*1, B =
')'
ric 1 teratu re as
1-10l-net
tlco
.fnal
./i?rnk,
with
(24. 187)
.
..$.
t he
')
' .
:'
=
known as thc ilptll' ttlld solution in ( 186) pro'$ ides
12 .
.
.
.
.
.
(24 188 ) .
The rt??'??7nlultipliers of delay z respectively. t-ayrfl'/ll-l'l,//'n with so-called also Ionjl-run the t1s ll
-
602
linear regression model
The multiuriate
multiplier matrix defined by X
L=
l
BIAI:T
1.
=B1(I -A1)-
(24.189)
z=0
The elements of this matrix lij refer to the total expected response of thelth endogenous variable to the sustained unit change in the fth exogenous variable, holding the other exogenous variables constant (see Schmidt (1973),(1979), for a discussion of the statistical analysis of these multipliers). Returning to the question of prediction we can see that the natural predictor for yw+1 given by y1, yv and xj xw+ , is .
.
,
.
,
.
.
.
,
+ 2f'Z1+ 1 (24.190) v. l St'yv well ln order to predict yw..z we need to know xw+ 1, xw..c as as yw+1. Assuming that xr+ l and xwo.care available we can use the predictor of yv+ j, Jw.l in order to get yw..c in =
=
v-vz
.
S'yv,v 1 +
2?'Z1+2
* * *j *w = A 1 (A 1 y v + B Z + '
'-'
Hence,
'
'
:
)+
'-'
''h
B *1 Z w*+ 2 '
'-'
+ *1 2 *1 *1 + = (A ) y w + A B Z w+ ( + B 1 Z w+ a t
y
?
rl
(24.191)
t
.
'r -(')'yw+
v..z
j=1
(?')'-')'zy+,,
z
-
1,2,
.
.
.
(24.192)
will provide predictions for future values of yt assuming that the values taken by the regressors are available. For the asymptotic covariance matrix of the prediction error (yw..z see Schmidt (1974). Prediction in the context of the multivariate (dynamic) linear regression model is particularly important in econometric modelling because of its relationship with the simultaneous equation model discussed in the next chapter. The above discussion of prediction carries over to the simultaneous equation model with minor modifications and the concepts of impact, interim and long-run multipliers are useful in simulations and policy analysis. 'wsz)
-
Appendix 24.1 - The Wishart distribution Let (Z,, t c T) be a sequence of n x 1 independent random vectors such that Zf X(0, f1), t G T, then for Tb n, S J=1 Ztz't, S W$(f1,T), where the density function of S is ew
=
D(S; 0)
=
ctdet
S)EtF-N
ldet
-
'w
1'/21
fIIEX/X
expt
-J
tr fl - 1S)
Appendix 24.2
603
where D
Fn
2,
,=
2
t'j
1 /'4 l 1Ilntn- )1 )
i
r(
.
the gamma function
) being
Pvoperties .
.
.
1
=
XV 1
i
-
1
-
,
2
x
(seePress (1972:.
Wishart distribution
of the
Sk are T;), i 1, 2, 14.7:(Q,
If S1,
tj-.
n
,
=
.
.
,
.
independent matrices and Sf
n random
)(
'v
k, then
k
Sf f
=
I4$(n, T),
-
1
where
If S
W'(f1,T) and M is a /( x
'w
MSM'
--
IXIMQM',
h?
matrix of rank k then
r)
fllll. (see Muirhead ( 1982), Press ( 1972) inter W;,tf, enable S T) and S and fl are These results us to deduce that if conformally partitioned as '.w
s
S11 Slc Scl Sa2
=
,
(1
=
then
(a)
n11 n12 f21 nc2
,
where v : + n J, n, S j l : n 1 x n 1 Sc : S lf W';,f(Dff,T), i 1, 2, c n2 x n a S1: and Sj c are independent if fll 2 0. 2f12-c1f121 T- na) and is inde(S11 S1 2Sc-21S21 ) l4$,(f1:1 - f11 of and Sa2. pendent S1a ( : cfl LzS a 2 (f1l l fl afl z-z flc 1 ) (& S z c). (S1 2 'Sc a) =
'v
.
.
.
=
,
,
.
(b) @)
=
'v
-
,
-
is'
-
,
Appendix 24.2 Kronecker products and matrix differentiation The Kronecker product between two arbitrary matrices A m x n and B: p x q is defined b '
A ()l B
=
rz11 B
x: CB
: 2 1B
:x
JmIB
j,,B
za B .
am,,B
604
The multivariate
linear regression model
Let A, B, C and D be arbitrary matrices, then A () (aB) l(A @ B), x being a scalar; (i) (A + B) ()) C A (&C + B () C, A and B being of the same order; (ii) A (&(B+ C) A ?t B + A () C, B and C being of the same order; (iii) A @)(B @ C) (A ()) B) (@ C,. (iv) (V) (A (# B)' A' @)B'; (Vi) (A () B)(C (&D) AC (#)BD 1 1 (A @)B) A - () B- 1 A and B being square non-singular (vii) =
=
=
=
=
=
=
matrices
(viii) (ix) (x)
vec(ACB) (B' ()j)A) vec(C); trtA () B) (tr Alttr B), A and B being square matrices; dettA @)B) (det AlMtdet B)'', A and B being n x n and m x m =
=
=
matrices;
vectA + B) vec(A) + vec(B), A and B being of the same order; tr(AB) (vec(B')')(vec(A)). =
=
Usejl
derlvatives
' logtdet A) 1 .4A - ) -- dA
,.
t?trtAB h
.
'-.
=
.
4y-7 1E3;
( tr(A'B) PB
A
/
'
'
= A;
t7tr(X'AXB) = AXB px tr(.?')
(v)
(.7.4
=
n.h -
t?vec(AXB) . ( vec X
.
Important
,
.s,
1
+ A'XB''
'
'
'
s x.
concepts
Wishart distribution, trace correlation- coefficient of alienation, iterative numerical optimisation, SURE and Malinvaud formulations, estimated GLS estimator, exclusion restrictions, linear homogeneous restrictions, final form, impact, interim and long-run multipliers.
Appendix 24.2
605
Questions Compare
the linear regression
statistical
models.
linear regression
and multivariate
how linearity and homoskedasticity are related to the normality of Z,. Compare the formulations Y XB + U and y,j X+j+ + u+. el-low do you explain the fact that, although yj ,, ymf are correlated, the MLE estimator of B, given by Explain
=
=
-pa,,
.
.
.
,
=(X'X)- IX'Y
does not involve f'?' Derive the distribution of Explain why the assumption F 7: m + k is needed for the existence of the MLE f of f1. Discuss the distribution of f1. Discuss the relationship between the goodness-of-fit measures JI ( 1#'n) tr G and t/c dettf;) where .
5.
=
=
G
=
I - (5'''5')-
1
f,l/f.'
rdettElq (detlEl 1 ) detltaalj- known as Hotelling's alienation i t? co c n t and (1 and discuss the of the MLE'S state the distributions which can be deduced from their distributions. properties Discuss intuitively how the conditions with
t##'
(1 =
.
Jt
.
S.
B. imply that : Give two examples for each of the following forms of restrictions: D 1 B + C 1 0,' (i) ii) BF 0 1 + A: ( Discuss the differences between them. 10. Explain how the linear restrictions formulation R#v r generalises and (ii) in question 9. -.+
=
=
.
=
(i)
Verify the follovs ing equality:
2 ti ( =
12.
What is the reason
formulations'?
1 I- 1 + A 1 )fF'1f1F1 )- F'1(1
=
(rl
-
+ Al )(F'1fF1) -
'
rlf.
for the interest in the Zellner and Malinvaud
606
14.
The multivariate
Explain how the Gl-s-type estimators of #+ for the Zellner and Malinvaud formulations can be derived using some numerical optimisation iterative formula. Discuss the question of constructing a test for Hv DB
15.
linear regression model
C
-
against
=0
ff1 : DB
C # 0.
-
Compare the following test statistics defined in Section 24.5: 'r1(Y), z2(Y), .LR(Y), 1-M(Y), za(Y).
16. Discuss the question of testing for departures from normality and compare it with the same test for the m 1 case. Explain how you would go about testing for linearity, homoskedasticity and independence in the context of the multivariate linear regression model. 18. 'Misspecification testing in the context of the multivariate dynamic linear regression model is related to that of the non-dynamic model in the same way as the dynamic linear regression is related to the linear regression model.' Discuss. 19. Explain the concepts of impact, interim and long-run multipliers and discuss their usefulness. =
Exercises
1. Verify the following: vec(B) # vec(B'); (i) Covt5,/ectY'l vec(Y?)') Iw L f1; (ii) ltyr -B'xt) (iii) tr D - 1(Y-XB)'(Y -XB). (yr B'xf)'D lrelationships Using the L2 L P P and LP=0 show that for L= G1(G/'X'XG1')- IGt/X'X P takes the form =
=
-
=
where
=
,
P= (X'X) - 1D'1 (D 1 (X'X) - tD' 1! - ID D
=
D1 D1'
,
(see Section 24.3). Verify the formulae
(Gj, Gt)=
-
D
1:
- !
(58),(59)and (62),(63.).
Consider the following system of equations
)'z=Ibzxtt +/$'22x2,
+
fszxst+/'/icx4t
m
+
=
2)
l?2t.
Appendix 24.2
607
Discuss the estimation of this system in the following three cases: no a priori restrictions; 0, /42 0,' s,
(i) lii) (iii)
=
=
/.z1 ja2. =
Derive the F-type test for Hv1 DB - C 0 against HL : DB -C + 0. tln testing for departures from the assumption of independence we can use either of the following auxiliary equations; =
l
l
,= xiyt--i
B'fxt-f +.,,
+
i
1
=
i
1
=
l
l
?r-(B()
-())'xr
i
because X'
=
B)x,-f +v,,
Afyl-f +
+ =
1
f
=
1
0 and thus both cases should give the same answer.'
Discuss.
Construct a 1
-
x prediction
region for yy.pz.
Additional references Anderson (1984),. Kendall and Stuart ( 1968); Mardia et al. ( 1979)) Morrison (197$,. Srivastava and Khatri (1979).
The simultaneous
25.1
equations model
Introduction
The simultaneous equations model was first proposed by Haavelmo ( 1943) as a way to provide a statistical framework in the context of which a of simultaneous theoretical model which comprises a system interdependent equations can be analysed. His suggestion was to provide the basis of a research programme undertaken by the Cowles Foundation during the latc 1940s and carly 50s. Their results, published in two monographs (Koopmans ( 1950), Hood and Koopmans ( 1953)). dominated the econometric research agenda for the next three decades. ln order to motivate the simultaneous equations formulation 1et us consider the following theoretical model:
where l??p, f,, ;)t. )',, )t refer to (the logg of) the theoretical variables, money, interest rate, price level, income and government budget deficit, respectively. For expositional purposes 1et us assume that there exist observed data series which correspond one-to-one to these theoretical variables. That is, ( 1)-(2) is also an estimable model (see Chapter 1). The question which naturally arises is to what extent the estimable model ( 1)-(2) can be statistically analysed in the context of the multivariate linear regression model discussed in Chapter 24. A moment's reflection suggests that the presence of the so-called endoqenous variables it and mt on the RHS of ( 1) and (2), respectively, raises new problems. The alternative 608
25.1
Introduction
609
formulation'.
mt /71+ /21J?, + llt.,'t + /'.lrkr, it /31z + #22p,+ pszyt+ /4.26/,, =
(25.3)
1
(25.4)
=
can be analysed in the context of the multivariate linear regression model because the bijs,i 1,2, i 1,2,3,4, j 1,2, are directly related to the statistical parameters B of the multivariate linear regression model (see Chapter 24). This suggests that if we could find a way to relate the two and (3h4) we could interpret the theoretical parametrisations in (1V42) of the jfys. ln what follows it is parameters xijs as a reparametrisation argued that this is possible as long as the afys can be uniquely defined in terms of the ijs. The way this is achieved is by (3)V4)first into the formulation: =
=
=
'reparametrising'
(25.6) and then derive the uijs by imposing restrictions on the aijs such as Jyl 0, tzsc 0. ln view of this it is important to emphasise at the outset that the simultaneous equations formulation should be interpreted as a theoretical parametrisation of particular interest in econometric modelling because it lmodels' the co-determination of behaviour, and not as a statistical model. ln Section 25.2 the relationship between the multivariate linear regression and the simultaneous equation formulation is explicitly derived in an attempt to introduce the problem of reparametrisation and overparametrisation. The latter problem raises the issue of identification which is considered in Section 25.3. The specification of the simultaneous equation model as an extension of the multivariate linear regression model where the statistical parameters of interest do not coincide with the theoretical (structural) parameters of interest is discussed in Section 25.4. The estimation of the theoretical parameters of interest by the method of maximum likelihood is considered in Section 25.5. Section 25.6 considers two least-squares estimators in an attempt to enhance our understanding of the problem of simultaneity and its implications. These estimators are related to the instrumental variables method in Section 25.7. In Section 25.8 we consider misspecification testing at three different but interrelated levels. Section 25.9 discusses the issues of spccification testing and model selection. In Section 25.10 the problem of prediction is briefly discussed. lt is important to note at the outset that even though the dynamic =
=
610
The simultaneous equations model
simultaneous equations model is not explicitly considered the results which follow can be extended to the more general case in the same way as in the context of the multivariatc linear regression model (seeChapter 24). In particular, if we interpret Xf as including al1 the predetermined variables, i.(). jx
::.t
,
yr
t
,
,
-
1,
.
.
.
,
yt
-
,
,
l , X/', Xt - 1 ,
.
.
.
,
Xt
.
-
t)
, ,
the following results on estimation, misspecification and specification testing as well as prediction go through asymptotically (seeHendry and Richard ( 1983/.
25.2
The multiuriate linear regression and simultaneous equations models
z'rhe multivariate linear regression model discussed in Chapter 24 was based 'on the statistical GM: yt B'xt +u:,
(25.7)
=
with 0 (B,f1), B E2-2lEcl fl Ej j E1 ct2-cltcl the statistical parameters of interest. As argued in Section 25.1, for certain estimable models in econometric modelling the theoretical parameters of interest do not coincide with 0 and we need to reparametrise (7)so as to accommodate such models. In order to motivate the reparametrisation needed to accommodate estimable models such as ( 1-(2) let us separate y1f from the other endogenous variables yjt) and decompose B and fl conformably in an obvious notation: H
=
,
((.Vytt)
y''x,x,) -
=
-
,
((s#) (&'.) ,)
x
-
X)x'
jl
f1)2a))
.
(25.8)
A natural way to gd the endogenous variables ykl' into the systematic component,purporting to explain yjt, is to condition on the c-field generatedby yjll, say c(yj11), i.e. for Ajl) Jt1,
e(c(yj1)), H
f(.y1t
Xf xf) =
l-:''y)1' + A'lxr,
..'/')'')
=
(25.9) (25.10)
whererf #j - B(1)n1(z).:1. The systematic component by defined (10)can be used to construct the statistical GM =
nc-cllajand A1
+ 1 y1 l ro'yjl) =
'
1xt
=
+ c:,,
(25.11)
25.2 The MLR where
and simultaneous
equatiols
models
'(.p:,/,.*-11)).
;1,
A'1,
=
-
-.:/11')1=0,
E'E.E'IEIII
E(;:,)-
(i)
p11 .EEF(7.t1,;1,.'.Fj1))!
'(#1,;1,)
(iii)
=
=
-
,
11
)1 2
nc-al?cl
.
=0,
(see Chapter 7 for the properties of the conditional expectations needed to prove (i)-(iii)). Looking at (11) we can see that such a statistical GM can easily accommodate any estimable equation of the form: r1l al l + =
zit + al
.a1
3p!
+ al
(25.12)
4).r
separately. The thing to note about structural parameters (1, lJ11) where l= t,I 1
=
(11) is that its
parameters
we call
(r0,A, 1 )' 1
(25. 13)
2j l t
(25.14)
,
E(
)
,
are simple functions of B and f1. That is. they constitute an alternative parametrisation of 0, in Dtyt/Xf; 0), based on the decomposition D(y,/Xt; 04 D(.J'1r''yl1', Xl; 41) o(y;'',/X,; 4(1)(). -
-
(25.15)
the normality assumptipn ensures that y)1) is indeed weakly with respect to the parameterstxj, rj j).The statistical GM (11)is exogenous of the linear and stochastic linear regression models. These a hybrid comments suggest that the so-callezd simultaneity problem has nothing to do with the presence of stochastic variables among the regressors. The decomposition ( 15) holds for any one endogenous variable yf, Moreover,
Dt't Xr; 04
=
givingrise to
D(-J',.'yji',
a statistical
-
x,;,l(fj).
(25.16)
GM:
A'x i t +cfl. t + l''ft = Fgdytfl I
Xt; ,lf) Dlyjil
(25.17)
The simultaneous equations
model
The problem, however, is that no decomposition of D(yf/Xf; #) exists whicl-. m in (17).That is, the system can sustain all m equations i 1, 2, =
F'y + A'x t + '
'1.
'.)
.1-EEEE((1- *1'.z
,
.
,
EEE ar (l1,,c2f,
.
.
.
.
.
.
.
,
0,
t:t =
t
.
.1-
,
'
...:'.
z:'i /1
,'.'.$ 'z
(
',',',',',',',',','! ,
,
.
,
.
.
,
t'.
'.) ,
smtj'
,
(I-f is essentially F;9 with - 1 added as its fth element), is not a well-defined of (7).Fol' statistical GM because it constitutes an over-reparametrisation equation, reparametrisation first, the the say one EEE <1 (r01A 1 ,
,
d?1 1
2) 441 ) = (B(1 ) f1.,2.
)
,
is well defined but 41 x tl,
x
.
.
.
X
x ttm = f
=
tlj
1
is not. A particular case where the cartesian product X7- 1 tli is a proper reparametrisation of 0 is when there exists a natural ordering of the yfts such 1, i.e. that yjt depends only on the pas up to .j
-
Eyjt
'.../-.
rU') =
E( v-jfc()'f! i ,
1 2,
=
,j -
,
.
.
.
1), Xt
=
x,),
j
=
1, 2,
.
.
.
,
rn.
(25.19) ln this case the distribution D(yr,'''X;;0j can be decomposed
in the form
p1
.l)(yt,''EXt,0)
lL/q-it/''.vkt.),'2,-
'-'
f
=
.
1
.
.
-
)'i -
1f,
X,,-<).
(25.20)
This decomposition gives rise to a lower triangular matrix F and the system ( 18) is then a well-defined statistical GM known as a recursive system (see below). given in (18)is defined in In the non-recursive case the parametrisation unknown ?? H (F, A, V) 1) mk of 1) + + parameters terms n mm -hJ(n: '(:fE;), and EEE well-defined statistical only mk 1) V + there are where +Jn1(r? ?! constitutes shortfall of parameters in 0,'a mm - 1) parameters. Given that of 0 there can only be mk +.ntn1 + 1) well-defined a reparametrisation parameters in tl. ln order to see the relationship between 0 and ?? let us premultiply ( 18) by (F') - 1 (assumedto be non-singular): =
-
1
(r')-
Yf+ (r') - A'x f +
11) l
=
0
'
lf we compare (21) with (7) we deduce that 0 and ?Jare related via BF + A
=
0,
V
=
F'flr.
(25.22)
25.2
The MLR and simultaneous
equations models
The first thing to note about this system of equations is that they are not a priori restrictions of the form considered in Chapter 24 where 1-and A are assumed known. The system of equations in (22) the parameters 4 in terms of 0. As it stands, however, the system allows an infinity of solutions for 4 given p. The parameters 0 and ?; will be referred to as statistical and structural parameters, respectively. The system of equations (22) enables us to determine only a subset 41 of <(4H (4j:42)) for any given set of well-defined statistical parameters 0. That is, (22) can be solved for mk +-tanltm + 1) structural parameters <, in the form tdefines'
(25.23)
41 G(P, 42). =
Hence, we need additional Note that
tlz
information to determine qz elsewhere.
is a m(m - 1) x 1 vector of strtlctural parameters.
Without any
additional information the structural parameters ?/ are said to be nt?r identlhed. The problem of identifying the structural parameters of interest using additional information will be considered formally in the next section, ln the meantime it is sufficient to note that. if we supplement the system (22) with mm - 1) additional independent restrictions, then a unique solution (implicit or explicit), (= G*(p),
(25.24)
exists for the structural parameters of interest t L(<). There are several things worth summarising in the above discussion. Firstly, given that the structural formulation ( 18) is a reparametrisation of the statistical GM for the multivariate linear regression model, the structural parameters of interest ( (when identified) are no more well defined than the statistical parameters 0. This suggests that before any question about ( can be asked we need to ensure that 0 is well defined (no misspecification test has shown any departures from the assumptions linear regression model). Secondly, ( 18) does underlying the multivariate well-defined constitute statistical GM unless F is a triangular matlix; not a recursive. of equations is This is because although each equation the system well separately is defined. the system as a whole is not because of overparametrisation. In the case of a recursive system F is lower triangular and V is diagonal with =
detttl) dettt! 1) being the /th leading diagonal matrix. vL, =
EI
(25.25)
.
This implies that
X7'-1 l;f
614
The simultaneous equations model
constitutes a proper reparametrisation of 0 with mk +1.zmlm + 1) structural 1t)' parameters in (F, A, V). Using the notation yj9-1r 58 (y1,,ycr, we yo recursive form the system in the can express .
j''it= ygzyg 1, +J'xf ,
, -
t
i
+ st,
We can estimate the structural A.tj
Y-fjy 1Yf() X/YO
l'
m=
ti
i
1
i- 1
and
=
1, 2,
.
.
parameters
v0, 1Av i-
-
1
X'X
.
,
-
rn,
Jf, l7ff) by (y,9,
tti =
v0, *
i - 1Yi X'y i
(25.27)
)
-Y,9ii T (yf =-
a 1).,9
.
-XJf)
g
(yf-
Yf(; 1 yr -
It can be verified that these are indeed the
25.3
,
.
.
Identification
.j
-
xgf j
MLE'S
,
of
tti.
using Iinear homogeneous restrictions
As argued in the previous section, the reparametrisation GM, yt B'xf + ul, in the form of the structural F'yf + A'xt + tp 0,
of the statistical
(25.29)
=
system,
=
(25.30)
is not well defined because there are only mk +.l.n#?'; + 1) statistical parameters 0 defining (29) and mtrn - 1) + mk + .i'?'?1(?Al+ 1) structural parameters < defining (30).The two sets of parameters 0 and 1) are related via BF+ A 0, =
V= F'f1F.
(25.31)
No unique solution for ?J exists, corresponding to any set of well-defined statistical parameters 0, unless the system of equations (31)is supplemented with some additional a priori restrictions. In order to simplify the discussion of the identification problem let us assume that the system V= F'DF determines V for given F and fl and concentrate on the detennination of F and A. The system of equations BF+ A HA
=0
written in the form
=0,
where H
H
(B, 1)
and
.-(r,)
(25.32)
25.3 ldentification
615
is linear in A for given B. ln Kronecker 24.2), (32)can be expressed as (1m() H) vectA)
0
=
I'1v
product
notation
(seeAppendix
0,
=
(25.33)
a'.m)' are the columns of A turned into a where @'1 x'.2, The 1) 1 system of equations (33)cannot be solved mm + k - x vector. ranktrl+) mk < rn4n1+ k 1) for m > 1. For a unique uniquely for x because need supplement solution we the system with a priori restrictions such as to >
,
.
.
.
,
=
-
the Iinear homogeneous restrictions *
=
0,
(25.34)
ensuring that ranktm) > mm
-
1). If this is the case then the system
('))--e
has a unique solution
for
(25.35)
.
Dejlnition 1 The structural
are said
parameters H*
rank *
=
nytny +
=
rankt*A*l
k
rtpbe identified #' and only
1).
-
(f
(25.36)
Using the result that rank (see Schmidt
I1+
*
(197$),we
+mk
can state condition
rank(*A*)
=
m(m
-
1).
(25.37)
(36) in the form
(25.38)
Note that A* denotes the restricted structural coefficient parameters. Hence, the structural formulation (30)is said to be identified if and only if we can supplement it with at least nllnl 1) additional restlictions. More general restrictions as wcll as covariance restrictions are beyond the scope of the present book (see Hsiao (1983)inter alia). The identification problem in econometrics is usually tackled not in terms of the system (30) as a whole but equation by equation using a particular form of linear homogeneous restlictions, the so-called exclusion (or zero-one) restrictions. ln order to motivate the problem 1et us consider the two equation estimable model introduced in Section 25.1 above. The -
616
The simultaneous equations model
unrestricted
form(30)(comparewith (5)V6))of this
structural
model is
:.,21it + J1 1 + Jc1 yr + Ja1 l,t + J-,,:gt + clf,
mt =
(25.39)
(25.40) As can be seen, the two equations are indistinguishable given that they differ arises because only by a normalisation condition. The overparametrisation statistical underlying GM effect the (39) and (40) is in + #21 + #31pt + l.4l(/r+ 1,/1,, 1 (25.41) /.1 mt .rf
=
(25.42) and the two parametrisations 1 /.12 /71 /21 #22 #3 1 /' 3 2 /4.1 I4.,
A) are related via
B and (r,
J1 a
''
l 1
-
1
(621
)'12
21
+
1
-
2c
J
:3.1
J.l
,3
'0
0
0
0
0
0
0
0
=
,c
J4c
,
,
(25.43)
,
which cannot be solved for F and A uniquely, given that there are only eight pijs and ten ijs and (bijs. A natural way to the identification problem is to impose exclusion These restrictions restrictions on (39) and (40) such as 4l J22 enable us to distinguish between the money and interest rate equations. Equivalently, the restrictions enable us to get a unique solution for (712, ),a1, (5 al) given (py, i 1. 4- j 1, 2). l ., 1 2, J1 1 2l lThe exclusion restrictions on the fth structural equation can be expressed in the form esolve'
=0,
.3,
,
=
,
*.j
=0.
.
.
.
,
=
0,
=
(25.44)
matrix of zeros and ones and xi is the th column of A. where *f is a above example the the selector matrices are ln tselector'
mj
=40,
0, 0,0, 0, 1)
ma
(0,0, 0, 1, 0, 0).
=
The identification condition (36)for each equation separately,known rank condition, takes the form
rank
H*
=
(hi
ranktmfA*)
=
m+ k m
-
-
1,
1
i
'
i
=
=
1, 2,
1, 2,
.
.
.
.
,
.
.
???.
,
(25.45) as the
m
(25.47)
In the special case of exclusion restlictions the order ondition can be made even easier to apply. Let us consider the first equation of the system
25.3 BF
1
ldentification
+ A.j
=
(25.48)
0
restrictions', omit (n1 + (/4 k1) exelusion and impose (m endogenous and (/ - k:) exogenous variables from the first equation. ln this case we do not need to define *1 explicitly and consider (46)because we can substitute the restrictions directly into (48).Re-arranging the variables so as to have the excluded ones last, the resbricted structural parameters Ft and -n11)
1111)
-
A1' are
where )!1 : (rak - 1) x 1
,
#ll /21
B1z Ba2
where
jl
1:
kl
jc
1:
(k -
0
J1
=
)x
.
0-
0,
B1 c : kl x
x 1,
1
,
.
0
J1
o 0
(25.52)
=0,
B2zrl
1
'y1
Bzsz
+ B1 c71 +
- / 11 -/21 +
.- 1
Bla'j
J 1 : k1 x 1
(m1-
B 2 a : (k - l
1), B1 : /(1x (h'?-lhl'lj ), l
) x (tn 1
-
1) ,
B z :! : (l - k 1 ) x (m
-
m 1) .
Determining J1 from (51)presents no problems if yl is uniquely determined in (52).For this to be the case we need the condition that ranklBaa)
=
(25.53)
rnl - 1.
ln view of the result that the ranktBaz) mintk - pti nlj - 1) we can deduce that a necessal'q' t't??7t/frl't?n for identification under exclusion restrictions is that =
(/k'- l 1 )k: ??) 1 - 1.
,
(25.54)
That is, the numbcr of excluded exogenous variables is greater or equal to the number of incltlded cndogenous variables minus one. This condition in (I)f takes the form terms of the selector luatrix
ranklm:
,b
.=
??? ..
(2f
.55)
Tlle simultaneous
equations model
This is known as the order condition for identification which is necessary but not sufficient. This can be easily seen in the example (39),(40),above 14.=0. The selector matrices when the exclusion restrictions are J4j for the two equations are *1 (0, 0, 0, 0, 0, 1), *2 (0,0, 0, 0, 0, 1). Clearly 1 and thus the order condition is satisfied but the ranktml) ranktmc) rank condition (47)fails because =0,
=
=
=
=
rank(*1A*)
=
rankto,
0)
(25.56)
=0,
This is because when gt is excluded from both equations the restriction is 'phoney'. Such a situation arises when: and a1l equations satisfy the same restriction; (i) restrictions a11 of the fth equation other equation satisfies (ii) the some (see example below). related to the identification of Let us introduce some nomenclature particular equations in the system (30).
Dehni tion 2
-4 particular to be:
equation
q/' the structural
c/?-p?
(J(?), saq' the ith,
-$
saitl
under identified if rank(*jA*)
< r??-
1,'
(25.57)
exactly identified ('/' rankt*fA*l
iii)
over identified
=
=
m
1,.
-
(25.58)
(f'
ranktmfA*) The system
ranklmf)
(.)) is said to
??
=
m - 1,
identi.fled
ranktmf)
t'Jevery
>
m
equation
-
1. .$
(25.59)
identilied.
Exatnple
the following structural
Consider
form with the exclusion
restrictions
imposed: (25.60) 712.F1/ -
(25.61)
-1.'2,
-
.F3f+
l 3.X1,
+
(123.X2,
=
:3,,
(25.62)
25.4 Specilkation
619 0 0
0
0
0
0
0
1
*:! (0, 1, 0, =
,
*a
0 0 0 0
=
0 0
0, 0)
-
1),
2a
*aA*
=
(0, 0,
1).
-
3. 2 but ranktmj) The first equation is overidentified since ranktmjA*) underidentified ranktmcA*) second equation 1 because is The even though ranklmz) 2 (the order condition holds). This is because the first equation satisfied all the restrictions of the second equation rendering these The third equation is underdentified condtons as well, beeause rankt*aA*l l < 2. lt is important to note that when certain equations (af the structural form (30) are overidentified then not only the structural parameters of interest are uniquely defined by (31) but the statistical parameters 0 are themselves restricted. That is, overdentifying restrctions imply that 0 G (6))where (.)1is a subset of the parameter space (i).An important implication of this is that the identifying restrictions cannot be tested but the overidentifying ones can (see Section 25.9). The above discussion of the identification problem depends crucially on the assumption that the statistical parameters of interest 0n (B, f) are well defined the assumptions g1q-g8()underlying the multivariate linear regression model are valid. However, in the case where some assumption is changes we need to reconsider the invalid and the parametrisation identification of the system. For example, in the case where the independenee assumption is invalid and the statistical GM takes the form =
=
=
=
'phoney'.
=
B'x i
j -
i
+ ut
!,
(25.63)
the dentification problem has to be related to the statistical parameters At, Bo. B:, Bf, f1ol, not 0. Hence, the identification of the 0* H (A:, of is system not just a matter a priori information only. .
25.4
.
.
,
.
.
.
,
Spification
The simultaneous
equation formulation as a statistcal
model is viewed as a
620
Tlle simultaneous
equations model
reparametrisation of the multivariate linear regression model where the theoretical (structural) parameters of interests ( do not coincide with the statistical parameters of interest 0. In particular the statistical model ih specified as follows:
0)
Statistical GM y, Bk, =
g1j
(25.64)
+ u,,
and up )', - Elt't ,'''X/ xr) are the systematic and ,'X, non-systematic components, respekrtively. 0 EEE(B, f1) are the statistical parameters of interest where (i) B Y 2-21Ya j f' E j j E j zrc-2l E z : 0 g (.) > j2MlV x C ; (= L(r, A, V) are the theoretical parameters of interest where r H1(p), A Hc(p)- V H a(p), X, is assumed to be weakly exogenous with respect to 0 (and t). The theoretical parameters of interest I=H(0), IVE, are identitied. ranklx) x?.)': T x k for 1' > k. k where X >(x1 x2, 'tyr
pt
=
xJ)
=
=
=
=
=
,
-
,
=
g3j g4j
g5(I (11)
=
=
,
Probability (p
=
m
=
.
,
.
,
model
Dlyt ,,/XfP) ,'
=
(det f1)(27:)
eXP )
2--'--.rj-
-
lg
yt - B
,
Xr)
,
D
-
.1
) ()'f- B XJy
0 (E R'n
,
,
x
c
m5
t( T
.
(25.65) D( yf/'Xt ; 0) is no rmal (i) (ii) J)y;,/'Xr xf) B'x, - linear in x,; (iii) Covty,/'''xf xr) f'l homoskedastic > (B. f1) is time invariant. 0 ',
=
=
=
(111)
Sampling model
g8j
y
=
-
(free of x,);
y1)' is an independent sample sequentially drawn 1 2 -J( respecti vely. The above specification suggests most clearly that before we can proceed to consider the theoretical parameters of interest t we need to ensure that the That is, the statistical parameters of interest 0 t//'t, well Jp/v/. testing misspecification discussed in Chapter 24 precedes any discussion of either identification or statistical inference related to ). Testing departures from multivariate normality, linearity, homoskedasticity, time invariance EBt,'j
,
y2.
.
.
.
,
frona D(yr,/X?;#), l
=
,
,
.
.
.
.
25.5
likelihood estimation
Maximum
and independence became of paramount modelling in the context of simultaneous 25.5
importance
in econometric
equations,
likelihood estimation
Maximum
ln view of the simultaneous equations model specification considered in the previous section and the discussion of the identification problem in Section 25.3 we can deduce that in the case where the theoretical (structural) parameters of interest ( zjust-identilled the mapping H( ): R'''k x C,, E, and onto. This implies that the where ( H(p), is one-to-one '(() and an obvious estimator of ( is invertible, H is 0= reparametrisation estimator indirect maximum (lM LE). likelihood the -+
.
=
D/in i rffpl3 I n r/lt.zcase J4,/? eve ( is r-itlvlkrftl,ll i r.s (indirect) Iikelihood estirnator (Iz%' /.$ /?)' ILE ) (Ie./i?7tp(/
n'Iaxirnurn
.jlk.$
(-
=
jjj
jj
0- (
.
fl.
=
-
(r
(x'x)- lx'Y
=
v -x. (25.67)
tbe invariance pvopert t?/' i%ILE's. A more explicit formula for the IMLE is given in the case of one equation, say the first, when just-identificationis achieved by linear homogeneous restrictions. It is defined as the solution of Tbis
lLsbased
t?s/rf'nTf/rtal-
fl
=
1
1
ln the simpler
0.
f1= (
+
J1
('?n
lk).
case of exclusion
jj la): ; - ll1 +
'k'
restrictions
the system
(66) is
()
(2,j.69)
= 0,
(25.70)
=
,
(25.71) J1
=
j:
: -
B : a B c-jj B c :
.
matrix. given that B2a is a sqtlare (nl - 1) x (mj 1) non-singular The IMLE can be N iewed as an unconstrained MLE of the theoretical for testing any parameters of interest which might provide the -
'benchmark'
622
The simtlltaneous
equations model
As argued above, the identifying restrictionh overidentifying restrictions. testable from 0 to ( bu: because they provide the reparametrisation are not restrictions restrictionh overidentifying because imply the they are testable statistical of interest. parameters for 0, the of ( 1et us utilise thc In order to simplify the derivation of general MLE'S MLE'S of constrained derived of 0 under a priof: formulae in the context restrictions in linear Chapter 24. The constrained MLE'S of B and f2, in the context of the multivariate linear regression model, subject to the linear a priori restrictions HA >BF + A
=0,
where (F, A) are known, take the following form: :--(r'+A)(r'f1r)-1r'f,
f=
f+-
1 (:
-*)
,
F
,
(x x)(
-)
(25.75)
(see Chapter 24). In the context of the simultaneous equations formulation these formulae can be reinterpreted as determining the theoretica! parameters of interest (= G(F,A,V) given and f. That is, determine ( via
:
-(r(j)
+ A(j))(F(j)'f(r'()
=
-
lrtf
and
1 A(
v(t)-F(t)'fF(-.,jk
.
Z ZA( ,
(25.77)
(see exercise 2) where A-(F)
a and
and
z-tv,x)
Richard (1983)). (see Hendry Substituting (76)and (77)into the log of multivariate likelihood function the linear regression model log Lttl; Y) likelihood function log f-((., Y) defined by we get the Cconcentrated'
1og f,tt; Y)
=
const T
-
F
Y
logldet fla
log(det((F'Y'MxYF')
- 1(A'Z'ZA)j).
(25.78)
This function can be viewed as the objective function for estimating ( using 12/ (, f) as opposed to the direct estimation of (by solving (76)and (77)for 1.The log likelihood function (78)has the advantage that it provides us with
Iikelihood estimation
Maximum
25.5
(76)and (77).ln the case of general a natural objective function to rather prohibitive and thus we consider is restrictions ( H(#) the l-estrictions. exclusion the case where the restrictions are the exclusion restrictions by achieved When the identifieation of ( is This A(() matrices and in structural parameter are Iinear (. constrained r4() implies that the first-order conditions can be derived explicitly. The conditions tsolve'
'solution'
=
i
12 ,
,
.
.
.
,
p
.
(2 5 80) .
Using the relation
X'Z'ZX
t'h'.
=
T --....
(80) takes the form t?1-
tr tf-'ff-l-itf--,.,--(f''P)t?(i
?A v )-t
Z'Z
1,
T
(25.82)
=0.
f
of ( and The system of equations (82)could be used to derive the MLE'S t-i, derived asymptotically equivalent be An form f. and can hencef', 2
using
fl
for f in view of the fact that
1 F
-
-
r/n
PF
ti
(7
-
- A'z/z
JA --. C'1
(f
-
-
i
P
0 and
--+
t7A A'Z/XH ..0
-
-
-
=
f)
(i
H
,
Using these, (82) ean be expressed in the asymptotically tr
tL'ff'
1 )- .i'Z'Xf1
JA
Ptf
=0,
i
=
1, 2,
.
.
.
,
p.
-
(B : I),
EE
equivalent
(25.83) form
(25.84)
For the details of the derivation see Hendry ( 1976), Hendry and Richard system (84)is particularly intuitive because it can ( 1983). The reformulated iding interpreted an estimator of ( in terms of the sufficient as pro: be
statistics and H(, 1-,,,= and f
f of the fonn fl,
(25.85)
are sufficient statistics for the MLR statistical model because, as
The simultaneous
624
equations model
argued in Chapter 24, they are functions
of the minimal
z(Y) N (Y/Y, Y'X).
sufficient statistics
(25.86)
the orthogonality between the systematic and non-systematic components, preserved with their estimated counterparts by the MLE'S (discussed in Chapters 19 and 23) holds asymptotically for (84).ln order to see this consider the systematic and non-systematic components related to the system of equations (73)defined by Moreover,
EZ
and
t
'X t
tt, where cf
=
=
It can be shown 1
-T
xr) -
I'l'xr
=
I)'
x?
=
lk
,
A'Z,, Zf EEE(y'f X;)'. ,
(25.87) (25.88)
(seeHendry ( 1976)) that P
A'z'xl
--.
0,
(25.89)
a condition which suggests that the MLE can be interpreted as an IV estimator (see Section 25.7). Unfortunately, explicit solutions for ( of the form given in (85) are not possible because (84) is non-linear in ( and it has to be solved by some numerical optimisation procedure (see Hendry ( 1976:. As emphasised by Hendry the numerical optimisation rule for deriving the MLE'S (-of ( from (84)must be distinguished from the alternative approximate solutions of the system (25.90) where V= B=
1
A'Z/ZA, -(r+
H=(B: A)'(r'fr)
I), -
'
rv!1.
(25.91) (25.92)
When V and H are given, (90)is linear in A and an explicit solution can be derived. On the other hand, when A is given, V and 11can be derived easily. This suggests that (90)-492)can be used to generate estimators of A for different estimators of V and n. lndeed, Hendry ( 1976) showed that most of the conventional estimators in the econometric literature such as threestage Ieast-squares (3SLS), two-stage least-squares (2SLS), can be easily Hendry referred to (90)as the estimator generating generated by (90)V92).
j'
25.5
Maximum
likelihood estimation
625
equation (.EGE). The usefulness of the EGE is that it unifies and summarises a huge literature in econometrics by showing that: (a)
Every solution is an approximation to the maximum likelihood estimator values' selected for V and obtained by variations in the rl and in the through iterating. of taken the cycle including not steps number (90)--492), Al1 known econometric estimators for linear dynamic systems can be obtained in this way, which provides a systematic approach to estimation theory in an area where a very large number of methods have been 'initial
proposed.
methods immediately into distinct groups of Equation (90) classies varying asymptotic efficiency as follows: if (V, H) (i) as efcient asymptotically as the maximum likelihood estimated consistently', are consistent for A if any convergent estimator is used for (5', H); asymptotically as eticient as can be obtained for a single equation out of a system if V I but H is consistently estimated wx
=
(see Hendry and Richard ( 1983)). Dellnition 4
ln tbe case wlztlr? F is m x m 'lnf von-sinqular tbe MLE of (, called te full information maximum likelihood (FIML) estimator, is the solution of the system (#4) for (-. lt is interesting to note that in this case determining j' takes the simple form
E
=
-
A(r(j)
-
the system of equations
'
(76)
(25.93)
,
whenthe theoretical parameters of interest ( are just identified then the : and f1= f and thus statisticalparameters are not constrained, i.e. indirect maximum estimator reduces likelihood to the (93) =
2
*
*
-
-
A(t)F(()
-
'
(25.94)
.
The score function for the general case is delined as
qit)
(7lo L
=
c
t(
with log L as defined in (78).The information
1/()
=
E!q(()q(t)'1
matrix can be defined by
(25.96)
(see Chapter 13), where E'( ) is with respect to the underlying probability model distribution D4y, Xf,' 0). Using the asymptotic information matrix .
626
The simultaneous
equations model
defined by
V
.,''j j'-
tlslxj- x((),I x
1
y
Q
(() - ).
(25.98)
A more explicit form of 1.(() will be given in the next section in the case exist, for the 3SLS (three-stage leastwhere only exclusion restrictions squares) estimator. One important feature of the above derivation of (84)and (90)-(92)is that it does not depend on F being ln x rn and non-singular. These results hold true with r being m x q ((/%?n). In this case the structural formulation is said to be incomplete (seeRichard ( 1984:. For f- m x r?11 (:n3< ?zl), (84) gives rise to a direct sub-system generalisation of the llmited information 1) (see Richard ( 1979:. maximum likelihood (LIML) estimator (with -1 'solving'
=
25.6
Least-squares
estimation
As argued in the previous section most of the conventional estimators of the structural parameters ( can be profitably viewed as particular approximations of the FIML estimator. lnstead of deriving such estimators (EGE) discussed above (see using the estimator generating equation Hendry (1976)for a comprehensive discussion) we will consider the derivation of two of the most widely used (and discussed) least-squares estimators, the two-stage (ZSLS) and three-stage least-squares (3SLS) estimators, in order to illustrate some of the issues raised in the previous sections. ln particular, the role of weak exogeneity and a priori restrictions in the reparametrisation.
(1)
Two-stage least-squares
(2N15)
ZSLS is by far the most widely used method of estimation for the structural parameters of interest in a particular equation. For expositional purposes 1et us consider the first structural equation of the system
25.6
Least-squares estimation
GM based on the following decomposition 1 ',
''yrt
o(y,,/Xl; 04 D ).,.r =
X,,' 1/1 )
of the probability
otyjl',,-'x
'
t
;
model:
(25.10 1)
141))
with systematic component .::/-:1
p''j, =
))
.E().,
and non-systematic
;1, =
(25.102)
component
J)).,//,F)1
.J'! -
,#7'J'' (c(y)1'),Xr
'),
=
=
(25.103)
xr).
This suggests that, because of the normality of Dt)'t Xf; 0), ?!j and 4(1) are variationfree and thus yjl' is wctpkly exogenous with respect to ?;: (see Chapter 20). If no restrictions are placed on -(F) 1
A1
its )1 1-1! is
*1 (Z'(1)Z(1))- 'Z'(1)y1.
(25.104)
=
1'
X). This estimator has the same properties in the estimator of j* in the stochastic linear regression model given that ( 100) is a hybrid of the linear and stochastic linear regression models. Moreover, the variance L?11 can be estimated by
where Z(1) EEE(Yt
5
il
'1
=
-
Z(1):1
(25.105)
.
The optimality of the estimators (104)and ( 105) arises because of the orthogonality between the systematic and non-systematic components .l.t/.t'l,E'('lll c(y;''))l .E7(p'l,c1?)- E.t7E/,.t'lfc1,,7c(y)'')1l -
-0
(25.106)
0. Note that the expectation operator in .E'(/t'ly1f) iS since .!;Rjt/'c(y)1')(l 'XC,' relati: defined p), the distribution underlying the probability e to Dty, model. The above scenario. however, changes drastically by the imposition of a priori restrictions on 41 ln order to see this 1etus consider the simple case of Let us rearrange the vectors yt and xf so as to get the exclusion and included endogenous kl (y1t)and exogenous (x1f)variables tirst, i,e. hrl1 EE (x'1,, x'(1)f)'.ln EEn y'1,, y('1),)'. xr terrns of the decolnposition in (101) 3' (.3.71,, this rearrangement suggests that =
.
't?yrl'l'c-rt?p7s.
t
1)(y!/(X:; 04 1)( J'1 r y 1 l y(1)f (Xf; 41.) ,
.
.D(y1 t
''y(
,
X, ; <*1 )
.l)lyl
-
1 )1,
1 )1
xf; 44*1)).
(25. 107)
628
The simultaneous
equations model
ln terms of this decomposition the statistical GM ( 100) becomes .J'1f 7-1yl t + 7/(1,y(
1 )f
),1 : (n j
l'l j. j : (r?'s.- r?7j
=
where
1) x 1
.-
,
+J'l x l t +J'( 1 )x( .-
1 )1 +
(25. 108)
clf,
1) x 1 .
and
(25.109) (25. 110)
Let us consider
system
the implications of these restrictions
for the rest of the
via
Bl-/ + A1t
=0.
Given that the restricted
coefficient vectors At'
(#1 #c1 B22 B21
r 1 + J 1 #1 1 =
Bc2r1
! -
g'j h.t, 'j
.j.
0
.g,0.-j
(25. 113)
) q07
(25. 114)
,
(25. 115)
=#21 .
These equations
#1
,
y,
Bca7
(25.112)
(J'1 0)',
=
.-
B1 a-h
B1 2
1
take the form
relate 41 and <41) via
/1 l #21
=
and
B(j)
B21
B31
B22
B3c
=
.
In the case where the equation ( 110) isjust identljled k k : rl1 - 1) Bca is (,'n1 1) x (,,n1 1) square matrix with rank rn1 1. Hence, (114) and ( 115) can be solved uniquely for )!1 and J1 : =
-
-
r1
=
-
B 2-cl #,
,
J1
,
=
p1 ,
B c 1 B z-, pa 1
-
.
Looking at (116) we can see that in this case 4j and tlj ) are still variation free. Moreover, no structural parameter estimator is needed because these can be estimated via the indirect MLE's..
71
=
: 22- #21 j 1 #11 j
-
-
,
=
: 22- j #21 -
-
21
.
(25.1 17)
25.6
629
estimation
Least-squares
1 the first equation is overiden tihed, the restrictions on B( j ) and thus the variation impose in 114). 115)( equations ( h-ee condition between ?)1 and q(: j is invalidabetl. ln an attempt to enhance of this condition let us return to the decomposition in our understanding structural and the 107) parameters involved. define (
(L'- /..'1 ) >
when
However,
n1
-
,
f.51 3
12 t.?a f122 f23 2
l'l
=
f' 2 2 c)2 1 + j'j
32
c?g 1
) (1 j y
.
=u
j'j
D22
as
well as
ln the
=
t/ya 1 .j. j'j
.
f2 ! 3 3
3
ft) 3 1
.
(25. 120)
f132
rr,)
23
f12 3
(25. 121)
t),
(25. 122)
( 114), ( 115).
just identified case
nz-alc)z j
=
B2-21 pz1
(25.123)
and yl as defined by ( 121) and ( 115) coincide. On the other hand, in the overidentified case ( 12 1) and ( 115) define ),l differently with ( 115) free condition, invalidating the N'ariation is in terms of the An alternativ'e u'ay to view the above argument non-systematic lf components. we define plj in ( 110) by systematie and :1::
then
lt
=
J( ).: : c( y 1 ? ). X 1 t
=
x j f ),
(25.124) (25. 125)
(25.126)
630
The simultaneous equations model
where
cl',
1'1t
=
-
ftl'lf/'cty:
t
), X 1
,
xl
=
,)
(see Spanos (1985$). This suggests that the natural way to proceed in order to estimate ()'1,Jj t'11) is to find a way to impose the restrietions (114)-( 115) and then construct an estimator which preserves the orthogonality between pl and t?st. The LIML estimator briefly considered in the previous section is indeed such an estimator tseeTheil ( l97 1/. Another estimator based on the same argument is the so-called two-stage least-squares (ZSLS) estimator. Let us consider its derivation in some detail. The orthogonality in ( 126) suggests that ,
.r1t =
7'1y1r+J'1x1f + zt
(25.127)
is equivalent to
vlf #'1l x1, + #'21xrl), =
subject to ( 114) and
y1,-B'1zxl,
(1 15). In
+
(25.128)
lg,,
order to see this let us substitute
+Brz.zx(1)fA-ulr
(25.129)
into ( 127):
I!
3.'1,r'1(B', cxlf -
+ B'22x(1),+ ul
f)
+J'1 xj,
+ t?st
+ = (/1B?12+J'l)x1, +J'1B)cx(1), + l'-l'u, l/jr - y/lulf, (130)becomes
ltf.
'y'jul,
Given
lt,
=
(25.130) (25.131)
=
-
3.'1:(7'1B'12 +J?1)x1t+ =
/1B'2cx(1 ), +
(25.132)
l,flt.
The LIML estimator can be viewed as a constrained MLE of 1 and #21in #1 (128) subject to the restrictions (114)-41 15) as shown in (132)(see Theil (1971:. Similarly, the 2SLS estimator of )/> (/, J')' can be interpreted as achieving the same effect by a two-stage procedure. The method is based on F: a re-arranged formulation of (130)for t 1, 2, =
yl =(X1B12 +X(1)Bc2)71
-FXIJI
+
.
ll
.
.
,
(25.133)
,
in an attempt to impose (114) and (115) in stage one by estimating (XIBI 2 + X(1)B2c) separately using ( 129). Once the restrictions are imposed the next step is to construct an estimator of xlt which preserves the orthogonality between the systematic and non-systematic component of (133).More
explicitly:
Stage one: using the formulation provided by
XB.2 (X1B12 + X(1)B22) =
(133)estimate (25.134)
25.6
estimation
Least-squares
631
in the context of ( 129) or Y1
=
XB
2
.
+ U1
using
.
c
l
(X'X) - X'Y
=
1
(25.135)
.
(25.136)
(130) can be transformed XIJI yl i' l r1+ or *1 *1
into
y1
Z 1x
=
+ u
(25.137)
+ u?
=
u
,
*1 =
*1
:
+ U
'
Z1
1,
(25. 138)
(Y l ; X ! )
=
.
Estimate a*1 by ,a.
.a.
1 x*
=
m.
,'<
) (Z:ZI)
f1 J1 zsu.s
.
j
(:.!5 139)
j
zlyl
i''
'
.
X'
''
x,1$, x,'
=
t
!
1
1
x
-
1
..
1
-'
y'
'
x,' ..
1
1 '
3.1
ln the context of the model ( 1/-(2) estimating the parameters of ( 1) by 2SLS amounts to estimating the statistical form in equation (4):
it /21+
+ ?2l, Izsh + /724.::
zzpt +
=
by OLS and substitute the fitted values the structural form of (1) to get: lf
=
+
( 1:
J 21 t
+
(x 3 j
pt +
J41
,
into j21 + jazp; + qzyt + fz.bgt
=
),: + ul
.
Applying OLS to this equation yields the ZSLS estimator of aj 1, a21 a31 and ayl. Given that ( 133)preserves the orthogonality between the systematic and non-systematic component by redefining the latter, the 2SLS method uses the equivalence between ( 127) and (130) and via an operational form of the ,
P
Consistency (*1: x1:) stems from the
latter estimates tz:l' consistently. asymptotic orthogonality ;(2'1u/ )
--.+
as T-+
0
-.
(25.14 1)
-..c .
For a given Tl however. f (2,':ul) #0 and thus &1'is a biased estimator, i.e. Ell'l
z,.tz
(25.142)
tz1:.
The ZSLS estimator can be compared ,j.
1
$-1uluu
vv
:= X1A 1 1
,
-
1
directly
p r 1 () 1 v x j ,
,
,k
,
XIXI
- l
the LIML estimator
with
(v j
-
(.T J)y j ),y j ,
Xlyl
,
(2 5 14 3) .
632
The simultaneous
which, in
view of
V'IX j
equations model
the fact that
i''1i' I
Y'jX j
=
Y'1Y 1
=
it differs from the 2SLS estimator in so far as 0/
Ory
=
where 0 T1
1 mx
.-. 1
=
f*is not
,
(25.144)
given the value one but
,
71
(25.145)
'
-0
,v
*
fJ'l 0 1
0 S)
$-711 yj 1 y 1.r1 J* --o 0/ u v 0 f1
-
171
*
Y 01
(yj v j )
=
,
M1
,
I -XI(X')X1) --1X'1
=
(see Schmidt ( 1976:. lf we substitute k for J* in ( 143) where k is a scalar (stochastic or non-stochastic) ( 143) defines the so-called k-class estimator of all which includes both 2SLS and LIML estimators. For the 2SLS estimator, k 1. Looking at ( 140) we can see that the ZSLS estimator exists if the rank of =
Y'1 Y 1
Y'1 X 1
Y1 1
x 1x 1
x
'
'
is ???j + kj - 1. Given that rank(X'jX1) /(1 by assumption (5j, Section 25.3, this rank condition is satisfied if rankti/l'j) nl 1. The latter condition holds when ranktBzc) > r??l 1, i.e.the order condition for the first equation is satisfied. It must be noted that in the case where this equation is exactly identified, i.e. =
=
-
-
ranktml)
ranktmjA*)
=
=
m
-
(25. 146)
1,
the 2SLS estimator is equivalent to solving
AI'+ rt
where
=0,
(25.147) - 1
= (X'X) - IX/Y,
r?
=
(25.148)
zl 0
The solution of (147)for a/ is the indirect MLE. ln order to discuss the properties of the ZSLS estimator we need to derive its distribution. From (140)we can deduce that
fzst.s=(Y'1(Px ZSLS
=
Px.)Y1) --1Y'1(Px
-
(X'1X1)- 1X'1 (yj -Y,
-
Px, )y1
fcsus),
,
(25.149) (25. 150)
where Px X(X'X) - 1X' P Xj(X/1X1)- 1X'1. These results show that the distribution oftssus can be derived directly from that of fzsus. Concentrating =
!'
.Yj
=
25.6
we can express it in the
on the latter estimator
form
1
7zsus g.?2- z 17 2 1 yr::z
where
estimation
Least-squares
B''
W01 v
(25 15 1) .
-
/
W'12
11
(),
=Y1 (Px - p y,)
w 21 w 22
vo vo j
,
1
=
'
3l yr 1
(25. 152)
(see Mariano ( 1982)). ln view of the fact that (y0jt 'Xr
x)
=
N(B0j 'xf
x,
,
Doj)
(25.153)
a
YO/QYO 1
1 where Q where B01 EEr (#'j B',c)', we can see that the distribution of Px - Pxj is an idempotent matrix must be a matrix extension to the noncentral chi-square (see Appendix 24.1). That is. .
wo1
x
,
11J
??1j
(n01 w.M0)1 5
>
=
5
where I4Jmjt ) stands for the non-central Wishart distribution with F degrees of freedom, scale matrix f1ol and non-centrality (or means-sigma) matrix '
M0
1
=
no -
1
1
)Y(')'Q)Y0). i
1
The 2SLS estimator of J1 as given in ( 150) can be viewed as a regressor functionof Wf. The l'-class and LIML estimators of yj can be expressed
similarlyas
f(k)=(W22
+V22)- '(1V21 -i-'S21).
(25.156)
-l*Sc1).
(25.157)
fuIs1u=(5V22 -f*S22)-'(M'21
where Sy
=
1 (Y -
T
?-
XBO:
)'(Y0j xB:)) -
SE
S1 1 S21
S1 2 S2 1
(25, 158)
(see Mariano ( 1982)). These results suggest that the same methods for the derivation of the distribution of fcsps can be used to derive those of ftkjand fujul-. ln Chapter 6. several ways to derive the distribution of Borel functions of random vectors were discussed and the general impression was that such deri: ations can be very difcult exercises in multivariate calculus. In the case of fl abo: k?. the derivations are further complicated by the fact that the distribtltion ftlnction of the non-central Wishart is a highly complicated function based on infinite series (see Muirhead (1982)). The complexity of the distribution increases rapidly with the ranktMoj). This is basically the reason why the econometric literature on the distribution of these estimators concentrated mainly on the case where 1Al1 1 (see Basmann (1974) for an excellent survey). ln the general case of n1 > 1 (see Phillips ( 1980)) the distribution of # is so complicated so as to be almost =
The simultaneous
model
equations
non-operational. This encouraged the development of asymptotic results and related expansions such as the Edgeworth expansions (seeChapter 10). Before we consider these results it is important to summarise some of tht! most interesting aspects of the finite sample research related to the above estimators; in particular to the existence of moments of the various estimators discussed above. (i) fzsl.shas moments up to order (/2 ?,:11 + 1), i.e. the 2SLS estimator has moments up to the degree of overidentification. Thus, for a just-identifiedequation where lz ?'?I - 1 no moments exist; see Kinal ( 1980). f(k)has moments up to order 7--4/1 + r?1 1) if 0 k < 1 and k is non-stochastic; see Kinal ( 1980). (iii) fl-luuhas no moments (see Sargan ( 1970)). See Mariano ( 1982) for an excellent survey of these results. It must be emphasised that the non-existence of moments should not be interpreted as making the relevant estimators inferior to ones whose moments exist. lt implies, however, that comparisons of estimators based on criteria such as mean square error cannot be made since they presuppose the existence of some moments. Moreover, in Monte Carlo studies (seeHendry (1984))the non-existence of moments provides useful information. The asymptotic distribution of the ZSLS estimator takes the form -
=
-
/
F( :*1 x *1)zs I-s
X(0
x.
-
Y
1
*1
?-? 1 D
,
(25.159)
)
,
where D 11
B/ 2 Qx B .
=
B' 2 Q1
2
v
The asymptotic
,
T
x'x 1
lim -,
Q1 j
--
x
=
Q11
X'X
=
,
.X
.
X'1 X 1 --.T
lim w..- y
,
E(ss1 f 1;'y1 t ).
of *1 can be estimated
covariance
* * Cov(41)= f11
=
I7* 11 a
,
T
r
(25. 160)
.
,
1im Qx F-+
Q1
.
Q, B.c
ir'1 i' 1
i''1 X 1
X1Y1
X1X:
,
-
by
1
(25. 161)
,
where ,1
=
,
1
-
v 1f - x 1 1
1
(see Schmidt ( 1976)). lt is important to note that D1 1 is ranktBac) n'?1 1, l.e. the first equation ls. identi.fled. =
-
non-singular
if
25.6
Least-squares
Under the condition
estimation
that
P
v'' F(k - 1)
--+
0.
(25.162)
the k-class estimators are asymptotically equiv alent to the ZSLS. ln particular the LIML and ZSLS are asymptotically eqtli: alent. A brief ntroduction in to Edgeworth expansions was considered Chapter 10. Sargan ( 1976) considered the question of applicability of such expansions to econometric estimators. He proposed an Edgeworth expansion for the difference (
#
-
=
t?/P,
W),
(25.163)
where p is a normally distributed random vector. and w a random vector, independent of p. The conditions placed on Lawlp. B ) are general enough to be applicable to many of the conventional estimators mentioned above; see also Phillips ( 1977). These conditions are related to the smoothness and invertibility of trwt ) and the boundedness of the moments of x F w of all orders. ln the case of the ZSLS it can be shown that .
fl
=
lllp, /.1, B.2).
(25.164)
where P 55E
X'u 1 X'U 1 T
,
F
and the Sargan conditionj expansion of order Op(T-I) (f l -T1)= F () + F -
,
W
0
=
apply
(25.165)
(seePhillips ( 1980a)).
The asymptotic
takes the general form + F-
1
+ Op(F-),
where the components of Fo, F-.L, F O P (T-) O P (T - 1) respectively.
j
(25.166)
include the terms of order Op(1),
5
(2)
Three-stage least squares
(3S+5)
The 3SLS estimator is a system GLS estimator based on the SURE formulation considered in Chapter 24. As argued there, this formulation exclusion restrictions directly into the enables us to accommodate formulation itself. The results related to the estimation of the MLR model when a priori information is available (see Section 24.4) suggest that in estimating (110) separated from the system there must be some loss of efciency. ln the case of exclusion restrictions the reformulation of the MLR model into the SURE form enabled us to incorporate such a priori information. lntuition
636
The simultaneous equations model
suggests that the same formulation should lead us to a more efficient system estimator than the ZSLS estimator. Expressing all m equations of (99) in the same form as equation one n ( 110) the SURE formulation in the present context gives rise to the systenl y1
Z1
ya
()
0
0
za
=
x1
c*1
12
P ':2
.
'
+
0
ym
where
0
Zm
7f
('' f : xf)
Ai
f
,
Jf
.
12
.
l
,
x FX
,
-
.
.
.
s
(25.167)
5
'
.
.
:*
.P1
'
n''l ,
or, more compactly, +):
'+ Z++ =
is an obvious the system
notation.
y+ =
+
(25.168) The ZSLS estimator amounts
a,y + c:.
(25.169)
where 2+ is Zv with if instead of Yj, i.e. estimator for the system as a whole is ti* zsus
to applying OLS to
i
(i'f'.X). That is, the ZSLS
EEE
(2'+2+)- 1 k'* y,y
=
(25.170)
.
In view of the fact that (c:,/X)
'v
N(0, V () 1w),
intuition suggests that the generalised-least-squares (GLS) estimator used in the context of the MLR model should be more appropriate. This estimator takes the form (see Section 24,4): ti* ysus
=
(2.'+($''- l () Iw)2'v) - 12' * ($' '' 1 (>()Iw)y+,
(25.172)
where Vis estimated from the ZSLS residuals applied to all the equations of the system via l''t'ij
1
''?A 2) I --' 2'-'A
T
?
,
'
l,
' ..1
12 ,
,
,
.
.
,
n
(25. 173)
.
For obvious
reasons this estimator of a,j is called the three-stage leastestimator, first suygested by Zellner and Theil ( 1962). (3SLS) squares l It is important to note that for(Z'+(V - (@)Iw)2 * ) in ( 189) to be invertible k,v must be of full column rank which requires that each 2f (ij. Xf), i 1, =
=
Instrumental
2,
.
.
.
,
variables
comprising rn, i.e. a11 equations when the system is identified
the system must be identihed.
Moreover.
1 (;' (# l -T * -
N
y
(
(g)
Iy );
-.a
v3st.s
+
P *
)
....+
(;j.
D
1 ) N ((j D - ) x
j.74) ,
(25.175)
,
a
(see Schmidt ( 1976)). Sargan (1964/) showed that when only exclusion restrictions are present and the system is identified, 3SLS and FIML are asymptotically equivalent, i.e. D 1.((),. see Section 25.3. the result related to the ln relation to the finite sample distribution of Vysus moments of the ZSLS applies to this ease as well. That is, moments exist up 1he nonto the order of overidentification (see Sargan ( 1978)). Moreover. the L1M L estimator extends to the FIML existence of moments estima-tor (see Sargan ( 1970)). As argued above. howeNrer. existence or nonexistence of moments should not be considered as a criterion for choosing between estimators in general. =
.for
25.7
Instrumental
variables
variables (lV) was initially proposed by The method of instrumental Reiersol ( 1941) in the context of the errors-in-variables model (seeJudge et the aI. ( 1982)) as an alternative estimation method which inconsistency problem associated with the OLS estimator. This was then systematised and extended by Durbin ( 1954) to a more general linear statistical model where the explanatory variables are correlated with the error term. He also conjectured the direct relationship between the IV estimator and the LIML estimator in the context of the simultaneous equations model. The current approach to the IV method was formalised by Sargan ( 19583. who considered the asymptotic efficiency as well as the consistency of IV estimators. Let us summarise the current approach in order to motivate the formalisation proposed in the sequel. Consider the statistical formulation Ksolved'
Ktextbook'
pf I'X =
where X,: p that
(25.176)
+ Tr.
x 1 vector
'(X;c,) # 0.
of
fpossibly stochastic) explanatory
variables
such
(25.177)
The simultaneous
equations model
If we estimate x using the usual orthogonal
projection
(OLS) estimator
:=(X'X)- 1X'y we can see that this estimator is in general biased and inconsistent since i x + (X'X) - 1X/c and S)(X'X)- 1X'E) # 0 =
and the bias does The
IV method
introducing
not go
a new vector
'(Z; G)
=
zero as T'-+ :f)
to
'solves'
.
the bias (and inconsistency) Z : m x 1, t 1, 2,
of instruments
0
((Z/ c,/F)
Eaa
((Z'Z,'F)
=
,$
.
.
.
,
problem b) F such that:
0)
-->
P
Cov(Z,)
=
Eaa)
-+
P
(iii)
Covtzf ,Xf) =X32
((Z'X,/F)
Eaa)
-+
where Eaa > 0. ln the case where the number number of unknown parameters in a, i.e. m form: (9IV =(Z'X)- 1Z'y. =p,
of instruments equals the the IV estimator takes the
Assuming that Ea2 is bounded and non-singular
we can deduce that
P '(Iv)
and
x
=
91:. -+
if in addition to
Moreover,
y'
(iv)
(Z v#x/ T)
X(0,
'w
'
x
(i)-(iii)we
also assume that
azXaa)
then /'F(4Iv N/
a)
-
'.w
J
- 1 E3aEaa - 1 N(0, c 2 N.zs )
ln the case where m wp (we have more instruments (1958) proposed the so-called generalised instrumental
than variable
fs)
Sargan
estimator
(GIVE)
&)j=(X'PaX) - IX'P y
(25.179)
where Pz Z(Z'Z) - 1Z'. The GIVE estimator was derived by Sa-gan as an instruments' are chosen as linear extension of mv where the functions of Z, i.e. Z* ZH, H being a m x J?matrix and H is chosen so as to minimise the asymptotic covariance matrix =
'optimal
=
Covt:lv) 2
=
c2(HYca) - 1(H'ZaaH)(N'aaH')- 1
Instrumental
25.7
variables
639
1
(Z*'X) - Z*'y 1 1 Given that for /(H) (HEaa) - (H'EasH)(E'asH') ' /(H) /(AH) for any matrix minimising we can choose H by m x m non-singular Of
:lt
=
,
=
(H'Ea aH)
=
subject to HE2a
1.
=
The optimal H takes the form: H
E 5-1 E a a(N aa E -l E aa )- l
=
Using its sample Z*
=
the
equivalent 1
P :: XIX'P X):5
.
toptimal'
set of instruments
is
and thus
,'
jjrgjjf jjj.f ....'
. ..
lj,
''
(# x. N(0, c (EazEa........a Ea a) ... ) The GIVE estimator as derived above is not just consistent but also asymptotically efficient. Because of this it should come as no surprise to learn that a number of well-known estimators in econometrics including OLS, GLS, ZSLS, 3SLS. LIML and FIML can be viewed as GIVE estimators (see Bowden and Turkington ( 1984), furt/l' alia). Norp that in the case where m /?, g)i). I& as defined in (178). The main problem in implementing the IV method in practice is the choice of the instruments Z,. The above argument begins with the presupposition that such a set of instruments exists. As for the conditions (i)-(iv), they are of rather limited value in practice because the basic ln order to orthogonality condition (i), as it stands, is non-operational. make it operational we need to specify explicitly the distribution in terms of which E ) in ( 176) and (i)-(iv) are defined. lf we look at the argument leading to the GIVE estimator ( 179) we can see that the whole argument around an (implicit)distribution which esomehow' involves y't, Xf and Z,. ln order to see this let us assume that the underlying distribution is Dt-pr,Xp; #) (assumed to be normal for simplicity). lf we interpret the systematic and non-systematic components of ( 176) as in Chapters 17 and 20, i.e. .
Y
F()
.
=
=
'
Krevolves'
and
c-al x E o'c : =
'(Xf';r) 0. =
(E a c
by
=
t?t =
Covtxf )
,
yf - E ()'r,6'Xf))
o'c 1
=
Co v(X,,
(25.180)
-pf)),
Construction
and 9,:u:(X'X) ' 'X' fully efficient.
.
is th0 MLE of x and it is unbiased, consistent
and
640
The simultaneous
equations model
This suggests immediately that D ).,, X?; ) is not the distribution underlying the above IV argument and thus ( 180) is not the (implicit) interpretation of a'X, and st. Another question which raises the issue of the underlying distribution is related to the possible conflict between condition (i) and
(v)
Cov(Z,,.pr)
=
/3 1
# 0.
A glance at the IV estimators ( 178) and ( 179) confirm that the latter condition is needed in order to ensure the existence of these estimators, in addition to (i)-(iii).This, however, requires Zf to be correlated with pt but do we uncorrelated with a random variable cf. directly related to yr. resolve this apparent paradox?' ln addition to the above raised questions the discerning reader might be estimator of x in wondering how the GIVE estimator ( 179) can be a 176) when apparently the has latter ( parameter to do' with Zf. As emphasised throughout Chapters 19-24 there is nothing coincidental estimator. For about the form of the parameter to be estimated and its 1X'y (X'X)estimator of is a example the fact that & x Ea-alokj is no that the natural given sample analogues of and J21 are X'X E22 accident, respectively. analogy in of and X'y Using the same i),. we can see the case estimator of must be: that the parameter this is a Sl-low
'good'
'nothing
'best'
'good'
=
=
'good'
* (E2 aE 5-1 Ea c )- 1 Ea aEa'-a1 as j =
(25. 18 1)
.
How is this parameter implicitly assumed to be the parameter of interest in ( 176)? The above raised questions indicate that the issues of the (implicit) condition ( 177) and distribution underlying ( 176), the non-orthogonality the (implicit)parametrisation are inextricably bound up. The fact that (181) involves the moments of a11random variables (.k,f, that the Xf, Zf) Suggests (implicit) distribution is related to their joint distribution. Moreover, the apparent conllict between conditions (i) and (v) can only be resolved if Z, together they suggest refer to conditioning variables. Putting these that the underlying distribution should be: 'clues'
D( J..,, Xr/'Zr;0j.
(25. 182)
Let us consider how this choice of the implicit distribution could help us resolve the above raised issues. Firstly, assuming that the parametrisation of interest in ( 176) is I*, as defined in ( 18l), we can see how the non-orthogonality (177) arises. lf we ignore ( 182) and instead treat f'( ) in ( 177) as defined in terms of D()'t, Xt; #) .
then
'(X;k)
=
'(X;(yf- x'Xt))
=
(c21-
E22t:* )# 0
25.7
Ilstrumental
variables
unless * Ea'-21o'c1 This leaves us with having to explain how the parametrisation of x* in ( 181) arises. Using ( 182) as the underlying distribution we can deduce that =
.
E(X;y,./c(Z,))
=
E( f'gX,'cl c(Zt)1 c(X,))
=
E'IXCfR,c(X,)1/'c(Z,))
*'Xf)/'c(Zf)1/'c(X/)) = F)X;JJg(yf Z/Ya-.tEa aa*l c(Xt) ) f Ygi as 1 t f = E$X'LZ -
-
l
= E 23 E-3 3
f(Xf'G 'c(Zl))
=
65 1
-E
2.3
t- 3 3IE 32 x*
0
in ( 177) for as defined in (181). ln other words, the non-orthogonality expectation of the x* in 1 82) but in defined terms terms (. was arose because Xl., D( p,, of #). The above discussion suggests that the instruments Zf are random variables jointly distributed with the pt and X, but treated (implicitly)as specification ( 176), ln defining the statistical conditioning variables. explicitly. include That natural statistical is, the Z, however, we did not *
GM
:
yl a'()Xt+ SZ,
(25.184)
+ ut
=
Ec- l(Jl 2 J1 3ELI Ea2)' and 7c, EL1o.5 1 - Egs-alEtpza()where E2 :! (Eac - E2aE:!:5Esa) is not the parametrisation f?/' interest. lnstead Zf is marginalised out from the systematic component F(y,/c(Xf, Zr)). In the present case this amounts to imposing the restriction yo 0 or equivalently
where
atl
=
=
-
=
=
(25.185)
X2370 0 =
since m y p. Given the relationship (185) implies that ' aEa-sI al atl (Ea aE s- Es a )- Ec =
between ya and atl in ( 184), however, o's I
=
x*
(25.186)
and ( 184) reduces to ( l 76# with x as defined in ( 18 1). This relationship between ( 184) and ( 176) enhances our understanding of the IV argument conditioning variables considerably. lt shows that IV are indeed whose separate effect is of no interest in a particular formulation. lf we were would be the usual orthogonal to estimate alg and y() in ( 184) the MLE'S projection estimator: tomitted'
=
ll
1
(X'M,X) - X'5'1Z y
(25.187)
642
The simultaneous
equations model
and
ftl (Z'MxZ) - IX'M y =
X
=
(Z'Z) - 1Z'y -(Z'Z) .
- 1(Z'X)()
Pz= Z(Z'Z)- lZ', Mz=I - Pz. The estimator ('10 of (), however, is not an estimator of the parameters of interest a*. On the other hand, when Zf in (184) is dropped without taking into account its indirect effect (throughthe parametrisation) the estimator #= (X'X) - 'X'y is inappropriate. This is because (i introduces a non-existent orthogonality between the estimated 'systematic' and components. That is, for #= X:= Pxy and -X4= Mxy sample equivalent to ( 176) induced by that the wecan see =y knon-systematic'
is
y
pxy + Mxy
=
(25.188)
given that PxMx 0, Pv X(X'X) - IX', Mx I Px. ln view of where / . 177), however, ( no such orthogonality between yt and st exists within the framework defined by D(o, Xr; #) and thus x is inappropriate. From ( 183) we can see that in order to achieve an orthogonality between pt and st we need to take account of the conditioning variables Z,, The way this was achieved in ( 184) was inappropriate not because we introduced a nonexistent orthogonality (the orthogonality was valid) but because we orthogonality achieved the at the expense of the parametrisation of interest. The question which naturally arises at this stage is how we can introduce an of orthogonality between p, and ;, without losing the parametrisation interest. =
=
=
Intuition
suggests that when /t, a'X, and ), yf - J'X: are nOt orthogonal it must be the case that t includes information which trightfully' belongs to pt. In the present context what is systematic and non-systematic information is determined by ( 182), the underlying distribution. This being the case the obvious way to proceed is to redefine both components so as to ensure that they are indeed systematic and non-systematic relative to D(y,, X, Zt,' 0j. This can be achieved by the new components : =
ptt .E'(/t,?'c(Zr)) and =
t:1
-
zt
-
=
(p) -p,)
the systematic information y1 -/tf) is Subtracted the new error term c). The redesigned form of ( 176) is Notejow
5't E (Jtf'rqZ,)) =
+
it
-
f(/lf,/fT(Zr))()+ c,
(25.189) from cf to define
(25.190)
0. with Eptttl) (190) is different from ( 184) in so far as the original parametrisation has been retained It is interesting to note that the above can be motivated =
.
're-design'
25.7
Instrumental
variables
643
directly in terms of condition (i). Given that F(Z;;,)
'('gZ'f)f/J(Zf)1)
=
(see Chapter 7)
which in turn is
if
(25.19 1)
0
=
for
valid
Ept c(Zf>
(i)holds
deduce that condition
we can
E (lt J(Zl))
E (ZIEI;, c(Z,)))
=
(25.192)
c(Zr))
'(.)?,
-
Vchoice'
The conditions ( 19 1) and (192) are equivalent to ( 189) and thus the of Z, so as to ensure condition (i) is equivalent to the above re-design' of ( 176). ln order to understand how the IV estimator (179) is indeed the most appropriate estimator of x* in the context of ( 176) 1et us derive the sample equivalent of (190).Using the well-known analogy between conditional expectations and orthogonal projections we can deduce the sample F (T > m + p): equivalents to (189) and (190) for t 1, 2, =
p * P z xa* =
and
y
c*
,
=
PzXa* + Mzxa
=
1) +
.
.
.
,
M z X()
(25.193)
+ c.
Pz and Mz I Pz are the orthogonal projections onto the space spanned by the columns of Z and its orthogonal complement, respectively. ln view of the orthogonality between Pz and Mz (PzMz 0) the usual orthogonal (OLS) estimators of * and tl coincide with (179) and (187),respectively. This derivation of the IV estimator provides us with additional insight as tsolves' the original estimation problem. ln a sense the to how the method method method IV is a two-staqe (seePagan (1986))where the jlrst staqe the original formulation so as to achieve orthogonality between the systematic and non-systematic components, without changing the parametrisation, and the second stage the parameters of interest are estimated using the usual orthogonal projection (OLS) estimator. The re-designed formulation is particularly illuminating in so far as the relationship between ( 176) and (184)is concemed. lf we rearrange (193)in =
-
=
Sre-designs'
the form:
y Xa* =
+ M,X(a() - a*) + c
(a()- x*) the hypothesis against f11 : J#0
we can see that for J J.f():J=0
(25.194)
=
can be easily tested in the context of (194) when T > m + p. Ho is often viewed as referring to the admissibility of the instruments in Zf. From (184)-
644
The simultaneous equatios
model
(186) we can see that Hv is an indirect test of the hypothesis that the coefficient of the conditioning variables Zt is zero, i.e. covt)?t,ZXt)
0
=
This brings out the connection between conditioning variables variables 25.8 and instrumental below). (seeSection In the context of the simultaneous equations model a1l the estimators discussed (ZSLS, LIML, 3SLS and FIML) can be viewed as IV estimators; see the recent monograph by Bowden and Turkington (1984).In order to see this let us consider the case of the ZSLS (see Section 25.6 above). The first restricted structural equation for the sample period tzzu1, 2, T takes the form: 'omitted'
.
yl
=
Zll
+ zl3
.
.
,
(25.195)
where Z1 >(Y1,X1) and aj =(y'1.J'!)'. Given that the underlying distribution is Dlyttixt; 0) where y, HE (yl,, y'1,, y'(j)r)' and Xf EEEE(X?1 t, X'(1),)' we can see that X, is the set of instruments and ( 195) includes some of those instruments as genuine conditioning variables. Expressing (19f) in the form (193) yields: yl since PxX1 i
=
PXYITI
=
X1. From this we can deduce that the IV estilnator
1
J'1
+
XIJI
Y' P x Y 1
Y'1 P x X 1
XIPXY 1
X'IX 1
,1
.
Iv
+ MxY17 +
-
:1'
1
(25.196) Y'1 P xyl
X'1yl
of al is:
'
.,)
(aj. j oy)
which coincides with the 2SLS estimator ( 140).
25.8
Misspecification testing
For theoretical as well as practical reasons misspecification testing in the context of the simultaneous equations model will be considered at three different but interrelated levels:
the statistical system level, in terms of the statistical GM, yf B'xt =
(ii)
M-ut,
(25.198)
the unrestricted structural equation, in terms of the statistical GM,
25.8
Misspecification
testing
the restricted structural parameters of interest, =
(1)
equation,
7'1y1, -F-ti'1X1? +
1'1,
The statistical
system
645
in terms of the theoretical
'l/f
(25.200)
.
level
lf we look at the specification of the simultaneous equations model given in Section 25.4 we can see that a sufficient condition for the theoretical parameters of interest ( to be well defined is that the statistical parameters 0 are well defined. This suggests that the natural way to proceed with misspecification testing is to test assumptions (6(1to (8qin the context of the multivariate linear regression model (seeChapter 24). In the case where the system defined by the structural form with the restrictions imposed is just identified, the mapping H(.) : O-/E, (= H(p), is one-to-one and onto, implying that the inverse image of H( ) is (.) itself. Hence, ( is a simple reparametrisation and as well defined as 0. ln the case where the system is whose inverse image overidentified then the mapping H( ) is many-to-one 0. () is a subset of (4. That is, in the case of overidentification H( ) imposes restrictions on 0. which are, however, testable in the context of (198)(see Section 25.91. The above argument suggests that before any questions related to the theoretical parameters of interest t are considered we need to ensure that the statistical model in terms of the statistical parameters p is well defined. This is achieved by testing for departures from the underlying assumptions using l/?L2 procedures discussed in C'lr/rt/l- J4. lf this misspecification testing reveals that the statistical assumptions underlying ( 198) are valid we can proceed to consider the identification, estimation as well as specification testing in the context of the structural form reparametrisation. Otherwise we need to respecify the underlying statistical model. For example, the assumption of independence is likely to prove inappropriate in econometric modelling with time-series data. A respecification of the underlying statistical model U i1lgive rise to the multivariate dynamic linear regression model whose statistical GM is .
.
.
(25.201) (see Chapter 24). This suggests that if the identification problem was in terms of 0 in f 198) we need to reconsider it in terms of 0* in (199).Given that (198) and ( 199) coincide under isolved'
Hv : Bf
=
0,
-'.f
=
0-
1.=
1, 2,
.
.
.
-
/,
(25.202)
646
The simultaneous equations model
restrictions. In the we need to account for these implicitly imposed restrictions context of (198)the restrictions in (202)are viewed as which fail the rank condition (seeSection 25.3) because a11 equations satisfy these restrictions. When Ho is tested and rejected, however, we need to reconsider our estimable model (seeChapters 1 and 26) in view of the fact that the original model allowed for no lags in the variables involved. When a situation like this arises in practice the modelling is commonly done equation by equation and not in terms of the system as a whole because of the relative uncertainty about which variables enter which equation. For this reason it is important to consider misspecification testing in terms of individual equations as well. itestable'
tphoney'
(2)
The unrestricted
single structural
equation level
(194)is of no interest because no structural identified in is its context. From the statistical viewpoint, parameter however, (194)is of some interest because it can be viewed as a bridge' testing between (193)and (195)which can be used for misspecification stochastic and of Treating the it linear linear hybrid as a purposes. regression models (see Chapters 19-20) (194)presents us with no new testing is concerned. The testing problems as far as misspecification procedures proposed in Chapters 2 1 and 22 can be easily adapted for the present context. From the theoretical viewpoint
The restricted
single structural
Ievel
equation
Having tested for misspecification in the context of (198)and accepted the underlying assumptions as valid we have a well-defined estimated statistical GM. From the theoretical viewpoint, however, (198)makes very little sense (if any). For that we need to reparametrise it (by imposing a priori restrictions) in terms of the theoretical parameters of interest. The question will be discussed in Section of how we achieve such a reparametrisation 25.9. At this stage we need to assume that this has been achieved by J(1) 0 (enoughin number to imposing the exclusion restrictions >1) identify the first equation) in order to get (200). expressed in From the misspecification viewpoint the reparametrisation (200)is of interest in so far as this has not been achieved at the expense of the statistical properties of the estimated GM (198) we started with. That is, we need to check that (200)is not just a theoretically meaningful but also statistically well-detined relationship. The discerning reader would have noticed that the same problem arose in the context of the linear and dynamic linear regression models (see Chapter 23). =0,
=
25.8
Misspecilkation
647
testing
ln Sections 25.6 and 25.7 it was argued that (200)can be viewed in various alternative ways which are useful for different purposes. The two intemretations we are interested in here are: (25.203) (25.204) Given that E (. . k'1 f
'o-l
y1
.'
X1t
:),
)?'jy j t + J'1x 1 r
x j t)
aE-tylf,''X, x!) -
c
,
Bz1cxl, + B'lcxtl
=
*j
t
),'ju j t +
?.fj t
)?,
we can see that we can go from (204)to (203)by subtracting y'jyjt from both sides of (204).For estimation ptlrposes (204)is preferable because of the orthogonality between the systematic and non-systematic components (see Section 25.6). For misspecification testing- howey'er,it is more convenient to use (203). Tbe
oI' D(
??t?,-n'la/!'r.
Xl,,' (1) can be tested using a direct extension test discussed in Chapter 2 1 based on the residuals
'1,..y1,,
of the skewness-kurtosis J: t
=
.yl
, -
f;
j..y
:
, -
i
k
x
t
t,
=
l 2 ,
,
.
.
.
,
T
(25.205)
whereflv,tsvrefer to
some IV estimator of ),j and J! respectively such as the estimators. LIML ZSLS or The Iinearitl' of the conditional expectation c(.ylf), X1t
f( 1':,
=
xl!)
=
?'Z1,,
(25.206)
x'Ifl' and 1'H (/1,J'1) can be tested using the F-type test where Z1t EEE(y'1p, discussed in the context of the linear regression model (see Chapter 2 1). That is, generate the higher-order tenns #r,using Z1f or powers of ft),and test the null hypothesis
He: d
regression
in the auxiliary )',1 /
=
=
z
H 1: d# 0
0
=
1
a01 +
Z1(at'
-
tlzd
+
:0 1
$) ) + Td
(25.207) +
&'
(25.208)
(see Chapter 2 1). The augmented equation (207) or (208) should be estimated using an IV estimator, say ZSLS. Using the asymptotic normality of &jj.the F-type test procedure discussed in Chapter 21 can be justified asymptotically in the present context.
The simultaneous
The homoskedasticitv Var()'1 t/o-ly
j
equations model
of the conditional
), X j r
=
xl
r)
variance
(25.209)
r: 1
=
can be tested by extending the White and related tests discussed in Chapter 2 1 to the present context. If we assume that no heteroskedasticity is present in the simultaneous equation model as a whole we can test (209)against t/c(y1,),
Vartyl
using Hv cl
=
X 1,
-
xlr)
/'l(z 1 ,),
-
(25.210)
0 against H3 : cl # 0 in the auxiliary
regression
Lqtl cll + c'1 kt + q
(25.211)
=
(see Kelejian ( 1982)). This can be tested using the F-type test or its asymptotic variants (e.g. T'R2) discussed in Chapter 21. ln the case where the assumption related to the presence of heteroskedasticiLy in the rest of the appropriate, need modify F-type is because the the tests not system to we asymptotic distribution of > ((),1) is not Ft
x
-c)
N(0, c
x J
Q/ )
(25.212)
but j.'
N
,. T( -c)
x. J
N(0, 2c #.(Q-
j
j
,
Q/ and Qz: refer to the probability 1#) ( #,) ' and (), r 1 k lf k'lt ) k'1, (''1r, x'lr) of matrix coefficients of the auxiliary regression: where
'
=
f(Z1?l1',)
,
-..
-
=
L'#) +
%1
(25.213)
L )), 2LQcj
+
=
limits of respectively
(,Fj
#)#)'q,
and L is the
(25.214)
and Hall ( 1983), White ( 1982a)). An important assumption which should be tested in the present context is whether by imposing the exclusive restrictions on ( 198) the parameter time invariance no longer holds. The diagnostic graphs related to the recursive windows discussed estimates of the coefficients as well as the z-observation in Chapter 2 1 apply to the present case in terms of some IV estimator of a1. Moreover, the F-type tests related to time-dependence and structural change can be applied to equation (203)using some IV estimator for 1'. Time-dependence of the parameters is particularly important in the present context because this formulation is commonly used for prediction and policy evaluation purposes. ln cases where no misspecification tests have been applied to the statistical GM ( 198), testing for parameter time dependence is of paramount importance because heteroskedasticity in its context might appear as time dependence in the context of (204).This is because the parameters in (204)are defined in terms of B and f2 and if
(seePagan
25.9 Specification testing Covtyf Xf
=
xr)
=
649 (25.215)
ftxt),
will be a function of xt via flxfl. The sampling model assumption of independence can be tested by applying the F-type test to test the significance of the coefficients of the lagged Zj, in
t
the vector of coefficients
(25.2 16) 0r l
*1 t
(01 4*1 r )'Z 1 t + )-' a'jzjf
=
-
-
,
i
=
1
i
(25.217)
+ vl,
The only modilication of the F-type test discussed in Chapter 22 needed is to estimate the parameters of (216)or (217) by some IV estimator because of the presence of simultaneity. The F-type autocorrelation test is based on the auxiliary equation l
*1t (01
-&*
=
I l''
)'z 1 t + Y c'0 f
=
*
1
1t - i
+v
(25.218)
t5
(see Godfrey (1976),Breusch and Godfrey (198 1)). 25.9
Spaification testing
ln practice, specification testing in the simultaneous equations model is almost exclusively considered in the context of the single identilied structural equation .J.'lf T'1ylt =
whereylt: (n11(1)
+J'I
=
X1t
where .J,'1, ylf of both xIf and =
-
+
=
against Sl
f
:
yl #
j
( 1949:. Under Hv the GM (219) takes the form
(25.220)
lj f,
yj Under i,'1 ,.
x(1)r given
y1f B'12X 1 t + =
> n'lj - 1 (foridentification).
1) x 1, x1f : kl x 1, (k
(Anderson and Rubin 'lf
(25.219) -kj)
Testing Ho : l'j
'l
l11,
+
X1,
H3 however, y1, ,
that
B'22X(1)t
+ ulf
.
=
ylr - y'ylf and a function
(25.221)
Hence, Hv can be tested indirectly by testing that the coefficient of X(:) is zero in the regression equation
yf
=
X1l1' + X(l)/(1)
+V1
.
(25.222)
650
The simultaneous
equations model
This can be tested using the F-type test statistic .)'?'(Mx, Mxly? Mxyl
T-k
FT(y1)=
-
'f
kz
-
,
where Mx l -X(X'X) - 1X' and M.y region being =
=
1
H '--
F- k),
F(2,
(25.223)
I -X1 (X)X1) - 1Xk the rejection ,
Lk
t)-l
=
.ftyl:
FT(yt)
>
ca),
x
df-lkc, F
=
-k).
(25.224)
t7t
A special case of Ho above of particular interest is yl i.e.the endogenous variables are insignificant in (219). This test can be readily extended to include some of the coefficients in Jj as long as al1 the coefficients in yj are specified under Ho. The difficulty with including only a subset of yl in Hv is that we cannot apply the usual linear regression estimator under Ho, we need to apply an IV estimator such as ZSLS or LIML. The most convenient way to extend Hv to a more general set of linear restrictions on a/ EEE(y'j,J?1)' is to use the asymptotic distribution of a consistent and asymptotically normal estimator. =0,
(2)
Testing S(): R1'
r against S1 : Ra/ # r
=
where R and r are p x (rn1+ kj - 1) and p x 1 known matrices ranktR) p. Using the asymptotic normality of S*sss (0r &ljuu),
with
=
x'/T*lsus
*)
-
.N'(0,t?1 1 D 1-11)
-
t'
(25.225)
(see Section 25.6), we can use the same intuitive argument as in the case of the F-test (seeChapter 20) to suggest that a natural choice for a test statistic in the present context is (R&lst.s r)'gR(2'121) - 1 R'(J- ltR&lsus
-r)
1
-
FT'(y1)
where
=
f1:1
21 B (#1: ij). Asymptotically pFrtyll
In practice
,
p
-
(25.226)
we can argue that
&o
z2(.
it is preferable
approximation
-
(25.227) to use the F-type form
(226)based
on the
(25.228)
25.9 Specification
651
testing
The above results suggest that asymptotically the specification testing results discussed in Chapter 19 in the context of the linear regression model individual (r-tests)and can be extended to the present case. ln particular, Two special cases of asymptotically significance of justified. joint tests are which using be restrictions tested the above F-type can linear homogeneous
test are: (i) (ii) which (3)
the overidentifying restrictions; the exogeneity restrictions; we consider next. Testing the overidentifying
and
restrlctions
As argued in Section 25.3 the identification restrictions in the case of exclusion restrictions come in the form of the system (51)-(52). For identification we need (k kj) > (?'n1 1). ln the case of overidentification the overidentifying restrictions q= k -zn1 + 1 can be tested via (51(52). Given that (51) is not directly involved in the overidentification the latter conditions can be tested using (52),i.e. -
-
-k:
H3 : B22r1 #21#0. 1.o: 82271 #21 enables us to test Ho is the statistical which A convenient formulation sub-model: =0,
-
-
A'lt
/'11 xl t + b',j.x(1), +
=
1./1t
+ Br2cxll), + u1,
y1, B'1ax,, -
Multiplying the second equation by y'2and subtracting from the first we can define the unrestricted structural formof y1, as :
(#11 -B12T1)'x1f
7'1y1l+
-P1, =
+
(/21 -
B2271)/x(jjt +
:1,.
(25.229)
This form presents itself as the natural formulation in the context of which Hv can be tested. If we compare (229)with restricted structural form(219) we can derive the auxiliary regression:
tt or its
=
t#11
operational
Jt,
=
J?'x1
The obvious test statistic
B12r1 -J1)'x1, +
-
form: :
J!l'lxllj, +
+
way to test Hv II
LM
=
rR2
(#c1-B2c),1)'x(l), +c1,
x 7
f
h
g2((y).
!?f.
is to use the R2 from
(25.230) (230)to define the LM (25.231)
652
The simultaneous
An asymptotically for small F is:
equivalent form suggested by Bassman
RRSS
FF
model
equatio>
T-k
URSS
-
fo 'v
=
URSS
q
where RRSS=(y1 URSS
ylflv
-
appx
-
FM1,T-k)
-Y1fIv
-X1JIv)'(y1
-Y1fIv)'(Mx
=(y1
(1960)as better (25.232)
Xl#lv)
-
Mx,)(yl -Y1iIv).
The main problems with (231)and (232)are: when Hv is rejected the tests provide no information as to the likely source of the problem, and (ii) for a large k their power is likely to be 1ow (seeSection 19.5). In an attempt to alleviate the problem of low power and get some idea about the likely source of any rejection, the modeller could choose q instruments at a time. That is, for X(1) BE (X2; X3) where X2: F x q we could ,T: consider the following augmented form of (219) for 1,,=,1,2,
(i)
.
=YI)4
XIJI+X2J1
+
r1
+
:1
J1
against
=0
H3 : J1
.
(25.233)
,
where X1 : T x kj and X2: T x q. Using exclusion restrictions takes the form HvL
.
(233)the test for the overidentifying
'0.
The F-type test statistic for this hypothesis is RRSSV
FXp'(r1)
-
URSS;V
T- k
=
q
where and URSS;..
(y1
=
-
(25.234)
,
URSS;V
Y1f hp,
-
X1J*1,pXc#*c,p.)' -
*1 p ) (y 1 - Y 1 f , p X 1IL'p. - X 2J*w
X
-
,
and IV refers to some instrumental variables estimator such as 2SLS or LIML of the parameter in question. Asymptotically, we can argue by analogy to the linear regression case that Ho
FT',wlyl)
-
J
z2(>.
ln practice, however, it might approximation:
be preferable
use the F-type
H
FT;Jy1)
-
aPPX
Flq, T-k),
(25.235)
653
25.9 Specifkation testing
with the usual rejection region. For a readable survey of the finite sample results related to the test statistics (226)and (234)see Phillips (1983). The above tests can be generalised directly to the system as a whole using either the 3SLS formulation in conjunction with the F-type tests or the FIML formulation using the likelihood ratio test procedure.
for exogeneity specification of (219) above, Testing
(4)
yjt was treated as a random vector of by defining the systematic component as the endogenous and the expectation of conditional yjt on the c-field generated by y1f (c(y1f)) XIf yI, and of asymmetric xlt). This treatment observed value of X1, (X1; distinction. was used as the main implication of the endogenous-exogenous of refers possibility the Testing for exogeneity in present context to the This amounts to treating yjt or some subset of it in the same way as Xjy. testing whether the stochastic component of yj, contributes significantly to the equation (219). of this A convenient formulation for testing the appropriateness of suggested equation context the in asymmetric treatment of yjf is the (194) form: the 194) context takes IV estimation. In the present ( In the
variables
=
y1 =Y171
+
XIJI
+ MxY1(y1
(25.236)
r1)+ :1.
-
ln this formulation we can see that y1t and x1, in (219)are treated similarly when ),t yl 0. Hence, an obvious way to parametrise the endogeneity' of yj, is in the form of: 0, Sl : (7? 71) #0. Hl :(T1-T1) (25.237) =
-
=
-
This can be tested using the F-type test statistic: FT=
RRss -
usss
'Rss
(
T-2(rn1 ,,n1
-
1)-I
1 Hu -
-
1
Ft'''l
-
1, T'-2@1
-
1'+k1)
(25.238)
where RRSS and URSS refer to the RSS from (219) and (236)respectively, both estimated by OLS (seeWu (1973),Hausman (1978), inter ali. The above test can be easily extended to the case where only the exogeneity of a subset of the variables in y1f is tested by re-arranging (236)
yl
=
(see Hausman
PxYt7? +Y171
''FXIJI
+
MxY1(y -y!)
+ MxYtyf
+Et
(25.239) and Taylor
(1981) for a similar test).
The simultaneous
equations model
As a conclusion to this section it is important to emphasise that a1l the above specilication tests will be very sensitive to any departures from assumptions (1j-g8q(see Section 25.4). If any of these assumptions are invalid the conclusions of the above tests will be at best misleading. ln particular the above test is likely to be inappropriate in cases where the independent sample assumption is invalid', see Section 25.7. texogeneity'
25.10
Prediction
lf we assume that the observed value xw+, of the exogenous random vector Xw./ is available, the obvious way to predict yw../ is to use the statistical GM, yt B'xf + u,,
t G T,
=
(25.240)
estimator of the unknown in conjunction with a parameters 0(B, f1). The problem, however, is that (240) can also be expressed (parametrised) in terms of the theoretical parameters of interest (, i.e. tgood'
yt B(()'xt =
(25.241)
+ u?,
and the MLE of 0 and ( coincide only in the case where the system of structural equations (withthe identifying restrictions imposed), F*'yt + A*'xr + st*
0,
=
(25.242)
identsed. Otherwise, if the system is overidentified the FIML (01' is 3SLS) estimator of ( has a smaller variance than This discussion suggests that in the case where the system isjust identified we apply the prediction methods discussed in the context of the multivariate linear regression model (see Chapter 24). ln the case of overidentification, however, we derive the restricted estimator of the statistical parameters B via .jlsr
(25.243) . - trt(-)where A(j) and F(j) refer to some asymptotically efficient estimator f of the structural parameters (, such as FIML or 3SLS. lt can be shown that in the '
B(-
7
overidentified case:
x,7''z-(B(j)-B(t))
x//'rt with (Fo
-
-B)
,v
-
N(e, Fa),
x(e,F0),
Fa) being positive semi-definite.
(25.244) (25.245)
25.10 Prediction
655
In the case where ( is estimated by a single equation estimation such as LIML or ZSLS then '''r(B(j
.v'
/
B(())
r )-
N(0, F2),
-
method
(25.246)
1
and both (F() - Fz) and (F2 F:5) are positive semi-definite (see Dhrymes ( 1978)). lt is not very difficult to see that this asymptotic efficiency in estimation carries over to prediction as well given that the prediction error defined by -
(w+/ - yr+/)
(2
=
-
B)'Xw+!
is a function of the difference lmportant
(
(25.247)
+uw+/ -
B) and
n.
concepts
of interest, Reparametrisation, theoretical (structural) parameters equations, overparametrisation, recursive of simultaneity, system restrictions, restrictions, exclusion identification, linear homogeneous order and rank conditions for identification. underidentification, just and overidentification, indirect MLE, estimator generating equation, full informatio' n maximum likelihood (FIML), limited information maximum likelihood (LIML), two-stage least squares (2SLS), k-class estimator, noncentral Wishart distribution, instrumental variables, three-stage least-
squares (3SLS).
Questions
3.
Compare the multivariate linear regression and the simultaneous equations models. Discuss the relationship between statistical and structural parameters and well structural theoretical parameters of interest. as as Show how the identification problem arises because of the overparametrisation induced by the structural formulation. 'In the context of the statistical GM T. 1 ,
=
0
r:
'
1
t' ) yt + Ajx, + /
ij,
of Section 25.2 no endogeneity problem arises and the usual orthogonal estimator (OLS) is the appropriate estimator to use.' 5.
Explain. of the variation 'Endogeneity arises because of the violation free exogeneity induced by the identifying restrictions.' condition of Discuss.
656
The simultaneous equatiom model
Explain what we mean by a recursive system of simultaneous equations and discuss the estimation of the parameters of interest. refers to the uniqueness of the reparametrisation from the statistical to the theoretical parameters of interest.' Discuss. Explain the order and rank conditions of identification in the case of linear homogeneous restrictions. Discuss the two cases where the ordercondition might be satisfied but the rank condition fails. 10. Before we could even discuss the identification problem we need to have a well-defined statistical GM.' Discuss. Explain the concept of an indirect MLE of the theoretical parameters of interest. Under what circumstances is the IMLE fully efficient? Explain the intuition underlying the concept of an estimator generating equations (EGE). How do restrictions of the form BF + A 0 differ in the context of the simultaneous equations and multivariate linear regression model? structural parameters estimators such as ZSLS, 1V, LIML, k14. class, 3SLS are numerical approximations of the FIML estimator.' Discuss (see Hendry (1976:. Explain the derivation of the 2SLS estimator of 1 in Sldentification
=
t'l-he
yl =Z
1
*
1
+:* 1,
both as constrained least-squares as well as an instrumental variables estimator. 'Fhe 2SLS estimator can be viewed as being a numerical LIML estimaton' Discuss. approximation to the 'lf the rank condition for identification is not satisfied then 2SLS does not exist.' Discuss. S'T'he3SLS estimator is derived by using the Zellnerformulation in the context of the simultaneous equation model.' Explain. 'Misspecification testing in the context of the simultaneous equation model coincides with that in the context of the multivariate linear regression model.' Discuss. Consider the question of misspecification testing in the context of a single structural equation before and after the identifying restrictions have been imposed. Explain how you could construct a test for the overidentifying restrictions in the context of a single structural equation. Discuss the question of testing for the exogeneity of a subset of the endogenous variables included in the first equation.
25.10 Prediction
657
'Using the derived statistical parameter .-
..<
..x
B(t)
will provide
=
A(()F(t)
-
efficient predictors
more
estimators
defined via
j
.
for yw-hj.'Discuss.
Exerclses
Consider the two equation estimable model: mt
a 1 1 + a 1 zit + a 1 5pt +
=
it
(1) (2)
J 1 4y1,.
a21 + xzzmt + acapf + uzqgt.
=
Express equations
y
=
t
(1) and (2)in the
B'x f +u
forms
'
f'
r*' t't + A*'x t +
t1
=0.
F* and A* in terms of (B, f1) and show that the theoretical parameters of interest are uniquely Derive the parameters
defined. Verify that in the case of a recursive simultaneous 1 IFft= y0?y9
1t
+J'xi
t
+
ilf
,
i
=
1, 2,
.
.
the orthogonal estimators (where a,9 HE (),,9',J;)',Zf EH (yf 1 49 (z;zf)- l zkyi l
and
model
n'l,
,
.
equations
,
X)):
=
1 ii T =
--.
if
ii
,
yi
=
Zf&,9, i
-
=
1 2, ,
.
.
.
,
m
,9
MLE'S of and nf. are indeed the equations structural Consider the following imposed'. restrictions
+ 7 4 1 .V41+
- 1.'1 l + 7 2 1 )' 2f
+ l
4.1' 1
r
-h
-/'
24.-.21
-
.F4t
-1-
1 1 X 1t
+
lcxl,
+
l 3x1,
+
1 4.,X1 t
+
the exclusion 3 1 .X 3/
l1
=
;3,,-
2cx2t
+ aaxa, +
t;
=
(524..X2/
=
fl4t.
Discuss the identifiability of the above equations under the following conditions: No additional restrictions; (a) 0 J z2 0 ; (b) 24 (c) 7 32 0, T41 0. Explain how you would estimate the first equation by ZSLS. =
=
(ii)
-/4.3.:4.1
with
,
=
=
The simultaneous
658
muationsmodel
Show that in the case of ajust identified equation the ZSLS and IMLE estimators coincide. Show that l-((-)'fF(t-) ( 1/F).d(()'Z'ZA((-) in Section 25.5. Derive the 2SLS estimator of al' in yl Z1al + 1)1' and explain why where * (y-.irj sus -X1J2sus)is an inconsistent !7tl (1/F)t'/ estimator of tll'j Compare your answer in 6 with the derivation of the 2SLS estimator as an instrumental variables estimator. t'T'he GM yl Pxzlf + ct* (see Section 25.6) can be viewed as a equivalent sample for the sample to y3t y'1'(y1,/c(X,)) + J':xlf + k:T* T' Explain. period t 1, 2, Derive a test for the exogeneity of a subset y2, of y1f in the context of the single equation =
=
=
=
.
=
=
=
.
.
.
,
)'1, r'1yl t +J'1 x1, =
+ tt
.
Additional references Anderson
( 1982);
Hausman
( 1983);
Judge et al. ( 1982).
Epilogue: towards a methodology econometric modelling
26.1
A methodologist's
of
critical eye
The purpose of this chapter is to formalise the methodology sketched in Chapter 1 using the concepts and procedures developed and discussed in the intervening chapters. The task of writing this chapter has become considerably easier since the publication of Caldwell ( 1982) which provides a lucid introduction to the philosophy of science for economists and establishes the required terminology. Indeed, the chapter can be seen as a response to Caldwell's challenge in his discussion of possible alternative approaches to economic methodology: One approach which to my knowledge has been completely ignored is
. and philosophy with the integration of economic methodology econometrics. Methodologists have generally skirted the issue of methodological foundations of econometric theory, and the few econometricianswho have addressed philosophical issues have seldom gone beyond gratuitous references to such figures as Feigl or Carnap .
.
.
.
.
.
(See ibid, p. 216.)
ln order to avoid long digressions into the philosophy of science the discussion which follows assumes that the reader has some basic knowledge of philosophy of science at the level covered in the first five chapters of Caldwell ( 1982) or Chalmers (1982). Let us begin the discussion by considering the textbook econometric methodology criticised in Chapter 1 from a philosophy of science perspective. Any attempt to justify the procedure given in Fig. 1.1 reveals a deeply rooted influence from the logical positivist tradition of the late 1920s early 30s. Firstly, the preoccupation of logical positivism with criteria of cognitive significance, and in particular their verifiability criterion, is clearly 659
Epilogue
660
discernible in defining the intended scope of econometrics as measurement of theoretical relationships'. A theory or a theoretical concept was considered meaningful in so far as it can be verified by observational evidence. Secondly, the trcatment of the observed data as not directly related to the specification of the statistical model was based on the logical positivists' view that observed data represent facts' and any with the facts was rendered meaningless. theory which does not Unless the theoretical model, now re-interpreted in terms of the observational language,differs from the statistical model only by some nonsystematic effects (white-noise error) the theory had no meaning. Moreover. there was no need to bring the actual DGP into the picture because the theory could only have cognitive significance if it constitutes a description of the former. Thirdly, emphasis in the textbook methodology tradition is cliterion of cognitive significance) and placed on testing theories (testability choosing between theories on empirical grounds. Despite such an emphasis, however, to my knowledge no economic theory was ever abandoned because it was rejected by some empirical econometric test, nor was a clear-cut decision between competing theories made in lieu of the evidence of such a test. From the philosophy of science viewpoint the textbook econometric methodology largely ignored the later reformulations of logical positivism in the shape of logical empiricism in the late 1950s early 60s (seeCaldwell (1982/. For example, the distinction between theoretical and observational concepts was never an issue in econometric modelling. adhering to the logical positivist view that the two should coincide for the former to have any meaning. Moreover, the later developments in philosophy of science challenging every aspect of logical positivism and in particular the verication principle and objedivity, as well as the incorrigibility of observed data (see Chalmers (1982)), are yet to reach the textbook econometric methodology. The structure of theories in economics has been influenced by the axiomatic hypothetico-deductive formulation of logical empiricism but this served to complicate the implementation of the textbook econometric methodology even further. Research workers found themselves having to use not only statistical procedures but also theory construction procedures. That is, they would begin but with a very theoretical model and after estimating a variety of similar equations and a series of statistical techniques end up with the and revise sense) return the some to (in theoretical model in order to rationalise the empirical equation. Commonly, research workers find themselves having to include lagged variables in their estimated equations (becauseof the nature of the observed data used), even though the original theoretical model could not account 'the
iobjective
'comply'
'illegitimate'
'illegitimate'
'vague'
'estimable'
iillegitimate'
ibest'
'best'
26.2
Formalising
661
a methodology
for them. Often, research workers felt compelled, and referees encouraged them, not to report their modelling strategy but to devise a new theoretical hypothetico-deductive model using some form of dynamic optimisation method and pretend that that was their theoretical model all along (see Ward ( 1972/. As argued in Chapter 1, research workers are driven to use procedures because of the the textbook methodology forces them to wear. Moreover, most of these procedures become the natural way to proceed in the context of the alternative methodology sketched in Chapter 1. In order to see this let us consider a brief formalisation of this methodology in the context of the philosophy of science. 'straightjacket'
Sillegitimate'
'illegitimate'
26.2
Econometric
modelling, formalising a methodology
Having criticised the textbook econometric methodology for being deeply rooted in an outdated philosophy of science the task of fonnalising an alternative methodology would have been easier if the new methodology could be founded on the most recent accepted view in philosophy of science. Unfortunately (or fortunately) no such generally accepted view has since emerged the dethronement of the positivist (logicalpositivism as well empiricism) philosophy (seeCaldwell (1982),Chalmers (1982), as logical alia). The discussion which follows, purporting to bring inter Suppe (1977), various threads of methodological arguments in the book, together the related cannot be to any one school of thought in the current philosophy of science discussions aspiring to become the new orthodoxy. At best it could be seen as the product of a cross-fertilisation between some of the current views on the stnlcture, status and function of theories and the particular features of economic theory and the associated observed data. For expositional purposes the discussion will be related to Fig. 1.2 briefly discussed in Chapter 1. The concept of the actual data generation process (DGP) is used to designate the phenomenon of interest which a theory purports to explain. The concept is used in order to emphasise the intended scope of the theory as well as the source of the observable data. Defined this way, the concept of the actual DGP might be a real observable phenomenon or an experimental-like situation depending on the intended scope of the theory. For example, in the case of the demand schedule discussed in Chapter 1, if between a the intended scope of the theory is to determine a relationship hypothetical range of prices and the corresponding intentions to buy by a group of economic agents, at some particular point in time, the actual DGP might be used to designate the experimental-like situation where such data could be generated. On the other hand, if the intended scope of the theory is
662
Epilogue
to explain observed quantity or and price changes over time, then the actual DGP should refer to the actual market process giving rise to the observed data. lt should be noted that the intended scope of the theory is used to determine the choice of the observable data to be used. This will be taken up in the discussion of the theory-dependence of observation. ,4 theory is defined as a conceptual construct which purports to provide an idealised description of the phenomena within its intended scope (the actual DGP). A theory is not intended as a carbon copy of providing an exact description of the observable phenomena in its intended is much too complicated for such an exact copy scope. Economic to be comprehensible and thus useful in explaining the phenomena in question. A theory provides only an idealised projected image of reality in tenns of certain abstracted features of the phenomena in its intended scope. These abstracted features, referred to as concepts, provide the means by descriptions are possible. To some extent which such generalised (idealised) the theory assumes that the phenomena within its intended scope can be iadequately explained' in terms of the proposed idealised replicas viewed as isolated systems built in terms of the devised concepts. ln the context of the centuries-old dichotomy of instrumentalism versus realism the above view of theory borrows elements from both. lt is instrumentalist in so far as it assumes that a theory does not purport to provide an exact picture of reality. Moreover, concepts are not viewed as describing entities of the real world which exist independently of any theory. It is also realist in two senses. Firstly, it is realist in so far as underthe circumstances assumed by the theory (as an isolated system) its can be ascertained. There is something realistic about a demand schedule in so faras it can be established ornot under the circumstances assumed by the theory of demand. Secondly, it is realist because its main aim is to provide explanation of the phenomena in its intended scope. As such, an of adequacy the a theory can be evaluated by how successful it is in coming the reality it purports to explain. Theories are viewed as with grips to providing to reality and they are judged by the extent to approximations enhance our understanding of the phenomena such which We question. cannot, however, appraise theories by the extent to which in provide exact pictures of reality they treality'
treality'
:validity'
tadequate'
kapproximations'
,
.
simply because we have no access to the world independently of our would enable us to assess the adequacy of those
. theories in a way that descriptions .
.
.
.
(See Chalmers
(1982),p.
163.)
This stems from the view that observation itself presupposes some theory providing the tenus of reference. Hence, in the context of the adopted
26.2 Formalising
663
a methodology
methodology of econometric modelling, theories are treated as providing approximations to the observable phenomena without being exact copies of reality; a realist position without the logical empiricists' correspondence theory of truth (see Suppe ( 1977:. In view of the realist elements of a theory propounded above the logical positivists' contention that the only function of theories is description is rejected. Theories are constructed so as to enable us to consider a variety of questions related to the phenomena within their intended scope, that includes description, explanation and prediction. The distinction between theoretical and observational concepts associated with logical empiricism as well as naive instrumentalism (see Caldwell (1982)) is not adhered to in the context of the proposed methodology. This is because observation itself is theory-laden. We do not just observe, we observe within a reference framework, however rudimentary. ln constructing a theory we devise the concepts and choose the assumptions in terms of which an idealised picture of the phenomena within its intended scope is projected. This amounts to devising a language and accepting a system of picturing and conceiving the structure of the features of phenomena in question. Such a system specifies the the phenomena to be observed. The main problem in econometric modelling, however, is that what a theory suggests as important features to be observed and the available observed data can differ substantially. Commonly, what is observed (data collected) was determined by some outdated theory and what was deemed possible at the time given the external constraints on data collection. Although there is an on-going revision process on what aspects of the observable phenomena data are collected, there is always a sizeable time-lag between current theories and the available observed data. For this reason in the context of the proposed methodology we distinguish between the concepts of te theory and the obser,ed data available. lt should be stressed that this is not the theoreticalobservational concepts distinction of logical positivism in some disguise. lt is forced upon the econometric modelling because of the gap between what is needed to observe (as suggested by the theory in question) and what is available in terms of observed data series. This distinction becomes even more important when the theory is contrasted with the actual DGP giving rise to the asailable data. Observed tt/rtk in econometric modelling are rarely the result of the experiments on some isolated system as projected by a theory. They constitute a sample taken from an on-going real DGP with al1 its variability and irrelevant' features (as far as the theory in question is concerned). These, together with the sampling impurities and observational errors, objective which against facts that published data far from being suggest are 'important'
664
Epilogue
theories are to be appraised, striking at the very foundation of logical positivism. Clearly the econometrician can do very little to improve the quality of the published data in the short-run apart from suggesting better ways of collecting and processing data. On the other hand, bridging the gap between the isolated system projected by a theory and the actual DGP giving rise to the observed data chosen is the econometrician's responsibility. Hence, in view of this and the multitude of observed data series which can be chosen to correspond to the concepts of theory, a distinction is suggested between a theoretical and an estimable model. theoretical model is simply a mathematical formulation of a theory. This differs from the underlying theory in two important respects. Firstly, there might be more than one possible mathematical formulation of a theory. Secondly, in the context of a theory the initial conditions and the simplifying assumptions are explicitly recognised; in the context of the of the model these simplifying assumptions govern its characterisation phenomena of interest. This is to be contrasted with an estimable model whose form depends crucially on the nature of the observed data series chosen (seeChapter 1). To some extent the estimable model refers to a mathematical formulation of the observational implications of a theory, in view of the observed data and the underlying actual DGP. ln order to determine the form of the estimable model the econometrician might be required to use auxiliary hypotheses in an attempt to bridge the gap between the theory and the actual DGP. lt is, however, important to emphasise that an estimable model is defined in terms of the concepts of the theory and not the observed data chosen. ln practice, the form of estimable models might not be readily available. This, however, is no reason to abandon the distinction and return to the textbook methodology assumption that the theory coincides with the actual DGP apart from a white-noise error. The estimable form of a theory depends crucially on the nature of the available observed data and the gap between the theory and the actual DGp,given that its aim is to bridge thisgap. The estimable model plays an important role in the propesed methodology because it provides the link between a theory and a statistical model. Having rejected the presupposition that the actual DGP, the theory and the statistical model coincide apart from a white-noise error, the task of specifying a statistical model, for the problem in hand, is no longer a trivial matter. The statistical model. can no longer be specified by attaching a white-noise process, assumed to follow a certain distribution, to the theoretical model. The nature of the observed data has a crucial role to play because of their relationship with the estimable form of the theory. Because of this it is important to specify the statistical model, in the context of which the estimable model will be analysed, taking account of the nature of the -4
26.2 Formalising
a methodology
665
observed data in question. The procedure from the observed data to the statistical model has been discussed in various places throughout the book but it has not been formalised in any systematic way. Because of its central role in the proposed methodology- however- it is imperative to provide such a formalisation. The tlle o/ast?rnv/ data rtp3he statistical model is made by assuming first step that: Zw)'0j' a the observed data constitute a realisation Z EEE(Z1 Z2, processesj (Zt, t e:T1. sequence (?Jrandom vectors (stochastic This assumption provides the necessary link between the actual DGP and probability theory. lt enables us to postulate a probabilistic structure for )Zr, t 6 T) in the form of its joint distribution function D(Z; #). This will provide the basis for the statistical model specitication. The specification of a statistical model is based on three sources of information: theory information; (i) information; and measurement (ii) information. sample (iii) ikfrmation comes in the form of the estimable model. 1ts role Tbe r/l?t?l-)' of the model is related to the choice of the in the initial specification variables underlying the model as well as the general form of the observable GM. statistical The measurement illjrmation is related to the quantification and the measurement system properties of the variables involved. These include the units of measurement, the measurement system (nominal,ordinal, interval, ratio; see Appendix 19. 1), as well as exact relationships among the observed data series such as accounting identities. Such information is useful for the specification of the statistical model because it enables us to postulate a sensible statistical GM for the problem in question. For example, if the statistical GM v, #'x,+ uf allows y, to take values outside its range we need to reconsider it. Also, if an accounting identity holds either among some of the xfrs or among )', and some xas the statistical GM is useless. For further discussion on accounting identities and their role in econometric modelling see Spanos ( 1982tk). Sample ?!/?l'??7(/f(??7 colnes in the form of the observable random variables involved and their structure. lt is helpful to divide this information into sets related to )Z,, r e: T ): three mutually exclusiN'e 2 1 Z st 1 (a) 1 pa .//0n1
,
.
.
.
,
=
ft
(b)
( -. ' l
present
=
.
.
.
.
.
,
.
' !b*
z.
1 12 future Z (c) ) Such information is useful in relation to important concepts underlying the specification of a statistical model such as exogeneity, Granger-causality, i (
f
--
,. -
=
.
.
.
.
.
.
666
Epilogue
structural invariance
(seeHendry and Richard (1983)). The specthcation of a statistical model is indirectly related to the distribution of the stochastic process .tZf, t (E T) because in practice it is easier to evaluate the appropriateness of probabilistic assumptions about the marginal distributions of the Zits instead of any conditional distributions directly related to the postulated statistical GM. According to Fisher the problem of statistical model specification .
.
is not arbitrary, but requires an understanding are supposed to, or did in fact originate
, data
of the way in which the .
.
.
.
(See Fisher (19j8), p. 8.) For this reason the statistical model is directly related to the actual DGP being defined in terms of the observed data and not some arbitrary whitenoise error term. From the modelling viewpoint it is convenient to specify a statistical model in terms of three interrelated components: statistical GM; (i) probability model; and (ii) sampling model. (iii) The statistical GM defines a probabilistic mechanism purporting to provide a generalised approximation to the actual DGP. lt is a generalised approximation in the sense that it incorporates no theoretical information apart from the choice of the observed data and the general form of the estimable model. That is, it is to provide the framework in the which of possibly testable theoretical information of interest context any could be tested. ln particular, any theoretical information not needed in determining the distribution of )Zr, r iE7') for the sample period r= 1, 2, F might be testable in the context of the postulated statistical model. On the other hand, the other two sets of information, the measurement and sample information, play a vital role in determining D(Z; #), Z EB (Zj, Z2, Zw), and should be taken into consideration at the outset. The theoretical information which is taken into account in designing a statistical GM is the form of the estimable model. This is because the estimable model constitutes a reformulation of the theoretical model so as to provide an idealised approximation to the actual DGP in view of the available data and the statistical GM postulates a probabilistic mechanism which could conceivably have given rise to the observed data. The latter, however, does not necessarily coincide with the former apart from some white-noise error. The statistical GM is a stochastic mechanism whose particular form depends on the nature of D(Z; #).ln particular the statistical GM is defined in terms of a distribution derivable from D(Z1, Z2, Zw; #) by marginalisation and conditioning (seeChapters 5 and 6) in such a way so as to enable us to analyse (statistically) the estimable model in its eontext. That 'designed'
.
.
.
.
.
,
.
.
,
.
.
,
26.2 Formalising
a methodology
667
in terms of the observable variables is, the estimable model special case) within the postulated involved should be (a statistical GM. The probability distribution underlying a statistical GM defines the probabilit p model eomponent of the statistical model in question. For the completeness of the argument and the additional insights available at the misspecification stage the distribution defining the probability model is related to D(Z', #). The apparent generality we lose by ignoring D(Z', #) and going direetly to the distribution of the probability model is illusory. The probability model in the context of statistical inference plays the role of the uanchor distribution' in terms of which e'very other distrbution or distribution related quantity is definable. ln Part IV we dscussed many instances where if the modeller does not adhere to the same probability model the argument can be led astray. Such instances include the omitted variables argument, the stochastic linear regression model as well as the problem of simultaneity. ln the last case the argument about simultaneity bias is based totally on defining the expectation operator fJ( ) in terms of Dtyt,/Xr; p). Ths bias ean be made to disappear if the underlyng distribution can be changed (seeChapter 25). As argued in Chapter 17, the sampling model, which constitutes the third component of a statistical model, provides the link between the probability model and the observed data in hand. lts role might not seem so vital at the specification stage because of its close relationship with the other two components. At the misspecification stage, however, it can play a crucial role and its close relationship with the other two components can be utilised to determine the way to proceed. Given that the statistical model is defined directly in terms of the observable variables giving rise to the observed data in hand and not some unobservable error term, the sampling model has a very crucial role to play in econometric modelling. For examples in the case of the linear regression model very few eyebrows will be raised if a modeller assumes that the error is independent. On the other hand, if time-series data are used and the modeller postulates that )'y (conditional on Xf xf) is T; a sizeable number of econometricians will independent for ? 1, 2, advise caution; even though the two assumptions are equivalent in this context. Making probabilistic assumptions about observable random variables which gas'e rise to the data in hand seems to provide a better perspective for judging the appropriateness of such assumptions. Once a statistical model is postulated we can proceed to estimate the parameters in terms of which the statistical GM is defined, what we called the statistical ptkramt?li?l's(?Jinterest. These parameters do not necessarily coincide with the tbeoretical pavameters of intevest in terms of which the estimable model is defined. Before we are in a position to relate the two sets etranslated'
tnestable'
.
=
=
.
.
.
,
668
Epilogue
of parameters we need to ensure that the estimated statistical model is well J/inv/. Given that statistical arguments will be used to as well as test any hypotheses related to the theoretical parameters of interest, we need to ensure that their foundation (the estimated statistical model) is well assumptions defined in the sense that the underlying are valid. The estimated statistical model might be viewed as a convenient summarisation of the sample information in the Fisher paradigm analogous to the histogram in the Pearson paradigm. As such, the estimated statistical model is well defined if the underlying assumptions (defining it) are valid. Otherwise, any statistical arguments based on invalid assumptions will be at best misleading. Misspeci.ticalion Ip.sr??f7 refers to the testing of the (testable) assumptions underlying the statistical model in question. The estimated statistical parameters of interest acquire a meaning only after the misspecification testing has been completed without any assumptions being rejected. This is to ensure that the estimated parameters are indeed estimates of the intended statistical parameters of interest. It will be very difficult to overestimate the importance of misspecification testing in the context of econometric modelling. A misspecified estimated statistical model is essentially useless in this context. The concept of an esbimated wp//-t//nta/ satistical model plays a vital role in the context of the proposed methodology of econometric modelling. No valid statistical inference argument can be made unless it is based on such an estimated statistical model. For this reason the statistical GM is not -constrained' to coineide with the estimable model. F'or example, there is no point in adhering to one 1ag in the variables involved in view of the estimable model if the temporal structure of these variables requires more than one in order to yield a well-defined statistical model. The modeller should not feel constrained to allow for features necessitated by the nature of the observed data chosen simply because the theory (estimablemodel) does not account for. More often than not the theoretical information relating to the estimable model is vague because of the isolated system characteristics of economic theories. An important implication of this is the apparent non-rejection of theories on the basis of statistical tests. Such theories can be accepted or rejected if the system' conditions assumed by the theory are made to hold under an experimental-like situation where the theoretical model and the actual DGP can be ensured to coincide apart from a white-noise error term. When this is not the case rejecting a theory whose estimable model can take a number of different forms will be very difficult. ln the case where misspecification testing leads to the rejection of one or more of the assumptions underlying the statistical model we proceed by respecifying the model so as to take account of the apparent invalid 'define'
tgood'
'isolated
26.2
Formalising
a methodology
669 Sdrafting'
in local surgery' by assumption. What we do not do is to the alternative hypothesis of a misspecification test into an otherwise unchanged statistical model such as postulating an AR( 1) error because the kengage
Durbin--Watson
test rejects the independence
assumption,
in the context of
the linear regression model; see Hendry (1983) for a similar viewpoint. Once a well-defined estimated statistical model is reached we can proceed to determine (construct) the empirical econometric model. Starting from a well-defined statistical model we can proceed to test any theoretical restrictions which can be related to the statistical formulation.
The specification
of the empirical econometric
model can be viewed as a
reparametrisation/restriction of the estimated statistical GM in view of the estimable model, so as to be expressed in terms of the theoretical pat-ameters which imposes restrictions on the of interest. Any reparametrisation statistical parameters ean be tested for on a formal hypothesis-testing framework and accepted or rejected. The aim is to construct an approximation to the actual DGP in terms of the theoretical parameters of interest as suggested by the estimable form of the theory, an ?mp-trf:/ econometric model. This should be done not at the expense of the statistical properties enjoyed by the estimated statistical GM because the empirical econometric model needs to be itself a well-defined statistical model for prediction and policy evaluation purposes. Hence- when the empirical econometric model is constructed by reparametrising,.'restricting a wellunderlying that satisfies the statistical GM need it defined to we statistical assumptions we started with. This does not constitute proper lt is important, statistical testing but informal diagnostic however, in order to be able to use formal statistical arguments in relation to prediction or and policy evaluation in the context of the empirical econometric model. Although we need to ensure that the theoretical parameters of interest ( are uniquely defined in terms of the statistical parameters of interest 0, there is nothing unique about (. Numerous theoretical parametrisations are possible for an' well-defined set of statistical parameters. ln practice, we of the need to choose one of such possible reparametrisations,/restrictions estimated statistical GM. Several criteria for model selection (see Fig. 1.2) have been proposed in the econometric literature depending on the potential uses of the empirical econometric model in question. The most important of such criteria are: (i) theo ry con si st ttl'clh goodness o f fit: (ii) predictiN'e ability: (iii robustness (iv) tincluding nearly orthogonal explanatory variablesl; encompassing: (v) tcheck'
bchecking'.
..
670
Epilogue
parsimony (vi) (see Hendry and Richard (1982), ( 1983), Hendry (1983), Hendry and Wallis ( 1984)). ln cases where the observed data can be relied upon as accurate measurements of the underlying variables, there is something realistic about an empirical econometric model in so far as it can be a or a bad' approximation to the actual DGP. ln such cases the instrumentalists' interpretations of such an empirical econometric model is not very useful because it will stop the modeller seeking to improve the approximation. The realistic interpretation, however, should not be viewed as implying the econometric modelling is aiming at; we have existence of the ultimate criteria outside our theoretical perspective to establish ultimate no This should not stop us from seeking better empirical econometric models which provide us with additional insight in our attempt to explain observable economic phenomena of interest. In particular, an empirical econometric model which explains why and how other empirical studies have reached the conclusions they did is invaluable. ln such a case we say that the empirical econometric model encompasses others purporting to explain the same observable phenomenon; on the subject of encompassing see Hendry and Richard (1982), (1983),and Mizon ( 1984). Hendry and Richard ( 1982), in their attempt to formalise the concept of a well-defined empirical model, include the encompassing of all rival models as one of the important conditions for what they call tentatively adequate conditional data characterisation' (TACDI;for further discussion on the similarities and differences between the two approaches see Spanos ( 1985). The empirical econometric model is to some extent as close as we can get to an actual DGP within the framework of an underlying theory and the available observed data chosen. As argued above, its form takes account of and sample. all three main sources of information - theory, measurement As such, the above-discussed methodology differs from both extreme approaches to econometric modelling where only theory or data are used for the specification of empirical models. The first extreme approach requires the statistical model to coincide with the theoretical model apart from a white-noise error tenn. The second approach ignores the theoretical model altogether and uses only the structure of the observed data chosen as the only source of information fbr the determination of the empirical model; see Bos and Jenkins (197$., see Spanos ( 1985). This concludes the formalisation of the alternative methodology sketched in Chapter 1. A main feature of the new methodology is the broadening of the intended scope of econometrics. Econometric modelling is viewed not as the estimation of theoretical relationships nor as a of economic theories, but as an procedure in establishing the tgood'
'truth'
ttruth'.
'a
'trueness'
26.3 Conclusion endeavour to understand observable economic phenomena of interest using observed data in conjunction with some underlying theory in the context of a statistical framework. 26.3
Conclusion
The methodology formulated in Section 26.2 can be viewed as an attempt towards a coherent approach to econometric modelling where economic theory as well as the structure of the observed data have a role to play. ln a certain sense the adopted methodology can be seen as an attempt to formalise certain procedures which are currently practised by an increasing number of applied econometricians. The most important features of the adopted methodology are: (i) the rose of the actual DGP; (ii) the distinction between a theoretical and an estimable model; (iii) the role of the observed data in the statistical model specification; the notion of a well-defined estimated statistcal model; and (iv) the distinction between an estimated statistical GM and an (v) empirical econometric model. As mentioned in the preface the methodology formalised in the present book can be viewed as having evolved out of the LSE tradition in econometric modelling (see Gilbert (1985))that owes a lot to the work of Denis Sargan and David Hendry. The main features of the LSE tradition in time series econometric modelling, including its emphasis on specification and misspecification testing, maximum likelihood and instrumental variables estimation, asymptotic approximations, dynamic specifications, parametrisations, are common factor restrictions and error-correction coherent famework in order to formalise an integrated within a (hopefully) approach to econometric modelling. Another related methodology at odds with the textbook methodology was proposed by Sims ( 1980). The Sims' methodology, in terms of Fig. 1.2, essentially ignores the left-hand side of the djagram and concentrates almost exclusi: ely on modelling the observed data. For a discussion of the relationships between the metlodology adopted in the present book and these alternative methodologies see Spanos ( 1985). Finally, it is important to warn the reader about the nattlre of any methodological discussion. On this I can do no better than quote from Caldwell ( 1982): methodologl'
is a fruslrating and rcwarding area in which to work. Just or to write a
. as there is no bcst wal' to listen to a Tchaikovsky symphony, book, or to raise a child, there is no best way to investigate .
.
social reality.
Epilogue Yet methodology has a role to play in all of this. By showing that science is not objectives rigorous, intellectual endeavour it was once thought to be. and by demonstrating that this need not lead to anarchy, that critical discourse still has a place, the hope is held out that a true picture of the strengths and limitations of scientific practice will emerge, And with luck. this insight may lead to a better and certainly more honest, science. (See ibid, p. 252.)
Additional references Blaug
( 1980);
Boland
(1982);Caldwell ( 1984);
Hendry and Wallis
( 1984).
Index
acceptance region, 286-7 and confidencc region. 304 5 actual DGP. 2O. 66 1 2 adjusted R 382 admissible estimator. 236 almost sure consrcrgence. 188 alternative hypothesis. 287 approximate M L E's. 533-6 a priori restrictions. 377 exclusion (zero-one). 6 16. 19 Iinear. 396..-401 422-7 linear homogeneouse 6 15- 19 non-linear. 427 -32 ARIMA ( /?. d. (?) process, 156 stability' eonditions, 16 1 ARMA (/). q j process. 159-6 1 ARCH test. 550 AR( 1) process. 150-5 error. 506- 7 estimation. 279-8 1 stability condition. 152...-4
l 53 asymptotic stationarity asymptotic test procedu res 328- 35 14 1 asympto t ic unco rrelatedness autoco rrelation 134 at1toco rrttlat ion erro rs 50 l - 3 505- 11 tests for. 5 l3...2 1 autocovariance. 134 autoproduct moment, 134 auxiliary regressions. 446- 7, 460- 1 467 470 .
.
.
.
.
.
.
.
,
-
Bassman
.
,
AR(m) process. 155-8. 506- 7 asymptotic expansions. 203-8 asymptotic independence. 140- 1 501 asymptotic momcnts. 192 asymptotic pouer ftlnction. 327 asymptotic. properties of estimators, 244-7 consistencl'. ?44 ,
efficiency. 247 nonnalit) 246 7 unbiasedness asymptotic. properties consistcncy. 327 .
.
-n4
.
Iocally UM P. 328
UMP, 327 8 unbiasedness. 327
test. 652
Bayes' formula. 121 Bayesian approach, 220 Benoulli distribution, 62-3, 166 Bernoulli's theorem. 165 best Iinear unbiased estimator (BLUE), 239 255-6, 450- 1 best linear unbiased scalar (BLUS) residuals, 407 beta distribution. 40 1 479 bias, 235 binomial distribution, 63-4. 166 bivariate distributions, 79-93 binomial, 84 exponential, 92, 124 logistics 9 1. 125 normal. 83. 88, 93. 120, 122 Pareto. 84, 124 Borel field, 4 1 52 Borel function. 95 455-7 Box-cox transformation, Box-lenkins approach (sgtrARMA. ARl M A) Breusch-pagan test, 469-70 Brownian motion process, 149--50
of tests. 326-8
,
690
Index
CAN estimators, 27 1 canonical correlations, 3 14 Carleman's conditionp 74 Cauchy distribution. 70-1 105 causality (setpGranger non-causality) central limit theorem
cross-correlation, 135 cross-covariance, 135 cross-section data. 342-3 cumulants, 74 cumulative distribution function (.$&& distribution function) cumulative freqtlency, 25 CUSUM test, 477 CUSLIMSQtest, 477
,
De Moivre-l-aplace,
64s 165
Liapounov, 174 Lindeberg-Feller, Lindeberg-l-evy.
174 173 characteristic ftlnction, 73--4
.
.
.
,
,
data economic. 342-.6 and tbe probability models 346-9 degrees of freedom, 108 111--13 -method. 20 I demand function, 10- 11 dc Moivre-l-aplace CLT. 64, 165 dcnsity function definition. 57 conditional, %) joint, 82 lna rginal 86 )3ro pert ies 59 diagnostic checking, 557 d ifference cq uation l 55-6 543- 5 differencing. 16 1 479-8 l 528 differentiation of vectors and matrices. 603...4 dist ribu tion function 55-.60 conditional 89-92 joint 78-85 ma rginal 8 5-7 dist ribution of thc sample 2 16 dist u rbance (see erro r term non-systema t it component) dummy variables. 369. 536-7 ,
Chebyshev's inequality, 73 chi-square distribution, 98-.9, 108, 1 l l non-central 108, l 1l Chow test 487-8 collinearity. exact. 432....4 &near',434.-40 common factor restrictions, 507- 1l condition ntlmbers, 436 conditional distriblltions, 89--94 exponentials 92 logistic. 9 l normal. 93 Pareto, 92 conditional cxpectation. 12 1..7 wl'f a tr-field l25-7 w'?'ran observed vtlue, 12 l 5 propertics, 122, 125, l26 -7 conditional moments. 122- 5 mean. 122.-3 variances 122--3 conditional probability, 43-4 confidence region. 303-6 conllucnce analysis, 12 consistcncy, wcak, 244..-6 strong, 246 constant. in linear regression, 370- l 4 10 constrained MLE*s, 423-4 continuous rvfs. 56 185.-8 convergence, mathematical, of a function, l 85-6 of a sequence, l 85 pointwise, 187 uniform, 187 convergence, modes of, 188-92 almost sure, J 88, 167 in distribution. 189, 167 in probability. 189. l66 in l'th mean 188 convergence of moments, 192.-4 corrclation coefficient, 119 covariance, 119 matrix, 3 12-3 Cramer-Rao, lower botlnd, 237 regularity conditions, 237 Cramer-Wold lemma, 19 l
,
,
,
.
.
.
.
.
,
,
,
.
,
s
Durbin's /1 test, 54 1-2 Durbin-Watson test. 515- 18 dynamic linear regression model, 526-70 dynamic multipliers, 60 1-2
econometrics, definition of. 3, 676 Edgeworth
expansion,
206-7
cfficiency. relative, 234-5 full. 237-8 efficient score test t,ct'.,Lagrange multiplier test) eigenvalue. 433. 436 eigenvector, 433. 436 elliptical family of distributions, 458 empirical distribution function. 228-9 empirical cconometric model, 2 1 23, 670 encompassing, 568, 670 endogeneity (non-exogeneity), 629 endogenous variables. 608 Engel's Iaw. 6 equilibrium, long-runa 558-.9 crgodicity, 143. 500 ,
Index
691
505-7 error, autocorrelation, error bounds, Berry-Esseen, 202-3 error-correction model, 554 errors-in-variables, 12 also nonerror term. 349-50. 374-5 (.st?t? systematic component) estimable model. 23, 668 estimate- 23 l estimation. methods, 252-84 estimation. properties of estimators, 23 1-.49 estimators, CAN. 27 l FIML, 625 GLS, 463. 503, 587-8 IV 637-.-44 ,
k-class, 632 LIML, 629, 63 1, 633 OLS, 449-52 3SLS, 639.-40 ZSLS, 635-7 estimator generation cquation. 624-6 events, 38 elementary. 38 impossible, 39 mutually exclusive. 44 sure, 39 exclusion restrictions (.t?t>a priori restrictions) exogeneity, weak, 273, 376.7, 42 1-2 strong, 505, 629-30 tests for, 653 expectation, 68-9 conditional, 12 1-7 properties of, 70- 1, 116-20 experimental design, 366-7 exponential distribution, 76, 92, 124 exponentiai family of distributons. 68, 299 F distribution. centrals 104, 108, 113, 3 19 non-central i 13 3 19. 320, 324 F test. 398--402 425-, 553 power of. 40 1 .
.
.
F-type misspecification
test. 446
homoskedasticitl'. 467. 547 555 linearity. 460- 1 547-8 555 parameter time-ins ariance. 477 sample independence. j 11 54 1 structural inN ariance. 482-6. 556 ,
.
,
.
FIML,
625
final form, 60 1 finite sampie properties. 232.44 efficiency. 234--8 linearity. 238 sufficiency, 242...-4 unbiasedness, 232--4
Fisher-Neyman factorisationa 242 F distribution) Fishef's F distribution (st?t? Fisher paradigm, 7-9 forecasting (st?eprediction) frequency curves, 27 frequency polygon, 24 functions of r.v.'s, distribution of, 96-107 addition, 100-2 min, 105-6 quotient. 102-5 Gamma function, 99 Gaussian distribution (sec nonnal distribution) Gauss linear model, 6-8, 348, 353. 357-68 Gauss-Markov theorem, 239, 449 generalised F-test. 590-3 GIVE PC (computerpackage). xvii GIV E (estimator), 638 GLS, 463-6. 503, 587 goodness of fit (s(,(' :.2 ) Gram-charlier series A, 205 Granger non-causality. 505, 509, 529 tcst for. 544 hennite polynomials. 204 heteroskedasticitys 463-7 1 versus parameter time-dependence, 473 histogram, 23 homogeneous non-stationary process, 16l 479-8 1, 527-8 homoskedasticity, 126, 378, 463-7 1 misspecification tests for, 464-7 1 547, 648-9 hypothesis testing. 285-303 alternative, 287 composite, 287 null, 286-7 simple, 287 .
idempotent matrix. 3 19. 38 1 4 11 idcntically distributed r.v.'s, 94. 2 16- 17 identification, 6 14-9 exact, 6 (8 order condition, 6 15 overidentification. 6 18 rank condition, 6 17-8 underidentification, 6 18 impact multipliers. 60 1.-2 incidental parametcrs. 136. 346-7, 499 independence. 44. 87. 93 independent r.v.'s, 87 linear. 118 neccssary and suflicient condition, l 17- 18 118 versus orthogonality. ,
,
692
Index
independent r.v.'s continued) 118 versus uncorrelatedness, independent sample, 2 17, 378 misspecilication tests for, 51 1-2 l indirect MLE, 62 1-2 627 infonnation matrix, 239 asymptotic, 247 sample, 239 single observation, 247 information matrix test proccdure, 467 initial conditions, 151-3. 156, 528, 531 innovation process, 147 instrumental variables, 637.-44 estimator, 638 instrumentalism, 665 69 integral, Riemann-stieltjes. integrability, 193 square, 203 unifonn, 193 interim multipliers, 60 l intersept (see constant) interval estimation fsee confidence region) invariance, linear transformations, 438-9 invariance of MLE's, 266-7 inverse matrix, partitioned, 442 invertibitity conditionss 16l iterative estimation procedure, 588 ,
Jacobian transformation, 106. 257 joint central moments. 119 joint density function, 8 1-2 continuous, 82 discrete, 82 k-class estimator, 632 Khinchin's WLLN l69 King's law, 5 Kolmogorov-Gabor polynomial, 446 Kolmogorov's axiomatic approach, 37-43 Kolmogorov's inequality, 171 Kolmogorov's stochastic process conditions, 133 Kolmogorov's SLLN, 170-1 Kolmogorov's WLLN, 169 Kolmogorov-smirnov test, 229, 453 Kronecker product. 573, 603-4 kurtosis, 73, 452 ,
lag operator,
155, 16 1, 509 multiplier test procedure, 330, 333....4,430-2 in misspecilication testing, 446, 453, 460. 466, 468-9, 5 19-2 1 Laguerre polynomials, 208 law of large numbers (see WLLN, SLLN) leading indicator model, 554 Lagrange
Ieast squares method, 6-7. 253-.6. 448-50 Ieast squares estimators, 448 GLS 463. 503, 587-8 OLS, 448-9, 638 Lehmann-scheffe theoremp 242-3, 387. 577 Ievel of significance of a test lksgta size of a ,
test)
Liapunov's
CLT,
174
likelihood function 258-..60 likelihood ratio test procedure. 299-303, 328-.9, 335. 425-6, 432 limit of a function. 186 limit of a sequence. 185 limit of moments, 192 limit, probability (stpgconvergence in probability) LIML estimator. 629. 63 1 633 .
,
Lindeberg condition, 174-5 Lindeberg-Feller CLT. 174, 177 Lindeberg-l-evy CLT, 173-..4 linear regression model, 369-4 10 Iinear restrictions l,st?t? a priori restrictions) linearity. 370, 378, 457-.63 and normality, 3 16 462-3 inducing transformations. misspecification tests for. 459-6 1 597, .
648 locally UMP test, 335 logical empiricism, 662. 3 logical positivism. 3, 662-3 logistic distribution, 9 1 125 log-likelihood. 258-60 log-normal distribution. 283. 457 long-run equilibrium solution, 558-9 long-run multipliers, 602 lower bound (scc Cramer-Rao lower bound) ,
MA(p) stochastic process, 158 Malinvaud formulation. 588 Mann-Wald theorem, 197 marginal distributions. 85-9 normal distribution, 88, 317 Markov inequality, 7 1 Markov process, 148-9 Martingale diflrence processs 147, 273-.5 118, 171 Martingale orthogonality. Martingale process. 145 CLT for, 178 SLLN for. l72 WLLN for, 172 maximum likelihood, method, 257-8 1 mean, 25. 70-- 1 mean square error (MSE), 234-.6, 249 measurement equations, 12
measurement information. 352. 665
Index
693
measurement systcms, 409-1 1 m-dependent process, 14 1 median, 25, 7 1 methodology, 15-2 l 659 72 misspecification testing, 2 1 22 1 392 mixing. strong. 142. 179 uniform. 143. 179 MLE. 260 constrained. 423 properties. 266 82 mode, 25, 7 1 model selection. 523. 669. 70 l94 moments. approximate, asymptotic, 192 central 73. 119 limit of, 192 raw 73. l 18 moments, method of, 256-7 Monte Carlo. 435 mth-order Markov process, l49 mtllticollinearity tse'gcollinearity) multiplication rule of probability, 44 multiple correlation coefficient. 3 13- 14. 3 l 8. 322. 3. 382. 439 multivariate linear regrcssion models 57 1.-607 in relation to the SEM. 610-1 4 multivariate normal distribution, 3 12-24 multivariate l distribution. 47 1 .
,
,
p
,
theorem. 296 Neyman-pearson non-centrality parameter. 108, 111-13 in the F-test, 399, 40 1 nonlinear model. 46 1- 3 non-linear restrictions tstyt.' a priori restrictions) non-parametric inference. 2 18 non-parametric processes, 146-52 non-parametric tests, 453 non-random sample. 2 18. 343. 494-7 357 non-stochastic N'ariables. non-systematic component. 350, 370, 374, 376 64 6 normal dislribution bivariate. 83 4. 88 mean of. 7() multivariate. 3 l j 24 standard. 68 variance. 7 l normality. 447 57 misspecification tests lbr. 45 1-j 4j5 7 normalising transformations. normal (Gaussian ) stochastic process. 135--6 time homogeneitl rcstrictions on. 139 .
nuisance parameters. null hypothesis. 286
4 14
observation space. 2 17 ogive, 27 omitted variables bias argument, 4 19-2 1 and auxiliary regressions. 446, 458-6 1 468, 47 1, 502, 515, 523 reformulated, 445-7 OLS, 448-51 0, o notation, 195--6 Op, op notation, 196-9 order condition fsee identification) order of magnitude, 171, 174, 179, 194-8 orthogonal projection. 38 1 41 1, 642 orthogonality, 118 between systematic and non-systematic components, 350, 358. 371, 38 1 overdiflkrencing, 479 overidentifying restrictions. 6 18 test for, 65 1-3 overparametrisation, 612-13 ,
,
panel data, 42 parameter space, 60 parameter structural change, 48 1-7 parameter time-invariance, 378, 472-8 1 parametric family of densities (see probability model) parametric processes, 146 Pareto distribution, 6 1 339....41 partial adjustment model, 552 partial correlation coefficient, 314, 318, 323, 439-40 Pearson family of densities, 28, 452-3 Pearson paradigm, 7-8 Pillai's trace test, 593 pivots, 295 ,
Political
Arithmetik, 4-5
power function, 29 1 power of a test, 290 power set, 39 predetennined variables, 610 prediction, 22 1, 247-9, 30*9 in the linear regression model, 402-5 in the multivariate linear regression model. 599, 601-2, 654-5 principal components, 434 probability, definition, axiomatic, 43 classical, 34 frequency, 34 subjective, 35 probability limit (,T?E, convergence) probability model, 60-1, 214
694
Index
probability set function, 42-3 probability space, 42 quadratic forms related to the normal distributions 3 19-20 quotient of two r.v.'s, 102-5 R2 see multiple
correlation
coefficient)
random experiment, 37 random matrix, 135 random sample, 2 16-17 random variable, 48-76 continuous, 56 defnition, 50 discrete, 56 functions of, 97, 99-1 10 minimal o'-lield generated by, 50 random vector, 78-93 rank condition (.st?t? identication) Rao-Blackwell lemma, 243 realism, 663 recursive estimator, 407, 474-78 recursive system, 6 12- 14 regression curve, 122-4 regressors, order of magnitude, 39 1-2 rejection region, 286 reparametrisation/restriction, 2 1, 352 RESET type tests, 446, 460-1, 555, 597 residuals, 405-8 BLUS, 407 recursive, 407, 474-8 residual sum of squares (RSS), 428 respecification approach, 498-502, 505-9 restrictions see a priori restrictions) Russian school, 36, 64 sample information, 352, 667 sample moments. 227 sample space, 38 sampling model, 2 15-19 score function, 260 second order stochastic process, 138-9 selection matrices, 6 19 sequential conditioning, 273-4, 495 serial correlation (see autocorrelation) Shapiro-Wilk test, 452 c-field. definition, 40 generated by a r.v., 50-1 generated by a set, 40 increasing sequence of, 51 simple random sampling, 343 simultaneous equation model, 608-58 singular normal distribution, 406 size of a test, 29 1 skedasticity, 123-4 skewness, 26, 73
skewness-kurtosis test, 452-5 SLLN,
170-2
small sample properties (yt?cfinite sample properties) specification testing, 392 spectral decomposition (-seceigenvalues) standard deviation, 72 stationarity. strict, 137-8 !th order, 138-9 statistic, 224 statistical GM, 344, 349-52 statistical model, 2 18, 339 well-defined estimated, 352, 409, 522, 668 statistical parameters of interest, 351, 37 1, 376 4 19-22, 575, 666-7 versus theoretical parameters of interest, 351, 376, 667-9 stochastic linear regression model, 413- 18 stochastic process, 13 1-7 realisation of, 131 stratified sampling, 343 structural form, unrestricted, 6 14 restricted, 616 structural parameters see theoretical parameters of interest) Student's t distribution, 104, 108 multivariate, 47 1 sufficiency, 242 minimal, 243 supermartingale, 145 SURE formulation, 585-8, 636 systematic component, 349-51, 370-1, 375-6 ,
Student's t distribution) f distribution t.st?t? t test, 364, 396-7, 392 test, definition, 294 test statistic, 289 testing, 22 1, 285-303 Theil's inequality coefcient, 405 theoretical parameters of interest. 349, 351, 553, 569, 61 3-14, 620, 669 theoretical model, 2 1, 667 theory, 20, 662-6 3SLS estimator, 635-7 time-homogeneity restrictions, 137-9 time series data, 130, 342 Toeplitz matrix, 495, 505 total sum of squares, 382 2SLS estimator, 626-35 type I error, 286 type 11 error, 286 UMA region, 305 UMP test, 291
Index
695
unbiased confidence region. 306 unbiased estimator, 232 unbiased test. 293 uncorrelated r.v.-s, 118 underidentification, 62 l Llliform
Convcrgence
(.$&E?
Convergence)
uniform distribution. 57-8 unimodal distribution 7 1 univariate distributions, 62-8 ,
variance. 72. 119 properties. 73 varance---covariance matrix, 3 12 variance ratio tests, 395-6 variance stabilising transformations, 487-8 variation free condition, 377. 422 violation of, 629 vectoring, 573. 604 vector stochastic process, 135
Wald test procedure,
329, 332-3. 429-30
in probability) weakly stationary process (sczcsecond order stationarity) Weibull distributiona 105 white-noise process, 150, 151 White test for homoskedastcity, 465-7 Wilk's ratio tcst. 593 Weak
(.?t? Convergence
Convergence
Window. z-period, 40, 562 Wishart distribtltion, 32 1 577, 602-3 WLLN 168--9 Wold decomposition, 159 .
.
Yule-Walker
equationse
157
Zellner formulation (st?gSU RE)
Zero-one
restrictions
restrictions)
tst??exclusion
Symbols and abbreviations
(1) Symbols Nq;1,c2)
normal random c-field
distribution experiment
p and variance
with mean
c2
'...-r .# - Borel field k..p- union to - intersection - - complementation subset of (F - belongs to ( - does not belong to r3' - empty set c(z4) - minimal c-field generated by # - for all t.zrt:z, j) - uniform distribution between a and j 5(,7,J?)- binomial distribution with parameters n and p * : ) probability model z2(??) chi-square distribution with n degrees of freedom Ftnj na) - F distribution with nl and na degrees of freedom -4
=
,
P
--+ - convergence D
in probability
-+ - convergence in distribution a S . .
-+
'
almost
Sure
Convorgence
F
--+- convergence in R - the real line (
-
?'th :y-
mean .
:y-
)
List of abbreviations
Euclidean k-dimensional space positive real line g0, ) (2 Iv0) - sample information matrix /(p) - single observation information matrix 1w(p) - asymptotic information matrix 'v distributed as Ep .
-
.-
'v - asymptotically
E( E( Vart
distributed as
,v - distributed .
) - expected
under value
H
) - asymptotic expected value
.
.
) - variance
(2) Abbreviations DGP - data generating process CLT - central limit theorem DF - distribution function pdf - probability density function WLLN - weak law' of large numbers SLLN - strong law of large numbers residual sum of squares URSS - unrestricted RRSS - restricted residual sum of squares DLR - dynamic linear regression MLR - mtlltivariate linear regression MDLR - multivariate dynamic linear regression MSE - mean square error UMP - uniformly most powerftll llD - independent and identically distributed MLE maximum likelihood estimator
-
xxiii
List of abbreviations
OLS GLS AR MA ARMA ARIMA ARCH LR LM GM
- ordinary least-squares - generalised least-squares .- autoregressive moving average .autoregressive, moving average -autoregressive, integrated, moving average autoregressive, conditional heteroskedasticity likelihood ratio test multiplier - Lagrange generating mechanism .-
rv -
random
variable
IV instrumental variable GIVE - generalised instrumental variables estimator ZSLS - two stage least-squares LIML - limited information maximum likelihood FIML - full information maximum likelihood 3SLS - three stage least-squares wrt with respect to IMLE -- indirect maximum likelihood estimator CAN - consistent, asymptotically normal NllD - normal 11D UMA - uniformly most accurate MAE - mean absolute error regression specification error test RESET BLUE best linear unbiased estimator BLUS best linear unbiased scalar (residuals)
-